Enable Intel Icelake CPU support, backport related patches from upstream.
Fix kabi broken introduced by the patch set with patches 0288 - 0292.
Alexander Shishkin (28): Intel: perf/ring_buffer: Fix AUX software double buffering Intel: perf/x86/intel/pt: Remove software double buffering PMU capability intel_th: Only create useful device nodes intel_th: Rework resource passing between glue layers and core intel_th: Skip subdevices if their MMIO is missing intel_th: Add "rtit" source device intel_th: Communicate IRQ via resource intel_th: pci: Use MSI interrupt signalling intel_th: msu: Start handling IRQs intel_th: Only report useful IRQs to subdevices intel_th: msu: Replace open-coded list_{first,last,next}_entry variants intel_th: msu: Switch over to scatterlist intel_th: msu: Factor out pipeline draining intel_th: gth: Factor out trace start/stop intel_th: Add switch triggering support intel_th: msu: Correct the block wrap detection intel_th: msu: Add a sysfs attribute to trigger window switch intel_th: msu: Add current window tracking intel_th: msu: Support multipage blocks intel_th: msu: Split sgt array and pointer in multiwindow mode intel_th: msu: Start read iterator from a non-empty window intel_th: msu: Introduce buffer interface intel_th: msu-sink: An example msu buffer "sink" intel_th: gth: Fix the window switching sequence intel_th: msu: Fix an uninitialized mutex intel_th: Fix freeing IRQs intel_th: msu: Fix window switching without windows intel_th: msu: Fix the unexpected state warning
Alexandru Gagniuc (1): Intel:PCI: pciehp: Wait for PDS if in-band presence is disabled
Alison Schofield (1): acpi/hmat: Update acpi_hmat_type enum with ACPI_HMAT_TYPE_PROXIMITY
Andi Kleen (3): Intel: perf/x86/intel: Extract memory code PEBS parser for reuse Intel: perf/x86/lbr: Avoid reading the LBRs when adaptive PEBS handles them Intel: perf tools x86: Add support for recording and printing XMM registers
Andrew Murray (2): Intel: perf/core: Add function to test for event exclusion flags Intel: perf/core: Add PERF_PMU_CAP_NO_EXCLUDE for exclusion incapable PMUs
Andy Lutomirski (1): x86/traps: Stop using ist_enter/exit() in do_int3()
Andy Shevchenko (4): Intel:PCI/AER: Use match_string() helper to simplify the code Intel:PCI/AER: Use for_each_set_bit() to simplify code Intel:PCI/AER: Fix kernel-doc warnings intel_th: pti: Use sysfs_match_string() helper
Aneesh Kumar K.V (1): drivers/dax: Allow to include DEV_DAX_PMEM as builtin
Arnaldo Carvalho de Melo (3): Intel: perf record: Fix suggestion to get list of registers usable with --user-regs and --intr-regs Intel: perf parse-regs: Improve error output when faced with unknown register name tools x86 uapi asm: Sync the pt_regs.h copy with the kernel sources
Bharat Kumar Gogada (1): Intel:PCI: Enable SERR# forwarding for all bridges
Bjorn Helgaas (5): PCI: Add pci_speed_string() PCI: Use pci_speed_string() for all PCI/PCI-X/PCIe strings Intel:PCI/ASPM: Save LTR Capability for suspend/resume Intel:PCI/portdrv: Use conventional Device ID table formatting Intel:PCI: Use dev_printk() when possible
Chen Yu (3): Intel: intel_idle: Customize IceLake server support tools/power turbostat: Support Ice Lake server intel_idle: Fix max_cstate for processor models without C-state tables
Colin Ian King (2): intel_th: msu: Fix missing allocation failure check on a kstrndup intel_th: msu: Fix overflow in shift of an unsigned int
Dan Carpenter (2): tools/power/x86/intel-speed-select: Fix a read overflow in isst_set_tdp_level_msr() node: fix device cleanups in error handling code
Dan Williams (11): Intel: device-dax: Kill dax_region ida Intel: device-dax: Kill dax_region base Intel: device-dax: Remove multi-resource infrastructure Intel: device-dax: Start defining a dax bus model Intel: device-dax: Introduce bus + driver model Intel: device-dax: Move resource pinning+mapping into the common driver Intel: device-dax: Add support for a dax override driver Intel: device-dax: Add /sys/class/dax backwards compatibility Intel: acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node Intel: device-dax: Auto-bind device after successful new_id Intel: device-dax: Add a 'target_node' attribute
Dave Hansen (4): Intel: mm/resource: Move HMM pr_debug() deeper into resource code Intel: mm/memory-hotplug: Allow memory resources to be children mm/resource: Let walk_system_ram_range() search child resources Intel: device-dax: "Hotplug" persistent memory for use like normal RAM
Dave Jiang (7): Intel: dmaengine: ioatdma: Add Snow Ridge ioatdma device id Intel: dmaengine: ioatdma: disable DCA enabling on IOATDMA v3.4 Intel: dmaengine: ioatdma: add descriptor pre-fetch support for v3.4 Intel: dmaengine: ioatdma: support latency tolerance report (LTR) for v3.4 Intel: ntb: intel: Add Icelake (gen4) support for Intel NTB Intel: ntb: intel: fix static declaration Intel: ntb: intel: add hw workaround for NTB BAR alignment
Dinghao Liu (1): ntb: intel: Fix memleak in intel_ntb_pci_probe
Dongdong Liu (1): PCI/AER: Initialize aer_fifo
Erik Schmauss (1): Intel: ACPICA: ACPI 6.3: HMAT updates
Felipe Balbi (1): Intel: PCI: Add support for Immediate Readiness
Frederick Lawler (1): Intel:PCI/DPC: Log messages with pci_dev, not pcie_device
Guoqing Jiang (11): Intel: intel_idle: Use ACPI _CST on server systems Intel: perf/x86/intel: Add more Icelake CPUIDs Intel:PCI/AER: Use threaded IRQ for bottom half Intel:PCI/AER: Use managed resource allocations Intel:PCI/AER: Log messages with pci_dev, not pcie_device Intel:PCI: pciehp: Disable in-band presence detect when possible perf/x86/intel: Export mem events only if there's PEBS support Intel: perf/x86: Use the new pmu::update_attrs attribute group intel: perf/x86/intel: Use update attributes for skylake format Intel: perf/x86/amd: Constrain Large Increment per Cycle events Intel: perf/x86/intel: Support TopDown metrics on Ice Lake
Gustavo A. R. Silva (1): intel_th: Mark expected switch fall-throughs
Gustavo Pimentel (1): PCI: Decode PCIe 32 GT/s link speed
Honghui Zhang (1): Intel:PCI/portdrv: Support PCIe services on subtractive decode bridges
Jann Horn (2): Intel: x86/extable: Introduce _ASM_EXTABLE_UA for uaccess fixups Intel: x86/insn-eval: Add support for 64-bit kernel mode
Jay Fang (1): PCI/PME: Fix kernel-doc of pcie_pme_resume() and pcie_pme_remove()
Jiri Olsa (7): Intel: sysfs: Add sysfs_update_groups function Intel: perf/core: Add attr_groups_update into struct pmu Intel: perf/x86: Get rid of x86_pmu::event_attrs Intel: perf/x86: Add is_visible attribute_group callback for base events Intel: perf/x86: Use update attribute groups for caps Intel: perf/x86: Use update attribute groups for extra format Intel: perf/x86: Use update attribute groups for default attributes
Jonathan Corbet (1): docs: fix numaperf.rst and add it to the doc tree
Kan Liang (39): perf/x86/intel: Add Icelake support perf/x86/intel: Add Tremont core PMU support Intel: perf/x86: Support outputting XMM registers Intel: perf/x86/intel/ds: Extract code of event update in short period Intel: perf/x86/intel: Support adaptive PEBS v4 Intel: perf/x86/intel/uncore: Add Intel Icelake uncore support Intel: perf parse-regs: Split parse_regs Intel: perf parse-regs: Add generic support for arch__intr/user_reg_mask() Intel: perf regs x86: Add X86 specific arch__intr_reg_mask() Intel: perf/x86: Disable extended registers for non-supported PMUs Intel: perf/x86/regs: Check reserved bits Intel: perf/x86: Clean up PEBS_XMM_REGS Intel: perf/x86: Remove pmu->pebs_no_xmm_regs Intel: perf/x86/regs: Use PERF_REG_EXTENDED_MASK Intel: perf/x86/intel/uncore: Add uncore support for Snow Ridge server Intel: perf/x86/intel/uncore: Factor out box ref/unref functions Intel: perf/x86/intel/uncore: Support MMIO type uncore blocks Intel: perf/x86/intel/uncore: Clean up client IMC Intel: perf/x86/intel/uncore: Add IMC uncore support for Snow Ridge Intel: perf/x86/intel/uncore: Factor out __snr_uncore_mmio_init_box Intel: perf/x86/intel/uncore: Add box_offsets for free-running counters Intel: perf/x86/intel/uncore: Add Ice Lake server uncore support Intel: perf/x86/intel: Factor out common code of PMI handler Intel: perf/x86/intel: Fix SLOTS PEBS event constraint Intel: perf/x86: Use event_base_rdpmc for the RDPMC userspace support Intel: perf/x86/intel: Name the global status bit in NMI handler Intel: perf/x86/intel: Introduce the fourth fixed counter Intel: perf/x86/intel: Move BTS index to 47 Intel: perf/x86/intel: Fix the name of perf METRICS Intel: perf/x86/intel: Use switch in intel_pmu_disable/enable_event Intel: perf/core: Add a new PERF_EV_CAP_SIBLING event capability Intel: perf/x86/intel: Generic support for hardware TopDown metrics Intel: perf/x86: Add a macro for RDPMC offset of fixed counters Intel: perf/x86/intel: Support per-thread RDPMC TopDown metrics Intel: perf/x86/intel: Check perf metrics feature for each CPU perf/x86/intel/uncore: Fix missing marker for snr_uncore_imc_freerunning_events perf/x86/intel/uncore: Reduce the number of CBOX counters perf/x86/intel/uncore: Fix the scale of the IMC free-running events perf/x86/intel/uncore: Fix M2M event umask for Ice Lake server
Keith Busch (23): PCI/AER: Remove error source from AER struct aer_rpc PCI/AER: Use kfifo for tracking events instead of reimplementing it Intel: acpi: Create subtable parsing infrastructure Intel: acpi: Add HMAT to generic parsing tables Intel: acpi/hmat: Parse and report heterogeneous memory node: Link memory nodes to their compute nodes Intel: node: Add heterogenous memory access attributes Intel: node: Add memory-side caching attributes Intel: acpi/hmat: Register processor domain to its memory Intel: acpi/hmat: Register performance attributes Intel: acpi/hmat: Register memory side cache attributes Intel: doc/mm: New documentation for memory performance PCI: portdrv: Restore PCI config state on slot reset Intel:PCI/DPC: Save and restore config state PCI/ERR: Handle fatal error recovery PCI/ERR: Simplify broadcast callouts Intel:PCI/ERR: Always report current recovery status for udev PCI: Unify device inaccessible Intel:PCI: Make link active reporting detection generic Intel:PCI/AER: Remove unused aer_error_resume() Intel:PCI/AER: Use kfifo_in_spinlocked() to insert locked elements Intel:PCI/AER: Reuse existing pcie_port_find_device() interface Intel:PCI/AER: Abstract AER interrupt handling
Kim Phillips (2): perf/x86/amd: Add support for Large Increment per Cycle Events perf/x86/amd: Fix sampling Large Increment per Cycle events
Kuppuswamy Sathyanarayanan (2): PCI/ERR: Combine pci_channel_io_frozen cases PCI/ERR: Update error status after reset_link()
Len Brown (9): Intel: topology: Simplify cputopology.txt formatting and wording Intel: x86/topology: Add CPUID.1F multi-die/package support Intel: x86/topology: Create topology_max_die_per_package() Intel: cpu/topology: Export die_id Intel: x86/topology: Define topology_die_id() Intel: x86/topology: Define topology_logical_die_id() tools/power turbostat: reduce debug output tools/power turbostat: consolidate duplicate model numbers tools/power turbostat: Fix Haswell Core systems
Leonid Ravich (1): Intel: NTB: add new parameter to peer_db_addr() db_bit and db_data
Like Xu (4): perf/x86/core: Refactor hw->idx checks and cleanup Intel: perf/x86/lbr: Add interface to get LBR information Intel: perf/x86: Add constraint to create guest LBR event without hw counter Intel: perf/x86: Keep LBR records unchanged in host context for guest usage
Lukas Wunner (1): PCI: Simplify disconnected marking
Mel Gorman (1): intel_idle: Ignore _CST if control cannot be taken from the platform
Mika Westerberg (2): Intel:PCI: Make pcie_downstream_port() available outside of access.c Intel:PCI: Get rid of dev->has_secondary_link flag
Mohan Kumar (2): Intel:PCI: Replace printk(KERN_INFO) with pr_info(), etc Intel:PCI: Replace dev_printk(KERN_DEBUG) with dev_info(), etc
Olof Johansson (1): Intel:PCI/DPC: Add "pcie_ports=dpc-native" to allow DPC without AER control
Otto Sabart (1): doc: trace: fix reference to cpuidle documentation file
Patel, Mayurkumar (1): Intel:PCI/AER: Save AER Capability for suspend/resume
Pavel Tatashin (1): device-dax: fix memory and resource leak if hotplug fails
Peter Zijlstra (4): perf/x86: Support constraint ranges Intel: perf/x86: Fix n_metric for cancelled txn Intel: x86/uaccess: Move copy_user_handle_tail() into asm Intel: hardirq/nmi: Allow nested nmi_enter()
Qian Cai (2): acpi/hmat: fix memory leaks in hmat_init() acpi/hmat: fix an uninitialized memory_target
Qiuxu Zhuo (9): Intel: EDAC, skx_common: Separate common code out from skx_edac EDAC, skx_edac: Delete duplicated code Intel: EDAC, i10nm: Add a driver for Intel 10nm server processors Intel: EDAC, skx, i10nm: Make skx_common.c a pure library Intel: EDAC, i10nm: Add Intel additional Ice-Lake support Intel: EDAC, i10nm: Check ECC enabling status per channel Intel: EDAC, skx, i10nm: Fix source ID register offset Intel: EDAC, {skx,i10nm}: Make some configurations CPU model specific Intel: EDAC/i10nm: Update driver to support different bus number config register offsets
Rafael J. Wysocki (12): Intel: ACPI: processor: Export function to claim _CST control Intel: ACPI: processor: Introduce acpi_processor_evaluate_cst() Intel: ACPI: processor: Clean up acpi_processor_evaluate_cst() Intel: ACPI: processor: Export acpi_processor_evaluate_cst() Intel: intel_idle: Refactor intel_idle_cpuidle_driver_init() Intel: intel_idle: Use ACPI _CST for processor models without C-state tables Intel: Documentation: admin-guide: PM: Add cpuidle document Intel: cpuidle: Allow idle states to be disabled by default Intel: intel_idle: Allow ACPI _CST to be used for selected known processors Intel: intel_idle: Add module parameter to prevent ACPI _CST from being used Intel: ACPI: processor: Make ACPI_PROCESSOR_CSTATE depend on ACPI_PROCESSOR Intel: Documentation: admin-guide: PM: Add intel_idle document
Shaokun Zhang (1): intel_th: msu: Fix unused variable warning on arm64 platform
Srinivas Pandruvada (13): Intel: platform/x86: ISST: Update ioctl-number.txt for Intel Speed Select interface Intel: platform/x86: ISST: Add common API to register and handle ioctls Intel: platform/x86: ISST: Store per CPU information Intel: platform/x86: ISST: Add IOCTL to Translate Linux logical CPU to PUNIT CPU number Intel: platform/x86: ISST: Add Intel Speed Select mmio interface Intel: platform/x86: ISST: Add Intel Speed Select mailbox interface via PCI Intel: platform/x86: ISST: Add Intel Speed Select mailbox interface via MSRs Intel: platform/x86: ISST: Add Intel Speed Select PUNIT MSR interface Intel: platform/x86: ISST: Restore state on resume Intel: tools/power/x86: A tool to validate Intel Speed Select commands Intel: ICX: platform/x86: ISST: Allow additional core-power mailbox commands Intel: ICX: platform/x86: ISST: Fix wrong unregister type Intel: platform/x86: ISST: Increase timeout
Stephen Rothwell (1): intel_rapl: need linux/cpuhotplug.h for enum cpuhp_state
Stuart Hayes (1): PCI: pciehp: Add DMI table for in-band presence detection disabled
Subbaraya Sundeep (2): Intel:PCI: Assign bus numbers present in EA capability for bridges PCI: Do not use bus number zero from EA capability
Sumeet Pawnikar (1): powercap: RAPL: remove unused local MSR define
Thomas Gleixner (2): genirq: Provide interrupt injection mechanism Intel:PCI/AER: Fix the broken interrupt injection
Tony Luck (5): Intel: EDAC, skx_common: Add code to recognise new compound error code Intel: EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode. Intel: EDAC, skx: Retrieve and print retry_rd_err_log registers MAINTAINERS: Update entry for EDAC-SKYLAKE MAINTAINERS: Add entry for EDAC-I10NM
Vishal Verma (1): Intel: device-dax: Add a 'modalias' attribute to DAX 'bus' devices
Wang Hai (1): device-dax/core: Fix memory leak when rmmod dax.ko
Wei Li (1): kabi: Fix "Intel: perf/core: Add attr_groups_update into struct pmu"
Wei Wang (1): Intel: perf/x86: Fix variable types for LBR registers
Wei Yongjun (1): intel_th: msu: Fix possible memory leak in mode_store()
Yangtao Li (1): Intel: cpuidle: use BIT() for idle state flags and remove CPUIDLE_DRIVER_FLAGS_MASK
Yanjiang Jin (1): Intel:PCI/AER: Queue one GHES event, not several uninitialized ones
Yicong Yang (3): PCI/AER: Log which device prevents error recovery PCI: Add 32 GT/s decoding in some macros PCI: Add PCIE_LNKCAP2_SLS2SPEED() macro
YueHaibing (2): Intel:PCI/ERR: Remove duplicated include from err.c intel_th: msu: Remove set but not used variable 'last'
Yunying Sun (1): Intel: perf/x86/intel: Fix invalid Bit 13 for Icelake MSR_OFFCORE_RSP_x register
Zhang Rui (18): Intel: powercap/intel_rapl: Simplify rapl_find_package() Intel: powercap/intel_rapl: Support multi-die/package Intel: powercap/intel_rapl: Update RAPL domain name and debug messages Intel: intel_rapl: use reg instead of msr Intel: intel_rapl: remove hardcoded register index Intel: intel_rapl: introduce intel_rapl.h Intel: intel_rapl: introduce struct rapl_if_private Intel: intel_rapl: abstract register address Intel: intel_rapl: abstract register access operations Intel: intel_rapl: cleanup some functions Intel: intel_rapl: cleanup hardcoded MSR access intel_rapl: abstract RAPL common code Intel: intel_rapl: support 64 bit register Intel: intel_rapl: support two power limits for every RAPL domain Intel: intel_rapl: Fix module autoloading issue Intel: powercap/intel_rapl: add support for IceLake desktop Intel: powercap/intel_rapl: add support for ICX Intel: powercap/intel_rapl: add support for ICX-D
Zheng Zengkai (9): irqchip: phytium-2500: Fix compilation issues hulk_defconfig: Enable some Icelake support configs openeuler_defconfig: Enable some Icelake support configs hulk_defconfig: Adjust some configs for Intel icelake support openeuler_defconfig: Adjust some configs for Intel icelake support kabi: Fix "PCI: Decode PCIe 32 GT/s link speed" PCI: kabi: fix kabi broken for struct pci_dev kabi: Fix "perf/x86/intel: Support per-thread RDPMC TopDown metrics" x86: Fix kabi broken for struct cpuinfo_x86
Documentation/ABI/obsolete/sysfs-class-dax | 22 + Documentation/ABI/stable/sysfs-devices-node | 87 +- .../testing/sysfs-bus-intel_th-devices-msc | 11 +- .../ABI/testing/sysfs-devices-system-cpu | 6 + Documentation/PCI/pci-error-recovery.txt | 35 +- .../admin-guide/kernel-parameters.txt | 2 + Documentation/admin-guide/mm/index.rst | 1 + Documentation/admin-guide/mm/numaperf.rst | 169 ++ Documentation/admin-guide/pm/intel_idle.rst | 246 +++ .../admin-guide/pm/working-state.rst | 2 + Documentation/cpuidle/core.txt | 23 - Documentation/cpuidle/sysfs.txt | 98 - Documentation/cputopology.txt | 55 +- Documentation/ioctl/ioctl-number.txt | 1 + Documentation/trace/coresight-cpu-debug.txt | 2 +- Documentation/x86/topology.txt | 4 + MAINTAINERS | 10 +- arch/arm64/configs/hulk_defconfig | 2 + arch/arm64/configs/openeuler_defconfig | 2 + arch/arm64/kernel/acpi_numa.c | 2 +- arch/arm64/kernel/smp.c | 4 +- arch/ia64/kernel/acpi.c | 14 +- arch/x86/configs/hulk_defconfig | 18 +- arch/x86/configs/openeuler_defconfig | 18 +- arch/x86/events/amd/core.c | 107 +- arch/x86/events/core.c | 315 ++-- arch/x86/events/intel/core.c | 929 ++++++++-- arch/x86/events/intel/ds.c | 501 ++++- arch/x86/events/intel/lbr.c | 84 +- arch/x86/events/intel/pt.c | 3 +- arch/x86/events/intel/uncore.c | 137 +- arch/x86/events/intel/uncore.h | 40 +- arch/x86/events/intel/uncore_snb.c | 107 +- arch/x86/events/intel/uncore_snbep.c | 1119 ++++++++++++ arch/x86/events/perf_event.h | 160 +- arch/x86/include/asm/asm.h | 31 +- arch/x86/include/asm/futex.h | 6 +- arch/x86/include/asm/intel_ds.h | 2 +- arch/x86/include/asm/msr-index.h | 4 + arch/x86/include/asm/perf_event.h | 181 +- arch/x86/include/asm/processor.h | 6 +- arch/x86/include/asm/ptrace.h | 13 + arch/x86/include/asm/topology.h | 16 + arch/x86/include/asm/uaccess.h | 16 +- arch/x86/include/asm/uaccess_64.h | 3 - arch/x86/include/uapi/asm/perf_regs.h | 26 +- arch/x86/kernel/acpi/boot.c | 36 +- arch/x86/kernel/cpu/common.c | 1 + arch/x86/kernel/cpu/topology.c | 88 +- arch/x86/kernel/perf_regs.c | 28 +- arch/x86/kernel/smpboot.c | 47 + arch/x86/kernel/traps.c | 21 +- arch/x86/lib/checksum_32.S | 4 +- arch/x86/lib/copy_user_64.S | 138 +- arch/x86/lib/csum-copy_64.S | 8 +- arch/x86/lib/getuser.S | 12 +- arch/x86/lib/insn-eval.c | 26 +- arch/x86/lib/putuser.S | 10 +- arch/x86/lib/usercopy_32.c | 126 +- arch/x86/lib/usercopy_64.c | 24 +- arch/x86/mm/extable.c | 8 + drivers/acpi/Kconfig | 2 + drivers/acpi/Makefile | 1 + drivers/acpi/acpi_processor.c | 182 ++ drivers/acpi/hmat/Kconfig | 11 + drivers/acpi/hmat/Makefile | 1 + drivers/acpi/hmat/hmat.c | 666 +++++++ drivers/acpi/nfit/core.c | 8 +- drivers/acpi/numa.c | 17 +- drivers/acpi/processor_idle.c | 174 +- drivers/acpi/scan.c | 4 +- drivers/acpi/tables.c | 76 +- drivers/base/Kconfig | 8 + drivers/base/memory.c | 1 + drivers/base/node.c | 350 +++- drivers/base/topology.c | 4 + drivers/cpuidle/cpuidle.c | 7 +- drivers/cpuidle/sysfs.c | 10 + drivers/dax/Kconfig | 27 +- drivers/dax/Makefile | 6 +- drivers/dax/bus.c | 503 ++++++ drivers/dax/bus.h | 61 + drivers/dax/dax-private.h | 34 +- drivers/dax/dax.h | 18 - drivers/dax/device-dax.h | 25 - drivers/dax/device.c | 363 +--- drivers/dax/kmem.c | 111 ++ drivers/dax/pmem.c | 153 -- drivers/dax/pmem/Makefile | 7 + drivers/dax/pmem/compat.c | 73 + drivers/dax/pmem/core.c | 71 + drivers/dax/pmem/pmem.c | 40 + drivers/dax/super.c | 42 +- drivers/dma/ioat/dma.c | 12 + drivers/dma/ioat/dma.h | 2 +- drivers/dma/ioat/hw.h | 3 + drivers/dma/ioat/init.c | 40 +- drivers/dma/ioat/registers.h | 24 + drivers/edac/Kconfig | 12 + drivers/edac/Makefile | 7 +- drivers/edac/i10nm_base.c | 344 ++++ drivers/edac/skx_base.c | 752 ++++++++ drivers/edac/skx_common.c | 657 +++++++ drivers/edac/skx_common.h | 153 ++ drivers/edac/skx_edac.c | 1357 -------------- drivers/hwtracing/intel_th/Makefile | 3 + drivers/hwtracing/intel_th/acpi.c | 10 +- drivers/hwtracing/intel_th/core.c | 146 +- drivers/hwtracing/intel_th/gth.c | 130 +- drivers/hwtracing/intel_th/gth.h | 19 + drivers/hwtracing/intel_th/intel_th.h | 32 +- drivers/hwtracing/intel_th/msu-sink.c | 116 ++ drivers/hwtracing/intel_th/msu.c | 832 +++++++-- drivers/hwtracing/intel_th/msu.h | 28 +- drivers/hwtracing/intel_th/pci.c | 32 +- drivers/hwtracing/intel_th/pti.c | 16 +- drivers/hwtracing/intel_th/sth.c | 4 + drivers/idle/intel_idle.c | 388 +++- drivers/irqchip/irq-gic-phytium-2500-its.c | 6 +- drivers/irqchip/irq-gic-phytium-2500.c | 10 +- drivers/irqchip/irq-gic-v2m.c | 2 +- drivers/irqchip/irq-gic-v3-its-pci-msi.c | 2 +- drivers/irqchip/irq-gic-v3-its-platform-msi.c | 2 +- drivers/irqchip/irq-gic-v3-its.c | 6 +- drivers/irqchip/irq-gic-v3.c | 10 +- drivers/irqchip/irq-gic.c | 4 +- drivers/mailbox/pcc.c | 2 +- drivers/ntb/hw/intel/Makefile | 2 +- drivers/ntb/hw/intel/ntb_hw_gen1.c | 72 +- drivers/ntb/hw/intel/ntb_hw_gen1.h | 6 +- drivers/ntb/hw/intel/ntb_hw_gen3.c | 44 +- drivers/ntb/hw/intel/ntb_hw_gen3.h | 8 + drivers/ntb/hw/intel/ntb_hw_gen4.c | 552 ++++++ drivers/ntb/hw/intel/ntb_hw_gen4.h | 100 + drivers/ntb/hw/intel/ntb_hw_intel.h | 12 + drivers/ntb/hw/mscc/ntb_hw_switchtec.c | 9 +- drivers/nvdimm/e820.c | 1 + drivers/nvdimm/nd.h | 2 +- drivers/nvdimm/of_pmem.c | 1 + drivers/nvdimm/region_devs.c | 1 + drivers/pci/access.c | 11 +- drivers/pci/bus.c | 5 +- drivers/pci/hotplug/pciehp.h | 1 + drivers/pci/hotplug/pciehp_hpc.c | 75 +- drivers/pci/hotplug/pciehp_pci.c | 9 +- drivers/pci/pci-acpi.c | 11 +- drivers/pci/pci-stub.c | 10 +- drivers/pci/pci-sysfs.c | 27 +- drivers/pci/pci.c | 143 +- drivers/pci/pci.h | 101 +- drivers/pci/pcie/Kconfig | 2 +- drivers/pci/pcie/aer.c | 340 ++-- drivers/pci/pcie/aer_inject.c | 48 +- drivers/pci/pcie/aspm.c | 8 +- drivers/pci/pcie/dpc.c | 106 +- drivers/pci/pcie/err.c | 210 +-- drivers/pci/pcie/pme.c | 4 +- drivers/pci/pcie/portdrv.h | 6 +- drivers/pci/pcie/portdrv_core.c | 12 +- drivers/pci/pcie/portdrv_pci.c | 24 +- drivers/pci/probe.c | 193 +- drivers/pci/quirks.c | 15 +- drivers/pci/setup-bus.c | 30 +- drivers/pci/slot.c | 39 +- drivers/pci/vc.c | 4 +- drivers/pci/xen-pcifront.c | 7 +- drivers/platform/x86/Kconfig | 2 + drivers/platform/x86/Makefile | 1 + .../x86/intel_speed_select_if/Kconfig | 17 + .../x86/intel_speed_select_if/Makefile | 10 + .../intel_speed_select_if/isst_if_common.c | 675 +++++++ .../intel_speed_select_if/isst_if_common.h | 69 + .../intel_speed_select_if/isst_if_mbox_msr.c | 216 +++ .../intel_speed_select_if/isst_if_mbox_pci.c | 213 +++ .../x86/intel_speed_select_if/isst_if_mmio.c | 180 ++ drivers/powercap/Kconfig | 11 +- drivers/powercap/Makefile | 4 +- .../{intel_rapl.c => intel_rapl_common.c} | 857 ++++----- drivers/powercap/intel_rapl_msr.c | 183 ++ fs/sysfs/group.c | 54 +- include/acpi/actbl1.h | 10 +- include/linux/acpi.h | 26 +- include/linux/aer.h | 4 + include/linux/cpuidle.h | 10 +- include/linux/hardirq.h | 5 +- include/linux/intel_rapl.h | 155 ++ include/linux/intel_th.h | 79 + include/linux/interrupt.h | 2 + include/linux/libnvdimm.h | 1 + include/linux/node.h | 71 + include/linux/ntb.h | 10 +- include/linux/pci.h | 15 +- include/linux/perf_event.h | 47 +- include/linux/perf_regs.h | 8 + include/linux/preempt.h | 4 +- include/linux/sysfs.h | 8 + include/linux/topology.h | 3 + include/uapi/linux/isst_if.h | 172 ++ include/uapi/linux/pci_regs.h | 13 + kernel/events/core.c | 90 +- kernel/events/ring_buffer.c | 3 +- kernel/irq/Kconfig | 5 + kernel/irq/chip.c | 2 +- kernel/irq/debugfs.c | 34 +- kernel/irq/internals.h | 2 +- kernel/irq/resend.c | 53 +- kernel/resource.c | 14 +- mm/memory_hotplug.c | 33 +- tools/Makefile | 11 +- tools/arch/x86/include/uapi/asm/perf_regs.h | 26 +- tools/perf/Documentation/perf-record.txt | 3 +- tools/perf/arch/x86/include/perf_regs.h | 25 +- tools/perf/arch/x86/util/perf_regs.c | 44 + tools/perf/builtin-record.c | 4 +- tools/perf/util/parse-regs-options.c | 33 +- tools/perf/util/parse-regs-options.h | 3 +- tools/perf/util/perf_regs.c | 10 + tools/perf/util/perf_regs.h | 3 + tools/power/x86/intel-speed-select/Build | 1 + tools/power/x86/intel-speed-select/Makefile | 56 + .../x86/intel-speed-select/isst-config.c | 1607 +++++++++++++++++ .../power/x86/intel-speed-select/isst-core.c | 721 ++++++++ .../x86/intel-speed-select/isst-display.c | 479 +++++ tools/power/x86/intel-speed-select/isst.h | 231 +++ tools/power/x86/turbostat/turbostat.c | 80 +- tools/testing/nvdimm/Kbuild | 7 +- tools/testing/nvdimm/dax-dev.c | 16 +- 227 files changed, 17781 insertions(+), 4637 deletions(-) create mode 100644 Documentation/ABI/obsolete/sysfs-class-dax create mode 100644 Documentation/admin-guide/mm/numaperf.rst create mode 100644 Documentation/admin-guide/pm/intel_idle.rst delete mode 100644 Documentation/cpuidle/core.txt delete mode 100644 Documentation/cpuidle/sysfs.txt create mode 100644 drivers/acpi/hmat/Kconfig create mode 100644 drivers/acpi/hmat/Makefile create mode 100644 drivers/acpi/hmat/hmat.c create mode 100644 drivers/dax/bus.c create mode 100644 drivers/dax/bus.h delete mode 100644 drivers/dax/dax.h delete mode 100644 drivers/dax/device-dax.h create mode 100644 drivers/dax/kmem.c delete mode 100644 drivers/dax/pmem.c create mode 100644 drivers/dax/pmem/Makefile create mode 100644 drivers/dax/pmem/compat.c create mode 100644 drivers/dax/pmem/core.c create mode 100644 drivers/dax/pmem/pmem.c create mode 100644 drivers/edac/i10nm_base.c create mode 100644 drivers/edac/skx_base.c create mode 100644 drivers/edac/skx_common.c create mode 100644 drivers/edac/skx_common.h delete mode 100644 drivers/edac/skx_edac.c create mode 100644 drivers/hwtracing/intel_th/msu-sink.c create mode 100644 drivers/ntb/hw/intel/ntb_hw_gen4.c create mode 100644 drivers/ntb/hw/intel/ntb_hw_gen4.h create mode 100644 drivers/platform/x86/intel_speed_select_if/Kconfig create mode 100644 drivers/platform/x86/intel_speed_select_if/Makefile create mode 100644 drivers/platform/x86/intel_speed_select_if/isst_if_common.c create mode 100644 drivers/platform/x86/intel_speed_select_if/isst_if_common.h create mode 100644 drivers/platform/x86/intel_speed_select_if/isst_if_mbox_msr.c create mode 100644 drivers/platform/x86/intel_speed_select_if/isst_if_mbox_pci.c create mode 100644 drivers/platform/x86/intel_speed_select_if/isst_if_mmio.c rename drivers/powercap/{intel_rapl.c => intel_rapl_common.c} (61%) create mode 100644 drivers/powercap/intel_rapl_msr.c create mode 100644 include/linux/intel_rapl.h create mode 100644 include/linux/intel_th.h create mode 100644 include/uapi/linux/isst_if.h create mode 100644 tools/power/x86/intel-speed-select/Build create mode 100644 tools/power/x86/intel-speed-select/Makefile create mode 100644 tools/power/x86/intel-speed-select/isst-config.c create mode 100644 tools/power/x86/intel-speed-select/isst-core.c create mode 100644 tools/power/x86/intel-speed-select/isst-display.c create mode 100644 tools/power/x86/intel-speed-select/isst.h
From: Felipe Balbi felipe.balbi@linux.intel.com
mainline inclusion from mainline-v4.20-rc1 commit d6112f8def514e019658bcc9b57d53acdb71ca3f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit d6112f8def514e019658bcc9b57d53acdb71ca3f upstream.
PCIe r4.0, sec 7.5.1.1.4 defines a new bit in the Status Register:
Immediate Readiness – This optional bit, when Set, indicates the Function is guaranteed to be ready to successfully complete valid configuration accesses at any time following any reset that the host is capable of issuing Configuration Requests to this Function.
When this bit is Set, for accesses to this Function, software is exempt from all requirements to delay configuration accesses following any type of reset, including but not limited to the timing requirements defined in Section 6.6.
This means that all delays after a Conventional or Function Reset can be skipped.
This patch reads such bit and caches its value in a flag inside struct pci_dev to be checked later if we should delay or can skip delays after a reset. While at that, also move the explicit msleep(100) call from pcie_flr() and pci_af_flr() to pci_dev_wait().
Change-Id: I1c0f7772895b3ccf89092661f95e460f44ba4a2d Signed-off-by: Felipe Balbi felipe.balbi@linux.intel.com [bhelgaas: rename PCI_STATUS_IMMEDIATE to PCI_STATUS_IMM_READY] Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: Lin Wang lin.x.wang@intel.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com --- drivers/pci/pci.c | 13 ++++++++++++- include/linux/pci.h | 1 + include/uapi/linux/pci_regs.h | 1 + 3 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index f6e3a813caf1..09a19caa32eb 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1009,7 +1009,7 @@ static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state) * because have already delayed for the bridge. */ if (dev->runtime_d3cold) { - if (dev->d3cold_delay) + if (dev->d3cold_delay && !dev->imm_ready) msleep(dev->d3cold_delay); /* * When powering on a bridge from D3cold, the @@ -2708,6 +2708,7 @@ EXPORT_SYMBOL_GPL(pci_d3cold_disable); void pci_pm_init(struct pci_dev *dev) { int pm; + u16 status; u16 pmc;
pm_runtime_forbid(&dev->dev); @@ -2770,6 +2771,10 @@ void pci_pm_init(struct pci_dev *dev) /* Disable the PME# generation functionality */ pci_pme_active(dev, false); } + + pci_read_config_word(dev, PCI_STATUS, &status); + if (status & PCI_STATUS_IMM_READY) + dev->imm_ready = 1; }
static unsigned long pci_ea_flags(struct pci_dev *dev, u8 prop) @@ -4444,6 +4449,9 @@ int pcie_flr(struct pci_dev *dev)
pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR);
+ if (dev->imm_ready) + return 0; + /* * Per PCIe r4.0, sec 6.6.2, a device must complete an FLR within * 100ms, but may silently discard requests while the FLR is in @@ -4485,6 +4493,9 @@ static int pci_af_flr(struct pci_dev *dev, int probe)
pci_write_config_byte(dev, pos + PCI_AF_CTRL, PCI_AF_CTRL_FLR);
+ if (dev->imm_ready) + return 0; + /* * Per Advanced Capabilities for Conventional PCI ECN, 13 April 2006, * updated 27 July 2006; a device must complete an FLR within diff --git a/include/linux/pci.h b/include/linux/pci.h index c97de5c8bc35..4988b58628a2 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -326,6 +326,7 @@ struct pci_dev { pci_power_t current_state; /* Current operating state. In ACPI, this is D0-D3, D0 being fully functional, and D3 being off. */ + unsigned int imm_ready:1; /* Supports Immediate Readiness */ u8 pm_cap; /* PM capability offset */ unsigned int pme_support:5; /* Bitmask of states from which PME# can be generated */ diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h index 662bac7a677a..6050817163a6 100644 --- a/include/uapi/linux/pci_regs.h +++ b/include/uapi/linux/pci_regs.h @@ -52,6 +52,7 @@ #define PCI_COMMAND_INTX_DISABLE 0x400 /* INTx Emulation Disable */
#define PCI_STATUS 0x06 /* 16 bits */ +#define PCI_STATUS_IMM_READY 0x01 /* Immediate Readiness */ #define PCI_STATUS_INTERRUPT 0x08 /* Interrupt status */ #define PCI_STATUS_CAP_LIST 0x10 /* Support Capability List */ #define PCI_STATUS_66MHZ 0x20 /* Support 66 MHz PCI 2.1 bus */
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v4.20-rc1 commit fcd4d369034a819aa393f65c3a8f58db9ab5ed2a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
The AER struct aer_rpc was carrying a copy of the error source simply as a temperary variable. Remove that from the structure and use a stack variable for the purpose.
Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: liyouhong liyouhong@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/aer.c | 33 ++++++++++++++++----------------- 1 file changed, 16 insertions(+), 17 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 1563e22600ec..52aa1555620a 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -44,7 +44,6 @@ struct aer_rpc { struct pci_dev *rpd; /* Root Port device */ struct work_struct dpc_handler; struct aer_err_source e_sources[AER_ERROR_SOURCES_MAX]; - struct aer_err_info e_info; unsigned short prod_idx; /* Error Producer Index */ unsigned short cons_idx; /* Error Consumer Index */ int isr; @@ -1172,7 +1171,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc, struct aer_err_source *e_src) { struct pci_dev *pdev = rpc->rpd; - struct aer_err_info *e_info = &rpc->e_info; + struct aer_err_info e_info;
pci_rootport_aer_stats_incr(pdev, e_src);
@@ -1181,36 +1180,36 @@ static void aer_isr_one_error(struct aer_rpc *rpc, * uncorrectable error being logged. Report correctable error first. */ if (e_src->status & PCI_ERR_ROOT_COR_RCV) { - e_info->id = ERR_COR_ID(e_src->id); - e_info->severity = AER_CORRECTABLE; + e_info.id = ERR_COR_ID(e_src->id); + e_info.severity = AER_CORRECTABLE;
if (e_src->status & PCI_ERR_ROOT_MULTI_COR_RCV) - e_info->multi_error_valid = 1; + e_info.multi_error_valid = 1; else - e_info->multi_error_valid = 0; - aer_print_port_info(pdev, e_info); + e_info.multi_error_valid = 0; + aer_print_port_info(pdev, &e_info);
- if (find_source_device(pdev, e_info)) - aer_process_err_devices(e_info); + if (find_source_device(pdev, &e_info)) + aer_process_err_devices(&e_info); }
if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) { - e_info->id = ERR_UNCOR_ID(e_src->id); + e_info.id = ERR_UNCOR_ID(e_src->id);
if (e_src->status & PCI_ERR_ROOT_FATAL_RCV) - e_info->severity = AER_FATAL; + e_info.severity = AER_FATAL; else - e_info->severity = AER_NONFATAL; + e_info.severity = AER_NONFATAL;
if (e_src->status & PCI_ERR_ROOT_MULTI_UNCOR_RCV) - e_info->multi_error_valid = 1; + e_info.multi_error_valid = 1; else - e_info->multi_error_valid = 0; + e_info.multi_error_valid = 0;
- aer_print_port_info(pdev, e_info); + aer_print_port_info(pdev, &e_info);
- if (find_source_device(pdev, e_info)) - aer_process_err_devices(e_info); + if (find_source_device(pdev, &e_info)) + aer_process_err_devices(&e_info); } }
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v4.20-rc1 commit 27c1ce8bbed7e7f0e4a87cf4a93f09be26d62ada category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
The kernel provides a generic FIFO implementation, so no need to reinvent that capability in a driver. Replace the AER-specific implementation with the kernel-provided kfifo. Since the interrupt handler producer and work queue consumer run single threaded, there is no need for additional locking, so remove that lock, too.
Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: liyouhong liyouhong@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/aer.c | 88 ++++++------------------------------------ 1 file changed, 11 insertions(+), 77 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 52aa1555620a..14acf51cd9c3 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -30,7 +30,7 @@ #include "../pci.h" #include "portdrv.h"
-#define AER_ERROR_SOURCES_MAX 100 +#define AER_ERROR_SOURCES_MAX 128
#define AER_MAX_TYPEOF_COR_ERRS 16 /* as per PCI_ERR_COR_STATUS */ #define AER_MAX_TYPEOF_UNCOR_ERRS 26 /* as per PCI_ERR_UNCOR_STATUS*/ @@ -43,14 +43,8 @@ struct aer_err_source { struct aer_rpc { struct pci_dev *rpd; /* Root Port device */ struct work_struct dpc_handler; - struct aer_err_source e_sources[AER_ERROR_SOURCES_MAX]; - unsigned short prod_idx; /* Error Producer Index */ - unsigned short cons_idx; /* Error Consumer Index */ + DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX); int isr; - spinlock_t e_lock; /* - * Lock access to Error Status/ID Regs - * and error producer/consumer index - */ struct mutex rpc_mutex; /* * only one thread could do * recovery on the same @@ -1213,35 +1207,6 @@ static void aer_isr_one_error(struct aer_rpc *rpc, } }
-/** - * get_e_source - retrieve an error source - * @rpc: pointer to the root port which holds an error - * @e_src: pointer to store retrieved error source - * - * Return 1 if an error source is retrieved, otherwise 0. - * - * Invoked by DPC handler to consume an error. - */ -static int get_e_source(struct aer_rpc *rpc, struct aer_err_source *e_src) -{ - unsigned long flags; - - /* Lock access to Root error producer/consumer index */ - spin_lock_irqsave(&rpc->e_lock, flags); - if (rpc->prod_idx == rpc->cons_idx) { - spin_unlock_irqrestore(&rpc->e_lock, flags); - return 0; - } - - *e_src = rpc->e_sources[rpc->cons_idx]; - rpc->cons_idx++; - if (rpc->cons_idx == AER_ERROR_SOURCES_MAX) - rpc->cons_idx = 0; - spin_unlock_irqrestore(&rpc->e_lock, flags); - - return 1; -} - /** * aer_isr - consume errors detected by root port * @work: definition of this work item @@ -1254,7 +1219,7 @@ static void aer_isr(struct work_struct *work) struct aer_err_source uninitialized_var(e_src);
mutex_lock(&rpc->rpc_mutex); - while (get_e_source(rpc, &e_src)) + while (kfifo_get(&rpc->aer_fifo, &e_src)) aer_isr_one_error(rpc, &e_src); mutex_unlock(&rpc->rpc_mutex); } @@ -1268,51 +1233,23 @@ static void aer_isr(struct work_struct *work) */ irqreturn_t aer_irq(int irq, void *context) { - unsigned int status, id; struct pcie_device *pdev = (struct pcie_device *)context; struct aer_rpc *rpc = get_service_data(pdev); - int next_prod_idx; - unsigned long flags; - int pos; - - pos = pdev->port->aer_cap; - /* - * Must lock access to Root Error Status Reg, Root Error ID Reg, - * and Root error producer/consumer index - */ - spin_lock_irqsave(&rpc->e_lock, flags); + struct pci_dev *rp = rpc->rpd; + struct aer_err_source e_src = {}; + int pos = rp->aer_cap;
- /* Read error status */ - pci_read_config_dword(pdev->port, pos + PCI_ERR_ROOT_STATUS, &status); - if (!(status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV))) { - spin_unlock_irqrestore(&rpc->e_lock, flags); + pci_read_config_dword(rp, pos + PCI_ERR_ROOT_STATUS, &e_src.status); + if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV))) return IRQ_NONE; - }
- /* Read error source and clear error status */ - pci_read_config_dword(pdev->port, pos + PCI_ERR_ROOT_ERR_SRC, &id); - pci_write_config_dword(pdev->port, pos + PCI_ERR_ROOT_STATUS, status); + pci_read_config_dword(rp, pos + PCI_ERR_ROOT_ERR_SRC, &e_src.id); + pci_write_config_dword(rp, pos + PCI_ERR_ROOT_STATUS, e_src.status);
- /* Store error source for later DPC handler */ - next_prod_idx = rpc->prod_idx + 1; - if (next_prod_idx == AER_ERROR_SOURCES_MAX) - next_prod_idx = 0; - if (next_prod_idx == rpc->cons_idx) { - /* - * Error Storm Condition - possibly the same error occurred. - * Drop the error. - */ - spin_unlock_irqrestore(&rpc->e_lock, flags); + if (!kfifo_put(&rpc->aer_fifo, e_src)) return IRQ_HANDLED; - } - rpc->e_sources[rpc->prod_idx].status = status; - rpc->e_sources[rpc->prod_idx].id = id; - rpc->prod_idx = next_prod_idx; - spin_unlock_irqrestore(&rpc->e_lock, flags);
- /* Invoke DPC handler */ schedule_work(&rpc->dpc_handler); - return IRQ_HANDLED; } EXPORT_SYMBOL_GPL(aer_irq); @@ -1437,9 +1374,6 @@ static struct aer_rpc *aer_alloc_rpc(struct pcie_device *dev) if (!rpc) return NULL;
- /* Initialize Root lock access, e_lock, to Root Error Status Reg */ - spin_lock_init(&rpc->e_lock); - rpc->rpd = dev->port; INIT_WORK(&rpc->dpc_handler, aer_isr); mutex_init(&rpc->rpc_mutex);
From: Dongdong Liu liudongdong3@huawei.com
mainline inclusion from mainline-v5.6-rc1 commit d95f20c4f07020ebc605f3b46af4b6db9eb5fc99 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Previously we did not call INIT_KFIFO() for aer_fifo. This leads to kfifo_put() sometimes returning 0 (queue full) when in fact it is not.
It is easy to reproduce the problem by using aer-inject:
$ aer-inject -s :82:00.0 multiple-corr-nonfatal
The content of the multiple-corr-nonfatal file is as below:
AER COR RCVR HL 0 1 2 3 AER UNCOR POISON_TLP HL 4 5 6 7
Fixes: 27c1ce8bbed7 ("PCI/AER: Use kfifo for tracking events instead of reimplementing it") Link: https://lore.kernel.org/r/1579767991-103898-1-git-send-email-liudongdong3@hu... Signed-off-by: Dongdong Liu liudongdong3@huawei.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: liyouhong liyouhong@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/aer.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 14acf51cd9c3..76ea0959ff0c 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -1425,7 +1425,8 @@ static int aer_probe(struct pcie_device *dev) aer_remove(dev); return -ENOMEM; } - + + INIT_KFIFO(rpc->aer_fifo); /* Request IRQ ISR */ status = request_irq(dev->irq, aer_irq, IRQF_SHARED, "aerdrv", dev); if (status) {
From: Yicong Yang yangyicong@hisilicon.com
mainline inclusion from mainline-v5.6-rc1 commit 01daacfb9035e5b86d43a01f11a0614648f306c1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
PCI error recovery will fail if any device under the Root Port doesn't have an error_detected callback. Currently only the failure result is printed, which is not enough to identify the driver that lacks the callback.
Log a message to identify the device with no error_detected callback.
[bhelgaas: tweak log message] Link: https://lore.kernel.org/r/1576237474-32021-1-git-send-email-yangyicong@hisil... Signed-off-by: Yicong Yang yangyicong@hisilicon.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: liyouhong liyouhong@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/err.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index 2c3b5bd59b18..571ac3d9c326 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -69,10 +69,12 @@ static int report_error_detected(struct pci_dev *dev, void *data) * error callbacks of "any" device in the subtree, and will * exit in the disconnected error state. */ - if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) { vote = PCI_ERS_RESULT_NO_AER_DRIVER; - else + pci_info(dev, "AER: Can't recover (no error_detected callback)\n"); + } else { vote = PCI_ERS_RESULT_NONE; + } } else { err_handler = dev->driver->err_handler; vote = err_handler->error_detected(dev, result_data->state);
From: Gustavo Pimentel Gustavo.Pimentel@synopsys.com
mainline inclusion from mainline-v5.3-rc1 commit de76cda215d56256ffcda7ffa538b70f9fb301a7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
PCIe r5.0, sec 7.5.3.18, defines a new 32.0 GT/s bit in the Supported Link Speeds Vector of Link Capabilities 2. Decode this new speed. This does not affect the speed of the link, which should be negotiated automatically by the hardware; it only adds decoding when showing the speed to the user.
Previously, reading the speed of a link operating at this speed showed "Unknown speed" instead of "32.0 GT/s".
Link: https://lore.kernel.org/lkml/92365e3caf0fc559f9ab14bcd053bfc92d4f661c.155966... Signed-off-by: Gustavo Pimentel gustavo.pimentel@synopsys.com [bhelgaas: changelog] Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: liyouhong liyouhong@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pci-sysfs.c | 3 +++ drivers/pci/pci.c | 4 +++- drivers/pci/probe.c | 2 +- drivers/pci/slot.c | 1 + include/linux/pci.h | 1 + include/uapi/linux/pci_regs.h | 4 ++++ 6 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index fcc997cbf843..e10fb16f8b13 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -182,6 +182,9 @@ static ssize_t current_link_speed_show(struct device *dev, return -EINVAL;
switch (linkstat & PCI_EXP_LNKSTA_CLS) { + case PCI_EXP_LNKSTA_CLS_32_0GB: + speed = "32 GT/s"; + break; case PCI_EXP_LNKSTA_CLS_16_0GB: speed = "16 GT/s"; break; diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 09a19caa32eb..ecf1d1a2e16b 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -5584,7 +5584,9 @@ enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev) */ pcie_capability_read_dword(dev, PCI_EXP_LNKCAP2, &lnkcap2); if (lnkcap2) { /* PCIe r3.0-compliant */ - if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_16_0GB) + if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_32_0GB) + return PCIE_SPEED_32_0GT; + else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_16_0GB) return PCIE_SPEED_16_0GT; else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_8_0GB) return PCIE_SPEED_8_0GT; diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 3731d17e1af1..695dd25a917f 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -667,7 +667,7 @@ const unsigned char pcie_link_speed[] = { PCIE_SPEED_5_0GT, /* 2 */ PCIE_SPEED_8_0GT, /* 3 */ PCIE_SPEED_16_0GT, /* 4 */ - PCI_SPEED_UNKNOWN, /* 5 */ + PCIE_SPEED_32_0GT, /* 5 */ PCI_SPEED_UNKNOWN, /* 6 */ PCI_SPEED_UNKNOWN, /* 7 */ PCI_SPEED_UNKNOWN, /* 8 */ diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c index dfbe9cbf292c..d575583e49c2 100644 --- a/drivers/pci/slot.c +++ b/drivers/pci/slot.c @@ -75,6 +75,7 @@ static const char *pci_bus_speed_strings[] = { "5.0 GT/s PCIe", /* 0x15 */ "8.0 GT/s PCIe", /* 0x16 */ "16.0 GT/s PCIe", /* 0x17 */ + "32.0 GT/s PCIe", /* 0x18 */ };
static ssize_t bus_speed_read(enum pci_bus_speed speed, char *buf) diff --git a/include/linux/pci.h b/include/linux/pci.h index 4988b58628a2..72382af281fb 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -259,6 +259,7 @@ enum pci_bus_speed { PCIE_SPEED_5_0GT = 0x15, PCIE_SPEED_8_0GT = 0x16, PCIE_SPEED_16_0GT = 0x17, + PCIE_SPEED_32_0GT = 0x18, PCI_SPEED_UNKNOWN = 0xff, };
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h index 6050817163a6..992f4d384ba1 100644 --- a/include/uapi/linux/pci_regs.h +++ b/include/uapi/linux/pci_regs.h @@ -524,6 +524,7 @@ #define PCI_EXP_LNKCAP_SLS_5_0GB 0x00000002 /* LNKCAP2 SLS Vector bit 1 */ #define PCI_EXP_LNKCAP_SLS_8_0GB 0x00000003 /* LNKCAP2 SLS Vector bit 2 */ #define PCI_EXP_LNKCAP_SLS_16_0GB 0x00000004 /* LNKCAP2 SLS Vector bit 3 */ +#define PCI_EXP_LNKCAP_SLS_32_0GB 0x00000005 /* LNKCAP2 SLS Vector bit 4 */ #define PCI_EXP_LNKCAP_MLW 0x000003f0 /* Maximum Link Width */ #define PCI_EXP_LNKCAP_ASPMS 0x00000c00 /* ASPM Support */ #define PCI_EXP_LNKCAP_L0SEL 0x00007000 /* L0s Exit Latency */ @@ -552,6 +553,7 @@ #define PCI_EXP_LNKSTA_CLS_5_0GB 0x0002 /* Current Link Speed 5.0GT/s */ #define PCI_EXP_LNKSTA_CLS_8_0GB 0x0003 /* Current Link Speed 8.0GT/s */ #define PCI_EXP_LNKSTA_CLS_16_0GB 0x0004 /* Current Link Speed 16.0GT/s */ +#define PCI_EXP_LNKSTA_CLS_32_0GB 0x0005 /* Current Link Speed 32.0GT/s */ #define PCI_EXP_LNKSTA_NLW 0x03f0 /* Negotiated Link Width */ #define PCI_EXP_LNKSTA_NLW_X1 0x0010 /* Current Link Width x1 */ #define PCI_EXP_LNKSTA_NLW_X2 0x0020 /* Current Link Width x2 */ @@ -657,6 +659,7 @@ #define PCI_EXP_LNKCAP2_SLS_5_0GB 0x00000004 /* Supported Speed 5GT/s */ #define PCI_EXP_LNKCAP2_SLS_8_0GB 0x00000008 /* Supported Speed 8GT/s */ #define PCI_EXP_LNKCAP2_SLS_16_0GB 0x00000010 /* Supported Speed 16GT/s */ +#define PCI_EXP_LNKCAP2_SLS_32_0GB 0x00000020 /* Supported Speed 32GT/s */ #define PCI_EXP_LNKCAP2_CROSSLINK 0x00000100 /* Crosslink supported */ #define PCI_EXP_LNKCTL2 48 /* Link Control 2 */ #define PCI_EXP_LNKCTL2_TLS 0x000f @@ -664,6 +667,7 @@ #define PCI_EXP_LNKCTL2_TLS_5_0GT 0x0002 /* Supported Speed 5GT/s */ #define PCI_EXP_LNKCTL2_TLS_8_0GT 0x0003 /* Supported Speed 8GT/s */ #define PCI_EXP_LNKCTL2_TLS_16_0GT 0x0004 /* Supported Speed 16GT/s */ +#define PCI_EXP_LNKCTL2_TLS_32_0GT 0x0005 /* Supported Speed 32GT/s */ #define PCI_EXP_LNKSTA2 50 /* Link Status 2 */ #define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 52 /* v2 endpoints with link end here */ #define PCI_EXP_SLTCAP2 52 /* Slot Capabilities 2 */
From: Yicong Yang yangyicong@hisilicon.com
mainline inclusion from mainline-v5.7-rc1 commit 9cb3985af63555810bb07de50acdf4170771451d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Link speed 32.0 GT/s is supported in PCIe r5.0. Add this speed to PCIE_SPEED2STR() and PCIE_SPEED2MBS_ENC() to correctly decode it.
This is complementary to de76cda215d5 ("PCI: Decode PCIe 32 GT/s link speed").
Link: https://lore.kernel.org/r/1581937984-40353-2-git-send-email-yangyicong@hisil... Signed-off-by: Yicong Yang yangyicong@hisilicon.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: liyouhong liyouhong@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pci.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 76b34efa67d0..456bdc2b8831 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -255,7 +255,8 @@ void pci_disable_bridge_window(struct pci_dev *dev);
/* PCIe link information */ #define PCIE_SPEED2STR(speed) \ - ((speed) == PCIE_SPEED_16_0GT ? "16 GT/s" : \ + ((speed) == PCIE_SPEED_32_0GT ? "32 GT/s" : \ + (speed) == PCIE_SPEED_16_0GT ? "16 GT/s" : \ (speed) == PCIE_SPEED_8_0GT ? "8 GT/s" : \ (speed) == PCIE_SPEED_5_0GT ? "5 GT/s" : \ (speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \ @@ -263,7 +264,8 @@ void pci_disable_bridge_window(struct pci_dev *dev);
/* PCIe speed to Mb/s reduced by encoding overhead */ #define PCIE_SPEED2MBS_ENC(speed) \ - ((speed) == PCIE_SPEED_16_0GT ? 16000*128/130 : \ + ((speed) == PCIE_SPEED_32_0GT ? 32000*128/130 : \ + (speed) == PCIE_SPEED_16_0GT ? 16000*128/130 : \ (speed) == PCIE_SPEED_8_0GT ? 8000*128/130 : \ (speed) == PCIE_SPEED_5_0GT ? 5000*8/10 : \ (speed) == PCIE_SPEED_2_5GT ? 2500*8/10 : \
From: Bjorn Helgaas bhelgaas@google.com
mainline inclusion from mainline-v5.7-rc1 commit e56faff57f0b39661093c00e0262d4ab9088830e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Add pci_speed_string() to return a text description of the supplied bus or link speed. The slot code previously used the private pci_bus_speed_strings[] array for this purpose, but adding this interface will enable us to consolidate similar code elsewhere.
Export pcie_link_speed[] and pci_speed_string() so they can be used by modules.
Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: liyouhong liyouhong@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pci.h | 1 + drivers/pci/probe.c | 40 ++++++++++++++++++++++++++++++++++++++++ drivers/pci/slot.c | 38 +------------------------------------- include/linux/pci.h | 2 +- 4 files changed, 43 insertions(+), 38 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 456bdc2b8831..83576d906a1e 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -271,6 +271,7 @@ void pci_disable_bridge_window(struct pci_dev *dev); (speed) == PCIE_SPEED_2_5GT ? 2500*8/10 : \ 0)
+const char *pci_speed_string(enum pci_bus_speed speed); enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev); enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev); u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed, diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 695dd25a917f..8265fc4bc036 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -642,6 +642,7 @@ void pci_free_host_bridge(struct pci_host_bridge *bridge) } EXPORT_SYMBOL(pci_free_host_bridge);
+/* Indexed by PCI_X_SSTATUS_FREQ (secondary bus mode and frequency) */ static const unsigned char pcix_bus_speed[] = { PCI_SPEED_UNKNOWN, /* 0 */ PCI_SPEED_66MHz_PCIX, /* 1 */ @@ -661,6 +662,7 @@ static const unsigned char pcix_bus_speed[] = { PCI_SPEED_133MHz_PCIX_533 /* F */ };
+/* Indexed by PCI_EXP_LNKCAP_SLS, PCI_EXP_LNKSTA_CLS */ const unsigned char pcie_link_speed[] = { PCI_SPEED_UNKNOWN, /* 0 */ PCIE_SPEED_2_5GT, /* 1 */ @@ -679,6 +681,44 @@ const unsigned char pcie_link_speed[] = { PCI_SPEED_UNKNOWN, /* E */ PCI_SPEED_UNKNOWN /* F */ }; +EXPORT_SYMBOL_GPL(pcie_link_speed); + +const char *pci_speed_string(enum pci_bus_speed speed) +{ + /* Indexed by the pci_bus_speed enum */ + static const char *speed_strings[] = { + "33 MHz PCI", /* 0x00 */ + "66 MHz PCI", /* 0x01 */ + "66 MHz PCI-X", /* 0x02 */ + "100 MHz PCI-X", /* 0x03 */ + "133 MHz PCI-X", /* 0x04 */ + NULL, /* 0x05 */ + NULL, /* 0x06 */ + NULL, /* 0x07 */ + NULL, /* 0x08 */ + "66 MHz PCI-X 266", /* 0x09 */ + "100 MHz PCI-X 266", /* 0x0a */ + "133 MHz PCI-X 266", /* 0x0b */ + "Unknown AGP", /* 0x0c */ + "1x AGP", /* 0x0d */ + "2x AGP", /* 0x0e */ + "4x AGP", /* 0x0f */ + "8x AGP", /* 0x10 */ + "66 MHz PCI-X 533", /* 0x11 */ + "100 MHz PCI-X 533", /* 0x12 */ + "133 MHz PCI-X 533", /* 0x13 */ + "2.5 GT/s PCIe", /* 0x14 */ + "5.0 GT/s PCIe", /* 0x15 */ + "8.0 GT/s PCIe", /* 0x16 */ + "16.0 GT/s PCIe", /* 0x17 */ + "32.0 GT/s PCIe", /* 0x18 */ + }; + + if (speed < ARRAY_SIZE(speed_strings)) + return speed_strings[speed]; + return "Unknown"; +} +EXPORT_SYMBOL_GPL(pci_speed_string);
void pcie_update_link_speed(struct pci_bus *bus, u16 linksta) { diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c index d575583e49c2..7f6c8e732bad 100644 --- a/drivers/pci/slot.c +++ b/drivers/pci/slot.c @@ -49,45 +49,9 @@ static ssize_t address_read_file(struct pci_slot *slot, char *buf) slot->number); }
-/* these strings match up with the values in pci_bus_speed */ -static const char *pci_bus_speed_strings[] = { - "33 MHz PCI", /* 0x00 */ - "66 MHz PCI", /* 0x01 */ - "66 MHz PCI-X", /* 0x02 */ - "100 MHz PCI-X", /* 0x03 */ - "133 MHz PCI-X", /* 0x04 */ - NULL, /* 0x05 */ - NULL, /* 0x06 */ - NULL, /* 0x07 */ - NULL, /* 0x08 */ - "66 MHz PCI-X 266", /* 0x09 */ - "100 MHz PCI-X 266", /* 0x0a */ - "133 MHz PCI-X 266", /* 0x0b */ - "Unknown AGP", /* 0x0c */ - "1x AGP", /* 0x0d */ - "2x AGP", /* 0x0e */ - "4x AGP", /* 0x0f */ - "8x AGP", /* 0x10 */ - "66 MHz PCI-X 533", /* 0x11 */ - "100 MHz PCI-X 533", /* 0x12 */ - "133 MHz PCI-X 533", /* 0x13 */ - "2.5 GT/s PCIe", /* 0x14 */ - "5.0 GT/s PCIe", /* 0x15 */ - "8.0 GT/s PCIe", /* 0x16 */ - "16.0 GT/s PCIe", /* 0x17 */ - "32.0 GT/s PCIe", /* 0x18 */ -}; - static ssize_t bus_speed_read(enum pci_bus_speed speed, char *buf) { - const char *speed_string; - - if (speed < ARRAY_SIZE(pci_bus_speed_strings)) - speed_string = pci_bus_speed_strings[speed]; - else - speed_string = "Unknown"; - - return sprintf(buf, "%s\n", speed_string); + return sprintf(buf, "%s\n", pci_speed_string(speed)); }
static ssize_t max_speed_read_file(struct pci_slot *slot, char *buf) diff --git a/include/linux/pci.h b/include/linux/pci.h index 72382af281fb..596386c5b707 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -234,7 +234,7 @@ enum pcie_link_width { PCIE_LNK_WIDTH_UNKNOWN = 0xff, };
-/* Based on the PCI Hotplug Spec, but some values are made up by us */ +/* See matching string table in pci_speed_string() */ enum pci_bus_speed { PCI_SPEED_33MHz = 0x00, PCI_SPEED_66MHz = 0x01,
From: Bjorn Helgaas bhelgaas@google.com
mainline inclusion from mainline-v5.7-rc1 commit 6348a34dcb98d8e285685a205f2a601817fa2d38 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Previously some PCI speed strings came from pci_speed_string(), some came from the PCIe-specific PCIE_SPEED2STR(), and some came from a PCIe-specific switch statement. These methods were inconsistent:
pci_speed_string() PCIE_SPEED2STR() switch ------------------ ---------------- ------ 33 MHz PCI ... 2.5 GT/s PCIe 2.5 GT/s 2.5 GT/s 5.0 GT/s PCIe 5 GT/s 5 GT/s 8.0 GT/s PCIe 8 GT/s 8 GT/s 16.0 GT/s PCIe 16 GT/s 16 GT/s 32.0 GT/s PCIe 32 GT/s 32 GT/s
Standardize on pci_speed_string() as the single source of these strings.
Note that this adds ".0" and "PCIe" to some messages, including sysfs "max_link_speed" files, a brcmstb "link up" message, and the link status dmesg logging, e.g.,
nvme 0000:01:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x4 link at 0000:00:01.1 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link)
I think it's better to standardize on a single version of the speed text. Previously we had strings like this:
/sys/bus/pci/slots/0/cur_bus_speed: 8.0 GT/s PCIe /sys/bus/pci/slots/0/max_bus_speed: 8.0 GT/s PCIe /sys/devices/pci0000:00/0000:00:1c.0/current_link_speed: 8 GT/s /sys/devices/pci0000:00/0000:00:1c.0/max_link_speed: 8 GT/s
This changes the latter two to match the slots files:
/sys/devices/pci0000:00/0000:00:1c.0/current_link_speed: 8.0 GT/s PCIe /sys/devices/pci0000:00/0000:00:1c.0/max_link_speed: 8.0 GT/s PCIe
Based-on-patch by: Yicong Yang yangyicong@hisilicon.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: liyouhong liyouhong@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pci-sysfs.c | 27 +++++---------------------- drivers/pci/pci.c | 6 +++--- drivers/pci/pci.h | 9 --------- 3 files changed, 8 insertions(+), 34 deletions(-)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index e10fb16f8b13..a62349a44c8c 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -156,7 +156,8 @@ static ssize_t max_link_speed_show(struct device *dev, { struct pci_dev *pdev = to_pci_dev(dev);
- return sprintf(buf, "%s\n", PCIE_SPEED2STR(pcie_get_speed_cap(pdev))); + return sprintf(buf, "%s\n", + pci_speed_string(pcie_get_speed_cap(pdev))); } static DEVICE_ATTR_RO(max_link_speed);
@@ -175,33 +176,15 @@ static ssize_t current_link_speed_show(struct device *dev, struct pci_dev *pci_dev = to_pci_dev(dev); u16 linkstat; int err; - const char *speed; + enum pci_bus_speed speed;
err = pcie_capability_read_word(pci_dev, PCI_EXP_LNKSTA, &linkstat); if (err) return -EINVAL;
- switch (linkstat & PCI_EXP_LNKSTA_CLS) { - case PCI_EXP_LNKSTA_CLS_32_0GB: - speed = "32 GT/s"; - break; - case PCI_EXP_LNKSTA_CLS_16_0GB: - speed = "16 GT/s"; - break; - case PCI_EXP_LNKSTA_CLS_8_0GB: - speed = "8 GT/s"; - break; - case PCI_EXP_LNKSTA_CLS_5_0GB: - speed = "5 GT/s"; - break; - case PCI_EXP_LNKSTA_CLS_2_5GB: - speed = "2.5 GT/s"; - break; - default: - speed = "Unknown speed"; - } + speed = pcie_link_speed[linkstat & PCI_EXP_LNKSTA_CLS];
- return sprintf(buf, "%s\n", speed); + return sprintf(buf, "%s\n", pci_speed_string(speed)); } static DEVICE_ATTR_RO(current_link_speed);
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index ecf1d1a2e16b..056b8b10e675 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -5671,14 +5671,14 @@ void __pcie_print_link_status(struct pci_dev *dev, bool verbose) if (bw_avail >= bw_cap && verbose) pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s x%d link)\n", bw_cap / 1000, bw_cap % 1000, - PCIE_SPEED2STR(speed_cap), width_cap); + pci_speed_string(speed_cap), width_cap); else if (bw_avail < bw_cap) pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n", bw_avail / 1000, bw_avail % 1000, - PCIE_SPEED2STR(speed), width, + pci_speed_string(speed), width, limiting_dev ? pci_name(limiting_dev) : "<unknown>", bw_cap / 1000, bw_cap % 1000, - PCIE_SPEED2STR(speed_cap), width_cap); + pci_speed_string(speed_cap), width_cap); }
/** diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 83576d906a1e..a65bca8e4d12 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -253,15 +253,6 @@ bool pci_bus_clip_resource(struct pci_dev *dev, int idx); void pci_reassigndev_resource_alignment(struct pci_dev *dev); void pci_disable_bridge_window(struct pci_dev *dev);
-/* PCIe link information */ -#define PCIE_SPEED2STR(speed) \ - ((speed) == PCIE_SPEED_32_0GT ? "32 GT/s" : \ - (speed) == PCIE_SPEED_16_0GT ? "16 GT/s" : \ - (speed) == PCIE_SPEED_8_0GT ? "8 GT/s" : \ - (speed) == PCIE_SPEED_5_0GT ? "5 GT/s" : \ - (speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \ - "Unknown speed") - /* PCIe speed to Mb/s reduced by encoding overhead */ #define PCIE_SPEED2MBS_ENC(speed) \ ((speed) == PCIE_SPEED_32_0GT ? 32000*128/130 : \
From: Yicong Yang yangyicong@hisilicon.com
mainline inclusion from mainline-v5.7-rc1 commit 757bfaa2c3515803dde9a6728bbf8c8a3c5f098a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Add PCIE_LNKCAP2_SLS2SPEED macro for transforming raw Link Capabilities 2 values to the pci_bus_speed. This is next to PCIE_SPEED2MBS_ENC() to make it easier to update both places when adding support for new speeds.
Link: https://lore.kernel.org/r/1581937984-40353-10-git-send-email-yangyicong@hisi... Signed-off-by: Yicong Yang yangyicong@hisilicon.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: liyouhong liyouhong@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pci.c | 17 ++++------------- drivers/pci/pci.h | 9 +++++++++ 2 files changed, 13 insertions(+), 13 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 056b8b10e675..3f82033dc8a0 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -5583,19 +5583,10 @@ enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev) * where only 2.5 GT/s and 5.0 GT/s speeds were defined. */ pcie_capability_read_dword(dev, PCI_EXP_LNKCAP2, &lnkcap2); - if (lnkcap2) { /* PCIe r3.0-compliant */ - if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_32_0GB) - return PCIE_SPEED_32_0GT; - else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_16_0GB) - return PCIE_SPEED_16_0GT; - else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_8_0GB) - return PCIE_SPEED_8_0GT; - else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_5_0GB) - return PCIE_SPEED_5_0GT; - else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_2_5GB) - return PCIE_SPEED_2_5GT; - return PCI_SPEED_UNKNOWN; - } + + /* PCIe r3.0-compliant */ + if (lnkcap2) + return PCIE_LNKCAP2_SLS2SPEED(lnkcap2);
pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap); if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_5_0GB) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index a65bca8e4d12..a10629ba980e 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -253,6 +253,15 @@ bool pci_bus_clip_resource(struct pci_dev *dev, int idx); void pci_reassigndev_resource_alignment(struct pci_dev *dev); void pci_disable_bridge_window(struct pci_dev *dev);
+/* PCIe link information from Link Capabilities 2 */ +#define PCIE_LNKCAP2_SLS2SPEED(lnkcap2) \ + ((lnkcap2) & PCI_EXP_LNKCAP2_SLS_32_0GB ? PCIE_SPEED_32_0GT : \ + (lnkcap2) & PCI_EXP_LNKCAP2_SLS_16_0GB ? PCIE_SPEED_16_0GT : \ + (lnkcap2) & PCI_EXP_LNKCAP2_SLS_8_0GB ? PCIE_SPEED_8_0GT : \ + (lnkcap2) & PCI_EXP_LNKCAP2_SLS_5_0GB ? PCIE_SPEED_5_0GT : \ + (lnkcap2) & PCI_EXP_LNKCAP2_SLS_2_5GB ? PCIE_SPEED_2_5GT : \ + PCI_SPEED_UNKNOWN) + /* PCIe speed to Mb/s reduced by encoding overhead */ #define PCIE_SPEED2MBS_ENC(speed) \ ((speed) == PCIE_SPEED_32_0GT ? 32000*128/130 : \
From: Jay Fang f.fangjian@huawei.com
mainline inclusion from mainline-v5.8-rc1 commit 5dda3ba6fc9c5d784b48687a3f3003023a0d7c74 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Fix kernel-doc of the "srv" parameter to pcie_pme_resume() and pcie_pme_remove(). Building with W=1 produced these warnings:
drivers/pci/pcie/pme.c:414: warning: Function parameter or member 'srv' not described in 'pcie_pme_resume' drivers/pci/pcie/pme.c:437: warning: Function parameter or member 'srv' not described in 'pcie_pme_remove'
Link: https://lore.kernel.org/r/1589612414-61682-1-git-send-email-f.fangjian@huawe... Signed-off-by: Jay Fang f.fangjian@huawei.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: liyouhong liyouhong@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/pme.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pcie/pme.c b/drivers/pci/pcie/pme.c index 54d593d10396..94479ec041c2 100644 --- a/drivers/pci/pcie/pme.c +++ b/drivers/pci/pcie/pme.c @@ -406,7 +406,7 @@ static int pcie_pme_suspend(struct pcie_device *srv)
/** * pcie_pme_resume - Resume PCIe PME service device. - * @srv - PCIe service device to resume. + * @srv: PCIe service device to resume. */ static int pcie_pme_resume(struct pcie_device *srv) { @@ -429,7 +429,7 @@ static int pcie_pme_resume(struct pcie_device *srv)
/** * pcie_pme_remove - Prepare PCIe PME service device for removal. - * @srv - PCIe service device to remove. + * @srv: PCIe service device to remove. */ static void pcie_pme_remove(struct pcie_device *srv) {
From: Peter Zijlstra peterz@infradead.org
mainline inclusion from mainline-v5.2-rc1 commit 63b79f6ebc464afb730bc45762c820795e276da1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Icelake extended the general counters to 8, even when SMT is enabled. However only a (large) subset of the events can be used on all 8 counters.
The events that can or cannot be used on all counters are organized in ranges.
A lot of scheduler constraints are required to handle all this.
To avoid blowing up the tables add event code ranges to the constraint tables, and a new inline function to match them.
Originally-by: Andi Kleen ak@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org # developer hat on Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org # maintainer hat on Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Cc: acme@kernel.org Cc: jolsa@kernel.org Link: https://lkml.kernel.org/r/20190402194509.2832-8-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/core.c | 2 +- arch/x86/events/intel/ds.c | 2 +- arch/x86/events/perf_event.h | 44 +++++++++++++++++++++++++++++++----- 3 files changed, 40 insertions(+), 8 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 2dd8b0d64295..710c24846df2 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -2573,7 +2573,7 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
if (x86_pmu.event_constraints) { for_each_event_constraint(c, x86_pmu.event_constraints) { - if ((event->hw.config & c->cmask) == c->code) { + if (constraint_match(c, event->hw.config)) { event->hw.flags |= c->flags; return c; } diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c index ad57183333f6..462dab3c0819 100644 --- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -858,7 +858,7 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
if (x86_pmu.pebs_constraints) { for_each_event_constraint(c, x86_pmu.pebs_constraints) { - if ((event->hw.config & c->cmask) == c->code) { + if (constraint_match(c, event->hw.config)) { event->hw.flags |= c->flags; return c; } diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index dd24cac3d5e5..eebd9dbebcf9 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -49,12 +49,19 @@ struct event_constraint { unsigned long idxmsk[BITS_TO_LONGS(X86_PMC_IDX_MAX)]; u64 idxmsk64; }; - u64 code; - u64 cmask; - int weight; - int overlap; - int flags; + u64 code; + u64 cmask; + int weight; + int overlap; + int flags; + unsigned int size; }; + +static inline bool constraint_match(struct event_constraint *c, u64 ecode) +{ + return ((ecode & c->cmask) - c->code) <= (u64)c->size; +} + /* * struct hw_perf_event.flags flags */ @@ -257,18 +264,29 @@ struct cpu_hw_events { void *kfree_on_online[X86_PERF_KFREE_MAX]; };
-#define __EVENT_CONSTRAINT(c, n, m, w, o, f) {\ +#define __EVENT_CONSTRAINT_RANGE(c, e, n, m, w, o, f) { \ { .idxmsk64 = (n) }, \ .code = (c), \ + .size = (e) - (c), \ .cmask = (m), \ .weight = (w), \ .overlap = (o), \ .flags = f, \ }
+#define __EVENT_CONSTRAINT(c, n, m, w, o, f) \ + __EVENT_CONSTRAINT_RANGE(c, c, n, m, w, o, f) + #define EVENT_CONSTRAINT(c, n, m) \ __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 0, 0)
+/* + * The constraint_match() function only works for 'simple' event codes + * and not for extended (AMD64_EVENTSEL_EVENT) events codes. + */ +#define EVENT_CONSTRAINT_RANGE(c, e, n, m) \ + __EVENT_CONSTRAINT_RANGE(c, e, n, m, HWEIGHT(n), 0, 0) + #define INTEL_EXCLEVT_CONSTRAINT(c, n) \ __EVENT_CONSTRAINT(c, n, ARCH_PERFMON_EVENTSEL_EVENT, HWEIGHT(n),\ 0, PERF_X86_EVENT_EXCL) @@ -303,6 +321,12 @@ struct cpu_hw_events { #define INTEL_EVENT_CONSTRAINT(c, n) \ EVENT_CONSTRAINT(c, n, ARCH_PERFMON_EVENTSEL_EVENT)
+/* + * Constraint on a range of Event codes + */ +#define INTEL_EVENT_CONSTRAINT_RANGE(c, e, n) \ + EVENT_CONSTRAINT_RANGE(c, e, n, ARCH_PERFMON_EVENTSEL_EVENT) + /* * Constraint on the Event code + UMask + fixed-mask * @@ -350,6 +374,9 @@ struct cpu_hw_events { #define INTEL_FLAGS_EVENT_CONSTRAINT(c, n) \ EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS)
+#define INTEL_FLAGS_EVENT_CONSTRAINT_RANGE(c, e, n) \ + EVENT_CONSTRAINT_RANGE(c, e, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS) + /* Check only flags, but allow all event/umask */ #define INTEL_ALL_EVENT_CONSTRAINT(code, n) \ EVENT_CONSTRAINT(code, n, X86_ALL_EVENT_FLAGS) @@ -366,6 +393,11 @@ struct cpu_hw_events { ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS, \ HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LD_HSW)
+#define INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD_RANGE(code, end, n) \ + __EVENT_CONSTRAINT_RANGE(code, end, n, \ + ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS, \ + HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LD_HSW) + #define INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_XLD(code, n) \ __EVENT_CONSTRAINT(code, n, \ ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS, \
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 6017608936c1825ff5d7325270484042f597edff category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Add Icelake core PMU perf code, including constraint tables and the main enable code.
Icelake expanded the generic counters to always 8 even with HT on, but a range of events cannot be scheduled on the extra 4 counters. Add new constraint ranges to describe this to the scheduler. The number of constraints that need to be checked is larger now than with earlier CPUs. At some point we may need a new data structure to look them up more efficiently than with linear search. So far it still seems to be acceptable however.
Icelake added a new fixed counter SLOTS. Full support for it is added later in the patch series.
The cache events table is identical to Skylake.
Compare to PEBS instruction event on generic counter, fixed counter 0 has less skid. Force instruction:ppp always in fixed counter 0.
Originally-by: Andi Kleen ak@linux.intel.com Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Cc: acme@kernel.org Cc: jolsa@kernel.org Link: https://lkml.kernel.org/r/20190402194509.2832-9-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/core.c | 110 ++++++++++++++++++++++++++++++ arch/x86/events/intel/ds.c | 25 ++++++- arch/x86/events/perf_event.h | 2 + arch/x86/include/asm/intel_ds.h | 2 +- arch/x86/include/asm/perf_event.h | 2 +- 5 files changed, 137 insertions(+), 4 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 710c24846df2..63a140bc3add 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -238,6 +238,35 @@ static struct extra_reg intel_skl_extra_regs[] __read_mostly = { EVENT_EXTRA_END };
+static struct event_constraint intel_icl_event_constraints[] = { + FIXED_EVENT_CONSTRAINT(0x00c0, 0), /* INST_RETIRED.ANY */ + INTEL_UEVENT_CONSTRAINT(0x1c0, 0), /* INST_RETIRED.PREC_DIST */ + FIXED_EVENT_CONSTRAINT(0x003c, 1), /* CPU_CLK_UNHALTED.CORE */ + FIXED_EVENT_CONSTRAINT(0x0300, 2), /* CPU_CLK_UNHALTED.REF */ + FIXED_EVENT_CONSTRAINT(0x0400, 3), /* SLOTS */ + INTEL_EVENT_CONSTRAINT_RANGE(0x03, 0x0a, 0xf), + INTEL_EVENT_CONSTRAINT_RANGE(0x1f, 0x28, 0xf), + INTEL_EVENT_CONSTRAINT(0x32, 0xf), /* SW_PREFETCH_ACCESS.* */ + INTEL_EVENT_CONSTRAINT_RANGE(0x48, 0x54, 0xf), + INTEL_EVENT_CONSTRAINT_RANGE(0x60, 0x8b, 0xf), + INTEL_UEVENT_CONSTRAINT(0x04a3, 0xff), /* CYCLE_ACTIVITY.STALLS_TOTAL */ + INTEL_UEVENT_CONSTRAINT(0x10a3, 0xff), /* CYCLE_ACTIVITY.STALLS_MEM_ANY */ + INTEL_EVENT_CONSTRAINT(0xa3, 0xf), /* CYCLE_ACTIVITY.* */ + INTEL_EVENT_CONSTRAINT_RANGE(0xa8, 0xb0, 0xf), + INTEL_EVENT_CONSTRAINT_RANGE(0xb7, 0xbd, 0xf), + INTEL_EVENT_CONSTRAINT_RANGE(0xd0, 0xe6, 0xf), + INTEL_EVENT_CONSTRAINT_RANGE(0xf0, 0xf4, 0xf), + EVENT_CONSTRAINT_END +}; + +static struct extra_reg intel_icl_extra_regs[] __read_mostly = { + INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3fffff9fffull, RSP_0), + INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff9fffull, RSP_1), + INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd), + INTEL_UEVENT_EXTRA_REG(0x01c6, MSR_PEBS_FRONTEND, 0x7fff17, FE), + EVENT_EXTRA_END +}; + EVENT_ATTR_STR(mem-loads, mem_ld_nhm, "event=0x0b,umask=0x10,ldlat=3"); EVENT_ATTR_STR(mem-loads, mem_ld_snb, "event=0xcd,umask=0x1,ldlat=3"); EVENT_ATTR_STR(mem-stores, mem_st_snb, "event=0xcd,umask=0x2"); @@ -3240,6 +3269,9 @@ static struct event_constraint counter0_constraint = static struct event_constraint counter2_constraint = EVENT_CONSTRAINT(0, 0x4, 0);
+static struct event_constraint fixed0_constraint = + FIXED_EVENT_CONSTRAINT(0x00c0, 0); + static struct event_constraint * hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx, struct perf_event *event) @@ -3258,6 +3290,21 @@ hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx, return c; }
+static struct event_constraint * +icl_get_event_constraints(struct cpu_hw_events *cpuc, int idx, + struct perf_event *event) +{ + /* + * Fixed counter 0 has less skid. + * Force instruction:ppp in Fixed counter 0 + */ + if ((event->attr.precise_ip == 3) && + constraint_match(&fixed0_constraint, event->hw.config)) + return &fixed0_constraint; + + return hsw_get_event_constraints(cpuc, idx, event); +} + static struct event_constraint * glp_get_event_constraints(struct cpu_hw_events *cpuc, int idx, struct perf_event *event) @@ -3924,6 +3971,42 @@ static struct attribute *hsw_tsx_events_attrs[] = { NULL };
+EVENT_ATTR_STR(tx-capacity-read, tx_capacity_read, "event=0x54,umask=0x80"); +EVENT_ATTR_STR(tx-capacity-write, tx_capacity_write, "event=0x54,umask=0x2"); +EVENT_ATTR_STR(el-capacity-read, el_capacity_read, "event=0x54,umask=0x80"); +EVENT_ATTR_STR(el-capacity-write, el_capacity_write, "event=0x54,umask=0x2"); + +static struct attribute *icl_events_attrs[] = { + EVENT_PTR(mem_ld_hsw), + EVENT_PTR(mem_st_hsw), + NULL, +}; + +static struct attribute *icl_tsx_events_attrs[] = { + EVENT_PTR(tx_start), + EVENT_PTR(tx_abort), + EVENT_PTR(tx_commit), + EVENT_PTR(tx_capacity_read), + EVENT_PTR(tx_capacity_write), + EVENT_PTR(tx_conflict), + EVENT_PTR(el_start), + EVENT_PTR(el_abort), + EVENT_PTR(el_commit), + EVENT_PTR(el_capacity_read), + EVENT_PTR(el_capacity_write), + EVENT_PTR(el_conflict), + EVENT_PTR(cycles_t), + EVENT_PTR(cycles_ct), + NULL, +}; + +static __init struct attribute **get_icl_events_attrs(void) +{ + return boot_cpu_has(X86_FEATURE_RTM) ? + merge_attr(icl_events_attrs, icl_tsx_events_attrs) : + icl_events_attrs; +} + static __init struct attribute **get_hsw_events_attrs(void) { return boot_cpu_has(X86_FEATURE_RTM) ? @@ -4478,6 +4561,33 @@ __init int intel_pmu_init(void) name = "skylake"; break;
+ case INTEL_FAM6_ICELAKE_MOBILE: + x86_pmu.late_ack = true; + memcpy(hw_cache_event_ids, skl_hw_cache_event_ids, sizeof(hw_cache_event_ids)); + memcpy(hw_cache_extra_regs, skl_hw_cache_extra_regs, sizeof(hw_cache_extra_regs)); + hw_cache_event_ids[C(ITLB)][C(OP_READ)][C(RESULT_ACCESS)] = -1; + intel_pmu_lbr_init_skl(); + + x86_pmu.event_constraints = intel_icl_event_constraints; + x86_pmu.pebs_constraints = intel_icl_pebs_event_constraints; + x86_pmu.extra_regs = intel_icl_extra_regs; + x86_pmu.pebs_aliases = NULL; + x86_pmu.pebs_prec_dist = true; + x86_pmu.flags |= PMU_FL_HAS_RSP_1; + x86_pmu.flags |= PMU_FL_NO_HT_SHARING; + + x86_pmu.hw_config = hsw_hw_config; + x86_pmu.get_event_constraints = icl_get_event_constraints; + extra_attr = boot_cpu_has(X86_FEATURE_RTM) ? + hsw_format_attr : nhm_format_attr; + extra_attr = merge_attr(extra_attr, skl_format_attr); + x86_pmu.cpu_events = get_icl_events_attrs(); + x86_pmu.lbr_pt_coexist = true; + intel_pmu_pebs_data_source_skl(false); + pr_cont("Icelake events, "); + name = "icelake"; + break; + default: switch (x86_pmu.version) { case 1: diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c index 462dab3c0819..63ebe2caef73 100644 --- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -849,6 +849,26 @@ struct event_constraint intel_skl_pebs_event_constraints[] = { EVENT_CONSTRAINT_END };
+struct event_constraint intel_icl_pebs_event_constraints[] = { + INTEL_FLAGS_UEVENT_CONSTRAINT(0x1c0, 0x100000000ULL), /* INST_RETIRED.PREC_DIST */ + INTEL_FLAGS_UEVENT_CONSTRAINT(0x0400, 0x400000000ULL), /* SLOTS */ + + INTEL_PLD_CONSTRAINT(0x1cd, 0xff), /* MEM_TRANS_RETIRED.LOAD_LATENCY */ + INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x1d0, 0xf), /* MEM_INST_RETIRED.LOAD */ + INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x2d0, 0xf), /* MEM_INST_RETIRED.STORE */ + + INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD_RANGE(0xd1, 0xd4, 0xf), /* MEM_LOAD_*_RETIRED.* */ + + INTEL_FLAGS_EVENT_CONSTRAINT(0xd0, 0xf), /* MEM_INST_RETIRED.* */ + + /* + * Everything else is handled by PMU_FL_PEBS_ALL, because we + * need the full constraints from the main table. + */ + + EVENT_CONSTRAINT_END +}; + struct event_constraint *intel_pebs_constraints(struct perf_event *event) { struct event_constraint *c; @@ -960,7 +980,7 @@ void intel_pmu_pebs_enable(struct perf_event *event)
cpuc->pebs_enabled |= 1ULL << hwc->idx;
- if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT) + if ((event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT) && (x86_pmu.version < 5)) cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32); else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST) cpuc->pebs_enabled |= 1ULL << 63; @@ -1004,7 +1024,8 @@ void intel_pmu_pebs_disable(struct perf_event *event)
cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
- if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT) + if ((event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT) && + (x86_pmu.version < 5)) cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32)); else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST) cpuc->pebs_enabled &= ~(1ULL << 63); diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index eebd9dbebcf9..f7e66a6ea4b1 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -973,6 +973,8 @@ extern struct event_constraint intel_bdw_pebs_event_constraints[];
extern struct event_constraint intel_skl_pebs_event_constraints[];
+extern struct event_constraint intel_icl_pebs_event_constraints[]; + struct event_constraint *intel_pebs_constraints(struct perf_event *event);
void intel_pmu_pebs_add(struct perf_event *event); diff --git a/arch/x86/include/asm/intel_ds.h b/arch/x86/include/asm/intel_ds.h index ae26df1c2789..8380c3ddd4b2 100644 --- a/arch/x86/include/asm/intel_ds.h +++ b/arch/x86/include/asm/intel_ds.h @@ -8,7 +8,7 @@
/* The maximal number of PEBS events: */ #define MAX_PEBS_EVENTS 8 -#define MAX_FIXED_PEBS_EVENTS 3 +#define MAX_FIXED_PEBS_EVENTS 4
/* * A debug store configuration. diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 620c3e3564be..7a5ca63a3cca 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -7,7 +7,7 @@ */
#define INTEL_PMC_MAX_GENERIC 32 -#define INTEL_PMC_MAX_FIXED 3 +#define INTEL_PMC_MAX_FIXED 4 #define INTEL_PMC_IDX_FIXED 32
#define X86_PMC_IDX_MAX 64
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 6daeb8737f8a93c6d3a3ae57e23dd3dbe8b239da category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Add perf core PMU support for Intel Tremont CPU.
The init code is based on Goldmont plus.
The generic purpose counter 0 and fixed counter 0 have less skid. Force :ppp events on generic purpose counter 0. Force instruction:ppp on generic purpose counter 0 and fixed counter 0.
Updates LLC cache event table and OFFCORE_RESPONSE mask.
Adaptive PEBS, which is already enabled on ICL, is also supported on Tremont. No extra code required.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Cc: acme@kernel.org Cc: jolsa@kernel.org Link: https://lkml.kernel.org/r/1554922629-126287-3-git-send-email-kan.liang@linux... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/core.c | 91 ++++++++++++++++++++++++++++++++++++ 1 file changed, 91 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 63a140bc3add..012b89a113c0 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -1851,6 +1851,45 @@ static __initconst const u64 glp_hw_cache_extra_regs }, };
+#define TNT_LOCAL_DRAM BIT_ULL(26) +#define TNT_DEMAND_READ GLM_DEMAND_DATA_RD +#define TNT_DEMAND_WRITE GLM_DEMAND_RFO +#define TNT_LLC_ACCESS GLM_ANY_RESPONSE +#define TNT_SNP_ANY (SNB_SNP_NOT_NEEDED|SNB_SNP_MISS| \ + SNB_NO_FWD|SNB_SNP_FWD|SNB_HITM) +#define TNT_LLC_MISS (TNT_SNP_ANY|SNB_NON_DRAM|TNT_LOCAL_DRAM) + +static __initconst const u64 tnt_hw_cache_extra_regs + [PERF_COUNT_HW_CACHE_MAX] + [PERF_COUNT_HW_CACHE_OP_MAX] + [PERF_COUNT_HW_CACHE_RESULT_MAX] = { + [C(LL)] = { + [C(OP_READ)] = { + [C(RESULT_ACCESS)] = TNT_DEMAND_READ| + TNT_LLC_ACCESS, + [C(RESULT_MISS)] = TNT_DEMAND_READ| + TNT_LLC_MISS, + }, + [C(OP_WRITE)] = { + [C(RESULT_ACCESS)] = TNT_DEMAND_WRITE| + TNT_LLC_ACCESS, + [C(RESULT_MISS)] = TNT_DEMAND_WRITE| + TNT_LLC_MISS, + }, + [C(OP_PREFETCH)] = { + [C(RESULT_ACCESS)] = 0x0, + [C(RESULT_MISS)] = 0x0, + }, + }, +}; + +static struct extra_reg intel_tnt_extra_regs[] __read_mostly = { + /* must define OFFCORE_RSP_X first, see intel_fixup_er() */ + INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0xffffff9fffull, RSP_0), + INTEL_UEVENT_EXTRA_REG(0x02b7, MSR_OFFCORE_RSP_1, 0xffffff9fffull, RSP_1), + EVENT_EXTRA_END +}; + #define KNL_OT_L2_HITE BIT_ULL(19) /* Other Tile L2 Hit */ #define KNL_OT_L2_HITF BIT_ULL(20) /* Other Tile L2 Hit */ #define KNL_MCDRAM_LOCAL BIT_ULL(21) @@ -3272,6 +3311,9 @@ static struct event_constraint counter2_constraint = static struct event_constraint fixed0_constraint = FIXED_EVENT_CONSTRAINT(0x00c0, 0);
+static struct event_constraint fixed0_counter0_constraint = + INTEL_ALL_EVENT_CONSTRAINT(0, 0x100000001ULL); + static struct event_constraint * hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx, struct perf_event *event) @@ -3320,6 +3362,29 @@ glp_get_event_constraints(struct cpu_hw_events *cpuc, int idx, return c; }
+static struct event_constraint * +tnt_get_event_constraints(struct cpu_hw_events *cpuc, int idx, + struct perf_event *event) +{ + struct event_constraint *c; + + /* + * :ppp means to do reduced skid PEBS, + * which is available on PMC0 and fixed counter 0. + */ + if (event->attr.precise_ip == 3) { + /* Force instruction:ppp on PMC0 and Fixed counter 0 */ + if (constraint_match(&fixed0_constraint, event->hw.config)) + return &fixed0_counter0_constraint; + + return &counter0_constraint; + } + + c = intel_get_event_constraints(cpuc, idx, event); + + return c; +} + static bool allow_tsx_force_abort = true;
static struct event_constraint * @@ -4317,6 +4382,32 @@ __init int intel_pmu_init(void) name = "goldmont_plus"; break;
+ case INTEL_FAM6_ATOM_TREMONT_X: + x86_pmu.late_ack = true; + memcpy(hw_cache_event_ids, glp_hw_cache_event_ids, + sizeof(hw_cache_event_ids)); + memcpy(hw_cache_extra_regs, tnt_hw_cache_extra_regs, + sizeof(hw_cache_extra_regs)); + hw_cache_event_ids[C(ITLB)][C(OP_READ)][C(RESULT_ACCESS)] = -1; + + intel_pmu_lbr_init_skl(); + + x86_pmu.event_constraints = intel_slm_event_constraints; + x86_pmu.extra_regs = intel_tnt_extra_regs; + /* + * It's recommended to use CPU_CLK_UNHALTED.CORE_P + NPEBS + * for precise cycles. + */ + x86_pmu.pebs_aliases = NULL; + x86_pmu.pebs_prec_dist = true; + x86_pmu.lbr_pt_coexist = true; + x86_pmu.flags |= PMU_FL_HAS_RSP_1; + x86_pmu.get_event_constraints = tnt_get_event_constraints; + extra_attr = slm_format_attr; + pr_cont("Tremont events, "); + name = "Tremont"; + break; + case INTEL_FAM6_WESTMERE: case INTEL_FAM6_WESTMERE_EP: case INTEL_FAM6_WESTMERE_EX:
From: Dave Jiang dave.jiang@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 4d75873f814055359bb6722c4e35a185d02157a8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 4d75873f814055359bb6722c4e35a185d02157a8 upsream.
Add Snowridge Xeon-D ioatdma PCI device id. Also applies for Icelake SP Xeon. This introduces ioatdma v3.4 platform. Also bumping driver version to 5.0 since we are adding additional code for 3.4 support.
Signed-off-by: Dave Jiang dave.jiang@intel.com Signed-off-by: Vinod Koul vkoul@kernel.org Signed-off-by: Lin Wang lin.x.wang@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dma/ioat/dma.h | 2 +- drivers/dma/ioat/hw.h | 2 ++ drivers/dma/ioat/init.c | 3 +++ 3 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/dma/ioat/dma.h b/drivers/dma/ioat/dma.h index 1ab42ec2b7ff..aaafd0e882b5 100644 --- a/drivers/dma/ioat/dma.h +++ b/drivers/dma/ioat/dma.h @@ -27,7 +27,7 @@ #include "registers.h" #include "hw.h"
-#define IOAT_DMA_VERSION "4.00" +#define IOAT_DMA_VERSION "5.00"
#define IOAT_DMA_DCA_ANY_CPU ~0
diff --git a/drivers/dma/ioat/hw.h b/drivers/dma/ioat/hw.h index abcc51b343ce..3e38877cae6b 100644 --- a/drivers/dma/ioat/hw.h +++ b/drivers/dma/ioat/hw.h @@ -66,6 +66,8 @@
#define PCI_DEVICE_ID_INTEL_IOAT_SKX 0x2021
+#define PCI_DEVICE_ID_INTEL_IOAT_ICX 0x0b00 + #define IOAT_VER_1_2 0x12 /* Version 1.2 */ #define IOAT_VER_2_0 0x20 /* Version 2.0 */ #define IOAT_VER_3_0 0x30 /* Version 3.0 */ diff --git a/drivers/dma/ioat/init.c b/drivers/dma/ioat/init.c index 0fec3c554fe3..ec5d0ac295d9 100644 --- a/drivers/dma/ioat/init.c +++ b/drivers/dma/ioat/init.c @@ -119,6 +119,9 @@ static const struct pci_device_id ioat_pci_tbl[] = { { PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_IOAT_BDXDE2) }, { PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_IOAT_BDXDE3) },
+ /* I/OAT v3.4 platforms */ + { PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_IOAT_ICX) }, + { 0, } }; MODULE_DEVICE_TABLE(pci, ioat_pci_tbl);
From: Dave Jiang dave.jiang@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 11e31e281bd8f482a9277268f7b0d9c213584271 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 11e31e281bd8f482a9277268f7b0d9c213584271 upstream.
IOATDMA v3.4 does not support DCA. Disable
Signed-off-by: Dave Jiang dave.jiang@intel.com Signed-off-by: Vinod Koul vkoul@kernel.org Signed-off-by: Lin Wang lin.x.wang@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dma/ioat/hw.h | 1 + drivers/dma/ioat/init.c | 2 ++ 2 files changed, 3 insertions(+)
diff --git a/drivers/dma/ioat/hw.h b/drivers/dma/ioat/hw.h index 3e38877cae6b..781c94de8e81 100644 --- a/drivers/dma/ioat/hw.h +++ b/drivers/dma/ioat/hw.h @@ -73,6 +73,7 @@ #define IOAT_VER_3_0 0x30 /* Version 3.0 */ #define IOAT_VER_3_2 0x32 /* Version 3.2 */ #define IOAT_VER_3_3 0x33 /* Version 3.3 */ +#define IOAT_VER_3_4 0x34 /* Version 3.4 */
int system_has_dca_enabled(struct pci_dev *pdev); diff --git a/drivers/dma/ioat/init.c b/drivers/dma/ioat/init.c index ec5d0ac295d9..9f9e9488fd18 100644 --- a/drivers/dma/ioat/init.c +++ b/drivers/dma/ioat/init.c @@ -1360,6 +1360,8 @@ static int ioat_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) pci_set_drvdata(pdev, device);
device->version = readb(device->reg_base + IOAT_VER_OFFSET); + if (device->version >= IOAT_VER_3_4) + ioat_dca_enabled = 0; if (device->version >= IOAT_VER_3_0) { if (is_skx_ioat(pdev)) device->version = IOAT_VER_3_2;
From: Dave Jiang dave.jiang@intel.com
mainline inclusion from mainline-v5.1-rc1 commit e0100d40906d5dbe6d09d31083c1a5aaccc947fa category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit e0100d40906d5dbe6d09d31083c1a5aaccc947fa upstream.
Adding support for new feature on ioatdma 3.4 hardware that provides descriptor pre-fetching in order to reduce small DMA latencies.
Signed-off-by: Dave Jiang dave.jiang@intel.com Signed-off-by: Vinod Koul vkoul@kernel.org Signed-off-by: Lin Wang lin.x.wang@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dma/ioat/dma.c | 12 ++++++++++++ drivers/dma/ioat/init.c | 8 ++++++-- drivers/dma/ioat/registers.h | 10 ++++++++++ 3 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/drivers/dma/ioat/dma.c b/drivers/dma/ioat/dma.c index 23fb2fa04000..f373a139e0c3 100644 --- a/drivers/dma/ioat/dma.c +++ b/drivers/dma/ioat/dma.c @@ -372,6 +372,7 @@ struct ioat_ring_ent ** ioat_alloc_ring(struct dma_chan *c, int order, gfp_t flags) { struct ioatdma_chan *ioat_chan = to_ioat_chan(c); + struct ioatdma_device *ioat_dma = ioat_chan->ioat_dma; struct ioat_ring_ent **ring; int total_descs = 1 << order; int i, chunks; @@ -437,6 +438,17 @@ ioat_alloc_ring(struct dma_chan *c, int order, gfp_t flags) } ring[i]->hw->next = ring[0]->txd.phys;
+ /* setup descriptor pre-fetching for v3.4 */ + if (ioat_dma->cap & IOAT_CAP_DPS) { + u16 drsctl = IOAT_CHAN_DRSZ_2MB | IOAT_CHAN_DRS_EN; + + if (chunks == 1) + drsctl |= IOAT_CHAN_DRS_AUTOWRAP; + + writew(drsctl, ioat_chan->reg_base + IOAT_CHAN_DRSCTL_OFFSET); + + } + return ring; }
diff --git a/drivers/dma/ioat/init.c b/drivers/dma/ioat/init.c index 9f9e9488fd18..aaa904c08802 100644 --- a/drivers/dma/ioat/init.c +++ b/drivers/dma/ioat/init.c @@ -138,10 +138,10 @@ static int ioat3_dma_self_test(struct ioatdma_device *ioat_dma); static int ioat_dca_enabled = 1; module_param(ioat_dca_enabled, int, 0644); MODULE_PARM_DESC(ioat_dca_enabled, "control support of dca service (default: 1)"); -int ioat_pending_level = 4; +int ioat_pending_level = 7; module_param(ioat_pending_level, int, 0644); MODULE_PARM_DESC(ioat_pending_level, - "high-water mark for pushing ioat descriptors (default: 4)"); + "high-water mark for pushing ioat descriptors (default: 7)"); static char ioat_interrupt_style[32] = "msix"; module_param_string(ioat_interrupt_style, ioat_interrupt_style, sizeof(ioat_interrupt_style), 0644); @@ -1188,6 +1188,10 @@ static int ioat3_dma_probe(struct ioatdma_device *ioat_dma, int dca) if (err) return err;
+ if (ioat_dma->cap & IOAT_CAP_DPS) + writeb(ioat_pending_level + 1, + ioat_dma->reg_base + IOAT_PREFETCH_LIMIT_OFFSET); + return 0; }
diff --git a/drivers/dma/ioat/registers.h b/drivers/dma/ioat/registers.h index 2f3bbc88ff2a..2b517d6db5fd 100644 --- a/drivers/dma/ioat/registers.h +++ b/drivers/dma/ioat/registers.h @@ -84,6 +84,9 @@ #define IOAT_CAP_PQ 0x00000200 #define IOAT_CAP_DWBES 0x00002000 #define IOAT_CAP_RAID16SS 0x00020000 +#define IOAT_CAP_DPS 0x00800000 + +#define IOAT_PREFETCH_LIMIT_OFFSET 0x4C /* CHWPREFLMT */
#define IOAT_CHANNEL_MMIO_SIZE 0x80 /* Each Channel MMIO space is this size */
@@ -243,4 +246,11 @@
#define IOAT_CHANERR_MASK_OFFSET 0x2C /* 32-bit Channel Error Register */
+#define IOAT_CHAN_DRSCTL_OFFSET 0xB6 +#define IOAT_CHAN_DRSZ_4KB 0x0000 +#define IOAT_CHAN_DRSZ_8KB 0x0001 +#define IOAT_CHAN_DRSZ_2MB 0x0009 +#define IOAT_CHAN_DRS_EN 0x0100 +#define IOAT_CHAN_DRS_AUTOWRAP 0x0200 + #endif /* _IOAT_REGISTERS_H_ */
From: Dave Jiang dave.jiang@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 528314b503f855b268ae7861ea4e206fbbfb8356 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 528314b503f855b268ae7861ea4e206fbbfb8356 upstream.
IOATDMA 3.4 supports PCIe LTR mechanism. The registers are non-standard PCIe LTR support. This needs to be setup in order to not suffer performance impact and provide proper power management. The channel is set to active when it is allocated, and to passive when it's freed.
Signed-off-by: Dave Jiang dave.jiang@intel.com Signed-off-by: Vinod Koul vkoul@kernel.org Signed-off-by: Lin Wang lin.x.wang@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dma/ioat/init.c | 27 +++++++++++++++++++++++++++ drivers/dma/ioat/registers.h | 14 ++++++++++++++ 2 files changed, 41 insertions(+)
diff --git a/drivers/dma/ioat/init.c b/drivers/dma/ioat/init.c index aaa904c08802..5d0ee7a30275 100644 --- a/drivers/dma/ioat/init.c +++ b/drivers/dma/ioat/init.c @@ -638,6 +638,11 @@ static void ioat_free_chan_resources(struct dma_chan *c) ioat_stop(ioat_chan); ioat_reset_hw(ioat_chan);
+ /* Put LTR to idle */ + if (ioat_dma->version >= IOAT_VER_3_4) + writeb(IOAT_CHAN_LTR_SWSEL_IDLE, + ioat_chan->reg_base + IOAT_CHAN_LTR_SWSEL_OFFSET); + spin_lock_bh(&ioat_chan->cleanup_lock); spin_lock_bh(&ioat_chan->prep_lock); descs = ioat_ring_space(ioat_chan); @@ -727,6 +732,28 @@ static int ioat_alloc_chan_resources(struct dma_chan *c) spin_unlock_bh(&ioat_chan->prep_lock); spin_unlock_bh(&ioat_chan->cleanup_lock);
+ /* Setting up LTR values for 3.4 or later */ + if (ioat_chan->ioat_dma->version >= IOAT_VER_3_4) { + u32 lat_val; + + lat_val = IOAT_CHAN_LTR_ACTIVE_SNVAL | + IOAT_CHAN_LTR_ACTIVE_SNLATSCALE | + IOAT_CHAN_LTR_ACTIVE_SNREQMNT; + writel(lat_val, ioat_chan->reg_base + + IOAT_CHAN_LTR_ACTIVE_OFFSET); + + lat_val = IOAT_CHAN_LTR_IDLE_SNVAL | + IOAT_CHAN_LTR_IDLE_SNLATSCALE | + IOAT_CHAN_LTR_IDLE_SNREQMNT; + writel(lat_val, ioat_chan->reg_base + + IOAT_CHAN_LTR_IDLE_OFFSET); + + /* Select to active */ + writeb(IOAT_CHAN_LTR_SWSEL_ACTIVE, + ioat_chan->reg_base + + IOAT_CHAN_LTR_SWSEL_OFFSET); + } + ioat_start_null_desc(ioat_chan);
/* check that we got off the ground */ diff --git a/drivers/dma/ioat/registers.h b/drivers/dma/ioat/registers.h index 2b517d6db5fd..99c1c24d465d 100644 --- a/drivers/dma/ioat/registers.h +++ b/drivers/dma/ioat/registers.h @@ -253,4 +253,18 @@ #define IOAT_CHAN_DRS_EN 0x0100 #define IOAT_CHAN_DRS_AUTOWRAP 0x0200
+#define IOAT_CHAN_LTR_SWSEL_OFFSET 0xBC +#define IOAT_CHAN_LTR_SWSEL_ACTIVE 0x0 +#define IOAT_CHAN_LTR_SWSEL_IDLE 0x1 + +#define IOAT_CHAN_LTR_ACTIVE_OFFSET 0xC0 +#define IOAT_CHAN_LTR_ACTIVE_SNVAL 0x0000 /* 0 us */ +#define IOAT_CHAN_LTR_ACTIVE_SNLATSCALE 0x0800 /* 1us scale */ +#define IOAT_CHAN_LTR_ACTIVE_SNREQMNT 0x8000 /* snoop req enable */ + +#define IOAT_CHAN_LTR_IDLE_OFFSET 0xC4 +#define IOAT_CHAN_LTR_IDLE_SNVAL 0x0258 /* 600 us */ +#define IOAT_CHAN_LTR_IDLE_SNLATSCALE 0x0800 /* 1us scale */ +#define IOAT_CHAN_LTR_IDLE_SNREQMNT 0x8000 /* snoop req enable */ + #endif /* _IOAT_REGISTERS_H_ */
From: Dan Williams dan.j.williams@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 21b9e979501fdb5f6797193d70428a2b00bd5247 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 21b9e979501fdb5f6797193d70428a2b00bd5247 upstream.
Commit bbb3be170ac2 "device-dax: fix sysfs duplicate warnings" arranged for passing a dax instance-id to devm_create_dax_dev(), rather than generating one internally. Remove the dax_region ida and related code.
Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dax/dax-private.h | 3 --- drivers/dax/device.c | 24 +++--------------------- 2 files changed, 3 insertions(+), 24 deletions(-)
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index b6fc4f04636d..d1b36a42132f 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -28,7 +28,6 @@ */ struct dax_region { int id; - struct ida ida; void *base; struct kref kref; struct device *dev; @@ -42,7 +41,6 @@ struct dax_region { * @region - parent region * @dax_dev - core dax functionality * @dev - device core - * @id - child id in the region * @num_resources - number of physical address extents in this device * @res - array of physical address ranges */ @@ -50,7 +48,6 @@ struct dev_dax { struct dax_region *region; struct dax_device *dax_dev; struct device dev; - int id; int num_resources; struct resource res[0]; }; diff --git a/drivers/dax/device.c b/drivers/dax/device.c index a89ebd94c670..f8f6ee0b2c1b 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -128,7 +128,6 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id, dax_region->pfn_flags = pfn_flags; kref_init(&dax_region->kref); dax_region->id = region_id; - ida_init(&dax_region->ida); dax_region->align = align; dax_region->dev = parent; dax_region->base = addr; @@ -580,8 +579,6 @@ static void dev_dax_release(struct device *dev) struct dax_region *dax_region = dev_dax->region; struct dax_device *dax_dev = dev_dax->dax_dev;
- if (dev_dax->id >= 0) - ida_simple_remove(&dax_region->ida, dev_dax->id); dax_region_put(dax_region); put_dax(dax_dev); kfree(dev_dax); @@ -640,19 +637,7 @@ struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, }
if (i < count) - goto err_id; - - if (id < 0) { - id = ida_simple_get(&dax_region->ida, 0, 0, GFP_KERNEL); - dev_dax->id = id; - if (id < 0) { - rc = id; - goto err_id; - } - } else { - /* region provider owns @id lifetime */ - dev_dax->id = -1; - } + goto err;
/* * No 'host' or dax_operations since there is no access to this @@ -661,7 +646,7 @@ struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, dax_dev = alloc_dax(dev_dax, NULL, NULL); if (!dax_dev) { rc = -ENOMEM; - goto err_dax; + goto err; }
/* from here on we're committed to teardown via dax_dev_release() */ @@ -698,10 +683,7 @@ struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region,
return dev_dax;
- err_dax: - if (dev_dax->id >= 0) - ida_simple_remove(&dax_region->ida, dev_dax->id); - err_id: + err: kfree(dev_dax);
return ERR_PTR(rc);
From: Dan Williams dan.j.williams@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 93694f9630b0ed29cda61df58e480dcb34ef52fd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 93694f9630b0ed29cda61df58e480dcb34ef52fd upstream.
Nothing consumes this attribute of a region and devres otherwise remembers the value for de-allocation purposes.
Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dax/dax-private.h | 2 -- drivers/dax/device-dax.h | 5 ++--- drivers/dax/device.c | 3 +-- drivers/dax/pmem.c | 2 +- 4 files changed, 4 insertions(+), 8 deletions(-)
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index d1b36a42132f..9b393c218fe4 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -19,7 +19,6 @@ /** * struct dax_region - mapping infrastructure for dax devices * @id: kernel-wide unique region for a memory range - * @base: linear address corresponding to @res * @kref: to pin while other agents have a need to do lookups * @dev: parent device backing this region * @align: allocation and mapping alignment for child dax devices @@ -28,7 +27,6 @@ */ struct dax_region { int id; - void *base; struct kref kref; struct device *dev; unsigned int align; diff --git a/drivers/dax/device-dax.h b/drivers/dax/device-dax.h index 688b051750bd..4f1c69e1b3a2 100644 --- a/drivers/dax/device-dax.h +++ b/drivers/dax/device-dax.h @@ -17,9 +17,8 @@ struct dev_dax; struct resource; struct dax_region; void dax_region_put(struct dax_region *dax_region); -struct dax_region *alloc_dax_region(struct device *parent, - int region_id, struct resource *res, unsigned int align, - void *addr, unsigned long flags); +struct dax_region *alloc_dax_region(struct device *parent, int region_id, + struct resource *res, unsigned int align, unsigned long flags); struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id, struct resource *res, int count); #endif /* __DEVICE_DAX_H__ */ diff --git a/drivers/dax/device.c b/drivers/dax/device.c index f8f6ee0b2c1b..59518cefd0b2 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -100,7 +100,7 @@ static void dax_region_unregister(void *region) }
struct dax_region *alloc_dax_region(struct device *parent, int region_id, - struct resource *res, unsigned int align, void *addr, + struct resource *res, unsigned int align, unsigned long pfn_flags) { struct dax_region *dax_region; @@ -130,7 +130,6 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id, dax_region->id = region_id; dax_region->align = align; dax_region->dev = parent; - dax_region->base = addr; if (sysfs_create_groups(&parent->kobj, dax_region_attribute_groups)) { kfree(dax_region); return NULL; diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c index 2c1f459c0c63..72a76105eb02 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem.c @@ -125,7 +125,7 @@ static int dax_pmem_probe(struct device *dev) return -EINVAL;
dax_region = alloc_dax_region(dev, region_id, &res, - le32_to_cpu(pfn_sb->align), addr, PFN_DEV|PFN_MAP); + le32_to_cpu(pfn_sb->align), PFN_DEV|PFN_MAP); if (!dax_region) return -ENOMEM;
From: Dan Williams dan.j.williams@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 753a0850e707e9a8c5861356222f9b9e4eba7945 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 753a0850e707e9a8c5861356222f9b9e4eba7945 upstream.
The multi-resource implementation anticipated discontiguous sub-division support. That has not yet materialized, delete the infrastructure and related code.
Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dax/dax-private.h | 4 --- drivers/dax/device-dax.h | 3 +-- drivers/dax/device.c | 49 ++++++---------------------------- drivers/dax/pmem.c | 3 +-- tools/testing/nvdimm/dax-dev.c | 16 +++-------- 5 files changed, 13 insertions(+), 62 deletions(-)
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 9b393c218fe4..dbd077653b5c 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -39,14 +39,10 @@ struct dax_region { * @region - parent region * @dax_dev - core dax functionality * @dev - device core - * @num_resources - number of physical address extents in this device - * @res - array of physical address ranges */ struct dev_dax { struct dax_region *region; struct dax_device *dax_dev; struct device dev; - int num_resources; - struct resource res[0]; }; #endif diff --git a/drivers/dax/device-dax.h b/drivers/dax/device-dax.h index 4f1c69e1b3a2..e9be99584b92 100644 --- a/drivers/dax/device-dax.h +++ b/drivers/dax/device-dax.h @@ -19,6 +19,5 @@ struct dax_region; void dax_region_put(struct dax_region *dax_region); struct dax_region *alloc_dax_region(struct device *parent, int region_id, struct resource *res, unsigned int align, unsigned long flags); -struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, - int id, struct resource *res, int count); +struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id); #endif /* __DEVICE_DAX_H__ */ diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 59518cefd0b2..d3706a8f0ea5 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -151,11 +151,7 @@ static ssize_t size_show(struct device *dev, struct device_attribute *attr, char *buf) { struct dev_dax *dev_dax = to_dev_dax(dev); - unsigned long long size = 0; - int i; - - for (i = 0; i < dev_dax->num_resources; i++) - size += resource_size(&dev_dax->res[i]); + unsigned long long size = resource_size(&dev_dax->region->res);
return sprintf(buf, "%llu\n", size); } @@ -224,21 +220,11 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma, __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, unsigned long size) { - struct resource *res; - /* gcc-4.6.3-nolibc for i386 complains that this is uninitialized */ - phys_addr_t uninitialized_var(phys); - int i; - - for (i = 0; i < dev_dax->num_resources; i++) { - res = &dev_dax->res[i]; - phys = pgoff * PAGE_SIZE + res->start; - if (phys >= res->start && phys <= res->end) - break; - pgoff -= PHYS_PFN(resource_size(res)); - } + struct resource *res = &dev_dax->region->res; + phys_addr_t phys;
- if (i < dev_dax->num_resources) { - res = &dev_dax->res[i]; + phys = pgoff * PAGE_SIZE + res->start; + if (phys >= res->start && phys <= res->end) { if (phys + size - 1 <= res->end) return phys; } @@ -606,8 +592,7 @@ static void unregister_dev_dax(void *dev) put_device(dev); }
-struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, - int id, struct resource *res, int count) +struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id) { struct device *parent = dax_region->dev; struct dax_device *dax_dev; @@ -615,29 +600,12 @@ struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, struct inode *inode; struct device *dev; struct cdev *cdev; - int rc, i; + int rc;
- if (!count) - return ERR_PTR(-EINVAL); - - dev_dax = kzalloc(struct_size(dev_dax, res, count), GFP_KERNEL); + dev_dax = kzalloc(sizeof(*dev_dax), GFP_KERNEL); if (!dev_dax) return ERR_PTR(-ENOMEM);
- for (i = 0; i < count; i++) { - if (!IS_ALIGNED(res[i].start, dax_region->align) - || !IS_ALIGNED(resource_size(&res[i]), - dax_region->align)) { - rc = -EINVAL; - break; - } - dev_dax->res[i].start = res[i].start; - dev_dax->res[i].end = res[i].end; - } - - if (i < count) - goto err; - /* * No 'host' or dax_operations since there is no access to this * device outside of mmap of the resulting character device. @@ -657,7 +625,6 @@ struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, cdev_init(cdev, &dax_fops); cdev->owner = parent->driver->owner;
- dev_dax->num_resources = count; dev_dax->dax_dev = dax_dev; dev_dax->region = dax_region; kref_get(&dax_region->kref); diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c index 72a76105eb02..6c61a019f997 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem.c @@ -129,8 +129,7 @@ static int dax_pmem_probe(struct device *dev) if (!dax_region) return -ENOMEM;
- /* TODO: support for subdividing a dax region... */ - dev_dax = devm_create_dev_dax(dax_region, id, &res, 1); + dev_dax = devm_create_dev_dax(dax_region, id);
/* child dev_dax instances now own the lifetime of the dax_region */ dax_region_put(dax_region); diff --git a/tools/testing/nvdimm/dax-dev.c b/tools/testing/nvdimm/dax-dev.c index 36ee3d8797c3..f36e708265b8 100644 --- a/tools/testing/nvdimm/dax-dev.c +++ b/tools/testing/nvdimm/dax-dev.c @@ -17,20 +17,11 @@ phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, unsigned long size) { - struct resource *res; + struct resource *res = &dev_dax->region->res; phys_addr_t addr; - int i;
- for (i = 0; i < dev_dax->num_resources; i++) { - res = &dev_dax->res[i]; - addr = pgoff * PAGE_SIZE + res->start; - if (addr >= res->start && addr <= res->end) - break; - pgoff -= PHYS_PFN(resource_size(res)); - } - - if (i < dev_dax->num_resources) { - res = &dev_dax->res[i]; + addr = pgoff * PAGE_SIZE + res->start; + if (addr >= res->start && addr <= res->end) { if (addr + size - 1 <= res->end) { if (get_nfit_res(addr)) { struct page *page; @@ -44,6 +35,5 @@ phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, return addr; } } - return -1; }
From: Dan Williams dan.j.williams@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 51cf784c42d07fbd62cb604836a9270cf3361509 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 51cf784c42d07fbd62cb604836a9270cf3361509 upstream.
Towards eliminating the dax_class, move the dax-device-attribute enabling to a new bus.c file in the core. The amount of code thrash of sub-sequent patches is reduced as no logic changes are made, just pure code movement.
A temporary export of unregister_dex_dax() and dax_attribute_groups is needed to preserve compilation, but those symbols become static again in a follow-on patch.
Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dax/Makefile | 1 + drivers/dax/bus.c | 174 +++++++++++++++++++++++++++++++++ drivers/dax/bus.h | 15 +++ drivers/dax/dax-private.h | 14 +++ drivers/dax/dax.h | 18 ---- drivers/dax/device-dax.h | 23 ----- drivers/dax/device.c | 185 +----------------------------------- drivers/dax/pmem.c | 2 +- drivers/dax/super.c | 1 + tools/testing/nvdimm/Kbuild | 1 + 10 files changed, 210 insertions(+), 224 deletions(-) create mode 100644 drivers/dax/bus.c create mode 100644 drivers/dax/bus.h delete mode 100644 drivers/dax/dax.h delete mode 100644 drivers/dax/device-dax.h
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile index 574286fac87c..658e6b9b1d74 100644 --- a/drivers/dax/Makefile +++ b/drivers/dax/Makefile @@ -4,5 +4,6 @@ obj-$(CONFIG_DEV_DAX) += device_dax.o obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
dax-y := super.o +dax-y += bus.o dax_pmem-y := pmem.o device_dax-y := device.o diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c new file mode 100644 index 000000000000..8a398e8e1956 --- /dev/null +++ b/drivers/dax/bus.c @@ -0,0 +1,174 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2017-2018 Intel Corporation. All rights reserved. */ +#include <linux/device.h> +#include <linux/slab.h> +#include <linux/dax.h> +#include "dax-private.h" +#include "bus.h" + +/* + * Rely on the fact that drvdata is set before the attributes are + * registered, and that the attributes are unregistered before drvdata + * is cleared to assume that drvdata is always valid. + */ +static ssize_t id_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dax_region *dax_region = dev_get_drvdata(dev); + + return sprintf(buf, "%d\n", dax_region->id); +} +static DEVICE_ATTR_RO(id); + +static ssize_t region_size_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dax_region *dax_region = dev_get_drvdata(dev); + + return sprintf(buf, "%llu\n", (unsigned long long) + resource_size(&dax_region->res)); +} +static struct device_attribute dev_attr_region_size = __ATTR(size, 0444, + region_size_show, NULL); + +static ssize_t align_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dax_region *dax_region = dev_get_drvdata(dev); + + return sprintf(buf, "%u\n", dax_region->align); +} +static DEVICE_ATTR_RO(align); + +static struct attribute *dax_region_attributes[] = { + &dev_attr_region_size.attr, + &dev_attr_align.attr, + &dev_attr_id.attr, + NULL, +}; + +static const struct attribute_group dax_region_attribute_group = { + .name = "dax_region", + .attrs = dax_region_attributes, +}; + +static const struct attribute_group *dax_region_attribute_groups[] = { + &dax_region_attribute_group, + NULL, +}; + +static void dax_region_free(struct kref *kref) +{ + struct dax_region *dax_region; + + dax_region = container_of(kref, struct dax_region, kref); + kfree(dax_region); +} + +void dax_region_put(struct dax_region *dax_region) +{ + kref_put(&dax_region->kref, dax_region_free); +} +EXPORT_SYMBOL_GPL(dax_region_put); + +static void dax_region_unregister(void *region) +{ + struct dax_region *dax_region = region; + + sysfs_remove_groups(&dax_region->dev->kobj, + dax_region_attribute_groups); + dax_region_put(dax_region); +} + +struct dax_region *alloc_dax_region(struct device *parent, int region_id, + struct resource *res, unsigned int align, + unsigned long pfn_flags) +{ + struct dax_region *dax_region; + + /* + * The DAX core assumes that it can store its private data in + * parent->driver_data. This WARN is a reminder / safeguard for + * developers of device-dax drivers. + */ + if (dev_get_drvdata(parent)) { + dev_WARN(parent, "dax core failed to setup private data\n"); + return NULL; + } + + if (!IS_ALIGNED(res->start, align) + || !IS_ALIGNED(resource_size(res), align)) + return NULL; + + dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL); + if (!dax_region) + return NULL; + + dev_set_drvdata(parent, dax_region); + memcpy(&dax_region->res, res, sizeof(*res)); + dax_region->pfn_flags = pfn_flags; + kref_init(&dax_region->kref); + dax_region->id = region_id; + dax_region->align = align; + dax_region->dev = parent; + if (sysfs_create_groups(&parent->kobj, dax_region_attribute_groups)) { + kfree(dax_region); + return NULL; + } + + kref_get(&dax_region->kref); + if (devm_add_action_or_reset(parent, dax_region_unregister, dax_region)) + return NULL; + return dax_region; +} +EXPORT_SYMBOL_GPL(alloc_dax_region); + +static ssize_t size_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + unsigned long long size = resource_size(&dev_dax->region->res); + + return sprintf(buf, "%llu\n", size); +} +static DEVICE_ATTR_RO(size); + +static struct attribute *dev_dax_attributes[] = { + &dev_attr_size.attr, + NULL, +}; + +static const struct attribute_group dev_dax_attribute_group = { + .attrs = dev_dax_attributes, +}; + +const struct attribute_group *dax_attribute_groups[] = { + &dev_dax_attribute_group, + NULL, +}; +EXPORT_SYMBOL_GPL(dax_attribute_groups); + +void kill_dev_dax(struct dev_dax *dev_dax) +{ + struct dax_device *dax_dev = dev_dax->dax_dev; + struct inode *inode = dax_inode(dax_dev); + + kill_dax(dax_dev); + unmap_mapping_range(inode->i_mapping, 0, 0, 1); +} +EXPORT_SYMBOL_GPL(kill_dev_dax); + +void unregister_dev_dax(void *dev) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + struct dax_device *dax_dev = dev_dax->dax_dev; + struct inode *inode = dax_inode(dax_dev); + struct cdev *cdev = inode->i_cdev; + + dev_dbg(dev, "trace\n"); + + kill_dev_dax(dev_dax); + cdev_device_del(cdev, dev); + put_device(dev); +} +EXPORT_SYMBOL_GPL(unregister_dev_dax); diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h new file mode 100644 index 000000000000..840865aa69e8 --- /dev/null +++ b/drivers/dax/bus.h @@ -0,0 +1,15 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2016 - 2018 Intel Corporation. All rights reserved. */ +#ifndef __DAX_BUS_H__ +#define __DAX_BUS_H__ +struct device; +struct dev_dax; +struct resource; +struct dax_device; +struct dax_region; +void dax_region_put(struct dax_region *dax_region); +struct dax_region *alloc_dax_region(struct device *parent, int region_id, + struct resource *res, unsigned int align, unsigned long flags); +struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id); +void kill_dev_dax(struct dev_dax *dev_dax); +#endif /* __DAX_BUS_H__ */ diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index dbd077653b5c..620c3f4eefe7 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -16,6 +16,15 @@ #include <linux/device.h> #include <linux/cdev.h>
+/* private routines between core files */ +struct dax_device; +struct dax_device *inode_dax(struct inode *inode); +struct inode *dax_inode(struct dax_device *dax_dev); + +/* temporary until devm_create_dax_dev moves to bus.c */ +extern const struct attribute_group *dax_attribute_groups[]; +void unregister_dev_dax(void *dev); + /** * struct dax_region - mapping infrastructure for dax devices * @id: kernel-wide unique region for a memory range @@ -45,4 +54,9 @@ struct dev_dax { struct dax_device *dax_dev; struct device dev; }; + +static inline struct dev_dax *to_dev_dax(struct device *dev) +{ + return container_of(dev, struct dev_dax, dev); +} #endif diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h deleted file mode 100644 index f9e5feea742c..000000000000 --- a/drivers/dax/dax.h +++ /dev/null @@ -1,18 +0,0 @@ -/* - * Copyright(c) 2016 - 2017 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - */ -#ifndef __DAX_H__ -#define __DAX_H__ -struct dax_device; -struct dax_device *inode_dax(struct inode *inode); -struct inode *dax_inode(struct dax_device *dax_dev); -#endif /* __DAX_H__ */ diff --git a/drivers/dax/device-dax.h b/drivers/dax/device-dax.h deleted file mode 100644 index e9be99584b92..000000000000 --- a/drivers/dax/device-dax.h +++ /dev/null @@ -1,23 +0,0 @@ -/* - * Copyright(c) 2016 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - */ -#ifndef __DEVICE_DAX_H__ -#define __DEVICE_DAX_H__ -struct device; -struct dev_dax; -struct resource; -struct dax_region; -void dax_region_put(struct dax_region *dax_region); -struct dax_region *alloc_dax_region(struct device *parent, int region_id, - struct resource *res, unsigned int align, unsigned long flags); -struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id); -#endif /* __DEVICE_DAX_H__ */ diff --git a/drivers/dax/device.c b/drivers/dax/device.c index d3706a8f0ea5..f2ce8af2e6cc 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -1,15 +1,5 @@ -/* - * Copyright(c) 2016 - 2017 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - */ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */ #include <linux/pagemap.h> #include <linux/module.h> #include <linux/device.h> @@ -21,156 +11,10 @@ #include <linux/mm.h> #include <linux/mman.h> #include "dax-private.h" -#include "dax.h" +#include "bus.h"
static struct class *dax_class;
-/* - * Rely on the fact that drvdata is set before the attributes are - * registered, and that the attributes are unregistered before drvdata - * is cleared to assume that drvdata is always valid. - */ -static ssize_t id_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct dax_region *dax_region = dev_get_drvdata(dev); - - return sprintf(buf, "%d\n", dax_region->id); -} -static DEVICE_ATTR_RO(id); - -static ssize_t region_size_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct dax_region *dax_region = dev_get_drvdata(dev); - - return sprintf(buf, "%llu\n", (unsigned long long) - resource_size(&dax_region->res)); -} -static struct device_attribute dev_attr_region_size = __ATTR(size, 0444, - region_size_show, NULL); - -static ssize_t align_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct dax_region *dax_region = dev_get_drvdata(dev); - - return sprintf(buf, "%u\n", dax_region->align); -} -static DEVICE_ATTR_RO(align); - -static struct attribute *dax_region_attributes[] = { - &dev_attr_region_size.attr, - &dev_attr_align.attr, - &dev_attr_id.attr, - NULL, -}; - -static const struct attribute_group dax_region_attribute_group = { - .name = "dax_region", - .attrs = dax_region_attributes, -}; - -static const struct attribute_group *dax_region_attribute_groups[] = { - &dax_region_attribute_group, - NULL, -}; - -static void dax_region_free(struct kref *kref) -{ - struct dax_region *dax_region; - - dax_region = container_of(kref, struct dax_region, kref); - kfree(dax_region); -} - -void dax_region_put(struct dax_region *dax_region) -{ - kref_put(&dax_region->kref, dax_region_free); -} -EXPORT_SYMBOL_GPL(dax_region_put); - -static void dax_region_unregister(void *region) -{ - struct dax_region *dax_region = region; - - sysfs_remove_groups(&dax_region->dev->kobj, - dax_region_attribute_groups); - dax_region_put(dax_region); -} - -struct dax_region *alloc_dax_region(struct device *parent, int region_id, - struct resource *res, unsigned int align, - unsigned long pfn_flags) -{ - struct dax_region *dax_region; - - /* - * The DAX core assumes that it can store its private data in - * parent->driver_data. This WARN is a reminder / safeguard for - * developers of device-dax drivers. - */ - if (dev_get_drvdata(parent)) { - dev_WARN(parent, "dax core failed to setup private data\n"); - return NULL; - } - - if (!IS_ALIGNED(res->start, align) - || !IS_ALIGNED(resource_size(res), align)) - return NULL; - - dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL); - if (!dax_region) - return NULL; - - dev_set_drvdata(parent, dax_region); - memcpy(&dax_region->res, res, sizeof(*res)); - dax_region->pfn_flags = pfn_flags; - kref_init(&dax_region->kref); - dax_region->id = region_id; - dax_region->align = align; - dax_region->dev = parent; - if (sysfs_create_groups(&parent->kobj, dax_region_attribute_groups)) { - kfree(dax_region); - return NULL; - } - - kref_get(&dax_region->kref); - if (devm_add_action_or_reset(parent, dax_region_unregister, dax_region)) - return NULL; - return dax_region; -} -EXPORT_SYMBOL_GPL(alloc_dax_region); - -static struct dev_dax *to_dev_dax(struct device *dev) -{ - return container_of(dev, struct dev_dax, dev); -} - -static ssize_t size_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct dev_dax *dev_dax = to_dev_dax(dev); - unsigned long long size = resource_size(&dev_dax->region->res); - - return sprintf(buf, "%llu\n", size); -} -static DEVICE_ATTR_RO(size); - -static struct attribute *dev_dax_attributes[] = { - &dev_attr_size.attr, - NULL, -}; - -static const struct attribute_group dev_dax_attribute_group = { - .attrs = dev_dax_attributes, -}; - -static const struct attribute_group *dax_attribute_groups[] = { - &dev_dax_attribute_group, - NULL, -}; - static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma, const char *func) { @@ -569,29 +413,6 @@ static void dev_dax_release(struct device *dev) kfree(dev_dax); }
-static void kill_dev_dax(struct dev_dax *dev_dax) -{ - struct dax_device *dax_dev = dev_dax->dax_dev; - struct inode *inode = dax_inode(dax_dev); - - kill_dax(dax_dev); - unmap_mapping_range(inode->i_mapping, 0, 0, 1); -} - -static void unregister_dev_dax(void *dev) -{ - struct dev_dax *dev_dax = to_dev_dax(dev); - struct dax_device *dax_dev = dev_dax->dax_dev; - struct inode *inode = dax_inode(dax_dev); - struct cdev *cdev = inode->i_cdev; - - dev_dbg(dev, "trace\n"); - - kill_dev_dax(dev_dax); - cdev_device_del(cdev, dev); - put_device(dev); -} - struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id) { struct device *parent = dax_region->dev; diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c index 6c61a019f997..e1b50d8fd1f8 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem.c @@ -16,7 +16,7 @@ #include <linux/pfn_t.h> #include "../nvdimm/pfn.h" #include "../nvdimm/nd.h" -#include "device-dax.h" +#include "bus.h"
struct dax_pmem { struct device *dev; diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 6e928f37d084..0ecc1a2cf1cc 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -22,6 +22,7 @@ #include <linux/uio.h> #include <linux/dax.h> #include <linux/fs.h> +#include "dax-private.h"
static dev_t dax_devt; DEFINE_STATIC_SRCU(dax_srcu); diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild index 0392153a0009..bfc4d8e98452 100644 --- a/tools/testing/nvdimm/Kbuild +++ b/tools/testing/nvdimm/Kbuild @@ -55,6 +55,7 @@ nd_e820-y := $(NVDIMM_SRC)/e820.o nd_e820-y += config_check.o
dax-y := $(DAX_SRC)/super.o +dax-y += $(DAX_SRC)/bus.o dax-y += config_check.o
device_dax-y := $(DAX_SRC)/device.o
From: Dan Williams dan.j.williams@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 9567da0b408a2553d32ca83cba4f1fc5a8aad459 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 9567da0b408a2553d32ca83cba4f1fc5a8aad459 upstream.
In support of multiple device-dax instances per device-dax-region and allowing the 'kmem' driver to attach to dax-instances instead of the current device-node access, convert the dax sub-system from a class to a bus. Recall that the kmem driver takes reserved / special purpose memories and assigns them to be managed by the core-mm.
Aside from the fact the device-dax instances are registered and probed on a bus, two other lifetime-management changes are made:
1/ Delay attaching a cdev until driver probe time
2/ A new run_dax() helper is introduced to allow restoring dax-operation after a kill_dax() event. So, at driver ->probe() time we run_dax() and at ->remove() time we kill_dax() and invalidate all mappings.
Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dax/bus.c | 133 +++++++++++++++++++++++++++++++++++--- drivers/dax/bus.h | 16 +++++ drivers/dax/dax-private.h | 6 +- drivers/dax/device.c | 95 +++++++++------------------ drivers/dax/super.c | 40 ++++++++---- 5 files changed, 203 insertions(+), 87 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 8a398e8e1956..0cff32102c4c 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -6,6 +6,33 @@ #include "dax-private.h" #include "bus.h"
+static int dax_bus_uevent(struct device *dev, struct kobj_uevent_env *env) +{ + /* + * We only ever expect to handle device-dax instances, i.e. the + * @type argument to MODULE_ALIAS_DAX_DEVICE() is always zero + */ + return add_uevent_var(env, "MODALIAS=" DAX_DEVICE_MODALIAS_FMT, 0); +} + +static int dax_bus_match(struct device *dev, struct device_driver *drv); + +static struct bus_type dax_bus_type = { + .name = "dax", + .uevent = dax_bus_uevent, + .match = dax_bus_match, +}; + +static int dax_bus_match(struct device *dev, struct device_driver *drv) +{ + /* + * The drivers that can register on the 'dax' bus are private to + * drivers/dax/ so any device and driver on the bus always + * match. + */ + return 1; +} + /* * Rely on the fact that drvdata is set before the attributes are * registered, and that the attributes are unregistered before drvdata @@ -142,11 +169,10 @@ static const struct attribute_group dev_dax_attribute_group = { .attrs = dev_dax_attributes, };
-const struct attribute_group *dax_attribute_groups[] = { +static const struct attribute_group *dax_attribute_groups[] = { &dev_dax_attribute_group, NULL, }; -EXPORT_SYMBOL_GPL(dax_attribute_groups);
void kill_dev_dax(struct dev_dax *dev_dax) { @@ -158,17 +184,108 @@ void kill_dev_dax(struct dev_dax *dev_dax) } EXPORT_SYMBOL_GPL(kill_dev_dax);
-void unregister_dev_dax(void *dev) +static void dev_dax_release(struct device *dev) { struct dev_dax *dev_dax = to_dev_dax(dev); + struct dax_region *dax_region = dev_dax->region; struct dax_device *dax_dev = dev_dax->dax_dev; - struct inode *inode = dax_inode(dax_dev); - struct cdev *cdev = inode->i_cdev;
- dev_dbg(dev, "trace\n"); + dax_region_put(dax_region); + put_dax(dax_dev); + kfree(dev_dax); +} + +static void unregister_dev_dax(void *dev) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + + dev_dbg(dev, "%s\n", __func__);
kill_dev_dax(dev_dax); - cdev_device_del(cdev, dev); + device_del(dev); put_device(dev); } -EXPORT_SYMBOL_GPL(unregister_dev_dax); + +struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id) +{ + struct device *parent = dax_region->dev; + struct dax_device *dax_dev; + struct dev_dax *dev_dax; + struct inode *inode; + struct device *dev; + int rc = -ENOMEM; + + if (id < 0) + return ERR_PTR(-EINVAL); + + dev_dax = kzalloc(sizeof(*dev_dax), GFP_KERNEL); + if (!dev_dax) + return ERR_PTR(-ENOMEM); + + /* + * No 'host' or dax_operations since there is no access to this + * device outside of mmap of the resulting character device. + */ + dax_dev = alloc_dax(dev_dax, NULL, NULL); + if (!dax_dev) + goto err; + + /* a device_dax instance is dead while the driver is not attached */ + kill_dax(dax_dev); + + /* from here on we're committed to teardown via dax_dev_release() */ + dev = &dev_dax->dev; + device_initialize(dev); + + dev_dax->dax_dev = dax_dev; + dev_dax->region = dax_region; + kref_get(&dax_region->kref); + + inode = dax_inode(dax_dev); + dev->devt = inode->i_rdev; + dev->bus = &dax_bus_type; + dev->parent = parent; + dev->groups = dax_attribute_groups; + dev->release = dev_dax_release; + dev_set_name(dev, "dax%d.%d", dax_region->id, id); + + rc = device_add(dev); + if (rc) { + kill_dev_dax(dev_dax); + put_device(dev); + return ERR_PTR(rc); + } + + rc = devm_add_action_or_reset(dax_region->dev, unregister_dev_dax, dev); + if (rc) + return ERR_PTR(rc); + + return dev_dax; + + err: + kfree(dev_dax); + + return ERR_PTR(rc); +} +EXPORT_SYMBOL_GPL(devm_create_dev_dax); + +int __dax_driver_register(struct device_driver *drv, + struct module *module, const char *mod_name) +{ + drv->owner = module; + drv->name = mod_name; + drv->mod_name = mod_name; + drv->bus = &dax_bus_type; + return driver_register(drv); +} +EXPORT_SYMBOL_GPL(__dax_driver_register); + +int __init dax_bus_init(void) +{ + return bus_register(&dax_bus_type); +} + +void __exit dax_bus_exit(void) +{ + bus_unregister(&dax_bus_type); +} diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h index 840865aa69e8..ea509504df3a 100644 --- a/drivers/dax/bus.h +++ b/drivers/dax/bus.h @@ -11,5 +11,21 @@ void dax_region_put(struct dax_region *dax_region); struct dax_region *alloc_dax_region(struct device *parent, int region_id, struct resource *res, unsigned int align, unsigned long flags); struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id); +int __dax_driver_register(struct device_driver *drv, + struct module *module, const char *mod_name); +#define dax_driver_register(driver) \ + __dax_driver_register(driver, THIS_MODULE, KBUILD_MODNAME) void kill_dev_dax(struct dev_dax *dev_dax); + +/* + * While run_dax() is potentially a generic operation that could be + * defined in include/linux/dax.h we don't want to grow any users + * outside of drivers/dax/ + */ +void run_dax(struct dax_device *dax_dev); + +#define MODULE_ALIAS_DAX_DEVICE(type) \ + MODULE_ALIAS("dax:t" __stringify(type) "*") +#define DAX_DEVICE_MODALIAS_FMT "dax:t%d" + #endif /* __DAX_BUS_H__ */ diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 620c3f4eefe7..c3a121700837 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -20,10 +20,8 @@ struct dax_device; struct dax_device *inode_dax(struct inode *inode); struct inode *dax_inode(struct dax_device *dax_dev); - -/* temporary until devm_create_dax_dev moves to bus.c */ -extern const struct attribute_group *dax_attribute_groups[]; -void unregister_dev_dax(void *dev); +int dax_bus_init(void); +void dax_bus_exit(void);
/** * struct dax_region - mapping infrastructure for dax devices diff --git a/drivers/dax/device.c b/drivers/dax/device.c index f2ce8af2e6cc..20759c3f0b91 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -13,8 +13,6 @@ #include "dax-private.h" #include "bus.h"
-static struct class *dax_class; - static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma, const char *func) { @@ -402,93 +400,64 @@ static const struct file_operations dax_fops = { .mmap_supported_flags = MAP_SYNC, };
-static void dev_dax_release(struct device *dev) +static void dev_dax_cdev_del(void *cdev) { - struct dev_dax *dev_dax = to_dev_dax(dev); - struct dax_region *dax_region = dev_dax->region; - struct dax_device *dax_dev = dev_dax->dax_dev; + cdev_del(cdev); +}
- dax_region_put(dax_region); - put_dax(dax_dev); - kfree(dev_dax); +static void dev_dax_kill(void *dev_dax) +{ + kill_dev_dax(dev_dax); }
-struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id) +static int dev_dax_probe(struct device *dev) { - struct device *parent = dax_region->dev; - struct dax_device *dax_dev; - struct dev_dax *dev_dax; + struct dev_dax *dev_dax = to_dev_dax(dev); + struct dax_device *dax_dev = dev_dax->dax_dev; struct inode *inode; - struct device *dev; struct cdev *cdev; int rc;
- dev_dax = kzalloc(sizeof(*dev_dax), GFP_KERNEL); - if (!dev_dax) - return ERR_PTR(-ENOMEM); - - /* - * No 'host' or dax_operations since there is no access to this - * device outside of mmap of the resulting character device. - */ - dax_dev = alloc_dax(dev_dax, NULL, NULL); - if (!dax_dev) { - rc = -ENOMEM; - goto err; - } - - /* from here on we're committed to teardown via dax_dev_release() */ - dev = &dev_dax->dev; - device_initialize(dev); - inode = dax_inode(dax_dev); cdev = inode->i_cdev; cdev_init(cdev, &dax_fops); - cdev->owner = parent->driver->owner; - - dev_dax->dax_dev = dax_dev; - dev_dax->region = dax_region; - kref_get(&dax_region->kref); - - dev->devt = inode->i_rdev; - dev->class = dax_class; - dev->parent = parent; - dev->groups = dax_attribute_groups; - dev->release = dev_dax_release; - dev_set_name(dev, "dax%d.%d", dax_region->id, id); - - rc = cdev_device_add(cdev, dev); - if (rc) { - kill_dev_dax(dev_dax); - put_device(dev); - return ERR_PTR(rc); - } - - rc = devm_add_action_or_reset(dax_region->dev, unregister_dev_dax, dev); + cdev->owner = dev->driver->owner; + cdev_set_parent(cdev, &dev->kobj); + rc = cdev_add(cdev, dev->devt, 1); if (rc) - return ERR_PTR(rc); + return rc;
- return dev_dax; + rc = devm_add_action_or_reset(dev, dev_dax_cdev_del, cdev); + if (rc) + return rc;
- err: - kfree(dev_dax); + run_dax(dax_dev); + return devm_add_action_or_reset(dev, dev_dax_kill, dev_dax); +}
- return ERR_PTR(rc); +static int dev_dax_remove(struct device *dev) +{ + /* all probe actions are unwound by devm */ + return 0; } -EXPORT_SYMBOL_GPL(devm_create_dev_dax); + +static struct device_driver device_dax_driver = { + .probe = dev_dax_probe, + .remove = dev_dax_remove, +};
static int __init dax_init(void) { - dax_class = class_create(THIS_MODULE, "dax"); - return PTR_ERR_OR_ZERO(dax_class); + return dax_driver_register(&device_dax_driver); }
static void __exit dax_exit(void) { - class_destroy(dax_class); + driver_unregister(&device_dax_driver); }
MODULE_AUTHOR("Intel Corporation"); MODULE_LICENSE("GPL v2"); -subsys_initcall(dax_init); +module_init(dax_init); module_exit(dax_exit); +MODULE_ALIAS_DAX_DEVICE(0); diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 0ecc1a2cf1cc..ccb22d8db3a2 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -366,11 +366,15 @@ void kill_dax(struct dax_device *dax_dev) spin_lock(&dax_host_lock); hlist_del_init(&dax_dev->list); spin_unlock(&dax_host_lock); - - dax_dev->private = NULL; } EXPORT_SYMBOL_GPL(kill_dax);
+void run_dax(struct dax_device *dax_dev) +{ + set_bit(DAXDEV_ALIVE, &dax_dev->flags); +} +EXPORT_SYMBOL_GPL(run_dax); + static struct inode *dax_alloc_inode(struct super_block *sb) { struct dax_device *dax_dev; @@ -585,6 +589,8 @@ EXPORT_SYMBOL_GPL(dax_inode);
void *dax_get_private(struct dax_device *dax_dev) { + if (!test_bit(DAXDEV_ALIVE, &dax_dev->flags)) + return NULL; return dax_dev->private; } EXPORT_SYMBOL_GPL(dax_get_private); @@ -598,7 +604,7 @@ static void init_once(void *_dax_dev) inode_init_once(inode); }
-static int __dax_fs_init(void) +static int dax_fs_init(void) { int rc;
@@ -630,35 +636,45 @@ static int __dax_fs_init(void) return rc; }
-static void __dax_fs_exit(void) +static void dax_fs_exit(void) { kern_unmount(dax_mnt); unregister_filesystem(&dax_fs_type); kmem_cache_destroy(dax_cache); }
-static int __init dax_fs_init(void) +static int __init dax_core_init(void) { int rc;
- rc = __dax_fs_init(); + rc = dax_fs_init(); if (rc) return rc;
rc = alloc_chrdev_region(&dax_devt, 0, MINORMASK+1, "dax"); if (rc) - __dax_fs_exit(); - return rc; + goto err_chrdev; + + rc = dax_bus_init(); + if (rc) + goto err_bus; + return 0; + +err_bus: + unregister_chrdev_region(dax_devt, MINORMASK+1); +err_chrdev: + dax_fs_exit(); + return 0; }
-static void __exit dax_fs_exit(void) +static void __exit dax_core_exit(void) { unregister_chrdev_region(dax_devt, MINORMASK+1); ida_destroy(&dax_minor_ida); - __dax_fs_exit(); + dax_fs_exit(); }
MODULE_AUTHOR("Intel Corporation"); MODULE_LICENSE("GPL v2"); -subsys_initcall(dax_fs_init); -module_exit(dax_fs_exit); +subsys_initcall(dax_core_init); +module_exit(dax_core_exit);
From: Dan Williams dan.j.williams@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 89ec9f2cfa36cc5fca2fb445ed221bb9add7b536 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 89ec9f2cfa36cc5fca2fb445ed221bb9add7b536 upstream.
Move the responsibility of calling devm_request_resource() and devm_memremap_pages() into the common device-dax driver. This is another preparatory step to allowing an alternate personality driver for a device-dax range.
Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dax/bus.c | 6 ++- drivers/dax/bus.h | 3 +- drivers/dax/dax-private.h | 9 +++- drivers/dax/device.c | 61 ++++++++++++++++++++++++++ drivers/dax/pmem.c | 90 ++++++--------------------------------- 5 files changed, 90 insertions(+), 79 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 0cff32102c4c..69aae2cbd45f 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright(c) 2017-2018 Intel Corporation. All rights reserved. */ +#include <linux/memremap.h> #include <linux/device.h> #include <linux/slab.h> #include <linux/dax.h> @@ -206,7 +207,8 @@ static void unregister_dev_dax(void *dev) put_device(dev); }
-struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id) +struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id, + struct dev_pagemap *pgmap) { struct device *parent = dax_region->dev; struct dax_device *dax_dev; @@ -222,6 +224,8 @@ struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id) if (!dev_dax) return ERR_PTR(-ENOMEM);
+ memcpy(&dev_dax->pgmap, pgmap, sizeof(*pgmap)); + /* * No 'host' or dax_operations since there is no access to this * device outside of mmap of the resulting character device. diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h index ea509504df3a..e08e0c394983 100644 --- a/drivers/dax/bus.h +++ b/drivers/dax/bus.h @@ -10,7 +10,8 @@ struct dax_region; void dax_region_put(struct dax_region *dax_region); struct dax_region *alloc_dax_region(struct device *parent, int region_id, struct resource *res, unsigned int align, unsigned long flags); -struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id); +struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id, + struct dev_pagemap *pgmap); int __dax_driver_register(struct device_driver *drv, struct module *module, const char *mod_name); #define dax_driver_register(driver) \ diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index c3a121700837..a82ce48f5884 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -42,15 +42,22 @@ struct dax_region { };
/** - * struct dev_dax - instance data for a subdivision of a dax region + * struct dev_dax - instance data for a subdivision of a dax region, and + * data while the device is activated in the driver. * @region - parent region * @dax_dev - core dax functionality * @dev - device core + * @pgmap - pgmap for memmap setup / lifetime (driver owned) + * @ref: pgmap reference count (driver owned) + * @cmp: @ref final put completion (driver owned) */ struct dev_dax { struct dax_region *region; struct dax_device *dax_dev; struct device dev; + struct dev_pagemap pgmap; + struct percpu_ref ref; + struct completion cmp; };
static inline struct dev_dax *to_dev_dax(struct device *dev) diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 20759c3f0b91..d164d39fb401 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */ +#include <linux/memremap.h> #include <linux/pagemap.h> #include <linux/module.h> #include <linux/device.h> @@ -13,6 +14,38 @@ #include "dax-private.h" #include "bus.h"
+static struct dev_dax *ref_to_dev_dax(struct percpu_ref *ref) +{ + return container_of(ref, struct dev_dax, ref); +} + +static void dev_dax_percpu_release(struct percpu_ref *ref) +{ + struct dev_dax *dev_dax = ref_to_dev_dax(ref); + + dev_dbg(&dev_dax->dev, "%s\n", __func__); + complete(&dev_dax->cmp); +} + +static void dev_dax_percpu_exit(void *data) +{ + struct percpu_ref *ref = data; + struct dev_dax *dev_dax = ref_to_dev_dax(ref); + + dev_dbg(&dev_dax->dev, "%s\n", __func__); + wait_for_completion(&dev_dax->cmp); + percpu_ref_exit(ref); +} + +static void dev_dax_percpu_kill(struct percpu_ref *data) +{ + struct percpu_ref *ref = data; + struct dev_dax *dev_dax = ref_to_dev_dax(ref); + + dev_dbg(&dev_dax->dev, "%s\n", __func__); + percpu_ref_kill(ref); +} + static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma, const char *func) { @@ -414,10 +447,38 @@ static int dev_dax_probe(struct device *dev) { struct dev_dax *dev_dax = to_dev_dax(dev); struct dax_device *dax_dev = dev_dax->dax_dev; + struct resource *res = &dev_dax->region->res; struct inode *inode; struct cdev *cdev; + void *addr; int rc;
+ /* 1:1 map region resource range to device-dax instance range */ + if (!devm_request_mem_region(dev, res->start, resource_size(res), + dev_name(dev))) { + dev_warn(dev, "could not reserve region %pR\n", res); + return -EBUSY; + } + + init_completion(&dev_dax->cmp); + rc = percpu_ref_init(&dev_dax->ref, dev_dax_percpu_release, 0, + GFP_KERNEL); + if (rc) + return rc; + + rc = devm_add_action_or_reset(dev, dev_dax_percpu_exit, &dev_dax->ref); + if (rc) + return rc; + + dev_dax->pgmap.ref = &dev_dax->ref; + dev_dax->pgmap.kill = dev_dax_percpu_kill; + addr = devm_memremap_pages(dev, &dev_dax->pgmap); + if (IS_ERR(addr)) { + devm_remove_action(dev, dev_dax_percpu_exit, &dev_dax->ref); + percpu_ref_exit(&dev_dax->ref); + return PTR_ERR(addr); + } + inode = dax_inode(dax_dev); cdev = inode->i_cdev; cdev_init(cdev, &dax_fops); diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c index e1b50d8fd1f8..d3cefa7868ac 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem.c @@ -18,54 +18,16 @@ #include "../nvdimm/nd.h" #include "bus.h"
-struct dax_pmem { - struct device *dev; - struct percpu_ref ref; - struct dev_pagemap pgmap; - struct completion cmp; -}; - -static struct dax_pmem *to_dax_pmem(struct percpu_ref *ref) -{ - return container_of(ref, struct dax_pmem, ref); -} - -static void dax_pmem_percpu_release(struct percpu_ref *ref) -{ - struct dax_pmem *dax_pmem = to_dax_pmem(ref); - - dev_dbg(dax_pmem->dev, "trace\n"); - complete(&dax_pmem->cmp); -} - -static void dax_pmem_percpu_exit(void *data) -{ - struct percpu_ref *ref = data; - struct dax_pmem *dax_pmem = to_dax_pmem(ref); - - dev_dbg(dax_pmem->dev, "trace\n"); - wait_for_completion(&dax_pmem->cmp); - percpu_ref_exit(ref); -} - -static void dax_pmem_percpu_kill(struct percpu_ref *ref) -{ - struct dax_pmem *dax_pmem = to_dax_pmem(ref); - - dev_dbg(dax_pmem->dev, "trace\n"); - percpu_ref_kill(ref); -} - static int dax_pmem_probe(struct device *dev) { - void *addr; struct resource res; int rc, id, region_id; + resource_size_t offset; struct nd_pfn_sb *pfn_sb; struct dev_dax *dev_dax; - struct dax_pmem *dax_pmem; struct nd_namespace_io *nsio; struct dax_region *dax_region; + struct dev_pagemap pgmap = { 0 }; struct nd_namespace_common *ndns; struct nd_dax *nd_dax = to_nd_dax(dev); struct nd_pfn *nd_pfn = &nd_dax->nd_pfn; @@ -75,61 +37,37 @@ static int dax_pmem_probe(struct device *dev) return PTR_ERR(ndns); nsio = to_nd_namespace_io(&ndns->dev);
- dax_pmem = devm_kzalloc(dev, sizeof(*dax_pmem), GFP_KERNEL); - if (!dax_pmem) - return -ENOMEM; - /* parse the 'pfn' info block via ->rw_bytes */ rc = devm_nsio_enable(dev, nsio); if (rc) return rc; - rc = nvdimm_setup_pfn(nd_pfn, &dax_pmem->pgmap); + rc = nvdimm_setup_pfn(nd_pfn, &pgmap); if (rc) return rc; devm_nsio_disable(dev, nsio);
- pfn_sb = nd_pfn->pfn_sb; - - if (!devm_request_mem_region(dev, nsio->res.start, - resource_size(&nsio->res), + /* reserve the metadata area, device-dax will reserve the data */ + pfn_sb = nd_pfn->pfn_sb; + offset = le64_to_cpu(pfn_sb->dataoff); + if (!devm_request_mem_region(dev, nsio->res.start, offset, dev_name(&ndns->dev))) { - dev_warn(dev, "could not reserve region %pR\n", &nsio->res); - return -EBUSY; - } - - dax_pmem->dev = dev; - init_completion(&dax_pmem->cmp); - rc = percpu_ref_init(&dax_pmem->ref, dax_pmem_percpu_release, 0, - GFP_KERNEL); - if (rc) - return rc; - - rc = devm_add_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref); - if (rc) { - percpu_ref_exit(&dax_pmem->ref); - return rc; - } - - dax_pmem->pgmap.ref = &dax_pmem->ref; - dax_pmem->pgmap.kill = dax_pmem_percpu_kill; - addr = devm_memremap_pages(dev, &dax_pmem->pgmap); - if (IS_ERR(addr)) - return PTR_ERR(addr); - - /* adjust the dax_region resource to the start of data */ - memcpy(&res, &dax_pmem->pgmap.res, sizeof(res)); - res.start += le64_to_cpu(pfn_sb->dataoff); + dev_warn(dev, "could not reserve metadata\n"); + return -EBUSY; + }
rc = sscanf(dev_name(&ndns->dev), "namespace%d.%d", ®ion_id, &id); if (rc != 2) return -EINVAL;
+ /* adjust the dax_region resource to the start of data */ + memcpy(&res, &pgmap.res, sizeof(res)); + res.start += offset; dax_region = alloc_dax_region(dev, region_id, &res, le32_to_cpu(pfn_sb->align), PFN_DEV|PFN_MAP); if (!dax_region) return -ENOMEM;
- dev_dax = devm_create_dev_dax(dax_region, id); + dev_dax = devm_create_dev_dax(dax_region, id, &pgmap);
/* child dev_dax instances now own the lifetime of the dax_region */ dax_region_put(dax_region);
From: Dan Williams dan.j.williams@intel.com
mainline inclusion from mainline-v5.1-rc1 commit d200781ef237a354d918ceff5cee350d88a93d42 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit d200781ef237a354d918ceff5cee350d88a93d42 upstream.
Introduce the 'new_id' concept for enabling a custom device-driver attach policy for dax-bus drivers. The intended use is to have a mechanism for hot-plugging device-dax ranges into the page allocator on-demand. With this in place the default policy of using device-dax for performance differentiated memory can be overridden by user-space policy that can arrange for the memory range to be managed as 'System RAM' with user-defined NUMA and other performance attributes.
Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dax/bus.c | 145 +++++++++++++++++++++++++++++++++++++++++-- drivers/dax/bus.h | 10 ++- drivers/dax/device.c | 11 ++-- 3 files changed, 156 insertions(+), 10 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 69aae2cbd45f..17af6fbc3be5 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -2,11 +2,21 @@ /* Copyright(c) 2017-2018 Intel Corporation. All rights reserved. */ #include <linux/memremap.h> #include <linux/device.h> +#include <linux/mutex.h> +#include <linux/list.h> #include <linux/slab.h> #include <linux/dax.h> #include "dax-private.h" #include "bus.h"
+static DEFINE_MUTEX(dax_bus_lock); + +#define DAX_NAME_LEN 30 +struct dax_id { + struct list_head list; + char dev_name[DAX_NAME_LEN]; +}; + static int dax_bus_uevent(struct device *dev, struct kobj_uevent_env *env) { /* @@ -16,22 +26,115 @@ static int dax_bus_uevent(struct device *dev, struct kobj_uevent_env *env) return add_uevent_var(env, "MODALIAS=" DAX_DEVICE_MODALIAS_FMT, 0); }
+static struct dax_device_driver *to_dax_drv(struct device_driver *drv) +{ + return container_of(drv, struct dax_device_driver, drv); +} + +static struct dax_id *__dax_match_id(struct dax_device_driver *dax_drv, + const char *dev_name) +{ + struct dax_id *dax_id; + + lockdep_assert_held(&dax_bus_lock); + + list_for_each_entry(dax_id, &dax_drv->ids, list) + if (sysfs_streq(dax_id->dev_name, dev_name)) + return dax_id; + return NULL; +} + +static int dax_match_id(struct dax_device_driver *dax_drv, struct device *dev) +{ + int match; + + mutex_lock(&dax_bus_lock); + match = !!__dax_match_id(dax_drv, dev_name(dev)); + mutex_unlock(&dax_bus_lock); + + return match; +} + +static ssize_t do_id_store(struct device_driver *drv, const char *buf, + size_t count, bool add) +{ + struct dax_device_driver *dax_drv = to_dax_drv(drv); + unsigned int region_id, id; + char devname[DAX_NAME_LEN]; + struct dax_id *dax_id; + ssize_t rc = count; + int fields; + + fields = sscanf(buf, "dax%d.%d", ®ion_id, &id); + if (fields != 2) + return -EINVAL; + sprintf(devname, "dax%d.%d", region_id, id); + if (!sysfs_streq(buf, devname)) + return -EINVAL; + + mutex_lock(&dax_bus_lock); + dax_id = __dax_match_id(dax_drv, buf); + if (!dax_id) { + if (add) { + dax_id = kzalloc(sizeof(*dax_id), GFP_KERNEL); + if (dax_id) { + strncpy(dax_id->dev_name, buf, DAX_NAME_LEN); + list_add(&dax_id->list, &dax_drv->ids); + } else + rc = -ENOMEM; + } else + /* nothing to remove */; + } else if (!add) { + list_del(&dax_id->list); + kfree(dax_id); + } else + /* dax_id already added */; + mutex_unlock(&dax_bus_lock); + return rc; +} + +static ssize_t new_id_store(struct device_driver *drv, const char *buf, + size_t count) +{ + return do_id_store(drv, buf, count, true); +} +static DRIVER_ATTR_WO(new_id); + +static ssize_t remove_id_store(struct device_driver *drv, const char *buf, + size_t count) +{ + return do_id_store(drv, buf, count, false); +} +static DRIVER_ATTR_WO(remove_id); + +static struct attribute *dax_drv_attrs[] = { + &driver_attr_new_id.attr, + &driver_attr_remove_id.attr, + NULL, +}; +ATTRIBUTE_GROUPS(dax_drv); + static int dax_bus_match(struct device *dev, struct device_driver *drv);
static struct bus_type dax_bus_type = { .name = "dax", .uevent = dax_bus_uevent, .match = dax_bus_match, + .drv_groups = dax_drv_groups, };
static int dax_bus_match(struct device *dev, struct device_driver *drv) { + struct dax_device_driver *dax_drv = to_dax_drv(drv); + /* - * The drivers that can register on the 'dax' bus are private to - * drivers/dax/ so any device and driver on the bus always - * match. + * All but the 'device-dax' driver, which has 'match_always' + * set, requires an exact id match. */ - return 1; + if (dax_drv->match_always) + return 1; + + return dax_match_id(dax_drv, dev); }
/* @@ -273,17 +376,49 @@ struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id, } EXPORT_SYMBOL_GPL(devm_create_dev_dax);
-int __dax_driver_register(struct device_driver *drv, +static int match_always_count; + +int __dax_driver_register(struct dax_device_driver *dax_drv, struct module *module, const char *mod_name) { + struct device_driver *drv = &dax_drv->drv; + int rc = 0; + + INIT_LIST_HEAD(&dax_drv->ids); drv->owner = module; drv->name = mod_name; drv->mod_name = mod_name; drv->bus = &dax_bus_type; + + /* there can only be one default driver */ + mutex_lock(&dax_bus_lock); + match_always_count += dax_drv->match_always; + if (match_always_count > 1) { + match_always_count--; + WARN_ON(1); + rc = -EINVAL; + } + mutex_unlock(&dax_bus_lock); + if (rc) + return rc; return driver_register(drv); } EXPORT_SYMBOL_GPL(__dax_driver_register);
+void dax_driver_unregister(struct dax_device_driver *dax_drv) +{ + struct dax_id *dax_id, *_id; + + mutex_lock(&dax_bus_lock); + match_always_count -= dax_drv->match_always; + list_for_each_entry_safe(dax_id, _id, &dax_drv->ids, list) { + list_del(&dax_id->list); + kfree(dax_id); + } + mutex_unlock(&dax_bus_lock); +} +EXPORT_SYMBOL_GPL(dax_driver_unregister); + int __init dax_bus_init(void) { return bus_register(&dax_bus_type); diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h index e08e0c394983..395ab812367c 100644 --- a/drivers/dax/bus.h +++ b/drivers/dax/bus.h @@ -12,10 +12,18 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id, struct resource *res, unsigned int align, unsigned long flags); struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id, struct dev_pagemap *pgmap); -int __dax_driver_register(struct device_driver *drv, + +struct dax_device_driver { + struct device_driver drv; + struct list_head ids; + int match_always; +}; + +int __dax_driver_register(struct dax_device_driver *dax_drv, struct module *module, const char *mod_name); #define dax_driver_register(driver) \ __dax_driver_register(driver, THIS_MODULE, KBUILD_MODNAME) +void dax_driver_unregister(struct dax_device_driver *dax_drv); void kill_dev_dax(struct dev_dax *dev_dax);
/* diff --git a/drivers/dax/device.c b/drivers/dax/device.c index d164d39fb401..7b64fd828008 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -502,9 +502,12 @@ static int dev_dax_remove(struct device *dev) return 0; }
-static struct device_driver device_dax_driver = { - .probe = dev_dax_probe, - .remove = dev_dax_remove, +static struct dax_device_driver device_dax_driver = { + .drv = { + .probe = dev_dax_probe, + .remove = dev_dax_remove, + }, + .match_always = 1, };
static int __init dax_init(void) @@ -514,7 +517,7 @@ static int __init dax_init(void)
static void __exit dax_exit(void) { - driver_unregister(&device_dax_driver); + dax_driver_unregister(&device_dax_driver); }
MODULE_AUTHOR("Intel Corporation");
From: Dan Williams dan.j.williams@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 730926c3b0998943654019f00296cf8e3b02277e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 730926c3b0998943654019f00296cf8e3b02277e upstream.
On the expectation that some environments may not upgrade libdaxctl (userspace component that depends on the /sys/class/dax hierarchy), provide a default / legacy dax_pmem_compat driver. The dax_pmem_compat driver implements the original /sys/class/dax sysfs layout rather than /sys/bus/dax. When userspace is upgraded it can blacklist this module and switch to the dax_pmem driver going forward.
CONFIG_DEV_DAX_PMEM_COMPAT and supporting code will be deleted according to the dax_pmem entry in Documentation/ABI/obsolete/.
Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/ABI/obsolete/sysfs-class-dax | 22 +++++++ drivers/dax/Kconfig | 12 +++- drivers/dax/Makefile | 4 +- drivers/dax/bus.c | 29 +++++++-- drivers/dax/bus.h | 26 +++++++- drivers/dax/device.c | 9 ++- drivers/dax/pmem/Makefile | 7 +++ drivers/dax/pmem/compat.c | 73 ++++++++++++++++++++++ drivers/dax/{pmem.c => pmem/core.c} | 51 +++++---------- drivers/dax/pmem/pmem.c | 40 ++++++++++++ tools/testing/nvdimm/Kbuild | 6 +- 11 files changed, 229 insertions(+), 50 deletions(-) create mode 100644 Documentation/ABI/obsolete/sysfs-class-dax create mode 100644 drivers/dax/pmem/Makefile create mode 100644 drivers/dax/pmem/compat.c rename drivers/dax/{pmem.c => pmem/core.c} (57%) create mode 100644 drivers/dax/pmem/pmem.c
diff --git a/Documentation/ABI/obsolete/sysfs-class-dax b/Documentation/ABI/obsolete/sysfs-class-dax new file mode 100644 index 000000000000..2cb9fc5e8bd1 --- /dev/null +++ b/Documentation/ABI/obsolete/sysfs-class-dax @@ -0,0 +1,22 @@ +What: /sys/class/dax/ +Date: May, 2016 +KernelVersion: v4.7 +Contact: linux-nvdimm@lists.01.org +Description: Device DAX is the device-centric analogue of Filesystem + DAX (CONFIG_FS_DAX). It allows memory ranges to be + allocated and mapped without need of an intervening file + system. Device DAX is strict, precise and predictable. + Specifically this interface: + + 1/ Guarantees fault granularity with respect to a given + page size (pte, pmd, or pud) set at configuration time. + + 2/ Enforces deterministic behavior by being strict about + what fault scenarios are supported. + + The /sys/class/dax/ interface enumerates all the + device-dax instances in the system. The ABI is + deprecated and will be removed after 2020. It is + replaced with the DAX bus interface /sys/bus/dax/ where + device-dax instances can be found under + /sys/bus/dax/devices/ diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index e0700bf4893a..6fc96f03920e 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -23,12 +23,22 @@ config DEV_DAX config DEV_DAX_PMEM tristate "PMEM DAX: direct access to persistent memory" depends on LIBNVDIMM && NVDIMM_DAX && DEV_DAX + depends on m # until we can kill DEV_DAX_PMEM_COMPAT default DEV_DAX help Support raw access to persistent memory. Note that this driver consumes memory ranges allocated and exported by the libnvdimm sub-system.
- Say Y if unsure + Say M if unsure + +config DEV_DAX_PMEM_COMPAT + tristate "PMEM DAX: support the deprecated /sys/class/dax interface" + depends on DEV_DAX_PMEM + default DEV_DAX_PMEM + help + Older versions of the libdaxctl library expect to find all + device-dax instances under /sys/class/dax. If libdaxctl in + your distribution is older than v58 say M, otherwise say N.
endif diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile index 658e6b9b1d74..233bbffccbe6 100644 --- a/drivers/dax/Makefile +++ b/drivers/dax/Makefile @@ -1,9 +1,9 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_DAX) += dax.o obj-$(CONFIG_DEV_DAX) += device_dax.o -obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
dax-y := super.o dax-y += bus.o -dax_pmem-y := pmem.o device_dax-y := device.o + +obj-y += pmem/ diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 17af6fbc3be5..568168500217 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -9,6 +9,8 @@ #include "dax-private.h" #include "bus.h"
+static struct class *dax_class; + static DEFINE_MUTEX(dax_bus_lock);
#define DAX_NAME_LEN 30 @@ -310,8 +312,8 @@ static void unregister_dev_dax(void *dev) put_device(dev); }
-struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id, - struct dev_pagemap *pgmap) +struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id, + struct dev_pagemap *pgmap, enum dev_dax_subsys subsys) { struct device *parent = dax_region->dev; struct dax_device *dax_dev; @@ -350,7 +352,10 @@ struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id,
inode = dax_inode(dax_dev); dev->devt = inode->i_rdev; - dev->bus = &dax_bus_type; + if (subsys == DEV_DAX_BUS) + dev->bus = &dax_bus_type; + else + dev->class = dax_class; dev->parent = parent; dev->groups = dax_attribute_groups; dev->release = dev_dax_release; @@ -374,7 +379,7 @@ struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id,
return ERR_PTR(rc); } -EXPORT_SYMBOL_GPL(devm_create_dev_dax); +EXPORT_SYMBOL_GPL(__devm_create_dev_dax);
static int match_always_count;
@@ -407,6 +412,7 @@ EXPORT_SYMBOL_GPL(__dax_driver_register);
void dax_driver_unregister(struct dax_device_driver *dax_drv) { + struct device_driver *drv = &dax_drv->drv; struct dax_id *dax_id, *_id;
mutex_lock(&dax_bus_lock); @@ -416,15 +422,28 @@ void dax_driver_unregister(struct dax_device_driver *dax_drv) kfree(dax_id); } mutex_unlock(&dax_bus_lock); + driver_unregister(drv); } EXPORT_SYMBOL_GPL(dax_driver_unregister);
int __init dax_bus_init(void) { - return bus_register(&dax_bus_type); + int rc; + + if (IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT)) { + dax_class = class_create(THIS_MODULE, "dax"); + if (IS_ERR(dax_class)) + return PTR_ERR(dax_class); + } + + rc = bus_register(&dax_bus_type); + if (rc) + class_destroy(dax_class); + return rc; }
void __exit dax_bus_exit(void) { bus_unregister(&dax_bus_type); + class_destroy(dax_class); } diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h index 395ab812367c..ce977552ffb5 100644 --- a/drivers/dax/bus.h +++ b/drivers/dax/bus.h @@ -2,7 +2,8 @@ /* Copyright(c) 2016 - 2018 Intel Corporation. All rights reserved. */ #ifndef __DAX_BUS_H__ #define __DAX_BUS_H__ -struct device; +#include <linux/device.h> + struct dev_dax; struct resource; struct dax_device; @@ -10,8 +11,23 @@ struct dax_region; void dax_region_put(struct dax_region *dax_region); struct dax_region *alloc_dax_region(struct device *parent, int region_id, struct resource *res, unsigned int align, unsigned long flags); -struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, int id, - struct dev_pagemap *pgmap); + +enum dev_dax_subsys { + DEV_DAX_BUS, + DEV_DAX_CLASS, +}; + +struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id, + struct dev_pagemap *pgmap, enum dev_dax_subsys subsys); + +static inline struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, + int id, struct dev_pagemap *pgmap) +{ + return __devm_create_dev_dax(dax_region, id, pgmap, DEV_DAX_BUS); +} + +/* to be deleted when DEV_DAX_CLASS is removed */ +struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys);
struct dax_device_driver { struct device_driver drv; @@ -26,6 +42,10 @@ int __dax_driver_register(struct dax_device_driver *dax_drv, void dax_driver_unregister(struct dax_device_driver *dax_drv); void kill_dev_dax(struct dev_dax *dev_dax);
+#if IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT) +int dev_dax_probe(struct device *dev); +#endif + /* * While run_dax() is potentially a generic operation that could be * defined in include/linux/dax.h we don't want to grow any users diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 7b64fd828008..996d68ff992a 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -443,7 +443,7 @@ static void dev_dax_kill(void *dev_dax) kill_dev_dax(dev_dax); }
-static int dev_dax_probe(struct device *dev) +int dev_dax_probe(struct device *dev) { struct dev_dax *dev_dax = to_dev_dax(dev); struct dax_device *dax_dev = dev_dax->dax_dev; @@ -482,7 +482,11 @@ static int dev_dax_probe(struct device *dev) inode = dax_inode(dax_dev); cdev = inode->i_cdev; cdev_init(cdev, &dax_fops); - cdev->owner = dev->driver->owner; + if (dev->class) { + /* for the CONFIG_DEV_DAX_PMEM_COMPAT case */ + cdev->owner = dev->parent->driver->owner; + } else + cdev->owner = dev->driver->owner; cdev_set_parent(cdev, &dev->kobj); rc = cdev_add(cdev, dev->devt, 1); if (rc) @@ -495,6 +499,7 @@ static int dev_dax_probe(struct device *dev) run_dax(dax_dev); return devm_add_action_or_reset(dev, dev_dax_kill, dev_dax); } +EXPORT_SYMBOL_GPL(dev_dax_probe);
static int dev_dax_remove(struct device *dev) { diff --git a/drivers/dax/pmem/Makefile b/drivers/dax/pmem/Makefile new file mode 100644 index 000000000000..e2e79bd3fdcf --- /dev/null +++ b/drivers/dax/pmem/Makefile @@ -0,0 +1,7 @@ +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem_core.o +obj-$(CONFIG_DEV_DAX_PMEM_COMPAT) += dax_pmem_compat.o + +dax_pmem-y := pmem.o +dax_pmem_core-y := core.o +dax_pmem_compat-y := compat.o diff --git a/drivers/dax/pmem/compat.c b/drivers/dax/pmem/compat.c new file mode 100644 index 000000000000..d7b15e6f30c5 --- /dev/null +++ b/drivers/dax/pmem/compat.c @@ -0,0 +1,73 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2016 - 2018 Intel Corporation. All rights reserved. */ +#include <linux/percpu-refcount.h> +#include <linux/memremap.h> +#include <linux/module.h> +#include <linux/pfn_t.h> +#include <linux/nd.h> +#include "../bus.h" + +/* we need the private definitions to implement compat suport */ +#include "../dax-private.h" + +static int dax_pmem_compat_probe(struct device *dev) +{ + struct dev_dax *dev_dax = __dax_pmem_probe(dev, DEV_DAX_CLASS); + int rc; + + if (IS_ERR(dev_dax)) + return PTR_ERR(dev_dax); + + if (!devres_open_group(&dev_dax->dev, dev_dax, GFP_KERNEL)) + return -ENOMEM; + + device_lock(&dev_dax->dev); + rc = dev_dax_probe(&dev_dax->dev); + device_unlock(&dev_dax->dev); + + devres_close_group(&dev_dax->dev, dev_dax); + if (rc) + devres_release_group(&dev_dax->dev, dev_dax); + + return rc; +} + +static int dax_pmem_compat_release(struct device *dev, void *data) +{ + device_lock(dev); + devres_release_group(dev, to_dev_dax(dev)); + device_unlock(dev); + + return 0; +} + +static int dax_pmem_compat_remove(struct device *dev) +{ + device_for_each_child(dev, NULL, dax_pmem_compat_release); + return 0; +} + +static struct nd_device_driver dax_pmem_compat_driver = { + .probe = dax_pmem_compat_probe, + .remove = dax_pmem_compat_remove, + .drv = { + .name = "dax_pmem_compat", + }, + .type = ND_DRIVER_DAX_PMEM, +}; + +static int __init dax_pmem_compat_init(void) +{ + return nd_driver_register(&dax_pmem_compat_driver); +} +module_init(dax_pmem_compat_init); + +static void __exit dax_pmem_compat_exit(void) +{ + driver_unregister(&dax_pmem_compat_driver.drv); +} +module_exit(dax_pmem_compat_exit); + +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Intel Corporation"); +MODULE_ALIAS_ND_DEVICE(ND_DEVICE_DAX_PMEM); diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem/core.c similarity index 57% rename from drivers/dax/pmem.c rename to drivers/dax/pmem/core.c index d3cefa7868ac..bdcff1b14e95 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem/core.c @@ -1,24 +1,13 @@ -/* - * Copyright(c) 2016 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - */ -#include <linux/percpu-refcount.h> +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2016 - 2018 Intel Corporation. All rights reserved. */ #include <linux/memremap.h> #include <linux/module.h> #include <linux/pfn_t.h> -#include "../nvdimm/pfn.h" -#include "../nvdimm/nd.h" -#include "bus.h" +#include "../../nvdimm/pfn.h" +#include "../../nvdimm/nd.h" +#include "../bus.h"
-static int dax_pmem_probe(struct device *dev) +struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys) { struct resource res; int rc, id, region_id; @@ -34,16 +23,16 @@ static int dax_pmem_probe(struct device *dev)
ndns = nvdimm_namespace_common_probe(dev); if (IS_ERR(ndns)) - return PTR_ERR(ndns); + return ERR_CAST(ndns); nsio = to_nd_namespace_io(&ndns->dev);
/* parse the 'pfn' info block via ->rw_bytes */ rc = devm_nsio_enable(dev, nsio); if (rc) - return rc; + return ERR_PTR(rc); rc = nvdimm_setup_pfn(nd_pfn, &pgmap); if (rc) - return rc; + return ERR_PTR(rc); devm_nsio_disable(dev, nsio);
/* reserve the metadata area, device-dax will reserve the data */ @@ -52,12 +41,12 @@ static int dax_pmem_probe(struct device *dev) if (!devm_request_mem_region(dev, nsio->res.start, offset, dev_name(&ndns->dev))) { dev_warn(dev, "could not reserve metadata\n"); - return -EBUSY; + return ERR_PTR(-EBUSY); }
rc = sscanf(dev_name(&ndns->dev), "namespace%d.%d", ®ion_id, &id); if (rc != 2) - return -EINVAL; + return ERR_PTR(-EINVAL);
/* adjust the dax_region resource to the start of data */ memcpy(&res, &pgmap.res, sizeof(res)); @@ -65,26 +54,16 @@ static int dax_pmem_probe(struct device *dev) dax_region = alloc_dax_region(dev, region_id, &res, le32_to_cpu(pfn_sb->align), PFN_DEV|PFN_MAP); if (!dax_region) - return -ENOMEM; + return ERR_PTR(-ENOMEM);
- dev_dax = devm_create_dev_dax(dax_region, id, &pgmap); + dev_dax = __devm_create_dev_dax(dax_region, id, &pgmap, subsys);
/* child dev_dax instances now own the lifetime of the dax_region */ dax_region_put(dax_region);
- return PTR_ERR_OR_ZERO(dev_dax); + return dev_dax; } - -static struct nd_device_driver dax_pmem_driver = { - .probe = dax_pmem_probe, - .drv = { - .name = "dax_pmem", - }, - .type = ND_DRIVER_DAX_PMEM, -}; - -module_nd_driver(dax_pmem_driver); +EXPORT_SYMBOL_GPL(__dax_pmem_probe);
MODULE_LICENSE("GPL v2"); MODULE_AUTHOR("Intel Corporation"); -MODULE_ALIAS_ND_DEVICE(ND_DEVICE_DAX_PMEM); diff --git a/drivers/dax/pmem/pmem.c b/drivers/dax/pmem/pmem.c new file mode 100644 index 000000000000..0ae4238a0ef8 --- /dev/null +++ b/drivers/dax/pmem/pmem.c @@ -0,0 +1,40 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2016 - 2018 Intel Corporation. All rights reserved. */ +#include <linux/percpu-refcount.h> +#include <linux/memremap.h> +#include <linux/module.h> +#include <linux/pfn_t.h> +#include <linux/nd.h> +#include "../bus.h" + +static int dax_pmem_probe(struct device *dev) +{ + return PTR_ERR_OR_ZERO(__dax_pmem_probe(dev, DEV_DAX_BUS)); +} + +static struct nd_device_driver dax_pmem_driver = { + .probe = dax_pmem_probe, + .drv = { + .name = "dax_pmem", + }, + .type = ND_DRIVER_DAX_PMEM, +}; + +static int __init dax_pmem_init(void) +{ + return nd_driver_register(&dax_pmem_driver); +} +module_init(dax_pmem_init); + +static void __exit dax_pmem_exit(void) +{ + driver_unregister(&dax_pmem_driver.drv); +} +module_exit(dax_pmem_exit); + +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Intel Corporation"); +#if !IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT) +/* For compat builds, don't load this module by default */ +MODULE_ALIAS_ND_DEVICE(ND_DEVICE_DAX_PMEM); +#endif diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild index bfc4d8e98452..1e37719bc37e 100644 --- a/tools/testing/nvdimm/Kbuild +++ b/tools/testing/nvdimm/Kbuild @@ -34,6 +34,8 @@ obj-$(CONFIG_DAX) += dax.o endif obj-$(CONFIG_DEV_DAX) += device_dax.o obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem_core.o +obj-$(CONFIG_DEV_DAX_PMEM_COMPAT) += dax_pmem_compat.o
nfit-y := $(ACPI_SRC)/core.o nfit-$(CONFIG_X86_MCE) += $(ACPI_SRC)/mce.o @@ -63,7 +65,9 @@ device_dax-y += dax-dev.o device_dax-y += device_dax_test.o device_dax-y += config_check.o
-dax_pmem-y := $(DAX_SRC)/pmem.o +dax_pmem-y := $(DAX_SRC)/pmem/pmem.o +dax_pmem_core-y := $(DAX_SRC)/pmem/core.o +dax_pmem_compat-y := $(DAX_SRC)/pmem/compat.o dax_pmem-y += config_check.o
libnvdimm-y := $(NVDIMM_SRC)/core.o
From: Dan Williams dan.j.williams@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 8fc5c73554db0ac18c0c6ac5b2099ab917f83bdf category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 8fc5c73554db0ac18c0c6ac5b2099ab917f83bdf upstream.
Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware Interface Table), is the first known instance of a memory range described by a unique "target" proximity domain. Where "initiator" and "target" proximity domains is an approach that the ACPI HMAT (Heterogeneous Memory Attributes Table) uses to described the unique performance properties of a memory range relative to a given initiator (e.g. CPU or DMA device).
Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y char-device follows the traditional notion of 'numa-node' where the attribute conveys the closest online numa-node. That numa-node attribute is useful for cpu-binding and memory-binding processes *near* the device. However, when the memory range backing a 'pmem', or 'dax' device is onlined (memory hot-add) the memory-only-numa-node representing that address needs to be differentiated from the set of online nodes. In other words, the numa-node association of the device depends on whether you can bind processes *near* the cpu-numa-node in the offline device-case, or bind process *on* the memory-range directly after the backing address range is onlined.
Allow for the case that platform firmware describes persistent memory with a unique proximity domain, i.e. when it is distinct from the proximity of DRAM and CPUs that are on the same socket. Plumb the Linux numa-node translation of that proximity through the libnvdimm region device to namespaces that are in device-dax mode. With this in place the proposed kmem driver [1] can optionally discover a unique numa-node number for the address range as it transitions the memory from an offline state managed by a device-driver to an online memory range managed by the core-mm.
[1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.com
Reported-by: Fan Du fan.du@intel.com Cc: Michael Ellerman mpe@ellerman.id.au Cc: "Oliver O'Halloran" oohall@gmail.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Jérôme Glisse jglisse@redhat.com Reviewed-by: Yang Shi yang.shi@linux.alibaba.com Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/nfit/core.c | 8 ++++++-- drivers/acpi/numa.c | 1 + drivers/dax/bus.c | 4 +++- drivers/dax/bus.h | 3 ++- drivers/dax/dax-private.h | 4 ++++ drivers/dax/pmem/core.c | 4 +++- drivers/nvdimm/e820.c | 1 + drivers/nvdimm/nd.h | 2 +- drivers/nvdimm/of_pmem.c | 1 + drivers/nvdimm/region_devs.c | 1 + include/linux/acpi.h | 5 +++++ include/linux/libnvdimm.h | 1 + 12 files changed, 29 insertions(+), 6 deletions(-)
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 8340c81b258b..af280ae0cc7c 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -2804,11 +2804,15 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc, ndr_desc->res = &res; ndr_desc->provider_data = nfit_spa; ndr_desc->attr_groups = acpi_nfit_region_attribute_groups; - if (spa->flags & ACPI_NFIT_PROXIMITY_VALID) + if (spa->flags & ACPI_NFIT_PROXIMITY_VALID) { ndr_desc->numa_node = acpi_map_pxm_to_online_node( spa->proximity_domain); - else + ndr_desc->target_node = acpi_map_pxm_to_node( + spa->proximity_domain); + } else { ndr_desc->numa_node = NUMA_NO_NODE; + ndr_desc->target_node = NUMA_NO_NODE; + }
/* * Persistence domain bits are hierarchical, if diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c index a28ff3cfbc29..63d4f6b15e4d 100644 --- a/drivers/acpi/numa.c +++ b/drivers/acpi/numa.c @@ -85,6 +85,7 @@ int acpi_map_pxm_to_node(int pxm)
return node; } +EXPORT_SYMBOL(acpi_map_pxm_to_node);
/** * acpi_map_pxm_to_online_node - Map proximity ID to online node diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 568168500217..c620ad52d7e5 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -214,7 +214,7 @@ static void dax_region_unregister(void *region) }
struct dax_region *alloc_dax_region(struct device *parent, int region_id, - struct resource *res, unsigned int align, + struct resource *res, int target_node, unsigned int align, unsigned long pfn_flags) { struct dax_region *dax_region; @@ -244,6 +244,7 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id, dax_region->id = region_id; dax_region->align = align; dax_region->dev = parent; + dax_region->target_node = target_node; if (sysfs_create_groups(&parent->kobj, dax_region_attribute_groups)) { kfree(dax_region); return NULL; @@ -348,6 +349,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,
dev_dax->dax_dev = dax_dev; dev_dax->region = dax_region; + dev_dax->target_node = dax_region->target_node; kref_get(&dax_region->kref);
inode = dax_inode(dax_dev); diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h index ce977552ffb5..8619e3299943 100644 --- a/drivers/dax/bus.h +++ b/drivers/dax/bus.h @@ -10,7 +10,8 @@ struct dax_device; struct dax_region; void dax_region_put(struct dax_region *dax_region); struct dax_region *alloc_dax_region(struct device *parent, int region_id, - struct resource *res, unsigned int align, unsigned long flags); + struct resource *res, int target_node, unsigned int align, + unsigned long flags);
enum dev_dax_subsys { DEV_DAX_BUS, diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index a82ce48f5884..a45612148ca0 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -26,6 +26,7 @@ void dax_bus_exit(void); /** * struct dax_region - mapping infrastructure for dax devices * @id: kernel-wide unique region for a memory range + * @target_node: effective numa node if this memory range is onlined * @kref: to pin while other agents have a need to do lookups * @dev: parent device backing this region * @align: allocation and mapping alignment for child dax devices @@ -34,6 +35,7 @@ void dax_bus_exit(void); */ struct dax_region { int id; + int target_node; struct kref kref; struct device *dev; unsigned int align; @@ -46,6 +48,7 @@ struct dax_region { * data while the device is activated in the driver. * @region - parent region * @dax_dev - core dax functionality + * @target_node: effective numa node if dev_dax memory range is onlined * @dev - device core * @pgmap - pgmap for memmap setup / lifetime (driver owned) * @ref: pgmap reference count (driver owned) @@ -54,6 +57,7 @@ struct dax_region { struct dev_dax { struct dax_region *region; struct dax_device *dax_dev; + int target_node; struct device dev; struct dev_pagemap pgmap; struct percpu_ref ref; diff --git a/drivers/dax/pmem/core.c b/drivers/dax/pmem/core.c index bdcff1b14e95..f71019ce0647 100644 --- a/drivers/dax/pmem/core.c +++ b/drivers/dax/pmem/core.c @@ -20,6 +20,7 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys) struct nd_namespace_common *ndns; struct nd_dax *nd_dax = to_nd_dax(dev); struct nd_pfn *nd_pfn = &nd_dax->nd_pfn; + struct nd_region *nd_region = to_nd_region(dev->parent);
ndns = nvdimm_namespace_common_probe(dev); if (IS_ERR(ndns)) @@ -52,7 +53,8 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys) memcpy(&res, &pgmap.res, sizeof(res)); res.start += offset; dax_region = alloc_dax_region(dev, region_id, &res, - le32_to_cpu(pfn_sb->align), PFN_DEV|PFN_MAP); + nd_region->target_node, le32_to_cpu(pfn_sb->align), + PFN_DEV|PFN_MAP); if (!dax_region) return ERR_PTR(-ENOMEM);
diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c index 521eaf53a52a..36be9b619187 100644 --- a/drivers/nvdimm/e820.c +++ b/drivers/nvdimm/e820.c @@ -47,6 +47,7 @@ static int e820_register_one(struct resource *res, void *data) ndr_desc.res = res; ndr_desc.attr_groups = e820_pmem_region_attribute_groups; ndr_desc.numa_node = e820_range_to_nid(res->start); + ndr_desc.target_node = ndr_desc.numa_node; set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags); if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc)) return -ENXIO; diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h index 01e194a5824e..5259b8953cc6 100644 --- a/drivers/nvdimm/nd.h +++ b/drivers/nvdimm/nd.h @@ -157,7 +157,7 @@ struct nd_region { u16 ndr_mappings; u64 ndr_size; u64 ndr_start; - int id, num_lanes, ro, numa_node; + int id, num_lanes, ro, numa_node, target_node; void *provider_data; struct kernfs_node *bb_state; struct badblocks bb; diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c index 0a701837dfc0..ecaaa27438e2 100644 --- a/drivers/nvdimm/of_pmem.c +++ b/drivers/nvdimm/of_pmem.c @@ -68,6 +68,7 @@ static int of_pmem_region_probe(struct platform_device *pdev) memset(&ndr_desc, 0, sizeof(ndr_desc)); ndr_desc.attr_groups = region_attr_groups; ndr_desc.numa_node = dev_to_node(&pdev->dev); + ndr_desc.target_node = ndr_desc.numa_node; ndr_desc.res = &pdev->resource[i]; ndr_desc.of_node = np; set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags); diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c index 609fc450522a..a5c80767d81b 100644 --- a/drivers/nvdimm/region_devs.c +++ b/drivers/nvdimm/region_devs.c @@ -1064,6 +1064,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus, nd_region->flags = ndr_desc->flags; nd_region->ro = ro; nd_region->numa_node = ndr_desc->numa_node; + nd_region->target_node = ndr_desc->target_node; ida_init(&nd_region->ns_ida); ida_init(&nd_region->btt_ida); ida_init(&nd_region->pfn_ida); diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 5844342e5737..698da32ea4a0 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -400,12 +400,17 @@ extern bool acpi_osi_is_win8(void);
#ifdef CONFIG_ACPI_NUMA int acpi_map_pxm_to_online_node(int pxm); +int acpi_map_pxm_to_node(int pxm); int acpi_get_node(acpi_handle handle); #else static inline int acpi_map_pxm_to_online_node(int pxm) { return 0; } +static inline int acpi_map_pxm_to_node(int pxm) +{ + return 0; +} static inline int acpi_get_node(acpi_handle handle) { return 0; diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h index 097072c5a852..941102c0c81f 100644 --- a/include/linux/libnvdimm.h +++ b/include/linux/libnvdimm.h @@ -124,6 +124,7 @@ struct nd_region_desc { void *provider_data; int num_lanes; int numa_node; + int target_node; unsigned long flags; struct device_node *of_node; };
From: Dan Williams dan.j.williams@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 664525b2d84abca1074c9546654ae9689de8a818 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 664525b2d84abca1074c9546654ae9689de8a818 upstream.
The typical 'new_id' attribute behavior is to immediately attach a device to its driver after a new device-id is added. Implement this behavior for the dax bus.
Reported-by: Alexander Duyck alexander.h.duyck@linux.intel.com Reported-by: Brice Goglin Brice.Goglin@inria.fr Cc: Dave Hansen dave.hansen@linux.intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dax/bus.c | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index c620ad52d7e5..a410154d75fb 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -57,8 +57,13 @@ static int dax_match_id(struct dax_device_driver *dax_drv, struct device *dev) return match; }
+enum id_action { + ID_REMOVE, + ID_ADD, +}; + static ssize_t do_id_store(struct device_driver *drv, const char *buf, - size_t count, bool add) + size_t count, enum id_action action) { struct dax_device_driver *dax_drv = to_dax_drv(drv); unsigned int region_id, id; @@ -77,7 +82,7 @@ static ssize_t do_id_store(struct device_driver *drv, const char *buf, mutex_lock(&dax_bus_lock); dax_id = __dax_match_id(dax_drv, buf); if (!dax_id) { - if (add) { + if (action == ID_ADD) { dax_id = kzalloc(sizeof(*dax_id), GFP_KERNEL); if (dax_id) { strncpy(dax_id->dev_name, buf, DAX_NAME_LEN); @@ -86,26 +91,33 @@ static ssize_t do_id_store(struct device_driver *drv, const char *buf, rc = -ENOMEM; } else /* nothing to remove */; - } else if (!add) { + } else if (action == ID_REMOVE) { list_del(&dax_id->list); kfree(dax_id); } else /* dax_id already added */; mutex_unlock(&dax_bus_lock); - return rc; + + if (rc < 0) + return rc; + if (action == ID_ADD) + rc = driver_attach(drv); + if (rc) + return rc; + return count; }
static ssize_t new_id_store(struct device_driver *drv, const char *buf, size_t count) { - return do_id_store(drv, buf, count, true); + return do_id_store(drv, buf, count, ID_ADD); } static DRIVER_ATTR_WO(new_id);
static ssize_t remove_id_store(struct device_driver *drv, const char *buf, size_t count) { - return do_id_store(drv, buf, count, false); + return do_id_store(drv, buf, count, ID_REMOVE); } static DRIVER_ATTR_WO(remove_id);
From: Dan Williams dan.j.williams@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 21c75763a3ae18679e5c4e2260aa9379b073566b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 21c75763a3ae18679e5c4e2260aa9379b073566b upstream.
The target-node attribute is the Linux numa-node that a device-dax instance may create when it is online. Prior to being online the device's 'numa_node' property reflects the closest online cpu node which is the typical expectation of a device 'numa_node'. Once it is online it becomes its own distinct numa node, i.e. 'target_node'.
Export the 'target_node' property to give userspace tooling the ability to predict the effective numa-node from a device-dax instance configured to provide 'System RAM' capacity.
Cc: Vishal Verma vishal.l.verma@intel.com Reported-by: Dave Hansen dave.hansen@linux.intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dax/bus.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index a410154d75fb..28c3324271ac 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -279,13 +279,41 @@ static ssize_t size_show(struct device *dev, } static DEVICE_ATTR_RO(size);
+static int dev_dax_target_node(struct dev_dax *dev_dax) +{ + struct dax_region *dax_region = dev_dax->region; + + return dax_region->target_node; +} + +static ssize_t target_node_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + + return sprintf(buf, "%d\n", dev_dax_target_node(dev_dax)); +} +static DEVICE_ATTR_RO(target_node); + +static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n) +{ + struct device *dev = container_of(kobj, struct device, kobj); + struct dev_dax *dev_dax = to_dev_dax(dev); + + if (a == &dev_attr_target_node.attr && dev_dax_target_node(dev_dax) < 0) + return 0; + return a->mode; +} + static struct attribute *dev_dax_attributes[] = { &dev_attr_size.attr, + &dev_attr_target_node.attr, NULL, };
static const struct attribute_group dev_dax_attribute_group = { .attrs = dev_dax_attributes, + .is_visible = dev_dax_visible, };
static const struct attribute_group *dax_attribute_groups[] = {
From: Vishal Verma vishal.l.verma@intel.com
mainline inclusion from mainline-v5.1-rc1 commit c347bd71dcdb2d0ac8b3a771486584dca8c8dd80 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit c347bd71dcdb2d0ac8b3a771486584dca8c8dd80 upstream.
Add a 'modalias' attribute to devices under the DAX bus so that userspace is able to dynamically load modules as needed.
Normally, udev can get the modalias from 'uevent', and that is correctly set up by the DAX bus. However other tooling such as 'libndctl' for interacting with drivers/nvdimm/, and 'libdaxctl' for drivers/dax/ can also use the modalias to dynamically load modules via libkmod lookups.
The 'nd' bus set up by the libnvdimm subsystem exports a modalias attribute. Imitate this to export the same for the 'dax' bus.
Cc: Dave Hansen dave.hansen@linux.intel.com Signed-off-by: Vishal Verma vishal.l.verma@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dax/bus.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 28c3324271ac..2109cfe80219 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -295,6 +295,17 @@ static ssize_t target_node_show(struct device *dev, } static DEVICE_ATTR_RO(target_node);
+static ssize_t modalias_show(struct device *dev, struct device_attribute *attr, + char *buf) +{ + /* + * We only ever expect to handle device-dax instances, i.e. the + * @type argument to MODULE_ALIAS_DAX_DEVICE() is always zero + */ + return sprintf(buf, DAX_DEVICE_MODALIAS_FMT "\n", 0); +} +static DEVICE_ATTR_RO(modalias); + static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n) { struct device *dev = container_of(kobj, struct device, kobj); @@ -306,6 +317,7 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n) }
static struct attribute *dev_dax_attributes[] = { + &dev_attr_modalias.attr, &dev_attr_size.attr, &dev_attr_target_node.attr, NULL,
From: Dave Hansen dave.hansen@linux.intel.com
mainline inclusion from mainline-v5.1-rc1 commit b926b7f3baecb2a855db629e6822e1a85212e91c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit b926b7f3baecb2a855db629e6822e1a85212e91c upstream.
HMM consumes physical address space for its own use, even though nothing is mapped or accessible there. It uses a special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY) to uniquely identify these areas.
When HMM consumes address space, it makes a best guess about what to consume. However, it is possible that a future memory or device hotplug can collide with the reserved area. In the case of these conflicts, there is an error message in register_memory_resource().
Later patches in this series move register_memory_resource() from using request_resource_conflict() to __request_region(). Unfortunately, __request_region() does not return the conflict like the previous function did, which makes it impossible to check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting resource.
Instead of warning in register_memory_resource(), move the check into the core resource code itself (__request_region()) where the conflicting resource _is_ available. This has the added bonus of producing a warning in case of HMM conflicts with devices *or* RAM address space, as opposed to the RAM- only warnings that were there previously.
Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Reviewed-by: Jerome Glisse jglisse@redhat.com Cc: Dan Williams dan.j.williams@intel.com Cc: Dave Jiang dave.jiang@intel.com Cc: Ross Zwisler zwisler@kernel.org Cc: Vishal Verma vishal.l.verma@intel.com Cc: Tom Lendacky thomas.lendacky@amd.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Michal Hocko mhocko@suse.com Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying ying.huang@intel.com Cc: Fengguang Wu fengguang.wu@intel.com Cc: Keith Busch keith.busch@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/resource.c | 9 +++++++++ mm/memory_hotplug.c | 5 ----- 2 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/resource.c b/kernel/resource.c index 7dd11c78c738..2cf0edd8223f 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1126,6 +1126,15 @@ struct resource * __request_region(struct resource *parent, conflict = __request_resource(parent, res); if (!conflict) break; + /* + * mm/hmm.c reserves physical addresses which then + * become unavailable to other users. Conflicts are + * not expected. Warn to aid debugging if encountered. + */ + if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) { + pr_warn("Unaddressable device %s %pR conflicts with %pR", + conflict->name, conflict, res); + } if (conflict != parent) { if (!(conflict->flags & IORESOURCE_BUSY)) { parent = conflict; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 2de62385e336..3f4cf3ab7b51 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -111,11 +111,6 @@ static struct resource *register_memory_resource(u64 start, u64 size) res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; conflict = request_resource_conflict(&iomem_resource, res); if (conflict) { - if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) { - pr_debug("Device unaddressable memory block " - "memory hotplug at %#010llx !\n", - (unsigned long long)start); - } pr_debug("System RAM resource %pR cannot be added\n", res); kfree(res); return ERR_PTR(-EEXIST);
From: Dave Hansen dave.hansen@linux.intel.com
mainline inclusion from mainline-v5.1-rc1 commit 2794129e902d8eb69413d884dc6404b8716ed9ed category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 2794129e902d8eb69413d884dc6404b8716ed9ed upstream.
The mm/resource.c code is used to manage the physical address space. The current resource configuration can be viewed in /proc/iomem. An example of this is at the bottom of this description.
The nvdimm subsystem "owns" the physical address resources which map to persistent memory and has resources inserted for them as "Persistent Memory". The best way to repurpose this for volatile use is to leave the existing resource in place, but add a "System RAM" resource underneath it. This clearly communicates the ownership relationship of this memory.
The request_resource_conflict() API only deals with the top-level resources. Replace it with __request_region() which will search for !IORESOURCE_BUSY areas lower in the resource tree than the top level.
We *could* also simply truncate the existing top-level "Persistent Memory" resource and take over the released address space. But, this means that if we ever decide to hot-unplug the "RAM" and give it back, we need to recreate the original setup, which may mean going back to the BIOS tables.
This should have no real effect on the existing collision detection because the areas that truly conflict should be marked IORESOURCE_BUSY.
00000000-00000fff : Reserved 00001000-0009fbff : System RAM 0009fc00-0009ffff : Reserved 000a0000-000bffff : PCI Bus 0000:00 000c0000-000c97ff : Video ROM 000c9800-000ca5ff : Adapter ROM 000f0000-000fffff : Reserved 000f0000-000fffff : System ROM 00100000-9fffffff : System RAM 01000000-01e071d0 : Kernel code 01e071d1-027dfdff : Kernel data 02dc6000-0305dfff : Kernel bss a0000000-afffffff : Persistent Memory (legacy) a0000000-a7ffffff : System RAM b0000000-bffdffff : System RAM bffe0000-bfffffff : Reserved c0000000-febfffff : PCI Bus 0000:00
Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Reviewed-by: Dan Williams dan.j.williams@intel.com Reviewed-by: Vishal Verma vishal.l.verma@intel.com Cc: Dave Jiang dave.jiang@intel.com Cc: Ross Zwisler zwisler@kernel.org Cc: Vishal Verma vishal.l.verma@intel.com Cc: Tom Lendacky thomas.lendacky@amd.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Michal Hocko mhocko@suse.com Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying ying.huang@intel.com Cc: Fengguang Wu fengguang.wu@intel.com Cc: Borislav Petkov bp@suse.de Cc: Bjorn Helgaas bhelgaas@google.com Cc: Yaowei Bai baiyaowei@cmss.chinamobile.com Cc: Takashi Iwai tiwai@suse.de Cc: Jerome Glisse jglisse@redhat.com Cc: Keith Busch keith.busch@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memory_hotplug.c | 28 +++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 3f4cf3ab7b51..135348d1302c 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -100,19 +100,21 @@ void mem_hotplug_done(void) /* add this memory to iomem resource */ static struct resource *register_memory_resource(u64 start, u64 size) { - struct resource *res, *conflict; - res = kzalloc(sizeof(struct resource), GFP_KERNEL); - if (!res) - return ERR_PTR(-ENOMEM); - - res->name = "System RAM"; - res->start = start; - res->end = start + size - 1; - res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; - conflict = request_resource_conflict(&iomem_resource, res); - if (conflict) { - pr_debug("System RAM resource %pR cannot be added\n", res); - kfree(res); + struct resource *res; + unsigned long flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; + char *resource_name = "System RAM"; + + /* + * Request ownership of the new memory range. This might be + * a child of an existing resource that was present but + * not marked as busy. + */ + res = __request_region(&iomem_resource, start, size, + resource_name, flags); + + if (!res) { + pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n", + start, start + size); return ERR_PTR(-EEXIST); } return res;
From: Dave Hansen dave.hansen@linux.intel.com
mainline inclusion from mainline-v5.1-rc1 commit 2b539aefe9e48e3908cff02699aa63a8b9bd268e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
In the process of onlining memory, we use walk_system_ram_range() to find the actual RAM areas inside of the area being onlined.
However, it currently only finds memory resources which are "top-level" iomem_resources. Children are not currently searched which causes it to skip System RAM in areas like this (in the format of /proc/iomem):
a0000000-bfffffff : Persistent Memory (legacy) a0000000-afffffff : System RAM
Changing the true->false here allows children to be searched as well. We need this because we add a new "System RAM" resource underneath the "persistent memory" resource when we use persistent memory in a volatile mode.
Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Cc: Keith Busch keith.busch@intel.com Cc: Dan Williams dan.j.williams@intel.com Cc: Dave Jiang dave.jiang@intel.com Cc: Ross Zwisler zwisler@kernel.org Cc: Vishal Verma vishal.l.verma@intel.com Cc: Tom Lendacky thomas.lendacky@amd.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Michal Hocko mhocko@suse.com Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying ying.huang@intel.com Cc: Fengguang Wu fengguang.wu@intel.com Cc: Borislav Petkov bp@suse.de Cc: Bjorn Helgaas bhelgaas@google.com Cc: Yaowei Bai baiyaowei@cmss.chinamobile.com Cc: Takashi Iwai tiwai@suse.de Cc: Jerome Glisse jglisse@redhat.com Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/resource.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/resource.c b/kernel/resource.c index 2cf0edd8223f..55a3c6010682 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -448,6 +448,9 @@ int walk_mem_res(u64 start, u64 end, void *arg, * This function calls the @func callback against all memory ranges of type * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY. * It is to be used only for System RAM. + * + * This will find System RAM ranges that are children of top-level resources + * in addition to top-level System RAM resources. */ int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages, void *arg, int (*func)(unsigned long, unsigned long, void *)) @@ -463,7 +466,7 @@ int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages, flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; while (start < end && !find_next_iomem_res(start, end, flags, IORES_DESC_NONE, - true, &res)) { + false, &res)) { pfn = (res.start + PAGE_SIZE - 1) >> PAGE_SHIFT; end_pfn = (res.end + 1) >> PAGE_SHIFT; if (end_pfn > pfn)
From: Dave Hansen dave.hansen@linux.intel.com
mainline inclusion from mainline-v5.1-rc1 commit c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 upstream.
This is intended for use with NVDIMMs that are physically persistent (physically like flash) so that they can be used as a cost-effective RAM replacement. Intel Optane DC persistent memory is one implementation of this kind of NVDIMM.
Currently, a persistent memory region is "owned" by a device driver, either the "Direct DAX" or "Filesystem DAX" drivers. These drivers allow applications to explicitly use persistent memory, generally by being modified to use special, new libraries. (DIMM-based persistent memory hardware/software is described in great detail here: Documentation/nvdimm/nvdimm.txt).
However, this limits persistent memory use to applications which *have* been modified. To make it more broadly usable, this driver "hotplugs" memory into the kernel, to be managed and used just like normal RAM would be.
To make this work, management software must remove the device from being controlled by the "Device DAX" infrastructure:
echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
and then tell the new driver that it can bind to the device:
echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
After this, there will be a number of new memory sections visible in sysfs that can be onlined, or that may get onlined by existing udev-initiated memory hotplug rules.
This rebinding procedure is currently a one-way trip. Once memory is bound to "kmem", it's there permanently and can not be unbound and assigned back to device_dax.
The kmem driver will never bind to a dax device unless the device is *explicitly* bound to the driver. There are two reasons for this: One, since it is a one-way trip, it can not be undone if bound incorrectly. Two, the kmem driver destroys data on the device. Think of if you had good data on a pmem device. It would be catastrophic if you compile-in "kmem", but leave out the "device_dax" driver. kmem would take over the device and write volatile data all over your good data.
This inherits any existing NUMA information for the newly-added memory from the persistent memory device that came from the firmware. On Intel platforms, the firmware has guarantees that require each socket's persistent memory to be in a separate memory-only NUMA node. That means that this patch is not expected to create NUMA nodes, but will simply hotplug memory into existing nodes.
Because NUMA nodes are created, the existing NUMA APIs and tools are sufficient to create policies for applications or memory areas to have affinity for or an aversion to using this memory.
There is currently some metadata at the beginning of pmem regions. The section-size memory hotplug restrictions, plus this small reserved area can cause the "loss" of a section or two of capacity. This should be fixable in follow-on patches. But, as a first step, losing 256MB of memory (worst case) out of hundreds of gigabytes is a good tradeoff vs. the required code to fix this up precisely. This calculation is also the reason we export memory_block_size_bytes().
Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Reviewed-by: Dan Williams dan.j.williams@intel.com Reviewed-by: Keith Busch keith.busch@intel.com Cc: Dave Jiang dave.jiang@intel.com Cc: Ross Zwisler zwisler@kernel.org Cc: Vishal Verma vishal.l.verma@intel.com Cc: Tom Lendacky thomas.lendacky@amd.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Michal Hocko mhocko@suse.com Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying ying.huang@intel.com Cc: Fengguang Wu fengguang.wu@intel.com Cc: Borislav Petkov bp@suse.de Cc: Bjorn Helgaas bhelgaas@google.com Cc: Yaowei Bai baiyaowei@cmss.chinamobile.com Cc: Takashi Iwai tiwai@suse.de Cc: Jerome Glisse jglisse@redhat.com Reviewed-by: Vishal Verma vishal.l.verma@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/base/memory.c | 1 + drivers/dax/Kconfig | 16 +++++++ drivers/dax/Makefile | 1 + drivers/dax/kmem.c | 108 ++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 126 insertions(+) create mode 100644 drivers/dax/kmem.c
diff --git a/drivers/base/memory.c b/drivers/base/memory.c index c4a79040955d..a4fbf5f5a11f 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -93,6 +93,7 @@ unsigned long __weak memory_block_size_bytes(void) { return MIN_MEMORY_BLOCK_SIZE; } +EXPORT_SYMBOL_GPL(memory_block_size_bytes);
static unsigned long get_memory_block_size(void) { diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index 6fc96f03920e..5ef624fe3934 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -32,6 +32,22 @@ config DEV_DAX_PMEM
Say M if unsure
+config DEV_DAX_KMEM + tristate "KMEM DAX: volatile-use of persistent memory" + default DEV_DAX + depends on DEV_DAX + depends on MEMORY_HOTPLUG # for add_memory() and friends + help + Support access to persistent memory as if it were RAM. This + allows easier use of persistent memory by unmodified + applications. + + To use this feature, a DAX device must be unbound from the + device_dax driver (PMEM DAX) and bound to this kmem driver + on each boot. + + Say N if unsure. + config DEV_DAX_PMEM_COMPAT tristate "PMEM DAX: support the deprecated /sys/class/dax interface" depends on DEV_DAX_PMEM diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile index 233bbffccbe6..81f7d54dadfb 100644 --- a/drivers/dax/Makefile +++ b/drivers/dax/Makefile @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_DAX) += dax.o obj-$(CONFIG_DEV_DAX) += device_dax.o +obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
dax-y := super.o dax-y += bus.o diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c new file mode 100644 index 000000000000..a02318c6d28a --- /dev/null +++ b/drivers/dax/kmem.c @@ -0,0 +1,108 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2016-2019 Intel Corporation. All rights reserved. */ +#include <linux/memremap.h> +#include <linux/pagemap.h> +#include <linux/memory.h> +#include <linux/module.h> +#include <linux/device.h> +#include <linux/pfn_t.h> +#include <linux/slab.h> +#include <linux/dax.h> +#include <linux/fs.h> +#include <linux/mm.h> +#include <linux/mman.h> +#include "dax-private.h" +#include "bus.h" + +int dev_dax_kmem_probe(struct device *dev) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + struct resource *res = &dev_dax->region->res; + resource_size_t kmem_start; + resource_size_t kmem_size; + resource_size_t kmem_end; + struct resource *new_res; + int numa_node; + int rc; + + /* + * Ensure good NUMA information for the persistent memory. + * Without this check, there is a risk that slow memory + * could be mixed in a node with faster memory, causing + * unavoidable performance issues. + */ + numa_node = dev_dax->target_node; + if (numa_node < 0) { + dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n", + res, numa_node); + return -EINVAL; + } + + /* Hotplug starting at the beginning of the next block: */ + kmem_start = ALIGN(res->start, memory_block_size_bytes()); + + kmem_size = resource_size(res); + /* Adjust the size down to compensate for moving up kmem_start: */ + kmem_size -= kmem_start - res->start; + /* Align the size down to cover only complete blocks: */ + kmem_size &= ~(memory_block_size_bytes() - 1); + kmem_end = kmem_start + kmem_size; + + /* Region is permanently reserved. Hot-remove not yet implemented. */ + new_res = request_mem_region(kmem_start, kmem_size, dev_name(dev)); + if (!new_res) { + dev_warn(dev, "could not reserve region [%pa-%pa]\n", + &kmem_start, &kmem_end); + return -EBUSY; + } + + /* + * Set flags appropriate for System RAM. Leave ..._BUSY clear + * so that add_memory() can add a child resource. Do not + * inherit flags from the parent since it may set new flags + * unknown to us that will break add_memory() below. + */ + new_res->flags = IORESOURCE_SYSTEM_RAM; + new_res->name = dev_name(dev); + + rc = add_memory(numa_node, new_res->start, resource_size(new_res)); + if (rc) + return rc; + + return 0; +} + +static int dev_dax_kmem_remove(struct device *dev) +{ + /* + * Purposely leak the request_mem_region() for the device-dax + * range and return '0' to ->remove() attempts. The removal of + * the device from the driver always succeeds, but the region + * is permanently pinned as reserved by the unreleased + * request_mem_region(). + */ + return 0; +} + +static struct dax_device_driver device_dax_kmem_driver = { + .drv = { + .probe = dev_dax_kmem_probe, + .remove = dev_dax_kmem_remove, + }, +}; + +static int __init dax_kmem_init(void) +{ + return dax_driver_register(&device_dax_kmem_driver); +} + +static void __exit dax_kmem_exit(void) +{ + dax_driver_unregister(&device_dax_kmem_driver); +} + +MODULE_AUTHOR("Intel Corporation"); +MODULE_LICENSE("GPL v2"); +module_init(dax_kmem_init); +module_exit(dax_kmem_exit); +MODULE_ALIAS_DAX_DEVICE(0);
From: Erik Schmauss erik.schmauss@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 9a8d961f1ef835b0d338fbe13da03cb424e87ae5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 9a8d961f1ef835b0d338fbe13da03cb424e87ae5 upstream.
ACPICA commit a216e8ca9f7c79f90788b193e2e61151b2c973c0
This change reserves several field and renames subtable 0 to "memory proximity domain attributes"
Link: https://github.com/acpica/acpica/commit/a216e8ca Signed-off-by: Erik Schmauss erik.schmauss@intel.com Signed-off-by: Bob Moore robert.moore@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/acpi/actbl1.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h index ab424509cae9..7df78a9bddb1 100644 --- a/include/acpi/actbl1.h +++ b/include/acpi/actbl1.h @@ -1406,17 +1406,17 @@ struct acpi_hmat_structure { * HMAT Structures, correspond to Type in struct acpi_hmat_structure */
-/* 0: Memory subystem address range */ +/* 0: Memory proximity domain attributes */
-struct acpi_hmat_address_range { +struct acpi_hmat_proximity_domain { struct acpi_hmat_structure header; u16 flags; u16 reserved1; u32 processor_PD; /* Processor proximity domain */ u32 memory_PD; /* Memory proximity domain */ u32 reserved2; - u64 physical_address_base; /* Physical address range base */ - u64 physical_address_length; /* Physical address range length */ + u64 reserved3; + u64 reserved4; };
/* Masks for Flags field above */
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v5.2-rc1 commit 60574d1e05b094d222162260dd9cac49f4d0996a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 60574d1e05b094d222162260dd9cac49f4d0996a upstream.
Parsing entries in an ACPI table had assumed a generic header structure. There is no standard ACPI header, though, so less common layouts with different field sizes required custom parsers to go through their subtable entry list.
Create the infrastructure for adding different table types so parsing the entries array may be more reused for all ACPI system tables and the common code doesn't need to be duplicated.
Reviewed-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Acked-by: Jonathan Cameron Jonathan.Cameron@huawei.com Tested-by: Jonathan Cameron Jonathan.Cameron@huawei.com Signed-off-by: Keith Busch keith.busch@intel.com Tested-by: Brice Goglin Brice.Goglin@inria.fr Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/arm64/kernel/acpi_numa.c | 2 +- arch/arm64/kernel/smp.c | 4 +- arch/ia64/kernel/acpi.c | 14 ++-- arch/x86/kernel/acpi/boot.c | 36 +++++----- drivers/acpi/numa.c | 16 ++--- drivers/acpi/scan.c | 4 +- drivers/acpi/tables.c | 67 ++++++++++++++++--- drivers/irqchip/irq-gic-v2m.c | 2 +- drivers/irqchip/irq-gic-v3-its-pci-msi.c | 2 +- drivers/irqchip/irq-gic-v3-its-platform-msi.c | 2 +- drivers/irqchip/irq-gic-v3-its.c | 6 +- drivers/irqchip/irq-gic-v3.c | 10 +-- drivers/irqchip/irq-gic.c | 4 +- drivers/mailbox/pcc.c | 2 +- include/linux/acpi.h | 5 +- 15 files changed, 113 insertions(+), 63 deletions(-)
diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c index 4f4f1815e047..b63705407e5d 100644 --- a/arch/arm64/kernel/acpi_numa.c +++ b/arch/arm64/kernel/acpi_numa.c @@ -46,7 +46,7 @@ static inline int get_cpu_for_acpi_id(u32 uid) return -EINVAL; }
-static int __init acpi_parse_gicc_pxm(struct acpi_subtable_header *header, +static int __init acpi_parse_gicc_pxm(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_srat_gicc_affinity *pa; diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 090e7603c2da..fe562778de35 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -753,7 +753,7 @@ acpi_map_gic_cpu_interface(struct acpi_madt_generic_interrupt *processor) }
static int __init -acpi_parse_gic_cpu_interface(struct acpi_subtable_header *header, +acpi_parse_gic_cpu_interface(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_interrupt *processor; @@ -762,7 +762,7 @@ acpi_parse_gic_cpu_interface(struct acpi_subtable_header *header, if (BAD_MADT_GICC_ENTRY(processor, end)) return -EINVAL;
- acpi_table_print_madt_entry(header); + acpi_table_print_madt_entry(&header->common);
acpi_map_gic_cpu_interface(processor);
diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c index 1dacbf5e9e09..0ad2bb4d6043 100644 --- a/arch/ia64/kernel/acpi.c +++ b/arch/ia64/kernel/acpi.c @@ -177,7 +177,7 @@ struct acpi_table_madt *acpi_madt __initdata; static u8 has_8259;
static int __init -acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header, +acpi_parse_lapic_addr_ovr(union acpi_subtable_headers * header, const unsigned long end) { struct acpi_madt_local_apic_override *lapic; @@ -195,7 +195,7 @@ acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header, }
static int __init -acpi_parse_lsapic(struct acpi_subtable_header * header, const unsigned long end) +acpi_parse_lsapic(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_local_sapic *lsapic;
@@ -216,7 +216,7 @@ acpi_parse_lsapic(struct acpi_subtable_header * header, const unsigned long end) }
static int __init -acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long end) +acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long end) { struct acpi_madt_local_apic_nmi *lacpi_nmi;
@@ -230,7 +230,7 @@ acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long e }
static int __init -acpi_parse_iosapic(struct acpi_subtable_header * header, const unsigned long end) +acpi_parse_iosapic(union acpi_subtable_headers * header, const unsigned long end) { struct acpi_madt_io_sapic *iosapic;
@@ -245,7 +245,7 @@ acpi_parse_iosapic(struct acpi_subtable_header * header, const unsigned long end static unsigned int __initdata acpi_madt_rev;
static int __init -acpi_parse_plat_int_src(struct acpi_subtable_header * header, +acpi_parse_plat_int_src(union acpi_subtable_headers * header, const unsigned long end) { struct acpi_madt_interrupt_source *plintsrc; @@ -329,7 +329,7 @@ unsigned int get_cpei_target_cpu(void) }
static int __init -acpi_parse_int_src_ovr(struct acpi_subtable_header * header, +acpi_parse_int_src_ovr(union acpi_subtable_headers * header, const unsigned long end) { struct acpi_madt_interrupt_override *p; @@ -350,7 +350,7 @@ acpi_parse_int_src_ovr(struct acpi_subtable_header * header, }
static int __init -acpi_parse_nmi_src(struct acpi_subtable_header * header, const unsigned long end) +acpi_parse_nmi_src(union acpi_subtable_headers * header, const unsigned long end) { struct acpi_madt_nmi_source *nmi_src;
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 94554a5dc2df..6fc7e69b4255 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -196,7 +196,7 @@ static int acpi_register_lapic(int id, u32 acpiid, u8 enabled) }
static int __init -acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end) +acpi_parse_x2apic(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_local_x2apic *processor = NULL; #ifdef CONFIG_X86_X2APIC @@ -209,7 +209,7 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end) if (BAD_MADT_ENTRY(processor, end)) return -EINVAL;
- acpi_table_print_madt_entry(header); + acpi_table_print_madt_entry(&header->common);
#ifdef CONFIG_X86_X2APIC apic_id = processor->local_apic_id; @@ -241,7 +241,7 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end) }
static int __init -acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end) +acpi_parse_lapic(union acpi_subtable_headers * header, const unsigned long end) { struct acpi_madt_local_apic *processor = NULL;
@@ -250,7 +250,7 @@ acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end) if (BAD_MADT_ENTRY(processor, end)) return -EINVAL;
- acpi_table_print_madt_entry(header); + acpi_table_print_madt_entry(&header->common);
/* Ignore invalid ID */ if (processor->id == 0xff) @@ -271,7 +271,7 @@ acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end) }
static int __init -acpi_parse_sapic(struct acpi_subtable_header *header, const unsigned long end) +acpi_parse_sapic(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_local_sapic *processor = NULL;
@@ -280,7 +280,7 @@ acpi_parse_sapic(struct acpi_subtable_header *header, const unsigned long end) if (BAD_MADT_ENTRY(processor, end)) return -EINVAL;
- acpi_table_print_madt_entry(header); + acpi_table_print_madt_entry(&header->common);
acpi_register_lapic((processor->id << 8) | processor->eid,/* APIC ID */ processor->processor_id, /* ACPI ID */ @@ -290,7 +290,7 @@ acpi_parse_sapic(struct acpi_subtable_header *header, const unsigned long end) }
static int __init -acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header, +acpi_parse_lapic_addr_ovr(union acpi_subtable_headers * header, const unsigned long end) { struct acpi_madt_local_apic_override *lapic_addr_ovr = NULL; @@ -300,7 +300,7 @@ acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header, if (BAD_MADT_ENTRY(lapic_addr_ovr, end)) return -EINVAL;
- acpi_table_print_madt_entry(header); + acpi_table_print_madt_entry(&header->common);
acpi_lapic_addr = lapic_addr_ovr->address;
@@ -308,7 +308,7 @@ acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header, }
static int __init -acpi_parse_x2apic_nmi(struct acpi_subtable_header *header, +acpi_parse_x2apic_nmi(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_local_x2apic_nmi *x2apic_nmi = NULL; @@ -318,7 +318,7 @@ acpi_parse_x2apic_nmi(struct acpi_subtable_header *header, if (BAD_MADT_ENTRY(x2apic_nmi, end)) return -EINVAL;
- acpi_table_print_madt_entry(header); + acpi_table_print_madt_entry(&header->common);
if (x2apic_nmi->lint != 1) printk(KERN_WARNING PREFIX "NMI not connected to LINT 1!\n"); @@ -327,7 +327,7 @@ acpi_parse_x2apic_nmi(struct acpi_subtable_header *header, }
static int __init -acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long end) +acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long end) { struct acpi_madt_local_apic_nmi *lapic_nmi = NULL;
@@ -336,7 +336,7 @@ acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long e if (BAD_MADT_ENTRY(lapic_nmi, end)) return -EINVAL;
- acpi_table_print_madt_entry(header); + acpi_table_print_madt_entry(&header->common);
if (lapic_nmi->lint != 1) printk(KERN_WARNING PREFIX "NMI not connected to LINT 1!\n"); @@ -448,7 +448,7 @@ static int __init mp_register_ioapic_irq(u8 bus_irq, u8 polarity, }
static int __init -acpi_parse_ioapic(struct acpi_subtable_header * header, const unsigned long end) +acpi_parse_ioapic(union acpi_subtable_headers * header, const unsigned long end) { struct acpi_madt_io_apic *ioapic = NULL; struct ioapic_domain_cfg cfg = { @@ -461,7 +461,7 @@ acpi_parse_ioapic(struct acpi_subtable_header * header, const unsigned long end) if (BAD_MADT_ENTRY(ioapic, end)) return -EINVAL;
- acpi_table_print_madt_entry(header); + acpi_table_print_madt_entry(&header->common);
/* Statically assign IRQ numbers for IOAPICs hosting legacy IRQs */ if (ioapic->global_irq_base < nr_legacy_irqs()) @@ -507,7 +507,7 @@ static void __init acpi_sci_ioapic_setup(u8 bus_irq, u16 polarity, u16 trigger, }
static int __init -acpi_parse_int_src_ovr(struct acpi_subtable_header * header, +acpi_parse_int_src_ovr(union acpi_subtable_headers * header, const unsigned long end) { struct acpi_madt_interrupt_override *intsrc = NULL; @@ -517,7 +517,7 @@ acpi_parse_int_src_ovr(struct acpi_subtable_header * header, if (BAD_MADT_ENTRY(intsrc, end)) return -EINVAL;
- acpi_table_print_madt_entry(header); + acpi_table_print_madt_entry(&header->common);
if (intsrc->source_irq == acpi_gbl_FADT.sci_interrupt) { acpi_sci_ioapic_setup(intsrc->source_irq, @@ -549,7 +549,7 @@ acpi_parse_int_src_ovr(struct acpi_subtable_header * header, }
static int __init -acpi_parse_nmi_src(struct acpi_subtable_header * header, const unsigned long end) +acpi_parse_nmi_src(union acpi_subtable_headers * header, const unsigned long end) { struct acpi_madt_nmi_source *nmi_src = NULL;
@@ -558,7 +558,7 @@ acpi_parse_nmi_src(struct acpi_subtable_header * header, const unsigned long end if (BAD_MADT_ENTRY(nmi_src, end)) return -EINVAL;
- acpi_table_print_madt_entry(header); + acpi_table_print_madt_entry(&header->common);
/* TBD: Support nimsrc entries? */
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c index 63d4f6b15e4d..9d7acf895264 100644 --- a/drivers/acpi/numa.c +++ b/drivers/acpi/numa.c @@ -340,7 +340,7 @@ acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa) }
static int __init -acpi_parse_x2apic_affinity(struct acpi_subtable_header *header, +acpi_parse_x2apic_affinity(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_srat_x2apic_cpu_affinity *processor_affinity; @@ -349,7 +349,7 @@ acpi_parse_x2apic_affinity(struct acpi_subtable_header *header, if (!processor_affinity) return -EINVAL;
- acpi_table_print_srat_entry(header); + acpi_table_print_srat_entry(&header->common);
/* let architecture-dependent part to do it */ acpi_numa_x2apic_affinity_init(processor_affinity); @@ -358,7 +358,7 @@ acpi_parse_x2apic_affinity(struct acpi_subtable_header *header, }
static int __init -acpi_parse_processor_affinity(struct acpi_subtable_header *header, +acpi_parse_processor_affinity(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_srat_cpu_affinity *processor_affinity; @@ -367,7 +367,7 @@ acpi_parse_processor_affinity(struct acpi_subtable_header *header, if (!processor_affinity) return -EINVAL;
- acpi_table_print_srat_entry(header); + acpi_table_print_srat_entry(&header->common);
/* let architecture-dependent part to do it */ acpi_numa_processor_affinity_init(processor_affinity); @@ -376,7 +376,7 @@ acpi_parse_processor_affinity(struct acpi_subtable_header *header, }
static int __init -acpi_parse_gicc_affinity(struct acpi_subtable_header *header, +acpi_parse_gicc_affinity(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_srat_gicc_affinity *processor_affinity; @@ -385,7 +385,7 @@ acpi_parse_gicc_affinity(struct acpi_subtable_header *header, if (!processor_affinity) return -EINVAL;
- acpi_table_print_srat_entry(header); + acpi_table_print_srat_entry(&header->common);
/* let architecture-dependent part to do it */ acpi_numa_gicc_affinity_init(processor_affinity); @@ -396,7 +396,7 @@ acpi_parse_gicc_affinity(struct acpi_subtable_header *header, static int __initdata parsed_numa_memblks;
static int __init -acpi_parse_memory_affinity(struct acpi_subtable_header * header, +acpi_parse_memory_affinity(union acpi_subtable_headers * header, const unsigned long end) { struct acpi_srat_mem_affinity *memory_affinity; @@ -405,7 +405,7 @@ acpi_parse_memory_affinity(struct acpi_subtable_header * header, if (!memory_affinity) return -EINVAL;
- acpi_table_print_srat_entry(header); + acpi_table_print_srat_entry(&header->common);
/* let architecture-dependent part to do it */ if (!acpi_numa_memory_affinity_init(memory_affinity)) diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c index 3110ac8176cb..42ea4f784fa5 100644 --- a/drivers/acpi/scan.c +++ b/drivers/acpi/scan.c @@ -2259,10 +2259,10 @@ static struct acpi_probe_entry *ape; static int acpi_probe_count; static DEFINE_MUTEX(acpi_probe_mutex);
-static int __init acpi_match_madt(struct acpi_subtable_header *header, +static int __init acpi_match_madt(union acpi_subtable_headers *header, const unsigned long end) { - if (!ape->subtable_valid || ape->subtable_valid(header, ape)) + if (!ape->subtable_valid || ape->subtable_valid(&header->common, ape)) if (!ape->probe_subtbl(header, end)) acpi_probe_count++;
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c index 11caaa455f3e..ce0daf5c6293 100644 --- a/drivers/acpi/tables.c +++ b/drivers/acpi/tables.c @@ -50,6 +50,15 @@ static struct acpi_table_desc initial_tables[ACPI_MAX_TABLES] __initdata;
static int acpi_apic_instance __initdata;
+enum acpi_subtable_type { + ACPI_SUBTABLE_COMMON, +}; + +struct acpi_subtable_entry { + union acpi_subtable_headers *hdr; + enum acpi_subtable_type type; +}; + /* * Disable table checksum verification for the early stage due to the size * limitation of the current x86 early mapping implementation. @@ -218,6 +227,42 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header) } }
+static unsigned long __init +acpi_get_entry_type(struct acpi_subtable_entry *entry) +{ + switch (entry->type) { + case ACPI_SUBTABLE_COMMON: + return entry->hdr->common.type; + } + return 0; +} + +static unsigned long __init +acpi_get_entry_length(struct acpi_subtable_entry *entry) +{ + switch (entry->type) { + case ACPI_SUBTABLE_COMMON: + return entry->hdr->common.length; + } + return 0; +} + +static unsigned long __init +acpi_get_subtable_header_length(struct acpi_subtable_entry *entry) +{ + switch (entry->type) { + case ACPI_SUBTABLE_COMMON: + return sizeof(entry->hdr->common); + } + return 0; +} + +static enum acpi_subtable_type __init +acpi_get_subtable_type(char *id) +{ + return ACPI_SUBTABLE_COMMON; +} + /** * acpi_parse_entries_array - for each proc_num find a suitable subtable * @@ -247,8 +292,8 @@ acpi_parse_entries_array(char *id, unsigned long table_size, struct acpi_subtable_proc *proc, int proc_num, unsigned int max_entries) { - struct acpi_subtable_header *entry; - unsigned long table_end; + struct acpi_subtable_entry entry; + unsigned long table_end, subtable_len, entry_len; int count = 0; int errs = 0; int i; @@ -271,19 +316,20 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
/* Parse all entries looking for a match. */
- entry = (struct acpi_subtable_header *) + entry.type = acpi_get_subtable_type(id); + entry.hdr = (union acpi_subtable_headers *) ((unsigned long)table_header + table_size); + subtable_len = acpi_get_subtable_header_length(&entry);
- while (((unsigned long)entry) + sizeof(struct acpi_subtable_header) < - table_end) { + while (((unsigned long)entry.hdr) + subtable_len < table_end) { if (max_entries && count >= max_entries) break;
for (i = 0; i < proc_num; i++) { - if (entry->type != proc[i].id) + if (acpi_get_entry_type(&entry) != proc[i].id) continue; if (!proc[i].handler || - (!errs && proc[i].handler(entry, table_end))) { + (!errs && proc[i].handler(entry.hdr, table_end))) { errs++; continue; } @@ -298,13 +344,14 @@ acpi_parse_entries_array(char *id, unsigned long table_size, * If entry->length is 0, break from this loop to avoid * infinite loop. */ - if (entry->length == 0) { + entry_len = acpi_get_entry_length(&entry); + if (entry_len == 0) { pr_err("[%4.4s:0x%02x] Invalid zero length\n", id, proc->id); return -EINVAL; }
- entry = (struct acpi_subtable_header *) - ((unsigned long)entry + entry->length); + entry.hdr = (union acpi_subtable_headers *) + ((unsigned long)entry.hdr + entry_len); }
if (max_entries && count > max_entries) { diff --git a/drivers/irqchip/irq-gic-v2m.c b/drivers/irqchip/irq-gic-v2m.c index f5fe0100f9ff..de14e06fd9ec 100644 --- a/drivers/irqchip/irq-gic-v2m.c +++ b/drivers/irqchip/irq-gic-v2m.c @@ -446,7 +446,7 @@ static struct fwnode_handle *gicv2m_get_fwnode(struct device *dev) }
static int __init -acpi_parse_madt_msi(struct acpi_subtable_header *header, +acpi_parse_madt_msi(union acpi_subtable_headers *header, const unsigned long end) { int ret; diff --git a/drivers/irqchip/irq-gic-v3-its-pci-msi.c b/drivers/irqchip/irq-gic-v3-its-pci-msi.c index 8d6d009d1d58..c81d5b81da56 100644 --- a/drivers/irqchip/irq-gic-v3-its-pci-msi.c +++ b/drivers/irqchip/irq-gic-v3-its-pci-msi.c @@ -159,7 +159,7 @@ static int __init its_pci_of_msi_init(void) #ifdef CONFIG_ACPI
static int __init -its_pci_msi_parse_madt(struct acpi_subtable_header *header, +its_pci_msi_parse_madt(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_translator *its_entry; diff --git a/drivers/irqchip/irq-gic-v3-its-platform-msi.c b/drivers/irqchip/irq-gic-v3-its-platform-msi.c index 7b8e87b493fe..9cdcda5bb3bd 100644 --- a/drivers/irqchip/irq-gic-v3-its-platform-msi.c +++ b/drivers/irqchip/irq-gic-v3-its-platform-msi.c @@ -117,7 +117,7 @@ static int __init its_pmsi_init_one(struct fwnode_handle *fwnode,
#ifdef CONFIG_ACPI static int __init -its_pmsi_parse_madt(struct acpi_subtable_header *header, +its_pmsi_parse_madt(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_translator *its_entry; diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index 489e2e85ce3a..1a2ecfa23fd8 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -4102,13 +4102,13 @@ static int __init acpi_get_its_numa_node(u32 its_id) return NUMA_NO_NODE; }
-static int __init gic_acpi_match_srat_its(struct acpi_subtable_header *header, +static int __init gic_acpi_match_srat_its(union acpi_subtable_headers *header, const unsigned long end) { return 0; }
-static int __init gic_acpi_parse_srat_its(struct acpi_subtable_header *header, +static int __init gic_acpi_parse_srat_its(union acpi_subtable_headers *header, const unsigned long end) { int node; @@ -4175,7 +4175,7 @@ static int __init acpi_get_its_numa_node(u32 its_id) { return NUMA_NO_NODE; } static void __init acpi_its_srat_maps_free(void) { } #endif
-static int __init gic_acpi_parse_madt_its(struct acpi_subtable_header *header, +static int __init gic_acpi_parse_madt_its(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_translator *its_entry; diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 8d2558873561..10edf545e702 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -1846,7 +1846,7 @@ gic_acpi_register_redist(phys_addr_t phys_base, void __iomem *redist_base) }
static int __init -gic_acpi_parse_madt_redist(struct acpi_subtable_header *header, +gic_acpi_parse_madt_redist(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_redistributor *redist = @@ -1864,7 +1864,7 @@ gic_acpi_parse_madt_redist(struct acpi_subtable_header *header, }
static int __init -gic_acpi_parse_madt_gicc(struct acpi_subtable_header *header, +gic_acpi_parse_madt_gicc(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_interrupt *gicc = @@ -1906,14 +1906,14 @@ static int __init gic_acpi_collect_gicr_base(void) return -ENODEV; }
-static int __init gic_acpi_match_gicr(struct acpi_subtable_header *header, +static int __init gic_acpi_match_gicr(union acpi_subtable_headers *header, const unsigned long end) { /* Subtable presence means that redist exists, that's it */ return 0; }
-static int __init gic_acpi_match_gicc(struct acpi_subtable_header *header, +static int __init gic_acpi_match_gicc(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_interrupt *gicc = @@ -1983,7 +1983,7 @@ static bool __init acpi_validate_gic_table(struct acpi_subtable_header *header, return true; }
-static int __init gic_acpi_parse_virt_madt_gicc(struct acpi_subtable_header *header, +static int __init gic_acpi_parse_virt_madt_gicc(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_interrupt *gicc = diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c index 76e61392c3df..71e20e9f3150 100644 --- a/drivers/irqchip/irq-gic.c +++ b/drivers/irqchip/irq-gic.c @@ -1518,7 +1518,7 @@ static struct } acpi_data __initdata;
static int __init -gic_acpi_parse_madt_cpu(struct acpi_subtable_header *header, +gic_acpi_parse_madt_cpu(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_interrupt *processor; @@ -1550,7 +1550,7 @@ gic_acpi_parse_madt_cpu(struct acpi_subtable_header *header, }
/* The things you have to do to just *count* something... */ -static int __init acpi_dummy_func(struct acpi_subtable_header *header, +static int __init acpi_dummy_func(union acpi_subtable_headers *header, const unsigned long end) { return 0; diff --git a/drivers/mailbox/pcc.c b/drivers/mailbox/pcc.c index 256f18b67e8a..08a0a3517138 100644 --- a/drivers/mailbox/pcc.c +++ b/drivers/mailbox/pcc.c @@ -382,7 +382,7 @@ static const struct mbox_chan_ops pcc_chan_ops = { * * This gets called for each entry in the PCC table. */ -static int parse_pcc_subspace(struct acpi_subtable_header *header, +static int parse_pcc_subspace(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_pcct_subspace *ss = (struct acpi_pcct_subspace *) header; diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 698da32ea4a0..344e6f793e3d 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -141,10 +141,13 @@ enum acpi_address_range_id {
/* Table Handlers */ +union acpi_subtable_headers { + struct acpi_subtable_header common; +};
typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);
-typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header, +typedef int (*acpi_tbl_entry_handler)(union acpi_subtable_headers *header, const unsigned long end);
/* Debugger support */
hulk inclusion category: Feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
-------------------------------------------------
60574d1e05b0 ("acpi: Create subtable parsing infrastructure") modifies the argument 1 type of acpi_tbl_entry_handler from struct acpi_subtable_header * to union acpi_subtable_headers *, leading to compilation errors in original ft2500 irqchip driver.
Adapt the phytium ft2500 irqchip driver using the new argument type.
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/irqchip/irq-gic-phytium-2500-its.c | 6 +++--- drivers/irqchip/irq-gic-phytium-2500.c | 10 +++++----- 2 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/drivers/irqchip/irq-gic-phytium-2500-its.c b/drivers/irqchip/irq-gic-phytium-2500-its.c index cdb58bf99c7c..fff1c8546d23 100644 --- a/drivers/irqchip/irq-gic-phytium-2500-its.c +++ b/drivers/irqchip/irq-gic-phytium-2500-its.c @@ -3944,13 +3944,13 @@ static int __init acpi_get_its_numa_node(u32 its_id) return NUMA_NO_NODE; }
-static int __init gic_acpi_match_srat_its(struct acpi_subtable_header *header, +static int __init gic_acpi_match_srat_its(union acpi_subtable_headers *header, const unsigned long end) { return 0; }
-static int __init gic_acpi_parse_srat_its(struct acpi_subtable_header *header, +static int __init gic_acpi_parse_srat_its(union acpi_subtable_headers *header, const unsigned long end) { int node; @@ -4017,7 +4017,7 @@ static int __init acpi_get_its_numa_node(u32 its_id) { return NUMA_NO_NODE; } static void __init acpi_its_srat_maps_free(void) { } #endif
-static int __init gic_acpi_parse_madt_its(struct acpi_subtable_header *header, +static int __init gic_acpi_parse_madt_its(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_translator *its_entry; diff --git a/drivers/irqchip/irq-gic-phytium-2500.c b/drivers/irqchip/irq-gic-phytium-2500.c index fc69057e53f6..8674463a08c6 100644 --- a/drivers/irqchip/irq-gic-phytium-2500.c +++ b/drivers/irqchip/irq-gic-phytium-2500.c @@ -1823,7 +1823,7 @@ gic_acpi_register_redist(phys_addr_t phys_base, void __iomem *redist_base) }
static int __init -gic_acpi_parse_madt_redist(struct acpi_subtable_header *header, +gic_acpi_parse_madt_redist(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_redistributor *redist = @@ -1841,7 +1841,7 @@ gic_acpi_parse_madt_redist(struct acpi_subtable_header *header, }
static int __init -gic_acpi_parse_madt_gicc(struct acpi_subtable_header *header, +gic_acpi_parse_madt_gicc(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_interrupt *gicc = @@ -1883,14 +1883,14 @@ static int __init gic_acpi_collect_gicr_base(void) return -ENODEV; }
-static int __init gic_acpi_match_gicr(struct acpi_subtable_header *header, +static int __init gic_acpi_match_gicr(union acpi_subtable_headers *header, const unsigned long end) { /* Subtable presence means that redist exists, that's it */ return 0; }
-static int __init gic_acpi_match_gicc(struct acpi_subtable_header *header, +static int __init gic_acpi_match_gicc(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_interrupt *gicc = @@ -1960,7 +1960,7 @@ static bool __init acpi_validate_gic_table(struct acpi_subtable_header *header, return true; }
-static int __init gic_acpi_parse_virt_madt_gicc(struct acpi_subtable_header *header, +static int __init gic_acpi_parse_virt_madt_gicc(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_interrupt *gicc =
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v5.2-rc1 commit 3bc0e8eb179deebf1c06f5c4261d362c24b26ce1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3bc0e8eb179deebf1c06f5c4261d362c24b26ce1 upstream.
The Heterogeneous Memory Attribute Table (HMAT) header has different field lengths than the existing parsing uses. Add the HMAT type to the parsing rules so it may be generically parsed.
Reviewed-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Acked-by: Jonathan Cameron Jonathan.Cameron@huawei.com Tested-by: Jonathan Cameron Jonathan.Cameron@huawei.com Signed-off-by: Keith Busch keith.busch@intel.com Tested-by: Brice Goglin Brice.Goglin@inria.fr Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/tables.c | 9 +++++++++ include/linux/acpi.h | 1 + 2 files changed, 10 insertions(+)
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c index ce0daf5c6293..39fda61fba32 100644 --- a/drivers/acpi/tables.c +++ b/drivers/acpi/tables.c @@ -52,6 +52,7 @@ static int acpi_apic_instance __initdata;
enum acpi_subtable_type { ACPI_SUBTABLE_COMMON, + ACPI_SUBTABLE_HMAT, };
struct acpi_subtable_entry { @@ -233,6 +234,8 @@ acpi_get_entry_type(struct acpi_subtable_entry *entry) switch (entry->type) { case ACPI_SUBTABLE_COMMON: return entry->hdr->common.type; + case ACPI_SUBTABLE_HMAT: + return entry->hdr->hmat.type; } return 0; } @@ -243,6 +246,8 @@ acpi_get_entry_length(struct acpi_subtable_entry *entry) switch (entry->type) { case ACPI_SUBTABLE_COMMON: return entry->hdr->common.length; + case ACPI_SUBTABLE_HMAT: + return entry->hdr->hmat.length; } return 0; } @@ -253,6 +258,8 @@ acpi_get_subtable_header_length(struct acpi_subtable_entry *entry) switch (entry->type) { case ACPI_SUBTABLE_COMMON: return sizeof(entry->hdr->common); + case ACPI_SUBTABLE_HMAT: + return sizeof(entry->hdr->hmat); } return 0; } @@ -260,6 +267,8 @@ acpi_get_subtable_header_length(struct acpi_subtable_entry *entry) static enum acpi_subtable_type __init acpi_get_subtable_type(char *id) { + if (strncmp(id, ACPI_SIG_HMAT, 4) == 0) + return ACPI_SUBTABLE_HMAT; return ACPI_SUBTABLE_COMMON; }
diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 344e6f793e3d..c34ba2a1c2e6 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -143,6 +143,7 @@ enum acpi_address_range_id { /* Table Handlers */ union acpi_subtable_headers { struct acpi_subtable_header common; + struct acpi_hmat_structure hmat; };
typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v5.2-rc1 commit 3accf7ae37a96c3bf4b51999f3c395ac5ffcd6d4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3accf7ae37a96c3bf4b51999f3c395ac5ffcd6d4 upstream.
Systems may provide different memory types and export this information in the ACPI Heterogeneous Memory Attribute Table (HMAT). Parse these tables provided by the platform and report the memory access and caching attributes to the kernel messages.
Reviewed-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Acked-by: Jonathan Cameron Jonathan.Cameron@huawei.com Tested-by: Jonathan Cameron Jonathan.Cameron@huawei.com Signed-off-by: Keith Busch keith.busch@intel.com Tested-by: Brice Goglin Brice.Goglin@inria.fr Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/Kconfig | 1 + drivers/acpi/Makefile | 1 + drivers/acpi/hmat/Kconfig | 7 ++ drivers/acpi/hmat/Makefile | 1 + drivers/acpi/hmat/hmat.c | 236 +++++++++++++++++++++++++++++++++++++ 5 files changed, 246 insertions(+) create mode 100644 drivers/acpi/hmat/Kconfig create mode 100644 drivers/acpi/hmat/Makefile create mode 100644 drivers/acpi/hmat/hmat.c
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index e955a30f1a52..89d5fd8020ec 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -470,6 +470,7 @@ config ACPI_REDUCED_HARDWARE_ONLY If you are unsure what to do, do not enable this option.
source "drivers/acpi/nfit/Kconfig" +source "drivers/acpi/hmat/Kconfig"
source "drivers/acpi/apei/Kconfig" source "drivers/acpi/dptf/Kconfig" diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile index edc039313cd6..696ea10f7788 100644 --- a/drivers/acpi/Makefile +++ b/drivers/acpi/Makefile @@ -79,6 +79,7 @@ obj-$(CONFIG_ACPI_PROCESSOR) += processor.o obj-$(CONFIG_ACPI) += container.o obj-$(CONFIG_ACPI_THERMAL) += thermal.o obj-$(CONFIG_ACPI_NFIT) += nfit/ +obj-$(CONFIG_ACPI_HMAT) += hmat/ obj-$(CONFIG_ACPI) += acpi_memhotplug.o obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o obj-$(CONFIG_ACPI_BATTERY) += battery.o diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig new file mode 100644 index 000000000000..2f7111b7af62 --- /dev/null +++ b/drivers/acpi/hmat/Kconfig @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0 +config ACPI_HMAT + bool "ACPI Heterogeneous Memory Attribute Table Support" + depends on ACPI_NUMA + help + If set, this option has the kernel parse and report the + platform's ACPI HMAT (Heterogeneous Memory Attributes Table). diff --git a/drivers/acpi/hmat/Makefile b/drivers/acpi/hmat/Makefile new file mode 100644 index 000000000000..e909051d3d00 --- /dev/null +++ b/drivers/acpi/hmat/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_ACPI_HMAT) := hmat.o diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c new file mode 100644 index 000000000000..4758beb3b2c1 --- /dev/null +++ b/drivers/acpi/hmat/hmat.c @@ -0,0 +1,236 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2019, Intel Corporation. + * + * Heterogeneous Memory Attributes Table (HMAT) representation + * + * This program parses and reports the platform's HMAT tables, and registers + * the applicable attributes with the node's interfaces. + */ + +#include <linux/acpi.h> +#include <linux/bitops.h> +#include <linux/device.h> +#include <linux/init.h> +#include <linux/list.h> +#include <linux/node.h> +#include <linux/sysfs.h> + +static __initdata u8 hmat_revision; + +static __init const char *hmat_data_type(u8 type) +{ + switch (type) { + case ACPI_HMAT_ACCESS_LATENCY: + return "Access Latency"; + case ACPI_HMAT_READ_LATENCY: + return "Read Latency"; + case ACPI_HMAT_WRITE_LATENCY: + return "Write Latency"; + case ACPI_HMAT_ACCESS_BANDWIDTH: + return "Access Bandwidth"; + case ACPI_HMAT_READ_BANDWIDTH: + return "Read Bandwidth"; + case ACPI_HMAT_WRITE_BANDWIDTH: + return "Write Bandwidth"; + default: + return "Reserved"; + } +} + +static __init const char *hmat_data_type_suffix(u8 type) +{ + switch (type) { + case ACPI_HMAT_ACCESS_LATENCY: + case ACPI_HMAT_READ_LATENCY: + case ACPI_HMAT_WRITE_LATENCY: + return " nsec"; + case ACPI_HMAT_ACCESS_BANDWIDTH: + case ACPI_HMAT_READ_BANDWIDTH: + case ACPI_HMAT_WRITE_BANDWIDTH: + return " MB/s"; + default: + return ""; + } +} + +static __init u32 hmat_normalize(u16 entry, u64 base, u8 type) +{ + u32 value; + + /* + * Check for invalid and overflow values + */ + if (entry == 0xffff || !entry) + return 0; + else if (base > (UINT_MAX / (entry))) + return 0; + + /* + * Divide by the base unit for version 1, convert latency from + * picosenonds to nanoseconds if revision 2. + */ + value = entry * base; + if (hmat_revision == 1) { + if (value < 10) + return 0; + value = DIV_ROUND_UP(value, 10); + } else if (hmat_revision == 2) { + switch (type) { + case ACPI_HMAT_ACCESS_LATENCY: + case ACPI_HMAT_READ_LATENCY: + case ACPI_HMAT_WRITE_LATENCY: + value = DIV_ROUND_UP(value, 1000); + break; + default: + break; + } + } + return value; +} + +static __init int hmat_parse_locality(union acpi_subtable_headers *header, + const unsigned long end) +{ + struct acpi_hmat_locality *hmat_loc = (void *)header; + unsigned int init, targ, total_size, ipds, tpds; + u32 *inits, *targs, value; + u16 *entries; + u8 type; + + if (hmat_loc->header.length < sizeof(*hmat_loc)) { + pr_notice("HMAT: Unexpected locality header length: %d\n", + hmat_loc->header.length); + return -EINVAL; + } + + type = hmat_loc->data_type; + ipds = hmat_loc->number_of_initiator_Pds; + tpds = hmat_loc->number_of_target_Pds; + total_size = sizeof(*hmat_loc) + sizeof(*entries) * ipds * tpds + + sizeof(*inits) * ipds + sizeof(*targs) * tpds; + if (hmat_loc->header.length < total_size) { + pr_notice("HMAT: Unexpected locality header length:%d, minimum required:%d\n", + hmat_loc->header.length, total_size); + return -EINVAL; + } + + pr_info("HMAT: Locality: Flags:%02x Type:%s Initiator Domains:%d Target Domains:%d Base:%lld\n", + hmat_loc->flags, hmat_data_type(type), ipds, tpds, + hmat_loc->entry_base_unit); + + inits = (u32 *)(hmat_loc + 1); + targs = inits + ipds; + entries = (u16 *)(targs + tpds); + for (init = 0; init < ipds; init++) { + for (targ = 0; targ < tpds; targ++) { + value = hmat_normalize(entries[init * tpds + targ], + hmat_loc->entry_base_unit, + type); + pr_info(" Initiator-Target[%d-%d]:%d%s\n", + inits[init], targs[targ], value, + hmat_data_type_suffix(type)); + } + } + + return 0; +} + +static __init int hmat_parse_cache(union acpi_subtable_headers *header, + const unsigned long end) +{ + struct acpi_hmat_cache *cache = (void *)header; + u32 attrs; + + if (cache->header.length < sizeof(*cache)) { + pr_notice("HMAT: Unexpected cache header length: %d\n", + cache->header.length); + return -EINVAL; + } + + attrs = cache->cache_attributes; + pr_info("HMAT: Cache: Domain:%d Size:%llu Attrs:%08x SMBIOS Handles:%d\n", + cache->memory_PD, cache->cache_size, attrs, + cache->number_of_SMBIOShandles); + + return 0; +} + +static int __init hmat_parse_proximity_domain(union acpi_subtable_headers *header, + const unsigned long end) +{ + struct acpi_hmat_proximity_domain *p = (void *)header; + + if (p->header.length != sizeof(*p)) { + pr_notice("HMAT: Unexpected address range header length: %d\n", + p->header.length); + return -EINVAL; + } + + if (hmat_revision == 1) + pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n", + p->reserved3, p->reserved4, p->flags, p->processor_PD, + p->memory_PD); + else + pr_info("HMAT: Memory Flags:%04x Processor Domain:%d Memory Domain:%d\n", + p->flags, p->processor_PD, p->memory_PD); + + return 0; +} + +static int __init hmat_parse_subtable(union acpi_subtable_headers *header, + const unsigned long end) +{ + struct acpi_hmat_structure *hdr = (void *)header; + + if (!hdr) + return -EINVAL; + + switch (hdr->type) { + case ACPI_HMAT_TYPE_ADDRESS_RANGE: + return hmat_parse_proximity_domain(header, end); + case ACPI_HMAT_TYPE_LOCALITY: + return hmat_parse_locality(header, end); + case ACPI_HMAT_TYPE_CACHE: + return hmat_parse_cache(header, end); + default: + return -EINVAL; + } +} + +static __init int hmat_init(void) +{ + struct acpi_table_header *tbl; + enum acpi_hmat_type i; + acpi_status status; + + if (srat_disabled()) + return 0; + + status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl); + if (ACPI_FAILURE(status)) + return 0; + + hmat_revision = tbl->revision; + switch (hmat_revision) { + case 1: + case 2: + break; + default: + pr_notice("Ignoring HMAT: Unknown revision:%d\n", hmat_revision); + goto out_put; + } + + for (i = ACPI_HMAT_TYPE_ADDRESS_RANGE; i < ACPI_HMAT_TYPE_RESERVED; i++) { + if (acpi_table_parse_entries(ACPI_SIG_HMAT, + sizeof(struct acpi_table_hmat), i, + hmat_parse_subtable, 0) < 0) { + pr_notice("Ignoring HMAT: Invalid table"); + goto out_put; + } + } +out_put: + acpi_put_table(tbl); + return 0; +} +subsys_initcall(hmat_init);
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v5.2-rc1 commit 08d9dbe72b1f899468b2b34f9309e88a84f440f2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Systems may be constructed with various specialized nodes. Some nodes may provide memory, some provide compute devices that access and use that memory, and others may provide both. Nodes that provide memory are referred to as memory targets, and nodes that can initiate memory access are referred to as memory initiators.
Memory targets will often have varying access characteristics from different initiators, and platforms may have ways to express those relationships. In preparation for these systems, provide interfaces for the kernel to export the memory relationship among different nodes memory targets and their initiators with symlinks to each other.
If a system provides access locality for each initiator-target pair, nodes may be grouped into ranked access classes relative to other nodes. The new interface allows a subsystem to register relationships of varying classes if available and desired to be exported.
A memory initiator may have multiple memory targets in the same access class. The target memory's initiators in a given class indicate the nodes access characteristics share the same performance relative to other linked initiator nodes. Each target within an initiator's access class, though, do not necessarily perform the same as each other.
A memory target node may have multiple memory initiators. All linked initiators in a target's class have the same access characteristics to that target.
The following example show the nodes' new sysfs hierarchy for a memory target node 'Y' with access class 0 from initiator node 'X':
# symlinks -v /sys/devices/system/node/nodeX/access0/ relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
# symlinks -v /sys/devices/system/node/nodeY/access0/ relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
The new attributes are added to the sysfs stable documentation.
Reviewed-by: Jonathan Cameron Jonathan.Cameron@huawei.com Signed-off-by: Keith Busch keith.busch@intel.com Reviewed-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Tested-by: Brice Goglin Brice.Goglin@inria.fr Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/ABI/stable/sysfs-devices-node | 25 +++- drivers/base/node.c | 142 +++++++++++++++++++- include/linux/node.h | 6 + 3 files changed, 171 insertions(+), 2 deletions(-)
diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index 3e90e1f3bf0a..433bcc04e542 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -90,4 +90,27 @@ Date: December 2009 Contact: Lee Schermerhorn lee.schermerhorn@hp.com Description: The node's huge page size control/query attributes. - See Documentation/admin-guide/mm/hugetlbpage.rst \ No newline at end of file + See Documentation/admin-guide/mm/hugetlbpage.rst + +What: /sys/devices/system/node/nodeX/accessY/ +Date: December 2018 +Contact: Keith Busch keith.busch@intel.com +Description: + The node's relationship to other nodes for access class "Y". + +What: /sys/devices/system/node/nodeX/accessY/initiators/ +Date: December 2018 +Contact: Keith Busch keith.busch@intel.com +Description: + The directory containing symlinks to memory initiator + nodes that have class "Y" access to this target node's + memory. CPUs and other memory initiators in nodes not in + the list accessing this node's memory may have different + performance. + +What: /sys/devices/system/node/nodeX/accessY/targets/ +Date: December 2018 +Contact: Keith Busch keith.busch@intel.com +Description: + The directory containing symlinks to memory targets that + this initiator node has class "Y" access. diff --git a/drivers/base/node.c b/drivers/base/node.c index 0938337daa7f..b81d0d18c873 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -17,6 +17,7 @@ #include <linux/nodemask.h> #include <linux/cpu.h> #include <linux/device.h> +#include <linux/pm_runtime.h> #include <linux/swap.h> #include <linux/slab.h>
@@ -59,6 +60,94 @@ static inline ssize_t node_read_cpulist(struct device *dev, static DEVICE_ATTR(cpumap, S_IRUGO, node_read_cpumask, NULL); static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
+/** + * struct node_access_nodes - Access class device to hold user visible + * relationships to other nodes. + * @dev: Device for this memory access class + * @list_node: List element in the node's access list + * @access: The access class rank + */ +struct node_access_nodes { + struct device dev; + struct list_head list_node; + unsigned access; +}; +#define to_access_nodes(dev) container_of(dev, struct node_access_nodes, dev) + +static struct attribute *node_init_access_node_attrs[] = { + NULL, +}; + +static struct attribute *node_targ_access_node_attrs[] = { + NULL, +}; + +static const struct attribute_group initiators = { + .name = "initiators", + .attrs = node_init_access_node_attrs, +}; + +static const struct attribute_group targets = { + .name = "targets", + .attrs = node_targ_access_node_attrs, +}; + +static const struct attribute_group *node_access_node_groups[] = { + &initiators, + &targets, + NULL, +}; + +static void node_remove_accesses(struct node *node) +{ + struct node_access_nodes *c, *cnext; + + list_for_each_entry_safe(c, cnext, &node->access_list, list_node) { + list_del(&c->list_node); + device_unregister(&c->dev); + } +} + +static void node_access_release(struct device *dev) +{ + kfree(to_access_nodes(dev)); +} + +static struct node_access_nodes *node_init_node_access(struct node *node, + unsigned access) +{ + struct node_access_nodes *access_node; + struct device *dev; + + list_for_each_entry(access_node, &node->access_list, list_node) + if (access_node->access == access) + return access_node; + + access_node = kzalloc(sizeof(*access_node), GFP_KERNEL); + if (!access_node) + return NULL; + + access_node->access = access; + dev = &access_node->dev; + dev->parent = &node->dev; + dev->release = node_access_release; + dev->groups = node_access_node_groups; + if (dev_set_name(dev, "access%u", access)) + goto free; + + if (device_register(dev)) + goto free_name; + + pm_runtime_no_callbacks(dev); + list_add_tail(&access_node->list_node, &node->access_list); + return access_node; +free_name: + kfree_const(dev->kobj.name); +free: + kfree(access_node); + return NULL; +} + #define K(x) ((x) << (PAGE_SHIFT - 10)) static ssize_t node_read_meminfo(struct device *dev, struct device_attribute *attr, char *buf) @@ -340,7 +429,7 @@ static int register_node(struct node *node, int num) void unregister_node(struct node *node) { hugetlb_unregister_node(node); /* no-op, if memoryless node */ - + node_remove_accesses(node); device_unregister(&node->dev); }
@@ -372,6 +461,56 @@ int register_cpu_under_node(unsigned int cpu, unsigned int nid) kobject_name(&node_devices[nid]->dev.kobj)); }
+/** + * register_memory_node_under_compute_node - link memory node to its compute + * node for a given access class. + * @mem_node: Memory node number + * @cpu_node: Cpu node number + * @access: Access class to register + * + * Description: + * For use with platforms that may have separate memory and compute nodes. + * This function will export node relationships linking which memory + * initiator nodes can access memory targets at a given ranked access + * class. + */ +int register_memory_node_under_compute_node(unsigned int mem_nid, + unsigned int cpu_nid, + unsigned access) +{ + struct node *init_node, *targ_node; + struct node_access_nodes *initiator, *target; + int ret; + + if (!node_online(cpu_nid) || !node_online(mem_nid)) + return -ENODEV; + + init_node = node_devices[cpu_nid]; + targ_node = node_devices[mem_nid]; + initiator = node_init_node_access(init_node, access); + target = node_init_node_access(targ_node, access); + if (!initiator || !target) + return -ENOMEM; + + ret = sysfs_add_link_to_group(&initiator->dev.kobj, "targets", + &targ_node->dev.kobj, + dev_name(&targ_node->dev)); + if (ret) + return ret; + + ret = sysfs_add_link_to_group(&target->dev.kobj, "initiators", + &init_node->dev.kobj, + dev_name(&init_node->dev)); + if (ret) + goto err; + + return 0; + err: + sysfs_remove_link_from_group(&initiator->dev.kobj, "targets", + dev_name(&targ_node->dev)); + return ret; +} + int unregister_cpu_under_node(unsigned int cpu, unsigned int nid) { struct device *obj; @@ -588,6 +727,7 @@ int __register_one_node(int nid) register_cpu_under_node(cpu, nid); }
+ INIT_LIST_HEAD(&node_devices[nid]->access_list); /* initialize work queue for memory hot plug */ init_node_hugetlb_work(nid);
diff --git a/include/linux/node.h b/include/linux/node.h index a79ec4492650..1f25e9655d11 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -17,10 +17,12 @@
#include <linux/device.h> #include <linux/cpumask.h> +#include <linux/list.h> #include <linux/workqueue.h>
struct node { struct device dev; + struct list_head access_list;
#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS) struct work_struct node_work; @@ -77,6 +79,10 @@ extern int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg); extern void unregister_memory_block_under_nodes(struct memory_block *mem_blk);
+extern int register_memory_node_under_compute_node(unsigned int mem_nid, + unsigned int cpu_nid, + unsigned access); + #ifdef CONFIG_HUGETLBFS extern void register_hugetlbfs_with_node(node_registration_func_t doregister, node_registration_func_t unregister);
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v5.2-rc1 commit e1cf33aafb8462c7d0a0e6349925870316f040ee category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit e1cf33aafb8462c7d0a0e6349925870316f040ee upstream.
Heterogeneous memory systems provide memory nodes with different latency and bandwidth performance attributes. Provide a new kernel interface for subsystems to register the attributes under the memory target node's initiator access class. If the system provides this information, applications may query these attributes when deciding which node to request memory.
The following example shows the new sysfs hierarchy for a node exporting performance attributes:
# tree -P "read*|write*"/sys/devices/system/node/nodeY/accessZ/initiators/ /sys/devices/system/node/nodeY/accessZ/initiators/ |-- read_bandwidth |-- read_latency |-- write_bandwidth `-- write_latency
The bandwidth is exported as MB/s and latency is reported in nanoseconds. The values are taken from the platform as reported by the manufacturer.
Memory accesses from an initiator node that is not one of the memory's access "Z" initiator nodes linked in the same directory may observe different performance than reported here. When a subsystem makes use of this interface, initiators of a different access number may not have the same performance relative to initiators in other access numbers, or omitted from the any access class' initiators.
Descriptions for memory access initiator performance access attributes are added to sysfs stable documentation.
Acked-by: Jonathan Cameron Jonathan.Cameron@huawei.com Tested-by: Jonathan Cameron Jonathan.Cameron@huawei.com Signed-off-by: Keith Busch keith.busch@intel.com Reviewed-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Tested-by: Brice Goglin Brice.Goglin@inria.fr Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/ABI/stable/sysfs-devices-node | 28 ++++++++++ drivers/base/Kconfig | 8 +++ drivers/base/node.c | 59 +++++++++++++++++++++ include/linux/node.h | 26 +++++++++ 4 files changed, 121 insertions(+)
diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index 433bcc04e542..735a40a3f9b2 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -114,3 +114,31 @@ Contact: Keith Busch keith.busch@intel.com Description: The directory containing symlinks to memory targets that this initiator node has class "Y" access. + +What: /sys/devices/system/node/nodeX/accessY/initiators/read_bandwidth +Date: December 2018 +Contact: Keith Busch keith.busch@intel.com +Description: + This node's read bandwidth in MB/s when accessed from + nodes found in this access class's linked initiators. + +What: /sys/devices/system/node/nodeX/accessY/initiators/read_latency +Date: December 2018 +Contact: Keith Busch keith.busch@intel.com +Description: + This node's read latency in nanoseconds when accessed + from nodes found in this access class's linked initiators. + +What: /sys/devices/system/node/nodeX/accessY/initiators/write_bandwidth +Date: December 2018 +Contact: Keith Busch keith.busch@intel.com +Description: + This node's write bandwidth in MB/s when accessed from + found in this access class's linked initiators. + +What: /sys/devices/system/node/nodeX/accessY/initiators/write_latency +Date: December 2018 +Contact: Keith Busch keith.busch@intel.com +Description: + This node's write latency in nanoseconds when access + from nodes found in this class's linked initiators. diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index 3e63a900b330..32dc81bd7056 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -149,6 +149,14 @@ config DEBUG_TEST_DRIVER_REMOVE unusable. You should say N here unless you are explicitly looking to test this functionality.
+config HMEM_REPORTING + bool + default n + depends on NUMA + help + Enable reporting for heterogenous memory access attributes under + their non-uniform memory nodes. + source "drivers/base/test/Kconfig"
config SYS_HYPERVISOR diff --git a/drivers/base/node.c b/drivers/base/node.c index b81d0d18c873..e0d497f13f1e 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -71,6 +71,9 @@ struct node_access_nodes { struct device dev; struct list_head list_node; unsigned access; +#ifdef CONFIG_HMEM_REPORTING + struct node_hmem_attrs hmem_attrs; +#endif }; #define to_access_nodes(dev) container_of(dev, struct node_access_nodes, dev)
@@ -148,6 +151,62 @@ static struct node_access_nodes *node_init_node_access(struct node *node, return NULL; }
+#ifdef CONFIG_HMEM_REPORTING +#define ACCESS_ATTR(name) \ +static ssize_t name##_show(struct device *dev, \ + struct device_attribute *attr, \ + char *buf) \ +{ \ + return sprintf(buf, "%u\n", to_access_nodes(dev)->hmem_attrs.name); \ +} \ +static DEVICE_ATTR_RO(name); + +ACCESS_ATTR(read_bandwidth) +ACCESS_ATTR(read_latency) +ACCESS_ATTR(write_bandwidth) +ACCESS_ATTR(write_latency) + +static struct attribute *access_attrs[] = { + &dev_attr_read_bandwidth.attr, + &dev_attr_read_latency.attr, + &dev_attr_write_bandwidth.attr, + &dev_attr_write_latency.attr, + NULL, +}; + +/** + * node_set_perf_attrs - Set the performance values for given access class + * @nid: Node identifier to be set + * @hmem_attrs: Heterogeneous memory performance attributes + * @access: The access class the for the given attributes + */ +void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs, + unsigned access) +{ + struct node_access_nodes *c; + struct node *node; + int i; + + if (WARN_ON_ONCE(!node_online(nid))) + return; + + node = node_devices[nid]; + c = node_init_node_access(node, access); + if (!c) + return; + + c->hmem_attrs = *hmem_attrs; + for (i = 0; access_attrs[i] != NULL; i++) { + if (sysfs_add_file_to_group(&c->dev.kobj, access_attrs[i], + "initiators")) { + pr_info("failed to add performance attribute to node %d\n", + nid); + break; + } + } +} +#endif + #define K(x) ((x) << (PAGE_SHIFT - 10)) static ssize_t node_read_meminfo(struct device *dev, struct device_attribute *attr, char *buf) diff --git a/include/linux/node.h b/include/linux/node.h index 1f25e9655d11..526f1523c0f1 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -20,6 +20,32 @@ #include <linux/list.h> #include <linux/workqueue.h>
+/** + * struct node_hmem_attrs - heterogeneous memory performance attributes + * + * @read_bandwidth: Read bandwidth in MB/s + * @write_bandwidth: Write bandwidth in MB/s + * @read_latency: Read latency in nanoseconds + * @write_latency: Write latency in nanoseconds + */ +struct node_hmem_attrs { + unsigned int read_bandwidth; + unsigned int write_bandwidth; + unsigned int read_latency; + unsigned int write_latency; +}; + +#ifdef CONFIG_HMEM_REPORTING +void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs, + unsigned access); +#else +static inline void node_set_perf_attrs(unsigned int nid, + struct node_hmem_attrs *hmem_attrs, + unsigned access) +{ +} +#endif + struct node { struct device dev; struct list_head access_list;
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v5.2-rc1 commit acc02a109b0497e917c83f986a89c51e47d0022c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit acc02a109b0497e917c83f986a89c51e47d0022c upstream.
System memory may have caches to help improve access speed to frequently requested address ranges. While the system provided cache is transparent to the software accessing these memory ranges, applications can optimize their own access based on cache attributes.
Provide a new API for the kernel to register these memory-side caches under the memory node that provides it.
The new sysfs representation is modeled from the existing cpu cacheinfo attributes, as seen from /sys/devices/system/cpu/<cpu>/cache/. Unlike CPU cacheinfo though, the node cache level is reported from the view of the memory. A higher level number is nearer to the CPU, while lower levels are closer to the last level memory.
The exported attributes are the cache size, the line size, associativity indexing, and write back policy, and add the attributes for the system memory caches to sysfs stable documentation.
Signed-off-by: Keith Busch keith.busch@intel.com Reviewed-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Reviewed-by: Brice Goglin Brice.Goglin@inria.fr Tested-by: Brice Goglin Brice.Goglin@inria.fr Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/ABI/stable/sysfs-devices-node | 34 +++++ drivers/base/node.c | 151 ++++++++++++++++++++ include/linux/node.h | 39 +++++ 3 files changed, 224 insertions(+)
diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index 735a40a3f9b2..f7ce68fbd4b9 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -142,3 +142,37 @@ Contact: Keith Busch keith.busch@intel.com Description: This node's write latency in nanoseconds when access from nodes found in this class's linked initiators. + +What: /sys/devices/system/node/nodeX/memory_side_cache/indexY/ +Date: December 2018 +Contact: Keith Busch keith.busch@intel.com +Description: + The directory containing attributes for the memory-side cache + level 'Y'. + +What: /sys/devices/system/node/nodeX/memory_side_cache/indexY/indexing +Date: December 2018 +Contact: Keith Busch keith.busch@intel.com +Description: + The caches associativity indexing: 0 for direct mapped, + non-zero if indexed. + +What: /sys/devices/system/node/nodeX/memory_side_cache/indexY/line_size +Date: December 2018 +Contact: Keith Busch keith.busch@intel.com +Description: + The number of bytes accessed from the next cache level on a + cache miss. + +What: /sys/devices/system/node/nodeX/memory_side_cache/indexY/size +Date: December 2018 +Contact: Keith Busch keith.busch@intel.com +Description: + The size of this memory side cache in bytes. + +What: /sys/devices/system/node/nodeX/memory_side_cache/indexY/write_policy +Date: December 2018 +Contact: Keith Busch keith.busch@intel.com +Description: + The cache write policy: 0 for write-back, 1 for write-through, + other or unknown. diff --git a/drivers/base/node.c b/drivers/base/node.c index e0d497f13f1e..07b9424ca9eb 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -205,6 +205,155 @@ void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs, } } } + +/** + * struct node_cache_info - Internal tracking for memory node caches + * @dev: Device represeting the cache level + * @node: List element for tracking in the node + * @cache_attrs:Attributes for this cache level + */ +struct node_cache_info { + struct device dev; + struct list_head node; + struct node_cache_attrs cache_attrs; +}; +#define to_cache_info(device) container_of(device, struct node_cache_info, dev) + +#define CACHE_ATTR(name, fmt) \ +static ssize_t name##_show(struct device *dev, \ + struct device_attribute *attr, \ + char *buf) \ +{ \ + return sprintf(buf, fmt "\n", to_cache_info(dev)->cache_attrs.name);\ +} \ +DEVICE_ATTR_RO(name); + +CACHE_ATTR(size, "%llu") +CACHE_ATTR(line_size, "%u") +CACHE_ATTR(indexing, "%u") +CACHE_ATTR(write_policy, "%u") + +static struct attribute *cache_attrs[] = { + &dev_attr_indexing.attr, + &dev_attr_size.attr, + &dev_attr_line_size.attr, + &dev_attr_write_policy.attr, + NULL, +}; +ATTRIBUTE_GROUPS(cache); + +static void node_cache_release(struct device *dev) +{ + kfree(dev); +} + +static void node_cacheinfo_release(struct device *dev) +{ + struct node_cache_info *info = to_cache_info(dev); + kfree(info); +} + +static void node_init_cache_dev(struct node *node) +{ + struct device *dev; + + dev = kzalloc(sizeof(*dev), GFP_KERNEL); + if (!dev) + return; + + dev->parent = &node->dev; + dev->release = node_cache_release; + if (dev_set_name(dev, "memory_side_cache")) + goto free_dev; + + if (device_register(dev)) + goto free_name; + + pm_runtime_no_callbacks(dev); + node->cache_dev = dev; + return; +free_name: + kfree_const(dev->kobj.name); +free_dev: + kfree(dev); +} + +/** + * node_add_cache() - add cache attribute to a memory node + * @nid: Node identifier that has new cache attributes + * @cache_attrs: Attributes for the cache being added + */ +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs) +{ + struct node_cache_info *info; + struct device *dev; + struct node *node; + + if (!node_online(nid) || !node_devices[nid]) + return; + + node = node_devices[nid]; + list_for_each_entry(info, &node->cache_attrs, node) { + if (info->cache_attrs.level == cache_attrs->level) { + dev_warn(&node->dev, + "attempt to add duplicate cache level:%d\n", + cache_attrs->level); + return; + } + } + + if (!node->cache_dev) + node_init_cache_dev(node); + if (!node->cache_dev) + return; + + info = kzalloc(sizeof(*info), GFP_KERNEL); + if (!info) + return; + + dev = &info->dev; + dev->parent = node->cache_dev; + dev->release = node_cacheinfo_release; + dev->groups = cache_groups; + if (dev_set_name(dev, "index%d", cache_attrs->level)) + goto free_cache; + + info->cache_attrs = *cache_attrs; + if (device_register(dev)) { + dev_warn(&node->dev, "failed to add cache level:%d\n", + cache_attrs->level); + goto free_name; + } + pm_runtime_no_callbacks(dev); + list_add_tail(&info->node, &node->cache_attrs); + return; +free_name: + kfree_const(dev->kobj.name); +free_cache: + kfree(info); +} + +static void node_remove_caches(struct node *node) +{ + struct node_cache_info *info, *next; + + if (!node->cache_dev) + return; + + list_for_each_entry_safe(info, next, &node->cache_attrs, node) { + list_del(&info->node); + device_unregister(&info->dev); + } + device_unregister(node->cache_dev); +} + +static void node_init_caches(unsigned int nid) +{ + INIT_LIST_HEAD(&node_devices[nid]->cache_attrs); +} +#else +static void node_init_caches(unsigned int nid) { } +static void node_remove_caches(struct node *node) { } #endif
#define K(x) ((x) << (PAGE_SHIFT - 10)) @@ -489,6 +638,7 @@ void unregister_node(struct node *node) { hugetlb_unregister_node(node); /* no-op, if memoryless node */ node_remove_accesses(node); + node_remove_caches(node); device_unregister(&node->dev); }
@@ -789,6 +939,7 @@ int __register_one_node(int nid) INIT_LIST_HEAD(&node_devices[nid]->access_list); /* initialize work queue for memory hot plug */ init_node_hugetlb_work(nid); + node_init_caches(nid);
return error; } diff --git a/include/linux/node.h b/include/linux/node.h index 526f1523c0f1..d5215b2319b9 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -35,10 +35,45 @@ struct node_hmem_attrs { unsigned int write_latency; };
+enum cache_indexing { + NODE_CACHE_DIRECT_MAP, + NODE_CACHE_INDEXED, + NODE_CACHE_OTHER, +}; + +enum cache_write_policy { + NODE_CACHE_WRITE_BACK, + NODE_CACHE_WRITE_THROUGH, + NODE_CACHE_WRITE_OTHER, +}; + +/** + * struct node_cache_attrs - system memory caching attributes + * + * @indexing: The ways memory blocks may be placed in cache + * @write_policy: Write back or write through policy + * @size: Total size of cache in bytes + * @line_size: Number of bytes fetched on a cache miss + * @level: The cache hierarchy level + */ +struct node_cache_attrs { + enum cache_indexing indexing; + enum cache_write_policy write_policy; + u64 size; + u16 line_size; + u8 level; +}; + #ifdef CONFIG_HMEM_REPORTING +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs); void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs, unsigned access); #else +static inline void node_add_cache(unsigned int nid, + struct node_cache_attrs *cache_attrs) +{ +} + static inline void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs, unsigned access) @@ -53,6 +88,10 @@ struct node { #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS) struct work_struct node_work; #endif +#ifdef CONFIG_HMEM_REPORTING + struct list_head cache_attrs; + struct device *cache_dev; +#endif };
struct memory_block;
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v5.2-rc1 commit 665ac7e92757fb74d249b2817dd91b3d9439d0a0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 665ac7e92757fb74d249b2817dd91b3d9439d0a0 upstream.
If the HMAT Subsystem Address Range provides a valid processor proximity domain for a memory domain, or a processor domain matches the performance access of the valid processor proximity domain, register the memory target with that initiator so this relationship will be visible under the node's sysfs directory.
Since HMAT requires valid address ranges have an equivalent SRAT entry, verify each memory target satisfies this requirement.
Reviewed-by: Jonathan Cameron Jonathan.Cameron@huawei.com Signed-off-by: Keith Busch keith.busch@intel.com Acked-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Reviewed-by: Brice Goglin Brice.Goglin@inria.fr Tested-by: Brice Goglin Brice.Goglin@inria.fr Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/hmat/Kconfig | 3 +- drivers/acpi/hmat/hmat.c | 392 +++++++++++++++++++++++++++++++++++++- 2 files changed, 393 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig index 2f7111b7af62..13cddd612a52 100644 --- a/drivers/acpi/hmat/Kconfig +++ b/drivers/acpi/hmat/Kconfig @@ -4,4 +4,5 @@ config ACPI_HMAT depends on ACPI_NUMA help If set, this option has the kernel parse and report the - platform's ACPI HMAT (Heterogeneous Memory Attributes Table). + platform's ACPI HMAT (Heterogeneous Memory Attributes Table), + and register memory initiators with their targets. diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c index 4758beb3b2c1..01a6eddac6f7 100644 --- a/drivers/acpi/hmat/hmat.c +++ b/drivers/acpi/hmat/hmat.c @@ -13,11 +13,105 @@ #include <linux/device.h> #include <linux/init.h> #include <linux/list.h> +#include <linux/list_sort.h> #include <linux/node.h> #include <linux/sysfs.h>
static __initdata u8 hmat_revision;
+static __initdata LIST_HEAD(targets); +static __initdata LIST_HEAD(initiators); +static __initdata LIST_HEAD(localities); + +/* + * The defined enum order is used to prioritize attributes to break ties when + * selecting the best performing node. + */ +enum locality_types { + WRITE_LATENCY, + READ_LATENCY, + WRITE_BANDWIDTH, + READ_BANDWIDTH, +}; + +static struct memory_locality *localities_types[4]; + +struct memory_target { + struct list_head node; + unsigned int memory_pxm; + unsigned int processor_pxm; + struct node_hmem_attrs hmem_attrs; +}; + +struct memory_initiator { + struct list_head node; + unsigned int processor_pxm; +}; + +struct memory_locality { + struct list_head node; + struct acpi_hmat_locality *hmat_loc; +}; + +static __init struct memory_initiator *find_mem_initiator(unsigned int cpu_pxm) +{ + struct memory_initiator *initiator; + + list_for_each_entry(initiator, &initiators, node) + if (initiator->processor_pxm == cpu_pxm) + return initiator; + return NULL; +} + +static __init struct memory_target *find_mem_target(unsigned int mem_pxm) +{ + struct memory_target *target; + + list_for_each_entry(target, &targets, node) + if (target->memory_pxm == mem_pxm) + return target; + return NULL; +} + +static __init void alloc_memory_initiator(unsigned int cpu_pxm) +{ + struct memory_initiator *initiator; + + if (pxm_to_node(cpu_pxm) == NUMA_NO_NODE) + return; + + initiator = find_mem_initiator(cpu_pxm); + if (initiator) + return; + + initiator = kzalloc(sizeof(*initiator), GFP_KERNEL); + if (!initiator) + return; + + initiator->processor_pxm = cpu_pxm; + list_add_tail(&initiator->node, &initiators); +} + +static __init void alloc_memory_target(unsigned int mem_pxm) +{ + struct memory_target *target; + + if (pxm_to_node(mem_pxm) == NUMA_NO_NODE) + return; + + target = find_mem_target(mem_pxm); + if (target) + return; + + target = kzalloc(sizeof(*target), GFP_KERNEL); + if (!target) + return; + + target->memory_pxm = mem_pxm; + target->processor_pxm = PXM_INVAL; + list_add_tail(&target->node, &targets); +} + static __init const char *hmat_data_type(u8 type) { switch (type) { @@ -89,14 +183,83 @@ static __init u32 hmat_normalize(u16 entry, u64 base, u8 type) return value; }
+static __init void hmat_update_target_access(struct memory_target *target, + u8 type, u32 value) +{ + switch (type) { + case ACPI_HMAT_ACCESS_LATENCY: + target->hmem_attrs.read_latency = value; + target->hmem_attrs.write_latency = value; + break; + case ACPI_HMAT_READ_LATENCY: + target->hmem_attrs.read_latency = value; + break; + case ACPI_HMAT_WRITE_LATENCY: + target->hmem_attrs.write_latency = value; + break; + case ACPI_HMAT_ACCESS_BANDWIDTH: + target->hmem_attrs.read_bandwidth = value; + target->hmem_attrs.write_bandwidth = value; + break; + case ACPI_HMAT_READ_BANDWIDTH: + target->hmem_attrs.read_bandwidth = value; + break; + case ACPI_HMAT_WRITE_BANDWIDTH: + target->hmem_attrs.write_bandwidth = value; + break; + default: + break; + } +} + +static __init void hmat_add_locality(struct acpi_hmat_locality *hmat_loc) +{ + struct memory_locality *loc; + + loc = kzalloc(sizeof(*loc), GFP_KERNEL); + if (!loc) { + pr_notice_once("Failed to allocate HMAT locality\n"); + return; + } + + loc->hmat_loc = hmat_loc; + list_add_tail(&loc->node, &localities); + + switch (hmat_loc->data_type) { + case ACPI_HMAT_ACCESS_LATENCY: + localities_types[READ_LATENCY] = loc; + localities_types[WRITE_LATENCY] = loc; + break; + case ACPI_HMAT_READ_LATENCY: + localities_types[READ_LATENCY] = loc; + break; + case ACPI_HMAT_WRITE_LATENCY: + localities_types[WRITE_LATENCY] = loc; + break; + case ACPI_HMAT_ACCESS_BANDWIDTH: + localities_types[READ_BANDWIDTH] = loc; + localities_types[WRITE_BANDWIDTH] = loc; + break; + case ACPI_HMAT_READ_BANDWIDTH: + localities_types[READ_BANDWIDTH] = loc; + break; + case ACPI_HMAT_WRITE_BANDWIDTH: + localities_types[WRITE_BANDWIDTH] = loc; + break; + default: + break; + } +} + static __init int hmat_parse_locality(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_hmat_locality *hmat_loc = (void *)header; + struct memory_target *target; unsigned int init, targ, total_size, ipds, tpds; u32 *inits, *targs, value; u16 *entries; - u8 type; + u8 type, mem_hier;
if (hmat_loc->header.length < sizeof(*hmat_loc)) { pr_notice("HMAT: Unexpected locality header length: %d\n", @@ -105,6 +268,7 @@ static __init int hmat_parse_locality(union acpi_subtable_headers *header, }
type = hmat_loc->data_type; + mem_hier = hmat_loc->flags & ACPI_HMAT_MEMORY_HIERARCHY; ipds = hmat_loc->number_of_initiator_Pds; tpds = hmat_loc->number_of_target_Pds; total_size = sizeof(*hmat_loc) + sizeof(*entries) * ipds * tpds + @@ -123,6 +287,7 @@ static __init int hmat_parse_locality(union acpi_subtable_headers *header, targs = inits + ipds; entries = (u16 *)(targs + tpds); for (init = 0; init < ipds; init++) { + alloc_memory_initiator(inits[init]); for (targ = 0; targ < tpds; targ++) { value = hmat_normalize(entries[init * tpds + targ], hmat_loc->entry_base_unit, @@ -130,9 +295,18 @@ static __init int hmat_parse_locality(union acpi_subtable_headers *header, pr_info(" Initiator-Target[%d-%d]:%d%s\n", inits[init], targs[targ], value, hmat_data_type_suffix(type)); + + if (mem_hier == ACPI_HMAT_MEMORY) { + target = find_mem_target(targs[targ]); + if (target && target->processor_pxm == inits[init]) + hmat_update_target_access(target, type, value); + } } }
+ if (mem_hier == ACPI_HMAT_MEMORY) + hmat_add_locality(hmat_loc); + return 0; }
@@ -160,6 +334,7 @@ static int __init hmat_parse_proximity_domain(union acpi_subtable_headers *heade const unsigned long end) { struct acpi_hmat_proximity_domain *p = (void *)header; + struct memory_target *target;
if (p->header.length != sizeof(*p)) { pr_notice("HMAT: Unexpected address range header length: %d\n", @@ -175,6 +350,23 @@ static int __init hmat_parse_proximity_domain(union acpi_subtable_headers *heade pr_info("HMAT: Memory Flags:%04x Processor Domain:%d Memory Domain:%d\n", p->flags, p->processor_PD, p->memory_PD);
+ if (p->flags & ACPI_HMAT_MEMORY_PD_VALID) { + target = find_mem_target(p->memory_PD); + if (!target) { + pr_debug("HMAT: Memory Domain missing from SRAT\n"); + return -EINVAL; + } + } + if (target && p->flags & ACPI_HMAT_PROCESSOR_PD_VALID) { + int p_node = pxm_to_node(p->processor_PD); + + if (p_node == NUMA_NO_NODE) { + pr_debug("HMAT: Invalid Processor Domain\n"); + return -EINVAL; + } + target->processor_pxm = p_node; + } + return 0; }
@@ -198,6 +390,191 @@ static int __init hmat_parse_subtable(union acpi_subtable_headers *header, } }
+static __init int srat_parse_mem_affinity(union acpi_subtable_headers *header, + const unsigned long end) +{ + struct acpi_srat_mem_affinity *ma = (void *)header; + + if (!ma) + return -EINVAL; + if (!(ma->flags & ACPI_SRAT_MEM_ENABLED)) + return 0; + alloc_memory_target(ma->proximity_domain); + return 0; +} + +static __init u32 hmat_initiator_perf(struct memory_target *target, + struct memory_initiator *initiator, + struct acpi_hmat_locality *hmat_loc) +{ + unsigned int ipds, tpds, i, idx = 0, tdx = 0; + u32 *inits, *targs; + u16 *entries; + + ipds = hmat_loc->number_of_initiator_Pds; + tpds = hmat_loc->number_of_target_Pds; + inits = (u32 *)(hmat_loc + 1); + targs = inits + ipds; + entries = (u16 *)(targs + tpds); + + for (i = 0; i < ipds; i++) { + if (inits[i] == initiator->processor_pxm) { + idx = i; + break; + } + } + + if (i == ipds) + return 0; + + for (i = 0; i < tpds; i++) { + if (targs[i] == target->memory_pxm) { + tdx = i; + break; + } + } + if (i == tpds) + return 0; + + return hmat_normalize(entries[idx * tpds + tdx], + hmat_loc->entry_base_unit, + hmat_loc->data_type); +} + +static __init bool hmat_update_best(u8 type, u32 value, u32 *best) +{ + bool updated = false; + + if (!value) + return false; + + switch (type) { + case ACPI_HMAT_ACCESS_LATENCY: + case ACPI_HMAT_READ_LATENCY: + case ACPI_HMAT_WRITE_LATENCY: + if (!*best || *best > value) { + *best = value; + updated = true; + } + break; + case ACPI_HMAT_ACCESS_BANDWIDTH: + case ACPI_HMAT_READ_BANDWIDTH: + case ACPI_HMAT_WRITE_BANDWIDTH: + if (!*best || *best < value) { + *best = value; + updated = true; + } + break; + } + + return updated; +} + +static int initiator_cmp(void *priv, struct list_head *a, struct list_head *b) +{ + struct memory_initiator *ia; + struct memory_initiator *ib; + unsigned long *p_nodes = priv; + + ia = list_entry(a, struct memory_initiator, node); + ib = list_entry(b, struct memory_initiator, node); + + set_bit(ia->processor_pxm, p_nodes); + set_bit(ib->processor_pxm, p_nodes); + + return ia->processor_pxm - ib->processor_pxm; +} + +static __init void hmat_register_target_initiators(struct memory_target *target) +{ + static DECLARE_BITMAP(p_nodes, MAX_NUMNODES); + struct memory_initiator *initiator; + unsigned int mem_nid, cpu_nid; + struct memory_locality *loc = NULL; + u32 best = 0; + int i; + + mem_nid = pxm_to_node(target->memory_pxm); + /* + * If the Address Range Structure provides a local processor pxm, link + * only that one. Otherwise, find the best performance attributes and + * register all initiators that match. + */ + if (target->processor_pxm != PXM_INVAL) { + cpu_nid = pxm_to_node(target->processor_pxm); + register_memory_node_under_compute_node(mem_nid, cpu_nid, 0); + return; + } + + if (list_empty(&localities)) + return; + + /* + * We need the initiator list sorted so we can use bitmap_clear for + * previously set initiators when we find a better memory accessor. + * We'll also use the sorting to prime the candidate nodes with known + * initiators. + */ + bitmap_zero(p_nodes, MAX_NUMNODES); + list_sort(p_nodes, &initiators, initiator_cmp); + for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) { + loc = localities_types[i]; + if (!loc) + continue; + + best = 0; + list_for_each_entry(initiator, &initiators, node) { + u32 value; + + if (!test_bit(initiator->processor_pxm, p_nodes)) + continue; + + value = hmat_initiator_perf(target, initiator, loc->hmat_loc); + if (hmat_update_best(loc->hmat_loc->data_type, value, &best)) + bitmap_clear(p_nodes, 0, initiator->processor_pxm); + if (value != best) + clear_bit(initiator->processor_pxm, p_nodes); + } + if (best) + hmat_update_target_access(target, loc->hmat_loc->data_type, best); + } + + for_each_set_bit(i, p_nodes, MAX_NUMNODES) { + cpu_nid = pxm_to_node(i); + register_memory_node_under_compute_node(mem_nid, cpu_nid, 0); + } +} + +static __init void hmat_register_targets(void) +{ + struct memory_target *target; + + list_for_each_entry(target, &targets, node) + hmat_register_target_initiators(target); +} + +static __init void hmat_free_structures(void) +{ + struct memory_target *target, *tnext; + struct memory_locality *loc, *lnext; + struct memory_initiator *initiator, *inext; + + list_for_each_entry_safe(target, tnext, &targets, node) { + list_del(&target->node); + kfree(target); + } + + list_for_each_entry_safe(initiator, inext, &initiators, node) { + list_del(&initiator->node); + kfree(initiator); + } + + list_for_each_entry_safe(loc, lnext, &localities, node) { + list_del(&loc->node); + kfree(loc); + } +} + static __init int hmat_init(void) { struct acpi_table_header *tbl; @@ -207,6 +584,17 @@ static __init int hmat_init(void) if (srat_disabled()) return 0;
+ status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl); + if (ACPI_FAILURE(status)) + return 0; + + if (acpi_table_parse_entries(ACPI_SIG_SRAT, + sizeof(struct acpi_table_srat), + ACPI_SRAT_TYPE_MEMORY_AFFINITY, + srat_parse_mem_affinity, 0) < 0) + goto out_put; + acpi_put_table(tbl); + status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl); if (ACPI_FAILURE(status)) return 0; @@ -229,7 +617,9 @@ static __init int hmat_init(void) goto out_put; } } + hmat_register_targets(); out_put: + hmat_free_structures(); acpi_put_table(tbl); return 0; }
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v5.2-rc1 commit 8d59f5a2ca765f65fd86db86f233678c3770819f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 8d59f5a2ca765f65fd86db86f233678c3770819f upstream.
Save the best performance access attributes and register these with the memory's node if HMAT provides the locality table. While HMAT does make it possible to know performance for all possible initiator-target pairings, we export only the local pairings at this time.
Acked-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Keith Busch keith.busch@intel.com Tested-by: Brice Goglin Brice.Goglin@inria.fr Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/hmat/Kconfig | 5 ++++- drivers/acpi/hmat/hmat.c | 10 +++++++++- 2 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig index 13cddd612a52..95a29964dbea 100644 --- a/drivers/acpi/hmat/Kconfig +++ b/drivers/acpi/hmat/Kconfig @@ -2,7 +2,10 @@ config ACPI_HMAT bool "ACPI Heterogeneous Memory Attribute Table Support" depends on ACPI_NUMA + select HMEM_REPORTING help If set, this option has the kernel parse and report the platform's ACPI HMAT (Heterogeneous Memory Attributes Table), - and register memory initiators with their targets. + register memory initiators with their targets, and export + performance attributes through the node's sysfs device if + provided. diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c index 01a6eddac6f7..7a3a2b50cadd 100644 --- a/drivers/acpi/hmat/hmat.c +++ b/drivers/acpi/hmat/hmat.c @@ -545,12 +545,20 @@ static __init void hmat_register_target_initiators(struct memory_target *target) } }
+static __init void hmat_register_target_perf(struct memory_target *target) +{ + unsigned mem_nid = pxm_to_node(target->memory_pxm); + node_set_perf_attrs(mem_nid, &target->hmem_attrs, 0); +} + static __init void hmat_register_targets(void) { struct memory_target *target;
- list_for_each_entry(target, &targets, node) + list_for_each_entry(target, &targets, node) { hmat_register_target_initiators(target); + hmat_register_target_perf(target); + } }
static __init void hmat_free_structures(void)
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v5.2-rc1 commit d9e8844c7d8165d97848950ae6bf66b2be86ef06 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit d9e8844c7d8165d97848950ae6bf66b2be86ef06 upstream.
Register memory side cache attributes with the memory's node if HMAT provides the side cache iniformation table.
Acked-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Keith Busch keith.busch@intel.com Tested-by: Brice Goglin Brice.Goglin@inria.fr Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/hmat/hmat.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+)
diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c index 7a3a2b50cadd..b7824a0309f7 100644 --- a/drivers/acpi/hmat/hmat.c +++ b/drivers/acpi/hmat/hmat.c @@ -314,6 +314,7 @@ static __init int hmat_parse_cache(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_hmat_cache *cache = (void *)header; + struct node_cache_attrs cache_attrs; u32 attrs;
if (cache->header.length < sizeof(*cache)) { @@ -327,6 +328,37 @@ static __init int hmat_parse_cache(union acpi_subtable_headers *header, cache->memory_PD, cache->cache_size, attrs, cache->number_of_SMBIOShandles);
+ cache_attrs.size = cache->cache_size; + cache_attrs.level = (attrs & ACPI_HMAT_CACHE_LEVEL) >> 4; + cache_attrs.line_size = (attrs & ACPI_HMAT_CACHE_LINE_SIZE) >> 16; + + switch ((attrs & ACPI_HMAT_CACHE_ASSOCIATIVITY) >> 8) { + case ACPI_HMAT_CA_DIRECT_MAPPED: + cache_attrs.indexing = NODE_CACHE_DIRECT_MAP; + break; + case ACPI_HMAT_CA_COMPLEX_CACHE_INDEXING: + cache_attrs.indexing = NODE_CACHE_INDEXED; + break; + case ACPI_HMAT_CA_NONE: + default: + cache_attrs.indexing = NODE_CACHE_OTHER; + break; + } + + switch ((attrs & ACPI_HMAT_WRITE_POLICY) >> 12) { + case ACPI_HMAT_CP_WB: + cache_attrs.write_policy = NODE_CACHE_WRITE_BACK; + break; + case ACPI_HMAT_CP_WT: + cache_attrs.write_policy = NODE_CACHE_WRITE_THROUGH; + break; + case ACPI_HMAT_CP_NONE: + default: + cache_attrs.write_policy = NODE_CACHE_WRITE_OTHER; + break; + } + + node_add_cache(pxm_to_node(cache->memory_PD), &cache_attrs); return 0; }
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v5.2-rc1 commit 13bac55ef7aef8ecb67ff3005d24b05a464d28ea category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 13bac55ef7aef8ecb67ff3005d24b05a464d28ea upstream.
Platforms may provide system memory where some physical address ranges perform differently than others, or is cached by the system on the memory side.
Add documentation describing a high level overview of such systems and the perforamnce and caching attributes the kernel provides for applications wishing to query this information.
Reviewed-by: Mike Rapoport rppt@linux.ibm.com Reviewed-by: Jonathan Cameron Jonathan.Cameron@huawei.com Signed-off-by: Keith Busch keith.busch@intel.com Tested-by: Brice Goglin Brice.Goglin@inria.fr Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Fan Du fan.du@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/admin-guide/mm/numaperf.rst | 169 ++++++++++++++++++++++ 1 file changed, 169 insertions(+) create mode 100644 Documentation/admin-guide/mm/numaperf.rst
diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst new file mode 100644 index 000000000000..b79f70c04397 --- /dev/null +++ b/Documentation/admin-guide/mm/numaperf.rst @@ -0,0 +1,169 @@ +.. _numaperf: + +============= +NUMA Locality +============= + +Some platforms may have multiple types of memory attached to a compute +node. These disparate memory ranges may share some characteristics, such +as CPU cache coherence, but may have different performance. For example, +different media types and buses affect bandwidth and latency. + +A system supports such heterogeneous memory by grouping each memory type +under different domains, or "nodes", based on locality and performance +characteristics. Some memory may share the same node as a CPU, and others +are provided as memory only nodes. While memory only nodes do not provide +CPUs, they may still be local to one or more compute nodes relative to +other nodes. The following diagram shows one such example of two compute +nodes with local memory and a memory only node for each of compute node: + + +------------------+ +------------------+ + | Compute Node 0 +-----+ Compute Node 1 | + | Local Node0 Mem | | Local Node1 Mem | + +--------+---------+ +--------+---------+ + | | + +--------+---------+ +--------+---------+ + | Slower Node2 Mem | | Slower Node3 Mem | + +------------------+ +--------+---------+ + +A "memory initiator" is a node containing one or more devices such as +CPUs or separate memory I/O devices that can initiate memory requests. +A "memory target" is a node containing one or more physical address +ranges accessible from one or more memory initiators. + +When multiple memory initiators exist, they may not all have the same +performance when accessing a given memory target. Each initiator-target +pair may be organized into different ranked access classes to represent +this relationship. The highest performing initiator to a given target +is considered to be one of that target's local initiators, and given +the highest access class, 0. Any given target may have one or more +local initiators, and any given initiator may have multiple local +memory targets. + +To aid applications matching memory targets with their initiators, the +kernel provides symlinks to each other. The following example lists the +relationship for the access class "0" memory initiators and targets:: + + # symlinks -v /sys/devices/system/node/nodeX/access0/targets/ + relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY + + # symlinks -v /sys/devices/system/node/nodeY/access0/initiators/ + relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX + +A memory initiator may have multiple memory targets in the same access +class. The target memory's initiators in a given class indicate the +nodes' access characteristics share the same performance relative to other +linked initiator nodes. Each target within an initiator's access class, +though, do not necessarily perform the same as each other. + +================ +NUMA Performance +================ + +Applications may wish to consider which node they want their memory to +be allocated from based on the node's performance characteristics. If +the system provides these attributes, the kernel exports them under the +node sysfs hierarchy by appending the attributes directory under the +memory node's access class 0 initiators as follows:: + + /sys/devices/system/node/nodeY/access0/initiators/ + +These attributes apply only when accessed from nodes that have the +are linked under the this access's inititiators. + +The performance characteristics the kernel provides for the local initiators +are exported are as follows:: + + # tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/ + /sys/devices/system/node/nodeY/access0/initiators/ + |-- read_bandwidth + |-- read_latency + |-- write_bandwidth + `-- write_latency + +The bandwidth attributes are provided in MiB/second. + +The latency attributes are provided in nanoseconds. + +The values reported here correspond to the rated latency and bandwidth +for the platform. + +========== +NUMA Cache +========== + +System memory may be constructed in a hierarchy of elements with various +performance characteristics in order to provide large address space of +slower performing memory cached by a smaller higher performing memory. The +system physical addresses memory initiators are aware of are provided +by the last memory level in the hierarchy. The system meanwhile uses +higher performing memory to transparently cache access to progressively +slower levels. + +The term "far memory" is used to denote the last level memory in the +hierarchy. Each increasing cache level provides higher performing +initiator access, and the term "near memory" represents the fastest +cache provided by the system. + +This numbering is different than CPU caches where the cache level (ex: +L1, L2, L3) uses the CPU-side view where each increased level is lower +performing. In contrast, the memory cache level is centric to the last +level memory, so the higher numbered cache level corresponds to memory +nearer to the CPU, and further from far memory. + +The memory-side caches are not directly addressable by software. When +software accesses a system address, the system will return it from the +near memory cache if it is present. If it is not present, the system +accesses the next level of memory until there is either a hit in that +cache level, or it reaches far memory. + +An application does not need to know about caching attributes in order +to use the system. Software may optionally query the memory cache +attributes in order to maximize the performance out of such a setup. +If the system provides a way for the kernel to discover this information, +for example with ACPI HMAT (Heterogeneous Memory Attribute Table), +the kernel will append these attributes to the NUMA node memory target. + +When the kernel first registers a memory cache with a node, the kernel +will create the following directory:: + + /sys/devices/system/node/nodeX/memory_side_cache/ + +If that directory is not present, the system either does not not provide +a memory-side cache, or that information is not accessible to the kernel. + +The attributes for each level of cache is provided under its cache +level index:: + + /sys/devices/system/node/nodeX/memory_side_cache/indexA/ + /sys/devices/system/node/nodeX/memory_side_cache/indexB/ + /sys/devices/system/node/nodeX/memory_side_cache/indexC/ + +Each cache level's directory provides its attributes. For example, the +following shows a single cache level and the attributes available for +software to query:: + + # tree sys/devices/system/node/node0/memory_side_cache/ + /sys/devices/system/node/node0/memory_side_cache/ + |-- index1 + | |-- indexing + | |-- line_size + | |-- size + | `-- write_policy + +The "indexing" will be 0 if it is a direct-mapped cache, and non-zero +for any other indexed based, multi-way associativity. + +The "line_size" is the number of bytes accessed from the next cache +level on a miss. + +The "size" is the number of bytes provided by this cache level. + +The "write_policy" will be 0 for write-back, and non-zero for +write-through caching. + +======== +See Also +======== +.. [1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf + Section 5.2.27
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 878068ea270ea82767ff1d26c91583263c81fba0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 878068ea270ea82767ff1d26c91583263c81fba0 upstream.
Starting from Icelake, XMM registers can be collected in PEBS record. But current code only output the pt_regs.
Add a new struct x86_perf_regs for both pt_regs and xmm_regs. The xmm_regs will be used later to keep a pointer to PEBS record which has XMM information.
XMM registers are 128 bit. To simplify the code, they are handled like two different registers, which means setting two bits in the register bitmap. This also allows only sampling the lower 64bit bits in XMM.
The index of XMM registers starts from 32. There are 16 XMM registers. So all reserved space for regs are used. Remove REG_RESERVED.
Add PERF_REG_X86_XMM_MAX, which stands for the max number of all x86 regs including both GPRs and XMM.
Add REG_NOSUPPORT for 32bit to exclude unsupported registers.
Previous platforms can not collect XMM information in PEBS record. Adding pebs_no_xmm_regs to indicate the unsupported platforms.
The common code still validates the supported registers. However, it cannot check model specific registers, e.g. XMM. Add extra check in x86_pmu_hw_config() to reject invalid config of regs_user and regs_intr. The regs_user never supports XMM collection. The regs_intr only supports XMM collection when sampling PEBS event on icelake and later platforms.
Originally-by: Andi Kleen ak@linux.intel.com Suggested-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Cc: acme@kernel.org Cc: jolsa@kernel.org Link: https://lkml.kernel.org/r/20190402194509.2832-3-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 15 ++++++++++++ arch/x86/events/intel/ds.c | 2 ++ arch/x86/events/perf_event.h | 33 +++++++++++++++++++++------ arch/x86/include/asm/perf_event.h | 5 ++++ arch/x86/include/uapi/asm/perf_regs.h | 23 ++++++++++++++++++- arch/x86/kernel/perf_regs.c | 27 ++++++++++++++++------ 6 files changed, 90 insertions(+), 15 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 640f85da2b34..206ce31116b2 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -565,6 +565,21 @@ int x86_pmu_hw_config(struct perf_event *event) return -EINVAL; }
+ /* sample_regs_user never support XMM registers */ + if (unlikely(event->attr.sample_regs_user & PEBS_XMM_REGS)) + return -EINVAL; + /* + * Besides the general purpose registers, XMM registers may + * be collected in PEBS on some platforms, e.g. Icelake + */ + if (unlikely(event->attr.sample_regs_intr & PEBS_XMM_REGS)) { + if (x86_pmu.pebs_no_xmm_regs) + return -EINVAL; + + if (!event->attr.precise_ip) + return -EINVAL; + } + return x86_setup_perfctr(event); }
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c index 63ebe2caef73..50fd2d411e70 100644 --- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -1649,6 +1649,8 @@ void __init intel_ds_init(void) x86_pmu.bts = boot_cpu_has(X86_FEATURE_BTS); x86_pmu.pebs = boot_cpu_has(X86_FEATURE_PEBS); x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE; + if (x86_pmu.version <= 4) + x86_pmu.pebs_no_xmm_regs = 1; if (x86_pmu.pebs) { char pebs_type = x86_pmu.intel_cap.pebs_trap ? '+' : '-'; int format = x86_pmu.intel_cap.pebs_format; diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index f7e66a6ea4b1..c33cbf345293 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -123,6 +123,24 @@ struct amd_nb { (1ULL << PERF_REG_X86_R14) | \ (1ULL << PERF_REG_X86_R15))
+#define PEBS_XMM_REGS \ + ((1ULL << PERF_REG_X86_XMM0) | \ + (1ULL << PERF_REG_X86_XMM1) | \ + (1ULL << PERF_REG_X86_XMM2) | \ + (1ULL << PERF_REG_X86_XMM3) | \ + (1ULL << PERF_REG_X86_XMM4) | \ + (1ULL << PERF_REG_X86_XMM5) | \ + (1ULL << PERF_REG_X86_XMM6) | \ + (1ULL << PERF_REG_X86_XMM7) | \ + (1ULL << PERF_REG_X86_XMM8) | \ + (1ULL << PERF_REG_X86_XMM9) | \ + (1ULL << PERF_REG_X86_XMM10) | \ + (1ULL << PERF_REG_X86_XMM11) | \ + (1ULL << PERF_REG_X86_XMM12) | \ + (1ULL << PERF_REG_X86_XMM13) | \ + (1ULL << PERF_REG_X86_XMM14) | \ + (1ULL << PERF_REG_X86_XMM15)) + /* * Per register state. */ @@ -639,13 +657,14 @@ struct x86_pmu { /* * Intel DebugStore bits */ - unsigned int bts :1, - bts_active :1, - pebs :1, - pebs_active :1, - pebs_broken :1, - pebs_prec_dist :1, - pebs_no_tlb :1; + unsigned int bts :1, + bts_active :1, + pebs :1, + pebs_active :1, + pebs_broken :1, + pebs_prec_dist :1, + pebs_no_tlb :1, + pebs_no_xmm_regs :1; int pebs_record_size; int pebs_buffer_size; void (*drain_pebs)(struct pt_regs *regs); diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 7a5ca63a3cca..af66f97544a2 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -252,6 +252,11 @@ extern void perf_events_lapic_init(void); #define PERF_EFLAGS_VM (1UL << 5)
struct pt_regs; +struct x86_perf_regs { + struct pt_regs regs; + u64 *xmm_regs; +}; + extern unsigned long perf_instruction_pointer(struct pt_regs *regs); extern unsigned long perf_misc_flags(struct pt_regs *regs); #define perf_misc_flags(regs) perf_misc_flags(regs) diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h index f3329cabce5c..ac67bbea10ca 100644 --- a/arch/x86/include/uapi/asm/perf_regs.h +++ b/arch/x86/include/uapi/asm/perf_regs.h @@ -27,8 +27,29 @@ enum perf_event_x86_regs { PERF_REG_X86_R13, PERF_REG_X86_R14, PERF_REG_X86_R15, - + /* These are the limits for the GPRs. */ PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1, PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1, + + /* These all need two bits set because they are 128bit */ + PERF_REG_X86_XMM0 = 32, + PERF_REG_X86_XMM1 = 34, + PERF_REG_X86_XMM2 = 36, + PERF_REG_X86_XMM3 = 38, + PERF_REG_X86_XMM4 = 40, + PERF_REG_X86_XMM5 = 42, + PERF_REG_X86_XMM6 = 44, + PERF_REG_X86_XMM7 = 46, + PERF_REG_X86_XMM8 = 48, + PERF_REG_X86_XMM9 = 50, + PERF_REG_X86_XMM10 = 52, + PERF_REG_X86_XMM11 = 54, + PERF_REG_X86_XMM12 = 56, + PERF_REG_X86_XMM13 = 58, + PERF_REG_X86_XMM14 = 60, + PERF_REG_X86_XMM15 = 62, + + /* These include both GPRs and XMMX registers */ + PERF_REG_X86_XMM_MAX = PERF_REG_X86_XMM15 + 2, }; #endif /* _ASM_X86_PERF_REGS_H */ diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c index c06c4c16c6b6..07c30ee17425 100644 --- a/arch/x86/kernel/perf_regs.c +++ b/arch/x86/kernel/perf_regs.c @@ -59,18 +59,34 @@ static unsigned int pt_regs_offset[PERF_REG_X86_MAX] = {
u64 perf_reg_value(struct pt_regs *regs, int idx) { + struct x86_perf_regs *perf_regs; + + if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) { + perf_regs = container_of(regs, struct x86_perf_regs, regs); + if (!perf_regs->xmm_regs) + return 0; + return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0]; + } + if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset))) return 0;
return regs_get_register(regs, pt_regs_offset[idx]); }
-#define REG_RESERVED (~((1ULL << PERF_REG_X86_MAX) - 1ULL)) - #ifdef CONFIG_X86_32 +#define REG_NOSUPPORT ((1ULL << PERF_REG_X86_R8) | \ + (1ULL << PERF_REG_X86_R9) | \ + (1ULL << PERF_REG_X86_R10) | \ + (1ULL << PERF_REG_X86_R11) | \ + (1ULL << PERF_REG_X86_R12) | \ + (1ULL << PERF_REG_X86_R13) | \ + (1ULL << PERF_REG_X86_R14) | \ + (1ULL << PERF_REG_X86_R15)) + int perf_reg_validate(u64 mask) { - if (!mask || mask & REG_RESERVED) + if (!mask || (mask & REG_NOSUPPORT)) return -EINVAL;
return 0; @@ -96,10 +112,7 @@ void perf_get_regs_user(struct perf_regs *regs_user,
int perf_reg_validate(u64 mask) { - if (!mask || mask & REG_RESERVED) - return -EINVAL; - - if (mask & REG_NOSUPPORT) + if (!mask || (mask & REG_NOSUPPORT)) return -EINVAL;
return 0;
From: Andi Kleen ak@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 48f38aa4cc5a48bc0fe85c5c4b1ab171fbb539b6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 48f38aa4cc5a48bc0fe85c5c4b1ab171fbb539b6 upstream.
Extract some code related to memory profiling from the PEBS record parser into separate functions. It can be reused by the upcoming adaptive PEBS parser. No functional changes. Rename intel_hsw_weight to intel_get_tsx_weight, and intel_hsw_transaction to intel_get_tsx_transaction. Because the input is not the hsw pebs format anymore.
Signed-off-by: Andi Kleen ak@linux.intel.com Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Cc: acme@kernel.org Cc: jolsa@kernel.org Link: https://lkml.kernel.org/r/20190402194509.2832-4-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/ds.c | 63 ++++++++++++++++++++------------------ 1 file changed, 34 insertions(+), 29 deletions(-)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c index 50fd2d411e70..09f484114c7b 100644 --- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -1146,34 +1146,50 @@ static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs) return 0; }
-static inline u64 intel_hsw_weight(struct pebs_record_skl *pebs) +static inline u64 intel_get_tsx_weight(u64 tsx_tuning) { - if (pebs->tsx_tuning) { - union hsw_tsx_tuning tsx = { .value = pebs->tsx_tuning }; + if (tsx_tuning) { + union hsw_tsx_tuning tsx = { .value = tsx_tuning }; return tsx.cycles_last_block; } return 0; }
-static inline u64 intel_hsw_transaction(struct pebs_record_skl *pebs) +static inline u64 intel_get_tsx_transaction(u64 tsx_tuning, u64 ax) { - u64 txn = (pebs->tsx_tuning & PEBS_HSW_TSX_FLAGS) >> 32; + u64 txn = (tsx_tuning & PEBS_HSW_TSX_FLAGS) >> 32;
/* For RTM XABORTs also log the abort code from AX */ - if ((txn & PERF_TXN_TRANSACTION) && (pebs->ax & 1)) - txn |= ((pebs->ax >> 24) & 0xff) << PERF_TXN_ABORT_SHIFT; + if ((txn & PERF_TXN_TRANSACTION) && (ax & 1)) + txn |= ((ax >> 24) & 0xff) << PERF_TXN_ABORT_SHIFT; return txn; }
+#define PERF_X86_EVENT_PEBS_HSW_PREC \ + (PERF_X86_EVENT_PEBS_ST_HSW | \ + PERF_X86_EVENT_PEBS_LD_HSW | \ + PERF_X86_EVENT_PEBS_NA_HSW) + +static u64 get_data_src(struct perf_event *event, u64 aux) +{ + u64 val = PERF_MEM_NA; + int fl = event->hw.flags; + bool fst = fl & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_HSW_PREC); + + if (fl & PERF_X86_EVENT_PEBS_LDLAT) + val = load_latency_data(aux); + else if (fst && (fl & PERF_X86_EVENT_PEBS_HSW_PREC)) + val = precise_datala_hsw(event, aux); + else if (fst) + val = precise_store_data(aux); + return val; +} + static void setup_pebs_sample_data(struct perf_event *event, struct pt_regs *iregs, void *__pebs, struct perf_sample_data *data, struct pt_regs *regs) { -#define PERF_X86_EVENT_PEBS_HSW_PREC \ - (PERF_X86_EVENT_PEBS_ST_HSW | \ - PERF_X86_EVENT_PEBS_LD_HSW | \ - PERF_X86_EVENT_PEBS_NA_HSW) /* * We cast to the biggest pebs_record but are careful not to * unconditionally access the 'extra' entries. @@ -1181,17 +1197,13 @@ static void setup_pebs_sample_data(struct perf_event *event, struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); struct pebs_record_skl *pebs = __pebs; u64 sample_type; - int fll, fst, dsrc; - int fl = event->hw.flags; + int fll;
if (pebs == NULL) return;
sample_type = event->attr.sample_type; - dsrc = sample_type & PERF_SAMPLE_DATA_SRC; - - fll = fl & PERF_X86_EVENT_PEBS_LDLAT; - fst = fl & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_HSW_PREC); + fll = event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT;
perf_sample_data_init(data, 0, event->hw.last_period);
@@ -1206,16 +1218,8 @@ static void setup_pebs_sample_data(struct perf_event *event, /* * data.data_src encodes the data source */ - if (dsrc) { - u64 val = PERF_MEM_NA; - if (fll) - val = load_latency_data(pebs->dse); - else if (fst && (fl & PERF_X86_EVENT_PEBS_HSW_PREC)) - val = precise_datala_hsw(event, pebs->dse); - else if (fst) - val = precise_store_data(pebs->dse); - data->data_src.val = val; - } + if (sample_type & PERF_SAMPLE_DATA_SRC) + data->data_src.val = get_data_src(event, pebs->dse);
/* * We must however always use iregs for the unwinder to stay sane; the @@ -1302,10 +1306,11 @@ static void setup_pebs_sample_data(struct perf_event *event, if (x86_pmu.intel_cap.pebs_format >= 2) { /* Only set the TSX weight when no memory weight. */ if ((sample_type & PERF_SAMPLE_WEIGHT) && !fll) - data->weight = intel_hsw_weight(pebs); + data->weight = intel_get_tsx_weight(pebs->tsx_tuning);
if (sample_type & PERF_SAMPLE_TRANSACTION) - data->txn = intel_hsw_transaction(pebs); + data->txn = intel_get_tsx_transaction(pebs->tsx_tuning, + pebs->ax); }
/*
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 477f00f9617009a9a3a9271885231573b728ca4f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 477f00f9617009a9a3a9271885231573b728ca4f upstream.
The drain_pebs() could be called twice in a short period for auto-reload event in pmu::read(). The intel_pmu_save_and_restart_reload() should be called to update the event->count.
This case should also be handled on Icelake. Extract the code for later reuse.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Cc: acme@kernel.org Cc: jolsa@kernel.org Link: https://lkml.kernel.org/r/20190402194509.2832-5-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/ds.c | 33 ++++++++++++++++++++------------- 1 file changed, 20 insertions(+), 13 deletions(-)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c index 09f484114c7b..ffaecf809d41 100644 --- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -1512,6 +1512,25 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs) __intel_pmu_pebs_event(event, iregs, at, top, 0, n); }
+static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, int size) +{ + struct perf_event *event; + int bit; + + /* + * The drain_pebs() could be called twice in a short period + * for auto-reload event in pmu::read(). There are no + * overflows have happened in between. + * It needs to call intel_pmu_save_and_restart_reload() to + * update the event->count for this case. + */ + for_each_set_bit(bit, (unsigned long *)&cpuc->pebs_enabled, size) { + event = cpuc->events[bit]; + if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD) + intel_pmu_save_and_restart_reload(event, 0); + } +} + static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); @@ -1539,19 +1558,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs) }
if (unlikely(base >= top)) { - /* - * The drain_pebs() could be called twice in a short period - * for auto-reload event in pmu::read(). There are no - * overflows have happened in between. - * It needs to call intel_pmu_save_and_restart_reload() to - * update the event->count for this case. - */ - for_each_set_bit(bit, (unsigned long *)&cpuc->pebs_enabled, - size) { - event = cpuc->events[bit]; - if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD) - intel_pmu_save_and_restart_reload(event, 0); - } + intel_pmu_pebs_event_update_no_drain(cpuc, size); return; }
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit c22497f5838c237e3094a4dfb99d1c5de6353239 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit c22497f5838c237e3094a4dfb99d1c5de6353239 upstream.
Adaptive PEBS is a new way to report PEBS sampling information. Instead of a fixed size record for all PEBS events it allows to configure the PEBS record to only include the information needed. Events can then opt in to use such an extended record, or stay with a basic record which only contains the IP.
The major new feature is to support LBRs in PEBS record. Besides normal LBR, this allows (much faster) large PEBS, while still supporting callstacks through callstack LBR. So essentially a lot of profiling can now be done without frequent interrupts, dropping the overhead significantly.
The main requirement still is to use a period, and not use frequency mode, because frequency mode requires reevaluating the frequency on each overflow.
The floating point state (XMM) is also supported, which allows efficient profiling of FP function arguments.
Introduce specific drain function to handle variable length records. Use a new callback to parse the new record format, and also handle the STATUS field now being at a different offset.
Add code to set up the configuration register. Since there is only a single register, all events either get the full super set of all events, or only the basic record.
Originally-by: Andi Kleen ak@linux.intel.com Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Cc: acme@kernel.org Cc: jolsa@kernel.org Link: https://lkml.kernel.org/r/20190402194509.2832-6-kan.liang@linux.intel.com [ Renamed GPRS => GP. ] Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/core.c | 7 + arch/x86/events/intel/ds.c | 380 ++++++++++++++++++++++++++++-- arch/x86/events/intel/lbr.c | 22 ++ arch/x86/events/perf_event.h | 11 +- arch/x86/include/asm/msr-index.h | 1 + arch/x86/include/asm/perf_event.h | 43 ++++ 6 files changed, 438 insertions(+), 26 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 012b89a113c0..d6abf1bcdb95 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -2198,6 +2198,11 @@ static void intel_pmu_enable_fixed(struct perf_event *event) bits <<= (idx * 4); mask = 0xfULL << (idx * 4);
+ if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip) { + bits |= ICL_FIXED_0_ADAPTIVE << (idx * 4); + mask |= ICL_FIXED_0_ADAPTIVE << (idx * 4); + } + rdmsrl(hwc->config_base, ctrl_val); ctrl_val &= ~mask; ctrl_val |= bits; @@ -3498,6 +3503,8 @@ static struct intel_excl_cntrs *allocate_excl_cntrs(int cpu)
int intel_cpuc_prepare(struct cpu_hw_events *cpuc, int cpu) { + cpuc->pebs_record_size = x86_pmu.pebs_record_size; + if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map) { cpuc->shared_regs = allocate_shared_regs(cpu); if (!cpuc->shared_regs) diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c index ffaecf809d41..f3d6493a04a7 100644 --- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -926,17 +926,87 @@ static inline void pebs_update_threshold(struct cpu_hw_events *cpuc)
if (cpuc->n_pebs == cpuc->n_large_pebs) { threshold = ds->pebs_absolute_maximum - - reserved * x86_pmu.pebs_record_size; + reserved * cpuc->pebs_record_size; } else { - threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size; + threshold = ds->pebs_buffer_base + cpuc->pebs_record_size; }
ds->pebs_interrupt_threshold = threshold; }
+static void adaptive_pebs_record_size_update(void) +{ + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + u64 pebs_data_cfg = cpuc->pebs_data_cfg; + int sz = sizeof(struct pebs_basic); + + if (pebs_data_cfg & PEBS_DATACFG_MEMINFO) + sz += sizeof(struct pebs_meminfo); + if (pebs_data_cfg & PEBS_DATACFG_GP) + sz += sizeof(struct pebs_gprs); + if (pebs_data_cfg & PEBS_DATACFG_XMMS) + sz += sizeof(struct pebs_xmm); + if (pebs_data_cfg & PEBS_DATACFG_LBRS) + sz += x86_pmu.lbr_nr * sizeof(struct pebs_lbr_entry); + + cpuc->pebs_record_size = sz; +} + +#define PERF_PEBS_MEMINFO_TYPE (PERF_SAMPLE_ADDR | PERF_SAMPLE_DATA_SRC | \ + PERF_SAMPLE_PHYS_ADDR | PERF_SAMPLE_WEIGHT | \ + PERF_SAMPLE_TRANSACTION) + +static u64 pebs_update_adaptive_cfg(struct perf_event *event) +{ + struct perf_event_attr *attr = &event->attr; + u64 sample_type = attr->sample_type; + u64 pebs_data_cfg = 0; + bool gprs, tsx_weight; + + if (!(sample_type & ~(PERF_SAMPLE_IP|PERF_SAMPLE_TIME)) && + attr->precise_ip > 1) + return pebs_data_cfg; + + if (sample_type & PERF_PEBS_MEMINFO_TYPE) + pebs_data_cfg |= PEBS_DATACFG_MEMINFO; + + /* + * We need GPRs when: + * + user requested them + * + precise_ip < 2 for the non event IP + * + For RTM TSX weight we need GPRs for the abort code. + */ + gprs = (sample_type & PERF_SAMPLE_REGS_INTR) && + (attr->sample_regs_intr & PEBS_GP_REGS); + + tsx_weight = (sample_type & PERF_SAMPLE_WEIGHT) && + ((attr->config & INTEL_ARCH_EVENT_MASK) == + x86_pmu.rtm_abort_event); + + if (gprs || (attr->precise_ip < 2) || tsx_weight) + pebs_data_cfg |= PEBS_DATACFG_GP; + + if ((sample_type & PERF_SAMPLE_REGS_INTR) && + (attr->sample_regs_intr & PEBS_XMM_REGS)) + pebs_data_cfg |= PEBS_DATACFG_XMMS; + + if (sample_type & PERF_SAMPLE_BRANCH_STACK) { + /* + * For now always log all LBRs. Could configure this + * later. + */ + pebs_data_cfg |= PEBS_DATACFG_LBRS | + ((x86_pmu.lbr_nr-1) << PEBS_DATACFG_LBR_SHIFT); + } + + return pebs_data_cfg; +} + static void -pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc, struct pmu *pmu) +pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc, + struct perf_event *event, bool add) { + struct pmu *pmu = event->ctx->pmu; /* * Make sure we get updated with the first PEBS * event. It will trigger also during removal, but @@ -953,6 +1023,29 @@ pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc, struct pmu *pmu) update = true; }
+ /* + * The PEBS record doesn't shrink on pmu::del(). Doing so would require + * iterating all remaining PEBS events to reconstruct the config. + */ + if (x86_pmu.intel_cap.pebs_baseline && add) { + u64 pebs_data_cfg; + + /* Clear pebs_data_cfg and pebs_record_size for first PEBS. */ + if (cpuc->n_pebs == 1) { + cpuc->pebs_data_cfg = 0; + cpuc->pebs_record_size = sizeof(struct pebs_basic); + } + + pebs_data_cfg = pebs_update_adaptive_cfg(event); + + /* Update pebs_record_size if new event requires more data. */ + if (pebs_data_cfg & ~cpuc->pebs_data_cfg) { + cpuc->pebs_data_cfg |= pebs_data_cfg; + adaptive_pebs_record_size_update(); + update = true; + } + } + if (update) pebs_update_threshold(cpuc); } @@ -967,7 +1060,7 @@ void intel_pmu_pebs_add(struct perf_event *event) if (hwc->flags & PERF_X86_EVENT_LARGE_PEBS) cpuc->n_large_pebs++;
- pebs_update_state(needed_cb, cpuc, event->ctx->pmu); + pebs_update_state(needed_cb, cpuc, event, true); }
void intel_pmu_pebs_enable(struct perf_event *event) @@ -985,6 +1078,14 @@ void intel_pmu_pebs_enable(struct perf_event *event) else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST) cpuc->pebs_enabled |= 1ULL << 63;
+ if (x86_pmu.intel_cap.pebs_baseline) { + hwc->config |= ICL_EVENTSEL_ADAPTIVE; + if (cpuc->pebs_data_cfg != cpuc->active_pebs_data_cfg) { + wrmsrl(MSR_PEBS_DATA_CFG, cpuc->pebs_data_cfg); + cpuc->active_pebs_data_cfg = cpuc->pebs_data_cfg; + } + } + /* * Use auto-reload if possible to save a MSR write in the PMI. * This must be done in pmu::start(), because PERF_EVENT_IOC_PERIOD. @@ -1011,7 +1112,7 @@ void intel_pmu_pebs_del(struct perf_event *event) if (hwc->flags & PERF_X86_EVENT_LARGE_PEBS) cpuc->n_large_pebs--;
- pebs_update_state(needed_cb, cpuc, event->ctx->pmu); + pebs_update_state(needed_cb, cpuc, event, false); }
void intel_pmu_pebs_disable(struct perf_event *event) @@ -1165,6 +1266,13 @@ static inline u64 intel_get_tsx_transaction(u64 tsx_tuning, u64 ax) return txn; }
+static inline u64 get_pebs_status(void *n) +{ + if (x86_pmu.intel_cap.pebs_format < 4) + return ((struct pebs_record_nhm *)n)->status; + return ((struct pebs_basic *)n)->applicable_counters; +} + #define PERF_X86_EVENT_PEBS_HSW_PREC \ (PERF_X86_EVENT_PEBS_ST_HSW | \ PERF_X86_EVENT_PEBS_LD_HSW | \ @@ -1185,7 +1293,7 @@ static u64 get_data_src(struct perf_event *event, u64 aux) return val; }
-static void setup_pebs_sample_data(struct perf_event *event, +static void setup_pebs_fixed_sample_data(struct perf_event *event, struct pt_regs *iregs, void *__pebs, struct perf_sample_data *data, struct pt_regs *regs) @@ -1327,6 +1435,140 @@ static void setup_pebs_sample_data(struct perf_event *event, data->br_stack = &cpuc->lbr_stack; }
+static void adaptive_pebs_save_regs(struct pt_regs *regs, + struct pebs_gprs *gprs) +{ + regs->ax = gprs->ax; + regs->bx = gprs->bx; + regs->cx = gprs->cx; + regs->dx = gprs->dx; + regs->si = gprs->si; + regs->di = gprs->di; + regs->bp = gprs->bp; + regs->sp = gprs->sp; +#ifndef CONFIG_X86_32 + regs->r8 = gprs->r8; + regs->r9 = gprs->r9; + regs->r10 = gprs->r10; + regs->r11 = gprs->r11; + regs->r12 = gprs->r12; + regs->r13 = gprs->r13; + regs->r14 = gprs->r14; + regs->r15 = gprs->r15; +#endif +} + +/* + * With adaptive PEBS the layout depends on what fields are configured. + */ + +static void setup_pebs_adaptive_sample_data(struct perf_event *event, + struct pt_regs *iregs, void *__pebs, + struct perf_sample_data *data, + struct pt_regs *regs) +{ + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + struct pebs_basic *basic = __pebs; + void *next_record = basic + 1; + u64 sample_type; + u64 format_size; + struct pebs_meminfo *meminfo = NULL; + struct pebs_gprs *gprs = NULL; + struct x86_perf_regs *perf_regs; + + if (basic == NULL) + return; + + perf_regs = container_of(regs, struct x86_perf_regs, regs); + perf_regs->xmm_regs = NULL; + + sample_type = event->attr.sample_type; + format_size = basic->format_size; + perf_sample_data_init(data, 0, event->hw.last_period); + data->period = event->hw.last_period; + + if (event->attr.use_clockid == 0) + data->time = native_sched_clock_from_tsc(basic->tsc); + + /* + * We must however always use iregs for the unwinder to stay sane; the + * record BP,SP,IP can point into thin air when the record is from a + * previous PMI context or an (I)RET happened between the record and + * PMI. + */ + if (sample_type & PERF_SAMPLE_CALLCHAIN) + data->callchain = perf_callchain(event, iregs); + + *regs = *iregs; + /* The ip in basic is EventingIP */ + set_linear_ip(regs, basic->ip); + regs->flags = PERF_EFLAGS_EXACT; + + /* + * The record for MEMINFO is in front of GP + * But PERF_SAMPLE_TRANSACTION needs gprs->ax. + * Save the pointer here but process later. + */ + if (format_size & PEBS_DATACFG_MEMINFO) { + meminfo = next_record; + next_record = meminfo + 1; + } + + if (format_size & PEBS_DATACFG_GP) { + gprs = next_record; + next_record = gprs + 1; + + if (event->attr.precise_ip < 2) { + set_linear_ip(regs, gprs->ip); + regs->flags &= ~PERF_EFLAGS_EXACT; + } + + if (sample_type & PERF_SAMPLE_REGS_INTR) + adaptive_pebs_save_regs(regs, gprs); + } + + if (format_size & PEBS_DATACFG_MEMINFO) { + if (sample_type & PERF_SAMPLE_WEIGHT) + data->weight = meminfo->latency ?: + intel_get_tsx_weight(meminfo->tsx_tuning); + + if (sample_type & PERF_SAMPLE_DATA_SRC) + data->data_src.val = get_data_src(event, meminfo->aux); + + if (sample_type & (PERF_SAMPLE_ADDR | PERF_SAMPLE_PHYS_ADDR)) + data->addr = meminfo->address; + + if (sample_type & PERF_SAMPLE_TRANSACTION) + data->txn = intel_get_tsx_transaction(meminfo->tsx_tuning, + gprs ? gprs->ax : 0); + } + + if (format_size & PEBS_DATACFG_XMMS) { + struct pebs_xmm *xmm = next_record; + + next_record = xmm + 1; + perf_regs->xmm_regs = xmm->xmm; + } + + if (format_size & PEBS_DATACFG_LBRS) { + struct pebs_lbr *lbr = next_record; + int num_lbr = ((format_size >> PEBS_DATACFG_LBR_SHIFT) + & 0xff) + 1; + next_record = next_record + num_lbr*sizeof(struct pebs_lbr_entry); + + if (has_branch_stack(event)) { + intel_pmu_store_pebs_lbrs(lbr); + data->br_stack = &cpuc->lbr_stack; + } + } + + WARN_ONCE(next_record != __pebs + (format_size >> 48), + "PEBS record size %llu, expected %llu, config %llx\n", + format_size >> 48, + (u64)(next_record - __pebs), + basic->format_size); +} + static inline void * get_next_pebs_record_by_bit(void *base, void *top, int bit) { @@ -1344,19 +1586,19 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit) if (base == NULL) return NULL;
- for (at = base; at < top; at += x86_pmu.pebs_record_size) { - struct pebs_record_nhm *p = at; + for (at = base; at < top; at += cpuc->pebs_record_size) { + unsigned long status = get_pebs_status(at);
- if (test_bit(bit, (unsigned long *)&p->status)) { + if (test_bit(bit, (unsigned long *)&status)) { /* PEBS v3 has accurate status bits */ if (x86_pmu.intel_cap.pebs_format >= 3) return at;
- if (p->status == (1 << bit)) + if (status == (1 << bit)) return at;
/* clear non-PEBS bit and re-check */ - pebs_status = p->status & cpuc->pebs_enabled; + pebs_status = status & cpuc->pebs_enabled; pebs_status &= PEBS_COUNTER_MASK; if (pebs_status == (1 << bit)) return at; @@ -1436,11 +1678,18 @@ intel_pmu_save_and_restart_reload(struct perf_event *event, int count) static void __intel_pmu_pebs_event(struct perf_event *event, struct pt_regs *iregs, void *base, void *top, - int bit, int count) + int bit, int count, + void (*setup_sample)(struct perf_event *, + struct pt_regs *, + void *, + struct perf_sample_data *, + struct pt_regs *)) { + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); struct hw_perf_event *hwc = &event->hw; struct perf_sample_data data; - struct pt_regs regs; + struct x86_perf_regs perf_regs; + struct pt_regs *regs = &perf_regs.regs; void *at = get_next_pebs_record_by_bit(base, top, bit);
if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) { @@ -1455,20 +1704,20 @@ static void __intel_pmu_pebs_event(struct perf_event *event, return;
while (count > 1) { - setup_pebs_sample_data(event, iregs, at, &data, ®s); - perf_event_output(event, &data, ®s); - at += x86_pmu.pebs_record_size; + setup_sample(event, iregs, at, &data, regs); + perf_event_output(event, &data, regs); + at += cpuc->pebs_record_size; at = get_next_pebs_record_by_bit(at, top, bit); count--; }
- setup_pebs_sample_data(event, iregs, at, &data, ®s); + setup_sample(event, iregs, at, &data, regs);
/* * All but the last records are processed. * The last one is left to be able to call the overflow handler. */ - if (perf_event_overflow(event, &data, ®s)) { + if (perf_event_overflow(event, &data, regs)) { x86_pmu_stop(event, 0); return; } @@ -1509,7 +1758,8 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs) return; }
- __intel_pmu_pebs_event(event, iregs, at, top, 0, n); + __intel_pmu_pebs_event(event, iregs, at, top, 0, n, + setup_pebs_fixed_sample_data); }
static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, int size) @@ -1571,8 +1821,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
/* PEBS v3 has more accurate status bits */ if (x86_pmu.intel_cap.pebs_format >= 3) { - for_each_set_bit(bit, (unsigned long *)&pebs_status, - size) + for_each_set_bit(bit, (unsigned long *)&pebs_status, size) counts[bit]++;
continue; @@ -1611,8 +1860,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs) * If collision happened, the record will be dropped. */ if (p->status != (1ULL << bit)) { - for_each_set_bit(i, (unsigned long *)&pebs_status, - x86_pmu.max_pebs_events) + for_each_set_bit(i, (unsigned long *)&pebs_status, size) error[i]++; continue; } @@ -1620,7 +1868,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs) counts[bit]++; }
- for (bit = 0; bit < size; bit++) { + for_each_set_bit(bit, (unsigned long *)&mask, size) { if ((counts[bit] == 0) && (error[bit] == 0)) continue;
@@ -1641,11 +1889,66 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
if (counts[bit]) { __intel_pmu_pebs_event(event, iregs, base, - top, bit, counts[bit]); + top, bit, counts[bit], + setup_pebs_fixed_sample_data); } } }
+static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs) +{ + short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {}; + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + struct debug_store *ds = cpuc->ds; + struct perf_event *event; + void *base, *at, *top; + int bit, size; + u64 mask; + + if (!x86_pmu.pebs_active) + return; + + base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base; + top = (struct pebs_basic *)(unsigned long)ds->pebs_index; + + ds->pebs_index = ds->pebs_buffer_base; + + mask = ((1ULL << x86_pmu.max_pebs_events) - 1) | + (((1ULL << x86_pmu.num_counters_fixed) - 1) << INTEL_PMC_IDX_FIXED); + size = INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed; + + if (unlikely(base >= top)) { + intel_pmu_pebs_event_update_no_drain(cpuc, size); + return; + } + + for (at = base; at < top; at += cpuc->pebs_record_size) { + u64 pebs_status; + + pebs_status = get_pebs_status(at) & cpuc->pebs_enabled; + pebs_status &= mask; + + for_each_set_bit(bit, (unsigned long *)&pebs_status, size) + counts[bit]++; + } + + for_each_set_bit(bit, (unsigned long *)&mask, size) { + if (counts[bit] == 0) + continue; + + event = cpuc->events[bit]; + if (WARN_ON_ONCE(!event)) + continue; + + if (WARN_ON_ONCE(!event->attr.precise_ip)) + continue; + + __intel_pmu_pebs_event(event, iregs, base, + top, bit, counts[bit], + setup_pebs_adaptive_sample_data); + } +} + /* * BTS, PEBS probe and setup */ @@ -1665,8 +1968,12 @@ void __init intel_ds_init(void) x86_pmu.pebs_no_xmm_regs = 1; if (x86_pmu.pebs) { char pebs_type = x86_pmu.intel_cap.pebs_trap ? '+' : '-'; + char *pebs_qual = ""; int format = x86_pmu.intel_cap.pebs_format;
+ if (format < 4) + x86_pmu.intel_cap.pebs_baseline = 0; + switch (format) { case 0: pr_cont("PEBS fmt0%c, ", pebs_type); @@ -1702,6 +2009,29 @@ void __init intel_ds_init(void) x86_pmu.large_pebs_flags |= PERF_SAMPLE_TIME; break;
+ case 4: + x86_pmu.drain_pebs = intel_pmu_drain_pebs_icl; + x86_pmu.pebs_record_size = sizeof(struct pebs_basic); + if (x86_pmu.intel_cap.pebs_baseline) { + x86_pmu.large_pebs_flags |= + PERF_SAMPLE_BRANCH_STACK | + PERF_SAMPLE_TIME; + x86_pmu.flags |= PMU_FL_PEBS_ALL; + pebs_qual = "-baseline"; + } else { + /* Only basic record supported */ + x86_pmu.pebs_no_xmm_regs = 1; + x86_pmu.large_pebs_flags &= + ~(PERF_SAMPLE_ADDR | + PERF_SAMPLE_TIME | + PERF_SAMPLE_DATA_SRC | + PERF_SAMPLE_TRANSACTION | + PERF_SAMPLE_REGS_USER | + PERF_SAMPLE_REGS_INTR); + } + pr_cont("PEBS fmt4%c%s, ", pebs_type, pebs_qual); + break; + default: pr_cont("no PEBS fmt%d%c, ", format, pebs_type); x86_pmu.pebs = 0; diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index c88ed39582a1..58889f952959 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -1079,6 +1079,28 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc) } }
+void intel_pmu_store_pebs_lbrs(struct pebs_lbr *lbr) +{ + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + int i; + + cpuc->lbr_stack.nr = x86_pmu.lbr_nr; + for (i = 0; i < x86_pmu.lbr_nr; i++) { + u64 info = lbr->lbr[i].info; + struct perf_branch_entry *e = &cpuc->lbr_entries[i]; + + e->from = lbr->lbr[i].from; + e->to = lbr->lbr[i].to; + e->mispred = !!(info & LBR_INFO_MISPRED); + e->predicted = !(info & LBR_INFO_MISPRED); + e->in_tx = !!(info & LBR_INFO_IN_TX); + e->abort = !!(info & LBR_INFO_ABORT); + e->cycles = info & LBR_INFO_CYCLES; + e->reserved = 0; + } + intel_pmu_lbr_filter(cpuc); +} + /* * Map interface branch filters onto LBR filters */ diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index c33cbf345293..da2ff0455d29 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -232,6 +232,11 @@ struct cpu_hw_events { int n_pebs; int n_large_pebs;
+ /* Current super set of events hardware configuration */ + u64 pebs_data_cfg; + u64 active_pebs_data_cfg; + int pebs_record_size; + /* * Intel LBR bits */ @@ -523,6 +528,7 @@ union perf_capabilities { * values > 32bit. */ u64 full_width_write:1; + u64 pebs_baseline:1; }; u64 capabilities; }; @@ -667,11 +673,12 @@ struct x86_pmu { pebs_no_xmm_regs :1; int pebs_record_size; int pebs_buffer_size; + int max_pebs_events; void (*drain_pebs)(struct pt_regs *regs); struct event_constraint *pebs_constraints; void (*pebs_aliases)(struct perf_event *event); - int max_pebs_events; unsigned long large_pebs_flags; + u64 rtm_abort_event;
/* * Intel LBR @@ -1012,6 +1019,8 @@ void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
void intel_pmu_auto_reload_read(struct perf_event *event);
+void intel_pmu_store_pebs_lbrs(struct pebs_lbr *lbr); + void intel_ds_init(void);
void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in); diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 4d69cee70fbe..76b1c43547b0 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -143,6 +143,7 @@ #define LBR_INFO_CYCLES 0xffff
#define MSR_IA32_PEBS_ENABLE 0x000003f1 +#define MSR_PEBS_DATA_CFG 0x000003f2 #define MSR_IA32_DS_AREA 0x00000600 #define MSR_IA32_PERF_CAPABILITIES 0x00000345 #define MSR_PEBS_LD_LAT_THRESHOLD 0x000003f6 diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index af66f97544a2..b4085e08cc5e 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -32,6 +32,8 @@
#define HSW_IN_TX (1ULL << 32) #define HSW_IN_TX_CHECKPOINTED (1ULL << 33) +#define ICL_EVENTSEL_ADAPTIVE (1ULL << 34) +#define ICL_FIXED_0_ADAPTIVE (1ULL << 32)
#define AMD64_EVENTSEL_INT_CORE_ENABLE (1ULL << 36) #define AMD64_EVENTSEL_GUESTONLY (1ULL << 40) @@ -87,6 +89,12 @@ #define ARCH_PERFMON_BRANCH_MISSES_RETIRED 6 #define ARCH_PERFMON_EVENTS_COUNT 7
+#define PEBS_DATACFG_MEMINFO BIT_ULL(0) +#define PEBS_DATACFG_GP BIT_ULL(1) +#define PEBS_DATACFG_XMMS BIT_ULL(2) +#define PEBS_DATACFG_LBRS BIT_ULL(3) +#define PEBS_DATACFG_LBR_SHIFT 24 + /* * Intel "Architectural Performance Monitoring" CPUID * detection/enumeration details: @@ -176,6 +184,41 @@ struct x86_pmu_capability { #define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(58) #define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(55)
+/* + * Adaptive PEBS v4 + */ + +struct pebs_basic { + u64 format_size; + u64 ip; + u64 applicable_counters; + u64 tsc; +}; + +struct pebs_meminfo { + u64 address; + u64 aux; + u64 latency; + u64 tsx_tuning; +}; + +struct pebs_gprs { + u64 flags, ip, ax, cx, dx, bx, sp, bp, si, di; + u64 r8, r9, r10, r11, r12, r13, r14, r15; +}; + +struct pebs_xmm { + u64 xmm[16*2]; /* two entries for each register */ +}; + +struct pebs_lbr_entry { + u64 from, to, info; +}; + +struct pebs_lbr { + struct pebs_lbr_entry lbr[0]; /* Variable length */ +}; + /* * IBS cpuid feature detection */
From: Andi Kleen ak@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit d3617b98b04583df222f34992e65712862a77bf1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit d3617b98b04583df222f34992e65712862a77bf1 upstream.
With adaptive PEBS the CPU can directly supply the LBR information, so we don't need to read it again. But the LBRs still need to be enabled. Add a special count to the cpuc that distinguishes these two cases, and avoid reading the LBRs unnecessarily when PEBS is active.
Signed-off-by: Andi Kleen ak@linux.intel.com Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Cc: acme@kernel.org Cc: jolsa@kernel.org Link: https://lkml.kernel.org/r/20190402194509.2832-7-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/lbr.c | 13 ++++++++++++- arch/x86/events/perf_event.h | 1 + 2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 58889f952959..b8cac31f089c 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -488,6 +488,8 @@ void intel_pmu_lbr_add(struct perf_event *event) * be 'new'. Conversely, a new event can get installed through the * context switch path for the first time. */ + if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0) + cpuc->lbr_pebs_users++; perf_sched_cb_inc(event->ctx->pmu); if (!cpuc->lbr_users++ && !event->total_time_running) intel_pmu_lbr_reset(); @@ -507,8 +509,11 @@ void intel_pmu_lbr_del(struct perf_event *event) task_ctx->lbr_callstack_users--; }
+ if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0) + cpuc->lbr_pebs_users--; cpuc->lbr_users--; WARN_ON_ONCE(cpuc->lbr_users < 0); + WARN_ON_ONCE(cpuc->lbr_pebs_users < 0); perf_sched_cb_dec(event->ctx->pmu); }
@@ -658,7 +663,13 @@ void intel_pmu_lbr_read(void) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- if (!cpuc->lbr_users) + /* + * Don't read when all LBRs users are using adaptive PEBS. + * + * This could be smarter and actually check the event, + * but this simple approach seems to work for now. + */ + if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users) return;
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32) diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index da2ff0455d29..55b32febbe33 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -241,6 +241,7 @@ struct cpu_hw_events { * Intel LBR bits */ int lbr_users; + int lbr_pebs_users; struct perf_branch_stack lbr_stack; struct perf_branch_entry lbr_entries[MAX_LBR_ENTRIES]; struct er_account *lbr_sel;
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 6e394376ee89233508fa21d006546357f8efee31 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 6e394376ee89233508fa21d006546357f8efee31 upstream.
Add Intel Icelake uncore support:
- The init code is based on Skylake - Add new PCI id for IMC - New MSR address for CBOX - Get CBOX# from CNL_UNC_CBO_CONFIG MSR directly - Create a new PMU for fixed clocktick counter
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Cc: acme@kernel.org Cc: jolsa@kernel.org Link: https://lkml.kernel.org/r/20190402194509.2832-13-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/uncore.c | 6 ++ arch/x86/events/intel/uncore.h | 1 + arch/x86/events/intel/uncore_snb.c | 91 ++++++++++++++++++++++++++++++ 3 files changed, 98 insertions(+)
diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c index 7098b9b05d56..62e074d223b9 100644 --- a/arch/x86/events/intel/uncore.c +++ b/arch/x86/events/intel/uncore.c @@ -1406,6 +1406,11 @@ static const struct intel_uncore_init_fun skx_uncore_init __initconst = { .pci_init = skx_uncore_pci_init, };
+static const struct intel_uncore_init_fun icl_uncore_init __initconst = { + .cpu_init = icl_uncore_cpu_init, + .pci_init = skl_uncore_pci_init, +}; + static const struct x86_cpu_id intel_uncore_match[] __initconst = { X86_UNCORE_MODEL_MATCH(INTEL_FAM6_NEHALEM_EP, nhm_uncore_init), X86_UNCORE_MODEL_MATCH(INTEL_FAM6_NEHALEM, nhm_uncore_init), @@ -1432,6 +1437,7 @@ static const struct x86_cpu_id intel_uncore_match[] __initconst = { X86_UNCORE_MODEL_MATCH(INTEL_FAM6_SKYLAKE_X, skx_uncore_init), X86_UNCORE_MODEL_MATCH(INTEL_FAM6_KABYLAKE_MOBILE, skl_uncore_init), X86_UNCORE_MODEL_MATCH(INTEL_FAM6_KABYLAKE_DESKTOP, skl_uncore_init), + X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ICELAKE_MOBILE, icl_uncore_init), {}, };
diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h index 40e040ec31b5..b6ceb61b9921 100644 --- a/arch/x86/events/intel/uncore.h +++ b/arch/x86/events/intel/uncore.h @@ -493,6 +493,7 @@ int skl_uncore_pci_init(void); void snb_uncore_cpu_init(void); void nhm_uncore_cpu_init(void); void skl_uncore_cpu_init(void); +void icl_uncore_cpu_init(void); int snb_pci2phy_map_init(int devid);
/* uncore_snbep.c */ diff --git a/arch/x86/events/intel/uncore_snb.c b/arch/x86/events/intel/uncore_snb.c index 2d328386f83a..2c3f428f5709 100644 --- a/arch/x86/events/intel/uncore_snb.c +++ b/arch/x86/events/intel/uncore_snb.c @@ -34,6 +34,8 @@ #define PCI_DEVICE_ID_INTEL_CFL_4S_S_IMC 0x3e33 #define PCI_DEVICE_ID_INTEL_CFL_6S_S_IMC 0x3eca #define PCI_DEVICE_ID_INTEL_CFL_8S_S_IMC 0x3e32 +#define PCI_DEVICE_ID_INTEL_ICL_U_IMC 0x8a02 +#define PCI_DEVICE_ID_INTEL_ICL_U2_IMC 0x8a12
/* SNB event control */ #define SNB_UNC_CTL_EV_SEL_MASK 0x000000ff @@ -93,6 +95,12 @@ #define SKL_UNC_PERF_GLOBAL_CTL 0xe01 #define SKL_UNC_GLOBAL_CTL_CORE_ALL ((1 << 5) - 1)
+/* ICL Cbo register */ +#define ICL_UNC_CBO_CONFIG 0x396 +#define ICL_UNC_NUM_CBO_MASK 0xf +#define ICL_UNC_CBO_0_PER_CTR0 0x702 +#define ICL_UNC_CBO_MSR_OFFSET 0x8 + DEFINE_UNCORE_FORMAT_ATTR(event, event, "config:0-7"); DEFINE_UNCORE_FORMAT_ATTR(umask, umask, "config:8-15"); DEFINE_UNCORE_FORMAT_ATTR(edge, edge, "config:18"); @@ -276,6 +284,70 @@ void skl_uncore_cpu_init(void) snb_uncore_arb.ops = &skl_uncore_msr_ops; }
+static struct intel_uncore_type icl_uncore_cbox = { + .name = "cbox", + .num_counters = 4, + .perf_ctr_bits = 44, + .perf_ctr = ICL_UNC_CBO_0_PER_CTR0, + .event_ctl = SNB_UNC_CBO_0_PERFEVTSEL0, + .event_mask = SNB_UNC_RAW_EVENT_MASK, + .msr_offset = ICL_UNC_CBO_MSR_OFFSET, + .ops = &skl_uncore_msr_ops, + .format_group = &snb_uncore_format_group, +}; + +static struct uncore_event_desc icl_uncore_events[] = { + INTEL_UNCORE_EVENT_DESC(clockticks, "event=0xff"), + { /* end: all zeroes */ }, +}; + +static struct attribute *icl_uncore_clock_formats_attr[] = { + &format_attr_event.attr, + NULL, +}; + +static struct attribute_group icl_uncore_clock_format_group = { + .name = "format", + .attrs = icl_uncore_clock_formats_attr, +}; + +static struct intel_uncore_type icl_uncore_clockbox = { + .name = "clock", + .num_counters = 1, + .num_boxes = 1, + .fixed_ctr_bits = 48, + .fixed_ctr = SNB_UNC_FIXED_CTR, + .fixed_ctl = SNB_UNC_FIXED_CTR_CTRL, + .single_fixed = 1, + .event_mask = SNB_UNC_CTL_EV_SEL_MASK, + .format_group = &icl_uncore_clock_format_group, + .ops = &skl_uncore_msr_ops, + .event_descs = icl_uncore_events, +}; + +static struct intel_uncore_type *icl_msr_uncores[] = { + &icl_uncore_cbox, + &snb_uncore_arb, + &icl_uncore_clockbox, + NULL, +}; + +static int icl_get_cbox_num(void) +{ + u64 num_boxes; + + rdmsrl(ICL_UNC_CBO_CONFIG, num_boxes); + + return num_boxes & ICL_UNC_NUM_CBO_MASK; +} + +void icl_uncore_cpu_init(void) +{ + uncore_msr_uncores = icl_msr_uncores; + icl_uncore_cbox.num_boxes = icl_get_cbox_num(); + snb_uncore_arb.ops = &skl_uncore_msr_ops; +} + enum { SNB_PCI_UNCORE_IMC, }; @@ -669,6 +741,18 @@ static const struct pci_device_id skl_uncore_pci_ids[] = { { /* end: all zeroes */ }, };
+static const struct pci_device_id icl_uncore_pci_ids[] = { + { /* IMC */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICL_U_IMC), + .driver_data = UNCORE_PCI_DEV_DATA(SNB_PCI_UNCORE_IMC, 0), + }, + { /* IMC */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICL_U2_IMC), + .driver_data = UNCORE_PCI_DEV_DATA(SNB_PCI_UNCORE_IMC, 0), + }, + { /* end: all zeroes */ }, +}; + static struct pci_driver snb_uncore_pci_driver = { .name = "snb_uncore", .id_table = snb_uncore_pci_ids, @@ -694,6 +778,11 @@ static struct pci_driver skl_uncore_pci_driver = { .id_table = skl_uncore_pci_ids, };
+static struct pci_driver icl_uncore_pci_driver = { + .name = "icl_uncore", + .id_table = icl_uncore_pci_ids, +}; + struct imc_uncore_pci_dev { __u32 pci_id; struct pci_driver *driver; @@ -733,6 +822,8 @@ static const struct imc_uncore_pci_dev desktop_imc_pci_ids[] = { IMC_DEV(CFL_4S_S_IMC, &skl_uncore_pci_driver), /* 8th Gen Core S 4 Cores Server */ IMC_DEV(CFL_6S_S_IMC, &skl_uncore_pci_driver), /* 8th Gen Core S 6 Cores Server */ IMC_DEV(CFL_8S_S_IMC, &skl_uncore_pci_driver), /* 8th Gen Core S 8 Cores Server */ + IMC_DEV(ICL_U_IMC, &icl_uncore_pci_driver), /* 10th Gen Core Mobile */ + IMC_DEV(ICL_U2_IMC, &icl_uncore_pci_driver), /* 10th Gen Core Mobile */ { /* end marker */ } };
From: Andi Kleen ak@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit ca138a7aabc68bf727918bb40ce08157cd5ec0a5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit ca138a7aabc68bf727918bb40ce08157cd5ec0a5 upstream.
Icelake and later platforms support collecting XMM registers with PEBS event.
Add support for 'perf script' to dump them, and support for the register parser in 'perf record -I=' ... to configure them.
For now they are just printed in hex, we could potentially later add other formats too.
Committer testing:
Before:
# perf record -IXMM0 Warning: unknown register XMM0, check man page or run 'perf record -I?'
Usage: perf record [<options>] [<command>] or: perf record [<options>] -- <command> [<options>]
# # perf record -I? available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10 R11 R12 R13 R14 R15
Usage: perf record [<options>] [<command>] or: perf record [<options>] -- <command> [<options>] #
After:
# perf record -IXMM0 Error: The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cycles). /bin/dmesg | grep -i perf may provide additional information.
# # perf record -I? available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10 R11 R12 R13 R14 R15 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 XMM9 XMM10 XMM11 XMM12 XMM13 XMM14 XMM15
Usage: perf record [<options>] [<command>] or: perf record [<options>] -- <command> [<options>]
-I, --intr-regs[=<any register>] sample selected machine registers on interrupt, use -I ? to list register names #
More work is needed to, when faced with such error, warn the user that that register is not available on the running platform.
Signed-off-by: Andi Kleen ak@linux.intel.com Tested-by: Arnaldo Carvalho de Melo acme@redhat.com Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Jiri Olsa jolsa@kernel.org Cc: Peter Zijlstra peterz@infradead.org Link: http://lkml.kernel.org/r/20190506141926.13659-1-kan.liang@linux.intel.com Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/perf/arch/x86/include/perf_regs.h | 25 +++++++++++++++++++++++-- tools/perf/arch/x86/util/perf_regs.c | 16 ++++++++++++++++ tools/perf/util/perf_regs.h | 1 + 3 files changed, 40 insertions(+), 2 deletions(-)
diff --git a/tools/perf/arch/x86/include/perf_regs.h b/tools/perf/arch/x86/include/perf_regs.h index 7f6d538f8a89..b7321337d100 100644 --- a/tools/perf/arch/x86/include/perf_regs.h +++ b/tools/perf/arch/x86/include/perf_regs.h @@ -8,9 +8,9 @@
void perf_regs_load(u64 *regs);
+#define PERF_REGS_MAX PERF_REG_X86_XMM_MAX #ifndef HAVE_ARCH_X86_64_SUPPORT #define PERF_REGS_MASK ((1ULL << PERF_REG_X86_32_MAX) - 1) -#define PERF_REGS_MAX PERF_REG_X86_32_MAX #define PERF_SAMPLE_REGS_ABI PERF_SAMPLE_REGS_ABI_32 #else #define REG_NOSUPPORT ((1ULL << PERF_REG_X86_DS) | \ @@ -18,7 +18,6 @@ void perf_regs_load(u64 *regs); (1ULL << PERF_REG_X86_FS) | \ (1ULL << PERF_REG_X86_GS)) #define PERF_REGS_MASK (((1ULL << PERF_REG_X86_64_MAX) - 1) & ~REG_NOSUPPORT) -#define PERF_REGS_MAX PERF_REG_X86_64_MAX #define PERF_SAMPLE_REGS_ABI PERF_SAMPLE_REGS_ABI_64 #endif #define PERF_REG_IP PERF_REG_X86_IP @@ -77,6 +76,28 @@ static inline const char *perf_reg_name(int id) case PERF_REG_X86_R15: return "R15"; #endif /* HAVE_ARCH_X86_64_SUPPORT */ + +#define XMM(x) \ + case PERF_REG_X86_XMM ## x: \ + case PERF_REG_X86_XMM ## x + 1: \ + return "XMM" #x; + XMM(0) + XMM(1) + XMM(2) + XMM(3) + XMM(4) + XMM(5) + XMM(6) + XMM(7) + XMM(8) + XMM(9) + XMM(10) + XMM(11) + XMM(12) + XMM(13) + XMM(14) + XMM(15) +#undef XMM default: return NULL; } diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c index fead6b3b4206..71d7604dbf0b 100644 --- a/tools/perf/arch/x86/util/perf_regs.c +++ b/tools/perf/arch/x86/util/perf_regs.c @@ -31,6 +31,22 @@ const struct sample_reg sample_reg_masks[] = { SMPL_REG(R14, PERF_REG_X86_R14), SMPL_REG(R15, PERF_REG_X86_R15), #endif + SMPL_REG2(XMM0, PERF_REG_X86_XMM0), + SMPL_REG2(XMM1, PERF_REG_X86_XMM1), + SMPL_REG2(XMM2, PERF_REG_X86_XMM2), + SMPL_REG2(XMM3, PERF_REG_X86_XMM3), + SMPL_REG2(XMM4, PERF_REG_X86_XMM4), + SMPL_REG2(XMM5, PERF_REG_X86_XMM5), + SMPL_REG2(XMM6, PERF_REG_X86_XMM6), + SMPL_REG2(XMM7, PERF_REG_X86_XMM7), + SMPL_REG2(XMM8, PERF_REG_X86_XMM8), + SMPL_REG2(XMM9, PERF_REG_X86_XMM9), + SMPL_REG2(XMM10, PERF_REG_X86_XMM10), + SMPL_REG2(XMM11, PERF_REG_X86_XMM11), + SMPL_REG2(XMM12, PERF_REG_X86_XMM12), + SMPL_REG2(XMM13, PERF_REG_X86_XMM13), + SMPL_REG2(XMM14, PERF_REG_X86_XMM14), + SMPL_REG2(XMM15, PERF_REG_X86_XMM15), SMPL_REG_END };
diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h index f732e3af2bd4..9f9b25bf7106 100644 --- a/tools/perf/util/perf_regs.h +++ b/tools/perf/util/perf_regs.h @@ -12,6 +12,7 @@ struct sample_reg { uint64_t mask; }; #define SMPL_REG(n, b) { .name = #n, .mask = 1ULL << (b) } +#define SMPL_REG2(n, b) { .name = #n, .mask = 3ULL << (b) } #define SMPL_REG_END { .name = NULL }
extern const struct sample_reg sample_reg_masks[];
From: Arnaldo Carvalho de Melo acme@redhat.com
mainline inclusion from mainline-v5.2-rc1 commit 8e5bc76f2ce39ffd48350d6dd9318d78d07b0bfa category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 8e5bc76f2ce39ffd48350d6dd9318d78d07b0bfa upstream.
$ perf record -h -I
Usage: perf record [<options>] [<command>] or: perf record [<options>] -- <command> [<options>]
-I, --intr-regs[=<any register>] sample selected machine registers on interrupt, use -I ? to list register names
$ m $ perf record -I ? Workload failed: No such file or directory $
After:
$ perf record -h -I
Usage: perf record [<options>] [<command>] or: perf record [<options>] -- <command> [<options>]
-I, --intr-regs[=<any register>] sample selected machine registers on interrupt, use '-I?' to list register names
$ $ perf record -I? available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10 R11 R12 R13 R14 R15
Usage: perf record [<options>] [<command>] or: perf record [<options>] -- <command> [<options>]
-I, --intr-regs[=<any register>] sample selected machine registers on interrupt, use '-I?' to list register names $
Cc: Adrian Hunter adrian.hunter@intel.com Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Andi Kleen ak@linux.intel.com Cc: Jiri Olsa jolsa@kernel.org Cc: Kan Liang kan.liang@linux.intel.com Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Fixes: bcc84ec65ad1 ("perf record: Add ability to name registers to record") Link: https://lkml.kernel.org/n/tip-r0xhfhy5radmkhhcbcfs5izf@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/perf/builtin-record.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 44e8fff2ea1c..ced7848c2cf0 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -1637,10 +1637,10 @@ static struct option __record_options[] = { "use per-thread mmaps"), OPT_CALLBACK_OPTARG('I', "intr-regs", &record.opts.sample_intr_regs, NULL, "any register", "sample selected machine registers on interrupt," - " use -I ? to list register names", parse_regs), + " use '-I?' to list register names", parse_regs), OPT_CALLBACK_OPTARG(0, "user-regs", &record.opts.sample_user_regs, NULL, "any register", "sample selected machine registers on interrupt," - " use -I ? to list register names", parse_regs), + " use '-I?' to list register names", parse_regs), OPT_BOOLEAN(0, "running-time", &record.opts.running_time, "Record running/enabled time of read (:S) events"), OPT_CALLBACK('k', "clockid", &record.opts,
From: Arnaldo Carvalho de Melo acme@redhat.com
mainline inclusion from mainline-v5.2-rc1 commit 4c1cf20334ae6e390e1ea787ef76c40a161370d1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 4c1cf20334ae6e390e1ea787ef76c40a161370d1 upstream.
Add quotes around the register name and suggest using 'perf record -I?' to get the list of available registers.
Before:
# perf record -Idi,xmm20,xmm1 Warning: unknown register xmm20, check man page
Usage: perf record [<options>] [<command>] or: perf record [<options>] -- <command> [<options>]
-I, --intr-regs[=<any register>] sample selected machine registers on interrupt, use -I ? to list register names # # perf record -Idi,xmm20,xmm1 Warning: unknown register "xmm20", check man page or run "perf record -I?"
Usage: perf record [<options>] [<command>] or: perf record [<options>] -- <command> [<options>]
-I, --intr-regs[=<any register>] sample selected machine registers on interrupt, use -I ? to list register names #
Cc: Adrian Hunter adrian.hunter@intel.com Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Andi Kleen ak@linux.intel.com Cc: Jiri Olsa jolsa@kernel.org Cc: Kan Liang kan.liang@linux.intel.com Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Link: https://lkml.kernel.org/n/tip-9a9hyuum8c0oggg86xd3sxc5@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/perf/util/parse-regs-options.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c index e6599e290f46..9cb187a20fe2 100644 --- a/tools/perf/util/parse-regs-options.c +++ b/tools/perf/util/parse-regs-options.c @@ -48,8 +48,7 @@ parse_regs(const struct option *opt, const char *str, int unset) break; } if (!r->name) { - ui__warning("unknown register %s," - " check man page\n", s); + ui__warning("Unknown register "%s", check man page or run "perf record -I?"\n", s); goto error; }
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit aeea9062d949584ac1f2f9a20f0e5ed306539a3e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit aeea9062d949584ac1f2f9a20f0e5ed306539a3e upstream.
The available registers for --int-regs and --user-regs may be different, e.g. XMM registers.
Split parse_regs into two dedicated functions for --int-regs and --user-regs respectively.
Modify the warning message. "--user-regs=?" should be applied to show the available registers for --user-regs.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Tested-by: Ravi Bangoria ravi.bangoria@linux.ibm.com Cc: Andi Kleen ak@linux.intel.com Cc: Jiri Olsa jolsa@kernel.org Link: http://lkml.kernel.org/r/1557865174-56264-1-git-send-email-kan.liang@linux.i... [ Changed docs as suggested by Ravi and agreed by Kan ] Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/perf/Documentation/perf-record.txt | 3 ++- tools/perf/builtin-record.c | 4 ++-- tools/perf/util/parse-regs-options.c | 19 ++++++++++++++++--- tools/perf/util/parse-regs-options.h | 3 ++- 4 files changed, 22 insertions(+), 7 deletions(-)
diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt index 246dee081efd..bdb3f1f19386 100644 --- a/tools/perf/Documentation/perf-record.txt +++ b/tools/perf/Documentation/perf-record.txt @@ -392,7 +392,8 @@ symbolic names, e.g. on x86, ax, si. To list the available registers use --intr-regs=ax,bx. The list of register is architecture dependent.
--user-regs:: -Capture user registers at sample time. Same arguments as -I. +Similar to -I, but capture user registers at sample time. To list the available +user registers use --user-regs=?.
--running-time:: Record running and enabled time for read events (:S) diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index ced7848c2cf0..1fef7676417d 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -1637,10 +1637,10 @@ static struct option __record_options[] = { "use per-thread mmaps"), OPT_CALLBACK_OPTARG('I', "intr-regs", &record.opts.sample_intr_regs, NULL, "any register", "sample selected machine registers on interrupt," - " use '-I?' to list register names", parse_regs), + " use '-I?' to list register names", parse_intr_regs), OPT_CALLBACK_OPTARG(0, "user-regs", &record.opts.sample_user_regs, NULL, "any register", "sample selected machine registers on interrupt," - " use '-I?' to list register names", parse_regs), + " use '--user-regs=?' to list register names", parse_user_regs), OPT_BOOLEAN(0, "running-time", &record.opts.running_time, "Record running/enabled time of read (:S) events"), OPT_CALLBACK('k', "clockid", &record.opts, diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c index 9cb187a20fe2..b21617f2bec1 100644 --- a/tools/perf/util/parse-regs-options.c +++ b/tools/perf/util/parse-regs-options.c @@ -5,8 +5,8 @@ #include <subcmd/parse-options.h> #include "util/parse-regs-options.h"
-int -parse_regs(const struct option *opt, const char *str, int unset) +static int +__parse_regs(const struct option *opt, const char *str, int unset, bool intr) { uint64_t *mode = (uint64_t *)opt->value; const struct sample_reg *r; @@ -48,7 +48,8 @@ parse_regs(const struct option *opt, const char *str, int unset) break; } if (!r->name) { - ui__warning("Unknown register "%s", check man page or run "perf record -I?"\n", s); + ui__warning("Unknown register "%s", check man page or run "perf record %s?"\n", + s, intr ? "-I" : "--user-regs="); goto error; }
@@ -69,3 +70,15 @@ parse_regs(const struct option *opt, const char *str, int unset) free(os); return ret; } + +int +parse_user_regs(const struct option *opt, const char *str, int unset) +{ + return __parse_regs(opt, str, unset, false); +} + +int +parse_intr_regs(const struct option *opt, const char *str, int unset) +{ + return __parse_regs(opt, str, unset, true); +} diff --git a/tools/perf/util/parse-regs-options.h b/tools/perf/util/parse-regs-options.h index cdefb1acf6be..2b23d25c6394 100644 --- a/tools/perf/util/parse-regs-options.h +++ b/tools/perf/util/parse-regs-options.h @@ -2,5 +2,6 @@ #ifndef _PERF_PARSE_REGS_OPTIONS_H #define _PERF_PARSE_REGS_OPTIONS_H 1 struct option; -int parse_regs(const struct option *opt, const char *str, int unset); +int parse_user_regs(const struct option *opt, const char *str, int unset); +int parse_intr_regs(const struct option *opt, const char *str, int unset); #endif /* _PERF_PARSE_REGS_OPTIONS_H */
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit af785e75bf616704cab031e66403b6adcf5b700a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit af785e75bf616704cab031e66403b6adcf5b700a upstream.
There may be different register mask for use with intr or user on some platforms, e.g. Icelake.
Add weak functions arch__intr_reg_mask() and arch__user_reg_mask() to return intr and user register mask respectively.
Check mask before printing or comparing the register name.
Generic code always return PERF_REGS_MASK. No functional change.
Suggested-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Kan Liang kan.liang@linux.intel.com Tested-by: Ravi Bangoria ravi.bangoria@linux.ibm.com Cc: Andi Kleen ak@linux.intel.com Cc: Jiri Olsa jolsa@kernel.org Link: http://lkml.kernel.org/r/1557865174-56264-2-git-send-email-kan.liang@linux.i... Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/perf/util/parse-regs-options.c | 13 ++++++++++--- tools/perf/util/perf_regs.c | 10 ++++++++++ tools/perf/util/perf_regs.h | 2 ++ 3 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c index b21617f2bec1..08581e276225 100644 --- a/tools/perf/util/parse-regs-options.c +++ b/tools/perf/util/parse-regs-options.c @@ -12,6 +12,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr) const struct sample_reg *r; char *s, *os = NULL, *p; int ret = -1; + uint64_t mask;
if (unset) return 0; @@ -22,6 +23,11 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr) if (*mode) return -1;
+ if (intr) + mask = arch__intr_reg_mask(); + else + mask = arch__user_reg_mask(); + /* str may be NULL in case no arg is passed to -I */ if (str) { /* because str is read-only */ @@ -37,14 +43,15 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr) if (!strcmp(s, "?")) { fprintf(stderr, "available registers: "); for (r = sample_reg_masks; r->name; r++) { - fprintf(stderr, "%s ", r->name); + if (r->mask & mask) + fprintf(stderr, "%s ", r->name); } fputc('\n', stderr); /* just printing available regs */ return -1; } for (r = sample_reg_masks; r->name; r++) { - if (!strcasecmp(s, r->name)) + if ((r->mask & mask) && !strcasecmp(s, r->name)) break; } if (!r->name) { @@ -65,7 +72,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
/* default to all possible regs */ if (*mode == 0) - *mode = PERF_REGS_MASK; + *mode = mask; error: free(os); return ret; diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c index 2acfcc527cac..2774cec1f15f 100644 --- a/tools/perf/util/perf_regs.c +++ b/tools/perf/util/perf_regs.c @@ -13,6 +13,16 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused, return SDT_ARG_SKIP; }
+uint64_t __weak arch__intr_reg_mask(void) +{ + return PERF_REGS_MASK; +} + +uint64_t __weak arch__user_reg_mask(void) +{ + return PERF_REGS_MASK; +} + #ifdef HAVE_PERF_REGS_SUPPORT int perf_reg_value(u64 *valp, struct regs_dump *regs, int id) { diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h index 9f9b25bf7106..2f41b10698ad 100644 --- a/tools/perf/util/perf_regs.h +++ b/tools/perf/util/perf_regs.h @@ -23,6 +23,8 @@ enum { };
int arch_sdt_arg_parse_op(char *old_op, char **new_op); +uint64_t arch__intr_reg_mask(void); +uint64_t arch__user_reg_mask(void);
#ifdef HAVE_PERF_REGS_SUPPORT #include <perf_regs.h>
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 6466ec14aaf44ff14a05369dcf0929d0f01171c6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 6466ec14aaf44ff14a05369dcf0929d0f01171c6 upstream.
XMM registers can be collected on Icelake and later platforms.
Add specific arch__intr_reg_mask(), which creating an event to check if the kernel and hardware can collect XMM registers.
Test on Skylake which doesn't support XMM registers collection. There is nothing changed.
#perf record -I? available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10 R11 R12 R13 R14 R15
Usage: perf record [<options>] [<command>] or: perf record [<options>] -- <command> [<options>]
-I, --intr-regs[=<any register>] sample selected machine registers on interrupt, use '-I?' to list register names
#perf record -I [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.905 MB perf.data (2520 samples) ]
#perf evlist -v cycles: size: 112, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|REGS_INTR, read_format: ID, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1, sample_regs_intr: 0xff0fff
Test on Icelake which support XMM registers collection.
#perf record -I? available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10 R11 R12 R13 R14 R15 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 XMM9 XMM10 XMM11 XMM12 XMM13 XMM14 XMM15
Usage: perf record [<options>] [<command>] or: perf record [<options>] -- <command> [<options>]
-I, --intr-regs[=<any register>] sample selected machine registers on interrupt, use '-I?' to list register names
#perf record -I [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.800 MB perf.data (318 samples) ]
#perf evlist -v cycles: size: 112, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|REGS_INTR, read_format: ID, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1, sample_regs_intr: 0xffffffff00ff0fff
Committer notes:
Don't set attr.sample_period as a named struct init, as it is part of an unnamed union in 'struct perf_event_attr', and doing so breaks the build on older gcc versions, such as:
gcc version 4.1.2 20080704 (Red Hat 4.1.2-55) gcc version 4.4.7 20120313 (Red Hat 4.4.7-23) (GCC)
arch/x86/util/perf_regs.c: In function 'arch__intr_reg_mask': arch/x86/util/perf_regs.c:279: error: unknown field 'sample_period' specified in initializer cc1: warnings being treated as errors arch/x86/util/perf_regs.c:279: warning: missing braces around initializer arch/x86/util/perf_regs.c:279: warning: (near initialization for 'attr.<anonymous>')
Signed-off-by: Kan Liang kan.liang@linux.intel.com [ Only on a lenovo t480s, a skylake machine, where the XMM registers didn't show up in -I?/--user-regs=? as expected ] Tested-by: Arnaldo Carvalho de Melo acme@redhat.com Cc: Andi Kleen ak@linux.intel.com Cc: Jiri Olsa jolsa@kernel.org Link: http://lkml.kernel.org/r/1557865174-56264-3-git-send-email-kan.liang@linux.i... Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/perf/arch/x86/include/perf_regs.h | 1 + tools/perf/arch/x86/util/perf_regs.c | 28 +++++++++++++++++++++++++ 2 files changed, 29 insertions(+)
diff --git a/tools/perf/arch/x86/include/perf_regs.h b/tools/perf/arch/x86/include/perf_regs.h index b7321337d100..b7cd91a9014f 100644 --- a/tools/perf/arch/x86/include/perf_regs.h +++ b/tools/perf/arch/x86/include/perf_regs.h @@ -9,6 +9,7 @@ void perf_regs_load(u64 *regs);
#define PERF_REGS_MAX PERF_REG_X86_XMM_MAX +#define PERF_XMM_REGS_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1)) #ifndef HAVE_ARCH_X86_64_SUPPORT #define PERF_REGS_MASK ((1ULL << PERF_REG_X86_32_MAX) - 1) #define PERF_SAMPLE_REGS_ABI PERF_SAMPLE_REGS_ABI_32 diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c index 71d7604dbf0b..7886ca5263e3 100644 --- a/tools/perf/arch/x86/util/perf_regs.c +++ b/tools/perf/arch/x86/util/perf_regs.c @@ -270,3 +270,31 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
return SDT_ARG_VALID; } + +uint64_t arch__intr_reg_mask(void) +{ + struct perf_event_attr attr = { + .type = PERF_TYPE_HARDWARE, + .config = PERF_COUNT_HW_CPU_CYCLES, + .sample_type = PERF_SAMPLE_REGS_INTR, + .sample_regs_intr = PERF_XMM_REGS_MASK, + .precise_ip = 1, + .disabled = 1, + .exclude_kernel = 1, + }; + int fd; + /* + * In an unnamed union, init it here to build on older gcc versions + */ + attr.sample_period = 1; + + event_attr_init(&attr); + + fd = sys_perf_event_open(&attr, 0, -1, -1, 0); + if (fd != -1) { + close(fd); + return (PERF_XMM_REGS_MASK | PERF_REGS_MASK); + } + + return PERF_REGS_MASK; +}
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.1 commit 26ae4f4406f88d82d79c85c11ac5fae18213cd38 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 26ae4f4406f88d82d79c85c11ac5fae18213cd38 upstream.
This recent commit:
5768402fd9c6e87 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically")
overlooked the fact that the previous one page granularity of the AUX buffer provided an implicit double buffering capability to the PMU driver, which went away when the entire buffer became one high-order page.
Always make the full-trace mode AUX allocation at least two-part to preserve the previous behavior and allow the implicit double buffering to continue.
Reported-by: Ammy Yi ammy.yi@intel.com Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Acked-by: Peter Zijlstra a.p.zijlstra@chello.nl Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Cc: adrian.hunter@intel.com Fixes: 5768402fd9c6e87 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically") Link: http://lkml.kernel.org/r/20190503085536.24119-2-alexander.shishkin@linux.int... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/events/ring_buffer.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 12f351b253bb..34e1667e4254 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -630,8 +630,7 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event, * PMU requests more than one contiguous chunks of memory * for SW double buffering */ - if ((event->pmu->capabilities & PERF_PMU_CAP_AUX_SW_DOUBLEBUF) && - !overwrite) { + if (!overwrite) { if (!max_order) return -EINVAL;
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.1 commit 72e830f68428ab9ea9eca65d160795f4e02cecfc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 72e830f68428ab9ea9eca65d160795f4e02cecfc upstream.
Now that all AUX allocations are high-order by default, the software double buffering PMU capability doesn't make sense any more, get rid of it. In case some PMUs choose to opt out, we can re-introduce it.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Acked-by: Peter Zijlstra a.p.zijlstra@chello.nl Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Cc: adrian.hunter@intel.com Link: http://lkml.kernel.org/r/20190503085536.24119-3-alexander.shishkin@linux.int... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/pt.c | 3 +-- include/linux/perf_event.h | 1 - 2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c index f03100bc5fd1..db969f3f175c 100644 --- a/arch/x86/events/intel/pt.c +++ b/arch/x86/events/intel/pt.c @@ -1515,8 +1515,7 @@ static __init int pt_init(void) }
if (!pt_cap_get(PT_CAP_topa_multiple_entries)) - pt_pmu.pmu.capabilities = - PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_AUX_SW_DOUBLEBUF; + pt_pmu.pmu.capabilities = PERF_PMU_CAP_AUX_NO_SG;
pt_pmu.pmu.capabilities |= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE; pt_pmu.pmu.attr_groups = pt_attr_groups; diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index cf3da70056f7..7d47ec097c5d 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -240,7 +240,6 @@ struct perf_event; #define PERF_PMU_CAP_NO_INTERRUPT 0x01 #define PERF_PMU_CAP_NO_NMI 0x02 #define PERF_PMU_CAP_AUX_NO_SG 0x04 -#define PERF_PMU_CAP_AUX_SW_DOUBLEBUF 0x08 #define PERF_PMU_CAP_EXCLUSIVE 0x10 #define PERF_PMU_CAP_ITRACE 0x20 #define PERF_PMU_CAP_HETEROGENEOUS_CPUS 0x40
From: Andrew Murray andrew.murray@arm.com
mainline inclusion from mainline-v5.1-rc1 commit 486efe9f8e30bac1e236f867df164f4966f3e207 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 486efe9f8e30bac1e236f867df164f4966f3e207 upstream.
Add a function that tests if any of the perf event exclusion flags are set on a given event.
Signed-off-by: Andrew Murray andrew.murray@arm.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Arnaldo Carvalho de Melo acme@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Borislav Petkov bp@alien8.de Cc: Ivan Kokshaysky ink@jurassic.park.msu.ru Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Mark Rutland mark.rutland@arm.com Cc: Matt Turner mattst88@gmail.com Cc: Michael Ellerman mpe@ellerman.id.au Cc: Paul Mackerras paulus@samba.org Cc: Peter Zijlstra peterz@infradead.org Cc: Richard Henderson rth@twiddle.net Cc: Russell King linux@armlinux.org.uk Cc: Sascha Hauer s.hauer@pengutronix.de Cc: Shawn Guo shawnguo@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Will Deacon will.deacon@arm.com Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: robin.murphy@arm.com Cc: suzuki.poulose@arm.com Link: https://lkml.kernel.org/r/1547128414-50693-3-git-send-email-andrew.murray@ar... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/perf_event.h | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 7d47ec097c5d..fbdde119b1cc 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1013,6 +1013,15 @@ perf_event__output_id_sample(struct perf_event *event, extern void perf_log_lost_samples(struct perf_event *event, u64 lost);
+static inline bool event_has_any_exclude_flag(struct perf_event *event) +{ + struct perf_event_attr *attr = &event->attr; + + return attr->exclude_idle || attr->exclude_user || + attr->exclude_kernel || attr->exclude_hv || + attr->exclude_guest || attr->exclude_host; +} + static inline bool is_sampling_event(struct perf_event *event) { return event->attr.sample_period != 0;
From: Andrew Murray andrew.murray@arm.com
mainline inclusion from mainline-v5.1-rc1 commit cc6795aeffea0a80d0baf9ad31ba926a6c42cef5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit cc6795aeffea0a80d0baf9ad31ba926a6c42cef5 upstream.
Many PMU drivers do not have the capability to exclude counting events that occur in specific contexts such as idle, kernel, guest, etc. These drivers indicate this by returning an error in their event_init upon testing the events attribute flags. This approach is error prone and often inconsistent.
Let's instead allow PMU drivers to advertise their inability to exclude based on context via a new capability: PERF_PMU_CAP_NO_EXCLUDE. This allows the perf core to reject requests for exclusion events where there is no support in the PMU.
Signed-off-by: Andrew Murray andrew.murray@arm.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Arnaldo Carvalho de Melo acme@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Borislav Petkov bp@alien8.de Cc: Ivan Kokshaysky ink@jurassic.park.msu.ru Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Mark Rutland mark.rutland@arm.com Cc: Matt Turner mattst88@gmail.com Cc: Michael Ellerman mpe@ellerman.id.au Cc: Paul Mackerras paulus@samba.org Cc: Peter Zijlstra peterz@infradead.org Cc: Richard Henderson rth@twiddle.net Cc: Russell King linux@armlinux.org.uk Cc: Sascha Hauer s.hauer@pengutronix.de Cc: Shawn Guo shawnguo@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Will Deacon will.deacon@arm.com Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: robin.murphy@arm.com Cc: suzuki.poulose@arm.com Link: https://lkml.kernel.org/r/1547128414-50693-4-git-send-email-andrew.murray@ar... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/perf_event.h | 1 + kernel/events/core.c | 9 +++++++++ 2 files changed, 10 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index fbdde119b1cc..78c77a91c3b6 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -243,6 +243,7 @@ struct perf_event; #define PERF_PMU_CAP_EXCLUSIVE 0x10 #define PERF_PMU_CAP_ITRACE 0x20 #define PERF_PMU_CAP_HETEROGENEOUS_CPUS 0x40 +#define PERF_PMU_CAP_NO_EXCLUDE 0x80
/** * struct pmu - generic performance monitoring unit diff --git a/kernel/events/core.c b/kernel/events/core.c index f6da9e11693d..728186f7b5df 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9880,6 +9880,15 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event) if (ctx) perf_event_ctx_unlock(event->group_leader, ctx);
+ if (!ret) { + if (pmu->capabilities & PERF_PMU_CAP_NO_EXCLUDE && + event_has_any_exclude_flag(event)) { + if (event->destroy) + event->destroy(event); + ret = -EINVAL; + } + } + if (ret) module_put(pmu->module);
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.2-rc7 commit e321d02db87af7840da29ef833a2a71fc0eab198 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit e321d02db87af7840da29ef833a2a71fc0eab198 upstream.
The perf fuzzer caused Skylake machine to crash:
[ 9680.085831] Call Trace: [ 9680.088301] <IRQ> [ 9680.090363] perf_output_sample_regs+0x43/0xa0 [ 9680.094928] perf_output_sample+0x3aa/0x7a0 [ 9680.099181] perf_event_output_forward+0x53/0x80 [ 9680.103917] __perf_event_overflow+0x52/0xf0 [ 9680.108266] ? perf_trace_run_bpf_submit+0xc0/0xc0 [ 9680.113108] perf_swevent_hrtimer+0xe2/0x150 [ 9680.117475] ? check_preempt_wakeup+0x181/0x230 [ 9680.122091] ? check_preempt_curr+0x62/0x90 [ 9680.126361] ? ttwu_do_wakeup+0x19/0x140 [ 9680.130355] ? try_to_wake_up+0x54/0x460 [ 9680.134366] ? reweight_entity+0x15b/0x1a0 [ 9680.138559] ? __queue_work+0x103/0x3f0 [ 9680.142472] ? update_dl_rq_load_avg+0x1cd/0x270 [ 9680.147194] ? timerqueue_del+0x1e/0x40 [ 9680.151092] ? __remove_hrtimer+0x35/0x70 [ 9680.155191] __hrtimer_run_queues+0x100/0x280 [ 9680.159658] hrtimer_interrupt+0x100/0x220 [ 9680.163835] smp_apic_timer_interrupt+0x6a/0x140 [ 9680.168555] apic_timer_interrupt+0xf/0x20 [ 9680.172756] </IRQ>
The XMM registers can only be collected by PEBS hardware events on the platforms with PEBS baseline support, e.g. Icelake, not software/probe events.
Add capabilities flag PERF_PMU_CAP_EXTENDED_REGS to indicate the PMU which support extended registers. For X86, the extended registers are XMM registers.
Add has_extended_regs() to check if extended registers are applied.
The generic code define the mask of extended registers as 0 if arch headers haven't overridden it.
Originally-by: Peter Zijlstra (Intel) peterz@infradead.org Reported-by: Vince Weaver vincent.weaver@maine.edu Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Fixes: 878068ea270e ("perf/x86: Support outputting XMM registers") Link: https://lkml.kernel.org/r/1559081314-9714-1-git-send-email-kan.liang@linux.i... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 5 +++++ arch/x86/events/intel/ds.c | 1 + arch/x86/events/perf_event.h | 1 + arch/x86/include/uapi/asm/perf_regs.h | 3 +++ include/linux/perf_event.h | 1 + include/linux/perf_regs.h | 8 ++++++++ kernel/events/core.c | 18 ++++++++++++++---- 7 files changed, 33 insertions(+), 4 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 206ce31116b2..082fe2b988a1 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -681,6 +681,11 @@ static inline int is_x86_event(struct perf_event *event) return event->pmu == &pmu; }
+struct pmu *x86_get_pmu(void) +{ + return &pmu; +} + /* * Event scheduler state: * diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c index f3d6493a04a7..17fbbc490d10 100644 --- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -2018,6 +2018,7 @@ void __init intel_ds_init(void) PERF_SAMPLE_TIME; x86_pmu.flags |= PMU_FL_PEBS_ALL; pebs_qual = "-baseline"; + x86_get_pmu()->capabilities |= PERF_PMU_CAP_EXTENDED_REGS; } else { /* Only basic record supported */ x86_pmu.pebs_no_xmm_regs = 1; diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 55b32febbe33..aab2fa34c683 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -773,6 +773,7 @@ static struct perf_pmu_events_ht_attr event_attr_##v = { \ .event_str_ht = ht, \ }
+struct pmu *x86_get_pmu(void); extern struct x86_pmu x86_pmu __read_mostly;
static inline bool x86_pmu_has_lbr_callstack(void) diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h index ac67bbea10ca..7c9d2bb3833b 100644 --- a/arch/x86/include/uapi/asm/perf_regs.h +++ b/arch/x86/include/uapi/asm/perf_regs.h @@ -52,4 +52,7 @@ enum perf_event_x86_regs { /* These include both GPRs and XMMX registers */ PERF_REG_X86_XMM_MAX = PERF_REG_X86_XMM15 + 2, }; + +#define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1)) + #endif /* _ASM_X86_PERF_REGS_H */ diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 78c77a91c3b6..f6174c8a8c8a 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -240,6 +240,7 @@ struct perf_event; #define PERF_PMU_CAP_NO_INTERRUPT 0x01 #define PERF_PMU_CAP_NO_NMI 0x02 #define PERF_PMU_CAP_AUX_NO_SG 0x04 +#define PERF_PMU_CAP_EXTENDED_REGS 0x08 #define PERF_PMU_CAP_EXCLUSIVE 0x10 #define PERF_PMU_CAP_ITRACE 0x20 #define PERF_PMU_CAP_HETEROGENEOUS_CPUS 0x40 diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h index 476747456bca..2d12e97d5e7b 100644 --- a/include/linux/perf_regs.h +++ b/include/linux/perf_regs.h @@ -11,6 +11,11 @@ struct perf_regs {
#ifdef CONFIG_HAVE_PERF_REGS #include <asm/perf_regs.h> + +#ifndef PERF_REG_EXTENDED_MASK +#define PERF_REG_EXTENDED_MASK 0 +#endif + u64 perf_reg_value(struct pt_regs *regs, int idx); int perf_reg_validate(u64 mask); u64 perf_reg_abi(struct task_struct *task); @@ -18,6 +23,9 @@ void perf_get_regs_user(struct perf_regs *regs_user, struct pt_regs *regs, struct pt_regs *regs_user_copy); #else + +#define PERF_REG_EXTENDED_MASK 0 + static inline u64 perf_reg_value(struct pt_regs *regs, int idx) { return 0; diff --git a/kernel/events/core.c b/kernel/events/core.c index 728186f7b5df..d7aa8971b278 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9850,6 +9850,12 @@ void perf_pmu_unregister(struct pmu *pmu) } EXPORT_SYMBOL_GPL(perf_pmu_unregister);
+static inline bool has_extended_regs(struct perf_event *event) +{ + return (event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK) || + (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK); +} + static int perf_try_init_event(struct pmu *pmu, struct perf_event *event) { struct perf_event_context *ctx = NULL; @@ -9881,12 +9887,16 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event) perf_event_ctx_unlock(event->group_leader, ctx);
if (!ret) { + if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) && + has_extended_regs(event)) + ret = -EOPNOTSUPP; + if (pmu->capabilities & PERF_PMU_CAP_NO_EXCLUDE && - event_has_any_exclude_flag(event)) { - if (event->destroy) - event->destroy(event); + event_has_any_exclude_flag(event)) ret = -EINVAL; - } + + if (ret && event->destroy) + event->destroy(event); }
if (ret)
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.2-rc7 commit 90d424915ab6550826d297fd62df8ee255345b95 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 90d424915ab6550826d297fd62df8ee255345b95 upstream.
The perf fuzzer triggers a warning which map to:
if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset))) return 0;
The bits between XMM registers and generic registers are reserved. But perf_reg_validate() doesn't check these bits.
Add PERF_REG_X86_RESERVED for reserved bits on X86. Check the reserved bits in perf_reg_validate().
Reported-by: Vince Weaver vincent.weaver@maine.edu Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Fixes: 878068ea270e ("perf/x86: Support outputting XMM registers") Link: https://lkml.kernel.org/r/1559081314-9714-2-git-send-email-kan.liang@linux.i... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/kernel/perf_regs.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c index 07c30ee17425..bb7e1132290b 100644 --- a/arch/x86/kernel/perf_regs.c +++ b/arch/x86/kernel/perf_regs.c @@ -74,6 +74,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx) return regs_get_register(regs, pt_regs_offset[idx]); }
+#define PERF_REG_X86_RESERVED (((1ULL << PERF_REG_X86_XMM0) - 1) & \ + ~((1ULL << PERF_REG_X86_MAX) - 1)) + #ifdef CONFIG_X86_32 #define REG_NOSUPPORT ((1ULL << PERF_REG_X86_R8) | \ (1ULL << PERF_REG_X86_R9) | \ @@ -86,7 +89,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
int perf_reg_validate(u64 mask) { - if (!mask || (mask & REG_NOSUPPORT)) + if (!mask || (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))) return -EINVAL;
return 0; @@ -112,7 +115,7 @@ void perf_get_regs_user(struct perf_regs *regs_user,
int perf_reg_validate(u64 mask) { - if (!mask || (mask & REG_NOSUPPORT)) + if (!mask || (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))) return -EINVAL;
return 0;
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.2-rc7 commit dce86ac75d772047e9bc606154704aa73bfd4c83 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit dce86ac75d772047e9bc606154704aa73bfd4c83 upstream.
Use generic macro PERF_REG_EXTENDED_MASK to replace PEBS_XMM_REGS to avoid duplication.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lkml.kernel.org/r/1559081314-9714-3-git-send-email-kan.liang@linux.i... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 4 ++-- arch/x86/events/intel/ds.c | 2 +- arch/x86/events/perf_event.h | 18 ------------------ 3 files changed, 3 insertions(+), 21 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 082fe2b988a1..d940e7bc2d25 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -566,13 +566,13 @@ int x86_pmu_hw_config(struct perf_event *event) }
/* sample_regs_user never support XMM registers */ - if (unlikely(event->attr.sample_regs_user & PEBS_XMM_REGS)) + if (unlikely(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK)) return -EINVAL; /* * Besides the general purpose registers, XMM registers may * be collected in PEBS on some platforms, e.g. Icelake */ - if (unlikely(event->attr.sample_regs_intr & PEBS_XMM_REGS)) { + if (unlikely(event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK)) { if (x86_pmu.pebs_no_xmm_regs) return -EINVAL;
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c index 17fbbc490d10..767606851492 100644 --- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -987,7 +987,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event) pebs_data_cfg |= PEBS_DATACFG_GP;
if ((sample_type & PERF_SAMPLE_REGS_INTR) && - (attr->sample_regs_intr & PEBS_XMM_REGS)) + (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK)) pebs_data_cfg |= PEBS_DATACFG_XMMS;
if (sample_type & PERF_SAMPLE_BRANCH_STACK) { diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index aab2fa34c683..d639627c66de 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -123,24 +123,6 @@ struct amd_nb { (1ULL << PERF_REG_X86_R14) | \ (1ULL << PERF_REG_X86_R15))
-#define PEBS_XMM_REGS \ - ((1ULL << PERF_REG_X86_XMM0) | \ - (1ULL << PERF_REG_X86_XMM1) | \ - (1ULL << PERF_REG_X86_XMM2) | \ - (1ULL << PERF_REG_X86_XMM3) | \ - (1ULL << PERF_REG_X86_XMM4) | \ - (1ULL << PERF_REG_X86_XMM5) | \ - (1ULL << PERF_REG_X86_XMM6) | \ - (1ULL << PERF_REG_X86_XMM7) | \ - (1ULL << PERF_REG_X86_XMM8) | \ - (1ULL << PERF_REG_X86_XMM9) | \ - (1ULL << PERF_REG_X86_XMM10) | \ - (1ULL << PERF_REG_X86_XMM11) | \ - (1ULL << PERF_REG_X86_XMM12) | \ - (1ULL << PERF_REG_X86_XMM13) | \ - (1ULL << PERF_REG_X86_XMM14) | \ - (1ULL << PERF_REG_X86_XMM15)) - /* * Per register state. */
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.2-rc7 commit cd6b984f6d8cd615755b5404a51b7efe45215f28 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit cd6b984f6d8cd615755b5404a51b7efe45215f28 upstream.
We don't need pmu->pebs_no_xmm_regs anymore, the capabilities PERF_PMU_CAP_EXTENDED_REGS can be used to check if XMM registers collection is supported.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lkml.kernel.org/r/1559081314-9714-4-git-send-email-kan.liang@linux.i... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 2 +- arch/x86/events/intel/ds.c | 3 --- arch/x86/events/perf_event.h | 3 +-- 3 files changed, 2 insertions(+), 6 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index d940e7bc2d25..817b53d2ce2e 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -573,7 +573,7 @@ int x86_pmu_hw_config(struct perf_event *event) * be collected in PEBS on some platforms, e.g. Icelake */ if (unlikely(event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK)) { - if (x86_pmu.pebs_no_xmm_regs) + if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS)) return -EINVAL;
if (!event->attr.precise_ip) diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c index 767606851492..5c57ea8dce6d 100644 --- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -1964,8 +1964,6 @@ void __init intel_ds_init(void) x86_pmu.bts = boot_cpu_has(X86_FEATURE_BTS); x86_pmu.pebs = boot_cpu_has(X86_FEATURE_PEBS); x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE; - if (x86_pmu.version <= 4) - x86_pmu.pebs_no_xmm_regs = 1; if (x86_pmu.pebs) { char pebs_type = x86_pmu.intel_cap.pebs_trap ? '+' : '-'; char *pebs_qual = ""; @@ -2021,7 +2019,6 @@ void __init intel_ds_init(void) x86_get_pmu()->capabilities |= PERF_PMU_CAP_EXTENDED_REGS; } else { /* Only basic record supported */ - x86_pmu.pebs_no_xmm_regs = 1; x86_pmu.large_pebs_flags &= ~(PERF_SAMPLE_ADDR | PERF_SAMPLE_TIME | diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index d639627c66de..bfd6f58d3a35 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -652,8 +652,7 @@ struct x86_pmu { pebs_active :1, pebs_broken :1, pebs_prec_dist :1, - pebs_no_tlb :1, - pebs_no_xmm_regs :1; + pebs_no_tlb :1; int pebs_record_size; int pebs_buffer_size; int max_pebs_events;
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.2-rc7 commit 8b12b812f5367c2469fb937da7e28dd321ad8d7b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 8b12b812f5367c2469fb937da7e28dd321ad8d7b upstream.
Use the macro defined in kernel ABI header to replace the local name.
No functional change.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: https://lkml.kernel.org/r/1559081314-9714-5-git-send-email-kan.liang@linux.i... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Shen, Xiaochen xiaochen.shen@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/arch/x86/include/uapi/asm/perf_regs.h | 3 +++ tools/perf/arch/x86/include/perf_regs.h | 1 - tools/perf/arch/x86/util/perf_regs.c | 4 ++-- 3 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h index f3329cabce5c..62a6965637fd 100644 --- a/tools/arch/x86/include/uapi/asm/perf_regs.h +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h @@ -31,4 +31,7 @@ enum perf_event_x86_regs { PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1, PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1, }; + +#define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1)) + #endif /* _ASM_X86_PERF_REGS_H */ diff --git a/tools/perf/arch/x86/include/perf_regs.h b/tools/perf/arch/x86/include/perf_regs.h index b7cd91a9014f..b7321337d100 100644 --- a/tools/perf/arch/x86/include/perf_regs.h +++ b/tools/perf/arch/x86/include/perf_regs.h @@ -9,7 +9,6 @@ void perf_regs_load(u64 *regs);
#define PERF_REGS_MAX PERF_REG_X86_XMM_MAX -#define PERF_XMM_REGS_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1)) #ifndef HAVE_ARCH_X86_64_SUPPORT #define PERF_REGS_MASK ((1ULL << PERF_REG_X86_32_MAX) - 1) #define PERF_SAMPLE_REGS_ABI PERF_SAMPLE_REGS_ABI_32 diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c index 7886ca5263e3..3666c0076df9 100644 --- a/tools/perf/arch/x86/util/perf_regs.c +++ b/tools/perf/arch/x86/util/perf_regs.c @@ -277,7 +277,7 @@ uint64_t arch__intr_reg_mask(void) .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CPU_CYCLES, .sample_type = PERF_SAMPLE_REGS_INTR, - .sample_regs_intr = PERF_XMM_REGS_MASK, + .sample_regs_intr = PERF_REG_EXTENDED_MASK, .precise_ip = 1, .disabled = 1, .exclude_kernel = 1, @@ -293,7 +293,7 @@ uint64_t arch__intr_reg_mask(void) fd = sys_perf_event_open(&attr, 0, -1, -1, 0); if (fd != -1) { close(fd); - return (PERF_XMM_REGS_MASK | PERF_REGS_MASK); + return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK); }
return PERF_REGS_MASK;
From: Len Brown len.brown@intel.com
mainline inclusion from mainline-v5.2-rc1 commit 3a1c779fb8f71e772e2145e68c262936ada815ed category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3a1c779fb8f71e772e2145e68c262936ada815ed upstream.
Syntax only, no functional or semantic change.
Signed-off-by: Len Brown len.brown@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Paul E. McKenney paulmck@linux.vnet.ibm.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Will Deacon will.deacon@arm.com Cc: linux-doc@vger.kernel.org Link: http://lkml.kernel.org/r/1ca56f8ea922a67f0017bd645912ea02a65a85ec.1551160674... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/cputopology.txt | 46 +++++++++++++++++------------------ 1 file changed, 23 insertions(+), 23 deletions(-)
diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt index c6e7e9196a8b..cb61277e2308 100644 --- a/Documentation/cputopology.txt +++ b/Documentation/cputopology.txt @@ -3,79 +3,79 @@ How CPU topology info is exported via sysfs ===========================================
Export CPU topology info via sysfs. Items (attributes) are similar -to /proc/cpuinfo output of some architectures: +to /proc/cpuinfo output of some architectures. They reside in +/sys/devices/system/cpu/cpuX/topology/:
-1) /sys/devices/system/cpu/cpuX/topology/physical_package_id: +physical_package_id:
physical package id of cpuX. Typically corresponds to a physical socket number, but the actual value is architecture and platform dependent.
-2) /sys/devices/system/cpu/cpuX/topology/core_id: +core_id:
the CPU core ID of cpuX. Typically it is the hardware platform's identifier (rather than the kernel's). The actual value is architecture and platform dependent.
-3) /sys/devices/system/cpu/cpuX/topology/book_id: +book_id:
the book ID of cpuX. Typically it is the hardware platform's identifier (rather than the kernel's). The actual value is architecture and platform dependent.
-4) /sys/devices/system/cpu/cpuX/topology/drawer_id: +drawer_id:
the drawer ID of cpuX. Typically it is the hardware platform's identifier (rather than the kernel's). The actual value is architecture and platform dependent.
-5) /sys/devices/system/cpu/cpuX/topology/thread_siblings: +thread_siblings:
internal kernel map of cpuX's hardware threads within the same core as cpuX.
-6) /sys/devices/system/cpu/cpuX/topology/thread_siblings_list: +thread_siblings_list:
human-readable list of cpuX's hardware threads within the same core as cpuX.
-7) /sys/devices/system/cpu/cpuX/topology/core_siblings: +core_siblings:
internal kernel map of cpuX's hardware threads within the same physical_package_id.
-8) /sys/devices/system/cpu/cpuX/topology/core_siblings_list: +core_siblings_list:
human-readable list of cpuX's hardware threads within the same physical_package_id.
-9) /sys/devices/system/cpu/cpuX/topology/book_siblings: +book_siblings:
internal kernel map of cpuX's hardware threads within the same book_id.
-10) /sys/devices/system/cpu/cpuX/topology/book_siblings_list: +book_siblings_list:
human-readable list of cpuX's hardware threads within the same book_id.
-11) /sys/devices/system/cpu/cpuX/topology/drawer_siblings: +drawer_siblings:
internal kernel map of cpuX's hardware threads within the same drawer_id.
-12) /sys/devices/system/cpu/cpuX/topology/drawer_siblings_list: +drawer_siblings_list:
human-readable list of cpuX's hardware threads within the same drawer_id.
-To implement it in an architecture-neutral way, a new source file, -drivers/base/topology.c, is to export the 6 to 12 attributes. The book -and drawer related sysfs files will only be created if CONFIG_SCHED_BOOK -and CONFIG_SCHED_DRAWER are selected. +Architecture-neutral, drivers/base/topology.c, exports these attributes. +However, the book and drawer related sysfs files will only be created if +CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are selected, respectively.
-CONFIG_SCHED_BOOK and CONFIG_DRAWER are currently only used on s390, where -they reflect the cpu and cache hierarchy. +CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are currently only used on s390, +where they reflect the cpu and cache hierarchy.
For an architecture to support this feature, it must define some of these macros in include/asm-XXX/topology.h:: @@ -98,10 +98,10 @@ To be consistent on all architectures, include/linux/topology.h provides default definitions for any of the above macros that are not defined by include/asm-XXX/topology.h:
-1) physical_package_id: -1 -2) core_id: 0 -3) sibling_cpumask: just the given CPU -4) core_cpumask: just the given CPU +1) topology_physical_package_id: -1 +2) topology_core_id: 0 +3) topology_sibling_cpumask: just the given CPU +4) topology_core_cpumask: just the given CPU
For architectures that don't support books (CONFIG_SCHED_BOOK) there are no default definitions for topology_book_id() and topology_book_cpumask().
From: Len Brown len.brown@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 7745f03eb39587dd15a1fb26e6223678b8e906d2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 7745f03eb39587dd15a1fb26e6223678b8e906d2 upstream.
Some new systems have multiple software-visible die within each package.
Update Linux parsing of the Intel CPUID "Extended Topology Leaf" to handle either CPUID.B, or the new CPUID.1F.
Add cpuinfo_x86.die_id and cpuinfo_x86.max_dies to store the result.
die_id will be non-zero only for multi-die/package systems.
Signed-off-by: Len Brown len.brown@intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: linux-doc@vger.kernel.org Link: https://lkml.kernel.org/r/7b23d2d26d717b8e14ba137c94b70943f1ae4b5c.155776931... Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/x86/topology.txt | 4 ++ arch/x86/include/asm/processor.h | 4 +- arch/x86/kernel/cpu/topology.c | 85 +++++++++++++++++++++++++------- arch/x86/kernel/smpboot.c | 2 + 4 files changed, 75 insertions(+), 20 deletions(-)
diff --git a/Documentation/x86/topology.txt b/Documentation/x86/topology.txt index 2953e3ec9a02..e49487f51a25 100644 --- a/Documentation/x86/topology.txt +++ b/Documentation/x86/topology.txt @@ -46,6 +46,10 @@ The topology of a system is described in the units of:
The number of cores in a package. This information is retrieved via CPUID.
+ - cpuinfo_x86.x86_max_dies: + + The number of dies in a package. This information is retrieved via CPUID. + - cpuinfo_x86.phys_proc_id:
The physical ID of the package. This information is retrieved via CPUID diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index d0ccdd0ee457..37610d873b09 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -117,7 +117,8 @@ struct cpuinfo_x86 { int x86_power; unsigned long loops_per_jiffy; /* cpuid returned max cores value: */ - u16 x86_max_cores; + u16 x86_max_cores; + u16 x86_max_dies; u16 apicid; u16 initial_apicid; u16 x86_clflush_size; @@ -129,6 +130,7 @@ struct cpuinfo_x86 { u16 logical_proc_id; /* Core id: */ u16 cpu_core_id; + u16 cpu_die_id; /* Index into per_cpu list: */ u16 cpu_index; u32 microcode; diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c index 71ca064e3794..3f7141dd4ce8 100644 --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -13,33 +13,63 @@ /* leaf 0xb SMT level */ #define SMT_LEVEL 0
-/* leaf 0xb sub-leaf types */ +/* extended topology sub-leaf types */ #define INVALID_TYPE 0 #define SMT_TYPE 1 #define CORE_TYPE 2 +#define DIE_TYPE 5
#define LEAFB_SUBTYPE(ecx) (((ecx) >> 8) & 0xff) #define BITS_SHIFT_NEXT_LEVEL(eax) ((eax) & 0x1f) #define LEVEL_MAX_SIBLINGS(ebx) ((ebx) & 0xffff)
-int detect_extended_topology_early(struct cpuinfo_x86 *c) -{ #ifdef CONFIG_SMP +/* + * Check if given CPUID extended toplogy "leaf" is implemented + */ +static int check_extended_topology_leaf(int leaf) +{ unsigned int eax, ebx, ecx, edx;
- if (c->cpuid_level < 0xb) + cpuid_count(leaf, SMT_LEVEL, &eax, &ebx, &ecx, &edx); + + if (ebx == 0 || (LEAFB_SUBTYPE(ecx) != SMT_TYPE)) return -1;
- cpuid_count(0xb, SMT_LEVEL, &eax, &ebx, &ecx, &edx); + return 0; +} +/* + * Return best CPUID Extended Toplogy Leaf supported + */ +static int detect_extended_topology_leaf(struct cpuinfo_x86 *c) +{ + if (c->cpuid_level >= 0x1f) { + if (check_extended_topology_leaf(0x1f) == 0) + return 0x1f; + }
- /* - * check if the cpuid leaf 0xb is actually implemented. - */ - if (ebx == 0 || (LEAFB_SUBTYPE(ecx) != SMT_TYPE)) + if (c->cpuid_level >= 0xb) { + if (check_extended_topology_leaf(0xb) == 0) + return 0xb; + } + + return -1; +} +#endif + +int detect_extended_topology_early(struct cpuinfo_x86 *c) +{ +#ifdef CONFIG_SMP + unsigned int eax, ebx, ecx, edx; + int leaf; + + leaf = detect_extended_topology_leaf(c); + if (leaf < 0) return -1;
set_cpu_cap(c, X86_FEATURE_XTOPOLOGY);
+ cpuid_count(leaf, SMT_LEVEL, &eax, &ebx, &ecx, &edx); /* * initial apic id, which also represents 32-bit extended x2apic id. */ @@ -50,7 +80,7 @@ int detect_extended_topology_early(struct cpuinfo_x86 *c) }
/* - * Check for extended topology enumeration cpuid leaf 0xb and if it + * Check for extended topology enumeration cpuid leaf, and if it * exists, use it for populating initial_apicid and cpu topology * detection. */ @@ -58,22 +88,28 @@ int detect_extended_topology(struct cpuinfo_x86 *c) { #ifdef CONFIG_SMP unsigned int eax, ebx, ecx, edx, sub_index; - unsigned int ht_mask_width, core_plus_mask_width; + unsigned int ht_mask_width, core_plus_mask_width, die_plus_mask_width; unsigned int core_select_mask, core_level_siblings; + unsigned int die_select_mask, die_level_siblings; + int leaf;
- if (detect_extended_topology_early(c) < 0) + leaf = detect_extended_topology_leaf(c); + if (leaf < 0) return -1;
/* * Populate HT related information from sub-leaf level 0. */ - cpuid_count(0xb, SMT_LEVEL, &eax, &ebx, &ecx, &edx); + cpuid_count(leaf, SMT_LEVEL, &eax, &ebx, &ecx, &edx); + c->initial_apicid = edx; core_level_siblings = smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx); core_plus_mask_width = ht_mask_width = BITS_SHIFT_NEXT_LEVEL(eax); + die_level_siblings = LEVEL_MAX_SIBLINGS(ebx); + die_plus_mask_width = BITS_SHIFT_NEXT_LEVEL(eax);
sub_index = 1; do { - cpuid_count(0xb, sub_index, &eax, &ebx, &ecx, &edx); + cpuid_count(leaf, sub_index, &eax, &ebx, &ecx, &edx);
/* * Check for the Core type in the implemented sub leaves. @@ -81,23 +117,34 @@ int detect_extended_topology(struct cpuinfo_x86 *c) if (LEAFB_SUBTYPE(ecx) == CORE_TYPE) { core_level_siblings = LEVEL_MAX_SIBLINGS(ebx); core_plus_mask_width = BITS_SHIFT_NEXT_LEVEL(eax); - break; + die_level_siblings = core_level_siblings; + die_plus_mask_width = BITS_SHIFT_NEXT_LEVEL(eax); + } + if (LEAFB_SUBTYPE(ecx) == DIE_TYPE) { + die_level_siblings = LEVEL_MAX_SIBLINGS(ebx); + die_plus_mask_width = BITS_SHIFT_NEXT_LEVEL(eax); }
sub_index++; } while (LEAFB_SUBTYPE(ecx) != INVALID_TYPE);
core_select_mask = (~(-1 << core_plus_mask_width)) >> ht_mask_width; - - c->cpu_core_id = apic->phys_pkg_id(c->initial_apicid, ht_mask_width) - & core_select_mask; - c->phys_proc_id = apic->phys_pkg_id(c->initial_apicid, core_plus_mask_width); + die_select_mask = (~(-1 << die_plus_mask_width)) >> + core_plus_mask_width; + + c->cpu_core_id = apic->phys_pkg_id(c->initial_apicid, + ht_mask_width) & core_select_mask; + c->cpu_die_id = apic->phys_pkg_id(c->initial_apicid, + core_plus_mask_width) & die_select_mask; + c->phys_proc_id = apic->phys_pkg_id(c->initial_apicid, + die_plus_mask_width); /* * Reinit the apicid, now that we have extended initial_apicid. */ c->apicid = apic->phys_pkg_id(c->initial_apicid, 0);
c->x86_max_cores = (core_level_siblings / smp_num_siblings); + c->x86_max_dies = (die_level_siblings / core_level_siblings); #endif return 0; } diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index f51307d8ee7c..aedd9cafc441 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -393,6 +393,7 @@ static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
if (c->phys_proc_id == o->phys_proc_id && + c->cpu_die_id == o->cpu_die_id && per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2)) { if (c->cpu_core_id == o->cpu_core_id) return topology_sane(c, o, "smt"); @@ -404,6 +405,7 @@ static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) }
} else if (c->phys_proc_id == o->phys_proc_id && + c->cpu_die_id == o->cpu_die_id && c->cpu_core_id == o->cpu_core_id) { return topology_sane(c, o, "smt"); }
From: Len Brown len.brown@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 14d96d6c06b5d8116b8d52c9c5530f5528ef1e61 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 14d96d6c06b5d8116b8d52c9c5530f5528ef1e61 upstream.
topology_max_packages() is available to size resources to cover all packages in the system.
But now multi-die/package systems are coming up, and some resources are per-die.
Create topology_max_die_per_package(), for detecting multi-die/package systems, and sizing any per-die resources.
Signed-off-by: Len Brown len.brown@intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/e6eaf384571ae52ac7d0ca41510b7fb7d2fda0e4.155776931... Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/include/asm/processor.h | 1 - arch/x86/include/asm/topology.h | 10 ++++++++++ arch/x86/kernel/cpu/topology.c | 5 ++++- 3 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 37610d873b09..e8fe2828a0d7 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -118,7 +118,6 @@ struct cpuinfo_x86 { unsigned long loops_per_jiffy; /* cpuid returned max cores value: */ u16 x86_max_cores; - u16 x86_max_dies; u16 apicid; u16 initial_apicid; u16 x86_clflush_size; diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 453cf38a1c33..e0232f7042c3 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -115,6 +115,13 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); extern unsigned int __max_logical_packages; #define topology_max_packages() (__max_logical_packages)
+extern unsigned int __max_die_per_package; + +static inline int topology_max_die_per_package(void) +{ + return __max_die_per_package; +} + extern int __max_smt_threads;
static inline int topology_max_smt_threads(void) @@ -131,6 +138,9 @@ bool topology_smt_supported(void); static inline int topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; } static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; } +static inline int topology_phys_to_logical_die(unsigned int die, + unsigned int cpu) { return 0; } +static inline int topology_max_die_per_package(void) { return 1; } static inline int topology_max_smt_threads(void) { return 1; } static inline bool topology_is_primary_thread(unsigned int cpu) { return true; } static inline bool topology_smt_supported(void) { return false; } diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c index 3f7141dd4ce8..873125122218 100644 --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -24,6 +24,9 @@ #define LEVEL_MAX_SIBLINGS(ebx) ((ebx) & 0xffff)
#ifdef CONFIG_SMP +unsigned int __max_die_per_package __read_mostly = 1; +EXPORT_SYMBOL(__max_die_per_package); + /* * Check if given CPUID extended toplogy "leaf" is implemented */ @@ -144,7 +147,7 @@ int detect_extended_topology(struct cpuinfo_x86 *c) c->apicid = apic->phys_pkg_id(c->initial_apicid, 0);
c->x86_max_cores = (core_level_siblings / smp_num_siblings); - c->x86_max_dies = (die_level_siblings / core_level_siblings); + __max_die_per_package = (die_level_siblings / core_level_siblings); #endif return 0; }
From: Len Brown len.brown@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 0e344d8c709fe01d882fc0fb5452bedfe5eba67a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 0e344d8c709fe01d882fc0fb5452bedfe5eba67a upstream.
Export die_id in cpu topology, for the benefit of hardware that has multiple-die/package.
Signed-off-by: Len Brown len.brown@intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: linux-doc@vger.kernel.org Link: https://lkml.kernel.org/r/e7d1caaf4fbd24ee40db6d557ab28d7d83298900.155776931... Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/cputopology.txt | 15 ++++++++++++--- drivers/base/topology.c | 4 ++++ include/linux/topology.h | 3 +++ 3 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt index cb61277e2308..2ff8a1e9a2db 100644 --- a/Documentation/cputopology.txt +++ b/Documentation/cputopology.txt @@ -12,6 +12,12 @@ physical_package_id: socket number, but the actual value is architecture and platform dependent.
+die_id: + + the CPU die ID of cpuX. Typically it is the hardware platform's + identifier (rather than the kernel's). The actual value is + architecture and platform dependent. + core_id:
the CPU core ID of cpuX. Typically it is the hardware platform's @@ -81,6 +87,7 @@ For an architecture to support this feature, it must define some of these macros in include/asm-XXX/topology.h::
#define topology_physical_package_id(cpu) + #define topology_die_id(cpu) #define topology_core_id(cpu) #define topology_book_id(cpu) #define topology_drawer_id(cpu) @@ -99,9 +106,11 @@ provides default definitions for any of the above macros that are not defined by include/asm-XXX/topology.h:
1) topology_physical_package_id: -1 -2) topology_core_id: 0 -3) topology_sibling_cpumask: just the given CPU -4) topology_core_cpumask: just the given CPU +2) topology_die_id: -1 +3) topology_core_id: 0 +4) topology_sibling_cpumask: just the given CPU +5) topology_core_cpumask: just the given CPU +6) topology_die_cpumask: just the given CPU
For architectures that don't support books (CONFIG_SCHED_BOOK) there are no default definitions for topology_book_id() and topology_book_cpumask(). diff --git a/drivers/base/topology.c b/drivers/base/topology.c index 5fd9f167ecc1..50352cf96f85 100644 --- a/drivers/base/topology.c +++ b/drivers/base/topology.c @@ -43,6 +43,9 @@ static ssize_t name##_list_show(struct device *dev, \ define_id_show_func(physical_package_id); static DEVICE_ATTR_RO(physical_package_id);
+define_id_show_func(die_id); +static DEVICE_ATTR_RO(die_id); + define_id_show_func(core_id); static DEVICE_ATTR_RO(core_id);
@@ -72,6 +75,7 @@ static DEVICE_ATTR_RO(drawer_siblings_list);
static struct attribute *default_attrs[] = { &dev_attr_physical_package_id.attr, + &dev_attr_die_id.attr, &dev_attr_core_id.attr, &dev_attr_thread_siblings.attr, &dev_attr_thread_siblings_list.attr, diff --git a/include/linux/topology.h b/include/linux/topology.h index cb0775e1ee4b..5cc8595dd0e4 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -184,6 +184,9 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_physical_package_id #define topology_physical_package_id(cpu) ((void)(cpu), -1) #endif +#ifndef topology_die_id +#define topology_die_id(cpu) ((void)(cpu), -1) +#endif #ifndef topology_core_id #define topology_core_id(cpu) ((void)(cpu), 0) #endif
From: Len Brown len.brown@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 306a0de329f77537f29022c2982006f9145d782d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 306a0de329f77537f29022c2982006f9145d782d upstream.
topology_die_id(cpu) is a simple macro for use inside the kernel to get the die_id associated with the given cpu.
Signed-off-by: Len Brown len.brown@intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/6463bc422b1b05445a502dc505c1d7c6756bda6a.155776931... Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/include/asm/topology.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index e0232f7042c3..3777dbe9c0ff 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -106,6 +106,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
#define topology_logical_package_id(cpu) (cpu_data(cpu).logical_proc_id) #define topology_physical_package_id(cpu) (cpu_data(cpu).phys_proc_id) +#define topology_die_id(cpu) (cpu_data(cpu).cpu_die_id) #define topology_core_id(cpu) (cpu_data(cpu).cpu_core_id)
#ifdef CONFIG_SMP
From: Len Brown len.brown@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 212bf4fdb7f9eeeb99afd97ebad677d43e7b55ac category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 212bf4fdb7f9eeeb99afd97ebad677d43e7b55ac upstream.
Define topology_logical_die_id() ala existing topology_logical_package_id()
Signed-off-by: Len Brown len.brown@intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Zhang Rui rui.zhang@intel.com Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/2f3526e25ae14fbeff26fb26e877d159df8946d9.155776931... Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/include/asm/processor.h | 1 + arch/x86/include/asm/topology.h | 5 ++++ arch/x86/kernel/cpu/common.c | 1 + arch/x86/kernel/smpboot.c | 45 ++++++++++++++++++++++++++++++++ 4 files changed, 52 insertions(+)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index e8fe2828a0d7..6f422c01c0aa 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -130,6 +130,7 @@ struct cpuinfo_x86 { /* Core id: */ u16 cpu_core_id; u16 cpu_die_id; + u16 logical_die_id; /* Index into per_cpu list: */ u16 cpu_index; u32 microcode; diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 3777dbe9c0ff..9de16b4f6023 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -106,6 +106,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
#define topology_logical_package_id(cpu) (cpu_data(cpu).logical_proc_id) #define topology_physical_package_id(cpu) (cpu_data(cpu).phys_proc_id) +#define topology_logical_die_id(cpu) (cpu_data(cpu).logical_die_id) #define topology_die_id(cpu) (cpu_data(cpu).cpu_die_id) #define topology_core_id(cpu) (cpu_data(cpu).cpu_core_id)
@@ -131,13 +132,17 @@ static inline int topology_max_smt_threads(void) }
int topology_update_package_map(unsigned int apicid, unsigned int cpu); +int topology_update_die_map(unsigned int dieid, unsigned int cpu); int topology_phys_to_logical_pkg(unsigned int pkg); +int topology_phys_to_logical_die(unsigned int die, unsigned int cpu); bool topology_is_primary_thread(unsigned int cpu); bool topology_smt_supported(void); #else #define topology_max_packages() (1) static inline int topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; } +static inline int +topology_update_die_map(unsigned int dieid, unsigned int cpu) { return 0; } static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; } static inline int topology_phys_to_logical_die(unsigned int die, unsigned int cpu) { return 0; } diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 103f06b68f72..988d9c6a5c89 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1391,6 +1391,7 @@ static void validate_apic_and_package_id(struct cpuinfo_x86 *c) cpu, apicid, c->initial_apicid); } BUG_ON(topology_update_package_map(c->phys_proc_id, cpu)); + BUG_ON(topology_update_die_map(c->cpu_die_id, cpu)); #else c->logical_proc_id = 0; #endif diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index aedd9cafc441..08b5f534fabb 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -100,6 +100,7 @@ EXPORT_PER_CPU_SYMBOL(cpu_info); unsigned int __max_logical_packages __read_mostly; EXPORT_SYMBOL(__max_logical_packages); static unsigned int logical_packages __read_mostly; +static unsigned int logical_die __read_mostly;
/* Maximum number of SMT threads on any online core */ int __read_mostly __max_smt_threads = 1; @@ -306,6 +307,26 @@ int topology_phys_to_logical_pkg(unsigned int phys_pkg) return -1; } EXPORT_SYMBOL(topology_phys_to_logical_pkg); +/** + * topology_phys_to_logical_die - Map a physical die id to logical + * + * Returns logical die id or -1 if not found + */ +int topology_phys_to_logical_die(unsigned int die_id, unsigned int cur_cpu) +{ + int cpu; + int proc_id = cpu_data(cur_cpu).phys_proc_id; + + for_each_possible_cpu(cpu) { + struct cpuinfo_x86 *c = &cpu_data(cpu); + + if (c->initialized && c->cpu_die_id == die_id && + c->phys_proc_id == proc_id) + return c->logical_die_id; + } + return -1; +} +EXPORT_SYMBOL(topology_phys_to_logical_die);
/** * topology_update_package_map - Update the physical to logical package map @@ -330,6 +351,29 @@ int topology_update_package_map(unsigned int pkg, unsigned int cpu) cpu_data(cpu).logical_proc_id = new; return 0; } +/** + * topology_update_die_map - Update the physical to logical die map + * @die: The die id as retrieved via CPUID + * @cpu: The cpu for which this is updated + */ +int topology_update_die_map(unsigned int die, unsigned int cpu) +{ + int new; + + /* Already available somewhere? */ + new = topology_phys_to_logical_die(die, cpu); + if (new >= 0) + goto found; + + new = logical_die++; + if (new != die) { + pr_info("CPU %u Converting physical %u to logical die %u\n", + cpu, die, new); + } +found: + cpu_data(cpu).logical_die_id = new; + return 0; +}
void __init smp_store_boot_cpu_info(void) { @@ -339,6 +383,7 @@ void __init smp_store_boot_cpu_info(void) *c = boot_cpu_data; c->cpu_index = id; topology_update_package_map(c->phys_proc_id, id); + topology_update_die_map(c->cpu_die_id, id); c->initialized = true; }
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit aadf7b38337108627b014fb285147aacdfafe42e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit aadf7b38337108627b014fb285147aacdfafe42e upstream.
Simplify how the code to discover a package is called. Rename find_package_by_id() to rapl_find_package_domain()
Syntax only, no functional or semantic change.
Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Len Brown len.brown@intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: linux-pm@vger.kernel.org Link: https://lkml.kernel.org/r/ae3d1903407fd6e3684234b674f4f0e62c2ab54c.155776931... Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index 8cbfcce57a06..4d753445e28b 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -264,8 +264,9 @@ static struct powercap_control_type *control_type; /* PowerCap Controller */ static struct rapl_domain *platform_rapl_domain; /* Platform (PSys) domain */
/* caller to ensure CPU hotplug lock is held */ -static struct rapl_package *find_package_by_id(int id) +static struct rapl_package *rapl_find_package_domain(int cpu) { + int id = topology_physical_package_id(cpu); struct rapl_package *rp;
list_for_each_entry(rp, &rapl_packages, plist) { @@ -1305,7 +1306,7 @@ static int __init rapl_register_psys(void) rd->rpl[0].name = pl1_name; rd->rpl[1].prim_id = PL2_ENABLE; rd->rpl[1].name = pl2_name; - rd->rp = find_package_by_id(0); + rd->rp = rapl_find_package_domain(0);
power_zone = powercap_register_zone(&rd->power_zone, control_type, "psys", NULL, @@ -1461,8 +1462,9 @@ static void rapl_remove_package(struct rapl_package *rp) }
/* called from CPU hotplug notifier, hotplug lock held */ -static struct rapl_package *rapl_add_package(int cpu, int pkgid) +static struct rapl_package *rapl_add_package(int cpu) { + int id = topology_physical_package_id(cpu); struct rapl_package *rp; int ret;
@@ -1471,7 +1473,7 @@ static struct rapl_package *rapl_add_package(int cpu, int pkgid) return ERR_PTR(-ENOMEM);
/* add the new package to the list */ - rp->id = pkgid; + rp->id = id; rp->lead_cpu = cpu;
/* check if the package contains valid domains */ @@ -1502,12 +1504,11 @@ static struct rapl_package *rapl_add_package(int cpu, int pkgid) */ static int rapl_cpu_online(unsigned int cpu) { - int pkgid = topology_physical_package_id(cpu); struct rapl_package *rp;
- rp = find_package_by_id(pkgid); + rp = rapl_find_package_domain(cpu); if (!rp) { - rp = rapl_add_package(cpu, pkgid); + rp = rapl_add_package(cpu); if (IS_ERR(rp)) return PTR_ERR(rp); } @@ -1517,11 +1518,10 @@ static int rapl_cpu_online(unsigned int cpu)
static int rapl_cpu_down_prep(unsigned int cpu) { - int pkgid = topology_physical_package_id(cpu); struct rapl_package *rp; int lead_cpu;
- rp = find_package_by_id(pkgid); + rp = rapl_find_package_domain(cpu); if (!rp) return 0;
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 32fb480e0a2cf1f71e4174d6477198c94dbc746c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 32fb480e0a2cf1f71e4174d6477198c94dbc746c upstream.
RAPL "package" domains are actually implemented in hardware per-die. Thus, the new multi-die/package systems have mulitple domains within each physical package.
Update the intel_rapl driver to be "die aware" -- exporting multiple domains within a single package, when present. No change on single die/package systems.
Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Len Brown len.brown@intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: linux-pm@vger.kernel.org Link: https://lkml.kernel.org/r/9fcb4719aeb7efccf3bc75ed8dd559e46121649f.155776931... Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index 4d753445e28b..55f75025d247 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -266,7 +266,7 @@ static struct rapl_domain *platform_rapl_domain; /* Platform (PSys) domain */ /* caller to ensure CPU hotplug lock is held */ static struct rapl_package *rapl_find_package_domain(int cpu) { - int id = topology_physical_package_id(cpu); + int id = topology_logical_die_id(cpu); struct rapl_package *rp;
list_for_each_entry(rp, &rapl_packages, plist) { @@ -1464,7 +1464,7 @@ static void rapl_remove_package(struct rapl_package *rp) /* called from CPU hotplug notifier, hotplug lock held */ static struct rapl_package *rapl_add_package(int cpu) { - int id = topology_physical_package_id(cpu); + int id = topology_logical_die_id(cpu); struct rapl_package *rp; int ret;
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 9ea7612c46586d9eacfd517e73ff76ef294feca0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 9ea7612c46586d9eacfd517e73ff76ef294feca0 upstream.
The RAPL domain "name" attribute contains "Package-N", which is ambiguous on multi-die per-package systems.
Update the name to "package-X-die-Y" on those systems.
No change on systems without multi-die/package.
Update driver debug messages.
Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Len Brown len.brown@intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: linux-pm@vger.kernel.org Link: https://lkml.kernel.org/r/6510b784e16374447965925588ec6e46d5d007d8.155776931... Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl.c | 57 ++++++++++++++++++++--------------- 1 file changed, 32 insertions(+), 25 deletions(-)
diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index 55f75025d247..0c861567d37b 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -178,12 +178,15 @@ struct rapl_domain { #define power_zone_to_rapl_domain(_zone) \ container_of(_zone, struct rapl_domain, power_zone)
+/* maximum rapl package domain name: package-%d-die-%d */ +#define PACKAGE_DOMAIN_NAME_LENGTH 30
-/* Each physical package contains multiple domains, these are the common + +/* Each rapl package contains multiple domains, these are the common * data across RAPL domains within a package. */ struct rapl_package { - unsigned int id; /* physical package/socket id */ + unsigned int id; /* logical die id, equals physical 1-die systems */ unsigned int nr_domains; unsigned long domain_map; /* bit map of active domains */ unsigned int power_unit; @@ -198,6 +201,7 @@ struct rapl_package { int lead_cpu; /* one active cpu per package for access */ /* Track active cpus */ struct cpumask cpumask; + char name[PACKAGE_DOMAIN_NAME_LENGTH]; };
struct rapl_defaults { @@ -926,8 +930,8 @@ static int rapl_check_unit_core(struct rapl_package *rp, int cpu) value = (msr_val & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET; rp->time_unit = 1000000 / (1 << value);
- pr_debug("Core CPU package %d energy=%dpJ, time=%dus, power=%duW\n", - rp->id, rp->energy_unit, rp->time_unit, rp->power_unit); + pr_debug("Core CPU %s energy=%dpJ, time=%dus, power=%duW\n", + rp->name, rp->energy_unit, rp->time_unit, rp->power_unit);
return 0; } @@ -951,8 +955,8 @@ static int rapl_check_unit_atom(struct rapl_package *rp, int cpu) value = (msr_val & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET; rp->time_unit = 1000000 / (1 << value);
- pr_debug("Atom package %d energy=%dpJ, time=%dus, power=%duW\n", - rp->id, rp->energy_unit, rp->time_unit, rp->power_unit); + pr_debug("Atom %s energy=%dpJ, time=%dus, power=%duW\n", + rp->name, rp->energy_unit, rp->time_unit, rp->power_unit);
return 0; } @@ -1186,7 +1190,7 @@ static void rapl_update_domain_data(struct rapl_package *rp) u64 val;
for (dmn = 0; dmn < rp->nr_domains; dmn++) { - pr_debug("update package %d domain %s data\n", rp->id, + pr_debug("update %s domain %s data\n", rp->name, rp->domains[dmn].name); /* exclude non-raw primitives */ for (prim = 0; prim < NR_RAW_PRIMITIVES; prim++) { @@ -1211,7 +1215,6 @@ static void rapl_unregister_powercap(void) static int rapl_package_register_powercap(struct rapl_package *rp) { struct rapl_domain *rd; - char dev_name[17]; /* max domain name = 7 + 1 + 8 for int + 1 for null*/ struct powercap_zone *power_zone = NULL; int nr_pl, ret;
@@ -1222,20 +1225,16 @@ static int rapl_package_register_powercap(struct rapl_package *rp) for (rd = rp->domains; rd < rp->domains + rp->nr_domains; rd++) { if (rd->id == RAPL_DOMAIN_PACKAGE) { nr_pl = find_nr_power_limit(rd); - pr_debug("register socket %d package domain %s\n", - rp->id, rd->name); - memset(dev_name, 0, sizeof(dev_name)); - snprintf(dev_name, sizeof(dev_name), "%s-%d", - rd->name, rp->id); + pr_debug("register package domain %s\n", rp->name); power_zone = powercap_register_zone(&rd->power_zone, control_type, - dev_name, NULL, + rp->name, NULL, &zone_ops[rd->id], nr_pl, &constraint_ops); if (IS_ERR(power_zone)) { - pr_debug("failed to register package, %d\n", - rp->id); + pr_debug("failed to register power zone %s\n", + rp->name); return PTR_ERR(power_zone); } /* track parent zone in per package/socket data */ @@ -1261,8 +1260,8 @@ static int rapl_package_register_powercap(struct rapl_package *rp) &constraint_ops);
if (IS_ERR(power_zone)) { - pr_debug("failed to register power_zone, %d:%s:%s\n", - rp->id, rd->name, dev_name); + pr_debug("failed to register power_zone, %s:%s\n", + rp->name, rd->name); ret = PTR_ERR(power_zone); goto err_cleanup; } @@ -1275,7 +1274,7 @@ static int rapl_package_register_powercap(struct rapl_package *rp) * failed after the first domain setup. */ while (--rd >= rp->domains) { - pr_debug("unregister package %d domain %s\n", rp->id, rd->name); + pr_debug("unregister %s domain %s\n", rp->name, rd->name); powercap_unregister_zone(control_type, &rd->power_zone); }
@@ -1385,8 +1384,8 @@ static void rapl_detect_powerlimit(struct rapl_domain *rd) /* check if the domain is locked by BIOS, ignore if MSR doesn't exist */ if (!rapl_read_data_raw(rd, FW_LOCK, false, &val64)) { if (val64) { - pr_info("RAPL package %d domain %s locked by BIOS\n", - rd->rp->id, rd->name); + pr_info("RAPL %s domain %s locked by BIOS\n", + rd->rp->name, rd->name); rd->state |= DOMAIN_STATE_BIOS_LOCKED; } } @@ -1415,10 +1414,10 @@ static int rapl_detect_domains(struct rapl_package *rp, int cpu) } rp->nr_domains = bitmap_weight(&rp->domain_map, RAPL_DOMAIN_MAX); if (!rp->nr_domains) { - pr_debug("no valid rapl domains found in package %d\n", rp->id); + pr_debug("no valid rapl domains found in %s\n", rp->name); return -ENODEV; } - pr_debug("found %d domains on package %d\n", rp->nr_domains, rp->id); + pr_debug("found %d domains on %s\n", rp->nr_domains, rp->name);
rp->domains = kcalloc(rp->nr_domains + 1, sizeof(struct rapl_domain), GFP_KERNEL); @@ -1451,8 +1450,8 @@ static void rapl_remove_package(struct rapl_package *rp) rd_package = rd; continue; } - pr_debug("remove package, undo power limit on %d: %s\n", - rp->id, rd->name); + pr_debug("remove package, undo power limit on %s: %s\n", + rp->name, rd->name); powercap_unregister_zone(control_type, &rd->power_zone); } /* do parent zone last */ @@ -1466,6 +1465,7 @@ static struct rapl_package *rapl_add_package(int cpu) { int id = topology_logical_die_id(cpu); struct rapl_package *rp; + struct cpuinfo_x86 *c = &cpu_data(cpu); int ret;
rp = kzalloc(sizeof(struct rapl_package), GFP_KERNEL); @@ -1476,6 +1476,13 @@ static struct rapl_package *rapl_add_package(int cpu) rp->id = id; rp->lead_cpu = cpu;
+ if (topology_max_die_per_package() > 1) + snprintf(rp->name, PACKAGE_DOMAIN_NAME_LENGTH, + "package-%d-die-%d", c->phys_proc_id, c->cpu_die_id); + else + snprintf(rp->name, PACKAGE_DOMAIN_NAME_LENGTH, "package-%d", + c->phys_proc_id); + /* check if the package contains valid domains */ if (rapl_detect_domains(rp, cpu) || rapl_defaults->check_unit(rp, cpu)) {
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit f7c4e0c89bbd0f008b33d9dce02e207d9dea9f54 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit f7c4e0c89bbd0f008b33d9dce02e207d9dea9f54 upstream.
To support both MSR and MMIO Interface, use 'reg' to discribe RAPL registers instead of 'msr'.
Reviewed-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Tested-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl.c | 98 +++++++++++++++++------------------ 1 file changed, 49 insertions(+), 49 deletions(-)
diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index 0c861567d37b..fb7d47560491 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -95,13 +95,13 @@ enum rapl_domain_type { RAPL_DOMAIN_MAX, };
-enum rapl_domain_msr_id { - RAPL_DOMAIN_MSR_LIMIT, - RAPL_DOMAIN_MSR_STATUS, - RAPL_DOMAIN_MSR_PERF, - RAPL_DOMAIN_MSR_POLICY, - RAPL_DOMAIN_MSR_INFO, - RAPL_DOMAIN_MSR_MAX, +enum rapl_domain_reg_id { + RAPL_DOMAIN_REG_LIMIT, + RAPL_DOMAIN_REG_STATUS, + RAPL_DOMAIN_REG_PERF, + RAPL_DOMAIN_REG_POLICY, + RAPL_DOMAIN_REG_INFO, + RAPL_DOMAIN_REG_MAX, };
/* per domain data, some are optional */ @@ -166,7 +166,7 @@ struct rapl_package; struct rapl_domain { const char *name; enum rapl_domain_type id; - int msrs[RAPL_DOMAIN_MSR_MAX]; + int regs[RAPL_DOMAIN_REG_MAX]; struct powercap_zone power_zone; struct rapl_domain_data rdd; struct rapl_power_limit rpl[NR_POWER_LIMITS]; @@ -228,7 +228,7 @@ struct rapl_primitive_info { const char *name; u64 mask; int shift; - enum rapl_domain_msr_id id; + enum rapl_domain_reg_id id; enum unit_type unit; u32 flag; }; @@ -654,11 +654,11 @@ static void rapl_init_domains(struct rapl_package *rp) case BIT(RAPL_DOMAIN_PACKAGE): rd->name = rapl_domain_names[RAPL_DOMAIN_PACKAGE]; rd->id = RAPL_DOMAIN_PACKAGE; - rd->msrs[0] = MSR_PKG_POWER_LIMIT; - rd->msrs[1] = MSR_PKG_ENERGY_STATUS; - rd->msrs[2] = MSR_PKG_PERF_STATUS; - rd->msrs[3] = 0; - rd->msrs[4] = MSR_PKG_POWER_INFO; + rd->regs[0] = MSR_PKG_POWER_LIMIT; + rd->regs[1] = MSR_PKG_ENERGY_STATUS; + rd->regs[2] = MSR_PKG_PERF_STATUS; + rd->regs[3] = 0; + rd->regs[4] = MSR_PKG_POWER_INFO; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; rd->rpl[1].prim_id = PL2_ENABLE; @@ -667,33 +667,33 @@ static void rapl_init_domains(struct rapl_package *rp) case BIT(RAPL_DOMAIN_PP0): rd->name = rapl_domain_names[RAPL_DOMAIN_PP0]; rd->id = RAPL_DOMAIN_PP0; - rd->msrs[0] = MSR_PP0_POWER_LIMIT; - rd->msrs[1] = MSR_PP0_ENERGY_STATUS; - rd->msrs[2] = 0; - rd->msrs[3] = MSR_PP0_POLICY; - rd->msrs[4] = 0; + rd->regs[0] = MSR_PP0_POWER_LIMIT; + rd->regs[1] = MSR_PP0_ENERGY_STATUS; + rd->regs[2] = 0; + rd->regs[3] = MSR_PP0_POLICY; + rd->regs[4] = 0; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; break; case BIT(RAPL_DOMAIN_PP1): rd->name = rapl_domain_names[RAPL_DOMAIN_PP1]; rd->id = RAPL_DOMAIN_PP1; - rd->msrs[0] = MSR_PP1_POWER_LIMIT; - rd->msrs[1] = MSR_PP1_ENERGY_STATUS; - rd->msrs[2] = 0; - rd->msrs[3] = MSR_PP1_POLICY; - rd->msrs[4] = 0; + rd->regs[0] = MSR_PP1_POWER_LIMIT; + rd->regs[1] = MSR_PP1_ENERGY_STATUS; + rd->regs[2] = 0; + rd->regs[3] = MSR_PP1_POLICY; + rd->regs[4] = 0; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; break; case BIT(RAPL_DOMAIN_DRAM): rd->name = rapl_domain_names[RAPL_DOMAIN_DRAM]; rd->id = RAPL_DOMAIN_DRAM; - rd->msrs[0] = MSR_DRAM_POWER_LIMIT; - rd->msrs[1] = MSR_DRAM_ENERGY_STATUS; - rd->msrs[2] = MSR_DRAM_PERF_STATUS; - rd->msrs[3] = 0; - rd->msrs[4] = MSR_DRAM_POWER_INFO; + rd->regs[0] = MSR_DRAM_POWER_LIMIT; + rd->regs[1] = MSR_DRAM_ENERGY_STATUS; + rd->regs[2] = MSR_DRAM_PERF_STATUS; + rd->regs[3] = 0; + rd->regs[4] = MSR_DRAM_POWER_INFO; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; rd->domain_energy_unit = @@ -748,37 +748,37 @@ static u64 rapl_unit_xlate(struct rapl_domain *rd, enum unit_type type, static struct rapl_primitive_info rpi[] = { /* name, mask, shift, msr index, unit divisor */ PRIMITIVE_INFO_INIT(ENERGY_COUNTER, ENERGY_STATUS_MASK, 0, - RAPL_DOMAIN_MSR_STATUS, ENERGY_UNIT, 0), + RAPL_DOMAIN_REG_STATUS, ENERGY_UNIT, 0), PRIMITIVE_INFO_INIT(POWER_LIMIT1, POWER_LIMIT1_MASK, 0, - RAPL_DOMAIN_MSR_LIMIT, POWER_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, POWER_UNIT, 0), PRIMITIVE_INFO_INIT(POWER_LIMIT2, POWER_LIMIT2_MASK, 32, - RAPL_DOMAIN_MSR_LIMIT, POWER_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, POWER_UNIT, 0), PRIMITIVE_INFO_INIT(FW_LOCK, POWER_PP_LOCK, 31, - RAPL_DOMAIN_MSR_LIMIT, ARBITRARY_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), PRIMITIVE_INFO_INIT(PL1_ENABLE, POWER_LIMIT1_ENABLE, 15, - RAPL_DOMAIN_MSR_LIMIT, ARBITRARY_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), PRIMITIVE_INFO_INIT(PL1_CLAMP, POWER_LIMIT1_CLAMP, 16, - RAPL_DOMAIN_MSR_LIMIT, ARBITRARY_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), PRIMITIVE_INFO_INIT(PL2_ENABLE, POWER_LIMIT2_ENABLE, 47, - RAPL_DOMAIN_MSR_LIMIT, ARBITRARY_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), PRIMITIVE_INFO_INIT(PL2_CLAMP, POWER_LIMIT2_CLAMP, 48, - RAPL_DOMAIN_MSR_LIMIT, ARBITRARY_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), PRIMITIVE_INFO_INIT(TIME_WINDOW1, TIME_WINDOW1_MASK, 17, - RAPL_DOMAIN_MSR_LIMIT, TIME_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, TIME_UNIT, 0), PRIMITIVE_INFO_INIT(TIME_WINDOW2, TIME_WINDOW2_MASK, 49, - RAPL_DOMAIN_MSR_LIMIT, TIME_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, TIME_UNIT, 0), PRIMITIVE_INFO_INIT(THERMAL_SPEC_POWER, POWER_INFO_THERMAL_SPEC_MASK, - 0, RAPL_DOMAIN_MSR_INFO, POWER_UNIT, 0), + 0, RAPL_DOMAIN_REG_INFO, POWER_UNIT, 0), PRIMITIVE_INFO_INIT(MAX_POWER, POWER_INFO_MAX_MASK, 32, - RAPL_DOMAIN_MSR_INFO, POWER_UNIT, 0), + RAPL_DOMAIN_REG_INFO, POWER_UNIT, 0), PRIMITIVE_INFO_INIT(MIN_POWER, POWER_INFO_MIN_MASK, 16, - RAPL_DOMAIN_MSR_INFO, POWER_UNIT, 0), + RAPL_DOMAIN_REG_INFO, POWER_UNIT, 0), PRIMITIVE_INFO_INIT(MAX_TIME_WINDOW, POWER_INFO_MAX_TIME_WIN_MASK, 48, - RAPL_DOMAIN_MSR_INFO, TIME_UNIT, 0), + RAPL_DOMAIN_REG_INFO, TIME_UNIT, 0), PRIMITIVE_INFO_INIT(THROTTLED_TIME, PERF_STATUS_THROTTLE_TIME_MASK, 0, - RAPL_DOMAIN_MSR_PERF, TIME_UNIT, 0), + RAPL_DOMAIN_REG_PERF, TIME_UNIT, 0), PRIMITIVE_INFO_INIT(PRIORITY_LEVEL, PP_POLICY_MASK, 0, - RAPL_DOMAIN_MSR_POLICY, ARBITRARY_UNIT, 0), + RAPL_DOMAIN_REG_POLICY, ARBITRARY_UNIT, 0), /* non-hardware */ PRIMITIVE_INFO_INIT(AVERAGE_POWER, 0, 0, 0, POWER_UNIT, RAPL_PRIMITIVE_DERIVED), @@ -810,7 +810,7 @@ static int rapl_read_data_raw(struct rapl_domain *rd, if (!rp->name || rp->flag & RAPL_PRIMITIVE_DUMMY) return -EINVAL;
- msr = rd->msrs[rp->id]; + msr = rd->regs[rp->id]; if (!msr) return -EINVAL;
@@ -886,7 +886,7 @@ static int rapl_write_data_raw(struct rapl_domain *rd,
memset(&ma, 0, sizeof(ma));
- ma.msr_no = rd->msrs[rp->id]; + ma.msr_no = rd->regs[rp->id]; ma.clear_mask = rp->mask; ma.set_mask = bits;
@@ -1299,8 +1299,8 @@ static int __init rapl_register_psys(void)
rd->name = rapl_domain_names[RAPL_DOMAIN_PLATFORM]; rd->id = RAPL_DOMAIN_PLATFORM; - rd->msrs[0] = MSR_PLATFORM_POWER_LIMIT; - rd->msrs[1] = MSR_PLATFORM_ENERGY_STATUS; + rd->regs[0] = MSR_PLATFORM_POWER_LIMIT; + rd->regs[1] = MSR_PLATFORM_ENERGY_STATUS; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; rd->rpl[1].prim_id = PL2_ENABLE;
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 8310e8202f24d674b6b2bd341af15d72299f696d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 8310e8202f24d674b6b2bd341af15d72299f696d upstream.
enum rapl_domain_reg_id is defined for the RAPL registers for each RAPL domain, thus use it whenever possible.
Reviewed-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Tested-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl.c | 44 +++++++++++++++++------------------ 1 file changed, 22 insertions(+), 22 deletions(-)
diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index fb7d47560491..6cb123678e1c 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -654,11 +654,11 @@ static void rapl_init_domains(struct rapl_package *rp) case BIT(RAPL_DOMAIN_PACKAGE): rd->name = rapl_domain_names[RAPL_DOMAIN_PACKAGE]; rd->id = RAPL_DOMAIN_PACKAGE; - rd->regs[0] = MSR_PKG_POWER_LIMIT; - rd->regs[1] = MSR_PKG_ENERGY_STATUS; - rd->regs[2] = MSR_PKG_PERF_STATUS; - rd->regs[3] = 0; - rd->regs[4] = MSR_PKG_POWER_INFO; + rd->regs[RAPL_DOMAIN_REG_LIMIT] = MSR_PKG_POWER_LIMIT; + rd->regs[RAPL_DOMAIN_REG_STATUS] = MSR_PKG_ENERGY_STATUS; + rd->regs[RAPL_DOMAIN_REG_PERF] = MSR_PKG_PERF_STATUS; + rd->regs[RAPL_DOMAIN_REG_POLICY] = 0; + rd->regs[RAPL_DOMAIN_REG_INFO] = MSR_PKG_POWER_INFO; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; rd->rpl[1].prim_id = PL2_ENABLE; @@ -667,33 +667,33 @@ static void rapl_init_domains(struct rapl_package *rp) case BIT(RAPL_DOMAIN_PP0): rd->name = rapl_domain_names[RAPL_DOMAIN_PP0]; rd->id = RAPL_DOMAIN_PP0; - rd->regs[0] = MSR_PP0_POWER_LIMIT; - rd->regs[1] = MSR_PP0_ENERGY_STATUS; - rd->regs[2] = 0; - rd->regs[3] = MSR_PP0_POLICY; - rd->regs[4] = 0; + rd->regs[RAPL_DOMAIN_REG_LIMIT] = MSR_PP0_POWER_LIMIT; + rd->regs[RAPL_DOMAIN_REG_STATUS] = MSR_PP0_ENERGY_STATUS; + rd->regs[RAPL_DOMAIN_REG_PERF] = 0; + rd->regs[RAPL_DOMAIN_REG_POLICY] = MSR_PP0_POLICY; + rd->regs[RAPL_DOMAIN_REG_INFO] = 0; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; break; case BIT(RAPL_DOMAIN_PP1): rd->name = rapl_domain_names[RAPL_DOMAIN_PP1]; rd->id = RAPL_DOMAIN_PP1; - rd->regs[0] = MSR_PP1_POWER_LIMIT; - rd->regs[1] = MSR_PP1_ENERGY_STATUS; - rd->regs[2] = 0; - rd->regs[3] = MSR_PP1_POLICY; - rd->regs[4] = 0; + rd->regs[RAPL_DOMAIN_REG_LIMIT] = MSR_PP1_POWER_LIMIT; + rd->regs[RAPL_DOMAIN_REG_STATUS] = MSR_PP1_ENERGY_STATUS; + rd->regs[RAPL_DOMAIN_REG_PERF] = 0; + rd->regs[RAPL_DOMAIN_REG_POLICY] = MSR_PP1_POLICY; + rd->regs[RAPL_DOMAIN_REG_INFO] = 0; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; break; case BIT(RAPL_DOMAIN_DRAM): rd->name = rapl_domain_names[RAPL_DOMAIN_DRAM]; rd->id = RAPL_DOMAIN_DRAM; - rd->regs[0] = MSR_DRAM_POWER_LIMIT; - rd->regs[1] = MSR_DRAM_ENERGY_STATUS; - rd->regs[2] = MSR_DRAM_PERF_STATUS; - rd->regs[3] = 0; - rd->regs[4] = MSR_DRAM_POWER_INFO; + rd->regs[RAPL_DOMAIN_REG_LIMIT] = MSR_DRAM_POWER_LIMIT; + rd->regs[RAPL_DOMAIN_REG_STATUS] = MSR_DRAM_ENERGY_STATUS; + rd->regs[RAPL_DOMAIN_REG_PERF] = MSR_DRAM_PERF_STATUS; + rd->regs[RAPL_DOMAIN_REG_POLICY] = 0; + rd->regs[RAPL_DOMAIN_REG_INFO] = MSR_DRAM_POWER_INFO; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; rd->domain_energy_unit = @@ -1299,8 +1299,8 @@ static int __init rapl_register_psys(void)
rd->name = rapl_domain_names[RAPL_DOMAIN_PLATFORM]; rd->id = RAPL_DOMAIN_PLATFORM; - rd->regs[0] = MSR_PLATFORM_POWER_LIMIT; - rd->regs[1] = MSR_PLATFORM_ENERGY_STATUS; + rd->regs[RAPL_DOMAIN_REG_LIMIT] = MSR_PLATFORM_POWER_LIMIT; + rd->regs[RAPL_DOMAIN_REG_STATUS] = MSR_PLATFORM_ENERGY_STATUS; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; rd->rpl[1].prim_id = PL2_ENABLE;
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit ff956826a403f5cf189978d5ff6b3eb53aa11610 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit ff956826a403f5cf189978d5ff6b3eb53aa11610 upstream.
Create a new header file for the common definitions that might be used by different RAPL Interface.
Reviewed-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Tested-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- MAINTAINERS | 1 + drivers/powercap/intel_rapl.c | 101 +----------------------------- include/linux/intel_rapl.h | 113 ++++++++++++++++++++++++++++++++++ 3 files changed, 116 insertions(+), 99 deletions(-) create mode 100644 include/linux/intel_rapl.h
diff --git a/MAINTAINERS b/MAINTAINERS index c2bbf0fdc886..e01480643d53 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -11693,6 +11693,7 @@ F: drivers/base/power/ F: include/linux/pm.h F: include/linux/pm_* F: include/linux/powercap.h +F: include/linux/intel_rapl.h F: drivers/powercap/ F: kernel/configs/nopm.config
diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index 6cb123678e1c..a5e4f6b97621 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -30,8 +30,9 @@ #include <linux/cpu.h> #include <linux/powercap.h> #include <linux/suspend.h> -#include <asm/iosf_mbi.h> +#include <linux/intel_rapl.h>
+#include <asm/iosf_mbi.h> #include <asm/processor.h> #include <asm/cpu_device_id.h> #include <asm/intel-family.h> @@ -86,59 +87,9 @@ enum unit_type { TIME_UNIT, };
-enum rapl_domain_type { - RAPL_DOMAIN_PACKAGE, /* entire package/socket */ - RAPL_DOMAIN_PP0, /* core power plane */ - RAPL_DOMAIN_PP1, /* graphics uncore */ - RAPL_DOMAIN_DRAM,/* DRAM control_type */ - RAPL_DOMAIN_PLATFORM, /* PSys control_type */ - RAPL_DOMAIN_MAX, -}; - -enum rapl_domain_reg_id { - RAPL_DOMAIN_REG_LIMIT, - RAPL_DOMAIN_REG_STATUS, - RAPL_DOMAIN_REG_PERF, - RAPL_DOMAIN_REG_POLICY, - RAPL_DOMAIN_REG_INFO, - RAPL_DOMAIN_REG_MAX, -}; - /* per domain data, some are optional */ -enum rapl_primitives { - ENERGY_COUNTER, - POWER_LIMIT1, - POWER_LIMIT2, - FW_LOCK, - - PL1_ENABLE, /* power limit 1, aka long term */ - PL1_CLAMP, /* allow frequency to go below OS request */ - PL2_ENABLE, /* power limit 2, aka short term, instantaneous */ - PL2_CLAMP, - - TIME_WINDOW1, /* long term */ - TIME_WINDOW2, /* short term */ - THERMAL_SPEC_POWER, - MAX_POWER, - - MIN_POWER, - MAX_TIME_WINDOW, - THROTTLED_TIME, - PRIORITY_LEVEL, - - /* below are not raw primitive data */ - AVERAGE_POWER, - NR_RAPL_PRIMITIVES, -}; - #define NR_RAW_PRIMITIVES (NR_RAPL_PRIMITIVES - 2)
-/* Can be expanded to include events, etc.*/ -struct rapl_domain_data { - u64 primitives[NR_RAPL_PRIMITIVES]; - unsigned long timestamp; -}; - struct msrl_action { u32 msr_no; u64 clear_mask; @@ -150,60 +101,12 @@ struct msrl_action { #define DOMAIN_STATE_POWER_LIMIT_SET BIT(1) #define DOMAIN_STATE_BIOS_LOCKED BIT(2)
-#define NR_POWER_LIMITS (2) -struct rapl_power_limit { - struct powercap_zone_constraint *constraint; - int prim_id; /* primitive ID used to enable */ - struct rapl_domain *domain; - const char *name; - u64 last_power_limit; -}; - static const char pl1_name[] = "long_term"; static const char pl2_name[] = "short_term";
-struct rapl_package; -struct rapl_domain { - const char *name; - enum rapl_domain_type id; - int regs[RAPL_DOMAIN_REG_MAX]; - struct powercap_zone power_zone; - struct rapl_domain_data rdd; - struct rapl_power_limit rpl[NR_POWER_LIMITS]; - u64 attr_map; /* track capabilities */ - unsigned int state; - unsigned int domain_energy_unit; - struct rapl_package *rp; -}; #define power_zone_to_rapl_domain(_zone) \ container_of(_zone, struct rapl_domain, power_zone)
-/* maximum rapl package domain name: package-%d-die-%d */ -#define PACKAGE_DOMAIN_NAME_LENGTH 30 - - -/* Each rapl package contains multiple domains, these are the common - * data across RAPL domains within a package. - */ -struct rapl_package { - unsigned int id; /* logical die id, equals physical 1-die systems */ - unsigned int nr_domains; - unsigned long domain_map; /* bit map of active domains */ - unsigned int power_unit; - unsigned int energy_unit; - unsigned int time_unit; - struct rapl_domain *domains; /* array of domains, sized at runtime */ - struct powercap_zone *power_zone; /* keep track of parent zone */ - unsigned long power_limit_irq; /* keep track of package power limit - * notify interrupt enable status. - */ - struct list_head plist; - int lead_cpu; /* one active cpu per package for access */ - /* Track active cpus */ - struct cpumask cpumask; - char name[PACKAGE_DOMAIN_NAME_LENGTH]; -}; - struct rapl_defaults { u8 floor_freq_reg_addr; int (*check_unit)(struct rapl_package *rp, int cpu); diff --git a/include/linux/intel_rapl.h b/include/linux/intel_rapl.h new file mode 100644 index 000000000000..94716036d829 --- /dev/null +++ b/include/linux/intel_rapl.h @@ -0,0 +1,113 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Data types and headers for RAPL support + * + * Copyright (C) 2019 Intel Corporation. + * + * Author: Zhang Rui rui.zhang@intel.com + */ + +#ifndef __INTEL_RAPL_H__ +#define __INTEL_RAPL_H__ + +#include <linux/types.h> +#include <linux/powercap.h> + +enum rapl_domain_type { + RAPL_DOMAIN_PACKAGE, /* entire package/socket */ + RAPL_DOMAIN_PP0, /* core power plane */ + RAPL_DOMAIN_PP1, /* graphics uncore */ + RAPL_DOMAIN_DRAM, /* DRAM control_type */ + RAPL_DOMAIN_PLATFORM, /* PSys control_type */ + RAPL_DOMAIN_MAX, +}; + +enum rapl_domain_reg_id { + RAPL_DOMAIN_REG_LIMIT, + RAPL_DOMAIN_REG_STATUS, + RAPL_DOMAIN_REG_PERF, + RAPL_DOMAIN_REG_POLICY, + RAPL_DOMAIN_REG_INFO, + RAPL_DOMAIN_REG_MAX, +}; + +struct rapl_package; + +enum rapl_primitives { + ENERGY_COUNTER, + POWER_LIMIT1, + POWER_LIMIT2, + FW_LOCK, + + PL1_ENABLE, /* power limit 1, aka long term */ + PL1_CLAMP, /* allow frequency to go below OS request */ + PL2_ENABLE, /* power limit 2, aka short term, instantaneous */ + PL2_CLAMP, + + TIME_WINDOW1, /* long term */ + TIME_WINDOW2, /* short term */ + THERMAL_SPEC_POWER, + MAX_POWER, + + MIN_POWER, + MAX_TIME_WINDOW, + THROTTLED_TIME, + PRIORITY_LEVEL, + + /* below are not raw primitive data */ + AVERAGE_POWER, + NR_RAPL_PRIMITIVES, +}; + +struct rapl_domain_data { + u64 primitives[NR_RAPL_PRIMITIVES]; + unsigned long timestamp; +}; + +#define NR_POWER_LIMITS (2) +struct rapl_power_limit { + struct powercap_zone_constraint *constraint; + int prim_id; /* primitive ID used to enable */ + struct rapl_domain *domain; + const char *name; + u64 last_power_limit; +}; + +struct rapl_package; + +struct rapl_domain { + const char *name; + enum rapl_domain_type id; + int regs[RAPL_DOMAIN_REG_MAX]; + struct powercap_zone power_zone; + struct rapl_domain_data rdd; + struct rapl_power_limit rpl[NR_POWER_LIMITS]; + u64 attr_map; /* track capabilities */ + unsigned int state; + unsigned int domain_energy_unit; + struct rapl_package *rp; +}; + +/* maximum rapl package domain name: package-%d-die-%d */ +#define PACKAGE_DOMAIN_NAME_LENGTH 30 + +struct rapl_package { + unsigned int id; /* logical die id, equals physical 1-die systems */ + unsigned int nr_domains; + unsigned long domain_map; /* bit map of active domains */ + unsigned int power_unit; + unsigned int energy_unit; + unsigned int time_unit; + struct rapl_domain *domains; /* array of domains, sized at runtime */ + struct powercap_zone *power_zone; /* keep track of parent zone */ + unsigned long power_limit_irq; /* keep track of package power limit + * notify interrupt enable status. + */ + struct list_head plist; + int lead_cpu; /* one active cpu per package for access */ + /* Track active cpus */ + struct cpumask cpumask; + char name[PACKAGE_DOMAIN_NAME_LENGTH]; +}; + +#endif /* __INTEL_RAPL_H__ */
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 7ebf8eff63b4f349e7b2ded6aa5036d94bdf94b9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 7ebf8eff63b4f349e7b2ded6aa5036d94bdf94b9 upstream.
Introduce a new structure, rapl_if_private, to save the private data for different RAPL Interface.
Reviewed-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Tested-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl.c | 59 +++++++++++++++++------------------ include/linux/intel_rapl.h | 15 +++++++++ 2 files changed, 44 insertions(+), 30 deletions(-)
diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index a5e4f6b97621..26aa43c9a1a0 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -87,6 +87,9 @@ enum unit_type { TIME_UNIT, };
+/* private data for RAPL MSR Interface */ +static struct rapl_if_priv rapl_msr_priv; + /* per domain data, some are optional */ #define NR_RAW_PRIMITIVES (NR_RAPL_PRIMITIVES - 2)
@@ -167,17 +170,14 @@ static const char * const rapl_domain_names[] = { "psys", };
-static struct powercap_control_type *control_type; /* PowerCap Controller */ -static struct rapl_domain *platform_rapl_domain; /* Platform (PSys) domain */ - /* caller to ensure CPU hotplug lock is held */ -static struct rapl_package *rapl_find_package_domain(int cpu) +static struct rapl_package *rapl_find_package_domain(int cpu, struct rapl_if_priv *priv) { int id = topology_logical_die_id(cpu); struct rapl_package *rp;
list_for_each_entry(rp, &rapl_packages, plist) { - if (rp->id == id) + if (rp->id == id && rp->priv->control_type == priv->control_type) return rp; }
@@ -1107,12 +1107,12 @@ static void rapl_update_domain_data(struct rapl_package *rp)
static void rapl_unregister_powercap(void) { - if (platform_rapl_domain) { - powercap_unregister_zone(control_type, - &platform_rapl_domain->power_zone); - kfree(platform_rapl_domain); + if (&rapl_msr_priv.platform_rapl_domain) { + powercap_unregister_zone(rapl_msr_priv.control_type, + &rapl_msr_priv.platform_rapl_domain->power_zone); + kfree(rapl_msr_priv.platform_rapl_domain); } - powercap_unregister_control_type(control_type); + powercap_unregister_control_type(rapl_msr_priv.control_type); }
static int rapl_package_register_powercap(struct rapl_package *rp) @@ -1130,7 +1130,7 @@ static int rapl_package_register_powercap(struct rapl_package *rp) nr_pl = find_nr_power_limit(rd); pr_debug("register package domain %s\n", rp->name); power_zone = powercap_register_zone(&rd->power_zone, - control_type, + rp->priv->control_type, rp->name, NULL, &zone_ops[rd->id], nr_pl, @@ -1157,7 +1157,7 @@ static int rapl_package_register_powercap(struct rapl_package *rp) /* number of power limits per domain varies */ nr_pl = find_nr_power_limit(rd); power_zone = powercap_register_zone(&rd->power_zone, - control_type, rd->name, + rp->priv->control_type, rd->name, rp->power_zone, &zone_ops[rd->id], nr_pl, &constraint_ops); @@ -1178,7 +1178,7 @@ static int rapl_package_register_powercap(struct rapl_package *rp) */ while (--rd >= rp->domains) { pr_debug("unregister %s domain %s\n", rp->name, rd->name); - powercap_unregister_zone(control_type, &rd->power_zone); + powercap_unregister_zone(rp->priv->control_type, &rd->power_zone); }
return ret; @@ -1208,9 +1208,9 @@ static int __init rapl_register_psys(void) rd->rpl[0].name = pl1_name; rd->rpl[1].prim_id = PL2_ENABLE; rd->rpl[1].name = pl2_name; - rd->rp = rapl_find_package_domain(0); + rd->rp = rapl_find_package_domain(0, &rapl_msr_priv);
- power_zone = powercap_register_zone(&rd->power_zone, control_type, + power_zone = powercap_register_zone(&rd->power_zone, rapl_msr_priv.control_type, "psys", NULL, &zone_ops[RAPL_DOMAIN_PLATFORM], 2, &constraint_ops); @@ -1220,17 +1220,17 @@ static int __init rapl_register_psys(void) return PTR_ERR(power_zone); }
- platform_rapl_domain = rd; + rapl_msr_priv.platform_rapl_domain = rd;
return 0; }
static int __init rapl_register_powercap(void) { - control_type = powercap_register_control_type(NULL, "intel-rapl", NULL); - if (IS_ERR(control_type)) { + rapl_msr_priv.control_type = powercap_register_control_type(NULL, "intel-rapl", NULL); + if (IS_ERR(rapl_msr_priv.control_type)) { pr_debug("failed to register powercap control_type.\n"); - return PTR_ERR(control_type); + return PTR_ERR(rapl_msr_priv.control_type); } return 0; } @@ -1355,16 +1355,16 @@ static void rapl_remove_package(struct rapl_package *rp) } pr_debug("remove package, undo power limit on %s: %s\n", rp->name, rd->name); - powercap_unregister_zone(control_type, &rd->power_zone); + powercap_unregister_zone(rp->priv->control_type, &rd->power_zone); } /* do parent zone last */ - powercap_unregister_zone(control_type, &rd_package->power_zone); + powercap_unregister_zone(rp->priv->control_type, &rd_package->power_zone); list_del(&rp->plist); kfree(rp); }
/* called from CPU hotplug notifier, hotplug lock held */ -static struct rapl_package *rapl_add_package(int cpu) +static struct rapl_package *rapl_add_package(int cpu, struct rapl_if_priv *priv) { int id = topology_logical_die_id(cpu); struct rapl_package *rp; @@ -1378,6 +1378,7 @@ static struct rapl_package *rapl_add_package(int cpu) /* add the new package to the list */ rp->id = id; rp->lead_cpu = cpu; + rp->priv = priv;
if (topology_max_die_per_package() > 1) snprintf(rp->name, PACKAGE_DOMAIN_NAME_LENGTH, @@ -1416,9 +1417,9 @@ static int rapl_cpu_online(unsigned int cpu) { struct rapl_package *rp;
- rp = rapl_find_package_domain(cpu); + rp = rapl_find_package_domain(cpu, &rapl_msr_priv); if (!rp) { - rp = rapl_add_package(cpu); + rp = rapl_add_package(cpu, &rapl_msr_priv); if (IS_ERR(rp)) return PTR_ERR(rp); } @@ -1431,7 +1432,7 @@ static int rapl_cpu_down_prep(unsigned int cpu) struct rapl_package *rp; int lead_cpu;
- rp = rapl_find_package_domain(cpu); + rp = rapl_find_package_domain(cpu, &rapl_msr_priv); if (!rp) return 0;
@@ -1444,8 +1445,6 @@ static int rapl_cpu_down_prep(unsigned int cpu) return 0; }
-static enum cpuhp_state pcap_rapl_online; - static void power_limit_state_save(void) { struct rapl_package *rp; @@ -1555,7 +1554,7 @@ static int __init rapl_init(void) rapl_cpu_online, rapl_cpu_down_prep); if (ret < 0) goto err_unreg; - pcap_rapl_online = ret; + rapl_msr_priv.pcap_rapl_online = ret;
/* Don't bail out if PSys is not supported */ rapl_register_psys(); @@ -1567,7 +1566,7 @@ static int __init rapl_init(void) return 0;
err_unreg_all: - cpuhp_remove_state(pcap_rapl_online); + cpuhp_remove_state(rapl_msr_priv.pcap_rapl_online);
err_unreg: rapl_unregister_powercap(); @@ -1577,7 +1576,7 @@ static int __init rapl_init(void) static void __exit rapl_exit(void) { unregister_pm_notifier(&rapl_pm_notifier); - cpuhp_remove_state(pcap_rapl_online); + cpuhp_remove_state(rapl_msr_priv.pcap_rapl_online); rapl_unregister_powercap(); }
diff --git a/include/linux/intel_rapl.h b/include/linux/intel_rapl.h index 94716036d829..7bf1683e4a63 100644 --- a/include/linux/intel_rapl.h +++ b/include/linux/intel_rapl.h @@ -88,6 +88,20 @@ struct rapl_domain { struct rapl_package *rp; };
+/** + * struct rapl_if_priv: private data for different RAPL interfaces + * @control_type: Each RAPL interface must have its own powercap + * control type. + * @platform_rapl_domain: Optional. Some RAPL interface may have platform + * level RAPL control. + * @pcap_rapl_online: CPU hotplug state for each RAPL interface. + */ +struct rapl_if_priv { + struct powercap_control_type *control_type; + struct rapl_domain *platform_rapl_domain; + enum cpuhp_state pcap_rapl_online; +}; + /* maximum rapl package domain name: package-%d-die-%d */ #define PACKAGE_DOMAIN_NAME_LENGTH 30
@@ -108,6 +122,7 @@ struct rapl_package { /* Track active cpus */ struct cpumask cpumask; char name[PACKAGE_DOMAIN_NAME_LENGTH]; + struct rapl_if_priv *priv; };
#endif /* __INTEL_RAPL_H__ */
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 7fde2712a7adab721eaabafbd8ff93dff3262d35 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 7fde2712a7adab721eaabafbd8ff93dff3262d35 upstream.
MSR and MMIO RAPL interface have different sets of registers, thus the RAPL register address should be obtained from interface specific structure, i.e. struct rapl_if_private, instead.
Reviewed-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Tested-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl.c | 73 ++++++++++++++++------------------- include/linux/intel_rapl.h | 4 ++ 2 files changed, 37 insertions(+), 40 deletions(-)
diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index 26aa43c9a1a0..a8f1982e2fda 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -88,7 +88,19 @@ enum unit_type { };
/* private data for RAPL MSR Interface */ -static struct rapl_if_priv rapl_msr_priv; +static struct rapl_if_priv rapl_msr_priv = { + .reg_unit = MSR_RAPL_POWER_UNIT, + .regs[RAPL_DOMAIN_PACKAGE] = { + MSR_PKG_POWER_LIMIT, MSR_PKG_ENERGY_STATUS, MSR_PKG_PERF_STATUS, 0, MSR_PKG_POWER_INFO }, + .regs[RAPL_DOMAIN_PP0] = { + MSR_PP0_POWER_LIMIT, MSR_PP0_ENERGY_STATUS, 0, MSR_PP0_POLICY, 0 }, + .regs[RAPL_DOMAIN_PP1] = { + MSR_PP1_POWER_LIMIT, MSR_PP1_ENERGY_STATUS, 0, MSR_PP1_POLICY, 0 }, + .regs[RAPL_DOMAIN_DRAM] = { + MSR_DRAM_POWER_LIMIT, MSR_DRAM_ENERGY_STATUS, MSR_DRAM_PERF_STATUS, 0, MSR_DRAM_POWER_INFO }, + .regs[RAPL_DOMAIN_PLATFORM] = { + MSR_PLATFORM_POWER_LIMIT, MSR_PLATFORM_ENERGY_STATUS, 0, 0, 0}, +};
/* per domain data, some are optional */ #define NR_RAW_PRIMITIVES (NR_RAPL_PRIMITIVES - 2) @@ -553,15 +565,17 @@ static void rapl_init_domains(struct rapl_package *rp)
for (i = 0; i < RAPL_DOMAIN_MAX; i++) { unsigned int mask = rp->domain_map & (1 << i); + + rd->regs[RAPL_DOMAIN_REG_LIMIT] = rp->priv->regs[i][RAPL_DOMAIN_REG_LIMIT]; + rd->regs[RAPL_DOMAIN_REG_STATUS] = rp->priv->regs[i][RAPL_DOMAIN_REG_STATUS]; + rd->regs[RAPL_DOMAIN_REG_PERF] = rp->priv->regs[i][RAPL_DOMAIN_REG_PERF]; + rd->regs[RAPL_DOMAIN_REG_POLICY] = rp->priv->regs[i][RAPL_DOMAIN_REG_POLICY]; + rd->regs[RAPL_DOMAIN_REG_INFO] = rp->priv->regs[i][RAPL_DOMAIN_REG_INFO]; + switch (mask) { case BIT(RAPL_DOMAIN_PACKAGE): rd->name = rapl_domain_names[RAPL_DOMAIN_PACKAGE]; rd->id = RAPL_DOMAIN_PACKAGE; - rd->regs[RAPL_DOMAIN_REG_LIMIT] = MSR_PKG_POWER_LIMIT; - rd->regs[RAPL_DOMAIN_REG_STATUS] = MSR_PKG_ENERGY_STATUS; - rd->regs[RAPL_DOMAIN_REG_PERF] = MSR_PKG_PERF_STATUS; - rd->regs[RAPL_DOMAIN_REG_POLICY] = 0; - rd->regs[RAPL_DOMAIN_REG_INFO] = MSR_PKG_POWER_INFO; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; rd->rpl[1].prim_id = PL2_ENABLE; @@ -570,33 +584,18 @@ static void rapl_init_domains(struct rapl_package *rp) case BIT(RAPL_DOMAIN_PP0): rd->name = rapl_domain_names[RAPL_DOMAIN_PP0]; rd->id = RAPL_DOMAIN_PP0; - rd->regs[RAPL_DOMAIN_REG_LIMIT] = MSR_PP0_POWER_LIMIT; - rd->regs[RAPL_DOMAIN_REG_STATUS] = MSR_PP0_ENERGY_STATUS; - rd->regs[RAPL_DOMAIN_REG_PERF] = 0; - rd->regs[RAPL_DOMAIN_REG_POLICY] = MSR_PP0_POLICY; - rd->regs[RAPL_DOMAIN_REG_INFO] = 0; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; break; case BIT(RAPL_DOMAIN_PP1): rd->name = rapl_domain_names[RAPL_DOMAIN_PP1]; rd->id = RAPL_DOMAIN_PP1; - rd->regs[RAPL_DOMAIN_REG_LIMIT] = MSR_PP1_POWER_LIMIT; - rd->regs[RAPL_DOMAIN_REG_STATUS] = MSR_PP1_ENERGY_STATUS; - rd->regs[RAPL_DOMAIN_REG_PERF] = 0; - rd->regs[RAPL_DOMAIN_REG_POLICY] = MSR_PP1_POLICY; - rd->regs[RAPL_DOMAIN_REG_INFO] = 0; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; break; case BIT(RAPL_DOMAIN_DRAM): rd->name = rapl_domain_names[RAPL_DOMAIN_DRAM]; rd->id = RAPL_DOMAIN_DRAM; - rd->regs[RAPL_DOMAIN_REG_LIMIT] = MSR_DRAM_POWER_LIMIT; - rd->regs[RAPL_DOMAIN_REG_STATUS] = MSR_DRAM_ENERGY_STATUS; - rd->regs[RAPL_DOMAIN_REG_PERF] = MSR_DRAM_PERF_STATUS; - rd->regs[RAPL_DOMAIN_REG_POLICY] = 0; - rd->regs[RAPL_DOMAIN_REG_INFO] = MSR_DRAM_POWER_INFO; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; rd->domain_energy_unit = @@ -818,9 +817,9 @@ static int rapl_check_unit_core(struct rapl_package *rp, int cpu) u64 msr_val; u32 value;
- if (rdmsrl_safe_on_cpu(cpu, MSR_RAPL_POWER_UNIT, &msr_val)) { + if (rdmsrl_safe_on_cpu(cpu, rp->priv->reg_unit, &msr_val)) { pr_err("Failed to read power unit MSR 0x%x on CPU %d, exit.\n", - MSR_RAPL_POWER_UNIT, cpu); + rp->priv->reg_unit, cpu); return -ENODEV; }
@@ -844,9 +843,9 @@ static int rapl_check_unit_atom(struct rapl_package *rp, int cpu) u64 msr_val; u32 value;
- if (rdmsrl_safe_on_cpu(cpu, MSR_RAPL_POWER_UNIT, &msr_val)) { + if (rdmsrl_safe_on_cpu(cpu, rp->priv->reg_unit, &msr_val)) { pr_err("Failed to read power unit MSR 0x%x on CPU %d, exit.\n", - MSR_RAPL_POWER_UNIT, cpu); + rp->priv->reg_unit, cpu); return -ENODEV; } value = (msr_val & ENERGY_UNIT_MASK) >> ENERGY_UNIT_OFFSET; @@ -1190,10 +1189,10 @@ static int __init rapl_register_psys(void) struct powercap_zone *power_zone; u64 val;
- if (rdmsrl_safe_on_cpu(0, MSR_PLATFORM_ENERGY_STATUS, &val) || !val) + if (rdmsrl_safe_on_cpu(0, rapl_msr_priv.regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_STATUS], &val) || !val) return -ENODEV;
- if (rdmsrl_safe_on_cpu(0, MSR_PLATFORM_POWER_LIMIT, &val) || !val) + if (rdmsrl_safe_on_cpu(0, rapl_msr_priv.regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_LIMIT], &val) || !val) return -ENODEV;
rd = kzalloc(sizeof(*rd), GFP_KERNEL); @@ -1202,8 +1201,8 @@ static int __init rapl_register_psys(void)
rd->name = rapl_domain_names[RAPL_DOMAIN_PLATFORM]; rd->id = RAPL_DOMAIN_PLATFORM; - rd->regs[RAPL_DOMAIN_REG_LIMIT] = MSR_PLATFORM_POWER_LIMIT; - rd->regs[RAPL_DOMAIN_REG_STATUS] = MSR_PLATFORM_ENERGY_STATUS; + rd->regs[RAPL_DOMAIN_REG_LIMIT] = rapl_msr_priv.regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_LIMIT]; + rd->regs[RAPL_DOMAIN_REG_STATUS] = rapl_msr_priv.regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_STATUS]; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; rd->rpl[1].prim_id = PL2_ENABLE; @@ -1235,23 +1234,17 @@ static int __init rapl_register_powercap(void) return 0; }
-static int rapl_check_domain(int cpu, int domain) +static int rapl_check_domain(int cpu, int domain, struct rapl_package *rp) { - unsigned msr; + u32 reg; u64 val = 0;
switch (domain) { case RAPL_DOMAIN_PACKAGE: - msr = MSR_PKG_ENERGY_STATUS; - break; case RAPL_DOMAIN_PP0: - msr = MSR_PP0_ENERGY_STATUS; - break; case RAPL_DOMAIN_PP1: - msr = MSR_PP1_ENERGY_STATUS; - break; case RAPL_DOMAIN_DRAM: - msr = MSR_DRAM_ENERGY_STATUS; + reg = rp->priv->regs[domain][RAPL_DOMAIN_REG_STATUS]; break; case RAPL_DOMAIN_PLATFORM: /* PSYS(PLATFORM) is not a CPU domain, so avoid printng error */ @@ -1263,7 +1256,7 @@ static int rapl_check_domain(int cpu, int domain) /* make sure domain counters are available and contains non-zero * values, otherwise skip it. */ - if (rdmsrl_safe_on_cpu(cpu, msr, &val) || !val) + if (rdmsrl_safe_on_cpu(cpu, reg, &val) || !val) return -ENODEV;
return 0; @@ -1310,7 +1303,7 @@ static int rapl_detect_domains(struct rapl_package *rp, int cpu)
for (i = 0; i < RAPL_DOMAIN_MAX; i++) { /* use physical package id to read counters */ - if (!rapl_check_domain(cpu, i)) { + if (!rapl_check_domain(cpu, i, rp)) { rp->domain_map |= 1 << i; pr_info("Found RAPL domain %s\n", rapl_domain_names[i]); } diff --git a/include/linux/intel_rapl.h b/include/linux/intel_rapl.h index 7bf1683e4a63..ec2c9e83274f 100644 --- a/include/linux/intel_rapl.h +++ b/include/linux/intel_rapl.h @@ -95,11 +95,15 @@ struct rapl_domain { * @platform_rapl_domain: Optional. Some RAPL interface may have platform * level RAPL control. * @pcap_rapl_online: CPU hotplug state for each RAPL interface. + * @reg_unit: Register for getting energy/power/time unit. + * @regs: Register sets for different RAPL Domains. */ struct rapl_if_priv { struct powercap_control_type *control_type; struct rapl_domain *platform_rapl_domain; enum cpuhp_state pcap_rapl_online; + u32 reg_unit; + u32 regs[RAPL_DOMAIN_MAX][RAPL_DOMAIN_REG_MAX]; };
/* maximum rapl package domain name: package-%d-die-%d */
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit beea8df821d928e7755917da6c1e45d6afde5148 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit beea8df821d928e7755917da6c1e45d6afde5148 upstream.
MSR and MMIO RAPL interfaces have different ways to access the registers, thus in order to abstract the register access operations, two callbacks, .read_raw()/.write_raw() are introduced, and they should be implemented by MSR RAPL and MMIO RAPL interface driver respectly.
This patch implements them for the MSR I/F only.
Reviewed-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Tested-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl.c | 110 ++++++++++++++++++---------------- include/linux/intel_rapl.h | 13 ++++ 2 files changed, 70 insertions(+), 53 deletions(-)
diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index a8f1982e2fda..eb0b8ffa2452 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -105,13 +105,6 @@ static struct rapl_if_priv rapl_msr_priv = { /* per domain data, some are optional */ #define NR_RAW_PRIMITIVES (NR_RAPL_PRIMITIVES - 2)
-struct msrl_action { - u32 msr_no; - u64 clear_mask; - u64 set_mask; - int err; -}; - #define DOMAIN_STATE_INACTIVE BIT(0) #define DOMAIN_STATE_POWER_LIMIT_SET BIT(1) #define DOMAIN_STATE_BIOS_LOCKED BIT(2) @@ -704,16 +697,16 @@ static int rapl_read_data_raw(struct rapl_domain *rd, enum rapl_primitives prim, bool xlate, u64 *data) { - u64 value, final; - u32 msr; + u64 value; struct rapl_primitive_info *rp = &rpi[prim]; + struct reg_action ra; int cpu;
if (!rp->name || rp->flag & RAPL_PRIMITIVE_DUMMY) return -EINVAL;
- msr = rd->regs[rp->id]; - if (!msr) + ra.reg = rd->regs[rp->id]; + if (!ra.reg) return -EINVAL;
cpu = rd->rp->lead_cpu; @@ -729,47 +722,23 @@ static int rapl_read_data_raw(struct rapl_domain *rd, return 0; }
- if (rdmsrl_safe_on_cpu(cpu, msr, &value)) { - pr_debug("failed to read msr 0x%x on cpu %d\n", msr, cpu); + ra.mask = rp->mask; + + if (rd->rp->priv->read_raw(cpu, &ra)) { + pr_debug("failed to read reg 0x%x on cpu %d\n", ra.reg, cpu); return -EIO; }
- final = value & rp->mask; - final = final >> rp->shift; + value = ra.value >> rp->shift; + if (xlate) - *data = rapl_unit_xlate(rd, rp->unit, final, 0); + *data = rapl_unit_xlate(rd, rp->unit, value, 0); else - *data = final; + *data = value;
return 0; }
- -static int msrl_update_safe(u32 msr_no, u64 clear_mask, u64 set_mask) -{ - int err; - u64 val; - - err = rdmsrl_safe(msr_no, &val); - if (err) - goto out; - - val &= ~clear_mask; - val |= set_mask; - - err = wrmsrl_safe(msr_no, val); - -out: - return err; -} - -static void msrl_update_func(void *info) -{ - struct msrl_action *ma = info; - - ma->err = msrl_update_safe(ma->msr_no, ma->clear_mask, ma->set_mask); -} - /* Similar use of primitive info in the read counterpart */ static int rapl_write_data_raw(struct rapl_domain *rd, enum rapl_primitives prim, @@ -778,7 +747,7 @@ static int rapl_write_data_raw(struct rapl_domain *rd, struct rapl_primitive_info *rp = &rpi[prim]; int cpu; u64 bits; - struct msrl_action ma; + struct reg_action ra; int ret;
cpu = rd->rp->lead_cpu; @@ -786,17 +755,13 @@ static int rapl_write_data_raw(struct rapl_domain *rd, bits <<= rp->shift; bits &= rp->mask;
- memset(&ma, 0, sizeof(ma)); + memset(&ra, 0, sizeof(ra));
- ma.msr_no = rd->regs[rp->id]; - ma.clear_mask = rp->mask; - ma.set_mask = bits; + ra.reg = rd->regs[rp->id]; + ra.mask = rp->mask; + ra.value = bits;
- ret = smp_call_function_single(cpu, msrl_update_func, &ma, 1); - if (ret) - WARN_ON_ONCE(ret); - else - ret = ma.err; + ret = rd->rp->priv->write_raw(cpu, &ra);
return ret; } @@ -1524,6 +1489,43 @@ static struct notifier_block rapl_pm_notifier = { .notifier_call = rapl_pm_callback, };
+static int rapl_msr_read_raw(int cpu, struct reg_action *ra) +{ + if (rdmsrl_safe_on_cpu(cpu, ra->reg, &ra->value)) { + pr_debug("failed to read msr 0x%x on cpu %d\n", ra->reg, cpu); + return -EIO; + } + ra->value &= ra->mask; + return 0; +} + +static void rapl_msr_update_func(void *info) +{ + struct reg_action *ra = info; + u64 val; + + ra->err = rdmsrl_safe(ra->reg, &val); + if (ra->err) + return; + + val &= ~ra->mask; + val |= ra->value; + + ra->err = wrmsrl_safe(ra->reg, val); +} + + +static int rapl_msr_write_raw(int cpu, struct reg_action *ra) +{ + int ret; + + ret = smp_call_function_single(cpu, rapl_msr_update_func, ra, 1); + if (WARN_ON_ONCE(ret)) + return ret; + + return ra->err; +} + static int __init rapl_init(void) { const struct x86_cpu_id *id; @@ -1539,6 +1541,8 @@ static int __init rapl_init(void)
rapl_defaults = (struct rapl_defaults *)id->driver_data;
+ rapl_msr_priv.read_raw = rapl_msr_read_raw; + rapl_msr_priv.write_raw = rapl_msr_write_raw; ret = rapl_register_powercap(); if (ret) return ret; diff --git a/include/linux/intel_rapl.h b/include/linux/intel_rapl.h index ec2c9e83274f..ff215d64d114 100644 --- a/include/linux/intel_rapl.h +++ b/include/linux/intel_rapl.h @@ -88,6 +88,13 @@ struct rapl_domain { struct rapl_package *rp; };
+struct reg_action { + u32 reg; + u64 mask; + u64 value; + int err; +}; + /** * struct rapl_if_priv: private data for different RAPL interfaces * @control_type: Each RAPL interface must have its own powercap @@ -97,6 +104,10 @@ struct rapl_domain { * @pcap_rapl_online: CPU hotplug state for each RAPL interface. * @reg_unit: Register for getting energy/power/time unit. * @regs: Register sets for different RAPL Domains. + * @read_raw: Callback for reading RAPL interface specific + * registers. + * @write_raw: Callback for writing RAPL interface specific + * registers. */ struct rapl_if_priv { struct powercap_control_type *control_type; @@ -104,6 +115,8 @@ struct rapl_if_priv { enum cpuhp_state pcap_rapl_online; u32 reg_unit; u32 regs[RAPL_DOMAIN_MAX][RAPL_DOMAIN_REG_MAX]; + int (*read_raw)(int cpu, struct reg_action *ra); + int (*write_raw)(int cpu, struct reg_action *ra); };
/* maximum rapl package domain name: package-%d-die-%d */
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 8a00676cd690941c9a18bd390c3b2cade631c516 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 8a00676cd690941c9a18bd390c3b2cade631c516 upstream.
Previously, there are three functions: rapl_register_psys(), which registers platform rapl domain. rapl_register_powercap(), which registers powercap control type. rapl_unregsiter_powercap(), which unregisters platform rapl domain and powercap control type.
This is confusing as the function name does not describe what it does clearly.
With this patch, the three functions are removed, and two new functions rapl_register_platform_domain()/rapl_unregister_platform_domain() are introduced instead, and they do exactly what their function name describes.
Plus, as part of the common code, hardcoded MSR accesses in these functions are converted to follow the abstracted register access.
Reviewed-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Tested-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl.c | 62 +++++++++++++++++------------------ 1 file changed, 31 insertions(+), 31 deletions(-)
diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index eb0b8ffa2452..357ea9657ce1 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -1069,16 +1069,6 @@ static void rapl_update_domain_data(struct rapl_package *rp)
}
-static void rapl_unregister_powercap(void) -{ - if (&rapl_msr_priv.platform_rapl_domain) { - powercap_unregister_zone(rapl_msr_priv.control_type, - &rapl_msr_priv.platform_rapl_domain->power_zone); - kfree(rapl_msr_priv.platform_rapl_domain); - } - powercap_unregister_control_type(rapl_msr_priv.control_type); -} - static int rapl_package_register_powercap(struct rapl_package *rp) { struct rapl_domain *rd; @@ -1148,16 +1138,23 @@ static int rapl_package_register_powercap(struct rapl_package *rp) return ret; }
-static int __init rapl_register_psys(void) +static int __init rapl_add_platform_domain(struct rapl_if_priv *priv) { struct rapl_domain *rd; struct powercap_zone *power_zone; - u64 val; + struct reg_action ra; + int ret;
- if (rdmsrl_safe_on_cpu(0, rapl_msr_priv.regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_STATUS], &val) || !val) + ra.reg = priv->regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_STATUS]; + ra.mask = ~0; + ret = priv->read_raw(0, &ra); + if (ret || !ra.value) return -ENODEV;
- if (rdmsrl_safe_on_cpu(0, rapl_msr_priv.regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_LIMIT], &val) || !val) + ra.reg = priv->regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_LIMIT]; + ra.mask = ~0; + ret = priv->read_raw(0, &ra); + if (ret || !ra.value) return -ENODEV;
rd = kzalloc(sizeof(*rd), GFP_KERNEL); @@ -1166,15 +1163,15 @@ static int __init rapl_register_psys(void)
rd->name = rapl_domain_names[RAPL_DOMAIN_PLATFORM]; rd->id = RAPL_DOMAIN_PLATFORM; - rd->regs[RAPL_DOMAIN_REG_LIMIT] = rapl_msr_priv.regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_LIMIT]; - rd->regs[RAPL_DOMAIN_REG_STATUS] = rapl_msr_priv.regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_STATUS]; + rd->regs[RAPL_DOMAIN_REG_LIMIT] = priv->regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_LIMIT]; + rd->regs[RAPL_DOMAIN_REG_STATUS] = priv->regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_STATUS]; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; rd->rpl[1].prim_id = PL2_ENABLE; rd->rpl[1].name = pl2_name; - rd->rp = rapl_find_package_domain(0, &rapl_msr_priv); + rd->rp = rapl_find_package_domain(0, priv);
- power_zone = powercap_register_zone(&rd->power_zone, rapl_msr_priv.control_type, + power_zone = powercap_register_zone(&rd->power_zone, priv->control_type, "psys", NULL, &zone_ops[RAPL_DOMAIN_PLATFORM], 2, &constraint_ops); @@ -1184,19 +1181,18 @@ static int __init rapl_register_psys(void) return PTR_ERR(power_zone); }
- rapl_msr_priv.platform_rapl_domain = rd; + priv->platform_rapl_domain = rd;
return 0; }
-static int __init rapl_register_powercap(void) +static void rapl_remove_platform_domain(struct rapl_if_priv *priv) { - rapl_msr_priv.control_type = powercap_register_control_type(NULL, "intel-rapl", NULL); - if (IS_ERR(rapl_msr_priv.control_type)) { - pr_debug("failed to register powercap control_type.\n"); - return PTR_ERR(rapl_msr_priv.control_type); + if (priv->platform_rapl_domain) { + powercap_unregister_zone(priv->control_type, + &priv->platform_rapl_domain->power_zone); + kfree(priv->platform_rapl_domain); } - return 0; }
static int rapl_check_domain(int cpu, int domain, struct rapl_package *rp) @@ -1543,9 +1539,12 @@ static int __init rapl_init(void)
rapl_msr_priv.read_raw = rapl_msr_read_raw; rapl_msr_priv.write_raw = rapl_msr_write_raw; - ret = rapl_register_powercap(); - if (ret) - return ret; + + rapl_msr_priv.control_type = powercap_register_control_type(NULL, "intel-rapl", NULL); + if (IS_ERR(rapl_msr_priv.control_type)) { + pr_debug("failed to register powercap control_type.\n"); + return PTR_ERR(rapl_msr_priv.control_type); + }
ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "powercap/rapl:online", rapl_cpu_online, rapl_cpu_down_prep); @@ -1554,7 +1553,7 @@ static int __init rapl_init(void) rapl_msr_priv.pcap_rapl_online = ret;
/* Don't bail out if PSys is not supported */ - rapl_register_psys(); + rapl_add_platform_domain(&rapl_msr_priv);
ret = register_pm_notifier(&rapl_pm_notifier); if (ret) @@ -1566,7 +1565,7 @@ static int __init rapl_init(void) cpuhp_remove_state(rapl_msr_priv.pcap_rapl_online);
err_unreg: - rapl_unregister_powercap(); + powercap_unregister_control_type(rapl_msr_priv.control_type); return ret; }
@@ -1574,7 +1573,8 @@ static void __exit rapl_exit(void) { unregister_pm_notifier(&rapl_pm_notifier); cpuhp_remove_state(rapl_msr_priv.pcap_rapl_online); - rapl_unregister_powercap(); + rapl_remove_platform_domain(&rapl_msr_priv); + powercap_unregister_control_type(rapl_msr_priv.control_type); }
module_init(rapl_init);
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 1193b1658d16f03cdb2edbac5f2a796ccca225af category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 1193b1658d16f03cdb2edbac5f2a796ccca225af upstream.
There are still some places in the common code that have hardcoded MSR access, convert them to follow the abstracted register access.
Reviewed-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Tested-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl.c | 38 ++++++++++++++++++++--------------- 1 file changed, 22 insertions(+), 16 deletions(-)
diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c index 357ea9657ce1..b5cad6f9518d 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@ -779,22 +779,24 @@ static int rapl_write_data_raw(struct rapl_domain *rd, */ static int rapl_check_unit_core(struct rapl_package *rp, int cpu) { - u64 msr_val; + struct reg_action ra; u32 value;
- if (rdmsrl_safe_on_cpu(cpu, rp->priv->reg_unit, &msr_val)) { - pr_err("Failed to read power unit MSR 0x%x on CPU %d, exit.\n", + ra.reg = rp->priv->reg_unit; + ra.mask = ~0; + if (rp->priv->read_raw(cpu, &ra)) { + pr_err("Failed to read power unit REG 0x%x on CPU %d, exit.\n", rp->priv->reg_unit, cpu); return -ENODEV; }
- value = (msr_val & ENERGY_UNIT_MASK) >> ENERGY_UNIT_OFFSET; + value = (ra.value & ENERGY_UNIT_MASK) >> ENERGY_UNIT_OFFSET; rp->energy_unit = ENERGY_UNIT_SCALE * 1000000 / (1 << value);
- value = (msr_val & POWER_UNIT_MASK) >> POWER_UNIT_OFFSET; + value = (ra.value & POWER_UNIT_MASK) >> POWER_UNIT_OFFSET; rp->power_unit = 1000000 / (1 << value);
- value = (msr_val & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET; + value = (ra.value & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET; rp->time_unit = 1000000 / (1 << value);
pr_debug("Core CPU %s energy=%dpJ, time=%dus, power=%duW\n", @@ -805,21 +807,24 @@ static int rapl_check_unit_core(struct rapl_package *rp, int cpu)
static int rapl_check_unit_atom(struct rapl_package *rp, int cpu) { - u64 msr_val; + struct reg_action ra; u32 value;
- if (rdmsrl_safe_on_cpu(cpu, rp->priv->reg_unit, &msr_val)) { - pr_err("Failed to read power unit MSR 0x%x on CPU %d, exit.\n", + ra.reg = rp->priv->reg_unit; + ra.mask = ~0; + if (rp->priv->read_raw(cpu, &ra)) { + pr_err("Failed to read power unit REG 0x%x on CPU %d, exit.\n", rp->priv->reg_unit, cpu); return -ENODEV; } - value = (msr_val & ENERGY_UNIT_MASK) >> ENERGY_UNIT_OFFSET; + + value = (ra.value & ENERGY_UNIT_MASK) >> ENERGY_UNIT_OFFSET; rp->energy_unit = ENERGY_UNIT_SCALE * 1 << value;
- value = (msr_val & POWER_UNIT_MASK) >> POWER_UNIT_OFFSET; + value = (ra.value & POWER_UNIT_MASK) >> POWER_UNIT_OFFSET; rp->power_unit = (1 << value) * 1000;
- value = (msr_val & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET; + value = (ra.value & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET; rp->time_unit = 1000000 / (1 << value);
pr_debug("Atom %s energy=%dpJ, time=%dus, power=%duW\n", @@ -1197,15 +1202,14 @@ static void rapl_remove_platform_domain(struct rapl_if_priv *priv)
static int rapl_check_domain(int cpu, int domain, struct rapl_package *rp) { - u32 reg; - u64 val = 0; + struct reg_action ra;
switch (domain) { case RAPL_DOMAIN_PACKAGE: case RAPL_DOMAIN_PP0: case RAPL_DOMAIN_PP1: case RAPL_DOMAIN_DRAM: - reg = rp->priv->regs[domain][RAPL_DOMAIN_REG_STATUS]; + ra.reg = rp->priv->regs[domain][RAPL_DOMAIN_REG_STATUS]; break; case RAPL_DOMAIN_PLATFORM: /* PSYS(PLATFORM) is not a CPU domain, so avoid printng error */ @@ -1217,7 +1221,9 @@ static int rapl_check_domain(int cpu, int domain, struct rapl_package *rp) /* make sure domain counters are available and contains non-zero * values, otherwise skip it. */ - if (rdmsrl_safe_on_cpu(cpu, reg, &val) || !val) + + ra.mask = ~0; + if (rp->priv->read_raw(cpu, &ra) || !ra.value) return -ENODEV;
return 0;
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 3382388d714891fc0f575926189f33d22e7c960b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3382388d714891fc0f575926189f33d22e7c960b upstream.
Split intel_rapl.c to intel_rapl_common.c and intel_rapl_msr.c, where intel_rapl_common.c contains the common code that can be used by both MSR and MMIO interface. intel_rapl_msr.c contains the implementation of RAPL MSR interface.
Reviewed-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Tested-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/Kconfig | 11 +- drivers/powercap/Makefile | 4 +- .../{intel_rapl.c => intel_rapl_common.c} | 556 +++++++----------- drivers/powercap/intel_rapl_msr.c | 163 +++++ include/linux/intel_rapl.h | 7 + 5 files changed, 397 insertions(+), 344 deletions(-) rename drivers/powercap/{intel_rapl.c => intel_rapl_common.c} (72%) create mode 100644 drivers/powercap/intel_rapl_msr.c
diff --git a/drivers/powercap/Kconfig b/drivers/powercap/Kconfig index 6ac27e5908f5..8ef2332b5863 100644 --- a/drivers/powercap/Kconfig +++ b/drivers/powercap/Kconfig @@ -15,14 +15,17 @@ menuconfig POWERCAP
if POWERCAP # Client driver configurations go here. +config INTEL_RAPL_CORE + tristate + config INTEL_RAPL - tristate "Intel RAPL Support" + tristate "Intel RAPL Support via MSR Interface" depends on X86 && IOSF_MBI - default n + select INTEL_RAPL_CORE ---help--- This enables support for the Intel Running Average Power Limit (RAPL) - technology which allows power limits to be enforced and monitored on - modern Intel processors (Sandy Bridge and later). + technology via MSR interface, which allows power limits to be enforced + and monitored on modern Intel processors (Sandy Bridge and later).
In RAPL, the platform level settings are divided into domains for fine grained control. These domains include processor package, DRAM diff --git a/drivers/powercap/Makefile b/drivers/powercap/Makefile index 1b328854b36e..7255c94ec61c 100644 --- a/drivers/powercap/Makefile +++ b/drivers/powercap/Makefile @@ -1,3 +1,5 @@ +# SPDX-License-Identifier: GPL-2.0-only obj-$(CONFIG_POWERCAP) += powercap_sys.o -obj-$(CONFIG_INTEL_RAPL) += intel_rapl.o +obj-$(CONFIG_INTEL_RAPL_CORE) += intel_rapl_common.o +obj-$(CONFIG_INTEL_RAPL) += intel_rapl_msr.o obj-$(CONFIG_IDLE_INJECT) += idle_inject.o diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl_common.c similarity index 72% rename from drivers/powercap/intel_rapl.c rename to drivers/powercap/intel_rapl_common.c index b5cad6f9518d..34a82531a7cf 100644 --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl_common.c @@ -1,19 +1,7 @@ +// SPDX-License-Identifier: GPL-2.0-only /* - * Intel Running Average Power Limit (RAPL) Driver - * Copyright (c) 2013, Intel Corporation. - * - * This program is free software; you can redistribute it and/or modify it - * under the terms and conditions of the GNU General Public License, - * version 2, as published by the Free Software Foundation. - * - * This program is distributed in the hope it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. - * - * You should have received a copy of the GNU General Public License along with - * this program; if not, write to the Free Software Foundation, Inc. - * + * Common code for Intel Running Average Power Limit (RAPL) support. + * Copyright (c) 2019, Intel Corporation. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -30,10 +18,10 @@ #include <linux/cpu.h> #include <linux/powercap.h> #include <linux/suspend.h> +#include <asm/iosf_mbi.h> #include <linux/intel_rapl.h>
-#include <asm/iosf_mbi.h> -#include <asm/processor.h> +#include <linux/processor.h> #include <asm/cpu_device_id.h> #include <asm/intel-family.h>
@@ -74,34 +62,19 @@ #define PP_POLICY_MASK 0x1F
/* Non HW constants */ -#define RAPL_PRIMITIVE_DERIVED BIT(1) /* not from raw data */ +#define RAPL_PRIMITIVE_DERIVED BIT(1) /* not from raw data */ #define RAPL_PRIMITIVE_DUMMY BIT(2)
#define TIME_WINDOW_MAX_MSEC 40000 #define TIME_WINDOW_MIN_MSEC 250 -#define ENERGY_UNIT_SCALE 1000 /* scale from driver unit to powercap unit */ +#define ENERGY_UNIT_SCALE 1000 /* scale from driver unit to powercap unit */ enum unit_type { - ARBITRARY_UNIT, /* no translation */ + ARBITRARY_UNIT, /* no translation */ POWER_UNIT, ENERGY_UNIT, TIME_UNIT, };
-/* private data for RAPL MSR Interface */ -static struct rapl_if_priv rapl_msr_priv = { - .reg_unit = MSR_RAPL_POWER_UNIT, - .regs[RAPL_DOMAIN_PACKAGE] = { - MSR_PKG_POWER_LIMIT, MSR_PKG_ENERGY_STATUS, MSR_PKG_PERF_STATUS, 0, MSR_PKG_POWER_INFO }, - .regs[RAPL_DOMAIN_PP0] = { - MSR_PP0_POWER_LIMIT, MSR_PP0_ENERGY_STATUS, 0, MSR_PP0_POLICY, 0 }, - .regs[RAPL_DOMAIN_PP1] = { - MSR_PP1_POWER_LIMIT, MSR_PP1_ENERGY_STATUS, 0, MSR_PP1_POLICY, 0 }, - .regs[RAPL_DOMAIN_DRAM] = { - MSR_DRAM_POWER_LIMIT, MSR_DRAM_ENERGY_STATUS, MSR_DRAM_PERF_STATUS, 0, MSR_DRAM_POWER_INFO }, - .regs[RAPL_DOMAIN_PLATFORM] = { - MSR_PLATFORM_POWER_LIMIT, MSR_PLATFORM_ENERGY_STATUS, 0, 0, 0}, -}; - /* per domain data, some are optional */ #define NR_RAW_PRIMITIVES (NR_RAPL_PRIMITIVES - 2)
@@ -120,7 +93,7 @@ struct rapl_defaults { int (*check_unit)(struct rapl_package *rp, int cpu); void (*set_floor_freq)(struct rapl_domain *rd, bool mode); u64 (*compute_time_window)(struct rapl_package *rp, u64 val, - bool to_raw); + bool to_raw); unsigned int dram_domain_energy_unit; }; static struct rapl_defaults *rapl_defaults; @@ -155,19 +128,20 @@ struct rapl_primitive_info {
static void rapl_init_domains(struct rapl_package *rp); static int rapl_read_data_raw(struct rapl_domain *rd, - enum rapl_primitives prim, - bool xlate, u64 *data); + enum rapl_primitives prim, + bool xlate, u64 *data); static int rapl_write_data_raw(struct rapl_domain *rd, - enum rapl_primitives prim, - unsigned long long value); + enum rapl_primitives prim, + unsigned long long value); static u64 rapl_unit_xlate(struct rapl_domain *rd, - enum unit_type type, u64 value, - int to_raw); + enum unit_type type, u64 value, int to_raw); static void package_power_limit_irq_save(struct rapl_package *rp); +static int rapl_init_core(void); +static void rapl_remove_core(void);
-static LIST_HEAD(rapl_packages); /* guarded by CPU hotplug lock */ +static LIST_HEAD(rapl_packages); /* guarded by CPU hotplug lock */
-static const char * const rapl_domain_names[] = { +static const char *const rapl_domain_names[] = { "package", "core", "uncore", @@ -175,21 +149,8 @@ static const char * const rapl_domain_names[] = { "psys", };
-/* caller to ensure CPU hotplug lock is held */ -static struct rapl_package *rapl_find_package_domain(int cpu, struct rapl_if_priv *priv) -{ - int id = topology_logical_die_id(cpu); - struct rapl_package *rp; - - list_for_each_entry(rp, &rapl_packages, plist) { - if (rp->id == id && rp->priv->control_type == priv->control_type) - return rp; - } - - return NULL; -} - -static int get_energy_counter(struct powercap_zone *power_zone, u64 *energy_raw) +static int get_energy_counter(struct powercap_zone *power_zone, + u64 *energy_raw) { struct rapl_domain *rd; u64 energy_now; @@ -288,50 +249,49 @@ static int get_domain_enable(struct powercap_zone *power_zone, bool *mode) static const struct powercap_zone_ops zone_ops[] = { /* RAPL_DOMAIN_PACKAGE */ { - .get_energy_uj = get_energy_counter, - .get_max_energy_range_uj = get_max_energy_counter, - .release = release_zone, - .set_enable = set_domain_enable, - .get_enable = get_domain_enable, - }, + .get_energy_uj = get_energy_counter, + .get_max_energy_range_uj = get_max_energy_counter, + .release = release_zone, + .set_enable = set_domain_enable, + .get_enable = get_domain_enable, + }, /* RAPL_DOMAIN_PP0 */ { - .get_energy_uj = get_energy_counter, - .get_max_energy_range_uj = get_max_energy_counter, - .release = release_zone, - .set_enable = set_domain_enable, - .get_enable = get_domain_enable, - }, + .get_energy_uj = get_energy_counter, + .get_max_energy_range_uj = get_max_energy_counter, + .release = release_zone, + .set_enable = set_domain_enable, + .get_enable = get_domain_enable, + }, /* RAPL_DOMAIN_PP1 */ { - .get_energy_uj = get_energy_counter, - .get_max_energy_range_uj = get_max_energy_counter, - .release = release_zone, - .set_enable = set_domain_enable, - .get_enable = get_domain_enable, - }, + .get_energy_uj = get_energy_counter, + .get_max_energy_range_uj = get_max_energy_counter, + .release = release_zone, + .set_enable = set_domain_enable, + .get_enable = get_domain_enable, + }, /* RAPL_DOMAIN_DRAM */ { - .get_energy_uj = get_energy_counter, - .get_max_energy_range_uj = get_max_energy_counter, - .release = release_zone, - .set_enable = set_domain_enable, - .get_enable = get_domain_enable, - }, + .get_energy_uj = get_energy_counter, + .get_max_energy_range_uj = get_max_energy_counter, + .release = release_zone, + .set_enable = set_domain_enable, + .get_enable = get_domain_enable, + }, /* RAPL_DOMAIN_PLATFORM */ { - .get_energy_uj = get_energy_counter, - .get_max_energy_range_uj = get_max_energy_counter, - .release = release_zone, - .set_enable = set_domain_enable, - .get_enable = get_domain_enable, - }, + .get_energy_uj = get_energy_counter, + .get_max_energy_range_uj = get_max_energy_counter, + .release = release_zone, + .set_enable = set_domain_enable, + .get_enable = get_domain_enable, + }, };
- /* * Constraint index used by powercap can be different than power limit (PL) - * index in that some PLs maybe missing due to non-existant MSRs. So we + * index in that some PLs maybe missing due to non-existent MSRs. So we * need to convert here by finding the valid PLs only (name populated). */ static int contraint_to_pl(struct rapl_domain *rd, int cid) @@ -350,7 +310,7 @@ static int contraint_to_pl(struct rapl_domain *rd, int cid) }
static int set_power_limit(struct powercap_zone *power_zone, int cid, - u64 power_limit) + u64 power_limit) { struct rapl_domain *rd; struct rapl_package *rp; @@ -368,8 +328,8 @@ static int set_power_limit(struct powercap_zone *power_zone, int cid, rp = rd->rp;
if (rd->state & DOMAIN_STATE_BIOS_LOCKED) { - dev_warn(&power_zone->dev, "%s locked by BIOS, monitoring only\n", - rd->name); + dev_warn(&power_zone->dev, + "%s locked by BIOS, monitoring only\n", rd->name); ret = -EACCES; goto set_exit; } @@ -392,7 +352,7 @@ static int set_power_limit(struct powercap_zone *power_zone, int cid, }
static int get_current_power_limit(struct powercap_zone *power_zone, int cid, - u64 *data) + u64 *data) { struct rapl_domain *rd; u64 val; @@ -431,7 +391,7 @@ static int get_current_power_limit(struct powercap_zone *power_zone, int cid, }
static int set_time_window(struct powercap_zone *power_zone, int cid, - u64 window) + u64 window) { struct rapl_domain *rd; int ret = 0; @@ -461,7 +421,8 @@ static int set_time_window(struct powercap_zone *power_zone, int cid, return ret; }
-static int get_time_window(struct powercap_zone *power_zone, int cid, u64 *data) +static int get_time_window(struct powercap_zone *power_zone, int cid, + u64 *data) { struct rapl_domain *rd; u64 val; @@ -496,7 +457,8 @@ static int get_time_window(struct powercap_zone *power_zone, int cid, u64 *data) return ret; }
-static const char *get_constraint_name(struct powercap_zone *power_zone, int cid) +static const char *get_constraint_name(struct powercap_zone *power_zone, + int cid) { struct rapl_domain *rd; int id; @@ -509,9 +471,7 @@ static const char *get_constraint_name(struct powercap_zone *power_zone, int cid return NULL; }
- -static int get_max_power(struct powercap_zone *power_zone, int id, - u64 *data) +static int get_max_power(struct powercap_zone *power_zone, int id, u64 *data) { struct rapl_domain *rd; u64 val; @@ -559,11 +519,16 @@ static void rapl_init_domains(struct rapl_package *rp) for (i = 0; i < RAPL_DOMAIN_MAX; i++) { unsigned int mask = rp->domain_map & (1 << i);
- rd->regs[RAPL_DOMAIN_REG_LIMIT] = rp->priv->regs[i][RAPL_DOMAIN_REG_LIMIT]; - rd->regs[RAPL_DOMAIN_REG_STATUS] = rp->priv->regs[i][RAPL_DOMAIN_REG_STATUS]; - rd->regs[RAPL_DOMAIN_REG_PERF] = rp->priv->regs[i][RAPL_DOMAIN_REG_PERF]; - rd->regs[RAPL_DOMAIN_REG_POLICY] = rp->priv->regs[i][RAPL_DOMAIN_REG_POLICY]; - rd->regs[RAPL_DOMAIN_REG_INFO] = rp->priv->regs[i][RAPL_DOMAIN_REG_INFO]; + rd->regs[RAPL_DOMAIN_REG_LIMIT] = + rp->priv->regs[i][RAPL_DOMAIN_REG_LIMIT]; + rd->regs[RAPL_DOMAIN_REG_STATUS] = + rp->priv->regs[i][RAPL_DOMAIN_REG_STATUS]; + rd->regs[RAPL_DOMAIN_REG_PERF] = + rp->priv->regs[i][RAPL_DOMAIN_REG_PERF]; + rd->regs[RAPL_DOMAIN_REG_POLICY] = + rp->priv->regs[i][RAPL_DOMAIN_REG_POLICY]; + rd->regs[RAPL_DOMAIN_REG_INFO] = + rp->priv->regs[i][RAPL_DOMAIN_REG_INFO];
switch (mask) { case BIT(RAPL_DOMAIN_PACKAGE): @@ -592,7 +557,7 @@ static void rapl_init_domains(struct rapl_package *rp) rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; rd->domain_energy_unit = - rapl_defaults->dram_domain_energy_unit; + rapl_defaults->dram_domain_energy_unit; if (rd->domain_energy_unit) pr_info("DRAM domain energy unit %dpj\n", rd->domain_energy_unit); @@ -606,7 +571,7 @@ static void rapl_init_domains(struct rapl_package *rp) }
static u64 rapl_unit_xlate(struct rapl_domain *rd, enum unit_type type, - u64 value, int to_raw) + u64 value, int to_raw) { u64 units = 1; struct rapl_package *rp = rd->rp; @@ -643,40 +608,40 @@ static u64 rapl_unit_xlate(struct rapl_domain *rd, enum unit_type type, static struct rapl_primitive_info rpi[] = { /* name, mask, shift, msr index, unit divisor */ PRIMITIVE_INFO_INIT(ENERGY_COUNTER, ENERGY_STATUS_MASK, 0, - RAPL_DOMAIN_REG_STATUS, ENERGY_UNIT, 0), + RAPL_DOMAIN_REG_STATUS, ENERGY_UNIT, 0), PRIMITIVE_INFO_INIT(POWER_LIMIT1, POWER_LIMIT1_MASK, 0, - RAPL_DOMAIN_REG_LIMIT, POWER_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, POWER_UNIT, 0), PRIMITIVE_INFO_INIT(POWER_LIMIT2, POWER_LIMIT2_MASK, 32, - RAPL_DOMAIN_REG_LIMIT, POWER_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, POWER_UNIT, 0), PRIMITIVE_INFO_INIT(FW_LOCK, POWER_PP_LOCK, 31, - RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), PRIMITIVE_INFO_INIT(PL1_ENABLE, POWER_LIMIT1_ENABLE, 15, - RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), PRIMITIVE_INFO_INIT(PL1_CLAMP, POWER_LIMIT1_CLAMP, 16, - RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), PRIMITIVE_INFO_INIT(PL2_ENABLE, POWER_LIMIT2_ENABLE, 47, - RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), PRIMITIVE_INFO_INIT(PL2_CLAMP, POWER_LIMIT2_CLAMP, 48, - RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), PRIMITIVE_INFO_INIT(TIME_WINDOW1, TIME_WINDOW1_MASK, 17, - RAPL_DOMAIN_REG_LIMIT, TIME_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, TIME_UNIT, 0), PRIMITIVE_INFO_INIT(TIME_WINDOW2, TIME_WINDOW2_MASK, 49, - RAPL_DOMAIN_REG_LIMIT, TIME_UNIT, 0), + RAPL_DOMAIN_REG_LIMIT, TIME_UNIT, 0), PRIMITIVE_INFO_INIT(THERMAL_SPEC_POWER, POWER_INFO_THERMAL_SPEC_MASK, - 0, RAPL_DOMAIN_REG_INFO, POWER_UNIT, 0), + 0, RAPL_DOMAIN_REG_INFO, POWER_UNIT, 0), PRIMITIVE_INFO_INIT(MAX_POWER, POWER_INFO_MAX_MASK, 32, - RAPL_DOMAIN_REG_INFO, POWER_UNIT, 0), + RAPL_DOMAIN_REG_INFO, POWER_UNIT, 0), PRIMITIVE_INFO_INIT(MIN_POWER, POWER_INFO_MIN_MASK, 16, - RAPL_DOMAIN_REG_INFO, POWER_UNIT, 0), + RAPL_DOMAIN_REG_INFO, POWER_UNIT, 0), PRIMITIVE_INFO_INIT(MAX_TIME_WINDOW, POWER_INFO_MAX_TIME_WIN_MASK, 48, - RAPL_DOMAIN_REG_INFO, TIME_UNIT, 0), + RAPL_DOMAIN_REG_INFO, TIME_UNIT, 0), PRIMITIVE_INFO_INIT(THROTTLED_TIME, PERF_STATUS_THROTTLE_TIME_MASK, 0, - RAPL_DOMAIN_REG_PERF, TIME_UNIT, 0), + RAPL_DOMAIN_REG_PERF, TIME_UNIT, 0), PRIMITIVE_INFO_INIT(PRIORITY_LEVEL, PP_POLICY_MASK, 0, - RAPL_DOMAIN_REG_POLICY, ARBITRARY_UNIT, 0), + RAPL_DOMAIN_REG_POLICY, ARBITRARY_UNIT, 0), /* non-hardware */ PRIMITIVE_INFO_INIT(AVERAGE_POWER, 0, 0, 0, POWER_UNIT, - RAPL_PRIMITIVE_DERIVED), + RAPL_PRIMITIVE_DERIVED), {NULL, 0, 0, 0}, };
@@ -694,8 +659,7 @@ static struct rapl_primitive_info rpi[] = { * 63-------------------------- 31--------------------------- 0 */ static int rapl_read_data_raw(struct rapl_domain *rd, - enum rapl_primitives prim, - bool xlate, u64 *data) + enum rapl_primitives prim, bool xlate, u64 *data) { u64 value; struct rapl_primitive_info *rp = &rpi[prim]; @@ -711,7 +675,7 @@ static int rapl_read_data_raw(struct rapl_domain *rd,
cpu = rd->rp->lead_cpu;
- /* special-case package domain, which uses a different bit*/ + /* special-case package domain, which uses a different bit */ if (prim == FW_LOCK && rd->id == RAPL_DOMAIN_PACKAGE) { rp->mask = POWER_PACKAGE_LOCK; rp->shift = 63; @@ -741,8 +705,8 @@ static int rapl_read_data_raw(struct rapl_domain *rd,
/* Similar use of primitive info in the read counterpart */ static int rapl_write_data_raw(struct rapl_domain *rd, - enum rapl_primitives prim, - unsigned long long value) + enum rapl_primitives prim, + unsigned long long value) { struct rapl_primitive_info *rp = &rpi[prim]; int cpu; @@ -786,7 +750,7 @@ static int rapl_check_unit_core(struct rapl_package *rp, int cpu) ra.mask = ~0; if (rp->priv->read_raw(cpu, &ra)) { pr_err("Failed to read power unit REG 0x%x on CPU %d, exit.\n", - rp->priv->reg_unit, cpu); + rp->priv->reg_unit, cpu); return -ENODEV; }
@@ -800,7 +764,7 @@ static int rapl_check_unit_core(struct rapl_package *rp, int cpu) rp->time_unit = 1000000 / (1 << value);
pr_debug("Core CPU %s energy=%dpJ, time=%dus, power=%duW\n", - rp->name, rp->energy_unit, rp->time_unit, rp->power_unit); + rp->name, rp->energy_unit, rp->time_unit, rp->power_unit);
return 0; } @@ -814,7 +778,7 @@ static int rapl_check_unit_atom(struct rapl_package *rp, int cpu) ra.mask = ~0; if (rp->priv->read_raw(cpu, &ra)) { pr_err("Failed to read power unit REG 0x%x on CPU %d, exit.\n", - rp->priv->reg_unit, cpu); + rp->priv->reg_unit, cpu); return -ENODEV; }
@@ -828,7 +792,7 @@ static int rapl_check_unit_atom(struct rapl_package *rp, int cpu) rp->time_unit = 1000000 / (1 << value);
pr_debug("Atom %s energy=%dpJ, time=%dus, power=%duW\n", - rp->name, rp->energy_unit, rp->time_unit, rp->power_unit); + rp->name, rp->energy_unit, rp->time_unit, rp->power_unit);
return 0; } @@ -848,7 +812,6 @@ static void power_limit_irq_save_cpu(void *info) wrmsr_safe(MSR_IA32_PACKAGE_THERM_INTERRUPT, l, h); }
- /* REVISIT: * When package power limit is set artificially low by RAPL, LVT * thermal interrupt for package power limit should be ignored @@ -932,9 +895,9 @@ static void set_floor_freq_atom(struct rapl_domain *rd, bool enable) }
static u64 rapl_compute_time_window_core(struct rapl_package *rp, u64 value, - bool to_raw) + bool to_raw) { - u64 f, y; /* fraction and exp. used for time unit */ + u64 f, y; /* fraction and exp. used for time unit */
/* * Special processing based on 2^Y*(1+F/4), refer @@ -954,7 +917,7 @@ static u64 rapl_compute_time_window_core(struct rapl_package *rp, u64 value, }
static u64 rapl_compute_time_window_atom(struct rapl_package *rp, u64 value, - bool to_raw) + bool to_raw) { /* * Atom time unit encoding is straight forward val * time_unit, @@ -962,8 +925,8 @@ static u64 rapl_compute_time_window_atom(struct rapl_package *rp, u64 value, */ if (!to_raw) return (value) ? value *= rp->time_unit : rp->time_unit; - else - value = div64_u64(value, rp->time_unit); + + value = div64_u64(value, rp->time_unit);
return value; } @@ -1010,49 +973,45 @@ static const struct rapl_defaults rapl_defaults_cht = { .compute_time_window = rapl_compute_time_window_atom, };
-#define RAPL_CPU(_model, _ops) { \ - .vendor = X86_VENDOR_INTEL, \ - .family = 6, \ - .model = _model, \ - .driver_data = (kernel_ulong_t)&_ops, \ - } - static const struct x86_cpu_id rapl_ids[] __initconst = { - RAPL_CPU(INTEL_FAM6_SANDYBRIDGE, rapl_defaults_core), - RAPL_CPU(INTEL_FAM6_SANDYBRIDGE_X, rapl_defaults_core), - - RAPL_CPU(INTEL_FAM6_IVYBRIDGE, rapl_defaults_core), - RAPL_CPU(INTEL_FAM6_IVYBRIDGE_X, rapl_defaults_core), - - RAPL_CPU(INTEL_FAM6_HASWELL_CORE, rapl_defaults_core), - RAPL_CPU(INTEL_FAM6_HASWELL_ULT, rapl_defaults_core), - RAPL_CPU(INTEL_FAM6_HASWELL_GT3E, rapl_defaults_core), - RAPL_CPU(INTEL_FAM6_HASWELL_X, rapl_defaults_hsw_server), - - RAPL_CPU(INTEL_FAM6_BROADWELL_CORE, rapl_defaults_core), - RAPL_CPU(INTEL_FAM6_BROADWELL_GT3E, rapl_defaults_core), - RAPL_CPU(INTEL_FAM6_BROADWELL_XEON_D, rapl_defaults_core), - RAPL_CPU(INTEL_FAM6_BROADWELL_X, rapl_defaults_hsw_server), - - RAPL_CPU(INTEL_FAM6_SKYLAKE_DESKTOP, rapl_defaults_core), - RAPL_CPU(INTEL_FAM6_SKYLAKE_MOBILE, rapl_defaults_core), - RAPL_CPU(INTEL_FAM6_SKYLAKE_X, rapl_defaults_hsw_server), - RAPL_CPU(INTEL_FAM6_KABYLAKE_MOBILE, rapl_defaults_core), - RAPL_CPU(INTEL_FAM6_KABYLAKE_DESKTOP, rapl_defaults_core), - RAPL_CPU(INTEL_FAM6_CANNONLAKE_MOBILE, rapl_defaults_core), - - RAPL_CPU(INTEL_FAM6_ATOM_SILVERMONT, rapl_defaults_byt), - RAPL_CPU(INTEL_FAM6_ATOM_AIRMONT, rapl_defaults_cht), - RAPL_CPU(INTEL_FAM6_ATOM_SILVERMONT_MID, rapl_defaults_tng), - RAPL_CPU(INTEL_FAM6_ATOM_AIRMONT_MID, rapl_defaults_ann), - RAPL_CPU(INTEL_FAM6_ATOM_GOLDMONT, rapl_defaults_core), - RAPL_CPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS, rapl_defaults_core), - RAPL_CPU(INTEL_FAM6_ATOM_GOLDMONT_X, rapl_defaults_core), - - RAPL_CPU(INTEL_FAM6_XEON_PHI_KNL, rapl_defaults_hsw_server), - RAPL_CPU(INTEL_FAM6_XEON_PHI_KNM, rapl_defaults_hsw_server), + INTEL_CPU_FAM6(SANDYBRIDGE, rapl_defaults_core), + INTEL_CPU_FAM6(SANDYBRIDGE_X, rapl_defaults_core), + + INTEL_CPU_FAM6(IVYBRIDGE, rapl_defaults_core), + INTEL_CPU_FAM6(IVYBRIDGE_X, rapl_defaults_core), + + INTEL_CPU_FAM6(HASWELL_CORE, rapl_defaults_core), + INTEL_CPU_FAM6(HASWELL_ULT, rapl_defaults_core), + INTEL_CPU_FAM6(HASWELL_GT3E, rapl_defaults_core), + INTEL_CPU_FAM6(HASWELL_X, rapl_defaults_hsw_server), + + INTEL_CPU_FAM6(BROADWELL_CORE, rapl_defaults_core), + INTEL_CPU_FAM6(BROADWELL_GT3E, rapl_defaults_core), + INTEL_CPU_FAM6(BROADWELL_XEON_D, rapl_defaults_core), + INTEL_CPU_FAM6(BROADWELL_X, rapl_defaults_hsw_server), + + INTEL_CPU_FAM6(SKYLAKE_DESKTOP, rapl_defaults_core), + INTEL_CPU_FAM6(SKYLAKE_MOBILE, rapl_defaults_core), + INTEL_CPU_FAM6(SKYLAKE_X, rapl_defaults_hsw_server), + INTEL_CPU_FAM6(KABYLAKE_MOBILE, rapl_defaults_core), + INTEL_CPU_FAM6(KABYLAKE_DESKTOP, rapl_defaults_core), + INTEL_CPU_FAM6(CANNONLAKE_MOBILE, rapl_defaults_core), + INTEL_CPU_FAM6(ICELAKE_MOBILE, rapl_defaults_core), + + INTEL_CPU_FAM6(ATOM_SILVERMONT, rapl_defaults_byt), + INTEL_CPU_FAM6(ATOM_AIRMONT, rapl_defaults_cht), + INTEL_CPU_FAM6(ATOM_SILVERMONT_MID, rapl_defaults_tng), + INTEL_CPU_FAM6(ATOM_AIRMONT_MID, rapl_defaults_ann), + INTEL_CPU_FAM6(ATOM_GOLDMONT, rapl_defaults_core), + INTEL_CPU_FAM6(ATOM_GOLDMONT_PLUS, rapl_defaults_core), + INTEL_CPU_FAM6(ATOM_GOLDMONT_X, rapl_defaults_core), + INTEL_CPU_FAM6(ATOM_TREMONT_X, rapl_defaults_core), + + INTEL_CPU_FAM6(XEON_PHI_KNL, rapl_defaults_hsw_server), + INTEL_CPU_FAM6(XEON_PHI_KNM, rapl_defaults_hsw_server), {} }; + MODULE_DEVICE_TABLE(x86cpu, rapl_ids);
/* Read once for all raw primitive data for domains */ @@ -1068,7 +1027,7 @@ static void rapl_update_domain_data(struct rapl_package *rp) for (prim = 0; prim < NR_RAW_PRIMITIVES; prim++) { if (!rapl_read_data_raw(&rp->domains[dmn], prim, rpi[prim].unit, &val)) - rp->domains[dmn].rdd.primitives[prim] = val; + rp->domains[dmn].rdd.primitives[prim] = val; } }
@@ -1083,20 +1042,18 @@ static int rapl_package_register_powercap(struct rapl_package *rp) /* Update the domain data of the new package */ rapl_update_domain_data(rp);
- /* first we register package domain as the parent zone*/ + /* first we register package domain as the parent zone */ for (rd = rp->domains; rd < rp->domains + rp->nr_domains; rd++) { if (rd->id == RAPL_DOMAIN_PACKAGE) { nr_pl = find_nr_power_limit(rd); pr_debug("register package domain %s\n", rp->name); power_zone = powercap_register_zone(&rd->power_zone, - rp->priv->control_type, - rp->name, NULL, - &zone_ops[rd->id], - nr_pl, - &constraint_ops); + rp->priv->control_type, rp->name, + NULL, &zone_ops[rd->id], nr_pl, + &constraint_ops); if (IS_ERR(power_zone)) { pr_debug("failed to register power zone %s\n", - rp->name); + rp->name); return PTR_ERR(power_zone); } /* track parent zone in per package/socket data */ @@ -1109,21 +1066,21 @@ static int rapl_package_register_powercap(struct rapl_package *rp) pr_err("no package domain found, unknown topology!\n"); return -ENODEV; } - /* now register domains as children of the socket/package*/ + /* now register domains as children of the socket/package */ for (rd = rp->domains; rd < rp->domains + rp->nr_domains; rd++) { if (rd->id == RAPL_DOMAIN_PACKAGE) continue; /* number of power limits per domain varies */ nr_pl = find_nr_power_limit(rd); power_zone = powercap_register_zone(&rd->power_zone, - rp->priv->control_type, rd->name, - rp->power_zone, - &zone_ops[rd->id], nr_pl, - &constraint_ops); + rp->priv->control_type, + rd->name, rp->power_zone, + &zone_ops[rd->id], nr_pl, + &constraint_ops);
if (IS_ERR(power_zone)) { pr_debug("failed to register power_zone, %s:%s\n", - rp->name, rd->name); + rp->name, rd->name); ret = PTR_ERR(power_zone); goto err_cleanup; } @@ -1137,13 +1094,14 @@ static int rapl_package_register_powercap(struct rapl_package *rp) */ while (--rd >= rp->domains) { pr_debug("unregister %s domain %s\n", rp->name, rd->name); - powercap_unregister_zone(rp->priv->control_type, &rd->power_zone); + powercap_unregister_zone(rp->priv->control_type, + &rd->power_zone); }
return ret; }
-static int __init rapl_add_platform_domain(struct rapl_if_priv *priv) +int rapl_add_platform_domain(struct rapl_if_priv *priv) { struct rapl_domain *rd; struct powercap_zone *power_zone; @@ -1168,8 +1126,10 @@ static int __init rapl_add_platform_domain(struct rapl_if_priv *priv)
rd->name = rapl_domain_names[RAPL_DOMAIN_PLATFORM]; rd->id = RAPL_DOMAIN_PLATFORM; - rd->regs[RAPL_DOMAIN_REG_LIMIT] = priv->regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_LIMIT]; - rd->regs[RAPL_DOMAIN_REG_STATUS] = priv->regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_STATUS]; + rd->regs[RAPL_DOMAIN_REG_LIMIT] = + priv->regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_LIMIT]; + rd->regs[RAPL_DOMAIN_REG_STATUS] = + priv->regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_STATUS]; rd->rpl[0].prim_id = PL1_ENABLE; rd->rpl[0].name = pl1_name; rd->rpl[1].prim_id = PL2_ENABLE; @@ -1190,15 +1150,17 @@ static int __init rapl_add_platform_domain(struct rapl_if_priv *priv)
return 0; } +EXPORT_SYMBOL_GPL(rapl_add_platform_domain);
-static void rapl_remove_platform_domain(struct rapl_if_priv *priv) +void rapl_remove_platform_domain(struct rapl_if_priv *priv) { if (priv->platform_rapl_domain) { powercap_unregister_zone(priv->control_type, - &priv->platform_rapl_domain->power_zone); + &priv->platform_rapl_domain->power_zone); kfree(priv->platform_rapl_domain); } } +EXPORT_SYMBOL_GPL(rapl_remove_platform_domain);
static int rapl_check_domain(int cpu, int domain, struct rapl_package *rp) { @@ -1229,13 +1191,12 @@ static int rapl_check_domain(int cpu, int domain, struct rapl_package *rp) return 0; }
- /* * Check if power limits are available. Two cases when they are not available: * 1. Locked by BIOS, in this case we still provide read-only access so that * users can see what limit is set by the BIOS. * 2. Some CPUs make some domains monitoring only which means PLx MSRs may not - * exist at all. In this case, we do not show the contraints in powercap. + * exist at all. In this case, we do not show the constraints in powercap. * * Called after domains are detected and initialized. */ @@ -1252,9 +1213,10 @@ static void rapl_detect_powerlimit(struct rapl_domain *rd) rd->state |= DOMAIN_STATE_BIOS_LOCKED; } } - /* check if power limit MSRs exists, otherwise domain is monitoring only */ + /* check if power limit MSR exists, otherwise domain is monitoring only */ for (i = 0; i < NR_POWER_LIMITS; i++) { int prim = rd->rpl[i].prim_id; + if (rapl_read_data_raw(rd, prim, false, &val64)) rd->rpl[i].name = NULL; } @@ -1275,7 +1237,7 @@ static int rapl_detect_domains(struct rapl_package *rp, int cpu) pr_info("Found RAPL domain %s\n", rapl_domain_names[i]); } } - rp->nr_domains = bitmap_weight(&rp->domain_map, RAPL_DOMAIN_MAX); + rp->nr_domains = bitmap_weight(&rp->domain_map, RAPL_DOMAIN_MAX); if (!rp->nr_domains) { pr_debug("no valid rapl domains found in %s\n", rp->name); return -ENODEV; @@ -1283,7 +1245,7 @@ static int rapl_detect_domains(struct rapl_package *rp, int cpu) pr_debug("found %d domains on %s\n", rp->nr_domains, rp->name);
rp->domains = kcalloc(rp->nr_domains + 1, sizeof(struct rapl_domain), - GFP_KERNEL); + GFP_KERNEL); if (!rp->domains) return -ENOMEM;
@@ -1296,7 +1258,7 @@ static int rapl_detect_domains(struct rapl_package *rp, int cpu) }
/* called from CPU hotplug notifier, hotplug lock held */ -static void rapl_remove_package(struct rapl_package *rp) +void rapl_remove_package(struct rapl_package *rp) { struct rapl_domain *rd, *rd_package = NULL;
@@ -1315,22 +1277,47 @@ static void rapl_remove_package(struct rapl_package *rp) } pr_debug("remove package, undo power limit on %s: %s\n", rp->name, rd->name); - powercap_unregister_zone(rp->priv->control_type, &rd->power_zone); + powercap_unregister_zone(rp->priv->control_type, + &rd->power_zone); } /* do parent zone last */ - powercap_unregister_zone(rp->priv->control_type, &rd_package->power_zone); + powercap_unregister_zone(rp->priv->control_type, + &rd_package->power_zone); list_del(&rp->plist); + if (list_empty(&rapl_packages)) + rapl_remove_core(); kfree(rp); } +EXPORT_SYMBOL_GPL(rapl_remove_package); + +/* caller to ensure CPU hotplug lock is held */ +struct rapl_package *rapl_find_package_domain(int cpu, struct rapl_if_priv *priv) +{ + int id = topology_logical_die_id(cpu); + struct rapl_package *rp; + + list_for_each_entry(rp, &rapl_packages, plist) { + if (rp->id == id + && rp->priv->control_type == priv->control_type) + return rp; + } + + return NULL; +} +EXPORT_SYMBOL_GPL(rapl_find_package_domain);
/* called from CPU hotplug notifier, hotplug lock held */ -static struct rapl_package *rapl_add_package(int cpu, struct rapl_if_priv *priv) +struct rapl_package *rapl_add_package(int cpu, struct rapl_if_priv *priv) { int id = topology_logical_die_id(cpu); struct rapl_package *rp; struct cpuinfo_x86 *c = &cpu_data(cpu); int ret;
+ ret = rapl_init_core(); + if (ret) + return ERR_PTR(ret); + rp = kzalloc(sizeof(struct rapl_package), GFP_KERNEL); if (!rp) return ERR_PTR(-ENOMEM); @@ -1342,14 +1329,13 @@ static struct rapl_package *rapl_add_package(int cpu, struct rapl_if_priv *priv)
if (topology_max_die_per_package() > 1) snprintf(rp->name, PACKAGE_DOMAIN_NAME_LENGTH, - "package-%d-die-%d", c->phys_proc_id, c->cpu_die_id); + "package-%d-die-%d", c->phys_proc_id, c->cpu_die_id); else snprintf(rp->name, PACKAGE_DOMAIN_NAME_LENGTH, "package-%d", - c->phys_proc_id); + c->phys_proc_id);
/* check if the package contains valid domains */ - if (rapl_detect_domains(rp, cpu) || - rapl_defaults->check_unit(rp, cpu)) { + if (rapl_detect_domains(rp, cpu) || rapl_defaults->check_unit(rp, cpu)) { ret = -ENODEV; goto err_free_package; } @@ -1365,45 +1351,7 @@ static struct rapl_package *rapl_add_package(int cpu, struct rapl_if_priv *priv) kfree(rp); return ERR_PTR(ret); } - -/* Handles CPU hotplug on multi-socket systems. - * If a CPU goes online as the first CPU of the physical package - * we add the RAPL package to the system. Similarly, when the last - * CPU of the package is removed, we remove the RAPL package and its - * associated domains. Cooling devices are handled accordingly at - * per-domain level. - */ -static int rapl_cpu_online(unsigned int cpu) -{ - struct rapl_package *rp; - - rp = rapl_find_package_domain(cpu, &rapl_msr_priv); - if (!rp) { - rp = rapl_add_package(cpu, &rapl_msr_priv); - if (IS_ERR(rp)) - return PTR_ERR(rp); - } - cpumask_set_cpu(cpu, &rp->cpumask); - return 0; -} - -static int rapl_cpu_down_prep(unsigned int cpu) -{ - struct rapl_package *rp; - int lead_cpu; - - rp = rapl_find_package_domain(cpu, &rapl_msr_priv); - if (!rp) - return 0; - - cpumask_clear_cpu(cpu, &rp->cpumask); - lead_cpu = cpumask_first(&rp->cpumask); - if (lead_cpu >= nr_cpu_ids) - rapl_remove_package(rp); - else if (rp->lead_cpu == cpu) - rp->lead_cpu = lead_cpu; - return 0; -} +EXPORT_SYMBOL_GPL(rapl_add_package);
static void power_limit_state_save(void) { @@ -1421,17 +1369,15 @@ static void power_limit_state_save(void) switch (rd->rpl[i].prim_id) { case PL1_ENABLE: ret = rapl_read_data_raw(rd, - POWER_LIMIT1, - true, - &rd->rpl[i].last_power_limit); + POWER_LIMIT1, true, + &rd->rpl[i].last_power_limit); if (ret) rd->rpl[i].last_power_limit = 0; break; case PL2_ENABLE: ret = rapl_read_data_raw(rd, - POWER_LIMIT2, - true, - &rd->rpl[i].last_power_limit); + POWER_LIMIT2, true, + &rd->rpl[i].last_power_limit); if (ret) rd->rpl[i].last_power_limit = 0; break; @@ -1457,15 +1403,13 @@ static void power_limit_state_restore(void) switch (rd->rpl[i].prim_id) { case PL1_ENABLE: if (rd->rpl[i].last_power_limit) - rapl_write_data_raw(rd, - POWER_LIMIT1, - rd->rpl[i].last_power_limit); + rapl_write_data_raw(rd, POWER_LIMIT1, + rd->rpl[i].last_power_limit); break; case PL2_ENABLE: if (rd->rpl[i].last_power_limit) - rapl_write_data_raw(rd, - POWER_LIMIT2, - rd->rpl[i].last_power_limit); + rapl_write_data_raw(rd, POWER_LIMIT2, + rd->rpl[i].last_power_limit); break; } } @@ -1474,7 +1418,7 @@ static void power_limit_state_restore(void) }
static int rapl_pm_callback(struct notifier_block *nb, - unsigned long mode, void *_unused) + unsigned long mode, void *_unused) { switch (mode) { case PM_SUSPEND_PREPARE: @@ -1491,101 +1435,35 @@ static struct notifier_block rapl_pm_notifier = { .notifier_call = rapl_pm_callback, };
-static int rapl_msr_read_raw(int cpu, struct reg_action *ra) -{ - if (rdmsrl_safe_on_cpu(cpu, ra->reg, &ra->value)) { - pr_debug("failed to read msr 0x%x on cpu %d\n", ra->reg, cpu); - return -EIO; - } - ra->value &= ra->mask; - return 0; -} - -static void rapl_msr_update_func(void *info) -{ - struct reg_action *ra = info; - u64 val; - - ra->err = rdmsrl_safe(ra->reg, &val); - if (ra->err) - return; - - val &= ~ra->mask; - val |= ra->value; - - ra->err = wrmsrl_safe(ra->reg, val); -} - - -static int rapl_msr_write_raw(int cpu, struct reg_action *ra) -{ - int ret; - - ret = smp_call_function_single(cpu, rapl_msr_update_func, ra, 1); - if (WARN_ON_ONCE(ret)) - return ret; - - return ra->err; -} - -static int __init rapl_init(void) +static int rapl_init_core(void) { const struct x86_cpu_id *id; int ret;
+ if (rapl_defaults) + return 0; + id = x86_match_cpu(rapl_ids); if (!id) { pr_err("driver does not support CPU family %d model %d\n", - boot_cpu_data.x86, boot_cpu_data.x86_model); + boot_cpu_data.x86, boot_cpu_data.x86_model);
return -ENODEV; }
rapl_defaults = (struct rapl_defaults *)id->driver_data;
- rapl_msr_priv.read_raw = rapl_msr_read_raw; - rapl_msr_priv.write_raw = rapl_msr_write_raw; - - rapl_msr_priv.control_type = powercap_register_control_type(NULL, "intel-rapl", NULL); - if (IS_ERR(rapl_msr_priv.control_type)) { - pr_debug("failed to register powercap control_type.\n"); - return PTR_ERR(rapl_msr_priv.control_type); - } - - ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "powercap/rapl:online", - rapl_cpu_online, rapl_cpu_down_prep); - if (ret < 0) - goto err_unreg; - rapl_msr_priv.pcap_rapl_online = ret; - - /* Don't bail out if PSys is not supported */ - rapl_add_platform_domain(&rapl_msr_priv); - ret = register_pm_notifier(&rapl_pm_notifier); - if (ret) - goto err_unreg_all;
return 0; - -err_unreg_all: - cpuhp_remove_state(rapl_msr_priv.pcap_rapl_online); - -err_unreg: - powercap_unregister_control_type(rapl_msr_priv.control_type); - return ret; }
-static void __exit rapl_exit(void) +static void rapl_remove_core(void) { unregister_pm_notifier(&rapl_pm_notifier); - cpuhp_remove_state(rapl_msr_priv.pcap_rapl_online); - rapl_remove_platform_domain(&rapl_msr_priv); - powercap_unregister_control_type(rapl_msr_priv.control_type); + rapl_defaults = NULL; }
-module_init(rapl_init); -module_exit(rapl_exit); - -MODULE_DESCRIPTION("Driver for Intel RAPL (Running Average Power Limit)"); +MODULE_DESCRIPTION("Intel Runtime Average Power Limit (RAPL) common code"); MODULE_AUTHOR("Jacob Pan jacob.jun.pan@intel.com"); MODULE_LICENSE("GPL v2"); diff --git a/drivers/powercap/intel_rapl_msr.c b/drivers/powercap/intel_rapl_msr.c new file mode 100644 index 000000000000..89645222e3e0 --- /dev/null +++ b/drivers/powercap/intel_rapl_msr.c @@ -0,0 +1,163 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Intel Running Average Power Limit (RAPL) Driver via MSR interface + * Copyright (c) 2019, Intel Corporation. + */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/list.h> +#include <linux/types.h> +#include <linux/device.h> +#include <linux/slab.h> +#include <linux/log2.h> +#include <linux/bitmap.h> +#include <linux/delay.h> +#include <linux/sysfs.h> +#include <linux/cpu.h> +#include <linux/powercap.h> +#include <linux/suspend.h> +#include <linux/intel_rapl.h> +#include <linux/processor.h> + +#include <asm/iosf_mbi.h> +#include <asm/cpu_device_id.h> +#include <asm/intel-family.h> + +/* Local defines */ +#define MSR_PLATFORM_POWER_LIMIT 0x0000065C + +/* private data for RAPL MSR Interface */ +static struct rapl_if_priv rapl_msr_priv = { + .reg_unit = MSR_RAPL_POWER_UNIT, + .regs[RAPL_DOMAIN_PACKAGE] = { + MSR_PKG_POWER_LIMIT, MSR_PKG_ENERGY_STATUS, MSR_PKG_PERF_STATUS, 0, MSR_PKG_POWER_INFO }, + .regs[RAPL_DOMAIN_PP0] = { + MSR_PP0_POWER_LIMIT, MSR_PP0_ENERGY_STATUS, 0, MSR_PP0_POLICY, 0 }, + .regs[RAPL_DOMAIN_PP1] = { + MSR_PP1_POWER_LIMIT, MSR_PP1_ENERGY_STATUS, 0, MSR_PP1_POLICY, 0 }, + .regs[RAPL_DOMAIN_DRAM] = { + MSR_DRAM_POWER_LIMIT, MSR_DRAM_ENERGY_STATUS, MSR_DRAM_PERF_STATUS, 0, MSR_DRAM_POWER_INFO }, + .regs[RAPL_DOMAIN_PLATFORM] = { + MSR_PLATFORM_POWER_LIMIT, MSR_PLATFORM_ENERGY_STATUS, 0, 0, 0}, +}; + +/* Handles CPU hotplug on multi-socket systems. + * If a CPU goes online as the first CPU of the physical package + * we add the RAPL package to the system. Similarly, when the last + * CPU of the package is removed, we remove the RAPL package and its + * associated domains. Cooling devices are handled accordingly at + * per-domain level. + */ +static int rapl_cpu_online(unsigned int cpu) +{ + struct rapl_package *rp; + + rp = rapl_find_package_domain(cpu, &rapl_msr_priv); + if (!rp) { + rp = rapl_add_package(cpu, &rapl_msr_priv); + if (IS_ERR(rp)) + return PTR_ERR(rp); + } + cpumask_set_cpu(cpu, &rp->cpumask); + return 0; +} + +static int rapl_cpu_down_prep(unsigned int cpu) +{ + struct rapl_package *rp; + int lead_cpu; + + rp = rapl_find_package_domain(cpu, &rapl_msr_priv); + if (!rp) + return 0; + + cpumask_clear_cpu(cpu, &rp->cpumask); + lead_cpu = cpumask_first(&rp->cpumask); + if (lead_cpu >= nr_cpu_ids) + rapl_remove_package(rp); + else if (rp->lead_cpu == cpu) + rp->lead_cpu = lead_cpu; + return 0; +} + +static int rapl_msr_read_raw(int cpu, struct reg_action *ra) +{ + if (rdmsrl_safe_on_cpu(cpu, ra->reg, &ra->value)) { + pr_debug("failed to read msr 0x%x on cpu %d\n", ra->reg, cpu); + return -EIO; + } + ra->value &= ra->mask; + return 0; +} + +static void rapl_msr_update_func(void *info) +{ + struct reg_action *ra = info; + u64 val; + + ra->err = rdmsrl_safe(ra->reg, &val); + if (ra->err) + return; + + val &= ~ra->mask; + val |= ra->value; + + ra->err = wrmsrl_safe(ra->reg, val); +} + +static int rapl_msr_write_raw(int cpu, struct reg_action *ra) +{ + int ret; + + ret = smp_call_function_single(cpu, rapl_msr_update_func, ra, 1); + if (WARN_ON_ONCE(ret)) + return ret; + + return ra->err; +} + +static int __init rapl_msr_init(void) +{ + int ret; + + rapl_msr_priv.read_raw = rapl_msr_read_raw; + rapl_msr_priv.write_raw = rapl_msr_write_raw; + + rapl_msr_priv.control_type = powercap_register_control_type(NULL, "intel-rapl", NULL); + if (IS_ERR(rapl_msr_priv.control_type)) { + pr_debug("failed to register powercap control_type.\n"); + return PTR_ERR(rapl_msr_priv.control_type); + } + + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "powercap/rapl:online", + rapl_cpu_online, rapl_cpu_down_prep); + if (ret < 0) + goto out; + rapl_msr_priv.pcap_rapl_online = ret; + + /* Don't bail out if PSys is not supported */ + rapl_add_platform_domain(&rapl_msr_priv); + + return 0; + +out: + if (ret) + powercap_unregister_control_type(rapl_msr_priv.control_type); + return ret; +} + +static void __exit rapl_msr_exit(void) +{ + cpuhp_remove_state(rapl_msr_priv.pcap_rapl_online); + rapl_remove_platform_domain(&rapl_msr_priv); + powercap_unregister_control_type(rapl_msr_priv.control_type); +} + +module_init(rapl_msr_init); +module_exit(rapl_msr_exit); + +MODULE_DESCRIPTION("Driver for Intel RAPL (Running Average Power Limit) control via MSR interface"); +MODULE_AUTHOR("Zhang Rui rui.zhang@intel.com"); +MODULE_LICENSE("GPL v2"); diff --git a/include/linux/intel_rapl.h b/include/linux/intel_rapl.h index ff215d64d114..9579f458fe4d 100644 --- a/include/linux/intel_rapl.h +++ b/include/linux/intel_rapl.h @@ -142,4 +142,11 @@ struct rapl_package { struct rapl_if_priv *priv; };
+struct rapl_package *rapl_find_package_domain(int cpu, struct rapl_if_priv *priv); +struct rapl_package *rapl_add_package(int cpu, struct rapl_if_priv *priv); +void rapl_remove_package(struct rapl_package *rp); + +int rapl_add_platform_domain(struct rapl_if_priv *priv); +void rapl_remove_platform_domain(struct rapl_if_priv *priv); + #endif /* __INTEL_RAPL_H__ */
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit d978e755aabe215cb67bf713e103ed3916ec306d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit d978e755aabe215cb67bf713e103ed3916ec306d upstream.
RAPL MMIO interface uses 64 bit registers, thus force use 64 bit register for all the RAPL code.
Reviewed-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Tested-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl_common.c | 6 +++--- drivers/powercap/intel_rapl_msr.c | 11 +++++++---- include/linux/intel_rapl.h | 8 ++++---- 3 files changed, 14 insertions(+), 11 deletions(-)
diff --git a/drivers/powercap/intel_rapl_common.c b/drivers/powercap/intel_rapl_common.c index 34a82531a7cf..8e4de036f6d0 100644 --- a/drivers/powercap/intel_rapl_common.c +++ b/drivers/powercap/intel_rapl_common.c @@ -689,7 +689,7 @@ static int rapl_read_data_raw(struct rapl_domain *rd, ra.mask = rp->mask;
if (rd->rp->priv->read_raw(cpu, &ra)) { - pr_debug("failed to read reg 0x%x on cpu %d\n", ra.reg, cpu); + pr_debug("failed to read reg 0x%llx on cpu %d\n", ra.reg, cpu); return -EIO; }
@@ -749,7 +749,7 @@ static int rapl_check_unit_core(struct rapl_package *rp, int cpu) ra.reg = rp->priv->reg_unit; ra.mask = ~0; if (rp->priv->read_raw(cpu, &ra)) { - pr_err("Failed to read power unit REG 0x%x on CPU %d, exit.\n", + pr_err("Failed to read power unit REG 0x%llx on CPU %d, exit.\n", rp->priv->reg_unit, cpu); return -ENODEV; } @@ -777,7 +777,7 @@ static int rapl_check_unit_atom(struct rapl_package *rp, int cpu) ra.reg = rp->priv->reg_unit; ra.mask = ~0; if (rp->priv->read_raw(cpu, &ra)) { - pr_err("Failed to read power unit REG 0x%x on CPU %d, exit.\n", + pr_err("Failed to read power unit REG 0x%llx on CPU %d, exit.\n", rp->priv->reg_unit, cpu); return -ENODEV; } diff --git a/drivers/powercap/intel_rapl_msr.c b/drivers/powercap/intel_rapl_msr.c index 89645222e3e0..6cd8a8fb9238 100644 --- a/drivers/powercap/intel_rapl_msr.c +++ b/drivers/powercap/intel_rapl_msr.c @@ -84,8 +84,10 @@ static int rapl_cpu_down_prep(unsigned int cpu)
static int rapl_msr_read_raw(int cpu, struct reg_action *ra) { - if (rdmsrl_safe_on_cpu(cpu, ra->reg, &ra->value)) { - pr_debug("failed to read msr 0x%x on cpu %d\n", ra->reg, cpu); + u32 msr = (u32)ra->reg; + + if (rdmsrl_safe_on_cpu(cpu, msr, &ra->value)) { + pr_debug("failed to read msr 0x%x on cpu %d\n", msr, cpu); return -EIO; } ra->value &= ra->mask; @@ -95,16 +97,17 @@ static int rapl_msr_read_raw(int cpu, struct reg_action *ra) static void rapl_msr_update_func(void *info) { struct reg_action *ra = info; + u32 msr = (u32)ra->reg; u64 val;
- ra->err = rdmsrl_safe(ra->reg, &val); + ra->err = rdmsrl_safe(msr, &val); if (ra->err) return;
val &= ~ra->mask; val |= ra->value;
- ra->err = wrmsrl_safe(ra->reg, val); + ra->err = wrmsrl_safe(msr, val); }
static int rapl_msr_write_raw(int cpu, struct reg_action *ra) diff --git a/include/linux/intel_rapl.h b/include/linux/intel_rapl.h index 9579f458fe4d..649e19981eb0 100644 --- a/include/linux/intel_rapl.h +++ b/include/linux/intel_rapl.h @@ -78,7 +78,7 @@ struct rapl_package; struct rapl_domain { const char *name; enum rapl_domain_type id; - int regs[RAPL_DOMAIN_REG_MAX]; + u64 regs[RAPL_DOMAIN_REG_MAX]; struct powercap_zone power_zone; struct rapl_domain_data rdd; struct rapl_power_limit rpl[NR_POWER_LIMITS]; @@ -89,7 +89,7 @@ struct rapl_domain { };
struct reg_action { - u32 reg; + u64 reg; u64 mask; u64 value; int err; @@ -113,8 +113,8 @@ struct rapl_if_priv { struct powercap_control_type *control_type; struct rapl_domain *platform_rapl_domain; enum cpuhp_state pcap_rapl_online; - u32 reg_unit; - u32 regs[RAPL_DOMAIN_MAX][RAPL_DOMAIN_REG_MAX]; + u64 reg_unit; + u64 regs[RAPL_DOMAIN_MAX][RAPL_DOMAIN_REG_MAX]; int (*read_raw)(int cpu, struct reg_action *ra); int (*write_raw)(int cpu, struct reg_action *ra); };
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 0c2ddedd8bcb88c4100acb9e0fc5ac8752d09501 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 0c2ddedd8bcb88c4100acb9e0fc5ac8752d09501 upstream.
RAPL MSR interface supports 2 power limits for package domain, and 1 power limit for other domains, while RAPL MMIO interface supports 2 power limits for both package and dram domains. And when 2 power limits are supported, the FW_LOCK bit is in bit 63 of the register, instead of bit 31.
Remove the assumption that only pakcage domain supports 2 power limits. And allow the RAPL interface driver to specify the number of power limits supported, for every single RAPL domain it owns..
Reviewed-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Tested-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl_common.c | 72 ++++++++++------------------ drivers/powercap/intel_rapl_msr.c | 1 + include/linux/intel_rapl.h | 2 + 3 files changed, 28 insertions(+), 47 deletions(-)
diff --git a/drivers/powercap/intel_rapl_common.c b/drivers/powercap/intel_rapl_common.c index 8e4de036f6d0..db8df19d8133 100644 --- a/drivers/powercap/intel_rapl_common.c +++ b/drivers/powercap/intel_rapl_common.c @@ -38,8 +38,8 @@ #define POWER_LIMIT2_MASK (0x7FFFULL<<32) #define POWER_LIMIT2_ENABLE BIT_ULL(47) #define POWER_LIMIT2_CLAMP BIT_ULL(48) -#define POWER_PACKAGE_LOCK BIT_ULL(63) -#define POWER_PP_LOCK BIT(31) +#define POWER_HIGH_LOCK BIT_ULL(63) +#define POWER_LOW_LOCK BIT(31)
#define TIME_WINDOW1_MASK (0x7FULL<<17) #define TIME_WINDOW2_MASK (0x7FULL<<49) @@ -513,60 +513,38 @@ static const struct powercap_zone_constraint_ops constraint_ops = { /* called after domain detection and package level data are set */ static void rapl_init_domains(struct rapl_package *rp) { - int i; + enum rapl_domain_type i; + enum rapl_domain_reg_id j; struct rapl_domain *rd = rp->domains;
for (i = 0; i < RAPL_DOMAIN_MAX; i++) { unsigned int mask = rp->domain_map & (1 << i);
- rd->regs[RAPL_DOMAIN_REG_LIMIT] = - rp->priv->regs[i][RAPL_DOMAIN_REG_LIMIT]; - rd->regs[RAPL_DOMAIN_REG_STATUS] = - rp->priv->regs[i][RAPL_DOMAIN_REG_STATUS]; - rd->regs[RAPL_DOMAIN_REG_PERF] = - rp->priv->regs[i][RAPL_DOMAIN_REG_PERF]; - rd->regs[RAPL_DOMAIN_REG_POLICY] = - rp->priv->regs[i][RAPL_DOMAIN_REG_POLICY]; - rd->regs[RAPL_DOMAIN_REG_INFO] = - rp->priv->regs[i][RAPL_DOMAIN_REG_INFO]; - - switch (mask) { - case BIT(RAPL_DOMAIN_PACKAGE): - rd->name = rapl_domain_names[RAPL_DOMAIN_PACKAGE]; - rd->id = RAPL_DOMAIN_PACKAGE; - rd->rpl[0].prim_id = PL1_ENABLE; - rd->rpl[0].name = pl1_name; + if (!mask) + continue; + + rd->rp = rp; + rd->name = rapl_domain_names[i]; + rd->id = i; + rd->rpl[0].prim_id = PL1_ENABLE; + rd->rpl[0].name = pl1_name; + /* some domain may support two power limits */ + if (rp->priv->limits[i] == 2) { rd->rpl[1].prim_id = PL2_ENABLE; rd->rpl[1].name = pl2_name; - break; - case BIT(RAPL_DOMAIN_PP0): - rd->name = rapl_domain_names[RAPL_DOMAIN_PP0]; - rd->id = RAPL_DOMAIN_PP0; - rd->rpl[0].prim_id = PL1_ENABLE; - rd->rpl[0].name = pl1_name; - break; - case BIT(RAPL_DOMAIN_PP1): - rd->name = rapl_domain_names[RAPL_DOMAIN_PP1]; - rd->id = RAPL_DOMAIN_PP1; - rd->rpl[0].prim_id = PL1_ENABLE; - rd->rpl[0].name = pl1_name; - break; - case BIT(RAPL_DOMAIN_DRAM): - rd->name = rapl_domain_names[RAPL_DOMAIN_DRAM]; - rd->id = RAPL_DOMAIN_DRAM; - rd->rpl[0].prim_id = PL1_ENABLE; - rd->rpl[0].name = pl1_name; + } + + for (j = 0; j < RAPL_DOMAIN_REG_MAX; j++) + rd->regs[j] = rp->priv->regs[i][j]; + + if (i == RAPL_DOMAIN_DRAM) { rd->domain_energy_unit = rapl_defaults->dram_domain_energy_unit; if (rd->domain_energy_unit) pr_info("DRAM domain energy unit %dpj\n", rd->domain_energy_unit); - break; - } - if (mask) { - rd->rp = rp; - rd++; } + rd++; } }
@@ -613,7 +591,7 @@ static struct rapl_primitive_info rpi[] = { RAPL_DOMAIN_REG_LIMIT, POWER_UNIT, 0), PRIMITIVE_INFO_INIT(POWER_LIMIT2, POWER_LIMIT2_MASK, 32, RAPL_DOMAIN_REG_LIMIT, POWER_UNIT, 0), - PRIMITIVE_INFO_INIT(FW_LOCK, POWER_PP_LOCK, 31, + PRIMITIVE_INFO_INIT(FW_LOCK, POWER_LOW_LOCK, 31, RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), PRIMITIVE_INFO_INIT(PL1_ENABLE, POWER_LIMIT1_ENABLE, 15, RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0), @@ -675,9 +653,9 @@ static int rapl_read_data_raw(struct rapl_domain *rd,
cpu = rd->rp->lead_cpu;
- /* special-case package domain, which uses a different bit */ - if (prim == FW_LOCK && rd->id == RAPL_DOMAIN_PACKAGE) { - rp->mask = POWER_PACKAGE_LOCK; + /* domain with 2 limits has different bit */ + if (prim == FW_LOCK && rd->rp->priv->limits[rd->id] == 2) { + rp->mask = POWER_HIGH_LOCK; rp->shift = 63; } /* non-hardware data are collected by the polling thread */ diff --git a/drivers/powercap/intel_rapl_msr.c b/drivers/powercap/intel_rapl_msr.c index 6cd8a8fb9238..bc14a4579acb 100644 --- a/drivers/powercap/intel_rapl_msr.c +++ b/drivers/powercap/intel_rapl_msr.c @@ -41,6 +41,7 @@ static struct rapl_if_priv rapl_msr_priv = { MSR_DRAM_POWER_LIMIT, MSR_DRAM_ENERGY_STATUS, MSR_DRAM_PERF_STATUS, 0, MSR_DRAM_POWER_INFO }, .regs[RAPL_DOMAIN_PLATFORM] = { MSR_PLATFORM_POWER_LIMIT, MSR_PLATFORM_ENERGY_STATUS, 0, 0, 0}, + .limits[RAPL_DOMAIN_PACKAGE] = 2, };
/* Handles CPU hotplug on multi-socket systems. diff --git a/include/linux/intel_rapl.h b/include/linux/intel_rapl.h index 649e19981eb0..0c179d92d110 100644 --- a/include/linux/intel_rapl.h +++ b/include/linux/intel_rapl.h @@ -104,6 +104,7 @@ struct reg_action { * @pcap_rapl_online: CPU hotplug state for each RAPL interface. * @reg_unit: Register for getting energy/power/time unit. * @regs: Register sets for different RAPL Domains. + * @limits: Number of power limits supported by each domain. * @read_raw: Callback for reading RAPL interface specific * registers. * @write_raw: Callback for writing RAPL interface specific @@ -115,6 +116,7 @@ struct rapl_if_priv { enum cpuhp_state pcap_rapl_online; u64 reg_unit; u64 regs[RAPL_DOMAIN_MAX][RAPL_DOMAIN_REG_MAX]; + int limits[RAPL_DOMAIN_MAX]; int (*read_raw)(int cpu, struct reg_action *ra); int (*write_raw)(int cpu, struct reg_action *ra); };
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit abcfaeb3f5dc8bded4ba446eb2fb017a7a41d9bc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit abcfaeb3f5dc8bded4ba446eb2fb017a7a41d9bc upstream.
intel_rapl driver used to have a list of cpuids, which is used to 1. check if the processor support RAPL MSRs 2. do some cpu model specific setting 3. module autoloading
Now, the cpu model specific setting are moved to intel_rapl_common.c as part of the common code, because the setup is also needed by RAPL MMIO interface on those platforms. But removing the cpuid list from intel_rapl MSR interface driver results in that the driver can not be loaded automatically.
Maintaining another copy of the cpuid list in intel_rapl_msr.c does not make sense because it increases the complexity when enabling RAPL support on a new cpu model.
Fix the problem by creating an "intel_rapl_msr" platform device in the common code, and make RAPL MSR interface driver (intel_rapl_msr.c) probe the platform device directly.
Reviewed-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Tested-by: Pandruvada, Srinivas srinivas.pandruvada@intel.com Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl_common.c | 45 +++++++++++++++++----------- drivers/powercap/intel_rapl_msr.c | 24 ++++++++++++--- 2 files changed, 48 insertions(+), 21 deletions(-)
diff --git a/drivers/powercap/intel_rapl_common.c b/drivers/powercap/intel_rapl_common.c index db8df19d8133..f1b7bcc32891 100644 --- a/drivers/powercap/intel_rapl_common.c +++ b/drivers/powercap/intel_rapl_common.c @@ -18,10 +18,11 @@ #include <linux/cpu.h> #include <linux/powercap.h> #include <linux/suspend.h> -#include <asm/iosf_mbi.h> #include <linux/intel_rapl.h> - #include <linux/processor.h> +#include <linux/platform_device.h> + +#include <asm/iosf_mbi.h> #include <asm/cpu_device_id.h> #include <asm/intel-family.h>
@@ -136,8 +137,6 @@ static int rapl_write_data_raw(struct rapl_domain *rd, static u64 rapl_unit_xlate(struct rapl_domain *rd, enum unit_type type, u64 value, int to_raw); static void package_power_limit_irq_save(struct rapl_package *rp); -static int rapl_init_core(void); -static void rapl_remove_core(void);
static LIST_HEAD(rapl_packages); /* guarded by CPU hotplug lock */
@@ -1262,8 +1261,6 @@ void rapl_remove_package(struct rapl_package *rp) powercap_unregister_zone(rp->priv->control_type, &rd_package->power_zone); list_del(&rp->plist); - if (list_empty(&rapl_packages)) - rapl_remove_core(); kfree(rp); } EXPORT_SYMBOL_GPL(rapl_remove_package); @@ -1292,10 +1289,6 @@ struct rapl_package *rapl_add_package(int cpu, struct rapl_if_priv *priv) struct cpuinfo_x86 *c = &cpu_data(cpu); int ret;
- ret = rapl_init_core(); - if (ret) - return ERR_PTR(ret); - rp = kzalloc(sizeof(struct rapl_package), GFP_KERNEL); if (!rp) return ERR_PTR(-ENOMEM); @@ -1413,14 +1406,13 @@ static struct notifier_block rapl_pm_notifier = { .notifier_call = rapl_pm_callback, };
-static int rapl_init_core(void) +static struct platform_device *rapl_msr_platdev; + +static int __init rapl_init(void) { const struct x86_cpu_id *id; int ret;
- if (rapl_defaults) - return 0; - id = x86_match_cpu(rapl_ids); if (!id) { pr_err("driver does not support CPU family %d model %d\n", @@ -1432,16 +1424,35 @@ static int rapl_init_core(void) rapl_defaults = (struct rapl_defaults *)id->driver_data;
ret = register_pm_notifier(&rapl_pm_notifier); + if (ret) + return ret;
- return 0; + rapl_msr_platdev = platform_device_alloc("intel_rapl_msr", 0); + if (!rapl_msr_platdev) { + ret = -ENOMEM; + goto end; + } + + ret = platform_device_add(rapl_msr_platdev); + if (ret) + platform_device_put(rapl_msr_platdev); + +end: + if (ret) + unregister_pm_notifier(&rapl_pm_notifier); + + return ret; }
-static void rapl_remove_core(void) +static void __exit rapl_exit(void) { + platform_device_unregister(rapl_msr_platdev); unregister_pm_notifier(&rapl_pm_notifier); - rapl_defaults = NULL; }
+module_init(rapl_init); +module_exit(rapl_exit); + MODULE_DESCRIPTION("Intel Runtime Average Power Limit (RAPL) common code"); MODULE_AUTHOR("Jacob Pan jacob.jun.pan@intel.com"); MODULE_LICENSE("GPL v2"); diff --git a/drivers/powercap/intel_rapl_msr.c b/drivers/powercap/intel_rapl_msr.c index bc14a4579acb..d5487965bdfe 100644 --- a/drivers/powercap/intel_rapl_msr.c +++ b/drivers/powercap/intel_rapl_msr.c @@ -20,6 +20,7 @@ #include <linux/suspend.h> #include <linux/intel_rapl.h> #include <linux/processor.h> +#include <linux/platform_device.h>
#include <asm/iosf_mbi.h> #include <asm/cpu_device_id.h> @@ -122,7 +123,7 @@ static int rapl_msr_write_raw(int cpu, struct reg_action *ra) return ra->err; }
-static int __init rapl_msr_init(void) +static int rapl_msr_probe(struct platform_device *pdev) { int ret;
@@ -152,15 +153,30 @@ static int __init rapl_msr_init(void) return ret; }
-static void __exit rapl_msr_exit(void) +static int rapl_msr_remove(struct platform_device *pdev) { cpuhp_remove_state(rapl_msr_priv.pcap_rapl_online); rapl_remove_platform_domain(&rapl_msr_priv); powercap_unregister_control_type(rapl_msr_priv.control_type); + return 0; }
-module_init(rapl_msr_init); -module_exit(rapl_msr_exit); +static const struct platform_device_id rapl_msr_ids[] = { + { .name = "intel_rapl_msr", }, + {} +}; +MODULE_DEVICE_TABLE(platform, rapl_msr_ids); + +static struct platform_driver intel_rapl_msr_driver = { + .probe = rapl_msr_probe, + .remove = rapl_msr_remove, + .id_table = rapl_msr_ids, + .driver = { + .name = "intel_rapl_msr", + }, +}; + +module_platform_driver(intel_rapl_msr_driver);
MODULE_DESCRIPTION("Driver for Intel RAPL (Running Average Power Limit) control via MSR interface"); MODULE_AUTHOR("Zhang Rui rui.zhang@intel.com");
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 0ab74bcd1b50821391b264150d26b7f03ba6740b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 0ab74bcd1b50821391b264150d26b7f03ba6740b upstream.
Add IceLake desktop support in intel_rapl driver
Signed-off-by: Gayatri Kammela gayatri.kammela@intel.com Signed-off-by: Joe Konno joe.konno@intel.com Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl_common.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/powercap/intel_rapl_common.c b/drivers/powercap/intel_rapl_common.c index f1b7bcc32891..e9e2342616c6 100644 --- a/drivers/powercap/intel_rapl_common.c +++ b/drivers/powercap/intel_rapl_common.c @@ -974,6 +974,7 @@ static const struct x86_cpu_id rapl_ids[] __initconst = { INTEL_CPU_FAM6(KABYLAKE_DESKTOP, rapl_defaults_core), INTEL_CPU_FAM6(CANNONLAKE_MOBILE, rapl_defaults_core), INTEL_CPU_FAM6(ICELAKE_MOBILE, rapl_defaults_core), + INTEL_CPU_FAM6(ICELAKE_DESKTOP, rapl_defaults_core),
INTEL_CPU_FAM6(ATOM_SILVERMONT, rapl_defaults_byt), INTEL_CPU_FAM6(ATOM_AIRMONT, rapl_defaults_cht),
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit cceb1d9dfa680e8b0f5d70d87c2ee25903070b96 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit cceb1d9dfa680e8b0f5d70d87c2ee25903070b96 upstream.
Add ICX support in intel_rapl driver
Signed-off-by: Jacob Pan jacob.jun.pan@linux.intel.com Signed-off-by: Rajneesh Bhardwaj rajneesh.bhardwaj@linux.intel.com Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl_common.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/powercap/intel_rapl_common.c b/drivers/powercap/intel_rapl_common.c index e9e2342616c6..3a5440d90017 100644 --- a/drivers/powercap/intel_rapl_common.c +++ b/drivers/powercap/intel_rapl_common.c @@ -975,6 +975,7 @@ static const struct x86_cpu_id rapl_ids[] __initconst = { INTEL_CPU_FAM6(CANNONLAKE_MOBILE, rapl_defaults_core), INTEL_CPU_FAM6(ICELAKE_MOBILE, rapl_defaults_core), INTEL_CPU_FAM6(ICELAKE_DESKTOP, rapl_defaults_core), + INTEL_CPU_FAM6(ICELAKE_X, rapl_defaults_hsw_server),
INTEL_CPU_FAM6(ATOM_SILVERMONT, rapl_defaults_byt), INTEL_CPU_FAM6(ATOM_AIRMONT, rapl_defaults_cht),
From: Zhang Rui rui.zhang@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 3231a21d5ca6f6baea95588406775304f35a203e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3231a21d5ca6f6baea95588406775304f35a203e upstream.
Add ICX-D support in intel_rapl driver
Signed-off-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl_common.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/powercap/intel_rapl_common.c b/drivers/powercap/intel_rapl_common.c index 3a5440d90017..b624a88b2c25 100644 --- a/drivers/powercap/intel_rapl_common.c +++ b/drivers/powercap/intel_rapl_common.c @@ -976,6 +976,7 @@ static const struct x86_cpu_id rapl_ids[] __initconst = { INTEL_CPU_FAM6(ICELAKE_MOBILE, rapl_defaults_core), INTEL_CPU_FAM6(ICELAKE_DESKTOP, rapl_defaults_core), INTEL_CPU_FAM6(ICELAKE_X, rapl_defaults_hsw_server), + INTEL_CPU_FAM6(ICELAKE_XEON_D, rapl_defaults_hsw_server),
INTEL_CPU_FAM6(ATOM_SILVERMONT, rapl_defaults_byt), INTEL_CPU_FAM6(ATOM_AIRMONT, rapl_defaults_cht),
From: Qiuxu Zhuo qiuxu.zhuo@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 88a242c9874044740ab990d4fccba8bb90cb924b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 88a242c9874044740ab990d4fccba8bb90cb924b upstream.
Parts of skx_edac can be shared with the Intel 10nm server EDAC driver.
Carve out the common parts from skx_edac in preparation to support both skx_edac driver and i10nm_edac drivers.
Co-developed-by: Tony Luck tony.luck@intel.com Signed-off-by: Qiuxu Zhuo qiuxu.zhuo@intel.com Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Borislav Petkov bp@suse.de Cc: James Morse james.morse@arm.com Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: linux-edac linux-edac@vger.kernel.org Link: https://lkml.kernel.org/r/20190130191519.15393-3-tony.luck@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/edac/skx_common.c | 689 ++++++++++++++++++++++++++++++++++++++ drivers/edac/skx_common.h | 152 +++++++++ 2 files changed, 841 insertions(+) create mode 100644 drivers/edac/skx_common.c create mode 100644 drivers/edac/skx_common.h
diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c new file mode 100644 index 000000000000..fb3d9a563ba8 --- /dev/null +++ b/drivers/edac/skx_common.c @@ -0,0 +1,689 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Common codes for both the skx_edac driver and Intel 10nm server EDAC driver. + * Originally split out from the skx_edac driver. + * + * Copyright (c) 2018, Intel Corporation. + */ + +#include <linux/acpi.h> +#include <linux/dmi.h> +#include <linux/adxl.h> +#include <acpi/nfit.h> +#include <asm/mce.h> +#include "edac_module.h" +#include "skx_common.h" + +static const char * const component_names[] = { + [INDEX_SOCKET] = "ProcessorSocketId", + [INDEX_MEMCTRL] = "MemoryControllerId", + [INDEX_CHANNEL] = "ChannelId", + [INDEX_DIMM] = "DimmSlotId", +}; + +static int component_indices[ARRAY_SIZE(component_names)]; +static int adxl_component_count; +static const char * const *adxl_component_names; +static u64 *adxl_values; +static char *adxl_msg; + +static char skx_msg[MSG_SIZE]; +static skx_decode_f skx_decode; +static u64 skx_tolm, skx_tohm; +static LIST_HEAD(dev_edac_list); + +int __init skx_adxl_get(void) +{ + const char * const *names; + int i, j; + + names = adxl_get_component_names(); + if (!names) { + skx_printk(KERN_NOTICE, "No firmware support for address translation.\n"); + return -ENODEV; + } + + for (i = 0; i < INDEX_MAX; i++) { + for (j = 0; names[j]; j++) { + if (!strcmp(component_names[i], names[j])) { + component_indices[i] = j; + break; + } + } + + if (!names[j]) + goto err; + } + + adxl_component_names = names; + while (*names++) + adxl_component_count++; + + adxl_values = kcalloc(adxl_component_count, sizeof(*adxl_values), + GFP_KERNEL); + if (!adxl_values) { + adxl_component_count = 0; + return -ENOMEM; + } + + adxl_msg = kzalloc(MSG_SIZE, GFP_KERNEL); + if (!adxl_msg) { + adxl_component_count = 0; + kfree(adxl_values); + return -ENOMEM; + } + + return 0; +err: + skx_printk(KERN_ERR, "'%s' is not matched from DSM parameters: ", + component_names[i]); + for (j = 0; names[j]; j++) + skx_printk(KERN_CONT, "%s ", names[j]); + skx_printk(KERN_CONT, "\n"); + + return -ENODEV; +} + +void __exit skx_adxl_put(void) +{ + kfree(adxl_values); + kfree(adxl_msg); +} + +static bool skx_adxl_decode(struct decoded_addr *res) +{ + int i, len = 0; + + if (res->addr >= skx_tohm || (res->addr >= skx_tolm && + res->addr < BIT_ULL(32))) { + edac_dbg(0, "Address 0x%llx out of range\n", res->addr); + return false; + } + + if (adxl_decode(res->addr, adxl_values)) { + edac_dbg(0, "Failed to decode 0x%llx\n", res->addr); + return false; + } + + res->socket = (int)adxl_values[component_indices[INDEX_SOCKET]]; + res->imc = (int)adxl_values[component_indices[INDEX_MEMCTRL]]; + res->channel = (int)adxl_values[component_indices[INDEX_CHANNEL]]; + res->dimm = (int)adxl_values[component_indices[INDEX_DIMM]]; + + for (i = 0; i < adxl_component_count; i++) { + if (adxl_values[i] == ~0x0ull) + continue; + + len += snprintf(adxl_msg + len, MSG_SIZE - len, " %s:0x%llx", + adxl_component_names[i], adxl_values[i]); + if (MSG_SIZE - len <= 0) + break; + } + + return true; +} + +void skx_set_decode(skx_decode_f decode) +{ + skx_decode = decode; +} + +int skx_get_src_id(struct skx_dev *d, u8 *id) +{ + u32 reg; + + if (pci_read_config_dword(d->util_all, 0xf0, ®)) { + skx_printk(KERN_ERR, "Failed to read src id\n"); + return -ENODEV; + } + + *id = GET_BITFIELD(reg, 12, 14); + return 0; +} + +int skx_get_node_id(struct skx_dev *d, u8 *id) +{ + u32 reg; + + if (pci_read_config_dword(d->util_all, 0xf4, ®)) { + skx_printk(KERN_ERR, "Failed to read node id\n"); + return -ENODEV; + } + + *id = GET_BITFIELD(reg, 0, 2); + return 0; +} + +static int get_width(u32 mtr) +{ + switch (GET_BITFIELD(mtr, 8, 9)) { + case 0: + return DEV_X4; + case 1: + return DEV_X8; + case 2: + return DEV_X16; + } + return DEV_UNKNOWN; +} + +/* + * We use the per-socket device @did to count how many sockets are present, + * and to detemine which PCI buses are associated with each socket. Allocate + * and build the full list of all the skx_dev structures that we need here. + */ +int skx_get_all_bus_mappings(unsigned int did, int off, enum type type, + struct list_head **list) +{ + struct pci_dev *pdev, *prev; + struct skx_dev *d; + u32 reg; + int ndev = 0; + + prev = NULL; + for (;;) { + pdev = pci_get_device(PCI_VENDOR_ID_INTEL, did, prev); + if (!pdev) + break; + ndev++; + d = kzalloc(sizeof(*d), GFP_KERNEL); + if (!d) { + pci_dev_put(pdev); + return -ENOMEM; + } + + if (pci_read_config_dword(pdev, off, ®)) { + kfree(d); + pci_dev_put(pdev); + skx_printk(KERN_ERR, "Failed to read bus idx\n"); + return -ENODEV; + } + + d->bus[0] = GET_BITFIELD(reg, 0, 7); + d->bus[1] = GET_BITFIELD(reg, 8, 15); + if (type == SKX) { + d->seg = pci_domain_nr(pdev->bus); + d->bus[2] = GET_BITFIELD(reg, 16, 23); + d->bus[3] = GET_BITFIELD(reg, 24, 31); + } else { + d->seg = GET_BITFIELD(reg, 16, 23); + } + + edac_dbg(2, "busses: 0x%x, 0x%x, 0x%x, 0x%x\n", + d->bus[0], d->bus[1], d->bus[2], d->bus[3]); + list_add_tail(&d->list, &dev_edac_list); + prev = pdev; + } + + if (list) + *list = &dev_edac_list; + return ndev; +} + +int skx_get_hi_lo(unsigned int did, int off[], u64 *tolm, u64 *tohm) +{ + struct pci_dev *pdev; + u32 reg; + + pdev = pci_get_device(PCI_VENDOR_ID_INTEL, did, NULL); + if (!pdev) { + skx_printk(KERN_ERR, "Can't get tolm/tohm\n"); + return -ENODEV; + } + + if (pci_read_config_dword(pdev, off[0], ®)) { + skx_printk(KERN_ERR, "Failed to read tolm\n"); + goto fail; + } + skx_tolm = reg; + + if (pci_read_config_dword(pdev, off[1], ®)) { + skx_printk(KERN_ERR, "Failed to read lower tohm\n"); + goto fail; + } + skx_tohm = reg; + + if (pci_read_config_dword(pdev, off[2], ®)) { + skx_printk(KERN_ERR, "Failed to read upper tohm\n"); + goto fail; + } + skx_tohm |= (u64)reg << 32; + + pci_dev_put(pdev); + *tolm = skx_tolm; + *tohm = skx_tohm; + edac_dbg(2, "tolm = 0x%llx tohm = 0x%llx\n", skx_tolm, skx_tohm); + return 0; +fail: + pci_dev_put(pdev); + return -ENODEV; +} + +static int skx_get_dimm_attr(u32 reg, int lobit, int hibit, int add, + int minval, int maxval, const char *name) +{ + u32 val = GET_BITFIELD(reg, lobit, hibit); + + if (val < minval || val > maxval) { + edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg); + return -EINVAL; + } + return val + add; +} + +#define numrank(reg) skx_get_dimm_attr(reg, 12, 13, 0, 0, 2, "ranks") +#define numrow(reg) skx_get_dimm_attr(reg, 2, 4, 12, 1, 6, "rows") +#define numcol(reg) skx_get_dimm_attr(reg, 0, 1, 10, 0, 2, "cols") + +int skx_get_dimm_info(u32 mtr, u32 amap, struct dimm_info *dimm, + struct skx_imc *imc, int chan, int dimmno) +{ + int banks = 16, ranks, rows, cols, npages; + u64 size; + + ranks = numrank(mtr); + rows = numrow(mtr); + cols = numcol(mtr); + + /* + * Compute size in 8-byte (2^3) words, then shift to MiB (2^20) + */ + size = ((1ull << (rows + cols + ranks)) * banks) >> (20 - 3); + npages = MiB_TO_PAGES(size); + + edac_dbg(0, "mc#%d: channel %d, dimm %d, %lld MiB (%d pages) bank: %d, rank: %d, row: 0x%x, col: 0x%x\n", + imc->mc, chan, dimmno, size, npages, + banks, 1 << ranks, rows, cols); + + imc->chan[chan].dimms[dimmno].close_pg = GET_BITFIELD(mtr, 0, 0); + imc->chan[chan].dimms[dimmno].bank_xor_enable = GET_BITFIELD(mtr, 9, 9); + imc->chan[chan].dimms[dimmno].fine_grain_bank = GET_BITFIELD(amap, 0, 0); + imc->chan[chan].dimms[dimmno].rowbits = rows; + imc->chan[chan].dimms[dimmno].colbits = cols; + + dimm->nr_pages = npages; + dimm->grain = 32; + dimm->dtype = get_width(mtr); + dimm->mtype = MEM_DDR4; + dimm->edac_mode = EDAC_SECDED; /* likely better than this */ + snprintf(dimm->label, sizeof(dimm->label), "CPU_SrcID#%u_MC#%u_Chan#%u_DIMM#%u", + imc->src_id, imc->lmc, chan, dimmno); + + return 1; +} + +int skx_get_nvdimm_info(struct dimm_info *dimm, struct skx_imc *imc, + int chan, int dimmno, const char *mod_str) +{ + int smbios_handle; + u32 dev_handle; + u16 flags; + u64 size = 0; + + dev_handle = ACPI_NFIT_BUILD_DEVICE_HANDLE(dimmno, chan, imc->lmc, + imc->src_id, 0); + + smbios_handle = nfit_get_smbios_id(dev_handle, &flags); + if (smbios_handle == -EOPNOTSUPP) { + pr_warn_once("%s: Can't find size of NVDIMM. Try enabling CONFIG_ACPI_NFIT\n", mod_str); + goto unknown_size; + } + + if (smbios_handle < 0) { + skx_printk(KERN_ERR, "Can't find handle for NVDIMM ADR=0x%x\n", dev_handle); + goto unknown_size; + } + + if (flags & ACPI_NFIT_MEM_MAP_FAILED) { + skx_printk(KERN_ERR, "NVDIMM ADR=0x%x is not mapped\n", dev_handle); + goto unknown_size; + } + + size = dmi_memdev_size(smbios_handle); + if (size == ~0ull) + skx_printk(KERN_ERR, "Can't find size for NVDIMM ADR=0x%x/SMBIOS=0x%x\n", + dev_handle, smbios_handle); + +unknown_size: + dimm->nr_pages = size >> PAGE_SHIFT; + dimm->grain = 32; + dimm->dtype = DEV_UNKNOWN; + dimm->mtype = MEM_NVDIMM; + dimm->edac_mode = EDAC_SECDED; /* likely better than this */ + + edac_dbg(0, "mc#%d: channel %d, dimm %d, %llu MiB (%u pages)\n", + imc->mc, chan, dimmno, size >> 20, dimm->nr_pages); + + snprintf(dimm->label, sizeof(dimm->label), "CPU_SrcID#%u_MC#%u_Chan#%u_DIMM#%u", + imc->src_id, imc->lmc, chan, dimmno); + + return (size == 0 || size == ~0ull) ? 0 : 1; +} + +int skx_register_mci(struct skx_imc *imc, struct pci_dev *pdev, + const char *ctl_name, const char *mod_str, + get_dimm_config_f get_dimm_config) +{ + struct mem_ctl_info *mci; + struct edac_mc_layer layers[2]; + struct skx_pvt *pvt; + int rc; + + /* Allocate a new MC control structure */ + layers[0].type = EDAC_MC_LAYER_CHANNEL; + layers[0].size = NUM_CHANNELS; + layers[0].is_virt_csrow = false; + layers[1].type = EDAC_MC_LAYER_SLOT; + layers[1].size = NUM_DIMMS; + layers[1].is_virt_csrow = true; + mci = edac_mc_alloc(imc->mc, ARRAY_SIZE(layers), layers, + sizeof(struct skx_pvt)); + + if (unlikely(!mci)) + return -ENOMEM; + + edac_dbg(0, "MC#%d: mci = %p\n", imc->mc, mci); + + /* Associate skx_dev and mci for future usage */ + imc->mci = mci; + pvt = mci->pvt_info; + pvt->imc = imc; + + mci->ctl_name = kasprintf(GFP_KERNEL, "%s#%d IMC#%d", ctl_name, + imc->node_id, imc->lmc); + if (!mci->ctl_name) { + rc = -ENOMEM; + goto fail0; + } + + mci->mtype_cap = MEM_FLAG_DDR4 | MEM_FLAG_NVDIMM; + mci->edac_ctl_cap = EDAC_FLAG_NONE; + mci->edac_cap = EDAC_FLAG_NONE; + mci->mod_name = mod_str; + mci->dev_name = pci_name(pdev); + mci->ctl_page_to_phys = NULL; + + rc = get_dimm_config(mci); + if (rc < 0) + goto fail; + + /* Record ptr to the generic device */ + mci->pdev = &pdev->dev; + + /* Add this new MC control structure to EDAC's list of MCs */ + if (unlikely(edac_mc_add_mc(mci))) { + edac_dbg(0, "MC: failed edac_mc_add_mc()\n"); + rc = -EINVAL; + goto fail; + } + + return 0; + +fail: + kfree(mci->ctl_name); +fail0: + edac_mc_free(mci); + imc->mci = NULL; + return rc; +} + +static void skx_unregister_mci(struct skx_imc *imc) +{ + struct mem_ctl_info *mci = imc->mci; + + if (!mci) + return; + + edac_dbg(0, "MC%d: mci = %p\n", imc->mc, mci); + + /* Remove MC sysfs nodes */ + edac_mc_del_mc(mci->pdev); + + edac_dbg(1, "%s: free mci struct\n", mci->ctl_name); + kfree(mci->ctl_name); + edac_mc_free(mci); +} + +static struct mem_ctl_info *get_mci(int src_id, int lmc) +{ + struct skx_dev *d; + + if (lmc > NUM_IMC - 1) { + skx_printk(KERN_ERR, "Bad lmc %d\n", lmc); + return NULL; + } + + list_for_each_entry(d, &dev_edac_list, list) { + if (d->imc[0].src_id == src_id) + return d->imc[lmc].mci; + } + + skx_printk(KERN_ERR, "No mci for src_id %d lmc %d\n", src_id, lmc); + return NULL; +} + +static void skx_mce_output_error(struct mem_ctl_info *mci, + const struct mce *m, + struct decoded_addr *res) +{ + enum hw_event_mc_err_type tp_event; + char *type, *optype; + bool ripv = GET_BITFIELD(m->mcgstatus, 0, 0); + bool overflow = GET_BITFIELD(m->status, 62, 62); + bool uncorrected_error = GET_BITFIELD(m->status, 61, 61); + bool recoverable; + u32 core_err_cnt = GET_BITFIELD(m->status, 38, 52); + u32 mscod = GET_BITFIELD(m->status, 16, 31); + u32 errcode = GET_BITFIELD(m->status, 0, 15); + u32 optypenum = GET_BITFIELD(m->status, 4, 6); + + recoverable = GET_BITFIELD(m->status, 56, 56); + + if (uncorrected_error) { + core_err_cnt = 1; + if (ripv) { + type = "FATAL"; + tp_event = HW_EVENT_ERR_FATAL; + } else { + type = "NON_FATAL"; + tp_event = HW_EVENT_ERR_UNCORRECTED; + } + } else { + type = "CORRECTED"; + tp_event = HW_EVENT_ERR_CORRECTED; + } + + /* + * According with Table 15-9 of the Intel Architecture spec vol 3A, + * memory errors should fit in this mask: + * 000f 0000 1mmm cccc (binary) + * where: + * f = Correction Report Filtering Bit. If 1, subsequent errors + * won't be shown + * mmm = error type + * cccc = channel + * If the mask doesn't match, report an error to the parsing logic + */ + if (!((errcode & 0xef80) == 0x80)) { + optype = "Can't parse: it is not a mem"; + } else { + switch (optypenum) { + case 0: + optype = "generic undef request error"; + break; + case 1: + optype = "memory read error"; + break; + case 2: + optype = "memory write error"; + break; + case 3: + optype = "addr/cmd error"; + break; + case 4: + optype = "memory scrubbing error"; + break; + default: + optype = "reserved"; + break; + } + } + if (adxl_component_count) { + snprintf(skx_msg, MSG_SIZE, "%s%s err_code:0x%04x:0x%04x %s", + overflow ? " OVERFLOW" : "", + (uncorrected_error && recoverable) ? " recoverable" : "", + mscod, errcode, adxl_msg); + } else { + snprintf(skx_msg, MSG_SIZE, + "%s%s err_code:0x%04x:0x%04x socket:%d imc:%d rank:%d bg:%d ba:%d row:0x%x col:0x%x", + overflow ? " OVERFLOW" : "", + (uncorrected_error && recoverable) ? " recoverable" : "", + mscod, errcode, + res->socket, res->imc, res->rank, + res->bank_group, res->bank_address, res->row, res->column); + } + + edac_dbg(0, "%s\n", skx_msg); + + /* Call the helper to output message */ + edac_mc_handle_error(tp_event, mci, core_err_cnt, + m->addr >> PAGE_SHIFT, m->addr & ~PAGE_MASK, 0, + res->channel, res->dimm, -1, + optype, skx_msg); +} + +int skx_mce_check_error(struct notifier_block *nb, unsigned long val, + void *data) +{ + struct mce *mce = (struct mce *)data; + struct decoded_addr res; + struct mem_ctl_info *mci; + char *type; + + if (edac_get_report_status() == EDAC_REPORTING_DISABLED) + return NOTIFY_DONE; + + /* ignore unless this is memory related with an address */ + if ((mce->status & 0xefff) >> 7 != 1 || !(mce->status & MCI_STATUS_ADDRV)) + return NOTIFY_DONE; + + memset(&res, 0, sizeof(res)); + res.addr = mce->addr; + + if (adxl_component_count) { + if (!skx_adxl_decode(&res)) + return NOTIFY_DONE; + + mci = get_mci(res.socket, res.imc); + } else { + if (!skx_decode || !skx_decode(&res)) + return NOTIFY_DONE; + + mci = res.dev->imc[res.imc].mci; + } + + if (!mci) + return NOTIFY_DONE; + + if (mce->mcgstatus & MCG_STATUS_MCIP) + type = "Exception"; + else + type = "Event"; + + skx_mc_printk(mci, KERN_DEBUG, "HANDLING MCE MEMORY ERROR\n"); + + skx_mc_printk(mci, KERN_DEBUG, "CPU %d: Machine Check %s: 0x%llx " + "Bank %d: 0x%llx\n", mce->extcpu, type, + mce->mcgstatus, mce->bank, mce->status); + skx_mc_printk(mci, KERN_DEBUG, "TSC 0x%llx ", mce->tsc); + skx_mc_printk(mci, KERN_DEBUG, "ADDR 0x%llx ", mce->addr); + skx_mc_printk(mci, KERN_DEBUG, "MISC 0x%llx ", mce->misc); + + skx_mc_printk(mci, KERN_DEBUG, "PROCESSOR %u:0x%x TIME %llu SOCKET " + "%u APIC 0x%x\n", mce->cpuvendor, mce->cpuid, + mce->time, mce->socketid, mce->apicid); + + skx_mce_output_error(mci, mce, &res); + + return NOTIFY_DONE; +} + +void skx_remove(void) +{ + int i, j; + struct skx_dev *d, *tmp; + + edac_dbg(0, "\n"); + + list_for_each_entry_safe(d, tmp, &dev_edac_list, list) { + list_del(&d->list); + for (i = 0; i < NUM_IMC; i++) { + if (d->imc[i].mci) + skx_unregister_mci(&d->imc[i]); + + if (d->imc[i].mdev) + pci_dev_put(d->imc[i].mdev); + + if (d->imc[i].mbase) + iounmap(d->imc[i].mbase); + + for (j = 0; j < NUM_CHANNELS; j++) { + if (d->imc[i].chan[j].cdev) + pci_dev_put(d->imc[i].chan[j].cdev); + } + } + if (d->util_all) + pci_dev_put(d->util_all); + if (d->sad_all) + pci_dev_put(d->sad_all); + if (d->uracu) + pci_dev_put(d->uracu); + + kfree(d); + } +} + +#ifdef CONFIG_EDAC_DEBUG +/* + * Debug feature. + * Exercise the address decode logic by writing an address to + * /sys/kernel/debug/edac/dirname/addr. + */ +static struct dentry *skx_test; + +static int debugfs_u64_set(void *data, u64 val) +{ + struct mce m; + + pr_warn_once("Fake error to 0x%llx injected via debugfs\n", val); + + memset(&m, 0, sizeof(m)); + /* ADDRV + MemRd + Unknown channel */ + m.status = MCI_STATUS_ADDRV + 0x90; + /* One corrected error */ + m.status |= BIT_ULL(38); + m.addr = val; + skx_mce_check_error(NULL, 0, &m); + + return 0; +} +DEFINE_SIMPLE_ATTRIBUTE(fops_u64_wo, NULL, debugfs_u64_set, "%llu\n"); + +void setup_skx_debug(const char *dirname) +{ + skx_test = edac_debugfs_create_dir(dirname); + if (!skx_test) + return; + + if (!edac_debugfs_create_file("addr", 0200, skx_test, + NULL, &fops_u64_wo)) { + debugfs_remove(skx_test); + skx_test = NULL; + } +} + +void teardown_skx_debug(void) +{ + debugfs_remove_recursive(skx_test); +} +#endif /*CONFIG_EDAC_DEBUG*/ diff --git a/drivers/edac/skx_common.h b/drivers/edac/skx_common.h new file mode 100644 index 000000000000..d25374e34d4f --- /dev/null +++ b/drivers/edac/skx_common.h @@ -0,0 +1,152 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Common codes for both the skx_edac driver and Intel 10nm server EDAC driver. + * Originally split out from the skx_edac driver. + * + * Copyright (c) 2018, Intel Corporation. + */ + +#ifndef _SKX_COMM_EDAC_H +#define _SKX_COMM_EDAC_H + +#define MSG_SIZE 1024 + +/* + * Debug macros + */ +#define skx_printk(level, fmt, arg...) \ + edac_printk(level, "skx", fmt, ##arg) + +#define skx_mc_printk(mci, level, fmt, arg...) \ + edac_mc_chipset_printk(mci, level, "skx", fmt, ##arg) + +/* + * Get a bit field at register value <v>, from bit <lo> to bit <hi> + */ +#define GET_BITFIELD(v, lo, hi) \ + (((v) & GENMASK_ULL((hi), (lo))) >> (lo)) + +#define SKX_NUM_IMC 2 /* Memory controllers per socket */ +#define SKX_NUM_CHANNELS 3 /* Channels per memory controller */ +#define SKX_NUM_DIMMS 2 /* Max DIMMS per channel */ + +#define I10NM_NUM_IMC 4 +#define I10NM_NUM_CHANNELS 2 +#define I10NM_NUM_DIMMS 2 + +#define MAX(a, b) ((a) > (b) ? (a) : (b)) +#define NUM_IMC MAX(SKX_NUM_IMC, I10NM_NUM_IMC) +#define NUM_CHANNELS MAX(SKX_NUM_CHANNELS, I10NM_NUM_CHANNELS) +#define NUM_DIMMS MAX(SKX_NUM_DIMMS, I10NM_NUM_DIMMS) + +#define IS_DIMM_PRESENT(r) GET_BITFIELD(r, 15, 15) +#define IS_NVDIMM_PRESENT(r, i) GET_BITFIELD(r, i, i) + +/* + * Each cpu socket contains some pci devices that provide global + * information, and also some that are local to each of the two + * memory controllers on the die. + */ +struct skx_dev { + struct list_head list; + u8 bus[4]; + int seg; + struct pci_dev *sad_all; + struct pci_dev *util_all; + struct pci_dev *uracu; /* for i10nm CPU */ + u32 mcroute; + struct skx_imc { + struct mem_ctl_info *mci; + struct pci_dev *mdev; /* for i10nm CPU */ + void __iomem *mbase; /* for i10nm CPU */ + u8 mc; /* system wide mc# */ + u8 lmc; /* socket relative mc# */ + u8 src_id, node_id; + struct skx_channel { + struct pci_dev *cdev; + struct skx_dimm { + u8 close_pg; + u8 bank_xor_enable; + u8 fine_grain_bank; + u8 rowbits; + u8 colbits; + } dimms[NUM_DIMMS]; + } chan[NUM_CHANNELS]; + } imc[NUM_IMC]; +}; + +struct skx_pvt { + struct skx_imc *imc; +}; + +enum type { + SKX, + I10NM +}; + +enum { + INDEX_SOCKET, + INDEX_MEMCTRL, + INDEX_CHANNEL, + INDEX_DIMM, + INDEX_MAX +}; + +struct decoded_addr { + struct skx_dev *dev; + u64 addr; + int socket; + int imc; + int channel; + u64 chan_addr; + int sktways; + int chanways; + int dimm; + int rank; + int channel_rank; + u64 rank_address; + int row; + int column; + int bank_address; + int bank_group; +}; + +typedef int (*get_dimm_config_f)(struct mem_ctl_info *mci); +typedef bool (*skx_decode_f)(struct decoded_addr *res); + +int __init skx_adxl_get(void); +void __exit skx_adxl_put(void); +void skx_set_decode(skx_decode_f decode); + +int skx_get_src_id(struct skx_dev *d, u8 *id); +int skx_get_node_id(struct skx_dev *d, u8 *id); + +int skx_get_all_bus_mappings(unsigned int did, int off, enum type, + struct list_head **list); + +int skx_get_hi_lo(unsigned int did, int off[], u64 *tolm, u64 *tohm); + +int skx_get_dimm_info(u32 mtr, u32 amap, struct dimm_info *dimm, + struct skx_imc *imc, int chan, int dimmno); + +int skx_get_nvdimm_info(struct dimm_info *dimm, struct skx_imc *imc, + int chan, int dimmno, const char *mod_str); + +int skx_register_mci(struct skx_imc *imc, struct pci_dev *pdev, + const char *ctl_name, const char *mod_str, + get_dimm_config_f get_dimm_config); + +int skx_mce_check_error(struct notifier_block *nb, unsigned long val, + void *data); + +void skx_remove(void); + +#ifdef CONFIG_EDAC_DEBUG +void setup_skx_debug(const char *dirname); +void teardown_skx_debug(void); +#else +static inline void setup_skx_debug(const char *dirname) {} +static inline void teardown_skx_debug(void) {} +#endif /*CONFIG_EDAC_DEBUG*/ + +#endif /* _SKX_COMM_EDAC_H */
From: Qiuxu Zhuo qiuxu.zhuo@intel.com
mainline inclusion from mainline-v5.1-rc1 commit 98f2fc829e3b55a0ff1b82753fc1e10941535cb9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Delete the duplicated code from skx_edac.c and rename skx_edac.c to skx_base.c. Update the Makefile to build the skx_edac driver from skx_base.c and skx_common.c.
Add SPDX to skx_base.c and clean out unnecessary #include lines.
[ bp: Drop the license boilerplate - there's an SPDX identifier now. ]
Co-developed-by: Tony Luck tony.luck@intel.com Signed-off-by: Qiuxu Zhuo qiuxu.zhuo@intel.com Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Borislav Petkov bp@suse.de Cc: James Morse james.morse@arm.com Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: linux-edac linux-edac@vger.kernel.org Link: https://lkml.kernel.org/r/20190130191519.15393-4-tony.luck@intel.com Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/edac/Makefile | 4 +- drivers/edac/skx_base.c | 650 +++++++++++++++++++ drivers/edac/skx_edac.c | 1357 --------------------------------------- 3 files changed, 653 insertions(+), 1358 deletions(-) create mode 100644 drivers/edac/skx_base.c delete mode 100644 drivers/edac/skx_edac.c
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile index 02b43a7d8c3e..b250d59910ba 100644 --- a/drivers/edac/Makefile +++ b/drivers/edac/Makefile @@ -30,7 +30,6 @@ obj-$(CONFIG_EDAC_I5400) += i5400_edac.o obj-$(CONFIG_EDAC_I7300) += i7300_edac.o obj-$(CONFIG_EDAC_I7CORE) += i7core_edac.o obj-$(CONFIG_EDAC_SBRIDGE) += sb_edac.o -obj-$(CONFIG_EDAC_SKX) += skx_edac.o obj-$(CONFIG_EDAC_PND2) += pnd2_edac.o obj-$(CONFIG_EDAC_E7XXX) += e7xxx_edac.o obj-$(CONFIG_EDAC_E752X) += e752x_edac.o @@ -58,6 +57,9 @@ obj-$(CONFIG_EDAC_MPC85XX) += mpc85xx_edac_mod.o layerscape_edac_mod-y := fsl_ddr_edac.o layerscape_edac.o obj-$(CONFIG_EDAC_LAYERSCAPE) += layerscape_edac_mod.o
+skx_edac-y := skx_common.o skx_base.o +obj-$(CONFIG_EDAC_SKX) += skx_edac.o + obj-$(CONFIG_EDAC_MV64X60) += mv64x60_edac.o obj-$(CONFIG_EDAC_CELL) += cell_edac.o obj-$(CONFIG_EDAC_PPC4XX) += ppc4xx_edac.o diff --git a/drivers/edac/skx_base.c b/drivers/edac/skx_base.c new file mode 100644 index 000000000000..adae4c848ca1 --- /dev/null +++ b/drivers/edac/skx_base.c @@ -0,0 +1,650 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * EDAC driver for Intel(R) Xeon(R) Skylake processors + * Copyright (c) 2016, Intel Corporation. + */ + +#include <linux/kernel.h> +#include <linux/processor.h> +#include <asm/cpu_device_id.h> +#include <asm/intel-family.h> +#include <asm/mce.h> + +#include "edac_module.h" +#include "skx_common.h" + +#define EDAC_MOD_STR "skx_edac" + +/* + * Debug macros + */ +#define skx_printk(level, fmt, arg...) \ + edac_printk(level, "skx", fmt, ##arg) + +#define skx_mc_printk(mci, level, fmt, arg...) \ + edac_mc_chipset_printk(mci, level, "skx", fmt, ##arg) + +static struct list_head *skx_edac_list; + +static u64 skx_tolm, skx_tohm; +static int skx_num_sockets; +static unsigned int nvdimm_count; + +#define MASK26 0x3FFFFFF /* Mask for 2^26 */ +#define MASK29 0x1FFFFFFF /* Mask for 2^29 */ + +static struct skx_dev *get_skx_dev(struct pci_bus *bus, u8 idx) +{ + struct skx_dev *d; + + list_for_each_entry(d, skx_edac_list, list) { + if (d->seg == pci_domain_nr(bus) && d->bus[idx] == bus->number) + return d; + } + + return NULL; +} + +enum munittype { + CHAN0, CHAN1, CHAN2, SAD_ALL, UTIL_ALL, SAD +}; + +struct munit { + u16 did; + u16 devfn[SKX_NUM_IMC]; + u8 busidx; + u8 per_socket; + enum munittype mtype; +}; + +/* + * List of PCI device ids that we need together with some device + * number and function numbers to tell which memory controller the + * device belongs to. + */ +static const struct munit skx_all_munits[] = { + { 0x2054, { }, 1, 1, SAD_ALL }, + { 0x2055, { }, 1, 1, UTIL_ALL }, + { 0x2040, { PCI_DEVFN(10, 0), PCI_DEVFN(12, 0) }, 2, 2, CHAN0 }, + { 0x2044, { PCI_DEVFN(10, 4), PCI_DEVFN(12, 4) }, 2, 2, CHAN1 }, + { 0x2048, { PCI_DEVFN(11, 0), PCI_DEVFN(13, 0) }, 2, 2, CHAN2 }, + { 0x208e, { }, 1, 0, SAD }, + { } +}; + +static int get_all_munits(const struct munit *m) +{ + struct pci_dev *pdev, *prev; + struct skx_dev *d; + u32 reg; + int i = 0, ndev = 0; + + prev = NULL; + for (;;) { + pdev = pci_get_device(PCI_VENDOR_ID_INTEL, m->did, prev); + if (!pdev) + break; + ndev++; + if (m->per_socket == SKX_NUM_IMC) { + for (i = 0; i < SKX_NUM_IMC; i++) + if (m->devfn[i] == pdev->devfn) + break; + if (i == SKX_NUM_IMC) + goto fail; + } + d = get_skx_dev(pdev->bus, m->busidx); + if (!d) + goto fail; + + /* Be sure that the device is enabled */ + if (unlikely(pci_enable_device(pdev) < 0)) { + skx_printk(KERN_ERR, "Couldn't enable device %04x:%04x\n", + PCI_VENDOR_ID_INTEL, m->did); + goto fail; + } + + switch (m->mtype) { + case CHAN0: case CHAN1: case CHAN2: + pci_dev_get(pdev); + d->imc[i].chan[m->mtype].cdev = pdev; + break; + case SAD_ALL: + pci_dev_get(pdev); + d->sad_all = pdev; + break; + case UTIL_ALL: + pci_dev_get(pdev); + d->util_all = pdev; + break; + case SAD: + /* + * one of these devices per core, including cores + * that don't exist on this SKU. Ignore any that + * read a route table of zero, make sure all the + * non-zero values match. + */ + pci_read_config_dword(pdev, 0xB4, ®); + if (reg != 0) { + if (d->mcroute == 0) { + d->mcroute = reg; + } else if (d->mcroute != reg) { + skx_printk(KERN_ERR, "mcroute mismatch\n"); + goto fail; + } + } + ndev--; + break; + } + + prev = pdev; + } + + return ndev; +fail: + pci_dev_put(pdev); + return -ENODEV; +} + +static const struct x86_cpu_id skx_cpuids[] = { + { X86_VENDOR_INTEL, 6, INTEL_FAM6_SKYLAKE_X, 0, 0 }, + { } +}; +MODULE_DEVICE_TABLE(x86cpu, skx_cpuids); + +#define SKX_GET_MTMTR(dev, reg) \ + pci_read_config_dword((dev), 0x87c, &(reg)) + +static bool skx_check_ecc(struct pci_dev *pdev) +{ + u32 mtmtr; + + SKX_GET_MTMTR(pdev, mtmtr); + + return !!GET_BITFIELD(mtmtr, 2, 2); +} + +static int skx_get_dimm_config(struct mem_ctl_info *mci) +{ + struct skx_pvt *pvt = mci->pvt_info; + struct skx_imc *imc = pvt->imc; + u32 mtr, amap, mcddrtcfg; + struct dimm_info *dimm; + int i, j; + int ndimms; + + for (i = 0; i < SKX_NUM_CHANNELS; i++) { + ndimms = 0; + pci_read_config_dword(imc->chan[i].cdev, 0x8C, &amap); + pci_read_config_dword(imc->chan[i].cdev, 0x400, &mcddrtcfg); + for (j = 0; j < SKX_NUM_DIMMS; j++) { + dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms, + mci->n_layers, i, j, 0); + pci_read_config_dword(imc->chan[i].cdev, + 0x80 + 4 * j, &mtr); + if (IS_DIMM_PRESENT(mtr)) { + ndimms += skx_get_dimm_info(mtr, amap, dimm, imc, i, j); + } else if (IS_NVDIMM_PRESENT(mcddrtcfg, j)) { + ndimms += skx_get_nvdimm_info(dimm, imc, i, j, + EDAC_MOD_STR); + nvdimm_count++; + } + } + if (ndimms && !skx_check_ecc(imc->chan[0].cdev)) { + skx_printk(KERN_ERR, "ECC is disabled on imc %d\n", imc->mc); + return -ENODEV; + } + } + + return 0; +} + +#define SKX_MAX_SAD 24 + +#define SKX_GET_SAD(d, i, reg) \ + pci_read_config_dword((d)->sad_all, 0x60 + 8 * (i), &(reg)) +#define SKX_GET_ILV(d, i, reg) \ + pci_read_config_dword((d)->sad_all, 0x64 + 8 * (i), &(reg)) + +#define SKX_SAD_MOD3MODE(sad) GET_BITFIELD((sad), 30, 31) +#define SKX_SAD_MOD3(sad) GET_BITFIELD((sad), 27, 27) +#define SKX_SAD_LIMIT(sad) (((u64)GET_BITFIELD((sad), 7, 26) << 26) | MASK26) +#define SKX_SAD_MOD3ASMOD2(sad) GET_BITFIELD((sad), 5, 6) +#define SKX_SAD_ATTR(sad) GET_BITFIELD((sad), 3, 4) +#define SKX_SAD_INTERLEAVE(sad) GET_BITFIELD((sad), 1, 2) +#define SKX_SAD_ENABLE(sad) GET_BITFIELD((sad), 0, 0) + +#define SKX_ILV_REMOTE(tgt) (((tgt) & 8) == 0) +#define SKX_ILV_TARGET(tgt) ((tgt) & 7) + +static bool skx_sad_decode(struct decoded_addr *res) +{ + struct skx_dev *d = list_first_entry(skx_edac_list, typeof(*d), list); + u64 addr = res->addr; + int i, idx, tgt, lchan, shift; + u32 sad, ilv; + u64 limit, prev_limit; + int remote = 0; + + /* Simple sanity check for I/O space or out of range */ + if (addr >= skx_tohm || (addr >= skx_tolm && addr < BIT_ULL(32))) { + edac_dbg(0, "Address 0x%llx out of range\n", addr); + return false; + } + +restart: + prev_limit = 0; + for (i = 0; i < SKX_MAX_SAD; i++) { + SKX_GET_SAD(d, i, sad); + limit = SKX_SAD_LIMIT(sad); + if (SKX_SAD_ENABLE(sad)) { + if (addr >= prev_limit && addr <= limit) + goto sad_found; + } + prev_limit = limit + 1; + } + edac_dbg(0, "No SAD entry for 0x%llx\n", addr); + return false; + +sad_found: + SKX_GET_ILV(d, i, ilv); + + switch (SKX_SAD_INTERLEAVE(sad)) { + case 0: + idx = GET_BITFIELD(addr, 6, 8); + break; + case 1: + idx = GET_BITFIELD(addr, 8, 10); + break; + case 2: + idx = GET_BITFIELD(addr, 12, 14); + break; + case 3: + idx = GET_BITFIELD(addr, 30, 32); + break; + } + + tgt = GET_BITFIELD(ilv, 4 * idx, 4 * idx + 3); + + /* If point to another node, find it and start over */ + if (SKX_ILV_REMOTE(tgt)) { + if (remote) { + edac_dbg(0, "Double remote!\n"); + return false; + } + remote = 1; + list_for_each_entry(d, skx_edac_list, list) { + if (d->imc[0].src_id == SKX_ILV_TARGET(tgt)) + goto restart; + } + edac_dbg(0, "Can't find node %d\n", SKX_ILV_TARGET(tgt)); + return false; + } + + if (SKX_SAD_MOD3(sad) == 0) { + lchan = SKX_ILV_TARGET(tgt); + } else { + switch (SKX_SAD_MOD3MODE(sad)) { + case 0: + shift = 6; + break; + case 1: + shift = 8; + break; + case 2: + shift = 12; + break; + default: + edac_dbg(0, "illegal mod3mode\n"); + return false; + } + switch (SKX_SAD_MOD3ASMOD2(sad)) { + case 0: + lchan = (addr >> shift) % 3; + break; + case 1: + lchan = (addr >> shift) % 2; + break; + case 2: + lchan = (addr >> shift) % 2; + lchan = (lchan << 1) | !lchan; + break; + case 3: + lchan = ((addr >> shift) % 2) << 1; + break; + } + lchan = (lchan << 1) | (SKX_ILV_TARGET(tgt) & 1); + } + + res->dev = d; + res->socket = d->imc[0].src_id; + res->imc = GET_BITFIELD(d->mcroute, lchan * 3, lchan * 3 + 2); + res->channel = GET_BITFIELD(d->mcroute, lchan * 2 + 18, lchan * 2 + 19); + + edac_dbg(2, "0x%llx: socket=%d imc=%d channel=%d\n", + res->addr, res->socket, res->imc, res->channel); + return true; +} + +#define SKX_MAX_TAD 8 + +#define SKX_GET_TADBASE(d, mc, i, reg) \ + pci_read_config_dword((d)->imc[mc].chan[0].cdev, 0x850 + 4 * (i), &(reg)) +#define SKX_GET_TADWAYNESS(d, mc, i, reg) \ + pci_read_config_dword((d)->imc[mc].chan[0].cdev, 0x880 + 4 * (i), &(reg)) +#define SKX_GET_TADCHNILVOFFSET(d, mc, ch, i, reg) \ + pci_read_config_dword((d)->imc[mc].chan[ch].cdev, 0x90 + 4 * (i), &(reg)) + +#define SKX_TAD_BASE(b) ((u64)GET_BITFIELD((b), 12, 31) << 26) +#define SKX_TAD_SKT_GRAN(b) GET_BITFIELD((b), 4, 5) +#define SKX_TAD_CHN_GRAN(b) GET_BITFIELD((b), 6, 7) +#define SKX_TAD_LIMIT(b) (((u64)GET_BITFIELD((b), 12, 31) << 26) | MASK26) +#define SKX_TAD_OFFSET(b) ((u64)GET_BITFIELD((b), 4, 23) << 26) +#define SKX_TAD_SKTWAYS(b) (1 << GET_BITFIELD((b), 10, 11)) +#define SKX_TAD_CHNWAYS(b) (GET_BITFIELD((b), 8, 9) + 1) + +/* which bit used for both socket and channel interleave */ +static int skx_granularity[] = { 6, 8, 12, 30 }; + +static u64 skx_do_interleave(u64 addr, int shift, int ways, u64 lowbits) +{ + addr >>= shift; + addr /= ways; + addr <<= shift; + + return addr | (lowbits & ((1ull << shift) - 1)); +} + +static bool skx_tad_decode(struct decoded_addr *res) +{ + int i; + u32 base, wayness, chnilvoffset; + int skt_interleave_bit, chn_interleave_bit; + u64 channel_addr; + + for (i = 0; i < SKX_MAX_TAD; i++) { + SKX_GET_TADBASE(res->dev, res->imc, i, base); + SKX_GET_TADWAYNESS(res->dev, res->imc, i, wayness); + if (SKX_TAD_BASE(base) <= res->addr && res->addr <= SKX_TAD_LIMIT(wayness)) + goto tad_found; + } + edac_dbg(0, "No TAD entry for 0x%llx\n", res->addr); + return false; + +tad_found: + res->sktways = SKX_TAD_SKTWAYS(wayness); + res->chanways = SKX_TAD_CHNWAYS(wayness); + skt_interleave_bit = skx_granularity[SKX_TAD_SKT_GRAN(base)]; + chn_interleave_bit = skx_granularity[SKX_TAD_CHN_GRAN(base)]; + + SKX_GET_TADCHNILVOFFSET(res->dev, res->imc, res->channel, i, chnilvoffset); + channel_addr = res->addr - SKX_TAD_OFFSET(chnilvoffset); + + if (res->chanways == 3 && skt_interleave_bit > chn_interleave_bit) { + /* Must handle channel first, then socket */ + channel_addr = skx_do_interleave(channel_addr, chn_interleave_bit, + res->chanways, channel_addr); + channel_addr = skx_do_interleave(channel_addr, skt_interleave_bit, + res->sktways, channel_addr); + } else { + /* Handle socket then channel. Preserve low bits from original address */ + channel_addr = skx_do_interleave(channel_addr, skt_interleave_bit, + res->sktways, res->addr); + channel_addr = skx_do_interleave(channel_addr, chn_interleave_bit, + res->chanways, res->addr); + } + + res->chan_addr = channel_addr; + + edac_dbg(2, "0x%llx: chan_addr=0x%llx sktways=%d chanways=%d\n", + res->addr, res->chan_addr, res->sktways, res->chanways); + return true; +} + +#define SKX_MAX_RIR 4 + +#define SKX_GET_RIRWAYNESS(d, mc, ch, i, reg) \ + pci_read_config_dword((d)->imc[mc].chan[ch].cdev, \ + 0x108 + 4 * (i), &(reg)) +#define SKX_GET_RIRILV(d, mc, ch, idx, i, reg) \ + pci_read_config_dword((d)->imc[mc].chan[ch].cdev, \ + 0x120 + 16 * (idx) + 4 * (i), &(reg)) + +#define SKX_RIR_VALID(b) GET_BITFIELD((b), 31, 31) +#define SKX_RIR_LIMIT(b) (((u64)GET_BITFIELD((b), 1, 11) << 29) | MASK29) +#define SKX_RIR_WAYS(b) (1 << GET_BITFIELD((b), 28, 29)) +#define SKX_RIR_CHAN_RANK(b) GET_BITFIELD((b), 16, 19) +#define SKX_RIR_OFFSET(b) ((u64)(GET_BITFIELD((b), 2, 15) << 26)) + +static bool skx_rir_decode(struct decoded_addr *res) +{ + int i, idx, chan_rank; + int shift; + u32 rirway, rirlv; + u64 rank_addr, prev_limit = 0, limit; + + if (res->dev->imc[res->imc].chan[res->channel].dimms[0].close_pg) + shift = 6; + else + shift = 13; + + for (i = 0; i < SKX_MAX_RIR; i++) { + SKX_GET_RIRWAYNESS(res->dev, res->imc, res->channel, i, rirway); + limit = SKX_RIR_LIMIT(rirway); + if (SKX_RIR_VALID(rirway)) { + if (prev_limit <= res->chan_addr && + res->chan_addr <= limit) + goto rir_found; + } + prev_limit = limit; + } + edac_dbg(0, "No RIR entry for 0x%llx\n", res->addr); + return false; + +rir_found: + rank_addr = res->chan_addr >> shift; + rank_addr /= SKX_RIR_WAYS(rirway); + rank_addr <<= shift; + rank_addr |= res->chan_addr & GENMASK_ULL(shift - 1, 0); + + res->rank_address = rank_addr; + idx = (res->chan_addr >> shift) % SKX_RIR_WAYS(rirway); + + SKX_GET_RIRILV(res->dev, res->imc, res->channel, idx, i, rirlv); + res->rank_address = rank_addr - SKX_RIR_OFFSET(rirlv); + chan_rank = SKX_RIR_CHAN_RANK(rirlv); + res->channel_rank = chan_rank; + res->dimm = chan_rank / 4; + res->rank = chan_rank % 4; + + edac_dbg(2, "0x%llx: dimm=%d rank=%d chan_rank=%d rank_addr=0x%llx\n", + res->addr, res->dimm, res->rank, + res->channel_rank, res->rank_address); + return true; +} + +static u8 skx_close_row[] = { + 15, 16, 17, 18, 20, 21, 22, 28, 10, 11, 12, 13, 29, 30, 31, 32, 33 +}; + +static u8 skx_close_column[] = { + 3, 4, 5, 14, 19, 23, 24, 25, 26, 27 +}; + +static u8 skx_open_row[] = { + 14, 15, 16, 20, 28, 21, 22, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33 +}; + +static u8 skx_open_column[] = { + 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 +}; + +static u8 skx_open_fine_column[] = { + 3, 4, 5, 7, 8, 9, 10, 11, 12, 13 +}; + +static int skx_bits(u64 addr, int nbits, u8 *bits) +{ + int i, res = 0; + + for (i = 0; i < nbits; i++) + res |= ((addr >> bits[i]) & 1) << i; + return res; +} + +static int skx_bank_bits(u64 addr, int b0, int b1, int do_xor, int x0, int x1) +{ + int ret = GET_BITFIELD(addr, b0, b0) | (GET_BITFIELD(addr, b1, b1) << 1); + + if (do_xor) + ret ^= GET_BITFIELD(addr, x0, x0) | (GET_BITFIELD(addr, x1, x1) << 1); + + return ret; +} + +static bool skx_mad_decode(struct decoded_addr *r) +{ + struct skx_dimm *dimm = &r->dev->imc[r->imc].chan[r->channel].dimms[r->dimm]; + int bg0 = dimm->fine_grain_bank ? 6 : 13; + + if (dimm->close_pg) { + r->row = skx_bits(r->rank_address, dimm->rowbits, skx_close_row); + r->column = skx_bits(r->rank_address, dimm->colbits, skx_close_column); + r->column |= 0x400; /* C10 is autoprecharge, always set */ + r->bank_address = skx_bank_bits(r->rank_address, 8, 9, dimm->bank_xor_enable, 22, 28); + r->bank_group = skx_bank_bits(r->rank_address, 6, 7, dimm->bank_xor_enable, 20, 21); + } else { + r->row = skx_bits(r->rank_address, dimm->rowbits, skx_open_row); + if (dimm->fine_grain_bank) + r->column = skx_bits(r->rank_address, dimm->colbits, skx_open_fine_column); + else + r->column = skx_bits(r->rank_address, dimm->colbits, skx_open_column); + r->bank_address = skx_bank_bits(r->rank_address, 18, 19, dimm->bank_xor_enable, 22, 23); + r->bank_group = skx_bank_bits(r->rank_address, bg0, 17, dimm->bank_xor_enable, 20, 21); + } + r->row &= (1u << dimm->rowbits) - 1; + + edac_dbg(2, "0x%llx: row=0x%x col=0x%x bank_addr=%d bank_group=%d\n", + r->addr, r->row, r->column, r->bank_address, + r->bank_group); + return true; +} + +static bool skx_decode(struct decoded_addr *res) +{ + return skx_sad_decode(res) && skx_tad_decode(res) && + skx_rir_decode(res) && skx_mad_decode(res); +} + +static struct notifier_block skx_mce_dec = { + .notifier_call = skx_mce_check_error, + .priority = MCE_PRIO_EDAC, +}; + +/* + * skx_init: + * make sure we are running on the correct cpu model + * search for all the devices we need + * check which DIMMs are present. + */ +static int __init skx_init(void) +{ + const struct x86_cpu_id *id; + const struct munit *m; + const char *owner; + int rc = 0, i, off[3] = {0xd0, 0xd4, 0xd8}; + u8 mc = 0, src_id, node_id; + struct skx_dev *d; + + edac_dbg(2, "\n"); + + owner = edac_get_owner(); + if (owner && strncmp(owner, EDAC_MOD_STR, sizeof(EDAC_MOD_STR))) + return -EBUSY; + + id = x86_match_cpu(skx_cpuids); + if (!id) + return -ENODEV; + + rc = skx_get_hi_lo(0x2034, off, &skx_tolm, &skx_tohm); + if (rc) + return rc; + + rc = skx_get_all_bus_mappings(0x2016, 0xcc, SKX, &skx_edac_list); + if (rc < 0) + goto fail; + if (rc == 0) { + edac_dbg(2, "No memory controllers found\n"); + return -ENODEV; + } + skx_num_sockets = rc; + + for (m = skx_all_munits; m->did; m++) { + rc = get_all_munits(m); + if (rc < 0) + goto fail; + if (rc != m->per_socket * skx_num_sockets) { + edac_dbg(2, "Expected %d, got %d of 0x%x\n", + m->per_socket * skx_num_sockets, rc, m->did); + rc = -ENODEV; + goto fail; + } + } + + list_for_each_entry(d, skx_edac_list, list) { + rc = skx_get_src_id(d, &src_id); + if (rc < 0) + goto fail; + rc = skx_get_node_id(d, &node_id); + if (rc < 0) + goto fail; + edac_dbg(2, "src_id=%d node_id=%d\n", src_id, node_id); + for (i = 0; i < SKX_NUM_IMC; i++) { + d->imc[i].mc = mc++; + d->imc[i].lmc = i; + d->imc[i].src_id = src_id; + d->imc[i].node_id = node_id; + rc = skx_register_mci(&d->imc[i], d->imc[i].chan[0].cdev, + "Skylake Socket", EDAC_MOD_STR, + skx_get_dimm_config); + if (rc < 0) + goto fail; + } + } + + skx_set_decode(skx_decode); + + if (nvdimm_count && skx_adxl_get() == -ENODEV) + skx_printk(KERN_NOTICE, "Only decoding DDR4 address!\n"); + + /* Ensure that the OPSTATE is set correctly for POLL or NMI */ + opstate_init(); + + setup_skx_debug("skx_test"); + + mce_register_decode_chain(&skx_mce_dec); + + return 0; +fail: + skx_remove(); + return rc; +} + +static void __exit skx_exit(void) +{ + edac_dbg(2, "\n"); + mce_unregister_decode_chain(&skx_mce_dec); + teardown_skx_debug(); + if (nvdimm_count) + skx_adxl_put(); + skx_remove(); +} + +module_init(skx_init); +module_exit(skx_exit); + +module_param(edac_op_state, int, 0444); +MODULE_PARM_DESC(edac_op_state, "EDAC Error Reporting state: 0=Poll,1=NMI"); + +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Tony Luck"); +MODULE_DESCRIPTION("MC Driver for Intel Skylake server processors"); diff --git a/drivers/edac/skx_edac.c b/drivers/edac/skx_edac.c deleted file mode 100644 index a99ea61dad32..000000000000 --- a/drivers/edac/skx_edac.c +++ /dev/null @@ -1,1357 +0,0 @@ -/* - * EDAC driver for Intel(R) Xeon(R) Skylake processors - * Copyright (c) 2016, Intel Corporation. - * - * This program is free software; you can redistribute it and/or modify it - * under the terms and conditions of the GNU General Public License, - * version 2, as published by the Free Software Foundation. - * - * This program is distributed in the hope it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. - */ - -#include <linux/module.h> -#include <linux/init.h> -#include <linux/acpi.h> -#include <linux/dmi.h> -#include <linux/pci.h> -#include <linux/pci_ids.h> -#include <linux/slab.h> -#include <linux/delay.h> -#include <linux/edac.h> -#include <linux/mmzone.h> -#include <linux/smp.h> -#include <linux/bitmap.h> -#include <linux/math64.h> -#include <linux/mod_devicetable.h> -#include <linux/adxl.h> -#include <acpi/nfit.h> -#include <asm/cpu_device_id.h> -#include <asm/intel-family.h> -#include <asm/processor.h> -#include <asm/mce.h> - -#include "edac_module.h" - -#define EDAC_MOD_STR "skx_edac" -#define MSG_SIZE 1024 - -/* - * Debug macros - */ -#define skx_printk(level, fmt, arg...) \ - edac_printk(level, "skx", fmt, ##arg) - -#define skx_mc_printk(mci, level, fmt, arg...) \ - edac_mc_chipset_printk(mci, level, "skx", fmt, ##arg) - -/* - * Get a bit field at register value <v>, from bit <lo> to bit <hi> - */ -#define GET_BITFIELD(v, lo, hi) \ - (((v) & GENMASK_ULL((hi), (lo))) >> (lo)) - -static LIST_HEAD(skx_edac_list); - -static u64 skx_tolm, skx_tohm; -static char *skx_msg; -static unsigned int nvdimm_count; - -enum { - INDEX_SOCKET, - INDEX_MEMCTRL, - INDEX_CHANNEL, - INDEX_DIMM, - INDEX_MAX -}; - -static const char * const component_names[] = { - [INDEX_SOCKET] = "ProcessorSocketId", - [INDEX_MEMCTRL] = "MemoryControllerId", - [INDEX_CHANNEL] = "ChannelId", - [INDEX_DIMM] = "DimmSlotId", -}; - -static int component_indices[ARRAY_SIZE(component_names)]; -static int adxl_component_count; -static const char * const *adxl_component_names; -static u64 *adxl_values; -static char *adxl_msg; - -#define NUM_IMC 2 /* memory controllers per socket */ -#define NUM_CHANNELS 3 /* channels per memory controller */ -#define NUM_DIMMS 2 /* Max DIMMS per channel */ - -#define MASK26 0x3FFFFFF /* Mask for 2^26 */ -#define MASK29 0x1FFFFFFF /* Mask for 2^29 */ - -/* - * Each cpu socket contains some pci devices that provide global - * information, and also some that are local to each of the two - * memory controllers on the die. - */ -struct skx_dev { - struct list_head list; - u8 bus[4]; - int seg; - struct pci_dev *sad_all; - struct pci_dev *util_all; - u32 mcroute; - struct skx_imc { - struct mem_ctl_info *mci; - u8 mc; /* system wide mc# */ - u8 lmc; /* socket relative mc# */ - u8 src_id, node_id; - struct skx_channel { - struct pci_dev *cdev; - struct skx_dimm { - u8 close_pg; - u8 bank_xor_enable; - u8 fine_grain_bank; - u8 rowbits; - u8 colbits; - } dimms[NUM_DIMMS]; - } chan[NUM_CHANNELS]; - } imc[NUM_IMC]; -}; -static int skx_num_sockets; - -struct skx_pvt { - struct skx_imc *imc; -}; - -struct decoded_addr { - struct skx_dev *dev; - u64 addr; - int socket; - int imc; - int channel; - u64 chan_addr; - int sktways; - int chanways; - int dimm; - int rank; - int channel_rank; - u64 rank_address; - int row; - int column; - int bank_address; - int bank_group; -}; - -static struct skx_dev *get_skx_dev(struct pci_bus *bus, u8 idx) -{ - struct skx_dev *d; - - list_for_each_entry(d, &skx_edac_list, list) { - if (d->seg == pci_domain_nr(bus) && d->bus[idx] == bus->number) - return d; - } - - return NULL; -} - -enum munittype { - CHAN0, CHAN1, CHAN2, SAD_ALL, UTIL_ALL, SAD -}; - -struct munit { - u16 did; - u16 devfn[NUM_IMC]; - u8 busidx; - u8 per_socket; - enum munittype mtype; -}; - -/* - * List of PCI device ids that we need together with some device - * number and function numbers to tell which memory controller the - * device belongs to. - */ -static const struct munit skx_all_munits[] = { - { 0x2054, { }, 1, 1, SAD_ALL }, - { 0x2055, { }, 1, 1, UTIL_ALL }, - { 0x2040, { PCI_DEVFN(10, 0), PCI_DEVFN(12, 0) }, 2, 2, CHAN0 }, - { 0x2044, { PCI_DEVFN(10, 4), PCI_DEVFN(12, 4) }, 2, 2, CHAN1 }, - { 0x2048, { PCI_DEVFN(11, 0), PCI_DEVFN(13, 0) }, 2, 2, CHAN2 }, - { 0x208e, { }, 1, 0, SAD }, - { } -}; - -/* - * We use the per-socket device 0x2016 to count how many sockets are present, - * and to detemine which PCI buses are associated with each socket. Allocate - * and build the full list of all the skx_dev structures that we need here. - */ -static int get_all_bus_mappings(void) -{ - struct pci_dev *pdev, *prev; - struct skx_dev *d; - u32 reg; - int ndev = 0; - - prev = NULL; - for (;;) { - pdev = pci_get_device(PCI_VENDOR_ID_INTEL, 0x2016, prev); - if (!pdev) - break; - ndev++; - d = kzalloc(sizeof(*d), GFP_KERNEL); - if (!d) { - pci_dev_put(pdev); - return -ENOMEM; - } - d->seg = pci_domain_nr(pdev->bus); - pci_read_config_dword(pdev, 0xCC, ®); - d->bus[0] = GET_BITFIELD(reg, 0, 7); - d->bus[1] = GET_BITFIELD(reg, 8, 15); - d->bus[2] = GET_BITFIELD(reg, 16, 23); - d->bus[3] = GET_BITFIELD(reg, 24, 31); - edac_dbg(2, "busses: %x, %x, %x, %x\n", - d->bus[0], d->bus[1], d->bus[2], d->bus[3]); - list_add_tail(&d->list, &skx_edac_list); - skx_num_sockets++; - prev = pdev; - } - - return ndev; -} - -static int get_all_munits(const struct munit *m) -{ - struct pci_dev *pdev, *prev; - struct skx_dev *d; - u32 reg; - int i = 0, ndev = 0; - - prev = NULL; - for (;;) { - pdev = pci_get_device(PCI_VENDOR_ID_INTEL, m->did, prev); - if (!pdev) - break; - ndev++; - if (m->per_socket == NUM_IMC) { - for (i = 0; i < NUM_IMC; i++) - if (m->devfn[i] == pdev->devfn) - break; - if (i == NUM_IMC) - goto fail; - } - d = get_skx_dev(pdev->bus, m->busidx); - if (!d) - goto fail; - - /* Be sure that the device is enabled */ - if (unlikely(pci_enable_device(pdev) < 0)) { - skx_printk(KERN_ERR, - "Couldn't enable %04x:%04x\n", PCI_VENDOR_ID_INTEL, m->did); - goto fail; - } - - switch (m->mtype) { - case CHAN0: case CHAN1: case CHAN2: - pci_dev_get(pdev); - d->imc[i].chan[m->mtype].cdev = pdev; - break; - case SAD_ALL: - pci_dev_get(pdev); - d->sad_all = pdev; - break; - case UTIL_ALL: - pci_dev_get(pdev); - d->util_all = pdev; - break; - case SAD: - /* - * one of these devices per core, including cores - * that don't exist on this SKU. Ignore any that - * read a route table of zero, make sure all the - * non-zero values match. - */ - pci_read_config_dword(pdev, 0xB4, ®); - if (reg != 0) { - if (d->mcroute == 0) - d->mcroute = reg; - else if (d->mcroute != reg) { - skx_printk(KERN_ERR, - "mcroute mismatch\n"); - goto fail; - } - } - ndev--; - break; - } - - prev = pdev; - } - - return ndev; -fail: - pci_dev_put(pdev); - return -ENODEV; -} - -static const struct x86_cpu_id skx_cpuids[] = { - { X86_VENDOR_INTEL, 6, INTEL_FAM6_SKYLAKE_X, 0, 0 }, - { } -}; -MODULE_DEVICE_TABLE(x86cpu, skx_cpuids); - -static u8 get_src_id(struct skx_dev *d) -{ - u32 reg; - - pci_read_config_dword(d->util_all, 0xF0, ®); - - return GET_BITFIELD(reg, 12, 14); -} - -static u8 skx_get_node_id(struct skx_dev *d) -{ - u32 reg; - - pci_read_config_dword(d->util_all, 0xF4, ®); - - return GET_BITFIELD(reg, 0, 2); -} - -static int get_dimm_attr(u32 reg, int lobit, int hibit, int add, int minval, - int maxval, char *name) -{ - u32 val = GET_BITFIELD(reg, lobit, hibit); - - if (val < minval || val > maxval) { - edac_dbg(2, "bad %s = %d (raw=%x)\n", name, val, reg); - return -EINVAL; - } - return val + add; -} - -#define IS_DIMM_PRESENT(mtr) GET_BITFIELD((mtr), 15, 15) -#define IS_NVDIMM_PRESENT(mcddrtcfg, i) GET_BITFIELD((mcddrtcfg), (i), (i)) - -#define numrank(reg) get_dimm_attr((reg), 12, 13, 0, 0, 2, "ranks") -#define numrow(reg) get_dimm_attr((reg), 2, 4, 12, 1, 6, "rows") -#define numcol(reg) get_dimm_attr((reg), 0, 1, 10, 0, 2, "cols") - -static int get_width(u32 mtr) -{ - switch (GET_BITFIELD(mtr, 8, 9)) { - case 0: - return DEV_X4; - case 1: - return DEV_X8; - case 2: - return DEV_X16; - } - return DEV_UNKNOWN; -} - -static int skx_get_hi_lo(void) -{ - struct pci_dev *pdev; - u32 reg; - - pdev = pci_get_device(PCI_VENDOR_ID_INTEL, 0x2034, NULL); - if (!pdev) { - edac_dbg(0, "Can't get tolm/tohm\n"); - return -ENODEV; - } - - pci_read_config_dword(pdev, 0xD0, ®); - skx_tolm = reg; - pci_read_config_dword(pdev, 0xD4, ®); - skx_tohm = reg; - pci_read_config_dword(pdev, 0xD8, ®); - skx_tohm |= (u64)reg << 32; - - pci_dev_put(pdev); - edac_dbg(2, "tolm=%llx tohm=%llx\n", skx_tolm, skx_tohm); - - return 0; -} - -static int get_dimm_info(u32 mtr, u32 amap, struct dimm_info *dimm, - struct skx_imc *imc, int chan, int dimmno) -{ - int banks = 16, ranks, rows, cols, npages; - u64 size; - - ranks = numrank(mtr); - rows = numrow(mtr); - cols = numcol(mtr); - - /* - * Compute size in 8-byte (2^3) words, then shift to MiB (2^20) - */ - size = ((1ull << (rows + cols + ranks)) * banks) >> (20 - 3); - npages = MiB_TO_PAGES(size); - - edac_dbg(0, "mc#%d: channel %d, dimm %d, %lld MiB (%d pages) bank: %d, rank: %d, row: %#x, col: %#x\n", - imc->mc, chan, dimmno, size, npages, - banks, 1 << ranks, rows, cols); - - imc->chan[chan].dimms[dimmno].close_pg = GET_BITFIELD(mtr, 0, 0); - imc->chan[chan].dimms[dimmno].bank_xor_enable = GET_BITFIELD(mtr, 9, 9); - imc->chan[chan].dimms[dimmno].fine_grain_bank = GET_BITFIELD(amap, 0, 0); - imc->chan[chan].dimms[dimmno].rowbits = rows; - imc->chan[chan].dimms[dimmno].colbits = cols; - - dimm->nr_pages = npages; - dimm->grain = 32; - dimm->dtype = get_width(mtr); - dimm->mtype = MEM_DDR4; - dimm->edac_mode = EDAC_SECDED; /* likely better than this */ - snprintf(dimm->label, sizeof(dimm->label), "CPU_SrcID#%u_MC#%u_Chan#%u_DIMM#%u", - imc->src_id, imc->lmc, chan, dimmno); - - return 1; -} - -static int get_nvdimm_info(struct dimm_info *dimm, struct skx_imc *imc, - int chan, int dimmno) -{ - int smbios_handle; - u32 dev_handle; - u16 flags; - u64 size = 0; - - nvdimm_count++; - - dev_handle = ACPI_NFIT_BUILD_DEVICE_HANDLE(dimmno, chan, imc->lmc, - imc->src_id, 0); - - smbios_handle = nfit_get_smbios_id(dev_handle, &flags); - if (smbios_handle == -EOPNOTSUPP) { - pr_warn_once(EDAC_MOD_STR ": Can't find size of NVDIMM. Try enabling CONFIG_ACPI_NFIT\n"); - goto unknown_size; - } - - if (smbios_handle < 0) { - skx_printk(KERN_ERR, "Can't find handle for NVDIMM ADR=%x\n", dev_handle); - goto unknown_size; - } - - if (flags & ACPI_NFIT_MEM_MAP_FAILED) { - skx_printk(KERN_ERR, "NVDIMM ADR=%x is not mapped\n", dev_handle); - goto unknown_size; - } - - size = dmi_memdev_size(smbios_handle); - if (size == ~0ull) - skx_printk(KERN_ERR, "Can't find size for NVDIMM ADR=%x/SMBIOS=%x\n", - dev_handle, smbios_handle); - -unknown_size: - dimm->nr_pages = size >> PAGE_SHIFT; - dimm->grain = 32; - dimm->dtype = DEV_UNKNOWN; - dimm->mtype = MEM_NVDIMM; - dimm->edac_mode = EDAC_SECDED; /* likely better than this */ - - edac_dbg(0, "mc#%d: channel %d, dimm %d, %llu MiB (%u pages)\n", - imc->mc, chan, dimmno, size >> 20, dimm->nr_pages); - - snprintf(dimm->label, sizeof(dimm->label), "CPU_SrcID#%u_MC#%u_Chan#%u_DIMM#%u", - imc->src_id, imc->lmc, chan, dimmno); - - return (size == 0 || size == ~0ull) ? 0 : 1; -} - -#define SKX_GET_MTMTR(dev, reg) \ - pci_read_config_dword((dev), 0x87c, ®) - -static bool skx_check_ecc(struct pci_dev *pdev) -{ - u32 mtmtr; - - SKX_GET_MTMTR(pdev, mtmtr); - - return !!GET_BITFIELD(mtmtr, 2, 2); -} - -static int skx_get_dimm_config(struct mem_ctl_info *mci) -{ - struct skx_pvt *pvt = mci->pvt_info; - struct skx_imc *imc = pvt->imc; - u32 mtr, amap, mcddrtcfg; - struct dimm_info *dimm; - int i, j; - int ndimms; - - for (i = 0; i < NUM_CHANNELS; i++) { - ndimms = 0; - pci_read_config_dword(imc->chan[i].cdev, 0x8C, &amap); - pci_read_config_dword(imc->chan[i].cdev, 0x400, &mcddrtcfg); - for (j = 0; j < NUM_DIMMS; j++) { - dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms, - mci->n_layers, i, j, 0); - pci_read_config_dword(imc->chan[i].cdev, - 0x80 + 4*j, &mtr); - if (IS_DIMM_PRESENT(mtr)) - ndimms += get_dimm_info(mtr, amap, dimm, imc, i, j); - else if (IS_NVDIMM_PRESENT(mcddrtcfg, j)) - ndimms += get_nvdimm_info(dimm, imc, i, j); - } - if (ndimms && !skx_check_ecc(imc->chan[0].cdev)) { - skx_printk(KERN_ERR, "ECC is disabled on imc %d\n", imc->mc); - return -ENODEV; - } - } - - return 0; -} - -static void skx_unregister_mci(struct skx_imc *imc) -{ - struct mem_ctl_info *mci = imc->mci; - - if (!mci) - return; - - edac_dbg(0, "MC%d: mci = %p\n", imc->mc, mci); - - /* Remove MC sysfs nodes */ - edac_mc_del_mc(mci->pdev); - - edac_dbg(1, "%s: free mci struct\n", mci->ctl_name); - kfree(mci->ctl_name); - edac_mc_free(mci); -} - -static int skx_register_mci(struct skx_imc *imc) -{ - struct mem_ctl_info *mci; - struct edac_mc_layer layers[2]; - struct pci_dev *pdev = imc->chan[0].cdev; - struct skx_pvt *pvt; - int rc; - - /* allocate a new MC control structure */ - layers[0].type = EDAC_MC_LAYER_CHANNEL; - layers[0].size = NUM_CHANNELS; - layers[0].is_virt_csrow = false; - layers[1].type = EDAC_MC_LAYER_SLOT; - layers[1].size = NUM_DIMMS; - layers[1].is_virt_csrow = true; - mci = edac_mc_alloc(imc->mc, ARRAY_SIZE(layers), layers, - sizeof(struct skx_pvt)); - - if (unlikely(!mci)) - return -ENOMEM; - - edac_dbg(0, "MC#%d: mci = %p\n", imc->mc, mci); - - /* Associate skx_dev and mci for future usage */ - imc->mci = mci; - pvt = mci->pvt_info; - pvt->imc = imc; - - mci->ctl_name = kasprintf(GFP_KERNEL, "Skylake Socket#%d IMC#%d", - imc->node_id, imc->lmc); - if (!mci->ctl_name) { - rc = -ENOMEM; - goto fail0; - } - - mci->mtype_cap = MEM_FLAG_DDR4 | MEM_FLAG_NVDIMM; - mci->edac_ctl_cap = EDAC_FLAG_NONE; - mci->edac_cap = EDAC_FLAG_NONE; - mci->mod_name = EDAC_MOD_STR; - mci->dev_name = pci_name(imc->chan[0].cdev); - mci->ctl_page_to_phys = NULL; - - rc = skx_get_dimm_config(mci); - if (rc < 0) - goto fail; - - /* record ptr to the generic device */ - mci->pdev = &pdev->dev; - - /* add this new MC control structure to EDAC's list of MCs */ - if (unlikely(edac_mc_add_mc(mci))) { - edac_dbg(0, "MC: failed edac_mc_add_mc()\n"); - rc = -EINVAL; - goto fail; - } - - return 0; - -fail: - kfree(mci->ctl_name); -fail0: - edac_mc_free(mci); - imc->mci = NULL; - return rc; -} - -#define SKX_MAX_SAD 24 - -#define SKX_GET_SAD(d, i, reg) \ - pci_read_config_dword((d)->sad_all, 0x60 + 8 * (i), ®) -#define SKX_GET_ILV(d, i, reg) \ - pci_read_config_dword((d)->sad_all, 0x64 + 8 * (i), ®) - -#define SKX_SAD_MOD3MODE(sad) GET_BITFIELD((sad), 30, 31) -#define SKX_SAD_MOD3(sad) GET_BITFIELD((sad), 27, 27) -#define SKX_SAD_LIMIT(sad) (((u64)GET_BITFIELD((sad), 7, 26) << 26) | MASK26) -#define SKX_SAD_MOD3ASMOD2(sad) GET_BITFIELD((sad), 5, 6) -#define SKX_SAD_ATTR(sad) GET_BITFIELD((sad), 3, 4) -#define SKX_SAD_INTERLEAVE(sad) GET_BITFIELD((sad), 1, 2) -#define SKX_SAD_ENABLE(sad) GET_BITFIELD((sad), 0, 0) - -#define SKX_ILV_REMOTE(tgt) (((tgt) & 8) == 0) -#define SKX_ILV_TARGET(tgt) ((tgt) & 7) - -static bool skx_sad_decode(struct decoded_addr *res) -{ - struct skx_dev *d = list_first_entry(&skx_edac_list, typeof(*d), list); - u64 addr = res->addr; - int i, idx, tgt, lchan, shift; - u32 sad, ilv; - u64 limit, prev_limit; - int remote = 0; - - /* Simple sanity check for I/O space or out of range */ - if (addr >= skx_tohm || (addr >= skx_tolm && addr < BIT_ULL(32))) { - edac_dbg(0, "Address %llx out of range\n", addr); - return false; - } - -restart: - prev_limit = 0; - for (i = 0; i < SKX_MAX_SAD; i++) { - SKX_GET_SAD(d, i, sad); - limit = SKX_SAD_LIMIT(sad); - if (SKX_SAD_ENABLE(sad)) { - if (addr >= prev_limit && addr <= limit) - goto sad_found; - } - prev_limit = limit + 1; - } - edac_dbg(0, "No SAD entry for %llx\n", addr); - return false; - -sad_found: - SKX_GET_ILV(d, i, ilv); - - switch (SKX_SAD_INTERLEAVE(sad)) { - case 0: - idx = GET_BITFIELD(addr, 6, 8); - break; - case 1: - idx = GET_BITFIELD(addr, 8, 10); - break; - case 2: - idx = GET_BITFIELD(addr, 12, 14); - break; - case 3: - idx = GET_BITFIELD(addr, 30, 32); - break; - } - - tgt = GET_BITFIELD(ilv, 4 * idx, 4 * idx + 3); - - /* If point to another node, find it and start over */ - if (SKX_ILV_REMOTE(tgt)) { - if (remote) { - edac_dbg(0, "Double remote!\n"); - return false; - } - remote = 1; - list_for_each_entry(d, &skx_edac_list, list) { - if (d->imc[0].src_id == SKX_ILV_TARGET(tgt)) - goto restart; - } - edac_dbg(0, "Can't find node %d\n", SKX_ILV_TARGET(tgt)); - return false; - } - - if (SKX_SAD_MOD3(sad) == 0) - lchan = SKX_ILV_TARGET(tgt); - else { - switch (SKX_SAD_MOD3MODE(sad)) { - case 0: - shift = 6; - break; - case 1: - shift = 8; - break; - case 2: - shift = 12; - break; - default: - edac_dbg(0, "illegal mod3mode\n"); - return false; - } - switch (SKX_SAD_MOD3ASMOD2(sad)) { - case 0: - lchan = (addr >> shift) % 3; - break; - case 1: - lchan = (addr >> shift) % 2; - break; - case 2: - lchan = (addr >> shift) % 2; - lchan = (lchan << 1) | !lchan; - break; - case 3: - lchan = ((addr >> shift) % 2) << 1; - break; - } - lchan = (lchan << 1) | (SKX_ILV_TARGET(tgt) & 1); - } - - res->dev = d; - res->socket = d->imc[0].src_id; - res->imc = GET_BITFIELD(d->mcroute, lchan * 3, lchan * 3 + 2); - res->channel = GET_BITFIELD(d->mcroute, lchan * 2 + 18, lchan * 2 + 19); - - edac_dbg(2, "%llx: socket=%d imc=%d channel=%d\n", - res->addr, res->socket, res->imc, res->channel); - return true; -} - -#define SKX_MAX_TAD 8 - -#define SKX_GET_TADBASE(d, mc, i, reg) \ - pci_read_config_dword((d)->imc[mc].chan[0].cdev, 0x850 + 4 * (i), ®) -#define SKX_GET_TADWAYNESS(d, mc, i, reg) \ - pci_read_config_dword((d)->imc[mc].chan[0].cdev, 0x880 + 4 * (i), ®) -#define SKX_GET_TADCHNILVOFFSET(d, mc, ch, i, reg) \ - pci_read_config_dword((d)->imc[mc].chan[ch].cdev, 0x90 + 4 * (i), ®) - -#define SKX_TAD_BASE(b) ((u64)GET_BITFIELD((b), 12, 31) << 26) -#define SKX_TAD_SKT_GRAN(b) GET_BITFIELD((b), 4, 5) -#define SKX_TAD_CHN_GRAN(b) GET_BITFIELD((b), 6, 7) -#define SKX_TAD_LIMIT(b) (((u64)GET_BITFIELD((b), 12, 31) << 26) | MASK26) -#define SKX_TAD_OFFSET(b) ((u64)GET_BITFIELD((b), 4, 23) << 26) -#define SKX_TAD_SKTWAYS(b) (1 << GET_BITFIELD((b), 10, 11)) -#define SKX_TAD_CHNWAYS(b) (GET_BITFIELD((b), 8, 9) + 1) - -/* which bit used for both socket and channel interleave */ -static int skx_granularity[] = { 6, 8, 12, 30 }; - -static u64 skx_do_interleave(u64 addr, int shift, int ways, u64 lowbits) -{ - addr >>= shift; - addr /= ways; - addr <<= shift; - - return addr | (lowbits & ((1ull << shift) - 1)); -} - -static bool skx_tad_decode(struct decoded_addr *res) -{ - int i; - u32 base, wayness, chnilvoffset; - int skt_interleave_bit, chn_interleave_bit; - u64 channel_addr; - - for (i = 0; i < SKX_MAX_TAD; i++) { - SKX_GET_TADBASE(res->dev, res->imc, i, base); - SKX_GET_TADWAYNESS(res->dev, res->imc, i, wayness); - if (SKX_TAD_BASE(base) <= res->addr && res->addr <= SKX_TAD_LIMIT(wayness)) - goto tad_found; - } - edac_dbg(0, "No TAD entry for %llx\n", res->addr); - return false; - -tad_found: - res->sktways = SKX_TAD_SKTWAYS(wayness); - res->chanways = SKX_TAD_CHNWAYS(wayness); - skt_interleave_bit = skx_granularity[SKX_TAD_SKT_GRAN(base)]; - chn_interleave_bit = skx_granularity[SKX_TAD_CHN_GRAN(base)]; - - SKX_GET_TADCHNILVOFFSET(res->dev, res->imc, res->channel, i, chnilvoffset); - channel_addr = res->addr - SKX_TAD_OFFSET(chnilvoffset); - - if (res->chanways == 3 && skt_interleave_bit > chn_interleave_bit) { - /* Must handle channel first, then socket */ - channel_addr = skx_do_interleave(channel_addr, chn_interleave_bit, - res->chanways, channel_addr); - channel_addr = skx_do_interleave(channel_addr, skt_interleave_bit, - res->sktways, channel_addr); - } else { - /* Handle socket then channel. Preserve low bits from original address */ - channel_addr = skx_do_interleave(channel_addr, skt_interleave_bit, - res->sktways, res->addr); - channel_addr = skx_do_interleave(channel_addr, chn_interleave_bit, - res->chanways, res->addr); - } - - res->chan_addr = channel_addr; - - edac_dbg(2, "%llx: chan_addr=%llx sktways=%d chanways=%d\n", - res->addr, res->chan_addr, res->sktways, res->chanways); - return true; -} - -#define SKX_MAX_RIR 4 - -#define SKX_GET_RIRWAYNESS(d, mc, ch, i, reg) \ - pci_read_config_dword((d)->imc[mc].chan[ch].cdev, \ - 0x108 + 4 * (i), ®) -#define SKX_GET_RIRILV(d, mc, ch, idx, i, reg) \ - pci_read_config_dword((d)->imc[mc].chan[ch].cdev, \ - 0x120 + 16 * idx + 4 * (i), ®) - -#define SKX_RIR_VALID(b) GET_BITFIELD((b), 31, 31) -#define SKX_RIR_LIMIT(b) (((u64)GET_BITFIELD((b), 1, 11) << 29) | MASK29) -#define SKX_RIR_WAYS(b) (1 << GET_BITFIELD((b), 28, 29)) -#define SKX_RIR_CHAN_RANK(b) GET_BITFIELD((b), 16, 19) -#define SKX_RIR_OFFSET(b) ((u64)(GET_BITFIELD((b), 2, 15) << 26)) - -static bool skx_rir_decode(struct decoded_addr *res) -{ - int i, idx, chan_rank; - int shift; - u32 rirway, rirlv; - u64 rank_addr, prev_limit = 0, limit; - - if (res->dev->imc[res->imc].chan[res->channel].dimms[0].close_pg) - shift = 6; - else - shift = 13; - - for (i = 0; i < SKX_MAX_RIR; i++) { - SKX_GET_RIRWAYNESS(res->dev, res->imc, res->channel, i, rirway); - limit = SKX_RIR_LIMIT(rirway); - if (SKX_RIR_VALID(rirway)) { - if (prev_limit <= res->chan_addr && - res->chan_addr <= limit) - goto rir_found; - } - prev_limit = limit; - } - edac_dbg(0, "No RIR entry for %llx\n", res->addr); - return false; - -rir_found: - rank_addr = res->chan_addr >> shift; - rank_addr /= SKX_RIR_WAYS(rirway); - rank_addr <<= shift; - rank_addr |= res->chan_addr & GENMASK_ULL(shift - 1, 0); - - res->rank_address = rank_addr; - idx = (res->chan_addr >> shift) % SKX_RIR_WAYS(rirway); - - SKX_GET_RIRILV(res->dev, res->imc, res->channel, idx, i, rirlv); - res->rank_address = rank_addr - SKX_RIR_OFFSET(rirlv); - chan_rank = SKX_RIR_CHAN_RANK(rirlv); - res->channel_rank = chan_rank; - res->dimm = chan_rank / 4; - res->rank = chan_rank % 4; - - edac_dbg(2, "%llx: dimm=%d rank=%d chan_rank=%d rank_addr=%llx\n", - res->addr, res->dimm, res->rank, - res->channel_rank, res->rank_address); - return true; -} - -static u8 skx_close_row[] = { - 15, 16, 17, 18, 20, 21, 22, 28, 10, 11, 12, 13, 29, 30, 31, 32, 33 -}; -static u8 skx_close_column[] = { - 3, 4, 5, 14, 19, 23, 24, 25, 26, 27 -}; -static u8 skx_open_row[] = { - 14, 15, 16, 20, 28, 21, 22, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33 -}; -static u8 skx_open_column[] = { - 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 -}; -static u8 skx_open_fine_column[] = { - 3, 4, 5, 7, 8, 9, 10, 11, 12, 13 -}; - -static int skx_bits(u64 addr, int nbits, u8 *bits) -{ - int i, res = 0; - - for (i = 0; i < nbits; i++) - res |= ((addr >> bits[i]) & 1) << i; - return res; -} - -static int skx_bank_bits(u64 addr, int b0, int b1, int do_xor, int x0, int x1) -{ - int ret = GET_BITFIELD(addr, b0, b0) | (GET_BITFIELD(addr, b1, b1) << 1); - - if (do_xor) - ret ^= GET_BITFIELD(addr, x0, x0) | (GET_BITFIELD(addr, x1, x1) << 1); - - return ret; -} - -static bool skx_mad_decode(struct decoded_addr *r) -{ - struct skx_dimm *dimm = &r->dev->imc[r->imc].chan[r->channel].dimms[r->dimm]; - int bg0 = dimm->fine_grain_bank ? 6 : 13; - - if (dimm->close_pg) { - r->row = skx_bits(r->rank_address, dimm->rowbits, skx_close_row); - r->column = skx_bits(r->rank_address, dimm->colbits, skx_close_column); - r->column |= 0x400; /* C10 is autoprecharge, always set */ - r->bank_address = skx_bank_bits(r->rank_address, 8, 9, dimm->bank_xor_enable, 22, 28); - r->bank_group = skx_bank_bits(r->rank_address, 6, 7, dimm->bank_xor_enable, 20, 21); - } else { - r->row = skx_bits(r->rank_address, dimm->rowbits, skx_open_row); - if (dimm->fine_grain_bank) - r->column = skx_bits(r->rank_address, dimm->colbits, skx_open_fine_column); - else - r->column = skx_bits(r->rank_address, dimm->colbits, skx_open_column); - r->bank_address = skx_bank_bits(r->rank_address, 18, 19, dimm->bank_xor_enable, 22, 23); - r->bank_group = skx_bank_bits(r->rank_address, bg0, 17, dimm->bank_xor_enable, 20, 21); - } - r->row &= (1u << dimm->rowbits) - 1; - - edac_dbg(2, "%llx: row=%x col=%x bank_addr=%d bank_group=%d\n", - r->addr, r->row, r->column, r->bank_address, - r->bank_group); - return true; -} - -static bool skx_decode(struct decoded_addr *res) -{ - - return skx_sad_decode(res) && skx_tad_decode(res) && - skx_rir_decode(res) && skx_mad_decode(res); -} - -#ifdef CONFIG_EDAC_DEBUG -/* - * Debug feature. Make /sys/kernel/debug/skx_edac_test/addr. - * Write an address to this file to exercise the address decode - * logic in this driver. - */ -static struct dentry *skx_test; -static u64 skx_fake_addr; - -static int debugfs_u64_set(void *data, u64 val) -{ - struct decoded_addr res; - - res.addr = val; - skx_decode(&res); - - return 0; -} - -DEFINE_SIMPLE_ATTRIBUTE(fops_u64_wo, NULL, debugfs_u64_set, "%llu\n"); - -static struct dentry *mydebugfs_create(const char *name, umode_t mode, - struct dentry *parent, u64 *value) -{ - return debugfs_create_file(name, mode, parent, value, &fops_u64_wo); -} - -static void setup_skx_debug(void) -{ - skx_test = debugfs_create_dir("skx_edac_test", NULL); - mydebugfs_create("addr", S_IWUSR, skx_test, &skx_fake_addr); -} - -static void teardown_skx_debug(void) -{ - debugfs_remove_recursive(skx_test); -} -#else -static void setup_skx_debug(void) -{ -} - -static void teardown_skx_debug(void) -{ -} -#endif /*CONFIG_EDAC_DEBUG*/ - -static bool skx_adxl_decode(struct decoded_addr *res) - -{ - int i, len = 0; - - if (res->addr >= skx_tohm || (res->addr >= skx_tolm && - res->addr < BIT_ULL(32))) { - edac_dbg(0, "Address 0x%llx out of range\n", res->addr); - return false; - } - - if (adxl_decode(res->addr, adxl_values)) { - edac_dbg(0, "Failed to decode 0x%llx\n", res->addr); - return false; - } - - res->socket = (int)adxl_values[component_indices[INDEX_SOCKET]]; - res->imc = (int)adxl_values[component_indices[INDEX_MEMCTRL]]; - res->channel = (int)adxl_values[component_indices[INDEX_CHANNEL]]; - res->dimm = (int)adxl_values[component_indices[INDEX_DIMM]]; - - for (i = 0; i < adxl_component_count; i++) { - if (adxl_values[i] == ~0x0ull) - continue; - - len += snprintf(adxl_msg + len, MSG_SIZE - len, " %s:0x%llx", - adxl_component_names[i], adxl_values[i]); - if (MSG_SIZE - len <= 0) - break; - } - - return true; -} - -static void skx_mce_output_error(struct mem_ctl_info *mci, - const struct mce *m, - struct decoded_addr *res) -{ - enum hw_event_mc_err_type tp_event; - char *type, *optype; - bool ripv = GET_BITFIELD(m->mcgstatus, 0, 0); - bool overflow = GET_BITFIELD(m->status, 62, 62); - bool uncorrected_error = GET_BITFIELD(m->status, 61, 61); - bool recoverable; - u32 core_err_cnt = GET_BITFIELD(m->status, 38, 52); - u32 mscod = GET_BITFIELD(m->status, 16, 31); - u32 errcode = GET_BITFIELD(m->status, 0, 15); - u32 optypenum = GET_BITFIELD(m->status, 4, 6); - - recoverable = GET_BITFIELD(m->status, 56, 56); - - if (uncorrected_error) { - core_err_cnt = 1; - if (ripv) { - type = "FATAL"; - tp_event = HW_EVENT_ERR_FATAL; - } else { - type = "NON_FATAL"; - tp_event = HW_EVENT_ERR_UNCORRECTED; - } - } else { - type = "CORRECTED"; - tp_event = HW_EVENT_ERR_CORRECTED; - } - - /* - * According with Table 15-9 of the Intel Architecture spec vol 3A, - * memory errors should fit in this mask: - * 000f 0000 1mmm cccc (binary) - * where: - * f = Correction Report Filtering Bit. If 1, subsequent errors - * won't be shown - * mmm = error type - * cccc = channel - * If the mask doesn't match, report an error to the parsing logic - */ - if (!((errcode & 0xef80) == 0x80)) { - optype = "Can't parse: it is not a mem"; - } else { - switch (optypenum) { - case 0: - optype = "generic undef request error"; - break; - case 1: - optype = "memory read error"; - break; - case 2: - optype = "memory write error"; - break; - case 3: - optype = "addr/cmd error"; - break; - case 4: - optype = "memory scrubbing error"; - break; - default: - optype = "reserved"; - break; - } - } - if (adxl_component_count) { - snprintf(skx_msg, MSG_SIZE, "%s%s err_code:%04x:%04x %s", - overflow ? " OVERFLOW" : "", - (uncorrected_error && recoverable) ? " recoverable" : "", - mscod, errcode, adxl_msg); - } else { - snprintf(skx_msg, MSG_SIZE, - "%s%s err_code:%04x:%04x socket:%d imc:%d rank:%d bg:%d ba:%d row:%x col:%x", - overflow ? " OVERFLOW" : "", - (uncorrected_error && recoverable) ? " recoverable" : "", - mscod, errcode, - res->socket, res->imc, res->rank, - res->bank_group, res->bank_address, res->row, res->column); - } - - edac_dbg(0, "%s\n", skx_msg); - - /* Call the helper to output message */ - edac_mc_handle_error(tp_event, mci, core_err_cnt, - m->addr >> PAGE_SHIFT, m->addr & ~PAGE_MASK, 0, - res->channel, res->dimm, -1, - optype, skx_msg); -} - -static struct mem_ctl_info *get_mci(int src_id, int lmc) -{ - struct skx_dev *d; - - if (lmc > NUM_IMC - 1) { - skx_printk(KERN_ERR, "Bad lmc %d\n", lmc); - return NULL; - } - - list_for_each_entry(d, &skx_edac_list, list) { - if (d->imc[0].src_id == src_id) - return d->imc[lmc].mci; - } - - skx_printk(KERN_ERR, "No mci for src_id %d lmc %d\n", src_id, lmc); - - return NULL; -} - -static int skx_mce_check_error(struct notifier_block *nb, unsigned long val, - void *data) -{ - struct mce *mce = (struct mce *)data; - struct decoded_addr res; - struct mem_ctl_info *mci; - char *type; - - if (edac_get_report_status() == EDAC_REPORTING_DISABLED) - return NOTIFY_DONE; - - /* ignore unless this is memory related with an address */ - if ((mce->status & 0xefff) >> 7 != 1 || !(mce->status & MCI_STATUS_ADDRV)) - return NOTIFY_DONE; - - memset(&res, 0, sizeof(res)); - res.addr = mce->addr; - - if (adxl_component_count) { - if (!skx_adxl_decode(&res)) - return NOTIFY_DONE; - - mci = get_mci(res.socket, res.imc); - } else { - if (!skx_decode(&res)) - return NOTIFY_DONE; - - mci = res.dev->imc[res.imc].mci; - } - - if (!mci) - return NOTIFY_DONE; - - if (mce->mcgstatus & MCG_STATUS_MCIP) - type = "Exception"; - else - type = "Event"; - - skx_mc_printk(mci, KERN_DEBUG, "HANDLING MCE MEMORY ERROR\n"); - - skx_mc_printk(mci, KERN_DEBUG, "CPU %d: Machine Check %s: %Lx " - "Bank %d: %016Lx\n", mce->extcpu, type, - mce->mcgstatus, mce->bank, mce->status); - skx_mc_printk(mci, KERN_DEBUG, "TSC %llx ", mce->tsc); - skx_mc_printk(mci, KERN_DEBUG, "ADDR %llx ", mce->addr); - skx_mc_printk(mci, KERN_DEBUG, "MISC %llx ", mce->misc); - - skx_mc_printk(mci, KERN_DEBUG, "PROCESSOR %u:%x TIME %llu SOCKET " - "%u APIC %x\n", mce->cpuvendor, mce->cpuid, - mce->time, mce->socketid, mce->apicid); - - skx_mce_output_error(mci, mce, &res); - - return NOTIFY_DONE; -} - -static struct notifier_block skx_mce_dec = { - .notifier_call = skx_mce_check_error, - .priority = MCE_PRIO_EDAC, -}; - -static void skx_remove(void) -{ - int i, j; - struct skx_dev *d, *tmp; - - edac_dbg(0, "\n"); - - list_for_each_entry_safe(d, tmp, &skx_edac_list, list) { - list_del(&d->list); - for (i = 0; i < NUM_IMC; i++) { - skx_unregister_mci(&d->imc[i]); - for (j = 0; j < NUM_CHANNELS; j++) - pci_dev_put(d->imc[i].chan[j].cdev); - } - pci_dev_put(d->util_all); - pci_dev_put(d->sad_all); - - kfree(d); - } -} - -static void __init skx_adxl_get(void) -{ - const char * const *names; - int i, j; - - names = adxl_get_component_names(); - if (!names) { - skx_printk(KERN_NOTICE, "No firmware support for address translation."); - skx_printk(KERN_CONT, " Only decoding DDR4 address!\n"); - return; - } - - for (i = 0; i < INDEX_MAX; i++) { - for (j = 0; names[j]; j++) { - if (!strcmp(component_names[i], names[j])) { - component_indices[i] = j; - break; - } - } - - if (!names[j]) - goto err; - } - - adxl_component_names = names; - while (*names++) - adxl_component_count++; - - adxl_values = kcalloc(adxl_component_count, sizeof(*adxl_values), - GFP_KERNEL); - if (!adxl_values) { - adxl_component_count = 0; - return; - } - - adxl_msg = kzalloc(MSG_SIZE, GFP_KERNEL); - if (!adxl_msg) { - adxl_component_count = 0; - kfree(adxl_values); - } - - return; -err: - skx_printk(KERN_ERR, "'%s' is not matched from DSM parameters: ", - component_names[i]); - for (j = 0; names[j]; j++) - skx_printk(KERN_CONT, "%s ", names[j]); - skx_printk(KERN_CONT, "\n"); -} - -static void __exit skx_adxl_put(void) -{ - kfree(adxl_values); - kfree(adxl_msg); -} - -/* - * skx_init: - * make sure we are running on the correct cpu model - * search for all the devices we need - * check which DIMMs are present. - */ -static int __init skx_init(void) -{ - const struct x86_cpu_id *id; - const struct munit *m; - const char *owner; - int rc = 0, i; - u8 mc = 0, src_id, node_id; - struct skx_dev *d; - - edac_dbg(2, "\n"); - - owner = edac_get_owner(); - if (owner && strncmp(owner, EDAC_MOD_STR, sizeof(EDAC_MOD_STR))) - return -EBUSY; - - id = x86_match_cpu(skx_cpuids); - if (!id) - return -ENODEV; - - rc = skx_get_hi_lo(); - if (rc) - return rc; - - rc = get_all_bus_mappings(); - if (rc < 0) - goto fail; - if (rc == 0) { - edac_dbg(2, "No memory controllers found\n"); - return -ENODEV; - } - - for (m = skx_all_munits; m->did; m++) { - rc = get_all_munits(m); - if (rc < 0) - goto fail; - if (rc != m->per_socket * skx_num_sockets) { - edac_dbg(2, "Expected %d, got %d of %x\n", - m->per_socket * skx_num_sockets, rc, m->did); - rc = -ENODEV; - goto fail; - } - } - - list_for_each_entry(d, &skx_edac_list, list) { - src_id = get_src_id(d); - node_id = skx_get_node_id(d); - edac_dbg(2, "src_id=%d node_id=%d\n", src_id, node_id); - for (i = 0; i < NUM_IMC; i++) { - d->imc[i].mc = mc++; - d->imc[i].lmc = i; - d->imc[i].src_id = src_id; - d->imc[i].node_id = node_id; - rc = skx_register_mci(&d->imc[i]); - if (rc < 0) - goto fail; - } - } - - skx_msg = kzalloc(MSG_SIZE, GFP_KERNEL); - if (!skx_msg) { - rc = -ENOMEM; - goto fail; - } - - if (nvdimm_count) - skx_adxl_get(); - - /* Ensure that the OPSTATE is set correctly for POLL or NMI */ - opstate_init(); - - setup_skx_debug(); - - mce_register_decode_chain(&skx_mce_dec); - - return 0; -fail: - skx_remove(); - return rc; -} - -static void __exit skx_exit(void) -{ - edac_dbg(2, "\n"); - mce_unregister_decode_chain(&skx_mce_dec); - skx_remove(); - if (nvdimm_count) - skx_adxl_put(); - kfree(skx_msg); - teardown_skx_debug(); -} - -module_init(skx_init); -module_exit(skx_exit); - -module_param(edac_op_state, int, 0444); -MODULE_PARM_DESC(edac_op_state, "EDAC Error Reporting state: 0=Poll,1=NMI"); - -MODULE_LICENSE("GPL v2"); -MODULE_AUTHOR("Tony Luck"); -MODULE_DESCRIPTION("MC Driver for Intel Skylake server processors");
From: Qiuxu Zhuo qiuxu.zhuo@intel.com
mainline inclusion from mainline-v5.1-rc1 commit d4dc89d069aab9074e2493a4c2f3969a0a0b91c1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit d4dc89d069aab9074e2493a4c2f3969a0a0b91c1 upstream.
This driver supports the Intel 10nm series server integrated memory controller. It gets the memory capacity and topology information by reading the registers in PCI configuration space and memory-mapped I/O.
It decodes the memory error address to the platform specific address by using the ACPI Address Translation (ADXL) Device Specific Method (DSM).
Co-developed-by: Tony Luck tony.luck@intel.com Signed-off-by: Qiuxu Zhuo qiuxu.zhuo@intel.com Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Borislav Petkov bp@suse.de Cc: James Morse james.morse@arm.com Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: linux-edac linux-edac@vger.kernel.org Link: https://lkml.kernel.org/r/20190130191519.15393-5-tony.luck@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/edac/Kconfig | 12 ++ drivers/edac/Makefile | 3 + drivers/edac/i10nm_base.c | 275 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 290 insertions(+) create mode 100644 drivers/edac/i10nm_base.c
diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig index c927c8a68dd9..1886c8e49604 100644 --- a/drivers/edac/Kconfig +++ b/drivers/edac/Kconfig @@ -241,6 +241,18 @@ config EDAC_SKX system has non-volatile DIMMs you should also manually select CONFIG_ACPI_NFIT.
+config EDAC_I10NM + tristate "Intel 10nm server Integrated MC" + depends on PCI && X86_64 && X86_MCE_INTEL && PCI_MMCONFIG + depends on ACPI_NFIT || !ACPI_NFIT # if ACPI_NFIT=m, EDAC_I10NM can't be y + select DMI + select ACPI_ADXL if ACPI + help + Support for error detection and correction the Intel + 10nm server Integrated Memory Controllers. If your + system has non-volatile DIMMs you should also manually + select CONFIG_ACPI_NFIT. + config EDAC_PND2 tristate "Intel Pondicherry2" depends on PCI && X86_64 && X86_MCE_INTEL diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile index b250d59910ba..3cd809d92c74 100644 --- a/drivers/edac/Makefile +++ b/drivers/edac/Makefile @@ -60,6 +60,9 @@ obj-$(CONFIG_EDAC_LAYERSCAPE) += layerscape_edac_mod.o skx_edac-y := skx_common.o skx_base.o obj-$(CONFIG_EDAC_SKX) += skx_edac.o
+i10nm_edac-y := skx_common.o i10nm_base.o +obj-$(CONFIG_EDAC_I10NM) += i10nm_edac.o + obj-$(CONFIG_EDAC_MV64X60) += mv64x60_edac.o obj-$(CONFIG_EDAC_CELL) += cell_edac.o obj-$(CONFIG_EDAC_PPC4XX) += ppc4xx_edac.o diff --git a/drivers/edac/i10nm_base.c b/drivers/edac/i10nm_base.c new file mode 100644 index 000000000000..c334fb7c63df --- /dev/null +++ b/drivers/edac/i10nm_base.c @@ -0,0 +1,275 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Driver for Intel(R) 10nm server memory controller. + * Copyright (c) 2019, Intel Corporation. + * + */ + +#include <linux/kernel.h> +#include <asm/cpu_device_id.h> +#include <asm/intel-family.h> +#include <asm/mce.h> +#include "edac_module.h" +#include "skx_common.h" + +#define I10NM_REVISION "v0.0.3" +#define EDAC_MOD_STR "i10nm_edac" + +/* Debug macros */ +#define i10nm_printk(level, fmt, arg...) \ + edac_printk(level, "i10nm", fmt, ##arg) + +#define I10NM_GET_SCK_BAR(d, reg) \ + pci_read_config_dword((d)->uracu, 0xd0, &(reg)) +#define I10NM_GET_IMC_BAR(d, i, reg) \ + pci_read_config_dword((d)->uracu, 0xd8 + (i) * 4, &(reg)) +#define I10NM_GET_DIMMMTR(m, i, j) \ + (*(u32 *)((m)->mbase + 0x2080c + (i) * 0x4000 + (j) * 4)) +#define I10NM_GET_MCDDRTCFG(m, i, j) \ + (*(u32 *)((m)->mbase + 0x20970 + (i) * 0x4000 + (j) * 4)) + +#define I10NM_GET_SCK_MMIO_BASE(reg) (GET_BITFIELD(reg, 0, 28) << 23) +#define I10NM_GET_IMC_MMIO_OFFSET(reg) (GET_BITFIELD(reg, 0, 10) << 12) +#define I10NM_GET_IMC_MMIO_SIZE(reg) ((GET_BITFIELD(reg, 13, 23) - \ + GET_BITFIELD(reg, 0, 10) + 1) << 12) + +static struct list_head *i10nm_edac_list; + +static struct pci_dev *pci_get_dev_wrapper(int dom, unsigned int bus, + unsigned int dev, unsigned int fun) +{ + struct pci_dev *pdev; + + pdev = pci_get_domain_bus_and_slot(dom, bus, PCI_DEVFN(dev, fun)); + if (!pdev) { + edac_dbg(2, "No device %02x:%02x.%x\n", + bus, dev, fun); + return NULL; + } + + if (unlikely(pci_enable_device(pdev) < 0)) { + edac_dbg(2, "Failed to enable device %02x:%02x.%x\n", + bus, dev, fun); + return NULL; + } + + pci_dev_get(pdev); + + return pdev; +} + +static int i10nm_get_all_munits(void) +{ + struct pci_dev *mdev; + void __iomem *mbase; + unsigned long size; + struct skx_dev *d; + int i, j = 0; + u32 reg, off; + u64 base; + + list_for_each_entry(d, i10nm_edac_list, list) { + d->util_all = pci_get_dev_wrapper(d->seg, d->bus[1], 29, 1); + if (!d->util_all) + return -ENODEV; + + d->uracu = pci_get_dev_wrapper(d->seg, d->bus[0], 0, 1); + if (!d->uracu) + return -ENODEV; + + if (I10NM_GET_SCK_BAR(d, reg)) { + i10nm_printk(KERN_ERR, "Failed to socket bar\n"); + return -ENODEV; + } + + base = I10NM_GET_SCK_MMIO_BASE(reg); + edac_dbg(2, "socket%d mmio base 0x%llx (reg 0x%x)\n", + j++, base, reg); + + for (i = 0; i < I10NM_NUM_IMC; i++) { + mdev = pci_get_dev_wrapper(d->seg, d->bus[0], + 12 + i, 0); + if (i == 0 && !mdev) { + i10nm_printk(KERN_ERR, "No IMC found\n"); + return -ENODEV; + } + if (!mdev) + continue; + + d->imc[i].mdev = mdev; + + if (I10NM_GET_IMC_BAR(d, i, reg)) { + i10nm_printk(KERN_ERR, "Failed to get mc bar\n"); + return -ENODEV; + } + + off = I10NM_GET_IMC_MMIO_OFFSET(reg); + size = I10NM_GET_IMC_MMIO_SIZE(reg); + edac_dbg(2, "mc%d mmio base 0x%llx size 0x%lx (reg 0x%x)\n", + i, base + off, size, reg); + + mbase = ioremap(base + off, size); + if (!mbase) { + i10nm_printk(KERN_ERR, "Failed to ioremap 0x%llx\n", + base + off); + return -ENODEV; + } + + d->imc[i].mbase = mbase; + } + } + + return 0; +} + +static const struct x86_cpu_id i10nm_cpuids[] = { + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_TREMONT_X, 0, 0 }, + { } +}; +MODULE_DEVICE_TABLE(x86cpu, i10nm_cpuids); + +static bool i10nm_check_ecc(struct skx_imc *imc, int chan) +{ + u32 mcmtr; + + mcmtr = *(u32 *)(imc->mbase + 0x20ef8 + chan * 0x4000); + edac_dbg(1, "ch%d mcmtr reg %x\n", chan, mcmtr); + + return !!GET_BITFIELD(mcmtr, 2, 2); +} + +static int i10nm_get_dimm_config(struct mem_ctl_info *mci) +{ + struct skx_pvt *pvt = mci->pvt_info; + struct skx_imc *imc = pvt->imc; + struct dimm_info *dimm; + u32 mtr, mcddrtcfg; + int i, j, ndimms; + + for (i = 0; i < I10NM_NUM_CHANNELS; i++) { + if (!imc->mbase) + continue; + + ndimms = 0; + for (j = 0; j < I10NM_NUM_DIMMS; j++) { + dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms, + mci->n_layers, i, j, 0); + mtr = I10NM_GET_DIMMMTR(imc, i, j); + mcddrtcfg = I10NM_GET_MCDDRTCFG(imc, i, j); + edac_dbg(1, "dimmmtr 0x%x mcddrtcfg 0x%x (mc%d ch%d dimm%d)\n", + mtr, mcddrtcfg, imc->mc, i, j); + + if (IS_DIMM_PRESENT(mtr)) + ndimms += skx_get_dimm_info(mtr, 0, dimm, + imc, i, j); + else if (IS_NVDIMM_PRESENT(mcddrtcfg, j)) + ndimms += skx_get_nvdimm_info(dimm, imc, i, j, + EDAC_MOD_STR); + } + if (ndimms && !i10nm_check_ecc(imc, 0)) { + i10nm_printk(KERN_ERR, "ECC is disabled on imc %d\n", + imc->mc); + return -ENODEV; + } + } + + return 0; +} + +static struct notifier_block i10nm_mce_dec = { + .notifier_call = skx_mce_check_error, + .priority = MCE_PRIO_EDAC, +}; + +static int __init i10nm_init(void) +{ + u8 mc = 0, src_id = 0, node_id = 0; + const struct x86_cpu_id *id; + const char *owner; + struct skx_dev *d; + int rc, i, off[3] = {0xd0, 0xc8, 0xcc}; + u64 tolm, tohm; + + edac_dbg(2, "\n"); + + owner = edac_get_owner(); + if (owner && strncmp(owner, EDAC_MOD_STR, sizeof(EDAC_MOD_STR))) + return -EBUSY; + + id = x86_match_cpu(i10nm_cpuids); + if (!id) + return -ENODEV; + + rc = skx_get_hi_lo(0x09a2, off, &tolm, &tohm); + if (rc) + return rc; + + rc = skx_get_all_bus_mappings(0x3452, 0xcc, I10NM, &i10nm_edac_list); + if (rc < 0) + goto fail; + if (rc == 0) { + i10nm_printk(KERN_ERR, "No memory controllers found\n"); + return -ENODEV; + } + + rc = i10nm_get_all_munits(); + if (rc < 0) + goto fail; + + list_for_each_entry(d, i10nm_edac_list, list) { + rc = skx_get_src_id(d, &src_id); + if (rc < 0) + goto fail; + + rc = skx_get_node_id(d, &node_id); + if (rc < 0) + goto fail; + + edac_dbg(2, "src_id = %d node_id = %d\n", src_id, node_id); + for (i = 0; i < I10NM_NUM_IMC; i++) { + if (!d->imc[i].mdev) + continue; + + d->imc[i].mc = mc++; + d->imc[i].lmc = i; + d->imc[i].src_id = src_id; + d->imc[i].node_id = node_id; + + rc = skx_register_mci(&d->imc[i], d->imc[i].mdev, + "Intel_10nm Socket", EDAC_MOD_STR, + i10nm_get_dimm_config); + if (rc < 0) + goto fail; + } + } + + rc = skx_adxl_get(); + if (rc) + goto fail; + + opstate_init(); + mce_register_decode_chain(&i10nm_mce_dec); + setup_skx_debug("i10nm_test"); + + i10nm_printk(KERN_INFO, "%s\n", I10NM_REVISION); + + return 0; +fail: + skx_remove(); + return rc; +} + +static void __exit i10nm_exit(void) +{ + edac_dbg(2, "\n"); + teardown_skx_debug(); + mce_unregister_decode_chain(&i10nm_mce_dec); + skx_adxl_put(); + skx_remove(); +} + +module_init(i10nm_init); +module_exit(i10nm_exit); + +MODULE_LICENSE("GPL v2"); +MODULE_DESCRIPTION("MC Driver for Intel 10nm server processors");
From: Tony Luck tony.luck@intel.com
mainline inclusion from mainline-v5.1-rc1 commit cbfa482f7e2becbb774dd30117efac48819252f8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit cbfa482f7e2becbb774dd30117efac48819252f8 upstream.
A new error code for systems that use DRAM as an extra level of cache looks like:
000F 0010 1MMM CCCC
where the MMM and CCCC bits are used for the same purpose as the original code. For this new class of errors the ADXL translation will provide details of both the DIMM used as cache for the error location and the component that is being cached.
Note: This new error code is first supported in Skylake. Older EDAC drivers do not need to be updated.
Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Borislav Petkov bp@suse.de Cc: Aristeu Rozanski aris@redhat.com Cc: James Morse james.morse@arm.com Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: Qiuxu Zhuo qiuxu.zhuo@intel.com Cc: linux-edac linux-edac@vger.kernel.org Link: https://lkml.kernel.org/r/20190205182109.27828-1-tony.luck@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/edac/skx_common.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index fb3d9a563ba8..0821e74513c4 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -494,9 +494,11 @@ static void skx_mce_output_error(struct mem_ctl_info *mci, }
/* - * According with Table 15-9 of the Intel Architecture spec vol 3A, - * memory errors should fit in this mask: + * According to Intel Architecture spec vol 3B, + * Table 15-10 "IA32_MCi_Status [15:0] Compound Error Code Encoding" + * memory errors should fit one of these masks: * 000f 0000 1mmm cccc (binary) + * 000f 0010 1mmm cccc (binary) [RAM used as cache] * where: * f = Correction Report Filtering Bit. If 1, subsequent errors * won't be shown @@ -504,7 +506,7 @@ static void skx_mce_output_error(struct mem_ctl_info *mci, * cccc = channel * If the mask doesn't match, report an error to the parsing logic */ - if (!((errcode & 0xef80) == 0x80)) { + if (!((errcode & 0xef80) == 0x80 || (errcode & 0xef80) == 0x280)) { optype = "Can't parse: it is not a mem"; } else { switch (optypenum) {
From: Qiuxu Zhuo qiuxu.zhuo@intel.com
mainline inclusion from mainline-v5.2-rc1 commit fe783516e3016652b74ac92fb8b3fc2b1c0e9d5b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit fe783516e3016652b74ac92fb8b3fc2b1c0e9d5b upstream.
The following Kconfig constellations fail randconfig builds:
CONFIG_ACPI_NFIT=y CONFIG_EDAC_DEBUG=y CONFIG_EDAC_SKX=m CONFIG_EDAC_I10NM=y
or
CONFIG_ACPI_NFIT=y CONFIG_EDAC_DEBUG=y CONFIG_EDAC_SKX=y CONFIG_EDAC_I10NM=m
with: ... CC [M] drivers/edac/skx_common.o ... .../skx_common.o:.../skx_common.c:672: undefined reference to `__this_module'
That is because if one of the two drivers - skx_edac or i10nm_edac - is built-in and the other one is a module, the shared file skx_common.c gets linked into a module object by kbuild. Therefore, when linking that same file into vmlinux, the '__this_module' symbol used in debugfs isn't defined, leading to the above error.
Fix it by moving all debugfs code from skx_common.c to both skx_base.c and i10nm_base.c respectively. Thus, skx_common.c doesn't refer to the '__this_module' symbol anymore.
Clarify skx_common.c's purpose at the top of the file for future reference, while at it.
[ bp: Make text more readable. ]
Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors") Reported-by: Arnd Bergmann arnd@arndb.de Signed-off-by: Qiuxu Zhuo qiuxu.zhuo@intel.com Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Borislav Petkov bp@suse.de Cc: James Morse james.morse@arm.com Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: linux-edac linux-edac@vger.kernel.org Link: https://lkml.kernel.org/r/20190321221339.GA32323@agluck-desk Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/edac/i10nm_base.c | 52 +++++++++++++++++++++++++++++++++-- drivers/edac/skx_base.c | 50 +++++++++++++++++++++++++++++++++- drivers/edac/skx_common.c | 57 +++++++-------------------------------- drivers/edac/skx_common.h | 8 ------ 4 files changed, 109 insertions(+), 58 deletions(-)
diff --git a/drivers/edac/i10nm_base.c b/drivers/edac/i10nm_base.c index c334fb7c63df..87baed8e0807 100644 --- a/drivers/edac/i10nm_base.c +++ b/drivers/edac/i10nm_base.c @@ -181,6 +181,54 @@ static struct notifier_block i10nm_mce_dec = { .priority = MCE_PRIO_EDAC, };
+#ifdef CONFIG_EDAC_DEBUG +/* + * Debug feature. + * Exercise the address decode logic by writing an address to + * /sys/kernel/debug/edac/i10nm_test/addr. + */ +static struct dentry *i10nm_test; + +static int debugfs_u64_set(void *data, u64 val) +{ + struct mce m; + + pr_warn_once("Fake error to 0x%llx injected via debugfs\n", val); + + memset(&m, 0, sizeof(m)); + /* ADDRV + MemRd + Unknown channel */ + m.status = MCI_STATUS_ADDRV + 0x90; + /* One corrected error */ + m.status |= BIT_ULL(38); + m.addr = val; + skx_mce_check_error(NULL, 0, &m); + + return 0; +} +DEFINE_SIMPLE_ATTRIBUTE(fops_u64_wo, NULL, debugfs_u64_set, "%llu\n"); + +static void setup_i10nm_debug(void) +{ + i10nm_test = edac_debugfs_create_dir("i10nm_test"); + if (!i10nm_test) + return; + + if (!edac_debugfs_create_file("addr", 0200, i10nm_test, + NULL, &fops_u64_wo)) { + debugfs_remove(i10nm_test); + i10nm_test = NULL; + } +} + +static void teardown_i10nm_debug(void) +{ + debugfs_remove_recursive(i10nm_test); +} +#else +static inline void setup_i10nm_debug(void) {} +static inline void teardown_i10nm_debug(void) {} +#endif /*CONFIG_EDAC_DEBUG*/ + static int __init i10nm_init(void) { u8 mc = 0, src_id = 0, node_id = 0; @@ -249,7 +297,7 @@ static int __init i10nm_init(void)
opstate_init(); mce_register_decode_chain(&i10nm_mce_dec); - setup_skx_debug("i10nm_test"); + setup_i10nm_debug();
i10nm_printk(KERN_INFO, "%s\n", I10NM_REVISION);
@@ -262,7 +310,7 @@ static int __init i10nm_init(void) static void __exit i10nm_exit(void) { edac_dbg(2, "\n"); - teardown_skx_debug(); + teardown_i10nm_debug(); mce_unregister_decode_chain(&i10nm_mce_dec); skx_adxl_put(); skx_remove(); diff --git a/drivers/edac/skx_base.c b/drivers/edac/skx_base.c index adae4c848ca1..c6c8be2814ec 100644 --- a/drivers/edac/skx_base.c +++ b/drivers/edac/skx_base.c @@ -540,6 +540,54 @@ static struct notifier_block skx_mce_dec = { .priority = MCE_PRIO_EDAC, };
+#ifdef CONFIG_EDAC_DEBUG +/* + * Debug feature. + * Exercise the address decode logic by writing an address to + * /sys/kernel/debug/edac/skx_test/addr. + */ +static struct dentry *skx_test; + +static int debugfs_u64_set(void *data, u64 val) +{ + struct mce m; + + pr_warn_once("Fake error to 0x%llx injected via debugfs\n", val); + + memset(&m, 0, sizeof(m)); + /* ADDRV + MemRd + Unknown channel */ + m.status = MCI_STATUS_ADDRV + 0x90; + /* One corrected error */ + m.status |= BIT_ULL(38); + m.addr = val; + skx_mce_check_error(NULL, 0, &m); + + return 0; +} +DEFINE_SIMPLE_ATTRIBUTE(fops_u64_wo, NULL, debugfs_u64_set, "%llu\n"); + +static void setup_skx_debug(void) +{ + skx_test = edac_debugfs_create_dir("skx_test"); + if (!skx_test) + return; + + if (!edac_debugfs_create_file("addr", 0200, skx_test, + NULL, &fops_u64_wo)) { + debugfs_remove(skx_test); + skx_test = NULL; + } +} + +static void teardown_skx_debug(void) +{ + debugfs_remove_recursive(skx_test); +} +#else +static inline void setup_skx_debug(void) {} +static inline void teardown_skx_debug(void) {} +#endif /*CONFIG_EDAC_DEBUG*/ + /* * skx_init: * make sure we are running on the correct cpu model @@ -619,7 +667,7 @@ static int __init skx_init(void) /* Ensure that the OPSTATE is set correctly for POLL or NMI */ opstate_init();
- setup_skx_debug("skx_test"); + setup_skx_debug();
mce_register_decode_chain(&skx_mce_dec);
diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index 0821e74513c4..b0dddcfa9baa 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -1,7 +1,15 @@ // SPDX-License-Identifier: GPL-2.0 /* - * Common codes for both the skx_edac driver and Intel 10nm server EDAC driver. - * Originally split out from the skx_edac driver. + * + * Shared code by both skx_edac and i10nm_edac. Originally split out + * from the skx_edac driver. + * + * This file is linked into both skx_edac and i10nm_edac drivers. In + * order to avoid link errors, this file must be like a pure library + * without including symbols and defines which would otherwise conflict, + * when linked once into a module and into a built-in object, at the + * same time. For example, __this_module symbol references when that + * file is being linked into a built-in object. * * Copyright (c) 2018, Intel Corporation. */ @@ -644,48 +652,3 @@ void skx_remove(void) kfree(d); } } - -#ifdef CONFIG_EDAC_DEBUG -/* - * Debug feature. - * Exercise the address decode logic by writing an address to - * /sys/kernel/debug/edac/dirname/addr. - */ -static struct dentry *skx_test; - -static int debugfs_u64_set(void *data, u64 val) -{ - struct mce m; - - pr_warn_once("Fake error to 0x%llx injected via debugfs\n", val); - - memset(&m, 0, sizeof(m)); - /* ADDRV + MemRd + Unknown channel */ - m.status = MCI_STATUS_ADDRV + 0x90; - /* One corrected error */ - m.status |= BIT_ULL(38); - m.addr = val; - skx_mce_check_error(NULL, 0, &m); - - return 0; -} -DEFINE_SIMPLE_ATTRIBUTE(fops_u64_wo, NULL, debugfs_u64_set, "%llu\n"); - -void setup_skx_debug(const char *dirname) -{ - skx_test = edac_debugfs_create_dir(dirname); - if (!skx_test) - return; - - if (!edac_debugfs_create_file("addr", 0200, skx_test, - NULL, &fops_u64_wo)) { - debugfs_remove(skx_test); - skx_test = NULL; - } -} - -void teardown_skx_debug(void) -{ - debugfs_remove_recursive(skx_test); -} -#endif /*CONFIG_EDAC_DEBUG*/ diff --git a/drivers/edac/skx_common.h b/drivers/edac/skx_common.h index d25374e34d4f..d18fa98669af 100644 --- a/drivers/edac/skx_common.h +++ b/drivers/edac/skx_common.h @@ -141,12 +141,4 @@ int skx_mce_check_error(struct notifier_block *nb, unsigned long val,
void skx_remove(void);
-#ifdef CONFIG_EDAC_DEBUG -void setup_skx_debug(const char *dirname); -void teardown_skx_debug(void); -#else -static inline void setup_skx_debug(const char *dirname) {} -static inline void teardown_skx_debug(void) {} -#endif /*CONFIG_EDAC_DEBUG*/ - #endif /* _SKX_COMM_EDAC_H */
From: Qiuxu Zhuo qiuxu.zhuo@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 5c5d3ac2064ae2466c81d40186bcc09b2d5b7892 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 5c5d3ac2064ae2466c81d40186bcc09b2d5b7892 upstream.
Two new CPU models share the same memory controller architecture with Jacobsville/Tremont, so can use the same i10nm EDAC driver.
Add ICX and ICX-D CPU model numbers for EDAC support.
Signed-off-by: Qiuxu Zhuo qiuxu.zhuo@intel.com Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/edac/i10nm_base.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/edac/i10nm_base.c b/drivers/edac/i10nm_base.c index 87baed8e0807..eb13f07cdaa5 100644 --- a/drivers/edac/i10nm_base.c +++ b/drivers/edac/i10nm_base.c @@ -124,6 +124,8 @@ static int i10nm_get_all_munits(void)
static const struct x86_cpu_id i10nm_cpuids[] = { { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_TREMONT_X, 0, 0 }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ICELAKE_X, 0, 0 }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ICELAKE_XEON_D, 0, 0 }, { } }; MODULE_DEVICE_TABLE(x86cpu, i10nm_cpuids);
From: Qiuxu Zhuo qiuxu.zhuo@intel.com
mainline inclusion from mainline-v5.3-rc1 commit c4a1dd9e83ceceef6ffba82b8b274ab9b929ea14 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit c4a1dd9e83ceceef6ffba82b8b274ab9b929ea14 upstream.
The i10nm_edac only checks the ECC enabling status for the first channel of the memory controller. If there aren't memory DIMMs populated on the first channel, but at least one DIMM populated on the second channel, it will wrongly report that the ECC for the memory controller is disabled that fails to load the i10nm_edac driver. Fix it by checking ECC enabling status per channel.
[Tony: Also report which channel has ECC disabled]
Signed-off-by: Qiuxu Zhuo qiuxu.zhuo@intel.com Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/edac/i10nm_base.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/edac/i10nm_base.c b/drivers/edac/i10nm_base.c index eb13f07cdaa5..6c435db0bcd9 100644 --- a/drivers/edac/i10nm_base.c +++ b/drivers/edac/i10nm_base.c @@ -168,9 +168,9 @@ static int i10nm_get_dimm_config(struct mem_ctl_info *mci) ndimms += skx_get_nvdimm_info(dimm, imc, i, j, EDAC_MOD_STR); } - if (ndimms && !i10nm_check_ecc(imc, 0)) { - i10nm_printk(KERN_ERR, "ECC is disabled on imc %d\n", - imc->mc); + if (ndimms && !i10nm_check_ecc(imc, i)) { + i10nm_printk(KERN_ERR, "ECC is disabled on imc %d channel %d\n", + imc->mc, i); return -ENODEV; } }
From: Qiuxu Zhuo qiuxu.zhuo@intel.com
mainline inclusion from mainline-v5.3-rc1 commit 1dc78f1ffa3a386b986b659884952d816021f38f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 1dc78f1ffa3a386b986b659884952d816021f38f upstream.
The source ID register offset for Skylake server is 0xf0, while for Icelake server is 0xf8. Pass the correct offset to get the source ID.
Signed-off-by: Qiuxu Zhuo qiuxu.zhuo@intel.com Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/edac/i10nm_base.c | 2 +- drivers/edac/skx_base.c | 2 +- drivers/edac/skx_common.c | 4 ++-- drivers/edac/skx_common.h | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/edac/i10nm_base.c b/drivers/edac/i10nm_base.c index 6c435db0bcd9..216c36a0505e 100644 --- a/drivers/edac/i10nm_base.c +++ b/drivers/edac/i10nm_base.c @@ -267,7 +267,7 @@ static int __init i10nm_init(void) goto fail;
list_for_each_entry(d, i10nm_edac_list, list) { - rc = skx_get_src_id(d, &src_id); + rc = skx_get_src_id(d, 0xf8, &src_id); if (rc < 0) goto fail;
diff --git a/drivers/edac/skx_base.c b/drivers/edac/skx_base.c index c6c8be2814ec..c0ff22b96ec4 100644 --- a/drivers/edac/skx_base.c +++ b/drivers/edac/skx_base.c @@ -639,7 +639,7 @@ static int __init skx_init(void) }
list_for_each_entry(d, skx_edac_list, list) { - rc = skx_get_src_id(d, &src_id); + rc = skx_get_src_id(d, 0xf0, &src_id); if (rc < 0) goto fail; rc = skx_get_node_id(d, &node_id); diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index b0dddcfa9baa..d8ff63d91b86 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -136,11 +136,11 @@ void skx_set_decode(skx_decode_f decode) skx_decode = decode; }
-int skx_get_src_id(struct skx_dev *d, u8 *id) +int skx_get_src_id(struct skx_dev *d, int off, u8 *id) { u32 reg;
- if (pci_read_config_dword(d->util_all, 0xf0, ®)) { + if (pci_read_config_dword(d->util_all, off, ®)) { skx_printk(KERN_ERR, "Failed to read src id\n"); return -ENODEV; } diff --git a/drivers/edac/skx_common.h b/drivers/edac/skx_common.h index d18fa98669af..08cc971a50ea 100644 --- a/drivers/edac/skx_common.h +++ b/drivers/edac/skx_common.h @@ -118,7 +118,7 @@ int __init skx_adxl_get(void); void __exit skx_adxl_put(void); void skx_set_decode(skx_decode_f decode);
-int skx_get_src_id(struct skx_dev *d, u8 *id); +int skx_get_src_id(struct skx_dev *d, int off, u8 *id); int skx_get_node_id(struct skx_dev *d, u8 *id);
int skx_get_all_bus_mappings(unsigned int did, int off, enum type,
From: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit 2ee5bfc1efc81179c73abcd33098dd2c86019146 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 2ee5bfc1efc81179c73abcd33098dd2c86019146 upstream.
Reserve ioctl numbers for intel Speed Select Technology interface drivers.
Signed-off-by: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/ioctl/ioctl-number.txt | 1 + 1 file changed, 1 insertion(+)
diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index 13a7c999c04a..516e6e201186 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -346,3 +346,4 @@ Code Seq#(hex) Include File Comments 0xF6 all LTTng Linux Trace Toolkit Next Generation mailto:mathieu.desnoyers@efficios.com 0xFD all linux/dm-ioctl.h +0xFE all linux/isst_if.h
From: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit 35f2c14d2a076b063a76c5bf275c46c0743ba3a0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 35f2c14d2a076b063a76c5bf275c46c0743ba3a0 upstream.
Encapsulate common functions which all Intel Speed Select Technology interface drivers can use. This creates API to register misc device for user kernel communication and handle all common IOCTLs. As part of the registry it allows a callback which is to handle domain specific ioctl processing.
There can be multiple drivers register for services, which can be built as modules. So this driver handle contention during registry and as well as during removal. Once user space opened the misc device, the registered driver will be prevented from removal. Also once misc device is opened by the user space new client driver can't register, till the misc device is closed.
There are two types of client drivers, one to handle mail box interface and the other is to allow direct read/write to some specific MMIO space.
This common driver implements IOCTL ISST_IF_GET_PLATFORM_INFO.
Signed-off-by: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/platform/x86/Kconfig | 2 + drivers/platform/x86/Makefile | 1 + .../x86/intel_speed_select_if/Kconfig | 17 ++ .../x86/intel_speed_select_if/Makefile | 7 + .../intel_speed_select_if/isst_if_common.c | 182 ++++++++++++++++++ .../intel_speed_select_if/isst_if_common.h | 60 ++++++ include/uapi/linux/isst_if.h | 41 ++++ 7 files changed, 310 insertions(+) create mode 100644 drivers/platform/x86/intel_speed_select_if/Kconfig create mode 100644 drivers/platform/x86/intel_speed_select_if/Makefile create mode 100644 drivers/platform/x86/intel_speed_select_if/isst_if_common.c create mode 100644 drivers/platform/x86/intel_speed_select_if/isst_if_common.h create mode 100644 include/uapi/linux/isst_if.h
diff --git a/drivers/platform/x86/Kconfig b/drivers/platform/x86/Kconfig index 1e2524de6a63..66d4b7faca61 100644 --- a/drivers/platform/x86/Kconfig +++ b/drivers/platform/x86/Kconfig @@ -1243,6 +1243,8 @@ config INTEL_ATOMISP2_PM To compile this driver as a module, choose M here: the module will be called intel_atomisp2_pm.
+source "drivers/platform/x86/intel_speed_select_if/Kconfig" + endif # X86_PLATFORM_DEVICES
config PMC_ATOM diff --git a/drivers/platform/x86/Makefile b/drivers/platform/x86/Makefile index dc29af4d8e2f..fb426466817b 100644 --- a/drivers/platform/x86/Makefile +++ b/drivers/platform/x86/Makefile @@ -93,3 +93,4 @@ obj-$(CONFIG_INTEL_TURBO_MAX_3) += intel_turbo_max_3.o obj-$(CONFIG_INTEL_CHTDC_TI_PWRBTN) += intel_chtdc_ti_pwrbtn.o obj-$(CONFIG_I2C_MULTI_INSTANTIATE) += i2c-multi-instantiate.o obj-$(CONFIG_INTEL_ATOMISP2_PM) += intel_atomisp2_pm.o +obj-$(CONFIG_INTEL_SPEED_SELECT_INTERFACE) += intel_speed_select_if/ diff --git a/drivers/platform/x86/intel_speed_select_if/Kconfig b/drivers/platform/x86/intel_speed_select_if/Kconfig new file mode 100644 index 000000000000..ce3e3dc076d2 --- /dev/null +++ b/drivers/platform/x86/intel_speed_select_if/Kconfig @@ -0,0 +1,17 @@ +menu "Intel Speed Select Technology interface support" + depends on PCI + depends on X86_64 || COMPILE_TEST + +config INTEL_SPEED_SELECT_INTERFACE + tristate "Intel(R) Speed Select Technology interface drivers" + help + This config enables the Intel(R) Speed Select Technology interface + drivers. The Intel(R) speed select technology features are non + architectural and only supported on specific Xeon(R) servers. + These drivers provide interface to directly communicate with hardware + via MMIO and Mail boxes to enumerate and control all the speed select + features. + + Enable this config, if there is a need to enable and control the + Intel(R) Speed Select Technology features from the user space. +endmenu diff --git a/drivers/platform/x86/intel_speed_select_if/Makefile b/drivers/platform/x86/intel_speed_select_if/Makefile new file mode 100644 index 000000000000..c12687672fc9 --- /dev/null +++ b/drivers/platform/x86/intel_speed_select_if/Makefile @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Makefile - Intel Speed Select Interface drivers +# Copyright (c) 2019, Intel Corporation. +# + +obj-$(CONFIG_INTEL_SPEED_SELECT_INTERFACE) += isst_if_common.o diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_common.c b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c new file mode 100644 index 000000000000..ab2bb4862dc8 --- /dev/null +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c @@ -0,0 +1,182 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Intel Speed Select Interface: Common functions + * Copyright (c) 2019, Intel Corporation. + * All rights reserved. + * + * Author: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com + */ + +#include <linux/fs.h> +#include <linux/miscdevice.h> +#include <linux/module.h> +#include <linux/uaccess.h> +#include <uapi/linux/isst_if.h> + +#include "isst_if_common.h" + +static struct isst_if_cmd_cb punit_callbacks[ISST_IF_DEV_MAX]; + +static int isst_if_get_platform_info(void __user *argp) +{ + struct isst_if_platform_info info; + + info.api_version = ISST_IF_API_VERSION, + info.driver_version = ISST_IF_DRIVER_VERSION, + info.max_cmds_per_ioctl = ISST_IF_CMD_LIMIT, + info.mbox_supported = punit_callbacks[ISST_IF_DEV_MBOX].registered; + info.mmio_supported = punit_callbacks[ISST_IF_DEV_MMIO].registered; + + if (copy_to_user(argp, &info, sizeof(info))) + return -EFAULT; + + return 0; +} + +static long isst_if_def_ioctl(struct file *file, unsigned int cmd, + unsigned long arg) +{ + void __user *argp = (void __user *)arg; + long ret = -ENOTTY; + + switch (cmd) { + case ISST_IF_GET_PLATFORM_INFO: + ret = isst_if_get_platform_info(argp); + break; + default: + break; + } + + return ret; +} + +static DEFINE_MUTEX(punit_misc_dev_lock); +static int misc_usage_count; +static int misc_device_ret; +static int misc_device_open; + +static int isst_if_open(struct inode *inode, struct file *file) +{ + int i, ret = 0; + + /* Fail open, if a module is going away */ + mutex_lock(&punit_misc_dev_lock); + for (i = 0; i < ISST_IF_DEV_MAX; ++i) { + struct isst_if_cmd_cb *cb = &punit_callbacks[i]; + + if (cb->registered && !try_module_get(cb->owner)) { + ret = -ENODEV; + break; + } + } + if (ret) { + int j; + + for (j = 0; j < i; ++j) { + struct isst_if_cmd_cb *cb; + + cb = &punit_callbacks[j]; + if (cb->registered) + module_put(cb->owner); + } + } else { + misc_device_open++; + } + mutex_unlock(&punit_misc_dev_lock); + + return ret; +} + +static int isst_if_relase(struct inode *inode, struct file *f) +{ + int i; + + mutex_lock(&punit_misc_dev_lock); + misc_device_open--; + for (i = 0; i < ISST_IF_DEV_MAX; ++i) { + struct isst_if_cmd_cb *cb = &punit_callbacks[i]; + + if (cb->registered) + module_put(cb->owner); + } + mutex_unlock(&punit_misc_dev_lock); + + return 0; +} + +static const struct file_operations isst_if_char_driver_ops = { + .open = isst_if_open, + .unlocked_ioctl = isst_if_def_ioctl, + .release = isst_if_relase, +}; + +static struct miscdevice isst_if_char_driver = { + .minor = MISC_DYNAMIC_MINOR, + .name = "isst_interface", + .fops = &isst_if_char_driver_ops, +}; + +/** + * isst_if_cdev_register() - Register callback for IOCTL + * @device_type: The device type this callback handling. + * @cb: Callback structure. + * + * This function registers a callback to device type. On very first call + * it will register a misc device, which is used for user kernel interface. + * Other calls simply increment ref count. Registry will fail, if the user + * already opened misc device for operation. Also if the misc device + * creation failed, then it will not try again and all callers will get + * failure code. + * + * Return: Return the return value from the misc creation device or -EINVAL + * for unsupported device type. + */ +int isst_if_cdev_register(int device_type, struct isst_if_cmd_cb *cb) +{ + if (misc_device_ret) + return misc_device_ret; + + if (device_type >= ISST_IF_DEV_MAX) + return -EINVAL; + + mutex_lock(&punit_misc_dev_lock); + if (misc_device_open) { + mutex_unlock(&punit_misc_dev_lock); + return -EAGAIN; + } + if (!misc_usage_count) { + misc_device_ret = misc_register(&isst_if_char_driver); + if (misc_device_ret) + goto unlock_exit; + } + memcpy(&punit_callbacks[device_type], cb, sizeof(*cb)); + punit_callbacks[device_type].registered = 1; + misc_usage_count++; +unlock_exit: + mutex_unlock(&punit_misc_dev_lock); + + return misc_device_ret; +} +EXPORT_SYMBOL_GPL(isst_if_cdev_register); + +/** + * isst_if_cdev_unregister() - Unregister callback for IOCTL + * @device_type: The device type to unregister. + * + * This function unregisters the previously registered callback. If this + * is the last callback unregistering, then misc device is removed. + * + * Return: None. + */ +void isst_if_cdev_unregister(int device_type) +{ + mutex_lock(&punit_misc_dev_lock); + misc_usage_count--; + punit_callbacks[device_type].registered = 0; + if (!misc_usage_count && !misc_device_ret) + misc_deregister(&isst_if_char_driver); + mutex_unlock(&punit_misc_dev_lock); +} +EXPORT_SYMBOL_GPL(isst_if_cdev_unregister); + +MODULE_LICENSE("GPL v2"); diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_common.h b/drivers/platform/x86/intel_speed_select_if/isst_if_common.h new file mode 100644 index 000000000000..11f339226fb4 --- /dev/null +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_common.h @@ -0,0 +1,60 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Intel Speed Select Interface: Drivers Internal defines + * Copyright (c) 2019, Intel Corporation. + * All rights reserved. + * + * Author: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com + */ + +#ifndef __ISST_IF_COMMON_H +#define __ISST_IF_COMMON_H + +/* + * Validate maximum commands in a single request. + * This is enough to handle command to every core in one ioctl, or all + * possible message id to one CPU. Limit is also helpful for resonse time + * per IOCTL request, as PUNIT may take different times to process each + * request and may hold for long for too many commands. + */ +#define ISST_IF_CMD_LIMIT 64 + +#define ISST_IF_API_VERSION 0x01 +#define ISST_IF_DRIVER_VERSION 0x01 + +#define ISST_IF_DEV_MBOX 0 +#define ISST_IF_DEV_MMIO 1 +#define ISST_IF_DEV_MAX 2 + +/** + * struct isst_if_cmd_cb - Used to register a IOCTL handler + * @registered: Used by the common code to store registry. Caller don't + * to touch this field + * @cmd_size: The command size of the individual command in IOCTL + * @offset: Offset to the first valid member in command structure. + * This will be the offset of the start of the command + * after command count field + * @cmd_callback: Callback function to handle IOCTL. The callback has the + * command pointer with data for command. There is a pointer + * called write_only, which when set, will not copy the + * response to user ioctl buffer. The "resume" argument + * can be used to avoid storing the command for replay + * during system resume + * + * This structure is used to register an handler for IOCTL. To avoid + * code duplication common code handles all the IOCTL command read/write + * including handling multiple command in single IOCTL. The caller just + * need to execute a command via the registered callback. + */ +struct isst_if_cmd_cb { + int registered; + int cmd_size; + int offset; + struct module *owner; + long (*cmd_callback)(u8 *ptr, int *write_only, int resume); +}; + +/* Internal interface functions */ +int isst_if_cdev_register(int type, struct isst_if_cmd_cb *cb); +void isst_if_cdev_unregister(int type); +#endif diff --git a/include/uapi/linux/isst_if.h b/include/uapi/linux/isst_if.h new file mode 100644 index 000000000000..fa94480b5f74 --- /dev/null +++ b/include/uapi/linux/isst_if.h @@ -0,0 +1,41 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Intel Speed Select Interface: OS to hardware Interface + * Copyright (c) 2019, Intel Corporation. + * All rights reserved. + * + * Author: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com + */ + +#ifndef __ISST_IF_H +#define __ISST_IF_H + +#include <linux/types.h> + +/** + * struct isst_if_platform_info - Define platform information + * @api_version: Version of the firmware document, which this driver + * can communicate + * @driver_version: Driver version, which will help user to send right + * commands. Even if the firmware is capable, driver may + * not be ready + * @max_cmds_per_ioctl: Returns the maximum number of commands driver will + * accept in a single ioctl + * @mbox_supported: Support of mail box interface + * @mmio_supported: Support of mmio interface for core-power feature + * + * Used to return output of IOCTL ISST_IF_GET_PLATFORM_INFO. This + * information can be used by the user space, to get the driver, firmware + * support and also number of commands to send in a single IOCTL request. + */ +struct isst_if_platform_info { + __u16 api_version; + __u16 driver_version; + __u16 max_cmds_per_ioctl; + __u8 mbox_supported; + __u8 mmio_supported; +}; + +#define ISST_IF_MAGIC 0xFE +#define ISST_IF_GET_PLATFORM_INFO _IOR(ISST_IF_MAGIC, 0, struct isst_if_platform_info *) +#endif
From: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit 8fbfb6fc67819c1274584ff902f7d03aafe38dab category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 8fbfb6fc67819c1274584ff902f7d03aafe38dab upstream.
There are two per CPU data needs to be stored and cached to avoid repeated MSR readings for accessing them later:
- Physical to logical CPU conversion The PUNIT uses a different CPU numbering scheme which is not APIC id based. So we need to establish relationship between PUNIT CPU number and Linux logical CPU numbering which is based on APIC id. There is an MSR 0x53 (MSR_THREAD_ID), which gets physical CPU number for the local CPU where it is read. Also the CPU mask in some messages will inform which CPUs needs to be online/offline for a TDP level. During TDP switch if user offlined some CPUs, then the physical CPU mask can't be converted as we can't read MSR on an offlined CPU to go to a lower TDP level by onlining more CPUs. So the mapping needs to be established at the boot up time.
- Bus number corresponding to a CPU A group of CPUs are in a control of a PUNIT. The PUNIT device is exported as PCI device. To do operation on a PUNIT for a CPU, we need to find out to which PCI device it is related to. This is done by reading MSR 0x128 (MSR_CPU_BUS_NUMBER).
So during CPU online stages the above MSRs are read and stored. Later this stored information is used to process IOCTLs request from the user space.
Signed-off-by: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- .../intel_speed_select_if/isst_if_common.c | 114 +++++++++++++++++- .../intel_speed_select_if/isst_if_common.h | 1 + 2 files changed, 114 insertions(+), 1 deletion(-)
diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_common.c b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c index ab2bb4862dc8..0e16cbf685d0 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_common.c +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c @@ -7,14 +7,22 @@ * Author: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com */
+#include <linux/cpufeature.h> +#include <linux/cpuhotplug.h> #include <linux/fs.h> #include <linux/miscdevice.h> #include <linux/module.h> +#include <linux/pci.h> +#include <linux/sched/signal.h> +#include <linux/slab.h> #include <linux/uaccess.h> #include <uapi/linux/isst_if.h>
#include "isst_if_common.h"
+#define MSR_THREAD_ID_INFO 0x53 +#define MSR_CPU_BUS_NUMBER 0x128 + static struct isst_if_cmd_cb punit_callbacks[ISST_IF_DEV_MAX];
static int isst_if_get_platform_info(void __user *argp) @@ -33,6 +41,99 @@ static int isst_if_get_platform_info(void __user *argp) return 0; }
+ +struct isst_if_cpu_info { + /* For BUS 0 and BUS 1 only, which we need for PUNIT interface */ + int bus_info[2]; + int punit_cpu_id; +}; + +static struct isst_if_cpu_info *isst_cpu_info; + +/** + * isst_if_get_pci_dev() - Get the PCI device instance for a CPU + * @cpu: Logical CPU number. + * @bus_number: The bus number assigned by the hardware. + * @dev: The device number assigned by the hardware. + * @fn: The function number assigned by the hardware. + * + * Using cached bus information, find out the PCI device for a bus number, + * device and function. + * + * Return: Return pci_dev pointer or NULL. + */ +struct pci_dev *isst_if_get_pci_dev(int cpu, int bus_no, int dev, int fn) +{ + int bus_number; + + if (bus_no < 0 || bus_no > 1 || cpu < 0 || cpu >= nr_cpu_ids || + cpu >= num_possible_cpus()) + return NULL; + + bus_number = isst_cpu_info[cpu].bus_info[bus_no]; + if (bus_number < 0) + return NULL; + + return pci_get_domain_bus_and_slot(0, bus_number, PCI_DEVFN(dev, fn)); +} +EXPORT_SYMBOL_GPL(isst_if_get_pci_dev); + +static int isst_if_cpu_online(unsigned int cpu) +{ + u64 data; + int ret; + + ret = rdmsrl_safe(MSR_CPU_BUS_NUMBER, &data); + if (ret) { + /* This is not a fatal error on MSR mailbox only I/F */ + isst_cpu_info[cpu].bus_info[0] = -1; + isst_cpu_info[cpu].bus_info[1] = -1; + } else { + isst_cpu_info[cpu].bus_info[0] = data & 0xff; + isst_cpu_info[cpu].bus_info[1] = (data >> 8) & 0xff; + } + + ret = rdmsrl_safe(MSR_THREAD_ID_INFO, &data); + if (ret) { + isst_cpu_info[cpu].punit_cpu_id = -1; + return ret; + } + isst_cpu_info[cpu].punit_cpu_id = data; + + return 0; +} + +static int isst_if_online_id; + +static int isst_if_cpu_info_init(void) +{ + int ret; + + isst_cpu_info = kcalloc(num_possible_cpus(), + sizeof(*isst_cpu_info), + GFP_KERNEL); + if (!isst_cpu_info) + return -ENOMEM; + + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, + "platform/x86/isst-if:online", + isst_if_cpu_online, NULL); + if (ret < 0) { + kfree(isst_cpu_info); + return ret; + } + + isst_if_online_id = ret; + + return 0; +} + +static void isst_if_cpu_info_exit(void) +{ + cpuhp_remove_state(isst_if_online_id); + kfree(isst_cpu_info); +}; + static long isst_if_def_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { @@ -145,9 +246,18 @@ int isst_if_cdev_register(int device_type, struct isst_if_cmd_cb *cb) return -EAGAIN; } if (!misc_usage_count) { + int ret; + misc_device_ret = misc_register(&isst_if_char_driver); if (misc_device_ret) goto unlock_exit; + + ret = isst_if_cpu_info_init(); + if (ret) { + misc_deregister(&isst_if_char_driver); + misc_device_ret = ret; + goto unlock_exit; + } } memcpy(&punit_callbacks[device_type], cb, sizeof(*cb)); punit_callbacks[device_type].registered = 1; @@ -173,8 +283,10 @@ void isst_if_cdev_unregister(int device_type) mutex_lock(&punit_misc_dev_lock); misc_usage_count--; punit_callbacks[device_type].registered = 0; - if (!misc_usage_count && !misc_device_ret) + if (!misc_usage_count && !misc_device_ret) { misc_deregister(&isst_if_char_driver); + isst_if_cpu_info_exit(); + } mutex_unlock(&punit_misc_dev_lock); } EXPORT_SYMBOL_GPL(isst_if_cdev_unregister); diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_common.h b/drivers/platform/x86/intel_speed_select_if/isst_if_common.h index 11f339226fb4..dade77c58b22 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_common.h +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_common.h @@ -57,4 +57,5 @@ struct isst_if_cmd_cb { /* Internal interface functions */ int isst_if_cdev_register(int type, struct isst_if_cmd_cb *cb); void isst_if_cdev_unregister(int type); +struct pci_dev *isst_if_get_pci_dev(int cpu, int bus, int dev, int fn); #endif
From: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit fb5b36a413b9f30fba573fc2a596ab7142dfaf12 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit fb5b36a413b9f30fba573fc2a596ab7142dfaf12 upstream.
Add processing for IOCTL command ISST_IF_GET_PHY_ID. This converts from the Linux logical CPU to PUNIT CPU numbering scheme.
Signed-off-by: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- .../intel_speed_select_if/isst_if_common.c | 74 +++++++++++++++++++ include/uapi/linux/isst_if.h | 28 +++++++ 2 files changed, 102 insertions(+)
diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_common.c b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c index 0e16cbf685d0..72e74d72724b 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_common.c +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c @@ -134,16 +134,90 @@ static void isst_if_cpu_info_exit(void) kfree(isst_cpu_info); };
+static long isst_if_proc_phyid_req(u8 *cmd_ptr, int *write_only, int resume) +{ + struct isst_if_cpu_map *cpu_map; + + cpu_map = (struct isst_if_cpu_map *)cmd_ptr; + if (cpu_map->logical_cpu >= nr_cpu_ids || + cpu_map->logical_cpu >= num_possible_cpus()) + return -EINVAL; + + *write_only = 0; + cpu_map->physical_cpu = isst_cpu_info[cpu_map->logical_cpu].punit_cpu_id; + + return 0; +} + +static long isst_if_exec_multi_cmd(void __user *argp, struct isst_if_cmd_cb *cb) +{ + unsigned char __user *ptr; + u32 cmd_count; + u8 *cmd_ptr; + long ret; + int i; + + /* Each multi command has u32 command count as the first field */ + if (copy_from_user(&cmd_count, argp, sizeof(cmd_count))) + return -EFAULT; + + if (!cmd_count || cmd_count > ISST_IF_CMD_LIMIT) + return -EINVAL; + + cmd_ptr = kmalloc(cb->cmd_size, GFP_KERNEL); + if (!cmd_ptr) + return -ENOMEM; + + /* cb->offset points to start of the command after the command count */ + ptr = argp + cb->offset; + + for (i = 0; i < cmd_count; ++i) { + int wr_only; + + if (signal_pending(current)) { + ret = -EINTR; + break; + } + + if (copy_from_user(cmd_ptr, ptr, cb->cmd_size)) { + ret = -EFAULT; + break; + } + + ret = cb->cmd_callback(cmd_ptr, &wr_only, 0); + if (ret) + break; + + if (!wr_only && copy_to_user(ptr, cmd_ptr, cb->cmd_size)) { + ret = -EFAULT; + break; + } + + ptr += cb->cmd_size; + } + + kfree(cmd_ptr); + + return i ? i : ret; +} + static long isst_if_def_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { void __user *argp = (void __user *)arg; + struct isst_if_cmd_cb cmd_cb; long ret = -ENOTTY;
switch (cmd) { case ISST_IF_GET_PLATFORM_INFO: ret = isst_if_get_platform_info(argp); break; + case ISST_IF_GET_PHY_ID: + cmd_cb.cmd_size = sizeof(struct isst_if_cpu_map); + cmd_cb.offset = offsetof(struct isst_if_cpu_maps, cpu_map); + cmd_cb.cmd_callback = isst_if_proc_phyid_req; + ret = isst_if_exec_multi_cmd(argp, &cmd_cb); + break; default: break; } diff --git a/include/uapi/linux/isst_if.h b/include/uapi/linux/isst_if.h index fa94480b5f74..15d1f286a830 100644 --- a/include/uapi/linux/isst_if.h +++ b/include/uapi/linux/isst_if.h @@ -36,6 +36,34 @@ struct isst_if_platform_info { __u8 mmio_supported; };
+/** + * struct isst_if_cpu_map - CPU mapping between logical and physical CPU + * @logical_cpu: Linux logical CPU number + * @physical_cpu: PUNIT CPU number + * + * Used to convert from Linux logical CPU to PUNIT CPU numbering scheme. + * The PUNIT CPU number is different than APIC ID based CPU numbering. + */ +struct isst_if_cpu_map { + __u32 logical_cpu; + __u32 physical_cpu; +}; + +/** + * struct isst_if_cpu_maps - structure for CPU map IOCTL + * @cmd_count: Number of CPU mapping command in cpu_map[] + * @cpu_map[]: Holds one or more CPU map data structure + * + * This structure used with ioctl ISST_IF_GET_PHY_ID to send + * one or more CPU mapping commands. Here IOCTL return value indicates + * number of commands sent or error number if no commands have been sent. + */ +struct isst_if_cpu_maps { + __u32 cmd_count; + struct isst_if_cpu_map cpu_map[1]; +}; + #define ISST_IF_MAGIC 0xFE #define ISST_IF_GET_PLATFORM_INFO _IOR(ISST_IF_MAGIC, 0, struct isst_if_platform_info *) +#define ISST_IF_GET_PHY_ID _IOWR(ISST_IF_MAGIC, 1, struct isst_if_cpu_map *) #endif
From: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit d3a23584294c1f379239a3b52bac13e03fecd147 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit d3a23584294c1f379239a3b52bac13e03fecd147 upstream.
Added MMIO interface to read/write specific offsets in PUNIT PCI device which export core priortization. This MMIO interface can be used using ioctl interface on /dev/isst_interface using IOCTL ISST_IF_IO_CMD.
This MMIO interface is used by the intel-speed-select tool under tools/x86/power to enumerate and set core priority. The MMIO offsets and semantics of the message can be checked from the source code of the tool.
Signed-off-by: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- .../x86/intel_speed_select_if/Makefile | 1 + .../intel_speed_select_if/isst_if_common.c | 6 + .../intel_speed_select_if/isst_if_common.h | 2 + .../x86/intel_speed_select_if/isst_if_mmio.c | 131 ++++++++++++++++++ include/uapi/linux/isst_if.h | 33 +++++ 5 files changed, 173 insertions(+) create mode 100644 drivers/platform/x86/intel_speed_select_if/isst_if_mmio.c
diff --git a/drivers/platform/x86/intel_speed_select_if/Makefile b/drivers/platform/x86/intel_speed_select_if/Makefile index c12687672fc9..7e94919208d3 100644 --- a/drivers/platform/x86/intel_speed_select_if/Makefile +++ b/drivers/platform/x86/intel_speed_select_if/Makefile @@ -5,3 +5,4 @@ #
obj-$(CONFIG_INTEL_SPEED_SELECT_INTERFACE) += isst_if_common.o +obj-$(CONFIG_INTEL_SPEED_SELECT_INTERFACE) += isst_if_mmio.o diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_common.c b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c index 72e74d72724b..3f96a3925bc6 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_common.c +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c @@ -206,6 +206,7 @@ static long isst_if_def_ioctl(struct file *file, unsigned int cmd, { void __user *argp = (void __user *)arg; struct isst_if_cmd_cb cmd_cb; + struct isst_if_cmd_cb *cb; long ret = -ENOTTY;
switch (cmd) { @@ -218,6 +219,11 @@ static long isst_if_def_ioctl(struct file *file, unsigned int cmd, cmd_cb.cmd_callback = isst_if_proc_phyid_req; ret = isst_if_exec_multi_cmd(argp, &cmd_cb); break; + case ISST_IF_IO_CMD: + cb = &punit_callbacks[ISST_IF_DEV_MMIO]; + if (cb->registered) + ret = isst_if_exec_multi_cmd(argp, cb); + break; default: break; } diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_common.h b/drivers/platform/x86/intel_speed_select_if/isst_if_common.h index dade77c58b22..cdc7d019748a 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_common.h +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_common.h @@ -10,6 +10,8 @@ #ifndef __ISST_IF_COMMON_H #define __ISST_IF_COMMON_H
+#define INTEL_RAPL_PRIO_DEVID_0 0x3451 + /* * Validate maximum commands in a single request. * This is enough to handle command to every core in one ioctl, or all diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_mmio.c b/drivers/platform/x86/intel_speed_select_if/isst_if_mmio.c new file mode 100644 index 000000000000..1c25a1235b9e --- /dev/null +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_mmio.c @@ -0,0 +1,131 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Intel Speed Select Interface: MMIO Interface + * Copyright (c) 2019, Intel Corporation. + * All rights reserved. + * + * Author: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com + */ + +#include <linux/module.h> +#include <linux/pci.h> +#include <linux/sched/signal.h> +#include <linux/uaccess.h> +#include <uapi/linux/isst_if.h> + +#include "isst_if_common.h" + +struct isst_if_device { + void __iomem *punit_mmio; + struct mutex mutex; +}; + +static long isst_if_mmio_rd_wr(u8 *cmd_ptr, int *write_only, int resume) +{ + struct isst_if_device *punit_dev; + struct isst_if_io_reg *io_reg; + struct pci_dev *pdev; + + io_reg = (struct isst_if_io_reg *)cmd_ptr; + if (io_reg->reg < 0x04 || io_reg->reg > 0xD0) + return -EINVAL; + + if (io_reg->read_write && !capable(CAP_SYS_ADMIN)) + return -EPERM; + + pdev = isst_if_get_pci_dev(io_reg->logical_cpu, 0, 0, 1); + if (!pdev) + return -EINVAL; + + punit_dev = pci_get_drvdata(pdev); + if (!punit_dev) + return -EINVAL; + + /* + * Ensure that operation is complete on a PCI device to avoid read + * write race by using per PCI device mutex. + */ + mutex_lock(&punit_dev->mutex); + if (io_reg->read_write) { + writel(io_reg->value, punit_dev->punit_mmio+io_reg->reg); + *write_only = 1; + } else { + io_reg->value = readl(punit_dev->punit_mmio+io_reg->reg); + *write_only = 0; + } + mutex_unlock(&punit_dev->mutex); + + return 0; +} + +static const struct pci_device_id isst_if_ids[] = { + { PCI_DEVICE(PCI_VENDOR_ID_INTEL, INTEL_RAPL_PRIO_DEVID_0)}, + { 0 }, +}; +MODULE_DEVICE_TABLE(pci, isst_if_ids); + +static int isst_if_probe(struct pci_dev *pdev, const struct pci_device_id *ent) +{ + struct isst_if_device *punit_dev; + struct isst_if_cmd_cb cb; + u32 mmio_base, pcu_base; + u64 base_addr; + int ret; + + punit_dev = devm_kzalloc(&pdev->dev, sizeof(*punit_dev), GFP_KERNEL); + if (!punit_dev) + return -ENOMEM; + + ret = pcim_enable_device(pdev); + if (ret) + return ret; + + ret = pci_read_config_dword(pdev, 0xD0, &mmio_base); + if (ret) + return ret; + + ret = pci_read_config_dword(pdev, 0xFC, &pcu_base); + if (ret) + return ret; + + pcu_base &= GENMASK(10, 0); + base_addr = (u64)mmio_base << 23 | (u64) pcu_base << 12; + punit_dev->punit_mmio = devm_ioremap(&pdev->dev, base_addr, 256); + if (!punit_dev->punit_mmio) + return -ENOMEM; + + mutex_init(&punit_dev->mutex); + pci_set_drvdata(pdev, punit_dev); + + memset(&cb, 0, sizeof(cb)); + cb.cmd_size = sizeof(struct isst_if_io_reg); + cb.offset = offsetof(struct isst_if_io_regs, io_reg); + cb.cmd_callback = isst_if_mmio_rd_wr; + cb.owner = THIS_MODULE; + ret = isst_if_cdev_register(ISST_IF_DEV_MMIO, &cb); + if (ret) + mutex_destroy(&punit_dev->mutex); + + return ret; +} + +static void isst_if_remove(struct pci_dev *pdev) +{ + struct isst_if_device *punit_dev; + + punit_dev = pci_get_drvdata(pdev); + isst_if_cdev_unregister(ISST_IF_DEV_MBOX); + mutex_destroy(&punit_dev->mutex); +} + +static struct pci_driver isst_if_pci_driver = { + .name = "isst_if_pci", + .id_table = isst_if_ids, + .probe = isst_if_probe, + .remove = isst_if_remove, +}; + +module_pci_driver(isst_if_pci_driver); + +MODULE_LICENSE("GPL v2"); +MODULE_DESCRIPTION("Intel speed select interface mmio driver"); diff --git a/include/uapi/linux/isst_if.h b/include/uapi/linux/isst_if.h index 15d1f286a830..fe2492ade078 100644 --- a/include/uapi/linux/isst_if.h +++ b/include/uapi/linux/isst_if.h @@ -63,7 +63,40 @@ struct isst_if_cpu_maps { struct isst_if_cpu_map cpu_map[1]; };
+/** + * struct isst_if_io_reg - Read write PUNIT IO register + * @read_write: Value 0: Read, 1: Write + * @logical_cpu: Logical CPU number to get target PCI device. + * @reg: PUNIT register offset + * @value: For write operation value to write and for + * for read placeholder read value + * + * Structure to specify read/write data to PUNIT registers. + */ +struct isst_if_io_reg { + __u32 read_write; /* Read:0, Write:1 */ + __u32 logical_cpu; + __u32 reg; + __u32 value; +}; + +/** + * struct isst_if_io_regs - structure for IO register commands + * @cmd_count: Number of io reg commands in io_reg[] + * @io_reg[]: Holds one or more io_reg command structure + * + * This structure used with ioctl ISST_IF_IO_CMD to send + * one or more read/write commands to PUNIT. Here IOCTL return value + * indicates number of requests sent or error number if no requests have + * been sent. + */ +struct isst_if_io_regs { + __u32 req_count; + struct isst_if_io_reg io_reg[1]; +}; + #define ISST_IF_MAGIC 0xFE #define ISST_IF_GET_PLATFORM_INFO _IOR(ISST_IF_MAGIC, 0, struct isst_if_platform_info *) #define ISST_IF_GET_PHY_ID _IOWR(ISST_IF_MAGIC, 1, struct isst_if_cpu_map *) +#define ISST_IF_IO_CMD _IOW(ISST_IF_MAGIC, 2, struct isst_if_io_regs *) #endif
From: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit 31a166fe9c269af17977e650846ee4ea50361c07 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 31a166fe9c269af17977e650846ee4ea50361c07 upstream.
Add an IOCTL to send mailbox commands to PUNIT using PUNIT PCI device. A limited set of mailbox commands can be sent to PUNIT.
This MMIO interface is used by the intel-speed-select tool under tools/x86/power to enumerate and control Intel Speed Select features. The MBOX commands ids and semantics of the message can be checked from the source code of the tool.
Signed-off-by: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- .../x86/intel_speed_select_if/Makefile | 1 + .../intel_speed_select_if/isst_if_common.c | 85 ++++++++ .../intel_speed_select_if/isst_if_common.h | 3 + .../intel_speed_select_if/isst_if_mbox_pci.c | 199 ++++++++++++++++++ include/uapi/linux/isst_if.h | 38 ++++ 5 files changed, 326 insertions(+) create mode 100644 drivers/platform/x86/intel_speed_select_if/isst_if_mbox_pci.c
diff --git a/drivers/platform/x86/intel_speed_select_if/Makefile b/drivers/platform/x86/intel_speed_select_if/Makefile index 7e94919208d3..8dec8c858649 100644 --- a/drivers/platform/x86/intel_speed_select_if/Makefile +++ b/drivers/platform/x86/intel_speed_select_if/Makefile @@ -6,3 +6,4 @@
obj-$(CONFIG_INTEL_SPEED_SELECT_INTERFACE) += isst_if_common.o obj-$(CONFIG_INTEL_SPEED_SELECT_INTERFACE) += isst_if_mmio.o +obj-$(CONFIG_INTEL_SPEED_SELECT_INTERFACE) += isst_if_mbox_pci.o diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_common.c b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c index 3f96a3925bc6..391fc3f12161 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_common.c +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c @@ -25,6 +25,86 @@
static struct isst_if_cmd_cb punit_callbacks[ISST_IF_DEV_MAX];
+struct isst_valid_cmd_ranges { + u16 cmd; + u16 sub_cmd_beg; + u16 sub_cmd_end; +}; + +struct isst_cmd_set_req_type { + u16 cmd; + u16 sub_cmd; + u16 param; +}; + +static const struct isst_valid_cmd_ranges isst_valid_cmds[] = { + {0xD0, 0x00, 0x03}, + {0x7F, 0x00, 0x0B}, + {0x7F, 0x10, 0x12}, + {0x7F, 0x20, 0x23}, +}; + +static const struct isst_cmd_set_req_type isst_cmd_set_reqs[] = { + {0xD0, 0x00, 0x08}, + {0xD0, 0x01, 0x08}, + {0xD0, 0x02, 0x08}, + {0xD0, 0x03, 0x08}, + {0x7F, 0x02, 0x00}, + {0x7F, 0x08, 0x00}, +}; + +/** + * isst_if_mbox_cmd_invalid() - Check invalid mailbox commands + * @cmd: Pointer to the command structure to verify. + * + * Invalid command to PUNIT to may result in instability of the platform. + * This function has a whitelist of commands, which are allowed. + * + * Return: Return true if the command is invalid, else false. + */ +bool isst_if_mbox_cmd_invalid(struct isst_if_mbox_cmd *cmd) +{ + int i; + + if (cmd->logical_cpu >= nr_cpu_ids) + return true; + + for (i = 0; i < ARRAY_SIZE(isst_valid_cmds); ++i) { + if (cmd->command == isst_valid_cmds[i].cmd && + (cmd->sub_command >= isst_valid_cmds[i].sub_cmd_beg && + cmd->sub_command <= isst_valid_cmds[i].sub_cmd_end)) { + return false; + } + } + + return true; +} +EXPORT_SYMBOL_GPL(isst_if_mbox_cmd_invalid); + +/** + * isst_if_mbox_cmd_set_req() - Check mailbox command is a set request + * @cmd: Pointer to the command structure to verify. + * + * Check if the given mail box level is set request and not a get request. + * + * Return: Return true if the command is set_req, else false. + */ +bool isst_if_mbox_cmd_set_req(struct isst_if_mbox_cmd *cmd) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(isst_cmd_set_reqs); ++i) { + if (cmd->command == isst_cmd_set_reqs[i].cmd && + cmd->sub_command == isst_cmd_set_reqs[i].sub_cmd && + cmd->parameter == isst_cmd_set_reqs[i].param) { + return true; + } + } + + return false; +} +EXPORT_SYMBOL_GPL(isst_if_mbox_cmd_set_req); + static int isst_if_get_platform_info(void __user *argp) { struct isst_if_platform_info info; @@ -224,6 +304,11 @@ static long isst_if_def_ioctl(struct file *file, unsigned int cmd, if (cb->registered) ret = isst_if_exec_multi_cmd(argp, cb); break; + case ISST_IF_MBOX_COMMAND: + cb = &punit_callbacks[ISST_IF_DEV_MBOX]; + if (cb->registered) + ret = isst_if_exec_multi_cmd(argp, cb); + break; default: break; } diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_common.h b/drivers/platform/x86/intel_speed_select_if/isst_if_common.h index cdc7d019748a..7c0f71221da7 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_common.h +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_common.h @@ -11,6 +11,7 @@ #define __ISST_IF_COMMON_H
#define INTEL_RAPL_PRIO_DEVID_0 0x3451 +#define INTEL_CFG_MBOX_DEVID_0 0x3459
/* * Validate maximum commands in a single request. @@ -60,4 +61,6 @@ struct isst_if_cmd_cb { int isst_if_cdev_register(int type, struct isst_if_cmd_cb *cb); void isst_if_cdev_unregister(int type); struct pci_dev *isst_if_get_pci_dev(int cpu, int bus, int dev, int fn); +bool isst_if_mbox_cmd_set_req(struct isst_if_mbox_cmd *mbox_cmd); +bool isst_if_mbox_cmd_invalid(struct isst_if_mbox_cmd *cmd); #endif diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_pci.c b/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_pci.c new file mode 100644 index 000000000000..1c4f2893cd80 --- /dev/null +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_pci.c @@ -0,0 +1,199 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Intel Speed Select Interface: Mbox via PCI Interface + * Copyright (c) 2019, Intel Corporation. + * All rights reserved. + * + * Author: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com + */ + +#include <linux/cpufeature.h> +#include <linux/module.h> +#include <linux/pci.h> +#include <linux/sched/signal.h> +#include <linux/uaccess.h> +#include <uapi/linux/isst_if.h> + +#include "isst_if_common.h" + +#define PUNIT_MAILBOX_DATA 0xA0 +#define PUNIT_MAILBOX_INTERFACE 0xA4 +#define PUNIT_MAILBOX_BUSY_BIT 31 + +/* + * Commands has variable amount of processing time. Most of the commands will + * be done in 0-3 tries, but some takes up to 50. + * The real processing time was observed as 25us for the most of the commands + * at 2GHz. It is possible to optimize this count taking samples on customer + * systems. + */ +#define OS_MAILBOX_RETRY_COUNT 50 + +struct isst_if_device { + struct mutex mutex; +}; + +static int isst_if_mbox_cmd(struct pci_dev *pdev, + struct isst_if_mbox_cmd *mbox_cmd) +{ + u32 retries, data; + int ret; + + /* Poll for rb bit == 0 */ + retries = OS_MAILBOX_RETRY_COUNT; + do { + ret = pci_read_config_dword(pdev, PUNIT_MAILBOX_INTERFACE, + &data); + if (ret) + return ret; + + if (data & BIT_ULL(PUNIT_MAILBOX_BUSY_BIT)) { + ret = -EBUSY; + continue; + } + ret = 0; + break; + } while (--retries); + + if (ret) + return ret; + + /* Write DATA register */ + ret = pci_write_config_dword(pdev, PUNIT_MAILBOX_DATA, + mbox_cmd->req_data); + if (ret) + return ret; + + /* Write command register */ + data = BIT_ULL(PUNIT_MAILBOX_BUSY_BIT) | + (mbox_cmd->parameter & GENMASK_ULL(13, 0)) << 16 | + (mbox_cmd->sub_command << 8) | + mbox_cmd->command; + + ret = pci_write_config_dword(pdev, PUNIT_MAILBOX_INTERFACE, data); + if (ret) + return ret; + + /* Poll for rb bit == 0 */ + retries = OS_MAILBOX_RETRY_COUNT; + do { + ret = pci_read_config_dword(pdev, PUNIT_MAILBOX_INTERFACE, + &data); + if (ret) + return ret; + + if (data & BIT_ULL(PUNIT_MAILBOX_BUSY_BIT)) { + ret = -EBUSY; + continue; + } + + if (data & 0xff) + return -ENXIO; + + ret = pci_read_config_dword(pdev, PUNIT_MAILBOX_DATA, &data); + if (ret) + return ret; + + mbox_cmd->resp_data = data; + ret = 0; + break; + } while (--retries); + + return ret; +} + +static long isst_if_mbox_proc_cmd(u8 *cmd_ptr, int *write_only, int resume) +{ + struct isst_if_mbox_cmd *mbox_cmd; + struct isst_if_device *punit_dev; + struct pci_dev *pdev; + int ret; + + mbox_cmd = (struct isst_if_mbox_cmd *)cmd_ptr; + + if (isst_if_mbox_cmd_invalid(mbox_cmd)) + return -EINVAL; + + if (isst_if_mbox_cmd_set_req(mbox_cmd) && !capable(CAP_SYS_ADMIN)) + return -EPERM; + + pdev = isst_if_get_pci_dev(mbox_cmd->logical_cpu, 1, 30, 1); + if (!pdev) + return -EINVAL; + + punit_dev = pci_get_drvdata(pdev); + if (!punit_dev) + return -EINVAL; + + /* + * Basically we are allowing one complete mailbox transaction on + * a mapped PCI device at a time. + */ + mutex_lock(&punit_dev->mutex); + ret = isst_if_mbox_cmd(pdev, mbox_cmd); + mutex_unlock(&punit_dev->mutex); + if (ret) + return ret; + + *write_only = 0; + + return 0; +} + +static const struct pci_device_id isst_if_mbox_ids[] = { + { PCI_DEVICE(PCI_VENDOR_ID_INTEL, INTEL_CFG_MBOX_DEVID_0)}, + { 0 }, +}; +MODULE_DEVICE_TABLE(pci, isst_if_mbox_ids); + +static int isst_if_mbox_probe(struct pci_dev *pdev, + const struct pci_device_id *ent) +{ + struct isst_if_device *punit_dev; + struct isst_if_cmd_cb cb; + int ret; + + punit_dev = devm_kzalloc(&pdev->dev, sizeof(*punit_dev), GFP_KERNEL); + if (!punit_dev) + return -ENOMEM; + + ret = pcim_enable_device(pdev); + if (ret) + return ret; + + mutex_init(&punit_dev->mutex); + pci_set_drvdata(pdev, punit_dev); + + memset(&cb, 0, sizeof(cb)); + cb.cmd_size = sizeof(struct isst_if_mbox_cmd); + cb.offset = offsetof(struct isst_if_mbox_cmds, mbox_cmd); + cb.cmd_callback = isst_if_mbox_proc_cmd; + cb.owner = THIS_MODULE; + ret = isst_if_cdev_register(ISST_IF_DEV_MBOX, &cb); + + if (ret) + mutex_destroy(&punit_dev->mutex); + + return ret; +} + +static void isst_if_mbox_remove(struct pci_dev *pdev) +{ + struct isst_if_device *punit_dev; + + punit_dev = pci_get_drvdata(pdev); + isst_if_cdev_unregister(ISST_IF_DEV_MBOX); + mutex_destroy(&punit_dev->mutex); +} + +static struct pci_driver isst_if_pci_driver = { + .name = "isst_if_mbox_pci", + .id_table = isst_if_mbox_ids, + .probe = isst_if_mbox_probe, + .remove = isst_if_mbox_remove, +}; + +module_pci_driver(isst_if_pci_driver); + +MODULE_LICENSE("GPL v2"); +MODULE_DESCRIPTION("Intel speed select interface pci mailbox driver"); diff --git a/include/uapi/linux/isst_if.h b/include/uapi/linux/isst_if.h index fe2492ade078..e4b1c2ec3279 100644 --- a/include/uapi/linux/isst_if.h +++ b/include/uapi/linux/isst_if.h @@ -95,8 +95,46 @@ struct isst_if_io_regs { struct isst_if_io_reg io_reg[1]; };
+/** + * struct isst_if_mbox_cmd - Structure to define mail box command + * @logical_cpu: Logical CPU number to get target PCI device + * @parameter: Mailbox parameter value + * @req_data: Request data for the mailbox + * @resp_data: Response data for mailbox command response + * @command: Mailbox command value + * @sub_command: Mailbox sub command value + * @reserved: Unused, set to 0 + * + * Structure to specify mailbox command to be sent to PUNIT. + */ +struct isst_if_mbox_cmd { + __u32 logical_cpu; + __u32 parameter; + __u32 req_data; + __u32 resp_data; + __u16 command; + __u16 sub_command; + __u32 reserved; +}; + +/** + * struct isst_if_mbox_cmds - structure for mailbox commands + * @cmd_count: Number of mailbox commands in mbox_cmd[] + * @mbox_cmd[]: Holds one or more mbox commands + * + * This structure used with ioctl ISST_IF_MBOX_COMMAND to send + * one or more mailbox commands to PUNIT. Here IOCTL return value + * indicates number of commands sent or error number if no commands have + * been sent. + */ +struct isst_if_mbox_cmds { + __u32 cmd_count; + struct isst_if_mbox_cmd mbox_cmd[1]; +}; + #define ISST_IF_MAGIC 0xFE #define ISST_IF_GET_PLATFORM_INFO _IOR(ISST_IF_MAGIC, 0, struct isst_if_platform_info *) #define ISST_IF_GET_PHY_ID _IOWR(ISST_IF_MAGIC, 1, struct isst_if_cpu_map *) #define ISST_IF_IO_CMD _IOW(ISST_IF_MAGIC, 2, struct isst_if_io_regs *) +#define ISST_IF_MBOX_COMMAND _IOWR(ISST_IF_MAGIC, 3, struct isst_if_mbox_cmds *) #endif
From: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit 71b21bd7f68a6ee59003f63d2e4f84fd9b0a8d07 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 71b21bd7f68a6ee59003f63d2e4f84fd9b0a8d07 upstream.
Add an IOCTL to send mailbox commands to PUNIT using PUNIT MSRs for mailbox. Some CPU models don't have PCI device, so need to use MSRs. A limited set of mailbox commands can be sent to PUNIT.
This MMIO interface is used by the intel-speed-select tool under tools/x86/power to enumerate and control Intel Speed Select features. The MBOX commands ids and semantics of the message can be checked from the source code of the tool.
Signed-off-by: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- .../x86/intel_speed_select_if/Makefile | 1 + .../intel_speed_select_if/isst_if_mbox_msr.c | 180 ++++++++++++++++++ 2 files changed, 181 insertions(+) create mode 100644 drivers/platform/x86/intel_speed_select_if/isst_if_mbox_msr.c
diff --git a/drivers/platform/x86/intel_speed_select_if/Makefile b/drivers/platform/x86/intel_speed_select_if/Makefile index 8dec8c858649..856076206f35 100644 --- a/drivers/platform/x86/intel_speed_select_if/Makefile +++ b/drivers/platform/x86/intel_speed_select_if/Makefile @@ -7,3 +7,4 @@ obj-$(CONFIG_INTEL_SPEED_SELECT_INTERFACE) += isst_if_common.o obj-$(CONFIG_INTEL_SPEED_SELECT_INTERFACE) += isst_if_mmio.o obj-$(CONFIG_INTEL_SPEED_SELECT_INTERFACE) += isst_if_mbox_pci.o +obj-$(CONFIG_INTEL_SPEED_SELECT_INTERFACE) += isst_if_mbox_msr.o diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_msr.c b/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_msr.c new file mode 100644 index 000000000000..d4bfc32546b5 --- /dev/null +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_msr.c @@ -0,0 +1,180 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Intel Speed Select Interface: Mbox via MSR Interface + * Copyright (c) 2019, Intel Corporation. + * All rights reserved. + * + * Author: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com + */ + +#include <linux/module.h> +#include <linux/cpuhotplug.h> +#include <linux/pci.h> +#include <linux/sched/signal.h> +#include <linux/slab.h> +#include <linux/topology.h> +#include <linux/uaccess.h> +#include <uapi/linux/isst_if.h> +#include <asm/cpu_device_id.h> +#include <asm/intel-family.h> + +#include "isst_if_common.h" + +#define MSR_OS_MAILBOX_INTERFACE 0xB0 +#define MSR_OS_MAILBOX_DATA 0xB1 +#define MSR_OS_MAILBOX_BUSY_BIT 31 + +/* + * Based on experiments count is never more than 1, as the MSR overhead + * is enough to finish the command. So here this is the worst case number. + */ +#define OS_MAILBOX_RETRY_COUNT 3 + +static int isst_if_send_mbox_cmd(u8 command, u8 sub_command, u32 parameter, + u32 command_data, u32 *response_data) +{ + u32 retries; + u64 data; + int ret; + + /* Poll for rb bit == 0 */ + retries = OS_MAILBOX_RETRY_COUNT; + do { + rdmsrl(MSR_OS_MAILBOX_INTERFACE, data); + if (data & BIT_ULL(MSR_OS_MAILBOX_BUSY_BIT)) { + ret = -EBUSY; + continue; + } + ret = 0; + break; + } while (--retries); + + if (ret) + return ret; + + /* Write DATA register */ + wrmsrl(MSR_OS_MAILBOX_DATA, command_data); + + /* Write command register */ + data = BIT_ULL(MSR_OS_MAILBOX_BUSY_BIT) | + (parameter & GENMASK_ULL(13, 0)) << 16 | + (sub_command << 8) | + command; + wrmsrl(MSR_OS_MAILBOX_INTERFACE, data); + + /* Poll for rb bit == 0 */ + retries = OS_MAILBOX_RETRY_COUNT; + do { + rdmsrl(MSR_OS_MAILBOX_INTERFACE, data); + if (data & BIT_ULL(MSR_OS_MAILBOX_BUSY_BIT)) { + ret = -EBUSY; + continue; + } + + if (data & 0xff) + return -ENXIO; + + if (response_data) { + rdmsrl(MSR_OS_MAILBOX_DATA, data); + *response_data = data; + } + ret = 0; + break; + } while (--retries); + + return ret; +} + +struct msrl_action { + int err; + struct isst_if_mbox_cmd *mbox_cmd; +}; + +/* revisit, smp_call_function_single should be enough for atomic mailbox! */ +static void msrl_update_func(void *info) +{ + struct msrl_action *act = info; + + act->err = isst_if_send_mbox_cmd(act->mbox_cmd->command, + act->mbox_cmd->sub_command, + act->mbox_cmd->parameter, + act->mbox_cmd->req_data, + &act->mbox_cmd->resp_data); +} + +static long isst_if_mbox_proc_cmd(u8 *cmd_ptr, int *write_only, int resume) +{ + struct msrl_action action; + int ret; + + action.mbox_cmd = (struct isst_if_mbox_cmd *)cmd_ptr; + + if (isst_if_mbox_cmd_invalid(action.mbox_cmd)) + return -EINVAL; + + if (isst_if_mbox_cmd_set_req(action.mbox_cmd) && + !capable(CAP_SYS_ADMIN)) + return -EPERM; + + /* + * To complete mailbox command, we need to access two MSRs. + * So we don't want race to complete a mailbox transcation. + * Here smp_call ensures that msrl_update_func() has no race + * and also with wait flag, wait for completion. + * smp_call_function_single is using get_cpu() and put_cpu(). + */ + ret = smp_call_function_single(action.mbox_cmd->logical_cpu, + msrl_update_func, &action, 1); + if (ret) + return ret; + + *write_only = 0; + + return action.err; +} + +#define ICPU(model) { X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, } + +static const struct x86_cpu_id isst_if_cpu_ids[] = { + ICPU(INTEL_FAM6_SKYLAKE_X), + {} +}; +MODULE_DEVICE_TABLE(x86cpu, isst_if_cpu_ids); + +static int __init isst_if_mbox_init(void) +{ + struct isst_if_cmd_cb cb; + const struct x86_cpu_id *id; + u64 data; + int ret; + + id = x86_match_cpu(isst_if_cpu_ids); + if (!id) + return -ENODEV; + + /* Check presence of mailbox MSRs */ + ret = rdmsrl_safe(MSR_OS_MAILBOX_INTERFACE, &data); + if (ret) + return ret; + + ret = rdmsrl_safe(MSR_OS_MAILBOX_DATA, &data); + if (ret) + return ret; + + memset(&cb, 0, sizeof(cb)); + cb.cmd_size = sizeof(struct isst_if_mbox_cmd); + cb.offset = offsetof(struct isst_if_mbox_cmds, mbox_cmd); + cb.cmd_callback = isst_if_mbox_proc_cmd; + cb.owner = THIS_MODULE; + return isst_if_cdev_register(ISST_IF_DEV_MBOX, &cb); +} +module_init(isst_if_mbox_init) + +static void __exit isst_if_mbox_exit(void) +{ + isst_if_cdev_unregister(ISST_IF_DEV_MBOX); +} +module_exit(isst_if_mbox_exit) + +MODULE_LICENSE("GPL v2"); +MODULE_DESCRIPTION("Intel speed select interface mailbox driver");
From: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit e765f37b9b8b4fa65682e9a78a2ca2b11d3d9096 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit e765f37b9b8b4fa65682e9a78a2ca2b11d3d9096 upstream.
While using new non arhitectural features using PUNIT Mailbox and MMIO read/write interface, still there is need to operate using MSRs to control PUNIT. User space could have used user user-space MSR interface for this, but when user space MSR access is disabled, then it can't. Here only limited number of MSRs are allowed using this new interface.
Signed-off-by: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- .../intel_speed_select_if/isst_if_common.c | 59 +++++++++++++++++++ include/uapi/linux/isst_if.h | 32 ++++++++++ 2 files changed, 91 insertions(+)
diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_common.c b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c index 391fc3f12161..de2fb5292f1c 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_common.c +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c @@ -25,6 +25,11 @@
static struct isst_if_cmd_cb punit_callbacks[ISST_IF_DEV_MAX];
+static int punit_msr_white_list[] = { + MSR_TURBO_RATIO_LIMIT, + MSR_CONFIG_TDP_CONTROL, +}; + struct isst_valid_cmd_ranges { u16 cmd; u16 sub_cmd_beg; @@ -229,6 +234,54 @@ static long isst_if_proc_phyid_req(u8 *cmd_ptr, int *write_only, int resume) return 0; }
+static bool match_punit_msr_white_list(int msr) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(punit_msr_white_list); ++i) { + if (punit_msr_white_list[i] == msr) + return true; + } + + return false; +} + +static long isst_if_msr_cmd_req(u8 *cmd_ptr, int *write_only, int resume) +{ + struct isst_if_msr_cmd *msr_cmd; + int ret; + + msr_cmd = (struct isst_if_msr_cmd *)cmd_ptr; + + if (!match_punit_msr_white_list(msr_cmd->msr)) + return -EINVAL; + + if (msr_cmd->logical_cpu >= nr_cpu_ids) + return -EINVAL; + + if (msr_cmd->read_write) { + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + ret = wrmsrl_safe_on_cpu(msr_cmd->logical_cpu, + msr_cmd->msr, + msr_cmd->data); + *write_only = 1; + } else { + u64 data; + + ret = rdmsrl_safe_on_cpu(msr_cmd->logical_cpu, + msr_cmd->msr, &data); + if (!ret) { + msr_cmd->data = data; + *write_only = 0; + } + } + + + return ret; +} + static long isst_if_exec_multi_cmd(void __user *argp, struct isst_if_cmd_cb *cb) { unsigned char __user *ptr; @@ -309,6 +362,12 @@ static long isst_if_def_ioctl(struct file *file, unsigned int cmd, if (cb->registered) ret = isst_if_exec_multi_cmd(argp, cb); break; + case ISST_IF_MSR_COMMAND: + cmd_cb.cmd_size = sizeof(struct isst_if_msr_cmd); + cmd_cb.offset = offsetof(struct isst_if_msr_cmds, msr_cmd); + cmd_cb.cmd_callback = isst_if_msr_cmd_req; + ret = isst_if_exec_multi_cmd(argp, &cmd_cb); + break; default: break; } diff --git a/include/uapi/linux/isst_if.h b/include/uapi/linux/isst_if.h index e4b1c2ec3279..d10b832c58c5 100644 --- a/include/uapi/linux/isst_if.h +++ b/include/uapi/linux/isst_if.h @@ -132,9 +132,41 @@ struct isst_if_mbox_cmds { struct isst_if_mbox_cmd mbox_cmd[1]; };
+/** + * struct isst_if_msr_cmd - Structure to define msr command + * @read_write: Value 0: Read, 1: Write + * @logical_cpu: Logical CPU number + * @msr: MSR number + * @data: For write operation, data to write, for read + * place holder + * + * Structure to specify MSR command related to PUNIT. + */ +struct isst_if_msr_cmd { + __u32 read_write; /* Read:0, Write:1 */ + __u32 logical_cpu; + __u64 msr; + __u64 data; +}; + +/** + * struct isst_if_msr_cmds - structure for msr commands + * @cmd_count: Number of mailbox commands in msr_cmd[] + * @msr_cmd[]: Holds one or more msr commands + * + * This structure used with ioctl ISST_IF_MSR_COMMAND to send + * one or more MSR commands. IOCTL return value indicates number of + * commands sent or error number if no commands have been sent. + */ +struct isst_if_msr_cmds { + __u32 cmd_count; + struct isst_if_msr_cmd msr_cmd[1]; +}; + #define ISST_IF_MAGIC 0xFE #define ISST_IF_GET_PLATFORM_INFO _IOR(ISST_IF_MAGIC, 0, struct isst_if_platform_info *) #define ISST_IF_GET_PHY_ID _IOWR(ISST_IF_MAGIC, 1, struct isst_if_cpu_map *) #define ISST_IF_IO_CMD _IOW(ISST_IF_MAGIC, 2, struct isst_if_io_regs *) #define ISST_IF_MBOX_COMMAND _IOWR(ISST_IF_MAGIC, 3, struct isst_if_mbox_cmds *) +#define ISST_IF_MSR_COMMAND _IOWR(ISST_IF_MAGIC, 4, struct isst_if_msr_cmds *) #endif
From: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit f607874f35cbd276a837d7147d4e1ec752dfef44 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit f607874f35cbd276a837d7147d4e1ec752dfef44 upstream.
Commands which causes PUNIT writes, store them and restore them on system resume. The driver stores all such requests in a hash table and stores the the latest mailbox request parameters. On resume these commands mail box commands are executed again. There are only 5 such mail box commands which will trigger such processing so a very low overhead in store and execute on resume. Also there is no order requirement for mail box commands for these write/set commands. There is one MSR request for changing turbo ratio limits, this also stored and get restored on resume and cpu online.
Signed-off-by: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- .../intel_speed_select_if/isst_if_common.c | 154 ++++++++++++++++++ .../intel_speed_select_if/isst_if_common.h | 3 + .../intel_speed_select_if/isst_if_mbox_msr.c | 38 ++++- .../intel_speed_select_if/isst_if_mbox_pci.c | 15 ++ .../x86/intel_speed_select_if/isst_if_mmio.c | 49 ++++++ 5 files changed, 258 insertions(+), 1 deletion(-)
diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_common.c b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c index de2fb5292f1c..68d75391db57 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_common.c +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c @@ -10,6 +10,7 @@ #include <linux/cpufeature.h> #include <linux/cpuhotplug.h> #include <linux/fs.h> +#include <linux/hashtable.h> #include <linux/miscdevice.h> #include <linux/module.h> #include <linux/pci.h> @@ -58,6 +59,151 @@ static const struct isst_cmd_set_req_type isst_cmd_set_reqs[] = { {0x7F, 0x08, 0x00}, };
+struct isst_cmd { + struct hlist_node hnode; + u64 data; + u32 cmd; + int cpu; + int mbox_cmd_type; + u32 param; +}; + +static DECLARE_HASHTABLE(isst_hash, 8); +static DEFINE_MUTEX(isst_hash_lock); + +static int isst_store_new_cmd(int cmd, u32 cpu, int mbox_cmd_type, u32 param, + u32 data) +{ + struct isst_cmd *sst_cmd; + + sst_cmd = kmalloc(sizeof(*sst_cmd), GFP_KERNEL); + if (!sst_cmd) + return -ENOMEM; + + sst_cmd->cpu = cpu; + sst_cmd->cmd = cmd; + sst_cmd->mbox_cmd_type = mbox_cmd_type; + sst_cmd->param = param; + sst_cmd->data = data; + + hash_add(isst_hash, &sst_cmd->hnode, sst_cmd->cmd); + + return 0; +} + +static void isst_delete_hash(void) +{ + struct isst_cmd *sst_cmd; + struct hlist_node *tmp; + int i; + + hash_for_each_safe(isst_hash, i, tmp, sst_cmd, hnode) { + hash_del(&sst_cmd->hnode); + kfree(sst_cmd); + } +} + +/** + * isst_store_cmd() - Store command to a hash table + * @cmd: Mailbox command. + * @sub_cmd: Mailbox sub-command or MSR id. + * @mbox_cmd_type: Mailbox or MSR command. + * @param: Mailbox parameter. + * @data: Mailbox request data or MSR data. + * + * Stores the command to a hash table if there is no such command already + * stored. If already stored update the latest parameter and data for the + * command. + * + * Return: Return result of store to hash table, 0 for success, others for + * failure. + */ +int isst_store_cmd(int cmd, int sub_cmd, u32 cpu, int mbox_cmd_type, + u32 param, u64 data) +{ + struct isst_cmd *sst_cmd; + int full_cmd, ret; + + full_cmd = (cmd & GENMASK_ULL(15, 0)) << 16; + full_cmd |= (sub_cmd & GENMASK_ULL(15, 0)); + mutex_lock(&isst_hash_lock); + hash_for_each_possible(isst_hash, sst_cmd, hnode, full_cmd) { + if (sst_cmd->cmd == full_cmd && sst_cmd->cpu == cpu && + sst_cmd->mbox_cmd_type == mbox_cmd_type) { + sst_cmd->param = param; + sst_cmd->data = data; + mutex_unlock(&isst_hash_lock); + return 0; + } + } + + ret = isst_store_new_cmd(full_cmd, cpu, mbox_cmd_type, param, data); + mutex_unlock(&isst_hash_lock); + + return ret; +} +EXPORT_SYMBOL_GPL(isst_store_cmd); + +static void isst_mbox_resume_command(struct isst_if_cmd_cb *cb, + struct isst_cmd *sst_cmd) +{ + struct isst_if_mbox_cmd mbox_cmd; + int wr_only; + + mbox_cmd.command = (sst_cmd->cmd & GENMASK_ULL(31, 16)) >> 16; + mbox_cmd.sub_command = sst_cmd->cmd & GENMASK_ULL(15, 0); + mbox_cmd.parameter = sst_cmd->param; + mbox_cmd.req_data = sst_cmd->data; + mbox_cmd.logical_cpu = sst_cmd->cpu; + (cb->cmd_callback)((u8 *)&mbox_cmd, &wr_only, 1); +} + +/** + * isst_resume_common() - Process Resume request + * + * On resume replay all mailbox commands and MSRs. + * + * Return: None. + */ +void isst_resume_common(void) +{ + struct isst_cmd *sst_cmd; + int i; + + hash_for_each(isst_hash, i, sst_cmd, hnode) { + struct isst_if_cmd_cb *cb; + + if (sst_cmd->mbox_cmd_type) { + cb = &punit_callbacks[ISST_IF_DEV_MBOX]; + if (cb->registered) + isst_mbox_resume_command(cb, sst_cmd); + } else { + wrmsrl_safe_on_cpu(sst_cmd->cpu, sst_cmd->cmd, + sst_cmd->data); + } + } +} +EXPORT_SYMBOL_GPL(isst_resume_common); + +static void isst_restore_msr_local(int cpu) +{ + struct isst_cmd *sst_cmd; + int i; + + mutex_lock(&isst_hash_lock); + for (i = 0; i < ARRAY_SIZE(punit_msr_white_list); ++i) { + if (!punit_msr_white_list[i]) + break; + + hash_for_each_possible(isst_hash, sst_cmd, hnode, + punit_msr_white_list[i]) { + if (!sst_cmd->mbox_cmd_type && sst_cmd->cpu == cpu) + wrmsrl_safe(sst_cmd->cmd, sst_cmd->data); + } + } + mutex_unlock(&isst_hash_lock); +} + /** * isst_if_mbox_cmd_invalid() - Check invalid mailbox commands * @cmd: Pointer to the command structure to verify. @@ -185,6 +331,8 @@ static int isst_if_cpu_online(unsigned int cpu) } isst_cpu_info[cpu].punit_cpu_id = data;
+ isst_restore_msr_local(cpu); + return 0; }
@@ -267,6 +415,10 @@ static long isst_if_msr_cmd_req(u8 *cmd_ptr, int *write_only, int resume) msr_cmd->msr, msr_cmd->data); *write_only = 1; + if (!ret && !resume) + ret = isst_store_cmd(0, msr_cmd->msr, + msr_cmd->logical_cpu, + 0, 0, msr_cmd->data); } else { u64 data;
@@ -507,6 +659,8 @@ void isst_if_cdev_unregister(int device_type) mutex_lock(&punit_misc_dev_lock); misc_usage_count--; punit_callbacks[device_type].registered = 0; + if (device_type == ISST_IF_DEV_MBOX) + isst_delete_hash(); if (!misc_usage_count && !misc_device_ret) { misc_deregister(&isst_if_char_driver); isst_if_cpu_info_exit(); diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_common.h b/drivers/platform/x86/intel_speed_select_if/isst_if_common.h index 7c0f71221da7..1409a5bb5582 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_common.h +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_common.h @@ -63,4 +63,7 @@ void isst_if_cdev_unregister(int type); struct pci_dev *isst_if_get_pci_dev(int cpu, int bus, int dev, int fn); bool isst_if_mbox_cmd_set_req(struct isst_if_mbox_cmd *mbox_cmd); bool isst_if_mbox_cmd_invalid(struct isst_if_mbox_cmd *cmd); +int isst_store_cmd(int cmd, int sub_command, u32 cpu, int mbox_cmd, + u32 param, u64 data); +void isst_resume_common(void); #endif diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_msr.c b/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_msr.c index d4bfc32546b5..89b042aecef3 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_msr.c +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_msr.c @@ -12,6 +12,7 @@ #include <linux/pci.h> #include <linux/sched/signal.h> #include <linux/slab.h> +#include <linux/suspend.h> #include <linux/topology.h> #include <linux/uaccess.h> #include <uapi/linux/isst_if.h> @@ -128,11 +129,37 @@ static long isst_if_mbox_proc_cmd(u8 *cmd_ptr, int *write_only, int resume) if (ret) return ret;
+ if (!action.err && !resume && isst_if_mbox_cmd_set_req(action.mbox_cmd)) + action.err = isst_store_cmd(action.mbox_cmd->command, + action.mbox_cmd->sub_command, + action.mbox_cmd->logical_cpu, 1, + action.mbox_cmd->parameter, + action.mbox_cmd->req_data); *write_only = 0;
return action.err; }
+ +static int isst_pm_notify(struct notifier_block *nb, + unsigned long mode, void *_unused) +{ + switch (mode) { + case PM_POST_HIBERNATION: + case PM_POST_RESTORE: + case PM_POST_SUSPEND: + isst_resume_common(); + break; + default: + break; + } + return 0; +} + +static struct notifier_block isst_pm_nb = { + .notifier_call = isst_pm_notify, +}; + #define ICPU(model) { X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, }
static const struct x86_cpu_id isst_if_cpu_ids[] = { @@ -166,12 +193,21 @@ static int __init isst_if_mbox_init(void) cb.offset = offsetof(struct isst_if_mbox_cmds, mbox_cmd); cb.cmd_callback = isst_if_mbox_proc_cmd; cb.owner = THIS_MODULE; - return isst_if_cdev_register(ISST_IF_DEV_MBOX, &cb); + ret = isst_if_cdev_register(ISST_IF_DEV_MBOX, &cb); + if (ret) + return ret; + + ret = register_pm_notifier(&isst_pm_nb); + if (ret) + isst_if_cdev_unregister(ISST_IF_DEV_MBOX); + + return ret; } module_init(isst_if_mbox_init)
static void __exit isst_if_mbox_exit(void) { + unregister_pm_notifier(&isst_pm_nb); isst_if_cdev_unregister(ISST_IF_DEV_MBOX); } module_exit(isst_if_mbox_exit) diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_pci.c b/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_pci.c index 1c4f2893cd80..de4169d0796b 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_pci.c +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_pci.c @@ -131,6 +131,12 @@ static long isst_if_mbox_proc_cmd(u8 *cmd_ptr, int *write_only, int resume) */ mutex_lock(&punit_dev->mutex); ret = isst_if_mbox_cmd(pdev, mbox_cmd); + if (!ret && !resume && isst_if_mbox_cmd_set_req(mbox_cmd)) + ret = isst_store_cmd(mbox_cmd->command, + mbox_cmd->sub_command, + mbox_cmd->logical_cpu, 1, + mbox_cmd->parameter, + mbox_cmd->req_data); mutex_unlock(&punit_dev->mutex); if (ret) return ret; @@ -186,11 +192,20 @@ static void isst_if_mbox_remove(struct pci_dev *pdev) mutex_destroy(&punit_dev->mutex); }
+static int __maybe_unused isst_if_resume(struct device *device) +{ + isst_resume_common(); + return 0; +} + +static SIMPLE_DEV_PM_OPS(isst_if_pm_ops, NULL, isst_if_resume); + static struct pci_driver isst_if_pci_driver = { .name = "isst_if_mbox_pci", .id_table = isst_if_mbox_ids, .probe = isst_if_mbox_probe, .remove = isst_if_mbox_remove, + .driver.pm = &isst_if_pm_ops, };
module_pci_driver(isst_if_pci_driver); diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_mmio.c b/drivers/platform/x86/intel_speed_select_if/isst_if_mmio.c index 1c25a1235b9e..f7266a115a08 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_mmio.c +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_mmio.c @@ -15,8 +15,20 @@
#include "isst_if_common.h"
+struct isst_mmio_range { + int beg; + int end; +}; + +struct isst_mmio_range mmio_range[] = { + {0x04, 0x14}, + {0x20, 0xD0}, +}; + struct isst_if_device { void __iomem *punit_mmio; + u32 range_0[5]; + u32 range_1[45]; struct mutex mutex; };
@@ -118,11 +130,48 @@ static void isst_if_remove(struct pci_dev *pdev) mutex_destroy(&punit_dev->mutex); }
+static int __maybe_unused isst_if_suspend(struct device *device) +{ + struct pci_dev *pdev = to_pci_dev(device); + struct isst_if_device *punit_dev; + int i; + + punit_dev = pci_get_drvdata(pdev); + for (i = 0; i < ARRAY_SIZE(punit_dev->range_0); ++i) + punit_dev->range_0[i] = readl(punit_dev->punit_mmio + + mmio_range[0].beg + 4 * i); + for (i = 0; i < ARRAY_SIZE(punit_dev->range_1); ++i) + punit_dev->range_1[i] = readl(punit_dev->punit_mmio + + mmio_range[1].beg + 4 * i); + + return 0; +} + +static int __maybe_unused isst_if_resume(struct device *device) +{ + struct pci_dev *pdev = to_pci_dev(device); + struct isst_if_device *punit_dev; + int i; + + punit_dev = pci_get_drvdata(pdev); + for (i = 0; i < ARRAY_SIZE(punit_dev->range_0); ++i) + writel(punit_dev->range_0[i], punit_dev->punit_mmio + + mmio_range[0].beg + 4 * i); + for (i = 0; i < ARRAY_SIZE(punit_dev->range_1); ++i) + writel(punit_dev->range_1[i], punit_dev->punit_mmio + + mmio_range[1].beg + 4 * i); + + return 0; +} + +static SIMPLE_DEV_PM_OPS(isst_if_pm_ops, isst_if_suspend, isst_if_resume); + static struct pci_driver isst_if_pci_driver = { .name = "isst_if_pci", .id_table = isst_if_ids, .probe = isst_if_probe, .remove = isst_if_remove, + .driver.pm = &isst_if_pm_ops, };
module_pci_driver(isst_if_pci_driver);
From: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit 3fb4f7cd472c7f5905c91508e988f6b28372210d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3fb4f7cd472c7f5905c91508e988f6b28372210d upstream.
The Intel(R) Speed select technologies contains four features.
Performance profile:An non architectural mechanism that allows multiple optimized performance profiles per system via static and/or dynamic adjustment of core count, workload, Tjmax, and TDP, etc. aka ISS in the documentation.
Base Frequency: Enables users to increase guaranteed base frequency on certain cores (high priority cores) in exchange for lower base frequency on remaining cores (low priority cores). aka PBF in the documenation.
Turbo frequency: Enables the ability to set different turbo ratio limits to cores based on priority. aka FACT in the documentation.
Core power: An Interface that allows user to define per core/tile priority.
There is a multi level help for commands and options. This can be used to check required arguments for each feature and commands for the feature.
To start navigating the features start with
$sudo intel-speed-select --help
For help on a specific feature for example $sudo intel-speed-select perf-profile --help
To get help for a command for a feature for example $sudo intel-speed-select perf-profile get-lock-status --help
Signed-off-by: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com Acked-by: Len Brown len.brown@intel.com Acked-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/Makefile | 11 +- tools/power/x86/intel-speed-select/Build | 1 + tools/power/x86/intel-speed-select/Makefile | 56 + .../x86/intel-speed-select/isst-config.c | 1607 +++++++++++++++++ .../power/x86/intel-speed-select/isst-core.c | 721 ++++++++ .../x86/intel-speed-select/isst-display.c | 479 +++++ tools/power/x86/intel-speed-select/isst.h | 231 +++ 7 files changed, 3101 insertions(+), 5 deletions(-) create mode 100644 tools/power/x86/intel-speed-select/Build create mode 100644 tools/power/x86/intel-speed-select/Makefile create mode 100644 tools/power/x86/intel-speed-select/isst-config.c create mode 100644 tools/power/x86/intel-speed-select/isst-core.c create mode 100644 tools/power/x86/intel-speed-select/isst-display.c create mode 100644 tools/power/x86/intel-speed-select/isst.h
diff --git a/tools/Makefile b/tools/Makefile index be02c8b904db..a078ba8e69c8 100644 --- a/tools/Makefile +++ b/tools/Makefile @@ -17,6 +17,7 @@ help: @echo ' gpio - GPIO tools' @echo ' hv - tools used when in Hyper-V clients' @echo ' iio - IIO tools' + @echo ' intel-speed-select - Intel Speed Select tool' @echo ' kvm_stat - top-like utility for displaying kvm statistics' @echo ' leds - LEDs tools' @echo ' liblockdep - user-space wrapper for kernel locking-validator' @@ -79,7 +80,7 @@ perf: FORCE selftests: FORCE $(call descend,testing/$@)
-turbostat x86_energy_perf_policy: FORCE +turbostat x86_energy_perf_policy intel-speed-select: FORCE $(call descend,power/x86/$@)
tmon: FORCE @@ -111,7 +112,7 @@ liblockdep_install: selftests_install: $(call descend,testing/$(@:_install=),install)
-turbostat_install x86_energy_perf_policy_install: +turbostat_install x86_energy_perf_policy_install intel-speed-select_install: $(call descend,power/x86/$(@:_install=),install)
tmon_install: @@ -128,7 +129,7 @@ install: acpi_install cgroup_install cpupower_install gpio_install \ perf_install selftests_install turbostat_install usb_install \ virtio_install vm_install bpf_install x86_energy_perf_policy_install \ tmon_install freefall_install objtool_install kvm_stat_install \ - wmi_install + wmi_install intel-speed-select_install
acpi_clean: $(call descend,power/acpi,clean) @@ -158,7 +159,7 @@ perf_clean: selftests_clean: $(call descend,testing/$(@:_clean=),clean)
-turbostat_clean x86_energy_perf_policy_clean: +turbostat_clean x86_energy_perf_policy_clean intel-speed-select_clean: $(call descend,power/x86/$(@:_clean=),clean)
tmon_clean: @@ -174,6 +175,6 @@ clean: acpi_clean cgroup_clean cpupower_clean hv_clean firewire_clean \ perf_clean selftests_clean turbostat_clean spi_clean usb_clean virtio_clean \ vm_clean bpf_clean iio_clean x86_energy_perf_policy_clean tmon_clean \ freefall_clean build_clean libbpf_clean libsubcmd_clean liblockdep_clean \ - gpio_clean objtool_clean leds_clean wmi_clean + gpio_clean objtool_clean leds_clean wmi_clean intel-speed-select_clean
.PHONY: FORCE diff --git a/tools/power/x86/intel-speed-select/Build b/tools/power/x86/intel-speed-select/Build new file mode 100644 index 000000000000..b61456d75190 --- /dev/null +++ b/tools/power/x86/intel-speed-select/Build @@ -0,0 +1 @@ +intel-speed-select-y += isst-config.o isst-core.o isst-display.o diff --git a/tools/power/x86/intel-speed-select/Makefile b/tools/power/x86/intel-speed-select/Makefile new file mode 100644 index 000000000000..12c6939dca2a --- /dev/null +++ b/tools/power/x86/intel-speed-select/Makefile @@ -0,0 +1,56 @@ +# SPDX-License-Identifier: GPL-2.0 +include ../../../scripts/Makefile.include + +bindir ?= /usr/bin + +ifeq ($(srctree),) +srctree := $(patsubst %/,%,$(dir $(CURDIR))) +srctree := $(patsubst %/,%,$(dir $(srctree))) +srctree := $(patsubst %/,%,$(dir $(srctree))) +srctree := $(patsubst %/,%,$(dir $(srctree))) +endif + +# Do not use make's built-in rules +# (this improves performance and avoids hard-to-debug behaviour); +MAKEFLAGS += -r + +override CFLAGS += -O2 -Wall -g -D_GNU_SOURCE -I$(OUTPUT)include + +ALL_TARGETS := intel-speed-select +ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS)) + +all: $(ALL_PROGRAMS) + +export srctree OUTPUT CC LD CFLAGS +include $(srctree)/tools/build/Makefile.include + +# +# We need the following to be outside of kernel tree +# +$(OUTPUT)include/linux/isst_if.h: ../../../../include/uapi/linux/isst_if.h + mkdir -p $(OUTPUT)include/linux 2>&1 || true + ln -sf $(CURDIR)/../../../../include/uapi/linux/isst_if.h $@ + +prepare: $(OUTPUT)include/linux/isst_if.h + +ISST_IN := $(OUTPUT)intel-speed-select-in.o + +$(ISST_IN): prepare FORCE + $(Q)$(MAKE) $(build)=intel-speed-select +$(OUTPUT)intel-speed-select: $(ISST_IN) + $(QUIET_LINK)$(CC) $(CFLAGS) $(LDFLAGS) $< -o $@ + +clean: + rm -f $(ALL_PROGRAMS) + rm -rf $(OUTPUT)include/linux/isst_if.h + find $(if $(OUTPUT),$(OUTPUT),.) -name '*.o' -delete -o -name '.*.d' -delete + +install: $(ALL_PROGRAMS) + install -d -m 755 $(DESTDIR)$(bindir); \ + for program in $(ALL_PROGRAMS); do \ + install $$program $(DESTDIR)$(bindir); \ + done + +FORCE: + +.PHONY: all install clean FORCE prepare diff --git a/tools/power/x86/intel-speed-select/isst-config.c b/tools/power/x86/intel-speed-select/isst-config.c new file mode 100644 index 000000000000..91c5ad1685a1 --- /dev/null +++ b/tools/power/x86/intel-speed-select/isst-config.c @@ -0,0 +1,1607 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Intel Speed Select -- Enumerate and control features + * Copyright (c) 2019 Intel Corporation. + */ + +#include <linux/isst_if.h> + +#include "isst.h" + +struct process_cmd_struct { + char *feature; + char *command; + void (*process_fn)(void); +}; + +static const char *version_str = "v1.0"; +static const int supported_api_ver = 1; +static struct isst_if_platform_info isst_platform_info; +static char *progname; +static int debug_flag; +static FILE *outf; + +static int cpu_model; + +#define MAX_CPUS_IN_ONE_REQ 64 +static short max_target_cpus; +static unsigned short target_cpus[MAX_CPUS_IN_ONE_REQ]; + +static int topo_max_cpus; +static size_t present_cpumask_size; +static cpu_set_t *present_cpumask; +static size_t target_cpumask_size; +static cpu_set_t *target_cpumask; +static int tdp_level = 0xFF; +static int fact_bucket = 0xFF; +static int fact_avx = 0xFF; +static unsigned long long fact_trl; +static int out_format_json; +static int cmd_help; + +/* clos related */ +static int current_clos = -1; +static int clos_epp = -1; +static int clos_prop_prio = -1; +static int clos_min = -1; +static int clos_max = -1; +static int clos_desired = -1; +static int clos_priority_type; + +struct _cpu_map { + unsigned short core_id; + unsigned short pkg_id; + unsigned short die_id; + unsigned short punit_cpu; + unsigned short punit_cpu_core; +}; +struct _cpu_map *cpu_map; + +void debug_printf(const char *format, ...) +{ + va_list args; + + va_start(args, format); + + if (debug_flag) + vprintf(format, args); + + va_end(args); +} + +static void update_cpu_model(void) +{ + unsigned int ebx, ecx, edx; + unsigned int fms, family; + + __cpuid(1, fms, ebx, ecx, edx); + family = (fms >> 8) & 0xf; + cpu_model = (fms >> 4) & 0xf; + if (family == 6 || family == 0xf) + cpu_model += ((fms >> 16) & 0xf) << 4; +} + +/* Open a file, and exit on failure */ +static FILE *fopen_or_exit(const char *path, const char *mode) +{ + FILE *filep = fopen(path, mode); + + if (!filep) + err(1, "%s: open failed", path); + + return filep; +} + +/* Parse a file containing a single int */ +static int parse_int_file(int fatal, const char *fmt, ...) +{ + va_list args; + char path[PATH_MAX]; + FILE *filep; + int value; + + va_start(args, fmt); + vsnprintf(path, sizeof(path), fmt, args); + va_end(args); + if (fatal) { + filep = fopen_or_exit(path, "r"); + } else { + filep = fopen(path, "r"); + if (!filep) + return -1; + } + if (fscanf(filep, "%d", &value) != 1) + err(1, "%s: failed to parse number from file", path); + fclose(filep); + + return value; +} + +int cpufreq_sysfs_present(void) +{ + DIR *dir; + + dir = opendir("/sys/devices/system/cpu/cpu0/cpufreq"); + if (dir) { + closedir(dir); + return 1; + } + + return 0; +} + +int out_format_is_json(void) +{ + return out_format_json; +} + +int get_physical_package_id(int cpu) +{ + return parse_int_file( + 1, "/sys/devices/system/cpu/cpu%d/topology/physical_package_id", + cpu); +} + +int get_physical_core_id(int cpu) +{ + return parse_int_file( + 1, "/sys/devices/system/cpu/cpu%d/topology/core_id", cpu); +} + +int get_physical_die_id(int cpu) +{ + int ret; + + ret = parse_int_file(0, "/sys/devices/system/cpu/cpu%d/topology/die_id", + cpu); + if (ret < 0) + ret = 0; + + return ret; +} + +int get_topo_max_cpus(void) +{ + return topo_max_cpus; +} + +#define MAX_PACKAGE_COUNT 8 +#define MAX_DIE_PER_PACKAGE 2 +static void for_each_online_package_in_set(void (*callback)(int, void *, void *, + void *, void *), + void *arg1, void *arg2, void *arg3, + void *arg4) +{ + int max_packages[MAX_PACKAGE_COUNT * MAX_PACKAGE_COUNT]; + int pkg_index = 0, i; + + memset(max_packages, 0xff, sizeof(max_packages)); + for (i = 0; i < topo_max_cpus; ++i) { + int j, online, pkg_id, die_id = 0, skip = 0; + + if (!CPU_ISSET_S(i, present_cpumask_size, present_cpumask)) + continue; + if (i) + online = parse_int_file( + 1, "/sys/devices/system/cpu/cpu%d/online", i); + else + online = + 1; /* online entry for CPU 0 needs some special configs */ + + die_id = get_physical_die_id(i); + if (die_id < 0) + die_id = 0; + pkg_id = get_physical_package_id(i); + /* Create an unique id for package, die combination to store */ + pkg_id = (MAX_PACKAGE_COUNT * pkg_id + die_id); + + for (j = 0; j < pkg_index; ++j) { + if (max_packages[j] == pkg_id) { + skip = 1; + break; + } + } + + if (!skip && online && callback) { + callback(i, arg1, arg2, arg3, arg4); + max_packages[pkg_index++] = pkg_id; + } + } +} + +static void for_each_online_target_cpu_in_set( + void (*callback)(int, void *, void *, void *, void *), void *arg1, + void *arg2, void *arg3, void *arg4) +{ + int i; + + for (i = 0; i < topo_max_cpus; ++i) { + int online; + + if (!CPU_ISSET_S(i, target_cpumask_size, target_cpumask)) + continue; + if (i) + online = parse_int_file( + 1, "/sys/devices/system/cpu/cpu%d/online", i); + else + online = + 1; /* online entry for CPU 0 needs some special configs */ + + if (online && callback) + callback(i, arg1, arg2, arg3, arg4); + } +} + +#define BITMASK_SIZE 32 +static void set_max_cpu_num(void) +{ + FILE *filep; + unsigned long dummy; + + topo_max_cpus = 0; + filep = fopen_or_exit( + "/sys/devices/system/cpu/cpu0/topology/thread_siblings", "r"); + while (fscanf(filep, "%lx,", &dummy) == 1) + topo_max_cpus += BITMASK_SIZE; + fclose(filep); + topo_max_cpus--; /* 0 based */ + + debug_printf("max cpus %d\n", topo_max_cpus); +} + +size_t alloc_cpu_set(cpu_set_t **cpu_set) +{ + cpu_set_t *_cpu_set; + size_t size; + + _cpu_set = CPU_ALLOC((topo_max_cpus + 1)); + if (_cpu_set == NULL) + err(3, "CPU_ALLOC"); + size = CPU_ALLOC_SIZE((topo_max_cpus + 1)); + CPU_ZERO_S(size, _cpu_set); + + *cpu_set = _cpu_set; + return size; +} + +void free_cpu_set(cpu_set_t *cpu_set) +{ + CPU_FREE(cpu_set); +} + +static int cpu_cnt[MAX_PACKAGE_COUNT][MAX_DIE_PER_PACKAGE]; +static void set_cpu_present_cpu_mask(void) +{ + size_t size; + DIR *dir; + int i; + + size = alloc_cpu_set(&present_cpumask); + present_cpumask_size = size; + for (i = 0; i < topo_max_cpus; ++i) { + char buffer[256]; + + snprintf(buffer, sizeof(buffer), + "/sys/devices/system/cpu/cpu%d", i); + dir = opendir(buffer); + if (dir) { + int pkg_id, die_id; + + CPU_SET_S(i, size, present_cpumask); + die_id = get_physical_die_id(i); + if (die_id < 0) + die_id = 0; + + pkg_id = get_physical_package_id(i); + if (pkg_id < MAX_PACKAGE_COUNT && + die_id < MAX_DIE_PER_PACKAGE) + cpu_cnt[pkg_id][die_id]++; + } + closedir(dir); + } +} + +int get_cpu_count(int pkg_id, int die_id) +{ + if (pkg_id < MAX_PACKAGE_COUNT && die_id < MAX_DIE_PER_PACKAGE) + return cpu_cnt[pkg_id][die_id] + 1; + + return 0; +} + +static void set_cpu_target_cpu_mask(void) +{ + size_t size; + int i; + + size = alloc_cpu_set(&target_cpumask); + target_cpumask_size = size; + for (i = 0; i < max_target_cpus; ++i) { + if (!CPU_ISSET_S(target_cpus[i], present_cpumask_size, + present_cpumask)) + continue; + + CPU_SET_S(target_cpus[i], size, target_cpumask); + } +} + +static void create_cpu_map(void) +{ + const char *pathname = "/dev/isst_interface"; + int i, fd = 0; + struct isst_if_cpu_maps map; + + cpu_map = malloc(sizeof(*cpu_map) * topo_max_cpus); + if (!cpu_map) + err(3, "cpumap"); + + fd = open(pathname, O_RDWR); + if (fd < 0) + err(-1, "%s open failed", pathname); + + for (i = 0; i < topo_max_cpus; ++i) { + if (!CPU_ISSET_S(i, present_cpumask_size, present_cpumask)) + continue; + + map.cmd_count = 1; + map.cpu_map[0].logical_cpu = i; + + debug_printf(" map logical_cpu:%d\n", + map.cpu_map[0].logical_cpu); + if (ioctl(fd, ISST_IF_GET_PHY_ID, &map) == -1) { + perror("ISST_IF_GET_PHY_ID"); + fprintf(outf, "Error: map logical_cpu:%d\n", + map.cpu_map[0].logical_cpu); + continue; + } + cpu_map[i].core_id = get_physical_core_id(i); + cpu_map[i].pkg_id = get_physical_package_id(i); + cpu_map[i].die_id = get_physical_die_id(i); + cpu_map[i].punit_cpu = map.cpu_map[0].physical_cpu; + cpu_map[i].punit_cpu_core = (map.cpu_map[0].physical_cpu >> + 1); // shift to get core id + + debug_printf( + "map logical_cpu:%d core: %d die:%d pkg:%d punit_cpu:%d punit_core:%d\n", + i, cpu_map[i].core_id, cpu_map[i].die_id, + cpu_map[i].pkg_id, cpu_map[i].punit_cpu, + cpu_map[i].punit_cpu_core); + } + + if (fd) + close(fd); +} + +int find_logical_cpu(int pkg_id, int die_id, int punit_core_id) +{ + int i; + + for (i = 0; i < topo_max_cpus; ++i) { + if (cpu_map[i].pkg_id == pkg_id && + cpu_map[i].die_id == die_id && + cpu_map[i].punit_cpu_core == punit_core_id) + return i; + } + + return -EINVAL; +} + +void set_cpu_mask_from_punit_coremask(int cpu, unsigned long long core_mask, + size_t core_cpumask_size, + cpu_set_t *core_cpumask, int *cpu_cnt) +{ + int i, cnt = 0; + int die_id, pkg_id; + + *cpu_cnt = 0; + die_id = get_physical_die_id(cpu); + pkg_id = get_physical_package_id(cpu); + + for (i = 0; i < 64; ++i) { + if (core_mask & BIT(i)) { + int j; + + for (j = 0; j < topo_max_cpus; ++j) { + if (cpu_map[j].pkg_id == pkg_id && + cpu_map[j].die_id == die_id && + cpu_map[j].punit_cpu_core == i) { + CPU_SET_S(j, core_cpumask_size, + core_cpumask); + ++cnt; + } + } + } + } + + *cpu_cnt = cnt; +} + +int find_phy_core_num(int logical_cpu) +{ + if (logical_cpu < topo_max_cpus) + return cpu_map[logical_cpu].punit_cpu_core; + + return -EINVAL; +} + +static int isst_send_mmio_command(unsigned int cpu, unsigned int reg, int write, + unsigned int *value) +{ + struct isst_if_io_regs io_regs; + const char *pathname = "/dev/isst_interface"; + int cmd; + int fd; + + debug_printf("mmio_cmd cpu:%d reg:%d write:%d\n", cpu, reg, write); + + fd = open(pathname, O_RDWR); + if (fd < 0) + err(-1, "%s open failed", pathname); + + io_regs.req_count = 1; + io_regs.io_reg[0].logical_cpu = cpu; + io_regs.io_reg[0].reg = reg; + cmd = ISST_IF_IO_CMD; + if (write) { + io_regs.io_reg[0].read_write = 1; + io_regs.io_reg[0].value = *value; + } else { + io_regs.io_reg[0].read_write = 0; + } + + if (ioctl(fd, cmd, &io_regs) == -1) { + perror("ISST_IF_IO_CMD"); + fprintf(outf, "Error: mmio_cmd cpu:%d reg:%x read_write:%x\n", + cpu, reg, write); + } else { + if (!write) + *value = io_regs.io_reg[0].value; + + debug_printf( + "mmio_cmd response: cpu:%d reg:%x rd_write:%x resp:%x\n", + cpu, reg, write, *value); + } + + close(fd); + + return 0; +} + +int isst_send_mbox_command(unsigned int cpu, unsigned char command, + unsigned char sub_command, unsigned int parameter, + unsigned int req_data, unsigned int *resp) +{ + const char *pathname = "/dev/isst_interface"; + int fd; + struct isst_if_mbox_cmds mbox_cmds = { 0 }; + + debug_printf( + "mbox_send: cpu:%d command:%x sub_command:%x parameter:%x req_data:%x\n", + cpu, command, sub_command, parameter, req_data); + + if (isst_platform_info.mmio_supported && command == CONFIG_CLOS) { + unsigned int value; + int write = 0; + int clos_id, core_id, ret = 0; + + debug_printf("CLOS %d\n", cpu); + + if (parameter & BIT(MBOX_CMD_WRITE_BIT)) { + value = req_data; + write = 1; + } + + switch (sub_command) { + case CLOS_PQR_ASSOC: + core_id = parameter & 0xff; + ret = isst_send_mmio_command( + cpu, PQR_ASSOC_OFFSET + core_id * 4, write, + &value); + if (!ret && !write) + *resp = value; + break; + case CLOS_PM_CLOS: + clos_id = parameter & 0x03; + ret = isst_send_mmio_command( + cpu, PM_CLOS_OFFSET + clos_id * 4, write, + &value); + if (!ret && !write) + *resp = value; + break; + case CLOS_PM_QOS_CONFIG: + ret = isst_send_mmio_command(cpu, PM_QOS_CONFIG_OFFSET, + write, &value); + if (!ret && !write) + *resp = value; + break; + case CLOS_STATUS: + break; + default: + break; + } + return ret; + } + + mbox_cmds.cmd_count = 1; + mbox_cmds.mbox_cmd[0].logical_cpu = cpu; + mbox_cmds.mbox_cmd[0].command = command; + mbox_cmds.mbox_cmd[0].sub_command = sub_command; + mbox_cmds.mbox_cmd[0].parameter = parameter; + mbox_cmds.mbox_cmd[0].req_data = req_data; + + fd = open(pathname, O_RDWR); + if (fd < 0) + err(-1, "%s open failed", pathname); + + if (ioctl(fd, ISST_IF_MBOX_COMMAND, &mbox_cmds) == -1) { + perror("ISST_IF_MBOX_COMMAND"); + fprintf(outf, + "Error: mbox_cmd cpu:%d command:%x sub_command:%x parameter:%x req_data:%x\n", + cpu, command, sub_command, parameter, req_data); + } else { + *resp = mbox_cmds.mbox_cmd[0].resp_data; + debug_printf( + "mbox_cmd response: cpu:%d command:%x sub_command:%x parameter:%x req_data:%x resp:%x\n", + cpu, command, sub_command, parameter, req_data, *resp); + } + + close(fd); + + return 0; +} + +int isst_send_msr_command(unsigned int cpu, unsigned int msr, int write, + unsigned long long *req_resp) +{ + struct isst_if_msr_cmds msr_cmds; + const char *pathname = "/dev/isst_interface"; + int fd; + + fd = open(pathname, O_RDWR); + if (fd < 0) + err(-1, "%s open failed", pathname); + + msr_cmds.cmd_count = 1; + msr_cmds.msr_cmd[0].logical_cpu = cpu; + msr_cmds.msr_cmd[0].msr = msr; + msr_cmds.msr_cmd[0].read_write = write; + if (write) + msr_cmds.msr_cmd[0].data = *req_resp; + + if (ioctl(fd, ISST_IF_MSR_COMMAND, &msr_cmds) == -1) { + perror("ISST_IF_MSR_COMMAD"); + fprintf(outf, "Error: msr_cmd cpu:%d msr:%x read_write:%d\n", + cpu, msr, write); + } else { + if (!write) + *req_resp = msr_cmds.msr_cmd[0].data; + + debug_printf( + "msr_cmd response: cpu:%d msr:%x rd_write:%x resp:%llx %llx\n", + cpu, msr, write, *req_resp, msr_cmds.msr_cmd[0].data); + } + + close(fd); + + return 0; +} + +static int isst_fill_platform_info(void) +{ + const char *pathname = "/dev/isst_interface"; + int fd; + + fd = open(pathname, O_RDWR); + if (fd < 0) + err(-1, "%s open failed", pathname); + + if (ioctl(fd, ISST_IF_GET_PLATFORM_INFO, &isst_platform_info) == -1) { + perror("ISST_IF_GET_PLATFORM_INFO"); + close(fd); + return -1; + } + + close(fd); + + return 0; +} + +static void isst_print_platform_information(void) +{ + struct isst_if_platform_info platform_info; + const char *pathname = "/dev/isst_interface"; + int fd; + + fd = open(pathname, O_RDWR); + if (fd < 0) + err(-1, "%s open failed", pathname); + + if (ioctl(fd, ISST_IF_GET_PLATFORM_INFO, &platform_info) == -1) { + perror("ISST_IF_GET_PLATFORM_INFO"); + } else { + fprintf(outf, "Platform: API version : %d\n", + platform_info.api_version); + fprintf(outf, "Platform: Driver version : %d\n", + platform_info.driver_version); + fprintf(outf, "Platform: mbox supported : %d\n", + platform_info.mbox_supported); + fprintf(outf, "Platform: mmio supported : %d\n", + platform_info.mmio_supported); + } + + close(fd); + + exit(0); +} + +static void exec_on_get_ctdp_cpu(int cpu, void *arg1, void *arg2, void *arg3, + void *arg4) +{ + int (*fn_ptr)(int cpu, void *arg); + int ret; + + fn_ptr = arg1; + ret = fn_ptr(cpu, arg2); + if (ret) + perror("get_tdp_*"); + else + isst_display_result(cpu, outf, "perf-profile", (char *)arg3, + *(unsigned int *)arg4); +} + +#define _get_tdp_level(desc, suffix, object, help) \ + static void get_tdp_##object(void) \ + { \ + struct isst_pkg_ctdp ctdp; \ +\ + if (cmd_help) { \ + fprintf(stderr, \ + "Print %s [No command arguments are required]\n", \ + help); \ + exit(0); \ + } \ + isst_ctdp_display_information_start(outf); \ + if (max_target_cpus) \ + for_each_online_target_cpu_in_set( \ + exec_on_get_ctdp_cpu, isst_get_ctdp_##suffix, \ + &ctdp, desc, &ctdp.object); \ + else \ + for_each_online_package_in_set(exec_on_get_ctdp_cpu, \ + isst_get_ctdp_##suffix, \ + &ctdp, desc, \ + &ctdp.object); \ + isst_ctdp_display_information_end(outf); \ + } + +_get_tdp_level("get-config-levels", levels, levels, "TDP levels"); +_get_tdp_level("get-config-version", levels, version, "TDP version"); +_get_tdp_level("get-config-enabled", levels, enabled, "TDP enable status"); +_get_tdp_level("get-config-current_level", levels, current_level, + "Current TDP Level"); +_get_tdp_level("get-lock-status", levels, locked, "TDP lock status"); + +static void dump_isst_config_for_cpu(int cpu, void *arg1, void *arg2, + void *arg3, void *arg4) +{ + struct isst_pkg_ctdp pkg_dev; + int ret; + + memset(&pkg_dev, 0, sizeof(pkg_dev)); + ret = isst_get_process_ctdp(cpu, tdp_level, &pkg_dev); + if (ret) { + perror("isst_get_process_ctdp"); + } else { + isst_ctdp_display_information(cpu, outf, tdp_level, &pkg_dev); + isst_get_process_ctdp_complete(cpu, &pkg_dev); + } +} + +static void dump_isst_config(void) +{ + if (cmd_help) { + fprintf(stderr, + "Print Intel(R) Speed Select Technology Performance profile configuration\n"); + fprintf(stderr, + "including base frequency and turbo frequency configurations\n"); + fprintf(stderr, "Optional: -l|--level : Specify tdp level\n"); + fprintf(stderr, + "\tIf no arguments, dump information for all TDP levels\n"); + exit(0); + } + + isst_ctdp_display_information_start(outf); + + if (max_target_cpus) + for_each_online_target_cpu_in_set(dump_isst_config_for_cpu, + NULL, NULL, NULL, NULL); + else + for_each_online_package_in_set(dump_isst_config_for_cpu, NULL, + NULL, NULL, NULL); + + isst_ctdp_display_information_end(outf); +} + +static void set_tdp_level_for_cpu(int cpu, void *arg1, void *arg2, void *arg3, + void *arg4) +{ + int ret; + + ret = isst_set_tdp_level(cpu, tdp_level); + if (ret) + perror("set_tdp_level_for_cpu"); + else + isst_display_result(cpu, outf, "perf-profile", "set_tdp_level", + ret); +} + +static void set_tdp_level(void) +{ + if (cmd_help) { + fprintf(stderr, "Set Config TDP level\n"); + fprintf(stderr, + "\t Arguments: -l|--level : Specify tdp level\n"); + exit(0); + } + + if (tdp_level == 0xff) { + fprintf(outf, "Invalid command: specify tdp_level\n"); + exit(1); + } + isst_ctdp_display_information_start(outf); + if (max_target_cpus) + for_each_online_target_cpu_in_set(set_tdp_level_for_cpu, NULL, + NULL, NULL, NULL); + else + for_each_online_package_in_set(set_tdp_level_for_cpu, NULL, + NULL, NULL, NULL); + isst_ctdp_display_information_end(outf); +} + +static void dump_pbf_config_for_cpu(int cpu, void *arg1, void *arg2, void *arg3, + void *arg4) +{ + struct isst_pbf_info pbf_info; + int ret; + + ret = isst_get_pbf_info(cpu, tdp_level, &pbf_info); + if (ret) { + perror("isst_get_pbf_info"); + } else { + isst_pbf_display_information(cpu, outf, tdp_level, &pbf_info); + isst_get_pbf_info_complete(&pbf_info); + } +} + +static void dump_pbf_config(void) +{ + if (cmd_help) { + fprintf(stderr, + "Print Intel(R) Speed Select Technology base frequency configuration for a TDP level\n"); + fprintf(stderr, + "\tArguments: -l|--level : Specify tdp level\n"); + exit(0); + } + + if (tdp_level == 0xff) { + fprintf(outf, "Invalid command: specify tdp_level\n"); + exit(1); + } + + isst_ctdp_display_information_start(outf); + if (max_target_cpus) + for_each_online_target_cpu_in_set(dump_pbf_config_for_cpu, NULL, + NULL, NULL, NULL); + else + for_each_online_package_in_set(dump_pbf_config_for_cpu, NULL, + NULL, NULL, NULL); + isst_ctdp_display_information_end(outf); +} + +static void set_pbf_for_cpu(int cpu, void *arg1, void *arg2, void *arg3, + void *arg4) +{ + int ret; + int status = *(int *)arg4; + + ret = isst_set_pbf_fact_status(cpu, 1, status); + if (ret) { + perror("isst_set_pbf"); + } else { + if (status) + isst_display_result(cpu, outf, "base-freq", "enable", + ret); + else + isst_display_result(cpu, outf, "base-freq", "disable", + ret); + } +} + +static void set_pbf_enable(void) +{ + int status = 1; + + if (cmd_help) { + fprintf(stderr, + "Enable Intel Speed Select Technology base frequency feature [No command arguments are required]\n"); + exit(0); + } + + isst_ctdp_display_information_start(outf); + if (max_target_cpus) + for_each_online_target_cpu_in_set(set_pbf_for_cpu, NULL, NULL, + NULL, &status); + else + for_each_online_package_in_set(set_pbf_for_cpu, NULL, NULL, + NULL, &status); + isst_ctdp_display_information_end(outf); +} + +static void set_pbf_disable(void) +{ + int status = 0; + + if (cmd_help) { + fprintf(stderr, + "Disable Intel Speed Select Technology base frequency feature [No command arguments are required]\n"); + exit(0); + } + + isst_ctdp_display_information_start(outf); + if (max_target_cpus) + for_each_online_target_cpu_in_set(set_pbf_for_cpu, NULL, NULL, + NULL, &status); + else + for_each_online_package_in_set(set_pbf_for_cpu, NULL, NULL, + NULL, &status); + isst_ctdp_display_information_end(outf); +} + +static void dump_fact_config_for_cpu(int cpu, void *arg1, void *arg2, + void *arg3, void *arg4) +{ + struct isst_fact_info fact_info; + int ret; + + ret = isst_get_fact_info(cpu, tdp_level, &fact_info); + if (ret) + perror("isst_get_fact_bucket_info"); + else + isst_fact_display_information(cpu, outf, tdp_level, fact_bucket, + fact_avx, &fact_info); +} + +static void dump_fact_config(void) +{ + if (cmd_help) { + fprintf(stderr, + "Print complete Intel Speed Select Technology turbo frequency configuration for a TDP level. Other arguments are optional.\n"); + fprintf(stderr, + "\tArguments: -l|--level : Specify tdp level\n"); + fprintf(stderr, + "\tArguments: -b|--bucket : Bucket index to dump\n"); + fprintf(stderr, + "\tArguments: -r|--trl-type : Specify trl type: sse|avx2|avx512\n"); + exit(0); + } + + if (tdp_level == 0xff) { + fprintf(outf, "Invalid command: specify tdp_level\n"); + exit(1); + } + + isst_ctdp_display_information_start(outf); + if (max_target_cpus) + for_each_online_target_cpu_in_set(dump_fact_config_for_cpu, + NULL, NULL, NULL, NULL); + else + for_each_online_package_in_set(dump_fact_config_for_cpu, NULL, + NULL, NULL, NULL); + isst_ctdp_display_information_end(outf); +} + +static void set_fact_for_cpu(int cpu, void *arg1, void *arg2, void *arg3, + void *arg4) +{ + int ret; + int status = *(int *)arg4; + + ret = isst_set_pbf_fact_status(cpu, 0, status); + if (ret) + perror("isst_set_fact"); + else { + if (status) { + struct isst_pkg_ctdp pkg_dev; + + ret = isst_get_ctdp_levels(cpu, &pkg_dev); + if (ret) { + isst_display_result(cpu, outf, "turbo-freq", + "enable", ret); + return; + } + ret = isst_set_trl(cpu, fact_trl); + isst_display_result(cpu, outf, "turbo-freq", "enable", + ret); + } else { + /* Since we modified TRL during Fact enable, restore it */ + isst_set_trl_from_current_tdp(cpu, fact_trl); + isst_display_result(cpu, outf, "turbo-freq", "disable", + ret); + } + } +} + +static void set_fact_enable(void) +{ + int status = 1; + + if (cmd_help) { + fprintf(stderr, + "Enable Intel Speed Select Technology Turbo frequency feature\n"); + fprintf(stderr, + "Optional: -t|--trl : Specify turbo ratio limit\n"); + exit(0); + } + + isst_ctdp_display_information_start(outf); + if (max_target_cpus) + for_each_online_target_cpu_in_set(set_fact_for_cpu, NULL, NULL, + NULL, &status); + else + for_each_online_package_in_set(set_fact_for_cpu, NULL, NULL, + NULL, &status); + isst_ctdp_display_information_end(outf); +} + +static void set_fact_disable(void) +{ + int status = 0; + + if (cmd_help) { + fprintf(stderr, + "Disable Intel Speed Select Technology turbo frequency feature\n"); + fprintf(stderr, + "Optional: -t|--trl : Specify turbo ratio limit\n"); + exit(0); + } + + isst_ctdp_display_information_start(outf); + if (max_target_cpus) + for_each_online_target_cpu_in_set(set_fact_for_cpu, NULL, NULL, + NULL, &status); + else + for_each_online_package_in_set(set_fact_for_cpu, NULL, NULL, + NULL, &status); + isst_ctdp_display_information_end(outf); +} + +static void enable_clos_qos_config(int cpu, void *arg1, void *arg2, void *arg3, + void *arg4) +{ + int ret; + int status = *(int *)arg4; + + ret = isst_pm_qos_config(cpu, status, clos_priority_type); + if (ret) { + perror("isst_pm_qos_config"); + } else { + if (status) + isst_display_result(cpu, outf, "core-power", "enable", + ret); + else + isst_display_result(cpu, outf, "core-power", "disable", + ret); + } +} + +static void set_clos_enable(void) +{ + int status = 1; + + if (cmd_help) { + fprintf(stderr, "Enable core-power for a package/die\n"); + fprintf(stderr, + "\tClos Enable: Specify priority type with [--priority|-p]\n"); + fprintf(stderr, "\t\t 0: Proportional, 1: Ordered\n"); + exit(0); + } + + if (cpufreq_sysfs_present()) { + fprintf(stderr, + "cpufreq subsystem and core-power enable will interfere with each other!\n"); + } + + isst_ctdp_display_information_start(outf); + if (max_target_cpus) + for_each_online_target_cpu_in_set(enable_clos_qos_config, NULL, + NULL, NULL, &status); + else + for_each_online_package_in_set(enable_clos_qos_config, NULL, + NULL, NULL, &status); + isst_ctdp_display_information_end(outf); +} + +static void set_clos_disable(void) +{ + int status = 0; + + if (cmd_help) { + fprintf(stderr, + "Disable core-power: [No command arguments are required]\n"); + exit(0); + } + + isst_ctdp_display_information_start(outf); + if (max_target_cpus) + for_each_online_target_cpu_in_set(enable_clos_qos_config, NULL, + NULL, NULL, &status); + else + for_each_online_package_in_set(enable_clos_qos_config, NULL, + NULL, NULL, &status); + isst_ctdp_display_information_end(outf); +} + +static void dump_clos_config_for_cpu(int cpu, void *arg1, void *arg2, + void *arg3, void *arg4) +{ + struct isst_clos_config clos_config; + int ret; + + ret = isst_pm_get_clos(cpu, current_clos, &clos_config); + if (ret) + perror("isst_pm_get_clos"); + else + isst_clos_display_information(cpu, outf, current_clos, + &clos_config); +} + +static void dump_clos_config(void) +{ + if (cmd_help) { + fprintf(stderr, + "Print Intel Speed Select Technology core power configuration\n"); + fprintf(stderr, + "\tArguments: [-c | --clos]: Specify clos id\n"); + exit(0); + } + if (current_clos < 0 || current_clos > 3) { + fprintf(stderr, "Invalid clos id\n"); + exit(0); + } + + isst_ctdp_display_information_start(outf); + if (max_target_cpus) + for_each_online_target_cpu_in_set(dump_clos_config_for_cpu, + NULL, NULL, NULL, NULL); + else + for_each_online_package_in_set(dump_clos_config_for_cpu, NULL, + NULL, NULL, NULL); + isst_ctdp_display_information_end(outf); +} + +static void set_clos_config_for_cpu(int cpu, void *arg1, void *arg2, void *arg3, + void *arg4) +{ + struct isst_clos_config clos_config; + int ret; + + clos_config.pkg_id = get_physical_package_id(cpu); + clos_config.die_id = get_physical_die_id(cpu); + + clos_config.epp = clos_epp; + clos_config.clos_prop_prio = clos_prop_prio; + clos_config.clos_min = clos_min; + clos_config.clos_max = clos_max; + clos_config.clos_desired = clos_desired; + ret = isst_set_clos(cpu, current_clos, &clos_config); + if (ret) + perror("isst_set_clos"); + else + isst_display_result(cpu, outf, "core-power", "config", ret); +} + +static void set_clos_config(void) +{ + if (cmd_help) { + fprintf(stderr, + "Set core-power configuration for one of the four clos ids\n"); + fprintf(stderr, + "\tSpecify targeted clos id with [--clos|-c]\n"); + fprintf(stderr, "\tSpecify clos EPP with [--epp|-e]\n"); + fprintf(stderr, + "\tSpecify clos Proportional Priority [--weight|-w]\n"); + fprintf(stderr, "\tSpecify clos min with [--min|-n]\n"); + fprintf(stderr, "\tSpecify clos max with [--max|-m]\n"); + fprintf(stderr, "\tSpecify clos desired with [--desired|-d]\n"); + exit(0); + } + + if (current_clos < 0 || current_clos > 3) { + fprintf(stderr, "Invalid clos id\n"); + exit(0); + } + if (clos_epp < 0 || clos_epp > 0x0F) { + fprintf(stderr, "clos epp is not specified, default: 0\n"); + clos_epp = 0; + } + if (clos_prop_prio < 0 || clos_prop_prio > 0x0F) { + fprintf(stderr, + "clos frequency weight is not specified, default: 0\n"); + clos_prop_prio = 0; + } + if (clos_min < 0) { + fprintf(stderr, "clos min is not specified, default: 0\n"); + clos_min = 0; + } + if (clos_max < 0) { + fprintf(stderr, "clos max is not specified, default: 0xff\n"); + clos_max = 0xff; + } + if (clos_desired < 0) { + fprintf(stderr, "clos desired is not specified, default: 0\n"); + clos_desired = 0x00; + } + + isst_ctdp_display_information_start(outf); + if (max_target_cpus) + for_each_online_target_cpu_in_set(set_clos_config_for_cpu, NULL, + NULL, NULL, NULL); + else + for_each_online_package_in_set(set_clos_config_for_cpu, NULL, + NULL, NULL, NULL); + isst_ctdp_display_information_end(outf); +} + +static void set_clos_assoc_for_cpu(int cpu, void *arg1, void *arg2, void *arg3, + void *arg4) +{ + int ret; + + ret = isst_clos_associate(cpu, current_clos); + if (ret) + perror("isst_clos_associate"); + else + isst_display_result(cpu, outf, "core-power", "assoc", ret); +} + +static void set_clos_assoc(void) +{ + if (cmd_help) { + fprintf(stderr, "Associate a clos id to a CPU\n"); + fprintf(stderr, + "\tSpecify targeted clos id with [--clos|-c]\n"); + exit(0); + } + + if (current_clos < 0 || current_clos > 3) { + fprintf(stderr, "Invalid clos id\n"); + exit(0); + } + if (max_target_cpus) + for_each_online_target_cpu_in_set(set_clos_assoc_for_cpu, NULL, + NULL, NULL, NULL); + else { + fprintf(stderr, + "Invalid target cpu. Specify with [-c|--cpu]\n"); + } +} + +static void get_clos_assoc_for_cpu(int cpu, void *arg1, void *arg2, void *arg3, + void *arg4) +{ + int clos, ret; + + ret = isst_clos_get_assoc_status(cpu, &clos); + if (ret) + perror("isst_clos_get_assoc_status"); + else + isst_display_result(cpu, outf, "core-power", "get-assoc", clos); +} + +static void get_clos_assoc(void) +{ + if (cmd_help) { + fprintf(stderr, "Get associate clos id to a CPU\n"); + fprintf(stderr, "\tSpecify targeted cpu id with [--cpu|-c]\n"); + exit(0); + } + if (max_target_cpus) + for_each_online_target_cpu_in_set(get_clos_assoc_for_cpu, NULL, + NULL, NULL, NULL); + else { + fprintf(stderr, + "Invalid target cpu. Specify with [-c|--cpu]\n"); + } +} + +static struct process_cmd_struct isst_cmds[] = { + { "perf-profile", "get-lock-status", get_tdp_locked }, + { "perf-profile", "get-config-levels", get_tdp_levels }, + { "perf-profile", "get-config-version", get_tdp_version }, + { "perf-profile", "get-config-enabled", get_tdp_enabled }, + { "perf-profile", "get-config-current-level", get_tdp_current_level }, + { "perf-profile", "set-config-level", set_tdp_level }, + { "perf-profile", "info", dump_isst_config }, + { "base-freq", "info", dump_pbf_config }, + { "base-freq", "enable", set_pbf_enable }, + { "base-freq", "disable", set_pbf_disable }, + { "turbo-freq", "info", dump_fact_config }, + { "turbo-freq", "enable", set_fact_enable }, + { "turbo-freq", "disable", set_fact_disable }, + { "core-power", "info", dump_clos_config }, + { "core-power", "enable", set_clos_enable }, + { "core-power", "disable", set_clos_disable }, + { "core-power", "config", set_clos_config }, + { "core-power", "assoc", set_clos_assoc }, + { "core-power", "get-assoc", get_clos_assoc }, + { NULL, NULL, NULL } +}; + +/* + * parse cpuset with following syntax + * 1,2,4..6,8-10 and set bits in cpu_subset + */ +void parse_cpu_command(char *optarg) +{ + unsigned int start, end; + char *next; + + next = optarg; + + while (next && *next) { + if (*next == '-') /* no negative cpu numbers */ + goto error; + + start = strtoul(next, &next, 10); + + if (max_target_cpus < MAX_CPUS_IN_ONE_REQ) + target_cpus[max_target_cpus++] = start; + + if (*next == '\0') + break; + + if (*next == ',') { + next += 1; + continue; + } + + if (*next == '-') { + next += 1; /* start range */ + } else if (*next == '.') { + next += 1; + if (*next == '.') + next += 1; /* start range */ + else + goto error; + } + + end = strtoul(next, &next, 10); + if (end <= start) + goto error; + + while (++start <= end) { + if (max_target_cpus < MAX_CPUS_IN_ONE_REQ) + target_cpus[max_target_cpus++] = start; + } + + if (*next == ',') + next += 1; + else if (*next != '\0') + goto error; + } + +#ifdef DEBUG + { + int i; + + for (i = 0; i < max_target_cpus; ++i) + printf("cpu [%d] in arg\n", target_cpus[i]); + } +#endif + return; + +error: + fprintf(stderr, ""--cpu %s" malformed\n", optarg); + exit(-1); +} + +static void parse_cmd_args(int argc, int start, char **argv) +{ + int opt; + int option_index; + + static struct option long_options[] = { + { "bucket", required_argument, 0, 'b' }, + { "level", required_argument, 0, 'l' }, + { "trl-type", required_argument, 0, 'r' }, + { "trl", required_argument, 0, 't' }, + { "help", no_argument, 0, 'h' }, + { "clos", required_argument, 0, 'c' }, + { "desired", required_argument, 0, 'd' }, + { "epp", required_argument, 0, 'e' }, + { "min", required_argument, 0, 'n' }, + { "max", required_argument, 0, 'm' }, + { "priority", required_argument, 0, 'p' }, + { "weight", required_argument, 0, 'w' }, + { 0, 0, 0, 0 } + }; + + option_index = start; + + optind = start + 1; + while ((opt = getopt_long(argc, argv, "b:l:t:c:d:e:n:m:p:w:h", + long_options, &option_index)) != -1) { + switch (opt) { + case 'b': + fact_bucket = atoi(optarg); + break; + case 'h': + cmd_help = 1; + break; + case 'l': + tdp_level = atoi(optarg); + break; + case 't': + sscanf(optarg, "0x%llx", &fact_trl); + break; + case 'r': + if (!strncmp(optarg, "sse", 3)) { + fact_avx = 0x01; + } else if (!strncmp(optarg, "avx2", 4)) { + fact_avx = 0x02; + } else if (!strncmp(optarg, "avx512", 4)) { + fact_avx = 0x04; + } else { + fprintf(outf, "Invalid sse,avx options\n"); + exit(1); + } + break; + /* CLOS related */ + case 'c': + current_clos = atoi(optarg); + printf("clos %d\n", current_clos); + break; + case 'd': + clos_desired = atoi(optarg); + break; + case 'e': + clos_epp = atoi(optarg); + break; + case 'n': + clos_min = atoi(optarg); + break; + case 'm': + clos_max = atoi(optarg); + break; + case 'p': + clos_priority_type = atoi(optarg); + break; + case 'w': + clos_prop_prio = atoi(optarg); + break; + default: + printf("no match\n"); + } + } +} + +static void isst_help(void) +{ + printf("perf-profile:\tAn architectural mechanism that allows multiple optimized \n\ + performance profiles per system via static and/or dynamic\n\ + adjustment of core count, workload, Tjmax, and\n\ + TDP, etc.\n"); + printf("\nCommands : For feature=perf-profile\n"); + printf("\tinfo\n"); + printf("\tget-lock-status\n"); + printf("\tget-config-levels\n"); + printf("\tget-config-version\n"); + printf("\tget-config-enabled\n"); + printf("\tget-config-current-level\n"); + printf("\tset-config-level\n"); +} + +static void pbf_help(void) +{ + printf("base-freq:\tEnables users to increase guaranteed base frequency\n\ + on certain cores (high priority cores) in exchange for lower\n\ + base frequency on remaining cores (low priority cores).\n"); + printf("\tcommand : info\n"); + printf("\tcommand : enable\n"); + printf("\tcommand : disable\n"); +} + +static void fact_help(void) +{ + printf("turbo-freq:\tEnables the ability to set different turbo ratio\n\ + limits to cores based on priority.\n"); + printf("\nCommand: For feature=turbo-freq\n"); + printf("\tcommand : info\n"); + printf("\tcommand : enable\n"); + printf("\tcommand : disable\n"); +} + +static void core_power_help(void) +{ + printf("core-power:\tInterface that allows user to define per core/tile\n\ + priority.\n"); + printf("\nCommands : For feature=core-power\n"); + printf("\tinfo\n"); + printf("\tenable\n"); + printf("\tdisable\n"); + printf("\tconfig\n"); + printf("\tassoc\n"); + printf("\tget-assoc\n"); +} + +struct process_cmd_help_struct { + char *feature; + void (*process_fn)(void); +}; + +static struct process_cmd_help_struct isst_help_cmds[] = { + { "perf-profile", isst_help }, + { "base-freq", pbf_help }, + { "turbo-freq", fact_help }, + { "core-power", core_power_help }, + { NULL, NULL } +}; + +void process_command(int argc, char **argv) +{ + int i = 0, matched = 0; + char *feature = argv[optind]; + char *cmd = argv[optind + 1]; + + if (!feature || !cmd) + return; + + debug_printf("feature name [%s] command [%s]\n", feature, cmd); + if (!strcmp(cmd, "-h") || !strcmp(cmd, "--help")) { + while (isst_help_cmds[i].feature) { + if (!strcmp(isst_help_cmds[i].feature, feature)) { + isst_help_cmds[i].process_fn(); + exit(0); + } + ++i; + } + } + + create_cpu_map(); + + i = 0; + while (isst_cmds[i].feature) { + if (!strcmp(isst_cmds[i].feature, feature) && + !strcmp(isst_cmds[i].command, cmd)) { + parse_cmd_args(argc, optind + 1, argv); + isst_cmds[i].process_fn(); + matched = 1; + break; + } + ++i; + } + + if (!matched) + fprintf(stderr, "Invalid command\n"); +} + +static void usage(void) +{ + printf("Intel(R) Speed Select Technology\n"); + printf("\nUsage:\n"); + printf("intel-speed-select [OPTIONS] FEATURE COMMAND COMMAND_ARGUMENTS\n"); + printf("\nUse this tool to enumerate and control the Intel Speed Select Technology features,\n"); + printf("\nFEATURE : [perf-profile|base-freq|turbo-freq|core-power]\n"); + printf("\nFor help on each feature, use --h|--help\n"); + printf("\tFor example: intel-speed-select perf-profile -h\n"); + + printf("\nFor additional help on each command for a feature, use --h|--help\n"); + printf("\tFor example: intel-speed-select perf-profile get-lock-status -h\n"); + printf("\t\t This will print help for the command "get-lock-status" for the feature "perf-profile"\n"); + + printf("\nOPTIONS\n"); + printf("\t[-c|--cpu] : logical cpu number\n"); + printf("\t\tDefault: Die scoped for all dies in the system with multiple dies/package\n"); + printf("\t\t\t Or Package scoped for all Packages when each package contains one die\n"); + printf("\t[-d|--debug] : Debug mode\n"); + printf("\t[-h|--help] : Print help\n"); + printf("\t[-i|--info] : Print platform information\n"); + printf("\t[-o|--out] : Output file\n"); + printf("\t\t\tDefault : stderr\n"); + printf("\t[-f|--format] : output format [json|text]. Default: text\n"); + printf("\t[-v|--version] : Print version\n"); + + printf("\nResult format\n"); + printf("\tResult display uses a common format for each command:\n"); + printf("\tResults are formatted in text/JSON with\n"); + printf("\t\tPackage, Die, CPU, and command specific results.\n"); + printf("\t\t\tFor Set commands, status is 0 for success and rest for failures\n"); + exit(1); +} + +static void print_version(void) +{ + fprintf(outf, "Version %s\n", version_str); + fprintf(outf, "Build date %s time %s\n", __DATE__, __TIME__); + exit(0); +} + +static void cmdline(int argc, char **argv) +{ + int opt; + int option_index = 0; + + static struct option long_options[] = { + { "cpu", required_argument, 0, 'c' }, + { "debug", no_argument, 0, 'd' }, + { "format", required_argument, 0, 'f' }, + { "help", no_argument, 0, 'h' }, + { "info", no_argument, 0, 'i' }, + { "out", required_argument, 0, 'o' }, + { "version", no_argument, 0, 'v' }, + { 0, 0, 0, 0 } + }; + + progname = argv[0]; + while ((opt = getopt_long_only(argc, argv, "+c:df:hio:v", long_options, + &option_index)) != -1) { + switch (opt) { + case 'c': + parse_cpu_command(optarg); + break; + case 'd': + debug_flag = 1; + printf("Debug Mode ON\n"); + break; + case 'f': + if (!strncmp(optarg, "json", 4)) + out_format_json = 1; + break; + case 'h': + usage(); + break; + case 'i': + isst_print_platform_information(); + break; + case 'o': + if (outf) + fclose(outf); + outf = fopen_or_exit(optarg, "w"); + break; + case 'v': + print_version(); + break; + default: + usage(); + } + } + + if (geteuid() != 0) { + fprintf(stderr, "Must run as root\n"); + exit(0); + } + + if (optind > (argc - 2)) { + fprintf(stderr, "Feature name and|or command not specified\n"); + exit(0); + } + update_cpu_model(); + printf("Intel(R) Speed Select Technology\n"); + printf("Executing on CPU model:%d[0x%x]\n", cpu_model, cpu_model); + set_max_cpu_num(); + set_cpu_present_cpu_mask(); + set_cpu_target_cpu_mask(); + isst_fill_platform_info(); + if (isst_platform_info.api_version > supported_api_ver) { + printf("Incompatible API versions; Upgrade of tool is required\n"); + exit(0); + } + + process_command(argc, argv); +} + +int main(int argc, char **argv) +{ + outf = stderr; + cmdline(argc, argv); + return 0; +} diff --git a/tools/power/x86/intel-speed-select/isst-core.c b/tools/power/x86/intel-speed-select/isst-core.c new file mode 100644 index 000000000000..8de4ac39a008 --- /dev/null +++ b/tools/power/x86/intel-speed-select/isst-core.c @@ -0,0 +1,721 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Intel Speed Select -- Enumerate and control features + * Copyright (c) 2019 Intel Corporation. + */ + +#include "isst.h" + +int isst_get_ctdp_levels(int cpu, struct isst_pkg_ctdp *pkg_dev) +{ + unsigned int resp; + int ret; + + ret = isst_send_mbox_command(cpu, CONFIG_TDP, + CONFIG_TDP_GET_LEVELS_INFO, 0, 0, &resp); + if (ret) + return ret; + + debug_printf("cpu:%d CONFIG_TDP_GET_LEVELS_INFO resp:%x\n", cpu, resp); + + pkg_dev->version = resp & 0xff; + pkg_dev->levels = (resp >> 8) & 0xff; + pkg_dev->current_level = (resp >> 16) & 0xff; + pkg_dev->locked = !!(resp & BIT(24)); + pkg_dev->enabled = !!(resp & BIT(31)); + + return 0; +} + +int isst_get_ctdp_control(int cpu, int config_index, + struct isst_pkg_ctdp_level_info *ctdp_level) +{ + unsigned int resp; + int ret; + + ret = isst_send_mbox_command(cpu, CONFIG_TDP, + CONFIG_TDP_GET_TDP_CONTROL, 0, + config_index, &resp); + if (ret) + return ret; + + ctdp_level->fact_support = resp & BIT(0); + ctdp_level->pbf_support = !!(resp & BIT(1)); + ctdp_level->fact_enabled = !!(resp & BIT(16)); + ctdp_level->pbf_enabled = !!(resp & BIT(17)); + + debug_printf( + "cpu:%d CONFIG_TDP_GET_TDP_CONTROL resp:%x fact_support:%d pbf_support: %d fact_enabled:%d pbf_enabled:%d\n", + cpu, resp, ctdp_level->fact_support, ctdp_level->pbf_support, + ctdp_level->fact_enabled, ctdp_level->pbf_enabled); + + return 0; +} + +int isst_get_tdp_info(int cpu, int config_index, + struct isst_pkg_ctdp_level_info *ctdp_level) +{ + unsigned int resp; + int ret; + + ret = isst_send_mbox_command(cpu, CONFIG_TDP, CONFIG_TDP_GET_TDP_INFO, + 0, config_index, &resp); + if (ret) + return ret; + + ctdp_level->pkg_tdp = resp & GENMASK(14, 0); + ctdp_level->tdp_ratio = (resp & GENMASK(23, 16)) >> 16; + + debug_printf( + "cpu:%d ctdp:%d CONFIG_TDP_GET_TDP_INFO resp:%x tdp_ratio:%d pkg_tdp:%d\n", + cpu, config_index, resp, ctdp_level->tdp_ratio, + ctdp_level->pkg_tdp); + return 0; +} + +int isst_get_pwr_info(int cpu, int config_index, + struct isst_pkg_ctdp_level_info *ctdp_level) +{ + unsigned int resp; + int ret; + + ret = isst_send_mbox_command(cpu, CONFIG_TDP, CONFIG_TDP_GET_PWR_INFO, + 0, config_index, &resp); + if (ret) + return ret; + + ctdp_level->pkg_max_power = resp & GENMASK(14, 0); + ctdp_level->pkg_min_power = (resp & GENMASK(30, 16)) >> 16; + + debug_printf( + "cpu:%d ctdp:%d CONFIG_TDP_GET_PWR_INFO resp:%x pkg_max_power:%d pkg_min_power:%d\n", + cpu, config_index, resp, ctdp_level->pkg_max_power, + ctdp_level->pkg_min_power); + + return 0; +} + +int isst_get_tjmax_info(int cpu, int config_index, + struct isst_pkg_ctdp_level_info *ctdp_level) +{ + unsigned int resp; + int ret; + + ret = isst_send_mbox_command(cpu, CONFIG_TDP, CONFIG_TDP_GET_TJMAX_INFO, + 0, config_index, &resp); + if (ret) + return ret; + + ctdp_level->t_proc_hot = resp & GENMASK(7, 0); + + debug_printf( + "cpu:%d ctdp:%d CONFIG_TDP_GET_TJMAX_INFO resp:%x t_proc_hot:%d\n", + cpu, config_index, resp, ctdp_level->t_proc_hot); + + return 0; +} + +int isst_get_coremask_info(int cpu, int config_index, + struct isst_pkg_ctdp_level_info *ctdp_level) +{ + unsigned int resp; + int i, ret; + + ctdp_level->cpu_count = 0; + for (i = 0; i < 2; ++i) { + unsigned long long mask; + int cpu_count = 0; + + ret = isst_send_mbox_command(cpu, CONFIG_TDP, + CONFIG_TDP_GET_CORE_MASK, 0, + (i << 8) | config_index, &resp); + if (ret) + return ret; + + debug_printf( + "cpu:%d ctdp:%d mask:%d CONFIG_TDP_GET_CORE_MASK resp:%x\n", + cpu, config_index, i, resp); + + mask = (unsigned long long)resp << (32 * i); + set_cpu_mask_from_punit_coremask(cpu, mask, + ctdp_level->core_cpumask_size, + ctdp_level->core_cpumask, + &cpu_count); + ctdp_level->cpu_count += cpu_count; + debug_printf("cpu:%d ctdp:%d mask:%d cpu count:%d\n", cpu, + config_index, i, ctdp_level->cpu_count); + } + + return 0; +} + +int isst_get_get_trl(int cpu, int level, int avx_level, int *trl) +{ + unsigned int req, resp; + int ret; + + req = level | (avx_level << 16); + ret = isst_send_mbox_command(cpu, CONFIG_TDP, + CONFIG_TDP_GET_TURBO_LIMIT_RATIOS, 0, req, + &resp); + if (ret) + return ret; + + debug_printf( + "cpu:%d CONFIG_TDP_GET_TURBO_LIMIT_RATIOS req:%x resp:%x\n", + cpu, req, resp); + + trl[0] = resp & GENMASK(7, 0); + trl[1] = (resp & GENMASK(15, 8)) >> 8; + trl[2] = (resp & GENMASK(23, 16)) >> 16; + trl[3] = (resp & GENMASK(31, 24)) >> 24; + + req = level | BIT(8) | (avx_level << 16); + ret = isst_send_mbox_command(cpu, CONFIG_TDP, + CONFIG_TDP_GET_TURBO_LIMIT_RATIOS, 0, req, + &resp); + if (ret) + return ret; + + debug_printf("cpu:%d CONFIG_TDP_GET_TURBO_LIMIT req:%x resp:%x\n", cpu, + req, resp); + + trl[4] = resp & GENMASK(7, 0); + trl[5] = (resp & GENMASK(15, 8)) >> 8; + trl[6] = (resp & GENMASK(23, 16)) >> 16; + trl[7] = (resp & GENMASK(31, 24)) >> 24; + + return 0; +} + +int isst_set_tdp_level_msr(int cpu, int tdp_level) +{ + int ret; + + debug_printf("cpu: tdp_level via MSR %d\n", cpu, tdp_level); + + if (isst_get_config_tdp_lock_status(cpu)) { + debug_printf("cpu: tdp_locked %d\n", cpu); + return -1; + } + + if (tdp_level > 2) + return -1; /* invalid value */ + + ret = isst_send_msr_command(cpu, 0x64b, 1, + (unsigned long long *)&tdp_level); + if (ret) + return ret; + + debug_printf("cpu: tdp_level via MSR successful %d\n", cpu, tdp_level); + + return 0; +} + +int isst_set_tdp_level(int cpu, int tdp_level) +{ + unsigned int resp; + int ret; + + ret = isst_send_mbox_command(cpu, CONFIG_TDP, CONFIG_TDP_SET_LEVEL, 0, + tdp_level, &resp); + if (ret) + return isst_set_tdp_level_msr(cpu, tdp_level); + + return 0; +} + +int isst_get_pbf_info(int cpu, int level, struct isst_pbf_info *pbf_info) +{ + unsigned int req, resp; + int i, ret; + + pbf_info->core_cpumask_size = alloc_cpu_set(&pbf_info->core_cpumask); + + for (i = 0; i < 2; ++i) { + unsigned long long mask; + int count; + + ret = isst_send_mbox_command(cpu, CONFIG_TDP, + CONFIG_TDP_PBF_GET_CORE_MASK_INFO, + 0, (i << 8) | level, &resp); + if (ret) + return ret; + + debug_printf( + "cpu:%d CONFIG_TDP_PBF_GET_CORE_MASK_INFO resp:%x\n", + cpu, resp); + + mask = (unsigned long long)resp << (32 * i); + set_cpu_mask_from_punit_coremask(cpu, mask, + pbf_info->core_cpumask_size, + pbf_info->core_cpumask, + &count); + } + + req = level; + ret = isst_send_mbox_command(cpu, CONFIG_TDP, + CONFIG_TDP_PBF_GET_P1HI_P1LO_INFO, 0, req, + &resp); + if (ret) + return ret; + + debug_printf("cpu:%d CONFIG_TDP_PBF_GET_P1HI_P1LO_INFO resp:%x\n", cpu, + resp); + + pbf_info->p1_low = resp & 0xff; + pbf_info->p1_high = (resp & GENMASK(15, 8)) >> 8; + + req = level; + ret = isst_send_mbox_command( + cpu, CONFIG_TDP, CONFIG_TDP_PBF_GET_TDP_INFO, 0, req, &resp); + if (ret) + return ret; + + debug_printf("cpu:%d CONFIG_TDP_PBF_GET_TDP_INFO resp:%x\n", cpu, resp); + + pbf_info->tdp = resp & 0xffff; + + req = level; + ret = isst_send_mbox_command( + cpu, CONFIG_TDP, CONFIG_TDP_PBF_GET_TJ_MAX_INFO, 0, req, &resp); + if (ret) + return ret; + + debug_printf("cpu:%d CONFIG_TDP_PBF_GET_TJ_MAX_INFO resp:%x\n", cpu, + resp); + pbf_info->t_control = (resp >> 8) & 0xff; + pbf_info->t_prochot = resp & 0xff; + + return 0; +} + +void isst_get_pbf_info_complete(struct isst_pbf_info *pbf_info) +{ + free_cpu_set(pbf_info->core_cpumask); +} + +int isst_set_pbf_fact_status(int cpu, int pbf, int enable) +{ + struct isst_pkg_ctdp pkg_dev; + struct isst_pkg_ctdp_level_info ctdp_level; + int current_level; + unsigned int req = 0, resp; + int ret; + + ret = isst_get_ctdp_levels(cpu, &pkg_dev); + if (ret) + return ret; + + current_level = pkg_dev.current_level; + + ret = isst_get_ctdp_control(cpu, current_level, &ctdp_level); + if (ret) + return ret; + + if (pbf) { + if (ctdp_level.fact_enabled) + req = BIT(16); + + if (enable) + req |= BIT(17); + else + req &= ~BIT(17); + } else { + if (ctdp_level.pbf_enabled) + req = BIT(17); + + if (enable) + req |= BIT(16); + else + req &= ~BIT(16); + } + + ret = isst_send_mbox_command(cpu, CONFIG_TDP, + CONFIG_TDP_SET_TDP_CONTROL, 0, req, &resp); + if (ret) + return ret; + + debug_printf("cpu:%d CONFIG_TDP_SET_TDP_CONTROL pbf/fact:%d req:%x\n", + cpu, pbf, req); + + return 0; +} + +int isst_get_fact_bucket_info(int cpu, int level, + struct isst_fact_bucket_info *bucket_info) +{ + unsigned int resp; + int i, k, ret; + + for (i = 0; i < 2; ++i) { + int j; + + ret = isst_send_mbox_command( + cpu, CONFIG_TDP, + CONFIG_TDP_GET_FACT_HP_TURBO_LIMIT_NUMCORES, 0, + (i << 8) | level, &resp); + if (ret) + return ret; + + debug_printf( + "cpu:%d CONFIG_TDP_GET_FACT_HP_TURBO_LIMIT_NUMCORES index:%d level:%d resp:%x\n", + cpu, i, level, resp); + + for (j = 0; j < 4; ++j) { + bucket_info[j + (i * 4)].high_priority_cores_count = + (resp >> (j * 8)) & 0xff; + } + } + + for (k = 0; k < 3; ++k) { + for (i = 0; i < 2; ++i) { + int j; + + ret = isst_send_mbox_command( + cpu, CONFIG_TDP, + CONFIG_TDP_GET_FACT_HP_TURBO_LIMIT_RATIOS, 0, + (k << 16) | (i << 8) | level, &resp); + if (ret) + return ret; + + debug_printf( + "cpu:%d CONFIG_TDP_GET_FACT_HP_TURBO_LIMIT_RATIOS index:%d level:%d avx:%d resp:%x\n", + cpu, i, level, k, resp); + + for (j = 0; j < 4; ++j) { + switch (k) { + case 0: + bucket_info[j + (i * 4)].sse_trl = + (resp >> (j * 8)) & 0xff; + break; + case 1: + bucket_info[j + (i * 4)].avx_trl = + (resp >> (j * 8)) & 0xff; + break; + case 2: + bucket_info[j + (i * 4)].avx512_trl = + (resp >> (j * 8)) & 0xff; + break; + default: + break; + } + } + } + } + + return 0; +} + +int isst_get_fact_info(int cpu, int level, struct isst_fact_info *fact_info) +{ + unsigned int resp; + int ret; + + ret = isst_send_mbox_command(cpu, CONFIG_TDP, + CONFIG_TDP_GET_FACT_LP_CLIPPING_RATIO, 0, + level, &resp); + if (ret) + return ret; + + debug_printf("cpu:%d CONFIG_TDP_GET_FACT_LP_CLIPPING_RATIO resp:%x\n", + cpu, resp); + + fact_info->lp_clipping_ratio_license_sse = resp & 0xff; + fact_info->lp_clipping_ratio_license_avx2 = (resp >> 8) & 0xff; + fact_info->lp_clipping_ratio_license_avx512 = (resp >> 16) & 0xff; + + ret = isst_get_fact_bucket_info(cpu, level, fact_info->bucket_info); + + return ret; +} + +int isst_set_trl(int cpu, unsigned long long trl) +{ + int ret; + + if (!trl) + trl = 0xFFFFFFFFFFFFFFFFULL; + + ret = isst_send_msr_command(cpu, 0x1AD, 1, &trl); + if (ret) + return ret; + + return 0; +} + +int isst_set_trl_from_current_tdp(int cpu, unsigned long long trl) +{ + unsigned long long msr_trl; + int ret; + + if (trl) { + msr_trl = trl; + } else { + struct isst_pkg_ctdp pkg_dev; + int trl[8]; + int i; + + ret = isst_get_ctdp_levels(cpu, &pkg_dev); + if (ret) + return ret; + + ret = isst_get_get_trl(cpu, pkg_dev.current_level, 0, trl); + if (ret) + return ret; + + msr_trl = 0; + for (i = 0; i < 8; ++i) { + unsigned long long _trl = trl[i]; + + msr_trl |= (_trl << (i * 8)); + } + } + ret = isst_send_msr_command(cpu, 0x1AD, 1, &msr_trl); + if (ret) + return ret; + + return 0; +} + +/* Return 1 if locked */ +int isst_get_config_tdp_lock_status(int cpu) +{ + unsigned long long tdp_control = 0; + int ret; + + ret = isst_send_msr_command(cpu, 0x64b, 0, &tdp_control); + if (ret) + return ret; + + ret = !!(tdp_control & BIT(31)); + + return ret; +} + +void isst_get_process_ctdp_complete(int cpu, struct isst_pkg_ctdp *pkg_dev) +{ + int i; + + if (!pkg_dev->processed) + return; + + for (i = 0; i < pkg_dev->levels; ++i) { + struct isst_pkg_ctdp_level_info *ctdp_level; + + ctdp_level = &pkg_dev->ctdp_level[i]; + if (ctdp_level->pbf_support) + free_cpu_set(ctdp_level->pbf_info.core_cpumask); + free_cpu_set(ctdp_level->core_cpumask); + } +} + +int isst_get_process_ctdp(int cpu, int tdp_level, struct isst_pkg_ctdp *pkg_dev) +{ + int i, ret; + + if (pkg_dev->processed) + return 0; + + ret = isst_get_ctdp_levels(cpu, pkg_dev); + if (ret) + return ret; + + debug_printf("cpu: %d ctdp enable:%d current level: %d levels:%d\n", + cpu, pkg_dev->enabled, pkg_dev->current_level, + pkg_dev->levels); + + for (i = 0; i <= pkg_dev->levels; ++i) { + struct isst_pkg_ctdp_level_info *ctdp_level; + + if (tdp_level != 0xff && i != tdp_level) + continue; + + debug_printf("cpu:%d Get Information for TDP level:%d\n", cpu, + i); + ctdp_level = &pkg_dev->ctdp_level[i]; + + ctdp_level->processed = 1; + ctdp_level->level = i; + ctdp_level->control_cpu = cpu; + ctdp_level->pkg_id = get_physical_package_id(cpu); + ctdp_level->die_id = get_physical_die_id(cpu); + + ret = isst_get_ctdp_control(cpu, i, ctdp_level); + if (ret) + return ret; + + ret = isst_get_tdp_info(cpu, i, ctdp_level); + if (ret) + return ret; + + ret = isst_get_pwr_info(cpu, i, ctdp_level); + if (ret) + return ret; + + ret = isst_get_tjmax_info(cpu, i, ctdp_level); + if (ret) + return ret; + + ctdp_level->core_cpumask_size = + alloc_cpu_set(&ctdp_level->core_cpumask); + ret = isst_get_coremask_info(cpu, i, ctdp_level); + if (ret) + return ret; + + ret = isst_get_get_trl(cpu, i, 0, + ctdp_level->trl_sse_active_cores); + if (ret) + return ret; + + ret = isst_get_get_trl(cpu, i, 1, + ctdp_level->trl_avx_active_cores); + if (ret) + return ret; + + ret = isst_get_get_trl(cpu, i, 2, + ctdp_level->trl_avx_512_active_cores); + if (ret) + return ret; + + if (ctdp_level->pbf_support) { + ret = isst_get_pbf_info(cpu, i, &ctdp_level->pbf_info); + if (!ret) + ctdp_level->pbf_found = 1; + } + + if (ctdp_level->fact_support) { + ret = isst_get_fact_info(cpu, i, + &ctdp_level->fact_info); + if (ret) + return ret; + } + } + + pkg_dev->processed = 1; + + return 0; +} + +int isst_pm_qos_config(int cpu, int enable_clos, int priority_type) +{ + unsigned int req, resp; + int ret; + + ret = isst_send_mbox_command(cpu, CONFIG_CLOS, CLOS_PM_QOS_CONFIG, 0, 0, + &resp); + if (ret) + return ret; + + debug_printf("cpu:%d CLOS_PM_QOS_CONFIG resp:%x\n", cpu, resp); + + req = resp; + + if (enable_clos) + req = req | BIT(1); + else + req = req & ~BIT(1); + + if (priority_type) + req = req | BIT(2); + else + req = req & ~BIT(2); + + ret = isst_send_mbox_command(cpu, CONFIG_CLOS, CLOS_PM_QOS_CONFIG, + BIT(MBOX_CMD_WRITE_BIT), req, &resp); + if (ret) + return ret; + + debug_printf("cpu:%d CLOS_PM_QOS_CONFIG priority type:%d req:%x\n", cpu, + priority_type, req); + + return 0; +} + +int isst_pm_get_clos(int cpu, int clos, struct isst_clos_config *clos_config) +{ + unsigned int resp; + int ret; + + ret = isst_send_mbox_command(cpu, CONFIG_CLOS, CLOS_PM_CLOS, clos, 0, + &resp); + if (ret) + return ret; + + clos_config->pkg_id = get_physical_package_id(cpu); + clos_config->die_id = get_physical_die_id(cpu); + + clos_config->epp = resp & 0x0f; + clos_config->clos_prop_prio = (resp >> 4) & 0x0f; + clos_config->clos_min = (resp >> 8) & 0xff; + clos_config->clos_max = (resp >> 16) & 0xff; + clos_config->clos_desired = (resp >> 24) & 0xff; + + return 0; +} + +int isst_set_clos(int cpu, int clos, struct isst_clos_config *clos_config) +{ + unsigned int req, resp; + unsigned int param; + int ret; + + req = clos_config->epp & 0x0f; + req |= (clos_config->clos_prop_prio & 0x0f) << 4; + req |= (clos_config->clos_min & 0xff) << 8; + req |= (clos_config->clos_max & 0xff) << 16; + req |= (clos_config->clos_desired & 0xff) << 24; + + param = BIT(MBOX_CMD_WRITE_BIT) | clos; + + ret = isst_send_mbox_command(cpu, CONFIG_CLOS, CLOS_PM_CLOS, param, req, + &resp); + if (ret) + return ret; + + debug_printf("cpu:%d CLOS_PM_CLOS param:%x req:%x\n", cpu, param, req); + + return 0; +} + +int isst_clos_get_assoc_status(int cpu, int *clos_id) +{ + unsigned int resp; + unsigned int param; + int core_id, ret; + + core_id = find_phy_core_num(cpu); + param = core_id; + + ret = isst_send_mbox_command(cpu, CONFIG_CLOS, CLOS_PQR_ASSOC, param, 0, + &resp); + if (ret) + return ret; + + debug_printf("cpu:%d CLOS_PQR_ASSOC param:%x resp:%x\n", cpu, param, + resp); + *clos_id = (resp >> 16) & 0x03; + + return 0; +} + +int isst_clos_associate(int cpu, int clos_id) +{ + unsigned int req, resp; + unsigned int param; + int core_id, ret; + + req = (clos_id & 0x03) << 16; + core_id = find_phy_core_num(cpu); + param = BIT(MBOX_CMD_WRITE_BIT) | core_id; + + ret = isst_send_mbox_command(cpu, CONFIG_CLOS, CLOS_PQR_ASSOC, param, + req, &resp); + if (ret) + return ret; + + debug_printf("cpu:%d CLOS_PQR_ASSOC param:%x req:%x\n", cpu, param, + req); + + return 0; +} diff --git a/tools/power/x86/intel-speed-select/isst-display.c b/tools/power/x86/intel-speed-select/isst-display.c new file mode 100644 index 000000000000..f368b8323742 --- /dev/null +++ b/tools/power/x86/intel-speed-select/isst-display.c @@ -0,0 +1,479 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Intel dynamic_speed_select -- Enumerate and control features + * Copyright (c) 2019 Intel Corporation. + */ + +#include "isst.h" + +#define DISP_FREQ_MULTIPLIER 100000 + +static void printcpumask(int str_len, char *str, int mask_size, + cpu_set_t *cpu_mask) +{ + int i, max_cpus = get_topo_max_cpus(); + unsigned int *mask; + int size, index, curr_index; + + size = max_cpus / (sizeof(unsigned int) * 8); + if (max_cpus % (sizeof(unsigned int) * 8)) + size++; + + mask = calloc(size, sizeof(unsigned int)); + if (!mask) + return; + + for (i = 0; i < max_cpus; ++i) { + int mask_index, bit_index; + + if (!CPU_ISSET_S(i, mask_size, cpu_mask)) + continue; + + mask_index = i / (sizeof(unsigned int) * 8); + bit_index = i % (sizeof(unsigned int) * 8); + mask[mask_index] |= BIT(bit_index); + } + + curr_index = 0; + for (i = size - 1; i >= 0; --i) { + index = snprintf(&str[curr_index], str_len - curr_index, "%08x", + mask[i]); + curr_index += index; + if (i) { + strncat(&str[curr_index], ",", str_len - curr_index); + curr_index++; + } + } + + free(mask); +} + +static void format_and_print_txt(FILE *outf, int level, char *header, + char *value) +{ + char *spaces = " "; + static char delimiters[256]; + int i, j = 0; + + if (!level) + return; + + if (level == 1) { + strcpy(delimiters, " "); + } else { + for (i = 0; i < level - 1; ++i) + j += snprintf(&delimiters[j], sizeof(delimiters) - j, + "%s", spaces); + } + + if (header && value) { + fprintf(outf, "%s", delimiters); + fprintf(outf, "%s:%s\n", header, value); + } else if (header) { + fprintf(outf, "%s", delimiters); + fprintf(outf, "%s\n", header); + } +} + +static int last_level; +static void format_and_print(FILE *outf, int level, char *header, char *value) +{ + char *spaces = " "; + static char delimiters[256]; + int i; + + if (!out_format_is_json()) { + format_and_print_txt(outf, level, header, value); + return; + } + + if (level == 0) { + if (header) + fprintf(outf, "{"); + else + fprintf(outf, "\n}\n"); + + } else { + int j = 0; + + for (i = 0; i < level; ++i) + j += snprintf(&delimiters[j], sizeof(delimiters) - j, + "%s", spaces); + + if (last_level == level) + fprintf(outf, ",\n"); + + if (value) { + if (last_level != level) + fprintf(outf, "\n"); + + fprintf(outf, "%s"%s": ", delimiters, header); + fprintf(outf, ""%s"", value); + } else { + for (i = last_level - 1; i >= level; --i) { + int k = 0; + + for (j = i; j > 0; --j) + k += snprintf(&delimiters[k], + sizeof(delimiters) - k, + "%s", spaces); + if (i == level && header) + fprintf(outf, "\n%s},", delimiters); + else + fprintf(outf, "\n%s}", delimiters); + } + if (abs(last_level - level) < 3) + fprintf(outf, "\n"); + if (header) + fprintf(outf, "%s"%s": {", delimiters, + header); + } + } + + last_level = level; +} + +static void print_packag_info(int cpu, FILE *outf) +{ + char header[256]; + + snprintf(header, sizeof(header), "package-%d", + get_physical_package_id(cpu)); + format_and_print(outf, 1, header, NULL); + snprintf(header, sizeof(header), "die-%d", get_physical_die_id(cpu)); + format_and_print(outf, 2, header, NULL); + snprintf(header, sizeof(header), "cpu-%d", cpu); + format_and_print(outf, 3, header, NULL); +} + +static void _isst_pbf_display_information(int cpu, FILE *outf, int level, + struct isst_pbf_info *pbf_info, + int disp_level) +{ + char header[256]; + char value[256]; + + snprintf(header, sizeof(header), "speed-select-base-freq"); + format_and_print(outf, disp_level, header, NULL); + + snprintf(header, sizeof(header), "high-priority-base-frequency(KHz)"); + snprintf(value, sizeof(value), "%d", + pbf_info->p1_high * DISP_FREQ_MULTIPLIER); + format_and_print(outf, disp_level + 1, header, value); + + snprintf(header, sizeof(header), "high-priority-cpu-mask"); + printcpumask(sizeof(value), value, pbf_info->core_cpumask_size, + pbf_info->core_cpumask); + format_and_print(outf, disp_level + 1, header, value); + + snprintf(header, sizeof(header), "low-priority-base-frequency(KHz)"); + snprintf(value, sizeof(value), "%d", + pbf_info->p1_low * DISP_FREQ_MULTIPLIER); + format_and_print(outf, disp_level + 1, header, value); + + snprintf(header, sizeof(header), "tjunction-temperature(C)"); + snprintf(value, sizeof(value), "%d", pbf_info->t_prochot); + format_and_print(outf, disp_level + 1, header, value); + + snprintf(header, sizeof(header), "thermal-design-power(W)"); + snprintf(value, sizeof(value), "%d", pbf_info->tdp); + format_and_print(outf, disp_level + 1, header, value); +} + +static void _isst_fact_display_information(int cpu, FILE *outf, int level, + int fact_bucket, int fact_avx, + struct isst_fact_info *fact_info, + int base_level) +{ + struct isst_fact_bucket_info *bucket_info = fact_info->bucket_info; + char header[256]; + char value[256]; + int j; + + snprintf(header, sizeof(header), "speed-select-turbo-freq"); + format_and_print(outf, base_level, header, NULL); + for (j = 0; j < ISST_FACT_MAX_BUCKETS; ++j) { + if (fact_bucket != 0xff && fact_bucket != j) + continue; + + if (!bucket_info[j].high_priority_cores_count) + break; + + snprintf(header, sizeof(header), "bucket-%d", j); + format_and_print(outf, base_level + 1, header, NULL); + + snprintf(header, sizeof(header), "high-priority-cores-count"); + snprintf(value, sizeof(value), "%d", + bucket_info[j].high_priority_cores_count); + format_and_print(outf, base_level + 2, header, value); + + if (fact_avx & 0x01) { + snprintf(header, sizeof(header), + "high-priority-max-frequency(KHz)"); + snprintf(value, sizeof(value), "%d", + bucket_info[j].sse_trl * DISP_FREQ_MULTIPLIER); + format_and_print(outf, base_level + 2, header, value); + } + + if (fact_avx & 0x02) { + snprintf(header, sizeof(header), + "high-priority-max-avx2-frequency(KHz)"); + snprintf(value, sizeof(value), "%d", + bucket_info[j].avx_trl * DISP_FREQ_MULTIPLIER); + format_and_print(outf, base_level + 2, header, value); + } + + if (fact_avx & 0x04) { + snprintf(header, sizeof(header), + "high-priority-max-avx512-frequency(KHz)"); + snprintf(value, sizeof(value), "%d", + bucket_info[j].avx512_trl * + DISP_FREQ_MULTIPLIER); + format_and_print(outf, base_level + 2, header, value); + } + } + snprintf(header, sizeof(header), + "speed-select-turbo-freq-clip-frequencies"); + format_and_print(outf, base_level + 1, header, NULL); + snprintf(header, sizeof(header), "low-priority-max-frequency(KHz)"); + snprintf(value, sizeof(value), "%d", + fact_info->lp_clipping_ratio_license_sse * + DISP_FREQ_MULTIPLIER); + format_and_print(outf, base_level + 2, header, value); + snprintf(header, sizeof(header), + "low-priority-max-avx2-frequency(KHz)"); + snprintf(value, sizeof(value), "%d", + fact_info->lp_clipping_ratio_license_avx2 * + DISP_FREQ_MULTIPLIER); + format_and_print(outf, base_level + 2, header, value); + snprintf(header, sizeof(header), + "low-priority-max-avx512-frequency(KHz)"); + snprintf(value, sizeof(value), "%d", + fact_info->lp_clipping_ratio_license_avx512 * + DISP_FREQ_MULTIPLIER); + format_and_print(outf, base_level + 2, header, value); +} + +void isst_ctdp_display_information(int cpu, FILE *outf, int tdp_level, + struct isst_pkg_ctdp *pkg_dev) +{ + char header[256]; + char value[256]; + int i, base_level = 1; + + print_packag_info(cpu, outf); + + for (i = 0; i <= pkg_dev->levels; ++i) { + struct isst_pkg_ctdp_level_info *ctdp_level; + int j; + + ctdp_level = &pkg_dev->ctdp_level[i]; + if (!ctdp_level->processed) + continue; + + snprintf(header, sizeof(header), "perf-profile-level-%d", + ctdp_level->level); + format_and_print(outf, base_level + 3, header, NULL); + + snprintf(header, sizeof(header), "cpu-count"); + j = get_cpu_count(get_physical_die_id(cpu), + get_physical_die_id(cpu)); + snprintf(value, sizeof(value), "%d", j); + format_and_print(outf, base_level + 4, header, value); + + snprintf(header, sizeof(header), "enable-cpu-mask"); + printcpumask(sizeof(value), value, + ctdp_level->core_cpumask_size, + ctdp_level->core_cpumask); + format_and_print(outf, base_level + 4, header, value); + + snprintf(header, sizeof(header), "thermal-design-power-ratio"); + snprintf(value, sizeof(value), "%d", ctdp_level->tdp_ratio); + format_and_print(outf, base_level + 4, header, value); + + snprintf(header, sizeof(header), "base-frequency(KHz)"); + snprintf(value, sizeof(value), "%d", + ctdp_level->tdp_ratio * DISP_FREQ_MULTIPLIER); + format_and_print(outf, base_level + 4, header, value); + + snprintf(header, sizeof(header), + "speed-select-turbo-freq-support"); + snprintf(value, sizeof(value), "%d", ctdp_level->fact_support); + format_and_print(outf, base_level + 4, header, value); + + snprintf(header, sizeof(header), + "speed-select-base-freq-support"); + snprintf(value, sizeof(value), "%d", ctdp_level->pbf_support); + format_and_print(outf, base_level + 4, header, value); + + snprintf(header, sizeof(header), + "speed-select-base-freq-enabled"); + snprintf(value, sizeof(value), "%d", ctdp_level->pbf_enabled); + format_and_print(outf, base_level + 4, header, value); + + snprintf(header, sizeof(header), + "speed-select-turbo-freq-enabled"); + snprintf(value, sizeof(value), "%d", ctdp_level->fact_enabled); + format_and_print(outf, base_level + 4, header, value); + + snprintf(header, sizeof(header), "thermal-design-power(W)"); + snprintf(value, sizeof(value), "%d", ctdp_level->pkg_tdp); + format_and_print(outf, base_level + 4, header, value); + + snprintf(header, sizeof(header), "tjunction-max(C)"); + snprintf(value, sizeof(value), "%d", ctdp_level->t_proc_hot); + format_and_print(outf, base_level + 4, header, value); + + snprintf(header, sizeof(header), "turbo-ratio-limits-sse"); + format_and_print(outf, base_level + 4, header, NULL); + for (j = 0; j < 8; ++j) { + snprintf(header, sizeof(header), "bucket-%d", j); + format_and_print(outf, base_level + 5, header, NULL); + + snprintf(header, sizeof(header), "core-count"); + snprintf(value, sizeof(value), "%d", j); + format_and_print(outf, base_level + 6, header, value); + + snprintf(header, sizeof(header), "turbo-ratio"); + snprintf(value, sizeof(value), "%d", + ctdp_level->trl_sse_active_cores[j]); + format_and_print(outf, base_level + 6, header, value); + } + snprintf(header, sizeof(header), "turbo-ratio-limits-avx"); + format_and_print(outf, base_level + 4, header, NULL); + for (j = 0; j < 8; ++j) { + snprintf(header, sizeof(header), "bucket-%d", j); + format_and_print(outf, base_level + 5, header, NULL); + + snprintf(header, sizeof(header), "core-count"); + snprintf(value, sizeof(value), "%d", j); + format_and_print(outf, base_level + 6, header, value); + + snprintf(header, sizeof(header), "turbo-ratio"); + snprintf(value, sizeof(value), "%d", + ctdp_level->trl_avx_active_cores[j]); + format_and_print(outf, base_level + 6, header, value); + } + + snprintf(header, sizeof(header), "turbo-ratio-limits-avx512"); + format_and_print(outf, base_level + 4, header, NULL); + for (j = 0; j < 8; ++j) { + snprintf(header, sizeof(header), "bucket-%d", j); + format_and_print(outf, base_level + 5, header, NULL); + + snprintf(header, sizeof(header), "core-count"); + snprintf(value, sizeof(value), "%d", j); + format_and_print(outf, base_level + 6, header, value); + + snprintf(header, sizeof(header), "turbo-ratio"); + snprintf(value, sizeof(value), "%d", + ctdp_level->trl_avx_512_active_cores[j]); + format_and_print(outf, base_level + 6, header, value); + } + if (ctdp_level->pbf_support) + _isst_pbf_display_information(cpu, outf, i, + &ctdp_level->pbf_info, + base_level + 4); + if (ctdp_level->fact_support) + _isst_fact_display_information(cpu, outf, i, 0xff, 0xff, + &ctdp_level->fact_info, + base_level + 4); + } + + format_and_print(outf, 1, NULL, NULL); +} + +void isst_ctdp_display_information_start(FILE *outf) +{ + last_level = 0; + format_and_print(outf, 0, "start", NULL); +} + +void isst_ctdp_display_information_end(FILE *outf) +{ + format_and_print(outf, 0, NULL, NULL); +} + +void isst_pbf_display_information(int cpu, FILE *outf, int level, + struct isst_pbf_info *pbf_info) +{ + print_packag_info(cpu, outf); + _isst_pbf_display_information(cpu, outf, level, pbf_info, 4); + format_and_print(outf, 1, NULL, NULL); +} + +void isst_fact_display_information(int cpu, FILE *outf, int level, + int fact_bucket, int fact_avx, + struct isst_fact_info *fact_info) +{ + print_packag_info(cpu, outf); + _isst_fact_display_information(cpu, outf, level, fact_bucket, fact_avx, + fact_info, 4); + format_and_print(outf, 1, NULL, NULL); +} + +void isst_clos_display_information(int cpu, FILE *outf, int clos, + struct isst_clos_config *clos_config) +{ + char header[256]; + char value[256]; + + snprintf(header, sizeof(header), "package-%d", + get_physical_package_id(cpu)); + format_and_print(outf, 1, header, NULL); + snprintf(header, sizeof(header), "die-%d", get_physical_die_id(cpu)); + format_and_print(outf, 2, header, NULL); + snprintf(header, sizeof(header), "cpu-%d", cpu); + format_and_print(outf, 3, header, NULL); + + snprintf(header, sizeof(header), "core-power"); + format_and_print(outf, 4, header, NULL); + + snprintf(header, sizeof(header), "clos"); + snprintf(value, sizeof(value), "%d", clos); + format_and_print(outf, 5, header, value); + + snprintf(header, sizeof(header), "epp"); + snprintf(value, sizeof(value), "%d", clos_config->epp); + format_and_print(outf, 5, header, value); + + snprintf(header, sizeof(header), "clos-proportional-priority"); + snprintf(value, sizeof(value), "%d", clos_config->clos_prop_prio); + format_and_print(outf, 5, header, value); + + snprintf(header, sizeof(header), "clos-min"); + snprintf(value, sizeof(value), "%d", clos_config->clos_min); + format_and_print(outf, 5, header, value); + + snprintf(header, sizeof(header), "clos-max"); + snprintf(value, sizeof(value), "%d", clos_config->clos_max); + format_and_print(outf, 5, header, value); + + snprintf(header, sizeof(header), "clos-desired"); + snprintf(value, sizeof(value), "%d", clos_config->clos_desired); + format_and_print(outf, 5, header, value); + + format_and_print(outf, 1, NULL, NULL); +} + +void isst_display_result(int cpu, FILE *outf, char *feature, char *cmd, + int result) +{ + char header[256]; + char value[256]; + + snprintf(header, sizeof(header), "package-%d", + get_physical_package_id(cpu)); + format_and_print(outf, 1, header, NULL); + snprintf(header, sizeof(header), "die-%d", get_physical_die_id(cpu)); + format_and_print(outf, 2, header, NULL); + snprintf(header, sizeof(header), "cpu-%d", cpu); + format_and_print(outf, 3, header, NULL); + snprintf(header, sizeof(header), "%s", feature); + format_and_print(outf, 4, header, NULL); + snprintf(header, sizeof(header), "%s", cmd); + snprintf(value, sizeof(value), "%d", result); + format_and_print(outf, 5, header, value); + + format_and_print(outf, 1, NULL, NULL); +} diff --git a/tools/power/x86/intel-speed-select/isst.h b/tools/power/x86/intel-speed-select/isst.h new file mode 100644 index 000000000000..221881761609 --- /dev/null +++ b/tools/power/x86/intel-speed-select/isst.h @@ -0,0 +1,231 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Intel Speed Select -- Enumerate and control features + * Copyright (c) 2019 Intel Corporation. + */ + +#ifndef _ISST_H_ +#define _ISST_H_ + +#include <stdio.h> +#include <unistd.h> +#include <sys/types.h> +#include <sched.h> +#include <sys/stat.h> +#include <sys/resource.h> +#include <getopt.h> +#include <err.h> +#include <fcntl.h> +#include <signal.h> +#include <sys/time.h> +#include <limits.h> +#include <stdlib.h> +#include <string.h> +#include <cpuid.h> +#include <dirent.h> +#include <errno.h> + +#include <stdarg.h> +#include <sys/ioctl.h> + +#define BIT(x) (1 << (x)) +#define GENMASK(h, l) (((~0UL) << (l)) & (~0UL >> (sizeof(long) * 8 - 1 - (h)))) +#define GENMASK_ULL(h, l) \ + (((~0ULL) << (l)) & (~0ULL >> (sizeof(long long) * 8 - 1 - (h)))) + +#define CONFIG_TDP 0x7f +#define CONFIG_TDP_GET_LEVELS_INFO 0x00 +#define CONFIG_TDP_GET_TDP_CONTROL 0x01 +#define CONFIG_TDP_SET_TDP_CONTROL 0x02 +#define CONFIG_TDP_GET_TDP_INFO 0x03 +#define CONFIG_TDP_GET_PWR_INFO 0x04 +#define CONFIG_TDP_GET_TJMAX_INFO 0x05 +#define CONFIG_TDP_GET_CORE_MASK 0x06 +#define CONFIG_TDP_GET_TURBO_LIMIT_RATIOS 0x07 +#define CONFIG_TDP_SET_LEVEL 0x08 +#define CONFIG_TDP_GET_UNCORE_P0_P1_INFO 0X09 +#define CONFIG_TDP_GET_P1_INFO 0x0a +#define CONFIG_TDP_GET_MEM_FREQ 0x0b + +#define CONFIG_TDP_GET_FACT_HP_TURBO_LIMIT_NUMCORES 0x10 +#define CONFIG_TDP_GET_FACT_HP_TURBO_LIMIT_RATIOS 0x11 +#define CONFIG_TDP_GET_FACT_LP_CLIPPING_RATIO 0x12 + +#define CONFIG_TDP_PBF_GET_CORE_MASK_INFO 0x20 +#define CONFIG_TDP_PBF_GET_P1HI_P1LO_INFO 0x21 +#define CONFIG_TDP_PBF_GET_TJ_MAX_INFO 0x22 +#define CONFIG_TDP_PBF_GET_TDP_INFO 0X23 + +#define CONFIG_CLOS 0xd0 +#define CLOS_PQR_ASSOC 0x00 +#define CLOS_PM_CLOS 0x01 +#define CLOS_PM_QOS_CONFIG 0x02 +#define CLOS_STATUS 0x03 + +#define MBOX_CMD_WRITE_BIT 0x08 + +#define PM_QOS_INFO_OFFSET 0x00 +#define PM_QOS_CONFIG_OFFSET 0x04 +#define PM_CLOS_OFFSET 0x08 +#define PQR_ASSOC_OFFSET 0x20 + +struct isst_clos_config { + int pkg_id; + int die_id; + unsigned char epp; + unsigned char clos_prop_prio; + unsigned char clos_min; + unsigned char clos_max; + unsigned char clos_desired; +}; + +struct isst_fact_bucket_info { + int high_priority_cores_count; + int sse_trl; + int avx_trl; + int avx512_trl; +}; + +struct isst_pbf_info { + int pbf_acticated; + int pbf_available; + size_t core_cpumask_size; + cpu_set_t *core_cpumask; + int p1_high; + int p1_low; + int t_control; + int t_prochot; + int tdp; +}; + +#define ISST_TRL_MAX_ACTIVE_CORES 8 +#define ISST_FACT_MAX_BUCKETS 8 +struct isst_fact_info { + int lp_clipping_ratio_license_sse; + int lp_clipping_ratio_license_avx2; + int lp_clipping_ratio_license_avx512; + struct isst_fact_bucket_info bucket_info[ISST_FACT_MAX_BUCKETS]; +}; + +struct isst_pkg_ctdp_level_info { + int processed; + int control_cpu; + int pkg_id; + int die_id; + int level; + int fact_support; + int pbf_support; + int fact_enabled; + int pbf_enabled; + int tdp_ratio; + int active; + int tdp_control; + int pkg_tdp; + int pkg_min_power; + int pkg_max_power; + int fact; + int t_proc_hot; + int uncore_p0; + int uncore_p1; + int sse_p1; + int avx2_p1; + int avx512_p1; + int mem_freq; + size_t core_cpumask_size; + cpu_set_t *core_cpumask; + int cpu_count; + int trl_sse_active_cores[ISST_TRL_MAX_ACTIVE_CORES]; + int trl_avx_active_cores[ISST_TRL_MAX_ACTIVE_CORES]; + int trl_avx_512_active_cores[ISST_TRL_MAX_ACTIVE_CORES]; + int kobj_bucket_index; + int active_bucket; + int fact_max_index; + int fact_max_config; + int pbf_found; + int pbf_active; + struct isst_pbf_info pbf_info; + struct isst_fact_info fact_info; +}; + +#define ISST_MAX_TDP_LEVELS (4 + 1) /* +1 for base config */ +struct isst_pkg_ctdp { + int locked; + int version; + int processed; + int levels; + int current_level; + int enabled; + struct isst_pkg_ctdp_level_info ctdp_level[ISST_MAX_TDP_LEVELS]; +}; + +extern int get_topo_max_cpus(void); +extern int get_cpu_count(int pkg_id, int die_id); + +/* Common interfaces */ +extern void debug_printf(const char *format, ...); +extern int out_format_is_json(void); +extern int get_physical_package_id(int cpu); +extern int get_physical_die_id(int cpu); +extern size_t alloc_cpu_set(cpu_set_t **cpu_set); +extern void free_cpu_set(cpu_set_t *cpu_set); +extern int find_logical_cpu(int pkg_id, int die_id, int phy_cpu); +extern int find_phy_cpu_num(int logical_cpu); +extern int find_phy_core_num(int logical_cpu); +extern void set_cpu_mask_from_punit_coremask(int cpu, + unsigned long long core_mask, + size_t core_cpumask_size, + cpu_set_t *core_cpumask, + int *cpu_cnt); + +extern int isst_send_mbox_command(unsigned int cpu, unsigned char command, + unsigned char sub_command, + unsigned int write, + unsigned int req_data, unsigned int *resp); + +extern int isst_send_msr_command(unsigned int cpu, unsigned int command, + int write, unsigned long long *req_resp); + +extern int isst_get_ctdp_levels(int cpu, struct isst_pkg_ctdp *pkg_dev); +extern int isst_get_process_ctdp(int cpu, int tdp_level, + struct isst_pkg_ctdp *pkg_dev); +extern void isst_get_process_ctdp_complete(int cpu, + struct isst_pkg_ctdp *pkg_dev); +extern void isst_ctdp_display_information(int cpu, FILE *outf, int tdp_level, + struct isst_pkg_ctdp *pkg_dev); +extern void isst_ctdp_display_information_start(FILE *outf); +extern void isst_ctdp_display_information_end(FILE *outf); +extern void isst_pbf_display_information(int cpu, FILE *outf, int level, + struct isst_pbf_info *info); +extern int isst_set_tdp_level(int cpu, int tdp_level); +extern int isst_set_tdp_level_msr(int cpu, int tdp_level); +extern int isst_set_pbf_fact_status(int cpu, int pbf, int enable); +extern int isst_get_pbf_info(int cpu, int level, + struct isst_pbf_info *pbf_info); +extern void isst_get_pbf_info_complete(struct isst_pbf_info *pbf_info); +extern int isst_get_fact_info(int cpu, int level, + struct isst_fact_info *fact_info); +extern int isst_get_fact_bucket_info(int cpu, int level, + struct isst_fact_bucket_info *bucket_info); +extern void isst_fact_display_information(int cpu, FILE *outf, int level, + int fact_bucket, int fact_avx, + struct isst_fact_info *fact_info); +extern int isst_set_trl(int cpu, unsigned long long trl); +extern int isst_set_trl_from_current_tdp(int cpu, unsigned long long trl); +extern int isst_get_config_tdp_lock_status(int cpu); + +extern int isst_pm_qos_config(int cpu, int enable_clos, int priority_type); +extern int isst_pm_get_clos(int cpu, int clos, + struct isst_clos_config *clos_config); +extern int isst_set_clos(int cpu, int clos, + struct isst_clos_config *clos_config); +extern int isst_clos_associate(int cpu, int clos); +extern int isst_clos_get_assoc_status(int cpu, int *clos_id); +extern void isst_clos_display_information(int cpu, FILE *outf, int clos, + struct isst_clos_config *clos_config); + +extern int isst_read_reg(unsigned short reg, unsigned int *val); +extern int isst_write_reg(int reg, unsigned int val); + +extern void isst_display_result(int cpu, FILE *outf, char *feature, char *cmd, + int result); +#endif
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
mainline inclusion from mainline-v5.6-rc1 commit bc94638886ab21f8247d3f7f39573d3feb7d8284 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit bc94638886ab21f8247d3f7f39573d3feb7d8284 upstream
The intel_idle driver will be modified to use ACPI _CST subsequently and it will need to notify the platform firmware of that if acpi_gbl_FADT.cst_control is set, so add a routine for this purpose, acpi_processor_claim_cst_control(), to acpi_processor.c (so that it is always present which is required by intel_idle) and export it to allow the ACPI processor driver (which is modular) to call it.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: yjia yingbao.jia@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/acpi_processor.c | 25 +++++++++++++++++++++++++ drivers/acpi/processor_idle.c | 12 ++++-------- include/linux/acpi.h | 6 ++++++ 3 files changed, 35 insertions(+), 8 deletions(-)
diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c index a448cdf56718..6de4e6b9f8b7 100644 --- a/drivers/acpi/acpi_processor.c +++ b/drivers/acpi/acpi_processor.c @@ -708,3 +708,28 @@ void __init acpi_processor_init(void) acpi_scan_add_handler_with_hotplug(&processor_handler, "processor"); acpi_scan_add_handler(&processor_container_handler); } + +#ifdef CONFIG_ACPI_PROCESSOR_CSTATE +/** + * acpi_processor_claim_cst_control - Request _CST control from the platform. + */ +bool acpi_processor_claim_cst_control(void) +{ + static bool cst_control_claimed; + acpi_status status; + + if (!acpi_gbl_FADT.cst_control || cst_control_claimed) + return true; + + status = acpi_os_write_port(acpi_gbl_FADT.smi_command, + acpi_gbl_FADT.cst_control, 8); + if (ACPI_FAILURE(status)) { + pr_warn("ACPI: Failed to claim processor _CST control\n"); + return false; + } + + cst_control_claimed = true; + return true; +} +EXPORT_SYMBOL_GPL(acpi_processor_claim_cst_control); +#endif /* CONFIG_ACPI_PROCESSOR_CSTATE */ diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index 5c46435f81b5..3e398ada8641 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -946,7 +946,6 @@ static int acpi_processor_setup_cstates(struct acpi_processor *pr)
static inline void acpi_processor_cstate_first_run_checks(void) { - acpi_status status; static int first_run;
if (first_run) @@ -958,13 +957,10 @@ static inline void acpi_processor_cstate_first_run_checks(void) max_cstate); first_run++;
- if (acpi_gbl_FADT.cst_control && !nocst) { - status = acpi_os_write_port(acpi_gbl_FADT.smi_command, - acpi_gbl_FADT.cst_control, 8); - if (ACPI_FAILURE(status)) - ACPI_EXCEPTION((AE_INFO, status, - "Notifying BIOS of _CST ability failed")); - } + if (nocst) + return; + + acpi_processor_claim_cst_control(); } #else
diff --git a/include/linux/acpi.h b/include/linux/acpi.h index c34ba2a1c2e6..d51bac1277eb 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -295,6 +295,12 @@ static inline bool invalid_phys_cpuid(phys_cpuid_t phys_id)
/* Validate the processor object's proc_id */ bool acpi_duplicate_processor_id(int proc_id); +/* Processor _CTS control */ +#ifdef CONFIG_ACPI_PROCESSOR_CSTATE +bool acpi_processor_claim_cst_control(void); +#else +static inline bool acpi_processor_claim_cst_control(void) { return false; } +#endif
#ifdef CONFIG_ACPI_HOTPLUG_CPU /* Arch dependent functions for cpu hotplug support */
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
mainline inclusion from mainline-v5.6-rc1 commit 987c785319b99e32602f7f86cfae3cf9b81e402b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 987c785319b99e32602f7f86cfae3cf9b81e402b upstream
In order to separate the ACPI _CST evaluation from checks specific to the ACPI processor driver, move the majority of the acpi_processor_get_power_info_cst() function body to a new function, acpi_processor_evaluate_cst(), that will extract the C-states information from _CST output, and redefine acpi_processor_get_power_info_cst() as a wrapper around it.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: yjia yingbao.jia@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/processor_idle.c | 52 +++++++++++++++++++++-------------- 1 file changed, 32 insertions(+), 20 deletions(-)
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index 3e398ada8641..2bcc53281164 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -304,21 +304,17 @@ static int acpi_processor_get_power_info_default(struct acpi_processor *pr) return 0; }
-static int acpi_processor_get_power_info_cst(struct acpi_processor *pr) +static int acpi_processor_evaluate_cst(acpi_handle handle, u32 cpu, + struct acpi_processor_power *info) { + struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL }; + union acpi_object *cst; acpi_status status; u64 count; - int current_count; + int current_count = 0; int i, ret = 0; - struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL }; - union acpi_object *cst; - - if (nocst) - return -ENODEV;
- current_count = 0; - - status = acpi_evaluate_object(pr->handle, "_CST", NULL, &buffer); + status = acpi_evaluate_object(handle, "_CST", NULL, &buffer); if (ACPI_FAILURE(status)) { ACPI_DEBUG_PRINT((ACPI_DB_INFO, "No _CST, giving up\n")); return -ENODEV; @@ -342,9 +338,6 @@ static int acpi_processor_get_power_info_cst(struct acpi_processor *pr) goto end; }
- /* Tell driver that at least _CST is supported. */ - pr->flags.has_cst = 1; - for (i = 1; i <= count; i++) { union acpi_object *element; union acpi_object *obj; @@ -390,7 +383,7 @@ static int acpi_processor_get_power_info_cst(struct acpi_processor *pr) cx.entry_method = ACPI_CSTATE_SYSTEMIO; if (reg->space_id == ACPI_ADR_SPACE_FIXED_HARDWARE) { if (acpi_processor_ffh_cstate_probe - (pr->id, &cx, reg) == 0) { + (cpu, &cx, reg) == 0) { cx.entry_method = ACPI_CSTATE_FFH; } else if (cx.type == ACPI_STATE_C1) { /* @@ -439,7 +432,7 @@ static int acpi_processor_get_power_info_cst(struct acpi_processor *pr) continue;
current_count++; - memcpy(&(pr->power.states[current_count]), &cx, sizeof(cx)); + memcpy(&info->states[current_count], &cx, sizeof(cx));
/* * We support total ACPI_PROCESSOR_MAX_POWER - 1 @@ -453,12 +446,9 @@ static int acpi_processor_get_power_info_cst(struct acpi_processor *pr) } }
- ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Found %d power states\n", - current_count)); + acpi_handle_info(handle, "Found %d idle states\n", current_count);
- /* Validate number of power states discovered */ - if (current_count < 2) - ret = -EFAULT; + info->count = current_count;
end: kfree(buffer.pointer); @@ -466,6 +456,28 @@ static int acpi_processor_get_power_info_cst(struct acpi_processor *pr) return ret; }
+static int acpi_processor_get_power_info_cst(struct acpi_processor *pr) +{ + int ret; + + if (nocst) + return -ENODEV; + + ret = acpi_processor_evaluate_cst(pr->handle, pr->id, &pr->power); + if (ret) + return ret; + + /* + * It is expected that there will be at least 2 states, C1 and + * something else (C2 or C3), so fail if that is not the case. + */ + if (pr->power.count < 2) + return -EFAULT; + + pr->flags.has_cst = 1; + return 0; +} + static void acpi_processor_power_verify_c3(struct acpi_processor *pr, struct acpi_processor_cx *cx) {
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
mainline inclusion from mainline-v5.6-rc1 commit aa659a3fca79abd9a3ca3ddc535db631b01c9a02 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit aa659a3fca79abd9a3ca3ddc535db631b01c9a02 upstream
Clean up acpi_processor_evaluate_cst() in multiple ways:
* Rename current_count to last_index which matches the purpose of the variable better.
* Consistently use acpi_handle_*() for printing messages and make the messages cleaner.
* Drop redundant parens and braces.
* Rewrite and clarify comments.
* Rearrange checks and drop the redundant ones.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: yjia yingbao.jia@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/processor_idle.c | 114 ++++++++++++++++------------------ 1 file changed, 52 insertions(+), 62 deletions(-)
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index 2bcc53281164..5641d5c8eede 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -311,29 +311,29 @@ static int acpi_processor_evaluate_cst(acpi_handle handle, u32 cpu, union acpi_object *cst; acpi_status status; u64 count; - int current_count = 0; + int last_index = 0; int i, ret = 0;
status = acpi_evaluate_object(handle, "_CST", NULL, &buffer); if (ACPI_FAILURE(status)) { - ACPI_DEBUG_PRINT((ACPI_DB_INFO, "No _CST, giving up\n")); + acpi_handle_debug(handle, "No _CST\n"); return -ENODEV; }
cst = buffer.pointer;
- /* There must be at least 2 elements */ - if (!cst || (cst->type != ACPI_TYPE_PACKAGE) || cst->package.count < 2) { - pr_err("not enough elements in _CST\n"); + /* There must be at least 2 elements. */ + if (!cst || cst->type != ACPI_TYPE_PACKAGE || cst->package.count < 2) { + acpi_handle_warn(handle, "Invalid _CST output\n"); ret = -EFAULT; goto end; }
count = cst->package.elements[0].integer.value;
- /* Validate number of power states. */ + /* Validate the number of C-states. */ if (count < 1 || count != cst->package.count - 1) { - pr_err("count given by _CST is not valid\n"); + acpi_handle_warn(handle, "Inconsistent _CST data\n"); ret = -EFAULT; goto end; } @@ -344,111 +344,101 @@ static int acpi_processor_evaluate_cst(acpi_handle handle, u32 cpu, struct acpi_power_register *reg; struct acpi_processor_cx cx;
+ /* + * If there is not enough space for all C-states, skip the + * excess ones and log a warning. + */ + if (last_index >= ACPI_PROCESSOR_MAX_POWER - 1) { + acpi_handle_warn(handle, + "No room for more idle states (limit: %d)\n", + ACPI_PROCESSOR_MAX_POWER - 1); + break; + } + memset(&cx, 0, sizeof(cx));
- element = &(cst->package.elements[i]); + element = &cst->package.elements[i]; if (element->type != ACPI_TYPE_PACKAGE) continue;
if (element->package.count != 4) continue;
- obj = &(element->package.elements[0]); + obj = &element->package.elements[0];
if (obj->type != ACPI_TYPE_BUFFER) continue;
reg = (struct acpi_power_register *)obj->buffer.pointer;
- if (reg->space_id != ACPI_ADR_SPACE_SYSTEM_IO && - (reg->space_id != ACPI_ADR_SPACE_FIXED_HARDWARE)) - continue; - - /* There should be an easy way to extract an integer... */ - obj = &(element->package.elements[1]); + obj = &element->package.elements[1]; if (obj->type != ACPI_TYPE_INTEGER) continue;
cx.type = obj->integer.value; /* - * Some buggy BIOSes won't list C1 in _CST - - * Let acpi_processor_get_power_info_default() handle them later + * There are known cases in which the _CST output does not + * contain C1, so if the type of the first state found is not + * C1, leave an empty slot for C1 to be filled in later. */ if (i == 1 && cx.type != ACPI_STATE_C1) - current_count++; + last_index = 1;
cx.address = reg->address; - cx.index = current_count + 1; + cx.index = last_index + 1;
- cx.entry_method = ACPI_CSTATE_SYSTEMIO; if (reg->space_id == ACPI_ADR_SPACE_FIXED_HARDWARE) { - if (acpi_processor_ffh_cstate_probe - (cpu, &cx, reg) == 0) { - cx.entry_method = ACPI_CSTATE_FFH; + if (!acpi_processor_ffh_cstate_probe(cpu, &cx, reg)) { + /* + * In the majority of cases _CST describes C1 as + * a FIXED_HARDWARE C-state, but if the command + * line forbids using MWAIT, use CSTATE_HALT for + * C1 regardless. + */ + if (cx.type == ACPI_STATE_C1 && + boot_option_idle_override == IDLE_NOMWAIT) { + cx.entry_method = ACPI_CSTATE_HALT; + snprintf(cx.desc, ACPI_CX_DESC_LEN, "ACPI HLT"); + } else { + cx.entry_method = ACPI_CSTATE_FFH; + } } else if (cx.type == ACPI_STATE_C1) { /* - * C1 is a special case where FIXED_HARDWARE - * can be handled in non-MWAIT way as well. - * In that case, save this _CST entry info. - * Otherwise, ignore this info and continue. + * In the special case of C1, FIXED_HARDWARE can + * be handled by executing the HLT instruction. */ cx.entry_method = ACPI_CSTATE_HALT; snprintf(cx.desc, ACPI_CX_DESC_LEN, "ACPI HLT"); } else { continue; } - if (cx.type == ACPI_STATE_C1 && - (boot_option_idle_override == IDLE_NOMWAIT)) { - /* - * In most cases the C1 space_id obtained from - * _CST object is FIXED_HARDWARE access mode. - * But when the option of idle=halt is added, - * the entry_method type should be changed from - * CSTATE_FFH to CSTATE_HALT. - * When the option of idle=nomwait is added, - * the C1 entry_method type should be - * CSTATE_HALT. - */ - cx.entry_method = ACPI_CSTATE_HALT; - snprintf(cx.desc, ACPI_CX_DESC_LEN, "ACPI HLT"); - } - } else { + } else if (reg->space_id == ACPI_ADR_SPACE_SYSTEM_IO) { + cx.entry_method = ACPI_CSTATE_SYSTEMIO; snprintf(cx.desc, ACPI_CX_DESC_LEN, "ACPI IOPORT 0x%x", cx.address); + } else { + continue; }
- if (cx.type == ACPI_STATE_C1) { + if (cx.type == ACPI_STATE_C1) cx.valid = 1; - }
- obj = &(element->package.elements[2]); + obj = &element->package.elements[2]; if (obj->type != ACPI_TYPE_INTEGER) continue;
cx.latency = obj->integer.value;
- obj = &(element->package.elements[3]); + obj = &element->package.elements[3]; if (obj->type != ACPI_TYPE_INTEGER) continue;
- current_count++; - memcpy(&info->states[current_count], &cx, sizeof(cx)); - - /* - * We support total ACPI_PROCESSOR_MAX_POWER - 1 - * (From 1 through ACPI_PROCESSOR_MAX_POWER - 1) - */ - if (current_count >= (ACPI_PROCESSOR_MAX_POWER - 1)) { - pr_warn("Limiting number of power states to max (%d)\n", - ACPI_PROCESSOR_MAX_POWER); - pr_warn("Please increase ACPI_PROCESSOR_MAX_POWER if needed.\n"); - break; - } + memcpy(&info->states[++last_index], &cx, sizeof(cx)); }
- acpi_handle_info(handle, "Found %d idle states\n", current_count); + acpi_handle_info(handle, "Found %d idle states\n", last_index);
- info->count = current_count; + info->count = last_index;
end: kfree(buffer.pointer);
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
mainline inclusion from mainline-v5.6-rc1 commit 77fb4e0a559a960eb36d0b2c50c781c5492577eb category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 77fb4e0a559a960eb36d0b2c50c781c5492577eb upsteam
The intel_idle driver will be modified to use ACPI _CST subsequently and it will need to call acpi_processor_evaluate_cst(), so move that function to acpi_processor.c so that it is always present (which is required by intel_idle) and export it to modules to allow the ACPI processor driver (which is modular) to call it.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: yjia yingbao.jia@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/acpi_processor.c | 157 ++++++++++++++++++++++++++++++++++ drivers/acpi/processor_idle.c | 142 ------------------------------ include/linux/acpi.h | 9 ++ 3 files changed, 166 insertions(+), 142 deletions(-)
diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c index 6de4e6b9f8b7..cfdf7cf6d8f1 100644 --- a/drivers/acpi/acpi_processor.c +++ b/drivers/acpi/acpi_processor.c @@ -732,4 +732,161 @@ bool acpi_processor_claim_cst_control(void) return true; } EXPORT_SYMBOL_GPL(acpi_processor_claim_cst_control); + +/** + * acpi_processor_evaluate_cst - Evaluate the processor _CST control method. + * @handle: ACPI handle of the processor object containing the _CST. + * @cpu: The numeric ID of the target CPU. + * @info: Object write the C-states information into. + * + * Extract the C-state information for the given CPU from the output of the _CST + * control method under the corresponding ACPI processor object (or processor + * device object) and populate @info with it. + * + * If any ACPI_ADR_SPACE_FIXED_HARDWARE C-states are found, invoke + * acpi_processor_ffh_cstate_probe() to verify them and update the + * cpu_cstate_entry data for @cpu. + */ +int acpi_processor_evaluate_cst(acpi_handle handle, u32 cpu, + struct acpi_processor_power *info) +{ + struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL }; + union acpi_object *cst; + acpi_status status; + u64 count; + int last_index = 0; + int i, ret = 0; + + status = acpi_evaluate_object(handle, "_CST", NULL, &buffer); + if (ACPI_FAILURE(status)) { + acpi_handle_debug(handle, "No _CST\n"); + return -ENODEV; + } + + cst = buffer.pointer; + + /* There must be at least 2 elements. */ + if (!cst || cst->type != ACPI_TYPE_PACKAGE || cst->package.count < 2) { + acpi_handle_warn(handle, "Invalid _CST output\n"); + ret = -EFAULT; + goto end; + } + + count = cst->package.elements[0].integer.value; + + /* Validate the number of C-states. */ + if (count < 1 || count != cst->package.count - 1) { + acpi_handle_warn(handle, "Inconsistent _CST data\n"); + ret = -EFAULT; + goto end; + } + + for (i = 1; i <= count; i++) { + union acpi_object *element; + union acpi_object *obj; + struct acpi_power_register *reg; + struct acpi_processor_cx cx; + + /* + * If there is not enough space for all C-states, skip the + * excess ones and log a warning. + */ + if (last_index >= ACPI_PROCESSOR_MAX_POWER - 1) { + acpi_handle_warn(handle, + "No room for more idle states (limit: %d)\n", + ACPI_PROCESSOR_MAX_POWER - 1); + break; + } + + memset(&cx, 0, sizeof(cx)); + + element = &cst->package.elements[i]; + if (element->type != ACPI_TYPE_PACKAGE) + continue; + + if (element->package.count != 4) + continue; + + obj = &element->package.elements[0]; + + if (obj->type != ACPI_TYPE_BUFFER) + continue; + + reg = (struct acpi_power_register *)obj->buffer.pointer; + + obj = &element->package.elements[1]; + if (obj->type != ACPI_TYPE_INTEGER) + continue; + + cx.type = obj->integer.value; + /* + * There are known cases in which the _CST output does not + * contain C1, so if the type of the first state found is not + * C1, leave an empty slot for C1 to be filled in later. + */ + if (i == 1 && cx.type != ACPI_STATE_C1) + last_index = 1; + + cx.address = reg->address; + cx.index = last_index + 1; + + if (reg->space_id == ACPI_ADR_SPACE_FIXED_HARDWARE) { + if (!acpi_processor_ffh_cstate_probe(cpu, &cx, reg)) { + /* + * In the majority of cases _CST describes C1 as + * a FIXED_HARDWARE C-state, but if the command + * line forbids using MWAIT, use CSTATE_HALT for + * C1 regardless. + */ + if (cx.type == ACPI_STATE_C1 && + boot_option_idle_override == IDLE_NOMWAIT) { + cx.entry_method = ACPI_CSTATE_HALT; + snprintf(cx.desc, ACPI_CX_DESC_LEN, "ACPI HLT"); + } else { + cx.entry_method = ACPI_CSTATE_FFH; + } + } else if (cx.type == ACPI_STATE_C1) { + /* + * In the special case of C1, FIXED_HARDWARE can + * be handled by executing the HLT instruction. + */ + cx.entry_method = ACPI_CSTATE_HALT; + snprintf(cx.desc, ACPI_CX_DESC_LEN, "ACPI HLT"); + } else { + continue; + } + } else if (reg->space_id == ACPI_ADR_SPACE_SYSTEM_IO) { + cx.entry_method = ACPI_CSTATE_SYSTEMIO; + snprintf(cx.desc, ACPI_CX_DESC_LEN, "ACPI IOPORT 0x%x", + cx.address); + } else { + continue; + } + + if (cx.type == ACPI_STATE_C1) + cx.valid = 1; + + obj = &element->package.elements[2]; + if (obj->type != ACPI_TYPE_INTEGER) + continue; + + cx.latency = obj->integer.value; + + obj = &element->package.elements[3]; + if (obj->type != ACPI_TYPE_INTEGER) + continue; + + memcpy(&info->states[++last_index], &cx, sizeof(cx)); + } + + acpi_handle_info(handle, "Found %d idle states\n", last_index); + + info->count = last_index; + + end: + kfree(buffer.pointer); + + return ret; +} +EXPORT_SYMBOL_GPL(acpi_processor_evaluate_cst); #endif /* CONFIG_ACPI_PROCESSOR_CSTATE */ diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index 5641d5c8eede..b97860ec75b6 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -304,148 +304,6 @@ static int acpi_processor_get_power_info_default(struct acpi_processor *pr) return 0; }
-static int acpi_processor_evaluate_cst(acpi_handle handle, u32 cpu, - struct acpi_processor_power *info) -{ - struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL }; - union acpi_object *cst; - acpi_status status; - u64 count; - int last_index = 0; - int i, ret = 0; - - status = acpi_evaluate_object(handle, "_CST", NULL, &buffer); - if (ACPI_FAILURE(status)) { - acpi_handle_debug(handle, "No _CST\n"); - return -ENODEV; - } - - cst = buffer.pointer; - - /* There must be at least 2 elements. */ - if (!cst || cst->type != ACPI_TYPE_PACKAGE || cst->package.count < 2) { - acpi_handle_warn(handle, "Invalid _CST output\n"); - ret = -EFAULT; - goto end; - } - - count = cst->package.elements[0].integer.value; - - /* Validate the number of C-states. */ - if (count < 1 || count != cst->package.count - 1) { - acpi_handle_warn(handle, "Inconsistent _CST data\n"); - ret = -EFAULT; - goto end; - } - - for (i = 1; i <= count; i++) { - union acpi_object *element; - union acpi_object *obj; - struct acpi_power_register *reg; - struct acpi_processor_cx cx; - - /* - * If there is not enough space for all C-states, skip the - * excess ones and log a warning. - */ - if (last_index >= ACPI_PROCESSOR_MAX_POWER - 1) { - acpi_handle_warn(handle, - "No room for more idle states (limit: %d)\n", - ACPI_PROCESSOR_MAX_POWER - 1); - break; - } - - memset(&cx, 0, sizeof(cx)); - - element = &cst->package.elements[i]; - if (element->type != ACPI_TYPE_PACKAGE) - continue; - - if (element->package.count != 4) - continue; - - obj = &element->package.elements[0]; - - if (obj->type != ACPI_TYPE_BUFFER) - continue; - - reg = (struct acpi_power_register *)obj->buffer.pointer; - - obj = &element->package.elements[1]; - if (obj->type != ACPI_TYPE_INTEGER) - continue; - - cx.type = obj->integer.value; - /* - * There are known cases in which the _CST output does not - * contain C1, so if the type of the first state found is not - * C1, leave an empty slot for C1 to be filled in later. - */ - if (i == 1 && cx.type != ACPI_STATE_C1) - last_index = 1; - - cx.address = reg->address; - cx.index = last_index + 1; - - if (reg->space_id == ACPI_ADR_SPACE_FIXED_HARDWARE) { - if (!acpi_processor_ffh_cstate_probe(cpu, &cx, reg)) { - /* - * In the majority of cases _CST describes C1 as - * a FIXED_HARDWARE C-state, but if the command - * line forbids using MWAIT, use CSTATE_HALT for - * C1 regardless. - */ - if (cx.type == ACPI_STATE_C1 && - boot_option_idle_override == IDLE_NOMWAIT) { - cx.entry_method = ACPI_CSTATE_HALT; - snprintf(cx.desc, ACPI_CX_DESC_LEN, "ACPI HLT"); - } else { - cx.entry_method = ACPI_CSTATE_FFH; - } - } else if (cx.type == ACPI_STATE_C1) { - /* - * In the special case of C1, FIXED_HARDWARE can - * be handled by executing the HLT instruction. - */ - cx.entry_method = ACPI_CSTATE_HALT; - snprintf(cx.desc, ACPI_CX_DESC_LEN, "ACPI HLT"); - } else { - continue; - } - } else if (reg->space_id == ACPI_ADR_SPACE_SYSTEM_IO) { - cx.entry_method = ACPI_CSTATE_SYSTEMIO; - snprintf(cx.desc, ACPI_CX_DESC_LEN, "ACPI IOPORT 0x%x", - cx.address); - } else { - continue; - } - - if (cx.type == ACPI_STATE_C1) - cx.valid = 1; - - obj = &element->package.elements[2]; - if (obj->type != ACPI_TYPE_INTEGER) - continue; - - cx.latency = obj->integer.value; - - obj = &element->package.elements[3]; - if (obj->type != ACPI_TYPE_INTEGER) - continue; - - memcpy(&info->states[++last_index], &cx, sizeof(cx)); - } - - acpi_handle_info(handle, "Found %d idle states\n", last_index); - - info->count = last_index; - - end: - kfree(buffer.pointer); - - return ret; -} - static int acpi_processor_get_power_info_cst(struct acpi_processor *pr) { int ret; diff --git a/include/linux/acpi.h b/include/linux/acpi.h index d51bac1277eb..4a0142276cb8 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -296,10 +296,19 @@ static inline bool invalid_phys_cpuid(phys_cpuid_t phys_id) /* Validate the processor object's proc_id */ bool acpi_duplicate_processor_id(int proc_id); /* Processor _CTS control */ +struct acpi_processor_power; + #ifdef CONFIG_ACPI_PROCESSOR_CSTATE bool acpi_processor_claim_cst_control(void); +int acpi_processor_evaluate_cst(acpi_handle handle, u32 cpu, + struct acpi_processor_power *info); #else static inline bool acpi_processor_claim_cst_control(void) { return false; } +static inline int acpi_processor_evaluate_cst(acpi_handle handle, u32 cpu, + struct acpi_processor_power *info) +{ + return -ENODEV; +} #endif
#ifdef CONFIG_ACPI_HOTPLUG_CPU
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
mainline inclusion from mainline-v5.6-rc1 commit 9f3d6daf61e5156139cd05643f7f1c2a9b7b49b0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 9f3d6daf61e5156139cd05643f7f1c2a9b7b49b0 upstream
Move the C-state verification and checks from intel_idle_cpuidle_driver_init() to a separate function, intel_idle_verify_cstate(), and make the former call it after checking the CPUIDLE_FLAG_UNUSABLE state flag.
Also combine the drv->states[] updates with the incrementation of drv->state_count.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: yjia yingbao.jia@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/idle/intel_idle.c | 49 ++++++++++++++++++++------------------- 1 file changed, 25 insertions(+), 24 deletions(-)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index c4bb67ed8da3..4bb1fc966c95 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -956,6 +956,22 @@ static void intel_idle_s2idle(struct cpuidle_device *dev, mwait_idle_with_hints(eax, ecx); }
+static bool intel_idle_verify_cstate(unsigned int mwait_hint) +{ + unsigned int mwait_cstate = MWAIT_HINT2CSTATE(mwait_hint) + 1; + unsigned int num_substates = (mwait_substates >> mwait_cstate * 4) & + MWAIT_SUBSTATE_MASK; + + /* Ignore the C-state if there are NO sub-states in CPUID for it. */ + if (num_substates == 0) + return false; + + if (mwait_cstate > 2 && !boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) + mark_tsc_unstable("TSC halts in idle states deeper than C2"); + + return true; +} + static void __setup_broadcast_timer(bool on) { if (on) @@ -1346,10 +1362,10 @@ static void __init intel_idle_cpuidle_driver_init(void) drv->state_count = 1;
for (cstate = 0; cstate < CPUIDLE_STATE_MAX; ++cstate) { - int num_substates, mwait_hint, mwait_cstate; + unsigned int mwait_hint;
- if ((cpuidle_state_table[cstate].enter == NULL) && - (cpuidle_state_table[cstate].enter_s2idle == NULL)) + if (!cpuidle_state_table[cstate].enter && + !cpuidle_state_table[cstate].enter_s2idle) break;
if (cstate + 1 > max_cstate) { @@ -1357,34 +1373,19 @@ static void __init intel_idle_cpuidle_driver_init(void) break; }
- mwait_hint = flg2MWAIT(cpuidle_state_table[cstate].flags); - mwait_cstate = MWAIT_HINT2CSTATE(mwait_hint); - - /* number of sub-states for this state in CPUID.MWAIT */ - num_substates = (mwait_substates >> ((mwait_cstate + 1) * 4)) - & MWAIT_SUBSTATE_MASK; - - /* if NO sub-states for this state in CPUID, skip it */ - if (num_substates == 0) - continue; - - /* if state marked as disabled, skip it */ + /* If marked as unusable, skip this state. */ if (cpuidle_state_table[cstate].disabled != 0) { pr_debug("state %s is disabled\n", cpuidle_state_table[cstate].name); continue; }
+ mwait_hint = flg2MWAIT(cpuidle_state_table[cstate].flags); + if (!intel_idle_verify_cstate(mwait_hint)) + continue;
- if (((mwait_cstate + 1) > 2) && - !boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) - mark_tsc_unstable("TSC halts in idle" - " states deeper than C2"); - - drv->states[drv->state_count] = /* structure copy */ - cpuidle_state_table[cstate]; - - drv->state_count += 1; + /* Structure copy. */ + drv->states[drv->state_count++] = cpuidle_state_table[cstate]; }
if (icpu->byt_auto_demotion_disable_flag) {
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
mainline inclusion from mainline-v5.6-rc1 commit 18734958e9bfbc055805d110a38dc76307eba742 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 18734958e9bfbc055805d110a38dc76307eba742 upstream
Modify the intel_idle driver to get the C-states information from ACPI _CST if the processor model is not recognized by it.
The processor is still required to support MWAIT and the information from ACPI _CST will only be used if all of the C-states listed by _CST are of the ACPI_CSTATE_FFH type (which means that they are expected to be entered via MWAIT).
Moreover, the driver assumes that the _CST information is the same for all CPUs in the system, so it is sufficient to evaluate _CST for one of them and extract the common list of C-states from there. Also _CST is evaluated once at the system initialization time and the driver does not respond to _CST change notifications (that can be changed in the future).
The main functional difference between intel_idle with this change and the ACPI processor driver is that the former sets the target residency to be equal to the exit latency (provided by _CST) for C1-type C-states and to 3 times the exit latency value for the other C-state types, whereas the latter obtains the target residency by multiplying the exit latency by the same number (2 by default) for all C-state types. Therefore it is expected that in general using the former instead of the latter on the same system will lead to improved energy-efficiency.
Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: yjia yingbao.jia@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/idle/intel_idle.c | 190 ++++++++++++++++++++++++++++++++------ 1 file changed, 162 insertions(+), 28 deletions(-)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 4bb1fc966c95..22fffcbebbf5 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -53,6 +53,7 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/acpi.h> #include <linux/kernel.h> #include <linux/cpuidle.h> #include <linux/tick.h> @@ -1125,6 +1126,129 @@ static const struct x86_cpu_id intel_idle_ids[] __initconst = { {} };
+#define INTEL_CPU_FAM6_MWAIT \ + { X86_VENDOR_INTEL, 6, X86_MODEL_ANY, X86_FEATURE_MWAIT, 0 } + +static const struct x86_cpu_id intel_mwait_ids[] __initconst = { + INTEL_CPU_FAM6_MWAIT, + {} +}; + +static bool intel_idle_max_cstate_reached(int cstate) +{ + if (cstate + 1 > max_cstate) { + pr_info("max_cstate %d reached\n", max_cstate); + return true; + } + return false; +} + +#ifdef CONFIG_ACPI_PROCESSOR_CSTATE +#include <acpi/processor.h> + +static struct acpi_processor_power acpi_state_table; + +/** + * intel_idle_cst_usable - Check if the _CST information can be used. + * + * Check if all of the C-states listed by _CST in the max_cstate range are + * ACPI_CSTATE_FFH, which means that they should be entered via MWAIT. + */ +static bool intel_idle_cst_usable(void) +{ + int cstate, limit; + + limit = min_t(int, min_t(int, CPUIDLE_STATE_MAX, max_cstate + 1), + acpi_state_table.count); + + for (cstate = 1; cstate < limit; cstate++) { + struct acpi_processor_cx *cx = &acpi_state_table.states[cstate]; + + if (cx->entry_method != ACPI_CSTATE_FFH) + return false; + } + + return true; +} + +static bool intel_idle_acpi_cst_extract(void) +{ + unsigned int cpu; + + for_each_possible_cpu(cpu) { + struct acpi_processor *pr = per_cpu(processors, cpu); + + if (!pr) + continue; + + if (acpi_processor_evaluate_cst(pr->handle, cpu, &acpi_state_table)) + continue; + + acpi_state_table.count++; + + if (!intel_idle_cst_usable()) + continue; + + if (!acpi_processor_claim_cst_control()) { + acpi_state_table.count = 0; + return false; + } + + return true; + } + + pr_debug("ACPI _CST not found or not usable\n"); + return false; +} + +static void intel_idle_init_cstates_acpi(struct cpuidle_driver *drv) +{ + int cstate, limit = min_t(int, CPUIDLE_STATE_MAX, acpi_state_table.count); + + /* + * If limit > 0, intel_idle_cst_usable() has returned 'true', so all of + * the interesting states are ACPI_CSTATE_FFH. + */ + for (cstate = 1; cstate < limit; cstate++) { + struct acpi_processor_cx *cx; + struct cpuidle_state *state; + + if (intel_idle_max_cstate_reached(cstate)) + break; + + cx = &acpi_state_table.states[cstate]; + + state = &drv->states[drv->state_count++]; + + snprintf(state->name, CPUIDLE_NAME_LEN, "C%d_ACPI", cstate); + strlcpy(state->desc, cx->desc, CPUIDLE_DESC_LEN); + state->exit_latency = cx->latency; + /* + * For C1-type C-states use the same number for both the exit + * latency and target residency, because that is the case for + * C1 in the majority of the static C-states tables above. + * For the other types of C-states, however, set the target + * residency to 3 times the exit latency which should lead to + * a reasonable balance between energy-efficiency and + * performance in the majority of interesting cases. + */ + state->target_residency = cx->latency; + if (cx->type > ACPI_STATE_C1) + state->target_residency *= 3; + + state->flags = MWAIT2flg(cx->address); + if (cx->type > ACPI_STATE_C2) + state->flags |= CPUIDLE_FLAG_TLB_FLUSHED; + + state->enter = intel_idle; + state->enter_s2idle = intel_idle_s2idle; + } +} +#else /* !CONFIG_ACPI_PROCESSOR_CSTATE */ +static inline bool intel_idle_acpi_cst_extract(void) { return false; } +static inline void intel_idle_init_cstates_acpi(struct cpuidle_driver *drv) { } +#endif /* !CONFIG_ACPI_PROCESSOR_CSTATE */ + /* * intel_idle_probe() */ @@ -1139,17 +1263,15 @@ static int __init intel_idle_probe(void) }
id = x86_match_cpu(intel_idle_ids); - if (!id) { - if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL && - boot_cpu_data.x86 == 6) - pr_debug("does not run on family %d model %d\n", - boot_cpu_data.x86, boot_cpu_data.x86_model); - return -ENODEV; - } - - if (!boot_cpu_has(X86_FEATURE_MWAIT)) { - pr_debug("Please enable MWAIT in BIOS SETUP\n"); - return -ENODEV; + if (id) { + if (!boot_cpu_has(X86_FEATURE_MWAIT)) { + pr_debug("Please enable MWAIT in BIOS SETUP\n"); + return -ENODEV; + } + } else { + id = x86_match_cpu(intel_mwait_ids); + if (!id) + return -ENODEV; }
if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF) @@ -1165,7 +1287,10 @@ static int __init intel_idle_probe(void) pr_debug("MWAIT substates: 0x%x\n", mwait_substates);
icpu = (const struct idle_cpu *)id->driver_data; - cpuidle_state_table = icpu->state_table; + if (icpu) + cpuidle_state_table = icpu->state_table; + else if (!intel_idle_acpi_cst_extract()) + return -ENODEV;
pr_debug("v" INTEL_IDLE_VERSION " model 0x%X\n", boot_cpu_data.x86_model); @@ -1347,31 +1472,19 @@ static void intel_idle_state_table_update(void) } }
-/* - * intel_idle_cpuidle_driver_init() - * allocate, initialize cpuidle_states - */ -static void __init intel_idle_cpuidle_driver_init(void) +static void intel_idle_init_cstates_icpu(struct cpuidle_driver *drv) { int cstate; - struct cpuidle_driver *drv = &intel_idle_driver; - - intel_idle_state_table_update(); - - cpuidle_poll_state_init(drv); - drv->state_count = 1;
for (cstate = 0; cstate < CPUIDLE_STATE_MAX; ++cstate) { unsigned int mwait_hint;
- if (!cpuidle_state_table[cstate].enter && - !cpuidle_state_table[cstate].enter_s2idle) + if (intel_idle_max_cstate_reached(cstate)) break;
- if (cstate + 1 > max_cstate) { - pr_info("max_cstate %d reached\n", max_cstate); + if (!cpuidle_state_table[cstate].enter && + !cpuidle_state_table[cstate].enter_s2idle) break; - }
/* If marked as unusable, skip this state. */ if (cpuidle_state_table[cstate].disabled != 0) { @@ -1394,6 +1507,24 @@ static void __init intel_idle_cpuidle_driver_init(void) } }
+/* + * intel_idle_cpuidle_driver_init() + * allocate, initialize cpuidle_states + */ +static void __init intel_idle_cpuidle_driver_init(void) +{ + struct cpuidle_driver *drv = &intel_idle_driver; + + intel_idle_state_table_update(); + + cpuidle_poll_state_init(drv); + drv->state_count = 1; + + if (icpu) + intel_idle_init_cstates_icpu(drv); + else + intel_idle_init_cstates_acpi(drv); +}
/* * intel_idle_cpu_init() @@ -1412,6 +1543,9 @@ static int intel_idle_cpu_init(unsigned int cpu) return -EIO; }
+ if (!icpu) + return 0; + if (icpu->auto_demotion_disable_flags) auto_demotion_disable();
From: Yangtao Li tiny.windzz@gmail.com
mainline inclusion from mainline-v5.1-rc1 commit 44021606298870e4adc641ef3927e7bb47ca8236 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 44021606298870e4adc641ef3927e7bb47ca8236 upstream
Use BIT() macro to do a small tidy-up.
CPUIDLE_DRIVER_FLAGS_MASK is not used, so remove it.
Signed-off-by: Yangtao Li tiny.windzz@gmail.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: yjia yingbao.jia@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/cpuidle.h | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index 9dd5a6b27b21..8623aac181b1 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -67,11 +67,9 @@ struct cpuidle_state {
/* Idle State Flags */ #define CPUIDLE_FLAG_NONE (0x00) -#define CPUIDLE_FLAG_POLLING (0x01) /* polling state */ -#define CPUIDLE_FLAG_COUPLED (0x02) /* state applies to multiple cpus */ -#define CPUIDLE_FLAG_TIMER_STOP (0x04) /* timer is stopped on this state */ - -#define CPUIDLE_DRIVER_FLAGS_MASK (0xFFFF0000) +#define CPUIDLE_FLAG_POLLING BIT(0) /* polling state */ +#define CPUIDLE_FLAG_COUPLED BIT(1) /* state applies to multiple cpus */ +#define CPUIDLE_FLAG_TIMER_STOP BIT(2) /* timer is stopped on this state */
struct cpuidle_device_kobj; struct cpuidle_state_kobj;
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
mainline inclusion from mainline-v5.0-rc1 commit aa5eee355b466cb33f97f79bed9740a472c4ab73 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit aa5eee355b466cb33f97f79bed9740a472c4ab73 upstream
Important information is missing from user/admin cpuidle documentation available today, so add a new user/admin document for cpuidle containing current and comprehensive information to admin-guide and drop the old .txt documents it is replacing.
Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Reviewed-by: Viresh Kumar viresh.kumar@linaro.org Reviewed-by: Ulf Hansson ulf.hansson@linaro.org Signed-off-by: yjia yingbao.jia@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- .../admin-guide/pm/working-state.rst | 1 + Documentation/cpuidle/core.txt | 23 ----- Documentation/cpuidle/sysfs.txt | 98 ------------------- 3 files changed, 1 insertion(+), 121 deletions(-) delete mode 100644 Documentation/cpuidle/core.txt delete mode 100644 Documentation/cpuidle/sysfs.txt
diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst index fa01bf083dfe..b6cef9b5e961 100644 --- a/Documentation/admin-guide/pm/working-state.rst +++ b/Documentation/admin-guide/pm/working-state.rst @@ -5,5 +5,6 @@ Working-State Power Management .. toctree:: :maxdepth: 2
+ cpuidle cpufreq intel_pstate diff --git a/Documentation/cpuidle/core.txt b/Documentation/cpuidle/core.txt deleted file mode 100644 index 63ecc5dc9d8a..000000000000 --- a/Documentation/cpuidle/core.txt +++ /dev/null @@ -1,23 +0,0 @@ - - Supporting multiple CPU idle levels in kernel - - cpuidle - -General Information: - -Various CPUs today support multiple idle levels that are differentiated -by varying exit latencies and power consumption during idle. -cpuidle is a generic in-kernel infrastructure that separates -idle policy (governor) from idle mechanism (driver) and provides a -standardized infrastructure to support independent development of -governors and drivers. - -cpuidle resides under drivers/cpuidle. - -Boot options: -"cpuidle_sysfs_switch" -enables current_governor interface in /sys/devices/system/cpu/cpuidle/, -which can be used to switch governors at run time. This boot option -is meant for developer testing only. In normal usage, kernel picks the -best governor based on governor ratings. -SEE ALSO: sysfs.txt in this directory. diff --git a/Documentation/cpuidle/sysfs.txt b/Documentation/cpuidle/sysfs.txt deleted file mode 100644 index d1587f434e7b..000000000000 --- a/Documentation/cpuidle/sysfs.txt +++ /dev/null @@ -1,98 +0,0 @@ - - - Supporting multiple CPU idle levels in kernel - - cpuidle sysfs - -System global cpuidle related information and tunables are under -/sys/devices/system/cpu/cpuidle - -The current interfaces in this directory has self-explanatory names: -* current_driver -* current_governor_ro - -With cpuidle_sysfs_switch boot option (meant for developer testing) -following objects are visible instead. -* current_driver -* available_governors -* current_governor -In this case users can switch the governor at run time by writing -to current_governor. - - -Per logical CPU specific cpuidle information are under -/sys/devices/system/cpu/cpuX/cpuidle -for each online cpu X - --------------------------------------------------------------------------------- -# ls -lR /sys/devices/system/cpu/cpu0/cpuidle/ -/sys/devices/system/cpu/cpu0/cpuidle/: -total 0 -drwxr-xr-x 2 root root 0 Feb 8 10:42 state0 -drwxr-xr-x 2 root root 0 Feb 8 10:42 state1 -drwxr-xr-x 2 root root 0 Feb 8 10:42 state2 -drwxr-xr-x 2 root root 0 Feb 8 10:42 state3 - -/sys/devices/system/cpu/cpu0/cpuidle/state0: -total 0 --r--r--r-- 1 root root 4096 Feb 8 10:42 desc --rw-r--r-- 1 root root 4096 Feb 8 10:42 disable --r--r--r-- 1 root root 4096 Feb 8 10:42 latency --r--r--r-- 1 root root 4096 Feb 8 10:42 name --r--r--r-- 1 root root 4096 Feb 8 10:42 power --r--r--r-- 1 root root 4096 Feb 8 10:42 residency --r--r--r-- 1 root root 4096 Feb 8 10:42 time --r--r--r-- 1 root root 4096 Feb 8 10:42 usage - -/sys/devices/system/cpu/cpu0/cpuidle/state1: -total 0 --r--r--r-- 1 root root 4096 Feb 8 10:42 desc --rw-r--r-- 1 root root 4096 Feb 8 10:42 disable --r--r--r-- 1 root root 4096 Feb 8 10:42 latency --r--r--r-- 1 root root 4096 Feb 8 10:42 name --r--r--r-- 1 root root 4096 Feb 8 10:42 power --r--r--r-- 1 root root 4096 Feb 8 10:42 residency --r--r--r-- 1 root root 4096 Feb 8 10:42 time --r--r--r-- 1 root root 4096 Feb 8 10:42 usage - -/sys/devices/system/cpu/cpu0/cpuidle/state2: -total 0 --r--r--r-- 1 root root 4096 Feb 8 10:42 desc --rw-r--r-- 1 root root 4096 Feb 8 10:42 disable --r--r--r-- 1 root root 4096 Feb 8 10:42 latency --r--r--r-- 1 root root 4096 Feb 8 10:42 name --r--r--r-- 1 root root 4096 Feb 8 10:42 power --r--r--r-- 1 root root 4096 Feb 8 10:42 residency --r--r--r-- 1 root root 4096 Feb 8 10:42 time --r--r--r-- 1 root root 4096 Feb 8 10:42 usage - -/sys/devices/system/cpu/cpu0/cpuidle/state3: -total 0 --r--r--r-- 1 root root 4096 Feb 8 10:42 desc --rw-r--r-- 1 root root 4096 Feb 8 10:42 disable --r--r--r-- 1 root root 4096 Feb 8 10:42 latency --r--r--r-- 1 root root 4096 Feb 8 10:42 name --r--r--r-- 1 root root 4096 Feb 8 10:42 power --r--r--r-- 1 root root 4096 Feb 8 10:42 residency --r--r--r-- 1 root root 4096 Feb 8 10:42 time --r--r--r-- 1 root root 4096 Feb 8 10:42 usage --------------------------------------------------------------------------------- - - -* desc : Small description about the idle state (string) -* disable : Option to disable this idle state (bool) -> see note below -* latency : Latency to exit out of this idle state (in microseconds) -* residency : Time after which a state becomes more effecient than any - shallower state (in microseconds) -* name : Name of the idle state (string) -* power : Power consumed while in this idle state (in milliwatts) -* time : Total time spent in this idle state (in microseconds) -* usage : Number of times this state was entered (count) - -Note: -The behavior and the effect of the disable variable depends on the -implementation of a particular governor. In the ladder governor, for -example, it is not coherent, i.e. if one is disabling a light state, -then all deeper states are disabled as well, but the disable variable -does not reflect it. Likewise, if one enables a deep state but a lighter -state still is disabled, then this has no effect.
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
mainline inclusion from mainline-v5.6-rc1 commit 75a80267410e38ab76c4ceb39753f96d72113781 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 75a80267410e38ab76c4ceb39753f96d72113781 upstream
In certain situations it may be useful to prevent some idle states from being used by default while allowing user space to enable them later on.
For this purpose, introduce a new state flag, CPUIDLE_FLAG_OFF, to mark idle states that should be disabled by default, make the core set CPUIDLE_STATE_DISABLED_BY_USER for those states at the initialization time and add a new state attribute in sysfs, "default_status", to inform user space of the initial status of the given idle state ("disabled" if CPUIDLE_FLAG_OFF is set for it, "enabled" otherwise).
Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: yjia yingbao.jia@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/ABI/testing/sysfs-devices-system-cpu | 6 ++++++ drivers/cpuidle/cpuidle.c | 7 ++++++- drivers/cpuidle/sysfs.c | 10 ++++++++++ include/linux/cpuidle.h | 2 ++ 4 files changed, 24 insertions(+), 1 deletion(-)
diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index b9c14c11efc5..77f517d5337a 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -188,6 +188,12 @@ Description: does not reflect it. Likewise, if one enables a deep state but a lighter state still is disabled, then this has no effect.
+What: /sys/devices/system/cpu/cpuX/cpuidle/stateN/default_status +Date: December 2019 +KernelVersion: v5.6 +Contact: Linux power management list linux-pm@vger.kernel.org +Description: + (RO) The default status of this state, "enabled" or "disabled".
What: /sys/devices/system/cpu/cpuX/cpuidle/stateN/residency Date: March 2014 diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 42fc94773753..98c03272131b 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -534,8 +534,8 @@ static void __cpuidle_device_init(struct cpuidle_device *dev) */ static int __cpuidle_register_device(struct cpuidle_device *dev) { - int ret; struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev); + int i, ret;
if (!try_module_get(drv->owner)) return -EINVAL; @@ -543,6 +543,11 @@ static int __cpuidle_register_device(struct cpuidle_device *dev) per_cpu(cpuidle_devices, dev->cpu) = dev; list_add(&dev->device_list, &cpuidle_detected_devices);
+ for (i = 0; i < drv->state_count; i++) { + if (drv->states[i].flags & CPUIDLE_FLAG_OFF) + dev->states_usage[i].disable |= CPUIDLE_STATE_DISABLED_BY_USER; + } + ret = cpuidle_coupled_register_device(dev); if (ret) __cpuidle_unregister_device(dev); diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c index 4c8042f19a96..38986a36197e 100644 --- a/drivers/cpuidle/sysfs.c +++ b/drivers/cpuidle/sysfs.c @@ -292,6 +292,14 @@ static ssize_t show_state_##_name(struct cpuidle_state *state, \ return sprintf(buf, "%s\n", state->_name);\ }
+static ssize_t show_state_default_status(struct cpuidle_state *state, + struct cpuidle_state_usage *state_usage, + char *buf) +{ + return sprintf(buf, "%s\n", + state->flags & CPUIDLE_FLAG_OFF ? "disabled" : "enabled"); +} + define_show_state_function(exit_latency) define_show_state_function(target_residency) define_show_state_function(power_usage) @@ -310,6 +318,7 @@ define_one_state_ro(power, show_state_power_usage); define_one_state_ro(usage, show_state_usage); define_one_state_ro(time, show_state_time); define_one_state_rw(disable, show_state_disable, store_state_disable); +define_one_state_ro(default_status, show_state_default_status);
static struct attribute *cpuidle_state_default_attrs[] = { &attr_name.attr, @@ -320,6 +329,7 @@ static struct attribute *cpuidle_state_default_attrs[] = { &attr_usage.attr, &attr_time.attr, &attr_disable.attr, + &attr_default_status.attr, NULL };
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index 8623aac181b1..8b0ab79f7f7d 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -70,6 +70,7 @@ struct cpuidle_state { #define CPUIDLE_FLAG_POLLING BIT(0) /* polling state */ #define CPUIDLE_FLAG_COUPLED BIT(1) /* state applies to multiple cpus */ #define CPUIDLE_FLAG_TIMER_STOP BIT(2) /* timer is stopped on this state */ +#define CPUIDLE_FLAG_OFF BIT(4) /* disable this state by default */
struct cpuidle_device_kobj; struct cpuidle_state_kobj; @@ -117,6 +118,7 @@ static inline int cpuidle_get_last_residency(struct cpuidle_device *dev) /**************************** * CPUIDLE DRIVER INTERFACE * ****************************/ +#define CPUIDLE_STATE_DISABLED_BY_USER BIT(0)
struct cpuidle_driver { const char *name;
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
mainline inclusion from mainline-v5.6-rc1 commit bff8e60a86f4960133f90ad9add9adeb082b8154 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit bff8e60a86f4960133f90ad9add9adeb082b8154 upstream
Update the intel_idle driver to get the C-states information from ACPI _CST in some cases in which the processor is known to the driver, as long as that information is available and the new use_acpi flag is set in the profile of the processor in question.
In the cases when there is a specific table of C-states for the given processor in the driver, that table is used as the primary source of information on the available C-states, but if ACPI _CST is present, the C-states that are not listed by it will not be enabled by default (they still can be enabled later by user space via sysfs, though).
The new CPUIDLE_FLAG_ALWAYS_ENABLE flag can be used for marking C-states that should be enabled by default even if they are not listed by ACPI _CST.
Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: yjia yingbao.jia@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/idle/intel_idle.c | 45 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 42 insertions(+), 3 deletions(-)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 22fffcbebbf5..487edbad098d 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -92,6 +92,7 @@ struct idle_cpu { unsigned long auto_demotion_disable_flags; bool byt_auto_demotion_disable_flag; bool disable_promotion_to_c1e; + bool use_acpi; };
static const struct idle_cpu *icpu; @@ -102,6 +103,11 @@ static void intel_idle_s2idle(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index); static struct cpuidle_state *cpuidle_state_table;
+/* + * Enable this state by default even if the ACPI _CST does not list it. + */ +#define CPUIDLE_FLAG_ALWAYS_ENABLE BIT(15) + /* * Set this flag for states where the HW flushes the TLB for us * and so we don't need cross-calls to keep it consistent. @@ -1244,9 +1250,33 @@ static void intel_idle_init_cstates_acpi(struct cpuidle_driver *drv) state->enter_s2idle = intel_idle_s2idle; } } + +static bool intel_idle_off_by_default(u32 mwait_hint) +{ + int cstate, limit; + + /* + * If there are no _CST C-states, do not disable any C-states by + * default. + */ + if (!acpi_state_table.count) + return false; + + limit = min_t(int, CPUIDLE_STATE_MAX, acpi_state_table.count); + /* + * If limit > 0, intel_idle_cst_usable() has returned 'true', so all of + * the interesting states are ACPI_CSTATE_FFH. + */ + for (cstate = 1; cstate < limit; cstate++) { + if (acpi_state_table.states[cstate].address == mwait_hint) + return false; + } + return true; +} #else /* !CONFIG_ACPI_PROCESSOR_CSTATE */ static inline bool intel_idle_acpi_cst_extract(void) { return false; } static inline void intel_idle_init_cstates_acpi(struct cpuidle_driver *drv) { } +static inline bool intel_idle_off_by_default(u32 mwait_hint) { return false; } #endif /* !CONFIG_ACPI_PROCESSOR_CSTATE */
/* @@ -1287,10 +1317,13 @@ static int __init intel_idle_probe(void) pr_debug("MWAIT substates: 0x%x\n", mwait_substates);
icpu = (const struct idle_cpu *)id->driver_data; - if (icpu) + if (icpu) { cpuidle_state_table = icpu->state_table; - else if (!intel_idle_acpi_cst_extract()) + if (icpu->use_acpi) + intel_idle_acpi_cst_extract(); + } else if (!intel_idle_acpi_cst_extract()) { return -ENODEV; + }
pr_debug("v" INTEL_IDLE_VERSION " model 0x%X\n", boot_cpu_data.x86_model); @@ -1498,7 +1531,13 @@ static void intel_idle_init_cstates_icpu(struct cpuidle_driver *drv) continue;
/* Structure copy. */ - drv->states[drv->state_count++] = cpuidle_state_table[cstate]; + drv->states[drv->state_count] = cpuidle_state_table[cstate]; + + if (icpu->use_acpi && intel_idle_off_by_default(mwait_hint) && + !(cpuidle_state_table[cstate].flags & CPUIDLE_FLAG_ALWAYS_ENABLE)) + drv->states[drv->state_count].flags |= CPUIDLE_FLAG_OFF; + + drv->state_count++; }
if (icpu->byt_auto_demotion_disable_flag) {
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
mainline inclusion from mainline-v5.6-rc1 commit 4ec32d9e8e5b6d6eb491eeee3938665d8a2388fa category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 4ec32d9e8e5b6d6eb491eeee3938665d8a2388fa upstream
Add a new module parameter called "no_acpi" to the intel_idle driver to allow the driver to be prevented from using ACPI _CST via kernel command line.
Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: yjia yingbao.jia@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/idle/intel_idle.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 487edbad098d..cfd3ecafed7b 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -1152,6 +1152,10 @@ static bool intel_idle_max_cstate_reached(int cstate) #ifdef CONFIG_ACPI_PROCESSOR_CSTATE #include <acpi/processor.h>
+static bool no_acpi __read_mostly; +module_param(no_acpi, bool, 0444); +MODULE_PARM_DESC(no_acpi, "Do not use ACPI _CST for building the idle states list"); + static struct acpi_processor_power acpi_state_table;
/** @@ -1181,6 +1185,11 @@ static bool intel_idle_acpi_cst_extract(void) { unsigned int cpu;
+ if (no_acpi) { + pr_debug("Not allowed to use ACPI _CST\n"); + return false; + } + for_each_possible_cpu(cpu) { struct acpi_processor *pr = per_cpu(processors, cpu);
From: Guoqing Jiang jiangguoqing@kylinos.cn
mainline inclusion from mainline-v5.6-rc1 commit e6d4f08a677654385869ba0c39d7c9ceec47e5c5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit e6d4f08a677654385869ba0c39d7c9ceec47e5c5 upstream
In many cases, especially on server systems, it is desirable to avoid enabling C-states that have been disabled in the platform firmware (BIOS) setup, except for C1E.
As a rule, the C-states disabled this way are not listed by ACPI _CST, so if that is used by intel_idle along with the specific table of C-states that it has for the given processor, the C-states disabled through the platform firmware will not be enabled by default by intel_idle.
Accordingly, set the use_acpi flag (introduced previously) in all server processor profiles defined in intel_idle (so as to make it use ACPI _CST to decide which C-states to enable by default) and set the CPUIDLE_FLAG_ALWAYS_ENABLE flag (also introduced previously) for C1E in all C-states tables in intel_idle that contain C1 too (so that C1E is enabled regardless of whether or not it is listed by ACPI _CST).
Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: yjia yingbao.jia@intel.com Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/idle/intel_idle.c | 70 ++++++++++++++++++++++++++++----------- 1 file changed, 50 insertions(+), 20 deletions(-)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index cfd3ecafed7b..a921070184a0 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -143,7 +143,7 @@ static struct cpuidle_state nehalem_cstates[] = { { .name = "C1E", .desc = "MWAIT 0x01", - .flags = MWAIT2flg(0x01), + .flags = MWAIT2flg(0x01) | CPUIDLE_FLAG_ALWAYS_ENABLE, .exit_latency = 10, .target_residency = 20, .enter = &intel_idle, @@ -180,7 +180,7 @@ static struct cpuidle_state snb_cstates[] = { { .name = "C1E", .desc = "MWAIT 0x01", - .flags = MWAIT2flg(0x01), + .flags = MWAIT2flg(0x01) | CPUIDLE_FLAG_ALWAYS_ENABLE, .exit_latency = 10, .target_residency = 20, .enter = &intel_idle, @@ -315,7 +315,7 @@ static struct cpuidle_state ivb_cstates[] = { { .name = "C1E", .desc = "MWAIT 0x01", - .flags = MWAIT2flg(0x01), + .flags = MWAIT2flg(0x01) | CPUIDLE_FLAG_ALWAYS_ENABLE, .exit_latency = 10, .target_residency = 20, .enter = &intel_idle, @@ -360,7 +360,7 @@ static struct cpuidle_state ivt_cstates[] = { { .name = "C1E", .desc = "MWAIT 0x01", - .flags = MWAIT2flg(0x01), + .flags = MWAIT2flg(0x01) | CPUIDLE_FLAG_ALWAYS_ENABLE, .exit_latency = 10, .target_residency = 80, .enter = &intel_idle, @@ -397,7 +397,7 @@ static struct cpuidle_state ivt_cstates_4s[] = { { .name = "C1E", .desc = "MWAIT 0x01", - .flags = MWAIT2flg(0x01), + .flags = MWAIT2flg(0x01) | CPUIDLE_FLAG_ALWAYS_ENABLE, .exit_latency = 10, .target_residency = 250, .enter = &intel_idle, @@ -434,7 +434,7 @@ static struct cpuidle_state ivt_cstates_8s[] = { { .name = "C1E", .desc = "MWAIT 0x01", - .flags = MWAIT2flg(0x01), + .flags = MWAIT2flg(0x01) | CPUIDLE_FLAG_ALWAYS_ENABLE, .exit_latency = 10, .target_residency = 500, .enter = &intel_idle, @@ -471,7 +471,7 @@ static struct cpuidle_state hsw_cstates[] = { { .name = "C1E", .desc = "MWAIT 0x01", - .flags = MWAIT2flg(0x01), + .flags = MWAIT2flg(0x01) | CPUIDLE_FLAG_ALWAYS_ENABLE, .exit_latency = 10, .target_residency = 20, .enter = &intel_idle, @@ -539,7 +539,7 @@ static struct cpuidle_state bdw_cstates[] = { { .name = "C1E", .desc = "MWAIT 0x01", - .flags = MWAIT2flg(0x01), + .flags = MWAIT2flg(0x01) | CPUIDLE_FLAG_ALWAYS_ENABLE, .exit_latency = 10, .target_residency = 20, .enter = &intel_idle, @@ -608,7 +608,7 @@ static struct cpuidle_state skl_cstates[] = { { .name = "C1E", .desc = "MWAIT 0x01", - .flags = MWAIT2flg(0x01), + .flags = MWAIT2flg(0x01) | CPUIDLE_FLAG_ALWAYS_ENABLE, .exit_latency = 10, .target_residency = 20, .enter = &intel_idle, @@ -677,7 +677,7 @@ static struct cpuidle_state skx_cstates[] = { { .name = "C1E", .desc = "MWAIT 0x01", - .flags = MWAIT2flg(0x01), + .flags = MWAIT2flg(0x01) | CPUIDLE_FLAG_ALWAYS_ENABLE, .exit_latency = 10, .target_residency = 20, .enter = &intel_idle, @@ -827,7 +827,7 @@ static struct cpuidle_state bxt_cstates[] = { { .name = "C1E", .desc = "MWAIT 0x01", - .flags = MWAIT2flg(0x01), + .flags = MWAIT2flg(0x01) | CPUIDLE_FLAG_ALWAYS_ENABLE, .exit_latency = 10, .target_residency = 20, .enter = &intel_idle, @@ -888,7 +888,7 @@ static struct cpuidle_state dnv_cstates[] = { { .name = "C1E", .desc = "MWAIT 0x01", - .flags = MWAIT2flg(0x01), + .flags = MWAIT2flg(0x01) | CPUIDLE_FLAG_ALWAYS_ENABLE, .exit_latency = 10, .target_residency = 20, .enter = &intel_idle, @@ -1010,6 +1010,13 @@ static const struct idle_cpu idle_cpu_nehalem = { .disable_promotion_to_c1e = true, };
+static const struct idle_cpu idle_cpu_nhx = { + .state_table = nehalem_cstates, + .auto_demotion_disable_flags = NHM_C1_AUTO_DEMOTE | NHM_C3_AUTO_DEMOTE, + .disable_promotion_to_c1e = true, + .use_acpi = true, +}; + static const struct idle_cpu idle_cpu_atom = { .state_table = atom_cstates, }; @@ -1028,6 +1035,12 @@ static const struct idle_cpu idle_cpu_snb = { .disable_promotion_to_c1e = true, };
+static const struct idle_cpu idle_cpu_snx = { + .state_table = snb_cstates, + .disable_promotion_to_c1e = true, + .use_acpi = true, +}; + static const struct idle_cpu idle_cpu_byt = { .state_table = byt_cstates, .disable_promotion_to_c1e = true, @@ -1048,6 +1061,7 @@ static const struct idle_cpu idle_cpu_ivb = { static const struct idle_cpu idle_cpu_ivt = { .state_table = ivt_cstates, .disable_promotion_to_c1e = true, + .use_acpi = true, };
static const struct idle_cpu idle_cpu_hsw = { @@ -1055,11 +1069,23 @@ static const struct idle_cpu idle_cpu_hsw = { .disable_promotion_to_c1e = true, };
+static const struct idle_cpu idle_cpu_hsx = { + .state_table = hsw_cstates, + .disable_promotion_to_c1e = true, + .use_acpi = true, +}; + static const struct idle_cpu idle_cpu_bdw = { .state_table = bdw_cstates, .disable_promotion_to_c1e = true, };
+static const struct idle_cpu idle_cpu_bdx = { + .state_table = bdw_cstates, + .disable_promotion_to_c1e = true, + .use_acpi = true, +}; + static const struct idle_cpu idle_cpu_skl = { .state_table = skl_cstates, .disable_promotion_to_c1e = true, @@ -1068,15 +1094,18 @@ static const struct idle_cpu idle_cpu_skl = { static const struct idle_cpu idle_cpu_skx = { .state_table = skx_cstates, .disable_promotion_to_c1e = true, + .use_acpi = true, };
static const struct idle_cpu idle_cpu_avn = { .state_table = avn_cstates, .disable_promotion_to_c1e = true, + .use_acpi = true, };
static const struct idle_cpu idle_cpu_knl = { .state_table = knl_cstates, + .use_acpi = true, };
static const struct idle_cpu idle_cpu_bxt = { @@ -1087,6 +1116,7 @@ static const struct idle_cpu idle_cpu_bxt = { static const struct idle_cpu idle_cpu_dnv = { .state_table = dnv_cstates, .disable_promotion_to_c1e = true, + .use_acpi = true, };
#define ICPU(model, cpu) \ @@ -1094,16 +1124,16 @@ static const struct idle_cpu idle_cpu_dnv = {
static const struct x86_cpu_id intel_idle_ids[] __initconst = { ICPU(INTEL_FAM6_NEHALEM_EP, idle_cpu_nehalem), - ICPU(INTEL_FAM6_NEHALEM, idle_cpu_nehalem), + ICPU(INTEL_FAM6_NEHALEM, idle_cpu_nhx), ICPU(INTEL_FAM6_NEHALEM_G, idle_cpu_nehalem), ICPU(INTEL_FAM6_WESTMERE, idle_cpu_nehalem), - ICPU(INTEL_FAM6_WESTMERE_EP, idle_cpu_nehalem), - ICPU(INTEL_FAM6_NEHALEM_EX, idle_cpu_nehalem), + ICPU(INTEL_FAM6_WESTMERE_EP, idle_cpu_nhx), + ICPU(INTEL_FAM6_NEHALEM_EX, idle_cpu_nhx), ICPU(INTEL_FAM6_ATOM_BONNELL, idle_cpu_atom), ICPU(INTEL_FAM6_ATOM_BONNELL_MID, idle_cpu_lincroft), - ICPU(INTEL_FAM6_WESTMERE_EX, idle_cpu_nehalem), + ICPU(INTEL_FAM6_WESTMERE_EX, idle_cpu_nhx), ICPU(INTEL_FAM6_SANDYBRIDGE, idle_cpu_snb), - ICPU(INTEL_FAM6_SANDYBRIDGE_X, idle_cpu_snb), + ICPU(INTEL_FAM6_SANDYBRIDGE_X, idle_cpu_snx), ICPU(INTEL_FAM6_ATOM_SALTWELL, idle_cpu_atom), ICPU(INTEL_FAM6_ATOM_SILVERMONT, idle_cpu_byt), ICPU(INTEL_FAM6_ATOM_SILVERMONT_MID, idle_cpu_tangier), @@ -1111,14 +1141,14 @@ static const struct x86_cpu_id intel_idle_ids[] __initconst = { ICPU(INTEL_FAM6_IVYBRIDGE, idle_cpu_ivb), ICPU(INTEL_FAM6_IVYBRIDGE_X, idle_cpu_ivt), ICPU(INTEL_FAM6_HASWELL_CORE, idle_cpu_hsw), - ICPU(INTEL_FAM6_HASWELL_X, idle_cpu_hsw), + ICPU(INTEL_FAM6_HASWELL_X, idle_cpu_hsx), ICPU(INTEL_FAM6_HASWELL_ULT, idle_cpu_hsw), ICPU(INTEL_FAM6_HASWELL_GT3E, idle_cpu_hsw), ICPU(INTEL_FAM6_ATOM_SILVERMONT_X, idle_cpu_avn), ICPU(INTEL_FAM6_BROADWELL_CORE, idle_cpu_bdw), ICPU(INTEL_FAM6_BROADWELL_GT3E, idle_cpu_bdw), - ICPU(INTEL_FAM6_BROADWELL_X, idle_cpu_bdw), - ICPU(INTEL_FAM6_BROADWELL_XEON_D, idle_cpu_bdw), + ICPU(INTEL_FAM6_BROADWELL_X, idle_cpu_bdx), + ICPU(INTEL_FAM6_BROADWELL_XEON_D, idle_cpu_bdx), ICPU(INTEL_FAM6_SKYLAKE_MOBILE, idle_cpu_skl), ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, idle_cpu_skl), ICPU(INTEL_FAM6_KABYLAKE_MOBILE, idle_cpu_skl),
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
mainline inclusion from mainline-v5.6-rc1 commit 239ed06d0eefc4ee351af69bd64742ada56c4bdb category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 239ed06d0eefc4ee351af69bd64742ada56c4bdb upstream
To avoid build errors when CONFIG_ACPI_PROCESSOR_CSTATE is set and CONFIG_ACPI_PROCESSOR is not (that may appear in randconfig builds), make the former depend on the latter.
Acked-by: Len Brown len.brown@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: yjia yingbao.jia@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/Kconfig | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index 89d5fd8020ec..fc99ce413a76 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -244,6 +244,7 @@ config ACPI_CPU_FREQ_PSS
config ACPI_PROCESSOR_CSTATE def_bool y + depends on ACPI_PROCESSOR depends on IA64 || X86
config ACPI_PROCESSOR_IDLE
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
mainline inclusion from mainline-v5.6-rc1 commit a3299182216397a0b943d2549d1997f4eba2bdd2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit a3299182216397a0b943d2549d1997f4eba2bdd2 upstream
Add an admin-guide document for the intel_idle driver to describe how it works: how it enumerates idle states, what happens during the initialization of it, how it can be controlled via the kernel command line and so on.
Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Reviewed-by: Randy Dunlap rdunlap@infradead.org Signed-off-by: yjia yingbao.jia@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/admin-guide/pm/intel_idle.rst | 246 ++++++++++++++++++ .../admin-guide/pm/working-state.rst | 1 + 2 files changed, 247 insertions(+) create mode 100644 Documentation/admin-guide/pm/intel_idle.rst
diff --git a/Documentation/admin-guide/pm/intel_idle.rst b/Documentation/admin-guide/pm/intel_idle.rst new file mode 100644 index 000000000000..afbf778035f8 --- /dev/null +++ b/Documentation/admin-guide/pm/intel_idle.rst @@ -0,0 +1,246 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +============================================== +``intel_idle`` CPU Idle Time Management Driver +============================================== + +:Copyright: |copy| 2020 Intel Corporation + +:Author: Rafael J. Wysocki rafael.j.wysocki@intel.com + + +General Information +=================== + +``intel_idle`` is a part of the +:doc:`CPU idle time management subsystem <cpuidle>` in the Linux kernel +(``CPUIdle``). It is the default CPU idle time management driver for the +Nehalem and later generations of Intel processors, but the level of support for +a particular processor model in it depends on whether or not it recognizes that +processor model and may also depend on information coming from the platform +firmware. [To understand ``intel_idle`` it is necessary to know how ``CPUIdle`` +works in general, so this is the time to get familiar with :doc:`cpuidle` if you +have not done that yet.] + +``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the +logical CPU executing it is idle and so it may be possible to put some of the +processor's functional blocks into low-power states. That instruction takes two +arguments (passed in the ``EAX`` and ``ECX`` registers of the target CPU), the +first of which, referred to as a *hint*, can be used by the processor to +determine what can be done (for details refer to Intel Software Developer’s +Manual [1]_). Accordingly, ``intel_idle`` refuses to work with processors in +which the support for the ``MWAIT`` instruction has been disabled (for example, +via the platform firmware configuration menu) or which do not support that +instruction at all. + +``intel_idle`` is not modular, so it cannot be unloaded, which means that the +only way to pass early-configuration-time parameters to it is via the kernel +command line. + + +.. _intel-idle-enumeration-of-states: + +Enumeration of Idle States +========================== + +Each ``MWAIT`` hint value is interpreted by the processor as a license to +reconfigure itself in a certain way in order to save energy. The processor +configurations (with reduced power draw) resulting from that are referred to +as C-states (in the ACPI terminology) or idle states. The list of meaningful +``MWAIT`` hint values and idle states (i.e. low-power configurations of the +processor) corresponding to them depends on the processor model and it may also +depend on the configuration of the platform. + +In order to create a list of available idle states required by the ``CPUIdle`` +subsystem (see :ref:`idle-states-representation` in :doc:`cpuidle`), +``intel_idle`` can use two sources of information: static tables of idle states +for different processor models included in the driver itself and the ACPI tables +of the system. The former are always used if the processor model at hand is +recognized by ``intel_idle`` and the latter are used if that is required for +the given processor model (which is the case for all server processor models +recognized by ``intel_idle``) or if the processor model is not recognized. + +If the ACPI tables are going to be used for building the list of available idle +states, ``intel_idle`` first looks for a ``_CST`` object under one of the ACPI +objects corresponding to the CPUs in the system (refer to the ACPI specification +[2]_ for the description of ``_CST`` and its output package). Because the +``CPUIdle`` subsystem expects that the list of idle states supplied by the +driver will be suitable for all of the CPUs handled by it and ``intel_idle`` is +registered as the ``CPUIdle`` driver for all of the CPUs in the system, the +driver looks for the first ``_CST`` object returning at least one valid idle +state description and such that all of the idle states included in its return +package are of the FFH (Functional Fixed Hardware) type, which means that the +``MWAIT`` instruction is expected to be used to tell the processor that it can +enter one of them. The return package of that ``_CST`` is then assumed to be +applicable to all of the other CPUs in the system and the idle state +descriptions extracted from it are stored in a preliminary list of idle states +coming from the ACPI tables. [This step is skipped if ``intel_idle`` is +configured to ignore the ACPI tables; see `below <intel-idle-parameters_>`_.] + +Next, the first (index 0) entry in the list of available idle states is +initialized to represent a "polling idle state" (a pseudo-idle state in which +the target CPU continuously fetches and executes instructions), and the +subsequent (real) idle state entries are populated as follows. + +If the processor model at hand is recognized by ``intel_idle``, there is a +(static) table of idle state descriptions for it in the driver. In that case, +the "internal" table is the primary source of information on idle states and the +information from it is copied to the final list of available idle states. If +using the ACPI tables for the enumeration of idle states is not required +(depending on the processor model), all of the listed idle state are enabled by +default (so all of them will be taken into consideration by ``CPUIdle`` +governors during CPU idle state selection). Otherwise, some of the listed idle +states may not be enabled by default if there are no matching entries in the +preliminary list of idle states coming from the ACPI tables. In that case user +space still can enable them later (on a per-CPU basis) with the help of +the ``disable`` idle state attribute in ``sysfs`` (see +:ref:`idle-states-representation` in :doc:`cpuidle`). This basically means that +the idle states "known" to the driver may not be enabled by default if they have +not been exposed by the platform firmware (through the ACPI tables). + +If the given processor model is not recognized by ``intel_idle``, but it +supports ``MWAIT``, the preliminary list of idle states coming from the ACPI +tables is used for building the final list that will be supplied to the +``CPUIdle`` core during driver registration. For each idle state in that list, +the description, ``MWAIT`` hint and exit latency are copied to the corresponding +entry in the final list of idle states. The name of the idle state represented +by it (to be returned by the ``name`` idle state attribute in ``sysfs``) is +"CX_ACPI", where X is the index of that idle state in the final list (note that +the minimum value of X is 1, because 0 is reserved for the "polling" state), and +its target residency is based on the exit latency value. Specifically, for +C1-type idle states the exit latency value is also used as the target residency +(for compatibility with the majority of the "internal" tables of idle states for +various processor models recognized by ``intel_idle``) and for the other idle +state types (C2 and C3) the target residency value is 3 times the exit latency +(again, that is because it reflects the target residency to exit latency ratio +in the majority of cases for the processor models recognized by ``intel_idle``). +All of the idle states in the final list are enabled by default in this case. + + +.. _intel-idle-initialization: + +Initialization +============== + +The initialization of ``intel_idle`` starts with checking if the kernel command +line options forbid the use of the ``MWAIT`` instruction. If that is the case, +an error code is returned right away. + +The next step is to check whether or not the processor model is known to the +driver, which determines the idle states enumeration method (see +`above <intel-idle-enumeration-of-states_>`_), and whether or not the processor +supports ``MWAIT`` (the initialization fails if that is not the case). Then, +the ``MWAIT`` support in the processor is enumerated through ``CPUID`` and the +driver initialization fails if the level of support is not as expected (for +example, if the total number of ``MWAIT`` substates returned is 0). + +Next, if the driver is not configured to ignore the ACPI tables (see +`below <intel-idle-parameters_>`_), the idle states information provided by the +platform firmware is extracted from them. + +Then, ``CPUIdle`` device objects are allocated for all CPUs and the list of +available idle states is created as explained +`above <intel-idle-enumeration-of-states_>`_. + +Finally, ``intel_idle`` is registered with the help of cpuidle_register_driver() +as the ``CPUIdle`` driver for all CPUs in the system and a CPU online callback +for configuring individual CPUs is registered via cpuhp_setup_state(), which +(among other things) causes the callback routine to be invoked for all of the +CPUs present in the system at that time (each CPU executes its own instance of +the callback routine). That routine registers a ``CPUIdle`` device for the CPU +running it (which enables the ``CPUIdle`` subsystem to operate that CPU) and +optionally performs some CPU-specific initialization actions that may be +required for the given processor model. + + +.. _intel-idle-parameters: + +Kernel Command Line Options and Module Parameters +================================================= + +The *x86* architecture support code recognizes three kernel command line +options related to CPU idle time management: ``idle=poll``, ``idle=halt``, +and ``idle=nomwait``. If any of them is present in the kernel command line, the +``MWAIT`` instruction is not allowed to be used, so the initialization of +``intel_idle`` will fail. + +Apart from that there are two module parameters recognized by ``intel_idle`` +itself that can be set via the kernel command line (they cannot be updated via +sysfs, so that is the only way to change their values). + +The ``max_cstate`` parameter value is the maximum idle state index in the list +of idle states supplied to the ``CPUIdle`` core during the registration of the +driver. It is also the maximum number of regular (non-polling) idle states that +can be used by ``intel_idle``, so the enumeration of idle states is terminated +after finding that number of usable idle states (the other idle states that +potentially might have been used if ``max_cstate`` had been greater are not +taken into consideration at all). Setting ``max_cstate`` can prevent +``intel_idle`` from exposing idle states that are regarded as "too deep" for +some reason to the ``CPUIdle`` core, but it does so by making them effectively +invisible until the system is shut down and started again which may not always +be desirable. In practice, it is only really necessary to do that if the idle +states in question cannot be enabled during system startup, because in the +working state of the system the CPU power management quality of service (PM +QoS) feature can be used to prevent ``CPUIdle`` from touching those idle states +even if they have been enumerated (see :ref:`cpu-pm-qos` in :doc:`cpuidle`). +Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail. + +The ``noacpi`` module parameter (which is recognized by ``intel_idle`` if the +kernel has been configured with ACPI support), can be set to make the driver +ignore the system's ACPI tables entirely (it is unset by default). + + +.. _intel-idle-core-and-package-idle-states: + +Core and Package Levels of Idle States +====================================== + +Typically, in a processor supporting the ``MWAIT`` instruction there are (at +least) two levels of idle states (or C-states). One level, referred to as +"core C-states", covers individual cores in the processor, whereas the other +level, referred to as "package C-states", covers the entire processor package +and it may also involve other components of the system (GPUs, memory +controllers, I/O hubs etc.). + +Some of the ``MWAIT`` hint values allow the processor to use core C-states only +(most importantly, that is the case for the ``MWAIT`` hint value corresponding +to the ``C1`` idle state), but the majority of them give it a license to put +the target core (i.e. the core containing the logical CPU executing ``MWAIT`` +with the given hint value) into a specific core C-state and then (if possible) +to enter a specific package C-state at the deeper level. For example, the +``MWAIT`` hint value representing the ``C3`` idle state allows the processor to +put the target core into the low-power state referred to as "core ``C3``" (or +``CC3``), which happens if all of the logical CPUs (SMT siblings) in that core +have executed ``MWAIT`` with the ``C3`` hint value (or with a hint value +representing a deeper idle state), and in addition to that (in the majority of +cases) it gives the processor a license to put the entire package (possibly +including some non-CPU components such as a GPU or a memory controller) into the +low-power state referred to as "package ``C3``" (or ``PC3``), which happens if +all of the cores have gone into the ``CC3`` state and (possibly) some additional +conditions are satisfied (for instance, if the GPU is covered by ``PC3``, it may +be required to be in a certain GPU-specific low-power state for ``PC3`` to be +reachable). + +As a rule, there is no simple way to make the processor use core C-states only +if the conditions for entering the corresponding package C-states are met, so +the logical CPU executing ``MWAIT`` with a hint value that is not core-level +only (like for ``C1``) must always assume that this may cause the processor to +enter a package C-state. [That is why the exit latency and target residency +values corresponding to the majority of ``MWAIT`` hint values in the "internal" +tables of idle states in ``intel_idle`` reflect the properties of package +C-states.] If using package C-states is not desirable at all, either +:ref:`PM QoS <cpu-pm-qos>` or the ``max_cstate`` module parameter of +``intel_idle`` described `above <intel-idle-parameters_>`_ must be used to +restrict the range of permissible idle states to the ones with core-level only +``MWAIT`` hint values (like ``C1``). + + +References +========== + +.. [1] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2B*, + https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32... + +.. [2] *Advanced Configuration and Power Interface (ACPI) Specification*, + https://uefi.org/specifications diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst index b6cef9b5e961..ec695d3c7782 100644 --- a/Documentation/admin-guide/pm/working-state.rst +++ b/Documentation/admin-guide/pm/working-state.rst @@ -6,5 +6,6 @@ Working-State Power Management :maxdepth: 2
cpuidle + intel_idle cpufreq intel_pstate
From: Guoqing Jiang jiangguoqing@kylinos.cn
mainline inclusion from mainline-v5.3-rc1 commit faaeff98666c24376cebd0b106504d05a36881d1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit faaeff98666c24376cebd0b106504d05a36881d1 upstream.
Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
Add new model number for Icelake desktop and server to perf.
The data source encoding for Icelake server is the same as Skylake server.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: bp@alien8.de Cc: qiuxu.zhuo@intel.com Cc: rui.zhang@intel.com Cc: tony.luck@intel.com Link: https://lkml.kernel.org/r/20190603134122.13853-2-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/core.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index d6abf1bcdb95..09434c273ef1 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -4175,6 +4175,7 @@ __init int intel_pmu_init(void) struct event_constraint *c; unsigned int unused; struct extra_reg *er; + bool pmem = false; int version, i; char *name;
@@ -4612,9 +4613,10 @@ __init int intel_pmu_init(void) name = "knights-landing"; break;
+ case INTEL_FAM6_SKYLAKE_X: + pmem = true; case INTEL_FAM6_SKYLAKE_MOBILE: case INTEL_FAM6_SKYLAKE_DESKTOP: - case INTEL_FAM6_SKYLAKE_X: case INTEL_FAM6_KABYLAKE_MOBILE: case INTEL_FAM6_KABYLAKE_DESKTOP: x86_pmu.late_ack = true; @@ -4644,8 +4646,7 @@ __init int intel_pmu_init(void) extra_attr = merge_attr(extra_attr, skl_format_attr); to_free = extra_attr; x86_pmu.cpu_events = get_hsw_events_attrs(); - intel_pmu_pebs_data_source_skl( - boot_cpu_data.x86_model == INTEL_FAM6_SKYLAKE_X); + intel_pmu_pebs_data_source_skl(pmem);
if (boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)) { x86_pmu.flags |= PMU_FL_TFA; @@ -4659,7 +4660,11 @@ __init int intel_pmu_init(void) name = "skylake"; break;
+ case INTEL_FAM6_ICELAKE_X: + case INTEL_FAM6_ICELAKE_XEON_D: + pmem = true; case INTEL_FAM6_ICELAKE_MOBILE: + case INTEL_FAM6_ICELAKE_DESKTOP: x86_pmu.late_ack = true; memcpy(hw_cache_event_ids, skl_hw_cache_event_ids, sizeof(hw_cache_event_ids)); memcpy(hw_cache_extra_regs, skl_hw_cache_extra_regs, sizeof(hw_cache_extra_regs)); @@ -4681,7 +4686,7 @@ __init int intel_pmu_init(void) extra_attr = merge_attr(extra_attr, skl_format_attr); x86_pmu.cpu_events = get_icl_events_attrs(); x86_pmu.lbr_pt_coexist = true; - intel_pmu_pebs_data_source_skl(false); + intel_pmu_pebs_data_source_skl(pmem); pr_cont("Icelake events, "); name = "icelake"; break;
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit 210cc5f9db7a5c66b7ca6290b7d35cc7db7e9dbd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 210cc5f9db7a5c66b7ca6290b7d35cc7db7e9dbd upstream.
Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
The uncore subsystem on Snow Ridge is similar as previous SKX server. The uncore units on Snow Ridge include Ubox, Chabox, IIO, IRP, M2PCIE, PCU, M2M, PCIE3 and IMC.
- The config register encoding and pci device IDs are changed. - For CHA, the umask_ext and filter_tid fields are changed. - For IIO, the ch_mask and fc_mask fields are changed. - For M2M, the mask_ext field is changed. - Add new PCIe3 unit for PCIe3 root port which provides the interface between PCIe devices, plugged into the PCIe port, and the components (in M2IOSF). - IMC can only be accessed via MMIO on Snow Ridge now. Current common code doesn't support it yet. IMC will be supported in following patches. - There are 9 free running counters for IIO CLOCKS and bandwidth In. - Full uncore event list is not published yet. Event constrain is not included in this patch. It will be added later separately.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: acme@kernel.org Cc: eranian@google.com Link: https://lkml.kernel.org/r/1556672028-119221-3-git-send-email-kan.liang@linux... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/uncore.c | 6 + arch/x86/events/intel/uncore.h | 2 + arch/x86/events/intel/uncore_snbep.c | 403 +++++++++++++++++++++++++++ 3 files changed, 411 insertions(+)
diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c index 62e074d223b9..550035b9e41c 100644 --- a/arch/x86/events/intel/uncore.c +++ b/arch/x86/events/intel/uncore.c @@ -1411,6 +1411,11 @@ static const struct intel_uncore_init_fun icl_uncore_init __initconst = { .pci_init = skl_uncore_pci_init, };
+static const struct intel_uncore_init_fun snr_uncore_init __initconst = { + .cpu_init = snr_uncore_cpu_init, + .pci_init = snr_uncore_pci_init, +}; + static const struct x86_cpu_id intel_uncore_match[] __initconst = { X86_UNCORE_MODEL_MATCH(INTEL_FAM6_NEHALEM_EP, nhm_uncore_init), X86_UNCORE_MODEL_MATCH(INTEL_FAM6_NEHALEM, nhm_uncore_init), @@ -1438,6 +1443,7 @@ static const struct x86_cpu_id intel_uncore_match[] __initconst = { X86_UNCORE_MODEL_MATCH(INTEL_FAM6_KABYLAKE_MOBILE, skl_uncore_init), X86_UNCORE_MODEL_MATCH(INTEL_FAM6_KABYLAKE_DESKTOP, skl_uncore_init), X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ICELAKE_MOBILE, icl_uncore_init), + X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ATOM_TREMONT_X, snr_uncore_init), {}, };
diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h index b6ceb61b9921..4bbe5f7c1d11 100644 --- a/arch/x86/events/intel/uncore.h +++ b/arch/x86/events/intel/uncore.h @@ -509,6 +509,8 @@ int knl_uncore_pci_init(void); void knl_uncore_cpu_init(void); int skx_uncore_pci_init(void); void skx_uncore_cpu_init(void); +int snr_uncore_pci_init(void); +void snr_uncore_cpu_init(void);
/* uncore_nhmex.c */ void nhmex_uncore_cpu_init(void); diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c index 8e4e8e423839..851f76fc61dd 100644 --- a/arch/x86/events/intel/uncore_snbep.c +++ b/arch/x86/events/intel/uncore_snbep.c @@ -324,12 +324,64 @@ #define SKX_M2M_PCI_PMON_CTR0 0x200 #define SKX_M2M_PCI_PMON_BOX_CTL 0x258
+/* SNR Ubox */ +#define SNR_U_MSR_PMON_CTR0 0x1f98 +#define SNR_U_MSR_PMON_CTL0 0x1f91 +#define SNR_U_MSR_PMON_UCLK_FIXED_CTL 0x1f93 +#define SNR_U_MSR_PMON_UCLK_FIXED_CTR 0x1f94 + +/* SNR CHA */ +#define SNR_CHA_RAW_EVENT_MASK_EXT 0x3ffffff +#define SNR_CHA_MSR_PMON_CTL0 0x1c01 +#define SNR_CHA_MSR_PMON_CTR0 0x1c08 +#define SNR_CHA_MSR_PMON_BOX_CTL 0x1c00 +#define SNR_C0_MSR_PMON_BOX_FILTER0 0x1c05 + + +/* SNR IIO */ +#define SNR_IIO_MSR_PMON_CTL0 0x1e08 +#define SNR_IIO_MSR_PMON_CTR0 0x1e01 +#define SNR_IIO_MSR_PMON_BOX_CTL 0x1e00 +#define SNR_IIO_MSR_OFFSET 0x10 +#define SNR_IIO_PMON_RAW_EVENT_MASK_EXT 0x7ffff + +/* SNR IRP */ +#define SNR_IRP0_MSR_PMON_CTL0 0x1ea8 +#define SNR_IRP0_MSR_PMON_CTR0 0x1ea1 +#define SNR_IRP0_MSR_PMON_BOX_CTL 0x1ea0 +#define SNR_IRP_MSR_OFFSET 0x10 + +/* SNR M2PCIE */ +#define SNR_M2PCIE_MSR_PMON_CTL0 0x1e58 +#define SNR_M2PCIE_MSR_PMON_CTR0 0x1e51 +#define SNR_M2PCIE_MSR_PMON_BOX_CTL 0x1e50 +#define SNR_M2PCIE_MSR_OFFSET 0x10 + +/* SNR PCU */ +#define SNR_PCU_MSR_PMON_CTL0 0x1ef1 +#define SNR_PCU_MSR_PMON_CTR0 0x1ef8 +#define SNR_PCU_MSR_PMON_BOX_CTL 0x1ef0 +#define SNR_PCU_MSR_PMON_BOX_FILTER 0x1efc + +/* SNR M2M */ +#define SNR_M2M_PCI_PMON_CTL0 0x468 +#define SNR_M2M_PCI_PMON_CTR0 0x440 +#define SNR_M2M_PCI_PMON_BOX_CTL 0x438 +#define SNR_M2M_PCI_PMON_UMASK_EXT 0xff + +/* SNR PCIE3 */ +#define SNR_PCIE3_PCI_PMON_CTL0 0x508 +#define SNR_PCIE3_PCI_PMON_CTR0 0x4e8 +#define SNR_PCIE3_PCI_PMON_BOX_CTL 0x4e4 + DEFINE_UNCORE_FORMAT_ATTR(event, event, "config:0-7"); DEFINE_UNCORE_FORMAT_ATTR(event2, event, "config:0-6"); DEFINE_UNCORE_FORMAT_ATTR(event_ext, event, "config:0-7,21"); DEFINE_UNCORE_FORMAT_ATTR(use_occ_ctr, use_occ_ctr, "config:7"); DEFINE_UNCORE_FORMAT_ATTR(umask, umask, "config:8-15"); DEFINE_UNCORE_FORMAT_ATTR(umask_ext, umask, "config:8-15,32-43,45-55"); +DEFINE_UNCORE_FORMAT_ATTR(umask_ext2, umask, "config:8-15,32-57"); +DEFINE_UNCORE_FORMAT_ATTR(umask_ext3, umask, "config:8-15,32-39"); DEFINE_UNCORE_FORMAT_ATTR(qor, qor, "config:16"); DEFINE_UNCORE_FORMAT_ATTR(edge, edge, "config:18"); DEFINE_UNCORE_FORMAT_ATTR(tid_en, tid_en, "config:19"); @@ -343,11 +395,14 @@ DEFINE_UNCORE_FORMAT_ATTR(occ_invert, occ_invert, "config:30"); DEFINE_UNCORE_FORMAT_ATTR(occ_edge, occ_edge, "config:14-51"); DEFINE_UNCORE_FORMAT_ATTR(occ_edge_det, occ_edge_det, "config:31"); DEFINE_UNCORE_FORMAT_ATTR(ch_mask, ch_mask, "config:36-43"); +DEFINE_UNCORE_FORMAT_ATTR(ch_mask2, ch_mask, "config:36-47"); DEFINE_UNCORE_FORMAT_ATTR(fc_mask, fc_mask, "config:44-46"); +DEFINE_UNCORE_FORMAT_ATTR(fc_mask2, fc_mask, "config:48-50"); DEFINE_UNCORE_FORMAT_ATTR(filter_tid, filter_tid, "config1:0-4"); DEFINE_UNCORE_FORMAT_ATTR(filter_tid2, filter_tid, "config1:0"); DEFINE_UNCORE_FORMAT_ATTR(filter_tid3, filter_tid, "config1:0-5"); DEFINE_UNCORE_FORMAT_ATTR(filter_tid4, filter_tid, "config1:0-8"); +DEFINE_UNCORE_FORMAT_ATTR(filter_tid5, filter_tid, "config1:0-9"); DEFINE_UNCORE_FORMAT_ATTR(filter_cid, filter_cid, "config1:5"); DEFINE_UNCORE_FORMAT_ATTR(filter_link, filter_link, "config1:5-8"); DEFINE_UNCORE_FORMAT_ATTR(filter_link2, filter_link, "config1:6-8"); @@ -3968,3 +4023,351 @@ int skx_uncore_pci_init(void) }
/* end of SKX uncore support */ + +/* SNR uncore support */ + +static struct intel_uncore_type snr_uncore_ubox = { + .name = "ubox", + .num_counters = 2, + .num_boxes = 1, + .perf_ctr_bits = 48, + .fixed_ctr_bits = 48, + .perf_ctr = SNR_U_MSR_PMON_CTR0, + .event_ctl = SNR_U_MSR_PMON_CTL0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .fixed_ctr = SNR_U_MSR_PMON_UCLK_FIXED_CTR, + .fixed_ctl = SNR_U_MSR_PMON_UCLK_FIXED_CTL, + .ops = &ivbep_uncore_msr_ops, + .format_group = &ivbep_uncore_format_group, +}; + +static struct attribute *snr_uncore_cha_formats_attr[] = { + &format_attr_event.attr, + &format_attr_umask_ext2.attr, + &format_attr_edge.attr, + &format_attr_tid_en.attr, + &format_attr_inv.attr, + &format_attr_thresh8.attr, + &format_attr_filter_tid5.attr, + NULL, +}; +static const struct attribute_group snr_uncore_chabox_format_group = { + .name = "format", + .attrs = snr_uncore_cha_formats_attr, +}; + +static int snr_cha_hw_config(struct intel_uncore_box *box, struct perf_event *event) +{ + struct hw_perf_event_extra *reg1 = &event->hw.extra_reg; + + reg1->reg = SNR_C0_MSR_PMON_BOX_FILTER0 + + box->pmu->type->msr_offset * box->pmu->pmu_idx; + reg1->config = event->attr.config1 & SKX_CHA_MSR_PMON_BOX_FILTER_TID; + reg1->idx = 0; + + return 0; +} + +static void snr_cha_enable_event(struct intel_uncore_box *box, + struct perf_event *event) +{ + struct hw_perf_event *hwc = &event->hw; + struct hw_perf_event_extra *reg1 = &hwc->extra_reg; + + if (reg1->idx != EXTRA_REG_NONE) + wrmsrl(reg1->reg, reg1->config); + + wrmsrl(hwc->config_base, hwc->config | SNBEP_PMON_CTL_EN); +} + +static struct intel_uncore_ops snr_uncore_chabox_ops = { + .init_box = ivbep_uncore_msr_init_box, + .disable_box = snbep_uncore_msr_disable_box, + .enable_box = snbep_uncore_msr_enable_box, + .disable_event = snbep_uncore_msr_disable_event, + .enable_event = snr_cha_enable_event, + .read_counter = uncore_msr_read_counter, + .hw_config = snr_cha_hw_config, +}; + +static struct intel_uncore_type snr_uncore_chabox = { + .name = "cha", + .num_counters = 4, + .num_boxes = 6, + .perf_ctr_bits = 48, + .event_ctl = SNR_CHA_MSR_PMON_CTL0, + .perf_ctr = SNR_CHA_MSR_PMON_CTR0, + .box_ctl = SNR_CHA_MSR_PMON_BOX_CTL, + .msr_offset = HSWEP_CBO_MSR_OFFSET, + .event_mask = HSWEP_S_MSR_PMON_RAW_EVENT_MASK, + .event_mask_ext = SNR_CHA_RAW_EVENT_MASK_EXT, + .ops = &snr_uncore_chabox_ops, + .format_group = &snr_uncore_chabox_format_group, +}; + +static struct attribute *snr_uncore_iio_formats_attr[] = { + &format_attr_event.attr, + &format_attr_umask.attr, + &format_attr_edge.attr, + &format_attr_inv.attr, + &format_attr_thresh9.attr, + &format_attr_ch_mask2.attr, + &format_attr_fc_mask2.attr, + NULL, +}; + +static const struct attribute_group snr_uncore_iio_format_group = { + .name = "format", + .attrs = snr_uncore_iio_formats_attr, +}; + +static struct intel_uncore_type snr_uncore_iio = { + .name = "iio", + .num_counters = 4, + .num_boxes = 5, + .perf_ctr_bits = 48, + .event_ctl = SNR_IIO_MSR_PMON_CTL0, + .perf_ctr = SNR_IIO_MSR_PMON_CTR0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .event_mask_ext = SNR_IIO_PMON_RAW_EVENT_MASK_EXT, + .box_ctl = SNR_IIO_MSR_PMON_BOX_CTL, + .msr_offset = SNR_IIO_MSR_OFFSET, + .ops = &ivbep_uncore_msr_ops, + .format_group = &snr_uncore_iio_format_group, +}; + +static struct intel_uncore_type snr_uncore_irp = { + .name = "irp", + .num_counters = 2, + .num_boxes = 5, + .perf_ctr_bits = 48, + .event_ctl = SNR_IRP0_MSR_PMON_CTL0, + .perf_ctr = SNR_IRP0_MSR_PMON_CTR0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .box_ctl = SNR_IRP0_MSR_PMON_BOX_CTL, + .msr_offset = SNR_IRP_MSR_OFFSET, + .ops = &ivbep_uncore_msr_ops, + .format_group = &ivbep_uncore_format_group, +}; + +static struct intel_uncore_type snr_uncore_m2pcie = { + .name = "m2pcie", + .num_counters = 4, + .num_boxes = 5, + .perf_ctr_bits = 48, + .event_ctl = SNR_M2PCIE_MSR_PMON_CTL0, + .perf_ctr = SNR_M2PCIE_MSR_PMON_CTR0, + .box_ctl = SNR_M2PCIE_MSR_PMON_BOX_CTL, + .msr_offset = SNR_M2PCIE_MSR_OFFSET, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .ops = &ivbep_uncore_msr_ops, + .format_group = &ivbep_uncore_format_group, +}; + +static int snr_pcu_hw_config(struct intel_uncore_box *box, struct perf_event *event) +{ + struct hw_perf_event *hwc = &event->hw; + struct hw_perf_event_extra *reg1 = &hwc->extra_reg; + int ev_sel = hwc->config & SNBEP_PMON_CTL_EV_SEL_MASK; + + if (ev_sel >= 0xb && ev_sel <= 0xe) { + reg1->reg = SNR_PCU_MSR_PMON_BOX_FILTER; + reg1->idx = ev_sel - 0xb; + reg1->config = event->attr.config1 & (0xff << reg1->idx); + } + return 0; +} + +static struct intel_uncore_ops snr_uncore_pcu_ops = { + IVBEP_UNCORE_MSR_OPS_COMMON_INIT(), + .hw_config = snr_pcu_hw_config, + .get_constraint = snbep_pcu_get_constraint, + .put_constraint = snbep_pcu_put_constraint, +}; + +static struct intel_uncore_type snr_uncore_pcu = { + .name = "pcu", + .num_counters = 4, + .num_boxes = 1, + .perf_ctr_bits = 48, + .perf_ctr = SNR_PCU_MSR_PMON_CTR0, + .event_ctl = SNR_PCU_MSR_PMON_CTL0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .box_ctl = SNR_PCU_MSR_PMON_BOX_CTL, + .num_shared_regs = 1, + .ops = &snr_uncore_pcu_ops, + .format_group = &skx_uncore_pcu_format_group, +}; + +enum perf_uncore_snr_iio_freerunning_type_id { + SNR_IIO_MSR_IOCLK, + SNR_IIO_MSR_BW_IN, + + SNR_IIO_FREERUNNING_TYPE_MAX, +}; + +static struct freerunning_counters snr_iio_freerunning[] = { + [SNR_IIO_MSR_IOCLK] = { 0x1eac, 0x1, 0x10, 1, 48 }, + [SNR_IIO_MSR_BW_IN] = { 0x1f00, 0x1, 0x10, 8, 48 }, +}; + +static struct uncore_event_desc snr_uncore_iio_freerunning_events[] = { + /* Free-Running IIO CLOCKS Counter */ + INTEL_UNCORE_EVENT_DESC(ioclk, "event=0xff,umask=0x10"), + /* Free-Running IIO BANDWIDTH IN Counters */ + INTEL_UNCORE_EVENT_DESC(bw_in_port0, "event=0xff,umask=0x20"), + INTEL_UNCORE_EVENT_DESC(bw_in_port0.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port0.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port1, "event=0xff,umask=0x21"), + INTEL_UNCORE_EVENT_DESC(bw_in_port1.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port1.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port2, "event=0xff,umask=0x22"), + INTEL_UNCORE_EVENT_DESC(bw_in_port2.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port2.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port3, "event=0xff,umask=0x23"), + INTEL_UNCORE_EVENT_DESC(bw_in_port3.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port3.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port4, "event=0xff,umask=0x24"), + INTEL_UNCORE_EVENT_DESC(bw_in_port4.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port4.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port5, "event=0xff,umask=0x25"), + INTEL_UNCORE_EVENT_DESC(bw_in_port5.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port5.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port6, "event=0xff,umask=0x26"), + INTEL_UNCORE_EVENT_DESC(bw_in_port6.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port6.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port7, "event=0xff,umask=0x27"), + INTEL_UNCORE_EVENT_DESC(bw_in_port7.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port7.unit, "MiB"), + { /* end: all zeroes */ }, +}; + +static struct intel_uncore_type snr_uncore_iio_free_running = { + .name = "iio_free_running", + .num_counters = 9, + .num_boxes = 5, + .num_freerunning_types = SNR_IIO_FREERUNNING_TYPE_MAX, + .freerunning = snr_iio_freerunning, + .ops = &skx_uncore_iio_freerunning_ops, + .event_descs = snr_uncore_iio_freerunning_events, + .format_group = &skx_uncore_iio_freerunning_format_group, +}; + +static struct intel_uncore_type *snr_msr_uncores[] = { + &snr_uncore_ubox, + &snr_uncore_chabox, + &snr_uncore_iio, + &snr_uncore_irp, + &snr_uncore_m2pcie, + &snr_uncore_pcu, + &snr_uncore_iio_free_running, + NULL, +}; + +void snr_uncore_cpu_init(void) +{ + uncore_msr_uncores = snr_msr_uncores; +} + +static void snr_m2m_uncore_pci_init_box(struct intel_uncore_box *box) +{ + struct pci_dev *pdev = box->pci_dev; + int box_ctl = uncore_pci_box_ctl(box); + + __set_bit(UNCORE_BOX_FLAG_CTL_OFFS8, &box->flags); + pci_write_config_dword(pdev, box_ctl, IVBEP_PMON_BOX_CTL_INT); +} + +static struct intel_uncore_ops snr_m2m_uncore_pci_ops = { + .init_box = snr_m2m_uncore_pci_init_box, + .disable_box = snbep_uncore_pci_disable_box, + .enable_box = snbep_uncore_pci_enable_box, + .disable_event = snbep_uncore_pci_disable_event, + .enable_event = snbep_uncore_pci_enable_event, + .read_counter = snbep_uncore_pci_read_counter, +}; + +static struct attribute *snr_m2m_uncore_formats_attr[] = { + &format_attr_event.attr, + &format_attr_umask_ext3.attr, + &format_attr_edge.attr, + &format_attr_inv.attr, + &format_attr_thresh8.attr, + NULL, +}; + +static const struct attribute_group snr_m2m_uncore_format_group = { + .name = "format", + .attrs = snr_m2m_uncore_formats_attr, +}; + +static struct intel_uncore_type snr_uncore_m2m = { + .name = "m2m", + .num_counters = 4, + .num_boxes = 1, + .perf_ctr_bits = 48, + .perf_ctr = SNR_M2M_PCI_PMON_CTR0, + .event_ctl = SNR_M2M_PCI_PMON_CTL0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .event_mask_ext = SNR_M2M_PCI_PMON_UMASK_EXT, + .box_ctl = SNR_M2M_PCI_PMON_BOX_CTL, + .ops = &snr_m2m_uncore_pci_ops, + .format_group = &snr_m2m_uncore_format_group, +}; + +static struct intel_uncore_type snr_uncore_pcie3 = { + .name = "pcie3", + .num_counters = 4, + .num_boxes = 1, + .perf_ctr_bits = 48, + .perf_ctr = SNR_PCIE3_PCI_PMON_CTR0, + .event_ctl = SNR_PCIE3_PCI_PMON_CTL0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .box_ctl = SNR_PCIE3_PCI_PMON_BOX_CTL, + .ops = &ivbep_uncore_pci_ops, + .format_group = &ivbep_uncore_format_group, +}; + +enum { + SNR_PCI_UNCORE_M2M, + SNR_PCI_UNCORE_PCIE3, +}; + +static struct intel_uncore_type *snr_pci_uncores[] = { + [SNR_PCI_UNCORE_M2M] = &snr_uncore_m2m, + [SNR_PCI_UNCORE_PCIE3] = &snr_uncore_pcie3, + NULL, +}; + +static const struct pci_device_id snr_uncore_pci_ids[] = { + { /* M2M */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x344a), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(12, 0, SNR_PCI_UNCORE_M2M, 0), + }, + { /* PCIe3 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x334a), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(4, 0, SNR_PCI_UNCORE_PCIE3, 0), + }, + { /* end: all zeroes */ } +}; + +static struct pci_driver snr_uncore_pci_driver = { + .name = "snr_uncore", + .id_table = snr_uncore_pci_ids, +}; + +int snr_uncore_pci_init(void) +{ + /* SNR UBOX DID */ + int ret = snbep_pci2phy_map_init(0x3460, SKX_CPUNODEID, + SKX_GIDNIDMAP, true); + + if (ret) + return ret; + + uncore_pci_uncores = snr_pci_uncores; + uncore_pci_driver = &snr_uncore_pci_driver; + return 0; +} + +/* end of SNR uncore support */
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit c8872d90e0a3651a096860d3241625ccfa1647e0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit c8872d90e0a3651a096860d3241625ccfa1647e0 upstream.
Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
For uncore box which can only be accessed by MSR, its reference box->refcnt is updated in CPU hot plug. The uncore boxes need to be initalized and exited accordingly for the first/last CPU of a socket.
Starts from Snow Ridge server, a new type of uncore box is introduced, which can only be accessed by MMIO. The driver needs to map/unmap MMIO space for the first/last CPU of a socket.
Extract the codes of box ref/unref and init/exit for reuse later.
There is no functional change.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: acme@kernel.org Cc: eranian@google.com Link: https://lkml.kernel.org/r/1556672028-119221-4-git-send-email-kan.liang@linux... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/uncore.c | 56 ++++++++++++++++++++++------------ 1 file changed, 37 insertions(+), 19 deletions(-)
diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c index 550035b9e41c..ea10b6935c6b 100644 --- a/arch/x86/events/intel/uncore.c +++ b/arch/x86/events/intel/uncore.c @@ -1179,12 +1179,27 @@ static void uncore_change_context(struct intel_uncore_type **uncores, uncore_change_type_ctx(*uncores, old_cpu, new_cpu); }
-static int uncore_event_cpu_offline(unsigned int cpu) +static void uncore_box_unref(struct intel_uncore_type **types, int id) { - struct intel_uncore_type *type, **types = uncore_msr_uncores; + struct intel_uncore_type *type; struct intel_uncore_pmu *pmu; struct intel_uncore_box *box; - int i, pkg, target; + int i; + + for (; *types; types++) { + type = *types; + pmu = type->pmus; + for (i = 0; i < type->num_boxes; i++, pmu++) { + box = pmu->boxes[id]; + if (box && atomic_dec_return(&box->refcnt) == 0) + uncore_box_exit(box); + } + } +} + +static int uncore_event_cpu_offline(unsigned int cpu) +{ + int die, target;
/* Check if exiting cpu is used for collecting uncore events */ if (!cpumask_test_and_clear_cpu(cpu, &uncore_cpu_mask)) @@ -1203,16 +1218,8 @@ static int uncore_event_cpu_offline(unsigned int cpu)
unref: /* Clear the references */ - pkg = topology_logical_package_id(cpu); - for (; *types; types++) { - type = *types; - pmu = type->pmus; - for (i = 0; i < type->num_boxes; i++, pmu++) { - box = pmu->boxes[pkg]; - if (box && atomic_dec_return(&box->refcnt) == 0) - uncore_box_exit(box); - } - } + die = topology_logical_package_id(cpu); + uncore_box_unref(uncore_msr_uncores, die); return 0; }
@@ -1255,15 +1262,15 @@ static int allocate_boxes(struct intel_uncore_type **types, return -ENOMEM; }
-static int uncore_event_cpu_online(unsigned int cpu) +static int uncore_box_ref(struct intel_uncore_type **types, + int id, unsigned int cpu) { - struct intel_uncore_type *type, **types = uncore_msr_uncores; + struct intel_uncore_type *type; struct intel_uncore_pmu *pmu; struct intel_uncore_box *box; - int i, ret, pkg, target; + int i, ret;
- pkg = topology_logical_package_id(cpu); - ret = allocate_boxes(types, pkg, cpu); + ret = allocate_boxes(types, id, cpu); if (ret) return ret;
@@ -1271,11 +1278,22 @@ static int uncore_event_cpu_online(unsigned int cpu) type = *types; pmu = type->pmus; for (i = 0; i < type->num_boxes; i++, pmu++) { - box = pmu->boxes[pkg]; + box = pmu->boxes[id]; if (box && atomic_inc_return(&box->refcnt) == 1) uncore_box_init(box); } } + return 0; +} + +static int uncore_event_cpu_online(unsigned int cpu) +{ + int ret, die, target; + + die = topology_logical_die_id(cpu); + ret = uncore_box_ref(uncore_msr_uncores, die, cpu); + if (ret) + return ret;
/* * Check if there is an online cpu in the package
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit 3da04b8a00dd6d39970b9e764b78c5dfb40ec013 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3da04b8a00dd6d39970b9e764b78c5dfb40ec013 upstream.
Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
A new MMIO type uncore box is introduced on Snow Ridge server. The counters of MMIO type uncore box can only be accessed by MMIO.
Add a new uncore type, uncore_mmio_uncores, for MMIO type uncore blocks.
Support MMIO type uncore blocks in CPU hot plug. The MMIO space has to be map/unmap for the first/last CPU. The context also need to be migrated if the bind CPU changes.
Add mmio_init() to init and register PMUs for MMIO type uncore blocks.
Add a helper to calculate the box_ctl address.
The helpers which calculate ctl/ctr can be shared with PCI type uncore blocks.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: acme@kernel.org Cc: eranian@google.com Link: https://lkml.kernel.org/r/1556672028-119221-5-git-send-email-kan.liang@linux... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/uncore.c | 53 +++++++++++++++++++++++++++++----- arch/x86/events/intel/uncore.h | 21 ++++++++++---- 2 files changed, 61 insertions(+), 13 deletions(-)
diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c index ea10b6935c6b..066331c7be02 100644 --- a/arch/x86/events/intel/uncore.c +++ b/arch/x86/events/intel/uncore.c @@ -7,6 +7,7 @@ static struct intel_uncore_type *empty_uncore[] = { NULL, }; struct intel_uncore_type **uncore_msr_uncores = empty_uncore; struct intel_uncore_type **uncore_pci_uncores = empty_uncore; +struct intel_uncore_type **uncore_mmio_uncores = empty_uncore;
static bool pcidrv_registered; struct pci_driver *uncore_pci_driver; @@ -1214,12 +1215,14 @@ static int uncore_event_cpu_offline(unsigned int cpu) target = -1;
uncore_change_context(uncore_msr_uncores, cpu, target); + uncore_change_context(uncore_mmio_uncores, cpu, target); uncore_change_context(uncore_pci_uncores, cpu, target);
unref: /* Clear the references */ die = topology_logical_package_id(cpu); uncore_box_unref(uncore_msr_uncores, die); + uncore_box_unref(uncore_mmio_uncores, die); return 0; }
@@ -1288,12 +1291,13 @@ static int uncore_box_ref(struct intel_uncore_type **types,
static int uncore_event_cpu_online(unsigned int cpu) { - int ret, die, target; + int die, target, msr_ret, mmio_ret;
- die = topology_logical_die_id(cpu); - ret = uncore_box_ref(uncore_msr_uncores, die, cpu); - if (ret) - return ret; + die = topology_logical_die_id(cpu); + msr_ret = uncore_box_ref(uncore_msr_uncores, die, cpu); + mmio_ret = uncore_box_ref(uncore_mmio_uncores, die, cpu); + if (msr_ret && mmio_ret) + return -ENOMEM;
/* * Check if there is an online cpu in the package @@ -1305,7 +1309,10 @@ static int uncore_event_cpu_online(unsigned int cpu)
cpumask_set_cpu(cpu, &uncore_cpu_mask);
- uncore_change_context(uncore_msr_uncores, -1, cpu); + if (!msr_ret) + uncore_change_context(uncore_msr_uncores, -1, cpu); + if (!mmio_ret) + uncore_change_context(uncore_mmio_uncores, -1, cpu); uncore_change_context(uncore_pci_uncores, -1, cpu); return 0; } @@ -1353,12 +1360,35 @@ static int __init uncore_cpu_init(void) return ret; }
+static int __init uncore_mmio_init(void) +{ + struct intel_uncore_type **types = uncore_mmio_uncores; + int ret; + + ret = uncore_types_init(types, true); + if (ret) + goto err; + + for (; *types; types++) { + ret = type_pmu_register(*types); + if (ret) + goto err; + } + return 0; +err: + uncore_types_exit(uncore_mmio_uncores); + uncore_mmio_uncores = empty_uncore; + return ret; +} + + #define X86_UNCORE_MODEL_MATCH(model, init) \ { X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&init }
struct intel_uncore_init_fun { void (*cpu_init)(void); int (*pci_init)(void); + void (*mmio_init)(void); };
static const struct intel_uncore_init_fun nhm_uncore_init __initconst = { @@ -1471,7 +1501,7 @@ static int __init intel_uncore_init(void) { const struct x86_cpu_id *id; struct intel_uncore_init_fun *uncore_init; - int pret = 0, cret = 0, ret; + int pret = 0, cret = 0, mret = 0, ret;
id = x86_match_cpu(intel_uncore_match); if (!id) @@ -1494,7 +1524,12 @@ static int __init intel_uncore_init(void) cret = uncore_cpu_init(); }
- if (cret && pret) + if (uncore_init->mmio_init) { + uncore_init->mmio_init(); + mret = uncore_mmio_init(); + } + + if (cret && pret && mret) return -ENODEV;
/* Install hotplug callbacks to setup the targets for each package */ @@ -1508,6 +1543,7 @@ static int __init intel_uncore_init(void)
err: uncore_types_exit(uncore_msr_uncores); + uncore_types_exit(uncore_mmio_uncores); uncore_pci_exit(); return ret; } @@ -1517,6 +1553,7 @@ static void __exit intel_uncore_exit(void) { cpuhp_remove_state(CPUHP_AP_PERF_X86_UNCORE_ONLINE); uncore_types_exit(uncore_msr_uncores); + uncore_types_exit(uncore_mmio_uncores); uncore_pci_exit(); } module_exit(intel_uncore_exit); diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h index 4bbe5f7c1d11..1189e46481bc 100644 --- a/arch/x86/events/intel/uncore.h +++ b/arch/x86/events/intel/uncore.h @@ -56,7 +56,10 @@ struct intel_uncore_type { unsigned fixed_ctr; unsigned fixed_ctl; unsigned box_ctl; - unsigned msr_offset; + union { + unsigned msr_offset; + unsigned mmio_offset; + }; unsigned num_shared_regs:8; unsigned single_fixed:1; unsigned pair_ctr_ctl:1; @@ -183,6 +186,13 @@ static inline bool uncore_pmc_freerunning(int idx) return idx == UNCORE_PMC_IDX_FREERUNNING; }
+static inline +unsigned int uncore_mmio_box_ctl(struct intel_uncore_box *box) +{ + return box->pmu->type->box_ctl + + box->pmu->type->mmio_offset * box->pmu->pmu_idx; +} + static inline unsigned uncore_pci_box_ctl(struct intel_uncore_box *box) { return box->pmu->type->box_ctl; @@ -313,7 +323,7 @@ unsigned uncore_msr_perf_ctr(struct intel_uncore_box *box, int idx) static inline unsigned uncore_fixed_ctl(struct intel_uncore_box *box) { - if (box->pci_dev) + if (box->pci_dev || box->io_addr) return uncore_pci_fixed_ctl(box); else return uncore_msr_fixed_ctl(box); @@ -322,7 +332,7 @@ unsigned uncore_fixed_ctl(struct intel_uncore_box *box) static inline unsigned uncore_fixed_ctr(struct intel_uncore_box *box) { - if (box->pci_dev) + if (box->pci_dev || box->io_addr) return uncore_pci_fixed_ctr(box); else return uncore_msr_fixed_ctr(box); @@ -331,7 +341,7 @@ unsigned uncore_fixed_ctr(struct intel_uncore_box *box) static inline unsigned uncore_event_ctl(struct intel_uncore_box *box, int idx) { - if (box->pci_dev) + if (box->pci_dev || box->io_addr) return uncore_pci_event_ctl(box, idx); else return uncore_msr_event_ctl(box, idx); @@ -340,7 +350,7 @@ unsigned uncore_event_ctl(struct intel_uncore_box *box, int idx) static inline unsigned uncore_perf_ctr(struct intel_uncore_box *box, int idx) { - if (box->pci_dev) + if (box->pci_dev || box->io_addr) return uncore_pci_perf_ctr(box, idx); else return uncore_msr_perf_ctr(box, idx); @@ -478,6 +488,7 @@ u64 uncore_shared_reg_config(struct intel_uncore_box *box, int idx);
extern struct intel_uncore_type **uncore_msr_uncores; extern struct intel_uncore_type **uncore_pci_uncores; +extern struct intel_uncore_type **uncore_mmio_uncores; extern struct pci_driver *uncore_pci_driver; extern raw_spinlock_t pci2phy_map_lock; extern struct list_head pci2phy_map_head;
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit 07ce734dd8adc0f170d43c15a9b91b707a21b9d7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 07ce734dd8adc0f170d43c15a9b91b707a21b9d7 upstream.
Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
The client IMC block is accessed by MMIO. Current code uses an informal way to access the block, which is not recommended.
Clean up the code by using __iomem annotation and the accessor functions (read[lq]()).
Move exit_box() and read_counter() to generic code, which can be shared with the server code later.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: acme@kernel.org Cc: eranian@google.com Link: https://lkml.kernel.org/r/1556672028-119221-6-git-send-email-kan.liang@linux... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/uncore.c | 15 +++++++++++++++ arch/x86/events/intel/uncore.h | 6 +++++- arch/x86/events/intel/uncore_snb.c | 16 ++-------------- 3 files changed, 22 insertions(+), 15 deletions(-)
diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c index 066331c7be02..c451f2deffef 100644 --- a/arch/x86/events/intel/uncore.c +++ b/arch/x86/events/intel/uncore.c @@ -119,6 +119,21 @@ u64 uncore_msr_read_counter(struct intel_uncore_box *box, struct perf_event *eve return count; }
+void uncore_mmio_exit_box(struct intel_uncore_box *box) +{ + if (box->io_addr) + iounmap(box->io_addr); +} + +u64 uncore_mmio_read_counter(struct intel_uncore_box *box, + struct perf_event *event) +{ + if (!box->io_addr) + return 0; + + return readq(box->io_addr + event->hw.event_base); +} + /* * generic get constraint function for shared match/mask registers. */ diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h index 1189e46481bc..a3dbaf6fc273 100644 --- a/arch/x86/events/intel/uncore.h +++ b/arch/x86/events/intel/uncore.h @@ -2,6 +2,7 @@ #include <linux/slab.h> #include <linux/pci.h> #include <asm/apicdef.h> +#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/perf_event.h> #include "../perf_event.h" @@ -128,7 +129,7 @@ struct intel_uncore_box { struct hrtimer hrtimer; struct list_head list; struct list_head active_list; - void *io_addr; + void __iomem *io_addr; struct intel_uncore_extra_reg shared_regs[0]; };
@@ -473,6 +474,9 @@ static inline struct intel_uncore_box *uncore_event_to_box(struct perf_event *ev
struct intel_uncore_box *uncore_pmu_to_box(struct intel_uncore_pmu *pmu, int cpu); u64 uncore_msr_read_counter(struct intel_uncore_box *box, struct perf_event *event); +void uncore_mmio_exit_box(struct intel_uncore_box *box); +u64 uncore_mmio_read_counter(struct intel_uncore_box *box, + struct perf_event *event); void uncore_pmu_start_hrtimer(struct intel_uncore_box *box); void uncore_pmu_cancel_hrtimer(struct intel_uncore_box *box); void uncore_pmu_event_start(struct perf_event *event, int flags); diff --git a/arch/x86/events/intel/uncore_snb.c b/arch/x86/events/intel/uncore_snb.c index 2c3f428f5709..c3314c2395ac 100644 --- a/arch/x86/events/intel/uncore_snb.c +++ b/arch/x86/events/intel/uncore_snb.c @@ -416,11 +416,6 @@ static void snb_uncore_imc_init_box(struct intel_uncore_box *box) box->hrtimer_duration = UNCORE_SNB_IMC_HRTIMER_INTERVAL; }
-static void snb_uncore_imc_exit_box(struct intel_uncore_box *box) -{ - iounmap(box->io_addr); -} - static void snb_uncore_imc_enable_box(struct intel_uncore_box *box) {}
@@ -433,13 +428,6 @@ static void snb_uncore_imc_enable_event(struct intel_uncore_box *box, struct per static void snb_uncore_imc_disable_event(struct intel_uncore_box *box, struct perf_event *event) {}
-static u64 snb_uncore_imc_read_counter(struct intel_uncore_box *box, struct perf_event *event) -{ - struct hw_perf_event *hwc = &event->hw; - - return (u64)*(unsigned int *)(box->io_addr + hwc->event_base); -} - /* * Keep the custom event_init() function compatible with old event * encoding for free running counters. @@ -571,13 +559,13 @@ static struct pmu snb_uncore_imc_pmu = {
static struct intel_uncore_ops snb_uncore_imc_ops = { .init_box = snb_uncore_imc_init_box, - .exit_box = snb_uncore_imc_exit_box, + .exit_box = uncore_mmio_exit_box, .enable_box = snb_uncore_imc_enable_box, .disable_box = snb_uncore_imc_disable_box, .disable_event = snb_uncore_imc_disable_event, .enable_event = snb_uncore_imc_enable_event, .hw_config = snb_uncore_imc_hw_config, - .read_counter = snb_uncore_imc_read_counter, + .read_counter = uncore_mmio_read_counter, };
static struct intel_uncore_type snb_uncore_imc = {
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit ee49532b38dd084650bf715eabe7e3828fb8d275 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit ee49532b38dd084650bf715eabe7e3828fb8d275 upstream.
Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
IMC uncore unit can only be accessed via MMIO on Snow Ridge. The MMIO space of IMC uncore is at the specified offsets from the MEM0_BAR. Add snr_uncore_get_mc_dev() to locate the PCI device with MMIO_BASE and MEM0_BAR register.
Add new ops to access the IMC registers via MMIO.
Add 3 new free running counters for clocks, read and write bandwidth.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: acme@kernel.org Cc: eranian@google.com Link: https://lkml.kernel.org/r/1556672028-119221-7-git-send-email-kan.liang@linux... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/uncore.c | 3 +- arch/x86/events/intel/uncore.h | 2 + arch/x86/events/intel/uncore_snbep.c | 197 +++++++++++++++++++++++++++ 3 files changed, 201 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c index c451f2deffef..6dd1548f20e7 100644 --- a/arch/x86/events/intel/uncore.c +++ b/arch/x86/events/intel/uncore.c @@ -28,7 +28,7 @@ struct event_constraint uncore_constraint_empty =
MODULE_LICENSE("GPL");
-static int uncore_pcibus_to_physid(struct pci_bus *bus) +int uncore_pcibus_to_physid(struct pci_bus *bus) { struct pci2phy_map *map; int phys_id = -1; @@ -1477,6 +1477,7 @@ static const struct intel_uncore_init_fun icl_uncore_init __initconst = { static const struct intel_uncore_init_fun snr_uncore_init __initconst = { .cpu_init = snr_uncore_cpu_init, .pci_init = snr_uncore_pci_init, + .mmio_init = snr_uncore_mmio_init, };
static const struct x86_cpu_id intel_uncore_match[] __initconst = { diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h index a3dbaf6fc273..07c5fcc7cc3a 100644 --- a/arch/x86/events/intel/uncore.h +++ b/arch/x86/events/intel/uncore.h @@ -156,6 +156,7 @@ struct pci2phy_map { };
struct pci2phy_map *__find_pci2phy_map(int segment); +int uncore_pcibus_to_physid(struct pci_bus *bus);
ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf); @@ -526,6 +527,7 @@ int skx_uncore_pci_init(void); void skx_uncore_cpu_init(void); int snr_uncore_pci_init(void); void snr_uncore_cpu_init(void); +void snr_uncore_mmio_init(void);
/* uncore_nhmex.c */ void nhmex_uncore_cpu_init(void); diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c index 851f76fc61dd..7420e8b6534b 100644 --- a/arch/x86/events/intel/uncore_snbep.c +++ b/arch/x86/events/intel/uncore_snbep.c @@ -374,6 +374,19 @@ #define SNR_PCIE3_PCI_PMON_CTR0 0x4e8 #define SNR_PCIE3_PCI_PMON_BOX_CTL 0x4e4
+/* SNR IMC */ +#define SNR_IMC_MMIO_PMON_FIXED_CTL 0x54 +#define SNR_IMC_MMIO_PMON_FIXED_CTR 0x38 +#define SNR_IMC_MMIO_PMON_CTL0 0x40 +#define SNR_IMC_MMIO_PMON_CTR0 0x8 +#define SNR_IMC_MMIO_PMON_BOX_CTL 0x22800 +#define SNR_IMC_MMIO_OFFSET 0x4000 +#define SNR_IMC_MMIO_SIZE 0x4000 +#define SNR_IMC_MMIO_BASE_OFFSET 0xd0 +#define SNR_IMC_MMIO_BASE_MASK 0x1FFFFFFF +#define SNR_IMC_MMIO_MEM0_OFFSET 0xd8 +#define SNR_IMC_MMIO_MEM0_MASK 0x7FF + DEFINE_UNCORE_FORMAT_ATTR(event, event, "config:0-7"); DEFINE_UNCORE_FORMAT_ATTR(event2, event, "config:0-6"); DEFINE_UNCORE_FORMAT_ATTR(event_ext, event, "config:0-7,21"); @@ -4370,4 +4383,188 @@ int snr_uncore_pci_init(void) return 0; }
+static struct pci_dev *snr_uncore_get_mc_dev(int id) +{ + struct pci_dev *mc_dev = NULL; + int phys_id, pkg; + + while (1) { + mc_dev = pci_get_device(PCI_VENDOR_ID_INTEL, 0x3451, mc_dev); + if (!mc_dev) + break; + phys_id = uncore_pcibus_to_physid(mc_dev->bus); + if (phys_id < 0) + continue; + pkg = topology_phys_to_logical_pkg(phys_id); + if (pkg < 0) + continue; + else if (pkg == id) + break; + } + return mc_dev; +} + +static void snr_uncore_mmio_init_box(struct intel_uncore_box *box) +{ + struct pci_dev *pdev = snr_uncore_get_mc_dev(box->pkgid); + unsigned int box_ctl = uncore_mmio_box_ctl(box); + resource_size_t addr; + u32 pci_dword; + + if (!pdev) + return; + + pci_read_config_dword(pdev, SNR_IMC_MMIO_BASE_OFFSET, &pci_dword); + addr = (pci_dword & SNR_IMC_MMIO_BASE_MASK) << 23; + + pci_read_config_dword(pdev, SNR_IMC_MMIO_MEM0_OFFSET, &pci_dword); + addr |= (pci_dword & SNR_IMC_MMIO_MEM0_MASK) << 12; + + addr += box_ctl; + + box->io_addr = ioremap(addr, SNR_IMC_MMIO_SIZE); + if (!box->io_addr) + return; + + writel(IVBEP_PMON_BOX_CTL_INT, box->io_addr); +} + +static void snr_uncore_mmio_disable_box(struct intel_uncore_box *box) +{ + u32 config; + + if (!box->io_addr) + return; + + config = readl(box->io_addr); + config |= SNBEP_PMON_BOX_CTL_FRZ; + writel(config, box->io_addr); +} + +static void snr_uncore_mmio_enable_box(struct intel_uncore_box *box) +{ + u32 config; + + if (!box->io_addr) + return; + + config = readl(box->io_addr); + config &= ~SNBEP_PMON_BOX_CTL_FRZ; + writel(config, box->io_addr); +} + +static void snr_uncore_mmio_enable_event(struct intel_uncore_box *box, + struct perf_event *event) +{ + struct hw_perf_event *hwc = &event->hw; + + if (!box->io_addr) + return; + + writel(hwc->config | SNBEP_PMON_CTL_EN, + box->io_addr + hwc->config_base); +} + +static void snr_uncore_mmio_disable_event(struct intel_uncore_box *box, + struct perf_event *event) +{ + struct hw_perf_event *hwc = &event->hw; + + if (!box->io_addr) + return; + + writel(hwc->config, box->io_addr + hwc->config_base); +} + +static struct intel_uncore_ops snr_uncore_mmio_ops = { + .init_box = snr_uncore_mmio_init_box, + .exit_box = uncore_mmio_exit_box, + .disable_box = snr_uncore_mmio_disable_box, + .enable_box = snr_uncore_mmio_enable_box, + .disable_event = snr_uncore_mmio_disable_event, + .enable_event = snr_uncore_mmio_enable_event, + .read_counter = uncore_mmio_read_counter, +}; + +static struct uncore_event_desc snr_uncore_imc_events[] = { + INTEL_UNCORE_EVENT_DESC(clockticks, "event=0x00,umask=0x00"), + INTEL_UNCORE_EVENT_DESC(cas_count_read, "event=0x04,umask=0x0f"), + INTEL_UNCORE_EVENT_DESC(cas_count_read.scale, "6.103515625e-5"), + INTEL_UNCORE_EVENT_DESC(cas_count_read.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(cas_count_write, "event=0x04,umask=0x30"), + INTEL_UNCORE_EVENT_DESC(cas_count_write.scale, "6.103515625e-5"), + INTEL_UNCORE_EVENT_DESC(cas_count_write.unit, "MiB"), + { /* end: all zeroes */ }, +}; + +static struct intel_uncore_type snr_uncore_imc = { + .name = "imc", + .num_counters = 4, + .num_boxes = 2, + .perf_ctr_bits = 48, + .fixed_ctr_bits = 48, + .fixed_ctr = SNR_IMC_MMIO_PMON_FIXED_CTR, + .fixed_ctl = SNR_IMC_MMIO_PMON_FIXED_CTL, + .event_descs = snr_uncore_imc_events, + .perf_ctr = SNR_IMC_MMIO_PMON_CTR0, + .event_ctl = SNR_IMC_MMIO_PMON_CTL0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .box_ctl = SNR_IMC_MMIO_PMON_BOX_CTL, + .mmio_offset = SNR_IMC_MMIO_OFFSET, + .ops = &snr_uncore_mmio_ops, + .format_group = &skx_uncore_format_group, +}; + +enum perf_uncore_snr_imc_freerunning_type_id { + SNR_IMC_DCLK, + SNR_IMC_DDR, + + SNR_IMC_FREERUNNING_TYPE_MAX, +}; + +static struct freerunning_counters snr_imc_freerunning[] = { + [SNR_IMC_DCLK] = { 0x22b0, 0x0, 0, 1, 48 }, + [SNR_IMC_DDR] = { 0x2290, 0x8, 0, 2, 48 }, +}; + +static struct uncore_event_desc snr_uncore_imc_freerunning_events[] = { + INTEL_UNCORE_EVENT_DESC(dclk, "event=0xff,umask=0x10"), + + INTEL_UNCORE_EVENT_DESC(read, "event=0xff,umask=0x20"), + INTEL_UNCORE_EVENT_DESC(read.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(read.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(write, "event=0xff,umask=0x21"), + INTEL_UNCORE_EVENT_DESC(write.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(write.unit, "MiB"), +}; + +static struct intel_uncore_ops snr_uncore_imc_freerunning_ops = { + .init_box = snr_uncore_mmio_init_box, + .exit_box = uncore_mmio_exit_box, + .read_counter = uncore_mmio_read_counter, + .hw_config = uncore_freerunning_hw_config, +}; + +static struct intel_uncore_type snr_uncore_imc_free_running = { + .name = "imc_free_running", + .num_counters = 3, + .num_boxes = 1, + .num_freerunning_types = SNR_IMC_FREERUNNING_TYPE_MAX, + .freerunning = snr_imc_freerunning, + .ops = &snr_uncore_imc_freerunning_ops, + .event_descs = snr_uncore_imc_freerunning_events, + .format_group = &skx_uncore_iio_freerunning_format_group, +}; + +static struct intel_uncore_type *snr_mmio_uncores[] = { + &snr_uncore_imc, + &snr_uncore_imc_free_running, + NULL, +}; + +void snr_uncore_mmio_init(void) +{ + uncore_mmio_uncores = snr_mmio_uncores; +} + /* end of SNR uncore support */
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.7-rc1 commit 3442a9ecb8e72a33c28a2b969b766c659830e410 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3442a9ecb8e72a33c28a2b969b766c659830e410 upstream.
Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
The IMC uncore unit in Ice Lake server can only be accessed by MMIO, which is similar as Snow Ridge. Factor out __snr_uncore_mmio_init_box which can be shared with Ice Lake server in the following patch.
No functional changes.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/1584470314-46657-2-git-send-email-kan.liang@linux.... Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/uncore_snbep.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c index 7420e8b6534b..b453da617506 100644 --- a/arch/x86/events/intel/uncore_snbep.c +++ b/arch/x86/events/intel/uncore_snbep.c @@ -4404,10 +4404,10 @@ static struct pci_dev *snr_uncore_get_mc_dev(int id) return mc_dev; }
-static void snr_uncore_mmio_init_box(struct intel_uncore_box *box) +static void __snr_uncore_mmio_init_box(struct intel_uncore_box *box, + unsigned int box_ctl, int mem_offset) { struct pci_dev *pdev = snr_uncore_get_mc_dev(box->pkgid); - unsigned int box_ctl = uncore_mmio_box_ctl(box); resource_size_t addr; u32 pci_dword;
@@ -4417,7 +4417,7 @@ static void snr_uncore_mmio_init_box(struct intel_uncore_box *box) pci_read_config_dword(pdev, SNR_IMC_MMIO_BASE_OFFSET, &pci_dword); addr = (pci_dword & SNR_IMC_MMIO_BASE_MASK) << 23;
- pci_read_config_dword(pdev, SNR_IMC_MMIO_MEM0_OFFSET, &pci_dword); + pci_read_config_dword(pdev, mem_offset, &pci_dword); addr |= (pci_dword & SNR_IMC_MMIO_MEM0_MASK) << 12;
addr += box_ctl; @@ -4429,6 +4429,12 @@ static void snr_uncore_mmio_init_box(struct intel_uncore_box *box) writel(IVBEP_PMON_BOX_CTL_INT, box->io_addr); }
+static void snr_uncore_mmio_init_box(struct intel_uncore_box *box) +{ + __snr_uncore_mmio_init_box(box, uncore_mmio_box_ctl(box), + SNR_IMC_MMIO_MEM0_OFFSET); +} + static void snr_uncore_mmio_disable_box(struct intel_uncore_box *box) { u32 config;
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.7-rc1 commit bc88a2fe216a51e8ab46d61f89d0c1b5a400470e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit bc88a2fe216a51e8ab46d61f89d0c1b5a400470e upstream.
Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
The offset between uncore boxes of free-running counters varies, e.g. IIO free-running counters on Ice Lake server.
Add box_offsets, an array of offsets between adjacent uncore boxes.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/1584470314-46657-1-git-send-email-kan.liang@linux.... Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/uncore.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h index 07c5fcc7cc3a..26941507370e 100644 --- a/arch/x86/events/intel/uncore.h +++ b/arch/x86/events/intel/uncore.h @@ -147,6 +147,7 @@ struct freerunning_counters { unsigned int box_offset; unsigned int num_counters; unsigned int bits; + unsigned *box_offsets; };
struct pci2phy_map { @@ -303,7 +304,9 @@ unsigned int uncore_freerunning_counter(struct intel_uncore_box *box,
return pmu->type->freerunning[type].counter_base + pmu->type->freerunning[type].counter_offset * idx + - pmu->type->freerunning[type].box_offset * pmu->pmu_idx; + (pmu->type->freerunning[type].box_offsets ? + pmu->type->freerunning[type].box_offsets[pmu->pmu_idx] : + pmu->type->freerunning[type].box_offset * pmu->pmu_idx); }
static inline
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.7-rc1 commit 2b3b76b5ec67568da4bb475d3ce8a92ef494b5de category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 2b3b76b5ec67568da4bb475d3ce8a92ef494b5de upstream.
Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
The uncore subsystem in Ice Lake server is similar to previous server. There are some differences in config register encoding and pci device IDs. The uncore PMON units in Ice Lake server include Ubox, Chabox, IIO, IRP, M2PCIE, PCU, M2M, PCIE3 and IMC.
- For CHA, filter 1 register has been removed. The filter 0 register can be used by and of CHA events to be filterd by Thread/Core-ID. To do so, the control register's tid_en bit must be set to 1. - For IIO, there are some changes on event constraints. The MSR address and MSR offsets among counters are also changed. - For IRP, the MSR address and MSR offsets among counters are changed. - For M2PCIE, the counters are accessed by MSR now. Add new MSR address and MSR offsets. Change event constraints. - To determine the number of CHAs, have to read CAPID6(Low) and CAPID7 (High) now. - For M2M, update the PCICFG address and Device ID. - For UPI, update the PCICFG address, Device ID and counter address. - For M3UPI, update the PCICFG address, Device ID, counter address and event constraints. - For IMC, update the formular to calculate MMIO BAR address, which is MMIO_BASE + specific MEM_BAR offset.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Ingo Molnar mingo@kernel.org Link: https://lkml.kernel.org/r/1585842411-150452-1-git-send-email-kan.liang@linux... Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/uncore.c | 8 + arch/x86/events/intel/uncore.h | 3 + arch/x86/events/intel/uncore_snbep.c | 511 +++++++++++++++++++++++++++ 3 files changed, 522 insertions(+)
diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c index 6dd1548f20e7..b5523a5e4d90 100644 --- a/arch/x86/events/intel/uncore.c +++ b/arch/x86/events/intel/uncore.c @@ -1474,6 +1474,12 @@ static const struct intel_uncore_init_fun icl_uncore_init __initconst = { .pci_init = skl_uncore_pci_init, };
+static const struct intel_uncore_init_fun icx_uncore_init __initconst = { + .cpu_init = icx_uncore_cpu_init, + .pci_init = icx_uncore_pci_init, + .mmio_init = icx_uncore_mmio_init, +}; + static const struct intel_uncore_init_fun snr_uncore_init __initconst = { .cpu_init = snr_uncore_cpu_init, .pci_init = snr_uncore_pci_init, @@ -1507,6 +1513,8 @@ static const struct x86_cpu_id intel_uncore_match[] __initconst = { X86_UNCORE_MODEL_MATCH(INTEL_FAM6_KABYLAKE_MOBILE, skl_uncore_init), X86_UNCORE_MODEL_MATCH(INTEL_FAM6_KABYLAKE_DESKTOP, skl_uncore_init), X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ICELAKE_MOBILE, icl_uncore_init), + X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ICELAKE_XEON_D, icx_uncore_init), + X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ICELAKE_X, icx_uncore_init), X86_UNCORE_MODEL_MATCH(INTEL_FAM6_ATOM_TREMONT_X, snr_uncore_init), {}, }; diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h index 26941507370e..f57651c4176e 100644 --- a/arch/x86/events/intel/uncore.h +++ b/arch/x86/events/intel/uncore.h @@ -531,6 +531,9 @@ void skx_uncore_cpu_init(void); int snr_uncore_pci_init(void); void snr_uncore_cpu_init(void); void snr_uncore_mmio_init(void); +int icx_uncore_pci_init(void); +void icx_uncore_cpu_init(void); +void icx_uncore_mmio_init(void);
/* uncore_nhmex.c */ void nhmex_uncore_cpu_init(void); diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c index b453da617506..06a3e8ae370c 100644 --- a/arch/x86/events/intel/uncore_snbep.c +++ b/arch/x86/events/intel/uncore_snbep.c @@ -387,6 +387,42 @@ #define SNR_IMC_MMIO_MEM0_OFFSET 0xd8 #define SNR_IMC_MMIO_MEM0_MASK 0x7FF
+/* ICX CHA */ +#define ICX_C34_MSR_PMON_CTR0 0xb68 +#define ICX_C34_MSR_PMON_CTL0 0xb61 +#define ICX_C34_MSR_PMON_BOX_CTL 0xb60 +#define ICX_C34_MSR_PMON_BOX_FILTER0 0xb65 + +/* ICX IIO */ +#define ICX_IIO_MSR_PMON_CTL0 0xa58 +#define ICX_IIO_MSR_PMON_CTR0 0xa51 +#define ICX_IIO_MSR_PMON_BOX_CTL 0xa50 + +/* ICX IRP */ +#define ICX_IRP0_MSR_PMON_CTL0 0xa4d +#define ICX_IRP0_MSR_PMON_CTR0 0xa4b +#define ICX_IRP0_MSR_PMON_BOX_CTL 0xa4a + +/* ICX M2PCIE */ +#define ICX_M2PCIE_MSR_PMON_CTL0 0xa46 +#define ICX_M2PCIE_MSR_PMON_CTR0 0xa41 +#define ICX_M2PCIE_MSR_PMON_BOX_CTL 0xa40 + +/* ICX UPI */ +#define ICX_UPI_PCI_PMON_CTL0 0x350 +#define ICX_UPI_PCI_PMON_CTR0 0x320 +#define ICX_UPI_PCI_PMON_BOX_CTL 0x318 +#define ICX_UPI_CTL_UMASK_EXT 0xffffff + +/* ICX M3UPI*/ +#define ICX_M3UPI_PCI_PMON_CTL0 0xd8 +#define ICX_M3UPI_PCI_PMON_CTR0 0xa8 +#define ICX_M3UPI_PCI_PMON_BOX_CTL 0xa0 + +/* ICX IMC */ +#define ICX_NUMBER_IMC_CHN 2 +#define ICX_IMC_MEM_STRIDE 0x4 + DEFINE_UNCORE_FORMAT_ATTR(event, event, "config:0-7"); DEFINE_UNCORE_FORMAT_ATTR(event2, event, "config:0-6"); DEFINE_UNCORE_FORMAT_ATTR(event_ext, event, "config:0-7,21"); @@ -395,6 +431,7 @@ DEFINE_UNCORE_FORMAT_ATTR(umask, umask, "config:8-15"); DEFINE_UNCORE_FORMAT_ATTR(umask_ext, umask, "config:8-15,32-43,45-55"); DEFINE_UNCORE_FORMAT_ATTR(umask_ext2, umask, "config:8-15,32-57"); DEFINE_UNCORE_FORMAT_ATTR(umask_ext3, umask, "config:8-15,32-39"); +DEFINE_UNCORE_FORMAT_ATTR(umask_ext4, umask, "config:8-15,32-55"); DEFINE_UNCORE_FORMAT_ATTR(qor, qor, "config:16"); DEFINE_UNCORE_FORMAT_ATTR(edge, edge, "config:18"); DEFINE_UNCORE_FORMAT_ATTR(tid_en, tid_en, "config:19"); @@ -4574,3 +4611,477 @@ void snr_uncore_mmio_init(void) }
/* end of SNR uncore support */ + +/* ICX uncore support */ + +static unsigned icx_cha_msr_offsets[] = { + 0x2a0, 0x2ae, 0x2bc, 0x2ca, 0x2d8, 0x2e6, 0x2f4, 0x302, 0x310, + 0x31e, 0x32c, 0x33a, 0x348, 0x356, 0x364, 0x372, 0x380, 0x38e, + 0x3aa, 0x3b8, 0x3c6, 0x3d4, 0x3e2, 0x3f0, 0x3fe, 0x40c, 0x41a, + 0x428, 0x436, 0x444, 0x452, 0x460, 0x46e, 0x47c, 0x0, 0xe, + 0x1c, 0x2a, 0x38, 0x46, +}; + +static int icx_cha_hw_config(struct intel_uncore_box *box, struct perf_event *event) +{ + struct hw_perf_event_extra *reg1 = &event->hw.extra_reg; + bool tie_en = !!(event->hw.config & SNBEP_CBO_PMON_CTL_TID_EN); + + if (tie_en) { + reg1->reg = ICX_C34_MSR_PMON_BOX_FILTER0 + + icx_cha_msr_offsets[box->pmu->pmu_idx]; + reg1->config = event->attr.config1 & SKX_CHA_MSR_PMON_BOX_FILTER_TID; + reg1->idx = 0; + } + + return 0; +} + +static struct intel_uncore_ops icx_uncore_chabox_ops = { + .init_box = ivbep_uncore_msr_init_box, + .disable_box = snbep_uncore_msr_disable_box, + .enable_box = snbep_uncore_msr_enable_box, + .disable_event = snbep_uncore_msr_disable_event, + .enable_event = snr_cha_enable_event, + .read_counter = uncore_msr_read_counter, + .hw_config = icx_cha_hw_config, +}; + +static struct intel_uncore_type icx_uncore_chabox = { + .name = "cha", + .num_counters = 4, + .perf_ctr_bits = 48, + .event_ctl = ICX_C34_MSR_PMON_CTL0, + .perf_ctr = ICX_C34_MSR_PMON_CTR0, + .box_ctl = ICX_C34_MSR_PMON_BOX_CTL, + .msr_offsets = icx_cha_msr_offsets, + .event_mask = HSWEP_S_MSR_PMON_RAW_EVENT_MASK, + .event_mask_ext = SNR_CHA_RAW_EVENT_MASK_EXT, + .constraints = skx_uncore_chabox_constraints, + .ops = &icx_uncore_chabox_ops, + .format_group = &snr_uncore_chabox_format_group, +}; + +static unsigned icx_msr_offsets[] = { + 0x0, 0x20, 0x40, 0x90, 0xb0, 0xd0, +}; + +static struct event_constraint icx_uncore_iio_constraints[] = { + UNCORE_EVENT_CONSTRAINT(0x02, 0x3), + UNCORE_EVENT_CONSTRAINT(0x03, 0x3), + UNCORE_EVENT_CONSTRAINT(0x83, 0x3), + UNCORE_EVENT_CONSTRAINT(0xc0, 0xc), + UNCORE_EVENT_CONSTRAINT(0xc5, 0xc), + EVENT_CONSTRAINT_END +}; + +static struct intel_uncore_type icx_uncore_iio = { + .name = "iio", + .num_counters = 4, + .num_boxes = 6, + .perf_ctr_bits = 48, + .event_ctl = ICX_IIO_MSR_PMON_CTL0, + .perf_ctr = ICX_IIO_MSR_PMON_CTR0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .event_mask_ext = SNR_IIO_PMON_RAW_EVENT_MASK_EXT, + .box_ctl = ICX_IIO_MSR_PMON_BOX_CTL, + .msr_offsets = icx_msr_offsets, + .constraints = icx_uncore_iio_constraints, + .ops = &skx_uncore_iio_ops, + .format_group = &snr_uncore_iio_format_group, +}; + +static struct intel_uncore_type icx_uncore_irp = { + .name = "irp", + .num_counters = 2, + .num_boxes = 6, + .perf_ctr_bits = 48, + .event_ctl = ICX_IRP0_MSR_PMON_CTL0, + .perf_ctr = ICX_IRP0_MSR_PMON_CTR0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .box_ctl = ICX_IRP0_MSR_PMON_BOX_CTL, + .msr_offsets = icx_msr_offsets, + .ops = &ivbep_uncore_msr_ops, + .format_group = &ivbep_uncore_format_group, +}; + +static struct event_constraint icx_uncore_m2pcie_constraints[] = { + UNCORE_EVENT_CONSTRAINT(0x14, 0x3), + UNCORE_EVENT_CONSTRAINT(0x23, 0x3), + UNCORE_EVENT_CONSTRAINT(0x2d, 0x3), + EVENT_CONSTRAINT_END +}; + +static struct intel_uncore_type icx_uncore_m2pcie = { + .name = "m2pcie", + .num_counters = 4, + .num_boxes = 6, + .perf_ctr_bits = 48, + .event_ctl = ICX_M2PCIE_MSR_PMON_CTL0, + .perf_ctr = ICX_M2PCIE_MSR_PMON_CTR0, + .box_ctl = ICX_M2PCIE_MSR_PMON_BOX_CTL, + .msr_offsets = icx_msr_offsets, + .constraints = icx_uncore_m2pcie_constraints, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .ops = &ivbep_uncore_msr_ops, + .format_group = &ivbep_uncore_format_group, +}; + +enum perf_uncore_icx_iio_freerunning_type_id { + ICX_IIO_MSR_IOCLK, + ICX_IIO_MSR_BW_IN, + + ICX_IIO_FREERUNNING_TYPE_MAX, +}; + +static unsigned icx_iio_clk_freerunning_box_offsets[] = { + 0x0, 0x20, 0x40, 0x90, 0xb0, 0xd0, +}; + +static unsigned icx_iio_bw_freerunning_box_offsets[] = { + 0x0, 0x10, 0x20, 0x90, 0xa0, 0xb0, +}; + +static struct freerunning_counters icx_iio_freerunning[] = { + [ICX_IIO_MSR_IOCLK] = { 0xa55, 0x1, 0x20, 1, 48, icx_iio_clk_freerunning_box_offsets }, + [ICX_IIO_MSR_BW_IN] = { 0xaa0, 0x1, 0x10, 8, 48, icx_iio_bw_freerunning_box_offsets }, +}; + +static struct uncore_event_desc icx_uncore_iio_freerunning_events[] = { + /* Free-Running IIO CLOCKS Counter */ + INTEL_UNCORE_EVENT_DESC(ioclk, "event=0xff,umask=0x10"), + /* Free-Running IIO BANDWIDTH IN Counters */ + INTEL_UNCORE_EVENT_DESC(bw_in_port0, "event=0xff,umask=0x20"), + INTEL_UNCORE_EVENT_DESC(bw_in_port0.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port0.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port1, "event=0xff,umask=0x21"), + INTEL_UNCORE_EVENT_DESC(bw_in_port1.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port1.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port2, "event=0xff,umask=0x22"), + INTEL_UNCORE_EVENT_DESC(bw_in_port2.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port2.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port3, "event=0xff,umask=0x23"), + INTEL_UNCORE_EVENT_DESC(bw_in_port3.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port3.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port4, "event=0xff,umask=0x24"), + INTEL_UNCORE_EVENT_DESC(bw_in_port4.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port4.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port5, "event=0xff,umask=0x25"), + INTEL_UNCORE_EVENT_DESC(bw_in_port5.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port5.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port6, "event=0xff,umask=0x26"), + INTEL_UNCORE_EVENT_DESC(bw_in_port6.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port6.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(bw_in_port7, "event=0xff,umask=0x27"), + INTEL_UNCORE_EVENT_DESC(bw_in_port7.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(bw_in_port7.unit, "MiB"), + { /* end: all zeroes */ }, +}; + +static struct intel_uncore_type icx_uncore_iio_free_running = { + .name = "iio_free_running", + .num_counters = 9, + .num_boxes = 6, + .num_freerunning_types = ICX_IIO_FREERUNNING_TYPE_MAX, + .freerunning = icx_iio_freerunning, + .ops = &skx_uncore_iio_freerunning_ops, + .event_descs = icx_uncore_iio_freerunning_events, + .format_group = &skx_uncore_iio_freerunning_format_group, +}; + +static struct intel_uncore_type *icx_msr_uncores[] = { + &skx_uncore_ubox, + &icx_uncore_chabox, + &icx_uncore_iio, + &icx_uncore_irp, + &icx_uncore_m2pcie, + &skx_uncore_pcu, + &icx_uncore_iio_free_running, + NULL, +}; + +/* + * To determine the number of CHAs, it should read CAPID6(Low) and CAPID7 (High) + * registers which located at Device 30, Function 3 + */ +#define ICX_CAPID6 0x9c +#define ICX_CAPID7 0xa0 + +static u64 icx_count_chabox(void) +{ + struct pci_dev *dev = NULL; + u64 caps = 0; + + dev = pci_get_device(PCI_VENDOR_ID_INTEL, 0x345b, dev); + if (!dev) + goto out; + + pci_read_config_dword(dev, ICX_CAPID6, (u32 *)&caps); + pci_read_config_dword(dev, ICX_CAPID7, (u32 *)&caps + 1); +out: + pci_dev_put(dev); + return hweight64(caps); +} + +void icx_uncore_cpu_init(void) +{ + u64 num_boxes = icx_count_chabox(); + + if (WARN_ON(num_boxes > ARRAY_SIZE(icx_cha_msr_offsets))) + return; + icx_uncore_chabox.num_boxes = num_boxes; + uncore_msr_uncores = icx_msr_uncores; +} + +static struct intel_uncore_type icx_uncore_m2m = { + .name = "m2m", + .num_counters = 4, + .num_boxes = 4, + .perf_ctr_bits = 48, + .perf_ctr = SNR_M2M_PCI_PMON_CTR0, + .event_ctl = SNR_M2M_PCI_PMON_CTL0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .box_ctl = SNR_M2M_PCI_PMON_BOX_CTL, + .ops = &snr_m2m_uncore_pci_ops, + .format_group = &skx_uncore_format_group, +}; + +static struct attribute *icx_upi_uncore_formats_attr[] = { + &format_attr_event.attr, + &format_attr_umask_ext4.attr, + &format_attr_edge.attr, + &format_attr_inv.attr, + &format_attr_thresh8.attr, + NULL, +}; + +static const struct attribute_group icx_upi_uncore_format_group = { + .name = "format", + .attrs = icx_upi_uncore_formats_attr, +}; + +static struct intel_uncore_type icx_uncore_upi = { + .name = "upi", + .num_counters = 4, + .num_boxes = 3, + .perf_ctr_bits = 48, + .perf_ctr = ICX_UPI_PCI_PMON_CTR0, + .event_ctl = ICX_UPI_PCI_PMON_CTL0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .event_mask_ext = ICX_UPI_CTL_UMASK_EXT, + .box_ctl = ICX_UPI_PCI_PMON_BOX_CTL, + .ops = &skx_upi_uncore_pci_ops, + .format_group = &icx_upi_uncore_format_group, +}; + +static struct event_constraint icx_uncore_m3upi_constraints[] = { + UNCORE_EVENT_CONSTRAINT(0x1c, 0x1), + UNCORE_EVENT_CONSTRAINT(0x1d, 0x1), + UNCORE_EVENT_CONSTRAINT(0x1e, 0x1), + UNCORE_EVENT_CONSTRAINT(0x1f, 0x1), + UNCORE_EVENT_CONSTRAINT(0x40, 0x7), + UNCORE_EVENT_CONSTRAINT(0x4e, 0x7), + UNCORE_EVENT_CONSTRAINT(0x4f, 0x7), + UNCORE_EVENT_CONSTRAINT(0x50, 0x7), + EVENT_CONSTRAINT_END +}; + +static struct intel_uncore_type icx_uncore_m3upi = { + .name = "m3upi", + .num_counters = 4, + .num_boxes = 3, + .perf_ctr_bits = 48, + .perf_ctr = ICX_M3UPI_PCI_PMON_CTR0, + .event_ctl = ICX_M3UPI_PCI_PMON_CTL0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .box_ctl = ICX_M3UPI_PCI_PMON_BOX_CTL, + .constraints = icx_uncore_m3upi_constraints, + .ops = &ivbep_uncore_pci_ops, + .format_group = &skx_uncore_format_group, +}; + +enum { + ICX_PCI_UNCORE_M2M, + ICX_PCI_UNCORE_UPI, + ICX_PCI_UNCORE_M3UPI, +}; + +static struct intel_uncore_type *icx_pci_uncores[] = { + [ICX_PCI_UNCORE_M2M] = &icx_uncore_m2m, + [ICX_PCI_UNCORE_UPI] = &icx_uncore_upi, + [ICX_PCI_UNCORE_M3UPI] = &icx_uncore_m3upi, + NULL, +}; + +static const struct pci_device_id icx_uncore_pci_ids[] = { + { /* M2M 0 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x344a), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(12, 0, ICX_PCI_UNCORE_M2M, 0), + }, + { /* M2M 1 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x344a), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(13, 0, ICX_PCI_UNCORE_M2M, 1), + }, + { /* M2M 2 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x344a), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(14, 0, ICX_PCI_UNCORE_M2M, 2), + }, + { /* M2M 3 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x344a), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(15, 0, ICX_PCI_UNCORE_M2M, 3), + }, + { /* UPI Link 0 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x3441), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(2, 1, ICX_PCI_UNCORE_UPI, 0), + }, + { /* UPI Link 1 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x3441), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(3, 1, ICX_PCI_UNCORE_UPI, 1), + }, + { /* UPI Link 2 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x3441), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(4, 1, ICX_PCI_UNCORE_UPI, 2), + }, + { /* M3UPI Link 0 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x3446), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(5, 1, ICX_PCI_UNCORE_M3UPI, 0), + }, + { /* M3UPI Link 1 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x3446), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(6, 1, ICX_PCI_UNCORE_M3UPI, 1), + }, + { /* M3UPI Link 2 */ + PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x3446), + .driver_data = UNCORE_PCI_DEV_FULL_DATA(7, 1, ICX_PCI_UNCORE_M3UPI, 2), + }, + { /* end: all zeroes */ } +}; + +static struct pci_driver icx_uncore_pci_driver = { + .name = "icx_uncore", + .id_table = icx_uncore_pci_ids, +}; + +int icx_uncore_pci_init(void) +{ + /* ICX UBOX DID */ + int ret = snbep_pci2phy_map_init(0x3450, SKX_CPUNODEID, + SKX_GIDNIDMAP, true); + + if (ret) + return ret; + + uncore_pci_uncores = icx_pci_uncores; + uncore_pci_driver = &icx_uncore_pci_driver; + return 0; +} + +static void icx_uncore_imc_init_box(struct intel_uncore_box *box) +{ + unsigned int box_ctl = box->pmu->type->box_ctl + + box->pmu->type->mmio_offset * (box->pmu->pmu_idx % ICX_NUMBER_IMC_CHN); + int mem_offset = (box->pmu->pmu_idx / ICX_NUMBER_IMC_CHN) * ICX_IMC_MEM_STRIDE + + SNR_IMC_MMIO_MEM0_OFFSET; + + __snr_uncore_mmio_init_box(box, box_ctl, mem_offset); +} + +static struct intel_uncore_ops icx_uncore_mmio_ops = { + .init_box = icx_uncore_imc_init_box, + .exit_box = uncore_mmio_exit_box, + .disable_box = snr_uncore_mmio_disable_box, + .enable_box = snr_uncore_mmio_enable_box, + .disable_event = snr_uncore_mmio_disable_event, + .enable_event = snr_uncore_mmio_enable_event, + .read_counter = uncore_mmio_read_counter, +}; + +static struct intel_uncore_type icx_uncore_imc = { + .name = "imc", + .num_counters = 4, + .num_boxes = 8, + .perf_ctr_bits = 48, + .fixed_ctr_bits = 48, + .fixed_ctr = SNR_IMC_MMIO_PMON_FIXED_CTR, + .fixed_ctl = SNR_IMC_MMIO_PMON_FIXED_CTL, + .event_descs = hswep_uncore_imc_events, + .perf_ctr = SNR_IMC_MMIO_PMON_CTR0, + .event_ctl = SNR_IMC_MMIO_PMON_CTL0, + .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .box_ctl = SNR_IMC_MMIO_PMON_BOX_CTL, + .mmio_offset = SNR_IMC_MMIO_OFFSET, + .ops = &icx_uncore_mmio_ops, + .format_group = &skx_uncore_format_group, +}; + +enum perf_uncore_icx_imc_freerunning_type_id { + ICX_IMC_DCLK, + ICX_IMC_DDR, + ICX_IMC_DDRT, + + ICX_IMC_FREERUNNING_TYPE_MAX, +}; + +static struct freerunning_counters icx_imc_freerunning[] = { + [ICX_IMC_DCLK] = { 0x22b0, 0x0, 0, 1, 48 }, + [ICX_IMC_DDR] = { 0x2290, 0x8, 0, 2, 48 }, + [ICX_IMC_DDRT] = { 0x22a0, 0x8, 0, 2, 48 }, +}; + +static struct uncore_event_desc icx_uncore_imc_freerunning_events[] = { + INTEL_UNCORE_EVENT_DESC(dclk, "event=0xff,umask=0x10"), + + INTEL_UNCORE_EVENT_DESC(read, "event=0xff,umask=0x20"), + INTEL_UNCORE_EVENT_DESC(read.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(read.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(write, "event=0xff,umask=0x21"), + INTEL_UNCORE_EVENT_DESC(write.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(write.unit, "MiB"), + + INTEL_UNCORE_EVENT_DESC(ddrt_read, "event=0xff,umask=0x30"), + INTEL_UNCORE_EVENT_DESC(ddrt_read.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(ddrt_read.unit, "MiB"), + INTEL_UNCORE_EVENT_DESC(ddrt_write, "event=0xff,umask=0x31"), + INTEL_UNCORE_EVENT_DESC(ddrt_write.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(ddrt_write.unit, "MiB"), + { /* end: all zeroes */ }, +}; + +static void icx_uncore_imc_freerunning_init_box(struct intel_uncore_box *box) +{ + int mem_offset = box->pmu->pmu_idx * ICX_IMC_MEM_STRIDE + + SNR_IMC_MMIO_MEM0_OFFSET; + + __snr_uncore_mmio_init_box(box, uncore_mmio_box_ctl(box), mem_offset); +} + +static struct intel_uncore_ops icx_uncore_imc_freerunning_ops = { + .init_box = icx_uncore_imc_freerunning_init_box, + .exit_box = uncore_mmio_exit_box, + .read_counter = uncore_mmio_read_counter, + .hw_config = uncore_freerunning_hw_config, +}; + +static struct intel_uncore_type icx_uncore_imc_free_running = { + .name = "imc_free_running", + .num_counters = 5, + .num_boxes = 4, + .num_freerunning_types = ICX_IMC_FREERUNNING_TYPE_MAX, + .freerunning = icx_imc_freerunning, + .ops = &icx_uncore_imc_freerunning_ops, + .event_descs = icx_uncore_imc_freerunning_events, + .format_group = &skx_uncore_iio_freerunning_format_group, +}; + +static struct intel_uncore_type *icx_mmio_uncores[] = { + &icx_uncore_imc, + &icx_uncore_imc_free_running, + NULL, +}; + +void icx_uncore_mmio_init(void) +{ + uncore_mmio_uncores = icx_mmio_uncores; +} + +/* end of ICX uncore support */
From: Yunying Sun yunying.sun@intel.com
mainline inclusion from mainline-v5.3-rc2 commit 3b238a64c3009fed36eaea1af629d9377759d87d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3b238a64c3009fed36eaea1af629d9377759d87d upstream.
Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
The Intel SDM states that bit 13 of Icelake's MSR_OFFCORE_RSP_x register is valid, and used for counting hardware generated prefetches of L3 cache. Update the bitmask to allow bit 13.
Before: $ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3 Performance counter stats for 'sleep 3': <not supported> cpu/event=0xb7,umask=0x1,config1=0x1bfff/u
After: $ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3 Performance counter stats for 'sleep 3': 9,293 cpu/event=0xb7,umask=0x1,config1=0x1bfff/u
Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Kan Liang kan.liang@linux.intel.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: acme@kernel.org Cc: alexander.shishkin@linux.intel.com Cc: bp@alien8.de Cc: hpa@zytor.com Cc: jolsa@redhat.com Cc: namhyung@kernel.org Link: https://lkml.kernel.org/r/20190724082932.12833-1-yunying.sun@intel.com Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 09434c273ef1..28a0629c076f 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -260,8 +260,8 @@ static struct event_constraint intel_icl_event_constraints[] = { };
static struct extra_reg intel_icl_extra_regs[] __read_mostly = { - INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3fffff9fffull, RSP_0), - INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff9fffull, RSP_1), + INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3fffffbfffull, RSP_0), + INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffffbfffull, RSP_1), INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd), INTEL_UEVENT_EXTRA_REG(0x01c6, MSR_PEBS_FRONTEND, 0x7fff17, FE), EVENT_EXTRA_END
From: Leonid Ravich lravich@gmail.com
mainline inclusion from mainline-v5.1-rc1 commit ebb09b33c60c46fd4f7ffa0af9e693eebe765d1b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit ebb09b33c60c46fd4f7ffa0af9e693eebe765d1b upstream.
Backport summary: Backport to kernel 4.19.57 for 4th Gen Intel NTB support
NTB door bell usage depends on NTB hardware. ex: intel NTB gen1 has one peer door bell register which can be controlled by the bitmap writen to it, while Intel NTB gen3 has a registers per door bell and the data trigering the each door bell is always 1.
therefore exposing only peer door bell address forcing the user to be aware of such low level details
Signed-off-by: Leonid Ravich Leonid.Ravich@emc.com Acked-by: Logan Gunthorpe logang@deltatee.com Acked-by: Dave Jiang dave.jiang@intel.com Acked-by: Allen Hubbe allenbh@gmail.com Signed-off-by: Jon Mason jdmason@kudzu.us Signed-off-by: Zheng Wu wu.zheng@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/ntb/hw/intel/ntb_hw_gen1.c | 25 ++++++++++++++----- drivers/ntb/hw/intel/ntb_hw_gen1.h | 5 ++-- drivers/ntb/hw/intel/ntb_hw_gen3.c | 33 +++++++++++++++++++++++++- drivers/ntb/hw/mscc/ntb_hw_switchtec.c | 9 ++++++- include/linux/ntb.h | 10 +++++--- 5 files changed, 69 insertions(+), 13 deletions(-)
diff --git a/drivers/ntb/hw/intel/ntb_hw_gen1.c b/drivers/ntb/hw/intel/ntb_hw_gen1.c index 2ad263f708da..bb57ec239029 100644 --- a/drivers/ntb/hw/intel/ntb_hw_gen1.c +++ b/drivers/ntb/hw/intel/ntb_hw_gen1.c @@ -180,7 +180,7 @@ int ndev_mw_to_bar(struct intel_ntb_dev *ndev, int idx) return ndev->reg->mw_bar[idx]; }
-static inline int ndev_db_addr(struct intel_ntb_dev *ndev, +void ndev_db_addr(struct intel_ntb_dev *ndev, phys_addr_t *db_addr, resource_size_t *db_size, phys_addr_t reg_addr, unsigned long reg) { @@ -196,8 +196,6 @@ static inline int ndev_db_addr(struct intel_ntb_dev *ndev, *db_size = ndev->reg->db_size; dev_dbg(&ndev->ntb.pdev->dev, "Peer db size %llx\n", *db_size); } - - return 0; }
u64 ndev_db_read(struct intel_ntb_dev *ndev, @@ -1111,13 +1109,28 @@ int intel_ntb_db_clear_mask(struct ntb_dev *ntb, u64 db_bits) ndev->self_reg->db_mask); }
-int intel_ntb_peer_db_addr(struct ntb_dev *ntb, phys_addr_t *db_addr, - resource_size_t *db_size) +static int intel_ntb_peer_db_addr(struct ntb_dev *ntb, phys_addr_t *db_addr, + resource_size_t *db_size, u64 *db_data, int db_bit) { + u64 db_bits; struct intel_ntb_dev *ndev = ntb_ndev(ntb);
- return ndev_db_addr(ndev, db_addr, db_size, ndev->peer_addr, + if (unlikely(db_bit >= BITS_PER_LONG_LONG)) + return -EINVAL; + + db_bits = BIT_ULL(db_bit); + + if (unlikely(db_bits & ~ntb_ndev(ntb)->db_valid_mask)) + return -EINVAL; + + ndev_db_addr(ndev, db_addr, db_size, ndev->peer_addr, ndev->peer_reg->db_bell); + + if (db_data) + *db_data = db_bits; + + + return 0; }
static int intel_ntb_peer_db_set(struct ntb_dev *ntb, u64 db_bits) diff --git a/drivers/ntb/hw/intel/ntb_hw_gen1.h b/drivers/ntb/hw/intel/ntb_hw_gen1.h index ad8ec1444436..544cf5c06f4d 100644 --- a/drivers/ntb/hw/intel/ntb_hw_gen1.h +++ b/drivers/ntb/hw/intel/ntb_hw_gen1.h @@ -147,6 +147,9 @@ extern struct intel_b2b_addr xeon_b2b_dsd_addr; int ndev_init_isr(struct intel_ntb_dev *ndev, int msix_min, int msix_max, int msix_shift, int total_shift); enum ntb_topo xeon_ppd_topo(struct intel_ntb_dev *ndev, u8 ppd); +void ndev_db_addr(struct intel_ntb_dev *ndev, + phys_addr_t *db_addr, resource_size_t *db_size, + phys_addr_t reg_addr, unsigned long reg); u64 ndev_db_read(struct intel_ntb_dev *ndev, void __iomem *mmio); int ndev_db_write(struct intel_ntb_dev *ndev, u64 db_bits, void __iomem *mmio); @@ -166,8 +169,6 @@ int intel_ntb_db_vector_count(struct ntb_dev *ntb); u64 intel_ntb_db_vector_mask(struct ntb_dev *ntb, int db_vector); int intel_ntb_db_set_mask(struct ntb_dev *ntb, u64 db_bits); int intel_ntb_db_clear_mask(struct ntb_dev *ntb, u64 db_bits); -int intel_ntb_peer_db_addr(struct ntb_dev *ntb, phys_addr_t *db_addr, - resource_size_t *db_size); int intel_ntb_spad_is_unsafe(struct ntb_dev *ntb); int intel_ntb_spad_count(struct ntb_dev *ntb); u32 intel_ntb_spad_read(struct ntb_dev *ntb, int idx); diff --git a/drivers/ntb/hw/intel/ntb_hw_gen3.c b/drivers/ntb/hw/intel/ntb_hw_gen3.c index b3fa24778f94..f475b56a3f49 100644 --- a/drivers/ntb/hw/intel/ntb_hw_gen3.c +++ b/drivers/ntb/hw/intel/ntb_hw_gen3.c @@ -532,6 +532,37 @@ static int intel_ntb3_mw_set_trans(struct ntb_dev *ntb, int pidx, int idx, return 0; }
+int intel_ntb3_peer_db_addr(struct ntb_dev *ntb, phys_addr_t *db_addr, + resource_size_t *db_size, + u64 *db_data, int db_bit) +{ + phys_addr_t db_addr_base; + struct intel_ntb_dev *ndev = ntb_ndev(ntb); + + if (unlikely(db_bit >= BITS_PER_LONG_LONG)) + return -EINVAL; + + if (unlikely(BIT_ULL(db_bit) & ~ntb_ndev(ntb)->db_valid_mask)) + return -EINVAL; + + ndev_db_addr(ndev, &db_addr_base, db_size, ndev->peer_addr, + ndev->peer_reg->db_bell); + + if (db_addr) { + *db_addr = db_addr_base + (db_bit * 4); + dev_dbg(&ndev->ntb.pdev->dev, "Peer db addr %llx db bit %d\n", + *db_addr, db_bit); + } + + if (db_data) { + *db_data = 1; + dev_dbg(&ndev->ntb.pdev->dev, "Peer db data %llx db bit %d\n", + *db_data, db_bit); + } + + return 0; +} + static int intel_ntb3_peer_db_set(struct ntb_dev *ntb, u64 db_bits) { struct intel_ntb_dev *ndev = ntb_ndev(ntb); @@ -584,7 +615,7 @@ const struct ntb_dev_ops intel_ntb3_ops = { .db_clear = intel_ntb3_db_clear, .db_set_mask = intel_ntb_db_set_mask, .db_clear_mask = intel_ntb_db_clear_mask, - .peer_db_addr = intel_ntb_peer_db_addr, + .peer_db_addr = intel_ntb3_peer_db_addr, .peer_db_set = intel_ntb3_peer_db_set, .spad_is_unsafe = intel_ntb_spad_is_unsafe, .spad_count = intel_ntb_spad_count, diff --git a/drivers/ntb/hw/mscc/ntb_hw_switchtec.c b/drivers/ntb/hw/mscc/ntb_hw_switchtec.c index 5ee5f40b4dfc..9fa1b991dd52 100644 --- a/drivers/ntb/hw/mscc/ntb_hw_switchtec.c +++ b/drivers/ntb/hw/mscc/ntb_hw_switchtec.c @@ -707,11 +707,16 @@ static u64 switchtec_ntb_db_read_mask(struct ntb_dev *ntb)
static int switchtec_ntb_peer_db_addr(struct ntb_dev *ntb, phys_addr_t *db_addr, - resource_size_t *db_size) + resource_size_t *db_size, + u64 *db_data, + int db_bit) { struct switchtec_ntb *sndev = ntb_sndev(ntb); unsigned long offset;
+ if (unlikely(db_bit >= BITS_PER_LONG_LONG)) + return -EINVAL; + offset = (unsigned long)sndev->mmio_peer_dbmsg->odb - (unsigned long)sndev->stdev->mmio;
@@ -721,6 +726,8 @@ static int switchtec_ntb_peer_db_addr(struct ntb_dev *ntb, *db_addr = pci_resource_start(ntb->pdev, 0) + offset; if (db_size) *db_size = sizeof(u32); + if (db_data) + *db_data = BIT_ULL(db_bit) << sndev->db_peer_shift;
return 0; } diff --git a/include/linux/ntb.h b/include/linux/ntb.h index 181d16601dd9..56a92e3ae3ae 100644 --- a/include/linux/ntb.h +++ b/include/linux/ntb.h @@ -296,7 +296,8 @@ struct ntb_dev_ops { int (*db_clear_mask)(struct ntb_dev *ntb, u64 db_bits);
int (*peer_db_addr)(struct ntb_dev *ntb, - phys_addr_t *db_addr, resource_size_t *db_size); + phys_addr_t *db_addr, resource_size_t *db_size, + u64 *db_data, int db_bit); u64 (*peer_db_read)(struct ntb_dev *ntb); int (*peer_db_set)(struct ntb_dev *ntb, u64 db_bits); int (*peer_db_clear)(struct ntb_dev *ntb, u64 db_bits); @@ -1078,6 +1079,8 @@ static inline int ntb_db_clear_mask(struct ntb_dev *ntb, u64 db_bits) * @ntb: NTB device context. * @db_addr: OUT - The address of the peer doorbell register. * @db_size: OUT - The number of bytes to write the peer doorbell register. + * @db_data: OUT - The data of peer doorbell register + * @db_bit: door bell bit number * * Return the address of the peer doorbell register. This may be used, for * example, by drivers that offload memory copy operations to a dma engine. @@ -1091,12 +1094,13 @@ static inline int ntb_db_clear_mask(struct ntb_dev *ntb, u64 db_bits) */ static inline int ntb_peer_db_addr(struct ntb_dev *ntb, phys_addr_t *db_addr, - resource_size_t *db_size) + resource_size_t *db_size, + u64 *db_data, int db_bit) { if (!ntb->ops->peer_db_addr) return -EINVAL;
- return ntb->ops->peer_db_addr(ntb, db_addr, db_size); + return ntb->ops->peer_db_addr(ntb, db_addr, db_size, db_data, db_bit); }
/**
From: Dave Jiang dave.jiang@intel.com
mainline inclusion from mainline-v5.8-rc1 commit 26bfe3d0b227ab6d38692640b44ce48f2d857602 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 26bfe3d0b227ab6d38692640b44ce48f2d857602 upstream.
Backport summary: Backport to kernel 4.19.57 for 4th Gen Intel NTB support
Adding 4th generation Intel NTB support bits. There are a lot of common parts that the gen4 NTB has with gen3 NTB on Skylake. The commonalities are reused in gen4 Icelake NTB.
Signed-off-by: Dave Jiang dave.jiang@intel.com Signed-off-by: Jon Mason jdmason@kudzu.us Signed-off-by: Zheng Wu wu.zheng@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/ntb/hw/intel/Makefile | 2 +- drivers/ntb/hw/intel/ntb_hw_gen1.c | 45 +-- drivers/ntb/hw/intel/ntb_hw_gen3.c | 11 +- drivers/ntb/hw/intel/ntb_hw_gen3.h | 8 + drivers/ntb/hw/intel/ntb_hw_gen4.c | 500 ++++++++++++++++++++++++++++ drivers/ntb/hw/intel/ntb_hw_gen4.h | 87 +++++ drivers/ntb/hw/intel/ntb_hw_intel.h | 12 + 7 files changed, 639 insertions(+), 26 deletions(-) create mode 100644 drivers/ntb/hw/intel/ntb_hw_gen4.c create mode 100644 drivers/ntb/hw/intel/ntb_hw_gen4.h
diff --git a/drivers/ntb/hw/intel/Makefile b/drivers/ntb/hw/intel/Makefile index 4ff22af967c6..36c69412dc49 100644 --- a/drivers/ntb/hw/intel/Makefile +++ b/drivers/ntb/hw/intel/Makefile @@ -1,2 +1,2 @@ obj-$(CONFIG_NTB_INTEL) += ntb_hw_intel.o -ntb_hw_intel-y := ntb_hw_gen1.o ntb_hw_gen3.o +ntb_hw_intel-y := ntb_hw_gen1.o ntb_hw_gen3.o ntb_hw_gen4.o diff --git a/drivers/ntb/hw/intel/ntb_hw_gen1.c b/drivers/ntb/hw/intel/ntb_hw_gen1.c index bb57ec239029..cae70e7fe5bd 100644 --- a/drivers/ntb/hw/intel/ntb_hw_gen1.c +++ b/drivers/ntb/hw/intel/ntb_hw_gen1.c @@ -60,6 +60,7 @@ #include "ntb_hw_intel.h" #include "ntb_hw_gen1.h" #include "ntb_hw_gen3.h" +#include "ntb_hw_gen4.h"
#define NTB_NAME "ntb_hw_intel" #define NTB_DESC "Intel(R) PCI-E Non-Transparent Bridge Driver" @@ -762,6 +763,8 @@ static ssize_t ndev_debugfs_read(struct file *filp, char __user *ubuf, return ndev_ntb_debugfs_read(filp, ubuf, count, offp); else if (pdev_is_gen3(ndev->ntb.pdev)) return ndev_ntb3_debugfs_read(filp, ubuf, count, offp); + else if (pdev_is_gen4(ndev->ntb.pdev)) + return ndev_ntb4_debugfs_read(filp, ubuf, count, offp);
return -ENXIO; } @@ -1858,16 +1861,15 @@ static int intel_ntb_pci_probe(struct pci_dev *pdev, int rc, node;
node = dev_to_node(&pdev->dev); + ndev = kzalloc_node(sizeof(*ndev), GFP_KERNEL, node); + if (!ndev) { + rc = -ENOMEM; + goto err_ndev; + }
- if (pdev_is_gen1(pdev)) { - ndev = kzalloc_node(sizeof(*ndev), GFP_KERNEL, node); - if (!ndev) { - rc = -ENOMEM; - goto err_ndev; - } - - ndev_init_struct(ndev, pdev); + ndev_init_struct(ndev, pdev);
+ if (pdev_is_gen1(pdev)) { rc = intel_ntb_init_pci(ndev, pdev); if (rc) goto err_init_pci; @@ -1875,17 +1877,8 @@ static int intel_ntb_pci_probe(struct pci_dev *pdev, rc = xeon_init_dev(ndev); if (rc) goto err_init_dev; - } else if (pdev_is_gen3(pdev)) { - ndev = kzalloc_node(sizeof(*ndev), GFP_KERNEL, node); - if (!ndev) { - rc = -ENOMEM; - goto err_ndev; - } - - ndev_init_struct(ndev, pdev); ndev->ntb.ops = &intel_ntb3_ops; - rc = intel_ntb_init_pci(ndev, pdev); if (rc) goto err_init_pci; @@ -1893,7 +1886,15 @@ static int intel_ntb_pci_probe(struct pci_dev *pdev, rc = gen3_init_dev(ndev); if (rc) goto err_init_dev; + } else if (pdev_is_gen4(pdev)) { + ndev->ntb.ops = &intel_ntb4_ops; + rc = intel_ntb_init_pci(ndev, pdev); + if (rc) + goto err_init_pci;
+ rc = gen4_init_dev(ndev); + if (rc) + goto err_init_dev; } else { rc = -EINVAL; goto err_ndev; @@ -1915,7 +1916,7 @@ static int intel_ntb_pci_probe(struct pci_dev *pdev,
err_register: ndev_deinit_debugfs(ndev); - if (pdev_is_gen1(pdev) || pdev_is_gen3(pdev)) + if (pdev_is_gen1(pdev) || pdev_is_gen3(pdev) || pdev_is_gen4(pdev)) xeon_deinit_dev(ndev); err_init_dev: intel_ntb_deinit_pci(ndev); @@ -1931,7 +1932,7 @@ static void intel_ntb_pci_remove(struct pci_dev *pdev)
ntb_unregister_device(&ndev->ntb); ndev_deinit_debugfs(ndev); - if (pdev_is_gen1(pdev) || pdev_is_gen3(pdev)) + if (pdev_is_gen1(pdev) || pdev_is_gen3(pdev) || pdev_is_gen4(pdev)) xeon_deinit_dev(ndev); intel_ntb_deinit_pci(ndev); kfree(ndev); @@ -2036,6 +2037,7 @@ static const struct file_operations intel_ntb_debugfs_info = { };
static const struct pci_device_id intel_ntb_pci_tbl[] = { + /* GEN1 */ {PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_NTB_B2B_JSF)}, {PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_NTB_B2B_SNB)}, {PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_NTB_B2B_IVT)}, @@ -2051,7 +2053,12 @@ static const struct pci_device_id intel_ntb_pci_tbl[] = { {PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_NTB_SS_IVT)}, {PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_NTB_SS_HSX)}, {PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_NTB_SS_BDX)}, + + /* GEN3 */ {PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_NTB_B2B_SKX)}, + + /* GEN4 */ + {PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_NTB_B2B_ICX)}, {0} }; MODULE_DEVICE_TABLE(pci, intel_ntb_pci_tbl); diff --git a/drivers/ntb/hw/intel/ntb_hw_gen3.c b/drivers/ntb/hw/intel/ntb_hw_gen3.c index f475b56a3f49..9116333e01a8 100644 --- a/drivers/ntb/hw/intel/ntb_hw_gen3.c +++ b/drivers/ntb/hw/intel/ntb_hw_gen3.c @@ -415,9 +415,8 @@ ssize_t ndev_ntb3_debugfs_read(struct file *filp, char __user *ubuf, return ret; }
-static int intel_ntb3_link_enable(struct ntb_dev *ntb, - enum ntb_speed max_speed, - enum ntb_width max_width) +int intel_ntb3_link_enable(struct ntb_dev *ntb, enum ntb_speed max_speed, + enum ntb_width max_width) { struct intel_ntb_dev *ndev; u32 ntb_ctl; @@ -563,7 +562,7 @@ int intel_ntb3_peer_db_addr(struct ntb_dev *ntb, phys_addr_t *db_addr, return 0; }
-static int intel_ntb3_peer_db_set(struct ntb_dev *ntb, u64 db_bits) +int intel_ntb3_peer_db_set(struct ntb_dev *ntb, u64 db_bits) { struct intel_ntb_dev *ndev = ntb_ndev(ntb); int bit; @@ -581,7 +580,7 @@ static int intel_ntb3_peer_db_set(struct ntb_dev *ntb, u64 db_bits) return 0; }
-static u64 intel_ntb3_db_read(struct ntb_dev *ntb) +u64 intel_ntb3_db_read(struct ntb_dev *ntb) { struct intel_ntb_dev *ndev = ntb_ndev(ntb);
@@ -590,7 +589,7 @@ static u64 intel_ntb3_db_read(struct ntb_dev *ntb) ndev->self_reg->db_clear); }
-static int intel_ntb3_db_clear(struct ntb_dev *ntb, u64 db_bits) +int intel_ntb3_db_clear(struct ntb_dev *ntb, u64 db_bits) { struct intel_ntb_dev *ndev = ntb_ndev(ntb);
diff --git a/drivers/ntb/hw/intel/ntb_hw_gen3.h b/drivers/ntb/hw/intel/ntb_hw_gen3.h index 75fb86ca27bb..2bc5d8356045 100644 --- a/drivers/ntb/hw/intel/ntb_hw_gen3.h +++ b/drivers/ntb/hw/intel/ntb_hw_gen3.h @@ -104,6 +104,14 @@ static inline void gen3_db_iowrite(u64 bits, void __iomem *mmio) ssize_t ndev_ntb3_debugfs_read(struct file *filp, char __user *ubuf, size_t count, loff_t *offp); int gen3_init_dev(struct intel_ntb_dev *ndev); +int intel_ntb3_link_enable(struct ntb_dev *ntb, enum ntb_speed max_speed, + enum ntb_width max_width); +u64 intel_ntb3_db_read(struct ntb_dev *ntb); +int intel_ntb3_db_clear(struct ntb_dev *ntb, u64 db_bits); +int intel_ntb3_peer_db_set(struct ntb_dev *ntb, u64 db_bits); +int intel_ntb3_peer_db_addr(struct ntb_dev *ntb, phys_addr_t *db_addr, + resource_size_t *db_size, + u64 *db_data, int db_bit);
extern const struct ntb_dev_ops intel_ntb3_ops;
diff --git a/drivers/ntb/hw/intel/ntb_hw_gen4.c b/drivers/ntb/hw/intel/ntb_hw_gen4.c new file mode 100644 index 000000000000..aab614099d34 --- /dev/null +++ b/drivers/ntb/hw/intel/ntb_hw_gen4.c @@ -0,0 +1,500 @@ +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) +/* Copyright(c) 2020 Intel Corporation. All rights reserved. */ +#include <linux/debugfs.h> +#include <linux/delay.h> +#include <linux/init.h> +#include <linux/interrupt.h> +#include <linux/module.h> +#include <linux/pci.h> +#include <linux/random.h> +#include <linux/slab.h> +#include <linux/ntb.h> +#include <linux/log2.h> + +#include "ntb_hw_intel.h" +#include "ntb_hw_gen1.h" +#include "ntb_hw_gen3.h" +#include "ntb_hw_gen4.h" + +static int gen4_poll_link(struct intel_ntb_dev *ndev); +static int gen4_link_is_up(struct intel_ntb_dev *ndev); + +static const struct intel_ntb_reg gen4_reg = { + .poll_link = gen4_poll_link, + .link_is_up = gen4_link_is_up, + .db_ioread = gen3_db_ioread, + .db_iowrite = gen3_db_iowrite, + .db_size = sizeof(u32), + .ntb_ctl = GEN4_NTBCNTL_OFFSET, + .mw_bar = {2, 4}, +}; + +static const struct intel_ntb_alt_reg gen4_pri_reg = { + .db_clear = GEN4_IM_INT_STATUS_OFFSET, + .db_mask = GEN4_IM_INT_DISABLE_OFFSET, + .spad = GEN4_IM_SPAD_OFFSET, +}; + +static const struct intel_ntb_xlat_reg gen4_sec_xlat = { + .bar2_limit = GEN4_IM23XLMT_OFFSET, + .bar2_xlat = GEN4_IM23XBASE_OFFSET, + .bar2_idx = GEN4_IM23XBASEIDX_OFFSET, +}; + +static const struct intel_ntb_alt_reg gen4_b2b_reg = { + .db_bell = GEN4_IM_DOORBELL_OFFSET, + .spad = GEN4_EM_SPAD_OFFSET, +}; + +static int gen4_poll_link(struct intel_ntb_dev *ndev) +{ + u16 reg_val; + + /* + * We need to write to DLLSCS bit in the SLOTSTS before we + * can clear the hardware link interrupt on ICX NTB. + */ + iowrite16(GEN4_SLOTSTS_DLLSCS, ndev->self_mmio + GEN4_SLOTSTS); + ndev->reg->db_iowrite(ndev->db_link_mask, + ndev->self_mmio + + ndev->self_reg->db_clear); + + reg_val = ioread16(ndev->self_mmio + GEN4_LINK_STATUS_OFFSET); + if (reg_val == ndev->lnk_sta) + return 0; + + ndev->lnk_sta = reg_val; + + return 1; +} + +static int gen4_link_is_up(struct intel_ntb_dev *ndev) +{ + return NTB_LNK_STA_ACTIVE(ndev->lnk_sta); +} + +static int gen4_init_isr(struct intel_ntb_dev *ndev) +{ + int i; + + /* + * The MSIX vectors and the interrupt status bits are not lined up + * on Gen3 (Skylake) and Gen4. By default the link status bit is bit + * 32, however it is by default MSIX vector0. We need to fixup to + * line them up. The vectors at reset is 1-32,0. We need to reprogram + * to 0-32. + */ + for (i = 0; i < GEN4_DB_MSIX_VECTOR_COUNT; i++) + iowrite8(i, ndev->self_mmio + GEN4_INTVEC_OFFSET + i); + + return ndev_init_isr(ndev, GEN4_DB_MSIX_VECTOR_COUNT, + GEN4_DB_MSIX_VECTOR_COUNT, + GEN4_DB_MSIX_VECTOR_SHIFT, + GEN4_DB_TOTAL_SHIFT); +} + +static int gen4_setup_b2b_mw(struct intel_ntb_dev *ndev, + const struct intel_b2b_addr *addr, + const struct intel_b2b_addr *peer_addr) +{ + struct pci_dev *pdev; + void __iomem *mmio; + phys_addr_t bar_addr; + + pdev = ndev->ntb.pdev; + mmio = ndev->self_mmio; + + /* setup incoming bar limits == base addrs (zero length windows) */ + bar_addr = addr->bar2_addr64; + iowrite64(bar_addr, mmio + GEN4_IM23XLMT_OFFSET); + bar_addr = ioread64(mmio + GEN4_IM23XLMT_OFFSET); + dev_dbg(&pdev->dev, "IM23XLMT %#018llx\n", bar_addr); + + bar_addr = addr->bar4_addr64; + iowrite64(bar_addr, mmio + GEN4_IM45XLMT_OFFSET); + bar_addr = ioread64(mmio + GEN4_IM45XLMT_OFFSET); + dev_dbg(&pdev->dev, "IM45XLMT %#018llx\n", bar_addr); + + /* zero incoming translation addrs */ + iowrite64(0, mmio + GEN4_IM23XBASE_OFFSET); + iowrite64(0, mmio + GEN4_IM45XBASE_OFFSET); + + ndev->peer_mmio = ndev->self_mmio; + + return 0; +} + +static int gen4_init_ntb(struct intel_ntb_dev *ndev) +{ + int rc; + + + ndev->mw_count = XEON_MW_COUNT; + ndev->spad_count = GEN4_SPAD_COUNT; + ndev->db_count = GEN4_DB_COUNT; + ndev->db_link_mask = GEN4_DB_LINK_BIT; + + ndev->self_reg = &gen4_pri_reg; + ndev->xlat_reg = &gen4_sec_xlat; + ndev->peer_reg = &gen4_b2b_reg; + + if (ndev->ntb.topo == NTB_TOPO_B2B_USD) + rc = gen4_setup_b2b_mw(ndev, &xeon_b2b_dsd_addr, + &xeon_b2b_usd_addr); + else + rc = gen4_setup_b2b_mw(ndev, &xeon_b2b_usd_addr, + &xeon_b2b_dsd_addr); + if (rc) + return rc; + + ndev->db_valid_mask = BIT_ULL(ndev->db_count) - 1; + + ndev->reg->db_iowrite(ndev->db_valid_mask, + ndev->self_mmio + + ndev->self_reg->db_mask); + + return 0; +} + +static enum ntb_topo gen4_ppd_topo(struct intel_ntb_dev *ndev, u32 ppd) +{ + switch (ppd & GEN4_PPD_TOPO_MASK) { + case GEN4_PPD_TOPO_B2B_USD: + return NTB_TOPO_B2B_USD; + case GEN4_PPD_TOPO_B2B_DSD: + return NTB_TOPO_B2B_DSD; + } + + return NTB_TOPO_NONE; +} + +int gen4_init_dev(struct intel_ntb_dev *ndev) +{ + struct pci_dev *pdev = ndev->ntb.pdev; + u32 ppd1/*, ppd0*/; + u16 lnkctl; + int rc; + + ndev->reg = &gen4_reg; + + ppd1 = ioread32(ndev->self_mmio + GEN4_PPD1_OFFSET); + ndev->ntb.topo = gen4_ppd_topo(ndev, ppd1); + dev_dbg(&pdev->dev, "ppd %#x topo %s\n", ppd1, + ntb_topo_string(ndev->ntb.topo)); + if (ndev->ntb.topo == NTB_TOPO_NONE) + return -EINVAL; + + rc = gen4_init_ntb(ndev); + if (rc) + return rc; + + /* init link setup */ + lnkctl = ioread16(ndev->self_mmio + GEN4_LINK_CTRL_OFFSET); + lnkctl |= GEN4_LINK_CTRL_LINK_DISABLE; + iowrite16(lnkctl, ndev->self_mmio + GEN4_LINK_CTRL_OFFSET); + + return gen4_init_isr(ndev); +} + +ssize_t ndev_ntb4_debugfs_read(struct file *filp, char __user *ubuf, + size_t count, loff_t *offp) +{ + struct intel_ntb_dev *ndev; + void __iomem *mmio; + char *buf; + size_t buf_size; + ssize_t ret, off; + union { u64 v64; u32 v32; u16 v16; } u; + + ndev = filp->private_data; + mmio = ndev->self_mmio; + + buf_size = min(count, 0x800ul); + + buf = kmalloc(buf_size, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + off = 0; + + off += scnprintf(buf + off, buf_size - off, + "NTB Device Information:\n"); + + off += scnprintf(buf + off, buf_size - off, + "Connection Topology -\t%s\n", + ntb_topo_string(ndev->ntb.topo)); + + off += scnprintf(buf + off, buf_size - off, + "NTB CTL -\t\t%#06x\n", ndev->ntb_ctl); + off += scnprintf(buf + off, buf_size - off, + "LNK STA (cached) -\t\t%#06x\n", ndev->lnk_sta); + + if (!ndev->reg->link_is_up(ndev)) + off += scnprintf(buf + off, buf_size - off, + "Link Status -\t\tDown\n"); + else { + off += scnprintf(buf + off, buf_size - off, + "Link Status -\t\tUp\n"); + off += scnprintf(buf + off, buf_size - off, + "Link Speed -\t\tPCI-E Gen %u\n", + NTB_LNK_STA_SPEED(ndev->lnk_sta)); + off += scnprintf(buf + off, buf_size - off, + "Link Width -\t\tx%u\n", + NTB_LNK_STA_WIDTH(ndev->lnk_sta)); + } + + off += scnprintf(buf + off, buf_size - off, + "Memory Window Count -\t%u\n", ndev->mw_count); + off += scnprintf(buf + off, buf_size - off, + "Scratchpad Count -\t%u\n", ndev->spad_count); + off += scnprintf(buf + off, buf_size - off, + "Doorbell Count -\t%u\n", ndev->db_count); + off += scnprintf(buf + off, buf_size - off, + "Doorbell Vector Count -\t%u\n", ndev->db_vec_count); + off += scnprintf(buf + off, buf_size - off, + "Doorbell Vector Shift -\t%u\n", ndev->db_vec_shift); + + off += scnprintf(buf + off, buf_size - off, + "Doorbell Valid Mask -\t%#llx\n", ndev->db_valid_mask); + off += scnprintf(buf + off, buf_size - off, + "Doorbell Link Mask -\t%#llx\n", ndev->db_link_mask); + off += scnprintf(buf + off, buf_size - off, + "Doorbell Mask Cached -\t%#llx\n", ndev->db_mask); + + u.v64 = ndev_db_read(ndev, mmio + ndev->self_reg->db_mask); + off += scnprintf(buf + off, buf_size - off, + "Doorbell Mask -\t\t%#llx\n", u.v64); + + off += scnprintf(buf + off, buf_size - off, + "\nNTB Incoming XLAT:\n"); + + u.v64 = ioread64(mmio + GEN4_IM23XBASE_OFFSET); + off += scnprintf(buf + off, buf_size - off, + "IM23XBASE -\t\t%#018llx\n", u.v64); + + u.v64 = ioread64(mmio + GEN4_IM45XBASE_OFFSET); + off += scnprintf(buf + off, buf_size - off, + "IM45XBASE -\t\t%#018llx\n", u.v64); + + u.v64 = ioread64(mmio + GEN4_IM23XLMT_OFFSET); + off += scnprintf(buf + off, buf_size - off, + "IM23XLMT -\t\t\t%#018llx\n", u.v64); + + u.v64 = ioread64(mmio + GEN4_IM45XLMT_OFFSET); + off += scnprintf(buf + off, buf_size - off, + "IM45XLMT -\t\t\t%#018llx\n", u.v64); + + off += scnprintf(buf + off, buf_size - off, + "\nNTB Statistics:\n"); + + off += scnprintf(buf + off, buf_size - off, + "\nNTB Hardware Errors:\n"); + + if (!pci_read_config_word(ndev->ntb.pdev, + GEN4_DEVSTS_OFFSET, &u.v16)) + off += scnprintf(buf + off, buf_size - off, + "DEVSTS -\t\t%#06x\n", u.v16); + + u.v16 = ioread16(mmio + GEN4_LINK_STATUS_OFFSET); + off += scnprintf(buf + off, buf_size - off, + "LNKSTS -\t\t%#06x\n", u.v16); + + if (!pci_read_config_dword(ndev->ntb.pdev, + GEN4_UNCERRSTS_OFFSET, &u.v32)) + off += scnprintf(buf + off, buf_size - off, + "UNCERRSTS -\t\t%#06x\n", u.v32); + + if (!pci_read_config_dword(ndev->ntb.pdev, + GEN4_CORERRSTS_OFFSET, &u.v32)) + off += scnprintf(buf + off, buf_size - off, + "CORERRSTS -\t\t%#06x\n", u.v32); + + ret = simple_read_from_buffer(ubuf, count, offp, buf, off); + kfree(buf); + return ret; +} + +static int intel_ntb4_mw_set_trans(struct ntb_dev *ntb, int pidx, int idx, + dma_addr_t addr, resource_size_t size) +{ + struct intel_ntb_dev *ndev = ntb_ndev(ntb); + unsigned long xlat_reg, limit_reg, idx_reg; + unsigned short base_idx, reg_val16; + resource_size_t bar_size, mw_size; + void __iomem *mmio; + u64 base, limit, reg_val; + int bar; + + if (pidx != NTB_DEF_PEER_IDX) + return -EINVAL; + + if (idx >= ndev->b2b_idx && !ndev->b2b_off) + idx += 1; + + bar = ndev_mw_to_bar(ndev, idx); + if (bar < 0) + return bar; + + bar_size = pci_resource_len(ndev->ntb.pdev, bar); + + if (idx == ndev->b2b_idx) + mw_size = bar_size - ndev->b2b_off; + else + mw_size = bar_size; + + /* hardware requires that addr is aligned to bar size */ + if (addr & (bar_size - 1)) + return -EINVAL; + + /* make sure the range fits in the usable mw size */ + if (size > mw_size) + return -EINVAL; + + mmio = ndev->self_mmio; + xlat_reg = ndev->xlat_reg->bar2_xlat + (idx * 0x10); + limit_reg = ndev->xlat_reg->bar2_limit + (idx * 0x10); + idx_reg = ndev->xlat_reg->bar2_idx + (idx * 0x2); + base = pci_resource_start(ndev->ntb.pdev, bar); + + /* Set the limit if supported, if size is not mw_size */ + if (limit_reg && size != mw_size) { + limit = base + size; + base_idx = __ilog2_u64(size); + } else { + limit = base + mw_size; + base_idx = __ilog2_u64(mw_size); + } + + + /* set and verify setting the translation address */ + iowrite64(addr, mmio + xlat_reg); + reg_val = ioread64(mmio + xlat_reg); + if (reg_val != addr) { + iowrite64(0, mmio + xlat_reg); + return -EIO; + } + + dev_dbg(&ntb->pdev->dev, "BAR %d IMXBASE: %#Lx\n", bar, reg_val); + + /* set and verify setting the limit */ + iowrite64(limit, mmio + limit_reg); + reg_val = ioread64(mmio + limit_reg); + if (reg_val != limit) { + iowrite64(base, mmio + limit_reg); + iowrite64(0, mmio + xlat_reg); + return -EIO; + } + + dev_dbg(&ntb->pdev->dev, "BAR %d IMXLMT: %#Lx\n", bar, reg_val); + + iowrite16(base_idx, mmio + idx_reg); + reg_val16 = ioread16(mmio + idx_reg); + if (reg_val16 != base_idx) { + iowrite64(base, mmio + limit_reg); + iowrite64(0, mmio + xlat_reg); + iowrite16(0, mmio + idx_reg); + return -EIO; + } + + dev_dbg(&ntb->pdev->dev, "BAR %d IMBASEIDX: %#x\n", bar, reg_val16); + + return 0; +} + +static int intel_ntb4_link_enable(struct ntb_dev *ntb, + enum ntb_speed max_speed, enum ntb_width max_width) +{ + struct intel_ntb_dev *ndev; + u32 ntb_ctl, ppd0; + u16 lnkctl; + + ndev = container_of(ntb, struct intel_ntb_dev, ntb); + + dev_dbg(&ntb->pdev->dev, + "Enabling link with max_speed %d max_width %d\n", + max_speed, max_width); + + if (max_speed != NTB_SPEED_AUTO) + dev_dbg(&ntb->pdev->dev, + "ignoring max_speed %d\n", max_speed); + if (max_width != NTB_WIDTH_AUTO) + dev_dbg(&ntb->pdev->dev, + "ignoring max_width %d\n", max_width); + + ntb_ctl = NTB_CTL_E2I_BAR23_SNOOP | NTB_CTL_I2E_BAR23_SNOOP; + ntb_ctl |= NTB_CTL_E2I_BAR45_SNOOP | NTB_CTL_I2E_BAR45_SNOOP; + iowrite32(ntb_ctl, ndev->self_mmio + ndev->reg->ntb_ctl); + + lnkctl = ioread16(ndev->self_mmio + GEN4_LINK_CTRL_OFFSET); + lnkctl &= ~GEN4_LINK_CTRL_LINK_DISABLE; + iowrite16(lnkctl, ndev->self_mmio + GEN4_LINK_CTRL_OFFSET); + + /* start link training in PPD0 */ + ppd0 = ioread32(ndev->self_mmio + GEN4_PPD0_OFFSET); + ppd0 |= GEN4_PPD_LINKTRN; + iowrite32(ppd0, ndev->self_mmio + GEN4_PPD0_OFFSET); + + /* make sure link training has started */ + ppd0 = ioread32(ndev->self_mmio + GEN4_PPD0_OFFSET); + if (!(ppd0 & GEN4_PPD_LINKTRN)) { + dev_warn(&ntb->pdev->dev, "Link is not training\n"); + return -ENXIO; + } + + ndev->dev_up = 1; + + return 0; +} + +int intel_ntb4_link_disable(struct ntb_dev *ntb) +{ + struct intel_ntb_dev *ndev; + u32 ntb_cntl; + u16 lnkctl; + + ndev = container_of(ntb, struct intel_ntb_dev, ntb); + + dev_dbg(&ntb->pdev->dev, "Disabling link\n"); + + /* clear the snoop bits */ + ntb_cntl = ioread32(ndev->self_mmio + ndev->reg->ntb_ctl); + ntb_cntl &= ~(NTB_CTL_E2I_BAR23_SNOOP | NTB_CTL_I2E_BAR23_SNOOP); + ntb_cntl &= ~(NTB_CTL_E2I_BAR45_SNOOP | NTB_CTL_I2E_BAR45_SNOOP); + iowrite32(ntb_cntl, ndev->self_mmio + ndev->reg->ntb_ctl); + + lnkctl = ioread16(ndev->self_mmio + GEN4_LINK_CTRL_OFFSET); + lnkctl |= GEN4_LINK_CTRL_LINK_DISABLE; + iowrite16(lnkctl, ndev->self_mmio + GEN4_LINK_CTRL_OFFSET); + + ndev->dev_up = 0; + + return 0; +} + +const struct ntb_dev_ops intel_ntb4_ops = { + .mw_count = intel_ntb_mw_count, + .mw_get_align = intel_ntb_mw_get_align, + .mw_set_trans = intel_ntb4_mw_set_trans, + .peer_mw_count = intel_ntb_peer_mw_count, + .peer_mw_get_addr = intel_ntb_peer_mw_get_addr, + .link_is_up = intel_ntb_link_is_up, + .link_enable = intel_ntb4_link_enable, + .link_disable = intel_ntb4_link_disable, + .db_valid_mask = intel_ntb_db_valid_mask, + .db_vector_count = intel_ntb_db_vector_count, + .db_vector_mask = intel_ntb_db_vector_mask, + .db_read = intel_ntb3_db_read, + .db_clear = intel_ntb3_db_clear, + .db_set_mask = intel_ntb_db_set_mask, + .db_clear_mask = intel_ntb_db_clear_mask, + .peer_db_addr = intel_ntb3_peer_db_addr, + .peer_db_set = intel_ntb3_peer_db_set, + .spad_is_unsafe = intel_ntb_spad_is_unsafe, + .spad_count = intel_ntb_spad_count, + .spad_read = intel_ntb_spad_read, + .spad_write = intel_ntb_spad_write, + .peer_spad_addr = intel_ntb_peer_spad_addr, + .peer_spad_read = intel_ntb_peer_spad_read, + .peer_spad_write = intel_ntb_peer_spad_write, +}; + diff --git a/drivers/ntb/hw/intel/ntb_hw_gen4.h b/drivers/ntb/hw/intel/ntb_hw_gen4.h new file mode 100644 index 000000000000..7d7a82e8518d --- /dev/null +++ b/drivers/ntb/hw/intel/ntb_hw_gen4.h @@ -0,0 +1,87 @@ +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */ +/* Copyright(c) 2020 Intel Corporation. All rights reserved. */ +#ifndef _NTB_INTEL_GEN4_H_ +#define _NTB_INTEL_GEN4_H_ + +#include "ntb_hw_intel.h" + +/* Intel Gen4 NTB hardware */ +/* PCIe config space */ +#define GEN4_IMBAR23SZ_OFFSET 0x00c4 +#define GEN4_IMBAR45SZ_OFFSET 0x00c5 +#define GEN4_EMBAR23SZ_OFFSET 0x00c6 +#define GEN4_EMBAR45SZ_OFFSET 0x00c7 +#define GEN4_DEVCTRL_OFFSET 0x0048 +#define GEN4_DEVSTS_OFFSET 0x004a +#define GEN4_UNCERRSTS_OFFSET 0x0104 +#define GEN4_CORERRSTS_OFFSET 0x0110 + +/* BAR0 MMIO */ +#define GEN4_NTBCNTL_OFFSET 0x0000 +#define GEN4_IM23XBASE_OFFSET 0x0010 /* IMBAR1XBASE */ +#define GEN4_IM23XLMT_OFFSET 0x0018 /* IMBAR1XLMT */ +#define GEN4_IM45XBASE_OFFSET 0x0020 /* IMBAR2XBASE */ +#define GEN4_IM45XLMT_OFFSET 0x0028 /* IMBAR2XLMT */ +#define GEN4_IM_INT_STATUS_OFFSET 0x0040 +#define GEN4_IM_INT_DISABLE_OFFSET 0x0048 +#define GEN4_INTVEC_OFFSET 0x0050 /* 0-32 vecs */ +#define GEN4_IM23XBASEIDX_OFFSET 0x0074 +#define GEN4_IM45XBASEIDX_OFFSET 0x0076 +#define GEN4_IM_SPAD_OFFSET 0x0080 /* 0-15 SPADs */ +#define GEN4_IM_SPAD_SEM_OFFSET 0x00c0 /* SPAD hw semaphore */ +#define GEN4_IM_SPAD_STICKY_OFFSET 0x00c4 /* sticky SPAD */ +#define GEN4_IM_DOORBELL_OFFSET 0x0100 /* 0-31 doorbells */ +#define GEN4_EM_SPAD_OFFSET 0x8080 +/* note, link status is now in MMIO and not config space for NTB */ +#define GEN4_LINK_CTRL_OFFSET 0xb050 +#define GEN4_LINK_STATUS_OFFSET 0xb052 +#define GEN4_PPD0_OFFSET 0xb0d4 +#define GEN4_PPD1_OFFSET 0xb4c0 +#define GEN4_LTSSMSTATEJMP 0xf040 + +#define GEN4_PPD_CLEAR_TRN 0x0001 +#define GEN4_PPD_LINKTRN 0x0008 +#define GEN4_PPD_CONN_MASK 0x0300 +#define GEN4_PPD_CONN_B2B 0x0200 +#define GEN4_PPD_DEV_MASK 0x1000 +#define GEN4_PPD_DEV_DSD 0x1000 +#define GEN4_PPD_DEV_USD 0x0000 +#define GEN4_LINK_CTRL_LINK_DISABLE 0x0010 + +#define GEN4_SLOTSTS 0xb05a +#define GEN4_SLOTSTS_DLLSCS 0x100 + +#define GEN4_PPD_TOPO_MASK (GEN4_PPD_CONN_MASK | GEN4_PPD_DEV_MASK) +#define GEN4_PPD_TOPO_B2B_USD (GEN4_PPD_CONN_B2B | GEN4_PPD_DEV_USD) +#define GEN4_PPD_TOPO_B2B_DSD (GEN4_PPD_CONN_B2B | GEN4_PPD_DEV_DSD) + +#define GEN4_DB_COUNT 32 +#define GEN4_DB_LINK 32 +#define GEN4_DB_LINK_BIT BIT_ULL(GEN4_DB_LINK) +#define GEN4_DB_MSIX_VECTOR_COUNT 33 +#define GEN4_DB_MSIX_VECTOR_SHIFT 1 +#define GEN4_DB_TOTAL_SHIFT 33 +#define GEN4_SPAD_COUNT 16 + +#define NTB_CTL_E2I_BAR23_SNOOP 0x000004 +#define NTB_CTL_E2I_BAR23_NOSNOOP 0x000008 +#define NTB_CTL_I2E_BAR23_SNOOP 0x000010 +#define NTB_CTL_I2E_BAR23_NOSNOOP 0x000020 +#define NTB_CTL_E2I_BAR45_SNOOP 0x000040 +#define NTB_CTL_E2I_BAR45_NOSNOO 0x000080 +#define NTB_CTL_I2E_BAR45_SNOOP 0x000100 +#define NTB_CTL_I2E_BAR45_NOSNOOP 0x000200 +#define NTB_CTL_BUSNO_DIS_INC 0x000400 +#define NTB_CTL_LINK_DOWN 0x010000 + +#define NTB_SJC_FORCEDETECT 0x000004 + +ssize_t ndev_ntb4_debugfs_read(struct file *filp, char __user *ubuf, + size_t count, loff_t *offp); +int gen4_init_dev(struct intel_ntb_dev *ndev); +ssize_t ndev_ntb4_debugfs_read(struct file *filp, char __user *ubuf, + size_t count, loff_t *offp); + +extern const struct ntb_dev_ops intel_ntb4_ops; + +#endif diff --git a/drivers/ntb/hw/intel/ntb_hw_intel.h b/drivers/ntb/hw/intel/ntb_hw_intel.h index c49ff8970ce3..4a7d8411b807 100644 --- a/drivers/ntb/hw/intel/ntb_hw_intel.h +++ b/drivers/ntb/hw/intel/ntb_hw_intel.h @@ -71,6 +71,7 @@ #define PCI_DEVICE_ID_INTEL_NTB_PS_BDX 0x6F0E #define PCI_DEVICE_ID_INTEL_NTB_SS_BDX 0x6F0F #define PCI_DEVICE_ID_INTEL_NTB_B2B_SKX 0x201C +#define PCI_DEVICE_ID_INTEL_NTB_B2B_ICX 0x347e
/* Ntb control and link status */ #define NTB_CTL_CFG_LOCK BIT(0) @@ -119,6 +120,7 @@ struct intel_ntb_xlat_reg { unsigned long bar0_base; unsigned long bar2_xlat; unsigned long bar2_limit; + unsigned short bar2_idx; };
struct intel_b2b_addr { @@ -181,6 +183,9 @@ struct intel_ntb_dev {
struct dentry *debugfs_dir; struct dentry *debugfs_info; + + /* gen4 entries */ + int dev_up; };
#define ntb_ndev(__ntb) container_of(__ntb, struct intel_ntb_dev, ntb) @@ -247,4 +252,11 @@ static inline void _iowrite64(u64 val, void __iomem *mmio) #endif #endif
+static inline int pdev_is_gen4(struct pci_dev *pdev) +{ + if (pdev->device == PCI_DEVICE_ID_INTEL_NTB_B2B_ICX) + return 1; + + return 0; +} #endif
From: Dave Jiang dave.jiang@intel.com
mainline inclusion from mainline-v5.8-rc1 commit 893733c58d59f43e4c5219bf9ab9e9d39bb14289 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit: 893733c58d59f43e4c5219bf9ab9e9d39bb14289 upstream.
Backport summary: Backport to kernel 4.19.57 for 4th Gen Intel NTB support
intel_ntb4_link_disable() missing static declaration.
Reported-by: kbuild test robot lkp@intel.com Signed-off-by: Dave Jiang dave.jiang@intel.com Signed-off-by: Jon Mason jdmason@kudzu.us Signed-off-by: Zheng Wu wu.zheng@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/ntb/hw/intel/ntb_hw_gen4.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/ntb/hw/intel/ntb_hw_gen4.c b/drivers/ntb/hw/intel/ntb_hw_gen4.c index aab614099d34..70c759a95bfe 100644 --- a/drivers/ntb/hw/intel/ntb_hw_gen4.c +++ b/drivers/ntb/hw/intel/ntb_hw_gen4.c @@ -446,7 +446,7 @@ static int intel_ntb4_link_enable(struct ntb_dev *ntb, return 0; }
-int intel_ntb4_link_disable(struct ntb_dev *ntb) +static int intel_ntb4_link_disable(struct ntb_dev *ntb) { struct intel_ntb_dev *ndev; u32 ntb_cntl;
From: Dave Jiang dave.jiang@intel.com
mainline inclusion from mainline-v5.8-rc1 commit 134a86545c6072c049a2c6e77d6843e650df511d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit: 134a86545c6072c049a2c6e77d6843e650df511d upstream.
Backport summary: Backport to kernel 4.19.57 for 4th Gen Intel NTB support
Add NTB_HWERR_BAR_ALIGN hw errata flag to work around issue where the aligment for the XLAT base must be BAR size aligned rather than 4k page aligned. On ICX platform, the XLAT base can be 4k page size aligned rather than BAR size aligned unlike the previous gen Intel NTB. However, a silicon errata prevented this from working as expected and a workaround is introduced to resolve the issue.
Signed-off-by: Dave Jiang dave.jiang@intel.com Signed-off-by: Jon Mason jdmason@kudzu.us Signed-off-by: Zheng Wu wu.zheng@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/ntb/hw/intel/ntb_hw_gen1.h | 1 + drivers/ntb/hw/intel/ntb_hw_gen4.c | 78 +++++++++++++++++++++++++----- drivers/ntb/hw/intel/ntb_hw_gen4.h | 13 +++++ 3 files changed, 79 insertions(+), 13 deletions(-)
diff --git a/drivers/ntb/hw/intel/ntb_hw_gen1.h b/drivers/ntb/hw/intel/ntb_hw_gen1.h index 544cf5c06f4d..1b759942d8af 100644 --- a/drivers/ntb/hw/intel/ntb_hw_gen1.h +++ b/drivers/ntb/hw/intel/ntb_hw_gen1.h @@ -140,6 +140,7 @@ #define NTB_HWERR_SB01BASE_LOCKUP BIT_ULL(1) #define NTB_HWERR_B2BDOORBELL_BIT14 BIT_ULL(2) #define NTB_HWERR_MSIX_VECTOR32_BAD BIT_ULL(3) +#define NTB_HWERR_BAR_ALIGN BIT_ULL(4)
extern struct intel_b2b_addr xeon_b2b_usd_addr; extern struct intel_b2b_addr xeon_b2b_dsd_addr; diff --git a/drivers/ntb/hw/intel/ntb_hw_gen4.c b/drivers/ntb/hw/intel/ntb_hw_gen4.c index 70c759a95bfe..bc4541cbf8c6 100644 --- a/drivers/ntb/hw/intel/ntb_hw_gen4.c +++ b/drivers/ntb/hw/intel/ntb_hw_gen4.c @@ -177,6 +177,9 @@ int gen4_init_dev(struct intel_ntb_dev *ndev)
ndev->reg = &gen4_reg;
+ if (pdev_is_ICX(pdev)) + ndev->hwerr_flags |= NTB_HWERR_BAR_ALIGN; + ppd1 = ioread32(ndev->self_mmio + GEN4_PPD1_OFFSET); ndev->ntb.topo = gen4_ppd_topo(ndev, ppd1); dev_dbg(&pdev->dev, "ppd %#x topo %s\n", ppd1, @@ -342,9 +345,14 @@ static int intel_ntb4_mw_set_trans(struct ntb_dev *ntb, int pidx, int idx, else mw_size = bar_size;
- /* hardware requires that addr is aligned to bar size */ - if (addr & (bar_size - 1)) - return -EINVAL; + if (ndev->hwerr_flags & NTB_HWERR_BAR_ALIGN) { + /* hardware requires that addr is aligned to bar size */ + if (addr & (bar_size - 1)) + return -EINVAL; + } else { + if (addr & (PAGE_SIZE - 1)) + return -EINVAL; + }
/* make sure the range fits in the usable mw size */ if (size > mw_size) @@ -353,7 +361,6 @@ static int intel_ntb4_mw_set_trans(struct ntb_dev *ntb, int pidx, int idx, mmio = ndev->self_mmio; xlat_reg = ndev->xlat_reg->bar2_xlat + (idx * 0x10); limit_reg = ndev->xlat_reg->bar2_limit + (idx * 0x10); - idx_reg = ndev->xlat_reg->bar2_idx + (idx * 0x2); base = pci_resource_start(ndev->ntb.pdev, bar);
/* Set the limit if supported, if size is not mw_size */ @@ -387,16 +394,19 @@ static int intel_ntb4_mw_set_trans(struct ntb_dev *ntb, int pidx, int idx,
dev_dbg(&ntb->pdev->dev, "BAR %d IMXLMT: %#Lx\n", bar, reg_val);
- iowrite16(base_idx, mmio + idx_reg); - reg_val16 = ioread16(mmio + idx_reg); - if (reg_val16 != base_idx) { - iowrite64(base, mmio + limit_reg); - iowrite64(0, mmio + xlat_reg); - iowrite16(0, mmio + idx_reg); - return -EIO; + if (ndev->hwerr_flags & NTB_HWERR_BAR_ALIGN) { + idx_reg = ndev->xlat_reg->bar2_idx + (idx * 0x2); + iowrite16(base_idx, mmio + idx_reg); + reg_val16 = ioread16(mmio + idx_reg); + if (reg_val16 != base_idx) { + iowrite64(base, mmio + limit_reg); + iowrite64(0, mmio + xlat_reg); + iowrite16(0, mmio + idx_reg); + return -EIO; + } + dev_dbg(&ntb->pdev->dev, "BAR %d IMBASEIDX: %#x\n", bar, reg_val16); }
- dev_dbg(&ntb->pdev->dev, "BAR %d IMBASEIDX: %#x\n", bar, reg_val16);
return 0; } @@ -471,9 +481,51 @@ static int intel_ntb4_link_disable(struct ntb_dev *ntb) return 0; }
+static int intel_ntb4_mw_get_align(struct ntb_dev *ntb, int pidx, int idx, + resource_size_t *addr_align, + resource_size_t *size_align, + resource_size_t *size_max) +{ + struct intel_ntb_dev *ndev = ntb_ndev(ntb); + resource_size_t bar_size, mw_size; + int bar; + + if (pidx != NTB_DEF_PEER_IDX) + return -EINVAL; + + if (idx >= ndev->b2b_idx && !ndev->b2b_off) + idx += 1; + + bar = ndev_mw_to_bar(ndev, idx); + if (bar < 0) + return bar; + + bar_size = pci_resource_len(ndev->ntb.pdev, bar); + + if (idx == ndev->b2b_idx) + mw_size = bar_size - ndev->b2b_off; + else + mw_size = bar_size; + + if (addr_align) { + if (ndev->hwerr_flags & NTB_HWERR_BAR_ALIGN) + *addr_align = pci_resource_len(ndev->ntb.pdev, bar); + else + *addr_align = PAGE_SIZE; + } + + if (size_align) + *size_align = 1; + + if (size_max) + *size_max = mw_size; + + return 0; +} + const struct ntb_dev_ops intel_ntb4_ops = { .mw_count = intel_ntb_mw_count, - .mw_get_align = intel_ntb_mw_get_align, + .mw_get_align = intel_ntb4_mw_get_align, .mw_set_trans = intel_ntb4_mw_set_trans, .peer_mw_count = intel_ntb_peer_mw_count, .peer_mw_get_addr = intel_ntb_peer_mw_get_addr, diff --git a/drivers/ntb/hw/intel/ntb_hw_gen4.h b/drivers/ntb/hw/intel/ntb_hw_gen4.h index 7d7a82e8518d..a868c788de02 100644 --- a/drivers/ntb/hw/intel/ntb_hw_gen4.h +++ b/drivers/ntb/hw/intel/ntb_hw_gen4.h @@ -5,6 +5,10 @@
#include "ntb_hw_intel.h"
+/* Supported PCI device revision range for ICX */ +#define PCI_DEVICE_REVISION_ICX_MIN 0x2 +#define PCI_DEVICE_REVISION_ICX_MAX 0xF + /* Intel Gen4 NTB hardware */ /* PCIe config space */ #define GEN4_IMBAR23SZ_OFFSET 0x00c4 @@ -84,4 +88,13 @@ ssize_t ndev_ntb4_debugfs_read(struct file *filp, char __user *ubuf,
extern const struct ntb_dev_ops intel_ntb4_ops;
+static inline int pdev_is_ICX(struct pci_dev *pdev) +{ + if (pdev_is_gen4(pdev) && + pdev->revision >= PCI_DEVICE_REVISION_ICX_MIN && + pdev->revision <= PCI_DEVICE_REVISION_ICX_MAX) + return 1; + return 0; +} + #endif
From: Lukas Wunner lukas@wunner.de
mainline inclusion from mainline-v4.20-rc1 commit a50ac6bfd6042b16e0de4ac3264c407e678c9b10 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Commit 89ee9f768003 ("PCI: Add device disconnected state") iterates over the devices on a parent bus, marks each as disconnected, then marks each device's children as disconnected using pci_walk_bus().
The same can be achieved more succinctly by calling pci_walk_bus() on the parent bus. Moreover, this does not need to wait until acquiring pci_lock_rescan_remove(), so move it out of that critical section.
The critical section in err.c contains a pci_dev_get() / pci_dev_put() pair which was apparently copy-pasted from pciehp_pci.c. In the latter it serves the purpose of holding the struct pci_dev in place until the Command register is updated. err.c doesn't do anything like that, hence the pair is unnecessary. Remove it.
Signed-off-by: Lukas Wunner lukas@wunner.de Signed-off-by: Bjorn Helgaas bhelgaas@google.com Cc: Keith Busch keith.busch@intel.com Cc: Oza Pawandeep poza@codeaurora.org Cc: Sinan Kaya okaya@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/hotplug/pciehp_pci.c | 9 +++------ drivers/pci/pcie/err.c | 8 ++------ 2 files changed, 5 insertions(+), 12 deletions(-)
diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c index 18f83e554c73..0322bd4f0a7a 100644 --- a/drivers/pci/hotplug/pciehp_pci.c +++ b/drivers/pci/hotplug/pciehp_pci.c @@ -91,6 +91,9 @@ void pciehp_unconfigure_device(struct slot *p_slot, bool presence) ctrl_dbg(ctrl, "%s: domain:bus:dev = %04x:%02x:00\n", __func__, pci_domain_nr(parent), parent->number);
+ if (!presence) + pci_walk_bus(parent, pci_dev_set_disconnected, NULL); + pci_lock_rescan_remove();
/* @@ -102,12 +105,6 @@ void pciehp_unconfigure_device(struct slot *p_slot, bool presence) list_for_each_entry_safe_reverse(dev, temp, &parent->devices, bus_list) { pci_dev_get(dev); - if (!presence) { - pci_dev_set_disconnected(dev, NULL); - if (pci_has_subordinate(dev)) - pci_walk_bus(dev->subordinate, - pci_dev_set_disconnected, NULL); - } pci_stop_and_remove_bus_device(dev); /* * Ensure that no new Requests will be generated from diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index 571ac3d9c326..1cae9d2c365d 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -241,17 +241,13 @@ void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service) udev = dev->bus->self;
parent = udev->subordinate; + pci_walk_bus(parent, pci_dev_set_disconnected, NULL); + pci_lock_rescan_remove(); pci_dev_get(dev); list_for_each_entry_safe_reverse(pdev, temp, &parent->devices, bus_list) { - pci_dev_get(pdev); - pci_dev_set_disconnected(pdev, NULL); - if (pci_has_subordinate(pdev)) - pci_walk_bus(pdev->subordinate, - pci_dev_set_disconnected, NULL); pci_stop_and_remove_bus_device(pdev); - pci_dev_put(pdev); }
result = reset_link(udev, service);
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v4.20-rc1 commit 874b3251113a1e2cbe79c24994dc03fe4fe4b99b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
The port's config space may be cleared after a link reset, which wipes out the bridge's bus and memory windows. Restore the config space that was saved during probe so we can access downstream devices.
Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Sinan Kaya okaya@kernel.org Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/portdrv_pci.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c index 23a5a0c2c3fe..17256733fa43 100644 --- a/drivers/pci/pcie/portdrv_pci.c +++ b/drivers/pci/pcie/portdrv_pci.c @@ -146,6 +146,13 @@ static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev, return PCI_ERS_RESULT_CAN_RECOVER; }
+static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev) +{ + pci_restore_state(dev); + pci_save_state(dev); + return PCI_ERS_RESULT_RECOVERED; +} + static pci_ers_result_t pcie_portdrv_mmio_enabled(struct pci_dev *dev) { return PCI_ERS_RESULT_RECOVERED; @@ -185,6 +192,7 @@ static const struct pci_device_id port_pci_ids[] = { {
static const struct pci_error_handlers pcie_portdrv_err_handler = { .error_detected = pcie_portdrv_error_detected, + .slot_reset = pcie_portdrv_slot_reset, .mmio_enabled = pcie_portdrv_mmio_enabled, .resume = pcie_portdrv_err_resume, };
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v4.20-rc1 commit 4f802170a861265680cad03f47b19c4c3a137052 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 4f802170a861265680cad03f47b19c4c3a137052 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
This patch provides DPC save and restore capabilities. This is necessary for the driver to observe DPC events in the event the configuration space needs to be restored after a reset.
Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Sinan Kaya okaya@kernel.org (cherry picked from commit 4f802170a861265680cad03f47b19c4c3a137052) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pci.c | 2 ++ drivers/pci/pci.h | 8 ++++++ drivers/pci/pcie/dpc.c | 61 +++++++++++++++++++++++++++++++++++++----- 3 files changed, 65 insertions(+), 6 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 3f82033dc8a0..394b3d979955 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1308,6 +1308,7 @@ int pci_save_state(struct pci_dev *dev) if (i != 0) return i;
+ pci_save_dpc_state(dev); return pci_save_vc_state(dev); } EXPORT_SYMBOL(pci_save_state); @@ -1413,6 +1414,7 @@ void pci_restore_state(struct pci_dev *dev) pci_restore_ats_state(dev); pci_restore_vc_state(dev); pci_restore_rebar_state(dev); + pci_restore_dpc_state(dev);
pci_cleanup_aer_error_status_regs(dev);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index a10629ba980e..41a630f180e0 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -366,6 +366,14 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info); void aer_print_error(struct pci_dev *dev, struct aer_err_info *info); #endif /* CONFIG_PCIEAER */
+#ifdef CONFIG_PCIE_DPC +void pci_save_dpc_state(struct pci_dev *dev); +void pci_restore_dpc_state(struct pci_dev *dev); +#else +static inline void pci_save_dpc_state(struct pci_dev *dev) {} +static inline void pci_restore_dpc_state(struct pci_dev *dev) {} +#endif + #ifdef CONFIG_PCI_ATS void pci_restore_ats_state(struct pci_dev *dev); #else diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index 118b5bcae42e..5b6b821ff124 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -44,6 +44,58 @@ static const char * const rp_pio_error_string[] = { "Memory Request Completion Timeout", /* Bit Position 18 */ };
+static struct dpc_dev *to_dpc_dev(struct pci_dev *dev) +{ + struct device *device; + + device = pcie_port_find_device(dev, PCIE_PORT_SERVICE_DPC); + if (!device) + return NULL; + return get_service_data(to_pcie_device(device)); +} + +void pci_save_dpc_state(struct pci_dev *dev) +{ + struct dpc_dev *dpc; + struct pci_cap_saved_state *save_state; + u16 *cap; + + if (!pci_is_pcie(dev)) + return; + + dpc = to_dpc_dev(dev); + if (!dpc) + return; + + save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_DPC); + if (!save_state) + return; + + cap = (u16 *)&save_state->cap.data[0]; + pci_read_config_word(dev, dpc->cap_pos + PCI_EXP_DPC_CTL, cap); +} + +void pci_restore_dpc_state(struct pci_dev *dev) +{ + struct dpc_dev *dpc; + struct pci_cap_saved_state *save_state; + u16 *cap; + + if (!pci_is_pcie(dev)) + return; + + dpc = to_dpc_dev(dev); + if (!dpc) + return; + + save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_DPC); + if (!save_state) + return; + + cap = (u16 *)&save_state->cap.data[0]; + pci_write_config_word(dev, dpc->cap_pos + PCI_EXP_DPC_CTL, *cap); +} + static int dpc_wait_rp_inactive(struct dpc_dev *dpc) { unsigned long timeout = jiffies + HZ; @@ -67,18 +119,13 @@ static int dpc_wait_rp_inactive(struct dpc_dev *dpc) static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev) { struct dpc_dev *dpc; - struct pcie_device *pciedev; - struct device *devdpc; - u16 cap;
/* * DPC disables the Link automatically in hardware, so it has * already been reset by the time we get here. */ - devdpc = pcie_port_find_device(pdev, PCIE_PORT_SERVICE_DPC); - pciedev = to_pcie_device(devdpc); - dpc = get_service_data(pciedev); + dpc = to_dpc_dev(pdev); cap = dpc->cap_pos;
/* @@ -284,6 +331,8 @@ static int dpc_probe(struct pcie_device *dev) FLAG(cap, PCI_EXP_DPC_CAP_POISONED_TLP), FLAG(cap, PCI_EXP_DPC_CAP_SW_TRIGGER), dpc->rp_log_size, FLAG(cap, PCI_EXP_DPC_CAP_DL_ACTIVE)); + + pci_add_ext_cap_save_buffer(pdev, PCI_EXT_CAP_ID_DPC, sizeof(u16)); return status; }
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v4.20-rc1 commit bdb5ac85777de67c909c9ad4327f03f7648b543f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
We don't need to be paranoid about the topology changing while handling an error. If the device has changed in a hotplug capable slot, we can rely on the presence detection handling to react to a changing topology.
Restore the fatal error handling behavior that existed before merging DPC with AER with 7e9084b36740 ("PCI/AER: Handle ERR_FATAL with removal and re-enumeration of devices").
Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Sinan Kaya okaya@kernel.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/PCI/pci-error-recovery.txt | 35 ++++------- drivers/pci/pci.h | 4 +- drivers/pci/pcie/aer.c | 12 ++-- drivers/pci/pcie/dpc.c | 4 +- drivers/pci/pcie/err.c | 75 ++---------------------- 5 files changed, 28 insertions(+), 102 deletions(-)
diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt index 688b69121e82..0b6bb3ef449e 100644 --- a/Documentation/PCI/pci-error-recovery.txt +++ b/Documentation/PCI/pci-error-recovery.txt @@ -110,7 +110,7 @@ The actual steps taken by a platform to recover from a PCI error event will be platform-dependent, but will follow the general sequence described below.
-STEP 0: Error Event: ERR_NONFATAL +STEP 0: Error Event ------------------- A PCI bus error is detected by the PCI hardware. On powerpc, the slot is isolated, in that all I/O is blocked: all reads return 0xffffffff, @@ -228,7 +228,13 @@ proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform proceeds to STEP 4 (Slot Reset)
-STEP 3: Slot Reset +STEP 3: Link Reset +------------------ +The platform resets the link. This is a PCI-Express specific step +and is done whenever a fatal error has been detected that can be +"solved" by resetting the link. + +STEP 4: Slot Reset ------------------
In response to a return value of PCI_ERS_RESULT_NEED_RESET, the @@ -314,7 +320,7 @@ Failure).
However, it probably should.
-STEP 4: Resume Operations +STEP 5: Resume Operations ------------------------- The platform will call the resume() callback on all affected device drivers if all drivers on the segment have returned @@ -326,7 +332,7 @@ a result code. At this point, if a new error happens, the platform will restart a new error recovery sequence.
-STEP 5: Permanent Failure +STEP 6: Permanent Failure ------------------------- A "permanent failure" has occurred, and the platform cannot recover the device. The platform will call error_detected() with a @@ -349,27 +355,6 @@ errors. See the discussion in powerpc/eeh-pci-error-recovery.txt for additional detail on real-life experience of the causes of software errors.
-STEP 0: Error Event: ERR_FATAL -------------------- -PCI bus error is detected by the PCI hardware. On powerpc, the slot is -isolated, in that all I/O is blocked: all reads return 0xffffffff, all -writes are ignored. - -STEP 1: Remove devices --------------------- -Platform removes the devices depending on the error agent, it could be -this port for all subordinates or upstream component (likely downstream -port) - -STEP 2: Reset link --------------------- -The platform resets the link. This is a PCI-Express specific step and is -done whenever a fatal error has been detected that can be "solved" by -resetting the link. - -STEP 3: Re-enumerate the devices --------------------- -Initiates the re-enumeration.
Conclusion; General Remarks --------------------------- diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 41a630f180e0..d7ada43cec93 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -451,8 +451,8 @@ static inline int pci_dev_specific_disable_acs_redir(struct pci_dev *dev) #endif
/* PCI error reporting and recovery */ -void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service); -void pcie_do_nonfatal_recovery(struct pci_dev *dev); +void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state, + u32 service);
bool pcie_wait_for_link(struct pci_dev *pdev, bool active); #ifdef CONFIG_PCIEASPM diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 76ea0959ff0c..dbd464399065 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -1003,9 +1003,11 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) info->status); pci_aer_clear_device_status(dev); } else if (info->severity == AER_NONFATAL) - pcie_do_nonfatal_recovery(dev); + pcie_do_recovery(dev, pci_channel_io_normal, + PCIE_PORT_SERVICE_AER); else if (info->severity == AER_FATAL) - pcie_do_fatal_recovery(dev, PCIE_PORT_SERVICE_AER); + pcie_do_recovery(dev, pci_channel_io_frozen, + PCIE_PORT_SERVICE_AER); pci_dev_put(dev); }
@@ -1041,9 +1043,11 @@ static void aer_recover_work_func(struct work_struct *work) } cper_print_aer(pdev, entry.severity, entry.regs); if (entry.severity == AER_NONFATAL) - pcie_do_nonfatal_recovery(pdev); + pcie_do_recovery(pdev, pci_channel_io_normal, + PCIE_PORT_SERVICE_AER); else if (entry.severity == AER_FATAL) - pcie_do_fatal_recovery(pdev, PCIE_PORT_SERVICE_AER); + pcie_do_recovery(pdev, pci_channel_io_frozen, + PCIE_PORT_SERVICE_AER); pci_dev_put(pdev); } } diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index 5b6b821ff124..f57736783c83 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -238,7 +238,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1; ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5; - dev_warn(dev, "DPC %s detected, remove downstream devices\n", + dev_warn(dev, "DPC %s detected\n", (reason == 0) ? "unmasked uncorrectable error" : (reason == 1) ? "ERR_NONFATAL" : (reason == 2) ? "ERR_FATAL" : @@ -258,7 +258,7 @@ static irqreturn_t dpc_handler(int irq, void *context) }
/* We configure DPC so it only triggers on ERR_FATAL */ - pcie_do_fatal_recovery(pdev, PCIE_PORT_SERVICE_DPC); + pcie_do_recovery(pdev, pci_channel_io_frozen, PCIE_PORT_SERVICE_DPC);
return IRQ_HANDLED; } diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index 1cae9d2c365d..a83ac6e41b83 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -220,77 +220,10 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev, return result_data.result; }
-/** - * pcie_do_fatal_recovery - handle fatal error recovery process - * @dev: pointer to a pci_dev data structure of agent detecting an error - * - * Invoked when an error is fatal. Once being invoked, removes the devices - * beneath this AER agent, followed by reset link e.g. secondary bus reset - * followed by re-enumeration of devices. - */ -void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service) -{ - struct pci_dev *udev; - struct pci_bus *parent; - struct pci_dev *pdev, *temp; - pci_ers_result_t result; - - if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) - udev = dev; - else - udev = dev->bus->self; - - parent = udev->subordinate; - pci_walk_bus(parent, pci_dev_set_disconnected, NULL); - - pci_lock_rescan_remove(); - pci_dev_get(dev); - list_for_each_entry_safe_reverse(pdev, temp, &parent->devices, - bus_list) { - pci_stop_and_remove_bus_device(pdev); - } - - result = reset_link(udev, service); - - if ((service == PCIE_PORT_SERVICE_AER) && - (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)) { - /* - * If the error is reported by a bridge, we think this error - * is related to the downstream link of the bridge, so we - * do error recovery on all subordinates of the bridge instead - * of the bridge and clear the error status of the bridge. - */ - pci_aer_clear_fatal_status(dev); - pci_aer_clear_device_status(dev); - } - - if (result == PCI_ERS_RESULT_RECOVERED) { - if (pcie_wait_for_link(udev, true)) - pci_rescan_bus(udev->bus); - pci_info(dev, "Device recovery from fatal error successful\n"); - } else { - pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT); - pci_info(dev, "Device recovery from fatal error failed\n"); - } - - pci_dev_put(dev); - pci_unlock_rescan_remove(); -} - -/** - * pcie_do_nonfatal_recovery - handle nonfatal error recovery process - * @dev: pointer to a pci_dev data structure of agent detecting an error - * - * Invoked when an error is nonfatal/fatal. Once being invoked, broadcast - * error detected message to all downstream drivers within a hierarchy in - * question and return the returned code. - */ -void pcie_do_nonfatal_recovery(struct pci_dev *dev) +void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state, + u32 service) { pci_ers_result_t status; - enum pci_channel_state state; - - state = pci_channel_io_normal;
/* * Error recovery runs on all subordinates of the first downstream port. @@ -305,6 +238,10 @@ void pcie_do_nonfatal_recovery(struct pci_dev *dev) "error_detected", report_error_detected);
+ if (state == pci_channel_io_frozen && + reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED) + goto failed; + if (status == PCI_ERS_RESULT_CAN_RECOVER) status = broadcast_error_message(dev, state,
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v4.20-rc1 commit 542aeb9c8f930e4099432cb0bec17b92c0175e08 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
There is no point in having a generic broadcast function if it needs to have special cases for each callback it broadcasts.
Abstract the error broadcast to only the necessary information and removes the now unnecessary helper to walk the bus.
Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Sinan Kaya okaya@kernel.org Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/err.c | 107 +++++++++++++++-------------------------- 1 file changed, 38 insertions(+), 69 deletions(-)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index a83ac6e41b83..210995849746 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -19,11 +19,6 @@ #include "portdrv.h" #include "../pci.h"
-struct aer_broadcast_data { - enum pci_channel_state state; - enum pci_ers_result result; -}; - static pci_ers_result_t merge_result(enum pci_ers_result orig, enum pci_ers_result new) { @@ -49,16 +44,15 @@ static pci_ers_result_t merge_result(enum pci_ers_result orig, return orig; }
-static int report_error_detected(struct pci_dev *dev, void *data) +static int report_error_detected(struct pci_dev *dev, + enum pci_channel_state state, + enum pci_ers_result *result) { pci_ers_result_t vote; const struct pci_error_handlers *err_handler; - struct aer_broadcast_data *result_data; - - result_data = (struct aer_broadcast_data *) data;
device_lock(&dev->dev); - dev->error_state = result_data->state; + dev->error_state = state;
if (!dev->driver || !dev->driver->err_handler || @@ -77,22 +71,29 @@ static int report_error_detected(struct pci_dev *dev, void *data) } } else { err_handler = dev->driver->err_handler; - vote = err_handler->error_detected(dev, result_data->state); + vote = err_handler->error_detected(dev, state); pci_uevent_ers(dev, PCI_ERS_RESULT_NONE); }
- result_data->result = merge_result(result_data->result, vote); + *result = merge_result(*result, vote); device_unlock(&dev->dev); return 0; }
+static int report_frozen_detected(struct pci_dev *dev, void *data) +{ + return report_error_detected(dev, pci_channel_io_frozen, data); +} + +static int report_normal_detected(struct pci_dev *dev, void *data) +{ + return report_error_detected(dev, pci_channel_io_normal, data); +} + static int report_mmio_enabled(struct pci_dev *dev, void *data) { - pci_ers_result_t vote; + pci_ers_result_t vote, *result = data; const struct pci_error_handlers *err_handler; - struct aer_broadcast_data *result_data; - - result_data = (struct aer_broadcast_data *) data;
device_lock(&dev->dev); if (!dev->driver || @@ -102,7 +103,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data)
err_handler = dev->driver->err_handler; vote = err_handler->mmio_enabled(dev); - result_data->result = merge_result(result_data->result, vote); + *result = merge_result(*result, vote); out: device_unlock(&dev->dev); return 0; @@ -110,11 +111,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data)
static int report_slot_reset(struct pci_dev *dev, void *data) { - pci_ers_result_t vote; + pci_ers_result_t vote, *result = data; const struct pci_error_handlers *err_handler; - struct aer_broadcast_data *result_data; - - result_data = (struct aer_broadcast_data *) data;
device_lock(&dev->dev); if (!dev->driver || @@ -124,7 +122,7 @@ static int report_slot_reset(struct pci_dev *dev, void *data)
err_handler = dev->driver->err_handler; vote = err_handler->slot_reset(dev); - result_data->result = merge_result(result_data->result, vote); + *result = merge_result(*result, vote); out: device_unlock(&dev->dev); return 0; @@ -191,39 +189,11 @@ static pci_ers_result_t reset_link(struct pci_dev *dev, u32 service) return status; }
-/** - * broadcast_error_message - handle message broadcast to downstream drivers - * @dev: pointer to from where in a hierarchy message is broadcasted down - * @state: error state - * @error_mesg: message to print - * @cb: callback to be broadcasted - * - * Invoked during error recovery process. Once being invoked, the content - * of error severity will be broadcasted to all downstream drivers in a - * hierarchy in question. - */ -static pci_ers_result_t broadcast_error_message(struct pci_dev *dev, - enum pci_channel_state state, - char *error_mesg, - int (*cb)(struct pci_dev *, void *)) -{ - struct aer_broadcast_data result_data; - - pci_printk(KERN_DEBUG, dev, "broadcast %s message\n", error_mesg); - result_data.state = state; - if (cb == report_error_detected) - result_data.result = PCI_ERS_RESULT_CAN_RECOVER; - else - result_data.result = PCI_ERS_RESULT_RECOVERED; - - pci_walk_bus(dev->subordinate, cb, &result_data); - return result_data.result; -} - void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state, u32 service) { - pci_ers_result_t status; + pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER; + struct pci_bus *bus;
/* * Error recovery runs on all subordinates of the first downstream port. @@ -232,21 +202,23 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state, if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT || pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM)) dev = dev->bus->self; + bus = dev->subordinate;
- status = broadcast_error_message(dev, - state, - "error_detected", - report_error_detected); + pci_dbg(dev, "broadcast error_detected message\n"); + if (state == pci_channel_io_frozen) + pci_walk_bus(bus, report_frozen_detected, &status); + else + pci_walk_bus(bus, report_normal_detected, &status);
if (state == pci_channel_io_frozen && reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED) goto failed;
- if (status == PCI_ERS_RESULT_CAN_RECOVER) - status = broadcast_error_message(dev, - state, - "mmio_enabled", - report_mmio_enabled); + if (status == PCI_ERS_RESULT_CAN_RECOVER) { + status = PCI_ERS_RESULT_RECOVERED; + pci_dbg(dev, "broadcast mmio_enabled message\n"); + pci_walk_bus(bus, report_mmio_enabled, &status); + }
if (status == PCI_ERS_RESULT_NEED_RESET) { /* @@ -254,19 +226,16 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state, * functions to reset slot before calling * drivers' slot_reset callbacks? */ - status = broadcast_error_message(dev, - state, - "slot_reset", - report_slot_reset); + status = PCI_ERS_RESULT_RECOVERED; + pci_dbg(dev, "broadcast slot_reset message\n"); + pci_walk_bus(bus, report_slot_reset, &status); }
if (status != PCI_ERS_RESULT_RECOVERED) goto failed;
- broadcast_error_message(dev, - state, - "resume", - report_resume); + pci_dbg(dev, "broadcast resume message\n"); + pci_walk_bus(bus, report_resume, &status);
pci_aer_clear_device_status(dev); pci_cleanup_aer_uncorrect_error_status(dev);
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v4.20-rc1 commit 7b42d97e99d3a2babffd1b3456ded08b54981538 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 7b42d97e99d3a2babffd1b3456ded08b54981538 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
A device still participates in error recovery even if it doesn't have the error callbacks.
Always provide the status for user event watchers.
Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Sinan Kaya okaya@kernel.org (cherry picked from commit 7b42d97e99d3a2babffd1b3456ded08b54981538) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/err.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index 210995849746..a588747d9b2e 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -72,9 +72,8 @@ static int report_error_detected(struct pci_dev *dev, } else { err_handler = dev->driver->err_handler; vote = err_handler->error_detected(dev, state); - pci_uevent_ers(dev, PCI_ERS_RESULT_NONE); } - + pci_uevent_ers(dev, vote); *result = merge_result(*result, vote); device_unlock(&dev->dev); return 0; @@ -142,8 +141,8 @@ static int report_resume(struct pci_dev *dev, void *data)
err_handler = dev->driver->err_handler; err_handler->resume(dev); - pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED); out: + pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED); device_unlock(&dev->dev); return 0; }
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v4.20-rc1 commit a6bd101b8f84f9b98768e9ab1e418c239e2e669f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Bring surprise removals and permanent failures together so we no longer need separate flags. The implementation enforces that error handling will not be able to override a surprise removal's permanent channel failure.
Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Sinan Kaya okaya@kernel.org Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pci.h | 60 ++++++++++++++++++++++++++++++++++++++---- drivers/pci/pcie/err.c | 10 +++---- 2 files changed, 59 insertions(+), 11 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index d7ada43cec93..8398071edd46 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -313,21 +313,71 @@ struct pci_sriov { KABI_RESERVE(8) };
-/* pci_dev priv_flags */ -#define PCI_DEV_DISCONNECTED 0 -#define PCI_DEV_ADDED 1 +/** + * pci_dev_set_io_state - Set the new error state if possible. + * + * @dev - pci device to set new error_state + * @new - the state we want dev to be in + * + * Must be called with device_lock held. + * + * Returns true if state has been changed to the requested state. + */ +static inline bool pci_dev_set_io_state(struct pci_dev *dev, + pci_channel_state_t new) +{ + bool changed = false; + + device_lock_assert(&dev->dev); + switch (new) { + case pci_channel_io_perm_failure: + switch (dev->error_state) { + case pci_channel_io_frozen: + case pci_channel_io_normal: + case pci_channel_io_perm_failure: + changed = true; + break; + } + break; + case pci_channel_io_frozen: + switch (dev->error_state) { + case pci_channel_io_frozen: + case pci_channel_io_normal: + changed = true; + break; + } + break; + case pci_channel_io_normal: + switch (dev->error_state) { + case pci_channel_io_frozen: + case pci_channel_io_normal: + changed = true; + break; + } + break; + } + if (changed) + dev->error_state = new; + return changed; +}
static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused) { - set_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags); + device_lock(&dev->dev); + pci_dev_set_io_state(dev, pci_channel_io_perm_failure); + device_unlock(&dev->dev); + return 0; }
static inline bool pci_dev_is_disconnected(const struct pci_dev *dev) { - return test_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags); + return dev->error_state == pci_channel_io_perm_failure; }
+/* pci_dev priv_flags */ +#define PCI_DEV_ADDED 0 + static inline void pci_dev_assign_added(struct pci_dev *dev, bool added) { assign_bit(PCI_DEV_ADDED, &dev->priv_flags, added); diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index a588747d9b2e..72ddb07eb23c 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -52,9 +52,8 @@ static int report_error_detected(struct pci_dev *dev, const struct pci_error_handlers *err_handler;
device_lock(&dev->dev); - dev->error_state = state; - - if (!dev->driver || + if (!pci_dev_set_io_state(dev, state) || + !dev->driver || !dev->driver->err_handler || !dev->driver->err_handler->error_detected) { /* @@ -132,9 +131,8 @@ static int report_resume(struct pci_dev *dev, void *data) const struct pci_error_handlers *err_handler;
device_lock(&dev->dev); - dev->error_state = pci_channel_io_normal; - - if (!dev->driver || + if (!pci_dev_set_io_state(dev, pci_channel_io_normal) || + !dev->driver || !dev->driver->err_handler || !dev->driver->err_handler->resume) goto out;
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v4.20-rc1 commit f0157160b359b1d263ee9d4e0a435a7ad85bbcea category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit f0157160b359b1d263ee9d4e0a435a7ad85bbcea upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
The spec has timing requirements when waiting for a link to become active after a conventional reset. Implement those hard delays when waiting for an active link so pciehp and dpc drivers don't need to duplicate this.
For devices that don't support data link layer active reporting, wait the fixed time recommended by the PCIe spec.
Signed-off-by: Keith Busch keith.busch@intel.com [bhelgaas: changelog] Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Sinan Kaya okaya@kernel.org (cherry picked from commit f0157160b359b1d263ee9d4e0a435a7ad85bbcea) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com
Conflicts: drivers/pci/hotplug/pciehp.h Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/hotplug/pciehp_hpc.c | 22 ++------------------- drivers/pci/pci.c | 33 ++++++++++++++++++++++++++------ drivers/pci/pcie/dpc.c | 4 +++- drivers/pci/probe.c | 1 + include/linux/pci.h | 1 + 5 files changed, 34 insertions(+), 27 deletions(-)
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 7a3cf281ae06..0bb738591d27 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -212,13 +212,6 @@ bool pciehp_check_link_active(struct controller *ctrl) return ret; }
-static void pcie_wait_link_active(struct controller *ctrl) -{ - struct pci_dev *pdev = ctrl_dev(ctrl); - - pcie_wait_for_link(pdev, true); -} - static bool pci_bus_check_dev(struct pci_bus *bus, int devfn) { u32 l; @@ -251,18 +244,9 @@ int pciehp_check_link_status(struct controller *ctrl) bool found; u16 lnk_status;
- /* - * Data Link Layer Link Active Reporting must be capable for - * hot-plug capable downstream port. But old controller might - * not implement it. In this case, we wait for 1000 ms. - */ - if (ctrl->link_active_reporting) - pcie_wait_link_active(ctrl); - else - msleep(1000); + if (!pcie_wait_for_link(pdev, true)) + return -1;
- /* wait 100ms before read pci conf, and try in 1s */ - msleep(100); found = pci_bus_check_dev(ctrl->pcie->port->subordinate, PCI_DEVFN(0, 0));
@@ -972,8 +956,6 @@ struct controller *pcie_init(struct pcie_device *dev)
/* Check if Data Link Layer Link Active Reporting is implemented */ pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &link_cap); - if (link_cap & PCI_EXP_LNKCAP_DLLLARC) - ctrl->link_active_reporting = 1;
/* Clear all remaining event bits in Slot Status register. */ pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 394b3d979955..e764d90b8c9d 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4566,21 +4566,42 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active) bool ret; u16 lnk_status;
+ /* + * Some controllers might not implement link active reporting. In this + * case, we wait for 1000 + 100 ms. + */ + if (!pdev->link_active_reporting) { + msleep(1100); + return true; + } + + /* + * PCIe r4.0 sec 6.6.1, a component must enter LTSSM Detect within 20ms, + * after which we should expect an link active if the reset was + * successful. If so, software must wait a minimum 100ms before sending + * configuration requests to devices downstream this port. + * + * If the link fails to activate, either the device was physically + * removed or the link is permanently failed. + */ + if (active) + msleep(20); for (;;) { pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status); ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA); if (ret == active) - return true; + break; if (timeout <= 0) break; msleep(10); timeout -= 10; } - - pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n", - active ? "set" : "cleared"); - - return false; + if (active && ret) + msleep(100); + else if (ret != active) + pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n", + active ? "set" : "cleared"); + return ret == active; }
void pci_reset_secondary_bus(struct pci_dev *dev) diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index f57736783c83..7b77754a82de 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -140,10 +140,12 @@ static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev) pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS, PCI_EXP_DPC_STATUS_TRIGGER);
+ if (!pcie_wait_for_link(pdev, true)) + return PCI_ERS_RESULT_DISCONNECT; + return PCI_ERS_RESULT_RECOVERED; }
- static void dpc_process_rp_pio_error(struct dpc_dev *dpc) { struct device *dev = &dpc->dev->device; diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 8265fc4bc036..b798b9dbdea0 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -809,6 +809,7 @@ static void pci_set_bus_speed(struct pci_bus *bus)
pcie_capability_read_dword(bridge, PCI_EXP_LNKCAP, &linkcap); bus->max_bus_speed = pcie_link_speed[linkcap & PCI_EXP_LNKCAP_SLS]; + bridge->link_active_reporting = !!(linkcap & PCI_EXP_LNKCAP_DLLLARC);
pcie_capability_read_word(bridge, PCI_EXP_LNKSTA, &linksta); pcie_update_link_speed(bus, linksta); diff --git a/include/linux/pci.h b/include/linux/pci.h index 596386c5b707..3ac8633deab5 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -411,6 +411,7 @@ struct pci_dev { unsigned int has_secondary_link:1; unsigned int non_compliant_bars:1; /* Broken BARs; ignore them */ unsigned int is_probed:1; /* Device probing in progress */ + unsigned int link_active_reporting:1;/* Device capable of reporting link active */ pci_dev_flags_t dev_flags; atomic_t enable_cnt; /* pci_enable_device has been called */
From: YueHaibing yuehaibing@huawei.com
mainline inclusion from mainline-v4.20-rc1 commit 479e01a402f006746324a04a72bd949ceca5e73d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 479e01a402f006746324a04a72bd949ceca5e73d upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
Remove duplicated include.
Signed-off-by: YueHaibing yuehaibing@huawei.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com (cherry picked from commit 479e01a402f006746324a04a72bd949ceca5e73d) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/err.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index 72ddb07eb23c..288641b30116 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -12,7 +12,6 @@
#include <linux/pci.h> #include <linux/module.h> -#include <linux/pci.h> #include <linux/kernel.h> #include <linux/errno.h> #include <linux/aer.h>
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v4.20-rc1 commit 3e41a317ae456bbd7ae08d03746024ec29a7bf31 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3e41a317ae456bbd7ae08d03746024ec29a7bf31 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
The error recovery callbacks are only run on child devices. A Root Port is never a child device, so this error resume callback was never invoked.
Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com (cherry picked from commit 3e41a317ae456bbd7ae08d03746024ec29a7bf31) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/aer.c | 13 ------------- 1 file changed, 13 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index dbd464399065..ced4462738ed 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -1481,18 +1481,6 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev) return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED; }
-/** - * aer_error_resume - clean up corresponding error status bits - * @dev: pointer to Root Port's pci_dev data structure - * - * Invoked by Port Bus driver during nonfatal recovery. - */ -static void aer_error_resume(struct pci_dev *dev) -{ - pci_aer_clear_device_status(dev); - pci_cleanup_aer_uncorrect_error_status(dev); -} - static struct pcie_port_service_driver aerdriver = { .name = "aer", .port_type = PCI_EXP_TYPE_ROOT_PORT, @@ -1500,7 +1488,6 @@ static struct pcie_port_service_driver aerdriver = {
.probe = aer_probe, .remove = aer_remove, - .error_resume = aer_error_resume, .reset_link = aer_root_reset, };
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v4.20-rc1 commit ecae65e133f2e0647e6364d691130ff551382d91 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit ecae65e133f2e0647e6364d691130ff551382d91 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
Use the recommended kernel API for writing to a concurrently-accessed kfifo. No functional change here.
Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com (cherry picked from commit ecae65e133f2e0647e6364d691130ff551382d91) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/aer.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index ced4462738ed..1072229eb9fc 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -1063,7 +1063,6 @@ static DECLARE_WORK(aer_recover_work, aer_recover_work_func); void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn, int severity, struct aer_capability_regs *aer_regs) { - unsigned long flags; struct aer_recover_entry entry = { .bus = bus, .devfn = devfn, @@ -1072,13 +1071,12 @@ void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn, .regs = aer_regs, };
- spin_lock_irqsave(&aer_recover_ring_lock, flags); - if (kfifo_put(&aer_recover_ring, entry)) + if (kfifo_in_spinlocked(&aer_recover_ring, &entry, sizeof(entry), + &aer_recover_ring_lock)) schedule_work(&aer_recover_work); else pr_err("AER recover: Buffer overflow when recovering AER for %04x:%02x:%02x:%x\n", domain, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); - spin_unlock_irqrestore(&aer_recover_ring_lock, flags); } EXPORT_SYMBOL_GPL(aer_recover_queue); #endif
From: Guoqing Jiang jiangguoqing@kylinos.cn
mainline inclusion from mainline-v4.20-rc1 commit 6200cc5ee2baa573e7ac4dbcfca750e0b777c37d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 6200cc5ee2baa573e7ac4dbcfca750e0b777c37d upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
The threaded IRQ is naturally single threaded as desired, so use that to simplify the AER bottom half handler. Since the root port structure has much less to do now, remove the rpc construction helper routine.
Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com (cherry picked from commit 6200cc5ee2baa573e7ac4dbcfca750e0b777c37d) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/aer.c | 60 +++++++++--------------------------------- 1 file changed, 13 insertions(+), 47 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 1072229eb9fc..d5a99fd3e564 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -42,14 +42,7 @@ struct aer_err_source {
struct aer_rpc { struct pci_dev *rpd; /* Root Port device */ - struct work_struct dpc_handler; DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX); - int isr; - struct mutex rpc_mutex; /* - * only one thread could do - * recovery on the same - * root port hierarchy - */ };
/* AER stats for the device */ @@ -1215,15 +1208,18 @@ static void aer_isr_one_error(struct aer_rpc *rpc, * * Invoked, as DPC, when root port records new detected error */ -static void aer_isr(struct work_struct *work) +static irqreturn_t aer_isr(int irq, void *context) { - struct aer_rpc *rpc = container_of(work, struct aer_rpc, dpc_handler); + struct pcie_device *dev = (struct pcie_device *)context; + struct aer_rpc *rpc = get_service_data(dev); struct aer_err_source uninitialized_var(e_src);
- mutex_lock(&rpc->rpc_mutex); + if (kfifo_is_empty(&rpc->aer_fifo)) + return IRQ_NONE; + while (kfifo_get(&rpc->aer_fifo, &e_src)) aer_isr_one_error(rpc, &e_src); - mutex_unlock(&rpc->rpc_mutex); + return IRQ_HANDLED; }
/** @@ -1251,8 +1247,7 @@ irqreturn_t aer_irq(int irq, void *context) if (!kfifo_put(&rpc->aer_fifo, e_src)) return IRQ_HANDLED;
- schedule_work(&rpc->dpc_handler); - return IRQ_HANDLED; + return IRQ_WAKE_THREAD; } EXPORT_SYMBOL_GPL(aer_irq);
@@ -1362,30 +1357,6 @@ static void aer_disable_rootport(struct aer_rpc *rpc) pci_write_config_dword(pdev, pos + PCI_ERR_ROOT_STATUS, reg32); }
-/** - * aer_alloc_rpc - allocate Root Port data structure - * @dev: pointer to the pcie_dev data structure - * - * Invoked when Root Port's AER service is loaded. - */ -static struct aer_rpc *aer_alloc_rpc(struct pcie_device *dev) -{ - struct aer_rpc *rpc; - - rpc = kzalloc(sizeof(struct aer_rpc), GFP_KERNEL); - if (!rpc) - return NULL; - - rpc->rpd = dev->port; - INIT_WORK(&rpc->dpc_handler, aer_isr); - mutex_init(&rpc->rpc_mutex); - - /* Use PCIe bus function to store rpc into PCIe device */ - set_service_data(dev, rpc); - - return rpc; -} - /** * aer_remove - clean up resources * @dev: pointer to the pcie_dev data structure @@ -1397,11 +1368,6 @@ static void aer_remove(struct pcie_device *dev) struct aer_rpc *rpc = get_service_data(dev);
if (rpc) { - /* If register interrupt service, it must be free. */ - if (rpc->isr) - free_irq(dev->irq, dev); - - flush_work(&rpc->dpc_handler); aer_disable_rootport(rpc); kfree(rpc); set_service_data(dev, NULL); @@ -1421,16 +1387,18 @@ static int aer_probe(struct pcie_device *dev) struct device *device = &dev->port->dev;
/* Alloc rpc data structure */ - rpc = aer_alloc_rpc(dev); + rpc = kzalloc(sizeof(struct aer_rpc), GFP_KERNEL); if (!rpc) { dev_printk(KERN_DEBUG, device, "alloc AER rpc failed\n"); - aer_remove(dev); return -ENOMEM; } + rpc->rpd = dev->port; + set_service_data(dev, rpc); INIT_KFIFO(rpc->aer_fifo); /* Request IRQ ISR */ - status = request_irq(dev->irq, aer_irq, IRQF_SHARED, "aerdrv", dev); + status = request_threaded_irq(dev->irq, aer_irq, aer_isr, + IRQF_SHARED, "aerdrv", dev); if (status) { dev_printk(KERN_DEBUG, device, "request AER IRQ %d failed\n", dev->irq); @@ -1438,8 +1406,6 @@ static int aer_probe(struct pcie_device *dev) return status; }
- rpc->isr = 1; - aer_enable_rootport(rpc); dev_info(device, "AER enabled with IRQ %d\n", dev->irq); return 0;
From: Guoqing Jiang jiangguoqing@kylinos.cn
mainline inclusion from mainline-v4.20-rc1 commit 369fd7b00fce169570d6a74cb369e60dbfc95fb4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 369fd7b00fce169570d6a74cb369e60dbfc95fb4 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
Use the managed device resource allocations for the service data so the AER driver doesn't need to manage it, further simplifying this driver.
Link: https://lore.kernel.org/linux-pci/20180918235848.26694-12-keith.busch@intel.... Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com (cherry picked from commit 369fd7b00fce169570d6a74cb369e60dbfc95fb4) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/aer.c | 17 +++++------------ 1 file changed, 5 insertions(+), 12 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index d5a99fd3e564..1b713319be60 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -1367,11 +1367,7 @@ static void aer_remove(struct pcie_device *dev) { struct aer_rpc *rpc = get_service_data(dev);
- if (rpc) { - aer_disable_rootport(rpc); - kfree(rpc); - set_service_data(dev, NULL); - } + aer_disable_rootport(rpc); }
/** @@ -1384,10 +1380,9 @@ static int aer_probe(struct pcie_device *dev) { int status; struct aer_rpc *rpc; - struct device *device = &dev->port->dev; + struct device *device = &dev->device;
- /* Alloc rpc data structure */ - rpc = kzalloc(sizeof(struct aer_rpc), GFP_KERNEL); + rpc = devm_kzalloc(device, sizeof(struct aer_rpc), GFP_KERNEL); if (!rpc) { dev_printk(KERN_DEBUG, device, "alloc AER rpc failed\n"); return -ENOMEM; @@ -1396,13 +1391,11 @@ static int aer_probe(struct pcie_device *dev) set_service_data(dev, rpc); INIT_KFIFO(rpc->aer_fifo); - /* Request IRQ ISR */ - status = request_threaded_irq(dev->irq, aer_irq, aer_isr, - IRQF_SHARED, "aerdrv", dev); + status = devm_request_threaded_irq(device, dev->irq, aer_irq, aer_isr, + IRQF_SHARED, "aerdrv", dev); if (status) { dev_printk(KERN_DEBUG, device, "request AER IRQ %d failed\n", dev->irq); - aer_remove(dev); return status; }
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v4.20-rc1 commit 0e98db259fd8760fde556e640b447dadeceefc96 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 0e98db259fd8760fde556e640b447dadeceefc96 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
The port services driver already provides a method to find the pcie_device for a service. Export that function, use it from the aer_inject module, and remove the duplicate functionality.
Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com (cherry picked from commit 0e98db259fd8760fde556e640b447dadeceefc96) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/aer_inject.c | 25 ++++--------------------- drivers/pci/pcie/portdrv_core.c | 1 + 2 files changed, 5 insertions(+), 21 deletions(-)
diff --git a/drivers/pci/pcie/aer_inject.c b/drivers/pci/pcie/aer_inject.c index 03ee265abc66..3bfcffc39e58 100644 --- a/drivers/pci/pcie/aer_inject.c +++ b/drivers/pci/pcie/aer_inject.c @@ -320,32 +320,13 @@ static int pci_bus_set_aer_ops(struct pci_bus *bus) return 0; }
-static int find_aer_device_iter(struct device *device, void *data) -{ - struct pcie_device **result = data; - struct pcie_device *pcie_dev; - - if (device->bus == &pcie_port_bus_type) { - pcie_dev = to_pcie_device(device); - if (pcie_dev->service & PCIE_PORT_SERVICE_AER) { - *result = pcie_dev; - return 1; - } - } - return 0; -} - -static int find_aer_device(struct pci_dev *dev, struct pcie_device **result) -{ - return device_for_each_child(&dev->dev, result, find_aer_device_iter); -} - static int aer_inject(struct aer_error_inj *einj) { struct aer_error *err, *rperr; struct aer_error *err_alloc = NULL, *rperr_alloc = NULL; struct pci_dev *dev, *rpdev; struct pcie_device *edev; + struct device *device; unsigned long flags; unsigned int devfn = PCI_DEVFN(einj->dev, einj->fn); int pos_cap_err, rp_pos_cap_err; @@ -481,7 +462,9 @@ static int aer_inject(struct aer_error_inj *einj) if (ret) goto out_put;
- if (find_aer_device(rpdev, &edev)) { + device = pcie_port_find_device(rpdev, PCIE_PORT_SERVICE_AER); + if (device) { + edev = to_pcie_device(device); if (!get_service_data(edev)) { dev_warn(&edev->device, "aer_inject: AER service is not initialized\n"); diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c index 904846d280ba..a3d1cc42a393 100644 --- a/drivers/pci/pcie/portdrv_core.c +++ b/drivers/pci/pcie/portdrv_core.c @@ -468,6 +468,7 @@ struct device *pcie_port_find_device(struct pci_dev *dev, device = pdrvs.dev; return device; } +EXPORT_SYMBOL_GPL(pcie_port_find_device);
/** * pcie_port_device_remove - unregister PCI Express port service devices
From: Keith Busch keith.busch@intel.com
mainline inclusion from mainline-v4.20-rc1 commit 390e2db8248075ae2f31a7046a88eda0f9784310 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 390e2db8248075ae2f31a7046a88eda0f9784310 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
The aer_inject module was directly calling aer_irq(). This required the AER driver export its private IRQ handler for no other reason than to support error injection. A driver should not have to expose its private interfaces, so use the IRQ subsystem to route injection to the AER driver, and make aer_irq() a private interface.
This provides additional benefits:
First, directly calling the IRQ handler bypassed the IRQ subsytem so the injection wasn't really synthesizing what happens if a shared AER interrupt occurs.
The error injection had to provide the callback data directly, which may be racing with a removal that is freeing that structure. The IRQ subsystem can handle that race.
Finally, using the IRQ subsystem automatically reacts to threaded IRQs, keeping the error injection abstracted from that implementation detail.
Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com (cherry picked from commit 390e2db8248075ae2f31a7046a88eda0f9784310) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/aer.c | 3 +-- drivers/pci/pcie/aer_inject.c | 5 ++++- drivers/pci/pcie/portdrv.h | 4 ---- 3 files changed, 5 insertions(+), 7 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 1b713319be60..0c60ff28e7fa 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -1229,7 +1229,7 @@ static irqreturn_t aer_isr(int irq, void *context) * * Invoked when Root Port detects AER messages. */ -irqreturn_t aer_irq(int irq, void *context) +static irqreturn_t aer_irq(int irq, void *context) { struct pcie_device *pdev = (struct pcie_device *)context; struct aer_rpc *rpc = get_service_data(pdev); @@ -1249,7 +1249,6 @@ irqreturn_t aer_irq(int irq, void *context)
return IRQ_WAKE_THREAD; } -EXPORT_SYMBOL_GPL(aer_irq);
static int set_device_error_reporting(struct pci_dev *dev, void *data) { diff --git a/drivers/pci/pcie/aer_inject.c b/drivers/pci/pcie/aer_inject.c index 3bfcffc39e58..47ce382939fd 100644 --- a/drivers/pci/pcie/aer_inject.c +++ b/drivers/pci/pcie/aer_inject.c @@ -14,6 +14,7 @@
#include <linux/module.h> #include <linux/init.h> +#include <linux/irq.h> #include <linux/miscdevice.h> #include <linux/pci.h> #include <linux/slab.h> @@ -474,7 +475,9 @@ static int aer_inject(struct aer_error_inj *einj) dev_info(&edev->device, "aer_inject: Injecting errors %08x/%08x into device %s\n", einj->cor_status, einj->uncor_status, pci_name(dev)); - aer_irq(-1, edev); + local_irq_disable(); + generic_handle_irq(edev->irq); + local_irq_enable(); } else { pci_err(rpdev, "aer_inject: AER device not found\n"); ret = -ENODEV; diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h index 2498b2d34009..55fe9f7602de 100644 --- a/drivers/pci/pcie/portdrv.h +++ b/drivers/pci/pcie/portdrv.h @@ -147,10 +147,6 @@ static inline int pcie_aer_get_firmware_first(struct pci_dev *pci_dev) } #endif
-#ifdef CONFIG_PCIEAER -irqreturn_t aer_irq(int irq, void *context); -#endif - struct pcie_port_service_driver *pcie_port_find_service(struct pci_dev *dev, u32 service); struct device *pcie_port_find_device(struct pci_dev *dev, u32 service);
From: Yanjiang Jin yanjiang.jin@hxt-semitech.com
mainline inclusion from mainline-v4.20-rc1 commit 1063a5148ac9d1606e80886fa53ee57d45fb4589 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 1063a5148ac9d1606e80886fa53ee57d45fb4589 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
ecae65e133f2 ("PCI/AER: Use kfifo_in_spinlocked() to insert locked elements") replaced kfifo_put() with kfifo_in_spinlocked(), but passed the *size* of the queue entry, where kfifo_in_spinlocked() expects the *number* of entries to be copied.
We want to insert only one element into kfifo, not "sizeof(entry) = 16". Without this patch, we would get 15 uninitialized elements.
Fixes: ecae65e133f2 ("PCI/AER: Use kfifo_in_spinlocked() to insert locked elements") Signed-off-by: Yanjiang Jin yanjiang.jin@hxt-semitech.com [bhelgaas: changelog] Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Keith Busch keith.busch@intel.com (cherry picked from commit 1063a5148ac9d1606e80886fa53ee57d45fb4589)
Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/aer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 0c60ff28e7fa..6c99792cf11d 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -1064,7 +1064,7 @@ void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn, .regs = aer_regs, };
- if (kfifo_in_spinlocked(&aer_recover_ring, &entry, sizeof(entry), + if (kfifo_in_spinlocked(&aer_recover_ring, &entry, 1, &aer_recover_ring_lock)) schedule_work(&aer_recover_work); else
From: Andy Shevchenko andriy.shevchenko@linux.intel.com
mainline inclusion from mainline-v5.1-rc1 commit 807ffb1e1eabbcdcd46494ee415317aa80ed415c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 807ffb1e1eabbcdcd46494ee415317aa80ed415c upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
match_string() returns the array index of a matching string. Use it instead of the open-coded implementation.
Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com (cherry picked from commit 807ffb1e1eabbcdcd46494ee415317aa80ed415c) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/aer.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 6c99792cf11d..1aecf833cc06 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -117,7 +117,7 @@ bool pci_aer_available(void)
static int ecrc_policy = ECRC_POLICY_DEFAULT;
-static const char *ecrc_policy_str[] = { +static const char * const ecrc_policy_str[] = { [ECRC_POLICY_DEFAULT] = "bios", [ECRC_POLICY_OFF] = "off", [ECRC_POLICY_ON] = "on" @@ -203,11 +203,8 @@ void pcie_ecrc_get_policy(char *str) { int i;
- for (i = 0; i < ARRAY_SIZE(ecrc_policy_str); i++) - if (!strncmp(str, ecrc_policy_str[i], - strlen(ecrc_policy_str[i]))) - break; - if (i >= ARRAY_SIZE(ecrc_policy_str)) + i = match_string(ecrc_policy_str, ARRAY_SIZE(ecrc_policy_str), str); + if (i < 0) return;
ecrc_policy = i;
From: Bharat Kumar Gogada bharat.kumar.gogada@xilinx.com
mainline inclusion from mainline-v5.1-rc1 commit b4f6dcb9d35688392d668c46e834f72c55900b49 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit b4f6dcb9d35688392d668c46e834f72c55900b49 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
As per Figure 6-3 in PCIe r4.0, sec 6.2.6, ERR_ messages will be forwarded from the secondary interface to the primary interface, if the SERR# Enable bit in the Bridge Control register is set.
It seems clear that an ACPI hotplug parameter method (_HPP or _HPX) that tells us to "enable SERR in the command register" (ACPI v6.2, sec 6.2.8, 6.2.9.1) refers to PCI_COMMAND_SERR, which enables reporting of errors by the function itself.
For bridges, we also interpreted that to mean we should enable PCI_BRIDGE_CTL_SERR, which enables *forwarding* of errors by the bridge. But we didn't enable PCI_BRIDGE_CTL_SERR anywhere else, which means we never enabled it for non-ACPI systems or ACPI systems that didn't supply hotplug parameters.
That means errors reported below bridges were often never forwarded up to a Root Port where they could be signaled via AER.
Enable PCI_BRIDGE_CTL_SERR for all bridges so we can get better error reporting for downstream devices.
Signed-off-by: Bharat Kumar Gogada bharat.kumar.gogada@xilinx.com [bhelgaas: changelog] Signed-off-by: Bjorn Helgaas bhelgaas@google.com (cherry picked from commit b4f6dcb9d35688392d668c46e834f72c55900b49)
Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/probe.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index b798b9dbdea0..e520c57f515b 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1927,8 +1927,6 @@ static void program_hpp_type0(struct pci_dev *dev, struct hpp_type0 *hpp) pci_write_config_byte(dev, PCI_SEC_LATENCY_TIMER, hpp->latency_timer); pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &pci_bctl); - if (hpp->enable_serr) - pci_bctl |= PCI_BRIDGE_CTL_SERR; if (hpp->enable_perr) pci_bctl |= PCI_BRIDGE_CTL_PARITY; pci_write_config_word(dev, PCI_BRIDGE_CONTROL, pci_bctl); @@ -2210,6 +2208,24 @@ static void pci_configure_eetlp_prefix(struct pci_dev *dev) #endif }
+static void pci_configure_serr(struct pci_dev *dev) +{ + u16 control; + + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) { + + /* + * A bridge will not forward ERR_ messages coming from an + * endpoint unless SERR# forwarding is enabled. + */ + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &control); + if (!(control & PCI_BRIDGE_CTL_SERR)) { + control |= PCI_BRIDGE_CTL_SERR; + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, control); + } + } +} + static void pci_configure_device(struct pci_dev *dev) { struct hotplug_params hpp; @@ -2220,6 +2236,7 @@ static void pci_configure_device(struct pci_dev *dev) pci_configure_relaxed_ordering(dev); pci_configure_ltr(dev); pci_configure_eetlp_prefix(dev); + pci_configure_serr(dev);
memset(&hpp, 0, sizeof(hpp)); ret = pci_get_hp_params(dev, &hpp);
From: Bjorn Helgaas bhelgaas@google.com
mainline inclusion from mainline-v5.1-rc1 commit dbbfadf2319005cf528b0f15f12a05d4e4644303 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit dbbfadf2319005cf528b0f15f12a05d4e4644303 upstream.
Latency Tolerance Reporting (LTR) allows Endpoints and Switch Upstream Ports to report their latency requirements to upstream components. If ASPM L1 PM substates are enabled, the LTR information helps determine when a Link enters L1.2 [1].
Software must set the maximum latency values in the LTR Capability based on characteristics of the platform, then set LTR Mechanism Enable in the Device Control 2 register in the PCIe Capability. The device can then use LTR to report its latency tolerance.
If the device reports a maximum latency value of zero, that means the device requires the highest possible performance and the ASPM L1.2 substate is effectively disabled.
We put devices in D3 for suspend, and we assume their internal state is lost. On resume, previously we did not restore the LTR Capability, but we did restore the LTR Mechanism Enable bit, so devices would request the highest possible performance and ASPM L1.2 wouldn't be used.
[1] PCIe r4.0, sec 5.5.1 Link: https://bugzilla.kernel.org/show_bug.cgi?id=201469 Signed-off-by: Bjorn Helgaas bhelgaas@google.com (cherry picked from commit dbbfadf2319005cf528b0f15f12a05d4e4644303)
Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pci.c | 53 +++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 51 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index e764d90b8c9d..b7884a757e2d 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1247,7 +1247,6 @@ static void pci_restore_pcie_state(struct pci_dev *dev) pcie_capability_write_word(dev, PCI_EXP_SLTCTL2, cap[i++]); }
- static int pci_save_pcix_state(struct pci_dev *dev) { int pos; @@ -1284,6 +1283,45 @@ static void pci_restore_pcix_state(struct pci_dev *dev) pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]); }
+static void pci_save_ltr_state(struct pci_dev *dev) +{ + int ltr; + struct pci_cap_saved_state *save_state; + u16 *cap; + + if (!pci_is_pcie(dev)) + return; + + ltr = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_LTR); + if (!ltr) + return; + + save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_LTR); + if (!save_state) { + pci_err(dev, "no suspend buffer for LTR; ASPM issues possible after resume\n"); + return; + } + + cap = (u16 *)&save_state->cap.data[0]; + pci_read_config_word(dev, ltr + PCI_LTR_MAX_SNOOP_LAT, cap++); + pci_read_config_word(dev, ltr + PCI_LTR_MAX_NOSNOOP_LAT, cap++); +} + +static void pci_restore_ltr_state(struct pci_dev *dev) +{ + struct pci_cap_saved_state *save_state; + int ltr; + u16 *cap; + + save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_LTR); + ltr = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_LTR); + if (!save_state || !ltr) + return; + + cap = (u16 *)&save_state->cap.data[0]; + pci_write_config_word(dev, ltr + PCI_LTR_MAX_SNOOP_LAT, *cap++); + pci_write_config_word(dev, ltr + PCI_LTR_MAX_NOSNOOP_LAT, *cap++); +}
/** * pci_save_state - save the PCI configuration space of a device before suspending @@ -1308,6 +1346,7 @@ int pci_save_state(struct pci_dev *dev) if (i != 0) return i;
+ pci_save_ltr_state(dev); pci_save_dpc_state(dev); return pci_save_vc_state(dev); } @@ -1407,7 +1446,12 @@ void pci_restore_state(struct pci_dev *dev) if (!dev->state_saved) return;
- /* PCI Express register must be restored first */ + /* + * Restore max latencies (in the LTR capability) before enabling + * LTR itself (in the PCIe capability). + */ + pci_restore_ltr_state(dev); + pci_restore_pcie_state(dev); pci_restore_pasid_state(dev); pci_restore_pri_state(dev); @@ -3037,6 +3081,11 @@ void pci_allocate_cap_save_buffers(struct pci_dev *dev) if (error) pci_err(dev, "unable to preallocate PCI-X save buffer\n");
+ error = pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_LTR, + 2 * sizeof(u16)); + if (error) + pci_err(dev, "unable to allocate suspend buffer for LTR\n"); + pci_allocate_vc_save_buffers(dev); }
From: Bjorn Helgaas bhelgaas@google.com
mainline inclusion from mainline-v5.1-rc1 commit c89f7f98c971e0cabc819b6c0fe6bf509287b7e0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit c89f7f98c971e0cabc819b6c0fe6bf509287b7e0 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
The pci_device_id table was technically correct, but unusually formatted, which made adding entries error-prone. Change the format so it's obvious how to add entries.
Signed-off-by: Bjorn Helgaas bhelgaas@google.com (cherry picked from commit c89f7f98c971e0cabc819b6c0fe6bf509287b7e0) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/portdrv_pci.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c index 17256733fa43..3746545bb627 100644 --- a/drivers/pci/pcie/portdrv_pci.c +++ b/drivers/pci/pcie/portdrv_pci.c @@ -184,10 +184,10 @@ static void pcie_portdrv_err_resume(struct pci_dev *dev) /* * LINUX Device Driver Model */ -static const struct pci_device_id port_pci_ids[] = { { +static const struct pci_device_id port_pci_ids[] = { /* handle any PCI-Express port */ - PCI_DEVICE_CLASS(((PCI_CLASS_BRIDGE_PCI << 8) | 0x00), ~0), - }, { /* end: all zeroes */ } + { PCI_DEVICE_CLASS(((PCI_CLASS_BRIDGE_PCI << 8) | 0x00), ~0) }, + { }, };
static const struct pci_error_handlers pcie_portdrv_err_handler = {
From: Honghui Zhang honghui.zhang@mediatek.com
mainline inclusion from mainline-v5.1-rc1 commit f0cfecea8d1e8e0cd5d5053f9452b3a450f49eb5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit f0cfecea8d1e8e0cd5d5053f9452b3a450f49eb5 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
The Class Code for subtractive decode PCI-to-PCI bridge is 060401h; add an entry to make portdrv support this type of bridge. This allows use of PCIe services on subtractive decode ports.
Signed-off-by: Honghui Zhang honghui.zhang@mediatek.com [bhelgaas: add braces surrounding entry] Signed-off-by: Bjorn Helgaas bhelgaas@google.com (cherry picked from commit f0cfecea8d1e8e0cd5d5053f9452b3a450f49eb5)
Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/portdrv_pci.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c index 3746545bb627..e6671fec17e1 100644 --- a/drivers/pci/pcie/portdrv_pci.c +++ b/drivers/pci/pcie/portdrv_pci.c @@ -187,6 +187,8 @@ static void pcie_portdrv_err_resume(struct pci_dev *dev) static const struct pci_device_id port_pci_ids[] = { /* handle any PCI-Express port */ { PCI_DEVICE_CLASS(((PCI_CLASS_BRIDGE_PCI << 8) | 0x00), ~0) }, + /* subtractive decode PCI-to-PCI bridge, class type is 060401h */ + { PCI_DEVICE_CLASS(((PCI_CLASS_BRIDGE_PCI << 8) | 0x01), ~0) }, { }, };
From: Bjorn Helgaas bhelgaas@google.com
mainline inclusion from mainline-v5.2-rc1 commit 7db4af43c97b68dc65394c799b86cdd0fffe5f8d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 7db4af43c97b68dc65394c799b86cdd0fffe5f8d upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
Use dev_printk() when possible. This makes messages more consistent with other device-related messages and, in some cases, adds useful information.
Signed-off-by: Bjorn Helgaas bhelgaas@google.com (cherry picked from commit 7db4af43c97b68dc65394c799b86cdd0fffe5f8d) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pci-sysfs.c | 3 +-- drivers/pci/quirks.c | 6 +++--- 2 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index a62349a44c8c..6ec8679bafe9 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -1125,8 +1125,7 @@ void pci_create_legacy_files(struct pci_bus *b) kfree(b->legacy_io); b->legacy_io = NULL; kzalloc_err: - printk(KERN_WARNING "pci: warning: could not create legacy I/O port and ISA memory resources to sysfs\n"); - return; + dev_warn(&b->dev, "could not create legacy I/O port and ISA memory resources in sysfs\n"); }
void pci_remove_legacy_files(struct pci_bus *b) diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index a76dc1509f25..66944220cbe7 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -177,9 +177,9 @@ static int __init pci_apply_final_quirks(void) if (!tmp || cls == tmp) continue;
- printk(KERN_DEBUG "PCI: CLS mismatch (%u != %u), using %u bytes\n", - cls << 2, tmp << 2, - pci_dfl_cache_line_size << 2); + pci_printk(KERN_DEBUG, dev, "CLS mismatch (%u != %u), using %u bytes\n", + cls << 2, tmp << 2, + pci_dfl_cache_line_size << 2); pci_cache_line_size = pci_dfl_cache_line_size; } }
From: Mohan Kumar mohankumar718@gmail.com
mainline inclusion from mainline-v5.2-rc1 commit 25da8dbaaf0679b3b22c783952a8392071cfa135 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 25da8dbaaf0679b3b22c783952a8392071cfa135 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
Replace printk() with pr_*() to be more consistent with other logging and avoid checkpatch warnings.
Link: https://lore.kernel.org/lkml/1555733026-19609-1-git-send-email-mohankumar718... Link: https://lore.kernel.org/lkml/1555733130-19804-1-git-send-email-mohankumar718... Signed-off-by: Mohan Kumar mohankumar718@gmail.com [bhelgaas: squash in similar changes from second patch in series] Signed-off-by: Bjorn Helgaas bhelgaas@google.com
(cherry picked from commit 25da8dbaaf0679b3b22c783952a8392071cfa135) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/bus.c | 2 +- drivers/pci/pci-acpi.c | 11 ++++------- drivers/pci/pci-stub.c | 10 ++++------ drivers/pci/pci.c | 3 +-- drivers/pci/quirks.c | 2 +- drivers/pci/slot.c | 2 +- 6 files changed, 12 insertions(+), 18 deletions(-)
diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c index 5cb40b2518f9..2179a8baef52 100644 --- a/drivers/pci/bus.c +++ b/drivers/pci/bus.c @@ -23,7 +23,7 @@ void pci_add_resource_offset(struct list_head *resources, struct resource *res,
entry = resource_list_create_entry(res, 0); if (!entry) { - printk(KERN_ERR "PCI: can't add host bridge window %pR\n", res); + pr_err("PCI: can't add host bridge window %pR\n", res); return; }
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c index e442ab398227..15ba77cd1055 100644 --- a/drivers/pci/pci-acpi.c +++ b/drivers/pci/pci-acpi.c @@ -140,8 +140,7 @@ static acpi_status decode_type0_hpx_record(union acpi_object *record, hpx->t0->enable_perr = fields[5].integer.value; break; default: - printk(KERN_WARNING - "%s: Type 0 Revision %d record not supported\n", + pr_warn("%s: Type 0 Revision %d record not supported\n", __func__, revision); return AE_ERROR; } @@ -169,8 +168,7 @@ static acpi_status decode_type1_hpx_record(union acpi_object *record, hpx->t1->tot_max_split = fields[4].integer.value; break; default: - printk(KERN_WARNING - "%s: Type 1 Revision %d record not supported\n", + pr_warn("%s: Type 1 Revision %d record not supported\n", __func__, revision); return AE_ERROR; } @@ -211,8 +209,7 @@ static acpi_status decode_type2_hpx_record(union acpi_object *record, hpx->t2->sec_unc_err_mask_or = fields[17].integer.value; break; default: - printk(KERN_WARNING - "%s: Type 2 Revision %d record not supported\n", + pr_warn("%s: Type 2 Revision %d record not supported\n", __func__, revision); return AE_ERROR; } @@ -272,7 +269,7 @@ static acpi_status acpi_run_hpx(acpi_handle handle, struct hotplug_params *hpx) goto exit; break; default: - printk(KERN_ERR "%s: Type %d record not supported\n", + pr_err("%s: Type %d record not supported\n", __func__, type); status = AE_ERROR; goto exit; diff --git a/drivers/pci/pci-stub.c b/drivers/pci/pci-stub.c index 66f8a59fadbd..e408099fea52 100644 --- a/drivers/pci/pci-stub.c +++ b/drivers/pci/pci-stub.c @@ -66,20 +66,18 @@ static int __init pci_stub_init(void) &class, &class_mask);
if (fields < 2) { - printk(KERN_WARNING - "pci-stub: invalid id string "%s"\n", id); + pr_warn("pci-stub: invalid ID string "%s"\n", id); continue; }
- printk(KERN_INFO - "pci-stub: add %04X:%04X sub=%04X:%04X cls=%08X/%08X\n", + pr_info("pci-stub: add %04X:%04X sub=%04X:%04X cls=%08X/%08X\n", vendor, device, subvendor, subdevice, class, class_mask);
rc = pci_add_dynid(&stub_driver, vendor, device, subvendor, subdevice, class, class_mask, 0); if (rc) - printk(KERN_WARNING - "pci-stub: failed to add dynamic id (%d)\n", rc); + pr_warn("pci-stub: failed to add dynamic ID (%d)\n", + rc); }
return 0; diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index b7884a757e2d..84f7b906f4bb 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -6283,8 +6283,7 @@ static int __init pci_setup(char *str) } else if (!strncmp(str, "disable_acs_redir=", 18)) { disable_acs_redir_param = str + 18; } else { - printk(KERN_ERR "PCI: Unknown option `%s'\n", - str); + pr_err("PCI: Unknown option `%s'\n", str); } } str = k; diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 66944220cbe7..f01fc695bed6 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -2588,7 +2588,7 @@ static void nvbridge_check_legacy_irq_routing(struct pci_dev *dev) pci_read_config_dword(dev, 0x74, &cfg);
if (cfg & ((1 << 2) | (1 << 15))) { - printk(KERN_INFO "Rewriting IRQ routing register on MCP55\n"); + pr_info("Rewriting IRQ routing register on MCP55\n"); cfg &= ~((1 << 2) | (1 << 15)); pci_write_config_dword(dev, 0x74, cfg); } diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c index 7f6c8e732bad..19de071ca698 100644 --- a/drivers/pci/slot.c +++ b/drivers/pci/slot.c @@ -370,7 +370,7 @@ static int pci_slot_init(void) pci_slots_kset = kset_create_and_add("slots", NULL, &pci_bus_kset->kobj); if (!pci_slots_kset) { - printk(KERN_ERR "PCI: Slot initialization failure\n"); + pr_err("PCI: Slot initialization failure\n"); return -ENOMEM; } return 0;
From: Mohan Kumar mohankumar718@gmail.com
mainline inclusion from mainline-v5.2-rc1 commit 34c6b7105e5a11174f856483cde8ad6e61b7236a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 34c6b7105e5a11174f856483cde8ad6e61b7236a upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
Replace dev_printk(KERN_DEBUG) with dev_info(), etc to be more consistent with other logging and avoid checkpatch warnings.
The KERN_DEBUG messages could be converted to dev_dbg(), but that depends on CONFIG_DYNAMIC_DEBUG and DEBUG, and we want most of these messages to *always* be in the dmesg log.
Link: https://lore.kernel.org/lkml/1555733240-19875-1-git-send-email-mohankumar718... Signed-off-by: Mohan Kumar mohankumar718@gmail.com [bhelgaas: commit log] Signed-off-by: Bjorn Helgaas bhelgaas@google.com
(cherry picked from commit 34c6b7105e5a11174f856483cde8ad6e61b7236a) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/bus.c | 3 +-- drivers/pci/pci.c | 14 +++++++------- drivers/pci/probe.c | 21 +++++++++------------ drivers/pci/quirks.c | 13 ++++++------- drivers/pci/setup-bus.c | 30 ++++++++++++++---------------- drivers/pci/xen-pcifront.c | 7 +++---- 6 files changed, 40 insertions(+), 48 deletions(-)
diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c index 2179a8baef52..495059d923f7 100644 --- a/drivers/pci/bus.c +++ b/drivers/pci/bus.c @@ -288,8 +288,7 @@ bool pci_bus_clip_resource(struct pci_dev *dev, int idx) res->end = end; res->flags &= ~IORESOURCE_UNSET; orig_res.flags &= ~IORESOURCE_UNSET; - pci_printk(KERN_DEBUG, dev, "%pR clipped to %pR\n", - &orig_res, res); + pci_info(dev, "%pR clipped to %pR\n", &orig_res, res);
return true; } diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 84f7b906f4bb..212776f4ad17 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -2794,14 +2794,14 @@ void pci_pm_init(struct pci_dev *dev) dev->d2_support = true;
if (dev->d1_support || dev->d2_support) - pci_printk(KERN_DEBUG, dev, "supports%s%s\n", + pci_info(dev, "supports%s%s\n", dev->d1_support ? " D1" : "", dev->d2_support ? " D2" : ""); }
pmc &= PCI_PM_CAP_PME_MASK; if (pmc) { - pci_printk(KERN_DEBUG, dev, "PME# supported from%s%s%s%s%s\n", + pci_info(dev, "PME# supported from%s%s%s%s%s\n", (pmc & PCI_PM_CAP_PME_D0) ? " D0" : "", (pmc & PCI_PM_CAP_PME_D1) ? " D1" : "", (pmc & PCI_PM_CAP_PME_D2) ? " D2" : "", @@ -2969,16 +2969,16 @@ static int pci_ea_read(struct pci_dev *dev, int offset) res->flags = flags;
if (bei <= PCI_EA_BEI_BAR5) - pci_printk(KERN_DEBUG, dev, "BAR %d: %pR (from Enhanced Allocation, properties %#02x)\n", + pci_info(dev, "BAR %d: %pR (from Enhanced Allocation, properties %#02x)\n", bei, res, prop); else if (bei == PCI_EA_BEI_ROM) - pci_printk(KERN_DEBUG, dev, "ROM: %pR (from Enhanced Allocation, properties %#02x)\n", + pci_info(dev, "ROM: %pR (from Enhanced Allocation, properties %#02x)\n", res, prop); else if (bei >= PCI_EA_BEI_VF_BAR0 && bei <= PCI_EA_BEI_VF_BAR5) - pci_printk(KERN_DEBUG, dev, "VF BAR %d: %pR (from Enhanced Allocation, properties %#02x)\n", + pci_info(dev, "VF BAR %d: %pR (from Enhanced Allocation, properties %#02x)\n", bei - PCI_EA_BEI_VF_BAR0, res, prop); else - pci_printk(KERN_DEBUG, dev, "BEI %d res: %pR (from Enhanced Allocation, properties %#02x)\n", + pci_info(dev, "BEI %d res: %pR (from Enhanced Allocation, properties %#02x)\n", bei, res, prop);
out: @@ -4206,7 +4206,7 @@ int pci_set_cacheline_size(struct pci_dev *dev) if (cacheline_size == pci_cache_line_size) return 0;
- pci_printk(KERN_DEBUG, dev, "cache line size of %d is not supported\n", + pci_info(dev, "cache line size of %d is not supported\n", pci_cache_line_size << 2);
return -EINVAL; diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index e520c57f515b..2425bd2dddcc 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -317,7 +317,7 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, res->flags = 0; out: if (res->flags) - pci_printk(KERN_DEBUG, dev, "reg 0x%x: %pR\n", pos, res); + pci_info(dev, "reg 0x%x: %pR\n", pos, res);
return (res->flags & IORESOURCE_MEM_64) ? 1 : 0; } @@ -435,7 +435,7 @@ static void pci_read_bridge_io(struct pci_bus *child) region.start = base; region.end = limit + io_granularity - 1; pcibios_bus_to_resource(dev->bus, res, ®ion); - pci_printk(KERN_DEBUG, dev, " bridge window %pR\n", res); + pci_info(dev, " bridge window %pR\n", res); } }
@@ -457,7 +457,7 @@ static void pci_read_bridge_mmio(struct pci_bus *child) region.start = base; region.end = limit + 0xfffff; pcibios_bus_to_resource(dev->bus, res, ®ion); - pci_printk(KERN_DEBUG, dev, " bridge window %pR\n", res); + pci_info(dev, " bridge window %pR\n", res); } }
@@ -510,7 +510,7 @@ static void pci_read_bridge_mmio_pref(struct pci_bus *child) region.start = base; region.end = limit + 0xfffff; pcibios_bus_to_resource(dev->bus, res, ®ion); - pci_printk(KERN_DEBUG, dev, " bridge window %pR\n", res); + pci_info(dev, " bridge window %pR\n", res); } }
@@ -540,8 +540,7 @@ void pci_read_bridge_bases(struct pci_bus *child) if (res && res->flags) { pci_bus_add_resource(child, res, PCI_SUBTRACTIVE_DECODE); - pci_printk(KERN_DEBUG, dev, - " bridge window %pR (subtractive decode)\n", + pci_info(dev, " bridge window %pR (subtractive decode)\n", res); } } @@ -1711,7 +1710,7 @@ int pci_setup_device(struct pci_dev *dev) dev->revision = class & 0xff; dev->class = class >> 8; /* upper 3 bytes */
- pci_printk(KERN_DEBUG, dev, "[%04x:%04x] type %02x class %#08x\n", + pci_info(dev, "[%04x:%04x] type %02x class %#08x\n", dev->vendor, dev->device, dev->hdr_type, dev->class);
if (pci_early_dump) @@ -3549,7 +3548,7 @@ int pci_bus_insert_busn_res(struct pci_bus *b, int bus, int bus_max) conflict = request_resource_conflict(parent_res, res);
if (conflict) - dev_printk(KERN_DEBUG, &b->dev, + dev_info(&b->dev, "busn_res: can not insert %pR under %s%pR (conflicts with %s %pR)\n", res, pci_is_root_bus(b) ? "domain " : "", parent_res, conflict->name, conflict); @@ -3569,8 +3568,7 @@ int pci_bus_update_busn_res_end(struct pci_bus *b, int bus_max)
size = bus_max - res->start + 1; ret = adjust_resource(res, res->start, size); - dev_printk(KERN_DEBUG, &b->dev, - "busn_res: %pR end %s updated to %02x\n", + dev_info(&b->dev, "busn_res: %pR end %s updated to %02x\n", &old_res, ret ? "can not be" : "is", bus_max);
if (!ret && !res->parent) @@ -3588,8 +3586,7 @@ void pci_bus_release_busn_res(struct pci_bus *b) return;
ret = release_resource(res); - dev_printk(KERN_DEBUG, &b->dev, - "busn_res: %pR %s released\n", + dev_info(&b->dev, "busn_res: %pR %s released\n", res, ret ? "can not be" : "is"); }
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index f01fc695bed6..961003c6dc80 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -159,8 +159,7 @@ static int __init pci_apply_final_quirks(void) u8 tmp;
if (pci_cache_line_size) - printk(KERN_DEBUG "PCI: CLS %u bytes\n", - pci_cache_line_size << 2); + pr_info("PCI: CLS %u bytes\n", pci_cache_line_size << 2);
pci_apply_fixup_final_quirks = true; for_each_pci_dev(dev) { @@ -177,16 +176,16 @@ static int __init pci_apply_final_quirks(void) if (!tmp || cls == tmp) continue;
- pci_printk(KERN_DEBUG, dev, "CLS mismatch (%u != %u), using %u bytes\n", - cls << 2, tmp << 2, - pci_dfl_cache_line_size << 2); + pci_info(dev, "CLS mismatch (%u != %u), using %u bytes\n", + cls << 2, tmp << 2, + pci_dfl_cache_line_size << 2); pci_cache_line_size = pci_dfl_cache_line_size; } }
if (!pci_cache_line_size) { - printk(KERN_DEBUG "PCI: CLS %u bytes, default %u\n", - cls << 2, pci_dfl_cache_line_size << 2); + pr_info("PCI: CLS %u bytes, default %u\n", cls << 2, + pci_dfl_cache_line_size << 2); pci_cache_line_size = cls ? cls : pci_dfl_cache_line_size; }
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 81b772783dbd..39d19302f3cb 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -255,10 +255,9 @@ static void reassign_resources_sorted(struct list_head *realloc_head, (IORESOURCE_STARTALIGN|IORESOURCE_SIZEALIGN); if (pci_reassign_resource(add_res->dev, idx, add_size, align)) - pci_printk(KERN_DEBUG, add_res->dev, - "failed to add %llx res[%d]=%pR\n", - (unsigned long long)add_size, - idx, res); + pci_info(add_res->dev, "failed to add %llx res[%d]=%pR\n", + (unsigned long long) add_size, idx, + res); } out: list_del(&add_res->list); @@ -914,9 +913,9 @@ static void pbus_size_io(struct pci_bus *bus, resource_size_t min_size, if (size1 > size0 && realloc_head) { add_to_list(realloc_head, bus->self, b_res, size1-size0, min_align); - pci_printk(KERN_DEBUG, bus->self, "bridge window %pR to %pR add_size %llx\n", - b_res, &bus->busn_res, - (unsigned long long)size1-size0); + pci_info(bus->self, "bridge window %pR to %pR add_size %llx\n", + b_res, &bus->busn_res, + (unsigned long long) size1 - size0); } }
@@ -1061,7 +1060,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask, b_res->flags |= IORESOURCE_STARTALIGN; if (size1 > size0 && realloc_head) { add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align); - pci_printk(KERN_DEBUG, bus->self, "bridge window %pR to %pR add_size %llx add_align %llx\n", + pci_info(bus->self, "bridge window %pR to %pR add_size %llx add_align %llx\n", b_res, &bus->busn_res, (unsigned long long) (size1 - size0), (unsigned long long) add_align); @@ -1529,8 +1528,8 @@ static void pci_bridge_release_resources(struct pci_bus *bus, release_child_resources(r); if (!release_resource(r)) { type = old_flags = r->flags & PCI_RES_TYPE_MASK; - pci_printk(KERN_DEBUG, dev, "resource %d %pR released\n", - PCI_BRIDGE_RESOURCES + idx, r); + pci_info(dev, "resource %d %pR released\n", + PCI_BRIDGE_RESOURCES + idx, r); /* keep the old size */ r->end = resource_size(r) - 1; r->start = 0; @@ -1594,7 +1593,7 @@ static void pci_bus_dump_res(struct pci_bus *bus) if (!res || !res->end || !res->flags) continue;
- dev_printk(KERN_DEBUG, &bus->dev, "resource %d %pR\n", i, res); + dev_info(&bus->dev, "resource %d %pR\n", i, res); } }
@@ -1733,9 +1732,8 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus) int max_depth = pci_bus_get_depth(bus);
pci_try_num = max_depth + 1; - dev_printk(KERN_DEBUG, &bus->dev, - "max bus depth: %d pci_try_num: %d\n", - max_depth, pci_try_num); + dev_info(&bus->dev, "max bus depth: %d pci_try_num: %d\n", + max_depth, pci_try_num); }
again: @@ -1769,8 +1767,8 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus) goto dump; }
- dev_printk(KERN_DEBUG, &bus->dev, - "No. %d try to assign unassigned res\n", tried_times + 1); + dev_info(&bus->dev, "No. %d try to assign unassigned res\n", + tried_times + 1);
/* third times and later will not check if it is leaf */ if ((tried_times + 1) > 2) diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c index eba6e33147a2..fbd8d9f54d6a 100644 --- a/drivers/pci/xen-pcifront.c +++ b/drivers/pci/xen-pcifront.c @@ -291,8 +291,7 @@ static int pci_frontend_enable_msix(struct pci_dev *dev, vector[i] = op.msix_entries[i].vector; } } else { - printk(KERN_DEBUG "enable msix get value %x\n", - op.value); + pr_info("enable msix get value %x\n", op.value); err = op.value; } } else { @@ -364,12 +363,12 @@ static void pci_frontend_disable_msi(struct pci_dev *dev) err = do_pci_op(pdev, &op); if (err == XEN_PCI_ERR_dev_not_found) { /* XXX No response from backend, what shall we do? */ - printk(KERN_DEBUG "get no response from backend for disable MSI\n"); + pr_info("get no response from backend for disable MSI\n"); return; } if (err) /* how can pciback notify us fail? */ - printk(KERN_DEBUG "get fake response frombackend\n"); + pr_info("get fake response from backend\n"); }
static struct xen_pci_frontend_ops pci_frontend_ops = {
From: Frederick Lawler fred@fredlawl.com
mainline inclusion from mainline-v5.2-rc1 commit 10a9990c10447a7bfe9dc016629898814741d090 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 10a9990c10447a7bfe9dc016629898814741d090 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
Log messages with pci_dev, not pcie_device. Factor out common message prefixes with dev_fmt().
Example output change:
- dpc 0000:00:01.1:pcie008: DPC error containment capabilities... + pcieport 0000:00:01.1: DPC: error containment capabilities...
Link: https://lore.kernel.org/lkml/20190509141456.223614-4-helgaas@kernel.org Signed-off-by: Frederick Lawler fred@fredlawl.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Keith Busch keith.busch@intel.com (cherry picked from commit 10a9990c10447a7bfe9dc016629898814741d090) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/dpc.c | 37 ++++++++++++++++++------------------- 1 file changed, 18 insertions(+), 19 deletions(-)
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index 7b77754a82de..a32ec3487a8d 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -6,6 +6,8 @@ * Copyright (C) 2016 Intel Corp. */
+#define dev_fmt(fmt) "DPC: " fmt + #include <linux/aer.h> #include <linux/delay.h> #include <linux/interrupt.h> @@ -100,7 +102,6 @@ static int dpc_wait_rp_inactive(struct dpc_dev *dpc) { unsigned long timeout = jiffies + HZ; struct pci_dev *pdev = dpc->dev->port; - struct device *dev = &dpc->dev->device; u16 cap = dpc->cap_pos, status;
pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status); @@ -110,7 +111,7 @@ static int dpc_wait_rp_inactive(struct dpc_dev *dpc) pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status); } if (status & PCI_EXP_DPC_RP_BUSY) { - dev_warn(dev, "DPC root port still busy\n"); + pci_warn(pdev, "root port still busy\n"); return -EBUSY; } return 0; @@ -148,7 +149,6 @@ static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
static void dpc_process_rp_pio_error(struct dpc_dev *dpc) { - struct device *dev = &dpc->dev->device; struct pci_dev *pdev = dpc->dev->port; u16 cap = dpc->cap_pos, dpc_status, first_error; u32 status, mask, sev, syserr, exc, dw0, dw1, dw2, dw3, log, prefix; @@ -156,13 +156,13 @@ static void dpc_process_rp_pio_error(struct dpc_dev *dpc)
pci_read_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_STATUS, &status); pci_read_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_MASK, &mask); - dev_err(dev, "rp_pio_status: %#010x, rp_pio_mask: %#010x\n", + pci_err(pdev, "rp_pio_status: %#010x, rp_pio_mask: %#010x\n", status, mask);
pci_read_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_SEVERITY, &sev); pci_read_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_SYSERROR, &syserr); pci_read_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_EXCEPTION, &exc); - dev_err(dev, "RP PIO severity=%#010x, syserror=%#010x, exception=%#010x\n", + pci_err(pdev, "RP PIO severity=%#010x, syserror=%#010x, exception=%#010x\n", sev, syserr, exc);
/* Get First Error Pointer */ @@ -171,7 +171,7 @@ static void dpc_process_rp_pio_error(struct dpc_dev *dpc)
for (i = 0; i < ARRAY_SIZE(rp_pio_error_string); i++) { if ((status & ~mask) & (1 << i)) - dev_err(dev, "[%2d] %s%s\n", i, rp_pio_error_string[i], + pci_err(pdev, "[%2d] %s%s\n", i, rp_pio_error_string[i], first_error == i ? " (First)" : ""); }
@@ -185,18 +185,18 @@ static void dpc_process_rp_pio_error(struct dpc_dev *dpc) &dw2); pci_read_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_HEADER_LOG + 12, &dw3); - dev_err(dev, "TLP Header: %#010x %#010x %#010x %#010x\n", + pci_err(pdev, "TLP Header: %#010x %#010x %#010x %#010x\n", dw0, dw1, dw2, dw3);
if (dpc->rp_log_size < 5) goto clear_status; pci_read_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_IMPSPEC_LOG, &log); - dev_err(dev, "RP PIO ImpSpec Log %#010x\n", log); + pci_err(pdev, "RP PIO ImpSpec Log %#010x\n", log);
for (i = 0; i < dpc->rp_log_size - 5; i++) { pci_read_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_TLPPREFIX_LOG, &prefix); - dev_err(dev, "TLP Prefix Header: dw%d, %#010x\n", i, prefix); + pci_err(pdev, "TLP Prefix Header: dw%d, %#010x\n", i, prefix); } clear_status: pci_write_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_STATUS, status); @@ -229,18 +229,17 @@ static irqreturn_t dpc_handler(int irq, void *context) struct aer_err_info info; struct dpc_dev *dpc = context; struct pci_dev *pdev = dpc->dev->port; - struct device *dev = &dpc->dev->device; u16 cap = dpc->cap_pos, status, source, reason, ext_reason;
pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status); pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);
- dev_info(dev, "DPC containment event, status:%#06x source:%#06x\n", + pci_info(pdev, "containment event, status:%#06x source:%#06x\n", status, source);
reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1; ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5; - dev_warn(dev, "DPC %s detected\n", + pci_warn(pdev, "%s detected\n", (reason == 0) ? "unmasked uncorrectable error" : (reason == 1) ? "ERR_NONFATAL" : (reason == 2) ? "ERR_FATAL" : @@ -307,7 +306,7 @@ static int dpc_probe(struct pcie_device *dev) dpc_handler, IRQF_SHARED, "pcie-dpc", dpc); if (status) { - dev_warn(device, "request IRQ%d failed: %d\n", dev->irq, + pci_warn(pdev, "request IRQ%d failed: %d\n", dev->irq, status); return status; } @@ -319,7 +318,7 @@ static int dpc_probe(struct pcie_device *dev) if (dpc->rp_extensions) { dpc->rp_log_size = (cap & PCI_EXP_DPC_RP_PIO_LOG_SIZE) >> 8; if (dpc->rp_log_size < 4 || dpc->rp_log_size > 9) { - dev_err(device, "RP PIO log size %u is invalid\n", + pci_err(pdev, "RP PIO log size %u is invalid\n", dpc->rp_log_size); dpc->rp_log_size = 0; } @@ -328,11 +327,11 @@ static int dpc_probe(struct pcie_device *dev) ctl = (ctl & 0xfff4) | PCI_EXP_DPC_CTL_EN_FATAL | PCI_EXP_DPC_CTL_INT_EN; pci_write_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_CTL, ctl);
- dev_info(device, "DPC error containment capabilities: Int Msg #%d, RPExt%c PoisonedTLP%c SwTrigger%c RP PIO Log %d, DL_ActiveErr%c\n", - cap & PCI_EXP_DPC_IRQ, FLAG(cap, PCI_EXP_DPC_CAP_RP_EXT), - FLAG(cap, PCI_EXP_DPC_CAP_POISONED_TLP), - FLAG(cap, PCI_EXP_DPC_CAP_SW_TRIGGER), dpc->rp_log_size, - FLAG(cap, PCI_EXP_DPC_CAP_DL_ACTIVE)); + pci_info(pdev, "error containment capabilities: Int Msg #%d, RPExt%c PoisonedTLP%c SwTrigger%c RP PIO Log %d, DL_ActiveErr%c\n", + cap & PCI_EXP_DPC_IRQ, FLAG(cap, PCI_EXP_DPC_CAP_RP_EXT), + FLAG(cap, PCI_EXP_DPC_CAP_POISONED_TLP), + FLAG(cap, PCI_EXP_DPC_CAP_SW_TRIGGER), dpc->rp_log_size, + FLAG(cap, PCI_EXP_DPC_CAP_DL_ACTIVE));
pci_add_ext_cap_save_buffer(pdev, PCI_EXT_CAP_ID_DPC, sizeof(u16)); return status;
From: Guoqing Jiang jiangguoqing@kylinos.cn
mainline inclusion from mainline-v5.2-rc1 commit 9cc6f75b27e76d38fa7c2825c4a9a64fe26e4c77 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 9cc6f75b27e76d38fa7c2825c4a9a64fe26e4c77 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
Log messages with pci_dev, not pcie_device. Factor out common message prefixes with dev_fmt().
Example output change:
- aer 0000:00:00.0:pci002: AER enabled with IRQ ... + pcieport 0000:00:00.0: AER: enabled with IRQ ...
Link: https://lore.kernel.org/lkml/20190509141456.223614-5-helgaas@kernel.org Signed-off-by: Frederick Lawler fred@fredlawl.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Keith Busch keith.busch@intel.com Reviewed-by: Andy Shevchenko andy.shevchenko@gmail.com (cherry picked from commit 9cc6f75b27e76d38fa7c2825c4a9a64fe26e4c77) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/aer.c | 21 +++++++++++++-------- drivers/pci/pcie/aer_inject.c | 20 ++++++++++---------- 2 files changed, 23 insertions(+), 18 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 1aecf833cc06..3d8265e7d581 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -12,6 +12,9 @@ * Andrew Patterson andrew.patterson@hp.com */
+#define pr_fmt(fmt) "AER: " fmt +#define dev_fmt pr_fmt + #include <linux/cper.h> #include <linux/pci.h> #include <linux/pci-acpi.h> @@ -779,10 +782,11 @@ static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info) u8 bus = info->id >> 8; u8 devfn = info->id & 0xff;
- pci_info(dev, "AER: %s%s error received: %04x:%02x:%02x.%d\n", - info->multi_error_valid ? "Multiple " : "", - aer_error_severity_string[info->severity], - pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); + pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n", + info->multi_error_valid ? "Multiple " : "", + aer_error_severity_string[info->severity], + pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), + PCI_FUNC(devfn)); }
#ifdef CONFIG_ACPI_APEI_PCIEAER @@ -1377,26 +1381,27 @@ static int aer_probe(struct pcie_device *dev) int status; struct aer_rpc *rpc; struct device *device = &dev->device; + struct pci_dev *port = dev->port;
rpc = devm_kzalloc(device, sizeof(struct aer_rpc), GFP_KERNEL); if (!rpc) { dev_printk(KERN_DEBUG, device, "alloc AER rpc failed\n"); return -ENOMEM; } - rpc->rpd = dev->port; + + rpc->rpd = port; set_service_data(dev, rpc); INIT_KFIFO(rpc->aer_fifo); status = devm_request_threaded_irq(device, dev->irq, aer_irq, aer_isr, IRQF_SHARED, "aerdrv", dev); if (status) { - dev_printk(KERN_DEBUG, device, "request AER IRQ %d failed\n", - dev->irq); + pci_err(port, "request AER IRQ %d failed\n", dev->irq); return status; }
aer_enable_rootport(rpc); - dev_info(device, "AER enabled with IRQ %d\n", dev->irq); + pci_info(port, "enabled with IRQ %d\n", dev->irq); return 0; }
diff --git a/drivers/pci/pcie/aer_inject.c b/drivers/pci/pcie/aer_inject.c index 47ce382939fd..83d18551ba79 100644 --- a/drivers/pci/pcie/aer_inject.c +++ b/drivers/pci/pcie/aer_inject.c @@ -12,6 +12,8 @@ * Huang Ying ying.huang@intel.com */
+#define dev_fmt(fmt) "aer_inject: " fmt + #include <linux/module.h> #include <linux/init.h> #include <linux/irq.h> @@ -339,14 +341,14 @@ static int aer_inject(struct aer_error_inj *einj) return -ENODEV; rpdev = pcie_find_root_port(dev); if (!rpdev) { - pci_err(dev, "aer_inject: Root port not found\n"); + pci_err(dev, "Root port not found\n"); ret = -ENODEV; goto out_put; }
pos_cap_err = dev->aer_cap; if (!pos_cap_err) { - pci_err(dev, "aer_inject: Device doesn't support AER\n"); + pci_err(dev, "Device doesn't support AER\n"); ret = -EPROTONOSUPPORT; goto out_put; } @@ -357,7 +359,7 @@ static int aer_inject(struct aer_error_inj *einj)
rp_pos_cap_err = rpdev->aer_cap; if (!rp_pos_cap_err) { - pci_err(rpdev, "aer_inject: Root port doesn't support AER\n"); + pci_err(rpdev, "Root port doesn't support AER\n"); ret = -EPROTONOSUPPORT; goto out_put; } @@ -405,14 +407,14 @@ static int aer_inject(struct aer_error_inj *einj) if (!aer_mask_override && einj->cor_status && !(einj->cor_status & ~cor_mask)) { ret = -EINVAL; - pci_warn(dev, "aer_inject: The correctable error(s) is masked by device\n"); + pci_warn(dev, "The correctable error(s) is masked by device\n"); spin_unlock_irqrestore(&inject_lock, flags); goto out_put; } if (!aer_mask_override && einj->uncor_status && !(einj->uncor_status & ~uncor_mask)) { ret = -EINVAL; - pci_warn(dev, "aer_inject: The uncorrectable error(s) is masked by device\n"); + pci_warn(dev, "The uncorrectable error(s) is masked by device\n"); spin_unlock_irqrestore(&inject_lock, flags); goto out_put; } @@ -467,19 +469,17 @@ static int aer_inject(struct aer_error_inj *einj) if (device) { edev = to_pcie_device(device); if (!get_service_data(edev)) { - dev_warn(&edev->device, - "aer_inject: AER service is not initialized\n"); + pci_warn(edev->port, "AER service is not initialized\n"); ret = -EPROTONOSUPPORT; goto out_put; } - dev_info(&edev->device, - "aer_inject: Injecting errors %08x/%08x into device %s\n", + pci_info(edev->port, "Injecting errors %08x/%08x into device %s\n", einj->cor_status, einj->uncor_status, pci_name(dev)); local_irq_disable(); generic_handle_irq(edev->irq); local_irq_enable(); } else { - pci_err(rpdev, "aer_inject: AER device not found\n"); + pci_err(rpdev, "AER device not found\n"); ret = -ENODEV; } out_put:
From: Subbaraya Sundeep sbhatta@marvell.com
mainline inclusion from mainline-v5.2-rc1 commit 2dbce590117981196fe355efc0569bc6f949ae9b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 2dbce590117981196fe355efc0569bc6f949ae9b upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
The "Enhanced Allocation (EA) for Memory and I/O Resources" ECN, approved 23 October 2014, sec 6.9.1.2, specifies a second DW in the capability for type 1 (bridge) functions to describe fixed secondary and subordinate bus numbers. This ECN was included in the PCIe r4.0 spec, but sec 6.9.1.2 was omitted, presumably by mistake.
Read fixed bus numbers from the EA capability for bridges.
Signed-off-by: Subbaraya Sundeep sbhatta@marvell.com [bhelgaas: add pci_ea_fixed_busnrs() return value] Signed-off-by: Bjorn Helgaas bhelgaas@google.com
(cherry picked from commit 2dbce590117981196fe355efc0569bc6f949ae9b) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/probe.c | 54 ++++++++++++++++++++++++++++++++--- include/uapi/linux/pci_regs.h | 6 ++++ 2 files changed, 56 insertions(+), 4 deletions(-)
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 2425bd2dddcc..6a1566ba1cb4 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1131,6 +1131,36 @@ static void pci_enable_crs(struct pci_dev *pdev)
static unsigned int pci_scan_child_bus_extend(struct pci_bus *bus, unsigned int available_buses); +/** + * pci_ea_fixed_busnrs() - Read fixed Secondary and Subordinate bus + * numbers from EA capability. + * @dev: Bridge + * @sec: updated with secondary bus number from EA + * @sub: updated with subordinate bus number from EA + * + * If @dev is a bridge with EA capability, update @sec and @sub with + * fixed bus numbers from the capability and return true. Otherwise, + * return false. + */ +static bool pci_ea_fixed_busnrs(struct pci_dev *dev, u8 *sec, u8 *sub) +{ + int ea, offset; + u32 dw; + + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) + return false; + + /* find PCI EA capability in list */ + ea = pci_find_capability(dev, PCI_CAP_ID_EA); + if (!ea) + return false; + + offset = ea + PCI_EA_FIRST_ENT; + pci_read_config_dword(dev, offset, &dw); + *sec = dw & PCI_EA_SEC_BUS_MASK; + *sub = (dw & PCI_EA_SUB_BUS_MASK) >> PCI_EA_SUB_BUS_SHIFT; + return true; +}
/* * pci_scan_bridge_extend() - Scan buses behind a bridge @@ -1165,6 +1195,9 @@ static int pci_scan_bridge_extend(struct pci_bus *bus, struct pci_dev *dev, u16 bctl; u8 primary, secondary, subordinate; int broken = 0; + bool fixed_buses; + u8 fixed_sec, fixed_sub; + int next_busnr;
/* * Make sure the bridge is powered on to be able to access config @@ -1264,17 +1297,24 @@ static int pci_scan_bridge_extend(struct pci_bus *bus, struct pci_dev *dev, /* Clear errors */ pci_write_config_word(dev, PCI_STATUS, 0xffff);
+ /* Read bus numbers from EA Capability (if present) */ + fixed_buses = pci_ea_fixed_busnrs(dev, &fixed_sec, &fixed_sub); + if (fixed_buses) + next_busnr = fixed_sec; + else + next_busnr = max + 1; + /* * Prevent assigning a bus number that already exists. * This can happen when a bridge is hot-plugged, so in this * case we only re-scan this bus. */ - child = pci_find_bus(pci_domain_nr(bus), max+1); + child = pci_find_bus(pci_domain_nr(bus), next_busnr); if (!child) { - child = pci_add_new_bus(bus, dev, max+1); + child = pci_add_new_bus(bus, dev, next_busnr); if (!child) goto out; - pci_bus_insert_busn_res(child, max+1, + pci_bus_insert_busn_res(child, next_busnr, bus->busn_res.end); } max++; @@ -1335,7 +1375,13 @@ static int pci_scan_bridge_extend(struct pci_bus *bus, struct pci_dev *dev, max += i; }
- /* Set subordinate bus number to its real value */ + /* + * Set subordinate bus number to its real value. + * If fixed subordinate bus number exists from EA + * capability then use it. + */ + if (fixed_buses) + max = fixed_sub; pci_bus_update_busn_res_end(child, max); pci_write_config_byte(dev, PCI_SUBORDINATE_BUS, max); } diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h index 992f4d384ba1..123be51eba11 100644 --- a/include/uapi/linux/pci_regs.h +++ b/include/uapi/linux/pci_regs.h @@ -372,6 +372,12 @@ #define PCI_EA_FIRST_ENT_BRIDGE 8 /* First EA Entry for Bridges */ #define PCI_EA_ES 0x00000007 /* Entry Size */ #define PCI_EA_BEI 0x000000f0 /* BAR Equivalent Indicator */ + +/* EA fixed Secondary and Subordinate bus numbers for Bridge */ +#define PCI_EA_SEC_BUS_MASK 0xff +#define PCI_EA_SUB_BUS_MASK 0xff00 +#define PCI_EA_SUB_BUS_SHIFT 8 + /* 0-5 map to BARs 0-5 respectively */ #define PCI_EA_BEI_BAR0 0 #define PCI_EA_BEI_BAR5 5
From: Mika Westerberg mika.westerberg@linux.intel.com
mainline inclusion from mainline-v5.4-rc1 commit 984998e3404e9073479281dbba8af36b104e8c00 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 984998e3404e9073479281dbba8af36b104e8c00 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
pcie_downstream_port() is useful in other places where code needs to determine whether the PCIe port is downstream so make it available outside of access.c.
Link: https://lore.kernel.org/r/20190822085553.62697-1-mika.westerberg@linux.intel... Signed-off-by: Mika Westerberg mika.westerberg@linux.intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com (cherry picked from commit 984998e3404e9073479281dbba8af36b104e8c00) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/access.c | 9 --------- drivers/pci/pci.h | 9 +++++++++ 2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/pci/access.c b/drivers/pci/access.c index c5204892289b..79267c254458 100644 --- a/drivers/pci/access.c +++ b/drivers/pci/access.c @@ -336,15 +336,6 @@ static inline int pcie_cap_version(const struct pci_dev *dev) return pcie_caps_reg(dev) & PCI_EXP_FLAGS_VERS; }
-static bool pcie_downstream_port(const struct pci_dev *dev) -{ - int type = pci_pcie_type(dev); - - return type == PCI_EXP_TYPE_ROOT_PORT || - type == PCI_EXP_TYPE_DOWNSTREAM || - type == PCI_EXP_TYPE_PCIE_BRIDGE; -} - bool pcie_cap_has_lnkctl(const struct pci_dev *dev) { int type = pci_pcie_type(dev); diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 8398071edd46..54f25661911e 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -114,6 +114,15 @@ static inline bool pci_power_manageable(struct pci_dev *pci_dev) return !pci_has_subordinate(pci_dev) || pci_dev->bridge_d3; }
+static inline bool pcie_downstream_port(const struct pci_dev *dev) +{ + int type = pci_pcie_type(dev); + + return type == PCI_EXP_TYPE_ROOT_PORT || + type == PCI_EXP_TYPE_DOWNSTREAM || + type == PCI_EXP_TYPE_PCIE_BRIDGE; +} + int pci_vpd_init(struct pci_dev *dev); void pci_vpd_release(struct pci_dev *dev); void pcie_vpd_create_sysfs_dev_files(struct pci_dev *dev);
From: Mika Westerberg mika.westerberg@linux.intel.com
mainline inclusion from mainline-v5.4-rc1 commit ca78410403dd64ac0ee0e3cc8646b38335271bfd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit ca78410403dd64ac0ee0e3cc8646b38335271bfd upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
In some systems, the Device/Port Type in the PCI Express Capabilities register incorrectly identifies upstream ports as downstream ports.
d0751b98dfa3 ("PCI: Add dev->has_secondary_link to track downstream PCIe links") addressed this by adding pci_dev.has_secondary_link, which is set for downstream ports. But this is confusing because pci_pcie_type() sometimes gives the wrong answer, and it's not obvious that we should use pci_dev.has_secondary_link instead.
Reduce the confusion by correcting the type of the port itself so that pci_pcie_type() returns the actual type regardless of what the Device/Port Type register claims it is. Update the users to call pci_pcie_type() and pcie_downstream_port() accordingly, and remove pci_dev.has_secondary_link completely.
Link: https://lore.kernel.org/linux-pci/20190703133953.GK128603@google.com/ Suggested-by: Bjorn Helgaas bhelgaas@google.com Link: https://lore.kernel.org/r/20190822085553.62697-2-mika.westerberg@linux.intel... Signed-off-by: Mika Westerberg mika.westerberg@linux.intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com (cherry picked from commit ca78410403dd64ac0ee0e3cc8646b38335271bfd) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pci.c | 2 +- drivers/pci/pcie/aspm.c | 8 +++---- drivers/pci/pcie/err.c | 2 +- drivers/pci/probe.c | 48 ++++++++++++++++++++++++----------------- drivers/pci/vc.c | 4 +++- include/linux/pci.h | 1 - 6 files changed, 37 insertions(+), 28 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 212776f4ad17..a8675df3311e 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -3549,7 +3549,7 @@ int pci_enable_atomic_ops_to_root(struct pci_dev *dev, u32 cap_mask) }
/* Ensure upstream ports don't block AtomicOps on egress */ - if (!bridge->has_secondary_link) { + if (pci_pcie_type(bridge) == PCI_EXP_TYPE_UPSTREAM) { pcie_capability_read_dword(bridge, PCI_EXP_DEVCTL2, &ctl2); if (ctl2 & PCI_EXP_DEVCTL2_ATOMIC_EGRESS_BLOCK) diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c index e1946ec5de57..54fc2856995d 100644 --- a/drivers/pci/pcie/aspm.c +++ b/drivers/pci/pcie/aspm.c @@ -920,10 +920,10 @@ void pcie_aspm_init_link_state(struct pci_dev *pdev)
/* * We allocate pcie_link_state for the component on the upstream - * end of a Link, so there's nothing to do unless this device has a - * Link on its secondary side. + * end of a Link, so there's nothing to do unless this device is + * downstream port. */ - if (!pdev->has_secondary_link) + if (!pcie_downstream_port(pdev)) return;
/* VIA has a strange chipset, root port is under a bridge */ @@ -1078,7 +1078,7 @@ static void __pci_disable_link_state(struct pci_dev *pdev, int state, bool sem) if (!pci_is_pcie(pdev)) return;
- if (pdev->has_secondary_link) + if (pcie_downstream_port(pdev)) parent = pdev; if (!parent || !parent->link_state) return; diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index 288641b30116..98acf944a27f 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -168,7 +168,7 @@ static pci_ers_result_t reset_link(struct pci_dev *dev, u32 service) driver = pcie_port_find_service(dev, service); if (driver && driver->reset_link) { status = driver->reset_link(dev); - } else if (dev->has_secondary_link) { + } else if (pcie_downstream_port(dev)) { status = default_reset_link(dev); } else { pci_printk(KERN_DEBUG, dev, "no link-reset support at upstream device %s\n", diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 6a1566ba1cb4..83d2c9da238e 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1475,26 +1475,38 @@ void set_pcie_port_type(struct pci_dev *pdev) pci_read_config_word(pdev, pos + PCI_EXP_DEVCAP, ®16); pdev->pcie_mpss = reg16 & PCI_EXP_DEVCAP_PAYLOAD;
+ parent = pci_upstream_bridge(pdev); + if (!parent) + return; + /* - * A Root Port or a PCI-to-PCIe bridge is always the upstream end - * of a Link. No PCIe component has two Links. Two Links are - * connected by a Switch that has a Port on each Link and internal - * logic to connect the two Ports. + * Some systems do not identify their upstream/downstream ports + * correctly so detect impossible configurations here and correct + * the port type accordingly. */ type = pci_pcie_type(pdev); - if (type == PCI_EXP_TYPE_ROOT_PORT || - type == PCI_EXP_TYPE_PCIE_BRIDGE) - pdev->has_secondary_link = 1; - else if (type == PCI_EXP_TYPE_UPSTREAM || - type == PCI_EXP_TYPE_DOWNSTREAM) { - parent = pci_upstream_bridge(pdev); - + if (type == PCI_EXP_TYPE_DOWNSTREAM) { /* - * Usually there's an upstream device (Root Port or Switch - * Downstream Port), but we can't assume one exists. + * If pdev claims to be downstream port but the parent + * device is also downstream port assume pdev is actually + * upstream port. */ - if (parent && !parent->has_secondary_link) - pdev->has_secondary_link = 1; + if (pcie_downstream_port(parent)) { + pci_info(pdev, "claims to be downstream port but is acting as upstream port, correcting type\n"); + pdev->pcie_flags_reg &= ~PCI_EXP_FLAGS_TYPE; + pdev->pcie_flags_reg |= PCI_EXP_TYPE_UPSTREAM; + } + } else if (type == PCI_EXP_TYPE_UPSTREAM) { + /* + * If pdev claims to be upstream port but the parent + * device is also upstream port assume pdev is actually + * downstream port. + */ + if (pci_pcie_type(parent) == PCI_EXP_TYPE_UPSTREAM) { + pci_info(pdev, "claims to be upstream port but is acting as downstream port, correcting type\n"); + pdev->pcie_flags_reg &= ~PCI_EXP_FLAGS_TYPE; + pdev->pcie_flags_reg |= PCI_EXP_TYPE_DOWNSTREAM; + } } }
@@ -3112,12 +3124,8 @@ static int only_one_child(struct pci_bus *bus) * A PCIe Downstream Port normally leads to a Link with only Device * 0 on it (PCIe spec r3.1, sec 7.3.1). As an optimization, scan * only for Device 0 in that situation. - * - * Checking has_secondary_link is a hack to identify Downstream - * Ports because sometimes Switches are configured such that the - * PCIe Port Type labels are backwards. */ - if (bridge && pci_is_pcie(bridge) && bridge->has_secondary_link) + if (bridge && pci_is_pcie(bridge) && pcie_downstream_port(bridge)) return 1;
return 0; diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c index 5acd9c02683a..9ae9fb9339e8 100644 --- a/drivers/pci/vc.c +++ b/drivers/pci/vc.c @@ -13,6 +13,8 @@ #include <linux/pci_regs.h> #include <linux/types.h>
+#include "pci.h" + /** * pci_vc_save_restore_dwords - Save or restore a series of dwords * @dev: device @@ -105,7 +107,7 @@ static void pci_vc_enable(struct pci_dev *dev, int pos, int res) struct pci_dev *link = NULL;
/* Enable VCs from the downstream device */ - if (!dev->has_secondary_link) + if (!pci_is_pcie(dev) || !pcie_downstream_port(dev)) return;
ctrl_pos = pos + PCI_VC_RES_CTRL + (res * PCI_CAP_VC_PER_VC_SIZEOF); diff --git a/include/linux/pci.h b/include/linux/pci.h index 3ac8633deab5..e3c478b6eb78 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -408,7 +408,6 @@ struct pci_dev { unsigned int broken_intx_masking:1; /* INTx masking can't be used */ unsigned int io_window_1k:1; /* Intel bridge 1K I/O windows */ unsigned int irq_managed:1; - unsigned int has_secondary_link:1; unsigned int non_compliant_bars:1; /* Broken BARs; ignore them */ unsigned int is_probed:1; /* Device probing in progress */ unsigned int link_active_reporting:1;/* Device capable of reporting link active */
From: "Patel, Mayurkumar" mayurkumar.patel@intel.com
mainline inclusion from mainline-v5.5-rc1 commit af65d1ad416bc6e069ccb9e649faeda224248f96 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit af65d1ad416bc6e069ccb9e649faeda224248f96 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
Previously we did not save and restore the AER configuration on suspend/resume, so the configuration may be lost after resume.
Save the AER configuration during suspend and restore it during resume.
[bhelgaas: commit log] Link: https://lore.kernel.org/r/92EBB4272BF81E4089A7126EC1E7B28492C3B007@IRSMSX101... Signed-off-by: Mayurkumar Patel mayurkumar.patel@intel.com Signed-off-by: Kuppuswamy Sathyanarayanan sathyanarayanan.kuppuswamy@linux.intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com
(cherry picked from commit af65d1ad416bc6e069ccb9e649faeda224248f96) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/access.c | 2 +- drivers/pci/pci.c | 2 ++ drivers/pci/pci.h | 1 + drivers/pci/pcie/aer.c | 62 ++++++++++++++++++++++++++++++++++++++++-- include/linux/aer.h | 4 +++ 5 files changed, 68 insertions(+), 3 deletions(-)
diff --git a/drivers/pci/access.c b/drivers/pci/access.c index 79267c254458..dfb2fefbbce1 100644 --- a/drivers/pci/access.c +++ b/drivers/pci/access.c @@ -355,7 +355,7 @@ static inline bool pcie_cap_has_sltctl(const struct pci_dev *dev) pcie_caps_reg(dev) & PCI_EXP_FLAGS_SLOT; }
-static inline bool pcie_cap_has_rtctl(const struct pci_dev *dev) +bool pcie_cap_has_rtctl(const struct pci_dev *dev) { int type = pci_pcie_type(dev);
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index a8675df3311e..671b67e5b864 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1348,6 +1348,7 @@ int pci_save_state(struct pci_dev *dev)
pci_save_ltr_state(dev); pci_save_dpc_state(dev); + pci_save_aer_state(dev); return pci_save_vc_state(dev); } EXPORT_SYMBOL(pci_save_state); @@ -1461,6 +1462,7 @@ void pci_restore_state(struct pci_dev *dev) pci_restore_dpc_state(dev);
pci_cleanup_aer_error_status_regs(dev); + pci_restore_aer_state(dev);
pci_restore_config_space(dev);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 54f25661911e..ff5dde9e6745 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -12,6 +12,7 @@ extern const unsigned char pcie_link_speed[]; extern bool pci_early_dump;
bool pcie_cap_has_lnkctl(const struct pci_dev *dev); +bool pcie_cap_has_rtctl(const struct pci_dev *dev);
/* Functions internal to the PCI core code */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 3d8265e7d581..5123c1ef6412 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -448,12 +448,70 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev) return 0; }
+void pci_save_aer_state(struct pci_dev *dev) +{ + struct pci_cap_saved_state *save_state; + u32 *cap; + int pos; + + pos = dev->aer_cap; + if (!pos) + return; + + save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_ERR); + if (!save_state) + return; + + cap = &save_state->cap.data[0]; + pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_MASK, cap++); + pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, cap++); + pci_read_config_dword(dev, pos + PCI_ERR_COR_MASK, cap++); + pci_read_config_dword(dev, pos + PCI_ERR_CAP, cap++); + if (pcie_cap_has_rtctl(dev)) + pci_read_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, cap++); +} + +void pci_restore_aer_state(struct pci_dev *dev) +{ + struct pci_cap_saved_state *save_state; + u32 *cap; + int pos; + + pos = dev->aer_cap; + if (!pos) + return; + + save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_ERR); + if (!save_state) + return; + + cap = &save_state->cap.data[0]; + pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_MASK, *cap++); + pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, *cap++); + pci_write_config_dword(dev, pos + PCI_ERR_COR_MASK, *cap++); + pci_write_config_dword(dev, pos + PCI_ERR_CAP, *cap++); + if (pcie_cap_has_rtctl(dev)) + pci_write_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, *cap++); +} + void pci_aer_init(struct pci_dev *dev) { + int n; + dev->aer_cap = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR); + if (!dev->aer_cap) + return;
- if (dev->aer_cap) - dev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL); + dev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL); + + /* + * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER, + * PCI_ERR_COR_MASK, and PCI_ERR_CAP. Root and Root Complex Event + * Collectors also implement PCI_ERR_ROOT_COMMAND (PCIe r5.0, sec + * 7.8.4). + */ + n = pcie_cap_has_rtctl(dev) ? 5 : 4; + pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n);
pci_cleanup_aer_error_status_regs(dev); } diff --git a/include/linux/aer.h b/include/linux/aer.h index 514bffa11dbb..fa19e01f418a 100644 --- a/include/linux/aer.h +++ b/include/linux/aer.h @@ -46,6 +46,8 @@ int pci_enable_pcie_error_reporting(struct pci_dev *dev); int pci_disable_pcie_error_reporting(struct pci_dev *dev); int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); int pci_cleanup_aer_error_status_regs(struct pci_dev *dev); +void pci_save_aer_state(struct pci_dev *dev); +void pci_restore_aer_state(struct pci_dev *dev); #else static inline int pci_enable_pcie_error_reporting(struct pci_dev *dev) { @@ -63,6 +65,8 @@ static inline int pci_cleanup_aer_error_status_regs(struct pci_dev *dev) { return -EINVAL; } +static inline void pci_save_aer_state(struct pci_dev *dev) {} +static inline void pci_restore_aer_state(struct pci_dev *dev) {} #endif
void cper_print_aer(struct pci_dev *dev, int aer_severity,
From: Andy Shevchenko andriy.shevchenko@linux.intel.com
mainline inclusion from mainline-v5.5-rc1 commit 6a8c97345a15f9c60ff6c6ac1629a3e9ec140320 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 6a8c97345a15f9c60ff6c6ac1629a3e9ec140320 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
Simplify error counting code by using for_each_set_bit() library function.
Link: https://lore.kernel.org/r/20190827151823.75312-1-andriy.shevchenko@linux.int... Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Kuppuswamy Sathyanarayanan sathyanarayanan.kuppuswamy@linux.intel.com (cherry picked from commit 6a8c97345a15f9c60ff6c6ac1629a3e9ec140320) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/aer.c | 19 ++++++++----------- 1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 5123c1ef6412..342caa9bea45 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -15,6 +15,7 @@ #define pr_fmt(fmt) "AER: " fmt #define dev_fmt pr_fmt
+#include <linux/bitops.h> #include <linux/cper.h> #include <linux/pci.h> #include <linux/pci-acpi.h> @@ -715,7 +716,8 @@ const struct attribute_group aer_stats_attr_group = { static void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info) { - int status, i, max = -1; + unsigned long status = info->status & ~info->mask; + int i, max = -1; u64 *counter = NULL; struct aer_stats *aer_stats = pdev->aer_stats;
@@ -740,10 +742,8 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev, break; }
- status = (info->status & ~info->mask); - for (i = 0; i < max; i++) - if (status & (1 << i)) - counter[i]++; + for_each_set_bit(i, &status, max) + counter[i]++; }
static void pci_rootport_aer_stats_incr(struct pci_dev *pdev, @@ -775,14 +775,11 @@ static void __print_tlp_header(struct pci_dev *dev, static void __aer_print_error(struct pci_dev *dev, struct aer_err_info *info) { - int i, status; + unsigned long status = info->status & ~info->mask; const char *errmsg = NULL; - status = (info->status & ~info->mask); - - for (i = 0; i < 32; i++) { - if (!(status & (1 << i))) - continue; + int i;
+ for_each_set_bit(i, &status, 32) { if (info->severity == AER_CORRECTABLE) errmsg = i < ARRAY_SIZE(aer_correctable_error_string) ? aer_correctable_error_string[i] : NULL;
From: Andy Shevchenko andriy.shevchenko@linux.intel.com
mainline inclusion from mainline-v5.5-rc1 commit 161eea1b25268a1f7fa8d220a3f0c350b4cc491c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 161eea1b25268a1f7fa8d220a3f0c350b4cc491c upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
Kernel-doc validator complains:
aer.c:207: warning: Function parameter or member 'str' not described in 'pcie_ecrc_get_policy' aer.c:1209: warning: Function parameter or member 'irq' not described in 'aer_isr' aer.c:1209: warning: Function parameter or member 'context' not described in 'aer_isr' aer.c:1209: warning: Excess function parameter 'work' description in 'aer_isr'
Fix the above accordingly.
Link: https://lore.kernel.org/r/20190827151823.75312-2-andriy.shevchenko@linux.int... Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Kuppuswamy Sathyanarayanan sathyanarayanan.kuppuswamy@linux.intel.com (cherry picked from commit 161eea1b25268a1f7fa8d220a3f0c350b4cc491c) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/aer.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 342caa9bea45..57584cbaa428 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -202,6 +202,7 @@ void pcie_set_ecrc_checking(struct pci_dev *dev)
/** * pcie_ecrc_get_policy - parse kernel command-line ecrc option + * @str: ECRC policy from kernel command line to use */ void pcie_ecrc_get_policy(char *str) { @@ -1260,7 +1261,8 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
/** * aer_isr - consume errors detected by root port - * @work: definition of this work item + * @irq: IRQ assigned to Root Port + * @context: pointer to Root Port data structure * * Invoked, as DPC, when root port records new detected error */
From: Olof Johansson olof@lixom.net
mainline inclusion from mainline-v5.5-rc1 commit 35a0b2378c199d4f26e458b2ca38ea56aaf2d9b8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 35a0b2378c199d4f26e458b2ca38ea56aaf2d9b8 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
Prior to eed85ff4c0da7 ("PCI/DPC: Enable DPC only if AER is available"), Linux handled DPC events regardless of whether firmware had granted it ownership of AER or DPC, e.g., via _OSC.
PCIe r5.0, sec 6.2.10, recommends that the OS link control of DPC to control of AER, so after eed85ff4c0da7, Linux handles DPC events only if it has control of AER.
On platforms that do not grant OS control of AER via _OSC, Linux DPC handling worked before eed85ff4c0da7 but not after.
To make Linux DPC handling work on those platforms the same way they did before, add a "pcie_ports=dpc-native" kernel parameter that makes Linux handle DPC events regardless of whether it has control of AER.
[bhelgaas: commit log, move pcie_ports_dpc_native to drivers/pci/] Link: https://lore.kernel.org/r/20191023192205.97024-1-olof@lixom.net Signed-off-by: Olof Johansson olof@lixom.net Signed-off-by: Bjorn Helgaas bhelgaas@google.com
(cherry picked from commit 35a0b2378c199d4f26e458b2ca38ea56aaf2d9b8) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/admin-guide/kernel-parameters.txt | 2 ++ drivers/pci/pcie/dpc.c | 2 +- drivers/pci/pcie/portdrv.h | 2 ++ drivers/pci/pcie/portdrv_core.c | 11 ++++++++--- drivers/pci/pcie/portdrv_pci.c | 8 ++++++++ 5 files changed, 21 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 74285118ff44..54127ef75155 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3483,6 +3483,8 @@ even if the platform doesn't give the OS permission to use them. This may cause conflicts if the platform also tries to use these services. + dpc-native Use native PCIe service for DPC only. May + cause conflicts if firmware uses AER or DPC. compat Disable native PCIe services (PME, AER, DPC, PCIe hotplug).
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index a32ec3487a8d..e06f42f58d3d 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -291,7 +291,7 @@ static int dpc_probe(struct pcie_device *dev) int status; u16 ctl, cap;
- if (pcie_aer_get_firmware_first(pdev)) + if (pcie_aer_get_firmware_first(pdev) && !pcie_ports_dpc_native) return -ENOTSUPP;
dpc = devm_kzalloc(device, sizeof(*dpc), GFP_KERNEL); diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h index 55fe9f7602de..c42cb76fa67f 100644 --- a/drivers/pci/pcie/portdrv.h +++ b/drivers/pci/pcie/portdrv.h @@ -23,6 +23,8 @@
#define PCIE_PORT_DEVICE_MAXSERVICES 4
+extern bool pcie_ports_dpc_native; + #ifdef CONFIG_PCIEAER int pcie_aer_init(void); #else diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c index a3d1cc42a393..54772e5b7f8e 100644 --- a/drivers/pci/pcie/portdrv_core.c +++ b/drivers/pci/pcie/portdrv_core.c @@ -248,9 +248,14 @@ static int get_port_device_capability(struct pci_dev *dev) pcie_pme_interrupt_enable(dev, false); }
- if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) && - pci_aer_available() && services & PCIE_PORT_SERVICE_AER) - services |= PCIE_PORT_SERVICE_DPC; + /* + * With dpc-native, allow Linux to use DPC even if it doesn't have + * permission to use AER. + */ + if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) && + pci_aer_available() && + (pcie_ports_dpc_native || (services & PCIE_PORT_SERVICE_AER))) + services |= PCIE_PORT_SERVICE_DPC;
return services; } diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c index e6671fec17e1..94c2fd71c6ca 100644 --- a/drivers/pci/pcie/portdrv_pci.c +++ b/drivers/pci/pcie/portdrv_pci.c @@ -29,12 +29,20 @@ bool pcie_ports_disabled; */ bool pcie_ports_native;
+/* + * If the user specified "pcie_ports=dpc-native", use the Linux DPC PCIe + * service even if the platform hasn't given us permission. + */ +bool pcie_ports_dpc_native; + static int __init pcie_port_setup(char *str) { if (!strncmp(str, "compat", 6)) pcie_ports_disabled = true; else if (!strncmp(str, "native", 6)) pcie_ports_native = true; + else if (!strncmp(str, "dpc-native", 10)) + pcie_ports_dpc_native = true;
return 1; }
From: Thomas Gleixner tglx@linutronix.de
mainline inclusion from mainline-v5.7-rc1 commit acd26bcf362708594ea081ef55140e37d0854ed2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Error injection mechanisms need a half ways safe way to inject interrupts as invoking generic_handle_irq() or the actual device interrupt handler directly from e.g. a debugfs write is not guaranteed to be safe.
On x86 generic_handle_irq() is unsafe due to the hardware trainwreck which is the base of x86 interrupt delivery and affinity management.
Move the irq debugfs injection code into a separate function which can be used by error injection code as well.
The implementation prevents at least that state is corrupted, but it cannot close a very tiny race window on x86 which might result in a stale and not serviced device interrupt under very unlikely circumstances.
This is explicitly for debugging and testing and not for production use or abuse in random driver code.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Kuppuswamy Sathyanarayanan sathyanarayanan.kuppuswamy@linux.intel.com Reviewed-by: Kuppuswamy Sathyanarayanan sathyanarayanan.kuppuswamy@linux.intel.com Acked-by: Marc Zyngier maz@kernel.org Link: https://lkml.kernel.org/r/20200306130623.990928309@linutronix.de Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/interrupt.h | 2 ++ kernel/irq/Kconfig | 5 ++++ kernel/irq/chip.c | 2 +- kernel/irq/debugfs.c | 34 +------------------------ kernel/irq/internals.h | 2 +- kernel/irq/resend.c | 53 +++++++++++++++++++++++++++++++++++++-- 6 files changed, 61 insertions(+), 37 deletions(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 4562f67f950b..97de36a38770 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -235,6 +235,8 @@ extern void enable_percpu_nmi(unsigned int irq, unsigned int type); extern int prepare_percpu_nmi(unsigned int irq); extern void teardown_percpu_nmi(unsigned int irq);
+extern int irq_inject_interrupt(unsigned int irq); + /* The following three functions are for the core kernel use only. */ extern void suspend_device_irqs(void); extern void resume_device_irqs(void); diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig index 5f3e2baefca9..14a85d0161ea 100644 --- a/kernel/irq/Kconfig +++ b/kernel/irq/Kconfig @@ -42,6 +42,10 @@ config GENERIC_IRQ_MIGRATION config AUTO_IRQ_AFFINITY bool
+# Interrupt injection mechanism +config GENERIC_IRQ_INJECTION + bool + # Tasklet based software resend for pending interrupts on enable_irq() config HARDIRQS_SW_RESEND bool @@ -123,6 +127,7 @@ config SPARSE_IRQ config GENERIC_IRQ_DEBUGFS bool "Expose irq internals in debugfs" depends on DEBUG_FS + select GENERIC_IRQ_INJECTION default n ---help---
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index 77c3d715a01f..fc03e7271169 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -278,7 +278,7 @@ int irq_startup(struct irq_desc *desc, bool resend, bool force) } } if (resend) - check_irq_resend(desc); + check_irq_resend(desc, false);
return ret; } diff --git a/kernel/irq/debugfs.c b/kernel/irq/debugfs.c index 859412734183..216db115b68a 100644 --- a/kernel/irq/debugfs.c +++ b/kernel/irq/debugfs.c @@ -190,39 +190,7 @@ static ssize_t irq_debug_write(struct file *file, const char __user *user_buf, return -EFAULT;
if (!strncmp(buf, "trigger", size)) { - unsigned long flags; - int err; - - /* Try the HW interface first */ - err = irq_set_irqchip_state(irq_desc_get_irq(desc), - IRQCHIP_STATE_PENDING, true); - if (!err) - return count; - - /* - * Otherwise, try to inject via the resend interface, - * which may or may not succeed. - */ - chip_bus_lock(desc); - raw_spin_lock_irqsave(&desc->lock, flags); - - /* - * Don't allow injection when the interrupt is: - * - Level or NMI type - * - not activated - * - replaying already - */ - if (irq_settings_is_level(desc) || - !irqd_is_activated(&desc->irq_data) || - (desc->istate & (IRQS_NMI | IRQS_REPLAY))) { - err = -EINVAL; - } else { - desc->istate |= IRQS_PENDING; - err = check_irq_resend(desc); - } - - raw_spin_unlock_irqrestore(&desc->lock, flags); - chip_bus_sync_unlock(desc); + int err = irq_inject_interrupt(irq_desc_get_irq(desc));
return err ? err : count; } diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h index 36c193df9dac..871d1c12eae9 100644 --- a/kernel/irq/internals.h +++ b/kernel/irq/internals.h @@ -108,7 +108,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc); irqreturn_t handle_irq_event(struct irq_desc *desc);
/* Resending of interrupts :*/ -int check_irq_resend(struct irq_desc *desc); +int check_irq_resend(struct irq_desc *desc, bool inject); bool irq_wait_for_poll(struct irq_desc *desc); void __irq_wake_thread(struct irq_desc *desc, struct irqaction *action);
diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c index 7de48bc06c75..c0a2d182a060 100644 --- a/kernel/irq/resend.c +++ b/kernel/irq/resend.c @@ -103,7 +103,7 @@ static int try_retrigger(struct irq_desc *desc) * * Is called with interrupts disabled and desc->lock held. */ -int check_irq_resend(struct irq_desc *desc) +int check_irq_resend(struct irq_desc *desc, bool inject) { int err = 0;
@@ -119,7 +119,7 @@ int check_irq_resend(struct irq_desc *desc) if (desc->istate & IRQS_REPLAY) return -EBUSY;
- if (!(desc->istate & IRQS_PENDING)) + if (!(desc->istate & IRQS_PENDING) && !inject) return 0;
desc->istate &= ~IRQS_PENDING; @@ -132,3 +132,52 @@ int check_irq_resend(struct irq_desc *desc) desc->istate |= IRQS_REPLAY; return err; } + +#ifdef CONFIG_GENERIC_IRQ_INJECTION +/** + * irq_inject_interrupt - Inject an interrupt for testing/error injection + * @irq: The interrupt number + * + * This function must only be used for debug and testing purposes! + * + * Especially on x86 this can cause a premature completion of an interrupt + * affinity change causing the interrupt line to become stale. Very + * unlikely, but possible. + * + * The injection can fail for various reasons: + * - Interrupt is not activated + * - Interrupt is NMI type or currently replaying + * - Interrupt is level type + * - Interrupt does not support hardware retrigger and software resend is + * either not enabled or not possible for the interrupt. + */ +int irq_inject_interrupt(unsigned int irq) +{ + struct irq_desc *desc; + unsigned long flags; + int err; + + /* Try the state injection hardware interface first */ + if (!irq_set_irqchip_state(irq, IRQCHIP_STATE_PENDING, true)) + return 0; + + /* That failed, try via the resend mechanism */ + desc = irq_get_desc_buslock(irq, &flags, 0); + if (!desc) + return -EINVAL; + + /* + * Only try to inject when the interrupt is: + * - not NMI type + * - activated + */ + if ((desc->istate & IRQS_NMI) || !irqd_is_activated(&desc->irq_data)) + err = -EINVAL; + else + err = check_irq_resend(desc, true); + + irq_put_desc_busunlock(desc, flags); + return err; +} +EXPORT_SYMBOL_GPL(irq_inject_interrupt); +#endif
From: Thomas Gleixner tglx@linutronix.de
mainline inclusion from mainline-v5.7-rc1 commit 9ae0522537852408f0f48af888e44d6876777463 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 9ae0522537852408f0f48af888e44d6876777463 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
The AER error injection mechanism just blindly abuses generic_handle_irq() which is really not meant for consumption by random drivers. The include of linux/irq.h should have been a red flag in the first place. Driver code, unless implementing interrupt chips or low level hypervisor functionality has absolutely no business with that.
Invoking generic_handle_irq() from non interrupt handling context can have nasty side effects at least on x86 due to the hardware trainwreck which makes interrupt affinity changes a fragile beast. Sathyanarayanan triggered a NULL pointer dereference in the low level APIC code that way. While the particular pointer could be checked this would only paper over the issue because there are other ways to trigger warnings or silently corrupt state.
Invoke the new irq_inject_interrupt() mechanism, which has the necessary sanity checks in place and injects the interrupt via the irq_retrigger() mechanism, which is at least halfways safe vs. the fragile x86 affinity change mechanics.
It's safe on x86 as it does not corrupt state, but it still can cause a premature completion of an interrupt affinity change causing the interrupt line to become stale. Very unlikely, but possible.
For regular operations this is a non issue as AER error injection is meant for debugging and testing and not for usage on production systems. People using this should better know what they are doing.
Fixes: 390e2db82480 ("PCI/AER: Abstract AER interrupt handling") Reported-by: sathyanarayanan.kuppuswamy@linux.intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Kuppuswamy Sathyanarayanan sathyanarayanan.kuppuswamy@linux.intel.com Reviewed-by: Kuppuswamy Sathyanarayanan sathyanarayanan.kuppuswamy@linux.intel.com Cc: Bjorn Helgaas bhelgaas@google.com Link: https://lkml.kernel.org/r/20200306130624.098374457@linutronix.de (cherry picked from commit 9ae0522537852408f0f48af888e44d6876777463) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com
Conflicts: drivers/pci/pcie/Kconfig Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/Kconfig | 2 +- drivers/pci/pcie/aer_inject.c | 6 ++---- 2 files changed, 3 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig index 0a1e9d379bc5..49b3b7001de8 100644 --- a/drivers/pci/pcie/Kconfig +++ b/drivers/pci/pcie/Kconfig @@ -36,7 +36,7 @@ config PCIEAER config PCIEAER_INJECT tristate "PCI Express error injection support" depends on PCIEAER - default n + select GENERIC_IRQ_INJECTION help This enables PCI Express Root Port Advanced Error Reporting (AER) software error injector. diff --git a/drivers/pci/pcie/aer_inject.c b/drivers/pci/pcie/aer_inject.c index 83d18551ba79..5a1a82b2a4f6 100644 --- a/drivers/pci/pcie/aer_inject.c +++ b/drivers/pci/pcie/aer_inject.c @@ -16,7 +16,7 @@
#include <linux/module.h> #include <linux/init.h> -#include <linux/irq.h> +#include <linux/interrupt.h> #include <linux/miscdevice.h> #include <linux/pci.h> #include <linux/slab.h> @@ -475,9 +475,7 @@ static int aer_inject(struct aer_error_inj *einj) } pci_info(edev->port, "Injecting errors %08x/%08x into device %s\n", einj->cor_status, einj->uncor_status, pci_name(dev)); - local_irq_disable(); - generic_handle_irq(edev->irq); - local_irq_enable(); + ret = irq_inject_interrupt(edev->irq); } else { pci_err(rpdev, "AER device not found\n"); ret = -ENODEV;
From: Guoqing Jiang jiangguoqing@kylinos.cn
mainline inclusion from mainline-v5.7-rc1 commit 202853595e53f981c86656c49fc1cc1e3620f558 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 202853595e53f981c86656c49fc1cc1e3620f558 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
The presence detect state (PDS) is normally a logical OR of in-band and out-of-band (OOB) presence detect. As of PCIe 4.0, there is the option to disable in-band presence so that the PDS bit always reflects the state of the out-of-band presence.
The recommendation of the PCIe spec is to disable in-band presence whenever supported (PCIe r5.0, appendix I implementation note):
Due to architectural issues, the in-band (Physical-Layer-based) portion of the PD mechanism is deprecated for use with async hot-plug. One issue is that in-band PD as architected does not detect adapter removal during certain LTSSM states, notably the L1 and Disabled States. Another issue is that when both in-band and OOB PD are being used together, the Presence Detect State bit and its associated interrupt mechanism always reflect the logical OR of the inband and OOB PD states, and with some hot-plug hardware configurations, it is important for software to detect and respond to in-band and OOB PD events independently. If OOB PD is being used and the associated DSP supports In-Band PD Disable, it is recommended that the In-Band PD Disable bit be Set, and the Presence Detect State bit and its associated interrupt mechanism be used exclusively for OOB PD. As a substitute for in-band PD with async hot-plug, the reference model uses either the DPC or the DLL Link Active mechanism.
Link: https://lore.kernel.org/r/20191025190047.38130-2-stuart.w.hayes@gmail.com [bhelgaas: move PCI_EXP_SLTCAP2 read earlier & print PCI_EXP_SLTCAP2_IBPD value (suggested by Lukas)] Signed-off-by: Alexandru Gagniuc mr.nuke.me@gmail.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Andy Shevchenko andy.shevchenko@gmail.com Reviewed-by: Lukas Wunner lukas@wunner.de
(cherry picked from commit 202853595e53f981c86656c49fc1cc1e3620f558) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn
Conflicts: drivers/pci/hotplug/pciehp.h drivers/pci/hotplug/pciehp_hpc.c Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/hotplug/pciehp.h | 1 + drivers/pci/hotplug/pciehp_hpc.c | 12 ++++++++++-- include/uapi/linux/pci_regs.h | 2 ++ 3 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h index 1fe9cd746904..23ee5d82b3e3 100644 --- a/drivers/pci/hotplug/pciehp.h +++ b/drivers/pci/hotplug/pciehp.h @@ -113,6 +113,7 @@ struct slot { struct controller { struct mutex ctrl_lock; struct pcie_device *pcie; + unsigned int inband_presence_disabled:1; struct rw_semaphore reset_lock; struct slot *slot; wait_queue_head_t queue; diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 0bb738591d27..0fff48173247 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -926,7 +926,7 @@ static inline void dbg_ctrl(struct controller *ctrl) struct controller *pcie_init(struct pcie_device *dev) { struct controller *ctrl; - u32 slot_cap, link_cap; + u32 slot_cap, slot_cap2, link_cap; u8 poweron; struct pci_dev *pdev = dev->port;
@@ -954,6 +954,13 @@ struct controller *pcie_init(struct pcie_device *dev) init_waitqueue_head(&ctrl->queue); dbg_ctrl(ctrl);
+ pcie_capability_read_dword(pdev, PCI_EXP_SLTCAP2, &slot_cap2); + if (slot_cap2 & PCI_EXP_SLTCAP2_IBPD) { + pcie_write_cmd_nowait(ctrl, PCI_EXP_SLTCTL_IBPD_DISABLE, + PCI_EXP_SLTCTL_IBPD_DISABLE); + ctrl->inband_presence_disabled = 1; + } + /* Check if Data Link Layer Link Active Reporting is implemented */ pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &link_cap);
@@ -963,7 +970,7 @@ struct controller *pcie_init(struct pcie_device *dev) PCI_EXP_SLTSTA_MRLSC | PCI_EXP_SLTSTA_CC | PCI_EXP_SLTSTA_DLLSC | PCI_EXP_SLTSTA_PDC);
- ctrl_info(ctrl, "Slot #%d AttnBtn%c PwrCtrl%c MRL%c AttnInd%c PwrInd%c HotPlug%c Surprise%c Interlock%c NoCompl%c LLActRep%c%s\n", + ctrl_info(ctrl, "Slot #%d AttnBtn%c PwrCtrl%c MRL%c AttnInd%c PwrInd%c HotPlug%c Surprise%c Interlock%c NoCompl%c IbPresDis%c LLActRep%c%s\n", (slot_cap & PCI_EXP_SLTCAP_PSN) >> 19, FLAG(slot_cap, PCI_EXP_SLTCAP_ABP), FLAG(slot_cap, PCI_EXP_SLTCAP_PCP), @@ -974,6 +981,7 @@ struct controller *pcie_init(struct pcie_device *dev) FLAG(slot_cap, PCI_EXP_SLTCAP_HPS), FLAG(slot_cap, PCI_EXP_SLTCAP_EIP), FLAG(slot_cap, PCI_EXP_SLTCAP_NCCS), + FLAG(slot_cap2, PCI_EXP_SLTCAP2_IBPD), FLAG(link_cap, PCI_EXP_LNKCAP_DLLLARC), pdev->broken_cmd_compl ? " (with Cmd Compl erratum)" : "");
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h index 123be51eba11..c209c4d1730a 100644 --- a/include/uapi/linux/pci_regs.h +++ b/include/uapi/linux/pci_regs.h @@ -605,6 +605,7 @@ #define PCI_EXP_SLTCTL_PWR_OFF 0x0400 /* Power Off */ #define PCI_EXP_SLTCTL_EIC 0x0800 /* Electromechanical Interlock Control */ #define PCI_EXP_SLTCTL_DLLSCE 0x1000 /* Data Link Layer State Changed Enable */ +#define PCI_EXP_SLTCTL_IBPD_DISABLE 0x4000 /* In-band PD disable */ #define PCI_EXP_SLTSTA 26 /* Slot Status */ #define PCI_EXP_SLTSTA_ABP 0x0001 /* Attention Button Pressed */ #define PCI_EXP_SLTSTA_PFD 0x0002 /* Power Fault Detected */ @@ -677,6 +678,7 @@ #define PCI_EXP_LNKSTA2 50 /* Link Status 2 */ #define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 52 /* v2 endpoints with link end here */ #define PCI_EXP_SLTCAP2 52 /* Slot Capabilities 2 */ +#define PCI_EXP_SLTCAP2_IBPD 0x00000001 /* In-band PD Disable Supported */ #define PCI_EXP_SLTCTL2 56 /* Slot Control 2 */ #define PCI_EXP_SLTSTA2 58 /* Slot Status 2 */
From: Alexandru Gagniuc mr.nuke.me@gmail.com
mainline inclusion from mainline-v5.7-rc1 commit f496648b99f8f7f6711f7c30a6327381f37dd1e8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit f496648b99f8f7f6711f7c30a6327381f37dd1e8 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support.
When in-band presence detect is disabled, PDS may come up at any time or not at all. PDS being low may indicate that the card is still mating, and we could expect contact bounce to bring down the link as well.
It is reasonable to assume that most cards will mate in a hotplug slot in about a second. Thus, when we know PDS only reflects out-of-band presence detect, it's worthwhile to wait the extra second or so to make sure the card is properly mated before loading the driver and to prevent the hotplug code from disabling a device if the presence detect change goes active after the device is enabled.
Link: https://lore.kernel.org/r/20191025190047.38130-3-stuart.w.hayes@gmail.com [bhelgaas: use ctrl_info() instead of pci_info()] Signed-off-by: Alexandru Gagniuc mr.nuke.me@gmail.com Signed-off-by: Stuart Hayes stuart.w.hayes@gmail.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Andy Shevchenko andy.shevchenko@gmail.com Reviewed-by: Lukas Wunner lukas@wunner.de
(cherry picked from commit f496648b99f8f7f6711f7c30a6327381f37dd1e8) Signed-off-by: Ethan Zhao haifeng.zhao@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/hotplug/pciehp_hpc.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 0fff48173247..9d39647a6100 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -238,6 +238,22 @@ static bool pci_bus_check_dev(struct pci_bus *bus, int devfn) return found; }
+static void pcie_wait_for_presence(struct pci_dev *pdev) +{ + int timeout = 1250; + u16 slot_status; + + do { + pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &slot_status); + if (slot_status & PCI_EXP_SLTSTA_PDS) + return; + msleep(10); + timeout -= 10; + } while (timeout > 0); + + pci_info(pdev, "Timeout waiting for Presence Detect\n"); +} + int pciehp_check_link_status(struct controller *ctrl) { struct pci_dev *pdev = ctrl_dev(ctrl); @@ -247,6 +263,9 @@ int pciehp_check_link_status(struct controller *ctrl) if (!pcie_wait_for_link(pdev, true)) return -1;
+ if (ctrl->inband_presence_disabled) + pcie_wait_for_presence(pdev); + found = pci_bus_check_dev(ctrl->pcie->port->subordinate, PCI_DEVFN(0, 0));
From: Stuart Hayes stuart.w.hayes@gmail.com
mainline inclusion from mainline-v5.7-rc1 commit 0b382546d863f2f09eecaccda95a0b4bfd148f92 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Some systems have in-band presence detection disabled for hot-plug PCI slots but do not report this in the slot capabilities 2 (SLTCAP2) register. On these systems, presence detect can become active well after the link is reported to be active, which can cause the slots to be disabled after a device is connected.
Add a DMI table to flag these systems as having in-band presence detect disabled.
Link: https://lore.kernel.org/r/20191025190047.38130-4-stuart.w.hayes@gmail.com Signed-off-by: Stuart Hayes stuart.w.hayes@gmail.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Andy Shevchenko andy.shevchenko@gmail.com Reviewed-by: Lukas Wunner lukas@wunner.de Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/hotplug/pciehp_hpc.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 9d39647a6100..c49d1fc86cc1 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -12,6 +12,7 @@ * Send feedback to greg@kroah.com,kristen.c.accardi@intel.com */
+#include <linux/dmi.h> #include <linux/kernel.h> #include <linux/module.h> #include <linux/types.h> @@ -27,6 +28,24 @@ #include "../pci.h" #include "pciehp.h"
+static const struct dmi_system_id inband_presence_disabled_dmi_table[] = { + /* + * Match all Dell systems, as some Dell systems have inband + * presence disabled on NVMe slots (but don't support the bit to + * report it). Setting inband presence disabled should have no + * negative effect, except on broken hotplug slots that never + * assert presence detect--and those will still work, they will + * just have a bit of extra delay before being probed. + */ + { + .ident = "Dell System", + .matches = { + DMI_MATCH(DMI_OEM_STRING, "Dell System"), + }, + }, + {} +}; + static irqreturn_t pciehp_isr(int irq, void *dev_id); static irqreturn_t pciehp_ist(int irq, void *dev_id); static int pciehp_poll(void *data); @@ -980,6 +999,9 @@ struct controller *pcie_init(struct pcie_device *dev) ctrl->inband_presence_disabled = 1; }
+ if (dmi_first_match(inband_presence_disabled_dmi_table)) + ctrl->inband_presence_disabled = 1; + /* Check if Data Link Layer Link Active Reporting is implemented */ pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &link_cap);
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v4.20-rc1 commit ba12d20edc5caf9835006d8f3efd4ed18465c75b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit ba12d20edc5caf9835006d8f3efd4ed18465c75b upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
The Arch Perfmon v4 PMI handler is substantially different than the older PMI handler. Instead of adding more and more ifs cleanly fork the new handler into a new function, with the main common code factored out into a common function.
Fix complaint from checkpatch.pl by removing "false" from "static bool warned".
No functional change.
Based-on-code-from: Andi Kleen ak@linux.intel.com Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Cc: acme@kernel.org Link: http://lkml.kernel.org/r/1533712328-2834-1-git-send-email-kan.liang@linux.in... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/core.c | 109 +++++++++++++++++++---------------- 1 file changed, 60 insertions(+), 49 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 28a0629c076f..c74e675e432c 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -2308,59 +2308,15 @@ static void intel_pmu_reset(void) local_irq_restore(flags); }
-/* - * This handler is triggered by the local APIC, so the APIC IRQ handling - * rules apply: - */ -static int intel_pmu_handle_irq(struct pt_regs *regs) +static int handle_pmi_common(struct pt_regs *regs, u64 status) { struct perf_sample_data data; - struct cpu_hw_events *cpuc; - int bit, loops; - u64 status; - int handled; - int pmu_enabled; - - cpuc = this_cpu_ptr(&cpu_hw_events); - - /* - * Save the PMU state. - * It needs to be restored when leaving the handler. - */ - pmu_enabled = cpuc->enabled; - /* - * No known reason to not always do late ACK, - * but just in case do it opt-in. - */ - if (!x86_pmu.late_ack) - apic_write(APIC_LVTPC, APIC_DM_NMI); - intel_bts_disable_local(); - cpuc->enabled = 0; - __intel_pmu_disable_all(); - handled = intel_pmu_drain_bts_buffer(); - handled += intel_bts_interrupt(); - status = intel_pmu_get_status(); - if (!status) - goto done; - - loops = 0; -again: - intel_pmu_lbr_read(); - intel_pmu_ack_status(status); - if (++loops > 100) { - static bool warned = false; - if (!warned) { - WARN(1, "perfevents: irq loop stuck!\n"); - perf_event_print_debug(); - warned = true; - } - intel_pmu_reset(); - goto done; - } + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + int bit; + int handled = 0;
inc_irq_stat(apic_perf_irqs);
- /* * Ignore a range of extra bits in status that do not indicate * overflow by themselves. @@ -2369,7 +2325,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs) GLOBAL_STATUS_ASIF | GLOBAL_STATUS_LBRS_FROZEN); if (!status) - goto done; + return 0; /* * In case multiple PEBS events are sampled at the same time, * it is possible to have GLOBAL_STATUS bit 62 set indicating @@ -2439,6 +2395,61 @@ static int intel_pmu_handle_irq(struct pt_regs *regs) x86_pmu_stop(event, 0); }
+ return handled; +} + +/* + * This handler is triggered by the local APIC, so the APIC IRQ handling + * rules apply: + */ +static int intel_pmu_handle_irq(struct pt_regs *regs) +{ + struct cpu_hw_events *cpuc; + int loops; + u64 status; + int handled; + int pmu_enabled; + + cpuc = this_cpu_ptr(&cpu_hw_events); + + /* + * Save the PMU state. + * It needs to be restored when leaving the handler. + */ + pmu_enabled = cpuc->enabled; + /* + * No known reason to not always do late ACK, + * but just in case do it opt-in. + */ + if (!x86_pmu.late_ack) + apic_write(APIC_LVTPC, APIC_DM_NMI); + intel_bts_disable_local(); + cpuc->enabled = 0; + __intel_pmu_disable_all(); + handled = intel_pmu_drain_bts_buffer(); + handled += intel_bts_interrupt(); + status = intel_pmu_get_status(); + if (!status) + goto done; + + loops = 0; +again: + intel_pmu_lbr_read(); + intel_pmu_ack_status(status); + if (++loops > 100) { + static bool warned; + + if (!warned) { + WARN(1, "perfevents: irq loop stuck!\n"); + perf_event_print_debug(); + warned = true; + } + intel_pmu_reset(); + goto done; + } + + handled += handle_pmi_common(regs, status); + /* * Repeat if there is more work to be done: */
From: Guoqing Jiang jiangguoqing@kylinos.cn
mainline inclusion from mainline-v4.20-rc1 commit d4ae552982de39417d17f823df1f06b1cbc3686c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Memory events depends on PEBS support and access to LDLAT MSR, but we display them in /sys/devices/cpu/events even if the CPU does not provide those, like for KVM guests.
That brings the false assumption that those events should be available, while they fail event to open.
Separating the mem-* events attributes and merging them with cpu_events only if there's PEBS support detected.
We could also check if LDLAT MSR is available, but the PEBS check seems to cover the need now.
Suggested-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Jiri Olsa jolsa@kernel.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@kernel.org Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Michael Petlan mpetlan@redhat.com Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Link: http://lkml.kernel.org/r/20180906135748.GC9577@krava Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 8 ++--- arch/x86/events/intel/core.c | 69 +++++++++++++++++++++++++++--------- 2 files changed, 56 insertions(+), 21 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 817b53d2ce2e..0fc49dda85a3 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -1629,9 +1629,9 @@ __init struct attribute **merge_attr(struct attribute **a, struct attribute **b) struct attribute **new; int j, i;
- for (j = 0; a[j]; j++) + for (j = 0; a && a[j]; j++) ; - for (i = 0; b[i]; i++) + for (i = 0; b && b[i]; i++) j++; j++;
@@ -1640,9 +1640,9 @@ __init struct attribute **merge_attr(struct attribute **a, struct attribute **b) return NULL;
j = 0; - for (i = 0; a[i]; i++) + for (i = 0; a && a[i]; i++) new[j++] = a[i]; - for (i = 0; b[i]; i++) + for (i = 0; b && b[i]; i++) new[j++] = b[i]; new[j] = NULL;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index c74e675e432c..6815e62beaab 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -271,7 +271,7 @@ EVENT_ATTR_STR(mem-loads, mem_ld_nhm, "event=0x0b,umask=0x10,ldlat=3"); EVENT_ATTR_STR(mem-loads, mem_ld_snb, "event=0xcd,umask=0x1,ldlat=3"); EVENT_ATTR_STR(mem-stores, mem_st_snb, "event=0xcd,umask=0x2");
-static struct attribute *nhm_events_attrs[] = { +static struct attribute *nhm_mem_events_attrs[] = { EVENT_PTR(mem_ld_nhm), NULL, }; @@ -307,8 +307,6 @@ EVENT_ATTR_STR_HT(topdown-recovery-bubbles.scale, td_recovery_bubbles_scale, "4", "2");
static struct attribute *snb_events_attrs[] = { - EVENT_PTR(mem_ld_snb), - EVENT_PTR(mem_st_snb), EVENT_PTR(td_slots_issued), EVENT_PTR(td_slots_retired), EVENT_PTR(td_fetch_bubbles), @@ -319,6 +317,12 @@ static struct attribute *snb_events_attrs[] = { NULL, };
+static struct attribute *snb_mem_events_attrs[] = { + EVENT_PTR(mem_ld_snb), + EVENT_PTR(mem_st_snb), + NULL, +}; + static struct event_constraint intel_hsw_event_constraints[] = { FIXED_EVENT_CONSTRAINT(0x00c0, 0), /* INST_RETIRED.ANY */ FIXED_EVENT_CONSTRAINT(0x003c, 1), /* CPU_CLK_UNHALTED.CORE */ @@ -4026,8 +4030,6 @@ EVENT_ATTR_STR(cycles-t, cycles_t, "event=0x3c,in_tx=1"); EVENT_ATTR_STR(cycles-ct, cycles_ct, "event=0x3c,in_tx=1,in_tx_cp=1");
static struct attribute *hsw_events_attrs[] = { - EVENT_PTR(mem_ld_hsw), - EVENT_PTR(mem_st_hsw), EVENT_PTR(td_slots_issued), EVENT_PTR(td_slots_retired), EVENT_PTR(td_fetch_bubbles), @@ -4038,6 +4040,12 @@ static struct attribute *hsw_events_attrs[] = { NULL };
+static struct attribute *hsw_mem_events_attrs[] = { + EVENT_PTR(mem_ld_hsw), + EVENT_PTR(mem_st_hsw), + NULL, +}; + static struct attribute *hsw_tsx_events_attrs[] = { EVENT_PTR(tx_start), EVENT_PTR(tx_commit), @@ -4090,13 +4098,6 @@ static __init struct attribute **get_icl_events_attrs(void) icl_events_attrs; }
-static __init struct attribute **get_hsw_events_attrs(void) -{ - return boot_cpu_has(X86_FEATURE_RTM) ? - merge_attr(hsw_events_attrs, hsw_tsx_events_attrs) : - hsw_events_attrs; -} - static ssize_t freeze_on_smi_show(struct device *cdev, struct device_attribute *attr, char *buf) @@ -4176,9 +4177,32 @@ static struct attribute *intel_pmu_attrs[] = { NULL, };
+static __init struct attribute ** +get_events_attrs(struct attribute **base, + struct attribute **mem, + struct attribute **tsx) +{ + struct attribute **attrs = base; + struct attribute **old; + + if (mem && x86_pmu.pebs) + attrs = merge_attr(attrs, mem); + + if (tsx && boot_cpu_has(X86_FEATURE_RTM)) { + old = attrs; + attrs = merge_attr(attrs, tsx); + if (old != base) + kfree(old); + } + + return attrs; +} + __init int intel_pmu_init(void) { struct attribute **extra_attr = NULL; + struct attribute **mem_attr = NULL; + struct attribute **tsx_attr = NULL; struct attribute **to_free = NULL; union cpuid10_edx edx; union cpuid10_eax eax; @@ -4289,7 +4313,7 @@ __init int intel_pmu_init(void) x86_pmu.extra_regs = intel_nehalem_extra_regs; x86_pmu.limit_period = nhm_limit_period;
- x86_pmu.cpu_events = nhm_events_attrs; + mem_attr = nhm_mem_events_attrs;
/* UOPS_ISSUED.STALLED_CYCLES */ intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = @@ -4443,7 +4467,7 @@ __init int intel_pmu_init(void) x86_pmu.extra_regs = intel_westmere_extra_regs; x86_pmu.flags |= PMU_FL_HAS_RSP_1;
- x86_pmu.cpu_events = nhm_events_attrs; + mem_attr = nhm_mem_events_attrs;
/* UOPS_ISSUED.STALLED_CYCLES */ intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = @@ -4483,6 +4507,7 @@ __init int intel_pmu_init(void) x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
x86_pmu.cpu_events = snb_events_attrs; + mem_attr = snb_mem_events_attrs;
/* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */ intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = @@ -4523,6 +4548,7 @@ __init int intel_pmu_init(void) x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
x86_pmu.cpu_events = snb_events_attrs; + mem_attr = snb_mem_events_attrs;
/* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */ intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = @@ -4557,10 +4583,12 @@ __init int intel_pmu_init(void)
x86_pmu.hw_config = hsw_hw_config; x86_pmu.get_event_constraints = hsw_get_event_constraints; - x86_pmu.cpu_events = get_hsw_events_attrs(); + x86_pmu.cpu_events = hsw_events_attrs; x86_pmu.lbr_double_abort = true; extra_attr = boot_cpu_has(X86_FEATURE_RTM) ? hsw_format_attr : nhm_format_attr; + mem_attr = hsw_mem_events_attrs; + tsx_attr = hsw_tsx_events_attrs; pr_cont("Haswell events, "); name = "haswell"; break; @@ -4596,10 +4624,12 @@ __init int intel_pmu_init(void)
x86_pmu.hw_config = hsw_hw_config; x86_pmu.get_event_constraints = hsw_get_event_constraints; - x86_pmu.cpu_events = get_hsw_events_attrs(); + x86_pmu.cpu_events = hsw_events_attrs; x86_pmu.limit_period = bdw_limit_period; extra_attr = boot_cpu_has(X86_FEATURE_RTM) ? hsw_format_attr : nhm_format_attr; + mem_attr = hsw_mem_events_attrs; + tsx_attr = hsw_tsx_events_attrs; pr_cont("Broadwell events, "); name = "broadwell"; break; @@ -4656,7 +4686,9 @@ __init int intel_pmu_init(void) hsw_format_attr : nhm_format_attr; extra_attr = merge_attr(extra_attr, skl_format_attr); to_free = extra_attr; - x86_pmu.cpu_events = get_hsw_events_attrs(); + x86_pmu.cpu_events = hsw_events_attrs; + mem_attr = hsw_mem_events_attrs; + tsx_attr = hsw_tsx_events_attrs; intel_pmu_pebs_data_source_skl(pmem);
if (boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)) { @@ -4728,6 +4760,9 @@ __init int intel_pmu_init(void) WARN_ON(!x86_pmu.format_attrs); }
+ x86_pmu.cpu_events = get_events_attrs(x86_pmu.cpu_events, + mem_attr, tsx_attr); + if (x86_pmu.num_counters > INTEL_PMC_MAX_GENERIC) { WARN(1, KERN_ERR "hw perf events %d > max(%d), clipping!", x86_pmu.num_counters, INTEL_PMC_MAX_GENERIC);
From: Jiri Olsa jolsa@kernel.org
mainline inclusion from mainline-v5.3-rc1 commit aac1f7f95f115d5a5329be05b80022e72df7d080 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit aac1f7f95f115d5a5329be05b80022e72df7d080 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Adding sysfs_update_groups function to update multiple groups.
sysfs_update_groups - given a directory kobject, create a bunch of attribute groups @kobj: The kobject to update the group on @groups: The attribute groups to update, NULL terminated
This function update a bunch of attribute groups. If an error occurs when updating a group, all previously updated groups will be removed together with already existing (not updated) attributes.
Signed-off-by: Jiri Olsa jolsa@kernel.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@kernel.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: https://lkml.kernel.org/r/20190512155518.21468-2-jolsa@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/sysfs/group.c | 54 +++++++++++++++++++++++++++++++------------ include/linux/sysfs.h | 8 +++++++ 2 files changed, 47 insertions(+), 15 deletions(-)
diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c index 57038604d4a8..d41c21fef138 100644 --- a/fs/sysfs/group.c +++ b/fs/sysfs/group.c @@ -175,6 +175,26 @@ int sysfs_create_group(struct kobject *kobj, } EXPORT_SYMBOL_GPL(sysfs_create_group);
+static int internal_create_groups(struct kobject *kobj, int update, + const struct attribute_group **groups) +{ + int error = 0; + int i; + + if (!groups) + return 0; + + for (i = 0; groups[i]; i++) { + error = internal_create_group(kobj, update, groups[i]); + if (error) { + while (--i >= 0) + sysfs_remove_group(kobj, groups[i]); + break; + } + } + return error; +} + /** * sysfs_create_groups - given a directory kobject, create a bunch of attribute groups * @kobj: The kobject to create the group on @@ -191,24 +211,28 @@ EXPORT_SYMBOL_GPL(sysfs_create_group); int sysfs_create_groups(struct kobject *kobj, const struct attribute_group **groups) { - int error = 0; - int i; - - if (!groups) - return 0; - - for (i = 0; groups[i]; i++) { - error = sysfs_create_group(kobj, groups[i]); - if (error) { - while (--i >= 0) - sysfs_remove_group(kobj, groups[i]); - break; - } - } - return error; + return internal_create_groups(kobj, 0, groups); } EXPORT_SYMBOL_GPL(sysfs_create_groups);
+/** + * sysfs_update_groups - given a directory kobject, create a bunch of attribute groups + * @kobj: The kobject to update the group on + * @groups: The attribute groups to update, NULL terminated + * + * This function update a bunch of attribute groups. If an error occurs when + * updating a group, all previously updated groups will be removed together + * with already existing (not updated) attributes. + * + * Returns 0 on success or error code from sysfs_update_group on failure. + */ +int sysfs_update_groups(struct kobject *kobj, + const struct attribute_group **groups) +{ + return internal_create_groups(kobj, 1, groups); +} +EXPORT_SYMBOL_GPL(sysfs_update_groups); + /** * sysfs_update_group - given a directory kobject, update an attribute group * @kobj: The kobject to update the group on diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h index 1cd7bad56075..9b6fff250c23 100644 --- a/include/linux/sysfs.h +++ b/include/linux/sysfs.h @@ -268,6 +268,8 @@ int __must_check sysfs_create_group(struct kobject *kobj, const struct attribute_group *grp); int __must_check sysfs_create_groups(struct kobject *kobj, const struct attribute_group **groups); +int __must_check sysfs_update_groups(struct kobject *kobj, + const struct attribute_group **groups); int sysfs_update_group(struct kobject *kobj, const struct attribute_group *grp); void sysfs_remove_group(struct kobject *kobj, @@ -438,6 +440,12 @@ static inline int sysfs_create_groups(struct kobject *kobj, return 0; }
+static inline int sysfs_update_groups(struct kobject *kobj, + const struct attribute_group **groups) +{ + return 0; +} + static inline int sysfs_update_group(struct kobject *kobj, const struct attribute_group *grp) {
From: Jiri Olsa jolsa@kernel.org
mainline inclusion from mainline-v5.3-rc1 commit f3a3a8257e5a1a5e67cbb1afdbc4c1c6a26f1b22 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit f3a3a8257e5a1a5e67cbb1afdbc4c1c6a26f1b22 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Adding attr_update attribute group into pmu, to allow having multiple attribute groups for same group name.
This will allow us to update "events" or "format" directories with attributes that depend on various HW conditions.
For example having group_format_extra group that updates "format" directory only if pmu version is 2 and higher:
static umode_t exra_is_visible(struct kobject *kobj, struct attribute *attr, int i) { return x86_pmu.version >= 2 ? attr->mode : 0; }
static struct attribute_group group_format_extra = { .name = "format", .is_visible = exra_is_visible, };
Signed-off-by: Jiri Olsa jolsa@kernel.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@kernel.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: https://lkml.kernel.org/r/20190512155518.21468-3-jolsa@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/perf_event.h | 1 + kernel/events/core.c | 6 ++++++ 2 files changed, 7 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index f6174c8a8c8a..f78841df7a46 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -255,6 +255,7 @@ struct pmu { struct module *module; struct device *dev; const struct attribute_group **attr_groups; + const struct attribute_group **attr_update; const char *name; int type;
diff --git a/kernel/events/core.c b/kernel/events/core.c index d7aa8971b278..5d60d0f1c940 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9691,6 +9691,12 @@ static int pmu_dev_alloc(struct pmu *pmu) if (ret) goto del_dev;
+ if (pmu->attr_update) + ret = sysfs_update_groups(&pmu->dev->kobj, pmu->attr_update); + + if (ret) + goto del_dev; + out: return ret;
From: Jiri Olsa jolsa@kernel.org
mainline inclusion from mainline-v5.3-rc1 commit 21b0dbc5e8b050e40a93a1f8cdef277502a4fc90 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 21b0dbc5e8b050e40a93a1f8cdef277502a4fc90 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Nobody is using that.
Signed-off-by: Jiri Olsa jolsa@kernel.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@kernel.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: https://lkml.kernel.org/r/20190512155518.21468-4-jolsa@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 3 --- arch/x86/events/perf_event.h | 1 - 2 files changed, 4 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 0fc49dda85a3..35dd1714ad51 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -1823,9 +1823,6 @@ static int __init init_hw_perf_events(void) x86_pmu_caps_group.attrs = tmp; }
- if (x86_pmu.event_attrs) - x86_pmu_events_group.attrs = x86_pmu.event_attrs; - if (!x86_pmu.events_sysfs_show) x86_pmu_events_group.attrs = &empty_attrs; else diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index bfd6f58d3a35..a32630defbbc 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -616,7 +616,6 @@ struct x86_pmu { int attr_rdpmc_broken; int attr_rdpmc; struct attribute **format_attrs; - struct attribute **event_attrs; struct attribute **caps_attrs;
ssize_t (*events_sysfs_show)(char *page, u64 config);
From: Guoqing Jiang jiangguoqing@kylinos.cn
mainline inclusion from mainline-v5.3-rc1 commit baa0c83363c7aafb04734acf4ac252be8e13bd88 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit baa0c83363c7aafb04734acf4ac252be8e13bd88 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Using the new pmu::update_attrs attribute group to create detected events for x86_pmu.
Moving the topdown/memory/tsx attributes to separate attribute groups with specific is_visible functions.
Signed-off-by: Jiri Olsa jolsa@kernel.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@kernel.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: https://lkml.kernel.org/r/20190512155518.21468-5-jolsa@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 10 +---- arch/x86/events/intel/core.c | 86 ++++++++++++++++++++---------------- arch/x86/events/perf_event.h | 2 +- 3 files changed, 52 insertions(+), 46 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 35dd1714ad51..d56c9b087b1a 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -1828,14 +1828,6 @@ static int __init init_hw_perf_events(void) else filter_events(x86_pmu_events_group.attrs);
- if (x86_pmu.cpu_events) { - struct attribute **tmp; - - tmp = merge_attr(x86_pmu_events_group.attrs, x86_pmu.cpu_events); - if (!WARN_ON(!tmp)) - x86_pmu_events_group.attrs = tmp; - } - if (x86_pmu.attrs) { struct attribute **tmp;
@@ -1844,6 +1836,8 @@ static int __init init_hw_perf_events(void) x86_pmu_attr_group.attrs = tmp; }
+ pmu.attr_update = x86_pmu.attr_update; + pr_info("... version: %d\n", x86_pmu.version); pr_info("... bit width: %d\n", x86_pmu.cntval_bits); pr_info("... generic registers: %d\n", x86_pmu.num_counters); diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 6815e62beaab..07e0f0a487b8 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -4091,13 +4091,6 @@ static struct attribute *icl_tsx_events_attrs[] = { NULL, };
-static __init struct attribute **get_icl_events_attrs(void) -{ - return boot_cpu_has(X86_FEATURE_RTM) ? - merge_attr(icl_events_attrs, icl_tsx_events_attrs) : - icl_events_attrs; -} - static ssize_t freeze_on_smi_show(struct device *cdev, struct device_attribute *attr, char *buf) @@ -4177,32 +4170,47 @@ static struct attribute *intel_pmu_attrs[] = { NULL, };
-static __init struct attribute ** -get_events_attrs(struct attribute **base, - struct attribute **mem, - struct attribute **tsx) +static umode_t +tsx_is_visible(struct kobject *kobj, struct attribute *attr, int i) { - struct attribute **attrs = base; - struct attribute **old; + return boot_cpu_has(X86_FEATURE_RTM) ? attr->mode : 0; +}
- if (mem && x86_pmu.pebs) - attrs = merge_attr(attrs, mem); +static umode_t +pebs_is_visible(struct kobject *kobj, struct attribute *attr, int i) +{ + return x86_pmu.pebs ? attr->mode : 0; +}
- if (tsx && boot_cpu_has(X86_FEATURE_RTM)) { - old = attrs; - attrs = merge_attr(attrs, tsx); - if (old != base) - kfree(old); - } +static struct attribute_group group_events_td = { + .name = "events", +};
- return attrs; -} +static struct attribute_group group_events_mem = { + .name = "events", + .is_visible = pebs_is_visible, +}; + +static struct attribute_group group_events_tsx = { + .name = "events", + .is_visible = tsx_is_visible, +}; + +static const struct attribute_group *attr_update[] = { + &group_events_td, + &group_events_mem, + &group_events_tsx, + NULL, +}; + +static struct attribute *empty_attrs;
__init int intel_pmu_init(void) { - struct attribute **extra_attr = NULL; - struct attribute **mem_attr = NULL; - struct attribute **tsx_attr = NULL; + struct attribute **extra_attr = &empty_attrs; + struct attribute **td_attr = &empty_attrs; + struct attribute **mem_attr = &empty_attrs; + struct attribute **tsx_attr = &empty_attrs; struct attribute **to_free = NULL; union cpuid10_edx edx; union cpuid10_eax eax; @@ -4364,7 +4372,7 @@ __init int intel_pmu_init(void) x86_pmu.pebs_constraints = intel_slm_pebs_event_constraints; x86_pmu.extra_regs = intel_slm_extra_regs; x86_pmu.flags |= PMU_FL_HAS_RSP_1; - x86_pmu.cpu_events = slm_events_attrs; + td_attr = slm_events_attrs; extra_attr = slm_format_attr; pr_cont("Silvermont events, "); name = "silvermont"; @@ -4391,7 +4399,7 @@ __init int intel_pmu_init(void) x86_pmu.pebs_prec_dist = true; x86_pmu.lbr_pt_coexist = true; x86_pmu.flags |= PMU_FL_HAS_RSP_1; - x86_pmu.cpu_events = glm_events_attrs; + td_attr = glm_events_attrs; extra_attr = slm_format_attr; pr_cont("Goldmont events, "); name = "goldmont"; @@ -4417,7 +4425,7 @@ __init int intel_pmu_init(void) x86_pmu.flags |= PMU_FL_HAS_RSP_1; x86_pmu.flags |= PMU_FL_PEBS_ALL; x86_pmu.get_event_constraints = glp_get_event_constraints; - x86_pmu.cpu_events = glm_events_attrs; + td_attr = glm_events_attrs; /* Goldmont Plus has 4-wide pipeline */ event_attr_td_total_slots_scale_glm.event_str = "4"; extra_attr = slm_format_attr; @@ -4506,7 +4514,7 @@ __init int intel_pmu_init(void) x86_pmu.flags |= PMU_FL_HAS_RSP_1; x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
- x86_pmu.cpu_events = snb_events_attrs; + td_attr = snb_events_attrs; mem_attr = snb_mem_events_attrs;
/* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */ @@ -4547,7 +4555,7 @@ __init int intel_pmu_init(void) x86_pmu.flags |= PMU_FL_HAS_RSP_1; x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
- x86_pmu.cpu_events = snb_events_attrs; + td_attr = snb_events_attrs; mem_attr = snb_mem_events_attrs;
/* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */ @@ -4583,10 +4591,10 @@ __init int intel_pmu_init(void)
x86_pmu.hw_config = hsw_hw_config; x86_pmu.get_event_constraints = hsw_get_event_constraints; - x86_pmu.cpu_events = hsw_events_attrs; x86_pmu.lbr_double_abort = true; extra_attr = boot_cpu_has(X86_FEATURE_RTM) ? hsw_format_attr : nhm_format_attr; + td_attr = hsw_events_attrs; mem_attr = hsw_mem_events_attrs; tsx_attr = hsw_tsx_events_attrs; pr_cont("Haswell events, "); @@ -4624,10 +4632,10 @@ __init int intel_pmu_init(void)
x86_pmu.hw_config = hsw_hw_config; x86_pmu.get_event_constraints = hsw_get_event_constraints; - x86_pmu.cpu_events = hsw_events_attrs; x86_pmu.limit_period = bdw_limit_period; extra_attr = boot_cpu_has(X86_FEATURE_RTM) ? hsw_format_attr : nhm_format_attr; + td_attr = hsw_events_attrs; mem_attr = hsw_mem_events_attrs; tsx_attr = hsw_tsx_events_attrs; pr_cont("Broadwell events, "); @@ -4686,7 +4694,7 @@ __init int intel_pmu_init(void) hsw_format_attr : nhm_format_attr; extra_attr = merge_attr(extra_attr, skl_format_attr); to_free = extra_attr; - x86_pmu.cpu_events = hsw_events_attrs; + td_attr = hsw_events_attrs; mem_attr = hsw_mem_events_attrs; tsx_attr = hsw_tsx_events_attrs; intel_pmu_pebs_data_source_skl(pmem); @@ -4727,7 +4735,8 @@ __init int intel_pmu_init(void) extra_attr = boot_cpu_has(X86_FEATURE_RTM) ? hsw_format_attr : nhm_format_attr; extra_attr = merge_attr(extra_attr, skl_format_attr); - x86_pmu.cpu_events = get_icl_events_attrs(); + mem_attr = icl_events_attrs; + tsx_attr = icl_tsx_events_attrs; x86_pmu.lbr_pt_coexist = true; intel_pmu_pebs_data_source_skl(pmem); pr_cont("Icelake events, "); @@ -4760,8 +4769,11 @@ __init int intel_pmu_init(void) WARN_ON(!x86_pmu.format_attrs); }
- x86_pmu.cpu_events = get_events_attrs(x86_pmu.cpu_events, - mem_attr, tsx_attr); + group_events_td.attrs = td_attr; + group_events_mem.attrs = mem_attr; + group_events_tsx.attrs = tsx_attr; + + x86_pmu.attr_update = attr_update;
if (x86_pmu.num_counters > INTEL_PMC_MAX_GENERIC) { WARN(1, KERN_ERR "hw perf events %d > max(%d), clipping!", diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index a32630defbbc..a1defdcfac7c 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -619,7 +619,7 @@ struct x86_pmu { struct attribute **caps_attrs;
ssize_t (*events_sysfs_show)(char *page, u64 config); - struct attribute **cpu_events; + const struct attribute_group **attr_update;
unsigned long attr_freeze_on_smi; struct attribute **attrs;
From: Jiri Olsa jolsa@kernel.org
mainline inclusion from mainline-v5.3-rc1 commit 3d5672735b2348f5b13679a27f90c0847d22125d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3d5672735b2348f5b13679a27f90c0847d22125d upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
We dont need to pre-filter out unsupported base events, we can just use its group's is_visible function to do this.
Signed-off-by: Jiri Olsa jolsa@kernel.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@kernel.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: https://lkml.kernel.org/r/20190512155518.21468-6-jolsa@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 53 ++++++++++++------------------------------ 1 file changed, 15 insertions(+), 38 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index d56c9b087b1a..b64cb3b373b5 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -1587,42 +1587,6 @@ static struct attribute_group x86_pmu_format_group = { .attrs = NULL, };
-/* - * Remove all undefined events (x86_pmu.event_map(id) == 0) - * out of events_attr attributes. - */ -static void __init filter_events(struct attribute **attrs) -{ - struct device_attribute *d; - struct perf_pmu_events_attr *pmu_attr; - int offset = 0; - int i, j; - - for (i = 0; attrs[i]; i++) { - d = (struct device_attribute *)attrs[i]; - pmu_attr = container_of(d, struct perf_pmu_events_attr, attr); - /* str trumps id */ - if (pmu_attr->event_str) - continue; - if (x86_pmu.event_map(i + offset)) - continue; - - for (j = i; attrs[j]; j++) - attrs[j] = attrs[j + 1]; - - /* Check the shifted attr. */ - i--; - - /* - * event_map() is index based, the attrs array is organized - * by increasing event index. If we shift the events, then - * we need to compensate for the event_map(), otherwise - * we are looking up the wrong event in the map - */ - offset++; - } -} - /* Merge two pointer arrays */ __init struct attribute **merge_attr(struct attribute **a, struct attribute **b) { @@ -1713,9 +1677,24 @@ static struct attribute *events_attr[] = { NULL, };
+/* + * Remove all undefined events (x86_pmu.event_map(id) == 0) + * out of events_attr attributes. + */ +static umode_t +is_visible(struct kobject *kobj, struct attribute *attr, int idx) +{ + struct perf_pmu_events_attr *pmu_attr; + + pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr.attr); + /* str trumps id */ + return pmu_attr->event_str || x86_pmu.event_map(idx) ? attr->mode : 0; +} + static struct attribute_group x86_pmu_events_group = { .name = "events", .attrs = events_attr, + .is_visible = is_visible, };
ssize_t x86_event_sysfs_show(char *page, u64 config, u64 event) @@ -1825,8 +1804,6 @@ static int __init init_hw_perf_events(void)
if (!x86_pmu.events_sysfs_show) x86_pmu_events_group.attrs = &empty_attrs; - else - filter_events(x86_pmu_events_group.attrs);
if (x86_pmu.attrs) { struct attribute **tmp;
From: Jiri Olsa jolsa@kernel.org
mainline inclusion from mainline-v5.3-rc1 commit 1f157286829c78c0bd8e495951a5c098d88e3d1a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 1f157286829c78c0bd8e495951a5c098d88e3d1a upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Using the new pmu::update_attrs attribute group for "caps" directory.
Signed-off-by: Jiri Olsa jolsa@kernel.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@kernel.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: https://lkml.kernel.org/r/20190512155518.21468-7-jolsa@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 8 -------- arch/x86/events/intel/core.c | 25 ++++++++++++++++++++----- arch/x86/events/perf_event.h | 1 - 3 files changed, 20 insertions(+), 14 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index b64cb3b373b5..cd83e72be297 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -1794,14 +1794,6 @@ static int __init init_hw_perf_events(void)
x86_pmu_format_group.attrs = x86_pmu.format_attrs;
- if (x86_pmu.caps_attrs) { - struct attribute **tmp; - - tmp = merge_attr(x86_pmu_caps_group.attrs, x86_pmu.caps_attrs); - if (!WARN_ON(!tmp)) - x86_pmu_caps_group.attrs = tmp; - } - if (!x86_pmu.events_sysfs_show) x86_pmu_events_group.attrs = &empty_attrs;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 07e0f0a487b8..fec415d7ffbb 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -4182,6 +4182,12 @@ pebs_is_visible(struct kobject *kobj, struct attribute *attr, int i) return x86_pmu.pebs ? attr->mode : 0; }
+static umode_t +lbr_is_visible(struct kobject *kobj, struct attribute *attr, int i) +{ + return x86_pmu.lbr_nr ? attr->mode : 0; +} + static struct attribute_group group_events_td = { .name = "events", }; @@ -4196,10 +4202,23 @@ static struct attribute_group group_events_tsx = { .is_visible = tsx_is_visible, };
+static struct attribute_group group_caps_gen = { + .name = "caps", + .attrs = intel_pmu_caps_attrs, +}; + +static struct attribute_group group_caps_lbr = { + .name = "caps", + .attrs = lbr_attrs, + .is_visible = lbr_is_visible, +}; + static const struct attribute_group *attr_update[] = { &group_events_td, &group_events_mem, &group_events_tsx, + &group_caps_gen, + &group_caps_lbr, NULL, };
@@ -4821,12 +4840,8 @@ __init int intel_pmu_init(void) x86_pmu.lbr_nr = 0; }
- x86_pmu.caps_attrs = intel_pmu_caps_attrs; - - if (x86_pmu.lbr_nr) { - x86_pmu.caps_attrs = merge_attr(x86_pmu.caps_attrs, lbr_attrs); + if (x86_pmu.lbr_nr) pr_cont("%d-deep LBR, ", x86_pmu.lbr_nr); - }
/* * Access extra MSR may cause #GP under certain circumstances. diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index a1defdcfac7c..752b0b636b7a 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -616,7 +616,6 @@ struct x86_pmu { int attr_rdpmc_broken; int attr_rdpmc; struct attribute **format_attrs; - struct attribute **caps_attrs;
ssize_t (*events_sysfs_show)(char *page, u64 config); const struct attribute_group **attr_update;
From: Jiri Olsa jolsa@kernel.org
mainline inclusion from mainline-v5.3-rc1 commit 3ea40ac77261530b2c96734b99c0c9f1dc1d729d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3ea40ac77261530b2c96734b99c0c9f1dc1d729d upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Using the new pmu::update_attrs attribute group for extra "format" directory.
Signed-off-by: Jiri Olsa jolsa@kernel.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@kernel.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: https://lkml.kernel.org/r/20190512155518.21468-8-jolsa@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/core.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index fec415d7ffbb..63f5877fb6a9 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -4188,6 +4188,12 @@ lbr_is_visible(struct kobject *kobj, struct attribute *attr, int i) return x86_pmu.lbr_nr ? attr->mode : 0; }
+static umode_t +exra_is_visible(struct kobject *kobj, struct attribute *attr, int i) +{ + return x86_pmu.version >= 2 ? attr->mode : 0; +} + static struct attribute_group group_events_td = { .name = "events", }; @@ -4213,12 +4219,18 @@ static struct attribute_group group_caps_lbr = { .is_visible = lbr_is_visible, };
+static struct attribute_group group_format_extra = { + .name = "format", + .is_visible = exra_is_visible, +}; + static const struct attribute_group *attr_update[] = { &group_events_td, &group_events_mem, &group_events_tsx, &group_caps_gen, &group_caps_lbr, + &group_format_extra, NULL, };
@@ -4782,15 +4794,10 @@ __init int intel_pmu_init(void)
snprintf(pmu_name_str, sizeof pmu_name_str, "%s", name);
- if (version >= 2 && extra_attr) { - x86_pmu.format_attrs = merge_attr(intel_arch3_formats_attr, - extra_attr); - WARN_ON(!x86_pmu.format_attrs); - } - group_events_td.attrs = td_attr; group_events_mem.attrs = mem_attr; group_events_tsx.attrs = tsx_attr; + group_format_extra.attrs = extra_attr;
x86_pmu.attr_update = attr_update;
From: Guoqing Jiang jiangguoqing@kylinos.cn
mainline inclusion from mainline-v5.3-rc1 commit b657688069a24c3c81b6de22e0e57e1785d9211f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit b657688069a24c3c81b6de22e0e57e1785d9211f upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Using the new pmu::update_attrs attribute group for skylake specific format attributes.
Signed-off-by: Jiri Olsa jolsa@kernel.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@kernel.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: https://lkml.kernel.org/r/20190512155518.21468-9-jolsa@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/core.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 63f5877fb6a9..78852be4e05f 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -4224,6 +4224,11 @@ static struct attribute_group group_format_extra = { .is_visible = exra_is_visible, };
+static struct attribute_group group_format_extra_skl = { + .name = "format", + .is_visible = exra_is_visible, +}; + static const struct attribute_group *attr_update[] = { &group_events_td, &group_events_mem, @@ -4231,6 +4236,7 @@ static const struct attribute_group *attr_update[] = { &group_caps_gen, &group_caps_lbr, &group_format_extra, + &group_format_extra_skl, NULL, };
@@ -4238,11 +4244,11 @@ static struct attribute *empty_attrs;
__init int intel_pmu_init(void) { + struct attribute **extra_skl_attr = &empty_attrs; struct attribute **extra_attr = &empty_attrs; struct attribute **td_attr = &empty_attrs; struct attribute **mem_attr = &empty_attrs; struct attribute **tsx_attr = &empty_attrs; - struct attribute **to_free = NULL; union cpuid10_edx edx; union cpuid10_eax eax; union cpuid10_ebx ebx; @@ -4723,8 +4729,7 @@ __init int intel_pmu_init(void) x86_pmu.get_event_constraints = hsw_get_event_constraints; extra_attr = boot_cpu_has(X86_FEATURE_RTM) ? hsw_format_attr : nhm_format_attr; - extra_attr = merge_attr(extra_attr, skl_format_attr); - to_free = extra_attr; + extra_skl_attr = skl_format_attr; td_attr = hsw_events_attrs; mem_attr = hsw_mem_events_attrs; tsx_attr = hsw_tsx_events_attrs; @@ -4765,7 +4770,7 @@ __init int intel_pmu_init(void) x86_pmu.get_event_constraints = icl_get_event_constraints; extra_attr = boot_cpu_has(X86_FEATURE_RTM) ? hsw_format_attr : nhm_format_attr; - extra_attr = merge_attr(extra_attr, skl_format_attr); + extra_skl_attr = skl_format_attr; mem_attr = icl_events_attrs; tsx_attr = icl_tsx_events_attrs; x86_pmu.lbr_pt_coexist = true; @@ -4798,6 +4803,7 @@ __init int intel_pmu_init(void) group_events_mem.attrs = mem_attr; group_events_tsx.attrs = tsx_attr; group_format_extra.attrs = extra_attr; + group_format_extra_skl.attrs = extra_skl_attr;
x86_pmu.attr_update = attr_update;
@@ -4871,7 +4877,6 @@ __init int intel_pmu_init(void) pr_cont("full-width counters, "); }
- kfree(to_free); return 0; }
From: Jiri Olsa jolsa@kernel.org
mainline inclusion from mainline-v5.3-rc1 commit 6a9f4efe78af6069a11946c64d3d4c86cb42046b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 6a9f4efe78af6069a11946c64d3d4c86cb42046b upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Using the new pmu::update_attrs attribute group for default attributes - freeze_on_smi, allow_tsx_force_abort.
Signed-off-by: Jiri Olsa jolsa@kernel.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Arnaldo Carvalho de Melo acme@kernel.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: https://lkml.kernel.org/r/20190512155518.21468-10-jolsa@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 34 ---------------------------------- arch/x86/events/intel/core.c | 9 +++++---- arch/x86/events/perf_event.h | 3 --- 3 files changed, 5 insertions(+), 41 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index cd83e72be297..dc0dab436aff 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -1587,32 +1587,6 @@ static struct attribute_group x86_pmu_format_group = { .attrs = NULL, };
-/* Merge two pointer arrays */ -__init struct attribute **merge_attr(struct attribute **a, struct attribute **b) -{ - struct attribute **new; - int j, i; - - for (j = 0; a && a[j]; j++) - ; - for (i = 0; b && b[i]; i++) - j++; - j++; - - new = kmalloc_array(j, sizeof(struct attribute *), GFP_KERNEL); - if (!new) - return NULL; - - j = 0; - for (i = 0; a && a[i]; i++) - new[j++] = a[i]; - for (i = 0; b && b[i]; i++) - new[j++] = b[i]; - new[j] = NULL; - - return new; -} - ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr, char *page) { struct perf_pmu_events_attr *pmu_attr = \ @@ -1797,14 +1771,6 @@ static int __init init_hw_perf_events(void) if (!x86_pmu.events_sysfs_show) x86_pmu_events_group.attrs = &empty_attrs;
- if (x86_pmu.attrs) { - struct attribute **tmp; - - tmp = merge_attr(x86_pmu_attr_group.attrs, x86_pmu.attrs); - if (!WARN_ON(!tmp)) - x86_pmu_attr_group.attrs = tmp; - } - pmu.attr_update = x86_pmu.attr_update;
pr_info("... version: %d\n", x86_pmu.version); diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 78852be4e05f..0f73b1e21031 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3773,8 +3773,6 @@ static __initconst const struct x86_pmu core_pmu = { .check_period = intel_pmu_check_period, };
-static struct attribute *intel_pmu_attrs[]; - static __initconst const struct x86_pmu intel_pmu = { .name = "Intel", .handle_irq = intel_pmu_handle_irq, @@ -3806,8 +3804,6 @@ static __initconst const struct x86_pmu intel_pmu = { .format_attrs = intel_arch3_formats_attr, .events_sysfs_show = intel_event_sysfs_show,
- .attrs = intel_pmu_attrs, - .cpu_prepare = intel_pmu_cpu_prepare, .cpu_starting = intel_pmu_cpu_starting, .cpu_dying = intel_pmu_cpu_dying, @@ -4229,6 +4225,10 @@ static struct attribute_group group_format_extra_skl = { .is_visible = exra_is_visible, };
+static struct attribute_group group_default = { + .attrs = intel_pmu_attrs, +}; + static const struct attribute_group *attr_update[] = { &group_events_td, &group_events_mem, @@ -4237,6 +4237,7 @@ static const struct attribute_group *attr_update[] = { &group_caps_lbr, &group_format_extra, &group_format_extra_skl, + &group_default, NULL, };
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 752b0b636b7a..f53e89395620 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -621,7 +621,6 @@ struct x86_pmu { const struct attribute_group **attr_update;
unsigned long attr_freeze_on_smi; - struct attribute **attrs;
/* * CPU Hotplug hooks @@ -886,8 +885,6 @@ static inline void set_linear_ip(struct pt_regs *regs, unsigned long ip) ssize_t x86_event_sysfs_show(char *page, u64 config, u64 event); ssize_t intel_event_sysfs_show(char *page, u64 config);
-struct attribute **merge_attr(struct attribute **a, struct attribute **b); - ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr, char *page); ssize_t events_ht_sysfs_show(struct device *dev, struct device_attribute *attr,
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.3-rc2 commit 3d0c3953601d250175c7684ec0d9df612061dae5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3d0c3953601d250175c7684ec0d9df612061dae5 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Sampling SLOTS event and ref-cycles event in a group on Icelake gives EINVAL.
SLOTS event is the event stands for the fixed counter 3, not fixed counter 2. Wrong mask was set to SLOTS event in intel_icl_pebs_event_constraints[].
Reported-by: Andi Kleen ak@linux.intel.com Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Fixes: 6017608936c1 ("perf/x86/intel: Add Icelake support") Link: https://lkml.kernel.org/r/20190723200429.8180-1-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/ds.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c index 5c57ea8dce6d..b0ba46b82832 100644 --- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -851,7 +851,7 @@ struct event_constraint intel_skl_pebs_event_constraints[] = {
struct event_constraint intel_icl_pebs_event_constraints[] = { INTEL_FLAGS_UEVENT_CONSTRAINT(0x1c0, 0x100000000ULL), /* INST_RETIRED.PREC_DIST */ - INTEL_FLAGS_UEVENT_CONSTRAINT(0x0400, 0x400000000ULL), /* SLOTS */ + INTEL_FLAGS_UEVENT_CONSTRAINT(0x0400, 0x800000000ULL), /* SLOTS */
INTEL_PLD_CONSTRAINT(0x1cd, 0xff), /* MEM_TRANS_RETIRED.LOAD_LATENCY */ INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x1d0, 0xf), /* MEM_INST_RETIRED.LOAD */
From: Guoqing Jiang jiangguoqing@kylinos.cn
mainline inclusion from mainline-v5.6-rc1 commit 471af006a747f1c535c8a8c6c0973c320fe01b22 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 471af006a747f1c535c8a8c6c0973c320fe01b22 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
AMD Family 17h processors and above gain support for Large Increment per Cycle events. Unfortunately there is no CPUID or equivalent bit that indicates whether the feature exists or not, so we continue to determine eligibility based on a CPU family number comparison.
For Large Increment per Cycle events, we add a f17h-and-compatibles get_event_constraints_f17h() that returns an even counter bitmask: Large Increment per Cycle events can only be placed on PMCs 0, 2, and 4 out of the currently available 0-5. The only currently public event that requires this feature to report valid counts is PMCx003 "Retired SSE/AVX Operations".
Note that the CPU family logic in amd_core_pmu_init() is changed so as to be able to selectively add initialization for features available in ranges of backward-compatible CPU families. This Large Increment per Cycle feature is expected to be retained in future families.
A side-effect of assigning a new get_constraints function for f17h disables calling the old (prior to f15h) amd_get_event_constraints implementation left enabled by commit e40ed1542dd7 ("perf/x86: Add perf support for AMD family-17h processors"), which is no longer necessary since those North Bridge event codes are obsoleted.
Also fix a spelling mistake whilst in the area (calulating -> calculating).
Fixes: e40ed1542dd7 ("perf/x86: Add perf support for AMD family-17h processors") Signed-off-by: Kim Phillips kim.phillips@amd.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191114183720.19887-2-kim.phillips@amd.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/amd/core.c | 89 ++++++++++++++++++++++++------------ arch/x86/events/perf_event.h | 2 + 2 files changed, 62 insertions(+), 29 deletions(-)
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c index 1d3db3d22626..8df43d62ec0c 100644 --- a/arch/x86/events/amd/core.c +++ b/arch/x86/events/amd/core.c @@ -300,6 +300,25 @@ static inline int amd_pmu_addr_offset(int index, bool eventsel) return offset; }
+/* + * AMD64 events are detected based on their event codes. + */ +static inline unsigned int amd_get_event_code(struct hw_perf_event *hwc) +{ + return ((hwc->config >> 24) & 0x0f00) | (hwc->config & 0x00ff); +} + +static inline bool amd_is_pair_event_code(struct hw_perf_event *hwc) +{ + if (!(x86_pmu.flags & PMU_FL_PAIR)) + return false; + + switch (amd_get_event_code(hwc)) { + case 0x003: return true; /* Retired SSE/AVX FLOPs */ + default: return false; + } +} + static int amd_core_hw_config(struct perf_event *event) { if (event->attr.exclude_host && event->attr.exclude_guest) @@ -318,14 +337,6 @@ static int amd_core_hw_config(struct perf_event *event) return 0; }
-/* - * AMD64 events are detected based on their event codes. - */ -static inline unsigned int amd_get_event_code(struct hw_perf_event *hwc) -{ - return ((hwc->config >> 24) & 0x0f00) | (hwc->config & 0x00ff); -} - static inline int amd_is_nb_event(struct hw_perf_event *hwc) { return (hwc->config & 0xe0) == 0xe0; @@ -863,6 +874,20 @@ amd_get_event_constraints_f15h(struct cpu_hw_events *cpuc, int idx, } }
+static struct event_constraint pair_constraint; + +static struct event_constraint * +amd_get_event_constraints_f17h(struct cpu_hw_events *cpuc, int idx, + struct perf_event *event) +{ + struct hw_perf_event *hwc = &event->hw; + + if (amd_is_pair_event_code(hwc)) + return &pair_constraint; + + return &unconstrained; +} + static ssize_t amd_event_sysfs_show(char *page, u64 config) { u64 event = (config & ARCH_PERFMON_EVENTSEL_EVENT) | @@ -906,33 +931,15 @@ static __initconst const struct x86_pmu amd_pmu = {
static int __init amd_core_pmu_init(void) { + u64 even_ctr_mask = 0ULL; + int i; + if (!boot_cpu_has(X86_FEATURE_PERFCTR_CORE)) return 0;
/* Avoid calulating the value each time in the NMI handler */ perf_nmi_window = msecs_to_jiffies(100);
- switch (boot_cpu_data.x86) { - case 0x15: - pr_cont("Fam15h "); - x86_pmu.get_event_constraints = amd_get_event_constraints_f15h; - break; - case 0x17: - pr_cont("Fam17h "); - /* - * In family 17h, there are no event constraints in the PMC hardware. - * We fallback to using default amd_get_event_constraints. - */ - break; - case 0x18: - pr_cont("Fam18h "); - /* Using default amd_get_event_constraints. */ - break; - default: - pr_err("core perfctr but no constraints; unknown hardware!\n"); - return -ENODEV; - } - /* * If core performance counter extensions exists, we must use * MSR_F15H_PERF_CTL/MSR_F15H_PERF_CTR msrs. See also @@ -947,6 +954,30 @@ static int __init amd_core_pmu_init(void) */ x86_pmu.amd_nb_constraints = 0;
+ if (boot_cpu_data.x86 == 0x15) { + pr_cont("Fam15h "); + x86_pmu.get_event_constraints = amd_get_event_constraints_f15h; + } + if (boot_cpu_data.x86 >= 0x17) { + pr_cont("Fam17h+ "); + /* + * Family 17h and compatibles have constraints for Large + * Increment per Cycle events: they may only be assigned an + * even numbered counter that has a consecutive adjacent odd + * numbered counter following it. + */ + for (i = 0; i < x86_pmu.num_counters - 1; i += 2) + even_ctr_mask |= 1 << i; + + pair_constraint = (struct event_constraint) + __EVENT_CONSTRAINT(0, even_ctr_mask, 0, + x86_pmu.num_counters / 2, 0, + PERF_X86_EVENT_PAIR); + + x86_pmu.get_event_constraints = amd_get_event_constraints_f17h; + x86_pmu.flags |= PMU_FL_PAIR; + } + pr_cont("core perfctr, "); return 0; } diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index f53e89395620..0fa60c706b1e 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -77,6 +77,7 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode) #define PERF_X86_EVENT_EXCL_ACCT 0x0200 /* accounted EXCL event */ #define PERF_X86_EVENT_AUTO_RELOAD 0x0400 /* use PEBS auto-reload */ #define PERF_X86_EVENT_LARGE_PEBS 0x0800 /* use large PEBS */ +#define PERF_X86_EVENT_PAIR 0x1000 /* Large Increment per Cycle */
struct amd_nb { @@ -725,6 +726,7 @@ do { \ #define PMU_FL_EXCL_ENABLED 0x8 /* exclusive counter active */ #define PMU_FL_PEBS_ALL 0x10 /* all events are valid PEBS events */ #define PMU_FL_TFA 0x20 /* deal with TSX force abort */ +#define PMU_FL_PAIR 0x40 /* merge counters for large incr. events */
#define EVENT_VAR(_id) event_attr_##_id #define EVENT_PTR(_id) &event_attr_##_id.attr.attr
From: Kim Phillips kim.phillips@amd.com
mainline inclusion from mainline-v5.6-rc1 commit 5738891229a25e9e678122a843cbf0466a456d0c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Description of hardware operation ---------------------------------
The core AMD PMU has a 4-bit wide per-cycle increment for each performance monitor counter. That works for most events, but now with AMD Family 17h and above processors, some events can occur more than 15 times in a cycle. Those events are called "Large Increment per Cycle" events. In order to count these events, two adjacent h/w PMCs get their count signals merged to form 8 bits per cycle total. In addition, the PERF_CTR count registers are merged to be able to count up to 64 bits.
Normally, events like instructions retired, get programmed on a single counter like so:
PERF_CTL0 (MSR 0xc0010200) 0x000000000053ff0c # event 0x0c, umask 0xff PERF_CTR0 (MSR 0xc0010201) 0x0000800000000001 # r/w 48-bit count
The next counter at MSRs 0xc0010202-3 remains unused, or can be used independently to count something else.
When counting Large Increment per Cycle events, such as FLOPs, however, we now have to reserve the next counter and program the PERF_CTL (config) register with the Merge event (0xFFF), like so:
PERF_CTL0 (msr 0xc0010200) 0x000000000053ff03 # FLOPs event, umask 0xff PERF_CTR0 (msr 0xc0010201) 0x0000800000000001 # rd 64-bit cnt, wr lo 48b PERF_CTL1 (msr 0xc0010202) 0x0000000f004000ff # Merge event, enable bit PERF_CTR1 (msr 0xc0010203) 0x0000000000000000 # wr hi 16-bits count
The count is widened from the normal 48-bits to 64 bits by having the second counter carry the higher 16 bits of the count in its lower 16 bits of its counter register.
The odd counter, e.g., PERF_CTL1, is programmed with the enabled Merge event before the even counter, PERF_CTL0.
The Large Increment feature is available starting with Family 17h. For more details, search any Family 17h PPR for the "Large Increment per Cycle Events" section, e.g., section 2.1.15.3 on p. 173 in this version:
https://www.amd.com/system/files/TechDocs/56176_ppr_Family_17h_Model_71h_B0_...
Description of software operation ---------------------------------
The following steps are taken in order to support reserving and enabling the extra counter for Large Increment per Cycle events:
1. In the main x86 scheduler, we reduce the number of available counters by the number of Large Increment per Cycle events being scheduled, tracked by a new cpuc variable 'n_pair' and a new amd_put_event_constraints_f17h(). This improves the counter scheduler success rate.
2. In perf_assign_events(), if a counter is assigned to a Large Increment event, we increment the current counter variable, so the counter used for the Merge event is removed from assignment consideration by upcoming event assignments.
3. In find_counter(), if a counter has been found for the Large Increment event, we set the next counter as used, to prevent other events from using it.
4. We perform steps 2 & 3 also in the x86 scheduler fastpath, i.e., we add Merge event accounting to the existing used_mask logic.
5. Finally, we add on the programming of Merge event to the neighbouring PMC counters in the counter enable/disable{_all} code paths.
Currently, software does not support a single PMU with mixed 48- and 64-bit counting, so Large increment event counts are limited to 48 bits. In set_period, we zero-out the upper 16 bits of the count, so the hardware doesn't copy them to the even counter's higher bits.
Simple invocation example showing counting 8 FLOPs per 256-bit/%ymm vaddps instruction executed in a loop 100 million times:
perf stat -e cpu/fp_ret_sse_avx_ops.all/,cpu/instructions/ <workload>
Performance counter stats for '<workload>':
800,000,000 cpu/fp_ret_sse_avx_ops.all/u 300,042,101 cpu/instructions/u
Prior to this patch, the reported SSE/AVX FLOPs retired count would be wrong.
[peterz: lots of renames and edits to the code]
Signed-off-by: Kim Phillips kim.phillips@amd.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/amd/core.c | 18 +++++++++ arch/x86/events/core.c | 74 ++++++++++++++++++++++++++++-------- arch/x86/events/perf_event.h | 18 +++++++++ 3 files changed, 95 insertions(+), 15 deletions(-)
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c index 8df43d62ec0c..a58b71bc197b 100644 --- a/arch/x86/events/amd/core.c +++ b/arch/x86/events/amd/core.c @@ -13,6 +13,10 @@ static DEFINE_PER_CPU(unsigned long, perf_nmi_tstamp); static unsigned long perf_nmi_window;
+/* AMD Event 0xFFF: Merge. Used with Large Increment per Cycle events */ +#define AMD_MERGE_EVENT ((0xFULL << 32) | 0xFFULL) +#define AMD_MERGE_EVENT_ENABLE (AMD_MERGE_EVENT | ARCH_PERFMON_EVENTSEL_ENABLE) + static __initconst const u64 amd_hw_cache_event_ids [PERF_COUNT_HW_CACHE_MAX] [PERF_COUNT_HW_CACHE_OP_MAX] @@ -334,6 +338,9 @@ static int amd_core_hw_config(struct perf_event *event) else if (event->attr.exclude_guest) event->hw.config |= AMD64_EVENTSEL_HOSTONLY;
+ if ((x86_pmu.flags & PMU_FL_PAIR) && amd_is_pair_event_code(&event->hw)) + event->hw.flags |= PERF_X86_EVENT_PAIR; + return 0; }
@@ -888,6 +895,15 @@ amd_get_event_constraints_f17h(struct cpu_hw_events *cpuc, int idx, return &unconstrained; }
+static void amd_put_event_constraints_f17h(struct cpu_hw_events *cpuc, + struct perf_event *event) +{ + struct hw_perf_event *hwc = &event->hw; + + if (is_counter_pair(hwc)) + --cpuc->n_pair; +} + static ssize_t amd_event_sysfs_show(char *page, u64 config) { u64 event = (config & ARCH_PERFMON_EVENTSEL_EVENT) | @@ -975,6 +991,8 @@ static int __init amd_core_pmu_init(void) PERF_X86_EVENT_PAIR);
x86_pmu.get_event_constraints = amd_get_event_constraints_f17h; + x86_pmu.put_event_constraints = amd_put_event_constraints_f17h; + x86_pmu.perf_ctr_pair_en = AMD_MERGE_EVENT_ENABLE; x86_pmu.flags |= PMU_FL_PAIR; }
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index dc0dab436aff..e201ecae1414 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -617,6 +617,7 @@ void x86_pmu_disable_all(void) int idx;
for (idx = 0; idx < x86_pmu.num_counters; idx++) { + struct hw_perf_event *hwc = &cpuc->events[idx]->hw; u64 val;
if (!test_bit(idx, cpuc->active_mask)) @@ -626,6 +627,8 @@ void x86_pmu_disable_all(void) continue; val &= ~ARCH_PERFMON_EVENTSEL_ENABLE; wrmsrl(x86_pmu_config_addr(idx), val); + if (is_counter_pair(hwc)) + wrmsrl(x86_pmu_config_addr(idx + 1), 0); } }
@@ -699,7 +702,7 @@ struct sched_state { int counter; /* counter index */ int unassigned; /* number of events to be assigned left */ int nr_gp; /* number of GP counters used */ - unsigned long used[BITS_TO_LONGS(X86_PMC_IDX_MAX)]; + u64 used; };
/* Total max is X86_PMC_IDX_MAX, but we are O(n!) limited */ @@ -756,8 +759,12 @@ static bool perf_sched_restore_state(struct perf_sched *sched) sched->saved_states--; sched->state = sched->saved[sched->saved_states];
- /* continue with next counter: */ - clear_bit(sched->state.counter++, sched->state.used); + /* this assignment didn't work out */ + /* XXX broken vs EVENT_PAIR */ + sched->state.used &= ~BIT_ULL(sched->state.counter); + + /* try the next one */ + sched->state.counter++;
return true; } @@ -782,20 +789,32 @@ static bool __perf_sched_find_counter(struct perf_sched *sched) if (c->idxmsk64 & (~0ULL << INTEL_PMC_IDX_FIXED)) { idx = INTEL_PMC_IDX_FIXED; for_each_set_bit_from(idx, c->idxmsk, X86_PMC_IDX_MAX) { - if (!__test_and_set_bit(idx, sched->state.used)) - goto done; + u64 mask = BIT_ULL(idx); + + if (sched->state.used & mask) + continue; + + sched->state.used |= mask; + goto done; } }
/* Grab the first unused counter starting with idx */ idx = sched->state.counter; for_each_set_bit_from(idx, c->idxmsk, INTEL_PMC_IDX_FIXED) { - if (!__test_and_set_bit(idx, sched->state.used)) { - if (sched->state.nr_gp++ >= sched->max_gp) - return false; + u64 mask = BIT_ULL(idx);
- goto done; - } + if (c->flags & PERF_X86_EVENT_PAIR) + mask |= mask << 1; + + if (sched->state.used & mask) + continue; + + if (sched->state.nr_gp++ >= sched->max_gp) + return false; + + sched->state.used |= mask; + goto done; }
return false; @@ -872,12 +891,10 @@ EXPORT_SYMBOL_GPL(perf_assign_events); int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) { struct event_constraint *c; - unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)]; struct perf_event *e; int i, wmin, wmax, unsched = 0; struct hw_perf_event *hwc; - - bitmap_zero(used_mask, X86_PMC_IDX_MAX); + u64 used_mask = 0;
if (x86_pmu.start_scheduling) x86_pmu.start_scheduling(cpuc); @@ -895,6 +912,8 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) * fastpath, try to reuse previous register */ for (i = 0; i < n; i++) { + u64 mask; + hwc = &cpuc->event_list[i]->hw; c = cpuc->event_constraint[i];
@@ -906,11 +925,16 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) if (!test_bit(hwc->idx, c->idxmsk)) break;
+ mask = BIT_ULL(hwc->idx); + if (is_counter_pair(hwc)) + mask |= mask << 1; + /* not already used */ - if (test_bit(hwc->idx, used_mask)) + if (used_mask & mask) break;
- __set_bit(hwc->idx, used_mask); + used_mask |= mask; + if (assign) assign[i] = hwc->idx; } @@ -933,6 +957,15 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) READ_ONCE(cpuc->excl_cntrs->exclusive_present)) gpmax /= 2;
+ /* + * Reduce the amount of available counters to allow fitting + * the extra Merge events needed by large increment events. + */ + if (x86_pmu.flags & PMU_FL_PAIR) { + gpmax = x86_pmu.num_counters - cpuc->n_pair; + WARN_ON(gpmax <= 0); + } + unsched = perf_assign_events(cpuc->event_constraint, n, wmin, wmax, gpmax, assign); } @@ -997,6 +1030,8 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader, return -EINVAL; cpuc->event_list[n] = leader; n++; + if (is_counter_pair(&leader->hw)) + cpuc->n_pair++; } if (!dogrp) return n; @@ -1011,6 +1046,8 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
cpuc->event_list[n] = event; n++; + if (is_counter_pair(&event->hw)) + cpuc->n_pair++; } return n; } @@ -1175,6 +1212,13 @@ int x86_perf_event_set_period(struct perf_event *event)
wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ /* + * Clear the Merge event counter's upper 16 bits since + * we currently declare a 48-bit counter width + */ + if (is_counter_pair(hwc)) + wrmsrl(x86_pmu_event_addr(idx + 1), 0); + /* * Due to erratum on certan cpu we need * a second write to be sure the register diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 0fa60c706b1e..24ee113959f6 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -267,6 +267,7 @@ struct cpu_hw_events { struct amd_nb *amd_nb; /* Inverted mask of bits to clear in the perf_ctr ctrl registers */ u64 perf_ctr_virt_mask; + int n_pair; /* Large increment events */
void *kfree_on_online[X86_PERF_KFREE_MAX]; }; @@ -679,6 +680,7 @@ struct x86_pmu { * AMD bits */ unsigned int amd_nb_constraints : 1; + u64 perf_ctr_pair_en;
/* * Extra registers for events @@ -822,6 +824,11 @@ int x86_pmu_hw_config(struct perf_event *event);
void x86_pmu_disable_all(void);
+static inline bool is_counter_pair(struct hw_perf_event *hwc) +{ + return hwc->flags & PERF_X86_EVENT_PAIR; +} + static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc, u64 enable_mask) { @@ -829,6 +836,14 @@ static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc,
if (hwc->extra_reg.reg) wrmsrl(hwc->extra_reg.reg, hwc->extra_reg.config); + + /* + * Add enabled Merge event on next counter + * if large increment event being enabled on this counter + */ + if (is_counter_pair(hwc)) + wrmsrl(x86_pmu_config_addr(hwc->idx + 1), x86_pmu.perf_ctr_pair_en); + wrmsrl(hwc->config_base, (hwc->config | enable_mask) & ~disable_mask); }
@@ -845,6 +860,9 @@ static inline void x86_pmu_disable_event(struct perf_event *event) struct hw_perf_event *hwc = &event->hw;
wrmsrl(hwc->config_base, hwc->config); + + if (is_counter_pair(hwc)) + wrmsrl(x86_pmu_config_addr(hwc->idx + 1), 0); }
void x86_pmu_enable_event(struct perf_event *event);
From: Wei Wang wei.w.wang@intel.com
mainline inclusion from mainline-v5.9-rc1 commit 3cb9d5464c1ceea86f6225089b2f7965989cf316 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3cb9d5464c1ceea86f6225089b2f7965989cf316 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
The MSR variable type can be 'unsigned int', which uses less memory than the longer 'unsigned long'. Fix 'struct x86_pmu' for that. The lbr_nr won't be a negative number, so make it 'unsigned int' as well.
Suggested-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Wei Wang wei.w.wang@intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200613080958.132489-2-like.xu@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/perf_event.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 24ee113959f6..21745fd662d6 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -664,8 +664,8 @@ struct x86_pmu { /* * Intel LBR */ - unsigned long lbr_tos, lbr_from, lbr_to; /* MSR base regs */ - int lbr_nr; /* hardware stack size */ + unsigned int lbr_tos, lbr_from, lbr_to, + lbr_nr; /* LBR base regs and size */ u64 lbr_sel_mask; /* LBR_SELECT valid bits */ const int *lbr_sel_map; /* lbr_select mappings */ bool lbr_double_abort; /* duplicated lbr aborts */
From: Like Xu like.xu@linux.intel.com
mainline inclusion from mainline-v5.9-rc1 commit 027440b5d426a51f33b515bbd236cc479d1e051f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
For intel_pmu_en/disable_event(), reorder the branches checks for hw->idx and make them sorted by probability: gp,fixed,bts,others.
Clean up the x86_assign_hw_event() by converting multiple if-else statements to a switch statement.
To skip x86_perf_event_update() and x86_perf_event_set_period(), it's generic to replace "idx == INTEL_PMC_IDX_FIXED_BTS" check with '!hwc->event_base' because that should be 0 for all non-gp/fixed cases.
Wrap related bit operations into intel_set/clear_masks() and make the main path more cleaner and readable.
No functional changes.
Signed-off-by: Like Xu like.xu@linux.intel.com Original-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200613080958.132489-3-like.xu@linux.intel.com Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 25 +++++++---- arch/x86/events/intel/core.c | 85 +++++++++++++++++++----------------- 2 files changed, 62 insertions(+), 48 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index e201ecae1414..25a1b9953c1d 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -70,10 +70,9 @@ u64 x86_perf_event_update(struct perf_event *event) struct hw_perf_event *hwc = &event->hw; int shift = 64 - x86_pmu.cntval_bits; u64 prev_raw_count, new_raw_count; - int idx = hwc->idx; u64 delta;
- if (idx == INTEL_PMC_IDX_FIXED_BTS) + if (unlikely(!hwc->event_base)) return 0;
/* @@ -1056,22 +1055,30 @@ static inline void x86_assign_hw_event(struct perf_event *event, struct cpu_hw_events *cpuc, int i) { struct hw_perf_event *hwc = &event->hw; + int idx;
- hwc->idx = cpuc->assign[i]; + idx = hwc->idx = cpuc->assign[i]; hwc->last_cpu = smp_processor_id(); hwc->last_tag = ++cpuc->tags[i];
- if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) { + switch (hwc->idx) { + case INTEL_PMC_IDX_FIXED_BTS: hwc->config_base = 0; hwc->event_base = 0; - } else if (hwc->idx >= INTEL_PMC_IDX_FIXED) { + break; + + case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS-1: hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL; - hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - INTEL_PMC_IDX_FIXED); - hwc->event_base_rdpmc = (hwc->idx - INTEL_PMC_IDX_FIXED) | 1<<30; - } else { + hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + + (idx - INTEL_PMC_IDX_FIXED); + hwc->event_base_rdpmc = (idx - INTEL_PMC_IDX_FIXED) | 1<<30; + break; + + default: hwc->config_base = x86_pmu_config_addr(hwc->idx); hwc->event_base = x86_pmu_event_addr(hwc->idx); hwc->event_base_rdpmc = x86_pmu_rdpmc_index(hwc->idx); + break; } }
@@ -1171,7 +1178,7 @@ int x86_perf_event_set_period(struct perf_event *event) s64 period = hwc->sample_period; int ret = 0, idx = hwc->idx;
- if (idx == INTEL_PMC_IDX_FIXED_BTS) + if (unlikely(!hwc->event_base)) return 0;
/* diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 0f73b1e21031..e5ef223a6093 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -2114,8 +2114,35 @@ static inline void intel_pmu_ack_status(u64 ack) wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, ack); }
-static void intel_pmu_disable_fixed(struct hw_perf_event *hwc) +static inline bool event_is_checkpointed(struct perf_event *event) +{ + return unlikely(event->hw.config & HSW_IN_TX_CHECKPOINTED) != 0; +} + +static inline void intel_set_masks(struct perf_event *event, int idx) +{ + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + + if (event->attr.exclude_host) + __set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask); + if (event->attr.exclude_guest) + __set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask); + if (event_is_checkpointed(event)) + __set_bit(idx, (unsigned long *)&cpuc->intel_cp_status); +} + +static inline void intel_clear_masks(struct perf_event *event, int idx) { + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + + __clear_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask); + __clear_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask); + __clear_bit(idx, (unsigned long *)&cpuc->intel_cp_status); +} + +static void intel_pmu_disable_fixed(struct perf_event *event) +{ + struct hw_perf_event *hwc = &event->hw; int idx = hwc->idx - INTEL_PMC_IDX_FIXED; u64 ctrl_val, mask;
@@ -2126,31 +2153,22 @@ static void intel_pmu_disable_fixed(struct hw_perf_event *hwc) wrmsrl(hwc->config_base, ctrl_val); }
-static inline bool event_is_checkpointed(struct perf_event *event) -{ - return (event->hw.config & HSW_IN_TX_CHECKPOINTED) != 0; -} - static void intel_pmu_disable_event(struct perf_event *event) { struct hw_perf_event *hwc = &event->hw; - struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + int idx = hwc->idx;
- if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_BTS)) { + if (idx < INTEL_PMC_IDX_FIXED) { + intel_clear_masks(event, idx); + x86_pmu_disable_event(event); + } else if (idx < INTEL_PMC_IDX_FIXED_BTS) { + intel_clear_masks(event, idx); + intel_pmu_disable_fixed(event); + } else if (idx == INTEL_PMC_IDX_FIXED_BTS) { intel_pmu_disable_bts(); intel_pmu_drain_bts_buffer(); - return; }
- cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx); - cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx); - cpuc->intel_cp_status &= ~(1ull << hwc->idx); - - if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) - intel_pmu_disable_fixed(hwc); - else - x86_pmu_disable_event(event); - /* * Needs to be called after x86_pmu_disable_event, * so we don't trigger the event without PEBS bit set. @@ -2216,33 +2234,22 @@ static void intel_pmu_enable_fixed(struct perf_event *event) static void intel_pmu_enable_event(struct perf_event *event) { struct hw_perf_event *hwc = &event->hw; - struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); - - if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_BTS)) { - if (!__this_cpu_read(cpu_hw_events.enabled)) - return; - - intel_pmu_enable_bts(hwc->config); - return; - } - - if (event->attr.exclude_host) - cpuc->intel_ctrl_guest_mask |= (1ull << hwc->idx); - if (event->attr.exclude_guest) - cpuc->intel_ctrl_host_mask |= (1ull << hwc->idx); - - if (unlikely(event_is_checkpointed(event))) - cpuc->intel_cp_status |= (1ull << hwc->idx); + int idx = hwc->idx;
if (unlikely(event->attr.precise_ip)) intel_pmu_pebs_enable(event);
- if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) { + if (idx < INTEL_PMC_IDX_FIXED) { + intel_set_masks(event, idx); + __x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE); + } else if (idx < INTEL_PMC_IDX_FIXED_BTS) { + intel_set_masks(event, idx); intel_pmu_enable_fixed(event); - return; + } else if (idx == INTEL_PMC_IDX_FIXED_BTS) { + if (!__this_cpu_read(cpu_hw_events.enabled)) + return; + intel_pmu_enable_bts(hwc->config); } - - __x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE); }
static void intel_pmu_add_event(struct perf_event *event)
From: Like Xu like.xu@linux.intel.com
mainline inclusion from mainline-v5.9-rc1 commit b2d6504761a50b9493eb4b20f6e188b673f20c32 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit b2d6504761a50b9493eb4b20f6e188b673f20c32 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
The LBR records msrs are model specific. The perf subsystem has already obtained the base addresses of LBR records based on the cpu model.
Therefore, an interface is added to allow callers outside the perf subsystem to obtain these LBR information. It's useful for hypervisors to emulate the LBR feature for guests with less code.
Signed-off-by: Like Xu like.xu@linux.intel.com Signed-off-by: Wei Wang wei.w.wang@intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200613080958.132489-4-like.xu@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/lbr.c | 20 ++++++++++++++++++++ arch/x86/include/asm/perf_event.h | 16 ++++++++++++++++ 2 files changed, 36 insertions(+)
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index b8cac31f089c..15aa2fbdd60e 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -1310,3 +1310,23 @@ void intel_pmu_lbr_init_knl(void) if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_LIP) x86_pmu.intel_cap.lbr_format = LBR_FORMAT_EIP_FLAGS; } + +/** + * x86_perf_get_lbr - get the LBR records information + * + * @lbr: the caller's memory to store the LBR records information + * + * Returns: 0 indicates the LBR info has been successfully obtained + */ +int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) +{ + int lbr_fmt = x86_pmu.intel_cap.lbr_format; + + lbr->nr = x86_pmu.lbr_nr; + lbr->from = x86_pmu.lbr_from; + lbr->to = x86_pmu.lbr_to; + lbr->info = (lbr_fmt == LBR_FORMAT_INFO) ? MSR_LBR_INFO_0 : 0; + + return 0; +} +EXPORT_SYMBOL_GPL(x86_perf_get_lbr); diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index b4085e08cc5e..b0174e83ddda 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -322,6 +322,13 @@ struct perf_guest_switch_msr { u64 host, guest; };
+struct x86_pmu_lbr { + unsigned int nr; + unsigned int from; + unsigned int to; + unsigned int info; +}; + extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr); extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap); extern void perf_check_microcode(void); @@ -341,6 +348,15 @@ static inline void perf_events_lapic_init(void) { } static inline void perf_check_microcode(void) { } #endif
+#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL) +extern int x86_perf_get_lbr(struct x86_pmu_lbr *lbr); +#else +static inline int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) +{ + return -1; +} +#endif + #ifdef CONFIG_CPU_SUP_INTEL extern void intel_pt_handle_vmx(int on); #endif
From: Like Xu like.xu@linux.intel.com
mainline inclusion from mainline-v5.9-rc1 commit 097e4311cda952dfb047f2a49d35aa5de500d474 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 097e4311cda952dfb047f2a49d35aa5de500d474 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
The hypervisor may request the perf subsystem to schedule a time window to directly access the LBR records msrs for its own use. Normally, it would create a guest LBR event with callstack mode enabled, which is scheduled along with other ordinary LBR events on the host but in an exclusive way.
To avoid wasting a counter for the guest LBR event, the perf tracks its hw->idx via INTEL_PMC_IDX_FIXED_VLBR and assigns it with a fake VLBR counter with the help of new vlbr_constraint. As with the BTS event, there is actually no hardware counter assigned for the guest LBR event.
Signed-off-by: Like Xu like.xu@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200514083054.62538-5-like.xu@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 1 + arch/x86/events/intel/core.c | 18 ++++++++++++++++++ arch/x86/events/intel/lbr.c | 4 ++++ arch/x86/events/perf_event.h | 1 + arch/x86/include/asm/perf_event.h | 22 +++++++++++++++++++++- 5 files changed, 45 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 25a1b9953c1d..b4b3ecc9bbba 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -1063,6 +1063,7 @@ static inline void x86_assign_hw_event(struct perf_event *event,
switch (hwc->idx) { case INTEL_PMC_IDX_FIXED_BTS: + case INTEL_PMC_IDX_FIXED_VLBR: hwc->config_base = 0; hwc->event_base = 0; break; diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index e5ef223a6093..ec4373143d16 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -2494,6 +2494,20 @@ intel_bts_constraints(struct perf_event *event) return NULL; }
+/* + * Note: matches a fake event, like Fixed2. + */ +static struct event_constraint * +intel_vlbr_constraints(struct perf_event *event) +{ + struct event_constraint *c = &vlbr_constraint; + + if (unlikely(constraint_match(c, event->hw.config))) + return c; + + return NULL; +} + static int intel_alt_er(int idx, u64 config) { int alt_idx = idx; @@ -2684,6 +2698,10 @@ __intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx, { struct event_constraint *c;
+ c = intel_vlbr_constraints(event); + if (c) + return c; + c = intel_bts_constraints(event); if (c) return c; diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 15aa2fbdd60e..eea75f1fc245 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -1330,3 +1330,7 @@ int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) return 0; } EXPORT_SYMBOL_GPL(x86_perf_get_lbr); + +struct event_constraint vlbr_constraint = + FIXED_EVENT_CONSTRAINT(INTEL_FIXED_VLBR_EVENT, + (INTEL_PMC_IDX_FIXED_VLBR - INTEL_PMC_IDX_FIXED)); diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 21745fd662d6..639a8a6ba05c 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -966,6 +966,7 @@ void release_ds_buffers(void); void reserve_ds_buffers(void);
extern struct event_constraint bts_constraint; +extern struct event_constraint vlbr_constraint;
void intel_pmu_enable_bts(u64 config);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index b0174e83ddda..c4729277d31d 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -181,9 +181,29 @@ struct x86_pmu_capability { #define GLOBAL_STATUS_UNC_OVF BIT_ULL(61) #define GLOBAL_STATUS_ASIF BIT_ULL(60) #define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59) -#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(58) +#define GLOBAL_STATUS_LBRS_FROZEN_BIT 58 +#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT) #define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(55)
+/* + * We model guest LBR event tracing as another fixed-mode PMC like BTS. + * + * We choose bit 58 because it's used to indicate LBR stack frozen state + * for architectural perfmon v4, also we unconditionally mask that bit in + * the handle_pmi_common(), so it'll never be set in the overflow handling. + * + * With this fake counter assigned, the guest LBR event user (such as KVM), + * can program the LBR registers on its own, and we don't actually do anything + * with then in the host context. + */ +#define INTEL_PMC_IDX_FIXED_VLBR (GLOBAL_STATUS_LBRS_FROZEN_BIT) + +/* + * Pseudo-encoding the guest LBR event as event=0x00,umask=0x1b, + * since it would claim bit 58 which is effectively Fixed26. + */ +#define INTEL_FIXED_VLBR_EVENT 0x1b00 + /* * Adaptive PEBS v4 */
From: Like Xu like.xu@linux.intel.com
mainline inclusion from mainline-v5.9-rc1 commit e1ad1ac2deb8f90af9f12ff316989dd5675dec11 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit e1ad1ac2deb8f90af9f12ff316989dd5675dec11 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
When a guest wants to use the LBR registers, its hypervisor creates a guest LBR event and let host perf schedules it. The LBR records msrs are accessible to the guest when its guest LBR event is scheduled on by the perf subsystem.
Before scheduling this event out, we should avoid host changes on IA32_DEBUGCTLMSR or LBR_SELECT. Otherwise, some unexpected branch operations may interfere with guest behavior, pollute LBR records, and even cause host branches leakage. In addition, the read operation on host is also avoidable.
To ensure that guest LBR records are not lost during the context switch, the guest LBR event would enable the callstack mode which could save/restore guest unread LBR records with the help of intel_pmu_lbr_sched_task() naturally.
However, the guest LBR_SELECT may changes for its own use and the host LBR event doesn't save/restore it. To ensure that we doesn't lost the guest LBR_SELECT value when the guest LBR event is running, the vlbr_constraint is bound up with a new constraint flag PERF_X86_EVENT_LBR_SELECT.
Signed-off-by: Like Xu like.xu@linux.intel.com Signed-off-by: Wei Wang wei.w.wang@intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200514083054.62538-6-like.xu@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/core.c | 4 ++++ arch/x86/events/intel/lbr.c | 31 ++++++++++++++++++++++++++----- arch/x86/events/perf_event.h | 3 +++ 3 files changed, 33 insertions(+), 5 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index ec4373143d16..28359c6d9039 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -2167,6 +2167,8 @@ static void intel_pmu_disable_event(struct perf_event *event) } else if (idx == INTEL_PMC_IDX_FIXED_BTS) { intel_pmu_disable_bts(); intel_pmu_drain_bts_buffer(); + } else if (idx == INTEL_PMC_IDX_FIXED_VLBR) { + intel_clear_masks(event, idx); }
/* @@ -2249,6 +2251,8 @@ static void intel_pmu_enable_event(struct perf_event *event) if (!__this_cpu_read(cpu_hw_events.enabled)) return; intel_pmu_enable_bts(hwc->config); + } else if (idx == INTEL_PMC_IDX_FIXED_VLBR) { + intel_set_masks(event, idx); } }
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index eea75f1fc245..a3dc3b5bc78d 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -383,6 +383,9 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
wrmsrl(x86_pmu.lbr_tos, tos); task_ctx->lbr_stack_state = LBR_NONE; + + if (cpuc->lbr_select) + wrmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel); }
static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx) @@ -415,6 +418,9 @@ static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
cpuc->last_task_ctx = task_ctx; cpuc->last_log_id = ++task_ctx->log_id; + + if (cpuc->lbr_select) + rdmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel); }
void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in) @@ -462,6 +468,9 @@ void intel_pmu_lbr_add(struct perf_event *event) if (!x86_pmu.lbr_nr) return;
+ if (event->hw.flags & PERF_X86_EVENT_LBR_SELECT) + cpuc->lbr_select = 1; + cpuc->br_sel = event->hw.branch_reg.reg;
if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) { @@ -509,6 +518,9 @@ void intel_pmu_lbr_del(struct perf_event *event) task_ctx->lbr_callstack_users--; }
+ if (event->hw.flags & PERF_X86_EVENT_LBR_SELECT) + cpuc->lbr_select = 0; + if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0) cpuc->lbr_pebs_users--; cpuc->lbr_users--; @@ -517,11 +529,19 @@ void intel_pmu_lbr_del(struct perf_event *event) perf_sched_cb_dec(event->ctx->pmu); }
+static inline bool vlbr_exclude_host(void) +{ + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + + return test_bit(INTEL_PMC_IDX_FIXED_VLBR, + (unsigned long *)&cpuc->intel_ctrl_guest_mask); +} + void intel_pmu_lbr_enable_all(bool pmi) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- if (cpuc->lbr_users) + if (cpuc->lbr_users && !vlbr_exclude_host()) __intel_pmu_lbr_enable(pmi); }
@@ -529,7 +549,7 @@ void intel_pmu_lbr_disable_all(void) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- if (cpuc->lbr_users) + if (cpuc->lbr_users && !vlbr_exclude_host()) __intel_pmu_lbr_disable(); }
@@ -669,7 +689,8 @@ void intel_pmu_lbr_read(void) * This could be smarter and actually check the event, * but this simple approach seems to work for now. */ - if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users) + if (!cpuc->lbr_users || vlbr_exclude_host() || + cpuc->lbr_users == cpuc->lbr_pebs_users) return;
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32) @@ -1332,5 +1353,5 @@ int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) EXPORT_SYMBOL_GPL(x86_perf_get_lbr);
struct event_constraint vlbr_constraint = - FIXED_EVENT_CONSTRAINT(INTEL_FIXED_VLBR_EVENT, - (INTEL_PMC_IDX_FIXED_VLBR - INTEL_PMC_IDX_FIXED)); + __EVENT_CONSTRAINT(INTEL_FIXED_VLBR_EVENT, (1ULL << INTEL_PMC_IDX_FIXED_VLBR), + FIXED_EVENT_FLAGS, 1, 0, PERF_X86_EVENT_LBR_SELECT); diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 639a8a6ba05c..697a776b839b 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -78,6 +78,7 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode) #define PERF_X86_EVENT_AUTO_RELOAD 0x0400 /* use PEBS auto-reload */ #define PERF_X86_EVENT_LARGE_PEBS 0x0800 /* use large PEBS */ #define PERF_X86_EVENT_PAIR 0x1000 /* Large Increment per Cycle */ +#define PERF_X86_EVENT_LBR_SELECT 0x2000 /* Save/Restore MSR_LBR_SELECT */
struct amd_nb { @@ -231,6 +232,7 @@ struct cpu_hw_events { u64 br_sel; struct x86_perf_task_context *last_task_ctx; int last_log_id; + int lbr_select;
/* * Intel host/guest exclude bits @@ -703,6 +705,7 @@ struct x86_perf_task_context { u64 lbr_from[MAX_LBR_ENTRIES]; u64 lbr_to[MAX_LBR_ENTRIES]; u64 lbr_info[MAX_LBR_ENTRIES]; + u64 lbr_sel; int tos; int valid_lbrs; int lbr_callstack_users;
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.10-rc1 commit 75608cb02ea5dd997990e2998eca3670cb71a18c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 75608cb02ea5dd997990e2998eca3670cb71a18c upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
The RDPMC index is always re-calculated for the RDPMC userspace support, which is unnecessary.
The RDPMC index value is stored in the variable event_base_rdpmc for the kernel usage, which can be used for RDPMC userspace support as well.
Suggested-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200723171117.9918-2-kan.liang@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index b4b3ecc9bbba..60bca5389ed5 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2149,17 +2149,12 @@ static void x86_pmu_event_unmapped(struct perf_event *event, struct mm_struct *m
static int x86_pmu_event_idx(struct perf_event *event) { - int idx = event->hw.idx; + struct hw_perf_event *hwc = &event->hw;
- if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED)) + if (!(hwc->flags & PERF_X86_EVENT_RDPMC_ALLOWED)) return 0;
- if (x86_pmu.num_counters_fixed && idx >= INTEL_PMC_IDX_FIXED) { - idx -= INTEL_PMC_IDX_FIXED; - idx |= 1 << 30; - } - - return idx + 1; + return hwc->event_base_rdpmc + 1; }
static ssize_t get_attr_rdpmc(struct device *cdev,
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.10-rc1 commit 60a2a271cf05cf046c522e1d7f62116b4bcb32a2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 60a2a271cf05cf046c522e1d7f62116b4bcb32a2 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Magic numbers are used in the current NMI handler for the global status bit. Use a meaningful name to replace the magic numbers to improve the readability of the code.
Remove a Tab for all GLOBAL_STATUS_* and INTEL_PMC_IDX_FIXED_BTS macros to reduce the length of the line.
Suggested-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200723171117.9918-3-kan.liang@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/core.c | 4 ++-- arch/x86/include/asm/perf_event.h | 22 ++++++++++++---------- 2 files changed, 14 insertions(+), 12 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 28359c6d9039..4f1845b59a08 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -2369,7 +2369,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status) /* * PEBS overflow sets bit 62 in the global status register */ - if (__test_and_clear_bit(62, (unsigned long *)&status)) { + if (__test_and_clear_bit(GLOBAL_STATUS_BUFFER_OVF_BIT, (unsigned long *)&status)) { handled++; x86_pmu.drain_pebs(regs); status &= x86_pmu.intel_ctrl | GLOBAL_STATUS_TRACE_TOPAPMI; @@ -2378,7 +2378,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status) /* * Intel PT */ - if (__test_and_clear_bit(55, (unsigned long *)&status)) { + if (__test_and_clear_bit(GLOBAL_STATUS_TRACE_TOPAPMI_BIT, (unsigned long *)&status)) { handled++; intel_pt_interrupt(); } diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index c4729277d31d..65f6cfb37a7c 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -174,16 +174,18 @@ struct x86_pmu_capability { * values are used by actual fixed events and higher values are used * to indicate other overflow conditions in the PERF_GLOBAL_STATUS msr. */ -#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 16) - -#define GLOBAL_STATUS_COND_CHG BIT_ULL(63) -#define GLOBAL_STATUS_BUFFER_OVF BIT_ULL(62) -#define GLOBAL_STATUS_UNC_OVF BIT_ULL(61) -#define GLOBAL_STATUS_ASIF BIT_ULL(60) -#define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59) -#define GLOBAL_STATUS_LBRS_FROZEN_BIT 58 -#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT) -#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(55) +#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 16) + +#define GLOBAL_STATUS_COND_CHG BIT_ULL(63) +#define GLOBAL_STATUS_BUFFER_OVF_BIT 62 +#define GLOBAL_STATUS_BUFFER_OVF BIT_ULL(GLOBAL_STATUS_BUFFER_OVF_BIT) +#define GLOBAL_STATUS_UNC_OVF BIT_ULL(61) +#define GLOBAL_STATUS_ASIF BIT_ULL(60) +#define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59) +#define GLOBAL_STATUS_LBRS_FROZEN_BIT 58 +#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT) +#define GLOBAL_STATUS_TRACE_TOPAPMI_BIT 55 +#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(GLOBAL_STATUS_TRACE_TOPAPMI_BIT)
/* * We model guest LBR event tracing as another fixed-mode PMC like BTS.
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.10-rc1 commit 6f7225099d5f3ec3019f380a0da2b456b7796cb0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 6f7225099d5f3ec3019f380a0da2b456b7796cb0 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
The fourth fixed counter, TOPDOWN.SLOTS, is introduced in Ice Lake to measure the level 1 TopDown events.
Add MSR address and macros for the new fixed counter, which will be used in a later patch.
Add comments to explain the event encoding rules for the fixed counters.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200723171117.9918-4-kan.liang@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/include/asm/perf_event.h | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 65f6cfb37a7c..ece7c182d26c 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -146,12 +146,24 @@ struct x86_pmu_capability { */
/* - * All 3 fixed-mode PMCs are configured via this single MSR: + * All the fixed-mode PMCs are configured via this single MSR: */ #define MSR_ARCH_PERFMON_FIXED_CTR_CTRL 0x38d
/* - * The counts are available in three separate MSRs: + * There is no event-code assigned to the fixed-mode PMCs. + * + * For a fixed-mode PMC, which has an equivalent event on a general-purpose + * PMC, the event-code of the equivalent event is used for the fixed-mode PMC, + * e.g., Instr_Retired.Any and CPU_CLK_Unhalted.Core. + * + * For a fixed-mode PMC, which doesn't have an equivalent event, a + * pseudo-encoding is used, e.g., CPU_CLK_Unhalted.Ref and TOPDOWN.SLOTS. + * The pseudo event-code for a fixed-mode PMC must be 0x00. + * The pseudo umask-code is 0xX. The X equals the index of the fixed + * counter + 1, e.g., the fixed counter 2 has the pseudo-encoding 0x0300. + * + * The counts are available in separate MSRs: */
/* Instr_Retired.Any: */ @@ -162,11 +174,16 @@ struct x86_pmu_capability { #define MSR_ARCH_PERFMON_FIXED_CTR1 0x30a #define INTEL_PMC_IDX_FIXED_CPU_CYCLES (INTEL_PMC_IDX_FIXED + 1)
-/* CPU_CLK_Unhalted.Ref: */ +/* CPU_CLK_Unhalted.Ref: event=0x00,umask=0x3 (pseudo-encoding) */ #define MSR_ARCH_PERFMON_FIXED_CTR2 0x30b #define INTEL_PMC_IDX_FIXED_REF_CYCLES (INTEL_PMC_IDX_FIXED + 2) #define INTEL_PMC_MSK_FIXED_REF_CYCLES (1ULL << INTEL_PMC_IDX_FIXED_REF_CYCLES)
+/* TOPDOWN.SLOTS: event=0x00,umask=0x4 (pseudo-encoding) */ +#define MSR_ARCH_PERFMON_FIXED_CTR3 0x30c +#define INTEL_PMC_IDX_FIXED_SLOTS (INTEL_PMC_IDX_FIXED + 3) +#define INTEL_PMC_MSK_FIXED_SLOTS (1ULL << INTEL_PMC_IDX_FIXED_SLOTS) + /* * We model BTS tracing as another fixed-mode PMC. *
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.10-rc1 commit d39fcc32893dac2d02900d99c38276a00cc54d60 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit d39fcc32893dac2d02900d99c38276a00cc54d60 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
The bit 48 in the PERF_GLOBAL_STATUS is used to indicate the overflow status of the PERF_METRICS counters.
Move the BTS index to the bit 47.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200723171117.9918-5-kan.liang@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/include/asm/perf_event.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index ece7c182d26c..16be5f25b48e 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -187,11 +187,11 @@ struct x86_pmu_capability { /* * We model BTS tracing as another fixed-mode PMC. * - * We choose a value in the middle of the fixed event range, since lower + * We choose the value 47 for the fixed index of BTS, since lower * values are used by actual fixed events and higher values are used * to indicate other overflow conditions in the PERF_GLOBAL_STATUS msr. */ -#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 16) +#define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 15)
#define GLOBAL_STATUS_COND_CHG BIT_ULL(63) #define GLOBAL_STATUS_BUFFER_OVF_BIT 62
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.10-rc1 commit bbdbde2a415d9f479803266cae6fb0c1a9f6c80e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit bbdbde2a415d9f479803266cae6fb0c1a9f6c80e upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Bit 15 of the PERF_CAPABILITIES MSR indicates that the perf METRICS feature is supported. The perf METRICS is not a PEBS feature.
Rename pebs_metrics_available perf_metrics.
The bit is not used in the current code. It will be used in a later patch.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200723171117.9918-6-kan.liang@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/perf_event.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 697a776b839b..ea049119210b 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -516,6 +516,7 @@ union perf_capabilities { */ u64 full_width_write:1; u64 pebs_baseline:1; + u64 perf_metrics:1; }; u64 capabilities; };
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.10-rc1 commit 58da7dbe6f036fefe504a4bb452afbd39bba73f7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 58da7dbe6f036fefe504a4bb452afbd39bba73f7 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Currently, the if-else is used in the intel_pmu_disable/enable_event to check the type of an event. It works well, but with more and more types added later, e.g., perf metrics, compared to the switch statement, the if-else may impair the readability of the code.
There is no harm to use the switch statement to replace the if-else here. Also, some optimizing compilers may compile a switch statement into a jump-table which is more efficient than if-else for a large number of cases. The performance gain may not be observed for now, because the number of cases is only 5, but the benefits may be observed with more and more types added in the future.
Use switch to replace the if-else in the intel_pmu_disable/enable_event.
If the idx is invalid, print a warning.
For the case INTEL_PMC_IDX_FIXED_BTS in intel_pmu_disable_event, don't need to check the event->attr.precise_ip. Use return for the case.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200723171117.9918-7-kan.liang@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/core.c | 34 ++++++++++++++++++++++++++-------- 1 file changed, 26 insertions(+), 8 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 4f1845b59a08..51d8383b4373 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -2158,17 +2158,27 @@ static void intel_pmu_disable_event(struct perf_event *event) struct hw_perf_event *hwc = &event->hw; int idx = hwc->idx;
- if (idx < INTEL_PMC_IDX_FIXED) { + switch (idx) { + case 0 ... INTEL_PMC_IDX_FIXED - 1: intel_clear_masks(event, idx); x86_pmu_disable_event(event); - } else if (idx < INTEL_PMC_IDX_FIXED_BTS) { + break; + case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1: intel_clear_masks(event, idx); intel_pmu_disable_fixed(event); - } else if (idx == INTEL_PMC_IDX_FIXED_BTS) { + break; + case INTEL_PMC_IDX_FIXED_BTS: intel_pmu_disable_bts(); intel_pmu_drain_bts_buffer(); - } else if (idx == INTEL_PMC_IDX_FIXED_VLBR) { + return; + case INTEL_PMC_IDX_FIXED_VLBR: + intel_clear_masks(event, idx); + break; + default: intel_clear_masks(event, idx); + pr_warn("Failed to disable the event with invalid index %d\n", + idx); + return; }
/* @@ -2241,18 +2251,26 @@ static void intel_pmu_enable_event(struct perf_event *event) if (unlikely(event->attr.precise_ip)) intel_pmu_pebs_enable(event);
- if (idx < INTEL_PMC_IDX_FIXED) { + switch (idx) { + case 0 ... INTEL_PMC_IDX_FIXED - 1: intel_set_masks(event, idx); __x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE); - } else if (idx < INTEL_PMC_IDX_FIXED_BTS) { + break; + case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1: intel_set_masks(event, idx); intel_pmu_enable_fixed(event); - } else if (idx == INTEL_PMC_IDX_FIXED_BTS) { + break; + case INTEL_PMC_IDX_FIXED_BTS: if (!__this_cpu_read(cpu_hw_events.enabled)) return; intel_pmu_enable_bts(hwc->config); - } else if (idx == INTEL_PMC_IDX_FIXED_VLBR) { + break; + case INTEL_PMC_IDX_FIXED_VLBR: intel_set_masks(event, idx); + break; + default: + pr_warn("Failed to enable the event with invalid index %d\n", + idx); } }
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.10-rc1 commit 9f0c4fa111dc909ca545c45ea20ec84da555ce16 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 9f0c4fa111dc909ca545c45ea20ec84da555ce16 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Current perf assumes that events in a group are independent. Close an event doesn't impact the value of the other events in the same group. If the closed event is a member, after the event closure, other events are still running like a group. If the closed event is a leader, other events are running as singleton events.
Add PERF_EV_CAP_SIBLING to allow events to indicate they require being part of a group, and when the leader dies they cannot exist independently.
Suggested-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200723171117.9918-8-kan.liang@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/perf_event.h | 4 ++++ kernel/events/core.c | 43 +++++++++++++++++++++++++++++++++----- 2 files changed, 42 insertions(+), 5 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index f78841df7a46..87f465412af4 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -526,9 +526,13 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *, * PERF_EV_CAP_SOFTWARE: Is a software event. * PERF_EV_CAP_READ_ACTIVE_PKG: A CPU event (or cgroup event) that can be read * from any CPU in the package where it is active. + * PERF_EV_CAP_SIBLING: An event with this flag must be a group sibling and + * cannot be a group leader. If an event with this flag is detached from the + * group it is scheduled out and moved into an unrecoverable ERROR state. */ #define PERF_EV_CAP_SOFTWARE BIT(0) #define PERF_EV_CAP_READ_ACTIVE_PKG BIT(1) +#define PERF_EV_CAP_SIBLING BIT(2)
#define SWEVENT_HLIST_BITS 8 #define SWEVENT_HLIST_SIZE (1 << SWEVENT_HLIST_BITS) diff --git a/kernel/events/core.c b/kernel/events/core.c index 5d60d0f1c940..56617b912475 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1892,8 +1892,29 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx) ctx->generation++; }
+static void put_event(struct perf_event *event); +static void event_sched_out(struct perf_event *event, + struct perf_cpu_context *cpuctx, + struct perf_event_context *ctx); + +/* + * Events that have PERF_EV_CAP_SIBLING require being part of a group and + * cannot exist on their own, schedule them out and move them into the ERROR + * state. Also see _perf_event_enable(), it will not be able to recover + * this ERROR state. + */ +static inline void perf_remove_sibling_event(struct perf_event *event) +{ + struct perf_event_context *ctx = event->ctx; + struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); + + event_sched_out(event, cpuctx, ctx); + perf_event_set_state(event, PERF_EVENT_STATE_ERROR); +} + static void perf_group_detach(struct perf_event *event) { + struct perf_event *leader = event->group_leader; struct perf_event *sibling, *tmp; struct perf_event_context *ctx = event->ctx;
@@ -1910,7 +1931,7 @@ static void perf_group_detach(struct perf_event *event) /* * If this is a sibling, remove it from its group. */ - if (event->group_leader != event) { + if (leader != event) { list_del_init(&event->sibling_list); event->group_leader->nr_siblings--; goto out; @@ -1923,6 +1944,9 @@ static void perf_group_detach(struct perf_event *event) */ list_for_each_entry_safe(sibling, tmp, &event->sibling_list, sibling_list) {
+ if (sibling->event_caps & PERF_EV_CAP_SIBLING) + perf_remove_sibling_event(sibling); + sibling->group_leader = sibling; list_del_init(&sibling->sibling_list);
@@ -1944,10 +1968,10 @@ static void perf_group_detach(struct perf_event *event) }
out: - perf_event__header_size(event->group_leader); - - for_each_sibling_event(tmp, event->group_leader) + for_each_sibling_event(tmp, leader) perf_event__header_size(tmp); + + perf_event__header_size(leader); }
static bool is_orphaned_event(struct perf_event *event) @@ -2727,6 +2751,7 @@ static void _perf_event_enable(struct perf_event *event) raw_spin_lock_irq(&ctx->lock); if (event->state >= PERF_EVENT_STATE_INACTIVE || event->state < PERF_EVENT_STATE_ERROR) { +out: raw_spin_unlock_irq(&ctx->lock); return; } @@ -2738,8 +2763,16 @@ static void _perf_event_enable(struct perf_event *event) * has gone back into error state, as distinct from the task having * been scheduled away before the cross-call arrived. */ - if (event->state == PERF_EVENT_STATE_ERROR) + if (event->state == PERF_EVENT_STATE_ERROR) { + /* + * Detached SIBLING events cannot leave ERROR state. + */ + if (event->event_caps & PERF_EV_CAP_SIBLING && + event->group_leader == event) + goto out; + event->state = PERF_EVENT_STATE_OFF; + } raw_spin_unlock_irq(&ctx->lock);
event_function_call(event, __perf_event_enable, NULL);
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.10-rc1 commit 7b2c05a15d29d0570a0d21da1e4fd5cbc85cbf13 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 7b2c05a15d29d0570a0d21da1e4fd5cbc85cbf13 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Intro Reviewed-by: Xie XiuQi xiexiuqi@huawei.com
=====
The TopDown Microarchitecture Analysis (TMA) Method is a structured analysis methodology to identify critical performance bottlenecks in out-of-order processors. Current perf has supported the method.
The method works well, but there is one problem. To collect the TopDown events, several GP counters have to be used. If a user wants to collect other events at the same time, the multiplexing probably be triggered, which impacts the accuracy.
To free up the scarce GP counters, the hardware TopDown metrics feature is introduced from Ice Lake. The hardware implements an additional "metrics" register and a new Fixed Counter 3 that measures pipeline "slots". The TopDown events can be calculated from them instead.
Events ======
The level 1 TopDown has four metrics. There is no event-code assigned to the TopDown metrics. Four metric events are exported as separate perf events, which map to the internal "metrics" counter register. Those events do not exist in hardware, but can be allocated by the scheduler.
For the event mapping, a special 0x00 event code is used, which is reserved for fake events. The metric events start from umask 0x10.
When setting up the metric events, they point to the Fixed Counter 3. They have to be specially handled. - Add the update_topdown_event() callback to read the additional metrics MSR and generate the metrics. - Add the set_topdown_event_period() callback to initialize metrics MSR and the fixed counter 3. - Add a variable n_metric_event to track the number of the accepted metrics events. The sharing between multiple users of the same metric without multiplexing is not allowed. - Only enable/disable the fixed counter 3 when there are no other active TopDown events, which avoid the unnecessary writing of the fixed control register. - Disable the PMU when reading the metrics event. The metrics MSR and the fixed counter 3 are read separately. The values may be modified by an NMI.
All four metric events don't support sampling. Since they will be handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is introduced to indicate this case.
The slots event can support both sampling and counting. For counting, the flag is also applied. For sampling, it will be handled normally as other normal events.
Groups ======
The slots event is required in a Topdown group. To avoid reading the METRICS register multiple times, the metrics and slots value can only be updated by slots event in a group. All active slots and metrics events will be updated one time. Therefore, the slots event must be before any metric events in a Topdown group.
NMI ======
The METRICS related register may be overflow. The bit 48 of the STATUS register will be set. If so, PERF_METRICS and Fixed counter 3 are required to be reset. The patch also update all active slots and metrics events in the NMI handler.
The update_topdown_event() has to read two registers separately. The values may be modified by an NMI. PMU has to be disabled before calling the function.
RDPMC ======
RDPMC is temporarily disabled. A later patch will enable it.
Suggested-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200723171117.9918-9-kan.liang@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 63 ++++++++++++--- arch/x86/events/intel/core.c | 124 ++++++++++++++++++++++++++++-- arch/x86/events/perf_event.h | 37 +++++++++ arch/x86/include/asm/msr-index.h | 1 + arch/x86/include/asm/perf_event.h | 47 +++++++++++ 5 files changed, 257 insertions(+), 15 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 60bca5389ed5..40f4a63e4759 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -75,6 +75,9 @@ u64 x86_perf_event_update(struct perf_event *event) if (unlikely(!hwc->event_base)) return 0;
+ if (unlikely(is_topdown_count(event)) && x86_pmu.update_topdown_event) + return x86_pmu.update_topdown_event(event); + /* * Careful: an NMI might modify the previous event value. * @@ -1010,6 +1013,42 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) return unsched ? -EINVAL : 0; }
+static int add_nr_metric_event(struct cpu_hw_events *cpuc, + struct perf_event *event) +{ + if (is_metric_event(event)) { + if (cpuc->n_metric == INTEL_TD_METRIC_NUM) + return -EINVAL; + cpuc->n_metric++; + } + + return 0; +} + +static void del_nr_metric_event(struct cpu_hw_events *cpuc, + struct perf_event *event) +{ + if (is_metric_event(event)) + cpuc->n_metric--; +} + +static int collect_event(struct cpu_hw_events *cpuc, struct perf_event *event, + int max_count, int n) +{ + + if (x86_pmu.intel_cap.perf_metrics && add_nr_metric_event(cpuc, event)) + return -EINVAL; + + if (n >= max_count + cpuc->n_metric) + return -EINVAL; + + cpuc->event_list[n] = event; + if (is_counter_pair(&event->hw)) + cpuc->n_pair++; + + return 0; +} + /* * dogrp: true if must collect siblings events (group) * returns total number of events and error code @@ -1025,28 +1064,22 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader, n = cpuc->n_events;
if (is_x86_event(leader)) { - if (n >= max_count) + if (collect_event(cpuc, leader, max_count, n)) return -EINVAL; - cpuc->event_list[n] = leader; n++; - if (is_counter_pair(&leader->hw)) - cpuc->n_pair++; } + if (!dogrp) return n;
for_each_sibling_event(event, leader) { - if (!is_x86_event(event) || - event->state <= PERF_EVENT_STATE_OFF) + if (!is_x86_event(event) || event->state <= PERF_EVENT_STATE_OFF) continue;
- if (n >= max_count) + if (collect_event(cpuc, event, max_count, n)) return -EINVAL;
- cpuc->event_list[n] = event; n++; - if (is_counter_pair(&event->hw)) - cpuc->n_pair++; } return n; } @@ -1068,6 +1101,10 @@ static inline void x86_assign_hw_event(struct perf_event *event, hwc->event_base = 0; break;
+ case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END: + /* All the metric events are mapped onto the fixed counter 3. */ + idx = INTEL_PMC_IDX_FIXED_SLOTS; + /* fall through */ case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS-1: hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL; hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + @@ -1182,6 +1219,10 @@ int x86_perf_event_set_period(struct perf_event *event) if (unlikely(!hwc->event_base)) return 0;
+ if (unlikely(is_topdown_count(event)) && + x86_pmu.set_topdown_event_period) + return x86_pmu.set_topdown_event_period(event); + /* * If we are way outside a reasonable range then just skip forward: */ @@ -1470,6 +1511,8 @@ static void x86_pmu_del(struct perf_event *event, int flags) cpuc->event_constraint[i-1] = cpuc->event_constraint[i]; } --cpuc->n_events; + if (x86_pmu.intel_cap.perf_metrics) + del_nr_metric_event(cpuc, event);
perf_event_update_userpage(event);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 51d8383b4373..0f64fdbd5f98 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -2143,11 +2143,24 @@ static inline void intel_clear_masks(struct perf_event *event, int idx) static void intel_pmu_disable_fixed(struct perf_event *event) { struct hw_perf_event *hwc = &event->hw; - int idx = hwc->idx - INTEL_PMC_IDX_FIXED; u64 ctrl_val, mask; + int idx = hwc->idx;
- mask = 0xfULL << (idx * 4); + if (is_topdown_idx(idx)) { + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + + /* + * When there are other active TopDown events, + * don't disable the fixed counter 3. + */ + if (*(u64 *)cpuc->active_mask & INTEL_PMC_OTHER_TOPDOWN_BITS(idx)) + return; + idx = INTEL_PMC_IDX_FIXED_SLOTS; + }
+ intel_clear_masks(event, idx); + + mask = 0xfULL << ((idx - INTEL_PMC_IDX_FIXED) * 4); rdmsrl(hwc->config_base, ctrl_val); ctrl_val &= ~mask; wrmsrl(hwc->config_base, ctrl_val); @@ -2164,7 +2177,7 @@ static void intel_pmu_disable_event(struct perf_event *event) x86_pmu_disable_event(event); break; case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1: - intel_clear_masks(event, idx); + case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END: intel_pmu_disable_fixed(event); break; case INTEL_PMC_IDX_FIXED_BTS: @@ -2197,10 +2210,26 @@ static void intel_pmu_del_event(struct perf_event *event) intel_pmu_pebs_del(event); }
+static void intel_pmu_read_topdown_event(struct perf_event *event) +{ + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + + /* Only need to call update_topdown_event() once for group read. */ + if ((cpuc->txn_flags & PERF_PMU_TXN_READ) && + !is_slots_event(event)) + return; + + perf_pmu_disable(event->pmu); + x86_pmu.update_topdown_event(event); + perf_pmu_enable(event->pmu); +} + static void intel_pmu_read_event(struct perf_event *event) { if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD) intel_pmu_auto_reload_read(event); + else if (is_topdown_count(event) && x86_pmu.update_topdown_event) + intel_pmu_read_topdown_event(event); else x86_perf_event_update(event); } @@ -2208,8 +2237,22 @@ static void intel_pmu_read_event(struct perf_event *event) static void intel_pmu_enable_fixed(struct perf_event *event) { struct hw_perf_event *hwc = &event->hw; - int idx = hwc->idx - INTEL_PMC_IDX_FIXED; u64 ctrl_val, mask, bits = 0; + int idx = hwc->idx; + + if (is_topdown_idx(idx)) { + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + /* + * When there are other active TopDown events, + * don't enable the fixed counter 3 again. + */ + if (*(u64 *)cpuc->active_mask & INTEL_PMC_OTHER_TOPDOWN_BITS(idx)) + return; + + idx = INTEL_PMC_IDX_FIXED_SLOTS; + } + + intel_set_masks(event, idx);
/* * Enable IRQ generation (0x8), if not PEBS, @@ -2229,6 +2272,7 @@ static void intel_pmu_enable_fixed(struct perf_event *event) if (x86_pmu.version > 2 && hwc->config & ARCH_PERFMON_EVENTSEL_ANY) bits |= 0x4;
+ idx -= INTEL_PMC_IDX_FIXED; bits <<= (idx * 4); mask = 0xfULL << (idx * 4);
@@ -2257,7 +2301,7 @@ static void intel_pmu_enable_event(struct perf_event *event) __x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE); break; case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1: - intel_set_masks(event, idx); + case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END: intel_pmu_enable_fixed(event); break; case INTEL_PMC_IDX_FIXED_BTS: @@ -2401,6 +2445,15 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status) intel_pt_interrupt(); }
+ /* + * Intel Perf mertrics + */ + if (__test_and_clear_bit(GLOBAL_STATUS_PERF_METRICS_OVF_BIT, (unsigned long *)&status)) { + handled++; + if (x86_pmu.update_topdown_event) + x86_pmu.update_topdown_event(NULL); + } + /* * Checkpointed counters can lead to 'spurious' PMIs because the * rollback caused by the PMI will have cleared the overflow status @@ -3238,6 +3291,58 @@ static int intel_pmu_hw_config(struct perf_event *event) if (event->attr.type != PERF_TYPE_RAW) return 0;
+ /* + * Config Topdown slots and metric events + * + * The slots event on Fixed Counter 3 can support sampling, + * which will be handled normally in x86_perf_event_update(). + * + * Metric events don't support sampling and require being paired + * with a slots event as group leader. When the slots event + * is used in a metrics group, it too cannot support sampling. + */ + if (x86_pmu.intel_cap.perf_metrics && is_topdown_event(event)) { + if (event->attr.config1 || event->attr.config2) + return -EINVAL; + + /* + * The TopDown metrics events and slots event don't + * support any filters. + */ + if (event->attr.config & X86_ALL_EVENT_FLAGS) + return -EINVAL; + + if (is_metric_event(event)) { + struct perf_event *leader = event->group_leader; + + /* The metric events don't support sampling. */ + if (is_sampling_event(event)) + return -EINVAL; + + /* The metric events require a slots group leader. */ + if (!is_slots_event(leader)) + return -EINVAL; + + /* + * The leader/SLOTS must not be a sampling event for + * metric use; hardware requires it starts at 0 when used + * in conjunction with MSR_PERF_METRICS. + */ + if (is_sampling_event(leader)) + return -EINVAL; + + event->event_caps |= PERF_EV_CAP_SIBLING; + /* + * Only once we have a METRICs sibling do we + * need TopDown magic. + */ + leader->hw.flags |= PERF_X86_EVENT_TOPDOWN; + event->hw.flags |= PERF_X86_EVENT_TOPDOWN; + + event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED; + } + } + if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY)) return 0;
@@ -4877,6 +4982,15 @@ __init int intel_pmu_init(void) * counter, so do not extend mask to generic counters */ for_each_event_constraint(c, x86_pmu.event_constraints) { + /* + * Don't extend the topdown slots and metrics + * events to the generic counters. + */ + if (c->idxmsk64 & INTEL_PMC_MSK_TOPDOWN) { + c->weight = hweight64(c->idxmsk64); + continue; + } + if (c->cmask == FIXED_EVENT_FLAGS && c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES) { c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1; diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index ea049119210b..70845e020dac 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -80,6 +80,31 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode) #define PERF_X86_EVENT_PAIR 0x1000 /* Large Increment per Cycle */ #define PERF_X86_EVENT_LBR_SELECT 0x2000 /* Save/Restore MSR_LBR_SELECT */
+#define PERF_X86_EVENT_TOPDOWN 0x4000 /* Count Topdown slots/metrics events */ + +static inline bool is_topdown_count(struct perf_event *event) +{ + return event->hw.flags & PERF_X86_EVENT_TOPDOWN; +} + +static inline bool is_metric_event(struct perf_event *event) +{ + u64 config = event->attr.config; + + return ((config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) && + ((config & INTEL_ARCH_EVENT_MASK) >= INTEL_TD_METRIC_RETIRING) && + ((config & INTEL_ARCH_EVENT_MASK) <= INTEL_TD_METRIC_MAX); +} + +static inline bool is_slots_event(struct perf_event *event) +{ + return (event->attr.config & INTEL_ARCH_EVENT_MASK) == INTEL_TD_SLOTS; +} + +static inline bool is_topdown_event(struct perf_event *event) +{ + return is_metric_event(event) || is_slots_event(event); +}
struct amd_nb { int nb_id; /* NorthBridge id */ @@ -263,6 +288,12 @@ struct cpu_hw_events { */ u64 tfa_shadow;
+ /* + * Perf Metrics + */ + /* number of accepted metrics events */ + int n_metric; + /* * AMD specific bits */ @@ -679,6 +710,12 @@ struct x86_pmu { */ atomic_t lbr_exclusive[x86_lbr_exclusive_max];
+ /* + * Intel perf metrics + */ + u64 (*update_topdown_event)(struct perf_event *event); + int (*set_topdown_event_period)(struct perf_event *event); + /* * AMD bits */ diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 76b1c43547b0..882cd97bf092 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -768,6 +768,7 @@ #define MSR_CORE_PERF_FIXED_CTR0 0x00000309 #define MSR_CORE_PERF_FIXED_CTR1 0x0000030a #define MSR_CORE_PERF_FIXED_CTR2 0x0000030b +#define MSR_CORE_PERF_FIXED_CTR3 0x0000030c #define MSR_CORE_PERF_FIXED_CTR_CTRL 0x0000038d #define MSR_CORE_PERF_GLOBAL_STATUS 0x0000038e #define MSR_CORE_PERF_GLOBAL_CTRL 0x0000038f diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 16be5f25b48e..af3e1a79fe48 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -193,6 +193,52 @@ struct x86_pmu_capability { */ #define INTEL_PMC_IDX_FIXED_BTS (INTEL_PMC_IDX_FIXED + 15)
+/* + * The PERF_METRICS MSR is modeled as several magic fixed-mode PMCs, one for + * each TopDown metric event. + * + * Internally the TopDown metric events are mapped to the FxCtr 3 (SLOTS). + */ +#define INTEL_PMC_IDX_METRIC_BASE (INTEL_PMC_IDX_FIXED + 16) +#define INTEL_PMC_IDX_TD_RETIRING (INTEL_PMC_IDX_METRIC_BASE + 0) +#define INTEL_PMC_IDX_TD_BAD_SPEC (INTEL_PMC_IDX_METRIC_BASE + 1) +#define INTEL_PMC_IDX_TD_FE_BOUND (INTEL_PMC_IDX_METRIC_BASE + 2) +#define INTEL_PMC_IDX_TD_BE_BOUND (INTEL_PMC_IDX_METRIC_BASE + 3) +#define INTEL_PMC_IDX_METRIC_END INTEL_PMC_IDX_TD_BE_BOUND +#define INTEL_PMC_MSK_TOPDOWN ((0xfull << INTEL_PMC_IDX_METRIC_BASE) | \ + INTEL_PMC_MSK_FIXED_SLOTS) + +/* + * There is no event-code assigned to the TopDown events. + * + * For the slots event, use the pseudo code of the fixed counter 3. + * + * For the metric events, the pseudo event-code is 0x00. + * The pseudo umask-code starts from the middle of the pseudo event + * space, 0x80. + */ +#define INTEL_TD_SLOTS 0x0400 /* TOPDOWN.SLOTS */ +/* Level 1 metrics */ +#define INTEL_TD_METRIC_RETIRING 0x8000 /* Retiring metric */ +#define INTEL_TD_METRIC_BAD_SPEC 0x8100 /* Bad speculation metric */ +#define INTEL_TD_METRIC_FE_BOUND 0x8200 /* FE bound metric */ +#define INTEL_TD_METRIC_BE_BOUND 0x8300 /* BE bound metric */ +#define INTEL_TD_METRIC_MAX INTEL_TD_METRIC_BE_BOUND +#define INTEL_TD_METRIC_NUM 4 + +static inline bool is_metric_idx(int idx) +{ + return (unsigned)(idx - INTEL_PMC_IDX_METRIC_BASE) < INTEL_TD_METRIC_NUM; +} + +static inline bool is_topdown_idx(int idx) +{ + return is_metric_idx(idx) || idx == INTEL_PMC_IDX_FIXED_SLOTS; +} + +#define INTEL_PMC_OTHER_TOPDOWN_BITS(bit) \ + (~(0x1ull << bit) & INTEL_PMC_MSK_TOPDOWN) + #define GLOBAL_STATUS_COND_CHG BIT_ULL(63) #define GLOBAL_STATUS_BUFFER_OVF_BIT 62 #define GLOBAL_STATUS_BUFFER_OVF BIT_ULL(GLOBAL_STATUS_BUFFER_OVF_BIT) @@ -203,6 +249,7 @@ struct x86_pmu_capability { #define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT) #define GLOBAL_STATUS_TRACE_TOPAPMI_BIT 55 #define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(GLOBAL_STATUS_TRACE_TOPAPMI_BIT) +#define GLOBAL_STATUS_PERF_METRICS_OVF_BIT 48
/* * We model guest LBR event tracing as another fixed-mode PMC like BTS.
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.10-rc1 commit 0e2e45e2ded4988f5641115fd996c75dc32e4be3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 0e2e45e2ded4988f5641115fd996c75dc32e4be3 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
The RDPMC base offset of fixed counters is hard-code. Use a meaningful name to replace the magic number to improve the readability of the code.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200723171117.9918-10-kan.liang@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 3 ++- arch/x86/include/asm/perf_event.h | 3 +++ 2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 40f4a63e4759..0879a5398dbe 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -1109,7 +1109,8 @@ static inline void x86_assign_hw_event(struct perf_event *event, hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL; hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (idx - INTEL_PMC_IDX_FIXED); - hwc->event_base_rdpmc = (idx - INTEL_PMC_IDX_FIXED) | 1<<30; + hwc->event_base_rdpmc = (idx - INTEL_PMC_IDX_FIXED) | + INTEL_PMC_FIXED_RDPMC_BASE; break;
default: diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index af3e1a79fe48..41995204e9d0 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -145,6 +145,9 @@ struct x86_pmu_capability { * Fixed-purpose performance events: */
+/* RDPMC offset for Fixed PMCs */ +#define INTEL_PMC_FIXED_RDPMC_BASE (1 << 30) + /* * All the fixed-mode PMCs are configured via this single MSR: */
From: Guoqing Jiang jiangguoqing@kylinos.cn
mainline inclusion from mainline-v5.10-rc1 commit 59a854e2f3b90ad2cc7368ae392de40b981ad51d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 59a854e2f3b90ad2cc7368ae392de40b981ad51d upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Ice Lake supports the hardware TopDown metrics feature, which can free up the scarce GP counters.
Update the event constraints for the metrics events. The metric counters do not exist, which are mapped to a dummy offset. The sharing between multiple users of the same metric without multiplexing is not allowed.
Implement set_topdown_event_period for Ice Lake. The values in PERF_METRICS MSR are derived from the fixed counter 3. Both registers should start from zero.
Implement update_topdown_event for Ice Lake. The metric is reported by multiplying the metric (fraction) with slots. To maintain accurate measurements, both registers are cleared for each update. The fixed counter 3 should always be cleared before the PERF_METRICS.
Implement td_attr for the new metrics events and the new slots fixed counter. Make them visible to the perf user tools.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200723171117.9918-11-kan.liang@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/core.c | 118 ++++++++++++++++++++++++++++++ arch/x86/events/perf_event.h | 13 ++++ arch/x86/include/asm/msr-index.h | 2 + arch/x86/include/asm/perf_event.h | 2 + 4 files changed, 135 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 0f64fdbd5f98..d087b471a002 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -244,6 +244,10 @@ static struct event_constraint intel_icl_event_constraints[] = { FIXED_EVENT_CONSTRAINT(0x003c, 1), /* CPU_CLK_UNHALTED.CORE */ FIXED_EVENT_CONSTRAINT(0x0300, 2), /* CPU_CLK_UNHALTED.REF */ FIXED_EVENT_CONSTRAINT(0x0400, 3), /* SLOTS */ + METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_RETIRING, 0), + METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_BAD_SPEC, 1), + METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_FE_BOUND, 2), + METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_BE_BOUND, 3), INTEL_EVENT_CONSTRAINT_RANGE(0x03, 0x0a, 0xf), INTEL_EVENT_CONSTRAINT_RANGE(0x1f, 0x28, 0xf), INTEL_EVENT_CONSTRAINT(0x32, 0xf), /* SW_PREFETCH_ACCESS.* */ @@ -306,6 +310,12 @@ EVENT_ATTR_STR_HT(topdown-recovery-bubbles, td_recovery_bubbles, EVENT_ATTR_STR_HT(topdown-recovery-bubbles.scale, td_recovery_bubbles_scale, "4", "2");
+EVENT_ATTR_STR(slots, slots, "event=0x00,umask=0x4"); +EVENT_ATTR_STR(topdown-retiring, td_retiring, "event=0x00,umask=0x80"); +EVENT_ATTR_STR(topdown-bad-spec, td_bad_spec, "event=0x00,umask=0x81"); +EVENT_ATTR_STR(topdown-fe-bound, td_fe_bound, "event=0x00,umask=0x82"); +EVENT_ATTR_STR(topdown-be-bound, td_be_bound, "event=0x00,umask=0x83"); + static struct attribute *snb_events_attrs[] = { EVENT_PTR(td_slots_issued), EVENT_PTR(td_slots_retired), @@ -2210,6 +2220,99 @@ static void intel_pmu_del_event(struct perf_event *event) intel_pmu_pebs_del(event); }
+static int icl_set_topdown_event_period(struct perf_event *event) +{ + struct hw_perf_event *hwc = &event->hw; + s64 left = local64_read(&hwc->period_left); + + /* + * The values in PERF_METRICS MSR are derived from fixed counter 3. + * Software should start both registers, PERF_METRICS and fixed + * counter 3, from zero. + * Clear PERF_METRICS and Fixed counter 3 in initialization. + * After that, both MSRs will be cleared for each read. + * Don't need to clear them again. + */ + if (left == x86_pmu.max_period) { + wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0); + wrmsrl(MSR_PERF_METRICS, 0); + local64_set(&hwc->period_left, 0); + } + + perf_event_update_userpage(event); + + return 0; +} + +static inline u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx) +{ + u32 val; + + /* + * The metric is reported as an 8bit integer fraction + * suming up to 0xff. + * slots-in-metric = (Metric / 0xff) * slots + */ + val = (metric >> ((idx - INTEL_PMC_IDX_METRIC_BASE) * 8)) & 0xff; + return mul_u64_u32_div(slots, val, 0xff); +} + +static void __icl_update_topdown_event(struct perf_event *event, + u64 slots, u64 metrics) +{ + int idx = event->hw.idx; + u64 delta; + + if (is_metric_idx(idx)) + delta = icl_get_metrics_event_value(metrics, slots, idx); + else + delta = slots; + + local64_add(delta, &event->count); +} + +/* + * Update all active Topdown events. + * + * The PERF_METRICS and Fixed counter 3 are read separately. The values may be + * modify by a NMI. PMU has to be disabled before calling this function. + */ +static u64 icl_update_topdown_event(struct perf_event *event) +{ + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + struct perf_event *other; + u64 slots, metrics; + int idx; + + /* read Fixed counter 3 */ + rdpmcl((3 | INTEL_PMC_FIXED_RDPMC_BASE), slots); + if (!slots) + return 0; + + /* read PERF_METRICS */ + rdpmcl(INTEL_PMC_FIXED_RDPMC_METRICS, metrics); + + for_each_set_bit(idx, cpuc->active_mask, INTEL_PMC_IDX_TD_BE_BOUND + 1) { + if (!is_topdown_idx(idx)) + continue; + other = cpuc->events[idx]; + __icl_update_topdown_event(other, slots, metrics); + } + + /* + * Check and update this event, which may have been cleared + * in active_mask e.g. x86_pmu_stop() + */ + if (event && !test_bit(event->hw.idx, cpuc->active_mask)) + __icl_update_topdown_event(event, slots, metrics); + + /* The fixed counter 3 has to be written before the PERF_METRICS. */ + wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0); + wrmsrl(MSR_PERF_METRICS, 0); + + return slots; +} + static void intel_pmu_read_topdown_event(struct perf_event *event) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); @@ -4221,6 +4324,15 @@ static struct attribute *icl_events_attrs[] = { NULL, };
+static struct attribute *icl_td_events_attrs[] = { + EVENT_PTR(slots), + EVENT_PTR(td_retiring), + EVENT_PTR(td_bad_spec), + EVENT_PTR(td_fe_bound), + EVENT_PTR(td_be_bound), + NULL, +}; + static struct attribute *icl_tsx_events_attrs[] = { EVENT_PTR(tx_start), EVENT_PTR(tx_abort), @@ -4925,9 +5037,12 @@ __init int intel_pmu_init(void) hsw_format_attr : nhm_format_attr; extra_skl_attr = skl_format_attr; mem_attr = icl_events_attrs; + td_attr = icl_td_events_attrs; tsx_attr = icl_tsx_events_attrs; x86_pmu.lbr_pt_coexist = true; intel_pmu_pebs_data_source_skl(pmem); + x86_pmu.update_topdown_event = icl_update_topdown_event; + x86_pmu.set_topdown_event_period = icl_set_topdown_event_period; pr_cont("Icelake events, "); name = "icelake"; break; @@ -5039,6 +5154,9 @@ __init int intel_pmu_init(void) pr_cont("full-width counters, "); }
+ if (x86_pmu.intel_cap.perf_metrics) + x86_pmu.intel_ctrl |= 1ULL << GLOBAL_CTRL_EN_PERF_METRICS; + return 0; }
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 70845e020dac..ff2f0307ca8a 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -385,6 +385,19 @@ struct cpu_hw_events { #define FIXED_EVENT_CONSTRAINT(c, n) \ EVENT_CONSTRAINT(c, (1ULL << (32+n)), FIXED_EVENT_FLAGS)
+/* + * The special metric counters do not actually exist. They are calculated from + * the combination of the FxCtr3 + MSR_PERF_METRICS. + * + * The special metric counters are mapped to a dummy offset for the scheduler. + * The sharing between multiple users of the same metric without multiplexing + * is not allowed, even though the hardware supports that in principle. + */ + +#define METRIC_EVENT_CONSTRAINT(c, n) \ + EVENT_CONSTRAINT(c, (1ULL << (INTEL_PMC_IDX_METRIC_BASE + n)), \ + INTEL_ARCH_EVENT_MASK) + /* * Constraint on the Event code + UMask */ diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 882cd97bf092..cc387a1da9f9 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -774,6 +774,8 @@ #define MSR_CORE_PERF_GLOBAL_CTRL 0x0000038f #define MSR_CORE_PERF_GLOBAL_OVF_CTRL 0x00000390
+#define MSR_PERF_METRICS 0x00000329 + /* Geode defined MSRs */ #define MSR_GEODE_BUSCONT_CONF0 0x00001900
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 41995204e9d0..27afef6a050a 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -147,6 +147,7 @@ struct x86_pmu_capability {
/* RDPMC offset for Fixed PMCs */ #define INTEL_PMC_FIXED_RDPMC_BASE (1 << 30) +#define INTEL_PMC_FIXED_RDPMC_METRICS (1 << 29)
/* * All the fixed-mode PMCs are configured via this single MSR: @@ -254,6 +255,7 @@ static inline bool is_topdown_idx(int idx) #define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(GLOBAL_STATUS_TRACE_TOPAPMI_BIT) #define GLOBAL_STATUS_PERF_METRICS_OVF_BIT 48
+#define GLOBAL_CTRL_EN_PERF_METRICS 48 /* * We model guest LBR event tracing as another fixed-mode PMC like BTS. *
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.10-rc1 commit 2cb5383b30d47c446ec7d884cd80f93ffcc31817 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 2cb5383b30d47c446ec7d884cd80f93ffcc31817 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
Starts from Ice Lake, the TopDown metrics are directly available as fixed counters and do not require generic counters. Also, the TopDown metrics can be collected per thread. Extend the RDPMC usage to support per-thread TopDown metrics.
The RDPMC index of the PERF_METRICS will be output if RDPMC users ask for the RDPMC index of the metrics events.
To support per thread RDPMC TopDown, the metrics and slots counters have to be saved/restored during the context switching.
The last_period and period_left are not used in the counting mode. Use the fields for saved_metric and saved_slots.
Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200723171117.9918-12-kan.liang@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 5 +- arch/x86/events/intel/core.c | 90 +++++++++++++++++++++++++++++++----- include/linux/perf_event.h | 29 ++++++++---- 3 files changed, 102 insertions(+), 22 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 0879a5398dbe..6e2a5ae4608b 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2198,7 +2198,10 @@ static int x86_pmu_event_idx(struct perf_event *event) if (!(hwc->flags & PERF_X86_EVENT_RDPMC_ALLOWED)) return 0;
- return hwc->event_base_rdpmc + 1; + if (is_metric_idx(hwc->idx)) + return INTEL_PMC_FIXED_RDPMC_METRICS + 1; + else + return hwc->event_base_rdpmc + 1; }
static ssize_t get_attr_rdpmc(struct device *cdev, diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index d087b471a002..c10bdf5799d8 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -2236,7 +2236,13 @@ static int icl_set_topdown_event_period(struct perf_event *event) if (left == x86_pmu.max_period) { wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0); wrmsrl(MSR_PERF_METRICS, 0); - local64_set(&hwc->period_left, 0); + hwc->saved_slots = 0; + hwc->saved_metric = 0; + } + + if ((hwc->saved_slots) && is_slots_event(event)) { + wrmsrl(MSR_CORE_PERF_FIXED_CTR3, hwc->saved_slots); + wrmsrl(MSR_PERF_METRICS, hwc->saved_metric); }
perf_event_update_userpage(event); @@ -2257,7 +2263,7 @@ static inline u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx) return mul_u64_u32_div(slots, val, 0xff); }
-static void __icl_update_topdown_event(struct perf_event *event, +static u64 icl_get_topdown_value(struct perf_event *event, u64 slots, u64 metrics) { int idx = event->hw.idx; @@ -2268,7 +2274,50 @@ static void __icl_update_topdown_event(struct perf_event *event, else delta = slots;
- local64_add(delta, &event->count); + return delta; +} + +static void __icl_update_topdown_event(struct perf_event *event, + u64 slots, u64 metrics, + u64 last_slots, u64 last_metrics) +{ + u64 delta, last = 0; + + delta = icl_get_topdown_value(event, slots, metrics); + if (last_slots) + last = icl_get_topdown_value(event, last_slots, last_metrics); + + /* + * The 8bit integer fraction of metric may be not accurate, + * especially when the changes is very small. + * For example, if only a few bad_spec happens, the fraction + * may be reduced from 1 to 0. If so, the bad_spec event value + * will be 0 which is definitely less than the last value. + * Avoid update event->count for this case. + */ + if (delta > last) { + delta -= last; + local64_add(delta, &event->count); + } +} + +static void update_saved_topdown_regs(struct perf_event *event, + u64 slots, u64 metrics) +{ + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + struct perf_event *other; + int idx; + + event->hw.saved_slots = slots; + event->hw.saved_metric = metrics; + + for_each_set_bit(idx, cpuc->active_mask, INTEL_PMC_IDX_TD_BE_BOUND + 1) { + if (!is_topdown_idx(idx)) + continue; + other = cpuc->events[idx]; + other->hw.saved_slots = slots; + other->hw.saved_metric = metrics; + } }
/* @@ -2282,6 +2331,7 @@ static u64 icl_update_topdown_event(struct perf_event *event) struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); struct perf_event *other; u64 slots, metrics; + bool reset = true; int idx;
/* read Fixed counter 3 */ @@ -2296,19 +2346,39 @@ static u64 icl_update_topdown_event(struct perf_event *event) if (!is_topdown_idx(idx)) continue; other = cpuc->events[idx]; - __icl_update_topdown_event(other, slots, metrics); + __icl_update_topdown_event(other, slots, metrics, + event ? event->hw.saved_slots : 0, + event ? event->hw.saved_metric : 0); }
/* * Check and update this event, which may have been cleared * in active_mask e.g. x86_pmu_stop() */ - if (event && !test_bit(event->hw.idx, cpuc->active_mask)) - __icl_update_topdown_event(event, slots, metrics); + if (event && !test_bit(event->hw.idx, cpuc->active_mask)) { + __icl_update_topdown_event(event, slots, metrics, + event->hw.saved_slots, + event->hw.saved_metric);
- /* The fixed counter 3 has to be written before the PERF_METRICS. */ - wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0); - wrmsrl(MSR_PERF_METRICS, 0); + /* + * In x86_pmu_stop(), the event is cleared in active_mask first, + * then drain the delta, which indicates context switch for + * counting. + * Save metric and slots for context switch. + * Don't need to reset the PERF_METRICS and Fixed counter 3. + * Because the values will be restored in next schedule in. + */ + update_saved_topdown_regs(event, slots, metrics); + reset = false; + } + + if (reset) { + /* The fixed counter 3 has to be written before the PERF_METRICS. */ + wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0); + wrmsrl(MSR_PERF_METRICS, 0); + if (event) + update_saved_topdown_regs(event, 0, 0); + }
return slots; } @@ -3441,8 +3511,6 @@ static int intel_pmu_hw_config(struct perf_event *event) */ leader->hw.flags |= PERF_X86_EVENT_TOPDOWN; event->hw.flags |= PERF_X86_EVENT_TOPDOWN; - - event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED; } }
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 87f465412af4..0550194b3252 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -198,17 +198,26 @@ struct hw_perf_event { */ u64 sample_period;
- /* - * The period we started this sample with. - */ - u64 last_period; + union { + struct { /* Sampling */ + /* + * The period we started this sample with. + */ + u64 last_period;
- /* - * However much is left of the current period; note that this is - * a full 64bit value and allows for generation of periods longer - * than hardware might allow. - */ - local64_t period_left; + /* + * However much is left of the current period; + * note that this is a full 64bit value and + * allows for generation of periods longer + * than hardware might allow. + */ + local64_t period_left; + }; + struct { /* Topdown events counting for context switch */ + u64 saved_metric; + u64 saved_slots; + }; + };
/* * State for throttling the event, see __perf_event_overflow() and
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.10-rc1 commit 80a5ce116fc084e8a25d5a936617699e2931b611 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 80a5ce116fc084e8a25d5a936617699e2931b611 upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
It might be possible that different CPUs have different CPU metrics on a platform. In this case, writing the GLOBAL_CTRL_EN_PERF_METRICS bit to the GLOBAL_CTRL register of a CPU, which doesn't support the TopDown perf metrics feature, causes MSR access error.
Current TopDown perf metrics feature is enumerated using the boot CPU's PERF_CAPABILITIES MSR. The MSR only indicates the boot CPU supports this feature.
Check the PERF_CAPABILITIES MSR for each CPU. If any CPU doesn't support the perf metrics feature, disable the feature globally.
Fixes: 59a854e2f3b9 ("perf/x86/intel: Support TopDown metrics on Ice Lake") Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20201001211711.25708-1-kan.liang@linux.intel.com Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/core.c | 11 +++++++++++ 1 file changed, 11 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index c10bdf5799d8..c4ca55ac07bf 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3920,6 +3920,17 @@ static void intel_pmu_cpu_starting(int cpu) if (x86_pmu.version > 1) flip_smm_bit(&x86_pmu.attr_freeze_on_smi);
+ /* Disable perf metrics if any added CPU doesn't support it. */ + if (x86_pmu.intel_cap.perf_metrics) { + union perf_capabilities perf_cap; + + rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap.capabilities); + if (!perf_cap.perf_metrics) { + x86_pmu.intel_cap.perf_metrics = 0; + x86_pmu.intel_ctrl &= ~(1ULL << GLOBAL_CTRL_EN_PERF_METRICS); + } + } + if (!cpuc->shared_regs) return;
From: Peter Zijlstra peterz@infradead.org
mainline inclusion from mainline-v5.10-rc1 commit 3dbde69575637658d2094ee4416c21bc22eb89fe category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3dbde69575637658d2094ee4416c21bc22eb89fe upstream Backport summary: backport to kernel 4.19.57 for ICX perf topdown support
When a group that has TopDown members is failed to be scheduled, any later TopDown groups will not return valid values.
Here is an example.
A background perf that occupies all the GP counters and the fixed counter 1. $perf stat -e "{cycles,cycles,cycles,cycles,cycles,cycles,cycles, cycles,cycles}:D" -a
A user monitors a TopDown group. It works well, because the fixed counter 3 and the PERF_METRICS are available. $perf stat -x, --topdown -- ./workload retiring,bad speculation,frontend bound,backend bound, 18.0,16.1,40.4,25.5,
Then the user tries to monitor a group that has TopDown members. Because of the cycles event, the group is failed to be scheduled. $perf stat -x, -e '{slots,topdown-retiring,topdown-be-bound, topdown-fe-bound,topdown-bad-spec,cycles}' -- ./workload <not counted>,,slots,0,0.00,, <not counted>,,topdown-retiring,0,0.00,, <not counted>,,topdown-be-bound,0,0.00,, <not counted>,,topdown-fe-bound,0,0.00,, <not counted>,,topdown-bad-spec,0,0.00,, <not counted>,,cycles,0,0.00,,
The user tries to monitor a TopDown group again. It doesn't work anymore. $perf stat -x, --topdown -- ./workload
,,,,,
In a txn, cancel_txn() is to truncate the event_list for a canceled group and update the number of events added in this transaction. However, the number of TopDown events added in this transaction is not updated. The kernel will probably fail to add new Topdown events.
Fixes: 7b2c05a15d29 ("perf/x86/intel: Generic support for hardware TopDown metrics") Reported-by: Andi Kleen ak@linux.intel.com Reported-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Tested-by: Kan Liang kan.liang@linux.intel.com Link: https://lkml.kernel.org/r/20201005082611.GH2628@hirez.programming.kicks-ass.... Signed-off-by: Yunying Sun yunying.sun@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 3 +++ arch/x86/events/perf_event.h | 1 + 2 files changed, 4 insertions(+)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 6e2a5ae4608b..d5c95336ec15 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -1020,6 +1020,7 @@ static int add_nr_metric_event(struct cpu_hw_events *cpuc, if (cpuc->n_metric == INTEL_TD_METRIC_NUM) return -EINVAL; cpuc->n_metric++; + cpuc->n_txn_metric++; }
return 0; @@ -1941,6 +1942,7 @@ static void x86_pmu_start_txn(struct pmu *pmu, unsigned int txn_flags)
perf_pmu_disable(pmu); __this_cpu_write(cpu_hw_events.n_txn, 0); + __this_cpu_write(cpu_hw_events.n_txn_metric, 0); }
/* @@ -1966,6 +1968,7 @@ static void x86_pmu_cancel_txn(struct pmu *pmu) */ __this_cpu_sub(cpu_hw_events.n_added, __this_cpu_read(cpu_hw_events.n_txn)); __this_cpu_sub(cpu_hw_events.n_events, __this_cpu_read(cpu_hw_events.n_txn)); + __this_cpu_sub(cpu_hw_events.n_metric, __this_cpu_read(cpu_hw_events.n_txn_metric)); perf_pmu_enable(pmu); }
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index ff2f0307ca8a..73173f058dc6 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -220,6 +220,7 @@ struct cpu_hw_events { they've never been enabled yet */ int n_txn; /* the # last events in the below arrays; added in the current transaction */ + int n_txn_metric; int assign[X86_PMC_IDX_MAX]; /* event to counter assignment */ u64 tags[X86_PMC_IDX_MAX];
From: Tony Luck tony.luck@intel.com
mainline inclusion from mainline-v5.5-rc1 commit 29b8e84fbc23cb2b70317b745641ea0569426872 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 29b8e84fbc23cb2b70317b745641ea0569426872 upstream
Backport summary: Backport to kernel 4.19.57 to support SKX EDAC
Simplifies the code a little.
Acked-by: Aristeu Rozanski aris@redhat.com Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/edac/skx_common.c | 48 +++++++++++++++++++-------------------- 1 file changed, 23 insertions(+), 25 deletions(-)
diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index d8ff63d91b86..58b8348d0f71 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -100,6 +100,7 @@ void __exit skx_adxl_put(void)
static bool skx_adxl_decode(struct decoded_addr *res) { + struct skx_dev *d; int i, len = 0;
if (res->addr >= skx_tohm || (res->addr >= skx_tolm && @@ -118,6 +119,24 @@ static bool skx_adxl_decode(struct decoded_addr *res) res->channel = (int)adxl_values[component_indices[INDEX_CHANNEL]]; res->dimm = (int)adxl_values[component_indices[INDEX_DIMM]];
+ if (res->imc > NUM_IMC - 1) { + skx_printk(KERN_ERR, "Bad imc %d\n", res->imc); + return false; + } + + list_for_each_entry(d, &dev_edac_list, list) { + if (d->imc[0].src_id == res->socket) { + res->dev = d; + break; + } + } + + if (!res->dev) { + skx_printk(KERN_ERR, "No device for src_id %d imc %d\n", + res->socket, res->imc); + return false; + } + for (i = 0; i < adxl_component_count; i++) { if (adxl_values[i] == ~0x0ull) continue; @@ -452,24 +471,6 @@ static void skx_unregister_mci(struct skx_imc *imc) edac_mc_free(mci); }
-static struct mem_ctl_info *get_mci(int src_id, int lmc) -{ - struct skx_dev *d; - - if (lmc > NUM_IMC - 1) { - skx_printk(KERN_ERR, "Bad lmc %d\n", lmc); - return NULL; - } - - list_for_each_entry(d, &dev_edac_list, list) { - if (d->imc[0].src_id == src_id) - return d->imc[lmc].mci; - } - - skx_printk(KERN_ERR, "No mci for src_id %d lmc %d\n", src_id, lmc); - return NULL; -} - static void skx_mce_output_error(struct mem_ctl_info *mci, const struct mce *m, struct decoded_addr *res) @@ -583,15 +584,12 @@ int skx_mce_check_error(struct notifier_block *nb, unsigned long val, if (adxl_component_count) { if (!skx_adxl_decode(&res)) return NOTIFY_DONE; - - mci = get_mci(res.socket, res.imc); - } else { - if (!skx_decode || !skx_decode(&res)) - return NOTIFY_DONE; - - mci = res.dev->imc[res.imc].mci; + } else if (!skx_decode || !skx_decode(&res)) { + return NOTIFY_DONE; }
+ mci = res.dev->imc[res.imc].mci; + if (!mci) return NOTIFY_DONE;
From: Tony Luck tony.luck@intel.com
mainline inclusion from mainline-v5.5-rc1 commit e80634a75aba90e7485cd1fdb463fcac5d45f14d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit e80634a75aba90e7485cd1fdb463fcac5d45f14d upstream
Backport summary: Backport to kernel 4.19.57 to support SKX EDAC
Skylake logs some additional useful information in per-channel registers in addition the the architectural status/addr/misc logged in the machine check bank.
Pick up this information and add it to the EDAC log:
retry_rd_err_[five 32-bit register values]
Sorry, no definitions for these registers. OEMs and DIMM vendors will be able to use them to isolate which cells in the DIMM are causing problems.
correrrcnt[per rank corrected error counts]
Note that if additional errors are logged while these registers are being read, you may see a jumble of values some from earlier errors, others from later errors (since the registers report the most recent logged error). The correrrcnt registers provide error counts per possible rank. If these counts only change by one since the previous error logged for this channel, then it is safe to assume that the registers logged provide a coherent view of one error.
With this change EDAC logs look like this:
EDAC MC4: 1 CE memory read error on CPU_SrcID#2_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8f26018 offset:0x0 grain:32 syndrome:0x0 - err_code:0x0101:0x0091 socket:2 imc:0 rank:0 bg:0 ba:0 row:0x1f880 col:0x200 retry_rd_err_log[0001a209 00000000 00000001 04800001 0001f880] correrrcnt[0001 0000 0000 0000 0000 0000 0000 0000])
Acked-by: Aristeu Rozanski aris@redhat.com Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/edac/skx_base.c | 51 ++++++++++++++++++++++++++++++++++++--- drivers/edac/skx_common.c | 12 ++++++--- drivers/edac/skx_common.h | 4 ++- 3 files changed, 60 insertions(+), 7 deletions(-)
diff --git a/drivers/edac/skx_base.c b/drivers/edac/skx_base.c index c0ff22b96ec4..f16ed60055af 100644 --- a/drivers/edac/skx_base.c +++ b/drivers/edac/skx_base.c @@ -46,7 +46,8 @@ static struct skx_dev *get_skx_dev(struct pci_bus *bus, u8 idx) }
enum munittype { - CHAN0, CHAN1, CHAN2, SAD_ALL, UTIL_ALL, SAD + CHAN0, CHAN1, CHAN2, SAD_ALL, UTIL_ALL, SAD, + ERRCHAN0, ERRCHAN1, ERRCHAN2, };
struct munit { @@ -68,6 +69,9 @@ static const struct munit skx_all_munits[] = { { 0x2040, { PCI_DEVFN(10, 0), PCI_DEVFN(12, 0) }, 2, 2, CHAN0 }, { 0x2044, { PCI_DEVFN(10, 4), PCI_DEVFN(12, 4) }, 2, 2, CHAN1 }, { 0x2048, { PCI_DEVFN(11, 0), PCI_DEVFN(13, 0) }, 2, 2, CHAN2 }, + { 0x2043, { PCI_DEVFN(10, 3), PCI_DEVFN(12, 3) }, 2, 2, ERRCHAN0 }, + { 0x2047, { PCI_DEVFN(10, 7), PCI_DEVFN(12, 7) }, 2, 2, ERRCHAN1 }, + { 0x204b, { PCI_DEVFN(11, 3), PCI_DEVFN(13, 3) }, 2, 2, ERRCHAN2 }, { 0x208e, { }, 1, 0, SAD }, { } }; @@ -104,10 +108,18 @@ static int get_all_munits(const struct munit *m) }
switch (m->mtype) { - case CHAN0: case CHAN1: case CHAN2: + case CHAN0: + case CHAN1: + case CHAN2: pci_dev_get(pdev); d->imc[i].chan[m->mtype].cdev = pdev; break; + case ERRCHAN0: + case ERRCHAN1: + case ERRCHAN2: + pci_dev_get(pdev); + d->imc[i].chan[m->mtype - ERRCHAN0].edev = pdev; + break; case SAD_ALL: pci_dev_get(pdev); d->sad_all = pdev; @@ -216,6 +228,39 @@ static int skx_get_dimm_config(struct mem_ctl_info *mci) #define SKX_ILV_REMOTE(tgt) (((tgt) & 8) == 0) #define SKX_ILV_TARGET(tgt) ((tgt) & 7)
+static void skx_show_retry_rd_err_log(struct decoded_addr *res, + char *msg, int len) +{ + u32 log0, log1, log2, log3, log4; + u32 corr0, corr1, corr2, corr3; + struct pci_dev *edev; + int n; + + edev = res->dev->imc[res->imc].chan[res->channel].edev; + + pci_read_config_dword(edev, 0x154, &log0); + pci_read_config_dword(edev, 0x148, &log1); + pci_read_config_dword(edev, 0x150, &log2); + pci_read_config_dword(edev, 0x15c, &log3); + pci_read_config_dword(edev, 0x114, &log4); + + n = snprintf(msg, len, " retry_rd_err_log[%.8x %.8x %.8x %.8x %.8x]", + log0, log1, log2, log3, log4); + + pci_read_config_dword(edev, 0x104, &corr0); + pci_read_config_dword(edev, 0x108, &corr1); + pci_read_config_dword(edev, 0x10c, &corr2); + pci_read_config_dword(edev, 0x110, &corr3); + + if (len - n > 0) + snprintf(msg + n, len - n, + " correrrcnt[%.4x %.4x %.4x %.4x %.4x %.4x %.4x %.4x]", + corr0 & 0xffff, corr0 >> 16, + corr1 & 0xffff, corr1 >> 16, + corr2 & 0xffff, corr2 >> 16, + corr3 & 0xffff, corr3 >> 16); +} + static bool skx_sad_decode(struct decoded_addr *res) { struct skx_dev *d = list_first_entry(skx_edac_list, typeof(*d), list); @@ -659,7 +704,7 @@ static int __init skx_init(void) } }
- skx_set_decode(skx_decode); + skx_set_decode(skx_decode, skx_show_retry_rd_err_log);
if (nvdimm_count && skx_adxl_get() == -ENODEV) skx_printk(KERN_NOTICE, "Only decoding DDR4 address!\n"); diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index 58b8348d0f71..9174836ba85d 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -37,6 +37,7 @@ static char *adxl_msg;
static char skx_msg[MSG_SIZE]; static skx_decode_f skx_decode; +static skx_show_retry_log_f skx_show_retry_rd_err_log; static u64 skx_tolm, skx_tohm; static LIST_HEAD(dev_edac_list);
@@ -150,9 +151,10 @@ static bool skx_adxl_decode(struct decoded_addr *res) return true; }
-void skx_set_decode(skx_decode_f decode) +void skx_set_decode(skx_decode_f decode, skx_show_retry_log_f show_retry_log) { skx_decode = decode; + skx_show_retry_rd_err_log = show_retry_log; }
int skx_get_src_id(struct skx_dev *d, int off, u8 *id) @@ -481,6 +483,7 @@ static void skx_mce_output_error(struct mem_ctl_info *mci, bool overflow = GET_BITFIELD(m->status, 62, 62); bool uncorrected_error = GET_BITFIELD(m->status, 61, 61); bool recoverable; + int len; u32 core_err_cnt = GET_BITFIELD(m->status, 38, 52); u32 mscod = GET_BITFIELD(m->status, 16, 31); u32 errcode = GET_BITFIELD(m->status, 0, 15); @@ -540,12 +543,12 @@ static void skx_mce_output_error(struct mem_ctl_info *mci, } } if (adxl_component_count) { - snprintf(skx_msg, MSG_SIZE, "%s%s err_code:0x%04x:0x%04x %s", + len = snprintf(skx_msg, MSG_SIZE, "%s%s err_code:0x%04x:0x%04x %s", overflow ? " OVERFLOW" : "", (uncorrected_error && recoverable) ? " recoverable" : "", mscod, errcode, adxl_msg); } else { - snprintf(skx_msg, MSG_SIZE, + len = snprintf(skx_msg, MSG_SIZE, "%s%s err_code:0x%04x:0x%04x socket:%d imc:%d rank:%d bg:%d ba:%d row:0x%x col:0x%x", overflow ? " OVERFLOW" : "", (uncorrected_error && recoverable) ? " recoverable" : "", @@ -554,6 +557,9 @@ static void skx_mce_output_error(struct mem_ctl_info *mci, res->bank_group, res->bank_address, res->row, res->column); }
+ if (skx_show_retry_rd_err_log) + skx_show_retry_rd_err_log(res, skx_msg + len, MSG_SIZE - len); + edac_dbg(0, "%s\n", skx_msg);
/* Call the helper to output message */ diff --git a/drivers/edac/skx_common.h b/drivers/edac/skx_common.h index 08cc971a50ea..60d1ea669afd 100644 --- a/drivers/edac/skx_common.h +++ b/drivers/edac/skx_common.h @@ -64,6 +64,7 @@ struct skx_dev { u8 src_id, node_id; struct skx_channel { struct pci_dev *cdev; + struct pci_dev *edev; struct skx_dimm { u8 close_pg; u8 bank_xor_enable; @@ -113,10 +114,11 @@ struct decoded_addr {
typedef int (*get_dimm_config_f)(struct mem_ctl_info *mci); typedef bool (*skx_decode_f)(struct decoded_addr *res); +typedef void (*skx_show_retry_log_f)(struct decoded_addr *res, char *msg, int len);
int __init skx_adxl_get(void); void __exit skx_adxl_put(void); -void skx_set_decode(skx_decode_f decode); +void skx_set_decode(skx_decode_f decode, skx_show_retry_log_f show_retry_log);
int skx_get_src_id(struct skx_dev *d, int off, u8 *id); int skx_get_node_id(struct skx_dev *d, u8 *id);
From: Andy Lutomirski luto@kernel.org
mainline inclusion from mainline-v5.7-rc1 commit 65c668f5faebf549db086b7a6841b6f4187b4e4f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
harder to change ist_enter() and ist_exit()'s behavior. Instead open-code the very small amount of required logic.
Signed-off-by: Andy Lutomirski luto@kernel.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Alexandre Chartre alexandre.chartre@oracle.com Reviewed-by: Andy Lutomirski luto@kernel.org Link: https://lkml.kernel.org/r/20200225220217.150607679@linutronix.de Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/kernel/traps.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index e6db475164ed..e832a9becc19 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -593,14 +593,20 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) return;
/* - * Use ist_enter despite the fact that we don't use an IST stack. - * We can be called from a kprobe in non-CONTEXT_KERNEL kernel - * mode or even during context tracking state changes. + * Unlike any other non-IST entry, we can be called from a kprobe in + * non-CONTEXT_KERNEL kernel mode or even during context tracking + * state changes. Make sure that we wake up RCU even if we're coming + * from kernel code. * - * This means that we can't schedule. That's okay. + * This means that we can't schedule even if we came from a + * preemptible kernel context. That's okay. */ - ist_enter(regs); + if (!user_mode(regs)) { + rcu_nmi_enter(); + preempt_disable(); + } RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); + #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP, SIGTRAP) == NOTIFY_STOP) @@ -621,7 +627,10 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) cond_local_irq_disable(regs);
exit: - ist_exit(regs); + if (!user_mode(regs)) { + preempt_enable_no_resched(); + rcu_nmi_exit(); + } } NOKPROBE_SYMBOL(do_int3);
From: Jann Horn jannh@google.com
mainline inclusion from mainline-v4.20-rc1 commit 75045f77f7a73e617494d7a1fcf4e9c1849cec39 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 75045f77f7a73e617494d7a1fcf4e9c1849cec39 upstream Backport summary: Backport to kernel 4.19.57 to enhance MCA-R for copyin
Currently, most fixups for attempting to access userspace memory are handled using _ASM_EXTABLE, which is also used for various other types of fixups (e.g. safe MSR access, IRET failures, and a bunch of other things). In order to make it possible to add special safety checks to uaccess fixups (in particular, checking whether the fault address is actually in userspace), introduce a new exception table handler ex_handler_uaccess() and wire it up to all the user access fixups (excluding ones that already use _ASM_EXTABLE_EX).
Signed-off-by: Jann Horn jannh@google.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Kees Cook keescook@chromium.org Cc: Andy Lutomirski luto@kernel.org Cc: kernel-hardening@lists.openwall.com Cc: dvyukov@google.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: "Naveen N. Rao" naveen.n.rao@linux.vnet.ibm.com Cc: Anil S Keshavamurthy anil.s.keshavamurthy@intel.com Cc: "David S. Miller" davem@davemloft.net Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: linux-fsdevel@vger.kernel.org Cc: Borislav Petkov bp@alien8.de Link: https://lkml.kernel.org/r/20180828201421.157735-5-jannh@google.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/include/asm/asm.h | 6 ++ arch/x86/include/asm/futex.h | 6 +- arch/x86/include/asm/uaccess.h | 16 ++--- arch/x86/lib/checksum_32.S | 4 +- arch/x86/lib/copy_user_64.S | 90 +++++++++++------------ arch/x86/lib/csum-copy_64.S | 8 ++- arch/x86/lib/getuser.S | 12 ++-- arch/x86/lib/putuser.S | 10 +-- arch/x86/lib/usercopy_32.c | 126 ++++++++++++++++----------------- arch/x86/lib/usercopy_64.c | 4 +- arch/x86/mm/extable.c | 8 +++ 11 files changed, 154 insertions(+), 136 deletions(-)
diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h index 990770f9e76b..13fe8d6a880a 100644 --- a/arch/x86/include/asm/asm.h +++ b/arch/x86/include/asm/asm.h @@ -130,6 +130,9 @@ # define _ASM_EXTABLE(from, to) \ _ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
+# define _ASM_EXTABLE_UA(from, to) \ + _ASM_EXTABLE_HANDLE(from, to, ex_handler_uaccess) + # define _ASM_EXTABLE_FAULT(from, to) \ _ASM_EXTABLE_HANDLE(from, to, ex_handler_fault)
@@ -182,6 +185,9 @@ # define _ASM_EXTABLE(from, to) \ _ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
+# define _ASM_EXTABLE_UA(from, to) \ + _ASM_EXTABLE_HANDLE(from, to, ex_handler_uaccess) + # define _ASM_EXTABLE_FAULT(from, to) \ _ASM_EXTABLE_HANDLE(from, to, ex_handler_fault)
diff --git a/arch/x86/include/asm/futex.h b/arch/x86/include/asm/futex.h index de4d68852d3a..13c83fe97988 100644 --- a/arch/x86/include/asm/futex.h +++ b/arch/x86/include/asm/futex.h @@ -20,7 +20,7 @@ "3:\tmov\t%3, %1\n" \ "\tjmp\t2b\n" \ "\t.previous\n" \ - _ASM_EXTABLE(1b, 3b) \ + _ASM_EXTABLE_UA(1b, 3b) \ : "=r" (oldval), "=r" (ret), "+m" (*uaddr) \ : "i" (-EFAULT), "0" (oparg), "1" (0))
@@ -36,8 +36,8 @@ "4:\tmov\t%5, %1\n" \ "\tjmp\t3b\n" \ "\t.previous\n" \ - _ASM_EXTABLE(1b, 4b) \ - _ASM_EXTABLE(2b, 4b) \ + _ASM_EXTABLE_UA(1b, 4b) \ + _ASM_EXTABLE_UA(2b, 4b) \ : "=&a" (oldval), "=&r" (ret), \ "+m" (*uaddr), "=&r" (tem) \ : "r" (oparg), "i" (-EFAULT), "1" (0)) diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index a2373a6f304e..40c2dd62fc1b 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -195,8 +195,8 @@ __typeof__(__builtin_choose_expr(sizeof(x) > sizeof(0UL), 0ULL, 0UL)) "4: movl %3,%0\n" \ " jmp 3b\n" \ ".previous\n" \ - _ASM_EXTABLE(1b, 4b) \ - _ASM_EXTABLE(2b, 4b) \ + _ASM_EXTABLE_UA(1b, 4b) \ + _ASM_EXTABLE_UA(2b, 4b) \ : "=r" (err) \ : "A" (x), "r" (addr), "i" (errret), "0" (err))
@@ -382,7 +382,7 @@ do { \ " xor"itype" %"rtype"1,%"rtype"1\n" \ " jmp 2b\n" \ ".previous\n" \ - _ASM_EXTABLE(1b, 3b) \ + _ASM_EXTABLE_UA(1b, 3b) \ : "=r" (err), ltype(x) \ : "m" (__m(addr)), "i" (errret), "0" (err))
@@ -475,7 +475,7 @@ struct __large_struct { unsigned long buf[100]; }; "3: mov %3,%0\n" \ " jmp 2b\n" \ ".previous\n" \ - _ASM_EXTABLE(1b, 3b) \ + _ASM_EXTABLE_UA(1b, 3b) \ : "=r"(err) \ : ltype(x), "m" (__m(addr)), "i" (errret), "0" (err))
@@ -603,7 +603,7 @@ extern void __cmpxchg_wrong_size(void) "3:\tmov %3, %0\n" \ "\tjmp 2b\n" \ "\t.previous\n" \ - _ASM_EXTABLE(1b, 3b) \ + _ASM_EXTABLE_UA(1b, 3b) \ : "+r" (__ret), "=a" (__old), "+m" (*(ptr)) \ : "i" (-EFAULT), "q" (__new), "1" (__old) \ : "memory" \ @@ -619,7 +619,7 @@ extern void __cmpxchg_wrong_size(void) "3:\tmov %3, %0\n" \ "\tjmp 2b\n" \ "\t.previous\n" \ - _ASM_EXTABLE(1b, 3b) \ + _ASM_EXTABLE_UA(1b, 3b) \ : "+r" (__ret), "=a" (__old), "+m" (*(ptr)) \ : "i" (-EFAULT), "r" (__new), "1" (__old) \ : "memory" \ @@ -635,7 +635,7 @@ extern void __cmpxchg_wrong_size(void) "3:\tmov %3, %0\n" \ "\tjmp 2b\n" \ "\t.previous\n" \ - _ASM_EXTABLE(1b, 3b) \ + _ASM_EXTABLE_UA(1b, 3b) \ : "+r" (__ret), "=a" (__old), "+m" (*(ptr)) \ : "i" (-EFAULT), "r" (__new), "1" (__old) \ : "memory" \ @@ -654,7 +654,7 @@ extern void __cmpxchg_wrong_size(void) "3:\tmov %3, %0\n" \ "\tjmp 2b\n" \ "\t.previous\n" \ - _ASM_EXTABLE(1b, 3b) \ + _ASM_EXTABLE_UA(1b, 3b) \ : "+r" (__ret), "=a" (__old), "+m" (*(ptr)) \ : "i" (-EFAULT), "r" (__new), "1" (__old) \ : "memory" \ diff --git a/arch/x86/lib/checksum_32.S b/arch/x86/lib/checksum_32.S index 46e71a74e612..ad8e0906d1ea 100644 --- a/arch/x86/lib/checksum_32.S +++ b/arch/x86/lib/checksum_32.S @@ -273,11 +273,11 @@ unsigned int csum_partial_copy_generic (const char *src, char *dst,
#define SRC(y...) \ 9999: y; \ - _ASM_EXTABLE(9999b, 6001f) + _ASM_EXTABLE_UA(9999b, 6001f)
#define DST(y...) \ 9999: y; \ - _ASM_EXTABLE(9999b, 6002f) + _ASM_EXTABLE_UA(9999b, 6002f)
#ifndef CONFIG_X86_USE_PPRO_CHECKSUM
diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index 020f75cc8cf6..80cfad666210 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -92,26 +92,26 @@ ENTRY(copy_user_generic_unrolled) 60: jmp copy_user_handle_tail /* ecx is zerorest also */ .previous
- _ASM_EXTABLE(1b,30b) - _ASM_EXTABLE(2b,30b) - _ASM_EXTABLE(3b,30b) - _ASM_EXTABLE(4b,30b) - _ASM_EXTABLE(5b,30b) - _ASM_EXTABLE(6b,30b) - _ASM_EXTABLE(7b,30b) - _ASM_EXTABLE(8b,30b) - _ASM_EXTABLE(9b,30b) - _ASM_EXTABLE(10b,30b) - _ASM_EXTABLE(11b,30b) - _ASM_EXTABLE(12b,30b) - _ASM_EXTABLE(13b,30b) - _ASM_EXTABLE(14b,30b) - _ASM_EXTABLE(15b,30b) - _ASM_EXTABLE(16b,30b) - _ASM_EXTABLE(18b,40b) - _ASM_EXTABLE(19b,40b) - _ASM_EXTABLE(21b,50b) - _ASM_EXTABLE(22b,50b) + _ASM_EXTABLE_UA(1b,30b) + _ASM_EXTABLE_UA(2b,30b) + _ASM_EXTABLE_UA(3b,30b) + _ASM_EXTABLE_UA(4b,30b) + _ASM_EXTABLE_UA(5b,30b) + _ASM_EXTABLE_UA(6b,30b) + _ASM_EXTABLE_UA(7b,30b) + _ASM_EXTABLE_UA(8b,30b) + _ASM_EXTABLE_UA(9b,30b) + _ASM_EXTABLE_UA(10b,30b) + _ASM_EXTABLE_UA(11b,30b) + _ASM_EXTABLE_UA(12b,30b) + _ASM_EXTABLE_UA(13b,30b) + _ASM_EXTABLE_UA(14b,30b) + _ASM_EXTABLE_UA(15b,30b) + _ASM_EXTABLE_UA(16b,30b) + _ASM_EXTABLE_UA(18b,40b) + _ASM_EXTABLE_UA(19b,40b) + _ASM_EXTABLE_UA(21b,50b) + _ASM_EXTABLE_UA(22b,50b) ENDPROC(copy_user_generic_unrolled) EXPORT_SYMBOL(copy_user_generic_unrolled)
@@ -156,8 +156,8 @@ ENTRY(copy_user_generic_string) jmp copy_user_handle_tail .previous
- _ASM_EXTABLE(1b,11b) - _ASM_EXTABLE(3b,12b) + _ASM_EXTABLE_UA(1b,11b) + _ASM_EXTABLE_UA(3b,12b) ENDPROC(copy_user_generic_string) EXPORT_SYMBOL(copy_user_generic_string)
@@ -189,7 +189,7 @@ ENTRY(copy_user_enhanced_fast_string) jmp copy_user_handle_tail .previous
- _ASM_EXTABLE(1b,12b) + _ASM_EXTABLE_UA(1b,12b) ENDPROC(copy_user_enhanced_fast_string) EXPORT_SYMBOL(copy_user_enhanced_fast_string)
@@ -319,27 +319,27 @@ ENTRY(__copy_user_nocache) jmp copy_user_handle_tail .previous
- _ASM_EXTABLE(1b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(2b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(3b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(4b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(5b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(6b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(7b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(8b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(9b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(10b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(11b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(12b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(13b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(14b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(15b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(16b,.L_fixup_4x8b_copy) - _ASM_EXTABLE(20b,.L_fixup_8b_copy) - _ASM_EXTABLE(21b,.L_fixup_8b_copy) - _ASM_EXTABLE(30b,.L_fixup_4b_copy) - _ASM_EXTABLE(31b,.L_fixup_4b_copy) - _ASM_EXTABLE(40b,.L_fixup_1b_copy) - _ASM_EXTABLE(41b,.L_fixup_1b_copy) + _ASM_EXTABLE_UA(1b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(2b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(3b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(4b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(5b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(6b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(7b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(8b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(9b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(10b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(11b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(12b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(13b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(14b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(15b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(16b,.L_fixup_4x8b_copy) + _ASM_EXTABLE_UA(20b,.L_fixup_8b_copy) + _ASM_EXTABLE_UA(21b,.L_fixup_8b_copy) + _ASM_EXTABLE_UA(30b,.L_fixup_4b_copy) + _ASM_EXTABLE_UA(31b,.L_fixup_4b_copy) + _ASM_EXTABLE_UA(40b,.L_fixup_1b_copy) + _ASM_EXTABLE_UA(41b,.L_fixup_1b_copy) ENDPROC(__copy_user_nocache) EXPORT_SYMBOL(__copy_user_nocache) diff --git a/arch/x86/lib/csum-copy_64.S b/arch/x86/lib/csum-copy_64.S index 45a53dfe1859..a4a379e79259 100644 --- a/arch/x86/lib/csum-copy_64.S +++ b/arch/x86/lib/csum-copy_64.S @@ -31,14 +31,18 @@
.macro source 10: - _ASM_EXTABLE(10b, .Lbad_source) + _ASM_EXTABLE_UA(10b, .Lbad_source) .endm
.macro dest 20: - _ASM_EXTABLE(20b, .Lbad_dest) + _ASM_EXTABLE_UA(20b, .Lbad_dest) .endm
+ /* + * No _ASM_EXTABLE_UA; this is used for intentional prefetch on a + * potentially unmapped kernel address. + */ .macro ignore L=.Lignore 30: _ASM_EXTABLE(30b, \L) diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S index 49b167f73215..74fdff968ea3 100644 --- a/arch/x86/lib/getuser.S +++ b/arch/x86/lib/getuser.S @@ -132,12 +132,12 @@ bad_get_user_8: END(bad_get_user_8) #endif
- _ASM_EXTABLE(1b,bad_get_user) - _ASM_EXTABLE(2b,bad_get_user) - _ASM_EXTABLE(3b,bad_get_user) + _ASM_EXTABLE_UA(1b, bad_get_user) + _ASM_EXTABLE_UA(2b, bad_get_user) + _ASM_EXTABLE_UA(3b, bad_get_user) #ifdef CONFIG_X86_64 - _ASM_EXTABLE(4b,bad_get_user) + _ASM_EXTABLE_UA(4b, bad_get_user) #else - _ASM_EXTABLE(4b,bad_get_user_8) - _ASM_EXTABLE(5b,bad_get_user_8) + _ASM_EXTABLE_UA(4b, bad_get_user_8) + _ASM_EXTABLE_UA(5b, bad_get_user_8) #endif diff --git a/arch/x86/lib/putuser.S b/arch/x86/lib/putuser.S index 96dce5fe2a35..d2e5c9c39601 100644 --- a/arch/x86/lib/putuser.S +++ b/arch/x86/lib/putuser.S @@ -94,10 +94,10 @@ bad_put_user: EXIT END(bad_put_user)
- _ASM_EXTABLE(1b,bad_put_user) - _ASM_EXTABLE(2b,bad_put_user) - _ASM_EXTABLE(3b,bad_put_user) - _ASM_EXTABLE(4b,bad_put_user) + _ASM_EXTABLE_UA(1b, bad_put_user) + _ASM_EXTABLE_UA(2b, bad_put_user) + _ASM_EXTABLE_UA(3b, bad_put_user) + _ASM_EXTABLE_UA(4b, bad_put_user) #ifdef CONFIG_X86_32 - _ASM_EXTABLE(5b,bad_put_user) + _ASM_EXTABLE_UA(5b, bad_put_user) #endif diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c index ab88150ae1ef..bfd94e7812fc 100644 --- a/arch/x86/lib/usercopy_32.c +++ b/arch/x86/lib/usercopy_32.c @@ -47,8 +47,8 @@ do { \ "3: lea 0(%2,%0,4),%0\n" \ " jmp 2b\n" \ ".previous\n" \ - _ASM_EXTABLE(0b,3b) \ - _ASM_EXTABLE(1b,2b) \ + _ASM_EXTABLE_UA(0b, 3b) \ + _ASM_EXTABLE_UA(1b, 2b) \ : "=&c"(size), "=&D" (__d0) \ : "r"(size & 3), "0"(size / 4), "1"(addr), "a"(0)); \ } while (0) @@ -153,44 +153,44 @@ __copy_user_intel(void __user *to, const void *from, unsigned long size) "101: lea 0(%%eax,%0,4),%0\n" " jmp 100b\n" ".previous\n" - _ASM_EXTABLE(1b,100b) - _ASM_EXTABLE(2b,100b) - _ASM_EXTABLE(3b,100b) - _ASM_EXTABLE(4b,100b) - _ASM_EXTABLE(5b,100b) - _ASM_EXTABLE(6b,100b) - _ASM_EXTABLE(7b,100b) - _ASM_EXTABLE(8b,100b) - _ASM_EXTABLE(9b,100b) - _ASM_EXTABLE(10b,100b) - _ASM_EXTABLE(11b,100b) - _ASM_EXTABLE(12b,100b) - _ASM_EXTABLE(13b,100b) - _ASM_EXTABLE(14b,100b) - _ASM_EXTABLE(15b,100b) - _ASM_EXTABLE(16b,100b) - _ASM_EXTABLE(17b,100b) - _ASM_EXTABLE(18b,100b) - _ASM_EXTABLE(19b,100b) - _ASM_EXTABLE(20b,100b) - _ASM_EXTABLE(21b,100b) - _ASM_EXTABLE(22b,100b) - _ASM_EXTABLE(23b,100b) - _ASM_EXTABLE(24b,100b) - _ASM_EXTABLE(25b,100b) - _ASM_EXTABLE(26b,100b) - _ASM_EXTABLE(27b,100b) - _ASM_EXTABLE(28b,100b) - _ASM_EXTABLE(29b,100b) - _ASM_EXTABLE(30b,100b) - _ASM_EXTABLE(31b,100b) - _ASM_EXTABLE(32b,100b) - _ASM_EXTABLE(33b,100b) - _ASM_EXTABLE(34b,100b) - _ASM_EXTABLE(35b,100b) - _ASM_EXTABLE(36b,100b) - _ASM_EXTABLE(37b,100b) - _ASM_EXTABLE(99b,101b) + _ASM_EXTABLE_UA(1b, 100b) + _ASM_EXTABLE_UA(2b, 100b) + _ASM_EXTABLE_UA(3b, 100b) + _ASM_EXTABLE_UA(4b, 100b) + _ASM_EXTABLE_UA(5b, 100b) + _ASM_EXTABLE_UA(6b, 100b) + _ASM_EXTABLE_UA(7b, 100b) + _ASM_EXTABLE_UA(8b, 100b) + _ASM_EXTABLE_UA(9b, 100b) + _ASM_EXTABLE_UA(10b, 100b) + _ASM_EXTABLE_UA(11b, 100b) + _ASM_EXTABLE_UA(12b, 100b) + _ASM_EXTABLE_UA(13b, 100b) + _ASM_EXTABLE_UA(14b, 100b) + _ASM_EXTABLE_UA(15b, 100b) + _ASM_EXTABLE_UA(16b, 100b) + _ASM_EXTABLE_UA(17b, 100b) + _ASM_EXTABLE_UA(18b, 100b) + _ASM_EXTABLE_UA(19b, 100b) + _ASM_EXTABLE_UA(20b, 100b) + _ASM_EXTABLE_UA(21b, 100b) + _ASM_EXTABLE_UA(22b, 100b) + _ASM_EXTABLE_UA(23b, 100b) + _ASM_EXTABLE_UA(24b, 100b) + _ASM_EXTABLE_UA(25b, 100b) + _ASM_EXTABLE_UA(26b, 100b) + _ASM_EXTABLE_UA(27b, 100b) + _ASM_EXTABLE_UA(28b, 100b) + _ASM_EXTABLE_UA(29b, 100b) + _ASM_EXTABLE_UA(30b, 100b) + _ASM_EXTABLE_UA(31b, 100b) + _ASM_EXTABLE_UA(32b, 100b) + _ASM_EXTABLE_UA(33b, 100b) + _ASM_EXTABLE_UA(34b, 100b) + _ASM_EXTABLE_UA(35b, 100b) + _ASM_EXTABLE_UA(36b, 100b) + _ASM_EXTABLE_UA(37b, 100b) + _ASM_EXTABLE_UA(99b, 101b) : "=&c"(size), "=&D" (d0), "=&S" (d1) : "1"(to), "2"(from), "0"(size) : "eax", "edx", "memory"); @@ -259,26 +259,26 @@ static unsigned long __copy_user_intel_nocache(void *to, "9: lea 0(%%eax,%0,4),%0\n" "16: jmp 8b\n" ".previous\n" - _ASM_EXTABLE(0b,16b) - _ASM_EXTABLE(1b,16b) - _ASM_EXTABLE(2b,16b) - _ASM_EXTABLE(21b,16b) - _ASM_EXTABLE(3b,16b) - _ASM_EXTABLE(31b,16b) - _ASM_EXTABLE(4b,16b) - _ASM_EXTABLE(41b,16b) - _ASM_EXTABLE(10b,16b) - _ASM_EXTABLE(51b,16b) - _ASM_EXTABLE(11b,16b) - _ASM_EXTABLE(61b,16b) - _ASM_EXTABLE(12b,16b) - _ASM_EXTABLE(71b,16b) - _ASM_EXTABLE(13b,16b) - _ASM_EXTABLE(81b,16b) - _ASM_EXTABLE(14b,16b) - _ASM_EXTABLE(91b,16b) - _ASM_EXTABLE(6b,9b) - _ASM_EXTABLE(7b,16b) + _ASM_EXTABLE_UA(0b, 16b) + _ASM_EXTABLE_UA(1b, 16b) + _ASM_EXTABLE_UA(2b, 16b) + _ASM_EXTABLE_UA(21b, 16b) + _ASM_EXTABLE_UA(3b, 16b) + _ASM_EXTABLE_UA(31b, 16b) + _ASM_EXTABLE_UA(4b, 16b) + _ASM_EXTABLE_UA(41b, 16b) + _ASM_EXTABLE_UA(10b, 16b) + _ASM_EXTABLE_UA(51b, 16b) + _ASM_EXTABLE_UA(11b, 16b) + _ASM_EXTABLE_UA(61b, 16b) + _ASM_EXTABLE_UA(12b, 16b) + _ASM_EXTABLE_UA(71b, 16b) + _ASM_EXTABLE_UA(13b, 16b) + _ASM_EXTABLE_UA(81b, 16b) + _ASM_EXTABLE_UA(14b, 16b) + _ASM_EXTABLE_UA(91b, 16b) + _ASM_EXTABLE_UA(6b, 9b) + _ASM_EXTABLE_UA(7b, 16b) : "=&c"(size), "=&D" (d0), "=&S" (d1) : "1"(to), "2"(from), "0"(size) : "eax", "edx", "memory"); @@ -321,9 +321,9 @@ do { \ "3: lea 0(%3,%0,4),%0\n" \ " jmp 2b\n" \ ".previous\n" \ - _ASM_EXTABLE(4b,5b) \ - _ASM_EXTABLE(0b,3b) \ - _ASM_EXTABLE(1b,2b) \ + _ASM_EXTABLE_UA(4b, 5b) \ + _ASM_EXTABLE_UA(0b, 3b) \ + _ASM_EXTABLE_UA(1b, 2b) \ : "=&c"(size), "=&D" (__d0), "=&S" (__d1), "=r"(__d2) \ : "3"(size), "0"(size), "1"(to), "2"(from) \ : "memory"); \ diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c index 44d89a432380..accb61e4920f 100644 --- a/arch/x86/lib/usercopy_64.c +++ b/arch/x86/lib/usercopy_64.c @@ -37,8 +37,8 @@ unsigned long __clear_user(void __user *addr, unsigned long size) "3: lea 0(%[size1],%[size8],8),%[size8]\n" " jmp 2b\n" ".previous\n" - _ASM_EXTABLE(0b,3b) - _ASM_EXTABLE(1b,2b) + _ASM_EXTABLE_UA(0b, 3b) + _ASM_EXTABLE_UA(1b, 2b) : [size8] "=&c"(size), [dst] "=&D" (__d0) : [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr)); clac(); diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c index 45f5d6cf65ae..dc72b2d17ac6 100644 --- a/arch/x86/mm/extable.c +++ b/arch/x86/mm/extable.c @@ -108,6 +108,14 @@ __visible bool ex_handler_fprestore(const struct exception_table_entry *fixup, } EXPORT_SYMBOL_GPL(ex_handler_fprestore);
+bool ex_handler_uaccess(const struct exception_table_entry *fixup, + struct pt_regs *regs, int trapnr) +{ + regs->ip = ex_fixup_addr(fixup); + return true; +} +EXPORT_SYMBOL(ex_handler_uaccess); + __visible bool ex_handler_ext(const struct exception_table_entry *fixup, struct pt_regs *regs, int trapnr) {
From: Jann Horn jannh@google.com
mainline inclusion from mainline-v5.6-rc1 commit 7be4412721aee25e35583a20a896085dc6b99c3e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 7be4412721aee25e35583a20a896085dc6b99c3e upstream Backport summary: Backport to kernel 4.19.57 to enhance MCA-R for copyin
To support evaluating 64-bit kernel mode instructions:
* Replace existing checks for user_64bit_mode() with a new helper that checks whether code is being executed in either 64-bit kernel mode or 64-bit user mode.
* Select the GS base depending on whether the instruction is being evaluated in kernel mode.
Signed-off-by: Jann Horn jannh@google.com Signed-off-by: Borislav Petkov bp@suse.de Cc: Alexander Potapenko glider@google.com Cc: Andrey Konovalov andreyknvl@google.com Cc: Andrey Ryabinin aryabinin@virtuozzo.com Cc: Andy Lutomirski luto@kernel.org Cc: Dmitry Vyukov dvyukov@google.com Cc: "Gustavo A. R. Silva" gustavo@embeddedor.com Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: kasan-dev@googlegroups.com Cc: Oleg Nesterov oleg@redhat.com Cc: Sean Christopherson sean.j.christopherson@intel.com Cc: Thomas Gleixner tglx@linutronix.de Cc: x86-ml x86@kernel.org Link: https://lkml.kernel.org/r/20191218231150.12139-1-jannh@google.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/include/asm/ptrace.h | 13 +++++++++++++ arch/x86/lib/insn-eval.c | 26 +++++++++++++++----------- 2 files changed, 28 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index ee696efec99f..bb85b5108544 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -159,6 +159,19 @@ static inline bool user_64bit_mode(struct pt_regs *regs) #endif }
+/* + * Determine whether the register set came from any context that is running in + * 64-bit mode. + */ +static inline bool any_64bit_mode(struct pt_regs *regs) +{ +#ifdef CONFIG_X86_64 + return !user_mode(regs) || user_64bit_mode(regs); +#else + return false; +#endif +} + #ifdef CONFIG_X86_64 #define current_user_stack_pointer() current_pt_regs()->sp #define compat_user_stack_pointer() current_pt_regs()->sp diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 87dcba101e56..ec1670db5e4d 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -155,7 +155,7 @@ static bool check_seg_overrides(struct insn *insn, int regoff) */ static int resolve_default_seg(struct insn *insn, struct pt_regs *regs, int off) { - if (user_64bit_mode(regs)) + if (any_64bit_mode(regs)) return INAT_SEG_REG_IGNORE; /* * Resolve the default segment register as described in Section 3.7.4 @@ -264,7 +264,7 @@ static int resolve_seg_reg(struct insn *insn, struct pt_regs *regs, int regoff) * which may be invalid at this point. */ if (regoff == offsetof(struct pt_regs, ip)) { - if (user_64bit_mode(regs)) + if (any_64bit_mode(regs)) return INAT_SEG_REG_IGNORE; else return INAT_SEG_REG_CS; @@ -287,7 +287,7 @@ static int resolve_seg_reg(struct insn *insn, struct pt_regs *regs, int regoff) * In long mode, segment override prefixes are ignored, except for * overrides for FS and GS. */ - if (user_64bit_mode(regs)) { + if (any_64bit_mode(regs)) { if (idx != INAT_SEG_REG_FS && idx != INAT_SEG_REG_GS) idx = INAT_SEG_REG_IGNORE; @@ -644,23 +644,27 @@ unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx) */ return (unsigned long)(sel << 4);
- if (user_64bit_mode(regs)) { + if (any_64bit_mode(regs)) { /* * Only FS or GS will have a base address, the rest of * the segments' bases are forced to 0. */ unsigned long base;
- if (seg_reg_idx == INAT_SEG_REG_FS) + if (seg_reg_idx == INAT_SEG_REG_FS) { rdmsrl(MSR_FS_BASE, base); - else if (seg_reg_idx == INAT_SEG_REG_GS) + } else if (seg_reg_idx == INAT_SEG_REG_GS) { /* * swapgs was called at the kernel entry point. Thus, * MSR_KERNEL_GS_BASE will have the user-space GS base. */ - rdmsrl(MSR_KERNEL_GS_BASE, base); - else + if (user_mode(regs)) + rdmsrl(MSR_KERNEL_GS_BASE, base); + else + rdmsrl(MSR_GS_BASE, base); + } else { base = 0; + } return base; }
@@ -701,7 +705,7 @@ static unsigned long get_seg_limit(struct pt_regs *regs, int seg_reg_idx) if (sel < 0) return 0;
- if (user_64bit_mode(regs) || v8086_mode(regs)) + if (any_64bit_mode(regs) || v8086_mode(regs)) return -1L;
if (!sel) @@ -946,7 +950,7 @@ static int get_eff_addr_modrm(struct insn *insn, struct pt_regs *regs, * following instruction. */ if (*regoff == -EDOM) { - if (user_64bit_mode(regs)) + if (any_64bit_mode(regs)) tmp = regs->ip + insn->length; else tmp = 0; @@ -1248,7 +1252,7 @@ static void __user *get_addr_ref_32(struct insn *insn, struct pt_regs *regs) * After computed, the effective address is treated as an unsigned * quantity. */ - if (!user_64bit_mode(regs) && ((unsigned int)eff_addr > seg_limit)) + if (!any_64bit_mode(regs) && ((unsigned int)eff_addr > seg_limit)) goto out;
/*
From: Peter Zijlstra peterz@infradead.org
mainline inclusion from mainline-v5.2-rc1 commit 3693ca81151eacd498675baae56abede577e8b31 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 3693ca81151eacd498675baae56abede577e8b31 upstream Backport summary: Backport to kernel 4.19.57 to enhance MCA-R for copyin
By writing the function in asm we avoid cross object code flow and objtool no longer gets confused about a 'stray' CLAC.
Also; the asm version is actually _simpler_.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Borislav Petkov bp@alien8.de Cc: Josh Poimboeuf jpoimboe@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/include/asm/asm.h | 25 ---------------- arch/x86/include/asm/uaccess_64.h | 3 -- arch/x86/lib/copy_user_64.S | 48 +++++++++++++++++++++++++++++++ arch/x86/lib/usercopy_64.c | 20 ------------- 4 files changed, 48 insertions(+), 48 deletions(-)
diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h index 13fe8d6a880a..bcd69fc6cf34 100644 --- a/arch/x86/include/asm/asm.h +++ b/arch/x86/include/asm/asm.h @@ -147,31 +147,6 @@ _ASM_ALIGN ; \ _ASM_PTR (entry); \ .popsection - -.macro ALIGN_DESTINATION - /* check for bad alignment of destination */ - movl %edi,%ecx - andl $7,%ecx - jz 102f /* already aligned */ - subl $8,%ecx - negl %ecx - subl %ecx,%edx -100: movb (%rsi),%al -101: movb %al,(%rdi) - incq %rsi - incq %rdi - decl %ecx - jnz 100b -102: - .section .fixup,"ax" -103: addl %ecx,%edx /* ecx is zerorest also */ - jmp copy_user_handle_tail - .previous - - _ASM_EXTABLE(100b,103b) - _ASM_EXTABLE(101b,103b) - .endm - #else # define _EXPAND_EXTABLE_HANDLE(x) #x # define _ASM_EXTABLE_HANDLE(from, to, handler) \ diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index a9d637bc301d..5cd1caa8bc65 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -207,9 +207,6 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size) return __copy_user_flushcache(dst, src, size); }
-unsigned long -copy_user_handle_tail(char *to, char *from, unsigned len); - unsigned long mcsafe_handle_tail(char *to, char *from, unsigned len);
diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index 80cfad666210..944fcc662f78 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -16,6 +16,30 @@ #include <asm/smap.h> #include <asm/export.h>
+.macro ALIGN_DESTINATION + /* check for bad alignment of destination */ + movl %edi,%ecx + andl $7,%ecx + jz 102f /* already aligned */ + subl $8,%ecx + negl %ecx + subl %ecx,%edx +100: movb (%rsi),%al +101: movb %al,(%rdi) + incq %rsi + incq %rdi + decl %ecx + jnz 100b +102: + .section .fixup,"ax" +103: addl %ecx,%edx /* ecx is zerorest also */ + jmp copy_user_handle_tail + .previous + + _ASM_EXTABLE_UA(100b, 103b) + _ASM_EXTABLE_UA(101b, 103b) + .endm + /* * copy_user_generic_unrolled - memory copy with exception handling. * This version is for CPUs like P4 that don't have efficient micro @@ -193,6 +217,30 @@ ENTRY(copy_user_enhanced_fast_string) ENDPROC(copy_user_enhanced_fast_string) EXPORT_SYMBOL(copy_user_enhanced_fast_string)
+/* + * Try to copy last bytes and clear the rest if needed. + * Since protection fault in copy_from/to_user is not a normal situation, + * it is not necessary to optimize tail handling. + * + * Input: + * rdi destination + * rsi source + * rdx count + * + * Output: + * eax uncopied bytes or 0 if successful. + */ +ALIGN; +copy_user_handle_tail: + movl %edx,%ecx +1: rep movsb +2: mov %ecx,%eax + ASM_CLAC + ret + + _ASM_EXTABLE_UA(1b, 2b) +ENDPROC(copy_user_handle_tail) + /* * copy_user_nocache - Uncached memory copy with exception handling * This will force destination out of cache for more performance. diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c index accb61e4920f..975060f5768d 100644 --- a/arch/x86/lib/usercopy_64.c +++ b/arch/x86/lib/usercopy_64.c @@ -54,26 +54,6 @@ unsigned long clear_user(void __user *to, unsigned long n) } EXPORT_SYMBOL(clear_user);
-/* - * Try to copy last bytes and clear the rest if needed. - * Since protection fault in copy_from/to_user is not a normal situation, - * it is not necessary to optimize tail handling. - */ -__visible unsigned long -copy_user_handle_tail(char *to, char *from, unsigned len) -{ - for (; len; --len, to++) { - char c; - - if (__get_user_nocheck(c, from++, sizeof(char))) - break; - if (__put_user_nocheck(c, to, sizeof(char))) - break; - } - clac(); - return len; -} - /* * Similar to copy_user_handle_tail, probe for the write fault point, * but reuse __memcpy_mcsafe in case a new read error is encountered.
From: Chen Yu yu.c.chen@intel.com
mainline inclusion from mainline-v5.9-rc1 commit a472ad2bcea479ba068880125d7273fc95c14b70 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit a472ad2bcea479ba068880125d7273fc95c14b70 upstream
Backport summary: Backport to 4.19.57 for customize ICX intel_idle support
On ICX platform, the C1E auto-promotion is enabled by default. As a result, the CPU might fall into C1E more offen than previous platforms. Besides, the C1E is not exposed to sysfs on ICX, which is inconsistent with previous server platforms.
So disable C1E auto-promotion and expose C1E as a separate idle state, so the C1E and C6 can be disabled via sysfs when necessary.
Beside C1 and C1E, the exit latency of C6 was measured by a dedicated tool. However the exit latency(41us) exposed by _CST is much smaller than the one we measured(128us). This is probably due to the _CST uses the exit latency when woken up from PC0+C6, rather than PC6+C6 when C6 was measured. Choose the latter as we need the longest latency in theory.
Reported-by: kernel test robot lkp@intel.com Tested-by: Artem Bityutskiy artem.bityutskiy@linux.intel.com Acked-by: Artem Bityutskiy artem.bityutskiy@linux.intel.com Reviewed-by: Zhang Rui rui.zhang@intel.com Signed-off-by: Chen Yu yu.c.chen@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: yingbao jia yingbao.jia@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/idle/intel_idle.c | 40 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index a921070184a0..b34a4b2760c0 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -694,6 +694,37 @@ static struct cpuidle_state skx_cstates[] = { .enter = NULL } };
+ +static struct cpuidle_state icx_cstates[] = { + { + .name = "C1", + .desc = "MWAIT 0x00", + .flags = MWAIT2flg(0x00), + .exit_latency = 1, + .target_residency = 1, + .enter = &intel_idle, + .enter_s2idle = intel_idle_s2idle, }, + { + .name = "C1E", + .desc = "MWAIT 0x01", + .flags = MWAIT2flg(0x01) | CPUIDLE_FLAG_ALWAYS_ENABLE, + .exit_latency = 4, + .target_residency = 4, + .enter = &intel_idle, + .enter_s2idle = intel_idle_s2idle, }, + { + .name = "C6", + .desc = "MWAIT 0x20", + .flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED, + .exit_latency = 128, + .target_residency = 384, + .enter = &intel_idle, + .enter_s2idle = intel_idle_s2idle, }, + { + .enter = NULL } +}; + + static struct cpuidle_state atom_cstates[] = { { .name = "C1E", @@ -1097,6 +1128,14 @@ static const struct idle_cpu idle_cpu_skx = { .use_acpi = true, };
+ +static const struct idle_cpu idle_cpu_icx = { + .state_table = icx_cstates, + .disable_promotion_to_c1e = true, + .use_acpi = true, +}; + + static const struct idle_cpu idle_cpu_avn = { .state_table = avn_cstates, .disable_promotion_to_c1e = true, @@ -1154,6 +1193,7 @@ static const struct x86_cpu_id intel_idle_ids[] __initconst = { ICPU(INTEL_FAM6_KABYLAKE_MOBILE, idle_cpu_skl), ICPU(INTEL_FAM6_KABYLAKE_DESKTOP, idle_cpu_skl), ICPU(INTEL_FAM6_SKYLAKE_X, idle_cpu_skx), + ICPU(INTEL_FAM6_ICELAKE_X, idle_cpu_icx), ICPU(INTEL_FAM6_XEON_PHI_KNL, idle_cpu_knl), ICPU(INTEL_FAM6_XEON_PHI_KNM, idle_cpu_knl), ICPU(INTEL_FAM6_ATOM_GOLDMONT, idle_cpu_bxt),
From: Qiuxu Zhuo qiuxu.zhuo@intel.com
mainline inclusion from mainline-v5.8-rc1 commit ee5340abab3babb91c1807cea47de4468b2dfc91 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit ee5340abab3babb91c1807cea47de4468b2dfc91 upstream
Backport summary: Backport to kernel 4.19.57 for ICX EDAC support
The device ID for configuration agent PCI device and the offset for bus number configuration register can be CPU model specific. So add a new structure res_config to make them configurable and pass res_config to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use.
Signed-off-by: Qiuxu Zhuo qiuxu.zhuo@intel.com Signed-off-by: Tony Luck tony.luck@intel.com Reviewed-by: Borislav Petkov bp@suse.de Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnic Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/edac/i10nm_base.c | 17 +++++++++++++---- drivers/edac/skx_base.c | 13 +++++++++++-- drivers/edac/skx_common.c | 11 +++++------ drivers/edac/skx_common.h | 11 +++++++++-- 4 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/drivers/edac/i10nm_base.c b/drivers/edac/i10nm_base.c index 216c36a0505e..5a47f55f8145 100644 --- a/drivers/edac/i10nm_base.c +++ b/drivers/edac/i10nm_base.c @@ -122,10 +122,16 @@ static int i10nm_get_all_munits(void) return 0; }
+static struct res_config i10nm_cfg = { + .type = I10NM, + .decs_did = 0x3452, + .busno_cfg_offset = 0xcc, +}; + static const struct x86_cpu_id i10nm_cpuids[] = { - { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_TREMONT_X, 0, 0 }, - { X86_VENDOR_INTEL, 6, INTEL_FAM6_ICELAKE_X, 0, 0 }, - { X86_VENDOR_INTEL, 6, INTEL_FAM6_ICELAKE_XEON_D, 0, 0 }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_TREMONT_X, 0, (kernel_ulong_t)&i10nm_cfg}, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ICELAKE_X, 0, (kernel_ulong_t)&i10nm_cfg}, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ICELAKE_XEON_D, 0, (kernel_ulong_t)&i10nm_cfg}, { } }; MODULE_DEVICE_TABLE(x86cpu, i10nm_cpuids); @@ -235,6 +241,7 @@ static int __init i10nm_init(void) { u8 mc = 0, src_id = 0, node_id = 0; const struct x86_cpu_id *id; + struct res_config *cfg; const char *owner; struct skx_dev *d; int rc, i, off[3] = {0xd0, 0xc8, 0xcc}; @@ -250,11 +257,13 @@ static int __init i10nm_init(void) if (!id) return -ENODEV;
+ cfg = (struct res_config *)id->driver_data; + rc = skx_get_hi_lo(0x09a2, off, &tolm, &tohm); if (rc) return rc;
- rc = skx_get_all_bus_mappings(0x3452, 0xcc, I10NM, &i10nm_edac_list); + rc = skx_get_all_bus_mappings(cfg, &i10nm_edac_list); if (rc < 0) goto fail; if (rc == 0) { diff --git a/drivers/edac/skx_base.c b/drivers/edac/skx_base.c index f16ed60055af..ebde975b178b 100644 --- a/drivers/edac/skx_base.c +++ b/drivers/edac/skx_base.c @@ -157,8 +157,14 @@ static int get_all_munits(const struct munit *m) return -ENODEV; }
+static struct res_config skx_cfg = { + .type = SKX, + .decs_did = 0x2016, + .busno_cfg_offset = 0xcc, +}; + static const struct x86_cpu_id skx_cpuids[] = { - { X86_VENDOR_INTEL, 6, INTEL_FAM6_SKYLAKE_X, 0, 0 }, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_SKYLAKE_X, 0, (kernel_ulong_t)&skx_cfg}, { } }; MODULE_DEVICE_TABLE(x86cpu, skx_cpuids); @@ -642,6 +648,7 @@ static inline void teardown_skx_debug(void) {} static int __init skx_init(void) { const struct x86_cpu_id *id; + struct res_config *cfg; const struct munit *m; const char *owner; int rc = 0, i, off[3] = {0xd0, 0xd4, 0xd8}; @@ -658,11 +665,13 @@ static int __init skx_init(void) if (!id) return -ENODEV;
+ cfg = (struct res_config *)id->driver_data; + rc = skx_get_hi_lo(0x2034, off, &skx_tolm, &skx_tohm); if (rc) return rc;
- rc = skx_get_all_bus_mappings(0x2016, 0xcc, SKX, &skx_edac_list); + rc = skx_get_all_bus_mappings(cfg, &skx_edac_list); if (rc < 0) goto fail; if (rc == 0) { diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index 9174836ba85d..1f784b461d45 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -197,12 +197,11 @@ static int get_width(u32 mtr) }
/* - * We use the per-socket device @did to count how many sockets are present, + * We use the per-socket device @cfg->did to count how many sockets are present, * and to detemine which PCI buses are associated with each socket. Allocate * and build the full list of all the skx_dev structures that we need here. */ -int skx_get_all_bus_mappings(unsigned int did, int off, enum type type, - struct list_head **list) +int skx_get_all_bus_mappings(struct res_config *cfg, struct list_head **list) { struct pci_dev *pdev, *prev; struct skx_dev *d; @@ -211,7 +210,7 @@ int skx_get_all_bus_mappings(unsigned int did, int off, enum type type,
prev = NULL; for (;;) { - pdev = pci_get_device(PCI_VENDOR_ID_INTEL, did, prev); + pdev = pci_get_device(PCI_VENDOR_ID_INTEL, cfg->decs_did, prev); if (!pdev) break; ndev++; @@ -221,7 +220,7 @@ int skx_get_all_bus_mappings(unsigned int did, int off, enum type type, return -ENOMEM; }
- if (pci_read_config_dword(pdev, off, ®)) { + if (pci_read_config_dword(pdev, cfg->busno_cfg_offset, ®)) { kfree(d); pci_dev_put(pdev); skx_printk(KERN_ERR, "Failed to read bus idx\n"); @@ -230,7 +229,7 @@ int skx_get_all_bus_mappings(unsigned int did, int off, enum type type,
d->bus[0] = GET_BITFIELD(reg, 0, 7); d->bus[1] = GET_BITFIELD(reg, 8, 15); - if (type == SKX) { + if (cfg->type == SKX) { d->seg = pci_domain_nr(pdev->bus); d->bus[2] = GET_BITFIELD(reg, 16, 23); d->bus[3] = GET_BITFIELD(reg, 24, 31); diff --git a/drivers/edac/skx_common.h b/drivers/edac/skx_common.h index 60d1ea669afd..19dd8c099520 100644 --- a/drivers/edac/skx_common.h +++ b/drivers/edac/skx_common.h @@ -112,6 +112,14 @@ struct decoded_addr { int bank_group; };
+struct res_config { + enum type type; + /* Configuration agent device ID */ + unsigned int decs_did; + /* Default bus number configuration register offset */ + int busno_cfg_offset; +}; + typedef int (*get_dimm_config_f)(struct mem_ctl_info *mci); typedef bool (*skx_decode_f)(struct decoded_addr *res); typedef void (*skx_show_retry_log_f)(struct decoded_addr *res, char *msg, int len); @@ -123,8 +131,7 @@ void skx_set_decode(skx_decode_f decode, skx_show_retry_log_f show_retry_log); int skx_get_src_id(struct skx_dev *d, int off, u8 *id); int skx_get_node_id(struct skx_dev *d, u8 *id);
-int skx_get_all_bus_mappings(unsigned int did, int off, enum type, - struct list_head **list); +int skx_get_all_bus_mappings(struct res_config *cfg, struct list_head **list);
int skx_get_hi_lo(unsigned int did, int off[], u64 *tolm, u64 *tohm);
From: Qiuxu Zhuo qiuxu.zhuo@intel.com
mainline inclusion from mainline-v5.8-rc1 commit ce20670828c1228ecd37befbdda87a1f87a803b9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit ce20670828c1228ecd37befbdda87a1f87a803b9 upstream
Backport summary: Backport to kernel 4.19.57 for ICX-D EDAC support
The i10nm_edac driver failed to load on Ice Lake and Tremont/Jacobsville servers if their CPU stepping >= 4 and failed on Ice Lake-D servers from stepping 0. The root cause was that for Ice Lake and Tremont/Jacobsville servers with CPU stepping >=4, the offset for bus number configuration register was updated from 0xcc to 0xd0. For Ice Lake-D servers, all the steppings use the updated 0xd0 offset.
Fix the issue by using the appropriate offset for bus number configuration register according to the CPU model number and stepping.
Reported-by: Jerry Chen jerry.t.chen@intel.com Reported-and-tested-by: Jin Wen wen.jin@intel.com Signed-off-by: Qiuxu Zhuo qiuxu.zhuo@intel.com Signed-off-by: Tony Luck tony.luck@intel.com Reviewed-by: Borislav Petkov bp@suse.de Link: https://lore.kernel.org/linux-edac/20200427084022.GC11036@zn.tnic Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/edac/i10nm_base.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/drivers/edac/i10nm_base.c b/drivers/edac/i10nm_base.c index 5a47f55f8145..1c644cc02487 100644 --- a/drivers/edac/i10nm_base.c +++ b/drivers/edac/i10nm_base.c @@ -122,16 +122,22 @@ static int i10nm_get_all_munits(void) return 0; }
-static struct res_config i10nm_cfg = { +static struct res_config i10nm_cfg0 = { .type = I10NM, .decs_did = 0x3452, .busno_cfg_offset = 0xcc, };
+static struct res_config i10nm_cfg1 = { + .type = I10NM, + .decs_did = 0x3452, + .busno_cfg_offset = 0xd0, +}; + static const struct x86_cpu_id i10nm_cpuids[] = { - { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_TREMONT_X, 0, (kernel_ulong_t)&i10nm_cfg}, - { X86_VENDOR_INTEL, 6, INTEL_FAM6_ICELAKE_X, 0, (kernel_ulong_t)&i10nm_cfg}, - { X86_VENDOR_INTEL, 6, INTEL_FAM6_ICELAKE_XEON_D, 0, (kernel_ulong_t)&i10nm_cfg}, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_TREMONT_X, 0, (kernel_ulong_t)&i10nm_cfg0}, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ICELAKE_X, 0, (kernel_ulong_t)&i10nm_cfg0}, + { X86_VENDOR_INTEL, 6, INTEL_FAM6_ICELAKE_XEON_D, 0, (kernel_ulong_t)&i10nm_cfg1}, { } }; MODULE_DEVICE_TABLE(x86cpu, i10nm_cpuids); @@ -259,6 +265,10 @@ static int __init i10nm_init(void)
cfg = (struct res_config *)id->driver_data;
+ /* Newer steppings have different offset for ATOM_TREMONT_D/ICELAKE_X */ + if (boot_cpu_data.x86_stepping >= 4) + cfg->busno_cfg_offset = 0xd0; + rc = skx_get_hi_lo(0x09a2, off, &tolm, &tohm); if (rc) return rc;
From: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com
mainline inclusion from mainline-v5.6-rc1 commit 9749b376be181a98c75b6c2093e6fc30d92e38cc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 9749b376be181a98c75b6c2093e6fc30d92e38cc upstream
Backport summary: Backport to kernel 4.19.57 to support ICX ISST
To discover core-power capability, some new mailbox commands are added. Allow those commands to execute.
Signed-off-by: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/platform/x86/intel_speed_select_if/isst_if_common.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_common.c b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c index 68d75391db57..64a1818eabaa 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_common.c +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_common.c @@ -48,6 +48,8 @@ static const struct isst_valid_cmd_ranges isst_valid_cmds[] = { {0x7F, 0x00, 0x0B}, {0x7F, 0x10, 0x12}, {0x7F, 0x20, 0x23}, + {0x94, 0x03, 0x03}, + {0x95, 0x03, 0x03}, };
static const struct isst_cmd_set_req_type isst_cmd_set_reqs[] = { @@ -57,6 +59,7 @@ static const struct isst_cmd_set_req_type isst_cmd_set_reqs[] = { {0xD0, 0x03, 0x08}, {0x7F, 0x02, 0x00}, {0x7F, 0x08, 0x00}, + {0x95, 0x03, 0x03}, };
struct isst_cmd {
From: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com
mainline inclusion from mainline-v5.7-rc1 commit 6cc8f6598978b8f30e70bc12f28fbbc9e26227cc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 6cc8f6598978b8f30e70bc12f28fbbc9e26227cc upstream
Backport summary: Backport to kernel 4.19.57 to support ICX ISST
The MMIO driver is not unregistering with the correct type with the ISST common core during module removal. This should be unregistered with ISST_IF_DEV_MMIO instead of ISST_IF_DEV_MBOX.
Signed-off-by: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/platform/x86/intel_speed_select_if/isst_if_mmio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_mmio.c b/drivers/platform/x86/intel_speed_select_if/isst_if_mmio.c index f7266a115a08..bd3b867a4958 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_mmio.c +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_mmio.c @@ -126,7 +126,7 @@ static void isst_if_remove(struct pci_dev *pdev) struct isst_if_device *punit_dev;
punit_dev = pci_get_drvdata(pdev); - isst_if_cdev_unregister(ISST_IF_DEV_MBOX); + isst_if_cdev_unregister(ISST_IF_DEV_MMIO); mutex_destroy(&punit_dev->mutex); }
From: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com
mainline inclusion from mainline-v5.8-rc1 commit 2adaec46178be3b499e133923af227abbc23fb57 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 2adaec46178be3b499e133923af227abbc23fb57 upstream
Backport summary: Backport to kernel 4.19.57 to support ICX ISST
Fix timeout issue on some Ice Lake servers, where mail box command is timing out before the response,
Signed-off-by: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- .../x86/intel_speed_select_if/isst_if_mbox_pci.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_pci.c b/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_pci.c index de4169d0796b..d84e2174cbde 100644 --- a/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_pci.c +++ b/drivers/platform/x86/intel_speed_select_if/isst_if_mbox_pci.c @@ -21,13 +21,12 @@ #define PUNIT_MAILBOX_BUSY_BIT 31
/* - * Commands has variable amount of processing time. Most of the commands will - * be done in 0-3 tries, but some takes up to 50. - * The real processing time was observed as 25us for the most of the commands - * at 2GHz. It is possible to optimize this count taking samples on customer - * systems. + * The average time to complete some commands is about 40us. The current + * count is enough to satisfy 40us. But when the firmware is very busy, this + * causes timeout occasionally. So increase to deal with some worst case + * scenarios. Most of the command still complete in few us. */ -#define OS_MAILBOX_RETRY_COUNT 50 +#define OS_MAILBOX_RETRY_COUNT 100
struct isst_if_device { struct mutex mutex;
From: Peter Zijlstra peterz@infradead.org
mainline inclusion from mainline-v5.8-rc1 commit 69ea03b56ed2c7189ccd0b5910ad39f3cad1df21 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
commit 69ea03b56ed2c7189ccd0b5910ad39f3cad1df21 upstream Backport summary: Backport to kernel 4.19.57 to enhance MCA-R for copyin, backporting removes the changes to non-x86 architecture.
Since there are already a number of sites (ARM64, PowerPC) that effectively nest nmi_enter(), make the primitive support this before adding even more.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Marc Zyngier maz@kernel.org Acked-by: Will Deacon will@kernel.org Cc: Michael Ellerman mpe@ellerman.id.au Link: https://lkml.kernel.org/r/20200505134100.864179229@linutronix.de Signed-off-by: Youquan Song youquan.song@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/hardirq.h | 5 ++++- include/linux/preempt.h | 4 ++-- 2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index da0af631ded5..9c4ed2c4b45c 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -65,13 +65,16 @@ extern void irq_exit(void); #define arch_nmi_exit() do { } while (0) #endif
+/* + * nmi_enter() can nest up to 15 times; see NMI_BITS. + */ #define nmi_enter() \ do { \ arch_nmi_enter(); \ printk_nmi_enter(); \ lockdep_off(); \ ftrace_nmi_enter(); \ - BUG_ON(in_nmi()); \ + BUG_ON(in_nmi() == NMI_MASK); \ preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET); \ rcu_nmi_enter(); \ trace_hardirq_enter(); \ diff --git a/include/linux/preempt.h b/include/linux/preempt.h index c01813c3fbe9..f10333a2b676 100644 --- a/include/linux/preempt.h +++ b/include/linux/preempt.h @@ -26,13 +26,13 @@ * PREEMPT_MASK: 0x000000ff * SOFTIRQ_MASK: 0x0000ff00 * HARDIRQ_MASK: 0x000f0000 - * NMI_MASK: 0x00100000 + * NMI_MASK: 0x00f00000 * PREEMPT_NEED_RESCHED: 0x80000000 */ #define PREEMPT_BITS 8 #define SOFTIRQ_BITS 8 #define HARDIRQ_BITS 4 -#define NMI_BITS 1 +#define NMI_BITS 4
#define PREEMPT_SHIFT 0 #define SOFTIRQ_SHIFT (PREEMPT_SHIFT + PREEMPT_BITS)
From: Kim Phillips kim.phillips@amd.com
mainline inclusion from mainline-v5.10-rc1 commit 26e52558ead4b39c0e0fe7bf08f82f5a9777a412 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Commit 5738891229a2 ("perf/x86/amd: Add support for Large Increment per Cycle Events") mistakenly zeroes the upper 16 bits of the count in set_period(). That's fine for counting with perf stat, but not sampling with perf record when only Large Increment events are being sampled. To enable sampling, we sign extend the upper 16 bits of the merged counter pair as described in the Family 17h PPRs:
"Software wanting to preload a value to a merged counter pair writes the high-order 16-bit value to the low-order 16 bits of the odd counter and then writes the low-order 48-bit value to the even counter. Reading the even counter of the merged counter pair returns the full 64-bit value."
Fixes: 5738891229a2 ("perf/x86/amd: Add support for Large Increment per Cycle Events") Signed-off-by: Kim Phillips kim.phillips@amd.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index d5c95336ec15..4e5873593056 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -1264,11 +1264,11 @@ int x86_perf_event_set_period(struct perf_event *event) wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
/* - * Clear the Merge event counter's upper 16 bits since + * Sign extend the Merge event counter's upper 16 bits since * we currently declare a 48-bit counter width */ if (is_counter_pair(hwc)) - wrmsrl(x86_pmu_event_addr(idx + 1), 0); + wrmsrl(x86_pmu_event_addr(idx + 1), 0xffff);
/* * Due to erratum on certan cpu we need
From: "Gustavo A. R. Silva" gustavo@embeddedor.com
mainline inclusion from mainline-v5.1-rc1 commit 8d86f6b4306f4be38b39d420456edf61a7b7119e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through.
This patch fixes the following warnings:
drivers/hwtracing/intel_th/sth.c: In function ‘sth_stm_packet’: drivers/hwtracing/intel_th/sth.c:86:7: warning: this statement may fall through [-Wimplicit-fallthrough=] reg += 4; ~~~~^~~~ drivers/hwtracing/intel_th/sth.c:87:2: note: here case STP_PACKET_XSYNC: ^~~~ drivers/hwtracing/intel_th/sth.c:88:7: warning: this statement may fall through [-Wimplicit-fallthrough=] reg += 8; ~~~~^~~~ drivers/hwtracing/intel_th/sth.c:89:2: note: here case STP_PACKET_TRIG: ^~~~
Warning level 3 was used: -Wimplicit-fallthrough=3
This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough.
Signed-off-by: Gustavo A. R. Silva gustavo@embeddedor.com Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/sth.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/hwtracing/intel_th/sth.c b/drivers/hwtracing/intel_th/sth.c index 4b7ae47789d2..3a1f4e650378 100644 --- a/drivers/hwtracing/intel_th/sth.c +++ b/drivers/hwtracing/intel_th/sth.c @@ -84,8 +84,12 @@ static ssize_t notrace sth_stm_packet(struct stm_data *stm_data, /* Global packets (GERR, XSYNC, TRIG) are sent with register writes */ case STP_PACKET_GERR: reg += 4; + /* fall through */ + case STP_PACKET_XSYNC: reg += 8; + /* fall through */ + case STP_PACKET_TRIG: if (flags & STP_PACKET_TIMESTAMPED) reg += 4;
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.1-rc1 commit ba828cc9dcc8ffefd0d17165be9b132bd5232063 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Right now, the driver will create a device node for each output port, with the intent to provide read access to that port's data. However, only the memory ports are readable this way (msc0, msc1). Other output ports don't need device nodes, so remove them.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/core.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/hwtracing/intel_th/core.c b/drivers/hwtracing/intel_th/core.c index a8e750291643..e05c5fdfb597 100644 --- a/drivers/hwtracing/intel_th/core.c +++ b/drivers/hwtracing/intel_th/core.c @@ -422,6 +422,7 @@ static const struct intel_th_subdevice { unsigned nres; unsigned type; unsigned otype; + bool mknode; unsigned scrpd; int id; } intel_th_subdevices[] = { @@ -456,6 +457,7 @@ static const struct intel_th_subdevice { .name = "msc", .id = 0, .type = INTEL_TH_OUTPUT, + .mknode = true, .otype = GTH_MSU, .scrpd = SCRPD_MEM_IS_PRIM_DEST | SCRPD_MSC0_IS_ENABLED, }, @@ -476,6 +478,7 @@ static const struct intel_th_subdevice { .name = "msc", .id = 1, .type = INTEL_TH_OUTPUT, + .mknode = true, .otype = GTH_MSU, .scrpd = SCRPD_MEM_IS_PRIM_DEST | SCRPD_MSC1_IS_ENABLED, }, @@ -633,7 +636,8 @@ intel_th_subdevice_alloc(struct intel_th *th, goto fail_put_device;
if (subdev->type == INTEL_TH_OUTPUT) { - thdev->dev.devt = MKDEV(th->major, th->num_thdevs); + if (subdev->mknode) + thdev->dev.devt = MKDEV(th->major, th->num_thdevs); thdev->output.type = subdev->otype; thdev->output.port = -1; thdev->output.scratchpad = subdev->scrpd;
From: Andy Shevchenko andriy.shevchenko@linux.intel.com
mainline inclusion from mainline-v5.1-rc1 commit 1d2ef028bf9a9272c658782d9c75f60658770ba4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Use sysfs_match_string() helper instead of open coded variant.
Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/pti.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/drivers/hwtracing/intel_th/pti.c b/drivers/hwtracing/intel_th/pti.c index 56694339cb06..0da6b787f553 100644 --- a/drivers/hwtracing/intel_th/pti.c +++ b/drivers/hwtracing/intel_th/pti.c @@ -272,19 +272,17 @@ static ssize_t lpp_dest_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t size) { struct pti_device *pti = dev_get_drvdata(dev); - ssize_t ret = -EINVAL; int i;
- for (i = 0; i < ARRAY_SIZE(lpp_dest_str); i++) - if (sysfs_streq(buf, lpp_dest_str[i])) - break; + i = sysfs_match_string(lpp_dest_str, buf); + if (i < 0) + return i;
- if (i < ARRAY_SIZE(lpp_dest_str) && pti->lpp_dest_mask & BIT(i)) { - pti->lpp_dest = i; - ret = size; - } + if (!(pti->lpp_dest_mask & BIT(i))) + return -EINVAL;
- return ret; + pti->lpp_dest = i; + return size; }
static DEVICE_ATTR_RW(lpp_dest);
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit db73a059de00eed721f13051c0d6ff3e7de90fe8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Currently, MMIO resource numbers in the TH driver core correspond to PCI BAR numbers, because in the beginning there was only the PCI glue layer. This created some confusion when the ACPI glue layer was added.
To avoid confusion and remove glue-specific code from the driver core, split the resource indices between core and glue layers and change the API so that the driver core receives the MMIO resources in the same fixed order. At the same time, make the IRQ always be a parameter to intel_th_alloc() instead of sometimes passing it as a resource.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/acpi.c | 12 ++++++++++-- drivers/hwtracing/intel_th/core.c | 27 +++++++-------------------- drivers/hwtracing/intel_th/intel_th.h | 8 ++++---- drivers/hwtracing/intel_th/pci.c | 15 ++++++++++++--- 4 files changed, 33 insertions(+), 29 deletions(-)
diff --git a/drivers/hwtracing/intel_th/acpi.c b/drivers/hwtracing/intel_th/acpi.c index 87bc3744755f..b528e5b113ff 100644 --- a/drivers/hwtracing/intel_th/acpi.c +++ b/drivers/hwtracing/intel_th/acpi.c @@ -37,15 +37,23 @@ MODULE_DEVICE_TABLE(acpi, intel_th_acpi_ids); static int intel_th_acpi_probe(struct platform_device *pdev) { struct acpi_device *adev = ACPI_COMPANION(&pdev->dev); + struct resource resource[TH_MMIO_END]; const struct acpi_device_id *id; struct intel_th *th; + int i, r, irq = -1;
id = acpi_match_device(intel_th_acpi_ids, &pdev->dev); if (!id) return -ENODEV;
- th = intel_th_alloc(&pdev->dev, (void *)id->driver_data, - pdev->resource, pdev->num_resources, -1); + for (i = 0, r = 0; i < pdev->num_resources && r < TH_MMIO_END; i++) + if (pdev->resource[i].flags & IORESOURCE_IRQ) + irq = pdev->resource[i].start; + else if (pdev->resource[i].flags & IORESOURCE_MEM) + resource[r++] = pdev->resource[i]; + + th = intel_th_alloc(&pdev->dev, (void *)id->driver_data, resource, r, + irq); if (IS_ERR(th)) return PTR_ERR(th);
diff --git a/drivers/hwtracing/intel_th/core.c b/drivers/hwtracing/intel_th/core.c index e05c5fdfb597..d9961f23743b 100644 --- a/drivers/hwtracing/intel_th/core.c +++ b/drivers/hwtracing/intel_th/core.c @@ -491,7 +491,7 @@ static const struct intel_th_subdevice { .flags = IORESOURCE_MEM, }, { - .start = 1, /* use resource[1] */ + .start = TH_MMIO_SW, .end = 0, .flags = IORESOURCE_MEM, }, @@ -584,7 +584,6 @@ intel_th_subdevice_alloc(struct intel_th *th, struct intel_th_device *thdev; struct resource res[3]; unsigned int req = 0; - bool is64bit = false; int r, err;
thdev = intel_th_device_alloc(th, subdev->type, subdev->name, @@ -594,18 +593,12 @@ intel_th_subdevice_alloc(struct intel_th *th,
thdev->drvdata = th->drvdata;
- for (r = 0; r < th->num_resources; r++) - if (th->resource[r].flags & IORESOURCE_MEM_64) { - is64bit = true; - break; - } - memcpy(res, subdev->res, sizeof(struct resource) * subdev->nres);
for (r = 0; r < subdev->nres; r++) { struct resource *devres = th->resource; - int bar = 0; /* cut subdevices' MMIO from resource[0] */ + int bar = TH_MMIO_CONFIG;
/* * Take .end == 0 to mean 'take the whole bar', @@ -614,8 +607,6 @@ intel_th_subdevice_alloc(struct intel_th *th, */ if (!res[r].end && res[r].flags == IORESOURCE_MEM) { bar = res[r].start; - if (is64bit) - bar *= 2; res[r].start = 0; res[r].end = resource_size(&devres[bar]) - 1; } @@ -808,8 +799,7 @@ static const struct file_operations intel_th_output_fops = { /** * intel_th_alloc() - allocate a new Intel TH device and its subdevices * @dev: parent device - * @devres: parent's resources - * @ndevres: number of resources + * @devres: resources indexed by th_mmio_idx * @irq: irq number */ struct intel_th * @@ -819,12 +809,8 @@ intel_th_alloc(struct device *dev, struct intel_th_drvdata *drvdata, struct intel_th *th; int err, r;
- if (irq == -1) - for (r = 0; r < ndevres; r++) - if (devres[r].flags & IORESOURCE_IRQ) { - irq = devres[r].start; - break; - } + if (ndevres < TH_MMIO_END) + return ERR_PTR(-EINVAL);
th = kzalloc(sizeof(*th), GFP_KERNEL); if (!th) @@ -845,7 +831,8 @@ intel_th_alloc(struct device *dev, struct intel_th_drvdata *drvdata, th->dev = dev; th->drvdata = drvdata;
- th->resource = devres; + for (r = 0; r < ndevres; r++) + th->resource[r] = devres[r]; th->num_resources = ndevres; th->irq = irq;
diff --git a/drivers/hwtracing/intel_th/intel_th.h b/drivers/hwtracing/intel_th/intel_th.h index 780206dc9012..8c90c8d01867 100644 --- a/drivers/hwtracing/intel_th/intel_th.h +++ b/drivers/hwtracing/intel_th/intel_th.h @@ -225,9 +225,9 @@ int intel_th_set_output(struct intel_th_device *thdev, unsigned int master); int intel_th_output_enable(struct intel_th *th, unsigned int otype);
-enum { +enum th_mmio_idx { TH_MMIO_CONFIG = 0, - TH_MMIO_SW = 2, + TH_MMIO_SW = 1, TH_MMIO_END, };
@@ -244,7 +244,7 @@ enum { * @hub: "switch" subdevice (GTH) * @resource: resources of the entire controller * @num_thdevs: number of devices in the @thdev array - * @num_resources: number or resources in the @resource array + * @num_resources: number of resources in the @resource array * @irq: irq number * @id: this Intel TH controller's device ID in the system * @major: device node major for output devices @@ -256,7 +256,7 @@ struct intel_th { struct intel_th_device *hub; struct intel_th_drvdata *drvdata;
- struct resource *resource; + struct resource resource[TH_MMIO_END]; int (*activate)(struct intel_th *); void (*deactivate)(struct intel_th *); unsigned int num_thdevs; diff --git a/drivers/hwtracing/intel_th/pci.c b/drivers/hwtracing/intel_th/pci.c index e63a0c24e76b..998c0d20b136 100644 --- a/drivers/hwtracing/intel_th/pci.c +++ b/drivers/hwtracing/intel_th/pci.c @@ -17,7 +17,12 @@
#define DRIVER_NAME "intel_th_pci"
-#define BAR_MASK (BIT(TH_MMIO_CONFIG) | BIT(TH_MMIO_SW)) +enum { + TH_PCI_CONFIG_BAR = 0, + TH_PCI_STH_SW_BAR = 2, +}; + +#define BAR_MASK (BIT(TH_PCI_CONFIG_BAR) | BIT(TH_PCI_STH_SW_BAR))
#define PCI_REG_NPKDSC 0x80 #define NPKDSC_TSACT BIT(5) @@ -66,6 +71,10 @@ static int intel_th_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) { struct intel_th_drvdata *drvdata = (void *)id->driver_data; + struct resource resource[TH_MMIO_END] = { + [TH_MMIO_CONFIG] = pdev->resource[TH_PCI_CONFIG_BAR], + [TH_MMIO_SW] = pdev->resource[TH_PCI_STH_SW_BAR], + }; struct intel_th *th; int err;
@@ -77,8 +86,8 @@ static int intel_th_pci_probe(struct pci_dev *pdev, if (err) return err;
- th = intel_th_alloc(&pdev->dev, drvdata, pdev->resource, - DEVICE_COUNT_RESOURCE, pdev->irq); + th = intel_th_alloc(&pdev->dev, drvdata, resource, TH_MMIO_END, + pdev->irq); if (IS_ERR(th)) return PTR_ERR(th);
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 23f667494b4d65d230cf61ad0d0b96c72f3a163c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
If a subdevice requires an MMIO region that wasn't in the resources passed down from the glue layer, don't instantiate it, but don't error out. This means that that particular subdevice doesn't exist for this instance of Intel TH, which is a perfectly normal situation. This applies, for example, to the "rtit" source device.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/core.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/hwtracing/intel_th/core.c b/drivers/hwtracing/intel_th/core.c index d9961f23743b..e1c47ca7b2bd 100644 --- a/drivers/hwtracing/intel_th/core.c +++ b/drivers/hwtracing/intel_th/core.c @@ -607,6 +607,9 @@ intel_th_subdevice_alloc(struct intel_th *th, */ if (!res[r].end && res[r].flags == IORESOURCE_MEM) { bar = res[r].start; + err = -ENODEV; + if (bar >= th->num_resources) + goto fail_put_device; res[r].start = 0; res[r].end = resource_size(&devres[bar]) - 1; } @@ -745,8 +748,13 @@ static int intel_th_populate(struct intel_th *th)
thdev = intel_th_subdevice_alloc(th, subdev); /* note: caller should free subdevices from th::thdev[] */ - if (IS_ERR(thdev)) + if (IS_ERR(thdev)) { + /* ENODEV for individual subdevices is allowed */ + if (PTR_ERR(thdev) == -ENODEV) + continue; + return PTR_ERR(thdev); + }
th->thdev[th->num_thdevs++] = thdev; } @@ -809,9 +817,6 @@ intel_th_alloc(struct device *dev, struct intel_th_drvdata *drvdata, struct intel_th *th; int err, r;
- if (ndevres < TH_MMIO_END) - return ERR_PTR(-EINVAL); - th = kzalloc(sizeof(*th), GFP_KERNEL); if (!th) return ERR_PTR(-ENOMEM);
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit fc027f4ce7c718660e046c3269b303bdbe692fda category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
In some versions of Intel TH, the Software Trace Hub (STH) has a second MMIO BAR dedicated to the input from Intel PT. This calls for a new subdevice that will be enumerated if the corresponding BAR is present.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/core.c | 18 ++++++++++++++++++ drivers/hwtracing/intel_th/intel_th.h | 1 + drivers/hwtracing/intel_th/pci.c | 11 ++++++++--- 3 files changed, 27 insertions(+), 3 deletions(-)
diff --git a/drivers/hwtracing/intel_th/core.c b/drivers/hwtracing/intel_th/core.c index e1c47ca7b2bd..eec07ed2de29 100644 --- a/drivers/hwtracing/intel_th/core.c +++ b/drivers/hwtracing/intel_th/core.c @@ -500,6 +500,24 @@ static const struct intel_th_subdevice { .name = "sth", .type = INTEL_TH_SOURCE, }, + { + .nres = 2, + .res = { + { + .start = REG_STH_OFFSET, + .end = REG_STH_OFFSET + REG_STH_LENGTH - 1, + .flags = IORESOURCE_MEM, + }, + { + .start = TH_MMIO_RTIT, + .end = 0, + .flags = IORESOURCE_MEM, + }, + }, + .id = -1, + .name = "rtit", + .type = INTEL_TH_SOURCE, + }, { .nres = 1, .res = { diff --git a/drivers/hwtracing/intel_th/intel_th.h b/drivers/hwtracing/intel_th/intel_th.h index 8c90c8d01867..3fca86d78fdd 100644 --- a/drivers/hwtracing/intel_th/intel_th.h +++ b/drivers/hwtracing/intel_th/intel_th.h @@ -228,6 +228,7 @@ int intel_th_output_enable(struct intel_th *th, unsigned int otype); enum th_mmio_idx { TH_MMIO_CONFIG = 0, TH_MMIO_SW = 1, + TH_MMIO_RTIT = 2, TH_MMIO_END, };
diff --git a/drivers/hwtracing/intel_th/pci.c b/drivers/hwtracing/intel_th/pci.c index 998c0d20b136..843a2a915307 100644 --- a/drivers/hwtracing/intel_th/pci.c +++ b/drivers/hwtracing/intel_th/pci.c @@ -20,6 +20,7 @@ enum { TH_PCI_CONFIG_BAR = 0, TH_PCI_STH_SW_BAR = 2, + TH_PCI_RTIT_BAR = 4, };
#define BAR_MASK (BIT(TH_PCI_CONFIG_BAR) | BIT(TH_PCI_STH_SW_BAR)) @@ -75,8 +76,8 @@ static int intel_th_pci_probe(struct pci_dev *pdev, [TH_MMIO_CONFIG] = pdev->resource[TH_PCI_CONFIG_BAR], [TH_MMIO_SW] = pdev->resource[TH_PCI_STH_SW_BAR], }; + int err, r = TH_MMIO_SW + 1; struct intel_th *th; - int err;
err = pcim_enable_device(pdev); if (err) @@ -86,8 +87,12 @@ static int intel_th_pci_probe(struct pci_dev *pdev, if (err) return err;
- th = intel_th_alloc(&pdev->dev, drvdata, resource, TH_MMIO_END, - pdev->irq); + if (pdev->resource[TH_PCI_RTIT_BAR].start) { + resource[TH_MMIO_RTIT] = pdev->resource[TH_PCI_RTIT_BAR]; + r++; + } + + th = intel_th_alloc(&pdev->dev, drvdata, resource, r, pdev->irq); if (IS_ERR(th)) return PTR_ERR(th);
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 62a593022c32380d040303a5e3d6b67fd9c415bc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Currently, the IRQ is passed between the glue layers and the core as a separate argument, while the MMIO resources are passed as resources. This also limits the number of IRQs thus used to one, while the current versions of Intel TH use a different MSI vector for each interrupt triggering event, of which there are 7.
Change this to pass IRQ in the resources array.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/acpi.c | 10 ++++------ drivers/hwtracing/intel_th/core.c | 23 ++++++++++++++++++----- drivers/hwtracing/intel_th/intel_th.h | 2 +- drivers/hwtracing/intel_th/pci.c | 9 +++++++-- 4 files changed, 30 insertions(+), 14 deletions(-)
diff --git a/drivers/hwtracing/intel_th/acpi.c b/drivers/hwtracing/intel_th/acpi.c index b528e5b113ff..87f9024e4bbb 100644 --- a/drivers/hwtracing/intel_th/acpi.c +++ b/drivers/hwtracing/intel_th/acpi.c @@ -40,20 +40,18 @@ static int intel_th_acpi_probe(struct platform_device *pdev) struct resource resource[TH_MMIO_END]; const struct acpi_device_id *id; struct intel_th *th; - int i, r, irq = -1; + int i, r;
id = acpi_match_device(intel_th_acpi_ids, &pdev->dev); if (!id) return -ENODEV;
for (i = 0, r = 0; i < pdev->num_resources && r < TH_MMIO_END; i++) - if (pdev->resource[i].flags & IORESOURCE_IRQ) - irq = pdev->resource[i].start; - else if (pdev->resource[i].flags & IORESOURCE_MEM) + if (pdev->resource[i].flags & + (IORESOURCE_IRQ | IORESOURCE_MEM)) resource[r++] = pdev->resource[i];
- th = intel_th_alloc(&pdev->dev, (void *)id->driver_data, resource, r, - irq); + th = intel_th_alloc(&pdev->dev, (void *)id->driver_data, resource, r); if (IS_ERR(th)) return PTR_ERR(th);
diff --git a/drivers/hwtracing/intel_th/core.c b/drivers/hwtracing/intel_th/core.c index eec07ed2de29..8f709228def1 100644 --- a/drivers/hwtracing/intel_th/core.c +++ b/drivers/hwtracing/intel_th/core.c @@ -830,10 +830,10 @@ static const struct file_operations intel_th_output_fops = { */ struct intel_th * intel_th_alloc(struct device *dev, struct intel_th_drvdata *drvdata, - struct resource *devres, unsigned int ndevres, int irq) + struct resource *devres, unsigned int ndevres) { + int err, r, nr_mmios = 0; struct intel_th *th; - int err, r;
th = kzalloc(sizeof(*th), GFP_KERNEL); if (!th) @@ -851,13 +851,26 @@ intel_th_alloc(struct device *dev, struct intel_th_drvdata *drvdata, err = th->major; goto err_ida; } + th->irq = -1; th->dev = dev; th->drvdata = drvdata;
for (r = 0; r < ndevres; r++) - th->resource[r] = devres[r]; - th->num_resources = ndevres; - th->irq = irq; + switch (devres[r].flags & IORESOURCE_TYPE_BITS) { + case IORESOURCE_MEM: + th->resource[nr_mmios++] = devres[r]; + break; + case IORESOURCE_IRQ: + if (th->irq == -1) + th->irq = devres[r].start; + break; + default: + dev_warn(dev, "Unknown resource type %lx\n", + devres[r].flags); + break; + } + + th->num_resources = nr_mmios;
dev_set_drvdata(dev, th);
diff --git a/drivers/hwtracing/intel_th/intel_th.h b/drivers/hwtracing/intel_th/intel_th.h index 3fca86d78fdd..6c6eb87e48a0 100644 --- a/drivers/hwtracing/intel_th/intel_th.h +++ b/drivers/hwtracing/intel_th/intel_th.h @@ -213,7 +213,7 @@ static inline struct intel_th *to_intel_th(struct intel_th_device *thdev)
struct intel_th * intel_th_alloc(struct device *dev, struct intel_th_drvdata *drvdata, - struct resource *devres, unsigned int ndevres, int irq); + struct resource *devres, unsigned int ndevres); void intel_th_free(struct intel_th *th);
int intel_th_driver_register(struct intel_th_driver *thdrv); diff --git a/drivers/hwtracing/intel_th/pci.c b/drivers/hwtracing/intel_th/pci.c index 843a2a915307..a4a5efc8435d 100644 --- a/drivers/hwtracing/intel_th/pci.c +++ b/drivers/hwtracing/intel_th/pci.c @@ -72,7 +72,7 @@ static int intel_th_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) { struct intel_th_drvdata *drvdata = (void *)id->driver_data; - struct resource resource[TH_MMIO_END] = { + struct resource resource[TH_MMIO_END + 1] = { [TH_MMIO_CONFIG] = pdev->resource[TH_PCI_CONFIG_BAR], [TH_MMIO_SW] = pdev->resource[TH_PCI_STH_SW_BAR], }; @@ -92,7 +92,12 @@ static int intel_th_pci_probe(struct pci_dev *pdev, r++; }
- th = intel_th_alloc(&pdev->dev, drvdata, resource, r, pdev->irq); + if (pdev->irq > 0) { + resource[r].flags = IORESOURCE_IRQ; + resource[r++].start = pdev->irq; + } + + th = intel_th_alloc(&pdev->dev, drvdata, resource, r); if (IS_ERR(th)) return PTR_ERR(th);
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 7b7036d47c356a40818e516a69ac81a5dcc1613f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Since Intel TH is capable of MSI interrupt signalling, make use of it. The way it works is, each of the 7 interrupt triggering events has its own vector in this mode, as opposed to interrupt line delivery, where all events are signalled via the same line. Failing to enable MSI, the driver falls back to using an interrupt line.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/intel_th.h | 3 +++ drivers/hwtracing/intel_th/pci.c | 16 ++++++++++------ 2 files changed, 13 insertions(+), 6 deletions(-)
diff --git a/drivers/hwtracing/intel_th/intel_th.h b/drivers/hwtracing/intel_th/intel_th.h index 6c6eb87e48a0..db3ad8ca1c48 100644 --- a/drivers/hwtracing/intel_th/intel_th.h +++ b/drivers/hwtracing/intel_th/intel_th.h @@ -238,6 +238,9 @@ enum th_mmio_idx { #define TH_CONFIGURABLE_MASTERS 256 #define TH_MSC_MAX 2
+/* Maximum IRQ vectors */ +#define TH_NVEC_MAX 8 + /** * struct intel_th - Intel TH controller * @dev: driver core's device diff --git a/drivers/hwtracing/intel_th/pci.c b/drivers/hwtracing/intel_th/pci.c index a4a5efc8435d..fecba4190bbe 100644 --- a/drivers/hwtracing/intel_th/pci.c +++ b/drivers/hwtracing/intel_th/pci.c @@ -72,11 +72,11 @@ static int intel_th_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) { struct intel_th_drvdata *drvdata = (void *)id->driver_data; - struct resource resource[TH_MMIO_END + 1] = { + struct resource resource[TH_MMIO_END + TH_NVEC_MAX] = { [TH_MMIO_CONFIG] = pdev->resource[TH_PCI_CONFIG_BAR], [TH_MMIO_SW] = pdev->resource[TH_PCI_STH_SW_BAR], }; - int err, r = TH_MMIO_SW + 1; + int err, r = TH_MMIO_SW + 1, i; struct intel_th *th;
err = pcim_enable_device(pdev); @@ -92,10 +92,12 @@ static int intel_th_pci_probe(struct pci_dev *pdev, r++; }
- if (pdev->irq > 0) { - resource[r].flags = IORESOURCE_IRQ; - resource[r++].start = pdev->irq; - } + err = pci_alloc_irq_vectors(pdev, 1, 8, PCI_IRQ_ALL_TYPES); + if (err > 0) + for (i = 0; i < err; i++, r++) { + resource[r].flags = IORESOURCE_IRQ; + resource[r].start = pci_irq_vector(pdev, i); + }
th = intel_th_alloc(&pdev->dev, drvdata, resource, r); if (IS_ERR(th)) @@ -114,6 +116,8 @@ static void intel_th_pci_remove(struct pci_dev *pdev) struct intel_th *th = pci_get_drvdata(pdev);
intel_th_free(th); + + pci_free_irq_vectors(pdev); }
static const struct intel_th_drvdata intel_th_2x = {
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit aac8da65174a35749fcf21dbca4c1be314b562b5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
We intend to use the interrupt to detect Last Block condition in the MSU driver, which we can use for double-buffering software-managed data transfers.
Add an interrupt handler to the MSU driver.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/core.c | 32 ++++++++++++++ drivers/hwtracing/intel_th/intel_th.h | 4 +- drivers/hwtracing/intel_th/msu.c | 64 ++++++++++++++++++++++++++- drivers/hwtracing/intel_th/msu.h | 8 ++++ 4 files changed, 106 insertions(+), 2 deletions(-)
diff --git a/drivers/hwtracing/intel_th/core.c b/drivers/hwtracing/intel_th/core.c index 8f709228def1..a82ed64c2662 100644 --- a/drivers/hwtracing/intel_th/core.c +++ b/drivers/hwtracing/intel_th/core.c @@ -822,6 +822,28 @@ static const struct file_operations intel_th_output_fops = { .llseek = noop_llseek, };
+static irqreturn_t intel_th_irq(int irq, void *data) +{ + struct intel_th *th = data; + irqreturn_t ret = IRQ_NONE; + struct intel_th_driver *d; + int i; + + for (i = 0; i < th->num_thdevs; i++) { + if (th->thdev[i]->type != INTEL_TH_OUTPUT) + continue; + + d = to_intel_th_driver(th->thdev[i]->dev.driver); + if (d && d->irq) + ret |= d->irq(th->thdev[i]); + } + + if (ret == IRQ_NONE) + pr_warn_ratelimited("nobody cared for irq\n"); + + return ret; +} + /** * intel_th_alloc() - allocate a new Intel TH device and its subdevices * @dev: parent device @@ -861,6 +883,12 @@ intel_th_alloc(struct device *dev, struct intel_th_drvdata *drvdata, th->resource[nr_mmios++] = devres[r]; break; case IORESOURCE_IRQ: + err = devm_request_irq(dev, devres[r].start, + intel_th_irq, IRQF_SHARED, + dev_name(dev), th); + if (err) + goto err_chrdev; + if (th->irq == -1) th->irq = devres[r].start; break; @@ -887,6 +915,10 @@ intel_th_alloc(struct device *dev, struct intel_th_drvdata *drvdata,
return th;
+err_chrdev: + __unregister_chrdev(th->major, 0, TH_POSSIBLE_OUTPUTS, + "intel_th/output"); + err_ida: ida_simple_remove(&intel_th_ida, th->id);
diff --git a/drivers/hwtracing/intel_th/intel_th.h b/drivers/hwtracing/intel_th/intel_th.h index db3ad8ca1c48..59038215489a 100644 --- a/drivers/hwtracing/intel_th/intel_th.h +++ b/drivers/hwtracing/intel_th/intel_th.h @@ -8,6 +8,8 @@ #ifndef __INTEL_TH_H__ #define __INTEL_TH_H__
+#include <linux/irqreturn.h> + /* intel_th_device device types */ enum { /* Devices that generate trace data */ @@ -160,7 +162,7 @@ struct intel_th_driver { void (*disable)(struct intel_th_device *thdev, struct intel_th_output *output); /* output ops */ - void (*irq)(struct intel_th_device *thdev); + irqreturn_t (*irq)(struct intel_th_device *thdev); int (*activate)(struct intel_th_device *thdev); void (*deactivate)(struct intel_th_device *thdev); /* file_operations for those who want a device node */ diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index 3cdf85b1ce4f..d665087af647 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -102,6 +102,7 @@ struct msc_iter { */ struct msc { void __iomem *reg_base; + void __iomem *msu_base; struct intel_th_device *thdev;
struct list_head win_list; @@ -122,7 +123,8 @@ struct msc {
/* config */ unsigned int enabled : 1, - wrap : 1; + wrap : 1, + do_irq : 1; unsigned int mode; unsigned int burst_len; unsigned int index; @@ -476,6 +478,40 @@ static void msc_buffer_clear_hw_header(struct msc *msc) } }
+static int intel_th_msu_init(struct msc *msc) +{ + u32 mintctl, msusts; + + if (!msc->do_irq) + return 0; + + mintctl = ioread32(msc->msu_base + REG_MSU_MINTCTL); + mintctl |= msc->index ? M1BLIE : M0BLIE; + iowrite32(mintctl, msc->msu_base + REG_MSU_MINTCTL); + if (mintctl != ioread32(msc->msu_base + REG_MSU_MINTCTL)) { + dev_info(msc_dev(msc), "MINTCTL ignores writes: no usable interrupts\n"); + msc->do_irq = 0; + return 0; + } + + msusts = ioread32(msc->msu_base + REG_MSU_MSUSTS); + iowrite32(msusts, msc->msu_base + REG_MSU_MSUSTS); + + return 0; +} + +static void intel_th_msu_deinit(struct msc *msc) +{ + u32 mintctl; + + if (!msc->do_irq) + return; + + mintctl = ioread32(msc->msu_base + REG_MSU_MINTCTL); + mintctl &= msc->index ? ~M1BLIE : ~M0BLIE; + iowrite32(mintctl, msc->msu_base + REG_MSU_MINTCTL); +} + /** * msc_configure() - set up MSC hardware * @msc: the MSC device to configure @@ -1295,6 +1331,21 @@ static int intel_th_msc_init(struct msc *msc) return 0; }
+static irqreturn_t intel_th_msc_interrupt(struct intel_th_device *thdev) +{ + struct msc *msc = dev_get_drvdata(&thdev->dev); + u32 msusts = ioread32(msc->msu_base + REG_MSU_MSUSTS); + u32 mask = msc->index ? MSUSTS_MSC1BLAST : MSUSTS_MSC0BLAST; + + if (!(msusts & mask)) { + if (msc->enabled) + return IRQ_HANDLED; + return IRQ_NONE; + } + + return IRQ_HANDLED; +} + static const char * const msc_mode[] = { [MSC_MODE_SINGLE] = "single", [MSC_MODE_MULTI] = "multi", @@ -1500,10 +1551,19 @@ static int intel_th_msc_probe(struct intel_th_device *thdev) if (!msc) return -ENOMEM;
+ res = intel_th_device_get_resource(thdev, IORESOURCE_IRQ, 1); + if (!res) + msc->do_irq = 1; + msc->index = thdev->id;
msc->thdev = thdev; msc->reg_base = base + msc->index * 0x100; + msc->msu_base = base; + + err = intel_th_msu_init(msc); + if (err) + return err;
err = intel_th_msc_init(msc); if (err) @@ -1520,6 +1580,7 @@ static void intel_th_msc_remove(struct intel_th_device *thdev) int ret;
intel_th_msc_deactivate(thdev); + intel_th_msu_deinit(msc);
/* * Buffers should not be used at this point except if the @@ -1533,6 +1594,7 @@ static void intel_th_msc_remove(struct intel_th_device *thdev) static struct intel_th_driver intel_th_msc_driver = { .probe = intel_th_msc_probe, .remove = intel_th_msc_remove, + .irq = intel_th_msc_interrupt, .activate = intel_th_msc_activate, .deactivate = intel_th_msc_deactivate, .fops = &intel_th_msc_fops, diff --git a/drivers/hwtracing/intel_th/msu.h b/drivers/hwtracing/intel_th/msu.h index 9cc8aced6116..e8cb819a3804 100644 --- a/drivers/hwtracing/intel_th/msu.h +++ b/drivers/hwtracing/intel_th/msu.h @@ -11,6 +11,7 @@ enum { REG_MSU_MSUPARAMS = 0x0000, REG_MSU_MSUSTS = 0x0008, + REG_MSU_MINTCTL = 0x0004, /* MSU-global interrupt control */ REG_MSU_MSC0CTL = 0x0100, /* MSC0 control */ REG_MSU_MSC0STS = 0x0104, /* MSC0 status */ REG_MSU_MSC0BAR = 0x0108, /* MSC0 output base address */ @@ -28,6 +29,8 @@ enum {
/* MSUSTS bits */ #define MSUSTS_MSU_INT BIT(0) +#define MSUSTS_MSC0BLAST BIT(16) +#define MSUSTS_MSC1BLAST BIT(24)
/* MSCnCTL bits */ #define MSC_EN BIT(0) @@ -36,6 +39,11 @@ enum { #define MSC_MODE (BIT(4) | BIT(5)) #define MSC_LEN (BIT(8) | BIT(9) | BIT(10))
+/* MINTCTL bits */ +#define MICDE BIT(0) +#define M0BLIE BIT(16) +#define M1BLIE BIT(24) + /* MSC operating modes (MSC_MODE) */ enum { MSC_MODE_SINGLE = 0,
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 4c5bb6eb4055adcefaeb5da56dfbecf7df3695d7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
The only type of IRQ triggering event that is useful to us at the moment is the "last block" interrupt of the MSU. This interrupt can only be enabled via "MINTCTL" register that doesn't exist in earlier version of the Intel TH.
Enumerate the presence of MINTCTL via per-device driver data structure and only instantiate the IRQ resource for subdevices if this capability is present.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/core.c | 7 ++++++- drivers/hwtracing/intel_th/intel_th.h | 2 ++ drivers/hwtracing/intel_th/pci.c | 1 + 3 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/hwtracing/intel_th/core.c b/drivers/hwtracing/intel_th/core.c index a82ed64c2662..b7bb258bcdd0 100644 --- a/drivers/hwtracing/intel_th/core.c +++ b/drivers/hwtracing/intel_th/core.c @@ -639,7 +639,12 @@ intel_th_subdevice_alloc(struct intel_th *th, dev_dbg(th->dev, "%s:%d @ %pR\n", subdev->name, r, &res[r]); } else if (res[r].flags & IORESOURCE_IRQ) { - res[r].start = th->irq; + /* + * Only pass on the IRQ if we have useful interrupts: + * the ones that can be configured via MINTCTL. + */ + if (INTEL_TH_CAP(th, has_mintctl) && th->irq != -1) + res[r].start = th->irq; } }
diff --git a/drivers/hwtracing/intel_th/intel_th.h b/drivers/hwtracing/intel_th/intel_th.h index 59038215489a..8b986ba160e4 100644 --- a/drivers/hwtracing/intel_th/intel_th.h +++ b/drivers/hwtracing/intel_th/intel_th.h @@ -44,10 +44,12 @@ struct intel_th_output { /** * struct intel_th_drvdata - describes hardware capabilities and quirks * @tscu_enable: device needs SW to enable time stamping unit + * @has_mintctl: device has interrupt control (MINTCTL) register * @host_mode_only: device can only operate in 'host debugger' mode */ struct intel_th_drvdata { unsigned int tscu_enable : 1, + has_mintctl : 1, host_mode_only : 1; };
diff --git a/drivers/hwtracing/intel_th/pci.c b/drivers/hwtracing/intel_th/pci.c index fecba4190bbe..e9d90b53bbc4 100644 --- a/drivers/hwtracing/intel_th/pci.c +++ b/drivers/hwtracing/intel_th/pci.c @@ -122,6 +122,7 @@ static void intel_th_pci_remove(struct pci_dev *pdev)
static const struct intel_th_drvdata intel_th_2x = { .tscu_enable = 1, + .has_mintctl = 1, };
static const struct pci_device_id intel_th_pci_id_table[] = {
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 0de9e0351d4d4266b3e3f4ecf405c21af61a1f9e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
There are a few places in the code where open-coded versions of list entry accessors list_first_entry()/list_last_entry()/list_next_entry() are used.
Replace those with the standard macros.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/msu.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index d665087af647..b4c79f2d40cf 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -179,7 +179,7 @@ static struct msc_window *msc_oldest_window(struct msc *msc) return win; }
- return list_entry(msc->win_list.next, struct msc_window, entry); + return list_first_entry(&msc->win_list, struct msc_window, entry); }
/** @@ -230,10 +230,10 @@ static inline bool msc_is_last_win(struct msc_window *win) static struct msc_window *msc_next_window(struct msc_window *win) { if (msc_is_last_win(win)) - return list_entry(win->msc->win_list.next, struct msc_window, - entry); + return list_first_entry(&win->msc->win_list, struct msc_window, + entry);
- return list_entry(win->entry.next, struct msc_window, entry); + return list_next_entry(win, entry); }
static struct msc_block_desc *msc_iter_bdesc(struct msc_iter *iter) @@ -759,8 +759,9 @@ static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks) return -ENOMEM;
if (!list_empty(&msc->win_list)) { - struct msc_window *prev = list_entry(msc->win_list.prev, - struct msc_window, entry); + struct msc_window *prev = list_last_entry(&msc->win_list, + struct msc_window, + entry);
win->pgoff = prev->pgoff + prev->nr_blocks; } @@ -863,11 +864,10 @@ static void msc_buffer_relink(struct msc *msc) */ if (msc_is_last_win(win)) { sw_tag |= MSC_SW_TAG_LASTWIN; - next_win = list_entry(msc->win_list.next, - struct msc_window, entry); + next_win = list_first_entry(&msc->win_list, + struct msc_window, entry); } else { - next_win = list_entry(win->entry.next, - struct msc_window, entry); + next_win = list_next_entry(win, entry); }
for (blk = 0; blk < win->nr_blocks; blk++) {
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit ba39bd8306057fb343dfb75d93a76d824b625236 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Instead of using a home-grown array of pointers to the DMA pages, switch over to scatterlist data types and accessors, which has all the convenient accessors, can be used to batch-map DMA memory and is convenient for passing around between different layers, which will be useful when MSU buffer management has to cross the boundaries of the MSU driver.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/msu.c | 163 ++++++++++++++++++++----------- 1 file changed, 104 insertions(+), 59 deletions(-)
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index b4c79f2d40cf..0f41d526d433 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -28,29 +28,19 @@
#define msc_dev(x) (&(x)->thdev->dev)
-/** - * struct msc_block - multiblock mode block descriptor - * @bdesc: pointer to hardware descriptor (beginning of the block) - * @addr: physical address of the block - */ -struct msc_block { - struct msc_block_desc *bdesc; - dma_addr_t addr; -}; - /** * struct msc_window - multiblock mode window descriptor * @entry: window list linkage (msc::win_list) * @pgoff: page offset into the buffer that this window starts at * @nr_blocks: number of blocks (pages) in this window - * @block: array of block descriptors + * @sgt: array of block descriptors */ struct msc_window { struct list_head entry; unsigned long pgoff; unsigned int nr_blocks; struct msc *msc; - struct msc_block block[0]; + struct sg_table sgt; };
/** @@ -143,6 +133,24 @@ static inline bool msc_block_is_empty(struct msc_block_desc *bdesc) return false; }
+static inline struct msc_block_desc * +msc_win_block(struct msc_window *win, unsigned int block) +{ + return sg_virt(&win->sgt.sgl[block]); +} + +static inline dma_addr_t +msc_win_baddr(struct msc_window *win, unsigned int block) +{ + return sg_dma_address(&win->sgt.sgl[block]); +} + +static inline unsigned long +msc_win_bpfn(struct msc_window *win, unsigned int block) +{ + return msc_win_baddr(win, block) >> PAGE_SHIFT; +} + /** * msc_oldest_window() - locate the window with oldest data * @msc: MSC device @@ -168,11 +176,11 @@ static struct msc_window *msc_oldest_window(struct msc *msc) * something like 2, in which case we're good */ list_for_each_entry(win, &msc->win_list, entry) { - if (win->block[0].addr == win_addr) + if (sg_dma_address(win->sgt.sgl) == win_addr) found++;
/* skip the empty ones */ - if (msc_block_is_empty(win->block[0].bdesc)) + if (msc_block_is_empty(msc_win_block(win, 0))) continue;
if (found) @@ -191,7 +199,7 @@ static struct msc_window *msc_oldest_window(struct msc *msc) static unsigned int msc_win_oldest_block(struct msc_window *win) { unsigned int blk; - struct msc_block_desc *bdesc = win->block[0].bdesc; + struct msc_block_desc *bdesc = msc_win_block(win, 0);
/* without wrapping, first block is the oldest */ if (!msc_block_wrapped(bdesc)) @@ -202,7 +210,7 @@ static unsigned int msc_win_oldest_block(struct msc_window *win) * oldest data for this window. */ for (blk = 0; blk < win->nr_blocks; blk++) { - bdesc = win->block[blk].bdesc; + bdesc = msc_win_block(win, blk);
if (msc_block_last_written(bdesc)) return blk; @@ -238,7 +246,7 @@ static struct msc_window *msc_next_window(struct msc_window *win)
static struct msc_block_desc *msc_iter_bdesc(struct msc_iter *iter) { - return iter->win->block[iter->block].bdesc; + return msc_win_block(iter->win, iter->block); }
static void msc_iter_init(struct msc_iter *iter) @@ -471,7 +479,7 @@ static void msc_buffer_clear_hw_header(struct msc *msc) offsetof(struct msc_block_desc, hw_tag);
for (blk = 0; blk < win->nr_blocks; blk++) { - struct msc_block_desc *bdesc = win->block[blk].bdesc; + struct msc_block_desc *bdesc = msc_win_block(win, blk);
memset(&bdesc->hw_tag, 0, hw_sz); } @@ -734,6 +742,40 @@ static struct page *msc_buffer_contig_get_page(struct msc *msc, return virt_to_page(msc->base + (pgoff << PAGE_SHIFT)); }
+static int __msc_buffer_win_alloc(struct msc_window *win, + unsigned int nr_blocks) +{ + struct scatterlist *sg_ptr; + void *block; + int i, ret; + + ret = sg_alloc_table(&win->sgt, nr_blocks, GFP_KERNEL); + if (ret) + return -ENOMEM; + + for_each_sg(win->sgt.sgl, sg_ptr, nr_blocks, i) { + block = dma_alloc_coherent(msc_dev(win->msc)->parent->parent, + PAGE_SIZE, &sg_dma_address(sg_ptr), + GFP_KERNEL); + if (!block) + goto err_nomem; + + sg_set_buf(sg_ptr, block, PAGE_SIZE); + } + + return nr_blocks; + +err_nomem: + for (i--; i >= 0; i--) + dma_free_coherent(msc_dev(win->msc)->parent->parent, PAGE_SIZE, + msc_win_block(win, i), + msc_win_baddr(win, i)); + + sg_free_table(&win->sgt); + + return -ENOMEM; +} + /** * msc_buffer_win_alloc() - alloc a window for a multiblock mode * @msc: MSC device @@ -747,45 +789,48 @@ static struct page *msc_buffer_contig_get_page(struct msc *msc, static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks) { struct msc_window *win; - unsigned long size = PAGE_SIZE; - int i, ret = -ENOMEM; + int ret = -ENOMEM, i;
if (!nr_blocks) return 0;
- win = kzalloc(offsetof(struct msc_window, block[nr_blocks]), - GFP_KERNEL); + /* + * This limitation hold as long as we need random access to the + * block. When that changes, this can go away. + */ + if (nr_blocks > SG_MAX_SINGLE_ALLOC) + return -EINVAL; + + win = kzalloc(sizeof(*win), GFP_KERNEL); if (!win) return -ENOMEM;
+ win->msc = msc; + if (!list_empty(&msc->win_list)) { struct msc_window *prev = list_last_entry(&msc->win_list, struct msc_window, entry);
+ /* This works as long as blocks are page-sized */ win->pgoff = prev->pgoff + prev->nr_blocks; }
- for (i = 0; i < nr_blocks; i++) { - win->block[i].bdesc = - dma_alloc_coherent(msc_dev(msc)->parent->parent, size, - &win->block[i].addr, GFP_KERNEL); - - if (!win->block[i].bdesc) - goto err_nomem; + ret = __msc_buffer_win_alloc(win, nr_blocks); + if (ret < 0) + goto err_nomem;
#ifdef CONFIG_X86 + for (i = 0; i < ret; i++) /* Set the page as uncached */ - set_memory_uc((unsigned long)win->block[i].bdesc, 1); + set_memory_uc((unsigned long)msc_win_block(win, i), 1); #endif - }
- win->msc = msc; - win->nr_blocks = nr_blocks; + win->nr_blocks = ret;
if (list_empty(&msc->win_list)) { - msc->base = win->block[0].bdesc; - msc->base_addr = win->block[0].addr; + msc->base = msc_win_block(win, 0); + msc->base_addr = msc_win_baddr(win, 0); }
list_add_tail(&win->entry, &msc->win_list); @@ -794,19 +839,25 @@ static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks) return 0;
err_nomem: - for (i--; i >= 0; i--) { -#ifdef CONFIG_X86 - /* Reset the page to write-back before releasing */ - set_memory_wb((unsigned long)win->block[i].bdesc, 1); -#endif - dma_free_coherent(msc_dev(msc)->parent->parent, size, - win->block[i].bdesc, win->block[i].addr); - } kfree(win);
return ret; }
+static void __msc_buffer_win_free(struct msc *msc, struct msc_window *win) +{ + int i; + + for (i = 0; i < win->nr_blocks; i++) { + struct page *page = sg_page(&win->sgt.sgl[i]); + + page->mapping = NULL; + dma_free_coherent(msc_dev(win->msc)->parent->parent, PAGE_SIZE, + msc_win_block(win, i), msc_win_baddr(win, i)); + } + sg_free_table(&win->sgt); +} + /** * msc_buffer_win_free() - free a window from MSC's window list * @msc: MSC device @@ -827,17 +878,13 @@ static void msc_buffer_win_free(struct msc *msc, struct msc_window *win) msc->base_addr = 0; }
- for (i = 0; i < win->nr_blocks; i++) { - struct page *page = virt_to_page(win->block[i].bdesc); - - page->mapping = NULL; #ifdef CONFIG_X86 - /* Reset the page to write-back before releasing */ - set_memory_wb((unsigned long)win->block[i].bdesc, 1); + for (i = 0; i < win->nr_blocks; i++) + /* Reset the page to write-back */ + set_memory_wb((unsigned long)msc_win_block(win, i), 1); #endif - dma_free_coherent(msc_dev(win->msc)->parent->parent, PAGE_SIZE, - win->block[i].bdesc, win->block[i].addr); - } + + __msc_buffer_win_free(msc, win);
kfree(win); } @@ -871,11 +918,11 @@ static void msc_buffer_relink(struct msc *msc) }
for (blk = 0; blk < win->nr_blocks; blk++) { - struct msc_block_desc *bdesc = win->block[blk].bdesc; + struct msc_block_desc *bdesc = msc_win_block(win, blk);
memset(bdesc, 0, sizeof(*bdesc));
- bdesc->next_win = next_win->block[0].addr >> PAGE_SHIFT; + bdesc->next_win = msc_win_bpfn(next_win, 0);
/* * Similarly to last window, last block should point @@ -883,11 +930,9 @@ static void msc_buffer_relink(struct msc *msc) */ if (blk == win->nr_blocks - 1) { sw_tag |= MSC_SW_TAG_LASTBLK; - bdesc->next_blk = - win->block[0].addr >> PAGE_SHIFT; + bdesc->next_blk = msc_win_bpfn(win, 0); } else { - bdesc->next_blk = - win->block[blk + 1].addr >> PAGE_SHIFT; + bdesc->next_blk = msc_win_bpfn(win, blk + 1); }
bdesc->sw_tag = sw_tag; @@ -1062,7 +1107,7 @@ static struct page *msc_buffer_get_page(struct msc *msc, unsigned long pgoff)
found: pgoff -= win->pgoff; - return virt_to_page(win->block[pgoff].bdesc); + return sg_page(&win->sgt.sgl[pgoff]); }
/**
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 8d4155126e32fdceda7e1520e77a7beac2217f4a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
The code that waits for the pipeline empty condition of the MSU is currently called in the path that disables the trace. We will also need this in the window switch trigger sequence. Therefore, factor out this code and make it accessible to the GTH device.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/gth.c | 7 +++++++ drivers/hwtracing/intel_th/intel_th.h | 4 ++++ drivers/hwtracing/intel_th/msu.c | 28 +++++++++++++++++---------- 3 files changed, 29 insertions(+), 10 deletions(-)
diff --git a/drivers/hwtracing/intel_th/gth.c b/drivers/hwtracing/intel_th/gth.c index edc52d75e6bd..7e6b70047125 100644 --- a/drivers/hwtracing/intel_th/gth.c +++ b/drivers/hwtracing/intel_th/gth.c @@ -469,6 +469,10 @@ static void intel_th_gth_disable(struct intel_th_device *thdev, struct intel_th_output *output) { struct gth_device *gth = dev_get_drvdata(&thdev->dev); + struct intel_th_device *outdev = + container_of(output, struct intel_th_device, output); + struct intel_th_driver *outdrv = + to_intel_th_driver(outdev->dev.driver); unsigned long count; int master; u32 reg; @@ -492,6 +496,9 @@ static void intel_th_gth_disable(struct intel_th_device *thdev, cpu_relax(); }
+ if (outdrv->wait_empty) + outdrv->wait_empty(outdev); + /* clear force capture done for next captures */ iowrite32(0xfc, gth->base + REG_GTH_SCR2);
diff --git a/drivers/hwtracing/intel_th/intel_th.h b/drivers/hwtracing/intel_th/intel_th.h index 8b986ba160e4..74a6a4071e7f 100644 --- a/drivers/hwtracing/intel_th/intel_th.h +++ b/drivers/hwtracing/intel_th/intel_th.h @@ -20,6 +20,8 @@ enum { INTEL_TH_SWITCH, };
+struct intel_th_device; + /** * struct intel_th_output - descriptor INTEL_TH_OUTPUT type devices * @port: output port number, assigned by the switch @@ -27,6 +29,7 @@ enum { * @scratchpad: scratchpad bits to flag when this output is enabled * @multiblock: true for multiblock output configuration * @active: true when this output is enabled + * @wait_empty: wait for device pipeline to be empty * * Output port descriptor, used by switch driver to tell which output * port this output device corresponds to. Filled in at output device's @@ -165,6 +168,7 @@ struct intel_th_driver { struct intel_th_output *output); /* output ops */ irqreturn_t (*irq)(struct intel_th_device *thdev); + void (*wait_empty)(struct intel_th_device *thdev); int (*activate)(struct intel_th_device *thdev); void (*deactivate)(struct intel_th_device *thdev); /* file_operations for those who want a device node */ diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index 0f41d526d433..bede06c3663a 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -577,23 +577,14 @@ static int msc_configure(struct msc *msc) */ static void msc_disable(struct msc *msc) { - unsigned long count; u32 reg;
lockdep_assert_held(&msc->buf_mutex);
intel_th_trace_disable(msc->thdev);
- for (reg = 0, count = MSC_PLE_WAITLOOP_DEPTH; - count && !(reg & MSCSTS_PLE); count--) { - reg = ioread32(msc->reg_base + REG_MSU_MSC0STS); - cpu_relax(); - } - - if (!count) - dev_dbg(msc_dev(msc), "timeout waiting for MSC0 PLE\n"); - if (msc->mode == MSC_MODE_SINGLE) { + reg = ioread32(msc->reg_base + REG_MSU_MSC0STS); msc->single_wrap = !!(reg & MSCSTS_WRAPSTAT);
reg = ioread32(msc->reg_base + REG_MSU_MSC0MWP); @@ -1360,6 +1351,22 @@ static const struct file_operations intel_th_msc_fops = { .owner = THIS_MODULE, };
+static void intel_th_msc_wait_empty(struct intel_th_device *thdev) +{ + struct msc *msc = dev_get_drvdata(&thdev->dev); + unsigned long count; + u32 reg; + + for (reg = 0, count = MSC_PLE_WAITLOOP_DEPTH; + count && !(reg & MSCSTS_PLE); count--) { + reg = __raw_readl(msc->reg_base + REG_MSU_MSC0STS); + cpu_relax(); + } + + if (!count) + dev_dbg(msc_dev(msc), "timeout waiting for MSC0 PLE\n"); +} + static int intel_th_msc_init(struct msc *msc) { atomic_set(&msc->user_count, -1); @@ -1640,6 +1647,7 @@ static struct intel_th_driver intel_th_msc_driver = { .probe = intel_th_msc_probe, .remove = intel_th_msc_remove, .irq = intel_th_msc_interrupt, + .wait_empty = intel_th_msc_wait_empty, .activate = intel_th_msc_activate, .deactivate = intel_th_msc_deactivate, .fops = &intel_th_msc_fops,
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 9958e02523eea424d4848ef426c5aa7e07e4e207 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
The trace enable/disable functions of the GTH include the code that starts and stops trace flom from the sources. This start/stop functionality will also be used in the window switch trigger sequence.
Factor out start/stop code from the larger trace enable/disable code in preparation for the window switch sequence.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/gth.c | 93 ++++++++++++++++++++++---------- 1 file changed, 64 insertions(+), 29 deletions(-)
diff --git a/drivers/hwtracing/intel_th/gth.c b/drivers/hwtracing/intel_th/gth.c index 7e6b70047125..94143ef0abaf 100644 --- a/drivers/hwtracing/intel_th/gth.c +++ b/drivers/hwtracing/intel_th/gth.c @@ -457,37 +457,28 @@ static int intel_th_output_attributes(struct gth_device *gth) }
/** - * intel_th_gth_disable() - disable tracing to an output device - * @thdev: GTH device - * @output: output device's descriptor + * intel_th_gth_stop() - stop tracing to an output device + * @gth: GTH device + * @output: output device's descriptor + * @capture_done: set when no more traces will be captured * - * This will deconfigure all masters set to output to this device, - * disable tracing using force storeEn off signal and wait for the - * "pipeline empty" bit for corresponding output port. + * This will stop tracing using force storeEn off signal and wait for the + * pipelines to be empty for the corresponding output port. */ -static void intel_th_gth_disable(struct intel_th_device *thdev, - struct intel_th_output *output) +static void intel_th_gth_stop(struct gth_device *gth, + struct intel_th_output *output, + bool capture_done) { - struct gth_device *gth = dev_get_drvdata(&thdev->dev); struct intel_th_device *outdev = container_of(output, struct intel_th_device, output); struct intel_th_driver *outdrv = to_intel_th_driver(outdev->dev.driver); unsigned long count; - int master; u32 reg; - - spin_lock(>h->gth_lock); - output->active = false; - - for_each_set_bit(master, gth->output[output->port].master, - TH_CONFIGURABLE_MASTERS) { - gth_master_set(gth, master, -1); - } - spin_unlock(>h->gth_lock); + u32 scr2 = 0xfc | (capture_done ? 1 : 0);
iowrite32(0, gth->base + REG_GTH_SCR); - iowrite32(0xfd, gth->base + REG_GTH_SCR2); + iowrite32(scr2, gth->base + REG_GTH_SCR2);
/* wait on pipeline empty for the given port */ for (reg = 0, count = GTH_PLE_WAITLOOP_DEPTH; @@ -496,15 +487,63 @@ static void intel_th_gth_disable(struct intel_th_device *thdev, cpu_relax(); }
+ if (!count) + dev_dbg(gth->dev, "timeout waiting for GTH[%d] PLE\n", + output->port); + + /* wait on output piepline empty */ if (outdrv->wait_empty) outdrv->wait_empty(outdev);
/* clear force capture done for next captures */ iowrite32(0xfc, gth->base + REG_GTH_SCR2); +}
- if (!count) - dev_dbg(&thdev->dev, "timeout waiting for GTH[%d] PLE\n", - output->port); +/** + * intel_th_gth_start() - start tracing to an output device + * @gth: GTH device + * @output: output device's descriptor + * + * This will start tracing using force storeEn signal. + */ +static void intel_th_gth_start(struct gth_device *gth, + struct intel_th_output *output) +{ + u32 scr = 0xfc0000; + + if (output->multiblock) + scr |= 0xff; + + iowrite32(scr, gth->base + REG_GTH_SCR); + iowrite32(0, gth->base + REG_GTH_SCR2); +} + +/** + * intel_th_gth_disable() - disable tracing to an output device + * @thdev: GTH device + * @output: output device's descriptor + * + * This will deconfigure all masters set to output to this device, + * disable tracing using force storeEn off signal and wait for the + * "pipeline empty" bit for corresponding output port. + */ +static void intel_th_gth_disable(struct intel_th_device *thdev, + struct intel_th_output *output) +{ + struct gth_device *gth = dev_get_drvdata(&thdev->dev); + int master; + u32 reg; + + spin_lock(>h->gth_lock); + output->active = false; + + for_each_set_bit(master, gth->output[output->port].master, + TH_CONFIGURABLE_MASTERS + 1) { + gth_master_set(gth, master, -1); + } + spin_unlock(>h->gth_lock); + + intel_th_gth_stop(gth, output, true);
reg = ioread32(gth->base + REG_GTH_SCRPD0); reg &= ~output->scratchpad; @@ -533,8 +572,8 @@ static void intel_th_gth_enable(struct intel_th_device *thdev, { struct gth_device *gth = dev_get_drvdata(&thdev->dev); struct intel_th *th = to_intel_th(thdev); - u32 scr = 0xfc0000, scrpd; int master; + u32 scrpd;
spin_lock(>h->gth_lock); for_each_set_bit(master, gth->output[output->port].master, @@ -542,9 +581,6 @@ static void intel_th_gth_enable(struct intel_th_device *thdev, gth_master_set(gth, master, output->port); }
- if (output->multiblock) - scr |= 0xff; - output->active = true; spin_unlock(>h->gth_lock);
@@ -555,8 +591,7 @@ static void intel_th_gth_enable(struct intel_th_device *thdev, scrpd |= output->scratchpad; iowrite32(scrpd, gth->base + REG_GTH_SCRPD0);
- iowrite32(scr, gth->base + REG_GTH_SCR); - iowrite32(0, gth->base + REG_GTH_SCR2); + intel_th_gth_start(gth, output); }
/**
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 8116db57cf1618cb21dab957952e7bd1395430da category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Add support for asserting window switch trigger when tracing to MSU output ports. This allows for software controlled switching between windows of the MSU buffer, which can be used for double buffering while exporting the trace data further from the MSU.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/core.c | 25 ++++++++++++++++-- drivers/hwtracing/intel_th/gth.c | 37 +++++++++++++++++++++++++++ drivers/hwtracing/intel_th/gth.h | 19 ++++++++++++++ drivers/hwtracing/intel_th/intel_th.h | 6 +++++ 4 files changed, 85 insertions(+), 2 deletions(-)
diff --git a/drivers/hwtracing/intel_th/core.c b/drivers/hwtracing/intel_th/core.c index b7bb258bcdd0..5cab5194c9b9 100644 --- a/drivers/hwtracing/intel_th/core.c +++ b/drivers/hwtracing/intel_th/core.c @@ -430,9 +430,9 @@ static const struct intel_th_subdevice { .nres = 1, .res = { { - /* Handle TSCU from GTH driver */ + /* Handle TSCU and CTS from GTH driver */ .start = REG_GTH_OFFSET, - .end = REG_TSCU_OFFSET + REG_TSCU_LENGTH - 1, + .end = REG_CTS_OFFSET + REG_CTS_LENGTH - 1, .flags = IORESOURCE_MEM, }, }, @@ -983,6 +983,27 @@ int intel_th_trace_enable(struct intel_th_device *thdev) } EXPORT_SYMBOL_GPL(intel_th_trace_enable);
+/** + * intel_th_trace_switch() - execute a switch sequence + * @thdev: output device that requests tracing switch + */ +int intel_th_trace_switch(struct intel_th_device *thdev) +{ + struct intel_th_device *hub = to_intel_th_device(thdev->dev.parent); + struct intel_th_driver *hubdrv = to_intel_th_driver(hub->dev.driver); + + if (WARN_ON_ONCE(hub->type != INTEL_TH_SWITCH)) + return -EINVAL; + + if (WARN_ON_ONCE(thdev->type != INTEL_TH_OUTPUT)) + return -EINVAL; + + hubdrv->trig_switch(hub, &thdev->output); + + return 0; +} +EXPORT_SYMBOL_GPL(intel_th_trace_switch); + /** * intel_th_trace_disable() - disable tracing for an output device * @thdev: output device that requests tracing be disabled diff --git a/drivers/hwtracing/intel_th/gth.c b/drivers/hwtracing/intel_th/gth.c index 94143ef0abaf..c9672b74761a 100644 --- a/drivers/hwtracing/intel_th/gth.c +++ b/drivers/hwtracing/intel_th/gth.c @@ -308,6 +308,11 @@ static int intel_th_gth_reset(struct gth_device *gth) iowrite32(0, gth->base + REG_GTH_SCR); iowrite32(0xfc, gth->base + REG_GTH_SCR2);
+ /* setup CTS for single trigger */ + iowrite32(CTS_EVENT_ENABLE_IF_ANYTHING, gth->base + REG_CTS_C0S0_EN); + iowrite32(CTS_ACTION_CONTROL_SET_STATE(CTS_STATE_IDLE) | + CTS_ACTION_CONTROL_TRIGGER, gth->base + REG_CTS_C0S0_ACT); + return 0; }
@@ -594,6 +599,37 @@ static void intel_th_gth_enable(struct intel_th_device *thdev, intel_th_gth_start(gth, output); }
+/** + * intel_th_gth_switch() - execute a switch sequence + * @thdev: GTH device + * @output: output device's descriptor + * + * This will execute a switch sequence that will trigger a switch window + * when tracing to MSC in multi-block mode. + */ +static void intel_th_gth_switch(struct intel_th_device *thdev, + struct intel_th_output *output) +{ + struct gth_device *gth = dev_get_drvdata(&thdev->dev); + unsigned long count; + u32 reg; + + /* trigger */ + iowrite32(0, gth->base + REG_CTS_CTL); + iowrite32(CTS_CTL_SEQUENCER_ENABLE, gth->base + REG_CTS_CTL); + /* wait on trigger status */ + for (reg = 0, count = CTS_TRIG_WAITLOOP_DEPTH; + count && !(reg & BIT(4)); count--) { + reg = ioread32(gth->base + REG_CTS_STAT); + cpu_relax(); + } + if (!count) + dev_dbg(&thdev->dev, "timeout waiting for CTS Trigger\n"); + + intel_th_gth_stop(gth, output, false); + intel_th_gth_start(gth, output); +} + /** * intel_th_gth_assign() - assign output device to a GTH output port * @thdev: GTH device @@ -777,6 +813,7 @@ static struct intel_th_driver intel_th_gth_driver = { .unassign = intel_th_gth_unassign, .set_output = intel_th_gth_set_output, .enable = intel_th_gth_enable, + .trig_switch = intel_th_gth_switch, .disable = intel_th_gth_disable, .driver = { .name = "gth", diff --git a/drivers/hwtracing/intel_th/gth.h b/drivers/hwtracing/intel_th/gth.h index 6f2b0b930875..bfcc0fd01177 100644 --- a/drivers/hwtracing/intel_th/gth.h +++ b/drivers/hwtracing/intel_th/gth.h @@ -49,6 +49,12 @@ enum { REG_GTH_SCRPD3 = 0xec, /* ScratchPad[3] */ REG_TSCU_TSUCTRL = 0x2000, /* TSCU control register */ REG_TSCU_TSCUSTAT = 0x2004, /* TSCU status register */ + + /* Common Capture Sequencer (CTS) registers */ + REG_CTS_C0S0_EN = 0x30c0, /* clause_event_enable_c0s0 */ + REG_CTS_C0S0_ACT = 0x3180, /* clause_action_control_c0s0 */ + REG_CTS_STAT = 0x32a0, /* cts_status */ + REG_CTS_CTL = 0x32a4, /* cts_control */ };
/* waiting for Pipeline Empty bit(s) to assert for GTH */ @@ -57,4 +63,17 @@ enum { #define TSUCTRL_CTCRESYNC BIT(0) #define TSCUSTAT_CTCSYNCING BIT(1)
+/* waiting for Trigger status to assert for CTS */ +#define CTS_TRIG_WAITLOOP_DEPTH 10000 + +#define CTS_EVENT_ENABLE_IF_ANYTHING BIT(31) +#define CTS_ACTION_CONTROL_STATE_OFF 27 +#define CTS_ACTION_CONTROL_SET_STATE(x) \ + (((x) & 0x1f) << CTS_ACTION_CONTROL_STATE_OFF) +#define CTS_ACTION_CONTROL_TRIGGER BIT(4) + +#define CTS_STATE_IDLE 0x10u + +#define CTS_CTL_SEQUENCER_ENABLE BIT(0) + #endif /* __INTEL_TH_GTH_H__ */ diff --git a/drivers/hwtracing/intel_th/intel_th.h b/drivers/hwtracing/intel_th/intel_th.h index 74a6a4071e7f..0df480072b6c 100644 --- a/drivers/hwtracing/intel_th/intel_th.h +++ b/drivers/hwtracing/intel_th/intel_th.h @@ -164,6 +164,8 @@ struct intel_th_driver { struct intel_th_device *othdev); void (*enable)(struct intel_th_device *thdev, struct intel_th_output *output); + void (*trig_switch)(struct intel_th_device *thdev, + struct intel_th_output *output); void (*disable)(struct intel_th_device *thdev, struct intel_th_output *output); /* output ops */ @@ -228,6 +230,7 @@ int intel_th_driver_register(struct intel_th_driver *thdrv); void intel_th_driver_unregister(struct intel_th_driver *thdrv);
int intel_th_trace_enable(struct intel_th_device *thdev); +int intel_th_trace_switch(struct intel_th_device *thdev); int intel_th_trace_disable(struct intel_th_device *thdev); int intel_th_set_output(struct intel_th_device *thdev, unsigned int master); @@ -308,6 +311,9 @@ enum { REG_TSCU_OFFSET = 0x2000, REG_TSCU_LENGTH = 0x1000,
+ REG_CTS_OFFSET = 0x3000, + REG_CTS_LENGTH = 0x1000, + /* Software Trace Hub (STH) [0x4000..0x4fff] */ REG_STH_OFFSET = 0x4000, REG_STH_LENGTH = 0x2000,
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 4840572d3d7e66d7b55d3fc3b0f52711fd172eb8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
In multi window mode the MSU will set "window wrap" bit to indicate block wrapping as well. Take this into account when checking data blocks.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/msu.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/hwtracing/intel_th/msu.h b/drivers/hwtracing/intel_th/msu.h index e8cb819a3804..574c16004cb2 100644 --- a/drivers/hwtracing/intel_th/msu.h +++ b/drivers/hwtracing/intel_th/msu.h @@ -95,7 +95,7 @@ static inline unsigned long msc_data_sz(struct msc_block_desc *bdesc)
static inline bool msc_block_wrapped(struct msc_block_desc *bdesc) { - if (bdesc->hw_tag & MSC_HW_TAG_BLOCKWRAP) + if (bdesc->hw_tag & (MSC_HW_TAG_BLOCKWRAP | MSC_HW_TAG_WINWRAP)) return true;
return false;
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit 6cac7866c27418cd784fbb8041c2dddd6d966bc7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Now that we have the means to trigger a window switch for the MSU trace store, add a sysfs file to allow triggering it from userspace.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- .../testing/sysfs-bus-intel_th-devices-msc | 8 ++++++ drivers/hwtracing/intel_th/msu.c | 28 +++++++++++++++++++ 2 files changed, 36 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-bus-intel_th-devices-msc b/Documentation/ABI/testing/sysfs-bus-intel_th-devices-msc index b940c5d91cf7..f54ae244f3f1 100644 --- a/Documentation/ABI/testing/sysfs-bus-intel_th-devices-msc +++ b/Documentation/ABI/testing/sysfs-bus-intel_th-devices-msc @@ -30,4 +30,12 @@ Description: (RW) Configure MSC buffer size for "single" or "multi" modes. there are no active users and tracing is not enabled) and then allocates a new one.
+What: /sys/bus/intel_th/devices/<intel_th_id>-msc<msc-id>/win_switch +Date: May 2019 +KernelVersion: 5.2 +Contact: Alexander Shishkin alexander.shishkin@linux.intel.com +Description: (RW) Trigger window switch for the MSC's buffer, in + multi-window mode. In "multi" mode, accepts writes of "1", thereby + triggering a window switch for the buffer. Returns an error in any + other operating mode or attempts to write something other than "1".
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index bede06c3663a..94ff72cbf0cb 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -1572,10 +1572,38 @@ nr_pages_store(struct device *dev, struct device_attribute *attr,
static DEVICE_ATTR_RW(nr_pages);
+static ssize_t +win_switch_store(struct device *dev, struct device_attribute *attr, + const char *buf, size_t size) +{ + struct msc *msc = dev_get_drvdata(dev); + unsigned long val; + int ret; + + ret = kstrtoul(buf, 10, &val); + if (ret) + return ret; + + if (val != 1) + return -EINVAL; + + mutex_lock(&msc->buf_mutex); + if (msc->mode != MSC_MODE_MULTI) + ret = -ENOTSUPP; + else + ret = intel_th_trace_switch(msc->thdev); + mutex_unlock(&msc->buf_mutex); + + return ret ? ret : size; +} + +static DEVICE_ATTR_WO(win_switch); + static struct attribute *msc_output_attrs[] = { &dev_attr_wrap.attr, &dev_attr_mode.attr, &dev_attr_nr_pages.attr, + &dev_attr_win_switch.attr, NULL, };
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.2-rc1 commit aad14ad3cf3a63bd258b65e18d49c3eb8472d344 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Now that we have a way to switch between MSC buffer windows, add code to track the current window. The hardware register NWSA that contains the address of the next window is unfortunately not always usable, and since the driver has full control of the window switching, there is no reason not to keep this on the software side.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/msu.c | 79 ++++++++++++++++++++------------ 1 file changed, 49 insertions(+), 30 deletions(-)
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index 94ff72cbf0cb..024f04826d9e 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -75,6 +75,7 @@ struct msc_iter { * @thdev: intel_th_device pointer * @win_list: list of windows in multiblock mode * @single_sgt: single mode buffer + * @cur_win: current window * @nr_pages: total number of pages allocated for this buffer * @single_sz: amount of data in single mode * @single_wrap: single mode wrap occurred @@ -97,6 +98,7 @@ struct msc {
struct list_head win_list; struct sg_table single_sgt; + struct msc_window *cur_win; unsigned long nr_pages; unsigned long single_sz; unsigned int single_wrap : 1; @@ -151,6 +153,31 @@ msc_win_bpfn(struct msc_window *win, unsigned int block) return msc_win_baddr(win, block) >> PAGE_SHIFT; }
+/** + * msc_is_last_win() - check if a window is the last one for a given MSC + * @win: window + * Return: true if @win is the last window in MSC's multiblock buffer + */ +static inline bool msc_is_last_win(struct msc_window *win) +{ + return win->entry.next == &win->msc->win_list; +} + +/** + * msc_next_window() - return next window in the multiblock buffer + * @win: current window + * + * Return: window following the current one + */ +static struct msc_window *msc_next_window(struct msc_window *win) +{ + if (msc_is_last_win(win)) + return list_first_entry(&win->msc->win_list, struct msc_window, + entry); + + return list_next_entry(win, entry); +} + /** * msc_oldest_window() - locate the window with oldest data * @msc: MSC device @@ -162,9 +189,7 @@ msc_win_bpfn(struct msc_window *win, unsigned int block) */ static struct msc_window *msc_oldest_window(struct msc *msc) { - struct msc_window *win; - u32 reg = ioread32(msc->reg_base + REG_MSU_MSC0NWSA); - unsigned long win_addr = (unsigned long)reg << PAGE_SHIFT; + struct msc_window *win, *next = msc_next_window(msc->cur_win); unsigned int found = 0;
if (list_empty(&msc->win_list)) @@ -176,7 +201,7 @@ static struct msc_window *msc_oldest_window(struct msc *msc) * something like 2, in which case we're good */ list_for_each_entry(win, &msc->win_list, entry) { - if (sg_dma_address(win->sgt.sgl) == win_addr) + if (win == next) found++;
/* skip the empty ones */ @@ -219,31 +244,6 @@ static unsigned int msc_win_oldest_block(struct msc_window *win) return 0; }
-/** - * msc_is_last_win() - check if a window is the last one for a given MSC - * @win: window - * Return: true if @win is the last window in MSC's multiblock buffer - */ -static inline bool msc_is_last_win(struct msc_window *win) -{ - return win->entry.next == &win->msc->win_list; -} - -/** - * msc_next_window() - return next window in the multiblock buffer - * @win: current window - * - * Return: window following the current one - */ -static struct msc_window *msc_next_window(struct msc_window *win) -{ - if (msc_is_last_win(win)) - return list_first_entry(&win->msc->win_list, struct msc_window, - entry); - - return list_next_entry(win, entry); -} - static struct msc_block_desc *msc_iter_bdesc(struct msc_iter *iter) { return msc_win_block(iter->win, iter->block); @@ -822,6 +822,7 @@ static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks) if (list_empty(&msc->win_list)) { msc->base = msc_win_block(win, 0); msc->base_addr = msc_win_baddr(win, 0); + msc->cur_win = win; }
list_add_tail(&win->entry, &msc->win_list); @@ -1383,6 +1384,24 @@ static int intel_th_msc_init(struct msc *msc) return 0; }
+static void msc_win_switch(struct msc *msc) +{ + struct msc_window *last, *first; + + first = list_first_entry(&msc->win_list, struct msc_window, entry); + last = list_last_entry(&msc->win_list, struct msc_window, entry); + + if (msc_is_last_win(msc->cur_win)) + msc->cur_win = first; + else + msc->cur_win = list_next_entry(msc->cur_win, entry); + + msc->base = msc_win_block(msc->cur_win, 0); + msc->base_addr = msc_win_baddr(msc->cur_win, 0); + + intel_th_trace_switch(msc->thdev); +} + static irqreturn_t intel_th_msc_interrupt(struct intel_th_device *thdev) { struct msc *msc = dev_get_drvdata(&thdev->dev); @@ -1591,7 +1610,7 @@ win_switch_store(struct device *dev, struct device_attribute *attr, if (msc->mode != MSC_MODE_MULTI) ret = -ENOTSUPP; else - ret = intel_th_trace_switch(msc->thdev); + msc_win_switch(msc); mutex_unlock(&msc->buf_mutex);
return ret ? ret : size;
From: Shaokun Zhang zhangshaokun@hisilicon.com
mainline inclusion from mainline-v5.3-rc1 commit b96fb368b08f1637cbf780a6b83e36c2c5ed4ff5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Commit ba39bd8306057 ("intel_th: msu: Switch over to scatterlist") introduced the following warnings on non-x86 architectures, as a result of reordering the multi mode buffer allocation sequence:
drivers/hwtracing/intel_th/msu.c: In function ‘msc_buffer_win_alloc’: drivers/hwtracing/intel_th/msu.c:783:21: warning: unused variable ‘i’ [-Wunused-variable] int ret = -ENOMEM, i; ^ drivers/hwtracing/intel_th/msu.c: In function ‘msc_buffer_win_free’: drivers/hwtracing/intel_th/msu.c:863:6: warning: unused variable ‘i’ [-Wunused-variable] int i; ^
Fix this compiler warning by factoring out set_memory sequences and making them x86-only.
Suggested-by: Alexander Shishkin alexander.shishkin@linux.intel.com Signed-off-by: Shaokun Zhang zhangshaokun@hisilicon.com Fixes: ba39bd8306057 ("intel_th: msu: Switch over to scatterlist") Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: stable stable@vger.kernel.org Link: https://lore.kernel.org/r/20190621161930.60785-2-alexander.shishkin@linux.in... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/msu.c | 40 +++++++++++++++++++++----------- 1 file changed, 27 insertions(+), 13 deletions(-)
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index 024f04826d9e..02cee33c3030 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -767,6 +767,30 @@ static int __msc_buffer_win_alloc(struct msc_window *win, return -ENOMEM; }
+#ifdef CONFIG_X86 +static void msc_buffer_set_uc(struct msc_window *win, unsigned int nr_blocks) +{ + int i; + + for (i = 0; i < nr_blocks; i++) + /* Set the page as uncached */ + set_memory_uc((unsigned long)msc_win_block(win, i), 1); +} + +static void msc_buffer_set_wb(struct msc_window *win) +{ + int i; + + for (i = 0; i < win->nr_blocks; i++) + /* Reset the page to write-back */ + set_memory_wb((unsigned long)msc_win_block(win, i), 1); +} +#else /* !X86 */ +static inline void +msc_buffer_set_uc(struct msc_window *win, unsigned int nr_blocks) {} +static inline void msc_buffer_set_wb(struct msc_window *win) {} +#endif /* CONFIG_X86 */ + /** * msc_buffer_win_alloc() - alloc a window for a multiblock mode * @msc: MSC device @@ -780,7 +804,7 @@ static int __msc_buffer_win_alloc(struct msc_window *win, static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks) { struct msc_window *win; - int ret = -ENOMEM, i; + int ret = -ENOMEM;
if (!nr_blocks) return 0; @@ -811,11 +835,7 @@ static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks) if (ret < 0) goto err_nomem;
-#ifdef CONFIG_X86 - for (i = 0; i < ret; i++) - /* Set the page as uncached */ - set_memory_uc((unsigned long)msc_win_block(win, i), 1); -#endif + msc_buffer_set_uc(win, ret);
win->nr_blocks = ret;
@@ -860,8 +880,6 @@ static void __msc_buffer_win_free(struct msc *msc, struct msc_window *win) */ static void msc_buffer_win_free(struct msc *msc, struct msc_window *win) { - int i; - msc->nr_pages -= win->nr_blocks;
list_del(&win->entry); @@ -870,11 +888,7 @@ static void msc_buffer_win_free(struct msc *msc, struct msc_window *win) msc->base_addr = 0; }
-#ifdef CONFIG_X86 - for (i = 0; i < win->nr_blocks; i++) - /* Reset the page to write-back */ - set_memory_wb((unsigned long)msc_win_block(win, i), 1); -#endif + msc_buffer_set_wb(win);
__msc_buffer_win_free(msc, win);
From: YueHaibing yuehaibing@huawei.com
mainline inclusion from mainline-v5.3-rc1 commit 9800db282dff675dd700d5985d90b605c34b5ccd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Commit aad14ad3cf3a ("intel_th: msu: Add current window tracking") added the following gcc warning:
drivers/hwtracing/intel_th/msu.c: In function msc_win_switch: drivers/hwtracing/intel_th/msu.c:1389:21: warning: variable last set but not used [-Wunused-but-set-variable]
Fix it by removing the variable.
Signed-off-by: YueHaibing yuehaibing@huawei.com Fixes: aad14ad3cf3a ("intel_th: msu: Add current window tracking") Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: stable stable@vger.kernel.org Link: https://lore.kernel.org/r/20190621161930.60785-3-alexander.shishkin@linux.in... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/msu.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index 02cee33c3030..cfd48c81b9d9 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -1400,10 +1400,9 @@ static int intel_th_msc_init(struct msc *msc)
static void msc_win_switch(struct msc *msc) { - struct msc_window *last, *first; + struct msc_window *first;
first = list_first_entry(&msc->win_list, struct msc_window, entry); - last = list_last_entry(&msc->win_list, struct msc_window, entry);
if (msc_is_last_win(msc->cur_win)) msc->cur_win = first;
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit fa52b3fe5e9383022f83382f5ffcb386b9b11b4f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Now that the MSU is using scatterlist, we can support multipage blocks. At the moment, the code assumes that all blocks are page-sized, but in larger buffers it may make sense to chunk together larger blocks of memory. One place where one-to-many relationship needs to be handled is the MSU buffer's mmap path.
Get rid of the implicit assumption that all blocks are page-sized.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Link: https://lore.kernel.org/r/20190627125152.54905-3-alexander.shishkin@linux.in... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/msu.c | 56 ++++++++++++++++++++++---------- 1 file changed, 38 insertions(+), 18 deletions(-)
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index cfd48c81b9d9..c9c29f56d093 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -33,12 +33,14 @@ * @entry: window list linkage (msc::win_list) * @pgoff: page offset into the buffer that this window starts at * @nr_blocks: number of blocks (pages) in this window + * @nr_segs: number of segments in this window (<= @nr_blocks) * @sgt: array of block descriptors */ struct msc_window { struct list_head entry; unsigned long pgoff; unsigned int nr_blocks; + unsigned int nr_segs; struct msc *msc; struct sg_table sgt; }; @@ -141,6 +143,12 @@ msc_win_block(struct msc_window *win, unsigned int block) return sg_virt(&win->sgt.sgl[block]); }
+static inline size_t +msc_win_actual_bsz(struct msc_window *win, unsigned int block) +{ + return win->sgt.sgl[block].length; +} + static inline dma_addr_t msc_win_baddr(struct msc_window *win, unsigned int block) { @@ -234,7 +242,7 @@ static unsigned int msc_win_oldest_block(struct msc_window *win) * with wrapping, last written block contains both the newest and the * oldest data for this window. */ - for (blk = 0; blk < win->nr_blocks; blk++) { + for (blk = 0; blk < win->nr_segs; blk++) { bdesc = msc_win_block(win, blk);
if (msc_block_last_written(bdesc)) @@ -366,7 +374,7 @@ static int msc_iter_block_advance(struct msc_iter *iter) return msc_iter_win_advance(iter);
/* block advance */ - if (++iter->block == iter->win->nr_blocks) + if (++iter->block == iter->win->nr_segs) iter->block = 0;
/* no wrapping, sanity check in case there is no last written block */ @@ -478,7 +486,7 @@ static void msc_buffer_clear_hw_header(struct msc *msc) size_t hw_sz = sizeof(struct msc_block_desc) - offsetof(struct msc_block_desc, hw_tag);
- for (blk = 0; blk < win->nr_blocks; blk++) { + for (blk = 0; blk < win->nr_segs; blk++) { struct msc_block_desc *bdesc = msc_win_block(win, blk);
memset(&bdesc->hw_tag, 0, hw_sz); @@ -734,17 +742,17 @@ static struct page *msc_buffer_contig_get_page(struct msc *msc, }
static int __msc_buffer_win_alloc(struct msc_window *win, - unsigned int nr_blocks) + unsigned int nr_segs) { struct scatterlist *sg_ptr; void *block; int i, ret;
- ret = sg_alloc_table(&win->sgt, nr_blocks, GFP_KERNEL); + ret = sg_alloc_table(&win->sgt, nr_segs, GFP_KERNEL); if (ret) return -ENOMEM;
- for_each_sg(win->sgt.sgl, sg_ptr, nr_blocks, i) { + for_each_sg(win->sgt.sgl, sg_ptr, nr_segs, i) { block = dma_alloc_coherent(msc_dev(win->msc)->parent->parent, PAGE_SIZE, &sg_dma_address(sg_ptr), GFP_KERNEL); @@ -754,7 +762,7 @@ static int __msc_buffer_win_alloc(struct msc_window *win, sg_set_buf(sg_ptr, block, PAGE_SIZE); }
- return nr_blocks; + return nr_segs;
err_nomem: for (i--; i >= 0; i--) @@ -768,11 +776,11 @@ static int __msc_buffer_win_alloc(struct msc_window *win, }
#ifdef CONFIG_X86 -static void msc_buffer_set_uc(struct msc_window *win, unsigned int nr_blocks) +static void msc_buffer_set_uc(struct msc_window *win, unsigned int nr_segs) { int i;
- for (i = 0; i < nr_blocks; i++) + for (i = 0; i < nr_segs; i++) /* Set the page as uncached */ set_memory_uc((unsigned long)msc_win_block(win, i), 1); } @@ -781,13 +789,13 @@ static void msc_buffer_set_wb(struct msc_window *win) { int i;
- for (i = 0; i < win->nr_blocks; i++) + for (i = 0; i < win->nr_segs; i++) /* Reset the page to write-back */ set_memory_wb((unsigned long)msc_win_block(win, i), 1); } #else /* !X86 */ static inline void -msc_buffer_set_uc(struct msc_window *win, unsigned int nr_blocks) {} +msc_buffer_set_uc(struct msc_window *win, unsigned int nr_segs) {} static inline void msc_buffer_set_wb(struct msc_window *win) {} #endif /* CONFIG_X86 */
@@ -827,7 +835,6 @@ static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks) struct msc_window, entry);
- /* This works as long as blocks are page-sized */ win->pgoff = prev->pgoff + prev->nr_blocks; }
@@ -837,7 +844,8 @@ static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks)
msc_buffer_set_uc(win, ret);
- win->nr_blocks = ret; + win->nr_segs = ret; + win->nr_blocks = nr_blocks;
if (list_empty(&msc->win_list)) { msc->base = msc_win_block(win, 0); @@ -860,7 +868,7 @@ static void __msc_buffer_win_free(struct msc *msc, struct msc_window *win) { int i;
- for (i = 0; i < win->nr_blocks; i++) { + for (i = 0; i < win->nr_segs; i++) { struct page *page = sg_page(&win->sgt.sgl[i]);
page->mapping = NULL; @@ -923,7 +931,7 @@ static void msc_buffer_relink(struct msc *msc) next_win = list_next_entry(win, entry); }
- for (blk = 0; blk < win->nr_blocks; blk++) { + for (blk = 0; blk < win->nr_segs; blk++) { struct msc_block_desc *bdesc = msc_win_block(win, blk);
memset(bdesc, 0, sizeof(*bdesc)); @@ -934,7 +942,7 @@ static void msc_buffer_relink(struct msc *msc) * Similarly to last window, last block should point * to the first one. */ - if (blk == win->nr_blocks - 1) { + if (blk == win->nr_segs - 1) { sw_tag |= MSC_SW_TAG_LASTBLK; bdesc->next_blk = msc_win_bpfn(win, 0); } else { @@ -942,7 +950,7 @@ static void msc_buffer_relink(struct msc *msc) }
bdesc->sw_tag = sw_tag; - bdesc->block_sz = PAGE_SIZE / 64; + bdesc->block_sz = msc_win_actual_bsz(win, blk) / 64; } }
@@ -1101,6 +1109,7 @@ static int msc_buffer_free_unless_used(struct msc *msc) static struct page *msc_buffer_get_page(struct msc *msc, unsigned long pgoff) { struct msc_window *win; + unsigned int blk;
if (msc->mode == MSC_MODE_SINGLE) return msc_buffer_contig_get_page(msc, pgoff); @@ -1113,7 +1122,18 @@ static struct page *msc_buffer_get_page(struct msc *msc, unsigned long pgoff)
found: pgoff -= win->pgoff; - return sg_page(&win->sgt.sgl[pgoff]); + + for (blk = 0; blk < win->nr_segs; blk++) { + struct page *page = sg_page(&win->sgt.sgl[blk]); + size_t pgsz = PFN_DOWN(msc_win_actual_bsz(win, blk)); + + if (pgoff < pgsz) + return page + pgoff; + + pgoff -= pgsz; + } + + return NULL; }
/**
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit bbbc08a154a1750371f8347a5353819551d11abb category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
To allow the use of externally allocated SG tables further down the line, change the code to reference the table via a pointer and make it point to the locally allocated table by default.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Link: https://lore.kernel.org/r/20190627125152.54905-4-alexander.shishkin@linux.in... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/msu.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index c9c29f56d093..aedfd8629e58 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -34,6 +34,7 @@ * @pgoff: page offset into the buffer that this window starts at * @nr_blocks: number of blocks (pages) in this window * @nr_segs: number of segments in this window (<= @nr_blocks) + * @_sgt: array of block descriptors * @sgt: array of block descriptors */ struct msc_window { @@ -42,7 +43,8 @@ struct msc_window { unsigned int nr_blocks; unsigned int nr_segs; struct msc *msc; - struct sg_table sgt; + struct sg_table _sgt; + struct sg_table *sgt; };
/** @@ -140,19 +142,19 @@ static inline bool msc_block_is_empty(struct msc_block_desc *bdesc) static inline struct msc_block_desc * msc_win_block(struct msc_window *win, unsigned int block) { - return sg_virt(&win->sgt.sgl[block]); + return sg_virt(&win->sgt->sgl[block]); }
static inline size_t msc_win_actual_bsz(struct msc_window *win, unsigned int block) { - return win->sgt.sgl[block].length; + return win->sgt->sgl[block].length; }
static inline dma_addr_t msc_win_baddr(struct msc_window *win, unsigned int block) { - return sg_dma_address(&win->sgt.sgl[block]); + return sg_dma_address(&win->sgt->sgl[block]); }
static inline unsigned long @@ -748,11 +750,11 @@ static int __msc_buffer_win_alloc(struct msc_window *win, void *block; int i, ret;
- ret = sg_alloc_table(&win->sgt, nr_segs, GFP_KERNEL); + ret = sg_alloc_table(win->sgt, nr_segs, GFP_KERNEL); if (ret) return -ENOMEM;
- for_each_sg(win->sgt.sgl, sg_ptr, nr_segs, i) { + for_each_sg(win->sgt->sgl, sg_ptr, nr_segs, i) { block = dma_alloc_coherent(msc_dev(win->msc)->parent->parent, PAGE_SIZE, &sg_dma_address(sg_ptr), GFP_KERNEL); @@ -770,7 +772,7 @@ static int __msc_buffer_win_alloc(struct msc_window *win, msc_win_block(win, i), msc_win_baddr(win, i));
- sg_free_table(&win->sgt); + sg_free_table(win->sgt);
return -ENOMEM; } @@ -829,6 +831,7 @@ static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks) return -ENOMEM;
win->msc = msc; + win->sgt = &win->_sgt;
if (!list_empty(&msc->win_list)) { struct msc_window *prev = list_last_entry(&msc->win_list, @@ -869,13 +872,13 @@ static void __msc_buffer_win_free(struct msc *msc, struct msc_window *win) int i;
for (i = 0; i < win->nr_segs; i++) { - struct page *page = sg_page(&win->sgt.sgl[i]); + struct page *page = sg_page(&win->sgt->sgl[i]);
page->mapping = NULL; dma_free_coherent(msc_dev(win->msc)->parent->parent, PAGE_SIZE, msc_win_block(win, i), msc_win_baddr(win, i)); } - sg_free_table(&win->sgt); + sg_free_table(win->sgt); }
/** @@ -1124,7 +1127,7 @@ static struct page *msc_buffer_get_page(struct msc *msc, unsigned long pgoff) pgoff -= win->pgoff;
for (blk = 0; blk < win->nr_segs; blk++) { - struct page *page = sg_page(&win->sgt.sgl[blk]); + struct page *page = sg_page(&win->sgt->sgl[blk]); size_t pgsz = PFN_DOWN(msc_win_actual_bsz(win, blk));
if (pgoff < pgsz)
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.3-rc1 commit f505e91ef511a20b2bc3d0eaca41faecdb715027 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
In multi-window mode, the read iterator is supposed to start from the window with the oldest data, which is, chronologically, the next window after the one with the newest data. This, however, fails to take into account the potentially empty windows, so in short trace sessions it's possible to have a lot of zeroes read from the character device first.
Fix this by skipping over the empty windows in initialization of the read iterator.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Link: https://lore.kernel.org/r/20190627125152.54905-5-alexander.shishkin@linux.in... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/msu.c | 42 +++++++++++++++++++++++++------- 1 file changed, 33 insertions(+), 9 deletions(-)
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index aedfd8629e58..8ab28e5fb366 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -189,17 +189,18 @@ static struct msc_window *msc_next_window(struct msc_window *win) }
/** - * msc_oldest_window() - locate the window with oldest data + * msc_find_window() - find a window matching a given sg_table * @msc: MSC device + * @sgt: SG table of the window + * @nonempty: skip over empty windows * - * This should only be used in multiblock mode. Caller should hold the - * msc::user_count reference. - * - * Return: the oldest window with valid data + * Return: MSC window structure pointer or NULL if the window + * could not be found. */ -static struct msc_window *msc_oldest_window(struct msc *msc) +static struct msc_window * +msc_find_window(struct msc *msc, struct sg_table *sgt, bool nonempty) { - struct msc_window *win, *next = msc_next_window(msc->cur_win); + struct msc_window *win; unsigned int found = 0;
if (list_empty(&msc->win_list)) @@ -211,17 +212,40 @@ static struct msc_window *msc_oldest_window(struct msc *msc) * something like 2, in which case we're good */ list_for_each_entry(win, &msc->win_list, entry) { - if (win == next) + if (win->sgt == sgt) found++;
/* skip the empty ones */ - if (msc_block_is_empty(msc_win_block(win, 0))) + if (nonempty && msc_block_is_empty(msc_win_block(win, 0))) continue;
if (found) return win; }
+ return NULL; +} + +/** + * msc_oldest_window() - locate the window with oldest data + * @msc: MSC device + * + * This should only be used in multiblock mode. Caller should hold the + * msc::user_count reference. + * + * Return: the oldest window with valid data + */ +static struct msc_window *msc_oldest_window(struct msc *msc) +{ + struct msc_window *win; + + if (list_empty(&msc->win_list)) + return NULL; + + win = msc_find_window(msc, msc_next_window(msc->cur_win)->sgt, true); + if (win) + return win; + return list_first_entry(&msc->win_list, struct msc_window, entry); }
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.4-rc1 commit 615c164da0eb42cbfb1688cb429cc4d5039db5d8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Introduces a concept of external buffers, which is a mechanism for creating trace sinks that would receive trace data from MSC buffers and transfer it elsewhere.
A external buffer can implement its own window allocation/deallocation if it has to. It must provide a callback that's used to notify it when a window fills up, so that it can then start a DMA transaction from that window 'elsewhere'. This window remains in a 'locked' state and won't be used for storing new trace data until the buffer 'unlocks' it with a provided API call, at which point the window can be used again for storing trace data.
This relies on a functional "last block" interrupt, so not all versions of Trace Hub can use this feature, which does not reflect on existing users.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Link: https://lore.kernel.org/r/20190705141425.19894-2-alexander.shishkin@linux.in... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- .../testing/sysfs-bus-intel_th-devices-msc | 3 +- MAINTAINERS | 1 + drivers/hwtracing/intel_th/msu.c | 390 +++++++++++++++++- drivers/hwtracing/intel_th/msu.h | 20 +- include/linux/intel_th.h | 79 ++++ 5 files changed, 462 insertions(+), 31 deletions(-) create mode 100644 include/linux/intel_th.h
diff --git a/Documentation/ABI/testing/sysfs-bus-intel_th-devices-msc b/Documentation/ABI/testing/sysfs-bus-intel_th-devices-msc index f54ae244f3f1..456cb62b384c 100644 --- a/Documentation/ABI/testing/sysfs-bus-intel_th-devices-msc +++ b/Documentation/ABI/testing/sysfs-bus-intel_th-devices-msc @@ -12,7 +12,8 @@ Description: (RW) Configure MSC operating mode: - "single", for contiguous buffer mode (high-order alloc); - "multi", for multiblock mode; - "ExI", for DCI handler mode; - - "debug", for debug mode. + - "debug", for debug mode; + - any of the currently loaded buffer sinks. If operating mode changes, existing buffer is deallocated, provided there are no active users and tracing is not enabled, otherwise the write will fail. diff --git a/MAINTAINERS b/MAINTAINERS index e01480643d53..d03a2927f434 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7638,6 +7638,7 @@ M: Alexander Shishkin alexander.shishkin@linux.intel.com S: Supported F: Documentation/trace/intel_th.rst F: drivers/hwtracing/intel_th/ +F: include/linux/intel_th.h
INTEL(R) TRUSTED EXECUTION TECHNOLOGY (TXT) M: Ning Sun ning.sun@intel.com diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index 8ab28e5fb366..f6f12ce2c783 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -17,21 +17,48 @@ #include <linux/mm.h> #include <linux/fs.h> #include <linux/io.h> +#include <linux/workqueue.h> #include <linux/dma-mapping.h>
#ifdef CONFIG_X86 #include <asm/set_memory.h> #endif
+#include <linux/intel_th.h> #include "intel_th.h" #include "msu.h"
#define msc_dev(x) (&(x)->thdev->dev)
+/* + * Lockout state transitions: + * READY -> INUSE -+-> LOCKED -+-> READY -> etc. + * -----------/ + * WIN_READY: window can be used by HW + * WIN_INUSE: window is in use + * WIN_LOCKED: window is filled up and is being processed by the buffer + * handling code + * + * All state transitions happen automatically, except for the LOCKED->READY, + * which needs to be signalled by the buffer code by calling + * intel_th_msc_window_unlock(). + * + * When the interrupt handler has to switch to the next window, it checks + * whether it's READY, and if it is, it performs the switch and tracing + * continues. If it's LOCKED, it stops the trace. + */ +enum lockout_state { + WIN_READY = 0, + WIN_INUSE, + WIN_LOCKED +}; + /** * struct msc_window - multiblock mode window descriptor * @entry: window list linkage (msc::win_list) * @pgoff: page offset into the buffer that this window starts at + * @lockout: lockout state, see comment below + * @lo_lock: lockout state serialization * @nr_blocks: number of blocks (pages) in this window * @nr_segs: number of segments in this window (<= @nr_blocks) * @_sgt: array of block descriptors @@ -40,6 +67,8 @@ struct msc_window { struct list_head entry; unsigned long pgoff; + enum lockout_state lockout; + spinlock_t lo_lock; unsigned int nr_blocks; unsigned int nr_segs; struct msc *msc; @@ -77,6 +106,8 @@ struct msc_iter { * struct msc - MSC device representation * @reg_base: register window base address * @thdev: intel_th_device pointer + * @mbuf: MSU buffer, if assigned + * @mbuf_priv MSU buffer's private data, if @mbuf * @win_list: list of windows in multiblock mode * @single_sgt: single mode buffer * @cur_win: current window @@ -100,6 +131,10 @@ struct msc { void __iomem *msu_base; struct intel_th_device *thdev;
+ const struct msu_buffer *mbuf; + void *mbuf_priv; + + struct work_struct work; struct list_head win_list; struct sg_table single_sgt; struct msc_window *cur_win; @@ -126,6 +161,101 @@ struct msc { unsigned int index; };
+static LIST_HEAD(msu_buffer_list); +static struct mutex msu_buffer_mutex; + +/** + * struct msu_buffer_entry - internal MSU buffer bookkeeping + * @entry: link to msu_buffer_list + * @mbuf: MSU buffer object + * @owner: module that provides this MSU buffer + */ +struct msu_buffer_entry { + struct list_head entry; + const struct msu_buffer *mbuf; + struct module *owner; +}; + +static struct msu_buffer_entry *__msu_buffer_entry_find(const char *name) +{ + struct msu_buffer_entry *mbe; + + lockdep_assert_held(&msu_buffer_mutex); + + list_for_each_entry(mbe, &msu_buffer_list, entry) { + if (!strcmp(mbe->mbuf->name, name)) + return mbe; + } + + return NULL; +} + +static const struct msu_buffer * +msu_buffer_get(const char *name) +{ + struct msu_buffer_entry *mbe; + + mutex_lock(&msu_buffer_mutex); + mbe = __msu_buffer_entry_find(name); + if (mbe && !try_module_get(mbe->owner)) + mbe = NULL; + mutex_unlock(&msu_buffer_mutex); + + return mbe ? mbe->mbuf : NULL; +} + +static void msu_buffer_put(const struct msu_buffer *mbuf) +{ + struct msu_buffer_entry *mbe; + + mutex_lock(&msu_buffer_mutex); + mbe = __msu_buffer_entry_find(mbuf->name); + if (mbe) + module_put(mbe->owner); + mutex_unlock(&msu_buffer_mutex); +} + +int intel_th_msu_buffer_register(const struct msu_buffer *mbuf, + struct module *owner) +{ + struct msu_buffer_entry *mbe; + int ret = 0; + + mbe = kzalloc(sizeof(*mbe), GFP_KERNEL); + if (!mbe) + return -ENOMEM; + + mutex_lock(&msu_buffer_mutex); + if (__msu_buffer_entry_find(mbuf->name)) { + ret = -EEXIST; + kfree(mbe); + goto unlock; + } + + mbe->mbuf = mbuf; + mbe->owner = owner; + list_add_tail(&mbe->entry, &msu_buffer_list); +unlock: + mutex_unlock(&msu_buffer_mutex); + + return ret; +} +EXPORT_SYMBOL_GPL(intel_th_msu_buffer_register); + +void intel_th_msu_buffer_unregister(const struct msu_buffer *mbuf) +{ + struct msu_buffer_entry *mbe; + + mutex_lock(&msu_buffer_mutex); + mbe = __msu_buffer_entry_find(mbuf->name); + if (mbe) { + list_del(&mbe->entry); + kfree(mbe); + } + mutex_unlock(&msu_buffer_mutex); +} +EXPORT_SYMBOL_GPL(intel_th_msu_buffer_unregister); + static inline bool msc_block_is_empty(struct msc_block_desc *bdesc) { /* header hasn't been written */ @@ -188,6 +318,25 @@ static struct msc_window *msc_next_window(struct msc_window *win) return list_next_entry(win, entry); }
+static size_t msc_win_total_sz(struct msc_window *win) +{ + unsigned int blk; + size_t size = 0; + + for (blk = 0; blk < win->nr_segs; blk++) { + struct msc_block_desc *bdesc = msc_win_block(win, blk); + + if (msc_block_wrapped(bdesc)) + return win->nr_blocks << PAGE_SHIFT; + + size += msc_total_sz(bdesc); + if (msc_block_last_written(bdesc)) + break; + } + + return size; +} + /** * msc_find_window() - find a window matching a given sg_table * @msc: MSC device @@ -527,6 +676,9 @@ static int intel_th_msu_init(struct msc *msc) if (!msc->do_irq) return 0;
+ if (!msc->mbuf) + return 0; + mintctl = ioread32(msc->msu_base + REG_MSU_MINTCTL); mintctl |= msc->index ? M1BLIE : M0BLIE; iowrite32(mintctl, msc->msu_base + REG_MSU_MINTCTL); @@ -554,6 +706,44 @@ static void intel_th_msu_deinit(struct msc *msc) iowrite32(mintctl, msc->msu_base + REG_MSU_MINTCTL); }
+static int msc_win_set_lockout(struct msc_window *win, + enum lockout_state expect, + enum lockout_state new) +{ + enum lockout_state old; + unsigned long flags; + int ret = 0; + + if (!win->msc->mbuf) + return 0; + + spin_lock_irqsave(&win->lo_lock, flags); + old = win->lockout; + + if (old != expect) { + ret = -EINVAL; + dev_warn_ratelimited(msc_dev(win->msc), + "expected lockout state %d, got %d\n", + expect, old); + goto unlock; + } + + win->lockout = new; + +unlock: + spin_unlock_irqrestore(&win->lo_lock, flags); + + if (ret) { + if (expect == WIN_READY && old == WIN_LOCKED) + return -EBUSY; + + /* from intel_th_msc_window_unlock(), don't warn if not locked */ + if (expect == WIN_LOCKED && old == new) + return 0; + } + + return ret; +} /** * msc_configure() - set up MSC hardware * @msc: the MSC device to configure @@ -571,8 +761,12 @@ static int msc_configure(struct msc *msc) if (msc->mode > MSC_MODE_MULTI) return -ENOTSUPP;
- if (msc->mode == MSC_MODE_MULTI) - msc_buffer_clear_hw_header(msc); + if (msc->mode == MSC_MODE_MULTI) { + if (msc_win_set_lockout(msc->cur_win, WIN_READY, WIN_INUSE)) + return -EBUSY; + + msc_buffer_clear_hw_header(msc); + }
reg = msc->base_addr >> PAGE_SHIFT; iowrite32(reg, msc->reg_base + REG_MSU_MSC0BAR); @@ -594,10 +788,14 @@ static int msc_configure(struct msc *msc)
iowrite32(reg, msc->reg_base + REG_MSU_MSC0CTL);
+ intel_th_msu_init(msc); + msc->thdev->output.multiblock = msc->mode == MSC_MODE_MULTI; intel_th_trace_enable(msc->thdev); msc->enabled = 1;
+ if (msc->mbuf && msc->mbuf->activate) + msc->mbuf->activate(msc->mbuf_priv);
return 0; } @@ -611,10 +809,17 @@ static int msc_configure(struct msc *msc) */ static void msc_disable(struct msc *msc) { + struct msc_window *win = msc->cur_win; u32 reg;
lockdep_assert_held(&msc->buf_mutex);
+ if (msc->mode == MSC_MODE_MULTI) + msc_win_set_lockout(win, WIN_INUSE, WIN_LOCKED); + + if (msc->mbuf && msc->mbuf->deactivate) + msc->mbuf->deactivate(msc->mbuf_priv); + intel_th_msu_deinit(msc); intel_th_trace_disable(msc->thdev);
if (msc->mode == MSC_MODE_SINGLE) { @@ -630,6 +835,11 @@ static void msc_disable(struct msc *msc) reg = ioread32(msc->reg_base + REG_MSU_MSC0CTL); reg &= ~MSC_EN; iowrite32(reg, msc->reg_base + REG_MSU_MSC0CTL); + + if (msc->mbuf && msc->mbuf->ready) + msc->mbuf->ready(msc->mbuf_priv, win->sgt, + msc_win_total_sz(win)); + msc->enabled = 0;
iowrite32(0, msc->reg_base + REG_MSU_MSC0BAR); @@ -640,6 +850,10 @@ static void msc_disable(struct msc *msc)
reg = ioread32(msc->reg_base + REG_MSU_MSC0STS); dev_dbg(msc_dev(msc), "MSCnSTS: %08x\n", reg); + + reg = ioread32(msc->reg_base + REG_MSU_MSUSTS); + reg &= msc->index ? MSUSTS_MSC1BLAST : MSUSTS_MSC0BLAST; + iowrite32(reg, msc->reg_base + REG_MSU_MSUSTS); }
static int intel_th_msc_activate(struct intel_th_device *thdev) @@ -856,6 +1070,8 @@ static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks)
win->msc = msc; win->sgt = &win->_sgt; + win->lockout = WIN_READY; + spin_lock_init(&win->lo_lock);
if (!list_empty(&msc->win_list)) { struct msc_window *prev = list_last_entry(&msc->win_list, @@ -865,8 +1081,13 @@ static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks) win->pgoff = prev->pgoff + prev->nr_blocks; }
- ret = __msc_buffer_win_alloc(win, nr_blocks); - if (ret < 0) + if (msc->mbuf && msc->mbuf->alloc_window) + ret = msc->mbuf->alloc_window(msc->mbuf_priv, &win->sgt, + nr_blocks << PAGE_SHIFT); + else + ret = __msc_buffer_win_alloc(win, nr_blocks); + + if (ret <= 0) goto err_nomem;
msc_buffer_set_uc(win, ret); @@ -925,7 +1146,10 @@ static void msc_buffer_win_free(struct msc *msc, struct msc_window *win)
msc_buffer_set_wb(win);
- __msc_buffer_win_free(msc, win); + if (msc->mbuf && msc->mbuf->free_window) + msc->mbuf->free_window(msc->mbuf_priv, win->sgt); + else + __msc_buffer_win_free(msc, win);
kfree(win); } @@ -1462,18 +1686,77 @@ static void msc_win_switch(struct msc *msc) intel_th_trace_switch(msc->thdev); }
+/** + * intel_th_msc_window_unlock - put the window back in rotation + * @dev: MSC device to which this relates + * @sgt: buffer's sg_table for the window, does nothing if NULL + */ +void intel_th_msc_window_unlock(struct device *dev, struct sg_table *sgt) +{ + struct msc *msc = dev_get_drvdata(dev); + struct msc_window *win; + + if (!sgt) + return; + + win = msc_find_window(msc, sgt, false); + if (!win) + return; + + msc_win_set_lockout(win, WIN_LOCKED, WIN_READY); +} +EXPORT_SYMBOL_GPL(intel_th_msc_window_unlock); + +static void msc_work(struct work_struct *work) +{ + struct msc *msc = container_of(work, struct msc, work); + + intel_th_msc_deactivate(msc->thdev); +} + static irqreturn_t intel_th_msc_interrupt(struct intel_th_device *thdev) { struct msc *msc = dev_get_drvdata(&thdev->dev); u32 msusts = ioread32(msc->msu_base + REG_MSU_MSUSTS); u32 mask = msc->index ? MSUSTS_MSC1BLAST : MSUSTS_MSC0BLAST; + struct msc_window *win, *next_win;
- if (!(msusts & mask)) { - if (msc->enabled) - return IRQ_HANDLED; + if (!msc->do_irq || !msc->mbuf) return IRQ_NONE; + + msusts &= mask; + + if (!msusts) + return msc->enabled ? IRQ_HANDLED : IRQ_NONE; + + iowrite32(msusts, msc->msu_base + REG_MSU_MSUSTS); + + if (!msc->enabled) + return IRQ_NONE; + + /* grab the window before we do the switch */ + win = msc->cur_win; + if (!win) + return IRQ_HANDLED; + next_win = msc_next_window(win); + if (!next_win) + return IRQ_HANDLED; + + /* next window: if READY, proceed, if LOCKED, stop the trace */ + if (msc_win_set_lockout(next_win, WIN_READY, WIN_INUSE)) { + schedule_work(&msc->work); + return IRQ_HANDLED; }
+ /* current window: INUSE -> LOCKED */ + msc_win_set_lockout(win, WIN_INUSE, WIN_LOCKED); + + msc_win_switch(msc); + + if (msc->mbuf && msc->mbuf->ready) + msc->mbuf->ready(msc->mbuf_priv, win->sgt, + msc_win_total_sz(win)); + return IRQ_HANDLED; }
@@ -1511,21 +1794,43 @@ wrap_store(struct device *dev, struct device_attribute *attr, const char *buf,
static DEVICE_ATTR_RW(wrap);
+static void msc_buffer_unassign(struct msc *msc) +{ + lockdep_assert_held(&msc->buf_mutex); + + if (!msc->mbuf) + return; + + msc->mbuf->unassign(msc->mbuf_priv); + msu_buffer_put(msc->mbuf); + msc->mbuf_priv = NULL; + msc->mbuf = NULL; +} + static ssize_t mode_show(struct device *dev, struct device_attribute *attr, char *buf) { struct msc *msc = dev_get_drvdata(dev); + const char *mode = msc_mode[msc->mode]; + ssize_t ret; + + mutex_lock(&msc->buf_mutex); + if (msc->mbuf) + mode = msc->mbuf->name; + ret = scnprintf(buf, PAGE_SIZE, "%s\n", mode); + mutex_unlock(&msc->buf_mutex);
- return scnprintf(buf, PAGE_SIZE, "%s\n", msc_mode[msc->mode]); + return ret; }
static ssize_t mode_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t size) { + const struct msu_buffer *mbuf = NULL; struct msc *msc = dev_get_drvdata(dev); size_t len = size; - char *cp; + char *cp, *mode; int i, ret;
if (!capable(CAP_SYS_RAWIO)) @@ -1535,17 +1840,59 @@ mode_store(struct device *dev, struct device_attribute *attr, const char *buf, if (cp) len = cp - buf;
- for (i = 0; i < ARRAY_SIZE(msc_mode); i++) - if (!strncmp(msc_mode[i], buf, len)) - goto found; + mode = kstrndup(buf, len, GFP_KERNEL); + i = match_string(msc_mode, ARRAY_SIZE(msc_mode), mode); + if (i >= 0) + goto found; + + /* Buffer sinks only work with a usable IRQ */ + if (!msc->do_irq) { + kfree(mode); + return -EINVAL; + } + + mbuf = msu_buffer_get(mode); + kfree(mode); + if (mbuf) + goto found;
return -EINVAL;
found: mutex_lock(&msc->buf_mutex); + ret = 0; + + /* Same buffer: do nothing */ + if (mbuf && mbuf == msc->mbuf) { + /* put the extra reference we just got */ + msu_buffer_put(mbuf); + goto unlock; + } + ret = msc_buffer_unlocked_free_unless_used(msc); - if (!ret) - msc->mode = i; + if (ret) + goto unlock; + + if (mbuf) { + void *mbuf_priv = mbuf->assign(dev, &i); + + if (!mbuf_priv) { + ret = -ENOMEM; + goto unlock; + } + + msc_buffer_unassign(msc); + msc->mbuf_priv = mbuf_priv; + msc->mbuf = mbuf; + } else { + msc_buffer_unassign(msc); + } + + msc->mode = i; + +unlock: + if (ret && mbuf) + msu_buffer_put(mbuf); mutex_unlock(&msc->buf_mutex);
return ret ? ret : size; @@ -1667,7 +2014,12 @@ win_switch_store(struct device *dev, struct device_attribute *attr, return -EINVAL;
mutex_lock(&msc->buf_mutex); - if (msc->mode != MSC_MODE_MULTI) + /* + * Window switch can only happen in the "multi" mode. + * If a external buffer is engaged, they have the full + * control over window switching. + */ + if (msc->mode != MSC_MODE_MULTI || msc->mbuf) ret = -ENOTSUPP; else msc_win_switch(msc); @@ -1720,10 +2072,7 @@ static int intel_th_msc_probe(struct intel_th_device *thdev) msc->reg_base = base + msc->index * 0x100; msc->msu_base = base;
- err = intel_th_msu_init(msc); - if (err) - return err; - + INIT_WORK(&msc->work, msc_work); err = intel_th_msc_init(msc); if (err) return err; @@ -1739,7 +2088,6 @@ static void intel_th_msc_remove(struct intel_th_device *thdev) int ret;
intel_th_msc_deactivate(thdev); - intel_th_msu_deinit(msc);
/* * Buffers should not be used at this point except if the diff --git a/drivers/hwtracing/intel_th/msu.h b/drivers/hwtracing/intel_th/msu.h index 574c16004cb2..3f527dd4d727 100644 --- a/drivers/hwtracing/intel_th/msu.h +++ b/drivers/hwtracing/intel_th/msu.h @@ -44,14 +44,6 @@ enum { #define M0BLIE BIT(16) #define M1BLIE BIT(24)
-/* MSC operating modes (MSC_MODE) */ -enum { - MSC_MODE_SINGLE = 0, - MSC_MODE_MULTI, - MSC_MODE_EXI, - MSC_MODE_DEBUG, -}; - /* MSCnSTS bits */ #define MSCSTS_WRAPSTAT BIT(1) /* Wrap occurred */ #define MSCSTS_PLE BIT(2) /* Pipeline Empty */ @@ -93,6 +85,16 @@ static inline unsigned long msc_data_sz(struct msc_block_desc *bdesc) return bdesc->valid_dw * 4 - MSC_BDESC; }
+static inline unsigned long msc_total_sz(struct msc_block_desc *bdesc) +{ + return bdesc->valid_dw * 4; +} + +static inline unsigned long msc_block_sz(struct msc_block_desc *bdesc) +{ + return bdesc->block_sz * 64 - MSC_BDESC; +} + static inline bool msc_block_wrapped(struct msc_block_desc *bdesc) { if (bdesc->hw_tag & (MSC_HW_TAG_BLOCKWRAP | MSC_HW_TAG_WINWRAP)) @@ -104,7 +106,7 @@ static inline bool msc_block_wrapped(struct msc_block_desc *bdesc) static inline bool msc_block_last_written(struct msc_block_desc *bdesc) { if ((bdesc->hw_tag & MSC_HW_TAG_ENDBIT) || - (msc_data_sz(bdesc) != DATA_IN_PAGE)) + (msc_data_sz(bdesc) != msc_block_sz(bdesc))) return true;
return false; diff --git a/include/linux/intel_th.h b/include/linux/intel_th.h new file mode 100644 index 000000000000..9b7f4c22499c --- /dev/null +++ b/include/linux/intel_th.h @@ -0,0 +1,79 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Intel(R) Trace Hub data structures for implementing buffer sinks. + * + * Copyright (C) 2019 Intel Corporation. + */ + +#ifndef _INTEL_TH_H_ +#define _INTEL_TH_H_ + +#include <linux/scatterlist.h> + +/* MSC operating modes (MSC_MODE) */ +enum { + MSC_MODE_SINGLE = 0, + MSC_MODE_MULTI, + MSC_MODE_EXI, + MSC_MODE_DEBUG, +}; + +struct msu_buffer { + const char *name; + /* + * ->assign() called when buffer 'mode' is set to this driver + * (aka mode_store()) + * @device: struct device * of the msc + * @mode: allows the driver to set HW mode (see the enum above) + * Returns: a pointer to a private structure associated with this + * msc or NULL in case of error. This private structure + * will then be passed into all other callbacks. + */ + void *(*assign)(struct device *dev, int *mode); + /* ->unassign(): some other mode is selected, clean up */ + void (*unassign)(void *priv); + /* + * ->alloc_window(): allocate memory for the window of a given + * size + * @sgt: pointer to sg_table, can be overridden by the buffer + * driver, or kept intact + * Returns: number of sg table entries <= number of pages; + * 0 is treated as an allocation failure. + */ + int (*alloc_window)(void *priv, struct sg_table **sgt, + size_t size); + void (*free_window)(void *priv, struct sg_table *sgt); + /* ->activate(): trace has started */ + void (*activate)(void *priv); + /* ->deactivate(): trace is about to stop */ + void (*deactivate)(void *priv); + /* + * ->ready(): window @sgt is filled up to the last block OR + * tracing is stopped by the user; this window contains + * @bytes data. The window in question transitions into + * the "LOCKED" state, indicating that it can't be used + * by hardware. To clear this state and make the window + * available to the hardware again, call + * intel_th_msc_window_unlock(). + */ + int (*ready)(void *priv, struct sg_table *sgt, size_t bytes); +}; + +int intel_th_msu_buffer_register(const struct msu_buffer *mbuf, + struct module *owner); +void intel_th_msu_buffer_unregister(const struct msu_buffer *mbuf); +void intel_th_msc_window_unlock(struct device *dev, struct sg_table *sgt); + +#define module_intel_th_msu_buffer(__buffer) \ +static int __init __buffer##_init(void) \ +{ \ + return intel_th_msu_buffer_register(&(__buffer), THIS_MODULE); \ +} \ +module_init(__buffer##_init); \ +static void __exit __buffer##_exit(void) \ +{ \ + intel_th_msu_buffer_unregister(&(__buffer)); \ +} \ +module_exit(__buffer##_exit); + +#endif /* _INTEL_TH_H_ */
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.4-rc1 commit f220df66f67684246ae1bf4a4e479efc7c2f325a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
This patch adds an example MSU buffer "sink", which consumes trace data from MSC buffers.
Functionally, it acts similarly to "multi" mode with automatic window switching.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Link: https://lore.kernel.org/r/20190705141425.19894-3-alexander.shishkin@linux.in... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/Makefile | 3 + drivers/hwtracing/intel_th/msu-sink.c | 116 ++++++++++++++++++++++++++ 2 files changed, 119 insertions(+) create mode 100644 drivers/hwtracing/intel_th/msu-sink.c
diff --git a/drivers/hwtracing/intel_th/Makefile b/drivers/hwtracing/intel_th/Makefile index d9252fa8d9ca..b63eb8f309ad 100644 --- a/drivers/hwtracing/intel_th/Makefile +++ b/drivers/hwtracing/intel_th/Makefile @@ -20,3 +20,6 @@ intel_th_msu-y := msu.o
obj-$(CONFIG_INTEL_TH_PTI) += intel_th_pti.o intel_th_pti-y := pti.o + +obj-$(CONFIG_INTEL_TH_MSU) += intel_th_msu_sink.o +intel_th_msu_sink-y := msu-sink.o diff --git a/drivers/hwtracing/intel_th/msu-sink.c b/drivers/hwtracing/intel_th/msu-sink.c new file mode 100644 index 000000000000..2c7f5116be12 --- /dev/null +++ b/drivers/hwtracing/intel_th/msu-sink.c @@ -0,0 +1,116 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * An example software sink buffer for Intel TH MSU. + * + * Copyright (C) 2019 Intel Corporation. + */ + +#include <linux/intel_th.h> +#include <linux/module.h> +#include <linux/slab.h> +#include <linux/device.h> +#include <linux/dma-mapping.h> + +#define MAX_SGTS 16 + +struct msu_sink_private { + struct device *dev; + struct sg_table **sgts; + unsigned int nr_sgts; +}; + +static void *msu_sink_assign(struct device *dev, int *mode) +{ + struct msu_sink_private *priv; + + priv = kzalloc(sizeof(*priv), GFP_KERNEL); + if (!priv) + return NULL; + + priv->sgts = kcalloc(MAX_SGTS, sizeof(void *), GFP_KERNEL); + if (!priv->sgts) { + kfree(priv); + return NULL; + } + + priv->dev = dev; + *mode = MSC_MODE_MULTI; + + return priv; +} + +static void msu_sink_unassign(void *data) +{ + struct msu_sink_private *priv = data; + + kfree(priv->sgts); + kfree(priv); +} + +/* See also: msc.c: __msc_buffer_win_alloc() */ +static int msu_sink_alloc_window(void *data, struct sg_table **sgt, size_t size) +{ + struct msu_sink_private *priv = data; + unsigned int nents; + struct scatterlist *sg_ptr; + void *block; + int ret, i; + + if (priv->nr_sgts == MAX_SGTS) + return -ENOMEM; + + nents = DIV_ROUND_UP(size, PAGE_SIZE); + + ret = sg_alloc_table(*sgt, nents, GFP_KERNEL); + if (ret) + return -ENOMEM; + + priv->sgts[priv->nr_sgts++] = *sgt; + + for_each_sg((*sgt)->sgl, sg_ptr, nents, i) { + block = dma_alloc_coherent(priv->dev->parent->parent, + PAGE_SIZE, &sg_dma_address(sg_ptr), + GFP_KERNEL); + sg_set_buf(sg_ptr, block, PAGE_SIZE); + } + + return nents; +} + +/* See also: msc.c: __msc_buffer_win_free() */ +static void msu_sink_free_window(void *data, struct sg_table *sgt) +{ + struct msu_sink_private *priv = data; + struct scatterlist *sg_ptr; + int i; + + for_each_sg(sgt->sgl, sg_ptr, sgt->nents, i) { + dma_free_coherent(priv->dev->parent->parent, PAGE_SIZE, + sg_virt(sg_ptr), sg_dma_address(sg_ptr)); + } + + sg_free_table(sgt); + priv->nr_sgts--; +} + +static int msu_sink_ready(void *data, struct sg_table *sgt, size_t bytes) +{ + struct msu_sink_private *priv = data; + + intel_th_msc_window_unlock(priv->dev, sgt); + + return 0; +} + +static const struct msu_buffer sink_mbuf = { + .name = "sink", + .assign = msu_sink_assign, + .unassign = msu_sink_unassign, + .alloc_window = msu_sink_alloc_window, + .free_window = msu_sink_free_window, + .ready = msu_sink_ready, +}; + +module_intel_th_msu_buffer(sink_mbuf); + +MODULE_LICENSE("GPL v2");
From: Len Brown len.brown@intel.com
mainline inclusion from mainline-v5.0-rc1 commit 0ec712e36c1d5bafe6e65e53a19f136b778866cd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
A recent turbostat release increased topo.max_cpu_num to make it convenient to handle sysfs bitmaps of 32-cpus.
But users, who regularly make use of "--debug", then saw a bunch of output for cpus that were not present.
Remove that extra output by checking a cpu is online before dumping its info.
Signed-off-by: Len Brown len.brown@intel.com Cc: Prarit Bhargava prarit@redhat.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Reviewed-by: 江国庆 jiangguoqing@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/power/x86/turbostat/turbostat.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c index 99a3bf1bd332..add69424ac54 100644 --- a/tools/power/x86/turbostat/turbostat.c +++ b/tools/power/x86/turbostat/turbostat.c @@ -5002,6 +5002,8 @@ void topology_probe() return;
for (i = 0; i <= topo.max_cpu_num; ++i) { + if (cpu_is_not_present(i)) + continue; fprintf(outf, "cpu %d pkg %d node %d lnode %d core %d thread %d\n", i, cpus[i].physical_package_id,
From: Len Brown len.brown@intel.com
mainline inclusion from mainline-v5.0-rc1 commit f5a4c76ad7de96d47baef3d8810a88b10d60ec82 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Often a new processor gets a new model number, but from a turbostat point of view, it is the same as a previous model. Support duplicates with 1-line updates, rather than error-prone scattering of model #'s.
Signed-off-by: Len Brown len.brown@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Reviewed-by: 江国庆 jiangguoqing@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/power/x86/turbostat/turbostat.c | 85 ++++++++++++--------------- 1 file changed, 39 insertions(+), 46 deletions(-)
diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c index add69424ac54..47015cb75405 100644 --- a/tools/power/x86/turbostat/turbostat.c +++ b/tools/power/x86/turbostat/turbostat.c @@ -3182,13 +3182,8 @@ int probe_nhm_msrs(unsigned int family, unsigned int model) bclk = discover_bclk(family, model);
switch (model) { - case INTEL_FAM6_NEHALEM_EP: /* Core i7, Xeon 5500 series - Bloomfield, Gainstown NHM-EP */ case INTEL_FAM6_NEHALEM: /* Core i7 and i5 Processor - Clarksfield, Lynnfield, Jasper Forest */ - case 0x1F: /* Core i7 and i5 Processor - Nehalem */ - case INTEL_FAM6_WESTMERE: /* Westmere Client - Clarkdale, Arrandale */ - case INTEL_FAM6_WESTMERE_EP: /* Westmere EP - Gulftown */ case INTEL_FAM6_NEHALEM_EX: /* Nehalem-EX Xeon - Beckton */ - case INTEL_FAM6_WESTMERE_EX: /* Westmere-EX Xeon - Eagleton */ pkg_cstate_limits = nhm_pkg_cstate_limits; break; case INTEL_FAM6_SANDYBRIDGE: /* SNB */ @@ -3200,16 +3195,11 @@ int probe_nhm_msrs(unsigned int family, unsigned int model) break; case INTEL_FAM6_HASWELL_CORE: /* HSW */ case INTEL_FAM6_HASWELL_X: /* HSX */ - case INTEL_FAM6_HASWELL_ULT: /* HSW */ case INTEL_FAM6_HASWELL_GT3E: /* HSW */ case INTEL_FAM6_BROADWELL_CORE: /* BDW */ case INTEL_FAM6_BROADWELL_GT3E: /* BDW */ case INTEL_FAM6_BROADWELL_X: /* BDX */ - case INTEL_FAM6_BROADWELL_XEON_D: /* BDX-DE */ case INTEL_FAM6_SKYLAKE_MOBILE: /* SKL */ - case INTEL_FAM6_SKYLAKE_DESKTOP: /* SKL */ - case INTEL_FAM6_KABYLAKE_MOBILE: /* KBL */ - case INTEL_FAM6_KABYLAKE_DESKTOP: /* KBL */ case INTEL_FAM6_CANNONLAKE_MOBILE: /* CNL */ pkg_cstate_limits = hsw_pkg_cstate_limits; has_misc_feature_control = 1; @@ -3228,7 +3218,6 @@ int probe_nhm_msrs(unsigned int family, unsigned int model) no_MSR_MISC_PWR_MGMT = 1; break; case INTEL_FAM6_XEON_PHI_KNL: /* PHI */ - case INTEL_FAM6_XEON_PHI_KNM: pkg_cstate_limits = phi_pkg_cstate_limits; break; case INTEL_FAM6_ATOM_GOLDMONT: /* BXT */ @@ -3289,7 +3278,6 @@ int is_bdx(unsigned int family, unsigned int model)
switch (model) { case INTEL_FAM6_BROADWELL_X: - case INTEL_FAM6_BROADWELL_XEON_D: return 1; } return 0; @@ -3315,9 +3303,7 @@ int has_turbo_ratio_limit(unsigned int family, unsigned int model) switch (model) { /* Nehalem compatible, but do not include turbo-ratio limit support */ case INTEL_FAM6_NEHALEM_EX: /* Nehalem-EX Xeon - Beckton */ - case INTEL_FAM6_WESTMERE_EX: /* Westmere-EX Xeon - Eagleton */ case INTEL_FAM6_XEON_PHI_KNL: /* PHI - Knights Landing (different MSR definition) */ - case INTEL_FAM6_XEON_PHI_KNM: return 0; default: return 1; @@ -3372,7 +3358,6 @@ int has_knl_turbo_ratio_limit(unsigned int family, unsigned int model)
switch (model) { case INTEL_FAM6_XEON_PHI_KNL: /* Knights Landing */ - case INTEL_FAM6_XEON_PHI_KNM: return 1; default: return 0; @@ -3406,21 +3391,15 @@ int has_config_tdp(unsigned int family, unsigned int model) case INTEL_FAM6_IVYBRIDGE: /* IVB */ case INTEL_FAM6_HASWELL_CORE: /* HSW */ case INTEL_FAM6_HASWELL_X: /* HSX */ - case INTEL_FAM6_HASWELL_ULT: /* HSW */ case INTEL_FAM6_HASWELL_GT3E: /* HSW */ case INTEL_FAM6_BROADWELL_CORE: /* BDW */ case INTEL_FAM6_BROADWELL_GT3E: /* BDW */ case INTEL_FAM6_BROADWELL_X: /* BDX */ - case INTEL_FAM6_BROADWELL_XEON_D: /* BDX-DE */ case INTEL_FAM6_SKYLAKE_MOBILE: /* SKL */ - case INTEL_FAM6_SKYLAKE_DESKTOP: /* SKL */ - case INTEL_FAM6_KABYLAKE_MOBILE: /* KBL */ - case INTEL_FAM6_KABYLAKE_DESKTOP: /* KBL */ case INTEL_FAM6_CANNONLAKE_MOBILE: /* CNL */ case INTEL_FAM6_SKYLAKE_X: /* SKX */
case INTEL_FAM6_XEON_PHI_KNL: /* Knights Landing */ - case INTEL_FAM6_XEON_PHI_KNM: return 1; default: return 0; @@ -3824,9 +3803,7 @@ rapl_dram_energy_units_probe(int model, double rapl_energy_units) switch (model) { case INTEL_FAM6_HASWELL_X: /* HSX */ case INTEL_FAM6_BROADWELL_X: /* BDX */ - case INTEL_FAM6_BROADWELL_XEON_D: /* BDX-DE */ case INTEL_FAM6_XEON_PHI_KNL: /* KNL */ - case INTEL_FAM6_XEON_PHI_KNM: return (rapl_dram_energy_units = 15.3 / 1000000); default: return (rapl_energy_units); @@ -3846,7 +3823,6 @@ void rapl_probe_intel(unsigned int family, unsigned int model) case INTEL_FAM6_SANDYBRIDGE: case INTEL_FAM6_IVYBRIDGE: case INTEL_FAM6_HASWELL_CORE: /* HSW */ - case INTEL_FAM6_HASWELL_ULT: /* HSW */ case INTEL_FAM6_HASWELL_GT3E: /* HSW */ case INTEL_FAM6_BROADWELL_CORE: /* BDW */ case INTEL_FAM6_BROADWELL_GT3E: /* BDW */ @@ -3870,9 +3846,6 @@ void rapl_probe_intel(unsigned int family, unsigned int model) BIC_PRESENT(BIC_PkgWatt); break; case INTEL_FAM6_SKYLAKE_MOBILE: /* SKL */ - case INTEL_FAM6_SKYLAKE_DESKTOP: /* SKL */ - case INTEL_FAM6_KABYLAKE_MOBILE: /* KBL */ - case INTEL_FAM6_KABYLAKE_DESKTOP: /* KBL */ case INTEL_FAM6_CANNONLAKE_MOBILE: /* CNL */ do_rapl = RAPL_PKG | RAPL_CORES | RAPL_CORE_POLICY | RAPL_DRAM | RAPL_DRAM_PERF_STATUS | RAPL_PKG_PERF_STATUS | RAPL_GFX | RAPL_PKG_POWER_INFO; BIC_PRESENT(BIC_PKG__); @@ -3891,10 +3864,8 @@ void rapl_probe_intel(unsigned int family, unsigned int model) break; case INTEL_FAM6_HASWELL_X: /* HSX */ case INTEL_FAM6_BROADWELL_X: /* BDX */ - case INTEL_FAM6_BROADWELL_XEON_D: /* BDX-DE */ case INTEL_FAM6_SKYLAKE_X: /* SKX */ case INTEL_FAM6_XEON_PHI_KNL: /* KNL */ - case INTEL_FAM6_XEON_PHI_KNM: do_rapl = RAPL_PKG | RAPL_DRAM | RAPL_DRAM_POWER_INFO | RAPL_DRAM_PERF_STATUS | RAPL_PKG_PERF_STATUS | RAPL_PKG_POWER_INFO; BIC_PRESENT(BIC_PKG__); BIC_PRESENT(BIC_RAM__); @@ -4044,7 +4015,6 @@ void perf_limit_reasons_probe(unsigned int family, unsigned int model)
switch (model) { case INTEL_FAM6_HASWELL_CORE: /* HSW */ - case INTEL_FAM6_HASWELL_ULT: /* HSW */ case INTEL_FAM6_HASWELL_GT3E: /* HSW */ do_gfx_perf_limit_reasons = 1; case INTEL_FAM6_HASWELL_X: /* HSX */ @@ -4264,16 +4234,11 @@ int has_snb_msrs(unsigned int family, unsigned int model) case INTEL_FAM6_IVYBRIDGE_X: /* IVB Xeon */ case INTEL_FAM6_HASWELL_CORE: /* HSW */ case INTEL_FAM6_HASWELL_X: /* HSW */ - case INTEL_FAM6_HASWELL_ULT: /* HSW */ case INTEL_FAM6_HASWELL_GT3E: /* HSW */ case INTEL_FAM6_BROADWELL_CORE: /* BDW */ case INTEL_FAM6_BROADWELL_GT3E: /* BDW */ case INTEL_FAM6_BROADWELL_X: /* BDX */ - case INTEL_FAM6_BROADWELL_XEON_D: /* BDX-DE */ case INTEL_FAM6_SKYLAKE_MOBILE: /* SKL */ - case INTEL_FAM6_SKYLAKE_DESKTOP: /* SKL */ - case INTEL_FAM6_KABYLAKE_MOBILE: /* KBL */ - case INTEL_FAM6_KABYLAKE_DESKTOP: /* KBL */ case INTEL_FAM6_CANNONLAKE_MOBILE: /* CNL */ case INTEL_FAM6_SKYLAKE_X: /* SKX */ case INTEL_FAM6_ATOM_GOLDMONT: /* BXT */ @@ -4302,12 +4267,9 @@ int has_hsw_msrs(unsigned int family, unsigned int model) return 0;
switch (model) { - case INTEL_FAM6_HASWELL_ULT: /* HSW */ + case INTEL_FAM6_HASWELL_CORE: case INTEL_FAM6_BROADWELL_CORE: /* BDW */ case INTEL_FAM6_SKYLAKE_MOBILE: /* SKL */ - case INTEL_FAM6_SKYLAKE_DESKTOP: /* SKL */ - case INTEL_FAM6_KABYLAKE_MOBILE: /* KBL */ - case INTEL_FAM6_KABYLAKE_DESKTOP: /* KBL */ case INTEL_FAM6_CANNONLAKE_MOBILE: /* CNL */ case INTEL_FAM6_ATOM_GOLDMONT: /* BXT */ case INTEL_FAM6_ATOM_GOLDMONT_PLUS: @@ -4331,9 +4293,6 @@ int has_skl_msrs(unsigned int family, unsigned int model)
switch (model) { case INTEL_FAM6_SKYLAKE_MOBILE: /* SKL */ - case INTEL_FAM6_SKYLAKE_DESKTOP: /* SKL */ - case INTEL_FAM6_KABYLAKE_MOBILE: /* KBL */ - case INTEL_FAM6_KABYLAKE_DESKTOP: /* KBL */ case INTEL_FAM6_CANNONLAKE_MOBILE: /* CNL */ return 1; } @@ -4358,7 +4317,6 @@ int is_knl(unsigned int family, unsigned int model) return 0; switch (model) { case INTEL_FAM6_XEON_PHI_KNL: /* KNL */ - case INTEL_FAM6_XEON_PHI_KNM: return 1; } return 0; @@ -4572,6 +4530,42 @@ void decode_c6_demotion_policy_msr(void) base_cpu, msr, msr & (1 << 0) ? "EN" : "DIS"); }
+/* + * When models are the same, for the purpose of turbostat, reuse + */ +unsigned int intel_model_duplicates(unsigned int model) +{ + + switch(model) { + case INTEL_FAM6_NEHALEM_EP: /* Core i7, Xeon 5500 series - Bloomfield, Gainstown NHM-EP */ + case INTEL_FAM6_NEHALEM: /* Core i7 and i5 Processor - Clarksfield, Lynnfield, Jasper Forest */ + case 0x1F: /* Core i7 and i5 Processor - Nehalem */ + case INTEL_FAM6_WESTMERE: /* Westmere Client - Clarkdale, Arrandale */ + case INTEL_FAM6_WESTMERE_EP: /* Westmere EP - Gulftown */ + return INTEL_FAM6_NEHALEM; + + case INTEL_FAM6_NEHALEM_EX: /* Nehalem-EX Xeon - Beckton */ + case INTEL_FAM6_WESTMERE_EX: /* Westmere-EX Xeon - Eagleton */ + return INTEL_FAM6_NEHALEM_EX; + + case INTEL_FAM6_XEON_PHI_KNM: + return INTEL_FAM6_XEON_PHI_KNL; + + case INTEL_FAM6_HASWELL_ULT: + return INTEL_FAM6_HASWELL_CORE; + + case INTEL_FAM6_BROADWELL_X: + case INTEL_FAM6_BROADWELL_XEON_D: /* BDX-DE */ + return INTEL_FAM6_BROADWELL_X; + + case INTEL_FAM6_SKYLAKE_MOBILE: + case INTEL_FAM6_SKYLAKE_DESKTOP: + case INTEL_FAM6_KABYLAKE_MOBILE: + case INTEL_FAM6_KABYLAKE_DESKTOP: + return INTEL_FAM6_SKYLAKE_MOBILE; + } + return model; +} void process_cpuid() { unsigned int eax, ebx, ecx, edx; @@ -4627,6 +4621,8 @@ void process_cpuid() edx_flags & (1 << 28) ? "HT" : "-", edx_flags & (1 << 29) ? "TM" : "-"); } + if (genuine_intel) + model = intel_model_duplicates(model);
if (!(edx_flags & (1 << 5))) errx(1, "CPUID: no MSR"); @@ -4718,9 +4714,6 @@ void process_cpuid() if (crystal_hz == 0) switch(model) { case INTEL_FAM6_SKYLAKE_MOBILE: /* SKL */ - case INTEL_FAM6_SKYLAKE_DESKTOP: /* SKL */ - case INTEL_FAM6_KABYLAKE_MOBILE: /* KBL */ - case INTEL_FAM6_KABYLAKE_DESKTOP: /* KBL */ crystal_hz = 24000000; /* 24.0 MHz */ break; case INTEL_FAM6_ATOM_GOLDMONT_X: /* DNV */
From: Chen Yu yu.c.chen@intel.com
mainline inclusion from mainline-v5.6-rc7 commit 23274faf96500700da83c4f0ff12d78ae03d5604 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
From a turbostat point of view, Ice Lake server looks like Sky Lake server.
(cherry picked from commit 23274faf96500700da83c4f0ff12d78ae03d5604) Signed-off-by: Chen Yu yu.c.chen@intel.com Signed-off-by: Len Brown len.brown@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Reviewed-by: 江国庆 jiangguoqing@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/power/x86/turbostat/turbostat.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c index 47015cb75405..a6a0d1661604 100644 --- a/tools/power/x86/turbostat/turbostat.c +++ b/tools/power/x86/turbostat/turbostat.c @@ -4563,6 +4563,9 @@ unsigned int intel_model_duplicates(unsigned int model) case INTEL_FAM6_KABYLAKE_MOBILE: case INTEL_FAM6_KABYLAKE_DESKTOP: return INTEL_FAM6_SKYLAKE_MOBILE; + + case INTEL_FAM6_ICELAKE_X: + return INTEL_FAM6_SKYLAKE_X; } return model; }
From: Len Brown len.brown@intel.com
mainline inclusion from mainline-v5.3-rc7 commit cd188af5282d9f9e65f63915b13239bafc746f8d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
turbostat: cpu0: msr offset 0x630 read failed: Input/output error
because Haswell Core does not have C8-C10.
Output C8-C10 only on Haswell ULT.
Fixes: f5a4c76ad7de ("tools/power turbostat: consolidate duplicate model numbers")
Reported-by: Prarit Bhargava prarit@redhat.com Suggested-by: Kosuke Tatsukawa tatsu@ab.jp.nec.com Signed-off-by: Len Brown len.brown@intel.com Signed-off-by: Jackie Liu liuyun01@kylinos.cn Reviewed-by: 江国庆 jiangguoqing@kylinos.cn Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/power/x86/turbostat/turbostat.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c index a6a0d1661604..9805314c3d65 100644 --- a/tools/power/x86/turbostat/turbostat.c +++ b/tools/power/x86/turbostat/turbostat.c @@ -3195,6 +3195,7 @@ int probe_nhm_msrs(unsigned int family, unsigned int model) break; case INTEL_FAM6_HASWELL_CORE: /* HSW */ case INTEL_FAM6_HASWELL_X: /* HSX */ + case INTEL_FAM6_HASWELL_ULT: /* HSW */ case INTEL_FAM6_HASWELL_GT3E: /* HSW */ case INTEL_FAM6_BROADWELL_CORE: /* BDW */ case INTEL_FAM6_BROADWELL_GT3E: /* BDW */ @@ -3391,6 +3392,7 @@ int has_config_tdp(unsigned int family, unsigned int model) case INTEL_FAM6_IVYBRIDGE: /* IVB */ case INTEL_FAM6_HASWELL_CORE: /* HSW */ case INTEL_FAM6_HASWELL_X: /* HSX */ + case INTEL_FAM6_HASWELL_ULT: /* HSW */ case INTEL_FAM6_HASWELL_GT3E: /* HSW */ case INTEL_FAM6_BROADWELL_CORE: /* BDW */ case INTEL_FAM6_BROADWELL_GT3E: /* BDW */ @@ -3823,6 +3825,7 @@ void rapl_probe_intel(unsigned int family, unsigned int model) case INTEL_FAM6_SANDYBRIDGE: case INTEL_FAM6_IVYBRIDGE: case INTEL_FAM6_HASWELL_CORE: /* HSW */ + case INTEL_FAM6_HASWELL_ULT: /* HSW */ case INTEL_FAM6_HASWELL_GT3E: /* HSW */ case INTEL_FAM6_BROADWELL_CORE: /* BDW */ case INTEL_FAM6_BROADWELL_GT3E: /* BDW */ @@ -4015,6 +4018,7 @@ void perf_limit_reasons_probe(unsigned int family, unsigned int model)
switch (model) { case INTEL_FAM6_HASWELL_CORE: /* HSW */ + case INTEL_FAM6_HASWELL_ULT: /* HSW */ case INTEL_FAM6_HASWELL_GT3E: /* HSW */ do_gfx_perf_limit_reasons = 1; case INTEL_FAM6_HASWELL_X: /* HSX */ @@ -4234,6 +4238,7 @@ int has_snb_msrs(unsigned int family, unsigned int model) case INTEL_FAM6_IVYBRIDGE_X: /* IVB Xeon */ case INTEL_FAM6_HASWELL_CORE: /* HSW */ case INTEL_FAM6_HASWELL_X: /* HSW */ + case INTEL_FAM6_HASWELL_ULT: /* HSW */ case INTEL_FAM6_HASWELL_GT3E: /* HSW */ case INTEL_FAM6_BROADWELL_CORE: /* BDW */ case INTEL_FAM6_BROADWELL_GT3E: /* BDW */ @@ -4267,7 +4272,7 @@ int has_hsw_msrs(unsigned int family, unsigned int model) return 0;
switch (model) { - case INTEL_FAM6_HASWELL_CORE: + case INTEL_FAM6_HASWELL_ULT: /* HSW */ case INTEL_FAM6_BROADWELL_CORE: /* BDW */ case INTEL_FAM6_SKYLAKE_MOBILE: /* SKL */ case INTEL_FAM6_CANNONLAKE_MOBILE: /* CNL */ @@ -4551,9 +4556,6 @@ unsigned int intel_model_duplicates(unsigned int model) case INTEL_FAM6_XEON_PHI_KNM: return INTEL_FAM6_XEON_PHI_KNL;
- case INTEL_FAM6_HASWELL_ULT: - return INTEL_FAM6_HASWELL_CORE; - case INTEL_FAM6_BROADWELL_X: case INTEL_FAM6_BROADWELL_XEON_D: /* BDX-DE */ return INTEL_FAM6_BROADWELL_X;
hulk inclusion category: Feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
-------------------------------------------------
In order to support Intel Icelake platform, following configs need to be set as suggested by Intel:
CONFIG_ACPI_HMAT=y CONFIG_STM=m CONFIG_STM_DUMMY=m CONFIG_STM_SOURCE_CONSOLE=m CONFIG_STM_SOURCE_HEARTBEAT=m CONFIG_STM_SOURCE_FTRACE=m CONFIG_INTEL_TH=m CONFIG_INTEL_TH_PCI=m CONFIG_INTEL_TH_ACPI=m CONFIG_INTEL_TH_GTH=m CONFIG_INTEL_TH_STH=m CONFIG_INTEL_TH_MSU=m CONFIG_INTEL_TH_PTI=m
Set above configs in hulk_defconfig by default.
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/configs/hulk_defconfig | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/arch/x86/configs/hulk_defconfig b/arch/x86/configs/hulk_defconfig index 9c7d6be7fd72..5c5897c269c1 100644 --- a/arch/x86/configs/hulk_defconfig +++ b/arch/x86/configs/hulk_defconfig @@ -519,6 +519,7 @@ CONFIG_ACPI_HED=y # CONFIG_ACPI_CUSTOM_METHOD is not set CONFIG_ACPI_BGRT=y CONFIG_ACPI_NFIT=m +CONFIG_ACPI_HMAT=y CONFIG_HAVE_ACPI_APEI=y CONFIG_HAVE_ACPI_APEI_NMI=y CONFIG_ACPI_APEI=y @@ -6581,8 +6582,18 @@ CONFIG_NVMEM=y # # HW tracing support # -# CONFIG_STM is not set -# CONFIG_INTEL_TH is not set +CONFIG_STM=m +CONFIG_STM_DUMMY=m +CONFIG_STM_SOURCE_CONSOLE=m +CONFIG_STM_SOURCE_HEARTBEAT=m +CONFIG_STM_SOURCE_FTRACE=m +CONFIG_INTEL_TH=m +CONFIG_INTEL_TH_PCI=m +CONFIG_INTEL_TH_ACPI=m +CONFIG_INTEL_TH_GTH=m +CONFIG_INTEL_TH_STH=m +CONFIG_INTEL_TH_MSU=m +CONFIG_INTEL_TH_PTI=m # CONFIG_FPGA is not set # CONFIG_UNISYS_VISORBUS is not set # CONFIG_SIOX is not set
hulk inclusion category: Feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
-------------------------------------------------
In order to support Intel Icelake platform, following configs need to be set as suggested by Intel:
CONFIG_ACPI_HMAT=y CONFIG_STM=m CONFIG_STM_DUMMY=m CONFIG_STM_SOURCE_CONSOLE=m CONFIG_STM_SOURCE_HEARTBEAT=m CONFIG_STM_SOURCE_FTRACE=m CONFIG_INTEL_TH=m CONFIG_INTEL_TH_PCI=m CONFIG_INTEL_TH_ACPI=m CONFIG_INTEL_TH_GTH=m CONFIG_INTEL_TH_STH=m CONFIG_INTEL_TH_MSU=m CONFIG_INTEL_TH_PTI=m
Set above configs in openeuler_defconfig by default.
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/configs/openeuler_defconfig | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index c83f5231e357..6fe8a777fa1c 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -517,6 +517,7 @@ CONFIG_ACPI_HED=y # CONFIG_ACPI_CUSTOM_METHOD is not set CONFIG_ACPI_BGRT=y CONFIG_ACPI_NFIT=m +CONFIG_ACPI_HMAT=y CONFIG_HAVE_ACPI_APEI=y CONFIG_HAVE_ACPI_APEI_NMI=y CONFIG_ACPI_APEI=y @@ -6590,8 +6591,18 @@ CONFIG_NVMEM=y # # HW tracing support # -# CONFIG_STM is not set -# CONFIG_INTEL_TH is not set +CONFIG_STM=m +CONFIG_STM_DUMMY=m +CONFIG_STM_SOURCE_CONSOLE=m +CONFIG_STM_SOURCE_HEARTBEAT=m +CONFIG_STM_SOURCE_FTRACE=m +CONFIG_INTEL_TH=m +CONFIG_INTEL_TH_PCI=m +CONFIG_INTEL_TH_ACPI=m +CONFIG_INTEL_TH_GTH=m +CONFIG_INTEL_TH_STH=m +CONFIG_INTEL_TH_MSU=m +CONFIG_INTEL_TH_PTI=m # CONFIG_FPGA is not set # CONFIG_UNISYS_VISORBUS is not set # CONFIG_SIOX is not set
From: Otto Sabart ottosabart@seberm.com
mainline inclusion from mainline-v5.0-rc2 commit 7604bf0920985c9280c8b24e2f0c3e4ed47f502f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Old cpuidle/sysfs.txt file was replaced in aa5eee355b46. So, refer to an updated file.
Fixes: aa5eee355b46 (Documentation: admin-guide: PM: Add cpuidle document) Signed-off-by: Otto Sabart ottosabart@seberm.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/trace/coresight-cpu-debug.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/trace/coresight-cpu-debug.txt b/Documentation/trace/coresight-cpu-debug.txt index 89ab09e78e8d..f07e38094b40 100644 --- a/Documentation/trace/coresight-cpu-debug.txt +++ b/Documentation/trace/coresight-cpu-debug.txt @@ -165,7 +165,7 @@ Do some work... The same can also be done from an application program.
Disable specific CPU's specific idle state from cpuidle sysfs (see -Documentation/cpuidle/sysfs.txt): +Documentation/admin-guide/pm/cpuidle.rst): # echo 1 > /sys/devices/system/cpu/cpu$cpu/cpuidle/state$state/disable
From: "Aneesh Kumar K.V" aneesh.kumar@linux.ibm.com
mainline inclusion from mainline-v5.2-rc1 commit 67476656febd7ec5f1fe1aeec3c441fcf53b1e45 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
This move the dependency to DEV_DAX_PMEM_COMPAT such that only if DEV_DAX_PMEM is built as module we can allow the compat support.
This allows to test the new code easily in a emulation setup where we often build things without module support.
Cc: stable@vger.kernel.org Fixes: 730926c3b099 ("device-dax: Add /sys/class/dax backwards compatibility") Signed-off-by: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dax/Kconfig | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index 5ef624fe3934..a59f338f520f 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -23,7 +23,6 @@ config DEV_DAX config DEV_DAX_PMEM tristate "PMEM DAX: direct access to persistent memory" depends on LIBNVDIMM && NVDIMM_DAX && DEV_DAX - depends on m # until we can kill DEV_DAX_PMEM_COMPAT default DEV_DAX help Support raw access to persistent memory. Note that this @@ -50,7 +49,7 @@ config DEV_DAX_KMEM
config DEV_DAX_PMEM_COMPAT tristate "PMEM DAX: support the deprecated /sys/class/dax interface" - depends on DEV_DAX_PMEM + depends on m && DEV_DAX_PMEM=m default DEV_DAX_PMEM help Older versions of the libdaxctl library expect to find all
From: Qian Cai cai@lca.pw
mainline inclusion from mainline-v5.2-rc1 commit e174e78efa6b2421f070ad9321545a87b2ebe28c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
The commit 665ac7e92757 ("acpi/hmat: Register processor domain to its memory") introduced some memory leaks below due to it fails to release the heap memory in an error path, and then those statically-allocated __initdata memory which reference them get freed during boot renders those heap memory as leaks. Since it is valid to pass NULL to acpi_put_table(), it is fine to call it even if acpi_get_table() returns an error.
unreferenced object 0xc8ff8008349e9400 (size 128): comm "swapper/0", pid 1, jiffies 4294709236 (age 48121.476s) hex dump (first 32 bytes): 00 d0 9e 34 08 80 ff 84 d8 00 43 11 00 10 ff ff ...4......C..... 00 00 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000869d4503>] __kmalloc+0x568/0x600 [<0000000070fd6afb>] alloc_memory_target+0x50/0xd8 [<00000000efa2081e>] srat_parse_mem_affinity+0x58/0x5c [<000000008bfaef74>] acpi_parse_entries_array+0x1c8/0x2c0 [<0000000022804877>] acpi_table_parse_entries_array+0x11c/0x138 [<00000000ffe9cd34>] acpi_table_parse_entries+0x7c/0xac [<00000000a7023afd>] hmat_init+0x90/0x174 [<00000000694a86c1>] do_one_initcall+0x2d8/0x5f8 [<0000000024889da9>] do_initcall_level+0x37c/0x3fc [<000000009be02908>] do_basic_setup+0x38/0x50 [<0000000037b3ac0a>] kernel_init_freeable+0x194/0x258 [<00000000f5741184>] kernel_init+0x18/0x334 [<000000007b30f423>] ret_from_fork+0x10/0x18 [<000000006c7147a8>] 0xffffffffffffffff
Signed-off-by: Qian Cai cai@lca.pw Fixes: 665ac7e92757 ("acpi/hmat: Register processor domain to its memory") Acked-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/hmat/hmat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c index b7824a0309f7..69cdb3cea545 100644 --- a/drivers/acpi/hmat/hmat.c +++ b/drivers/acpi/hmat/hmat.c @@ -637,7 +637,7 @@ static __init int hmat_init(void)
status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl); if (ACPI_FAILURE(status)) - return 0; + goto out_put;
hmat_revision = tbl->revision; switch (hmat_revision) {
From: Alison Schofield alison.schofield@intel.com
mainline inclusion from mainline-v5.2-rc1 commit 57f5cf6ed8bdbbcb1abf3477beeac199dcdba316 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
ACPI 6.3 changed the subtable "Memory Subsystem Address Range Structure" to "Memory Proximity Domain Attributes Structure".
Updating and renaming of the structure was included in commit: ACPICA: ACPI 6.3: HMAT updates (9a8d961f1ef835b0d338fbe13da03cb424e87ae5)
Rename the enum type to match the subtable and structure naming.
Signed-off-by: Alison Schofield alison.schofield@intel.com Acked-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/hmat/hmat.c | 4 ++-- include/acpi/actbl1.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c index 69cdb3cea545..f16656047952 100644 --- a/drivers/acpi/hmat/hmat.c +++ b/drivers/acpi/hmat/hmat.c @@ -411,7 +411,7 @@ static int __init hmat_parse_subtable(union acpi_subtable_headers *header, return -EINVAL;
switch (hdr->type) { - case ACPI_HMAT_TYPE_ADDRESS_RANGE: + case ACPI_HMAT_TYPE_PROXIMITY: return hmat_parse_proximity_domain(header, end); case ACPI_HMAT_TYPE_LOCALITY: return hmat_parse_locality(header, end); @@ -649,7 +649,7 @@ static __init int hmat_init(void) goto out_put; }
- for (i = ACPI_HMAT_TYPE_ADDRESS_RANGE; i < ACPI_HMAT_TYPE_RESERVED; i++) { + for (i = ACPI_HMAT_TYPE_PROXIMITY; i < ACPI_HMAT_TYPE_RESERVED; i++) { if (acpi_table_parse_entries(ACPI_SIG_HMAT, sizeof(struct acpi_table_hmat), i, hmat_parse_subtable, 0) < 0) { diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h index 7df78a9bddb1..2030c0f553ae 100644 --- a/include/acpi/actbl1.h +++ b/include/acpi/actbl1.h @@ -1390,7 +1390,7 @@ struct acpi_table_hmat { /* Values for HMAT structure types */
enum acpi_hmat_type { - ACPI_HMAT_TYPE_ADDRESS_RANGE = 0, /* Memory subystem address range */ + ACPI_HMAT_TYPE_PROXIMITY = 0, /* Memory proximity domain attributes */ ACPI_HMAT_TYPE_LOCALITY = 1, /* System locality latency and bandwidth information */ ACPI_HMAT_TYPE_CACHE = 2, /* Memory side cache information */ ACPI_HMAT_TYPE_RESERVED = 3 /* 3 and greater are reserved */
From: Qian Cai cai@lca.pw
mainline inclusion from mainline-v5.2-rc1 commit ab3a9f2ccc080d27873f76869c9a780be45e581e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
The commit 665ac7e92757 ("acpi/hmat: Register processor domain to its memory") introduced an uninitialized "struct memory_target" that could cause an incorrect branching.
drivers/acpi/hmat/hmat.c:385:6: warning: variable 'target' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] if (p->flags & ACPI_HMAT_MEMORY_PD_VALID) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/acpi/hmat/hmat.c:392:6: note: uninitialized use occurs here if (target && p->flags & ACPI_HMAT_PROCESSOR_PD_VALID) { ^~~~~~ drivers/acpi/hmat/hmat.c:385:2: note: remove the 'if' if its condition is always true if (p->flags & ACPI_HMAT_MEMORY_PD_VALID) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/acpi/hmat/hmat.c:369:30: note: initialize the variable 'target' to silence this warning struct memory_target *target; ^ = NULL
Signed-off-by: Qian Cai cai@lca.pw Reviewed-by: Mukesh Ojha mojha@codeaurora.org Fixes: 665ac7e92757 ("acpi/hmat: Register processor domain to its memory") Reviewed-by: Nathan Chancellor natechancellor@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/hmat/hmat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c index f16656047952..96b7d39a97c6 100644 --- a/drivers/acpi/hmat/hmat.c +++ b/drivers/acpi/hmat/hmat.c @@ -366,7 +366,7 @@ static int __init hmat_parse_proximity_domain(union acpi_subtable_headers *heade const unsigned long end) { struct acpi_hmat_proximity_domain *p = (void *)header; - struct memory_target *target; + struct memory_target *target = NULL;
if (p->header.length != sizeof(*p)) { pr_notice("HMAT: Unexpected address range header length: %d\n",
From: Jonathan Corbet corbet@lwn.net
mainline inclusion from mainline-v5.2-rc3 commit 8867f6109b84f50d959af2b26440a7faacc7633a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Commit 13bac55ef7ae ("doc/mm: New documentation for memory performance") added numaperf.rst, but did not add it to the TOC tree. There was also an incorrectly marked literal block leading to this warning sequence:
numaperf.rst:24: WARNING: Unexpected indentation. numaperf.rst:24: WARNING: Inline substitution_reference start-string without end-string. numaperf.rst:25: WARNING: Block quote ends without a blank line; unexpected unindent.
Fix the block and add the file to the document tree.
Fixes: 13bac55ef7ae ("doc/mm: New documentation for memory performance") Cc: Keith Busch keith.busch@intel.com Reviewed-by: Mike Rapoport rppt@linux.ibm.com Signed-off-by: Jonathan Corbet corbet@lwn.net Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/admin-guide/mm/index.rst | 1 + Documentation/admin-guide/mm/numaperf.rst | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index ceead68c2df7..0cabbba2f885 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -30,6 +30,7 @@ the Linux memory management. idle_page_tracking ksm numa_memory_policy + numaperf pagemap soft-dirty transhuge diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst index b79f70c04397..c067ed145158 100644 --- a/Documentation/admin-guide/mm/numaperf.rst +++ b/Documentation/admin-guide/mm/numaperf.rst @@ -15,7 +15,7 @@ characteristics. Some memory may share the same node as a CPU, and others are provided as memory only nodes. While memory only nodes do not provide CPUs, they may still be local to one or more compute nodes relative to other nodes. The following diagram shows one such example of two compute -nodes with local memory and a memory only node for each of compute node: +nodes with local memory and a memory only node for each of compute node::
+------------------+ +------------------+ | Compute Node 0 +-----+ Compute Node 1 |
From: Arnaldo Carvalho de Melo acme@redhat.com
mainline inclusion from mainline-v5.2-rc1 commit 0ceb5499a8001e5ddac2c8bd7b45eb4c643469ad category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
To get the changes in:
878068ea270e ("perf/x86: Support outputting XMM registers")
That will be used in a followup patch to allow users to ask for some or all of those registers to be collected in certain contatexts.
This silences the following perf build warning:
Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/perf_regs.h' differs from latest version at 'arch/x86/include/uapi/asm/perf_regs.h' diff -u tools/arch/x86/include/uapi/asm/perf_regs.h arch/x86/include/uapi/asm/perf_regs.h
Cc: Adrian Hunter adrian.hunter@intel.com Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Andi Kleen ak@linux.intel.com Cc: Jiri Olsa jolsa@kernel.org Cc: Kan Liang kan.liang@linux.intel.com Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Link: https://lkml.kernel.org/n/tip-6pjnnrzqt3x3n2cd6br3wk7k@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/arch/x86/include/uapi/asm/perf_regs.h | 23 ++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h index 62a6965637fd..7c9d2bb3833b 100644 --- a/tools/arch/x86/include/uapi/asm/perf_regs.h +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h @@ -27,9 +27,30 @@ enum perf_event_x86_regs { PERF_REG_X86_R13, PERF_REG_X86_R14, PERF_REG_X86_R15, - + /* These are the limits for the GPRs. */ PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1, PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1, + + /* These all need two bits set because they are 128bit */ + PERF_REG_X86_XMM0 = 32, + PERF_REG_X86_XMM1 = 34, + PERF_REG_X86_XMM2 = 36, + PERF_REG_X86_XMM3 = 38, + PERF_REG_X86_XMM4 = 40, + PERF_REG_X86_XMM5 = 42, + PERF_REG_X86_XMM6 = 44, + PERF_REG_X86_XMM7 = 46, + PERF_REG_X86_XMM8 = 48, + PERF_REG_X86_XMM9 = 50, + PERF_REG_X86_XMM10 = 52, + PERF_REG_X86_XMM11 = 54, + PERF_REG_X86_XMM12 = 56, + PERF_REG_X86_XMM13 = 58, + PERF_REG_X86_XMM14 = 60, + PERF_REG_X86_XMM15 = 62, + + /* These include both GPRs and XMMX registers */ + PERF_REG_X86_XMM_MAX = PERF_REG_X86_XMM15 + 2, };
#define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
From: Tony Luck tony.luck@intel.com
mainline inclusion from mainline-v5.2-rc1 commit 76fc276f4a91bdf96940c276398954f291daf034 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Code refactoring to share some source code with a new EDAC driver resulted in renaming one file (skx_edac.c became skx_base.c) and adding a new file (skx_common.c).
Update the file pattern in MAINTAINERS to take account of this change.
Fixes: 98f2fc829e3b ("EDAC, skx_edac: Delete duplicated code") Reported-by: Joe Perches joe@perches.com Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Borislav Petkov bp@suse.de Cc: "David S. Miller" davem@davemloft.net Cc: Aristeu Rozanski aris@redhat.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Mauro Carvalho Chehab mchehab+samsung@kernel.org Cc: Nicolas Ferre nicolas.ferre@microchip.com Cc: Qiuxu Zhuo qiuxu.zhuo@intel.com Cc: linux-edac linux-edac@vger.kernel.org Link: https://lkml.kernel.org/r/20190325232932.GA8869@agluck-desk Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/MAINTAINERS b/MAINTAINERS index d03a2927f434..3e6138e4a4e8 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5360,7 +5360,7 @@ EDAC-SKYLAKE M: Tony Luck tony.luck@intel.com L: linux-edac@vger.kernel.org S: Maintained -F: drivers/edac/skx_edac.c +F: drivers/edac/skx_*.c
EDAC-TI M: Tero Kristo t-kristo@ti.com
From: Tony Luck tony.luck@intel.com
mainline inclusion from mainline-v5.2-rc1 commit bcc5c1bbf76c40764735a5b3bdb884575e7fd822 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Updating the MAINTAINERS file when adding this new driver was forgotten. Fix it.
Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors") Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Borislav Petkov bp@suse.de Cc: "David S. Miller" davem@davemloft.net Cc: Aristeu Rozanski aris@redhat.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Mauro Carvalho Chehab mchehab+samsung@kernel.org Cc: Nicolas Ferre nicolas.ferre@microchip.com Cc: Qiuxu Zhuo qiuxu.zhuo@intel.com Cc: linux-edac linux-edac@vger.kernel.org Link: https://lkml.kernel.org/r/20190325235627.GA11938@agluck-desk Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- MAINTAINERS | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS index 3e6138e4a4e8..15609985b6cc 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5279,6 +5279,12 @@ L: linux-edac@vger.kernel.org S: Maintained F: drivers/edac/ghes_edac.c
+EDAC-I10NM +M: Tony Luck tony.luck@intel.com +L: linux-edac@vger.kernel.org +S: Maintained +F: drivers/edac/i10nm_base.c + EDAC-I3000 L: linux-edac@vger.kernel.org S: Orphan
From: Pavel Tatashin pasha.tatashin@soleen.com
mainline inclusion from mainline-v5.3-rc1 commit 31e4ca92a7dd4cdebd7fe1456b3b0b6ace9a816f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Patch series ""Hotremove" persistent memory", v6.
Recently, adding a persistent memory to be used like a regular RAM was added to Linux. This work extends this functionality to also allow hot removing persistent memory.
We (Microsoft) have an important use case for this functionality.
The requirement is for physical machines with small amount of RAM (~8G) to be able to reboot in a very short period of time (<1s). Yet, there is a userland state that is expensive to recreate (~2G).
The solution is to boot machines with 2G preserved for persistent memory.
Copy the state, and hotadd the persistent memory so machine still has all 8G available for runtime. Before reboot, offline and hotremove device-dax 2G, copy the memory that is needed to be preserved to pmem0 device, and reboot.
The series of operations look like this:
1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps. and free ramdisk. 2. Convert raw pmem0 to devdax ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f 3. Hotadd to System RAM echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id echo online_movable > /sys/devices/system/memoryXXX/state 4. Before reboot hotremove device-dax memory from System RAM echo offline > /sys/devices/system/memoryXXX/state echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind 5. Create raw pmem0 device ndctl create-namespace --mode raw -e namespace0.0 -f 6. Copy the state that was stored by apps to ramdisk to pmem device 7. Do kexec reboot or reboot through firmware if firmware does not zero memory in pmem0 region (These machines have only regular volatile memory). So to have pmem0 device either memmap kernel parameter is used, or devices nodes in dtb are specified.
This patch (of 3):
When add_memory() fails, the resource and the memory should be freed.
Link: http://lkml.kernel.org/r/20190517215438.6487-2-pasha.tatashin@soleen.com Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM") Signed-off-by: Pavel Tatashin pasha.tatashin@soleen.com Reviewed-by: Dave Hansen dave.hansen@intel.com Cc: Bjorn Helgaas bhelgaas@google.com Cc: Borislav Petkov bp@suse.de Cc: Dan Williams dan.j.williams@intel.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Dave Jiang dave.jiang@intel.com Cc: David Hildenbrand david@redhat.com Cc: Fengguang Wu fengguang.wu@intel.com Cc: Huang Ying ying.huang@intel.com Cc: James Morris jmorris@namei.org Cc: Jérôme Glisse jglisse@redhat.com Cc: Keith Busch keith.busch@intel.com Cc: Michal Hocko mhocko@suse.com Cc: Ross Zwisler zwisler@kernel.org Cc: Sasha Levin sashal@kernel.org Cc: Takashi Iwai tiwai@suse.de Cc: Tom Lendacky thomas.lendacky@amd.com Cc: Vishal Verma vishal.l.verma@intel.com Cc: Yaowei Bai baiyaowei@cmss.chinamobile.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dax/kmem.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index a02318c6d28a..4c0131857133 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -66,8 +66,11 @@ int dev_dax_kmem_probe(struct device *dev) new_res->name = dev_name(dev);
rc = add_memory(numa_node, new_res->start, resource_size(new_res)); - if (rc) + if (rc) { + release_resource(new_res); + kfree(new_res); return rc; + }
return 0; }
From: Stephen Rothwell sfr@canb.auug.org.au
mainline inclusion from mainline-v5.3-rc1 commit 8da04e05cdfc715d414a1c5f8318c03030eb68fb category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Fixes: 7ebf8eff63b4 ("intel_rapl: introduce struct rapl_if_private") Signed-off-by: Stephen Rothwell sfr@canb.auug.org.au Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/intel_rapl.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/include/linux/intel_rapl.h b/include/linux/intel_rapl.h index 0c179d92d110..efb3ce892c20 100644 --- a/include/linux/intel_rapl.h +++ b/include/linux/intel_rapl.h @@ -12,6 +12,7 @@
#include <linux/types.h> #include <linux/powercap.h> +#include <linux/cpuhotplug.h>
enum rapl_domain_type { RAPL_DOMAIN_PACKAGE, /* entire package/socket */
From: Dan Carpenter dan.carpenter@oracle.com
mainline inclusion from mainline-v5.4-rc1 commit 010764b8856e5ee5056113704dc4b914ebc88f1d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
The isst_send_msr_command() function will read 8 bytes but we are passing an address to an int (4 bytes) so it results in a read overflow.
Fixes: 3fb4f7cd472c ("tools/power/x86: A tool to validate Intel Speed Select commands") Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Acked-by: Srinivas Pandruvada srinivas.pandruvada@linux.intel.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/power/x86/intel-speed-select/isst-core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/power/x86/intel-speed-select/isst-core.c b/tools/power/x86/intel-speed-select/isst-core.c index 8de4ac39a008..f724322856ed 100644 --- a/tools/power/x86/intel-speed-select/isst-core.c +++ b/tools/power/x86/intel-speed-select/isst-core.c @@ -190,6 +190,7 @@ int isst_get_get_trl(int cpu, int level, int avx_level, int *trl)
int isst_set_tdp_level_msr(int cpu, int tdp_level) { + unsigned long long level = tdp_level; int ret;
debug_printf("cpu: tdp_level via MSR %d\n", cpu, tdp_level); @@ -202,8 +203,7 @@ int isst_set_tdp_level_msr(int cpu, int tdp_level) if (tdp_level > 2) return -1; /* invalid value */
- ret = isst_send_msr_command(cpu, 0x64b, 1, - (unsigned long long *)&tdp_level); + ret = isst_send_msr_command(cpu, 0x64b, 1, &level); if (ret) return ret;
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.4-rc7 commit 87c0b9c79ec136ea76a14a88d675a746bc6a87f9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Commit 8116db57cf16 ("intel_th: Add switch triggering support") added a trigger assertion of the CTS, but forgot to de-assert it at the end of the sequence. This results in window switches randomly not happening.
Fix that by de-asserting the trigger at the end of the window switch sequence.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Fixes: 8116db57cf16 ("intel_th: Add switch triggering support") Cc: stable stable@vger.kernel.org Link: https://lore.kernel.org/r/20191028070651.9770-2-alexander.shishkin@linux.int... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/gth.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/hwtracing/intel_th/gth.c b/drivers/hwtracing/intel_th/gth.c index c9672b74761a..4a77e3517311 100644 --- a/drivers/hwtracing/intel_th/gth.c +++ b/drivers/hwtracing/intel_th/gth.c @@ -626,6 +626,9 @@ static void intel_th_gth_switch(struct intel_th_device *thdev, if (!count) dev_dbg(&thdev->dev, "timeout waiting for CTS Trigger\n");
+ /* De-assert the trigger */ + iowrite32(0, gth->base + REG_CTS_CTL); + intel_th_gth_stop(gth, output, false); intel_th_gth_start(gth, output); }
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.4-rc7 commit e5a340f770278f4de42e8bac19f2ebeb77ddfae4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Commit 615c164da0eb ("intel_th: msu: Introduce buffer interface") added a mutex that it forgot to initialize, resulting in a lockdep splat.
Fix that by initializing the mutex statically.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Fixes: 615c164da0eb ("intel_th: msu: Introduce buffer interface") Link: https://lore.kernel.org/r/20191028070651.9770-3-alexander.shishkin@linux.int... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/msu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index f6f12ce2c783..96d423422e91 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -162,7 +162,7 @@ struct msc { };
static LIST_HEAD(msu_buffer_list); -static struct mutex msu_buffer_mutex; +static DEFINE_MUTEX(msu_buffer_mutex);
/** * struct msu_buffer_entry - internal MSU buffer bookkeeping
From: Colin Ian King colin.king@canonical.com
mainline inclusion from mainline-v5.4-rc7 commit 063f097fd65a90fca2cd49411a2d6e35b8ca25db category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Commit 615c164da0eb ("intel_th: msu: Introduce buffer interface") forgot to add a NULL pointer check for the value returned from kstrdup(), which will be troublesome if the allocation fails.
Fix that by adding the check.
Addresses-Coverity: ("Dereference null return") Fixes: 615c164da0eb ("intel_th: msu: Introduce buffer interface") Signed-off-by: Colin Ian King colin.king@canonical.com [alexander.shishkin: amended the commit message] Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Link: https://lore.kernel.org/lkml/20190726120421.9650-1-colin.king@canonical.com/ Link: https://lore.kernel.org/r/20191028070651.9770-4-alexander.shishkin@linux.int... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/msu.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index 96d423422e91..78d36689d0d8 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -1841,6 +1841,9 @@ mode_store(struct device *dev, struct device_attribute *attr, const char *buf, len = cp - buf;
mode = kstrndup(buf, len, GFP_KERNEL); + if (!mode) + return -ENOMEM; + i = match_string(msc_mode, ARRAY_SIZE(msc_mode), mode); if (i >= 0) goto found;
From: Colin Ian King colin.king@canonical.com
mainline inclusion from mainline-v5.4-rc7 commit 8e3ef7b444aec3d1059085ce41edaa76ee7340e7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
The shift of the unsigned int win->nr_blocks by PAGE_SHIFT may potentially overflow. Note that the intended return of this shift is expected to be a size_t however the shift is being performed as an unsigned int. Fix this by casting win->nr_blocks to a size_t before performing the shift.
Addresses-Coverity: ("Unintentional integer overflow") Fixes: 615c164da0eb ("intel_th: msu: Introduce buffer interface") Signed-off-by: Colin Ian King colin.king@canonical.com Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Link: https://lore.kernel.org/lkml/20190726113151.8967-1-colin.king@canonical.com/ Link: https://lore.kernel.org/r/20191028070651.9770-5-alexander.shishkin@linux.int... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/msu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index 78d36689d0d8..76127836d4f8 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -327,7 +327,7 @@ static size_t msc_win_total_sz(struct msc_window *win) struct msc_block_desc *bdesc = msc_win_block(win, blk);
if (msc_block_wrapped(bdesc)) - return win->nr_blocks << PAGE_SHIFT; + return (size_t)win->nr_blocks << PAGE_SHIFT;
size += msc_total_sz(bdesc); if (msc_block_last_written(bdesc))
From: Wei Yongjun weiyongjun1@huawei.com
mainline inclusion from mainline-v5.4-rc7 commit 1fa1b6ca0fda97cbfccdc6b80b1a6b2920751665 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
'mode' is malloced in mode_store() and should be freed before leaving from the error handling cases, otherwise it will cause memory leak.
Fixes: 615c164da0eb ("intel_th: msu: Introduce buffer interface") Signed-off-by: Wei Yongjun weiyongjun1@huawei.com Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Link: https://lore.kernel.org/lkml/20190801013825.182543-1-weiyongjun1@huawei.com/ Link: https://lore.kernel.org/r/20191028070651.9770-6-alexander.shishkin@linux.int... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/msu.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index 76127836d4f8..6b48362df260 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -1845,8 +1845,10 @@ mode_store(struct device *dev, struct device_attribute *attr, const char *buf, return -ENOMEM;
i = match_string(msc_mode, ARRAY_SIZE(msc_mode), mode); - if (i >= 0) + if (i >= 0) { + kfree(mode); goto found; + }
/* Buffer sinks only work with a usable IRQ */ if (!msc->do_irq) {
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.5-rc7 commit fa694ae532836bd2f4cd659e9b4032abaf9fa9e5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
An Oops during the boot is found on some SNR machines. It turns out this is because the snr_uncore_imc_freerunning_events[] array was missing an end-marker.
Fixes: ee49532b38dd ("perf/x86/intel/uncore: Add IMC uncore support for Snow Ridge") Reported-by: Like Xu like.xu@linux.intel.com Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Ingo Molnar mingo@kernel.org Tested-by: Like Xu like.xu@linux.intel.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200116200210.18937-1-kan.liang@linux.intel.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/uncore_snbep.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c index 06a3e8ae370c..07c3c7f87731 100644 --- a/arch/x86/events/intel/uncore_snbep.c +++ b/arch/x86/events/intel/uncore_snbep.c @@ -4579,6 +4579,7 @@ static struct uncore_event_desc snr_uncore_imc_freerunning_events[] = { INTEL_UNCORE_EVENT_DESC(write, "event=0xff,umask=0x21"), INTEL_UNCORE_EVENT_DESC(write.scale, "3.814697266e-6"), INTEL_UNCORE_EVENT_DESC(write.unit, "MiB"), + { /* end: all zeroes */ }, };
static struct intel_uncore_ops snr_uncore_imc_freerunning_ops = {
From: Subbaraya Sundeep sbhatta@marvell.com
mainline inclusion from mainline-v5.5-rc1 commit 73884a7082f466ce6686bb8dd7e6571dd42313b4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
As per PCIe r5.0, sec 7.8.5.2, fixed bus numbers of a bridge must be zero when no function that uses EA is located behind it. Hence, if EA supplies bus numbers of zero, assign bus numbers normally. A secondary bus can never have a bus number of zero, so setting a bridge's Secondary Bus Number to zero makes downstream devices unreachable.
[bhelgaas: retain bool return value so "zero is invalid" logic is local] Fixes: 2dbce5901179 ("PCI: Assign bus numbers present in EA capability for bridges") Link: https://lore.kernel.org/r/1572850664-9861-1-git-send-email-sundeep.lkml@gmai... Signed-off-by: Subbaraya Sundeep sbhatta@marvell.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/probe.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 83d2c9da238e..055a2c47ae3a 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1138,14 +1138,15 @@ static unsigned int pci_scan_child_bus_extend(struct pci_bus *bus, * @sec: updated with secondary bus number from EA * @sub: updated with subordinate bus number from EA * - * If @dev is a bridge with EA capability, update @sec and @sub with - * fixed bus numbers from the capability and return true. Otherwise, - * return false. + * If @dev is a bridge with EA capability that specifies valid secondary + * and subordinate bus numbers, return true with the bus numbers in @sec + * and @sub. Otherwise return false. */ static bool pci_ea_fixed_busnrs(struct pci_dev *dev, u8 *sec, u8 *sub) { int ea, offset; u32 dw; + u8 ea_sec, ea_sub;
if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) return false; @@ -1157,8 +1158,13 @@ static bool pci_ea_fixed_busnrs(struct pci_dev *dev, u8 *sec, u8 *sub)
offset = ea + PCI_EA_FIRST_ENT; pci_read_config_dword(dev, offset, &dw); - *sec = dw & PCI_EA_SEC_BUS_MASK; - *sub = (dw & PCI_EA_SUB_BUS_MASK) >> PCI_EA_SUB_BUS_SHIFT; + ea_sec = dw & PCI_EA_SEC_BUS_MASK; + ea_sub = (dw & PCI_EA_SUB_BUS_MASK) >> PCI_EA_SUB_BUS_SHIFT; + if (ea_sec == 0 || ea_sub < ea_sec) + return false; + + *sec = ea_sec; + *sub = ea_sub; return true; }
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.5-rc3 commit ab832e38e4f0f45b16c3633714d868b7ec6b33b4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Commit aac8da65174a ("intel_th: msu: Start handling IRQs") implicitly relies on the use of devm_request_irq() to subsequently free the irqs on device removal, but in case of the pci_free_irq_vectors() API, the handlers need to be freed before it is called. Therefore, at the moment the driver's remove path trips a BUG_ON(irq_has_action()):
kernel BUG at drivers/pci/msi.c:375! invalid opcode: 0000 1 SMP CPU: 2 PID: 818 Comm: rmmod Not tainted 5.5.0-rc1+ #1 RIP: 0010:free_msi_irqs+0x67/0x1c0 pci_disable_msi+0x116/0x150 pci_free_irq_vectors+0x1b/0x20 intel_th_pci_remove+0x22/0x30 [intel_th_pci] pci_device_remove+0x3e/0xb0 device_release_driver_internal+0xf0/0x1c0 driver_detach+0x4c/0x8f bus_remove_driver+0x5c/0xd0 driver_unregister+0x31/0x50 pci_unregister_driver+0x40/0x90 intel_th_pci_driver_exit+0x10/0xad6 [intel_th_pci] __x64_sys_delete_module+0x147/0x290 ? exit_to_usermode_loop+0xd7/0x120 do_syscall_64+0x57/0x1b0 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fix this by explicitly freeing irqs before freeing the vectors. We keep using the devm_* variants because they are still useful in early error paths.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Fixes: aac8da65174a ("intel_th: msu: Start handling IRQs") Reported-by: Ammy Yi ammy.yi@intel.com Tested-by: Ammy Yi ammy.yi@intel.com Cc: stable@vger.kernel.org # v5.2+ Link: https://lore.kernel.org/r/20191217115527.74383-4-alexander.shishkin@linux.in... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/core.c | 7 ++++--- drivers/hwtracing/intel_th/intel_th.h | 2 ++ 2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/hwtracing/intel_th/core.c b/drivers/hwtracing/intel_th/core.c index 5cab5194c9b9..52f000d37bac 100644 --- a/drivers/hwtracing/intel_th/core.c +++ b/drivers/hwtracing/intel_th/core.c @@ -843,9 +843,6 @@ static irqreturn_t intel_th_irq(int irq, void *data) ret |= d->irq(th->thdev[i]); }
- if (ret == IRQ_NONE) - pr_warn_ratelimited("nobody cared for irq\n"); - return ret; }
@@ -896,6 +893,7 @@ intel_th_alloc(struct device *dev, struct intel_th_drvdata *drvdata,
if (th->irq == -1) th->irq = devres[r].start; + th->num_irqs++; break; default: dev_warn(dev, "Unknown resource type %lx\n", @@ -949,6 +947,9 @@ void intel_th_free(struct intel_th *th)
th->num_thdevs = 0;
+ for (i = 0; i < th->num_irqs; i++) + devm_free_irq(th->dev, th->irq + i, th); + pm_runtime_get_sync(th->dev); pm_runtime_forbid(th->dev);
diff --git a/drivers/hwtracing/intel_th/intel_th.h b/drivers/hwtracing/intel_th/intel_th.h index 0df480072b6c..6f4f5486fe6d 100644 --- a/drivers/hwtracing/intel_th/intel_th.h +++ b/drivers/hwtracing/intel_th/intel_th.h @@ -261,6 +261,7 @@ enum th_mmio_idx { * @num_thdevs: number of devices in the @thdev array * @num_resources: number of resources in the @resource array * @irq: irq number + * @num_irqs: number of IRQs is use * @id: this Intel TH controller's device ID in the system * @major: device node major for output devices */ @@ -277,6 +278,7 @@ struct intel_th { unsigned int num_thdevs; unsigned int num_resources; int irq; + int num_irqs;
int id; int major;
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.5-rc3 commit 05b686b573cfb35a227c30787083a6631ff0f0c9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Commit 6cac7866c2741 ("intel_th: msu: Add a sysfs attribute to trigger window switch") adds a NULL pointer dereference in the case when there are no windows allocated:
BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 1 SMP CPU: 5 PID: 1110 Comm: bash Not tainted 5.5.0-rc1+ #1 RIP: 0010:msc_win_switch+0xa/0x80 [intel_th_msu] Call Trace: ? win_switch_store+0x9b/0xc0 [intel_th_msu] dev_attr_store+0x17/0x30 sysfs_kf_write+0x3e/0x50 kernfs_fop_write+0xda/0x1b0 __vfs_write+0x1b/0x40 vfs_write+0xb9/0x1a0 ksys_write+0x67/0xe0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x57/0x1d0 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fix that by disallowing window switching with multiwindow buffers without windows.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Fixes: 6cac7866c274 ("intel_th: msu: Add a sysfs attribute to trigger window switch") Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Reported-by: Ammy Yi ammy.yi@intel.com Tested-by: Ammy Yi ammy.yi@intel.com Cc: stable@vger.kernel.org # v5.2+ Link: https://lore.kernel.org/r/20191217115527.74383-5-alexander.shishkin@linux.in... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/msu.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index 6b48362df260..2be1cb1e78a4 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -1669,10 +1669,13 @@ static int intel_th_msc_init(struct msc *msc) return 0; }
-static void msc_win_switch(struct msc *msc) +static int msc_win_switch(struct msc *msc) { struct msc_window *first;
+ if (list_empty(&msc->win_list)) + return -EINVAL; + first = list_first_entry(&msc->win_list, struct msc_window, entry);
if (msc_is_last_win(msc->cur_win)) @@ -1684,6 +1687,8 @@ static void msc_win_switch(struct msc *msc) msc->base_addr = msc_win_baddr(msc->cur_win, 0);
intel_th_trace_switch(msc->thdev); + + return 0; }
/** @@ -2018,16 +2023,15 @@ win_switch_store(struct device *dev, struct device_attribute *attr, if (val != 1) return -EINVAL;
+ ret = -EINVAL; mutex_lock(&msc->buf_mutex); /* * Window switch can only happen in the "multi" mode. * If a external buffer is engaged, they have the full * control over window switching. */ - if (msc->mode != MSC_MODE_MULTI || msc->mbuf) - ret = -ENOTSUPP; - else - msc_win_switch(msc); + if (msc->mode == MSC_MODE_MULTI && !msc->mbuf) + ret = msc_win_switch(msc); mutex_unlock(&msc->buf_mutex);
return ret ? ret : size;
From: Alexander Shishkin alexander.shishkin@linux.intel.com
mainline inclusion from mainline-v5.6-rc7 commit 885f123554bbdc1807ca25a374be6e9b3bddf4de category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
The unexpected state warning should only warn on illegal state transitions. Fix that.
Signed-off-by: Alexander Shishkin alexander.shishkin@linux.intel.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Fixes: 615c164da0eb4 ("intel_th: msu: Introduce buffer interface") Cc: stable@vger.kernel.org # v5.4+ Link: https://lore.kernel.org/r/20200317062215.15598-5-alexander.shishkin@linux.in... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/hwtracing/intel_th/msu.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index 2be1cb1e78a4..a0e61b8c966b 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -722,9 +722,6 @@ static int msc_win_set_lockout(struct msc_window *win,
if (old != expect) { ret = -EINVAL; - dev_warn_ratelimited(msc_dev(win->msc), - "expected lockout state %d, got %d\n", - expect, old); goto unlock; }
@@ -740,6 +737,10 @@ static int msc_win_set_lockout(struct msc_window *win, /* from intel_th_msc_window_unlock(), don't warn if not locked */ if (expect == WIN_LOCKED && old == new) return 0; + + dev_warn_ratelimited(msc_dev(win->msc), + "expected lockout state %d, got %d\n", + expect, old); }
return ret;
From: Kuppuswamy Sathyanarayanan sathyanarayanan.kuppuswamy@linux.intel.com
mainline inclusion from mainline-v5.7-rc1 commit b5dfbeacf74865a8d62a4f70f501cdc61510f8e0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
pcie_do_recovery() had two "if (state == pci_channel_io_frozen)" cases right after each other. Combine them to make this easier to read. No functional change intended.
Link: https://lore.kernel.org/r/20200317170654.GA23125@infradead.org [bhelgaas: split from https://lore.kernel.org/r/a255fcb3a3fdebcd90f84e08b555f1786eb8eba2.158500008...] Signed-off-by: Kuppuswamy Sathyanarayanan sathyanarayanan.kuppuswamy@linux.intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/err.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index 98acf944a27f..5bfc574f5519 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -201,14 +201,13 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state, bus = dev->subordinate;
pci_dbg(dev, "broadcast error_detected message\n"); - if (state == pci_channel_io_frozen) + if (state == pci_channel_io_frozen) { pci_walk_bus(bus, report_frozen_detected, &status); - else + if (reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED) + goto failed; + } else { pci_walk_bus(bus, report_normal_detected, &status); - - if (state == pci_channel_io_frozen && - reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED) - goto failed; + }
if (status == PCI_ERS_RESULT_CAN_RECOVER) { status = PCI_ERS_RESULT_RECOVERED;
From: Kuppuswamy Sathyanarayanan sathyanarayanan.kuppuswamy@linux.intel.com
mainline inclusion from mainline-v5.7-rc1 commit 6d2c89441571ea534d6240f7724f518936c44f8d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") uses reset_link() to recover from fatal errors. But during fatal error recovery, if the initial value of error status is PCI_ERS_RESULT_DISCONNECT or PCI_ERS_RESULT_NO_AER_DRIVER then even after successful recovery (using reset_link()) pcie_do_recovery() will report the recovery result as failure. Update the status of error after reset_link().
You can reproduce this issue by triggering a SW DPC using "DPC Software Trigger" bit in "DPC Control Register". You should see recovery failed dmesg log as below:
pcieport 0000:00:16.0: DPC: containment event, status:0x1f27 source:0x0000 pcieport 0000:00:16.0: DPC: software trigger detected pci 0000:04:00.0: AER: can't recover (no error_detected callback) pcieport 0000:00:16.0: AER: device recovery failed
Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") Link: https://lore.kernel.org/r/a255fcb3a3fdebcd90f84e08b555f1786eb8eba2.158500008... [bhelgaas: split pci_channel_io_frozen simplification to separate patch] Signed-off-by: Kuppuswamy Sathyanarayanan sathyanarayanan.kuppuswamy@linux.intel.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Acked-by: Keith Busch keith.busch@intel.com Cc: Ashok Raj ashok.raj@intel.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pcie/err.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index 5bfc574f5519..6d3d5b6a5c44 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -203,7 +203,8 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state, pci_dbg(dev, "broadcast error_detected message\n"); if (state == pci_channel_io_frozen) { pci_walk_bus(bus, report_frozen_detected, &status); - if (reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED) + status = reset_link(dev, service); + if (status != PCI_ERS_RESULT_RECOVERED) goto failed; } else { pci_walk_bus(bus, report_normal_detected, &status);
From: Sumeet Pawnikar sumeet.r.pawnikar@intel.com
mainline inclusion from mainline-v5.8-rc1 commit 03c3b413a14d8bbf0afc8e069cf2a65df580c30c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Remove unused PLATFORM_POWER_LIMIT MSR local definition from file intel_rapl_common.c. This was missed while splitting old RAPL code intel_rapl.c file into two new files intel_rapl_msr.c and intel_rapl_common.c as per the commit 3382388d7148 ("intel_rapl: abstract RAPL common code"). Currently, this #define entry is being used only in intel_rapl_msr.c file and local definition present in this file.
Signed-off-by: Sumeet Pawnikar sumeet.r.pawnikar@intel.com Reviewed-by: Andy Shevchenko andriy.shevchenko@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/powercap/intel_rapl_common.c | 3 --- 1 file changed, 3 deletions(-)
diff --git a/drivers/powercap/intel_rapl_common.c b/drivers/powercap/intel_rapl_common.c index b624a88b2c25..d85e8e079184 100644 --- a/drivers/powercap/intel_rapl_common.c +++ b/drivers/powercap/intel_rapl_common.c @@ -26,9 +26,6 @@ #include <asm/cpu_device_id.h> #include <asm/intel-family.h>
-/* Local defines */ -#define MSR_PLATFORM_POWER_LIMIT 0x0000065C - /* bitmasks for RAPL MSRs, used by primitive access functions */ #define ENERGY_STATUS_MASK 0xffffffff
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.10-rc1 commit ee139385432e919f4d1f59b80edbc073cdae1391 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
An oops is triggered by the fuzzy test.
[ 327.853081] unchecked MSR access error: RDMSR from 0x70c at rIP: 0xffffffffc082c820 (uncore_msr_read_counter+0x10/0x50 [intel_uncore]) [ 327.853083] Call Trace: [ 327.853085] <IRQ> [ 327.853089] uncore_pmu_event_start+0x85/0x170 [intel_uncore] [ 327.853093] uncore_pmu_event_add+0x1a4/0x410 [intel_uncore] [ 327.853097] ? event_sched_in.isra.118+0xca/0x240
There are 2 GP counters for each CBOX, but the current code claims 4 counters. Accessing the invalid registers triggers the oops.
Fixes: 6e394376ee89 ("perf/x86/intel/uncore: Add Intel Icelake uncore support") Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200925134905.8839-3-kan.liang@linux.intel.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/uncore_snb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/uncore_snb.c b/arch/x86/events/intel/uncore_snb.c index c3314c2395ac..7eea98642b37 100644 --- a/arch/x86/events/intel/uncore_snb.c +++ b/arch/x86/events/intel/uncore_snb.c @@ -286,7 +286,7 @@ void skl_uncore_cpu_init(void)
static struct intel_uncore_type icl_uncore_cbox = { .name = "cbox", - .num_counters = 4, + .num_counters = 2, .perf_ctr_bits = 44, .perf_ctr = ICL_UNC_CBO_0_PER_CTR0, .event_ctl = SNB_UNC_CBO_0_PERFEVTSEL0,
From: Chen Yu yu.c.chen@intel.com
mainline inclusion from mainline-v5.10-rc2 commit 4e0ba5577dba686f96c1c10ef4166380667fdec7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Currently intel_idle driver gets the c-state information from ACPI _CST if the processor model is not recognized by it. However the c-state in _CST starts with index 1 which is different from the index in intel_idle driver's internal c-state table.
While intel_idle_max_cstate_reached() was previously introduced to deal with intel_idle driver's internal c-state table, re-using this function directly on _CST is incorrect.
Fix this by subtracting 1 from the index when checking max_cstate in the _CST case.
For example, append intel_idle.max_cstate=1 in boot command line, Before the patch: grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name POLL After the patch: grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1_ACPI
Fixes: 18734958e9bf ("intel_idle: Use ACPI _CST for processor models without C-state tables") Reported-by: Pengfei Xu pengfei.xu@intel.com Cc: 5.6+ stable@vger.kernel.org # 5.6+ Signed-off-by: Chen Yu yu.c.chen@intel.com [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/idle/intel_idle.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index b34a4b2760c0..907f80e02379 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -1298,7 +1298,7 @@ static void intel_idle_init_cstates_acpi(struct cpuidle_driver *drv) struct acpi_processor_cx *cx; struct cpuidle_state *state;
- if (intel_idle_max_cstate_reached(cstate)) + if (intel_idle_max_cstate_reached(cstate - 1)) break;
cx = &acpi_state_table.states[cstate];
From: Mel Gorman mgorman@techsingularity.net
mainline inclusion from mainline-v5.10-rc1 commit 75af76d0a34e048651a6af311781d7206b6964c7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
e6d4f08a6776 ("intel_idle: Use ACPI _CST on server systems") avoids enabling c-states that have been disabled by the platform with the exception of C1E.
Unfortunately, BIOS implementations are not always consistent in terms of how capabilities are advertised and control cannot always be handed over. If control cannot be handed over then intel_idle reports that "ACPI _CST not found or not usable" but does not clear acpi_state_table.count meaning the information is still partially used.
This patch ignores ACPI information if CST control cannot be requested from the platform. This was only observed on a number of Haswell platforms that had identical CPUs but not identical BIOS versions. While this problem may be rare overall, 24 separate test cases bisected to this specific commit across 4 separate test machines and is worth addressing. If the situation occurs, the kernel behaves as it did before commit e6d4f08a6776 and uses any c-states that are discovered.
The affected test cases were all ones that involved a small number of processes -- exec microbenchmark, pipe microbenchmark, git test suite, netperf, tbench with one client and system call microbenchmark. Each case benefits from being able to use turboboost which is prevented if the lower c-states are unavailable. This may mask real regressions specific to older hardware so it is worth addressing.
C-state status before and after the patch
5.9.0-vanilla POLL latency:0 disabled:0 default:enabled 5.9.0-vanilla C1 latency:2 disabled:0 default:enabled 5.9.0-vanilla C1E latency:10 disabled:0 default:enabled 5.9.0-vanilla C3 latency:33 disabled:1 default:disabled 5.9.0-vanilla C6 latency:133 disabled:1 default:disabled 5.9.0-ignore-cst-v1r1 POLL latency:0 disabled:0 default:enabled 5.9.0-ignore-cst-v1r1 C1 latency:2 disabled:0 default:enabled 5.9.0-ignore-cst-v1r1 C1E latency:10 disabled:0 default:enabled 5.9.0-ignore-cst-v1r1 C3 latency:33 disabled:0 default:enabled 5.9.0-ignore-cst-v1r1 C6 latency:133 disabled:0 default:enabled
Patch enables C3/C6.
Netperf UDP_STREAM
netperf-udp 5.5.0 5.9.0 vanilla ignore-cst-v1r1 Hmean send-64 193.41 ( 0.00%) 226.54 * 17.13%* Hmean send-128 392.16 ( 0.00%) 450.54 * 14.89%* Hmean send-256 769.94 ( 0.00%) 881.85 * 14.53%* Hmean send-1024 2994.21 ( 0.00%) 3468.95 * 15.85%* Hmean send-2048 5725.60 ( 0.00%) 6628.99 * 15.78%* Hmean send-3312 8468.36 ( 0.00%) 10288.02 * 21.49%* Hmean send-4096 10135.46 ( 0.00%) 12387.57 * 22.22%* Hmean send-8192 17142.07 ( 0.00%) 19748.11 * 15.20%* Hmean send-16384 28539.71 ( 0.00%) 30084.45 * 5.41%*
Fixes: e6d4f08a6776 ("intel_idle: Use ACPI _CST on server systems") Signed-off-by: Mel Gorman mgorman@techsingularity.net Cc: 5.6+ stable@vger.kernel.org # 5.6+ Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/idle/intel_idle.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 907f80e02379..acbe78001830 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -1274,14 +1274,13 @@ static bool intel_idle_acpi_cst_extract(void) if (!intel_idle_cst_usable()) continue;
- if (!acpi_processor_claim_cst_control()) { - acpi_state_table.count = 0; - return false; - } + if (!acpi_processor_claim_cst_control()) + break;
return true; }
+ acpi_state_table.count = 0; pr_debug("ACPI _CST not found or not usable\n"); return false; }
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.10-rc1 commit 8191016a026b8dfbb14dea64efc8e723ee99fe65 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
The "MiB" result of the IMC free-running bandwidth events, uncore_imc_free_running/read/ and uncore_imc_free_running/write/ are 16 times too small.
The "MiB" value equals the raw IMC free-running bandwidth counter value times a "scale" which is inaccurate.
The IMC free-running bandwidth events should be incremented per 64B cache line, not DWs (4 bytes). The "scale" should be 6.103515625e-5. Fix the "scale" for both Snow Ridge and Ice Lake.
Fixes: 2b3b76b5ec67 ("perf/x86/intel/uncore: Add Ice Lake server uncore support") Fixes: ee49532b38dd ("perf/x86/intel/uncore: Add IMC uncore support for Snow Ridge") Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200928133240.12977-1-kan.liang@linux.intel.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/uncore_snbep.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c index 07c3c7f87731..cfc1eef2c440 100644 --- a/arch/x86/events/intel/uncore_snbep.c +++ b/arch/x86/events/intel/uncore_snbep.c @@ -4574,10 +4574,10 @@ static struct uncore_event_desc snr_uncore_imc_freerunning_events[] = { INTEL_UNCORE_EVENT_DESC(dclk, "event=0xff,umask=0x10"),
INTEL_UNCORE_EVENT_DESC(read, "event=0xff,umask=0x20"), - INTEL_UNCORE_EVENT_DESC(read.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(read.scale, "6.103515625e-5"), INTEL_UNCORE_EVENT_DESC(read.unit, "MiB"), INTEL_UNCORE_EVENT_DESC(write, "event=0xff,umask=0x21"), - INTEL_UNCORE_EVENT_DESC(write.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(write.scale, "6.103515625e-5"), INTEL_UNCORE_EVENT_DESC(write.unit, "MiB"), { /* end: all zeroes */ }, }; @@ -5033,17 +5033,17 @@ static struct uncore_event_desc icx_uncore_imc_freerunning_events[] = { INTEL_UNCORE_EVENT_DESC(dclk, "event=0xff,umask=0x10"),
INTEL_UNCORE_EVENT_DESC(read, "event=0xff,umask=0x20"), - INTEL_UNCORE_EVENT_DESC(read.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(read.scale, "6.103515625e-5"), INTEL_UNCORE_EVENT_DESC(read.unit, "MiB"), INTEL_UNCORE_EVENT_DESC(write, "event=0xff,umask=0x21"), - INTEL_UNCORE_EVENT_DESC(write.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(write.scale, "6.103515625e-5"), INTEL_UNCORE_EVENT_DESC(write.unit, "MiB"),
INTEL_UNCORE_EVENT_DESC(ddrt_read, "event=0xff,umask=0x30"), - INTEL_UNCORE_EVENT_DESC(ddrt_read.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(ddrt_read.scale, "6.103515625e-5"), INTEL_UNCORE_EVENT_DESC(ddrt_read.unit, "MiB"), INTEL_UNCORE_EVENT_DESC(ddrt_write, "event=0xff,umask=0x31"), - INTEL_UNCORE_EVENT_DESC(ddrt_write.scale, "3.814697266e-6"), + INTEL_UNCORE_EVENT_DESC(ddrt_write.scale, "6.103515625e-5"), INTEL_UNCORE_EVENT_DESC(ddrt_write.unit, "MiB"), { /* end: all zeroes */ }, };
From: Dinghao Liu dinghao.liu@zju.edu.cn
mainline inclusion from mainline-v5.10-rc1 commit dbb8df5c2d27610a87b0168a8acc89d73fbfde94 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
The default error branch of a series of pdev_is_gen calls should free ndev just like what we've done in these calls.
Fixes: 26bfe3d0b227 ("ntb: intel: Add Icelake (gen4) support for Intel NTB") Signed-off-by: Dinghao Liu dinghao.liu@zju.edu.cn Acked-by: Dave Jiang dave.jiang@intel.com Signed-off-by: Jon Mason jdmason@kudzu.us Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/ntb/hw/intel/ntb_hw_gen1.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/ntb/hw/intel/ntb_hw_gen1.c b/drivers/ntb/hw/intel/ntb_hw_gen1.c index cae70e7fe5bd..adf790452209 100644 --- a/drivers/ntb/hw/intel/ntb_hw_gen1.c +++ b/drivers/ntb/hw/intel/ntb_hw_gen1.c @@ -1897,7 +1897,7 @@ static int intel_ntb_pci_probe(struct pci_dev *pdev, goto err_init_dev; } else { rc = -EINVAL; - goto err_ndev; + goto err_init_pci; }
ndev_reset_unsafe_flags(ndev);
From: Wang Hai wanghai38@huawei.com
mainline inclusion from mainline-v5.11-rc1 commit 1aa574312518ef1d60d2dc62d58f7021db3b163a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
When I repeatedly modprobe and rmmod dax.ko, kmemleak report a memory leak as follows:
unreferenced object 0xffff9a5588c05088 (size 8): comm "modprobe", pid 261, jiffies 4294693644 (age 42.063s) ... backtrace: [<00000000e007ced0>] kstrdup+0x35/0x70 [<000000002ae73897>] kstrdup_const+0x3d/0x50 [<000000002b00c9c3>] kvasprintf_const+0xbc/0xf0 [<000000008023282f>] kobject_set_name_vargs+0x3b/0xd0 [<00000000d2cbaa4e>] kobject_set_name+0x62/0x90 [<00000000202e7a22>] bus_register+0x7f/0x2b0 [<000000000b77792c>] 0xffffffffc02840f7 [<000000002d5be5ac>] 0xffffffffc02840b4 [<00000000dcafb7cd>] do_one_initcall+0x58/0x240 [<00000000049fe480>] do_init_module+0x56/0x1e2 [<0000000022671491>] load_module+0x2517/0x2840 [<000000001a2201cb>] __do_sys_finit_module+0x9c/0xe0 [<000000003eb304e7>] do_syscall_64+0x33/0x40 [<0000000051c5fd06>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
When rmmod dax is executed, dax_bus_exit() is missing. This patch can fix this bug.
Fixes: 9567da0b408a ("device-dax: Introduce bus + driver model") Cc: stable@vger.kernel.org Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Wang Hai wanghai38@huawei.com Link: https://lore.kernel.org/r/20201201135929.66530-1-wanghai38@huawei.com Signed-off-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dax/super.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c index ccb22d8db3a2..a5588c00c8dd 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -669,6 +669,7 @@ static int __init dax_core_init(void)
static void __exit dax_core_exit(void) { + dax_bus_exit(); unregister_chrdev_region(dax_devt, MINORMASK+1); ida_destroy(&dax_minor_ida); dax_fs_exit();
From: Dan Carpenter dan.carpenter@oracle.com
mainline inclusion from mainline-v5.13-rc1 commit 4ce535ec0084f0d712317cb99d383cad3288e713 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
We can't use kfree() to free device managed resources so the kfree(dev) is against the rules.
It's easier to write this code if we open code the device_register() as a device_initialize() and device_add(). That way if dev_set_name() set name fails we can call put_device() and it will clean up correctly.
Fixes: acc02a109b04 ("node: Add memory-side caching attributes") Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Link: https://lore.kernel.org/r/YHA0JUra+F64+NpB@mwanda Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/base/node.c | 26 ++++++++++++-------------- 1 file changed, 12 insertions(+), 14 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c index 07b9424ca9eb..ac44db3f63c7 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -261,21 +261,20 @@ static void node_init_cache_dev(struct node *node) if (!dev) return;
+ device_initialize(dev); dev->parent = &node->dev; dev->release = node_cache_release; if (dev_set_name(dev, "memory_side_cache")) - goto free_dev; + goto put_device;
- if (device_register(dev)) - goto free_name; + if (device_add(dev)) + goto put_device;
pm_runtime_no_callbacks(dev); node->cache_dev = dev; return; -free_name: - kfree_const(dev->kobj.name); -free_dev: - kfree(dev); +put_device: + put_device(dev); }
/** @@ -312,25 +311,24 @@ void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs) return;
dev = &info->dev; + device_initialize(dev); dev->parent = node->cache_dev; dev->release = node_cacheinfo_release; dev->groups = cache_groups; if (dev_set_name(dev, "index%d", cache_attrs->level)) - goto free_cache; + goto put_device;
info->cache_attrs = *cache_attrs; - if (device_register(dev)) { + if (device_add(dev)) { dev_warn(&node->dev, "failed to add cache level:%d\n", cache_attrs->level); - goto free_name; + goto put_device; } pm_runtime_no_callbacks(dev); list_add_tail(&info->node, &node->cache_attrs); return; -free_name: - kfree_const(dev->kobj.name); -free_cache: - kfree(info); +put_device: + put_device(dev); }
static void node_remove_caches(struct node *node)
From: Kan Liang kan.liang@linux.intel.com
mainline inclusion from mainline-v5.13-rc6 commit 848ff3768684701a4ce73a2ec0e5d438d4e2b0da category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
--------------------------------
Perf tool errors out with the latest event list for the Ice Lake server.
event syntax error: 'unc_m2m_imc_reads.to_pmm' ___ value too big for format, maximum is 255
The same as the Snow Ridge server, the M2M uncore unit in the Ice Lake server has the unit mask extension field as well.
Fixes: 2b3b76b5ec67 ("perf/x86/intel/uncore: Add Ice Lake server uncore support") Reported-by: Jin Yao yao.jin@linux.intel.com Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1622552943-119174-1-git-send-email-kan.liang@linux... Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/events/intel/uncore_snbep.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c index cfc1eef2c440..01f53d0d3af1 100644 --- a/arch/x86/events/intel/uncore_snbep.c +++ b/arch/x86/events/intel/uncore_snbep.c @@ -4842,9 +4842,10 @@ static struct intel_uncore_type icx_uncore_m2m = { .perf_ctr = SNR_M2M_PCI_PMON_CTR0, .event_ctl = SNR_M2M_PCI_PMON_CTL0, .event_mask = SNBEP_PMON_RAW_EVENT_MASK, + .event_mask_ext = SNR_M2M_PCI_PMON_UMASK_EXT, .box_ctl = SNR_M2M_PCI_PMON_BOX_CTL, .ops = &snr_m2m_uncore_pci_ops, - .format_group = &skx_uncore_format_group, + .format_group = &snr_m2m_uncore_format_group, };
static struct attribute *icx_upi_uncore_formats_attr[] = {
hulk inclusion category: Feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
-------------------------------------------------
CONFIG_ACPI_HMAT and CONFIG_DEV_DAX_KMEM are newly introduced, so set default value for them in arch/arm64/configs/hulk_defconfig.
Enable CONFIG_EDAC_I10NM, CONFIG_INTEL_SPEED_SELECT_INTERFACE and CONFIG_DEV_DAX_KMEM as module by default in arch/x86/configs/hulk_defconfig.
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/arm64/configs/hulk_defconfig | 2 ++ arch/x86/configs/hulk_defconfig | 3 +++ 2 files changed, 5 insertions(+)
diff --git a/arch/arm64/configs/hulk_defconfig b/arch/arm64/configs/hulk_defconfig index 7f2e8d13930c..88ae2449e40d 100644 --- a/arch/arm64/configs/hulk_defconfig +++ b/arch/arm64/configs/hulk_defconfig @@ -645,6 +645,7 @@ CONFIG_ACPI_HED=y # CONFIG_ACPI_BGRT is not set CONFIG_ACPI_REDUCED_HARDWARE_ONLY=y CONFIG_ACPI_NFIT=m +# CONFIG_ACPI_HMAT is not set CONFIG_HAVE_ACPI_APEI=y CONFIG_ACPI_APEI=y CONFIG_ACPI_APEI_GHES=y @@ -4785,6 +4786,7 @@ CONFIG_OF_PMEM=m CONFIG_DAX_DRIVER=y CONFIG_DAX=y CONFIG_DEV_DAX=m +# CONFIG_DEV_DAX_KMEM is not set CONFIG_NVMEM=y # CONFIG_QCOM_QFPROM is not set
diff --git a/arch/x86/configs/hulk_defconfig b/arch/x86/configs/hulk_defconfig index 5c5897c269c1..983d52ccad82 100644 --- a/arch/x86/configs/hulk_defconfig +++ b/arch/x86/configs/hulk_defconfig @@ -5791,6 +5791,7 @@ CONFIG_EDAC_I5100=m CONFIG_EDAC_I7300=m CONFIG_EDAC_SBRIDGE=m CONFIG_EDAC_SKX=m +CONFIG_EDAC_I10NM=m CONFIG_EDAC_PND2=m CONFIG_RTC_LIB=y CONFIG_RTC_MC146818_LIB=y @@ -6071,6 +6072,7 @@ CONFIG_PVPANIC=y CONFIG_MLX_PLATFORM=m CONFIG_INTEL_TURBO_MAX_3=y # CONFIG_I2C_MULTI_INSTANTIATE is not set +CONFIG_INTEL_SPEED_SELECT_INTERFACE=m CONFIG_PMC_ATOM=y # CONFIG_CHROME_PLATFORMS is not set CONFIG_MELLANOX_PLATFORM=y @@ -6577,6 +6579,7 @@ CONFIG_DAX_DRIVER=y CONFIG_DAX=y CONFIG_DEV_DAX=m CONFIG_DEV_DAX_PMEM=m +CONFIG_DEV_DAX_KMEM=m CONFIG_NVMEM=y
#
hulk inclusion category: Feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
-------------------------------------------------
CONFIG_ACPI_HMAT and CONFIG_DEV_DAX_KMEM are newly introduced, set default value for them in arch/arm64/configs/openeuler_defconfig.
Enable CONFIG_EDAC_I10NM, CONFIG_INTEL_SPEED_SELECT_INTERFACE and CONFIG_DEV_DAX_KMEM as module by default in arch/x86/configs/openeuler_defconfig.
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/arm64/configs/openeuler_defconfig | 2 ++ arch/x86/configs/openeuler_defconfig | 3 +++ 2 files changed, 5 insertions(+)
diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 78bb69111d85..1845c5074b68 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -635,6 +635,7 @@ CONFIG_ACPI_HED=y # CONFIG_ACPI_BGRT is not set CONFIG_ACPI_REDUCED_HARDWARE_ONLY=y CONFIG_ACPI_NFIT=m +# CONFIG_ACPI_HMAT is not set CONFIG_HAVE_ACPI_APEI=y CONFIG_ACPI_APEI=y CONFIG_ACPI_APEI_GHES=y @@ -5114,6 +5115,7 @@ CONFIG_OF_PMEM=m CONFIG_DAX_DRIVER=y CONFIG_DAX=y CONFIG_DEV_DAX=m +# CONFIG_DEV_DAX_KMEM is not set CONFIG_NVMEM=y # CONFIG_QCOM_QFPROM is not set
diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index 6fe8a777fa1c..5a56a81e39e1 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -5793,6 +5793,7 @@ CONFIG_EDAC_I5100=m CONFIG_EDAC_I7300=m CONFIG_EDAC_SBRIDGE=m CONFIG_EDAC_SKX=m +CONFIG_EDAC_I10NM=m CONFIG_EDAC_PND2=m CONFIG_RTC_LIB=y CONFIG_RTC_MC146818_LIB=y @@ -6073,6 +6074,7 @@ CONFIG_MLX_PLATFORM=m CONFIG_INTEL_TURBO_MAX_3=y # CONFIG_I2C_MULTI_INSTANTIATE is not set # CONFIG_INTEL_ATOMISP2_PM is not set +CONFIG_INTEL_SPEED_SELECT_INTERFACE=m CONFIG_PMC_ATOM=y # CONFIG_CHROME_PLATFORMS is not set CONFIG_MELLANOX_PLATFORM=y @@ -6586,6 +6588,7 @@ CONFIG_DAX_DRIVER=y CONFIG_DAX=y CONFIG_DEV_DAX=m CONFIG_DEV_DAX_PMEM=m +CONFIG_DEV_DAX_KMEM=m CONFIG_NVMEM=y
#
hulk inclusion category: Bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
-------------------------------------------------
Fix kabi broken introduced by the following commit: "PCI: Decode PCIe 32 GT/s link speed".
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com --- include/linux/pci.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/include/linux/pci.h b/include/linux/pci.h index e3c478b6eb78..def71add9f86 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -259,7 +259,9 @@ enum pci_bus_speed { PCIE_SPEED_5_0GT = 0x15, PCIE_SPEED_8_0GT = 0x16, PCIE_SPEED_16_0GT = 0x17, +#ifndef __GENKSYMS__ PCIE_SPEED_32_0GT = 0x18, +#endif PCI_SPEED_UNKNOWN = 0xff, };
hulk inclusion category: Bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
-------------------------------------------------
Fix kabi broken introduced by the following commits: "PCI: Add support for Immediate Readiness" "PCI: Make link active reporting detection generic" "PCI: Get rid of dev->has_secondary_link flag"
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: wangxiongfeng 00379786 wangxiongfeng2@huawei.com --- include/linux/pci.h | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/include/linux/pci.h b/include/linux/pci.h index def71add9f86..de9c199ca606 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -329,7 +329,6 @@ struct pci_dev { pci_power_t current_state; /* Current operating state. In ACPI, this is D0-D3, D0 being fully functional, and D3 being off. */ - unsigned int imm_ready:1; /* Supports Immediate Readiness */ u8 pm_cap; /* PM capability offset */ unsigned int pme_support:5; /* Bitmask of states from which PME# can be generated */ @@ -410,9 +409,9 @@ struct pci_dev { unsigned int broken_intx_masking:1; /* INTx masking can't be used */ unsigned int io_window_1k:1; /* Intel bridge 1K I/O windows */ unsigned int irq_managed:1; + unsigned int has_secondary_link:1; unsigned int non_compliant_bars:1; /* Broken BARs; ignore them */ unsigned int is_probed:1; /* Device probing in progress */ - unsigned int link_active_reporting:1;/* Device capable of reporting link active */ pci_dev_flags_t dev_flags; atomic_t enable_cnt; /* pci_enable_device has been called */
@@ -462,7 +461,17 @@ struct pci_dev { unsigned long slot_being_removed_rescanned; struct pci_dev *rpdev; /* root port pci_dev */
+#ifndef __GENKSYMS__ + union { + struct { + unsigned int imm_ready:1; /* Supports Immediate Readiness */ + unsigned int link_active_reporting:1;/* Device capable of reporting link active */ + }; + unsigned long padding; + }; +#else KABI_RESERVE(1) +#endif KABI_RESERVE(2) KABI_RESERVE(3) KABI_RESERVE(4)
hulk inclusion category: Bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
-------------------------------------------------
Fix kabi broken introduced by the following commit: "perf/x86/intel: Support per-thread RDPMC TopDown metrics".
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com --- include/linux/perf_event.h | 15 +++++++++++++++ 1 file changed, 15 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 0550194b3252..53869bbd8b5d 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -198,6 +198,7 @@ struct hw_perf_event { */ u64 sample_period;
+#ifndef __GENKSYMS__ union { struct { /* Sampling */ /* @@ -218,6 +219,20 @@ struct hw_perf_event { u64 saved_slots; }; }; +#else + /* + * The period we started this sample with. + */ + u64 last_period; + + /* + * However much is left of the current period; + * note that this is a full 64bit value and + * allows for generation of periods longer + * than hardware might allow. + */ + local64_t period_left; +#endif
/* * State for throttling the event, see __perf_event_overflow() and
hulk inclusion category: Bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
-------------------------------------------------
Fix kabi broken introduced by the following commits: "x86/topology: Add CPUID.1F multi-die/package support" "x86/topology: Create topology_max_die_per_package()" "x86/topology: Define topology_logical_die_id()"
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com --- arch/x86/include/asm/processor.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 6f422c01c0aa..e119c78581d1 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -129,14 +129,16 @@ struct cpuinfo_x86 { u16 logical_proc_id; /* Core id: */ u16 cpu_core_id; - u16 cpu_die_id; - u16 logical_die_id; /* Index into per_cpu list: */ u16 cpu_index; u32 microcode; /* Address space bits used by the cache internally */ u8 x86_cache_bits; unsigned initialized : 1; +#ifndef __GENKSYMS__ + u16 cpu_die_id; + u16 logical_die_id; +#endif } __randomize_layout;
struct cpuid_regs {
From: Wei Li liwei391@huawei.com
hulk inclusion category: Bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I47H3V CVE: NA
-------------------------------------------------
Use the new pmu_attr_update_notifier_chain to update attribute groups to fix kabi broken introduced by the following commit: "Intel: perf/core: Add attr_groups_update into struct pmu" "Intel: perf/x86: Use the new pmu::update_attrs attribute group".
Signed-off-by: Wei Li liwei391@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com --- arch/x86/events/core.c | 18 +++++++++++++++++- include/linux/perf_event.h | 4 +++- kernel/events/core.c | 28 +++++++++++++++++++++++++--- 3 files changed, 45 insertions(+), 5 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 4e5873593056..7239f6109cc0 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -1810,6 +1810,22 @@ ssize_t x86_event_sysfs_show(char *page, u64 config, u64 event) static struct attribute_group x86_pmu_attr_group; static struct attribute_group x86_pmu_caps_group;
+static int x86_pmu_attr_update_notify(struct notifier_block *nb, + unsigned long ret, void *group) +{ + struct pmu *tmp = group; + + if (tmp != &pmu) + return NOTIFY_DONE; + + *(int *)ret = sysfs_update_groups(&tmp->dev->kobj, x86_pmu.attr_update); + return NOTIFY_STOP; +} + +static struct notifier_block x86_pmu_attr_update_notifier = { + .notifier_call = &x86_pmu_attr_update_notify, +}; + static int __init init_hw_perf_events(void) { struct x86_pmu_quirk *quirk; @@ -1868,7 +1884,7 @@ static int __init init_hw_perf_events(void) if (!x86_pmu.events_sysfs_show) x86_pmu_events_group.attrs = &empty_attrs;
- pmu.attr_update = x86_pmu.attr_update; + pmu_attr_update_register_notifier(&x86_pmu_attr_update_notifier);
pr_info("... version: %d\n", x86_pmu.version); pr_info("... bit width: %d\n", x86_pmu.cntval_bits); diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 53869bbd8b5d..41bc1d3b311f 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -279,7 +279,6 @@ struct pmu { struct module *module; struct device *dev; const struct attribute_group **attr_groups; - const struct attribute_group **attr_update; const char *name; int type;
@@ -897,6 +896,9 @@ extern void perf_event_itrace_started(struct perf_event *event); extern int perf_pmu_register(struct pmu *pmu, const char *name, int type); extern void perf_pmu_unregister(struct pmu *pmu);
+extern int pmu_attr_update_register_notifier(struct notifier_block *nb); +extern int pmu_attr_update_unregister_notifier(struct notifier_block *nb); + extern int perf_num_counters(void); extern const char *perf_pmu_name(void); extern void __perf_event_task_sched_in(struct task_struct *prev, diff --git a/kernel/events/core.c b/kernel/events/core.c index 56617b912475..09869287b7ae 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9690,6 +9690,30 @@ static struct bus_type pmu_bus = { .name = "event_source", .dev_groups = pmu_dev_groups, }; +static BLOCKING_NOTIFIER_HEAD(pmu_attr_update_notifier_chain); + +int pmu_attr_update_register_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register( + &pmu_attr_update_notifier_chain, nb); +} +EXPORT_SYMBOL_GPL(pmu_attr_update_register_notifier); + +int pmu_attr_update_unregister_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_unregister( + &pmu_attr_update_notifier_chain, nb); +} +EXPORT_SYMBOL_GPL(pmu_attr_update_unregister_notifier); + +static int pmu_attr_update_notifier_call_chain(struct pmu *pmu) +{ + int ret = 0; + + blocking_notifier_call_chain(&pmu_attr_update_notifier_chain, + (unsigned long)&ret, (void *)pmu); + return ret; +}
static void pmu_dev_release(struct device *dev) { @@ -9724,9 +9748,7 @@ static int pmu_dev_alloc(struct pmu *pmu) if (ret) goto del_dev;
- if (pmu->attr_update) - ret = sysfs_update_groups(&pmu->dev->kobj, pmu->attr_update); - + ret = pmu_attr_update_notifier_call_chain(pmu); if (ret) goto del_dev;