From: Jann Horn jannh@google.com
stable inclusion from stable-v5.10.142 commit 895428ee124ad70b9763259308354877b725c31d category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5PE9S CVE: CVE-2022-39188
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=l...
--------------------------------
commit b67fbebd4cf980aecbcc750e1462128bffe8ae15 upstream.
Some drivers rely on having all VMAs through which a PFN might be accessible listed in the rmap for correctness. However, on X86, it was possible for a VMA with stale TLB entries to not be listed in the rmap.
This was fixed in mainline with commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas"), but that commit relies on preceding refactoring in commit 18ba064e42df3 ("mmu_gather: Let there be one tlb_{start,end}_vma() implementation") and commit 1e9fdf21a4339 ("mmu_gather: Remove per arch tlb_{start,end}_vma()").
This patch provides equivalent protection without needing that refactoring, by forcing a TLB flush between removing PTEs in unmap_vmas() and the call to unlink_file_vma() in free_pgtables().
[This is a stable-specific rewrite of the upstream commit!] Signed-off-by: Jann Horn jannh@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: ze zuo zuoze1@huawei.com Reviewed-by: Chen Wandun chenwandun@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/mmap.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c index 5489d70db84e..7fba5d89ecde 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2962,6 +2962,18 @@ static void unmap_region(struct mm_struct *mm, tlb_gather_mmu(&tlb, mm, start, end); update_hiwater_rss(mm); unmap_vmas(&tlb, vma, start, end); + + /* + * Ensure we have no stale TLB entries by the time this mapping is + * removed from the rmap. + * Note that we don't have to worry about nested flushes here because + * we're holding the mm semaphore for removing the mapping - so any + * concurrent flush in this region has to be coming through the rmap, + * and we synchronize against that using the rmap lock. + */ + if ((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) != 0) + tlb_flush_mmu(&tlb); + free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, next ? next->vm_start : USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, start, end);
From: Yicong Yang yangyicong@hisilicon.com
mainline inclusion from mainline-v5.12-rc1 commit 566c6120f095 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5QUS6 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------------------------------------------------
Currently we use concrete version to determine the max_cmd_dword. New entries should be added for compatible hardwares of new version or on new platform, otherwise the device will use 16 dwords instead of 64 even if it supports, which will degrade the performance. This will decrease the compatibility and the maintainability.
Drop the switch-case statement of the version checking. Only version less than 0x351 supports maximum 16 command dwords.
Signed-off-by: Yicong Yang yangyicong@hisilicon.com Acked-by: John Garry john.garry@huawei.com Link: https://lore.kernel.org/r/1610526716-14882-1-git-send-email-yangyicong@hisil... Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Wangming Shao shaowangming@h-partners.com Reviewed-by: Yicong Yang yangyicong@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/spi/spi-hisi-sfc-v3xx.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/drivers/spi/spi-hisi-sfc-v3xx.c b/drivers/spi/spi-hisi-sfc-v3xx.c index 4650b483a33d..832b80e7ef67 100644 --- a/drivers/spi/spi-hisi-sfc-v3xx.c +++ b/drivers/spi/spi-hisi-sfc-v3xx.c @@ -465,14 +465,10 @@ static int hisi_sfc_v3xx_probe(struct platform_device *pdev)
version = readl(host->regbase + HISI_SFC_V3XX_VERSION);
- switch (version) { - case 0x351: + if (version >= 0x351) host->max_cmd_dword = 64; - break; - default: + else host->max_cmd_dword = 16; - break; - }
ret = devm_spi_register_controller(dev, ctlr); if (ret)
From: Yicong Yang yangyicong@hisilicon.com
mainline inclusion from mainline-v5.13-rc1 commit 4a46f88681ca category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5QUS6 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------------------------------------------------
We use ACPI_PTR() and related ifendif protection for the id table. This is unnecessary as the struct acpi_device_id is defined in mod_devicetable.h and doesn't rely on ACPI. The driver doesn't use any ACPI apis, so it can be compiled in the ACPI=n case with no warnings.
So remove the ACPI_PTR and related ifendif protection, also replace the header acpi.h with mod_devicetable.h.
Acked-by: John Garry john.garry@huawei.com Signed-off-by: Yicong Yang yangyicong@hisilicon.com Link: https://lore.kernel.org/r/1618228708-37949-3-git-send-email-yangyicong@hisil... Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Wangming Shao shaowangming@h-partners.com Reviewed-by: Yicong Yang yangyicong@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/spi/spi-hisi-sfc-v3xx.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/spi/spi-hisi-sfc-v3xx.c b/drivers/spi/spi-hisi-sfc-v3xx.c index 832b80e7ef67..943b2a77be00 100644 --- a/drivers/spi/spi-hisi-sfc-v3xx.c +++ b/drivers/spi/spi-hisi-sfc-v3xx.c @@ -5,13 +5,13 @@ // Copyright (c) 2019 HiSilicon Technologies Co., Ltd. // Author: John Garry john.garry@huawei.com
-#include <linux/acpi.h> #include <linux/bitops.h> #include <linux/completion.h> #include <linux/dmi.h> #include <linux/interrupt.h> #include <linux/iopoll.h> #include <linux/module.h> +#include <linux/mod_devicetable.h> #include <linux/platform_device.h> #include <linux/slab.h> #include <linux/spi/spi.h> @@ -484,18 +484,16 @@ static int hisi_sfc_v3xx_probe(struct platform_device *pdev) return ret; }
-#if IS_ENABLED(CONFIG_ACPI) static const struct acpi_device_id hisi_sfc_v3xx_acpi_ids[] = { {"HISI0341", 0}, {} }; MODULE_DEVICE_TABLE(acpi, hisi_sfc_v3xx_acpi_ids); -#endif
static struct platform_driver hisi_sfc_v3xx_spi_driver = { .driver = { .name = "hisi-sfc-v3xx", - .acpi_match_table = ACPI_PTR(hisi_sfc_v3xx_acpi_ids), + .acpi_match_table = hisi_sfc_v3xx_acpi_ids, }, .probe = hisi_sfc_v3xx_probe, };
From: Yicong Yang yangyicong@hisilicon.com
mainline inclusion from mainline-v5.13-rc1 commit 4c84e42d29af category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5QUS6 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------------------------------------------------
We mask the irq when the command completion is timeout. This won't stop the already running irq handler. Use sychronize_irq() after we mask the irq, to make sure there is no running handler.
Acked-by: John Garry john.garry@huawei.com Signed-off-by: Yicong Yang yangyicong@hisilicon.com Link: https://lore.kernel.org/r/1618228708-37949-2-git-send-email-yangyicong@hisil... Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Wangming Shao shaowangming@h-partners.com Reviewed-by: Yicong Yang yangyicong@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/spi/spi-hisi-sfc-v3xx.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/spi/spi-hisi-sfc-v3xx.c b/drivers/spi/spi-hisi-sfc-v3xx.c index 943b2a77be00..cbde2620e118 100644 --- a/drivers/spi/spi-hisi-sfc-v3xx.c +++ b/drivers/spi/spi-hisi-sfc-v3xx.c @@ -331,6 +331,7 @@ static int hisi_sfc_v3xx_generic_exec_op(struct hisi_sfc_v3xx_host *host, ret = 0;
hisi_sfc_v3xx_disable_int(host); + synchronize_irq(host->irq); host->completion = NULL; } else { ret = hisi_sfc_v3xx_wait_cmd_idle(host);
From: Yicong Yang yangyicong@hisilicon.com
mainline inclusion from mainline-v5.12-rc1 commit 6d2386e36440 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5QUS6 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------------------------------------------------
The address mode is either 3 or 4 for the controller, which is configured by the firmware and cannot be modified in the OS driver. Get the firmware configuration and add address mode check in the .supports_op() to block invalid operations.
Signed-off-by: Yicong Yang yangyicong@hisilicon.com Acked-by: John Garry john.garry@huawei.com Link: https://lore.kernel.org/r/1611740450-47975-3-git-send-email-yangyicong@hisil... Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Wangming Shao shaowangming@h-partners.com Reviewed-by: Yicong Yang yangyicong@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/spi/spi-hisi-sfc-v3xx.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-)
diff --git a/drivers/spi/spi-hisi-sfc-v3xx.c b/drivers/spi/spi-hisi-sfc-v3xx.c index cbde2620e118..d3a23b1c2a4c 100644 --- a/drivers/spi/spi-hisi-sfc-v3xx.c +++ b/drivers/spi/spi-hisi-sfc-v3xx.c @@ -19,6 +19,8 @@
#define HISI_SFC_V3XX_VERSION (0x1f8)
+#define HISI_SFC_V3XX_GLB_CFG (0x100) +#define HISI_SFC_V3XX_GLB_CFG_CS0_ADDR_MODE BIT(2) #define HISI_SFC_V3XX_RAW_INT_STAT (0x120) #define HISI_SFC_V3XX_INT_STAT (0x124) #define HISI_SFC_V3XX_INT_MASK (0x128) @@ -75,6 +77,7 @@ struct hisi_sfc_v3xx_host { void __iomem *regbase; int max_cmd_dword; struct completion *completion; + u8 address_mode; int irq; };
@@ -168,10 +171,18 @@ static int hisi_sfc_v3xx_adjust_op_size(struct spi_mem *mem, static bool hisi_sfc_v3xx_supports_op(struct spi_mem *mem, const struct spi_mem_op *op) { + struct spi_device *spi = mem->spi; + struct hisi_sfc_v3xx_host *host; + + host = spi_controller_get_devdata(spi->master); + if (op->data.buswidth > 4 || op->dummy.buswidth > 4 || op->addr.buswidth > 4 || op->cmd.buswidth > 4) return false;
+ if (op->addr.nbytes != host->address_mode && op->addr.nbytes) + return false; + return spi_mem_default_supports_op(mem, op); }
@@ -417,7 +428,7 @@ static int hisi_sfc_v3xx_probe(struct platform_device *pdev) struct device *dev = &pdev->dev; struct hisi_sfc_v3xx_host *host; struct spi_controller *ctlr; - u32 version; + u32 version, glb_config; int ret;
ctlr = spi_alloc_master(&pdev->dev, sizeof(*host)); @@ -464,6 +475,18 @@ static int hisi_sfc_v3xx_probe(struct platform_device *pdev) ctlr->num_chipselect = 1; ctlr->mem_ops = &hisi_sfc_v3xx_mem_ops;
+ /* + * The address mode of the controller is either 3 or 4, + * which is indicated by the address mode bit in + * the global config register. The register is read only + * for the OS driver. + */ + glb_config = readl(host->regbase + HISI_SFC_V3XX_GLB_CFG); + if (glb_config & HISI_SFC_V3XX_GLB_CFG_CS0_ADDR_MODE) + host->address_mode = 4; + else + host->address_mode = 3; + version = readl(host->regbase + HISI_SFC_V3XX_VERSION);
if (version >= 0x351)
From: Yicong Yang yangyicong@hisilicon.com
mainline inclusion from mainline-remotes/origin/next commit 24b6c7798a0122012ca848ea0d25e973334266b0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5RP8T CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/coresight/linux.git/commit/?...
--------------------------------------------------------------------------
The DMA operations of HiSilicon PTT device can only work properly with identical mappings. So add a quirk for the device to force the domain as passthrough.
Acked-by: Will Deacon will@kernel.org Signed-off-by: Yicong Yang yangyicong@hisilicon.com Reviewed-by: John Garry john.garry@huawei.com Link: https://lore.kernel.org/r/20220816114414.4092-2-yangyicong@huawei.com Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org Signed-off-by: Wangming Shao shaowangming@h-partners.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index b22d0187ea8a..46c15673788b 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -4315,6 +4315,26 @@ static int arm_smmu_device_set_config(struct device *dev, int type, void *data) } }
+/* + * HiSilicon PCIe tune and trace device can be used to trace TLP headers on the + * PCIe link and save the data to memory by DMA. The hardware is restricted to + * use identity mapping only. + */ +#define IS_HISI_PTT_DEVICE(pdev) ((pdev)->vendor == PCI_VENDOR_ID_HUAWEI && \ + (pdev)->device == 0xa12e) + +static int arm_smmu_def_domain_type(struct device *dev) +{ + if (dev_is_pci(dev)) { + struct pci_dev *pdev = to_pci_dev(dev); + + if (IS_HISI_PTT_DEVICE(pdev)) + return IOMMU_DOMAIN_IDENTITY; + } + + return 0; +} + static struct iommu_ops arm_smmu_ops = { .capable = arm_smmu_capable, .domain_alloc = arm_smmu_domain_alloc, @@ -4350,6 +4370,7 @@ static struct iommu_ops arm_smmu_ops = { .sva_unbind = arm_smmu_sva_unbind, .sva_get_pasid = arm_smmu_sva_get_pasid, .page_response = arm_smmu_page_response, + .def_domain_type = arm_smmu_def_domain_type, .aux_attach_dev = arm_smmu_aux_attach_dev, .aux_detach_dev = arm_smmu_aux_detach_dev, .aux_get_pasid = arm_smmu_aux_get_pasid,
From: Yicong Yang yangyicong@hisilicon.com
mainline inclusion from mainline-remotes/origin/next commit ff0de066b4632ccb2b2e50f90c0c5be7f4689de7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5RP8T CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/coresight/linux.git/commit/?...
--------------------------------------------------------------------------
HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex integrated Endpoint(RCiEP) device, providing the capability to dynamically monitor and tune the PCIe traffic and trace the TLP headers.
Add the driver for the device to enable the trace function. Register PMU device of PTT trace, then users can use trace through perf command. The driver makes use of perf AUX trace function and support the following events to configure the trace:
- filter: select Root port or Endpoint to trace - type: select the type of traced TLP headers - direction: select the direction of traced TLP headers - format: select the data format of the traced TLP headers
This patch initially add basic trace support of PTT device.
Acked-by: Mathieu Poirier mathieu.poirier@linaro.org Reviewed-by: Jonathan Cameron Jonathan.Cameron@huawei.com Reviewed-by: John Garry john.garry@huawei.com Signed-off-by: Yicong Yang yangyicong@hisilicon.com Link: https://lore.kernel.org/r/20220816114414.4092-3-yangyicong@huawei.com Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org Signed-off-by: Wangming Shao shaowangming@h-partners.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/Makefile | 1 + drivers/hwtracing/Kconfig | 2 + drivers/hwtracing/ptt/Kconfig | 12 + drivers/hwtracing/ptt/Makefile | 2 + drivers/hwtracing/ptt/hisi_ptt.c | 916 +++++++++++++++++++++++++++++++ drivers/hwtracing/ptt/hisi_ptt.h | 177 ++++++ 6 files changed, 1110 insertions(+) create mode 100644 drivers/hwtracing/ptt/Kconfig create mode 100644 drivers/hwtracing/ptt/Makefile create mode 100644 drivers/hwtracing/ptt/hisi_ptt.c create mode 100644 drivers/hwtracing/ptt/hisi_ptt.h
diff --git a/drivers/Makefile b/drivers/Makefile index 9d67932a5037..fd9c0b3da5f1 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -177,6 +177,7 @@ obj-$(CONFIG_USB4) += thunderbolt/ obj-$(CONFIG_CORESIGHT) += hwtracing/coresight/ obj-y += hwtracing/intel_th/ obj-$(CONFIG_STM) += hwtracing/stm/ +obj-$(CONFIG_HISI_PTT) += hwtracing/ptt/ obj-$(CONFIG_ANDROID) += android/ obj-$(CONFIG_VENDOR_HOOKS) += hooks/ obj-$(CONFIG_NVMEM) += nvmem/ diff --git a/drivers/hwtracing/Kconfig b/drivers/hwtracing/Kconfig index 13085835a636..911ee977103c 100644 --- a/drivers/hwtracing/Kconfig +++ b/drivers/hwtracing/Kconfig @@ -5,4 +5,6 @@ source "drivers/hwtracing/stm/Kconfig"
source "drivers/hwtracing/intel_th/Kconfig"
+source "drivers/hwtracing/ptt/Kconfig" + endmenu diff --git a/drivers/hwtracing/ptt/Kconfig b/drivers/hwtracing/ptt/Kconfig new file mode 100644 index 000000000000..6d46a09ffeb9 --- /dev/null +++ b/drivers/hwtracing/ptt/Kconfig @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-2.0-only +config HISI_PTT + tristate "HiSilicon PCIe Tune and Trace Device" + depends on ARM64 || (COMPILE_TEST && 64BIT) + depends on PCI && HAS_DMA && HAS_IOMEM && PERF_EVENTS + help + HiSilicon PCIe Tune and Trace device exists as a PCIe RCiEP + device, and it provides support for PCIe traffic tuning and + tracing TLP headers to the memory. + + This driver can also be built as a module. If so, the module + will be called hisi_ptt. diff --git a/drivers/hwtracing/ptt/Makefile b/drivers/hwtracing/ptt/Makefile new file mode 100644 index 000000000000..908c09a98161 --- /dev/null +++ b/drivers/hwtracing/ptt/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +obj-$(CONFIG_HISI_PTT) += hisi_ptt.o diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c new file mode 100644 index 000000000000..1a56b31b921a --- /dev/null +++ b/drivers/hwtracing/ptt/hisi_ptt.c @@ -0,0 +1,916 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Driver for HiSilicon PCIe tune and trace device + * + * Copyright (c) 2022 HiSilicon Technologies Co., Ltd. + * Author: Yicong Yang yangyicong@hisilicon.com + */ + +#include <linux/bitfield.h> +#include <linux/bitops.h> +#include <linux/cpuhotplug.h> +#include <linux/delay.h> +#include <linux/dma-iommu.h> +#include <linux/dma-mapping.h> +#include <linux/interrupt.h> +#include <linux/io.h> +#include <linux/iommu.h> +#include <linux/iopoll.h> +#include <linux/module.h> +#include <linux/sysfs.h> +#include <linux/vmalloc.h> + +#include "hisi_ptt.h" + +/* Dynamic CPU hotplug state used by PTT */ +static enum cpuhp_state hisi_ptt_pmu_online; + +static u16 hisi_ptt_get_filter_val(u16 devid, bool is_port) +{ + if (is_port) + return BIT(HISI_PCIE_CORE_PORT_ID(devid & 0xff)); + + return devid; +} + +static bool hisi_ptt_wait_trace_hw_idle(struct hisi_ptt *hisi_ptt) +{ + u32 val; + + return !readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_STS, + val, val & HISI_PTT_TRACE_IDLE, + HISI_PTT_WAIT_POLL_INTERVAL_US, + HISI_PTT_WAIT_TRACE_TIMEOUT_US); +} + +static void hisi_ptt_wait_dma_reset_done(struct hisi_ptt *hisi_ptt) +{ + u32 val; + + readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_WR_STS, + val, !val, HISI_PTT_RESET_POLL_INTERVAL_US, + HISI_PTT_RESET_TIMEOUT_US); +} + +static void hisi_ptt_trace_end(struct hisi_ptt *hisi_ptt) +{ + writel(0, hisi_ptt->iobase + HISI_PTT_TRACE_CTRL); + hisi_ptt->trace_ctrl.started = false; +} + +static int hisi_ptt_trace_start(struct hisi_ptt *hisi_ptt) +{ + struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl; + u32 val; + int i; + + /* Check device idle before start trace */ + if (!hisi_ptt_wait_trace_hw_idle(hisi_ptt)) { + pci_err(hisi_ptt->pdev, "Failed to start trace, the device is still busy\n"); + return -EBUSY; + } + + ctrl->started = true; + + /* Reset the DMA before start tracing */ + val = readl(hisi_ptt->iobase + HISI_PTT_TRACE_CTRL); + val |= HISI_PTT_TRACE_CTRL_RST; + writel(val, hisi_ptt->iobase + HISI_PTT_TRACE_CTRL); + + hisi_ptt_wait_dma_reset_done(hisi_ptt); + + val = readl(hisi_ptt->iobase + HISI_PTT_TRACE_CTRL); + val &= ~HISI_PTT_TRACE_CTRL_RST; + writel(val, hisi_ptt->iobase + HISI_PTT_TRACE_CTRL); + + /* Reset the index of current buffer */ + hisi_ptt->trace_ctrl.buf_index = 0; + + /* Zero the trace buffers */ + for (i = 0; i < HISI_PTT_TRACE_BUF_CNT; i++) + memset(ctrl->trace_buf[i].addr, 0, HISI_PTT_TRACE_BUF_SIZE); + + /* Clear the interrupt status */ + writel(HISI_PTT_TRACE_INT_STAT_MASK, hisi_ptt->iobase + HISI_PTT_TRACE_INT_STAT); + writel(0, hisi_ptt->iobase + HISI_PTT_TRACE_INT_MASK); + + /* Set the trace control register */ + val = FIELD_PREP(HISI_PTT_TRACE_CTRL_TYPE_SEL, ctrl->type); + val |= FIELD_PREP(HISI_PTT_TRACE_CTRL_RXTX_SEL, ctrl->direction); + val |= FIELD_PREP(HISI_PTT_TRACE_CTRL_DATA_FORMAT, ctrl->format); + val |= FIELD_PREP(HISI_PTT_TRACE_CTRL_TARGET_SEL, hisi_ptt->trace_ctrl.filter); + if (!hisi_ptt->trace_ctrl.is_port) + val |= HISI_PTT_TRACE_CTRL_FILTER_MODE; + + /* Start the Trace */ + val |= HISI_PTT_TRACE_CTRL_EN; + writel(val, hisi_ptt->iobase + HISI_PTT_TRACE_CTRL); + + return 0; +} + +static int hisi_ptt_update_aux(struct hisi_ptt *hisi_ptt, int index, bool stop) +{ + struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl; + struct perf_output_handle *handle = &ctrl->handle; + struct perf_event *event = handle->event; + struct hisi_ptt_pmu_buf *buf; + size_t size; + void *addr; + + buf = perf_get_aux(handle); + if (!buf || !handle->size) + return -EINVAL; + + addr = ctrl->trace_buf[ctrl->buf_index].addr; + + /* + * If we're going to stop, read the size of already traced data from + * HISI_PTT_TRACE_WR_STS. Otherwise we're coming from the interrupt, + * the data size is always HISI_PTT_TRACE_BUF_SIZE. + */ + if (stop) { + u32 reg; + + reg = readl(hisi_ptt->iobase + HISI_PTT_TRACE_WR_STS); + size = FIELD_GET(HISI_PTT_TRACE_WR_STS_WRITE, reg); + } else { + size = HISI_PTT_TRACE_BUF_SIZE; + } + + memcpy(buf->base + buf->pos, addr, size); + buf->pos += size; + + /* + * Just commit the traced data if we're going to stop. Otherwise if the + * resident AUX buffer cannot contain the data of next trace buffer, + * apply a new one. + */ + if (stop) { + perf_aux_output_end(handle, buf->pos); + } else if (buf->length - buf->pos < HISI_PTT_TRACE_BUF_SIZE) { + perf_aux_output_end(handle, buf->pos); + + buf = perf_aux_output_begin(handle, event); + if (!buf) + return -EINVAL; + + buf->pos = handle->head % buf->length; + if (buf->length - buf->pos < HISI_PTT_TRACE_BUF_SIZE) { + perf_aux_output_end(handle, 0); + return -EINVAL; + } + } + + return 0; +} + +static irqreturn_t hisi_ptt_isr(int irq, void *context) +{ + struct hisi_ptt *hisi_ptt = context; + u32 status, buf_idx; + + status = readl(hisi_ptt->iobase + HISI_PTT_TRACE_INT_STAT); + if (!(status & HISI_PTT_TRACE_INT_STAT_MASK)) + return IRQ_NONE; + + buf_idx = ffs(status) - 1; + + /* Clear the interrupt status of buffer @buf_idx */ + writel(status, hisi_ptt->iobase + HISI_PTT_TRACE_INT_STAT); + + /* + * Update the AUX buffer and cache the current buffer index, + * as we need to know this and save the data when the trace + * is ended out of the interrupt handler. End the trace + * if the updating fails. + */ + if (hisi_ptt_update_aux(hisi_ptt, buf_idx, false)) + hisi_ptt_trace_end(hisi_ptt); + else + hisi_ptt->trace_ctrl.buf_index = (buf_idx + 1) % HISI_PTT_TRACE_BUF_CNT; + + return IRQ_HANDLED; +} + +static void hisi_ptt_irq_free_vectors(void *pdev) +{ + pci_free_irq_vectors(pdev); +} + +static int hisi_ptt_register_irq(struct hisi_ptt *hisi_ptt) +{ + struct pci_dev *pdev = hisi_ptt->pdev; + int ret; + + ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_MSI); + if (ret < 0) { + pci_err(pdev, "failed to allocate irq vector, ret = %d\n", ret); + return ret; + } + + ret = devm_add_action_or_reset(&pdev->dev, hisi_ptt_irq_free_vectors, pdev); + if (ret < 0) + return ret; + + ret = devm_request_threaded_irq(&pdev->dev, + pci_irq_vector(pdev, HISI_PTT_TRACE_DMA_IRQ), + NULL, hisi_ptt_isr, 0, + DRV_NAME, hisi_ptt); + if (ret) { + pci_err(pdev, "failed to request irq %d, ret = %d\n", + pci_irq_vector(pdev, HISI_PTT_TRACE_DMA_IRQ), ret); + return ret; + } + + return 0; +} + +static int hisi_ptt_init_filters(struct pci_dev *pdev, void *data) +{ + struct hisi_ptt_filter_desc *filter; + struct hisi_ptt *hisi_ptt = data; + + /* + * We won't fail the probe if filter allocation failed here. The filters + * should be partial initialized and users would know which filter fails + * through the log. Other functions of PTT device are still available. + */ + filter = kzalloc(sizeof(*filter), GFP_KERNEL); + if (!filter) { + pci_err(hisi_ptt->pdev, "failed to add filter %s\n", pci_name(pdev)); + return -ENOMEM; + } + + filter->devid = PCI_DEVID(pdev->bus->number, pdev->devfn); + + if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) { + filter->is_port = true; + list_add_tail(&filter->list, &hisi_ptt->port_filters); + + /* Update the available port mask */ + hisi_ptt->port_mask |= hisi_ptt_get_filter_val(filter->devid, true); + } else { + list_add_tail(&filter->list, &hisi_ptt->req_filters); + } + + return 0; +} + +static void hisi_ptt_release_filters(void *data) +{ + struct hisi_ptt_filter_desc *filter, *tmp; + struct hisi_ptt *hisi_ptt = data; + + list_for_each_entry_safe(filter, tmp, &hisi_ptt->req_filters, list) { + list_del(&filter->list); + kfree(filter); + } + + list_for_each_entry_safe(filter, tmp, &hisi_ptt->port_filters, list) { + list_del(&filter->list); + kfree(filter); + } +} + +static int hisi_ptt_config_trace_buf(struct hisi_ptt *hisi_ptt) +{ + struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl; + struct device *dev = &hisi_ptt->pdev->dev; + int i; + + ctrl->trace_buf = devm_kcalloc(dev, HISI_PTT_TRACE_BUF_CNT, + sizeof(*ctrl->trace_buf), GFP_KERNEL); + if (!ctrl->trace_buf) + return -ENOMEM; + + for (i = 0; i < HISI_PTT_TRACE_BUF_CNT; ++i) { + ctrl->trace_buf[i].addr = dmam_alloc_coherent(dev, HISI_PTT_TRACE_BUF_SIZE, + &ctrl->trace_buf[i].dma, + GFP_KERNEL); + if (!ctrl->trace_buf[i].addr) + return -ENOMEM; + } + + /* Configure the trace DMA buffer */ + for (i = 0; i < HISI_PTT_TRACE_BUF_CNT; i++) { + writel(lower_32_bits(ctrl->trace_buf[i].dma), + hisi_ptt->iobase + HISI_PTT_TRACE_ADDR_BASE_LO_0 + + i * HISI_PTT_TRACE_ADDR_STRIDE); + writel(upper_32_bits(ctrl->trace_buf[i].dma), + hisi_ptt->iobase + HISI_PTT_TRACE_ADDR_BASE_HI_0 + + i * HISI_PTT_TRACE_ADDR_STRIDE); + } + writel(HISI_PTT_TRACE_BUF_SIZE, hisi_ptt->iobase + HISI_PTT_TRACE_ADDR_SIZE); + + return 0; +} + +static int hisi_ptt_init_ctrls(struct hisi_ptt *hisi_ptt) +{ + struct pci_dev *pdev = hisi_ptt->pdev; + struct pci_bus *bus; + int ret; + u32 reg; + + INIT_LIST_HEAD(&hisi_ptt->port_filters); + INIT_LIST_HEAD(&hisi_ptt->req_filters); + + ret = hisi_ptt_config_trace_buf(hisi_ptt); + if (ret) + return ret; + + /* + * The device range register provides the information about the root + * ports which the RCiEP can control and trace. The RCiEP and the root + * ports which it supports are on the same PCIe core, with same domain + * number but maybe different bus number. The device range register + * will tell us which root ports we can support, Bit[31:16] indicates + * the upper BDF numbers of the root port, while Bit[15:0] indicates + * the lower. + */ + reg = readl(hisi_ptt->iobase + HISI_PTT_DEVICE_RANGE); + hisi_ptt->upper_bdf = FIELD_GET(HISI_PTT_DEVICE_RANGE_UPPER, reg); + hisi_ptt->lower_bdf = FIELD_GET(HISI_PTT_DEVICE_RANGE_LOWER, reg); + + bus = pci_find_bus(pci_domain_nr(pdev->bus), PCI_BUS_NUM(hisi_ptt->upper_bdf)); + if (bus) + pci_walk_bus(bus, hisi_ptt_init_filters, hisi_ptt); + + ret = devm_add_action_or_reset(&pdev->dev, hisi_ptt_release_filters, hisi_ptt); + if (ret) + return ret; + + hisi_ptt->trace_ctrl.on_cpu = -1; + return 0; +} + +static ssize_t cpumask_show(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct hisi_ptt *hisi_ptt = to_hisi_ptt(dev_get_drvdata(dev)); + const cpumask_t *cpumask = cpumask_of_node(dev_to_node(&hisi_ptt->pdev->dev)); + + return cpumap_print_to_pagebuf(true, buf, cpumask); +} +static DEVICE_ATTR_RO(cpumask); + +static struct attribute *hisi_ptt_cpumask_attrs[] = { + &dev_attr_cpumask.attr, + NULL +}; + +static const struct attribute_group hisi_ptt_cpumask_attr_group = { + .attrs = hisi_ptt_cpumask_attrs, +}; + +/* + * Bit 19 indicates the filter type, 1 for Root Port filter and 0 for Requester + * filter. Bit[15:0] indicates the filter value, for Root Port filter it's + * a bit mask of desired ports and for Requester filter it's the Requester ID + * of the desired PCIe function. Bit[18:16] is reserved for extension. + * + * See hisi_ptt.rst documentation for detailed information. + */ +PMU_FORMAT_ATTR(filter, "config:0-19"); +PMU_FORMAT_ATTR(direction, "config:20-23"); +PMU_FORMAT_ATTR(type, "config:24-31"); +PMU_FORMAT_ATTR(format, "config:32-35"); + +static struct attribute *hisi_ptt_pmu_format_attrs[] = { + &format_attr_filter.attr, + &format_attr_direction.attr, + &format_attr_type.attr, + &format_attr_format.attr, + NULL +}; + +static struct attribute_group hisi_ptt_pmu_format_group = { + .name = "format", + .attrs = hisi_ptt_pmu_format_attrs, +}; + +static const struct attribute_group *hisi_ptt_pmu_groups[] = { + &hisi_ptt_cpumask_attr_group, + &hisi_ptt_pmu_format_group, + NULL +}; + +static int hisi_ptt_trace_valid_direction(u32 val) +{ + /* + * The direction values have different effects according to the data + * format (specified in the parentheses). TLP set A/B means different + * set of TLP types. See hisi_ptt.rst documentation for more details. + */ + static const u32 hisi_ptt_trace_available_direction[] = { + 0, /* inbound(4DW) or reserved(8DW) */ + 1, /* outbound(4DW) */ + 2, /* {in, out}bound(4DW) or inbound(8DW), TLP set A */ + 3, /* {in, out}bound(4DW) or inbound(8DW), TLP set B */ + }; + int i; + + for (i = 0; i < ARRAY_SIZE(hisi_ptt_trace_available_direction); i++) { + if (val == hisi_ptt_trace_available_direction[i]) + return 0; + } + + return -EINVAL; +} + +static int hisi_ptt_trace_valid_type(u32 val) +{ + /* Different types can be set simultaneously */ + static const u32 hisi_ptt_trace_available_type[] = { + 1, /* posted_request */ + 2, /* non-posted_request */ + 4, /* completion */ + }; + int i; + + if (!val) + return -EINVAL; + + /* + * Walk the available list and clear the valid bits of + * the config. If there is any resident bit after the + * walk then the config is invalid. + */ + for (i = 0; i < ARRAY_SIZE(hisi_ptt_trace_available_type); i++) + val &= ~hisi_ptt_trace_available_type[i]; + + if (val) + return -EINVAL; + + return 0; +} + +static int hisi_ptt_trace_valid_format(u32 val) +{ + static const u32 hisi_ptt_trace_availble_format[] = { + 0, /* 4DW */ + 1, /* 8DW */ + }; + int i; + + for (i = 0; i < ARRAY_SIZE(hisi_ptt_trace_availble_format); i++) { + if (val == hisi_ptt_trace_availble_format[i]) + return 0; + } + + return -EINVAL; +} + +static int hisi_ptt_trace_valid_filter(struct hisi_ptt *hisi_ptt, u64 config) +{ + unsigned long val, port_mask = hisi_ptt->port_mask; + struct hisi_ptt_filter_desc *filter; + + hisi_ptt->trace_ctrl.is_port = FIELD_GET(HISI_PTT_PMU_FILTER_IS_PORT, config); + val = FIELD_GET(HISI_PTT_PMU_FILTER_VAL_MASK, config); + + /* + * Port filters are defined as bit mask. For port filters, check + * the bits in the @val are within the range of hisi_ptt->port_mask + * and whether it's empty or not, otherwise user has specified + * some unsupported root ports. + * + * For Requester ID filters, walk the available filter list to see + * whether we have one matched. + */ + if (!hisi_ptt->trace_ctrl.is_port) { + list_for_each_entry(filter, &hisi_ptt->req_filters, list) { + if (val == hisi_ptt_get_filter_val(filter->devid, filter->is_port)) + return 0; + } + } else if (bitmap_subset(&val, &port_mask, BITS_PER_LONG)) { + return 0; + } + + return -EINVAL; +} + +static void hisi_ptt_pmu_init_configs(struct hisi_ptt *hisi_ptt, struct perf_event *event) +{ + struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl; + u32 val; + + val = FIELD_GET(HISI_PTT_PMU_FILTER_VAL_MASK, event->attr.config); + hisi_ptt->trace_ctrl.filter = val; + + val = FIELD_GET(HISI_PTT_PMU_DIRECTION_MASK, event->attr.config); + ctrl->direction = val; + + val = FIELD_GET(HISI_PTT_PMU_TYPE_MASK, event->attr.config); + ctrl->type = val; + + val = FIELD_GET(HISI_PTT_PMU_FORMAT_MASK, event->attr.config); + ctrl->format = val; +} + +static int hisi_ptt_pmu_event_init(struct perf_event *event) +{ + struct hisi_ptt *hisi_ptt = to_hisi_ptt(event->pmu); + int ret; + u32 val; + + if (event->cpu < 0) { + dev_dbg(event->pmu->dev, "Per-task mode not supported\n"); + return -EOPNOTSUPP; + } + + if (event->attr.type != hisi_ptt->hisi_ptt_pmu.type) + return -ENOENT; + + ret = hisi_ptt_trace_valid_filter(hisi_ptt, event->attr.config); + if (ret < 0) + return ret; + + val = FIELD_GET(HISI_PTT_PMU_DIRECTION_MASK, event->attr.config); + ret = hisi_ptt_trace_valid_direction(val); + if (ret < 0) + return ret; + + val = FIELD_GET(HISI_PTT_PMU_TYPE_MASK, event->attr.config); + ret = hisi_ptt_trace_valid_type(val); + if (ret < 0) + return ret; + + val = FIELD_GET(HISI_PTT_PMU_FORMAT_MASK, event->attr.config); + return hisi_ptt_trace_valid_format(val); +} + +static void *hisi_ptt_pmu_setup_aux(struct perf_event *event, void **pages, + int nr_pages, bool overwrite) +{ + struct hisi_ptt_pmu_buf *buf; + struct page **pagelist; + int i; + + if (overwrite) { + dev_warn(event->pmu->dev, "Overwrite mode is not supported\n"); + return NULL; + } + + /* If the pages size less than buffers, we cannot start trace */ + if (nr_pages < HISI_PTT_TRACE_TOTAL_BUF_SIZE / PAGE_SIZE) + return NULL; + + buf = kzalloc(sizeof(*buf), GFP_KERNEL); + if (!buf) + return NULL; + + pagelist = kcalloc(nr_pages, sizeof(*pagelist), GFP_KERNEL); + if (!pagelist) + goto err; + + for (i = 0; i < nr_pages; i++) + pagelist[i] = virt_to_page(pages[i]); + + buf->base = vmap(pagelist, nr_pages, VM_MAP, PAGE_KERNEL); + if (!buf->base) { + kfree(pagelist); + goto err; + } + + buf->nr_pages = nr_pages; + buf->length = nr_pages * PAGE_SIZE; + buf->pos = 0; + + kfree(pagelist); + return buf; +err: + kfree(buf); + return NULL; +} + +static void hisi_ptt_pmu_free_aux(void *aux) +{ + struct hisi_ptt_pmu_buf *buf = aux; + + vunmap(buf->base); + kfree(buf); +} + +static void hisi_ptt_pmu_start(struct perf_event *event, int flags) +{ + struct hisi_ptt *hisi_ptt = to_hisi_ptt(event->pmu); + struct perf_output_handle *handle = &hisi_ptt->trace_ctrl.handle; + struct hw_perf_event *hwc = &event->hw; + struct device *dev = event->pmu->dev; + struct hisi_ptt_pmu_buf *buf; + int cpu = event->cpu; + int ret; + + hwc->state = 0; + + /* Serialize the perf process if user specified several CPUs */ + spin_lock(&hisi_ptt->pmu_lock); + if (hisi_ptt->trace_ctrl.started) { + dev_dbg(dev, "trace has already started\n"); + goto stop; + } + + /* + * Handle the interrupt on the same cpu which starts the trace to avoid + * context mismatch. Otherwise we'll trigger the WARN from the perf + * core in event_function_local(). If CPU passed is offline we'll fail + * here, just log it since we can do nothing here. + */ + ret = irq_set_affinity(pci_irq_vector(hisi_ptt->pdev, HISI_PTT_TRACE_DMA_IRQ), + cpumask_of(cpu)); + if (ret) + dev_warn(dev, "failed to set the affinity of trace interrupt\n"); + + hisi_ptt->trace_ctrl.on_cpu = cpu; + + buf = perf_aux_output_begin(handle, event); + if (!buf) { + dev_dbg(dev, "aux output begin failed\n"); + goto stop; + } + + buf->pos = handle->head % buf->length; + + hisi_ptt_pmu_init_configs(hisi_ptt, event); + + ret = hisi_ptt_trace_start(hisi_ptt); + if (ret) { + dev_dbg(dev, "trace start failed, ret = %d\n", ret); + perf_aux_output_end(handle, 0); + goto stop; + } + + spin_unlock(&hisi_ptt->pmu_lock); + return; +stop: + event->hw.state |= PERF_HES_STOPPED; + spin_unlock(&hisi_ptt->pmu_lock); +} + +static void hisi_ptt_pmu_stop(struct perf_event *event, int flags) +{ + struct hisi_ptt *hisi_ptt = to_hisi_ptt(event->pmu); + struct hw_perf_event *hwc = &event->hw; + + if (hwc->state & PERF_HES_STOPPED) + return; + + spin_lock(&hisi_ptt->pmu_lock); + if (hisi_ptt->trace_ctrl.started) { + hisi_ptt_trace_end(hisi_ptt); + + if (!hisi_ptt_wait_trace_hw_idle(hisi_ptt)) + dev_warn(event->pmu->dev, "Device is still busy\n"); + + hisi_ptt_update_aux(hisi_ptt, hisi_ptt->trace_ctrl.buf_index, true); + } + spin_unlock(&hisi_ptt->pmu_lock); + + hwc->state |= PERF_HES_STOPPED; + perf_event_update_userpage(event); + hwc->state |= PERF_HES_UPTODATE; +} + +static int hisi_ptt_pmu_add(struct perf_event *event, int flags) +{ + struct hisi_ptt *hisi_ptt = to_hisi_ptt(event->pmu); + struct hw_perf_event *hwc = &event->hw; + int cpu = event->cpu; + + /* Only allow the cpus on the device's node to add the event */ + if (!cpumask_test_cpu(cpu, cpumask_of_node(dev_to_node(&hisi_ptt->pdev->dev)))) + return 0; + + hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE; + + if (flags & PERF_EF_START) { + hisi_ptt_pmu_start(event, PERF_EF_RELOAD); + if (hwc->state & PERF_HES_STOPPED) + return -EINVAL; + } + + return 0; +} + +static void hisi_ptt_pmu_del(struct perf_event *event, int flags) +{ + hisi_ptt_pmu_stop(event, PERF_EF_UPDATE); +} + +static void hisi_ptt_remove_cpuhp_instance(void *hotplug_node) +{ + cpuhp_state_remove_instance_nocalls(hisi_ptt_pmu_online, hotplug_node); +} + +static void hisi_ptt_unregister_pmu(void *pmu) +{ + perf_pmu_unregister(pmu); +} + +static int hisi_ptt_register_pmu(struct hisi_ptt *hisi_ptt) +{ + u16 core_id, sicl_id; + char *pmu_name; + u32 reg; + int ret; + + ret = cpuhp_state_add_instance_nocalls(hisi_ptt_pmu_online, + &hisi_ptt->hotplug_node); + if (ret) + return ret; + + ret = devm_add_action_or_reset(&hisi_ptt->pdev->dev, + hisi_ptt_remove_cpuhp_instance, + &hisi_ptt->hotplug_node); + if (ret) + return ret; + + spin_lock_init(&hisi_ptt->pmu_lock); + + hisi_ptt->hisi_ptt_pmu = (struct pmu) { + .module = THIS_MODULE, + .capabilities = PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE, + .task_ctx_nr = perf_sw_context, + .attr_groups = hisi_ptt_pmu_groups, + .event_init = hisi_ptt_pmu_event_init, + .setup_aux = hisi_ptt_pmu_setup_aux, + .free_aux = hisi_ptt_pmu_free_aux, + .start = hisi_ptt_pmu_start, + .stop = hisi_ptt_pmu_stop, + .add = hisi_ptt_pmu_add, + .del = hisi_ptt_pmu_del, + }; + + reg = readl(hisi_ptt->iobase + HISI_PTT_LOCATION); + core_id = FIELD_GET(HISI_PTT_CORE_ID, reg); + sicl_id = FIELD_GET(HISI_PTT_SICL_ID, reg); + + pmu_name = devm_kasprintf(&hisi_ptt->pdev->dev, GFP_KERNEL, "hisi_ptt%u_%u", + sicl_id, core_id); + if (!pmu_name) + return -ENOMEM; + + ret = perf_pmu_register(&hisi_ptt->hisi_ptt_pmu, pmu_name, -1); + if (ret) + return ret; + + return devm_add_action_or_reset(&hisi_ptt->pdev->dev, + hisi_ptt_unregister_pmu, + &hisi_ptt->hisi_ptt_pmu); +} + +/* + * The DMA of PTT trace can only use direct mappings due to some + * hardware restriction. Check whether there is no IOMMU or the + * policy of the IOMMU domain is passthrough, otherwise the trace + * cannot work. + * + * The PTT device is supposed to behind an ARM SMMUv3, which + * should have passthrough the device by a quirk. + */ +static int hisi_ptt_check_iommu_mapping(struct pci_dev *pdev) +{ + struct iommu_domain *iommu_domain; + + iommu_domain = iommu_get_domain_for_dev(&pdev->dev); + if (!iommu_domain || iommu_domain->type == IOMMU_DOMAIN_IDENTITY) + return 0; + + return -EOPNOTSUPP; +} + +static int hisi_ptt_probe(struct pci_dev *pdev, + const struct pci_device_id *id) +{ + struct hisi_ptt *hisi_ptt; + int ret; + + ret = hisi_ptt_check_iommu_mapping(pdev); + if (ret) { + pci_err(pdev, "requires direct DMA mappings\n"); + return ret; + } + + hisi_ptt = devm_kzalloc(&pdev->dev, sizeof(*hisi_ptt), GFP_KERNEL); + if (!hisi_ptt) + return -ENOMEM; + + hisi_ptt->pdev = pdev; + pci_set_drvdata(pdev, hisi_ptt); + + ret = pcim_enable_device(pdev); + if (ret) { + pci_err(pdev, "failed to enable device, ret = %d\n", ret); + return ret; + } + + ret = pcim_iomap_regions(pdev, BIT(2), DRV_NAME); + if (ret) { + pci_err(pdev, "failed to remap io memory, ret = %d\n", ret); + return ret; + } + + hisi_ptt->iobase = pcim_iomap_table(pdev)[2]; + + ret = dma_set_coherent_mask(&pdev->dev, DMA_BIT_MASK(64)); + if (ret) { + pci_err(pdev, "failed to set 64 bit dma mask, ret = %d\n", ret); + return ret; + } + + pci_set_master(pdev); + + ret = hisi_ptt_register_irq(hisi_ptt); + if (ret) + return ret; + + ret = hisi_ptt_init_ctrls(hisi_ptt); + if (ret) { + pci_err(pdev, "failed to init controls, ret = %d\n", ret); + return ret; + } + + ret = hisi_ptt_register_pmu(hisi_ptt); + if (ret) { + pci_err(pdev, "failed to register PMU device, ret = %d", ret); + return ret; + } + + return 0; +} + +static const struct pci_device_id hisi_ptt_id_tbl[] = { + { PCI_DEVICE(PCI_VENDOR_ID_HUAWEI, 0xa12e) }, + { } +}; +MODULE_DEVICE_TABLE(pci, hisi_ptt_id_tbl); + +static struct pci_driver hisi_ptt_driver = { + .name = DRV_NAME, + .id_table = hisi_ptt_id_tbl, + .probe = hisi_ptt_probe, +}; + +static int hisi_ptt_cpu_teardown(unsigned int cpu, struct hlist_node *node) +{ + struct hisi_ptt *hisi_ptt; + struct device *dev; + int target, src; + + hisi_ptt = hlist_entry_safe(node, struct hisi_ptt, hotplug_node); + src = hisi_ptt->trace_ctrl.on_cpu; + dev = hisi_ptt->hisi_ptt_pmu.dev; + + if (!hisi_ptt->trace_ctrl.started || src != cpu) + return 0; + + target = cpumask_any_but(cpumask_of_node(dev_to_node(&hisi_ptt->pdev->dev)), cpu); + if (target >= nr_cpu_ids) { + dev_err(dev, "no available cpu for perf context migration\n"); + return 0; + } + + perf_pmu_migrate_context(&hisi_ptt->hisi_ptt_pmu, src, target); + + /* + * Also make sure the interrupt bind to the migrated CPU as well. Warn + * the user on failure here. + */ + if (irq_set_affinity(pci_irq_vector(hisi_ptt->pdev, HISI_PTT_TRACE_DMA_IRQ), + cpumask_of(target))) + dev_warn(dev, "failed to set the affinity of trace interrupt\n"); + + hisi_ptt->trace_ctrl.on_cpu = target; + return 0; +} + +static int __init hisi_ptt_init(void) +{ + int ret; + + ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, DRV_NAME, NULL, + hisi_ptt_cpu_teardown); + if (ret < 0) + return ret; + hisi_ptt_pmu_online = ret; + + ret = pci_register_driver(&hisi_ptt_driver); + if (ret) + cpuhp_remove_multi_state(hisi_ptt_pmu_online); + + return ret; +} +module_init(hisi_ptt_init); + +static void __exit hisi_ptt_exit(void) +{ + pci_unregister_driver(&hisi_ptt_driver); + cpuhp_remove_multi_state(hisi_ptt_pmu_online); +} +module_exit(hisi_ptt_exit); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Yicong Yang yangyicong@hisilicon.com"); +MODULE_DESCRIPTION("Driver for HiSilicon PCIe tune and trace device"); diff --git a/drivers/hwtracing/ptt/hisi_ptt.h b/drivers/hwtracing/ptt/hisi_ptt.h new file mode 100644 index 000000000000..10446dce8a86 --- /dev/null +++ b/drivers/hwtracing/ptt/hisi_ptt.h @@ -0,0 +1,177 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Driver for HiSilicon PCIe tune and trace device + * + * Copyright (c) 2022 HiSilicon Technologies Co., Ltd. + * Author: Yicong Yang yangyicong@hisilicon.com + */ + +#ifndef _HISI_PTT_H +#define _HISI_PTT_H + +#include <linux/bits.h> +#include <linux/cpumask.h> +#include <linux/list.h> +#include <linux/pci.h> +#include <linux/perf_event.h> +#include <linux/spinlock.h> +#include <linux/types.h> + +#define DRV_NAME "hisi_ptt" + +/* + * The definition of the device registers and register fields. + */ +#define HISI_PTT_TRACE_ADDR_SIZE 0x0800 +#define HISI_PTT_TRACE_ADDR_BASE_LO_0 0x0810 +#define HISI_PTT_TRACE_ADDR_BASE_HI_0 0x0814 +#define HISI_PTT_TRACE_ADDR_STRIDE 0x8 +#define HISI_PTT_TRACE_CTRL 0x0850 +#define HISI_PTT_TRACE_CTRL_EN BIT(0) +#define HISI_PTT_TRACE_CTRL_RST BIT(1) +#define HISI_PTT_TRACE_CTRL_RXTX_SEL GENMASK(3, 2) +#define HISI_PTT_TRACE_CTRL_TYPE_SEL GENMASK(7, 4) +#define HISI_PTT_TRACE_CTRL_DATA_FORMAT BIT(14) +#define HISI_PTT_TRACE_CTRL_FILTER_MODE BIT(15) +#define HISI_PTT_TRACE_CTRL_TARGET_SEL GENMASK(31, 16) +#define HISI_PTT_TRACE_INT_STAT 0x0890 +#define HISI_PTT_TRACE_INT_STAT_MASK GENMASK(3, 0) +#define HISI_PTT_TRACE_INT_MASK 0x0894 +#define HISI_PTT_TRACE_WR_STS 0x08a0 +#define HISI_PTT_TRACE_WR_STS_WRITE GENMASK(27, 0) +#define HISI_PTT_TRACE_WR_STS_BUFFER GENMASK(29, 28) +#define HISI_PTT_TRACE_STS 0x08b0 +#define HISI_PTT_TRACE_IDLE BIT(0) +#define HISI_PTT_DEVICE_RANGE 0x0fe0 +#define HISI_PTT_DEVICE_RANGE_UPPER GENMASK(31, 16) +#define HISI_PTT_DEVICE_RANGE_LOWER GENMASK(15, 0) +#define HISI_PTT_LOCATION 0x0fe8 +#define HISI_PTT_CORE_ID GENMASK(15, 0) +#define HISI_PTT_SICL_ID GENMASK(31, 16) + +/* Parameters of PTT trace DMA part. */ +#define HISI_PTT_TRACE_DMA_IRQ 0 +#define HISI_PTT_TRACE_BUF_CNT 4 +#define HISI_PTT_TRACE_BUF_SIZE SZ_4M +#define HISI_PTT_TRACE_TOTAL_BUF_SIZE (HISI_PTT_TRACE_BUF_SIZE * \ + HISI_PTT_TRACE_BUF_CNT) +/* Wait time for hardware DMA to reset */ +#define HISI_PTT_RESET_TIMEOUT_US 10UL +#define HISI_PTT_RESET_POLL_INTERVAL_US 1UL +/* Poll timeout and interval for waiting hardware work to finish */ +#define HISI_PTT_WAIT_TRACE_TIMEOUT_US 100UL +#define HISI_PTT_WAIT_POLL_INTERVAL_US 10UL + +#define HISI_PCIE_CORE_PORT_ID(devfn) ((PCI_SLOT(devfn) & 0x7) << 1) + +/* Definition of the PMU configs */ +#define HISI_PTT_PMU_FILTER_IS_PORT BIT(19) +#define HISI_PTT_PMU_FILTER_VAL_MASK GENMASK(15, 0) +#define HISI_PTT_PMU_DIRECTION_MASK GENMASK(23, 20) +#define HISI_PTT_PMU_TYPE_MASK GENMASK(31, 24) +#define HISI_PTT_PMU_FORMAT_MASK GENMASK(35, 32) + +/** + * struct hisi_ptt_dma_buffer - Describe a single trace buffer of PTT trace. + * The detail of the data format is described + * in the documentation of PTT device. + * @dma: DMA address of this buffer visible to the device + * @addr: virtual address of this buffer visible to the cpu + */ +struct hisi_ptt_dma_buffer { + dma_addr_t dma; + void *addr; +}; + +/** + * struct hisi_ptt_trace_ctrl - Control and status of PTT trace + * @trace_buf: array of the trace buffers for holding the trace data. + * the length will be HISI_PTT_TRACE_BUF_CNT. + * @handle: perf output handle of current trace session + * @buf_index: the index of current using trace buffer + * @on_cpu: current tracing cpu + * @started: current trace status, true for started + * @is_port: whether we're tracing root port or not + * @direction: direction of the TLP headers to trace + * @filter: filter value for tracing the TLP headers + * @format: format of the TLP headers to trace + * @type: type of the TLP headers to trace + */ +struct hisi_ptt_trace_ctrl { + struct hisi_ptt_dma_buffer *trace_buf; + struct perf_output_handle handle; + u32 buf_index; + int on_cpu; + bool started; + bool is_port; + u32 direction:2; + u32 filter:16; + u32 format:1; + u32 type:4; +}; + +/** + * struct hisi_ptt_filter_desc - Descriptor of the PTT trace filter + * @list: entry of this descriptor in the filter list + * @is_port: the PCI device of the filter is a Root Port or not + * @devid: the PCI device's devid of the filter + */ +struct hisi_ptt_filter_desc { + struct list_head list; + bool is_port; + u16 devid; +}; + +/** + * struct hisi_ptt_pmu_buf - Descriptor of the AUX buffer of PTT trace + * @length: size of the AUX buffer + * @nr_pages: number of pages of the AUX buffer + * @base: start address of AUX buffer + * @pos: position in the AUX buffer to commit traced data + */ +struct hisi_ptt_pmu_buf { + size_t length; + int nr_pages; + void *base; + long pos; +}; + +/** + * struct hisi_ptt - Per PTT device data + * @trace_ctrl: the control information of PTT trace + * @hotplug_node: node for register cpu hotplug event + * @hisi_ptt_pmu: the pum device of trace + * @iobase: base IO address of the device + * @pdev: pci_dev of this PTT device + * @pmu_lock: lock to serialize the perf process + * @upper_bdf: the upper BDF range of the PCI devices managed by this PTT device + * @lower_bdf: the lower BDF range of the PCI devices managed by this PTT device + * @port_filters: the filter list of root ports + * @req_filters: the filter list of requester ID + * @port_mask: port mask of the managed root ports + */ +struct hisi_ptt { + struct hisi_ptt_trace_ctrl trace_ctrl; + struct hlist_node hotplug_node; + struct pmu hisi_ptt_pmu; + void __iomem *iobase; + struct pci_dev *pdev; + spinlock_t pmu_lock; + u32 upper_bdf; + u32 lower_bdf; + + /* + * The trace TLP headers can either be filtered by certain + * root port, or by the requester ID. Organize the filters + * by @port_filters and @req_filters here. The mask of all + * the valid ports is also cached for doing sanity check + * of user input. + */ + struct list_head port_filters; + struct list_head req_filters; + u16 port_mask; +}; + +#define to_hisi_ptt(pmu) container_of(pmu, struct hisi_ptt, hisi_ptt_pmu) + +#endif /* _HISI_PTT_H */
From: Yicong Yang yangyicong@hisilicon.com
mainline inclusion from mainline-remotes/origin/next commit 5ca57b03d8c5de4c59234cc11fe9dd9f13d57f48 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5RP8T CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/coresight/linux.git/commit/?...
--------------------------------------------------------------------------
Add tune function for the HiSilicon Tune and Trace device. The interface of tune is exposed through sysfs attributes of PTT PMU device.
Acked-by: Mathieu Poirier mathieu.poirier@linaro.org Reviewed-by: Jonathan Cameron Jonathan.Cameron@huawei.com Reviewed-by: John Garry john.garry@huawei.com Signed-off-by: Yicong Yang yangyicong@hisilicon.com Link: https://lore.kernel.org/r/20220816114414.4092-4-yangyicong@huawei.com Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org Signed-off-by: Wangming Shao shaowangming@h-partners.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/hwtracing/ptt/hisi_ptt.c | 131 +++++++++++++++++++++++++++++++ drivers/hwtracing/ptt/hisi_ptt.h | 23 ++++++ 2 files changed, 154 insertions(+)
diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c index 1a56b31b921a..666a0f14b6c4 100644 --- a/drivers/hwtracing/ptt/hisi_ptt.c +++ b/drivers/hwtracing/ptt/hisi_ptt.c @@ -25,6 +25,135 @@ /* Dynamic CPU hotplug state used by PTT */ static enum cpuhp_state hisi_ptt_pmu_online;
+static bool hisi_ptt_wait_tuning_finish(struct hisi_ptt *hisi_ptt) +{ + u32 val; + + return !readl_poll_timeout(hisi_ptt->iobase + HISI_PTT_TUNING_INT_STAT, + val, !(val & HISI_PTT_TUNING_INT_STAT_MASK), + HISI_PTT_WAIT_POLL_INTERVAL_US, + HISI_PTT_WAIT_TUNE_TIMEOUT_US); +} + +static ssize_t hisi_ptt_tune_attr_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct hisi_ptt *hisi_ptt = to_hisi_ptt(dev_get_drvdata(dev)); + struct dev_ext_attribute *ext_attr; + struct hisi_ptt_tune_desc *desc; + u32 reg; + u16 val; + + ext_attr = container_of(attr, struct dev_ext_attribute, attr); + desc = ext_attr->var; + + mutex_lock(&hisi_ptt->tune_lock); + + reg = readl(hisi_ptt->iobase + HISI_PTT_TUNING_CTRL); + reg &= ~(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB); + reg |= FIELD_PREP(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB, + desc->event_code); + writel(reg, hisi_ptt->iobase + HISI_PTT_TUNING_CTRL); + + /* Write all 1 to indicates it's the read process */ + writel(~0U, hisi_ptt->iobase + HISI_PTT_TUNING_DATA); + + if (!hisi_ptt_wait_tuning_finish(hisi_ptt)) { + mutex_unlock(&hisi_ptt->tune_lock); + return -ETIMEDOUT; + } + + reg = readl(hisi_ptt->iobase + HISI_PTT_TUNING_DATA); + reg &= HISI_PTT_TUNING_DATA_VAL_MASK; + val = FIELD_GET(HISI_PTT_TUNING_DATA_VAL_MASK, reg); + + mutex_unlock(&hisi_ptt->tune_lock); + return sysfs_emit(buf, "%u\n", val); +} + +static ssize_t hisi_ptt_tune_attr_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct hisi_ptt *hisi_ptt = to_hisi_ptt(dev_get_drvdata(dev)); + struct dev_ext_attribute *ext_attr; + struct hisi_ptt_tune_desc *desc; + u32 reg; + u16 val; + + ext_attr = container_of(attr, struct dev_ext_attribute, attr); + desc = ext_attr->var; + + if (kstrtou16(buf, 10, &val)) + return -EINVAL; + + mutex_lock(&hisi_ptt->tune_lock); + + reg = readl(hisi_ptt->iobase + HISI_PTT_TUNING_CTRL); + reg &= ~(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB); + reg |= FIELD_PREP(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB, + desc->event_code); + writel(reg, hisi_ptt->iobase + HISI_PTT_TUNING_CTRL); + writel(FIELD_PREP(HISI_PTT_TUNING_DATA_VAL_MASK, val), + hisi_ptt->iobase + HISI_PTT_TUNING_DATA); + + if (!hisi_ptt_wait_tuning_finish(hisi_ptt)) { + mutex_unlock(&hisi_ptt->tune_lock); + return -ETIMEDOUT; + } + + mutex_unlock(&hisi_ptt->tune_lock); + return count; +} + +#define HISI_PTT_TUNE_ATTR(_name, _val, _show, _store) \ + static struct hisi_ptt_tune_desc _name##_desc = { \ + .name = #_name, \ + .event_code = (_val), \ + }; \ + static struct dev_ext_attribute hisi_ptt_##_name##_attr = { \ + .attr = __ATTR(_name, 0600, _show, _store), \ + .var = &_name##_desc, \ + } + +#define HISI_PTT_TUNE_ATTR_COMMON(_name, _val) \ + HISI_PTT_TUNE_ATTR(_name, _val, \ + hisi_ptt_tune_attr_show, \ + hisi_ptt_tune_attr_store) + +/* + * The value of the tuning event are composed of two parts: main event code + * in BIT[0,15] and subevent code in BIT[16,23]. For example, qox_tx_cpl is + * a subevent of 'Tx path QoS control' which for tuning the weight of Tx + * completion TLPs. See hisi_ptt.rst documentation for more information. + */ +#define HISI_PTT_TUNE_QOS_TX_CPL (0x4 | (3 << 16)) +#define HISI_PTT_TUNE_QOS_TX_NP (0x4 | (4 << 16)) +#define HISI_PTT_TUNE_QOS_TX_P (0x4 | (5 << 16)) +#define HISI_PTT_TUNE_RX_ALLOC_BUF_LEVEL (0x5 | (6 << 16)) +#define HISI_PTT_TUNE_TX_ALLOC_BUF_LEVEL (0x5 | (7 << 16)) + +HISI_PTT_TUNE_ATTR_COMMON(qos_tx_cpl, HISI_PTT_TUNE_QOS_TX_CPL); +HISI_PTT_TUNE_ATTR_COMMON(qos_tx_np, HISI_PTT_TUNE_QOS_TX_NP); +HISI_PTT_TUNE_ATTR_COMMON(qos_tx_p, HISI_PTT_TUNE_QOS_TX_P); +HISI_PTT_TUNE_ATTR_COMMON(rx_alloc_buf_level, HISI_PTT_TUNE_RX_ALLOC_BUF_LEVEL); +HISI_PTT_TUNE_ATTR_COMMON(tx_alloc_buf_level, HISI_PTT_TUNE_TX_ALLOC_BUF_LEVEL); + +static struct attribute *hisi_ptt_tune_attrs[] = { + &hisi_ptt_qos_tx_cpl_attr.attr.attr, + &hisi_ptt_qos_tx_np_attr.attr.attr, + &hisi_ptt_qos_tx_p_attr.attr.attr, + &hisi_ptt_rx_alloc_buf_level_attr.attr.attr, + &hisi_ptt_tx_alloc_buf_level_attr.attr.attr, + NULL, +}; + +static struct attribute_group hisi_ptt_tune_group = { + .name = "tune", + .attrs = hisi_ptt_tune_attrs, +}; + static u16 hisi_ptt_get_filter_val(u16 devid, bool is_port) { if (is_port) @@ -393,6 +522,7 @@ static struct attribute_group hisi_ptt_pmu_format_group = { static const struct attribute_group *hisi_ptt_pmu_groups[] = { &hisi_ptt_cpumask_attr_group, &hisi_ptt_pmu_format_group, + &hisi_ptt_tune_group, NULL };
@@ -727,6 +857,7 @@ static int hisi_ptt_register_pmu(struct hisi_ptt *hisi_ptt) if (ret) return ret;
+ mutex_init(&hisi_ptt->tune_lock); spin_lock_init(&hisi_ptt->pmu_lock);
hisi_ptt->hisi_ptt_pmu = (struct pmu) { diff --git a/drivers/hwtracing/ptt/hisi_ptt.h b/drivers/hwtracing/ptt/hisi_ptt.h index 10446dce8a86..5beb1648c93a 100644 --- a/drivers/hwtracing/ptt/hisi_ptt.h +++ b/drivers/hwtracing/ptt/hisi_ptt.h @@ -12,6 +12,7 @@ #include <linux/bits.h> #include <linux/cpumask.h> #include <linux/list.h> +#include <linux/mutex.h> #include <linux/pci.h> #include <linux/perf_event.h> #include <linux/spinlock.h> @@ -22,6 +23,11 @@ /* * The definition of the device registers and register fields. */ +#define HISI_PTT_TUNING_CTRL 0x0000 +#define HISI_PTT_TUNING_CTRL_CODE GENMASK(15, 0) +#define HISI_PTT_TUNING_CTRL_SUB GENMASK(23, 16) +#define HISI_PTT_TUNING_DATA 0x0004 +#define HISI_PTT_TUNING_DATA_VAL_MASK GENMASK(15, 0) #define HISI_PTT_TRACE_ADDR_SIZE 0x0800 #define HISI_PTT_TRACE_ADDR_BASE_LO_0 0x0810 #define HISI_PTT_TRACE_ADDR_BASE_HI_0 0x0814 @@ -37,6 +43,8 @@ #define HISI_PTT_TRACE_INT_STAT 0x0890 #define HISI_PTT_TRACE_INT_STAT_MASK GENMASK(3, 0) #define HISI_PTT_TRACE_INT_MASK 0x0894 +#define HISI_PTT_TUNING_INT_STAT 0x0898 +#define HISI_PTT_TUNING_INT_STAT_MASK BIT(0) #define HISI_PTT_TRACE_WR_STS 0x08a0 #define HISI_PTT_TRACE_WR_STS_WRITE GENMASK(27, 0) #define HISI_PTT_TRACE_WR_STS_BUFFER GENMASK(29, 28) @@ -59,6 +67,7 @@ #define HISI_PTT_RESET_TIMEOUT_US 10UL #define HISI_PTT_RESET_POLL_INTERVAL_US 1UL /* Poll timeout and interval for waiting hardware work to finish */ +#define HISI_PTT_WAIT_TUNE_TIMEOUT_US 1000000UL #define HISI_PTT_WAIT_TRACE_TIMEOUT_US 100UL #define HISI_PTT_WAIT_POLL_INTERVAL_US 10UL
@@ -71,6 +80,18 @@ #define HISI_PTT_PMU_TYPE_MASK GENMASK(31, 24) #define HISI_PTT_PMU_FORMAT_MASK GENMASK(35, 32)
+/** + * struct hisi_ptt_tune_desc - Describe tune event for PTT tune + * @hisi_ptt: PTT device this tune event belongs to + * @name: name of this event + * @event_code: code of the event + */ +struct hisi_ptt_tune_desc { + struct hisi_ptt *hisi_ptt; + const char *name; + u32 event_code; +}; + /** * struct hisi_ptt_dma_buffer - Describe a single trace buffer of PTT trace. * The detail of the data format is described @@ -143,6 +164,7 @@ struct hisi_ptt_pmu_buf { * @hisi_ptt_pmu: the pum device of trace * @iobase: base IO address of the device * @pdev: pci_dev of this PTT device + * @tune_lock: lock to serialize the tune process * @pmu_lock: lock to serialize the perf process * @upper_bdf: the upper BDF range of the PCI devices managed by this PTT device * @lower_bdf: the lower BDF range of the PCI devices managed by this PTT device @@ -156,6 +178,7 @@ struct hisi_ptt { struct pmu hisi_ptt_pmu; void __iomem *iobase; struct pci_dev *pdev; + struct mutex tune_lock; spinlock_t pmu_lock; u32 upper_bdf; u32 lower_bdf;
From: Yicong Yang yangyicong@hisilicon.com
mainline inclusion from mainline-remotes/origin/next commit a7112b747c324dda8937d4f47b14dc0af0b465d1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5RP8T CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/coresight/linux.git/commit/?...
--------------------------------------------------------------------------
Document the introduction and usage of HiSilicon PTT device driver as well as the sysfs attributes description provided by the driver.
Signed-off-by: Yicong Yang yangyicong@hisilicon.com Reviewed-by: Jonathan Cameron Jonathan.Cameron@huawei.com Reviewed-by: Bagas Sanjaya bagasdotme@gmail.com [Fixed month and kernel version] Link: https://lore.kernel.org/r/20220816114414.4092-5-yangyicong@huawei.com Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org Signed-off-by: Wangming Shao shaowangming@h-partners.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../ABI/testing/sysfs-devices-hisi_ptt | 61 ++++ Documentation/trace/hisi-ptt.rst | 298 ++++++++++++++++++ Documentation/trace/index.rst | 1 + 3 files changed, 360 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-devices-hisi_ptt create mode 100644 Documentation/trace/hisi-ptt.rst
diff --git a/Documentation/ABI/testing/sysfs-devices-hisi_ptt b/Documentation/ABI/testing/sysfs-devices-hisi_ptt new file mode 100644 index 000000000000..82de6d710266 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-devices-hisi_ptt @@ -0,0 +1,61 @@ +What: /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune +Date: October 2022 +KernelVersion: 6.1 +Contact: Yicong Yang yangyicong@hisilicon.com +Description: This directory contains files for tuning the PCIe link + parameters(events). Each file is named after the event + of the PCIe link. + + See Documentation/trace/hisi-ptt.rst for more information. + +What: /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune/qos_tx_cpl +Date: October 2022 +KernelVersion: 6.1 +Contact: Yicong Yang yangyicong@hisilicon.com +Description: (RW) Controls the weight of Tx completion TLPs, which influence + the proportion of outbound completion TLPs on the PCIe link. + The available tune data is [0, 1, 2]. Writing a negative value + will return an error, and out of range values will be converted + to 2. The value indicates a probable level of the event. + +What: /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune/qos_tx_np +Date: October 2022 +KernelVersion: 6.1 +Contact: Yicong Yang yangyicong@hisilicon.com +Description: (RW) Controls the weight of Tx non-posted TLPs, which influence + the proportion of outbound non-posted TLPs on the PCIe link. + The available tune data is [0, 1, 2]. Writing a negative value + will return an error, and out of range values will be converted + to 2. The value indicates a probable level of the event. + +What: /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune/qos_tx_p +Date: October 2022 +KernelVersion: 6.1 +Contact: Yicong Yang yangyicong@hisilicon.com +Description: (RW) Controls the weight of Tx posted TLPs, which influence the + proportion of outbound posted TLPs on the PCIe link. + The available tune data is [0, 1, 2]. Writing a negative value + will return an error, and out of range values will be converted + to 2. The value indicates a probable level of the event. + +What: /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune/rx_alloc_buf_level +Date: October 2022 +KernelVersion: 6.1 +Contact: Yicong Yang yangyicong@hisilicon.com +Description: (RW) Control the allocated buffer watermark for inbound packets. + The packets will be stored in the buffer first and then transmitted + either when the watermark reached or when timed out. + The available tune data is [0, 1, 2]. Writing a negative value + will return an error, and out of range values will be converted + to 2. The value indicates a probable level of the event. + +What: /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune/tx_alloc_buf_level +Date: October 2022 +KernelVersion: 6.1 +Contact: Yicong Yang yangyicong@hisilicon.com +Description: (RW) Control the allocated buffer watermark of outbound packets. + The packets will be stored in the buffer first and then transmitted + either when the watermark reached or when timed out. + The available tune data is [0, 1, 2]. Writing a negative value + will return an error, and out of range values will be converted + to 2. The value indicates a probable level of the event. diff --git a/Documentation/trace/hisi-ptt.rst b/Documentation/trace/hisi-ptt.rst new file mode 100644 index 000000000000..4f87d8e21065 --- /dev/null +++ b/Documentation/trace/hisi-ptt.rst @@ -0,0 +1,298 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================== +HiSilicon PCIe Tune and Trace device +====================================== + +Introduction +============ + +HiSilicon PCIe tune and trace device (PTT) is a PCIe Root Complex +integrated Endpoint (RCiEP) device, providing the capability +to dynamically monitor and tune the PCIe link's events (tune), +and trace the TLP headers (trace). The two functions are independent, +but is recommended to use them together to analyze and enhance the +PCIe link's performance. + +On Kunpeng 930 SoC, the PCIe Root Complex is composed of several +PCIe cores. Each PCIe core includes several Root Ports and a PTT +RCiEP, like below. The PTT device is capable of tuning and +tracing the links of the PCIe core. +:: + + +--------------Core 0-------+ + | | [ PTT ] | + | | [Root Port]---[Endpoint] + | | [Root Port]---[Endpoint] + | | [Root Port]---[Endpoint] + Root Complex |------Core 1-------+ + | | [ PTT ] | + | | [Root Port]---[ Switch ]---[Endpoint] + | | [Root Port]---[Endpoint] `-[Endpoint] + | | [Root Port]---[Endpoint] + +---------------------------+ + +The PTT device driver registers one PMU device for each PTT device. +The name of each PTT device is composed of 'hisi_ptt' prefix with +the id of the SICL and the Core where it locates. The Kunpeng 930 +SoC encapsulates multiple CPU dies (SCCL, Super CPU Cluster) and +IO dies (SICL, Super I/O Cluster), where there's one PCIe Root +Complex for each SICL. +:: + + /sys/devices/hisi_ptt<sicl_id>_<core_id> + +Tune +==== + +PTT tune is designed for monitoring and adjusting PCIe link parameters (events). +Currently we support events in 2 classes. The scope of the events +covers the PCIe core to which the PTT device belongs. + +Each event is presented as a file under $(PTT PMU dir)/tune, and +a simple open/read/write/close cycle will be used to tune the event. +:: + + $ cd /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune + $ ls + qos_tx_cpl qos_tx_np qos_tx_p + tx_path_rx_req_alloc_buf_level + tx_path_tx_req_alloc_buf_level + $ cat qos_tx_dp + 1 + $ echo 2 > qos_tx_dp + $ cat qos_tx_dp + 2 + +Current value (numerical value) of the event can be simply read +from the file, and the desired value written to the file to tune. + +1. Tx Path QoS Control +------------------------ + +The following files are provided to tune the QoS of the tx path of +the PCIe core. + +- qos_tx_cpl: weight of Tx completion TLPs +- qos_tx_np: weight of Tx non-posted TLPs +- qos_tx_p: weight of Tx posted TLPs + +The weight influences the proportion of certain packets on the PCIe link. +For example, for the storage scenario, increase the proportion +of the completion packets on the link to enhance the performance as +more completions are consumed. + +The available tune data of these events is [0, 1, 2]. +Writing a negative value will return an error, and out of range +values will be converted to 2. Note that the event value just +indicates a probable level, but is not precise. + +2. Tx Path Buffer Control +------------------------- + +Following files are provided to tune the buffer of tx path of the PCIe core. + +- rx_alloc_buf_level: watermark of Rx requested +- tx_alloc_buf_level: watermark of Tx requested + +These events influence the watermark of the buffer allocated for each +type. Rx means the inbound while Tx means outbound. The packets will +be stored in the buffer first and then transmitted either when the +watermark reached or when timed out. For a busy direction, you should +increase the related buffer watermark to avoid frequently posting and +thus enhance the performance. In most cases just keep the default value. + +The available tune data of above events is [0, 1, 2]. +Writing a negative value will return an error, and out of range +values will be converted to 2. Note that the event value just +indicates a probable level, but is not precise. + +Trace +===== + +PTT trace is designed for dumping the TLP headers to the memory, which +can be used to analyze the transactions and usage condition of the PCIe +Link. You can choose to filter the traced headers by either Requester ID, +or those downstream of a set of Root Ports on the same core of the PTT +device. It's also supported to trace the headers of certain type and of +certain direction. + +You can use the perf command `perf record` to set the parameters, start +trace and get the data. It's also supported to decode the trace +data with `perf report`. The control parameters for trace is inputted +as event code for each events, which will be further illustrated later. +An example usage is like +:: + + $ perf record -e hisi_ptt0_2/filter=0x80001,type=1,direction=1, + format=1/ -- sleep 5 + +This will trace the TLP headers downstream root port 0000:00:10.1 (event +code for event 'filter' is 0x80001) with type of posted TLP requests, +direction of inbound and traced data format of 8DW. + +1. Filter +--------- + +The TLP headers to trace can be filtered by the Root Ports or the Requester ID +of the Endpoint, which are located on the same core of the PTT device. You can +set the filter by specifying the `filter` parameter which is required to start +the trace. The parameter value is 20 bit. Bit 19 indicates the filter type. +1 for Root Port filter and 0 for Requester filter. Bit[15:0] indicates the +filter value. The value for a Root Port is a mask of the core port id which is +calculated from its PCI Slot ID as (slotid & 7) * 2. The value for a Requester +is the Requester ID (Device ID of the PCIe function). Bit[18:16] is currently +reserved for extension. + +For example, if the desired filter is Endpoint function 0000:01:00.1 the filter +value will be 0x00101. If the desired filter is Root Port 0000:00:10.0 then +then filter value is calculated as 0x80001. + +Note that multiple Root Ports can be specified at one time, but only one +Endpoint function can be specified in one trace. Specifying both Root Port +and function at the same time is not supported. Driver maintains a list of +available filters and will check the invalid inputs. + +Currently the available filters are detected in driver's probe. If the supported +devices are removed/added after probe, you may need to reload the driver to update +the filters. + +2. Type +------- + +You can trace the TLP headers of certain types by specifying the `type` +parameter, which is required to start the trace. The parameter value is +8 bit. Current supported types and related values are shown below: + +- 8'b00000001: posted requests (P) +- 8'b00000010: non-posted requests (NP) +- 8'b00000100: completions (CPL) + +You can specify multiple types when tracing inbound TLP headers, but can only +specify one when tracing outbound TLP headers. + +3. Direction +------------ + +You can trace the TLP headers from certain direction, which is relative +to the Root Port or the PCIe core, by specifying the `direction` parameter. +This is optional and the default parameter is inbound. The parameter value +is 4 bit. When the desired format is 4DW, directions and related values +supported are shown below: + +- 4'b0000: inbound TLPs (P, NP, CPL) +- 4'b0001: outbound TLPs (P, NP, CPL) +- 4'b0010: outbound TLPs (P, NP, CPL) and inbound TLPs (P, NP, CPL B) +- 4'b0011: outbound TLPs (P, NP, CPL) and inbound TLPs (CPL A) + +When the desired format is 8DW, directions and related values supported are +shown below: + +- 4'b0000: reserved +- 4'b0001: outbound TLPs (P, NP, CPL) +- 4'b0010: inbound TLPs (P, NP, CPL B) +- 4'b0011: inbound TLPs (CPL A) + +Inbound completions are classified into two types: + +- completion A (CPL A): completion of CHI/DMA/Native non-posted requests, except for CPL B +- completion B (CPL B): completion of DMA remote2local and P2P non-posted requests + +4. Format +-------------- + +You can change the format of the traced TLP headers by specifying the +`format` parameter. The default format is 4DW. The parameter value is 4 bit. +Current supported formats and related values are shown below: + +- 4'b0000: 4DW length per TLP header +- 4'b0001: 8DW length per TLP header + +The traced TLP header format is different from the PCIe standard. + +When using the 8DW data format, the entire TLP header is logged +(Header DW0-3 shown below). For example, the TLP header for Memory +Reads with 64-bit addresses is shown in PCIe r5.0, Figure 2-17; +the header for Configuration Requests is shown in Figure 2.20, etc. + +In addition, 8DW trace buffer entries contain a timestamp and +possibly a prefix for a PASID TLP prefix (see Figure 6-20, PCIe r5.0). +Otherwise this field will be all 0. + +The bit[31:11] of DW0 is always 0x1fffff, which can be +used to distinguish the data format. 8DW format is like +:: + + bits [ 31:11 ][ 10:0 ] + |---------------------------------------|-------------------| + DW0 [ 0x1fffff ][ Reserved (0x7ff) ] + DW1 [ Prefix ] + DW2 [ Header DW0 ] + DW3 [ Header DW1 ] + DW4 [ Header DW2 ] + DW5 [ Header DW3 ] + DW6 [ Reserved (0x0) ] + DW7 [ Time ] + +When using the 4DW data format, DW0 of the trace buffer entry +contains selected fields of DW0 of the TLP, together with a +timestamp. DW1-DW3 of the trace buffer entry contain DW1-DW3 +directly from the TLP header. + +4DW format is like +:: + + bits [31:30] [ 29:25 ][24][23][22][21][ 20:11 ][ 10:0 ] + |-----|---------|---|---|---|---|-------------|-------------| + DW0 [ Fmt ][ Type ][T9][T8][TH][SO][ Length ][ Time ] + DW1 [ Header DW1 ] + DW2 [ Header DW2 ] + DW3 [ Header DW3 ] + +5. Memory Management +-------------------- + +The traced TLP headers will be written to the memory allocated +by the driver. The hardware accepts 4 DMA address with same size, +and writes the buffer sequentially like below. If DMA addr 3 is +finished and the trace is still on, it will return to addr 0. +:: + + +->[DMA addr 0]->[DMA addr 1]->[DMA addr 2]->[DMA addr 3]-+ + +---------------------------------------------------------+ + +Driver will allocate each DMA buffer of 4MiB. The finished buffer +will be copied to the perf AUX buffer allocated by the perf core. +Once the AUX buffer is full while the trace is still on, driver +will commit the AUX buffer first and then apply for a new one with +the same size. The size of AUX buffer is default to 16MiB. User can +adjust the size by specifying the `-m` parameter of the perf command. + +6. Decoding +----------- + +You can decode the traced data with `perf report -D` command (currently +only support to dump the raw trace data). The traced data will be decoded +according to the format described previously (take 8DW as an example): +:: + + [...perf headers and other information] + . ... HISI PTT data: size 4194304 bytes + . 00000000: 00 00 00 00 Prefix + . 00000004: 01 00 00 60 Header DW0 + . 00000008: 0f 1e 00 01 Header DW1 + . 0000000c: 04 00 00 00 Header DW2 + . 00000010: 40 00 81 02 Header DW3 + . 00000014: 33 c0 04 00 Time + . 00000020: 00 00 00 00 Prefix + . 00000024: 01 00 00 60 Header DW0 + . 00000028: 0f 1e 00 01 Header DW1 + . 0000002c: 04 00 00 00 Header DW2 + . 00000030: 40 00 81 02 Header DW3 + . 00000034: 02 00 00 00 Time + . 00000040: 00 00 00 00 Prefix + . 00000044: 01 00 00 60 Header DW0 + . 00000048: 0f 1e 00 01 Header DW1 + . 0000004c: 04 00 00 00 Header DW2 + . 00000050: 40 00 81 02 Header DW3 + [...] diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst index 3769b9b7aed8..4f8b7e5637b2 100644 --- a/Documentation/trace/index.rst +++ b/Documentation/trace/index.rst @@ -30,3 +30,4 @@ Linux Tracing Technologies stm sys-t coresight/index + hisi-ptt
From: Yicong Yang yangyicong@hisilicon.com
mainline inclusion from mainline-remotes/origin/next commit 366317eae983a0d96aeed78ad219b9c4ed2a719a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5RP8T CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/coresight/linux.git/commit/?...
--------------------------------------------------------------------------
Add maintainer for driver and documentation of HiSilicon PTT device.
Signed-off-by: Yicong Yang yangyicong@hisilicon.com Reviewed-by: Jonathan Cameron Jonathan.Cameron@huawei.com Link: https://lore.kernel.org/r/20220816114414.4092-6-yangyicong@huawei.com Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org Signed-off-by: Wangming Shao shaowangming@h-partners.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- MAINTAINERS | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS index d597d8fc67b1..1bc063c07158 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7983,6 +7983,14 @@ F: Documentation/admin-guide/perf/hisi-pcie-pmu.rst F: Documentation/admin-guide/perf/hisi-pmu.rst F: drivers/perf/hisilicon
+HISILICON PTT DRIVER +M: Yicong Yang yangyicong@hisilicon.com +L: linux-kernel@vger.kernel.org +S: Maintained +F: Documentation/ABI/testing/sysfs-devices-hisi_ptt +F: Documentation/trace/hisi-ptt.rst +F: drivers/hwtracing/ptt/ + HISILICON QM AND ZIP Controller DRIVER M: Zhou Wang wangzhou1@hisilicon.com L: linux-crypto@vger.kernel.org
From: Stephen Rothwell sfr@canb.auug.org.au
mainline inclusion from mainline-remotes/origin/next commit 366317eae983a0d96aeed78ad219b9c4ed2a719a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5RP8T CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/coresight/linux.git/commit/?...
--------------------------------------------------------------------------
drivers/hwtracing/ptt/hisi_ptt.c:13:10: fatal error: linux/dma-iommu.h: No such file or directory 13 | #include <linux/dma-iommu.h> | ^~~~~~~~~~~~~~~~~~~
Caused by:
commit ff0de066b463 ("hwtracing: hisi_ptt: Add trace function support for HiSilicon PCIe Tune and Trace device")
interacting with:
commit f2042ed21da7 ("iommu/dma: Make header private")
from the iommu tree.
Signed-off-by: Stephen Rothwell sfr@canb.auug.org.au Acked-by: Robin Murphy robin.murphy@arm.com Acked-by: Yicong Yang yangyicong@hisilicon.com [Fixed subject line and added changelog text] Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org Signed-off-by: Wangming Shao shaowangming@h-partners.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/hwtracing/ptt/hisi_ptt.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c index 666a0f14b6c4..5d5526aa60c4 100644 --- a/drivers/hwtracing/ptt/hisi_ptt.c +++ b/drivers/hwtracing/ptt/hisi_ptt.c @@ -10,7 +10,6 @@ #include <linux/bitops.h> #include <linux/cpuhotplug.h> #include <linux/delay.h> -#include <linux/dma-iommu.h> #include <linux/dma-mapping.h> #include <linux/interrupt.h> #include <linux/io.h>
From: Like Xu likexu@tencent.com
mainline inclusion from mainline-v5.14 commit e79f49c37ccf273c8aba733f803b3774ebfbe581 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5RD6Y CVE: NA
-------------
Based on our observations, after any vm-exit associated with vPMU, there are at least two or more perf interfaces to be called for guest counter emulation, such as perf_event_{pause, read_value, period}(), and each one will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes more severe when guest use counters in a multiplexed manner.
Holding a lock once and completing the KVM request operations in the perf context would introduce a set of impractical new interfaces. So we can further optimize the vPMU implementation by avoiding repeated calls to these interfaces in the KVM context for at least one pattern:
After we call perf_event_pause() once, the event will be disabled and its internal count will be reset to 0. So there is no need to pause it again or read its value. Once the event is paused, event period will not be updated until the next time it's resumed or reprogrammed. And there is also no need to call perf_event_period twice for a non-running counter, considering the perf_event for a running counter is never paused.
Based on this implementation, for the following common usage of sampling 4 events using perf on a 4u8g guest:
echo 0 > /proc/sys/kernel/watchdog echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent for i in `seq 1 1 10` do taskset -c 0 perf record \ -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \ /root/br_instr a done
the average latency of the guest NMI handler is reduced from 37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server. Also, in addition to collecting more samples, no loss of sampling accuracy was observed compared to before the optimization.
Signed-off-by: Like Xu likexu@tencent.com Message-Id: 20210728120705.6855-1-likexu@tencent.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Acked-by: Peter Zijlstra peterz@infradead.org Signed-off-by: yezengruan yezengruan@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/pmu.c | 5 ++++- arch/x86/kvm/pmu.h | 2 +- arch/x86/kvm/vmx/pmu_intel.c | 4 ++-- 4 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 01544554a4c2..0440c95d55a4 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -425,6 +425,7 @@ struct kvm_pmc { * ctrl value for fixed counters. */ u64 current_config; + bool is_paused; };
struct kvm_pmu { diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index f2c3869475d9..4b7cdd549b4b 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -137,18 +137,20 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, pmc->perf_event = event; pmc_to_pmu(pmc)->event_count++; clear_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi); + pmc->is_paused = false; }
static void pmc_pause_counter(struct kvm_pmc *pmc) { u64 counter = pmc->counter;
- if (!pmc->perf_event) + if (!pmc->perf_event || pmc->is_paused) return;
/* update counter, reset event value to avoid redundant accumulation */ counter += perf_event_pause(pmc->perf_event, true); pmc->counter = counter & pmc_bitmask(pmc); + pmc->is_paused = true; }
static bool pmc_resume_counter(struct kvm_pmc *pmc) @@ -163,6 +165,7 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc)
/* reuse perf_event to serve as pmc_reprogram_counter() does*/ perf_event_enable(pmc->perf_event); + pmc->is_paused = false;
clear_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi); return true; diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index cd35624595bf..e8620ec6014d 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -54,7 +54,7 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc) u64 counter, enabled, running;
counter = pmc->counter; - if (pmc->perf_event) + if (pmc->perf_event && !pmc->is_paused) counter += perf_event_read_value(pmc->perf_event, &enabled, &running); /* FIXME: Scaling needed? */ diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 44cd13790810..6427d95de01c 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -438,13 +438,13 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) !(msr & MSR_PMC_FULL_WIDTH_BIT)) data = (s64)(s32)data; pmc->counter += data - pmc_read_counter(pmc); - if (pmc->perf_event) + if (pmc->perf_event && !pmc->is_paused) perf_event_period(pmc->perf_event, get_sample_period(pmc, data)); return 0; } else if ((pmc = get_fixed_pmc(pmu, msr))) { pmc->counter += data - pmc_read_counter(pmc); - if (pmc->perf_event) + if (pmc->perf_event && !pmc->is_paused) perf_event_period(pmc->perf_event, get_sample_period(pmc, data)); return 0;
From: Like Xu likexu@tencent.com
mainline inclusion from mainline-v5.18 commit 75189d1de1b377e580ebd2d2c55914631eac9c64 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5RD6Y CVE: NA
-------------
NMI-watchdog is one of the favorite features of kernel developers, but it does not work in AMD guest even with vPMU enabled and worse, the system misrepresents this capability via /proc.
This is a PMC emulation error. KVM does not pass the latest valid value to perf_event in time when guest NMI-watchdog is running, thus the perf_event corresponding to the watchdog counter will enter the old state at some point after the first guest NMI injection, forcing the hardware register PMC0 to be constantly written to 0x800000000001.
Meanwhile, the running counter should accurately reflect its new value based on the latest coordinated pmc->counter (from vPMC's point of view) rather than the value written directly by the guest.
Fixes: 168d918f2643 ("KVM: x86: Adjust counter sample period after a wrmsr") Reported-by: Dongli Cao caodongli@kingsoft.com Signed-off-by: Like Xu likexu@tencent.com Reviewed-by: Yanan Wang wangyanan55@huawei.com Tested-by: Yanan Wang wangyanan55@huawei.com Reviewed-by: Jim Mattson jmattson@google.com Message-Id: 20220409015226.38619-1-likexu@tencent.com Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: yezengruan yezengruan@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/x86/kvm/pmu.h | 9 +++++++++ arch/x86/kvm/svm/pmu.c | 1 + arch/x86/kvm/vmx/pmu_intel.c | 8 ++------ 3 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index e8620ec6014d..005f580ca5e7 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -141,6 +141,15 @@ static inline u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value) return sample_period; }
+static inline void pmc_update_sample_period(struct kvm_pmc *pmc) +{ + if (!pmc->perf_event || pmc->is_paused) + return; + + perf_event_period(pmc->perf_event, + get_sample_period(pmc, pmc->counter)); +} + void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel); void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx); void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx); diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c index 49e5be735f14..663d943f85db 100644 --- a/arch/x86/kvm/svm/pmu.c +++ b/arch/x86/kvm/svm/pmu.c @@ -270,6 +270,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER); if (pmc) { pmc->counter += data - pmc_read_counter(pmc); + pmc_update_sample_period(pmc); return 0; } /* MSR_EVNTSELn */ diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 6427d95de01c..a38df824262c 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -438,15 +438,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) !(msr & MSR_PMC_FULL_WIDTH_BIT)) data = (s64)(s32)data; pmc->counter += data - pmc_read_counter(pmc); - if (pmc->perf_event && !pmc->is_paused) - perf_event_period(pmc->perf_event, - get_sample_period(pmc, data)); + pmc_update_sample_period(pmc); return 0; } else if ((pmc = get_fixed_pmc(pmu, msr))) { pmc->counter += data - pmc_read_counter(pmc); - if (pmc->perf_event && !pmc->is_paused) - perf_event_period(pmc->perf_event, - get_sample_period(pmc, data)); + pmc_update_sample_period(pmc); return 0; } else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) { if (data == pmc->eventsel)
From: Zhen Lei thunder.leizhen@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5P8OD CVE: NA
-------------------------------------------------------------------------
The value of 'end' for both for_each_mem_range() and __map_memblock() is 'start + size', not 'start + size - 1'. So if the end value of a memory block is 4G, then: if (eflags && (end >= SZ_4G)) { //end=SZ_4G if (start < SZ_4G) { ... ... start = SZ_4G; } }
//Now, start=end=SZ_4G, all [start,...) will be mapped __map_memblock(pgdp, start, end, ..xxx..);
Fixes: e26eee769978 ("arm64: kdump: Don't force page-level mappings for memory above 4G") Signed-off-by: Zhen Lei thunder.leizhen@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/mm/mmu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index e76765354040..a31f2124705e 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -523,13 +523,13 @@ static void __init map_mem(pgd_t *pgdp) break;
#ifdef CONFIG_KEXEC_CORE - if (eflags && (end >= SZ_4G)) { + if (eflags && (end > SZ_4G)) { /* * The memory block cross the 4G boundary. * Forcibly use page-level mappings for memory under 4G. */ if (start < SZ_4G) { - __map_memblock(pgdp, start, SZ_4G - 1, + __map_memblock(pgdp, start, SZ_4G, pgprot_tagged(PAGE_KERNEL), flags | eflags); start = SZ_4G; }
From: Baokun Li libaokun1@huawei.com
hulk inclusion category: bugfix bugzilla: 187600, https://gitee.com/openeuler/kernel/issues/I5SV2U CVE: NA
--------------------------------
If the starting position of our insert range happens to be in the hole between the two ext4_extent_idx, because the lblk of the ext4_extent in the previous ext4_extent_idx is always less than the start, which leads to the "extent" variable access across the boundary, the following UAF is triggered:
================================================================== BUG: KASAN: use-after-free in ext4_ext_shift_extents+0x257/0x790 Read of size 4 at addr ffff88819807a008 by task fallocate/8010 CPU: 3 PID: 8010 Comm: fallocate Tainted: G E 5.10.0+ #492 Call Trace: dump_stack+0x7d/0xa3 print_address_description.constprop.0+0x1e/0x220 kasan_report.cold+0x67/0x7f ext4_ext_shift_extents+0x257/0x790 ext4_insert_range+0x5b6/0x700 ext4_fallocate+0x39e/0x3d0 vfs_fallocate+0x26f/0x470 ksys_fallocate+0x3a/0x70 __x64_sys_fallocate+0x4f/0x60 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ==================================================================
For right shifts, we can divide them into the following situations:
1. When the first ee_block of ext4_extent_idx is greater than or equal to start, make right shifts directly from the first ee_block. 1) If it is greater than start, we need to continue searching in the previous ext4_extent_idx. 2) If it is equal to start, we can exit the loop (iterator=NULL).
2. When the first ee_block of ext4_extent_idx is less than start, then traverse from the last extent to find the first extent whose ee_block is less than start. 1) If extent is still the last extent after traversal, it means that the last ee_block of ext4_extent_idx is less than start, that is, start is located in the hole between idx and (idx+1), so we can exit the loop directly (break) without right shifts. 2) Otherwise, make right shifts at the corresponding position of the found extent, and then exit the loop (iterator=NULL).
Fixes: 331573febb6a ("ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate") Cc: stable@vger.kernel.org Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ext4/extents.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 9d06695c04ab..0f93d72d9301 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -5191,6 +5191,7 @@ ext4_ext_shift_extents(struct inode *inode, handle_t *handle, * and it is decreased till we reach start. */ again: + ret = 0; if (SHIFT == SHIFT_LEFT) iterator = &start; else @@ -5234,14 +5235,21 @@ ext4_ext_shift_extents(struct inode *inode, handle_t *handle, ext4_ext_get_actual_len(extent); } else { extent = EXT_FIRST_EXTENT(path[depth].p_hdr); - if (le32_to_cpu(extent->ee_block) > 0) + if (le32_to_cpu(extent->ee_block) > start) *iterator = le32_to_cpu(extent->ee_block) - 1; - else - /* Beginning is reached, end of the loop */ + else if (le32_to_cpu(extent->ee_block) == start) iterator = NULL; - /* Update path extent in case we need to stop */ - while (le32_to_cpu(extent->ee_block) < start) + else { + extent = EXT_LAST_EXTENT(path[depth].p_hdr); + while (le32_to_cpu(extent->ee_block) >= start) + extent--; + + if (extent == EXT_LAST_EXTENT(path[depth].p_hdr)) + break; + extent++; + iterator = NULL; + } path[depth].p_ext = extent; } ret = ext4_ext_shift_path_extents(path, shift, inode,
From: Keqian Zhu zhukeqian1@huawei.com
mainline inclusion from mainline-v5.14-rc1 commit fd6f17bade21 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5R1MW CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------------------------------------------------
The MMIO regions may be unmapped for many reasons and can be remapped by stage2 fault path. Map MMIO regions at creation time becomes a minor optimization and makes these two mapping path hard to sync.
Remove the mapping code while keep the useful sanity check.
Signed-off-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Heng Zhang zhangheng191@h-partners.com Link: https://lore.kernel.org/r/20210507110322.23348-2-zhukeqian1@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/kvm/mmu.c | 37 +++---------------------------------- 1 file changed, 3 insertions(+), 34 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index bbc4cc26c92a..c3231b1ee288 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1302,7 +1302,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, { hva_t hva = mem->userspace_addr; hva_t reg_end = hva + mem->memory_size; - bool writable = !(mem->flags & KVM_MEM_READONLY); int ret = 0;
if (change != KVM_MR_CREATE && change != KVM_MR_MOVE && @@ -1319,8 +1318,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, mmap_read_lock(current->mm); /* * A memory region could potentially cover multiple VMAs, and any holes - * between them, so iterate over all of them to find out if we can map - * any of them right now. + * between them, so iterate over all of them. * * +--------------------------------------------+ * +---------------+----------------+ +----------------+ @@ -1331,50 +1329,21 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, */ do { struct vm_area_struct *vma = find_vma(current->mm, hva); - hva_t vm_start, vm_end;
if (!vma || vma->vm_start >= reg_end) break;
- /* - * Take the intersection of this VMA with the memory region - */ - vm_start = max(hva, vma->vm_start); - vm_end = min(reg_end, vma->vm_end); - if (vma->vm_flags & VM_PFNMAP) { - gpa_t gpa = mem->guest_phys_addr + - (vm_start - mem->userspace_addr); - phys_addr_t pa; - - pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT; - pa += vm_start - vma->vm_start;
/* IO region dirty page logging not allowed */ if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) { ret = -EINVAL; - goto out; - } - - ret = kvm_phys_addr_ioremap(kvm, gpa, pa, - vm_end - vm_start, - writable); - if (ret) break; + } } - hva = vm_end; + hva = min(reg_end, vma->vm_end); } while (hva < reg_end);
- if (change == KVM_MR_FLAGS_ONLY) - goto out; - - spin_lock(&kvm->mmu_lock); - if (ret) - unmap_stage2_range(&kvm->arch.mmu, mem->guest_phys_addr, mem->memory_size); - else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB)) - stage2_flush_memslot(kvm, memslot); - spin_unlock(&kvm->mmu_lock); -out: mmap_read_unlock(current->mm); return ret; }
From: Keqian Zhu zhukeqian1@huawei.com
mainline inclusion from mainline-v5.14-rc1 commit 2aa53d68cee6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5R1MW CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
------------------------------------------------------------------
The MMIO region of a device maybe huge (GB level), try to use block mapping in stage2 to speedup both map and unmap.
Compared to normal memory mapping, we should consider two more points when try block mapping for MMIO region:
1. For normal memory mapping, the PA(host physical address) and HVA have same alignment within PUD_SIZE or PMD_SIZE when we use the HVA to request hugepage, so we don't need to consider PA alignment when verifing block mapping. But for device memory mapping, the PA and HVA may have different alignment.
2. For normal memory mapping, we are sure hugepage size properly fit into vma, so we don't check whether the mapping size exceeds the boundary of vma. But for device memory mapping, we should pay attention to this.
This adds get_vma_page_shift() to get page shift for both normal memory and device MMIO region, and check these two points when selecting block mapping size for MMIO region.
Signed-off-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Heng Zhang zhangheng191@h-partners.com Link: https://lore.kernel.org/r/20210507110322.23348-3-zhukeqian1@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/kvm/mmu.c | 61 ++++++++++++++++++++++++++++++++++++-------- 1 file changed, 51 insertions(+), 10 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index c3231b1ee288..23a51f0bab75 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot, return PAGE_SIZE; }
+static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva) +{ + unsigned long pa; + + if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP)) + return huge_page_shift(hstate_vma(vma)); + + if (!(vma->vm_flags & VM_PFNMAP)) + return PAGE_SHIFT; + + VM_BUG_ON(is_vm_hugetlb_page(vma)); + + pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start); + +#ifndef __PAGETABLE_PMD_FOLDED + if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) && + ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start && + ALIGN(hva, PUD_SIZE) <= vma->vm_end) + return PUD_SHIFT; +#endif + + if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) && + ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start && + ALIGN(hva, PMD_SIZE) <= vma->vm_end) + return PMD_SHIFT; + + return PAGE_SHIFT; +} + static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_memory_slot *memslot, unsigned long hva, unsigned long fault_status) @@ -770,7 +799,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, return -EFAULT; }
- /* Let's check if we will get back a huge page backed by hugetlbfs */ + /* + * Let's check if we will get back a huge page backed by hugetlbfs, or + * get block mapping for device MMIO region. + */ mmap_read_lock(current->mm); vma = find_vma_intersection(current->mm, hva, hva + 1); if (unlikely(!vma)) { @@ -779,15 +811,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, return -EFAULT; }
- if (is_vm_hugetlb_page(vma)) - vma_shift = huge_page_shift(hstate_vma(vma)); - else - vma_shift = PAGE_SHIFT; - - if (logging_active || - (vma->vm_flags & VM_PFNMAP)) { + /* + * logging_active is guaranteed to never be true for VM_PFNMAP + * memslots. + */ + if (logging_active) { force_pte = true; vma_shift = PAGE_SHIFT; + } else { + vma_shift = get_vma_page_shift(vma, hva); }
switch (vma_shift) { @@ -855,8 +887,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, return -EFAULT;
if (kvm_is_device_pfn(pfn)) { + /* + * If the page was identified as device early by looking at + * the VMA flags, vma_pagesize is already representing the + * largest quantity we can map. If instead it was mapped + * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE + * and must not be upgraded. + * + * In both cases, we don't let transparent_hugepage_adjust() + * change things at the last minute. + */ device = true; - force_pte = true; } else if (logging_active && !write_fault) { /* * Only actually map the page as writable if this was a write @@ -877,7 +918,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * If we are not forced to use page mapping, check if we are * backed by a THP and thus use block mapping if possible. */ - if (vma_pagesize == PAGE_SIZE && !force_pte) + if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) vma_pagesize = transparent_hugepage_adjust(memslot, hva, &pfn, &fault_ipa); if (writable)