From: Kan Liang <kan.liang(a)linux.intel.com>
mainline inclusion
from mainline-v6.11-rc7
commit 25dfc9e357af8aed1ca79b318a73f2c59c1f0b2b
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAR511
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
--------------------------------
Running the ltp test cve-2015-3290 concurrently reports the following
warnings.
perfevents: irq loop stuck!
WARNING: CPU: 31 PID: 32438 at arch/x86/events/intel/core.c:3174
intel_pmu_handle_irq+0x285/0x370
Call Trace:
<NMI>
? __warn+0xa4/0x220
? intel_pmu_handle_irq+0x285/0x370
? __report_bug+0x123/0x130
? intel_pmu_handle_irq+0x285/0x370
? __report_bug+0x123/0x130
? intel_pmu_handle_irq+0x285/0x370
? report_bug+0x3e/0xa0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x18/0x50
? asm_exc_invalid_op+0x1a/0x20
? irq_work_claim+0x1e/0x40
? intel_pmu_handle_irq+0x285/0x370
perf_event_nmi_handler+0x3d/0x60
nmi_handle+0x104/0x330
Thanks to Thomas Gleixner's analysis, the issue is caused by the low
initial period (1) of the frequency estimation algorithm, which triggers
the defects of the HW, specifically erratum HSW11 and HSW143. (For the
details, please refer https://lore.kernel.org/lkml/87plq9l5d2.ffs@tglx/)
The HSW11 requires a period larger than 100 for the INST_RETIRED.ALL
event, but the initial period in the freq mode is 1. The erratum is the
same as the BDM11, which has been supported in the kernel. A minimum
period of 128 is enforced as well on HSW.
HSW143 is regarding that the fixed counter 1 may overcount 32 with the
Hyper-Threading is enabled. However, based on the test, the hardware
has more issues than it tells. Besides the fixed counter 1, the message
'interrupt took too long' can be observed on any counter which was armed
with a period < 32 and two events expired in the same NMI. A minimum
period of 32 is enforced for the rest of the events.
The recommended workaround code of the HSW143 is not implemented.
Because it only addresses the issue for the fixed counter. It brings
extra overhead through extra MSR writing. No related overcounting issue
has been reported so far.
Fixes: 3a632cb229bf ("perf/x86/intel: Add simple Haswell PMU support")
Reported-by: Li Huafei <lihuafei1(a)huawei.com>
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Kan Liang <kan.liang(a)linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20240819183004.3132920-1-kan.liang@linux.intel.…
Closes: https://lore.kernel.org/lkml/20240729223328.327835-1-lihuafei1@huawei.com/
Conflicts:
arch/x86/events/intel/core.c
[ Adapted x86_pmu::limit_period signature due to commit 28f0f3c44b5c
(“perf/x86: Change x86_pmu::limit_period signature”) not backported. ]
Signed-off-by: Li Huafei <lihuafei1(a)huawei.com>
---
arch/x86/events/intel/core.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 4372ed2d1637..1786e8d85b6b 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4414,6 +4414,25 @@ static u8 adl_get_hybrid_cpu_type(void)
return hybrid_big;
}
+static inline bool erratum_hsw11(struct perf_event *event)
+{
+ return (event->hw.config & INTEL_ARCH_EVENT_MASK) ==
+ X86_CONFIG(.event=0xc0, .umask=0x01);
+}
+
+/*
+ * The HSW11 requires a period larger than 100 which is the same as the BDM11.
+ * A minimum period of 128 is enforced as well for the INST_RETIRED.ALL.
+ *
+ * The message 'interrupt took too long' can be observed on any counter which
+ * was armed with a period < 32 and two events expired in the same NMI.
+ * A minimum period of 32 is enforced for the rest of the events.
+ */
+static u64 hsw_limit_period(struct perf_event *event, u64 left)
+{
+ return max(left, erratum_hsw11(event) ? 128ULL : 32ULL);
+}
+
/*
* Broadwell:
*
@@ -4431,8 +4450,7 @@ static u8 adl_get_hybrid_cpu_type(void)
*/
static u64 bdw_limit_period(struct perf_event *event, u64 left)
{
- if ((event->hw.config & INTEL_ARCH_EVENT_MASK) ==
- X86_CONFIG(.event=0xc0, .umask=0x01)) {
+ if (erratum_hsw11(event)) {
if (left < 128)
left = 128;
left &= ~0x3fULL;
@@ -6406,6 +6424,7 @@ __init int intel_pmu_init(void)
x86_pmu.hw_config = hsw_hw_config;
x86_pmu.get_event_constraints = hsw_get_event_constraints;
+ x86_pmu.limit_period = hsw_limit_period;
x86_pmu.lbr_double_abort = true;
extra_attr = boot_cpu_has(X86_FEATURE_RTM) ?
hsw_format_attr : nhm_format_attr;
--
2.25.1
Hi Neil,
FYI, the error/warning still remains.
tree: https://gitee.com/openeuler/kernel.git openEuler-1.0-LTS
head: ecca3ab84cbf3cf5ec32574bcec8a068e79d14df
commit: d80280f7df4f4be25d67a6460b63b3d2e0b5170d [3158/23766] perf: add arm64 smmuv3 pmu driver
config: arm64-randconfig-004-20240925 (https://download.01.org/0day-ci/archive/20240926/202409260104.P8TNlohM-lkp@…)
compiler: aarch64-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240926/202409260104.P8TNlohM-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409260104.P8TNlohM-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> drivers/perf/arm_smmuv3_pmu.c:459:9: warning: no previous prototype for 'smmu_pmu_event_show' [-Wmissing-prototypes]
459 | ssize_t smmu_pmu_event_show(struct device *dev,
| ^~~~~~~~~~~~~~~~~~~
vim +/smmu_pmu_event_show +459 drivers/perf/arm_smmuv3_pmu.c
458
> 459 ssize_t smmu_pmu_event_show(struct device *dev,
460 struct device_attribute *attr, char *page)
461 {
462 struct perf_pmu_events_attr *pmu_attr;
463
464 pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr);
465
466 return sprintf(page, "event=0x%02llx\n", pmu_attr->id);
467 }
468
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
tree: https://gitee.com/openeuler/kernel.git openEuler-1.0-LTS
head: ecca3ab84cbf3cf5ec32574bcec8a068e79d14df
commit: 8ce6cbc44f9ca78cac43506c84fcdd7beadee07f [15191/23766] block: fix use-after-free in disk_part_iter_next
config: x86_64-buildonly-randconfig-002-20240923 (https://download.01.org/0day-ci/archive/20240925/202409252335.Yggdg4Gw-lkp@…)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240925/202409252335.Yggdg4Gw-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409252335.Yggdg4Gw-lkp@intel.com/
All warnings (new ones prefixed by >>):
In file included from block/genhd.c:10:
In file included from include/linux/blkdev.h:16:
include/linux/pagemap.h:425:21: warning: cast from 'int (*)(struct file *, struct page *)' to 'filler_t *' (aka 'int (*)(void *, struct page *)') converts to incompatible function type [-Wcast-function-type-strict]
425 | filler_t *filler = (filler_t *)mapping->a_ops->readpage;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
block/genhd.c:532: warning: Function parameter or member 'devt' not described in 'blk_invalidate_devt'
>> block/.tmp_genhd.o: warning: objtool: __device_add_disk()+0x387: unreachable instruction
objdump-func vmlinux.o __device_add_disk:
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
tree: https://gitee.com/openeuler/kernel.git OLK-6.6
head: 81a41d2ac1de43215c014bc71d907a026042e55b
commit: ab331ac5b797eb3889777f3d8d98a86069c5720e [14108/14122] arm64/mpam: Check mpam_detect_is_enabled() before accessing MPAM registers
config: arm64-randconfig-001-20240925 (https://download.01.org/0day-ci/archive/20240925/202409252253.TKzwXzes-lkp@…)
compiler: aarch64-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240925/202409252253.TKzwXzes-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409252253.TKzwXzes-lkp@intel.com/
All errors (new ones prefixed by >>):
>> arch/arm64/kernel/cpufeature.c:2313:6: error: redefinition of 'mpam_detect_is_enabled'
2313 | bool mpam_detect_is_enabled(void)
| ^~~~~~~~~~~~~~~~~~~~~~
In file included from arch/arm64/include/asm/ptrace.h:11,
from arch/arm64/include/asm/irqflags.h:10,
from include/linux/irqflags.h:17,
from include/linux/rcupdate.h:26,
from include/linux/rculist.h:11,
from include/linux/pid.h:5,
from include/linux/sched.h:14,
from include/linux/sched/task_stack.h:9,
from include/linux/elfcore.h:7,
from include/linux/crash_core.h:6,
from include/linux/kexec.h:18,
from include/linux/crash_dump.h:5,
from arch/arm64/kernel/cpufeature.c:67:
arch/arm64/include/asm/cpufeature.h:864:20: note: previous definition of 'mpam_detect_is_enabled' with type 'bool(void)' {aka '_Bool(void)'}
864 | static inline bool mpam_detect_is_enabled(void)
| ^~~~~~~~~~~~~~~~~~~~~~
arch/arm64/kernel/cpufeature.c:2133:13: warning: 'enable_pseudo_nmi' defined but not used [-Wunused-variable]
2133 | static bool enable_pseudo_nmi;
| ^~~~~~~~~~~~~~~~~
vim +/mpam_detect_is_enabled +2313 arch/arm64/kernel/cpufeature.c
2311
2312 static bool __read_mostly mpam_force_enabled;
> 2313 bool mpam_detect_is_enabled(void)
2314 {
2315 return mpam_force_enabled;
2316 }
2317
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki