New subject: [RFC PATCH 1/2] EDAC/ghes: Add EDAC device for reporting the CPU cache errors

12 Jan 2021

      ARM64 server chip Kunpeng 920 occasionally detected  L2 cache
corrected errors on few hardware. Thus CPU cache errors seems
not uncommon. The earlier failure detection by monitoring the
cache corrected errors and taking preventive action could
prevent more serious hardware failures.

For the firmware-first error handling on ARM64 hardware platforms,
CPU cache corrected error count is not recorded.

For this purpose, the suggestion was to create an CPU EDAC device
and device blocks for the CPU caches. The EDAC device blocks would be
created based on the cache information obtained from the cpu_cacheinfo.
User-space application could monitor the recorded corrected error
count for the earlier hardware failure detection and could take
preventive action.

Changes:
RFC V1 -> RFC V2:
1. Fixed feedback by Boris.
1.1. Added reason of this patch.
1.2. Changed CPU errors to CPU cache errors in the drivers/edac/Kconfig
1.3  Changed EDAC cache list to percpu variables.
2. Changes in the descriptions.

Shiju Jose (2):
  EDAC/ghes: Add EDAC device for reporting the CPU cache errors
  ACPI / APEI: Add reporting ARM64 CPU cache corrected error count

 Documentation/ABI/testing/sysfs-devices-edac |  15 ++
 drivers/acpi/apei/ghes.c                     |  76 +++++++-
 drivers/edac/Kconfig                         |  10 +
 drivers/edac/ghes_edac.c                     | 181 +++++++++++++++++++
 include/acpi/ghes.h                          |  27 +++
 include/linux/cper.h                         |   4 +
 6 files changed, 309 insertions(+), 4 deletions(-)

-- 
2.17.1

[RFC PATCH 0/2] EDAC/ghes: Add EDAC device for recording the CPU cache error count

Shiju Jose

Shiju Jose

Shiju Jose

tags

participants (1)