New subject: [PATCH v2 1/2] EDAC/ghes: Add EDAC device for reporting the CPU cache errors

26 Jan 2021

      CPU cache corrected errors are detected occasionally on
few of our ARM64 hardware boards. Though it is rare, the
probability of the CPU cache errors frequently occurring
can't be avoided. The earlier failure detection by monitoring
the cache corrected errors for the frequent occurrences and
taking preventive action could prevent more serious hardware
faults.

On Intel architectures, cache corrected errors are reported and
the affected cores are offlined in the architecture specific method.
http://www.mcelog.org/cache.html

However for the firmware-first error reporting, specifically on
ARM64 architectures, there is no provision present for reporting
the cache corrected error count to the user-space and taking
preventive action such as offline the affected cores.

For this purpose, it was suggested to create the CPU EDAC
device for the CPU caches for reporting the cache error count
for the firmware-first error reporting.

User-space application could monitor the recorded corrected error
count for the earlier hardware failure detection and could take
preventive action, such as offline the corresponding CPU core/s.

Changes:
RFC V1 -> RFC V2:
1. Fixed feedback by Boris.
1.1. Added reason of this patch.
1.2. Changed CPU errors to CPU cache errors in the drivers/edac/Kconfig
1.3  Changed EDAC cache list to percpu variables.
1.4  Changed configuration depends on ARM64.
1.5. Moved discovery of cacheinfo to ghes_scan_system().  
2. Changes in the descriptions.

Shiju Jose (2):
  EDAC/ghes: Add EDAC device for reporting the CPU cache errors
  ACPI / APEI: Add reporting ARM64 CPU cache corrected error count

 Documentation/ABI/testing/sysfs-devices-edac |  15 ++
 drivers/acpi/apei/ghes.c                     |  76 +++++++-
 drivers/edac/Kconfig                         |  12 ++
 drivers/edac/ghes_edac.c                     | 186 +++++++++++++++++++
 include/acpi/ghes.h                          |  27 +++
 include/linux/cper.h                         |   4 +
 6 files changed, 316 insertions(+), 4 deletions(-)

-- 
2.17.1

[PATCH v2 0/2] EDAC/ghes: Add EDAC device for reporting the CPU cache error count

Shiju Jose

Shiju Jose

Shiju Jose

tags

participants (1)