ARM64 server chip Kunpeng 920 occasionally detected L2 cache corrected errors on few hardware. Thus CPU cache errors seems not uncommon. The earlier failure detection by monitoring the cache corrected errors and taking preventive action could prevent more serious hardware failures.
For the firmware-first error handling on ARM64 hardware platforms, CPU cache corrected error count is not recorded.
For this purpose, the suggestion was to create an CPU EDAC device and device blocks for the CPU caches. The EDAC device blocks would be created based on the cache information obtained from the cpu_cacheinfo. User-space application could monitor the recorded corrected error count for the earlier hardware failure detection and could take preventive action.
Changes: RFC V1 -> RFC V2: 1. Fixed feedback by Boris. 1.1. Added reason of this patch. 1.2. Changed CPU errors to CPU cache errors in the drivers/edac/Kconfig 1.3 Changed EDAC cache list to percpu variables. 2. Changes in the descriptions.
Shiju Jose (2): EDAC/ghes: Add EDAC device for reporting the CPU cache errors ACPI / APEI: Add reporting ARM64 CPU cache corrected error count
Documentation/ABI/testing/sysfs-devices-edac | 15 ++ drivers/acpi/apei/ghes.c | 76 +++++++- drivers/edac/Kconfig | 10 + drivers/edac/ghes_edac.c | 181 +++++++++++++++++++ include/acpi/ghes.h | 27 +++ include/linux/cper.h | 4 + 6 files changed, 309 insertions(+), 4 deletions(-)