ARM64 server chip Kunpeng 920 occasionally detected L2 cache corrected errors on few hardware. Thus the probability of the CPU cache errors occurring often exists. The earlier failure detection by monitoring the cache corrected errors for the frequent occurrences and taking preventive action could prevent more serious hardware faults.
For the firmware-first error handling, especially on ARM64 architectures, there is no provision for recording the CPU cache corrected error count is present.
For Intel architectures cache corrected errors reporting and offline the cores is done through more architecture specific method. http://www.mcelog.org/cache.html
For this purpose, the suggestion was to create the CPU EDAC device for the CPU caches for recording the cache error count. The EDAC device blocks for the CPU caches would be created based on the cache information obtained from the cpu_cacheinfo. User-space application could monitor the recorded corrected error count for the earlier hardware failure detection and could take preventive action.
Changes: RFC V1 -> RFC V2: 1. Fixed feedback by Boris. 1.1. Added reason of this patch. 1.2. Changed CPU errors to CPU cache errors in the drivers/edac/Kconfig 1.3 Changed EDAC cache list to percpu variables. 2. Changes in the descriptions.
Shiju Jose (2): EDAC/ghes: Add EDAC device for reporting the CPU cache errors ACPI / APEI: Add reporting ARM64 CPU cache corrected error count
Documentation/ABI/testing/sysfs-devices-edac | 15 ++ drivers/acpi/apei/ghes.c | 76 +++++++- drivers/edac/Kconfig | 10 + drivers/edac/ghes_edac.c | 181 +++++++++++++++++++ include/acpi/ghes.h | 27 +++ include/linux/cper.h | 4 + 6 files changed, 309 insertions(+), 4 deletions(-)