CPU cache corrected errors are detected occasionally on few of our ARM64 hardware boards. Though it is rare, the probability of the CPU cache errors frequently occurring can't be avoided. The earlier failure detection by monitoring the cache corrected errors for the frequent occurrences and taking preventive action could prevent more serious hardware faults.
On Intel architectures, cache corrected errors are reported and the affected cores are offlined in the architecture specific method. http://www.mcelog.org/cache.html
However for the firmware-first error reporting, specifically on ARM64 architecture, there is no provision present for reporting the cache corrected error count to the user-space and taking preventive action such as offline the affected cores.
For this purpose, it was suggested to create the CPU EDAC device for the CPU caches for reporting the cache error count for the firmware-first error reporting.
User-space application could monitor the recorded corrected error count for the earlier hardware failure detection and could take preventive action, such as offline the corresponding CPU core/s.
Changes: RFC V1 -> RFC V2: 1. Fixed feedback by Boris. 1.1. Added reason of this patch. 1.2. Changed CPU errors to CPU cache errors in the drivers/edac/Kconfig 1.3 Changed EDAC cache list to percpu variables. 1.4 Changed configuration depends on ARM64. 1.5. Moved discovery of cacheinfo to ghes_scan_system(). 2. Changes in the descriptions.
Shiju Jose (2): EDAC/ghes: Add EDAC device for reporting the CPU cache errors ACPI / APEI: Add reporting ARM64 CPU cache corrected error count
Documentation/ABI/testing/sysfs-devices-edac | 15 ++ drivers/acpi/apei/ghes.c | 76 +++++++- drivers/edac/Kconfig | 12 ++ drivers/edac/ghes_edac.c | 186 +++++++++++++++++++ include/acpi/ghes.h | 27 +++ include/linux/cper.h | 4 + 6 files changed, 316 insertions(+), 4 deletions(-)