Hi Boris,
-----Original Message----- From: Borislav Petkov [mailto:bp@alien8.de] Sent: 18 January 2021 18:37 To: Shiju Jose shiju.jose@huawei.com Cc: linux-edac@vger.kernel.org; linux-acpi@vger.kernel.org; linux- kernel@vger.kernel.org; james.morse@arm.com; mchehab+huawei@kernel.org; tony.luck@intel.com; rjw@rjwysocki.net; lenb@kernel.org; rrichter@marvell.com; Jonathan Cameron jonathan.cameron@huawei.com; tanxiaofei tanxiaofei@huawei.com; linuxarm@openeuler.org Subject: Re: [RFC PATCH 1/2] EDAC/ghes: Add EDAC device for the CPU caches
On Fri, Jan 15, 2021 at 11:06:30AM +0000, Shiju Jose wrote:
L2 cache corrected errors are detected occasionally on few of our ARM64 hardware boards. Though it is rare, the probability of the CPU cache errors frequently occurring can't be avoided. The earlier failure detection by monitoring the cache corrected errors for the frequent occurrences and taking preventive action could prevent more serious hardware faults.
On Intel architectures, cache corrected errors are reported and the affected cores are offline in the architecture specific method. http://www.mcelog.org/cache.html
However for the firmware-first error reporting, specifically on ARM64 architectures, there is no provision present for reporting the cache corrected error count to the user-space and taking preventive action such as offline the affected cores.
How hard was it to write that in your first submission? What do you think would be the best way to persuade a patch reviewer/maintainer to take a look at your submission?
Why a separate Kconfig item?
CONFIG_EDAC_GHES_CPU_CACHE_ERROR is added to make this feature optional only for the platforms which need this and supported.
- depends on EDAC_GHES
depends on EDAC_GHES hardly expresses which platforms need it/support it.
If anything, depends on ARM64.
Sure. I will add dependency on ARM64. This EDAC code for the cache errors is architecture independent for the firmware-first error reporting and could be used for other architectures, though now we need it for the ARM64.
Init stuff belongs into ghes_scan_system().
Did you mean calling ghes_edac_create_cpu_device() in the
ghes_scan_system()?
I mean, all hardware discovery needs to happen in ghes_scan_system
- you don't need to call those from outside the driver, in
ghes_edac_register().
sure. Will modify.
-- Regards/Gruss, Boris.
Thanks, Shiju