Hi Boris,
Thanks for the feedback. Apologies for the delay.
-----Original Message----- From: Borislav Petkov [mailto:bp@alien8.de] Sent: 31 December 2020 16:44 To: Shiju Jose shiju.jose@huawei.com Cc: linux-edac@vger.kernel.org; linux-acpi@vger.kernel.org; linux- kernel@vger.kernel.org; james.morse@arm.com; mchehab+huawei@kernel.org; tony.luck@intel.com; rjw@rjwysocki.net; lenb@kernel.org; rrichter@marvell.com; Linuxarm linuxarm@huawei.com; xuwei (O) xuwei5@huawei.com; Jonathan Cameron jonathan.cameron@huawei.com; John Garry john.garry@huawei.com; tanxiaofei tanxiaofei@huawei.com; Shameerali Kolothum Thodi shameerali.kolothum.thodi@huawei.com; Salil Mehta salil.mehta@huawei.com Subject: Re: [RFC PATCH 1/2] EDAC/ghes: Add EDAC device for the CPU caches
On Tue, Dec 08, 2020 at 05:29:58PM +0000, Shiju Jose wrote:
The corrected error count on the CPU caches required reporting to the user-space for the predictive failure analysis. For this purpose, add an EDAC device and device blocks for the CPU caches found. The cache's corrected error count would be stored in the /sys/devices/system/edac/cpu/cpu*/cache*/ce_count.
This still doesn't begin to explain why the kernel needs this. I had already asked whether errors in CPU caches are something which happen often enough so that software should count them but nothing came. So pls justify first why this wants to be added to the kernel.
L2 cache corrected errors are detected occasionally on few of our ARM64 hardware boards. Though it is rare, the probability of the CPU cache errors frequently occurring can't be avoided. The earlier failure detection by monitoring the cache corrected errors for the frequent occurrences and taking preventive action could prevent more serious hardware faults.
On Intel architectures, cache corrected errors are reported and the affected cores are offline in the architecture specific method. http://www.mcelog.org/cache.html
However for the firmware-first error reporting, specifically on ARM64 architectures, there is no provision present for reporting the cache corrected error count to the user-space and taking preventive action such as offline the affected cores.
diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig index 7a47680d6f07..c73eeab27ac9 100644 --- a/drivers/edac/Kconfig +++ b/drivers/edac/Kconfig @@ -74,6 +74,16 @@ config EDAC_GHES
In doubt, say 'Y'.
+config EDAC_GHES_CPU_ERROR
- bool "EDAC device for reporting firmware-first BIOS detected CPU
error count"
Why a separate Kconfig item?
CONFIG_EDAC_GHES_CPU_CACHE_ERROR is added to make this feature optional only for the platforms which need this and supported.
- depends on EDAC_GHES
- help
EDAC device for the firmware-first BIOS detected CPU error count
+reported
Well this is not what it is doing - you're talking about cache errors. "CPU errors" can be a lot more than just cache errors.
Sure. I will change.
+static void ghes_edac_create_cpu_device(struct device *dev) {
- int cpu;
- struct cpu_cacheinfo *this_cpu_ci;
- /*
* Find the maximum number of caches present in the CPU heirarchy
* among the online CPUs.
*/
- for_each_online_cpu(cpu) {
this_cpu_ci = get_cpu_cacheinfo(cpu);
if (!this_cpu_ci)
continue;
if (max_number_of_caches < this_cpu_ci->num_leaves)
max_number_of_caches = this_cpu_ci->num_leaves;
So this is counting the number of cache levels on the system? So you want to count the errors per cache levels?
Yes. This was the suggestion from James and to offline the affected cores for the shared cache.
- }
- if (!max_number_of_caches)
return;
- /*
* EDAC device interface only supports creating the CPU cache
hierarchy for alls
* the CPUs together. Thus need to allocate cpu_edac_block_list for
the
* max_number_of_caches among all the CPUs irrespective of the
number of caches
* per CPU might vary.
*/
So this is lumping all the caches together into a single list? What for? To untangle to the proper ones when the error gets reported?
Have you heard of percpu variables?
Yes. Changed the list to the percpu variable.
@@ -624,6 +787,10 @@ int ghes_edac_register(struct ghes *ghes, struct
device *dev)
ghes_pvt = pvt; spin_unlock_irqrestore(&ghes_lock, flags);
+#if defined(CONFIG_EDAC_GHES_CPU_ERROR)
- ghes_edac_create_cpu_device(dev);
+#endif
Init stuff belongs into ghes_scan_system().
Did you mean calling ghes_edac_create_cpu_device() in the ghes_scan_system()?
...
Ok, I've seen enough. "required reporting to the user-space for the predictive failure analysis." is not even trying to explain *why* you're doing this, what *actual* problem it is addressing and why should the kernel get it.
And without a proper problem definition of what you're trying to solve, this is not going anywhere.
-- Regards/Gruss, Boris.
Thanks, Shiju
On Fri, Jan 15, 2021 at 11:06:30AM +0000, Shiju Jose wrote:
L2 cache corrected errors are detected occasionally on few of our ARM64 hardware boards. Though it is rare, the probability of the CPU cache errors frequently occurring can't be avoided. The earlier failure detection by monitoring the cache corrected errors for the frequent occurrences and taking preventive action could prevent more serious hardware faults.
On Intel architectures, cache corrected errors are reported and the affected cores are offline in the architecture specific method. http://www.mcelog.org/cache.html
However for the firmware-first error reporting, specifically on ARM64 architectures, there is no provision present for reporting the cache corrected error count to the user-space and taking preventive action such as offline the affected cores.
How hard was it to write that in your first submission? What do you think would be the best way to persuade a patch reviewer/maintainer to take a look at your submission?
Why a separate Kconfig item?
CONFIG_EDAC_GHES_CPU_CACHE_ERROR is added to make this feature optional only for the platforms which need this and supported.
- depends on EDAC_GHES
depends on EDAC_GHES hardly expresses which platforms need it/support it.
If anything, depends on ARM64.
Init stuff belongs into ghes_scan_system().
Did you mean calling ghes_edac_create_cpu_device() in the ghes_scan_system()?
I mean, all hardware discovery needs to happen in ghes_scan_system - you don't need to call those from outside the driver, in ghes_edac_register().
Hi Boris,
-----Original Message----- From: Borislav Petkov [mailto:bp@alien8.de] Sent: 18 January 2021 18:37 To: Shiju Jose shiju.jose@huawei.com Cc: linux-edac@vger.kernel.org; linux-acpi@vger.kernel.org; linux- kernel@vger.kernel.org; james.morse@arm.com; mchehab+huawei@kernel.org; tony.luck@intel.com; rjw@rjwysocki.net; lenb@kernel.org; rrichter@marvell.com; Jonathan Cameron jonathan.cameron@huawei.com; tanxiaofei tanxiaofei@huawei.com; linuxarm@openeuler.org Subject: Re: [RFC PATCH 1/2] EDAC/ghes: Add EDAC device for the CPU caches
On Fri, Jan 15, 2021 at 11:06:30AM +0000, Shiju Jose wrote:
L2 cache corrected errors are detected occasionally on few of our ARM64 hardware boards. Though it is rare, the probability of the CPU cache errors frequently occurring can't be avoided. The earlier failure detection by monitoring the cache corrected errors for the frequent occurrences and taking preventive action could prevent more serious hardware faults.
On Intel architectures, cache corrected errors are reported and the affected cores are offline in the architecture specific method. http://www.mcelog.org/cache.html
However for the firmware-first error reporting, specifically on ARM64 architectures, there is no provision present for reporting the cache corrected error count to the user-space and taking preventive action such as offline the affected cores.
How hard was it to write that in your first submission? What do you think would be the best way to persuade a patch reviewer/maintainer to take a look at your submission?
Why a separate Kconfig item?
CONFIG_EDAC_GHES_CPU_CACHE_ERROR is added to make this feature optional only for the platforms which need this and supported.
- depends on EDAC_GHES
depends on EDAC_GHES hardly expresses which platforms need it/support it.
If anything, depends on ARM64.
Sure. I will add dependency on ARM64. This EDAC code for the cache errors is architecture independent for the firmware-first error reporting and could be used for other architectures, though now we need it for the ARM64.
Init stuff belongs into ghes_scan_system().
Did you mean calling ghes_edac_create_cpu_device() in the
ghes_scan_system()?
I mean, all hardware discovery needs to happen in ghes_scan_system
- you don't need to call those from outside the driver, in
ghes_edac_register().
sure. Will modify.
-- Regards/Gruss, Boris.
Thanks, Shiju
On Tue, Jan 19, 2021 at 10:04:23AM +0000, Shiju Jose wrote:
This EDAC code for the cache errors is architecture independent for the firmware-first error reporting and could be used for other architectures,
I'm not so sure about that because you're lumping all the cache hierarchies together and there might be L3 slices on some x86 incarnations, for example, which do not belong to the core you're reporting the error for... It would need to be tested though.
Also, if this is a firmware-first mode, then I would expect the firmware to also report which cache triggered the error and thus not need any OS glue at all.
Therefore ARM only and I'd need an ACK from ARM folks whether they want it this way.
Thx.