On Tue, Jun 15, 2021 at 5:47 AM Xiaofei Tan tanxiaofei@huawei.com wrote:
Hi Rafael,
On 2021/6/14 23:46, Rafael J. Wysocki wrote:
On Fri, Jun 11, 2021 at 2:40 PM Xiaofei Tan tanxiaofei@huawei.com wrote:
Before commit 8fcc4ae6faf8 ("arm64: acpi: Make apei_claim_sea() synchronise with APEI's irq work"), do_sea() would unconditionally signal the affected task from the arch code. Since that change, the GHES driver sends the signals.
This exposes a problem as errors the GHES driver doesn't understand or doesn't handle effectively are silently ignored. It will cause the errors get taken again, and circulate endlessly. User-space task get stuck in this loop.
Existing firmware on Kunpeng9xx systems reports cache errors with the 'ARM Processor Error' CPER records.
Do memory failure handling for ARM Processor Error Section just like for Memory Error Section.
Still, I'm not convinced that this is the right way to address the problem.
In particular, is it guaranteed that "ARM Processor Error" will always mean "memory failure" on all platforms?
There are two sources for ARM Processor cache errors(no second case for the platform that doesn't support poison mechanism). 1.occur in the cache. If it is transient, we have a chance to recover by doing memory failure. If it is persistent, we have to handle in other place, such as do cache way isolation in firmware, or trigger cpu core isolation in user space. I think most platform can't support such feature, so the most simple and effective way is report as fatal error and do isolation during firmware start-up phase.
2.error transferred from other RAS node. If it is from DDR, i think there is no doubt, and this is the most cases we met before.If it is from other place of SoC, such as internal SRAM(the probability is very little compare to DDR), the error is still in the hardware. But the RAS node that detected the SRAM error will also report the error.
To sum up the above, it is effective for most situation, and no harm for the others.
OK, so applied as 5.14 material under edited subject.
Thanks!