From: Miaohe Lin linmiaohe@huawei.com
mainline inclusion from mainline-v6.1-rc1 commit 0792a4a6195a6d67a4ead2554da393cbd8dcdf5a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IB5LNL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
If try_to_unmap() fails, the hwpoisoned page still resides in the address space of some processes. We should kill these processes or the hwpoisoned page might be consumed later. collect_procs() is always called to collect relevant processes now so they can be killed later if unmap fails.
[linmiaohe@huawei.com: v2] Link: https://lkml.kernel.org/r/20220823032346.4260-6-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220818130016.45313-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin linmiaohe@huawei.com Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Signed-off-by: Andrew Morton akpm@linux-foundation.org
Conflicts: mm/memory-failure.c [Ma Wupeng: context conflicts] Signed-off-by: Ma Wupeng mawupeng1@huawei.com --- mm/memory-failure.c | 32 ++++++++++++++------------------ 1 file changed, 14 insertions(+), 18 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index b83482faf38a..1cab0871c77d 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -6,16 +6,16 @@ * High level machine check handler. Handles pages reported by the * hardware as being corrupted usually due to a multi-bit ECC memory or cache * failure. - * + * * In addition there is a "soft offline" entry point that allows stop using * not-yet-corrupted-by-suspicious pages without killing anything. * * Handles page cache pages in various states. The tricky part - * here is that we can access any page asynchronously in respect to - * other VM users, because memory failures could happen anytime and - * anywhere. This could violate some of their assumptions. This is why - * this code has to be extremely careful. Generally it tries to use - * normal locking rules, as in get the standard locks, even if that means + * here is that we can access any page asynchronously in respect to + * other VM users, because memory failures could happen anytime and + * anywhere. This could violate some of their assumptions. This is why + * this code has to be extremely careful. Generally it tries to use + * normal locking rules, as in get the standard locks, even if that means * the error handling takes potentially a long time. * * It can be very tempting to add handling for obscure cases here. @@ -25,12 +25,12 @@ * https://git.kernel.org/cgit/utils/cpu/mce/mce-test.git/ * - The case actually shows up as a frequent (top 10) page state in * tools/vm/page-types when running a real workload. - * + * * There are several operations here with exponential complexity because - * of unsuitable VM data structures. For example the operation to map back - * from RMAP chains to processes has to walk the complete process list and + * of unsuitable VM data structures. For example the operation to map back + * from RMAP chains to processes has to walk the complete process list and * has non linear complexity with the number. But since memory corruptions - * are rare we hope to get away with this. This avoids impacting the core + * are rare we hope to get away with this. This avoids impacting the core * VM. */ #include <linux/kernel.h> @@ -1184,7 +1184,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, struct address_space *mapping; LIST_HEAD(tokill); bool unmap_success = true; - int kill = 1, forcekill; + int forcekill; struct page *hpage = *hpagep; bool mlocked = PageMlocked(hpage);
@@ -1227,7 +1227,6 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, if (page_mkclean(hpage)) { SetPageDirty(hpage); } else { - kill = 0; ttu |= TTU_IGNORE_HWPOISON; pr_info("Memory failure: %#lx: corrupted page was clean: dropped without side effects\n", pfn); @@ -1238,12 +1237,8 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, * First collect all the processes that have the page * mapped in dirty form. This has to be done before try_to_unmap, * because ttu takes the rmap data structures down. - * - * Error handling: We ignore errors here because - * there's nothing that can be done. */ - if (kill) - collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED); + collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
if (!PageHuge(hpage)) { unmap_success = try_to_unmap(hpage, ttu); @@ -1290,7 +1285,8 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, * use a more force-full uncatchable kill to prevent * any accesses to the poisoned memory. */ - forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL); + forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL) || + !unmap_success; kill_procs(&tokill, forcekill, !unmap_success, pfn, flags);
return unmap_success;