From: Naoya Horiguchi naoya.horiguchi@nec.com
mainline inclusion from linux-v5.10-rc1 commit 7d9d46ac87f91b8dedad5241d64382b650e26487 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
Drop the PageHuge check, which is dead code since memory_failure() forks into memory_failure_hugetlb() for hugetlb pages.
memory_failure() and memory_failure_hugetlb() shares some functions like hwpoison_user_mappings() and identify_page_state(), so they should properly handle 4kB page, thp, and hugetlb.
Signed-off-by: Naoya Horiguchi naoya.horiguchi@nec.com Signed-off-by: Oscar Salvador osalvador@suse.de Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: tong tiangen tongtiangen@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memory-failure.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 72e1746e5386f..a432ba37e132a 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1373,10 +1373,7 @@ int memory_failure(unsigned long pfn, int flags) * page_remove_rmap() in try_to_unmap_one(). So to determine page status * correctly, we save a copy of the page flags at this time. */ - if (PageHuge(p)) - page_flags = hpage->flags; - else - page_flags = p->flags; + page_flags = p->flags;
/* * unpoison always clear PG_hwpoison inside page lock
From: Naoya Horiguchi naoya.horiguchi@nec.com
mainline inclusion from linux-v5.10-rc1 commit 1b473becde09d1aec17334a34af70ccdee9fe680 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
hpage is never used after try_to_split_thp_page() in memory_failure(), so we don't have to update hpage. So let's not recalculate/use hpage.
Signed-off-by: Naoya Horiguchi naoya.horiguchi@nec.com Signed-off-by: Oscar Salvador osalvador@suse.de Suggested-by: "Aneesh Kumar K.V" aneesh.kumar@linux.ibm.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memory-failure.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index a432ba37e132a..075926ca7aa57 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1333,7 +1333,6 @@ int memory_failure(unsigned long pfn, int flags) } unlock_page(p); VM_BUG_ON_PAGE(!page_count(p), p); - hpage = compound_head(p); }
/* @@ -1417,11 +1416,8 @@ int memory_failure(unsigned long pfn, int flags) /* * Now take care of user space mappings. * Abort on fail: __delete_from_page_cache() assumes unmapped page. - * - * When the raw error page is thp tail page, hpage points to the raw - * page after thp split. */ - if (!hwpoison_user_mappings(p, pfn, flags, &hpage)) { + if (!hwpoison_user_mappings(p, pfn, flags, &p)) { action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED); res = -EBUSY; goto unlock_page;
From: Naoya Horiguchi naoya.horiguchi@nec.com
mainline inclusion from linux-v5.10-rc1 commit fd476720c9ba3cf16617d074a94a0852be468545 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
Another memory error injection interface debugfs:hwpoison/corrupt-pfn also takes bogus refcount for hwpoison_filter(). It's justified because this does a coarse filter, expecting that memory_failure() redoes the check for sure.
Signed-off-by: Naoya Horiguchi naoya.horiguchi@nec.com Signed-off-by: Oscar Salvador osalvador@suse.de Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/hwpoison-inject.c | 18 +++++------------- 1 file changed, 5 insertions(+), 13 deletions(-)
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c index b6ac70616c321..766062ce62a21 100644 --- a/mm/hwpoison-inject.c +++ b/mm/hwpoison-inject.c @@ -25,11 +25,6 @@ static int hwpoison_inject(void *data, u64 val)
p = pfn_to_page(pfn); hpage = compound_head(p); - /* - * This implies unable to support free buddy pages. - */ - if (!get_hwpoison_page(p)) - return 0;
if (!hwpoison_filter_enable) goto inject; @@ -39,23 +34,20 @@ static int hwpoison_inject(void *data, u64 val) * This implies unable to support non-LRU pages. */ if (!PageLRU(hpage) && !PageHuge(p)) - goto put_out; + return 0;
/* - * do a racy check with elevated page count, to make sure PG_hwpoison - * will only be set for the targeted owner (or on a free page). + * do a racy check to make sure PG_hwpoison will only be set for + * the targeted owner (or on a free page). * memory_failure() will redo the check reliably inside page lock. */ err = hwpoison_filter(hpage); if (err) - goto put_out; + return 0;
inject: pr_info("Injecting memory failure at pfn %#lx\n", pfn); - return memory_failure(pfn, MF_COUNT_INCREASED); -put_out: - put_hwpoison_page(p); - return 0; + return memory_failure(pfn, 0); }
static int hwpoison_unpoison(void *data, u64 val)
From: Oscar Salvador osalvador@suse.de
mainline inclusion from linux-v5.10-rc1 commit dc7560b496f9e045c675ae160afda010ec0c77f6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
Make a proper if-else condition for {hard,soft}-offline.
[akpm: refactor comment] Signed-off-by: Oscar Salvador osalvador@suse.de Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/madvise.c | 30 ++++++++++++++---------------- 1 file changed, 14 insertions(+), 16 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 242a88ae3acf1..6022627646fa2 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -638,7 +638,6 @@ static long madvise_remove(struct vm_area_struct *vma, static int madvise_inject_error(int behavior, unsigned long start, unsigned long end) { - struct page *page; struct zone *zone; unsigned int order;
@@ -647,6 +646,7 @@ static int madvise_inject_error(int behavior,
for (; start < end; start += PAGE_SIZE << order) { + struct page *page; unsigned long pfn; int ret;
@@ -669,25 +669,23 @@ static int madvise_inject_error(int behavior,
if (behavior == MADV_SOFT_OFFLINE) { pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n", - pfn, start); + pfn, start);
ret = soft_offline_page(page, MF_COUNT_INCREASED); - if (ret) - return ret; - continue; + } else { + pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n", + pfn, start); + /* + * Drop the page reference taken by + * get_user_pages_fast(). In the absence of + * MF_COUNT_INCREASED the memory_failure() routine is + * responsible for pinning the page to prevent it + * from being released back to the page allocator. + */ + put_page(page); + ret = memory_failure(pfn, 0); }
- pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n", - pfn, start); - - /* - * Drop the page reference taken by get_user_pages_fast(). In - * the absence of MF_COUNT_INCREASED the memory_failure() - * routine is responsible for pinning the page to prevent it - * from being released back to the page allocator. - */ - put_page(page); - ret = memory_failure(pfn, 0); if (ret) return ret; }
From: Oscar Salvador osalvador@suse.de
mainline inclusion from linux-v5.10-rc1 commit dd6e2402fad966290f35dc687294fb6049714aac category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
keep put_hwpoison_page to avoid kapi change
After commit 4e41a30c6d50 ("mm: hwpoison: adjust for new thp refcounting"), put_hwpoison_page got reduced to a put_page. Let us just use put_page instead.
Signed-off-by: Oscar Salvador osalvador@suse.de Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memory-failure.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 075926ca7aa57..fb328fcac431a 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1126,7 +1126,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags) pr_err("Memory failure: %#lx: just unpoisoned\n", pfn); num_poisoned_pages_dec(); unlock_page(head); - put_hwpoison_page(head); + put_page(head); return 0; }
@@ -1327,7 +1327,7 @@ int memory_failure(unsigned long pfn, int flags) pfn); if (TestClearPageHWPoison(p)) num_poisoned_pages_dec(); - put_hwpoison_page(p); + put_page(p); res = -EBUSY; goto unlock_mutex; } @@ -1381,14 +1381,14 @@ int memory_failure(unsigned long pfn, int flags) pr_err("Memory failure: %#lx: just unpoisoned\n", pfn); num_poisoned_pages_dec(); unlock_page(p); - put_hwpoison_page(p); + put_page(p); goto unlock_mutex; } if (hwpoison_filter(p)) { if (TestClearPageHWPoison(p)) num_poisoned_pages_dec(); unlock_page(p); - put_hwpoison_page(p); + put_page(p); goto unlock_mutex; }
@@ -1623,9 +1623,9 @@ int unpoison_memory(unsigned long pfn) } unlock_page(page);
- put_hwpoison_page(page); + put_page(page); if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1)) - put_hwpoison_page(page); + put_page(page);
return 0; } @@ -1689,7 +1689,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags) /* * Try to free it. */ - put_hwpoison_page(page); + put_page(page); shake_page(page, 1);
/* @@ -1698,7 +1698,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags) ret = __get_any_page(page, pfn, 0); if (ret == 1 && !PageLRU(page)) { /* Drop page reference which is from __get_any_page() */ - put_hwpoison_page(page); + put_page(page); pr_info("soft_offline: %#lx: unknown non LRU page type %lx (%pGp)\n", pfn, page->flags, &page->flags); return -EIO; @@ -1721,7 +1721,7 @@ static int soft_offline_huge_page(struct page *page, int flags) lock_page(hpage); if (PageHWPoison(hpage)) { unlock_page(hpage); - put_hwpoison_page(hpage); + put_page(hpage); pr_info("soft offline: %#lx hugepage already poisoned\n", pfn); return -EBUSY; } @@ -1732,7 +1732,7 @@ static int soft_offline_huge_page(struct page *page, int flags) * get_any_page() and isolate_huge_page() takes a refcount each, * so need to drop one here. */ - put_hwpoison_page(hpage); + put_page(hpage); if (!ret) { pr_info("soft offline: %#lx hugepage failed to isolate\n", pfn); return -EBUSY; @@ -1781,7 +1781,7 @@ static int __soft_offline_page(struct page *page, int flags) wait_on_page_writeback(page); if (PageHWPoison(page)) { unlock_page(page); - put_hwpoison_page(page); + put_page(page); pr_info("soft offline: %#lx page already poisoned\n", pfn); return -EBUSY; } @@ -1796,7 +1796,7 @@ static int __soft_offline_page(struct page *page, int flags) * would need to fix isolation locking first. */ if (ret == 1) { - put_hwpoison_page(page); + put_page(page); pr_info("soft_offline: %#lx: invalidated\n", pfn); SetPageHWPoison(page); num_poisoned_pages_inc(); @@ -1816,7 +1816,7 @@ static int __soft_offline_page(struct page *page, int flags) * Drop page reference which is came from get_any_page() * successful isolate_lru_page() already took another one. */ - put_hwpoison_page(page); + put_page(page); if (!ret) { LIST_HEAD(pagelist); /* @@ -1860,7 +1860,7 @@ static int soft_offline_in_use_page(struct page *page, int flags) pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page)); else pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page)); - put_hwpoison_page(page); + put_page(page); return -EBUSY; } unlock_page(page); @@ -1934,7 +1934,7 @@ int soft_offline_page(struct page *page, int flags) if (PageHWPoison(page)) { pr_info("soft offline: %#lx page already poisoned\n", pfn); if (flags & MF_COUNT_INCREASED) - put_hwpoison_page(page); + put_page(page); return -EBUSY; }
From: Oscar Salvador osalvador@suse.de
mainline inclusion from linux-v5.10-rc1 commit 694bf0b0cdf91be50e6f037ca93733ed83ca1187 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
Place the THP's page handling in a helper and use it from both hard and soft-offline machinery, so we get rid of some duplicated code.
Signed-off-by: Oscar Salvador osalvador@suse.de Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memory-failure.c | 51 ++++++++++++++++++++------------------------- 1 file changed, 23 insertions(+), 28 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index fb328fcac431a..40fdc743b5e90 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1085,6 +1085,25 @@ static int identify_page_state(unsigned long pfn, struct page *p, return page_action(ps, p, pfn); }
+static int try_to_split_thp_page(struct page *page, const char *msg) +{ + lock_page(page); + if (!PageAnon(page) || unlikely(split_huge_page(page))) { + unsigned long pfn = page_to_pfn(page); + + unlock_page(page); + if (!PageAnon(page)) + pr_info("%s: %#lx: non anonymous thp\n", msg, pfn); + else + pr_info("%s: %#lx: thp split failed\n", msg, pfn); + put_page(page); + return -EBUSY; + } + unlock_page(page); + + return 0; +} + static int memory_failure_hugetlb(unsigned long pfn, int flags) { struct page *p = pfn_to_page(pfn); @@ -1316,22 +1335,8 @@ int memory_failure(unsigned long pfn, int flags) }
if (PageTransHuge(hpage)) { - lock_page(p); - if (!PageAnon(p) || unlikely(split_huge_page(p))) { - unlock_page(p); - if (!PageAnon(p)) - pr_err("Memory failure: %#lx: non anonymous thp\n", - pfn); - else - pr_err("Memory failure: %#lx: thp split failed\n", - pfn); - if (TestClearPageHWPoison(p)) - num_poisoned_pages_dec(); - put_page(p); - res = -EBUSY; - goto unlock_mutex; - } - unlock_page(p); + if (try_to_split_thp_page(p, "Memory Failure") < 0) + return -EBUSY; VM_BUG_ON_PAGE(!page_count(p), p); }
@@ -1852,19 +1857,9 @@ static int soft_offline_in_use_page(struct page *page, int flags) int mt; struct page *hpage = compound_head(page);
- if (!PageHuge(page) && PageTransHuge(hpage)) { - lock_page(page); - if (!PageAnon(page) || unlikely(split_huge_page(page))) { - unlock_page(page); - if (!PageAnon(page)) - pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page)); - else - pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page)); - put_page(page); + if (!PageHuge(page) && PageTransHuge(hpage)) + if (try_to_split_thp_page(page, "soft offline") < 0) return -EBUSY; - } - unlock_page(page); - }
/* * Setting MIGRATE_ISOLATE here ensures that the page will be linked
From: Oscar Salvador osalvador@suse.de
mainline inclusion from linux-v5.10-rc1 commit 06be6ff3d2ec8be806b859fc054a1909b16d2473 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
When trying to soft-offline a free page, we need to first take it off the buddy allocator. Once we know is out of reach, we can safely flag it as poisoned.
take_page_off_buddy will be used to take a page meant to be poisoned off the buddy allocator. take_page_off_buddy calls break_down_buddy_pages, which splits a higher-order page in case our page belongs to one.
Once the page is under our control, we call page_handle_poison to set it as poisoned and grab a refcount on it.
Signed-off-by: Oscar Salvador osalvador@suse.de Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Conflicts: mm/page_alloc.c Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/page-flags.h | 1 + mm/memory-failure.c | 18 ++++++---- mm/page_alloc.c | 73 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 86 insertions(+), 6 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 7eb776a677d7a..0c5d1c4c71e62 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -371,6 +371,7 @@ PAGEFLAG(HWPoison, hwpoison, PF_ANY) TESTSCFLAG(HWPoison, hwpoison, PF_ANY) #define __PG_HWPOISON (1UL << PG_hwpoison) extern bool set_hwpoison_free_buddy_page(struct page *page); +extern bool take_page_off_buddy(struct page *page); #else PAGEFLAG_FALSE(HWPoison) static inline bool set_hwpoison_free_buddy_page(struct page *page) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 40fdc743b5e90..cbd2c62895ffe 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -68,6 +68,13 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
+static void page_handle_poison(struct page *page) +{ + SetPageHWPoison(page); + page_ref_inc(page); + num_poisoned_pages_inc(); +} + #if defined(CONFIG_HWPOISON_INJECT) || defined(CONFIG_HWPOISON_INJECT_MODULE)
u32 hwpoison_filter_enable = 0; @@ -1880,14 +1887,13 @@ static int soft_offline_in_use_page(struct page *page, int flags)
static int soft_offline_free_page(struct page *page) { - int rc = dissolve_free_huge_page(page); + int rc = -EBUSY;
- if (!rc) { - if (set_hwpoison_free_buddy_page(page)) - num_poisoned_pages_inc(); - else - rc = -EBUSY; + if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) { + page_handle_poison(page); + rc = 0; } + return rc; }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3fb21ea5dcf9b..32aba0270d653 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8537,6 +8537,79 @@ bool is_free_buddy_page(struct page *page) }
#ifdef CONFIG_MEMORY_FAILURE +/* + * Break down a higher-order page in sub-pages, and keep our target out of + * buddy allocator. + */ +static void break_down_buddy_pages(struct zone *zone, struct page *page, + struct page *target, int low, int high, + int migratetype) +{ + unsigned long size = 1 << high; + struct page *current_buddy, *next_page; + + while (high > low) { + high--; + size >>= 1; + + if (target >= &page[size]) { + next_page = page + size; + current_buddy = page; + } else { + next_page = page; + current_buddy = page + size; + } + + if (set_page_guard(zone, current_buddy, high, migratetype)) + continue; + + if (current_buddy != target) { + list_add(¤t_buddy->lru, + &zone->free_area[high].free_list[migratetype]); + zone->free_area[high].nr_free++; + set_page_order(current_buddy, high); + page = next_page; + } + } +} + +/* + * Take a page that will be marked as poisoned off the buddy allocator. + */ +bool take_page_off_buddy(struct page *page) +{ + struct zone *zone = page_zone(page); + unsigned long pfn = page_to_pfn(page); + unsigned long flags; + unsigned int order; + bool ret = false; + + spin_lock_irqsave(&zone->lock, flags); + for (order = 0; order < MAX_ORDER; order++) { + struct page *page_head = page - (pfn & ((1 << order) - 1)); + int buddy_order = page_order(page_head); + + if (PageBuddy(page_head) && buddy_order >= order) { + unsigned long pfn_head = page_to_pfn(page_head); + int migratetype = get_pfnblock_migratetype(page_head, + pfn_head); + + list_del(&page_head->lru); + __ClearPageBuddy(page_head); + set_page_private(page_head, 0); + zone->free_area[buddy_order].nr_free--; + break_down_buddy_pages(zone, page_head, page, 0, + buddy_order, migratetype); + ret = true; + break; + } + if (page_count(page_head) > 0) + break; + } + spin_unlock_irqrestore(&zone->lock, flags); + return ret; +} + /* * Set PG_hwpoison flag if a given page is confirmed to be a free page. This * test is performed under the zone lock to prevent a race against page
From: Oscar Salvador osalvador@suse.de
mainline inclusion from linux-v5.10-rc1 commit 79f5f8fab482dfff62948214468ac4ebbf0a016f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
keep set_hwpoison_free_buddy_page exported to avoid kapi change.
This patch changes the way we set and handle in-use poisoned pages. Until now, poisoned pages were released to the buddy allocator, trusting that the checks that take place at allocation time would act as a safe net and would skip that page.
This has proved to be wrong, as we got some pfn walkers out there, like compaction, that all they care is the page to be in a buddy freelist.
Although this might not be the only user, having poisoned pages in the buddy allocator seems a bad idea as we should only have free pages that are ready and meant to be used as such.
Before explaining the taken approach, let us break down the kind of pages we can soft offline.
- Anonymous THP (after the split, they end up being 4K pages) - Hugetlb - Order-0 pages (that can be either migrated or invalited)
* Normal pages (order-0 and anon-THP)
- If they are clean and unmapped page cache pages, we invalidate then by means of invalidate_inode_page(). - If they are mapped/dirty, we do the isolate-and-migrate dance.
Either way, do not call put_page directly from those paths. Instead, we keep the page and send it to page_handle_poison to perform the right handling.
page_handle_poison sets the HWPoison flag and does the last put_page.
Down the chain, we placed a check for HWPoison page in free_pages_prepare, that just skips any poisoned page, so those pages do not end up in any pcplist/freelist.
After that, we set the refcount on the page to 1 and we increment the poisoned pages counter.
If we see that the check in free_pages_prepare creates trouble, we can always do what we do for free pages:
- wait until the page hits buddy's freelists - take it off, and flag it
The downside of the above approach is that we could race with an allocation, so by the time we want to take the page off the buddy, the page has been already allocated so we cannot soft offline it. But the user could always retry it.
* Hugetlb pages
- We isolate-and-migrate them
After the migration has been successful, we call dissolve_free_huge_page, and we set HWPoison on the page if we succeed. Hugetlb has a slightly different handling though.
While for non-hugetlb pages we cared about closing the race with an allocation, doing so for hugetlb pages requires quite some additional and intrusive code (we would need to hook in free_huge_page and some other places). So I decided to not make the code overly complicated and just fail normally if the page we allocated in the meantime.
We can always build on top of this.
As a bonus, because of the way we handle now in-use pages, we no longer need the put-as-isolation-migratetype dance, that was guarding for poisoned pages to end up in pcplists.
Signed-off-by: Oscar Salvador osalvador@suse.de Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com
Conflicts: mm/page_alloc.c
Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memory-failure.c | 43 ++++++++++++++----------------------------- mm/migrate.c | 11 +++-------- mm/page_alloc.c | 11 +++++++++++ 3 files changed, 28 insertions(+), 37 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index cbd2c62895ffe..15888de023fe7 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -68,9 +68,11 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
-static void page_handle_poison(struct page *page) +static void page_handle_poison(struct page *page, bool release) { SetPageHWPoison(page); + if (release) + put_page(page); page_ref_inc(page); num_poisoned_pages_inc(); } @@ -1761,19 +1763,13 @@ static int soft_offline_huge_page(struct page *page, int flags) ret = -EBUSY; } else { /* - * We set PG_hwpoison only when the migration source hugepage - * was successfully dissolved, because otherwise hwpoisoned - * hugepage remains on free hugepage list, then userspace will - * find it as SIGBUS by allocation failure. That's not expected - * in soft-offlining. + * We set PG_hwpoison only when we were able to take the page + * off the buddy. */ - ret = dissolve_free_huge_page(page); - if (!ret) { - if (set_hwpoison_free_buddy_page(page)) - num_poisoned_pages_inc(); - else - ret = -EBUSY; - } + if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) + page_handle_poison(page, false); + else + ret = -EBUSY; } return ret; } @@ -1808,10 +1804,8 @@ static int __soft_offline_page(struct page *page, int flags) * would need to fix isolation locking first. */ if (ret == 1) { - put_page(page); pr_info("soft_offline: %#lx: invalidated\n", pfn); - SetPageHWPoison(page); - num_poisoned_pages_inc(); + page_handle_poison(page, true); return 0; }
@@ -1842,7 +1836,9 @@ static int __soft_offline_page(struct page *page, int flags) list_add(&page->lru, &pagelist); ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, MIGRATE_SYNC, MR_MEMORY_FAILURE); - if (ret) { + if (!ret) { + page_handle_poison(page, true); + } else { if (!list_empty(&pagelist)) putback_movable_pages(&pagelist);
@@ -1861,27 +1857,16 @@ static int __soft_offline_page(struct page *page, int flags) static int soft_offline_in_use_page(struct page *page, int flags) { int ret; - int mt; struct page *hpage = compound_head(page);
if (!PageHuge(page) && PageTransHuge(hpage)) if (try_to_split_thp_page(page, "soft offline") < 0) return -EBUSY;
- /* - * Setting MIGRATE_ISOLATE here ensures that the page will be linked - * to free list immediately (not via pcplist) when released after - * successful page migration. Otherwise we can't guarantee that the - * page is really free after put_page() returns, so - * set_hwpoison_free_buddy_page() highly likely fails. - */ - mt = get_pageblock_migratetype(page); - set_pageblock_migratetype(page, MIGRATE_ISOLATE); if (PageHuge(page)) ret = soft_offline_huge_page(page, flags); else ret = __soft_offline_page(page, flags); - set_pageblock_migratetype(page, mt); return ret; }
@@ -1890,7 +1875,7 @@ static int soft_offline_free_page(struct page *page) int rc = -EBUSY;
if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) { - page_handle_poison(page); + page_handle_poison(page, false); rc = 0; }
diff --git a/mm/migrate.c b/mm/migrate.c index eb27e8e2bf213..f7721d0aece5f 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1217,16 +1217,11 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page, * we want to retry. */ if (rc == MIGRATEPAGE_SUCCESS) { - put_page(page); - if (reason == MR_MEMORY_FAILURE) { + if (reason != MR_MEMORY_FAILURE) /* - * Set PG_HWPoison on just freed page - * intentionally. Although it's rather weird, - * it's how HWPoison flag works at the moment. + * We release the page in page_handle_poison. */ - if (set_hwpoison_free_buddy_page(page)) - num_poisoned_pages_inc(); - } + put_page(page); } else { if (rc != -EAGAIN) { if (likely(!__PageMovable(page))) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 32aba0270d653..8ddd186f68c02 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1061,6 +1061,17 @@ static __always_inline bool free_pages_prepare(struct page *page,
trace_mm_page_free(page, order);
+ if (unlikely(PageHWPoison(page)) && !order) { + /* + * Do not let hwpoison pages hit pcplists/buddy + * Untie memcg state and reset page's owner + */ + if (memcg_kmem_enabled() && PageKmemcg(page)) + __memcg_kmem_uncharge(page, order); + reset_page_owner(page, order); + return false; + } + /* * Check tail pages before head page information is cleared to * avoid checking PageCompound for order-0 pages.
From: Oscar Salvador osalvador@suse.de
mainline inclusion from linux-v5.10-rc1 commit 6b9a217eda4a13cc72914fdf7433712122ff595b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
Merging soft_offline_huge_page and __soft_offline_page let us get rid of quite some duplicated code, and makes the code much easier to follow.
Now, __soft_offline_page will handle both normal and hugetlb pages.
Signed-off-by: Oscar Salvador osalvador@suse.de Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com
Conflicts: mm/page_alloc.c mm/memory-failure.c
Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memory-failure.c | 179 ++++++++++++++++++++------------------------ 1 file changed, 80 insertions(+), 99 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 15888de023fe7..6f16e24969843 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -68,13 +68,31 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
-static void page_handle_poison(struct page *page, bool release) +static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release) { + if (hugepage_or_freepage) { + /* + * Doing this check for free pages is also fine since dissolve_free_huge_page + * returns 0 for non-hugetlb pages as well. + */ + if (dissolve_free_huge_page(page) || !take_page_off_buddy(page)) + /* + * We could fail to take off the target page from buddy + * for example due to racy page allocaiton, but that's + * acceptable because soft-offlined page is not broken + * and if someone really want to use it, they should + * take it. + */ + return false; + } + SetPageHWPoison(page); if (release) put_page(page); page_ref_inc(page); num_poisoned_pages_inc(); + + return true; }
#if defined(CONFIG_HWPOISON_INJECT) || defined(CONFIG_HWPOISON_INJECT_MODULE) @@ -1721,63 +1739,49 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags) return ret; }
-static int soft_offline_huge_page(struct page *page, int flags) +static bool isolate_page(struct page *page, struct list_head *pagelist) { - int ret; - unsigned long pfn = page_to_pfn(page); - struct page *hpage = compound_head(page); - LIST_HEAD(pagelist); + bool isolated = false; + bool lru = PageLRU(page);
- /* - * This double-check of PageHWPoison is to avoid the race with - * memory_failure(). See also comment in __soft_offline_page(). - */ - lock_page(hpage); - if (PageHWPoison(hpage)) { - unlock_page(hpage); - put_page(hpage); - pr_info("soft offline: %#lx hugepage already poisoned\n", pfn); - return -EBUSY; + if (PageHuge(page)) { + isolated = isolate_huge_page(page, pagelist); + } else { + if (lru) + isolated = !isolate_lru_page(page); + else + isolated = !isolate_movable_page(page, ISOLATE_UNEVICTABLE); + if (isolated) + list_add(&page->lru, pagelist); } - unlock_page(hpage);
- ret = isolate_huge_page(hpage, &pagelist); + if (isolated && lru) + inc_node_page_state(page, NR_ISOLATED_ANON + + page_is_file_cache(page)); /* - * get_any_page() and isolate_huge_page() takes a refcount each, - * so need to drop one here. + * If we succeed to isolate the page, we grabbed another refcount on + * the page, so we can safely drop the one we got from get_any_pages(). + * If we failed to isolate the page, it means that we cannot go further + * and we will return an error, so drop the reference we got from + * get_any_pages() as well. */ - put_page(hpage); - if (!ret) { - pr_info("soft offline: %#lx hugepage failed to isolate\n", pfn); - return -EBUSY; - } - - ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, - MIGRATE_SYNC, MR_MEMORY_FAILURE); - if (ret) { - pr_info("soft offline: %#lx: hugepage migration failed %d, type %lx (%pGp)\n", - pfn, ret, page->flags, &page->flags); - if (!list_empty(&pagelist)) - putback_movable_pages(&pagelist); - if (ret > 0) - ret = -EBUSY; - } else { - /* - * We set PG_hwpoison only when we were able to take the page - * off the buddy. - */ - if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) - page_handle_poison(page, false); - else - ret = -EBUSY; - } - return ret; + put_page(page); + return isolated; }
-static int __soft_offline_page(struct page *page, int flags) +/* + * __soft_offline_page handles hugetlb-pages and non-hugetlb pages. + * If the page is a non-dirty unmapped page-cache page, it simply invalidates. + * If the page is mapped, it migrates the contents over. + */ +static int __soft_offline_page(struct page *page) { - int ret; + int ret = 0; unsigned long pfn = page_to_pfn(page); + struct page *hpage = compound_head(page); + char const *msg_page[] = {"page", "hugepage"}; + bool huge = PageHuge(page); + LIST_HEAD(pagelist);
/* * Check PageHWPoison again inside page lock because PageHWPoison @@ -1786,98 +1790,75 @@ static int __soft_offline_page(struct page *page, int flags) * so there's no race between soft_offline_page() and memory_failure(). */ lock_page(page); - wait_on_page_writeback(page); + if (!PageHuge(page)) + wait_on_page_writeback(page); if (PageHWPoison(page)) { unlock_page(page); put_page(page); pr_info("soft offline: %#lx page already poisoned\n", pfn); return -EBUSY; } - /* - * Try to invalidate first. This should work for - * non dirty unmapped page cache pages. - */ - ret = invalidate_inode_page(page); + + if (!PageHuge(page)) + /* + * Try to invalidate first. This should work for + * non dirty unmapped page cache pages. + */ + ret = invalidate_inode_page(page); unlock_page(page); + /* * RED-PEN would be better to keep it isolated here, but we * would need to fix isolation locking first. */ - if (ret == 1) { + if (ret) { pr_info("soft_offline: %#lx: invalidated\n", pfn); - page_handle_poison(page, true); + page_handle_poison(page, false, true); return 0; }
- /* - * Simple invalidation didn't work. - * Try to migrate to a new page instead. migrate.c - * handles a large number of cases for us. - */ - if (PageLRU(page)) - ret = isolate_lru_page(page); - else - ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE); - /* - * Drop page reference which is came from get_any_page() - * successful isolate_lru_page() already took another one. - */ - put_page(page); - if (!ret) { - LIST_HEAD(pagelist); - /* - * After isolated lru page, the PageLRU will be cleared, - * so use !__PageMovable instead for LRU page's mapping - * cannot have PAGE_MAPPING_MOVABLE. - */ - if (!__PageMovable(page)) - inc_node_page_state(page, NR_ISOLATED_ANON + - page_is_file_cache(page)); - list_add(&page->lru, &pagelist); + if (isolate_page(hpage, &pagelist)) { ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, MIGRATE_SYNC, MR_MEMORY_FAILURE); if (!ret) { - page_handle_poison(page, true); + bool release = !huge; + + if (!page_handle_poison(page, huge, release)) + ret = -EBUSY; } else { if (!list_empty(&pagelist)) putback_movable_pages(&pagelist);
- pr_info("soft offline: %#lx: migration failed %d, type %lx (%pGp)\n", - pfn, ret, page->flags, &page->flags); + pr_info("soft offline: %#lx: %s migration failed %d, type %lx (%pGp)\n", + pfn, msg_page[huge], ret, page->flags, &page->flags); if (ret > 0) ret = -EBUSY; } } else { - pr_info("soft offline: %#lx: isolation failed, page count %d, type %lx (%pGp)\n", - pfn, page_count(page), page->flags, &page->flags); + pr_info("soft offline: %#lx: %s isolation failed: %d, page count %d, type %lx (%pGp)\n", + pfn, msg_page[huge], ret, page_count(page), page->flags, &page->flags); + ret = -EBUSY; } return ret; }
-static int soft_offline_in_use_page(struct page *page, int flags) +static int soft_offline_in_use_page(struct page *page) { - int ret; struct page *hpage = compound_head(page);
if (!PageHuge(page) && PageTransHuge(hpage)) if (try_to_split_thp_page(page, "soft offline") < 0) return -EBUSY;
- if (PageHuge(page)) - ret = soft_offline_huge_page(page, flags); - else - ret = __soft_offline_page(page, flags); - return ret; + return __soft_offline_page(page); }
static int soft_offline_free_page(struct page *page) { - int rc = -EBUSY; + int rc = 0;
- if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) { - page_handle_poison(page, false); - rc = 0; - } + if (!page_handle_poison(page, true, false)) + rc = -EBUSY;
return rc; } @@ -1929,7 +1910,7 @@ int soft_offline_page(struct page *page, int flags) put_online_mems();
if (ret > 0) - ret = soft_offline_in_use_page(page, flags); + ret = soft_offline_in_use_page(page); else if (ret == 0) ret = soft_offline_free_page(page);
From: Oscar Salvador osalvador@suse.de
mainline inclusion from linux-v5.10-rc1 commit 5a2ffca3c23333a41cf8604f62994cfd28e4267b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
Currently, there is an inconsistency when calling soft-offline from different paths on a page that is already poisoned.
1) madvise:
madvise_inject_error skips any poisoned page and continues the loop. If that was the only page to madvise, it returns 0.
2) /sys/devices/system/memory/:
When calling soft_offline_page_store()->soft_offline_page(), we return -EBUSY in case the page is already poisoned. This is inconsistent with a) the above example and b) memory_failure, where we return 0 if the page was poisoned.
Fix this by dropping the PageHWPoison() check in madvise_inject_error, and let soft_offline_page return 0 if it finds the page already poisoned.
Please, note that this represents a user-api change, since now the return error when calling soft_offline_page_store()->soft_offline_page() will be different.
Signed-off-by: Oscar Salvador osalvador@suse.de Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/madvise.c | 5 ----- mm/memory-failure.c | 4 ++-- 2 files changed, 2 insertions(+), 7 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 6022627646fa2..f0d3d0aaa1167 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -662,11 +662,6 @@ static int madvise_inject_error(int behavior, */ order = compound_order(compound_head(page));
- if (PageHWPoison(page)) { - put_page(page); - continue; - } - if (behavior == MADV_SOFT_OFFLINE) { pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n", pfn, start); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 6f16e24969843..385d2d570435c 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1796,7 +1796,7 @@ static int __soft_offline_page(struct page *page) unlock_page(page); put_page(page); pr_info("soft offline: %#lx page already poisoned\n", pfn); - return -EBUSY; + return 0; }
if (!PageHuge(page)) @@ -1902,7 +1902,7 @@ int soft_offline_page(struct page *page, int flags) pr_info("soft offline: %#lx page already poisoned\n", pfn); if (flags & MF_COUNT_INCREASED) put_page(page); - return -EBUSY; + return 0; }
get_online_mems();
From: Naoya Horiguchi naoya.horiguchi@nec.com
mainline inclusion from linux-v5.10-rc1 commit 5d1fd5dc877bc1c670e7b1c174aa659b76c07de1 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
memory_failure() is supposed to call action_result() when it handles a memory error event, but there's one missing case. So let's add it.
I find that include/ras/ras_event.h has some other MF_MSG_* undefined, so this patch also adds them.
Signed-off-by: Naoya Horiguchi naoya.horiguchi@nec.com Signed-off-by: Oscar Salvador osalvador@suse.de Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/mm.h | 1 + include/ras/ras_event.h | 3 +++ mm/memory-failure.c | 5 ++++- 3 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 630b103065f4c..b318e9c6cc43d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2907,6 +2907,7 @@ enum mf_action_page_type { MF_MSG_BUDDY, MF_MSG_BUDDY_2ND, MF_MSG_DAX, + MF_MSG_UNSPLIT_THP, MF_MSG_UNKNOWN, };
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h index 2d6a662886e6d..1a9c67d6e7b8e 100644 --- a/include/ras/ras_event.h +++ b/include/ras/ras_event.h @@ -399,6 +399,7 @@ TRACE_EVENT(aer_event, EM ( MF_MSG_POISONED_HUGE, "huge page already hardware poisoned" ) \ EM ( MF_MSG_HUGE, "huge page" ) \ EM ( MF_MSG_FREE_HUGE, "free huge page" ) \ + EM ( MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page" ) \ EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" ) \ EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" ) \ EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" ) \ @@ -411,6 +412,8 @@ TRACE_EVENT(aer_event, EM ( MF_MSG_TRUNCATED_LRU, "already truncated LRU page" ) \ EM ( MF_MSG_BUDDY, "free buddy page" ) \ EM ( MF_MSG_BUDDY_2ND, "free buddy page (2nd try)" ) \ + EM ( MF_MSG_DAX, "dax page" ) \ + EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" ) \ EMe ( MF_MSG_UNKNOWN, "unknown page" )
/* diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 385d2d570435c..a04980a8980ef 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -587,6 +587,7 @@ static const char * const action_page_types[] = { [MF_MSG_BUDDY] = "free buddy page", [MF_MSG_BUDDY_2ND] = "free buddy page (2nd try)", [MF_MSG_DAX] = "dax page", + [MF_MSG_UNSPLIT_THP] = "unsplit thp", [MF_MSG_UNKNOWN] = "unknown page", };
@@ -1362,8 +1363,10 @@ int memory_failure(unsigned long pfn, int flags) }
if (PageTransHuge(hpage)) { - if (try_to_split_thp_page(p, "Memory Failure") < 0) + if (try_to_split_thp_page(p, "Memory Failure") < 0) { + action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED); return -EBUSY; + } VM_BUG_ON_PAGE(!page_count(p), p); }
From: Oscar Salvador osalvador@suse.de
mainline inclusion from linux-v5.10-rc1 commit b94e02822debdf0cc473556aad7dcc859f216653 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
Aristeu Rozanski reported that a customer test case started to report -EBUSY after the hwpoison rework patchset.
There is a race window between spotting a free page and taking it off its buddy freelist, so it might be that by the time we try to take it off, the page has been already allocated.
This patch tries to handle such race window by trying to handle the new type of page again if the page was allocated under us.
Signed-off-by: Oscar Salvador osalvador@suse.de Reported-by: Aristeu Rozanski aris@ruivo.org Tested-by: Aristeu Rozanski aris@ruivo.org Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memory-failure.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index a04980a8980ef..63b26a9cca335 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1892,6 +1892,7 @@ int soft_offline_page(struct page *page, int flags) { int ret; unsigned long pfn = page_to_pfn(page); + bool try_again = true;
if (is_zone_device_page(page)) { pr_debug_ratelimited("soft_offline: %#lx page is device page\n", @@ -1908,6 +1909,7 @@ int soft_offline_page(struct page *page, int flags) return 0; }
+retry: get_online_mems(); ret = get_any_page(page, pfn, flags); put_online_mems(); @@ -1915,7 +1917,10 @@ int soft_offline_page(struct page *page, int flags) if (ret > 0) ret = soft_offline_in_use_page(page); else if (ret == 0) - ret = soft_offline_free_page(page); + if (soft_offline_free_page(page) && try_again) { + try_again = false; + goto retry; + }
return ret; }
From: Oscar Salvador osalvador@suse.de
mainline inclusion from linux-v5.11-rc1 commit 17e395b60f5b3dea204fcae60c7b38e84a00d87a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
A page with 0-refcount and !PageBuddy could perfectly be a pcppage. Currently, we bail out with an error if we encounter such a page, meaning that we do not handle pcppages neither from hard-offline nor from soft-offline path.
Fix this by draining pcplists whenever we find this kind of page and retry the check again. It might be that pcplists have been spilled into the buddy allocator and so we can handle it.
Signed-off-by: Oscar Salvador osalvador@suse.de Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Acked-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memory-failure.c | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 63b26a9cca335..7a69a295fc3ef 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -953,13 +953,13 @@ static int page_action(struct page_state *ps, struct page *p, }
/** - * get_hwpoison_page() - Get refcount for memory error handling: + * __get_hwpoison_page() - Get refcount for memory error handling: * @page: raw error page (hit by memory error) * * Return: return 0 if failed to grab the refcount, otherwise true (some * non-zero value.) */ -int get_hwpoison_page(struct page *page) +static int __get_hwpoison_page(struct page *page) { struct page *head = compound_head(page);
@@ -988,6 +988,26 @@ int get_hwpoison_page(struct page *page)
return 0; } + +int get_hwpoison_page(struct page *p) +{ + int ret; + bool drained = false; + +retry: + ret = __get_hwpoison_page(p); + if (!ret && !is_free_buddy_page(p) && !page_count(p) && !drained) { + /* + * The page might be in a pcplist, so try to drain those + * and see if we are lucky. + */ + drain_all_pages(page_zone(p)); + drained = true; + goto retry; + } + + return ret; +} EXPORT_SYMBOL_GPL(get_hwpoison_page);
/*
From: Oscar Salvador osalvador@suse.de
mainline inclusion from linux-v5.11-rc1 commit a8b2c2ce89d4e01062de69b89cafad97cd0fc01b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
The crux of the matter is that historically we left poisoned pages in the buddy system because we have some checks in place when allocating a page that are gatekeeper for poisoned pages. Unfortunately, we do have other users (e.g: compaction [1]) that scan buddy freelists and try to get a page from there without checking whether the page is HWPoison.
As I stated already, I think it is fundamentally wrong to keep HWPoison pages within the buddy systems, checks in place or not.
Let us fix this the same way we did for soft_offline [2], taking the page off the buddy freelist so it is completely unreachable.
Note that this is fairly simple to trigger, as we only need to poison free buddy pages (madvise MADV_HWPOISON) and then run some sort of memory stress system.
Just for a matter of reference, I put a dump_page() in compaction_alloc() to trigger for HWPoison patches:
page:0000000012b2982b refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1d5db flags: 0xfffffc0800000(hwpoison) raw: 000fffffc0800000 ffffea00007573c8 ffffc90000857de0 0000000000000000 raw: 0000000000000001 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: compaction_alloc
CPU: 4 PID: 123 Comm: kcompactd0 Tainted: G E 5.9.0-rc2-mm1-1-default+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 Call Trace: dump_stack+0x6d/0x8b compaction_alloc+0xb2/0xc0 migrate_pages+0x2a6/0x12a0 compact_zone+0x5eb/0x11c0 proactive_compact_node+0x89/0xf0 kcompactd+0x2d0/0x3a0 kthread+0x118/0x130 ret_from_fork+0x22/0x30
After that, if e.g: a process faults in the page, it will get killed unexpectedly. Fix it by containing the page immediatelly.
Besides that, two more changes can be noticed:
* MF_DELAYED no longer suits as we are fixing the issue by containing the page immediately, so it does no longer rely on the allocation-time checks to stop HWPoison to be handed over. gain unless it is unpoisoned, so we fixed the situation. Because of that, let us use MF_RECOVERED from now on.
* The second block that handles PageBuddy pages is no longer needed: We call shake_page and then check whether the page is Buddy because shake_page calls drain_all_pages, which sends pcp-pages back to the buddy freelists, so we could have a chance to handle free pages. Currently, get_hwpoison_page already calls drain_all_pages, and we call get_hwpoison_page right before coming here, so we should be on the safe side.
[1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u [2] https://patchwork.kernel.org/cover/11792607/
[osalvador@suse.de: take the poisoned subpage off the buddy frelists] Link: https://lkml.kernel.org/r/20201013144447.6706-4-osalvador@suse.de
Link: https://lkml.kernel.org/r/20201013144447.6706-3-osalvador@suse.de Signed-off-by: Oscar Salvador osalvador@suse.de Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: mm/memory-failure.c
Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memory-failure.c | 45 ++++++++++++++++++++++++++++++--------------- 1 file changed, 30 insertions(+), 15 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 7a69a295fc3ef..cd3394dd70e16 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -814,7 +814,7 @@ static int me_swapcache_clean(struct page *p, unsigned long pfn) */ static int me_huge_page(struct page *p, unsigned long pfn) { - int res = 0; + int res; struct page *hpage = compound_head(p); struct address_space *mapping;
@@ -825,6 +825,7 @@ static int me_huge_page(struct page *p, unsigned long pfn) if (mapping) { res = truncate_error_page(hpage, pfn, mapping); } else { + res = MF_FAILED; unlock_page(hpage); /* * migration entry prevents later access on error anonymous @@ -833,8 +834,10 @@ static int me_huge_page(struct page *p, unsigned long pfn) */ if (PageAnon(hpage)) put_page(hpage); - dissolve_free_huge_page(p); - res = MF_RECOVERED; + if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) { + page_ref_inc(p); + res = MF_RECOVERED; + } lock_page(hpage); }
@@ -1181,9 +1184,13 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags) } } unlock_page(head); - dissolve_free_huge_page(p); - action_result(pfn, MF_MSG_FREE_HUGE, MF_DELAYED); - return 0; + res = MF_FAILED; + if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) { + page_ref_inc(p); + res = MF_RECOVERED; + } + action_result(pfn, MF_MSG_FREE_HUGE, res); + return res == MF_RECOVERED ? 0 : -EBUSY; }
lock_page(head); @@ -1326,6 +1333,7 @@ int memory_failure(unsigned long pfn, int flags) struct dev_pagemap *pgmap; int res = 0; unsigned long page_flags; + bool retry = true; static DEFINE_MUTEX(mf_mutex);
if (!sysctl_memory_failure_recovery) @@ -1346,6 +1354,7 @@ int memory_failure(unsigned long pfn, int flags)
mutex_lock(&mf_mutex);
+try_again: if (PageHuge(p)) { res = memory_failure_hugetlb(pfn, flags); goto unlock_mutex; @@ -1374,7 +1383,21 @@ int memory_failure(unsigned long pfn, int flags) */ if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) { if (is_free_buddy_page(p)) { - action_result(pfn, MF_MSG_BUDDY, MF_DELAYED); + if (take_page_off_buddy(p)) { + page_ref_inc(p); + res = MF_RECOVERED; + } else { + /* We lost the race, try again */ + if (retry) { + ClearPageHWPoison(p); + num_poisoned_pages_dec(); + retry = false; + goto try_again; + } + res = MF_FAILED; + } + action_result(pfn, MF_MSG_BUDDY, res); + res = res == MF_RECOVERED ? 0 : -EBUSY; } else { action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED); res = -EBUSY; @@ -1399,14 +1422,6 @@ int memory_failure(unsigned long pfn, int flags) * walked by the page reclaim code, however that's not a big loss. */ shake_page(p, 0); - /* shake_page could have turned it free. */ - if (!PageLRU(p) && is_free_buddy_page(p)) { - if (flags & MF_COUNT_INCREASED) - action_result(pfn, MF_MSG_BUDDY, MF_DELAYED); - else - action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED); - goto unlock_mutex; - }
lock_page(p);
From: Oscar Salvador osalvador@suse.de
mainline inclusion from linux-v5.11-rc1 commit 32409cba3f66810626c1c15b728c31968d6bfa92 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
memory_failure and soft_offline_path paths now drain pcplists by calling get_hwpoison_page.
memory_failure flags the page as HWPoison before, so that page cannot longer go into a pcplist, and soft_offline_page only flags a page as HWPoison if 1) we took the page off a buddy freelist 2) the page was in-use and we migrated it 3) was a clean pagecache.
Because of that, a page cannot longer be poisoned and be in a pcplist.
Signed-off-by: Oscar Salvador osalvador@suse.de Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Acked-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/madvise.c | 5 ----- 1 file changed, 5 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index f0d3d0aaa1167..e187cd74e9254 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -638,7 +638,6 @@ static long madvise_remove(struct vm_area_struct *vma, static int madvise_inject_error(int behavior, unsigned long start, unsigned long end) { - struct zone *zone; unsigned int order;
if (!capable(CAP_SYS_ADMIN)) @@ -685,10 +684,6 @@ static int madvise_inject_error(int behavior, return ret; }
- /* Ensure that all poisoned pages are removed from per-cpu lists */ - for_each_populated_zone(zone) - drain_all_pages(zone); - return 0; } #endif
From: Ding Hui dinghui@sangfor.com.cn
mainline inclusion from linux-v5.13-rc5 commit bac9c6fa1f929213bbd0ac9cdf21e8e2f0916828 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA
--------------------------------
Recently we found that there is a lot MemFree left in /proc/meminfo after do a lot of pages soft offline, it's not quite correct.
Before Oscar's rework of soft offline for free pages [1], if we soft offline free pages, these pages are left in buddy with HWPoison flag, and NR_FREE_PAGES is not updated immediately. So the difference between NR_FREE_PAGES and real number of available free pages is also even big at the beginning.
However, with the workload running, when we catch HWPoison page in any alloc functions subsequently, we will remove it from buddy, meanwhile update the NR_FREE_PAGES and try again, so the NR_FREE_PAGES will get more and more closer to the real number of available free pages. (regardless of unpoison_memory())
Now, for offline free pages, after a successful call take_page_off_buddy(), the page is no longer belong to buddy allocator, and will not be used any more, but we missed accounting NR_FREE_PAGES in this situation, and there is no chance to be updated later.
Do update in take_page_off_buddy() like rmqueue() does, but avoid double counting if some one already set_migratetype_isolate() on the page.
[1]: commit 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages")
Link: https://lkml.kernel.org/r/20210526075247.11130-1-dinghui@sangfor.com.cn Fixes: 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages") Signed-off-by: Ding Hui dinghui@sangfor.com.cn Suggested-by: Naoya Horiguchi naoya.horiguchi@nec.com Reviewed-by: Oscar Salvador osalvador@suse.de Acked-by: David Hildenbrand david@redhat.com Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_alloc.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8ddd186f68c02..200e19fe216ae 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8611,6 +8611,8 @@ bool take_page_off_buddy(struct page *page) zone->free_area[buddy_order].nr_free--; break_down_buddy_pages(zone, page_head, page, 0, buddy_order, migratetype); + if (!is_migrate_isolate(migratetype)) + __mod_zone_freepage_state(zone, -1, migratetype); ret = true; break; }