From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
mainline inclusion from mainline-v5.8-rc1 commit ffe945e633b527d5a4577b42cbadec3c7cbcf096 category: bugfix bugzilla: 36230 CVE: NA
-------------------------------------------------
__collapse_huge_page_swapin() checks the number of referenced PTE to decide if the memory range is hot enough to justify swapin.
We have few problems with the approach:
- It is way too late: we can do the check much earlier and safe time. khugepaged_scan_pmd() already knows if we have any pages to swap in and number of referenced page.
- It stops collapse altogether if there's not enough referenced pages, not only swappingin.
Fix it by making the right check early. We also can avoid additional page table scanning if khugepaged_scan_pmd() haven't found any swap entries.
Fixes: 0db501f7a34c ("mm, thp: convert from optimistic swapin collapsing to conservative") Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Tested-by: Zi Yan ziy@nvidia.com Reviewed-by: William Kucharski william.kucharski@oracle.com Reviewed-by: Zi Yan ziy@nvidia.com Acked-by: Yang Shi yang.shi@linux.alibaba.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: John Hubbard jhubbard@nvidia.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Ralph Campbell rcampbell@nvidia.com Link: http://lkml.kernel.org/r/20200416160026.16538-3-kirill.shutemov@linux.intel.... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/khugepaged.c | 27 +++++++++++---------------- 1 file changed, 11 insertions(+), 16 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 5883fd75d6fc..0ad9f2b2b33e 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -905,11 +905,6 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm, .pgoff = linear_page_index(vma, address), };
- /* we only decide to swapin, if there is enough young ptes */ - if (referenced < HPAGE_PMD_NR/2) { - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0); - return false; - } vmf.pte = pte_offset_map(pmd, address); for (; vmf.address < address + HPAGE_PMD_NR*PAGE_SIZE; vmf.pte++, vmf.address += PAGE_SIZE) { @@ -949,7 +944,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm, static void collapse_huge_page(struct mm_struct *mm, unsigned long address, struct page **hpage, - int node, int referenced) + int node, int referenced, int unmapped) { pmd_t *pmd, _pmd; pte_t *pte; @@ -1007,7 +1002,8 @@ static void collapse_huge_page(struct mm_struct *mm, * If it fails, we release mmap_sem and jump out_nolock. * Continuing to collapse causes inconsistency. */ - if (!__collapse_huge_page_swapin(mm, vma, address, pmd, referenced)) { + if (unmapped && !__collapse_huge_page_swapin(mm, vma, address, + pmd, referenced)) { mem_cgroup_cancel_charge(new_page, memcg, true); up_read(&mm->mmap_sem); goto out_nolock; @@ -1214,22 +1210,21 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, mmu_notifier_test_young(vma->vm_mm, address)) referenced++; } - if (writable) { - if (referenced) { - result = SCAN_SUCCEED; - ret = 1; - } else { - result = SCAN_LACK_REFERENCED_PAGE; - } - } else { + if (!writable) { result = SCAN_PAGE_RO; + } else if (!referenced || (unmapped && referenced < HPAGE_PMD_NR/2)) { + result = SCAN_LACK_REFERENCED_PAGE; + } else { + result = SCAN_SUCCEED; + ret = 1; } out_unmap: pte_unmap_unlock(pte, ptl); if (ret) { node = khugepaged_find_target_node(); /* collapse_huge_page will return with the mmap_sem released */ - collapse_huge_page(mm, address, hpage, node, referenced); + collapse_huge_page(mm, address, hpage, node, + referenced, unmapped); } out: trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
mainline inclusion from mainline-v5.8-rc1 commit a980df33e9351e5474c06ec0fd96b2f409e2ff22 category: bugfix bugzilla: 36242 CVE: NA
-------------------------------------------------
Having a page in LRU add cache offsets page refcount and gives false-negative on PageLRU(). It reduces collapse success rate.
Drain all LRU add caches before scanning. It happens relatively rare and should not disturb the system too much.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Tested-by: Zi Yan ziy@nvidia.com Reviewed-by: William Kucharski william.kucharski@oracle.com Reviewed-by: Zi Yan ziy@nvidia.com Acked-by: Yang Shi yang.shi@linux.alibaba.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: John Hubbard jhubbard@nvidia.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Ralph Campbell rcampbell@nvidia.com Link: http://lkml.kernel.org/r/20200416160026.16538-4-kirill.shutemov@linux.intel.... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/khugepaged.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 0ad9f2b2b33e..a0eae9df34bd 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1834,6 +1834,8 @@ static void khugepaged_do_scan(void)
barrier(); /* write khugepaged_pages_to_scan to local stack */
+ lru_add_drain_all(); + while (progress < pages) { if (!khugepaged_prealloc_page(&hpage, &wait)) break;
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
mainline inclusion from mainline-v5.8-rc1 commit ae2c5d8042426b69c5f4a74296d1a20bb769a8ad category: bugfix bugzilla: 36222 CVE: NA
-------------------------------------------------
collapse_huge_page() tries to swap in pages that are part of the PMD range. Just swapped in page goes though LRU add cache. The cache gets extra reference on the page.
The extra reference can lead to the collapse fail: the following __collapse_huge_page_isolate() would check refcount and abort collapse seeing unexpected refcount.
The fix is to drain local LRU add cache in __collapse_huge_page_swapin() if we successfully swapped in any pages.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Tested-by: Zi Yan ziy@nvidia.com Reviewed-by: William Kucharski william.kucharski@oracle.com Reviewed-by: Zi Yan ziy@nvidia.com Acked-by: Yang Shi yang.shi@linux.alibaba.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: John Hubbard jhubbard@nvidia.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Ralph Campbell rcampbell@nvidia.com Link: http://lkml.kernel.org/r/20200416160026.16538-5-kirill.shutemov@linux.intel.... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/khugepaged.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index a0eae9df34bd..ad386978d7e0 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -937,6 +937,11 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm, } vmf.pte--; pte_unmap(vmf.pte); + + /* Drain LRU add pagevec to remove extra pin on the swapped in pages */ + if (swapped_in) + lru_add_drain(); + trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 1); return true; }
From: Vishal Verma vishal.l.verma@intel.com
mainline inclusion from mainline-v5.8-rc1 commit fa6d9ec790550b758215b6c6fa9f940878c3e2a2 category: bugfix bugzilla: 36877 CVE: NA
-------------------------------------------------
A misbehaving qemu created a situation where the ACPI SRAT table advertised one fewer proximity domains than intended. The NFIT table did describe all the expected proximity domains. This caused the device dax driver to assign an impossible target_node to the device, and when hotplugged as system memory, this would fail with the following signature:
BUG: kernel NULL pointer dereference, address: 0000000000000088 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G O 5.6.0-rc5 #9 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0 Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44 00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b 84 d0 20 0f 00 00 <48> 3b 98 88 00 00 00 75 28 f0 80 a0 80 00 00 00 fe f0 80 a3 38 20 RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000 RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000 R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003 R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8 FS: 0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0 Call Trace: kswapd+0x103/0x520 kthread+0x120/0x140 ret_from_fork+0x3a/0x50
Add a check in the add_memory path to fail if the node to which we are adding memory is in the node_possible_map
Signed-off-by: Vishal Verma vishal.l.verma@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: David Hildenbrand david@redhat.com Acked-by: Michal Hocko mhocko@suse.com Cc: David Hildenbrand david@redhat.com Cc: Dan Williams dan.j.williams@intel.com Cc: Dave Hansen dave.hansen@linux.intel.com Link: http://lkml.kernel.org/r/20200416225438.15208-1-vishal.l.verma@intel.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memory_hotplug.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 0031c5063aa1..bdafc63ab789 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1058,6 +1058,11 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online) if (ret) return ret;
+ if (!node_possible(nid)) { + WARN(1, "node %d was absent from the node_possible_map\n", nid); + return -EINVAL; + } + mem_hotplug_begin();
/*
From: SeongJae Park sjpark@amazon.de
mainline inclusion from mainline-v5.8-rc1 commit 92fb1db26eefc11554820f11ce8e92007da2fbf4 category: bugfix bugzilla: 37584 CVE: NA
-------------------------------------------------
'Idle page tracking' users can pass random pfn that might be mapped to an offline page. To avoid accessing such pages, this commit modifies the 'page_idle_get_page()' to use 'pfn_to_online_page()' instead of 'pfn_valid()' and 'pfn_to_page()' combination, so that the pfn mapped to an offline page can be skipped.
Reported-by: David Hildenbrand david@redhat.com Signed-off-by: SeongJae Park sjpark@amazon.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: David Hildenbrand david@redhat.com Reviewed-by: Pankaj Gupta pankaj.gupta.linux@gmail.com Link: http://lkml.kernel.org/r/20200605092502.18018-2-sjpark@amazon.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_idle.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/mm/page_idle.c b/mm/page_idle.c index 52ed59bbc275..7881bc643bbb 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -4,6 +4,7 @@ #include <linux/fs.h> #include <linux/sysfs.h> #include <linux/kobject.h> +#include <linux/memory_hotplug.h> #include <linux/mm.h> #include <linux/mmzone.h> #include <linux/pagemap.h> @@ -30,13 +31,9 @@ */ static struct page *page_idle_get_page(unsigned long pfn) { - struct page *page; + struct page *page = pfn_to_online_page(pfn); struct zone *zone;
- if (!pfn_valid(pfn)) - return NULL; - - page = pfn_to_page(pfn); if (!page || !PageLRU(page) || !get_page_unless_zero(page)) return NULL;