1. Backport the patch series: mm: thp: use generic THP migration for NUMA hinting fault 2. Support THP migration in numa-affinity 3. Some minor optimizations
change from v1: backport one more "bugfix" patch(the last one) to avoid CI failed.
Aneesh Kumar K.V (1): mm/migrate: fix NR_ISOLATED corruption on 64-bit
Huang Ying (1): mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code
Nanyong Sun (3): mm: fix KABI broken in struct vm_fault mm: numa-affinity: support THP migration mm: numa-affinity: delete the duplicate numa_migrate_prep
Yang Shi (7): mm: memory: add orig_pmd to struct vm_fault mm: memory: make numa_migrate_prep() non-static mm: thp: refactor NUMA fault handling mm: migrate: account THP NUMA migration counters correctly mm: migrate: don't split THP for misplaced NUMA page mm: migrate: check mapcount for THP instead of refcount mm: thp: skip make PMD PROT_NONE if THP migration is not supported
Ze Zuo (1): mm: numa-affinity: backport some migrate policy from AutoNuma
include/linux/huge_mm.h | 9 +- include/linux/migrate.h | 23 ----- include/linux/mm.h | 8 +- mm/huge_memory.c | 167 +++++++++----------------------- mm/internal.h | 21 +--- mm/mem_sampling.c | 89 ++++++++++++++--- mm/memory.c | 31 +++--- mm/migrate.c | 205 +++++++++------------------------------- 8 files changed, 194 insertions(+), 359 deletions(-)
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/10336 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/P...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/10336 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/P...
From: Yang Shi shy828301@gmail.com
mainline inclusion from mainline-v5.14-rc1 commit 5db4f15c4fd7ae74dd40c6f84bf56dfcf13d10cf category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAFONL CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3.
When the THP NUMA fault support was added THP migration was not supported yet. So the ad hoc THP migration was implemented in NUMA fault handling. Since v4.14 THP migration has been supported so it doesn't make too much sense to still keep another THP migration implementation rather than using the generic migration code. It is definitely a maintenance burden to keep two THP migration implementation for different code paths and it is more error prone. Using the generic THP migration implementation allows us remove the duplicate code and some hacks needed by the old ad hoc implementation.
A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP and NUMA balancing. The most of them support THP migration except for S390. Zi Yan tried to add THP migration support for S390 before but it was not accepted due to the design of S390 PMD. For the discussion, please see: https://lkml.org/lkml/2018/4/27/953.
Per the discussion with Gerald Schaefer in v1 it is acceptible to skip huge PMD for S390 for now.
I saw there were some hacks about gup from git history, but I didn't figure out if they have been removed or not since I just found FOLL_NUMA code in the current gup implementation and they seems useful.
Patch #1 ~ #2 are preparation patches. Patch #3 is the real meat. Patch #4 ~ #6 keep consistent counters and behaviors with before. Patch #7 skips change huge PMD to prot_none if thp migration is not supported.
Test ---- Did some tests to measure the latency of do_huge_pmd_numa_page. The test VM has 80 vcpus and 64G memory. The test would create 2 processes to consume 128G memory together which would incur memory pressure to cause THP splits. And it also creates 80 processes to hog cpu, and the memory consumer processes are bound to different nodes periodically in order to increase NUMA faults.
The below test script is used:
echo 3 > /proc/sys/vm/drop_caches
./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h & PID=$!
./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h &
sleep 5
PID_1=`pgrep -P $PID | awk 'NR == 1'` PID_2=`pgrep -P $PID | awk 'NR == 2'`
JOB1=`pgrep -P $PID_1` JOB2=`pgrep -P $PID_2`
while [ -d "/proc/$PID" ] do taskset -apc 8 $JOB1 taskset -apc 8 $JOB2 sleep 300 taskset -apc 58 $JOB1 taskset -apc 58 $JOB2 sleep 300 done
With the above test the histogram of latency of do_huge_pmd_numa_page is as shown below. Since the number of do_huge_pmd_numa_page varies drastically for each run (should be due to scheduler), so I converted the raw number to percentage.
patched base @us[stress-ng]: [0] 3.57% 0.16% [1] 55.68% 18.36% [2, 4) 10.46% 40.44% [4, 8) 7.26% 17.82% [8, 16) 21.12% 13.41% [16, 32) 1.06% 4.27% [32, 64) 0.56% 4.07% [64, 128) 0.16% 0.35% [128, 256) < 0.1% < 0.1% [256, 512) < 0.1% < 0.1% [512, 1K) < 0.1% < 0.1% [1K, 2K) < 0.1% < 0.1% [2K, 4K) < 0.1% < 0.1% [4K, 8K) < 0.1% < 0.1% [8K, 16K) < 0.1% < 0.1% [16K, 32K) < 0.1% < 0.1% [32K, 64K) < 0.1% < 0.1%
Per the result, patched kernel is even slightly better than the base kernel. I think this is because the lock contention against THP split is less than base kernel due to the refactor.
To exclude the affect from THP split, I also did test w/o memory pressure. No obvious regression is spotted. The below is the test result *w/o* memory pressure.
patched base @us[stress-ng]: [0] 7.97% 18.4% [1] 69.63% 58.24% [2, 4) 4.18% 2.63% [4, 8) 0.22% 0.17% [8, 16) 1.03% 0.92% [16, 32) 0.14% < 0.1% [32, 64) < 0.1% < 0.1% [64, 128) < 0.1% < 0.1% [128, 256) < 0.1% < 0.1% [256, 512) 0.45% 1.19% [512, 1K) 15.45% 17.27% [1K, 2K) < 0.1% < 0.1% [2K, 4K) < 0.1% < 0.1% [4K, 8K) < 0.1% < 0.1% [8K, 16K) 0.86% 0.88% [16K, 32K) < 0.1% 0.15% [32K, 64K) < 0.1% < 0.1% [64K, 128K) < 0.1% < 0.1% [128K, 256K) < 0.1% < 0.1%
The series also survived a series of tests that exercise NUMA balancing migrations by Mel.
This patch (of 7):
Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge page fault could be removed, just like its PTE counterpart does.
Link: https://lkml.kernel.org/r/20210518200801.7413-1-shy828301@gmail.com Link: https://lkml.kernel.org/r/20210518200801.7413-2-shy828301@gmail.com Signed-off-by: Yang Shi shy828301@gmail.com Acked-by: Mel Gorman mgorman@suse.de Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Zi Yan ziy@nvidia.com Cc: Huang Ying ying.huang@intel.com Cc: Michal Hocko mhocko@suse.com Cc: Hugh Dickins hughd@google.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Christian Borntraeger borntraeger@de.ibm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- include/linux/huge_mm.h | 9 ++++----- include/linux/mm.h | 7 ++++++- mm/huge_memory.c | 9 ++++++--- mm/memory.c | 26 +++++++++++++------------- 4 files changed, 29 insertions(+), 22 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index b993ae44111c..efb370e79ac3 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -11,7 +11,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); -void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd); +void huge_pmd_set_accessed(struct vm_fault *vmf); int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, pud_t *dst_pud, pud_t *src_pud, unsigned long addr, struct vm_area_struct *vma); @@ -24,7 +24,7 @@ static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud) } #endif
-vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd); +vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf); struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, unsigned int flags); @@ -291,7 +291,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap);
-vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd); +vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
extern struct page *huge_zero_page; extern unsigned long huge_zero_pfn; @@ -444,8 +444,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud, return NULL; }
-static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, - pmd_t orig_pmd) +static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) { return 0; } diff --git a/include/linux/mm.h b/include/linux/mm.h index e0d269b83c8f..de5dd997522b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -579,7 +579,12 @@ struct vm_fault { pud_t *pud; /* Pointer to pud entry matching * the 'address' */ - pte_t orig_pte; /* Value of PTE at the time of fault */ + union { + pte_t orig_pte; /* Value of PTE at the time of fault */ + pmd_t orig_pmd; /* Value of PMD at the time of fault, + * used by PMD fault only. + */ + };
struct page *cow_page; /* Page handler may use for COW fault */ struct page *page; /* ->fault handlers should return a diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 20d548da4660..7e02102b275e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1262,11 +1262,12 @@ void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud) } #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) +void huge_pmd_set_accessed(struct vm_fault *vmf) { pmd_t entry; unsigned long haddr; bool write = vmf->flags & FAULT_FLAG_WRITE; + pmd_t orig_pmd = vmf->orig_pmd;
vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) @@ -1283,11 +1284,12 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) spin_unlock(vmf->ptl); }
-vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) +vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct page *page; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; + pmd_t orig_pmd = vmf->orig_pmd;
vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd); VM_BUG_ON_VMA(!vma->anon_vma, vma); @@ -1423,9 +1425,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, }
/* NUMA hinting page fault entry point for trans huge pmds */ -vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) +vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; + pmd_t pmd = vmf->orig_pmd; struct anon_vma *anon_vma = NULL; struct page *page; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; diff --git a/mm/memory.c b/mm/memory.c index cd72c3d63e60..5a309417efb7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4490,12 +4490,12 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) }
/* `inline' is required to avoid gcc 4.1.2 build error */ -static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd) +static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) { if (vma_is_anonymous(vmf->vma)) { - if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd)) + if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) return handle_userfault(vmf, VM_UFFD_WP); - return do_huge_pmd_wp_page(vmf, orig_pmd); + return do_huge_pmd_wp_page(vmf); } if (vmf->vma->vm_ops->huge_fault) { vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); @@ -4716,26 +4716,26 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, if (!(ret & VM_FAULT_FALLBACK)) return ret; } else { - pmd_t orig_pmd = *vmf.pmd; + vmf.orig_pmd = *vmf.pmd;
barrier(); - if (unlikely(is_swap_pmd(orig_pmd))) { + if (unlikely(is_swap_pmd(vmf.orig_pmd))) { VM_BUG_ON(thp_migration_supported() && - !is_pmd_migration_entry(orig_pmd)); - if (is_pmd_migration_entry(orig_pmd)) + !is_pmd_migration_entry(vmf.orig_pmd)); + if (is_pmd_migration_entry(vmf.orig_pmd)) pmd_migration_entry_wait(mm, vmf.pmd); return 0; } - if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) { - if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) - return do_huge_pmd_numa_page(&vmf, orig_pmd); + if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) { + if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) + return do_huge_pmd_numa_page(&vmf);
- if (dirty && !pmd_write(orig_pmd)) { - ret = wp_huge_pmd(&vmf, orig_pmd); + if (dirty && !pmd_write(vmf.orig_pmd)) { + ret = wp_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; } else { - huge_pmd_set_accessed(&vmf, orig_pmd); + huge_pmd_set_accessed(&vmf); return 0; } }
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAFONL CVE: NA
--------------------------------
The commit 5daea808f6d4 ("mm: memory: add orig_pmd to struct vm_fault") replaced the pte_t orig_pte to a union in struct vm_fault, which not actually affect the KABI. Use KABI_REPLACE to fix it.
Fixes: 5daea808f6d4 ("mm: memory: add orig_pmd to struct vm_fault") Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- include/linux/mm.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index de5dd997522b..627f997bc547 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -579,12 +579,13 @@ struct vm_fault { pud_t *pud; /* Pointer to pud entry matching * the 'address' */ + KABI_REPLACE(pte_t orig_pte, union { pte_t orig_pte; /* Value of PTE at the time of fault */ pmd_t orig_pmd; /* Value of PMD at the time of fault, * used by PMD fault only. */ - }; + })
struct page *cow_page; /* Page handler may use for COW fault */ struct page *page; /* ->fault handlers should return a
From: Yang Shi shy828301@gmail.com
mainline inclusion from mainline-v5.14-rc1 commit f4c0d8367ea492cdfc7f6d14763c02f472731592 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAFONL CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------
The numa_migrate_prep() will be used by huge NUMA fault as well in the following patch, make it non-static.
Link: https://lkml.kernel.org/r/20210518200801.7413-3-shy828301@gmail.com Signed-off-by: Yang Shi shy828301@gmail.com Acked-by: Mel Gorman mgorman@suse.de Cc: Christian Borntraeger borntraeger@de.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Huang Ying ying.huang@intel.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Michal Hocko mhocko@suse.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- mm/internal.h | 3 +++ mm/memory.c | 5 ++--- 2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h index a9fc47f3677f..800f8f42fdb7 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -673,4 +673,7 @@ struct migration_target_control {
DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
+int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, + unsigned long addr, int page_nid, int *flags); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memory.c b/mm/memory.c index 5a309417efb7..0c4da925e8ad 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4375,9 +4375,8 @@ static vm_fault_t do_fault(struct vm_fault *vmf) return ret; }
-static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, - unsigned long addr, int page_nid, - int *flags) +int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, + unsigned long addr, int page_nid, int *flags) { get_page(page);
From: Yang Shi shy828301@gmail.com
mainline inclusion from mainline-v5.14-rc1 commit c5b5a3dd2c1fa61049b7789ce596faff4d659a61 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAFONL CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When the THP NUMA fault support was added THP migration was not supported yet. So the ad hoc THP migration was implemented in NUMA fault handling. Since v4.14 THP migration has been supported so it doesn't make too much sense to still keep another THP migration implementation rather than using the generic migration code.
This patch reworks the NUMA fault handling to use generic migration implementation to migrate misplaced page. There is no functional change.
After the refactor the flow of NUMA fault handling looks just like its PTE counterpart: Acquire ptl Prepare for migration (elevate page refcount) Release ptl Isolate page from lru and elevate page refcount Migrate the misplaced THP
If migration fails just restore the old normal PMD.
In the old code anon_vma lock was needed to serialize THP migration against THP split, but since then the THP code has been reworked a lot, it seems anon_vma lock is not required anymore to avoid the race.
The page refcount elevation when holding ptl should prevent from THP split.
Use migrate_misplaced_page() for both base page and THP NUMA hinting fault and remove all the dead and duplicate code.
[dan.carpenter@oracle.com: fix a double unlock bug] Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda
Link: https://lkml.kernel.org/r/20210518200801.7413-4-shy828301@gmail.com Signed-off-by: Yang Shi shy828301@gmail.com Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Acked-by: Mel Gorman mgorman@suse.de Cc: Christian Borntraeger borntraeger@de.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Huang Ying ying.huang@intel.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Michal Hocko mhocko@suse.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- include/linux/migrate.h | 23 ------ mm/huge_memory.c | 146 ++++++++++---------------------- mm/internal.h | 18 ---- mm/migrate.c | 178 ++++++++-------------------------------- 4 files changed, 77 insertions(+), 288 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h index ade4993f5fab..188f4b414bf8 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -105,14 +105,9 @@ static inline void __ClearPageMovable(struct page *page) #endif
#ifdef CONFIG_NUMA_BALANCING -extern bool pmd_trans_migrating(pmd_t pmd); extern int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node); #else -static inline bool pmd_trans_migrating(pmd_t pmd) -{ - return false; -} static inline int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node) { @@ -120,24 +115,6 @@ static inline int migrate_misplaced_page(struct page *page, } #endif /* CONFIG_NUMA_BALANCING */
-#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE) -extern int migrate_misplaced_transhuge_page(struct mm_struct *mm, - struct vm_area_struct *vma, - pmd_t *pmd, pmd_t entry, - unsigned long address, - struct page *page, int node); -#else -static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm, - struct vm_area_struct *vma, - pmd_t *pmd, pmd_t entry, - unsigned long address, - struct page *page, int node) -{ - return -EAGAIN; -} -#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/ - - #ifdef CONFIG_MIGRATION
/* diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7e02102b275e..d5cb55cbf578 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1428,95 +1428,22 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; - pmd_t pmd = vmf->orig_pmd; - struct anon_vma *anon_vma = NULL; + pmd_t oldpmd = vmf->orig_pmd; + pmd_t pmd; struct page *page; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; - int page_nid = NUMA_NO_NODE, this_nid = numa_node_id(); + int page_nid = NUMA_NO_NODE; int target_nid, last_cpupid = -1; - bool page_locked; bool migrated = false; - bool was_writable; + bool was_writable = pmd_savedwrite(oldpmd); int flags = 0;
vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_same(pmd, *vmf->pmd))) - goto out_unlock; - - /* - * If there are potential migrations, wait for completion and retry - * without disrupting NUMA hinting information. Do not relock and - * check_same as the page may no longer be mapped. - */ - if (unlikely(pmd_trans_migrating(*vmf->pmd))) { - page = pmd_page(*vmf->pmd); - if (!get_page_unless_zero(page)) - goto out_unlock; + if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) { spin_unlock(vmf->ptl); - put_and_wait_on_page_locked(page); goto out; }
- page = pmd_page(pmd); - BUG_ON(is_huge_zero_page(page)); - page_nid = page_to_nid(page); - last_cpupid = page_cpupid_last(page); - count_vm_numa_event(NUMA_HINT_FAULTS); - if (page_nid == this_nid) { - count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL); - flags |= TNF_FAULT_LOCAL; - } - - /* See similar comment in do_numa_page for explanation */ - if (!pmd_savedwrite(pmd)) - flags |= TNF_NO_GROUP; - - /* - * Acquire the page lock to serialise THP migrations but avoid dropping - * page_table_lock if at all possible - */ - page_locked = trylock_page(page); - target_nid = mpol_misplaced(page, vma, haddr); - if (target_nid == NUMA_NO_NODE) { - /* If the page was locked, there are no parallel migrations */ - if (page_locked) - goto clear_pmdnuma; - } - - /* Migration could have started since the pmd_trans_migrating check */ - if (!page_locked) { - page_nid = NUMA_NO_NODE; - if (!get_page_unless_zero(page)) - goto out_unlock; - spin_unlock(vmf->ptl); - put_and_wait_on_page_locked(page); - goto out; - } - - /* - * Page is misplaced. Page lock serialises migrations. Acquire anon_vma - * to serialises splits - */ - get_page(page); - spin_unlock(vmf->ptl); - anon_vma = page_lock_anon_vma_read(page); - - /* Confirm the PMD did not change while page_table_lock was released */ - spin_lock(vmf->ptl); - if (unlikely(!pmd_same(pmd, *vmf->pmd))) { - unlock_page(page); - put_page(page); - page_nid = NUMA_NO_NODE; - goto out_unlock; - } - - /* Bail if we fail to protect against THP splits for any reason */ - if (unlikely(!anon_vma)) { - put_page(page); - page_nid = NUMA_NO_NODE; - goto clear_pmdnuma; - } - /* * Since we took the NUMA fault, we must have observed the !accessible * bit. Make sure all other CPUs agree with that, to avoid them @@ -1543,43 +1470,58 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) haddr + HPAGE_PMD_SIZE); }
- /* - * Migrate the THP to the requested node, returns with page unlocked - * and access rights restored. - */ + pmd = pmd_modify(oldpmd, vma->vm_page_prot); + page = vm_normal_page_pmd(vma, haddr, pmd); + if (!page) + goto out_map; + + /* See similar comment in do_numa_page for explanation */ + if (!was_writable) + flags |= TNF_NO_GROUP; + + page_nid = page_to_nid(page); + last_cpupid = page_cpupid_last(page); + target_nid = numa_migrate_prep(page, vma, haddr, page_nid, + &flags); + + if (target_nid == NUMA_NO_NODE) { + put_page(page); + goto out_map; + } + spin_unlock(vmf->ptl);
- migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma, - vmf->pmd, pmd, vmf->address, page, target_nid); + migrated = migrate_misplaced_page(page, vma, target_nid); if (migrated) { flags |= TNF_MIGRATED; page_nid = target_nid; - } else + } else { flags |= TNF_MIGRATE_FAIL; - - goto out; -clear_pmdnuma: - BUG_ON(!PageLocked(page)); - was_writable = pmd_savedwrite(pmd); - pmd = pmd_modify(pmd, vma->vm_page_prot); - pmd = pmd_mkyoung(pmd); - if (was_writable) - pmd = pmd_mkwrite(pmd); - set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd); - update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); - unlock_page(page); -out_unlock: - spin_unlock(vmf->ptl); + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); + if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) { + spin_unlock(vmf->ptl); + goto out; + } + goto out_map; + }
out: - if (anon_vma) - page_unlock_anon_vma_read(anon_vma); - if (page_nid != NUMA_NO_NODE) task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);
return 0; + +out_map: + /* Restore the PMD */ + pmd = pmd_modify(oldpmd, vma->vm_page_prot); + pmd = pmd_mkyoung(pmd); + if (was_writable) + pmd = pmd_mkwrite(pmd); + set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd); + update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); + spin_unlock(vmf->ptl); + goto out; }
/* diff --git a/mm/internal.h b/mm/internal.h index 800f8f42fdb7..0c4e6959b83c 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -384,23 +384,6 @@ extern unsigned int munlock_vma_page(struct page *page); */ extern void clear_page_mlock(struct page *page);
-/* - * mlock_migrate_page - called only from migrate_misplaced_transhuge_page() - * (because that does not go through the full procedure of migration ptes): - * to migrate the Mlocked page flag; update statistics. - */ -static inline void mlock_migrate_page(struct page *newpage, struct page *page) -{ - if (TestClearPageMlocked(page)) { - int nr_pages = thp_nr_pages(page); - - /* Holding pmd lock, no change in irq context: __mod is safe */ - __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); - SetPageMlocked(newpage); - __mod_zone_page_state(page_zone(newpage), NR_MLOCK, nr_pages); - } -} - extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
/* @@ -476,7 +459,6 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf, #else /* !CONFIG_MMU */ static inline void clear_page_mlock(struct page *page) { } static inline void mlock_vma_page(struct page *page) { } -static inline void mlock_migrate_page(struct page *new, struct page *old) { }
#endif /* !CONFIG_MMU */
diff --git a/mm/migrate.c b/mm/migrate.c index c8491a744e8c..488e445b8ba4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2082,6 +2082,23 @@ static struct page *alloc_misplaced_dst_page(struct page *page, return newpage; }
+static struct page *alloc_misplaced_dst_page_thp(struct page *page, + unsigned long data) +{ + int nid = (int) data; + struct page *newpage; + + newpage = alloc_pages_node(nid, (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE), + HPAGE_PMD_ORDER); + if (!newpage) + goto out; + + prep_transhuge_page(newpage); + +out: + return newpage; +} + static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) { int page_lru; @@ -2123,12 +2140,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) return 1; }
-bool pmd_trans_migrating(pmd_t pmd) -{ - struct page *page = pmd_page(pmd); - return PageLocked(page); -} - /* * Attempt to migrate a misplaced page to the specified destination * node. Caller is expected to have an elevated reference count on @@ -2141,6 +2152,20 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int isolated; int nr_remaining; LIST_HEAD(migratepages); + new_page_t *new; + bool compound; + + /* + * PTE mapped THP or HugeTLB page can't reach here so the page could + * be either base page or THP. And it must be head page if it is + * THP. + */ + compound = PageTransHuge(page); + + if (compound) + new = alloc_misplaced_dst_page_thp; + else + new = alloc_misplaced_dst_page;
/* * Don't migrate file pages that are mapped in multiple processes @@ -2162,9 +2187,8 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, goto out;
list_add(&page->lru, &migratepages); - nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page, - NULL, node, MIGRATE_ASYNC, - MR_NUMA_MISPLACED); + nr_remaining = migrate_pages(&migratepages, *new, NULL, node, + MIGRATE_ASYNC, MR_NUMA_MISPLACED); if (nr_remaining) { if (!list_empty(&migratepages)) { list_del(&page->lru); @@ -2184,142 +2208,6 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, } #endif /* CONFIG_NUMA_BALANCING */
-#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE) -/* - * Migrates a THP to a given target node. page must be locked and is unlocked - * before returning. - */ -int migrate_misplaced_transhuge_page(struct mm_struct *mm, - struct vm_area_struct *vma, - pmd_t *pmd, pmd_t entry, - unsigned long address, - struct page *page, int node) -{ - spinlock_t *ptl; - pg_data_t *pgdat = NODE_DATA(node); - int isolated = 0; - struct page *new_page = NULL; - int page_lru = page_is_file_lru(page); - unsigned long start = address & HPAGE_PMD_MASK; - - new_page = alloc_pages_node(node, - (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE), - HPAGE_PMD_ORDER); - if (!new_page) - goto out_fail; - prep_transhuge_page(new_page); - - isolated = numamigrate_isolate_page(pgdat, page); - if (!isolated) { - put_page(new_page); - goto out_fail; - } - - /* Prepare a page as a migration target */ - __SetPageLocked(new_page); - if (PageSwapBacked(page)) - __SetPageSwapBacked(new_page); - - /* anon mapping, we can simply copy page->mapping to the new page: */ - new_page->mapping = page->mapping; - new_page->index = page->index; - /* flush the cache before copying using the kernel virtual address */ - flush_cache_range(vma, start, start + HPAGE_PMD_SIZE); - migrate_page_copy(new_page, page); - WARN_ON(PageLRU(new_page)); - - /* Recheck the target PMD */ - ptl = pmd_lock(mm, pmd); - if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) { - spin_unlock(ptl); - - /* Reverse changes made by migrate_page_copy() */ - if (TestClearPageActive(new_page)) - SetPageActive(page); - if (TestClearPageUnevictable(new_page)) - SetPageUnevictable(page); - - unlock_page(new_page); - put_page(new_page); /* Free it */ - - /* Retake the callers reference and putback on LRU */ - get_page(page); - putback_lru_page(page); - mod_node_page_state(page_pgdat(page), - NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR); - - goto out_unlock; - } - - entry = mk_huge_pmd(new_page, vma->vm_page_prot); - entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); - - /* - * Overwrite the old entry under pagetable lock and establish - * the new PTE. Any parallel GUP will either observe the old - * page blocking on the page lock, block on the page table - * lock or observe the new page. The SetPageUptodate on the - * new page and page_add_new_anon_rmap guarantee the copy is - * visible before the pagetable update. - */ - reliable_page_counter(new_page, vma->vm_mm, HPAGE_PMD_NR); - page_add_anon_rmap(new_page, vma, start, true); - /* - * At this point the pmd is numa/protnone (i.e. non present) and the TLB - * has already been flushed globally. So no TLB can be currently - * caching this non present pmd mapping. There's no need to clear the - * pmd before doing set_pmd_at(), nor to flush the TLB after - * set_pmd_at(). Clearing the pmd here would introduce a race - * condition against MADV_DONTNEED, because MADV_DONTNEED only holds the - * mmap_lock for reading. If the pmd is set to NULL at any given time, - * MADV_DONTNEED won't wait on the pmd lock and it'll skip clearing this - * pmd. - */ - set_pmd_at(mm, start, pmd, entry); - update_mmu_cache_pmd(vma, address, &entry); - - page_ref_unfreeze(page, 2); - mlock_migrate_page(new_page, page); - reliable_page_counter(page, vma->vm_mm, -HPAGE_PMD_NR); - page_remove_rmap(page, true); - set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED); - - spin_unlock(ptl); - - /* Take an "isolate" reference and put new page on the LRU. */ - get_page(new_page); - putback_lru_page(new_page); - - unlock_page(new_page); - unlock_page(page); - put_page(page); /* Drop the rmap reference */ - put_page(page); /* Drop the LRU isolation reference */ - - count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR); - count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR); - - mod_node_page_state(page_pgdat(page), - NR_ISOLATED_ANON + page_lru, - -HPAGE_PMD_NR); - return isolated; - -out_fail: - count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR); - ptl = pmd_lock(mm, pmd); - if (pmd_same(*pmd, entry)) { - entry = pmd_modify(entry, vma->vm_page_prot); - set_pmd_at(mm, start, pmd, entry); - update_mmu_cache_pmd(vma, address, &entry); - } - spin_unlock(ptl); - -out_unlock: - unlock_page(page); - put_page(page); - return 0; -} -#endif /* CONFIG_NUMA_BALANCING */ - #endif /* CONFIG_NUMA */
#ifdef CONFIG_DEVICE_PRIVATE
From: Yang Shi shy828301@gmail.com
mainline inclusion from mainline-v5.14-rc1 commit c5fc5c3ae0c849c713c4291addb5fce699ad0972 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAFONL CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now both base page and THP NUMA migration is done via migrate_misplaced_page(), keep the counters correctly for THP.
Link: https://lkml.kernel.org/r/20210518200801.7413-5-shy828301@gmail.com Signed-off-by: Yang Shi shy828301@gmail.com Acked-by: Mel Gorman mgorman@suse.de Cc: Christian Borntraeger borntraeger@de.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Huang Ying ying.huang@intel.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Michal Hocko mhocko@suse.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- mm/migrate.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index 488e445b8ba4..99a04ac960b3 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2154,6 +2154,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, LIST_HEAD(migratepages); new_page_t *new; bool compound; + unsigned int nr_pages = thp_nr_pages(page);
/* * PTE mapped THP or HugeTLB page can't reach here so the page could @@ -2192,13 +2193,13 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, if (nr_remaining) { if (!list_empty(&migratepages)) { list_del(&page->lru); - dec_node_page_state(page, NR_ISOLATED_ANON + - page_is_file_lru(page)); + mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + + page_is_file_lru(page), -nr_pages); putback_lru_page(page); } isolated = 0; } else - count_vm_numa_event(NUMA_PAGE_MIGRATE); + count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_pages); BUG_ON(!list_empty(&migratepages)); return isolated;
From: "Aneesh Kumar K.V" aneesh.kumar@linux.ibm.com
mainline inclusion from mainline-v5.14-rc4 commit b5916c025432b7c776b6bb13617485fbc0bd3ebd category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAFONL CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Similar to commit 2da9f6305f30 ("mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit") avoid using unsigned int for nr_pages. With unsigned int type the large unsigned int converts to a large positive signed long.
Symptoms include CMA allocations hanging forever due to alloc_contig_range->...->isolate_migratepages_block waiting forever in "while (unlikely(too_many_isolated(pgdat)))".
Link: https://lkml.kernel.org/r/20210728042531.359409-1-aneesh.kumar@linux.ibm.com Fixes: c5fc5c3ae0c8 ("mm: migrate: account THP NUMA migration counters correctly") Signed-off-by: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Reported-by: Michael Ellerman mpe@ellerman.id.au Reported-by: Alexey Kardashevskiy aik@ozlabs.ru Reviewed-by: Yang Shi shy828301@gmail.com Cc: Mel Gorman mgorman@suse.de Cc: Nicholas Piggin npiggin@gmail.com Cc: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- mm/migrate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/migrate.c b/mm/migrate.c index 99a04ac960b3..ce7b7989286b 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2154,7 +2154,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, LIST_HEAD(migratepages); new_page_t *new; bool compound; - unsigned int nr_pages = thp_nr_pages(page); + int nr_pages = thp_nr_pages(page);
/* * PTE mapped THP or HugeTLB page can't reach here so the page could
From: Yang Shi shy828301@gmail.com
mainline inclusion from mainline-v5.14-rc1 commit b0b515bfb3f4f3dc208862989e38ee5268a1003f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAFONL CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The old behavior didn't split THP if migration is failed due to lack of memory on the target node. But the THP migration does split THP, so keep the old behavior for misplaced NUMA page migration.
Link: https://lkml.kernel.org/r/20210518200801.7413-6-shy828301@gmail.com Signed-off-by: Yang Shi shy828301@gmail.com Acked-by: Mel Gorman mgorman@suse.de Cc: Christian Borntraeger borntraeger@de.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Huang Ying ying.huang@intel.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Michal Hocko mhocko@suse.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- mm/migrate.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/migrate.c b/mm/migrate.c index ce7b7989286b..33badf35219e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1487,6 +1487,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, struct page *page2; int swapwrite = current->flags & PF_SWAPWRITE; int rc, nr_subpages; + bool nosplit = (reason == MR_NUMA_MISPLACED);
if (!swapwrite) current->flags |= PF_SWAPWRITE; @@ -1527,8 +1528,9 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, * pages are added to the tail of the list so * we encounter them after the rest of the list * is processed. + * THP NUMA faulting doesn't split THP to retry. */ - if (is_thp) { + if (is_thp && !nosplit) { lock_page(page); rc = split_huge_page_to_list(page, from); unlock_page(page);
From: Yang Shi shy828301@gmail.com
mainline inclusion from mainline-v5.14-rc1 commit 662aeea7536d84d7e1d01739694e4748ba294ce0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAFONL CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The generic migration path will check refcount, so no need check refcount here. But the old code actually prevents from migrating shared THP (mapped by multiple processes), so bail out early if mapcount is > 1 to keep the behavior.
Link: https://lkml.kernel.org/r/20210518200801.7413-7-shy828301@gmail.com Signed-off-by: Yang Shi shy828301@gmail.com Cc: Christian Borntraeger borntraeger@de.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Huang Ying ying.huang@intel.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Mel Gorman mgorman@suse.de Cc: Michal Hocko mhocko@suse.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- mm/migrate.c | 16 ++++------------ 1 file changed, 4 insertions(+), 12 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index 33badf35219e..3f5b217d5af1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2107,6 +2107,10 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
+ /* Do not migrate THP mapped by multiple processes */ + if (PageTransHuge(page) && total_mapcount(page) > 1) + return 0; + /* Avoid migrating to a node that is nearly full */ if (!migrate_balanced_pgdat(pgdat, compound_nr(page))) return 0; @@ -2117,18 +2121,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) if (isolate_lru_page(page)) return 0;
- /* - * migrate_misplaced_transhuge_page() skips page migration's usual - * check on page_count(), so we must do it here, now that the page - * has been isolated: a GUP pin, or any other pin, prevents migration. - * The expected page count is 3: 1 for page's mapcount and 1 for the - * caller's pin and 1 for the reference taken by isolate_lru_page(). - */ - if (PageTransHuge(page) && page_count(page) != 3) { - putback_lru_page(page); - return 0; - } - page_lru = page_is_file_lru(page); mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru, thp_nr_pages(page));
From: Yang Shi shy828301@gmail.com
mainline inclusion from mainline-v5.14-rc1 commit e346e6688c4aa18588f2c6a75b572d8ca7a65f5f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAFONL CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
A quick grep shows x86_64, PowerPC (book3s), ARM64 and S390 support both NUMA balancing and THP. But S390 doesn't support THP migration so NUMA balancing actually can't migrate any misplaced pages.
Skip make PMD PROT_NONE for such case otherwise CPU cycles may be wasted by pointless NUMA hinting faults on S390.
Link: https://lkml.kernel.org/r/20210518200801.7413-8-shy828301@gmail.com Signed-off-by: Yang Shi shy828301@gmail.com Acked-by: Mel Gorman mgorman@suse.de Cc: Christian Borntraeger borntraeger@de.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Huang Ying ying.huang@intel.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Michal Hocko mhocko@suse.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- mm/huge_memory.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d5cb55cbf578..72b4f10b822f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1751,6 +1751,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, * Returns * - 0 if PMD could not be locked * - 1 if PMD was locked but protections unchange and TLB flush unnecessary + * or if prot_numa but THP migration is not supported * - HPAGE_PMD_NR is protections changed and TLB flush necessary */ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, @@ -1765,6 +1766,9 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, bool uffd_wp = cp_flags & MM_CP_UFFD_WP; bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ if (prot_numa && !thp_migration_supported()) + return 1; + ptl = __pmd_trans_huge_lock(pmd, vma); if (!ptl) return 0;
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAFONL CVE: NA
--------------------------------
In the NUMA balancing scenario, support for PMD-level THP migration, follow the flow of do_huge_pmd_numa_page function in the AutoNuma: Acquire ptl Prepare for migration (elevate page refcount) Release ptl Isolate page from lru and elevate page refcount Migrate the misplaced THP
The page refcount elevation when holding ptl should prevent from THP split.
In conclusion, like what AutoNuma did, these pages would not migrate: 1. Read-only files, see do_numa_access(). 2. Nonnormal pages like huge zero pages and devmap pages, see vm_normal_page_pmd(). 3. Shared libraries and dirty pages, see migrate_misplaced_page(). 4. THP mapped by multiple processes, see numamigrate_isolate_page().
Signed-off-by: Nanyong Sun sunnanyong@huawei.com Signed-off-by: Ze Zuo zuoze1@huawei.com --- mm/mem_sampling.c | 78 +++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 76 insertions(+), 2 deletions(-)
diff --git a/mm/mem_sampling.c b/mm/mem_sampling.c index 0eaea2680d83..e0470052ae9c 100644 --- a/mm/mem_sampling.c +++ b/mm/mem_sampling.c @@ -144,6 +144,79 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, return mpol_misplaced(page, vma, addr); }
+static inline void do_thp_numa_access(struct mm_struct *mm, + struct vm_area_struct *vma, + u64 vaddr, struct page *page) +{ + int page_nid = NUMA_NO_NODE; + int target_nid, last_cpupid = -1; + bool migrated = false; + int flags = 0; + struct page *hpage = NULL; + u64 haddr = vaddr & HPAGE_PMD_MASK; + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd, pmde; + spinlock_t *ptl; + + pgd = pgd_offset(mm, vaddr); + if (!pgd_present(*pgd)) + return; + + p4d = p4d_offset(pgd, vaddr); + if (!p4d_present(*p4d)) + return; + + pud = pud_offset(p4d, vaddr); + if (!pud_present(*pud)) + return; + + pmd = pmd_offset(pud, vaddr); + pmde = READ_ONCE(*pmd); + /* TODO: handle PTE-mapped THP */ + if (!pmd_trans_huge(pmde)) + return; + + ptl = pmd_lock(mm, pmd); + pmde = READ_ONCE(*pmd); + if (unlikely(!pmd_trans_huge(pmde))) + goto out_unlock; + + hpage = vm_normal_page_pmd(vma, haddr, pmde); + if (!hpage || hpage != compound_head(page)) + goto out_unlock; + + page_nid = page_to_nid(hpage); + last_cpupid = page_cpupid_last(hpage); + target_nid = numa_migrate_prep(hpage, vma, haddr, page_nid, + &flags); + spin_unlock(ptl); + if (target_nid == NUMA_NO_NODE) { + put_page(hpage); + goto out; + } + + migrated = migrate_misplaced_page(hpage, vma, target_nid); + if (migrated) { + flags |= TNF_MIGRATED; + page_nid = target_nid; + } else { + flags |= TNF_MIGRATE_FAIL; + } + +out: + trace_mm_numa_migrating(haddr, page_nid, target_nid, flags&TNF_MIGRATED); + if (page_nid != NUMA_NO_NODE) + task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, + flags); + + return; + +out_unlock: + spin_unlock(ptl); +} + /* * Called from task_work context to act upon the page access. * @@ -190,9 +263,10 @@ static void do_numa_access(struct task_struct *p, u64 vaddr, u64 paddr) if (unlikely(!PageLRU(page))) goto out_unlock;
- /* TODO: handle PTE-mapped THP or PMD-mapped THP*/ - if (PageCompound(page)) + if (PageCompound(page)) { + do_thp_numa_access(mm, vma, vaddr, page); goto out_unlock; + }
/* * Flag if the page is shared between multiple address spaces. This
From: Ze Zuo zuoze1@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAFONL CVE: NA
--------------------------------
Skip cow page and ksm page for base page, like what change_pte_range did.
Signed-off-by: Ze Zuo zuoze1@huawei.com Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- mm/mem_sampling.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/mm/mem_sampling.c b/mm/mem_sampling.c index e0470052ae9c..7d3a159c8f16 100644 --- a/mm/mem_sampling.c +++ b/mm/mem_sampling.c @@ -23,6 +23,7 @@ #include <linux/migrate.h> #include <linux/sched/numa_balancing.h> #include <trace/events/kmem.h> +#include "internal.h"
struct mem_sampling_ops_struct mem_sampling_ops;
@@ -257,7 +258,7 @@ static void do_numa_access(struct task_struct *p, u64 vaddr, u64 paddr) goto out_unlock;
page = pfn_to_online_page(PHYS_PFN(paddr)); - if (!page || is_zone_device_page(page)) + if (!page || is_zone_device_page(page) || PageKsm(page)) goto out_unlock;
if (unlikely(!PageLRU(page))) @@ -275,6 +276,10 @@ static void do_numa_access(struct task_struct *p, u64 vaddr, u64 paddr) if (page_mapcount(page) > 1 && (vma->vm_flags & VM_SHARED)) flags |= TNF_SHARED;
+ /* Also skip shared copy-on-write pages */ + if (is_cow_mapping(vma->vm_flags) && page_count(page) != 1) + goto out_unlock; + last_cpupid = page_cpupid_last(page); page_nid = page_to_nid(page);
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAFONL CVE: NA
--------------------------------
Now the numa_migrate_prep() is not static in mm core, we can delete the duplicate one here.
Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- mm/mem_sampling.c | 16 ---------------- 1 file changed, 16 deletions(-)
diff --git a/mm/mem_sampling.c b/mm/mem_sampling.c index 7d3a159c8f16..1d8a831be531 100644 --- a/mm/mem_sampling.c +++ b/mm/mem_sampling.c @@ -129,22 +129,6 @@ static void mem_sampling_process(struct mem_sampling_record *record_base, int nr }
#ifdef CONFIG_NUMABALANCING_MEM_SAMPLING - -static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, - unsigned long addr, int page_nid, - int *flags) -{ - get_page(page); - - count_vm_numa_event(NUMA_HINT_FAULTS); - if (page_nid == numa_node_id()) { - count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL); - *flags |= TNF_FAULT_LOCAL; - } - - return mpol_misplaced(page, vma, addr); -} - static inline void do_thp_numa_access(struct mm_struct *mm, struct vm_area_struct *vma, u64 vaddr, struct page *page)
From: Huang Ying ying.huang@intel.com
mainline inclusion from mainline-v5.15-rc1 commit f00230ff8411eaecbea1f2e528e205424f3725ba category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAFONL CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Before commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling"), the TLB flushing is done in do_huge_pmd_numa_page() itself via flush_tlb_range().
But after commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling"), the TLB flushing is done in migrate_pages() as in the following code path anyway.
do_huge_pmd_numa_page migrate_misplaced_page migrate_pages
So now, the TLB flushing code in do_huge_pmd_numa_page() becomes unnecessary. So the code is deleted in this patch to simplify the code. This is only code cleanup, there's no visible performance difference.
The mmu_notifier_invalidate_range() in do_huge_pmd_numa_page() is deleted too. Because migrate_pages() takes care of that too when CPU TLB is flushed.
Link: https://lkml.kernel.org/r/20210720065529.716031-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" ying.huang@intel.com Reviewed-by: Zi Yan ziy@nvidia.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: Dan Carpenter dan.carpenter@oracle.com Cc: Mel Gorman mgorman@suse.de Cc: Christian Borntraeger borntraeger@de.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Hugh Dickins hughd@google.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Michal Hocko mhocko@suse.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- mm/huge_memory.c | 26 -------------------------- 1 file changed, 26 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 72b4f10b822f..eb293d17a104 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1444,32 +1444,6 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) goto out; }
- /* - * Since we took the NUMA fault, we must have observed the !accessible - * bit. Make sure all other CPUs agree with that, to avoid them - * modifying the page we're about to migrate. - * - * Must be done under PTL such that we'll observe the relevant - * inc_tlb_flush_pending(). - * - * We are not sure a pending tlb flush here is for a huge page - * mapping or not. Hence use the tlb range variant - */ - if (mm_tlb_flush_pending(vma->vm_mm)) { - flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); - /* - * change_huge_pmd() released the pmd lock before - * invalidating the secondary MMUs sharing the primary - * MMU pagetables (with ->invalidate_range()). The - * mmu_notifier_invalidate_range_end() (which - * internally calls ->invalidate_range()) in - * change_pmd_range() will run after us, so we can't - * rely on it here and we need an explicit invalidate. - */ - mmu_notifier_invalidate_range(vma->vm_mm, haddr, - haddr + HPAGE_PMD_SIZE); - } - pmd = pmd_modify(oldpmd, vma->vm_page_prot); page = vm_normal_page_pmd(vma, haddr, pmd); if (!page)