MADV_DONTNEED is currently disabled for hugetlb mappings. This certainly makes sense in shared file mappings as the pagecache maintains a reference to the page and it will never be freed. However, it could be useful to unmap and free pages in private mappings.
Mike Kravetz (2): mm: enable MADV_DONTNEED for hugetlb mappings madvise: use zap_page_range_single for madvise dontneed
Rik van Riel (1): mm,madvise,hugetlb: fix unexpected data loss with MADV_DONTNEED on hugetlbfs
include/linux/mm.h | 2 ++ mm/madvise.c | 47 +++++++++++++++++++++++++++++++++++++++++----- mm/memory.c | 14 +++++++++++--- 3 files changed, 55 insertions(+), 8 deletions(-)
From: Mike Kravetz mike.kravetz@oracle.com
mainline inclusion from mainline-v5.18-rc1 commit 90e7e7f5ef3f712da992d505aee2d921797c3f96 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9GVYW CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "Add hugetlb MADV_DONTNEED support", v3.
Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP testing. However, mremap support was recently added in commit 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed vma"). While attempting to enable mremap support in the test, it was discovered that the mremap test indirectly depends on MADV_DONTNEED.
madvise does not allow MADV_DONTNEED for hugetlb mappings. However, that is primarily due to the check in can_madv_lru_vma(). By simply removing the check and adding huge page alignment, MADV_DONTNEED can be made to work for hugetlb mappings.
Do note that there is no compelling use case for adding this support. This was discussed in the RFC [1]. However, adding support makes sense as it is fairly trivial and brings hugetlb functionality more in line with 'normal' memory.
After enabling support, add selftest for MADV_DONTNEED as well as MADV_REMOVE. Then update userfaultfd selftest.
If new functionality is accepted, then madvise man page will be updated to indicate hugetlb is supported. It will also be updated to clarify what happens to the passed length argument.
This patch (of 3):
MADV_DONTNEED is currently disabled for hugetlb mappings. This certainly makes sense in shared file mappings as the pagecache maintains a reference to the page and it will never be freed. However, it could be useful to unmap and free pages in private mappings. In addition, userfaultfd minor fault users may be able to simplify code by using MADV_DONTNEED.
The primary thing preventing MADV_DONTNEED from working on hugetlb mappings is a check in can_madv_lru_vma(). To allow support for hugetlb mappings create and use a new routine madvise_dontneed_free_valid_vma() that allows hugetlb mappings in this specific case.
For normal mappings, madvise requires the start address be PAGE aligned and rounds up length to the next multiple of PAGE_SIZE. Do similarly for hugetlb mappings: require start address be huge page size aligned and round up length to the next multiple of huge page size. Use the new madvise_dontneed_free_valid_vma routine to check alignment and round up length/end. zap_page_range requires this alignment for hugetlb vmas otherwise we will hit BUGs.
Link: https://lkml.kernel.org/r/20220215002348.128823-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20220215002348.128823-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz mike.kravetz@oracle.com Cc: Naoya Horiguchi naoya.horiguchi@linux.dev Cc: David Hildenbrand david@redhat.com Cc: Axel Rasmussen axelrasmussen@google.com Cc: Mina Almasry almasrymina@google.com Cc: Michal Hocko mhocko@suse.com Cc: Peter Xu peterx@redhat.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Shuah Khan skhan@linuxfoundation.org Cc: Mike Rapoport rppt@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: mm/madvise.c
Signed-off-by: Ze Zuo zuoze1@huawei.com --- mm/madvise.c | 31 +++++++++++++++++++++++++++++-- 1 file changed, 29 insertions(+), 2 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index bd851f2c687f..d2b9f8ec770f 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -502,6 +502,11 @@ static void madvise_cold_page_range(struct mmu_gather *tlb, tlb_end_vma(tlb, vma); }
+static inline bool can_madv_lru_non_huge_vma(struct vm_area_struct *vma) +{ + return !(vma->vm_flags & (VM_LOCKED|VM_PFNMAP)); +} + static long madvise_cold(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start_addr, unsigned long end_addr) @@ -771,6 +776,23 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma, return 0; }
+static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma, + unsigned long start, + unsigned long *end, + int behavior) +{ + if (!is_vm_hugetlb_page(vma)) + return can_madv_lru_non_huge_vma(vma); + + if (behavior != MADV_DONTNEED) + return false; + if (start & ~huge_page_mask(hstate_vma(vma))) + return false; + + *end = ALIGN(*end, huge_page_size(hstate_vma(vma))); + return true; +} + static long madvise_dontneed_free(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, @@ -779,7 +801,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, struct mm_struct *mm = vma->vm_mm;
*prev = vma; - if (!can_madv_lru_vma(vma)) + if (!madvise_dontneed_free_valid_vma(vma, start, &end, behavior)) return -EINVAL;
if (!userfaultfd_remove(vma, start, end)) { @@ -801,7 +823,12 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, */ return -ENOMEM; } - if (!can_madv_lru_vma(vma)) + /* + * Potential end adjustment for hugetlb vma is OK as + * the check below keeps end within vma. + */ + if (!madvise_dontneed_free_valid_vma(vma, start, &end, + behavior)) return -EINVAL; if (end > vma->vm_end) { /*
From: Rik van Riel riel@surriel.com
mainline inclusion from mainline-v6.1-rc3 commit 8ebe0a5eaaeb099de03d09ad20f54ed962e2261e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9GVYW CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
A common use case for hugetlbfs is for the application to create memory pools backed by huge pages, which then get handed over to some malloc library (eg. jemalloc) for further management.
That malloc library may be doing MADV_DONTNEED calls on memory that is no longer needed, expecting those calls to happen on PAGE_SIZE boundaries.
However, currently the MADV_DONTNEED code rounds up any such requests to HPAGE_PMD_SIZE boundaries. This leads to undesired outcomes when jemalloc expects a 4kB MADV_DONTNEED, but 2MB of memory get zeroed out, instead.
Use of pre-built shared libraries means that user code does not always know the page size of every memory arena in use.
Avoid unexpected data loss with MADV_DONTNEED by rounding up only to PAGE_SIZE (in do_madvise), and rounding down to huge page granularity.
That way programs will only get as much memory zeroed out as they requested.
Link: https://lkml.kernel.org/r/20221021192805.366ad573@imladris.surriel.com Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings") Signed-off-by: Rik van Riel riel@surriel.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: David Hildenbrand david@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Ze Zuo zuoze1@huawei.com --- mm/madvise.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/mm/madvise.c b/mm/madvise.c index d2b9f8ec770f..e409e9d39d45 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -789,7 +789,14 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma, if (start & ~huge_page_mask(hstate_vma(vma))) return false;
- *end = ALIGN(*end, huge_page_size(hstate_vma(vma))); + /* + * Madvise callers expect the length to be rounded up to PAGE_SIZE + * boundaries, and may be unaware that this VMA uses huge pages. + * Avoid unexpected data loss by rounding down the number of + * huge pages freed. + */ + *end = ALIGN_DOWN(*end, huge_page_size(hstate_vma(vma))); + return true; }
@@ -804,6 +811,9 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, if (!madvise_dontneed_free_valid_vma(vma, start, &end, behavior)) return -EINVAL;
+ if (start == end) + return 0; + if (!userfaultfd_remove(vma, start, end)) { *prev = NULL; /* mmap_lock has been dropped, prev is stale */
From: Mike Kravetz mike.kravetz@oracle.com
mainline inclusion from mainline-v6.1-rc8 commit 21b85b09527c28e242db55c1b751f7f7549b830c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9GVYW CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
This series addresses the issue first reported in [1], and fully described in patch 2. Patches 1 and 2 address the user visible issue and are tagged for stable backports.
While exploring solutions to this issue, related problems with mmu notification calls were discovered. This is addressed in the patch "hugetlb: remove duplicate mmu notifications:". Since there are no user visible effects, this third is not tagged for stable backports.
Previous discussions suggested further cleanup by removing the routine zap_page_range. This is possible because zap_page_range_single is now exported, and all callers of zap_page_range pass ranges entirely within a single vma. This work will be done in a later patch so as not to distract from this bug fix.
[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6Ju...
This patch (of 2):
Expose the routine zap_page_range_single to zap a range within a single vma. The madvise routine madvise_dontneed_single_vma can use this routine as it explicitly operates on a single vma. Also, update the mmu notification range in zap_page_range_single to take hugetlb pmd sharing into account. This is required as MADV_DONTNEED supports hugetlb vmas.
Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings") Signed-off-by: Mike Kravetz mike.kravetz@oracle.com Reported-by: Wei Chen harperchen1110@gmail.com Cc: Axel Rasmussen axelrasmussen@google.com Cc: David Hildenbrand david@redhat.com Cc: Matthew Wilcox willy@infradead.org Cc: Mina Almasry almasrymina@google.com Cc: Nadav Amit nadav.amit@gmail.com Cc: Naoya Horiguchi naoya.horiguchi@linux.dev Cc: Peter Xu peterx@redhat.com Cc: Rik van Riel riel@surriel.com Cc: Vlastimil Babka vbabka@suse.cz Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org
Conflicts: include/linux/mm.h mm/memory.c
Signed-off-by: Ze Zuo zuoze1@huawei.com --- include/linux/mm.h | 2 ++ mm/madvise.c | 6 +++--- mm/memory.c | 14 +++++++++++--- 3 files changed, 16 insertions(+), 6 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index ef3baa5941d7..350d5ce400e7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1696,6 +1696,8 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, unsigned long size); void zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size); +void zap_page_range_single(struct vm_area_struct *vma, unsigned long address, + unsigned long size, struct zap_details *details); void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, unsigned long start, unsigned long end);
diff --git a/mm/madvise.c b/mm/madvise.c index e409e9d39d45..885fc6e31f4f 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -754,8 +754,8 @@ static int madvise_free_single_vma(struct vm_area_struct *vma, * Application no longer needs these pages. If the pages are dirty, * it's OK to just throw them away. The app will be more careful about * data it wants to keep. Be sure to free swap resources too. The - * zap_page_range call sets things up for shrink_active_list to actually free - * these pages later if no one else has touched them in the meantime, + * zap_page_range_single call sets things up for shrink_active_list to actually + * free these pages later if no one else has touched them in the meantime, * although we could add these pages to a global reuse list for * shrink_active_list to pick up before reclaiming other pages. * @@ -772,7 +772,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma, static long madvise_dontneed_single_vma(struct vm_area_struct *vma, unsigned long start, unsigned long end) { - zap_page_range(vma, start, end - start); + zap_page_range_single(vma, start, end - start, NULL); return 0; }
diff --git a/mm/memory.c b/mm/memory.c index 68e92af0bfa5..d7a134a35a95 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1577,19 +1577,27 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start, * * The range must fit into one VMA. */ -static void zap_page_range_single(struct vm_area_struct *vma, unsigned long address, +void zap_page_range_single(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *details) { + const unsigned long end = address + size; struct mmu_notifier_range range; struct mmu_gather tlb;
lru_add_drain(); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, - address, address + size); + address, end); + if (is_vm_hugetlb_page(vma)) + adjust_range_if_pmd_sharing_possible(vma, &range.start, + &range.end); tlb_gather_mmu(&tlb, vma->vm_mm, address, range.end); update_hiwater_rss(vma->vm_mm); mmu_notifier_invalidate_range_start(&range); - unmap_single_vma(&tlb, vma, address, range.end, details); + /* + * unmap 'address-end' not 'range.start-range.end' as range + * could have been expanded for hugetlb pmd sharing. + */ + unmap_single_vma(&tlb, vma, address, end, details); mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb, address, range.end); }
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/6207 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/S...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/6207 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/S...