[PATCH OLK-6.6 00/14] mm: add huge pfnmap support for remap_pfn_range()
mm: add huge pfnmap support for remap_pfn_range() Overview ======== This patch series adds huge page support for remap_pfn_range(), automatically creating huge mappings when prerequisites are satisfied (size, alignment, architecture support, etc.) and falling back to normal page mappings otherwise. This work builds on Peter Xu's previous efforts on huge pfnmap support [0]. TODO ==== - Add PUD-level huge page support. Currently, only PMD-level huge pages are supported. - Consider the logic related to vmap_page_range and extract reusable common code. Tests Done ========== - Cross-build tests. - Performance tests with custom device driver implementing mmap() with remap_pfn_range(): - lat_mem_rd benchmark modified to use mmap(device_fd) instead of malloc() shows around 40% improvement in memory access latency with huge page support compared to normal page mappings. numactl -C 0 lat_mem_rd -t 4096M (stride=64) Memory Size (MB) Without Huge Mapping With Huge Mapping Improvement ---------------- ----------------- -------------- ----------- 64.00 148.858 ns 100.780 ns 32.3% 128.00 164.745 ns 103.537 ns 37.2% 256.00 169.907 ns 103.179 ns 39.3% 512.00 171.285 ns 103.072 ns 39.8% 1024.00 173.054 ns 103.055 ns 40.4% 2048.00 172.820 ns 103.091 ns 40.3% 4096.00 172.877 ns 103.115 ns 40.4% - Custom memory copy operations on mmap(device_fd) show around 18% performance improvement with huge page support compared to normal page mappings. numactl -C 0 memcpy_test (memory copy performance test) Memory Size (MB) Without Huge Mapping With Huge Mapping Improvement ---------------- ----------------- -------------- ----------- 1024.00 95.76 ms 77.91 ms 18.6% 2048.00 190.87 ms 155.64 ms 18.5% 4096.00 380.84 ms 311.45 ms 18.2% [0] https://lore.kernel.org/all/20240826204353.2228736-2-peterx@redhat.com/T/#u David Hildenbrand (1): mm/huge_memory: check pmd_special() only after pmd_present() Peter Xu (11): mm: introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud mm: drop is_huge_zero_pud() mm: mark special bits for huge pfn mappings when inject mm: allow THP orders for PFNMAPs mm/gup: detect huge pfnmap entries in gup-fast mm/pagewalk: check pfnmap for folio_walk_start() mm/fork: accept huge pfnmap entries mm: always define pxx_pgprot() mm/x86: support large pfn mappings mm/arm64: support large pfn mappings arm64: mm: Drop dead code for pud special bit handling Yin Tirui (2): pgtable: add pte_clrhuge() implementation mm: introduce remap_pfn_range_try_pmd() for PMD-level hugepage mapping arch/arm/include/asm/pgtable-3level.h | 1 + arch/arm64/Kconfig | 1 + arch/arm64/include/asm/pgtable.h | 38 ++++++ arch/loongarch/include/asm/pgtable.h | 6 + arch/mips/include/asm/pgtable.h | 6 + arch/powerpc/include/asm/book3s/32/pgtable.h | 5 + arch/powerpc/include/asm/book3s/64/pgtable.h | 7 +- arch/powerpc/include/asm/nohash/32/pte-8xx.h | 7 ++ arch/powerpc/include/asm/nohash/pgtable.h | 7 ++ arch/powerpc/include/asm/pgtable.h | 1 + arch/riscv/include/asm/pgtable.h | 5 + arch/s390/include/asm/pgtable.h | 6 + arch/sparc/include/asm/pgtable_64.h | 6 + arch/sw_64/include/asm/pgtable.h | 6 + arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 80 +++++++----- fs/dax.c | 2 +- include/linux/huge_mm.h | 16 +-- include/linux/mm.h | 26 ++++ include/linux/pgtable.h | 14 ++- mm/Kconfig | 13 ++ mm/gup.c | 6 + mm/huge_memory.c | 93 ++++++++++---- mm/memory.c | 126 +++++++++++++++---- 24 files changed, 383 insertions(+), 96 deletions(-) -- 2.43.0
From: Peter Xu <peterx@redhat.com> mainline inclusion from mainline-v6.12-rc1 commit 6857be5fecaebd9773ff27b6d29b6fff3b1abbce category: feature bugzilla: https://gitee.com/src-openeuler/kernel/issues/ID4NL4 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- Patch series "mm: Support huge pfnmaps", v2. Overview ======== This series implements huge pfnmaps support for mm in general. Huge pfnmap allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow as large as 8GB or even bigger. Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. The last patch (from Alex Williamson) will be the first user of huge pfnmap, so as to enable vfio-pci driver to fault in huge pfn mappings. Implementation ============== In reality, it's relatively simple to add such support comparing to many other types of mappings, because of PFNMAP's specialties when there's no vmemmap backing it, so that most of the kernel routines on huge mappings should simply already fail for them, like GUPs or old-school follow_page() (which is recently rewritten to be folio_walk* APIs by David). One trick here is that we're still unmature on PUDs in generic paths here and there, as DAX is so far the only user. This patchset will add the 2nd user of it. Hugetlb can be a 3rd user if the hugetlb unification work can go on smoothly, but to be discussed later. The other trick is how to allow gup-fast working for such huge mappings even if there's no direct sign of knowing whether it's a normal page or MMIO mapping. This series chose to keep the pte_special solution, so that it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that gup-fast will be able to identify them and fail properly. Along the way, we'll also notice that the major pgtable pfn walker, aka, follow_pte(), will need to retire soon due to the fact that it only works with ptes. A new set of simple API is introduced (follow_pfnmap* API) to be able to do whatever follow_pte() can already do, plus that it can also process huge pfnmaps now. Half of this series is about that and converting all existing pfnmap walkers to use the new API properly. Hopefully the new API also looks better to avoid exposing e.g. pgtable lock details into the callers, so that it can be used in an even more straightforward way. Here, three more options will be introduced and involved in huge pfnmap: - ARCH_SUPPORTS_HUGE_PFNMAP Arch developers will need to select this option when huge pfnmap is supported in arch's Kconfig. After this patchset applied, both x86_64 and arm64 will start to enable it by default. - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP These options are for driver developers to identify whether current arch / config supports huge pfnmaps, making decision on whether it can use the huge pfnmap APIs to inject them. One can refer to the last vfio-pci patch from Alex on the use of them properly in a device driver. So after the whole set applied, and if one would enable some dynamic debug lines in vfio-pci core files, we should observe things like: vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100 vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100 vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100 In this specific case, it says that vfio-pci faults in PMDs properly for a few BAR0 offsets. Patch Layout ============ Patch 1: Introduce the new options mentioned above for huge PFNMAPs Patch 2: A tiny cleanup Patch 3-8: Preparation patches for huge pfnmap (include introduce special bit for pmd/pud) Patch 9-16: Introduce follow_pfnmap*() API, use it everywhere, and then drop follow_pte() API Patch 17: Add huge pfnmap support for x86_64 Patch 18: Add huge pfnmap support for arm64 Patch 19: Add vfio-pci support for all kinds of huge pfnmaps (Alex) TODO ==== More architectures / More page sizes ------------------------------------ Currently only x86_64 (2M+1G) and arm64 (2M) are supported. There seems to have plan to support arm64 1G later on top of this series [2]. Any arch will need to first support THP / THP_1G, then provide a special bit in pmds/puds to support huge pfnmaps. remap_pfn_range() support ------------------------- Currently, remap_pfn_range() still only maps PTEs. With the new option, remap_pfn_range() can logically start to inject either PMDs or PUDs when the alignment requirements match on the VAs. When the support is there, it should be able to silently benefit all drivers that is using remap_pfn_range() in its mmap() handler on better TLB hit rate and overall faster MMIO accesses similar to processor on hugepages. More driver support ------------------- VFIO is so far the only consumer for the huge pfnmaps after this series applied. Besides above remap_pfn_range() generic optimization, device driver can also try to optimize its mmap() on a better VA alignment for either PMD/PUD sizes. This may, iiuc, normally require userspace changes, as the driver doesn't normally decide the VA to map a bar. But I don't think I know all the drivers to know the full picture. Credits all go to Alex on help testing the GPU/NIC use cases above. [0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local [1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com [2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com This patch (of 19): This patch introduces the option to introduce special pte bit into pmd/puds. Archs can start to define pmd_special / pud_special when supported by selecting the new option. Per-arch support will be added later. Before that, create fallbacks for these helpers so that they are always available. Link: https://lkml.kernel.org/r/20240826204353.2228736-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20240826204353.2228736-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: include/linux/mm.h [Context conflicts.] Signed-off-by: Yin Tirui <yintirui@huawei.com> --- include/linux/mm.h | 24 ++++++++++++++++++++++++ mm/Kconfig | 13 +++++++++++++ 2 files changed, 37 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index e364a2846a78..f0efb11b915d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2779,6 +2779,30 @@ static inline pte_t pte_mkspecial(pte_t pte) } #endif +#ifndef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP +static inline bool pmd_special(pmd_t pmd) +{ + return false; +} + +static inline pmd_t pmd_mkspecial(pmd_t pmd) +{ + return pmd; +} +#endif /* CONFIG_ARCH_SUPPORTS_PMD_PFNMAP */ + +#ifndef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP +static inline bool pud_special(pud_t pud) +{ + return false; +} + +static inline pud_t pud_mkspecial(pud_t pud) +{ + return pud; +} +#endif /* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */ + #ifndef CONFIG_ARCH_HAS_PTE_DEVMAP static inline int pte_devmap(pte_t pte) { diff --git a/mm/Kconfig b/mm/Kconfig index 5972d143fb2b..4eb0642b71e5 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -903,6 +903,19 @@ endif # TRANSPARENT_HUGEPAGE config PGTABLE_HAS_HUGE_LEAVES def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE +# TODO: Allow to be enabled without THP +config ARCH_SUPPORTS_HUGE_PFNMAP + def_bool n + depends on TRANSPARENT_HUGEPAGE + +config ARCH_SUPPORTS_PMD_PFNMAP + def_bool y + depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE + +config ARCH_SUPPORTS_PUD_PFNMAP + def_bool y + depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD + # # UP and nommu archs use km based percpu allocator # -- 2.43.0
From: Peter Xu <peterx@redhat.com> mainline inclusion from mainline-v6.12-rc1 commit ef713ec3a566d3e5e011c5d6201eb661ebf94c1f category: feature bugzilla: https://gitee.com/src-openeuler/kernel/issues/ID4NL4 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- It constantly returns false since 2017. One assertion is added in 2019 but it should never have triggered, IOW it means what is checked should be asserted instead. If it didn't exist for 7 years maybe it's good idea to remove it and only add it when it comes. Link: https://lkml.kernel.org/r/20240826204353.2228736-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: include/linux/huge_mm.h [Context conflicts.] Signed-off-by: Yin Tirui <yintirui@huawei.com> --- include/linux/huge_mm.h | 10 ---------- mm/huge_memory.c | 13 +------------ 2 files changed, 1 insertion(+), 22 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index cfe42c43b55b..ef0a514a9b65 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -453,11 +453,6 @@ static inline bool is_huge_zero_pmd(pmd_t pmd) return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd); } -static inline bool is_huge_zero_pud(pud_t pud) -{ - return false; -} - struct page *mm_get_huge_zero_page(struct mm_struct *mm); void mm_put_huge_zero_page(struct mm_struct *mm); @@ -594,11 +589,6 @@ static inline bool is_huge_zero_pmd(pmd_t pmd) return false; } -static inline bool is_huge_zero_pud(pud_t pud) -{ - return false; -} - static inline void mm_put_huge_zero_page(struct mm_struct *mm) { return; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a28dda799978..e92fa0ea0555 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1659,10 +1659,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, ptl = pud_lock(mm, pud); if (!pud_none(*pud)) { if (write) { - if (pud_pfn(*pud) != pfn_t_to_pfn(pfn)) { - WARN_ON_ONCE(!is_huge_zero_pud(*pud)); + if (WARN_ON_ONCE(pud_pfn(*pud) != pfn_t_to_pfn(pfn))) goto out_unlock; - } entry = pud_mkyoung(*pud); entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma); if (pudp_set_access_flags(vma, addr, pud, entry, 1)) @@ -1954,15 +1952,6 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud))) goto out_unlock; - /* - * When page table lock is held, the huge zero pud should not be - * under splitting since we don't split the page itself, only pud to - * a page table. - */ - if (is_huge_zero_pud(pud)) { - /* No huge zero pud yet */ - } - /* * TODO: once we support anonymous pages, use * folio_try_dup_anon_rmap_*() and split if duplicating fails. -- 2.43.0
From: Peter Xu <peterx@redhat.com> mainline inclusion from mainline-v6.12-rc1 commit 3c8e44c9b369b3d422516b3f2bf47a6e3c61d1ea category: feature bugzilla: https://gitee.com/src-openeuler/kernel/issues/ID4NL4 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- We need these special bits to be around on pfnmaps. Mark properly for !devmap case, reflecting that there's no page struct backing the entry. Link: https://lkml.kernel.org/r/20240826204353.2228736-4-peterx@redhat.com Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Yin Tirui <yintirui@huawei.com> --- mm/huge_memory.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e92fa0ea0555..83a205344606 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1576,6 +1576,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, entry = pmd_mkhuge(pfn_t_pmd(pfn, prot)); if (pfn_t_devmap(pfn)) entry = pmd_mkdevmap(entry); + else + entry = pmd_mkspecial(entry); if (write) { entry = pmd_mkyoung(pmd_mkdirty(entry)); entry = maybe_pmd_mkwrite(entry, vma); @@ -1672,6 +1674,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, entry = pud_mkhuge(pfn_t_pud(pfn, prot)); if (pfn_t_devmap(pfn)) entry = pud_mkdevmap(entry); + else + entry = pud_mkspecial(entry); if (write) { entry = pud_mkyoung(pud_mkdirty(entry)); entry = maybe_pud_mkwrite(entry, vma); -- 2.43.0
From: Peter Xu <peterx@redhat.com> mainline inclusion from mainline-v6.12-rc1 commit 5dd40721f147e83733ad34848330913cb633046e category: feature bugzilla: https://gitee.com/src-openeuler/kernel/issues/ID4NL4 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- This enables PFNMAPs to be mapped at either pmd/pud layers. Generalize the dax case into vma_is_special_huge() so as to cover both. Meanwhile, rename the macro to THP_ORDERS_ALL_SPECIAL. Link: https://lkml.kernel.org/r/20240826204353.2228736-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Gavin Shan <gshan@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Yin Tirui <yintirui@huawei.com> --- include/linux/huge_mm.h | 6 +++--- mm/huge_memory.c | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index ef0a514a9b65..31d8dde1eef2 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -82,9 +82,9 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; /* * Mask of all large folio orders supported for file THP. Folios in a DAX * file is never split and the MAX_PAGECACHE_ORDER limit does not apply to - * it. + * it. Same to PFNMAPs where there's neither page* nor pagecache. */ -#define THP_ORDERS_ALL_FILE_DAX \ +#define THP_ORDERS_ALL_SPECIAL \ (BIT(PMD_ORDER) | BIT(PUD_ORDER)) #define THP_ORDERS_ALL_FILE_DEFAULT \ ((BIT(MAX_PAGECACHE_ORDER + 1) - 1) & ~BIT(0)) @@ -93,7 +93,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; * Mask of all large folio orders supported for THP. */ #define THP_ORDERS_ALL \ - (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE_DAX | THP_ORDERS_ALL_FILE_DEFAULT) + (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_SPECIAL | THP_ORDERS_ALL_FILE_DEFAULT) #define TVA_SMAPS (1 << 0) /* Will be used for procfs */ #define TVA_IN_PF (1 << 1) /* Page fault handler */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 83a205344606..afa7cbb6da21 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -97,8 +97,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, /* Check the intersection of requested and supported orders. */ if (vma_is_anonymous(vma)) supported_orders = THP_ORDERS_ALL_ANON; - else if (vma_is_dax(vma)) - supported_orders = THP_ORDERS_ALL_FILE_DAX; + else if (vma_is_special_huge(vma)) + supported_orders = THP_ORDERS_ALL_SPECIAL; else supported_orders = THP_ORDERS_ALL_FILE_DEFAULT; -- 2.43.0
From: Peter Xu <peterx@redhat.com> mainline inclusion from mainline-v6.12-rc1 commit ae3c99e650da4a8f4deb3670c29059de375a88be category: feature bugzilla: https://gitee.com/src-openeuler/kernel/issues/ID4NL4 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- Since gup-fast doesn't have the vma reference, teach it to detect such huge pfnmaps by checking the special bit for pmd/pud too, just like ptes. Link: https://lkml.kernel.org/r/20240826204353.2228736-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: mm/gup.c [Context conflicts.] Signed-off-by: Yin Tirui <yintirui@huawei.com> --- mm/gup.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index e38d3bd4d3f9..20b918ed1b4a 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2883,6 +2883,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, if (!pmd_access_permitted(orig, flags & FOLL_WRITE)) return 0; + if (pmd_special(orig)) + return 0; + if (pmd_devmap(orig)) { if (unlikely(flags & FOLL_LONGTERM)) return 0; @@ -2927,6 +2930,9 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, if (!pud_access_permitted(orig, flags & FOLL_WRITE)) return 0; + if (pud_special(orig)) + return 0; + if (pud_devmap(orig)) { if (unlikely(flags & FOLL_LONGTERM)) return 0; -- 2.43.0
From: Peter Xu <peterx@redhat.com> mainline inclusion from mainline-v6.12-rc1 commit 10d83d7781a8a6ff02bafd172c1ab183b27f8d5a category: feature bugzilla: https://gitee.com/src-openeuler/kernel/issues/ID4NL4 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- Teach folio_walk_start() to recognize special pmd/pud mappings, and fail them properly as it means there's no folio backing them. [peterx@redhat.com: remove some stale comments, per David] Link: https://lkml.kernel.org/r/20240829202237.2640288-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20240826204353.2228736-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: mm/pagewalk.c [Context conflicts.] Signed-off-by: Yin Tirui <yintirui@huawei.com> --- mm/memory.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index cf573f8f7764..a590e0c80378 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -674,11 +674,10 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, { unsigned long pfn = pmd_pfn(pmd); - /* - * There is no pmd_special() but there may be special pmds, e.g. - * in a direct-access (dax) mapping, so let's just replicate the - * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here. - */ + /* Currently it's only used for huge pfnmaps */ + if (unlikely(pmd_special(pmd))) + return NULL; + if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { if (vma->vm_flags & VM_MIXEDMAP) { if (!pfn_valid(pfn)) -- 2.43.0
From: Peter Xu <peterx@redhat.com> mainline inclusion from mainline-v6.12-rc1 commit bc02afbd4d73c4424ea12a0c35fa96e27172e8cb category: feature bugzilla: https://gitee.com/src-openeuler/kernel/issues/ID4NL4 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- Teach the fork code to properly copy pfnmaps for pmd/pud levels. Pud is much easier, the write bit needs to be persisted though for writable and shared pud mappings like PFNMAP ones, otherwise a follow up write in either parent or child process will trigger a write fault. Do the same for pmd level. Link: https://lkml.kernel.org/r/20240826204353.2228736-8-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Yin Tirui <yintirui@huawei.com> --- mm/huge_memory.c | 29 ++++++++++++++++++++++++++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index afa7cbb6da21..a7381323b826 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1789,6 +1789,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pgtable_t pgtable = NULL; int ret = -ENOMEM; + pmd = pmdp_get_lockless(src_pmd); + if (unlikely(pmd_special(pmd))) { + dst_ptl = pmd_lock(dst_mm, dst_pmd); + src_ptl = pmd_lockptr(src_mm, src_pmd); + spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); + /* + * No need to recheck the pmd, it can't change with write + * mmap lock held here. + * + * Meanwhile, making sure it's not a CoW VMA with writable + * mapping, otherwise it means either the anon page wrongly + * applied special bit, or we made the PRIVATE mapping be + * able to wrongly write to the backend MMIO. + */ + VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)); + goto set_pmd; + } + /* Skip if can be re-fill on fault */ if (!vma_is_anonymous(dst_vma)) return 0; @@ -1871,7 +1889,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmdp_set_wrprotect(src_mm, addr, src_pmd); if (!userfaultfd_wp(dst_vma)) pmd = pmd_clear_uffd_wp(pmd); - pmd = pmd_mkold(pmd_wrprotect(pmd)); + pmd = pmd_wrprotect(pmd); +set_pmd: + pmd = pmd_mkold(pmd); set_pmd_at(dst_mm, addr, dst_pmd, pmd); ret = 0; @@ -1960,8 +1980,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, * TODO: once we support anonymous pages, use * folio_try_dup_anon_rmap_*() and split if duplicating fails. */ - pudp_set_wrprotect(src_mm, addr, src_pud); - pud = pud_mkold(pud_wrprotect(pud)); + if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) { + pudp_set_wrprotect(src_mm, addr, src_pud); + pud = pud_wrprotect(pud); + } + pud = pud_mkold(pud); set_pud_at(dst_mm, addr, dst_pud, pud); ret = 0; -- 2.43.0
From: Peter Xu <peterx@redhat.com> mainline inclusion from mainline-v6.12-rc1 commit 0515e022e167cfacf1fee092eb93aa9514e23c0a category: feature bugzilla: https://gitee.com/src-openeuler/kernel/issues/ID4NL4 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- There're: - 8 archs (arc, arm64, include, mips, powerpc, s390, sh, x86) that support pte_pgprot(). - 2 archs (x86, sparc) that support pmd_pgprot(). - 1 arch (x86) that support pud_pgprot(). Always define them to be used in generic code, and then we don't need to fiddle with "#ifdef"s when doing so. Link: https://lkml.kernel.org/r/20240826204353.2228736-9-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Yin Tirui <yintirui@huawei.com> --- arch/arm64/include/asm/pgtable.h | 1 + arch/powerpc/include/asm/pgtable.h | 1 + arch/s390/include/asm/pgtable.h | 1 + arch/sparc/include/asm/pgtable_64.h | 1 + include/linux/pgtable.h | 12 ++++++++++++ 5 files changed, 16 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 2b66aab73dbc..7f945403a041 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -347,6 +347,7 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages) /* * Select all bits except the pfn */ +#define pte_pgprot pte_pgprot static inline pgprot_t pte_pgprot(pte_t pte) { unsigned long pfn = pte_pfn(pte); diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index db2fe941e4c8..c395ca4e7696 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -65,6 +65,7 @@ static inline unsigned long pte_pfn(pte_t pte) /* * Select all bits except the pfn */ +#define pte_pgprot pte_pgprot static inline pgprot_t pte_pgprot(pte_t pte) { unsigned long pte_flags; diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 93b90757d226..1f4a630f22b5 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -912,6 +912,7 @@ static inline int pte_unused(pte_t pte) * young/old accounting is not supported, i.e _PAGE_PROTECT and _PAGE_INVALID * must not be set. */ +#define pte_pgprot pte_pgprot static inline pgprot_t pte_pgprot(pte_t pte) { unsigned long pte_flags = pte_val(pte) & _PAGE_CHG_MASK; diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index be9bcc50e4cb..5877be1af301 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -782,6 +782,7 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) return __pmd(pte_val(pte)); } +#define pmd_pgprot pmd_pgprot static inline pgprot_t pmd_pgprot(pmd_t entry) { unsigned long val = pmd_val(entry); diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e33015ae2c4d..b69f295b5f91 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1943,6 +1943,18 @@ typedef unsigned int pgtbl_mod_mask; #define MAX_PTRS_PER_P4D PTRS_PER_P4D #endif +#ifndef pte_pgprot +#define pte_pgprot(x) ((pgprot_t) {0}) +#endif + +#ifndef pmd_pgprot +#define pmd_pgprot(x) ((pgprot_t) {0}) +#endif + +#ifndef pud_pgprot +#define pud_pgprot(x) ((pgprot_t) {0}) +#endif + /* description of effects of mapping type and prot in current implementation. * this is due to the limited x86 page protection hardware. The expected * behavior is in parens: -- 2.43.0
From: Peter Xu <peterx@redhat.com> mainline inclusion from mainline-v6.12-rc1 commit 75182022a0439788415b2dd1db3086e07aa506f7 category: feature bugzilla: https://gitee.com/src-openeuler/kernel/issues/ID4NL4 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- Helpers to install and detect special pmd/pud entries. In short, bit 9 on x86 is not used for pmd/pud, so we can directly define them the same as the pte level. One note is that it's also used in _PAGE_BIT_CPA_TEST but that is only used in the debug test, and shouldn't conflict in this case. One note is that pxx_set|clear_flags() for pmd/pud will need to be moved upper so that they can be referenced by the new special bit helpers. There's no change in the code that was moved. Link: https://lkml.kernel.org/r/20240826204353.2228736-18-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: arch/x86/Kconfig [Context conflicts.] Signed-off-by: Yin Tirui <yintirui@huawei.com> --- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 80 ++++++++++++++++++++++------------ 2 files changed, 53 insertions(+), 28 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 05225ce10c34..d97109f9d034 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -28,6 +28,7 @@ config X86_64 select ARCH_HAS_GIGANTIC_PAGE select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 select ARCH_SUPPORTS_PER_VMA_LOCK + select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE select ARCH_USE_CMPXCHG_LOCKREF select HAVE_ARCH_SOFT_DIRTY select MODULES_USE_ELF_RELA diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 993d49cd379a..9ef5315c931e 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -121,6 +121,34 @@ extern pmdval_t early_pmd_flags; #define arch_end_context_switch(prev) do {} while(0) #endif /* CONFIG_PARAVIRT_XXL */ +static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set) +{ + pmdval_t v = native_pmd_val(pmd); + + return native_make_pmd(v | set); +} + +static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear) +{ + pmdval_t v = native_pmd_val(pmd); + + return native_make_pmd(v & ~clear); +} + +static inline pud_t pud_set_flags(pud_t pud, pudval_t set) +{ + pudval_t v = native_pud_val(pud); + + return native_make_pud(v | set); +} + +static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear) +{ + pudval_t v = native_pud_val(pud); + + return native_make_pud(v & ~clear); +} + /* * The following only work if pte_present() is true. * Undefined behaviour if not.. @@ -304,6 +332,30 @@ static inline int pud_devmap(pud_t pud) } #endif +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP +static inline bool pmd_special(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_SPECIAL; +} + +static inline pmd_t pmd_mkspecial(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_SPECIAL); +} +#endif /* CONFIG_ARCH_SUPPORTS_PMD_PFNMAP */ + +#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP +static inline bool pud_special(pud_t pud) +{ + return pud_flags(pud) & _PAGE_SPECIAL; +} + +static inline pud_t pud_mkspecial(pud_t pud) +{ + return pud_set_flags(pud, _PAGE_SPECIAL); +} +#endif /* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */ + static inline int pgd_devmap(pgd_t pgd) { return 0; @@ -474,20 +526,6 @@ static inline pte_t pte_mkdevmap(pte_t pte) return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP); } -static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set) -{ - pmdval_t v = native_pmd_val(pmd); - - return native_make_pmd(v | set); -} - -static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear) -{ - pmdval_t v = native_pmd_val(pmd); - - return native_make_pmd(v & ~clear); -} - /* See comments above mksaveddirty_shift() */ static inline pmd_t pmd_mksaveddirty(pmd_t pmd) { @@ -582,20 +620,6 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma); #define pmd_mkwrite pmd_mkwrite -static inline pud_t pud_set_flags(pud_t pud, pudval_t set) -{ - pudval_t v = native_pud_val(pud); - - return native_make_pud(v | set); -} - -static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear) -{ - pudval_t v = native_pud_val(pud); - - return native_make_pud(v & ~clear); -} - /* See comments above mksaveddirty_shift() */ static inline pud_t pud_mksaveddirty(pud_t pud) { -- 2.43.0
From: Peter Xu <peterx@redhat.com> mainline inclusion from mainline-v6.12-rc1 commit 3e509c9b03f9abc7804c80bed266a6cc4286a5a8 category: feature bugzilla: https://gitee.com/src-openeuler/kernel/issues/ID4NL4 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- Support huge pfnmaps by using bit 56 (PTE_SPECIAL) for "special" on pmds/puds. Provide the pmd/pud helpers to set/get special bit. There's one more thing missing for arm64 which is the pxx_pgprot() for pmd/pud. Add them too, which is mostly the same as the pte version by dropping the pfn field. These helpers are essential to be used in the new follow_pfnmap*() API to report valid pgprot_t results. Note that arm64 doesn't yet support huge PUD yet, but it's still straightforward to provide the pud helpers that we need altogether. Only PMD helpers will make an immediate benefit until arm64 will support huge PUDs first in general (e.g. in THPs). Link: https://lkml.kernel.org/r/20240826204353.2228736-19-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: arch/arm64/include/asm/pgtable.h [Context conflicts.] Signed-off-by: Yin Tirui <yintirui@huawei.com> --- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/pgtable.h | 29 +++++++++++++++++++++++++++++ 2 files changed, 30 insertions(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 95974b69e202..c3b38c890b45 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -109,6 +109,7 @@ config ARM64 select ARCH_SUPPORTS_SCHED_SOFT_QUOTA select ARCH_SUPPORTS_PAGE_TABLE_CHECK select ARCH_SUPPORTS_PER_VMA_LOCK + select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT select ARCH_WANT_DEFAULT_BPF_JIT diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 7f945403a041..3ab1bf347951 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -515,6 +515,14 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd) return pte_pmd(set_pte_bit(pmd_pte(pmd), __pgprot(PTE_DEVMAP))); } +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP +#define pmd_special(pte) (!!((pmd_val(pte) & PTE_SPECIAL))) +static inline pmd_t pmd_mkspecial(pmd_t pmd) +{ + return set_pmd_bit(pmd, __pgprot(PTE_SPECIAL)); +} +#endif + #define __pmd_to_phys(pmd) __pte_to_phys(pmd_pte(pmd)) #define __phys_to_pmd_val(phys) __phys_to_pte_val(phys) #define pmd_pfn(pmd) ((__pmd_to_phys(pmd) & PMD_MASK) >> PAGE_SHIFT) @@ -532,6 +540,27 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd) #define pud_pfn(pud) ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT) #define pfn_pud(pfn,prot) __pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot)) +#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP +#define pud_special(pte) pte_special(pud_pte(pud)) +#define pud_mkspecial(pte) pte_pud(pte_mkspecial(pud_pte(pud))) +#endif + +#define pmd_pgprot pmd_pgprot +static inline pgprot_t pmd_pgprot(pmd_t pmd) +{ + unsigned long pfn = pmd_pfn(pmd); + + return __pgprot(pmd_val(pfn_pmd(pfn, __pgprot(0))) ^ pmd_val(pmd)); +} + +#define pud_pgprot pud_pgprot +static inline pgprot_t pud_pgprot(pud_t pud) +{ + unsigned long pfn = pud_pfn(pud); + + return __pgprot(pud_val(pfn_pud(pfn, __pgprot(0))) ^ pud_val(pud)); +} + static inline void __set_pte_at(struct mm_struct *mm, unsigned long __always_unused addr, pte_t *ptep, pte_t pte, unsigned int nr) -- 2.43.0
From: David Hildenbrand <david@redhat.com> mainline inclusion from mainline-v6.12-rc1 commit 47fa30118f02dc50e1c57242c6b72542c871b178 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/ID4NL4 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- We should only check for pmd_special() after we made sure that we have a present PMD. For example, if we have a migration PMD, pmd_special() might indicate that we have a special PMD although we really don't. This fixes confusing migration entries as PFN mappings, and not doing what we are supposed to do in the "is_swap_pmd()" case further down in the function -- including messing up COW, page table handling and accounting. Link: https://lkml.kernel.org/r/20240926154234.2247217-1-david@redhat.com Fixes: bc02afbd4d73 ("mm/fork: accept huge pfnmap entries") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+bf2c35fa302ebe3c7471@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/66f15c8d.050a0220.c23dd.000f.GAE@google.com/ Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Yin Tirui <yintirui@huawei.com> --- mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a7381323b826..140dbdaa99f2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1790,7 +1790,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, int ret = -ENOMEM; pmd = pmdp_get_lockless(src_pmd); - if (unlikely(pmd_special(pmd))) { + if (unlikely(pmd_present(pmd) && pmd_special(pmd))) { dst_ptl = pmd_lock(dst_mm, dst_pmd); src_ptl = pmd_lockptr(src_mm, src_pmd); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); -- 2.43.0
From: Peter Xu <peterx@redhat.com> mainline inclusion from mainline-v6.15-rc1 commit 0fff2aa96f6be6d33b584d73b16d3672fd30fd5c category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/ID4NL4 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- Keith Busch observed some incorrect macros defined in arm64 code [1]. It turns out the two lines should never be needed and won't be exposed to anyone, because aarch64 doesn't select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD, hence ARCH_SUPPORTS_PUD_PFNMAP is always N. The only archs that support THP PUDs so far are x86 and powerpc. Instead of fixing the lines (with no way to test it..), remove the two lines that are in reality dead code, to avoid confusing readers. Fixes tag is attached to reflect where the wrong macros were introduced, but explicitly not copying stable, because there's no real issue to be fixed. So it's only about removing the dead code so far. [1] https://lore.kernel.org/all/Z9tDjOk-JdV_fCY4@kbusch-mbp.dhcp.thefacebook.com... Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Donald Dutile <ddutile@redhat.com> Cc: Will Deacon <will@kernel.org> Fixes: 3e509c9b03f9 ("mm/arm64: support large pfn mappings") Reported-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Donald Dutile <ddutile@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20250320183405.12659-1-peterx@redhat.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Conflicts: arch/arm64/include/asm/pgtable.h [Context conflicts.] Signed-off-by: Yin Tirui <yintirui@huawei.com> --- arch/arm64/include/asm/pgtable.h | 5 ----- 1 file changed, 5 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 3ab1bf347951..38c9d78b639c 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -540,11 +540,6 @@ static inline pmd_t pmd_mkspecial(pmd_t pmd) #define pud_pfn(pud) ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT) #define pfn_pud(pfn,prot) __pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot)) -#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP -#define pud_special(pte) pte_special(pud_pte(pud)) -#define pud_mkspecial(pte) pte_pud(pte_mkspecial(pud_pte(pud))) -#endif - #define pmd_pgprot pmd_pgprot static inline pgprot_t pmd_pgprot(pmd_t pmd) { -- 2.43.0
hulk inclusion category: feature bugzilla: https://gitee.com/src-openeuler/kernel/issues/ID4NL4 ---------------------------------------- Add pte_clrhuge() helper function for architectures that support transparent huge pages to convert a huge PTE entry to regular PTE. This function provides the inverse operation of pte_mkhuge() and is needed when splitting huge page mappings, where individual PTE entries are created from the original huge page mapping. Signed-off-by: Yin Tirui <yintirui@huawei.com> --- arch/arm/include/asm/pgtable-3level.h | 1 + arch/arm64/include/asm/pgtable.h | 8 ++++++++ arch/loongarch/include/asm/pgtable.h | 6 ++++++ arch/mips/include/asm/pgtable.h | 6 ++++++ arch/powerpc/include/asm/book3s/32/pgtable.h | 5 +++++ arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +++++ arch/powerpc/include/asm/nohash/32/pte-8xx.h | 7 +++++++ arch/powerpc/include/asm/nohash/pgtable.h | 7 +++++++ arch/riscv/include/asm/pgtable.h | 5 +++++ arch/s390/include/asm/pgtable.h | 5 +++++ arch/sparc/include/asm/pgtable_64.h | 5 +++++ arch/sw_64/include/asm/pgtable.h | 6 ++++++ 12 files changed, 66 insertions(+) diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h index 71c3add6417f..661b16d93fde 100644 --- a/arch/arm/include/asm/pgtable-3level.h +++ b/arch/arm/include/asm/pgtable-3level.h @@ -173,6 +173,7 @@ static inline pmd_t *pud_pgtable(pud_t pud) #define pte_huge(pte) (pte_val(pte) && !(pte_val(pte) & PTE_TABLE_BIT)) #define pte_mkhuge(pte) (__pte(pte_val(pte) & ~PTE_TABLE_BIT)) +#define pte_clrhuge(pte) (__pte(pte_val(pte) | PTE_TABLE_BIT)) #define pmd_isset(pmd, val) ((u32)(val) == (val) ? pmd_val(pmd) & (val) \ : !!(pmd_val(pmd) & (val))) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 38c9d78b639c..d8f8a804375b 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -249,6 +249,14 @@ static inline pte_t pte_mkpresent(pte_t pte) return set_pte_bit(pte, __pgprot(PTE_VALID)); } +static inline pte_t pte_clrhuge(pte_t pte) +{ + pteval_t mask = PTE_TYPE_MASK & ~PTE_VALID; + pteval_t val = PTE_TYPE_PAGE & ~PTE_VALID; + + return __pte((pte_val(pte) & ~mask) | val); +} + static inline pmd_t pmd_mkcont(pmd_t pmd) { return __pmd(pmd_val(pmd) | PMD_SECT_CONT); diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h index 4f4498ce2255..28fda616206a 100644 --- a/arch/loongarch/include/asm/pgtable.h +++ b/arch/loongarch/include/asm/pgtable.h @@ -405,6 +405,12 @@ static inline pte_t pte_mkhuge(pte_t pte) return pte; } +static inline pte_t pte_clrhuge(pte_t pte) +{ + pte_val(pte) &= ~_PAGE_HUGE; + return pte; +} + #if defined(CONFIG_ARCH_HAS_PTE_SPECIAL) static inline int pte_special(pte_t pte) { return pte_val(pte) & _PAGE_SPECIAL; } static inline pte_t pte_mkspecial(pte_t pte) { pte_val(pte) |= _PAGE_SPECIAL; return pte; } diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h index daa48f28ce5e..80d652cfc113 100644 --- a/arch/mips/include/asm/pgtable.h +++ b/arch/mips/include/asm/pgtable.h @@ -409,6 +409,12 @@ static inline pte_t pte_mkhuge(pte_t pte) return pte; } +static inline pte_t pte_clrhuge(pte_t pte) +{ + pte_val(pte) &= ~_PAGE_HUGE; + return pte; +} + #define pmd_write pmd_write static inline int pmd_write(pmd_t pmd) { diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h index 9b13eb14e21b..1b028a99d380 100644 --- a/arch/powerpc/include/asm/book3s/32/pgtable.h +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h @@ -518,6 +518,11 @@ static inline pte_t pte_mkhuge(pte_t pte) return pte; } +static inline pte_t pte_clrhuge(pte_t pte) +{ + return pte; +} + static inline pte_t pte_mkprivileged(pte_t pte) { return __pte(pte_val(pte) & ~_PAGE_USER); diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 8a6e6b6daa90..b98d3f903d5a 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -598,6 +598,11 @@ static inline pte_t pte_mkhuge(pte_t pte) return pte; } +static inline pte_t pte_clrhuge(pte_t pte) +{ + return pte; +} + static inline pte_t pte_mkdevmap(pte_t pte) { return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_SPECIAL | _PAGE_DEVMAP)); diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h index e6fe1d5731f2..70f5309afde5 100644 --- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h @@ -143,6 +143,13 @@ static inline pte_t pte_mkhuge(pte_t pte) #define pte_mkhuge pte_mkhuge +static inline pte_t pte_clrhuge(pte_t pte) +{ + return __pte(pte_val(pte) & ~(_PAGE_SPS | _PAGE_HUGE)); +} + +#define pte_clrhuge pte_clrhuge + static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, pte_t *p, unsigned long clr, unsigned long set, int huge); diff --git a/arch/powerpc/include/asm/nohash/pgtable.h b/arch/powerpc/include/asm/nohash/pgtable.h index c721478c5934..c97cd2b66499 100644 --- a/arch/powerpc/include/asm/nohash/pgtable.h +++ b/arch/powerpc/include/asm/nohash/pgtable.h @@ -132,6 +132,13 @@ static inline pte_t pte_mkhuge(pte_t pte) } #endif +#ifndef pte_clrhuge +static inline pte_t pte_clrhuge(pte_t pte) +{ + return __pte(pte_val(pte)); +} +#endif + #ifndef pte_mkprivileged static inline pte_t pte_mkprivileged(pte_t pte) { diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index e58315cedfd3..168194ea5237 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -438,6 +438,11 @@ static inline pte_t pte_mkhuge(pte_t pte) return pte; } +static inline pte_t pte_clrhuge(pte_t pte) +{ + return pte; +} + #ifdef CONFIG_RISCV_ISA_SVNAPOT #define pte_leaf_size(pte) (pte_napot(pte) ? \ napot_cont_size(napot_cont_order(pte)) :\ diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 1f4a630f22b5..f9ddc893b518 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1058,6 +1058,11 @@ static inline pte_t pte_mkhuge(pte_t pte) { return set_pte_bit(pte, __pgprot(_PAGE_LARGE)); } + +static inline pte_t pte_clrhuge(pte_t pte) +{ + return clear_pte_bit(pte, __pgprot(_PAGE_LARGE)); +} #endif #define IPTE_GLOBAL 0 diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 5877be1af301..0d892e824d15 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -420,6 +420,11 @@ static inline pte_t pte_mkhuge(pte_t pte) return __pte(pte_val(pte) | __pte_default_huge_mask()); } +static inline pte_t pte_clrhuge(pte_t pte) +{ + return __pte(pte_val(pte) & ~__pte_default_huge_mask()); +} + static inline bool is_default_hugetlb_pte(pte_t pte) { unsigned long mask = __pte_default_huge_mask(); diff --git a/arch/sw_64/include/asm/pgtable.h b/arch/sw_64/include/asm/pgtable.h index 2614b47d25dc..4a919c76bd48 100644 --- a/arch/sw_64/include/asm/pgtable.h +++ b/arch/sw_64/include/asm/pgtable.h @@ -622,6 +622,12 @@ static inline pte_t pte_mkhuge(pte_t pte) return pte; } +static inline pte_t pte_clrhuge(pte_t pte) +{ + pte_val(pte) &= ~_PAGE_LEAF; + return pte; +} + static inline pte_t pte_mkspecial(pte_t pte) { pte_val(pte) |= _PAGE_SPECIAL; -- 2.43.0
hulk inclusion category: feature bugzilla: https://gitee.com/src-openeuler/kernel/issues/ID4NL4 ---------------------------------------- Implement special huge PMD splitting by utilizing the pgtable deposit/ withdraw mechanism. When splitting is needed, the deposited pgtable is withdrawn and populated with individual PTEs created from the original huge mapping, using pte_clrhuge() to clear huge page attributes. Update arch_needs_pgtable_deposit() to return true when PMD pfnmap support is enabled, ensuring proper pgtable management for huge pfnmap operations. Signed-off-by: Yin Tirui <yintirui@huawei.com> --- arch/arm64/include/asm/pgtable.h | 5 + arch/powerpc/include/asm/book3s/64/pgtable.h | 2 +- fs/dax.c | 2 +- include/linux/mm.h | 2 + include/linux/pgtable.h | 2 +- mm/huge_memory.c | 43 +++++-- mm/memory.c | 117 +++++++++++++++---- 7 files changed, 140 insertions(+), 33 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index d8f8a804375b..a06e2dcdc6ee 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -529,6 +529,11 @@ static inline pmd_t pmd_mkspecial(pmd_t pmd) { return set_pmd_bit(pmd, __pgprot(PTE_SPECIAL)); } + +extern bool nohugepfnmap; +#define arch_needs_pgtable_deposit(vma) \ + (nohugepfnmap ? false : (!vma_is_dax(vma) && vma_is_special_huge(vma))) + #endif #define __pmd_to_phys(pmd) __pte_to_phys(pmd_pte(pmd)) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index b98d3f903d5a..508c3d543185 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -1399,7 +1399,7 @@ extern int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl, * slot information. */ #define arch_needs_pgtable_deposit arch_needs_pgtable_deposit -static inline bool arch_needs_pgtable_deposit(void) +static inline bool arch_needs_pgtable_deposit(struct vm_area_struct *vma) { if (radix_enabled()) return false; diff --git a/fs/dax.c b/fs/dax.c index 8c09578fa035..a65e1a088a8a 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1221,7 +1221,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf, *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_PMD | DAX_ZERO_PAGE); - if (arch_needs_pgtable_deposit()) { + if (arch_needs_pgtable_deposit(vma)) { pgtable = pte_alloc_one(vma->vm_mm); if (!pgtable) return VM_FAULT_OOM; diff --git a/include/linux/mm.h b/include/linux/mm.h index f0efb11b915d..e8d8b032024a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3627,6 +3627,8 @@ struct vm_area_struct *find_extend_vma_locked(struct mm_struct *, unsigned long addr); int remap_pfn_range(struct vm_area_struct *, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t); +int remap_pfn_range_try_pmd(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, unsigned long size, pgprot_t prot); int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t prot); int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index b69f295b5f91..afef31772096 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -940,7 +940,7 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp); #endif #ifndef arch_needs_pgtable_deposit -#define arch_needs_pgtable_deposit() (false) +#define arch_needs_pgtable_deposit(vma) (false) #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 140dbdaa99f2..dab73dd757b0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1629,7 +1629,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write) if (addr < vma->vm_start || addr >= vma->vm_end) return VM_FAULT_SIGBUS; - if (arch_needs_pgtable_deposit()) { + if (arch_needs_pgtable_deposit(vma)) { pgtable = pte_alloc_one(vma->vm_mm); if (!pgtable) return VM_FAULT_OOM; @@ -1791,6 +1791,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd = pmdp_get_lockless(src_pmd); if (unlikely(pmd_present(pmd) && pmd_special(pmd))) { + pgtable = pte_alloc_one(dst_mm); + if (unlikely(!pgtable)) + goto out; dst_ptl = pmd_lock(dst_mm, dst_pmd); src_ptl = pmd_lockptr(src_mm, src_pmd); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); @@ -1804,6 +1807,12 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, * able to wrongly write to the backend MMIO. */ VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)); + + /* dax won't reach here, it will be intercepted at vma_needs_copy() */ + VM_WARN_ON_ONCE(vma_is_dax(src_vma)); + + mm_inc_nr_ptes(dst_mm); + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); goto set_pmd; } @@ -2448,7 +2457,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, arch_check_zapped_pmd(vma, orig_pmd); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); if (vma_is_special_huge(vma)) { - if (arch_needs_pgtable_deposit()) + if (arch_needs_pgtable_deposit(vma)) zap_deposited_table(tlb->mm, pmd); spin_unlock(ptl); } else if (is_huge_zero_pmd(orig_pmd)) { @@ -2480,7 +2489,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, zap_deposited_table(tlb->mm, pmd); add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); } else { - if (arch_needs_pgtable_deposit()) + if (arch_needs_pgtable_deposit(vma)) zap_deposited_table(tlb->mm, pmd); add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PMD_NR); @@ -2862,14 +2871,28 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, if (!vma_is_anonymous(vma)) { old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd); - /* - * We are going to unmap this huge page. So - * just go ahead and zap it - */ - if (arch_needs_pgtable_deposit()) - zap_deposited_table(mm, pmd); - if (vma_is_special_huge(vma)) + if (vma_is_special_huge(vma)) { + pte_t entry; + + if (vma_is_dax(vma)) + return; + pgtable = pgtable_trans_huge_withdraw(mm, pmd); + if (unlikely(!pgtable)) + return; + pmd_populate(mm, &_pmd, pgtable); + pte = pte_offset_map(&_pmd, haddr); + entry = pte_clrhuge(pfn_pte(pmd_pfn(old_pmd), pmd_pgprot(old_pmd))); + set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR); + pte_unmap(pte); + + smp_wmb(); /* make pte visible before pmd */ + pmd_populate(mm, pmd, pgtable); return; + } else if (arch_needs_pgtable_deposit(vma)) { + /* Zap for the non-special mappings. */ + zap_deposited_table(mm, pmd); + } + if (unlikely(is_pmd_migration_entry(old_pmd))) { swp_entry_t entry; diff --git a/mm/memory.c b/mm/memory.c index a590e0c80378..f0584fec93c9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2585,9 +2585,59 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, return err; } +#if defined(CONFIG_ARM64) && defined(CONFIG_ARCH_SUPPORTS_PMD_PFNMAP) +bool __ro_after_init nohugepfnmap; + +static int __init set_nohugepfnmap(char *str) +{ + nohugepfnmap = true; + return 0; +} +early_param("nohugepfnmap", set_nohugepfnmap); + +static int remap_try_huge_pmd(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, unsigned long end, + unsigned long pfn, pgprot_t prot, + unsigned int page_shift) +{ + pgtable_t pgtable; + spinlock_t *ptl; + + if (nohugepfnmap) + return 0; + + if (page_shift < PMD_SHIFT) + return 0; + + if ((end - addr) != PMD_SIZE) + return 0; + + if (!IS_ALIGNED(addr, PMD_SIZE)) + return 0; + + if (!IS_ALIGNED(pfn, HPAGE_PMD_NR)) + return 0; + + if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr)) + return 0; + + pgtable = pte_alloc_one(mm); + if (unlikely(!pgtable)) + return 0; + + mm_inc_nr_ptes(mm); + ptl = pmd_lock(mm, pmd); + set_pmd_at(mm, addr, pmd, pmd_mkspecial(pmd_mkhuge(pfn_pmd(pfn, prot)))); + pgtable_trans_huge_deposit(mm, pmd, pgtable); + spin_unlock(ptl); + + return 1; +} +#endif + static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud, unsigned long addr, unsigned long end, - unsigned long pfn, pgprot_t prot) + unsigned long pfn, pgprot_t prot, unsigned int page_shift) { pmd_t *pmd; unsigned long next; @@ -2600,6 +2650,12 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud, VM_BUG_ON(pmd_trans_huge(*pmd)); do { next = pmd_addr_end(addr, end); +#if defined(CONFIG_ARM64) && defined(CONFIG_ARCH_SUPPORTS_PMD_PFNMAP) + if (remap_try_huge_pmd(mm, pmd, addr, next, + pfn + (addr >> PAGE_SHIFT), prot, page_shift)) { + continue; + } +#endif err = remap_pte_range(mm, pmd, addr, next, pfn + (addr >> PAGE_SHIFT), prot); if (err) @@ -2610,7 +2666,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud, static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d, unsigned long addr, unsigned long end, - unsigned long pfn, pgprot_t prot) + unsigned long pfn, pgprot_t prot, unsigned int page_shift) { pud_t *pud; unsigned long next; @@ -2623,7 +2679,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d, do { next = pud_addr_end(addr, end); err = remap_pmd_range(mm, pud, addr, next, - pfn + (addr >> PAGE_SHIFT), prot); + pfn + (addr >> PAGE_SHIFT), prot, page_shift); if (err) return err; } while (pud++, addr = next, addr != end); @@ -2632,7 +2688,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d, static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd, unsigned long addr, unsigned long end, - unsigned long pfn, pgprot_t prot) + unsigned long pfn, pgprot_t prot, unsigned int page_shift) { p4d_t *p4d; unsigned long next; @@ -2645,7 +2701,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd, do { next = p4d_addr_end(addr, end); err = remap_pud_range(mm, p4d, addr, next, - pfn + (addr >> PAGE_SHIFT), prot); + pfn + (addr >> PAGE_SHIFT), prot, page_shift); if (err) return err; } while (p4d++, addr = next, addr != end); @@ -2653,7 +2709,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd, } static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn, unsigned long size, pgprot_t prot) + unsigned long pfn, unsigned long size, pgprot_t prot, unsigned int page_shift) { pgd_t *pgd; unsigned long next; @@ -2697,7 +2753,7 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad do { next = pgd_addr_end(addr, end); err = remap_p4d_range(mm, pgd, addr, next, - pfn + (addr >> PAGE_SHIFT), prot); + pfn + (addr >> PAGE_SHIFT), prot, page_shift); if (err) return err; } while (pgd++, addr = next, addr != end); @@ -2705,15 +2761,10 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad return 0; } -/* - * Variant of remap_pfn_range that does not call track_pfn_remap. The caller - * must have pre-validated the caching bits of the pgprot_t. - */ -int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn, unsigned long size, pgprot_t prot) +static int __remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, unsigned long size, pgprot_t prot, unsigned int page_shift) { - int error = remap_pfn_range_internal(vma, addr, pfn, size, prot); - + int error = remap_pfn_range_internal(vma, addr, pfn, size, prot, page_shift); if (!error) return 0; @@ -2726,6 +2777,16 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, return error; } +/* + * Variant of remap_pfn_range that does not call track_pfn_remap. The caller + * must have pre-validated the caching bits of the pgprot_t. + */ +int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, unsigned long size, pgprot_t prot) +{ + return __remap_pfn_range_notrack(vma, addr, pfn, size, prot, PAGE_SHIFT); +} + /** * remap_pfn_range - remap kernel memory to userspace * @vma: user vma to map to @@ -2738,8 +2799,9 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, * * Return: %0 on success, negative error code otherwise. */ -int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn, unsigned long size, pgprot_t prot) +int __remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, unsigned long size, pgprot_t prot, + unsigned int page_shift) { int err; @@ -2747,13 +2809,28 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, if (err) return -EINVAL; - err = remap_pfn_range_notrack(vma, addr, pfn, size, prot); + err = __remap_pfn_range_notrack(vma, addr, pfn, size, prot, page_shift); if (err) untrack_pfn(vma, pfn, PAGE_ALIGN(size), true); return err; } + +int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, unsigned long size, pgprot_t prot) +{ + return __remap_pfn_range(vma, addr, pfn, size, prot, PAGE_SHIFT); +} EXPORT_SYMBOL(remap_pfn_range); +#if defined(CONFIG_ARM64) && defined(CONFIG_ARCH_SUPPORTS_PMD_PFNMAP) +int remap_pfn_range_try_pmd(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, unsigned long size, pgprot_t prot) +{ + return __remap_pfn_range(vma, addr, pfn, size, prot, PMD_SHIFT); +} +EXPORT_SYMBOL_GPL(remap_pfn_range_try_pmd); +#endif + /** * vm_iomap_memory - remap memory to userspace * @vma: user vma to map to @@ -4964,7 +5041,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) * Archs like ppc64 need additional space to store information * related to pte entry. Use the preallocated table for that. */ - if (arch_needs_pgtable_deposit() && !vmf->prealloc_pte) { + if (arch_needs_pgtable_deposit(vma) && !vmf->prealloc_pte) { vmf->prealloc_pte = pte_alloc_one(vma->vm_mm); if (!vmf->prealloc_pte) return VM_FAULT_OOM; @@ -4987,7 +5064,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) /* * deposit and withdraw with pmd lock held */ - if (arch_needs_pgtable_deposit()) + if (arch_needs_pgtable_deposit(vma)) deposit_prealloc_pte(vmf); set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry); -- 2.43.0
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/18903 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/RP4... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/18903 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/RP4...
participants (2)
-
patchwork bot -
Yin Tirui