From: Ryan Roberts ryan.roberts@arm.com
mainline inclusion from mainline-v6.9-rc1 commit 6e8f588708971e0626f5be808e8c4b6cdb86eb0b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Patch series "mm/memory: optimize fork() with PTE-mapped THP", v3.
Now that the rmap overhaul[1] is upstream that provides a clean interface for rmap batching, let's implement PTE batching during fork when processing PTE-mapped THPs.
This series is partially based on Ryan's previous work[2] to implement cont-pte support on arm64, but its a complete rewrite based on [1] to optimize all architectures independent of any such PTE bits, and to use the new rmap batching functions that simplify the code and prepare for further rmap accounting changes.
We collect consecutive PTEs that map consecutive pages of the same large folio, making sure that the other PTE bits are compatible, and (a) adjust the refcount only once per batch, (b) call rmap handling functions only once per batch and (c) perform batch PTE setting/updates.
While this series should be beneficial for adding cont-pte support on ARM64[2], it's one of the requirements for maintaining a total mapcount[3] for large folios with minimal added overhead and further changes[4] that build up on top of the total mapcount.
Independent of all that, this series results in a speedup during fork with PTE-mapped THP, which is the default with THPs that are smaller than a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).
On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios of the same size (stddev < 1%) results in the following runtimes for fork() (shorter is better):
Folio Size | v6.8-rc1 | New | Change ------------------------------------------ 4KiB | 0.014328 | 0.014035 | - 2% 16KiB | 0.014263 | 0.01196 | -16% 32KiB | 0.014334 | 0.01094 | -24% 64KiB | 0.014046 | 0.010444 | -26% 128KiB | 0.014011 | 0.010063 | -28% 256KiB | 0.013993 | 0.009938 | -29% 512KiB | 0.013983 | 0.00985 | -30% 1024KiB | 0.013986 | 0.00982 | -30% 2048KiB | 0.014305 | 0.010076 | -30%
Note that these numbers are even better than the ones from v1 (verified over multiple reboots), even though there were only minimal code changes. Well, I removed a pte_mkclean() call for anon folios, maybe that also plays a role.
But my experience is that fork() is extremely sensitive to code size, inlining, ... so I suspect we'll see on other architectures rather a change of -20% instead of -30%, and it will be easy to "lose" some of that speedup in the future by subtle code changes.
Next up is PTE batching when unmapping. Only tested on x86-64. Compile-tested on most other architectures.
[1] https://lkml.kernel.org/r/20231220224504.646757-1-david@redhat.com [2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com [3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com [4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com [5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com
This patch (of 15):
Since the high bits [51:48] of an OA are not stored contiguously in the PTE, there is a theoretical bug in set_ptes(), which just adds PAGE_SIZE to the pte to get the pte with the next pfn. This works until the pfn crosses the 48-bit boundary, at which point we overflow into the upper attributes.
Of course one could argue (and Matthew Wilcox has :) that we will never see a folio cross this boundary because we only allow naturally aligned power-of-2 allocation, so this would require a half-petabyte folio. So its only a theoretical bug. But its better that the code is robust regardless.
I've implemented pte_next_pfn() as part of the fix, which is an opt-in core-mm interface. So that is now available to the core-mm, which will be needed shortly to support forthcoming fork()-batching optimizations.
Link: https://lkml.kernel.org/r/20240129124649.189745-1-david@redhat.com Link: https://lkml.kernel.org/r/20240125173534.1659317-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240129124649.189745-2-david@redhat.com Fixes: 4a169d61c2ed ("arm64: implement the new page table range API") Closes: https://lore.kernel.org/linux-mm/fdaeb9a5-d890-499a-92c8-d171df43ad01@arm.co... Signed-off-by: Ryan Roberts ryan.roberts@arm.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Catalin Marinas catalin.marinas@arm.com Reviewed-by: David Hildenbrand david@redhat.com Tested-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Cc: Alexandre Ghiti alexghiti@rivosinc.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 6e8f588708971e0626f5be808e8c4b6cdb86eb0b) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/arm64/include/asm/pgtable.h | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 79ce70fbb751..52d0b0a763f1 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -341,6 +341,22 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages) mte_sync_tags(pte, nr_pages); }
+/* + * Select all bits except the pfn + */ +static inline pgprot_t pte_pgprot(pte_t pte) +{ + unsigned long pfn = pte_pfn(pte); + + return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte)); +} + +#define pte_next_pfn pte_next_pfn +static inline pte_t pte_next_pfn(pte_t pte) +{ + return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte)); +} + static inline void set_ptes(struct mm_struct *mm, unsigned long __always_unused addr, pte_t *ptep, pte_t pte, unsigned int nr) @@ -354,7 +370,7 @@ static inline void set_ptes(struct mm_struct *mm, if (--nr == 0) break; ptep++; - pte_val(pte) += PAGE_SIZE; + pte = pte_next_pfn(pte); } } #define set_ptes set_ptes @@ -433,16 +449,6 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte) return clear_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE)); }
-/* - * Select all bits except the pfn - */ -static inline pgprot_t pte_pgprot(pte_t pte) -{ - unsigned long pfn = pte_pfn(pte); - - return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte)); -} - #ifdef CONFIG_NUMA_BALANCING /* * See the comment in include/linux/pgtable.h