mailweb.openeuler.org
Manage this list

Keyboard Shortcuts

Thread View

  • j: Next unread message
  • k: Previous unread message
  • j a: Jump to all threads
  • j l: Jump to MailingList overview

Kernel

Threads by month
  • ----- 2026 -----
  • January
  • ----- 2025 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2024 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2023 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2022 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2021 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2020 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2019 -----
  • December
kernel@openeuler.org

  • 54 participants
  • 22593 discussions
[PATCH v1 openEuler-26.09] Add copy to/from/in user with vectorization support
by Nikita Panov 29 Jan '26

29 Jan '26
From: Artem Kuzin <artem.kuzin(a)huawei.com> kunpeng inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8445 ------------------------------------------------- 1. This implementation uses st1/ld1 4-vector instructions which allow to copy 64 bytes at once 2. Copy code is used only if size of data block to copy is more than 128 bytes 4. To use this functionality you need to set configuration switch CONFIG_USE_VECTORIZED_COPY=y 5. Code can be used on any ARMv8 variant 6. In kernel copy functions like memcpy are not supported now, but can be enabled in future 7. For now we use lightweght version of register context saving/restoration (4-registers) We introduce support of vectorization for copy_from/to/in_user functions. Nowadays it works in parallel with original FPSIMD/SVE vectorization and doesn't affect it anyhow. We have special flag in task struct - TIF_KERNEL_FPSIMD, that set if currently we use lightweight vectorization in kernel. Task struct has been updated by two fields: user space fpsimd state and kernel fpsimd state. User space fpsimd state used by kernel_fpsimd_begin(), kernel_fpsimd_end() functions that wrap lightweight FPSIMD contexts usage in kernel space. Kernel fpsimd state is used to manage threads switch. Now there is no support of nested calls of kernel_neon_begin()/kernel_fpsimd_begin() and there is no plans to support this in future. This is not necessary. We save lightweight FPSIMD context in kernel_fpsimd_begin(), and restore it in /kernel_fpsimd_end(). On thread switch we preserve kernel FPSIMD context and restore user space one if any. This prevens curruption of user space FPSIMD state. Before switching to the next thread we restore it's kernel FPSIMD context if any. It is allowed to use FPSIMD in bottom halves, due to in case of BH preemption we check TIF_KERNEL_FPSIMD flag and save/restore contexts. Context management if quite lightweight and executed only in case of TIF_KERNEL_FPSIMD flag is set. To enable this feature, you need to manually modify one of the appropriate entries: /proc/sys/vm/copy_from_user_threshold /proc/sys/vm/copy_in_user_threshold /proc/sys/vm/copy_to_user_threshold Allowed values are following: -1 - feature enabled 0 - feature always enabled n (n >0) - feature enabled, if copied size is greater than n KB. P.S.: What I am personally don't like in current approach: 1. Additional fields and flag in task struct look quite ugly 2. No way to configure the size of chunk to copy using FPSIMD from user space 3. FPSIMD-based memory movement is not generic, need to enable for memmove(), memcpy() and friends in future. Co-developed-by: Alexander Kozhevnikov <alexander.kozhevnikov(a)huawei-partners.com> Signed-off-by: Alexander Kozhevnikov <alexander.kozhevnikov(a)huawei-partners.com> Co-developed-by: Nikita Panov <panov.nikita(a)huawei.com> Signed-off-by: Nikita Panov <panov.nikita(a)huawei.com> Signed-off-by: Artem Kuzin <artem.kuzin(a)huawei.com> --- arch/arm64/Kconfig | 15 ++ arch/arm64/configs/openeuler_defconfig | 2 + arch/arm64/include/asm/fpsimd.h | 15 ++ arch/arm64/include/asm/fpsimdmacros.h | 14 ++ arch/arm64/include/asm/neon.h | 28 ++++ arch/arm64/include/asm/processor.h | 10 ++ arch/arm64/include/asm/thread_info.h | 5 + arch/arm64/include/asm/uaccess.h | 218 ++++++++++++++++++++++++- arch/arm64/kernel/entry-fpsimd.S | 22 +++ arch/arm64/kernel/fpsimd.c | 102 +++++++++++- arch/arm64/kernel/process.c | 2 +- arch/arm64/lib/copy_from_user.S | 30 ++++ arch/arm64/lib/copy_template_fpsimd.S | 180 ++++++++++++++++++++ arch/arm64/lib/copy_to_user.S | 30 ++++ kernel/softirq.c | 34 ++++ kernel/sysctl.c | 34 ++++ 16 files changed, 734 insertions(+), 7 deletions(-) create mode 100644 arch/arm64/lib/copy_template_fpsimd.S diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index c3b38c890b45..8904e6476e3b 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1870,6 +1870,21 @@ config ARM64_ILP32 is an ABI where long and pointers are 32bits but it uses the AARCH64 instruction set. +config USE_VECTORIZED_COPY + bool "Use vectorized instructions in copy_to/from user" + depends on KERNEL_MODE_NEON + default y + help + This option turns on vectorization to speed up copy_to/from_user routines. + +config VECTORIZED_COPY_VALIDATE + bool "Validate result of vectorized copy using regular implementation" + depends on KERNEL_MODE_NEON + depends on USE_VECTORIZED_COPY + default n + help + This option turns on vectorization to speed up copy_to/from_user routines. + menuconfig AARCH32_EL0 bool "Kernel support for 32-bit EL0" depends on ARM64_4K_PAGES || EXPERT diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 9e7bc82cba3a..9843dec071bf 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -527,6 +527,8 @@ CONFIG_MITIGATE_SPECTRE_BRANCH_HISTORY=y # CONFIG_RODATA_FULL_DEFAULT_ENABLED is not set # CONFIG_ARM64_SW_TTBR0_PAN is not set CONFIG_ARM64_TAGGED_ADDR_ABI=y +CONFIG_USE_VECTORIZED_COPY=y +# CONFIG_VECTORIZED_COPY_VALIDATE is not set CONFIG_AARCH32_EL0=y # CONFIG_KUSER_HELPERS is not set # CONFIG_COMPAT_ALIGNMENT_FIXUPS is not set diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h index b6c6949984d8..1fc9089b4a47 100644 --- a/arch/arm64/include/asm/fpsimd.h +++ b/arch/arm64/include/asm/fpsimd.h @@ -46,6 +46,21 @@ struct task_struct; +#ifdef CONFIG_USE_VECTORIZED_COPY +extern void fpsimd_save_state_light(struct fpsimd_state *state); +extern void fpsimd_load_state_light(struct fpsimd_state *state); +#else +static inline void fpsimd_save_state_light(struct fpsimd_state *state) +{ + (void) state; +} + +static inline void fpsimd_load_state_light(struct fpsimd_state *state) +{ + (void) state; +} +#endif + extern void fpsimd_save_state(struct user_fpsimd_state *state); extern void fpsimd_load_state(struct user_fpsimd_state *state); diff --git a/arch/arm64/include/asm/fpsimdmacros.h b/arch/arm64/include/asm/fpsimdmacros.h index cdf6a35e3994..df9d3ed91931 100644 --- a/arch/arm64/include/asm/fpsimdmacros.h +++ b/arch/arm64/include/asm/fpsimdmacros.h @@ -8,6 +8,20 @@ #include <asm/assembler.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +/* Lightweight fpsimd context saving/restoration. + * Necessary for vectorized kernel memory movement + * implementation + */ +.macro fpsimd_save_light state + st1 {v20.16b, v21.16b, v22.16b, v23.16b}, [\state] +.endm + +.macro fpsimd_restore_light state + ld1 {v20.16b, v21.16b, v22.16b, v23.16b}, [\state] +.endm +#endif + .macro fpsimd_save state, tmpnr stp q0, q1, [\state, #16 * 0] stp q2, q3, [\state, #16 * 2] diff --git a/arch/arm64/include/asm/neon.h b/arch/arm64/include/asm/neon.h index d4b1d172a79b..ab84b194d7b3 100644 --- a/arch/arm64/include/asm/neon.h +++ b/arch/arm64/include/asm/neon.h @@ -16,4 +16,32 @@ void kernel_neon_begin(void); void kernel_neon_end(void); +#ifdef CONFIG_USE_VECTORIZED_COPY +bool kernel_fpsimd_begin(void); +void kernel_fpsimd_end(void); +/* Functions to use in non-preemptible context */ +void _kernel_fpsimd_save(struct fpsimd_state *state); +void _kernel_fpsimd_load(struct fpsimd_state *state); +#else +bool kernel_fpsimd_begin(void) +{ + return false; +} + +void kernel_fpsimd_end(void) +{ +} + +/* Functions to use in non-preemptible context */ +void _kernel_fpsimd_save(struct fpsimd_state *state) +{ + (void) state; +} + +void _kernel_fpsimd_load(struct fpsimd_state *state) +{ + (void) state; +} +#endif + #endif /* ! __ASM_NEON_H */ diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index 9e688b1b13d4..9b81dbcd2126 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -153,6 +153,10 @@ struct cpu_context { unsigned long pc; }; +struct fpsimd_state { + __uint128_t v[4]; +}; + struct thread_struct { struct cpu_context cpu_context; /* cpu context */ @@ -196,6 +200,12 @@ struct thread_struct { KABI_RESERVE(6) KABI_RESERVE(7) KABI_RESERVE(8) +#ifdef CONFIG_USE_VECTORIZED_COPY + KABI_EXTEND( + struct fpsimd_state ustate; + struct fpsimd_state kstate; + ) +#endif }; static inline unsigned int thread_get_vl(struct thread_struct *thread, diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 379d24059f5b..60d0be8a2d58 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -89,6 +89,9 @@ void arch_setup_new_exec(void); #define TIF_SME 27 /* SME in use */ #define TIF_SME_VL_INHERIT 28 /* Inherit SME vl_onexec across exec */ #define TIF_32BIT_AARCH64 29 /* 32 bit process on AArch64(ILP32) */ +#define TIF_KERNEL_FPSIMD 31 /* Use FPSIMD in kernel */ +#define TIF_PRIV_UACC_ENABLED 32 /* Whether priviliged uaccess was manually enabled */ + #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) @@ -107,6 +110,8 @@ void arch_setup_new_exec(void); #define _TIF_MTE_ASYNC_FAULT (1 << TIF_MTE_ASYNC_FAULT) #define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL) #define _TIF_32BIT_AARCH64 (1 << TIF_32BIT_AARCH64) +#define _TIF_KERNEL_FPSIMD (1 << TIF_KERNEL_FPSIMD) +#define _TIF_PRIV_UACC_ENABLED (1 << TIF_PRIV_UACC_ENABLED) #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \ diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index dd0877a75922..fc9f1a40624d 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -26,6 +26,10 @@ #include <asm/memory.h> #include <asm/extable.h> +#ifndef __GENKSYMS__ +#include <asm/neon.h> +#endif + static inline int __access_ok(const void __user *ptr, unsigned long size); /* @@ -134,7 +138,7 @@ static inline void __uaccess_enable_hw_pan(void) CONFIG_ARM64_PAN)); } -static inline void uaccess_disable_privileged(void) +static inline void __uaccess_disable_privileged(void) { mte_disable_tco(); @@ -144,7 +148,22 @@ static inline void uaccess_disable_privileged(void) __uaccess_enable_hw_pan(); } -static inline void uaccess_enable_privileged(void) +static inline void uaccess_disable_privileged(void) +{ + preempt_disable(); + + if (!test_and_clear_thread_flag(TIF_PRIV_UACC_ENABLED)) { + WARN_ON(1); + preempt_enable(); + return; + } + + __uaccess_disable_privileged(); + + preempt_enable(); +} + +static inline void __uaccess_enable_privileged(void) { mte_enable_tco(); @@ -154,6 +173,47 @@ static inline void uaccess_enable_privileged(void) __uaccess_disable_hw_pan(); } +static inline void uaccess_enable_privileged(void) +{ + preempt_disable(); + + if (test_and_set_thread_flag(TIF_PRIV_UACC_ENABLED)) { + WARN_ON(1); + preempt_enable(); + return; + } + + __uaccess_enable_privileged(); + + preempt_enable(); +} + +static inline void uaccess_priviliged_context_switch(struct task_struct *next) +{ + bool curr_enabled = !!test_thread_flag(TIF_PRIV_UACC_ENABLED); + bool next_enabled = !!test_ti_thread_flag(&next->thread_info, TIF_PRIV_UACC_ENABLED); + + if (curr_enabled == next_enabled) + return; + + if (curr_enabled) + __uaccess_disable_privileged(); + else + __uaccess_enable_privileged(); +} + +static inline void uaccess_priviliged_state_save(void) +{ + if (test_thread_flag(TIF_PRIV_UACC_ENABLED)) + __uaccess_disable_privileged(); +} + +static inline void uaccess_priviliged_state_restore(void) +{ + if (test_thread_flag(TIF_PRIV_UACC_ENABLED)) + __uaccess_enable_privileged(); +} + /* * Sanitize a uaccess pointer such that it cannot reach any kernel address. * @@ -391,7 +451,97 @@ do { \ } while (0); \ } while(0) -extern unsigned long __must_check __arch_copy_from_user(void *to, const void __user *from, unsigned long n); +#define USER_COPY_CHUNK_SIZE 4096 + +#ifdef CONFIG_USE_VECTORIZED_COPY + +extern int sysctl_copy_from_user_threshold; + +#define verify_fpsimd_copy(to, from, n, ret) \ +({ \ + unsigned long __verify_ret = 0; \ + __verify_ret = memcmp(to, from, ret ? n - ret : n); \ + if (__verify_ret) \ + pr_err("FPSIMD:%s inconsistent state\n", __func__); \ + if (ret) \ + pr_err("FPSIMD:%s failed to copy data, expected=%lu, copied=%lu\n", __func__, n, n - ret); \ + __verify_ret |= ret; \ + __verify_ret; \ +}) + +#define compare_fpsimd_copy(to, from, n, ret_fpsimd, ret) \ +({ \ + unsigned long __verify_ret = 0; \ + __verify_ret = memcmp(to, from, ret ? n - ret : n); \ + if (__verify_ret) \ + pr_err("FIXUP:%s inconsistent state\n", __func__); \ + if (ret) \ + pr_err("FIXUP:%s failed to copy data, expected=%lu, copied=%lu\n", __func__, n, n - ret); \ + __verify_ret |= ret; \ + if (ret_fpsimd != ret) { \ + pr_err("FIXUP:%s difference between FPSIMD %lu and regular %lu\n", __func__, n - ret_fpsimd, n - ret); \ + __verify_ret |= 1; \ + } else { \ + __verify_ret = 0; \ + } \ + __verify_ret; \ +}) + +extern unsigned long __must_check +__arch_copy_from_user(void *to, const void __user *from, unsigned long n); + +extern unsigned long __must_check +__arch_copy_from_user_fpsimd(void *to, const void __user *from, unsigned long n); + +static __always_inline unsigned long __must_check +raw_copy_from_user(void *to, const void __user *from, unsigned long n) +{ + unsigned long __acfu_ret; + + if (sysctl_copy_from_user_threshold == -1 || n < sysctl_copy_from_user_threshold) { + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user(to, + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } else { + if (kernel_fpsimd_begin()) { + unsigned long __acfu_ret_fpsimd; + + uaccess_enable_privileged(); + __acfu_ret_fpsimd = __arch_copy_from_user_fpsimd((to), + __uaccess_mask_ptr(from), n); + uaccess_disable_privileged(); + + __acfu_ret = __acfu_ret_fpsimd; + kernel_fpsimd_end(); +#ifdef CONFIG_VECTORIZED_COPY_VALIDATE + if (verify_fpsimd_copy(to, __uaccess_mask_ptr(from), n, + __acfu_ret)) { + + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user((to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + + compare_fpsimd_copy(to, __uaccess_mask_ptr(from), n, + __acfu_ret_fpsimd, __acfu_ret); + } +#endif + } else { + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user((to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } + } + + + return __acfu_ret; +} +#else +extern unsigned long __must_check +__arch_copy_from_user(void *to, const void __user *from, unsigned long n); + #define raw_copy_from_user(to, from, n) \ ({ \ unsigned long __acfu_ret; \ @@ -402,7 +552,66 @@ extern unsigned long __must_check __arch_copy_from_user(void *to, const void __u __acfu_ret; \ }) -extern unsigned long __must_check __arch_copy_to_user(void __user *to, const void *from, unsigned long n); +#endif + +#ifdef CONFIG_USE_VECTORIZED_COPY + +extern int sysctl_copy_to_user_threshold; + +extern unsigned long __must_check +__arch_copy_to_user(void __user *to, const void *from, unsigned long n); + +extern unsigned long __must_check +__arch_copy_to_user_fpsimd(void __user *to, const void *from, unsigned long n); + +static __always_inline unsigned long __must_check +raw_copy_to_user(void __user *to, const void *from, unsigned long n) +{ + unsigned long __actu_ret; + + + if (sysctl_copy_to_user_threshold == -1 || n < sysctl_copy_to_user_threshold) { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + } else { + if (kernel_fpsimd_begin()) { + unsigned long __actu_ret_fpsimd; + + uaccess_enable_privileged(); + __actu_ret_fpsimd = __arch_copy_to_user_fpsimd(__uaccess_mask_ptr(to), + from, n); + uaccess_disable_privileged(); + + kernel_fpsimd_end(); + __actu_ret = __actu_ret_fpsimd; +#ifdef CONFIG_VECTORIZED_COPY_VALIDATE + if (verify_fpsimd_copy(__uaccess_mask_ptr(to), from, n, + __actu_ret)) { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + + compare_fpsimd_copy(__uaccess_mask_ptr(to), from, n, + __actu_ret_fpsimd, __actu_ret); + } +#endif + } else { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + } + } + + return __actu_ret; +} +#else +extern unsigned long __must_check +__arch_copy_to_user(void __user *to, const void *from, unsigned long n); + #define raw_copy_to_user(to, from, n) \ ({ \ unsigned long __actu_ret; \ @@ -412,6 +621,7 @@ extern unsigned long __must_check __arch_copy_to_user(void __user *to, const voi uaccess_ttbr0_disable(); \ __actu_ret; \ }) +#endif static __must_check __always_inline bool user_access_begin(const void __user *ptr, size_t len) { diff --git a/arch/arm64/kernel/entry-fpsimd.S b/arch/arm64/kernel/entry-fpsimd.S index 6325db1a2179..6660465f1b7c 100644 --- a/arch/arm64/kernel/entry-fpsimd.S +++ b/arch/arm64/kernel/entry-fpsimd.S @@ -11,6 +11,28 @@ #include <asm/assembler.h> #include <asm/fpsimdmacros.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +/* + * Save the FP registers. + * + * x0 - pointer to struct fpsimd_state_light + */ +SYM_FUNC_START(fpsimd_save_state_light) + fpsimd_save_light x0 + ret +SYM_FUNC_END(fpsimd_save_state_light) + +/* + * Load the FP registers. + * + * x0 - pointer to struct fpsimd_state_light + */ +SYM_FUNC_START(fpsimd_load_state_light) + fpsimd_restore_light x0 + ret +SYM_FUNC_END(fpsimd_load_state_light) +#endif + /* * Save the FP registers. * diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c index 998906b75075..1b6b1accfbbc 100644 --- a/arch/arm64/kernel/fpsimd.c +++ b/arch/arm64/kernel/fpsimd.c @@ -1579,6 +1579,11 @@ void do_fpsimd_exc(unsigned long esr, struct pt_regs *regs) current); } +#ifdef CONFIG_USE_VECTORIZED_COPY +static void kernel_fpsimd_rollback_changes(void); +static void kernel_fpsimd_restore_changes(struct task_struct *tsk); +#endif + void fpsimd_thread_switch(struct task_struct *next) { bool wrong_task, wrong_cpu; @@ -1587,10 +1592,11 @@ void fpsimd_thread_switch(struct task_struct *next) return; __get_cpu_fpsimd_context(); - +#ifdef CONFIG_USE_VECTORIZED_COPY + kernel_fpsimd_rollback_changes(); +#endif /* Save unsaved fpsimd state, if any: */ fpsimd_save(); - /* * Fix up TIF_FOREIGN_FPSTATE to correctly describe next's * state. For kernel threads, FPSIMD registers are never loaded @@ -1603,6 +1609,9 @@ void fpsimd_thread_switch(struct task_struct *next) update_tsk_thread_flag(next, TIF_FOREIGN_FPSTATE, wrong_task || wrong_cpu); +#ifdef CONFIG_USE_VECTORIZED_COPY + kernel_fpsimd_restore_changes(next); +#endif __put_cpu_fpsimd_context(); } @@ -1933,6 +1942,95 @@ void kernel_neon_end(void) } EXPORT_SYMBOL_GPL(kernel_neon_end); +#ifdef CONFIG_USE_VECTORIZED_COPY +bool kernel_fpsimd_begin(void) +{ + if (WARN_ON(!system_capabilities_finalized()) || + !system_supports_fpsimd() || + in_irq() || irqs_disabled() || in_nmi()) + return false; + + preempt_disable(); + if (test_and_set_thread_flag(TIF_KERNEL_FPSIMD)) { + preempt_enable(); + + WARN_ON(1); + return false; + } + + /* + * Leaving streaming mode enabled will cause issues for any kernel + * NEON and leaving streaming mode or ZA enabled may increase power + * consumption. + */ + if (system_supports_sme()) + sme_smstop(); + + fpsimd_save_state_light(&current->thread.ustate); + preempt_enable(); + + return true; +} +EXPORT_SYMBOL(kernel_fpsimd_begin); + +void kernel_fpsimd_end(void) +{ + if (!system_supports_fpsimd()) + return; + + preempt_disable(); + if (test_and_clear_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_load_state_light(&current->thread.ustate); + + preempt_enable(); +} +EXPORT_SYMBOL(kernel_fpsimd_end); + +void _kernel_fpsimd_save(struct fpsimd_state *state) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_save_state_light(state); +} + +void _kernel_fpsimd_load(struct fpsimd_state *state) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_load_state_light(state); +} + +static void kernel_fpsimd_rollback_changes(void) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) { + fpsimd_save_state_light(&current->thread.kstate); + fpsimd_load_state_light(&current->thread.ustate); + } +} + +static void kernel_fpsimd_restore_changes(struct task_struct *tsk) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_ti_thread_flag(task_thread_info(tsk), TIF_KERNEL_FPSIMD)) { + fpsimd_save_state_light(&tsk->thread.ustate); + fpsimd_load_state_light(&tsk->thread.kstate); + } +} +#endif + #ifdef CONFIG_EFI static DEFINE_PER_CPU(struct user_fpsimd_state, efi_fpsimd_state); diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index e9e5ce956f15..fd895189cb7e 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -529,7 +529,7 @@ struct task_struct *__switch_to(struct task_struct *prev, struct task_struct *next) { struct task_struct *last; - + uaccess_priviliged_context_switch(next); fpsimd_thread_switch(next); tls_thread_switch(next); hw_breakpoint_thread_switch(next); diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S index 34e317907524..60dc63e10233 100644 --- a/arch/arm64/lib/copy_from_user.S +++ b/arch/arm64/lib/copy_from_user.S @@ -71,3 +71,33 @@ USER(9998f, ldtrb tmp1w, [srcin]) ret SYM_FUNC_END(__arch_copy_from_user) EXPORT_SYMBOL(__arch_copy_from_user) + + + +#ifdef CONFIG_USE_VECTORIZED_COPY + .macro ldsve reg1, reg2, reg3, reg4, ptr + USER(9997f, ld1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + + .macro stsve reg1, reg2, reg3, reg4, ptr + KERNEL_ME_SAFE(9998f, st1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + +SYM_FUNC_START(__arch_copy_from_user_fpsimd) + add end, x0, x2 + mov srcin, x1 +#include "copy_template_fpsimd.S" + mov x0, #0 // Nothing to copy + ret + + // Exception fixups +9997: cmp dst, dstin + b.ne 9998f + // Before being absolutely sure we couldn't copy anything, try harder +USER(9998f, ldtrb tmp1w, [srcin]) + strb tmp1w, [dst], #1 +9998: sub x0, end, dst // bytes not copied + ret +SYM_FUNC_END(__arch_copy_from_user_fpsimd) +EXPORT_SYMBOL(__arch_copy_from_user_fpsimd) +#endif \ No newline at end of file diff --git a/arch/arm64/lib/copy_template_fpsimd.S b/arch/arm64/lib/copy_template_fpsimd.S new file mode 100644 index 000000000000..9b2e7ce1e4d2 --- /dev/null +++ b/arch/arm64/lib/copy_template_fpsimd.S @@ -0,0 +1,180 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ + */ + +/* + * Copy a buffer from src to dest (alignment handled by the hardware) + * + * Parameters: + * x0 - dest + * x1 - src + * x2 - n + * Returns: + * x0 - dest + */ +dstin .req x0 +src .req x1 +count .req x2 +tmp1 .req x3 +tmp1w .req w3 +tmp2 .req x4 +tmp2w .req w4 +dst .req x6 + +A_l .req x7 +A_h .req x8 +B_l .req x9 +B_h .req x10 +C_l .req x11 +C_h .req x12 +D_l .req x13 +D_h .req x14 + +V_a .req v20 +V_b .req v21 +V_c .req v22 +V_d .req v23 + + mov dst, dstin + cmp count, #16 + /*When memory length is less than 16, the accessed are not aligned.*/ + b.lo .Ltiny15_fpsimd + + neg tmp2, src + ands tmp2, tmp2, #15/* Bytes to reach alignment. */ + b.eq .LSrcAligned_fpsimd + sub count, count, tmp2 + /* + * Copy the leading memory data from src to dst in an increasing + * address order.By this way,the risk of overwriting the source + * memory data is eliminated when the distance between src and + * dst is less than 16. The memory accesses here are alignment. + */ + tbz tmp2, #0, 1f + ldrb1 tmp1w, src, #1 + strb1 tmp1w, dst, #1 +1: + tbz tmp2, #1, 2f + ldrh1 tmp1w, src, #2 + strh1 tmp1w, dst, #2 +2: + tbz tmp2, #2, 3f + ldr1 tmp1w, src, #4 + str1 tmp1w, dst, #4 +3: + tbz tmp2, #3, .LSrcAligned_fpsimd + ldr1 tmp1, src, #8 + str1 tmp1, dst, #8 + +.LSrcAligned_fpsimd: + cmp count, #64 + b.ge .Lcpy_over64_fpsimd + /* + * Deal with small copies quickly by dropping straight into the + * exit block. + */ +.Ltail63_fpsimd: + /* + * Copy up to 48 bytes of data. At this point we only need the + * bottom 6 bits of count to be accurate. + */ + ands tmp1, count, #0x30 + b.eq .Ltiny15_fpsimd + cmp tmp1w, #0x20 + b.eq 1f + b.lt 2f + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +1: + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +2: + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +.Ltiny15_fpsimd: + /* + * Prefer to break one ldp/stp into several load/store to access + * memory in an increasing address order,rather than to load/store 16 + * bytes from (src-16) to (dst-16) and to backward the src to aligned + * address,which way is used in original cortex memcpy. If keeping + * the original memcpy process here, memmove need to satisfy the + * precondition that src address is at least 16 bytes bigger than dst + * address,otherwise some source data will be overwritten when memove + * call memcpy directly. To make memmove simpler and decouple the + * memcpy's dependency on memmove, withdrew the original process. + */ + tbz count, #3, 1f + ldr1 tmp1, src, #8 + str1 tmp1, dst, #8 +1: + tbz count, #2, 2f + ldr1 tmp1w, src, #4 + str1 tmp1w, dst, #4 +2: + tbz count, #1, 3f + ldrh1 tmp1w, src, #2 + strh1 tmp1w, dst, #2 +3: + tbz count, #0, .Lexitfunc_fpsimd + ldrb1 tmp1w, src, #1 + strb1 tmp1w, dst, #1 + + b .Lexitfunc_fpsimd + +.Lcpy_over64_fpsimd: + subs count, count, #128 + b.ge .Lcpy_body_large_fpsimd + /* + * Less than 128 bytes to copy, so handle 64 here and then jump + * to the tail. + */ + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 + ldp1 B_l, B_h, src, #16 + ldp1 C_l, C_h, src, #16 + stp1 B_l, B_h, dst, #16 + stp1 C_l, C_h, dst, #16 + ldp1 D_l, D_h, src, #16 + stp1 D_l, D_h, dst, #16 + + tst count, #0x3f + b.ne .Ltail63_fpsimd + b .Lexitfunc_fpsimd + + /* + * Critical loop. Start at a new cache line boundary. Assuming + * 64 bytes per line this ensures the entire loop is in one line. + */ + .p2align L1_CACHE_SHIFT +.Lcpy_body_large_fpsimd: + /* pre-get 64 bytes data. */ + ldsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, src + add src, src, #64 + +1: + /* + * interlace the load of next 64 bytes data block with store of the last + * loaded 64 bytes data. + */ + stsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, dst + ldsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, src + add dst, dst, #64 + add src, src, #64 + + subs count, count, #64 + b.ge 1b + + stsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, dst + add dst, dst, #64 + + tst count, #0x3f + b.ne .Ltail63_fpsimd +.Lexitfunc_fpsimd: diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S index 2ac716c0d6d8..c190e5f8a989 100644 --- a/arch/arm64/lib/copy_to_user.S +++ b/arch/arm64/lib/copy_to_user.S @@ -71,3 +71,33 @@ USER(9998f, sttrb tmp1w, [dst]) ret SYM_FUNC_END(__arch_copy_to_user) EXPORT_SYMBOL(__arch_copy_to_user) + + +#ifdef CONFIG_USE_VECTORIZED_COPY + .macro stsve reg1, reg2, reg3, reg4, ptr + USER(9997f, st1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + + .macro ldsve reg1, reg2, reg3, reg4, ptr + KERNEL_ME_SAFE(9998f, ld1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + +SYM_FUNC_START(__arch_copy_to_user_fpsimd) + add end, x0, x2 + mov srcin, x1 +#include "copy_template_fpsimd.S" + mov x0, #0 + ret + + // Exception fixups +9997: cmp dst, dstin + b.ne 9998f + // Before being absolutely sure we couldn't copy anything, try harder +KERNEL_ME_SAFE(9998f, ldrb tmp1w, [srcin]) +USER(9998f, sttrb tmp1w, [dst]) + add dst, dst, #1 +9998: sub x0, end, dst // bytes not copied + ret +SYM_FUNC_END(__arch_copy_to_user_fpsimd) +EXPORT_SYMBOL(__arch_copy_to_user_fpsimd) +#endif diff --git a/kernel/softirq.c b/kernel/softirq.c index f8cf88cc46c6..39b84ffbf4e5 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -30,6 +30,10 @@ #include <asm/softirq_stack.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +#include <asm/fpsimd.h> +#endif + #define CREATE_TRACE_POINTS #include <trace/events/irq.h> @@ -524,6 +528,9 @@ static void handle_softirqs(bool ksirqd) __u32 pending; int softirq_bit; +#ifdef CONFIG_USE_VECTORIZED_COPY + struct fpsimd_state state; +#endif /* * Mask out PF_MEMALLOC as the current task context is borrowed for the * softirq. A softirq handled, such as network RX, might set PF_MEMALLOC @@ -533,10 +540,16 @@ static void handle_softirqs(bool ksirqd) pending = local_softirq_pending(); + softirq_handle_begin(); in_hardirq = lockdep_softirq_start(); account_softirq_enter(current); +#ifdef CONFIG_USE_VECTORIZED_COPY + _kernel_fpsimd_save(&state); + uaccess_priviliged_state_save(); +#endif + restart: /* Reset the pending bitmask before enabling irqs */ set_softirq_pending(0); @@ -585,7 +598,14 @@ static void handle_softirqs(bool ksirqd) account_softirq_exit(current); lockdep_softirq_end(in_hardirq); + +#ifdef CONFIG_USE_VECTORIZED_COPY + uaccess_priviliged_state_restore(); + _kernel_fpsimd_load(&state); +#endif + softirq_handle_end(); + current_restore_flags(old_flags, PF_MEMALLOC); } @@ -819,12 +839,21 @@ static void tasklet_action_common(struct softirq_action *a, { struct tasklet_struct *list; +#ifdef CONFIG_USE_VECTORIZED_COPY + struct fpsimd_state state; +#endif + local_irq_disable(); list = tl_head->head; tl_head->head = NULL; tl_head->tail = &tl_head->head; local_irq_enable(); +#ifdef CONFIG_USE_VECTORIZED_COPY + _kernel_fpsimd_save(&state); + uaccess_priviliged_state_save(); +#endif + while (list) { struct tasklet_struct *t = list; @@ -856,6 +885,11 @@ static void tasklet_action_common(struct softirq_action *a, __raise_softirq_irqoff(softirq_nr); local_irq_enable(); } + +#ifdef CONFIG_USE_VECTORIZED_COPY + uaccess_priviliged_state_restore(); + _kernel_fpsimd_load(&state); +#endif } static __latent_entropy void tasklet_action(struct softirq_action *a) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e84df0818517..6f8e22102bdc 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -137,6 +137,17 @@ int sysctl_legacy_va_layout; #endif /* CONFIG_SYSCTL */ +#ifdef CONFIG_USE_VECTORIZED_COPY +int sysctl_copy_to_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_to_user_threshold); + +int sysctl_copy_from_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_from_user_threshold); + +int sysctl_copy_in_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_in_user_threshold); +#endif + /* * /proc/sys support */ @@ -2250,6 +2261,29 @@ static struct ctl_table vm_table[] = { .extra1 = (void *)&mmap_rnd_compat_bits_min, .extra2 = (void *)&mmap_rnd_compat_bits_max, }, +#endif +#ifdef CONFIG_USE_VECTORIZED_COPY + { + .procname = "copy_to_user_threshold", + .data = &sysctl_copy_to_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "copy_from_user_threshold", + .data = &sysctl_copy_from_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "copy_in_user_threshold", + .data = &sysctl_copy_in_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, #endif { } }; -- 2.34.1
2 1
0 0
[PATCH v1 openEuler-25.03] Add copy to/from/in user with vectorization support
by Nikita Panov 28 Jan '26

28 Jan '26
From: Artem Kuzin <artem.kuzin(a)huawei.com> kunpeng inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8445 ------------------------------------------------- 1. This implementation uses st1/ld1 4-vector instructions which allow to copy 64 bytes at once 2. Copy code is used only if size of data block to copy is more than 128 bytes 4. To use this functionality you need to set configuration switch CONFIG_USE_VECTORIZED_COPY=y 5. Code can be used on any ARMv8 variant 6. In kernel copy functions like memcpy are not supported now, but can be enabled in future 7. For now we use lightweght version of register context saving/restoration (4-registers) We introduce support of vectorization for copy_from/to/in_user functions. Nowadays it works in parallel with original FPSIMD/SVE vectorization and doesn't affect it anyhow. We have special flag in task struct - TIF_KERNEL_FPSIMD, that set if currently we use lightweight vectorization in kernel. Task struct has been updated by two fields: user space fpsimd state and kernel fpsimd state. User space fpsimd state used by kernel_fpsimd_begin(), kernel_fpsimd_end() functions that wrap lightweight FPSIMD contexts usage in kernel space. Kernel fpsimd state is used to manage threads switch. Now there is no support of nested calls of kernel_neon_begin()/kernel_fpsimd_begin() and there is no plans to support this in future. This is not necessary. We save lightweight FPSIMD context in kernel_fpsimd_begin(), and restore it in /kernel_fpsimd_end(). On thread switch we preserve kernel FPSIMD context and restore user space one if any. This prevens curruption of user space FPSIMD state. Before switching to the next thread we restore it's kernel FPSIMD context if any. It is allowed to use FPSIMD in bottom halves, due to in case of BH preemption we check TIF_KERNEL_FPSIMD flag and save/restore contexts. Context management if quite lightweight and executed only in case of TIF_KERNEL_FPSIMD flag is set. To enable this feature, you need to manually modify one of the appropriate entries: /proc/sys/vm/copy_from_user_threshold /proc/sys/vm/copy_in_user_threshold /proc/sys/vm/copy_to_user_threshold Allowed values are following: -1 - feature enabled 0 - feature always enabled n (n >0) - feature enabled, if copied size is greater than n KB. P.S.: What I am personally don't like in current approach: 1. Additional fields and flag in task struct look quite ugly 2. No way to configure the size of chunk to copy using FPSIMD from user space 3. FPSIMD-based memory movement is not generic, need to enable for memmove(), memcpy() and friends in future. Co-developed-by: Alexander Kozhevnikov <alexander.kozhevnikov(a)huawei-partners.com> Signed-off-by: Alexander Kozhevnikov <alexander.kozhevnikov(a)huawei-partners.com> Co-developed-by: Nikita Panov <panov.nikita(a)huawei.com> Signed-off-by: Nikita Panov <panov.nikita(a)huawei.com> Signed-off-by: Artem Kuzin <artem.kuzin(a)huawei.com> --- arch/arm64/Kconfig | 15 ++ arch/arm64/configs/openeuler_defconfig | 2 + arch/arm64/include/asm/fpsimd.h | 15 ++ arch/arm64/include/asm/fpsimdmacros.h | 14 ++ arch/arm64/include/asm/neon.h | 28 ++++ arch/arm64/include/asm/processor.h | 10 ++ arch/arm64/include/asm/thread_info.h | 5 + arch/arm64/include/asm/uaccess.h | 218 ++++++++++++++++++++++++- arch/arm64/kernel/entry-fpsimd.S | 22 +++ arch/arm64/kernel/fpsimd.c | 102 +++++++++++- arch/arm64/kernel/process.c | 2 +- arch/arm64/lib/copy_from_user.S | 30 ++++ arch/arm64/lib/copy_template_fpsimd.S | 180 ++++++++++++++++++++ arch/arm64/lib/copy_to_user.S | 30 ++++ kernel/softirq.c | 34 ++++ kernel/sysctl.c | 34 ++++ 16 files changed, 734 insertions(+), 7 deletions(-) create mode 100644 arch/arm64/lib/copy_template_fpsimd.S diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index d3ce44c166ce..0cf5ab2d7574 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1828,6 +1828,21 @@ config ARM64_ILP32 is an ABI where long and pointers are 32bits but it uses the AARCH64 instruction set. +config USE_VECTORIZED_COPY + bool "Use vectorized instructions in copy_to/from user" + depends on KERNEL_MODE_NEON + default y + help + This option turns on vectorization to speed up copy_to/from_user routines. + +config VECTORIZED_COPY_VALIDATE + bool "Validate result of vectorized copy using regular implementation" + depends on KERNEL_MODE_NEON + depends on USE_VECTORIZED_COPY + default n + help + This option turns on vectorization to speed up copy_to/from_user routines. + menuconfig AARCH32_EL0 bool "Kernel support for 32-bit EL0" depends on ARM64_4K_PAGES || EXPERT diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 8f97574813ca..dbad22bcbd57 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -499,6 +499,8 @@ CONFIG_MITIGATE_SPECTRE_BRANCH_HISTORY=y # CONFIG_RODATA_FULL_DEFAULT_ENABLED is not set # CONFIG_ARM64_SW_TTBR0_PAN is not set CONFIG_ARM64_TAGGED_ADDR_ABI=y +CONFIG_USE_VECTORIZED_COPY=y +# CONFIG_VECTORIZED_COPY_VALIDATE is not set CONFIG_AARCH32_EL0=y # CONFIG_KUSER_HELPERS is not set # CONFIG_COMPAT_ALIGNMENT_FIXUPS is not set diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h index 40a99d8607fe..f71b3ac578c1 100644 --- a/arch/arm64/include/asm/fpsimd.h +++ b/arch/arm64/include/asm/fpsimd.h @@ -46,6 +46,21 @@ struct task_struct; +#ifdef CONFIG_USE_VECTORIZED_COPY +extern void fpsimd_save_state_light(struct fpsimd_state *state); +extern void fpsimd_load_state_light(struct fpsimd_state *state); +#else +static inline void fpsimd_save_state_light(struct fpsimd_state *state) +{ + (void) state; +} + +static inline void fpsimd_load_state_light(struct fpsimd_state *state) +{ + (void) state; +} +#endif + extern void fpsimd_save_state(struct user_fpsimd_state *state); extern void fpsimd_load_state(struct user_fpsimd_state *state); diff --git a/arch/arm64/include/asm/fpsimdmacros.h b/arch/arm64/include/asm/fpsimdmacros.h index cdf6a35e3994..df9d3ed91931 100644 --- a/arch/arm64/include/asm/fpsimdmacros.h +++ b/arch/arm64/include/asm/fpsimdmacros.h @@ -8,6 +8,20 @@ #include <asm/assembler.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +/* Lightweight fpsimd context saving/restoration. + * Necessary for vectorized kernel memory movement + * implementation + */ +.macro fpsimd_save_light state + st1 {v20.16b, v21.16b, v22.16b, v23.16b}, [\state] +.endm + +.macro fpsimd_restore_light state + ld1 {v20.16b, v21.16b, v22.16b, v23.16b}, [\state] +.endm +#endif + .macro fpsimd_save state, tmpnr stp q0, q1, [\state, #16 * 0] stp q2, q3, [\state, #16 * 2] diff --git a/arch/arm64/include/asm/neon.h b/arch/arm64/include/asm/neon.h index d4b1d172a79b..ab84b194d7b3 100644 --- a/arch/arm64/include/asm/neon.h +++ b/arch/arm64/include/asm/neon.h @@ -16,4 +16,32 @@ void kernel_neon_begin(void); void kernel_neon_end(void); +#ifdef CONFIG_USE_VECTORIZED_COPY +bool kernel_fpsimd_begin(void); +void kernel_fpsimd_end(void); +/* Functions to use in non-preemptible context */ +void _kernel_fpsimd_save(struct fpsimd_state *state); +void _kernel_fpsimd_load(struct fpsimd_state *state); +#else +bool kernel_fpsimd_begin(void) +{ + return false; +} + +void kernel_fpsimd_end(void) +{ +} + +/* Functions to use in non-preemptible context */ +void _kernel_fpsimd_save(struct fpsimd_state *state) +{ + (void) state; +} + +void _kernel_fpsimd_load(struct fpsimd_state *state) +{ + (void) state; +} +#endif + #endif /* ! __ASM_NEON_H */ diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index 9e688b1b13d4..9b81dbcd2126 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -153,6 +153,10 @@ struct cpu_context { unsigned long pc; }; +struct fpsimd_state { + __uint128_t v[4]; +}; + struct thread_struct { struct cpu_context cpu_context; /* cpu context */ @@ -196,6 +200,12 @@ struct thread_struct { KABI_RESERVE(6) KABI_RESERVE(7) KABI_RESERVE(8) +#ifdef CONFIG_USE_VECTORIZED_COPY + KABI_EXTEND( + struct fpsimd_state ustate; + struct fpsimd_state kstate; + ) +#endif }; static inline unsigned int thread_get_vl(struct thread_struct *thread, diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 379d24059f5b..60d0be8a2d58 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -89,6 +89,9 @@ void arch_setup_new_exec(void); #define TIF_SME 27 /* SME in use */ #define TIF_SME_VL_INHERIT 28 /* Inherit SME vl_onexec across exec */ #define TIF_32BIT_AARCH64 29 /* 32 bit process on AArch64(ILP32) */ +#define TIF_KERNEL_FPSIMD 31 /* Use FPSIMD in kernel */ +#define TIF_PRIV_UACC_ENABLED 32 /* Whether priviliged uaccess was manually enabled */ + #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) @@ -107,6 +110,8 @@ void arch_setup_new_exec(void); #define _TIF_MTE_ASYNC_FAULT (1 << TIF_MTE_ASYNC_FAULT) #define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL) #define _TIF_32BIT_AARCH64 (1 << TIF_32BIT_AARCH64) +#define _TIF_KERNEL_FPSIMD (1 << TIF_KERNEL_FPSIMD) +#define _TIF_PRIV_UACC_ENABLED (1 << TIF_PRIV_UACC_ENABLED) #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \ diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index dd0877a75922..fc9f1a40624d 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -26,6 +26,10 @@ #include <asm/memory.h> #include <asm/extable.h> +#ifndef __GENKSYMS__ +#include <asm/neon.h> +#endif + static inline int __access_ok(const void __user *ptr, unsigned long size); /* @@ -134,7 +138,7 @@ static inline void __uaccess_enable_hw_pan(void) CONFIG_ARM64_PAN)); } -static inline void uaccess_disable_privileged(void) +static inline void __uaccess_disable_privileged(void) { mte_disable_tco(); @@ -144,7 +148,22 @@ static inline void uaccess_disable_privileged(void) __uaccess_enable_hw_pan(); } -static inline void uaccess_enable_privileged(void) +static inline void uaccess_disable_privileged(void) +{ + preempt_disable(); + + if (!test_and_clear_thread_flag(TIF_PRIV_UACC_ENABLED)) { + WARN_ON(1); + preempt_enable(); + return; + } + + __uaccess_disable_privileged(); + + preempt_enable(); +} + +static inline void __uaccess_enable_privileged(void) { mte_enable_tco(); @@ -154,6 +173,47 @@ static inline void uaccess_enable_privileged(void) __uaccess_disable_hw_pan(); } +static inline void uaccess_enable_privileged(void) +{ + preempt_disable(); + + if (test_and_set_thread_flag(TIF_PRIV_UACC_ENABLED)) { + WARN_ON(1); + preempt_enable(); + return; + } + + __uaccess_enable_privileged(); + + preempt_enable(); +} + +static inline void uaccess_priviliged_context_switch(struct task_struct *next) +{ + bool curr_enabled = !!test_thread_flag(TIF_PRIV_UACC_ENABLED); + bool next_enabled = !!test_ti_thread_flag(&next->thread_info, TIF_PRIV_UACC_ENABLED); + + if (curr_enabled == next_enabled) + return; + + if (curr_enabled) + __uaccess_disable_privileged(); + else + __uaccess_enable_privileged(); +} + +static inline void uaccess_priviliged_state_save(void) +{ + if (test_thread_flag(TIF_PRIV_UACC_ENABLED)) + __uaccess_disable_privileged(); +} + +static inline void uaccess_priviliged_state_restore(void) +{ + if (test_thread_flag(TIF_PRIV_UACC_ENABLED)) + __uaccess_enable_privileged(); +} + /* * Sanitize a uaccess pointer such that it cannot reach any kernel address. * @@ -391,7 +451,97 @@ do { \ } while (0); \ } while(0) -extern unsigned long __must_check __arch_copy_from_user(void *to, const void __user *from, unsigned long n); +#define USER_COPY_CHUNK_SIZE 4096 + +#ifdef CONFIG_USE_VECTORIZED_COPY + +extern int sysctl_copy_from_user_threshold; + +#define verify_fpsimd_copy(to, from, n, ret) \ +({ \ + unsigned long __verify_ret = 0; \ + __verify_ret = memcmp(to, from, ret ? n - ret : n); \ + if (__verify_ret) \ + pr_err("FPSIMD:%s inconsistent state\n", __func__); \ + if (ret) \ + pr_err("FPSIMD:%s failed to copy data, expected=%lu, copied=%lu\n", __func__, n, n - ret); \ + __verify_ret |= ret; \ + __verify_ret; \ +}) + +#define compare_fpsimd_copy(to, from, n, ret_fpsimd, ret) \ +({ \ + unsigned long __verify_ret = 0; \ + __verify_ret = memcmp(to, from, ret ? n - ret : n); \ + if (__verify_ret) \ + pr_err("FIXUP:%s inconsistent state\n", __func__); \ + if (ret) \ + pr_err("FIXUP:%s failed to copy data, expected=%lu, copied=%lu\n", __func__, n, n - ret); \ + __verify_ret |= ret; \ + if (ret_fpsimd != ret) { \ + pr_err("FIXUP:%s difference between FPSIMD %lu and regular %lu\n", __func__, n - ret_fpsimd, n - ret); \ + __verify_ret |= 1; \ + } else { \ + __verify_ret = 0; \ + } \ + __verify_ret; \ +}) + +extern unsigned long __must_check +__arch_copy_from_user(void *to, const void __user *from, unsigned long n); + +extern unsigned long __must_check +__arch_copy_from_user_fpsimd(void *to, const void __user *from, unsigned long n); + +static __always_inline unsigned long __must_check +raw_copy_from_user(void *to, const void __user *from, unsigned long n) +{ + unsigned long __acfu_ret; + + if (sysctl_copy_from_user_threshold == -1 || n < sysctl_copy_from_user_threshold) { + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user(to, + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } else { + if (kernel_fpsimd_begin()) { + unsigned long __acfu_ret_fpsimd; + + uaccess_enable_privileged(); + __acfu_ret_fpsimd = __arch_copy_from_user_fpsimd((to), + __uaccess_mask_ptr(from), n); + uaccess_disable_privileged(); + + __acfu_ret = __acfu_ret_fpsimd; + kernel_fpsimd_end(); +#ifdef CONFIG_VECTORIZED_COPY_VALIDATE + if (verify_fpsimd_copy(to, __uaccess_mask_ptr(from), n, + __acfu_ret)) { + + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user((to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + + compare_fpsimd_copy(to, __uaccess_mask_ptr(from), n, + __acfu_ret_fpsimd, __acfu_ret); + } +#endif + } else { + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user((to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } + } + + + return __acfu_ret; +} +#else +extern unsigned long __must_check +__arch_copy_from_user(void *to, const void __user *from, unsigned long n); + #define raw_copy_from_user(to, from, n) \ ({ \ unsigned long __acfu_ret; \ @@ -402,7 +552,66 @@ extern unsigned long __must_check __arch_copy_from_user(void *to, const void __u __acfu_ret; \ }) -extern unsigned long __must_check __arch_copy_to_user(void __user *to, const void *from, unsigned long n); +#endif + +#ifdef CONFIG_USE_VECTORIZED_COPY + +extern int sysctl_copy_to_user_threshold; + +extern unsigned long __must_check +__arch_copy_to_user(void __user *to, const void *from, unsigned long n); + +extern unsigned long __must_check +__arch_copy_to_user_fpsimd(void __user *to, const void *from, unsigned long n); + +static __always_inline unsigned long __must_check +raw_copy_to_user(void __user *to, const void *from, unsigned long n) +{ + unsigned long __actu_ret; + + + if (sysctl_copy_to_user_threshold == -1 || n < sysctl_copy_to_user_threshold) { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + } else { + if (kernel_fpsimd_begin()) { + unsigned long __actu_ret_fpsimd; + + uaccess_enable_privileged(); + __actu_ret_fpsimd = __arch_copy_to_user_fpsimd(__uaccess_mask_ptr(to), + from, n); + uaccess_disable_privileged(); + + kernel_fpsimd_end(); + __actu_ret = __actu_ret_fpsimd; +#ifdef CONFIG_VECTORIZED_COPY_VALIDATE + if (verify_fpsimd_copy(__uaccess_mask_ptr(to), from, n, + __actu_ret)) { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + + compare_fpsimd_copy(__uaccess_mask_ptr(to), from, n, + __actu_ret_fpsimd, __actu_ret); + } +#endif + } else { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + } + } + + return __actu_ret; +} +#else +extern unsigned long __must_check +__arch_copy_to_user(void __user *to, const void *from, unsigned long n); + #define raw_copy_to_user(to, from, n) \ ({ \ unsigned long __actu_ret; \ @@ -412,6 +621,7 @@ extern unsigned long __must_check __arch_copy_to_user(void __user *to, const voi uaccess_ttbr0_disable(); \ __actu_ret; \ }) +#endif static __must_check __always_inline bool user_access_begin(const void __user *ptr, size_t len) { diff --git a/arch/arm64/kernel/entry-fpsimd.S b/arch/arm64/kernel/entry-fpsimd.S index 6325db1a2179..6660465f1b7c 100644 --- a/arch/arm64/kernel/entry-fpsimd.S +++ b/arch/arm64/kernel/entry-fpsimd.S @@ -11,6 +11,28 @@ #include <asm/assembler.h> #include <asm/fpsimdmacros.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +/* + * Save the FP registers. + * + * x0 - pointer to struct fpsimd_state_light + */ +SYM_FUNC_START(fpsimd_save_state_light) + fpsimd_save_light x0 + ret +SYM_FUNC_END(fpsimd_save_state_light) + +/* + * Load the FP registers. + * + * x0 - pointer to struct fpsimd_state_light + */ +SYM_FUNC_START(fpsimd_load_state_light) + fpsimd_restore_light x0 + ret +SYM_FUNC_END(fpsimd_load_state_light) +#endif + /* * Save the FP registers. * diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c index 0137d987631e..19fcf3a3ac66 100644 --- a/arch/arm64/kernel/fpsimd.c +++ b/arch/arm64/kernel/fpsimd.c @@ -1577,6 +1577,11 @@ void do_fpsimd_exc(unsigned long esr, struct pt_regs *regs) current); } +#ifdef CONFIG_USE_VECTORIZED_COPY +static void kernel_fpsimd_rollback_changes(void); +static void kernel_fpsimd_restore_changes(struct task_struct *tsk); +#endif + void fpsimd_thread_switch(struct task_struct *next) { bool wrong_task, wrong_cpu; @@ -1585,10 +1590,11 @@ void fpsimd_thread_switch(struct task_struct *next) return; __get_cpu_fpsimd_context(); - +#ifdef CONFIG_USE_VECTORIZED_COPY + kernel_fpsimd_rollback_changes(); +#endif /* Save unsaved fpsimd state, if any: */ fpsimd_save(); - /* * Fix up TIF_FOREIGN_FPSTATE to correctly describe next's * state. For kernel threads, FPSIMD registers are never loaded @@ -1601,6 +1607,9 @@ void fpsimd_thread_switch(struct task_struct *next) update_tsk_thread_flag(next, TIF_FOREIGN_FPSTATE, wrong_task || wrong_cpu); +#ifdef CONFIG_USE_VECTORIZED_COPY + kernel_fpsimd_restore_changes(next); +#endif __put_cpu_fpsimd_context(); } @@ -1956,6 +1965,95 @@ void kernel_neon_end(void) } EXPORT_SYMBOL_GPL(kernel_neon_end); +#ifdef CONFIG_USE_VECTORIZED_COPY +bool kernel_fpsimd_begin(void) +{ + if (WARN_ON(!system_capabilities_finalized()) || + !system_supports_fpsimd() || + in_irq() || irqs_disabled() || in_nmi()) + return false; + + preempt_disable(); + if (test_and_set_thread_flag(TIF_KERNEL_FPSIMD)) { + preempt_enable(); + + WARN_ON(1); + return false; + } + + /* + * Leaving streaming mode enabled will cause issues for any kernel + * NEON and leaving streaming mode or ZA enabled may increase power + * consumption. + */ + if (system_supports_sme()) + sme_smstop(); + + fpsimd_save_state_light(&current->thread.ustate); + preempt_enable(); + + return true; +} +EXPORT_SYMBOL(kernel_fpsimd_begin); + +void kernel_fpsimd_end(void) +{ + if (!system_supports_fpsimd()) + return; + + preempt_disable(); + if (test_and_clear_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_load_state_light(&current->thread.ustate); + + preempt_enable(); +} +EXPORT_SYMBOL(kernel_fpsimd_end); + +void _kernel_fpsimd_save(struct fpsimd_state *state) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_save_state_light(state); +} + +void _kernel_fpsimd_load(struct fpsimd_state *state) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_load_state_light(state); +} + +static void kernel_fpsimd_rollback_changes(void) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) { + fpsimd_save_state_light(&current->thread.kstate); + fpsimd_load_state_light(&current->thread.ustate); + } +} + +static void kernel_fpsimd_restore_changes(struct task_struct *tsk) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_ti_thread_flag(task_thread_info(tsk), TIF_KERNEL_FPSIMD)) { + fpsimd_save_state_light(&tsk->thread.ustate); + fpsimd_load_state_light(&tsk->thread.kstate); + } +} +#endif + #ifdef CONFIG_EFI static DEFINE_PER_CPU(struct user_fpsimd_state, efi_fpsimd_state); diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 068e5bb2661b..bbeb36e671de 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -524,7 +524,7 @@ struct task_struct *__switch_to(struct task_struct *prev, struct task_struct *next) { struct task_struct *last; - + uaccess_priviliged_context_switch(next); fpsimd_thread_switch(next); tls_thread_switch(next); hw_breakpoint_thread_switch(next); diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S index 34e317907524..60dc63e10233 100644 --- a/arch/arm64/lib/copy_from_user.S +++ b/arch/arm64/lib/copy_from_user.S @@ -71,3 +71,33 @@ USER(9998f, ldtrb tmp1w, [srcin]) ret SYM_FUNC_END(__arch_copy_from_user) EXPORT_SYMBOL(__arch_copy_from_user) + + + +#ifdef CONFIG_USE_VECTORIZED_COPY + .macro ldsve reg1, reg2, reg3, reg4, ptr + USER(9997f, ld1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + + .macro stsve reg1, reg2, reg3, reg4, ptr + KERNEL_ME_SAFE(9998f, st1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + +SYM_FUNC_START(__arch_copy_from_user_fpsimd) + add end, x0, x2 + mov srcin, x1 +#include "copy_template_fpsimd.S" + mov x0, #0 // Nothing to copy + ret + + // Exception fixups +9997: cmp dst, dstin + b.ne 9998f + // Before being absolutely sure we couldn't copy anything, try harder +USER(9998f, ldtrb tmp1w, [srcin]) + strb tmp1w, [dst], #1 +9998: sub x0, end, dst // bytes not copied + ret +SYM_FUNC_END(__arch_copy_from_user_fpsimd) +EXPORT_SYMBOL(__arch_copy_from_user_fpsimd) +#endif \ No newline at end of file diff --git a/arch/arm64/lib/copy_template_fpsimd.S b/arch/arm64/lib/copy_template_fpsimd.S new file mode 100644 index 000000000000..9b2e7ce1e4d2 --- /dev/null +++ b/arch/arm64/lib/copy_template_fpsimd.S @@ -0,0 +1,180 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ + */ + +/* + * Copy a buffer from src to dest (alignment handled by the hardware) + * + * Parameters: + * x0 - dest + * x1 - src + * x2 - n + * Returns: + * x0 - dest + */ +dstin .req x0 +src .req x1 +count .req x2 +tmp1 .req x3 +tmp1w .req w3 +tmp2 .req x4 +tmp2w .req w4 +dst .req x6 + +A_l .req x7 +A_h .req x8 +B_l .req x9 +B_h .req x10 +C_l .req x11 +C_h .req x12 +D_l .req x13 +D_h .req x14 + +V_a .req v20 +V_b .req v21 +V_c .req v22 +V_d .req v23 + + mov dst, dstin + cmp count, #16 + /*When memory length is less than 16, the accessed are not aligned.*/ + b.lo .Ltiny15_fpsimd + + neg tmp2, src + ands tmp2, tmp2, #15/* Bytes to reach alignment. */ + b.eq .LSrcAligned_fpsimd + sub count, count, tmp2 + /* + * Copy the leading memory data from src to dst in an increasing + * address order.By this way,the risk of overwriting the source + * memory data is eliminated when the distance between src and + * dst is less than 16. The memory accesses here are alignment. + */ + tbz tmp2, #0, 1f + ldrb1 tmp1w, src, #1 + strb1 tmp1w, dst, #1 +1: + tbz tmp2, #1, 2f + ldrh1 tmp1w, src, #2 + strh1 tmp1w, dst, #2 +2: + tbz tmp2, #2, 3f + ldr1 tmp1w, src, #4 + str1 tmp1w, dst, #4 +3: + tbz tmp2, #3, .LSrcAligned_fpsimd + ldr1 tmp1, src, #8 + str1 tmp1, dst, #8 + +.LSrcAligned_fpsimd: + cmp count, #64 + b.ge .Lcpy_over64_fpsimd + /* + * Deal with small copies quickly by dropping straight into the + * exit block. + */ +.Ltail63_fpsimd: + /* + * Copy up to 48 bytes of data. At this point we only need the + * bottom 6 bits of count to be accurate. + */ + ands tmp1, count, #0x30 + b.eq .Ltiny15_fpsimd + cmp tmp1w, #0x20 + b.eq 1f + b.lt 2f + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +1: + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +2: + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +.Ltiny15_fpsimd: + /* + * Prefer to break one ldp/stp into several load/store to access + * memory in an increasing address order,rather than to load/store 16 + * bytes from (src-16) to (dst-16) and to backward the src to aligned + * address,which way is used in original cortex memcpy. If keeping + * the original memcpy process here, memmove need to satisfy the + * precondition that src address is at least 16 bytes bigger than dst + * address,otherwise some source data will be overwritten when memove + * call memcpy directly. To make memmove simpler and decouple the + * memcpy's dependency on memmove, withdrew the original process. + */ + tbz count, #3, 1f + ldr1 tmp1, src, #8 + str1 tmp1, dst, #8 +1: + tbz count, #2, 2f + ldr1 tmp1w, src, #4 + str1 tmp1w, dst, #4 +2: + tbz count, #1, 3f + ldrh1 tmp1w, src, #2 + strh1 tmp1w, dst, #2 +3: + tbz count, #0, .Lexitfunc_fpsimd + ldrb1 tmp1w, src, #1 + strb1 tmp1w, dst, #1 + + b .Lexitfunc_fpsimd + +.Lcpy_over64_fpsimd: + subs count, count, #128 + b.ge .Lcpy_body_large_fpsimd + /* + * Less than 128 bytes to copy, so handle 64 here and then jump + * to the tail. + */ + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 + ldp1 B_l, B_h, src, #16 + ldp1 C_l, C_h, src, #16 + stp1 B_l, B_h, dst, #16 + stp1 C_l, C_h, dst, #16 + ldp1 D_l, D_h, src, #16 + stp1 D_l, D_h, dst, #16 + + tst count, #0x3f + b.ne .Ltail63_fpsimd + b .Lexitfunc_fpsimd + + /* + * Critical loop. Start at a new cache line boundary. Assuming + * 64 bytes per line this ensures the entire loop is in one line. + */ + .p2align L1_CACHE_SHIFT +.Lcpy_body_large_fpsimd: + /* pre-get 64 bytes data. */ + ldsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, src + add src, src, #64 + +1: + /* + * interlace the load of next 64 bytes data block with store of the last + * loaded 64 bytes data. + */ + stsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, dst + ldsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, src + add dst, dst, #64 + add src, src, #64 + + subs count, count, #64 + b.ge 1b + + stsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, dst + add dst, dst, #64 + + tst count, #0x3f + b.ne .Ltail63_fpsimd +.Lexitfunc_fpsimd: diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S index 2ac716c0d6d8..c190e5f8a989 100644 --- a/arch/arm64/lib/copy_to_user.S +++ b/arch/arm64/lib/copy_to_user.S @@ -71,3 +71,33 @@ USER(9998f, sttrb tmp1w, [dst]) ret SYM_FUNC_END(__arch_copy_to_user) EXPORT_SYMBOL(__arch_copy_to_user) + + +#ifdef CONFIG_USE_VECTORIZED_COPY + .macro stsve reg1, reg2, reg3, reg4, ptr + USER(9997f, st1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + + .macro ldsve reg1, reg2, reg3, reg4, ptr + KERNEL_ME_SAFE(9998f, ld1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + +SYM_FUNC_START(__arch_copy_to_user_fpsimd) + add end, x0, x2 + mov srcin, x1 +#include "copy_template_fpsimd.S" + mov x0, #0 + ret + + // Exception fixups +9997: cmp dst, dstin + b.ne 9998f + // Before being absolutely sure we couldn't copy anything, try harder +KERNEL_ME_SAFE(9998f, ldrb tmp1w, [srcin]) +USER(9998f, sttrb tmp1w, [dst]) + add dst, dst, #1 +9998: sub x0, end, dst // bytes not copied + ret +SYM_FUNC_END(__arch_copy_to_user_fpsimd) +EXPORT_SYMBOL(__arch_copy_to_user_fpsimd) +#endif diff --git a/kernel/softirq.c b/kernel/softirq.c index cd8770b2f76c..e8ce3275a099 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -30,6 +30,10 @@ #include <asm/softirq_stack.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +#include <asm/fpsimd.h> +#endif + #define CREATE_TRACE_POINTS #include <trace/events/irq.h> @@ -517,6 +521,9 @@ static void handle_softirqs(bool ksirqd) __u32 pending; int softirq_bit; +#ifdef CONFIG_USE_VECTORIZED_COPY + struct fpsimd_state state; +#endif /* * Mask out PF_MEMALLOC as the current task context is borrowed for the * softirq. A softirq handled, such as network RX, might set PF_MEMALLOC @@ -526,10 +533,16 @@ static void handle_softirqs(bool ksirqd) pending = local_softirq_pending(); + softirq_handle_begin(); in_hardirq = lockdep_softirq_start(); account_softirq_enter(current); +#ifdef CONFIG_USE_VECTORIZED_COPY + _kernel_fpsimd_save(&state); + uaccess_priviliged_state_save(); +#endif + restart: /* Reset the pending bitmask before enabling irqs */ set_softirq_pending(0); @@ -578,7 +591,14 @@ static void handle_softirqs(bool ksirqd) account_softirq_exit(current); lockdep_softirq_end(in_hardirq); + +#ifdef CONFIG_USE_VECTORIZED_COPY + uaccess_priviliged_state_restore(); + _kernel_fpsimd_load(&state); +#endif + softirq_handle_end(); + current_restore_flags(old_flags, PF_MEMALLOC); } @@ -812,12 +832,21 @@ static void tasklet_action_common(struct softirq_action *a, { struct tasklet_struct *list; +#ifdef CONFIG_USE_VECTORIZED_COPY + struct fpsimd_state state; +#endif + local_irq_disable(); list = tl_head->head; tl_head->head = NULL; tl_head->tail = &tl_head->head; local_irq_enable(); +#ifdef CONFIG_USE_VECTORIZED_COPY + _kernel_fpsimd_save(&state); + uaccess_priviliged_state_save(); +#endif + while (list) { struct tasklet_struct *t = list; @@ -849,6 +878,11 @@ static void tasklet_action_common(struct softirq_action *a, __raise_softirq_irqoff(softirq_nr); local_irq_enable(); } + +#ifdef CONFIG_USE_VECTORIZED_COPY + uaccess_priviliged_state_restore(); + _kernel_fpsimd_load(&state); +#endif } static __latent_entropy void tasklet_action(struct softirq_action *a) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e84df0818517..6f8e22102bdc 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -137,6 +137,17 @@ int sysctl_legacy_va_layout; #endif /* CONFIG_SYSCTL */ +#ifdef CONFIG_USE_VECTORIZED_COPY +int sysctl_copy_to_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_to_user_threshold); + +int sysctl_copy_from_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_from_user_threshold); + +int sysctl_copy_in_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_in_user_threshold); +#endif + /* * /proc/sys support */ @@ -2250,6 +2261,29 @@ static struct ctl_table vm_table[] = { .extra1 = (void *)&mmap_rnd_compat_bits_min, .extra2 = (void *)&mmap_rnd_compat_bits_max, }, +#endif +#ifdef CONFIG_USE_VECTORIZED_COPY + { + .procname = "copy_to_user_threshold", + .data = &sysctl_copy_to_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "copy_from_user_threshold", + .data = &sysctl_copy_from_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "copy_in_user_threshold", + .data = &sysctl_copy_in_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, #endif { } }; -- 2.34.1
2 1
0 0
[PATCH v1 OLK-6.6] Add copy to/from/in user with vectorization support
by Nikita Panov 28 Jan '26

28 Jan '26
From: Artem Kuzin <artem.kuzin(a)huawei.com> kunpeng inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8445 ------------------------------------------------- 1. This implementation uses st1/ld1 4-vector instructions which allow to copy 64 bytes at once 2. Copy code is used only if size of data block to copy is more than 128 bytes 4. To use this functionality you need to set configuration switch CONFIG_USE_VECTORIZED_COPY=y 5. Code can be used on any ARMv8 variant 6. In kernel copy functions like memcpy are not supported now, but can be enabled in future 7. For now we use lightweght version of register context saving/restoration (4-registers) We introduce support of vectorization for copy_from/to/in_user functions. Nowadays it works in parallel with original FPSIMD/SVE vectorization and doesn't affect it anyhow. We have special flag in task struct - TIF_KERNEL_FPSIMD, that set if currently we use lightweight vectorization in kernel. Task struct has been updated by two fields: user space fpsimd state and kernel fpsimd state. User space fpsimd state used by kernel_fpsimd_begin(), kernel_fpsimd_end() functions that wrap lightweight FPSIMD contexts usage in kernel space. Kernel fpsimd state is used to manage threads switch. Now there is no support of nested calls of kernel_neon_begin()/kernel_fpsimd_begin() and there is no plans to support this in future. This is not necessary. We save lightweight FPSIMD context in kernel_fpsimd_begin(), and restore it in /kernel_fpsimd_end(). On thread switch we preserve kernel FPSIMD context and restore user space one if any. This prevens curruption of user space FPSIMD state. Before switching to the next thread we restore it's kernel FPSIMD context if any. It is allowed to use FPSIMD in bottom halves, due to in case of BH preemption we check TIF_KERNEL_FPSIMD flag and save/restore contexts. Context management if quite lightweight and executed only in case of TIF_KERNEL_FPSIMD flag is set. To enable this feature, you need to manually modify one of the appropriate entries: /proc/sys/vm/copy_from_user_threshold /proc/sys/vm/copy_in_user_threshold /proc/sys/vm/copy_to_user_threshold Allowed values are following: -1 - feature enabled 0 - feature always enabled n (n >0) - feature enabled, if copied size is greater than n KB. P.S.: What I am personally don't like in current approach: 1. Additional fields and flag in task struct look quite ugly 2. No way to configure the size of chunk to copy using FPSIMD from user space 3. FPSIMD-based memory movement is not generic, need to enable for memmove(), memcpy() and friends in future. Co-developed-by: Alexander Kozhevnikov <alexander.kozhevnikov(a)huawei-partners.com> Signed-off-by: Alexander Kozhevnikov <alexander.kozhevnikov(a)huawei-partners.com> Co-developed-by: Nikita Panov <panov.nikita(a)huawei.com> Signed-off-by: Nikita Panov <panov.nikita(a)huawei.com> Signed-off-by: Artem Kuzin <artem.kuzin(a)huawei.com> --- arch/arm64/Kconfig | 15 ++ arch/arm64/configs/openeuler_defconfig | 2 + arch/arm64/include/asm/fpsimd.h | 15 ++ arch/arm64/include/asm/fpsimdmacros.h | 14 ++ arch/arm64/include/asm/neon.h | 28 ++++ arch/arm64/include/asm/processor.h | 10 ++ arch/arm64/include/asm/thread_info.h | 5 + arch/arm64/include/asm/uaccess.h | 218 ++++++++++++++++++++++++- arch/arm64/kernel/entry-fpsimd.S | 22 +++ arch/arm64/kernel/fpsimd.c | 102 +++++++++++- arch/arm64/kernel/process.c | 2 +- arch/arm64/lib/copy_from_user.S | 30 ++++ arch/arm64/lib/copy_template_fpsimd.S | 180 ++++++++++++++++++++ arch/arm64/lib/copy_to_user.S | 30 ++++ kernel/softirq.c | 34 ++++ kernel/sysctl.c | 34 ++++ 16 files changed, 734 insertions(+), 7 deletions(-) create mode 100644 arch/arm64/lib/copy_template_fpsimd.S diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index ef8c524a296d..15ec2232994a 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1870,6 +1870,21 @@ config ARM64_ILP32 is an ABI where long and pointers are 32bits but it uses the AARCH64 instruction set. +config USE_VECTORIZED_COPY + bool "Use vectorized instructions in copy_to/from user" + depends on KERNEL_MODE_NEON + default y + help + This option turns on vectorization to speed up copy_to/from_user routines. + +config VECTORIZED_COPY_VALIDATE + bool "Validate result of vectorized copy using regular implementation" + depends on KERNEL_MODE_NEON + depends on USE_VECTORIZED_COPY + default n + help + This option turns on vectorization to speed up copy_to/from_user routines. + menuconfig AARCH32_EL0 bool "Kernel support for 32-bit EL0" depends on ARM64_4K_PAGES || EXPERT diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 425616aa8422..331077d556ca 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -525,6 +525,8 @@ CONFIG_MITIGATE_SPECTRE_BRANCH_HISTORY=y # CONFIG_RODATA_FULL_DEFAULT_ENABLED is not set # CONFIG_ARM64_SW_TTBR0_PAN is not set CONFIG_ARM64_TAGGED_ADDR_ABI=y +CONFIG_USE_VECTORIZED_COPY=y +# CONFIG_VECTORIZED_COPY_VALIDATE is not set CONFIG_AARCH32_EL0=y # CONFIG_KUSER_HELPERS is not set # CONFIG_COMPAT_ALIGNMENT_FIXUPS is not set diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h index b6c6949984d8..1fc9089b4a47 100644 --- a/arch/arm64/include/asm/fpsimd.h +++ b/arch/arm64/include/asm/fpsimd.h @@ -46,6 +46,21 @@ struct task_struct; +#ifdef CONFIG_USE_VECTORIZED_COPY +extern void fpsimd_save_state_light(struct fpsimd_state *state); +extern void fpsimd_load_state_light(struct fpsimd_state *state); +#else +static inline void fpsimd_save_state_light(struct fpsimd_state *state) +{ + (void) state; +} + +static inline void fpsimd_load_state_light(struct fpsimd_state *state) +{ + (void) state; +} +#endif + extern void fpsimd_save_state(struct user_fpsimd_state *state); extern void fpsimd_load_state(struct user_fpsimd_state *state); diff --git a/arch/arm64/include/asm/fpsimdmacros.h b/arch/arm64/include/asm/fpsimdmacros.h index cdf6a35e3994..df9d3ed91931 100644 --- a/arch/arm64/include/asm/fpsimdmacros.h +++ b/arch/arm64/include/asm/fpsimdmacros.h @@ -8,6 +8,20 @@ #include <asm/assembler.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +/* Lightweight fpsimd context saving/restoration. + * Necessary for vectorized kernel memory movement + * implementation + */ +.macro fpsimd_save_light state + st1 {v20.16b, v21.16b, v22.16b, v23.16b}, [\state] +.endm + +.macro fpsimd_restore_light state + ld1 {v20.16b, v21.16b, v22.16b, v23.16b}, [\state] +.endm +#endif + .macro fpsimd_save state, tmpnr stp q0, q1, [\state, #16 * 0] stp q2, q3, [\state, #16 * 2] diff --git a/arch/arm64/include/asm/neon.h b/arch/arm64/include/asm/neon.h index d4b1d172a79b..ab84b194d7b3 100644 --- a/arch/arm64/include/asm/neon.h +++ b/arch/arm64/include/asm/neon.h @@ -16,4 +16,32 @@ void kernel_neon_begin(void); void kernel_neon_end(void); +#ifdef CONFIG_USE_VECTORIZED_COPY +bool kernel_fpsimd_begin(void); +void kernel_fpsimd_end(void); +/* Functions to use in non-preemptible context */ +void _kernel_fpsimd_save(struct fpsimd_state *state); +void _kernel_fpsimd_load(struct fpsimd_state *state); +#else +bool kernel_fpsimd_begin(void) +{ + return false; +} + +void kernel_fpsimd_end(void) +{ +} + +/* Functions to use in non-preemptible context */ +void _kernel_fpsimd_save(struct fpsimd_state *state) +{ + (void) state; +} + +void _kernel_fpsimd_load(struct fpsimd_state *state) +{ + (void) state; +} +#endif + #endif /* ! __ASM_NEON_H */ diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index 9e688b1b13d4..9b81dbcd2126 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -153,6 +153,10 @@ struct cpu_context { unsigned long pc; }; +struct fpsimd_state { + __uint128_t v[4]; +}; + struct thread_struct { struct cpu_context cpu_context; /* cpu context */ @@ -196,6 +200,12 @@ struct thread_struct { KABI_RESERVE(6) KABI_RESERVE(7) KABI_RESERVE(8) +#ifdef CONFIG_USE_VECTORIZED_COPY + KABI_EXTEND( + struct fpsimd_state ustate; + struct fpsimd_state kstate; + ) +#endif }; static inline unsigned int thread_get_vl(struct thread_struct *thread, diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 379d24059f5b..60d0be8a2d58 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -89,6 +89,9 @@ void arch_setup_new_exec(void); #define TIF_SME 27 /* SME in use */ #define TIF_SME_VL_INHERIT 28 /* Inherit SME vl_onexec across exec */ #define TIF_32BIT_AARCH64 29 /* 32 bit process on AArch64(ILP32) */ +#define TIF_KERNEL_FPSIMD 31 /* Use FPSIMD in kernel */ +#define TIF_PRIV_UACC_ENABLED 32 /* Whether priviliged uaccess was manually enabled */ + #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) @@ -107,6 +110,8 @@ void arch_setup_new_exec(void); #define _TIF_MTE_ASYNC_FAULT (1 << TIF_MTE_ASYNC_FAULT) #define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL) #define _TIF_32BIT_AARCH64 (1 << TIF_32BIT_AARCH64) +#define _TIF_KERNEL_FPSIMD (1 << TIF_KERNEL_FPSIMD) +#define _TIF_PRIV_UACC_ENABLED (1 << TIF_PRIV_UACC_ENABLED) #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \ diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index dd0877a75922..fc9f1a40624d 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -26,6 +26,10 @@ #include <asm/memory.h> #include <asm/extable.h> +#ifndef __GENKSYMS__ +#include <asm/neon.h> +#endif + static inline int __access_ok(const void __user *ptr, unsigned long size); /* @@ -134,7 +138,7 @@ static inline void __uaccess_enable_hw_pan(void) CONFIG_ARM64_PAN)); } -static inline void uaccess_disable_privileged(void) +static inline void __uaccess_disable_privileged(void) { mte_disable_tco(); @@ -144,7 +148,22 @@ static inline void uaccess_disable_privileged(void) __uaccess_enable_hw_pan(); } -static inline void uaccess_enable_privileged(void) +static inline void uaccess_disable_privileged(void) +{ + preempt_disable(); + + if (!test_and_clear_thread_flag(TIF_PRIV_UACC_ENABLED)) { + WARN_ON(1); + preempt_enable(); + return; + } + + __uaccess_disable_privileged(); + + preempt_enable(); +} + +static inline void __uaccess_enable_privileged(void) { mte_enable_tco(); @@ -154,6 +173,47 @@ static inline void uaccess_enable_privileged(void) __uaccess_disable_hw_pan(); } +static inline void uaccess_enable_privileged(void) +{ + preempt_disable(); + + if (test_and_set_thread_flag(TIF_PRIV_UACC_ENABLED)) { + WARN_ON(1); + preempt_enable(); + return; + } + + __uaccess_enable_privileged(); + + preempt_enable(); +} + +static inline void uaccess_priviliged_context_switch(struct task_struct *next) +{ + bool curr_enabled = !!test_thread_flag(TIF_PRIV_UACC_ENABLED); + bool next_enabled = !!test_ti_thread_flag(&next->thread_info, TIF_PRIV_UACC_ENABLED); + + if (curr_enabled == next_enabled) + return; + + if (curr_enabled) + __uaccess_disable_privileged(); + else + __uaccess_enable_privileged(); +} + +static inline void uaccess_priviliged_state_save(void) +{ + if (test_thread_flag(TIF_PRIV_UACC_ENABLED)) + __uaccess_disable_privileged(); +} + +static inline void uaccess_priviliged_state_restore(void) +{ + if (test_thread_flag(TIF_PRIV_UACC_ENABLED)) + __uaccess_enable_privileged(); +} + /* * Sanitize a uaccess pointer such that it cannot reach any kernel address. * @@ -391,7 +451,97 @@ do { \ } while (0); \ } while(0) -extern unsigned long __must_check __arch_copy_from_user(void *to, const void __user *from, unsigned long n); +#define USER_COPY_CHUNK_SIZE 4096 + +#ifdef CONFIG_USE_VECTORIZED_COPY + +extern int sysctl_copy_from_user_threshold; + +#define verify_fpsimd_copy(to, from, n, ret) \ +({ \ + unsigned long __verify_ret = 0; \ + __verify_ret = memcmp(to, from, ret ? n - ret : n); \ + if (__verify_ret) \ + pr_err("FPSIMD:%s inconsistent state\n", __func__); \ + if (ret) \ + pr_err("FPSIMD:%s failed to copy data, expected=%lu, copied=%lu\n", __func__, n, n - ret); \ + __verify_ret |= ret; \ + __verify_ret; \ +}) + +#define compare_fpsimd_copy(to, from, n, ret_fpsimd, ret) \ +({ \ + unsigned long __verify_ret = 0; \ + __verify_ret = memcmp(to, from, ret ? n - ret : n); \ + if (__verify_ret) \ + pr_err("FIXUP:%s inconsistent state\n", __func__); \ + if (ret) \ + pr_err("FIXUP:%s failed to copy data, expected=%lu, copied=%lu\n", __func__, n, n - ret); \ + __verify_ret |= ret; \ + if (ret_fpsimd != ret) { \ + pr_err("FIXUP:%s difference between FPSIMD %lu and regular %lu\n", __func__, n - ret_fpsimd, n - ret); \ + __verify_ret |= 1; \ + } else { \ + __verify_ret = 0; \ + } \ + __verify_ret; \ +}) + +extern unsigned long __must_check +__arch_copy_from_user(void *to, const void __user *from, unsigned long n); + +extern unsigned long __must_check +__arch_copy_from_user_fpsimd(void *to, const void __user *from, unsigned long n); + +static __always_inline unsigned long __must_check +raw_copy_from_user(void *to, const void __user *from, unsigned long n) +{ + unsigned long __acfu_ret; + + if (sysctl_copy_from_user_threshold == -1 || n < sysctl_copy_from_user_threshold) { + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user(to, + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } else { + if (kernel_fpsimd_begin()) { + unsigned long __acfu_ret_fpsimd; + + uaccess_enable_privileged(); + __acfu_ret_fpsimd = __arch_copy_from_user_fpsimd((to), + __uaccess_mask_ptr(from), n); + uaccess_disable_privileged(); + + __acfu_ret = __acfu_ret_fpsimd; + kernel_fpsimd_end(); +#ifdef CONFIG_VECTORIZED_COPY_VALIDATE + if (verify_fpsimd_copy(to, __uaccess_mask_ptr(from), n, + __acfu_ret)) { + + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user((to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + + compare_fpsimd_copy(to, __uaccess_mask_ptr(from), n, + __acfu_ret_fpsimd, __acfu_ret); + } +#endif + } else { + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user((to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } + } + + + return __acfu_ret; +} +#else +extern unsigned long __must_check +__arch_copy_from_user(void *to, const void __user *from, unsigned long n); + #define raw_copy_from_user(to, from, n) \ ({ \ unsigned long __acfu_ret; \ @@ -402,7 +552,66 @@ extern unsigned long __must_check __arch_copy_from_user(void *to, const void __u __acfu_ret; \ }) -extern unsigned long __must_check __arch_copy_to_user(void __user *to, const void *from, unsigned long n); +#endif + +#ifdef CONFIG_USE_VECTORIZED_COPY + +extern int sysctl_copy_to_user_threshold; + +extern unsigned long __must_check +__arch_copy_to_user(void __user *to, const void *from, unsigned long n); + +extern unsigned long __must_check +__arch_copy_to_user_fpsimd(void __user *to, const void *from, unsigned long n); + +static __always_inline unsigned long __must_check +raw_copy_to_user(void __user *to, const void *from, unsigned long n) +{ + unsigned long __actu_ret; + + + if (sysctl_copy_to_user_threshold == -1 || n < sysctl_copy_to_user_threshold) { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + } else { + if (kernel_fpsimd_begin()) { + unsigned long __actu_ret_fpsimd; + + uaccess_enable_privileged(); + __actu_ret_fpsimd = __arch_copy_to_user_fpsimd(__uaccess_mask_ptr(to), + from, n); + uaccess_disable_privileged(); + + kernel_fpsimd_end(); + __actu_ret = __actu_ret_fpsimd; +#ifdef CONFIG_VECTORIZED_COPY_VALIDATE + if (verify_fpsimd_copy(__uaccess_mask_ptr(to), from, n, + __actu_ret)) { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + + compare_fpsimd_copy(__uaccess_mask_ptr(to), from, n, + __actu_ret_fpsimd, __actu_ret); + } +#endif + } else { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + } + } + + return __actu_ret; +} +#else +extern unsigned long __must_check +__arch_copy_to_user(void __user *to, const void *from, unsigned long n); + #define raw_copy_to_user(to, from, n) \ ({ \ unsigned long __actu_ret; \ @@ -412,6 +621,7 @@ extern unsigned long __must_check __arch_copy_to_user(void __user *to, const voi uaccess_ttbr0_disable(); \ __actu_ret; \ }) +#endif static __must_check __always_inline bool user_access_begin(const void __user *ptr, size_t len) { diff --git a/arch/arm64/kernel/entry-fpsimd.S b/arch/arm64/kernel/entry-fpsimd.S index 6325db1a2179..6660465f1b7c 100644 --- a/arch/arm64/kernel/entry-fpsimd.S +++ b/arch/arm64/kernel/entry-fpsimd.S @@ -11,6 +11,28 @@ #include <asm/assembler.h> #include <asm/fpsimdmacros.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +/* + * Save the FP registers. + * + * x0 - pointer to struct fpsimd_state_light + */ +SYM_FUNC_START(fpsimd_save_state_light) + fpsimd_save_light x0 + ret +SYM_FUNC_END(fpsimd_save_state_light) + +/* + * Load the FP registers. + * + * x0 - pointer to struct fpsimd_state_light + */ +SYM_FUNC_START(fpsimd_load_state_light) + fpsimd_restore_light x0 + ret +SYM_FUNC_END(fpsimd_load_state_light) +#endif + /* * Save the FP registers. * diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c index b86a50646700..103559cccb07 100644 --- a/arch/arm64/kernel/fpsimd.c +++ b/arch/arm64/kernel/fpsimd.c @@ -1579,6 +1579,11 @@ void do_fpsimd_exc(unsigned long esr, struct pt_regs *regs) current); } +#ifdef CONFIG_USE_VECTORIZED_COPY +static void kernel_fpsimd_rollback_changes(void); +static void kernel_fpsimd_restore_changes(struct task_struct *tsk); +#endif + void fpsimd_thread_switch(struct task_struct *next) { bool wrong_task, wrong_cpu; @@ -1587,10 +1592,11 @@ void fpsimd_thread_switch(struct task_struct *next) return; __get_cpu_fpsimd_context(); - +#ifdef CONFIG_USE_VECTORIZED_COPY + kernel_fpsimd_rollback_changes(); +#endif /* Save unsaved fpsimd state, if any: */ fpsimd_save(); - /* * Fix up TIF_FOREIGN_FPSTATE to correctly describe next's * state. For kernel threads, FPSIMD registers are never loaded @@ -1603,6 +1609,9 @@ void fpsimd_thread_switch(struct task_struct *next) update_tsk_thread_flag(next, TIF_FOREIGN_FPSTATE, wrong_task || wrong_cpu); +#ifdef CONFIG_USE_VECTORIZED_COPY + kernel_fpsimd_restore_changes(next); +#endif __put_cpu_fpsimd_context(); } @@ -1933,6 +1942,95 @@ void kernel_neon_end(void) } EXPORT_SYMBOL_GPL(kernel_neon_end); +#ifdef CONFIG_USE_VECTORIZED_COPY +bool kernel_fpsimd_begin(void) +{ + if (WARN_ON(!system_capabilities_finalized()) || + !system_supports_fpsimd() || + in_irq() || irqs_disabled() || in_nmi()) + return false; + + preempt_disable(); + if (test_and_set_thread_flag(TIF_KERNEL_FPSIMD)) { + preempt_enable(); + + WARN_ON(1); + return false; + } + + /* + * Leaving streaming mode enabled will cause issues for any kernel + * NEON and leaving streaming mode or ZA enabled may increase power + * consumption. + */ + if (system_supports_sme()) + sme_smstop(); + + fpsimd_save_state_light(&current->thread.ustate); + preempt_enable(); + + return true; +} +EXPORT_SYMBOL(kernel_fpsimd_begin); + +void kernel_fpsimd_end(void) +{ + if (!system_supports_fpsimd()) + return; + + preempt_disable(); + if (test_and_clear_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_load_state_light(&current->thread.ustate); + + preempt_enable(); +} +EXPORT_SYMBOL(kernel_fpsimd_end); + +void _kernel_fpsimd_save(struct fpsimd_state *state) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_save_state_light(state); +} + +void _kernel_fpsimd_load(struct fpsimd_state *state) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_load_state_light(state); +} + +static void kernel_fpsimd_rollback_changes(void) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) { + fpsimd_save_state_light(&current->thread.kstate); + fpsimd_load_state_light(&current->thread.ustate); + } +} + +static void kernel_fpsimd_restore_changes(struct task_struct *tsk) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_ti_thread_flag(task_thread_info(tsk), TIF_KERNEL_FPSIMD)) { + fpsimd_save_state_light(&tsk->thread.ustate); + fpsimd_load_state_light(&tsk->thread.kstate); + } +} +#endif + #ifdef CONFIG_EFI static DEFINE_PER_CPU(struct user_fpsimd_state, efi_fpsimd_state); diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index e9e5ce956f15..fd895189cb7e 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -529,7 +529,7 @@ struct task_struct *__switch_to(struct task_struct *prev, struct task_struct *next) { struct task_struct *last; - + uaccess_priviliged_context_switch(next); fpsimd_thread_switch(next); tls_thread_switch(next); hw_breakpoint_thread_switch(next); diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S index 34e317907524..60dc63e10233 100644 --- a/arch/arm64/lib/copy_from_user.S +++ b/arch/arm64/lib/copy_from_user.S @@ -71,3 +71,33 @@ USER(9998f, ldtrb tmp1w, [srcin]) ret SYM_FUNC_END(__arch_copy_from_user) EXPORT_SYMBOL(__arch_copy_from_user) + + + +#ifdef CONFIG_USE_VECTORIZED_COPY + .macro ldsve reg1, reg2, reg3, reg4, ptr + USER(9997f, ld1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + + .macro stsve reg1, reg2, reg3, reg4, ptr + KERNEL_ME_SAFE(9998f, st1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + +SYM_FUNC_START(__arch_copy_from_user_fpsimd) + add end, x0, x2 + mov srcin, x1 +#include "copy_template_fpsimd.S" + mov x0, #0 // Nothing to copy + ret + + // Exception fixups +9997: cmp dst, dstin + b.ne 9998f + // Before being absolutely sure we couldn't copy anything, try harder +USER(9998f, ldtrb tmp1w, [srcin]) + strb tmp1w, [dst], #1 +9998: sub x0, end, dst // bytes not copied + ret +SYM_FUNC_END(__arch_copy_from_user_fpsimd) +EXPORT_SYMBOL(__arch_copy_from_user_fpsimd) +#endif \ No newline at end of file diff --git a/arch/arm64/lib/copy_template_fpsimd.S b/arch/arm64/lib/copy_template_fpsimd.S new file mode 100644 index 000000000000..9b2e7ce1e4d2 --- /dev/null +++ b/arch/arm64/lib/copy_template_fpsimd.S @@ -0,0 +1,180 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ + */ + +/* + * Copy a buffer from src to dest (alignment handled by the hardware) + * + * Parameters: + * x0 - dest + * x1 - src + * x2 - n + * Returns: + * x0 - dest + */ +dstin .req x0 +src .req x1 +count .req x2 +tmp1 .req x3 +tmp1w .req w3 +tmp2 .req x4 +tmp2w .req w4 +dst .req x6 + +A_l .req x7 +A_h .req x8 +B_l .req x9 +B_h .req x10 +C_l .req x11 +C_h .req x12 +D_l .req x13 +D_h .req x14 + +V_a .req v20 +V_b .req v21 +V_c .req v22 +V_d .req v23 + + mov dst, dstin + cmp count, #16 + /*When memory length is less than 16, the accessed are not aligned.*/ + b.lo .Ltiny15_fpsimd + + neg tmp2, src + ands tmp2, tmp2, #15/* Bytes to reach alignment. */ + b.eq .LSrcAligned_fpsimd + sub count, count, tmp2 + /* + * Copy the leading memory data from src to dst in an increasing + * address order.By this way,the risk of overwriting the source + * memory data is eliminated when the distance between src and + * dst is less than 16. The memory accesses here are alignment. + */ + tbz tmp2, #0, 1f + ldrb1 tmp1w, src, #1 + strb1 tmp1w, dst, #1 +1: + tbz tmp2, #1, 2f + ldrh1 tmp1w, src, #2 + strh1 tmp1w, dst, #2 +2: + tbz tmp2, #2, 3f + ldr1 tmp1w, src, #4 + str1 tmp1w, dst, #4 +3: + tbz tmp2, #3, .LSrcAligned_fpsimd + ldr1 tmp1, src, #8 + str1 tmp1, dst, #8 + +.LSrcAligned_fpsimd: + cmp count, #64 + b.ge .Lcpy_over64_fpsimd + /* + * Deal with small copies quickly by dropping straight into the + * exit block. + */ +.Ltail63_fpsimd: + /* + * Copy up to 48 bytes of data. At this point we only need the + * bottom 6 bits of count to be accurate. + */ + ands tmp1, count, #0x30 + b.eq .Ltiny15_fpsimd + cmp tmp1w, #0x20 + b.eq 1f + b.lt 2f + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +1: + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +2: + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +.Ltiny15_fpsimd: + /* + * Prefer to break one ldp/stp into several load/store to access + * memory in an increasing address order,rather than to load/store 16 + * bytes from (src-16) to (dst-16) and to backward the src to aligned + * address,which way is used in original cortex memcpy. If keeping + * the original memcpy process here, memmove need to satisfy the + * precondition that src address is at least 16 bytes bigger than dst + * address,otherwise some source data will be overwritten when memove + * call memcpy directly. To make memmove simpler and decouple the + * memcpy's dependency on memmove, withdrew the original process. + */ + tbz count, #3, 1f + ldr1 tmp1, src, #8 + str1 tmp1, dst, #8 +1: + tbz count, #2, 2f + ldr1 tmp1w, src, #4 + str1 tmp1w, dst, #4 +2: + tbz count, #1, 3f + ldrh1 tmp1w, src, #2 + strh1 tmp1w, dst, #2 +3: + tbz count, #0, .Lexitfunc_fpsimd + ldrb1 tmp1w, src, #1 + strb1 tmp1w, dst, #1 + + b .Lexitfunc_fpsimd + +.Lcpy_over64_fpsimd: + subs count, count, #128 + b.ge .Lcpy_body_large_fpsimd + /* + * Less than 128 bytes to copy, so handle 64 here and then jump + * to the tail. + */ + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 + ldp1 B_l, B_h, src, #16 + ldp1 C_l, C_h, src, #16 + stp1 B_l, B_h, dst, #16 + stp1 C_l, C_h, dst, #16 + ldp1 D_l, D_h, src, #16 + stp1 D_l, D_h, dst, #16 + + tst count, #0x3f + b.ne .Ltail63_fpsimd + b .Lexitfunc_fpsimd + + /* + * Critical loop. Start at a new cache line boundary. Assuming + * 64 bytes per line this ensures the entire loop is in one line. + */ + .p2align L1_CACHE_SHIFT +.Lcpy_body_large_fpsimd: + /* pre-get 64 bytes data. */ + ldsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, src + add src, src, #64 + +1: + /* + * interlace the load of next 64 bytes data block with store of the last + * loaded 64 bytes data. + */ + stsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, dst + ldsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, src + add dst, dst, #64 + add src, src, #64 + + subs count, count, #64 + b.ge 1b + + stsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, dst + add dst, dst, #64 + + tst count, #0x3f + b.ne .Ltail63_fpsimd +.Lexitfunc_fpsimd: diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S index 2ac716c0d6d8..c190e5f8a989 100644 --- a/arch/arm64/lib/copy_to_user.S +++ b/arch/arm64/lib/copy_to_user.S @@ -71,3 +71,33 @@ USER(9998f, sttrb tmp1w, [dst]) ret SYM_FUNC_END(__arch_copy_to_user) EXPORT_SYMBOL(__arch_copy_to_user) + + +#ifdef CONFIG_USE_VECTORIZED_COPY + .macro stsve reg1, reg2, reg3, reg4, ptr + USER(9997f, st1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + + .macro ldsve reg1, reg2, reg3, reg4, ptr + KERNEL_ME_SAFE(9998f, ld1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + +SYM_FUNC_START(__arch_copy_to_user_fpsimd) + add end, x0, x2 + mov srcin, x1 +#include "copy_template_fpsimd.S" + mov x0, #0 + ret + + // Exception fixups +9997: cmp dst, dstin + b.ne 9998f + // Before being absolutely sure we couldn't copy anything, try harder +KERNEL_ME_SAFE(9998f, ldrb tmp1w, [srcin]) +USER(9998f, sttrb tmp1w, [dst]) + add dst, dst, #1 +9998: sub x0, end, dst // bytes not copied + ret +SYM_FUNC_END(__arch_copy_to_user_fpsimd) +EXPORT_SYMBOL(__arch_copy_to_user_fpsimd) +#endif diff --git a/kernel/softirq.c b/kernel/softirq.c index bd10ff418865..9935a11be1e8 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -30,6 +30,10 @@ #include <asm/softirq_stack.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +#include <asm/fpsimd.h> +#endif + #define CREATE_TRACE_POINTS #include <trace/events/irq.h> @@ -542,6 +546,9 @@ static void handle_softirqs(bool ksirqd) __u32 pending; int softirq_bit; +#ifdef CONFIG_USE_VECTORIZED_COPY + struct fpsimd_state state; +#endif /* * Mask out PF_MEMALLOC as the current task context is borrowed for the * softirq. A softirq handled, such as network RX, might set PF_MEMALLOC @@ -551,10 +558,16 @@ static void handle_softirqs(bool ksirqd) pending = local_softirq_pending(); + softirq_handle_begin(); in_hardirq = lockdep_softirq_start(); account_softirq_enter(current); +#ifdef CONFIG_USE_VECTORIZED_COPY + _kernel_fpsimd_save(&state); + uaccess_priviliged_state_save(); +#endif + restart: /* Reset the pending bitmask before enabling irqs */ set_softirq_pending(0); @@ -603,7 +616,14 @@ static void handle_softirqs(bool ksirqd) account_softirq_exit(current); lockdep_softirq_end(in_hardirq); + +#ifdef CONFIG_USE_VECTORIZED_COPY + uaccess_priviliged_state_restore(); + _kernel_fpsimd_load(&state); +#endif + softirq_handle_end(); + current_restore_flags(old_flags, PF_MEMALLOC); } @@ -837,12 +857,21 @@ static void tasklet_action_common(struct softirq_action *a, { struct tasklet_struct *list; +#ifdef CONFIG_USE_VECTORIZED_COPY + struct fpsimd_state state; +#endif + local_irq_disable(); list = tl_head->head; tl_head->head = NULL; tl_head->tail = &tl_head->head; local_irq_enable(); +#ifdef CONFIG_USE_VECTORIZED_COPY + _kernel_fpsimd_save(&state); + uaccess_priviliged_state_save(); +#endif + while (list) { struct tasklet_struct *t = list; @@ -874,6 +903,11 @@ static void tasklet_action_common(struct softirq_action *a, __raise_softirq_irqoff(softirq_nr); local_irq_enable(); } + +#ifdef CONFIG_USE_VECTORIZED_COPY + uaccess_priviliged_state_restore(); + _kernel_fpsimd_load(&state); +#endif } static __latent_entropy void tasklet_action(struct softirq_action *a) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e84df0818517..6f8e22102bdc 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -137,6 +137,17 @@ int sysctl_legacy_va_layout; #endif /* CONFIG_SYSCTL */ +#ifdef CONFIG_USE_VECTORIZED_COPY +int sysctl_copy_to_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_to_user_threshold); + +int sysctl_copy_from_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_from_user_threshold); + +int sysctl_copy_in_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_in_user_threshold); +#endif + /* * /proc/sys support */ @@ -2250,6 +2261,29 @@ static struct ctl_table vm_table[] = { .extra1 = (void *)&mmap_rnd_compat_bits_min, .extra2 = (void *)&mmap_rnd_compat_bits_max, }, +#endif +#ifdef CONFIG_USE_VECTORIZED_COPY + { + .procname = "copy_to_user_threshold", + .data = &sysctl_copy_to_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "copy_from_user_threshold", + .data = &sysctl_copy_from_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "copy_in_user_threshold", + .data = &sysctl_copy_in_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, #endif { } }; -- 2.34.1
2 1
0 0
[PATCH v3 OLK-5.10] Add copy to/from/in user with vectorization support
by Nikita Panov 28 Jan '26

28 Jan '26
From: Artem Kuzin <artem.kuzin(a)huawei.com> kunpeng inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8445 ------------------------------------------------- 1. This implementation uses st1/ld1 4-vector instructions which allow to copy 64 bytes at once 2. Copy code is used only if size of data block to copy is more than 128 bytes 4. To use this functionality you need to set configuration switch CONFIG_USE_VECTORIZED_COPY=y 5. Code can be used on any ARMv8 variant 6. In kernel copy functions like memcpy are not supported now, but can be enabled in future 7. For now we use lightweght version of register context saving/restoration (4-registers) We introduce support of vectorization for copy_from/to/in_user functions. Nowadays it works in parallel with original FPSIMD/SVE vectorization and doesn't affect it anyhow. We have special flag in task struct - TIF_KERNEL_FPSIMD, that set if currently we use lightweight vectorization in kernel. Task struct has been updated by two fields: user space fpsimd state and kernel fpsimd state. User space fpsimd state used by kernel_fpsimd_begin(), kernel_fpsimd_end() functions that wrap lightweight FPSIMD contexts usage in kernel space. Kernel fpsimd state is used to manage threads switch. Now there is no support of nested calls of kernel_neon_begin()/kernel_fpsimd_begin() and there is no plans to support this in future. This is not necessary. We save lightweight FPSIMD context in kernel_fpsimd_begin(), and restore it in /kernel_fpsimd_end(). On thread switch we preserve kernel FPSIMD context and restore user space one if any. This prevens curruption of user space FPSIMD state. Before switching to the next thread we restore it's kernel FPSIMD context if any. It is allowed to use FPSIMD in bottom halves, due to in case of BH preemption we check TIF_KERNEL_FPSIMD flag and save/restore contexts. Context management if quite lightweight and executed only in case of TIF_KERNEL_FPSIMD flag is set. To enable this feature, you need to manually modify one of the appropriate entries: /proc/sys/vm/copy_from_user_threshold /proc/sys/vm/copy_in_user_threshold /proc/sys/vm/copy_to_user_threshold Allowed values are following: -1 - feature enabled 0 - feature always enabled n (n >0) - feature enabled, if copied size is greater than n KB. P.S.: What I am personally don't like in current approach: 1. Additional fields and flag in task struct look quite ugly 2. No way to configure the size of chunk to copy using FPSIMD from user space 3. FPSIMD-based memory movement is not generic, need to enable for memmove(), memcpy() and friends in future. Co-developed-by: Alexander Kozhevnikov <alexander.kozhevnikov(a)huawei-partners.com> Signed-off-by: Alexander Kozhevnikov <alexander.kozhevnikov(a)huawei-partners.com> Co-developed-by: Nikita Panov <panov.nikita(a)huawei.com> Signed-off-by: Nikita Panov <panov.nikita(a)huawei.com> Signed-off-by: Artem Kuzin <artem.kuzin(a)huawei.com> --- arch/arm64/Kconfig | 15 ++ arch/arm64/configs/openeuler_defconfig | 2 + arch/arm64/include/asm/fpsimd.h | 15 ++ arch/arm64/include/asm/fpsimdmacros.h | 14 ++ arch/arm64/include/asm/neon.h | 28 +++ arch/arm64/include/asm/processor.h | 10 + arch/arm64/include/asm/thread_info.h | 4 + arch/arm64/include/asm/uaccess.h | 274 ++++++++++++++++++++++++- arch/arm64/kernel/entry-fpsimd.S | 22 ++ arch/arm64/kernel/fpsimd.c | 102 ++++++++- arch/arm64/kernel/process.c | 2 +- arch/arm64/lib/copy_from_user.S | 18 ++ arch/arm64/lib/copy_in_user.S | 19 ++ arch/arm64/lib/copy_template_fpsimd.S | 180 ++++++++++++++++ arch/arm64/lib/copy_to_user.S | 19 ++ kernel/softirq.c | 31 ++- kernel/sysctl.c | 35 ++++ 17 files changed, 782 insertions(+), 8 deletions(-) create mode 100644 arch/arm64/lib/copy_template_fpsimd.S diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index eb30ef59aca2..959af31f7e70 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1470,6 +1470,21 @@ config ARM64_ILP32 is an ABI where long and pointers are 32bits but it uses the AARCH64 instruction set. +config USE_VECTORIZED_COPY + bool "Use vectorized instructions in copy_to/from user" + depends on KERNEL_MODE_NEON + default y + help + This option turns on vectorization to speed up copy_to/from_user routines. + +config VECTORIZED_COPY_VALIDATE + bool "Validate result of vectorized copy using regular implementation" + depends on KERNEL_MODE_NEON + depends on USE_VECTORIZED_COPY + default n + help + This option turns on vectorization to speed up copy_to/from_user routines. + menuconfig AARCH32_EL0 bool "Kernel support for 32-bit EL0" depends on ARM64_4K_PAGES || EXPERT diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index be1faf2da008..84408352a95e 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -484,6 +484,8 @@ CONFIG_ARM64_PMEM_RESERVE=y CONFIG_ARM64_PMEM_LEGACY=m # CONFIG_ARM64_SW_TTBR0_PAN is not set CONFIG_ARM64_TAGGED_ADDR_ABI=y +CONFIG_USE_VECTORIZED_COPY=y +# CONFIG_VECTORIZED_COPY_VALIDATE is not set CONFIG_AARCH32_EL0=y # CONFIG_KUSER_HELPERS is not set CONFIG_ARMV8_DEPRECATED=y diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h index 22f6c6e23441..cb53767105ef 100644 --- a/arch/arm64/include/asm/fpsimd.h +++ b/arch/arm64/include/asm/fpsimd.h @@ -46,6 +46,21 @@ struct task_struct; +#ifdef CONFIG_USE_VECTORIZED_COPY +extern void fpsimd_save_state_light(struct fpsimd_state *state); +extern void fpsimd_load_state_light(struct fpsimd_state *state); +#else +static inline void fpsimd_save_state_light(struct fpsimd_state *state) +{ + (void) state; +} + +static inline void fpsimd_load_state_light(struct fpsimd_state *state) +{ + (void) state; +} +#endif + extern void fpsimd_save_state(struct user_fpsimd_state *state); extern void fpsimd_load_state(struct user_fpsimd_state *state); diff --git a/arch/arm64/include/asm/fpsimdmacros.h b/arch/arm64/include/asm/fpsimdmacros.h index ea2577e159f6..62f5f8a0540a 100644 --- a/arch/arm64/include/asm/fpsimdmacros.h +++ b/arch/arm64/include/asm/fpsimdmacros.h @@ -8,6 +8,20 @@ #include <asm/assembler.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +/* Lightweight fpsimd context saving/restoration. + * Necessary for vectorized kernel memory movement + * implementation + */ +.macro fpsimd_save_light state + st1 {v20.16b, v21.16b, v22.16b, v23.16b}, [\state] +.endm + +.macro fpsimd_restore_light state + ld1 {v20.16b, v21.16b, v22.16b, v23.16b}, [\state] +.endm +#endif + .macro fpsimd_save state, tmpnr stp q0, q1, [\state, #16 * 0] stp q2, q3, [\state, #16 * 2] diff --git a/arch/arm64/include/asm/neon.h b/arch/arm64/include/asm/neon.h index d4b1d172a79b..ab84b194d7b3 100644 --- a/arch/arm64/include/asm/neon.h +++ b/arch/arm64/include/asm/neon.h @@ -16,4 +16,32 @@ void kernel_neon_begin(void); void kernel_neon_end(void); +#ifdef CONFIG_USE_VECTORIZED_COPY +bool kernel_fpsimd_begin(void); +void kernel_fpsimd_end(void); +/* Functions to use in non-preemptible context */ +void _kernel_fpsimd_save(struct fpsimd_state *state); +void _kernel_fpsimd_load(struct fpsimd_state *state); +#else +bool kernel_fpsimd_begin(void) +{ + return false; +} + +void kernel_fpsimd_end(void) +{ +} + +/* Functions to use in non-preemptible context */ +void _kernel_fpsimd_save(struct fpsimd_state *state) +{ + (void) state; +} + +void _kernel_fpsimd_load(struct fpsimd_state *state) +{ + (void) state; +} +#endif + #endif /* ! __ASM_NEON_H */ diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index 66186f3ab550..d6ca823f7f0f 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -137,6 +137,10 @@ struct cpu_context { unsigned long pc; }; +struct fpsimd_state { + __uint128_t v[4]; +}; + struct thread_struct { struct cpu_context cpu_context; /* cpu context */ @@ -174,6 +178,12 @@ struct thread_struct { KABI_RESERVE(6) KABI_RESERVE(7) KABI_RESERVE(8) +#ifdef CONFIG_USE_VECTORIZED_COPY + KABI_EXTEND( + struct fpsimd_state ustate; + struct fpsimd_state kstate; + ) +#endif }; static inline unsigned int thread_get_vl(struct thread_struct *thread, diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 390d9612546b..2e395ebcc856 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -89,6 +89,8 @@ void arch_release_task_struct(struct task_struct *tsk); #define TIF_PATCH_PENDING 28 /* pending live patching update */ #define TIF_SME 29 /* SME in use */ #define TIF_SME_VL_INHERIT 30 /* Inherit SME vl_onexec across exec */ +#define TIF_KERNEL_FPSIMD 31 /* Use FPSIMD in kernel */ +#define TIF_PRIV_UACC_ENABLED 32 /* Whether priviliged uaccess was manually enabled */ #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) @@ -108,6 +110,8 @@ void arch_release_task_struct(struct task_struct *tsk); #define _TIF_32BIT_AARCH64 (1 << TIF_32BIT_AARCH64) #define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING) #define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG) +#define _TIF_KERNEL_FPSIMD (1 << TIF_KERNEL_FPSIMD) +#define _TIF_PRIV_UACC_ENABLED (1 << TIF_PRIV_UACC_ENABLED) #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \ diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index 03c2db710f92..4e4eec098cbc 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -24,6 +24,10 @@ #include <asm/memory.h> #include <asm/extable.h> +#ifndef __GENKSYMS__ +#include <asm/neon.h> +#endif + #define HAVE_GET_KERNEL_NOFAULT /* @@ -174,7 +178,7 @@ static inline void __uaccess_enable_hw_pan(void) CONFIG_ARM64_PAN)); } -static inline void uaccess_disable_privileged(void) +static inline void __uaccess_disable_privileged(void) { if (uaccess_ttbr0_disable()) return; @@ -182,7 +186,22 @@ static inline void uaccess_disable_privileged(void) __uaccess_enable_hw_pan(); } -static inline void uaccess_enable_privileged(void) +static inline void uaccess_disable_privileged(void) +{ + preempt_disable(); + + if (!test_and_clear_thread_flag(TIF_PRIV_UACC_ENABLED)) { + WARN_ON(1); + preempt_enable(); + return; + } + + __uaccess_disable_privileged(); + + preempt_enable(); +} + +static inline void __uaccess_enable_privileged(void) { if (uaccess_ttbr0_enable()) return; @@ -190,6 +209,47 @@ static inline void uaccess_enable_privileged(void) __uaccess_disable_hw_pan(); } +static inline void uaccess_enable_privileged(void) +{ + preempt_disable(); + + if (test_and_set_thread_flag(TIF_PRIV_UACC_ENABLED)) { + WARN_ON(1); + preempt_enable(); + return; + } + + __uaccess_enable_privileged(); + + preempt_enable(); +} + +static inline void uaccess_priviliged_context_switch(struct task_struct *next) +{ + bool curr_enabled = !!test_thread_flag(TIF_PRIV_UACC_ENABLED); + bool next_enabled = !!test_ti_thread_flag(&next->thread_info, TIF_PRIV_UACC_ENABLED); + + if (curr_enabled == next_enabled) + return; + + if (curr_enabled) + __uaccess_disable_privileged(); + else + __uaccess_enable_privileged(); +} + +static inline void uaccess_priviliged_state_save(void) +{ + if (test_thread_flag(TIF_PRIV_UACC_ENABLED)) + __uaccess_disable_privileged(); +} + +static inline void uaccess_priviliged_state_restore(void) +{ + if (test_thread_flag(TIF_PRIV_UACC_ENABLED)) + __uaccess_enable_privileged(); +} + /* * Sanitise a uaccess pointer such that it becomes NULL if above the maximum * user address. In case the pointer is tagged (has the top byte set), untag @@ -386,7 +446,97 @@ do { \ goto err_label; \ } while(0) -extern unsigned long __must_check __arch_copy_from_user(void *to, const void __user *from, unsigned long n); +#define USER_COPY_CHUNK_SIZE 4096 + +#ifdef CONFIG_USE_VECTORIZED_COPY + +extern int sysctl_copy_from_user_threshold; + +#define verify_fpsimd_copy(to, from, n, ret) \ +({ \ + unsigned long __verify_ret = 0; \ + __verify_ret = memcmp(to, from, ret ? n - ret : n); \ + if (__verify_ret) \ + pr_err("FPSIMD:%s inconsistent state\n", __func__); \ + if (ret) \ + pr_err("FPSIMD:%s failed to copy data, expected=%lu, copied=%lu\n", __func__, n, n - ret); \ + __verify_ret |= ret; \ + __verify_ret; \ +}) + +#define compare_fpsimd_copy(to, from, n, ret_fpsimd, ret) \ +({ \ + unsigned long __verify_ret = 0; \ + __verify_ret = memcmp(to, from, ret ? n - ret : n); \ + if (__verify_ret) \ + pr_err("FIXUP:%s inconsistent state\n", __func__); \ + if (ret) \ + pr_err("FIXUP:%s failed to copy data, expected=%lu, copied=%lu\n", __func__, n, n - ret); \ + __verify_ret |= ret; \ + if (ret_fpsimd != ret) { \ + pr_err("FIXUP:%s difference between FPSIMD %lu and regular %lu\n", __func__, n - ret_fpsimd, n - ret); \ + __verify_ret |= 1; \ + } else { \ + __verify_ret = 0; \ + } \ + __verify_ret; \ +}) + +extern unsigned long __must_check +__arch_copy_from_user(void *to, const void __user *from, unsigned long n); + +extern unsigned long __must_check +__arch_copy_from_user_fpsimd(void *to, const void __user *from, unsigned long n); + +static __always_inline unsigned long __must_check +raw_copy_from_user(void *to, const void __user *from, unsigned long n) +{ + unsigned long __acfu_ret; + + if (sysctl_copy_from_user_threshold == -1 || n < sysctl_copy_from_user_threshold) { + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user(to, + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } else { + if (kernel_fpsimd_begin()) { + unsigned long __acfu_ret_fpsimd; + + uaccess_enable_privileged(); + __acfu_ret_fpsimd = __arch_copy_from_user_fpsimd((to), + __uaccess_mask_ptr(from), n); + uaccess_disable_privileged(); + + __acfu_ret = __acfu_ret_fpsimd; + kernel_fpsimd_end(); +#ifdef CONFIG_VECTORIZED_COPY_VALIDATE + if (verify_fpsimd_copy(to, __uaccess_mask_ptr(from), n, + __acfu_ret)) { + + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user((to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + + compare_fpsimd_copy(to, __uaccess_mask_ptr(from), n, + __acfu_ret_fpsimd, __acfu_ret); + } +#endif + } else { + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user((to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } + } + + + return __acfu_ret; +} +#else +extern unsigned long __must_check +__arch_copy_from_user(void *to, const void __user *from, unsigned long n); + #define raw_copy_from_user(to, from, n) \ ({ \ unsigned long __acfu_ret; \ @@ -397,7 +547,66 @@ extern unsigned long __must_check __arch_copy_from_user(void *to, const void __u __acfu_ret; \ }) -extern unsigned long __must_check __arch_copy_to_user(void __user *to, const void *from, unsigned long n); +#endif + +#ifdef CONFIG_USE_VECTORIZED_COPY + +extern int sysctl_copy_to_user_threshold; + +extern unsigned long __must_check +__arch_copy_to_user(void __user *to, const void *from, unsigned long n); + +extern unsigned long __must_check +__arch_copy_to_user_fpsimd(void __user *to, const void *from, unsigned long n); + +static __always_inline unsigned long __must_check +raw_copy_to_user(void __user *to, const void *from, unsigned long n) +{ + unsigned long __actu_ret; + + + if (sysctl_copy_to_user_threshold == -1 || n < sysctl_copy_to_user_threshold) { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + } else { + if (kernel_fpsimd_begin()) { + unsigned long __actu_ret_fpsimd; + + uaccess_enable_privileged(); + __actu_ret_fpsimd = __arch_copy_to_user_fpsimd(__uaccess_mask_ptr(to), + from, n); + uaccess_disable_privileged(); + + kernel_fpsimd_end(); + __actu_ret = __actu_ret_fpsimd; +#ifdef CONFIG_VECTORIZED_COPY_VALIDATE + if (verify_fpsimd_copy(__uaccess_mask_ptr(to), from, n, + __actu_ret)) { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + + compare_fpsimd_copy(__uaccess_mask_ptr(to), from, n, + __actu_ret_fpsimd, __actu_ret); + } +#endif + } else { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + } + } + + return __actu_ret; +} +#else +extern unsigned long __must_check +__arch_copy_to_user(void __user *to, const void *from, unsigned long n); + #define raw_copy_to_user(to, from, n) \ ({ \ unsigned long __actu_ret; \ @@ -407,7 +616,62 @@ extern unsigned long __must_check __arch_copy_to_user(void __user *to, const voi uaccess_ttbr0_disable(); \ __actu_ret; \ }) +#endif +#ifdef CONFIG_USE_VECTORIZED_COPY + +extern int sysctl_copy_in_user_threshold; + +extern unsigned long __must_check +__arch_copy_in_user(void __user *to, const void __user *from, unsigned long n); + +extern unsigned long __must_check +__arch_copy_in_user_fpsimd(void __user *to, const void __user *from, unsigned long n); + +static __always_inline unsigned long __must_check +raw_copy_in_user(void __user *to, const void __user *from, unsigned long n) +{ + unsigned long __aciu_ret; + + if (sysctl_copy_in_user_threshold == -1 || n < sysctl_copy_in_user_threshold) { + uaccess_ttbr0_enable(); + __aciu_ret = __arch_copy_in_user(__uaccess_mask_ptr(to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } else { + if (kernel_fpsimd_begin()) { + unsigned long __aciu_ret_fpsimd; + + uaccess_enable_privileged(); + __aciu_ret_fpsimd = __arch_copy_in_user_fpsimd(__uaccess_mask_ptr(to), + __uaccess_mask_ptr(from), n); + uaccess_disable_privileged(); + + kernel_fpsimd_end(); + __aciu_ret = __aciu_ret_fpsimd; +#ifdef CONFIG_VECTORIZED_COPY_VALIDATE + if (verify_fpsimd_copy(__uaccess_mask_ptr(to), __uaccess_mask_ptr(from), n, + __aciu_ret)) { + uaccess_ttbr0_enable(); + __aciu_ret = __arch_copy_in_user(__uaccess_mask_ptr(to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + + compare_fpsimd_copy(__uaccess_mask_ptr(to), __uaccess_mask_ptr(from), n, + __aciu_ret_fpsimd, __aciu_ret); + } +#endif + } else { + uaccess_ttbr0_enable(); + __aciu_ret = __arch_copy_in_user(__uaccess_mask_ptr(to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } + } + + return __aciu_ret; +} +#else extern unsigned long __must_check __arch_copy_in_user(void __user *to, const void __user *from, unsigned long n); #define raw_copy_in_user(to, from, n) \ ({ \ @@ -419,6 +683,8 @@ extern unsigned long __must_check __arch_copy_in_user(void __user *to, const voi __aciu_ret; \ }) +#endif + #define INLINE_COPY_TO_USER #define INLINE_COPY_FROM_USER diff --git a/arch/arm64/kernel/entry-fpsimd.S b/arch/arm64/kernel/entry-fpsimd.S index 8d12aaac7862..848ca6a351d7 100644 --- a/arch/arm64/kernel/entry-fpsimd.S +++ b/arch/arm64/kernel/entry-fpsimd.S @@ -11,6 +11,28 @@ #include <asm/assembler.h> #include <asm/fpsimdmacros.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +/* + * Save the FP registers. + * + * x0 - pointer to struct fpsimd_state_light + */ +SYM_FUNC_START(fpsimd_save_state_light) + fpsimd_save_light x0 + ret +SYM_FUNC_END(fpsimd_save_state_light) + +/* + * Load the FP registers. + * + * x0 - pointer to struct fpsimd_state_light + */ +SYM_FUNC_START(fpsimd_load_state_light) + fpsimd_restore_light x0 + ret +SYM_FUNC_END(fpsimd_load_state_light) +#endif + /* * Save the FP registers. * diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c index c2489a72b0b9..1a08c19a181f 100644 --- a/arch/arm64/kernel/fpsimd.c +++ b/arch/arm64/kernel/fpsimd.c @@ -1492,6 +1492,11 @@ void do_fpsimd_exc(unsigned int esr, struct pt_regs *regs) current); } +#ifdef CONFIG_USE_VECTORIZED_COPY +static void kernel_fpsimd_rollback_changes(void); +static void kernel_fpsimd_restore_changes(struct task_struct *tsk); +#endif + void fpsimd_thread_switch(struct task_struct *next) { bool wrong_task, wrong_cpu; @@ -1500,10 +1505,11 @@ void fpsimd_thread_switch(struct task_struct *next) return; __get_cpu_fpsimd_context(); - +#ifdef CONFIG_USE_VECTORIZED_COPY + kernel_fpsimd_rollback_changes(); +#endif /* Save unsaved fpsimd state, if any: */ fpsimd_save(); - /* * Fix up TIF_FOREIGN_FPSTATE to correctly describe next's * state. For kernel threads, FPSIMD registers are never loaded @@ -1516,6 +1522,9 @@ void fpsimd_thread_switch(struct task_struct *next) update_tsk_thread_flag(next, TIF_FOREIGN_FPSTATE, wrong_task || wrong_cpu); +#ifdef CONFIG_USE_VECTORIZED_COPY + kernel_fpsimd_restore_changes(next); +#endif __put_cpu_fpsimd_context(); } @@ -1835,6 +1844,95 @@ void kernel_neon_end(void) } EXPORT_SYMBOL(kernel_neon_end); +#ifdef CONFIG_USE_VECTORIZED_COPY +bool kernel_fpsimd_begin(void) +{ + if (WARN_ON(!system_capabilities_finalized()) || + !system_supports_fpsimd() || + in_irq() || irqs_disabled() || in_nmi()) + return false; + + preempt_disable(); + if (test_and_set_thread_flag(TIF_KERNEL_FPSIMD)) { + preempt_enable(); + + WARN_ON(1); + return false; + } + + /* + * Leaving streaming mode enabled will cause issues for any kernel + * NEON and leaving streaming mode or ZA enabled may increase power + * consumption. + */ + if (system_supports_sme()) + sme_smstop(); + + fpsimd_save_state_light(&current->thread.ustate); + preempt_enable(); + + return true; +} +EXPORT_SYMBOL(kernel_fpsimd_begin); + +void kernel_fpsimd_end(void) +{ + if (!system_supports_fpsimd()) + return; + + preempt_disable(); + if (test_and_clear_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_load_state_light(&current->thread.ustate); + + preempt_enable(); +} +EXPORT_SYMBOL(kernel_fpsimd_end); + +void _kernel_fpsimd_save(struct fpsimd_state *state) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_save_state_light(state); +} + +void _kernel_fpsimd_load(struct fpsimd_state *state) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_load_state_light(state); +} + +static void kernel_fpsimd_rollback_changes(void) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) { + fpsimd_save_state_light(&current->thread.kstate); + fpsimd_load_state_light(&current->thread.ustate); + } +} + +static void kernel_fpsimd_restore_changes(struct task_struct *tsk) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_ti_thread_flag(task_thread_info(tsk), TIF_KERNEL_FPSIMD)) { + fpsimd_save_state_light(&tsk->thread.ustate); + fpsimd_load_state_light(&tsk->thread.kstate); + } +} +#endif + #ifdef CONFIG_EFI static DEFINE_PER_CPU(struct user_fpsimd_state, efi_fpsimd_state); diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 14300c9e06d5..338d40725a5d 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -572,7 +572,7 @@ __notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev, struct task_struct *next) { struct task_struct *last; - + uaccess_priviliged_context_switch(next); fpsimd_thread_switch(next); tls_thread_switch(next); hw_breakpoint_thread_switch(next); diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S index dfc33ce09e72..94290069d97d 100644 --- a/arch/arm64/lib/copy_from_user.S +++ b/arch/arm64/lib/copy_from_user.S @@ -63,6 +63,24 @@ SYM_FUNC_START(__arch_copy_from_user) SYM_FUNC_END(__arch_copy_from_user) EXPORT_SYMBOL(__arch_copy_from_user) +#ifdef CONFIG_USE_VECTORIZED_COPY + .macro ldsve reg1, reg2, reg3, reg4, ptr + USER(9997f, ld1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + + .macro stsve reg1, reg2, reg3, reg4, ptr + USER_MC(9998f, st1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + +SYM_FUNC_START(__arch_copy_from_user_fpsimd) + add end, x0, x2 + mov srcin, x1 +#include "copy_template_fpsimd.S" + mov x0, #0 // Nothing to copy + ret +SYM_FUNC_END(__arch_copy_from_user_fpsimd) +EXPORT_SYMBOL(__arch_copy_from_user_fpsimd) +#endif .section .fixup,"ax" .align 2 9997: cmp dst, dstin diff --git a/arch/arm64/lib/copy_in_user.S b/arch/arm64/lib/copy_in_user.S index dbea3799c3ef..cbc09c377050 100644 --- a/arch/arm64/lib/copy_in_user.S +++ b/arch/arm64/lib/copy_in_user.S @@ -64,6 +64,25 @@ SYM_FUNC_START(__arch_copy_in_user) SYM_FUNC_END(__arch_copy_in_user) EXPORT_SYMBOL(__arch_copy_in_user) +#ifdef CONFIG_USE_VECTORIZED_COPY + .macro ldsve reg1, reg2, reg3, reg4, ptr + USER(9997f, ld1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + + .macro stsve reg1, reg2, reg3, reg4, ptr + USER(9997f, st1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + +SYM_FUNC_START(__arch_copy_in_user_fpsimd) + add end, x0, x2 + mov srcin, x1 +#include "copy_template_fpsimd.S" + mov x0, #0 + ret +SYM_FUNC_END(__arch_copy_in_user_fpsimd) +EXPORT_SYMBOL(__arch_copy_in_user_fpsimd) +#endif + .section .fixup,"ax" .align 2 9997: cmp dst, dstin diff --git a/arch/arm64/lib/copy_template_fpsimd.S b/arch/arm64/lib/copy_template_fpsimd.S new file mode 100644 index 000000000000..9b2e7ce1e4d2 --- /dev/null +++ b/arch/arm64/lib/copy_template_fpsimd.S @@ -0,0 +1,180 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ + */ + +/* + * Copy a buffer from src to dest (alignment handled by the hardware) + * + * Parameters: + * x0 - dest + * x1 - src + * x2 - n + * Returns: + * x0 - dest + */ +dstin .req x0 +src .req x1 +count .req x2 +tmp1 .req x3 +tmp1w .req w3 +tmp2 .req x4 +tmp2w .req w4 +dst .req x6 + +A_l .req x7 +A_h .req x8 +B_l .req x9 +B_h .req x10 +C_l .req x11 +C_h .req x12 +D_l .req x13 +D_h .req x14 + +V_a .req v20 +V_b .req v21 +V_c .req v22 +V_d .req v23 + + mov dst, dstin + cmp count, #16 + /*When memory length is less than 16, the accessed are not aligned.*/ + b.lo .Ltiny15_fpsimd + + neg tmp2, src + ands tmp2, tmp2, #15/* Bytes to reach alignment. */ + b.eq .LSrcAligned_fpsimd + sub count, count, tmp2 + /* + * Copy the leading memory data from src to dst in an increasing + * address order.By this way,the risk of overwriting the source + * memory data is eliminated when the distance between src and + * dst is less than 16. The memory accesses here are alignment. + */ + tbz tmp2, #0, 1f + ldrb1 tmp1w, src, #1 + strb1 tmp1w, dst, #1 +1: + tbz tmp2, #1, 2f + ldrh1 tmp1w, src, #2 + strh1 tmp1w, dst, #2 +2: + tbz tmp2, #2, 3f + ldr1 tmp1w, src, #4 + str1 tmp1w, dst, #4 +3: + tbz tmp2, #3, .LSrcAligned_fpsimd + ldr1 tmp1, src, #8 + str1 tmp1, dst, #8 + +.LSrcAligned_fpsimd: + cmp count, #64 + b.ge .Lcpy_over64_fpsimd + /* + * Deal with small copies quickly by dropping straight into the + * exit block. + */ +.Ltail63_fpsimd: + /* + * Copy up to 48 bytes of data. At this point we only need the + * bottom 6 bits of count to be accurate. + */ + ands tmp1, count, #0x30 + b.eq .Ltiny15_fpsimd + cmp tmp1w, #0x20 + b.eq 1f + b.lt 2f + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +1: + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +2: + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +.Ltiny15_fpsimd: + /* + * Prefer to break one ldp/stp into several load/store to access + * memory in an increasing address order,rather than to load/store 16 + * bytes from (src-16) to (dst-16) and to backward the src to aligned + * address,which way is used in original cortex memcpy. If keeping + * the original memcpy process here, memmove need to satisfy the + * precondition that src address is at least 16 bytes bigger than dst + * address,otherwise some source data will be overwritten when memove + * call memcpy directly. To make memmove simpler and decouple the + * memcpy's dependency on memmove, withdrew the original process. + */ + tbz count, #3, 1f + ldr1 tmp1, src, #8 + str1 tmp1, dst, #8 +1: + tbz count, #2, 2f + ldr1 tmp1w, src, #4 + str1 tmp1w, dst, #4 +2: + tbz count, #1, 3f + ldrh1 tmp1w, src, #2 + strh1 tmp1w, dst, #2 +3: + tbz count, #0, .Lexitfunc_fpsimd + ldrb1 tmp1w, src, #1 + strb1 tmp1w, dst, #1 + + b .Lexitfunc_fpsimd + +.Lcpy_over64_fpsimd: + subs count, count, #128 + b.ge .Lcpy_body_large_fpsimd + /* + * Less than 128 bytes to copy, so handle 64 here and then jump + * to the tail. + */ + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 + ldp1 B_l, B_h, src, #16 + ldp1 C_l, C_h, src, #16 + stp1 B_l, B_h, dst, #16 + stp1 C_l, C_h, dst, #16 + ldp1 D_l, D_h, src, #16 + stp1 D_l, D_h, dst, #16 + + tst count, #0x3f + b.ne .Ltail63_fpsimd + b .Lexitfunc_fpsimd + + /* + * Critical loop. Start at a new cache line boundary. Assuming + * 64 bytes per line this ensures the entire loop is in one line. + */ + .p2align L1_CACHE_SHIFT +.Lcpy_body_large_fpsimd: + /* pre-get 64 bytes data. */ + ldsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, src + add src, src, #64 + +1: + /* + * interlace the load of next 64 bytes data block with store of the last + * loaded 64 bytes data. + */ + stsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, dst + ldsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, src + add dst, dst, #64 + add src, src, #64 + + subs count, count, #64 + b.ge 1b + + stsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, dst + add dst, dst, #64 + + tst count, #0x3f + b.ne .Ltail63_fpsimd +.Lexitfunc_fpsimd: diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S index 34154e7c8577..d0211fce4923 100644 --- a/arch/arm64/lib/copy_to_user.S +++ b/arch/arm64/lib/copy_to_user.S @@ -62,6 +62,25 @@ SYM_FUNC_START(__arch_copy_to_user) SYM_FUNC_END(__arch_copy_to_user) EXPORT_SYMBOL(__arch_copy_to_user) +#ifdef CONFIG_USE_VECTORIZED_COPY + .macro stsve reg1, reg2, reg3, reg4, ptr + USER(9997f, st1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + + .macro ldsve reg1, reg2, reg3, reg4, ptr + USER_MC(9998f, ld1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + +SYM_FUNC_START(__arch_copy_to_user_fpsimd) + add end, x0, x2 + mov srcin, x1 +#include "copy_template_fpsimd.S" + mov x0, #0 + ret +SYM_FUNC_END(__arch_copy_to_user_fpsimd) +EXPORT_SYMBOL(__arch_copy_to_user_fpsimd) +#endif + .section .fixup,"ax" .align 2 9997: cmp dst, dstin diff --git a/kernel/softirq.c b/kernel/softirq.c index 9fc69e6e2c11..e3f73422829d 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -26,6 +26,10 @@ #include <linux/tick.h> #include <linux/irq.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +#include <asm/fpsimd.h> +#endif + #define CREATE_TRACE_POINTS #include <trace/events/irq.h> @@ -262,6 +266,9 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) __u32 pending; int softirq_bit; +#ifdef CONFIG_USE_VECTORIZED_COPY + struct fpsimd_state state; +#endif /* * Mask out PF_MEMALLOC as the current task context is borrowed for the * softirq. A softirq handled, such as network RX, might set PF_MEMALLOC @@ -273,8 +280,11 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) account_irq_enter_time(current); __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET); +#ifdef CONFIG_USE_VECTORIZED_COPY + _kernel_fpsimd_save(&state); + uaccess_priviliged_state_save(); +#endif in_hardirq = lockdep_softirq_start(); - restart: /* Reset the pending bitmask before enabling irqs */ set_softirq_pending(0); @@ -322,6 +332,11 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) lockdep_softirq_end(in_hardirq); account_irq_exit_time(current); + +#ifdef CONFIG_USE_VECTORIZED_COPY + uaccess_priviliged_state_restore(); + _kernel_fpsimd_load(&state); +#endif __local_bh_enable(SOFTIRQ_OFFSET); WARN_ON_ONCE(in_interrupt()); current_restore_flags(old_flags, PF_MEMALLOC); @@ -612,12 +627,21 @@ static void tasklet_action_common(struct softirq_action *a, { struct tasklet_struct *list; +#ifdef CONFIG_USE_VECTORIZED_COPY + struct fpsimd_state state; +#endif + local_irq_disable(); list = tl_head->head; tl_head->head = NULL; tl_head->tail = &tl_head->head; local_irq_enable(); +#ifdef CONFIG_USE_VECTORIZED_COPY + _kernel_fpsimd_save(&state); + uaccess_priviliged_state_save(); +#endif + while (list) { struct tasklet_struct *t = list; @@ -645,6 +669,11 @@ static void tasklet_action_common(struct softirq_action *a, __raise_softirq_irqoff(softirq_nr); local_irq_enable(); } + +#ifdef CONFIG_USE_VECTORIZED_COPY + uaccess_priviliged_state_restore(); + _kernel_fpsimd_load(&state); +#endif } static __latent_entropy void tasklet_action(struct softirq_action *a) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 0b1c13a05332..9ec07294429b 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -210,6 +210,17 @@ static int max_extfrag_threshold = 1000; #endif /* CONFIG_SYSCTL */ +#ifdef CONFIG_USE_VECTORIZED_COPY +int sysctl_copy_to_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_to_user_threshold); + +int sysctl_copy_from_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_from_user_threshold); + +int sysctl_copy_in_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_in_user_threshold); +#endif + #if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_SYSCTL) static int bpf_stats_handler(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) @@ -3385,6 +3396,30 @@ static struct ctl_table vm_table[] = { .extra2 = SYSCTL_ONE, }, #endif + +#ifdef CONFIG_USE_VECTORIZED_COPY + { + .procname = "copy_to_user_threshold", + .data = &sysctl_copy_to_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "copy_from_user_threshold", + .data = &sysctl_copy_from_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "copy_in_user_threshold", + .data = &sysctl_copy_in_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, +#endif { } }; -- 2.34.1
2 1
0 0
[PATCH v2 OLK-5.10] Add copy to/from/in user with vectorization support
by Nikita Panov 28 Jan '26

28 Jan '26
From: Artem Kuzin <artem.kuzin(a)huawei.com> kunpeng inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8445 ------------------------------------------------- 1. This implementation uses st1/ld1 4-vector instructions which allow to copy 64 bytes at once 2. Copy code is used only if size of data block to copy is more than 128 bytes 4. To use this functionality you need to set configuration switch CONFIG_USE_VECTORIZED_COPY=y 5. Code can be used on any ARMv8 variant 6. In kernel copy functions like memcpy are not supported now, but can be enabled in future 7. For now we use lightweght version of register context saving/restoration (4-registers) We introduce support of vectorization for copy_from/to/in_user functions. Nowadays it works in parallel with original FPSIMD/SVE vectorization and doesn't affect it anyhow. We have special flag in task struct - TIF_KERNEL_FPSIMD, that set if currently we use lightweight vectorization in kernel. Task struct has been updated by two fields: user space fpsimd state and kernel fpsimd state. User space fpsimd state used by kernel_fpsimd_begin(), kernel_fpsimd_end() functions that wrap lightweight FPSIMD contexts usage in kernel space. Kernel fpsimd state is used to manage threads switch. Now there is no support of nested calls of kernel_neon_begin()/kernel_fpsimd_begin() and there is no plans to support this in future. This is not necessary. We save lightweight FPSIMD context in kernel_fpsimd_begin(), and restore it in /kernel_fpsimd_end(). On thread switch we preserve kernel FPSIMD context and restore user space one if any. This prevens curruption of user space FPSIMD state. Before switching to the next thread we restore it's kernel FPSIMD context if any. It is allowed to use FPSIMD in bottom halves, due to in case of BH preemption we check TIF_KERNEL_FPSIMD flag and save/restore contexts. Context management if quite lightweight and executed only in case of TIF_KERNEL_FPSIMD flag is set. To enable this feature, you need to manually modify one of the appropriate entries: /proc/sys/vm/copy_from_user_threshold /proc/sys/vm/copy_in_user_threshold /proc/sys/vm/copy_to_user_threshold Allowed values are following: -1 - feature enabled 0 - feature always enabled n (n >0) - feature enabled, if copied size is greater than n KB. P.S.: What I am personally don't like in current approach: 1. Additional fields and flag in task struct look quite ugly 2. No way to configure the size of chunk to copy using FPSIMD from user space 3. FPSIMD-based memory movement is not generic, need to enable for memmove(), memcpy() and friends in future. Co-developed-by: Alexander Kozhevnikov <alexander.kozhevnikov(a)huawei-partners.com> Signed-off-by: Alexander Kozhevnikov <alexander.kozhevnikov(a)huawei-partners.com> Co-developed-by: Nikita Panov <panov.nikita(a)huawei.com> Signed-off-by: Nikita Panov <panov.nikita(a)huawei.com> Signed-off-by: Artem Kuzin <artem.kuzin(a)huawei.com> --- arch/arm64/Kconfig | 15 ++ arch/arm64/include/asm/fpsimd.h | 15 ++ arch/arm64/include/asm/fpsimdmacros.h | 14 ++ arch/arm64/include/asm/neon.h | 28 +++ arch/arm64/include/asm/processor.h | 10 + arch/arm64/include/asm/thread_info.h | 4 + arch/arm64/include/asm/uaccess.h | 274 +++++++++++++++++++++++++- arch/arm64/kernel/entry-fpsimd.S | 22 +++ arch/arm64/kernel/fpsimd.c | 102 +++++++++- arch/arm64/kernel/process.c | 2 +- arch/arm64/lib/copy_from_user.S | 18 ++ arch/arm64/lib/copy_in_user.S | 19 ++ arch/arm64/lib/copy_template_fpsimd.S | 180 +++++++++++++++++ arch/arm64/lib/copy_to_user.S | 19 ++ kernel/softirq.c | 31 ++- kernel/sysctl.c | 35 ++++ 16 files changed, 780 insertions(+), 8 deletions(-) create mode 100644 arch/arm64/lib/copy_template_fpsimd.S diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index eb30ef59aca2..959af31f7e70 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1470,6 +1470,21 @@ config ARM64_ILP32 is an ABI where long and pointers are 32bits but it uses the AARCH64 instruction set. +config USE_VECTORIZED_COPY + bool "Use vectorized instructions in copy_to/from user" + depends on KERNEL_MODE_NEON + default y + help + This option turns on vectorization to speed up copy_to/from_user routines. + +config VECTORIZED_COPY_VALIDATE + bool "Validate result of vectorized copy using regular implementation" + depends on KERNEL_MODE_NEON + depends on USE_VECTORIZED_COPY + default n + help + This option turns on vectorization to speed up copy_to/from_user routines. + menuconfig AARCH32_EL0 bool "Kernel support for 32-bit EL0" depends on ARM64_4K_PAGES || EXPERT diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h index 22f6c6e23441..cb53767105ef 100644 --- a/arch/arm64/include/asm/fpsimd.h +++ b/arch/arm64/include/asm/fpsimd.h @@ -46,6 +46,21 @@ struct task_struct; +#ifdef CONFIG_USE_VECTORIZED_COPY +extern void fpsimd_save_state_light(struct fpsimd_state *state); +extern void fpsimd_load_state_light(struct fpsimd_state *state); +#else +static inline void fpsimd_save_state_light(struct fpsimd_state *state) +{ + (void) state; +} + +static inline void fpsimd_load_state_light(struct fpsimd_state *state) +{ + (void) state; +} +#endif + extern void fpsimd_save_state(struct user_fpsimd_state *state); extern void fpsimd_load_state(struct user_fpsimd_state *state); diff --git a/arch/arm64/include/asm/fpsimdmacros.h b/arch/arm64/include/asm/fpsimdmacros.h index ea2577e159f6..62f5f8a0540a 100644 --- a/arch/arm64/include/asm/fpsimdmacros.h +++ b/arch/arm64/include/asm/fpsimdmacros.h @@ -8,6 +8,20 @@ #include <asm/assembler.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +/* Lightweight fpsimd context saving/restoration. + * Necessary for vectorized kernel memory movement + * implementation + */ +.macro fpsimd_save_light state + st1 {v20.16b, v21.16b, v22.16b, v23.16b}, [\state] +.endm + +.macro fpsimd_restore_light state + ld1 {v20.16b, v21.16b, v22.16b, v23.16b}, [\state] +.endm +#endif + .macro fpsimd_save state, tmpnr stp q0, q1, [\state, #16 * 0] stp q2, q3, [\state, #16 * 2] diff --git a/arch/arm64/include/asm/neon.h b/arch/arm64/include/asm/neon.h index d4b1d172a79b..ab84b194d7b3 100644 --- a/arch/arm64/include/asm/neon.h +++ b/arch/arm64/include/asm/neon.h @@ -16,4 +16,32 @@ void kernel_neon_begin(void); void kernel_neon_end(void); +#ifdef CONFIG_USE_VECTORIZED_COPY +bool kernel_fpsimd_begin(void); +void kernel_fpsimd_end(void); +/* Functions to use in non-preemptible context */ +void _kernel_fpsimd_save(struct fpsimd_state *state); +void _kernel_fpsimd_load(struct fpsimd_state *state); +#else +bool kernel_fpsimd_begin(void) +{ + return false; +} + +void kernel_fpsimd_end(void) +{ +} + +/* Functions to use in non-preemptible context */ +void _kernel_fpsimd_save(struct fpsimd_state *state) +{ + (void) state; +} + +void _kernel_fpsimd_load(struct fpsimd_state *state) +{ + (void) state; +} +#endif + #endif /* ! __ASM_NEON_H */ diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index 66186f3ab550..d6ca823f7f0f 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -137,6 +137,10 @@ struct cpu_context { unsigned long pc; }; +struct fpsimd_state { + __uint128_t v[4]; +}; + struct thread_struct { struct cpu_context cpu_context; /* cpu context */ @@ -174,6 +178,12 @@ struct thread_struct { KABI_RESERVE(6) KABI_RESERVE(7) KABI_RESERVE(8) +#ifdef CONFIG_USE_VECTORIZED_COPY + KABI_EXTEND( + struct fpsimd_state ustate; + struct fpsimd_state kstate; + ) +#endif }; static inline unsigned int thread_get_vl(struct thread_struct *thread, diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 390d9612546b..2e395ebcc856 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -89,6 +89,8 @@ void arch_release_task_struct(struct task_struct *tsk); #define TIF_PATCH_PENDING 28 /* pending live patching update */ #define TIF_SME 29 /* SME in use */ #define TIF_SME_VL_INHERIT 30 /* Inherit SME vl_onexec across exec */ +#define TIF_KERNEL_FPSIMD 31 /* Use FPSIMD in kernel */ +#define TIF_PRIV_UACC_ENABLED 32 /* Whether priviliged uaccess was manually enabled */ #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) @@ -108,6 +110,8 @@ void arch_release_task_struct(struct task_struct *tsk); #define _TIF_32BIT_AARCH64 (1 << TIF_32BIT_AARCH64) #define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING) #define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG) +#define _TIF_KERNEL_FPSIMD (1 << TIF_KERNEL_FPSIMD) +#define _TIF_PRIV_UACC_ENABLED (1 << TIF_PRIV_UACC_ENABLED) #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \ diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index 03c2db710f92..4e4eec098cbc 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -24,6 +24,10 @@ #include <asm/memory.h> #include <asm/extable.h> +#ifndef __GENKSYMS__ +#include <asm/neon.h> +#endif + #define HAVE_GET_KERNEL_NOFAULT /* @@ -174,7 +178,7 @@ static inline void __uaccess_enable_hw_pan(void) CONFIG_ARM64_PAN)); } -static inline void uaccess_disable_privileged(void) +static inline void __uaccess_disable_privileged(void) { if (uaccess_ttbr0_disable()) return; @@ -182,7 +186,22 @@ static inline void uaccess_disable_privileged(void) __uaccess_enable_hw_pan(); } -static inline void uaccess_enable_privileged(void) +static inline void uaccess_disable_privileged(void) +{ + preempt_disable(); + + if (!test_and_clear_thread_flag(TIF_PRIV_UACC_ENABLED)) { + WARN_ON(1); + preempt_enable(); + return; + } + + __uaccess_disable_privileged(); + + preempt_enable(); +} + +static inline void __uaccess_enable_privileged(void) { if (uaccess_ttbr0_enable()) return; @@ -190,6 +209,47 @@ static inline void uaccess_enable_privileged(void) __uaccess_disable_hw_pan(); } +static inline void uaccess_enable_privileged(void) +{ + preempt_disable(); + + if (test_and_set_thread_flag(TIF_PRIV_UACC_ENABLED)) { + WARN_ON(1); + preempt_enable(); + return; + } + + __uaccess_enable_privileged(); + + preempt_enable(); +} + +static inline void uaccess_priviliged_context_switch(struct task_struct *next) +{ + bool curr_enabled = !!test_thread_flag(TIF_PRIV_UACC_ENABLED); + bool next_enabled = !!test_ti_thread_flag(&next->thread_info, TIF_PRIV_UACC_ENABLED); + + if (curr_enabled == next_enabled) + return; + + if (curr_enabled) + __uaccess_disable_privileged(); + else + __uaccess_enable_privileged(); +} + +static inline void uaccess_priviliged_state_save(void) +{ + if (test_thread_flag(TIF_PRIV_UACC_ENABLED)) + __uaccess_disable_privileged(); +} + +static inline void uaccess_priviliged_state_restore(void) +{ + if (test_thread_flag(TIF_PRIV_UACC_ENABLED)) + __uaccess_enable_privileged(); +} + /* * Sanitise a uaccess pointer such that it becomes NULL if above the maximum * user address. In case the pointer is tagged (has the top byte set), untag @@ -386,7 +446,97 @@ do { \ goto err_label; \ } while(0) -extern unsigned long __must_check __arch_copy_from_user(void *to, const void __user *from, unsigned long n); +#define USER_COPY_CHUNK_SIZE 4096 + +#ifdef CONFIG_USE_VECTORIZED_COPY + +extern int sysctl_copy_from_user_threshold; + +#define verify_fpsimd_copy(to, from, n, ret) \ +({ \ + unsigned long __verify_ret = 0; \ + __verify_ret = memcmp(to, from, ret ? n - ret : n); \ + if (__verify_ret) \ + pr_err("FPSIMD:%s inconsistent state\n", __func__); \ + if (ret) \ + pr_err("FPSIMD:%s failed to copy data, expected=%lu, copied=%lu\n", __func__, n, n - ret); \ + __verify_ret |= ret; \ + __verify_ret; \ +}) + +#define compare_fpsimd_copy(to, from, n, ret_fpsimd, ret) \ +({ \ + unsigned long __verify_ret = 0; \ + __verify_ret = memcmp(to, from, ret ? n - ret : n); \ + if (__verify_ret) \ + pr_err("FIXUP:%s inconsistent state\n", __func__); \ + if (ret) \ + pr_err("FIXUP:%s failed to copy data, expected=%lu, copied=%lu\n", __func__, n, n - ret); \ + __verify_ret |= ret; \ + if (ret_fpsimd != ret) { \ + pr_err("FIXUP:%s difference between FPSIMD %lu and regular %lu\n", __func__, n - ret_fpsimd, n - ret); \ + __verify_ret |= 1; \ + } else { \ + __verify_ret = 0; \ + } \ + __verify_ret; \ +}) + +extern unsigned long __must_check +__arch_copy_from_user(void *to, const void __user *from, unsigned long n); + +extern unsigned long __must_check +__arch_copy_from_user_fpsimd(void *to, const void __user *from, unsigned long n); + +static __always_inline unsigned long __must_check +raw_copy_from_user(void *to, const void __user *from, unsigned long n) +{ + unsigned long __acfu_ret; + + if (sysctl_copy_from_user_threshold == -1 || n < sysctl_copy_from_user_threshold) { + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user(to, + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } else { + if (kernel_fpsimd_begin()) { + unsigned long __acfu_ret_fpsimd; + + uaccess_enable_privileged(); + __acfu_ret_fpsimd = __arch_copy_from_user_fpsimd((to), + __uaccess_mask_ptr(from), n); + uaccess_disable_privileged(); + + __acfu_ret = __acfu_ret_fpsimd; + kernel_fpsimd_end(); +#ifdef CONFIG_VECTORIZED_COPY_VALIDATE + if (verify_fpsimd_copy(to, __uaccess_mask_ptr(from), n, + __acfu_ret)) { + + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user((to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + + compare_fpsimd_copy(to, __uaccess_mask_ptr(from), n, + __acfu_ret_fpsimd, __acfu_ret); + } +#endif + } else { + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user((to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } + } + + + return __acfu_ret; +} +#else +extern unsigned long __must_check +__arch_copy_from_user(void *to, const void __user *from, unsigned long n); + #define raw_copy_from_user(to, from, n) \ ({ \ unsigned long __acfu_ret; \ @@ -397,7 +547,66 @@ extern unsigned long __must_check __arch_copy_from_user(void *to, const void __u __acfu_ret; \ }) -extern unsigned long __must_check __arch_copy_to_user(void __user *to, const void *from, unsigned long n); +#endif + +#ifdef CONFIG_USE_VECTORIZED_COPY + +extern int sysctl_copy_to_user_threshold; + +extern unsigned long __must_check +__arch_copy_to_user(void __user *to, const void *from, unsigned long n); + +extern unsigned long __must_check +__arch_copy_to_user_fpsimd(void __user *to, const void *from, unsigned long n); + +static __always_inline unsigned long __must_check +raw_copy_to_user(void __user *to, const void *from, unsigned long n) +{ + unsigned long __actu_ret; + + + if (sysctl_copy_to_user_threshold == -1 || n < sysctl_copy_to_user_threshold) { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + } else { + if (kernel_fpsimd_begin()) { + unsigned long __actu_ret_fpsimd; + + uaccess_enable_privileged(); + __actu_ret_fpsimd = __arch_copy_to_user_fpsimd(__uaccess_mask_ptr(to), + from, n); + uaccess_disable_privileged(); + + kernel_fpsimd_end(); + __actu_ret = __actu_ret_fpsimd; +#ifdef CONFIG_VECTORIZED_COPY_VALIDATE + if (verify_fpsimd_copy(__uaccess_mask_ptr(to), from, n, + __actu_ret)) { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + + compare_fpsimd_copy(__uaccess_mask_ptr(to), from, n, + __actu_ret_fpsimd, __actu_ret); + } +#endif + } else { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + } + } + + return __actu_ret; +} +#else +extern unsigned long __must_check +__arch_copy_to_user(void __user *to, const void *from, unsigned long n); + #define raw_copy_to_user(to, from, n) \ ({ \ unsigned long __actu_ret; \ @@ -407,7 +616,62 @@ extern unsigned long __must_check __arch_copy_to_user(void __user *to, const voi uaccess_ttbr0_disable(); \ __actu_ret; \ }) +#endif +#ifdef CONFIG_USE_VECTORIZED_COPY + +extern int sysctl_copy_in_user_threshold; + +extern unsigned long __must_check +__arch_copy_in_user(void __user *to, const void __user *from, unsigned long n); + +extern unsigned long __must_check +__arch_copy_in_user_fpsimd(void __user *to, const void __user *from, unsigned long n); + +static __always_inline unsigned long __must_check +raw_copy_in_user(void __user *to, const void __user *from, unsigned long n) +{ + unsigned long __aciu_ret; + + if (sysctl_copy_in_user_threshold == -1 || n < sysctl_copy_in_user_threshold) { + uaccess_ttbr0_enable(); + __aciu_ret = __arch_copy_in_user(__uaccess_mask_ptr(to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } else { + if (kernel_fpsimd_begin()) { + unsigned long __aciu_ret_fpsimd; + + uaccess_enable_privileged(); + __aciu_ret_fpsimd = __arch_copy_in_user_fpsimd(__uaccess_mask_ptr(to), + __uaccess_mask_ptr(from), n); + uaccess_disable_privileged(); + + kernel_fpsimd_end(); + __aciu_ret = __aciu_ret_fpsimd; +#ifdef CONFIG_VECTORIZED_COPY_VALIDATE + if (verify_fpsimd_copy(__uaccess_mask_ptr(to), __uaccess_mask_ptr(from), n, + __aciu_ret)) { + uaccess_ttbr0_enable(); + __aciu_ret = __arch_copy_in_user(__uaccess_mask_ptr(to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + + compare_fpsimd_copy(__uaccess_mask_ptr(to), __uaccess_mask_ptr(from), n, + __aciu_ret_fpsimd, __aciu_ret); + } +#endif + } else { + uaccess_ttbr0_enable(); + __aciu_ret = __arch_copy_in_user(__uaccess_mask_ptr(to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } + } + + return __aciu_ret; +} +#else extern unsigned long __must_check __arch_copy_in_user(void __user *to, const void __user *from, unsigned long n); #define raw_copy_in_user(to, from, n) \ ({ \ @@ -419,6 +683,8 @@ extern unsigned long __must_check __arch_copy_in_user(void __user *to, const voi __aciu_ret; \ }) +#endif + #define INLINE_COPY_TO_USER #define INLINE_COPY_FROM_USER diff --git a/arch/arm64/kernel/entry-fpsimd.S b/arch/arm64/kernel/entry-fpsimd.S index 8d12aaac7862..848ca6a351d7 100644 --- a/arch/arm64/kernel/entry-fpsimd.S +++ b/arch/arm64/kernel/entry-fpsimd.S @@ -11,6 +11,28 @@ #include <asm/assembler.h> #include <asm/fpsimdmacros.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +/* + * Save the FP registers. + * + * x0 - pointer to struct fpsimd_state_light + */ +SYM_FUNC_START(fpsimd_save_state_light) + fpsimd_save_light x0 + ret +SYM_FUNC_END(fpsimd_save_state_light) + +/* + * Load the FP registers. + * + * x0 - pointer to struct fpsimd_state_light + */ +SYM_FUNC_START(fpsimd_load_state_light) + fpsimd_restore_light x0 + ret +SYM_FUNC_END(fpsimd_load_state_light) +#endif + /* * Save the FP registers. * diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c index c2489a72b0b9..1a08c19a181f 100644 --- a/arch/arm64/kernel/fpsimd.c +++ b/arch/arm64/kernel/fpsimd.c @@ -1492,6 +1492,11 @@ void do_fpsimd_exc(unsigned int esr, struct pt_regs *regs) current); } +#ifdef CONFIG_USE_VECTORIZED_COPY +static void kernel_fpsimd_rollback_changes(void); +static void kernel_fpsimd_restore_changes(struct task_struct *tsk); +#endif + void fpsimd_thread_switch(struct task_struct *next) { bool wrong_task, wrong_cpu; @@ -1500,10 +1505,11 @@ void fpsimd_thread_switch(struct task_struct *next) return; __get_cpu_fpsimd_context(); - +#ifdef CONFIG_USE_VECTORIZED_COPY + kernel_fpsimd_rollback_changes(); +#endif /* Save unsaved fpsimd state, if any: */ fpsimd_save(); - /* * Fix up TIF_FOREIGN_FPSTATE to correctly describe next's * state. For kernel threads, FPSIMD registers are never loaded @@ -1516,6 +1522,9 @@ void fpsimd_thread_switch(struct task_struct *next) update_tsk_thread_flag(next, TIF_FOREIGN_FPSTATE, wrong_task || wrong_cpu); +#ifdef CONFIG_USE_VECTORIZED_COPY + kernel_fpsimd_restore_changes(next); +#endif __put_cpu_fpsimd_context(); } @@ -1835,6 +1844,95 @@ void kernel_neon_end(void) } EXPORT_SYMBOL(kernel_neon_end); +#ifdef CONFIG_USE_VECTORIZED_COPY +bool kernel_fpsimd_begin(void) +{ + if (WARN_ON(!system_capabilities_finalized()) || + !system_supports_fpsimd() || + in_irq() || irqs_disabled() || in_nmi()) + return false; + + preempt_disable(); + if (test_and_set_thread_flag(TIF_KERNEL_FPSIMD)) { + preempt_enable(); + + WARN_ON(1); + return false; + } + + /* + * Leaving streaming mode enabled will cause issues for any kernel + * NEON and leaving streaming mode or ZA enabled may increase power + * consumption. + */ + if (system_supports_sme()) + sme_smstop(); + + fpsimd_save_state_light(&current->thread.ustate); + preempt_enable(); + + return true; +} +EXPORT_SYMBOL(kernel_fpsimd_begin); + +void kernel_fpsimd_end(void) +{ + if (!system_supports_fpsimd()) + return; + + preempt_disable(); + if (test_and_clear_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_load_state_light(&current->thread.ustate); + + preempt_enable(); +} +EXPORT_SYMBOL(kernel_fpsimd_end); + +void _kernel_fpsimd_save(struct fpsimd_state *state) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_save_state_light(state); +} + +void _kernel_fpsimd_load(struct fpsimd_state *state) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_load_state_light(state); +} + +static void kernel_fpsimd_rollback_changes(void) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) { + fpsimd_save_state_light(&current->thread.kstate); + fpsimd_load_state_light(&current->thread.ustate); + } +} + +static void kernel_fpsimd_restore_changes(struct task_struct *tsk) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_ti_thread_flag(task_thread_info(tsk), TIF_KERNEL_FPSIMD)) { + fpsimd_save_state_light(&tsk->thread.ustate); + fpsimd_load_state_light(&tsk->thread.kstate); + } +} +#endif + #ifdef CONFIG_EFI static DEFINE_PER_CPU(struct user_fpsimd_state, efi_fpsimd_state); diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 14300c9e06d5..338d40725a5d 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -572,7 +572,7 @@ __notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev, struct task_struct *next) { struct task_struct *last; - + uaccess_priviliged_context_switch(next); fpsimd_thread_switch(next); tls_thread_switch(next); hw_breakpoint_thread_switch(next); diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S index dfc33ce09e72..94290069d97d 100644 --- a/arch/arm64/lib/copy_from_user.S +++ b/arch/arm64/lib/copy_from_user.S @@ -63,6 +63,24 @@ SYM_FUNC_START(__arch_copy_from_user) SYM_FUNC_END(__arch_copy_from_user) EXPORT_SYMBOL(__arch_copy_from_user) +#ifdef CONFIG_USE_VECTORIZED_COPY + .macro ldsve reg1, reg2, reg3, reg4, ptr + USER(9997f, ld1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + + .macro stsve reg1, reg2, reg3, reg4, ptr + USER_MC(9998f, st1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + +SYM_FUNC_START(__arch_copy_from_user_fpsimd) + add end, x0, x2 + mov srcin, x1 +#include "copy_template_fpsimd.S" + mov x0, #0 // Nothing to copy + ret +SYM_FUNC_END(__arch_copy_from_user_fpsimd) +EXPORT_SYMBOL(__arch_copy_from_user_fpsimd) +#endif .section .fixup,"ax" .align 2 9997: cmp dst, dstin diff --git a/arch/arm64/lib/copy_in_user.S b/arch/arm64/lib/copy_in_user.S index dbea3799c3ef..cbc09c377050 100644 --- a/arch/arm64/lib/copy_in_user.S +++ b/arch/arm64/lib/copy_in_user.S @@ -64,6 +64,25 @@ SYM_FUNC_START(__arch_copy_in_user) SYM_FUNC_END(__arch_copy_in_user) EXPORT_SYMBOL(__arch_copy_in_user) +#ifdef CONFIG_USE_VECTORIZED_COPY + .macro ldsve reg1, reg2, reg3, reg4, ptr + USER(9997f, ld1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + + .macro stsve reg1, reg2, reg3, reg4, ptr + USER(9997f, st1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + +SYM_FUNC_START(__arch_copy_in_user_fpsimd) + add end, x0, x2 + mov srcin, x1 +#include "copy_template_fpsimd.S" + mov x0, #0 + ret +SYM_FUNC_END(__arch_copy_in_user_fpsimd) +EXPORT_SYMBOL(__arch_copy_in_user_fpsimd) +#endif + .section .fixup,"ax" .align 2 9997: cmp dst, dstin diff --git a/arch/arm64/lib/copy_template_fpsimd.S b/arch/arm64/lib/copy_template_fpsimd.S new file mode 100644 index 000000000000..9b2e7ce1e4d2 --- /dev/null +++ b/arch/arm64/lib/copy_template_fpsimd.S @@ -0,0 +1,180 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ + */ + +/* + * Copy a buffer from src to dest (alignment handled by the hardware) + * + * Parameters: + * x0 - dest + * x1 - src + * x2 - n + * Returns: + * x0 - dest + */ +dstin .req x0 +src .req x1 +count .req x2 +tmp1 .req x3 +tmp1w .req w3 +tmp2 .req x4 +tmp2w .req w4 +dst .req x6 + +A_l .req x7 +A_h .req x8 +B_l .req x9 +B_h .req x10 +C_l .req x11 +C_h .req x12 +D_l .req x13 +D_h .req x14 + +V_a .req v20 +V_b .req v21 +V_c .req v22 +V_d .req v23 + + mov dst, dstin + cmp count, #16 + /*When memory length is less than 16, the accessed are not aligned.*/ + b.lo .Ltiny15_fpsimd + + neg tmp2, src + ands tmp2, tmp2, #15/* Bytes to reach alignment. */ + b.eq .LSrcAligned_fpsimd + sub count, count, tmp2 + /* + * Copy the leading memory data from src to dst in an increasing + * address order.By this way,the risk of overwriting the source + * memory data is eliminated when the distance between src and + * dst is less than 16. The memory accesses here are alignment. + */ + tbz tmp2, #0, 1f + ldrb1 tmp1w, src, #1 + strb1 tmp1w, dst, #1 +1: + tbz tmp2, #1, 2f + ldrh1 tmp1w, src, #2 + strh1 tmp1w, dst, #2 +2: + tbz tmp2, #2, 3f + ldr1 tmp1w, src, #4 + str1 tmp1w, dst, #4 +3: + tbz tmp2, #3, .LSrcAligned_fpsimd + ldr1 tmp1, src, #8 + str1 tmp1, dst, #8 + +.LSrcAligned_fpsimd: + cmp count, #64 + b.ge .Lcpy_over64_fpsimd + /* + * Deal with small copies quickly by dropping straight into the + * exit block. + */ +.Ltail63_fpsimd: + /* + * Copy up to 48 bytes of data. At this point we only need the + * bottom 6 bits of count to be accurate. + */ + ands tmp1, count, #0x30 + b.eq .Ltiny15_fpsimd + cmp tmp1w, #0x20 + b.eq 1f + b.lt 2f + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +1: + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +2: + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +.Ltiny15_fpsimd: + /* + * Prefer to break one ldp/stp into several load/store to access + * memory in an increasing address order,rather than to load/store 16 + * bytes from (src-16) to (dst-16) and to backward the src to aligned + * address,which way is used in original cortex memcpy. If keeping + * the original memcpy process here, memmove need to satisfy the + * precondition that src address is at least 16 bytes bigger than dst + * address,otherwise some source data will be overwritten when memove + * call memcpy directly. To make memmove simpler and decouple the + * memcpy's dependency on memmove, withdrew the original process. + */ + tbz count, #3, 1f + ldr1 tmp1, src, #8 + str1 tmp1, dst, #8 +1: + tbz count, #2, 2f + ldr1 tmp1w, src, #4 + str1 tmp1w, dst, #4 +2: + tbz count, #1, 3f + ldrh1 tmp1w, src, #2 + strh1 tmp1w, dst, #2 +3: + tbz count, #0, .Lexitfunc_fpsimd + ldrb1 tmp1w, src, #1 + strb1 tmp1w, dst, #1 + + b .Lexitfunc_fpsimd + +.Lcpy_over64_fpsimd: + subs count, count, #128 + b.ge .Lcpy_body_large_fpsimd + /* + * Less than 128 bytes to copy, so handle 64 here and then jump + * to the tail. + */ + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 + ldp1 B_l, B_h, src, #16 + ldp1 C_l, C_h, src, #16 + stp1 B_l, B_h, dst, #16 + stp1 C_l, C_h, dst, #16 + ldp1 D_l, D_h, src, #16 + stp1 D_l, D_h, dst, #16 + + tst count, #0x3f + b.ne .Ltail63_fpsimd + b .Lexitfunc_fpsimd + + /* + * Critical loop. Start at a new cache line boundary. Assuming + * 64 bytes per line this ensures the entire loop is in one line. + */ + .p2align L1_CACHE_SHIFT +.Lcpy_body_large_fpsimd: + /* pre-get 64 bytes data. */ + ldsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, src + add src, src, #64 + +1: + /* + * interlace the load of next 64 bytes data block with store of the last + * loaded 64 bytes data. + */ + stsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, dst + ldsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, src + add dst, dst, #64 + add src, src, #64 + + subs count, count, #64 + b.ge 1b + + stsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, dst + add dst, dst, #64 + + tst count, #0x3f + b.ne .Ltail63_fpsimd +.Lexitfunc_fpsimd: diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S index 34154e7c8577..d0211fce4923 100644 --- a/arch/arm64/lib/copy_to_user.S +++ b/arch/arm64/lib/copy_to_user.S @@ -62,6 +62,25 @@ SYM_FUNC_START(__arch_copy_to_user) SYM_FUNC_END(__arch_copy_to_user) EXPORT_SYMBOL(__arch_copy_to_user) +#ifdef CONFIG_USE_VECTORIZED_COPY + .macro stsve reg1, reg2, reg3, reg4, ptr + USER(9997f, st1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + + .macro ldsve reg1, reg2, reg3, reg4, ptr + USER_MC(9998f, ld1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + +SYM_FUNC_START(__arch_copy_to_user_fpsimd) + add end, x0, x2 + mov srcin, x1 +#include "copy_template_fpsimd.S" + mov x0, #0 + ret +SYM_FUNC_END(__arch_copy_to_user_fpsimd) +EXPORT_SYMBOL(__arch_copy_to_user_fpsimd) +#endif + .section .fixup,"ax" .align 2 9997: cmp dst, dstin diff --git a/kernel/softirq.c b/kernel/softirq.c index 9fc69e6e2c11..e3f73422829d 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -26,6 +26,10 @@ #include <linux/tick.h> #include <linux/irq.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +#include <asm/fpsimd.h> +#endif + #define CREATE_TRACE_POINTS #include <trace/events/irq.h> @@ -262,6 +266,9 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) __u32 pending; int softirq_bit; +#ifdef CONFIG_USE_VECTORIZED_COPY + struct fpsimd_state state; +#endif /* * Mask out PF_MEMALLOC as the current task context is borrowed for the * softirq. A softirq handled, such as network RX, might set PF_MEMALLOC @@ -273,8 +280,11 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) account_irq_enter_time(current); __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET); +#ifdef CONFIG_USE_VECTORIZED_COPY + _kernel_fpsimd_save(&state); + uaccess_priviliged_state_save(); +#endif in_hardirq = lockdep_softirq_start(); - restart: /* Reset the pending bitmask before enabling irqs */ set_softirq_pending(0); @@ -322,6 +332,11 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) lockdep_softirq_end(in_hardirq); account_irq_exit_time(current); + +#ifdef CONFIG_USE_VECTORIZED_COPY + uaccess_priviliged_state_restore(); + _kernel_fpsimd_load(&state); +#endif __local_bh_enable(SOFTIRQ_OFFSET); WARN_ON_ONCE(in_interrupt()); current_restore_flags(old_flags, PF_MEMALLOC); @@ -612,12 +627,21 @@ static void tasklet_action_common(struct softirq_action *a, { struct tasklet_struct *list; +#ifdef CONFIG_USE_VECTORIZED_COPY + struct fpsimd_state state; +#endif + local_irq_disable(); list = tl_head->head; tl_head->head = NULL; tl_head->tail = &tl_head->head; local_irq_enable(); +#ifdef CONFIG_USE_VECTORIZED_COPY + _kernel_fpsimd_save(&state); + uaccess_priviliged_state_save(); +#endif + while (list) { struct tasklet_struct *t = list; @@ -645,6 +669,11 @@ static void tasklet_action_common(struct softirq_action *a, __raise_softirq_irqoff(softirq_nr); local_irq_enable(); } + +#ifdef CONFIG_USE_VECTORIZED_COPY + uaccess_priviliged_state_restore(); + _kernel_fpsimd_load(&state); +#endif } static __latent_entropy void tasklet_action(struct softirq_action *a) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 0b1c13a05332..9ec07294429b 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -210,6 +210,17 @@ static int max_extfrag_threshold = 1000; #endif /* CONFIG_SYSCTL */ +#ifdef CONFIG_USE_VECTORIZED_COPY +int sysctl_copy_to_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_to_user_threshold); + +int sysctl_copy_from_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_from_user_threshold); + +int sysctl_copy_in_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_in_user_threshold); +#endif + #if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_SYSCTL) static int bpf_stats_handler(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) @@ -3385,6 +3396,30 @@ static struct ctl_table vm_table[] = { .extra2 = SYSCTL_ONE, }, #endif + +#ifdef CONFIG_USE_VECTORIZED_COPY + { + .procname = "copy_to_user_threshold", + .data = &sysctl_copy_to_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "copy_from_user_threshold", + .data = &sysctl_copy_from_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "copy_in_user_threshold", + .data = &sysctl_copy_in_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, +#endif { } }; -- 2.34.1
2 1
0 0
[PATCH v1 OLK-5.10] Add copy to/from/in user with vectorization support
by Nikita Panov 28 Jan '26

28 Jan '26
From: Artem Kuzin <artem.kuzin(a)huawei.com> kunpeng inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8445 ------------------------------------------------- 1. This implementation uses st1/ld1 4-vector instructions which allow to copy 64 bytes at once 2. Copy code is used only if size of data block to copy is more than 128 bytes 4. To use this functionality you need to set configuration switch CONFIG_USE_VECTORIZED_COPY=y 5. Code can be used on any ARMv8 variant 6. In kernel copy functions like memcpy are not supported now, but can be enabled in future 7. For now we use lightweght version of register context saving/restoration (4-registers) We introduce support of vectorization for copy_from/to/in_user functions. Nowadays it works in parallel with original FPSIMD/SVE vectorization and doesn't affect it anyhow. We have special flag in task struct - TIF_KERNEL_FPSIMD, that set if currently we use lightweight vectorization in kernel. Task struct has been updated by two fields: user space fpsimd state and kernel fpsimd state. User space fpsimd state used by kernel_fpsimd_begin(), kernel_fpsimd_end() functions that wrap lightweight FPSIMD contexts usage in kernel space. Kernel fpsimd state is used to manage threads switch. Now there is no support of nested calls of kernel_neon_begin()/kernel_fpsimd_begin() and there is no plans to support this in future. This is not necessary. We save lightweight FPSIMD context in kernel_fpsimd_begin(), and restore it in /kernel_fpsimd_end(). On thread switch we preserve kernel FPSIMD context and restore user space one if any. This prevens curruption of user space FPSIMD state. Before switching to the next thread we restore it's kernel FPSIMD context if any. It is allowed to use FPSIMD in bottom halves, due to in case of BH preemption we check TIF_KERNEL_FPSIMD flag and save/restore contexts. Context management if quite lightweight and executed only in case of TIF_KERNEL_FPSIMD flag is set. To enable this feature, you need to manually modify one of the appropriate entries: /proc/sys/vm/copy_from_user_threshold /proc/sys/vm/copy_in_user_threshold /proc/sys/vm/copy_to_user_threshold Allowed values are following: -1 - feature enabled 0 - feature always enabled n (n >0) - feature enabled, if copied size is greater than n KB. P.S.: What I am personally don't like in current approach: 1. Additional fields and flag in task struct look quite ugly 2. No way to configure the size of chunk to copy using FPSIMD from user space 3. FPSIMD-based memory movement is not generic, need to enable for memmove(), memcpy() and friends in future. Co-developed-by: Alexander Kozhevnikov <alexander.kozhevnikov(a)huawei-partners.com> Signed-off-by: Alexander Kozhevnikov <alexander.kozhevnikov(a)huawei-partners.com> Co-developed-by: Nikita Panov <panov.nikita(a)huawei.com> Signed-off-by: Nikita Panov <panov.nikita(a)huawei.com> Signed-off-by: Artem Kuzin <artem.kuzin(a)huawei.com> --- arch/arm64/Kconfig | 15 ++ arch/arm64/include/asm/fpsimd.h | 15 ++ arch/arm64/include/asm/fpsimdmacros.h | 14 ++ arch/arm64/include/asm/neon.h | 28 +++ arch/arm64/include/asm/processor.h | 7 + arch/arm64/include/asm/thread_info.h | 4 + arch/arm64/include/asm/uaccess.h | 271 +++++++++++++++++++++++++- arch/arm64/kernel/entry-fpsimd.S | 22 +++ arch/arm64/kernel/fpsimd.c | 102 +++++++++- arch/arm64/kernel/process.c | 2 +- arch/arm64/lib/copy_from_user.S | 18 ++ arch/arm64/lib/copy_in_user.S | 19 ++ arch/arm64/lib/copy_template_fpsimd.S | 180 +++++++++++++++++ arch/arm64/lib/copy_to_user.S | 19 ++ kernel/softirq.c | 31 ++- kernel/sysctl.c | 35 ++++ 16 files changed, 774 insertions(+), 8 deletions(-) create mode 100644 arch/arm64/lib/copy_template_fpsimd.S diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index eb30ef59aca2..959af31f7e70 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1470,6 +1470,21 @@ config ARM64_ILP32 is an ABI where long and pointers are 32bits but it uses the AARCH64 instruction set. +config USE_VECTORIZED_COPY + bool "Use vectorized instructions in copy_to/from user" + depends on KERNEL_MODE_NEON + default y + help + This option turns on vectorization to speed up copy_to/from_user routines. + +config VECTORIZED_COPY_VALIDATE + bool "Validate result of vectorized copy using regular implementation" + depends on KERNEL_MODE_NEON + depends on USE_VECTORIZED_COPY + default n + help + This option turns on vectorization to speed up copy_to/from_user routines. + menuconfig AARCH32_EL0 bool "Kernel support for 32-bit EL0" depends on ARM64_4K_PAGES || EXPERT diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h index 22f6c6e23441..cb53767105ef 100644 --- a/arch/arm64/include/asm/fpsimd.h +++ b/arch/arm64/include/asm/fpsimd.h @@ -46,6 +46,21 @@ struct task_struct; +#ifdef CONFIG_USE_VECTORIZED_COPY +extern void fpsimd_save_state_light(struct fpsimd_state *state); +extern void fpsimd_load_state_light(struct fpsimd_state *state); +#else +static inline void fpsimd_save_state_light(struct fpsimd_state *state) +{ + (void) state; +} + +static inline void fpsimd_load_state_light(struct fpsimd_state *state) +{ + (void) state; +} +#endif + extern void fpsimd_save_state(struct user_fpsimd_state *state); extern void fpsimd_load_state(struct user_fpsimd_state *state); diff --git a/arch/arm64/include/asm/fpsimdmacros.h b/arch/arm64/include/asm/fpsimdmacros.h index ea2577e159f6..62f5f8a0540a 100644 --- a/arch/arm64/include/asm/fpsimdmacros.h +++ b/arch/arm64/include/asm/fpsimdmacros.h @@ -8,6 +8,20 @@ #include <asm/assembler.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +/* Lightweight fpsimd context saving/restoration. + * Necessary for vectorized kernel memory movement + * implementation + */ +.macro fpsimd_save_light state + st1 {v20.16b, v21.16b, v22.16b, v23.16b}, [\state] +.endm + +.macro fpsimd_restore_light state + ld1 {v20.16b, v21.16b, v22.16b, v23.16b}, [\state] +.endm +#endif + .macro fpsimd_save state, tmpnr stp q0, q1, [\state, #16 * 0] stp q2, q3, [\state, #16 * 2] diff --git a/arch/arm64/include/asm/neon.h b/arch/arm64/include/asm/neon.h index d4b1d172a79b..ab84b194d7b3 100644 --- a/arch/arm64/include/asm/neon.h +++ b/arch/arm64/include/asm/neon.h @@ -16,4 +16,32 @@ void kernel_neon_begin(void); void kernel_neon_end(void); +#ifdef CONFIG_USE_VECTORIZED_COPY +bool kernel_fpsimd_begin(void); +void kernel_fpsimd_end(void); +/* Functions to use in non-preemptible context */ +void _kernel_fpsimd_save(struct fpsimd_state *state); +void _kernel_fpsimd_load(struct fpsimd_state *state); +#else +bool kernel_fpsimd_begin(void) +{ + return false; +} + +void kernel_fpsimd_end(void) +{ +} + +/* Functions to use in non-preemptible context */ +void _kernel_fpsimd_save(struct fpsimd_state *state) +{ + (void) state; +} + +void _kernel_fpsimd_load(struct fpsimd_state *state) +{ + (void) state; +} +#endif + #endif /* ! __ASM_NEON_H */ diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index 66186f3ab550..3f6784867508 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -137,6 +137,10 @@ struct cpu_context { unsigned long pc; }; +struct fpsimd_state { + __uint128_t v[4]; +}; + struct thread_struct { struct cpu_context cpu_context; /* cpu context */ @@ -166,6 +170,9 @@ struct thread_struct { u64 sctlr_tcf0; u64 gcr_user_incl; #endif + struct fpsimd_state ustate; + struct fpsimd_state kstate; + KABI_USE(1, unsigned int vl[ARM64_VEC_MAX]) KABI_USE(2, unsigned int vl_onexec[ARM64_VEC_MAX]) KABI_USE(3, u64 tpidr2_el0) diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 390d9612546b..2e395ebcc856 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -89,6 +89,8 @@ void arch_release_task_struct(struct task_struct *tsk); #define TIF_PATCH_PENDING 28 /* pending live patching update */ #define TIF_SME 29 /* SME in use */ #define TIF_SME_VL_INHERIT 30 /* Inherit SME vl_onexec across exec */ +#define TIF_KERNEL_FPSIMD 31 /* Use FPSIMD in kernel */ +#define TIF_PRIV_UACC_ENABLED 32 /* Whether priviliged uaccess was manually enabled */ #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) @@ -108,6 +110,8 @@ void arch_release_task_struct(struct task_struct *tsk); #define _TIF_32BIT_AARCH64 (1 << TIF_32BIT_AARCH64) #define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING) #define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG) +#define _TIF_KERNEL_FPSIMD (1 << TIF_KERNEL_FPSIMD) +#define _TIF_PRIV_UACC_ENABLED (1 << TIF_PRIV_UACC_ENABLED) #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \ diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index 03c2db710f92..60ffb6aee7bf 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -23,6 +23,7 @@ #include <asm/ptrace.h> #include <asm/memory.h> #include <asm/extable.h> +#include <asm/neon.h> #define HAVE_GET_KERNEL_NOFAULT @@ -174,7 +175,7 @@ static inline void __uaccess_enable_hw_pan(void) CONFIG_ARM64_PAN)); } -static inline void uaccess_disable_privileged(void) +static inline void __uaccess_disable_privileged(void) { if (uaccess_ttbr0_disable()) return; @@ -182,7 +183,22 @@ static inline void uaccess_disable_privileged(void) __uaccess_enable_hw_pan(); } -static inline void uaccess_enable_privileged(void) +static inline void uaccess_disable_privileged(void) +{ + preempt_disable(); + + if (!test_and_clear_thread_flag(TIF_PRIV_UACC_ENABLED)) { + WARN_ON(1); + preempt_enable(); + return; + } + + __uaccess_disable_privileged(); + + preempt_enable(); +} + +static inline void __uaccess_enable_privileged(void) { if (uaccess_ttbr0_enable()) return; @@ -190,6 +206,47 @@ static inline void uaccess_enable_privileged(void) __uaccess_disable_hw_pan(); } +static inline void uaccess_enable_privileged(void) +{ + preempt_disable(); + + if (test_and_set_thread_flag(TIF_PRIV_UACC_ENABLED)) { + WARN_ON(1); + preempt_enable(); + return; + } + + __uaccess_enable_privileged(); + + preempt_enable(); +} + +static inline void uaccess_priviliged_context_switch(struct task_struct *next) +{ + bool curr_enabled = !!test_thread_flag(TIF_PRIV_UACC_ENABLED); + bool next_enabled = !!test_ti_thread_flag(&next->thread_info, TIF_PRIV_UACC_ENABLED); + + if (curr_enabled == next_enabled) + return; + + if (curr_enabled) + __uaccess_disable_privileged(); + else + __uaccess_enable_privileged(); +} + +static inline void uaccess_priviliged_state_save(void) +{ + if (test_thread_flag(TIF_PRIV_UACC_ENABLED)) + __uaccess_disable_privileged(); +} + +static inline void uaccess_priviliged_state_restore(void) +{ + if (test_thread_flag(TIF_PRIV_UACC_ENABLED)) + __uaccess_enable_privileged(); +} + /* * Sanitise a uaccess pointer such that it becomes NULL if above the maximum * user address. In case the pointer is tagged (has the top byte set), untag @@ -386,7 +443,97 @@ do { \ goto err_label; \ } while(0) -extern unsigned long __must_check __arch_copy_from_user(void *to, const void __user *from, unsigned long n); +#define USER_COPY_CHUNK_SIZE 4096 + +#ifdef CONFIG_USE_VECTORIZED_COPY + +extern int sysctl_copy_from_user_threshold; + +#define verify_fpsimd_copy(to, from, n, ret) \ +({ \ + unsigned long __verify_ret = 0; \ + __verify_ret = memcmp(to, from, ret ? n - ret : n); \ + if (__verify_ret) \ + pr_err("FPSIMD:%s inconsistent state\n", __func__); \ + if (ret) \ + pr_err("FPSIMD:%s failed to copy data, expected=%lu, copied=%lu\n", __func__, n, n - ret); \ + __verify_ret |= ret; \ + __verify_ret; \ +}) + +#define compare_fpsimd_copy(to, from, n, ret_fpsimd, ret) \ +({ \ + unsigned long __verify_ret = 0; \ + __verify_ret = memcmp(to, from, ret ? n - ret : n); \ + if (__verify_ret) \ + pr_err("FIXUP:%s inconsistent state\n", __func__); \ + if (ret) \ + pr_err("FIXUP:%s failed to copy data, expected=%lu, copied=%lu\n", __func__, n, n - ret); \ + __verify_ret |= ret; \ + if (ret_fpsimd != ret) { \ + pr_err("FIXUP:%s difference between FPSIMD %lu and regular %lu\n", __func__, n - ret_fpsimd, n - ret); \ + __verify_ret |= 1; \ + } else { \ + __verify_ret = 0; \ + } \ + __verify_ret; \ +}) + +extern unsigned long __must_check +__arch_copy_from_user(void *to, const void __user *from, unsigned long n); + +extern unsigned long __must_check +__arch_copy_from_user_fpsimd(void *to, const void __user *from, unsigned long n); + +static __always_inline unsigned long __must_check +raw_copy_from_user(void *to, const void __user *from, unsigned long n) +{ + unsigned long __acfu_ret; + + if (sysctl_copy_from_user_threshold == -1 || n < sysctl_copy_from_user_threshold) { + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user(to, + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } else { + if (kernel_fpsimd_begin()) { + unsigned long __acfu_ret_fpsimd; + + uaccess_enable_privileged(); + __acfu_ret_fpsimd = __arch_copy_from_user_fpsimd((to), + __uaccess_mask_ptr(from), n); + uaccess_disable_privileged(); + + __acfu_ret = __acfu_ret_fpsimd; + kernel_fpsimd_end(); +#ifdef CONFIG_VECTORIZED_COPY_VALIDATE + if (verify_fpsimd_copy(to, __uaccess_mask_ptr(from), n, + __acfu_ret)) { + + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user((to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + + compare_fpsimd_copy(to, __uaccess_mask_ptr(from), n, + __acfu_ret_fpsimd, __acfu_ret); + } +#endif + } else { + uaccess_ttbr0_enable(); + __acfu_ret = __arch_copy_from_user((to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } + } + + + return __acfu_ret; +} +#else +extern unsigned long __must_check +__arch_copy_from_user(void *to, const void __user *from, unsigned long n); + #define raw_copy_from_user(to, from, n) \ ({ \ unsigned long __acfu_ret; \ @@ -397,7 +544,66 @@ extern unsigned long __must_check __arch_copy_from_user(void *to, const void __u __acfu_ret; \ }) -extern unsigned long __must_check __arch_copy_to_user(void __user *to, const void *from, unsigned long n); +#endif + +#ifdef CONFIG_USE_VECTORIZED_COPY + +extern int sysctl_copy_to_user_threshold; + +extern unsigned long __must_check +__arch_copy_to_user(void __user *to, const void *from, unsigned long n); + +extern unsigned long __must_check +__arch_copy_to_user_fpsimd(void __user *to, const void *from, unsigned long n); + +static __always_inline unsigned long __must_check +raw_copy_to_user(void __user *to, const void *from, unsigned long n) +{ + unsigned long __actu_ret; + + + if (sysctl_copy_to_user_threshold == -1 || n < sysctl_copy_to_user_threshold) { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + } else { + if (kernel_fpsimd_begin()) { + unsigned long __actu_ret_fpsimd; + + uaccess_enable_privileged(); + __actu_ret_fpsimd = __arch_copy_to_user_fpsimd(__uaccess_mask_ptr(to), + from, n); + uaccess_disable_privileged(); + + kernel_fpsimd_end(); + __actu_ret = __actu_ret_fpsimd; +#ifdef CONFIG_VECTORIZED_COPY_VALIDATE + if (verify_fpsimd_copy(__uaccess_mask_ptr(to), from, n, + __actu_ret)) { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + + compare_fpsimd_copy(__uaccess_mask_ptr(to), from, n, + __actu_ret_fpsimd, __actu_ret); + } +#endif + } else { + uaccess_ttbr0_enable(); + __actu_ret = __arch_copy_to_user(__uaccess_mask_ptr(to), + from, n); + uaccess_ttbr0_disable(); + } + } + + return __actu_ret; +} +#else +extern unsigned long __must_check +__arch_copy_to_user(void __user *to, const void *from, unsigned long n); + #define raw_copy_to_user(to, from, n) \ ({ \ unsigned long __actu_ret; \ @@ -407,7 +613,62 @@ extern unsigned long __must_check __arch_copy_to_user(void __user *to, const voi uaccess_ttbr0_disable(); \ __actu_ret; \ }) +#endif +#ifdef CONFIG_USE_VECTORIZED_COPY + +extern int sysctl_copy_in_user_threshold; + +extern unsigned long __must_check +__arch_copy_in_user(void __user *to, const void __user *from, unsigned long n); + +extern unsigned long __must_check +__arch_copy_in_user_fpsimd(void __user *to, const void __user *from, unsigned long n); + +static __always_inline unsigned long __must_check +raw_copy_in_user(void __user *to, const void __user *from, unsigned long n) +{ + unsigned long __aciu_ret; + + if (sysctl_copy_in_user_threshold == -1 || n < sysctl_copy_in_user_threshold) { + uaccess_ttbr0_enable(); + __aciu_ret = __arch_copy_in_user(__uaccess_mask_ptr(to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } else { + if (kernel_fpsimd_begin()) { + unsigned long __aciu_ret_fpsimd; + + uaccess_enable_privileged(); + __aciu_ret_fpsimd = __arch_copy_in_user_fpsimd(__uaccess_mask_ptr(to), + __uaccess_mask_ptr(from), n); + uaccess_disable_privileged(); + + kernel_fpsimd_end(); + __aciu_ret = __aciu_ret_fpsimd; +#ifdef CONFIG_VECTORIZED_COPY_VALIDATE + if (verify_fpsimd_copy(__uaccess_mask_ptr(to), __uaccess_mask_ptr(from), n, + __aciu_ret)) { + uaccess_ttbr0_enable(); + __aciu_ret = __arch_copy_in_user(__uaccess_mask_ptr(to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + + compare_fpsimd_copy(__uaccess_mask_ptr(to), __uaccess_mask_ptr(from), n, + __aciu_ret_fpsimd, __aciu_ret); + } +#endif + } else { + uaccess_ttbr0_enable(); + __aciu_ret = __arch_copy_in_user(__uaccess_mask_ptr(to), + __uaccess_mask_ptr(from), n); + uaccess_ttbr0_disable(); + } + } + + return __aciu_ret; +} +#else extern unsigned long __must_check __arch_copy_in_user(void __user *to, const void __user *from, unsigned long n); #define raw_copy_in_user(to, from, n) \ ({ \ @@ -419,6 +680,8 @@ extern unsigned long __must_check __arch_copy_in_user(void __user *to, const voi __aciu_ret; \ }) +#endif + #define INLINE_COPY_TO_USER #define INLINE_COPY_FROM_USER diff --git a/arch/arm64/kernel/entry-fpsimd.S b/arch/arm64/kernel/entry-fpsimd.S index 8d12aaac7862..848ca6a351d7 100644 --- a/arch/arm64/kernel/entry-fpsimd.S +++ b/arch/arm64/kernel/entry-fpsimd.S @@ -11,6 +11,28 @@ #include <asm/assembler.h> #include <asm/fpsimdmacros.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +/* + * Save the FP registers. + * + * x0 - pointer to struct fpsimd_state_light + */ +SYM_FUNC_START(fpsimd_save_state_light) + fpsimd_save_light x0 + ret +SYM_FUNC_END(fpsimd_save_state_light) + +/* + * Load the FP registers. + * + * x0 - pointer to struct fpsimd_state_light + */ +SYM_FUNC_START(fpsimd_load_state_light) + fpsimd_restore_light x0 + ret +SYM_FUNC_END(fpsimd_load_state_light) +#endif + /* * Save the FP registers. * diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c index c2489a72b0b9..1a08c19a181f 100644 --- a/arch/arm64/kernel/fpsimd.c +++ b/arch/arm64/kernel/fpsimd.c @@ -1492,6 +1492,11 @@ void do_fpsimd_exc(unsigned int esr, struct pt_regs *regs) current); } +#ifdef CONFIG_USE_VECTORIZED_COPY +static void kernel_fpsimd_rollback_changes(void); +static void kernel_fpsimd_restore_changes(struct task_struct *tsk); +#endif + void fpsimd_thread_switch(struct task_struct *next) { bool wrong_task, wrong_cpu; @@ -1500,10 +1505,11 @@ void fpsimd_thread_switch(struct task_struct *next) return; __get_cpu_fpsimd_context(); - +#ifdef CONFIG_USE_VECTORIZED_COPY + kernel_fpsimd_rollback_changes(); +#endif /* Save unsaved fpsimd state, if any: */ fpsimd_save(); - /* * Fix up TIF_FOREIGN_FPSTATE to correctly describe next's * state. For kernel threads, FPSIMD registers are never loaded @@ -1516,6 +1522,9 @@ void fpsimd_thread_switch(struct task_struct *next) update_tsk_thread_flag(next, TIF_FOREIGN_FPSTATE, wrong_task || wrong_cpu); +#ifdef CONFIG_USE_VECTORIZED_COPY + kernel_fpsimd_restore_changes(next); +#endif __put_cpu_fpsimd_context(); } @@ -1835,6 +1844,95 @@ void kernel_neon_end(void) } EXPORT_SYMBOL(kernel_neon_end); +#ifdef CONFIG_USE_VECTORIZED_COPY +bool kernel_fpsimd_begin(void) +{ + if (WARN_ON(!system_capabilities_finalized()) || + !system_supports_fpsimd() || + in_irq() || irqs_disabled() || in_nmi()) + return false; + + preempt_disable(); + if (test_and_set_thread_flag(TIF_KERNEL_FPSIMD)) { + preempt_enable(); + + WARN_ON(1); + return false; + } + + /* + * Leaving streaming mode enabled will cause issues for any kernel + * NEON and leaving streaming mode or ZA enabled may increase power + * consumption. + */ + if (system_supports_sme()) + sme_smstop(); + + fpsimd_save_state_light(&current->thread.ustate); + preempt_enable(); + + return true; +} +EXPORT_SYMBOL(kernel_fpsimd_begin); + +void kernel_fpsimd_end(void) +{ + if (!system_supports_fpsimd()) + return; + + preempt_disable(); + if (test_and_clear_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_load_state_light(&current->thread.ustate); + + preempt_enable(); +} +EXPORT_SYMBOL(kernel_fpsimd_end); + +void _kernel_fpsimd_save(struct fpsimd_state *state) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_save_state_light(state); +} + +void _kernel_fpsimd_load(struct fpsimd_state *state) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) + fpsimd_load_state_light(state); +} + +static void kernel_fpsimd_rollback_changes(void) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_thread_flag(TIF_KERNEL_FPSIMD)) { + fpsimd_save_state_light(&current->thread.kstate); + fpsimd_load_state_light(&current->thread.ustate); + } +} + +static void kernel_fpsimd_restore_changes(struct task_struct *tsk) +{ + if (!system_supports_fpsimd()) + return; + + BUG_ON(preemptible()); + if (test_ti_thread_flag(task_thread_info(tsk), TIF_KERNEL_FPSIMD)) { + fpsimd_save_state_light(&tsk->thread.ustate); + fpsimd_load_state_light(&tsk->thread.kstate); + } +} +#endif + #ifdef CONFIG_EFI static DEFINE_PER_CPU(struct user_fpsimd_state, efi_fpsimd_state); diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 14300c9e06d5..338d40725a5d 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -572,7 +572,7 @@ __notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev, struct task_struct *next) { struct task_struct *last; - + uaccess_priviliged_context_switch(next); fpsimd_thread_switch(next); tls_thread_switch(next); hw_breakpoint_thread_switch(next); diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S index dfc33ce09e72..94290069d97d 100644 --- a/arch/arm64/lib/copy_from_user.S +++ b/arch/arm64/lib/copy_from_user.S @@ -63,6 +63,24 @@ SYM_FUNC_START(__arch_copy_from_user) SYM_FUNC_END(__arch_copy_from_user) EXPORT_SYMBOL(__arch_copy_from_user) +#ifdef CONFIG_USE_VECTORIZED_COPY + .macro ldsve reg1, reg2, reg3, reg4, ptr + USER(9997f, ld1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + + .macro stsve reg1, reg2, reg3, reg4, ptr + USER_MC(9998f, st1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + +SYM_FUNC_START(__arch_copy_from_user_fpsimd) + add end, x0, x2 + mov srcin, x1 +#include "copy_template_fpsimd.S" + mov x0, #0 // Nothing to copy + ret +SYM_FUNC_END(__arch_copy_from_user_fpsimd) +EXPORT_SYMBOL(__arch_copy_from_user_fpsimd) +#endif .section .fixup,"ax" .align 2 9997: cmp dst, dstin diff --git a/arch/arm64/lib/copy_in_user.S b/arch/arm64/lib/copy_in_user.S index dbea3799c3ef..cbc09c377050 100644 --- a/arch/arm64/lib/copy_in_user.S +++ b/arch/arm64/lib/copy_in_user.S @@ -64,6 +64,25 @@ SYM_FUNC_START(__arch_copy_in_user) SYM_FUNC_END(__arch_copy_in_user) EXPORT_SYMBOL(__arch_copy_in_user) +#ifdef CONFIG_USE_VECTORIZED_COPY + .macro ldsve reg1, reg2, reg3, reg4, ptr + USER(9997f, ld1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + + .macro stsve reg1, reg2, reg3, reg4, ptr + USER(9997f, st1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + +SYM_FUNC_START(__arch_copy_in_user_fpsimd) + add end, x0, x2 + mov srcin, x1 +#include "copy_template_fpsimd.S" + mov x0, #0 + ret +SYM_FUNC_END(__arch_copy_in_user_fpsimd) +EXPORT_SYMBOL(__arch_copy_in_user_fpsimd) +#endif + .section .fixup,"ax" .align 2 9997: cmp dst, dstin diff --git a/arch/arm64/lib/copy_template_fpsimd.S b/arch/arm64/lib/copy_template_fpsimd.S new file mode 100644 index 000000000000..9b2e7ce1e4d2 --- /dev/null +++ b/arch/arm64/lib/copy_template_fpsimd.S @@ -0,0 +1,180 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ + */ + +/* + * Copy a buffer from src to dest (alignment handled by the hardware) + * + * Parameters: + * x0 - dest + * x1 - src + * x2 - n + * Returns: + * x0 - dest + */ +dstin .req x0 +src .req x1 +count .req x2 +tmp1 .req x3 +tmp1w .req w3 +tmp2 .req x4 +tmp2w .req w4 +dst .req x6 + +A_l .req x7 +A_h .req x8 +B_l .req x9 +B_h .req x10 +C_l .req x11 +C_h .req x12 +D_l .req x13 +D_h .req x14 + +V_a .req v20 +V_b .req v21 +V_c .req v22 +V_d .req v23 + + mov dst, dstin + cmp count, #16 + /*When memory length is less than 16, the accessed are not aligned.*/ + b.lo .Ltiny15_fpsimd + + neg tmp2, src + ands tmp2, tmp2, #15/* Bytes to reach alignment. */ + b.eq .LSrcAligned_fpsimd + sub count, count, tmp2 + /* + * Copy the leading memory data from src to dst in an increasing + * address order.By this way,the risk of overwriting the source + * memory data is eliminated when the distance between src and + * dst is less than 16. The memory accesses here are alignment. + */ + tbz tmp2, #0, 1f + ldrb1 tmp1w, src, #1 + strb1 tmp1w, dst, #1 +1: + tbz tmp2, #1, 2f + ldrh1 tmp1w, src, #2 + strh1 tmp1w, dst, #2 +2: + tbz tmp2, #2, 3f + ldr1 tmp1w, src, #4 + str1 tmp1w, dst, #4 +3: + tbz tmp2, #3, .LSrcAligned_fpsimd + ldr1 tmp1, src, #8 + str1 tmp1, dst, #8 + +.LSrcAligned_fpsimd: + cmp count, #64 + b.ge .Lcpy_over64_fpsimd + /* + * Deal with small copies quickly by dropping straight into the + * exit block. + */ +.Ltail63_fpsimd: + /* + * Copy up to 48 bytes of data. At this point we only need the + * bottom 6 bits of count to be accurate. + */ + ands tmp1, count, #0x30 + b.eq .Ltiny15_fpsimd + cmp tmp1w, #0x20 + b.eq 1f + b.lt 2f + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +1: + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +2: + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 +.Ltiny15_fpsimd: + /* + * Prefer to break one ldp/stp into several load/store to access + * memory in an increasing address order,rather than to load/store 16 + * bytes from (src-16) to (dst-16) and to backward the src to aligned + * address,which way is used in original cortex memcpy. If keeping + * the original memcpy process here, memmove need to satisfy the + * precondition that src address is at least 16 bytes bigger than dst + * address,otherwise some source data will be overwritten when memove + * call memcpy directly. To make memmove simpler and decouple the + * memcpy's dependency on memmove, withdrew the original process. + */ + tbz count, #3, 1f + ldr1 tmp1, src, #8 + str1 tmp1, dst, #8 +1: + tbz count, #2, 2f + ldr1 tmp1w, src, #4 + str1 tmp1w, dst, #4 +2: + tbz count, #1, 3f + ldrh1 tmp1w, src, #2 + strh1 tmp1w, dst, #2 +3: + tbz count, #0, .Lexitfunc_fpsimd + ldrb1 tmp1w, src, #1 + strb1 tmp1w, dst, #1 + + b .Lexitfunc_fpsimd + +.Lcpy_over64_fpsimd: + subs count, count, #128 + b.ge .Lcpy_body_large_fpsimd + /* + * Less than 128 bytes to copy, so handle 64 here and then jump + * to the tail. + */ + ldp1 A_l, A_h, src, #16 + stp1 A_l, A_h, dst, #16 + ldp1 B_l, B_h, src, #16 + ldp1 C_l, C_h, src, #16 + stp1 B_l, B_h, dst, #16 + stp1 C_l, C_h, dst, #16 + ldp1 D_l, D_h, src, #16 + stp1 D_l, D_h, dst, #16 + + tst count, #0x3f + b.ne .Ltail63_fpsimd + b .Lexitfunc_fpsimd + + /* + * Critical loop. Start at a new cache line boundary. Assuming + * 64 bytes per line this ensures the entire loop is in one line. + */ + .p2align L1_CACHE_SHIFT +.Lcpy_body_large_fpsimd: + /* pre-get 64 bytes data. */ + ldsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, src + add src, src, #64 + +1: + /* + * interlace the load of next 64 bytes data block with store of the last + * loaded 64 bytes data. + */ + stsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, dst + ldsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, src + add dst, dst, #64 + add src, src, #64 + + subs count, count, #64 + b.ge 1b + + stsve V_a.16b, V_b.16b, V_c.16b, V_d.16b, dst + add dst, dst, #64 + + tst count, #0x3f + b.ne .Ltail63_fpsimd +.Lexitfunc_fpsimd: diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S index 34154e7c8577..d0211fce4923 100644 --- a/arch/arm64/lib/copy_to_user.S +++ b/arch/arm64/lib/copy_to_user.S @@ -62,6 +62,25 @@ SYM_FUNC_START(__arch_copy_to_user) SYM_FUNC_END(__arch_copy_to_user) EXPORT_SYMBOL(__arch_copy_to_user) +#ifdef CONFIG_USE_VECTORIZED_COPY + .macro stsve reg1, reg2, reg3, reg4, ptr + USER(9997f, st1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + + .macro ldsve reg1, reg2, reg3, reg4, ptr + USER_MC(9998f, ld1 {\reg1, \reg2, \reg3, \reg4}, [\ptr]) + .endm + +SYM_FUNC_START(__arch_copy_to_user_fpsimd) + add end, x0, x2 + mov srcin, x1 +#include "copy_template_fpsimd.S" + mov x0, #0 + ret +SYM_FUNC_END(__arch_copy_to_user_fpsimd) +EXPORT_SYMBOL(__arch_copy_to_user_fpsimd) +#endif + .section .fixup,"ax" .align 2 9997: cmp dst, dstin diff --git a/kernel/softirq.c b/kernel/softirq.c index 9fc69e6e2c11..e3f73422829d 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -26,6 +26,10 @@ #include <linux/tick.h> #include <linux/irq.h> +#ifdef CONFIG_USE_VECTORIZED_COPY +#include <asm/fpsimd.h> +#endif + #define CREATE_TRACE_POINTS #include <trace/events/irq.h> @@ -262,6 +266,9 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) __u32 pending; int softirq_bit; +#ifdef CONFIG_USE_VECTORIZED_COPY + struct fpsimd_state state; +#endif /* * Mask out PF_MEMALLOC as the current task context is borrowed for the * softirq. A softirq handled, such as network RX, might set PF_MEMALLOC @@ -273,8 +280,11 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) account_irq_enter_time(current); __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET); +#ifdef CONFIG_USE_VECTORIZED_COPY + _kernel_fpsimd_save(&state); + uaccess_priviliged_state_save(); +#endif in_hardirq = lockdep_softirq_start(); - restart: /* Reset the pending bitmask before enabling irqs */ set_softirq_pending(0); @@ -322,6 +332,11 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) lockdep_softirq_end(in_hardirq); account_irq_exit_time(current); + +#ifdef CONFIG_USE_VECTORIZED_COPY + uaccess_priviliged_state_restore(); + _kernel_fpsimd_load(&state); +#endif __local_bh_enable(SOFTIRQ_OFFSET); WARN_ON_ONCE(in_interrupt()); current_restore_flags(old_flags, PF_MEMALLOC); @@ -612,12 +627,21 @@ static void tasklet_action_common(struct softirq_action *a, { struct tasklet_struct *list; +#ifdef CONFIG_USE_VECTORIZED_COPY + struct fpsimd_state state; +#endif + local_irq_disable(); list = tl_head->head; tl_head->head = NULL; tl_head->tail = &tl_head->head; local_irq_enable(); +#ifdef CONFIG_USE_VECTORIZED_COPY + _kernel_fpsimd_save(&state); + uaccess_priviliged_state_save(); +#endif + while (list) { struct tasklet_struct *t = list; @@ -645,6 +669,11 @@ static void tasklet_action_common(struct softirq_action *a, __raise_softirq_irqoff(softirq_nr); local_irq_enable(); } + +#ifdef CONFIG_USE_VECTORIZED_COPY + uaccess_priviliged_state_restore(); + _kernel_fpsimd_load(&state); +#endif } static __latent_entropy void tasklet_action(struct softirq_action *a) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 0b1c13a05332..9ec07294429b 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -210,6 +210,17 @@ static int max_extfrag_threshold = 1000; #endif /* CONFIG_SYSCTL */ +#ifdef CONFIG_USE_VECTORIZED_COPY +int sysctl_copy_to_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_to_user_threshold); + +int sysctl_copy_from_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_from_user_threshold); + +int sysctl_copy_in_user_threshold = -1; +EXPORT_SYMBOL(sysctl_copy_in_user_threshold); +#endif + #if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_SYSCTL) static int bpf_stats_handler(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) @@ -3385,6 +3396,30 @@ static struct ctl_table vm_table[] = { .extra2 = SYSCTL_ONE, }, #endif + +#ifdef CONFIG_USE_VECTORIZED_COPY + { + .procname = "copy_to_user_threshold", + .data = &sysctl_copy_to_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "copy_from_user_threshold", + .data = &sysctl_copy_from_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "copy_in_user_threshold", + .data = &sysctl_copy_in_user_threshold, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, +#endif { } }; -- 2.34.1
2 1
0 0
[PATCH OLK-5.10] ext4: fix e4b bitmap inconsistency reports
by Yongjian Sun 28 Jan '26

28 Jan '26
hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8330 ---------------------------------------------------------------------- A bitmap inconsistency issue was observed during stress tests under mixed huge-page workloads. Ext4 reported multiple e4b bitmap check failures like: ext4_mb_complex_scan_group:2508: group 350, 8179 free clusters as per group info. But got 8192 blocks Analysis and experimentation confirmed that the issue is caused by a race condition between page migration and bitmap modification. Although this timing window is extremely narrow, it is still hit in practice: folio_lock ext4_mb_load_buddy __migrate_folio check ref count folio_mc_copy __filemap_get_folio folio_try_get(folio) ...... mb_mark_used ext4_mb_unload_buddy __folio_migrate_mapping folio_ref_freeze folio_unlock The root cause of this issue is that the fast path of load_buddy only increments the folio's reference count, which is insufficient to prevent concurrent folio migration. We observed that the folio migration process acquires the folio lock. Therefore, we can determine whether to take the fast path in load_buddy by checking the lock status. If the folio is locked, we opt for the slow path (which acquires the lock) to close this concurrency window. Additionally, this change addresses the following issues: When the DOUBLE_CHECK macro is enabled to inspect bitmap-related issues, the following error may be triggered: corruption in group 324 at byte 784(6272): f in copy != ff on disk/prealloc Analysis reveals that this is a false positive. There is a specific race window where the bitmap and the group descriptor become momentarily inconsistent, leading to this error report: ext4_mb_load_buddy ext4_mb_load_buddy __filemap_get_folio(create|lock) folio_lock ext4_mb_init_cache folio_mark_uptodate __filemap_get_folio(no lock) ...... mb_mark_used mb_mark_used_double mb_cmp_bitmaps mb_set_bits(e4b->bd_bitmap) folio_unlock The original logic assumed that since mb_cmp_bitmaps is called when the bitmap is newly loaded from disk, the folio lock would be sufficient to prevent concurrent access. However, this overlooks a specific race condition: if another process attempts to load buddy and finds the folio is already in an uptodate state, it will immediately begin using it without holding folio lock. Fixes: 9bb3fd60f91e ("arm64: mm: Add copy mc support for all migrate_page") Signed-off-by: Yongjian Sun <sunyongjian1(a)huawei.com> --- fs/ext4/mballoc.c | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 522d2ec128ef..9d4e8e3c74e2 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1217,16 +1217,17 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group, /* we could use find_or_create_page(), but it locks page * what we'd like to avoid in fast path ... */ page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED); - if (page == NULL || !PageUptodate(page)) { + if (page == NULL || !PageUptodate(page) || PageLocked(page)) { + /* + * PageLocked is employed to detect ongoing page + * migrations, since concurrent migrations can lead to + * bitmap inconsistency. And if we are not uptodate that + * implies somebody just created the page but is yet to + * initialize it. We can drop the page reference and + * try to get the page with lock in both cases to avoid + * concurrency. + */ if (page) - /* - * drop the page reference and try - * to get the page with lock. If we - * are not uptodate that implies - * somebody just created the page but - * is yet to initialize the same. So - * wait for it to initialize. - */ put_page(page); page = find_or_create_page(inode->i_mapping, pnum, gfp); if (page) { @@ -1261,7 +1262,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group, poff = block % blocks_per_page; page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED); - if (page == NULL || !PageUptodate(page)) { + if (page == NULL || !PageUptodate(page) || PageLocked(page)) { if (page) put_page(page); page = find_or_create_page(inode->i_mapping, pnum, gfp); -- 2.39.2
2 1
0 0
[PATCH OLK-6.6] udp: Deal with race between UDP socket address change and rehash
by Zhang Changzhong 28 Jan '26

28 Jan '26
From: Stefano Brivio <sbrivio(a)redhat.com> mainline inclusion from mainline-v6.14-rc1 commit a502ea6fa94b1f7be72a24bcf9e3f5f6b7e6e90c category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/10960 CVE: CVE-2024-57974 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… ------------------------------------------------- If a UDP socket changes its local address while it's receiving datagrams, as a result of connect(), there is a period during which a lookup operation might fail to find it, after the address is changed but before the secondary hash (port and address) and the four-tuple hash (local and remote ports and addresses) are updated. Secondary hash chains were introduced by commit 30fff9231fad ("udp: bind() optimisation") and, as a result, a rehash operation became needed to make a bound socket reachable again after a connect(). This operation was introduced by commit 719f835853a9 ("udp: add rehash on connect()") which isn't however a complete fix: the socket will be found once the rehashing completes, but not while it's pending. This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a client sending datagrams to it. After the server receives the first datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to the address of the sender, in order to set up a directed flow. Now, if the client, running on a different CPU thread, happens to send a (subsequent) datagram while the server's socket changes its address, but is not rehashed yet, this will result in a failed lookup and a port unreachable error delivered to the client, as apparent from the following reproducer: LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4)) dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in while :; do taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.1 || sleep 1 taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null wait done where the client will eventually get ECONNREFUSED on a write() (typically the second or third one of a given iteration): 2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused This issue was first observed as a seldom failure in Podman's tests checking UDP functionality while using pasta(1) to connect the container's network namespace, which leads us to a reproducer with the lookup error resulting in an ICMP packet on a tap device: LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')" while :; do ./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.2 || sleep 1 socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null wait cmp tmp.in tmp.out done Once this fails: tmp.in tmp.out differ: char 8193, line 29 we can finally have a look at what's going on: $ tshark -r pasta.pcap 1 0.000000 :: ? ff02::16 ICMPv6 110 Multicast Listener Report Message v2 2 0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 3 0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 4 0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 5 0.168827 c6:47:05:8d:dc:04 ? Broadcast ARP 42 Who has 88.198.0.161? Tell 88.198.0.164 6 0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55 7 0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 8 0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable) 9 0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 10 0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 11 0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096 12 0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0 On the third datagram received, the network namespace of the container initiates an ARP lookup to deliver the ICMP message. In another variant of this reproducer, starting the client with: strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log & and connecting to the socat server using a loopback address: socat OPEN:tmp.in UDP4:localhost:1337,shut-null we can more clearly observe a sendmmsg() call failing after the first datagram is delivered: [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0 [...] [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1 [...] [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused) and, somewhat confusingly, after a connect() on the same socket succeeded. Until commit 4cdeeee9252a ("net: udp: prefer listeners bound to an address"), the race between receive address change and lookup didn't actually cause visible issues, because, once the lookup based on the secondary hash chain failed, we would still attempt a lookup based on the primary hash (destination port only), and find the socket with the outdated secondary hash. That change, however, dropped port-only lookups altogether, as side effect, making the race visible. To fix this, while avoiding the need to make address changes and rehash atomic against lookups, reintroduce primary hash lookups as fallback, if lookups based on four-tuple and secondary hashes fail. To this end, introduce a simplified lookup implementation, which doesn't take care of SO_REUSEPORT groups: if we have one, there are multiple sockets that would match the four-tuple or secondary hash, meaning that we can't run into this race at all. v2: - instead of synchronising lookup operations against address change plus rehash, reintroduce a simplified version of the original primary hash lookup as fallback v1: - fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr usage (Kuniyuki Iwashima) - directly use sk_rcv_saddr for IPv4 receive addresses instead of fetching inet_rcv_saddr (Kuniyuki Iwashima) - move inet_update_saddr() to inet_hashtables.h and use that to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima) - rebase onto net-next, update commit message accordingly Reported-by: Ed Santiago <santiago(a)redhat.com> Link: https://github.com/containers/podman/issues/24147 Analysed-by: David Gibson <david(a)gibson.dropbear.id.au> Fixes: 30fff9231fad ("udp: bind() optimisation") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> Reviewed-by: Eric Dumazet <edumazet(a)google.com> Reviewed-by: Willem de Bruijn <willemb(a)google.com> Signed-off-by: David S. Miller <davem(a)davemloft.net> Conflicts: net/ipv4/udp.c net/ipv6/udp.c [context conflicts, and fix some discards qualifiers error.] Signed-off-by: Liu Jian <liujian56(a)huawei.com> Signed-off-by: Zhang Changzhong <zhangchangzhong(a)huawei.com> --- net/ipv4/udp.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ net/ipv6/udp.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 108 insertions(+) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 8a34e22..4c1f82f 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -421,6 +421,49 @@ u32 udp_ehashfn(const struct net *net, const __be32 laddr, const __u16 lport, udp_ehash_secret + net_hash_mix(net)); } +/** + * udp4_lib_lookup1() - Simplified lookup using primary hash (destination port) + * @net: Network namespace + * @saddr: Source address, network order + * @sport: Source port, network order + * @daddr: Destination address, network order + * @hnum: Destination port, host order + * @dif: Destination interface index + * @sdif: Destination bridge port index, if relevant + * @udptable: Set of UDP hash tables + * + * Simplified lookup to be used as fallback if no sockets are found due to a + * potential race between (receive) address change, and lookup happening before + * the rehash operation. This function ignores SO_REUSEPORT groups while scoring + * result sockets, because if we have one, we don't need the fallback at all. + * + * Called under rcu_read_lock(). + * + * Return: socket with highest matching score if any, NULL if none + */ +static struct sock *udp4_lib_lookup1(const struct net *net, + __be32 saddr, __be16 sport, + __be32 daddr, unsigned int hnum, + int dif, int sdif, + const struct udp_table *udptable) +{ + unsigned int slot = udp_hashfn(net, hnum, udptable->mask); + struct udp_hslot *hslot = &udptable->hash[slot]; + struct sock *sk, *result = NULL; + int score, badness = 0; + + sk_for_each_rcu(sk, &hslot->head) { + score = compute_score(sk, (struct net *)net, + saddr, sport, daddr, hnum, dif, sdif); + if (score > badness) { + result = sk; + badness = score; + } + } + + return result; +} + /* called with rcu_read_lock() */ static struct sock *udp4_lib_lookup2(struct net *net, __be32 saddr, __be16 sport, @@ -526,6 +569,19 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, result = udp4_lib_lookup2(net, saddr, sport, htonl(INADDR_ANY), hnum, dif, sdif, hslot2, skb); + if (!IS_ERR_OR_NULL(result)) + goto done; + + /* Primary hash (destination port) lookup as fallback for this race: + * 1. __ip4_datagram_connect() sets sk_rcv_saddr + * 2. lookup (this function): new sk_rcv_saddr, hashes not updated yet + * 3. rehash operation updating _secondary and four-tuple_ hashes + * The primary hash doesn't need an update after 1., so, thanks to this + * further step, 1. and 3. don't need to be atomic against the lookup. + */ + result = udp4_lib_lookup1(net, saddr, sport, daddr, hnum, dif, sdif, + udptable); + done: if (IS_ERR(result)) return NULL; diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 9ff8e72..7c9f77f 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -162,6 +162,51 @@ static int compute_score(struct sock *sk, struct net *net, return score; } +/** + * udp6_lib_lookup1() - Simplified lookup using primary hash (destination port) + * @net: Network namespace + * @saddr: Source address, network order + * @sport: Source port, network order + * @daddr: Destination address, network order + * @hnum: Destination port, host order + * @dif: Destination interface index + * @sdif: Destination bridge port index, if relevant + * @udptable: Set of UDP hash tables + * + * Simplified lookup to be used as fallback if no sockets are found due to a + * potential race between (receive) address change, and lookup happening before + * the rehash operation. This function ignores SO_REUSEPORT groups while scoring + * result sockets, because if we have one, we don't need the fallback at all. + * + * Called under rcu_read_lock(). + * + * Return: socket with highest matching score if any, NULL if none + */ +static struct sock *udp6_lib_lookup1(const struct net *net, + const struct in6_addr *saddr, __be16 sport, + const struct in6_addr *daddr, + unsigned int hnum, int dif, int sdif, + const struct udp_table *udptable) +{ + unsigned int slot = udp_hashfn(net, hnum, udptable->mask); + struct udp_hslot *hslot = &udptable->hash[slot]; + struct sock *sk, *result = NULL; + int score, badness = 0; + + sk_for_each_rcu(sk, &hslot->head) { + score = compute_score(sk, (struct net *)net, + (struct in6_addr *)saddr, sport, + (struct in6_addr *)daddr, hnum, + dif, sdif); + if (score > badness) { + result = sk; + badness = score; + } + } + + return result; +} + /* called with rcu_read_lock() */ static struct sock *udp6_lib_lookup2(struct net *net, const struct in6_addr *saddr, __be16 sport, @@ -266,6 +311,13 @@ struct sock *__udp6_lib_lookup(struct net *net, result = udp6_lib_lookup2(net, saddr, sport, &in6addr_any, hnum, dif, sdif, hslot2, skb); + if (!IS_ERR_OR_NULL(result)) + goto done; + + /* Cover address change/lookup/rehash race: see __udp4_lib_lookup() */ + result = udp6_lib_lookup1(net, saddr, sport, daddr, hnum, dif, sdif, + udptable); + done: if (IS_ERR(result)) return NULL; -- 2.9.5
2 1
0 0
[PATCH OLK-6.6] x86/fpu: Clear XSTATE_BV[i] in guest XSAVE state whenever XFD[i]=1
by Zhang Kunbo 28 Jan '26

28 Jan '26
From: Sean Christopherson <seanjc(a)google.com> stable inclusion from stable-v6.12.67 commit f577508cc8a0adb8b4ebe9480bba7683b6149930 category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/13516 CVE: CVE-2026-23005 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- commit b45f721775947a84996deb5c661602254ce25ce6 upstream. When loading guest XSAVE state via KVM_SET_XSAVE, and when updating XFD in response to a guest WRMSR, clear XFD-disabled features in the saved (or to be restored) XSTATE_BV to ensure KVM doesn't attempt to load state for features that are disabled via the guest's XFD. Because the kernel executes XRSTOR with the guest's XFD, saving XSTATE_BV[i]=1 with XFD[i]=1 will cause XRSTOR to #NM and panic the kernel. E.g. if fpu_update_guest_xfd() sets XFD without clearing XSTATE_BV: ------------[ cut here ]------------ WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#29: amx_test/848 Modules linked in: kvm_intel kvm irqbypass CPU: 29 UID: 1000 PID: 848 Comm: amx_test Not tainted 6.19.0-rc2-ffa07f7fd437-x86_amx_nm_xfd_non_init-vm #171 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:exc_device_not_available+0x101/0x110 Call Trace: <TASK> asm_exc_device_not_available+0x1a/0x20 RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90 switch_fpu_return+0x4a/0xb0 kvm_arch_vcpu_ioctl_run+0x1245/0x1e40 [kvm] kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm] __x64_sys_ioctl+0x8f/0xd0 do_syscall_64+0x62/0x940 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> ---[ end trace 0000000000000000 ]--- This can happen if the guest executes WRMSR(MSR_IA32_XFD) to set XFD[18] = 1, and a host IRQ triggers kernel_fpu_begin() prior to the vmexit handler's call to fpu_update_guest_xfd(). and if userspace stuffs XSTATE_BV[i]=1 via KVM_SET_XSAVE: ------------[ cut here ]------------ WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#14: amx_test/867 Modules linked in: kvm_intel kvm irqbypass CPU: 14 UID: 1000 PID: 867 Comm: amx_test Not tainted 6.19.0-rc2-2dace9faccd6-x86_amx_nm_xfd_non_init-vm #168 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:exc_device_not_available+0x101/0x110 Call Trace: <TASK> asm_exc_device_not_available+0x1a/0x20 RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90 fpu_swap_kvm_fpstate+0x6b/0x120 kvm_load_guest_fpu+0x30/0x80 [kvm] kvm_arch_vcpu_ioctl_run+0x85/0x1e40 [kvm] kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm] __x64_sys_ioctl+0x8f/0xd0 do_syscall_64+0x62/0x940 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> ---[ end trace 0000000000000000 ]--- The new behavior is consistent with the AMX architecture. Per Intel's SDM, XSAVE saves XSTATE_BV as '0' for components that are disabled via XFD (and non-compacted XSAVE saves the initial configuration of the state component): If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i, the instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1; instead, it operates as if XINUSE[i] = 0 (and the state component was in its initial state): it saves bit i of XSTATE_BV field of the XSAVE header as 0; in addition, XSAVE saves the initial configuration of the state component (the other instructions do not save state component i). Alternatively, KVM could always do XRSTOR with XFD=0, e.g. by using a constant XFD based on the set of enabled features when XSAVEing for a struct fpu_guest. However, having XSTATE_BV[i]=1 for XFD-disabled features can only happen in the above interrupt case, or in similar scenarios involving preemption on preemptible kernels, because fpu_swap_kvm_fpstate()'s call to save_fpregs_to_fpstate() saves the outgoing FPU state with the current XFD; and that is (on all but the first WRMSR to XFD) the guest XFD. Therefore, XFD can only go out of sync with XSTATE_BV in the above interrupt case, or in similar scenarios involving preemption on preemptible kernels, and it we can consider it (de facto) part of KVM ABI that KVM_GET_XSAVE returns XSTATE_BV[i]=0 for XFD-disabled features. Reported-by: Paolo Bonzini <pbonzini(a)redhat.com> Cc: stable(a)vger.kernel.org Fixes: 820a6ee944e7 ("kvm: x86: Add emulation for IA32_XFD", 2022-01-14) Signed-off-by: Sean Christopherson <seanjc(a)google.com> [Move clearing of XSTATE_BV from fpu_copy_uabi_to_guest_fpstate to kvm_vcpu_ioctl_x86_set_xsave. - Paolo] Reviewed-by: Binbin Wu <binbin.wu(a)linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Conflicts: arch/x86/kvm/x86.c [Context difference. ] Signed-off-by: Zhang Kunbo <zhangkunbo(a)huawei.com> --- arch/x86/kernel/fpu/core.c | 32 +++++++++++++++++++++++++++++--- arch/x86/kvm/x86.c | 9 +++++++++ 2 files changed, 38 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 1f0871be9d53..fd53ff26cd42 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -296,10 +296,29 @@ EXPORT_SYMBOL_GPL(fpu_enable_guest_xfd_features); #ifdef CONFIG_X86_64 void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd) { + struct fpstate *fpstate = guest_fpu->fpstate; + fpregs_lock(); - guest_fpu->fpstate->xfd = xfd; - if (guest_fpu->fpstate->in_use) - xfd_update_state(guest_fpu->fpstate); + + /* + * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert the + * save state to its initial configuration. Likewise, KVM_GET_XSAVE does + * the same as XSAVE and returns XSTATE_BV[i]=0 whenever XFD[i]=1. + * + * If the guest's FPU state is in hardware, just update XFD: the XSAVE + * in fpu_swap_kvm_fpstate will clear XSTATE_BV[i] whenever XFD[i]=1. + * + * If however the guest's FPU state is NOT resident in hardware, clear + * disabled components in XSTATE_BV now, or a subsequent XRSTOR will + * attempt to load disabled components and generate #NM _in the host_. + */ + if (xfd && test_thread_flag(TIF_NEED_FPU_LOAD)) + fpstate->regs.xsave.header.xfeatures &= ~xfd; + + fpstate->xfd = xfd; + if (fpstate->in_use) + xfd_update_state(fpstate); + fpregs_unlock(); } EXPORT_SYMBOL_GPL(fpu_update_guest_xfd); @@ -407,6 +426,13 @@ int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, if (ustate->xsave.header.xfeatures & ~xcr0) return -EINVAL; + /* + * Disabled features must be in their initial state, otherwise XRSTOR + * causes an exception. + */ + if (WARN_ON_ONCE(ustate->xsave.header.xfeatures & kstate->xfd)) + return -EINVAL; + /* * Nullify @vpkru to preserve its current value if PKRU's bit isn't set * in the header. KVM's odd ABI is to leave PKRU untouched in this diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2139f728aecc..57fc6916906b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5541,9 +5541,18 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu, static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu, struct kvm_xsave *guest_xsave) { + union fpregs_state *xstate = (union fpregs_state *)guest_xsave->region; + if (fpstate_is_confidential(&vcpu->arch.guest_fpu)) return 0; + /* + * For backwards compatibility, do not expect disabled features to be in + * their initial state. XSTATE_BV[i] must still be cleared whenever + * XFD[i]=1, or XRSTOR would cause a #NM. + */ + xstate->xsave.header.xfeatures &= ~vcpu->arch.guest_fpu.fpstate->xfd; + return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, guest_xsave->region, kvm_caps.supported_xcr0, -- 2.34.1
2 1
0 0
[PATCH OLK-6.6] x86/fpu: Clear XSTATE_BV[i] in guest XSAVE state whenever XFD[i]=1
by Zhang Kunbo 28 Jan '26

28 Jan '26
From: Sean Christopherson <seanjc(a)google.com> stable inclusion from stable-v6.12.67 commit f577508cc8a0adb8b4ebe9480bba7683b6149930 category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/13516 CVE: CVE-2026-23005 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- commit b45f721775947a84996deb5c661602254ce25ce6 upstream. When loading guest XSAVE state via KVM_SET_XSAVE, and when updating XFD in response to a guest WRMSR, clear XFD-disabled features in the saved (or to be restored) XSTATE_BV to ensure KVM doesn't attempt to load state for features that are disabled via the guest's XFD. Because the kernel executes XRSTOR with the guest's XFD, saving XSTATE_BV[i]=1 with XFD[i]=1 will cause XRSTOR to #NM and panic the kernel. E.g. if fpu_update_guest_xfd() sets XFD without clearing XSTATE_BV: ------------[ cut here ]------------ WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#29: amx_test/848 Modules linked in: kvm_intel kvm irqbypass CPU: 29 UID: 1000 PID: 848 Comm: amx_test Not tainted 6.19.0-rc2-ffa07f7fd437-x86_amx_nm_xfd_non_init-vm #171 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:exc_device_not_available+0x101/0x110 Call Trace: <TASK> asm_exc_device_not_available+0x1a/0x20 RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90 switch_fpu_return+0x4a/0xb0 kvm_arch_vcpu_ioctl_run+0x1245/0x1e40 [kvm] kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm] __x64_sys_ioctl+0x8f/0xd0 do_syscall_64+0x62/0x940 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> ---[ end trace 0000000000000000 ]--- This can happen if the guest executes WRMSR(MSR_IA32_XFD) to set XFD[18] = 1, and a host IRQ triggers kernel_fpu_begin() prior to the vmexit handler's call to fpu_update_guest_xfd(). and if userspace stuffs XSTATE_BV[i]=1 via KVM_SET_XSAVE: ------------[ cut here ]------------ WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#14: amx_test/867 Modules linked in: kvm_intel kvm irqbypass CPU: 14 UID: 1000 PID: 867 Comm: amx_test Not tainted 6.19.0-rc2-2dace9faccd6-x86_amx_nm_xfd_non_init-vm #168 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:exc_device_not_available+0x101/0x110 Call Trace: <TASK> asm_exc_device_not_available+0x1a/0x20 RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90 fpu_swap_kvm_fpstate+0x6b/0x120 kvm_load_guest_fpu+0x30/0x80 [kvm] kvm_arch_vcpu_ioctl_run+0x85/0x1e40 [kvm] kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm] __x64_sys_ioctl+0x8f/0xd0 do_syscall_64+0x62/0x940 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> ---[ end trace 0000000000000000 ]--- The new behavior is consistent with the AMX architecture. Per Intel's SDM, XSAVE saves XSTATE_BV as '0' for components that are disabled via XFD (and non-compacted XSAVE saves the initial configuration of the state component): If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i, the instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1; instead, it operates as if XINUSE[i] = 0 (and the state component was in its initial state): it saves bit i of XSTATE_BV field of the XSAVE header as 0; in addition, XSAVE saves the initial configuration of the state component (the other instructions do not save state component i). Alternatively, KVM could always do XRSTOR with XFD=0, e.g. by using a constant XFD based on the set of enabled features when XSAVEing for a struct fpu_guest. However, having XSTATE_BV[i]=1 for XFD-disabled features can only happen in the above interrupt case, or in similar scenarios involving preemption on preemptible kernels, because fpu_swap_kvm_fpstate()'s call to save_fpregs_to_fpstate() saves the outgoing FPU state with the current XFD; and that is (on all but the first WRMSR to XFD) the guest XFD. Therefore, XFD can only go out of sync with XSTATE_BV in the above interrupt case, or in similar scenarios involving preemption on preemptible kernels, and it we can consider it (de facto) part of KVM ABI that KVM_GET_XSAVE returns XSTATE_BV[i]=0 for XFD-disabled features. Reported-by: Paolo Bonzini <pbonzini(a)redhat.com> Cc: stable(a)vger.kernel.org Fixes: 820a6ee944e7 ("kvm: x86: Add emulation for IA32_XFD", 2022-01-14) Signed-off-by: Sean Christopherson <seanjc(a)google.com> [Move clearing of XSTATE_BV from fpu_copy_uabi_to_guest_fpstate to kvm_vcpu_ioctl_x86_set_xsave. - Paolo] Reviewed-by: Binbin Wu <binbin.wu(a)linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Conflicts: arch/x86/kvm/x86.c [Context difference. ] Signed-off-by: Zhang Kunbo <zhangkunbo(a)huawei.com> --- arch/x86/kernel/fpu/core.c | 32 +++++++++++++++++++++++++++++--- arch/x86/kvm/x86.c | 9 +++++++++ 2 files changed, 38 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 1f0871be9d53..fd53ff26cd42 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -296,10 +296,29 @@ EXPORT_SYMBOL_GPL(fpu_enable_guest_xfd_features); #ifdef CONFIG_X86_64 void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd) { + struct fpstate *fpstate = guest_fpu->fpstate; + fpregs_lock(); - guest_fpu->fpstate->xfd = xfd; - if (guest_fpu->fpstate->in_use) - xfd_update_state(guest_fpu->fpstate); + + /* + * KVM's guest ABI is that setting XFD[i]=1 *can* immediately revert the + * save state to its initial configuration. Likewise, KVM_GET_XSAVE does + * the same as XSAVE and returns XSTATE_BV[i]=0 whenever XFD[i]=1. + * + * If the guest's FPU state is in hardware, just update XFD: the XSAVE + * in fpu_swap_kvm_fpstate will clear XSTATE_BV[i] whenever XFD[i]=1. + * + * If however the guest's FPU state is NOT resident in hardware, clear + * disabled components in XSTATE_BV now, or a subsequent XRSTOR will + * attempt to load disabled components and generate #NM _in the host_. + */ + if (xfd && test_thread_flag(TIF_NEED_FPU_LOAD)) + fpstate->regs.xsave.header.xfeatures &= ~xfd; + + fpstate->xfd = xfd; + if (fpstate->in_use) + xfd_update_state(fpstate); + fpregs_unlock(); } EXPORT_SYMBOL_GPL(fpu_update_guest_xfd); @@ -407,6 +426,13 @@ int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, if (ustate->xsave.header.xfeatures & ~xcr0) return -EINVAL; + /* + * Disabled features must be in their initial state, otherwise XRSTOR + * causes an exception. + */ + if (WARN_ON_ONCE(ustate->xsave.header.xfeatures & kstate->xfd)) + return -EINVAL; + /* * Nullify @vpkru to preserve its current value if PKRU's bit isn't set * in the header. KVM's odd ABI is to leave PKRU untouched in this diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2139f728aecc..57fc6916906b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5541,9 +5541,18 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu, static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu, struct kvm_xsave *guest_xsave) { + union fpregs_state *xstate = (union fpregs_state *)guest_xsave->region; + if (fpstate_is_confidential(&vcpu->arch.guest_fpu)) return 0; + /* + * For backwards compatibility, do not expect disabled features to be in + * their initial state. XSTATE_BV[i] must still be cleared whenever + * XFD[i]=1, or XRSTOR would cause a #NM. + */ + xstate->xsave.header.xfeatures &= ~vcpu->arch.guest_fpu.fpstate->xfd; + return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, guest_xsave->region, kvm_caps.supported_xcr0, -- 2.34.1
2 1
0 0
  • ← Newer
  • 1
  • 2
  • 3
  • 4
  • ...
  • 2260
  • Older →

HyperKitty Powered by HyperKitty