From: Li Nan linan122@huawei.com
hulk inclusion category: bugfix bugzilla: 187584, https://gitee.com/openeuler/kernel/issues/I5QW2R CVE: NA
--------------------------------
This reverts commit 36f5d7662495aa5ad4ec197443e69e01384eda3c.
There are two wbt_enable_default() in bfq_exit_queue(). Although it will not lead to no fault, revert one.
Signed-off-by: Li Nan linan122@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/bfq-iosched.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 4bfea5e5354e..1aec01c0a707 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -6418,8 +6418,6 @@ static void bfq_exit_queue(struct elevator_queue *e) spin_unlock_irq(&bfqd->lock); #endif
- wbt_enable_default(bfqd->queue); - kfree(bfqd);
/* Re-enable throttling in case elevator disabled it */
From: Ye Bin yebin10@huawei.com
mainline inclusion from mainline-v5.19-rc3 commit 9b6641dd95a0c441b277dd72ba22fed8d61f76ad category: bugfix bugzilla: 186927, https://gitee.com/src-openeuler/kernel/issues/I5YIY6 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We got issue as follows: [home]# mount /dev/sda test EXT4-fs (sda): warning: mounting fs with errors, running e2fsck is recommended [home]# dmesg EXT4-fs (sda): warning: mounting fs with errors, running e2fsck is recommended EXT4-fs (sda): Errors on filesystem, clearing orphan list. EXT4-fs (sda): recovery complete EXT4-fs (sda): mounted filesystem with ordered data mode. Quota mode: none. [home]# debugfs /dev/sda debugfs 1.46.5 (30-Dec-2021) Checksum errors in superblock! Retrying...
Reason is ext4_orphan_cleanup will reset ‘s_last_orphan’ but not update super block checksum.
To solve above issue, defer update super block checksum after ext4_orphan_cleanup.
Signed-off-by: Ye Bin yebin10@huawei.com Cc: stable@kernel.org Reviewed-by: Jan Kara jack@suse.cz Reviewed-by: Ritesh Harjani ritesh.list@gmail.com Link: https://lore.kernel.org/r/20220525012904.1604737-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ext4/super.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 602db1346670..5b98734f9f3d 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -5078,14 +5078,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) err = percpu_counter_init(&sbi->s_freeinodes_counter, freei, GFP_KERNEL); } - /* - * Update the checksum after updating free space/inode - * counters. Otherwise the superblock can have an incorrect - * checksum in the buffer cache until it is written out and - * e2fsprogs programs trying to open a file system immediately - * after it is mounted can fail. - */ - ext4_superblock_csum_set(sb); if (!err) err = percpu_counter_init(&sbi->s_dirs_counter, ext4_count_dirs(sb), GFP_KERNEL); @@ -5140,6 +5132,14 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS; ext4_orphan_cleanup(sb, es); EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS; + /* + * Update the checksum after updating free space/inode counters and + * ext4_orphan_cleanup. Otherwise the superblock can have an incorrect + * checksum in the buffer cache until it is written out and + * e2fsprogs programs trying to open a file system immediately + * after it is mounted can fail. + */ + ext4_superblock_csum_set(sb); if (needs_recovery) { ext4_msg(sb, KERN_INFO, "recovery complete"); err = ext4_mark_recovery_complete(sb, es);
From: Kees Cook keescook@chromium.org
mainline inclusion from mainline-v5.13-rc1 commit 0d66ccc1627013c95f1e7ef10b95b8451cd7834e category: featrue bugzilla: https://gitee.com/openeuler/kernel/issues/I5YQ6Z CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
As shown in the comment in jump_label.h, choosing the initial state of static branches changes the assembly layout. If the condition is expected to be likely it's inline, and if unlikely it is out of line via a jump.
A few places in the kernel use (or could be using) a CONFIG to choose the default state, which would give a small performance benefit to their compile-time declared default. Provide the infrastructure to do this.
Signed-off-by: Kees Cook keescook@chromium.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lore.kernel.org/r/20210401232347.2791257-2-keescook@chromium.org Signed-off-by: Yi Yang yiyang13@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/jump_label.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index 470a7e5a7756..6dec6543393b 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -388,6 +388,21 @@ struct static_key_false { [0 ... (count) - 1] = STATIC_KEY_FALSE_INIT, \ }
+#define _DEFINE_STATIC_KEY_1(name) DEFINE_STATIC_KEY_TRUE(name) +#define _DEFINE_STATIC_KEY_0(name) DEFINE_STATIC_KEY_FALSE(name) +#define DEFINE_STATIC_KEY_MAYBE(cfg, name) \ + __PASTE(_DEFINE_STATIC_KEY_, IS_ENABLED(cfg))(name) + +#define _DEFINE_STATIC_KEY_RO_1(name) DEFINE_STATIC_KEY_TRUE_RO(name) +#define _DEFINE_STATIC_KEY_RO_0(name) DEFINE_STATIC_KEY_FALSE_RO(name) +#define DEFINE_STATIC_KEY_MAYBE_RO(cfg, name) \ + __PASTE(_DEFINE_STATIC_KEY_RO_, IS_ENABLED(cfg))(name) + +#define _DECLARE_STATIC_KEY_1(name) DECLARE_STATIC_KEY_TRUE(name) +#define _DECLARE_STATIC_KEY_0(name) DECLARE_STATIC_KEY_FALSE(name) +#define DECLARE_STATIC_KEY_MAYBE(cfg, name) \ + __PASTE(_DECLARE_STATIC_KEY_, IS_ENABLED(cfg))(name) + extern bool ____wrong_branch_error(void);
#define static_key_enabled(x) \ @@ -488,6 +503,10 @@ extern bool ____wrong_branch_error(void);
#endif /* CONFIG_JUMP_LABEL */
+#define static_branch_maybe(config, x) \ + (IS_ENABLED(config) ? static_branch_likely(x) \ + : static_branch_unlikely(x)) + /* * Advanced usage; refcount, branch is enabled when: count != 0 */
From: Kees Cook keescook@chromium.org
mainline inclusion from mainline-v5.13-rc1 commit 39218ff4c625dbf2e68224024fe0acaa60bcd51a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5YQ6Z CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
This provides the ability for architectures to enable kernel stack base address offset randomization. This feature is controlled by the boot param "randomize_kstack_offset=on/off", with its default value set by CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.
This feature is based on the original idea from the last public release of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt All the credit for the original idea goes to the PaX team. Note that the design and implementation of this upstream randomize_kstack_offset feature differs greatly from the RANDKSTACK feature (see below).
Reasoning for the feature:
This feature aims to make harder the various stack-based attacks that rely on deterministic stack structure. We have had many such attacks in past (just to name few):
https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux...
As Linux kernel stack protections have been constantly improving (vmap-based stack allocation with guard pages, removal of thread_info, STACKLEAK), attackers have had to find new ways for their exploits to work. They have done so, continuing to rely on the kernel's stack determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT were not relevant. For example, the following recent attacks would have been hampered if the stack offset was non-deterministic between syscalls:
https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf (page 70: targeting the pt_regs copy with linear stack overflow)
https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html (leaked stack address from one syscall as a target during next syscall)
The main idea is that since the stack offset is randomized on each system call, it is harder for an attack to reliably land in any particular place on the thread stack, even with address exposures, as the stack base will change on the next syscall. Also, since randomization is performed after placing pt_regs, the ptrace-based approach[1] to discover the randomized offset during a long-running syscall should not be possible.
Design description:
During most of the kernel's execution, it runs on the "thread stack", which is pretty deterministic in its structure: it is fixed in size, and on every entry from userspace to kernel on a syscall the thread stack starts construction from an address fetched from the per-cpu cpu_current_top_of_stack variable. The first element to be pushed to the thread stack is the pt_regs struct that stores all required CPU registers and syscall parameters. Finally the specific syscall function is called, with the stack being used as the kernel executes the resulting request.
The goal of randomize_kstack_offset feature is to add a random offset after the pt_regs has been pushed to the stack and before the rest of the thread stack is used during the syscall processing, and to change it every time a process issues a syscall. The source of randomness is currently architecture-defined (but x86 is using the low byte of rdtsc()). Future improvements for different entropy sources is possible, but out of scope for this patch. Further more, to add more unpredictability, new offsets are chosen at the end of syscalls (the timing of which should be less easy to measure from userspace than at syscall entry time), and stored in a per-CPU variable, so that the life of the value does not stay explicitly tied to a single task.
As suggested by Andy Lutomirski, the offset is added using alloca() and an empty asm() statement with an output constraint, since it avoids changes to assembly syscall entry code, to the unwinder, and provides correct stack alignment as defined by the compiler.
In order to make this available by default with zero performance impact for those that don't want it, it is boot-time selectable with static branches. This way, if the overhead is not wanted, it can just be left turned off with no performance impact.
The generated assembly for x86_64 with GCC looks like this:
... ffffffff81003977: 65 8b 05 02 ea 00 7f mov %gs:0x7f00ea02(%rip),%eax # 12380 <kstack_offset> ffffffff8100397e: 25 ff 03 00 00 and $0x3ff,%eax ffffffff81003983: 48 83 c0 0f add $0xf,%rax ffffffff81003987: 25 f8 07 00 00 and $0x7f8,%eax ffffffff8100398c: 48 29 c4 sub %rax,%rsp ffffffff8100398f: 48 8d 44 24 0f lea 0xf(%rsp),%rax ffffffff81003994: 48 83 e0 f0 and $0xfffffffffffffff0,%rax ...
As a result of the above stack alignment, this patch introduces about 5 bits of randomness after pt_regs is spilled to the thread stack on x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for stack alignment). The amount of entropy could be adjusted based on how much of the stack space we wish to trade for security.
My measure of syscall performance overhead (on x86_64):
lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null randomize_kstack_offset=y Simple syscall: 0.7082 microseconds randomize_kstack_offset=n Simple syscall: 0.7016 microseconds
So, roughly 0.9% overhead growth for a no-op syscall, which is very manageable. And for people that don't want this, it's off by default.
There are two gotchas with using the alloca() trick. First, compilers that have Stack Clash protection (-fstack-clash-protection) enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to any dynamic stack allocations. While the randomization offset is always less than a page, the resulting assembly would still contain (unreachable!) probing routines, bloating the resulting assembly. To avoid this, -fno-stack-clash-protection is unconditionally added to the kernel Makefile since this is the only dynamic stack allocation in the kernel (now that VLAs have been removed) and it is provably safe from Stack Clash style attacks.
The second gotcha with alloca() is a negative interaction with -fstack-protector*, in that it sees the alloca() as an array allocation, which triggers the unconditional addition of the stack canary function pre/post-amble which slows down syscalls regardless of the static branch. In order to avoid adding this unneeded check and its associated performance impact, architectures need to carefully remove uses of -fstack-protector-strong (or -fstack-protector) in the compilation units that use the add_random_kstack() macro and to audit the resulting stack mitigation coverage (to make sure no desired coverage disappears). No change is visible for this on x86 because the stack protector is already unconditionally disabled for the compilation unit, but the change is required on arm64. There is, unfortunately, no attribute that can be used to disable stack protector for specific functions.
Comparison to PaX RANDKSTACK feature:
The RANDKSTACK feature randomizes the location of the stack start (cpu_current_top_of_stack), i.e. including the location of pt_regs structure itself on the stack. Initially this patch followed the same approach, but during the recent discussions[2], it has been determined to be of a little value since, if ptrace functionality is available for an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write different offsets in the pt_regs struct, observe the cache behavior of the pt_regs accesses, and figure out the random stack offset. Another difference is that the random offset is stored in a per-cpu variable, rather than having it be per-thread. As a result, these implementations differ a fair bit in their implementation details and results, though obviously the intent is similar.
[1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4B... [2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshet... [3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.html
Co-developed-by: Elena Reshetova elena.reshetova@intel.com Signed-off-by: Elena Reshetova elena.reshetova@intel.com Signed-off-by: Kees Cook keescook@chromium.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org
conflict: Documentation/admin-guide/kernel-parameters.txt arch/Kconfig
Signed-off-by: Yi Yang yiyang13@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../admin-guide/kernel-parameters.txt | 11 ++++ Makefile | 4 ++ arch/Kconfig | 23 ++++++++ include/linux/randomize_kstack.h | 54 +++++++++++++++++++ init/main.c | 23 ++++++++ 5 files changed, 115 insertions(+) create mode 100644 include/linux/randomize_kstack.h
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index de74fc62be19..1e8a33bc52e9 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4241,6 +4241,17 @@ fully seed the kernel's CRNG. Default is controlled by CONFIG_RANDOM_TRUST_BOOTLOADER.
+ randomize_kstack_offset= + [KNL] Enable or disable kernel stack offset + randomization, which provides roughly 5 bits of + entropy, frustrating memory corruption attacks + that depend on stack address determinism or + cross-syscall address exposures. This is only + available on architectures that have defined + CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET. + Format: <bool> (1/Y/y=enable, 0/N/n=disable) + Default is CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT. + ras=option[,option,...] [KNL] RAS-specific options
cec_disable [X86] diff --git a/Makefile b/Makefile index 85d8936b0ffa..29dee6e7a872 100644 --- a/Makefile +++ b/Makefile @@ -825,6 +825,10 @@ KBUILD_CFLAGS += -ftrivial-auto-var-init=zero KBUILD_CFLAGS += -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang endif
+# While VLAs have been removed, GCC produces unreachable stack probes +# for the randomize_kstack_offset feature. Disable it for all compilers. +KBUILD_CFLAGS += $(call cc-option, -fno-stack-clash-protection) + DEBUG_CFLAGS :=
# Workaround for GCC versions < 5.0 diff --git a/arch/Kconfig b/arch/Kconfig index 7a8e3d45b2a1..0454a4b1da2a 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -989,6 +989,29 @@ config VMAP_STACK virtual mappings with real shadow memory, and KASAN_VMALLOC must be enabled.
+config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET + def_bool n + help + An arch should select this symbol if it can support kernel stack + offset randomization with calls to add_random_kstack_offset() + during syscall entry and choose_random_kstack_offset() during + syscall exit. Careful removal of -fstack-protector-strong and + -fstack-protector should also be applied to the entry code and + closely examined, as the artificial stack bump looks like an array + to the compiler, so it will attempt to add canary checks regardless + of the static branch state. + +config RANDOMIZE_KSTACK_OFFSET_DEFAULT + bool "Randomize kernel stack offset on syscall entry" + depends on HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET + help + The kernel stack offset can be randomized (after pt_regs) by + roughly 5 bits of entropy, frustrating memory corruption + attacks that depend on stack address determinism or + cross-syscall address exposures. This feature is controlled + by kernel boot param "randomize_kstack_offset=on/off", and this + config chooses the default boot state. + config ARCH_OPTIONAL_KERNEL_RWX def_bool n
diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h new file mode 100644 index 000000000000..fd80fab663a9 --- /dev/null +++ b/include/linux/randomize_kstack.h @@ -0,0 +1,54 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef _LINUX_RANDOMIZE_KSTACK_H +#define _LINUX_RANDOMIZE_KSTACK_H + +#include <linux/kernel.h> +#include <linux/jump_label.h> +#include <linux/percpu-defs.h> + +DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, + randomize_kstack_offset); +DECLARE_PER_CPU(u32, kstack_offset); + +/* + * Do not use this anywhere else in the kernel. This is used here because + * it provides an arch-agnostic way to grow the stack with correct + * alignment. Also, since this use is being explicitly masked to a max of + * 10 bits, stack-clash style attacks are unlikely. For more details see + * "VLAs" in Documentation/process/deprecated.rst + */ +void *__builtin_alloca(size_t size); +/* + * Use, at most, 10 bits of entropy. We explicitly cap this to keep the + * "VLA" from being unbounded (see above). 10 bits leaves enough room for + * per-arch offset masks to reduce entropy (by removing higher bits, since + * high entropy may overly constrain usable stack space), and for + * compiler/arch-specific stack alignment to remove the lower bits. + */ +#define KSTACK_OFFSET_MAX(x) ((x) & 0x3FF) + +/* + * These macros must be used during syscall entry when interrupts and + * preempt are disabled, and after user registers have been stored to + * the stack. + */ +#define add_random_kstack_offset() do { \ + if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \ + &randomize_kstack_offset)) { \ + u32 offset = raw_cpu_read(kstack_offset); \ + u8 *ptr = __builtin_alloca(KSTACK_OFFSET_MAX(offset)); \ + /* Keep allocation even after "ptr" loses scope. */ \ + asm volatile("" : "=o"(*ptr) :: "memory"); \ + } \ +} while (0) + +#define choose_random_kstack_offset(rand) do { \ + if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \ + &randomize_kstack_offset)) { \ + u32 offset = raw_cpu_read(kstack_offset); \ + offset ^= (rand); \ + raw_cpu_write(kstack_offset, offset); \ + } \ +} while (0) + +#endif diff --git a/init/main.c b/init/main.c index 7f4e8a8964b1..e1d179fa1f4a 100644 --- a/init/main.c +++ b/init/main.c @@ -846,6 +846,29 @@ static void __init mm_init(void) pti_init(); }
+#ifdef CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET +DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, + randomize_kstack_offset); +DEFINE_PER_CPU(u32, kstack_offset); + +static int __init early_randomize_kstack_offset(char *buf) +{ + int ret; + bool bool_result; + + ret = kstrtobool(buf, &bool_result); + if (ret) + return ret; + + if (bool_result) + static_branch_enable(&randomize_kstack_offset); + else + static_branch_disable(&randomize_kstack_offset); + return 0; +} +early_param("randomize_kstack_offset", early_randomize_kstack_offset); +#endif + void __init __weak arch_call_rest_init(void) { rest_init();
From: Kees Cook keescook@chromium.org
mainline inclusion from mainline-v5.13-rc1 commit fe950f6020338c8ac668ef823bb692d36b7542a2 category: featrue bugzilla: https://gitee.com/openeuler/kernel/issues/I5YQ6Z CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Allow for a randomized stack offset on a per-syscall basis, with roughly 5-6 bits of entropy, depending on compiler and word size. Since the method of offsetting uses macros, this cannot live in the common entry code (the stack offset needs to be retained for the life of the syscall, which means it needs to happen at the actual entry point).
Signed-off-by: Kees Cook keescook@chromium.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/20210401232347.2791257-5-keescook@chromium.org Signed-off-by: Yi Yang yiyang13@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/x86/Kconfig | 1 + arch/x86/entry/common.c | 3 +++ arch/x86/include/asm/entry-common.h | 16 ++++++++++++++++ 3 files changed, 20 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index c31746505bb3..42f03c50857f 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -163,6 +163,7 @@ config X86 select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64 select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD select HAVE_ARCH_VMAP_STACK if X86_64 + select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET select HAVE_ARCH_WITHIN_STACK_FRAMES select HAVE_ASM_MODVERSIONS select HAVE_CMPXCHG_DOUBLE diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 93a3122cd15f..1f179e317f0c 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -38,6 +38,7 @@ #ifdef CONFIG_X86_64 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs) { + add_random_kstack_offset(); nr = syscall_enter_from_user_mode(regs, nr);
instrumentation_begin(); @@ -83,6 +84,7 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs) { unsigned int nr = syscall_32_enter(regs);
+ add_random_kstack_offset(); /* * Subtlety here: if ptrace pokes something larger than 2^32-1 into * orig_ax, the unsigned int return value truncates it. This may @@ -102,6 +104,7 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs) unsigned int nr = syscall_32_enter(regs); int res;
+ add_random_kstack_offset(); /* * This cannot use syscall_enter_from_user_mode() as it has to * fetch EBP before invoking any of the syscall entry work diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h index 4a382fb6a9ef..50ded30ae54e 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -2,6 +2,7 @@ #ifndef _ASM_X86_ENTRY_COMMON_H #define _ASM_X86_ENTRY_COMMON_H
+#include <linux/randomize_kstack.h> #include <linux/user-return-notifier.h>
#include <asm/nospec-branch.h> @@ -72,6 +73,21 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, */ current_thread_info()->status &= ~(TS_COMPAT | TS_I386_REGS_POKED); #endif + + /* + * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(), + * but not enough for x86 stack utilization comfort. To keep + * reasonable stack head room, reduce the maximum offset to 8 bits. + * + * The actual entropy will be further reduced by the compiler when + * applying stack alignment constraints (see cc_stack_align4/8 in + * arch/x86/Makefile), which will remove the 3 (x86_64) or 2 (ia32) + * low bits from any entropy chosen here. + * + * Therefore, final stack offset entropy will be 5 (x86_64) or + * 6 (ia32) bits. + */ + choose_random_kstack_offset(rdtsc() & 0xFF); } #define arch_exit_to_user_mode_prepare arch_exit_to_user_mode_prepare
From: Kees Cook keescook@chromium.org
mainline inclusion from mainline-v5.13-rc1 commit 70918779aec9bd01d16f4e6e800ffe423d196021 category: featrue bugzilla: https://gitee.com/openeuler/kernel/issues/I5YQ6Z CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Allow for a randomized stack offset on a per-syscall basis, with roughly 5 bits of entropy. (And include AAPCS rationale AAPCS thanks to Mark Rutland.)
In order to avoid unconditional stack canaries on syscall entry (due to the use of alloca()), also disable stack protector to avoid triggering needless checks and slowing down the entry path. As there is no general way to control stack protector coverage with a function attribute[1], this must be disabled at the compilation unit level. This isn't a problem here, though, since stack protector was not triggered before: examining the resulting syscall.o, there are no changes in canary coverage (none before, none now).
[1] a working __attribute__((no_stack_protector)) has been added to GCC and Clang but has not been released in any version yet: https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=346b302d09c1e6db56d9fe69048ac... https://reviews.llvm.org/rG4fbf84c1732fca596ad1d6e96015e19760eb8a9b
Signed-off-by: Kees Cook keescook@chromium.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Will Deacon will@kernel.org Link: https://lore.kernel.org/r/20210401232347.2791257-6-keescook@chromium.org
conflict: arch/arm64/Kconfig arch/arm64/kernel/syscall.c
Signed-off-by: Yi Yang yiyang13@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/Kconfig | 1 + arch/arm64/kernel/Makefile | 5 +++++ arch/arm64/kernel/syscall.c | 16 ++++++++++++++++ 3 files changed, 22 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 28b4e754e856..6d288cfa313f 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -144,6 +144,7 @@ config ARM64 select HAVE_ARCH_MMAP_RND_BITS select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT select HAVE_ARCH_PREL32_RELOCATIONS + select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET select HAVE_ARCH_SECCOMP_FILTER select HAVE_ARCH_STACKLEAK select HAVE_ARCH_THREAD_STRUCT_WHITELIST diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index 4cf75b247461..312c164db2ed 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -9,6 +9,11 @@ CFLAGS_REMOVE_ftrace.o = $(CC_FLAGS_FTRACE) CFLAGS_REMOVE_insn.o = $(CC_FLAGS_FTRACE) CFLAGS_REMOVE_return_address.o = $(CC_FLAGS_FTRACE)
+# Remove stack protector to avoid triggering unneeded stack canary +# checks due to randomize_kstack_offset. +CFLAGS_REMOVE_syscall.o = -fstack-protector -fstack-protector-strong +CFLAGS_syscall.o += -fno-stack-protector + # Object file lists. obj-y := debug-monitors.o entry.o irq.o fpsimd.o \ entry-common.o entry-fpsimd.o process.o ptrace.o \ diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c index 66ca9534bd69..2a106e67a4cb 100644 --- a/arch/arm64/kernel/syscall.c +++ b/arch/arm64/kernel/syscall.c @@ -5,6 +5,7 @@ #include <linux/errno.h> #include <linux/nospec.h> #include <linux/ptrace.h> +#include <linux/randomize_kstack.h> #include <linux/syscalls.h>
#include <asm/daifflags.h> @@ -42,6 +43,8 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno, { long ret;
+ add_random_kstack_offset(); + if (scno < sc_nr) { syscall_fn_t syscall_fn; syscall_fn = syscall_table[array_index_nospec(scno, sc_nr)]; @@ -51,6 +54,19 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno, }
syscall_set_return_value(current, regs, 0, ret); + + /* + * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(), + * but not enough for arm64 stack utilization comfort. To keep + * reasonable stack head room, reduce the maximum offset to 9 bits. + * + * The actual entropy will be further reduced by the compiler when + * applying stack alignment constraints: the AAPCS mandates a + * 16-byte (i.e. 4-bit) aligned SP at function boundaries. + * + * The resulting 5 bits of entropy is seen in SP[8:4]. + */ + choose_random_kstack_offset(get_random_int() & 0x1FF); }
static inline bool has_syscall_work(unsigned long flags)
From: Kees Cook keescook@chromium.org
mainline inclusion from mainline-v5.13-rc1 commit 68ef8735d253f3d840082b78f996bf2d89ee6e5f category: featrue bugzilla: https://gitee.com/openeuler/kernel/issues/I5YQ6Z CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
For validating the stack offset behavior, report the offset from a given process's first seen stack address. Add s script to calculate the results to the LKDTM kselftests.
Signed-off-by: Kees Cook keescook@chromium.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/20210401232347.2791257-7-keescook@chromium.org Signed-off-by: Yi Yang yiyang13@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/misc/lkdtm/bugs.c | 17 +++++++++ drivers/misc/lkdtm/core.c | 1 + drivers/misc/lkdtm/lkdtm.h | 1 + tools/testing/selftests/lkdtm/.gitignore | 1 + tools/testing/selftests/lkdtm/Makefile | 1 + .../testing/selftests/lkdtm/stack-entropy.sh | 36 +++++++++++++++++++ 6 files changed, 57 insertions(+) create mode 100755 tools/testing/selftests/lkdtm/stack-entropy.sh
diff --git a/drivers/misc/lkdtm/bugs.c b/drivers/misc/lkdtm/bugs.c index d39b8139b096..5094835d2567 100644 --- a/drivers/misc/lkdtm/bugs.c +++ b/drivers/misc/lkdtm/bugs.c @@ -134,6 +134,23 @@ noinline void lkdtm_CORRUPT_STACK_STRONG(void) __lkdtm_CORRUPT_STACK((void *)&data); }
+static pid_t stack_pid; +static unsigned long stack_addr; + +void lkdtm_REPORT_STACK(void) +{ + volatile uintptr_t magic; + pid_t pid = task_pid_nr(current); + + if (pid != stack_pid) { + pr_info("Starting stack offset tracking for pid %d\n", pid); + stack_pid = pid; + stack_addr = (uintptr_t)&magic; + } + + pr_info("Stack offset: %d\n", (int)(stack_addr - (uintptr_t)&magic)); +} + void lkdtm_UNALIGNED_LOAD_STORE_WRITE(void) { static u8 data[5] __attribute__((aligned(4))) = {1, 2, 3, 4, 5}; diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c index 32b3d77368e3..2c0bb716a119 100644 --- a/drivers/misc/lkdtm/core.c +++ b/drivers/misc/lkdtm/core.c @@ -110,6 +110,7 @@ static const struct crashtype crashtypes[] = { CRASHTYPE(EXHAUST_STACK), CRASHTYPE(CORRUPT_STACK), CRASHTYPE(CORRUPT_STACK_STRONG), + CRASHTYPE(REPORT_STACK), CRASHTYPE(CORRUPT_LIST_ADD), CRASHTYPE(CORRUPT_LIST_DEL), CRASHTYPE(STACK_GUARD_PAGE_LEADING), diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h index 6dec4c9b442f..6e35ab171ea3 100644 --- a/drivers/misc/lkdtm/lkdtm.h +++ b/drivers/misc/lkdtm/lkdtm.h @@ -17,6 +17,7 @@ void lkdtm_LOOP(void); void lkdtm_EXHAUST_STACK(void); void lkdtm_CORRUPT_STACK(void); void lkdtm_CORRUPT_STACK_STRONG(void); +void lkdtm_REPORT_STACK(void); void lkdtm_UNALIGNED_LOAD_STORE_WRITE(void); void lkdtm_SOFTLOCKUP(void); void lkdtm_HARDLOCKUP(void); diff --git a/tools/testing/selftests/lkdtm/.gitignore b/tools/testing/selftests/lkdtm/.gitignore index f26212605b6b..d4b0be857deb 100644 --- a/tools/testing/selftests/lkdtm/.gitignore +++ b/tools/testing/selftests/lkdtm/.gitignore @@ -1,2 +1,3 @@ *.sh !run.sh +!stack-entropy.sh diff --git a/tools/testing/selftests/lkdtm/Makefile b/tools/testing/selftests/lkdtm/Makefile index 1bcc9ee990eb..c71109ceeb2d 100644 --- a/tools/testing/selftests/lkdtm/Makefile +++ b/tools/testing/selftests/lkdtm/Makefile @@ -5,6 +5,7 @@ include ../lib.mk
# NOTE: $(OUTPUT) won't get default value if used before lib.mk TEST_FILES := tests.txt +TEST_PROGS := stack-entropy.sh TEST_GEN_PROGS = $(patsubst %,$(OUTPUT)/%.sh,$(shell awk '{print $$1}' tests.txt | sed -e 's/#//')) all: $(TEST_GEN_PROGS)
diff --git a/tools/testing/selftests/lkdtm/stack-entropy.sh b/tools/testing/selftests/lkdtm/stack-entropy.sh new file mode 100755 index 000000000000..b1b8a5097cbb --- /dev/null +++ b/tools/testing/selftests/lkdtm/stack-entropy.sh @@ -0,0 +1,36 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# +# Measure kernel stack entropy by sampling via LKDTM's REPORT_STACK test. +set -e +samples="${1:-1000}" + +# Capture dmesg continuously since it may fill up depending on sample size. +log=$(mktemp -t stack-entropy-XXXXXX) +dmesg --follow >"$log" & pid=$! +report=-1 +for i in $(seq 1 $samples); do + echo "REPORT_STACK" >/sys/kernel/debug/provoke-crash/DIRECT + if [ -t 1 ]; then + percent=$(( 100 * $i / $samples )) + if [ "$percent" -ne "$report" ]; then + /bin/echo -en "$percent%\r" + report="$percent" + fi + fi +done +kill "$pid" + +# Count unique offsets since last run. +seen=$(tac "$log" | grep -m1 -B"$samples"0 'Starting stack offset' | \ + grep 'Stack offset' | awk '{print $NF}' | sort | uniq -c | wc -l) +bits=$(echo "obase=2; $seen" | bc | wc -L) +echo "Bits of stack entropy: $bits" +rm -f "$log" + +# We would expect any functional stack randomization to be at least 5 bits. +if [ "$bits" -lt 5 ]; then + exit 1 +else + exit 0 +fi
From: Nick Desaulniers ndesaulniers@google.com
mainline inclusion from mainline-v5.13-rc1 commit 2515dd6ce8e545b0b2eece84920048ef9ed846c4 category: featrue bugzilla: https://gitee.com/openeuler/kernel/issues/I5YQ6Z CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
"o" isn't a common asm() constraint to use; it triggers an assertion in assert-enabled builds of LLVM that it's not recognized when targeting aarch64 (though it appears to fall back to "m"). It's fixed in LLVM 13 now, but there isn't really a good reason to use "o" in particular here. To avoid causing build issues for those using assert-enabled builds of earlier LLVM versions, the constraint needs changing.
Instead, if the point is to retain the __builtin_alloca(), make ptr appear to "escape" via being an input to an empty inline asm block. This is preferable anyways, since otherwise this looks like a dead store.
While the use of "r" was considered in
https://lore.kernel.org/lkml/202104011447.2E7F543@keescook/
it was only tested as an output (which looks like a dead store, and wasn't sufficient).
Use "r" as an input constraint instead, which behaves correctly across compilers and architectures.
Fixes: 39218ff4c625 ("stack: Optionally randomize kernel stack offset each syscall") Signed-off-by: Nick Desaulniers ndesaulniers@google.com Signed-off-by: Kees Cook keescook@chromium.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Kees Cook keescook@chromium.org Tested-by: Nathan Chancellor nathan@kernel.org Reviewed-by: Nathan Chancellor nathan@kernel.org Link: https://reviews.llvm.org/D100412 Link: https://bugs.llvm.org/show_bug.cgi?id=49956 Link: https://lore.kernel.org/r/20210419231741.4084415-1-keescook@chromium.org Signed-off-by: Yi Yang yiyang13@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/randomize_kstack.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h index fd80fab663a9..bebc911161b6 100644 --- a/include/linux/randomize_kstack.h +++ b/include/linux/randomize_kstack.h @@ -38,7 +38,7 @@ void *__builtin_alloca(size_t size); u32 offset = raw_cpu_read(kstack_offset); \ u8 *ptr = __builtin_alloca(KSTACK_OFFSET_MAX(offset)); \ /* Keep allocation even after "ptr" loses scope. */ \ - asm volatile("" : "=o"(*ptr) :: "memory"); \ + asm volatile("" :: "r"(ptr) : "memory"); \ } \ } while (0)
From: Marco Elver elver@google.com
mainline inclusion from mainline-v5.18-rc1 commit 8cb37a5974a48569aab8a1736d21399fddbdbdb2 category: featrue bugzilla: https://gitee.com/openeuler/kernel/issues/I5YQ6Z CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The randomize_kstack_offset feature is unconditionally compiled in when the architecture supports it.
To add constraints on compiler versions, we require a dedicated Kconfig variable. Therefore, introduce RANDOMIZE_KSTACK_OFFSET.
Furthermore, this option is now also configurable by EXPERT kernels: while the feature is supposed to have zero performance overhead when disabled, due to its use of static branches, there are few cases where giving a distribution the option to disable the feature entirely makes sense. For example, in very resource constrained environments, which would never enable the feature to begin with, in which case the additional kernel code size increase would be redundant.
Signed-off-by: Marco Elver elver@google.com Reviewed-by: Nathan Chancellor nathan@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Kees Cook keescook@chromium.org Signed-off-by: Kees Cook keescook@chromium.org Link: https://lore.kernel.org/r/20220131090521.1947110-1-elver@google.com Signed-off-by: Yi Yang yiyang13@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/Kconfig | 23 ++++++++++++++++++----- include/linux/randomize_kstack.h | 5 +++++ init/main.c | 2 +- 3 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig index 0454a4b1da2a..3bd3412f78f4 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1001,16 +1001,29 @@ config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET to the compiler, so it will attempt to add canary checks regardless of the static branch state.
-config RANDOMIZE_KSTACK_OFFSET_DEFAULT - bool "Randomize kernel stack offset on syscall entry" +config RANDOMIZE_KSTACK_OFFSET + bool "Support for randomizing kernel stack offset on syscall entry" if EXPERT + default y depends on HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET help The kernel stack offset can be randomized (after pt_regs) by roughly 5 bits of entropy, frustrating memory corruption attacks that depend on stack address determinism or - cross-syscall address exposures. This feature is controlled - by kernel boot param "randomize_kstack_offset=on/off", and this - config chooses the default boot state. + cross-syscall address exposures. + + The feature is controlled via the "randomize_kstack_offset=on/off" + kernel boot param, and if turned off has zero overhead due to its use + of static branches (see JUMP_LABEL). + + If unsure, say Y. + +config RANDOMIZE_KSTACK_OFFSET_DEFAULT + bool "Default state of kernel stack offset randomization" + depends on RANDOMIZE_KSTACK_OFFSET + help + Kernel stack offset randomization is controlled by kernel boot param + "randomize_kstack_offset=on/off", and this config chooses the default + boot state.
config ARCH_OPTIONAL_KERNEL_RWX def_bool n diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h index bebc911161b6..91f1b990a3c3 100644 --- a/include/linux/randomize_kstack.h +++ b/include/linux/randomize_kstack.h @@ -2,6 +2,7 @@ #ifndef _LINUX_RANDOMIZE_KSTACK_H #define _LINUX_RANDOMIZE_KSTACK_H
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET #include <linux/kernel.h> #include <linux/jump_label.h> #include <linux/percpu-defs.h> @@ -50,5 +51,9 @@ void *__builtin_alloca(size_t size); raw_cpu_write(kstack_offset, offset); \ } \ } while (0) +#else /* CONFIG_RANDOMIZE_KSTACK_OFFSET */ +#define add_random_kstack_offset() do { } while (0) +#define choose_random_kstack_offset(rand) do { } while (0) +#endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
#endif diff --git a/init/main.c b/init/main.c index e1d179fa1f4a..3660c43291ce 100644 --- a/init/main.c +++ b/init/main.c @@ -846,7 +846,7 @@ static void __init mm_init(void) pti_init(); }
-#ifdef CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, randomize_kstack_offset); DEFINE_PER_CPU(u32, kstack_offset);
From: "GONG, Ruiqi" gongruiqi1@huawei.com
mainline inclusion from mainline-v6.0-rc1 commit 375561bd6195a31bf4c109732bd538cb97a941f4 category: featrue bugzilla: https://gitee.com/openeuler/kernel/issues/I5YQ6Z CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Fix the following Sparse warnings that got noticed when the PPC-dev patchwork was checking another patch (see the link below):
init/main.c:862:1: warning: symbol 'randomize_kstack_offset' was not declared. Should it be static? init/main.c:864:1: warning: symbol 'kstack_offset' was not declared. Should it be static?
Which in fact are triggered on all architectures that have HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET support (for instances x86, arm64 etc).
Link: https://lore.kernel.org/lkml/e7b0d68b-914d-7283-827c-101988923929@huawei.com... Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: GONG, Ruiqi gongruiqi1@huawei.com Reviewed-by: Christophe Leroy christophe.leroy@csgroup.eu Fixes: 39218ff4c625 ("stack: Optionally randomize kernel stack offset each syscall") Signed-off-by: Kees Cook keescook@chromium.org Link: https://lore.kernel.org/r/20220629060423.2515693-1-gongruiqi1@huawei.com
conflict: init/main.c
Signed-off-by: Yi Yang yiyang13@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- init/main.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/init/main.c b/init/main.c index 3660c43291ce..21b65f18ba83 100644 --- a/init/main.c +++ b/init/main.c @@ -99,6 +99,7 @@ #include <linux/mem_encrypt.h> #include <linux/kcsan.h> #include <linux/init_syscalls.h> +#include <linux/randomize_kstack.h>
#include <asm/io.h> #include <asm/bugs.h>
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v5.15-rc1 commit 89f871af1b26d98d983cba7ed0e86effa45ba5f8 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I60HCD CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
If blk_mq_request_issue_directly() failed from blk_insert_cloned_request(), the request will be accounted start. Currently, blk_insert_cloned_request() is only called by dm, and such request won't be accounted done by dm.
In normal path, io will be accounted start from blk_mq_bio_to_request(), when the request is allocated, and such io will be accounted done from __blk_mq_end_request_acct() whether it succeeded or failed. Thus add blk_account_io_done() to fix the problem.
Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20220126012132.3111551-1-yukuai3@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflict: block/blk-core.c Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-core.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/block/blk-core.c b/block/blk-core.c index a4ec5e168312..a18cfc467d41 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1223,7 +1223,10 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * * bypass a potential scheduler on the bottom device for * insert. */ - return blk_mq_request_issue_directly(rq, true); + ret = blk_mq_request_issue_directly(rq, true); + if (ret) + blk_account_io_done(rq, ktime_get_ns()); + return ret; } EXPORT_SYMBOL_GPL(blk_insert_cloned_request);
From: Lei Chen lennychen@tencent.com
stable inclusion from stable-v5.10.152 commit 392536023da18086d57565e716ed50193869b8e7 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60HVY CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------
commit 5a20d073ec54a72d9a732fa44bfe14954eb6332f upstream.
It's unnecessary to call wbt_update_limits explicitly within wbt_init, because it will be called in the following function wbt_queue_depth_changed.
Signed-off-by: Lei Chen lennychen@tencent.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-wbt.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/block/blk-wbt.c b/block/blk-wbt.c index 35d81b5deae1..4ec0a018a2ad 100644 --- a/block/blk-wbt.c +++ b/block/blk-wbt.c @@ -840,7 +840,6 @@ int wbt_init(struct request_queue *q) rwb->enable_state = WBT_STATE_ON_DEFAULT; rwb->wc = 1; rwb->rq_depth.default_depth = RWB_DEF_DEPTH; - wbt_update_limits(rwb);
/* * Assign rwb and add the stats callback.
From: Yu Kuai yukuai3@huawei.com
stable inclusion from stable-v5.10.152 commit 910ba49b33450a878128adc7d9c419dd97efd923 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60HVY CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------
commit 8c5035dfbb9475b67c82b3fdb7351236525bf52b upstream.
Our test found a problem that wbt inflight counter is negative, which will cause io hang(noted that this problem doesn't exist in mainline):
t1: device create t2: issue io add_disk blk_register_queue wbt_enable_default wbt_init rq_qos_add // wb_normal is still 0 /* * in mainline, disk can't be opened before * bdev_add(), however, in old kernels, disk * can be opened before blk_register_queue(). */ blkdev_issue_flush // disk size is 0, however, it's not checked submit_bio_wait submit_bio blk_mq_submit_bio rq_qos_throttle wbt_wait bio_to_wbt_flags rwb_enabled // wb_normal is 0, inflight is not increased
wbt_queue_depth_changed(&rwb->rqos); wbt_update_limits // wb_normal is initialized rq_qos_track wbt_track rq->wbt_flags |= bio_to_wbt_flags(rwb, bio); // wb_normal is not 0,wbt_flags will be set t3: io completion blk_mq_free_request rq_qos_done wbt_done wbt_is_tracked // return true __wbt_done wbt_rqw_done atomic_dec_return(&rqw->inflight); // inflight is decreased
commit 8235b5c1e8c1 ("block: call bdev_add later in device_add_disk") can avoid this problem, however it's better to fix this problem in wbt:
1) Lower kernel can't backport this patch due to lots of refactor. 2) Root cause is that wbt call rq_qos_add() before wb_normal is initialized.
Fixes: e34cbd307477 ("blk-wbt: add general throttling mechanism") Cc: stable@vger.kernel.org Signed-off-by: Yu Kuai yukuai3@huawei.com Link: https://lore.kernel.org/r/20220913105749.3086243-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-wbt.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/block/blk-wbt.c b/block/blk-wbt.c index 4ec0a018a2ad..bafdb8098893 100644 --- a/block/blk-wbt.c +++ b/block/blk-wbt.c @@ -840,6 +840,10 @@ int wbt_init(struct request_queue *q) rwb->enable_state = WBT_STATE_ON_DEFAULT; rwb->wc = 1; rwb->rq_depth.default_depth = RWB_DEF_DEPTH; + rwb->min_lat_nsec = wbt_default_latency_nsec(q); + + wbt_queue_depth_changed(&rwb->rqos); + wbt_set_write_cache(q, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
/* * Assign rwb and add the stats callback. @@ -847,10 +851,5 @@ int wbt_init(struct request_queue *q) rq_qos_add(q, &rwb->rqos); blk_stat_add_callback(q, rwb->cb);
- rwb->min_lat_nsec = wbt_default_latency_nsec(q); - - wbt_queue_depth_changed(&rwb->rqos); - wbt_set_write_cache(q, test_bit(QUEUE_FLAG_WC, &q->queue_flags)); - return 0; }
From: Yu Kuai yukuai3@huawei.com
stable inclusion from stable-v5.10.152 commit 31b1570677e8bf85f48be8eb95e21804399b8295 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60HVY CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------
commit 285febabac4a16655372d23ff43e89ff6f216691 upstream.
commit 8c5035dfbb94 ("blk-wbt: call rq_qos_add() after wb_normal is initialized") moves wbt_set_write_cache() before rq_qos_add(), which is wrong because wbt_rq_qos() is still NULL.
Fix the problem by removing wbt_set_write_cache() and setting 'rwb->wc' directly. Noted that this patch also remove the redundant setting of 'rab->wc'.
Fixes: 8c5035dfbb94 ("blk-wbt: call rq_qos_add() after wb_normal is initialized") Reported-by: kernel test robot yujie.liu@intel.com Link: https://lore.kernel.org/r/202210081045.77ddf59b-yujie.liu@intel.com Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Ming Lei ming.lei@redhat.com Link: https://lore.kernel.org/r/20221009101038.1692875-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-wbt.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/block/blk-wbt.c b/block/blk-wbt.c index bafdb8098893..6f63920f073c 100644 --- a/block/blk-wbt.c +++ b/block/blk-wbt.c @@ -838,12 +838,11 @@ int wbt_init(struct request_queue *q) rwb->last_comp = rwb->last_issue = jiffies; rwb->win_nsec = RWB_WINDOW_NSEC; rwb->enable_state = WBT_STATE_ON_DEFAULT; - rwb->wc = 1; + rwb->wc = test_bit(QUEUE_FLAG_WC, &q->queue_flags); rwb->rq_depth.default_depth = RWB_DEF_DEPTH; rwb->min_lat_nsec = wbt_default_latency_nsec(q);
wbt_queue_depth_changed(&rwb->rqos); - wbt_set_write_cache(q, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
/* * Assign rwb and add the stats callback.
From: Yu Kuai yukuai3@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60IHY CVE: NA
--------------------------------
This reverts commit eeabdc14ef8231fea94074b744d9648805a4015b. Prepare to backport solution from mainline.
Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/bfq-cgroup.c | 16 +++++----------- block/bfq-wf2q.c | 9 --------- 2 files changed, 5 insertions(+), 20 deletions(-)
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index f84a88b7a09d..abd01025d043 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -643,7 +643,6 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, struct bfq_group *bfqg) { struct bfq_entity *entity = &bfqq->entity; - struct bfq_group *old_parent = bfqq_group(bfqq);
if (bfqq == &bfqd->oom_bfqq) return; @@ -667,22 +666,18 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, bfq_deactivate_bfqq(bfqd, bfqq, false, false); else if (entity->on_st_or_in_serv) bfq_put_idle_entity(bfq_entity_service_tree(entity), entity); + bfqg_and_blkg_put(bfqq_group(bfqq));
entity->parent = bfqg->my_entity; entity->sched_data = &bfqg->sched_data; /* pin down bfqg and its associated blkg */ bfqg_and_blkg_get(bfqg);
- /* - * Don't leave the bfqq->pos_root to old bfqg, since the ref to old - * bfqg will be released and the bfqg might be freed. - */ - if (unlikely(!bfqd->nonrot_with_queueing)) - bfq_pos_tree_add_move(bfqd, bfqq); - bfqg_and_blkg_put(old_parent); - - if (bfq_bfqq_busy(bfqq)) + if (bfq_bfqq_busy(bfqq)) { + if (unlikely(!bfqd->nonrot_with_queueing)) + bfq_pos_tree_add_move(bfqd, bfqq); bfq_activate_bfqq(bfqd, bfqq); + }
if (!bfqd->in_service_queue && !bfqd->rq_in_driver) bfq_schedule_dispatch(bfqd); @@ -964,7 +959,6 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
put_async_queues: bfq_put_async_queues(bfqd, bfqg); - pd->plid = BLKCG_MAX_POLS;
spin_unlock_irqrestore(&bfqd->lock, flags); /* diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c index 5a6cb0513c4f..26776bdbdf36 100644 --- a/block/bfq-wf2q.c +++ b/block/bfq-wf2q.c @@ -1695,15 +1695,6 @@ void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq, */ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq) { -#ifdef CONFIG_BFQ_GROUP_IOSCHED - /* If parent group is offlined, move the bfqq to root group */ - if (bfqq->entity.parent) { - struct bfq_group *bfqg = bfq_bfqq_to_bfqg(bfqq); - - if (bfqg->pd.plid >= BLKCG_MAX_POLS) - bfq_bfqq_move(bfqd, bfqq, bfqd->root_group); - } -#endif bfq_log_bfqq(bfqd, bfqq, "add to busy");
bfq_activate_bfqq(bfqd, bfqq);
From: Jan Kara jack@suse.cz
stable inclusion from stable-v5.10.121 commit 70a7dea84639bcd029130e00e01792eb9207fb38 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60IHY CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 09f871868080c33992cd6a9b72a5ca49582578fa upstream.
Track whether bfq_group is still online. We cannot rely on blkcg_gq->online because that gets cleared only after all policies are offlined and we need something that gets updated already under bfqd->lock when we are cleaning up our bfq_group to be able to guarantee that when we see online bfq_group, it will stay online while we are holding bfqd->lock lock.
CC: stable@vger.kernel.org Tested-by: "yukuai (C)" yukuai3@huawei.com Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20220401102752.8599-7-jack@suse.cz Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/bfq-cgroup.c | 3 ++- block/bfq-iosched.h | 2 ++ 2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index abd01025d043..6846bfe03912 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -555,6 +555,7 @@ static void bfq_pd_init(struct blkg_policy_data *pd) */ bfqg->bfqd = bfqd; bfqg->active_entities = 0; + bfqg->online = true; bfqg->rq_pos_tree = RB_ROOT; }
@@ -601,7 +602,6 @@ struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd, struct bfq_entity *entity;
bfqg = bfq_lookup_bfqg(bfqd, blkcg); - if (unlikely(!bfqg)) return NULL;
@@ -959,6 +959,7 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
put_async_queues: bfq_put_async_queues(bfqd, bfqg); + bfqg->online = false;
spin_unlock_irqrestore(&bfqd->lock, flags); /* diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index fb51e5ce9400..b116ed48d27d 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -901,6 +901,8 @@ struct bfq_group {
/* reference counter (see comments in bfq_bic_update_cgroup) */ int ref; + /* Is bfq_group still online? */ + bool online;
struct bfq_entity entity; struct bfq_sched_data sched_data;
From: Jan Kara jack@suse.cz
stable inclusion from stable-v5.10.121 commit 0285718e28259e41f405a038ee0e6bb984fd1b34 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60IHY CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 4e54a2493e582361adc3bfbf06c7d50d19d18837 upstream.
BFQ usage of __bio_blkcg() is a relict from the past. Furthermore if bio would not be associated with any blkcg, the usage of __bio_blkcg() in BFQ is prone to races with the task being migrated between cgroups as __bio_blkcg() calls at different places could return different blkcgs.
Convert BFQ to the new situation where bio->bi_blkg is initialized in bio_set_dev() and thus practically always valid. This allows us to save blkcg_gq lookup and noticeably simplify the code.
CC: stable@vger.kernel.org Fixes: 0fe061b9f03c ("blkcg: fix ref count issue with bio_blkcg() using task_css") Tested-by: "yukuai (C)" yukuai3@huawei.com Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20220401102752.8599-8-jack@suse.cz Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/bfq-cgroup.c | 63 +++++++++++++++++---------------------------- block/bfq-iosched.c | 10 +------ block/bfq-iosched.h | 3 +-- 3 files changed, 25 insertions(+), 51 deletions(-)
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index 6846bfe03912..168faa2c8c3b 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -584,27 +584,11 @@ static void bfq_group_set_parent(struct bfq_group *bfqg, entity->sched_data = &parent->sched_data; }
-static struct bfq_group *bfq_lookup_bfqg(struct bfq_data *bfqd, - struct blkcg *blkcg) +static void bfq_link_bfqg(struct bfq_data *bfqd, struct bfq_group *bfqg) { - struct blkcg_gq *blkg; - - blkg = blkg_lookup(blkcg, bfqd->queue); - if (likely(blkg)) - return blkg_to_bfqg(blkg); - return NULL; -} - -struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd, - struct blkcg *blkcg) -{ - struct bfq_group *bfqg, *parent; + struct bfq_group *parent; struct bfq_entity *entity;
- bfqg = bfq_lookup_bfqg(bfqd, blkcg); - if (unlikely(!bfqg)) - return NULL; - /* * Update chain of bfq_groups as we might be handling a leaf group * which, along with some of its relatives, has not been hooked yet @@ -621,8 +605,15 @@ struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd, bfq_group_set_parent(curr_bfqg, parent); } } +}
- return bfqg; +struct bfq_group *bfq_bio_bfqg(struct bfq_data *bfqd, struct bio *bio) +{ + struct blkcg_gq *blkg = bio->bi_blkg; + + if (!blkg) + return bfqd->root_group; + return blkg_to_bfqg(blkg); }
/** @@ -694,25 +685,15 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, * Move bic to blkcg, assuming that bfqd->lock is held; which makes * sure that the reference to cgroup is valid across the call (see * comments in bfq_bic_update_cgroup on this issue) - * - * NOTE: an alternative approach might have been to store the current - * cgroup in bfqq and getting a reference to it, reducing the lookup - * time here, at the price of slightly more complex code. */ -static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd, - struct bfq_io_cq *bic, - struct blkcg *blkcg) +static void *__bfq_bic_change_cgroup(struct bfq_data *bfqd, + struct bfq_io_cq *bic, + struct bfq_group *bfqg) { struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0); struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1); - struct bfq_group *bfqg; struct bfq_entity *entity;
- bfqg = bfq_find_set_group(bfqd, blkcg); - - if (unlikely(!bfqg)) - bfqg = bfqd->root_group; - if (async_bfqq) { entity = &async_bfqq->entity;
@@ -764,20 +745,24 @@ static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd, void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) { struct bfq_data *bfqd = bic_to_bfqd(bic); - struct bfq_group *bfqg = NULL; + struct bfq_group *bfqg = bfq_bio_bfqg(bfqd, bio); uint64_t serial_nr;
- rcu_read_lock(); - serial_nr = __bio_blkcg(bio)->css.serial_nr; + serial_nr = bfqg_to_blkg(bfqg)->blkcg->css.serial_nr;
/* * Check whether blkcg has changed. The condition may trigger * spuriously on a newly created cic but there's no harm. */ if (unlikely(!bfqd) || likely(bic->blkcg_serial_nr == serial_nr)) - goto out; + return;
- bfqg = __bfq_bic_change_cgroup(bfqd, bic, __bio_blkcg(bio)); + /* + * New cgroup for this process. Make sure it is linked to bfq internal + * cgroup hierarchy. + */ + bfq_link_bfqg(bfqd, bfqg); + __bfq_bic_change_cgroup(bfqd, bic, bfqg); /* * Update blkg_path for bfq_log_* functions. We cache this * path, and update it here, for the following @@ -830,8 +815,6 @@ void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) */ blkg_path(bfqg_to_blkg(bfqg), bfqg->blkg_path, sizeof(bfqg->blkg_path)); bic->blkcg_serial_nr = serial_nr; -out: - rcu_read_unlock(); }
/** @@ -1449,7 +1432,7 @@ void bfq_end_wr_async(struct bfq_data *bfqd) bfq_end_wr_async_queues(bfqd, bfqd->root_group); }
-struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd, struct blkcg *blkcg) +struct bfq_group *bfq_bio_bfqg(struct bfq_data *bfqd, struct bio *bio) { return bfqd->root_group; } diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 1aec01c0a707..6edc00da5b57 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -5175,14 +5175,7 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, struct bfq_queue *bfqq; struct bfq_group *bfqg;
- rcu_read_lock(); - - bfqg = bfq_find_set_group(bfqd, __bio_blkcg(bio)); - if (!bfqg) { - bfqq = &bfqd->oom_bfqq; - goto out; - } - + bfqg = bfq_bio_bfqg(bfqd, bio); if (!is_sync) { async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class, ioprio); @@ -5226,7 +5219,6 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, out: bfqq->ref++; /* get a process reference to this queue */ bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq, bfqq->ref); - rcu_read_unlock(); return bfqq; }
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index b116ed48d27d..2a4a6f44efff 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -984,8 +984,7 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, void bfq_init_entity(struct bfq_entity *entity, struct bfq_group *bfqg); void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio); void bfq_end_wr_async(struct bfq_data *bfqd); -struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd, - struct blkcg *blkcg); +struct bfq_group *bfq_bio_bfqg(struct bfq_data *bfqd, struct bio *bio); struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg); struct bfq_group *bfqq_group(struct bfq_queue *bfqq); struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node);
From: Jan Kara jack@suse.cz
stable inclusion from stable-v5.10.121 commit 51f724bffa3403a5236597e6b75df7329c1ec6e9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60IHY CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 075a53b78b815301f8d3dd1ee2cd99554e34f0dd upstream.
Bios queued into BFQ IO scheduler can be associated with a cgroup that was already offlined. This may then cause insertion of this bfq_group into a service tree. But this bfq_group will get freed as soon as last bio associated with it is completed leading to use after free issues for service tree users. Fix the problem by making sure we always operate on online bfq_group. If the bfq_group associated with the bio is not online, we pick the first online parent.
CC: stable@vger.kernel.org Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support") Tested-by: "yukuai (C)" yukuai3@huawei.com Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20220401102752.8599-9-jack@suse.cz Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/bfq-cgroup.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index 168faa2c8c3b..f99351017182 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -610,10 +610,19 @@ static void bfq_link_bfqg(struct bfq_data *bfqd, struct bfq_group *bfqg) struct bfq_group *bfq_bio_bfqg(struct bfq_data *bfqd, struct bio *bio) { struct blkcg_gq *blkg = bio->bi_blkg; + struct bfq_group *bfqg;
- if (!blkg) - return bfqd->root_group; - return blkg_to_bfqg(blkg); + while (blkg) { + bfqg = blkg_to_bfqg(blkg); + if (bfqg->online) { + bio_associate_blkg_from_css(bio, &blkg->blkcg->css); + return bfqg; + } + blkg = blkg->parent; + } + bio_associate_blkg_from_css(bio, + &bfqg_to_blkg(bfqd->root_group)->blkcg->css); + return bfqd->root_group; }
/**
From: Yu Kuai yukuai3@huawei.com
stable inclusion from stable-v5.10.147 commit cce5dc03338e25e910fb5a2c4f2ce8a79644370f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60JC5 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
This reverts commit 24cd0b9bfdff126c066032b0d40ab0962d35e777.
1) commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails") tries to fix that iova allocation can fail while there are still free space available. This is not backported to 5.10 stable. 2) commit fce54ed02757 ("scsi: hisi_sas: Limit max hw sectors for v3 HW") fix the performance regression introduced by 1), however, this is just a temporary solution and will cause io performance regression because it limit max io size to PAGE_SIZE * 32(128k for 4k page_size). 3) John Garry posted a patchset to fix the problem. 4) The temporary solution is reverted.
It's weird that the patch in 2) is backported to 5.10 stable alone, while the right thing to do is to backport them all together.
Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: John Garry john.garry@huawei.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 7 ------- 1 file changed, 7 deletions(-)
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c index dcbda8edd03f..fd5bdb0afa71 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c @@ -2755,7 +2755,6 @@ static int slave_configure_v3_hw(struct scsi_device *sdev) struct hisi_hba *hisi_hba = shost_priv(shost); struct device *dev = hisi_hba->dev; int ret = sas_slave_configure(sdev); - unsigned int max_sectors;
if (ret) return ret; @@ -2773,12 +2772,6 @@ static int slave_configure_v3_hw(struct scsi_device *sdev) } }
- /* Set according to IOMMU IOVA caching limit */ - max_sectors = min_t(size_t, queue_max_hw_sectors(sdev->request_queue), - (PAGE_SIZE * 32) >> SECTOR_SHIFT); - - blk_queue_max_hw_sectors(sdev->request_queue, max_sectors); - return 0; }
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v5.16-rc2 commit 76dd298094f484c6250ebd076fa53287477b2328 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5VGU9 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Our syzkaller report a null pointer dereference, root cause is following:
__blk_mq_alloc_map_and_rqs set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs blk_mq_alloc_map_and_rqs blk_mq_alloc_rqs // failed due to oom alloc_pages_node // set->tags[hctx_idx] is still NULL blk_mq_free_rqs drv_tags = set->tags[hctx_idx]; // null pointer dereference is triggered blk_mq_clear_rq_mapping(drv_tags, ...)
This is because commit 63064be150e4 ("blk-mq: Add blk_mq_alloc_map_and_rqs()") merged the two steps:
1) set->tags[hctx_idx] = blk_mq_alloc_rq_map() 2) blk_mq_alloc_rqs(..., set->tags[hctx_idx])
into one step:
set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs()
Since tags is not initialized yet in this case, fix the problem by checking if tags is NULL pointer in blk_mq_clear_rq_mapping().
Fixes: 63064be150e4 ("blk-mq: Add blk_mq_alloc_map_and_rqs()") Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: John Garry john.garry@huawei.com Link: https://lore.kernel.org/r/20221011142253.4015966-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-mq.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index 5cd17ca527ea..5f896a12b8e4 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2438,8 +2438,11 @@ static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags, struct page *page; unsigned long flags;
- /* There is no need to clear a driver tags own mapping */ - if (drv_tags == tags) + /* + * There is no need to clear mapping if driver tags is not initialized + * or the mapping belongs to the driver tags. + */ + if (!drv_tags || drv_tags == tags) return;
list_for_each_entry(page, &tags->page_list, lru) {
From: Li Huafei lihuafei1@huawei.com
mainline inclusion from mainline-v6.1-rc4 commit 0e792b89e6800cd9cb4757a76a96f7ef3e8b6294 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I600G0 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
KASAN reported a use-after-free with ftrace ops [1]. It was found from vmcore that perf had registered two ops with the same content successively, both dynamic. After unregistering the second ops, a use-after-free occurred.
In ftrace_shutdown(), when the second ops is unregistered, the FTRACE_UPDATE_CALLS command is not set because there is another enabled ops with the same content. Also, both ops are dynamic and the ftrace callback function is ftrace_ops_list_func, so the FTRACE_UPDATE_TRACE_FUNC command will not be set. Eventually the value of 'command' will be 0 and ftrace_shutdown() will skip the rcu synchronization.
However, ftrace may be activated. When the ops is released, another CPU may be accessing the ops. Add the missing synchronization to fix this problem.
[1] BUG: KASAN: use-after-free in __ftrace_ops_list_func kernel/trace/ftrace.c:7020 [inline] BUG: KASAN: use-after-free in ftrace_ops_list_func+0x2b0/0x31c kernel/trace/ftrace.c:7049 Read of size 8 at addr ffff56551965bbc8 by task syz-executor.2/14468
CPU: 1 PID: 14468 Comm: syz-executor.2 Not tainted 5.10.0 #7 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x40c arch/arm64/kernel/stacktrace.c:132 show_stack+0x30/0x40 arch/arm64/kernel/stacktrace.c:196 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b4/0x248 lib/dump_stack.c:118 print_address_description.constprop.0+0x28/0x48c mm/kasan/report.c:387 __kasan_report mm/kasan/report.c:547 [inline] kasan_report+0x118/0x210 mm/kasan/report.c:564 check_memory_region_inline mm/kasan/generic.c:187 [inline] __asan_load8+0x98/0xc0 mm/kasan/generic.c:253 __ftrace_ops_list_func kernel/trace/ftrace.c:7020 [inline] ftrace_ops_list_func+0x2b0/0x31c kernel/trace/ftrace.c:7049 ftrace_graph_call+0x0/0x4 __might_sleep+0x8/0x100 include/linux/perf_event.h:1170 __might_fault mm/memory.c:5183 [inline] __might_fault+0x58/0x70 mm/memory.c:5171 do_strncpy_from_user lib/strncpy_from_user.c:41 [inline] strncpy_from_user+0x1f4/0x4b0 lib/strncpy_from_user.c:139 getname_flags+0xb0/0x31c fs/namei.c:149 getname+0x2c/0x40 fs/namei.c:209 [...]
Allocated by task 14445: kasan_save_stack+0x24/0x50 mm/kasan/common.c:48 kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc mm/kasan/common.c:479 [inline] __kasan_kmalloc.constprop.0+0x110/0x13c mm/kasan/common.c:449 kasan_kmalloc+0xc/0x14 mm/kasan/common.c:493 kmem_cache_alloc_trace+0x440/0x924 mm/slub.c:2950 kmalloc include/linux/slab.h:563 [inline] kzalloc include/linux/slab.h:675 [inline] perf_event_alloc.part.0+0xb4/0x1350 kernel/events/core.c:11230 perf_event_alloc kernel/events/core.c:11733 [inline] __do_sys_perf_event_open kernel/events/core.c:11831 [inline] __se_sys_perf_event_open+0x550/0x15f4 kernel/events/core.c:11723 __arm64_sys_perf_event_open+0x6c/0x80 kernel/events/core.c:11723 [...]
Freed by task 14445: kasan_save_stack+0x24/0x50 mm/kasan/common.c:48 kasan_set_track+0x24/0x34 mm/kasan/common.c:56 kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:358 __kasan_slab_free.part.0+0x11c/0x1b0 mm/kasan/common.c:437 __kasan_slab_free mm/kasan/common.c:445 [inline] kasan_slab_free+0x2c/0x40 mm/kasan/common.c:446 slab_free_hook mm/slub.c:1569 [inline] slab_free_freelist_hook mm/slub.c:1608 [inline] slab_free mm/slub.c:3179 [inline] kfree+0x12c/0xc10 mm/slub.c:4176 perf_event_alloc.part.0+0xa0c/0x1350 kernel/events/core.c:11434 perf_event_alloc kernel/events/core.c:11733 [inline] __do_sys_perf_event_open kernel/events/core.c:11831 [inline] __se_sys_perf_event_open+0x550/0x15f4 kernel/events/core.c:11723 [...]
Link: https://lore.kernel.org/linux-trace-kernel/20221103031010.166498-1-lihuafei1...
Fixes: edb096e00724f ("ftrace: Fix memleak when unregistering dynamic ops when tracing disabled") Cc: stable@vger.kernel.org Suggested-by: Steven Rostedt rostedt@goodmis.org Signed-off-by: Li Huafei lihuafei1@huawei.com Signed-off-by: Steven Rostedt (Google) rostedt@goodmis.org Signed-off-by: Li Huafei lihuafei1@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/trace/ftrace.c | 16 +++------------- 1 file changed, 3 insertions(+), 13 deletions(-)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 4f40bc2f90a7..945e87b0084e 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -2937,18 +2937,8 @@ int ftrace_shutdown(struct ftrace_ops *ops, int command) command |= FTRACE_UPDATE_TRACE_FUNC; }
- if (!command || !ftrace_enabled) { - /* - * If these are dynamic or per_cpu ops, they still - * need their data freed. Since, function tracing is - * not currently active, we can just free them - * without synchronizing all CPUs. - */ - if (ops->flags & FTRACE_OPS_FL_DYNAMIC) - goto free_ops; - - return 0; - } + if (!command || !ftrace_enabled) + goto out;
/* * If the ops uses a trampoline, then it needs to be @@ -2985,6 +2975,7 @@ int ftrace_shutdown(struct ftrace_ops *ops, int command) removed_ops = NULL; ops->flags &= ~FTRACE_OPS_FL_REMOVING;
+out: /* * Dynamic ops may be freed, we must make sure that all * callers are done before leaving this function. @@ -3012,7 +3003,6 @@ int ftrace_shutdown(struct ftrace_ops *ops, int command) if (IS_ENABLED(CONFIG_PREEMPTION)) synchronize_rcu_tasks();
- free_ops: ftrace_trampoline_free(ops); }
From: Luo Meng luomeng12@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60Q98 CVE: NA
--------------------------------
A crash as follows:
BUG: unable to handle page fault for address: 000000011241cec7 sd 5:0:0:1: [sdl] Synchronizing SCSI cache #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 3 PID: 2465367 Comm: multipath Kdump: loaded Tainted: G W O 5.10.0-60.18.0.50.h478.eulerosv2r11.x86_64 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-20220525_182517-szxrtosci10000 04/01/2014 RIP: 0010:kernfs_new_node+0x22/0x60 Code: cc cc 66 0f 1f 44 00 00 0f 1f 44 00 00 41 54 41 89 cb 0f b7 ca 48 89 f2 53 48 8b 47 08 48 89 fb 48 89 de 48 85 c0 48 0f 44 c7 <48> 8b 78 50 41 51 45 89 c1 45 89 d8 e8 4d ee ff ff 5a 49 89 c4 48 RSP: 0018:ffffa178419539e8 EFLAGS: 00010206 RAX: 000000011241ce77 RBX: ffff9596828395a0 RCX: 000000000000a1ff RDX: ffff9595ada828b0 RSI: ffff9596828395a0 RDI: ffff9596828395a0 RBP: ffff95959a9a2a80 R08: 0000000000000000 R09: 0000000000000004 R10: ffff9595ca0bf930 R11: 0000000000000000 R12: ffff9595ada828b0 R13: ffff9596828395a0 R14: 0000000000000001 R15: ffff9595948c5c80 FS: 00007f64baa10200(0000) GS:ffff9596bad80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000011241cec7 CR3: 000000011923e003 CR4: 0000000000170ee0 Call Trace: kernfs_create_link+0x31/0xa0 sysfs_do_create_link_sd+0x61/0xc0 bd_link_disk_holder+0x10a/0x180 dm_get_table_device+0x10b/0x1f0 [dm_mod] __dm_get_device+0x1e2/0x280 [dm_mod] ? kmem_cache_alloc_trace+0x2fb/0x410 parse_path+0xca/0x200 [dm_multipath] parse_priority_group+0x19d/0x1f0 [dm_multipath] multipath_ctr+0x27a/0x491 [dm_multipath] dm_table_add_target+0x177/0x360 [dm_mod] table_load+0x12b/0x380 [dm_mod] ctl_ioctl+0x199/0x290 [dm_mod] ? dev_suspend+0xd0/0xd0 [dm_mod] dm_ctl_ioctl+0xa/0x20 [dm_mod] __se_sys_ioctl+0x85/0xc0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x61/0xc6
This can be easy reproduce: Add delay before ret = add_symlink(bdev->bd_part->holder_dir...) in bd_link_disk_holder() dmsetup create xxx --tabel "0 1000 linear /dev/sda 0" echo 1 > /sys/block/sda/device/delete
Delete /dev/sda will release holder_dir, but add_symlink() will use holder_dir. Therefore UAF will occur in this case.
Fix this problem by adding reference count to holder_dir.
Signed-off-by: Luo Meng luomeng12@huawei.com Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/block_dev.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/block_dev.c b/fs/block_dev.c index 46801789f2dc..1cd90013f6ac 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1597,6 +1597,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, void *holder, } } bdev->bd_openers++; + kobject_get(bdev->bd_part->holder_dir); if (for_part) bdev->bd_part_count++; if (claiming) @@ -1818,6 +1819,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) if (for_part) bdev->bd_part_count--;
+ kobject_put(bdev->bd_part->holder_dir); if (!--bdev->bd_openers) { WARN_ON_ONCE(bdev->bd_holders); sync_blockdev(bdev);
From: Yu Kuai yukuai3@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60Q98 CVE: NA
--------------------------------
Official solution will be applied to mainline, and this solution can't fix uaf for 'bd_holder_dir' thoroughly. hence revert this temporary solution. Officail solution will be backported in the next patch.
Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/block_dev.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c index 1cd90013f6ac..46801789f2dc 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1597,7 +1597,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, void *holder, } } bdev->bd_openers++; - kobject_get(bdev->bd_part->holder_dir); if (for_part) bdev->bd_part_count++; if (claiming) @@ -1819,7 +1818,6 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) if (for_part) bdev->bd_part_count--;
- kobject_put(bdev->bd_part->holder_dir); if (!--bdev->bd_openers) { WARN_ON_ONCE(bdev->bd_holders); sync_blockdev(bdev);
From: Yu Kuai yukuai3@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60Q98 CVE: NA
--------------------------------
Currently, the caller of bd_link_disk_holer() get 'bdev' by blkdev_get_by_dev(), which will look up 'bdev' by inode number 'dev'. Howerver, it's possible that del_gendisk() can be called currently, and 'bd_holder_dir' can be freed before bd_link_disk_holer() access it, thus use after free is triggered.
t1: t2: bdev = blkdev_get_by_dev del_gendisk kobject_put(bd_holder_dir) kobject_free() bd_link_disk_holder
Fix the problem by checking disk is still live and grabbing a reference to 'bd_holder_dir' first in bd_link_disk_holder().
Link: https://lore.kernel.org/all/20221103025541.1875809-3-yukuai1@huaweicloud.com... Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/block_dev.c | 26 +++++++++++++++++++------- 1 file changed, 19 insertions(+), 7 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c index 46801789f2dc..07cbe6190463 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1266,16 +1266,31 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) struct bd_holder_disk *holder; int ret = 0;
- mutex_lock(&bdev->bd_mutex); + /* + * bdev could be deleted beneath us which would implicitly destroy + * the holder directory. Hold on to it. + */ + down_read(&bdev->bd_disk->lookup_sem); + if (!(disk->flags & GENHD_FL_UP)) { + up_read(&bdev->bd_disk->lookup_sem); + return -ENODEV; + }
+ kobject_get(bdev->bd_part->holder_dir); + up_read(&bdev->bd_disk->lookup_sem); + + mutex_lock(&bdev->bd_mutex); WARN_ON_ONCE(!bdev->bd_holder);
/* FIXME: remove the following once add_disk() handles errors */ - if (WARN_ON(!disk->slave_dir || !bdev->bd_part->holder_dir)) + if (WARN_ON(!disk->slave_dir || !bdev->bd_part->holder_dir)) { + kobject_put(bdev->bd_part->holder_dir); goto out_unlock; + }
holder = bd_find_holder_disk(bdev, disk); if (holder) { + kobject_put(bdev->bd_part->holder_dir); holder->refcnt++; goto out_unlock; } @@ -1297,11 +1312,6 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) ret = add_symlink(bdev->bd_part->holder_dir, &disk_to_dev(disk)->kobj); if (ret) goto out_del; - /* - * bdev could be deleted beneath us which would implicitly destroy - * the holder directory. Hold on to it. - */ - kobject_get(bdev->bd_part->holder_dir);
list_add(&holder->list, &bdev->bd_holder_disks); goto out_unlock; @@ -1312,6 +1322,8 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) kfree(holder); out_unlock: mutex_unlock(&bdev->bd_mutex); + if (ret) + kobject_put(bdev->bd_part->holder_dir); return ret; } EXPORT_SYMBOL_GPL(bd_link_disk_holder);
From: Baokun Li libaokun1@huawei.com
hulk inclusion category: bugfix bugzilla: 187327,https://gitee.com/openeuler/kernel/issues/I6111I CVE: NA
--------------------------------
Print error when hc and md do not match, which is convenient for locating the cause of the problem
Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/dm-ioctl.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c index b839705654d4..b012a2748af8 100644 --- a/drivers/md/dm-ioctl.c +++ b/drivers/md/dm-ioctl.c @@ -2033,6 +2033,8 @@ int dm_copy_name_and_uuid(struct mapped_device *md, char *name, char *uuid) mutex_lock(&dm_hash_cells_mutex); hc = dm_get_mdptr(md); if (!hc || hc->md != md) { + if (hc) + DMERR("hash cell and mapped device do not match!"); r = -ENXIO; goto out; }
From: Yanan Wang wangyanan55@huawei.com
virt inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5WHHV CVE: NA
----------------------------------------------------
The "ncsnp" is an implementation specific CPU virtualization feature on Hisi 1620 series CPUs. This feature works just like ARM standard S2FWB to reduce some cache management operations in virtualization.
Given that it's Hisi specific feature, let's restrict the detection only to Hisi CPUs. To realize this: 1) Add a sub-directory `hisilicon/` within arch/arm64/kvm to hold code for Hisi specific virtualization features. 2) Add a new kconfig option `CONFIG_KVM_HISI_VIRT` for users to select the whole Hisi specific virtualization features. 3) Add a generic global KVM variable `kvm_ncsnp_support` which is `false` by default. Only re-initialize it when we have `CONFIG_KVM_HISI_VIRT` enabled.
Signed-off-by: Yanan Wang wangyanan55@huawei.com Reviewed-by: Zenghui Yu yuzenghui@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/configs/openeuler_defconfig | 1 + arch/arm64/include/asm/hisi_cpu_model.h | 21 ---- arch/arm64/include/asm/kvm_host.h | 3 +- arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/Makefile | 2 +- arch/arm64/kvm/arm.c | 13 ++- arch/arm64/kvm/hisi_cpu_model.c | 117 ---------------------- arch/arm64/kvm/hisilicon/Kconfig | 7 ++ arch/arm64/kvm/hisilicon/Makefile | 2 + arch/arm64/kvm/hisilicon/hisi_virt.c | 124 ++++++++++++++++++++++++ arch/arm64/kvm/hisilicon/hisi_virt.h | 19 ++++ 11 files changed, 166 insertions(+), 144 deletions(-) delete mode 100644 arch/arm64/include/asm/hisi_cpu_model.h delete mode 100644 arch/arm64/kvm/hisi_cpu_model.c create mode 100644 arch/arm64/kvm/hisilicon/Kconfig create mode 100644 arch/arm64/kvm/hisilicon/Makefile create mode 100644 arch/arm64/kvm/hisilicon/hisi_virt.c create mode 100644 arch/arm64/kvm/hisilicon/hisi_virt.h
diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 21c2f95ac2f3..2c16a55630e9 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -746,6 +746,7 @@ CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL=y CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT=y CONFIG_HAVE_KVM_IRQ_BYPASS=y CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE=y +CONFIG_KVM_HISI_VIRT=y CONFIG_KVM_XFER_TO_GUEST_WORK=y CONFIG_KVM_ARM_PMU=y CONFIG_ARM64_CRYPTO=y diff --git a/arch/arm64/include/asm/hisi_cpu_model.h b/arch/arm64/include/asm/hisi_cpu_model.h deleted file mode 100644 index e0da0ef61613..000000000000 --- a/arch/arm64/include/asm/hisi_cpu_model.h +++ /dev/null @@ -1,21 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-or-later -/* - * Copyright(c) 2019 Huawei Technologies Co., Ltd - */ - -#ifndef __HISI_CPU_MODEL_H__ -#define __HISI_CPU_MODEL_H__ - -enum hisi_cpu_type { - HI_1612, - HI_1616, - HI_1620, - UNKNOWN_HI_TYPE -}; - -extern enum hisi_cpu_type hi_cpu_type; -extern bool kvm_ncsnp_support; - -void probe_hisi_cpu_type(void); -void probe_hisi_ncsnp_support(void); -#endif /* __HISI_CPU_MODEL_H__ */ diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 1a80df133d9d..71a3ba24b287 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -26,7 +26,6 @@ #include <asm/kvm.h> #include <asm/kvm_asm.h> #include <asm/thread_info.h> -#include <asm/hisi_cpu_model.h>
#define __KVM_HAVE_ARCH_INTC_INITIALIZED
@@ -715,4 +714,6 @@ extern unsigned int twedel; #define use_twed() (false) #endif
+extern bool kvm_ncsnp_support; + #endif /* __ARM64_KVM_HOST_H__ */ diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index bc6b692128c9..d984a6041860 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -49,6 +49,7 @@ menuconfig KVM if KVM
source "virt/kvm/Kconfig" +source "arch/arm64/kvm/hisilicon/Kconfig"
config KVM_ARM_PMU bool "Virtual Performance Monitoring Unit (PMU) support" diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile index 02612bfbbde7..6dc8c914c99b 100644 --- a/arch/arm64/kvm/Makefile +++ b/arch/arm64/kvm/Makefile @@ -17,7 +17,6 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \ guest.o debug.o reset.o sys_regs.o \ vgic-sys-reg-v3.o fpsimd.o pmu.o \ aarch32.o arch_timer.o trng.o\ - hisi_cpu_model.o \ vgic/vgic.o vgic/vgic-init.o \ vgic/vgic-irqfd.o vgic/vgic-v2.o \ vgic/vgic-v3.o vgic/vgic-v4.o \ @@ -26,3 +25,4 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \ vgic/vgic-its.o vgic/vgic-debug.o
kvm-$(CONFIG_KVM_ARM_PMU) += pmu-emul.o +obj-$(CONFIG_KVM_HISI_VIRT) += hisilicon/ diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 384cc56a6549..469f324ce536 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -47,6 +47,10 @@ __asm__(".arch_extension virt"); #endif
+#ifdef CONFIG_KVM_HISI_VIRT +#include "hisilicon/hisi_virt.h" +#endif + DECLARE_KVM_HYP_PER_CPU(unsigned long, kvm_hyp_vector);
static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page); @@ -59,8 +63,7 @@ static DEFINE_SPINLOCK(kvm_vmid_lock);
static bool vgic_present;
-/* Hisi cpu type enum */ -enum hisi_cpu_type hi_cpu_type = UNKNOWN_HI_TYPE; +/* Capability of non-cacheable snooping */ bool kvm_ncsnp_support;
static DEFINE_PER_CPU(unsigned char, kvm_arm_hardware_enabled); @@ -1859,9 +1862,11 @@ int kvm_arch_init(void *opaque) return -ENODEV; }
- /* Probe the Hisi CPU type */ +#ifdef CONFIG_KVM_HISI_VIRT probe_hisi_cpu_type(); - probe_hisi_ncsnp_support(); + kvm_ncsnp_support = hisi_ncsnp_supported(); +#endif + kvm_info("KVM ncsnp %s\n", kvm_ncsnp_support ? "enabled" : "disabled");
in_hyp_mode = is_kernel_in_hyp_mode();
diff --git a/arch/arm64/kvm/hisi_cpu_model.c b/arch/arm64/kvm/hisi_cpu_model.c deleted file mode 100644 index 52eecf1ba1cf..000000000000 --- a/arch/arm64/kvm/hisi_cpu_model.c +++ /dev/null @@ -1,117 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-or-later -/* - * Copyright(c) 2019 Huawei Technologies Co., Ltd - */ - -#include <linux/acpi.h> -#include <linux/of.h> -#include <linux/init.h> -#include <linux/kvm_host.h> - -#ifdef CONFIG_ACPI - -/* ACPI Hisi oem table id str */ -const char *oem_str[] = { - "HIP06", /* Hisi 1612 */ - "HIP07", /* Hisi 1616 */ - "HIP08" /* Hisi 1620 */ -}; - -/* - * Get Hisi oem table id. - */ -static void acpi_get_hw_cpu_type(void) -{ - struct acpi_table_header *table; - acpi_status status; - int i, str_size = ARRAY_SIZE(oem_str); - - /* Get oem table id from ACPI table header */ - status = acpi_get_table(ACPI_SIG_DSDT, 0, &table); - if (ACPI_FAILURE(status)) { - pr_err("Failed to get ACPI table: %s\n", - acpi_format_exception(status)); - return; - } - - for (i = 0; i < str_size; ++i) { - if (!strncmp(oem_str[i], table->oem_table_id, 5)) { - hi_cpu_type = i; - return; - } - } -} - -#else -static void acpi_get_hw_cpu_type(void) {} -#endif - -/* of Hisi cpu model str */ -const char *of_model_str[] = { - "Hi1612", - "Hi1616" -}; - -static void of_get_hw_cpu_type(void) -{ - const char *cpu_type; - int ret, i, str_size = ARRAY_SIZE(of_model_str); - - ret = of_property_read_string(of_root, "model", &cpu_type); - if (ret < 0) { - pr_err("Failed to get Hisi cpu model by OF.\n"); - return; - } - - for (i = 0; i < str_size; ++i) { - if (strstr(cpu_type, of_model_str[i])) { - hi_cpu_type = i; - return; - } - } -} - -void probe_hisi_cpu_type(void) -{ - if (!acpi_disabled) - acpi_get_hw_cpu_type(); - else - of_get_hw_cpu_type(); - - if (hi_cpu_type == UNKNOWN_HI_TYPE) - pr_warn("UNKNOWN Hisi cpu type.\n"); -} - -#define NCSNP_MMIO_BASE 0x20107E238 - -/* - * We have the fantastic HHA ncsnp capability on Kunpeng 920, - * with which hypervisor doesn't need to perform a lot of cache - * maintenance like before (in case the guest has non-cacheable - * Stage-1 mappings). - */ -void probe_hisi_ncsnp_support(void) -{ - void __iomem *base; - unsigned int high; - - kvm_ncsnp_support = false; - - if (hi_cpu_type != HI_1620) - goto out; - - base = ioremap(NCSNP_MMIO_BASE, 4); - if (!base) { - pr_err("Unable to map MMIO region when probing ncsnp!\n"); - goto out; - } - - high = readl_relaxed(base) >> 28; - iounmap(base); - if (high != 0x1) - kvm_ncsnp_support = true; - -out: - kvm_info("Hisi ncsnp: %s\n", kvm_ncsnp_support ? "enabled" : - "disabled"); -} diff --git a/arch/arm64/kvm/hisilicon/Kconfig b/arch/arm64/kvm/hisilicon/Kconfig new file mode 100644 index 000000000000..6536f897a32e --- /dev/null +++ b/arch/arm64/kvm/hisilicon/Kconfig @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0-only +config KVM_HISI_VIRT + bool "HiSilicon SoC specific virtualization features" + depends on ARCH_HISI + help + Support for HiSilicon SoC specific virtualization features. + On non-HiSilicon platforms, say N here. diff --git a/arch/arm64/kvm/hisilicon/Makefile b/arch/arm64/kvm/hisilicon/Makefile new file mode 100644 index 000000000000..849f99d1526d --- /dev/null +++ b/arch/arm64/kvm/hisilicon/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-$(CONFIG_KVM_HISI_VIRT) += hisi_virt.o diff --git a/arch/arm64/kvm/hisilicon/hisi_virt.c b/arch/arm64/kvm/hisilicon/hisi_virt.c new file mode 100644 index 000000000000..9587f9508a79 --- /dev/null +++ b/arch/arm64/kvm/hisilicon/hisi_virt.c @@ -0,0 +1,124 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright(c) 2022 Huawei Technologies Co., Ltd + */ + +#include <linux/acpi.h> +#include <linux/of.h> +#include <linux/init.h> +#include <linux/kvm_host.h> +#include "hisi_virt.h" + +static enum hisi_cpu_type cpu_type = UNKNOWN_HI_TYPE; + +static const char * const hisi_cpu_type_str[] = { + "Hisi1612", + "Hisi1616", + "Hisi1620", + "Unknown" +}; + +/* ACPI Hisi oem table id str */ +static const char * const oem_str[] = { + "HIP06", /* Hisi 1612 */ + "HIP07", /* Hisi 1616 */ + "HIP08" /* Hisi 1620 */ +}; + +/* + * Probe Hisi CPU type form ACPI. + */ +static enum hisi_cpu_type acpi_get_hisi_cpu_type(void) +{ + struct acpi_table_header *table; + acpi_status status; + int i, str_size = ARRAY_SIZE(oem_str); + + /* Get oem table id from ACPI table header */ + status = acpi_get_table(ACPI_SIG_DSDT, 0, &table); + if (ACPI_FAILURE(status)) { + pr_warn("Failed to get ACPI table: %s\n", + acpi_format_exception(status)); + return UNKNOWN_HI_TYPE; + } + + for (i = 0; i < str_size; ++i) { + if (!strncmp(oem_str[i], table->oem_table_id, 5)) + return i; + } + + return UNKNOWN_HI_TYPE; +} + +/* of Hisi cpu model str */ +static const char * const of_model_str[] = { + "Hi1612", + "Hi1616" +}; + +/* + * Probe Hisi CPU type from DT. + */ +static enum hisi_cpu_type of_get_hisi_cpu_type(void) +{ + const char *model; + int ret, i, str_size = ARRAY_SIZE(of_model_str); + + /* + * Note: There may not be a "model" node in FDT, which + * is provided by the vendor. In this case, we are not + * able to get CPU type information through this way. + */ + ret = of_property_read_string(of_root, "model", &model); + if (ret < 0) { + pr_warn("Failed to get Hisi cpu model by OF.\n"); + return UNKNOWN_HI_TYPE; + } + + for (i = 0; i < str_size; ++i) { + if (strstr(model, of_model_str[i])) + return i; + } + + return UNKNOWN_HI_TYPE; +} + +void probe_hisi_cpu_type(void) +{ + if (!acpi_disabled) + cpu_type = acpi_get_hisi_cpu_type(); + else + cpu_type = of_get_hisi_cpu_type(); + + kvm_info("detected: Hisi CPU type '%s'\n", hisi_cpu_type_str[cpu_type]); +} + +/* + * We have the fantastic HHA ncsnp capability on Kunpeng 920, + * with which hypervisor doesn't need to perform a lot of cache + * maintenance like before (in case the guest has non-cacheable + * Stage-1 mappings). + */ +#define NCSNP_MMIO_BASE 0x20107E238 +bool hisi_ncsnp_supported(void) +{ + void __iomem *base; + unsigned int high; + bool supported = false; + + if (cpu_type != HI_1620) + return supported; + + base = ioremap(NCSNP_MMIO_BASE, 4); + if (!base) { + pr_warn("Unable to map MMIO region when probing ncsnp!\n"); + return supported; + } + + high = readl_relaxed(base) >> 28; + iounmap(base); + if (high != 0x1) + supported = true; + + return supported; +} diff --git a/arch/arm64/kvm/hisilicon/hisi_virt.h b/arch/arm64/kvm/hisilicon/hisi_virt.h new file mode 100644 index 000000000000..ef8de6a2101e --- /dev/null +++ b/arch/arm64/kvm/hisilicon/hisi_virt.h @@ -0,0 +1,19 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright(c) 2022 Huawei Technologies Co., Ltd + */ + +#ifndef __HISI_VIRT_H__ +#define __HISI_VIRT_H__ + +enum hisi_cpu_type { + HI_1612, + HI_1616, + HI_1620, + UNKNOWN_HI_TYPE +}; + +void probe_hisi_cpu_type(void); +bool hisi_ncsnp_supported(void); + +#endif /* __HISI_VIRT_H__ */
From: Li Nan linan122@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I617GN CVE: NA
--------------------------------
The q->tag_set can be NULL in blk_mq_queue_tag_busy_ite() while queue has not been initialized:
CPU0 CPU1 dm_mq_init_request_queue md->tag_set = kzalloc_node blk_mq_init_allocated_queue q->mq_ops = set->ops; diskstats_show part_get_stat_info if(q->mq_ops) blk_mq_in_flight_with_stat blk_mq_queue_tag_busy_ite if (blk_mq_is_shared_tags(q->tag_set->flags)) //q->tag_set is null here q->tag_set = set blk_register_queue blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q)
There is same bug when cat /sys/block/[device]/inflight. Fix it by checking the flag 'QUEUE_FLAG_REGISTERED'. Althrough this may cause some io not to be counted temporarily, it doesn't hurt in real user case.
Signed-off-by: Li Nan linan122@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-mq-tag.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 24b48a2f7fba..87bb146c7d44 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -515,6 +515,13 @@ EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request); void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, void *priv) { + /* + * For dm, it can run here after register_disk, but the queue has not + * been initialized yet. Check QUEUE_FLAG_REGISTERED prevent null point + * access. + */ + if (!blk_queue_registered(q)) + return; /* * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and queue_hw_ctx * while the queue is frozen. So we can use q_usage_counter to avoid
From: Li Nan linan122@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I617GN CVE: NA
--------------------------------
Since 8b97d51a0c9c, blk_mq_queue_tag_busy_iter() will return directly if queue has not been registered. However, scsi_scan will issue io before queue is registered ,and it causes io hang as some special scsi driver (e.x. ata_piix) relied on blk_mq_timeou_work() to complete io when driver initializing during scan. Fix the bug by checking QUEUE_FLAG_REGISTERED upward.
Fixes: 8b97d51a0c9c ("[Huawei] blk-mq: fix null pointer dereference in blk_mq_queue_tag_busy_ite") Signed-off-by: Li Nan linan122@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-mq-tag.c | 7 ------- block/blk-mq.c | 12 ++++++++---- 2 files changed, 8 insertions(+), 11 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 87bb146c7d44..24b48a2f7fba 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -515,13 +515,6 @@ EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request); void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, void *priv) { - /* - * For dm, it can run here after register_disk, but the queue has not - * been initialized yet. Check QUEUE_FLAG_REGISTERED prevent null point - * access. - */ - if (!blk_queue_registered(q)) - return; /* * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and queue_hw_ctx * while the queue is frozen. So we can use q_usage_counter to avoid diff --git a/block/blk-mq.c b/block/blk-mq.c index 5f896a12b8e4..427457d43d07 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -151,7 +151,8 @@ unsigned int blk_mq_in_flight_with_stat(struct request_queue *q, { struct mq_inflight mi = { .part = part };
- blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight_with_stat, &mi); + if (blk_queue_registered(q)) + blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight_with_stat, &mi);
return mi.inflight[0] + mi.inflight[1]; } @@ -174,7 +175,8 @@ unsigned int blk_mq_in_flight(struct request_queue *q, struct hd_struct *part) { struct mq_inflight mi = { .part = part };
- blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi); + if (blk_queue_registered(q)) + blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
return mi.inflight[0] + mi.inflight[1]; } @@ -184,7 +186,8 @@ void blk_mq_in_flight_rw(struct request_queue *q, struct hd_struct *part, { struct mq_inflight mi = { .part = part };
- blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi); + if (blk_queue_registered(q)) + blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi); inflight[0] = mi.inflight[0]; inflight[1] = mi.inflight[1]; } @@ -974,7 +977,8 @@ bool blk_mq_queue_inflight(struct request_queue *q) { bool busy = false;
- blk_mq_queue_tag_busy_iter(q, blk_mq_rq_inflight, &busy); + if (blk_queue_registered(q)) + blk_mq_queue_tag_busy_iter(q, blk_mq_rq_inflight, &busy); return busy; } EXPORT_SYMBOL_GPL(blk_mq_queue_inflight);
From: Anshuman Khandual anshuman.khandual@arm.com
mainline inclusion from mainline-v5.12-rc3 commit 79cc2ed5a716544621b11a3f90550e5c7d314306 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I611C3 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
Currently without THP being enabled, MAX_ORDER via FORCE_MAX_ZONEORDER gets reduced to 11, which falls below HUGETLB_PAGE_ORDER for certain 16K and 64K page size configurations. This is problematic which throws up the following warning during boot as pageblock_order via HUGETLB_PAGE_ORDER order exceeds MAX_ORDER.
WARNING: CPU: 7 PID: 127 at mm/vmstat.c:1092 __fragmentation_index+0x58/0x70 Modules linked in: CPU: 7 PID: 127 Comm: kswapd0 Not tainted 5.12.0-rc1-00005-g0221e3101a1 #237 Hardware name: linux,dummy-virt (DT) pstate: 20400005 (nzCv daif +PAN -UAO -TCO BTYPE=--) pc : __fragmentation_index+0x58/0x70 lr : fragmentation_index+0x88/0xa8 sp : ffff800016ccfc00 x29: ffff800016ccfc00 x28: 0000000000000000 x27: ffff800011fd4000 x26: 0000000000000002 x25: ffff800016ccfda0 x24: 0000000000000002 x23: 0000000000000640 x22: ffff0005ffcb5b18 x21: 0000000000000002 x20: 000000000000000d x19: ffff0005ffcb3980 x18: 0000000000000004 x17: 0000000000000001 x16: 0000000000000019 x15: ffff800011ca7fb8 x14: 00000000000002b3 x13: 0000000000000000 x12: 00000000000005e0 x11: 0000000000000003 x10: 0000000000000080 x9 : ffff800011c93948 x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000007000 x5 : 0000000000007944 x4 : 0000000000000032 x3 : 000000000000001c x2 : 000000000000000b x1 : ffff800016ccfc10 x0 : 000000000000000d Call trace: __fragmentation_index+0x58/0x70 compaction_suitable+0x58/0x78 wakeup_kcompactd+0x8c/0xd8 balance_pgdat+0x570/0x5d0 kswapd+0x1e0/0x388 kthread+0x154/0x158 ret_from_fork+0x10/0x30
This solves the problem via keeping FORCE_MAX_ZONEORDER unchanged with or without THP on 16K and 64K page size configurations, making sure that the HUGETLB_PAGE_ORDER (and pageblock_order) would never exceed MAX_ORDER.
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Anshuman Khandual anshuman.khandual@arm.com Acked-by: Catalin Marinas catalin.marinas@arm.com Link: https://lore.kernel.org/r/1614597914-28565-1-git-send-email-anshuman.khandua... Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Zhang Peng zhangpeng362@huawei.com Reviewed-by: Chen Wandun chenwandun@huawei.com Reviewed-by: Nanyong Sun sunnanyong@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 6d288cfa313f..ddbdbc7ecdd9 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1287,8 +1287,8 @@ config XEN
config FORCE_MAX_ZONEORDER int - default "14" if (ARM64_64K_PAGES && TRANSPARENT_HUGEPAGE) - default "12" if (ARM64_16K_PAGES && TRANSPARENT_HUGEPAGE) + default "14" if ARM64_64K_PAGES + default "12" if ARM64_16K_PAGES default "11" help The kernel memory allocator divides physically contiguous memory
From: Lorenz Bauer lmb@cloudflare.com
stable inclusion from stable-v5.10.135 commit 6aad811b37eeeba902b14cc4ab698d2b37bb4fb9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZWFM
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 607b9cc92bd7208338d714a22b8082fe83bcb177 upstream.
Share the timing / signal interruption logic between different implementations of PROG_TEST_RUN. There is a change in behaviour as well. We check the loop exit condition before checking for pending signals. This resolves an edge case where a signal arrives during the last iteration. Instead of aborting with EINTR we return the successful result to user space.
Signed-off-by: Lorenz Bauer lmb@cloudflare.com Signed-off-by: Alexei Starovoitov ast@kernel.org Acked-by: Andrii Nakryiko andrii@kernel.org Link: https://lore.kernel.org/bpf/20210303101816.36774-2-lmb@cloudflare.com [dtcccc: fix conflicts in bpf_test_run()] Signed-off-by: Tianchen Ding dtcccc@linux.alibaba.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Conflicts: net/bpf/test_run.c Signed-off-by: Pu Lehui pulehui@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- net/bpf/test_run.c | 142 +++++++++++++++++++++++++-------------------- 1 file changed, 78 insertions(+), 64 deletions(-)
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index f266a9453c8e..3a8f5a2d0e74 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -16,16 +16,80 @@ #define CREATE_TRACE_POINTS #include <trace/events/bpf_test_run.h>
+struct bpf_test_timer { + enum { NO_PREEMPT, NO_MIGRATE } mode; + u32 i; + u64 time_start, time_spent; +}; + +static void bpf_test_timer_enter(struct bpf_test_timer *t) + __acquires(rcu) +{ + rcu_read_lock(); + if (t->mode == NO_PREEMPT) + preempt_disable(); + else + migrate_disable(); + + t->time_start = ktime_get_ns(); +} + +static void bpf_test_timer_leave(struct bpf_test_timer *t) + __releases(rcu) +{ + t->time_start = 0; + + if (t->mode == NO_PREEMPT) + preempt_enable(); + else + migrate_enable(); + rcu_read_unlock(); +} + +static bool bpf_test_timer_continue(struct bpf_test_timer *t, u32 repeat, int *err, u32 *duration) + __must_hold(rcu) +{ + t->i++; + if (t->i >= repeat) { + /* We're done. */ + t->time_spent += ktime_get_ns() - t->time_start; + do_div(t->time_spent, t->i); + *duration = t->time_spent > U32_MAX ? U32_MAX : (u32)t->time_spent; + *err = 0; + goto reset; + } + + if (signal_pending(current)) { + /* During iteration: we've been cancelled, abort. */ + *err = -EINTR; + goto reset; + } + + if (need_resched()) { + /* During iteration: we need to reschedule between runs. */ + t->time_spent += ktime_get_ns() - t->time_start; + bpf_test_timer_leave(t); + cond_resched(); + bpf_test_timer_enter(t); + } + + /* Do another round. */ + return true; + +reset: + t->i = 0; + return false; +} + static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 *retval, u32 *time, bool xdp) { struct bpf_prog_array_item item = {.prog = prog}; struct bpf_run_ctx *old_ctx; struct bpf_cg_run_ctx run_ctx; + struct bpf_test_timer t = { NO_MIGRATE }; enum bpf_cgroup_storage_type stype; - u64 time_start, time_spent = 0; - int ret = 0; - u32 i; + int ret;
for_each_cgroup_storage_type(stype) { item.cgroup_storage[stype] = bpf_cgroup_storage_alloc(prog, stype); @@ -40,42 +104,17 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, if (!repeat) repeat = 1;
- rcu_read_lock(); - migrate_disable(); - time_start = ktime_get_ns(); + bpf_test_timer_enter(&t); old_ctx = bpf_set_run_ctx(&run_ctx.run_ctx); - for (i = 0; i < repeat; i++) { + do { run_ctx.prog_item = &item; - if (xdp) *retval = bpf_prog_run_xdp(prog, ctx); else *retval = BPF_PROG_RUN(prog, ctx); - - if (signal_pending(current)) { - ret = -EINTR; - break; - } - - if (need_resched()) { - time_spent += ktime_get_ns() - time_start; - migrate_enable(); - rcu_read_unlock(); - - cond_resched(); - - rcu_read_lock(); - migrate_disable(); - time_start = ktime_get_ns(); - } - } + } while (bpf_test_timer_continue(&t, repeat, &ret, time)); bpf_reset_run_ctx(old_ctx); - time_spent += ktime_get_ns() - time_start; - migrate_enable(); - rcu_read_unlock(); - - do_div(time_spent, repeat); - *time = time_spent > U32_MAX ? U32_MAX : (u32)time_spent; + bpf_test_timer_leave(&t);
for_each_cgroup_storage_type(stype) bpf_cgroup_storage_free(item.cgroup_storage[stype]); @@ -691,18 +730,17 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr) { + struct bpf_test_timer t = { NO_PREEMPT }; u32 size = kattr->test.data_size_in; struct bpf_flow_dissector ctx = {}; u32 repeat = kattr->test.repeat; struct bpf_flow_keys *user_ctx; struct bpf_flow_keys flow_keys; - u64 time_start, time_spent = 0; const struct ethhdr *eth; unsigned int flags = 0; u32 retval, duration; void *data; int ret; - u32 i;
if (prog->type != BPF_PROG_TYPE_FLOW_DISSECTOR) return -EINVAL; @@ -738,39 +776,15 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog, ctx.data = data; ctx.data_end = (__u8 *)data + size;
- rcu_read_lock(); - preempt_disable(); - time_start = ktime_get_ns(); - for (i = 0; i < repeat; i++) { + bpf_test_timer_enter(&t); + do { retval = bpf_flow_dissect(prog, &ctx, eth->h_proto, ETH_HLEN, size, flags); + } while (bpf_test_timer_continue(&t, repeat, &ret, &duration)); + bpf_test_timer_leave(&t);
- if (signal_pending(current)) { - preempt_enable(); - rcu_read_unlock(); - - ret = -EINTR; - goto out; - } - - if (need_resched()) { - time_spent += ktime_get_ns() - time_start; - preempt_enable(); - rcu_read_unlock(); - - cond_resched(); - - rcu_read_lock(); - preempt_disable(); - time_start = ktime_get_ns(); - } - } - time_spent += ktime_get_ns() - time_start; - preempt_enable(); - rcu_read_unlock(); - - do_div(time_spent, repeat); - duration = time_spent > U32_MAX ? U32_MAX : (u32)time_spent; + if (ret < 0) + goto out;
ret = bpf_test_finish(kattr, uattr, &flow_keys, sizeof(flow_keys), retval, duration);
From: Lorenz Bauer lmb@cloudflare.com
stable inclusion from stable-v5.10.135 commit 6d3fad2b44eb9d226a896d1c93909f0fd2e1b9ea category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZWFM
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 7c32e8f8bc33a5f4b113a630857e46634e3e143b upstream.
Allow to pass sk_lookup programs to PROG_TEST_RUN. User space provides the full bpf_sk_lookup struct as context. Since the context includes a socket pointer that can't be exposed to user space we define that PROG_TEST_RUN returns the cookie of the selected socket or zero in place of the socket pointer.
We don't support testing programs that select a reuseport socket, since this would mean running another (unrelated) BPF program from the sk_lookup test handler.
Signed-off-by: Lorenz Bauer lmb@cloudflare.com Signed-off-by: Alexei Starovoitov ast@kernel.org Link: https://lore.kernel.org/bpf/20210303101816.36774-3-lmb@cloudflare.com Signed-off-by: Tianchen Ding dtcccc@linux.alibaba.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Pu Lehui pulehui@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/bpf.h | 10 ++++ include/uapi/linux/bpf.h | 5 +- net/bpf/test_run.c | 105 +++++++++++++++++++++++++++++++++ net/core/filter.c | 1 + tools/include/uapi/linux/bpf.h | 5 +- 5 files changed, 124 insertions(+), 2 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 3154be71a80c..8d95f4c66275 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1545,6 +1545,9 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog, int bpf_prog_test_run_raw_tp(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr); +int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog, + const union bpf_attr *kattr, + union bpf_attr __user *uattr); bool btf_ctx_access(int off, int size, enum bpf_access_type type, const struct bpf_prog *prog, struct bpf_insn_access_aux *info); @@ -1759,6 +1762,13 @@ static inline int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog, return -ENOTSUPP; }
+static inline int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog, + const union bpf_attr *kattr, + union bpf_attr __user *uattr) +{ + return -ENOTSUPP; +} + static inline void bpf_map_put(struct bpf_map *map) { } diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index fdd082dbed40..0365f22651e0 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5007,7 +5007,10 @@ struct bpf_pidns_info {
/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */ struct bpf_sk_lookup { - __bpf_md_ptr(struct bpf_sock *, sk); /* Selected socket */ + union { + __bpf_md_ptr(struct bpf_sock *, sk); /* Selected socket */ + __u64 cookie; /* Non-zero if socket was selected in PROG_TEST_RUN */ + };
__u32 family; /* Protocol family (AF_INET, AF_INET6) */ __u32 protocol; /* IP protocol (IPPROTO_TCP, IPPROTO_UDP) */ diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 3a8f5a2d0e74..0dfef59cf3de 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -10,8 +10,10 @@ #include <net/bpf_sk_storage.h> #include <net/sock.h> #include <net/tcp.h> +#include <net/net_namespace.h> #include <linux/error-injection.h> #include <linux/smp.h> +#include <linux/sock_diag.h>
#define CREATE_TRACE_POINTS #include <trace/events/bpf_test_run.h> @@ -797,3 +799,106 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog, kfree(data); return ret; } + +int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog, const union bpf_attr *kattr, + union bpf_attr __user *uattr) +{ + struct bpf_test_timer t = { NO_PREEMPT }; + struct bpf_prog_array *progs = NULL; + struct bpf_sk_lookup_kern ctx = {}; + u32 repeat = kattr->test.repeat; + struct bpf_sk_lookup *user_ctx; + u32 retval, duration; + int ret = -EINVAL; + + if (prog->type != BPF_PROG_TYPE_SK_LOOKUP) + return -EINVAL; + + if (kattr->test.flags || kattr->test.cpu) + return -EINVAL; + + if (kattr->test.data_in || kattr->test.data_size_in || kattr->test.data_out || + kattr->test.data_size_out) + return -EINVAL; + + if (!repeat) + repeat = 1; + + user_ctx = bpf_ctx_init(kattr, sizeof(*user_ctx)); + if (IS_ERR(user_ctx)) + return PTR_ERR(user_ctx); + + if (!user_ctx) + return -EINVAL; + + if (user_ctx->sk) + goto out; + + if (!range_is_zero(user_ctx, offsetofend(typeof(*user_ctx), local_port), sizeof(*user_ctx))) + goto out; + + if (user_ctx->local_port > U16_MAX || user_ctx->remote_port > U16_MAX) { + ret = -ERANGE; + goto out; + } + + ctx.family = (u16)user_ctx->family; + ctx.protocol = (u16)user_ctx->protocol; + ctx.dport = (u16)user_ctx->local_port; + ctx.sport = (__force __be16)user_ctx->remote_port; + + switch (ctx.family) { + case AF_INET: + ctx.v4.daddr = (__force __be32)user_ctx->local_ip4; + ctx.v4.saddr = (__force __be32)user_ctx->remote_ip4; + break; + +#if IS_ENABLED(CONFIG_IPV6) + case AF_INET6: + ctx.v6.daddr = (struct in6_addr *)user_ctx->local_ip6; + ctx.v6.saddr = (struct in6_addr *)user_ctx->remote_ip6; + break; +#endif + + default: + ret = -EAFNOSUPPORT; + goto out; + } + + progs = bpf_prog_array_alloc(1, GFP_KERNEL); + if (!progs) { + ret = -ENOMEM; + goto out; + } + + progs->items[0].prog = prog; + + bpf_test_timer_enter(&t); + do { + ctx.selected_sk = NULL; + retval = BPF_PROG_SK_LOOKUP_RUN_ARRAY(progs, ctx, BPF_PROG_RUN); + } while (bpf_test_timer_continue(&t, repeat, &ret, &duration)); + bpf_test_timer_leave(&t); + + if (ret < 0) + goto out; + + user_ctx->cookie = 0; + if (ctx.selected_sk) { + if (ctx.selected_sk->sk_reuseport && !ctx.no_reuseport) { + ret = -EOPNOTSUPP; + goto out; + } + + user_ctx->cookie = sock_gen_cookie(ctx.selected_sk); + } + + ret = bpf_test_finish(kattr, uattr, NULL, 0, retval, duration); + if (!ret) + ret = bpf_ctx_finish(kattr, uattr, user_ctx, sizeof(*user_ctx)); + +out: + bpf_prog_array_free(progs); + kfree(user_ctx); + return ret; +} diff --git a/net/core/filter.c b/net/core/filter.c index 38153331b7b0..160f13758713 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -10338,6 +10338,7 @@ static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type, }
const struct bpf_prog_ops sk_lookup_prog_ops = { + .test_run = bpf_prog_test_run_sk_lookup, };
const struct bpf_verifier_ops sk_lookup_verifier_ops = { diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 2ca3caed8838..6f2e2e303697 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -5007,7 +5007,10 @@ struct bpf_pidns_info {
/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */ struct bpf_sk_lookup { - __bpf_md_ptr(struct bpf_sock *, sk); /* Selected socket */ + union { + __bpf_md_ptr(struct bpf_sock *, sk); /* Selected socket */ + __u64 cookie; /* Non-zero if socket was selected in PROG_TEST_RUN */ + };
__u32 family; /* Protocol family (AF_INET, AF_INET6) */ __u32 protocol; /* IP protocol (IPPROTO_TCP, IPPROTO_UDP) */
From: Lorenz Bauer lmb@cloudflare.com
stable inclusion from stable-v5.10.135 commit 4bfc9dc60873923ffa64ee77084bac55031a30a0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZWFM
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit b4f894633fa14d7d46ba7676f950b90a401504bb upstream.
sk_lookup doesn't allow setting data_in for bpf_prog_run. This doesn't play well with the verifier tests, since they always set a 64 byte input buffer. Allow not running verifier tests by setting bpf_test.runs to a negative value and don't run the ctx access case for sk_lookup. We have dedicated ctx access tests so skipping here doesn't reduce coverage.
Signed-off-by: Lorenz Bauer lmb@cloudflare.com Signed-off-by: Alexei Starovoitov ast@kernel.org Link: https://lore.kernel.org/bpf/20210303101816.36774-6-lmb@cloudflare.com Signed-off-by: Tianchen Ding dtcccc@linux.alibaba.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Pu Lehui pulehui@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- tools/testing/selftests/bpf/test_verifier.c | 4 ++-- tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c | 1 + 2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index 0fc813235575..961c17b4681e 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -101,7 +101,7 @@ struct bpf_test { enum bpf_prog_type prog_type; uint8_t flags; void (*fill_helper)(struct bpf_test *self); - uint8_t runs; + int runs; #define bpf_testdata_struct_t \ struct { \ uint32_t retval, retval_unpriv; \ @@ -1064,7 +1064,7 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
run_errs = 0; run_successes = 0; - if (!alignment_prevented_execution && fd_prog >= 0) { + if (!alignment_prevented_execution && fd_prog >= 0 && test->runs >= 0) { uint32_t expected_val; int i;
diff --git a/tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c b/tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c index 2ad5f974451c..fd3b62a084b9 100644 --- a/tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c +++ b/tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c @@ -239,6 +239,7 @@ .result = ACCEPT, .prog_type = BPF_PROG_TYPE_SK_LOOKUP, .expected_attach_type = BPF_SK_LOOKUP, + .runs = -1, }, /* invalid 8-byte reads from a 4-byte fields in bpf_sk_lookup */ {
From: GUO Zihua guozihua@huawei.com
maillist inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61O87 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?...
--------------------------------
Currently ima_lsm_copy_rule() set the arg_p field of the source rule to NULL, so that the source rule could be freed afterward. It does not make sense for this behavior to be inside a "copy" function. So move it outside and let the caller handle this field.
ima_lsm_copy_rule() now produce a shallow copy of the original entry including args_p field. Meaning only the lsm.rule and the rule itself should be freed for the original rule. Thus, instead of calling ima_lsm_free_rule() which frees lsm.rule as well as args_p field, free the lsm.rule directly.
Signed-off-by: GUO Zihua guozihua@huawei.com Reviewed-by: Roberto Sassu roberto.sassu@huawei.com Signed-off-by: Mimi Zohar zohar@linux.ibm.com Conflicts: security/integrity/ima/ima_policy.c Signed-off-by: GUO Zihua guozihua@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- security/integrity/ima/ima_policy.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c index b1ab4b3d99fb..d39118c1ad3d 100644 --- a/security/integrity/ima/ima_policy.c +++ b/security/integrity/ima/ima_policy.c @@ -399,12 +399,6 @@ static struct ima_rule_entry *ima_lsm_copy_rule(struct ima_rule_entry *entry)
nentry->lsm[i].type = entry->lsm[i].type; nentry->lsm[i].args_p = entry->lsm[i].args_p; - /* - * Remove the reference from entry so that the associated - * memory will not be freed during a later call to - * ima_lsm_free_rule(entry). - */ - entry->lsm[i].args_p = NULL;
ima_filter_rule_init(nentry->lsm[i].type, Audit_equal, nentry->lsm[i].args_p, @@ -418,6 +412,7 @@ static struct ima_rule_entry *ima_lsm_copy_rule(struct ima_rule_entry *entry)
static int ima_lsm_update_rule(struct ima_rule_entry *entry) { + int i; struct ima_rule_entry *nentry;
nentry = ima_lsm_copy_rule(entry); @@ -432,7 +427,8 @@ static int ima_lsm_update_rule(struct ima_rule_entry *entry) * references and the entry itself. All other memory refrences will now * be owned by nentry. */ - ima_lsm_free_rule(entry); + for (i = 0; i < MAX_LSM_RULES; i++) + ima_filter_rule_free(entry->lsm[i].rule); kfree(entry);
return 0;
From: GUO Zihua guozihua@huawei.com
maillist inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61O87 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?...
--------------------------------
IMA relies on the blocking LSM policy notifier callback to update the LSM based IMA policy rules.
When SELinux update its policies, IMA would be notified and starts updating all its lsm rules one-by-one. During this time, -ESTALE would be returned by ima_filter_rule_match() if it is called with a LSM rule that has not yet been updated. In ima_match_rules(), -ESTALE is not handled, and the LSM rule is considered a match, causing extra files to be measured by IMA.
Fix it by re-initializing a temporary rule if -ESTALE is returned by ima_filter_rule_match(). The origin rule in the rule list would be updated by the LSM policy notifier callback.
Fixes: b16942455193 ("ima: use the lsm policy update notifier") Signed-off-by: GUO Zihua guozihua@huawei.com Reviewed-by: Roberto Sassu roberto.sassu@huawei.com Signed-off-by: Mimi Zohar zohar@linux.ibm.com Conflicts: security/integrity/ima/ima_policy.c Signed-off-by: GUO Zihua guozihua@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- security/integrity/ima/ima_policy.c | 41 ++++++++++++++++++++++------- 1 file changed, 32 insertions(+), 9 deletions(-)
diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c index d39118c1ad3d..274f4c7c99f4 100644 --- a/security/integrity/ima/ima_policy.c +++ b/security/integrity/ima/ima_policy.c @@ -528,6 +528,9 @@ static bool ima_match_rules(struct ima_rule_entry *rule, struct inode *inode, const char *keyring) { int i; + bool result = false; + struct ima_rule_entry *lsm_rule = rule; + bool rule_reinitialized = false;
if (func == KEY_CHECK) { return (rule->flags & IMA_FUNC) && (rule->func == func) && @@ -573,34 +576,54 @@ static bool ima_match_rules(struct ima_rule_entry *rule, struct inode *inode, int rc = 0; u32 osid;
- if (!rule->lsm[i].rule) { - if (!rule->lsm[i].args_p) + if (!lsm_rule->lsm[i].rule) { + if (!lsm_rule->lsm[i].args_p) continue; else return false; } + +retry: switch (i) { case LSM_OBJ_USER: case LSM_OBJ_ROLE: case LSM_OBJ_TYPE: security_inode_getsecid(inode, &osid); - rc = ima_filter_rule_match(osid, rule->lsm[i].type, + rc = ima_filter_rule_match(osid, lsm_rule->lsm[i].type, Audit_equal, - rule->lsm[i].rule); + lsm_rule->lsm[i].rule); break; case LSM_SUBJ_USER: case LSM_SUBJ_ROLE: case LSM_SUBJ_TYPE: - rc = ima_filter_rule_match(secid, rule->lsm[i].type, + rc = ima_filter_rule_match(secid, lsm_rule->lsm[i].type, Audit_equal, - rule->lsm[i].rule); + lsm_rule->lsm[i].rule); default: break; } - if (!rc) - return false; + + if (rc == -ESTALE && !rule_reinitialized) { + lsm_rule = ima_lsm_copy_rule(rule); + if (lsm_rule) { + rule_reinitialized = true; + goto retry; + } + } + if (!rc) { + result = false; + goto out; + } } - return true; + result = true; + +out: + if (rule_reinitialized) { + for (i = 0; i < MAX_LSM_RULES; i++) + ima_filter_rule_free(lsm_rule->lsm[i].rule); + kfree(lsm_rule); + } + return result; }
/*
From: Tejun Heo tj@kernel.org
mainline inclusion from mainline-v5.18-rc1 commit 6b2b04590b51aa4cf395fcd185ce439cab5961dc category: bugfix bugzilla: 187443, https://gitee.com/openeuler/kernel/issues/I5Z7O2 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs...
---------------------------
blk-iocost and iolatency are cgroup aware rq-qos policies but they didn't disable merges across different cgroups. This obviously can lead to accounting and control errors but more importantly to priority inversions - e.g. an IO which belongs to a higher priority cgroup or IO class may end up getting throttled incorrectly because it gets merged to an IO issued from a low priority cgroup.
Fix it by adding blk_cgroup_mergeable() which is called from merge paths and rejects cross-cgroup and cross-issue_as_root merges.
Signed-off-by: Tejun Heo tj@kernel.org Fixes: d70675121546 ("block: introduce blk-iolatency io controller") Cc: stable@vger.kernel.org # v4.19+ Cc: Josef Bacik jbacik@fb.com Link: https://lore.kernel.org/r/Yi/eE/6zFNyWJ+qd@slm.duckdns.org Signed-off-by: Jens Axboe axboe@kernel.dk
conflicts: block/blk-merge.c include/linux/blk-cgroup.h
Signed-off-by: Li Nan linan122@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-merge.c | 11 +++++++++++ include/linux/blk-cgroup.h | 17 +++++++++++++++++ 2 files changed, 28 insertions(+)
diff --git a/block/blk-merge.c b/block/blk-merge.c index a3f31d1f3fcc..827e43fe33b1 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -7,6 +7,7 @@ #include <linux/bio.h> #include <linux/blkdev.h> #include <linux/scatterlist.h> +#include <linux/blk-cgroup.h>
#include <trace/events/block.h>
@@ -554,6 +555,9 @@ static inline unsigned int blk_rq_get_max_segments(struct request *rq) static inline int ll_new_hw_segment(struct request *req, struct bio *bio, unsigned int nr_phys_segs) { + if (!blk_cgroup_mergeable(req, bio)) + goto no_merge; + if (blk_integrity_merge_bio(req->q, req, bio) == false) goto no_merge;
@@ -650,6 +654,9 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req, if (total_phys_segments > blk_rq_get_max_segments(req)) return 0;
+ if (!blk_cgroup_mergeable(req, next->bio)) + return 0; + if (blk_integrity_merge_rq(q, req, next) == false) return 0;
@@ -860,6 +867,10 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio) if (rq->rq_disk != bio->bi_disk) return false;
+ /* don't merge across cgroup boundaries */ + if (!blk_cgroup_mergeable(rq, bio)) + return false; + /* only merge integrity protected bio into ditto rq */ if (blk_integrity_merge_bio(rq->q, rq, bio) == false) return false; diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index b44db9835489..dac9804907df 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -25,6 +25,7 @@ #include <linux/atomic.h> #include <linux/kthread.h> #include <linux/fs.h> +#include <linux/blk-mq.h>
/* percpu_counter batch for blkg_[rw]stats, per-cpu drift doesn't matter */ #define BLKG_STAT_CPU_BATCH (INT_MAX / 2) @@ -610,6 +611,21 @@ static inline void blkcg_clear_delay(struct blkcg_gq *blkg) atomic_dec(&blkg->blkcg->css.cgroup->congestion_count); }
+/** + * blk_cgroup_mergeable - Determine whether to allow or disallow merges + * @rq: request to merge into + * @bio: bio to merge + * + * @bio and @rq should belong to the same cgroup and their issue_as_root should + * match. The latter is necessary as we don't want to throttle e.g. a metadata + * update because it happens to be next to a regular IO. + */ +static inline bool blk_cgroup_mergeable(struct request *rq, struct bio *bio) +{ + return rq->bio->bi_blkg == bio->bi_blkg && + bio_issue_as_root_blkg(rq->bio) == bio_issue_as_root_blkg(bio); +} + void blk_cgroup_bio_start(struct bio *bio); void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta); void blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay); @@ -665,6 +681,7 @@ static inline void blkg_put(struct blkcg_gq *blkg) { } static inline bool blkcg_punt_bio_submit(struct bio *bio) { return false; } static inline void blkcg_bio_issue_init(struct bio *bio) { } static inline void blk_cgroup_bio_start(struct bio *bio) { } +static inline bool blk_cgroup_mergeable(struct request *rq, struct bio *bio) { return true; }
#define blk_queue_for_each_rl(rl, q) \ for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
From: Li Nan linan122@huawei.com
hulk inclusion category: bugfix bugzilla: 187443, https://gitee.com/openeuler/kernel/issues/I5Z7O2 CVE: NA
--------------------------------
Include additional files and add new function will cause kabi broken. So move changes to blk-mq.h. bio_issue_as_root_blkg() is needed by blk_cgroup_mergeable(), move it together. It is used by iocost, too, so add blk-mq.h to blk-iocost.c.
Signed-off-by: Li Nan linan122@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-iocost.c | 1 + block/blk-merge.c | 1 - block/blk-mq.h | 34 ++++++++++++++++++++++++++++++++++ include/linux/blk-cgroup.h | 33 --------------------------------- 4 files changed, 35 insertions(+), 34 deletions(-)
diff --git a/block/blk-iocost.c b/block/blk-iocost.c index 08e4ba856e3b..462dbb766ed1 100644 --- a/block/blk-iocost.c +++ b/block/blk-iocost.c @@ -184,6 +184,7 @@ #include "blk-rq-qos.h" #include "blk-stat.h" #include "blk-wbt.h" +#include "blk-mq.h"
#ifdef CONFIG_TRACEPOINTS
diff --git a/block/blk-merge.c b/block/blk-merge.c index 827e43fe33b1..117a160444af 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -7,7 +7,6 @@ #include <linux/bio.h> #include <linux/blkdev.h> #include <linux/scatterlist.h> -#include <linux/blk-cgroup.h>
#include <trace/events/block.h>
diff --git a/block/blk-mq.h b/block/blk-mq.h index 5572277cf9a3..1c86f7d56e72 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -338,5 +338,39 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx, return __blk_mq_active_requests(hctx) < depth; }
+/** + * bio_issue_as_root_blkg - see if this bio needs to be issued as root blkg + * @return: true if this bio needs to be submitted with the root blkg context. + * + * In order to avoid priority inversions we sometimes need to issue a bio as if + * it were attached to the root blkg, and then backcharge to the actual owning + * blkg. The idea is we do bio_blkcg() to look up the actual context for the + * bio and attach the appropriate blkg to the bio. Then we call this helper and + * if it is true run with the root blkg for that queue and then do any + * backcharging to the originating cgroup once the io is complete. + */ +static inline bool bio_issue_as_root_blkg(struct bio *bio) +{ + return (bio->bi_opf & (REQ_META | REQ_SWAP)) != 0; +} + +#ifdef CONFIG_BLK_CGROUP +/** + * blk_cgroup_mergeable - Determine whether to allow or disallow merges + * @rq: request to merge into + * @bio: bio to merge + * + * @bio and @rq should belong to the same cgroup and their issue_as_root should + * match. The latter is necessary as we don't want to throttle e.g. a metadata + * update because it happens to be next to a regular IO. + */ +static inline bool blk_cgroup_mergeable(struct request *rq, struct bio *bio) +{ + return rq->bio->bi_blkg == bio->bi_blkg && + bio_issue_as_root_blkg(rq->bio) == bio_issue_as_root_blkg(bio); +} +#else /* CONFIG_BLK_CGROUP */ +static inline bool blk_cgroup_mergeable(struct request *rq, struct bio *bio) { return true; } +#endif /* CONFIG_BLK_CGROUP */
#endif diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index dac9804907df..994ff06de40f 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -25,7 +25,6 @@ #include <linux/atomic.h> #include <linux/kthread.h> #include <linux/fs.h> -#include <linux/blk-mq.h>
/* percpu_counter batch for blkg_[rw]stats, per-cpu drift doesn't matter */ #define BLKG_STAT_CPU_BATCH (INT_MAX / 2) @@ -297,22 +296,6 @@ static inline bool blk_cgroup_congested(void) return ret; }
-/** - * bio_issue_as_root_blkg - see if this bio needs to be issued as root blkg - * @return: true if this bio needs to be submitted with the root blkg context. - * - * In order to avoid priority inversions we sometimes need to issue a bio as if - * it were attached to the root blkg, and then backcharge to the actual owning - * blkg. The idea is we do bio_blkcg() to look up the actual context for the - * bio and attach the appropriate blkg to the bio. Then we call this helper and - * if it is true run with the root blkg for that queue and then do any - * backcharging to the originating cgroup once the io is complete. - */ -static inline bool bio_issue_as_root_blkg(struct bio *bio) -{ - return (bio->bi_opf & (REQ_META | REQ_SWAP)) != 0; -} - /** * blkcg_parent - get the parent of a blkcg * @blkcg: blkcg of interest @@ -611,21 +594,6 @@ static inline void blkcg_clear_delay(struct blkcg_gq *blkg) atomic_dec(&blkg->blkcg->css.cgroup->congestion_count); }
-/** - * blk_cgroup_mergeable - Determine whether to allow or disallow merges - * @rq: request to merge into - * @bio: bio to merge - * - * @bio and @rq should belong to the same cgroup and their issue_as_root should - * match. The latter is necessary as we don't want to throttle e.g. a metadata - * update because it happens to be next to a regular IO. - */ -static inline bool blk_cgroup_mergeable(struct request *rq, struct bio *bio) -{ - return rq->bio->bi_blkg == bio->bi_blkg && - bio_issue_as_root_blkg(rq->bio) == bio_issue_as_root_blkg(bio); -} - void blk_cgroup_bio_start(struct bio *bio); void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta); void blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay); @@ -681,7 +649,6 @@ static inline void blkg_put(struct blkcg_gq *blkg) { } static inline bool blkcg_punt_bio_submit(struct bio *bio) { return false; } static inline void blkcg_bio_issue_init(struct bio *bio) { } static inline void blk_cgroup_bio_start(struct bio *bio) { } -static inline bool blk_cgroup_mergeable(struct request *rq, struct bio *bio) { return true; }
#define blk_queue_for_each_rl(rl, q) \ for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
From: Long Li leo.lilong@huawei.com
hulk inclusion category: bugfix bugzilla: 187286, https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA
--------------------------------
The following error occurred during the fsstress test:
XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2452
The problem was that inode race condition causes incorrect i_nlink to be written to disk, and then it is read into memory. Consider the following call graph, inodes that are marked as both XFS_IFLUSHING and XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk may be set to 1.
xfsaild xfs_inode_item_push xfs_iflush_cluster xfs_iflush xfs_inode_to_disk
xfs_iget xfs_iget_cache_hit xfs_iget_recycle xfs_reinit_inode inode_init_always
xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing internal inode state and can race with other RCU protected inode lookups. On the read side, xfs_iflush_cluster() grabs the ILOCK_SHARED while under rcu + ip->i_flags_lock, and so xfs_iflush/xfs_inode_to_disk() are protected from racing inode updates (during transactions) by that lock.
Signed-off-by: Long Li leo.lilong@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_icache.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 986d087df226..93e24f85eb8d 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -361,6 +361,9 @@ xfs_iget_recycle(
trace_xfs_iget_recycle(ip);
+ if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) + return -EAGAIN; + /* * We need to make it look like the inode is being reclaimed to prevent * the actual reclaim workers from stomping over us while we recycle @@ -374,6 +377,7 @@ xfs_iget_recycle(
ASSERT(!rwsem_is_locked(&inode->i_rwsem)); error = xfs_reinit_inode(mp, inode); + xfs_iunlock(ip, XFS_ILOCK_EXCL); if (error) { bool wake;
@@ -542,6 +546,8 @@ xfs_iget_cache_hit( if (ip->i_flags & XFS_IRECLAIMABLE) { /* Drops i_flags_lock and RCU read lock. */ error = xfs_iget_recycle(pag, ip); + if (error == -EAGAIN) + goto out_skip; if (error) return error; } else {
From: Qi Liu liuqi115@huawei.com
mainline inclusion from mainline-v5.17-rc1 commit 20c634932ae8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I62482 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
A user may issue a control phy command from sysfs at any time, even if the controller is resetting.
If a phy is disabled by hardreset/linkreset command before calling get_phys_state() in the reset path, the saved phy state may be incorrect.
To avoid incorrectly recording the phy state, use hisi_hba.sem to ensure that the controller reset may not run at the same time as when the phy control function is running.
Link: https://lore.kernel.org/r/1639579061-179473-6-git-send-email-john.garry@huaw... Signed-off-by: Qi Liu liuqi115@huawei.com Signed-off-by: John Garry john.garry@huawei.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: xiabing xiabing12@h-partners.com Reviewed-by: Jason Yan yanaijie@huawei.com Reviewed-by: Xiang Chen chenxiang66@hisilicon.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/scsi/hisi_sas/hisi_sas_main.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c index a1c6a67da132..482be0a461f8 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_main.c +++ b/drivers/scsi/hisi_sas/hisi_sas_main.c @@ -1168,6 +1168,7 @@ static int hisi_sas_control_phy(struct asd_sas_phy *sas_phy, enum phy_func func, u8 sts = phy->phy_attached; int ret = 0;
+ down(&hisi_hba->sem); phy->reset_completion = &completion;
switch (func) { @@ -1211,6 +1212,7 @@ static int hisi_sas_control_phy(struct asd_sas_phy *sas_phy, enum phy_func func, out: phy->reset_completion = NULL;
+ up(&hisi_hba->sem); return ret; }
From: Qi Liu liuqi115@huawei.com
mainline inclusion from mainline-v5.17-rc1 commit 16775db613c2 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I62482 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
If we issue a controller reset command during executing a FLR a hung task may be found:
Call trace: __switch_to+0x158/0x1cc __schedule+0x2e8/0x85c schedule+0x7c/0x110 schedule_timeout+0x190/0x1cc __down+0x7c/0xd4 down+0x5c/0x7c hisi_sas_task_exec+0x510/0x680 [hisi_sas_main] hisi_sas_queue_command+0x24/0x30 [hisi_sas_main] smp_execute_task_sg+0xf4/0x23c [libsas] sas_smp_phy_control+0x110/0x1e0 [libsas] transport_sas_phy_reset+0xc8/0x190 [libsas] phy_reset_work+0x2c/0x40 [libsas] process_one_work+0x1dc/0x48c worker_thread+0x15c/0x464 kthread+0x160/0x170 ret_from_fork+0x10/0x18
This is a race condition which occurs when the FLR completes first.
Here the host HISI_SAS_RESETTING_BIT flag out gets of sync as HISI_SAS_RESETTING_BIT is not always cleared with the hisi_hba.sem held, so now only set/unset HISI_SAS_RESETTING_BIT under hisi_hba.sem .
Link: https://lore.kernel.org/r/1639579061-179473-7-git-send-email-john.garry@huaw... Signed-off-by: Qi Liu liuqi115@huawei.com Signed-off-by: John Garry john.garry@huawei.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: xiabing xiabing12@h-partners.com Reviewed-by: Jason Yan yanaijie@huawei.com Reviewed-by: Xiang Chen chenxiang66@hisilicon.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/scsi/hisi_sas/hisi_sas_main.c | 8 +++++--- drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 1 + 2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c index 482be0a461f8..09809c3bd317 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_main.c +++ b/drivers/scsi/hisi_sas/hisi_sas_main.c @@ -1600,7 +1600,6 @@ void hisi_sas_controller_reset_prepare(struct hisi_hba *hisi_hba) { struct Scsi_Host *shost = hisi_hba->shost;
- down(&hisi_hba->sem); hisi_hba->phy_state = hisi_hba->hw->get_phys_state(hisi_hba);
scsi_block_requests(shost); @@ -1626,9 +1625,9 @@ void hisi_sas_controller_reset_done(struct hisi_hba *hisi_hba) if (hisi_hba->reject_stp_links_msk) hisi_sas_terminate_stp_reject(hisi_hba); hisi_sas_reset_init_all_devices(hisi_hba); - up(&hisi_hba->sem); scsi_unblock_requests(shost); clear_bit(HISI_SAS_RESET_BIT, &hisi_hba->flags); + up(&hisi_hba->sem);
hisi_sas_rescan_topology(hisi_hba, hisi_hba->phy_state); } @@ -1639,8 +1638,11 @@ static int hisi_sas_controller_prereset(struct hisi_hba *hisi_hba) if (!hisi_hba->hw->soft_reset) return -1;
- if (test_and_set_bit(HISI_SAS_RESET_BIT, &hisi_hba->flags)) + down(&hisi_hba->sem); + if (test_and_set_bit(HISI_SAS_RESET_BIT, &hisi_hba->flags)) { + up(&hisi_hba->sem); return -1; + }
if (hisi_sas_debugfs_enable && hisi_hba->debugfs_itct[0].itct) hisi_hba->hw->debugfs_snapshot_regs(hisi_hba); diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c index fd5bdb0afa71..8dc86bebe5d2 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c @@ -4908,6 +4908,7 @@ static void hisi_sas_reset_prepare_v3_hw(struct pci_dev *pdev) int rc;
dev_info(dev, "FLR prepare\n"); + down(&hisi_hba->sem); set_bit(HISI_SAS_RESET_BIT, &hisi_hba->flags); hisi_sas_controller_reset_prepare(hisi_hba);
From: Yu Liao liaoyu15@huawei.com
hulk inclusion category: feature bugzilla: 186841, https://gitee.com/openeuler/kernel/issues/I61188 CVE: NA
--------------------------------
Enable CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE by default.
Signed-off-by: Yu Liao liaoyu15@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/configs/openeuler_defconfig | 2 +- arch/x86/configs/openeuler_defconfig | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 2c16a55630e9..34a7b8d500a1 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -651,7 +651,7 @@ CONFIG_FW_CFG_SYSFS=y # CONFIG_EFI_ESRT=y CONFIG_EFI_VARS_PSTORE=y -# CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE is not set +CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE=y CONFIG_EFI_FAKE_MEMMAP=y CONFIG_EFI_MAX_FAKE_MEM=8 CONFIG_EFI_SOFT_RESERVE=y diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index 18a190653df5..71b2ad5ccc73 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -690,7 +690,7 @@ CONFIG_FW_CFG_SYSFS=y # CONFIG_EFI_VARS is not set CONFIG_EFI_ESRT=y CONFIG_EFI_VARS_PSTORE=y -# CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE is not set +CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE=y CONFIG_EFI_RUNTIME_MAP=y # CONFIG_EFI_FAKE_MEMMAP is not set CONFIG_EFI_SOFT_RESERVE=y
From: Luís Henriques lhenriques@suse.de
stable inclusion from stable-v5.10.146 commit 958b0ee23f5ac106e7cc11472b71aa2ea9a033bc category: bugfix bugzilla: 187444, https://gitee.com/openeuler/kernel/issues/I6261Z CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 29a5b8a137ac8eb410cc823653a29ac0e7b7e1b0 upstream.
When walking through an inode extents, the ext4_ext_binsearch_idx() function assumes that the extent header has been previously validated. However, there are no checks that verify that the number of entries (eh->eh_entries) is non-zero when depth is > 0. And this will lead to problems because the EXT_FIRST_INDEX() and EXT_LAST_INDEX() will return garbage and result in this:
[ 135.245946] ------------[ cut here ]------------ [ 135.247579] kernel BUG at fs/ext4/extents.c:2258! [ 135.249045] invalid opcode: 0000 [#1] PREEMPT SMP [ 135.250320] CPU: 2 PID: 238 Comm: tmp118 Not tainted 5.19.0-rc8+ #4 [ 135.252067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014 [ 135.255065] RIP: 0010:ext4_ext_map_blocks+0xc20/0xcb0 [ 135.256475] Code: [ 135.261433] RSP: 0018:ffffc900005939f8 EFLAGS: 00010246 [ 135.262847] RAX: 0000000000000024 RBX: ffffc90000593b70 RCX: 0000000000000023 [ 135.264765] RDX: ffff8880038e5f10 RSI: 0000000000000003 RDI: ffff8880046e922c [ 135.266670] RBP: ffff8880046e9348 R08: 0000000000000001 R09: ffff888002ca580c [ 135.268576] R10: 0000000000002602 R11: 0000000000000000 R12: 0000000000000024 [ 135.270477] R13: 0000000000000000 R14: 0000000000000024 R15: 0000000000000000 [ 135.272394] FS: 00007fdabdc56740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 [ 135.274510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 135.276075] CR2: 00007ffc26bd4f00 CR3: 0000000006261004 CR4: 0000000000170ea0 [ 135.277952] Call Trace: [ 135.278635] <TASK> [ 135.279247] ? preempt_count_add+0x6d/0xa0 [ 135.280358] ? percpu_counter_add_batch+0x55/0xb0 [ 135.281612] ? _raw_read_unlock+0x18/0x30 [ 135.282704] ext4_map_blocks+0x294/0x5a0 [ 135.283745] ? xa_load+0x6f/0xa0 [ 135.284562] ext4_mpage_readpages+0x3d6/0x770 [ 135.285646] read_pages+0x67/0x1d0 [ 135.286492] ? folio_add_lru+0x51/0x80 [ 135.287441] page_cache_ra_unbounded+0x124/0x170 [ 135.288510] filemap_get_pages+0x23d/0x5a0 [ 135.289457] ? path_openat+0xa72/0xdd0 [ 135.290332] filemap_read+0xbf/0x300 [ 135.291158] ? _raw_spin_lock_irqsave+0x17/0x40 [ 135.292192] new_sync_read+0x103/0x170 [ 135.293014] vfs_read+0x15d/0x180 [ 135.293745] ksys_read+0xa1/0xe0 [ 135.294461] do_syscall_64+0x3c/0x80 [ 135.295284] entry_SYSCALL_64_after_hwframe+0x46/0xb0
This patch simply adds an extra check in __ext4_ext_check(), verifying that eh_entries is not 0 when eh_depth is > 0.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215941 Link: https://bugzilla.kernel.org/show_bug.cgi?id=216283 Cc: Baokun Li libaokun1@huawei.com Cc: stable@kernel.org Signed-off-by: Luís Henriques lhenriques@suse.de Reviewed-by: Jan Kara jack@suse.cz Reviewed-by: Baokun Li libaokun1@huawei.com Link: https://lore.kernel.org/r/20220822094235.2690-1-lhenriques@suse.de Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ext4/extents.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 6202bd153934..e42a78170109 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -459,6 +459,10 @@ static int __ext4_ext_check(const char *function, unsigned int line, error_msg = "invalid eh_entries"; goto corrupted; } + if (unlikely((eh->eh_entries == 0) && (depth > 0))) { + error_msg = "eh_entries is 0 but eh_depth is > 0"; + goto corrupted; + } if (!ext4_valid_extent_entries(inode, eh, lblk, &pblk, depth)) { error_msg = "invalid extent entries"; goto corrupted;
From: Selvin Xavier selvin.xavier@broadcom.com
mainline inclusion from mainline-v5.16-rc1 commit 5ec0a6fcb60ea430f8ee7e0bec22db9b22f856d3 category: bugfix bugzilla: 187447,https://gitee.com/openeuler/kernel/issues/I612GP
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Host crashes when pci_enable_atomic_ops_to_root() is called for VFs with virtual buses. The virtual buses added to SR-IOV have bus->self set to NULL and host crashes due to this.
PID: 4481 TASK: ffff89c6941b0000 CPU: 53 COMMAND: "bash" ... #3 [ffff9a9481713808] oops_end at ffffffffb9025cd6 #4 [ffff9a9481713828] page_fault_oops at ffffffffb906e417 #5 [ffff9a9481713888] exc_page_fault at ffffffffb9a0ad14 #6 [ffff9a94817138b0] asm_exc_page_fault at ffffffffb9c00ace [exception RIP: pcie_capability_read_dword+28] RIP: ffffffffb952fd5c RSP: ffff9a9481713960 RFLAGS: 00010246 RAX: 0000000000000001 RBX: ffff89c6b1096000 RCX: 0000000000000000 RDX: ffff9a9481713990 RSI: 0000000000000024 RDI: 0000000000000000 RBP: 0000000000000080 R8: 0000000000000008 R9: ffff89c64341a2f8 R10: 0000000000000002 R11: 0000000000000000 R12: ffff89c648bab000 R13: 0000000000000000 R14: 0000000000000000 R15: ffff89c648bab0c8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff9a9481713988] pci_enable_atomic_ops_to_root at ffffffffb95359a6 #8 [ffff9a94817139c0] bnxt_qplib_determine_atomics at ffffffffc08c1a33 [bnxt_re] #9 [ffff9a94817139d0] bnxt_re_dev_init at ffffffffc08ba2d1 [bnxt_re]
Per PCIe r5.0, sec 9.3.5.10, the AtomicOp Requester Enable bit in Device Control 2 is reserved for VFs. The PF value applies to all associated VFs.
Return -EINVAL if pci_enable_atomic_ops_to_root() is called for a VF.
Link: https://lore.kernel.org/r/1631354585-16597-1-git-send-email-selvin.xavier@br... Fixes: 35f5ace5dea4 ("RDMA/bnxt_re: Enable global atomic ops if platform supports") Fixes: 430a23689dea ("PCI: Add pci_enable_atomic_ops_to_root()") Signed-off-by: Selvin Xavier selvin.xavier@broadcom.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Andy Gospodarek gospo@broadcom.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/pci/pci.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 458a52d6fd2f..8ace56c8141b 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -3674,6 +3674,14 @@ int pci_enable_atomic_ops_to_root(struct pci_dev *dev, u32 cap_mask) struct pci_dev *bridge; u32 cap, ctl2;
+ /* + * Per PCIe r5.0, sec 9.3.5.10, the AtomicOp Requester Enable bit + * in Device Control 2 is reserved in VFs and the PF value applies + * to all associated VFs. + */ + if (dev->is_virtfn) + return -EINVAL; + if (!pci_is_pcie(dev)) return -EINVAL;
From: Baisong Zhong zhongbaisong@huawei.com
mainline inclusion from mainline-v6.1-rc6 commit d3fd203f36d46aa29600a72d57a1b61af80e4a25 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60P4J CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We got a syzkaller problem because of aarch64 alignment fault if KFENCE enabled. When the size from user bpf program is an odd number, like 399, 407, etc, it will cause the struct skb_shared_info's unaligned access. As seen below:
BUG: KFENCE: use-after-free read in __skb_clone+0x23c/0x2a0 net/core/skbuff.c:1032
Use-after-free read at 0xffff6254fffac077 (in kfence-#213): __lse_atomic_add arch/arm64/include/asm/atomic_lse.h:26 [inline] arch_atomic_add arch/arm64/include/asm/atomic.h:28 [inline] arch_atomic_inc include/linux/atomic-arch-fallback.h:270 [inline] atomic_inc include/asm-generic/atomic-instrumented.h:241 [inline] __skb_clone+0x23c/0x2a0 net/core/skbuff.c:1032 skb_clone+0xf4/0x214 net/core/skbuff.c:1481 ____bpf_clone_redirect net/core/filter.c:2433 [inline] bpf_clone_redirect+0x78/0x1c0 net/core/filter.c:2420 bpf_prog_d3839dd9068ceb51+0x80/0x330 bpf_dispatcher_nop_func include/linux/bpf.h:728 [inline] bpf_test_run+0x3c0/0x6c0 net/bpf/test_run.c:53 bpf_prog_test_run_skb+0x638/0xa7c net/bpf/test_run.c:594 bpf_prog_test_run kernel/bpf/syscall.c:3148 [inline] __do_sys_bpf kernel/bpf/syscall.c:4441 [inline] __se_sys_bpf+0xad0/0x1634 kernel/bpf/syscall.c:4381
kfence-#213: 0xffff6254fffac000-0xffff6254fffac196, size=407, cache=kmalloc-512
allocated by task 15074 on cpu 0 at 1342.585390s: kmalloc include/linux/slab.h:568 [inline] kzalloc include/linux/slab.h:675 [inline] bpf_test_init.isra.0+0xac/0x290 net/bpf/test_run.c:191 bpf_prog_test_run_skb+0x11c/0xa7c net/bpf/test_run.c:512 bpf_prog_test_run kernel/bpf/syscall.c:3148 [inline] __do_sys_bpf kernel/bpf/syscall.c:4441 [inline] __se_sys_bpf+0xad0/0x1634 kernel/bpf/syscall.c:4381 __arm64_sys_bpf+0x50/0x60 kernel/bpf/syscall.c:4381
To fix the problem, we adjust @size so that (@size + @hearoom) is a multiple of SMP_CACHE_BYTES. So we make sure the struct skb_shared_info is aligned to a cache line.
Fixes: 1cf1cae963c2 ("bpf: introduce BPF_PROG_TEST_RUN command") Signed-off-by: Baisong Zhong zhongbaisong@huawei.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Cc: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/bpf/20221102081620.1465154-1-zhongbaisong@huawei.com Signed-off-by: Baisong Zhong zhongbaisong@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- net/bpf/test_run.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 0dfef59cf3de..df8d9d800ebc 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -229,6 +229,7 @@ static void *bpf_test_init(const union bpf_attr *kattr, u32 size, if (user_size > size) return ERR_PTR(-EMSGSIZE);
+ size = SKB_DATA_ALIGN(size); data = kzalloc(size + headroom + tailroom, GFP_USER); if (!data) return ERR_PTR(-ENOMEM);
From: "Darrick J. Wong" djwong@kernel.org
mainline inclusion from mainline-v5.17-rc1 commit 2d86293c70750e4331e9616aded33ab6b47c299d category: bugfix bugzilla: 186909,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now that the VFS will do something with the return values from ->sync_fs, make ours pass on error codes.
Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Jan Kara jack@suse.cz Reviewed-by: Christoph Hellwig hch@lst.de Acked-by: Christian Brauner brauner@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_super.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 2561e95fbdd1..9c4ff38e0901 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -729,6 +729,7 @@ xfs_fs_sync_fs( int wait) { struct xfs_mount *mp = XFS_M(sb); + int error;
trace_xfs_fs_sync_fs(mp, __return_address);
@@ -738,7 +739,10 @@ xfs_fs_sync_fs( if (!wait) return 0;
- xfs_log_force(mp, XFS_LOG_SYNC); + error = xfs_log_force(mp, XFS_LOG_SYNC); + if (error) + return error; + if (laptop_mode) { /* * The disk must be active because we're syncing.
From: Brian Foster bfoster@redhat.com
mainline inclusion from mainline-v5.18-rc2 commit f650df7171b882dca737ddbbeb414100b31f16af category: bugfix bugzilla: 187094,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The filestream AG selection loop uses pagf data to aid in AG selection, which depends on pagf initialization. If the in-core structure is not initialized, the caller invokes the AGF read path to do so and carries on. If another task enters the loop and finds a pagf init already in progress, the AGF read returns -EAGAIN and the task continues the loop. This does not increment the current ag index, however, which means the task spins on the current AGF buffer until unlocked.
If the AGF read I/O submitted by the initial task happens to be delayed for whatever reason, this results in soft lockup warnings via the spinning task. This is reproduced by xfs/170. To avoid this problem, fix the AGF trylock failure path to properly iterate to the next AG. If a task iterates all AGs without making progress, the trylock behavior is dropped in favor of blocking locks and thus a soft lockup is no longer possible.
Fixes: f48e2df8a877ca1c ("xfs: make xfs_*read_agf return EAGAIN to ALLOC_FLAG_TRYLOCK callers") Signed-off-by: Brian Foster bfoster@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Dave Chinner david@fromorbit.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_filestream.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/xfs_filestream.c b/fs/xfs/xfs_filestream.c index db23e455eb91..bc41ec0c483d 100644 --- a/fs/xfs/xfs_filestream.c +++ b/fs/xfs/xfs_filestream.c @@ -128,11 +128,12 @@ xfs_filestream_pick_ag( if (!pag->pagf_init) { err = xfs_alloc_pagf_init(mp, NULL, ag, trylock); if (err) { - xfs_perag_put(pag); - if (err != -EAGAIN) + if (err != -EAGAIN) { + xfs_perag_put(pag); return err; + } /* Couldn't lock the AGF, skip this AG. */ - continue; + goto next_ag; } }
From: Eric Sandeen sandeen@redhat.com
mainline inclusion from mainline-v5.11-rc1 commit 207ddc0ef4f413ab1f4e0c1fcab2226425dec293 category: bugfix bugzilla: 187102,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We don't yet support dax on reflinked files, but that is in the works.
Further, having the flag set does not automatically mean that the inode is actually "in the CPU direct access state," which depends on several other conditions in addition to the flag being set.
As such, we should not catch this as corruption in the verifier - simply not actually enabling S_DAX on reflinked files is enough for now.
Fixes: 4f435ebe7d04 ("xfs: don't mix reflink and DAX mode for now") Signed-off-by: Eric Sandeen sandeen@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de [darrick: fix the scrubber too] Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/libxfs/xfs_inode_buf.c | 4 ---- fs/xfs/scrub/inode.c | 4 ---- 2 files changed, 8 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c index c667c63f2cb0..4d7410e49db4 100644 --- a/fs/xfs/libxfs/xfs_inode_buf.c +++ b/fs/xfs/libxfs/xfs_inode_buf.c @@ -547,10 +547,6 @@ xfs_dinode_verify( if ((flags2 & XFS_DIFLAG2_REFLINK) && (flags & XFS_DIFLAG_REALTIME)) return __this_address;
- /* don't let reflink and dax mix */ - if ((flags2 & XFS_DIFLAG2_REFLINK) && (flags2 & XFS_DIFLAG2_DAX)) - return __this_address; - /* COW extent size hint validation */ fa = xfs_inode_validate_cowextsize(mp, be32_to_cpu(dip->di_cowextsize), mode, flags, flags2); diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c index 147c443a7242..357e6042d1c0 100644 --- a/fs/xfs/scrub/inode.c +++ b/fs/xfs/scrub/inode.c @@ -185,10 +185,6 @@ xchk_inode_flags2( if ((flags & XFS_DIFLAG_REALTIME) && (flags2 & XFS_DIFLAG2_REFLINK)) goto bad;
- /* dax and reflink make no sense, currently */ - if ((flags2 & XFS_DIFLAG2_DAX) && (flags2 & XFS_DIFLAG2_REFLINK)) - goto bad; - /* no bigtime iflag without the bigtime feature */ if (xfs_dinode_has_bigtime(dip) && !xfs_sb_version_hasbigtime(&mp->m_sb))
From: "Darrick J. Wong" djwong@kernel.org
mainline inclusion from mainline-v5.16-rc2 commit 089558bc7ba785c03815a49c89e28ad9b8de51f9 category: bugfix bugzilla: 186901,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
As part of multiple customer escalations due to file data corruption after copy on write operations, I wrote some fstests that use fsstress to hammer on COW to shake things loose. Regrettably, I caught some filesystem shutdowns due to incorrect rmap operations with the following loop:
mount <filesystem> # (0) fsstress <run only readonly ops> & # (1) while true; do fsstress <run all ops> mount -o remount,ro # (2) fsstress <run only readonly ops> mount -o remount,rw # (3) done
When (2) happens, notice that (1) is still running. xfs_remount_ro will call xfs_blockgc_stop to walk the inode cache to free all the COW extents, but the blockgc mechanism races with (1)'s reader threads to take IOLOCKs and loses, which means that it doesn't clean them all out. Call such a file (A).
When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which walks the ondisk refcount btree and frees any COW extent that it finds. This function does not check the inode cache, which means that incore COW forks of inode (A) is now inconsistent with the ondisk metadata. If one of those former COW extents are allocated and mapped into another file (B) and someone triggers a COW to the stale reservation in (A), A's dirty data will be written into (B) and once that's done, those blocks will be transferred to (A)'s data fork without bumping the refcount.
The results are catastrophic -- file (B) and the refcount btree are now corrupt. Solve this race by forcing the xfs_blockgc_free_space to run synchronously, which causes xfs_icwalk to return to inodes that were skipped because the blockgc code couldn't take the IOLOCK. This is safe to do here because the VFS has already prohibited new writer threads.
Fixes: 10ddf64e420f ("xfs: remove leftover CoW reservations when remounting ro") Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Dave Chinner dchinner@redhat.com Reviewed-by: Chandan Babu R chandan.babu@oracle.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_super.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 9c4ff38e0901..fd2cb3393747 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1778,7 +1778,10 @@ static int xfs_remount_ro( struct xfs_mount *mp) { - int error; + struct xfs_icwalk icw = { + .icw_flags = XFS_ICWALK_FLAG_SYNC, + }; + int error;
/* * Cancel background eofb scanning so it cannot race with the final @@ -1786,8 +1789,13 @@ xfs_remount_ro( */ xfs_blockgc_stop(mp);
- /* Get rid of any leftover CoW reservations... */ - error = xfs_blockgc_free_space(mp, NULL); + /* + * Clear out all remaining COW staging extents and speculative post-EOF + * preallocations so that we don't leave inodes requiring inactivation + * cleanups during reclaim on a read-only mount. We must process every + * cached inode, so this requires a synchronous cache scan. + */ + error = xfs_blockgc_free_space(mp, &icw); if (error) { xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); return error;
From: "Darrick J. Wong" djwong@kernel.org
mainline inclusion from mainline-v5.16-rc5 commit 7993f1a431bc5271369d359941485a9340658ac3 category: bugfix bugzilla: 186901,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
As part of multiple customer escalations due to file data corruption after copy on write operations, I wrote some fstests that use fsstress to hammer on COW to shake things loose. Regrettably, I caught some filesystem shutdowns due to incorrect rmap operations with the following loop:
mount <filesystem> # (0) fsstress <run only readonly ops> & # (1) while true; do fsstress <run all ops> mount -o remount,ro # (2) fsstress <run only readonly ops> mount -o remount,rw # (3) done
When (2) happens, notice that (1) is still running. xfs_remount_ro will call xfs_blockgc_stop to walk the inode cache to free all the COW extents, but the blockgc mechanism races with (1)'s reader threads to take IOLOCKs and loses, which means that it doesn't clean them all out. Call such a file (A).
When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which walks the ondisk refcount btree and frees any COW extent that it finds. This function does not check the inode cache, which means that incore COW forks of inode (A) is now inconsistent with the ondisk metadata. If one of those former COW extents are allocated and mapped into another file (B) and someone triggers a COW to the stale reservation in (A), A's dirty data will be written into (B) and once that's done, those blocks will be transferred to (A)'s data fork without bumping the refcount.
The results are catastrophic -- file (B) and the refcount btree are now corrupt. In the first patch, we fixed the race condition in (2) so that (A) will always flush the COW fork. In this second patch, we move the _recover_cow call to the initial mount call in (0) for safety.
As mentioned previously, xfs_reflink_recover_cow walks the refcount btree looking for COW staging extents, and frees them. This was intended to be run at mount time (when we know there are no live inodes) to clean up any leftover staging events that may have been left behind during an unclean shutdown. As a time "optimization" for readonly mounts, we deferred this to the ro->rw transition, not realizing that any failure to clean all COW forks during a rw->ro transition would result in catastrophic corruption.
Therefore, remove this optimization and only run the recovery routine when we're guaranteed not to have any COW staging extents anywhere, which means we always run this at mount time. While we're at it, move the callsite to xfs_log_mount_finish because any refcount btree expansion (however unlikely given that we're removing records from the right side of the index) must be fed by a per-AG reservation, which doesn't exist in its current location.
Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree") Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Chandan Babu R chandan.babu@oracle.com Reviewed-by: Dave Chinner dchinner@redhat.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log_recover.c | 23 +++++++++++++++++++++++ fs/xfs/xfs_mount.c | 10 ---------- fs/xfs/xfs_reflink.c | 5 ++++- fs/xfs/xfs_super.c | 9 --------- 4 files changed, 27 insertions(+), 20 deletions(-)
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index f3e7016823e8..83afe5bc0872 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -25,6 +25,7 @@ #include "xfs_icache.h" #include "xfs_error.h" #include "xfs_buf_item.h" +#include "xfs_reflink.h"
#define BLK_AVG(blk1, blk2) ((blk1+blk2) >> 1)
@@ -3465,6 +3466,28 @@ xlog_recover_finish(
xlog_recover_process_iunlinks(log); xlog_recover_check_summary(log); + + /* + * Recover any CoW staging blocks that are still referenced by the + * ondisk refcount metadata. During mount there cannot be any live + * staging extents as we have not permitted any user modifications. + * Therefore, it is safe to free them all right now, even on a + * read-only mount. + */ + error = xfs_reflink_recover_cow(log->l_mp); + if (error) { + xfs_alert(log->l_mp, + "Failed to recover leftover CoW staging extents, err %d.", + error); + /* + * If we get an error here, make sure the log is shut down + * but return zero so that any log items committed since the + * end of intents processing can be pushed through the CIL + * and AIL. + */ + xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); + } + return 0; }
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index 1f8ba6f40654..959425cfb612 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -1040,15 +1040,6 @@ xfs_mountfs( xfs_warn(mp, "Unable to allocate reserve blocks. Continuing without reserve pool.");
- /* Recover any CoW blocks that never got remapped. */ - error = xfs_reflink_recover_cow(mp); - if (error) { - xfs_err(mp, - "Error %d recovering leftover CoW allocations.", error); - xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); - goto out_quota; - } - /* Reserve AG blocks for future btree expansion. */ error = xfs_fs_reserve_ag_blocks(mp); if (error && error != -ENOSPC) @@ -1059,7 +1050,6 @@ xfs_mountfs(
out_agresv: xfs_fs_unreserve_ag_blocks(mp); - out_quota: xfs_qm_unmount_quotas(mp); out_rtunmount: xfs_rtunmount_inodes(mp); diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index 5e3f00f1192a..6c8492b16c7b 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -744,7 +744,10 @@ xfs_reflink_end_cow( }
/* - * Free leftover CoW reservations that didn't get cleaned out. + * Free all CoW staging blocks that are still referenced by the ondisk refcount + * metadata. The ondisk metadata does not track which inode created the + * staging extent, so callers must ensure that there are no cached inodes with + * live CoW staging extents. */ int xfs_reflink_recover_cow( diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index fd2cb3393747..bf65e2e50ab7 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1752,15 +1752,6 @@ xfs_remount_rw( */ xfs_restore_resvblks(mp); xfs_log_work_queue(mp); - - /* Recover any CoW blocks that never got remapped. */ - error = xfs_reflink_recover_cow(mp); - if (error) { - xfs_err(mp, - "Error %d recovering leftover CoW allocations.", error); - xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); - return error; - } xfs_blockgc_start(mp);
/* Create the per-AG metadata reservation pool .*/
From: Dave Chinner dchinner@redhat.com
mainline inclusion from mainline-v5.16-rc5 commit 8dc9384b7d75012856b02ff44c37566a55fc2abf category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Oh, let me count the ways that the kvmalloc API sucks dog eggs.
The problem is when we are logging lots of large objects, we hit kvmalloc really damn hard with costly order allocations, and behaviour utterly sucks:
- 49.73% xlog_cil_commit - 31.62% kvmalloc_node - 29.96% __kmalloc_node - 29.38% kmalloc_large_node - 29.33% __alloc_pages - 24.33% __alloc_pages_slowpath.constprop.0 - 18.35% __alloc_pages_direct_compact - 17.39% try_to_compact_pages - compact_zone_order - 15.26% compact_zone 5.29% __pageblock_pfn_to_page 3.71% PageHuge - 1.44% isolate_migratepages_block 0.71% set_pfnblock_flags_mask 1.11% get_pfnblock_flags_mask - 0.81% get_page_from_freelist - 0.59% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 3.24% try_to_free_pages - 3.14% shrink_node - 2.94% shrink_slab.constprop.0 - 0.89% super_cache_count - 0.66% xfs_fs_nr_cached_objects - 0.65% xfs_reclaim_inodes_count 0.55% xfs_perag_get_tag 0.58% kfree_rcu_shrink_count - 2.09% get_page_from_freelist - 1.03% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 4.88% get_page_from_freelist - 3.66% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 1.63% __vmalloc_node - __vmalloc_node_range - 1.10% __alloc_pages_bulk - 0.93% __alloc_pages - 0.92% get_page_from_freelist - 0.89% rmqueue_bulk - 0.69% _raw_spin_lock - do_raw_spin_lock __pv_queued_spin_lock_slowpath 13.73% memcpy_erms - 2.22% kvfree
On this workload, that's almost a dozen CPUs all trying to compact and reclaim memory inside kvmalloc_node at the same time. Yet it is regularly falling back to vmalloc despite all that compaction, page and shrinker reclaim that direct reclaim is doing. Copying all the metadata is taking far less CPU time than allocating the storage!
Direct reclaim should be considered extremely harmful.
This is a high frequency, high throughput, CPU usage and latency sensitive allocation. We've got memory there, and we're using kvmalloc to allow memory allocation to avoid doing lots of work to try to do contiguous allocations.
Except it still does *lots of costly work* that is unnecessary.
Worse: the only way to avoid the slowpath page allocation trying to do compaction on costly allocations is to turn off direct reclaim (i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).
Unfortunately, the stupid kvmalloc API then says "oh, this isn't a GFP_KERNEL allocation context, so you only get kmalloc!". This cuts off the vmalloc fallback, and this leads to almost instant OOM problems which ends up in filesystems deadlocks, shutdowns and/or kernel crashes.
I want some basic kvmalloc behaviour:
- kmalloc for a contiguous range with fail fast semantics - no compaction direct reclaim if the allocation enters the slow path. - run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails
The really, really stupid part about this is these kvmalloc() calls are run under memalloc_nofs task context, so all the allocations are always reduced to GFP_NOFS regardless of the fact that kvmalloc requires GFP_KERNEL to be passed in. IOWs, we're already telling kvmalloc to behave differently to the gfp flags we pass in, but it still won't allow vmalloc to be run with anything other than GFP_KERNEL.
So, this patch open codes the kvmalloc() in the commit path to have the above described behaviour. The result is we more than halve the CPU time spend doing kvmalloc() in this path and transaction commits with 64kB objects in them more than doubles. i.e. we get ~5x reduction in CPU usage per costly-sized kvmalloc() invocation and the profile looks like this:
- 37.60% xlog_cil_commit 16.01% memcpy_erms - 8.45% __kmalloc - 8.04% kmalloc_order_trace - 8.03% kmalloc_order - 7.93% alloc_pages - 7.90% __alloc_pages - 4.05% __alloc_pages_slowpath.constprop.0 - 2.18% get_page_from_freelist - 1.77% wake_all_kswapds .... - __wake_up_common_lock - 0.94% _raw_spin_lock_irqsave - 3.72% get_page_from_freelist - 2.43% _raw_spin_lock_irqsave - 5.72% vmalloc - 5.72% __vmalloc_node_range - 4.81% __get_vm_area_node.constprop.0 - 3.26% alloc_vmap_area - 2.52% _raw_spin_lock - 1.46% _raw_spin_lock 0.56% __alloc_pages_bulk - 4.66% kvfree - 3.25% vfree - __vfree - 3.23% __vunmap - 1.95% remove_vm_area - 1.06% free_vmap_area_noflush - 0.82% _raw_spin_lock - 0.68% _raw_spin_lock - 0.92% _raw_spin_lock - 1.40% kfree - 1.36% __free_pages - 1.35% __free_pages_ok - 1.02% _raw_spin_lock_irqsave
It's worth noting that over 50% of the CPU time spent allocating these shadow buffers is now spent on spinlocks. So the shadow buffer allocation overhead is greatly reduced by getting rid of direct reclaim from kmalloc, and could probably be made even less costly if vmalloc() didn't use global spinlocks to protect it's structures.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Allison Henderson allison.henderson@oracle.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log_cil.c | 46 +++++++++++++++++++++++++++++++++----------- 1 file changed, 35 insertions(+), 11 deletions(-)
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index ec9bef3670f2..c5118801218b 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -102,6 +102,39 @@ xlog_cil_iovec_space( sizeof(uint64_t)); }
+/* + * shadow buffers can be large, so we need to use kvmalloc() here to ensure + * success. Unfortunately, kvmalloc() only allows GFP_KERNEL contexts to fall + * back to vmalloc, so we can't actually do anything useful with gfp flags to + * control the kmalloc() behaviour within kvmalloc(). Hence kmalloc() will do + * direct reclaim and compaction in the slow path, both of which are + * horrendously expensive. We just want kmalloc to fail fast and fall back to + * vmalloc if it can't get somethign straight away from the free lists or buddy + * allocator. Hence we have to open code kvmalloc outselves here. + * + * Also, we are in memalloc_nofs_save task context here, so despite the use of + * GFP_KERNEL here, we are actually going to be doing GFP_NOFS allocations. This + * is actually the only way to make vmalloc() do GFP_NOFS allocations, so lets + * just all pretend this is a GFP_KERNEL context operation.... + */ +static inline void * +xlog_cil_kvmalloc( + size_t buf_size) +{ + gfp_t flags = GFP_KERNEL; + void *p; + + flags &= ~__GFP_DIRECT_RECLAIM; + flags |= __GFP_NOWARN | __GFP_NORETRY; + do { + p = kmalloc(buf_size, flags); + if (!p) + p = vmalloc(buf_size); + } while (!p); + + return p; +} + /* * Allocate or pin log vector buffers for CIL insertion. * @@ -203,25 +236,16 @@ xlog_cil_alloc_shadow_bufs( */ if (!lip->li_lv_shadow || buf_size > lip->li_lv_shadow->lv_size) { - /* * We free and allocate here as a realloc would copy - * unnecessary data. We don't use kmem_zalloc() for the + * unnecessary data. We don't use kvzalloc() for the * same reason - we don't need to zero the data area in * the buffer, only the log vector header and the iovec * storage. */ kmem_free(lip->li_lv_shadow); + lv = xlog_cil_kvmalloc(buf_size);
- /* - * We are in transaction context, which means this - * allocation will pick up GFP_NOFS from the - * memalloc_nofs_save/restore context the transaction - * holds. This means we can use GFP_KERNEL here so the - * generic kvmalloc() code will run vmalloc on - * contiguous page allocation failure as we require. - */ - lv = kvmalloc(buf_size, GFP_KERNEL); memset(lv, 0, xlog_cil_iovec_space(niovecs));
lv->lv_item = lip;
From: Yang Xu xuyang2018.jy@fujitsu.com
stable inclusion from stable-v5.10.128 commit 1e76bd4c67224a645558314c0097d5b5a338bba9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5PBNO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit a1de97fe296c52eafc6590a3506f4bbd44ecb19a upstream.
When testing xfstests xfs/126 on lastest upstream kernel, it will hang on some machine. Adding a getxattr operation after xattr corrupted, I can reproduce it 100%.
The deadlock as below: [983.923403] task:setfattr state:D stack: 0 pid:17639 ppid: 14687 flags:0x00000080 [ 983.923405] Call Trace: [ 983.923410] __schedule+0x2c4/0x700 [ 983.923412] schedule+0x37/0xa0 [ 983.923414] schedule_timeout+0x274/0x300 [ 983.923416] __down+0x9b/0xf0 [ 983.923451] ? xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs] [ 983.923453] down+0x3b/0x50 [ 983.923471] xfs_buf_lock+0x33/0xf0 [xfs] [ 983.923490] xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs] [ 983.923508] xfs_buf_get_map+0x4c/0x320 [xfs] [ 983.923525] xfs_buf_read_map+0x53/0x310 [xfs] [ 983.923541] ? xfs_da_read_buf+0xcf/0x120 [xfs] [ 983.923560] xfs_trans_read_buf_map+0x1cf/0x360 [xfs] [ 983.923575] ? xfs_da_read_buf+0xcf/0x120 [xfs] [ 983.923590] xfs_da_read_buf+0xcf/0x120 [xfs] [ 983.923606] xfs_da3_node_read+0x1f/0x40 [xfs] [ 983.923621] xfs_da3_node_lookup_int+0x69/0x4a0 [xfs] [ 983.923624] ? kmem_cache_alloc+0x12e/0x270 [ 983.923637] xfs_attr_node_hasname+0x6e/0xa0 [xfs] [ 983.923651] xfs_has_attr+0x6e/0xd0 [xfs] [ 983.923664] xfs_attr_set+0x273/0x320 [xfs] [ 983.923683] xfs_xattr_set+0x87/0xd0 [xfs] [ 983.923686] __vfs_removexattr+0x4d/0x60 [ 983.923688] __vfs_removexattr_locked+0xac/0x130 [ 983.923689] vfs_removexattr+0x4e/0xf0 [ 983.923690] removexattr+0x4d/0x80 [ 983.923693] ? __check_object_size+0xa8/0x16b [ 983.923695] ? strncpy_from_user+0x47/0x1a0 [ 983.923696] ? getname_flags+0x6a/0x1e0 [ 983.923697] ? _cond_resched+0x15/0x30 [ 983.923699] ? __sb_start_write+0x1e/0x70 [ 983.923700] ? mnt_want_write+0x28/0x50 [ 983.923701] path_removexattr+0x9b/0xb0 [ 983.923702] __x64_sys_removexattr+0x17/0x20 [ 983.923704] do_syscall_64+0x5b/0x1a0 [ 983.923705] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 983.923707] RIP: 0033:0x7f080f10ee1b
When getxattr calls xfs_attr_node_get function, xfs_da3_node_lookup_int fails with EFSCORRUPTED in xfs_attr_node_hasname because we have use blocktrash to random it in xfs/126. So it free state in internal and xfs_attr_node_get doesn't do xfs_buf_trans release job.
Then subsequent removexattr will hang because of it.
This bug was introduced by kernel commit 07120f1abdff ("xfs: Add xfs_has_attr and subroutines"). It adds xfs_attr_node_hasname helper and said caller will be responsible for freeing the state in this case. But xfs_attr_node_hasname will free state itself instead of caller if xfs_da3_node_lookup_int fails.
Fix this bug by moving the step of free state into caller.
[amir: this text from original commit is not relevant for 5.10 backport: Also, use "goto error/out" instead of returning error directly in xfs_attr_node_addname_find_attr and xfs_attr_node_removename_setup function because we should free state ourselves. ]
Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") Signed-off-by: Yang Xu xuyang2018.jy@fujitsu.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Amir Goldstein amir73il@gmail.com Acked-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Chen Jiahao chenjiahao16@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/libxfs/xfs_attr.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c index 60cf08e0cb5e..909980802f19 100644 --- a/fs/xfs/libxfs/xfs_attr.c +++ b/fs/xfs/libxfs/xfs_attr.c @@ -864,21 +864,18 @@ xfs_attr_node_hasname(
state = xfs_da_state_alloc(args); if (statep != NULL) - *statep = NULL; + *statep = state;
/* * Search to see if name exists, and get back a pointer to it. */ error = xfs_da3_node_lookup_int(state, &retval); - if (error) { - xfs_da_state_free(state); - return error; - } + if (error) + retval = error;
- if (statep != NULL) - *statep = state; - else + if (!statep) xfs_da_state_free(state); + return retval; }
From: Dave Chinner dchinner@redhat.com
stable inclusion from stable-v5.10.128 commit 6b734f7b7071859f582b5acb95abb97e1276a030 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5PBNO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 09654ed8a18cfd45027a67d6cbca45c9ea54feab upstream.
Got a report that a repeated crash test of a container host would eventually fail with a log recovery error preventing the system from mounting the root filesystem. It manifested as a directory leaf node corruption on writeback like so:
XFS (loop0): Mounting V5 Filesystem XFS (loop0): Starting recovery (logdev: internal) XFS (loop0): Metadata corruption detected at xfs_dir3_leaf_check_int+0x99/0xf0, xfs_dir3_leaf1 block 0x12faa158 XFS (loop0): Unmount and run xfs_repair XFS (loop0): First 128 bytes of corrupted metadata buffer: 00000000: 00 00 00 00 00 00 00 00 3d f1 00 00 e1 9e d5 8b ........=....... 00000010: 00 00 00 00 12 fa a1 58 00 00 00 29 00 00 1b cc .......X...).... 00000020: 91 06 78 ff f7 7e 4a 7d 8d 53 86 f2 ac 47 a8 23 ..x..~J}.S...G.# 00000030: 00 00 00 00 17 e0 00 80 00 43 00 00 00 00 00 00 .........C...... 00000040: 00 00 00 2e 00 00 00 08 00 00 17 2e 00 00 00 0a ................ 00000050: 02 35 79 83 00 00 00 30 04 d3 b4 80 00 00 01 50 .5y....0.......P 00000060: 08 40 95 7f 00 00 02 98 08 41 fe b7 00 00 02 d4 .@.......A...... 00000070: 0d 62 ef a7 00 00 01 f2 14 50 21 41 00 00 00 0c .b.......P!A.... XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_buf.c:1514). Shutting down. XFS (loop0): Please unmount the filesystem and rectify the problem(s) XFS (loop0): log mount/recovery failed: error -117 XFS (loop0): log mount failed
Tracing indicated that we were recovering changes from a transaction at LSN 0x29/0x1c16 into a buffer that had an LSN of 0x29/0x1d57. That is, log recovery was overwriting a buffer with newer changes on disk than was in the transaction. Tracing indicated that we were hitting the "recovery immediately" case in xfs_buf_log_recovery_lsn(), and hence it was ignoring the LSN in the buffer.
The code was extracting the LSN correctly, then ignoring it because the UUID in the buffer did not match the superblock UUID. The problem arises because the UUID check uses the wrong UUID - it should be checking the sb_meta_uuid, not sb_uuid. This filesystem has sb_uuid != sb_meta_uuid (which is fine), and the buffer has the correct matching sb_meta_uuid in it, it's just the code checked it against the wrong superblock uuid.
The is no corruption in the filesystem, and failing to recover the buffer due to a write verifier failure means the recovery bug did not propagate the corruption to disk. Hence there is no corruption before or after this bug has manifested, the impact is limited simply to an unmountable filesystem....
This was missed back in 2015 during an audit of incorrect sb_uuid usage that resulted in commit fcfbe2c4ef42 ("xfs: log recovery needs to validate against sb_meta_uuid") that fixed the magic32 buffers to validate against sb_meta_uuid instead of sb_uuid. It missed the magicda buffers....
Fixes: ce748eaa65f2 ("xfs: create new metadata UUID field and incompat flag") Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Amir Goldstein amir73il@gmail.com Acked-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Chen Jiahao chenjiahao16@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_buf_item_recover.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c index 4775485b4062..aa4d45701de5 100644 --- a/fs/xfs/xfs_buf_item_recover.c +++ b/fs/xfs/xfs_buf_item_recover.c @@ -816,7 +816,7 @@ xlog_recover_get_buf_lsn( }
if (lsn != (xfs_lsn_t)-1) { - if (!uuid_equal(&mp->m_sb.sb_uuid, uuid)) + if (!uuid_equal(&mp->m_sb.sb_meta_uuid, uuid)) goto recover_immediately; return lsn; }
From: Zhang Yi yi.zhang@huawei.com
mainline inclusion from mainline-v5.19-rc2 commit 04a98a036cf8b810dda172a9dcfcbd783bf63655 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
In the procedure of recover AGI unlinked lists, if something bad happenes on one of the unlinked inode in the bucket list, we would call xlog_recover_clear_agi_bucket() to clear the whole unlinked bucket list, not the unlinked inodes after the bad one. If we have already added some inodes to the gc workqueue before the bad inode in the list, we could get below error when freeing those inodes, and finaly fail to complete the log recover procedure.
XFS (ram0): Internal error xfs_iunlink_remove at line 2456 of file fs/xfs/xfs_inode.c. Caller xfs_ifree+0xb0/0x360 [xfs]
The problem is xlog_recover_clear_agi_bucket() clear the bucket list, so the gc worker fail to check the agino in xfs_verify_agino(). Fix this by flush workqueue before clearing the bucket.
Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues") Signed-off-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Dave Chinner david@fromorbit.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log_recover.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 83afe5bc0872..88b48aed446a 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -2715,6 +2715,7 @@ xlog_recover_process_one_iunlink( * Call xlog_recover_clear_agi_bucket() to perform a transaction to * clear the inode pointer in the bucket. */ + xfs_inodegc_flush(mp); xlog_recover_clear_agi_bucket(mp, agno, bucket); return NULLAGINO; }
From: Chandan Babu R chandanrlinux@gmail.com
mainline inclusion from mainline-v5.12-rc1 commit b9b7e1dc56c5ca8d6fc37c410b054e9f26737d2e category: bugfix bugzilla: 187510,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
XFS does not check for possible overflow of per-inode extent counter fields when adding extents to either data or attr fork.
For e.g. 1. Insert 5 million xattrs (each having a value size of 255 bytes) and then delete 50% of them in an alternating manner.
2. On a 4k block sized XFS filesystem instance, the above causes 98511 extents to be created in the attr fork of the inode.
xfsaild/loop0 2008 [003] 1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131
3. The incore inode fork extent counter is a signed 32-bit quantity. However the on-disk extent counter is an unsigned 16-bit quantity and hence cannot hold 98511 extents.
4. The following incorrect value is stored in the attr extent counter, # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0 core.naextents = -32561
This commit adds a new helper function (i.e. xfs_iext_count_may_overflow()) to check for overflow of the per-inode data and xattr extent counters. Future patches will use this function to make sure that an FS operation won't cause the extent counter to overflow.
Suggested-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Allison Henderson allison.henderson@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Chandan Babu R chandanrlinux@gmail.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/libxfs/xfs_inode_fork.c | 23 +++++++++++++++++++++++ fs/xfs/libxfs/xfs_inode_fork.h | 2 ++ 2 files changed, 25 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c index 7575de5cecb1..8d48716547e5 100644 --- a/fs/xfs/libxfs/xfs_inode_fork.c +++ b/fs/xfs/libxfs/xfs_inode_fork.c @@ -23,6 +23,7 @@ #include "xfs_da_btree.h" #include "xfs_dir2_priv.h" #include "xfs_attr_leaf.h" +#include "xfs_types.h"
kmem_zone_t *xfs_ifork_zone;
@@ -728,3 +729,25 @@ xfs_ifork_verify_local_attr(
return 0; } + +int +xfs_iext_count_may_overflow( + struct xfs_inode *ip, + int whichfork, + int nr_to_add) +{ + struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork); + uint64_t max_exts; + uint64_t nr_exts; + + if (whichfork == XFS_COW_FORK) + return 0; + + max_exts = (whichfork == XFS_ATTR_FORK) ? MAXAEXTNUM : MAXEXTNUM; + + nr_exts = ifp->if_nextents + nr_to_add; + if (nr_exts < ifp->if_nextents || nr_exts > max_exts) + return -EFBIG; + + return 0; +} diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h index a4953e95c4f3..0beb8e2a00be 100644 --- a/fs/xfs/libxfs/xfs_inode_fork.h +++ b/fs/xfs/libxfs/xfs_inode_fork.h @@ -172,5 +172,7 @@ extern void xfs_ifork_init_cow(struct xfs_inode *ip);
int xfs_ifork_verify_local_data(struct xfs_inode *ip); int xfs_ifork_verify_local_attr(struct xfs_inode *ip); +int xfs_iext_count_may_overflow(struct xfs_inode *ip, int whichfork, + int nr_to_add);
#endif /* __XFS_INODE_FORK_H__ */
From: Chandan Babu R chandanrlinux@gmail.com
mainline inclusion from mainline-v5.12-rc1 commit 727e1acd297cae15449607d6e2ee39c71216cf1a category: bugfix bugzilla: 187510,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When adding a new data extent (without modifying an inode's existing extents) the extent count increases only by 1. This commit checks for extent count overflow in such cases.
Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Allison Henderson allison.henderson@oracle.com Signed-off-by: Chandan Babu R chandanrlinux@gmail.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com
Conflict: commit 3de4eb106fcc ("xfs: allow reservation of rtblocks with xfs_trans_alloc_inode") is backported already, which introduce some conflicts on code context. Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/libxfs/xfs_bmap.c | 6 ++++++ fs/xfs/libxfs/xfs_inode_fork.h | 6 ++++++ fs/xfs/xfs_bmap_item.c | 7 +++++++ fs/xfs/xfs_bmap_util.c | 4 ++++ fs/xfs/xfs_dquot.c | 8 +++++++- fs/xfs/xfs_iomap.c | 5 +++++ fs/xfs/xfs_rtalloc.c | 5 +++++ 7 files changed, 40 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index e6bb7b928b38..07596edbfb38 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -4516,6 +4516,12 @@ xfs_bmapi_convert_delalloc( return error;
xfs_ilock(ip, XFS_ILOCK_EXCL); + + error = xfs_iext_count_may_overflow(ip, whichfork, + XFS_IEXT_ADD_NOSPLIT_CNT); + if (error) + goto out_trans_cancel; + xfs_trans_ijoin(tp, ip, 0);
if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, &bma.icur, &bma.got) || diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h index 0beb8e2a00be..7fc2b129a2e7 100644 --- a/fs/xfs/libxfs/xfs_inode_fork.h +++ b/fs/xfs/libxfs/xfs_inode_fork.h @@ -34,6 +34,12 @@ struct xfs_ifork { #define XFS_IFEXTENTS 0x02 /* All extent pointers are read in */ #define XFS_IFBROOT 0x04 /* i_broot points to the bmap b-tree root */
+/* + * Worst-case increase in the fork extent count when we're adding a single + * extent to a fork and there's no possibility of splitting an existing mapping. + */ +#define XFS_IEXT_ADD_NOSPLIT_CNT (1) + /* * Fork handling. */ diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index 984bb480f177..adc7507947bf 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -497,6 +497,13 @@ xfs_bui_item_recover( xfs_ilock(ip, XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, ip, 0);
+ if (bui_type == XFS_BMAP_MAP) { + error = xfs_iext_count_may_overflow(ip, whichfork, + XFS_IEXT_ADD_NOSPLIT_CNT); + if (error) + goto err_cancel; + } + count = bmap->me_len; error = xfs_trans_log_finish_bmap_update(tp, budp, bui_type, ip, whichfork, bmap->me_startoff, bmap->me_startblock, diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index d0113dec5d32..c727ac76d03b 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -804,6 +804,10 @@ xfs_alloc_file_space( if (error) break;
+ error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK, + XFS_IEXT_ADD_NOSPLIT_CNT); + if (error) + goto error;
error = xfs_bmapi_write(tp, ip, startoffset_fsb, allocatesize_fsb, alloc_type, 0, imapp, diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c index 3c31bd97b590..23366951bf95 100644 --- a/fs/xfs/xfs_dquot.c +++ b/fs/xfs/xfs_dquot.c @@ -314,8 +314,14 @@ xfs_dquot_disk_alloc( return -ESRCH; }
- /* Create the block mapping. */ xfs_trans_ijoin(tp, quotip, XFS_ILOCK_EXCL); + + error = xfs_iext_count_may_overflow(quotip, XFS_DATA_FORK, + XFS_IEXT_ADD_NOSPLIT_CNT); + if (error) + return error; + + /* Create the block mapping. */ error = xfs_bmapi_write(tp, quotip, dqp->q_fileoffset, XFS_DQUOT_CLUSTER_SIZE_FSB, XFS_BMAPI_METADATA, 0, &map, &nmaps); diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 3b362416ddb0..ff0b092d0c37 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -241,6 +241,11 @@ xfs_iomap_write_direct( if (error) return error;
+ error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK, + XFS_IEXT_ADD_NOSPLIT_CNT); + if (error) + goto out_trans_cancel; + /* * From this point onwards we overwrite the imap pointer that the * caller gave to us. diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c index 35aa62625bf3..8a150acecba4 100644 --- a/fs/xfs/xfs_rtalloc.c +++ b/fs/xfs/xfs_rtalloc.c @@ -804,6 +804,11 @@ xfs_growfs_rt_alloc( xfs_ilock(ip, XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
+ error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK, + XFS_IEXT_ADD_NOSPLIT_CNT); + if (error) + goto out_trans_cancel; + /* * Allocate blocks to the bitmap file. */
From: Chandan Babu R chandanrlinux@gmail.com
mainline inclusion from mainline-v5.12-rc1 commit 85ef08b5a667615bc7be5058259753dc42a7adcd category: bugfix bugzilla: 187510,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The extent mapping the file offset at which a hole has to be inserted will be split into two extents causing extent count to increase by 1.
Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Allison Henderson allison.henderson@oracle.com Signed-off-by: Chandan Babu R chandanrlinux@gmail.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com
Conflict: commit 3a1af6c317d0 ("xfs: refactor common transaction/inode/quota allocation idiom") is backported, which introduce some conflicts on code context. Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/libxfs/xfs_inode_fork.h | 7 +++++++ fs/xfs/xfs_bmap_item.c | 15 +++++++++------ fs/xfs/xfs_bmap_util.c | 10 ++++++++++ 3 files changed, 26 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h index 7fc2b129a2e7..bcac769a7df6 100644 --- a/fs/xfs/libxfs/xfs_inode_fork.h +++ b/fs/xfs/libxfs/xfs_inode_fork.h @@ -40,6 +40,13 @@ struct xfs_ifork { */ #define XFS_IEXT_ADD_NOSPLIT_CNT (1)
+/* + * Punching out an extent from the middle of an existing extent can cause the + * extent count to increase by 1. + * i.e. | Old extent | Hole | Old extent | + */ +#define XFS_IEXT_PUNCH_HOLE_CNT (1) + /* * Fork handling. */ diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index adc7507947bf..44ec0f2d5253 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -439,6 +439,7 @@ xfs_bui_item_recover( xfs_exntst_t state; unsigned int bui_type; int whichfork; + int iext_delta; int error = 0;
/* Only one mapping operation per BUI... */ @@ -497,12 +498,14 @@ xfs_bui_item_recover( xfs_ilock(ip, XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, ip, 0);
- if (bui_type == XFS_BMAP_MAP) { - error = xfs_iext_count_may_overflow(ip, whichfork, - XFS_IEXT_ADD_NOSPLIT_CNT); - if (error) - goto err_cancel; - } + if (bui_type == XFS_BMAP_MAP) + iext_delta = XFS_IEXT_ADD_NOSPLIT_CNT; + else + iext_delta = XFS_IEXT_PUNCH_HOLE_CNT; + + error = xfs_iext_count_may_overflow(ip, whichfork, iext_delta); + if (error) + goto err_cancel;
count = bmap->me_len; error = xfs_trans_log_finish_bmap_update(tp, budp, bui_type, ip, diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index c727ac76d03b..202d6af3e503 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -859,6 +859,11 @@ xfs_unmap_extent( if (error) return error;
+ error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK, + XFS_IEXT_PUNCH_HOLE_CNT); + if (error) + goto out_trans_cancel; + error = xfs_bunmapi(tp, ip, startoffset_fsb, len_fsb, 0, 2, done); if (error) goto out_trans_cancel; @@ -1136,6 +1141,11 @@ xfs_insert_file_space( xfs_ilock(ip, XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, ip, 0);
+ error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK, + XFS_IEXT_PUNCH_HOLE_CNT); + if (error) + goto out_trans_cancel; + /* * The extent shifting code works on extent granularity. So, if stop_fsb * is not the starting block of extent, we need to split the extent at
From: Chandan Babu R chandanrlinux@gmail.com
mainline inclusion from mainline-v5.12-rc1 commit f5d92749191402c50e32ac83dd9da3b910f5680f category: bugfix bugzilla: 187510,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Directory entry addition can cause the following, 1. Data block can be added/removed. A new extent can cause extent count to increase by 1. 2. Free disk block can be added/removed. Same behaviour as described above for Data block. 3. Dabtree blocks. XFS_DA_NODE_MAXDEPTH blocks can be added. Each of these can be new extents. Hence extent count can increase by XFS_DA_NODE_MAXDEPTH.
Signed-off-by: Chandan Babu R chandanrlinux@gmail.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com
Conflict: commit 3a1af6c317d0 ("xfs: refactor common transaction/inode/quota allocation idiom") is backported, which introduce some conflicts on code context. Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/libxfs/xfs_inode_fork.h | 13 +++++++++++++ fs/xfs/xfs_inode.c | 10 ++++++++++ fs/xfs/xfs_symlink.c | 5 +++++ 3 files changed, 28 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h index bcac769a7df6..ea1a9dd8a763 100644 --- a/fs/xfs/libxfs/xfs_inode_fork.h +++ b/fs/xfs/libxfs/xfs_inode_fork.h @@ -47,6 +47,19 @@ struct xfs_ifork { */ #define XFS_IEXT_PUNCH_HOLE_CNT (1)
+/* + * Directory entry addition can cause the following, + * 1. Data block can be added/removed. + * A new extent can cause extent count to increase by 1. + * 2. Free disk block can be added/removed. + * Same behaviour as described above for Data block. + * 3. Dabtree blocks. + * XFS_DA_NODE_MAXDEPTH blocks can be added. Each of these can be new + * extents. Hence extent count can increase by XFS_DA_NODE_MAXDEPTH. + */ +#define XFS_IEXT_DIR_MANIP_CNT(mp) \ + ((XFS_DA_NODE_MAXDEPTH + 1 + 1) * (mp)->m_dir_geo->fsbcount) + /* * Fork handling. */ diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index ba4bde9d5fcb..31c1f8f951a0 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1171,6 +1171,11 @@ xfs_create( xfs_ilock(dp, XFS_ILOCK_EXCL | XFS_ILOCK_PARENT); unlock_dp_on_error = true;
+ error = xfs_iext_count_may_overflow(dp, XFS_DATA_FORK, + XFS_IEXT_DIR_MANIP_CNT(mp)); + if (error) + goto out_trans_cancel; + /* * A newly created regular or special file just has one directory * entry pointing to them, but a directory also the "." entry @@ -1383,6 +1388,11 @@ xfs_link( xfs_trans_ijoin(tp, sip, XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, tdp, XFS_ILOCK_EXCL);
+ error = xfs_iext_count_may_overflow(tdp, XFS_DATA_FORK, + XFS_IEXT_DIR_MANIP_CNT(mp)); + if (error) + goto error_return; + /* * If we are using project inheritance, we only allow hard link * creation in our tree when the project IDs are the same; else diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c index 8cd3ad4a0dfb..7b9cdb1a41ff 100644 --- a/fs/xfs/xfs_symlink.c +++ b/fs/xfs/xfs_symlink.c @@ -213,6 +213,11 @@ xfs_symlink( goto out_trans_cancel; }
+ error = xfs_iext_count_may_overflow(dp, XFS_DATA_FORK, + XFS_IEXT_DIR_MANIP_CNT(mp)); + if (error) + goto out_trans_cancel; + /* * Allocate an inode for the symlink. */
From: Chandan Babu R chandanrlinux@gmail.com
mainline inclusion from mainline-v5.12-rc1 commit 0dbc5cb1a91cc8c44b1c75429f5b9351837114fd category: bugfix bugzilla: 187510,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Directory entry removal must always succeed; Hence XFS does the following during low disk space scenario: 1. Data/Free blocks linger until a future remove operation. 2. Dabtree blocks would be swapped with the last block in the leaf space and then the new last block will be unmapped.
This facility is reused during low inode extent count scenario i.e. this commit causes xfs_bmap_del_extent_real() to return -ENOSPC error code so that the above mentioned behaviour is exercised causing no change to the directory's extent count.
Signed-off-by: Chandan Babu R chandanrlinux@gmail.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/libxfs/xfs_bmap.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 07596edbfb38..89ccf059d1aa 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -5139,6 +5139,24 @@ xfs_bmap_del_extent_real( /* * Deleting the middle of the extent. */ + + /* + * For directories, -ENOSPC is returned since a directory entry + * remove operation must not fail due to low extent count + * availability. -ENOSPC will be handled by higher layers of XFS + * by letting the corresponding empty Data/Free blocks to linger + * until a future remove operation. Dabtree blocks would be + * swapped with the last block in the leaf space and then the + * new last block will be unmapped. + */ + error = xfs_iext_count_may_overflow(ip, whichfork, 1); + if (error) { + ASSERT(S_ISDIR(VFS_I(ip)->i_mode) && + whichfork == XFS_DATA_FORK); + error = -ENOSPC; + goto done; + } + old = got;
got.br_blockcount = del->br_startoff - got.br_startoff;
From: Chandan Babu R chandanrlinux@gmail.com
mainline inclusion from mainline-v5.12-rc1 commit 02092a2f034fdeabab524ae39c2de86ba9ffa15a category: bugfix bugzilla: 187510,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
A rename operation is essentially a directory entry remove operation from the perspective of parent directory (i.e. src_dp) of rename's source. Hence the only place where we check for extent count overflow for src_dp is in xfs_bmap_del_extent_real(). xfs_bmap_del_extent_real() returns -ENOSPC when it detects a possible extent count overflow and in response, the higher layers of directory handling code do the following: 1. Data/Free blocks: XFS lets these blocks linger until a future remove operation removes them. 2. Dabtree blocks: XFS swaps the blocks with the last block in the Leaf space and unmaps the last block.
For target_dp, there are two cases depending on whether the destination directory entry exists or not.
When destination directory entry does not exist (i.e. target_ip == NULL), extent count overflow check is performed only when transaction has a non-zero sized space reservation associated with it. With a zero-sized space reservation, XFS allows a rename operation to continue only when the directory has sufficient free space in its data/leaf/free space blocks to hold the new entry.
When destination directory entry exists (i.e. target_ip != NULL), all we need to do is change the inode number associated with the already existing entry. Hence there is no need to perform an extent count overflow check.
Signed-off-by: Chandan Babu R chandanrlinux@gmail.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/libxfs/xfs_bmap.c | 3 +++ fs/xfs/xfs_inode.c | 44 +++++++++++++++++++++++++++++++++++++++- 2 files changed, 46 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 89ccf059d1aa..97dbb8af9fa0 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -5148,6 +5148,9 @@ xfs_bmap_del_extent_real( * until a future remove operation. Dabtree blocks would be * swapped with the last block in the leaf space and then the * new last block will be unmapped. + * + * The above logic also applies to the source directory entry of + * a rename operation. */ error = xfs_iext_count_may_overflow(ip, whichfork, 1); if (error) { diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 31c1f8f951a0..c9cf34a4fee8 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -3310,6 +3310,35 @@ xfs_rename( /* * Check for expected errors before we dirty the transaction * so we can return an error without a transaction abort. + * + * Extent count overflow check: + * + * From the perspective of src_dp, a rename operation is essentially a + * directory entry remove operation. Hence the only place where we check + * for extent count overflow for src_dp is in + * xfs_bmap_del_extent_real(). xfs_bmap_del_extent_real() returns + * -ENOSPC when it detects a possible extent count overflow and in + * response, the higher layers of directory handling code do the + * following: + * 1. Data/Free blocks: XFS lets these blocks linger until a + * future remove operation removes them. + * 2. Dabtree blocks: XFS swaps the blocks with the last block in the + * Leaf space and unmaps the last block. + * + * For target_dp, there are two cases depending on whether the + * destination directory entry exists or not. + * + * When destination directory entry does not exist (i.e. target_ip == + * NULL), extent count overflow check is performed only when transaction + * has a non-zero sized space reservation associated with it. With a + * zero-sized space reservation, XFS allows a rename operation to + * continue only when the directory has sufficient free space in its + * data/leaf/free space blocks to hold the new entry. + * + * When destination directory entry exists (i.e. target_ip != NULL), all + * we need to do is change the inode number associated with the already + * existing entry. Hence there is no need to perform an extent count + * overflow check. */ if (target_ip == NULL) { /* @@ -3320,6 +3349,12 @@ xfs_rename( error = xfs_dir_canenter(tp, target_dp, target_name); if (error) goto out_trans_cancel; + } else { + error = xfs_iext_count_may_overflow(target_dp, + XFS_DATA_FORK, + XFS_IEXT_DIR_MANIP_CNT(mp)); + if (error) + goto out_trans_cancel; } } else { /* @@ -3485,9 +3520,16 @@ xfs_rename( if (wip) { error = xfs_dir_replace(tp, src_dp, src_name, wip->i_ino, spaceres); - } else + } else { + /* + * NOTE: We don't need to check for extent count overflow here + * because the dir remove name code will leave the dir block in + * place if the extent count would overflow. + */ error = xfs_dir_removename(tp, src_dp, src_name, src_ip->i_ino, spaceres); + } + if (error) goto out_trans_cancel;
From: Chandan Babu R chandanrlinux@gmail.com
mainline inclusion from mainline-v5.12-rc1 commit 3a19bb147c72d2e9b77137bf5130b9cfb50a5eef category: bugfix bugzilla: 187510,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Adding/removing an xattr can cause XFS_DA_NODE_MAXDEPTH extents to be added. One extra extent for dabtree in case a local attr is large enough to cause a double split. It can also cause extent count to increase proportional to the size of a remote xattr's value.
Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Allison Henderson allison.henderson@oracle.com Signed-off-by: Chandan Babu R chandanrlinux@gmail.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com
Conflict: commit 3a1af6c317d0 ("xfs: refactor common transaction/inode/quota allocation idiom") is backported, which introduce some conflicts on code context. Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/libxfs/xfs_attr.c | 12 ++++++++++++ fs/xfs/libxfs/xfs_inode_fork.h | 10 ++++++++++ 2 files changed, 22 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c index 909980802f19..5ce192d2a426 100644 --- a/fs/xfs/libxfs/xfs_attr.c +++ b/fs/xfs/libxfs/xfs_attr.c @@ -396,6 +396,7 @@ xfs_attr_set( struct xfs_trans_res tres; bool rsvd = (args->attr_filter & XFS_ATTR_ROOT); int error, local; + int rmt_blks = 0; unsigned int total;
if (XFS_FORCED_SHUTDOWN(dp->i_mount)) @@ -442,11 +443,15 @@ xfs_attr_set( tres.tr_logcount = XFS_ATTRSET_LOG_COUNT; tres.tr_logflags = XFS_TRANS_PERM_LOG_RES; total = args->total; + + if (!local) + rmt_blks = xfs_attr3_rmt_blocks(mp, args->valuelen); } else { XFS_STATS_INC(mp, xs_attr_remove);
tres = M_RES(mp)->tr_attrrm; total = XFS_ATTRRM_SPACE_RES(mp); + rmt_blks = xfs_attr3_rmt_blocks(mp, XFS_XATTR_SIZE_MAX); }
/* @@ -457,6 +462,13 @@ xfs_attr_set( if (error) return error;
+ if (args->value || xfs_inode_hasattr(dp)) { + error = xfs_iext_count_may_overflow(dp, XFS_ATTR_FORK, + XFS_IEXT_ATTR_MANIP_CNT(rmt_blks)); + if (error) + goto out_trans_cancel; + } + if (args->value) { error = xfs_has_attr(args); if (error == -EEXIST && (args->attr_flags & XATTR_CREATE)) diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h index ea1a9dd8a763..8d89838e23f8 100644 --- a/fs/xfs/libxfs/xfs_inode_fork.h +++ b/fs/xfs/libxfs/xfs_inode_fork.h @@ -60,6 +60,16 @@ struct xfs_ifork { #define XFS_IEXT_DIR_MANIP_CNT(mp) \ ((XFS_DA_NODE_MAXDEPTH + 1 + 1) * (mp)->m_dir_geo->fsbcount)
+/* + * Adding/removing an xattr can cause XFS_DA_NODE_MAXDEPTH extents to + * be added. One extra extent for dabtree in case a local attr is + * large enough to cause a double split. It can also cause extent + * count to increase proportional to the size of a remote xattr's + * value. + */ +#define XFS_IEXT_ATTR_MANIP_CNT(rmt_blks) \ + (XFS_DA_NODE_MAXDEPTH + max(1, rmt_blks)) + /* * Fork handling. */
From: Chandan Babu R chandanrlinux@gmail.com
mainline inclusion from mainline-v5.12-rc1 commit c442f3086d5a108b7ff086c8ade1923a8f389db5 category: bugfix bugzilla: 187510,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
A write to a sub-interval of an existing unwritten extent causes the original extent to be split into 3 extents i.e. | Unwritten | Real | Unwritten | Hence extent count can increase by 2.
Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Allison Henderson allison.henderson@oracle.com Signed-off-by: Chandan Babu R chandanrlinux@gmail.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com
Conflict: commit 3a1af6c317d0 ("xfs: refactor common transaction/inode/quota allocation idiom") is backported, which introduce some conflicts on code context. Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/libxfs/xfs_inode_fork.h | 9 +++++++++ fs/xfs/xfs_iomap.c | 4 ++++ 2 files changed, 13 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h index 8d89838e23f8..917e289ad962 100644 --- a/fs/xfs/libxfs/xfs_inode_fork.h +++ b/fs/xfs/libxfs/xfs_inode_fork.h @@ -70,6 +70,15 @@ struct xfs_ifork { #define XFS_IEXT_ATTR_MANIP_CNT(rmt_blks) \ (XFS_DA_NODE_MAXDEPTH + max(1, rmt_blks))
+/* + * A write to a sub-interval of an existing unwritten extent causes the original + * extent to be split into 3 extents + * i.e. | Unwritten | Real | Unwritten | + * Hence extent count can increase by 2. + */ +#define XFS_IEXT_WRITE_UNWRITTEN_CNT (2) + + /* * Fork handling. */ diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index ff0b092d0c37..cf22dad509e1 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -545,6 +545,10 @@ xfs_iomap_write_unwritten( if (error) return error;
+ error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK, + XFS_IEXT_WRITE_UNWRITTEN_CNT); + if (error) + goto error_on_bmapi_transaction;
/* * Modify the unwritten extent state of the buffer.
From: Chandan Babu R chandanrlinux@gmail.com
mainline inclusion from mainline-v5.12-rc1 commit 5f1d5bbfb2e674052a9fe542f53678978af20770 category: bugfix bugzilla: 187510,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Moving an extent to data fork can cause a sub-interval of an existing extent to be unmapped. This will increase extent count by 1. Mapping in the new extent can increase the extent count by 1 again i.e. | Old extent | New extent | Old extent | Hence number of extents increases by 2.
Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Allison Henderson allison.henderson@oracle.com Signed-off-by: Chandan Babu R chandanrlinux@gmail.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/libxfs/xfs_inode_fork.h | 9 +++++++++ fs/xfs/xfs_reflink.c | 5 +++++ 2 files changed, 14 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h index 917e289ad962..c8f279edc5c1 100644 --- a/fs/xfs/libxfs/xfs_inode_fork.h +++ b/fs/xfs/libxfs/xfs_inode_fork.h @@ -79,6 +79,15 @@ struct xfs_ifork { #define XFS_IEXT_WRITE_UNWRITTEN_CNT (2)
+/* + * Moving an extent to data fork can cause a sub-interval of an existing extent + * to be unmapped. This will increase extent count by 1. Mapping in the new + * extent can increase the extent count by 1 again i.e. + * | Old extent | New extent | Old extent | + * Hence number of extents increases by 2. + */ +#define XFS_IEXT_REFLINK_END_COW_CNT (2) + /* * Fork handling. */ diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index 6c8492b16c7b..b85b249df989 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -615,6 +615,11 @@ xfs_reflink_end_cow_extent( xfs_ilock(ip, XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, ip, 0);
+ error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK, + XFS_IEXT_REFLINK_END_COW_CNT); + if (error) + goto out_cancel; + /* * In case of racing, overlapping AIO writes no COW extents might be * left by the time I/O completes for the loser of the race. In that
From: Chandan Babu R chandanrlinux@gmail.com
mainline inclusion from mainline-v5.12-rc1 commit ee898d78c3540b44270a5fdffe208d7bbb219d93 category: bugfix bugzilla: 187510,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Remapping an extent involves unmapping the existing extent and mapping in the new extent. When unmapping, an extent containing the entire unmap range can be split into two extents, i.e. | Old extent | hole | Old extent | Hence extent count increases by 1.
Mapping in the new extent into the destination file can increase the extent count by 1.
Reviewed-by: Allison Henderson allison.henderson@oracle.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Chandan Babu R chandanrlinux@gmail.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_reflink.c | 11 +++++++++++ 1 file changed, 11 insertions(+)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index b85b249df989..5fc128bc7939 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -997,6 +997,7 @@ xfs_reflink_remap_extent( bool quota_reserved = true; bool smap_real; bool dmap_written = xfs_bmap_is_written_extent(dmap); + int iext_delta = 0; int nimaps; int error;
@@ -1107,6 +1108,16 @@ xfs_reflink_remap_extent( goto out_cancel; }
+ if (smap_real) + ++iext_delta; + + if (dmap_written) + ++iext_delta; + + error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK, iext_delta); + if (error) + goto out_cancel; + if (smap_real) { /* * If the extent we're unmapping is backed by storage (written
From: Chandan Babu R chandanrlinux@gmail.com
mainline inclusion from mainline-v5.12-rc1 commit bcc561f21f115437a010307420fc43d91be91c66 category: bugfix bugzilla: 187510,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Removing an initial range of source/donor file's extent and adding a new extent (from donor/source file) in its place will cause extent count to increase by 1.
Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Allison Henderson allison.henderson@oracle.com Signed-off-by: Chandan Babu R chandanrlinux@gmail.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/libxfs/xfs_inode_fork.h | 7 +++++++ fs/xfs/xfs_bmap_util.c | 16 ++++++++++++++++ 2 files changed, 23 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h index c8f279edc5c1..9e2137cd7372 100644 --- a/fs/xfs/libxfs/xfs_inode_fork.h +++ b/fs/xfs/libxfs/xfs_inode_fork.h @@ -88,6 +88,13 @@ struct xfs_ifork { */ #define XFS_IEXT_REFLINK_END_COW_CNT (2)
+/* + * Removing an initial range of source/donor file's extent and adding a new + * extent (from donor/source file) in its place will cause extent count to + * increase by 1. + */ +#define XFS_IEXT_SWAP_RMAP_CNT (1) + /* * Fork handling. */ diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 202d6af3e503..df004890c2a3 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -1367,6 +1367,22 @@ xfs_swap_extent_rmap( irec.br_blockcount); trace_xfs_swap_extent_rmap_remap_piece(tip, &uirec);
+ if (xfs_bmap_is_real_extent(&uirec)) { + error = xfs_iext_count_may_overflow(ip, + XFS_DATA_FORK, + XFS_IEXT_SWAP_RMAP_CNT); + if (error) + goto out; + } + + if (xfs_bmap_is_real_extent(&irec)) { + error = xfs_iext_count_may_overflow(tip, + XFS_DATA_FORK, + XFS_IEXT_SWAP_RMAP_CNT); + if (error) + goto out; + } + /* Remove the mapping from the donor file. */ xfs_bmap_unmap_extent(tp, tip, &uirec);
From: Chandan Babu R chandanrlinux@gmail.com
mainline inclusion from mainline-v5.12-rc1 commit 5147ef30f2cd128c9eedf7a697e8cb2ce2767989 category: bugfix bugzilla: 187510,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
With dax enabled filesystems, a direct write operation into an existing unwritten extent results in xfs_iomap_write_direct() zero-ing and converting the extent into a normal extent before the actual data is copied from the userspace buffer.
The inode extent count can increase by 2 if the extent range being written to maps to the middle of the existing unwritten extent range. Hence this commit uses XFS_IEXT_WRITE_UNWRITTEN_CNT as the extent count delta when such a write operation is being performed.
Fixes: 727e1acd297c ("xfs: Check for extent overflow when trivally adding a new extent") Reported-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Chandan Babu R chandanrlinux@gmail.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_iomap.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index cf22dad509e1..31c553a49241 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -198,6 +198,7 @@ xfs_iomap_write_direct( bool force = false; int error; int bmapi_flags = XFS_BMAPI_PREALLOC; + int nr_exts = XFS_IEXT_ADD_NOSPLIT_CNT;
ASSERT(count_fsb > 0);
@@ -232,6 +233,7 @@ xfs_iomap_write_direct( bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO; if (imap->br_state == XFS_EXT_UNWRITTEN) { force = true; + nr_exts = XFS_IEXT_WRITE_UNWRITTEN_CNT; dblocks = XFS_DIOSTRAT_SPACE_RES(mp, 0) << 1; } } @@ -241,8 +243,7 @@ xfs_iomap_write_direct( if (error) return error;
- error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK, - XFS_IEXT_ADD_NOSPLIT_CNT); + error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK, nr_exts); if (error) goto out_trans_cancel;
From: Guo Xuenan guoxuenan@huawei.com
Offering: HULK hulk inclusion category: bugfix bugzilla: 186943,https://gitee.com/openeuler/kernel/issues/I4KIAO
--------------------------------
For leaf dir, In most cases, there should be as many bestfree slots as the dir data blocks that can fit under i_size (except for [1]).
Root cause is we don't examin the number bestfree slots, when the slots number less than dir data blocks, if we need to allocate new dir data block and update the bestfree array, we will use the dir block number as index to assign bestfree array, while we did not check the leaf buf boundary which may cause UAF or other memory access problem. This issue can also triggered with test cases xfs/473 from fstests.
Considering the special case [1] , only add check bestfree array boundary, to avoid UAF or slab-out-of bound.
[1] https://lore.kernel.org/all/163961697197.3129691.1911552605195534271.stgit@m...
Simplify the testcase xfs/473 with commands below: DEV=/dev/sdb MP=/mnt/sdb WORKDIR=/mnt/sdb/341 #1. mkfs create new xfs image mkfs.xfs -f ${DEV} mount ${DEV} ${MP} mkdir -p ${WORKDIR} for i in `seq 1 341` #2. create leaf dir with 341 entries file name len 8 do touch ${WORKDIR}/$(printf "%08d" ${i}) done inode=$(ls -i ${MP} | cut -d' ' -f1) umount ${MP} #3. xfs_db set bestcount to 0 xfs_db -x ${DEV} -c "inode ${inode}" -c "dblock 8388608" \ -c "write ltail.bestcount 0" mount ${DEV} ${MP} touch ${WORKDIR}/{1..100}.txt #4. touch new file, reproduce
The error log is shown as follows: Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com
================================================================== BUG: KASAN: use-after-free in xfs_dir2_leaf_addname+0x1995/0x1ac0 Write of size 2 at addr ffff88810168b000 by task touch/1552 CPU: 5 PID: 1552 Comm: touch Not tainted 6.0.0-rc3+ #101 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x4d/0x66 print_report.cold+0xf6/0x691 kasan_report+0xa8/0x120 xfs_dir2_leaf_addname+0x1995/0x1ac0 xfs_dir_createname+0x58c/0x7f0 xfs_create+0x7af/0x1010 xfs_generic_create+0x270/0x5e0 path_openat+0x270b/0x3450 do_filp_open+0x1cf/0x2b0 do_sys_openat2+0x46b/0x7a0 do_sys_open+0xb7/0x130 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fe4d9e9312b Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25 RSP: 002b:00007ffda4c16c20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe4d9e9312b RDX: 0000000000000941 RSI: 00007ffda4c17f33 RDI: 00000000ffffff9c RBP: 00007ffda4c17f33 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941 R13: 00007fe4d9f631a4 R14: 00007ffda4c17f33 R15: 0000000000000000 </TASK>
The buggy address belongs to the physical page: page:ffffea000405a2c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10168b flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff) raw: 002fffff80000000 ffffea0004057788 ffffea000402dbc8 0000000000000000 raw: 0000000000000000 0000000000170000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected
Memory state around the buggy address: ffff88810168af00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88810168af80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff88810168b000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^ ffff88810168b080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88810168b100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== Disabling lock debugging due to kernel taint 00000000: 58 44 44 33 5b 53 35 c2 00 00 00 00 00 00 00 78 XDD3[S5........x XFS (sdb): Internal error xfs_dir2_data_use_free at line 1200 of file fs/xfs/libxfs/xfs_dir2_data.c. Caller xfs_dir2_data_use_free+0x28a/0xeb0 CPU: 5 PID: 1552 Comm: touch Tainted: G B 6.0.0-rc3+ Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x4d/0x66 xfs_corruption_error+0x132/0x150 xfs_dir2_data_use_free+0x198/0xeb0 xfs_dir2_leaf_addname+0xa59/0x1ac0 xfs_dir_createname+0x58c/0x7f0 xfs_create+0x7af/0x1010 xfs_generic_create+0x270/0x5e0 path_openat+0x270b/0x3450 do_filp_open+0x1cf/0x2b0 do_sys_openat2+0x46b/0x7a0 do_sys_open+0xb7/0x130 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fe4d9e9312b Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25 RSP: 002b:00007ffda4c16c20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe4d9e9312b RDX: 0000000000000941 RSI: 00007ffda4c17f46 RDI: 00000000ffffff9c RBP: 00007ffda4c17f46 R08: 0000000000000000 R09: 0000000000000001 R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941 R13: 00007fe4d9f631a4 R14: 00007ffda4c17f46 R15: 0000000000000000 </TASK> XFS (sdb): Corruption detected. Unmount and run xfs_repair
Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/libxfs/xfs_dir2_leaf.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c index 95d2a3f92d75..bd1b2559e165 100644 --- a/fs/xfs/libxfs/xfs_dir2_leaf.c +++ b/fs/xfs/libxfs/xfs_dir2_leaf.c @@ -815,6 +815,18 @@ xfs_dir2_leaf_addname( */ else xfs_dir3_leaf_log_bests(args, lbp, use_block, use_block); + /* + * An abnormal corner case, bestfree count less than data + * blocks, add a condition to avoid UAF or slab-out-of bound. + */ + if ((char *)(&bestsp[use_block]) >= (char *)ltp) { + xfs_trans_brelse(tp, lbp); + if (tp->t_flags & XFS_TRANS_DIRTY) + xfs_force_shutdown(tp->t_mountp, + SHUTDOWN_CORRUPT_INCORE); + return -EFSCORRUPTED; + } + hdr = dbp->b_addr; bf = xfs_dir2_data_bestfree_p(dp->i_mount, hdr); bestsp[use_block] = bf[0].length;
From: "Darrick J. Wong" darrick.wong@oracle.com
mainline inclusion from mainline-v5.10-rc5 commit da531cc46ef16301b1bc5bc74acbaacc628904f5 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
xfs_iget can return -ENOENT for a file that the inobt thinks is allocated but has zeroed mode. This currently causes scrub to exit with an operational error instead of flagging this as a corruption. The end result is that scrub mistakenly reports the ENOENT to the user instead of "directory parent pointer corrupt" like we do for EINVAL.
Fixes: 5927268f5a04 ("xfs: flag inode corruption if parent ptr doesn't get us a real inode") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/scrub/parent.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c index 855aa8bcab64..66c35f6dfc24 100644 --- a/fs/xfs/scrub/parent.c +++ b/fs/xfs/scrub/parent.c @@ -164,13 +164,13 @@ xchk_parent_validate( * can't use DONTCACHE here because DONTCACHE inodes can trigger * immediate inactive cleanup of the inode. * - * If _iget returns -EINVAL then the parent inode number is garbage - * and the directory is corrupt. If the _iget returns -EFSCORRUPTED - * or -EFSBADCRC then the parent is corrupt which is a cross - * referencing error. Any other error is an operational error. + * If _iget returns -EINVAL or -ENOENT then the parent inode number is + * garbage and the directory is corrupt. If the _iget returns + * -EFSCORRUPTED or -EFSBADCRC then the parent is corrupt which is a + * cross referencing error. Any other error is an operational error. */ error = xfs_iget(mp, sc->tp, dnum, XFS_IGET_UNTRUSTED, 0, &dp); - if (error == -EINVAL) { + if (error == -EINVAL || error == -ENOENT) { error = -EFSCORRUPTED; xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error); goto out;
From: Christoph Hellwig hch@lst.de
mainline inclusion from mainline-v5.11-rc4 commit f50b8f475a2c70ae8309c16b6d4ecb305a4aa9d6 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Add a helper to factor out the nowait locking logical for the read/write helpers.
Signed-off-by: Christoph Hellwig hch@lst.de Reviewed-by: Dave Chinner dchinner@redhat.com Reviewed-by: Brian Foster bfoster@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Conflicts: fs/xfs/xfs_file.c Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_file.c | 55 +++++++++++++++++++++++++---------------------- 1 file changed, 29 insertions(+), 26 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 80adec66744b..ebc1de5fb2d7 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -224,6 +224,23 @@ xfs_file_fsync( return error; }
+static int +xfs_ilock_iocb( + struct kiocb *iocb, + unsigned int lock_mode) +{ + struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp)); + + if (iocb->ki_flags & IOCB_NOWAIT) { + if (!xfs_ilock_nowait(ip, lock_mode)) + return -EAGAIN; + } else { + xfs_ilock(ip, lock_mode); + } + + return 0; +} + STATIC ssize_t xfs_file_dio_aio_read( struct kiocb *iocb, @@ -240,12 +257,9 @@ xfs_file_dio_aio_read(
file_accessed(iocb->ki_filp);
- if (iocb->ki_flags & IOCB_NOWAIT) { - if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) - return -EAGAIN; - } else { - xfs_ilock(ip, XFS_IOLOCK_SHARED); - } + ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED); + if (ret) + return ret; ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, is_sync_kiocb(iocb)); xfs_iunlock(ip, XFS_IOLOCK_SHARED); @@ -267,13 +281,9 @@ xfs_file_dax_read( if (!count) return 0; /* skip atime */
- if (iocb->ki_flags & IOCB_NOWAIT) { - if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) - return -EAGAIN; - } else { - xfs_ilock(ip, XFS_IOLOCK_SHARED); - } - + ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED); + if (ret) + return ret; ret = dax_iomap_rw(iocb, to, &xfs_read_iomap_ops); xfs_iunlock(ip, XFS_IOLOCK_SHARED);
@@ -292,12 +302,9 @@ xfs_file_buffered_aio_read( trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos); fs_file_read_do_trace(iocb);
- if (iocb->ki_flags & IOCB_NOWAIT) { - if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) - return -EAGAIN; - } else { - xfs_ilock(ip, XFS_IOLOCK_SHARED); - } + ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED); + if (ret) + return ret; ret = generic_file_read_iter(iocb, to); xfs_iunlock(ip, XFS_IOLOCK_SHARED);
@@ -650,13 +657,9 @@ xfs_file_dax_write( size_t count; loff_t pos;
- if (iocb->ki_flags & IOCB_NOWAIT) { - if (!xfs_ilock_nowait(ip, iolock)) - return -EAGAIN; - } else { - xfs_ilock(ip, iolock); - } - + ret = xfs_ilock_iocb(iocb, iolock); + if (ret) + return ret; ret = xfs_file_aio_write_checks(iocb, from, &iolock); if (ret) goto out;
From: "Darrick J. Wong" darrick.wong@oracle.com
mainline inclusion from mainline-v5.10-rc5 commit 4b80ac64450f169bae364df631d233d66070a06e category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
It's possible that xfs_iget can return EINVAL for inodes that the inobt thinks are free, or ENOENT for inodes that look free. If this is the case, mark the directory corrupt immediately when we check ftype. Note that we already check the ftype of the '.' and '..' entries, so we can skip the iget part since we already know the inode type for '.' and we have a separate parent pointer scrubber for '..'.
Fixes: a5c46e5e8912 ("xfs: scrub directory metadata") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/scrub/dir.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c index b045e95c2ea7..178b3455a170 100644 --- a/fs/xfs/scrub/dir.c +++ b/fs/xfs/scrub/dir.c @@ -66,8 +66,18 @@ xchk_dir_check_ftype( * eofblocks cleanup (which allocates what would be a nested * transaction), we can't use DONTCACHE here because DONTCACHE * inodes can trigger immediate inactive cleanup of the inode. + * + * If _iget returns -EINVAL or -ENOENT then the child inode number is + * garbage and the directory is corrupt. If the _iget returns + * -EFSCORRUPTED or -EFSBADCRC then the child is corrupt which is a + * cross referencing error. Any other error is an operational error. */ error = xfs_iget(mp, sdc->sc->tp, inum, 0, 0, &ip); + if (error == -EINVAL || error == -ENOENT) { + error = -EFSCORRUPTED; + xchk_fblock_process_error(sdc->sc, XFS_DATA_FORK, 0, &error); + goto out; + } if (!xchk_fblock_xref_process_error(sdc->sc, XFS_DATA_FORK, offset, &error)) goto out; @@ -105,6 +115,7 @@ xchk_dir_actor( struct xfs_name xname; xfs_ino_t lookup_ino; xfs_dablk_t offset; + bool checked_ftype = false; int error = 0;
sdc = container_of(dir_iter, struct xchk_dir_ctx, dir_iter); @@ -133,6 +144,7 @@ xchk_dir_actor( if (xfs_sb_version_hasftype(&mp->m_sb) && type != DT_DIR) xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset); + checked_ftype = true; if (ino != ip->i_ino) xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset); @@ -144,6 +156,7 @@ xchk_dir_actor( if (xfs_sb_version_hasftype(&mp->m_sb) && type != DT_DIR) xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset); + checked_ftype = true; if (ip->i_ino == mp->m_sb.sb_rootino && ino != ip->i_ino) xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset); @@ -167,9 +180,11 @@ xchk_dir_actor( }
/* Verify the file type. This function absorbs error codes. */ - error = xchk_dir_check_ftype(sdc, offset, lookup_ino, type); - if (error) - goto out; + if (!checked_ftype) { + error = xchk_dir_check_ftype(sdc, offset, lookup_ino, type); + if (error) + goto out; + } out: /* * A negative error code returned here is supposed to cause the
From: Christoph Hellwig hch@lst.de
mainline inclusion from mainline-v5.11-rc4 commit 354be7e3b2baf32e63c0599cc131d393591ba299 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Ensure we don't block on the iolock, or waiting for I/O in xfs_file_aio_write_checks if the caller asked to avoid that.
Fixes: 29a5d29ec181 ("xfs: nowait aio support") Signed-off-by: Christoph Hellwig hch@lst.de Reviewed-by: Dave Chinner dchinner@redhat.com Reviewed-by: Brian Foster bfoster@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_file.c | 25 +++++++++++++++++++++---- 1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index ebc1de5fb2d7..d2451de87006 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -363,7 +363,14 @@ xfs_file_aio_write_checks( if (error <= 0) return error;
- error = xfs_break_layouts(inode, iolock, BREAK_WRITE); + if (iocb->ki_flags & IOCB_NOWAIT) { + error = break_layout(inode, false); + if (error == -EWOULDBLOCK) + error = -EAGAIN; + } else { + error = xfs_break_layouts(inode, iolock, BREAK_WRITE); + } + if (error) return error;
@@ -374,7 +381,11 @@ xfs_file_aio_write_checks( if (*iolock == XFS_IOLOCK_SHARED && !IS_NOSEC(inode)) { xfs_iunlock(ip, *iolock); *iolock = XFS_IOLOCK_EXCL; - xfs_ilock(ip, *iolock); + error = xfs_ilock_iocb(iocb, *iolock); + if (error) { + *iolock = 0; + return error; + } goto restart; }
@@ -405,6 +416,10 @@ xfs_file_aio_write_checks( isize = i_size_read(inode); if (iocb->ki_pos > isize) { spin_unlock(&ip->i_flags_lock); + + if (iocb->ki_flags & IOCB_NOWAIT) + return -EAGAIN; + if (!drained_dio) { if (*iolock == XFS_IOLOCK_SHARED) { xfs_iunlock(ip, *iolock); @@ -635,7 +650,8 @@ xfs_file_dio_aio_write( &xfs_dio_write_ops, is_sync_kiocb(iocb) || unaligned_io); out: - xfs_iunlock(ip, iolock); + if (iolock) + xfs_iunlock(ip, iolock);
/* * No fallback to buffered IO after short writes for XFS, direct I/O @@ -674,7 +690,8 @@ xfs_file_dax_write( error = xfs_setfilesize(ip, pos, ret); } out: - xfs_iunlock(ip, iolock); + if (iolock) + xfs_iunlock(ip, iolock); if (error) return error;
From: "Darrick J. Wong" djwong@kernel.org
mainline inclusion from mainline-v5.11-rc4 commit 89e0eb8c13bb842e224b27d7e071262cd84717cb category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
In commit 9669f51de5c0 I tried to get rid of the undocumented cow gc lifetime knob. The knob's function was never documented and it now doesn't really have a function since eof and cow gc have been consolidated.
Regrettably, xfs/231 relies on it and regresses on for-next. I did not succeed at getting far enough through fstests patch review for the fixup to land in time.
Restore the sysctl knob, document what it did (does?), put it on the deprecation schedule, and rip out a redundant function.
Fixes: 9669f51de5c0 ("xfs: consolidate the eofblocks and cowblocks workers") Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Dave Chinner dchinner@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/admin-guide/xfs.rst | 16 ++++++++------ fs/xfs/xfs_sysctl.c | 35 +++++++++++++------------------ 2 files changed, 24 insertions(+), 27 deletions(-)
diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst index 4b9096a8ee5e..82cc349e3aac 100644 --- a/Documentation/admin-guide/xfs.rst +++ b/Documentation/admin-guide/xfs.rst @@ -284,6 +284,9 @@ The following sysctls are available for the XFS filesystem: removes unused preallocation from clean inodes and releases the unused space back to the free pool.
+ fs.xfs.speculative_cow_prealloc_lifetime + This is an alias for speculative_prealloc_lifetime. + fs.xfs.error_level (Min: 0 Default: 3 Max: 11) A volume knob for error reporting when internal errors occur. This will generate detailed messages & backtraces for filesystem @@ -356,12 +359,13 @@ The following sysctls are available for the XFS filesystem: Deprecated Sysctls ==================
-=========================== ================ - Name Removal Schedule -=========================== ================ -fs.xfs.irix_sgid_inherit September 2025 -fs.xfs.irix_symlink_mode September 2025 -=========================== ================ +=========================================== ================ + Name Removal Schedule +=========================================== ================ +fs.xfs.irix_sgid_inherit September 2025 +fs.xfs.irix_symlink_mode September 2025 +fs.xfs.speculative_cow_prealloc_lifetime September 2025 +=========================================== ================
Removed Sysctls diff --git a/fs/xfs/xfs_sysctl.c b/fs/xfs/xfs_sysctl.c index 145e06c47744..546a6cd96729 100644 --- a/fs/xfs/xfs_sysctl.c +++ b/fs/xfs/xfs_sysctl.c @@ -51,7 +51,7 @@ xfs_panic_mask_proc_handler( #endif /* CONFIG_PROC_FS */
STATIC int -xfs_deprecate_irix_sgid_inherit_proc_handler( +xfs_deprecated_dointvec_minmax( struct ctl_table *ctl, int write, void *buffer, @@ -59,24 +59,8 @@ xfs_deprecate_irix_sgid_inherit_proc_handler( loff_t *ppos) { if (write) { - printk_once(KERN_WARNING - "XFS: " "%s sysctl option is deprecated.\n", - ctl->procname); - } - return proc_dointvec_minmax(ctl, write, buffer, lenp, ppos); -} - -STATIC int -xfs_deprecate_irix_symlink_mode_proc_handler( - struct ctl_table *ctl, - int write, - void *buffer, - size_t *lenp, - loff_t *ppos) -{ - if (write) { - printk_once(KERN_WARNING - "XFS: " "%s sysctl option is deprecated.\n", + printk_ratelimited(KERN_WARNING + "XFS: %s sysctl option is deprecated.\n", ctl->procname); } return proc_dointvec_minmax(ctl, write, buffer, lenp, ppos); @@ -88,7 +72,7 @@ static struct ctl_table xfs_table[] = { .data = &xfs_params.sgid_inherit.val, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = xfs_deprecate_irix_sgid_inherit_proc_handler, + .proc_handler = xfs_deprecated_dointvec_minmax, .extra1 = &xfs_params.sgid_inherit.min, .extra2 = &xfs_params.sgid_inherit.max }, @@ -97,7 +81,7 @@ static struct ctl_table xfs_table[] = { .data = &xfs_params.symlink_mode.val, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = xfs_deprecate_irix_symlink_mode_proc_handler, + .proc_handler = xfs_deprecated_dointvec_minmax, .extra1 = &xfs_params.symlink_mode.min, .extra2 = &xfs_params.symlink_mode.max }, @@ -201,6 +185,15 @@ static struct ctl_table xfs_table[] = { .extra1 = &xfs_params.blockgc_timer.min, .extra2 = &xfs_params.blockgc_timer.max, }, + { + .procname = "speculative_cow_prealloc_lifetime", + .data = &xfs_params.blockgc_timer.val, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = xfs_deprecated_dointvec_minmax, + .extra1 = &xfs_params.blockgc_timer.min, + .extra2 = &xfs_params.blockgc_timer.max, + }, /* please keep this the last entry */ #ifdef CONFIG_PROC_FS {
From: "Darrick J. Wong" djwong@kernel.org
mainline inclusion from mainline-v5.12-rc4 commit 9de4b514494a3b49fa708186c0dc4611f1fe549c category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
If scrub observes cross-referencing errors while scanning a data structure, mark the data structure sick. There's /something/ inconsistent, even if we can't really tell what it is.
Fixes: 4860a05d2475 ("xfs: scrub/repair should update filesystem metadata health") Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/scrub/health.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c index 83d27cdf579b..3de59b5c2ce6 100644 --- a/fs/xfs/scrub/health.c +++ b/fs/xfs/scrub/health.c @@ -133,7 +133,8 @@ xchk_update_health( if (!sc->sick_mask) return;
- bad = (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT); + bad = (sc->sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT | + XFS_SCRUB_OFLAG_XCORRUPT)); switch (type_to_health_flag[sc->sm->sm_type].group) { case XHG_AG: pag = xfs_perag_get(sc->mp, sc->sm->sm_agno);
From: Christoph Hellwig hch@lst.de
mainline inclusion from mainline-v5.16-rc2 commit 1090427bf18f9835b3ccbd36edf43f2509444e27 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
With the remove of xfs_dqrele_all_inodes, xfs_inew_wait and all the infrastructure used to wake the XFS_INEW bit waitqueue is unused.
Reported-by: kernel test robot lkp@intel.com Fixes: 777eb1fa857e ("xfs: remove xfs_dqrele_all_inodes") Signed-off-by: Christoph Hellwig hch@lst.de Reviewed-by: Brian Foster bfoster@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Conflicts: fs/xfs/xfs_inode.h Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_icache.c | 21 --------------------- fs/xfs/xfs_inode.h | 4 +--- 2 files changed, 1 insertion(+), 24 deletions(-)
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 93e24f85eb8d..2039423df384 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -296,22 +296,6 @@ xfs_perag_clear_inode_tag( trace_xfs_perag_clear_inode_tag(mp, pag->pag_agno, tag, _RET_IP_); }
-static inline void -xfs_inew_wait( - struct xfs_inode *ip) -{ - wait_queue_head_t *wq = bit_waitqueue(&ip->i_flags, __XFS_INEW_BIT); - DEFINE_WAIT_BIT(wait, &ip->i_flags, __XFS_INEW_BIT); - - do { - prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); - if (!xfs_iflags_test(ip, XFS_INEW)) - break; - schedule(); - } while (true); - finish_wait(wq, &wait.wq_entry); -} - /* * When we recycle a reclaimable inode, we need to re-initialise the VFS inode * part of the structure. This is made more complex by the fact we store @@ -379,18 +363,13 @@ xfs_iget_recycle( error = xfs_reinit_inode(mp, inode); xfs_iunlock(ip, XFS_ILOCK_EXCL); if (error) { - bool wake; - /* * Re-initializing the inode failed, and we are in deep * trouble. Try to re-add it to the reclaim list. */ rcu_read_lock(); spin_lock(&ip->i_flags_lock); - wake = !!__xfs_iflags_test(ip, XFS_INEW); ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM); - if (wake) - wake_up_bit(&ip->i_flags, __XFS_INEW_BIT); ASSERT(ip->i_flags & XFS_IRECLAIMABLE); spin_unlock(&ip->i_flags_lock); rcu_read_unlock(); diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 57141daff28e..ba0f57dd5392 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -221,8 +221,7 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip) #define XFS_IRECLAIM (1 << 0) /* started reclaiming this inode */ #define XFS_ISTALE (1 << 1) /* inode has been staled */ #define XFS_IRECLAIMABLE (1 << 2) /* inode can be reclaimed */ -#define __XFS_INEW_BIT 3 /* inode has just been allocated */ -#define XFS_INEW (1 << __XFS_INEW_BIT) +#define XFS_INEW (1 << 3) /* inode has just been allocated */ #define XFS_ITRUNCATED (1 << 5) /* truncated down so flush-on-close */ #define XFS_IDIRTY_RELEASE (1 << 6) /* dirty release already seen */ #define XFS_IFLUSHING (1 << 7) /* inode is being flushed */ @@ -477,7 +476,6 @@ static inline void xfs_finish_inode_setup(struct xfs_inode *ip) xfs_iflags_clear(ip, XFS_INEW); barrier(); unlock_new_inode(VFS_I(ip)); - wake_up_bit(&ip->i_flags, __XFS_INEW_BIT); }
static inline void xfs_setup_existing_inode(struct xfs_inode *ip)
From: "Darrick J. Wong" djwong@kernel.org
mainline inclusion from mainline-v5.16-rc3 commit b97cca3ba9098522e5a1c3388764ead42640c1a5 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
In commit 02b9984d6408, we pushed a sync_filesystem() call from the VFS into xfs_fs_remount. The only time that we ever need to push dirty file data or metadata to disk for a remount is if we're remounting the filesystem read only, so this really could be moved to xfs_remount_ro.
Once we've moved the call site, actually check the return value from sync_filesystem.
Fixes: 02b9984d6408 ("fs: push sync_filesystem() down to the file system's remount_fs()") Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Dave Chinner dchinner@redhat.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_super.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index bf65e2e50ab7..bc4ea6f13dad 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1774,6 +1774,11 @@ xfs_remount_ro( }; int error;
+ /* Flush all the dirty data to disk. */ + error = sync_filesystem(mp->m_super); + if (error) + return error; + /* * Cancel background eofb scanning so it cannot race with the final * log force+buftarg wait and deadlock the remount. @@ -1853,8 +1858,6 @@ xfs_fc_reconfigure( if (error) return error;
- sync_filesystem(mp->m_super); - /* inode32 -> inode64 */ if ((mp->m_flags & XFS_MOUNT_SMALL_INUMS) && !(new_mp->m_flags & XFS_MOUNT_SMALL_INUMS)) {
From: "Darrick J. Wong" djwong@kernel.org
mainline inclusion from mainline-v5.16-rc3 commit eba0549bc7d100691c13384b774346b8aa9cf9a9 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
There are a few places where we test the current process' capability set to decide if we're going to be more or less generous with resource acquisition for a system call. If the process doesn't have the capability, we can continue the call, albeit in a degraded mode.
These are /not/ the actual security decisions, so it's not proper to use capable(), which (in certain selinux setups) causes audit messages to get logged. Switch them to has_capability_noaudit.
Fixes: 7317a03df703f ("xfs: refactor inode ownership change transaction/inode/quota allocation idiom") Fixes: ea9a46e1c4925 ("xfs: only return detailed fsmap info if the caller has CAP_SYS_ADMIN") Signed-off-by: Darrick J. Wong djwong@kernel.org Cc: Dave Chinner david@fromorbit.com Reviewed-by: Ondrej Mosnacek omosnace@redhat.com Acked-by: Serge Hallyn serge@hallyn.com Reviewed-by: Eric Sandeen sandeen@redhat.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com
Conflicts: fs/xfs/xfs_fsmap.c Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_fsmap.c | 4 ++-- fs/xfs/xfs_ioctl.c | 2 +- fs/xfs/xfs_iops.c | 2 +- kernel/capability.c | 1 + 4 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_fsmap.c b/fs/xfs/xfs_fsmap.c index 2d98d8cfae44..bb678a57ec9d 100644 --- a/fs/xfs/xfs_fsmap.c +++ b/fs/xfs/xfs_fsmap.c @@ -848,8 +848,8 @@ xfs_getfsmap( !xfs_getfsmap_is_valid_device(mp, &head->fmh_keys[1])) return -EINVAL;
- use_rmap = capable(CAP_SYS_ADMIN) && - xfs_sb_version_hasrmapbt(&mp->m_sb); + use_rmap = xfs_sb_version_hasrmapbt(&mp->m_sb) && + has_capability_noaudit(current, CAP_SYS_ADMIN); head->fmh_entries = 0;
/* Set up our device handlers. */ diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index bf05525ba88c..87299bab516c 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1290,7 +1290,7 @@ xfs_ioctl_setattr_get_trans( goto out_error;
error = xfs_trans_alloc_ichange(ip, NULL, NULL, pdqp, - capable(CAP_FOWNER), &tp); + has_capability_noaudit(current, CAP_FOWNER), &tp); if (error) goto out_error;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 51f84c0e9417..2bcd5b4c7b73 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -704,7 +704,7 @@ xfs_setattr_nonsize( }
error = xfs_trans_alloc_ichange(ip, udqp, gdqp, NULL, - capable(CAP_FOWNER), &tp); + has_capability_noaudit(current, CAP_FOWNER), &tp); if (error) goto out_dqrele;
diff --git a/kernel/capability.c b/kernel/capability.c index de7eac903a2a..c5e1871a0ea7 100644 --- a/kernel/capability.c +++ b/kernel/capability.c @@ -360,6 +360,7 @@ bool has_capability_noaudit(struct task_struct *t, int cap) { return has_ns_capability_noaudit(t, &init_user_ns, cap); } +EXPORT_SYMBOL(has_capability_noaudit);
static bool ns_capable_common(struct user_namespace *ns, int cap,
From: Dave Chinner dchinner@redhat.com
mainline inclusion from mainline-v5.16-rc3 commit 70447e0ad9781f84e60e0990888bd8c84987f44e category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When the AIL tries to flush the CIL, it relies on the CIL push ending up on stable storage without having to wait for and manipulate iclog state directly. However, if there is already a pending CIL push when the AIL tries to flush the CIL, it won't set the cil->xc_push_commit_stable flag and so the CIL push will not actively flush the commit record iclog.
generic/530 when run on a single CPU test VM can trigger this fairly reliably. This test exercises unlinked inode recovery, and can result in inodes being pinned in memory by ongoing modifications to the inode cluster buffer to record unlinked list modifications. As a result, the first inode unlinked in a buffer can pin the tail of the log whilst the inode cluster buffer is pinned by the current checkpoint that has been pushed but isn't on stable storage because because the cil->xc_push_commit_stable was not set. This results in the log/AIL effectively deadlocking until something triggers the commit record iclog to be pushed to stable storage (i.e. the periodic log worker calling xfs_log_force()).
The fix is two-fold - first we should always set the cil->xc_push_commit_stable when xlog_cil_flush() is called, regardless of whether there is already a pending push or not.
Second, if the CIL is empty, we should trigger an iclog flush to ensure that the iclogs of the last checkpoint have actually been submitted to disk as that checkpoint may not have been run under stable completion constraints.
Reported-and-tested-by: Matthew Wilcox willy@infradead.org Fixes: 0020a190cf3e ("xfs: AIL needs asynchronous CIL forcing") Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log_cil.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index c5118801218b..c2fa07909ea6 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -1243,18 +1243,27 @@ xlog_cil_push_now( if (!async) flush_workqueue(cil->xc_push_wq);
+ spin_lock(&cil->xc_push_lock); + + /* + * If this is an async flush request, we always need to set the + * xc_push_commit_stable flag even if something else has already queued + * a push. The flush caller is asking for the CIL to be on stable + * storage when the next push completes, so regardless of who has queued + * the push, the flush requires stable semantics from it. + */ + cil->xc_push_commit_stable = async; + /* * If the CIL is empty or we've already pushed the sequence then - * there's no work we need to do. + * there's no more work that we need to do. */ - spin_lock(&cil->xc_push_lock); if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) { spin_unlock(&cil->xc_push_lock); return; }
cil->xc_push_seq = push_seq; - cil->xc_push_commit_stable = async; queue_work(cil->xc_push_wq, &cil->xc_ctx->push_work); spin_unlock(&cil->xc_push_lock); } @@ -1352,6 +1361,13 @@ xlog_cil_flush(
trace_xfs_log_force(log->l_mp, seq, _RET_IP_); xlog_cil_push_now(log, seq, true); + + /* + * If the CIL is empty, make sure that any previous checkpoint that may + * still be in an active iclog is pushed to stable storage. + */ + if (list_empty(&log->l_cilp->xc_cil)) + xfs_log_force(log->l_mp, 0); }
/*
From: Dave Chinner dchinner@redhat.com
mainline inclusion from mainline-v5.16-rc3 commit cd6f79d1fb324968a3bae92f82eeb7d28ca1fd22 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Brian reported a null pointer dereference failure during unmount in xfs/006. He tracked the problem down to the AIL being torn down before a log shutdown had completed and removed all the items from the AIL. The failure occurred in this path while unmount was proceeding in another task:
xfs_trans_ail_delete+0x102/0x130 [xfs] xfs_buf_item_done+0x22/0x30 [xfs] xfs_buf_ioend+0x73/0x4d0 [xfs] xfs_trans_committed_bulk+0x17e/0x2f0 [xfs] xlog_cil_committed+0x2a9/0x300 [xfs] xlog_cil_process_committed+0x69/0x80 [xfs] xlog_state_shutdown_callbacks+0xce/0xf0 [xfs] xlog_force_shutdown+0xdf/0x150 [xfs] xfs_do_force_shutdown+0x5f/0x150 [xfs] xlog_ioend_work+0x71/0x80 [xfs] process_one_work+0x1c5/0x390 worker_thread+0x30/0x350 kthread+0xd7/0x100 ret_from_fork+0x1f/0x30
This is processing an EIO error to a log write, and it's triggering a force shutdown. This causes the log to be shut down, and then it is running attached iclog callbacks from the shutdown context. That means the fs and log has already been marked as xfs_is_shutdown/xlog_is_shutdown and so high level code will abort (e.g. xfs_trans_commit(), xfs_log_force(), etc) with an error because of shutdown.
The umount would have been blocked waiting for a log force completion inside xfs_log_cover() -> xfs_sync_sb(). The first thing for this situation to occur is for xfs_sync_sb() to exit without waiting for the iclog buffer to be comitted to disk. The above trace is the completion routine for the iclog buffer, and it is shutting down the filesystem.
xlog_state_shutdown_callbacks() does this:
{ struct xlog_in_core *iclog; LIST_HEAD(cb_list);
spin_lock(&log->l_icloglock); iclog = log->l_iclog; do { if (atomic_read(&iclog->ic_refcnt)) { /* Reference holder will re-run iclog callbacks. */ continue; } list_splice_init(&iclog->ic_callbacks, &cb_list);
wake_up_all(&iclog->ic_write_wait); wake_up_all(&iclog->ic_force_wait);
} while ((iclog = iclog->ic_next) != log->l_iclog);
wake_up_all(&log->l_flush_wait); spin_unlock(&log->l_icloglock);
xlog_cil_process_committed(&cb_list);
}
This wakes any thread waiting on IO completion of the iclog (in this case the umount log force) before shutdown processes all the pending callbacks. That means the xfs_sync_sb() waiting on a sync transaction in xfs_log_force() on iclog->ic_force_wait will get woken before the callbacks attached to that iclog are run. This results in xfs_sync_sb() returning an error, and so unmount unblocks and continues to run whilst the log shutdown is still in progress.
Normally this is just fine because the force waiter has nothing to do with AIL operations. But in the case of this unmount path, the log force waiter goes on to tear down the AIL because the log is now shut down and so nothing ever blocks it again from the wait point in xfs_log_cover().
Hence it's a race to see who gets to the AIL first - the unmount code or xlog_cil_process_committed() killing the superblock buffer.
To fix this, we just have to change the order of processing in xlog_state_shutdown_callbacks() to run the callbacks before it wakes any task waiting on completion of the iclog.
Reported-by: Brian Foster bfoster@redhat.com Fixes: aad7272a9208 ("xfs: separate out log shutdown callback processing") Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 91e8640a4711..367aa7be2cad 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -483,7 +483,10 @@ xfs_log_reserve( * Run all the pending iclog callbacks and wake log force waiters and iclog * space waiters so they can process the newly set shutdown state. We really * don't care what order we process callbacks here because the log is shut down - * and so state cannot change on disk anymore. + * and so state cannot change on disk anymore. However, we cannot wake waiters + * until the callbacks have been processed because we may be in unmount and + * we must ensure that all AIL operations the callbacks perform have completed + * before we tear down the AIL. * * We avoid processing actively referenced iclogs so that we don't run callbacks * while the iclog owner might still be preparing the iclog for IO submssion. @@ -497,7 +500,6 @@ xlog_state_shutdown_callbacks( struct xlog_in_core *iclog; LIST_HEAD(cb_list);
- spin_lock(&log->l_icloglock); iclog = log->l_iclog; do { if (atomic_read(&iclog->ic_refcnt)) { @@ -505,14 +507,16 @@ xlog_state_shutdown_callbacks( continue; } list_splice_init(&iclog->ic_callbacks, &cb_list); + spin_unlock(&log->l_icloglock); + + xlog_cil_process_committed(&cb_list); + + spin_lock(&log->l_icloglock); wake_up_all(&iclog->ic_write_wait); wake_up_all(&iclog->ic_force_wait); } while ((iclog = iclog->ic_next) != log->l_iclog);
wake_up_all(&log->l_flush_wait); - spin_unlock(&log->l_icloglock); - - xlog_cil_process_committed(&cb_list); }
/* @@ -556,11 +560,8 @@ xlog_state_release_iclog( * pending iclog callbacks that were waiting on the release of * this iclog. */ - if (last_ref) { - spin_unlock(&log->l_icloglock); + if (last_ref) xlog_state_shutdown_callbacks(log); - spin_lock(&log->l_icloglock); - } return -EIO; }
@@ -3769,7 +3770,10 @@ xlog_force_shutdown( wake_up_all(&log->l_cilp->xc_start_wait); wake_up_all(&log->l_cilp->xc_commit_wait); spin_unlock(&log->l_cilp->xc_push_lock); + + spin_lock(&log->l_icloglock); xlog_state_shutdown_callbacks(log); + spin_unlock(&log->l_icloglock);
return log_error; }
From: Eric Sandeen sandeen@redhat.com
mainline inclusion from mainline-v5.16-rc3 commit bc37e4fb5cac2925b2e286b1f1d4fc2b519f7d92 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
This reverts commit 4b8628d57b725b32616965e66975fcdebe008fe7.
XFS quota has had the concept of a "quota warning limit" since the earliest Irix implementation, but a mechanism for incrementing the warning counter was never implemented, as documented in the xfs_quota(8) man page. We do know from the historical archive that it was never incremented at runtime during quota reservation operations.
With this commit, the warning counter quickly increments for every allocation attempt after the user has crossed a quote soft limit threshold, and this in turn transitions the user to hard quota failures, rendering soft quota thresholds and timers useless. This was reported as a regression by users.
Because the intended behavior of this warning counter has never been understood or documented, and the result of this change is a regression in soft quota functionality, revert this commit to make soft quota limits and timers operable again.
Fixes: 4b8628d57b72 ("xfs: actually bump warning counts when we send warnings) Signed-off-by: Eric Sandeen sandeen@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Dave Chinner dchinner@redhat.com Signed-off-by: Dave Chinner david@fromorbit.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_trans_dquot.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c index 3d7386650cfe..d9b861581833 100644 --- a/fs/xfs/xfs_trans_dquot.c +++ b/fs/xfs/xfs_trans_dquot.c @@ -615,7 +615,6 @@ xfs_dqresv_check( return QUOTA_NL_ISOFTLONGWARN; }
- res->warnings++; return QUOTA_NL_ISOFTWARN; }
From: "Darrick J. Wong" djwong@kernel.org
mainline inclusion from mainline-v5.16-rc3 commit 86d40f1e49e9a909d25c35ba01bea80dbcd758cb category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
xfs/434 and xfs/436 have been reporting occasional memory leaks of xfs_dquot objects. These tests themselves were the messenger, not the culprit, since they unload the xfs module, which trips the slub debugging code while tearing down all the xfs slab caches: Reviewed-by: Zhang Yi yi.zhang@huawei.com
============================================================================= BUG xfs_dquot (Tainted: G W ): Objects remaining in xfs_dquot on __kmem_cache_shutdown() -----------------------------------------------------------------------------
Slab 0xffffea000606de00 objects=30 used=5 fp=0xffff888181b78a78 flags=0x17ff80000010200(slab|head|node=0|zone=2|lastcpupid=0xfff) CPU: 0 PID: 3953166 Comm: modprobe Tainted: G W 5.18.0-rc6-djwx #rc6 d5824be9e46a2393677bda868f9b154d917ca6a7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
Since we don't generally rmmod the xfs module between fstests, this means that xfs/434 is really just the canary in the coal mine -- something leaked a dquot, but we don't know who. After days of pounding on fstests with kmemleak enabled, I finally got it to spit this out:
unreferenced object 0xffff8880465654c0 (size 536): comm "u10:4", pid 88, jiffies 4294935810 (age 29.512s) hex dump (first 32 bytes): 60 4a 56 46 80 88 ff ff 58 ea e4 5c 80 88 ff ff `JVF....X...... 00 e0 52 49 80 88 ff ff 01 00 01 00 00 00 00 00 ..RI............ backtrace: [<ffffffffa0740f6c>] xfs_dquot_alloc+0x2c/0x530 [xfs] [<ffffffffa07443df>] xfs_qm_dqread+0x6f/0x330 [xfs] [<ffffffffa07462a2>] xfs_qm_dqget+0x132/0x4e0 [xfs] [<ffffffffa0756bb0>] xfs_qm_quotacheck_dqadjust+0xa0/0x3e0 [xfs] [<ffffffffa075724d>] xfs_qm_dqusage_adjust+0x35d/0x4f0 [xfs] [<ffffffffa06c9068>] xfs_iwalk_ag_recs+0x348/0x5d0 [xfs] [<ffffffffa06c95d3>] xfs_iwalk_run_callbacks+0x273/0x540 [xfs] [<ffffffffa06c9e8d>] xfs_iwalk_ag+0x5ed/0x890 [xfs] [<ffffffffa06ca22f>] xfs_iwalk_ag_work+0xff/0x170 [xfs] [<ffffffffa06d22c9>] xfs_pwork_work+0x79/0x130 [xfs] [<ffffffff81170bb2>] process_one_work+0x672/0x1040 [<ffffffff81171b1b>] worker_thread+0x59b/0xec0 [<ffffffff8118711e>] kthread+0x29e/0x340 [<ffffffff810032bf>] ret_from_fork+0x1f/0x30
Now we know that quotacheck is at fault, but even this report was canaryish -- it was triggered by xfs/494, which doesn't actually mount any filesystems. (kmemleak can be a little slow to notice leaks, even with fstests repeatedly whacking it to look for them.) Looking at the *previous* fstest, however, showed that the test run before xfs/494 was xfs/117. The tipoff to the problem is in this excerpt from dmesg:
XFS (sda4): Quotacheck needed: Please wait. XFS (sda4): Metadata corruption detected at xfs_dinode_verify.part.0+0xdb/0x7b0 [xfs], inode 0x119 dinode XFS (sda4): Unmount and run xfs_repair XFS (sda4): First 128 bytes of corrupted metadata buffer: 00000000: 49 4e 81 a4 03 02 00 00 00 00 00 00 00 00 00 00 IN.............. 00000010: 00 00 00 01 00 00 00 00 00 90 57 54 54 1a 4c 68 ..........WTT.Lh 00000020: 81 f9 7d e1 6d ee 16 00 34 bd 7d e1 6d ee 16 00 ..}.m...4.}.m... 00000030: 34 bd 7d e1 6d ee 16 00 00 00 00 00 00 00 00 00 4.}.m........... 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000050: 00 00 00 02 00 00 00 00 00 00 00 00 96 80 f3 ab ................ 00000060: ff ff ff ff da 57 7b 11 00 00 00 00 00 00 00 03 .....W{......... 00000070: 00 00 00 01 00 00 00 10 00 00 00 00 00 00 00 08 ................ XFS (sda4): Quotacheck: Unsuccessful (Error -117): Disabling quotas.
The dinode verifier decided that the inode was corrupt, which causes iget to return with EFSCORRUPTED. Since this happened during quotacheck, it is obvious that the kernel aborted the inode walk on account of the corruption error and disabled quotas. Unfortunately, we neglect to purge the dquot cache before doing that, which is how the dquots leaked.
The problems started 10 years ago in commit b84a3a, when the dquot lists were converted to a radix tree, but the error handling behavior was not correctly preserved -- in that commit, if the bulkstat failed and usrquota was enabled, the bulkstat failure code would be overwritten by the result of flushing all the dquots to disk. As long as that succeeds, we'd continue the quota mount as if everything were ok, but instead we're now operating with a corrupt inode and incorrect quota usage counts. I didn't notice this bug in 2019 when I wrote commit ebd126a, which changed quotacheck to skip the dqflush when the scan doesn't complete due to inode walk failures.
Introduced-by: b84a3a96751f ("xfs: remove the per-filesystem list of dquots") Fixes: ebd126a651f8 ("xfs: convert quotacheck to use the new iwalk functions") Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Dave Chinner dchinner@redhat.com Signed-off-by: Dave Chinner david@fromorbit.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_qm.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c index 926d262c1236..0a6c8cc8e997 100644 --- a/fs/xfs/xfs_qm.c +++ b/fs/xfs/xfs_qm.c @@ -1312,8 +1312,15 @@ xfs_qm_quotacheck(
error = xfs_iwalk_threaded(mp, 0, 0, xfs_qm_dqusage_adjust, 0, true, NULL); - if (error) + if (error) { + /* + * The inode walk may have partially populated the dquot + * caches. We must purge them before disabling quota and + * tearing down the quotainfo, or else the dquots will leak. + */ + xfs_qm_dqpurge_all(mp); goto error_return; + }
/* * We've made all the changes that we need to make incore. Flush them
From: "Darrick J. Wong" djwong@kernel.org
mainline inclusion from mainline-v5.16-rc3 commit 7561cea5dbb97fecb952548a0fb74fb105bf4664 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
KASAN reported the following use after free bug when running generic/475:
XFS (dm-0): Mounting V5 Filesystem XFS (dm-0): Starting recovery (logdev: internal) XFS (dm-0): Ending recovery (logdev: internal) Buffer I/O error on dev dm-0, logical block 20639616, async page read Buffer I/O error on dev dm-0, logical block 20639617, async page read XFS (dm-0): log I/O error -5 XFS (dm-0): Filesystem has been shut down due to log error (0x2). XFS (dm-0): Unmounting Filesystem XFS (dm-0): Please unmount the filesystem and rectify the problem(s). ================================================================== BUG: KASAN: use-after-free in do_raw_spin_lock+0x246/0x270 Read of size 4 at addr ffff888109dd84c4 by task 3:1H/136
CPU: 3 PID: 136 Comm: 3:1H Not tainted 5.19.0-rc4-xfsx #rc4 8e53ab5ad0fddeb31cee5e7063ff9c361915a9c4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 Workqueue: xfs-log/dm-0 xlog_ioend_work [xfs] Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_report.cold+0x2b8/0x661 ? do_raw_spin_lock+0x246/0x270 kasan_report+0xab/0x120 ? do_raw_spin_lock+0x246/0x270 do_raw_spin_lock+0x246/0x270 ? rwlock_bug.part.0+0x90/0x90 xlog_force_shutdown+0xf6/0x370 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318] xlog_ioend_work+0x100/0x190 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318] process_one_work+0x672/0x1040 worker_thread+0x59b/0xec0 ? __kthread_parkme+0xc6/0x1f0 ? process_one_work+0x1040/0x1040 ? process_one_work+0x1040/0x1040 kthread+0x29e/0x340 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK>
Allocated by task 154099: kasan_save_stack+0x1e/0x40 __kasan_kmalloc+0x81/0xa0 kmem_alloc+0x8d/0x2e0 [xfs] xlog_cil_init+0x1f/0x540 [xfs] xlog_alloc_log+0xd1e/0x1260 [xfs] xfs_log_mount+0xba/0x640 [xfs] xfs_mountfs+0xf2b/0x1d00 [xfs] xfs_fs_fill_super+0x10af/0x1910 [xfs] get_tree_bdev+0x383/0x670 vfs_get_tree+0x7d/0x240 path_mount+0xdb7/0x1890 __x64_sys_mount+0x1fa/0x270 do_syscall_64+0x2b/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0
Freed by task 154151: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_set_free_info+0x20/0x30 ____kasan_slab_free+0x110/0x190 slab_free_freelist_hook+0xab/0x180 kfree+0xbc/0x310 xlog_dealloc_log+0x1b/0x2b0 [xfs] xfs_unmountfs+0x119/0x200 [xfs] xfs_fs_put_super+0x6e/0x2e0 [xfs] generic_shutdown_super+0x12b/0x3a0 kill_block_super+0x95/0xd0 deactivate_locked_super+0x80/0x130 cleanup_mnt+0x329/0x4d0 task_work_run+0xc5/0x160 exit_to_user_mode_prepare+0xd4/0xe0 syscall_exit_to_user_mode+0x1d/0x40 entry_SYSCALL_64_after_hwframe+0x46/0xb0
This appears to be a race between the unmount process, which frees the CIL and waits for in-flight iclog IO; and the iclog IO completion. When generic/475 runs, it starts fsstress in the background, waits a few seconds, and substitutes a dm-error device to simulate a disk falling out of a machine. If the fsstress encounters EIO on a pure data write, it will exit but the filesystem will still be online.
The next thing the test does is unmount the filesystem, which tries to clean the log, free the CIL, and wait for iclog IO completion. If an iclog was being written when the dm-error switch occurred, it can race with log unmounting as follows:
Thread 1 Thread 2
xfs_log_unmount xfs_log_clean xfs_log_quiesce xlog_ioend_work <observe error> xlog_force_shutdown test_and_set_bit(XLOG_IOERROR) xfs_log_force <log is shut down, nop> xfs_log_umount_write <log is shut down, nop> xlog_dealloc_log xlog_cil_destroy <wait for iclogs> spin_lock(&log->l_cilp->xc_push_lock) <KABOOM>
Therefore, free the CIL after waiting for the iclogs to complete. I /think/ this race has existed for quite a few years now, though I don't remember the ~2014 era logging code well enough to know if it was a real threat then or if the actual race was exposed only more recently.
Fixes: ac983517ec59 ("xfs: don't sleep in xlog_cil_force_lsn on shutdown") Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Dave Chinner dchinner@redhat.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 367aa7be2cad..681bdcbe2265 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1941,8 +1941,6 @@ xlog_dealloc_log( xlog_in_core_t *iclog, *next_iclog; int i;
- xlog_cil_destroy(log); - /* * Cycle all the iclogbuf locks to make sure all log IO completion * is done before we tear down these buffers. @@ -1954,6 +1952,13 @@ xlog_dealloc_log( iclog = iclog->ic_next; }
+ /* + * Destroy the CIL after waiting for iclog IO completion because an + * iclog EIO error will try to shut down the log, which accesses the + * CIL to wake up the waiters. + */ + xlog_cil_destroy(log); + iclog = log->l_iclog; for (i = 0; i < log->l_iclog_bufs; i++) { next_iclog = iclog->ic_next;
From: Dan Carpenter dan.carpenter@oracle.com
stable inclusion from stable-v5.10.140 commit 6a564bad3a6474a5247491d2b48637ec69d429dd category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5W3GQ
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 6ed6356b07714e0198be3bc3ecccc8b40a212de4 upstream.
The "bufsize" comes from the root user. If "bufsize" is negative then, because of type promotion, neither of the validation checks at the start of the function are able to catch it:
if (bufsize < sizeof(struct xfs_attrlist) || bufsize > XFS_XATTR_LIST_MAX) return -EINVAL;
This means "bufsize" will trigger (WARN_ON_ONCE(size > INT_MAX)) in kvmalloc_node(). Fix this by changing the type from int to size_t.
Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Amir Goldstein amir73il@gmail.com Acked-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Wang Hai wanghai38@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_ioctl.c | 2 +- fs/xfs/xfs_ioctl.h | 5 +++-- 2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 87299bab516c..814345b6a245 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -371,7 +371,7 @@ int xfs_ioc_attr_list( struct xfs_inode *dp, void __user *ubuf, - int bufsize, + size_t bufsize, int flags, struct xfs_attrlist_cursor __user *ucursor) { diff --git a/fs/xfs/xfs_ioctl.h b/fs/xfs/xfs_ioctl.h index bab6a5a92407..416e20de66e7 100644 --- a/fs/xfs/xfs_ioctl.h +++ b/fs/xfs/xfs_ioctl.h @@ -38,8 +38,9 @@ xfs_readlink_by_handle( int xfs_ioc_attrmulti_one(struct file *parfilp, struct inode *inode, uint32_t opcode, void __user *uname, void __user *value, uint32_t *len, uint32_t flags); -int xfs_ioc_attr_list(struct xfs_inode *dp, void __user *ubuf, int bufsize, - int flags, struct xfs_attrlist_cursor __user *ucursor); +int xfs_ioc_attr_list(struct xfs_inode *dp, void __user *ubuf, + size_t bufsize, int flags, + struct xfs_attrlist_cursor __user *ucursor);
extern struct dentry * xfs_handle_to_dentry(
From: "Darrick J. Wong" djwong@kernel.org
stable inclusion from stable-v5.10.140 commit 1b9b4139d794cf0ae51ba3dd91f009c77fab16d0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5W3GQ
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 29d650f7e3ab55283b89c9f5883d0c256ce478b5 upstream.
Syzbot tripped over the following complaint from the kernel:
WARNING: CPU: 2 PID: 15402 at mm/util.c:597 kvmalloc_node+0x11e/0x125 mm/util.c:597
While trying to run XFS_IOC_GETBMAP against the following structure:
struct getbmap fubar = { .bmv_count = 0x22dae649, };
Obviously, this is a crazy huge value since the next thing that the ioctl would do is allocate 37GB of memory. This is enough to make kvmalloc mad, but isn't large enough to trip the validation functions. In other words, I'm fussing with checks that were **already sufficient** because that's easier than dealing with 644 internal bug reports. Yes, that's right, six hundred and forty-four.
Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Allison Henderson allison.henderson@oracle.com Reviewed-by: Catherine Hoang catherine.hoang@oracle.com Signed-off-by: Amir Goldstein amir73il@gmail.com Acked-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Wang Hai wanghai38@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 814345b6a245..f1de12df880e 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1680,7 +1680,7 @@ xfs_ioc_getbmap(
if (bmx.bmv_count < 2) return -EINVAL; - if (bmx.bmv_count > ULONG_MAX / recsize) + if (bmx.bmv_count >= INT_MAX / recsize) return -ENOMEM;
buf = kvzalloc(bmx.bmv_count * sizeof(*buf), GFP_KERNEL);
From: Long Li leo.lilong@huawei.com
Offering: HULK hulk inclusion category: bugfix bugzilla: 186982,https://gitee.com/openeuler/kernel/issues/I4KIAO
--------------------------------
When lazysbcount is enabled, fsstress and loop mount/unmount test report the following problems:
XFS (loop0): SB summary counter sanity check failed XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460, xfs_sb block 0x0 XFS (loop0): Unmount and run xfs_repair XFS (loop0): First 128 bytes of corrupted metadata buffer: 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. XFS (loop0): Please unmount the filesystem and rectify the problem(s) XFS (loop0): log mount/recovery failed: error -117 XFS (loop0): log mount failed
This corruption will shutdown the file system and the file system will no longer be mountable. The following script can reproduce the problem, but it may take a long time.
#!/bin/bash
device=/dev/sda testdir=/mnt/test round=0
function fail() { echo "$*" exit 1 }
mkdir -p $testdir while [ $round -lt 10000 ] do echo "******* round $round ********" mkfs.xfs -f $device mount $device $testdir || fail "mount failed!" fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null & sleep 4 killall -w fsstress umount $testdir xfs_repair -e $device > /dev/null if [ $? -eq 2 ];then echo "ERR CODE 2: Dirty log exception during repair." exit 1 fi round=$(($round+1)) done
With lazysbcount is enabled, There is no additional lock protection for reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the m_ifree, this will make the m_ifree greater than m_icount. For example, consider the following sequence and ifreedelta is postive:
CPU0 CPU1 xfs_log_sb xfs_trans_unreserve_and_mod_sb ---------- ------------------------------ percpu_counter_sum(&mp->m_icount) percpu_counter_add_batch(&mp->m_icount, idelta, XFS_ICOUNT_BATCH) percpu_counter_add(&mp->m_ifree, ifreedelta); percpu_counter_sum(&mp->m_ifree)
After this, incorrect inode count (sb_ifree > sb_icount) will be writen to the log. In the subsequent writing of sb, incorrect inode count (sb_ifree > sb_icount) will fail to pass the boundary check in xfs_validate_sb_write() that cause the file system shutdown.
When lazysbcount is enabled, we don't need to guarantee that Lazy sb counters are completely correct, but we do need to guarantee that sb_ifree <= sb_icount. On the other hand, the constraint that m_ifree <= m_icount must be satisfied any time that there /cannot/ be other threads allocating or freeing inode chunks. If the constraint is violated under these circumstances, sb_i{count,free} (the ondisk superblock inode counters) maybe incorrect and need to be marked sick at unmount, the count will be rebuilt on the next mount.
Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks") Signed-off-by: Long Li leo.lilong@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/libxfs/xfs_sb.c | 4 +++- fs/xfs/xfs_mount.c | 16 +++++++++++++++- 2 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c index 13ef340044b3..1296cff23a3b 100644 --- a/fs/xfs/libxfs/xfs_sb.c +++ b/fs/xfs/libxfs/xfs_sb.c @@ -966,7 +966,9 @@ xfs_log_sb( */ if (xfs_sb_version_haslazysbcount(&mp->m_sb)) { mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount); - mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree); + mp->m_sb.sb_ifree = min_t(uint64_t, + percpu_counter_sum(&mp->m_ifree), + mp->m_sb.sb_icount); mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks); }
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index 959425cfb612..6504503f07b3 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -637,6 +637,20 @@ xfs_check_summary_counts( return xfs_initialize_perag_data(mp, mp->m_sb.sb_agcount); }
+static void +xfs_ifree_unmount( + struct xfs_mount *mp) +{ + if (XFS_FORCED_SHUTDOWN(mp)) + return; + + if (percpu_counter_sum(&mp->m_ifree) > + percpu_counter_sum(&mp->m_icount)) { + xfs_alert(mp, "ifree/icount mismatch at unmount"); + xfs_fs_mark_sick(mp, XFS_SICK_FS_COUNTERS); + } +} + /* * Flush and reclaim dirty inodes in preparation for unmount. Inodes and * internal inode structures can be sitting in the CIL and AIL at this point, @@ -1160,7 +1174,7 @@ xfs_unmountfs( xfs_warn(mp, "Unable to update superblock counters. " "Freespace may not be correct on next mount.");
- + xfs_ifree_unmount(mp); xfs_log_unmount(mp); xfs_da_unmount(mp); xfs_uuid_unmount(mp);
From: Zeng Heng zengheng4@huawei.com
mainline inclusion from mainline-v6.1-rc1 commit cf4f4c12dea7a977a143c8fe5af1740b7f9876f8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When `xfs_sysfs_init` returns failed, `mp->m_errortag` needs to free. Otherwise kmemleak would report memory leak after mounting xfs image:
unreferenced object 0xffff888101364900 (size 192): comm "mount", pid 13099, jiffies 4294915218 (age 335.207s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000f08ad25c>] __kmalloc+0x41/0x1b0 [<00000000dca9aeb6>] kmem_alloc+0xfd/0x430 [<0000000040361882>] xfs_errortag_init+0x20/0x110 [<00000000b384a0f6>] xfs_mountfs+0x6ea/0x1a30 [<000000003774395d>] xfs_fs_fill_super+0xe10/0x1a80 [<000000009cf07b6c>] get_tree_bdev+0x3e7/0x700 [<00000000046b5426>] vfs_get_tree+0x8e/0x2e0 [<00000000952ec082>] path_mount+0xf8c/0x1990 [<00000000beb1f838>] do_mount+0xee/0x110 [<000000000e9c41bb>] __x64_sys_mount+0x14b/0x1f0 [<00000000f7bb938e>] do_syscall_64+0x3b/0x90 [<000000003fcd67a9>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: c68401011522 ("xfs: expose errortag knobs via sysfs") Signed-off-by: Zeng Heng zengheng4@huawei.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_error.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c index f9e2f606b5b8..aed72fee71cb 100644 --- a/fs/xfs/xfs_error.c +++ b/fs/xfs/xfs_error.c @@ -215,13 +215,18 @@ int xfs_errortag_init( struct xfs_mount *mp) { + int ret; + mp->m_errortag = kmem_zalloc(sizeof(unsigned int) * XFS_ERRTAG_MAX, KM_MAYFAIL); if (!mp->m_errortag) return -ENOMEM;
- return xfs_sysfs_init(&mp->m_errortag_kobj, &xfs_errortag_ktype, - &mp->m_kobj, "errortag"); + ret = xfs_sysfs_init(&mp->m_errortag_kobj, &xfs_errortag_ktype, + &mp->m_kobj, "errortag"); + if (ret) + kmem_free(mp->m_errortag); + return ret; }
void
From: Li Zetao lizetao1@huawei.com
mainline inclusion from mainline-v6.1-rc1 commit d08af40340cad0e025d643c3982781a8f99d5032 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
kmemleak reported a sequence of memory leaks, and one of them indicated we failed to free a pointer: comm "mount", pid 19610, jiffies 4297086464 (age 60.635s) hex dump (first 8 bytes): 73 64 61 00 81 88 ff ff sda..... backtrace: [<00000000d77f3e04>] kstrdup_const+0x46/0x70 [<00000000e51fa804>] kobject_set_name_vargs+0x2f/0xb0 [<00000000247cd595>] kobject_init_and_add+0xb0/0x120 [<00000000f9139aaf>] xfs_mountfs+0x367/0xfc0 [<00000000250d3caf>] xfs_fs_fill_super+0xa16/0xdc0 [<000000008d873d38>] get_tree_bdev+0x256/0x390 [<000000004881f3fa>] vfs_get_tree+0x41/0xf0 [<000000008291ab52>] path_mount+0x9b3/0xdd0 [<0000000022ba8f2d>] __x64_sys_mount+0x190/0x1d0
As mentioned in kobject_init_and_add() comment, if this function returns an error, kobject_put() must be called to properly clean up the memory associated with the object. Apparently, xfs_sysfs_init() does not follow such a requirement. When kobject_init_and_add() returns an error, the space of kobj->kobject.name alloced by kstrdup_const() is unfree, which will cause the above stack.
Fix it by adding kobject_put() when kobject_init_and_add returns an error.
Fixes: a31b1d3d89e4 ("xfs: add xfs_mount sysfs kobject") Signed-off-by: Li Zetao lizetao1@huawei.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_sysfs.h | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_sysfs.h b/fs/xfs/xfs_sysfs.h index 43585850f154..513095e353a5 100644 --- a/fs/xfs/xfs_sysfs.h +++ b/fs/xfs/xfs_sysfs.h @@ -33,10 +33,15 @@ xfs_sysfs_init( const char *name) { struct kobject *parent; + int err;
parent = parent_kobj ? &parent_kobj->kobject : NULL; init_completion(&kobj->complete); - return kobject_init_and_add(&kobj->kobject, ktype, parent, "%s", name); + err = kobject_init_and_add(&kobj->kobject, ktype, parent, "%s", name); + if (err) + kobject_put(&kobj->kobject); + + return err; }
static inline void
From: Jing-Ting Wu Jing-Ting.Wu@mediatek.com
mainline inclusion from mainline-v6.0-rc3 commit 763f4fb76e24959c370cdaa889b2492ba6175580 category: bugfix bugzilla: 187794, https://gitee.com/openeuler/kernel/issues/I5ZXVK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------
Root cause: The rebind_subsystems() is no lock held when move css object from A list to B list,then let B's head be treated as css node at list_for_each_entry_rcu().
Solution: Add grace period before invalidating the removed rstat_css_node.
Reported-by: Jing-Ting Wu jing-ting.wu@mediatek.com Suggested-by: Michal Koutný mkoutny@suse.com Signed-off-by: Jing-Ting Wu jing-ting.wu@mediatek.com Tested-by: Jing-Ting Wu jing-ting.wu@mediatek.com Link: https://lore.kernel.org/linux-arm-kernel/d8f0bc5e2fb6ed259f9334c83279b4c0112... Acked-by: Mukesh Ojha quic_mojha@quicinc.com Fixes: a7df69b81aac ("cgroup: rstat: support cgroup1") Cc: stable@vger.kernel.org # v5.13+ Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Cai Xinchen caixinchen1@huawei.com Reviewed-by: GONG Ruiqi gongruiqi1@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/cgroup/cgroup.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 57f4e19df8c6..46d5c120c626 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1781,6 +1781,7 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
if (ss->css_rstat_flush) { list_del_rcu(&css->rstat_css_node); + synchronize_rcu(); list_add_rcu(&css->rstat_css_node, &dcgrp->rstat_css_list); }
From: Shubhrajyoti Datta shubhrajyoti.datta@xilinx.com
mainline inclusion from mainline-v5.15-rc1 commit ed623dffdeebcc0acac7be6af4a301ee7169cd21 category: bugfix bugzilla: 187303, https://gitee.com/openeuler/kernel/issues/I5ZXV2 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------
In case the uart registration fails the clocks are left enabled. Disable the clock in case of errors.
Signed-off-by: Shubhrajyoti Datta shubhrajyoti.datta@xilinx.com Link: https://lore.kernel.org/r/20210713064835.27978-2-shubhrajyoti.datta@xilinx.c... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Cai Xinchen caixinchen1@huawei.com Reviewed-by: GONG Ruiqi gongruiqi1@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/tty/serial/uartlite.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/tty/serial/uartlite.c b/drivers/tty/serial/uartlite.c index 48923cd8c07d..dadac5cc3d5d 100644 --- a/drivers/tty/serial/uartlite.c +++ b/drivers/tty/serial/uartlite.c @@ -784,6 +784,7 @@ static int ulite_probe(struct platform_device *pdev) ret = uart_register_driver(&ulite_uart_driver); if (ret < 0) { dev_err(&pdev->dev, "Failed to register driver\n"); + clk_disable_unprepare(pdata->clk); return ret; } }
From: Yang Jihong yangjihong1@huawei.com
mainline inclusion from mainline-v6.0 commit 6b959ba22d34ca793ffdb15b5715457c78e38b1a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60M4W CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...
--------------------------------
perf_output_read_group may respond to IPI request of other cores and invoke __perf_install_in_context function. As a result, hwc configuration is modified. causing inconsistency and unexpected consequences.
Interrupts are not disabled when perf_output_read_group reads PMU counter. In this case, IPI request may be received from other cores. As a result, PMU configuration is modified and an error occurs when reading PMU counter:
CPU0 CPU1 __se_sys_perf_event_open perf_install_in_context perf_output_read_group smp_call_function_single for_each_sibling_event(sub, leader) { generic_exec_single if ((sub != event) && remote_function (sub->state == PERF_EVENT_STATE_ACTIVE)) | <enter IPI handler: __perf_install_in_context> <----RAISE IPI-----+ __perf_install_in_context ctx_resched event_sched_out armpmu_del ... hwc->idx = -1; // event->hwc.idx is set to -1 ... <exit IPI> sub->pmu->read(sub); armpmu_read armv8pmu_read_counter armv8pmu_read_hw_counter int idx = event->hw.idx; // idx = -1 u64 val = armv8pmu_read_evcntr(idx); u32 counter = ARMV8_IDX_TO_COUNTER(idx); // invalid counter = 30 read_pmevcntrn(counter) // undefined instruction
Signed-off-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20220902082918.179248-1-yangjihong1@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/events/core.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/kernel/events/core.c b/kernel/events/core.c index 6fcd4177ade6..21fe33ca6327 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6779,6 +6779,13 @@ static void perf_output_read_group(struct perf_output_handle *handle, u64 read_format = event->attr.read_format; u64 values[5]; int n = 0; + unsigned long flags; + + /* + * Disabling interrupts avoids all counter scheduling + * (context switches, timer based rotation and IPIs). + */ + local_irq_save(flags);
values[n++] = 1 + leader->nr_siblings;
@@ -6811,6 +6818,8 @@ static void perf_output_read_group(struct perf_output_handle *handle,
__output_copy(handle, values, n * sizeof(u64)); } + + local_irq_restore(flags); }
#define PERF_FORMAT_TOTAL_TIMES (PERF_FORMAT_TOTAL_TIME_ENABLED|\
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit b22a06ea6ff96075d4a443fb4f318f41a9823e08.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 53602868a90e..e02bf2578a64 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -3534,7 +3534,7 @@ static int arm_smmu_switch_dirty_log(struct iommu_domain *domain, bool enable,
if (!(smmu->features & ARM_SMMU_FEAT_HD)) return -ENODEV; - if (smmu_domain->stage == ARM_SMMU_DOMAIN_BYPASS) + if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1) return -EINVAL;
if (enable) { @@ -3575,7 +3575,7 @@ static int arm_smmu_sync_dirty_log(struct iommu_domain *domain,
if (!(smmu->features & ARM_SMMU_FEAT_HD)) return -ENODEV; - if (smmu_domain->stage == ARM_SMMU_DOMAIN_BYPASS) + if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1) return -EINVAL;
if (!ops || !ops->sync_dirty_log) { @@ -3604,7 +3604,7 @@ static int arm_smmu_clear_dirty_log(struct iommu_domain *domain,
if (!(smmu->features & ARM_SMMU_FEAT_HD)) return -ENODEV; - if (smmu_domain->stage == ARM_SMMU_DOMAIN_BYPASS) + if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1) return -EINVAL;
if (!ops || !ops->clear_dirty_log) {
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 22f7a4bf1186b3f50b6716b714927e602fa32392.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/vfio/pci/vfio_pci.c | 16 ++++------------ 1 file changed, 4 insertions(+), 12 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index ee73b1b2e200..2ff6f3ba9f39 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -557,12 +557,8 @@ static int vfio_pci_dma_fault_init(struct vfio_pci_device *vdev) return 0;
ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &nested); - if (ret || !nested) { - if (ret) - pr_warn("%s: Get DOMAIN_ATTR_NESTING failed: %d.\n", - __func__, ret); - return 0; - } + if (ret || !nested) + return ret;
mutex_init(&vdev->fault_queue_lock);
@@ -651,12 +647,8 @@ static int vfio_pci_dma_fault_response_init(struct vfio_pci_device *vdev) return 0;
ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &nested); - if (ret || !nested) { - if (ret) - pr_warn("%s: Get DOMAIN_ATTR_NESTING failed: %d.\n", - __func__, ret); - return 0; - } + if (ret || !nested) + return ret;
mutex_init(&vdev->fault_response_queue_lock);
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 3afa66c6a1ca51433487ba116455af878ac17227.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/iommu.h | 1 - 1 file changed, 1 deletion(-)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h index d993036c94c2..95320164dcf3 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -1221,7 +1221,6 @@ iommu_sva_bind_group(struct iommu_group *group, struct mm_struct *mm, return NULL; }
-static inline int iommu_bind_guest_msi(struct iommu_domain *domain, dma_addr_t giova, phys_addr_t gpa, size_t size) {
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit c046c2a2c57587243b1fc53e65061f3e842848e3.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index e02bf2578a64..d3575e05b1be 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -3832,6 +3832,7 @@ arm_smmu_cache_invalidate(struct iommu_domain *domain, struct device *dev, !(granule_size & smmu_domain->domain.pgsize_bitmap)) { tg = __ffs(smmu_domain->domain.pgsize_bitmap); granule_size = 1 << tg; + size = size >> tg; }
arm_smmu_tlb_inv_range_domain(info->addr, size,
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit f7cdf6923af762bfec3d8d7e919cca4de79de73a.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 5 ----- 1 file changed, 5 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index d3575e05b1be..9ac38f9140eb 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2297,11 +2297,6 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd, cmd->tlbi.tg = (tg - 10) / 2;
/* Determine what level the granule is at */ - if (!(granule & smmu_domain->domain.pgsize_bitmap) || - (granule & (granule - 1))) { - granule = leaf_pgsize; - iova = ALIGN_DOWN(iova, leaf_pgsize); - } cmd->tlbi.ttl = 4 - ((ilog2(granule) - 3) / (tg - 3));
/* Align size with the leaf page size upwards */
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 8f1d8ede3a5b0aa8d73b3932332a2ca39d9a2d2b.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 5 ----- 1 file changed, 5 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 9ac38f9140eb..1dfdc06ed60b 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2288,10 +2288,7 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) { /* Get the leaf page size */ - size_t leaf_pgsize; - tg = __ffs(smmu_domain->domain.pgsize_bitmap); - leaf_pgsize = 1 << tg;
/* Convert page size of 12,14,16 (log2) to 1,2,3 */ cmd->tlbi.tg = (tg - 10) / 2; @@ -2299,8 +2296,6 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd, /* Determine what level the granule is at */ cmd->tlbi.ttl = 4 - ((ilog2(granule) - 3) / (tg - 3));
- /* Align size with the leaf page size upwards */ - size = ALIGN(size, leaf_pgsize); num_pages = size >> tg; }
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 9ed8587a5c4bc58f7136343c4c7930eb35187ea0.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 3 +-- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 - 2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 1dfdc06ed60b..1f33580e0006 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -338,7 +338,6 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent) case CMDQ_OP_TLBI_NH_ASID: cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid); fallthrough; - case CMDQ_OP_TLBI_NH_ALL: case CMDQ_OP_TLBI_S12_VMALL: cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid); break; @@ -3759,7 +3758,7 @@ static int arm_smmu_cache_invalidate(struct iommu_domain *domain, struct device *dev, struct iommu_cache_invalidate_info *inv_info) { - struct arm_smmu_cmdq_ent cmd = {.opcode = CMDQ_OP_TLBI_NH_ALL}; + struct arm_smmu_cmdq_ent cmd = {.opcode = CMDQ_OP_TLBI_NSNH_ALL}; struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); struct arm_smmu_device *smmu = smmu_domain->smmu;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index 1dd49bed58df..9abce4732456 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -509,7 +509,6 @@ struct arm_smmu_cmdq_ent { }; } cfgi;
- #define CMDQ_OP_TLBI_NH_ALL 0x10 #define CMDQ_OP_TLBI_NH_ASID 0x11 #define CMDQ_OP_TLBI_NH_VA 0x12 #define CMDQ_OP_TLBI_EL2_ALL 0x20
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 21d56f9c91a0a1b55cc7e5933974d6afcef7b001.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/io-pgtable-arm.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c index 0969224aff7b..34f6366dcc6c 100644 --- a/drivers/iommu/io-pgtable-arm.c +++ b/drivers/iommu/io-pgtable-arm.c @@ -980,6 +980,10 @@ static int arm_lpae_sync_dirty_log(struct io_pgtable_ops *ops, if (WARN_ON(iaext)) return -EINVAL;
+ if (data->iop.fmt != ARM_64_LPAE_S1 && + data->iop.fmt != ARM_32_LPAE_S1) + return -EINVAL; + return __arm_lpae_sync_dirty_log(data, iova, size, lvl, ptep, bitmap, base_iova, bitmap_pgshift); } @@ -1072,6 +1076,10 @@ static int arm_lpae_clear_dirty_log(struct io_pgtable_ops *ops, if (WARN_ON(iaext)) return -EINVAL;
+ if (data->iop.fmt != ARM_64_LPAE_S1 && + data->iop.fmt != ARM_32_LPAE_S1) + return -EINVAL; + return __arm_lpae_clear_dirty_log(data, iova, size, lvl, ptep, bitmap, base_iova, bitmap_pgshift); }
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 524f1339c8a5eba5cc59161679f643f01edd89ec.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/io-pgtable-arm.c | 33 ++++++--------------------------- 1 file changed, 6 insertions(+), 27 deletions(-)
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c index 34f6366dcc6c..2740a35fe714 100644 --- a/drivers/iommu/io-pgtable-arm.c +++ b/drivers/iommu/io-pgtable-arm.c @@ -159,23 +159,6 @@ static inline bool iopte_leaf(arm_lpae_iopte pte, int lvl, return iopte_type(pte, lvl) == ARM_LPAE_PTE_TYPE_BLOCK; }
-static inline bool arm_lpae_pte_writable(struct arm_lpae_io_pgtable *data, - arm_lpae_iopte pte, int lvl) -{ - if (iopte_leaf(pte, lvl, data->iop.fmt)) { - if (data->iop.fmt == ARM_64_LPAE_S1 || - data->iop.fmt == ARM_32_LPAE_S1) { - if (!(pte & ARM_LPAE_PTE_AP_RDONLY)) - return true; - } else { - if (pte & ARM_LPAE_PTE_HAP_WRITE) - return true; - } - } - - return false; -} - static arm_lpae_iopte paddr_to_iopte(phys_addr_t paddr, struct arm_lpae_io_pgtable *data) { @@ -769,7 +752,7 @@ static size_t __arm_lpae_split_block(struct arm_lpae_io_pgtable *data, if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) { if (iopte_leaf(pte, lvl, iop->fmt)) { if (lvl == (ARM_LPAE_MAX_LEVELS - 1) || - !arm_lpae_pte_writable(data, pte, lvl)) + (pte & ARM_LPAE_PTE_AP_RDONLY)) return size;
/* We find a writable block, split it. */ @@ -923,7 +906,7 @@ static int __arm_lpae_sync_dirty_log(struct arm_lpae_io_pgtable *data,
if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) { if (iopte_leaf(pte, lvl, iop->fmt)) { - if (!arm_lpae_pte_writable(data, pte, lvl)) + if (pte & ARM_LPAE_PTE_AP_RDONLY) return 0;
/* It is writable, set the bitmap */ @@ -944,7 +927,7 @@ static int __arm_lpae_sync_dirty_log(struct arm_lpae_io_pgtable *data, } return 0; } else if (iopte_leaf(pte, lvl, iop->fmt)) { - if (!arm_lpae_pte_writable(data, pte, lvl)) + if (pte & ARM_LPAE_PTE_AP_RDONLY) return 0;
/* Though the size is too small, also set bitmap */ @@ -1011,7 +994,7 @@ static int __arm_lpae_clear_dirty_log(struct arm_lpae_io_pgtable *data,
if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) { if (iopte_leaf(pte, lvl, iop->fmt)) { - if (!arm_lpae_pte_writable(data, pte, lvl)) + if (pte & ARM_LPAE_PTE_AP_RDONLY) return 0;
/* Ensure all corresponding bits are set */ @@ -1023,11 +1006,7 @@ static int __arm_lpae_clear_dirty_log(struct arm_lpae_io_pgtable *data, }
/* Race does not exist */ - if ((data->iop.fmt == ARM_64_LPAE_S1) || - (data->iop.fmt == ARM_32_LPAE_S1)) - pte |= ARM_LPAE_PTE_AP_RDONLY; - else - pte &= ~ARM_LPAE_PTE_HAP_WRITE; + pte |= ARM_LPAE_PTE_AP_RDONLY; __arm_lpae_set_pte(ptep, pte, &iop->cfg); return 0; } @@ -1044,7 +1023,7 @@ static int __arm_lpae_clear_dirty_log(struct arm_lpae_io_pgtable *data, return 0; } else if (iopte_leaf(pte, lvl, iop->fmt)) { /* Though the size is too small, it is already clean */ - if (!arm_lpae_pte_writable(data, pte, lvl)) + if (pte & ARM_LPAE_PTE_AP_RDONLY) return 0;
return -EINVAL;
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 97e11307edcc8734359ec7bdcdbffc37633ae716.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 5 ----- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 -- drivers/iommu/io-pgtable-arm.c | 6 +----- 3 files changed, 1 insertion(+), 12 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 1f33580e0006..74eb98bd3e6e 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1610,11 +1610,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2R);
- if (smmu->features & ARM_SMMU_FEAT_HA) - dst[2] |= cpu_to_le64(STRTAB_STE_2_S2HA); - if (smmu->features & ARM_SMMU_FEAT_HD) - dst[2] |= cpu_to_le64(STRTAB_STE_2_S2HD); - dst[3] = cpu_to_le64(vttbr);
val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS); diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index 9abce4732456..d0f3181a22c5 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -294,8 +294,6 @@ #define STRTAB_STE_2_S2AA64 (1UL << 51) #define STRTAB_STE_2_S2ENDI (1UL << 52) #define STRTAB_STE_2_S2PTW (1UL << 54) -#define STRTAB_STE_2_S2HD (1UL << 55) -#define STRTAB_STE_2_S2HA (1UL << 56) #define STRTAB_STE_2_S2R (1UL << 58)
#define STRTAB_STE_3_S2TTB_MASK GENMASK_ULL(51, 4) diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c index 2740a35fe714..3fc6ae00dc96 100644 --- a/drivers/iommu/io-pgtable-arm.c +++ b/drivers/iommu/io-pgtable-arm.c @@ -401,12 +401,8 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data, pte = ARM_LPAE_PTE_HAP_FAULT; if (prot & IOMMU_READ) pte |= ARM_LPAE_PTE_HAP_READ; - if (prot & IOMMU_WRITE) { + if (prot & IOMMU_WRITE) pte |= ARM_LPAE_PTE_HAP_WRITE; - if (data->iop.fmt == ARM_64_LPAE_S2 && - cfg->quirks & IO_PGTABLE_QUIRK_ARM_HD) - pte |= ARM_LPAE_PTE_DBM; - } }
/*
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 9b4742a6dd67e4a9309c325376682bde5da60fdf.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/vfio/pci/vfio_pci.c | 40 ----------------------------- drivers/vfio/pci/vfio_pci_private.h | 7 ----- drivers/vfio/pci/vfio_pci_rdwr.c | 1 - 3 files changed, 48 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 2ff6f3ba9f39..36bdcc0a4fc9 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -607,32 +607,6 @@ static int vfio_pci_dma_fault_init(struct vfio_pci_device *vdev) return ret; }
-static void dma_response_inject(struct work_struct *work) -{ - struct vfio_pci_dma_fault_response_work *rwork = - container_of(work, struct vfio_pci_dma_fault_response_work, inject); - struct vfio_region_dma_fault_response *header = rwork->header; - struct vfio_pci_device *vdev = rwork->vdev; - struct iommu_page_response *resp; - u32 tail, head, size; - - mutex_lock(&vdev->fault_response_queue_lock); - - tail = header->tail; - head = header->head; - size = header->nb_entries; - - while (CIRC_CNT(head, tail, size) >= 1) { - resp = (struct iommu_page_response *)(vdev->fault_response_pages + header->offset + - tail * header->entry_size); - - /* TODO: properly handle the return value */ - iommu_page_response(&vdev->pdev->dev, resp); - header->tail = tail = (tail + 1) % size; - } - mutex_unlock(&vdev->fault_response_queue_lock); -} - #define DMA_FAULT_RESPONSE_RING_LENGTH 512
static int vfio_pci_dma_fault_response_init(struct vfio_pci_device *vdev) @@ -678,22 +652,8 @@ static int vfio_pci_dma_fault_response_init(struct vfio_pci_device *vdev) header->nb_entries = DMA_FAULT_RESPONSE_RING_LENGTH; header->offset = PAGE_SIZE;
- vdev->response_work = kzalloc(sizeof(*vdev->response_work), GFP_KERNEL); - if (!vdev->response_work) - goto out; - vdev->response_work->header = header; - vdev->response_work->vdev = vdev; - - /* launch the thread that will extract the response */ - INIT_WORK(&vdev->response_work->inject, dma_response_inject); - vdev->dma_fault_response_wq = - create_singlethread_workqueue("vfio-dma-fault-response"); - if (!vdev->dma_fault_response_wq) - return -ENOMEM; - return 0; out: - kfree(vdev->fault_response_pages); vdev->fault_response_pages = NULL; return ret; } diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index 318328602874..70abd68a2ed9 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -52,12 +52,6 @@ struct vfio_pci_irq_ctx { struct irq_bypass_producer producer; };
-struct vfio_pci_dma_fault_response_work { - struct work_struct inject; - struct vfio_region_dma_fault_response *header; - struct vfio_pci_device *vdev; -}; - struct vfio_pci_device; struct vfio_pci_region;
@@ -159,7 +153,6 @@ struct vfio_pci_device { u8 *fault_pages; u8 *fault_response_pages; struct workqueue_struct *dma_fault_response_wq; - struct vfio_pci_dma_fault_response_work *response_work; struct mutex fault_queue_lock; struct mutex fault_response_queue_lock; struct list_head dummy_resources_list; diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c index 43c11b5f5486..04828d0b752f 100644 --- a/drivers/vfio/pci/vfio_pci_rdwr.c +++ b/drivers/vfio/pci/vfio_pci_rdwr.c @@ -440,7 +440,6 @@ size_t vfio_pci_dma_fault_response_rw(struct vfio_pci_device *vdev, char __user mutex_lock(&vdev->fault_response_queue_lock); header->head = new_head; mutex_unlock(&vdev->fault_response_queue_lock); - queue_work(vdev->dma_fault_response_wq, &vdev->response_work->inject); } else { if (copy_to_user(buf, base + pos, count)) return -EFAULT;
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit cbbf4b3a64870f66d1c43b3900225adcf2d3fb48.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/vfio/pci/vfio_pci.c | 125 ++-------------------------- drivers/vfio/pci/vfio_pci_private.h | 6 -- drivers/vfio/pci/vfio_pci_rdwr.c | 39 --------- include/uapi/linux/vfio.h | 32 ------- 4 files changed, 9 insertions(+), 193 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 36bdcc0a4fc9..352abea42649 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -371,20 +371,9 @@ static void vfio_pci_dma_fault_release(struct vfio_pci_device *vdev, kfree(vdev->fault_pages); }
-static void -vfio_pci_dma_fault_response_release(struct vfio_pci_device *vdev, - struct vfio_pci_region *region) -{ - if (vdev->dma_fault_response_wq) - destroy_workqueue(vdev->dma_fault_response_wq); - kfree(vdev->fault_response_pages); - vdev->fault_response_pages = NULL; -} - -static int __vfio_pci_dma_fault_mmap(struct vfio_pci_device *vdev, - struct vfio_pci_region *region, - struct vm_area_struct *vma, - u8 *pages) +static int vfio_pci_dma_fault_mmap(struct vfio_pci_device *vdev, + struct vfio_pci_region *region, + struct vm_area_struct *vma) { u64 phys_len, req_len, pgoff, req_start; unsigned long long addr; @@ -397,14 +386,14 @@ static int __vfio_pci_dma_fault_mmap(struct vfio_pci_device *vdev, ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1); req_start = pgoff << PAGE_SHIFT;
- /* only the second page of the fault region is mmappable */ + /* only the second page of the producer fault region is mmappable */ if (req_start < PAGE_SIZE) return -EINVAL;
if (req_start + req_len > phys_len) return -EINVAL;
- addr = virt_to_phys(pages); + addr = virt_to_phys(vdev->fault_pages); vma->vm_private_data = vdev; vma->vm_pgoff = (addr >> PAGE_SHIFT) + pgoff;
@@ -413,29 +402,13 @@ static int __vfio_pci_dma_fault_mmap(struct vfio_pci_device *vdev, return ret; }
-static int vfio_pci_dma_fault_mmap(struct vfio_pci_device *vdev, - struct vfio_pci_region *region, - struct vm_area_struct *vma) -{ - return __vfio_pci_dma_fault_mmap(vdev, region, vma, vdev->fault_pages); -} - -static int -vfio_pci_dma_fault_response_mmap(struct vfio_pci_device *vdev, - struct vfio_pci_region *region, - struct vm_area_struct *vma) -{ - return __vfio_pci_dma_fault_mmap(vdev, region, vma, vdev->fault_response_pages); -} - -static int __vfio_pci_dma_fault_add_capability(struct vfio_pci_device *vdev, - struct vfio_pci_region *region, - struct vfio_info_cap *caps, - u32 cap_id) +static int vfio_pci_dma_fault_add_capability(struct vfio_pci_device *vdev, + struct vfio_pci_region *region, + struct vfio_info_cap *caps) { struct vfio_region_info_cap_sparse_mmap *sparse = NULL; struct vfio_region_info_cap_fault cap = { - .header.id = cap_id, + .header.id = VFIO_REGION_INFO_CAP_DMA_FAULT, .header.version = 1, .version = 1, }; @@ -463,23 +436,6 @@ static int __vfio_pci_dma_fault_add_capability(struct vfio_pci_device *vdev, return ret; }
-static int vfio_pci_dma_fault_add_capability(struct vfio_pci_device *vdev, - struct vfio_pci_region *region, - struct vfio_info_cap *caps) -{ - return __vfio_pci_dma_fault_add_capability(vdev, region, caps, - VFIO_REGION_INFO_CAP_DMA_FAULT); -} - -static int -vfio_pci_dma_fault_response_add_capability(struct vfio_pci_device *vdev, - struct vfio_pci_region *region, - struct vfio_info_cap *caps) -{ - return __vfio_pci_dma_fault_add_capability(vdev, region, caps, - VFIO_REGION_INFO_CAP_DMA_FAULT_RESPONSE); -} - static const struct vfio_pci_regops vfio_pci_dma_fault_regops = { .rw = vfio_pci_dma_fault_rw, .release = vfio_pci_dma_fault_release, @@ -487,13 +443,6 @@ static const struct vfio_pci_regops vfio_pci_dma_fault_regops = { .add_capability = vfio_pci_dma_fault_add_capability, };
-static const struct vfio_pci_regops vfio_pci_dma_fault_response_regops = { - .rw = vfio_pci_dma_fault_response_rw, - .release = vfio_pci_dma_fault_response_release, - .mmap = vfio_pci_dma_fault_response_mmap, - .add_capability = vfio_pci_dma_fault_response_add_capability, -}; - static int vfio_pci_iommu_dev_fault_handler(struct iommu_fault *fault, void *data) { @@ -607,57 +556,6 @@ static int vfio_pci_dma_fault_init(struct vfio_pci_device *vdev) return ret; }
-#define DMA_FAULT_RESPONSE_RING_LENGTH 512 - -static int vfio_pci_dma_fault_response_init(struct vfio_pci_device *vdev) -{ - struct vfio_region_dma_fault_response *header; - struct iommu_domain *domain; - int nested, ret; - size_t size; - - domain = iommu_get_domain_for_dev(&vdev->pdev->dev); - if (!domain) - return 0; - - ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &nested); - if (ret || !nested) - return ret; - - mutex_init(&vdev->fault_response_queue_lock); - - /* - * We provision 1 page for the header and space for - * DMA_FAULT_RING_LENGTH fault records in the ring buffer. - */ - size = ALIGN(sizeof(struct iommu_page_response) * - DMA_FAULT_RESPONSE_RING_LENGTH, PAGE_SIZE) + PAGE_SIZE; - - vdev->fault_response_pages = kzalloc(size, GFP_KERNEL); - if (!vdev->fault_response_pages) - return -ENOMEM; - - ret = vfio_pci_register_dev_region(vdev, - VFIO_REGION_TYPE_NESTED, - VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT_RESPONSE, - &vfio_pci_dma_fault_response_regops, size, - VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE | - VFIO_REGION_INFO_FLAG_MMAP, - vdev->fault_response_pages); - if (ret) - goto out; - - header = (struct vfio_region_dma_fault_response *)vdev->fault_response_pages; - header->entry_size = sizeof(struct iommu_page_response); - header->nb_entries = DMA_FAULT_RESPONSE_RING_LENGTH; - header->offset = PAGE_SIZE; - - return 0; -out: - vdev->fault_response_pages = NULL; - return ret; -} - static int vfio_pci_enable(struct vfio_pci_device *vdev) { struct pci_dev *pdev = vdev->pdev; @@ -760,10 +658,6 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev) if (ret) goto disable_exit;
- ret = vfio_pci_dma_fault_response_init(vdev); - if (ret) - goto disable_exit; - vfio_pci_probe_mmaps(vdev);
return 0; @@ -2507,7 +2401,6 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) INIT_LIST_HEAD(&vdev->ioeventfds_list); mutex_init(&vdev->vma_lock); INIT_LIST_HEAD(&vdev->vma_list); - INIT_LIST_HEAD(&vdev->dummy_resources_list); init_rwsem(&vdev->memory_lock);
ret = vfio_pci_reflck_attach(vdev); diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index 70abd68a2ed9..a578723a34a5 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -151,10 +151,7 @@ struct vfio_pci_device { struct eventfd_ctx *err_trigger; struct eventfd_ctx *req_trigger; u8 *fault_pages; - u8 *fault_response_pages; - struct workqueue_struct *dma_fault_response_wq; struct mutex fault_queue_lock; - struct mutex fault_response_queue_lock; struct list_head dummy_resources_list; struct mutex ioeventfds_lock; struct list_head ioeventfds_list; @@ -201,9 +198,6 @@ extern long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset, extern size_t vfio_pci_dma_fault_rw(struct vfio_pci_device *vdev, char __user *buf, size_t count, loff_t *ppos, bool iswrite); -extern size_t vfio_pci_dma_fault_response_rw(struct vfio_pci_device *vdev, - char __user *buf, size_t count, - loff_t *ppos, bool iswrite);
extern int vfio_pci_init_perm_bits(void); extern void vfio_pci_uninit_perm_bits(void); diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c index 04828d0b752f..7f4d377ac9be 100644 --- a/drivers/vfio/pci/vfio_pci_rdwr.c +++ b/drivers/vfio/pci/vfio_pci_rdwr.c @@ -410,45 +410,6 @@ size_t vfio_pci_dma_fault_rw(struct vfio_pci_device *vdev, char __user *buf, return ret; }
-size_t vfio_pci_dma_fault_response_rw(struct vfio_pci_device *vdev, char __user *buf, - size_t count, loff_t *ppos, bool iswrite) -{ - unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS; - loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; - void *base = vdev->region[i].data; - int ret = -EFAULT; - - if (pos >= vdev->region[i].size) - return -EINVAL; - - count = min(count, (size_t)(vdev->region[i].size - pos)); - - if (iswrite) { - struct vfio_region_dma_fault_response *header = - (struct vfio_region_dma_fault_response *)base; - uint32_t new_head; - - if (pos != 0 || count != 4) - return -EINVAL; - - if (copy_from_user((void *)&new_head, buf, count)) - return -EFAULT; - - if (new_head >= header->nb_entries) - return -EINVAL; - - mutex_lock(&vdev->fault_response_queue_lock); - header->head = new_head; - mutex_unlock(&vdev->fault_response_queue_lock); - } else { - if (copy_to_user(buf, base + pos, count)) - return -EFAULT; - } - *ppos += count; - ret = count; - return ret; -} - static void vfio_pci_ioeventfd_do_write(struct vfio_pci_ioeventfd *ioeventfd, bool test_mem) { diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 9ae6c31796ed..6574032973a3 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -356,7 +356,6 @@ struct vfio_region_info_cap_type {
/* sub-types for VFIO_REGION_TYPE_NESTED */ #define VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT (1) -#define VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT_RESPONSE (2)
/** * struct vfio_region_gfx_edid - EDID region layout. @@ -1034,17 +1033,6 @@ struct vfio_region_info_cap_fault { __u32 version; };
-/* - * Capability exposed by the DMA fault response region - * @version: ABI version - */ -#define VFIO_REGION_INFO_CAP_DMA_FAULT_RESPONSE 7 - -struct vfio_region_info_cap_fault_response { - struct vfio_info_cap_header header; - __u32 version; -}; - /* * DMA Fault Region Layout * @tail: index relative to the start of the ring buffer at which the @@ -1065,26 +1053,6 @@ struct vfio_region_dma_fault { __u32 head; };
-/* - * DMA Fault Response Region Layout - * @head: index relative to the start of the ring buffer at which the - * producer (userspace) insert responses into the buffer - * @entry_size: fault ring buffer entry size in bytes - * @nb_entries: max capacity of the fault ring buffer - * @offset: ring buffer offset relative to the start of the region - * @tail: index relative to the start of the ring buffer at which the - * consumer (kernel) finds the next item in the buffer - */ -struct vfio_region_dma_fault_response { - /* Write-Only */ - __u32 head; - /* Read-Only */ - __u32 entry_size; - __u32 nb_entries; - __u32 offset; - __u32 tail; -}; - /* -------- API for Type1 VFIO IOMMU -------- */
/**
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit aa2addedeae2756de0265c56c4e8d96aac737a23.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/driver-api/vfio.rst | 77 ------------------------------- 1 file changed, 77 deletions(-)
diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst index b57a96d20d3b..d3a02300913a 100644 --- a/Documentation/driver-api/vfio.rst +++ b/Documentation/driver-api/vfio.rst @@ -239,83 +239,6 @@ group and can access them as follows:: /* Gratuitous device reset and go... */ ioctl(device, VFIO_DEVICE_RESET);
-IOMMU Dual Stage Control ------------------------- - -Some IOMMUs support 2 stages/levels of translation. "Stage" corresponds to -the ARM terminology while "level" corresponds to Intel's VTD terminology. In -the following text we use either without distinction. - -This is useful when the guest is exposed with a virtual IOMMU and some -devices are assigned to the guest through VFIO. Then the guest OS can use -stage 1 (IOVA -> GPA), while the hypervisor uses stage 2 for VM isolation -(GPA -> HPA). - -The guest gets ownership of the stage 1 page tables and also owns stage 1 -configuration structures. The hypervisor owns the root configuration structure -(for security reason), including stage 2 configuration. This works as long -configuration structures and page table format are compatible between the -virtual IOMMU and the physical IOMMU. - -Assuming the HW supports it, this nested mode is selected by choosing the -VFIO_TYPE1_NESTING_IOMMU type through: - -ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU); - -This forces the hypervisor to use the stage 2, leaving stage 1 available for -guest usage. - -Once groups are attached to the container, the guest stage 1 translation -configuration data can be passed to VFIO by using - -ioctl(container, VFIO_IOMMU_SET_PASID_TABLE, &pasid_table_info); - -This allows to combine the guest stage 1 configuration structure along with -the hypervisor stage 2 configuration structure. Stage 1 configuration -structures are dependent on the IOMMU type. - -As the stage 1 translation is fully delegated to the HW, translation faults -encountered during the translation process need to be propagated up to -the virtualizer and re-injected into the guest. - -The userspace must be prepared to receive faults. The VFIO-PCI device -exposes one dedicated DMA FAULT region: it contains a ring buffer and -its header that allows to manage the head/tail indices. The region is -identified by the following index/subindex: -- VFIO_REGION_TYPE_NESTED/VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT - -The DMA FAULT region exposes a VFIO_REGION_INFO_CAP_DMA_FAULT -region capability that allows the userspace to retrieve the ABI version -of the fault records filled by the host. - -On top of that region, the userspace can be notified whenever a fault -occurs at the physical level. It can use the VFIO_IRQ_TYPE_NESTED/ -VFIO_IRQ_SUBTYPE_DMA_FAULT specific IRQ to attach the eventfd to be -signalled. - -The ring buffer containing the fault records can be mmapped. When -the userspace consumes a fault in the queue, it should increment -the consumer index to allow new fault records to replace the used ones. - -The queue size and the entry size can be retrieved in the header. -The tail index should never overshoot the producer index as in any -other circular buffer scheme. Also it must be less than the queue size -otherwise the change fails. - -When the guest invalidates stage 1 related caches, invalidations must be -forwarded to the host through -ioctl(container, VFIO_IOMMU_CACHE_INVALIDATE, &inv_data); -Those invalidations can happen at various granularity levels, page, context, ... - -The ARM SMMU specification introduces another challenge: MSIs are translated by -both the virtual SMMU and the physical SMMU. To build a nested mapping for the -IOVA programmed into the assigned device, the guest needs to pass its IOVA/MSI -doorbell GPA binding to the host. Then the hypervisor can build a nested stage 2 -binding eventually translating into the physical MSI doorbell. - -This is achieved by calling -ioctl(container, VFIO_IOMMU_SET_MSI_BINDING, &guest_binding); - VFIO User API -------------------------------------------------------------------------------
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit b6f29e4d0dc417e7eec27d84a7913b80f1b760e1.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/vfio/pci/vfio_pci.c | 21 +-------------------- 1 file changed, 1 insertion(+), 20 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 352abea42649..514b004c2cc6 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -451,7 +451,6 @@ vfio_pci_iommu_dev_fault_handler(struct iommu_fault *fault, void *data) (struct vfio_region_dma_fault *)vdev->fault_pages; struct iommu_fault *new; u32 head, tail, size; - int ext_irq_index; int ret = -EINVAL;
if (WARN_ON(!reg)) @@ -476,19 +475,7 @@ vfio_pci_iommu_dev_fault_handler(struct iommu_fault *fault, void *data) ret = 0; unlock: mutex_unlock(&vdev->fault_queue_lock); - if (ret) - return ret; - - ext_irq_index = vfio_pci_get_ext_irq_index(vdev, VFIO_IRQ_TYPE_NESTED, - VFIO_IRQ_SUBTYPE_DMA_FAULT); - if (ext_irq_index < 0) - return -EINVAL; - - mutex_lock(&vdev->igate); - if (vdev->ext_irqs[ext_irq_index].trigger) - eventfd_signal(vdev->ext_irqs[ext_irq_index].trigger, 1); - mutex_unlock(&vdev->igate); - return 0; + return ret; }
#define DMA_FAULT_RING_LENGTH 512 @@ -543,12 +530,6 @@ static int vfio_pci_dma_fault_init(struct vfio_pci_device *vdev) if (ret) /* the dma fault region is freed in vfio_pci_disable() */ goto out;
- ret = vfio_pci_register_irq(vdev, VFIO_IRQ_TYPE_NESTED, - VFIO_IRQ_SUBTYPE_DMA_FAULT, - VFIO_IRQ_INFO_EVENTFD); - if (ret) /* the fault handler is also freed in vfio_pci_disable() */ - goto out; - return 0; out: kfree(vdev->fault_pages);
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 31ed6dc2484b533447c26163dcdecdfd93063b25.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/uapi/linux/vfio.h | 3 --- 1 file changed, 3 deletions(-)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 6574032973a3..fa3ac73c47be 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -733,9 +733,6 @@ struct vfio_irq_info_cap_type { __u32 subtype; /* type specific */ };
-#define VFIO_IRQ_TYPE_NESTED (1) -#define VFIO_IRQ_SUBTYPE_DMA_FAULT (1) - /** * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set) *
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit e3489f77845cdf900002271db11bd7bdc10c7696.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/vfio/pci/vfio_pci.c | 98 +++++------------------------ drivers/vfio/pci/vfio_pci_intrs.c | 62 ------------------ drivers/vfio/pci/vfio_pci_private.h | 14 ----- 3 files changed, 17 insertions(+), 157 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 514b004c2cc6..9a3d0a54ee08 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -665,14 +665,6 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev) ret = iommu_unregister_device_fault_handler(&vdev->pdev->dev); WARN_ON(ret == -EBUSY);
- for (i = 0; i < vdev->num_ext_irqs; i++) - vfio_pci_set_irqs_ioctl(vdev, VFIO_IRQ_SET_DATA_NONE | - VFIO_IRQ_SET_ACTION_TRIGGER, - VFIO_PCI_NUM_IRQS + i, 0, 0, NULL); - vdev->num_ext_irqs = 0; - kfree(vdev->ext_irqs); - vdev->ext_irqs = NULL; - /* Device closed, don't need mutex here */ list_for_each_entry_safe(ioeventfd, ioeventfd_tmp, &vdev->ioeventfds_list, next) { @@ -890,9 +882,6 @@ static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type) return 1; } else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) { return 1; - } else if (irq_type >= VFIO_PCI_NUM_IRQS && - irq_type < VFIO_PCI_NUM_IRQS + vdev->num_ext_irqs) { - return 1; }
return 0; @@ -1077,10 +1066,9 @@ long vfio_pci_ioctl(void *device_data, if (vdev->reset_works) info.flags |= VFIO_DEVICE_FLAGS_RESET;
- info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions + - vdev->num_vendor_regions; - info.num_irqs = VFIO_PCI_NUM_IRQS + vdev->num_ext_irqs + - vdev->num_vendor_irqs; + info.num_regions = VFIO_PCI_NUM_REGIONS + + vdev->num_vendor_regions; + info.num_irqs = VFIO_PCI_NUM_IRQS + vdev->num_vendor_irqs;
if (IS_ENABLED(CONFIG_VFIO_PCI_ZDEV)) { int ret = vfio_pci_info_zdev_add_caps(vdev, &caps); @@ -1259,87 +1247,36 @@ long vfio_pci_ioctl(void *device_data,
} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) { struct vfio_irq_info info; - struct vfio_info_cap caps = { .buf = NULL, .size = 0 }; - unsigned long capsz;
minsz = offsetofend(struct vfio_irq_info, count);
- /* For backward compatibility, cannot require this */ - capsz = offsetofend(struct vfio_irq_info, cap_offset); - if (copy_from_user(&info, (void __user *)arg, minsz)) return -EFAULT;
- if (info.argsz < minsz || - info.index >= VFIO_PCI_NUM_IRQS + vdev->num_ext_irqs) + if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS) return -EINVAL;
- if (info.argsz >= capsz) - minsz = capsz; - - info.flags = VFIO_IRQ_INFO_EVENTFD; - switch (info.index) { - case VFIO_PCI_INTX_IRQ_INDEX: - info.flags |= (VFIO_IRQ_INFO_MASKABLE | - VFIO_IRQ_INFO_AUTOMASKED); - break; - case VFIO_PCI_MSI_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX: + case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX: case VFIO_PCI_REQ_IRQ_INDEX: - info.flags |= VFIO_IRQ_INFO_NORESIZE; break; case VFIO_PCI_ERR_IRQ_INDEX: - info.flags |= VFIO_IRQ_INFO_NORESIZE; - if (!pci_is_pcie(vdev->pdev)) - return -EINVAL; - break; + if (pci_is_pcie(vdev->pdev)) + break; + fallthrough; default: - { - struct vfio_irq_info_cap_type cap_type = { - .header.id = VFIO_IRQ_INFO_CAP_TYPE, - .header.version = 1 }; - int ret, i; - - if (info.index >= VFIO_PCI_NUM_IRQS + - vdev->num_ext_irqs) - return -EINVAL; - info.index = array_index_nospec(info.index, - VFIO_PCI_NUM_IRQS + - vdev->num_ext_irqs); - i = info.index - VFIO_PCI_NUM_IRQS; - - info.flags = vdev->ext_irqs[i].flags; - cap_type.type = vdev->ext_irqs[i].type; - cap_type.subtype = vdev->ext_irqs[i].subtype; - - ret = vfio_info_add_capability(&caps, - &cap_type.header, - sizeof(cap_type)); - if (ret) - return ret; - } + return -EINVAL; }
- info.count = vfio_pci_get_irq_count(vdev, info.index); + info.flags = VFIO_IRQ_INFO_EVENTFD;
- if (caps.size) { - info.flags |= VFIO_IRQ_INFO_FLAG_CAPS; - if (info.argsz < sizeof(info) + caps.size) { - info.argsz = sizeof(info) + caps.size; - info.cap_offset = 0; - } else { - vfio_info_cap_shift(&caps, sizeof(info)); - if (copy_to_user((void __user *)arg + - sizeof(info), caps.buf, - caps.size)) { - kfree(caps.buf); - return -EFAULT; - } - info.cap_offset = sizeof(info); - } + info.count = vfio_pci_get_irq_count(vdev, info.index);
- kfree(caps.buf); - } + if (info.index == VFIO_PCI_INTX_IRQ_INDEX) + info.flags |= (VFIO_IRQ_INFO_MASKABLE | + VFIO_IRQ_INFO_AUTOMASKED); + else + info.flags |= VFIO_IRQ_INFO_NORESIZE;
return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0; @@ -1358,8 +1295,7 @@ long vfio_pci_ioctl(void *device_data, max = vfio_pci_get_irq_count(vdev, hdr.index);
ret = vfio_set_irqs_validate_and_prepare(&hdr, max, - VFIO_PCI_NUM_IRQS + vdev->num_ext_irqs, - &data_size); + VFIO_PCI_NUM_IRQS, &data_size); if (ret) return ret;
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index d67995fe872f..869dce5f134d 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -19,7 +19,6 @@ #include <linux/vfio.h> #include <linux/wait.h> #include <linux/slab.h> -#include <linux/nospec.h>
#include "vfio_pci_private.h"
@@ -636,24 +635,6 @@ static int vfio_pci_set_req_trigger(struct vfio_pci_device *vdev, count, flags, data); }
-static int vfio_pci_set_ext_irq_trigger(struct vfio_pci_device *vdev, - unsigned int index, unsigned int start, - unsigned int count, uint32_t flags, - void *data) -{ - int i; - - if (start != 0 || count > 1 || !vdev->num_ext_irqs) - return -EINVAL; - - index = array_index_nospec(index, - VFIO_PCI_NUM_IRQS + vdev->num_ext_irqs); - i = index - VFIO_PCI_NUM_IRQS; - - return vfio_pci_set_ctx_trigger_single(&vdev->ext_irqs[i].trigger, - count, flags, data); -} - int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags, unsigned index, unsigned start, unsigned count, void *data) @@ -703,13 +684,6 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags, break; } break; - default: - switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) { - case VFIO_IRQ_SET_ACTION_TRIGGER: - func = vfio_pci_set_ext_irq_trigger; - break; - } - break; }
if (!func) @@ -717,39 +691,3 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
return func(vdev, index, start, count, flags, data); } - -int vfio_pci_get_ext_irq_index(struct vfio_pci_device *vdev, - unsigned int type, unsigned int subtype) -{ - int i; - - for (i = 0; i < vdev->num_ext_irqs; i++) { - if (vdev->ext_irqs[i].type == type && - vdev->ext_irqs[i].subtype == subtype) { - return i; - } - } - return -EINVAL; -} - -int vfio_pci_register_irq(struct vfio_pci_device *vdev, - unsigned int type, unsigned int subtype, - u32 flags) -{ - struct vfio_ext_irq *ext_irqs; - - ext_irqs = krealloc(vdev->ext_irqs, - (vdev->num_ext_irqs + 1) * sizeof(*ext_irqs), - GFP_KERNEL); - if (!ext_irqs) - return -ENOMEM; - - vdev->ext_irqs = ext_irqs; - - vdev->ext_irqs[vdev->num_ext_irqs].type = type; - vdev->ext_irqs[vdev->num_ext_irqs].subtype = subtype; - vdev->ext_irqs[vdev->num_ext_irqs].flags = flags; - vdev->ext_irqs[vdev->num_ext_irqs].trigger = NULL; - vdev->num_ext_irqs++; - return 0; -} diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index a578723a34a5..ab488f11b2db 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -77,13 +77,6 @@ struct vfio_pci_region { u32 flags; };
-struct vfio_ext_irq { - u32 type; - u32 subtype; - u32 flags; - struct eventfd_ctx *trigger; -}; - struct vfio_pci_dummy_resource { struct resource resource; int index; @@ -123,8 +116,6 @@ struct vfio_pci_device { struct vfio_pci_irq_ctx *ctx; int num_ctx; int irq_type; - struct vfio_ext_irq *ext_irqs; - int num_ext_irqs; int num_regions; int num_vendor_regions; int num_vendor_irqs; @@ -172,11 +163,6 @@ struct vfio_pci_device {
extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev); extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev); -extern int vfio_pci_register_irq(struct vfio_pci_device *vdev, - unsigned int type, unsigned int subtype, - u32 flags); -extern int vfio_pci_get_ext_irq_index(struct vfio_pci_device *vdev, - unsigned int type, unsigned int subtype);
extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags, unsigned index,
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 41e3175a4cabca590ba3be605fb2ec63cc87f7c9.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/uapi/linux/vfio.h | 19 +------------------ 1 file changed, 1 insertion(+), 18 deletions(-)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index fa3ac73c47be..6fe74f7b362c 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -712,27 +712,11 @@ struct vfio_irq_info { #define VFIO_IRQ_INFO_MASKABLE (1 << 1) #define VFIO_IRQ_INFO_AUTOMASKED (1 << 2) #define VFIO_IRQ_INFO_NORESIZE (1 << 3) -#define VFIO_IRQ_INFO_FLAG_CAPS (1 << 4) /* Info supports caps */ __u32 index; /* IRQ index */ __u32 count; /* Number of IRQs within this index */ - __u32 cap_offset; /* Offset within info struct of first cap */ }; #define VFIO_DEVICE_GET_IRQ_INFO _IO(VFIO_TYPE, VFIO_BASE + 9)
-/* - * The irq type capability allows IRQs unique to a specific device or - * class of devices to be exposed. - * - * The structures below define version 1 of this capability. - */ -#define VFIO_IRQ_INFO_CAP_TYPE 3 - -struct vfio_irq_info_cap_type { - struct vfio_info_cap_header header; - __u32 type; /* global per bus driver */ - __u32 subtype; /* type specific */ -}; - /** * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set) * @@ -834,8 +818,7 @@ enum { VFIO_PCI_MSIX_IRQ_INDEX, VFIO_PCI_ERR_IRQ_INDEX, VFIO_PCI_REQ_IRQ_INDEX, - VFIO_PCI_NUM_IRQS = 5 /* Fixed user ABI, IRQ indexes >=5 use */ - /* device specific cap to define content */ + VFIO_PCI_NUM_IRQS };
/*
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit e57dd79bca166644d630103e3e96b9345368c753.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/vfio/pci/vfio_pci.c | 61 ++----------------------------------- 1 file changed, 3 insertions(+), 58 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 9a3d0a54ee08..9493bfe98dd4 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -371,75 +371,21 @@ static void vfio_pci_dma_fault_release(struct vfio_pci_device *vdev, kfree(vdev->fault_pages); }
-static int vfio_pci_dma_fault_mmap(struct vfio_pci_device *vdev, - struct vfio_pci_region *region, - struct vm_area_struct *vma) -{ - u64 phys_len, req_len, pgoff, req_start; - unsigned long long addr; - unsigned int ret; - - phys_len = region->size; - - req_len = vma->vm_end - vma->vm_start; - pgoff = vma->vm_pgoff & - ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1); - req_start = pgoff << PAGE_SHIFT; - - /* only the second page of the producer fault region is mmappable */ - if (req_start < PAGE_SIZE) - return -EINVAL; - - if (req_start + req_len > phys_len) - return -EINVAL; - - addr = virt_to_phys(vdev->fault_pages); - vma->vm_private_data = vdev; - vma->vm_pgoff = (addr >> PAGE_SHIFT) + pgoff; - - ret = remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, - req_len, vma->vm_page_prot); - return ret; -} - static int vfio_pci_dma_fault_add_capability(struct vfio_pci_device *vdev, struct vfio_pci_region *region, struct vfio_info_cap *caps) { - struct vfio_region_info_cap_sparse_mmap *sparse = NULL; struct vfio_region_info_cap_fault cap = { .header.id = VFIO_REGION_INFO_CAP_DMA_FAULT, .header.version = 1, .version = 1, }; - size_t size = sizeof(*sparse) + sizeof(*sparse->areas); - int ret; - - ret = vfio_info_add_capability(caps, &cap.header, sizeof(cap)); - if (ret) - return ret; - - sparse = kzalloc(size, GFP_KERNEL); - if (!sparse) - return -ENOMEM; - - sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP; - sparse->header.version = 1; - sparse->nr_areas = 1; - sparse->areas[0].offset = PAGE_SIZE; - sparse->areas[0].size = region->size - PAGE_SIZE; - - ret = vfio_info_add_capability(caps, &sparse->header, size); - if (ret) - kfree(sparse); - - return ret; + return vfio_info_add_capability(caps, &cap.header, sizeof(cap)); }
static const struct vfio_pci_regops vfio_pci_dma_fault_regops = { .rw = vfio_pci_dma_fault_rw, .release = vfio_pci_dma_fault_release, - .mmap = vfio_pci_dma_fault_mmap, .add_capability = vfio_pci_dma_fault_add_capability, };
@@ -513,8 +459,7 @@ static int vfio_pci_dma_fault_init(struct vfio_pci_device *vdev) VFIO_REGION_TYPE_NESTED, VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT, &vfio_pci_dma_fault_regops, size, - VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE | - VFIO_REGION_INFO_FLAG_MMAP, + VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE, vdev->fault_pages); if (ret) goto out; @@ -522,7 +467,7 @@ static int vfio_pci_dma_fault_init(struct vfio_pci_device *vdev) header = (struct vfio_region_dma_fault *)vdev->fault_pages; header->entry_size = sizeof(struct iommu_fault); header->nb_entries = DMA_FAULT_RING_LENGTH; - header->offset = PAGE_SIZE; + header->offset = sizeof(struct vfio_region_dma_fault);
ret = iommu_register_device_fault_handler(&vdev->pdev->dev, vfio_pci_iommu_dev_fault_handler,
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit f7c0c57bf2addf067bc27a82389bd50c25334458.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/vfio/pci/vfio_pci.c | 48 +------------------------------------ 1 file changed, 1 insertion(+), 47 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 9493bfe98dd4..b68832bcc3e4 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -27,7 +27,6 @@ #include <linux/vgaarb.h> #include <linux/nospec.h> #include <linux/sched/mm.h> -#include <linux/circ_buf.h>
#include "vfio_pci_private.h"
@@ -389,41 +388,6 @@ static const struct vfio_pci_regops vfio_pci_dma_fault_regops = { .add_capability = vfio_pci_dma_fault_add_capability, };
-static int -vfio_pci_iommu_dev_fault_handler(struct iommu_fault *fault, void *data) -{ - struct vfio_pci_device *vdev = (struct vfio_pci_device *)data; - struct vfio_region_dma_fault *reg = - (struct vfio_region_dma_fault *)vdev->fault_pages; - struct iommu_fault *new; - u32 head, tail, size; - int ret = -EINVAL; - - if (WARN_ON(!reg)) - return ret; - - mutex_lock(&vdev->fault_queue_lock); - - head = reg->head; - tail = reg->tail; - size = reg->nb_entries; - - new = (struct iommu_fault *)(vdev->fault_pages + reg->offset + - head * reg->entry_size); - - if (CIRC_SPACE(head, tail, size) < 1) { - ret = -ENOSPC; - goto unlock; - } - - *new = *fault; - reg->head = (head + 1) % size; - ret = 0; -unlock: - mutex_unlock(&vdev->fault_queue_lock); - return ret; -} - #define DMA_FAULT_RING_LENGTH 512
static int vfio_pci_dma_fault_init(struct vfio_pci_device *vdev) @@ -468,13 +432,6 @@ static int vfio_pci_dma_fault_init(struct vfio_pci_device *vdev) header->entry_size = sizeof(struct iommu_fault); header->nb_entries = DMA_FAULT_RING_LENGTH; header->offset = sizeof(struct vfio_region_dma_fault); - - ret = iommu_register_device_fault_handler(&vdev->pdev->dev, - vfio_pci_iommu_dev_fault_handler, - vdev); - if (ret) /* the dma fault region is freed in vfio_pci_disable() */ - goto out; - return 0; out: kfree(vdev->fault_pages); @@ -598,7 +555,7 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev) struct pci_dev *pdev = vdev->pdev; struct vfio_pci_dummy_resource *dummy_res, *tmp; struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp; - int i, bar, ret; + int i, bar;
/* Stop the device from further DMA */ pci_clear_master(pdev); @@ -607,9 +564,6 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev) VFIO_IRQ_SET_ACTION_TRIGGER, vdev->irq_type, 0, 0, NULL);
- ret = iommu_unregister_device_fault_handler(&vdev->pdev->dev); - WARN_ON(ret == -EBUSY); - /* Device closed, don't need mutex here */ list_for_each_entry_safe(ioeventfd, ioeventfd_tmp, &vdev->ioeventfds_list, next) {
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 20b23b137402e2c4fd197feacf03b0bd30629b76.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/vfio/pci/vfio_pci.c | 79 ----------------------------- drivers/vfio/pci/vfio_pci_private.h | 6 --- drivers/vfio/pci/vfio_pci_rdwr.c | 44 ---------------- include/uapi/linux/vfio.h | 35 ------------- 4 files changed, 164 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index b68832bcc3e4..af18415942ff 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -364,81 +364,6 @@ int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state) return ret; }
-static void vfio_pci_dma_fault_release(struct vfio_pci_device *vdev, - struct vfio_pci_region *region) -{ - kfree(vdev->fault_pages); -} - -static int vfio_pci_dma_fault_add_capability(struct vfio_pci_device *vdev, - struct vfio_pci_region *region, - struct vfio_info_cap *caps) -{ - struct vfio_region_info_cap_fault cap = { - .header.id = VFIO_REGION_INFO_CAP_DMA_FAULT, - .header.version = 1, - .version = 1, - }; - return vfio_info_add_capability(caps, &cap.header, sizeof(cap)); -} - -static const struct vfio_pci_regops vfio_pci_dma_fault_regops = { - .rw = vfio_pci_dma_fault_rw, - .release = vfio_pci_dma_fault_release, - .add_capability = vfio_pci_dma_fault_add_capability, -}; - -#define DMA_FAULT_RING_LENGTH 512 - -static int vfio_pci_dma_fault_init(struct vfio_pci_device *vdev) -{ - struct vfio_region_dma_fault *header; - struct iommu_domain *domain; - size_t size; - int nested; - int ret; - - domain = iommu_get_domain_for_dev(&vdev->pdev->dev); - if (!domain) - return 0; - - ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &nested); - if (ret || !nested) - return ret; - - mutex_init(&vdev->fault_queue_lock); - - /* - * We provision 1 page for the header and space for - * DMA_FAULT_RING_LENGTH fault records in the ring buffer. - */ - size = ALIGN(sizeof(struct iommu_fault) * - DMA_FAULT_RING_LENGTH, PAGE_SIZE) + PAGE_SIZE; - - vdev->fault_pages = kzalloc(size, GFP_KERNEL); - if (!vdev->fault_pages) - return -ENOMEM; - - ret = vfio_pci_register_dev_region(vdev, - VFIO_REGION_TYPE_NESTED, - VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT, - &vfio_pci_dma_fault_regops, size, - VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE, - vdev->fault_pages); - if (ret) - goto out; - - header = (struct vfio_region_dma_fault *)vdev->fault_pages; - header->entry_size = sizeof(struct iommu_fault); - header->nb_entries = DMA_FAULT_RING_LENGTH; - header->offset = sizeof(struct vfio_region_dma_fault); - return 0; -out: - kfree(vdev->fault_pages); - vdev->fault_pages = NULL; - return ret; -} - static int vfio_pci_enable(struct vfio_pci_device *vdev) { struct pci_dev *pdev = vdev->pdev; @@ -537,10 +462,6 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev) } }
- ret = vfio_pci_dma_fault_init(vdev); - if (ret) - goto disable_exit; - vfio_pci_probe_mmaps(vdev);
return 0; diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index ab488f11b2db..861068ec9cf7 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -141,8 +141,6 @@ struct vfio_pci_device { int ioeventfds_nr; struct eventfd_ctx *err_trigger; struct eventfd_ctx *req_trigger; - u8 *fault_pages; - struct mutex fault_queue_lock; struct list_head dummy_resources_list; struct mutex ioeventfds_lock; struct list_head ioeventfds_list; @@ -181,10 +179,6 @@ extern ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf, extern long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset, uint64_t data, int count, int fd);
-extern size_t vfio_pci_dma_fault_rw(struct vfio_pci_device *vdev, - char __user *buf, size_t count, - loff_t *ppos, bool iswrite); - extern int vfio_pci_init_perm_bits(void); extern void vfio_pci_uninit_perm_bits(void);
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c index 7f4d377ac9be..4bced6e43afe 100644 --- a/drivers/vfio/pci/vfio_pci_rdwr.c +++ b/drivers/vfio/pci/vfio_pci_rdwr.c @@ -366,50 +366,6 @@ ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf, return done; }
-size_t vfio_pci_dma_fault_rw(struct vfio_pci_device *vdev, char __user *buf, - size_t count, loff_t *ppos, bool iswrite) -{ - unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS; - loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; - void *base = vdev->region[i].data; - int ret = -EFAULT; - - if (pos >= vdev->region[i].size) - return -EINVAL; - - count = min(count, (size_t)(vdev->region[i].size - pos)); - - mutex_lock(&vdev->fault_queue_lock); - - if (iswrite) { - struct vfio_region_dma_fault *header = - (struct vfio_region_dma_fault *)base; - u32 new_tail; - - if (pos != 0 || count != 4) { - ret = -EINVAL; - goto unlock; - } - - if (copy_from_user((void *)&new_tail, buf, count)) - goto unlock; - - if (new_tail >= header->nb_entries) { - ret = -EINVAL; - goto unlock; - } - header->tail = new_tail; - } else { - if (copy_to_user(buf, base + pos, count)) - goto unlock; - } - *ppos += count; - ret = count; -unlock: - mutex_unlock(&vdev->fault_queue_lock); - return ret; -} - static void vfio_pci_ioeventfd_do_write(struct vfio_pci_ioeventfd *ioeventfd, bool test_mem) { diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 6fe74f7b362c..8d75f2f0aebc 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -329,7 +329,6 @@ struct vfio_region_info_cap_type { #define VFIO_REGION_TYPE_GFX (1) #define VFIO_REGION_TYPE_CCW (2) #define VFIO_REGION_TYPE_MIGRATION (3) -#define VFIO_REGION_TYPE_NESTED (4)
/* sub-types for VFIO_REGION_TYPE_PCI_* */
@@ -354,9 +353,6 @@ struct vfio_region_info_cap_type { /* sub-types for VFIO_REGION_TYPE_GFX */ #define VFIO_REGION_SUBTYPE_GFX_EDID (1)
-/* sub-types for VFIO_REGION_TYPE_NESTED */ -#define VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT (1) - /** * struct vfio_region_gfx_edid - EDID region layout. * @@ -1002,37 +998,6 @@ struct vfio_device_feature { */ #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN (0)
-/* - * Capability exposed by the DMA fault region - * @version: ABI version - */ -#define VFIO_REGION_INFO_CAP_DMA_FAULT 6 - -struct vfio_region_info_cap_fault { - struct vfio_info_cap_header header; - __u32 version; -}; - -/* - * DMA Fault Region Layout - * @tail: index relative to the start of the ring buffer at which the - * consumer finds the next item in the buffer - * @entry_size: fault ring buffer entry size in bytes - * @nb_entries: max capacity of the fault ring buffer - * @offset: ring buffer offset relative to the start of the region - * @head: index relative to the start of the ring buffer at which the - * producer (kernel) inserts items into the buffers - */ -struct vfio_region_dma_fault { - /* Write-Only */ - __u32 tail; - /* Read-Only */ - __u32 entry_size; - __u32 nb_entries; - __u32 offset; - __u32 head; -}; - /* -------- API for Type1 VFIO IOMMU -------- */
/**
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit ac16d334b1ac8664932e725a6a6255692f4e11f6.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/vfio/vfio_iommu_type1.c | 62 --------------------------------- include/uapi/linux/vfio.h | 20 ----------- 2 files changed, 82 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 1422cbb37013..06e03a1bd6ce 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -3088,41 +3088,6 @@ static int vfio_cache_inv_fn(struct device *dev, void *data) return iommu_uapi_cache_invalidate(dc->domain, dev, (void __user *)arg); }
-static int -vfio_bind_msi(struct vfio_iommu *iommu, - dma_addr_t giova, phys_addr_t gpa, size_t size) -{ - struct vfio_domain *d; - int ret = 0; - - mutex_lock(&iommu->lock); - - list_for_each_entry(d, &iommu->domain_list, next) { - ret = iommu_bind_guest_msi(d->domain, giova, gpa, size); - if (ret) - goto unwind; - } - goto unlock; -unwind: - list_for_each_entry_continue_reverse(d, &iommu->domain_list, next) { - iommu_unbind_guest_msi(d->domain, giova); - } -unlock: - mutex_unlock(&iommu->lock); - return ret; -} - -static void -vfio_unbind_msi(struct vfio_iommu *iommu, dma_addr_t giova) -{ - struct vfio_domain *d; - - mutex_lock(&iommu->lock); - list_for_each_entry(d, &iommu->domain_list, next) - iommu_unbind_guest_msi(d->domain, giova); - mutex_unlock(&iommu->lock); -} - static int vfio_iommu_migration_build_caps(struct vfio_iommu *iommu, struct vfio_info_cap *caps) { @@ -3320,31 +3285,6 @@ static int vfio_iommu_type1_cache_invalidate(struct vfio_iommu *iommu, return ret; }
-static int vfio_iommu_type1_set_msi_binding(struct vfio_iommu *iommu, - unsigned long arg) -{ - struct vfio_iommu_type1_set_msi_binding msi_binding; - unsigned long minsz; - - minsz = offsetofend(struct vfio_iommu_type1_set_msi_binding, - size); - - if (copy_from_user(&msi_binding, (void __user *)arg, minsz)) - return -EFAULT; - - if (msi_binding.argsz < minsz) - return -EINVAL; - - if (msi_binding.flags == VFIO_IOMMU_UNBIND_MSI) { - vfio_unbind_msi(iommu, msi_binding.iova); - return 0; - } else if (msi_binding.flags == VFIO_IOMMU_BIND_MSI) { - return vfio_bind_msi(iommu, msi_binding.iova, - msi_binding.gpa, msi_binding.size); - } - return -EINVAL; -} - static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu, unsigned long arg) { @@ -3654,8 +3594,6 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, return vfio_iommu_type1_set_pasid_table(iommu, arg); case VFIO_IOMMU_CACHE_INVALIDATE: return vfio_iommu_type1_cache_invalidate(iommu, arg); - case VFIO_IOMMU_SET_MSI_BINDING: - return vfio_iommu_type1_set_msi_binding(iommu, arg); default: return -ENOTTY; } diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 8d75f2f0aebc..7ea68500b508 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1323,26 +1323,6 @@ struct vfio_iommu_type1_cache_invalidate { }; #define VFIO_IOMMU_CACHE_INVALIDATE _IO(VFIO_TYPE, VFIO_BASE + 19)
-/** - * VFIO_IOMMU_SET_MSI_BINDING - _IOWR(VFIO_TYPE, VFIO_BASE + 20, - * struct vfio_iommu_type1_set_msi_binding) - * - * Pass a stage 1 MSI doorbell mapping to the host so that this - * latter can build a nested stage2 mapping. Or conversely tear - * down a previously bound stage 1 MSI binding. - */ -struct vfio_iommu_type1_set_msi_binding { - __u32 argsz; - __u32 flags; -#define VFIO_IOMMU_BIND_MSI (1 << 0) -#define VFIO_IOMMU_UNBIND_MSI (1 << 1) - __u64 iova; /* MSI guest IOVA */ - /* Fields below are used on BIND */ - __u64 gpa; /* MSI guest physical address */ - __u64 size; /* size of stage1 mapping (bytes) */ -}; -#define VFIO_IOMMU_SET_MSI_BINDING _IO(VFIO_TYPE, VFIO_BASE + 20) - /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
/*
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 04ba12c4366f5369157419e73e37e444e2faa232.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/vfio/vfio_iommu_type1.c | 60 --------------------------------- include/uapi/linux/vfio.h | 13 ------- 2 files changed, 73 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 06e03a1bd6ce..b2fad085697c 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -159,36 +159,6 @@ struct vfio_regions { #define DIRTY_BITMAP_PAGES_MAX ((u64)INT_MAX) #define DIRTY_BITMAP_SIZE_MAX DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
-#define WAITED 1 - -struct domain_capsule { - struct iommu_domain *domain; - void *data; -}; - -/* iommu->lock must be held */ -static int -vfio_iommu_lookup_dev(struct vfio_iommu *iommu, - int (*fn)(struct device *dev, void *data), - unsigned long arg) -{ - struct domain_capsule dc = {.data = &arg}; - struct vfio_domain *d; - struct vfio_group *g; - int ret = 0; - - list_for_each_entry(d, &iommu->domain_list, next) { - dc.domain = d->domain; - list_for_each_entry(g, &d->group_list, next) { - ret = iommu_group_for_each_dev(g->iommu_group, - &dc, fn); - if (ret) - break; - } - } - return ret; -} - static int put_pfn(unsigned long pfn, int prot);
static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu, @@ -3080,13 +3050,6 @@ vfio_attach_pasid_table(struct vfio_iommu *iommu, unsigned long arg) mutex_unlock(&iommu->lock); return ret; } -static int vfio_cache_inv_fn(struct device *dev, void *data) -{ - struct domain_capsule *dc = (struct domain_capsule *)data; - unsigned long arg = *(unsigned long *)dc->data; - - return iommu_uapi_cache_invalidate(dc->domain, dev, (void __user *)arg); -}
static int vfio_iommu_migration_build_caps(struct vfio_iommu *iommu, struct vfio_info_cap *caps) @@ -3264,27 +3227,6 @@ static void vfio_iommu_dirty_log_switch(struct vfio_iommu *iommu, bool enable) } }
-static int vfio_iommu_type1_cache_invalidate(struct vfio_iommu *iommu, - unsigned long arg) -{ - struct vfio_iommu_type1_cache_invalidate cache_inv; - unsigned long minsz; - int ret; - - minsz = offsetofend(struct vfio_iommu_type1_cache_invalidate, flags); - - if (copy_from_user(&cache_inv, (void __user *)arg, minsz)) - return -EFAULT; - - if (cache_inv.argsz < minsz || cache_inv.flags) - return -EINVAL; - - mutex_lock(&iommu->lock); - ret = vfio_iommu_lookup_dev(iommu, vfio_cache_inv_fn, arg + minsz); - mutex_unlock(&iommu->lock); - return ret; -} - static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu, unsigned long arg) { @@ -3592,8 +3534,6 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, return vfio_iommu_type1_unbind(iommu, arg); case VFIO_IOMMU_SET_PASID_TABLE: return vfio_iommu_type1_set_pasid_table(iommu, arg); - case VFIO_IOMMU_CACHE_INVALIDATE: - return vfio_iommu_type1_cache_invalidate(iommu, arg); default: return -ENOTTY; } diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 7ea68500b508..38ab2b0d35e0 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1310,19 +1310,6 @@ struct vfio_iommu_type1_set_pasid_table {
#define VFIO_IOMMU_SET_PASID_TABLE _IO(VFIO_TYPE, VFIO_BASE + 18)
-/** - * VFIO_IOMMU_CACHE_INVALIDATE - _IOWR(VFIO_TYPE, VFIO_BASE + 19, - * struct vfio_iommu_type1_cache_invalidate) - * - * Propagate guest IOMMU cache invalidation to the host. - */ -struct vfio_iommu_type1_cache_invalidate { - __u32 argsz; - __u32 flags; - struct iommu_cache_invalidate_info info; -}; -#define VFIO_IOMMU_CACHE_INVALIDATE _IO(VFIO_TYPE, VFIO_BASE + 19) - /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
/*
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 4b0423579002261f8ea84ec82ce1039ec174025a.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/vfio/vfio_iommu_type1.c | 58 --------------------------------- include/uapi/linux/vfio.h | 20 ------------ 2 files changed, 78 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index b2fad085697c..8c52f40504c8 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -3018,39 +3018,6 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu, return ret; }
-static void -vfio_detach_pasid_table(struct vfio_iommu *iommu) -{ - struct vfio_domain *d; - - mutex_lock(&iommu->lock); - list_for_each_entry(d, &iommu->domain_list, next) - iommu_detach_pasid_table(d->domain); - - mutex_unlock(&iommu->lock); -} - -static int -vfio_attach_pasid_table(struct vfio_iommu *iommu, unsigned long arg) -{ - struct vfio_domain *d; - int ret = 0; - - mutex_lock(&iommu->lock); - - list_for_each_entry(d, &iommu->domain_list, next) { - ret = iommu_uapi_attach_pasid_table(d->domain, (void __user *)arg); - if (ret) { - list_for_each_entry_continue_reverse(d, &iommu->domain_list, next) - iommu_detach_pasid_table(d->domain); - break; - } - } - - mutex_unlock(&iommu->lock); - return ret; -} - static int vfio_iommu_migration_build_caps(struct vfio_iommu *iommu, struct vfio_info_cap *caps) { @@ -3489,29 +3456,6 @@ static long vfio_iommu_type1_unbind(struct vfio_iommu *iommu, unsigned long arg) return 0; }
-static int vfio_iommu_type1_set_pasid_table(struct vfio_iommu *iommu, - unsigned long arg) -{ - struct vfio_iommu_type1_set_pasid_table spt; - unsigned long minsz; - - minsz = offsetofend(struct vfio_iommu_type1_set_pasid_table, flags); - - if (copy_from_user(&spt, (void __user *)arg, minsz)) - return -EFAULT; - - if (spt.argsz < minsz) - return -EINVAL; - - if (spt.flags == VFIO_PASID_TABLE_FLAG_SET) { - return vfio_attach_pasid_table(iommu, arg + minsz); - } else if (spt.flags == VFIO_PASID_TABLE_FLAG_UNSET) { - vfio_detach_pasid_table(iommu); - return 0; - } - return -EINVAL; -} - static long vfio_iommu_type1_ioctl(void *iommu_data, unsigned int cmd, unsigned long arg) { @@ -3532,8 +3476,6 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, return vfio_iommu_type1_bind(iommu, arg); case VFIO_IOMMU_UNBIND: return vfio_iommu_type1_unbind(iommu, arg); - case VFIO_IOMMU_SET_PASID_TABLE: - return vfio_iommu_type1_set_pasid_table(iommu, arg); default: return -ENOTTY; } diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 38ab2b0d35e0..52658db9aaf7 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -14,7 +14,6 @@
#include <linux/types.h> #include <linux/ioctl.h> -#include <linux/iommu.h>
#define VFIO_API_VERSION 0
@@ -1291,25 +1290,6 @@ struct vfio_iommu_type1_bind { */ #define VFIO_IOMMU_UNBIND _IO(VFIO_TYPE, VFIO_BASE + 23)
-/* - * VFIO_IOMMU_SET_PASID_TABLE - _IOWR(VFIO_TYPE, VFIO_BASE + 18, - * struct vfio_iommu_type1_set_pasid_table) - * - * The SET operation passes a PASID table to the host while the - * UNSET operation detaches the one currently programmed. It is - * allowed to "SET" the table several times without unsetting as - * long as the table config does not stay IOMMU_PASID_CONFIG_TRANSLATE. - */ -struct vfio_iommu_type1_set_pasid_table { - __u32 argsz; - __u32 flags; -#define VFIO_PASID_TABLE_FLAG_SET (1 << 0) -#define VFIO_PASID_TABLE_FLAG_UNSET (1 << 1) - struct iommu_pasid_table_config config; /* used on SET */ -}; - -#define VFIO_IOMMU_SET_PASID_TABLE _IO(VFIO_TYPE, VFIO_BASE + 18) - /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
/*
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 7733f2e7a689598588f6074acca8b9424a76ea4a.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 40 ++------------------- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 4 --- 2 files changed, 2 insertions(+), 42 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 74eb98bd3e6e..b061ed78c202 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1700,7 +1700,6 @@ static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt) u32 perm = 0; struct arm_smmu_master *master; bool ssid_valid = evt[0] & EVTQ_0_SSV; - u8 type = FIELD_GET(EVTQ_0_ID, evt[0]); u32 sid = FIELD_GET(EVTQ_0_SID, evt[0]); struct iommu_fault_event fault_evt = { }; struct iommu_fault *flt = &fault_evt.fault; @@ -1753,6 +1752,8 @@ static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt) } else { flt->type = IOMMU_FAULT_DMA_UNRECOV; flt->event = (struct iommu_fault_unrecoverable) { + .reason = reason, + .flags = IOMMU_FAULT_UNRECOV_ADDR_VALID, .perm = perm, .addr = FIELD_GET(EVTQ_2_ADDR, evt[2]), }; @@ -1761,43 +1762,6 @@ static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt) flt->event.flags |= IOMMU_FAULT_UNRECOV_PASID_VALID; flt->event.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]); } - - switch (type) { - case EVT_ID_TRANSLATION_FAULT: - flt->event.reason = IOMMU_FAULT_REASON_PTE_FETCH; - flt->event.flags |= IOMMU_FAULT_UNRECOV_ADDR_VALID; - break; - case EVT_ID_ADDR_SIZE_FAULT: - flt->event.reason = IOMMU_FAULT_REASON_OOR_ADDRESS; - flt->event.flags |= IOMMU_FAULT_UNRECOV_ADDR_VALID; - break; - case EVT_ID_ACCESS_FAULT: - flt->event.reason = IOMMU_FAULT_REASON_ACCESS; - flt->event.flags |= IOMMU_FAULT_UNRECOV_ADDR_VALID; - break; - case EVT_ID_PERMISSION_FAULT: - flt->event.reason = IOMMU_FAULT_REASON_PERMISSION; - flt->event.flags |= IOMMU_FAULT_UNRECOV_ADDR_VALID; - break; - case EVT_ID_BAD_SUBSTREAMID: - flt->event.reason = IOMMU_FAULT_REASON_PASID_INVALID; - break; - case EVT_ID_CD_FETCH: - flt->event.reason = IOMMU_FAULT_REASON_PASID_FETCH; - flt->event.flags |= IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID; - break; - case EVT_ID_BAD_CD: - flt->event.reason = IOMMU_FAULT_REASON_BAD_PASID_ENTRY; - break; - case EVT_ID_WALK_EABT: - flt->event.reason = IOMMU_FAULT_REASON_WALK_EABT; - flt->event.flags |= IOMMU_FAULT_UNRECOV_ADDR_VALID | - IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID; - break; - default: - /* TODO: report other unrecoverable faults. */ - return -EFAULT; - } }
mutex_lock(&smmu->streams_mutex); diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index d0f3181a22c5..c744d812fc8d 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -433,10 +433,6 @@
#define EVTQ_0_ID GENMASK_ULL(7, 0)
-#define EVT_ID_BAD_SUBSTREAMID 0x08 -#define EVT_ID_CD_FETCH 0x09 -#define EVT_ID_BAD_CD 0x0a -#define EVT_ID_WALK_EABT 0x0b #define EVT_ID_TRANSLATION_FAULT 0x10 #define EVT_ID_ADDR_SIZE_FAULT 0x11 #define EVT_ID_ACCESS_FAULT 0x12
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 74eeb1a933fe92b75c7140063dd3ee2d7ec5872f.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 43 --------------------- 1 file changed, 43 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index b061ed78c202..de07858271d1 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -3585,47 +3585,6 @@ static void arm_smmu_get_resv_regions(struct device *dev, iommu_dma_get_resv_regions(dev, head); }
-static int -arm_smmu_bind_guest_msi(struct iommu_domain *domain, - dma_addr_t giova, phys_addr_t gpa, size_t size) -{ - struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - struct arm_smmu_device *smmu; - int ret = -EINVAL; - - mutex_lock(&smmu_domain->init_mutex); - smmu = smmu_domain->smmu; - if (!smmu) - goto out; - - if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED) - goto out; - - ret = iommu_dma_bind_guest_msi(domain, giova, gpa, size); -out: - mutex_unlock(&smmu_domain->init_mutex); - return ret; -} - -static void -arm_smmu_unbind_guest_msi(struct iommu_domain *domain, dma_addr_t giova) -{ - struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - struct arm_smmu_device *smmu; - - mutex_lock(&smmu_domain->init_mutex); - smmu = smmu_domain->smmu; - if (!smmu) - goto unlock; - - if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED) - goto unlock; - - iommu_dma_unbind_guest_msi(domain, giova); -unlock: - mutex_unlock(&smmu_domain->init_mutex); -} - static int arm_smmu_attach_pasid_table(struct iommu_domain *domain, struct iommu_pasid_table_config *cfg) { @@ -4309,8 +4268,6 @@ static struct iommu_ops arm_smmu_ops = { .attach_pasid_table = arm_smmu_attach_pasid_table, .detach_pasid_table = arm_smmu_detach_pasid_table, .cache_invalidate = arm_smmu_cache_invalidate, - .bind_guest_msi = arm_smmu_bind_guest_msi, - .unbind_guest_msi = arm_smmu_unbind_guest_msi, .dev_has_feat = arm_smmu_dev_has_feature, .dev_feat_enabled = arm_smmu_dev_feature_enabled, .dev_enable_feat = arm_smmu_dev_enable_feature,
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 85412c048741ef1c9d3ae4f1f6218ed15ceac587.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 23 ++------------------- drivers/iommu/iommu.c | 2 -- 2 files changed, 2 insertions(+), 23 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index de07858271d1..c81c6c59b80b 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2914,23 +2914,6 @@ static bool arm_smmu_share_msi_domain(struct iommu_domain *domain, return share; }
-static bool arm_smmu_has_hw_msi_resv_region(struct device *dev) -{ - struct iommu_resv_region *region; - bool has_msi_resv_region = false; - LIST_HEAD(resv_regions); - - iommu_get_resv_regions(dev, &resv_regions); - list_for_each_entry(region, &resv_regions, list) { - if (region->type == IOMMU_RESV_MSI) { - has_msi_resv_region = true; - break; - } - } - iommu_put_resv_regions(dev, &resv_regions); - return has_msi_resv_region; -} - static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev) { int ret = 0; @@ -2995,12 +2978,10 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev) /* * In nested mode we must check all devices belonging to the * domain share the same physical MSI doorbell. Otherwise nested - * stage MSI binding is not supported. Also nested mode is not - * compatible with MSI HW reserved regions. + * stage MSI binding is not supported. */ if (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED && - (!arm_smmu_share_msi_domain(domain, dev) || - arm_smmu_has_hw_msi_resv_region(dev))) { + !arm_smmu_share_msi_domain(domain, dev)) { ret = -EINVAL; goto out_unlock; } diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 97953fa27630..d2fbebee719b 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -3127,7 +3127,6 @@ void iommu_get_resv_regions(struct device *dev, struct list_head *list) if (ops && ops->get_resv_regions) ops->get_resv_regions(dev, list); } -EXPORT_SYMBOL_GPL(iommu_get_resv_regions);
void iommu_put_resv_regions(struct device *dev, struct list_head *list) { @@ -3136,7 +3135,6 @@ void iommu_put_resv_regions(struct device *dev, struct list_head *list) if (ops && ops->put_resv_regions) ops->put_resv_regions(dev, list); } -EXPORT_SYMBOL_GPL(iommu_put_resv_regions);
/** * generic_iommu_put_resv_regions - Reserved region driver helper
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit b4ddfa737eca0043559055325e6d8c0483425065.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 41 --------------------- 1 file changed, 41 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index c81c6c59b80b..52df1e26976b 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2883,37 +2883,6 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master) arm_smmu_install_ste_for_dev(master); }
-static bool arm_smmu_share_msi_domain(struct iommu_domain *domain, - struct device *dev) -{ - struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - struct irq_domain *irqd = dev_get_msi_domain(dev); - struct arm_smmu_master *master; - unsigned long flags; - bool share = false; - - if (!irqd) - return true; - - spin_lock_irqsave(&smmu_domain->devices_lock, flags); - list_for_each_entry(master, &smmu_domain->devices, domain_head) { - struct irq_domain *d = dev_get_msi_domain(master->dev); - - if (!d) - continue; - if (irqd != d) { - dev_info(dev, "Nested mode forbids to attach devices " - "using different physical MSI doorbells " - "to the same iommu_domain"); - goto unlock; - } - } - share = true; -unlock: - spin_unlock_irqrestore(&smmu_domain->devices_lock, flags); - return share; -} - static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev) { int ret = 0; @@ -2975,16 +2944,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev) ret = -EINVAL; goto out_unlock; } - /* - * In nested mode we must check all devices belonging to the - * domain share the same physical MSI doorbell. Otherwise nested - * stage MSI binding is not supported. - */ - if (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED && - !arm_smmu_share_msi_domain(domain, dev)) { - ret = -EINVAL; - goto out_unlock; - }
master->domain = smmu_domain;
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 15700dc0010f823a62c8d77f693ce9ad121f75c6.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/dma-iommu.c | 180 +------------------------------------- include/linux/dma-iommu.h | 16 ---- 2 files changed, 4 insertions(+), 192 deletions(-)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 50b3e3a72a00..d1539b7399a9 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -27,15 +27,12 @@ struct iommu_dma_msi_page { struct list_head list; dma_addr_t iova; - dma_addr_t gpa; phys_addr_t phys; - size_t s1_granule; };
enum iommu_dma_cookie_type { IOMMU_DMA_IOVA_COOKIE, IOMMU_DMA_MSI_COOKIE, - IOMMU_DMA_NESTED_MSI_COOKIE, };
struct iommu_dma_cookie { @@ -47,8 +44,6 @@ struct iommu_dma_cookie { dma_addr_t msi_iova; }; struct list_head msi_page_list; - /* used in nested mode only */ - spinlock_t msi_lock;
/* Domain for flush queue callback; NULL if flush queue not in use */ struct iommu_domain *fq_domain; @@ -67,7 +62,6 @@ static struct iommu_dma_cookie *cookie_alloc(enum iommu_dma_cookie_type type)
cookie = kzalloc(sizeof(*cookie), GFP_KERNEL); if (cookie) { - spin_lock_init(&cookie->msi_lock); INIT_LIST_HEAD(&cookie->msi_page_list); cookie->type = type; } @@ -101,17 +95,14 @@ EXPORT_SYMBOL(iommu_get_dma_cookie); * * Users who manage their own IOVA allocation and do not want DMA API support, * but would still like to take advantage of automatic MSI remapping, can use - * this to initialise their own domain appropriately. Users may reserve a + * this to initialise their own domain appropriately. Users should reserve a * contiguous IOVA region, starting at @base, large enough to accommodate the * number of PAGE_SIZE mappings necessary to cover every MSI doorbell address - * used by the devices attached to @domain. The other way round is to provide - * usable iova pages through the iommu_dma_bind_guest_msi API (nested stages - * use case) + * used by the devices attached to @domain. */ int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base) { struct iommu_dma_cookie *cookie; - int nesting, ret;
if (domain->type != IOMMU_DOMAIN_UNMANAGED) return -EINVAL; @@ -119,17 +110,11 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base) if (domain->iova_cookie) return -EEXIST;
- ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &nesting); - if (!ret && nesting) - cookie = cookie_alloc(IOMMU_DMA_NESTED_MSI_COOKIE); - else - cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE); - + cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE); if (!cookie) return -ENOMEM;
- if (!nesting) - cookie->msi_iova = base; + cookie->msi_iova = base; domain->iova_cookie = cookie; return 0; } @@ -153,116 +138,15 @@ void iommu_put_dma_cookie(struct iommu_domain *domain) if (cookie->type == IOMMU_DMA_IOVA_COOKIE && cookie->iovad.granule) put_iova_domain(&cookie->iovad);
- spin_lock(&cookie->msi_lock); list_for_each_entry_safe(msi, tmp, &cookie->msi_page_list, list) { - if (cookie->type == IOMMU_DMA_NESTED_MSI_COOKIE && msi->phys) { - size_t size = cookie_msi_granule(cookie); - - WARN_ON(iommu_unmap(domain, msi->gpa, size) != size); - } list_del(&msi->list); kfree(msi); } - spin_unlock(&cookie->msi_lock); kfree(cookie); domain->iova_cookie = NULL; } EXPORT_SYMBOL(iommu_put_dma_cookie);
-/** - * iommu_dma_bind_guest_msi - Allows to pass the stage 1 - * binding of a virtual MSI doorbell used by @dev. - * - * @domain: domain handle - * @giova: guest iova - * @gpa: gpa of the virtual doorbell - * @size: size of the granule used for the stage1 mapping - * - * In nested stage use case, the user can provide IOVA/IPA bindings - * corresponding to a guest MSI stage 1 mapping. When the host needs - * to map its own MSI doorbells, it can use @gpa as stage 2 input - * and map it onto the physical MSI doorbell. - */ -int iommu_dma_bind_guest_msi(struct iommu_domain *domain, - dma_addr_t giova, phys_addr_t gpa, size_t size) -{ - struct iommu_dma_cookie *cookie = domain->iova_cookie; - struct iommu_dma_msi_page *msi; - int ret = 0; - - if (!cookie) - return -EINVAL; - - if (cookie->type != IOMMU_DMA_NESTED_MSI_COOKIE) - return -EINVAL; - - /* - * we currently do not support S1 granule larger than S2 one - * as this would oblige to have multiple S2 mappings for a - * single S1 one - */ - if (size > cookie_msi_granule(cookie)) - return -EINVAL; - - giova = giova & ~(dma_addr_t)(size - 1); - gpa = gpa & ~(phys_addr_t)(size - 1); - - spin_lock(&cookie->msi_lock); - - list_for_each_entry(msi, &cookie->msi_page_list, list) { - if (msi->iova == giova) - goto unlock; /* this page is already registered */ - } - - msi = kzalloc(sizeof(*msi), GFP_ATOMIC); - if (!msi) { - ret = -ENOMEM; - goto unlock; - } - - msi->iova = giova; - msi->gpa = gpa; - msi->s1_granule = size; - list_add(&msi->list, &cookie->msi_page_list); -unlock: - spin_unlock(&cookie->msi_lock); - return ret; -} -EXPORT_SYMBOL(iommu_dma_bind_guest_msi); - -void iommu_dma_unbind_guest_msi(struct iommu_domain *domain, dma_addr_t giova) -{ - struct iommu_dma_cookie *cookie = domain->iova_cookie; - struct iommu_dma_msi_page *msi; - - if (!cookie) - return; - - if (cookie->type != IOMMU_DMA_NESTED_MSI_COOKIE) - return; - - spin_lock(&cookie->msi_lock); - - list_for_each_entry(msi, &cookie->msi_page_list, list) { - dma_addr_t aligned_giova = - giova & ~(dma_addr_t)(msi->s1_granule - 1); - - if (msi->iova == aligned_giova) { - if (msi->phys) { - /* unmap the stage 2 */ - size_t size = cookie_msi_granule(cookie); - - WARN_ON(iommu_unmap(domain, msi->gpa, size) != size); - } - list_del(&msi->list); - kfree(msi); - break; - } - } - spin_unlock(&cookie->msi_lock); -} -EXPORT_SYMBOL(iommu_dma_unbind_guest_msi); - /** * iommu_dma_get_resv_regions - Reserved region driver helper * @dev: Device from iommu_get_resv_regions() @@ -1314,58 +1198,6 @@ void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size) dev_name(dev)); }
-/* - * iommu_dma_get_nested_msi_page - Returns a nested stage MSI page - * mapping translating into the physical doorbell address @msi_addr - * - * In nested mode, the userspace provides the guest - * gIOVA - gDB stage 1 mappings. When we need to build a stage 2 - * mapping for a physical doorbell (@msi_addr), we look up - * for an unused S1 mapping and map the gDB onto @msi_addr - */ -static struct iommu_dma_msi_page * -iommu_dma_get_nested_msi_page(struct iommu_domain *domain, - phys_addr_t msi_addr) -{ - struct iommu_dma_cookie *cookie = domain->iova_cookie; - struct iommu_dma_msi_page *iter, *msi_page = NULL; - size_t size = cookie_msi_granule(cookie); - int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO; - - spin_lock(&cookie->msi_lock); - list_for_each_entry(iter, &cookie->msi_page_list, list) - if (iter->phys == msi_addr) { - msi_page = iter; - goto unlock; - } - - /* - * No nested mapping exists for the physical doorbell, - * look for an unused S1 mapping - */ - list_for_each_entry(iter, &cookie->msi_page_list, list) { - int ret; - - if (iter->phys) - continue; - - /* do the stage 2 mapping */ - ret = iommu_map_atomic(domain, iter->gpa, msi_addr, size, prot); - if (ret) { - pr_warn_once("MSI S2 mapping 0x%llx -> 0x%llx failed (%d)\n", - iter->gpa, msi_addr, ret); - goto unlock; - } - iter->phys = msi_addr; - msi_page = iter; - goto unlock; - } - pr_warn_once("No usable S1 MSI mapping found\n"); -unlock: - spin_unlock(&cookie->msi_lock); - return msi_page; -} - static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev, phys_addr_t msi_addr, struct iommu_domain *domain) { @@ -1376,10 +1208,6 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev, size_t size = cookie_msi_granule(cookie);
msi_addr &= ~(phys_addr_t)(size - 1); - - if (cookie->type == IOMMU_DMA_NESTED_MSI_COOKIE) - return iommu_dma_get_nested_msi_page(domain, msi_addr); - list_for_each_entry(msi_page, &cookie->msi_page_list, list) if (msi_page->phys == msi_addr) return msi_page; diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h index f112ecdb4af6..2112f21f73d8 100644 --- a/include/linux/dma-iommu.h +++ b/include/linux/dma-iommu.h @@ -12,7 +12,6 @@ #include <linux/dma-mapping.h> #include <linux/iommu.h> #include <linux/msi.h> -#include <uapi/linux/iommu.h>
/* Domain management interface for IOMMU drivers */ int iommu_get_dma_cookie(struct iommu_domain *domain); @@ -37,9 +36,6 @@ void iommu_dma_compose_msi_msg(struct msi_desc *desc, struct msi_msg *msg);
void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list); -int iommu_dma_bind_guest_msi(struct iommu_domain *domain, - dma_addr_t iova, phys_addr_t gpa, size_t size); -void iommu_dma_unbind_guest_msi(struct iommu_domain *domain, dma_addr_t giova);
#else /* CONFIG_IOMMU_DMA */
@@ -78,18 +74,6 @@ static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc, { }
-static inline int -iommu_dma_bind_guest_msi(struct iommu_domain *domain, - dma_addr_t iova, phys_addr_t gpa, size_t size) -{ - return -ENODEV; -} - -static inline void -iommu_dma_unbind_guest_msi(struct iommu_domain *domain, dma_addr_t giova) -{ -} - static inline void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list) { }
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 04039cc97a8839f000fb8cfaa71d84ea0bae7850.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 88 --------------------- 1 file changed, 88 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 52df1e26976b..f4d2e31b2afb 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -3612,93 +3612,6 @@ static void arm_smmu_detach_pasid_table(struct iommu_domain *domain) mutex_unlock(&smmu_domain->init_mutex); }
-static int -arm_smmu_cache_invalidate(struct iommu_domain *domain, struct device *dev, - struct iommu_cache_invalidate_info *inv_info) -{ - struct arm_smmu_cmdq_ent cmd = {.opcode = CMDQ_OP_TLBI_NSNH_ALL}; - struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - struct arm_smmu_device *smmu = smmu_domain->smmu; - - if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED) - return -EINVAL; - - if (!smmu) - return -EINVAL; - - if (inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1) - return -EINVAL; - - if (inv_info->cache & IOMMU_CACHE_INV_TYPE_PASID || - inv_info->cache & IOMMU_CACHE_INV_TYPE_DEV_IOTLB) { - return -ENOENT; - } - - if (!(inv_info->cache & IOMMU_CACHE_INV_TYPE_IOTLB)) - return -EINVAL; - - /* IOTLB invalidation */ - - switch (inv_info->granularity) { - case IOMMU_INV_GRANU_PASID: - { - struct iommu_inv_pasid_info *info = - &inv_info->granu.pasid_info; - - if (info->flags & IOMMU_INV_ADDR_FLAGS_PASID) - return -ENOENT; - if (!(info->flags & IOMMU_INV_PASID_FLAGS_ARCHID)) - return -EINVAL; - - __arm_smmu_tlb_inv_context(smmu_domain, info->archid); - return 0; - } - case IOMMU_INV_GRANU_ADDR: - { - struct iommu_inv_addr_info *info = &inv_info->granu.addr_info; - size_t granule_size = info->granule_size; - size_t size = info->nb_granules * info->granule_size; - bool leaf = info->flags & IOMMU_INV_ADDR_FLAGS_LEAF; - int tg; - - if (info->flags & IOMMU_INV_ADDR_FLAGS_PASID) - return -ENOENT; - - if (!(info->flags & IOMMU_INV_ADDR_FLAGS_ARCHID)) - break; - - tg = __ffs(granule_size); - if (granule_size & ~(1 << tg)) - return -EINVAL; - /* - * When RIL is not supported, make sure the granule size that is - * passed is supported. In RIL mode, this is enforced in - * __arm_smmu_tlb_inv_range() - */ - if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV) && - !(granule_size & smmu_domain->domain.pgsize_bitmap)) { - tg = __ffs(smmu_domain->domain.pgsize_bitmap); - granule_size = 1 << tg; - size = size >> tg; - } - - arm_smmu_tlb_inv_range_domain(info->addr, size, - granule_size, leaf, - info->archid, smmu_domain); - return 0; - } - case IOMMU_INV_GRANU_DOMAIN: - break; - default: - return -EINVAL; - } - - /* Global S1 invalidation */ - cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid; - arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd); - return 0; -} - static bool arm_smmu_dev_has_feature(struct device *dev, enum iommu_dev_features feat) { @@ -4207,7 +4120,6 @@ static struct iommu_ops arm_smmu_ops = { .put_resv_regions = generic_iommu_put_resv_regions, .attach_pasid_table = arm_smmu_attach_pasid_table, .detach_pasid_table = arm_smmu_detach_pasid_table, - .cache_invalidate = arm_smmu_cache_invalidate, .dev_has_feat = arm_smmu_dev_has_feature, .dev_feat_enabled = arm_smmu_dev_feature_enabled, .dev_enable_feat = arm_smmu_dev_enable_feature,
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit eeb79c56db25e457d0e0bf14db747cbd7d456a93.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 44 +++++---------------- 1 file changed, 10 insertions(+), 34 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index f4d2e31b2afb..35005a5edbe0 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2191,9 +2191,9 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid, }
/* IO_PGTABLE API */ -static void __arm_smmu_tlb_inv_context(struct arm_smmu_domain *smmu_domain, - int ext_asid) +static void arm_smmu_tlb_inv_context(void *cookie) { + struct arm_smmu_domain *smmu_domain = cookie; struct arm_smmu_device *smmu = smmu_domain->smmu; struct arm_smmu_cmdq_ent cmd;
@@ -2204,12 +2204,7 @@ static void __arm_smmu_tlb_inv_context(struct arm_smmu_domain *smmu_domain, * insertion to guarantee those are observed before the TLBI. Do be * careful, 007. */ - if (ext_asid >= 0) { /* guest stage 1 invalidation */ - cmd.opcode = CMDQ_OP_TLBI_NH_ASID; - cmd.tlbi.asid = ext_asid; - cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid; - arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd); - } else if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) { + if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) { arm_smmu_tlb_inv_asid(smmu, smmu_domain->s1_cfg.cd.asid); } else { cmd.opcode = CMDQ_OP_TLBI_S12_VMALL; @@ -2224,13 +2219,6 @@ static void __arm_smmu_tlb_inv_context(struct arm_smmu_domain *smmu_domain,
}
-static void arm_smmu_tlb_inv_context(void *cookie) -{ - struct arm_smmu_domain *smmu_domain = cookie; - - __arm_smmu_tlb_inv_context(smmu_domain, -1); -} - static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd, unsigned long iova, size_t size, size_t granule, @@ -2292,10 +2280,9 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd, arm_smmu_preempt_enable(smmu); }
-static void -arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size, - size_t granule, bool leaf, int ext_asid, - struct arm_smmu_domain *smmu_domain) +static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size, + size_t granule, bool leaf, + struct arm_smmu_domain *smmu_domain) { struct arm_smmu_cmdq_ent cmd = { .tlbi = { @@ -2303,16 +2290,7 @@ arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size, }, };
- if (ext_asid >= 0) { /* guest stage 1 invalidation */ - /* - * At the moment the guest only uses NS-EL1, to be - * revisited when nested virt gets supported with E2H - * exposed. - */ - cmd.opcode = CMDQ_OP_TLBI_NH_VA; - cmd.tlbi.asid = ext_asid; - cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid; - } else if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) { + if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) { cmd.opcode = smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ? CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA; cmd.tlbi.asid = smmu_domain->s1_cfg.cd.asid; @@ -2320,7 +2298,6 @@ arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size, cmd.opcode = CMDQ_OP_TLBI_S2_IPA; cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid; } - __arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
/* @@ -2363,7 +2340,7 @@ static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather, static void arm_smmu_tlb_inv_walk(unsigned long iova, size_t size, size_t granule, void *cookie) { - arm_smmu_tlb_inv_range_domain(iova, size, granule, false, -1, cookie); + arm_smmu_tlb_inv_range_domain(iova, size, granule, false, cookie); }
static const struct iommu_flush_ops arm_smmu_flush_ops = { @@ -2999,9 +2976,8 @@ static void arm_smmu_iotlb_sync(struct iommu_domain *domain, { struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
- arm_smmu_tlb_inv_range_domain(gather->start, - gather->end - gather->start + 1, - gather->pgsize, true, -1, smmu_domain); + arm_smmu_tlb_inv_range_domain(gather->start, gather->end - gather->start + 1, + gather->pgsize, true, smmu_domain); }
static phys_addr_t
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit c47df7b65d78d7589df949d74e6242a77c6bc2be.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 93 --------------------- 1 file changed, 93 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 35005a5edbe0..a5e2942afe38 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1202,10 +1202,6 @@ static void arm_smmu_write_cd_l1_desc(__le64 *dst, WRITE_ONCE(*dst, cpu_to_le64(val)); }
-/* - * Must not be used in case of nested mode where the CD table is owned - * by the guest - */ static __le64 *arm_smmu_get_cd_ptr(struct arm_smmu_domain *smmu_domain, u32 ssid) { @@ -3501,93 +3497,6 @@ static void arm_smmu_get_resv_regions(struct device *dev, iommu_dma_get_resv_regions(dev, head); }
-static int arm_smmu_attach_pasid_table(struct iommu_domain *domain, - struct iommu_pasid_table_config *cfg) -{ - struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - struct arm_smmu_master *master; - struct arm_smmu_device *smmu; - unsigned long flags; - int ret = -EINVAL; - - if (cfg->format != IOMMU_PASID_FORMAT_SMMUV3) - return -EINVAL; - - if (cfg->version != PASID_TABLE_CFG_VERSION_1 || - cfg->vendor_data.smmuv3.version != PASID_TABLE_SMMUV3_CFG_VERSION_1) - return -EINVAL; - - mutex_lock(&smmu_domain->init_mutex); - - smmu = smmu_domain->smmu; - - if (!smmu) - goto out; - - if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED) - goto out; - - switch (cfg->config) { - case IOMMU_PASID_CONFIG_ABORT: - smmu_domain->s1_cfg.set = false; - smmu_domain->abort = true; - break; - case IOMMU_PASID_CONFIG_BYPASS: - smmu_domain->s1_cfg.set = false; - smmu_domain->abort = false; - break; - case IOMMU_PASID_CONFIG_TRANSLATE: - /* we do not support S1 <-> S1 transitions */ - if (smmu_domain->s1_cfg.set) - goto out; - - /* - * we currently support a single CD so s1fmt and s1dss - * fields are also ignored - */ - if (cfg->pasid_bits) - goto out; - - smmu_domain->s1_cfg.cdcfg.cdtab_dma = cfg->base_ptr; - smmu_domain->s1_cfg.set = true; - smmu_domain->abort = false; - break; - default: - goto out; - } - spin_lock_irqsave(&smmu_domain->devices_lock, flags); - list_for_each_entry(master, &smmu_domain->devices, domain_head) - arm_smmu_install_ste_for_dev(master); - spin_unlock_irqrestore(&smmu_domain->devices_lock, flags); - ret = 0; -out: - mutex_unlock(&smmu_domain->init_mutex); - return ret; -} - -static void arm_smmu_detach_pasid_table(struct iommu_domain *domain) -{ - struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - struct arm_smmu_master *master; - unsigned long flags; - - mutex_lock(&smmu_domain->init_mutex); - - if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED) - goto unlock; - - smmu_domain->s1_cfg.set = false; - smmu_domain->abort = false; - - spin_lock_irqsave(&smmu_domain->devices_lock, flags); - list_for_each_entry(master, &smmu_domain->devices, domain_head) - arm_smmu_install_ste_for_dev(master); - spin_unlock_irqrestore(&smmu_domain->devices_lock, flags); - -unlock: - mutex_unlock(&smmu_domain->init_mutex); -} - static bool arm_smmu_dev_has_feature(struct device *dev, enum iommu_dev_features feat) { @@ -4094,8 +4003,6 @@ static struct iommu_ops arm_smmu_ops = { .of_xlate = arm_smmu_of_xlate, .get_resv_regions = arm_smmu_get_resv_regions, .put_resv_regions = generic_iommu_put_resv_regions, - .attach_pasid_table = arm_smmu_attach_pasid_table, - .detach_pasid_table = arm_smmu_detach_pasid_table, .dev_has_feat = arm_smmu_dev_has_feature, .dev_feat_enabled = arm_smmu_dev_feature_enabled, .dev_enable_feat = arm_smmu_dev_enable_feature,
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 54cb4594c8396fb3dcb846d13c64ebcd72f0aabc.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 55 +++------------------ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 - 2 files changed, 8 insertions(+), 49 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index a5e2942afe38..6f23c5ca7abd 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1461,8 +1461,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, * 3. Update Config, sync */ u64 val = le64_to_cpu(dst[0]); - bool s1_live = false, s2_live = false, ste_live; - bool abort, translate = false; + bool ste_live = false; struct arm_smmu_device *smmu = NULL; struct arm_smmu_s1_cfg *s1_cfg; struct arm_smmu_s2_cfg *s2_cfg; @@ -1502,7 +1501,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, default: break; } - translate = s1_cfg->set || s2_cfg->set; }
if (val & STRTAB_STE_0_V) { @@ -1510,36 +1508,23 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, case STRTAB_STE_0_CFG_BYPASS: break; case STRTAB_STE_0_CFG_S1_TRANS: - s1_live = true; - break; case STRTAB_STE_0_CFG_S2_TRANS: - s2_live = true; - break; - case STRTAB_STE_0_CFG_NESTED: - s1_live = true; - s2_live = true; + ste_live = true; break; case STRTAB_STE_0_CFG_ABORT: + BUG_ON(!disable_bypass); break; default: BUG(); /* STE corruption */ } }
- ste_live = s1_live || s2_live; - /* Nuke the existing STE_0 value, as we're going to rewrite it */ val = STRTAB_STE_0_V;
/* Bypass/fault */ - - if (!smmu_domain) - abort = disable_bypass; - else - abort = smmu_domain->abort; - - if (abort || !translate) { - if (abort) + if (!smmu_domain || !(s1_cfg->set || s2_cfg->set)) { + if (!smmu_domain && disable_bypass) val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT); else val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS); @@ -1557,17 +1542,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, return; }
- if (ste_live) { - /* First invalidate the live STE */ - dst[0] = cpu_to_le64(STRTAB_STE_0_CFG_ABORT); - arm_smmu_sync_ste_for_sid(smmu, sid); - } - if (s1_cfg->set) { u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ? STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
- BUG_ON(s1_live); + BUG_ON(ste_live); dst[1] = cpu_to_le64( FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) | FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) | @@ -1589,14 +1568,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, }
if (s2_cfg->set) { - u64 vttbr = s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK; - - if (s2_live) { - u64 s2ttb = le64_to_cpu(dst[3]) & STRTAB_STE_3_S2TTB_MASK; - - BUG_ON(s2ttb != vttbr); - } - + BUG_ON(ste_live); dst[2] = cpu_to_le64( FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) | FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) | @@ -1606,12 +1578,9 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2R);
- dst[3] = cpu_to_le64(vttbr); + dst[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS); - } else { - dst[2] = 0; - dst[3] = 0; }
if (master->ats_enabled) @@ -2555,14 +2524,6 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain, return 0; }
- if (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED && - (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1) || - !(smmu->features & ARM_SMMU_FEAT_TRANS_S2))) { - dev_info(smmu_domain->smmu->dev, - "does not implement two stages\n"); - return -EINVAL; - } - /* Restrict the stage to what we can actually support */ if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1)) smmu_domain->stage = ARM_SMMU_DOMAIN_S2; diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index c744d812fc8d..fdf80cf1184c 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -245,7 +245,6 @@ #define STRTAB_STE_0_CFG_BYPASS 4 #define STRTAB_STE_0_CFG_S1_TRANS 5 #define STRTAB_STE_0_CFG_S2_TRANS 6 -#define STRTAB_STE_0_CFG_NESTED 7
#define STRTAB_STE_0_S1FMT GENMASK_ULL(5, 4) #define STRTAB_STE_0_S1FMT_LINEAR 0 @@ -803,7 +802,6 @@ struct arm_smmu_domain { enum arm_smmu_domain_stage stage; struct arm_smmu_s1_cfg s1_cfg; struct arm_smmu_s2_cfg s2_cfg; - bool abort;
struct iommu_domain domain;
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit a07fcc1fc081da9990da18818bafa276ddc227c0.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 47 ++++++++------------- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 8 ++-- 2 files changed, 22 insertions(+), 33 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 6f23c5ca7abd..98b5bce5bfc4 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1463,8 +1463,8 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, u64 val = le64_to_cpu(dst[0]); bool ste_live = false; struct arm_smmu_device *smmu = NULL; - struct arm_smmu_s1_cfg *s1_cfg; - struct arm_smmu_s2_cfg *s2_cfg; + struct arm_smmu_s1_cfg *s1_cfg = NULL; + struct arm_smmu_s2_cfg *s2_cfg = NULL; struct arm_smmu_domain *smmu_domain = NULL; struct arm_smmu_cmdq_ent prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG, @@ -1479,24 +1479,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, }
if (smmu_domain) { - s1_cfg = &smmu_domain->s1_cfg; - s2_cfg = &smmu_domain->s2_cfg; - switch (smmu_domain->stage) { case ARM_SMMU_DOMAIN_S1: - s1_cfg->set = true; - s2_cfg->set = false; + s1_cfg = &smmu_domain->s1_cfg; break; case ARM_SMMU_DOMAIN_S2: - s1_cfg->set = false; - s2_cfg->set = true; - break; case ARM_SMMU_DOMAIN_NESTED: - /* - * Actual usage of stage 1 depends on nested mode: - * legacy (2d stage only) or true nested mode - */ - s2_cfg->set = true; + s2_cfg = &smmu_domain->s2_cfg; break; default: break; @@ -1523,7 +1512,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, val = STRTAB_STE_0_V;
/* Bypass/fault */ - if (!smmu_domain || !(s1_cfg->set || s2_cfg->set)) { + if (!smmu_domain || !(s1_cfg || s2_cfg)) { if (!smmu_domain && disable_bypass) val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT); else @@ -1542,7 +1531,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, return; }
- if (s1_cfg->set) { + if (s1_cfg) { u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ? STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
@@ -1567,7 +1556,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, FIELD_PREP(STRTAB_STE_0_S1FMT, s1_cfg->s1fmt); }
- if (s2_cfg->set) { + if (s2_cfg) { BUG_ON(ste_live); dst[2] = cpu_to_le64( FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) | @@ -2381,26 +2370,26 @@ static void arm_smmu_domain_free(struct iommu_domain *domain) { struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); struct arm_smmu_device *smmu = smmu_domain->smmu; - struct arm_smmu_s1_cfg *s1_cfg = &smmu_domain->s1_cfg; - struct arm_smmu_s2_cfg *s2_cfg = &smmu_domain->s2_cfg;
iommu_put_dma_cookie(domain); free_io_pgtable_ops(smmu_domain->pgtbl_ops);
/* Free the CD and ASID, if we allocated them */ - if (s1_cfg->set) { + if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) { + struct arm_smmu_s1_cfg *cfg = &smmu_domain->s1_cfg; + /* Prevent SVA from touching the CD while we're freeing it */ mutex_lock(&arm_smmu_asid_lock); - if (s1_cfg->cdcfg.cdtab) + if (cfg->cdcfg.cdtab) arm_smmu_free_cd_tables(smmu_domain); - arm_smmu_free_asid(&s1_cfg->cd); + arm_smmu_free_asid(&cfg->cd); mutex_unlock(&arm_smmu_asid_lock); if (smmu_domain->ssid) ioasid_put(smmu_domain->ssid); - } - if (s2_cfg->set) { - if (s2_cfg->vmid) - arm_smmu_bitmap_free(smmu->vmid_map, s2_cfg->vmid); + } else { + struct arm_smmu_s2_cfg *cfg = &smmu_domain->s2_cfg; + if (cfg->vmid) + arm_smmu_bitmap_free(smmu->vmid_map, cfg->vmid); }
kfree(smmu_domain); @@ -3699,7 +3688,7 @@ static int arm_smmu_set_mpam(struct arm_smmu_device *smmu,
if (WARN_ON(!domain)) return -EINVAL; - if (WARN_ON(!domain->s1_cfg.set)) + if (WARN_ON(domain->stage != ARM_SMMU_DOMAIN_S1)) return -EINVAL; if (WARN_ON(ssid >= (1 << domain->s1_cfg.s1cdmax))) return -E2BIG; @@ -3822,7 +3811,7 @@ static int arm_smmu_get_mpam(struct arm_smmu_device *smmu,
if (WARN_ON(!domain)) return -EINVAL; - if (WARN_ON(!domain->s1_cfg.set)) + if (WARN_ON(domain->stage != ARM_SMMU_DOMAIN_S1)) return -EINVAL; if (WARN_ON(ssid >= (1 << domain->s1_cfg.s1cdmax))) return -E2BIG; diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index fdf80cf1184c..f680cd6dd3bd 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -658,14 +658,12 @@ struct arm_smmu_s1_cfg { struct arm_smmu_ctx_desc cd; u8 s1fmt; u8 s1cdmax; - bool set; };
struct arm_smmu_s2_cfg { u16 vmid; u64 vttbr; u64 vtcr; - bool set; };
struct arm_smmu_strtab_cfg { @@ -800,8 +798,10 @@ struct arm_smmu_domain { atomic_t nr_ats_masters;
enum arm_smmu_domain_stage stage; - struct arm_smmu_s1_cfg s1_cfg; - struct arm_smmu_s2_cfg s2_cfg; + union { + struct arm_smmu_s1_cfg s1_cfg; + struct arm_smmu_s2_cfg s2_cfg; + };
struct iommu_domain domain;
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit 9db83ab7c297b4f0d4d31a22fe389ce31c1ee662.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/iommu.c | 37 ------------------------------------- include/linux/iommu.h | 18 ------------------ 2 files changed, 55 deletions(-)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index d2fbebee719b..d53c88c647ae 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2383,43 +2383,6 @@ static void __iommu_detach_device(struct iommu_domain *domain, trace_detach_device_from_domain(dev); }
-/** - * iommu_bind_guest_msi - Passes the stage1 GIOVA/GPA mapping of a - * virtual doorbell - * - * @domain: iommu domain the stage 1 mapping will be attached to - * @iova: iova allocated by the guest - * @gpa: guest physical address of the virtual doorbell - * @size: granule size used for the mapping - * - * The associated IOVA can be reused by the host to create a nested - * stage2 binding mapping translating into the physical doorbell used - * by the devices attached to the domain. - * - * All devices within the domain must share the same physical doorbell. - * A single MSI GIOVA/GPA mapping can be attached to an iommu_domain. - */ - -int iommu_bind_guest_msi(struct iommu_domain *domain, - dma_addr_t giova, phys_addr_t gpa, size_t size) -{ - if (unlikely(!domain->ops->bind_guest_msi)) - return -ENODEV; - - return domain->ops->bind_guest_msi(domain, giova, gpa, size); -} -EXPORT_SYMBOL_GPL(iommu_bind_guest_msi); - -void iommu_unbind_guest_msi(struct iommu_domain *domain, - dma_addr_t giova) -{ - if (unlikely(!domain->ops->unbind_guest_msi)) - return; - - domain->ops->unbind_guest_msi(domain, giova); -} -EXPORT_SYMBOL_GPL(iommu_unbind_guest_msi); - void iommu_detach_device(struct iommu_domain *domain, struct device *dev) { struct iommu_group *group; diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 95320164dcf3..0e696aec98a5 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -248,8 +248,6 @@ struct iommu_iotlb_gather { * @sva_unbind_gpasid: unbind guest pasid and mm * @attach_pasid_table: attach a pasid table * @detach_pasid_table: detach the pasid table - * @bind_guest_msi: provides a stage1 giova/gpa MSI doorbell mapping - * @unbind_guest_msi: withdraw a stage1 giova/gpa MSI doorbell mapping * @def_domain_type: device default domain type, return value: * - IOMMU_DOMAIN_IDENTITY: must use an identity domain * - IOMMU_DOMAIN_DMA: must use a dma domain @@ -347,10 +345,6 @@ struct iommu_ops {
int (*def_domain_type)(struct device *dev);
- int (*bind_guest_msi)(struct iommu_domain *domain, - dma_addr_t giova, phys_addr_t gpa, size_t size); - void (*unbind_guest_msi)(struct iommu_domain *domain, dma_addr_t giova); - int (*dev_get_config)(struct device *dev, int type, void *data); int (*dev_set_config)(struct device *dev, int type, void *data);
@@ -507,10 +501,6 @@ extern int iommu_attach_pasid_table(struct iommu_domain *domain, extern int iommu_uapi_attach_pasid_table(struct iommu_domain *domain, void __user *udata); extern void iommu_detach_pasid_table(struct iommu_domain *domain); -extern int iommu_bind_guest_msi(struct iommu_domain *domain, - dma_addr_t giova, phys_addr_t gpa, size_t size); -extern void iommu_unbind_guest_msi(struct iommu_domain *domain, - dma_addr_t giova); extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev); extern struct iommu_domain *iommu_get_dma_domain(struct device *dev); extern size_t iommu_pgsize(struct iommu_domain *domain, @@ -1221,14 +1211,6 @@ iommu_sva_bind_group(struct iommu_group *group, struct mm_struct *mm, return NULL; }
-int iommu_bind_guest_msi(struct iommu_domain *domain, - dma_addr_t giova, phys_addr_t gpa, size_t size) -{ - return -ENODEV; -} -static inline -void iommu_unbind_guest_msi(struct iommu_domain *domain, dma_addr_t giova) {} - static inline int iommu_dev_set_config(struct device *dev, int type, void *data) {
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
This reverts commit dbb4844d2af73302bbcf96669a59d031ba69ca85.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/iommu/iommu.c | 50 -------------------------------------- include/linux/iommu.h | 13 ++-------- include/uapi/linux/iommu.h | 13 ++++------ 3 files changed, 7 insertions(+), 69 deletions(-)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index d53c88c647ae..b888efd65e92 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2311,56 +2311,6 @@ int iommu_attach_pasid_table(struct iommu_domain *domain, } EXPORT_SYMBOL_GPL(iommu_attach_pasid_table);
-int iommu_uapi_attach_pasid_table(struct iommu_domain *domain, - void __user *uinfo) -{ - struct iommu_pasid_table_config pasid_table_data = { 0 }; - u32 minsz; - - if (unlikely(!domain->ops->attach_pasid_table)) - return -ENODEV; - - /* - * No new spaces can be added before the variable sized union, the - * minimum size is the offset to the union. - */ - minsz = offsetof(struct iommu_pasid_table_config, vendor_data); - - /* Copy minsz from user to get flags and argsz */ - if (copy_from_user(&pasid_table_data, uinfo, minsz)) - return -EFAULT; - - /* Fields before the variable size union are mandatory */ - if (pasid_table_data.argsz < minsz) - return -EINVAL; - - /* PASID and address granu require additional info beyond minsz */ - if (pasid_table_data.version != PASID_TABLE_CFG_VERSION_1) - return -EINVAL; - if (pasid_table_data.format == IOMMU_PASID_FORMAT_SMMUV3 && - pasid_table_data.argsz < - offsetofend(struct iommu_pasid_table_config, vendor_data.smmuv3)) - return -EINVAL; - - /* - * User might be using a newer UAPI header which has a larger data - * size, we shall support the existing flags within the current - * size. Copy the remaining user data _after_ minsz but not more - * than the current kernel supported size. - */ - if (copy_from_user((void *)&pasid_table_data + minsz, uinfo + minsz, - min_t(u32, pasid_table_data.argsz, sizeof(pasid_table_data)) - minsz)) - return -EFAULT; - - /* Now the argsz is validated, check the content */ - if (pasid_table_data.config < IOMMU_PASID_CONFIG_TRANSLATE || - pasid_table_data.config > IOMMU_PASID_CONFIG_ABORT) - return -EINVAL; - - return domain->ops->attach_pasid_table(domain, &pasid_table_data); -} -EXPORT_SYMBOL_GPL(iommu_uapi_attach_pasid_table); - void iommu_detach_pasid_table(struct iommu_domain *domain) { if (unlikely(!domain->ops->detach_pasid_table)) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 0e696aec98a5..6671e45d3c3b 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -246,12 +246,12 @@ struct iommu_iotlb_gather { * @cache_invalidate: invalidate translation caches * @sva_bind_gpasid: bind guest pasid and mm * @sva_unbind_gpasid: unbind guest pasid and mm - * @attach_pasid_table: attach a pasid table - * @detach_pasid_table: detach the pasid table * @def_domain_type: device default domain type, return value: * - IOMMU_DOMAIN_IDENTITY: must use an identity domain * - IOMMU_DOMAIN_DMA: must use a dma domain * - 0: use the default setting + * @attach_pasid_table: attach a pasid table + * @detach_pasid_table: detach the pasid table * @pgsize_bitmap: bitmap of all possible supported page sizes * @owner: Driver module providing these ops */ @@ -498,8 +498,6 @@ extern int iommu_sva_unbind_gpasid(struct iommu_domain *domain, struct device *dev, ioasid_t pasid); extern int iommu_attach_pasid_table(struct iommu_domain *domain, struct iommu_pasid_table_config *cfg); -extern int iommu_uapi_attach_pasid_table(struct iommu_domain *domain, - void __user *udata); extern void iommu_detach_pasid_table(struct iommu_domain *domain); extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev); extern struct iommu_domain *iommu_get_dma_domain(struct device *dev); @@ -1194,13 +1192,6 @@ int iommu_attach_pasid_table(struct iommu_domain *domain, return -ENODEV; }
-static inline -int iommu_uapi_attach_pasid_table(struct iommu_domain *domain, - void __user *uinfo) -{ - return -ENODEV; -} - static inline void iommu_detach_pasid_table(struct iommu_domain *domain) {}
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h index 40c28bb0e1bf..bed34a8c9430 100644 --- a/include/uapi/linux/iommu.h +++ b/include/uapi/linux/iommu.h @@ -363,33 +363,30 @@ struct iommu_pasid_smmuv3 { /** * struct iommu_pasid_table_config - PASID table data used to bind guest PASID * table to the host IOMMU - * @argsz: User filled size of this data * @version: API version to prepare for future extensions - * @base_ptr: guest physical address of the PASID table * @format: format of the PASID table + * @base_ptr: guest physical address of the PASID table * @pasid_bits: number of PASID bits used in the PASID table * @config: indicates whether the guest translation stage must * be translated, bypassed or aborted. * @padding: reserved for future use (should be zero) - * @vendor_data.smmuv3: table information when @format is - * %IOMMU_PASID_FORMAT_SMMUV3 + * @smmuv3: table information when @format is %IOMMU_PASID_FORMAT_SMMUV3 */ struct iommu_pasid_table_config { - __u32 argsz; #define PASID_TABLE_CFG_VERSION_1 1 __u32 version; - __u64 base_ptr; #define IOMMU_PASID_FORMAT_SMMUV3 1 __u32 format; + __u64 base_ptr; __u8 pasid_bits; #define IOMMU_PASID_CONFIG_TRANSLATE 1 #define IOMMU_PASID_CONFIG_BYPASS 2 #define IOMMU_PASID_CONFIG_ABORT 3 __u8 config; - __u8 padding[2]; + __u8 padding[6]; union { struct iommu_pasid_smmuv3 smmuv3; - } vendor_data; + }; };
#endif /* _UAPI_IOMMU_H */
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61SPO CVE: NA
--------------------------------
In order to be consistent with the vSVA technical route of the open source community, it is necessary to revert related patches and bugfixes. In the meantime, some necessary steps need to be taken to avoid kabi change.
Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/iommu.h | 5 +++++ include/uapi/linux/iommu.h | 7 ++++--- 2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 6671e45d3c3b..47294a3a398e 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -345,6 +345,11 @@ struct iommu_ops {
int (*def_domain_type)(struct device *dev);
+ KABI_DEPRECATE_FN(int, bind_guest_msi, struct iommu_domain *domain, + dma_addr_t giova, phys_addr_t gpa, size_t size) + KABI_DEPRECATE_FN(void, unbind_guest_msi, struct iommu_domain *domain, + dma_addr_t giova) + int (*dev_get_config)(struct device *dev, int type, void *data); int (*dev_set_config)(struct device *dev, int type, void *data);
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h index bed34a8c9430..9ddaf2d22d9a 100644 --- a/include/uapi/linux/iommu.h +++ b/include/uapi/linux/iommu.h @@ -373,20 +373,21 @@ struct iommu_pasid_smmuv3 { * @smmuv3: table information when @format is %IOMMU_PASID_FORMAT_SMMUV3 */ struct iommu_pasid_table_config { + __u32 argsz; #define PASID_TABLE_CFG_VERSION_1 1 __u32 version; + __u64 base_ptr; #define IOMMU_PASID_FORMAT_SMMUV3 1 __u32 format; - __u64 base_ptr; __u8 pasid_bits; #define IOMMU_PASID_CONFIG_TRANSLATE 1 #define IOMMU_PASID_CONFIG_BYPASS 2 #define IOMMU_PASID_CONFIG_ABORT 3 __u8 config; - __u8 padding[6]; + __u8 padding[2]; union { struct iommu_pasid_smmuv3 smmuv3; - }; + } vendor_data; };
#endif /* _UAPI_IOMMU_H */
From: Qi Zheng zhengqi.arch@bytedance.com
Offering: HULK mainline inclusion from mainline-v5.19-rc1 commit 3f913fc5f9745613088d3c569778c9813ab9c129 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I610B5 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
We expect no warnings to be issued when we specify __GFP_NOWARN, but currently in paths like alloc_pages() and kmalloc(), there are still some warnings printed, fix it.
But for some warnings that report usage problems, we don't deal with them. If such warnings are printed, then we should fix the usage problems. Such as the following case:
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
[zhengqi.arch@bytedance.com: v2] Link: https://lkml.kernel.org/r/20220511061951.1114-1-zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/20220510113809.80626-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng zhengqi.arch@bytedance.com Cc: Akinobu Mita akinobu.mita@gmail.com Cc: Vlastimil Babka vbabka@suse.cz Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Jiri Slaby jirislaby@kernel.org Cc: Steven Rostedt (Google) rostedt@goodmis.org Signed-off-by: Andrew Morton akpm@linux-foundation.org
Conflict: mm/internal.h mm/page_alloc.c
Signed-off-by: Ye Weihua yeweihua4@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/fault-inject.h | 2 ++ lib/fault-inject.c | 3 +++ mm/failslab.c | 3 +++ mm/internal.h | 15 +++++++++++++++ mm/page_alloc.c | 16 +++++++++------- 5 files changed, 32 insertions(+), 7 deletions(-)
diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h index e525f6957c49..d506ee960ffd 100644 --- a/include/linux/fault-inject.h +++ b/include/linux/fault-inject.h @@ -20,6 +20,7 @@ struct fault_attr { atomic_t space; unsigned long verbose; bool task_filter; + bool no_warn; unsigned long stacktrace_depth; unsigned long require_start; unsigned long require_end; @@ -39,6 +40,7 @@ struct fault_attr { .ratelimit_state = RATELIMIT_STATE_INIT_DISABLED, \ .verbose = 2, \ .dname = NULL, \ + .no_warn = false, \ }
#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER diff --git a/lib/fault-inject.c b/lib/fault-inject.c index ce12621b4275..423784d9c058 100644 --- a/lib/fault-inject.c +++ b/lib/fault-inject.c @@ -41,6 +41,9 @@ EXPORT_SYMBOL_GPL(setup_fault_attr);
static void fail_dump(struct fault_attr *attr) { + if (attr->no_warn) + return; + if (attr->verbose > 0 && __ratelimit(&attr->ratelimit_state)) { printk(KERN_NOTICE "FAULT_INJECTION: forcing a failure.\n" "name %pd, interval %lu, probability %lu, " diff --git a/mm/failslab.c b/mm/failslab.c index f92fed91ac23..58df9789f1d2 100644 --- a/mm/failslab.c +++ b/mm/failslab.c @@ -30,6 +30,9 @@ bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags) if (failslab.cache_filter && !(s->flags & SLAB_FAILSLAB)) return false;
+ if (gfpflags & __GFP_NOWARN) + failslab.attr.no_warn = true; + return should_fail(&failslab.attr, s->object_size); }
diff --git a/mm/internal.h b/mm/internal.h index 917b86b2870c..0c6b1ade7438 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -32,6 +32,21 @@ /* Do not use these with a slab allocator */ #define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
+/* + * Different from WARN_ON_ONCE(), no warning will be issued + * when we specify __GFP_NOWARN. + */ +#define WARN_ON_ONCE_GFP(cond, gfp) ({ \ + static bool __section(".data.once") __warned; \ + int __ret_warn_once = !!(cond); \ + \ + if (unlikely(!(gfp & __GFP_NOWARN) && __ret_warn_once && !__warned)) { \ + __warned = true; \ + WARN_ON(1); \ + } \ + unlikely(__ret_warn_once); \ +}) + void page_writeback_init(void);
vm_fault_t do_swap_page(struct vm_fault *vmf); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 34a4673b909e..179a6d4948af 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3555,6 +3555,9 @@ static bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) (gfp_mask & __GFP_DIRECT_RECLAIM)) return false;
+ if (gfp_mask & __GFP_NOWARN) + fail_page_alloc.attr.no_warn = true; + return should_fail(&fail_page_alloc.attr, 1 << order); }
@@ -4130,7 +4133,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, */
/* Exhausted what can be done so it's blame time */ - if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) { + if (out_of_memory(&oc) || + WARN_ON_ONCE_GFP(gfp_mask & __GFP_NOFAIL, gfp_mask)) { *did_some_progress = 1;
/* @@ -4890,7 +4894,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * All existing users of the __GFP_NOFAIL are blockable, so warn * of any new users that actually require GFP_NOWAIT */ - if (WARN_ON_ONCE(!can_direct_reclaim)) + if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) goto fail;
/* @@ -4898,7 +4902,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * because we cannot reclaim anything and only can loop waiting * for somebody to do a work for us */ - WARN_ON_ONCE(current->flags & PF_MEMALLOC); + WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask);
/* * non failing costly orders are a hard requirement which we @@ -4906,7 +4910,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * so that we can identify them and convert them to something * else. */ - WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER); + WARN_ON_ONCE_GFP(order > PAGE_ALLOC_COSTLY_ORDER, gfp_mask);
/* * Help non-failing allocations by giving them access to memory @@ -5170,10 +5174,8 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, * There are several places where we assume that the order value is sane * so bail out early if the request is out of bound. */ - if (unlikely(order >= MAX_ORDER)) { - WARN_ON_ONCE(!(gfp & __GFP_NOWARN)); + if (WARN_ON_ONCE_GFP(order >= MAX_ORDER, gfp)) return NULL; - }
gfp &= gfp_allowed_mask;
From: Ye Weihua yeweihua4@huawei.com
Offering: HULK hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I610B5 CVE: NA
--------------------------------
__dend_signal_locked() invokes __sigqueue_alloc() which may invoke a normal printk() to print failure message. This can cause a deadlock in the scenario reported by syz-bot below (test in 5.10):
CPU0 CPU1 ---- ---- lock(&sighand->siglock); lock(&tty->read_wait); lock(&sighand->siglock); lock(console_owner);
This patch specities __GFP_NOWARN to __sigqueue_alloc(), so that printk will not be called, and this deadlock problem can be avoided.
Syzbot reported the following lockdep error:
====================================================== WARNING: possible circular locking dependency detected 5.10.0-04424-ga472e3c833d3 #1 Not tainted ------------------------------------------------------ syz-executor.2/31970 is trying to acquire lock: ffffa00014066a60 (console_owner){-.-.}-{0:0}, at: console_trylock_spinning+0xf0/0x2e0 kernel/printk/printk.c:1854
but task is already holding lock: ffff0000ddb38a98 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x60/0x260 kernel/signal.c:1322
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&sighand->siglock){-.-.}-{2:2}: validate_chain+0x6dc/0xb0c kernel/locking/lockdep.c:3728 __lock_acquire+0x498/0x940 kernel/locking/lockdep.c:4954 lock_acquire+0x228/0x580 kernel/locking/lockdep.c:5564 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0xc0/0x15c kernel/locking/spinlock.c:159 __lock_task_sighand+0xf0/0x370 kernel/signal.c:1396 lock_task_sighand include/linux/sched/signal.h:699 [inline] task_work_add+0x1f8/0x2a0 kernel/task_work.c:58 io_req_task_work_add+0x98/0x10c fs/io_uring.c:2115 __io_async_wake+0x338/0x780 fs/io_uring.c:4984 io_poll_wake+0x40/0x50 fs/io_uring.c:5461 __wake_up_common+0xcc/0x2a0 kernel/sched/wait.c:93 __wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:123 __wake_up+0x1c/0x24 kernel/sched/wait.c:142 pty_set_termios+0x1ac/0x2d0 drivers/tty/pty.c:286 tty_set_termios+0x310/0x46c drivers/tty/tty_ioctl.c:334 set_termios.part.0+0x2dc/0xa50 drivers/tty/tty_ioctl.c:414 set_termios drivers/tty/tty_ioctl.c:368 [inline] tty_mode_ioctl+0x4f4/0xbec drivers/tty/tty_ioctl.c:736 n_tty_ioctl_helper+0x74/0x260 drivers/tty/tty_ioctl.c:883 n_tty_ioctl+0x80/0x3d0 drivers/tty/n_tty.c:2516 tty_ioctl+0x508/0x1100 drivers/tty/tty_io.c:2751 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __arm64_sys_ioctl+0x12c/0x18c fs/ioctl.c:739 __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline] invoke_syscall arch/arm64/kernel/syscall.c:48 [inline] el0_svc_common.constprop.0+0xf8/0x420 arch/arm64/kernel/syscall.c:155 do_el0_svc+0x50/0x120 arch/arm64/kernel/syscall.c:217 el0_svc+0x20/0x30 arch/arm64/kernel/entry-common.c:353 el0_sync_handler+0xe4/0x1e0 arch/arm64/kernel/entry-common.c:369 el0_sync+0x148/0x180 arch/arm64/kernel/entry.S:683
-> #3 (&tty->read_wait){....}-{2:2}: validate_chain+0x6dc/0xb0c kernel/locking/lockdep.c:3728 __lock_acquire+0x498/0x940 kernel/locking/lockdep.c:4954 lock_acquire+0x228/0x580 kernel/locking/lockdep.c:5564 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0xa0/0x120 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:354 [inline] io_poll_double_wake+0x158/0x30c fs/io_uring.c:5093 __wake_up_common+0xcc/0x2a0 kernel/sched/wait.c:93 __wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:123 __wake_up+0x1c/0x24 kernel/sched/wait.c:142 pty_close+0x1bc/0x330 drivers/tty/pty.c:68 tty_release+0x1e0/0x88c drivers/tty/tty_io.c:1761 __fput+0x1dc/0x500 fs/file_table.c:281 ____fput+0x24/0x30 fs/file_table.c:314 task_work_run+0xf4/0x1ec kernel/task_work.c:151 tracehook_notify_resume include/linux/tracehook.h:188 [inline] do_notify_resume+0x378/0x410 arch/arm64/kernel/signal.c:718 work_pending+0xc/0x198
-> #2 (&tty->write_wait){....}-{2:2}: validate_chain+0x6dc/0xb0c kernel/locking/lockdep.c:3728 __lock_acquire+0x498/0x940 kernel/locking/lockdep.c:4954 lock_acquire+0x228/0x580 kernel/locking/lockdep.c:5564 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0xc0/0x15c kernel/locking/spinlock.c:159 __wake_up_common_lock+0xb0/0x130 kernel/sched/wait.c:122 __wake_up+0x1c/0x24 kernel/sched/wait.c:142 tty_wakeup+0x54/0xbc drivers/tty/tty_io.c:539 tty_port_default_wakeup+0x38/0x50 drivers/tty/tty_port.c:50 tty_port_tty_wakeup+0x3c/0x50 drivers/tty/tty_port.c:388 uart_write_wakeup+0x38/0x60 drivers/tty/serial/serial_core.c:106 pl011_tx_chars+0x530/0x5c0 drivers/tty/serial/amba-pl011.c:1418 pl011_start_tx_pio drivers/tty/serial/amba-pl011.c:1303 [inline] pl011_start_tx+0x1b4/0x430 drivers/tty/serial/amba-pl011.c:1315 __uart_start.isra.0+0xb4/0xcc drivers/tty/serial/serial_core.c:127 uart_write+0x21c/0x460 drivers/tty/serial/serial_core.c:613 process_output_block+0x120/0x3ac drivers/tty/n_tty.c:590 n_tty_write+0x2c8/0x650 drivers/tty/n_tty.c:2383 do_tty_write drivers/tty/tty_io.c:1028 [inline] file_tty_write.constprop.0+0x2d0/0x520 drivers/tty/tty_io.c:1118 tty_write drivers/tty/tty_io.c:1125 [inline] redirected_tty_write+0xe4/0x104 drivers/tty/tty_io.c:1147 call_write_iter include/linux/fs.h:1960 [inline] new_sync_write+0x264/0x37c fs/read_write.c:515 vfs_write+0x694/0x9d0 fs/read_write.c:602 ksys_write+0xfc/0x200 fs/read_write.c:655 __do_sys_write fs/read_write.c:667 [inline] __se_sys_write fs/read_write.c:664 [inline] __arm64_sys_write+0x50/0x60 fs/read_write.c:664 __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline] invoke_syscall arch/arm64/kernel/syscall.c:48 [inline] el0_svc_common.constprop.0+0xf8/0x420 arch/arm64/kernel/syscall.c:155 do_el0_svc+0x50/0x120 arch/arm64/kernel/syscall.c:217 el0_svc+0x20/0x30 arch/arm64/kernel/entry-common.c:353 el0_sync_handler+0xe4/0x1e0 arch/arm64/kernel/entry-common.c:369 el0_sync+0x148/0x180 arch/arm64/kernel/entry.S:683
-> #1 (&port_lock_key){-.-.}-{2:2}: validate_chain+0x6dc/0xb0c kernel/locking/lockdep.c:3728 __lock_acquire+0x498/0x940 kernel/locking/lockdep.c:4954 lock_acquire+0x228/0x580 kernel/locking/lockdep.c:5564 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0xa0/0x120 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:354 [inline] pl011_console_write+0x2f0/0x410 drivers/tty/serial/amba-pl011.c:2263 call_console_drivers.constprop.0+0x1f8/0x3b0 kernel/printk/printk.c:1932 console_unlock+0x36c/0x9ec kernel/printk/printk.c:2553 vprintk_emit+0x40c/0x4b0 kernel/printk/printk.c:2075 vprintk_default+0x48/0x54 kernel/printk/printk.c:2092 vprintk_func+0x1f0/0x40c kernel/printk/printk_safe.c:404 printk+0xbc/0xf0 kernel/printk/printk.c:2123 register_console+0x580/0x790 kernel/printk/printk.c:2905 uart_configure_port.constprop.0+0x4a0/0x4e0 drivers/tty/serial/serial_core.c:2431 uart_add_one_port+0x378/0x550 drivers/tty/serial/serial_core.c:2944 pl011_register_port+0xb4/0x210 drivers/tty/serial/amba-pl011.c:2686 pl011_probe+0x334/0x3ec drivers/tty/serial/amba-pl011.c:2736 amba_probe+0x14c/0x2f0 drivers/amba/bus.c:283 really_probe+0x210/0xa5c drivers/base/dd.c:562 driver_probe_device+0x1c8/0x280 drivers/base/dd.c:747 __device_attach_driver+0x18c/0x260 drivers/base/dd.c:853 bus_for_each_drv+0x120/0x1a0 drivers/base/bus.c:431 __device_attach+0x16c/0x3b4 drivers/base/dd.c:922 device_initial_probe+0x28/0x34 drivers/base/dd.c:971 bus_probe_device+0x124/0x13c drivers/base/bus.c:491 fw_devlink_resume+0x164/0x270 drivers/base/core.c:1601 of_platform_default_populate_init+0xf4/0x114 drivers/of/platform.c:543 do_one_initcall+0x11c/0x770 init/main.c:1217 do_initcall_level+0x364/0x388 init/main.c:1290 do_initcalls+0x90/0xc0 init/main.c:1306 do_basic_setup init/main.c:1326 [inline] kernel_init_freeable+0x57c/0x63c init/main.c:1529 kernel_init+0x1c/0x20c init/main.c:1417 ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1034
-> #0 (console_owner){-.-.}-{0:0}: check_prev_add+0xe0/0x105c kernel/locking/lockdep.c:2988 check_prevs_add+0x1c8/0x3d4 kernel/locking/lockdep.c:3113 validate_chain+0x6dc/0xb0c kernel/locking/lockdep.c:3728 __lock_acquire+0x498/0x940 kernel/locking/lockdep.c:4954 lock_acquire+0x228/0x580 kernel/locking/lockdep.c:5564 console_trylock_spinning+0x130/0x2e0 kernel/printk/printk.c:1875 vprintk_emit+0x268/0x4b0 kernel/printk/printk.c:2074 vprintk_default+0x48/0x54 kernel/printk/printk.c:2092 vprintk_func+0x1f0/0x40c kernel/printk/printk_safe.c:404 printk+0xbc/0xf0 kernel/printk/printk.c:2123 fail_dump lib/fault-inject.c:45 [inline] should_fail+0x2a0/0x370 lib/fault-inject.c:146 __should_failslab+0x8c/0xe0 mm/failslab.c:33 should_failslab+0x14/0x2c mm/slab_common.c:1181 slab_pre_alloc_hook mm/slab.h:495 [inline] slab_alloc_node mm/slub.c:2842 [inline] slab_alloc mm/slub.c:2931 [inline] kmem_cache_alloc+0x8c/0xe64 mm/slub.c:2936 __sigqueue_alloc+0x224/0x5a4 kernel/signal.c:437 __send_signal+0x700/0xeac kernel/signal.c:1121 send_signal+0x348/0x6a0 kernel/signal.c:1247 force_sig_info_to_task+0x184/0x260 kernel/signal.c:1339 force_sig_fault_to_task kernel/signal.c:1678 [inline] force_sig_fault+0xb0/0xf0 kernel/signal.c:1685 arm64_force_sig_fault arch/arm64/kernel/traps.c:182 [inline] arm64_notify_die arch/arm64/kernel/traps.c:208 [inline] arm64_notify_die+0xdc/0x160 arch/arm64/kernel/traps.c:199 do_sp_pc_abort+0x4c/0x60 arch/arm64/mm/fault.c:794 el0_pc+0xd8/0x19c arch/arm64/kernel/entry-common.c:309 el0_sync_handler+0x12c/0x1e0 arch/arm64/kernel/entry-common.c:394 el0_sync+0x148/0x180 arch/arm64/kernel/entry.S:683
other info that might help us debug this:
Chain exists of: console_owner --> &tty->read_wait --> &sighand->siglock
Signed-off-by: Ye Weihua yeweihua4@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/signal.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/signal.c b/kernel/signal.c index 6d374d02a2cb..9aec5fdcc8d0 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1118,7 +1118,8 @@ static int __send_signal(int sig, struct kernel_siginfo *info, struct task_struc else override_rlimit = 0;
- q = __sigqueue_alloc(sig, t, GFP_ATOMIC, override_rlimit); + q = __sigqueue_alloc(sig, t, GFP_ATOMIC | __GFP_NOWARN, + override_rlimit); if (q) { list_add_tail(&q->list, &pending->list); switch ((unsigned long) info) {
From: Joe Thornber ejt@redhat.com
mainline inclusion from mainline-v5.13-rc1 commit f73e2e70ec48c9a9d45494c4866230a5059062ad category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5JCAH CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
Remove this extra BUG_ON() that calls node_check() -- which avoids extra crc checking.
Signed-off-by: Joe Thornber ejt@redhat.com Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Zhang Changzhong zhangchangzhong@huawei.com Reviewed-by: Zhang Xiaoxu zhangxiaoxu5@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/persistent-data/dm-btree-spine.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/drivers/md/persistent-data/dm-btree-spine.c b/drivers/md/persistent-data/dm-btree-spine.c index e03cb9e48773..c4b386b2be97 100644 --- a/drivers/md/persistent-data/dm-btree-spine.c +++ b/drivers/md/persistent-data/dm-btree-spine.c @@ -30,8 +30,6 @@ static void node_prepare_for_write(struct dm_block_validator *v, h->csum = cpu_to_le32(dm_bm_checksum(&h->flags, block_size - sizeof(__le32), BTREE_CSUM_XOR)); - - BUG_ON(node_check(v, b, 4096)); }
static int node_check(struct dm_block_validator *v,
From: Zhang Xiaoxu zhangxiaoxu5@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5JCAH CVE: NA
--------------------------------
The BUG_ON is unneed Since f73e2e70ec48 ("dm btree spine: remove paranoid node_check call in node_prep_for_write()") merged in v5.13.
For debug reason, we also want to know the data on disk is corrupted by write or disk fault. So also add check and print some info when data corrupted.
Signed-off-by: Zhang Xiaoxu zhangxiaoxu5@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com Reviewed-by: Zhang Xiaoxu zhangxiaoxu5@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/persistent-data/dm-btree-spine.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/md/persistent-data/dm-btree-spine.c b/drivers/md/persistent-data/dm-btree-spine.c index c4b386b2be97..859aae0a8683 100644 --- a/drivers/md/persistent-data/dm-btree-spine.c +++ b/drivers/md/persistent-data/dm-btree-spine.c @@ -30,6 +30,8 @@ static void node_prepare_for_write(struct dm_block_validator *v, h->csum = cpu_to_le32(dm_bm_checksum(&h->flags, block_size - sizeof(__le32), BTREE_CSUM_XOR)); + if (node_check(v, b, 4096)) + DMWARN_LIMIT("%s node_check failed", __func__); }
static int node_check(struct dm_block_validator *v,
From: GUO Zihua guozihua@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I62DVN CVE: NA
--------------------------------
Syzkaller reported a UAF in mpi_key_length().
BUG: KASAN: use-after-free in mpi_key_length+0x34/0xb0 Read of size 2 at addr ffff888005737e14 by task syz-executor.15/6236
CPU: 1 PID: 6236 Comm: syz-executor.15 Kdump: loaded Tainted: GF OE 5.10.0.kasan.x86_64 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-20220525_182517-szxrtosci10000 04/01/2014 Call Trace: dump_stack+0x9c/0xd3 print_address_description.constprop.0+0x19/0x170 __kasan_report.cold+0x6c/0x84 kasan_report+0x3a/0x50 check_memory_region+0xfd/0x1f0 mpi_key_length+0x34/0xb0 pgp_calc_pkey_keyid.isra.0+0x100/0x5a0 pgp_generate_fingerprint+0x159/0x330 pgp_process_public_key+0x1c5/0x330 pgp_parse_packets+0xf4/0x200 pgp_key_parse+0xb6/0x340 asymmetric_key_preparse+0x8a/0x120 key_create_or_update+0x31f/0x8c0 __se_sys_add_key+0x23e/0x400 do_syscall_64+0x30/0x40 entry_SYSCALL_64_after_hwframe+0x61/0xc6
The root cause of the issue is that pgp_calc_pkey_keyid() would call mpi_key_length() and get the length of the public key. The length was then ducted from keylen, which is an unsigned value. However, the returnd byte count is not checked for legitimacy in mpi_key_length(), resulting in an inverted keylen, hence the read overflow.
It turns out that the byte count check was mistakenly placed in mpi_read_from_buffer() while commit 94479061ec5b ("mpi: introduce mpi_key_length()") tries to extract mpi_key_length() out of mpi_read_from_buffer(). This patch moves the check into mpi_key_length().
Fixes: commit 94479061ec5b ("mpi: introduce mpi_key_length()") Signed-off-by: GUO Zihua guozihua@huawei.com Reviewed-by: Roberto Sassu roberto.sassu@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- lib/mpi/mpicoder.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/lib/mpi/mpicoder.c b/lib/mpi/mpicoder.c index 51a8fc758021..19b8ce9aa5e3 100644 --- a/lib/mpi/mpicoder.c +++ b/lib/mpi/mpicoder.c @@ -83,7 +83,7 @@ int mpi_key_length(const void *xbuffer, unsigned int ret_nread, unsigned int *nbits_arg, unsigned int *nbytes_arg) { const uint8_t *buffer = xbuffer; - unsigned int nbits; + unsigned int nbits, nbytes;
if (ret_nread < 2) return -EINVAL; @@ -94,10 +94,17 @@ int mpi_key_length(const void *xbuffer, unsigned int ret_nread, return -EINVAL; }
+ nbytes = DIV_ROUND_UP(nbits, 8); + if (nbytes + 2 > ret_nread) { + pr_info("MPI: mpi larger than buffer nbytes=%u ret_nread=%u\n", + nbytes, ret_nread); + return -EINVAL; + } + if (nbits_arg) *nbits_arg = nbits; if (nbytes_arg) - *nbytes_arg = DIV_ROUND_UP(nbits, 8); + *nbytes_arg = nbytes;
return 0; } @@ -114,12 +121,6 @@ MPI mpi_read_from_buffer(const void *xbuffer, unsigned *ret_nread) if (ret < 0) return ERR_PTR(ret);
- if (nbytes + 2 > *ret_nread) { - pr_info("MPI: mpi larger than buffer nbytes=%u ret_nread=%u\n", - nbytes, *ret_nread); - return ERR_PTR(-EINVAL); - } - val = mpi_read_raw_data(buffer + 2, nbytes); if (!val) return ERR_PTR(-ENOMEM);
From: Li Lingfeng lilingfeng3@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60QE9 CVE: NA
--------------------------------
As explained in 32c39e8a7613 ("block: fix use after free for bd_holder_dir"), we should make sure the "disk" is still live and then grab a reference to 'bd_holder_dir'. However, the "disk" should be "the claimed slave bdev" rather than "the holding disk".
Fixes: 32c39e8a7613 ("block: fix use after free for bd_holder_dir") Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/block_dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c index 07cbe6190463..ef20ee346ec7 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1271,7 +1271,7 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) * the holder directory. Hold on to it. */ down_read(&bdev->bd_disk->lookup_sem); - if (!(disk->flags & GENHD_FL_UP)) { + if (!(bdev->bd_disk->flags & GENHD_FL_UP)) { up_read(&bdev->bd_disk->lookup_sem); return -ENODEV; }
From: Ard Biesheuvel ardb@kernel.org
mainline inclusion from mainline-v5.13-rc1 commit f9e7a99fb6b86aa6a00e53b34ee6973840e005aa category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I634EK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The cache invalidation code in v7_invalidate_l1 can be tweaked to re-read the associativity from CCSIDR, and keep the way identifier component in a single register that is assigned in the outer loop. This way, we need 2 registers less.
Given that the number of sets is typically much larger than the associativity, rearrange the code so that the outer loop has the fewer number of iterations, ensuring that the re-read of CCSIDR only occurs a handful of times in practice.
Fix the whitespace while at it, and update the comment to indicate that this code is no longer a clone of anything else.
Acked-by: Nicolas Pitre nico@fluxnic.net Signed-off-by: Ard Biesheuvel ardb@kernel.org Signed-off-by: Russell King rmk+kernel@armlinux.org.uk Signed-off-by: Zhang Jianhua chris.zjh@huawei.com Reviewed-by: Liao Chang liaochang1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm/mm/cache-v7.S | 51 +++++++++++++++++++++--------------------- 1 file changed, 25 insertions(+), 26 deletions(-)
diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S index 307f381eee71..76201ee9ee59 100644 --- a/arch/arm/mm/cache-v7.S +++ b/arch/arm/mm/cache-v7.S @@ -33,9 +33,8 @@ icache_size: * processor. We fix this by performing an invalidate, rather than a * clean + invalidate, before jumping into the kernel. * - * This function is cloned from arch/arm/mach-tegra/headsmp.S, and needs - * to be called for both secondary cores startup and primary core resume - * procedures. + * This function needs to be called for both secondary cores startup and + * primary core resume procedures. */ ENTRY(v7_invalidate_l1) mov r0, #0 @@ -43,32 +42,32 @@ ENTRY(v7_invalidate_l1) isb mrc p15, 1, r0, c0, c0, 0 @ read cache geometry from CCSIDR
- movw r1, #0x7fff - and r2, r1, r0, lsr #13 + movw r3, #0x3ff + and r3, r3, r0, lsr #3 @ 'Associativity' in CCSIDR[12:3] + clz r1, r3 @ WayShift + mov r2, #1 + mov r3, r3, lsl r1 @ NumWays-1 shifted into bits [31:...] + movs r1, r2, lsl r1 @ #1 shifted left by same amount + moveq r1, #1 @ r1 needs value > 0 even if only 1 way
- movw r1, #0x3ff + and r2, r0, #0x7 + add r2, r2, #4 @ SetShift
- and r3, r1, r0, lsr #3 @ NumWays - 1 - add r2, r2, #1 @ NumSets +1: movw r4, #0x7fff + and r0, r4, r0, lsr #13 @ 'NumSets' in CCSIDR[27:13]
- and r0, r0, #0x7 - add r0, r0, #4 @ SetShift - - clz r1, r3 @ WayShift - add r4, r3, #1 @ NumWays -1: sub r2, r2, #1 @ NumSets-- - mov r3, r4 @ Temp = NumWays -2: subs r3, r3, #1 @ Temp-- - mov r5, r3, lsl r1 - mov r6, r2, lsl r0 - orr r5, r5, r6 @ Reg = (Temp<<WayShift)|(NumSets<<SetShift) - mcr p15, 0, r5, c7, c6, 2 - bgt 2b - cmp r2, #0 - bgt 1b - dsb st - isb - ret lr +2: mov r4, r0, lsl r2 @ NumSet << SetShift + orr r4, r4, r3 @ Reg = (Temp<<WayShift)|(NumSets<<SetShift) + mcr p15, 0, r4, c7, c6, 2 + subs r0, r0, #1 @ Set-- + bpl 2b + subs r3, r3, r1 @ Way-- + bcc 3f + mrc p15, 1, r0, c0, c0, 0 @ re-read cache geometry from CCSIDR + b 1b +3: dsb st + isb + ret lr ENDPROC(v7_invalidate_l1)
/*
From: Ard Biesheuvel ardb@kernel.org
mainline inclusion from mainline-v5.13-rc1 commit 95731b8ee63ec9419822a51cd9878fa32582fdd2 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I634EK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now that we have reduced the number of registers that we need to preserve when calling v7_invalidate_l1 from the boot code, we can use scratch registers to preserve the remaining ones, and get rid of the mini stack entirely. This works around any issues regarding cache behavior in relation to the uncached accesses to this memory, which is hard to get right in the general case (i.e., both bare metal and under virtualization)
While at it, switch v7_invalidate_l1 to using ip as a scratch register instead of r4. This makes the function AAPCS compliant, and removes the need to stash r4 in ip across the call.
conflict: arch/arm/include/asm/memory.h
Acked-by: Nicolas Pitre nico@fluxnic.net Signed-off-by: Ard Biesheuvel ardb@kernel.org Signed-off-by: Russell King rmk+kernel@armlinux.org.uk Signed-off-by: Zhang Jianhua chris.zjh@huawei.com Reviewed-by: Liao Chang liaochang1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm/include/asm/memory.h | 15 -------------- arch/arm/mm/cache-v7.S | 10 ++++----- arch/arm/mm/proc-v7.S | 39 ++++++++++++++++------------------- 3 files changed, 23 insertions(+), 41 deletions(-)
diff --git a/arch/arm/include/asm/memory.h b/arch/arm/include/asm/memory.h index a7a22bf5ca7e..05d692d50fe3 100644 --- a/arch/arm/include/asm/memory.h +++ b/arch/arm/include/asm/memory.h @@ -150,21 +150,6 @@ extern unsigned long vectors_base; */ #define PLAT_PHYS_OFFSET UL(CONFIG_PHYS_OFFSET)
-#ifdef CONFIG_XIP_KERNEL -/* - * When referencing data in RAM from the XIP region in a relative manner - * with the MMU off, we need the relative offset between the two physical - * addresses. The macro below achieves this, which is: - * __pa(v_data) - __xip_pa(v_text) - */ -#define PHYS_RELATIVE(v_data, v_text) \ - (((v_data) - PAGE_OFFSET + PLAT_PHYS_OFFSET) - \ - ((v_text) - XIP_VIRT_ADDR(CONFIG_XIP_PHYS_ADDR) + \ - CONFIG_XIP_PHYS_ADDR)) -#else -#define PHYS_RELATIVE(v_data, v_text) ((v_data) - (v_text)) -#endif - #ifndef __ASSEMBLY__
#ifdef CONFIG_RANDOMIZE_BASE diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S index 76201ee9ee59..830bbfb26ca5 100644 --- a/arch/arm/mm/cache-v7.S +++ b/arch/arm/mm/cache-v7.S @@ -53,12 +53,12 @@ ENTRY(v7_invalidate_l1) and r2, r0, #0x7 add r2, r2, #4 @ SetShift
-1: movw r4, #0x7fff - and r0, r4, r0, lsr #13 @ 'NumSets' in CCSIDR[27:13] +1: movw ip, #0x7fff + and r0, ip, r0, lsr #13 @ 'NumSets' in CCSIDR[27:13]
-2: mov r4, r0, lsl r2 @ NumSet << SetShift - orr r4, r4, r3 @ Reg = (Temp<<WayShift)|(NumSets<<SetShift) - mcr p15, 0, r4, c7, c6, 2 +2: mov ip, r0, lsl r2 @ NumSet << SetShift + orr ip, ip, r3 @ Reg = (Temp<<WayShift)|(NumSets<<SetShift) + mcr p15, 0, ip, c7, c6, 2 subs r0, r0, #1 @ Set-- bpl 2b subs r3, r3, r1 @ Way-- diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S index 2fcffcc60cc6..7bee6f68c74c 100644 --- a/arch/arm/mm/proc-v7.S +++ b/arch/arm/mm/proc-v7.S @@ -265,6 +265,20 @@ ENDPROC(cpu_pj4b_do_resume)
#endif
+ @ + @ Invoke the v7_invalidate_l1() function, which adheres to the AAPCS + @ rules, and so it may corrupt registers that we need to preserve. + @ + .macro do_invalidate_l1 + mov r6, r1 + mov r7, r2 + mov r10, lr + bl v7_invalidate_l1 @ corrupts {r0-r3, ip, lr} + mov r1, r6 + mov r2, r7 + mov lr, r10 + .endm + /* * __v7_setup * @@ -286,6 +300,7 @@ __v7_ca5mp_setup: __v7_ca9mp_setup: __v7_cr7mp_setup: __v7_cr8mp_setup: + do_invalidate_l1 mov r10, #(1 << 0) @ Cache/TLB ops broadcasting b 1f __v7_ca7mp_setup: @@ -293,13 +308,9 @@ __v7_ca12mp_setup: __v7_ca15mp_setup: __v7_b15mp_setup: __v7_ca17mp_setup: + do_invalidate_l1 mov r10, #0 -1: adr r0, __v7_setup_stack_ptr - ldr r12, [r0] - add r12, r12, r0 @ the local stack - stmia r12, {r1-r6, lr} @ v7_invalidate_l1 touches r0-r6 - bl v7_invalidate_l1 - ldmia r12, {r1-r6, lr} +1: #ifdef CONFIG_SMP orr r10, r10, #(1 << 6) @ Enable SMP/nAMP mode ALT_SMP(mrc p15, 0, r0, c1, c0, 1) @@ -480,12 +491,7 @@ __v7_pj4b_setup: #endif /* CONFIG_CPU_PJ4B */
__v7_setup: - adr r0, __v7_setup_stack_ptr - ldr r12, [r0] - add r12, r12, r0 @ the local stack - stmia r12, {r1-r6, lr} @ v7_invalidate_l1 touches r0-r6 - bl v7_invalidate_l1 - ldmia r12, {r1-r6, lr} + do_invalidate_l1
__v7_setup_cont: and r0, r9, #0xff000000 @ ARM? @@ -557,17 +563,8 @@ __errata_finish: orr r0, r0, r6 @ set them THUMB( orr r0, r0, #1 << 30 ) @ Thumb exceptions ret lr @ return to head.S:__ret - - .align 2 -__v7_setup_stack_ptr: - .word PHYS_RELATIVE(__v7_setup_stack, .) ENDPROC(__v7_setup)
- .bss - .align 2 -__v7_setup_stack: - .space 4 * 7 @ 7 registers - __INITDATA
.weak cpu_v7_bugs_init
From: Vladimir Murzin vladimir.murzin@arm.com
mainline inclusion from mainline-v5.16-rc7 commit 7202216a6f34d571a22274e729f841256bf8b1ef category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I634EK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
__secondary_data used to reside in r7 around call to PROCINFO_INITFUNC. After commit 95731b8ee63e ("ARM: 9059/1: cache-v7: get rid of mini-stack") r7 is used as a scratch register, so we have to reload __secondary_data before we setup the stack pointer.
conflict: arch/arm/kernel/head-nommu.S
Fixes: 95731b8ee63e ("ARM: 9059/1: cache-v7: get rid of mini-stack") Signed-off-by: Vladimir Murzin vladimir.murzin@arm.com Signed-off-by: Russell King (Oracle) rmk+kernel@armlinux.org.uk Signed-off-by: Zhang Jianhua chris.zjh@huawei.com Reviewed-by: Liao Chang liaochang1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm/kernel/head-nommu.S | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/arm/kernel/head-nommu.S b/arch/arm/kernel/head-nommu.S index 0fc814bbc34b..8796a69c78e0 100644 --- a/arch/arm/kernel/head-nommu.S +++ b/arch/arm/kernel/head-nommu.S @@ -114,6 +114,7 @@ ENTRY(secondary_startup) add r12, r12, r10 ret r12 1: bl __after_proc_init + ldr r7, __secondary_data @ reload r7 ldr sp, [r7, #12] @ set up the stack pointer mov fp, #0 b secondary_start_kernel
From: Liu Shixin liushixin2@huawei.com
stable inclusion from stable-v5.10.150 commit 45c33966759ea1b4040c08dacda99ef623c0ca29 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I62WRY CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 958f32ce832ba781ac20e11bb2d12a9352ea28fc upstream.
The vma_lock and hugetlb_fault_mutex are dropped before handling userfault and reacquire them again after handle_userfault(), but reacquire the vma_lock could lead to UAF[1,2] due to the following race,
hugetlb_fault hugetlb_no_page /*unlock vma_lock */ hugetlb_handle_userfault handle_userfault /* unlock mm->mmap_lock*/ vm_mmap_pgoff do_mmap mmap_region munmap_vma_range /* clean old vma */ /* lock vma_lock again <--- UAF */ /* unlock vma_lock */
Since the vma_lock will unlock immediately after hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in hugetlb_handle_userfault() to fix the issue.
[1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/ [2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.com... Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook") Signed-off-by: Liu Shixin liushixin2@huawei.com Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Reported-by: syzbot+193f9cee8638750b23cf@syzkaller.appspotmail.com Reported-by: Liu Zixian liuzixian4@huawei.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: David Hildenbrand david@redhat.com Cc: John Hubbard jhubbard@nvidia.com Cc: Muchun Song songmuchun@bytedance.com Cc: Sidhartha Kumar sidhartha.kumar@oracle.com Cc: stable@vger.kernel.org [4.14+] Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Conflicts: mm/hugetlb.c Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Chen Wandun chenwandun@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/hugetlb.c | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 312ecc15a4e4..5b0d2264b99b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4683,6 +4683,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, spinlock_t *ptl; unsigned long haddr = address & huge_page_mask(h); bool new_page = false; + u32 hash = hugetlb_fault_mutex_hash(mapping, idx);
/* * Currently, we are forced to kill the process in the event the @@ -4692,7 +4693,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, if (is_vma_resv_set(vma, HPAGE_RESV_UNMAPPED)) { pr_warn_ratelimited("PID %d killed due to inadequate hugepage pool\n", current->pid); - return ret; + goto out; }
/* @@ -4711,7 +4712,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, * Check for page in userfault range */ if (userfaultfd_missing(vma)) { - u32 hash; struct vm_fault vmf = { .vma = vma, .address = haddr, @@ -4726,17 +4726,14 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, };
/* - * hugetlb_fault_mutex and i_mmap_rwsem must be - * dropped before handling userfault. Reacquire - * after handling fault to make calling code simpler. + * vma_lock and hugetlb_fault_mutex must be dropped + * before handling userfault. Also mmap_lock will + * be dropped during handling userfault, any vma + * operation should be careful from here. */ - hash = hugetlb_fault_mutex_hash(mapping, idx); mutex_unlock(&hugetlb_fault_mutex_table[hash]); i_mmap_unlock_read(mapping); - ret = handle_userfault(&vmf, VM_UFFD_MISSING); - i_mmap_lock_read(mapping); - mutex_lock(&hugetlb_fault_mutex_table[hash]); - goto out; + return handle_userfault(&vmf, VM_UFFD_MISSING); }
page = alloc_huge_page(vma, haddr, 0); @@ -4843,6 +4840,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
unlock_page(page); out: + mutex_unlock(&hugetlb_fault_mutex_table[hash]); + i_mmap_unlock_read(mapping); return ret;
backout: @@ -4942,7 +4941,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (sp_check_vm_share_pool(vma->vm_flags)) ret = sharepool_no_page(mm, vma, mapping, idx, address, ptep, flags); else - ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags); + /* + * hugetlb_no_page will drop vma lock and hugetlb fault + * mutex internally, which make us return immediately. + */ + return hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags); goto out_mutex; }
From: Luo Meng luomeng12@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5P05D CVE: NA
--------------------------------
When thinpool is suspended and sets fail_io, resume will report error as below: device-mapper: resume ioctl on vg-thinpool failed: Invalid argument
Thinpool also can't be removed if bio is in deferred list.
This can be easily reproduced using:
echo "offline" > /sys/block/sda/device/state dd if=/dev/zero of=/dev/mapper/thin bs=4K count=1 dmsetup suspend /dev/mapper/pool mkfs.ext4 /dev/mapper/thin dmsetup resume /dev/mapper/pool
The root cause is maybe_resize_data_dev() will check fail_io and return error before called dm_resume.
Fix this by adding FAIL mode check at the end of pool_preresume().
Fixes: da105ed5fd7e (dm thin metadata: introduce dm_pool_abort_metadata) Signed-off-by: Luo Meng luomeng12@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/dm-thin.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c index a196d7cb51bd..e837839e4def 100644 --- a/drivers/md/dm-thin.c +++ b/drivers/md/dm-thin.c @@ -3566,20 +3566,29 @@ static int pool_preresume(struct dm_target *ti) */ r = bind_control_target(pool, ti); if (r) - return r; + goto out;
r = maybe_resize_data_dev(ti, &need_commit1); if (r) - return r; + goto out;
r = maybe_resize_metadata_dev(ti, &need_commit2); if (r) - return r; + goto out;
if (need_commit1 || need_commit2) (void) commit(pool);
- return 0; +out: + /* + * When thinpool is PM_FAIL, it cannot be rebuilt if + * bio is in deferred list. Therefor need to return 0 and + * call pool_resume() to flush IO. + */ + if (r && get_pool_mode(pool) == PM_FAIL) + r = 0; + + return r; }
static void pool_suspend_active_thins(struct pool *pool)
From: Luo Meng luomeng12@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5WBID CVE: NA
--------------------------------
When dm_resume() and dm_destroy() are concurrent, it will lead to UAF.
One of the concurrency UAF can be shown as below:
use free do_resume | __find_device_hash_cell | dm_get | atomic_inc(&md->holders) | | dm_destroy | __dm_destroy | if (!dm_suspended_md(md)) | atomic_read(&md->holders) | msleep(1) dm_resume | __dm_resume | dm_table_resume_targets | pool_resume | do_waker #add delay work | | dm_table_destroy | pool_dtr | __pool_dec | __pool_destroy | destroy_workqueue | kfree(pool) # free pool time out __do_softirq run_timer_softirq # pool has already been freed
This can be easily reproduced using: 1. create thin-pool 2. dmsetup suspend pool 3. dmsetup resume pool 4. dmsetup remove_all # Concurrent with 3
The root cause of UAF bugs is that dm_resume() adds timer after dm_destroy() skips cancel timer beause of suspend status. After timeout, it will call run_timer_softirq(), however pool has already been freed. The concurrency UAF bug will happen.
Therefore, canceling timer is moved after md->holders is zero.
Signed-off-by: Luo Meng luomeng12@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/dm.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 23e79dadafc6..b97defbe21bb 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2248,6 +2248,19 @@ static void __dm_destroy(struct mapped_device *md, bool wait)
blk_set_queue_dying(md->queue);
+ /* + * Rare, but there may be I/O requests still going to complete, + * for example. Wait for all references to disappear. + * No one should increment the reference count of the mapped_device, + * after the mapped_device state becomes DMF_FREEING. + */ + if (wait) + while (atomic_read(&md->holders)) + msleep(1); + else if (atomic_read(&md->holders)) + DMWARN("%s: Forcibly removing mapped_device still in use! (%d users)", + dm_device_name(md), atomic_read(&md->holders)); + /* * Take suspend_lock so that presuspend and postsuspend methods * do not race with internal suspend. @@ -2264,19 +2277,6 @@ static void __dm_destroy(struct mapped_device *md, bool wait) dm_put_live_table(md, srcu_idx); mutex_unlock(&md->suspend_lock);
- /* - * Rare, but there may be I/O requests still going to complete, - * for example. Wait for all references to disappear. - * No one should increment the reference count of the mapped_device, - * after the mapped_device state becomes DMF_FREEING. - */ - if (wait) - while (atomic_read(&md->holders)) - msleep(1); - else if (atomic_read(&md->holders)) - DMWARN("%s: Forcibly removing mapped_device still in use! (%d users)", - dm_device_name(md), atomic_read(&md->holders)); - dm_sysfs_exit(md); dm_table_destroy(__unbind(md)); free_dev(md);
From: Sungwoo Kim iam@sung-woo.kim
maillist inclusion category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I63D3E CVE: CVE-2022-45934
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?...
--------------------------------
By keep sending L2CAP_CONF_REQ packets, chan->num_conf_rsp increases multiple times and eventually it will wrap around the maximum number (i.e., 255). This patch prevents this by adding a boundary check with L2CAP_MAX_CONF_RSP
Btmon log: Bluetooth monitor ver 5.64 = Note: Linux version 6.1.0-rc2 (x86_64) 0.264594 = Note: Bluetooth subsystem version 2.22 0.264636 @ MGMT Open: btmon (privileged) version 1.22 {0x0001} 0.272191 = New Index: 00:00:00:00:00:00 (Primary,Virtual,hci0) [hci0] 13.877604 @ RAW Open: 9496 (privileged) version 2.22 {0x0002} 13.890741 = Open Index: 00:00:00:00:00:00 [hci0] 13.900426 (...)
ACL Data RX: Handle 200 flags 0x00 dlen 1033 #32 [hci0] 14.273106
invalid packet size (12 != 1033) 08 00 01 00 02 01 04 00 01 10 ff ff ............
ACL Data RX: Handle 200 flags 0x00 dlen 1547 #33 [hci0] 14.273561
invalid packet size (14 != 1547) 0a 00 01 00 04 01 06 00 40 00 00 00 00 00 ........@.....
ACL Data RX: Handle 200 flags 0x00 dlen 2061 #34 [hci0] 14.274390
invalid packet size (16 != 2061) 0c 00 01 00 04 01 08 00 40 00 00 00 00 00 00 04 ........@.......
ACL Data RX: Handle 200 flags 0x00 dlen 2061 #35 [hci0] 14.274932
invalid packet size (16 != 2061) 0c 00 01 00 04 01 08 00 40 00 00 00 07 00 03 00 ........@....... = bluetoothd: Bluetooth daemon 5.43 14.401828
ACL Data RX: Handle 200 flags 0x00 dlen 1033 #36 [hci0] 14.275753
invalid packet size (12 != 1033) 08 00 01 00 04 01 04 00 40 00 00 00 ........@...
Signed-off-by: Sungwoo Kim iam@sung-woo.kim Signed-off-by: Luiz Augusto von Dentz luiz.von.dentz@intel.com Signed-off-by: Baisong Zhong zhongbaisong@huawei.com Reviewed-by: Liu Jian liujian56@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- net/bluetooth/l2cap_core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/bluetooth/l2cap_core.c b/net/bluetooth/l2cap_core.c index bbba3beffcd3..70bfd9e8913e 100644 --- a/net/bluetooth/l2cap_core.c +++ b/net/bluetooth/l2cap_core.c @@ -4440,7 +4440,8 @@ static inline int l2cap_config_req(struct l2cap_conn *conn,
chan->ident = cmd->ident; l2cap_send_cmd(conn, cmd->ident, L2CAP_CONF_RSP, len, rsp); - chan->num_conf_rsp++; + if (chan->num_conf_rsp < L2CAP_CONF_MAX_CONF_RSP) + chan->num_conf_rsp++;
/* Reset config buffer. */ chan->conf_len = 0;
From: Jakub Sitnicki jakub@cloudflare.com
mainline inclusion from mainline-v6.1-rc6 commit b68777d54fac21fc833ec26ea1a2a84f975ab035 category: bugfix bugzilla: 188056, https://gitee.com/src-openeuler/kernel/issues/I62RNU CVE: CVE-2022-4129
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
sk->sk_user_data has multiple users, which are not compatible with each other. Writers must synchronize by grabbing the sk->sk_callback_lock.
l2tp currently fails to grab the lock when modifying the underlying tunnel socket fields. Fix it by adding appropriate locking.
We err on the side of safety and grab the sk_callback_lock also inside the sk_destruct callback overridden by l2tp, even though there should be no refs allowing access to the sock at the time when sk_destruct gets called.
v4: - serialize write to sk_user_data in l2tp sk_destruct
v3: - switch from sock lock to sk_callback_lock - document write-protection for sk_user_data
v2: - update Fixes to point to origin of the bug - use real names in Reported/Tested-by tags
Cc: Tom Parkin tparkin@katalix.com Fixes: 3557baabf280 ("[L2TP]: PPP over L2TP driver core") Reported-by: Haowei Yan g1042620637@gmail.com Signed-off-by: Jakub Sitnicki jakub@cloudflare.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Lu Wei luwei32@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/net/sock.h | 2 +- net/l2tp/l2tp_core.c | 19 +++++++++++++------ 2 files changed, 14 insertions(+), 7 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h index bd0d77c96d73..3335b0bd7c9d 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -317,7 +317,7 @@ struct bpf_local_storage; * @sk_tskey: counter to disambiguate concurrent tstamp requests * @sk_zckey: counter to order MSG_ZEROCOPY notifications * @sk_socket: Identd and reporting IO signals - * @sk_user_data: RPC layer private data + * @sk_user_data: RPC layer private data. Write-protected by @sk_callback_lock. * @sk_frag: cached page frag * @sk_peek_off: current peek_offset value * @sk_send_head: front of stuff to transmit diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c index dc8987ed08ad..e89852bc5309 100644 --- a/net/l2tp/l2tp_core.c +++ b/net/l2tp/l2tp_core.c @@ -1150,8 +1150,10 @@ static void l2tp_tunnel_destruct(struct sock *sk) }
/* Remove hooks into tunnel socket */ + write_lock_bh(&sk->sk_callback_lock); sk->sk_destruct = tunnel->old_sk_destruct; sk->sk_user_data = NULL; + write_unlock_bh(&sk->sk_callback_lock);
/* Call the original destructor */ if (sk->sk_destruct) @@ -1471,16 +1473,18 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net, sock = sockfd_lookup(tunnel->fd, &ret); if (!sock) goto err; - - ret = l2tp_validate_socket(sock->sk, net, tunnel->encap); - if (ret < 0) - goto err_sock; }
+ sk = sock->sk; + write_lock(&sk->sk_callback_lock); + + ret = l2tp_validate_socket(sk, net, tunnel->encap); + if (ret < 0) + goto err_sock; + tunnel->l2tp_net = net; pn = l2tp_pernet(net);
- sk = sock->sk; sock_hold(sk); tunnel->sock = sk;
@@ -1506,7 +1510,7 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net,
setup_udp_tunnel_sock(net, sock, &udp_cfg); } else { - sk->sk_user_data = tunnel; + rcu_assign_sk_user_data(sk, tunnel); }
tunnel->old_sk_destruct = sk->sk_destruct; @@ -1520,6 +1524,7 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net, if (tunnel->fd >= 0) sockfd_put(sock);
+ write_unlock(&sk->sk_callback_lock); return 0;
err_sock: @@ -1527,6 +1532,8 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net, sock_release(sock); else sockfd_put(sock); + + write_unlock(&sk->sk_callback_lock); err: return ret; }
From: Jakub Sitnicki jakub@cloudflare.com
mainline inclusion from mainline-v6.1-rc7 commit af295e854a4e3813ffbdef26dbb6a4d6226c3ea1 category: bugfix bugzilla: 188056, https://gitee.com/src-openeuler/kernel/issues/I62RNU CVE: CVE-2022-4129
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When holding a reader-writer spin lock we cannot sleep. Calling setup_udp_tunnel_sock() with write lock held violates this rule, because we end up calling percpu_down_read(), which might sleep, as syzbot reports [1]:
__might_resched.cold+0x222/0x26b kernel/sched/core.c:9890 percpu_down_read include/linux/percpu-rwsem.h:49 [inline] cpus_read_lock+0x1b/0x140 kernel/cpu.c:310 static_key_slow_inc+0x12/0x20 kernel/jump_label.c:158 udp_tunnel_encap_enable include/net/udp_tunnel.h:187 [inline] setup_udp_tunnel_sock+0x43d/0x550 net/ipv4/udp_tunnel_core.c:81 l2tp_tunnel_register+0xc51/0x1210 net/l2tp/l2tp_core.c:1509 pppol2tp_connect+0xcdc/0x1a10 net/l2tp/l2tp_ppp.c:723
Trim the writer-side critical section for sk_callback_lock down to the minimum, so that it covers only operations on sk_user_data.
Also, when grabbing the sk_callback_lock, we always need to disable BH, as Eric points out. Failing to do so leads to deadlocks because we acquire sk_callback_lock in softirq context, which can get stuck waiting on us if:
1) it runs on the same CPU, or
CPU0 ---- lock(clock-AF_INET6); <Interrupt> lock(clock-AF_INET6);
2) lock ordering leads to priority inversion
CPU0 CPU1 ---- ---- lock(clock-AF_INET6); local_irq_disable(); lock(&tcp_hashinfo.bhash[i].lock); lock(clock-AF_INET6); <Interrupt> lock(&tcp_hashinfo.bhash[i].lock);
... as syzbot reports [2,3]. Use the _bh variants for write_(un)lock.
[1] https://lore.kernel.org/netdev/0000000000004e78ec05eda79749@google.com/ [2] https://lore.kernel.org/netdev/000000000000e38b6605eda76f98@google.com/ [3] https://lore.kernel.org/netdev/000000000000dfa31e05eda76f75@google.com/
v2: - Check and set sk_user_data while holding sk_callback_lock for both L2TP encapsulation types (IP and UDP) (Tetsuo)
Cc: Tom Parkin tparkin@katalix.com Cc: Tetsuo Handa penguin-kernel@i-love.sakura.ne.jp Fixes: b68777d54fac ("l2tp: Serialize access to sk_user_data with sk_callback_lock") Reported-by: Eric Dumazet edumazet@google.com Reported-by: syzbot+703d9e154b3b58277261@syzkaller.appspotmail.com Reported-by: syzbot+50680ced9e98a61f7698@syzkaller.appspotmail.com Reported-by: syzbot+de987172bb74a381879b@syzkaller.appspotmail.com Signed-off-by: Jakub Sitnicki jakub@cloudflare.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Lu Wei luwei32@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- net/l2tp/l2tp_core.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c index e89852bc5309..d6bb1795329a 100644 --- a/net/l2tp/l2tp_core.c +++ b/net/l2tp/l2tp_core.c @@ -1476,11 +1476,12 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net, }
sk = sock->sk; - write_lock(&sk->sk_callback_lock); - + write_lock_bh(&sk->sk_callback_lock); ret = l2tp_validate_socket(sk, net, tunnel->encap); if (ret < 0) - goto err_sock; + goto err_inval_sock; + rcu_assign_sk_user_data(sk, tunnel); + write_unlock_bh(&sk->sk_callback_lock);
tunnel->l2tp_net = net; pn = l2tp_pernet(net); @@ -1509,8 +1510,6 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net, };
setup_udp_tunnel_sock(net, sock, &udp_cfg); - } else { - rcu_assign_sk_user_data(sk, tunnel); }
tunnel->old_sk_destruct = sk->sk_destruct; @@ -1524,16 +1523,18 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net, if (tunnel->fd >= 0) sockfd_put(sock);
- write_unlock(&sk->sk_callback_lock); return 0;
err_sock: + write_lock_bh(&sk->sk_callback_lock); + rcu_assign_sk_user_data(sk, NULL); +err_inval_sock: + write_unlock_bh(&sk->sk_callback_lock); + if (tunnel->fd < 0) sock_release(sock); else sockfd_put(sock); - - write_unlock(&sk->sk_callback_lock); err: return ret; }
From: Andrzej Hajda andrzej.hajda@intel.com
stable inclusion from stable-v5.10.157 commit 86f0082fb9470904b15546726417f28077088fee category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I640L3 CVE: CVE-2022-4139
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...
--------------------------------
commit 04aa64375f48a5d430b5550d9271f8428883e550 upstream.
In case of Gen12 video and compute engines, TLB_INV registers are masked - to modify one bit, corresponding bit in upper half of the register must be enabled, otherwise nothing happens.
CVE: CVE-2022-4139 Suggested-by: Chris Wilson chris.p.wilson@intel.com Signed-off-by: Andrzej Hajda andrzej.hajda@intel.com Acked-by: Daniel Vetter daniel.vetter@ffwll.ch Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store") Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Ren Zhijie renzhijie2@huawei.com Reviewed-by: Zhang Qiao zhangqiao22@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/gpu/drm/i915/gt/intel_gt.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index a33887f2464f..5f86d9aacb8a 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -745,6 +745,10 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt) if (!i915_mmio_reg_offset(rb.reg)) continue;
+ if (INTEL_GEN(i915) == 12 && (engine->class == VIDEO_DECODE_CLASS || + engine->class == VIDEO_ENHANCEMENT_CLASS)) + rb.bit = _MASKED_BIT_ENABLE(rb.bit); + intel_uncore_write_fw(uncore, rb.reg, rb.bit); }
From: Junhao He hejunhao3@huawei.com
driver inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5KAX7
--------------------------------------------------------------------------
Fixed the issue that the kabi value changed when the HiSilicon PMU driver added the enum variable in "enum cpuhp_state{}".
The hisi_pcie_pmu and hisi_cpa_pmu drivers to replace the explicit specify hotplug events with dynamic allocation hotplug events(CPUHP_AP_ONLINE_DYN). The states between *CPUHP_AP_ONLINE_DYN* and *CPUHP_AP_ONLINE_DYN_END* are reserved for the dynamic allocation.
Signed-off-by: Junhao He hejunhao3@huawei.com Reviewed-by: Yicong Yang yangyicong@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/perf/hisilicon/hisi_pcie_pmu.c | 22 ++++++++++--------- drivers/perf/hisilicon/hisi_uncore_cpa_pmu.c | 23 ++++++++++---------- include/linux/cpuhotplug.h | 6 ----- 3 files changed, 24 insertions(+), 27 deletions(-)
diff --git a/drivers/perf/hisilicon/hisi_pcie_pmu.c b/drivers/perf/hisilicon/hisi_pcie_pmu.c index 2f18838754ec..c7972d631d2f 100644 --- a/drivers/perf/hisilicon/hisi_pcie_pmu.c +++ b/drivers/perf/hisilicon/hisi_pcie_pmu.c @@ -19,6 +19,9 @@ #include <linux/pci.h> #include <linux/perf_event.h>
+/* Dynamic CPU hotplug state used by PCIe PMU */ +static enum cpuhp_state hisi_pcie_pmu_online; + #define DRV_NAME "hisi_pcie_pmu" /* Define registers */ #define HISI_PCIE_GLOBAL_CTRL 0x00 @@ -818,7 +821,7 @@ static int hisi_pcie_init_pmu(struct pci_dev *pdev, struct hisi_pcie_pmu *pcie_p if (ret) goto err_iounmap;
- ret = cpuhp_state_add_instance(CPUHP_AP_PERF_ARM_HISI_PCIE_PMU_ONLINE, &pcie_pmu->node); + ret = cpuhp_state_add_instance(hisi_pcie_pmu_online, &pcie_pmu->node); if (ret) { pci_err(pdev, "Failed to register hotplug: %d\n", ret); goto err_irq_unregister; @@ -833,8 +836,7 @@ static int hisi_pcie_init_pmu(struct pci_dev *pdev, struct hisi_pcie_pmu *pcie_p return ret;
err_hotplug_unregister: - cpuhp_state_remove_instance_nocalls( - CPUHP_AP_PERF_ARM_HISI_PCIE_PMU_ONLINE, &pcie_pmu->node); + cpuhp_state_remove_instance_nocalls(hisi_pcie_pmu_online, &pcie_pmu->node);
err_irq_unregister: hisi_pcie_pmu_irq_unregister(pdev, pcie_pmu); @@ -850,8 +852,7 @@ static void hisi_pcie_uninit_pmu(struct pci_dev *pdev) struct hisi_pcie_pmu *pcie_pmu = pci_get_drvdata(pdev);
perf_pmu_unregister(&pcie_pmu->pmu); - cpuhp_state_remove_instance_nocalls( - CPUHP_AP_PERF_ARM_HISI_PCIE_PMU_ONLINE, &pcie_pmu->node); + cpuhp_state_remove_instance_nocalls(hisi_pcie_pmu_online, &pcie_pmu->node); hisi_pcie_pmu_irq_unregister(pdev, pcie_pmu); iounmap(pcie_pmu->base); } @@ -922,18 +923,19 @@ static int __init hisi_pcie_module_init(void) { int ret;
- ret = cpuhp_setup_state_multi(CPUHP_AP_PERF_ARM_HISI_PCIE_PMU_ONLINE, - "AP_PERF_ARM_HISI_PCIE_PMU_ONLINE", + ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, + "perf/hisi/pcie:online", hisi_pcie_pmu_online_cpu, hisi_pcie_pmu_offline_cpu); - if (ret) { + if (ret < 0) { pr_err("Failed to setup PCIe PMU hotplug: %d\n", ret); return ret; } + hisi_pcie_pmu_online = ret;
ret = pci_register_driver(&hisi_pcie_pmu_driver); if (ret) - cpuhp_remove_multi_state(CPUHP_AP_PERF_ARM_HISI_PCIE_PMU_ONLINE); + cpuhp_remove_multi_state(hisi_pcie_pmu_online);
return ret; } @@ -942,7 +944,7 @@ module_init(hisi_pcie_module_init); static void __exit hisi_pcie_module_exit(void) { pci_unregister_driver(&hisi_pcie_pmu_driver); - cpuhp_remove_multi_state(CPUHP_AP_PERF_ARM_HISI_PCIE_PMU_ONLINE); + cpuhp_remove_multi_state(hisi_pcie_pmu_online); } module_exit(hisi_pcie_module_exit);
diff --git a/drivers/perf/hisilicon/hisi_uncore_cpa_pmu.c b/drivers/perf/hisilicon/hisi_uncore_cpa_pmu.c index a9bb73f76be4..09839dae9b7c 100644 --- a/drivers/perf/hisilicon/hisi_uncore_cpa_pmu.c +++ b/drivers/perf/hisilicon/hisi_uncore_cpa_pmu.c @@ -19,6 +19,9 @@
#include "hisi_uncore_pmu.h"
+/* Dynamic CPU hotplug state used by CPA PMU */ +static enum cpuhp_state hisi_cpa_pmu_online; + /* CPA register definition */ #define CPA_PERF_CTRL 0x1c00 #define CPA_EVENT_CTRL 0x1c04 @@ -334,8 +337,7 @@ static int hisi_cpa_pmu_probe(struct platform_device *pdev)
/* Power Management should be disabled before using CPA PMU. */ hisi_cpa_pmu_disable_pm(cpa_pmu); - ret = cpuhp_state_add_instance(CPUHP_AP_PERF_ARM_HISI_CPA_ONLINE, - &cpa_pmu->node); + ret = cpuhp_state_add_instance(hisi_cpa_pmu_online, &cpa_pmu->node); if (ret) { dev_err(&pdev->dev, "Error %d registering hotplug\n", ret); hisi_cpa_pmu_enable_pm(cpa_pmu); @@ -345,8 +347,7 @@ static int hisi_cpa_pmu_probe(struct platform_device *pdev) ret = perf_pmu_register(&cpa_pmu->pmu, name, -1); if (ret) { dev_err(cpa_pmu->dev, "PMU register failed\n"); - cpuhp_state_remove_instance_nocalls( - CPUHP_AP_PERF_ARM_HISI_CPA_ONLINE, &cpa_pmu->node); + cpuhp_state_remove_instance_nocalls(hisi_cpa_pmu_online, &cpa_pmu->node); hisi_cpa_pmu_enable_pm(cpa_pmu); return ret; } @@ -360,8 +361,7 @@ static int hisi_cpa_pmu_remove(struct platform_device *pdev) struct hisi_pmu *cpa_pmu = platform_get_drvdata(pdev);
perf_pmu_unregister(&cpa_pmu->pmu); - cpuhp_state_remove_instance_nocalls(CPUHP_AP_PERF_ARM_HISI_CPA_ONLINE, - &cpa_pmu->node); + cpuhp_state_remove_instance_nocalls(hisi_cpa_pmu_online, &cpa_pmu->node); hisi_cpa_pmu_enable_pm(cpa_pmu); return 0; } @@ -380,18 +380,19 @@ static int __init hisi_cpa_pmu_module_init(void) { int ret;
- ret = cpuhp_setup_state_multi(CPUHP_AP_PERF_ARM_HISI_CPA_ONLINE, - "AP_PERF_ARM_HISI_CPA_ONLINE", + ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, + "pmu/hisi/cpa:online", hisi_uncore_pmu_online_cpu, hisi_uncore_pmu_offline_cpu); - if (ret) { + if (ret < 0) { pr_err("setup hotplug failed: %d\n", ret); return ret; } + hisi_cpa_pmu_online = ret;
ret = platform_driver_register(&hisi_cpa_pmu_driver); if (ret) - cpuhp_remove_multi_state(CPUHP_AP_PERF_ARM_HISI_CPA_ONLINE); + cpuhp_remove_multi_state(hisi_cpa_pmu_online);
return ret; } @@ -400,7 +401,7 @@ module_init(hisi_cpa_pmu_module_init); static void __exit hisi_cpa_pmu_module_exit(void) { platform_driver_unregister(&hisi_cpa_pmu_driver); - cpuhp_remove_multi_state(CPUHP_AP_PERF_ARM_HISI_CPA_ONLINE); + cpuhp_remove_multi_state(hisi_cpa_pmu_online); } module_exit(hisi_cpa_pmu_module_exit);
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index a9d6652d417c..4604c8820313 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -176,17 +176,11 @@ enum cpuhp_state { CPUHP_AP_PERF_S390_SF_ONLINE, CPUHP_AP_PERF_ARM_CCI_ONLINE, CPUHP_AP_PERF_ARM_CCN_ONLINE, - #ifndef __GENKSYMS__ - CPUHP_AP_PERF_ARM_HISI_CPA_ONLINE, - #endif CPUHP_AP_PERF_ARM_HISI_DDRC_ONLINE, CPUHP_AP_PERF_ARM_HISI_HHA_ONLINE, CPUHP_AP_PERF_ARM_HISI_L3_ONLINE, CPUHP_AP_PERF_ARM_HISI_PA_ONLINE, CPUHP_AP_PERF_ARM_HISI_SLLC_ONLINE, - #ifndef __GENKSYMS__ - CPUHP_AP_PERF_ARM_HISI_PCIE_PMU_ONLINE, - #endif CPUHP_AP_PERF_ARM_L2X0_ONLINE, CPUHP_AP_PERF_ARM_QCOM_L2_ONLINE, CPUHP_AP_PERF_ARM_QCOM_L3_ONLINE,
From: Dan Carpenter dan.carpenter@oracle.com
stable inclusion from stable-v5.10.142 commit 19e3f69d19801940abc2ac37c169882769ed9770 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I63OIO CVE: CVE-2022-4095
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
_Read/Write_MACREG callbacks are NULL so the read/write_macreg_hdl() functions don't do anything except free the "pcmd" pointer. It results in a use after free. Delete them.
Fixes: 2865d42c78a9 ("staging: r8712u: Add the new driver to the mainline kernel") Cc: stable stable@kernel.org Reported-by: Zheng Wang hackerzheng666@gmail.com Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Link: https://lore.kernel.org/r/Yw4ASqkYcUhUfoY2@kili Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Guan Jing guanjing6@huawei.com Reviewed-by: Zhang Qiao zhangqiao22@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/staging/rtl8712/rtl8712_cmd.c | 36 --------------------------- 1 file changed, 36 deletions(-)
diff --git a/drivers/staging/rtl8712/rtl8712_cmd.c b/drivers/staging/rtl8712/rtl8712_cmd.c index ff3cb09c57a6..30e965c410ff 100644 --- a/drivers/staging/rtl8712/rtl8712_cmd.c +++ b/drivers/staging/rtl8712/rtl8712_cmd.c @@ -117,34 +117,6 @@ static void r871x_internal_cmd_hdl(struct _adapter *padapter, u8 *pbuf) kfree(pdrvcmd->pbuf); }
-static u8 read_macreg_hdl(struct _adapter *padapter, u8 *pbuf) -{ - void (*pcmd_callback)(struct _adapter *dev, struct cmd_obj *pcmd); - struct cmd_obj *pcmd = (struct cmd_obj *)pbuf; - - /* invoke cmd->callback function */ - pcmd_callback = cmd_callback[pcmd->cmdcode].callback; - if (!pcmd_callback) - r8712_free_cmd_obj(pcmd); - else - pcmd_callback(padapter, pcmd); - return H2C_SUCCESS; -} - -static u8 write_macreg_hdl(struct _adapter *padapter, u8 *pbuf) -{ - void (*pcmd_callback)(struct _adapter *dev, struct cmd_obj *pcmd); - struct cmd_obj *pcmd = (struct cmd_obj *)pbuf; - - /* invoke cmd->callback function */ - pcmd_callback = cmd_callback[pcmd->cmdcode].callback; - if (!pcmd_callback) - r8712_free_cmd_obj(pcmd); - else - pcmd_callback(padapter, pcmd); - return H2C_SUCCESS; -} - static u8 read_bbreg_hdl(struct _adapter *padapter, u8 *pbuf) { struct cmd_obj *pcmd = (struct cmd_obj *)pbuf; @@ -213,14 +185,6 @@ static struct cmd_obj *cmd_hdl_filter(struct _adapter *padapter, pcmd_r = NULL;
switch (pcmd->cmdcode) { - case GEN_CMD_CODE(_Read_MACREG): - read_macreg_hdl(padapter, (u8 *)pcmd); - pcmd_r = pcmd; - break; - case GEN_CMD_CODE(_Write_MACREG): - write_macreg_hdl(padapter, (u8 *)pcmd); - pcmd_r = pcmd; - break; case GEN_CMD_CODE(_Read_BBREG): read_bbreg_hdl(padapter, (u8 *)pcmd); break;
On 12/7/22 6:37 PM, Zheng Zengkai wrote:
From: Li Nan linan122@huawei.com
hulk inclusion category: bugfix bugzilla: 187584, https://gitee.com/openeuler/kernel/issues/I5QW2R CVE: NA
This reverts commit 36f5d7662495aa5ad4ec197443e69e01384eda3c.
There are two wbt_enable_default() in bfq_exit_queue(). Although it will not lead to no fault, revert one.
Signed-off-by: Li Nan linan122@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com
block/bfq-iosched.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 4bfea5e5354e..1aec01c0a707 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -6418,8 +6418,6 @@ static void bfq_exit_queue(struct elevator_queue *e) spin_unlock_irq(&bfqd->lock); #endif
wbt_enable_default(bfqd->queue);
kfree(bfqd);
/* Re-enable throttling in case elevator disabled it */
我怀疑应该删除kfree下面的wbt_enable_default而不是kfree上面的. BTW,现在的代码是这样的
static void bfq_exit_queue(struct elevator_queue *e) { struct bfq_data *bfqd = e->elevator_data; struct bfq_queue *bfqq, *n; struct request_queue *q = bfqd->queue;
...
kfree(bfqd);
/* Re-enable throttling in case elevator disabled it */ wbt_enable_default(q); }
Just FYI,
Thanks, Guoqing
在 2022/12/9 19:57, Guoqing Jiang 写道:
On 12/7/22 6:37 PM, Zheng Zengkai wrote:
From: Li Nan linan122@huawei.com
hulk inclusion category: bugfix bugzilla: 187584, https://gitee.com/openeuler/kernel/issues/I5QW2R CVE: NA
This reverts commit 36f5d7662495aa5ad4ec197443e69e01384eda3c.
There are two wbt_enable_default() in bfq_exit_queue(). Although it will not lead to no fault, revert one.
Signed-off-by: Li Nan linan122@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com
block/bfq-iosched.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 4bfea5e5354e..1aec01c0a707 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -6418,8 +6418,6 @@ static void bfq_exit_queue(struct elevator_queue *e) spin_unlock_irq(&bfqd->lock); #endif - wbt_enable_default(bfqd->queue);
kfree(bfqd); /* Re-enable throttling in case elevator disabled it */
我怀疑应该删除kfree下面的wbt_enable_default而不是kfree上面的. BTW,现在的代码是这样的
static void bfq_exit_queue(struct elevator_queue *e) { struct bfq_data *bfqd = e->elevator_data; struct bfq_queue *bfqq, *n; struct request_queue *q = bfqd->queue;
...
kfree(bfqd);
/* Re-enable throttling in case elevator disabled it */ wbt_enable_default(q); }
Just FYI,
Thanks, Guoqing _______________________________________________ Kernel mailing list -- kernel@openeuler.org To unsubscribe send an email to kernel-leave@openeuler.org
两个wbt_enable_default()通过两个补丁引入: 35328115880f ("block/wbt: fix negative inflight counter when remove scsi device") 36f5d7662495 ("block/wbt: fix negative inflight counter when remove scsi device")
需要revert其中的一个。 35328115880f删除了elv_unregister_queue()中的wbt_enable_default(),并且定 义了局部变量q,若revert它则36f5d7662495 也会失效。并且wbt_enable_default 和kfree的位置关系并不关键,所以我们选择了revert 36f5d7662495。
Thanks, Nan