From: "Paul E. McKenney" paulmck@kernel.org
mainline inclusion from mainline-v5.17-rc1 commit 147f04b14adde831eb4a0a1e378667429732f9e8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I53ILL CVE: NA
-------------------------------------------------------------------------
If an RCU expedited grace period starts just when a CPU is in the process of going offline, so that the outgoing CPU has completed its pass through stop-machine but has not yet completed its final dive into the idle loop, RCU will attempt to enable that CPU's scheduling-clock tick via a call to tick_dep_set_cpu(). For this to happen, that CPU has to have been online when the expedited grace period completed its CPU-selection phase.
This is pointless: The outgoing CPU has interrupts disabled, so it cannot take a scheduling-clock tick anyway. In addition, the tick_dep_set_cpu() function's eventual call to irq_work_queue_on() will splat as follows:
smpboot: CPU 1 is now offline WARNING: CPU: 6 PID: 124 at kernel/irq_work.c:95 +irq_work_queue_on+0x57/0x60 Modules linked in: CPU: 6 PID: 124 Comm: kworker/6:2 Not tainted 5.15.0-rc1+ #3 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS +rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014 Workqueue: rcu_gp wait_rcu_exp_gp RIP: 0010:irq_work_queue_on+0x57/0x60 Code: 8b 05 1d c7 ea 62 a9 00 00 f0 00 75 21 4c 89 ce 44 89 c7 e8 +9b 37 fa ff ba 01 00 00 00 89 d0 c3 4c 89 cf e8 3b ff ff ff eb ee <0f> 0b eb b7 +0f 0b eb db 90 48 c7 c0 98 2a 02 00 65 48 03 05 91 6f RSP: 0000:ffffb12cc038fe48 EFLAGS: 00010282 RAX: 0000000000000001 RBX: 0000000000005208 RCX: 0000000000000020 RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff9ad01f45a680 RBP: 000000000004c990 R08: 0000000000000001 R09: ffff9ad01f45a680 R10: ffffb12cc0317db0 R11: 0000000000000001 R12: 00000000fffecee8 R13: 0000000000000001 R14: 0000000000026980 R15: ffffffff9e53ae00 FS: 0000000000000000(0000) GS:ffff9ad01f580000(0000) +knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000000de0c000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: tick_nohz_dep_set_cpu+0x59/0x70 rcu_exp_wait_wake+0x54e/0x870 ? sync_rcu_exp_select_cpus+0x1fc/0x390 process_one_work+0x1ef/0x3c0 ? process_one_work+0x3c0/0x3c0 worker_thread+0x28/0x3c0 ? process_one_work+0x3c0/0x3c0 kthread+0x115/0x140 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 ---[ end trace c5bf75eb6aa80bc6 ]---
This commit therefore avoids invoking tick_dep_set_cpu() on offlined CPUs to limit both futility and false-positive splats.
Signed-off-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Zhen Lei thunder.leizhen@huawei.com Reviewed-by: Cheng Jian cj.chengjian@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/rcu/tree_exp.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 0dc16345e668..2bc4538e8a61 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -507,7 +507,10 @@ static void synchronize_rcu_expedited_wait(void) if (rdp->rcu_forced_tick_exp) continue; rdp->rcu_forced_tick_exp = true; - tick_dep_set_cpu(cpu, TICK_DEP_BIT_RCU_EXP); + preempt_disable(); + if (cpu_online(cpu)) + tick_dep_set_cpu(cpu, TICK_DEP_BIT_RCU_EXP); + preempt_enable(); } } j = READ_ONCE(jiffies_till_first_fqs);
From: Zhang Jian zhangjian210@huawei.com
ascend inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53VVE CVE: NA
-------------------------------------------------
Collect the processes who have the page mapped via collect_procs().
@page if the page is a part of the hugepages/compound-page, we must using compound_head() to find it's head page to prevent the kernel panic, and make the page be locked.
@to_kill the function will return a linked list, when we have used this list, we must kfree the list.
@force_early if we want to find all process, we must make it be true, if it's false, the function will only return the process who have PF_MCE_PROCESS or PF_MCE_EARLY mark.
limits: if force_early is true, sysctl_memory_failure_early_kill is useless. If it's false, no process have PF_MCE_PROCESS and PF_MCE_EARLY flag, and the sysctl_memory_failure_early_kill is enabled, function will return all tasks whether the task have the PF_MCE_PROCESS and PF_MCE_EARLY flag.
Signed-off-by: Zhang Jian zhangjian210@huawei.com Reviewed-by: Weilong Chen chenweilong@huawei.com Reviewed-by: Kefeng Wangwangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Weilong Chen chenweilong@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/mm.h | 3 ++- mm/memory-failure.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 859d5200c57b..a886f48b6a0e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3109,7 +3109,8 @@ extern int sysctl_memory_failure_recovery; extern void shake_page(struct page *p, int access); extern atomic_long_t num_poisoned_pages __read_mostly; extern int soft_offline_page(unsigned long pfn, int flags); - +extern void collect_procs(struct page *page, struct list_head *tokill, + int force_early);
/* * Error handlers for various types of pages. diff --git a/mm/memory-failure.c b/mm/memory-failure.c index fb74e61e5aa4..509fe34a0421 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -541,7 +541,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, /* * Collect the processes who have the corrupted page mapped to kill. */ -static void collect_procs(struct page *page, struct list_head *tokill, +void collect_procs(struct page *page, struct list_head *tokill, int force_early) { if (!page->mapping) @@ -552,6 +552,7 @@ static void collect_procs(struct page *page, struct list_head *tokill, else collect_procs_file(page, tokill, force_early); } +EXPORT_SYMBOL_GPL(collect_procs);
static const char *action_name[] = { [MF_IGNORED] = "Ignored",
From: Zheng Yejian zhengyejian1@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53WZ9
--------------------------------
Codes related to patching text in 'arch_klp_patch_func' and 'arch_klp_unpatch_func' are duplicate, we can reduce them.
And There is issue in arm/arm64 that 'offset' between pc and new function address is out of valid range is NOT considered if MODULE_PLTS is not enabled (CONFIG_ARM_MODULE_PLTS in arm, CONFIG_ARM_MODULE_PLTS in arm64). We fix it by always checking that 'offset'.
Fixes: 2fa9f353c118 livepatch/arm: Support livepatch without ftrace Fixes: e429c61d12bf livepatch/arm64: Support livepatch without ftrace Suggested-by: Xu Kuohai xukuohai@huawei.com Signed-off-by: Zheng Yejian zhengyejian1@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm/kernel/livepatch.c | 73 ++++++++++++--------------- arch/arm64/kernel/livepatch.c | 79 ++++++++++++------------------ arch/powerpc/kernel/livepatch_32.c | 58 +++++++--------------- arch/powerpc/kernel/livepatch_64.c | 58 +++++++++------------- 4 files changed, 103 insertions(+), 165 deletions(-)
diff --git a/arch/arm/kernel/livepatch.c b/arch/arm/kernel/livepatch.c index 4b07e73ad37b..d9eae1dd9744 100644 --- a/arch/arm/kernel/livepatch.c +++ b/arch/arm/kernel/livepatch.c @@ -370,22 +370,15 @@ long arch_klp_save_old_code(struct arch_klp_data *arch_data, void *old_func) return ret; }
-int arch_klp_patch_func(struct klp_func *func) +static int do_patch(unsigned long pc, unsigned long new_addr) { - struct klp_func_node *func_node; - unsigned long pc, new_addr; u32 insn; -#ifdef CONFIG_ARM_MODULE_PLTS - int i; - u32 insns[LJMP_INSN_SIZE]; -#endif
- func_node = func->func_node; - list_add_rcu(&func->stack_node, &func_node->func_stack); - pc = (unsigned long)func->old_func; - new_addr = (unsigned long)func->new_func; -#ifdef CONFIG_ARM_MODULE_PLTS if (!offset_in_range(pc, new_addr, SZ_32M)) { +#ifdef CONFIG_ARM_MODULE_PLTS + int i; + u32 insns[LJMP_INSN_SIZE]; + /* * [0] LDR PC, [PC+8] * [4] nop @@ -397,28 +390,44 @@ int arch_klp_patch_func(struct klp_func *func)
for (i = 0; i < LJMP_INSN_SIZE; i++) __patch_text(((u32 *)pc) + i, insns[i]); - +#else + /* + * When offset from 'new_addr' to 'pc' is out of SZ_32M range but + * CONFIG_ARM_MODULE_PLTS not enabled, we should stop patching. + */ + pr_err("new address out of range\n"); + return -EFAULT; +#endif } else { insn = arm_gen_branch(pc, new_addr); __patch_text((void *)pc, insn); } -#else - insn = arm_gen_branch(pc, new_addr); - __patch_text((void *)pc, insn); -#endif - return 0; }
+int arch_klp_patch_func(struct klp_func *func) +{ + struct klp_func_node *func_node; + int ret; + + func_node = func->func_node; + list_add_rcu(&func->stack_node, &func_node->func_stack); + ret = do_patch((unsigned long)func->old_func, (unsigned long)func->new_func); + if (ret) + list_del_rcu(&func->stack_node); + return ret; +} + void arch_klp_unpatch_func(struct klp_func *func) { struct klp_func_node *func_node; struct klp_func *next_func; - unsigned long pc, new_addr; - u32 insn; + unsigned long pc; #ifdef CONFIG_ARM_MODULE_PLTS int i; u32 insns[LJMP_INSN_SIZE]; +#else + u32 insn; #endif
func_node = func->func_node; @@ -439,29 +448,7 @@ void arch_klp_unpatch_func(struct klp_func *func) next_func = list_first_or_null_rcu(&func_node->func_stack, struct klp_func, stack_node);
- new_addr = (unsigned long)next_func->new_func; -#ifdef CONFIG_ARM_MODULE_PLTS - if (!offset_in_range(pc, new_addr, SZ_32M)) { - /* - * [0] LDR PC, [PC+8] - * [4] nop - * [8] new_addr_to_jump - */ - insns[0] = __opcode_to_mem_arm(0xe59ff000); - insns[1] = __opcode_to_mem_arm(0xe320f000); - insns[2] = new_addr; - - for (i = 0; i < LJMP_INSN_SIZE; i++) - __patch_text(((u32 *)pc) + i, insns[i]); - - } else { - insn = arm_gen_branch(pc, new_addr); - __patch_text((void *)pc, insn); - } -#else - insn = arm_gen_branch(pc, new_addr); - __patch_text((void *)pc, insn); -#endif + do_patch(pc, (unsigned long)next_func->new_func); } }
diff --git a/arch/arm64/kernel/livepatch.c b/arch/arm64/kernel/livepatch.c index 2c292008440c..4e4ed4a65244 100644 --- a/arch/arm64/kernel/livepatch.c +++ b/arch/arm64/kernel/livepatch.c @@ -349,60 +349,63 @@ long arch_klp_save_old_code(struct arch_klp_data *arch_data, void *old_func) return ret; }
-int arch_klp_patch_func(struct klp_func *func) +static int do_patch(unsigned long pc, unsigned long new_addr) { - struct klp_func_node *func_node; - unsigned long pc, new_addr; u32 insn; -#ifdef CONFIG_ARM64_MODULE_PLTS - int i; - u32 insns[LJMP_INSN_SIZE]; -#endif
- func_node = func->func_node; - list_add_rcu(&func->stack_node, &func_node->func_stack); - pc = (unsigned long)func->old_func; - new_addr = (unsigned long)func->new_func; -#ifdef CONFIG_ARM64_MODULE_PLTS if (offset_in_range(pc, new_addr, SZ_128M)) { insn = aarch64_insn_gen_branch_imm(pc, new_addr, - AARCH64_INSN_BRANCH_NOLINK); + AARCH64_INSN_BRANCH_NOLINK); if (aarch64_insn_patch_text_nosync((void *)pc, insn)) - goto ERR_OUT; + return -EPERM; } else { +#ifdef CONFIG_ARM64_MODULE_PLTS + int i; + u32 insns[LJMP_INSN_SIZE]; + insns[0] = cpu_to_le32(0x92800010 | (((~new_addr) & 0xffff)) << 5); insns[1] = cpu_to_le32(0xf2a00010 | (((new_addr >> 16) & 0xffff)) << 5); insns[2] = cpu_to_le32(0xf2c00010 | (((new_addr >> 32) & 0xffff)) << 5); insns[3] = cpu_to_le32(0xd61f0200); for (i = 0; i < LJMP_INSN_SIZE; i++) { if (aarch64_insn_patch_text_nosync(((u32 *)pc) + i, insns[i])) - goto ERR_OUT; + return -EPERM; } - } #else - insn = aarch64_insn_gen_branch_imm(pc, new_addr, - AARCH64_INSN_BRANCH_NOLINK); - - if (aarch64_insn_patch_text_nosync((void *)pc, insn)) - goto ERR_OUT; + /* + * When offset from 'new_addr' to 'pc' is out of SZ_128M range but + * CONFIG_ARM64_MODULE_PLTS not enabled, we should stop patching. + */ + pr_err("new address out of range\n"); + return -EFAULT; #endif + } return 0; +}
-ERR_OUT: - list_del_rcu(&func->stack_node); +int arch_klp_patch_func(struct klp_func *func) +{ + struct klp_func_node *func_node; + int ret;
- return -EPERM; + func_node = func->func_node; + list_add_rcu(&func->stack_node, &func_node->func_stack); + ret = do_patch((unsigned long)func->old_func, (unsigned long)func->new_func); + if (ret) + list_del_rcu(&func->stack_node); + return ret; }
void arch_klp_unpatch_func(struct klp_func *func) { struct klp_func_node *func_node; struct klp_func *next_func; - unsigned long pc, new_addr; - u32 insn; + unsigned long pc; #ifdef CONFIG_ARM64_MODULE_PLTS int i; u32 insns[LJMP_INSN_SIZE]; +#else + u32 insn; #endif
func_node = func->func_node; @@ -430,29 +433,7 @@ void arch_klp_unpatch_func(struct klp_func *func) struct klp_func, stack_node); if (WARN_ON(!next_func)) return; - - new_addr = (unsigned long)next_func->new_func; -#ifdef CONFIG_ARM64_MODULE_PLTS - if (offset_in_range(pc, new_addr, SZ_128M)) { - insn = aarch64_insn_gen_branch_imm(pc, new_addr, - AARCH64_INSN_BRANCH_NOLINK); - - aarch64_insn_patch_text_nosync((void *)pc, insn); - } else { - insns[0] = cpu_to_le32(0x92800010 | (((~new_addr) & 0xffff)) << 5); - insns[1] = cpu_to_le32(0xf2a00010 | (((new_addr >> 16) & 0xffff)) << 5); - insns[2] = cpu_to_le32(0xf2c00010 | (((new_addr >> 32) & 0xffff)) << 5); - insns[3] = cpu_to_le32(0xd61f0200); - for (i = 0; i < LJMP_INSN_SIZE; i++) - aarch64_insn_patch_text_nosync(((u32 *)pc) + i, - insns[i]); - } -#else - insn = aarch64_insn_gen_branch_imm(pc, new_addr, - AARCH64_INSN_BRANCH_NOLINK); - - aarch64_insn_patch_text_nosync((void *)pc, insn); -#endif + do_patch(pc, (unsigned long)next_func->new_func); } }
diff --git a/arch/powerpc/kernel/livepatch_32.c b/arch/powerpc/kernel/livepatch_32.c index 99acabd730e0..3b5c9b121c6f 100644 --- a/arch/powerpc/kernel/livepatch_32.c +++ b/arch/powerpc/kernel/livepatch_32.c @@ -392,24 +392,19 @@ long arch_klp_save_old_code(struct arch_klp_data *arch_data, void *old_func) return ret; }
-int arch_klp_patch_func(struct klp_func *func) +static int do_patch(unsigned long pc, unsigned long new_addr) { - struct klp_func_node *func_node; - unsigned long pc, new_addr; - long ret; + int ret; int i; u32 insns[LJMP_INSN_SIZE];
- func_node = func->func_node; - list_add_rcu(&func->stack_node, &func_node->func_stack); - pc = (unsigned long)func->old_func; - new_addr = (unsigned long)func->new_func; if (offset_in_range(pc, new_addr, SZ_32M)) { struct ppc_inst instr;
create_branch(&instr, (struct ppc_inst *)pc, new_addr, 0); - if (patch_instruction((struct ppc_inst *)pc, instr)) - goto ERR_OUT; + ret = patch_instruction((struct ppc_inst *)pc, instr); + if (ret) + return -EPERM; } else { /* * lis r12,sym@ha @@ -426,23 +421,30 @@ int arch_klp_patch_func(struct klp_func *func) ret = patch_instruction((struct ppc_inst *)(((u32 *)pc) + i), ppc_inst(insns[i])); if (ret) - goto ERR_OUT; + return -EPERM; } } - return 0; +}
-ERR_OUT: - list_del_rcu(&func->stack_node); +int arch_klp_patch_func(struct klp_func *func) +{ + struct klp_func_node *func_node; + int ret;
- return -EPERM; + func_node = func->func_node; + list_add_rcu(&func->stack_node, &func_node->func_stack); + ret = do_patch((unsigned long)func->old_func, (unsigned long)func->new_func); + if (ret) + list_del_rcu(&func->stack_node); + return ret; }
void arch_klp_unpatch_func(struct klp_func *func) { struct klp_func_node *func_node; struct klp_func *next_func; - unsigned long pc, new_addr; + unsigned long pc; u32 insns[LJMP_INSN_SIZE]; int i;
@@ -461,29 +463,7 @@ void arch_klp_unpatch_func(struct klp_func *func) list_del_rcu(&func->stack_node); next_func = list_first_or_null_rcu(&func_node->func_stack, struct klp_func, stack_node); - - new_addr = (unsigned long)next_func->new_func; - if (offset_in_range(pc, new_addr, SZ_32M)) { - struct ppc_inst instr; - - create_branch(&instr, (struct ppc_inst *)pc, new_addr, 0); - patch_instruction((struct ppc_inst *)pc, instr); - } else { - /* - * lis r12,sym@ha - * addi r12,r12,sym@l - * mtctr r12 - * bctr - */ - insns[0] = 0x3d800000 + ((new_addr + 0x8000) >> 16); - insns[1] = 0x398c0000 + (new_addr & 0xffff); - insns[2] = 0x7d8903a6; - insns[3] = 0x4e800420; - - for (i = 0; i < LJMP_INSN_SIZE; i++) - patch_instruction((struct ppc_inst *)(((u32 *)pc) + i), - ppc_inst(insns[i])); - } + do_patch(pc, (unsigned long)next_func->new_func); } }
diff --git a/arch/powerpc/kernel/livepatch_64.c b/arch/powerpc/kernel/livepatch_64.c index b319675afd4c..f3cd2ee66efa 100644 --- a/arch/powerpc/kernel/livepatch_64.c +++ b/arch/powerpc/kernel/livepatch_64.c @@ -439,43 +439,44 @@ long arch_klp_save_old_code(struct arch_klp_data *arch_data, void *old_func) return ret; }
-int arch_klp_patch_func(struct klp_func *func) +static int do_patch(unsigned long pc, unsigned long new_addr, + struct arch_klp_data *arch_data, struct module *old_mod) { - struct klp_func_node *func_node; - unsigned long pc, new_addr; - long ret; - - func_node = func->func_node; - list_add_rcu(&func->stack_node, &func_node->func_stack); + int ret;
- pc = (unsigned long)func->old_func; - new_addr = (unsigned long)func->new_func; - ret = livepatch_create_branch(pc, (unsigned long)&func_node->arch_data.trampoline, - new_addr, func->old_mod); + ret = livepatch_create_branch(pc, (unsigned long)&arch_data->trampoline, + new_addr, old_mod); if (ret) - goto ERR_OUT; - flush_icache_range((unsigned long)pc, - (unsigned long)pc + LJMP_INSN_SIZE * PPC64_INSN_SIZE); - + return -EPERM; + flush_icache_range(pc, pc + LJMP_INSN_SIZE * PPC64_INSN_SIZE); pr_debug("[%s %d] old = 0x%lx/0x%lx/%pS, new = 0x%lx/0x%lx/%pS\n", __func__, __LINE__, pc, ppc_function_entry((void *)pc), (void *)pc, new_addr, ppc_function_entry((void *)new_addr), (void *)ppc_function_entry((void *)new_addr)); - return 0; +}
-ERR_OUT: - list_del_rcu(&func->stack_node); +int arch_klp_patch_func(struct klp_func *func) +{ + struct klp_func_node *func_node; + int ret;
- return -EPERM; + func_node = func->func_node; + list_add_rcu(&func->stack_node, &func_node->func_stack); + ret = do_patch((unsigned long)func->old_func, + (unsigned long)func->new_func, + &func_node->arch_data, func->old_mod); + if (ret) + list_del_rcu(&func->stack_node); + return ret; }
void arch_klp_unpatch_func(struct klp_func *func) { struct klp_func_node *func_node; struct klp_func *next_func; - unsigned long pc, new_addr; + unsigned long pc; u32 insns[LJMP_INSN_SIZE]; int i;
@@ -492,25 +493,14 @@ void arch_klp_unpatch_func(struct klp_func *func) ppc_inst(insns[i]));
pr_debug("[%s %d] restore insns at 0x%lx\n", __func__, __LINE__, pc); + flush_icache_range(pc, pc + LJMP_INSN_SIZE * PPC64_INSN_SIZE); } else { list_del_rcu(&func->stack_node); next_func = list_first_or_null_rcu(&func_node->func_stack, struct klp_func, stack_node); - new_addr = (unsigned long)next_func->new_func; - - livepatch_create_branch(pc, (unsigned long)&func_node->arch_data.trampoline, - new_addr, func->old_mod); - - pr_debug("[%s %d] old = 0x%lx/0x%lx/%pS, new = 0x%lx/0x%lx/%pS\n", - __func__, __LINE__, - pc, ppc_function_entry((void *)pc), (void *)pc, - new_addr, ppc_function_entry((void *)new_addr), - (void *)ppc_function_entry((void *)new_addr)); - + do_patch(pc, (unsigned long)next_func->new_func, + &func_node->arch_data, func->old_mod); } - - flush_icache_range((unsigned long)pc, - (unsigned long)pc + LJMP_INSN_SIZE * PPC64_INSN_SIZE); }
/* return 0 if the func can be patched */
From: Zheng Yejian zhengyejian1@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53WZ9
--------------------------------
Before commit ec7ce700674f ("[Huawei] livepatch: put memory alloc and free out stop machine"), procedure of restore codes of old function in 'arch_klp_unpatch_func' is like: 1. copy old codes which saved in func_node into array 'old_insns'; 2. free memory of func_node; 3. patch text with old codes in array 'old_insns';
But after above commit, operation of freeing memory of func_node in procedure 2 is done after 'arch_klp_unpatch_func' succeed. And then operation of copying old codes in procedure 1 seems redundant, so we can just remove it.
Suggested-by: Xu Kuohai xukuohai@huawei.com Signed-off-by: Zheng Yejian zhengyejian1@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm/kernel/livepatch.c | 14 ++++---------- arch/arm64/kernel/livepatch.c | 14 ++------------ arch/powerpc/kernel/livepatch_32.c | 7 +------ arch/powerpc/kernel/livepatch_64.c | 7 +------ 4 files changed, 8 insertions(+), 34 deletions(-)
diff --git a/arch/arm/kernel/livepatch.c b/arch/arm/kernel/livepatch.c index d9eae1dd9744..21efd265149a 100644 --- a/arch/arm/kernel/livepatch.c +++ b/arch/arm/kernel/livepatch.c @@ -423,24 +423,18 @@ void arch_klp_unpatch_func(struct klp_func *func) struct klp_func_node *func_node; struct klp_func *next_func; unsigned long pc; -#ifdef CONFIG_ARM_MODULE_PLTS - int i; - u32 insns[LJMP_INSN_SIZE]; -#else - u32 insn; -#endif
func_node = func->func_node; pc = (unsigned long)func_node->old_func; if (list_is_singular(&func_node->func_stack)) { #ifdef CONFIG_ARM_MODULE_PLTS + int i; + for (i = 0; i < LJMP_INSN_SIZE; i++) { - insns[i] = func_node->arch_data.old_insns[i]; - __patch_text(((u32 *)pc) + i, insns[i]); + __patch_text(((u32 *)pc) + i, func_node->arch_data.old_insns[i]); } #else - insn = func_node->arch_data.old_insn; - __patch_text((void *)pc, insn); + __patch_text((void *)pc, func_node->arch_data.old_insn); #endif list_del_rcu(&func->stack_node); } else { diff --git a/arch/arm64/kernel/livepatch.c b/arch/arm64/kernel/livepatch.c index 4e4ed4a65244..74405b77e40e 100644 --- a/arch/arm64/kernel/livepatch.c +++ b/arch/arm64/kernel/livepatch.c @@ -403,29 +403,19 @@ void arch_klp_unpatch_func(struct klp_func *func) unsigned long pc; #ifdef CONFIG_ARM64_MODULE_PLTS int i; - u32 insns[LJMP_INSN_SIZE]; -#else - u32 insn; #endif
func_node = func->func_node; pc = (unsigned long)func_node->old_func; if (list_is_singular(&func_node->func_stack)) { -#ifdef CONFIG_ARM64_MODULE_PLTS - for (i = 0; i < LJMP_INSN_SIZE; i++) - insns[i] = func_node->arch_data.old_insns[i]; -#else - insn = func_node->arch_data.old_insn; -#endif list_del_rcu(&func->stack_node); - #ifdef CONFIG_ARM64_MODULE_PLTS for (i = 0; i < LJMP_INSN_SIZE; i++) { aarch64_insn_patch_text_nosync(((u32 *)pc) + i, - insns[i]); + func_node->arch_data.old_insns[i]); } #else - aarch64_insn_patch_text_nosync((void *)pc, insn); + aarch64_insn_patch_text_nosync((void *)pc, func_node->arch_data.old_insn); #endif } else { list_del_rcu(&func->stack_node); diff --git a/arch/powerpc/kernel/livepatch_32.c b/arch/powerpc/kernel/livepatch_32.c index 3b5c9b121c6f..ece36990699e 100644 --- a/arch/powerpc/kernel/livepatch_32.c +++ b/arch/powerpc/kernel/livepatch_32.c @@ -445,20 +445,15 @@ void arch_klp_unpatch_func(struct klp_func *func) struct klp_func_node *func_node; struct klp_func *next_func; unsigned long pc; - u32 insns[LJMP_INSN_SIZE]; int i;
func_node = func->func_node; pc = (unsigned long)func_node->old_func; if (list_is_singular(&func_node->func_stack)) { - for (i = 0; i < LJMP_INSN_SIZE; i++) - insns[i] = func_node->arch_data.old_insns[i]; - list_del_rcu(&func->stack_node); - for (i = 0; i < LJMP_INSN_SIZE; i++) patch_instruction((struct ppc_inst *)(((u32 *)pc) + i), - ppc_inst(insns[i])); + ppc_inst(func_node->arch_data.old_insns[i])); } else { list_del_rcu(&func->stack_node); next_func = list_first_or_null_rcu(&func_node->func_stack, diff --git a/arch/powerpc/kernel/livepatch_64.c b/arch/powerpc/kernel/livepatch_64.c index f3cd2ee66efa..9de727a7b455 100644 --- a/arch/powerpc/kernel/livepatch_64.c +++ b/arch/powerpc/kernel/livepatch_64.c @@ -477,20 +477,15 @@ void arch_klp_unpatch_func(struct klp_func *func) struct klp_func_node *func_node; struct klp_func *next_func; unsigned long pc; - u32 insns[LJMP_INSN_SIZE]; int i;
func_node = func->func_node; pc = (unsigned long)func_node->old_func; if (list_is_singular(&func_node->func_stack)) { - for (i = 0; i < LJMP_INSN_SIZE; i++) - insns[i] = func_node->arch_data.old_insns[i]; - list_del_rcu(&func->stack_node); - for (i = 0; i < LJMP_INSN_SIZE; i++) patch_instruction((struct ppc_inst *)((u32 *)pc + i), - ppc_inst(insns[i])); + ppc_inst(func_node->arch_data.old_insns[i]));
pr_debug("[%s %d] restore insns at 0x%lx\n", __func__, __LINE__, pc); flush_icache_range(pc, pc + LJMP_INSN_SIZE * PPC64_INSN_SIZE);
From: Zheng Yejian zhengyejian1@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53WZ9
--------------------------------
Structure 'arch_klp_data' contains fields which are used to save codes of a function before patching. In arm, they are 'old_insns' and 'old_insn' (depending on CONFIG_ARM_MODULE_PLTS enabled or not): struct arch_klp_data { #ifdef CONFIG_ARM_MODULE_PLTS u32 old_insns[LJMP_INSN_SIZE]; #else u32 old_insn; #endif };
We can use array 'old_insns' to replace 'old_insn' so that no need to depend on CONFIG_ARM_MODULE_PLTS.
The similar scenario exists in arm64, so we also do the optimization.
Suggested-by: Xu Kuohai xukuohai@huawei.com Signed-off-by: Zheng Yejian zhengyejian1@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm/include/asm/livepatch.h | 8 +++----- arch/arm/kernel/livepatch.c | 21 +++------------------ arch/arm64/include/asm/livepatch.h | 9 ++++----- arch/arm64/kernel/livepatch.c | 25 ++++--------------------- 4 files changed, 14 insertions(+), 49 deletions(-)
diff --git a/arch/arm/include/asm/livepatch.h b/arch/arm/include/asm/livepatch.h index 4f1cf4c72097..befa1efbbcd1 100644 --- a/arch/arm/include/asm/livepatch.h +++ b/arch/arm/include/asm/livepatch.h @@ -41,14 +41,12 @@ int klp_check_calltrace(struct klp_patch *patch, int enable);
#ifdef CONFIG_ARM_MODULE_PLTS #define LJMP_INSN_SIZE 3 -#endif +#else +#define LJMP_INSN_SIZE 1 +#endif /* CONFIG_ARM_MODULE_PLTS */
struct arch_klp_data { -#ifdef CONFIG_ARM_MODULE_PLTS u32 old_insns[LJMP_INSN_SIZE]; -#else - u32 old_insn; -#endif };
long arch_klp_save_old_code(struct arch_klp_data *arch_data, void *old_func); diff --git a/arch/arm/kernel/livepatch.c b/arch/arm/kernel/livepatch.c index 21efd265149a..6c6f268c8d3d 100644 --- a/arch/arm/kernel/livepatch.c +++ b/arch/arm/kernel/livepatch.c @@ -37,15 +37,9 @@ #define ARM_INSN_SIZE 4 #endif
-#ifdef CONFIG_ARM_MODULE_PLTS #define MAX_SIZE_TO_CHECK (LJMP_INSN_SIZE * ARM_INSN_SIZE) #define CHECK_JUMP_RANGE LJMP_INSN_SIZE
-#else -#define MAX_SIZE_TO_CHECK ARM_INSN_SIZE -#define CHECK_JUMP_RANGE 1 -#endif - #ifdef CONFIG_LIVEPATCH_STOP_MACHINE_CONSISTENCY /* * The instruction set on arm is A32. @@ -356,7 +350,6 @@ long arm_insn_read(void *addr, u32 *insnp) long arch_klp_save_old_code(struct arch_klp_data *arch_data, void *old_func) { long ret; -#ifdef CONFIG_ARM_MODULE_PLTS int i;
for (i = 0; i < LJMP_INSN_SIZE; i++) { @@ -364,20 +357,16 @@ long arch_klp_save_old_code(struct arch_klp_data *arch_data, void *old_func) if (ret) break; } -#else - ret = arm_insn_read(old_func, &arch_data->old_insn); -#endif return ret; }
static int do_patch(unsigned long pc, unsigned long new_addr) { - u32 insn; + u32 insns[LJMP_INSN_SIZE];
if (!offset_in_range(pc, new_addr, SZ_32M)) { #ifdef CONFIG_ARM_MODULE_PLTS int i; - u32 insns[LJMP_INSN_SIZE];
/* * [0] LDR PC, [PC+8] @@ -399,8 +388,8 @@ static int do_patch(unsigned long pc, unsigned long new_addr) return -EFAULT; #endif } else { - insn = arm_gen_branch(pc, new_addr); - __patch_text((void *)pc, insn); + insns[0] = arm_gen_branch(pc, new_addr); + __patch_text((void *)pc, insns[0]); } return 0; } @@ -427,15 +416,11 @@ void arch_klp_unpatch_func(struct klp_func *func) func_node = func->func_node; pc = (unsigned long)func_node->old_func; if (list_is_singular(&func_node->func_stack)) { -#ifdef CONFIG_ARM_MODULE_PLTS int i;
for (i = 0; i < LJMP_INSN_SIZE; i++) { __patch_text(((u32 *)pc) + i, func_node->arch_data.old_insns[i]); } -#else - __patch_text((void *)pc, func_node->arch_data.old_insn); -#endif list_del_rcu(&func->stack_node); } else { list_del_rcu(&func->stack_node); diff --git a/arch/arm64/include/asm/livepatch.h b/arch/arm64/include/asm/livepatch.h index a9bc7ce4cc6e..7b9ea5dcea4d 100644 --- a/arch/arm64/include/asm/livepatch.h +++ b/arch/arm64/include/asm/livepatch.h @@ -48,17 +48,16 @@ int klp_check_calltrace(struct klp_patch *patch, int enable); #error Live patching support is disabled; check CONFIG_LIVEPATCH #endif
- #if defined(CONFIG_LIVEPATCH_STOP_MACHINE_CONSISTENCY)
+#ifdef CONFIG_ARM64_MODULE_PLTS #define LJMP_INSN_SIZE 4 +#else +#define LJMP_INSN_SIZE 1 +#endif /* CONFIG_ARM64_MODULE_PLTS */
struct arch_klp_data { -#ifdef CONFIG_ARM64_MODULE_PLTS u32 old_insns[LJMP_INSN_SIZE]; -#else - u32 old_insn; -#endif };
long arch_klp_save_old_code(struct arch_klp_data *arch_data, void *old_func); diff --git a/arch/arm64/kernel/livepatch.c b/arch/arm64/kernel/livepatch.c index 74405b77e40e..4ced7d3d824c 100644 --- a/arch/arm64/kernel/livepatch.c +++ b/arch/arm64/kernel/livepatch.c @@ -34,7 +34,6 @@ #include <linux/sched/debug.h> #include <linux/kallsyms.h>
-#ifdef CONFIG_ARM64_MODULE_PLTS #define MAX_SIZE_TO_CHECK (LJMP_INSN_SIZE * sizeof(u32)) #define CHECK_JUMP_RANGE LJMP_INSN_SIZE
@@ -46,11 +45,6 @@ static inline bool offset_in_range(unsigned long pc, unsigned long addr, return (offset >= -range && offset < range); }
-#else -#define MAX_SIZE_TO_CHECK sizeof(u32) -#define CHECK_JUMP_RANGE 1 -#endif - #ifdef CONFIG_LIVEPATCH_STOP_MACHINE_CONSISTENCY /* * The instruction set on arm64 is A64. @@ -334,7 +328,6 @@ int klp_check_calltrace(struct klp_patch *patch, int enable) long arch_klp_save_old_code(struct arch_klp_data *arch_data, void *old_func) { long ret; -#ifdef CONFIG_ARM64_MODULE_PLTS int i;
for (i = 0; i < LJMP_INSN_SIZE; i++) { @@ -343,25 +336,21 @@ long arch_klp_save_old_code(struct arch_klp_data *arch_data, void *old_func) if (ret) break; } -#else - ret = aarch64_insn_read(old_func, &arch_data->old_insn); -#endif return ret; }
static int do_patch(unsigned long pc, unsigned long new_addr) { - u32 insn; + u32 insns[LJMP_INSN_SIZE];
if (offset_in_range(pc, new_addr, SZ_128M)) { - insn = aarch64_insn_gen_branch_imm(pc, new_addr, - AARCH64_INSN_BRANCH_NOLINK); - if (aarch64_insn_patch_text_nosync((void *)pc, insn)) + insns[0] = aarch64_insn_gen_branch_imm(pc, new_addr, + AARCH64_INSN_BRANCH_NOLINK); + if (aarch64_insn_patch_text_nosync((void *)pc, insns[0])) return -EPERM; } else { #ifdef CONFIG_ARM64_MODULE_PLTS int i; - u32 insns[LJMP_INSN_SIZE];
insns[0] = cpu_to_le32(0x92800010 | (((~new_addr) & 0xffff)) << 5); insns[1] = cpu_to_le32(0xf2a00010 | (((new_addr >> 16) & 0xffff)) << 5); @@ -401,22 +390,16 @@ void arch_klp_unpatch_func(struct klp_func *func) struct klp_func_node *func_node; struct klp_func *next_func; unsigned long pc; -#ifdef CONFIG_ARM64_MODULE_PLTS int i; -#endif
func_node = func->func_node; pc = (unsigned long)func_node->old_func; if (list_is_singular(&func_node->func_stack)) { list_del_rcu(&func->stack_node); -#ifdef CONFIG_ARM64_MODULE_PLTS for (i = 0; i < LJMP_INSN_SIZE; i++) { aarch64_insn_patch_text_nosync(((u32 *)pc) + i, func_node->arch_data.old_insns[i]); } -#else - aarch64_insn_patch_text_nosync((void *)pc, func_node->arch_data.old_insn); -#endif } else { list_del_rcu(&func->stack_node); next_func = list_first_or_null_rcu(&func_node->func_stack,
From: Zheng Yejian zhengyejian1@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53WZ9
--------------------------------
Currently when unpatch a function, we check whether 'func_stack' has only one item then delete it:
if (list_is_singular(&func_node->func_stack)) { list_del_rcu(&func->stack_node); ...... } else { list_del_rcu(&func->stack_node); next_func = list_first_or_null_rcu(&func_node->func_stack); ...... }
We can optimize it as delete first then check whether 'func_stack' is empty or not.
Suggested-by: Xu Kuohai xukuohai@huawei.com Signed-off-by: Zheng Yejian zhengyejian1@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm/kernel/livepatch.c | 5 ++--- arch/arm64/kernel/livepatch.c | 5 ++--- arch/powerpc/kernel/livepatch_32.c | 5 ++--- arch/powerpc/kernel/livepatch_64.c | 5 ++--- arch/x86/kernel/livepatch.c | 5 ++--- 5 files changed, 10 insertions(+), 15 deletions(-)
diff --git a/arch/arm/kernel/livepatch.c b/arch/arm/kernel/livepatch.c index 6c6f268c8d3d..d5223046cc66 100644 --- a/arch/arm/kernel/livepatch.c +++ b/arch/arm/kernel/livepatch.c @@ -415,15 +415,14 @@ void arch_klp_unpatch_func(struct klp_func *func)
func_node = func->func_node; pc = (unsigned long)func_node->old_func; - if (list_is_singular(&func_node->func_stack)) { + list_del_rcu(&func->stack_node); + if (list_empty(&func_node->func_stack)) { int i;
for (i = 0; i < LJMP_INSN_SIZE; i++) { __patch_text(((u32 *)pc) + i, func_node->arch_data.old_insns[i]); } - list_del_rcu(&func->stack_node); } else { - list_del_rcu(&func->stack_node); next_func = list_first_or_null_rcu(&func_node->func_stack, struct klp_func, stack_node);
diff --git a/arch/arm64/kernel/livepatch.c b/arch/arm64/kernel/livepatch.c index 4ced7d3d824c..c7110c7c291c 100644 --- a/arch/arm64/kernel/livepatch.c +++ b/arch/arm64/kernel/livepatch.c @@ -394,14 +394,13 @@ void arch_klp_unpatch_func(struct klp_func *func)
func_node = func->func_node; pc = (unsigned long)func_node->old_func; - if (list_is_singular(&func_node->func_stack)) { - list_del_rcu(&func->stack_node); + list_del_rcu(&func->stack_node); + if (list_empty(&func_node->func_stack)) { for (i = 0; i < LJMP_INSN_SIZE; i++) { aarch64_insn_patch_text_nosync(((u32 *)pc) + i, func_node->arch_data.old_insns[i]); } } else { - list_del_rcu(&func->stack_node); next_func = list_first_or_null_rcu(&func_node->func_stack, struct klp_func, stack_node); if (WARN_ON(!next_func)) diff --git a/arch/powerpc/kernel/livepatch_32.c b/arch/powerpc/kernel/livepatch_32.c index ece36990699e..063546851c0a 100644 --- a/arch/powerpc/kernel/livepatch_32.c +++ b/arch/powerpc/kernel/livepatch_32.c @@ -449,13 +449,12 @@ void arch_klp_unpatch_func(struct klp_func *func)
func_node = func->func_node; pc = (unsigned long)func_node->old_func; - if (list_is_singular(&func_node->func_stack)) { - list_del_rcu(&func->stack_node); + list_del_rcu(&func->stack_node); + if (list_empty(&func_node->func_stack)) { for (i = 0; i < LJMP_INSN_SIZE; i++) patch_instruction((struct ppc_inst *)(((u32 *)pc) + i), ppc_inst(func_node->arch_data.old_insns[i])); } else { - list_del_rcu(&func->stack_node); next_func = list_first_or_null_rcu(&func_node->func_stack, struct klp_func, stack_node); do_patch(pc, (unsigned long)next_func->new_func); diff --git a/arch/powerpc/kernel/livepatch_64.c b/arch/powerpc/kernel/livepatch_64.c index 9de727a7b455..770e68fae6c8 100644 --- a/arch/powerpc/kernel/livepatch_64.c +++ b/arch/powerpc/kernel/livepatch_64.c @@ -481,8 +481,8 @@ void arch_klp_unpatch_func(struct klp_func *func)
func_node = func->func_node; pc = (unsigned long)func_node->old_func; - if (list_is_singular(&func_node->func_stack)) { - list_del_rcu(&func->stack_node); + list_del_rcu(&func->stack_node); + if (list_empty(&func_node->func_stack)) { for (i = 0; i < LJMP_INSN_SIZE; i++) patch_instruction((struct ppc_inst *)((u32 *)pc + i), ppc_inst(func_node->arch_data.old_insns[i])); @@ -490,7 +490,6 @@ void arch_klp_unpatch_func(struct klp_func *func) pr_debug("[%s %d] restore insns at 0x%lx\n", __func__, __LINE__, pc); flush_icache_range(pc, pc + LJMP_INSN_SIZE * PPC64_INSN_SIZE); } else { - list_del_rcu(&func->stack_node); next_func = list_first_or_null_rcu(&func_node->func_stack, struct klp_func, stack_node); do_patch(pc, (unsigned long)next_func->new_func, diff --git a/arch/x86/kernel/livepatch.c b/arch/x86/kernel/livepatch.c index 2a541c7de167..385b8428da91 100644 --- a/arch/x86/kernel/livepatch.c +++ b/arch/x86/kernel/livepatch.c @@ -412,11 +412,10 @@ void arch_klp_unpatch_func(struct klp_func *func)
func_node = func->func_node; ip = (unsigned long)func_node->old_func; - if (list_is_singular(&func_node->func_stack)) { - list_del_rcu(&func->stack_node); + list_del_rcu(&func->stack_node); + if (list_empty(&func_node->func_stack)) { new = func_node->arch_data.old_code; } else { - list_del_rcu(&func->stack_node); next_func = list_first_or_null_rcu(&func_node->func_stack, struct klp_func, stack_node);
From: Zheng Yejian zhengyejian1@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53WZ9
--------------------------------
Signed-off-by: Zheng Yejian zhengyejian1@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/kernel/livepatch.c | 21 +++++++++++++++++---- arch/powerpc/kernel/livepatch_32.c | 21 ++++++++++++++++----- arch/powerpc/kernel/livepatch_64.c | 16 ++++++++++++---- 3 files changed, 45 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/kernel/livepatch.c b/arch/arm64/kernel/livepatch.c index c7110c7c291c..ad4c8337f7f3 100644 --- a/arch/arm64/kernel/livepatch.c +++ b/arch/arm64/kernel/livepatch.c @@ -342,12 +342,16 @@ long arch_klp_save_old_code(struct arch_klp_data *arch_data, void *old_func) static int do_patch(unsigned long pc, unsigned long new_addr) { u32 insns[LJMP_INSN_SIZE]; + int ret;
if (offset_in_range(pc, new_addr, SZ_128M)) { insns[0] = aarch64_insn_gen_branch_imm(pc, new_addr, AARCH64_INSN_BRANCH_NOLINK); - if (aarch64_insn_patch_text_nosync((void *)pc, insns[0])) + ret = aarch64_insn_patch_text_nosync((void *)pc, insns[0]); + if (ret) { + pr_err("patch instruction small range failed, ret=%d\n", ret); return -EPERM; + } } else { #ifdef CONFIG_ARM64_MODULE_PLTS int i; @@ -357,8 +361,12 @@ static int do_patch(unsigned long pc, unsigned long new_addr) insns[2] = cpu_to_le32(0xf2c00010 | (((new_addr >> 32) & 0xffff)) << 5); insns[3] = cpu_to_le32(0xd61f0200); for (i = 0; i < LJMP_INSN_SIZE; i++) { - if (aarch64_insn_patch_text_nosync(((u32 *)pc) + i, insns[i])) + ret = aarch64_insn_patch_text_nosync(((u32 *)pc) + i, insns[i]); + if (ret) { + pr_err("patch instruction(%d) large range failed, ret=%d\n", + i, ret); return -EPERM; + } } #else /* @@ -391,14 +399,19 @@ void arch_klp_unpatch_func(struct klp_func *func) struct klp_func *next_func; unsigned long pc; int i; + int ret;
func_node = func->func_node; pc = (unsigned long)func_node->old_func; list_del_rcu(&func->stack_node); if (list_empty(&func_node->func_stack)) { for (i = 0; i < LJMP_INSN_SIZE; i++) { - aarch64_insn_patch_text_nosync(((u32 *)pc) + i, - func_node->arch_data.old_insns[i]); + ret = aarch64_insn_patch_text_nosync(((u32 *)pc) + i, + func_node->arch_data.old_insns[i]); + if (ret) { + pr_err("restore instruction(%d) failed, ret=%d\n", i, ret); + return; + } } } else { next_func = list_first_or_null_rcu(&func_node->func_stack, diff --git a/arch/powerpc/kernel/livepatch_32.c b/arch/powerpc/kernel/livepatch_32.c index 063546851c0a..8fe9ebe43b25 100644 --- a/arch/powerpc/kernel/livepatch_32.c +++ b/arch/powerpc/kernel/livepatch_32.c @@ -403,8 +403,10 @@ static int do_patch(unsigned long pc, unsigned long new_addr)
create_branch(&instr, (struct ppc_inst *)pc, new_addr, 0); ret = patch_instruction((struct ppc_inst *)pc, instr); - if (ret) + if (ret) { + pr_err("patch instruction small range failed, ret=%d\n", ret); return -EPERM; + } } else { /* * lis r12,sym@ha @@ -420,8 +422,11 @@ static int do_patch(unsigned long pc, unsigned long new_addr) for (i = 0; i < LJMP_INSN_SIZE; i++) { ret = patch_instruction((struct ppc_inst *)(((u32 *)pc) + i), ppc_inst(insns[i])); - if (ret) + if (ret) { + pr_err("patch instruction(%d) large range failed, ret=%d\n", + i, ret); return -EPERM; + } } } return 0; @@ -446,14 +451,20 @@ void arch_klp_unpatch_func(struct klp_func *func) struct klp_func *next_func; unsigned long pc; int i; + int ret;
func_node = func->func_node; pc = (unsigned long)func_node->old_func; list_del_rcu(&func->stack_node); if (list_empty(&func_node->func_stack)) { - for (i = 0; i < LJMP_INSN_SIZE; i++) - patch_instruction((struct ppc_inst *)(((u32 *)pc) + i), - ppc_inst(func_node->arch_data.old_insns[i])); + for (i = 0; i < LJMP_INSN_SIZE; i++) { + ret = patch_instruction((struct ppc_inst *)(((u32 *)pc) + i), + ppc_inst(func_node->arch_data.old_insns[i])); + if (ret) { + pr_err("restore instruction(%d) failed, ret=%d\n", i, ret); + return; + } + } } else { next_func = list_first_or_null_rcu(&func_node->func_stack, struct klp_func, stack_node); diff --git a/arch/powerpc/kernel/livepatch_64.c b/arch/powerpc/kernel/livepatch_64.c index 770e68fae6c8..90d3e37a0bfe 100644 --- a/arch/powerpc/kernel/livepatch_64.c +++ b/arch/powerpc/kernel/livepatch_64.c @@ -446,8 +446,10 @@ static int do_patch(unsigned long pc, unsigned long new_addr,
ret = livepatch_create_branch(pc, (unsigned long)&arch_data->trampoline, new_addr, old_mod); - if (ret) + if (ret) { + pr_err("create branch failed, ret=%d\n", ret); return -EPERM; + } flush_icache_range(pc, pc + LJMP_INSN_SIZE * PPC64_INSN_SIZE); pr_debug("[%s %d] old = 0x%lx/0x%lx/%pS, new = 0x%lx/0x%lx/%pS\n", __func__, __LINE__, @@ -478,14 +480,20 @@ void arch_klp_unpatch_func(struct klp_func *func) struct klp_func *next_func; unsigned long pc; int i; + int ret;
func_node = func->func_node; pc = (unsigned long)func_node->old_func; list_del_rcu(&func->stack_node); if (list_empty(&func_node->func_stack)) { - for (i = 0; i < LJMP_INSN_SIZE; i++) - patch_instruction((struct ppc_inst *)((u32 *)pc + i), - ppc_inst(func_node->arch_data.old_insns[i])); + for (i = 0; i < LJMP_INSN_SIZE; i++) { + ret = patch_instruction((struct ppc_inst *)((u32 *)pc + i), + ppc_inst(func_node->arch_data.old_insns[i])); + if (ret) { + pr_err("restore instruction(%d) failed, ret=%d\n", i, ret); + break; + } + }
pr_debug("[%s %d] restore insns at 0x%lx\n", __func__, __LINE__, pc); flush_icache_range(pc, pc + LJMP_INSN_SIZE * PPC64_INSN_SIZE);
From: Zheng Yejian zhengyejian1@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53WZ9
--------------------------------
Signed-off-by: Zheng Yejian zhengyejian1@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm/kernel/livepatch.c | 4 +++- arch/arm64/kernel/livepatch.c | 4 +++- arch/powerpc/kernel/livepatch_32.c | 4 +++- arch/powerpc/kernel/livepatch_64.c | 4 +++- arch/x86/kernel/livepatch.c | 24 ++++++++++++++---------- 5 files changed, 26 insertions(+), 14 deletions(-)
diff --git a/arch/arm/kernel/livepatch.c b/arch/arm/kernel/livepatch.c index d5223046cc66..da88113d14e9 100644 --- a/arch/arm/kernel/livepatch.c +++ b/arch/arm/kernel/livepatch.c @@ -283,8 +283,10 @@ int klp_check_calltrace(struct klp_patch *patch, int enable) };
ret = klp_check_activeness_func(patch, enable, &check_funcs); - if (ret) + if (ret) { + pr_err("collect active functions failed, ret=%d\n", ret); goto out; + } args.check_funcs = check_funcs;
for_each_process_thread(g, t) { diff --git a/arch/arm64/kernel/livepatch.c b/arch/arm64/kernel/livepatch.c index ad4c8337f7f3..e83e4ce94887 100644 --- a/arch/arm64/kernel/livepatch.c +++ b/arch/arm64/kernel/livepatch.c @@ -276,8 +276,10 @@ int klp_check_calltrace(struct klp_patch *patch, int enable) };
ret = klp_check_activeness_func(patch, enable, &check_funcs); - if (ret) + if (ret) { + pr_err("collect active functions failed, ret=%d\n", ret); goto out; + } args.check_funcs = check_funcs;
for_each_process_thread(g, t) { diff --git a/arch/powerpc/kernel/livepatch_32.c b/arch/powerpc/kernel/livepatch_32.c index 8fe9ebe43b25..a3cf41af073e 100644 --- a/arch/powerpc/kernel/livepatch_32.c +++ b/arch/powerpc/kernel/livepatch_32.c @@ -311,8 +311,10 @@ int klp_check_calltrace(struct klp_patch *patch, int enable) };
ret = klp_check_activeness_func(patch, enable, &check_funcs); - if (ret) + if (ret) { + pr_err("collect active functions failed, ret=%d\n", ret); goto out; + } args.check_funcs = check_funcs;
for_each_process_thread(g, t) { diff --git a/arch/powerpc/kernel/livepatch_64.c b/arch/powerpc/kernel/livepatch_64.c index 90d3e37a0bfe..0098ad48f918 100644 --- a/arch/powerpc/kernel/livepatch_64.c +++ b/arch/powerpc/kernel/livepatch_64.c @@ -359,8 +359,10 @@ int klp_check_calltrace(struct klp_patch *patch, int enable) struct walk_stackframe_args args;
ret = klp_check_activeness_func(patch, enable, &check_funcs); - if (ret) + if (ret) { + pr_err("collect active functions failed, ret=%d\n", ret); goto out; + } args.check_funcs = check_funcs; args.ret = 0;
diff --git a/arch/x86/kernel/livepatch.c b/arch/x86/kernel/livepatch.c index 385b8428da91..fe34183826d3 100644 --- a/arch/x86/kernel/livepatch.c +++ b/arch/x86/kernel/livepatch.c @@ -321,38 +321,42 @@ int klp_check_calltrace(struct klp_patch *patch, int enable) #endif
ret = klp_check_activeness_func(patch, enable, &check_funcs); - if (ret) + if (ret) { + pr_err("collect active functions failed, ret=%d\n", ret); goto out; + } for_each_process_thread(g, t) { if (!strncmp(t->comm, "migration/", 10)) continue;
#ifdef CONFIG_ARCH_STACKWALK ret = stack_trace_save_tsk_reliable(t, trace_entries, MAX_STACK_ENTRIES); - if (ret < 0) + if (ret < 0) { + pr_err("%s:%d has an unreliable stack, ret=%d\n", + t->comm, t->pid, ret); goto out; + } trace_len = ret; - ret = 0; + ret = klp_check_stack(trace_entries, trace_len, check_funcs); #else trace.skip = 0; trace.nr_entries = 0; trace.max_entries = MAX_STACK_ENTRIES; trace.entries = trace_entries; ret = save_stack_trace_tsk_reliable(t, &trace); -#endif WARN_ON_ONCE(ret == -ENOSYS); if (ret) { - pr_info("%s: %s:%d has an unreliable stack\n", - __func__, t->comm, t->pid); + pr_err("%s: %s:%d has an unreliable stack, ret=%d\n", + __func__, t->comm, t->pid, ret); goto out; } -#ifdef CONFIG_ARCH_STACKWALK - ret = klp_check_stack(trace_entries, trace_len, check_funcs); -#else ret = klp_check_stack(&trace, 0, check_funcs); #endif - if (ret) + if (ret) { + pr_err("%s:%d check stack failed, ret=%d\n", + t->comm, t->pid, ret); goto out; + } }
out:
From: Zheng Yejian zhengyejian1@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53WZ9
--------------------------------
Signed-off-by: Zheng Yejian zhengyejian1@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/livepatch/core.c | 39 +++++++++++++++++++++++++++++++++++---- 1 file changed, 35 insertions(+), 4 deletions(-)
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index d34b68614f2c..87ed93df7a98 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -1026,6 +1026,7 @@ static int klp_init_object_loaded(struct klp_patch *patch, ret = klp_apply_object_relocs(patch, obj); if (ret) { module_enable_ro(patch->mod, true); + pr_err("apply object relocations failed, ret=%d\n", ret); return ret; } } @@ -1046,6 +1047,19 @@ static int klp_init_object_loaded(struct klp_patch *patch, return -ENOENT; }
+#ifdef PPC64_ELF_ABI_v1 + /* + * PPC64 big endian binary format is 'elfv1' defaultly, actual + * symbol name of old function need a prefix '.' (related + * feature 'function descriptor'), otherwise size found by + * 'kallsyms_lookup_size_offset' may be abnormal. + */ + if (func->old_name[0] != '.') { + pr_warn("old_name '%s' may miss the prefix '.', old_size=%lu\n", + func->old_name, func->old_size); + } +#endif + if (func->nop) func->new_func = func->old_func;
@@ -1067,8 +1081,10 @@ static int klp_init_object(struct klp_patch *patch, struct klp_object *obj) int ret; const char *name;
- if (klp_is_module(obj) && strlen(obj->name) >= MODULE_NAME_LEN) + if (klp_is_module(obj) && strnlen(obj->name, MODULE_NAME_LEN) >= MODULE_NAME_LEN) { + pr_err("obj name is too long\n"); return -EINVAL; + } klp_for_each_func(obj, func) { if (!func->old_name) { pr_err("old name is invalid\n"); @@ -1202,6 +1218,7 @@ static int klp_init_patch(struct klp_patch *patch) ret = jump_label_register(patch->mod); if (ret) { module_enable_ro(patch->mod, true); + pr_err("register jump label failed, ret=%d\n", ret); return ret; } module_enable_ro(patch->mod, true); @@ -1711,12 +1728,24 @@ int klp_register_patch(struct klp_patch *patch) int ret; struct klp_object *obj;
- if (!patch || !patch->mod || !patch->objs) + if (!patch) { + pr_err("patch invalid\n"); + return -EINVAL; + } + if (!patch->mod) { + pr_err("patch->mod invalid\n"); + return -EINVAL; + } + if (!patch->objs) { + pr_err("patch->objs invalid\n"); return -EINVAL; + }
klp_for_each_object_static(patch, obj) { - if (!obj->funcs) + if (!obj->funcs) { + pr_err("obj->funcs invalid\n"); return -EINVAL; + } }
if (!is_livepatch_module(patch->mod)) { @@ -1725,8 +1754,10 @@ int klp_register_patch(struct klp_patch *patch) return -EINVAL; }
- if (!klp_initialized()) + if (!klp_initialized()) { + pr_err("kernel live patch not available\n"); return -ENODEV; + }
mutex_lock(&klp_mutex);
From: Zheng Yejian zhengyejian1@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4ZII6
--------------------------------
Kernel panic happened on 'arm64 big endian' board after calling function that has been live-patched. It can be reproduced as follows: 1. Insert 'livepatch-sample.ko' to patch function 'cmdline_proc_show'; 2. Enable patch by execute: echo 1 > /sys/kernel/livepatch/livepatch-sample/enabled 3. Call 'cmdline_proc_show' by execute: cat /proc/cmdline 4. Then we get following panic logs: > kernel BUG at arch/arm64/kernel/traps.c:408! > Internal error: Oops - BUG: 0 [#1] SMP > Modules linked in: dump_mem(OE) livepatch_cmdline1(OEK) > [last unloaded: dump_mem] > CPU: 3 PID: 1752 Comm: cat Session: 0 Tainted: G OE K > 5.10.0+ #2 > Hardware name: Hisilicon PhosphorHi1382 (DT) > pstate: 00000005 (nzcv daif -PAN -UAO -TCO BTYPE=--) > pc : do_undefinstr+0x23c/0x2b4 > lr : do_undefinstr+0x5c/0x2b4 > sp : ffffffc010ac3a80 > x29: ffffffc010ac3a80 x28: ffffff82eb0a8000 > x27: 0000000000000000 x26: 0000000000000001 > x25: 0000000000000000 x24: 0000000000001000 > x23: 0000000000000000 x22: ffffffd0e0f16000 > x21: ffffffd0e0ae7000 x20: ffffffc010ac3b00 > x19: 0000000000021fd6 x18: ffffffd0e04aad94 > x17: 0000000000000000 x16: 0000000000000000 > x15: ffffffd0e04b519c x14: 0000000000000000 > x13: 0000000000000000 x12: 0000000000000000 > x11: 0000000000000000 x10: 0000000000000000 > x9 : 0000000000000000 x8 : 0000000000000000 > x7 : 0000000000000000 x6 : ffffffd0e0f16100 > x5 : 0000000000000000 x4 : 00000000d5300000 > x3 : 0000000000000000 x2 : ffffffd0e0f160f0 > x1 : ffffffd0e0f16103 x0 : 0000000000000005 > Call trace: > do_undefinstr+0x23c/0x2b4 > el1_undef+0x2c/0x44 > el1_sync_handler+0xa4/0xb0 > el1_sync+0x74/0x100 > cmdline_proc_show+0xc/0x44 > proc_reg_read_iter+0xb0/0xc4 > new_sync_read+0x10c/0x15c > vfs_read+0x144/0x18c > ksys_read+0x78/0xe8 > __arm64_sys_read+0x24/0x30
We compare first 6 instructions of 'cmdline_proc_show' before and after patch (see below). There are 4 instructions modified, so this is case that offset between old and new function is out of 128M. And we found that instruction at 'cmdline_proc_show+0xc' seems incorrect (it expects to be '00021fd6'). origin: patched: -------- -------- fd7bbea9 929ff7f0 21d500f0 f2a91b30 fd030091 f2d00010 211040f9 d61f0200 <-- cmdline_proc_show+0xc (expect is '00021fd6') f30b00f9 f30b00f9 f30300aa f30300aa
It is caused by an incorrect big-to-little endian conversion, and we correct it.
Fixes: e429c61d12bf livepatch/arm64: Support livepatch without ftrace Signed-off-by: Zheng Yejian zhengyejian1@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/kernel/livepatch.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/kernel/livepatch.c b/arch/arm64/kernel/livepatch.c index e83e4ce94887..4bc35725af36 100644 --- a/arch/arm64/kernel/livepatch.c +++ b/arch/arm64/kernel/livepatch.c @@ -358,10 +358,10 @@ static int do_patch(unsigned long pc, unsigned long new_addr) #ifdef CONFIG_ARM64_MODULE_PLTS int i;
- insns[0] = cpu_to_le32(0x92800010 | (((~new_addr) & 0xffff)) << 5); - insns[1] = cpu_to_le32(0xf2a00010 | (((new_addr >> 16) & 0xffff)) << 5); - insns[2] = cpu_to_le32(0xf2c00010 | (((new_addr >> 32) & 0xffff)) << 5); - insns[3] = cpu_to_le32(0xd61f0200); + insns[0] = 0x92800010 | (((~new_addr) & 0xffff)) << 5; + insns[1] = 0xf2a00010 | (((new_addr >> 16) & 0xffff)) << 5; + insns[2] = 0xf2c00010 | (((new_addr >> 32) & 0xffff)) << 5; + insns[3] = 0xd61f0200; for (i = 0; i < LJMP_INSN_SIZE; i++) { ret = aarch64_insn_patch_text_nosync(((u32 *)pc) + i, insns[i]); if (ret) {
From: Thomas Gleixner tglx@linutronix.de
mainline inclusion from mainline-v5.17-rc1 commit 24ee940d89277602147ce1b8b4fd87b01b9a6660 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I53K0E CVE: NA
-------------------------------------------------------------------------
While reporting a quiescent state for a given CPU, rcu_core() takes advantage of the freshly loaded grace period sequence number and the locked rnp to accelerate the callbacks whose sequence number have been assigned a stale value.
This action is only necessary when the rdp isn't offloaded, otherwise the NOCB kthreads already take care of the callbacks progression.
However the check for the offloaded state is volatile because it is performed outside the IRQs disabled section. It's possible for the offloading process to preempt rcu_core() at that point on PREEMPT_RT.
This is dangerous because rcu_core() may end up accelerating callbacks concurrently with NOCB kthreads without appropriate locking.
Fix this with moving the offloaded check inside the rnp locking section.
Reported-and-tested-by: Valentin Schneider valentin.schneider@arm.com Reviewed-by: Valentin Schneider valentin.schneider@arm.com Tested-by: Sebastian Andrzej Siewior bigeasy@linutronix.de Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Peter Zijlstra peterz@infradead.org Cc: Sebastian Andrzej Siewior bigeasy@linutronix.de Cc: Josh Triplett josh@joshtriplett.org Cc: Joel Fernandes joel@joelfernandes.org Cc: Boqun Feng boqun.feng@gmail.com Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Uladzislau Rezki urezki@gmail.com Cc: Thomas Gleixner tglx@linutronix.de Signed-off-by: Frederic Weisbecker frederic@kernel.org Signed-off-by: Paul E. McKenney paulmck@kernel.org Conflicts: kernel/rcu/tree.c Move "const bool offloaded = ..." down, so that it is within the irq disabled protection range, and with minimal changes.
Signed-off-by: Zhen Lei thunder.leizhen@huawei.com Reviewed-by: Cheng Jian cj.chengjian@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/rcu/tree.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index b9c45b2d7690..fc3e2180e9d5 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2276,8 +2276,6 @@ rcu_report_qs_rdp(struct rcu_data *rdp) unsigned long flags; unsigned long mask; bool needwake = false; - const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && - rcu_segcblist_is_offloaded(&rdp->cblist); struct rcu_node *rnp;
WARN_ON_ONCE(rdp->cpu != smp_processor_id()); @@ -2301,9 +2299,13 @@ rcu_report_qs_rdp(struct rcu_data *rdp) if ((rnp->qsmask & mask) == 0) { raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } else { + const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && + rcu_segcblist_is_offloaded(&rdp->cblist); /* * This GP can't end until cpu checks in, so all of our * callbacks can be processed during the next GP. + * + * NOCB kthreads have their own way to deal with that. */ if (!offloaded) needwake = rcu_accelerate_cbs(rnp, rdp);
From: Yang Jihong yangjihong1@huawei.com
maillist inclusion category: Feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53L83 CVE: NA
Reference: https://lore.kernel.org/all/20210104020930.GA4897@leoy-ThinkPad-X240s/
-------------------
Arm SPE trace data doesn't support HITM, but we still want to explore "perf c2c" tool to analyze cache false sharing. If without HITM tag, the tool cannot give out accurate result for cache false sharing, a candidate solution is to sort the total load operations and connect with the threads info, e.g. if multiple threads hit the same cache line for many times, this can give out the hint that it's likely to cause cache false sharing issue.
Unlike having HITM tag, the proposed solution is not accurate and might introduce false positive reporting, but it's a pragmatic approach for detecting false sharing if memory event doesn't support HITM.
To sort with the cache line hit, this patch adds dimensions for total load hit and the associated percentage calculation.
Signed-off-by: Leo Yan leo.yan@linaro.org Signed-off-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- tools/perf/builtin-c2c.c | 112 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 112 insertions(+)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c index d247f9878948..570f523b6e1d 100644 --- a/tools/perf/builtin-c2c.c +++ b/tools/perf/builtin-c2c.c @@ -615,6 +615,47 @@ tot_hitm_cmp(struct perf_hpp_fmt *fmt __maybe_unused, return tot_hitm_left - tot_hitm_right; }
+#define TOT_LD_HIT(stats) \ + ((stats)->ld_fbhit + \ + (stats)->ld_l1hit + \ + (stats)->ld_l2hit + \ + (stats)->ld_llchit + \ + (stats)->lcl_hitm + \ + (stats)->rmt_hitm + \ + (stats)->rmt_hit) + +static int tot_ld_hit_entry(struct perf_hpp_fmt *fmt, + struct perf_hpp *hpp, + struct hist_entry *he) +{ + struct c2c_hist_entry *c2c_he; + int width = c2c_width(fmt, hpp, he->hists); + unsigned int tot_hit; + + c2c_he = container_of(he, struct c2c_hist_entry, he); + tot_hit = TOT_LD_HIT(&c2c_he->stats); + + return scnprintf(hpp->buf, hpp->size, "%*u", width, tot_hit); +} + +static int64_t tot_ld_hit_cmp(struct perf_hpp_fmt *fmt __maybe_unused, + struct hist_entry *left, + struct hist_entry *right) +{ + struct c2c_hist_entry *c2c_left; + struct c2c_hist_entry *c2c_right; + uint64_t tot_hit_left; + uint64_t tot_hit_right; + + c2c_left = container_of(left, struct c2c_hist_entry, he); + c2c_right = container_of(right, struct c2c_hist_entry, he); + + tot_hit_left = TOT_LD_HIT(&c2c_left->stats); + tot_hit_right = TOT_LD_HIT(&c2c_right->stats); + + return tot_hit_left - tot_hit_right; +} + #define STAT_FN_ENTRY(__f) \ static int \ __f ## _entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp, \ @@ -860,6 +901,58 @@ percent_hitm_cmp(struct perf_hpp_fmt *fmt __maybe_unused, return per_left - per_right; }
+static double percent_tot_ld_hit(struct c2c_hist_entry *c2c_he) +{ + struct c2c_hists *hists; + int tot = 0, st = 0; + + hists = container_of(c2c_he->he.hists, struct c2c_hists, hists); + + st = TOT_LD_HIT(&c2c_he->stats); + tot = TOT_LD_HIT(&hists->stats); + + return tot ? (double) st * 100 / tot : 0; +} + +static int +percent_tot_ld_hit_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp, + struct hist_entry *he) +{ + struct c2c_hist_entry *c2c_he; + int width = c2c_width(fmt, hpp, he->hists); + char buf[10]; + double per; + + c2c_he = container_of(he, struct c2c_hist_entry, he); + per = percent_tot_ld_hit(c2c_he); + return scnprintf(hpp->buf, hpp->size, "%*s", width, PERC_STR(buf, per)); +} + +static int +percent_tot_ld_hit_color(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp, + struct hist_entry *he) +{ + return percent_color(fmt, hpp, he, percent_tot_ld_hit); +} + +static int64_t +percent_tot_ld_hit_cmp(struct perf_hpp_fmt *fmt __maybe_unused, + struct hist_entry *left, struct hist_entry *right) +{ + struct c2c_hist_entry *c2c_left; + struct c2c_hist_entry *c2c_right; + double per_left; + double per_right; + + c2c_left = container_of(left, struct c2c_hist_entry, he); + c2c_right = container_of(right, struct c2c_hist_entry, he); + + per_left = percent_tot_ld_hit(c2c_left); + per_right = percent_tot_ld_hit(c2c_right); + + return per_left - per_right; +} + static struct c2c_stats *he_stats(struct hist_entry *he) { struct c2c_hist_entry *c2c_he; @@ -1419,6 +1512,14 @@ static struct c2c_dimension dim_ld_rmthit = { .width = 8, };
+static struct c2c_dimension dim_tot_ld_hit = { + .header = HEADER_BOTH("Load Hit", "Total"), + .name = "tot_ld_hit", + .cmp = tot_ld_hit_cmp, + .entry = tot_ld_hit_entry, + .width = 8, +}; + static struct c2c_dimension dim_tot_recs = { .header = HEADER_BOTH("Total", "records"), .name = "tot_recs", @@ -1467,6 +1568,15 @@ static struct c2c_dimension dim_percent_lcl_hitm = { .width = 7, };
+static struct c2c_dimension dim_percent_tot_ld_hit = { + .header = HEADER_BOTH("Load Hit", "Pct"), + .name = "percent_tot_ld_hit", + .cmp = percent_tot_ld_hit_cmp, + .entry = percent_tot_ld_hit_entry, + .color = percent_tot_ld_hit_color, + .width = 8, +}; + static struct c2c_dimension dim_percent_stores_l1hit = { .header = HEADER_SPAN("-- Store Refs --", "L1 Hit", 1), .name = "percent_stores_l1hit", @@ -1622,11 +1732,13 @@ static struct c2c_dimension *dimensions[] = { &dim_ld_l2hit, &dim_ld_llchit, &dim_ld_rmthit, + &dim_tot_ld_hit, &dim_tot_recs, &dim_tot_loads, &dim_percent_hitm, &dim_percent_rmt_hitm, &dim_percent_lcl_hitm, + &dim_percent_tot_ld_hit, &dim_percent_stores_l1hit, &dim_percent_stores_l1miss, &dim_dram_lcl,
From: Yang Jihong yangjihong1@huawei.com
maillist inclusion category: Feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53L83 CVE: NA
Reference: https://lore.kernel.org/all/20210104020930.GA4897@leoy-ThinkPad-X240s/
-------------------
Add dimensions for load hit and its percentage calculation, which is to be displayed in the single cache line output.
Signed-off-by: Leo Yan leo.yan@linaro.org Signed-off-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- tools/perf/builtin-c2c.c | 71 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 71 insertions(+)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c index 570f523b6e1d..399cc080bd32 100644 --- a/tools/perf/builtin-c2c.c +++ b/tools/perf/builtin-c2c.c @@ -1052,6 +1052,58 @@ percent_lcl_hitm_cmp(struct perf_hpp_fmt *fmt __maybe_unused, return per_left - per_right; }
+static double percent_ld_hit(struct c2c_hist_entry *c2c_he) +{ + struct c2c_hists *hists; + int tot, st; + + hists = container_of(c2c_he->he.hists, struct c2c_hists, hists); + + st = TOT_LD_HIT(&c2c_he->stats); + tot = TOT_LD_HIT(&hists->stats); + + return percent(st, tot); +} + +static int +percent_ld_hit_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp, + struct hist_entry *he) +{ + struct c2c_hist_entry *c2c_he; + int width = c2c_width(fmt, hpp, he->hists); + char buf[10]; + double per; + + c2c_he = container_of(he, struct c2c_hist_entry, he); + per = percent_ld_hit(c2c_he); + return scnprintf(hpp->buf, hpp->size, "%*s", width, PERC_STR(buf, per)); +} + +static int +percent_ld_hit_color(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp, + struct hist_entry *he) +{ + return percent_color(fmt, hpp, he, percent_ld_hit); +} + +static int64_t +percent_ld_hit_cmp(struct perf_hpp_fmt *fmt __maybe_unused, + struct hist_entry *left, struct hist_entry *right) +{ + struct c2c_hist_entry *c2c_left; + struct c2c_hist_entry *c2c_right; + double per_left; + double per_right; + + c2c_left = container_of(left, struct c2c_hist_entry, he); + c2c_right = container_of(right, struct c2c_hist_entry, he); + + per_left = percent_ld_hit(c2c_left); + per_right = percent_ld_hit(c2c_right); + + return per_left - per_right; +} + static int percent_stores_l1hit_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp, struct hist_entry *he) @@ -1424,6 +1476,14 @@ static struct c2c_dimension dim_cl_rmt_hitm = { .width = 7, };
+static struct c2c_dimension dim_cl_tot_ld_hit = { + .header = HEADER_SPAN("--- Load ---", "Hit", 1), + .name = "cl_tot_ld_hit", + .cmp = tot_ld_hit_cmp, + .entry = tot_ld_hit_entry, + .width = 7, +}; + static struct c2c_dimension dim_cl_lcl_hitm = { .header = HEADER_SPAN_LOW("Lcl"), .name = "cl_lcl_hitm", @@ -1577,6 +1637,15 @@ static struct c2c_dimension dim_percent_tot_ld_hit = { .width = 8, };
+static struct c2c_dimension dim_percent_ld_hit = { + .header = HEADER_SPAN("-- Load Refs --", "Hit", 1), + .name = "percent_ld_hit", + .cmp = percent_ld_hit_cmp, + .entry = percent_ld_hit_entry, + .color = percent_ld_hit_color, + .width = 7, +}; + static struct c2c_dimension dim_percent_stores_l1hit = { .header = HEADER_SPAN("-- Store Refs --", "L1 Hit", 1), .name = "percent_stores_l1hit", @@ -1722,6 +1791,7 @@ static struct c2c_dimension *dimensions[] = { &dim_rmt_hitm, &dim_cl_lcl_hitm, &dim_cl_rmt_hitm, + &dim_cl_tot_ld_hit, &dim_tot_stores, &dim_stores_l1hit, &dim_stores_l1miss, @@ -1738,6 +1808,7 @@ static struct c2c_dimension *dimensions[] = { &dim_percent_hitm, &dim_percent_rmt_hitm, &dim_percent_lcl_hitm, + &dim_percent_ld_hit, &dim_percent_tot_ld_hit, &dim_percent_stores_l1hit, &dim_percent_stores_l1miss,
From: Yang Jihong yangjihong1@huawei.com
maillist inclusion category: Feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53L83 CVE: NA
Reference: https://lore.kernel.org/all/20210104020930.GA4897@leoy-ThinkPad-X240s/
-------------------
Add dimensions for load miss and its percentage calculation, which is to be displayed in the single cache line output.
Signed-off-by: Leo Yan leo.yan@linaro.org Signed-off-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- tools/perf/builtin-c2c.c | 107 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 107 insertions(+)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c index 399cc080bd32..91eefd9591f8 100644 --- a/tools/perf/builtin-c2c.c +++ b/tools/perf/builtin-c2c.c @@ -624,6 +624,10 @@ tot_hitm_cmp(struct perf_hpp_fmt *fmt __maybe_unused, (stats)->rmt_hitm + \ (stats)->rmt_hit)
+#define TOT_LD_MISS(stats) \ + ((stats)->lcl_dram + \ + (stats)->rmt_dram) + static int tot_ld_hit_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp, struct hist_entry *he) @@ -656,6 +660,38 @@ static int64_t tot_ld_hit_cmp(struct perf_hpp_fmt *fmt __maybe_unused, return tot_hit_left - tot_hit_right; }
+static int tot_ld_miss_entry(struct perf_hpp_fmt *fmt, + struct perf_hpp *hpp, + struct hist_entry *he) +{ + struct c2c_hist_entry *c2c_he; + int width = c2c_width(fmt, hpp, he->hists); + unsigned int tot_miss; + + c2c_he = container_of(he, struct c2c_hist_entry, he); + tot_miss = TOT_LD_MISS(&c2c_he->stats); + + return scnprintf(hpp->buf, hpp->size, "%*u", width, tot_miss); +} + +static int64_t tot_ld_miss_cmp(struct perf_hpp_fmt *fmt __maybe_unused, + struct hist_entry *left, + struct hist_entry *right) +{ + struct c2c_hist_entry *c2c_left; + struct c2c_hist_entry *c2c_right; + uint64_t tot_miss_left; + uint64_t tot_miss_right; + + c2c_left = container_of(left, struct c2c_hist_entry, he); + c2c_right = container_of(right, struct c2c_hist_entry, he); + + tot_miss_left = TOT_LD_MISS(&c2c_left->stats); + tot_miss_right = TOT_LD_MISS(&c2c_right->stats); + + return tot_miss_left - tot_miss_right; +} + #define STAT_FN_ENTRY(__f) \ static int \ __f ## _entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp, \ @@ -1104,6 +1140,58 @@ percent_ld_hit_cmp(struct perf_hpp_fmt *fmt __maybe_unused, return per_left - per_right; }
+static double percent_ld_miss(struct c2c_hist_entry *c2c_he) +{ + struct c2c_hists *hists; + int tot, st; + + hists = container_of(c2c_he->he.hists, struct c2c_hists, hists); + + st = TOT_LD_MISS(&c2c_he->stats); + tot = TOT_LD_MISS(&hists->stats); + + return percent(st, tot); +} + +static int +percent_ld_miss_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp, + struct hist_entry *he) +{ + struct c2c_hist_entry *c2c_he; + int width = c2c_width(fmt, hpp, he->hists); + char buf[10]; + double per; + + c2c_he = container_of(he, struct c2c_hist_entry, he); + per = percent_ld_miss(c2c_he); + return scnprintf(hpp->buf, hpp->size, "%*s", width, PERC_STR(buf, per)); +} + +static int +percent_ld_miss_color(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp, + struct hist_entry *he) +{ + return percent_color(fmt, hpp, he, percent_ld_miss); +} + +static int64_t +percent_ld_miss_cmp(struct perf_hpp_fmt *fmt __maybe_unused, + struct hist_entry *left, struct hist_entry *right) +{ + struct c2c_hist_entry *c2c_left; + struct c2c_hist_entry *c2c_right; + double per_left; + double per_right; + + c2c_left = container_of(left, struct c2c_hist_entry, he); + c2c_right = container_of(right, struct c2c_hist_entry, he); + + per_left = percent_ld_miss(c2c_left); + per_right = percent_ld_miss(c2c_right); + + return per_left - per_right; +} + static int percent_stores_l1hit_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp, struct hist_entry *he) @@ -1484,6 +1572,14 @@ static struct c2c_dimension dim_cl_tot_ld_hit = { .width = 7, };
+static struct c2c_dimension dim_cl_tot_ld_miss = { + .header = HEADER_SPAN_LOW("Miss"), + .name = "cl_tot_ld_miss", + .cmp = tot_ld_miss_cmp, + .entry = tot_ld_miss_entry, + .width = 7, +}; + static struct c2c_dimension dim_cl_lcl_hitm = { .header = HEADER_SPAN_LOW("Lcl"), .name = "cl_lcl_hitm", @@ -1646,6 +1742,15 @@ static struct c2c_dimension dim_percent_ld_hit = { .width = 7, };
+static struct c2c_dimension dim_percent_ld_miss = { + .header = HEADER_SPAN_LOW("Miss"), + .name = "percent_ld_miss", + .cmp = percent_ld_miss_cmp, + .entry = percent_ld_miss_entry, + .color = percent_ld_miss_color, + .width = 7, +}; + static struct c2c_dimension dim_percent_stores_l1hit = { .header = HEADER_SPAN("-- Store Refs --", "L1 Hit", 1), .name = "percent_stores_l1hit", @@ -1792,6 +1897,7 @@ static struct c2c_dimension *dimensions[] = { &dim_cl_lcl_hitm, &dim_cl_rmt_hitm, &dim_cl_tot_ld_hit, + &dim_cl_tot_ld_miss, &dim_tot_stores, &dim_stores_l1hit, &dim_stores_l1miss, @@ -1809,6 +1915,7 @@ static struct c2c_dimension *dimensions[] = { &dim_percent_rmt_hitm, &dim_percent_lcl_hitm, &dim_percent_ld_hit, + &dim_percent_ld_miss, &dim_percent_tot_ld_hit, &dim_percent_stores_l1hit, &dim_percent_stores_l1miss,
From: Yang Jihong yangjihong1@huawei.com
maillist inclusion category: Feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53L83 CVE: NA
Reference: https://lore.kernel.org/all/20210104020930.GA4897@leoy-ThinkPad-X240s/
-------------------
The node header array contains 3 items, each item is used for one of the 3 flavors for node accessing info. To extend sorting on all load references and not always stick to HITMs, the second header string "Node{cpus %hitms %stores}" should be adjusted (e.g. it's changed as "Node{cpus %loads %stores}").
For this reason, this patch changes the node header array to three flat variables and uses switch-case in function setup_nodes_header(), thus it is easier for altering the header string.
Signed-off-by: Leo Yan leo.yan@linaro.org Signed-off-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- tools/perf/builtin-c2c.c | 26 +++++++++++++++++++------- 1 file changed, 19 insertions(+), 7 deletions(-)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c index 91eefd9591f8..c3c1ded7b9f7 100644 --- a/tools/perf/builtin-c2c.c +++ b/tools/perf/builtin-c2c.c @@ -1810,12 +1810,6 @@ static struct c2c_dimension dim_dso = { .se = &sort_dso, };
-static struct c2c_header header_node[3] = { - HEADER_LOW("Node"), - HEADER_LOW("Node{cpus %hitms %stores}"), - HEADER_LOW("Node{cpu list}"), -}; - static struct c2c_dimension dim_node = { .name = "node", .cmp = empty_cmp, @@ -2294,9 +2288,27 @@ static int resort_cl_cb(struct hist_entry *he, void *arg __maybe_unused) return 0; }
+static struct c2c_header header_node_0 = HEADER_LOW("Node"); +static struct c2c_header header_node_1 = HEADER_LOW("Node{cpus %hitms %stores}"); +static struct c2c_header header_node_2 = HEADER_LOW("Node{cpu list}"); + static void setup_nodes_header(void) { - dim_node.header = header_node[c2c.node_info]; + switch (c2c.node_info) { + case 0: + dim_node.header = header_node_0; + break; + case 1: + dim_node.header = header_node_1; + break; + case 2: + dim_node.header = header_node_2; + break; + default: + break; + } + + return; }
static int setup_nodes(struct perf_session *session)
From: Yang Jihong yangjihong1@huawei.com
maillist inclusion category: Feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53L83 CVE: NA
Reference: https://lore.kernel.org/all/20210104020930.GA4897@leoy-ThinkPad-X240s/
-------------------
Except the existed three display options 'tot', 'rmt', 'lcl', this patch adds option 'all' so can sort on the all cache hit for load operation. This new introduced option can be a choice for profiling cache false sharing if the memory event doesn't contain HITM tags.
For displaying with option 'all', the "Shared Data Cache Line Table" and "Shared Cache Line Distribution Pareto" both have difference comparing to other three display options.
For the "Shared Data Cache Line Table", instead of sorting HITM metrics, it sorts with the metrics "tot_ld_hit" and "percent_tot_ld_hit". If without HITM metrics, users can analyze the load hit statistics for all cache levels, so the dimensions of total load hit is used to replace HITM dimensions.
For Pareto, every single cache line shows the metrics "cl_tot_ld_hit" and "cl_tot_ld_miss" instead of "cl_rmt_hitm" and "percent_lcl_hitm", and the single cache line view is sorted by metrics "tot_ld_hit".
As result, we can get the 'all' display as follows:
# perf c2c report -d all --coalesce tid,pid,iaddr,dso --stdio
[...]
================================================= Shared Data Cache Line Table ================================================= # # ----------- Cacheline ---------- Load Hit Load Hit Total Total Total ---- Stores ---- ----- Core Load Hit ----- - LLC Load Hit -- - RMT Load Hit -- --- Load Dram ---- # Index Address Node PA cnt Pct Total records Loads Stores L1Hit L1Miss FB L1 L2 LclHit LclHitm RmtHit RmtHitm Lcl Rmt # ..... .................. .... ...... ........ ........ ....... ....... ....... ....... ....... ....... ....... ....... ........ ....... ........ ....... ........ ........ # 0 0x556f25dff100 0 1895 75.73% 4591 7840 4591 3249 2633 616 849 2734 67 58 883 0 0 0 0 1 0x556f25dff080 0 1 13.10% 794 794 794 0 0 0 164 486 28 20 96 0 0 0 0 2 0x556f25dff0c0 0 1 10.01% 607 607 607 0 0 0 107 5 5 488 2 0 0 0 0
================================================= Shared Cache Line Distribution Pareto ================================================= # # -- Load Refs -- -- Store Refs -- --------- Data address --------- ---------- cycles ---------- Total cpu Shared # Num Hit Miss L1 Hit L1 Miss Offset Node PA cnt Pid Tid Code address rmt hitm lcl hitm load records cnt Symbol Object Source:Line Node # ..... ....... ....... ....... ....... .................. .... ...... ....... .................. .................. ........ ........ ........ ....... ........ ................... ................. ........................... .... # ------------------------------------------------------------- 0 4591 0 2633 616 0x556f25dff100 ------------------------------------------------------------- 20.52% 0.00% 0.00% 0.00% 0x0 0 1 28079 28082:lock_th 0x556f25bfdc1d 0 2200 1276 942 1 [.] read_write_func false_sharing.exe false_sharing_example.c:146 0 19.82% 0.00% 38.06% 0.00% 0x0 0 1 28079 28082:lock_th 0x556f25bfdc16 0 2190 1130 1912 1 [.] read_write_func false_sharing.exe false_sharing_example.c:145 0 18.25% 0.00% 56.63% 0.00% 0x0 0 1 28079 28081:lock_th 0x556f25bfdc16 0 2173 1074 2329 1 [.] read_write_func false_sharing.exe false_sharing_example.c:145 0 18.23% 0.00% 0.00% 0.00% 0x0 0 1 28079 28081:lock_th 0x556f25bfdc1d 0 2013 1220 837 1 [.] read_write_func false_sharing.exe false_sharing_example.c:146 0 0.00% 0.00% 3.11% 59.90% 0x0 0 1 28079 28081:lock_th 0x556f25bfdc28 0 0 0 451 1 [.] read_write_func false_sharing.exe false_sharing_example.c:146 0 0.00% 0.00% 2.20% 40.10% 0x0 0 1 28079 28082:lock_th 0x556f25bfdc28 0 0 0 305 1 [.] read_write_func false_sharing.exe false_sharing_example.c:146 0 12.00% 0.00% 0.00% 0.00% 0x20 0 1 28079 28083:reader_thd 0x556f25bfdc73 0 159 107 551 1 [.] read_write_func false_sharing.exe false_sharing_example.c:155 0 11.17% 0.00% 0.00% 0.00% 0x20 0 1 28079 28084:reader_thd 0x556f25bfdc73 0 148 108 513 1 [.] read_write_func false_sharing.exe false_sharing_example.c:155 0
[...]
Signed-off-by: Leo Yan leo.yan@linaro.org
conflict: tools/perf/builtin-c2c.c Signed-off-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- tools/perf/builtin-c2c.c | 140 ++++++++++++++++++++++++++++----------- 1 file changed, 102 insertions(+), 38 deletions(-)
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c index c3c1ded7b9f7..a1d66b0bda84 100644 --- a/tools/perf/builtin-c2c.c +++ b/tools/perf/builtin-c2c.c @@ -113,13 +113,15 @@ enum { DISPLAY_LCL, DISPLAY_RMT, DISPLAY_TOT, + DISPLAY_ALL, DISPLAY_MAX, };
static const char *display_str[DISPLAY_MAX] = { - [DISPLAY_LCL] = "Local", - [DISPLAY_RMT] = "Remote", - [DISPLAY_TOT] = "Total", + [DISPLAY_LCL] = "Local HITMs", + [DISPLAY_RMT] = "Remote HITMs", + [DISPLAY_TOT] = "Total HITMs", + [DISPLAY_ALL] = "All Load Access", };
static const struct option c2c_options[] = { @@ -883,6 +885,11 @@ static double percent_hitm(struct c2c_hist_entry *c2c_he) case DISPLAY_TOT: st = stats->tot_hitm; tot = total->tot_hitm; + break; + case DISPLAY_ALL: + ui__warning("Calculate hitm percent for display 'all';\n" + "should never happen!\n"); + break; default: break; } @@ -1350,6 +1357,10 @@ node_entry(struct perf_hpp_fmt *fmt __maybe_unused, struct perf_hpp *hpp, ret = display_metrics(hpp, stats->tot_hitm, c2c_he->stats.tot_hitm); break; + case DISPLAY_ALL: + ret = display_metrics(hpp, TOT_LD_HIT(stats), + TOT_LD_HIT(&c2c_he->stats)); + break; default: break; } @@ -1696,6 +1707,7 @@ static struct c2c_header percent_hitm_header[] = { [DISPLAY_LCL] = HEADER_BOTH("Lcl", "Hitm"), [DISPLAY_RMT] = HEADER_BOTH("Rmt", "Hitm"), [DISPLAY_TOT] = HEADER_BOTH("Tot", "Hitm"), + [DISPLAY_ALL] = HEADER_BOTH("LLC", "Hit"), };
static struct c2c_dimension dim_percent_hitm = { @@ -2172,6 +2184,10 @@ static bool he__display(struct hist_entry *he, struct c2c_stats *stats) he->filtered = filter_display(c2c_he->stats.tot_hitm, stats->tot_hitm); break; + case DISPLAY_ALL: + he->filtered = filter_display(TOT_LD_HIT(&c2c_he->stats), + TOT_LD_HIT(stats)); + break; default: break; } @@ -2200,6 +2216,9 @@ static inline bool is_valid_hist_entry(struct hist_entry *he) case DISPLAY_TOT: has_record = !!c2c_he->stats.tot_hitm; break; + case DISPLAY_ALL: + has_record = !!TOT_LD_HIT(&c2c_he->stats); + break; default: break; } @@ -2289,7 +2308,10 @@ static int resort_cl_cb(struct hist_entry *he, void *arg __maybe_unused) }
static struct c2c_header header_node_0 = HEADER_LOW("Node"); -static struct c2c_header header_node_1 = HEADER_LOW("Node{cpus %hitms %stores}"); +static struct c2c_header header_node_1_hitms_stores = + HEADER_LOW("Node{cpus %hitms %stores}"); +static struct c2c_header header_node_1_loads_stores = + HEADER_LOW("Node{cpus %loads %stores}"); static struct c2c_header header_node_2 = HEADER_LOW("Node{cpu list}");
static void setup_nodes_header(void) @@ -2299,7 +2321,10 @@ static void setup_nodes_header(void) dim_node.header = header_node_0; break; case 1: - dim_node.header = header_node_1; + if (c2c.display == DISPLAY_ALL) + dim_node.header = header_node_1_loads_stores; + else + dim_node.header = header_node_1_hitms_stores; break; case 2: dim_node.header = header_node_2; @@ -2378,11 +2403,13 @@ static int resort_shared_cl_cb(struct hist_entry *he, void *arg __maybe_unused) struct c2c_hist_entry *c2c_he; c2c_he = container_of(he, struct c2c_hist_entry, he);
- if (HAS_HITMS(c2c_he)) { + if (c2c.display == DISPLAY_ALL && TOT_LD_HIT(&c2c_he->stats)) { + c2c.shared_clines++; + c2c_add_stats(&c2c.shared_clines_stats, &c2c_he->stats); + } else if (HAS_HITMS(c2c_he)) { c2c.shared_clines++; c2c_add_stats(&c2c.shared_clines_stats, &c2c_he->stats); } - return 0; }
@@ -2503,12 +2530,21 @@ static void print_pareto(FILE *out) int ret; const char *cl_output;
- cl_output = "cl_num," - "cl_rmt_hitm," - "cl_lcl_hitm," - "cl_stores_l1hit," - "cl_stores_l1miss," - "dcacheline"; + if (c2c.display == DISPLAY_TOT || c2c.display == DISPLAY_LCL || + c2c.display == DISPLAY_RMT) + cl_output = "cl_num," + "cl_rmt_hitm," + "cl_lcl_hitm," + "cl_stores_l1hit," + "cl_stores_l1miss," + "dcacheline"; + else /* c2c.display == DISPLAY_ALL */ + cl_output = "cl_num," + "cl_tot_ld_hit," + "cl_tot_ld_miss," + "cl_stores_l1hit," + "cl_stores_l1miss," + "dcacheline";
perf_hpp_list__init(&hpp_list); ret = hpp_list__parse(&hpp_list, cl_output, NULL); @@ -2544,7 +2580,7 @@ static void print_c2c_info(FILE *out, struct perf_session *session) fprintf(out, "%-36s: %s\n", first ? " Events" : "", evsel__name(evsel)); first = false; } - fprintf(out, " Cachelines sort on : %s HITMs\n", + fprintf(out, " Cachelines sort on : %s\n", display_str[c2c.display]); fprintf(out, " Cacheline data grouping : %s\n", c2c.cl_sort); } @@ -2701,7 +2737,7 @@ static int perf_c2c_browser__title(struct hist_browser *browser, { scnprintf(bf, size, "Shared Data Cache Line Table " - "(%lu entries, sorted on %s HITMs)", + "(%lu entries, sorted on %s)", browser->nr_non_filtered_entries, display_str[c2c.display]); return 0; @@ -2907,6 +2943,8 @@ static int setup_display(const char *str) c2c.display = DISPLAY_RMT; else if (!strcmp(display, "lcl")) c2c.display = DISPLAY_LCL; + else if (!strcmp(display, "all")) + c2c.display = DISPLAY_ALL; else { pr_err("failed: unknown display type: %s\n", str); return -1; @@ -2953,10 +2991,12 @@ static int build_cl_output(char *cl_sort, bool no_source) }
if (asprintf(&c2c.cl_output, - "%s%s%s%s%s%s%s%s%s%s", + "%s%s%s%s%s%s%s%s%s%s%s", c2c.use_stdio ? "cl_num_empty," : "", - "percent_rmt_hitm," - "percent_lcl_hitm," + c2c.display == DISPLAY_ALL ? "percent_ld_hit," + "percent_ld_miss," : + "percent_rmt_hitm," + "percent_lcl_hitm,", "percent_stores_l1hit," "percent_stores_l1miss," "offset,offset_node,dcacheline_count,", @@ -2985,6 +3025,7 @@ static int build_cl_output(char *cl_sort, bool no_source) static int setup_coalesce(const char *coalesce, bool no_source) { const char *c = coalesce ?: coalesce_default; + const char *sort_str = NULL;
if (asprintf(&c2c.cl_sort, "offset,%s", c) < 0) return -ENOMEM; @@ -2992,12 +3033,16 @@ static int setup_coalesce(const char *coalesce, bool no_source) if (build_cl_output(c2c.cl_sort, no_source)) return -1;
- if (asprintf(&c2c.cl_resort, "offset,%s", - c2c.display == DISPLAY_TOT ? - "tot_hitm" : - c2c.display == DISPLAY_RMT ? - "rmt_hitm,lcl_hitm" : - "lcl_hitm,rmt_hitm") < 0) + if (c2c.display == DISPLAY_TOT) + sort_str = "tot_hitm"; + else if (c2c.display == DISPLAY_RMT) + sort_str = "rmt_hitm,lcl_hitm"; + else if (c2c.display == DISPLAY_LCL) + sort_str = "lcl_hitm,rmt_hitm"; + else if (c2c.display == DISPLAY_ALL) + sort_str = "tot_ld_hit"; + + if (asprintf(&c2c.cl_resort, "offset,%s", sort_str) < 0) return -ENOMEM;
pr_debug("coalesce sort fields: %s\n", c2c.cl_sort); @@ -3132,20 +3177,37 @@ static int perf_c2c__report(int argc, const char **argv) goto out_mem2node; }
- output_str = "cl_idx," - "dcacheline," - "dcacheline_node," - "dcacheline_count," - "percent_hitm," - "tot_hitm,lcl_hitm,rmt_hitm," - "tot_recs," - "tot_loads," - "tot_stores," - "stores_l1hit,stores_l1miss," - "ld_fbhit,ld_l1hit,ld_l2hit," - "ld_lclhit,lcl_hitm," - "ld_rmthit,rmt_hitm," - "dram_lcl,dram_rmt"; + if (c2c.display == DISPLAY_TOT || c2c.display == DISPLAY_LCL || + c2c.display == DISPLAY_RMT) + output_str = "cl_idx," + "dcacheline," + "dcacheline_node," + "dcacheline_count," + "percent_hitm," + "tot_hitm,lcl_hitm,rmt_hitm," + "tot_recs," + "tot_loads," + "tot_stores," + "stores_l1hit,stores_l1miss," + "ld_fbhit,ld_l1hit,ld_l2hit," + "ld_lclhit,lcl_hitm," + "ld_rmthit,rmt_hitm," + "dram_lcl,dram_rmt"; + else /* c2c.display == DISPLAY_ALL */ + output_str = "cl_idx," + "dcacheline," + "dcacheline_node," + "dcacheline_count," + "percent_tot_ld_hit," + "tot_ld_hit," + "tot_recs," + "tot_loads," + "tot_stores," + "stores_l1hit,stores_l1miss," + "ld_fbhit,ld_l1hit,ld_l2hit," + "ld_lclhit,lcl_hitm," + "ld_rmthit,rmt_hitm," + "dram_lcl,dram_rmt";
if (c2c.display == DISPLAY_TOT) sort_str = "tot_hitm"; @@ -3153,6 +3215,8 @@ static int perf_c2c__report(int argc, const char **argv) sort_str = "rmt_hitm"; else if (c2c.display == DISPLAY_LCL) sort_str = "lcl_hitm"; + else if (c2c.display == DISPLAY_ALL) + sort_str = "tot_ld_hit";
c2c_hists__reinit(&c2c.hists, output_str, sort_str);
From: Yang Jihong yangjihong1@huawei.com
maillist inclusion category: Feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53L83 CVE: NA
Reference: https://lore.kernel.org/all/20210104020930.GA4897@leoy-ThinkPad-X240s/
-------------------
Since the new display option 'all' is introduced, this patch is to update the documentation to reflect it.
Signed-off-by: Leo Yan leo.yan@linaro.org Signed-off-by: Yang Jihong yangjihong1@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- tools/perf/Documentation/perf-c2c.txt | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-)
diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt index c81d72e3eecf..da49c3d26316 100644 --- a/tools/perf/Documentation/perf-c2c.txt +++ b/tools/perf/Documentation/perf-c2c.txt @@ -109,7 +109,8 @@ REPORT OPTIONS
-d:: --display:: - Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs as default. + Switch to HITM type (rmt, lcl) or all load cache hit (all) to display + and sort on. Total HITMs as default.
--stitch-lbr:: Show callgraph with stitched LBRs, which may have more complete @@ -174,12 +175,18 @@ For each cacheline in the 1) list we display following data: Cacheline - cacheline address (hex number)
- Rmt/Lcl Hitm + Rmt/Lcl Hitm (For display with HITM types) - cacheline percentage of all Remote/Local HITM accesses
- LLC Load Hitm - Total, LclHitm, RmtHitm + LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types) - count of Total/Local/Remote load HITMs
+ LD Hit Pct (For display 'all') + - cacheline percentage of all load hit accesses + + LD Hit Total (For display 'all') + - sum of all load hit accesses + Total records - sum of all cachelines accesses
@@ -207,9 +214,12 @@ For each cacheline in the 1) list we display following data:
For each offset in the 2) list we display following data:
- HITM - Rmt, Lcl + HITM - Rmt, Lcl (For display with HITM types) - % of Remote/Local HITM accesses for given offset within cacheline
+ Load Refs - Hit, Miss (For display 'all') + - % of load accesses that hit/missed cache for given offset within cacheline + Store Refs - L1 Hit, L1 Miss - % of store accesses that hit/missed L1 for given offset within cacheline
@@ -249,7 +259,8 @@ The 'Node' field displays nodes that accesses given cacheline offset. Its output comes in 3 flavors: - node IDs separated by ',' - node IDs with stats for each ID, in following format: - Node{cpus %hitms %stores} + Node{cpus %hitms %stores} (For display with HITM types) + Node{cpus %loads %stores} (For display with "all") - node IDs with list of affected CPUs in following format: Node{cpu list}
From: Zhihao Cheng chengzhihao1@huawei.com
hulk inclusion category: bugfix bugzilla: 185955, https://gitee.com/openeuler/kernel/issues/I50DVI?from=project-issue
--------------------------------
There at least 6 PEBs reserved on UBI device: 1. EBA_RESERVED_PEBS[1] 2. WL_RESERVED_PEBS[1] 3. UBI_LAYOUT_VOLUME_EBS[2] 4. MIN_FASTMAP_RESERVED_PEBS[2]
When all ubi volumes take all their PEBs, there are 3 (EBA_RESERVED_PEBS + WL_RESERVED_PEBS + MIN_FASTMAP_RESERVED_PEBS - MIN_FASTMAP_TAKEN_PEBS[1]) free PEBs. Since f9c34bb529975fe ("ubi: Fix producing anchor PEBs") and 4b68bf9a69d22dd ("ubi: Select fastmap anchor PEBs considering wear level rules") applied, there is only 1 (3 - FASTMAP_ANCHOR_PEBS[1] - FASTMAP_NEXT_ANCHOR_PEBS[1]) free PEB to fill pool and wl_pool, after filling pool, wl_pool is always empty. So, UBI could be stuck in an infinite loop:
ubi_thread system_wq wear_leveling_worker <-------------------------------------------------- get_peb_for_wl | // fm_wl_pool, used = size = 0 | schedule_work(&ubi->fm_work) | | update_fastmap_work_fn | ubi_update_fastmap | ubi_refill_pools | // ubi->free_count - ubi->beb_rsvd_pebs < 5 | // wl_pool is not filled with any PEBs | schedule_erase(old_fm_anchor) | ubi_ensure_anchor_pebs | __schedule_ubi_work(wear_leveling_worker) | | __erase_worker | ensure_wear_leveling | __schedule_ubi_work(wear_leveling_worker) --------------------------
, which cause high cpu usage of ubi_bgt: top - 12:10:42 up 5 min, 2 users, load average: 1.76, 0.68, 0.27 Tasks: 123 total, 3 running, 54 sleeping, 0 stopped, 0 zombie
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1589 root 20 0 0 0 0 R 45.0 0.0 0:38.86 ubi_bgt0d 319 root 20 0 0 0 0 I 15.2 0.0 0:15.29 kworker/0:3-eve 371 root 20 0 0 0 0 I 14.9 0.0 0:12.85 kworker/3:3-eve 20 root 20 0 0 0 0 I 11.3 0.0 0:05.33 kworker/1:0-eve 202 root 20 0 0 0 0 I 11.3 0.0 0:04.93 kworker/2:3-eve
In 4b68bf9a69d22dd ("ubi: Select fastmap anchor PEBs considering wear level rules"), there are three key changes: 1) Choose the fastmap anchor when the most free PEBs are available. 2) Enable anchor move within the anchor area again as it is useful for distributing wear. 3) Import a candidate fm anchor and check this PEB's erase count during wear leveling. If the wear leveling limit is exceeded, use the used anchor area PEB with the lowest erase count to replace it.
The anchor candidate can be removed, we can check fm_anchor PEB's erase count during wear leveling. Fix it by: 1) Removing 'fm_next_anchor' and check 'fm_anchor' during wear leveling. 2) Preferentially filling one free peb into fm_wl_pool in condition of ubi->free_count > ubi->beb_rsvd_pebs, then try to reserve enough free count for fastmap non anchor pebs after the above prerequisites are met. Then, there are at least 1 PEB in pool and 1 PEB in wl_pool after calling ubi_refill_pools() with all erase works done.
Fetch a reproducer in [Link].
Fixes: 4b68bf9a69d22dd ("ubi: Select fastmap anchor PEBs ... rules") Link: https://bugzilla.kernel.org/show_bug.cgi?id=215407 Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com v1->v2: Update fm pool filling strategy, consider reserve enough free count for fastmap non anchor pebs while filling fm_wl_pool. v2->v3: Remove 'fm_next_anchor' and check 'fm_anchor' during wear leveling. v3->v4: Reserve 'fm_next_anchor' member in 'ubi_device' to keep kabi no changes. Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/mtd/ubi/fastmap-wl.c | 63 +++++++++++++++++++++++------------- drivers/mtd/ubi/fastmap.c | 11 ------- drivers/mtd/ubi/wl.c | 19 +++++------ 3 files changed, 50 insertions(+), 43 deletions(-)
diff --git a/drivers/mtd/ubi/fastmap-wl.c b/drivers/mtd/ubi/fastmap-wl.c index 28f55f9cf715..21ea5ca8270b 100644 --- a/drivers/mtd/ubi/fastmap-wl.c +++ b/drivers/mtd/ubi/fastmap-wl.c @@ -97,6 +97,27 @@ struct ubi_wl_entry *ubi_wl_get_fm_peb(struct ubi_device *ubi, int anchor) return e; }
+/* + * has_enough_free_count - whether ubi has enough free pebs to fill fm pools + * @ubi: UBI device description object + * + * This helper function checks whether there are enough free pebs (deducted + * by fastmap pebs) to fill fm_pool and fm_wl_pool, above rule works after + * there is at least one of free pebs is filled into fm_wl_pool. + */ +static bool has_enough_free_count(struct ubi_device *ubi) +{ + int fm_used = 0; // fastmap non anchor pebs. + + if (!ubi->free.rb_node) + return false; + + if (ubi->fm_wl_pool.size > 0 && !(ubi->ro_mode || ubi->fm_disabled)) + fm_used = ubi->fm_size / ubi->leb_size - 1; + + return ubi->free_count - ubi->beb_rsvd_pebs > fm_used; +} + /** * ubi_refill_pools - refills all fastmap PEB pools. * @ubi: UBI device description object @@ -120,21 +141,17 @@ void ubi_refill_pools(struct ubi_device *ubi) wl_tree_add(ubi->fm_anchor, &ubi->free); ubi->free_count++; } - if (ubi->fm_next_anchor) { - wl_tree_add(ubi->fm_next_anchor, &ubi->free); - ubi->free_count++; - }
- /* All available PEBs are in ubi->free, now is the time to get + /* + * All available PEBs are in ubi->free, now is the time to get * the best anchor PEBs. */ ubi->fm_anchor = ubi_wl_get_fm_peb(ubi, 1); - ubi->fm_next_anchor = ubi_wl_get_fm_peb(ubi, 1);
for (;;) { enough = 0; if (pool->size < pool->max_size) { - if (!ubi->free.rb_node) + if (!has_enough_free_count(ubi)) break;
e = wl_get_wle(ubi); @@ -147,8 +164,7 @@ void ubi_refill_pools(struct ubi_device *ubi) enough++;
if (wl_pool->size < wl_pool->max_size) { - if (!ubi->free.rb_node || - (ubi->free_count - ubi->beb_rsvd_pebs < 5)) + if (!has_enough_free_count(ubi)) break;
e = find_wl_entry(ubi, &ubi->free, WL_FREE_MAX_DIFF); @@ -286,20 +302,26 @@ static struct ubi_wl_entry *get_peb_for_wl(struct ubi_device *ubi) int ubi_ensure_anchor_pebs(struct ubi_device *ubi) { struct ubi_work *wrk; + struct ubi_wl_entry *anchor;
spin_lock(&ubi->wl_lock);
- /* Do we have a next anchor? */ - if (!ubi->fm_next_anchor) { - ubi->fm_next_anchor = ubi_wl_get_fm_peb(ubi, 1); - if (!ubi->fm_next_anchor) - /* Tell wear leveling to produce a new anchor PEB */ - ubi->fm_do_produce_anchor = 1; + /* Do we already have an anchor? */ + if (ubi->fm_anchor) { + spin_unlock(&ubi->wl_lock); + return 0; }
- /* Do wear leveling to get a new anchor PEB or check the - * existing next anchor candidate. - */ + /* See if we can find an anchor PEB on the list of free PEBs */ + anchor = ubi_wl_get_fm_peb(ubi, 1); + if (anchor) { + ubi->fm_anchor = anchor; + spin_unlock(&ubi->wl_lock); + return 0; + } + + ubi->fm_do_produce_anchor = 1; + /* No luck, trigger wear leveling to produce a new anchor PEB. */ if (ubi->wl_scheduled) { spin_unlock(&ubi->wl_lock); return 0; @@ -381,11 +403,6 @@ static void ubi_fastmap_close(struct ubi_device *ubi) ubi->fm_anchor = NULL; }
- if (ubi->fm_next_anchor) { - return_unused_peb(ubi, ubi->fm_next_anchor); - ubi->fm_next_anchor = NULL; - } - if (ubi->fm) { for (i = 0; i < ubi->fm->used_blocks; i++) kfree(ubi->fm->e[i]); diff --git a/drivers/mtd/ubi/fastmap.c b/drivers/mtd/ubi/fastmap.c index 88fdf8f5709f..cdc2d713d3eb 100644 --- a/drivers/mtd/ubi/fastmap.c +++ b/drivers/mtd/ubi/fastmap.c @@ -1219,17 +1219,6 @@ static int ubi_write_fastmap(struct ubi_device *ubi, fm_pos += sizeof(*fec); ubi_assert(fm_pos <= ubi->fm_size); } - if (ubi->fm_next_anchor) { - fec = (struct ubi_fm_ec *)(fm_raw + fm_pos); - - fec->pnum = cpu_to_be32(ubi->fm_next_anchor->pnum); - set_seen(ubi, ubi->fm_next_anchor->pnum, seen_pebs); - fec->ec = cpu_to_be32(ubi->fm_next_anchor->ec); - - free_peb_count++; - fm_pos += sizeof(*fec); - ubi_assert(fm_pos <= ubi->fm_size); - } fmh->free_peb_count = cpu_to_be32(free_peb_count);
ubi_for_each_used_peb(ubi, wl_e, tmp_rb) { diff --git a/drivers/mtd/ubi/wl.c b/drivers/mtd/ubi/wl.c index 7847de75a74c..820b5c1c8e8e 100644 --- a/drivers/mtd/ubi/wl.c +++ b/drivers/mtd/ubi/wl.c @@ -688,16 +688,16 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
#ifdef CONFIG_MTD_UBI_FASTMAP e1 = find_anchor_wl_entry(&ubi->used); - if (e1 && ubi->fm_next_anchor && - (ubi->fm_next_anchor->ec - e1->ec >= UBI_WL_THRESHOLD)) { + if (e1 && ubi->fm_anchor && + (ubi->fm_anchor->ec - e1->ec >= UBI_WL_THRESHOLD)) { ubi->fm_do_produce_anchor = 1; - /* fm_next_anchor is no longer considered a good anchor - * candidate. + /* + * fm_anchor is no longer considered a good anchor. * NULL assignment also prevents multiple wear level checks * of this PEB. */ - wl_tree_add(ubi->fm_next_anchor, &ubi->free); - ubi->fm_next_anchor = NULL; + wl_tree_add(ubi->fm_anchor, &ubi->free); + ubi->fm_anchor = NULL; ubi->free_count++; }
@@ -1086,12 +1086,13 @@ static int __erase_worker(struct ubi_device *ubi, struct ubi_work *wl_wrk) if (!err) { spin_lock(&ubi->wl_lock);
- if (!ubi->fm_disabled && !ubi->fm_next_anchor && + if (!ubi->fm_disabled && !ubi->fm_anchor && e->pnum < UBI_FM_MAX_START) { - /* Abort anchor production, if needed it will be + /* + * Abort anchor production, if needed it will be * enabled again in the wear leveling started below. */ - ubi->fm_next_anchor = e; + ubi->fm_anchor = e; ubi->fm_do_produce_anchor = 0; } else { wl_tree_add(e, &ubi->free);
From: Zhihao Cheng chengzhihao1@huawei.com
hulk inclusion category: bugfix bugzilla: 185955, https://gitee.com/openeuler/kernel/issues/I55AKK CVE: NA backport: openEuler-22.03-LTS
--------------------------------
Commit 505a666ee3fc ("writeback: plug writeback in wb_writeback() and writeback_inodes_wb()") has us holding a plug during wb_writeback, which may cause a potential ABBA dead lock:
wb_writeback fat_file_fsync blk_start_plug(&plug) for (;;) { iter i-1: some reqs have been added into plug->mq_list // LOCK A iter i: progress = __writeback_inodes_wb(wb, work) . writeback_sb_inodes // fat's bdev . __writeback_single_inode . . generic_writepages . . __block_write_full_page . . . . __generic_file_fsync . . . . sync_inode_metadata . . . . writeback_single_inode . . . . __writeback_single_inode . . . . fat_write_inode . . . . __fat_write_inode . . . . sync_dirty_buffer // fat's bdev . . . . lock_buffer(bh) // LOCK B . . . . submit_bh . . . . blk_mq_get_tag // LOCK A . . . trylock_buffer(bh) // LOCK B . . . redirty_page_for_writepage . . . wbc->pages_skipped++ . . --wbc->nr_to_write . wrote += write_chunk - wbc.nr_to_write // wrote > 0 . requeue_inode . redirty_tail_locked if (progress) // progress > 0 continue; iter i+1: queue_io // similar process with iter i, infinite for-loop ! } blk_finish_plug(&plug) // flush plug won't be called
Above process triggers a hungtask like: [ 399.044861] INFO: task bb:2607 blocked for more than 30 seconds. [ 399.046824] Not tainted 5.18.0-rc1-00005-gefae4d9eb6a2-dirty [ 399.051539] task:bb state:D stack: 0 pid: 2607 ppid: 2426 flags:0x00004000 [ 399.051556] Call Trace: [ 399.051570] __schedule+0x480/0x1050 [ 399.051592] schedule+0x92/0x1a0 [ 399.051602] io_schedule+0x22/0x50 [ 399.051613] blk_mq_get_tag+0x1d3/0x3c0 [ 399.051640] __blk_mq_alloc_requests+0x21d/0x3f0 [ 399.051657] blk_mq_submit_bio+0x68d/0xca0 [ 399.051674] __submit_bio+0x1b5/0x2d0 [ 399.051708] submit_bio_noacct+0x34e/0x720 [ 399.051718] submit_bio+0x3b/0x150 [ 399.051725] submit_bh_wbc+0x161/0x230 [ 399.051734] __sync_dirty_buffer+0xd1/0x420 [ 399.051744] sync_dirty_buffer+0x17/0x20 [ 399.051750] __fat_write_inode+0x289/0x310 [ 399.051766] fat_write_inode+0x2a/0xa0 [ 399.051783] __writeback_single_inode+0x53c/0x6f0 [ 399.051795] writeback_single_inode+0x145/0x200 [ 399.051803] sync_inode_metadata+0x45/0x70 [ 399.051856] __generic_file_fsync+0xa3/0x150 [ 399.051880] fat_file_fsync+0x1d/0x80 [ 399.051895] vfs_fsync_range+0x40/0xb0 [ 399.051929] __x64_sys_fsync+0x18/0x30
In my test, 'need_resched()' (which is imported by 590dca3a71 "fs-writeback: unplug before cond_resched in writeback_sb_inodes") in function 'writeback_sb_inodes()' seldom comes true, unless cond_resched() is deleted from write_cache_pages().
Fix it by correcting wrote number according number of skipped pages in writeback_sb_inodes().
Goto Link to find a reproducer.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215837 Cc: stable@vger.kernel.org # v4.3 Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/fs-writeback.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 050d40c465bc..2011199476ea 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1650,11 +1650,12 @@ static long writeback_sb_inodes(struct super_block *sb, }; unsigned long start_time = jiffies; long write_chunk; - long wrote = 0; /* count both pages and inodes */ + long total_wrote = 0; /* count both pages and inodes */
while (!list_empty(&wb->b_io)) { struct inode *inode = wb_inode(wb->b_io.prev); struct bdi_writeback *tmp_wb; + long wrote;
if (inode->i_sb != sb) { if (work->sb) { @@ -1730,7 +1731,9 @@ static long writeback_sb_inodes(struct super_block *sb,
wbc_detach_inode(&wbc); work->nr_pages -= write_chunk - wbc.nr_to_write; - wrote += write_chunk - wbc.nr_to_write; + wrote = write_chunk - wbc.nr_to_write - wbc.pages_skipped; + wrote = wrote < 0 ? 0 : wrote; + total_wrote += wrote;
if (need_resched()) { /* @@ -1752,7 +1755,7 @@ static long writeback_sb_inodes(struct super_block *sb, tmp_wb = inode_to_wb_and_lock_list(inode); spin_lock(&inode->i_lock); if (!(inode->i_state & I_DIRTY_ALL)) - wrote++; + total_wrote++; requeue_inode(inode, tmp_wb, &wbc); inode_sync_complete(inode); spin_unlock(&inode->i_lock); @@ -1766,14 +1769,14 @@ static long writeback_sb_inodes(struct super_block *sb, * bail out to wb_writeback() often enough to check * background threshold and other termination conditions. */ - if (wrote) { + if (total_wrote) { if (time_is_before_jiffies(start_time + HZ / 10UL)) break; if (work->nr_pages <= 0) break; } } - return wrote; + return total_wrote; }
static long __writeback_inodes_wb(struct bdi_writeback *wb,
From: Janis Schoetterl-Glausch scgl@linux.ibm.com
stable inclusion from stable-v5.10.100 commit b62267b8b06e9b8bb429ae8f962ee431e6535d60 bugzilla: https://gitee.com/src-openeuler/kernel/issues/I4U746 CVE: CVE-2022-0516
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 2c212e1baedcd782b2535a3f86bc491977677c0e upstream.
Refuse SIDA memops on guests which are not protected. For normal guests, the secure instruction data address designation, which determines the location we access, is not under control of KVM.
Fixes: 19e122776886 (KVM: S390: protvirt: Introduce instruction data area bounce buffer) Signed-off-by: Janis Schoetterl-Glausch scgl@linux.ibm.com Cc: stable@vger.kernel.org Signed-off-by: Christian Borntraeger borntraeger@linux.ibm.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Chen Jun chenjun102@huawei.com Reviewed-by: Weilong Chen chenweilong@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/s390/kvm/kvm-s390.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 07a04f392600..d8e9239c24ff 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -4654,6 +4654,8 @@ static long kvm_s390_guest_sida_op(struct kvm_vcpu *vcpu, return -EINVAL; if (mop->size + mop->sida_offset > sida_size(vcpu->arch.sie_block)) return -E2BIG; + if (!kvm_s390_pv_cpu_is_protected(vcpu)) + return -EINVAL;
switch (mop->op) { case KVM_S390_MEMOP_SIDA_READ:
From: Ye Bin yebin10@huawei.com
mainline inclusion from mainline-v5.18-rc4 commit c186f0887fe7061a35cebef024550ec33ef8fbd8 category: bugfix bugzilla: 186477, https://gitee.com/openeuler/kernel/issues/I55UHT CVE: NA
-------------------------------------------------
We got issue as follows: EXT4-fs (loop0): mounted filesystem without journal. Opts: ,errors=continue
================================================================== BUG: KASAN: use-after-free in ext4_search_dir fs/ext4/namei.c:1394 [inline] BUG: KASAN: use-after-free in search_dirblock fs/ext4/namei.c:1199 [inline] BUG: KASAN: use-after-free in __ext4_find_entry+0xdca/0x1210 fs/ext4/namei.c:1553 Read of size 1 at addr ffff8881317c3005 by task syz-executor117/2331
CPU: 1 PID: 2331 Comm: syz-executor117 Not tainted 5.10.0+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:83 [inline] dump_stack+0x144/0x187 lib/dump_stack.c:124 print_address_description+0x7d/0x630 mm/kasan/report.c:387 __kasan_report+0x132/0x190 mm/kasan/report.c:547 kasan_report+0x47/0x60 mm/kasan/report.c:564 ext4_search_dir fs/ext4/namei.c:1394 [inline] search_dirblock fs/ext4/namei.c:1199 [inline] __ext4_find_entry+0xdca/0x1210 fs/ext4/namei.c:1553 ext4_lookup_entry fs/ext4/namei.c:1622 [inline] ext4_lookup+0xb8/0x3a0 fs/ext4/namei.c:1690 __lookup_hash+0xc5/0x190 fs/namei.c:1451 do_rmdir+0x19e/0x310 fs/namei.c:3760 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x445e59 Code: 4d c7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b c7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fff2277fac8 EFLAGS: 00000246 ORIG_RAX: 0000000000000054 RAX: ffffffffffffffda RBX: 0000000000400280 RCX: 0000000000445e59 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000200000c0 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000002 R10: 00007fff2277f990 R11: 0000000000000246 R12: 0000000000000000 R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000
The buggy address belongs to the page: page:0000000048cd3304 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1317c3 flags: 0x200000000000000() raw: 0200000000000000 ffffea0004526588 ffffea0004528088 0000000000000000 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected
Memory state around the buggy address: ffff8881317c2f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8881317c2f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8881317c3000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^ ffff8881317c3080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff8881317c3100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ==================================================================
ext4_search_dir: ... de = (struct ext4_dir_entry_2 *)search_buf; dlimit = search_buf + buf_size; while ((char *) de < dlimit) { ... if ((char *) de + de->name_len <= dlimit && ext4_match(dir, fname, de)) { ... } ... de_len = ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize); if (de_len <= 0) return -1; offset += de_len; de = (struct ext4_dir_entry_2 *) ((char *) de + de_len); }
Assume: de=0xffff8881317c2fff dlimit=0x0xffff8881317c3000
If read 'de->name_len' which address is 0xffff8881317c3005, obviously is out of range, then will trigger use-after-free. To solve this issue, 'dlimit' must reserve 8 bytes, as we will read 'de->name_len' to judge if '(char *) de + de->name_len' out of range.
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20220324064816.1209985-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Cc: stable@kernel.org Signed-off-by: ChenXiaoSong chenxiaosong2@huawei.com Reviewed-by: yebin yebin10@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ext4/ext4.h | 4 ++++ fs/ext4/namei.c | 4 ++-- 2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index e75c130d1f8d..77541b83ca93 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2189,6 +2189,10 @@ static inline int ext4_forced_shutdown(struct ext4_sb_info *sbi) * Structure of a directory entry */ #define EXT4_NAME_LEN 255 +/* + * Base length of the ext4 directory entry excluding the name length + */ +#define EXT4_BASE_DIR_LEN (sizeof(struct ext4_dir_entry_2) - EXT4_NAME_LEN)
struct ext4_dir_entry { __le32 inode; /* Inode number */ diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 526960e34386..a2193f21f418 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -1388,10 +1388,10 @@ int ext4_search_dir(struct buffer_head *bh, char *search_buf, int buf_size,
de = (struct ext4_dir_entry_2 *)search_buf; dlimit = search_buf + buf_size; - while ((char *) de < dlimit) { + while ((char *) de < dlimit - EXT4_BASE_DIR_LEN) { /* this code is executed quadratically often */ /* do minimal checking `by hand' */ - if ((char *) de + de->name_len <= dlimit && + if (de->name + de->name_len <= dlimit && ext4_match(dir, fname, de)) { /* found a match - just to be sure, do * a full check */
From: Ye Bin yebin10@huawei.com
mainline inclusion from mainline-v5.18-rc4 commit b98535d091795a79336f520b0708457aacf55c67 category: bugfix bugzilla: 186675, https://gitee.com/openeuler/kernel/issues/I55TUC CVE: NA
-------------------------------------------------
We got issue as follows: ------------[ cut here ]------------ kernel BUG at fs/jbd2/transaction.c:389! invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 9 PID: 131 Comm: kworker/9:1 Not tainted 5.17.0-862.14.0.6.x86_64-00001-g23f87daf7d74-dirty #197 Workqueue: events flush_stashed_error_work RIP: 0010:start_this_handle+0x41c/0x1160 RSP: 0018:ffff888106b47c20 EFLAGS: 00010202 RAX: ffffed10251b8400 RBX: ffff888128dc204c RCX: ffffffffb52972ac RDX: 0000000000000200 RSI: 0000000000000004 RDI: ffff888128dc2050 RBP: 0000000000000039 R08: 0000000000000001 R09: ffffed10251b840a R10: ffff888128dc204f R11: ffffed10251b8409 R12: ffff888116d78000 R13: 0000000000000000 R14: dffffc0000000000 R15: ffff888128dc2000 FS: 0000000000000000(0000) GS:ffff88839d680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000001620068 CR3: 0000000376c0e000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> jbd2__journal_start+0x38a/0x790 jbd2_journal_start+0x19/0x20 flush_stashed_error_work+0x110/0x2b3 process_one_work+0x688/0x1080 worker_thread+0x8b/0xc50 kthread+0x26f/0x310 ret_from_fork+0x22/0x30 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]---
Above issue may happen as follows: umount read procfs error_work ext4_put_super flush_work(&sbi->s_error_work);
ext4_mb_seq_groups_show ext4_mb_load_buddy_gfp ext4_mb_init_group ext4_mb_init_cache ext4_read_block_bitmap_nowait ext4_validate_block_bitmap ext4_error ext4_handle_error schedule_work(&EXT4_SB(sb)->s_error_work);
ext4_unregister_sysfs(sb); jbd2_journal_destroy(sbi->s_journal); journal_kill_thread journal->j_flags |= JBD2_UNMOUNT;
flush_stashed_error_work jbd2_journal_start start_this_handle BUG_ON(journal->j_flags & JBD2_UNMOUNT);
To solve this issue, we call 'ext4_unregister_sysfs() before flushing s_error_work in ext4_put_super().
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: Jan Kara jack@suse.cz Reviewed-by: Ritesh Harjani riteshh@linux.ibm.com Link: https://lore.kernel.org/r/20220322012419.725457-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu
conflicts: fs/ext4/super.c
Signed-off-by: ChenXiaoSong chenxiaosong2@huawei.com Reviewed-by: yebin yebin10@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ext4/super.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 47832b989b2d..a6419161f856 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1216,19 +1216,24 @@ static void ext4_put_super(struct super_block *sb) int aborted = 0; int i, err;
- ext4_unregister_li_request(sb); - ext4_quota_off_umount(sb); - - flush_work(&sbi->s_error_work); - destroy_workqueue(sbi->rsv_conversion_wq); - /* * Unregister sysfs before destroying jbd2 journal. * Since we could still access attr_journal_task attribute via sysfs * path which could have sbi->s_journal->j_task as NULL + * Unregister sysfs before flush sbi->s_error_work. + * Since user may read /proc/fs/ext4/xx/mb_groups during umount, If + * read metadata verify failed then will queue error work. + * flush_stashed_error_work will call start_this_handle may trigger + * BUG_ON. */ ext4_unregister_sysfs(sb);
+ ext4_unregister_li_request(sb); + ext4_quota_off_umount(sb); + + flush_work(&sbi->s_error_work); + destroy_workqueue(sbi->rsv_conversion_wq); + if (sbi->s_journal) { aborted = is_journal_aborted(sbi->s_journal); err = jbd2_journal_destroy(sbi->s_journal);
From: Ye Bin yebin10@huawei.com
mainline inclusion from mainline-v5.18-rc4 commit a2b0b205d125f27cddfb4f7280e39affdaf46686 category: bugfix bugzilla: 186450, https://gitee.com/openeuler/kernel/issues/I4YSJ7 CVE: NA
-----------------------------------------------
We got issue as follows: [home]# fsck.ext4 -fn ram0yb e2fsck 1.45.6 (20-Mar-2020) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Symlink /p3/d14/d1a/l3d (inode #3494) is invalid. Clear? no Entry 'l3d' in /p3/d14/d1a (3383) has an incorrect filetype (was 7, should be 0). Fix? no
As the symlink file size does not match the file content. If the writeback of the symlink data block failed, ext4_finish_bio() handles the end of IO. However this function fails to mark the buffer with BH_write_io_error and so when unmount does journal checkpoint it cannot detect the writeback error and will cleanup the journal. Thus we've lost the correct data in the journal area. To solve this issue, mark the buffer as BH_write_io_error in ext4_finish_bio().
Cc: stable@kernel.org Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20220321144438.201685-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: ChenXiaoSong chenxiaosong2@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ext4/page-io.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index defd2e10dfd1..4569075a7da0 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -137,8 +137,10 @@ static void ext4_finish_bio(struct bio *bio) continue; } clear_buffer_async_write(bh); - if (bio->bi_status) + if (bio->bi_status) { + set_buffer_write_io_error(bh); buffer_io_error(bh); + } } while ((bh = bh->b_this_page) != head); spin_unlock_irqrestore(&head->b_uptodate_lock, flags); if (!under_io) {
From: Guan Jing guanjing6@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I52611 CVE: NA
--------------------------------
We introduce the qos smt expeller, which lets online tasks to expel offline tasks on the smt sibling cpus, and exclusively occupy CPU resources.In this way we are able to improve QOS of online tasks in co-location.
Signed-off-by: Guan Jing guanjing6@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- init/Kconfig | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/init/Kconfig b/init/Kconfig index 895e0ef85f73..27c5ed16fef1 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -966,6 +966,15 @@ config QOS_SCHED
default n
+config QOS_SCHED_SMT_EXPELLER + bool "Qos smt expeller" + depends on SCHED_SMT + depends on QOS_SCHED + default n + help + This feature enable online tasks to expel offline tasks + on the smt sibling cpus, and exclusively occupy CPU resources. + config FAIR_GROUP_SCHED bool "Group scheduling for SCHED_OTHER" depends on CGROUP_SCHED
From: Guan Jing guanjing6@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I52611 CVE: NA
--------------------------------
We implement the function of qos smt expeller by this following two points: a)when online tasks and offline tasks are running on the same physical cpu, online tasks will send ipi to expel offline tasks on the smt sibling cpus. b)when online tasks are running, the smt sibling cpus will not allow offline tasks to be selected.
Signed-off-by: Guan Jing guanjing6@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/sched.h | 7 ++ kernel/sched/fair.c | 185 +++++++++++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 5 ++ 3 files changed, 195 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index edd236f98f0c..06215f01f68f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1830,9 +1830,16 @@ extern char *__get_task_comm(char *to, size_t len, struct task_struct *tsk); __get_task_comm(buf, sizeof(buf), tsk); \ })
+#ifdef CONFIG_QOS_SCHED_SMT_EXPELLER +void qos_smt_check_need_resched(void); +#endif + #ifdef CONFIG_SMP static __always_inline void scheduler_ipi(void) { +#ifdef CONFIG_QOS_SCHED_SMT_EXPELLER + qos_smt_check_need_resched(); +#endif /* * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting * TIF_NEED_RESCHED remotely (for the first time) will also send diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2e26e1b98589..5cfdf40b974c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -130,6 +130,10 @@ unsigned int sysctl_offline_wait_interval = 100; /* in ms */ static int unthrottle_qos_cfs_rqs(int cpu); #endif
+#ifdef CONFIG_QOS_SCHED_SMT_EXPELLER +static DEFINE_PER_CPU(int, qos_smt_status); +#endif + #ifdef CONFIG_CFS_BANDWIDTH /* * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool @@ -7371,6 +7375,131 @@ void init_qos_hrtimer(int cpu) } #endif
+#ifdef CONFIG_QOS_SCHED_SMT_EXPELLER +static bool qos_smt_check_siblings_status(int this_cpu) +{ + int cpu; + + if (!sched_smt_active()) + return false; + + for_each_cpu(cpu, cpu_smt_mask(this_cpu)) { + if (cpu == this_cpu) + continue; + + if (per_cpu(qos_smt_status, cpu) == QOS_LEVEL_ONLINE) + return true; + } + + return false; +} + +static bool qos_smt_expelled(int this_cpu) +{ + /* + * The qos_smt_status of siblings cpu is online, and current cpu only has + * offline tasks enqueued, there is not suitable task, + * so pick_next_task_fair return null. + */ + if (qos_smt_check_siblings_status(this_cpu) && sched_idle_cpu(this_cpu)) + return true; + + return false; +} + +static bool qos_smt_update_status(struct task_struct *p) +{ + int status = QOS_LEVEL_OFFLINE; + + if (p != NULL && task_group(p)->qos_level >= QOS_LEVEL_ONLINE) + status = QOS_LEVEL_ONLINE; + + if (__this_cpu_read(qos_smt_status) == status) + return false; + + __this_cpu_write(qos_smt_status, status); + + return true; +} + +static void qos_smt_send_ipi(int this_cpu) +{ + int cpu; + struct rq *rq = NULL; + + if (!sched_smt_active()) + return; + + for_each_cpu(cpu, cpu_smt_mask(this_cpu)) { + if (cpu == this_cpu) + continue; + + rq = cpu_rq(cpu); + + /* + * There are two cases where current don't need to send scheduler_ipi: + * a) The qos_smt_status of siblings cpu is online; + * b) The cfs.h_nr_running of siblings cpu is 0. + */ + if (per_cpu(qos_smt_status, cpu) == QOS_LEVEL_ONLINE || + rq->cfs.h_nr_running == 0) + continue; + + smp_send_reschedule(cpu); + } +} + +static void qos_smt_expel(int this_cpu, struct task_struct *p) +{ + if (qos_smt_update_status(p)) + qos_smt_send_ipi(this_cpu); +} + +static bool _qos_smt_check_need_resched(int this_cpu, struct rq *rq) +{ + int cpu; + + if (!sched_smt_active()) + return false; + + for_each_cpu(cpu, cpu_smt_mask(this_cpu)) { + if (cpu == this_cpu) + continue; + + /* + * There are two cases rely on the set need_resched to drive away + * offline task: + * a) The qos_smt_status of siblings cpu is online, the task of current cpu is offline; + * b) The qos_smt_status of siblings cpu is offline, the task of current cpu is idle, + * and current cpu only has SCHED_IDLE tasks enqueued. + */ + if (per_cpu(qos_smt_status, cpu) == QOS_LEVEL_ONLINE && + task_group(current)->qos_level < QOS_LEVEL_ONLINE) + return true; + + if (per_cpu(qos_smt_status, cpu) == QOS_LEVEL_OFFLINE && + rq->curr == rq->idle && sched_idle_cpu(this_cpu)) + return true; + } + + return false; +} + +void qos_smt_check_need_resched(void) +{ + struct rq *rq = this_rq(); + int this_cpu = rq->cpu; + + if (test_tsk_need_resched(current)) + return; + + if (_qos_smt_check_need_resched(this_cpu, rq)) { + set_tsk_need_resched(current); + set_preempt_need_resched(); + } +} +#endif + struct task_struct * pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { @@ -7379,14 +7508,32 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf struct task_struct *p; int new_tasks; unsigned long time; +#ifdef CONFIG_QOS_SCHED_SMT_EXPELLER + int this_cpu = rq->cpu; +#endif
again: +#ifdef CONFIG_QOS_SCHED_SMT_EXPELLER + if (qos_smt_expelled(this_cpu)) { + __this_cpu_write(qos_smt_status, QOS_LEVEL_OFFLINE); + return NULL; + } +#endif + if (!sched_fair_runnable(rq)) goto idle;
#ifdef CONFIG_FAIR_GROUP_SCHED - if (!prev || prev->sched_class != &fair_sched_class) - goto simple; + if (!prev || prev->sched_class != &fair_sched_class) { +#ifdef CONFIG_QOS_SCHED + if (cfs_rq->idle_h_nr_running != 0 && rq->online) + goto qos_simple; + else +#endif + goto simple; + } + +
/* * Because of the set_next_buddy() in dequeue_task_fair() it is rather @@ -7470,6 +7617,34 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf }
goto done; + +#ifdef CONFIG_QOS_SCHED +qos_simple: + if (prev) + put_prev_task(rq, prev); + + do { + se = pick_next_entity(cfs_rq, NULL); + if (check_qos_cfs_rq(group_cfs_rq(se))) { + cfs_rq = &rq->cfs; + if (!cfs_rq->nr_running) + goto idle; + continue; + } + + cfs_rq = group_cfs_rq(se); + } while (cfs_rq); + + p = task_of(se); + + while (se) { + set_next_entity(cfs_rq_of(se), se); + se = parent_entity(se); + } + + goto done; +#endif + simple: #endif if (prev) @@ -7498,6 +7673,9 @@ done: __maybe_unused;
update_misfit_status(p, rq);
+#ifdef CONFIG_QOS_SCHED_SMT_EXPELLER + qos_smt_expel(this_cpu, p); +#endif return p;
idle: @@ -7546,6 +7724,9 @@ done: __maybe_unused; */ update_idle_rq_clock_pelt(rq);
+#ifdef CONFIG_QOS_SCHED_SMT_EXPELLER + qos_smt_expel(this_cpu, NULL); +#endif return NULL; }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index fadd38187c2a..0d40bb700f3c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1136,6 +1136,11 @@ static inline int cpu_of(struct rq *rq) }
#ifdef CONFIG_QOS_SCHED +enum task_qos_level { + QOS_LEVEL_OFFLINE = -1, + QOS_LEVEL_ONLINE = 0, + QOS_LEVEL_MAX +}; void init_qos_hrtimer(int cpu); #endif
From: Guan Jing guanjing6@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I52611 CVE: NA
--------------------------------
We have added two statistics for qos smt expeller: a) nr_qos_smt_send_ipi:the times of ipi which online task expel offline tasks; b) nr_qos_smt_expelled:the statistics that offline task will not be picked times.
Signed-off-by: Guan Jing guanjing6@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/sched.h | 6 +++++- kernel/sched/debug.c | 4 ++++ kernel/sched/fair.c | 2 ++ 3 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 06215f01f68f..7928b8d9c7da 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -461,8 +461,13 @@ struct sched_statistics { u64 nr_wakeups_passive; u64 nr_wakeups_idle;
+#if defined(CONFIG_QOS_SCHED_SMT_EXPELLER) && !defined(__GENKSYMS__) + u64 nr_qos_smt_send_ipi; + u64 nr_qos_smt_expelled; +#else KABI_RESERVE(1) KABI_RESERVE(2) +#endif KABI_RESERVE(3) KABI_RESERVE(4) #endif @@ -2172,5 +2177,4 @@ static inline int sched_qos_cpu_overload(void) return 0; } #endif - #endif diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 70a578272436..12fbaf1302ac 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -982,6 +982,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, P_SCHEDSTAT(se.statistics.nr_wakeups_affine_attempts); P_SCHEDSTAT(se.statistics.nr_wakeups_passive); P_SCHEDSTAT(se.statistics.nr_wakeups_idle); +#ifdef CONFIG_QOS_SCHED_SMT_EXPELLER + P_SCHEDSTAT(se.statistics.nr_qos_smt_send_ipi); + P_SCHEDSTAT(se.statistics.nr_qos_smt_expelled); +#endif
avg_atom = p->se.sum_exec_runtime; if (nr_switches) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5cfdf40b974c..a06c5c173619 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7445,6 +7445,7 @@ static void qos_smt_send_ipi(int this_cpu) rq->cfs.h_nr_running == 0) continue;
+ schedstat_inc(current->se.statistics.nr_qos_smt_send_ipi); smp_send_reschedule(cpu); } } @@ -7516,6 +7517,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf #ifdef CONFIG_QOS_SCHED_SMT_EXPELLER if (qos_smt_expelled(this_cpu)) { __this_cpu_write(qos_smt_status, QOS_LEVEL_OFFLINE); + schedstat_inc(rq->curr->se.statistics.nr_qos_smt_expelled); return NULL; } #endif
From: Guan Jing guanjing6@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I52611 CVE: NA
--------------------------------
There are two caces that we add tracepoint: a) while online task of sibling cpu is running, it is running that offline task of local cpu will be set TIF_NEED_RESCHED; b) while online task of sibling cpu is running, it will expell that next picked offline task of local cpu.
Signed-off-by: Guan Jing guanjing6@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/trace/events/sched.h | 55 ++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 9 ++++-- 2 files changed, 62 insertions(+), 2 deletions(-)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index c96a4337afe6..028f49662ac3 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -183,6 +183,61 @@ TRACE_EVENT(sched_switch, __entry->next_comm, __entry->next_pid, __entry->next_prio) );
+#ifdef CONFIG_QOS_SCHED_SMT_EXPELLER +/* + * Tracepoint for a offline task being resched: + */ +TRACE_EVENT(sched_qos_smt_expel, + + TP_PROTO(struct task_struct *sibling_p, int qos_smt_status), + + TP_ARGS(sibling_p, qos_smt_status), + + TP_STRUCT__entry( + __array( char, sibling_comm, TASK_COMM_LEN ) + __field( pid_t, sibling_pid ) + __field( int, sibling_qos_status ) + __field( int, sibling_cpu ) + ), + + TP_fast_assign( + memcpy(__entry->sibling_comm, sibling_p->comm, TASK_COMM_LEN); + __entry->sibling_pid = sibling_p->pid; + __entry->sibling_qos_status = qos_smt_status; + __entry->sibling_cpu = task_cpu(sibling_p); + ), + + TP_printk("sibling_comm=%s sibling_pid=%d sibling_qos_status=%d sibling_cpu=%d", + __entry->sibling_comm, __entry->sibling_pid, __entry->sibling_qos_status, + __entry->sibling_cpu) +); + +/* + * Tracepoint for a offline task being expelled: + */ +TRACE_EVENT(sched_qos_smt_expelled, + + TP_PROTO(struct task_struct *p, int qos_smt_status), + + TP_ARGS(p, qos_smt_status), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( int, qos_status ) + ), + + TP_fast_assign( + memcpy(__entry->comm, p->comm, TASK_COMM_LEN); + __entry->pid = p->pid; + __entry->qos_status = qos_smt_status; + ), + + TP_printk("comm=%s pid=%d qos_status=%d", + __entry->comm, __entry->pid, __entry->qos_status) +); +#endif + /* * Tracepoint for a task being migrated: */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a06c5c173619..dfcc341fe179 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7475,12 +7475,16 @@ static bool _qos_smt_check_need_resched(int this_cpu, struct rq *rq) * and current cpu only has SCHED_IDLE tasks enqueued. */ if (per_cpu(qos_smt_status, cpu) == QOS_LEVEL_ONLINE && - task_group(current)->qos_level < QOS_LEVEL_ONLINE) + task_group(current)->qos_level < QOS_LEVEL_ONLINE) { + trace_sched_qos_smt_expel(cpu_curr(cpu), per_cpu(qos_smt_status, cpu)); return true; + }
if (per_cpu(qos_smt_status, cpu) == QOS_LEVEL_OFFLINE && - rq->curr == rq->idle && sched_idle_cpu(this_cpu)) + rq->curr == rq->idle && sched_idle_cpu(this_cpu)) { + trace_sched_qos_smt_expel(cpu_curr(cpu), per_cpu(qos_smt_status, cpu)); return true; + } }
return false; @@ -7518,6 +7522,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf if (qos_smt_expelled(this_cpu)) { __this_cpu_write(qos_smt_status, QOS_LEVEL_OFFLINE); schedstat_inc(rq->curr->se.statistics.nr_qos_smt_expelled); + trace_sched_qos_smt_expelled(rq->curr, per_cpu(qos_smt_status, this_cpu)); return NULL; } #endif
From: Guan Jing guanjing6@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I52611 CVE: NA
Signed-off-by: Guan Jing guanjing6@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/configs/openeuler_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + 2 files changed, 2 insertions(+)
diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 770222a597e4..78a63cbc3db6 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -138,6 +138,7 @@ CONFIG_BLK_CGROUP=y CONFIG_CGROUP_WRITEBACK=y CONFIG_CGROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y +CONFIG_QOS_SCHED_SMT_EXPELLER=y CONFIG_CFS_BANDWIDTH=y CONFIG_RT_GROUP_SCHED=y CONFIG_CGROUP_PIDS=y diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index 926dfe0628dc..61c4be815462 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -158,6 +158,7 @@ CONFIG_CGROUP_WRITEBACK=y CONFIG_CGROUP_SCHED=y CONFIG_QOS_SCHED=y CONFIG_FAIR_GROUP_SCHED=y +CONFIG_QOS_SCHED_SMT_EXPELLER=y CONFIG_CFS_BANDWIDTH=y CONFIG_RT_GROUP_SCHED=y CONFIG_CGROUP_PIDS=y
From: Congyu Liu liu3101@purdue.edu
stable inclusion from stable-v5.10.96 commit db044d97460ea792110eb8b971e82569ded536c6 bugzilla: https://gitee.com/openeuler/kernel/issues/I55VNA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 47934e06b65637c88a762d9c98329ae6e3238888 upstream.
In one net namespace, after creating a packet socket without binding it to a device, users in other net namespaces can observe the new `packet_type` added by this packet socket by reading `/proc/net/ptype` file. This is minor information leakage as packet socket is namespace aware.
Add a net pointer in `packet_type` to keep the net namespace of of corresponding packet socket. In `ptype_seq_show`, this net pointer must be checked when it is not NULL.
Fixes: 2feb27dbe00c ("[NETNS]: Minor information leak via /proc/net/ptype file.") Signed-off-by: Congyu Liu liu3101@purdue.edu Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Lu Wei luwei32@huawei.com Reviewed-by: Wei Yongjun weiyongjun1@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/netdevice.h | 2 +- net/core/net-procfs.c | 3 ++- net/packet/af_packet.c | 2 ++ 3 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 051a8b2edf13..4a9ad88439d4 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2595,7 +2595,7 @@ struct packet_type { void *af_packet_priv; struct list_head list;
- KABI_RESERVE(1) + KABI_USE(1, struct net *af_packet_net) KABI_RESERVE(2) KABI_RESERVE(3) KABI_RESERVE(4) diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c index c714e6a9dad4..e12c67f9492b 100644 --- a/net/core/net-procfs.c +++ b/net/core/net-procfs.c @@ -263,7 +263,8 @@ static int ptype_seq_show(struct seq_file *seq, void *v)
if (v == SEQ_START_TOKEN) seq_puts(seq, "Type Device Function\n"); - else if (pt->dev == NULL || dev_net(pt->dev) == seq_file_net(seq)) { + else if ((!pt->af_packet_net || net_eq(pt->af_packet_net, seq_file_net(seq))) && + (!pt->dev || net_eq(dev_net(pt->dev), seq_file_net(seq)))) { if (pt->type == htons(ETH_P_ALL)) seq_puts(seq, "ALL "); else diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index f78097aa403a..6ef035494f30 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1735,6 +1735,7 @@ static int fanout_add(struct sock *sk, struct fanout_args *args) match->prot_hook.dev = po->prot_hook.dev; match->prot_hook.func = packet_rcv_fanout; match->prot_hook.af_packet_priv = match; + match->prot_hook.af_packet_net = read_pnet(&match->net); match->prot_hook.id_match = match_fanout_group; match->max_num_members = args->max_num_members; list_add(&match->list, &fanout_list); @@ -3323,6 +3324,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol, po->prot_hook.func = packet_rcv_spkt;
po->prot_hook.af_packet_priv = sk; + po->prot_hook.af_packet_net = sock_net(sk);
if (proto) { po->prot_hook.type = proto;
From: GUO Zihua guozihua@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I569U8 CVE: NA
Reference: https://lore.kernel.org/lkml/20190609164147.971147667@linuxfoundation.org/
--------------------------------
The digest() hook relies on a crc value from the shash_desc context. However, this context is not initialized while digest() hook is called, and an arbitrary value is read causing the algorithm generating wrong result.
This patch fixes this issue by passing a 0 as the initial crc value in the digest() hook.
Signed-off-by: GUO Zihua guozihua@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/crypto/crct10dif-neon_glue.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/crypto/crct10dif-neon_glue.c b/arch/arm64/crypto/crct10dif-neon_glue.c index af731b3ec30e..6fad09de212d 100644 --- a/arch/arm64/crypto/crct10dif-neon_glue.c +++ b/arch/arm64/crypto/crct10dif-neon_glue.c @@ -55,10 +55,10 @@ static int chksum_final(struct shash_desc *desc, u8 *out) return 0; }
-static int __chksum_finup(__u16 *crcp, const u8 *data, unsigned int len, +static int __chksum_finup(__u16 crc, const u8 *data, unsigned int len, u8 *out) { - *(__u16 *)out = crc_t10dif_neon(*crcp, data, len); + *(__u16 *)out = crc_t10dif_neon(crc, data, len); return 0; }
@@ -67,15 +67,13 @@ static int chksum_finup(struct shash_desc *desc, const u8 *data, { struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
- return __chksum_finup(&ctx->crc, data, len, out); + return __chksum_finup(ctx->crc, data, len, out); }
static int chksum_digest(struct shash_desc *desc, const u8 *data, unsigned int length, u8 *out) { - struct chksum_desc_ctx *ctx = shash_desc_ctx(desc); - - return __chksum_finup(&ctx->crc, data, length, out); + return __chksum_finup(0, data, length, out); }
static struct shash_alg alg = {
From: Lu Wei luwei32@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I545NW CVE: NA
--------------------------------
UID and GID are requested as filters for socketmap, but we can only get UID from sock structure. This patch adds GID field to struct sock as UID.
Signed-off-by: Lu Wei luwei32@huawei.com Signed-off-by: Liu Jian liujian56@huawei.com Reviewed-by: Wei Yongjun weiyongjun1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/net/sock.h | 14 ++++++++++++++ net/core/sock.c | 2 ++ net/socket.c | 6 ++++-- 3 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h index c958be11d172..af73dda0285b 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -303,6 +303,7 @@ struct bpf_local_storage; * @sk_ack_backlog: current listen backlog * @sk_max_ack_backlog: listen backlog set in listen() * @sk_uid: user id of owner + * @sk_gid: group id of owner * @sk_priority: %SO_PRIORITY setting * @sk_type: socket type (%SOCK_STREAM, etc) * @sk_protocol: which protocol this socket belongs in this network family @@ -527,7 +528,14 @@ struct sock { #endif struct rcu_head sk_rcu;
+#ifndef __GENKSYMS__ + union { + kgid_t sk_gid; + u64 sk_gid_padding; + }; +#else KABI_RESERVE(1) +#endif KABI_RESERVE(2) KABI_RESERVE(3) KABI_RESERVE(4) @@ -1904,6 +1912,7 @@ static inline void sock_graft(struct sock *sk, struct socket *parent) parent->sk = sk; sk_set_socket(sk, parent); sk->sk_uid = SOCK_INODE(parent)->i_uid; + sk->sk_gid = SOCK_INODE(parent)->i_gid; security_sock_graft(sk, parent); write_unlock_bh(&sk->sk_callback_lock); } @@ -1916,6 +1925,11 @@ static inline kuid_t sock_net_uid(const struct net *net, const struct sock *sk) return sk ? sk->sk_uid : make_kuid(net->user_ns, 0); }
+static inline kgid_t sock_net_gid(const struct net *net, const struct sock *sk) +{ + return sk ? sk->sk_gid : make_kgid(net->user_ns, 0); +} + static inline u32 net_tx_rndhash(void) { u32 v = prandom_u32(); diff --git a/net/core/sock.c b/net/core/sock.c index bee3c320dbfe..2fa8863caee0 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2985,9 +2985,11 @@ void sock_init_data(struct socket *sock, struct sock *sk) RCU_INIT_POINTER(sk->sk_wq, &sock->wq); sock->sk = sk; sk->sk_uid = SOCK_INODE(sock)->i_uid; + sk->sk_gid = SOCK_INODE(sock)->i_gid; } else { RCU_INIT_POINTER(sk->sk_wq, NULL); sk->sk_uid = make_kuid(sock_net(sk)->user_ns, 0); + sk->sk_gid = make_kgid(sock_net(sk)->user_ns, 0); }
rwlock_init(&sk->sk_callback_lock); diff --git a/net/socket.c b/net/socket.c index d52c265ad449..7d84c289e5ae 100644 --- a/net/socket.c +++ b/net/socket.c @@ -543,10 +543,12 @@ static int sockfs_setattr(struct dentry *dentry, struct iattr *iattr) if (!err && (iattr->ia_valid & ATTR_UID)) { struct socket *sock = SOCKET_I(d_inode(dentry));
- if (sock->sk) + if (sock->sk) { sock->sk->sk_uid = iattr->ia_uid; - else + sock->sk->sk_gid = iattr->ia_gid; + } else { err = -ENOENT; + } }
return err;
From: Liu Jian liujian56@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I545NW CVE: NA
--------------------------------
Add the function for bpf sock_ops hook to get sock's uid and gid.
Signed-off-by: Liu Jian liujian56@huawei.com Reviewed-by: Wei Yongjun weiyongjun1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/uapi/linux/bpf.h | 8 ++++++++ net/core/filter.c | 25 +++++++++++++++++++++++++ tools/include/uapi/linux/bpf.h | 8 ++++++++ 3 files changed, 41 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 00afbbc130ee..4829a28ddcae 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -3742,6 +3742,13 @@ union bpf_attr { * Return * The helper returns **TC_ACT_REDIRECT** on success or * **TC_ACT_SHOT** on error. + * + * u64 bpf_get_sockops_uid_gid(void *sockops) + * Description + * Get sock's uid and gid + * Return + * A 64-bit integer containing the current GID and UID, and + * created as such: *current_gid* **<< 32 |** *current_uid*. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3900,6 +3907,7 @@ union bpf_attr { FN(per_cpu_ptr), \ FN(this_cpu_ptr), \ FN(redirect_peer), \ + FN(get_sockops_uid_gid), \ /* */
/* integer value in 'imm' field of BPF_CALL instruction selects which helper diff --git a/net/core/filter.c b/net/core/filter.c index ca45a97ef2fe..59ed0724442b 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5006,6 +5006,29 @@ static const struct bpf_func_proto bpf_sock_addr_setsockopt_proto = { .arg5_type = ARG_CONST_SIZE, };
+BPF_CALL_1(bpf_get_sockops_uid_gid, struct bpf_sock_ops_kern *, bpf_sock) +{ + struct sock *sk = bpf_sock->sk; + kuid_t uid; + kgid_t gid; + + if (!sk || !sk_fullsock(sk)) + return -EINVAL; + + uid = sock_net_uid(sock_net(sk), sk); + gid = sock_net_gid(sock_net(sk), sk); + + return ((u64)from_kgid_munged(sock_net(sk)->user_ns, gid)) << 32 | + from_kuid_munged(sock_net(sk)->user_ns, uid); +} + +static const struct bpf_func_proto bpf_get_sockops_uid_gid_proto = { + .func = bpf_get_sockops_uid_gid, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; + BPF_CALL_5(bpf_sock_addr_getsockopt, struct bpf_sock_addr_kern *, ctx, int, level, int, optname, char *, optval, int, optlen) { @@ -7276,6 +7299,8 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_sk_storage_get_proto; case BPF_FUNC_sk_storage_delete: return &bpf_sk_storage_delete_proto; + case BPF_FUNC_get_sockops_uid_gid: + return &bpf_get_sockops_uid_gid_proto; #ifdef CONFIG_INET case BPF_FUNC_load_hdr_opt: return &bpf_sock_ops_load_hdr_opt_proto; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 00afbbc130ee..4829a28ddcae 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -3742,6 +3742,13 @@ union bpf_attr { * Return * The helper returns **TC_ACT_REDIRECT** on success or * **TC_ACT_SHOT** on error. + * + * u64 bpf_get_sockops_uid_gid(void *sockops) + * Description + * Get sock's uid and gid + * Return + * A 64-bit integer containing the current GID and UID, and + * created as such: *current_gid* **<< 32 |** *current_uid*. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3900,6 +3907,7 @@ union bpf_attr { FN(per_cpu_ptr), \ FN(this_cpu_ptr), \ FN(redirect_peer), \ + FN(get_sockops_uid_gid), \ /* */
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
From: Liu Jian liujian56@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I545NW CVE: NA
--------------------------------
Add new optname(BPF_SO_ORIGINAL_DST 800, BPF_SO_REPLY_SRC 801) to get origdst/reply src for bpf progs. Now only support IPv4.
Signed-off-by: Wang Yufen wangyufen@huawei.com Signed-off-by: Liu Jian liujian56@huawei.com Reviewed-by: Wei Yongjun weiyongjun1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/net/netfilter/nf_conntrack.h | 4 ++ include/uapi/linux/bpf.h | 7 +++ include/uapi/linux/netfilter_ipv4.h | 2 + net/core/filter.c | 49 +++++++++++++++++++++ net/netfilter/nf_conntrack_proto.c | 65 ++++++++++++++++++++++++++++ tools/include/uapi/linux/bpf.h | 7 +++ 6 files changed, 134 insertions(+)
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 0acbd9c40a5f..2b2d9deed907 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -342,4 +342,8 @@ nf_ct_set(struct sk_buff *skb, struct nf_conn *ct, enum ip_conntrack_info info) #define MODULE_ALIAS_NFCT_HELPER(helper) \ MODULE_ALIAS("nfct-helper-" helper)
+typedef int (*bpf_getorigdst_opt_func)(struct sock *sk, int optname, + void *optval, int *optlen, int dir); +extern bpf_getorigdst_opt_func bpf_getorigdst_opt; + #endif /* _NF_CONNTRACK_H */ diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 4829a28ddcae..75617c529efd 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -3749,6 +3749,12 @@ union bpf_attr { * Return * A 64-bit integer containing the current GID and UID, and * created as such: *current_gid* **<< 32 |** *current_uid*. + * + * int bpf_sk_original_addr(void *bpf_socket, int optname, char *optval, int optlen) + * Description + * Get Ipv4 origdst or replysrc. Works with IPv4. + * Return + * 0 on success, or a negative error in case of failure. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3908,6 +3914,7 @@ union bpf_attr { FN(this_cpu_ptr), \ FN(redirect_peer), \ FN(get_sockops_uid_gid), \ + FN(sk_original_addr), \ /* */
/* integer value in 'imm' field of BPF_CALL instruction selects which helper diff --git a/include/uapi/linux/netfilter_ipv4.h b/include/uapi/linux/netfilter_ipv4.h index 155e77d6a42d..00e78cc2782b 100644 --- a/include/uapi/linux/netfilter_ipv4.h +++ b/include/uapi/linux/netfilter_ipv4.h @@ -50,6 +50,8 @@ enum nf_ip_hook_priorities { /* 2.2 firewalling (+ masq) went from 64 through 76 */ /* 2.4 firewalling went 64 through 67. */ #define SO_ORIGINAL_DST 80 +#define BPF_SO_ORIGINAL_DST 800 +#define BPF_SO_REPLY_SRC 801
#endif /* _UAPI__LINUX_IP_NETFILTER_H */ diff --git a/net/core/filter.c b/net/core/filter.c index 59ed0724442b..61cb3f94bd03 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5029,6 +5029,53 @@ static const struct bpf_func_proto bpf_get_sockops_uid_gid_proto = { .arg1_type = ARG_PTR_TO_CTX, };
+#include <net/netfilter/nf_conntrack.h> +#include <linux/netfilter_ipv4.h> + +bpf_getorigdst_opt_func bpf_getorigdst_opt; +EXPORT_SYMBOL(bpf_getorigdst_opt); + +BPF_CALL_4(bpf_sk_original_addr, struct bpf_sock_ops_kern *, bpf_sock, + int, optname, char *, optval, int, optlen) +{ + struct sock *sk = bpf_sock->sk; + int ret = -EINVAL; + + if (!sk_fullsock(sk)) + goto err_clear; + + if (optname != BPF_SO_ORIGINAL_DST && optname != BPF_SO_REPLY_SRC) + goto err_clear; + + if (!bpf_getorigdst_opt) + goto err_clear; +#if IS_ENABLED(CONFIG_NF_CONNTRACK) + if (optname == BPF_SO_ORIGINAL_DST) + ret = bpf_getorigdst_opt(sk, optname, optval, &optlen, + IP_CT_DIR_ORIGINAL); + else if (optname == BPF_SO_REPLY_SRC) + ret = bpf_getorigdst_opt(sk, optname, optval, &optlen, + IP_CT_DIR_REPLY); + if (ret < 0) + goto err_clear; + + return 0; +#endif +err_clear: + memset(optval, 0, optlen); + return ret; +} + +static const struct bpf_func_proto bpf_sk_original_addr_proto = { + .func = bpf_sk_original_addr, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_UNINIT_MEM, + .arg4_type = ARG_CONST_SIZE, +}; + BPF_CALL_5(bpf_sock_addr_getsockopt, struct bpf_sock_addr_kern *, ctx, int, level, int, optname, char *, optval, int, optlen) { @@ -7301,6 +7348,8 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_sk_storage_delete_proto; case BPF_FUNC_get_sockops_uid_gid: return &bpf_get_sockops_uid_gid_proto; + case BPF_FUNC_sk_original_addr: + return &bpf_sk_original_addr_proto; #ifdef CONFIG_INET case BPF_FUNC_load_hdr_opt: return &bpf_sock_ops_load_hdr_opt_proto; diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c index 71892822bbf5..dd1fff72c736 100644 --- a/net/netfilter/nf_conntrack_proto.c +++ b/net/netfilter/nf_conntrack_proto.c @@ -292,6 +292,67 @@ getorigdst(struct sock *sk, int optval, void __user *user, int *len) return -ENOENT; }
+static int +bpf_getorigdst_impl(struct sock *sk, int optval, void *user, int *len, int dir) +{ + const struct inet_sock *inet = inet_sk(sk); + const struct nf_conntrack_tuple_hash *h; + struct nf_conntrack_tuple tuple; + + memset(&tuple, 0, sizeof(tuple)); + + tuple.src.u3.ip = inet->inet_rcv_saddr; + tuple.src.u.tcp.port = inet->inet_sport; + tuple.dst.u3.ip = inet->inet_daddr; + tuple.dst.u.tcp.port = inet->inet_dport; + tuple.src.l3num = PF_INET; + tuple.dst.protonum = sk->sk_protocol; + + /* We only do TCP and SCTP at the moment: is there a better way? */ + if (tuple.dst.protonum != IPPROTO_TCP && + tuple.dst.protonum != IPPROTO_SCTP) { + pr_debug("SO_ORIGINAL_DST: Not a TCP/SCTP socket\n"); + return -ENOPROTOOPT; + } + + if ((unsigned int)*len < sizeof(struct sockaddr_in)) { + pr_debug("SO_ORIGINAL_DST: len %d not %zu\n", + *len, sizeof(struct sockaddr_in)); + return -EINVAL; + } + + h = nf_conntrack_find_get(sock_net(sk), &nf_ct_zone_dflt, &tuple); + if (h) { + struct sockaddr_in sin; + struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h); + + sin.sin_family = AF_INET; + if (dir == IP_CT_DIR_REPLY) { + sin.sin_port = ct->tuplehash[IP_CT_DIR_REPLY] + .tuple.src.u.tcp.port; + sin.sin_addr.s_addr = ct->tuplehash[IP_CT_DIR_REPLY] + .tuple.src.u3.ip; + } else { + sin.sin_port = ct->tuplehash[IP_CT_DIR_ORIGINAL] + .tuple.dst.u.tcp.port; + sin.sin_addr.s_addr = ct->tuplehash[IP_CT_DIR_ORIGINAL] + .tuple.dst.u3.ip; + } + memset(sin.sin_zero, 0, sizeof(sin.sin_zero)); + + pr_debug("SO_ORIGINAL_DST: %pI4 %u\n", + &sin.sin_addr.s_addr, ntohs(sin.sin_port)); + nf_ct_put(ct); + + memcpy(user, &sin, sizeof(sin)); + return 0; + } + pr_debug("SO_ORIGINAL_DST: Can't find %pI4/%u-%pI4/%u.\n", + &tuple.src.u3.ip, ntohs(tuple.src.u.tcp.port), + &tuple.dst.u3.ip, ntohs(tuple.dst.u.tcp.port)); + return -ENOENT; +} + static struct nf_sockopt_ops so_getorigdst = { .pf = PF_INET, .get_optmin = SO_ORIGINAL_DST, @@ -656,6 +717,8 @@ int nf_conntrack_proto_init(void) goto cleanup_sockopt; #endif
+ bpf_getorigdst_opt = bpf_getorigdst_impl; + return ret;
#if IS_ENABLED(CONFIG_IPV6) @@ -667,6 +730,8 @@ int nf_conntrack_proto_init(void)
void nf_conntrack_proto_fini(void) { + bpf_getorigdst_opt = NULL; + nf_unregister_sockopt(&so_getorigdst); #if IS_ENABLED(CONFIG_IPV6) nf_unregister_sockopt(&so_getorigdst6); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 4829a28ddcae..75617c529efd 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -3749,6 +3749,12 @@ union bpf_attr { * Return * A 64-bit integer containing the current GID and UID, and * created as such: *current_gid* **<< 32 |** *current_uid*. + * + * int bpf_sk_original_addr(void *bpf_socket, int optname, char *optval, int optlen) + * Description + * Get Ipv4 origdst or replysrc. Works with IPv4. + * Return + * 0 on success, or a negative error in case of failure. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3908,6 +3914,7 @@ union bpf_attr { FN(this_cpu_ptr), \ FN(redirect_peer), \ FN(get_sockops_uid_gid), \ + FN(sk_original_addr), \ /* */
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
From: Liu Jian liujian56@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I545NW CVE: NA
--------------------------------
Access bpf_sock's src_ip4 and sorc_port in BPF_CGROUP_INET_SOCK_RELEASE hook.
Signed-off-by: Liu Jian liujian56@huawei.com Reviewed-by: Wei Yongjun weiyongjun1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- net/core/filter.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/net/core/filter.c b/net/core/filter.c index 61cb3f94bd03..fa473a58d1be 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -7742,6 +7742,7 @@ static bool __sock_filter_check_attach_type(int off, case bpf_ctx_range(struct bpf_sock, src_ip4): switch (attach_type) { case BPF_CGROUP_INET4_POST_BIND: + case BPF_CGROUP_INET_SOCK_RELEASE: goto read_only; default: return false; @@ -7757,6 +7758,7 @@ static bool __sock_filter_check_attach_type(int off, switch (attach_type) { case BPF_CGROUP_INET4_POST_BIND: case BPF_CGROUP_INET6_POST_BIND: + case BPF_CGROUP_INET_SOCK_RELEASE: goto read_only; default: return false;
From: He Ying heying24@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I56HL6
--------------------------------
Arm64 pseudo-NMI feature code brings some additional nops when CONFIG_ARM64_PSEUDO_NMI is not set, which is not necessary. So add necessary ifdeffery to avoid it.
Signed-off-by: He Ying heying24@huawei.com Reviewed-by: Zhang Jianhua chris.zjh@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/kernel/entry.S | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index ad6649006704..64145bfab48f 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -256,6 +256,7 @@ alternative_else_nop_endif str w21, [sp, #S_SYSCALLNO] .endif
+#ifdef CONFIG_ARM64_PSEUDO_NMI /* Save pmr */ alternative_if ARM64_HAS_IRQ_PRIO_MASKING mrs_s x20, SYS_ICC_PMR_EL1 @@ -263,6 +264,7 @@ alternative_if ARM64_HAS_IRQ_PRIO_MASKING mov x20, #GIC_PRIO_IRQON | GIC_PRIO_PSR_I_SET msr_s SYS_ICC_PMR_EL1, x20 alternative_else_nop_endif +#endif
/* Re-enable tag checking (TCO set on exception entry) */ #ifdef CONFIG_ARM64_MTE @@ -286,6 +288,7 @@ alternative_else_nop_endif disable_daif .endif
+#ifdef CONFIG_ARM64_PSEUDO_NMI /* Restore pmr */ alternative_if ARM64_HAS_IRQ_PRIO_MASKING ldr x20, [sp, #S_PMR_SAVE] @@ -295,6 +298,7 @@ alternative_if ARM64_HAS_IRQ_PRIO_MASKING dsb sy // Ensure priority change is seen by redistributor .L__skip_pmr_sync@: alternative_else_nop_endif +#endif
ldp x21, x22, [sp, #S_PC] // load ELR, SPSR
@@ -507,6 +511,7 @@ alternative_endif
#ifdef CONFIG_PREEMPTION ldr x24, [tsk, #TSK_TI_PREEMPT] // get preempt count +#ifdef CONFIG_ARM64_PSEUDO_NMI alternative_if ARM64_HAS_IRQ_PRIO_MASKING /* * DA_F were cleared at start of handling. If anything is set in DAIF, @@ -515,6 +520,7 @@ alternative_if ARM64_HAS_IRQ_PRIO_MASKING mrs x0, daif orr x24, x24, x0 alternative_else_nop_endif +#endif cbnz x24, 1f // preempt count != 0 || NMI return path bl arm64_preempt_schedule_irq // irq en/disable is done inside 1:
From: Li Lingfeng lilingfeng3@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53Q6M CVE: NA
---------------------------
Currently, we don't have an easy way to figure out a corrupted file system which has been writen data through the raw block device. It is risky to open a block device exclusively which has been opened for write by some other processes since this may lead to potential data corruption. This patch record the exclusive openers and give a hint if that happens.
Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: zhihao Cheng chengzhihao1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/block_dev.c | 65 +++++++++++++++++++++++++++++++++++++-- include/linux/blk_types.h | 2 ++ 2 files changed, 64 insertions(+), 3 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c index 46801789f2dc..915d3b5bdee7 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1172,7 +1172,6 @@ static void bd_clear_claiming(struct block_device *whole, void *holder) static void bd_finish_claiming(struct block_device *bdev, struct block_device *whole, void *holder) { - spin_lock(&bdev_lock); BUG_ON(!bd_may_claim(bdev, whole, holder)); /* * Note that for a whole device bd_holders will be incremented twice, @@ -1183,7 +1182,6 @@ static void bd_finish_claiming(struct block_device *bdev, bdev->bd_holders++; bdev->bd_holder = holder; bd_clear_claiming(whole, holder); - spin_unlock(&bdev_lock); }
/** @@ -1481,6 +1479,39 @@ int bdev_disk_changed(struct block_device *bdev, bool invalidate) */ EXPORT_SYMBOL_GPL(bdev_disk_changed);
+static void blkdev_dump_conflict_opener(struct block_device *bdev, char *msg) +{ + char name[BDEVNAME_SIZE]; + struct task_struct *p = NULL; + char comm_buf[TASK_COMM_LEN]; + pid_t p_pid; + + rcu_read_lock(); + p = rcu_dereference(current->real_parent); + task_lock(p); + strncpy(comm_buf, p->comm, TASK_COMM_LEN); + p_pid = p->pid; + task_unlock(p); + rcu_read_unlock(); + + pr_info_ratelimited("%s %s. current [%d %s]. parent [%d %s]\n", + msg, bdevname(bdev, name), + current->pid, current->comm, p_pid, comm_buf); +} + +static bool is_conflict_excl_open(struct block_device *bdev, struct block_device *whole, fmode_t mode) +{ + if (bdev->bd_holders) + return false; + + if (bdev->bd_write_openers > ((mode & FMODE_WRITE) ? 1 : 0)) + return true; + + if (bdev == whole) + return !!bdev->bd_part_write_openers; + + return !!whole->bd_write_openers; +} /* * bd_mutex locking: * @@ -1599,8 +1630,28 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, void *holder, bdev->bd_openers++; if (for_part) bdev->bd_part_count++; - if (claiming) + + if (!for_part && (mode & FMODE_WRITE)) { + spin_lock(&bdev_lock); + bdev->bd_write_openers++; + if (bdev->bd_contains != bdev) + bdev->bd_contains->bd_part_write_openers++; + spin_unlock(&bdev_lock); + } + + if (claiming) { + spin_lock(&bdev_lock); + /* + * Open an write opened block device exclusively, the + * writing process may probability corrupt the device, + * such as a mounted file system, give a hint here. + */ + if (is_conflict_excl_open(bdev, claiming, mode)) + blkdev_dump_conflict_opener(bdev, "VFS: Open an write opened " + "block device exclusively"); bd_finish_claiming(bdev, claiming, holder); + spin_unlock(&bdev_lock); + }
/* * Block event polling for write claims if requested. Any write holder @@ -1818,6 +1869,14 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) if (for_part) bdev->bd_part_count--;
+ if (!for_part && (mode & FMODE_WRITE)) { + spin_lock(&bdev_lock); + bdev->bd_write_openers--; + if (bdev->bd_contains != bdev) + bdev->bd_contains->bd_part_write_openers--; + spin_unlock(&bdev_lock); + } + if (!--bdev->bd_openers) { WARN_ON_ONCE(bdev->bd_holders); sync_blockdev(bdev); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 11b9505b14c6..5410050d5017 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -48,6 +48,8 @@ struct block_device { int bd_fsfreeze_count; /* Mutex for freeze */ struct mutex bd_fsfreeze_mutex; + int bd_write_openers; + int bd_part_write_openers; KABI_RESERVE(1) KABI_RESERVE(2) KABI_RESERVE(3)
From: Li Lingfeng lilingfeng3@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53Q6M CVE: NA
---------------------------
Just like open a write opend block device exclusively, open an exclusive opened block device for write may also lead to potential data corruption. This patch add an info message when opening an exclusive opened block device for write to hint the potential data corruption.
Note that there are some legal cases such as file system or device mapper online resize, so this message is just a hint and isn't always mean that a risky written happens.
Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: zhihao Cheng chengzhihao1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/block_dev.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/fs/block_dev.c b/fs/block_dev.c index 915d3b5bdee7..c8a3c93cc256 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1651,6 +1651,18 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, void *holder, "block device exclusively"); bd_finish_claiming(bdev, claiming, holder); spin_unlock(&bdev_lock); + } else if (!for_part && (mode & FMODE_WRITE)) { + spin_lock(&bdev_lock); + /* + * Open an exclusive opened device for write may + * probability corrupt the device, such as a + * mounted file system, give a hint here. + */ + if (bdev->bd_holders || + (whole && (whole->bd_holder != NULL) && (whole->bd_holder != bd_may_claim))) + blkdev_dump_conflict_opener(bdev, "VFS: Open an exclusive opened " + "block device for write"); + spin_unlock(&bdev_lock); }
/*
From: Li Lingfeng lilingfeng3@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53Q6M CVE: NA Reference:https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/commit/...
---------------------------
We don't really need the field names to be globally unique, it is enough when they are unique in the given struct. Since structs do not generally span mutliple files, using the line number is enough to ensure an unique identifier. It means that we can't use two KABI_RENAME macros on the same line but that's not happening anyway.
This allows pahole to deduplicate the type info of structs using KABI macros, lowering the size of vmlinuz from 26M to 8.5
Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: zhihao Cheng chengzhihao1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/kabi.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/kabi.h b/include/linux/kabi.h index a52d9fa72cfa..fe3213c0f576 100644 --- a/include/linux/kabi.h +++ b/include/linux/kabi.h @@ -393,6 +393,8 @@ # define __KABI_CHECK_SIZE(_item, _size) #endif
+#define KABI_UNIQUE_ID __PASTE(kabi_hidden_, __LINE__) + # define _KABI_DEPRECATE(_type, _orig) _type kabi_reserved_##_orig # define _KABI_DEPRECATE_FN(_type, _orig, _args...) \ _type (* kabi_reserved_##_orig)(_args) @@ -402,7 +404,7 @@ _new; \ struct { \ _orig; \ - } __UNIQUE_ID(kabi_hide); \ + } KABI_UNIQUE_ID; \ __KABI_CHECK_SIZE_ALIGN(_orig, _new); \ } #else
From: Li Lingfeng lilingfeng3@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I53Q6M CVE: NA
---------------------------
Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: zhihao Cheng chengzhihao1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/blk_types.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 5410050d5017..bbb62ff84601 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -48,9 +48,11 @@ struct block_device { int bd_fsfreeze_count; /* Mutex for freeze */ struct mutex bd_fsfreeze_mutex; - int bd_write_openers; - int bd_part_write_openers; +#ifndef __GENKSYMS__ + KABI_USE2(1, int bd_write_openers, int bd_part_write_openers); +#else KABI_RESERVE(1) +#endif KABI_RESERVE(2) KABI_RESERVE(3) KABI_RESERVE(4)
From: Guoqing Jiang jiangguoqing@kylinos.cn
mainline inclusion from mainline-v5.15-rc1 commit 6607cd319b6b91bff94e90f798a61c031650b514 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I566T6 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
We can't split write behind bio with more than BIO_MAX_VECS sectors, otherwise the below call trace was triggered because we could allocate oversized write behind bio later.
[ 8.097936] bvec_alloc+0x90/0xc0 [ 8.098934] bio_alloc_bioset+0x1b3/0x260 [ 8.099959] raid1_make_request+0x9ce/0xc50 [raid1] [ 8.100988] ? __bio_clone_fast+0xa8/0xe0 [ 8.102008] md_handle_request+0x158/0x1d0 [md_mod] [ 8.103050] md_submit_bio+0xcd/0x110 [md_mod] [ 8.104084] submit_bio_noacct+0x139/0x530 [ 8.105127] submit_bio+0x78/0x1d0 [ 8.106163] ext4_io_submit+0x48/0x60 [ext4] [ 8.107242] ext4_writepages+0x652/0x1170 [ext4] [ 8.108300] ? do_writepages+0x41/0x100 [ 8.109338] ? __ext4_mark_inode_dirty+0x240/0x240 [ext4] [ 8.110406] do_writepages+0x41/0x100 [ 8.111450] __filemap_fdatawrite_range+0xc5/0x100 [ 8.112513] file_write_and_wait_range+0x61/0xb0 [ 8.113564] ext4_sync_file+0x73/0x370 [ext4] [ 8.114607] __x64_sys_fsync+0x33/0x60 [ 8.115635] do_syscall_64+0x33/0x40 [ 8.116670] entry_SYSCALL_64_after_hwframe+0x44/0xae
Thanks for the comment from Christoph.
[1]. https://bugs.archlinux.org/task/70992
Cc: stable@vger.kernel.org # v5.12+ Reported-by: Jens Stutte jens@chianterastutte.eu Tested-by: Jens Stutte jens@chianterastutte.eu Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Song Liu songliubraving@fb.com Signed-off-by: Luo Meng luomeng12@huawei.com
Conflicts: drivers/md/raid1.c Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/raid1.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index da6772f49f07..5e17699cf47c 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1326,6 +1326,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, struct raid1_plug_cb *plug = NULL; int first_clone; int max_sectors; + bool write_behind = false;
if (mddev_is_clustered(mddev) && md_cluster_ops->area_resyncing(mddev, WRITE, @@ -1378,6 +1379,15 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, max_sectors = r1_bio->sectors; for (i = 0; i < disks; i++) { struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev); + + /* + * The write-behind io is only attempted on drives marked as + * write-mostly, which means we could allocate write behind + * bio later. + */ + if (rdev && test_bit(WriteMostly, &rdev->flags)) + write_behind = true; + if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) { atomic_inc(&rdev->nr_pending); blocked_rdev = rdev; @@ -1452,6 +1462,15 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, goto retry_write; }
+ /* + * When using a bitmap, we may call alloc_behind_master_bio below. + * alloc_behind_master_bio allocates a copy of the data payload a page + * at a time and thus needs a new bio that can fit the whole payload + * this bio in page sized chunks. + */ + if (write_behind && bitmap) + max_sectors = min_t(int, max_sectors, + BIO_MAX_PAGES * (PAGE_SIZE >> 9)); if (max_sectors < bio_sectors(bio)) { struct bio *split = bio_split(bio, max_sectors, GFP_NOIO, &conf->bio_split);
From: Guoqing Jiang guoqing.jiang@linux.dev
mainline inclusion from mainline-v5.16-rc1 commit fd3b6975e9c11c4fa00965f82a0bfbb3b7b44101 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I566T6 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
Commit 6607cd319b6b91bff94e90f798a61c031650b514 ("raid1: ensure write behind bio has less than BIO_MAX_VECS sectors") tried to guarantee the size of behind bio is not bigger than BIO_MAX_VECS sectors.
Unfortunately the same calltrace still could happen since an array could enable write-behind without write mostly device.
To match the manpage of mdadm (which says "write-behind is only attempted on drives marked as write-mostly"), we need to check WriteMostly flag to avoid such unexpected behavior.
[1]. https://bugzilla.kernel.org/show_bug.cgi?id=213181#c25
Cc: stable@vger.kernel.org # v5.12+ Cc: Jens Stutte jens@chianterastutte.eu Reported-by: Jens Stutte jens@chianterastutte.eu Signed-off-by: Guoqing Jiang guoqing.jiang@linux.dev Signed-off-by: Song Liu songliubraving@fb.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Luo Meng luomeng12@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/raid1.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 5e17699cf47c..7c7f03f07f03 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1492,7 +1492,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, if (!r1_bio->bios[i]) continue;
- if (first_clone) { + if (first_clone && test_bit(WriteMostly, &rdev->flags)) { /* do behind I/O ? * Not if there are too many, or cannot * allocate memory, or a reader on WriteMostly
From: Song Liu song@kernel.org
mainline inclusion from mainline-v5.16 commit 46669e8616c649c71c4cfcd712fd3d107e771380 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I566T6 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit [1] causes missing bitmap updates when there isn't any WriteMostly devices.
Detailed steps to reproduce by Norbert (which somehow didn't make to lore):
# setup md10 (raid1) with two drives (1 GByte sparse files) dd if=/dev/zero of=disk1 bs=1024k seek=1024 count=0 dd if=/dev/zero of=disk2 bs=1024k seek=1024 count=0
losetup /dev/loop11 disk1 losetup /dev/loop12 disk2
mdadm --create /dev/md10 --level=1 --raid-devices=2 /dev/loop11 /dev/loop12
# add bitmap (aka write-intent log) mdadm /dev/md10 --grow --bitmap=internal
echo check > /sys/block/md10/md/sync_action
root:# cat /sys/block/md10/md/mismatch_cnt 0 root:#
# remove member drive disk2 (loop12) mdadm /dev/md10 -f loop12 ; mdadm /dev/md10 -r loop12
# modify degraded md device dd if=/dev/urandom of=/dev/md10 bs=512 count=1
# no blocks recorded as out of sync on the remaining member disk1/loop11 root:# mdadm -X /dev/loop11 | grep Bitmap Bitmap : 16 bits (chunks), 0 dirty (0.0%) root:#
# re-add disk2, nothing synced because of empty bitmap mdadm /dev/md10 --re-add /dev/loop12
# check integrity again echo check > /sys/block/md10/md/sync_action
# disk1 and disk2 are no longer in sync, reads return differend data root:# cat /sys/block/md10/md/mismatch_cnt 128 root:#
# clean up mdadm -S /dev/md10 losetup -d /dev/loop11 losetup -d /dev/loop12 rm disk1 disk2
Fix this by moving the WriteMostly check to the if condition for alloc_behind_master_bio().
[1] commit fd3b6975e9c1 ("md/raid1: only allocate write behind bio for WriteMostly device") Fixes: fd3b6975e9c1 ("md/raid1: only allocate write behind bio for WriteMostly device") Cc: stable@vger.kernel.org # v5.12+ Cc: Guoqing Jiang guoqing.jiang@linux.dev Cc: Jens Axboe axboe@kernel.dk Reported-by: Norbert Warmuth nwarmuth@t-online.de Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Song Liu song@kernel.org Signed-off-by: Luo Meng luomeng12@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/raid1.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 7c7f03f07f03..9fccbf916015 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1492,12 +1492,13 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, if (!r1_bio->bios[i]) continue;
- if (first_clone && test_bit(WriteMostly, &rdev->flags)) { + if (first_clone) { /* do behind I/O ? * Not if there are too many, or cannot * allocate memory, or a reader on WriteMostly * is waiting for behind writes to flush */ if (bitmap && + test_bit(WriteMostly, &rdev->flags) && (atomic_read(&bitmap->behind_writes) < mddev->bitmap_info.max_write_behind) && !waitqueue_active(&bitmap->behind_wait)) {
From: Yang Yingliang yangyingliang@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I56R9J CVE: N/A
-------------------------------------------------
Current mbigen driver uses module_platform_driver() to call init function, but pl011 driver uses arch_initcall(). So pl011 driver will init earlier than mbigen driver and pl011 will get irq failed. This will happen on Hi1616.
Fix this problem by using arch_initcall in mbigen driver.
Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yi Yang yiyang13@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/irqchip/irq-mbigen.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/drivers/irqchip/irq-mbigen.c b/drivers/irqchip/irq-mbigen.c index 8729b8a6b54d..fc05e23938cd 100644 --- a/drivers/irqchip/irq-mbigen.c +++ b/drivers/irqchip/irq-mbigen.c @@ -402,7 +402,18 @@ static struct platform_driver mbigen_platform_driver = { .probe = mbigen_device_probe, };
-module_platform_driver(mbigen_platform_driver); +static int __init mbigen_init(void) +{ + return platform_driver_register(&mbigen_platform_driver); +} + +static void __exit mbigen_exit(void) +{ + platform_driver_unregister(&mbigen_platform_driver); +} + +arch_initcall(mbigen_init); +module_exit(mbigen_exit);
MODULE_AUTHOR("Jun Ma majun258@huawei.com"); MODULE_AUTHOR("Yun Wu wuyun.wu@huawei.com");