Backport bugfix and enhancement patches for mm/fs/livepatch/sched.
Al Viro (2): switch file_open_root() to struct path take LOOKUP_{ROOT,ROOT_GRABBED,JUMPED} out of LOOKUP_... space
Chen Jun (1): mm: Fix the uninitialized use in overcommit_policy_handler
Guoqing Jiang (1): md: revert io stats accounting
Kefeng Wang (1): once: Fix panic when module unload
Leah Rumancik (1): ext4: wipe ext4_dir_entry2 upon file deletion
Li Hua (2): sched/idle: Optimize the loop time algorithm to reduce multicore disturb sched/idle: Reported an error when an illegal negative value is passed
Vasily Averin (7): memcg: enable accounting for pids in nested pid namespaces memcg: enable accounting for mnt_cache entries memcg: enable accounting for fasync_cache memcg: enable accounting for new namesapces and struct nsproxy memcg: enable accounting for signals memcg: enable accounting for posix_timers_cache slab memcg: enable accounting for ldt_struct objects
Vignesh Raghavendra (1): serial: 8250: 8250_omap: Fix possible array out of bounds access
Yang Jihong (1): perf annotate: Add itrace options support
Yang Yang (1): kyber: introduce kyber_depth_updated()
Ye Bin (1): ext4: fix potential uninitialized access to retval in kmmpd
Ye Weihua (9): livepatch: Add state describe for force livepatch: checks only if the replaced instruction is on the stack livepatch/arm64: only check stack top livepatch/arm: only check stack top livepatch/ppc32: only check stack top livepatch/ppc64: only check stack top livepatch/x86: only check stack top livepatch: move arch_klp_mem_recycle after the return value judgment livepatch: Fix compile warnning
Yu Jiahua (1): sched: Aware multi-core system for optimize loadtracking
Yu Kuai (2): blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHARED blk-mq: fix divide by zero crash in tg_may_dispatch()
Yutian Yang (1): memcg: charge fs_context and legacy_fs_context
Zhang Yi (5): ext4: move inode eio simulation behind io completeion ext4: make the updating inode data procedure atomic ext4: factor out ext4_fill_raw_inode() ext4: move ext4_fill_raw_inode() related functions ext4: prevent getting empty inode buffer
Zheng Zucheng (1): sysctl: Refactor IAS framework
Documentation/filesystems/path-lookup.rst | 6 +- Documentation/filesystems/porting.rst | 9 + arch/arm/kernel/livepatch.c | 221 +++++++++++-- arch/arm64/kernel/livepatch.c | 209 +++++++++++-- arch/powerpc/kernel/livepatch_32.c | 209 +++++++++++-- arch/powerpc/kernel/livepatch_64.c | 227 ++++++++++---- arch/um/drivers/mconsole_kern.c | 2 +- arch/x86/kernel/ldt.c | 6 +- arch/x86/kernel/livepatch.c | 347 +++++++++++++++------ block/blk-mq.c | 6 +- block/blk-sysfs.c | 7 + block/blk-throttle.c | 37 ++- block/kyber-iosched.c | 29 +- drivers/md/md.c | 45 --- drivers/md/md.h | 1 - drivers/tty/serial/8250/8250_omap.c | 1 + fs/coredump.c | 4 +- fs/ext4/inode.c | 332 +++++++++++--------- fs/ext4/mmp.c | 2 +- fs/ext4/namei.c | 24 +- fs/fcntl.c | 3 +- fs/fhandle.c | 2 +- fs/fs_context.c | 4 +- fs/internal.h | 2 +- fs/kernel_read_file.c | 2 +- fs/namei.c | 60 ++-- fs/namespace.c | 7 +- fs/nfs/nfstrace.h | 4 - fs/open.c | 4 +- fs/proc/proc_sysctl.c | 2 +- include/linux/blkdev.h | 1 + include/linux/fs.h | 9 +- include/linux/kernel.h | 4 +- include/linux/livepatch.h | 4 + include/linux/namei.h | 3 - include/linux/once.h | 4 +- include/linux/sched/sysctl.h | 8 +- init/Kconfig | 36 ++- ipc/namespace.c | 2 +- kernel/cgroup/namespace.c | 2 +- kernel/livepatch/core.c | 2 +- kernel/nsproxy.c | 2 +- kernel/pid_namespace.c | 5 +- kernel/sched/fair.c | 86 ++--- kernel/sched/idle.c | 48 ++- kernel/signal.c | 2 +- kernel/sysctl.c | 84 ++--- kernel/time/namespace.c | 4 +- kernel/time/posix-timers.c | 4 +- kernel/user_namespace.c | 2 +- kernel/usermode_driver.c | 2 +- lib/once.c | 11 +- mm/util.c | 4 +- security/integrity/ima/ima_digest_list.c | 2 +- tools/perf/Documentation/perf-annotate.txt | 7 + tools/perf/builtin-annotate.c | 11 + 56 files changed, 1494 insertions(+), 669 deletions(-)
From: Zheng Zucheng zhengzucheng@huawei.com
hulk inclusion category: feature bugzilla: 177206 https://gitee.com/openeuler/kernel/issues/I4DDEL
--------------------------------
Refactor intelligent aware scheduler framework
Signed-off-by: Zheng Zucheng zhengzucheng@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- include/linux/kernel.h | 2 +- include/linux/sched/sysctl.h | 2 +- init/Kconfig | 36 +++++++++------- kernel/sched/fair.c | 14 +++--- kernel/sched/idle.c | 12 +++--- kernel/sysctl.c | 82 ++++++++++++++++++++---------------- 6 files changed, 81 insertions(+), 67 deletions(-)
diff --git a/include/linux/kernel.h b/include/linux/kernel.h index b8cce99fd8eb..eb88683890c9 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -555,7 +555,7 @@ extern int sysctl_panic_on_rcu_stall; extern int sysctl_panic_on_stackoverflow;
extern bool crash_kexec_post_notifiers; -#ifdef CONFIG_IAS_SMART_HALT_POLL +#ifdef CONFIG_IAS_SMART_IDLE extern unsigned long poll_threshold_ns; #endif
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index b9ad7db37614..378bcb58c509 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -98,7 +98,7 @@ int sched_energy_aware_handler(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos); #endif
-#ifdef CONFIG_SCHED_OPTIMIZE_LOAD_TRACKING +#ifdef CONFIG_IAS_SMART_LOAD_TRACKING extern int sysctl_blocked_averages(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); extern int sysctl_tick_update_load(struct ctl_table *table, int write, diff --git a/init/Kconfig b/init/Kconfig index 29682cbb327c..04bc46ca0b9e 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -774,22 +774,6 @@ config GENERIC_SCHED_CLOCK
menu "Scheduler features"
-config SCHED_OPTIMIZE_LOAD_TRACKING - bool "Optimize scheduler load tracking" - default n - help - Optimize scheduler load tracking, when load balance is not important - in system, we close some load tracking in tick and enqueue or dequeue - task, in this way, we can save some unnecessary cpu overhead. - -config IAS_SMART_HALT_POLL - bool "Enable smart halt poll" - default n - help - Before entering the real idle, polling for a while. if the current - task is set TIF_NEED_RESCHED during the polling process, it will - immediately break from the polling loop. - config UCLAMP_TASK bool "Enable utilization clamping for RT/FAIR tasks" depends on CPU_FREQ_GOV_SCHEDUTIL @@ -839,6 +823,26 @@ config UCLAMP_BUCKETS_COUNT
If in doubt, use the default value.
+menu "Intelligent aware scheduler" + +config IAS_SMART_IDLE + bool "Enable smart idle" + default n + help + Before entering the real idle, polling for a while. if the current + task is set TIF_NEED_RESCHED during the polling process, it will + immediately break from the polling loop. + +config IAS_SMART_LOAD_TRACKING + bool "Enable smart load tracking" + default n + help + Optimize scheduler load tracking, when load balance is not important + in system, we close some load tracking in tick and enqueue or dequeue + task, in this way, we can save some unnecessary cpu overhead. + +endmenu + endmenu
# diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 32fea109e604..1417af3dd427 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -38,7 +38,7 @@ unsigned int sysctl_sched_latency = 6000000ULL; static unsigned int normalized_sysctl_sched_latency = 6000000ULL;
-#ifdef CONFIG_SCHED_OPTIMIZE_LOAD_TRACKING +#ifdef CONFIG_IAS_SMART_LOAD_TRACKING #define LANTENCY_MIN 10 #define LANTENCY_MAX 30 unsigned int sysctl_load_tracking_latency = LANTENCY_MIN; @@ -3837,7 +3837,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s { u64 now = cfs_rq_clock_pelt(cfs_rq); int decayed; -#ifdef CONFIG_SCHED_OPTIMIZE_LOAD_TRACKING +#ifdef CONFIG_IAS_SMART_LOAD_TRACKING u64 delta; #endif
@@ -3845,7 +3845,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s * Track task load average for carrying it to new CPU after migrated, and * track group sched_entity load average for task_h_load calc in migration */ -#ifdef CONFIG_SCHED_OPTIMIZE_LOAD_TRACKING +#ifdef CONFIG_IAS_SMART_LOAD_TRACKING delta = now - se->avg.last_update_time; delta >>= sysctl_load_tracking_latency;
@@ -4601,7 +4601,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev) cfs_rq->curr = NULL; }
-#ifdef CONFIG_SCHED_OPTIMIZE_LOAD_TRACKING +#ifdef CONFIG_IAS_SMART_LOAD_TRACKING DEFINE_STATIC_KEY_TRUE(sched_tick_update_load); static void set_tick_update_load(bool enabled) { @@ -4644,7 +4644,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) /* * Ensure that runnable average is periodically updated. */ -#ifdef CONFIG_SCHED_OPTIMIZE_LOAD_TRACKING +#ifdef CONFIG_IAS_SMART_LOAD_TRACKING if (static_branch_likely(&sched_tick_update_load)) { update_load_avg(cfs_rq, curr, UPDATE_TG); update_cfs_group(curr); @@ -8090,7 +8090,7 @@ static void attach_tasks(struct lb_env *env) rq_unlock(env->dst_rq, &rf); }
-#ifdef CONFIG_SCHED_OPTIMIZE_LOAD_TRACKING +#ifdef CONFIG_IAS_SMART_LOAD_TRACKING DEFINE_STATIC_KEY_TRUE(sched_blocked_averages);
static void set_blocked_averages(bool enabled) @@ -8326,7 +8326,7 @@ static void update_blocked_averages(int cpu) rq_lock_irqsave(rq, &rf); update_rq_clock(rq);
-#ifdef CONFIG_SCHED_OPTIMIZE_LOAD_TRACKING +#ifdef CONFIG_IAS_SMART_LOAD_TRACKING if (!static_branch_likely(&sched_blocked_averages)) { rq_unlock_irqrestore(rq, &rf); return; diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 4f7b0ee06144..a503e7d4c170 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -13,7 +13,7 @@ /* Linker adds these: start and end of __cpuidle functions */ extern char __cpuidle_text_start[], __cpuidle_text_end[];
-#ifdef CONFIG_IAS_SMART_HALT_POLL +#ifdef CONFIG_IAS_SMART_IDLE /* * Poll_threshold_ns indicates the maximum polling time before * entering real idle. @@ -60,7 +60,7 @@ static int __init cpu_idle_nopoll_setup(char *__unused) __setup("hlt", cpu_idle_nopoll_setup); #endif
-#ifdef CONFIG_IAS_SMART_HALT_POLL +#ifdef CONFIG_IAS_SMART_IDLE static void smart_idle_poll(void) { unsigned long poll_duration = poll_threshold_ns; @@ -86,7 +86,7 @@ static noinline int __cpuidle cpu_idle_poll(void) stop_critical_timings(); rcu_idle_enter(); local_irq_enable(); -#ifdef CONFIG_IAS_SMART_HALT_POLL +#ifdef CONFIG_IAS_SMART_IDLE smart_idle_poll(); #endif
@@ -292,7 +292,7 @@ static void cpuidle_idle_call(void) static void do_idle(void) { int cpu = smp_processor_id(); -#ifdef CONFIG_IAS_SMART_HALT_POLL +#ifdef CONFIG_IAS_SMART_IDLE unsigned long idle_poll_flag = poll_threshold_ns; #endif /* @@ -327,7 +327,7 @@ static void do_idle(void) * broadcast device expired for us, we don't want to go deep * idle as we know that the IPI is going to arrive right away. */ -#ifdef CONFIG_IAS_SMART_HALT_POLL +#ifdef CONFIG_IAS_SMART_IDLE if (cpu_idle_force_poll || tick_check_broadcast_expired() || idle_poll_flag) { #else @@ -335,7 +335,7 @@ static void do_idle(void) #endif tick_nohz_idle_restart_tick(); cpu_idle_poll(); -#ifdef CONFIG_IAS_SMART_HALT_POLL +#ifdef CONFIG_IAS_SMART_IDLE idle_poll_flag = 0; #endif } else { diff --git a/kernel/sysctl.c b/kernel/sysctl.c index a573817a6fe0..c8d3a20007c6 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1650,6 +1650,46 @@ int proc_do_static_key(struct ctl_table *table, int write, mutex_unlock(&static_key_mutex); return ret; } +static struct ctl_table ias_table[] = { +#ifdef CONFIG_IAS_SMART_IDLE + { + .procname = "smart_idle_threshold", + .data = &poll_threshold_ns, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = proc_doulongvec_minmax, + }, +#endif + +#ifdef CONFIG_IAS_SMART_LOAD_TRACKING + { + .procname = "sched_blocked_averages", + .data = NULL, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = sysctl_blocked_averages, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, + { + .procname = "sched_tick_update_load", + .data = NULL, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = sysctl_tick_update_load, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, + { + .procname = "sched_load_tracking_latency", + .data = NULL, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = sysctl_update_load_latency, + }, +#endif + { } +};
static struct ctl_table kern_table[] = { { @@ -1764,33 +1804,6 @@ static struct ctl_table kern_table[] = { }, #endif /* CONFIG_NUMA_BALANCING */ #endif /* CONFIG_SCHED_DEBUG */ -#ifdef CONFIG_SCHED_OPTIMIZE_LOAD_TRACKING - { - .procname = "sched_blocked_averages", - .data = NULL, - .maxlen = sizeof(unsigned int), - .mode = 0644, - .proc_handler = sysctl_blocked_averages, - .extra1 = SYSCTL_ZERO, - .extra2 = SYSCTL_ONE, - }, - { - .procname = "sched_tick_update_load", - .data = NULL, - .maxlen = sizeof(unsigned int), - .mode = 0644, - .proc_handler = sysctl_tick_update_load, - .extra1 = SYSCTL_ZERO, - .extra2 = SYSCTL_ONE, - }, - { - .procname = "sched_load_tracking_latency", - .data = NULL, - .maxlen = sizeof(unsigned int), - .mode = 0644, - .proc_handler = sysctl_update_load_latency, - }, -#endif { .procname = "sched_rt_period_us", .data = &sysctl_sched_rt_period, @@ -1849,15 +1862,7 @@ static struct ctl_table kern_table[] = { .proc_handler = sysctl_sched_uclamp_handler, }, #endif -#ifdef CONFIG_IAS_SMART_HALT_POLL - { - .procname = "halt_poll_threshold", - .data = &poll_threshold_ns, - .maxlen = sizeof(unsigned long), - .mode = 0644, - .proc_handler = proc_doulongvec_minmax, - }, -#endif + #ifdef CONFIG_SCHED_AUTOGROUP { .procname = "sched_autogroup_enabled", @@ -2697,6 +2702,11 @@ static struct ctl_table kern_table[] = { .extra2 = SYSCTL_ONE, }, #endif + { + .procname = "ias", + .mode = 0555, + .child = ias_table, + }, { } };
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-5.14-rc6 commit 454bb6775202d94f0f489c4632efecdb62d3c904 category: bugfix bugzilla: 175277 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
We run a test that delete and recover devcies frequently(two devices on the same host), and we found that 'active_queues' is super big after a period of time.
If device a and device b share a tag set, and a is deleted, then blk_mq_exit_queue() will clear BLK_MQ_F_TAG_QUEUE_SHARED because there is only one queue that are using the tag set. However, if b is still active, the active_queues of b might never be cleared even if b is deleted.
Thus clear active_queues before BLK_MQ_F_TAG_QUEUE_SHARED is cleared.
Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Ming Lei ming.lei@redhat.com Link: https://lore.kernel.org/r/20210731062130.1533893-1-yukuai3@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- block/blk-mq.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index 501599124fc4..2840cc4897c9 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2930,10 +2930,12 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared) int i;
queue_for_each_hw_ctx(q, hctx, i) { - if (shared) + if (shared) { hctx->flags |= BLK_MQ_F_TAG_QUEUE_SHARED; - else + } else { + blk_mq_tag_idle(hctx); hctx->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED; + } } }
From: Ye Weihua yeweihua4@huawei.com
hulk inclusion category: feature bugzilla: 119440 https://gitee.com/openeuler/kernel/issues/I4DDEL
--------------------------------
The force field is divided into three states. KLP_NORMAL_FORCE indicates that a hot patch is installed according to the initial rule. KLP_ENFORCMENT indicates that the hot patch of the function must be installed. KLP_STACK_OPTIMIZE is prepared for stack optimization policy.
Signed-off-by: Ye Weihua yeweihua4@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- arch/arm/kernel/livepatch.c | 2 +- arch/arm64/kernel/livepatch.c | 2 +- arch/powerpc/kernel/livepatch_32.c | 2 +- arch/powerpc/kernel/livepatch_64.c | 2 +- arch/x86/kernel/livepatch.c | 2 +- include/linux/livepatch.h | 4 ++++ 6 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/arch/arm/kernel/livepatch.c b/arch/arm/kernel/livepatch.c index b5fcaf3c4ca7..1dc074e2a0d4 100644 --- a/arch/arm/kernel/livepatch.c +++ b/arch/arm/kernel/livepatch.c @@ -99,7 +99,7 @@ static int klp_check_activeness_func(struct stackframe *frame, void *data) for (obj = patch->objs; obj->funcs; obj++) { for (func = obj->funcs; func->old_name; func++) { if (args->enable) { - if (func->force) + if (func->force == KLP_ENFORCEMENT) continue; /* * When enable, checking the currently diff --git a/arch/arm64/kernel/livepatch.c b/arch/arm64/kernel/livepatch.c index a1cd8ee026d7..10b7d9b99f62 100644 --- a/arch/arm64/kernel/livepatch.c +++ b/arch/arm64/kernel/livepatch.c @@ -104,7 +104,7 @@ static bool klp_check_activeness_func(void *data, unsigned long pc) for (obj = patch->objs; obj->funcs; obj++) { for (func = obj->funcs; func->old_name; func++) { if (args->enable) { - if (func->force) + if (func->force == KLP_ENFORCEMENT) continue; /* * When enable, checking the currently diff --git a/arch/powerpc/kernel/livepatch_32.c b/arch/powerpc/kernel/livepatch_32.c index 35d1885796d4..1d41b8939799 100644 --- a/arch/powerpc/kernel/livepatch_32.c +++ b/arch/powerpc/kernel/livepatch_32.c @@ -93,7 +93,7 @@ static int klp_check_activeness_func(struct stackframe *frame, void *data) for (obj = patch->objs; obj->funcs; obj++) { for (func = obj->funcs; func->old_name; func++) { if (args->enable) { - if (func->force) + if (func->force == KLP_ENFORCEMENT) continue; /* * When enable, checking the currently diff --git a/arch/powerpc/kernel/livepatch_64.c b/arch/powerpc/kernel/livepatch_64.c index 6285635e63fd..55cbb65ca708 100644 --- a/arch/powerpc/kernel/livepatch_64.c +++ b/arch/powerpc/kernel/livepatch_64.c @@ -132,7 +132,7 @@ static int klp_check_activeness_func(struct stackframe *frame, void *data)
/* Check func address in stack */ if (args->enable) { - if (func->force) + if (func->force == KLP_ENFORCEMENT) continue; /* * When enable, checking the currently diff --git a/arch/x86/kernel/livepatch.c b/arch/x86/kernel/livepatch.c index bcfda2490916..52bc0fc2bd6b 100644 --- a/arch/x86/kernel/livepatch.c +++ b/arch/x86/kernel/livepatch.c @@ -90,7 +90,7 @@ static int klp_check_stack_func(struct klp_func *func, #endif
if (enable) { - if (func->force) + if (func->force == KLP_ENFORCEMENT) continue; /* * When enable, checking the currently active diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index ce32a8da1517..5b4a0864a011 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -23,6 +23,10 @@ #define KLP_UNPATCHED 0 #define KLP_PATCHED 1
+#define KLP_NORMAL_FORCE 0 +#define KLP_ENFORCEMENT 1 +#define KLP_STACK_OPTIMIZE 2 + /** * struct klp_func - function structure for live patching * @old_name: name of the function to be patched
From: Ye Weihua yeweihua4@huawei.com
hulk inclusion category: feature bugzilla: 119440 https://gitee.com/openeuler/kernel/issues/I4DDEL
--------------------------------
When the CONFIG_LIVEPATCH_STOP_MACHINE_CONSISTENCY macro is turned on, the system checks whether the function to patch is on the stack under the stop_machine. If the function is on the stack, the livepatch cannot be patched and returns a busy signal.
Hotspot functions are easily on the stack under the stop_machine condition. As a result, the livpatch success rate is low when the patch includes a hot function.
For the repalced function, only the first seceral instructions are rewritten, and the rest of the instructions are the same as the original ones. Therefore, if the force flag is KLP_STACK_OPTIMIZE, only need to check whether the replaced instructions are on the stack.
Signed-off-by: Ye Weihua yeweihua4@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- arch/arm/kernel/livepatch.c | 29 ++++++++++++++++------ arch/arm64/kernel/livepatch.c | 27 ++++++++++++++++----- arch/powerpc/kernel/livepatch_32.c | 19 ++++++++++++--- arch/powerpc/kernel/livepatch_64.c | 39 ++++++++++++++++++++---------- arch/x86/kernel/livepatch.c | 20 +++++++++++---- 5 files changed, 98 insertions(+), 36 deletions(-)
diff --git a/arch/arm/kernel/livepatch.c b/arch/arm/kernel/livepatch.c index 1dc074e2a0d4..ce981b48fedb 100644 --- a/arch/arm/kernel/livepatch.c +++ b/arch/arm/kernel/livepatch.c @@ -31,16 +31,19 @@ #include <asm/insn.h> #include <asm/patch.h>
-#ifdef CONFIG_ARM_MODULE_PLTS -#define LJMP_INSN_SIZE 3 -#endif - #ifdef ARM_INSN_SIZE #error "ARM_INSN_SIZE have been redefined, please check" #else #define ARM_INSN_SIZE 4 #endif
+#ifdef CONFIG_ARM_MODULE_PLTS +#define LJMP_INSN_SIZE 3 +#define MAX_SIZE_TO_CHECK (LJMP_INSN_SIZE * ARM_INSN_SIZE) +#else +#define MAX_SIZE_TO_CHECK ARM_INSN_SIZE +#endif + struct klp_func_node { struct list_head node; struct list_head func_stack; @@ -73,10 +76,20 @@ struct walk_stackframe_args { int ret; };
+static inline unsigned long klp_size_to_check(unsigned long func_size, + int force) +{ + unsigned long size = func_size; + + if (force == KLP_STACK_OPTIMIZE && size > MAX_SIZE_TO_CHECK) + size = MAX_SIZE_TO_CHECK; + return size; +} + static inline int klp_compare_address(unsigned long pc, unsigned long func_addr, - unsigned long func_size, const char *func_name) + const char *func_name, unsigned long check_size) { - if (pc >= func_addr && pc < func_addr + func_size) { + if (pc >= func_addr && pc < func_addr + check_size) { pr_err("func %s is in use!\n", func_name); return -EBUSY; } @@ -136,8 +149,8 @@ static int klp_check_activeness_func(struct stackframe *frame, void *data) func_size = func->new_size; } func_name = func->old_name; - args->ret = klp_compare_address(frame->pc, func_addr, - func_size, func_name); + args->ret = klp_compare_address(frame->pc, func_addr, func_name, + klp_size_to_check(func_size, func->force)); if (args->ret) return args->ret; } diff --git a/arch/arm64/kernel/livepatch.c b/arch/arm64/kernel/livepatch.c index 10b7d9b99f62..4c4ff0620c4c 100644 --- a/arch/arm64/kernel/livepatch.c +++ b/arch/arm64/kernel/livepatch.c @@ -34,7 +34,11 @@ #include <linux/sched/debug.h> #include <linux/kallsyms.h>
+#define LJMP_INSN_SIZE 4 + #ifdef CONFIG_ARM64_MODULE_PLTS +#define MAX_SIZE_TO_CHECK (LJMP_INSN_SIZE * sizeof(u32)) + static inline bool offset_in_range(unsigned long pc, unsigned long addr, long range) { @@ -42,9 +46,10 @@ static inline bool offset_in_range(unsigned long pc, unsigned long addr,
return (offset >= -range && offset < range); } -#endif
-#define LJMP_INSN_SIZE 4 +#else +#define MAX_SIZE_TO_CHECK sizeof(u32) +#endif
struct klp_func_node { struct list_head node; @@ -78,10 +83,20 @@ struct walk_stackframe_args { int ret; };
+static inline unsigned long klp_size_to_check(unsigned long func_size, + int force) +{ + unsigned long size = func_size; + + if (force == KLP_STACK_OPTIMIZE && size > MAX_SIZE_TO_CHECK) + size = MAX_SIZE_TO_CHECK; + return size; +} + static inline int klp_compare_address(unsigned long pc, unsigned long func_addr, - unsigned long func_size, const char *func_name) + const char *func_name, unsigned long check_size) { - if (pc >= func_addr && pc < func_addr + func_size) { + if (pc >= func_addr && pc < func_addr + check_size) { pr_err("func %s is in use!\n", func_name); return -EBUSY; } @@ -137,8 +152,8 @@ static bool klp_check_activeness_func(void *data, unsigned long pc) func_size = func->new_size; } func_name = func->old_name; - args->ret = klp_compare_address(pc, func_addr, - func_size, func_name); + args->ret = klp_compare_address(pc, func_addr, func_name, + klp_size_to_check(func_size, func->force)); if (args->ret) return false; } diff --git a/arch/powerpc/kernel/livepatch_32.c b/arch/powerpc/kernel/livepatch_32.c index 1d41b8939799..db6dbe091281 100644 --- a/arch/powerpc/kernel/livepatch_32.c +++ b/arch/powerpc/kernel/livepatch_32.c @@ -32,6 +32,7 @@ #if defined (CONFIG_LIVEPATCH_STOP_MACHINE_CONSISTENCY) || \ defined (CONFIG_LIVEPATCH_WO_FTRACE) #define LJMP_INSN_SIZE 4 +#define MAX_SIZE_TO_CHECK (LJMP_INSN_SIZE * sizeof(u32))
struct klp_func_node { struct list_head node; @@ -67,10 +68,20 @@ struct walk_stackframe_args { int ret; };
+static inline unsigned long klp_size_to_check(unsigned long func_size, + int force) +{ + unsigned long size = func_size; + + if (force == KLP_STACK_OPTIMIZE && size > MAX_SIZE_TO_CHECK) + size = MAX_SIZE_TO_CHECK; + return size; +} + static inline int klp_compare_address(unsigned long pc, unsigned long func_addr, - unsigned long func_size, const char *func_name) + const char *func_name, unsigned long check_size) { - if (pc >= func_addr && pc < func_addr + func_size) { + if (pc >= func_addr && pc < func_addr + check_size) { pr_err("func %s is in use!\n", func_name); return -EBUSY; } @@ -130,8 +141,8 @@ static int klp_check_activeness_func(struct stackframe *frame, void *data) func_size = func->new_size; } func_name = func->old_name; - args->ret = klp_compare_address(frame->pc, func_addr, - func_size, func_name); + args->ret = klp_compare_address(frame->pc, func_addr, func_name, + klp_size_to_check(func_size, func->force)); if (args->ret) return args->ret; } diff --git a/arch/powerpc/kernel/livepatch_64.c b/arch/powerpc/kernel/livepatch_64.c index 55cbb65ca708..f98f4ffc78f3 100644 --- a/arch/powerpc/kernel/livepatch_64.c +++ b/arch/powerpc/kernel/livepatch_64.c @@ -36,6 +36,8 @@
#if defined(CONFIG_LIVEPATCH_STOP_MACHINE_CONSISTENCY) || \ defined(CONFIG_LIVEPATCH_WO_FTRACE) +#define MAX_SIZE_TO_CHECK (LJMP_INSN_SIZE * sizeof(u32)) + struct klp_func_node { struct list_head node; struct list_head func_stack; @@ -76,12 +78,20 @@ struct walk_stackframe_args { int ret; };
-static inline int klp_compare_address(unsigned long pc, - unsigned long func_addr, - unsigned long func_size, - const char *func_name) +static inline unsigned long klp_size_to_check(unsigned long func_size, + int force) +{ + unsigned long size = func_size; + + if (force == KLP_STACK_OPTIMIZE && size > MAX_SIZE_TO_CHECK) + size = MAX_SIZE_TO_CHECK; + return size; +} + +static inline int klp_compare_address(unsigned long pc, unsigned long func_addr, + const char *func_name, unsigned long check_size) { - if (pc >= func_addr && pc < func_addr + func_size) { + if (pc >= func_addr && pc < func_addr + check_size) { pr_err("func %s is in use!\n", func_name); return -EBUSY; } @@ -92,20 +102,21 @@ static inline int klp_check_activeness_func_addr( struct stackframe *frame, unsigned long func_addr, unsigned long func_size, - const char *func_name) + const char *func_name, + int force) { int ret;
/* Check PC first */ - ret = klp_compare_address(frame->pc, func_addr, - func_size, func_name); + ret = klp_compare_address(frame->pc, func_addr, func_name, + klp_size_to_check(func_size, force)); if (ret) return ret;
/* Check NIP when the exception stack switching */ if (frame->nip != 0) { - ret = klp_compare_address(frame->nip, func_addr, - func_size, func_name); + ret = klp_compare_address(frame->nip, func_addr, func_name, + klp_size_to_check(func_size, force)); if (ret) return ret; } @@ -171,7 +182,8 @@ static int klp_check_activeness_func(struct stackframe *frame, void *data) } func_name = func->old_name; args->ret = klp_check_activeness_func_addr(frame, - func_addr, func_size, func_name); + func_addr, func_size, func_name, + func->force); if (args->ret) return args->ret;
@@ -188,7 +200,7 @@ static int klp_check_activeness_func(struct stackframe *frame, void *data) func_addr = (unsigned long)func->old_func; func_size = func->old_size; args->ret = klp_check_activeness_func_addr(frame, - func_addr, func_size, "OLD_FUNC"); + func_addr, func_size, "OLD_FUNC", func->force); if (args->ret) return args->ret;
@@ -199,7 +211,8 @@ static int klp_check_activeness_func(struct stackframe *frame, void *data) func_addr = (unsigned long)&func_node->trampoline; func_size = sizeof(struct ppc64_klp_btramp_entry); args->ret = klp_check_activeness_func_addr(frame, - func_addr, func_size, "trampoline"); + func_addr, func_size, "trampoline", + func->force); if (args->ret) return args->ret; } diff --git a/arch/x86/kernel/livepatch.c b/arch/x86/kernel/livepatch.c index 52bc0fc2bd6b..5be8b601f0c7 100644 --- a/arch/x86/kernel/livepatch.c +++ b/arch/x86/kernel/livepatch.c @@ -57,11 +57,21 @@ static struct klp_func_node *klp_find_func_node(void *old_func) #endif
#ifdef CONFIG_LIVEPATCH_STOP_MACHINE_CONSISTENCY +static inline unsigned long klp_size_to_check(unsigned long func_size, + int force) +{ + unsigned long size = func_size; + + if (force == KLP_STACK_OPTIMIZE && size > JMP_E9_INSN_SIZE) + size = JMP_E9_INSN_SIZE; + return size; +} + static inline int klp_compare_address(unsigned long stack_addr, - unsigned long func_addr, unsigned long func_size, - const char *func_name) + unsigned long func_addr, const char *func_name, + unsigned long check_size) { - if (stack_addr >= func_addr && stack_addr < func_addr + func_size) { + if (stack_addr >= func_addr && stack_addr < func_addr + check_size) { pr_err("func %s is in use!\n", func_name); return -EBUSY; } @@ -124,8 +134,8 @@ static int klp_check_stack_func(struct klp_func *func, } func_name = func->old_name;
- if (klp_compare_address(address, func_addr, - func_size, func_name)) + if (klp_compare_address(address, func_addr, func_name, + klp_size_to_check(func_size, func->force))) return -EAGAIN; }
From: Ye Weihua yeweihua4@huawei.com
hulk inclusion category: feature bugzilla: 119440 https://gitee.com/openeuler/kernel/issues/I4DDEL
--------------------------------
Based on the commit 'livepatch: checks only of the replaced instruction is on the stack', the livepatch only needs to check the replaced instructions during stack check.
If the instructions to be replaced do not contain a jump instruction, the instructions may only appear at the top of the stack. Thus, after confirming that the instructions to be replaced do not contain a jump instruction, only the top of the stack instead of entire stack may be checked.
Each function in livepatch has a force tag. When the value is KLP_STACK_OPTIMIZE, the function of checking only the top of the stack is enabled to speed up the check.
Signed-off-by: Ye Weihua yeweihua4@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- arch/arm64/kernel/livepatch.c | 183 +++++++++++++++++++++++++++++----- 1 file changed, 160 insertions(+), 23 deletions(-)
diff --git a/arch/arm64/kernel/livepatch.c b/arch/arm64/kernel/livepatch.c index 4c4ff0620c4c..650f457ab656 100644 --- a/arch/arm64/kernel/livepatch.c +++ b/arch/arm64/kernel/livepatch.c @@ -38,6 +38,7 @@
#ifdef CONFIG_ARM64_MODULE_PLTS #define MAX_SIZE_TO_CHECK (LJMP_INSN_SIZE * sizeof(u32)) +#define CHECK_JUMP_RANGE LJMP_INSN_SIZE
static inline bool offset_in_range(unsigned long pc, unsigned long addr, long range) @@ -49,6 +50,7 @@ static inline bool offset_in_range(unsigned long pc, unsigned long addr,
#else #define MAX_SIZE_TO_CHECK sizeof(u32) +#define CHECK_JUMP_RANGE 1 #endif
struct klp_func_node { @@ -56,9 +58,9 @@ struct klp_func_node { struct list_head func_stack; unsigned long old_addr; #ifdef CONFIG_ARM64_MODULE_PLTS - u32 old_insns[LJMP_INSN_SIZE]; + u32 old_insns[LJMP_INSN_SIZE]; #else - u32 old_insn; + u32 old_insn; #endif };
@@ -77,9 +79,27 @@ static struct klp_func_node *klp_find_func_node(unsigned long old_addr) }
#ifdef CONFIG_LIVEPATCH_STOP_MACHINE_CONSISTENCY +/* + * The instruction set on arm64 is A64. + * The instruction of BLR is 1101011000111111000000xxxxx00000. + * The instruction of BL is 100101xxxxxxxxxxxxxxxxxxxxxxxxxx. + * The instruction of BLRAX is 1101011x0011111100001xxxxxxxxxxx. + */ +#define is_jump_insn(insn) (((le32_to_cpu(insn) & 0xfffffc1f) == 0xd63f0000) || \ + ((le32_to_cpu(insn) & 0xfc000000) == 0x94000000) || \ + ((le32_to_cpu(insn) & 0xfefff800) == 0xd63f0800)) + +struct klp_func_list { + struct klp_func_list *next; + unsigned long func_addr; + unsigned long func_size; + const char *func_name; + int force; +}; + struct walk_stackframe_args { - struct klp_patch *patch; int enable; + struct klp_func_list *other_funcs; int ret; };
@@ -103,22 +123,59 @@ static inline int klp_compare_address(unsigned long pc, unsigned long func_addr, return 0; }
-static bool klp_check_activeness_func(void *data, unsigned long pc) +static bool check_jump_insn(unsigned long func_addr) { - struct walk_stackframe_args *args = data; - struct klp_patch *patch = args->patch; + unsigned long i; + u32 *insn = (u32*)func_addr; + + for (i = 0; i < CHECK_JUMP_RANGE; i++) { + if (is_jump_insn(*insn)) { + return true; + } + insn++; + } + return false; +} + +static int add_func_to_list(struct klp_func_list **funcs, struct klp_func_list **func, + unsigned long func_addr, unsigned long func_size, const char *func_name, + int force) +{ + if (*func == NULL) { + *funcs = (struct klp_func_list *)kzalloc(sizeof(**funcs), GFP_ATOMIC); + if (!(*funcs)) + return -ENOMEM; + *func = *funcs; + } else { + (*func)->next = (struct klp_func_list *)kzalloc(sizeof(**funcs), + GFP_ATOMIC); + if (!(*func)->next) + return -ENOMEM; + *func = (*func)->next; + } + (*func)->func_addr = func_addr; + (*func)->func_size = func_size; + (*func)->func_name = func_name; + (*func)->force = force; + (*func)->next = NULL; + return 0; +} + +static int klp_check_activeness_func(struct klp_patch *patch, int enable, + struct klp_func_list **nojump_funcs, + struct klp_func_list **other_funcs) +{ + int ret; struct klp_object *obj; struct klp_func *func; unsigned long func_addr, func_size; - const char *func_name; struct klp_func_node *func_node; - - if (args->ret) - return false; + struct klp_func_list *pnjump = NULL; + struct klp_func_list *pother = NULL;
for (obj = patch->objs; obj->funcs; obj++) { for (func = obj->funcs; func->old_name; func++) { - if (args->enable) { + if (enable) { if (func->force == KLP_ENFORCEMENT) continue; /* @@ -143,34 +200,105 @@ static bool klp_check_activeness_func(void *data, unsigned long pc) func_addr = (unsigned long)prev->new_func; func_size = prev->new_size; } + if ((func->force == KLP_STACK_OPTIMIZE) && + !check_jump_insn(func_addr)) + ret = add_func_to_list(nojump_funcs, &pnjump, + func_addr, func_size, + func->old_name, func->force); + else + ret = add_func_to_list(other_funcs, &pother, + func_addr, func_size, + func->old_name, func->force); + if (ret) + return ret; } else { /* - * When disable, check for the function - * itself which to be unpatched. + * When disable, check for the previously + * patched function and the function itself + * which to be unpatched. */ + func_node = klp_find_func_node((unsigned long)func->old_func); + if (!func_node) { + return -EINVAL; + } + if (list_is_singular(&func_node->func_stack)) { + func_addr = (unsigned long)func->old_func; + func_size = func->old_size; + } else { + struct klp_func *prev; + + prev = list_first_or_null_rcu( + &func_node->func_stack, + struct klp_func, stack_node); + func_addr = (unsigned long)prev->new_func; + func_size = prev->new_size; + } + ret = add_func_to_list(other_funcs, &pother, + func_addr, func_size, + func->old_name, 0); + if (ret) + return ret; + func_addr = (unsigned long)func->new_func; func_size = func->new_size; + ret = add_func_to_list(other_funcs, &pother, + func_addr, func_size, + func->old_name, 0); + if (ret) + return ret; } - func_name = func->old_name; - args->ret = klp_compare_address(pc, func_addr, func_name, - klp_size_to_check(func_size, func->force)); - if (args->ret) - return false; } } + return 0; +}
+static bool check_func_list(struct klp_func_list *funcs, int *ret, unsigned long pc) +{ + while (funcs != NULL) { + *ret = klp_compare_address(pc, funcs->func_addr, funcs->func_name, + klp_size_to_check(funcs->func_size, funcs->force)); + if (*ret) { + return false; + } + funcs = funcs->next; + } return true; }
+static bool klp_check_jump_func(void *data, unsigned long pc) +{ + struct walk_stackframe_args *args = data; + struct klp_func_list *other_funcs = args->other_funcs; + + return check_func_list(other_funcs, &args->ret, pc); +} + +static void free_list(struct klp_func_list **funcs) +{ + struct klp_func_list *p; + + while (*funcs != NULL) { + p = *funcs; + *funcs = (*funcs)->next; + kfree(p); + } +} + int klp_check_calltrace(struct klp_patch *patch, int enable) { struct task_struct *g, *t; struct stackframe frame; int ret = 0; + struct klp_func_list *nojump_funcs = NULL; + struct klp_func_list *other_funcs = NULL; + + ret = klp_check_activeness_func(patch, enable, &nojump_funcs, &other_funcs); + if (ret) + goto out;
struct walk_stackframe_args args = { - .patch = patch, .enable = enable, + .other_funcs = other_funcs, .ret = 0 };
@@ -201,17 +329,26 @@ int klp_check_calltrace(struct klp_patch *patch, int enable) frame.fp = thread_saved_fp(t); frame.pc = thread_saved_pc(t); } - start_backtrace(&frame, frame.fp, frame.pc); - walk_stackframe(t, &frame, klp_check_activeness_func, &args); - if (args.ret) { - ret = args.ret; + if (!check_func_list(nojump_funcs, &ret, frame.pc)) { pr_info("PID: %d Comm: %.20s\n", t->pid, t->comm); show_stack(t, NULL, KERN_INFO); goto out; } + if (other_funcs != NULL) { + start_backtrace(&frame, frame.fp, frame.pc); + walk_stackframe(t, &frame, klp_check_jump_func, &args); + if (args.ret) { + ret = args.ret; + pr_info("PID: %d Comm: %.20s\n", t->pid, t->comm); + show_stack(t, NULL, KERN_INFO); + goto out; + } + } }
out: + free_list(&nojump_funcs); + free_list(&other_funcs); return ret; } #endif
From: Ye Weihua yeweihua4@huawei.com
hulk inclusion category: feature bugzilla: 119440 https://gitee.com/openeuler/kernel/issues/I4DDEL
--------------------------------
Enable stack optimize on arm.
Signed-off-by: Ye Weihua yeweihua4@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- arch/arm/kernel/livepatch.c | 193 +++++++++++++++++++++++++++++++----- 1 file changed, 169 insertions(+), 24 deletions(-)
diff --git a/arch/arm/kernel/livepatch.c b/arch/arm/kernel/livepatch.c index ce981b48fedb..f0bb09aa14b7 100644 --- a/arch/arm/kernel/livepatch.c +++ b/arch/arm/kernel/livepatch.c @@ -40,8 +40,11 @@ #ifdef CONFIG_ARM_MODULE_PLTS #define LJMP_INSN_SIZE 3 #define MAX_SIZE_TO_CHECK (LJMP_INSN_SIZE * ARM_INSN_SIZE) +#define CHECK_JUMP_RANGE LJMP_INSN_SIZE + #else #define MAX_SIZE_TO_CHECK ARM_INSN_SIZE +#define CHECK_JUMP_RANGE 1 #endif
struct klp_func_node { @@ -49,9 +52,9 @@ struct klp_func_node { struct list_head func_stack; void *old_func; #ifdef CONFIG_ARM_MODULE_PLTS - u32 old_insns[LJMP_INSN_SIZE]; + u32 old_insns[LJMP_INSN_SIZE]; #else - u32 old_insn; + u32 old_insn; #endif };
@@ -70,9 +73,38 @@ static struct klp_func_node *klp_find_func_node(void *old_func) }
#ifdef CONFIG_LIVEPATCH_STOP_MACHINE_CONSISTENCY +/* + * The instruction set on arm is A32. + * The instruction of BL is xxxx1011xxxxxxxxxxxxxxxxxxxxxxxx, and first four + * bits could not be 1111. + * The instruction of BLX(immediate) is 1111101xxxxxxxxxxxxxxxxxxxxxxxxx. + * The instruction of BLX(register) is xxxx00010010xxxxxxxxxxxx0011xxxx, and + * first four bits could not be 1111. + */ +static bool is_jump_insn(u32 insn) +{ + if (((insn & 0x0f000000) == 0x0b000000) && + ((insn & 0xf0000000) != 0xf0000000)) + return true; + if ((insn & 0xfe000000) == 0xfa000000) + return true; + if (((insn & 0x0ff000f0) == 0x01200030) && + ((insn & 0xf0000000) != 0xf0000000)) + return true; + return false; +} + +struct klp_func_list { + struct klp_func_list *next; + unsigned long func_addr; + unsigned long func_size; + const char *func_name; + int force; +}; + struct walk_stackframe_args { - struct klp_patch *patch; int enable; + struct klp_func_list *other_funcs; int ret; };
@@ -96,22 +128,59 @@ static inline int klp_compare_address(unsigned long pc, unsigned long func_addr, return 0; }
-static int klp_check_activeness_func(struct stackframe *frame, void *data) +static bool check_jump_insn(unsigned long func_addr) { - struct walk_stackframe_args *args = data; - struct klp_patch *patch = args->patch; + unsigned long i; + u32 *insn = (u32*)func_addr; + + for (i = 0; i < CHECK_JUMP_RANGE; i++) { + if (is_jump_insn(*insn)) { + return true; + } + insn++; + } + return false; +} + +static int add_func_to_list(struct klp_func_list **funcs, struct klp_func_list **func, + unsigned long func_addr, unsigned long func_size, const char *func_name, + int force) +{ + if (*func == NULL) { + *funcs = (struct klp_func_list*)kzalloc(sizeof(**funcs), GFP_ATOMIC); + if (!(*funcs)) + return -ENOMEM; + *func = *funcs; + } else { + (*func)->next = (struct klp_func_list*)kzalloc(sizeof(**funcs), + GFP_ATOMIC); + if (!(*func)->next) + return -ENOMEM; + *func = (*func)->next; + } + (*func)->func_addr = func_addr; + (*func)->func_size = func_size; + (*func)->func_name = func_name; + (*func)->force = force; + (*func)->next = NULL; + return 0; +} + +static int klp_check_activeness_func(struct klp_patch *patch, int enable, + struct klp_func_list **nojump_funcs, + struct klp_func_list **other_funcs) +{ + int ret; struct klp_object *obj; struct klp_func_node *func_node; struct klp_func *func; unsigned long func_addr, func_size; - const char *func_name; - - if (args->ret) - return args->ret; + struct klp_func_list *pnjump = NULL; + struct klp_func_list *pother = NULL;
for (obj = patch->objs; obj->funcs; obj++) { for (func = obj->funcs; func->old_name; func++) { - if (args->enable) { + if (enable) { if (func->force == KLP_ENFORCEMENT) continue; /* @@ -140,23 +209,86 @@ static int klp_check_activeness_func(struct stackframe *frame, void *data) func_addr = (unsigned long)prev->new_func; func_size = prev->new_size; } + if ((func->force == KLP_STACK_OPTIMIZE) && + !check_jump_insn(func_addr)) + ret = add_func_to_list(nojump_funcs, &pnjump, + func_addr, func_size, + func->old_name, func->force); + else + ret = add_func_to_list(other_funcs, &pother, + func_addr, func_size, + func->old_name, func->force); + if (ret) + return ret; } else { /* - * When disable, check for the function itself + * When disable, check for the previously + * patched function and the function itself * which to be unpatched. */ + func_node = klp_find_func_node(func->old_func); + if (!func_node) + return -EINVAL; + if (list_is_singular(&func_node->func_stack)) { + func_addr = (unsigned long)func->old_func; + func_size = func->old_size; + } else { + struct klp_func *prev; + + prev = list_first_or_null_rcu( + &func_node->func_stack, + struct klp_func, stack_node); + func_addr = (unsigned long)prev->new_func; + func_size = prev->new_size; + } + ret = add_func_to_list(other_funcs, &pother, + func_addr, func_size, + func->old_name, 0); + if (ret) + return ret; func_addr = (unsigned long)func->new_func; func_size = func->new_size; + ret = add_func_to_list(other_funcs, &pother, + func_addr, func_size, + func->old_name, 0); + if (ret) + return ret; } - func_name = func->old_name; - args->ret = klp_compare_address(frame->pc, func_addr, func_name, - klp_size_to_check(func_size, func->force)); - if (args->ret) - return args->ret; } } + return 0; +}
- return args->ret; +static bool check_func_list(struct klp_func_list *funcs, int *ret, unsigned long pc) +{ + while (funcs != NULL) { + *ret = klp_compare_address(pc, funcs->func_addr, funcs->func_name, + klp_size_to_check(funcs->func_size, funcs->force)); + if (*ret) { + return false; + } + funcs = funcs->next; + } + return true; +} + +static int klp_check_jump_func(struct stackframe *frame, void *data) +{ + struct walk_stackframe_args *args = data; + struct klp_func_list *other_funcs = args->other_funcs; + + return check_func_list(other_funcs, &args->ret, frame->pc); +} + +static void free_list(struct klp_func_list **funcs) +{ + struct klp_func_list *p; + + while (*funcs != NULL) { + p = *funcs; + *funcs = (*funcs)->next; + kfree(p); + } }
int klp_check_calltrace(struct klp_patch *patch, int enable) @@ -164,10 +296,15 @@ int klp_check_calltrace(struct klp_patch *patch, int enable) struct task_struct *g, *t; struct stackframe frame; int ret = 0; + struct klp_func_list *nojump_funcs = NULL; + struct klp_func_list *other_funcs = NULL; + + ret = klp_check_activeness_func(patch, enable, &nojump_funcs, &other_funcs); + if (ret) + goto out;
struct walk_stackframe_args args = { - .patch = patch, - .enable = enable, + .other_funcs = other_funcs, .ret = 0 };
@@ -194,17 +331,25 @@ int klp_check_calltrace(struct klp_patch *patch, int enable) frame.lr = 0; /* recovered from the stack */ frame.pc = thread_saved_pc(t); } - - walk_stackframe(&frame, klp_check_activeness_func, &args); - if (args.ret) { - ret = args.ret; + if (!check_func_list(nojump_funcs, &ret, frame.pc)) { pr_info("PID: %d Comm: %.20s\n", t->pid, t->comm); show_stack(t, NULL, KERN_INFO); goto out; } + if (other_funcs != NULL) { + walk_stackframe(&frame, klp_check_jump_func, &args); + if (args.ret) { + ret = args.ret; + pr_info("PID: %d Comm: %.20s\n", t->pid, t->comm); + show_stack(t, NULL, KERN_INFO); + goto out; + } + } }
out: + free_list(&nojump_funcs); + free_list(&other_funcs); return ret; } #endif
From: Ye Weihua yeweihua4@huawei.com
hulk inclusion category: feature bugzilla: 119440 https://gitee.com/openeuler/kernel/issues/I4DDEL
--------------------------------
Enable stack optimize on ppc32.
Signed-off-by: Ye Weihua yeweihua4@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- arch/powerpc/kernel/livepatch_32.c | 191 +++++++++++++++++++++++++---- 1 file changed, 166 insertions(+), 25 deletions(-)
diff --git a/arch/powerpc/kernel/livepatch_32.c b/arch/powerpc/kernel/livepatch_32.c index db6dbe091281..d22c44edc7c7 100644 --- a/arch/powerpc/kernel/livepatch_32.c +++ b/arch/powerpc/kernel/livepatch_32.c @@ -31,14 +31,15 @@
#if defined (CONFIG_LIVEPATCH_STOP_MACHINE_CONSISTENCY) || \ defined (CONFIG_LIVEPATCH_WO_FTRACE) -#define LJMP_INSN_SIZE 4 +#define LJMP_INSN_SIZE 4 #define MAX_SIZE_TO_CHECK (LJMP_INSN_SIZE * sizeof(u32)) +#define CHECK_JUMP_RANGE LJMP_INSN_SIZE
struct klp_func_node { struct list_head node; struct list_head func_stack; void *old_func; - u32 old_insns[LJMP_INSN_SIZE]; + u32 old_insns[LJMP_INSN_SIZE]; };
static LIST_HEAD(klp_func_list); @@ -57,14 +58,40 @@ static struct klp_func_node *klp_find_func_node(void *old_func) #endif
#ifdef CONFIG_LIVEPATCH_STOP_MACHINE_CONSISTENCY +/* + * The instruction set on ppc32 is RISC. + * The instructions of BL and BLA are 010010xxxxxxxxxxxxxxxxxxxxxxxxx1. + * The instructions of BCL and BCLA are 010000xxxxxxxxxxxxxxxxxxxxxxxxx1. + * The instruction of BCCTRL is 010011xxxxxxxxxx0000010000100001. + * The instruction of BCLRL is 010011xxxxxxxxxx0000000000100001. + */ +static bool is_jump_insn(u32 insn) +{ + u32 tmp1 = (insn & 0xfc000001); + u32 tmp2 = (insn & 0xfc00ffff); + + if ((tmp1 == 0x48000001) || (tmp1 == 0x40000001) || + (tmp2 == 0x4c000421) || (tmp2 == 0x4c000021)) + return true; + return false; +} + +struct klp_func_list { + struct klp_func_list *next; + unsigned long func_addr; + unsigned long func_size; + const char *func_name; + int force; +}; + struct stackframe { unsigned long sp; unsigned long pc; };
struct walk_stackframe_args { - struct klp_patch *patch; int enable; + struct klp_func_list *other_funcs; int ret; };
@@ -88,22 +115,59 @@ static inline int klp_compare_address(unsigned long pc, unsigned long func_addr, return 0; }
-static int klp_check_activeness_func(struct stackframe *frame, void *data) +static bool check_jump_insn(unsigned long func_addr) { - struct walk_stackframe_args *args = data; - struct klp_patch *patch = args->patch; + unsigned long i; + u32 *insn = (u32*)func_addr; + + for (i = 0; i < CHECK_JUMP_RANGE; i++) { + if (is_jump_insn(*insn)) { + return true; + } + insn++; + } + return false; +} + +static int add_func_to_list(struct klp_func_list **funcs, struct klp_func_list **func, + unsigned long func_addr, unsigned long func_size, const char *func_name, + int force) +{ + if (*func == NULL) { + *funcs = (struct klp_func_list*)kzalloc(sizeof(**funcs), GFP_ATOMIC); + if (!(*funcs)) + return -ENOMEM; + *func = *funcs; + } else { + (*func)->next = (struct klp_func_list*)kzalloc(sizeof(**funcs), + GFP_ATOMIC); + if (!(*func)->next) + return -ENOMEM; + *func = (*func)->next; + } + (*func)->func_addr = func_addr; + (*func)->func_size = func_size; + (*func)->func_name = func_name; + (*func)->force = force; + (*func)->next = NULL; + return 0; +} + +static int klp_check_activeness_func(struct klp_patch *patch, int enable, + struct klp_func_list **nojump_funcs, + struct klp_func_list **other_funcs) +{ + int ret; struct klp_object *obj; struct klp_func *func; unsigned long func_addr, func_size; - const char *func_name; struct klp_func_node *func_node; - - if (args->ret) - return args->ret; + struct klp_func_list *pnjump = NULL; + struct klp_func_list *pother = NULL;
for (obj = patch->objs; obj->funcs; obj++) { for (func = obj->funcs; func->old_name; func++) { - if (args->enable) { + if (enable) { if (func->force == KLP_ENFORCEMENT) continue; /* @@ -132,23 +196,52 @@ static int klp_check_activeness_func(struct stackframe *frame, void *data) func_addr = (unsigned long)prev->new_func; func_size = prev->new_size; } + if ((func->force == KLP_STACK_OPTIMIZE) && + !check_jump_insn(func_addr)) + ret = add_func_to_list(nojump_funcs, &pnjump, + func_addr, func_size, + func->old_name, func->force); + else + ret = add_func_to_list(other_funcs, &pother, + func_addr, func_size, + func->old_name, func->force); + if (ret) + return ret; } else { /* - * When disable, check for the function itself + * When disable, check for the previously + * patched function and the function itself * which to be unpatched. */ + func_node = klp_find_func_node(func->old_func); + if (!func_node) + return -EINVAL; + if (list_is_singular(&func_node->func_stack)) { + func_addr = (unsigned long)func->old_func; + func_size = func->old_size; + } else { + struct klp_func *prev; + + prev = list_first_or_null_rcu( + &func_node->func_stack, + struct klp_func, stack_node); + func_addr = (unsigned long)prev->new_func; + func_size = prev->new_size; + } + ret = add_func_to_list(other_funcs, &pother, func_addr, + func_size, func->old_name, 0); + if (ret) + return ret; func_addr = (unsigned long)func->new_func; func_size = func->new_size; + ret = add_func_to_list(other_funcs, &pother, func_addr, + func_size, func->old_name, 0); + if (ret) + return ret; } - func_name = func->old_name; - args->ret = klp_compare_address(frame->pc, func_addr, func_name, - klp_size_to_check(func_size, func->force)); - if (args->ret) - return args->ret; } } - - return args->ret; + return 0; }
static int unwind_frame(struct task_struct *tsk, struct stackframe *frame) @@ -180,16 +273,56 @@ void notrace klp_walk_stackframe(struct stackframe *frame, } }
+static bool check_func_list(struct klp_func_list *funcs, int *ret, unsigned long pc) +{ + while (funcs != NULL) { + *ret = klp_compare_address(pc, funcs->func_addr, funcs->func_name, + klp_size_to_check(funcs->func_size, funcs->force)); + if (*ret) { + return false; + } + funcs = funcs->next; + } + return true; +} + +static int klp_check_jump_func(struct stackframe *frame, void *data) +{ + struct walk_stackframe_args *args = data; + struct klp_func_list *other_funcs = args->other_funcs; + + if (!check_func_list(other_funcs, &args->ret, frame->pc)) { + return args->ret; + } + return 0; +} + +static void free_list(struct klp_func_list **funcs) +{ + struct klp_func_list *p; + + while (*funcs != NULL) { + p = *funcs; + *funcs = (*funcs)->next; + kfree(p); + } +} + int klp_check_calltrace(struct klp_patch *patch, int enable) { struct task_struct *g, *t; struct stackframe frame; unsigned long *stack; int ret = 0; + struct klp_func_list *nojump_funcs = NULL; + struct klp_func_list *other_funcs = NULL; + + ret = klp_check_activeness_func(patch, enable, &nojump_funcs, &other_funcs); + if (ret) + goto out;
struct walk_stackframe_args args = { - .patch = patch, - .enable = enable, + .other_funcs = other_funcs, .ret = 0 };
@@ -230,17 +363,25 @@ int klp_check_calltrace(struct klp_patch *patch, int enable)
frame.sp = (unsigned long)stack; frame.pc = stack[STACK_FRAME_LR_SAVE]; - klp_walk_stackframe(&frame, klp_check_activeness_func, - t, &args); - if (args.ret) { - ret = args.ret; + if (!check_func_list(nojump_funcs, &ret, frame.pc)) { pr_info("PID: %d Comm: %.20s\n", t->pid, t->comm); show_stack(t, NULL, KERN_INFO); goto out; } + if (other_funcs != NULL) { + klp_walk_stackframe(&frame, klp_check_jump_func, t, &args); + if (args.ret) { + ret = args.ret; + pr_info("PID: %d Comm: %.20s\n", t->pid, t->comm); + show_stack(t, NULL, KERN_INFO); + goto out; + } + } }
out: + free_list(&nojump_funcs); + free_list(&other_funcs); return ret; } #endif
From: Ye Weihua yeweihua4@huawei.com
hulk inclusion category: feature bugzilla: 119440 https://gitee.com/openeuler/kernel/issues/I4DDEL
--------------------------------
Enable stack optimize on ppc64.
Signed-off-by: Ye Weihua yeweihua4@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- arch/powerpc/kernel/livepatch_64.c | 208 +++++++++++++++++++++-------- 1 file changed, 150 insertions(+), 58 deletions(-)
diff --git a/arch/powerpc/kernel/livepatch_64.c b/arch/powerpc/kernel/livepatch_64.c index f98f4ffc78f3..09e8bb330606 100644 --- a/arch/powerpc/kernel/livepatch_64.c +++ b/arch/powerpc/kernel/livepatch_64.c @@ -37,16 +37,17 @@ #if defined(CONFIG_LIVEPATCH_STOP_MACHINE_CONSISTENCY) || \ defined(CONFIG_LIVEPATCH_WO_FTRACE) #define MAX_SIZE_TO_CHECK (LJMP_INSN_SIZE * sizeof(u32)) +#define CHECK_JUMP_RANGE LJMP_INSN_SIZE
struct klp_func_node { struct list_head node; struct list_head func_stack; void *old_func; - u32 old_insns[LJMP_INSN_SIZE]; + u32 old_insns[LJMP_INSN_SIZE]; #ifdef PPC64_ELF_ABI_v1 struct ppc64_klp_btramp_entry trampoline; #else - unsigned long trampoline; + unsigned long trampoline; #endif };
@@ -66,6 +67,32 @@ static struct klp_func_node *klp_find_func_node(void *old_func) #endif
#ifdef CONFIG_LIVEPATCH_STOP_MACHINE_CONSISTENCY +/* + * The instruction set on ppc64 is RISC. + * The instructions of BL and BLA are 010010xxxxxxxxxxxxxxxxxxxxxxxxx1. + * The instructions of BCL and BCLA are 010000xxxxxxxxxxxxxxxxxxxxxxxxx1. + * The instruction of BCCTRL is 010011xxxxxxxxxx0000010000100001. + * The instruction of BCLRL is 010011xxxxxxxxxx0000000000100001. + */ +static bool is_jump_insn(u32 insn) +{ + u32 tmp1 = (insn & 0xfc000001); + u32 tmp2 = (insn & 0xfc00ffff); + + if (tmp1 == 0x48000001 || tmp1 == 0x40000001 || + tmp2 == 0x4c000421 || tmp2 == 0x4c000021) + return true; + return false; +} + +struct klp_func_list { + struct klp_func_list *next; + unsigned long func_addr; + unsigned long func_size; + const char *func_name; + int force; +}; + struct stackframe { unsigned long sp; unsigned long pc; @@ -73,8 +100,8 @@ struct stackframe { };
struct walk_stackframe_args { - struct klp_patch *patch; int enable; + struct klp_func_list *other_funcs; int ret; };
@@ -98,51 +125,62 @@ static inline int klp_compare_address(unsigned long pc, unsigned long func_addr, return 0; }
-static inline int klp_check_activeness_func_addr( - struct stackframe *frame, - unsigned long func_addr, - unsigned long func_size, - const char *func_name, - int force) +static bool check_jump_insn(unsigned long func_addr) { - int ret; + unsigned long i; + u32 *insn = (u32*)func_addr;
- /* Check PC first */ - ret = klp_compare_address(frame->pc, func_addr, func_name, - klp_size_to_check(func_size, force)); - if (ret) - return ret; - - /* Check NIP when the exception stack switching */ - if (frame->nip != 0) { - ret = klp_compare_address(frame->nip, func_addr, func_name, - klp_size_to_check(func_size, force)); - if (ret) - return ret; + for (i = 0; i < CHECK_JUMP_RANGE; i++) { + if (is_jump_insn(*insn)) { + return true; + } + insn++; } + return false; +}
- return ret; +static int add_func_to_list(struct klp_func_list **funcs, struct klp_func_list **func, + unsigned long func_addr, unsigned long func_size, const char *func_name, + int force) +{ + if (*func == NULL) { + *funcs = (struct klp_func_list*)kzalloc(sizeof(**funcs), GFP_ATOMIC); + if (!(*funcs)) + return -ENOMEM; + *func = *funcs; + } else { + (*func)->next = (struct klp_func_list*)kzalloc(sizeof(**funcs), + GFP_ATOMIC); + if (!(*func)->next) + return -ENOMEM; + *func = (*func)->next; + } + (*func)->func_addr = func_addr; + (*func)->func_size = func_size; + (*func)->func_name = func_name; + (*func)->force = force; + (*func)->next = NULL; + return 0; }
-static int klp_check_activeness_func(struct stackframe *frame, void *data) +static int klp_check_activeness_func(struct klp_patch *patch, int enable, + struct klp_func_list **nojump_funcs, + struct klp_func_list **other_funcs) { - struct walk_stackframe_args *args = data; - struct klp_patch *patch = args->patch; + int ret; struct klp_object *obj; struct klp_func *func; unsigned long func_addr, func_size; - const char *func_name; struct klp_func_node *func_node = NULL; - - if (args->ret) - return args->ret; + struct klp_func_list *pnjump = NULL; + struct klp_func_list *pother = NULL;
for (obj = patch->objs; obj->funcs; obj++) { for (func = obj->funcs; func->old_name; func++) { func_node = klp_find_func_node(func->old_func);
/* Check func address in stack */ - if (args->enable) { + if (enable) { if (func->force == KLP_ENFORCEMENT) continue; /* @@ -171,6 +209,17 @@ static int klp_check_activeness_func(struct stackframe *frame, void *data) (void *)prev->new_func); func_size = prev->new_size; } + if ((func->force == KLP_STACK_OPTIMIZE) && + !check_jump_insn(func_addr)) + ret = add_func_to_list(nojump_funcs, &pnjump, + func_addr, func_size, + func->old_name, func->force); + else + ret = add_func_to_list(other_funcs, &pother, + func_addr, func_size, + func->old_name, func->force); + if (ret) + return ret; } else { /* * When disable, check for the function itself @@ -179,13 +228,11 @@ static int klp_check_activeness_func(struct stackframe *frame, void *data) func_addr = ppc_function_entry( (void *)func->new_func); func_size = func->new_size; + ret = add_func_to_list(other_funcs, &pother, func_addr, + func_size, func->old_name, 0); + if (ret) + return ret; } - func_name = func->old_name; - args->ret = klp_check_activeness_func_addr(frame, - func_addr, func_size, func_name, - func->force); - if (args->ret) - return args->ret;
#ifdef PPC64_ELF_ABI_v1 /* @@ -199,10 +246,10 @@ static int klp_check_activeness_func(struct stackframe *frame, void *data) if (func_addr != (unsigned long)func->old_func) { func_addr = (unsigned long)func->old_func; func_size = func->old_size; - args->ret = klp_check_activeness_func_addr(frame, - func_addr, func_size, "OLD_FUNC", func->force); - if (args->ret) - return args->ret; + ret = add_func_to_list(other_funcs, &pother, func_addr, + func_size, "OLD_FUNC", 0); + if (ret) + return ret;
if (func_node == NULL || func_node->trampoline.magic != BRANCH_TRAMPOLINE_MAGIC) @@ -210,17 +257,15 @@ static int klp_check_activeness_func(struct stackframe *frame, void *data)
func_addr = (unsigned long)&func_node->trampoline; func_size = sizeof(struct ppc64_klp_btramp_entry); - args->ret = klp_check_activeness_func_addr(frame, - func_addr, func_size, "trampoline", - func->force); - if (args->ret) - return args->ret; + ret = add_func_to_list(other_funcs, &pother, func_addr, + func_size, "trampoline", 0); + if (ret) + return ret; } #endif } } - - return args->ret; + return 0; }
static int unwind_frame(struct task_struct *tsk, struct stackframe *frame) @@ -282,18 +327,56 @@ static void notrace klp_walk_stackframe(struct stackframe *frame, } }
+static bool check_func_list(struct klp_func_list *funcs, int *ret, unsigned long pc) +{ + while (funcs != NULL) { + *ret = klp_compare_address(pc, funcs->func_addr, funcs->func_name, + klp_size_to_check(funcs->func_size, funcs->force)); + if (*ret) { + return false; + } + funcs = funcs->next; + } + return true; +} + +static int klp_check_jump_func(struct stackframe *frame, void *data) +{ + struct walk_stackframe_args *args = data; + struct klp_func_list *other_funcs = args->other_funcs; + + if (!check_func_list(other_funcs, &args->ret, frame->pc)) { + return args->ret; + } + return 0; +} + +static void free_list(struct klp_func_list **funcs) +{ + struct klp_func_list *p; + + while (*funcs != NULL) { + p = *funcs; + *funcs = (*funcs)->next; + kfree(p); + } +} + int klp_check_calltrace(struct klp_patch *patch, int enable) { struct task_struct *g, *t; struct stackframe frame; unsigned long *stack; int ret = 0; + struct klp_func_list *nojump_funcs = NULL; + struct klp_func_list *other_funcs = NULL; + struct walk_stackframe_args args;
- struct walk_stackframe_args args = { - .patch = patch, - .enable = enable, - .ret = 0 - }; + ret = klp_check_activeness_func(patch, enable, &nojump_funcs, &other_funcs); + if (ret) + goto out; + args.other_funcs = other_funcs; + args.ret = 0;
for_each_process_thread(g, t) { if (t == current) { @@ -335,20 +418,29 @@ int klp_check_calltrace(struct klp_patch *patch, int enable) frame.sp = (unsigned long)stack; frame.pc = stack[STACK_FRAME_LR_SAVE]; frame.nip = 0; - klp_walk_stackframe(&frame, klp_check_activeness_func, - t, &args); - if (args.ret) { - ret = args.ret; + if (!check_func_list(nojump_funcs, &ret, frame.pc)) { pr_debug("%s FAILED when %s\n", __func__, enable ? "enabling" : "disabling"); pr_info("PID: %d Comm: %.20s\n", t->pid, t->comm); show_stack(t, NULL, KERN_INFO); - goto out; } + if (other_funcs != NULL) { + klp_walk_stackframe(&frame, klp_check_jump_func, t, &args); + if (args.ret) { + ret = args.ret; + pr_debug("%s FAILED when %s\n", __func__, + enable ? "enabling" : "disabling"); + pr_info("PID: %d Comm: %.20s\n", t->pid, t->comm); + show_stack(t, NULL, KERN_INFO); + goto out; + } + } }
out: + free_list(&nojump_funcs); + free_list(&other_funcs); return ret; } #endif
From: Ye Weihua yeweihua4@huawei.com
hulk inclusion category: feature bugzilla: 119440 https://gitee.com/openeuler/kernel/issues/I4DDEL
--------------------------------
Enable stack optimize on x86.
Signed-off-by: Ye Weihua yeweihua4@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- arch/x86/kernel/livepatch.c | 331 +++++++++++++++++++++++++----------- 1 file changed, 230 insertions(+), 101 deletions(-)
diff --git a/arch/x86/kernel/livepatch.c b/arch/x86/kernel/livepatch.c index 5be8b601f0c7..bca152b67818 100644 --- a/arch/x86/kernel/livepatch.c +++ b/arch/x86/kernel/livepatch.c @@ -25,6 +25,7 @@ #include <asm/text-patching.h> #include <asm/stacktrace.h> #include <asm/set_memory.h> +#include <asm/insn.h>
#include <linux/slab.h> #include <asm/nops.h> @@ -32,7 +33,7 @@
#if defined (CONFIG_LIVEPATCH_STOP_MACHINE_CONSISTENCY) || \ defined (CONFIG_LIVEPATCH_WO_FTRACE) -#define JMP_E9_INSN_SIZE 5 +#define JMP_E9_INSN_SIZE 5
struct klp_func_node { struct list_head node; @@ -57,6 +58,30 @@ static struct klp_func_node *klp_find_func_node(void *old_func) #endif
#ifdef CONFIG_LIVEPATCH_STOP_MACHINE_CONSISTENCY +/* + * The instruction set on x86 is CISC. + * The instructions of call in same segment are 11101000(direct), + * 11111111(register indirect) and 11111111(memory indirect). + * The instructions of call in other segment are 10011010(direct), + * 11111111(indirect). + */ +static bool is_jump_insn(u8 *insn) +{ + if ((insn[0] == 0xE8) || (insn[0] == 0x9a)) + return true; + else if ((insn[0] == 0xFF) && ((insn[1] & 0x30) == 0x10)) + return true; + return false; +} + +struct klp_func_list { + struct klp_func_list *next; + unsigned long func_addr; + unsigned long func_size; + const char *func_name; + int force; +}; + static inline unsigned long klp_size_to_check(unsigned long func_size, int force) { @@ -78,67 +103,136 @@ static inline int klp_compare_address(unsigned long stack_addr, return 0; }
-static int klp_check_stack_func(struct klp_func *func, - void *trace_ptr, int trace_len, int enable) +static bool check_jump_insn(unsigned long func_addr) { -#ifdef CONFIG_ARCH_STACKWALK - unsigned long *trace = trace_ptr; -#else - struct stack_trace *trace = trace_ptr; -#endif - unsigned long func_addr, func_size, address; - const char *func_name; - struct klp_func_node *func_node; - int i; + int len = JMP_E9_INSN_SIZE; + struct insn insn; + u8 *addr = (u8*)func_addr; + + do { + if (is_jump_insn(addr)) + return true; + insn_init(&insn, addr, MAX_INSN_SIZE, 1); + insn_get_length(&insn); + if (!insn.length || !insn_complete(&insn)) + return true; + len -= insn.length; + addr += insn.length; + } while (len > 0); + + return false; +}
-#ifdef CONFIG_ARCH_STACKWALK - for (i = 0; i < trace_len; i++) { - address = trace[i]; -#else - for (i = 0; i < trace->nr_entries; i++) { - address = trace->entries[i]; -#endif +static int add_func_to_list(struct klp_func_list **funcs, struct klp_func_list **func, + unsigned long func_addr, unsigned long func_size, const char *func_name, + int force) +{ + if (*func == NULL) { + *funcs = (struct klp_func_list*)kzalloc(sizeof(**funcs), GFP_ATOMIC); + if (!(*funcs)) + return -ENOMEM; + *func = *funcs; + } else { + (*func)->next = (struct klp_func_list*)kzalloc(sizeof(**funcs), + GFP_ATOMIC); + if (!(*func)->next) + return -ENOMEM; + *func = (*func)->next; + } + (*func)->func_addr = func_addr; + (*func)->func_size = func_size; + (*func)->func_name = func_name; + (*func)->force = force; + (*func)->next = NULL; + return 0; +}
- if (enable) { - if (func->force == KLP_ENFORCEMENT) - continue; - /* - * When enable, checking the currently active - * functions. - */ +static int klp_check_activeness_func(struct klp_patch *patch, int enable, + struct klp_func_list **nojump_funcs, + struct klp_func_list **other_funcs) +{ + int ret; + struct klp_object *obj; + struct klp_func *func; + unsigned long func_addr, func_size; + struct klp_func_node *func_node = NULL; + struct klp_func_list *pnojump = NULL; + struct klp_func_list *pother = NULL; + + + for (obj = patch->objs; obj->funcs; obj++) { + for (func = obj->funcs; func->old_name; func++) { func_node = klp_find_func_node(func->old_func); - if (!func_node || - list_empty(&func_node->func_stack)) { - func_addr = (unsigned long)func->old_func; - func_size = func->old_size; + + /* Check func address in stack */ + if (enable) { + if (func->force == KLP_ENFORCEMENT) + continue; + /* + * When enable, checking the currently + * active functions. + */ + if (!func_node || + list_empty(&func_node->func_stack)) { + func_addr = (unsigned long)func->old_func; + func_size = func->old_size; + } else { + /* + * Previously patched function + * [the active one] + */ + struct klp_func *prev; + + prev = list_first_or_null_rcu( + &func_node->func_stack, + struct klp_func, stack_node); + func_addr = (unsigned long)prev->new_func; + func_size = prev->new_size; + } + if ((func->force == KLP_STACK_OPTIMIZE) && + !check_jump_insn(func_addr)) + ret = add_func_to_list(nojump_funcs, &pnojump, + func_addr, func_size, + func->old_name, func->force); + else + ret = add_func_to_list(other_funcs, &pother, + func_addr, func_size, + func->old_name, func->force); + if (ret) + return ret; } else { /* - * Previously patched function - * [the active one] + * When disable, check for the function + * itself which to be unpatched. */ - struct klp_func *prev; - - prev = list_first_or_null_rcu( + if (!func_node) + return -EINVAL; + if (list_is_singular(&func_node->func_stack)) { + func_addr = (unsigned long)func->old_func; + func_size = func->old_size; + } else { + struct klp_func *prev; + + prev = list_first_or_null_rcu( &func_node->func_stack, struct klp_func, stack_node); - func_addr = (unsigned long)prev->new_func; - func_size = prev->new_size; + func_addr = (unsigned long)prev->new_func; + func_size = prev->new_size; + } + ret = add_func_to_list(other_funcs, &pother, func_addr, + func_size, func->old_name, 0); + if (ret) + return ret; + + func_addr = (unsigned long)func->new_func; + func_size = func->new_size; + ret = add_func_to_list(other_funcs, &pother, func_addr, + func_size, func->old_name, 0); + if (ret) + return ret; } - } else { - /* - * When disable, check for the function itself - * which to be unpatched. - */ - func_addr = (unsigned long)func->new_func; - func_size = func->new_size; } - func_name = func->old_name; - - if (klp_compare_address(address, func_addr, func_name, - klp_size_to_check(func_size, func->force))) - return -EAGAIN; } - return 0; }
@@ -173,86 +267,121 @@ static void klp_print_stack_trace(void *trace_ptr, int trace_len) #endif #define MAX_STACK_ENTRIES 100
-/* - * Determine whether it's safe to transition the task to the target patch state - * by looking for any to-be-patched or to-be-unpatched functions on its stack. - */ -static int klp_check_stack(struct task_struct *task, - struct klp_patch *patch, int enable) - +static bool check_func_list(struct klp_func_list *funcs, int *ret, unsigned long pc) { - static unsigned long trace_entries[MAX_STACK_ENTRIES]; -#ifdef CONFIG_ARCH_STACKWALK - int trace_len; -#else - struct stack_trace trace; -#endif - struct klp_object *obj; - struct klp_func *func; - int ret; - - if (!strncmp(task->comm, "migration/", 10)) - return 0; + while (funcs != NULL) { + *ret = klp_compare_address(pc, funcs->func_addr, funcs->func_name, + klp_size_to_check(funcs->func_size, funcs->force)); + if (*ret) { + return false; + } + funcs = funcs->next; + } + return true; +}
+static int klp_check_stack(void *trace_ptr, int trace_len, + struct klp_func_list *other_funcs) +{ #ifdef CONFIG_ARCH_STACKWALK - ret = stack_trace_save_tsk_reliable(task, trace_entries, MAX_STACK_ENTRIES); - if (ret < 0) - return ret; - trace_len = ret; - ret = 0; + unsigned long *trace = trace_ptr; #else - trace.skip = 0; - trace.nr_entries = 0; - trace.max_entries = MAX_STACK_ENTRIES; - trace.entries = trace_entries; - ret = save_stack_trace_tsk_reliable(task, &trace); + struct stack_trace *trace = trace_ptr; #endif - WARN_ON_ONCE(ret == -ENOSYS); - if (ret) { - pr_info("%s: %s:%d has an unreliable stack\n", - __func__, task->comm, task->pid); - return ret; - } + unsigned long address; + int i, ret;
- klp_for_each_object(patch, obj) { - klp_for_each_func(obj, func) { #ifdef CONFIG_ARCH_STACKWALK - ret = klp_check_stack_func(func, &trace_entries, MAX_STACK_ENTRIES, enable); + for (i = 0; i < trace_len; i++) { + address = trace[i]; #else - ret = klp_check_stack_func(func, &trace, 0, enable); + for (i = 0; i < trace->nr_entries; i++) { + address = trace->entries[i]; #endif - if (ret) { - pr_info("%s: %s:%d is sleeping on function %s\n", - __func__, task->comm, task->pid, - func->old_name); - + if (!check_func_list(other_funcs, &ret, address)) { #ifdef CONFIG_ARCH_STACKWALK - klp_print_stack_trace(&trace_entries, trace_len); + klp_print_stack_trace(trace_ptr, trace_len); #else - klp_print_stack_trace(&trace, 0); + klp_print_stack_trace(trace_ptr, 0); #endif - - return ret; - - } + return ret; } }
return 0; }
+static void free_list(struct klp_func_list **funcs) +{ + struct klp_func_list *p; + + while (*funcs != NULL) { + p = *funcs; + *funcs = (*funcs)->next; + kfree(p); + } +} + int klp_check_calltrace(struct klp_patch *patch, int enable) { struct task_struct *g, *t; int ret = 0; + struct klp_func_list *nojump_funcs = NULL; + struct klp_func_list *other_funcs = NULL; + static unsigned long trace_entries[MAX_STACK_ENTRIES]; +#ifdef CONFIG_ARCH_STACKWALK + int trace_len; +#else + struct stack_trace trace; +#endif
+ ret = klp_check_activeness_func(patch, enable, &nojump_funcs, &other_funcs); + if (ret) + goto out; for_each_process_thread(g, t) { - ret = klp_check_stack(t, patch, enable); + if (!strncmp(t->comm, "migration/", 10)) + continue; + +#ifdef CONFIG_ARCH_STACKWALK + ret = stack_trace_save_tsk_reliable(t, trace_entries, MAX_STACK_ENTRIES); + if (ret < 0) + goto out; + trace_len = ret; + ret = 0; +#else + trace.skip = 0; + trace.nr_entries = 0; + trace.max_entries = MAX_STACK_ENTRIES; + trace.entries = trace_entries; + ret = save_stack_trace_tsk_reliable(t, &trace); +#endif + WARN_ON_ONCE(ret == -ENOSYS); + if (ret) { + pr_info("%s: %s:%d has an unreliable stack\n", + __func__, t->comm, t->pid); + goto out; + } +#ifdef CONFIG_ARCH_STACKWALK + if (!check_func_list(nojump_funcs, &ret, trace_entries[0])) { + klp_print_stack_trace(&trace_entries, trace_len); +#else + if (!check_func_list(nojump_funcs, &ret, trace->entries[0])) { + klp_print_stack_trace(&trace, 0); +#endif + goto out; + } +#ifdef CONFIG_ARCH_STACKWALK + ret = klp_check_stack(trace_entries, trace_len, other_funcs); +#else + ret = klp_check_stack(&trace, 0, other_funcs); +#endif if (ret) goto out; }
out: + free_list(&nojump_funcs); + free_list(&other_funcs); return ret; } #endif
From: Ye Weihua yeweihua4@huawei.com
hulk inclusion category: bugfix bugzilla: 176976 https://gitee.com/openeuler/kernel/issues/I4DDEL
--------------------------------
Before enable a livepatch, we apply for a piece of memory for func_node to store function information and release it after disable this livepatch.
However, in some special cases, for example, the livepatch code is running, disable fails. In these cases, the applied memory should not be released. Otherwise, the livepatch cannot be disabled.
So, we move arch_klp_mem_recycle after the return value judgment to solve this problem.
Fixes: ec7ce700674f ("livepatch: put memory alloc and free out stop machine") Signed-off-by: Ye Weihua yeweihua4@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- kernel/livepatch/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 58cdfeea46d2..7cea023e88f4 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -1323,11 +1323,11 @@ static int __klp_disable_patch(struct klp_patch *patch)
arch_klp_code_modify_prepare(); ret = stop_machine(klp_try_disable_patch, &patch_data, cpu_online_mask); - arch_klp_mem_recycle(patch); arch_klp_code_modify_post_process(); if (ret) return ret;
+ arch_klp_mem_recycle(patch); klp_free_patch_async(patch); return 0; }
From: Leah Rumancik leah.rumancik@gmail.com
mainline inclusion from mainline-5.13-rc1 commit 6c0912739699d8e4b6a87086401bf3ad3c59502d category: bugfix bugzilla: 176545 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Upon file deletion, zero out all fields in ext4_dir_entry2 besides rec_len. In case sensitive data is stored in filenames, this ensures no potentially sensitive data is left in the directory entry upon deletion. Also, wipe these fields upon moving a directory entry during the conversion to an htree and when splitting htree nodes.
The data wiped may still exist in the journal, but there are future commits planned to address this.
Signed-off-by: Leah Rumancik leah.rumancik@gmail.com Link: https://lore.kernel.org/r/20210422180834.2242353-1-leah.rumancik@gmail.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/ext4/namei.c | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 4b97cca90c67..526960e34386 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -1776,7 +1776,14 @@ dx_move_dirents(char *from, char *to, struct dx_map_entry *map, int count, memcpy (to, de, rec_len); ((struct ext4_dir_entry_2 *) to)->rec_len = ext4_rec_len_to_disk(rec_len, blocksize); + + /* wipe dir_entry excluding the rec_len field */ de->inode = 0; + memset(&de->name_len, 0, ext4_rec_len_from_disk(de->rec_len, + blocksize) - + offsetof(struct ext4_dir_entry_2, + name_len)); + map++; to += rec_len; } @@ -2101,6 +2108,7 @@ static int make_indexed_dir(handle_t *handle, struct ext4_filename *fname, data2 = bh2->b_data;
memcpy(data2, de, len); + memset(de, 0, len); /* wipe old data */ de = (struct ext4_dir_entry_2 *) data2; top = data2 + len; while ((char *)(de2 = ext4_next_entry(de, blocksize)) < top) @@ -2481,15 +2489,27 @@ int ext4_generic_delete_entry(struct inode *dir, entry_buf, buf_size, i)) return -EFSCORRUPTED; if (de == de_del) { - if (pde) + if (pde) { pde->rec_len = ext4_rec_len_to_disk( ext4_rec_len_from_disk(pde->rec_len, blocksize) + ext4_rec_len_from_disk(de->rec_len, blocksize), blocksize); - else + + /* wipe entire dir_entry */ + memset(de, 0, ext4_rec_len_from_disk(de->rec_len, + blocksize)); + } else { + /* wipe dir_entry excluding the rec_len field */ de->inode = 0; + memset(&de->name_len, 0, + ext4_rec_len_from_disk(de->rec_len, + blocksize) - + offsetof(struct ext4_dir_entry_2, + name_len)); + } + inode_inc_iversion(dir); return 0; }
From: Kefeng Wang wangkefeng.wang@huawei.com
mainline inclusion from mainline-5.14-rc6 commit 1027b96ec9d34f9abab69bc1a4dc5b1ad8ab1349 category: bugfix bugzilla: 176712 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
------------------------------------------------- DO_ONCE DEFINE_STATIC_KEY_TRUE(___once_key); __do_once_done once_disable_jump(once_key); INIT_WORK(&w->work, once_deferred); struct once_work *w; w->key = key; schedule_work(&w->work); module unload //*the key is destroy* process_one_work once_deferred BUG_ON(!static_key_enabled(work->key)); static_key_count((struct static_key *)x) //*access key, crash*
When module uses DO_ONCE mechanism, it could crash due to the above concurrency problem, we could reproduce it with link[1].
Fix it by add/put module refcount in the once work process.
[1] https://lore.kernel.org/netdev/eaa6c371-465e-57eb-6be9-f4b16b9d7cbf@huawei.c...
Cc: Hannes Frederic Sowa hannes@stressinduktion.org Cc: Daniel Borkmann daniel@iogearbox.net Cc: David S. Miller davem@davemloft.net Cc: Eric Dumazet edumazet@google.com Reported-by: Minmin chen chenmingmin@huawei.com Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Acked-by: Hannes Frederic Sowa hannes@stressinduktion.org Signed-off-by: David S. Miller davem@davemloft.net (cherry picked from commit 1027b96ec9d34f9abab69bc1a4dc5b1ad8ab1349) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- include/linux/once.h | 4 ++-- lib/once.c | 11 ++++++++--- 2 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/include/linux/once.h b/include/linux/once.h index 9225ee6d96c7..ae6f4eb41cbe 100644 --- a/include/linux/once.h +++ b/include/linux/once.h @@ -7,7 +7,7 @@
bool __do_once_start(bool *done, unsigned long *flags); void __do_once_done(bool *done, struct static_key_true *once_key, - unsigned long *flags); + unsigned long *flags, struct module *mod);
/* Call a function exactly once. The idea of DO_ONCE() is to perform * a function call such as initialization of random seeds, etc, only @@ -46,7 +46,7 @@ void __do_once_done(bool *done, struct static_key_true *once_key, if (unlikely(___ret)) { \ func(__VA_ARGS__); \ __do_once_done(&___done, &___once_key, \ - &___flags); \ + &___flags, THIS_MODULE); \ } \ } \ ___ret; \ diff --git a/lib/once.c b/lib/once.c index 8b7d6235217e..59149bf3bfb4 100644 --- a/lib/once.c +++ b/lib/once.c @@ -3,10 +3,12 @@ #include <linux/spinlock.h> #include <linux/once.h> #include <linux/random.h> +#include <linux/module.h>
struct once_work { struct work_struct work; struct static_key_true *key; + struct module *module; };
static void once_deferred(struct work_struct *w) @@ -16,10 +18,11 @@ static void once_deferred(struct work_struct *w) work = container_of(w, struct once_work, work); BUG_ON(!static_key_enabled(work->key)); static_branch_disable(work->key); + module_put(work->module); kfree(work); }
-static void once_disable_jump(struct static_key_true *key) +static void once_disable_jump(struct static_key_true *key, struct module *mod) { struct once_work *w;
@@ -29,6 +32,8 @@ static void once_disable_jump(struct static_key_true *key)
INIT_WORK(&w->work, once_deferred); w->key = key; + w->module = mod; + __module_get(mod); schedule_work(&w->work); }
@@ -53,11 +58,11 @@ bool __do_once_start(bool *done, unsigned long *flags) EXPORT_SYMBOL(__do_once_start);
void __do_once_done(bool *done, struct static_key_true *once_key, - unsigned long *flags) + unsigned long *flags, struct module *mod) __releases(once_lock) { *done = true; spin_unlock_irqrestore(&once_lock, *flags); - once_disable_jump(once_key); + once_disable_jump(once_key, mod); } EXPORT_SYMBOL(__do_once_done);
From: Vignesh Raghavendra vigneshr@ti.com
mainline inclusion from mainline v5.11-rc1 commit d4548b14dd7e5c698f81ce23ce7b69a896373b45 category: bugfix bugzilla: 176665 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
k3_soc_devices array is missing a sentinel entry which may result in out of bounds access as reported by kernel KASAN.
Fix this by adding a sentinel entry.
Fixes: 439c7183e5b9 ("serial: 8250: 8250_omap: Disable RX interrupt after DMA enable") Reported-by: Naresh Kamboju naresh.kamboju@linaro.org Signed-off-by: Vignesh Raghavendra vigneshr@ti.com Link: https://lore.kernel.org/r/20201111112653.2710-1-vigneshr@ti.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yi Yang yiyang13@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/tty/serial/8250/8250_omap.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c index 95e2d6de4f21..ad0549dac7d7 100644 --- a/drivers/tty/serial/8250/8250_omap.c +++ b/drivers/tty/serial/8250/8250_omap.c @@ -1211,6 +1211,7 @@ static int omap8250_no_handle_irq(struct uart_port *port) static const struct soc_device_attribute k3_soc_devices[] = { { .family = "AM65X", }, { .family = "J721E", .revision = "SR1.0" }, + { /* sentinel */ } };
static struct omap8250_dma_params am654_dma = {
From: Li Hua hucool.lihua@huawei.com
hulk inclusion category: bugfix bugzilla: 180841 https://gitee.com/openeuler/kernel/issues/I4DDEL
The idle cpu will call ktime_get frequently. The smp_rmb() will influence the share cache, especially arm a15 share L2 cache. The performance drop will enlarge when the loop time is millisecond level.
Signed-off-by: Li Hua hucool.lihua@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- kernel/sched/idle.c | 30 +++++++++++++++++++++++------- 1 file changed, 23 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index a503e7d4c170..0aa35c0958e3 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -61,22 +61,38 @@ __setup("hlt", cpu_idle_nopoll_setup); #endif
#ifdef CONFIG_IAS_SMART_IDLE -static void smart_idle_poll(void) +/* looping 2000 times is probably microsecond level for 2GHZ CPU*/ +#define MICRO_LEVEL_COUNT 2000 +static inline void delay_relax(unsigned long delay_max) +{ + unsigned long delay_count = 0; + + delay_max = (delay_max < MICRO_LEVEL_COUNT) ? delay_max : MICRO_LEVEL_COUNT; + while (unlikely(!tif_need_resched()) && delay_count < delay_max) { + barrier(); + __asm__ __volatile__("nop;"); + delay_count++; + } +} + +static inline void smart_idle_poll(void) { unsigned long poll_duration = poll_threshold_ns; ktime_t cur, stop;
- if (!poll_duration) + if (likely(!poll_duration)) return;
stop = ktime_add_ns(ktime_get(), poll_duration); - - do { - cpu_relax(); - if (tif_need_resched()) + while (true) { + delay_relax(poll_duration); + if (likely(tif_need_resched())) break; cur = ktime_get(); - } while (ktime_before(cur, stop)); + if (likely(!ktime_before(cur, stop))) + break; + poll_duration = ktime_sub_ns(stop, cur); + } } #endif
From: Li Hua hucool.lihua@huawei.com
hulk inclusion category: bugfix bugzilla: 180842 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reported an error when an illegal negative value is passed. The allowed value is 0 to INT_MAX.
Signed-off-by: Li Hua hucool.lihua@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- include/linux/kernel.h | 2 +- kernel/sched/idle.c | 10 +++++----- kernel/sysctl.c | 6 +++--- 3 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/include/linux/kernel.h b/include/linux/kernel.h index eb88683890c9..189c51afc877 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -556,7 +556,7 @@ extern int sysctl_panic_on_stackoverflow;
extern bool crash_kexec_post_notifiers; #ifdef CONFIG_IAS_SMART_IDLE -extern unsigned long poll_threshold_ns; +extern int poll_threshold_ns; #endif
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 0aa35c0958e3..ae823c329fd7 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -18,7 +18,7 @@ extern char __cpuidle_text_start[], __cpuidle_text_end[]; * Poll_threshold_ns indicates the maximum polling time before * entering real idle. */ -unsigned long poll_threshold_ns; +int poll_threshold_ns; #endif
/** @@ -63,9 +63,9 @@ __setup("hlt", cpu_idle_nopoll_setup); #ifdef CONFIG_IAS_SMART_IDLE /* looping 2000 times is probably microsecond level for 2GHZ CPU*/ #define MICRO_LEVEL_COUNT 2000 -static inline void delay_relax(unsigned long delay_max) +static inline void delay_relax(int delay_max) { - unsigned long delay_count = 0; + int delay_count = 0;
delay_max = (delay_max < MICRO_LEVEL_COUNT) ? delay_max : MICRO_LEVEL_COUNT; while (unlikely(!tif_need_resched()) && delay_count < delay_max) { @@ -77,7 +77,7 @@ static inline void delay_relax(unsigned long delay_max)
static inline void smart_idle_poll(void) { - unsigned long poll_duration = poll_threshold_ns; + int poll_duration = poll_threshold_ns; ktime_t cur, stop;
if (likely(!poll_duration)) @@ -309,7 +309,7 @@ static void do_idle(void) { int cpu = smp_processor_id(); #ifdef CONFIG_IAS_SMART_IDLE - unsigned long idle_poll_flag = poll_threshold_ns; + int idle_poll_flag = poll_threshold_ns; #endif /* * If the arch has a polling bit, we maintain an invariant: diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c8d3a20007c6..626530cf1342 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1655,9 +1655,10 @@ static struct ctl_table ias_table[] = { { .procname = "smart_idle_threshold", .data = &poll_threshold_ns, - .maxlen = sizeof(unsigned long), + .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_doulongvec_minmax, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, }, #endif
@@ -1862,7 +1863,6 @@ static struct ctl_table kern_table[] = { .proc_handler = sysctl_sched_uclamp_handler, }, #endif - #ifdef CONFIG_SCHED_AUTOGROUP { .procname = "sched_autogroup_enabled",
From: Guoqing Jiang jgq516@gmail.com
mainline inclusion from mainline-v5.14-rc1 commit ad3fc798800fb7ca04c1dfc439dba946818048d8 category: bugfix bugzilla: 169402 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
The commit 41d2d848e5c0 ("md: improve io stats accounting") could cause double fault problem per the report [1], and also it is not correct to change ->bi_end_io if md don't own it, so let's revert it.
And io stats accounting will be replemented in later commits.
[1]. https://lore.kernel.org/linux-raid/3bf04253-3fad-434a-63a7-20214e38cf26@gmai...
Fixes: 41d2d848e5c0 ("md: improve io stats accounting") Signed-off-by: Guoqing Jiang jiangguoqing@kylinos.cn Signed-off-by: Song Liu song@kernel.org Signed-off-by: Luo Meng luomeng12@huawei.com
Conflicts: drivers/md/md.c Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/md/md.c | 45 --------------------------------------------- drivers/md/md.h | 1 - 2 files changed, 46 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 288d26013de2..2cc9d0db9428 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -459,30 +459,6 @@ void md_handle_request(struct mddev *mddev, struct bio *bio) } EXPORT_SYMBOL(md_handle_request);
-struct md_io { - struct mddev *mddev; - bio_end_io_t *orig_bi_end_io; - void *orig_bi_private; - unsigned long start_time; - struct hd_struct *part; -}; - -static void md_end_io(struct bio *bio) -{ - struct md_io *md_io = bio->bi_private; - struct mddev *mddev = md_io->mddev; - - part_end_io_acct(md_io->part, bio, md_io->start_time); - - bio->bi_end_io = md_io->orig_bi_end_io; - bio->bi_private = md_io->orig_bi_private; - - mempool_free(md_io, &mddev->md_io_pool); - - if (bio->bi_end_io) - bio->bi_end_io(bio); -} - static blk_qc_t md_submit_bio(struct bio *bio) { const int rw = bio_data_dir(bio); @@ -507,21 +483,6 @@ static blk_qc_t md_submit_bio(struct bio *bio) return BLK_QC_T_NONE; }
- if (bio->bi_end_io != md_end_io) { - struct md_io *md_io; - - md_io = mempool_alloc(&mddev->md_io_pool, GFP_NOIO); - md_io->mddev = mddev; - md_io->orig_bi_end_io = bio->bi_end_io; - md_io->orig_bi_private = bio->bi_private; - - bio->bi_end_io = md_end_io; - bio->bi_private = md_io; - - md_io->start_time = part_start_io_acct(mddev->gendisk, - &md_io->part, bio); - } - /* bio could be mergeable after passing to underlayer */ bio->bi_opf &= ~REQ_NOMERGE;
@@ -5626,7 +5587,6 @@ static void md_free(struct kobject *ko)
bioset_exit(&mddev->bio_set); bioset_exit(&mddev->sync_set); - mempool_exit(&mddev->md_io_pool); kfree(mddev); }
@@ -5722,11 +5682,6 @@ static int md_alloc(dev_t dev, char *name) */ mddev->hold_active = UNTIL_STOP;
- error = mempool_init_kmalloc_pool(&mddev->md_io_pool, BIO_POOL_SIZE, - sizeof(struct md_io)); - if (error) - goto abort; - error = -ENOMEM; mddev->queue = blk_alloc_queue(NUMA_NO_NODE); if (!mddev->queue) diff --git a/drivers/md/md.h b/drivers/md/md.h index 2175a5ac4f7c..c94811cf2600 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -487,7 +487,6 @@ struct mddev { struct bio_set sync_set; /* for sync operations like * metadata and bitmap writes */ - mempool_t md_io_pool;
/* Generic flush handling. * The last to finish preflush schedules a worker to submit
From: Ye Weihua yeweihua4@huawei.com
hulk inclusion category: bugfix bugzilla: 181325 https://gitee.com/openeuler/kernel/issues/I4DDEL
--------------------------------
An error is reported during version building: error: ISO C90 forbids mixed declarations and code.
Fix it by moving the variable definition forward.
Signed-off-by: Ye Weihua yeweihua4@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- arch/arm/kernel/livepatch.c | 9 ++++----- arch/arm64/kernel/livepatch.c | 11 +++++------ arch/powerpc/kernel/livepatch_32.c | 9 ++++----- 3 files changed, 13 insertions(+), 16 deletions(-)
diff --git a/arch/arm/kernel/livepatch.c b/arch/arm/kernel/livepatch.c index f0bb09aa14b7..1ec326706a7b 100644 --- a/arch/arm/kernel/livepatch.c +++ b/arch/arm/kernel/livepatch.c @@ -298,15 +298,14 @@ int klp_check_calltrace(struct klp_patch *patch, int enable) int ret = 0; struct klp_func_list *nojump_funcs = NULL; struct klp_func_list *other_funcs = NULL; + struct walk_stackframe_args args = { + .ret = 0 + };
ret = klp_check_activeness_func(patch, enable, &nojump_funcs, &other_funcs); if (ret) goto out; - - struct walk_stackframe_args args = { - .other_funcs = other_funcs, - .ret = 0 - }; + args.other_funcs = other_funcs;
for_each_process_thread(g, t) { if (t == current) { diff --git a/arch/arm64/kernel/livepatch.c b/arch/arm64/kernel/livepatch.c index 650f457ab656..2ffbdfbe87de 100644 --- a/arch/arm64/kernel/livepatch.c +++ b/arch/arm64/kernel/livepatch.c @@ -291,17 +291,16 @@ int klp_check_calltrace(struct klp_patch *patch, int enable) int ret = 0; struct klp_func_list *nojump_funcs = NULL; struct klp_func_list *other_funcs = NULL; - - ret = klp_check_activeness_func(patch, enable, &nojump_funcs, &other_funcs); - if (ret) - goto out; - struct walk_stackframe_args args = { .enable = enable, - .other_funcs = other_funcs, .ret = 0 };
+ ret = klp_check_activeness_func(patch, enable, &nojump_funcs, &other_funcs); + if (ret) + goto out; + args.other_funcs = other_funcs; + for_each_process_thread(g, t) { /* * Handle the current carefully on each CPUs, we shouldn't diff --git a/arch/powerpc/kernel/livepatch_32.c b/arch/powerpc/kernel/livepatch_32.c index d22c44edc7c7..ea153f52e9ad 100644 --- a/arch/powerpc/kernel/livepatch_32.c +++ b/arch/powerpc/kernel/livepatch_32.c @@ -316,15 +316,14 @@ int klp_check_calltrace(struct klp_patch *patch, int enable) int ret = 0; struct klp_func_list *nojump_funcs = NULL; struct klp_func_list *other_funcs = NULL; + struct walk_stackframe_args args = { + .ret = 0 + };
ret = klp_check_activeness_func(patch, enable, &nojump_funcs, &other_funcs); if (ret) goto out; - - struct walk_stackframe_args args = { - .other_funcs = other_funcs, - .ret = 0 - }; + args.other_funcs = other_funcs;
for_each_process_thread(g, t) { if (t == current) {
From: Yu Jiahua yujiahua1@huawei.com
hulk inclusion category: feature bugzilla: 181656 https://gitee.com/openeuler/kernel/issues/I4DDEL
-------------------------------------------------
Optimize load tracking feature have uncertain impact on scheduler in multi-core system. Therefore, an aware switch is needed to percept the number of cpus on system, if one than one cpu is detected, optimize load tracking feature will be disable.
Signed-off-by: Yu Jiahua yujiahua1@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- include/linux/sched/sysctl.h | 6 +-- kernel/sched/fair.c | 74 +++++++++++++++++++----------------- kernel/sysctl.c | 6 ++- 3 files changed, 46 insertions(+), 40 deletions(-)
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 378bcb58c509..525d73dd8ef9 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -103,12 +103,12 @@ extern int sysctl_blocked_averages(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); extern int sysctl_tick_update_load(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); -extern int sysctl_update_load_latency(struct ctl_table *table, int write, - void __user *buffer, size_t *lenp, loff_t *ppos); +extern int sysctl_update_load_tracking_aware(struct ctl_table *table, + int write, void __user *buffer, size_t *lenp, loff_t *ppos);
-extern unsigned int sysctl_load_tracking_latency; extern struct static_key_true sched_tick_update_load; extern struct static_key_true sched_blocked_averages; +extern struct static_key_false sched_load_tracking_aware_enable; #endif
#endif /* _LINUX_SCHED_SYSCTL_H */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1417af3dd427..8c830dce4481 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -39,34 +39,35 @@ unsigned int sysctl_sched_latency = 6000000ULL; static unsigned int normalized_sysctl_sched_latency = 6000000ULL;
#ifdef CONFIG_IAS_SMART_LOAD_TRACKING -#define LANTENCY_MIN 10 -#define LANTENCY_MAX 30 -unsigned int sysctl_load_tracking_latency = LANTENCY_MIN; +DEFINE_STATIC_KEY_FALSE(sched_load_tracking_aware_enable); +static void set_load_tracking_aware(bool enabled) +{ + if (enabled) + static_branch_enable(&sched_load_tracking_aware_enable); + else + static_branch_disable(&sched_load_tracking_aware_enable); +}
-int sysctl_update_load_latency(struct ctl_table *table, int write, - void __user *buffer, size_t *lenp, loff_t *ppos) +int sysctl_update_load_tracking_aware(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) { - int ret; - int min = LANTENCY_MIN; - int max = LANTENCY_MAX; - int latency = sysctl_load_tracking_latency; struct ctl_table t; + int err; + int state = static_branch_likely(&sched_load_tracking_aware_enable);
if (write && !capable(CAP_SYS_ADMIN)) return -EPERM;
t = *table; - t.data = &latency; - t.extra1 = &min; - t.extra2 = &max; - - ret = proc_dointvec_minmax(&t, write, buffer, lenp, ppos); - if (ret || !write) - return ret; + t.data = &state; + err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos); + if (err < 0) + return err;
- sysctl_load_tracking_latency = latency; + if (write) + set_load_tracking_aware(state);
- return 0; + return err; } #endif
@@ -3832,39 +3833,42 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s #define SKIP_AGE_LOAD 0x2 #define DO_ATTACH 0x4
+#ifdef CONFIG_IAS_SMART_LOAD_TRACKING +/* + * Check load tracking senario. In single-core system without cpu frequency update, + * precise load tracking will be unnecessary. So here we just shutdown load tracking, + * for decreasing cpu usage. + */ +static inline int check_load_switch(void) +{ + if (static_branch_unlikely(&sched_load_tracking_aware_enable)) + if (num_online_cpus() == 1) + /* no need to update load average in single core senario */ + return 1; + + return 0; +} +#endif + /* Update task and its cfs_rq load average */ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { u64 now = cfs_rq_clock_pelt(cfs_rq); int decayed; + #ifdef CONFIG_IAS_SMART_LOAD_TRACKING - u64 delta; + if (check_load_switch()) + return; #endif - /* * Track task load average for carrying it to new CPU after migrated, and * track group sched_entity load average for task_h_load calc in migration */ -#ifdef CONFIG_IAS_SMART_LOAD_TRACKING - delta = now - se->avg.last_update_time; - delta >>= sysctl_load_tracking_latency; - - if (!delta) - return; - - if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) - __update_load_avg_se(now, cfs_rq, se); - - decayed = update_cfs_rq_load_avg(now, cfs_rq); - decayed |= propagate_entity_load_avg(se); -#else if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) __update_load_avg_se(now, cfs_rq, se);
decayed = update_cfs_rq_load_avg(now, cfs_rq); decayed |= propagate_entity_load_avg(se); -#endif -
if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 626530cf1342..c7ca58de3b1b 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1682,11 +1682,13 @@ static struct ctl_table ias_table[] = { .extra2 = SYSCTL_ONE, }, { - .procname = "sched_load_tracking_latency", + .procname = "sched_load_tracking_aware_enable", .data = NULL, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = sysctl_update_load_latency, + .proc_handler = sysctl_update_load_tracking_aware, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, }, #endif { }
From: Zhang Yi yi.zhang@huawei.com
mainline inclusion from mainline-5.15-rc1 commit 0904c9ae3465c7acc066a564a76b75c0af83e6c7 category: bugfix bugzilla: 174653 https://gitee.com/openeuler/kernel/issues/I4DDEL Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
No EIO simulation is required if the buffer is uptodate, so move the simulation behind read bio completeion just like inode/block bitmap simulation does.
Signed-off-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20210826130412.3921207-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/ext4/inode.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 0207579d7e04..a78cb4ba63bd 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4334,8 +4334,6 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino, bh = sb_getblk(sb, block); if (unlikely(!bh)) return -ENOMEM; - if (ext4_simulate_fail(sb, EXT4_SIM_INODE_EIO)) - goto simulate_eio; if (!buffer_uptodate(bh)) { lock_buffer(bh);
@@ -4422,8 +4420,8 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino, ext4_read_bh_nowait(bh, REQ_META | REQ_PRIO, NULL); blk_finish_plug(&plug); wait_on_buffer(bh); + ext4_simulate_fail_bh(sb, bh, EXT4_SIM_INODE_EIO); if (!buffer_uptodate(bh)) { - simulate_eio: if (ret_block) *ret_block = block; brelse(bh);
From: Zhang Yi yi.zhang@huawei.com
mainline inclusion from mainline-5.15-rc1 commit baaae979b112642a41b71c71c599d875c067d257 category: bugfix bugzilla: 174653 https://gitee.com/openeuler/kernel/issues/I4DDEL Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
Now that ext4_do_update_inode() return error before filling the whole inode data if we fail to set inode blocks in ext4_inode_blocks_set(). This error should never happen in theory since sb->s_maxbytes should not have allowed this, we have already init sb->s_maxbytes according to this feature in ext4_fill_super(). So even through that could only happen due to the filesystem corruption, we'd better to return after we finish updating the inode because it may left an uninitialized buffer and we could read this buffer later in "errors=continue" mode.
This patch make the updating inode data procedure atomic, call EXT4_ERROR_INODE() after we dropping i_raw_lock after something bad happened, make sure that the inode is integrated, and also drop a BUG_ON and do some small cleanups.
Signed-off-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20210826130412.3921207-4-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/ext4/inode.c | 44 ++++++++++++++++++++++++++++---------------- 1 file changed, 28 insertions(+), 16 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index a78cb4ba63bd..8194e847087e 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4930,8 +4930,14 @@ static int ext4_inode_blocks_set(handle_t *handle, ext4_clear_inode_flag(inode, EXT4_INODE_HUGE_FILE); return 0; } + + /* + * This should never happen since sb->s_maxbytes should not have + * allowed this, sb->s_maxbytes was set according to the huge_file + * feature in ext4_fill_super(). + */ if (!ext4_has_feature_huge_file(sb)) - return -EFBIG; + return -EFSCORRUPTED;
if (i_blocks <= 0xffffffffffffULL) { /* @@ -5038,16 +5044,14 @@ static int ext4_do_update_inode(handle_t *handle,
spin_lock(&ei->i_raw_lock);
- /* For fields not tracked in the in-memory inode, - * initialise them to zero for new inodes. */ + /* + * For fields not tracked in the in-memory inode, initialise them + * to zero for new inodes. + */ if (ext4_test_inode_state(inode, EXT4_STATE_NEW)) memset(raw_inode, 0, EXT4_SB(inode->i_sb)->s_inode_size);
err = ext4_inode_blocks_set(handle, raw_inode, ei); - if (err) { - spin_unlock(&ei->i_raw_lock); - goto out_brelse; - }
raw_inode->i_mode = cpu_to_le16(inode->i_mode); i_uid = i_uid_read(inode); @@ -5056,10 +5060,11 @@ static int ext4_do_update_inode(handle_t *handle, if (!(test_opt(inode->i_sb, NO_UID32))) { raw_inode->i_uid_low = cpu_to_le16(low_16_bits(i_uid)); raw_inode->i_gid_low = cpu_to_le16(low_16_bits(i_gid)); -/* - * Fix up interoperability with old kernels. Otherwise, old inodes get - * re-used with the upper 16 bits of the uid/gid intact - */ + /* + * Fix up interoperability with old kernels. Otherwise, + * old inodes get re-used with the upper 16 bits of the + * uid/gid intact. + */ if (ei->i_dtime && list_empty(&ei->i_orphan)) { raw_inode->i_uid_high = 0; raw_inode->i_gid_high = 0; @@ -5128,8 +5133,9 @@ static int ext4_do_update_inode(handle_t *handle, } }
- BUG_ON(!ext4_has_feature_project(inode->i_sb) && - i_projid != EXT4_DEF_PROJID); + if (i_projid != EXT4_DEF_PROJID && + !ext4_has_feature_project(inode->i_sb)) + err = err ?: -EFSCORRUPTED;
if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE && EXT4_FITS_IN_INODE(raw_inode, ei, i_projid)) @@ -5137,6 +5143,11 @@ static int ext4_do_update_inode(handle_t *handle,
ext4_inode_csum_set(inode, raw_inode, ei); spin_unlock(&ei->i_raw_lock); + if (err) { + EXT4_ERROR_INODE(inode, "corrupted inode contents"); + goto out_brelse; + } + if (inode->i_sb->s_flags & SB_LAZYTIME) ext4_update_other_inodes_time(inode->i_sb, inode->i_ino, bh->b_data); @@ -5144,13 +5155,13 @@ static int ext4_do_update_inode(handle_t *handle, BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata"); err = ext4_handle_dirty_metadata(handle, NULL, bh); if (err) - goto out_brelse; + goto out_error; ext4_clear_inode_state(inode, EXT4_STATE_NEW); if (set_large_file) { BUFFER_TRACE(EXT4_SB(sb)->s_sbh, "get write access"); err = ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh); if (err) - goto out_brelse; + goto out_error; lock_buffer(EXT4_SB(sb)->s_sbh); ext4_set_feature_large_file(sb); ext4_superblock_csum_set(sb); @@ -5160,9 +5171,10 @@ static int ext4_do_update_inode(handle_t *handle, EXT4_SB(sb)->s_sbh); } ext4_update_inode_fsync_trans(handle, inode, need_datasync); +out_error: + ext4_std_error(inode->i_sb, err); out_brelse: brelse(bh); - ext4_std_error(inode->i_sb, err); return err; }
From: Zhang Yi yi.zhang@huawei.com
hulk inclusion category: bugfix bugzilla: 174653 https://gitee.com/openeuler/kernel/issues/I4DDEL ---------------------------
Factor out ext4_fill_raw_inode() from ext4_do_update_inode(), which is use to fill the in-mem inode contents into the inode table buffer, in preparation for initializing the exclusive inode buffer without reading the block in __ext4_get_inode_loc().
Signed-off-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/ext4/inode.c | 85 +++++++++++++++++++++++++++---------------------- 1 file changed, 47 insertions(+), 38 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 8194e847087e..1b7733750ac2 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4912,9 +4912,8 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino, return ERR_PTR(ret); }
-static int ext4_inode_blocks_set(handle_t *handle, - struct ext4_inode *raw_inode, - struct ext4_inode_info *ei) +static int ext4_inode_blocks_set(struct ext4_inode *raw_inode, + struct ext4_inode_info *ei) { struct inode *inode = &(ei->vfs_inode); u64 i_blocks = READ_ONCE(inode->i_blocks); @@ -5021,37 +5020,16 @@ static void ext4_update_other_inodes_time(struct super_block *sb, rcu_read_unlock(); }
-/* - * Post the struct inode info into an on-disk inode location in the - * buffer-cache. This gobbles the caller's reference to the - * buffer_head in the inode location struct. - * - * The caller must have write access to iloc->bh. - */ -static int ext4_do_update_inode(handle_t *handle, - struct inode *inode, - struct ext4_iloc *iloc) +static int ext4_fill_raw_inode(struct inode *inode, struct ext4_inode *raw_inode) { - struct ext4_inode *raw_inode = ext4_raw_inode(iloc); struct ext4_inode_info *ei = EXT4_I(inode); - struct buffer_head *bh = iloc->bh; - struct super_block *sb = inode->i_sb; - int err = 0, block; - int need_datasync = 0, set_large_file = 0; uid_t i_uid; gid_t i_gid; projid_t i_projid; + int block; + int err;
- spin_lock(&ei->i_raw_lock); - - /* - * For fields not tracked in the in-memory inode, initialise them - * to zero for new inodes. - */ - if (ext4_test_inode_state(inode, EXT4_STATE_NEW)) - memset(raw_inode, 0, EXT4_SB(inode->i_sb)->s_inode_size); - - err = ext4_inode_blocks_set(handle, raw_inode, ei); + err = ext4_inode_blocks_set(raw_inode, ei);
raw_inode->i_mode = cpu_to_le16(inode->i_mode); i_uid = i_uid_read(inode); @@ -5093,16 +5071,8 @@ static int ext4_do_update_inode(handle_t *handle, raw_inode->i_file_acl_high = cpu_to_le16(ei->i_file_acl >> 32); raw_inode->i_file_acl_lo = cpu_to_le32(ei->i_file_acl); - if (READ_ONCE(ei->i_disksize) != ext4_isize(inode->i_sb, raw_inode)) { - ext4_isize_set(raw_inode, ei->i_disksize); - need_datasync = 1; - } - if (ei->i_disksize > 0x7fffffffULL) { - if (!ext4_has_feature_large_file(sb) || - EXT4_SB(sb)->s_es->s_rev_level == - cpu_to_le32(EXT4_GOOD_OLD_REV)) - set_large_file = 1; - } + ext4_isize_set(raw_inode, ei->i_disksize); + raw_inode->i_generation = cpu_to_le32(inode->i_generation); if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) { if (old_valid_dev(inode->i_rdev)) { @@ -5142,6 +5112,45 @@ static int ext4_do_update_inode(handle_t *handle, raw_inode->i_projid = cpu_to_le32(i_projid);
ext4_inode_csum_set(inode, raw_inode, ei); + return err; +} + +/* + * Post the struct inode info into an on-disk inode location in the + * buffer-cache. This gobbles the caller's reference to the + * buffer_head in the inode location struct. + * + * The caller must have write access to iloc->bh. + */ +static int ext4_do_update_inode(handle_t *handle, + struct inode *inode, + struct ext4_iloc *iloc) +{ + struct ext4_inode *raw_inode = ext4_raw_inode(iloc); + struct ext4_inode_info *ei = EXT4_I(inode); + struct buffer_head *bh = iloc->bh; + struct super_block *sb = inode->i_sb; + int err; + int need_datasync = 0, set_large_file = 0; + + spin_lock(&ei->i_raw_lock); + + /* + * For fields not tracked in the in-memory inode, initialise them + * to zero for new inodes. + */ + if (ext4_test_inode_state(inode, EXT4_STATE_NEW)) + memset(raw_inode, 0, EXT4_SB(inode->i_sb)->s_inode_size); + + if (READ_ONCE(ei->i_disksize) != ext4_isize(inode->i_sb, raw_inode)) + need_datasync = 1; + if (ei->i_disksize > 0x7fffffffULL) { + if (!ext4_has_feature_large_file(sb) || + EXT4_SB(sb)->s_es->s_rev_level == cpu_to_le32(EXT4_GOOD_OLD_REV)) + set_large_file = 1; + } + + err = ext4_fill_raw_inode(inode, raw_inode); spin_unlock(&ei->i_raw_lock); if (err) { EXT4_ERROR_INODE(inode, "corrupted inode contents");
From: Zhang Yi yi.zhang@huawei.com
hulk inclusion category: bugfix bugzilla: 174653 https://gitee.com/openeuler/kernel/issues/I4DDEL ---------------------------
In preparation for calling ext4_fill_raw_inode() in __ext4_get_inode_loc(), move three related functions before __ext4_get_inode_loc(), no logical change.
Signed-off-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/ext4/inode.c | 293 ++++++++++++++++++++++++------------------------ 1 file changed, 147 insertions(+), 146 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 1b7733750ac2..d76ebbebd455 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4296,6 +4296,153 @@ int ext4_truncate(struct inode *inode) return err; }
+static inline u64 ext4_inode_peek_iversion(const struct inode *inode) +{ + if (unlikely(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL)) + return inode_peek_iversion_raw(inode); + else + return inode_peek_iversion(inode); +} + +static int ext4_inode_blocks_set(struct ext4_inode *raw_inode, + struct ext4_inode_info *ei) +{ + struct inode *inode = &(ei->vfs_inode); + u64 i_blocks = READ_ONCE(inode->i_blocks); + struct super_block *sb = inode->i_sb; + + if (i_blocks <= ~0U) { + /* + * i_blocks can be represented in a 32 bit variable + * as multiple of 512 bytes + */ + raw_inode->i_blocks_lo = cpu_to_le32(i_blocks); + raw_inode->i_blocks_high = 0; + ext4_clear_inode_flag(inode, EXT4_INODE_HUGE_FILE); + return 0; + } + + /* + * This should never happen since sb->s_maxbytes should not have + * allowed this, sb->s_maxbytes was set according to the huge_file + * feature in ext4_fill_super(). + */ + if (!ext4_has_feature_huge_file(sb)) + return -EFSCORRUPTED; + + if (i_blocks <= 0xffffffffffffULL) { + /* + * i_blocks can be represented in a 48 bit variable + * as multiple of 512 bytes + */ + raw_inode->i_blocks_lo = cpu_to_le32(i_blocks); + raw_inode->i_blocks_high = cpu_to_le16(i_blocks >> 32); + ext4_clear_inode_flag(inode, EXT4_INODE_HUGE_FILE); + } else { + ext4_set_inode_flag(inode, EXT4_INODE_HUGE_FILE); + /* i_block is stored in file system block size */ + i_blocks = i_blocks >> (inode->i_blkbits - 9); + raw_inode->i_blocks_lo = cpu_to_le32(i_blocks); + raw_inode->i_blocks_high = cpu_to_le16(i_blocks >> 32); + } + return 0; +} + +static int ext4_fill_raw_inode(struct inode *inode, struct ext4_inode *raw_inode) +{ + struct ext4_inode_info *ei = EXT4_I(inode); + uid_t i_uid; + gid_t i_gid; + projid_t i_projid; + int block; + int err; + + err = ext4_inode_blocks_set(raw_inode, ei); + + raw_inode->i_mode = cpu_to_le16(inode->i_mode); + i_uid = i_uid_read(inode); + i_gid = i_gid_read(inode); + i_projid = from_kprojid(&init_user_ns, ei->i_projid); + if (!(test_opt(inode->i_sb, NO_UID32))) { + raw_inode->i_uid_low = cpu_to_le16(low_16_bits(i_uid)); + raw_inode->i_gid_low = cpu_to_le16(low_16_bits(i_gid)); + /* + * Fix up interoperability with old kernels. Otherwise, + * old inodes get re-used with the upper 16 bits of the + * uid/gid intact. + */ + if (ei->i_dtime && list_empty(&ei->i_orphan)) { + raw_inode->i_uid_high = 0; + raw_inode->i_gid_high = 0; + } else { + raw_inode->i_uid_high = + cpu_to_le16(high_16_bits(i_uid)); + raw_inode->i_gid_high = + cpu_to_le16(high_16_bits(i_gid)); + } + } else { + raw_inode->i_uid_low = cpu_to_le16(fs_high2lowuid(i_uid)); + raw_inode->i_gid_low = cpu_to_le16(fs_high2lowgid(i_gid)); + raw_inode->i_uid_high = 0; + raw_inode->i_gid_high = 0; + } + raw_inode->i_links_count = cpu_to_le16(inode->i_nlink); + + EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode); + EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode); + EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode); + EXT4_EINODE_SET_XTIME(i_crtime, ei, raw_inode); + + raw_inode->i_dtime = cpu_to_le32(ei->i_dtime); + raw_inode->i_flags = cpu_to_le32(ei->i_flags & 0xFFFFFFFF); + if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) + raw_inode->i_file_acl_high = + cpu_to_le16(ei->i_file_acl >> 32); + raw_inode->i_file_acl_lo = cpu_to_le32(ei->i_file_acl); + ext4_isize_set(raw_inode, ei->i_disksize); + + raw_inode->i_generation = cpu_to_le32(inode->i_generation); + if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) { + if (old_valid_dev(inode->i_rdev)) { + raw_inode->i_block[0] = + cpu_to_le32(old_encode_dev(inode->i_rdev)); + raw_inode->i_block[1] = 0; + } else { + raw_inode->i_block[0] = 0; + raw_inode->i_block[1] = + cpu_to_le32(new_encode_dev(inode->i_rdev)); + raw_inode->i_block[2] = 0; + } + } else if (!ext4_has_inline_data(inode)) { + for (block = 0; block < EXT4_N_BLOCKS; block++) + raw_inode->i_block[block] = ei->i_data[block]; + } + + if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) { + u64 ivers = ext4_inode_peek_iversion(inode); + + raw_inode->i_disk_version = cpu_to_le32(ivers); + if (ei->i_extra_isize) { + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) + raw_inode->i_version_hi = + cpu_to_le32(ivers >> 32); + raw_inode->i_extra_isize = + cpu_to_le16(ei->i_extra_isize); + } + } + + if (i_projid != EXT4_DEF_PROJID && + !ext4_has_feature_project(inode->i_sb)) + err = err ?: -EFSCORRUPTED; + + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE && + EXT4_FITS_IN_INODE(raw_inode, ei, i_projid)) + raw_inode->i_projid = cpu_to_le32(i_projid); + + ext4_inode_csum_set(inode, raw_inode, ei); + return err; +} + /* * ext4_get_inode_loc returns with an extra refcount against the inode's * underlying buffer_head on success. If 'in_mem' is true, we have all @@ -4590,13 +4737,6 @@ static inline void ext4_inode_set_iversion_queried(struct inode *inode, u64 val) else inode_set_iversion_queried(inode, val); } -static inline u64 ext4_inode_peek_iversion(const struct inode *inode) -{ - if (unlikely(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL)) - return inode_peek_iversion_raw(inode); - else - return inode_peek_iversion(inode); -}
struct inode *__ext4_iget(struct super_block *sb, unsigned long ino, ext4_iget_flags flags, const char *function, @@ -4912,50 +5052,6 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino, return ERR_PTR(ret); }
-static int ext4_inode_blocks_set(struct ext4_inode *raw_inode, - struct ext4_inode_info *ei) -{ - struct inode *inode = &(ei->vfs_inode); - u64 i_blocks = READ_ONCE(inode->i_blocks); - struct super_block *sb = inode->i_sb; - - if (i_blocks <= ~0U) { - /* - * i_blocks can be represented in a 32 bit variable - * as multiple of 512 bytes - */ - raw_inode->i_blocks_lo = cpu_to_le32(i_blocks); - raw_inode->i_blocks_high = 0; - ext4_clear_inode_flag(inode, EXT4_INODE_HUGE_FILE); - return 0; - } - - /* - * This should never happen since sb->s_maxbytes should not have - * allowed this, sb->s_maxbytes was set according to the huge_file - * feature in ext4_fill_super(). - */ - if (!ext4_has_feature_huge_file(sb)) - return -EFSCORRUPTED; - - if (i_blocks <= 0xffffffffffffULL) { - /* - * i_blocks can be represented in a 48 bit variable - * as multiple of 512 bytes - */ - raw_inode->i_blocks_lo = cpu_to_le32(i_blocks); - raw_inode->i_blocks_high = cpu_to_le16(i_blocks >> 32); - ext4_clear_inode_flag(inode, EXT4_INODE_HUGE_FILE); - } else { - ext4_set_inode_flag(inode, EXT4_INODE_HUGE_FILE); - /* i_block is stored in file system block size */ - i_blocks = i_blocks >> (inode->i_blkbits - 9); - raw_inode->i_blocks_lo = cpu_to_le32(i_blocks); - raw_inode->i_blocks_high = cpu_to_le16(i_blocks >> 32); - } - return 0; -} - static void __ext4_update_other_inode_time(struct super_block *sb, unsigned long orig_ino, unsigned long ino, @@ -5020,101 +5116,6 @@ static void ext4_update_other_inodes_time(struct super_block *sb, rcu_read_unlock(); }
-static int ext4_fill_raw_inode(struct inode *inode, struct ext4_inode *raw_inode) -{ - struct ext4_inode_info *ei = EXT4_I(inode); - uid_t i_uid; - gid_t i_gid; - projid_t i_projid; - int block; - int err; - - err = ext4_inode_blocks_set(raw_inode, ei); - - raw_inode->i_mode = cpu_to_le16(inode->i_mode); - i_uid = i_uid_read(inode); - i_gid = i_gid_read(inode); - i_projid = from_kprojid(&init_user_ns, ei->i_projid); - if (!(test_opt(inode->i_sb, NO_UID32))) { - raw_inode->i_uid_low = cpu_to_le16(low_16_bits(i_uid)); - raw_inode->i_gid_low = cpu_to_le16(low_16_bits(i_gid)); - /* - * Fix up interoperability with old kernels. Otherwise, - * old inodes get re-used with the upper 16 bits of the - * uid/gid intact. - */ - if (ei->i_dtime && list_empty(&ei->i_orphan)) { - raw_inode->i_uid_high = 0; - raw_inode->i_gid_high = 0; - } else { - raw_inode->i_uid_high = - cpu_to_le16(high_16_bits(i_uid)); - raw_inode->i_gid_high = - cpu_to_le16(high_16_bits(i_gid)); - } - } else { - raw_inode->i_uid_low = cpu_to_le16(fs_high2lowuid(i_uid)); - raw_inode->i_gid_low = cpu_to_le16(fs_high2lowgid(i_gid)); - raw_inode->i_uid_high = 0; - raw_inode->i_gid_high = 0; - } - raw_inode->i_links_count = cpu_to_le16(inode->i_nlink); - - EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode); - EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode); - EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode); - EXT4_EINODE_SET_XTIME(i_crtime, ei, raw_inode); - - raw_inode->i_dtime = cpu_to_le32(ei->i_dtime); - raw_inode->i_flags = cpu_to_le32(ei->i_flags & 0xFFFFFFFF); - if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) - raw_inode->i_file_acl_high = - cpu_to_le16(ei->i_file_acl >> 32); - raw_inode->i_file_acl_lo = cpu_to_le32(ei->i_file_acl); - ext4_isize_set(raw_inode, ei->i_disksize); - - raw_inode->i_generation = cpu_to_le32(inode->i_generation); - if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) { - if (old_valid_dev(inode->i_rdev)) { - raw_inode->i_block[0] = - cpu_to_le32(old_encode_dev(inode->i_rdev)); - raw_inode->i_block[1] = 0; - } else { - raw_inode->i_block[0] = 0; - raw_inode->i_block[1] = - cpu_to_le32(new_encode_dev(inode->i_rdev)); - raw_inode->i_block[2] = 0; - } - } else if (!ext4_has_inline_data(inode)) { - for (block = 0; block < EXT4_N_BLOCKS; block++) - raw_inode->i_block[block] = ei->i_data[block]; - } - - if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) { - u64 ivers = ext4_inode_peek_iversion(inode); - - raw_inode->i_disk_version = cpu_to_le32(ivers); - if (ei->i_extra_isize) { - if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) - raw_inode->i_version_hi = - cpu_to_le32(ivers >> 32); - raw_inode->i_extra_isize = - cpu_to_le16(ei->i_extra_isize); - } - } - - if (i_projid != EXT4_DEF_PROJID && - !ext4_has_feature_project(inode->i_sb)) - err = err ?: -EFSCORRUPTED; - - if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE && - EXT4_FITS_IN_INODE(raw_inode, ei, i_projid)) - raw_inode->i_projid = cpu_to_le32(i_projid); - - ext4_inode_csum_set(inode, raw_inode, ei); - return err; -} - /* * Post the struct inode info into an on-disk inode location in the * buffer-cache. This gobbles the caller's reference to the
From: Zhang Yi yi.zhang@huawei.com
hulk inclusion category: bugfix bugzilla: 174653 https://gitee.com/openeuler/kernel/issues/I4DDEL ---------------------------
In ext4_get_inode_loc(), we may skip IO and get an zero && uptodate inode buffer when the inode monopolize an inode block for performance reason. For most cases, ext4_mark_iloc_dirty() will fill the inode buffer to make it fine, but we could miss this call if something bad happened. Finally, __ext4_get_inode_loc_noinmem() may probably get an empty inode buffer and trigger ext4 error.
For example, if we remove a nonexistent xattr on inode A, ext4_xattr_set_handle() will return ENODATA before invoking ext4_mark_iloc_dirty(), it will left an uptodate but zero buffer. We will get checksum error message in ext4_iget() when getting inode again.
EXT4-fs error (device sda): ext4_lookup:1784: inode #131074: comm cat: iget: checksum invalid
Even worse, if we allocate another inode B at the same inode block, it will corrupt the inode A on disk when write back inode B.
So this patch initialize the inode buffer by filling the in-mem inode contents if we skip read I/O, ensure that the buffer is really uptodate.
Signed-off-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/ext4/inode.c | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d76ebbebd455..a054f07a63a6 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4445,12 +4445,12 @@ static int ext4_fill_raw_inode(struct inode *inode, struct ext4_inode *raw_inode
/* * ext4_get_inode_loc returns with an extra refcount against the inode's - * underlying buffer_head on success. If 'in_mem' is true, we have all - * data in memory that is needed to recreate the on-disk version of this - * inode. + * underlying buffer_head on success. If we pass 'inode' and it does not + * have in-inode xattr, we have all inode data in memory that is needed + * to recreate the on-disk version of this inode. */ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino, - struct ext4_iloc *iloc, int in_mem, + struct inode *inode, struct ext4_iloc *iloc, ext4_fsblk_t *ret_block) { struct ext4_group_desc *gdp; @@ -4495,7 +4495,7 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino, * is the only valid inode in the block, we need not read the * block. */ - if (in_mem) { + if (inode && !ext4_test_inode_state(inode, EXT4_STATE_XATTR)) { struct buffer_head *bitmap_bh; int i, start;
@@ -4523,8 +4523,13 @@ static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino, } brelse(bitmap_bh); if (i == start + inodes_per_block) { + struct ext4_inode *raw_inode = + (struct ext4_inode *) (bh->b_data + iloc->offset); + /* all other inodes are free, so skip I/O */ memset(bh->b_data, 0, bh->b_size); + if (!ext4_test_inode_state(inode, EXT4_STATE_NEW)) + ext4_fill_raw_inode(inode, raw_inode); set_buffer_uptodate(bh); unlock_buffer(bh); goto has_buffer; @@ -4586,7 +4591,7 @@ static int __ext4_get_inode_loc_noinmem(struct inode *inode, ext4_fsblk_t err_blk; int ret;
- ret = __ext4_get_inode_loc(inode->i_sb, inode->i_ino, iloc, 0, + ret = __ext4_get_inode_loc(inode->i_sb, inode->i_ino, NULL, iloc, &err_blk);
if (ret == -EIO) @@ -4601,9 +4606,8 @@ int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc) ext4_fsblk_t err_blk; int ret;
- /* We have all inode data except xattrs in memory here. */ - ret = __ext4_get_inode_loc(inode->i_sb, inode->i_ino, iloc, - !ext4_test_inode_state(inode, EXT4_STATE_XATTR), &err_blk); + ret = __ext4_get_inode_loc(inode->i_sb, inode->i_ino, inode, iloc, + &err_blk);
if (ret == -EIO) ext4_error_inode_block(inode, err_blk, EIO, @@ -4616,7 +4620,7 @@ int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc) int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino, struct ext4_iloc *iloc) { - return __ext4_get_inode_loc(sb, ino, iloc, 0, NULL); + return __ext4_get_inode_loc(sb, ino, NULL, iloc, NULL); }
static bool ext4_should_enable_dax(struct inode *inode)
From: Yu Kuai yukuai3@huawei.com
hulk inclusion category: bugfix bugzilla: 177149 https://gitee.com/openeuler/kernel/issues/I4DDEL
-----------------------------------------------
If blk-throttle is enabled and io is issued before blk_throtl_register_queue() is done. Divide by zero crash will be triggered in tg_may_dispatch() because 'throtl_slice' is uninitialized.
Thus introduce a new flag QUEUE_FLAG_THROTL_INIT_DONE. It will be set after blk_throtl_register_queue() is done, and will be checked before applying any config.
Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- block/blk-sysfs.c | 7 +++++++ block/blk-throttle.c | 37 ++++++++++++++++++++++++++++++++++++- include/linux/blkdev.h | 1 + 3 files changed, 44 insertions(+), 1 deletion(-)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index b513f1683af0..66765740902b 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -910,6 +910,9 @@ int blk_register_queue(struct gendisk *disk) blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q); wbt_enable_default(q); blk_throtl_register_queue(q); + spin_lock_irq(&q->queue_lock); + blk_queue_flag_set(QUEUE_FLAG_THROTL_INIT_DONE, q); + spin_unlock_irq(&q->queue_lock);
/* Now everything is ready and send out KOBJ_ADD uevent */ kobject_uevent(&q->kobj, KOBJ_ADD); @@ -942,6 +945,10 @@ void blk_unregister_queue(struct gendisk *disk) if (!blk_queue_registered(q)) return;
+ spin_lock_irq(&q->queue_lock); + blk_queue_flag_clear(QUEUE_FLAG_THROTL_INIT_DONE, q); + spin_unlock_irq(&q->queue_lock); + /* * Since sysfs_remove_dir() prevents adding new directory entries * before removal of existing entries starts, protect against diff --git a/block/blk-throttle.c b/block/blk-throttle.c index b771c4299982..6c327893314e 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -11,6 +11,7 @@ #include <linux/bio.h> #include <linux/blktrace_api.h> #include <linux/blk-cgroup.h> +#include <linux/delay.h> #include "blk.h" #include "blk-cgroup-rwstat.h"
@@ -1445,6 +1446,31 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global) } }
+static inline int throtl_check_init_done(struct request_queue *q) +{ + if (test_bit(QUEUE_FLAG_THROTL_INIT_DONE, &q->queue_flags)) + return 0; + + return blk_queue_dying(q) ? -ENODEV : -EBUSY; +} + +/* + * If throtl_check_init_done() return -EBUSY, we should retry after a short + * msleep(), since that throttle init will be completed in blk_register_queue() + * soon. + */ +static inline int throtl_restart_syscall_when_busy(int errno) +{ + int ret = errno; + + if (ret == -EBUSY) { + msleep(10); + ret = restart_syscall(); + } + + return ret; +} + static ssize_t tg_set_conf(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off, bool is_u64) { @@ -1458,6 +1484,10 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of, if (ret) return ret;
+ ret = throtl_check_init_done(ctx.disk->queue); + if (ret) + goto out_finish; + ret = -EINVAL; if (sscanf(ctx.body, "%llu", &v) != 1) goto out_finish; @@ -1475,6 +1505,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of, ret = 0; out_finish: blkg_conf_finish(&ctx); + ret = throtl_restart_syscall_when_busy(ret); return ret ?: nbytes; }
@@ -1650,8 +1681,11 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of, if (ret) return ret;
- tg = blkg_to_tg(ctx.blkg); + ret = throtl_check_init_done(ctx.disk->queue); + if (ret) + goto out_finish;
+ tg = blkg_to_tg(ctx.blkg); v[0] = tg->bps_conf[READ][index]; v[1] = tg->bps_conf[WRITE][index]; v[2] = tg->iops_conf[READ][index]; @@ -1747,6 +1781,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of, ret = 0; out_finish: blkg_conf_finish(&ctx); + ret = throtl_restart_syscall_when_busy(ret); return ret ?: nbytes; }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 00e71019f4f6..e8e2ab8a6742 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -593,6 +593,7 @@ struct request_queue { /* Keep blk_queue_flag_name[] in sync with the definitions below */ #define QUEUE_FLAG_STOPPED 0 /* queue is stopped */ #define QUEUE_FLAG_DYING 1 /* queue being torn down */ +#define QUEUE_FLAG_THROTL_INIT_DONE 2 /* io throttle can be online */ #define QUEUE_FLAG_NOMERGES 3 /* disable merge attempts */ #define QUEUE_FLAG_SAME_COMP 4 /* complete on same CPU-group */ #define QUEUE_FLAG_FAIL_IO 5 /* fake timeout */
From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit fab827dbee8c2e06ca4ba000fa6c48bcf9054aba bugzilla: 181858 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Commit 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg") enabled memcg accounting for pids allocated from init_pid_ns.pid_cachep, but forgot to adjust the setting for nested pid namespaces. As a result, pid memory is not accounted exactly where it is really needed, inside memcg-limited containers with their own pid namespaces.
Pid was one the first kernel objects enabled for memcg accounting. init_pid_ns.pid_cachep marked by SLAB_ACCOUNT and we can expect that any new pids in the system are memcg-accounted.
Though recently I've noticed that it is wrong. nested pid namespaces creates own slab caches for pid objects, nested pids have increased size because contain id both for all parent and for own pid namespaces. The problem is that these slab caches are _NOT_ marked by SLAB_ACCOUNT, as a result any pids allocated in nested pid namespaces are not memcg-accounted.
Pid struct in nested pid namespace consumes up to 500 bytes memory, 100000 such objects gives us up to ~50Mb unaccounted memory, this allow container to exceed assigned memcg limits.
Link: https://lkml.kernel.org/r/8b6de616-fd1a-02c6-cbdb-976ecdcfa604@virtuozzo.com Fixes: 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg") Cc: stable@vger.kernel.org Signed-off-by: Vasily Averin vvs@virtuozzo.com Reviewed-by: Michal Koutný mkoutny@suse.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Christian Brauner christian.brauner@ubuntu.com Acked-by: Roman Gushchin guro@fb.com Cc: Michal Hocko mhocko@suse.com Cc: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Li Ming limingming.li@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com
Signed-off-by: Lu Jialin lujialin4@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- kernel/pid_namespace.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index 9de21803a8ae..ef8733e2a476 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -51,7 +51,8 @@ static struct kmem_cache *create_pid_cachep(unsigned int level) mutex_lock(&pid_caches_mutex); /* Name collision forces to do allocation under mutex. */ if (!*pkc) - *pkc = kmem_cache_create(name, len, 0, SLAB_HWCACHE_ALIGN, 0); + *pkc = kmem_cache_create(name, len, 0, + SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, 0); mutex_unlock(&pid_caches_mutex); /* current can fail, but someone else can succeed. */ return READ_ONCE(*pkc);
From: Yutian Yang nglaive@gmail.com
mainline inclusion from mainline-v5.15-rc1 commit bb902cb47cf93b33cd92b3b7a4019330a03ef57f bugzilla: 181858 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
This patch adds accounting flags to fs_context and legacy_fs_context allocation sites so that kernel could correctly charge these objects.
We have written a PoC to demonstrate the effect of the missing-charging bugs. The PoC takes around 1,200MB unaccounted memory, while it is charged for only 362MB memory usage. We evaluate the PoC on QEMU x86_64 v5.2.90 + Linux kernel v5.10.19 + Debian buster. All the limitations including ulimits and sysctl variables are set as default. Specifically, the hard NOFILE limit and nr_open in sysctl are both 1,048,576.
/*------------------------- POC code ----------------------------*/
} while (0)
static inline int fsopen(const char *fs_name, unsigned int flags) { return syscall(__NR_fsopen, fs_name, flags); }
static char thread_stack[512][STACK_SIZE];
int thread_fn(void* arg) { for (int i = 0; i< 800000; ++i) { int fsfd = fsopen("nfs", FSOPEN_CLOEXEC); if (fsfd == -1) { errExit("fsopen"); } } while(1); return 0; }
int main(int argc, char *argv[]) { int thread_pid; for (int i = 0; i < 1; ++i) { thread_pid = clone(thread_fn, thread_stack[i] + STACK_SIZE, \ SIGCHLD, NULL); } while(1); return 0; }
/*-------------------------- end --------------------------------*/
Link: https://lkml.kernel.org/r/1626517201-24086-1-git-send-email-nglaive@gmail.co... Signed-off-by: Yutian Yang nglaive@gmail.com Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Michal Hocko mhocko@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: shenwenbo@zju.edu.cn Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Li Ming limingming.li@huawei.com Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/fs_context.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/fs_context.c b/fs/fs_context.c index 2834d1afa6e8..4858645ca620 100644 --- a/fs/fs_context.c +++ b/fs/fs_context.c @@ -231,7 +231,7 @@ static struct fs_context *alloc_fs_context(struct file_system_type *fs_type, struct fs_context *fc; int ret = -ENOMEM;
- fc = kzalloc(sizeof(struct fs_context), GFP_KERNEL); + fc = kzalloc(sizeof(struct fs_context), GFP_KERNEL_ACCOUNT); if (!fc) return ERR_PTR(-ENOMEM);
@@ -631,7 +631,7 @@ const struct fs_context_operations legacy_fs_context_ops = { */ static int legacy_init_fs_context(struct fs_context *fc) { - fc->fs_private = kzalloc(sizeof(struct legacy_fs_context), GFP_KERNEL); + fc->fs_private = kzalloc(sizeof(struct legacy_fs_context), GFP_KERNEL_ACCOUNT); if (!fc->fs_private) return -ENOMEM; fc->ops = &legacy_fs_context_ops;
From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit 79f6540ba88dfb383ecf057a3425e668105ca774 bugzilla: 181858 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "memcg accounting from OpenVZ", v7.
OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. Initially we used our own accounting subsystem, then partially committed it to upstream, and a few years ago switched to cgroups v1. Now we're rebasing again, revising our old patches and trying to push them upstream.
We try to protect the host system from any misuse of kernel memory allocation triggered by untrusted users inside the containers.
Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing list, though I would be very grateful for any comments from maintainersi of affected subsystems or other people added in cc:
Compared to the upstream, we additionally account the following kernel objects: - network devices and its Tx/Rx queues - ipv4/v6 addresses and routing-related objects - inet_bind_bucket cache objects - VLAN group arrays - ipv6/sit: ip_tunnel_prl - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets - nsproxy and namespace objects itself - IPC objects: semaphores, message queues and share memory segments - mounts - pollfd and select bits arrays - signals and posix timers - file lock - fasync_struct used by the file lease code and driver's fasync queues - tty objects - per-mm LDT
We have an incorrect/incomplete/obsoleted accounting for few other kernel objects: sk_filter, af_packets, netlink and xt_counters for iptables. They require rework and probably will be dropped at all.
Also we're going to add an accounting for nft, however it is not ready yet.
We have not tested performance on upstream, however, our performance team compares our current RHEL7-based production kernel and reports that they are at least not worse as the according original RHEL7 kernel.
This patch (of 10):
The kernel allocates ~400 bytes of 'struct mount' for any new mount. Creating a new mount namespace clones most of the parent mounts, and this can be repeated many times. Additionally, each mount allocates up to PATH_MAX=4096 bytes for mnt->mnt_devname.
It makes sense to account for these allocations to restrict the host's memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Christian Brauner christian.brauner@ubuntu.com Cc: Tejun Heo tj@kernel.org Cc: Michal Hocko mhocko@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Roman Gushchin guro@fb.com Cc: Yutian Yang nglaive@gmail.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Alexey Dobriyan adobriyan@gmail.com Cc: Andrei Vagin avagin@gmail.com Cc: Borislav Petkov bp@alien8.de Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "J. Bruce Fields" bfields@fieldses.org Cc: Jeff Layton jlayton@kernel.org Cc: Jens Axboe axboe@kernel.dk Cc: Jiri Slaby jirislaby@kernel.org Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: Oleg Nesterov oleg@redhat.com Cc: Serge Hallyn serge@hallyn.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Zefan Li lizefan.x@bytedance.com Cc: Borislav Petkov bp@suse.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Li Ming limingming.li@huawei.com
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/namespace.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c index 046b084136c5..7f1f89db511f 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -183,7 +183,8 @@ static struct mount *alloc_vfsmnt(const char *name) goto out_free_cache;
if (name) { - mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL); + mnt->mnt_devname = kstrdup_const(name, + GFP_KERNEL_ACCOUNT); if (!mnt->mnt_devname) goto out_free_id; } @@ -3840,7 +3841,7 @@ void __init mnt_init(void) int err;
mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount), - 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); + 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
mount_hashtable = alloc_large_system_hash("Mount-cache", sizeof(struct hlist_head),
From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit 839d68206de869b8cb4272c5ea10da2ef7ec34cb bugzilla: 181858 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
fasync_struct is used by almost all character device drivers to set up the fasync queue, and for regular files by the file lease code. This structure is quite small but long-living and it can be assigned for any open file.
It makes sense to account for its allocations to restrict the host's memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/1b408625-d71c-0b26-b0b6-9baf00f93e69@virtuozzo.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Alexey Dobriyan adobriyan@gmail.com Cc: Andrei Vagin avagin@gmail.com Cc: Borislav Petkov bp@alien8.de Cc: Borislav Petkov bp@suse.de Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "J. Bruce Fields" bfields@fieldses.org Cc: Jeff Layton jlayton@kernel.org Cc: Jens Axboe axboe@kernel.dk Cc: Jiri Slaby jirislaby@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: Michal Hocko mhocko@kernel.org Cc: Oleg Nesterov oleg@redhat.com Cc: Roman Gushchin guro@fb.com Cc: Serge Hallyn serge@hallyn.com Cc: Tejun Heo tj@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Yutian Yang nglaive@gmail.com Cc: Zefan Li lizefan.x@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Li Ming limingming.li@huawei.com
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/fcntl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c index 05b36b28f2e8..4364ae99e8a8 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -1041,7 +1041,8 @@ static int __init fcntl_init(void) __FMODE_EXEC | __FMODE_NONOTIFY));
fasync_cache = kmem_cache_create("fasync_cache", - sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL); + sizeof(struct fasync_struct), 0, + SLAB_PANIC | SLAB_ACCOUNT, NULL); return 0; }
From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit 30acd0bdfb86548172168a0cc71d455944de0683 bugzilla: 181858 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Container admin can create new namespaces and force kernel to allocate up to several pages of memory for the namespaces and its associated structures.
Net and uts namespaces have enabled accounting for such allocations. It makes sense to account for rest ones to restrict the host's memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Acked-by: Serge Hallyn serge@hallyn.com Acked-by: Christian Brauner christian.brauner@ubuntu.com Acked-by: Kirill Tkhai ktkhai@virtuozzo.com Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Alexey Dobriyan adobriyan@gmail.com Cc: Andrei Vagin avagin@gmail.com Cc: Borislav Petkov bp@alien8.de Cc: Borislav Petkov bp@suse.de Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "J. Bruce Fields" bfields@fieldses.org Cc: Jeff Layton jlayton@kernel.org Cc: Jens Axboe axboe@kernel.dk Cc: Jiri Slaby jirislaby@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Oleg Nesterov oleg@redhat.com Cc: Roman Gushchin guro@fb.com Cc: Tejun Heo tj@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Yutian Yang nglaive@gmail.com Cc: Zefan Li lizefan.x@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: kernel/time/namespace.c Signed-off-by: Li Ming limingming.li@huawei.com
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/namespace.c | 2 +- ipc/namespace.c | 2 +- kernel/cgroup/namespace.c | 2 +- kernel/nsproxy.c | 2 +- kernel/pid_namespace.c | 2 +- kernel/time/namespace.c | 4 ++-- kernel/user_namespace.c | 2 +- 7 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c index 7f1f89db511f..6e76f2a72cfc 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3283,7 +3283,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a if (!ucounts) return ERR_PTR(-ENOSPC);
- new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL); + new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT); if (!new_ns) { dec_mnt_namespaces(ucounts); return ERR_PTR(-ENOMEM); diff --git a/ipc/namespace.c b/ipc/namespace.c index 24e7b45320f7..c94c05846141 100644 --- a/ipc/namespace.c +++ b/ipc/namespace.c @@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns, goto fail;
err = -ENOMEM; - ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL); + ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT); if (ns == NULL) goto fail_dec;
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c index 812a61afd538..12c5110466bc 100644 --- a/kernel/cgroup/namespace.c +++ b/kernel/cgroup/namespace.c @@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void) struct cgroup_namespace *new_ns; int ret;
- new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL); + new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT); if (!new_ns) return ERR_PTR(-ENOMEM); ret = ns_alloc_inum(&new_ns->ns); diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index ff138b24b25d..277ae1fadafe 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -609,6 +609,6 @@ SYSCALL_DEFINE2(setns, int, fd, int, flags)
int __init nsproxy_cache_init(void) { - nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC); + nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT); return 0; } diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index ef8733e2a476..52c017feabcb 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -457,7 +457,7 @@ const struct proc_ns_operations pidns_for_children_operations = {
static __init int pid_namespaces_init(void) { - pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC); + pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
#ifdef CONFIG_CHECKPOINT_RESTORE register_sysctl_paths(kern_path, pid_ns_ctl_table); diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c index afc65e6be33e..00c20f7fdc02 100644 --- a/kernel/time/namespace.c +++ b/kernel/time/namespace.c @@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns, goto fail;
err = -ENOMEM; - ns = kmalloc(sizeof(*ns), GFP_KERNEL); + ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT); if (!ns) goto fail_dec;
kref_init(&ns->kref);
- ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO); + ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO); if (!ns->vvar_page) goto fail_free;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index b3edcfbbfecb..a9700fe9c722 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -1392,7 +1392,7 @@ const struct proc_ns_operations userns_operations = {
static __init int user_namespaces_init(void) { - user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC); + user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT); return 0; } subsys_initcall(user_namespaces_init);
From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit 5f58c39819ff78ca5ddbba2b3cd8ff4779b19bb5 bugzilla: 181858 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When a user send a signal to any another processes it forces the kernel to allocate memory for 'struct sigqueue' objects. The number of signals is limited by RLIMIT_SIGPENDING resource limit, but even the default settings allow each user to consume up to several megabytes of memory.
It makes sense to account for these allocations to restrict the host's memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/e34e958c-e785-712e-a62a-2c7b66c646c7@virtuozzo.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Alexey Dobriyan adobriyan@gmail.com Cc: Andrei Vagin avagin@gmail.com Cc: Borislav Petkov bp@alien8.de Cc: Borislav Petkov bp@suse.de Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "J. Bruce Fields" bfields@fieldses.org Cc: Jeff Layton jlayton@kernel.org Cc: Jens Axboe axboe@kernel.dk Cc: Jiri Slaby jirislaby@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: Michal Hocko mhocko@kernel.org Cc: Oleg Nesterov oleg@redhat.com Cc: Roman Gushchin guro@fb.com Cc: Serge Hallyn serge@hallyn.com Cc: Tejun Heo tj@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Yutian Yang nglaive@gmail.com Cc: Zefan Li lizefan.x@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Li Ming limingming.li@huawei.com
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- kernel/signal.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/signal.c b/kernel/signal.c index ec83b1fbb0d3..30e1b37a73e1 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -4598,7 +4598,7 @@ void __init signals_init(void) { siginfo_buildtime_checks();
- sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC); + sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT); }
#ifdef CONFIG_KGDB_KDB
From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit c509723ec27e925bb91a20682c448e95d4bc8c9f bugzilla: 181858 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
A program may create multiple interval timers using timer_create(). For each timer the kernel preallocates a "queued real-time signal", Consequently, the number of timers is limited by the RLIMIT_SIGPENDING resource limit. The allocated object is quite small, ~250 bytes, but even the default signal limits allow to consume up to 100 megabytes per user.
It makes sense to account for them to limit the host's memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/57795560-025c-267c-6b1a-dea852d95530@virtuozzo.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Reviewed-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Alexey Dobriyan adobriyan@gmail.com Cc: Andrei Vagin avagin@gmail.com Cc: Borislav Petkov bp@alien8.de Cc: Borislav Petkov bp@suse.de Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "J. Bruce Fields" bfields@fieldses.org Cc: Jeff Layton jlayton@kernel.org Cc: Jens Axboe axboe@kernel.dk Cc: Jiri Slaby jirislaby@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: Michal Hocko mhocko@kernel.org Cc: Oleg Nesterov oleg@redhat.com Cc: Roman Gushchin guro@fb.com Cc: Serge Hallyn serge@hallyn.com Cc: Tejun Heo tj@kernel.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Yutian Yang nglaive@gmail.com Cc: Zefan Li lizefan.x@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Li Ming limingming.li@huawei.com
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- kernel/time/posix-timers.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c index dd5697d7347b..7363f81dc31a 100644 --- a/kernel/time/posix-timers.c +++ b/kernel/time/posix-timers.c @@ -273,8 +273,8 @@ static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp) static __init int init_posix_timers(void) { posix_timers_cache = kmem_cache_create("posix_timers_cache", - sizeof (struct k_itimer), 0, SLAB_PANIC, - NULL); + sizeof(struct k_itimer), 0, + SLAB_PANIC | SLAB_ACCOUNT, NULL); return 0; } __initcall(init_posix_timers);
From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit ec403e2ae0dfc85996aad6e944a98a16e6dfcc6d bugzilla: 181858 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Each task can request own LDT and force the kernel to allocate up to 64Kb memory per-mm.
There are legitimate workloads with hundreds of processes and there can be hundreds of workloads running on large machines. The unaccounted memory can cause isolation issues between the workloads particularly on highly utilized machines.
It makes sense to account for this objects to restrict the host's memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/38010594-50fe-c06d-7cb0-d1f77ca422f3@virtuozzo.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Acked-by: Borislav Petkov bp@suse.de Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Alexey Dobriyan adobriyan@gmail.com Cc: Andrei Vagin avagin@gmail.com Cc: Borislav Petkov bp@alien8.de Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "J. Bruce Fields" bfields@fieldses.org Cc: Jeff Layton jlayton@kernel.org Cc: Jens Axboe axboe@kernel.dk Cc: Jiri Slaby jirislaby@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: Michal Hocko mhocko@kernel.org Cc: Oleg Nesterov oleg@redhat.com Cc: Roman Gushchin guro@fb.com Cc: Serge Hallyn serge@hallyn.com Cc: Tejun Heo tj@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Yutian Yang nglaive@gmail.com Cc: Zefan Li lizefan.x@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Li Ming limingming.li@huawei.com
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- arch/x86/kernel/ldt.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index b8aee71840ae..7694c541e3d8 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -154,7 +154,7 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries) if (num_entries > LDT_ENTRIES) return NULL;
- new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL); + new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT); if (!new_ldt) return NULL;
@@ -168,9 +168,9 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries) * than PAGE_SIZE. */ if (alloc_size > PAGE_SIZE) - new_ldt->entries = vzalloc(alloc_size); + new_ldt->entries = __vmalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO); else - new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL); + new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
if (!new_ldt->entries) { kfree(new_ldt);
From: Chen Jun chenjun102@huawei.com
maillist inclusion category: bugfix bugzilla: 182215 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://lore.kernel.org/linux-mm/20210922014122.47219-1-chenjun102@huawei.co...
-------------------------------------------------
An unexpected value of /proc/sys/vm/overcommit_memory we will get, after running the following program.
int main() { int fd = open("/proc/sys/vm/overcommit_memory", O_RDWR) write(fd, "1", 1); write(fd, "2", 1); close(fd); }
write(fd, "2", 1) will pass *ppos = 1 to proc_dointvec_minmax. proc_dointvec_minmax will return 0 without setting new_policy.
t.data = &new_policy; ret = proc_dointvec_minmax(&t, write, buffer, lenp, ppos) -->do_proc_dointvec -->__do_proc_dointvec if (write) { if (proc_first_pos_non_zero_ignore(ppos, table)) goto out;
sysctl_overcommit_memory = new_policy;
so sysctl_overcommit_memory will be set to an uninitialized value.
Check whether new_policy has been changed by proc_dointvec_minmax.
Fixes: 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy" Signed-off-by: Chen Jun chenjun102@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- mm/util.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/util.c b/mm/util.c index 4ddb6e186dd5..d5be67771850 100644 --- a/mm/util.c +++ b/mm/util.c @@ -756,7 +756,7 @@ int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { struct ctl_table t; - int new_policy; + int new_policy = -1; int ret;
/* @@ -774,7 +774,7 @@ int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer, t = *table; t.data = &new_policy; ret = proc_dointvec_minmax(&t, write, buffer, lenp, ppos); - if (ret) + if (ret || new_policy == -1) return ret;
mm_compute_batch(new_policy);
From: Yang Jihong yangjihong1@huawei.com
mainline inclusion from mainline-v5.14-rc1 commit 4bcbe438b3baaeb532dd50a5f002aed56c197e2a category: feature bugzilla: 182229 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
The "auxtrace_info" and "auxtrace" functions are not set in "tool" member of "annotate". As a result, perf annotate does not support parsing itrace data.
Before:
# perf record -e arm_spe_0/branch_filter=1/ -a sleep 1 [ perf record: Woken up 9 times to write data ] [ perf record: Captured and wrote 20.874 MB perf.data ] # perf annotate --stdio Error: The perf.data data has no samples!
Solution:
1. Add itrace options in help, 2. Set hook functions of "id_index", "auxtrace_info" and "auxtrace" in perf_tool.
After:
# perf record --all-user -e arm_spe_0/branch_filter=1/ ls Couldn't synthesize bpf events. perf.data [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.010 MB perf.data ] # perf annotate --stdio Percent | Source code & Disassembly of libc-2.28.so for branch-miss (1 samples, percent: local period) ------------------------------------------------------------------------------------------------------------ : : : : Disassembly of section .text: : : 0000000000066180 <__getdelim@@GLIBC_2.17>: 0.00 : 66180: stp x29, x30, [sp, #-96]! 0.00 : 66184: cmp x0, #0x0 0.00 : 66188: ccmp x1, #0x0, #0x4, ne // ne = any 0.00 : 6618c: mov x29, sp 0.00 : 66190: stp x24, x25, [sp, #56] 0.00 : 66194: stp x26, x27, [sp, #72] 0.00 : 66198: str x28, [sp, #88] 0.00 : 6619c: b.eq 66450 <__getdelim@@GLIBC_2.17+0x2d0> // b.none 0.00 : 661a0: stp x22, x23, [x29, #40] 0.00 : 661a4: mov x22, x1 0.00 : 661a8: ldr w1, [x3] 0.00 : 661ac: mov w23, w2 0.00 : 661b0: stp x20, x21, [x29, #24] 0.00 : 661b4: mov x20, x3 0.00 : 661b8: mov x21, x0 0.00 : 661bc: tbnz w1, #15, 66360 <__getdelim@@GLIBC_2.17+0x1e0> 0.00 : 661c0: ldr x0, [x3, #136] 0.00 : 661c4: ldr x2, [x0, #8] 0.00 : 661c8: str x19, [x29, #16] 0.00 : 661cc: mrs x19, tpidr_el0 0.00 : 661d0: sub x19, x19, #0x700 0.00 : 661d4: cmp x2, x19 0.00 : 661d8: b.eq 663f0 <__getdelim@@GLIBC_2.17+0x270> // b.none 0.00 : 661dc: mov w1, #0x1 // #1 0.00 : 661e0: ldaxr w2, [x0] 0.00 : 661e4: cmp w2, #0x0 0.00 : 661e8: b.ne 661f4 <__getdelim@@GLIBC_2.17+0x74> // b.any 0.00 : 661ec: stxr w3, w1, [x0] 0.00 : 661f0: cbnz w3, 661e0 <__getdelim@@GLIBC_2.17+0x60> 0.00 : 661f4: b.ne 66448 <__getdelim@@GLIBC_2.17+0x2c8> // b.any 0.00 : 661f8: ldr x0, [x20, #136] 0.00 : 661fc: ldr w1, [x20] 0.00 : 66200: ldr w2, [x0, #4] 0.00 : 66204: str x19, [x0, #8] 0.00 : 66208: add w2, w2, #0x1 0.00 : 6620c: str w2, [x0, #4] 0.00 : 66210: tbnz w1, #5, 66388 <__getdelim@@GLIBC_2.17+0x208> 0.00 : 66214: ldr x19, [x29, #16] 0.00 : 66218: ldr x0, [x21] 0.00 : 6621c: cbz x0, 66228 <__getdelim@@GLIBC_2.17+0xa8> 0.00 : 66220: ldr x0, [x22] 0.00 : 66224: cbnz x0, 6623c <__getdelim@@GLIBC_2.17+0xbc> 0.00 : 66228: mov x0, #0x78 // #120 0.00 : 6622c: str x0, [x22] 0.00 : 66230: bl 20710 malloc@plt 0.00 : 66234: str x0, [x21] 0.00 : 66238: cbz x0, 66428 <__getdelim@@GLIBC_2.17+0x2a8> 0.00 : 6623c: ldr x27, [x20, #8] 0.00 : 66240: str x19, [x29, #16] 0.00 : 66244: ldr x19, [x20, #16] 0.00 : 66248: sub x19, x19, x27 0.00 : 6624c: cmp x19, #0x0 0.00 : 66250: b.le 66398 <__getdelim@@GLIBC_2.17+0x218> 0.00 : 66254: mov x25, #0x0 // #0 0.00 : 66258: b 662d8 <__getdelim@@GLIBC_2.17+0x158> 0.00 : 6625c: nop 0.00 : 66260: add x24, x19, x25 0.00 : 66264: ldr x3, [x22] 0.00 : 66268: add x26, x24, #0x1 0.00 : 6626c: ldr x0, [x21] 0.00 : 66270: cmp x3, x26 0.00 : 66274: b.cs 6629c <__getdelim@@GLIBC_2.17+0x11c> // b.hs, b.nlast 0.00 : 66278: lsl x3, x3, #1 0.00 : 6627c: cmp x3, x26 0.00 : 66280: csel x26, x3, x26, cs // cs = hs, nlast 0.00 : 66284: mov x1, x26 0.00 : 66288: bl 206f0 realloc@plt 0.00 : 6628c: cbz x0, 66438 <__getdelim@@GLIBC_2.17+0x2b8> 0.00 : 66290: str x0, [x21] 0.00 : 66294: ldr x27, [x20, #8] 0.00 : 66298: str x26, [x22] 0.00 : 6629c: mov x2, x19 0.00 : 662a0: mov x1, x27 0.00 : 662a4: add x0, x0, x25 0.00 : 662a8: bl 87390 <explicit_bzero@@GLIBC_2.25+0x50> 0.00 : 662ac: ldr x0, [x20, #8] 0.00 : 662b0: add x19, x0, x19 0.00 : 662b4: str x19, [x20, #8] 0.00 : 662b8: cbnz x28, 66410 <__getdelim@@GLIBC_2.17+0x290> 0.00 : 662bc: mov x0, x20 0.00 : 662c0: bl 73b80 <__underflow@@GLIBC_2.17> 0.00 : 662c4: cmn w0, #0x1 0.00 : 662c8: b.eq 66410 <__getdelim@@GLIBC_2.17+0x290> // b.none 0.00 : 662cc: ldp x27, x19, [x20, #8] 0.00 : 662d0: mov x25, x24 0.00 : 662d4: sub x19, x19, x27 0.00 : 662d8: mov x2, x19 0.00 : 662dc: mov w1, w23 0.00 : 662e0: mov x0, x27 0.00 : 662e4: bl 807b0 <memchr@@GLIBC_2.17> 0.00 : 662e8: cmp x0, #0x0 0.00 : 662ec: mov x28, x0 0.00 : 662f0: sub x0, x0, x27 0.00 : 662f4: csinc x19, x19, x0, eq // eq = none 0.00 : 662f8: mov x0, #0x7fffffffffffffff // #9223372036854775807 0.00 : 662fc: sub x0, x0, x25 0.00 : 66300: cmp x19, x0 0.00 : 66304: b.lt 66260 <__getdelim@@GLIBC_2.17+0xe0> // b.tstop 0.00 : 66308: adrp x0, 17f000 <sys_sigabbrev@@GLIBC_2.17+0x320> 0.00 : 6630c: ldr x0, [x0, #3624] 0.00 : 66310: mrs x2, tpidr_el0 0.00 : 66314: ldr x19, [x29, #16] 0.00 : 66318: mov w3, #0x4b // #75 0.00 : 6631c: ldr w1, [x20] 0.00 : 66320: mov x24, #0xffffffffffffffff // #-1 0.00 : 66324: str w3, [x2, x0] 0.00 : 66328: tbnz w1, #15, 66340 <__getdelim@@GLIBC_2.17+0x1c0> 0.00 : 6632c: ldr x0, [x20, #136] 0.00 : 66330: ldr w1, [x0, #4] 0.00 : 66334: sub w1, w1, #0x1 0.00 : 66338: str w1, [x0, #4] 0.00 : 6633c: cbz w1, 663b8 <__getdelim@@GLIBC_2.17+0x238> 0.00 : 66340: mov x0, x24 0.00 : 66344: ldr x28, [sp, #88] 0.00 : 66348: ldp x20, x21, [x29, #24] 0.00 : 6634c: ldp x22, x23, [x29, #40] 0.00 : 66350: ldp x24, x25, [sp, #56] 0.00 : 66354: ldp x26, x27, [sp, #72] 0.00 : 66358: ldp x29, x30, [sp], #96 0.00 : 6635c: ret 100.00 : 66360: tbz w1, #5, 66218 <__getdelim@@GLIBC_2.17+0x98> 0.00 : 66364: ldp x20, x21, [x29, #24] 0.00 : 66368: mov x24, #0xffffffffffffffff // #-1 0.00 : 6636c: ldp x22, x23, [x29, #40] 0.00 : 66370: mov x0, x24 0.00 : 66374: ldp x24, x25, [sp, #56] 0.00 : 66378: ldp x26, x27, [sp, #72] 0.00 : 6637c: ldr x28, [sp, #88] 0.00 : 66380: ldp x29, x30, [sp], #96 0.00 : 66384: ret 0.00 : 66388: mov x24, #0xffffffffffffffff // #-1 0.00 : 6638c: ldr x19, [x29, #16] 0.00 : 66390: b 66328 <__getdelim@@GLIBC_2.17+0x1a8> 0.00 : 66394: nop 0.00 : 66398: mov x0, x20 0.00 : 6639c: bl 73b80 <__underflow@@GLIBC_2.17> 0.00 : 663a0: cmn w0, #0x1 0.00 : 663a4: b.eq 66438 <__getdelim@@GLIBC_2.17+0x2b8> // b.none 0.00 : 663a8: ldp x27, x19, [x20, #8] 0.00 : 663ac: sub x19, x19, x27 0.00 : 663b0: b 66254 <__getdelim@@GLIBC_2.17+0xd4> 0.00 : 663b4: nop 0.00 : 663b8: str xzr, [x0, #8] 0.00 : 663bc: ldxr w2, [x0] 0.00 : 663c0: stlxr w3, w1, [x0] 0.00 : 663c4: cbnz w3, 663bc <__getdelim@@GLIBC_2.17+0x23c> 0.00 : 663c8: cmp w2, #0x1 0.00 : 663cc: b.le 66340 <__getdelim@@GLIBC_2.17+0x1c0> 0.00 : 663d0: mov x1, #0x81 // #129 0.00 : 663d4: mov x2, #0x1 // #1 0.00 : 663d8: mov x3, #0x0 // #0 0.00 : 663dc: mov x8, #0x62 // #98 0.00 : 663e0: svc #0x0 0.00 : 663e4: ldp x20, x21, [x29, #24] 0.00 : 663e8: ldp x22, x23, [x29, #40] 0.00 : 663ec: b 66370 <__getdelim@@GLIBC_2.17+0x1f0> 0.00 : 663f0: ldr w2, [x0, #4] 0.00 : 663f4: add w2, w2, #0x1 0.00 : 663f8: str w2, [x0, #4] 0.00 : 663fc: tbz w1, #5, 66214 <__getdelim@@GLIBC_2.17+0x94> 0.00 : 66400: mov x24, #0xffffffffffffffff // #-1 0.00 : 66404: ldr x19, [x29, #16] 0.00 : 66408: b 66330 <__getdelim@@GLIBC_2.17+0x1b0> 0.00 : 6640c: nop 0.00 : 66410: ldr x0, [x21] 0.00 : 66414: strb wzr, [x0, x24] 0.00 : 66418: ldr w1, [x20] 0.00 : 6641c: ldr x19, [x29, #16] 0.00 : 66420: b 66328 <__getdelim@@GLIBC_2.17+0x1a8> 0.00 : 66424: nop 0.00 : 66428: mov x24, #0xffffffffffffffff // #-1 0.00 : 6642c: ldr w1, [x20] 0.00 : 66430: b 66328 <__getdelim@@GLIBC_2.17+0x1a8> 0.00 : 66434: nop 0.00 : 66438: mov x24, #0xffffffffffffffff // #-1 0.00 : 6643c: ldr w1, [x20] 0.00 : 66440: ldr x19, [x29, #16] 0.00 : 66444: b 66328 <__getdelim@@GLIBC_2.17+0x1a8> 0.00 : 66448: bl e3ba0 <pthread_setcanceltype@@GLIBC_2.17+0x30> 0.00 : 6644c: b 661f8 <__getdelim@@GLIBC_2.17+0x78> 0.00 : 66450: adrp x0, 17f000 <sys_sigabbrev@@GLIBC_2.17+0x320> 0.00 : 66454: ldr x0, [x0, #3624] 0.00 : 66458: mrs x1, tpidr_el0 0.00 : 6645c: mov w2, #0x16 // #22 0.00 : 66460: mov x24, #0xffffffffffffffff // #-1 0.00 : 66464: str w2, [x1, x0] 0.00 : 66468: b 66370 <__getdelim@@GLIBC_2.17+0x1f0> 0.00 : 6646c: ldr w1, [x20] 0.00 : 66470: mov x4, x0 0.00 : 66474: tbnz w1, #15, 6648c <__getdelim@@GLIBC_2.17+0x30c> 0.00 : 66478: ldr x0, [x20, #136] 0.00 : 6647c: ldr w1, [x0, #4] 0.00 : 66480: sub w1, w1, #0x1 0.00 : 66484: str w1, [x0, #4] 0.00 : 66488: cbz w1, 66494 <__getdelim@@GLIBC_2.17+0x314> 0.00 : 6648c: mov x0, x4 0.00 : 66490: bl 20e40 <gnu_get_libc_version@@GLIBC_2.17+0x130> 0.00 : 66494: str xzr, [x0, #8] 0.00 : 66498: ldxr w2, [x0] 0.00 : 6649c: stlxr w3, w1, [x0] 0.00 : 664a0: cbnz w3, 66498 <__getdelim@@GLIBC_2.17+0x318> 0.00 : 664a4: cmp w2, #0x1 0.00 : 664a8: b.le 6648c <__getdelim@@GLIBC_2.17+0x30c> 0.00 : 664ac: mov x1, #0x81 // #129 0.00 : 664b0: mov x2, #0x1 // #1 0.00 : 664b4: mov x3, #0x0 // #0 0.00 : 664b8: mov x8, #0x62 // #98 0.00 : 664bc: svc #0x0 0.00 : 664c0: b 6648c <__getdelim@@GLIBC_2.17+0x30c>
Signed-off-by: Yang Jihong yangjihong1@huawei.com Tested-by: Leo Yan leo.yan@linaro.org Acked-by: Adrian Hunter adrian.hunter@intel.com Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Jiri Olsa jolsa@redhat.com Cc: Mark Rutland mark.rutland@arm.com Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Link: http://lore.kernel.org/lkml/20210615091704.259202-1-yangjihong1@huawei.com Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Li Huafei lihuafei1@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- tools/perf/Documentation/perf-annotate.txt | 7 +++++++ tools/perf/builtin-annotate.c | 11 +++++++++++ 2 files changed, 18 insertions(+)
diff --git a/tools/perf/Documentation/perf-annotate.txt b/tools/perf/Documentation/perf-annotate.txt index 1b5042f134a8..7b594283c21c 100644 --- a/tools/perf/Documentation/perf-annotate.txt +++ b/tools/perf/Documentation/perf-annotate.txt @@ -58,6 +58,13 @@ OPTIONS --ignore-vmlinux:: Ignore vmlinux files.
+--itrace:: + Options for decoding instruction tracing data. The options are: + +include::itrace.txt[] + + To disable decoding entirely, use --no-itrace. + -m:: --modules:: Load module symbols. WARNING: use only with -k and LIVE kernel. diff --git a/tools/perf/builtin-annotate.c b/tools/perf/builtin-annotate.c index 4940d10074c3..25d0d3d52a05 100644 --- a/tools/perf/builtin-annotate.c +++ b/tools/perf/builtin-annotate.c @@ -481,6 +481,9 @@ int cmd_annotate(int argc, const char **argv) .attr = perf_event__process_attr, .build_id = perf_event__process_build_id, .tracing_data = perf_event__process_tracing_data, + .id_index = perf_event__process_id_index, + .auxtrace_info = perf_event__process_auxtrace_info, + .auxtrace = perf_event__process_auxtrace, .feature = process_feature_event, .ordered_events = true, .ordering_requires_timestamps = true, @@ -490,6 +493,9 @@ int cmd_annotate(int argc, const char **argv) struct perf_data data = { .mode = PERF_DATA_MODE_READ, }; + struct itrace_synth_opts itrace_synth_opts = { + .set = 0, + }; struct option options[] = { OPT_STRING('i', "input", &input_name, "file", "input file name"), @@ -550,6 +556,9 @@ int cmd_annotate(int argc, const char **argv) OPT_CALLBACK(0, "percent-type", &annotate.opts, "local-period", "Set percent type local/global-period/hits", annotate_parse_percent_type), + OPT_CALLBACK_OPTARG(0, "itrace", &itrace_synth_opts, NULL, "opts", + "Instruction Tracing options\n" ITRACE_HELP, + itrace_parse_synth_opts),
OPT_END() }; @@ -594,6 +603,8 @@ int cmd_annotate(int argc, const char **argv) if (IS_ERR(annotate.session)) return PTR_ERR(annotate.session);
+ annotate.session->itrace_synth_opts = &itrace_synth_opts; + annotate.has_br_stack = perf_header__has_feat(&annotate.session->header, HEADER_BRANCH_STACK);
From: Yang Yang yang.yang@vivo.com
mainline inclusion from mainline-5.12-rc1 commit ffa772cfe9356ce94d3061335c2681f60e7c1c5b category: bugfix bugzilla: 182133 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Hang occurs when user changes the scheduler queue depth, by writing to the 'nr_requests' sysfs file of that device.
The details of the environment that we found the problem are as follows: an eMMC block device total driver tags: 16 default queue_depth: 32 kqd->async_depth initialized in kyber_init_sched() with queue_depth=32
Then we change queue_depth to 256, by writing to the 'nr_requests' sysfs file. But kqd->async_depth don't be updated after queue_depth changes. Now the value of async depth is too small for queue_depth=256, this may cause hang.
This patch introduces kyber_depth_updated(), so that kyber can update async depth when queue depth changes.
Signed-off-by: Yang Yang yang.yang@vivo.com Reviewed-by: Omar Sandoval osandov@fb.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- block/kyber-iosched.c | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-)
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index 7f9ef773bf44..d2648356e430 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -353,19 +353,9 @@ static void kyber_timer_fn(struct timer_list *t) } }
-static unsigned int kyber_sched_tags_shift(struct request_queue *q) -{ - /* - * All of the hardware queues have the same depth, so we can just grab - * the shift of the first one. - */ - return q->queue_hw_ctx[0]->sched_tags->bitmap_tags->sb.shift; -} - static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q) { struct kyber_queue_data *kqd; - unsigned int shift; int ret = -ENOMEM; int i;
@@ -400,9 +390,6 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q) kqd->latency_targets[i] = kyber_latency_targets[i]; }
- shift = kyber_sched_tags_shift(q); - kqd->async_depth = (1U << shift) * KYBER_ASYNC_PERCENT / 100U; - return kqd;
err_buckets: @@ -458,9 +445,19 @@ static void kyber_ctx_queue_init(struct kyber_ctx_queue *kcq) INIT_LIST_HEAD(&kcq->rq_list[i]); }
-static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx) +static void kyber_depth_updated(struct blk_mq_hw_ctx *hctx) { struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data; + struct blk_mq_tags *tags = hctx->sched_tags; + unsigned int shift = tags->bitmap_tags->sb.shift; + + kqd->async_depth = (1U << shift) * KYBER_ASYNC_PERCENT / 100U; + + sbitmap_queue_min_shallow_depth(tags->bitmap_tags, kqd->async_depth); +} + +static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx) +{ struct kyber_hctx_data *khd; int i;
@@ -502,8 +499,7 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx) khd->batching = 0;
hctx->sched_data = khd; - sbitmap_queue_min_shallow_depth(hctx->sched_tags->bitmap_tags, - kqd->async_depth); + kyber_depth_updated(hctx);
return 0;
@@ -1023,6 +1019,7 @@ static struct elevator_type kyber_sched = { .completed_request = kyber_completed_request, .dispatch_request = kyber_dispatch_request, .has_work = kyber_has_work, + .depth_updated = kyber_depth_updated, }, #ifdef CONFIG_BLK_DEBUG_FS .queue_debugfs_attrs = kyber_queue_debugfs_attrs,
From: Al Viro viro@zeniv.linux.org.uk
mainline inclusion from mainline-5.14-rc1 commit ffb37ca3bd16ce6ea2df2f87fde9a31e94ebb54b category: bugfix bugzilla: 181657 https://gitee.com/openeuler/kernel/issues/I4DDEL Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
... and provide file_open_root_mnt(), using the root of given mount.
Signed-off-by: Al Viro viro@zeniv.linux.org.uk
Conflicts: Documentation/filesystems/porting.rst [ Non-bugfix 14e43bf4356126("vfs: don't unnecessarily clone write access for writable fd") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com [Roberto Sassu: Adjust file_open_root() called by load_digest_list()] Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/filesystems/porting.rst | 9 +++++++++ arch/um/drivers/mconsole_kern.c | 2 +- fs/coredump.c | 4 ++-- fs/fhandle.c | 2 +- fs/internal.h | 2 +- fs/kernel_read_file.c | 2 +- fs/namei.c | 8 +++----- fs/open.c | 4 ++-- fs/proc/proc_sysctl.c | 2 +- include/linux/fs.h | 9 ++++++++- kernel/usermode_driver.c | 2 +- security/integrity/ima/ima_digest_list.c | 2 +- 12 files changed, 31 insertions(+), 17 deletions(-)
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 867036aa90b8..b00fb6313b58 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -865,3 +865,12 @@ no matter what. Everything is handled by the caller.
clone_private_mount() returns a longterm mount now, so the proper destructor of its result is kern_unmount() or kern_unmount_array(). + +--- + +**mandatory** + +Calling conventions for file_open_root() changed; now it takes struct path * +instead of passing mount and dentry separately. For callers that used to +pass <mnt, mnt->mnt_root> pair (i.e. the root of given mount), a new helper +is provided - file_open_root_mnt(). In-tree users adjusted. diff --git a/arch/um/drivers/mconsole_kern.c b/arch/um/drivers/mconsole_kern.c index a2e680f7d39f..6a22ead31c5b 100644 --- a/arch/um/drivers/mconsole_kern.c +++ b/arch/um/drivers/mconsole_kern.c @@ -140,7 +140,7 @@ void mconsole_proc(struct mc_request *req) mconsole_reply(req, "Proc not available", 1, 0); goto out; } - file = file_open_root(mnt->mnt_root, mnt, ptr, O_RDONLY, 0); + file = file_open_root_mnt(mnt, ptr, O_RDONLY, 0); if (IS_ERR(file)) { mconsole_reply(req, "Failed to open file", 1, 0); printk(KERN_ERR "open /proc/%s: %ld\n", ptr, PTR_ERR(file)); diff --git a/fs/coredump.c b/fs/coredump.c index c6acfc694f65..bdd83e404113 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -755,8 +755,8 @@ void do_coredump(const kernel_siginfo_t *siginfo) task_lock(&init_task); get_fs_root(init_task.fs, &root); task_unlock(&init_task); - cprm.file = file_open_root(root.dentry, root.mnt, - cn.corename, open_flags, 0600); + cprm.file = file_open_root(&root, cn.corename, + open_flags, 0600); path_put(&root); } else { cprm.file = filp_open(cn.corename, open_flags, 0600); diff --git a/fs/fhandle.c b/fs/fhandle.c index 01263ffbc4c0..718defdf1e0e 100644 --- a/fs/fhandle.c +++ b/fs/fhandle.c @@ -229,7 +229,7 @@ static long do_handle_open(int mountdirfd, struct file_handle __user *ufh, path_put(&path); return fd; } - file = file_open_root(path.dentry, path.mnt, "", open_flag, 0); + file = file_open_root(&path, "", open_flag, 0); if (IS_ERR(file)) { put_unused_fd(fd); retval = PTR_ERR(file); diff --git a/fs/internal.h b/fs/internal.h index 5155f6ce95c7..0d4f7e4e2f3a 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -128,7 +128,7 @@ struct open_flags { }; extern struct file *do_filp_open(int dfd, struct filename *pathname, const struct open_flags *op); -extern struct file *do_file_open_root(struct dentry *, struct vfsmount *, +extern struct file *do_file_open_root(const struct path *, const char *, const struct open_flags *); extern struct open_how build_open_how(int flags, umode_t mode); extern int build_open_flags(const struct open_how *how, struct open_flags *op); diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c index 90d255fbdd9b..87aac4c72c37 100644 --- a/fs/kernel_read_file.c +++ b/fs/kernel_read_file.c @@ -160,7 +160,7 @@ int kernel_read_file_from_path_initns(const char *path, loff_t offset, get_fs_root(init_task.fs, &root); task_unlock(&init_task);
- file = file_open_root(root.dentry, root.mnt, path, O_RDONLY, 0); + file = file_open_root(&root, path, O_RDONLY, 0); path_put(&root); if (IS_ERR(file)) return PTR_ERR(file); diff --git a/fs/namei.c b/fs/namei.c index 4c9d0c36545d..130aa5694f48 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3393,7 +3393,7 @@ struct file *do_filp_open(int dfd, struct filename *pathname, return filp; }
-struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt, +struct file *do_file_open_root(const struct path *root, const char *name, const struct open_flags *op) { struct nameidata nd; @@ -3401,16 +3401,14 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt, struct filename *filename; int flags = op->lookup_flags | LOOKUP_ROOT;
- nd.root.mnt = mnt; - nd.root.dentry = dentry; - - if (d_is_symlink(dentry) && op->intent & LOOKUP_OPEN) + if (d_is_symlink(root->dentry) && op->intent & LOOKUP_OPEN) return ERR_PTR(-ELOOP);
filename = getname_kernel(name); if (IS_ERR(filename)) return ERR_CAST(filename);
+ nd.root = *root; set_nameidata(&nd, -1, filename); file = path_openat(&nd, op, flags | LOOKUP_RCU); if (unlikely(file == ERR_PTR(-ECHILD))) diff --git a/fs/open.c b/fs/open.c index 3aaaad47d9ca..7ce64c08bb40 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1149,7 +1149,7 @@ struct file *filp_open(const char *filename, int flags, umode_t mode) } EXPORT_SYMBOL(filp_open);
-struct file *file_open_root(struct dentry *dentry, struct vfsmount *mnt, +struct file *file_open_root(const struct path *root, const char *filename, int flags, umode_t mode) { struct open_flags op; @@ -1157,7 +1157,7 @@ struct file *file_open_root(struct dentry *dentry, struct vfsmount *mnt, int err = build_open_flags(&how, &op); if (err) return ERR_PTR(err); - return do_file_open_root(dentry, mnt, filename, &op); + return do_file_open_root(root, filename, &op); } EXPORT_SYMBOL(file_open_root);
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index 070d2df8ab9c..ffed75f833b7 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -1803,7 +1803,7 @@ static int process_sysctl_arg(char *param, char *val, panic("%s: Failed to allocate path for %s\n", __func__, param); strreplace(path, '.', '/');
- file = file_open_root((*proc_mnt)->mnt_root, *proc_mnt, path, O_WRONLY, 0); + file = file_open_root_mnt(*proc_mnt, path, O_WRONLY, 0); if (IS_ERR(file)) { err = PTR_ERR(file); if (err == -ENOENT) diff --git a/include/linux/fs.h b/include/linux/fs.h index 0624c28350c1..6b53f6322225 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -39,6 +39,7 @@ #include <linux/fs_types.h> #include <linux/build_bug.h> #include <linux/stddef.h> +#include <linux/mount.h>
#include <asm/byteorder.h> #include <uapi/linux/fs.h> @@ -2528,8 +2529,14 @@ extern long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode); extern struct file *file_open_name(struct filename *, int, umode_t); extern struct file *filp_open(const char *, int, umode_t); -extern struct file *file_open_root(struct dentry *, struct vfsmount *, +extern struct file *file_open_root(const struct path *, const char *, int, umode_t); +static inline struct file *file_open_root_mnt(struct vfsmount *mnt, + const char *name, int flags, umode_t mode) +{ + return file_open_root(&(struct path){.mnt = mnt, .dentry = mnt->mnt_root}, + name, flags, mode); +} extern struct file * dentry_open(const struct path *, int, const struct cred *); extern struct file * open_with_fake_path(const struct path *, int, struct inode*, const struct cred *); diff --git a/kernel/usermode_driver.c b/kernel/usermode_driver.c index bb7bb3b478ab..9dae1f648713 100644 --- a/kernel/usermode_driver.c +++ b/kernel/usermode_driver.c @@ -26,7 +26,7 @@ static struct vfsmount *blob_to_mnt(const void *data, size_t len, const char *na if (IS_ERR(mnt)) return mnt;
- file = file_open_root(mnt->mnt_root, mnt, name, O_CREAT | O_WRONLY, 0700); + file = file_open_root_mnt(mnt, name, O_CREAT | O_WRONLY, 0700); if (IS_ERR(file)) { mntput(mnt); return ERR_CAST(file); diff --git a/security/integrity/ima/ima_digest_list.c b/security/integrity/ima/ima_digest_list.c index 5ed0c0768958..9384affe8b30 100644 --- a/security/integrity/ima/ima_digest_list.c +++ b/security/integrity/ima/ima_digest_list.c @@ -352,7 +352,7 @@ static int __init load_digest_list(struct dir_context *__ctx, const char *name, goto out; }
- file = file_open_root(dir->dentry, dir->mnt, name, O_RDONLY, 0); + file = file_open_root(dir, name, O_RDONLY, 0); if (IS_ERR(file)) { pr_err("Unable to open file: %s (%ld)", name, PTR_ERR(file)); goto out;
From: Al Viro viro@zeniv.linux.org.uk
mainline inclusion from mainline-5.14-rc1 commit bcba1e7d0d520adba895d9e0800a056f734b0a6a category: bugfix bugzilla: 181657 https://gitee.com/openeuler/kernel/issues/I4DDEL Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
Separate field in nameidata (nd->state) holding the flags that should be internal-only - that way we both get some spare bits in LOOKUP_... and get simpler rules for nd->root lifetime rules, since we can set the replacement of LOOKUP_ROOT (ND_ROOT_PRESET) at the same time we set nd->root.
Signed-off-by: Al Viro viro@zeniv.linux.org.uk
Conflicts: fs/namei.c [ Bugfix 7d01ef7585c0("Make sure nd->path.mnt and nd->path.dentry are always valid pointers") is not applid, the problem to be fixed not exists. Feature 6c6ec2b0a3e0("fs: add support for LOOKUP_CACHED") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- Documentation/filesystems/path-lookup.rst | 6 +-- fs/namei.c | 54 +++++++++++++---------- fs/nfs/nfstrace.h | 4 -- include/linux/namei.h | 3 -- 4 files changed, 34 insertions(+), 33 deletions(-)
diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index c482e1619e77..ede67f705787 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1321,18 +1321,18 @@ to lookup: RCU-walk, REF-walk, and REF-walk with forced revalidation. yet. This is primarily used to tell the audit subsystem the full context of a particular access being audited.
-``LOOKUP_ROOT`` indicates that the ``root`` field in the ``nameidata`` was +``ND_ROOT_PRESET`` indicates that the ``root`` field in the ``nameidata`` was provided by the caller, so it shouldn't be released when it is no longer needed.
-``LOOKUP_JUMPED`` means that the current dentry was chosen not because +``ND_JUMPED`` means that the current dentry was chosen not because it had the right name but for some other reason. This happens when following "``..``", following a symlink to ``/``, crossing a mount point or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic link"). In this case the filesystem has not been asked to revalidate the name (with ``d_revalidate()``). In such cases the inode may still need to be revalidated, so ``d_op->d_weak_revalidate()`` is called if -``LOOKUP_JUMPED`` is set when the look completes - which may be at the +``ND_JUMPED`` is set when the look completes - which may be at the final component or, when creating, unlinking, or renaming, at the penultimate component.
Resolution-restriction flags diff --git a/fs/namei.c b/fs/namei.c index 130aa5694f48..c94a814e86b2 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -504,7 +504,7 @@ struct nameidata { struct qstr last; struct path root; struct inode *inode; /* path.dentry.d_inode */ - unsigned int flags; + unsigned int flags, state; unsigned seq, m_seq, r_seq; int last_type; unsigned depth; @@ -523,6 +523,10 @@ struct nameidata { umode_t dir_mode; } __randomize_layout;
+#define ND_ROOT_PRESET 1 +#define ND_ROOT_GRABBED 2 +#define ND_JUMPED 4 + static void set_nameidata(struct nameidata *p, int dfd, struct filename *name) { struct nameidata *old = current->nameidata; @@ -531,6 +535,7 @@ static void set_nameidata(struct nameidata *p, int dfd, struct filename *name) p->name = name; p->total_link_count = old ? old->total_link_count : 0; p->saved = old; + p->state = 0; current->nameidata = p; }
@@ -593,9 +598,9 @@ static void terminate_walk(struct nameidata *nd) path_put(&nd->path); for (i = 0; i < nd->depth; i++) path_put(&nd->stack[i].link); - if (nd->flags & LOOKUP_ROOT_GRABBED) { + if (nd->state & ND_ROOT_GRABBED) { path_put(&nd->root); - nd->flags &= ~LOOKUP_ROOT_GRABBED; + nd->state &= ~ND_ROOT_GRABBED; } } else { nd->flags &= ~LOOKUP_RCU; @@ -651,9 +656,9 @@ static bool legitimize_root(struct nameidata *nd) if (!nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED)) return false; /* Nothing to do if nd->root is zero or is managed by the VFS user. */ - if (!nd->root.mnt || (nd->flags & LOOKUP_ROOT)) + if (!nd->root.mnt || (nd->state & ND_ROOT_PRESET)) return true; - nd->flags |= LOOKUP_ROOT_GRABBED; + nd->state |= ND_ROOT_GRABBED; return legitimize_path(nd, &nd->root, nd->root_seq); }
@@ -790,8 +795,9 @@ static int complete_walk(struct nameidata *nd) * We don't want to zero nd->root for scoped-lookups or * externally-managed nd->root. */ - if (!(nd->flags & (LOOKUP_ROOT | LOOKUP_IS_SCOPED))) - nd->root.mnt = NULL; + if (!(nd->state & ND_ROOT_PRESET)) + if (!(nd->flags & LOOKUP_IS_SCOPED)) + nd->root.mnt = NULL; if (!try_to_unlazy(nd)) return -ECHILD; } @@ -817,7 +823,7 @@ static int complete_walk(struct nameidata *nd) return -EXDEV; }
- if (likely(!(nd->flags & LOOKUP_JUMPED))) + if (likely(!(nd->state & ND_JUMPED))) return 0;
if (likely(!(dentry->d_flags & DCACHE_OP_WEAK_REVALIDATE))) @@ -855,7 +861,7 @@ static int set_root(struct nameidata *nd) } while (read_seqcount_retry(&fs->seq, seq)); } else { get_fs_root(fs, &nd->root); - nd->flags |= LOOKUP_ROOT_GRABBED; + nd->state |= ND_ROOT_GRABBED; } return 0; } @@ -888,7 +894,7 @@ static int nd_jump_root(struct nameidata *nd) path_get(&nd->path); nd->inode = nd->path.dentry->d_inode; } - nd->flags |= LOOKUP_JUMPED; + nd->state |= ND_JUMPED; return 0; }
@@ -916,7 +922,7 @@ int nd_jump_link(struct path *path) path_put(&nd->path); nd->path = *path; nd->inode = nd->path.dentry->d_inode; - nd->flags |= LOOKUP_JUMPED; + nd->state |= ND_JUMPED; return 0;
err: @@ -1338,7 +1344,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, if (mounted) { path->mnt = &mounted->mnt; dentry = path->dentry = mounted->mnt.mnt_root; - nd->flags |= LOOKUP_JUMPED; + nd->state |= ND_JUMPED; *seqp = read_seqcount_begin(&dentry->d_seq); *inode = dentry->d_inode; /* @@ -1383,7 +1389,7 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry, if (unlikely(nd->flags & LOOKUP_NO_XDEV)) ret = -EXDEV; else - nd->flags |= LOOKUP_JUMPED; + nd->state |= ND_JUMPED; } if (unlikely(ret)) { dput(path->dentry); @@ -2129,7 +2135,7 @@ static int link_path_walk(const char *name, struct nameidata *nd) case 2: if (name[1] == '.') { type = LAST_DOTDOT; - nd->flags |= LOOKUP_JUMPED; + nd->state |= ND_JUMPED; } break; case 1: @@ -2137,7 +2143,7 @@ static int link_path_walk(const char *name, struct nameidata *nd) } if (likely(type == LAST_NORM)) { struct dentry *parent = nd->path.dentry; - nd->flags &= ~LOOKUP_JUMPED; + nd->state &= ~ND_JUMPED; if (unlikely(parent->d_flags & DCACHE_OP_HASH)) { struct qstr this = { { .hash_len = hash_len }, .name = name }; err = parent->d_op->d_hash(parent, &this); @@ -2207,14 +2213,15 @@ static const char *path_init(struct nameidata *nd, unsigned flags) if (flags & LOOKUP_RCU) rcu_read_lock();
- nd->flags = flags | LOOKUP_JUMPED; + nd->flags = flags; + nd->state |= ND_JUMPED; nd->depth = 0;
nd->m_seq = __read_seqcount_begin(&mount_lock.seqcount); nd->r_seq = __read_seqcount_begin(&rename_lock.seqcount); smp_rmb();
- if (flags & LOOKUP_ROOT) { + if (nd->state & ND_ROOT_PRESET) { struct dentry *root = nd->root.dentry; struct inode *inode = root->d_inode; if (*s && unlikely(!d_can_lookup(root))) @@ -2291,7 +2298,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->root_seq = nd->seq; } else { path_get(&nd->root); - nd->flags |= LOOKUP_ROOT_GRABBED; + nd->state |= ND_ROOT_GRABBED; } } return s; @@ -2330,7 +2337,7 @@ static int path_lookupat(struct nameidata *nd, unsigned flags, struct path *path ; if (!err && unlikely(nd->flags & LOOKUP_MOUNTPOINT)) { err = handle_lookup_down(nd); - nd->flags &= ~LOOKUP_JUMPED; // no d_weak_revalidate(), please... + nd->state &= ~ND_JUMPED; // no d_weak_revalidate(), please... } if (!err) err = complete_walk(nd); @@ -2354,11 +2361,11 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags, struct nameidata nd; if (IS_ERR(name)) return PTR_ERR(name); + set_nameidata(&nd, dfd, name); if (unlikely(root)) { nd.root = *root; - flags |= LOOKUP_ROOT; + nd.state = ND_ROOT_PRESET; } - set_nameidata(&nd, dfd, name); retval = path_lookupat(&nd, flags | LOOKUP_RCU, path); if (unlikely(retval == -ECHILD)) retval = path_lookupat(&nd, flags, path); @@ -3399,7 +3406,7 @@ struct file *do_file_open_root(const struct path *root, struct nameidata nd; struct file *file; struct filename *filename; - int flags = op->lookup_flags | LOOKUP_ROOT; + int flags = op->lookup_flags;
if (d_is_symlink(root->dentry) && op->intent & LOOKUP_OPEN) return ERR_PTR(-ELOOP); @@ -3408,8 +3415,9 @@ struct file *do_file_open_root(const struct path *root, if (IS_ERR(filename)) return ERR_CAST(filename);
- nd.root = *root; set_nameidata(&nd, -1, filename); + nd.root = *root; + nd.state = ND_ROOT_PRESET; file = path_openat(&nd, op, flags | LOOKUP_RCU); if (unlikely(file == ERR_PTR(-ECHILD))) file = path_openat(&nd, op, flags); diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h index 5a59dcdce0b2..cb7f49723bbf 100644 --- a/fs/nfs/nfstrace.h +++ b/fs/nfs/nfstrace.h @@ -271,8 +271,6 @@ TRACE_DEFINE_ENUM(LOOKUP_OPEN); TRACE_DEFINE_ENUM(LOOKUP_CREATE); TRACE_DEFINE_ENUM(LOOKUP_EXCL); TRACE_DEFINE_ENUM(LOOKUP_RENAME_TARGET); -TRACE_DEFINE_ENUM(LOOKUP_JUMPED); -TRACE_DEFINE_ENUM(LOOKUP_ROOT); TRACE_DEFINE_ENUM(LOOKUP_EMPTY); TRACE_DEFINE_ENUM(LOOKUP_DOWN);
@@ -288,8 +286,6 @@ TRACE_DEFINE_ENUM(LOOKUP_DOWN); { LOOKUP_CREATE, "CREATE" }, \ { LOOKUP_EXCL, "EXCL" }, \ { LOOKUP_RENAME_TARGET, "RENAME_TARGET" }, \ - { LOOKUP_JUMPED, "JUMPED" }, \ - { LOOKUP_ROOT, "ROOT" }, \ { LOOKUP_EMPTY, "EMPTY" }, \ { LOOKUP_DOWN, "DOWN" })
diff --git a/include/linux/namei.h b/include/linux/namei.h index a4bb992623c4..ca94eb5d2b16 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -36,9 +36,6 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
/* internal use only */ #define LOOKUP_PARENT 0x0010 -#define LOOKUP_JUMPED 0x1000 -#define LOOKUP_ROOT 0x2000 -#define LOOKUP_ROOT_GRABBED 0x0008
/* Scoping flags for lookup. */ #define LOOKUP_NO_SYMLINKS 0x010000 /* No symlink crossing. */
From: Ye Bin yebin10@huawei.com
mainline inclusion from mainline-5.14-rc5 commit b66541422824cf6cf20e9a35112e9cb5d82cdf62 category: bugfix bugzilla: 176138 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
if (!ext4_has_feature_mmp(sb)) then retval can be unitialized before we jump to the wait_to_exit label.
Fixes: 61bb4a1c417e ("ext4: fix possible UAF when remounting r/o a mmp-protected file system") Signed-off-by: Ye Bin yebin10@huawei.com Link: https://lore.kernel.org/r/20210713022728.2533770-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/ext4/mmp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c index bc364c119af6..cebea4270817 100644 --- a/fs/ext4/mmp.c +++ b/fs/ext4/mmp.c @@ -138,7 +138,7 @@ static int kmmpd(void *data) unsigned mmp_check_interval; unsigned long last_update_time; unsigned long diff; - int retval; + int retval = 0;
mmp_block = le64_to_cpu(es->s_mmp_block); mmp = (struct mmp_struct *)(bh->b_data);