- Kernel - mailweb.openeuler.org

[PATCH OLK-5.10] dm: switch to precise io accounting
by Li Nan 07 Sep '23

07 Sep '23

hulk inclusion category: bugfix bugzilla: 188421, https://gitee.com/openeuler/kernel/issues/I7WMMI CVE: NA -------------------------------- 'ios' and 'sectors' is counted in bio_start_io_acct() while io is started insted of io is done. Hence switch to precise io accounting to count them when io is done. Signed-off-by: Li Nan <linan122(a)huawei.com> --- drivers/md/dm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 06318a0efc67..c925ff1bf900 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -587,7 +587,7 @@ static void start_io_acct(struct dm_io *io) struct mapped_device *md = io->md; struct bio *bio = io->orig_bio; - io->start_time = bio_start_io_acct(bio); + io->start_time = bio_start_precise_io_acct(bio); if (unlikely(dm_stats_used(&md->stats))) dm_stats_account_io(&md->stats, bio_data_dir(bio), bio->bi_iter.bi_sector, bio_sectors(bio), @@ -606,7 +606,7 @@ static void end_io_acct(struct mapped_device *md, struct bio *bio, smp_wmb(); - bio_end_io_acct(bio, start_time); + bio_end_precise_io_acct(bio, start_time); /* nudge anyone waiting on suspend queue */ if (unlikely(wq_has_sleeper(&md->wait))) -- 2.39.2

2 1

[openEuler-1.0-LTS 0/3] scsi: scsi_device_gets returns failure when the module is NULL.
by Zhong Jinghua 07 Sep '23

07 Sep '23

Fix scsi mod UAF problem. Li Lingfeng (2): scsi: don't fail if hostt->module is NULL scsi: fix kabi broken in struct Scsi_Host Zhong Jinghua (1): scsi: scsi_device_gets returns failure when the module is NULL. drivers/scsi/hosts.c | 3 +++ drivers/scsi/scsi.c | 6 +++++- include/scsi/scsi_host.h | 2 +- 3 files changed, 9 insertions(+), 2 deletions(-) -- 2.31.1

1 3

[PATCH openEuler-1.0-LTS 0/7] x86/speculation: Add Gather Data Sampling mitigation
by Zeng Heng 07 Sep '23

07 Sep '23

Arnd Bergmann (2): x86/speculation: Add cpu_show_gds() prototype x86: Move gds_ucode_mitigated() declaration to header Daniel Sneddon (4): x86/speculation: Add Gather Data Sampling mitigation x86/speculation: Add force option to GDS mitigation x86/speculation: Add Kconfig option for GDS KVM: Add GDS_NO support to KVM Dave Hansen (1): Documentation/x86: Fix backwards on/off logic about YMM support .../ABI/testing/sysfs-devices-system-cpu | 11 +- .../hw-vuln/gather_data_sampling.rst | 109 ++++++++++++ Documentation/admin-guide/hw-vuln/index.rst | 1 + .../admin-guide/kernel-parameters.txt | 39 ++++- arch/x86/Kconfig | 19 +++ arch/x86/include/asm/cpufeatures.h | 3 +- arch/x86/include/asm/msr-index.h | 11 ++ arch/x86/include/asm/processor.h | 2 + arch/x86/kernel/cpu/bugs.c | 158 ++++++++++++++++++ arch/x86/kernel/cpu/common.c | 36 ++-- arch/x86/kernel/cpu/cpu.h | 1 + arch/x86/kvm/x86.c | 3 + drivers/base/cpu.c | 8 + include/linux/cpu.h | 8 +- 14 files changed, 381 insertions(+), 28 deletions(-) create mode 100644 Documentation/admin-guide/hw-vuln/gather_data_sampling.rst -- 2.25.1

1 7

[PATCH openEuler-1.0-LTS] cpu/hotplug: Prevent self deadlock on CPU hot-unplug
by Yu Liao 07 Sep '23

07 Sep '23

From: Thomas Gleixner <tglx(a)linutronix.de> mainline inclusion from mainline-v6.6-rc1 commit 2b8272ff4a70b866106ae13c36be7ecbef5d5da2 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7Y6AQ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- Xiongfeng reported and debugged a self deadlock of the task which initiates and controls a CPU hot-unplug operation vs. the CFS bandwidth timer. CPU1 CPU2 T1 sets cfs_quota starts hrtimer cfs_bandwidth 'period_timer' T1 is migrated to CPU2 T1 initiates offlining of CPU1 Hotplug operation starts ... 'period_timer' expires and is re-enqueued on CPU1 ... take_cpu_down() CPU1 shuts down and does not handle timers anymore. They have to be migrated in the post dead hotplug steps by the control task. T1 runs the post dead offline operation T1 is scheduled out T1 waits for 'period_timer' to expire T1 waits there forever if it is scheduled out before it can execute the hrtimer offline callback hrtimers_dead_cpu(). Cure this by delegating the hotplug control operation to a worker thread on an online CPU. This takes the initiating user space task, which might be affected by the bandwidth timer, completely out of the picture. Reported-by: Xiongfeng Wang <wangxiongfeng2(a)huawei.com> Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de> Tested-by: Yu Liao <liaoyu15(a)huawei.com> Acked-by: Vincent Guittot <vincent.guittot(a)linaro.org> Cc: stable(a)vger.kernel.org Link: https://lore.kernel.org/lkml/8e785777-03aa-99e1-d20e-e956f5685be6@huawei.com Link: https://lore.kernel.org/r/87h6oqdq0i.ffs@tglx Conflict: kernel/cpu.c Signed-off-by: Yu Liao <liaoyu15(a)huawei.com> --- kernel/cpu.c | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/kernel/cpu.c b/kernel/cpu.c index b3b04e7c17d8..c943454b748e 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1018,11 +1018,33 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen, return ret; } +struct cpu_down_work { + unsigned int cpu; + enum cpuhp_state target; +}; + +static long __cpu_down_maps_locked(void *arg) +{ + struct cpu_down_work *work = arg; + + return _cpu_down(work->cpu, 0, work->target); +} + static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target) { + struct cpu_down_work work = { .cpu = cpu, .target = target, }; + if (cpu_hotplug_disabled) return -EBUSY; - return _cpu_down(cpu, 0, target); + + /* + * Ensure that the control task does not run on the to be offlined + * CPU to prevent a deadlock against cfs_b->period_timer. + */ + cpu = cpumask_any_but(cpu_online_mask, cpu); + if (cpu >= nr_cpu_ids) + return -EBUSY; + return work_on_cpu(cpu, __cpu_down_maps_locked, &work); } static int do_cpu_down(unsigned int cpu, enum cpuhp_state target) -- 2.25.1

2 1

[PATCH openEuler-23.09] mm: gmem: Release gm lock once during huge pmd fault
by Wupeng Ma 07 Sep '23

07 Sep '23

From: Ma Wupeng <mawupeng1(a)huawei.com> euleros inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7WLVX --------------------------------------------- Mutex lock is release at the end of huge pmd fault, remove the redundancy one. Fixes: 848492f233ce ("mm: gmem: Introduce vm_object for gmem") Signed-off-by: Ma Wupeng <mawupeng1(a)huawei.com> --- mm/huge_memory.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index aac116da2552..b5ddee157fa6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -834,10 +834,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) } xa_unlock(vma->vm_obj->logical_page_table); mutex_lock(&gm_mapping->lock); - if (unlikely(!pmd_none(*vmf->pmd))) { - mutex_unlock(&gm_mapping->lock); + if (unlikely(!pmd_none(*vmf->pmd))) goto gm_mapping_release; - } } #endif -- 2.25.1

2 1

[PATCH v6 openEuler-23.09 0/2] Introduce multiple qos level
by Zhao Wenhui 06 Sep '23

06 Sep '23

We introduce multiple qos level, which Expand qos_level from {-1,0} to [-2, 2], to distinguish the tasks expected to be with extremely high or low priority level. Zhao Wenhui (2): sched/fair: Introduce multiple qos level config: Enable CONFIG_QOS_SCHED_MULTILEVEL arch/arm64/configs/openeuler_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + include/linux/sched/sysctl.h | 4 ++ init/Kconfig | 9 ++++ kernel/sched/core.c | 24 ++++++---- kernel/sched/fair.c | 64 ++++++++++++++++++++++++-- kernel/sched/sched.h | 26 ++++++++++- kernel/sysctl.c | 9 ++++ 8 files changed, 125 insertions(+), 13 deletions(-) -- 2.34.1

2 3

[PATCH v6 openEuler-23.09 1/2] sched/fair: Introduce multiple qos level
by Zhao Wenhui 06 Sep '23

06 Sep '23

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7YS6M ------------------------------- Expand qos_level from {-1,0} to [-2, 2], to distinguish the tasks expected to be with extremely high or low priority level. Using qos_level_weight to reweight the shares when calculating group's weight. Meanwhile, set offline task's schedule policy to SCHED_IDLE so that it can be preempted at check_preempt_wakeup. Signed-off-by: Zhao Wenhui <zhaowenhui8(a)huawei.com> --- include/linux/sched/sysctl.h | 4 +++ init/Kconfig | 9 +++++ kernel/sched/core.c | 24 +++++++++----- kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++--- kernel/sched/sched.h | 26 ++++++++++++++- kernel/sysctl.c | 9 +++++ 6 files changed, 123 insertions(+), 13 deletions(-) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 28d9be8e4614..3a02a76b08ca 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -37,4 +37,8 @@ extern unsigned int sysctl_overload_detect_period; extern unsigned int sysctl_offline_wait_interval; #endif +#ifdef CONFIG_QOS_SCHED_MULTILEVEL +extern unsigned int sysctl_qos_level_weights[]; +#endif + #endif /* _LINUX_SCHED_SYSCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index a12109fe4385..12a5ffbb5252 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1015,6 +1015,15 @@ config QOS_SCHED_SMT_EXPELLER This feature enable online tasks to expel offline tasks on the smt sibling cpus, and exclusively occupy CPU resources. +config QOS_SCHED_MULTILEVEL + bool "Multiple qos level task scheduling" + depends on QOS_SCHED + default n + help + This feature enable multiple qos level on task scheduling. + Expand the qos_level to [-2,2] to distinguish the tasks expected + to be with extremely high or low priority level. + config FAIR_GROUP_SCHED bool "Group scheduling for SCHED_OTHER" depends on CGROUP_SCHED diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 652c06bd546d..238b5b55c38a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7689,7 +7689,7 @@ static int __sched_setscheduler(struct task_struct *p, * other than SCHED_IDLE, the online task preemption and cpu resource * isolation will be invalid, so return -EINVAL in this case. */ - if (unlikely(task_group(p)->qos_level == -1 && !idle_policy(policy))) { + if (unlikely(is_offline_level(task_group(p)->qos_level) && !idle_policy(policy))) { retval = -EINVAL; goto unlock; } @@ -10356,7 +10356,7 @@ static void sched_change_qos_group(struct task_struct *tsk, struct task_group *t */ if (!(tsk->flags & PF_EXITING) && !task_group_is_autogroup(tg) && - (tg->qos_level == -1)) { + (is_offline_level(tg->qos_level))) { attr.sched_priority = 0; attr.sched_policy = SCHED_IDLE; attr.sched_nice = PRIO_TO_NICE(tsk->static_prio); @@ -10385,7 +10385,7 @@ void sched_move_offline_task(struct task_struct *p) { struct offline_args *args; - if (unlikely(task_group(p)->qos_level != -1)) + if (unlikely(!is_offline_level(task_group(p)->qos_level))) return; args = kmalloc(sizeof(struct offline_args), GFP_ATOMIC); @@ -11275,7 +11275,7 @@ static int tg_change_scheduler(struct task_group *tg, void *data) struct cgroup_subsys_state *css = &tg->css; tg->qos_level = qos_level; - if (qos_level == -1) + if (is_offline_level(qos_level)) policy = SCHED_IDLE; else policy = SCHED_NORMAL; @@ -11297,19 +11297,27 @@ static int cpu_qos_write(struct cgroup_subsys_state *css, if (!tg->se[0]) return -EINVAL; - if (qos_level != -1 && qos_level != 0) +#ifdef CONFIG_QOS_SCHED_MULTILEVEL + if (qos_level > QOS_LEVEL_HIGH_EX || qos_level < QOS_LEVEL_OFFLINE_EX) +#else + if (qos_level != QOS_LEVEL_OFFLINE && qos_level != QOS_LEVEL_ONLINE) +#endif return -EINVAL; if (tg->qos_level == qos_level) goto done; - if (tg->qos_level == -1 && qos_level == 0) +#ifdef CONFIG_QOS_SCHED_MULTILEVEL + if (!is_normal_level(tg->qos_level)) +#else + if (tg->qos_level == QOS_LEVEL_OFFLINE && qos_level == QOS_LEVEL_ONLINE) +#endif return -EINVAL; cpus_read_lock(); - if (qos_level == -1) + if (is_offline_level(qos_level)) cfs_bandwidth_usage_inc(); - else + else if (is_offline_level(tg->qos_level) && !is_offline_level(qos_level)) cfs_bandwidth_usage_dec(); cpus_read_unlock(); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ec2be284d185..bd833504f741 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -199,6 +199,23 @@ static bool qos_smt_expelled(int this_cpu); static DEFINE_PER_CPU(int, qos_smt_status); #endif +#ifdef CONFIG_QOS_SCHED_MULTILEVEL +#define QOS_LEVEL_WEIGHT_OFFLINE_EX 1 +#define QOS_LEVEL_WEIGHT_OFFLINE 10 +#define QOS_LEVEL_WEIGHT_ONLINE 100 +#define QOS_LEVEL_WEIGHT_HIGH 1000 +#define QOS_LEVEL_WEIGHT_HIGH_EX 10000 + +unsigned int sysctl_qos_level_weights[5] = { + QOS_LEVEL_WEIGHT_OFFLINE_EX, + QOS_LEVEL_WEIGHT_OFFLINE, + QOS_LEVEL_WEIGHT_ONLINE, + QOS_LEVEL_WEIGHT_HIGH, + QOS_LEVEL_WEIGHT_HIGH_EX, +}; +static long qos_reweight(long shares, struct task_group *tg); +#endif + #ifdef CONFIG_CFS_BANDWIDTH /* * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool @@ -3537,6 +3554,9 @@ static long calc_group_shares(struct cfs_rq *cfs_rq) struct task_group *tg = cfs_rq->tg; tg_shares = READ_ONCE(tg->shares); +#ifdef CONFIG_QOS_SCHED_MULTILEVEL + tg_shares = qos_reweight(tg_shares, tg); +#endif load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg); @@ -3583,6 +3603,9 @@ static void update_cfs_group(struct sched_entity *se) #ifndef CONFIG_SMP shares = READ_ONCE(gcfs_rq->tg->shares); +#ifdef CONFIG_QOS_SCHED_MULTILEVEL + shares = qos_reweight(shares, gcfs_rq->tg); +#endif if (likely(se->load.weight == shares)) return; @@ -8317,7 +8340,7 @@ static inline void cancel_qos_timer(int cpu) static inline bool is_offline_task(struct task_struct *p) { - return task_group(p)->qos_level == -1; + return task_group(p)->qos_level < QOS_LEVEL_ONLINE; } static void start_qos_hrtimer(int cpu); @@ -8510,7 +8533,7 @@ static bool check_qos_cfs_rq(struct cfs_rq *cfs_rq) if (unlikely(__this_cpu_read(qos_cpu_overload))) return false; - if (unlikely(cfs_rq && cfs_rq->tg->qos_level < 0 && + if (unlikely(cfs_rq && is_offline_level(cfs_rq->tg->qos_level) && !sched_idle_cpu(smp_processor_id()) && cfs_rq->h_nr_running == cfs_rq->idle_h_nr_running)) { throttle_qos_cfs_rq(cfs_rq); @@ -8526,7 +8549,7 @@ static inline void unthrottle_qos_sched_group(struct cfs_rq *cfs_rq) struct rq_flags rf; rq_lock_irqsave(rq, &rf); - if (cfs_rq->tg->qos_level == -1 && cfs_rq_throttled(cfs_rq)) + if (is_offline_level(cfs_rq->tg->qos_level) && cfs_rq_throttled(cfs_rq)) unthrottle_qos_cfs_rq(cfs_rq); rq_unlock_irqrestore(rq, &rf); } @@ -8539,7 +8562,7 @@ void sched_qos_offline_wait(void) rcu_read_lock(); qos_level = task_group(current)->qos_level; rcu_read_unlock(); - if (qos_level != -1 || fatal_signal_pending(current)) + if (!is_offline_level(qos_level) || fatal_signal_pending(current)) break; schedule_timeout_killable(msecs_to_jiffies(sysctl_offline_wait_interval)); @@ -8569,6 +8592,39 @@ static enum hrtimer_restart qos_overload_timer_handler(struct hrtimer *timer) return HRTIMER_NORESTART; } +#ifdef CONFIG_QOS_SCHED_MULTILEVEL +static long qos_reweight(long shares, struct task_group *tg) +{ + long qos_weight = 100; + long div = 100; + long scale_shares; + + switch (tg->qos_level) { + case QOS_LEVEL_OFFLINE_EX: + qos_weight = sysctl_qos_level_weights[0]; + break; + case QOS_LEVEL_OFFLINE: + qos_weight = sysctl_qos_level_weights[1]; + break; + case QOS_LEVEL_ONLINE: + qos_weight = sysctl_qos_level_weights[2]; + break; + case QOS_LEVEL_HIGH: + qos_weight = sysctl_qos_level_weights[3]; + break; + case QOS_LEVEL_HIGH_EX: + qos_weight = sysctl_qos_level_weights[4]; + break; + } + if (qos_weight > LONG_MAX / shares) + scale_shares = LONG_MAX / div; + else + scale_shares = shares * qos_weight / div; + scale_shares = clamp_t(long, scale_shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES)); + return scale_shares; +} +#endif + static void start_qos_hrtimer(int cpu) { ktime_t time; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0d981063bf48..5782b770e120 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1420,11 +1420,20 @@ do { \ } while (0) #ifdef CONFIG_QOS_SCHED +#ifdef CONFIG_QOS_SCHED_MULTILEVEL enum task_qos_level { + QOS_LEVEL_OFFLINE_EX = -2, QOS_LEVEL_OFFLINE = -1, QOS_LEVEL_ONLINE = 0, - QOS_LEVEL_MAX + QOS_LEVEL_HIGH = 1, + QOS_LEVEL_HIGH_EX = 2 }; +#else +enum task_qos_level { + QOS_LEVEL_OFFLINE = -1, + QOS_LEVEL_ONLINE = 0, +}; +#endif void init_qos_hrtimer(int cpu); #endif @@ -3269,6 +3278,21 @@ static inline int qos_idle_policy(int policy) { return policy == QOS_LEVEL_OFFLINE; } + +static inline int is_high_level(long qos_level) +{ + return qos_level > QOS_LEVEL_ONLINE; +} + +static inline int is_normal_level(long qos_level) +{ + return qos_level == QOS_LEVEL_ONLINE; +} + +static inline int is_offline_level(long qos_level) +{ + return qos_level < QOS_LEVEL_ONLINE; +} #endif #ifdef CONFIG_QOS_SCHED_SMT_EXPELLER diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e9af234bf882..1714abd73f23 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2068,6 +2068,15 @@ static struct ctl_table kern_table[] = { .extra1 = SYSCTL_ONE_HUNDRED, .extra2 = &one_thousand, }, +#endif +#ifdef CONFIG_QOS_SCHED_MULTILEVEL + { + .procname = "qos_level_weights", + .data = &sysctl_qos_level_weights, + .maxlen = 5*sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, #endif { .procname = "max_rcu_stall_to_panic", -- 2.34.1

2 2

[PATCH OLK-5.10 0/3] scsi: scsi_device_gets returns failure when the module is NULL
by Zhong Jinghua 06 Sep '23

06 Sep '23

scsi: scsi_device_gets returns failure when the module is NULL Li Lingfeng (2): scsi: don't fail if hostt->module is NULL scsi: fix kabi broken in struct Scsi_Host Zhong Jinghua (1): scsi: scsi_device_gets returns failure when the module is NULL. drivers/scsi/hosts.c | 3 +++ drivers/scsi/scsi.c | 6 +++++- include/scsi/scsi_host.h | 2 +- 3 files changed, 9 insertions(+), 2 deletions(-) -- 2.31.1

2 4

[PATCH v5 openEuler-23.09] sched/fair: Introduce multiple qos level
by Zhao Wenhui 06 Sep '23

06 Sep '23

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7YS6M ------------------------------- Expand qos_level from {-1,0} to [-2, 2], to distinguish the tasks expected to be with extremely high or low priority level. Using qos_level_weight to reweight the shares when calculating group's weight. Meanwhile, set offline task's schedule policy to SCHED_IDLE so that it can be preempted at check_preempt_wakeup. Signed-off-by: Zhao Wenhui <zhaowenhui8(a)huawei.com> --- arch/arm64/configs/openeuler_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + include/linux/sched/sysctl.h | 4 ++ init/Kconfig | 9 ++++ kernel/sched/core.c | 24 ++++++---- kernel/sched/fair.c | 64 ++++++++++++++++++++++++-- kernel/sched/sched.h | 26 ++++++++++- kernel/sysctl.c | 9 ++++ 8 files changed, 125 insertions(+), 13 deletions(-) diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index a64923c8f1c9..1b591206471b 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -181,6 +181,7 @@ CONFIG_CGROUP_PERF=y CONFIG_CGROUP_BPF=y # CONFIG_CGROUP_MISC is not set CONFIG_QOS_SCHED=y +CONFIG_QOS_SCHED_MULTILEVEL=y # CONFIG_CGROUP_DEBUG is not set CONFIG_SOCK_CGROUP_DATA=y CONFIG_CGROUP_FILES=y diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index a0669731cef4..73d87040d650 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -190,6 +190,7 @@ CONFIG_FAIR_GROUP_SCHED=y CONFIG_QOS_SCHED_SMT_EXPELLER=y CONFIG_CFS_BANDWIDTH=y CONFIG_QOS_SCHED=y +CONFIG_QOS_SCHED_MULTILEVEL=y CONFIG_RT_GROUP_SCHED=y CONFIG_SCHED_MM_CID=y CONFIG_QOS_SCHED_DYNAMIC_AFFINITY=y diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 28d9be8e4614..3a02a76b08ca 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -37,4 +37,8 @@ extern unsigned int sysctl_overload_detect_period; extern unsigned int sysctl_offline_wait_interval; #endif +#ifdef CONFIG_QOS_SCHED_MULTILEVEL +extern unsigned int sysctl_qos_level_weights[]; +#endif + #endif /* _LINUX_SCHED_SYSCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index a12109fe4385..12a5ffbb5252 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1015,6 +1015,15 @@ config QOS_SCHED_SMT_EXPELLER This feature enable online tasks to expel offline tasks on the smt sibling cpus, and exclusively occupy CPU resources. +config QOS_SCHED_MULTILEVEL + bool "Multiple qos level task scheduling" + depends on QOS_SCHED + default n + help + This feature enable multiple qos level on task scheduling. + Expand the qos_level to [-2,2] to distinguish the tasks expected + to be with extremely high or low priority level. + config FAIR_GROUP_SCHED bool "Group scheduling for SCHED_OTHER" depends on CGROUP_SCHED diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 652c06bd546d..238b5b55c38a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7689,7 +7689,7 @@ static int __sched_setscheduler(struct task_struct *p, * other than SCHED_IDLE, the online task preemption and cpu resource * isolation will be invalid, so return -EINVAL in this case. */ - if (unlikely(task_group(p)->qos_level == -1 && !idle_policy(policy))) { + if (unlikely(is_offline_level(task_group(p)->qos_level) && !idle_policy(policy))) { retval = -EINVAL; goto unlock; } @@ -10356,7 +10356,7 @@ static void sched_change_qos_group(struct task_struct *tsk, struct task_group *t */ if (!(tsk->flags & PF_EXITING) && !task_group_is_autogroup(tg) && - (tg->qos_level == -1)) { + (is_offline_level(tg->qos_level))) { attr.sched_priority = 0; attr.sched_policy = SCHED_IDLE; attr.sched_nice = PRIO_TO_NICE(tsk->static_prio); @@ -10385,7 +10385,7 @@ void sched_move_offline_task(struct task_struct *p) { struct offline_args *args; - if (unlikely(task_group(p)->qos_level != -1)) + if (unlikely(!is_offline_level(task_group(p)->qos_level))) return; args = kmalloc(sizeof(struct offline_args), GFP_ATOMIC); @@ -11275,7 +11275,7 @@ static int tg_change_scheduler(struct task_group *tg, void *data) struct cgroup_subsys_state *css = &tg->css; tg->qos_level = qos_level; - if (qos_level == -1) + if (is_offline_level(qos_level)) policy = SCHED_IDLE; else policy = SCHED_NORMAL; @@ -11297,19 +11297,27 @@ static int cpu_qos_write(struct cgroup_subsys_state *css, if (!tg->se[0]) return -EINVAL; - if (qos_level != -1 && qos_level != 0) +#ifdef CONFIG_QOS_SCHED_MULTILEVEL + if (qos_level > QOS_LEVEL_HIGH_EX || qos_level < QOS_LEVEL_OFFLINE_EX) +#else + if (qos_level != QOS_LEVEL_OFFLINE && qos_level != QOS_LEVEL_ONLINE) +#endif return -EINVAL; if (tg->qos_level == qos_level) goto done; - if (tg->qos_level == -1 && qos_level == 0) +#ifdef CONFIG_QOS_SCHED_MULTILEVEL + if (!is_normal_level(tg->qos_level)) +#else + if (tg->qos_level == QOS_LEVEL_OFFLINE && qos_level == QOS_LEVEL_ONLINE) +#endif return -EINVAL; cpus_read_lock(); - if (qos_level == -1) + if (is_offline_level(qos_level)) cfs_bandwidth_usage_inc(); - else + else if (is_offline_level(tg->qos_level) && !is_offline_level(qos_level)) cfs_bandwidth_usage_dec(); cpus_read_unlock(); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ec2be284d185..bd833504f741 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -199,6 +199,23 @@ static bool qos_smt_expelled(int this_cpu); static DEFINE_PER_CPU(int, qos_smt_status); #endif +#ifdef CONFIG_QOS_SCHED_MULTILEVEL +#define QOS_LEVEL_WEIGHT_OFFLINE_EX 1 +#define QOS_LEVEL_WEIGHT_OFFLINE 10 +#define QOS_LEVEL_WEIGHT_ONLINE 100 +#define QOS_LEVEL_WEIGHT_HIGH 1000 +#define QOS_LEVEL_WEIGHT_HIGH_EX 10000 + +unsigned int sysctl_qos_level_weights[5] = { + QOS_LEVEL_WEIGHT_OFFLINE_EX, + QOS_LEVEL_WEIGHT_OFFLINE, + QOS_LEVEL_WEIGHT_ONLINE, + QOS_LEVEL_WEIGHT_HIGH, + QOS_LEVEL_WEIGHT_HIGH_EX, +}; +static long qos_reweight(long shares, struct task_group *tg); +#endif + #ifdef CONFIG_CFS_BANDWIDTH /* * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool @@ -3537,6 +3554,9 @@ static long calc_group_shares(struct cfs_rq *cfs_rq) struct task_group *tg = cfs_rq->tg; tg_shares = READ_ONCE(tg->shares); +#ifdef CONFIG_QOS_SCHED_MULTILEVEL + tg_shares = qos_reweight(tg_shares, tg); +#endif load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg); @@ -3583,6 +3603,9 @@ static void update_cfs_group(struct sched_entity *se) #ifndef CONFIG_SMP shares = READ_ONCE(gcfs_rq->tg->shares); +#ifdef CONFIG_QOS_SCHED_MULTILEVEL + shares = qos_reweight(shares, gcfs_rq->tg); +#endif if (likely(se->load.weight == shares)) return; @@ -8317,7 +8340,7 @@ static inline void cancel_qos_timer(int cpu) static inline bool is_offline_task(struct task_struct *p) { - return task_group(p)->qos_level == -1; + return task_group(p)->qos_level < QOS_LEVEL_ONLINE; } static void start_qos_hrtimer(int cpu); @@ -8510,7 +8533,7 @@ static bool check_qos_cfs_rq(struct cfs_rq *cfs_rq) if (unlikely(__this_cpu_read(qos_cpu_overload))) return false; - if (unlikely(cfs_rq && cfs_rq->tg->qos_level < 0 && + if (unlikely(cfs_rq && is_offline_level(cfs_rq->tg->qos_level) && !sched_idle_cpu(smp_processor_id()) && cfs_rq->h_nr_running == cfs_rq->idle_h_nr_running)) { throttle_qos_cfs_rq(cfs_rq); @@ -8526,7 +8549,7 @@ static inline void unthrottle_qos_sched_group(struct cfs_rq *cfs_rq) struct rq_flags rf; rq_lock_irqsave(rq, &rf); - if (cfs_rq->tg->qos_level == -1 && cfs_rq_throttled(cfs_rq)) + if (is_offline_level(cfs_rq->tg->qos_level) && cfs_rq_throttled(cfs_rq)) unthrottle_qos_cfs_rq(cfs_rq); rq_unlock_irqrestore(rq, &rf); } @@ -8539,7 +8562,7 @@ void sched_qos_offline_wait(void) rcu_read_lock(); qos_level = task_group(current)->qos_level; rcu_read_unlock(); - if (qos_level != -1 || fatal_signal_pending(current)) + if (!is_offline_level(qos_level) || fatal_signal_pending(current)) break; schedule_timeout_killable(msecs_to_jiffies(sysctl_offline_wait_interval)); @@ -8569,6 +8592,39 @@ static enum hrtimer_restart qos_overload_timer_handler(struct hrtimer *timer) return HRTIMER_NORESTART; } +#ifdef CONFIG_QOS_SCHED_MULTILEVEL +static long qos_reweight(long shares, struct task_group *tg) +{ + long qos_weight = 100; + long div = 100; + long scale_shares; + + switch (tg->qos_level) { + case QOS_LEVEL_OFFLINE_EX: + qos_weight = sysctl_qos_level_weights[0]; + break; + case QOS_LEVEL_OFFLINE: + qos_weight = sysctl_qos_level_weights[1]; + break; + case QOS_LEVEL_ONLINE: + qos_weight = sysctl_qos_level_weights[2]; + break; + case QOS_LEVEL_HIGH: + qos_weight = sysctl_qos_level_weights[3]; + break; + case QOS_LEVEL_HIGH_EX: + qos_weight = sysctl_qos_level_weights[4]; + break; + } + if (qos_weight > LONG_MAX / shares) + scale_shares = LONG_MAX / div; + else + scale_shares = shares * qos_weight / div; + scale_shares = clamp_t(long, scale_shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES)); + return scale_shares; +} +#endif + static void start_qos_hrtimer(int cpu) { ktime_t time; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0d981063bf48..5782b770e120 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1420,11 +1420,20 @@ do { \ } while (0) #ifdef CONFIG_QOS_SCHED +#ifdef CONFIG_QOS_SCHED_MULTILEVEL enum task_qos_level { + QOS_LEVEL_OFFLINE_EX = -2, QOS_LEVEL_OFFLINE = -1, QOS_LEVEL_ONLINE = 0, - QOS_LEVEL_MAX + QOS_LEVEL_HIGH = 1, + QOS_LEVEL_HIGH_EX = 2 }; +#else +enum task_qos_level { + QOS_LEVEL_OFFLINE = -1, + QOS_LEVEL_ONLINE = 0, +}; +#endif void init_qos_hrtimer(int cpu); #endif @@ -3269,6 +3278,21 @@ static inline int qos_idle_policy(int policy) { return policy == QOS_LEVEL_OFFLINE; } + +static inline int is_high_level(long qos_level) +{ + return qos_level > QOS_LEVEL_ONLINE; +} + +static inline int is_normal_level(long qos_level) +{ + return qos_level == QOS_LEVEL_ONLINE; +} + +static inline int is_offline_level(long qos_level) +{ + return qos_level < QOS_LEVEL_ONLINE; +} #endif #ifdef CONFIG_QOS_SCHED_SMT_EXPELLER diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e9af234bf882..1714abd73f23 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2068,6 +2068,15 @@ static struct ctl_table kern_table[] = { .extra1 = SYSCTL_ONE_HUNDRED, .extra2 = &one_thousand, }, +#endif +#ifdef CONFIG_QOS_SCHED_MULTILEVEL + { + .procname = "qos_level_weights", + .data = &sysctl_qos_level_weights, + .maxlen = 5*sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, #endif { .procname = "max_rcu_stall_to_panic", -- 2.34.1

2 1

[PATCH openEuler-1.0-LTS] memcg: fix a UAF problem in drain_all_stock()
by GONG, Ruiqi 06 Sep '23

06 Sep '23

hulk inclusion category: bugfix bugzilla: 189183, https://gitee.com/openeuler/kernel/issues/I7Z1ZU CVE: NA ---------------------------------------- The following panic with RedHat 7.5 kernel was reported by UVP: CPU: 28 PID: 56610 Comm: kworker/u160:6 Kdump: loaded Tainted: G OE K----V------- 3.10.0-862.14.1.6_152.x86_64 #1 Hardware name: ZTE R5300 G4/R5300G4, BIOS 03.20.0200_8717837 06/07/2021 Workqueue: events_aync_free recharge_parent task: ffff97fc84cc0fe0 ti: ffff97d3840c8000 task.ti: ffff97d3840c8000 RIP: 0010:[<ffffffffa0d1ea7d>] [<ffffffffa0d1ea7d>] cgroup_is_descendant+0x1d/0x40 RSP: 0018:ffff97d3840cbd10 EFLAGS: 00010296 RAX: 0000000000000000 RBX: ffffffffa1943ba0 RCX: 0000000000000007 RDX: ffff97fd12043800 RSI: ffff982d3faa5c00 RDI: 3930343331356364 RBP: ffff97d3840cbd10 R08: ffffffffa1943ba0 R09: 0000000000000007 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000016100 R13: ffff97fe809d6100 R14: 0000000000000007 R15: 0000000000000007 FS: 0000000000000000(0000) GS:ffff982e7be00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005633394b1c10 CR3: 00000059d900e000 CR4: 00000000003627e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: [<ffffffffa0e1840c>] __mem_cgroup_same_or_subtree+0x2c/0x40 [<ffffffffc0937291>] drain_all_stock+0xc1/0xe30 [klp_HP0038] [<ffffffffa0e185f1>] mem_cgroup_reparent_charges+0x51/0x3b0 [<ffffffffa0cce718>] ? finish_task_switch+0xf8/0x170 [<ffffffffa0e18b14>] recharge_parent+0x54/0x80 [<ffffffffa0cb7a32>] process_one_work+0x182/0x450 [<ffffffffa0cb8996>] worker_thread+0x126/0x3c0 [<ffffffffa0cb8870>] ? manage_workers.isra.24+0x2a0/0x2a0 [<ffffffffa0cbfab1>] kthread+0xd1/0xe0 [<ffffffffa0cbf9e0>] ? insert_kthread_work+0x40/0x40 [<ffffffffa133b5dd>] ret_from_fork_nospec_begin+0x7/0x21 [<ffffffffa0cbf9e0>] ? insert_kthread_work+0x40/0x40 It is found that in case stock->nr_pages is decreased to 0, a memcg that is going to be freed would skip the drain_local_stock() process and therefore be left on stock->cached after being freed, which could cause UAF problems in drain_all_stock(). Now it is believed that the same problem exists on 4.19 as well, confirmed by successful reproduction. The problem causes panic with similar call trace, and its triggering process is demonstrated as follows: stock->cached = mB CPU2 CPU3 CPU4 consume_stock local_irq_save stock->nr_pages -= xxx -> 0 drain_all_stock rcu_read_lock() memcg = cpu2's stock->cached cpu2's stock->nr_page==0 rcu_read_unlock() (skip) ====================================== (mB freed) ======================================= drain_all_stock(mD) rcu_read_lock() memcg = cpu2's stock->cached (interrupted) refill_stock(mC) local_irq_save drain_stock(mB) stock->cached = mC stock->nr_pages += xxx (> 0) stock->nr_pages > 0 mem_cgroup_is_descendant(memcg, root_memcg) [UAF] rcu_read_unlock() Fix this problem by removing `stock->nr_pages` from the preconditions of `flush = true` in drain_all_stock(), so as to drain the stock even if its nr_pages is 0. Signed-off-by: GONG, Ruiqi <gongruiqi1(a)huawei.com> --- mm/memcontrol.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1b11bc13e1aa..032bb52cd2ed 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2226,8 +2226,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) rcu_read_lock(); memcg = stock->cached; - if (memcg && stock->nr_pages && - mem_cgroup_is_descendant(memcg, root_memcg)) + if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) flush = true; rcu_read_unlock(); -- 2.25.1

2 1