[PATCH OLK-6.6 0/6] add min_slice and sched_runtime

older
[PATCH OLK-6.6] sched/cputime: Fix...

Chen Jinghuang

1 Jun 2026 1 Jun '26

5:09 p.m.

*** BLURB HERE *** Chen Jinghuang (1): sched: Fix kabi breakage in struct sched_entity Omar Sandoval (1): sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash Peter Zijlstra (3): sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion sched/fair: Re-organize dequeue_task_fair() sched/eevdf: Propagate min_slice up the cgroup hierarchy Tianchen Ding (1): sched/eevdf: Force propagating min_slice of cfs_rq when {en,de}queue tasks include/linux/sched.h | 5 +- kernel/sched/core.c | 4 +- kernel/sched/debug.c | 3 +- kernel/sched/fair.c | 131 ++++++++++++++++++++++++++++++++-------- kernel/sched/syscalls.c | 29 +++++++-- 5 files changed, 137 insertions(+), 35 deletions(-) -- 2.34.1

Show replies by date

Chen Jinghuang

1 Jun 1 Jun

5:09 p.m.

New subject: [PATCH OLK-6.6 1/6] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit 857b158dc5e81c6de795ef6be006eed146098fc6 category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15498 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Allow applications to directly set a suggested request/slice length using sched_attr::sched_runtime. The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms] which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100. Applications should strive to use their periodic runtime at a high confidence interval (95%+) as the target slice. Using a smaller slice will introduce undue preemptions, while using a larger value will increase latency. For all the following examples assume a scheduling quantum of 8, and for consistency all examples have W=4: {A,B,C,D}(w=1,r=8): ABCD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+*------+-------+--- ---+--*----+-------+--- t=2, V=5.5 t=3, V=7.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+----*--+-------+--- ---+------*+-------+--- Note: 4 identical tasks in FIFO order ~~~ {A,B}(w=1,r=16) C(w=2,r=16) AACCBBCC... +---+---+---+--- t=0, V=1.25 t=2, V=5.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=4, V=8.25 t=6, V=12.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+-------*-------+--- ---+-------+---*---+--- Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2 task doesn't go below q. Note: observe the full schedule becomes: W*max(r_i/w_i) = 4*2q = 8q in length. Note: the period of the heavy task is half the full period at: W*(r_i/w_i) = 4*(2q/2) = 4q ~~~ {A,C,D}(w=1,r=16) B(w=1,r=8): BAACCBDD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |--------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.5 t=5, V=11.5 A |---------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.5 A |---------------< B |------< C |--------------< D |--------------< ---+-------+----*--+--- Note: 1 short task -- again double r so that the deadline of the short task won't be below q. Made B short because its not the leftmost task, but is eligible with the 0,1,2,3 spread. Note: like with the heavy task, the period of the short task observes: W*(r_i/w_i) = 4*(1q/1) = 4q ~~~ A(w=1,r=16) B(w=1,r=8) C(w=2,r=16) BCCAABCC... +---+---+---+--- t=0, V=1.25 t=1, V=3.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.25 t=5, V=11.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.25 A |--------------< B |------< C |------< ---+-------+----*--+--- Note: 1 heavy and 1 short task -- combine them all. Note: both the short and heavy task end up with a period of 4q ~~~ A(w=1,r=16) B(w=2,r=16) C(w=1,r=8) BBCAABBC... +---+---+---+--- t=0, V=1 t=2, V=5 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=3, V=7 t=5, V=11 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=7, V=15 A |--------------< B |------< C |------< ---+-------+------*+--- Note: as before but permuted ~~~ From all this it can be deduced that, for the steady state: - the total period (P) of a schedule is: W*max(r_i/w_i) - the average period of a task is: W*(r_i/w_i) - each task obtains the fair share: w_i/W of each full period P Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.842834421@infradead.org Conflicts: include/linux/sched.h kernel/sched/core.c kernel/sched/syscalls.c [a trival context conflicts] Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> --- include/linux/sched.h | 3 ++- kernel/sched/core.c | 4 +++- kernel/sched/debug.c | 3 ++- kernel/sched/fair.c | 6 ++++-- kernel/sched/syscalls.c | 29 +++++++++++++++++++++++------ 5 files changed, 34 insertions(+), 11 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 59a8ae4b150c..063c6acc3580 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -588,7 +588,8 @@ struct sched_entity { struct list_head group_node; unsigned int on_rq; KABI_FILL_HOLE(unsigned char rel_deadline) - /* 3 holes left here */ + unsigned char custom_slice; + /* 2 holes left here */ u64 exec_start; u64 sum_exec_runtime; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fedd93efd849..82fcfeb96c40 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4389,7 +4389,6 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->se.nr_migrations = 0; p->se.vruntime = 0; p->se.vlag = 0; - p->se.slice = sysctl_sched_base_slice; p->se.rel_deadline = 0; INIT_LIST_HEAD(&p->se.group_node); @@ -4683,6 +4682,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) p->prio = p->normal_prio = p->static_prio; set_load_weight(p, false); + p->se.custom_slice = 0; + p->se.slice = sysctl_sched_base_slice; /* * We don't need the reset flag anymore after the fork. It has @@ -8546,6 +8547,7 @@ void __init sched_init(void) } set_load_weight(&init_task, false); + init_task.se.slice = sysctl_sched_base_slice, /* * The boot idle thread does lazy MMU switching as well: diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index bcdf2e5a6b13..35a9c5188168 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -622,11 +622,12 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p) else SEQ_printf(m, " %c", task_state_to_char(p)); - SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld.%06ld %9Ld %5d ", + SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld %5d ", p->comm, task_pid_nr(p), SPLIT_NS(p->se.vruntime), entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N', SPLIT_NS(p->se.deadline), + p->se.custom_slice ? 'S' : ' ', SPLIT_NS(p->se.slice), SPLIT_NS(p->se.sum_exec_runtime), (long long)(p->nvcsw + p->nivcsw), diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 80e87acc9f55..4ec7e0b198a5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1304,7 +1304,8 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) * nice) while the request time r_i is determined by * sysctl_sched_base_slice. */ - se->slice = sysctl_sched_base_slice; + if (!se->custom_slice) + se->slice = sysctl_sched_base_slice; /* * EEVDF: vd_i = ve_i + r_i / w_i @@ -5437,7 +5438,8 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) u64 vslice, vruntime = avg_vruntime(cfs_rq); s64 lag = 0; - se->slice = sysctl_sched_base_slice; + if (!se->custom_slice) + se->slice = sysctl_sched_base_slice; vslice = calc_delta_fair(se->slice, se); /* diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index a36a4958e617..39a333ea5105 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -393,10 +393,20 @@ void __setscheduler_params(struct task_struct *p, p->policy = policy; - if (dl_policy(policy)) + if (dl_policy(policy)) { __setparam_dl(p, attr); - else if (fair_policy(policy)) + } else if (fair_policy(policy)) { p->static_prio = NICE_TO_PRIO(attr->sched_nice); + if (attr->sched_runtime) { + p->se.custom_slice = 1; + p->se.slice = clamp_t(u64, attr->sched_runtime, + NSEC_PER_MSEC/10, /* HZ=1000 * 10 */ + NSEC_PER_MSEC*100); /* HZ=100 / 10 */ + } else { + p->se.custom_slice = 0; + p->se.slice = sysctl_sched_base_slice; + } + } /* rt-policy tasks do not have a timerslack */ if (task_is_realtime(p)) { @@ -733,7 +743,9 @@ int __sched_setscheduler(struct task_struct *p, * but store a possible modification of reset_on_fork. */ if (unlikely(policy == p->policy)) { - if (fair_policy(policy) && attr->sched_nice != task_nice(p)) + if (fair_policy(policy) && + (attr->sched_nice != task_nice(p) || + (attr->sched_runtime != p->se.slice))) goto change; if (rt_policy(policy) && attr->sched_priority != p->rt_priority) goto change; @@ -893,6 +905,9 @@ static int _sched_setscheduler(struct task_struct *p, int policy, .sched_nice = PRIO_TO_NICE(p->static_prio), }; + if (p->se.custom_slice) + attr.sched_runtime = p->se.slice; + /* Fixup the legacy SCHED_RESET_ON_FORK hack. */ if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) { attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK; @@ -1055,12 +1070,14 @@ static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *a static void get_params(struct task_struct *p, struct sched_attr *attr) { - if (task_has_dl_policy(p)) + if (task_has_dl_policy(p)) { __getparam_dl(p, attr); - else if (task_has_rt_policy(p)) + } else if (task_has_rt_policy(p)) { attr->sched_priority = p->rt_priority; - else + } else { attr->sched_nice = task_nice(p); + attr->sched_runtime = p->se.slice; + } } /** -- 2.34.1

Chen Jinghuang

5:09 p.m.

New subject: [PATCH OLK-6.6 2/6] sched/fair: Re-organize dequeue_task_fair()

From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit fab4a808ba9fb59b691d7096eed9b1494812ffd6 category: cleanup bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15498 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Working towards delaying dequeue, notably also inside the hierachy, rework dequeue_task_fair() such that it can 'resume' an interrupted hierarchy walk. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.977256873@infradead.org Conflicts: kernel/sched/fair.c [a trival context conflict] Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> --- kernel/sched/fair.c | 67 ++++++++++++++++++++++++++++++--------------- 1 file changed, 45 insertions(+), 22 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4ec7e0b198a5..04020753c190 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7717,40 +7717,52 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) static void set_next_buddy(struct sched_entity *se); /* - * The dequeue_task method is called before nr_running is - * decreased. We remove the task from the rbtree and - * update the fair scheduling stats: + * Basically dequeue_task_fair(), except it can deal with dequeue_entity() + * failing half-way through and resume the dequeue later. + * + * Returns: + * -1 - dequeue delayed + * 0 - dequeue throttled + * 1 - dequeue complete */ -static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) +static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) { - struct cfs_rq *cfs_rq; - struct sched_entity *se = &p->se; - int task_sleep = flags & DEQUEUE_SLEEP; - int idle_h_nr_running = task_has_idle_policy(p); #ifdef CONFIG_QOS_SCHED_SMT_EXPELLER - int qos_idle_h_nr_running = task_has_qos_idle_policy(p); + int qos_idle_h_nr_running = 0; #endif unsigned int prev_nr = rq->cfs.h_nr_running; bool was_sched_idle = sched_idle_rq(rq); + bool task_sleep = flags & DEQUEUE_SLEEP; + struct task_struct *p = NULL; + int idle_h_nr_running = 0; + int h_nr_running = 0; + struct cfs_rq *cfs_rq; - util_est_dequeue(&rq->cfs, p); + if (entity_is_task(se)) { + p = task_of(se); + h_nr_running = 1; + idle_h_nr_running = task_has_idle_policy(p); +#ifdef CONFIG_QOS_SCHED_SMT_EXPELLER + qos_idle_h_nr_running = task_has_qos_idle_policy(p); +#endif + } for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); dequeue_entity(cfs_rq, se, flags); - cfs_rq->h_nr_running--; + cfs_rq->h_nr_running -= h_nr_running; cfs_rq->idle_h_nr_running -= idle_h_nr_running; #ifdef CONFIG_QOS_SCHED_SMT_EXPELLER cfs_rq->qos_idle_h_nr_running -= qos_idle_h_nr_running; #endif if (cfs_rq_is_idle(cfs_rq)) - idle_h_nr_running = 1; + idle_h_nr_running = h_nr_running; /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - goto dequeue_throttle; + return 0; /* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { @@ -7774,22 +7786,20 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) se_update_runnable(se); update_cfs_group(se); - cfs_rq->h_nr_running--; + cfs_rq->h_nr_running -= h_nr_running; cfs_rq->idle_h_nr_running -= idle_h_nr_running; #ifdef CONFIG_QOS_SCHED_SMT_EXPELLER cfs_rq->qos_idle_h_nr_running -= qos_idle_h_nr_running; #endif if (cfs_rq_is_idle(cfs_rq)) - idle_h_nr_running = 1; + idle_h_nr_running = h_nr_running; /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - goto dequeue_throttle; - + return 0; } - /* At this point se is NULL and we are at root level*/ - sub_nr_running(rq, 1); + sub_nr_running(rq, h_nr_running); if (prev_nr == 2) overload_clear(rq); @@ -7797,10 +7807,23 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (unlikely(!was_sched_idle && sched_idle_rq(rq))) rq->next_balance = jiffies; -dequeue_throttle: - util_est_update(&rq->cfs, p, task_sleep); - hrtick_update(rq); + return 1; +} +/* + * The dequeue_task method is called before nr_running is + * decreased. We remove the task from the rbtree and + * update the fair scheduling stats: + */ +static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) +{ + util_est_dequeue(&rq->cfs, p); + + if (dequeue_entities(rq, &p->se, flags) < 0) + return false; + + util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); + hrtick_update(rq); return true; } -- 2.34.1

Chen Jinghuang

5:09 p.m.

New subject: [PATCH OLK-6.6 3/6] sched/eevdf: Propagate min_slice up the cgroup hierarchy

From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit aef6987d89544d63a47753cf3741cabff0b5574c category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15498 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ------------------------------- In the absence of an explicit cgroup slice configureation, make mixed slice length work with cgroups by propagating the min_slice up the hierarchy. This ensures the cgroup entity gets timely service to service its entities that have this timing constraint set on them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.948188417@infradead.org Conflicts: kernel/sched/fair.c include/linux/sched.h [a trival context conflict] Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> --- include/linux/sched.h | 1 + kernel/sched/fair.c | 57 ++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 57 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 063c6acc3580..172f7c59cedd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -584,6 +584,7 @@ struct sched_entity { struct rb_node run_node; u64 deadline; u64 min_vruntime; + u64 min_slice; struct list_head group_node; unsigned int on_rq; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 04020753c190..db18de51b93c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1060,6 +1060,21 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se) return vruntime_eligible(cfs_rq, se->vruntime); } +static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq) +{ + struct sched_entity *root = __pick_root_entity(cfs_rq); + struct sched_entity *curr = cfs_rq->curr; + u64 min_slice = ~0ULL; + + if (curr && curr->on_rq) + min_slice = curr->slice; + + if (root) + min_slice = min(min_slice, root->min_slice); + + return min_slice; +} + static inline bool __entity_less(struct rb_node *a, const struct rb_node *b) { return entity_before(__node_2_se(a), __node_2_se(b)); @@ -1075,19 +1090,34 @@ static inline void __min_vruntime_update(struct sched_entity *se, struct rb_node } } +static inline void __min_slice_update(struct sched_entity *se, struct rb_node *node) +{ + if (node) { + struct sched_entity *rse = __node_2_se(node); + if (rse->min_slice < se->min_slice) + se->min_slice = rse->min_slice; + } +} + /* * se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime) */ static inline bool min_vruntime_update(struct sched_entity *se, bool exit) { u64 old_min_vruntime = se->min_vruntime; + u64 old_min_slice = se->min_slice; struct rb_node *node = &se->run_node; se->min_vruntime = se->vruntime; __min_vruntime_update(se, node->rb_right); __min_vruntime_update(se, node->rb_left); - return se->min_vruntime == old_min_vruntime; + se->min_slice = se->slice; + __min_slice_update(se, node->rb_right); + __min_slice_update(se, node->rb_left); + + return se->min_vruntime == old_min_vruntime && + se->min_slice == old_min_slice; } RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity, @@ -1100,6 +1130,7 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) { sum_w_vruntime_add(cfs_rq, se); se->min_vruntime = se->vruntime; + se->min_slice = se->slice; rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less, &min_vruntime_cb); } @@ -7627,6 +7658,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) #endif int task_new = !(flags & ENQUEUE_WAKEUP); unsigned int prev_nr = rq->cfs.h_nr_running; + u64 slice = 0; /* * The code below (indirectly) updates schedutil which looks at @@ -7648,7 +7680,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (se->on_rq) break; cfs_rq = cfs_rq_of(se); + + /* + * Basically set the slice of group entries to the min_slice of + * their respective cfs_rq. This ensures the group can service + * its entities in the desired time-frame. + */ + if (slice) { + se->slice = slice; + se->custom_slice = 1; + } enqueue_entity(cfs_rq, se, flags); + slice = cfs_rq_min_slice(cfs_rq); cfs_rq->h_nr_running++; cfs_rq->idle_h_nr_running += idle_h_nr_running; @@ -7673,6 +7716,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) se_update_runnable(se); update_cfs_group(se); + se->slice = slice; + slice = cfs_rq_min_slice(cfs_rq); + cfs_rq->h_nr_running++; cfs_rq->idle_h_nr_running += idle_h_nr_running; #ifdef CONFIG_QOS_SCHED_SMT_EXPELLER @@ -7737,6 +7783,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) int idle_h_nr_running = 0; int h_nr_running = 0; struct cfs_rq *cfs_rq; + u64 slice = 0; if (entity_is_task(se)) { p = task_of(se); @@ -7745,6 +7792,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) #ifdef CONFIG_QOS_SCHED_SMT_EXPELLER qos_idle_h_nr_running = task_has_qos_idle_policy(p); #endif + } else { + cfs_rq = group_cfs_rq(se); + slice = cfs_rq_min_slice(cfs_rq); } for_each_sched_entity(se) { @@ -7766,6 +7816,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) /* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { + slice = cfs_rq_min_slice(cfs_rq); + /* Avoid re-evaluating load for this entity: */ se = parent_entity(se); /* @@ -7786,6 +7838,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) se_update_runnable(se); update_cfs_group(se); + se->slice = slice; + slice = cfs_rq_min_slice(cfs_rq); + cfs_rq->h_nr_running -= h_nr_running; cfs_rq->idle_h_nr_running -= idle_h_nr_running; #ifdef CONFIG_QOS_SCHED_SMT_EXPELLER -- 2.34.1

Chen Jinghuang

5:09 p.m.

New subject: [PATCH OLK-6.6 4/6] sched: Fix kabi breakage in struct sched_entity

hulk inclusion category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15498 CVE: NA -------------------------------- Fix kabi breakage in struct sched_entity Fixes: a01260d66615 ("sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion") Fixes: e9054f60ffb0 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> --- include/linux/sched.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 172f7c59cedd..629c7d42110c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -584,12 +584,11 @@ struct sched_entity { struct rb_node run_node; u64 deadline; u64 min_vruntime; - u64 min_slice; struct list_head group_node; unsigned int on_rq; KABI_FILL_HOLE(unsigned char rel_deadline) - unsigned char custom_slice; + KABI_FILL_HOLE(unsigned char custom_slice) /* 2 holes left here */ u64 exec_start; @@ -632,7 +631,7 @@ struct sched_entity { */ struct sched_avg avg; #endif - KABI_RESERVE(1) + KABI_USE(1, u64 min_slice) KABI_RESERVE(2) KABI_RESERVE(3) KABI_RESERVE(4) -- 2.34.1

Chen Jinghuang

5:09 p.m.

New subject: [PATCH OLK-6.6 5/6] sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash

From: Omar Sandoval <osandov@fb.com> mainline inclusion from mainline-v6.15-rc4 commit bbce3de72be56e4b5f68924b7da9630cc89aa1a8 category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15498 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- There is a code path in dequeue_entities() that can set the slice of a sched_entity to U64_MAX, which sometimes results in a crash. The offending case is when dequeue_entities() is called to dequeue a delayed group entity, and then the entity's parent's dequeue is delayed. In that case: 1. In the if (entity_is_task(se)) else block at the beginning of dequeue_entities(), slice is set to cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX. 2. The first for_each_sched_entity() loop dequeues the entity. 3. If the entity was its parent's only child, then the next iteration tries to dequeue the parent. 4. If the parent's dequeue needs to be delayed, then it breaks from the first for_each_sched_entity() loop _without updating slice_. 5. The second for_each_sched_entity() loop sets the parent's ->slice to the saved slice, which is still U64_MAX. This throws off subsequent calculations with potentially catastrophic results. A manifestation we saw in production was: 6. In update_entity_lag(), se->slice is used to calculate limit, which ends up as a huge negative number. 7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit is negative, vlag > limit, so se->vlag is set to the same huge negative number. 8. In place_entity(), se->vlag is scaled, which overflows and results in another huge (positive or negative) number. 9. The adjusted lag is subtracted from se->vruntime, which increases or decreases se->vruntime by a huge number. 10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which incorrectly returns false because the vruntime is so far from the other vruntimes on the queue, causing the (vruntime - cfs_rq->min_vruntime) * load calulation to overflow. 11. Nothing appears to be eligible, so pick_eevdf() returns NULL. 12. pick_next_entity() tries to dereference the return value of pick_eevdf() and crashes. Dumping the cfs_rq states from the core dumps with drgn showed tell-tale huge vruntime ranges and bogus vlag values, and I also traced se->slice being set to U64_MAX on live systems (which was usually "benign" since the rest of the runqueue needed to be in a particular state to crash). Fix it in dequeue_entities() by always setting slice from the first non-empty cfs_rq. Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.174557099... Conflicts: kernel/sched/fair.c [a trival context conflict, not merged implement delay dequeue, so can't adapt slice = cfs_rq_min_slice(cfs_rq);] Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> --- kernel/sched/fair.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index db18de51b93c..ed097a7a5a63 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7792,9 +7792,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) #ifdef CONFIG_QOS_SCHED_SMT_EXPELLER qos_idle_h_nr_running = task_has_qos_idle_policy(p); #endif - } else { - cfs_rq = group_cfs_rq(se); - slice = cfs_rq_min_slice(cfs_rq); } for_each_sched_entity(se) { -- 2.34.1

Chen Jinghuang

5:09 p.m.

New subject: [PATCH OLK-6.6 6/6] sched/eevdf: Force propagating min_slice of cfs_rq when {en,de}queue tasks

From: Tianchen Ding <dtcccc@linux.alibaba.com> mainline inclusion from mainline-v6.15-rc1 commit 563bc2161b94571ea425bbe2cf69fd38e24cdedf category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15498 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- When a task is enqueued and its parent cgroup se is already on_rq, this parent cgroup se will not be enqueued again, and hence the root->min_slice leaves unchanged. The same issue happens when a task is dequeued and its parent cgroup se has other runnable entities, and the parent cgroup se will not be dequeued. Force propagating min_slice when se doesn't need to be enqueued or dequeued. Ensure the se hierarchy always get the latest min_slice. Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250211063659.7180-1-dtcccc@linux.alibaba.com Conflicts: kernel/sched/fair.c [a trival context conflicts, not merge sched/fair: Add new cfs_rq.h_nr_runnable]] Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> --- kernel/sched/fair.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ed097a7a5a63..c1fee185e3cb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7717,6 +7717,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_cfs_group(se); se->slice = slice; + if (se != cfs_rq->curr) + min_vruntime_cb_propagate(&se->run_node, NULL); slice = cfs_rq_min_slice(cfs_rq); cfs_rq->h_nr_running++; @@ -7836,6 +7838,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) update_cfs_group(se); se->slice = slice; + if (se != cfs_rq->curr) + min_vruntime_cb_propagate(&se->run_node, NULL); slice = cfs_rq_min_slice(cfs_rq); cfs_rq->h_nr_running -= h_nr_running; -- 2.34.1

patchwork bot

6:13 p.m.

反馈：您发送到kernel@openeuler.org的补丁/补丁集，已成功转换为PR！ PR链接地址： https://atomgit.com/openeuler/kernel/merge_requests/23317 邮件列表地址：https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/TOH... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://atomgit.com/openeuler/kernel/merge_requests/23317 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/TOH...

Age (days ago)

Last active (days ago)

List overview

7 comments

2 participants

participants (2)

Chen Jinghuang
patchwork bot

[PATCH OLK-6.6 0/6] add min_slice and sched_runtime

tags

participants (2)