[PATCH openEuler-24.03-LTS_SP3 0/5] add min_slice and sched_attr::sched_runtime

older
[PATCH OLK-6.6] KVM: nSVM: Triple...

Chen Jinghuang

1 Jun 2026 1 Jun '26

5 p.m.

*** BLURB HERE *** Chen Jinghuang (1): sched: Fix kabi breakage in struct sched_entity and sched_class Peter Zijlstra (4): sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion sched: Allow sched_class::dequeue_task() to fail sched/fair: Re-organize dequeue_task_fair() sched/eevdf: Propagate min_slice up the cgroup hierarchy include/linux/sched.h | 5 +- kernel/sched/core.c | 11 +++- kernel/sched/deadline.c | 4 +- kernel/sched/debug.c | 3 +- kernel/sched/fair.c | 128 +++++++++++++++++++++++++++++++-------- kernel/sched/idle.c | 3 +- kernel/sched/rt.c | 4 +- kernel/sched/sched.h | 3 +- kernel/sched/stop_task.c | 3 +- 9 files changed, 129 insertions(+), 35 deletions(-) -- 2.34.1

Show replies by date

Chen Jinghuang

1 Jun 1 Jun

5 p.m.

New subject: [PATCH openEuler-24.03-LTS_SP3 1/7] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit 857b158dc5e81c6de795ef6be006eed146098fc6 category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15498 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Allow applications to directly set a suggested request/slice length using sched_attr::sched_runtime. The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms] which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100. Applications should strive to use their periodic runtime at a high confidence interval (95%+) as the target slice. Using a smaller slice will introduce undue preemptions, while using a larger value will increase latency. For all the following examples assume a scheduling quantum of 8, and for consistency all examples have W=4: {A,B,C,D}(w=1,r=8): ABCD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+*------+-------+--- ---+--*----+-------+--- t=2, V=5.5 t=3, V=7.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+----*--+-------+--- ---+------*+-------+--- Note: 4 identical tasks in FIFO order ~~~ {A,B}(w=1,r=16) C(w=2,r=16) AACCBBCC... +---+---+---+--- t=0, V=1.25 t=2, V=5.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=4, V=8.25 t=6, V=12.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+-------*-------+--- ---+-------+---*---+--- Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2 task doesn't go below q. Note: observe the full schedule becomes: W*max(r_i/w_i) = 4*2q = 8q in length. Note: the period of the heavy task is half the full period at: W*(r_i/w_i) = 4*(2q/2) = 4q ~~~ {A,C,D}(w=1,r=16) B(w=1,r=8): BAACCBDD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |--------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.5 t=5, V=11.5 A |---------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.5 A |---------------< B |------< C |--------------< D |--------------< ---+-------+----*--+--- Note: 1 short task -- again double r so that the deadline of the short task won't be below q. Made B short because its not the leftmost task, but is eligible with the 0,1,2,3 spread. Note: like with the heavy task, the period of the short task observes: W*(r_i/w_i) = 4*(1q/1) = 4q ~~~ A(w=1,r=16) B(w=1,r=8) C(w=2,r=16) BCCAABCC... +---+---+---+--- t=0, V=1.25 t=1, V=3.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.25 t=5, V=11.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.25 A |--------------< B |------< C |------< ---+-------+----*--+--- Note: 1 heavy and 1 short task -- combine them all. Note: both the short and heavy task end up with a period of 4q ~~~ A(w=1,r=16) B(w=2,r=16) C(w=1,r=8) BBCAABBC... +---+---+---+--- t=0, V=1 t=2, V=5 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=3, V=7 t=5, V=11 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=7, V=15 A |--------------< B |------< C |------< ---+-------+------*+--- Note: as before but permuted ~~~ From all this it can be deduced that, for the steady state: - the total period (P) of a schedule is: W*max(r_i/w_i) - the average period of a task is: W*(r_i/w_i) - each task obtains the fair share: w_i/W of each full period P Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.842834421@infradead.org Conflicts: include/linux/sched.h kernel/sched/core.c [a trival context conflicts] Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> --- include/linux/sched.h | 1 + kernel/sched/core.c | 4 +++- kernel/sched/debug.c | 3 ++- kernel/sched/fair.c | 6 ++++-- 4 files changed, 10 insertions(+), 4 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index ae8300ef77f2..66a3d77a29e7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -586,6 +586,7 @@ struct sched_entity { struct list_head group_node; unsigned int on_rq; KABI_FILL_HOLE(unsigned char rel_deadline) + unsigned char custom_slice; /* 3 holes left here */ u64 exec_start; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ccd629f99060..29b40c8c7c57 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4542,7 +4542,6 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->se.nr_migrations = 0; p->se.vruntime = 0; p->se.vlag = 0; - p->se.slice = sysctl_sched_base_slice; p->se.rel_deadline = 0; INIT_LIST_HEAD(&p->se.group_node); @@ -4832,6 +4831,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) p->prio = p->normal_prio = p->static_prio; set_load_weight(p, false); + p->se.custom_slice = 0; + p->se.slice = sysctl_sched_base_slice; /* * We don't need the reset flag anymore after the fork. It has @@ -10239,6 +10240,7 @@ void __init sched_init(void) } set_load_weight(&init_task, false); + init_task.se.slice = sysctl_sched_base_slice, /* * The boot idle thread does lazy MMU switching as well: diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 7ebf32a4344c..d849534978c3 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -622,11 +622,12 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p) else SEQ_printf(m, " %c", task_state_to_char(p)); - SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld.%06ld %9Ld %5d ", + SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld %5d ", p->comm, task_pid_nr(p), SPLIT_NS(p->se.vruntime), entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N', SPLIT_NS(p->se.deadline), + p->se.custom_slice ? 'S' : ' ', SPLIT_NS(p->se.slice), SPLIT_NS(p->se.sum_exec_runtime), (long long)(p->nvcsw + p->nivcsw), diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e77256af344f..30557fc4cd43 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1304,7 +1304,8 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) * nice) while the request time r_i is determined by * sysctl_sched_base_slice. */ - se->slice = sysctl_sched_base_slice; + if (!se->custom_slice) + se->slice = sysctl_sched_base_slice; /* * EEVDF: vd_i = ve_i + r_i / w_i @@ -5436,7 +5437,8 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) u64 vslice, vruntime = avg_vruntime(cfs_rq); s64 lag = 0; - se->slice = sysctl_sched_base_slice; + if (!se->custom_slice) + se->slice = sysctl_sched_base_slice; vslice = calc_delta_fair(se->slice, se); /* -- 2.34.1

Chen Jinghuang

5 p.m.

New subject: [PATCH openEuler-24.03-LTS_SP3 2/7] sched: Allow sched_class::dequeue_task() to fail

From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit 863ccdbb918a77e3f011571f943020bf7f0b114b category: cleanup bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15498 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Change the function signature of sched_class::dequeue_task() to return a boolean, allowing future patches to 'fail' dequeue. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.864630153@infradead.org Conflicts: kernel/sched/deadline.c [a trival context conflict] Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> --- kernel/sched/core.c | 7 +++++-- kernel/sched/deadline.c | 4 +++- kernel/sched/fair.c | 4 +++- kernel/sched/idle.c | 3 ++- kernel/sched/rt.c | 4 +++- kernel/sched/sched.h | 2 +- kernel/sched/stop_task.c | 3 ++- 7 files changed, 19 insertions(+), 8 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 29b40c8c7c57..2bb9b8e0ac2d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2116,7 +2116,10 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) sched_core_enqueue(rq, p); } -static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) +/* + * Must only return false when DEQUEUE_SLEEP. + */ +static inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags) { if (sched_core_enabled(rq)) sched_core_dequeue(rq, p, flags); @@ -2130,7 +2133,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) } uclamp_rq_dec(rq, p); - p->sched_class->dequeue_task(rq, p, flags); + return p->sched_class->dequeue_task(rq, p, flags); } void activate_task(struct rq *rq, struct task_struct *p, int flags) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 80131df38d82..1e7e0d070c62 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1745,7 +1745,7 @@ static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) dequeue_pushable_dl_task(rq, p); } -static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) +static bool dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) { update_curr_dl(rq); __dequeue_task_dl(rq, p, flags); @@ -1766,6 +1766,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) */ if (flags & DEQUEUE_SLEEP) task_non_contending(p); + + return true; } /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 30557fc4cd43..8105db4052e1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7740,7 +7740,7 @@ static void set_next_buddy(struct sched_entity *se); * decreased. We remove the task from the rbtree and * update the fair scheduling stats: */ -static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) +static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; @@ -7819,6 +7819,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) dequeue_throttle: util_est_update(&rq->cfs, p, task_sleep); hrtick_update(rq); + + return true; } #ifdef CONFIG_SMP diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 86618f71918f..a644f3ea5711 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -483,13 +483,14 @@ struct task_struct *pick_next_task_idle(struct rq *rq) * It is not legal to sleep in the idle task - print a warning * message if some code attempts to do it: */ -static void +static bool dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags) { raw_spin_rq_unlock_irq(rq); printk(KERN_ERR "bad: scheduling from the idle thread!\n"); dump_stack(); raw_spin_rq_lock_irq(rq); + return true; } /* diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index bdd4a3081c41..a05f30642f35 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1543,7 +1543,7 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags) enqueue_pushable_task(rq, p); } -static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags) +static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags) { struct sched_rt_entity *rt_se = &p->rt; @@ -1551,6 +1551,8 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags) dequeue_rt_entity(rt_se, flags); dequeue_pushable_task(rq, p); + + return true; } /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index fb15154fd01c..4c6858ad64db 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2450,7 +2450,7 @@ struct sched_class { #endif void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); - void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); + bool (*dequeue_task)(struct rq *rq, struct task_struct *p, int flags); void (*yield_task) (struct rq *rq); bool (*yield_to_task)(struct rq *rq, struct task_struct *p); diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c index 7595494ceb6d..c9b0e3ec675b 100644 --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c @@ -57,10 +57,11 @@ enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags) add_nr_running(rq, 1); } -static void +static bool dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags) { sub_nr_running(rq, 1); + return true; } static void yield_task_stop(struct rq *rq) -- 2.34.1

Chen Jinghuang

5 p.m.

New subject: [PATCH openEuler-24.03-LTS_SP3 3/7] sched/fair: Re-organize dequeue_task_fair()

From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit fab4a808ba9fb59b691d7096eed9b1494812ffd6 category: cleanup bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15498 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Working towards delaying dequeue, notably also inside the hierachy, rework dequeue_task_fair() such that it can 'resume' an interrupted hierarchy walk. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.977256873@infradead.org Conflicts: kernel/sched/fair.c [a trival context conflict] Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> --- kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++---------------- 1 file changed, 42 insertions(+), 22 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8105db4052e1..3381780841c8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7736,40 +7736,49 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) static void set_next_buddy(struct sched_entity *se); /* - * The dequeue_task method is called before nr_running is - * decreased. We remove the task from the rbtree and - * update the fair scheduling stats: + * Basically dequeue_task_fair(), except it can deal with dequeue_entity() + * failing half-way through and resume the dequeue later. + * + * Returns: + * -1 - dequeue delayed + * 0 - dequeue throttled + * 1 - dequeue complete */ -static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) +static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) { - struct cfs_rq *cfs_rq; - struct sched_entity *se = &p->se; - int task_sleep = flags & DEQUEUE_SLEEP; - int idle_h_nr_running = task_has_idle_policy(p); #ifdef CONFIG_QOS_SCHED_SMT_EXPELLER - int qos_idle_h_nr_running = task_has_qos_idle_policy(p); + int qos_idle_h_nr_running = 0; #endif unsigned int prev_nr = rq->cfs.h_nr_running; bool was_sched_idle = sched_idle_rq(rq); + bool task_sleep = flags & DEQUEUE_SLEEP; + struct task_struct *p = NULL; + int idle_h_nr_running = 0; + int h_nr_running = 0; + struct cfs_rq *cfs_rq; - util_est_dequeue(&rq->cfs, p); + if (entity_is_task(se)) { + p = task_of(se); + h_nr_running = 1; + idle_h_nr_running = task_has_idle_policy(p); + } for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); dequeue_entity(cfs_rq, se, flags); - cfs_rq->h_nr_running--; + cfs_rq->h_nr_running -= h_nr_running; cfs_rq->idle_h_nr_running -= idle_h_nr_running; #ifdef CONFIG_QOS_SCHED_SMT_EXPELLER cfs_rq->qos_idle_h_nr_running -= qos_idle_h_nr_running; #endif if (cfs_rq_is_idle(cfs_rq)) - idle_h_nr_running = 1; + idle_h_nr_running = h_nr_running; /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - goto dequeue_throttle; + return 0; /* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { @@ -7793,22 +7802,20 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) se_update_runnable(se); update_cfs_group(se); - cfs_rq->h_nr_running--; + cfs_rq->h_nr_running -= h_nr_running; cfs_rq->idle_h_nr_running -= idle_h_nr_running; #ifdef CONFIG_QOS_SCHED_SMT_EXPELLER cfs_rq->qos_idle_h_nr_running -= qos_idle_h_nr_running; #endif if (cfs_rq_is_idle(cfs_rq)) - idle_h_nr_running = 1; + idle_h_nr_running = h_nr_running; /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - goto dequeue_throttle; - + return 0; } - /* At this point se is NULL and we are at root level*/ - sub_nr_running(rq, 1); + sub_nr_running(rq, h_nr_running); if (prev_nr == 2) overload_clear(rq); @@ -7816,10 +7823,23 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (unlikely(!was_sched_idle && sched_idle_rq(rq))) rq->next_balance = jiffies; -dequeue_throttle: - util_est_update(&rq->cfs, p, task_sleep); - hrtick_update(rq); + return 1; +} + +/* + * The dequeue_task method is called before nr_running is + * decreased. We remove the task from the rbtree and + * update the fair scheduling stats: + */ +static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) +{ + util_est_dequeue(&rq->cfs, p); + + if (dequeue_entities(rq, &p->se, flags) < 0) + return false; + util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); + hrtick_update(rq); return true; } -- 2.34.1

Chen Jinghuang

5 p.m.

New subject: [PATCH openEuler-24.03-LTS_SP3 4/7] sched/eevdf: Propagate min_slice up the cgroup hierarchy

From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit aef6987d89544d63a47753cf3741cabff0b5574c category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15498 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ------------------------------- In the absence of an explicit cgroup slice configureation, make mixed slice length work with cgroups by propagating the min_slice up the hierarchy. This ensures the cgroup entity gets timely service to service its entities that have this timing constraint set on them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.948188417@infradead.org Conflicts: kernel/sched/fair.c include/linux/sched.h [a trival context conflict] Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> --- include/linux/sched.h | 1 + kernel/sched/fair.c | 58 ++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 58 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 66a3d77a29e7..b23143162c01 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -582,6 +582,7 @@ struct sched_entity { struct rb_node run_node; u64 deadline; u64 min_vruntime; + u64 min_slice; struct list_head group_node; unsigned int on_rq; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3381780841c8..34142bc8fa37 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1060,6 +1060,21 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se) return vruntime_eligible(cfs_rq, se->vruntime); } +static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq) +{ + struct sched_entity *root = __pick_root_entity(cfs_rq); + struct sched_entity *curr = cfs_rq->curr; + u64 min_slice = ~0ULL; + + if (curr && curr->on_rq) + min_slice = curr->slice; + + if (root) + min_slice = min(min_slice, root->min_slice); + + return min_slice; +} + static inline bool __entity_less(struct rb_node *a, const struct rb_node *b) { return entity_before(__node_2_se(a), __node_2_se(b)); @@ -1075,19 +1090,35 @@ static inline void __min_vruntime_update(struct sched_entity *se, struct rb_node } } +static inline void __min_slice_update(struct sched_entity *se, struct rb_node *node) +{ + if (node) { + struct sched_entity *rse = __node_2_se(node); + + if (rse->min_slice < se->min_slice) + se->min_slice = rse->min_slice; + } +} + /* * se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime) */ static inline bool min_vruntime_update(struct sched_entity *se, bool exit) { u64 old_min_vruntime = se->min_vruntime; + u64 old_min_slice = se->min_slice; struct rb_node *node = &se->run_node; se->min_vruntime = se->vruntime; __min_vruntime_update(se, node->rb_right); __min_vruntime_update(se, node->rb_left); - return se->min_vruntime == old_min_vruntime; + se->min_slice = se->slice; + __min_slice_update(se, node->rb_right); + __min_slice_update(se, node->rb_left); + + return se->min_vruntime == old_min_vruntime && + se->min_slice == old_min_slice; } RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity, @@ -1100,6 +1131,7 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) { sum_w_vruntime_add(cfs_rq, se); se->min_vruntime = se->vruntime; + se->min_slice = se->slice; rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less, &min_vruntime_cb); } @@ -7646,6 +7678,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) #endif int task_new = !(flags & ENQUEUE_WAKEUP); unsigned int prev_nr = rq->cfs.h_nr_running; + u64 slice = 0; /* * The code below (indirectly) updates schedutil which looks at @@ -7667,7 +7700,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (se->on_rq) break; cfs_rq = cfs_rq_of(se); + + /* + * Basically set the slice of group entries to the min_slice of + * their respective cfs_rq. This ensures the group can service + * its entities in the desired time-frame. + */ + if (slice) { + se->slice = slice; + se->custom_slice = 1; + } enqueue_entity(cfs_rq, se, flags); + slice = cfs_rq_min_slice(cfs_rq); cfs_rq->h_nr_running++; cfs_rq->idle_h_nr_running += idle_h_nr_running; @@ -7692,6 +7736,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) se_update_runnable(se); update_cfs_group(se); + se->slice = slice; + slice = cfs_rq_min_slice(cfs_rq); + cfs_rq->h_nr_running++; cfs_rq->idle_h_nr_running += idle_h_nr_running; #ifdef CONFIG_QOS_SCHED_SMT_EXPELLER @@ -7756,11 +7803,15 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) int idle_h_nr_running = 0; int h_nr_running = 0; struct cfs_rq *cfs_rq; + u64 slice = 0; if (entity_is_task(se)) { p = task_of(se); h_nr_running = 1; idle_h_nr_running = task_has_idle_policy(p); + } else { + cfs_rq = group_cfs_rq(se); + slice = cfs_rq_min_slice(cfs_rq); } for_each_sched_entity(se) { @@ -7782,6 +7833,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) /* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { + slice = cfs_rq_min_slice(cfs_rq); + /* Avoid re-evaluating load for this entity: */ se = parent_entity(se); /* @@ -7802,6 +7855,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) se_update_runnable(se); update_cfs_group(se); + se->slice = slice; + slice = cfs_rq_min_slice(cfs_rq); + cfs_rq->h_nr_running -= h_nr_running; cfs_rq->idle_h_nr_running -= idle_h_nr_running; #ifdef CONFIG_QOS_SCHED_SMT_EXPELLER -- 2.34.1

Chen Jinghuang

5 p.m.

New subject: [PATCH openEuler-24.03-LTS_SP3 5/7] sched: Fix kabi breakage in struct sched_entity and sched_class

hulk inclusion category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15498 CVE: NA -------------------------------- Fix kabi breakage in struct sched_entity and sched_class Fixes: f6a953217565 ("sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion") Fixes: 5239ee04ed75 ("sched: Allow sched_class::dequeue_task() to fail") Fixes: 442364ee8cbf ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> --- include/linux/sched.h | 7 +++---- kernel/sched/sched.h | 3 ++- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index b23143162c01..6bea3514b5dd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -582,13 +582,12 @@ struct sched_entity { struct rb_node run_node; u64 deadline; u64 min_vruntime; - u64 min_slice; struct list_head group_node; unsigned int on_rq; KABI_FILL_HOLE(unsigned char rel_deadline) - unsigned char custom_slice; - /* 3 holes left here */ + KABI_FILL_HOLE(unsigned char custom_slice) + /* 2 holes left here */ u64 exec_start; u64 sum_exec_runtime; @@ -630,7 +629,7 @@ struct sched_entity { */ struct sched_avg avg; #endif - KABI_RESERVE(1) + KABI_USE(1, u64 min_slice) KABI_RESERVE(2) KABI_RESERVE(3) KABI_RESERVE(4) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 4c6858ad64db..d008b102dece 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2450,7 +2450,8 @@ struct sched_class { #endif void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); - bool (*dequeue_task)(struct rq *rq, struct task_struct *p, int flags); + KABI_REPLACE(void (*dequeue_task)(struct rq *rq, struct task_struct *p, int flags), + bool (*dequeue_task)(struct rq *rq, struct task_struct *p, int flags)) void (*yield_task) (struct rq *rq); bool (*yield_to_task)(struct rq *rq, struct task_struct *p); -- 2.34.1

Chen Jinghuang

5:01 p.m.

New subject: [PATCH openEuler-24.03-LTS_SP3 6/7] sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash

From: Omar Sandoval <osandov@fb.com> mainline inclusion from mainline-v6.15-rc4 commit bbce3de72be56e4b5f68924b7da9630cc89aa1a8 category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15498 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- There is a code path in dequeue_entities() that can set the slice of a sched_entity to U64_MAX, which sometimes results in a crash. The offending case is when dequeue_entities() is called to dequeue a delayed group entity, and then the entity's parent's dequeue is delayed. In that case: 1. In the if (entity_is_task(se)) else block at the beginning of dequeue_entities(), slice is set to cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX. 2. The first for_each_sched_entity() loop dequeues the entity. 3. If the entity was its parent's only child, then the next iteration tries to dequeue the parent. 4. If the parent's dequeue needs to be delayed, then it breaks from the first for_each_sched_entity() loop _without updating slice_. 5. The second for_each_sched_entity() loop sets the parent's ->slice to the saved slice, which is still U64_MAX. This throws off subsequent calculations with potentially catastrophic results. A manifestation we saw in production was: 6. In update_entity_lag(), se->slice is used to calculate limit, which ends up as a huge negative number. 7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit is negative, vlag > limit, so se->vlag is set to the same huge negative number. 8. In place_entity(), se->vlag is scaled, which overflows and results in another huge (positive or negative) number. 9. The adjusted lag is subtracted from se->vruntime, which increases or decreases se->vruntime by a huge number. 10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which incorrectly returns false because the vruntime is so far from the other vruntimes on the queue, causing the (vruntime - cfs_rq->min_vruntime) * load calulation to overflow. 11. Nothing appears to be eligible, so pick_eevdf() returns NULL. 12. pick_next_entity() tries to dereference the return value of pick_eevdf() and crashes. Dumping the cfs_rq states from the core dumps with drgn showed tell-tale huge vruntime ranges and bogus vlag values, and I also traced se->slice being set to U64_MAX on live systems (which was usually "benign" since the rest of the runqueue needed to be in a particular state to crash). Fix it in dequeue_entities() by always setting slice from the first non-empty cfs_rq. Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.174557099... Conflicts: kernel/sched/fair.c [a trival context conflict, not merged implement delay dequeue, so can't adapt slice = cfs_rq_min_slice(cfs_rq);] Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> --- kernel/sched/fair.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 34142bc8fa37..3bc78ffc27e6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7809,9 +7809,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) p = task_of(se); h_nr_running = 1; idle_h_nr_running = task_has_idle_policy(p); - } else { - cfs_rq = group_cfs_rq(se); - slice = cfs_rq_min_slice(cfs_rq); } for_each_sched_entity(se) { -- 2.34.1

Chen Jinghuang

5:01 p.m.

New subject: [PATCH openEuler-24.03-LTS_SP3 7/7] sched/eevdf: Force propagating min_slice of cfs_rq when {en,de}queue tasks

From: Tianchen Ding <dtcccc@linux.alibaba.com> mainline inclusion from mainline-v6.15-rc1 commit 563bc2161b94571ea425bbe2cf69fd38e24cdedf category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15498 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- When a task is enqueued and its parent cgroup se is already on_rq, this parent cgroup se will not be enqueued again, and hence the root->min_slice leaves unchanged. The same issue happens when a task is dequeued and its parent cgroup se has other runnable entities, and the parent cgroup se will not be dequeued. Force propagating min_slice when se doesn't need to be enqueued or dequeued. Ensure the se hierarchy always get the latest min_slice. Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250211063659.7180-1-dtcccc@linux.alibaba.com Conflicts: kernel/sched/fair.c [a trival context conflicts, not merge sched/fair: Add new cfs_rq.h_nr_runnable]] Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> --- kernel/sched/fair.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3bc78ffc27e6..a6a3ee9f6f71 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7737,6 +7737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_cfs_group(se); se->slice = slice; + if (se != cfs_rq->curr) + min_vruntime_cb_propagate(&se->run_node, NULL); slice = cfs_rq_min_slice(cfs_rq); cfs_rq->h_nr_running++; @@ -7853,6 +7855,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) update_cfs_group(se); se->slice = slice; + if (se != cfs_rq->curr) + min_vruntime_cb_propagate(&se->run_node, NULL); slice = cfs_rq_min_slice(cfs_rq); cfs_rq->h_nr_running -= h_nr_running; -- 2.34.1

Age (days ago)

Last active (days ago)

List overview

7 comments

1 participants

participants (1)

Chen Jinghuang

[PATCH openEuler-24.03-LTS_SP3 0/5] add min_slice and sched_attr::sched_runtime

tags

participants (1)