[PATCH OLK-6.6 00/10] EEVDF null pointer

Zicheng Qu

19 Dec 2025 19 Dec '25

2:56 p.m.

Fernand Sieber (1): sched/fair: Forfeit vruntime on yield Ingo Molnar (1): sched/balancing: Change comment formatting to not overlap Git conflict marker lines Peter Zijlstra (4): sched/fair: Avoid re-setting virtual deadline on 'migrations' sched/eevdf: Remove min_vruntime_copy sched/core: Add comment explaining force-idle vruntime snapshots sched/eevdf: Fix min_vruntime vs avg_vruntime Zicheng Qu (4): sched: Fix kabi breakage of struct sched_entity for on_rq and rel_deadline sched: Fix kabi breakage of struct cfs_rq for min_vruntime_copy Revert "sched/fair: Only increment deadline once on yield" sched: Fix kabi breakage of struct cfs_rq for zero_vruntime include/linux/sched.h | 4 +- kernel/sched/debug.c | 8 +- kernel/sched/fair.c | 277 ++++++++++++++++++++++++++++++---------- kernel/sched/features.h | 4 + kernel/sched/sched.h | 6 +- 5 files changed, 223 insertions(+), 76 deletions(-) -- 2.34.1

Show replies by date

Zicheng Qu

19 Dec 19 Dec

2:56 p.m.

New subject: [PATCH OLK-6.6 01/10] sched/balancing: Change comment formatting to not overlap Git conflict marker lines

From: Ingo Molnar <mingo@kernel.org> mainline inclusion from mainline-v6.10-rc1 commit be8858dba9a2c3aec454a6b382671101fd0dc3b7 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICDF44 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- So the scheduler has two such comment blocks, with '=' used as a double underline: /* * VRUNTIME * ======== * '========' also happens to be a Git conflict marker, throwing off a simple search in an editor for this pattern. Change them to '-------' type of underline instead - it looks just as good. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240308105901.1096078-7-mingo@kernel.org Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6b30b3811c88..49b883b7b94b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3861,7 +3861,7 @@ static void reweight_eevdf(struct sched_entity *se, u64 avruntime, /* * VRUNTIME - * ======== + * -------- * * COROLLARY #1: The virtual runtime of the entity needs to be * adjusted if re-weight at !0-lag point. @@ -3944,7 +3944,7 @@ static void reweight_eevdf(struct sched_entity *se, u64 avruntime, /* * DEADLINE - * ======== + * -------- * * When the weight changes, the virtual time slope changes and * we should adjust the relative virtual deadline accordingly. -- 2.34.1

Zicheng Qu

2:56 p.m.

New subject: [PATCH OLK-6.6 02/10] sched/fair: Avoid re-setting virtual deadline on 'migrations'

From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit 82e9d0456e06cebe2c89f3c73cdbc9e3805e9437 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICDF44 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- During OSPM24 Youssef noted that migrations are re-setting the virtual deadline. Notably everything that does a dequeue-enqueue, like setting nice, changing preferred numa-node, and a myriad of other random crap, will cause this to happen. This shouldn't be. Preserve the relative virtual deadline across such dequeue/enqueue cycles. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.625119246@infradead.org Conflicts: include/linux/sched.h kernel/sched/fair.c kernel/sched/features.h [features.h: context different due to comment, so ignore the comment. sched.h: There is no sched_delayed feature, so ingore this attr. fair.c: There is no sched_delayed feature, so ingore the logic and only pick the necessary code 'rel_deadline'.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- include/linux/sched.h | 4 +++- kernel/sched/fair.c | 13 +++++++++++++ kernel/sched/features.h | 4 ++++ 3 files changed, 20 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index bb23790fd2d3..17647b1c35d3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -584,7 +584,9 @@ struct sched_entity { u64 min_vruntime; struct list_head group_node; - unsigned int on_rq; + unsigned char on_rq; + unsigned char rel_deadline; + /* hole */ u64 exec_start; u64 sum_exec_runtime; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 49b883b7b94b..daa08c92d345 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5449,6 +5449,12 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) se->vruntime = vruntime - lag; + if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) { + se->deadline += se->vruntime; + se->rel_deadline = 0; + return; + } + /* * When joining the competition; the exisiting tasks will be, * on average, halfway through their slice, as such start tasks @@ -5564,6 +5570,7 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq); static void dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { + bool sleep = flags & DEQUEUE_SLEEP; int action = UPDATE_TG; if (entity_is_task(se) && task_on_rq_migrating(task_of(se))) @@ -5591,6 +5598,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) clear_buddies(cfs_rq, se); update_entity_lag(cfs_rq, se); + + if (sched_feat(PLACE_REL_DEADLINE) && !sleep) { + se->deadline -= se->vruntime; + se->rel_deadline = 1; + } + if (se != cfs_rq->curr) __dequeue_entity(cfs_rq, se); se->on_rq = 0; diff --git a/kernel/sched/features.h b/kernel/sched/features.h index b95797360dd6..1f665fbf0137 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -6,6 +6,10 @@ */ SCHED_FEAT(PLACE_LAG, true) SCHED_FEAT(PLACE_DEADLINE_INITIAL, true) +/* + * Preserve relative virtual deadline on 'migration'. + */ +SCHED_FEAT(PLACE_REL_DEADLINE, true) SCHED_FEAT(RUN_TO_PARITY, true) SCHED_FEAT(RUN_TO_PARITY_WAKEUP, true) -- 2.34.1

Zicheng Qu

2:56 p.m.

New subject: [PATCH OLK-6.6 03/10] sched: Fix kabi breakage of struct sched_entity for on_rq and rel_deadline

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICDF44 -------------------------------- Fix kabi breakage in struct sched_entity for on_rq and rel_deadline. Fixes: 1d37fd174fff ("sched/fair: Avoid re-setting virtual deadline on 'migrations'") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- include/linux/sched.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 17647b1c35d3..3d161758c9bd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -584,9 +584,9 @@ struct sched_entity { u64 min_vruntime; struct list_head group_node; - unsigned char on_rq; - unsigned char rel_deadline; - /* hole */ + unsigned int on_rq; + KABI_FILL_HOLE(unsigned char rel_deadline) + /* 3 holes left here */ u64 exec_start; u64 sum_exec_runtime; -- 2.34.1

Zicheng Qu

2:56 p.m.

New subject: [PATCH OLK-6.6 04/10] sched/eevdf: Remove min_vruntime_copy

From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit 949090eaf0a3e39aa0f4a675407e16d0e975da11 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICDF44 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Since commit e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement") the min_vruntime_copy is no longer used. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.395297941@infradead.org Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 5 ++--- kernel/sched/sched.h | 4 ---- 2 files changed, 2 insertions(+), 7 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index daa08c92d345..0a5fe4e03fd5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -997,8 +997,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq) } /* ensure we never gain time by being placed backwards. */ - u64_u32_store(cfs_rq->min_vruntime, - __update_min_vruntime(cfs_rq, vruntime)); + cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime); } static inline bool __entity_less(struct rb_node *a, const struct rb_node *b) @@ -14958,7 +14957,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) void init_cfs_rq(struct cfs_rq *cfs_rq) { cfs_rq->tasks_timeline = RB_ROOT_CACHED; - u64_u32_store(cfs_rq->min_vruntime, (u64)(-(1LL << 20))); + cfs_rq->min_vruntime = (u64)(-(1LL << 20)); #ifdef CONFIG_SMP raw_spin_lock_init(&cfs_rq->removed.lock); #endif diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0e21ad151ec9..9e244c090679 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -677,10 +677,6 @@ struct cfs_rq { u64 min_vruntime_fi; #endif -#ifndef CONFIG_64BIT - u64 min_vruntime_copy; -#endif - struct rb_root_cached tasks_timeline; /* -- 2.34.1

Zicheng Qu

2:56 p.m.

New subject: [PATCH OLK-6.6 05/10] sched: Fix kabi breakage of struct cfs_rq for min_vruntime_copy

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICDF44 -------------------------------- Fix kabi breakage of struct cfs_rq for min_vruntime_copy. Fixes: 5e477c8cc709 ("sched/eevdf: Remove min_vruntime_copy") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/sched.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9e244c090679..142dd65d06a2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -677,6 +677,10 @@ struct cfs_rq { u64 min_vruntime_fi; #endif +#ifndef CONFIG_64BIT + KABI_BROKEN_REMOVE(u64 min_vruntime_copy) +#endif + struct rb_root_cached tasks_timeline; /* -- 2.34.1

Zicheng Qu

2:56 p.m.

New subject: [PATCH OLK-6.6 06/10] Revert "sched/fair: Only increment deadline once on yield"

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICDF44 -------------------------------- This reverts commit 53cc953d51ca8f2bf623dc65b563486b07e2b6ae. There is mainline patches `79104becf42b ("sched/fair: Forfeit vruntime on yield")`, so revert this maillist inclusion patch. Fixes: 1dc0da3e962b ("sched/fair: Only increment deadline once on yield") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0a5fe4e03fd5..eaf21d0b1885 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10469,7 +10469,7 @@ static void yield_task_fair(struct rq *rq) */ rq_clock_skip_update(rq); - se->deadline = se->vruntime + calc_delta_fair(se->slice, se); + se->deadline += calc_delta_fair(se->slice, se); } static bool yield_to_task_fair(struct rq *rq, struct task_struct *p) -- 2.34.1

Zicheng Qu

2:56 p.m.

New subject: [PATCH OLK-6.6 07/10] sched/fair: Forfeit vruntime on yield

From: Fernand Sieber <sieberf@amazon.com> mainline inclusion from mainline-v6.19-rc1 commit 79104becf42baeeb4a3f2b106f954b9fc7c10a3c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICDF44 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- If a task yields, the scheduler may decide to pick it again. The task in turn may decide to yield immediately or shortly after, leading to a tight loop of yields. If there's another runnable task as this point, the deadline will be increased by the slice at each loop. This can cause the deadline to runaway pretty quickly, and subsequent elevated run delays later on as the task doesn't get picked again. The reason the scheduler can pick the same task again and again despite its deadline increasing is because it may be the only eligible task at that point. Fix this by making the task forfeiting its remaining vruntime and pushing the deadline one slice ahead. This implements yield behavior more authentically. We limit the forfeiting to eligible tasks. This is because core scheduling prefers running ineligible tasks rather than force idling. As such, without the condition, we can end up on a yield loop which makes the vruntime increase rapidly, leading to anomalous run delays later down the line. Fixes: 147f3efaa24182 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Fernand Sieber <sieberf@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250401123622.584018-1-sieberf@amazon.com Link: https://lore.kernel.org/r/20250911095113.203439-1-sieberf@amazon.com Link: https://lore.kernel.org/r/20250916140228.452231-1-sieberf@amazon.com Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index eaf21d0b1885..d0e0e66da95f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10469,7 +10469,19 @@ static void yield_task_fair(struct rq *rq) */ rq_clock_skip_update(rq); - se->deadline += calc_delta_fair(se->slice, se); + /* + * Forfeit the remaining vruntime, only if the entity is eligible. This + * condition is necessary because in core scheduling we prefer to run + * ineligible tasks rather than force idling. If this happens we may + * end up in a loop where the core scheduler picks the yielding task, + * which yields immediately again; without the condition the vruntime + * ends up quickly running away. + */ + if (entity_eligible(cfs_rq, se)) { + se->vruntime = se->deadline; + se->deadline += calc_delta_fair(se->slice, se); + update_min_vruntime(cfs_rq); + } } static bool yield_to_task_fair(struct rq *rq, struct task_struct *p) -- 2.34.1

Zicheng Qu

2:56 p.m.

New subject: [PATCH OLK-6.6 08/10] sched/core: Add comment explaining force-idle vruntime snapshots

From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.19-rc1 commit 9359d9785d85bb53f1ff1738a59aeeec4b878906 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICDF44 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- I always end up having to re-read these emails every time I look at this code. And a future patch is going to change this story a little. This means it is past time to stick them in a comment so it can be modified and stay current. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200506143506.GH5298@hirez.programming.kicks-ass.... Link: https://lkml.kernel.org/r/20200515103844.GG2978@hirez.programming.kicks-ass.... Link: https://patch.msgid.link/20251106111603.GB4068168@noisy.programming.kicks-as... Conflicts: kernel/sched/fair.c [Conflicts in nr_queued and nr_running in `task_tick_core()`, so keep nr_running.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 181 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 181 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d0e0e66da95f..70b49b96c117 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -14505,6 +14505,187 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr) resched_curr(rq); } +/* + * Consider any infeasible weight scenario. Take for instance two tasks, + * each bound to their respective sibling, one with weight 1 and one with + * weight 2. Then the lower weight task will run ahead of the higher weight + * task without bound. + * + * This utterly destroys the concept of a shared time base. + * + * Remember; all this is about a proportionally fair scheduling, where each + * tasks receives: + * + * w_i + * dt_i = ---------- dt (1) + * \Sum_j w_j + * + * which we do by tracking a virtual time, s_i: + * + * 1 + * s_i = --- d[t]_i (2) + * w_i + * + * Where d[t] is a delta of discrete time, while dt is an infinitesimal. + * The immediate corollary is that the ideal schedule S, where (2) to use + * an infinitesimal delta, is: + * + * 1 + * S = ---------- dt (3) + * \Sum_i w_i + * + * From which we can define the lag, or deviation from the ideal, as: + * + * lag(i) = S - s_i (4) + * + * And since the one and only purpose is to approximate S, we get that: + * + * \Sum_i w_i lag(i) := 0 (5) + * + * If this were not so, we no longer converge to S, and we can no longer + * claim our scheduler has any of the properties we derive from S. This is + * exactly what you did above, you broke it! + * + * + * Let's continue for a while though; to see if there is anything useful to + * be learned. We can combine (1)-(3) or (4)-(5) and express S in s_i: + * + * \Sum_i w_i s_i + * S = -------------- (6) + * \Sum_i w_i + * + * Which gives us a way to compute S, given our s_i. Now, if you've read + * our code, you know that we do not in fact do this, the reason for this + * is two-fold. Firstly, computing S in that way requires a 64bit division + * for every time we'd use it (see 12), and secondly, this only describes + * the steady-state, it doesn't handle dynamics. + * + * Anyway, in (6): s_i -> x + (s_i - x), to get: + * + * \Sum_i w_i (s_i - x) + * S - x = -------------------- (7) + * \Sum_i w_i + * + * Which shows that S and s_i transform alike (which makes perfect sense + * given that S is basically the (weighted) average of s_i). + * + * Then: + * + * x -> s_min := min{s_i} (8) + * + * to obtain: + * + * \Sum_i w_i (s_i - s_min) + * S = s_min + ------------------------ (9) + * \Sum_i w_i + * + * Which already looks familiar, and is the basis for our current + * approximation: + * + * S ~= s_min (10) + * + * Now, obviously, (10) is absolute crap :-), but it sorta works. + * + * So the thing to remember is that the above is strictly UP. It is + * possible to generalize to multiple runqueues -- however it gets really + * yuck when you have to add affinity support, as illustrated by our very + * first counter-example. + * + * Luckily I think we can avoid needing a full multi-queue variant for + * core-scheduling (or load-balancing). The crucial observation is that we + * only actually need this comparison in the presence of forced-idle; only + * then do we need to tell if the stalled rq has higher priority over the + * other. + * + * [XXX assumes SMT2; better consider the more general case, I suspect + * it'll work out because our comparison is always between 2 rqs and the + * answer is only interesting if one of them is forced-idle] + * + * And (under assumption of SMT2) when there is forced-idle, there is only + * a single queue, so everything works like normal. + * + * Let, for our runqueue 'k': + * + * T_k = \Sum_i w_i s_i + * W_k = \Sum_i w_i ; for all i of k (11) + * + * Then we can write (6) like: + * + * T_k + * S_k = --- (12) + * W_k + * + * From which immediately follows that: + * + * T_k + T_l + * S_k+l = --------- (13) + * W_k + W_l + * + * On which we can define a combined lag: + * + * lag_k+l(i) := S_k+l - s_i (14) + * + * And that gives us the tools to compare tasks across a combined runqueue. + * + * + * Combined this gives the following: + * + * a) when a runqueue enters force-idle, sync it against it's sibling rq(s) + * using (7); this only requires storing single 'time'-stamps. + * + * b) when comparing tasks between 2 runqueues of which one is forced-idle, + * compare the combined lag, per (14). + * + * Now, of course cgroups (I so hate them) make this more interesting in + * that a) seems to suggest we need to iterate all cgroup on a CPU at such + * boundaries, but I think we can avoid that. The force-idle is for the + * whole CPU, all it's rqs. So we can mark it in the root and lazily + * propagate downward on demand. + */ + +/* + * So this sync is basically a relative reset of S to 0. + * + * So with 2 queues, when one goes idle, we drop them both to 0 and one + * then increases due to not being idle, and the idle one builds up lag to + * get re-elected. So far so simple, right? + * + * When there's 3, we can have the situation where 2 run and one is idle, + * we sync to 0 and let the idle one build up lag to get re-election. Now + * suppose another one also drops idle. At this point dropping all to 0 + * again would destroy the built-up lag from the queue that was already + * idle, not good. + * + * So instead of syncing everything, we can: + * + * less := !((s64)(s_a - s_b) <= 0) + * + * (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b + * == v_a - (v_b - S_a + S_b) + * + * IOW, we can recast the (lag) comparison to a one-sided difference. + * So if then, instead of syncing the whole queue, sync the idle queue + * against the active queue with S_a + S_b at the point where we sync. + * + * (XXX consider the implication of living in a cyclic group: N / 2^n N) + * + * This gives us means of syncing single queues against the active queue, + * and for already idle queues to preserve their build-up lag. + * + * Of course, then we get the situation where there's 2 active and one + * going idle, who do we pick to sync against? Theory would have us sync + * against the combined S, but as we've already demonstrated, there is no + * such thing in infeasible weight scenarios. + * + * One thing I've considered; and this is where that core_active rudiment + * came from, is having active queues sync up between themselves after + * every tick. This limits the observed divergence due to the work + * conservancy. + * + * On top of that, we can improve upon things by moving away from our + * horrible (10) hack and moving to (9) and employing (13) here. + */ + /* * se_fi_update - Update the cfs_rq->min_vruntime_fi in a CFS hierarchy if needed. */ -- 2.34.1

Zicheng Qu

2:56 p.m.

New subject: [PATCH OLK-6.6 09/10] sched/eevdf: Fix min_vruntime vs avg_vruntime

From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.19-rc1 commit 79f3f9bedd149ea438aaeb0fb6a083637affe205 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICDF44 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Basically, from the constraint that the sum of lag is zero, you can infer that the 0-lag point is the weighted average of the individual vruntime, which is what we're trying to compute: \Sum w_i * v_i avg = -------------- \Sum w_i Now, since vruntime takes the whole u64 (worse, it wraps), this multiplication term in the numerator is not something we can compute; instead we do the min_vruntime (v0 henceforth) thing like: v_i = (v_i - v0) + v0 This does two things: - it keeps the key: (v_i - v0) 'small'; - it creates a relative 0-point in the modular space. If you do that subtitution and work it all out, you end up with: \Sum w_i * (v_i - v0) avg = --------------------- + v0 \Sum w_i Since you cannot very well track a ratio like that (and not suffer terrible numerical problems) we simpy track the numerator and denominator individually and only perform the division when strictly needed. Notably, the numerator lives in cfs_rq->avg_vruntime and the denominator lives in cfs_rq->avg_load. The one extra 'funny' is that these numbers track the entities in the tree, and current is typically outside of the tree, so avg_vruntime() adds current when needed before doing the division. (vruntime_eligible() elides the division by cross-wise multiplication) Anyway, as mentioned above, we currently use the CFS era min_vruntime for this purpose. However, this thing can only move forward, while the above avg can in fact move backward (when a non-eligible task leaves, the average becomes smaller), this can cause trouble when through happenstance (or construction) these values drift far enough apart to wreck the game. Replace cfs_rq::min_vruntime with cfs_rq::zero_vruntime which is kept near/at avg_vruntime, following its motion. The down-side is that this requires computing the avg more often. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Reported-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251106111741.GC4068168@noisy.programming.kicks-as... Cc: stable@vger.kernel.org Conflicts: kernel/sched/sched.h kernel/sched/fair.c [sched.h: conflicts with exec_clock, so keep it. fair.c: in `init_cfs_rq()`, conflicts with CONFIG_SMP, keep it. In `reweight_entity()`, conflicts with cfs_rq->nr_queued, but we do not have 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") and c70fc32f4443 ("sched/fair: Adhere to place_entity() constraints"), so discard it. In `dequeue_entity()`, conflicts with DEQUEUE_DELAYED, so ignore it.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/debug.c | 8 +-- kernel/sched/fair.c | 114 ++++++++++--------------------------------- kernel/sched/sched.h | 4 +- 3 files changed, 31 insertions(+), 95 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 10a19e51e9b8..31e71294a949 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -671,7 +671,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu) void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) { - s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, left_deadline = -1, spread; + s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread; struct sched_entity *last, *first, *root; struct rq *rq = cpu_rq(cpu); unsigned long flags; @@ -696,15 +696,15 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) last = __pick_last_entity(cfs_rq); if (last) right_vruntime = last->vruntime; - min_vruntime = cfs_rq->min_vruntime; + zero_vruntime = cfs_rq->zero_vruntime; raw_spin_rq_unlock_irqrestore(rq, flags); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_deadline", SPLIT_NS(left_deadline)); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_vruntime", SPLIT_NS(left_vruntime)); - SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "min_vruntime", - SPLIT_NS(min_vruntime)); + SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "zero_vruntime", + SPLIT_NS(zero_vruntime)); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "avg_vruntime", SPLIT_NS(avg_vruntime(cfs_rq))); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "right_vruntime", diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 70b49b96c117..da8b7c25cc6a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -764,7 +764,7 @@ static inline bool entity_before(const struct sched_entity *a, static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se) { - return (s64)(se->vruntime - cfs_rq->min_vruntime); + return (s64)(se->vruntime - cfs_rq->zero_vruntime); } #define __node_2_se(node) \ @@ -816,13 +816,13 @@ static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se) * * Which we track using: * - * v0 := cfs_rq->min_vruntime + * v0 := cfs_rq->zero_vruntime * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime * \Sum w_i := cfs_rq->avg_load * - * Since min_vruntime is a monotonic increasing variable that closely tracks - * the per-task service, these deltas: (v_i - v), will be in the order of the - * maximal (virtual) lag induced in the system due to quantisation. + * Since zero_vruntime closely tracks the per-task service, these + * deltas: (v_i - v), will be in the order of the maximal (virtual) lag + * induced in the system due to quantisation. * * Also, we use scale_load_down() to reduce the size. * @@ -881,7 +881,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq) avg = div_s64(avg, load); } - return cfs_rq->min_vruntime + avg; + return cfs_rq->zero_vruntime + avg; } /* @@ -947,7 +947,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime) load += weight; } - return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load; + return avg >= (s64)(vruntime - cfs_rq->zero_vruntime) * load; } int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se) @@ -962,42 +962,14 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se) return vruntime_eligible(cfs_rq, se->vruntime); } -static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime) +static void update_zero_vruntime(struct cfs_rq *cfs_rq) { - u64 min_vruntime = cfs_rq->min_vruntime; - /* - * open coded max_vruntime() to allow updating avg_vruntime - */ - s64 delta = (s64)(vruntime - min_vruntime); - if (delta > 0) { - avg_vruntime_update(cfs_rq, delta); - min_vruntime = vruntime; - } - return min_vruntime; -} + u64 vruntime = avg_vruntime(cfs_rq); + s64 delta = (s64)(vruntime - cfs_rq->zero_vruntime); -static void update_min_vruntime(struct cfs_rq *cfs_rq) -{ - struct sched_entity *se = __pick_root_entity(cfs_rq); - struct sched_entity *curr = cfs_rq->curr; - u64 vruntime = cfs_rq->min_vruntime; - - if (curr) { - if (curr->on_rq) - vruntime = curr->vruntime; - else - curr = NULL; - } + avg_vruntime_update(cfs_rq, delta); - if (se) { - if (!curr) - vruntime = se->min_vruntime; - else - vruntime = min_vruntime(vruntime, se->min_vruntime); - } - - /* ensure we never gain time by being placed backwards. */ - cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime); + cfs_rq->zero_vruntime = vruntime; } static inline bool __entity_less(struct rb_node *a, const struct rb_node *b) @@ -1040,6 +1012,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity, static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) { avg_vruntime_add(cfs_rq, se); + update_zero_vruntime(cfs_rq); se->min_vruntime = se->vruntime; rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less, &min_vruntime_cb); @@ -1050,6 +1023,7 @@ static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, &min_vruntime_cb); avg_vruntime_sub(cfs_rq, se); + update_zero_vruntime(cfs_rq); } struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq) @@ -1366,7 +1340,6 @@ static void update_curr(struct cfs_rq *cfs_rq) curr->vruntime += calc_delta_fair(delta_exec, curr); update_deadline(cfs_rq, curr); - update_min_vruntime(cfs_rq); if (entity_is_task(curr)) { struct task_struct *curtask = task_of(curr); @@ -3999,15 +3972,6 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, update_load_add(&cfs_rq->load, se->load.weight); if (!curr) __enqueue_entity(cfs_rq, se); - - /* - * The entity's vruntime has been adjusted, so let's check - * whether the rq-wide min_vruntime needs updated too. Since - * the calculations above require stable min_vruntime rather - * than up-to-date one, we do the update at the end of the - * reweight process. - */ - update_min_vruntime(cfs_rq); } } @@ -5613,15 +5577,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) update_cfs_group(se); - /* - * Now advance min_vruntime if @se was the entity holding it back, - * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be - * put back on, and if we advance min_vruntime, we'll be placed back - * further than we started -- ie. we'll be penalized. - */ - if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE) - update_min_vruntime(cfs_rq); - if (cfs_rq->nr_running == 0) update_idle_cfs_rq_clock_pelt(cfs_rq); } @@ -10480,7 +10435,6 @@ static void yield_task_fair(struct rq *rq) if (entity_eligible(cfs_rq, se)) { se->vruntime = se->deadline; se->deadline += calc_delta_fair(se->slice, se); - update_min_vruntime(cfs_rq); } } @@ -14569,23 +14523,6 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr) * Which shows that S and s_i transform alike (which makes perfect sense * given that S is basically the (weighted) average of s_i). * - * Then: - * - * x -> s_min := min{s_i} (8) - * - * to obtain: - * - * \Sum_i w_i (s_i - s_min) - * S = s_min + ------------------------ (9) - * \Sum_i w_i - * - * Which already looks familiar, and is the basis for our current - * approximation: - * - * S ~= s_min (10) - * - * Now, obviously, (10) is absolute crap :-), but it sorta works. - * * So the thing to remember is that the above is strictly UP. It is * possible to generalize to multiple runqueues -- however it gets really * yuck when you have to add affinity support, as illustrated by our very @@ -14607,23 +14544,23 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr) * Let, for our runqueue 'k': * * T_k = \Sum_i w_i s_i - * W_k = \Sum_i w_i ; for all i of k (11) + * W_k = \Sum_i w_i ; for all i of k (8) * * Then we can write (6) like: * * T_k - * S_k = --- (12) + * S_k = --- (9) * W_k * * From which immediately follows that: * * T_k + T_l - * S_k+l = --------- (13) + * S_k+l = --------- (10) * W_k + W_l * * On which we can define a combined lag: * - * lag_k+l(i) := S_k+l - s_i (14) + * lag_k+l(i) := S_k+l - s_i (11) * * And that gives us the tools to compare tasks across a combined runqueue. * @@ -14634,7 +14571,7 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr) * using (7); this only requires storing single 'time'-stamps. * * b) when comparing tasks between 2 runqueues of which one is forced-idle, - * compare the combined lag, per (14). + * compare the combined lag, per (11). * * Now, of course cgroups (I so hate them) make this more interesting in * that a) seems to suggest we need to iterate all cgroup on a CPU at such @@ -14682,12 +14619,11 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr) * every tick. This limits the observed divergence due to the work * conservancy. * - * On top of that, we can improve upon things by moving away from our - * horrible (10) hack and moving to (9) and employing (13) here. + * On top of that, we can improve upon things by employing (10) here. */ /* - * se_fi_update - Update the cfs_rq->min_vruntime_fi in a CFS hierarchy if needed. + * se_fi_update - Update the cfs_rq->zero_vruntime_fi in a CFS hierarchy if needed. */ static void se_fi_update(const struct sched_entity *se, unsigned int fi_seq, bool forceidle) @@ -14701,7 +14637,7 @@ static void se_fi_update(const struct sched_entity *se, unsigned int fi_seq, cfs_rq->forceidle_seq = fi_seq; } - cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime; + cfs_rq->zero_vruntime_fi = cfs_rq->zero_vruntime; } } @@ -14754,11 +14690,11 @@ bool cfs_prio_less(const struct task_struct *a, const struct task_struct *b, /* * Find delta after normalizing se's vruntime with its cfs_rq's - * min_vruntime_fi, which would have been updated in prior calls + * zero_vruntime_fi, which would have been updated in prior calls * to se_fi_update(). */ delta = (s64)(sea->vruntime - seb->vruntime) + - (s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi); + (s64)(cfs_rqb->zero_vruntime_fi - cfs_rqa->zero_vruntime_fi); return delta > 0; } @@ -15150,7 +15086,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) void init_cfs_rq(struct cfs_rq *cfs_rq) { cfs_rq->tasks_timeline = RB_ROOT_CACHED; - cfs_rq->min_vruntime = (u64)(-(1LL << 20)); + cfs_rq->zero_vruntime = (u64)(-(1LL << 20)); #ifdef CONFIG_SMP raw_spin_lock_init(&cfs_rq->removed.lock); #endif diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 142dd65d06a2..2f0cd6e9c2d6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -671,10 +671,10 @@ struct cfs_rq { u64 avg_load; u64 exec_clock; - u64 min_vruntime; + u64 zero_vruntime; #ifdef CONFIG_SCHED_CORE unsigned int forceidle_seq; - u64 min_vruntime_fi; + u64 zero_vruntime_fi; #endif #ifndef CONFIG_64BIT -- 2.34.1

Zicheng Qu

2:56 p.m.

New subject: [PATCH OLK-6.6 10/10] sched: Fix kabi breakage of struct cfs_rq for zero_vruntime

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICDF44 -------------------------------- Fix kabi breakage of struct cfs_rq for zero_vruntime. Fixes: f05a130a9f37 ("sched/eevdf: Fix min_vruntime vs avg_vruntime") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/sched.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2f0cd6e9c2d6..bb581cbdae8d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -671,7 +671,7 @@ struct cfs_rq { u64 avg_load; u64 exec_clock; - u64 zero_vruntime; + KABI_REPLACE(u64 min_vruntime, u64 zero_vruntime) #ifdef CONFIG_SCHED_CORE unsigned int forceidle_seq; u64 zero_vruntime_fi; -- 2.34.1

patchwork bot

3:20 p.m.

反馈：您发送到kernel@openeuler.org的补丁/补丁集，已成功转换为PR！ PR链接地址： https://gitee.com/openeuler/kernel/pulls/19746 邮件列表地址：https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/6IS... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/19746 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/6IS...

Age (days ago)

Last active (days ago)

List overview

11 comments

2 participants

participants (2)

patchwork bot
Zicheng Qu

[PATCH OLK-6.6 00/10] EEVDF null pointer

tags

participants (2)