[PATCH OLK-6.6 00/23] OLK-6.6 mainline or LTS patches sync
Chen Yu (1): sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running Ingo Molnar (2): sched/fair: Rename cfs_rq::avg_load to cfs_rq::sum_weight sched/fair: Rename cfs_rq::avg_vruntime to ::sum_w_vruntime, and helper functions Peter Zijlstra (10): sched/fair: Fix zero_vruntime tracking sched/fair: Fix EEVDF entity placement bug causing scheduling lag sched/fair: Adhere to place_entity() constraints sched: Unify runtime accounting across classes sched: Remove vruntime from trace_sched_stat_runtime() sched: Unify more update_curr*() sched/eevdf: Allow shorter slices to wakeup-preempt sched/fair: Only set slice protection at pick time sched/fair: Fix zero_vruntime tracking fix sched/debug: Fix avg_vruntime() usage Vincent Guittot (2): sched/fair: Use protect_slice() instead of direct comparison sched/fair: Fix NO_RUN_TO_PARITY case Wang Tao (1): sched/eevdf: Update se->vprot in reweight_entity() Zhang Qiao (1): sched: Fix struct sched_entity kabi broken Zicheng Qu (5): sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups sched: Fix kabi breakage of struct cfs_rq for sum_weight sched: Fix kabi breakage of struct cfs_rq for sum_w_vruntime sched/eevdf: Disable shorter slices to wakeup-preempt sched/fair: Fix vruntime drift by preventing double lag scaling during reweight zihan zhou (1): sched: Cancel the slice protection of the idle entity include/linux/sched.h | 15 +- include/trace/events/sched.h | 15 +- kernel/sched/core.c | 4 +- kernel/sched/deadline.c | 15 +- kernel/sched/debug.c | 4 +- kernel/sched/fair.c | 477 +++++++++++++++++++---------------- kernel/sched/features.h | 5 + kernel/sched/rt.c | 15 +- kernel/sched/sched.h | 18 +- kernel/sched/stop_task.c | 13 +- 10 files changed, 304 insertions(+), 277 deletions(-) -- 2.34.1
mainline inclusion from mainline-v7.0-rc1 commit e34881c84c255bc300f24d9fe685324be20da3d1 category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/13564 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Consider the following sequence on a CPU configured with nohz_full: 1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS bandwidth control. The gse (cgroup A) where the task P attached is dequeued and the CPU switches to idle. 2) Before cgroup A is unthrottled, task P is migrated from cgroup A to another cgroup B (not throttled). During sched_move_task(), the task P is observed as queued but not running, and therefore no resched_curr() is triggered. 3) Since the CPU is nohz_full, it remains in do_idle() waiting for an explicit scheduling event, i.e., resched_curr(). 4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task P has already been migrated out of cgroup A, so unthrottle_cfs_rq() may observe load_weight == 0 and return early without resched_curr() called. For kernel >= 6.6: The unthrottling path normally triggers `resched_curr()` almost cases even when no runnable tasks remain in the unthrottled cgroup, preventing the idle stall described above. However, if cgroup A is removed before it gets unthrottled, the unthrottling path for cgroup A is never executed. In a result, no `resched_curr()` can be called. 5) At this point, the task P is runnable in cgroup B (not throttled), but the CPU remains in do_idle() with no pending reschedule point. The system stays in this state until an unrelated event (e.g. a new task wakeup or any cases) that can trigger a resched_curr() breaks the nohz_full idle state, and then the task P finally gets scheduled. The root cause is that sched_move_task() may classify the task as only queued, not running, and therefore fails to trigger a resched_curr(), while the later unthrottling path no longer has visibility of the migrated task. Preserve the existing behavior for running tasks by issuing resched_curr(), and explicitly invoke check_preempt_curr() for tasks that were queued at the time of migration. This ensures that runnable tasks are reconsidered for scheduling even when nohz_full suppresses periodic ticks. Fixes: 29f59db3a74b ("sched: group-scheduler core") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20260130083438.1122457-1-quzicheng@huawei.com Conflicts: kernel/sched/core.c [Context difference.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9dd56facb6f0..bf0845ece316 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10793,8 +10793,10 @@ void sched_move_task(struct task_struct *tsk) sched_change_group(tsk, group); - if (queued) + if (queued) { enqueue_task(rq, tsk, queue_flags); + check_preempt_curr(rq, tsk, 0); + } if (running) { set_next_task(rq, tsk); /* -- 2.34.1
From: Ingo Molnar <mingo@kernel.org> mainline inclusion from mainline-v7.0-rc1 commit 4ff674fa986c27ec8a0542479258c92d361a2566 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The ::avg_load field is a long-standing misnomer: it says it's an 'average load', but in reality it's the momentary sum of the load of all currently runnable tasks. We'd have to also perform a division by nr_running (or use time-decay) to arrive at any sort of average value. This is clear from comments about the math of fair scheduling: * \Sum w_i := cfs_rq->avg_load The sum of all weights is ... the sum of all weights, not the average of all weights. To make it doubly confusing, there's also an ::avg_load in the load-balancing struct sg_lb_stats, which *is* a true average. The second part of the field's name is a minor misnomer as well: it says 'load', and it is indeed a load_weight structure as it shares code with the load-balancer - but it's only in an SMP load-balancing context where load = weight, in the fair scheduling context the primary purpose is the weighting of different nice levels. So rename the field to ::sum_weight instead, which makes the terminology of the EEVDF math match up with our implementation of it: * \Sum w_i := cfs_rq->sum_weight Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251201064647.1851919-6-mingo@kernel.org Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 16 ++++++++-------- kernel/sched/sched.h | 2 +- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ad30bb800961..7706902e904d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -818,7 +818,7 @@ static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se) * * v0 := cfs_rq->zero_vruntime * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime - * \Sum w_i := cfs_rq->avg_load + * \Sum w_i := cfs_rq->sum_weight * * Since zero_vruntime closely tracks the per-task service, these * deltas: (v_i - v), will be in the order of the maximal (virtual) lag @@ -835,7 +835,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) s64 key = entity_key(cfs_rq, se); cfs_rq->avg_vruntime += key * weight; - cfs_rq->avg_load += weight; + cfs_rq->sum_weight += weight; } static void @@ -845,16 +845,16 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) s64 key = entity_key(cfs_rq, se); cfs_rq->avg_vruntime -= key * weight; - cfs_rq->avg_load -= weight; + cfs_rq->sum_weight -= weight; } static inline void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) { /* - * v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load + * v' = v + d ==> avg_vruntime' = avg_runtime - d*sum_weight */ - cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta; + cfs_rq->avg_vruntime -= cfs_rq->sum_weight * delta; } /* @@ -865,7 +865,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; s64 avg = cfs_rq->avg_vruntime; - long load = cfs_rq->avg_load; + long load = cfs_rq->sum_weight; if (curr && curr->on_rq) { unsigned long weight = scale_load_down(curr->load.weight); @@ -938,7 +938,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime) { struct sched_entity *curr = cfs_rq->curr; s64 avg = cfs_rq->avg_vruntime; - long load = cfs_rq->avg_load; + long load = cfs_rq->sum_weight; if (curr && curr->on_rq) { unsigned long weight = scale_load_down(curr->load.weight); @@ -5419,7 +5419,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) * * vl_i = (W + w_i)*vl'_i / W */ - load = cfs_rq->avg_load; + load = cfs_rq->sum_weight; if (curr && curr->on_rq) load += scale_load_down(curr->load.weight); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2cbbf75a3e92..718dbdd18434 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -669,7 +669,7 @@ struct cfs_rq { unsigned int idle_h_nr_running; /* SCHED_IDLE */ s64 avg_vruntime; - u64 avg_load; + u64 sum_weight; u64 exec_clock; KABI_REPLACE(u64 min_vruntime, u64 zero_vruntime) -- 2.34.1
hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 -------------------------------- Fix kabi breakage of struct cfs_rq for sum_weight. Fixes: fa56d9bd5b28 ("sched/fair: Rename cfs_rq::avg_load to cfs_rq::sum_weight") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/sched.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 718dbdd18434..bb5a36cac77c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -669,7 +669,7 @@ struct cfs_rq { unsigned int idle_h_nr_running; /* SCHED_IDLE */ s64 avg_vruntime; - u64 sum_weight; + KABI_REPLACE(u64 avg_load, u64 sum_weight) u64 exec_clock; KABI_REPLACE(u64 min_vruntime, u64 zero_vruntime) -- 2.34.1
From: Ingo Molnar <mingo@kernel.org> mainline inclusion from mainline-v7.0-rc1 commit dcbc9d3f0e594223275a18f7016001889ad35eff category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The ::avg_vruntime field is a misnomer: it says it's an 'average vruntime', but in reality it's the momentary sum of the weighted vruntimes of all queued tasks, which is at least a division away from being an average. This is clear from comments about the math of fair scheduling: * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime This confusion is increased by the cfs_avg_vruntime() function, which does perform the division and returns a true average. The sum of all weighted vruntimes should be named thusly, so rename the field to ::sum_w_vruntime. (As arguably ::sum_weighted_vruntime would be a bit of a mouthful.) Understanding the scheduler is hard enough already, without extra layers of obfuscated naming. ;-) Also rename related helper functions: sum_vruntime_add() => sum_w_vruntime_add() sum_vruntime_sub() => sum_w_vruntime_sub() sum_vruntime_update() => sum_w_vruntime_update() With the notable exception of cfs_avg_vruntime(), which was named accurately. Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251201064647.1851919-7-mingo@kernel.org Conflicts: kernel/sched/sched.h [Conflicts with the kabi commit 731fcfd1da82 ("[Huawei] sched: Fix kabi breakage of struct cfs_rq for sum_weight"), so ignore it.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 26 +++++++++++++------------- kernel/sched/sched.h | 2 +- 2 files changed, 14 insertions(+), 14 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7706902e904d..036589906f4e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -817,7 +817,7 @@ static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se) * Which we track using: * * v0 := cfs_rq->zero_vruntime - * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime + * \Sum (v_i - v0) * w_i := cfs_rq->sum_w_vruntime * \Sum w_i := cfs_rq->sum_weight * * Since zero_vruntime closely tracks the per-task service, these @@ -829,32 +829,32 @@ static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se) * As measured, the max (key * weight) value was ~44 bits for a kernel build. */ static void -avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) +sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) { unsigned long weight = scale_load_down(se->load.weight); s64 key = entity_key(cfs_rq, se); - cfs_rq->avg_vruntime += key * weight; + cfs_rq->sum_w_vruntime += key * weight; cfs_rq->sum_weight += weight; } static void -avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) +sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) { unsigned long weight = scale_load_down(se->load.weight); s64 key = entity_key(cfs_rq, se); - cfs_rq->avg_vruntime -= key * weight; + cfs_rq->sum_w_vruntime -= key * weight; cfs_rq->sum_weight -= weight; } static inline -void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) +void sum_w_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) { /* - * v' = v + d ==> avg_vruntime' = avg_runtime - d*sum_weight + * v' = v + d ==> sum_w_vruntime' = sum_runtime - d*sum_weight */ - cfs_rq->avg_vruntime -= cfs_rq->sum_weight * delta; + cfs_rq->sum_w_vruntime -= cfs_rq->sum_weight * delta; } /* @@ -864,7 +864,7 @@ void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) u64 avg_vruntime(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; - s64 avg = cfs_rq->avg_vruntime; + s64 avg = cfs_rq->sum_w_vruntime; long load = cfs_rq->sum_weight; if (curr && curr->on_rq) { @@ -937,7 +937,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime) { struct sched_entity *curr = cfs_rq->curr; - s64 avg = cfs_rq->avg_vruntime; + s64 avg = cfs_rq->sum_w_vruntime; long load = cfs_rq->sum_weight; if (curr && curr->on_rq) { @@ -967,7 +967,7 @@ static void update_zero_vruntime(struct cfs_rq *cfs_rq) u64 vruntime = avg_vruntime(cfs_rq); s64 delta = (s64)(vruntime - cfs_rq->zero_vruntime); - avg_vruntime_update(cfs_rq, delta); + sum_w_vruntime_update(cfs_rq, delta); cfs_rq->zero_vruntime = vruntime; } @@ -1011,7 +1011,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity, */ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) { - avg_vruntime_add(cfs_rq, se); + sum_w_vruntime_add(cfs_rq, se); update_zero_vruntime(cfs_rq); se->min_vruntime = se->vruntime; rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, @@ -1022,7 +1022,7 @@ static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) { rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, &min_vruntime_cb); - avg_vruntime_sub(cfs_rq, se); + sum_w_vruntime_sub(cfs_rq, se); update_zero_vruntime(cfs_rq); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bb5a36cac77c..6c9efa7e479f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -668,7 +668,7 @@ struct cfs_rq { unsigned int idle_nr_running; /* SCHED_IDLE */ unsigned int idle_h_nr_running; /* SCHED_IDLE */ - s64 avg_vruntime; + s64 sum_w_vruntime; KABI_REPLACE(u64 avg_load, u64 sum_weight) u64 exec_clock; -- 2.34.1
hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 -------------------------------- Fix kabi breakage of struct cfs_rq for sum_w_vruntime. Fixes: 53543affdd3b ("sched/fair: Rename cfs_rq::avg_vruntime to ::sum_w_vruntime, and helper functions") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/sched.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 6c9efa7e479f..e9eb9d47c5db 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -668,7 +668,7 @@ struct cfs_rq { unsigned int idle_nr_running; /* SCHED_IDLE */ unsigned int idle_h_nr_running; /* SCHED_IDLE */ - s64 sum_w_vruntime; + KABI_REPLACE(s64 avg_vruntime, s64 sum_w_vruntime) KABI_REPLACE(u64 avg_load, u64 sum_weight) u64 exec_clock; -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v7.0-rc2 commit b3d99f43c72b56cf7a104a364e7fb34b0702828b category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- It turns out that zero_vruntime tracking is broken when there is but a single task running. Current update paths are through __{en,de}queue_entity(), and when there is but a single task, pick_next_task() will always return that one task, and put_prev_set_next_task() will end up in neither function. This can cause entity_key() to grow indefinitely large and cause overflows, leading to much pain and suffering. Furtermore, doing update_zero_vruntime() from __{de,en}queue_entity(), which are called from {set_next,put_prev}_entity() has problems because: - set_next_entity() calls __dequeue_entity() before it does cfs_rq->curr = se. This means the avg_vruntime() will see the removal but not current, missing the entity for accounting. - put_prev_entity() calls __enqueue_entity() before it does cfs_rq->curr = NULL. This means the avg_vruntime() will see the addition *and* current, leading to double accounting. Both cases are incorrect/inconsistent. Noting that avg_vruntime is already called on each {en,de}queue, remove the explicit avg_vruntime() calls (which removes an extra 64bit division for each {en,de}queue) and have avg_vruntime() update zero_vruntime itself. Additionally, have the tick call avg_vruntime() -- discarding the result, but for the side-effect of updating zero_vruntime. While there, optimize avg_vruntime() by noting that the average of one value is rather trivial to compute. Test case: # taskset -c -p 1 $$ # taskset -c 2 bash -c 'while :; do :; done&' # cat /sys/kernel/debug/sched/debug | awk '/^cpu#/ {P=0} /^cpu#2,/ {P=1} {if (P) print $0}' | grep -e zero_vruntime -e "^>" PRE: .zero_vruntime : 31316.407903
R bash 487 50787.345112 E 50789.145972 2.800000 50780.298364 16 120 0.000000 0.000000 0.000000 / .zero_vruntime : 382548.253179 R bash 487 427275.204288 E 427276.003584 2.800000 427268.157540 23 120 0.000000 0.000000 0.000000 /
POST: .zero_vruntime : 17259.709467
R bash 526 17259.709467 E 17262.509467 2.800000 16915.031624 9 120 0.000000 0.000000 0.000000 / .zero_vruntime : 18702.723356 R bash 526 18702.723356 E 18705.523356 2.800000 18358.045513 9 120 0.000000 0.000000 0.000000 /
Fixes: 79f3f9bedd14 ("sched/eevdf: Fix min_vruntime vs avg_vruntime") Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20260219080624.438854780%40infradead.org Conflicts: kernel/sched/fair.c [In avg_vruntime(), `ceil` and `ceiling` in comment are different, ignore the difference and accept the mainline comment, because hulk-6.6 does not have the commit b9e6e2866392 ("sched/fair: Fix typos in comments"). In cfs_rq_min_slice(), only context is different due to lacking of the commit aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy").] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 86 ++++++++++++++++++++++++++++++--------------- 1 file changed, 58 insertions(+), 28 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 036589906f4e..92381ec2ac38 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -762,6 +762,21 @@ static inline bool entity_before(const struct sched_entity *a, return (s64)(a->deadline - b->deadline) < 0; } +/* + * Per avg_vruntime() below, cfs_rq::zero_vruntime is only slightly stale + * and this value should be no more than two lag bounds. Which puts it in the + * general order of: + * + * (slice + TICK_NSEC) << NICE_0_LOAD_SHIFT + * + * which is around 44 bits in size (on 64bit); that is 20 for + * NICE_0_LOAD_SHIFT, another 20 for NSEC_PER_MSEC and then a handful for + * however many msec the actual slice+tick ends up begin. + * + * (disregarding the actual divide-by-weight part makes for the worst case + * weight of 2, which nicely cancels vs the fuzz in zero_vruntime not actually + * being the zero-lag point). + */ static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se) { return (s64)(se->vruntime - cfs_rq->zero_vruntime); @@ -849,39 +864,61 @@ sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) } static inline -void sum_w_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) +void update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta) { /* - * v' = v + d ==> sum_w_vruntime' = sum_runtime - d*sum_weight + * v' = v + d ==> sum_w_vruntime' = sum_w_vruntime - d*sum_weight */ cfs_rq->sum_w_vruntime -= cfs_rq->sum_weight * delta; + cfs_rq->zero_vruntime += delta; } /* - * Specifically: avg_runtime() + 0 must result in entity_eligible() := true + * Specifically: avg_vruntime() + 0 must result in entity_eligible() := true * For this to be so, the result of this function must have a left bias. + * + * Called in: + * - place_entity() -- before enqueue + * - update_entity_lag() -- before dequeue + * - entity_tick() + * + * This means it is one entry 'behind' but that puts it close enough to where + * the bound on entity_key() is at most two lag bounds. */ u64 avg_vruntime(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; - s64 avg = cfs_rq->sum_w_vruntime; - long load = cfs_rq->sum_weight; + long weight = cfs_rq->sum_weight; + s64 delta = 0; - if (curr && curr->on_rq) { - unsigned long weight = scale_load_down(curr->load.weight); + if (curr && !curr->on_rq) + curr = NULL; - avg += entity_key(cfs_rq, curr) * weight; - load += weight; - } + if (weight) { + s64 runtime = cfs_rq->sum_w_vruntime; - if (load) { - /* sign flips effective floor / ceil */ - if (avg < 0) - avg -= (load - 1); - avg = div_s64(avg, load); + if (curr) { + unsigned long w = scale_load_down(curr->load.weight); + + runtime += entity_key(cfs_rq, curr) * w; + weight += w; + } + + /* sign flips effective floor / ceiling */ + if (runtime < 0) + runtime -= (weight - 1); + + delta = div_s64(runtime, weight); + } else if (curr) { + /* + * When there is but one element, it is the average. + */ + delta = curr->vruntime - cfs_rq->zero_vruntime; } - return cfs_rq->zero_vruntime + avg; + update_zero_vruntime(cfs_rq, delta); + + return cfs_rq->zero_vruntime; } /* @@ -962,16 +999,6 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se) return vruntime_eligible(cfs_rq, se->vruntime); } -static void update_zero_vruntime(struct cfs_rq *cfs_rq) -{ - u64 vruntime = avg_vruntime(cfs_rq); - s64 delta = (s64)(vruntime - cfs_rq->zero_vruntime); - - sum_w_vruntime_update(cfs_rq, delta); - - cfs_rq->zero_vruntime = vruntime; -} - static inline bool __entity_less(struct rb_node *a, const struct rb_node *b) { return entity_before(__node_2_se(a), __node_2_se(b)); @@ -1012,7 +1039,6 @@ RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity, static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) { sum_w_vruntime_add(cfs_rq, se); - update_zero_vruntime(cfs_rq); se->min_vruntime = se->vruntime; rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less, &min_vruntime_cb); @@ -1023,7 +1049,6 @@ static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, &min_vruntime_cb); sum_w_vruntime_sub(cfs_rq, se); - update_zero_vruntime(cfs_rq); } struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq) @@ -5701,6 +5726,11 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) update_load_avg(cfs_rq, curr, UPDATE_TG); update_cfs_group(curr); + /* + * Pulls along cfs_rq::zero_vruntime. + */ + avg_vruntime(cfs_rq); + #ifdef CONFIG_SCHED_HRTICK /* * queued ticks are scheduled to match the slice, so don't bother -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.13 commit 6d71a9c6160479899ee744d2c6d6602a191deb1f category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- I noticed this in my traces today: turbostat-1222 [006] d..2. 311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6) { weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } -> { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } turbostat-1222 [006] d..2. 311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6) { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } -> { weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 } Which is a weight transition: 1048576 -> 2 -> 1048576. One would expect the lag to shoot out *AND* come back, notably: -1123*1048576/2 = -588775424 -588775424*2/1048576 = -1123 Except the trace shows it is all off. Worse, subsequent cycles shoot it out further and further. This made me have a very hard look at reweight_entity(), and specifically the ->on_rq case, which is more prominent with DELAY_DEQUEUE. And indeed, it is all sorts of broken. While the computation of the new lag is correct, the computation for the new vruntime, using the new lag is broken for it does not consider the logic set out in place_entity(). With the below patch, I now see things like: migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12) { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } -> { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 } migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15) { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } -> { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 } Which isn't perfect yet, but much closer. Reported-by: Doug Smythies <dsmythies@telus.net> Reported-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Link: https://lore.kernel.org/r/20250109105959.GA2981@noisy.programming.kicks-ass.... Conflicts: kernel/sched/fair.c [Context difference, git am report this warning, but cherry-pick works good.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 145 ++++++-------------------------------------- 1 file changed, 18 insertions(+), 127 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 92381ec2ac38..4d66ebd63f98 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -937,21 +937,16 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq) * * XXX could add max_slice to the augmented data to track this. */ -static s64 entity_lag(u64 avruntime, struct sched_entity *se) +static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) { s64 vlag, limit; - vlag = avruntime - se->vruntime; - limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se); - - return clamp(vlag, -limit, limit); -} - -static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) -{ SCHED_WARN_ON(!se->on_rq); - se->vlag = entity_lag(avg_vruntime(cfs_rq), se); + vlag = avg_vruntime(cfs_rq) - se->vruntime; + limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se); + + se->vlag = clamp(vlag, -limit, limit); } /* @@ -3850,137 +3845,32 @@ static inline void dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { } #endif -static void reweight_eevdf(struct sched_entity *se, u64 avruntime, - unsigned long weight) -{ - unsigned long old_weight = se->load.weight; - s64 vlag, vslice; - - /* - * VRUNTIME - * -------- - * - * COROLLARY #1: The virtual runtime of the entity needs to be - * adjusted if re-weight at !0-lag point. - * - * Proof: For contradiction assume this is not true, so we can - * re-weight without changing vruntime at !0-lag point. - * - * Weight VRuntime Avg-VRuntime - * before w v V - * after w' v' V' - * - * Since lag needs to be preserved through re-weight: - * - * lag = (V - v)*w = (V'- v')*w', where v = v' - * ==> V' = (V - v)*w/w' + v (1) - * - * Let W be the total weight of the entities before reweight, - * since V' is the new weighted average of entities: - * - * V' = (WV + w'v - wv) / (W + w' - w) (2) - * - * by using (1) & (2) we obtain: - * - * (WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v - * ==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v - * ==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v - * ==> (V - v)*W/(W + w' - w) = (V - v)*w/w' (3) - * - * Since we are doing at !0-lag point which means V != v, we - * can simplify (3): - * - * ==> W / (W + w' - w) = w / w' - * ==> Ww' = Ww + ww' - ww - * ==> W * (w' - w) = w * (w' - w) - * ==> W = w (re-weight indicates w' != w) - * - * So the cfs_rq contains only one entity, hence vruntime of - * the entity @v should always equal to the cfs_rq's weighted - * average vruntime @V, which means we will always re-weight - * at 0-lag point, thus breach assumption. Proof completed. - * - * - * COROLLARY #2: Re-weight does NOT affect weighted average - * vruntime of all the entities. - * - * Proof: According to corollary #1, Eq. (1) should be: - * - * (V - v)*w = (V' - v')*w' - * ==> v' = V' - (V - v)*w/w' (4) - * - * According to the weighted average formula, we have: - * - * V' = (WV - wv + w'v') / (W - w + w') - * = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w') - * = (WV - wv + w'V' - Vw + wv) / (W - w + w') - * = (WV + w'V' - Vw) / (W - w + w') - * - * ==> V'*(W - w + w') = WV + w'V' - Vw - * ==> V' * (W - w) = (W - w) * V (5) - * - * If the entity is the only one in the cfs_rq, then reweight - * always occurs at 0-lag point, so V won't change. Or else - * there are other entities, hence W != w, then Eq. (5) turns - * into V' = V. So V won't change in either case, proof done. - * - * - * So according to corollary #1 & #2, the effect of re-weight - * on vruntime should be: - * - * v' = V' - (V - v) * w / w' (4) - * = V - (V - v) * w / w' - * = V - vl * w / w' - * = V - vl' - */ - if (avruntime != se->vruntime) { - vlag = entity_lag(avruntime, se); - vlag = div_s64(vlag * old_weight, weight); - se->vruntime = avruntime - vlag; - } - - /* - * DEADLINE - * -------- - * - * When the weight changes, the virtual time slope changes and - * we should adjust the relative virtual deadline accordingly. - * - * d' = v' + (d - v)*w/w' - * = V' - (V - v)*w/w' + (d - v)*w/w' - * = V - (V - v)*w/w' + (d - v)*w/w' - * = V + (d - V)*w/w' - */ - vslice = (s64)(se->deadline - avruntime); - vslice = div_s64(vslice * old_weight, weight); - se->deadline = avruntime + vslice; -} +static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags); static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, unsigned long weight) { bool curr = cfs_rq->curr == se; - u64 avruntime; if (se->on_rq) { /* commit outstanding execution time */ update_curr(cfs_rq); - avruntime = avg_vruntime(cfs_rq); + update_entity_lag(cfs_rq, se); + se->deadline -= se->vruntime; + se->rel_deadline = 1; if (!curr) __dequeue_entity(cfs_rq, se); update_load_sub(&cfs_rq->load, se->load.weight); } dequeue_load_avg(cfs_rq, se); - if (se->on_rq) { - reweight_eevdf(se, avruntime, weight); - } else { - /* - * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i), - * we need to scale se->vlag when w_i changes. - */ - se->vlag = div_s64(se->vlag * se->load.weight, weight); - } + /* + * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i), + * we need to scale se->vlag when w_i changes. + */ + se->vlag = div_s64(se->vlag * se->load.weight, weight); + if (se->rel_deadline) + se->deadline = div_s64(se->deadline * se->load.weight, weight); update_load_set(&se->load, weight); @@ -3995,6 +3885,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, enqueue_load_avg(cfs_rq, se); if (se->on_rq) { update_load_add(&cfs_rq->load, se->load.weight); + place_entity(cfs_rq, se, 0); if (!curr) __enqueue_entity(cfs_rq, se); } @@ -5456,7 +5347,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) se->vruntime = vruntime - lag; - if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) { + if (se->rel_deadline) { se->deadline += se->vruntime; se->rel_deadline = 0; return; -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.16-rc1 commit c70fc32f44431bb30f9025ce753ba8be25acbba3 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ------------------------------- Mike reports that commit 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") relies on commit 4423af84b297 ("sched/fair: optimize the PLACE_LAG when se->vlag is zero") to not trip a WARN in place_entity(). What happens is that the lag of the very last entity is 0 per definition -- the average of one element matches the value of that element. Therefore place_entity() will match the condition skipping the lag adjustment: if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) { Without the 'se->vlag' condition -- it will attempt to adjust the zero lag even though we're inserting into an empty tree. Notably, we should have failed the 'cfs_rq->nr_queued' condition, but don't because they didn't get updated. Additionally, move update_load_add() after placement() as is consistent with other place_entity() users -- this change is non-functional, place_entity() does not use cfs_rq->load. Fixes: 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/c216eb4ef0e0e0029c600aefc69d56681cee5581.camel@gmx... Conflicts: kernel/sched/fair.c [This attr 'nr_running' has been renamed to 'nr_queued' in mainline 736c55a02c47 ("sched/fair: Rename cfs_rq.nr_running into nr_queued"), but we do not have this patch in current version and will have plenty of conflicts, so just restore the attr name to 'nr_running' again.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4d66ebd63f98..0bdd25bf02f5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3858,6 +3858,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, update_entity_lag(cfs_rq, se); se->deadline -= se->vruntime; se->rel_deadline = 1; + cfs_rq->nr_running--; if (!curr) __dequeue_entity(cfs_rq, se); update_load_sub(&cfs_rq->load, se->load.weight); @@ -3884,10 +3885,11 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, enqueue_load_avg(cfs_rq, se); if (se->on_rq) { - update_load_add(&cfs_rq->load, se->load.weight); place_entity(cfs_rq, se, 0); + update_load_add(&cfs_rq->load, se->load.weight); if (!curr) __enqueue_entity(cfs_rq, se); + cfs_rq->nr_running++; } } -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> stable inclusion from stable-v6.6.66 commit 4db5988bb0996126895df56784f59076bc7b370a category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=l... -------------------------------- [ Upstream commit 5d69eca542ee17c618f9a55da52191d5e28b435f ] All classes use sched_entity::exec_start to track runtime and have copies of the exact same code around to compute runtime. Collapse all that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/54d148a144f26d9559698c4dd82d8859038a7380.169909515... Stable-dep-of: 0664e2c311b9 ("sched/deadline: Fix warning in migrate_enable for boosted tasks") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- include/linux/sched.h | 2 +- kernel/sched/deadline.c | 15 +++-------- kernel/sched/fair.c | 57 ++++++++++++++++++++++++++++++---------- kernel/sched/rt.c | 15 +++-------- kernel/sched/sched.h | 12 ++------- kernel/sched/stop_task.c | 13 +-------- 6 files changed, 53 insertions(+), 61 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index f5c80c372f74..ccf9b58cf4b8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -533,7 +533,7 @@ struct sched_statistics { u64 block_max; s64 sum_block_runtime; - u64 exec_max; + s64 exec_max; u64 slice_max; u64 nr_migrations_cold; diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 9a2327abbd99..80131df38d82 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1299,9 +1299,8 @@ static void update_curr_dl(struct rq *rq) { struct task_struct *curr = rq->curr; struct sched_dl_entity *dl_se = &curr->dl; - u64 delta_exec, scaled_delta_exec; + s64 delta_exec, scaled_delta_exec; int cpu = cpu_of(rq); - u64 now; if (!dl_task(curr) || !on_dl_rq(dl_se)) return; @@ -1314,21 +1313,13 @@ static void update_curr_dl(struct rq *rq) * natural solution, but the full ramifications of this * approach need further study. */ - now = rq_clock_task(rq); - delta_exec = now - curr->se.exec_start; - if (unlikely((s64)delta_exec <= 0)) { + delta_exec = update_curr_common(rq); + if (unlikely(delta_exec <= 0)) { if (unlikely(dl_se->dl_yielded)) goto throttle; return; } - schedstat_set(curr->stats.exec_max, - max(curr->stats.exec_max, delta_exec)); - - trace_sched_stat_runtime(curr, delta_exec, 0); - - update_current_exec_runtime(curr, now, delta_exec); - if (dl_entity_is_special(dl_se)) return; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0bdd25bf02f5..8a775ab66386 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1329,23 +1329,17 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq) } #endif /* CONFIG_SMP */ -/* - * Update the current task's runtime statistics. - */ -static void update_curr(struct cfs_rq *cfs_rq) +static s64 update_curr_se(struct rq *rq, struct sched_entity *curr) { - struct sched_entity *curr = cfs_rq->curr; - u64 now = rq_clock_task(rq_of(cfs_rq)); - u64 delta_exec; - - if (unlikely(!curr)) - return; + u64 now = rq_clock_task(rq); + s64 delta_exec; delta_exec = now - curr->exec_start; - if (unlikely((s64)delta_exec <= 0)) - return; + if (unlikely(delta_exec <= 0)) + return delta_exec; curr->exec_start = now; + curr->sum_exec_runtime += delta_exec; if (schedstat_enabled()) { struct sched_statistics *stats; @@ -1355,8 +1349,43 @@ static void update_curr(struct cfs_rq *cfs_rq) max(delta_exec, stats->exec_max)); } - curr->sum_exec_runtime += delta_exec; - schedstat_add(cfs_rq->exec_clock, delta_exec); + return delta_exec; +} + +/* + * Used by other classes to account runtime. + */ +s64 update_curr_common(struct rq *rq) +{ + struct task_struct *curr = rq->curr; + s64 delta_exec; + + delta_exec = update_curr_se(rq, &curr->se); + if (unlikely(delta_exec <= 0)) + return delta_exec; + + trace_sched_stat_runtime(curr, delta_exec, 0); + + account_group_exec_runtime(curr, delta_exec); + cgroup_account_cputime(curr, delta_exec); + + return delta_exec; +} + +/* + * Update the current task's runtime statistics. + */ +static void update_curr(struct cfs_rq *cfs_rq) +{ + struct sched_entity *curr = cfs_rq->curr; + s64 delta_exec; + + if (unlikely(!curr)) + return; + + delta_exec = update_curr_se(rq_of(cfs_rq), curr); + if (unlikely(delta_exec <= 0)) + return; curr->vruntime += calc_delta_fair(delta_exec, curr); update_deadline(cfs_rq, curr); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 48d53bf0161b..bdd4a3081c41 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1050,24 +1050,15 @@ static void update_curr_rt(struct rq *rq) { struct task_struct *curr = rq->curr; struct sched_rt_entity *rt_se = &curr->rt; - u64 delta_exec; - u64 now; + s64 delta_exec; if (curr->sched_class != &rt_sched_class) return; - now = rq_clock_task(rq); - delta_exec = now - curr->se.exec_start; - if (unlikely((s64)delta_exec <= 0)) + delta_exec = update_curr_common(rq); + if (unlikely(delta_exec <= 0)) return; - schedstat_set(curr->stats.exec_max, - max(curr->stats.exec_max, delta_exec)); - - trace_sched_stat_runtime(curr, delta_exec, 0); - - update_current_exec_runtime(curr, now, delta_exec); - if (!rt_bandwidth_enabled()) return; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e9eb9d47c5db..214659d7dc9c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2439,6 +2439,8 @@ struct affinity_context { unsigned int flags; }; +extern s64 update_curr_common(struct rq *rq); + struct sched_class { #ifdef CONFIG_UCLAMP_TASK @@ -3523,16 +3525,6 @@ extern int sched_dynamic_mode(const char *str); extern void sched_dynamic_update(int mode); #endif -static inline void update_current_exec_runtime(struct task_struct *curr, - u64 now, u64 delta_exec) -{ - curr->se.sum_exec_runtime += delta_exec; - account_group_exec_runtime(curr, delta_exec); - - curr->se.exec_start = now; - cgroup_account_cputime(curr, delta_exec); -} - #ifdef CONFIG_SCHED_MM_CID #define SCHED_MM_CID_PERIOD_NS (100ULL * 1000000) /* 100ms */ diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c index 85590599b4d6..7595494ceb6d 100644 --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c @@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq) static void put_prev_task_stop(struct rq *rq, struct task_struct *prev) { - struct task_struct *curr = rq->curr; - u64 now, delta_exec; - - now = rq_clock_task(rq); - delta_exec = now - curr->se.exec_start; - if (unlikely((s64)delta_exec < 0)) - delta_exec = 0; - - schedstat_set(curr->stats.exec_max, - max(curr->stats.exec_max, delta_exec)); - - update_current_exec_runtime(curr, now, delta_exec); + update_curr_common(rq); } /* -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> stable inclusion from stable-v6.6.66 commit 7f509457773e2d358f451c3057e065e7289f3eb7 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- [ Upstream commit 5fe6ec8f6ab549b6422e41551abb51802bd48bc7 ] Tracing the runtime delta makes sense, observer can sum over time. Tracing the absolute vruntime makes less sense, inconsistent: absolute-vs-delta, but also vruntime delta can be computed from runtime delta. Removing the vruntime thing also makes the two tracepoint sites identical, allowing to unify the code in a later patch. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Stable-dep-of: 0664e2c311b9 ("sched/deadline: Fix warning in migrate_enable for boosted tasks") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- include/trace/events/sched.h | 15 ++++++--------- kernel/sched/fair.c | 5 ++--- 2 files changed, 8 insertions(+), 12 deletions(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index ee7a37e72257..09cde74b0a58 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -548,33 +548,30 @@ DEFINE_EVENT_SCHEDSTAT(sched_stat_template, sched_stat_blocked, */ DECLARE_EVENT_CLASS(sched_stat_runtime, - TP_PROTO(struct task_struct *tsk, u64 runtime, u64 vruntime), + TP_PROTO(struct task_struct *tsk, u64 runtime), - TP_ARGS(tsk, __perf_count(runtime), vruntime), + TP_ARGS(tsk, __perf_count(runtime)), TP_STRUCT__entry( __array( char, comm, TASK_COMM_LEN ) __field( pid_t, pid ) __field( u64, runtime ) - __field( u64, vruntime ) ), TP_fast_assign( memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); __entry->pid = tsk->pid; __entry->runtime = runtime; - __entry->vruntime = vruntime; ), - TP_printk("comm=%s pid=%d runtime=%Lu [ns] vruntime=%Lu [ns]", + TP_printk("comm=%s pid=%d runtime=%Lu [ns]", __entry->comm, __entry->pid, - (unsigned long long)__entry->runtime, - (unsigned long long)__entry->vruntime) + (unsigned long long)__entry->runtime) ); DEFINE_EVENT(sched_stat_runtime, sched_stat_runtime, - TP_PROTO(struct task_struct *tsk, u64 runtime, u64 vruntime), - TP_ARGS(tsk, runtime, vruntime)); + TP_PROTO(struct task_struct *tsk, u64 runtime), + TP_ARGS(tsk, runtime)); /* * Tracepoint for showing priority inheritance modifying a tasks diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8a775ab66386..43226f4c98fa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1364,8 +1364,7 @@ s64 update_curr_common(struct rq *rq) if (unlikely(delta_exec <= 0)) return delta_exec; - trace_sched_stat_runtime(curr, delta_exec, 0); - + trace_sched_stat_runtime(curr, delta_exec); account_group_exec_runtime(curr, delta_exec); cgroup_account_cputime(curr, delta_exec); @@ -1393,7 +1392,7 @@ static void update_curr(struct cfs_rq *cfs_rq) if (entity_is_task(curr)) { struct task_struct *curtask = task_of(curr); - trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime); + trace_sched_stat_runtime(curtask, delta_exec); cgroup_account_cputime(curtask, delta_exec); account_group_exec_runtime(curtask, delta_exec); } -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> stable inclusion from stable-v6.6.66 commit 24617f9ca8c82a9a0b89169a909a26b9751a31e2 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- [ Upstream commit c708a4dc5ab547edc3d6537233ca9e79ea30ce47 ] Now that trace_sched_stat_runtime() no longer takes a vruntime argument, the task specific bits are identical between update_curr_common() and update_curr(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Stable-dep-of: 0664e2c311b9 ("sched/deadline: Fix warning in migrate_enable for boosted tasks") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 43226f4c98fa..8aba01b949a9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1352,6 +1352,13 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr) return delta_exec; } +static inline void update_curr_task(struct task_struct *p, s64 delta_exec) +{ + trace_sched_stat_runtime(p, delta_exec); + account_group_exec_runtime(p, delta_exec); + cgroup_account_cputime(p, delta_exec); +} + /* * Used by other classes to account runtime. */ @@ -1361,12 +1368,8 @@ s64 update_curr_common(struct rq *rq) s64 delta_exec; delta_exec = update_curr_se(rq, &curr->se); - if (unlikely(delta_exec <= 0)) - return delta_exec; - - trace_sched_stat_runtime(curr, delta_exec); - account_group_exec_runtime(curr, delta_exec); - cgroup_account_cputime(curr, delta_exec); + if (likely(delta_exec > 0)) + update_curr_task(curr, delta_exec); return delta_exec; } @@ -1389,13 +1392,8 @@ static void update_curr(struct cfs_rq *cfs_rq) curr->vruntime += calc_delta_fair(delta_exec, curr); update_deadline(cfs_rq, curr); - if (entity_is_task(curr)) { - struct task_struct *curtask = task_of(curr); - - trace_sched_stat_runtime(curtask, delta_exec); - cgroup_account_cputime(curtask, delta_exec); - account_group_exec_runtime(curtask, delta_exec); - } + if (entity_is_task(curr)) + update_curr_task(task_of(curr), delta_exec); account_cfs_rq_runtime(cfs_rq, delta_exec); } -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit 85e511df3cec46021024176672a748008ed135bf category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Part of the reason to have shorter slices is to improve responsiveness. Allow shorter slices to preempt longer slices on wakeup. Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms | 100ms massive_intr 500us cyclictest NO_PREEMPT_SHORT 1 massive_intr:(5) | 846018.956 ms | 779188 | avg: 0.273 ms | max: 58.337 ms | sum:212545.245 ms | 2 massive_intr:(5) | 853450.693 ms | 792269 | avg: 0.275 ms | max: 71.193 ms | sum:218263.588 ms | 3 massive_intr:(5) | 843888.920 ms | 771456 | avg: 0.277 ms | max: 92.405 ms | sum:213353.221 ms | 1 chromium-browse:(8) | 53015.889 ms | 131766 | avg: 0.463 ms | max: 36.341 ms | sum:60959.230 ms | 2 chromium-browse:(8) | 53864.088 ms | 136962 | avg: 0.480 ms | max: 27.091 ms | sum:65687.681 ms | 3 chromium-browse:(9) | 53637.904 ms | 132637 | avg: 0.481 ms | max: 24.756 ms | sum:63781.673 ms | 1 cyclictest:(5) | 12615.604 ms | 639689 | avg: 0.471 ms | max: 32.272 ms | sum:301351.094 ms | 2 cyclictest:(5) | 12511.583 ms | 642578 | avg: 0.448 ms | max: 44.243 ms | sum:287632.830 ms | 3 cyclictest:(5) | 12545.867 ms | 635953 | avg: 0.475 ms | max: 25.530 ms | sum:302374.658 ms | 100ms massive_intr 500us cyclictest PREEMPT_SHORT 1 massive_intr:(5) | 839843.919 ms | 837384 | avg: 0.264 ms | max: 74.366 ms | sum:221476.885 ms | 2 massive_intr:(5) | 852449.913 ms | 845086 | avg: 0.252 ms | max: 68.162 ms | sum:212595.968 ms | 3 massive_intr:(5) | 839180.725 ms | 836883 | avg: 0.266 ms | max: 69.742 ms | sum:222812.038 ms | 1 chromium-browse:(11) | 54591.481 ms | 138388 | avg: 0.458 ms | max: 35.427 ms | sum:63401.508 ms | 2 chromium-browse:(8) | 52034.541 ms | 132276 | avg: 0.436 ms | max: 31.826 ms | sum:57732.958 ms | 3 chromium-browse:(8) | 55231.771 ms | 141892 | avg: 0.469 ms | max: 27.607 ms | sum:66538.697 ms | 1 cyclictest:(5) | 13156.391 ms | 667412 | avg: 0.373 ms | max: 38.247 ms | sum:249174.502 ms | 2 cyclictest:(5) | 12688.939 ms | 665144 | avg: 0.374 ms | max: 33.548 ms | sum:248509.392 ms | 3 cyclictest:(5) | 13475.623 ms | 669110 | avg: 0.370 ms | max: 37.819 ms | sum:247673.390 ms | As per the numbers the, this makes cyclictest (short slice) it's max-delay more consistent and consistency drops the sum-delay. The trade-off is that the massive_intr (long slice) gets more context switches and a slight increase in sum-delay. Chunxin contributed did_preempt_short() where a task that lost slice protection from PREEMPT_SHORT gets rescheduled once it becomes in-eligible. [mike: numbers] Co-Developed-by: Chunxin Zang <zangchunxin@lixiang.com> Signed-off-by: Chunxin Zang <zangchunxin@lixiang.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Link: https://lkml.kernel.org/r/20240727105030.735459544@infradead.org Conflicts: kernel/sched/fair.c kernel/sched/features.h [a trival context conflict] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 65 ++++++++++++++++++++++++++++++++++++----- kernel/sched/features.h | 5 ++++ 2 files changed, 62 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8aba01b949a9..72f325df1137 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1201,10 +1201,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se); * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i * this is probably good enough. */ -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) +static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) { if ((s64)(se->vruntime - se->deadline) < 0) - return; + return false; /* * For EEVDF the virtual time slope is determined by w_i (iow. @@ -1221,10 +1221,7 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) /* * The task has consumed its request, reschedule. */ - if (cfs_rq->nr_running > 1) { - resched_curr(rq_of(cfs_rq)); - clear_buddies(cfs_rq, se); - } + return true; } #include "pelt.h" @@ -1359,6 +1356,38 @@ static inline void update_curr_task(struct task_struct *p, s64 delta_exec) cgroup_account_cputime(p, delta_exec); } +static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity *curr) +{ + if (!sched_feat(PREEMPT_SHORT)) + return false; + + if (curr->vlag == curr->deadline) + return false; + + return !entity_eligible(cfs_rq, curr); +} + +static inline bool do_preempt_short(struct cfs_rq *cfs_rq, + struct sched_entity *pse, struct sched_entity *se) +{ + if (!sched_feat(PREEMPT_SHORT)) + return false; + + if (pse->slice >= se->slice) + return false; + + if (!entity_eligible(cfs_rq, pse)) + return false; + + if (entity_before(pse, se)) + return true; + + if (!entity_eligible(cfs_rq, se)) + return true; + + return false; +} + /* * Used by other classes to account runtime. */ @@ -1380,7 +1409,9 @@ s64 update_curr_common(struct rq *rq) static void update_curr(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; + struct rq *rq = rq_of(cfs_rq); s64 delta_exec; + bool resched; if (unlikely(!curr)) return; @@ -1390,12 +1421,20 @@ static void update_curr(struct cfs_rq *cfs_rq) return; curr->vruntime += calc_delta_fair(delta_exec, curr); - update_deadline(cfs_rq, curr); + resched = update_deadline(cfs_rq, curr); if (entity_is_task(curr)) update_curr_task(task_of(curr), delta_exec); account_cfs_rq_runtime(cfs_rq, delta_exec); + + if (rq->nr_running == 1) + return; + + if (resched || did_preempt_short(cfs_rq, curr)) { + resched_curr(rq); + clear_buddies(cfs_rq, curr); + } } static void update_curr_fair(struct rq *rq) @@ -9541,7 +9580,17 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ cfs_rq = cfs_rq_of(se); update_curr(cfs_rq); /* - * XXX pick_eevdf(cfs_rq) != se ? + * If @p has a shorter slice than current and @p is eligible, override + * current's slice protection in order to allow preemption. + * + * Note that even if @p does not turn out to be the most eligible + * task at this moment, current's slice protection will be lost. + */ + if (do_preempt_short(cfs_rq, pse, se) && se->vlag == se->deadline) + se->vlag = se->deadline + 1; + + /* + * If @p has become the most eligible task, force preemption. */ if (pick_eevdf(cfs_rq, true) == pse) goto preempt; diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 24a0c853a8a0..b2261a21e77f 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -12,6 +12,11 @@ SCHED_FEAT(PLACE_DEADLINE_INITIAL, true) SCHED_FEAT(PLACE_REL_DEADLINE, true) SCHED_FEAT(RUN_TO_PARITY, true) SCHED_FEAT(RUN_TO_PARITY_WAKEUP, true) +/* + * Allow wakeup of tasks with a shorter slice to cancel RESPECT_SLICE for + * current. + */ +SCHED_FEAT(PREEMPT_SHORT, true) /* * Prefer to schedule the task we woke last (assuming it failed -- 2.34.1
hulk inclusion category: other bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 ---------------------------------------- Turn PREEMPT_SHORT from true to false Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/features.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index b2261a21e77f..e093935a8819 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -16,7 +16,7 @@ SCHED_FEAT(RUN_TO_PARITY_WAKEUP, true) * Allow wakeup of tasks with a shorter slice to cancel RESPECT_SLICE for * current. */ -SCHED_FEAT(PREEMPT_SHORT, true) +SCHED_FEAT(PREEMPT_SHORT, false) /* * Prefer to schedule the task we woke last (assuming it failed -- 2.34.1
From: Chen Yu <yu.c.chen@intel.com> mainline inclusion from mainline-v6.12-rc4 commit d4ac164bde7a12ec0a238a7ead5aa26819bbb1c1 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Commit 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt") introduced a mechanism that a wakee with shorter slice could preempt the current running task. It also lower the bar for the current task to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running when the current task has ran out of time slice. But there is a scenario that is problematic. Say, if there is 1 cfs task and 1 rt task, before 85e511df3cec, update_deadline() will not trigger a reschedule, and after 85e511df3cec, since rq->nr_running is 2 and resched is true, a resched_curr() would happen. Some workloads (like the hackbench reported by lkp) do not like over-scheduling. We can see that the preemption rate has been increased by 2.2%: 1.654e+08 +2.2% 1.69e+08 hackbench.time.involuntary_context_switches Restore its previous check criterion. Fixes: 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt") Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@intel.com Reported-by: kernel test robot <oliver.sang@intel.com> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Honglei Wang <jameshongleiwang@126.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240925085440.358138-1-yu.c.chen@intel.com Conflicts: kernel/sched/fair.c [a trival context conflict] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 72f325df1137..134aa2eac4bf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1428,7 +1428,7 @@ static void update_curr(struct cfs_rq *cfs_rq) account_cfs_rq_runtime(cfs_rq, delta_exec); - if (rq->nr_running == 1) + if (cfs_rq->nr_running == 1) return; if (resched || did_preempt_short(cfs_rq, curr)) { -- 2.34.1
From: zihan zhou <15645113830zzh@gmail.com> mainline inclusion from mainline-v6.15-rc1 commit f553741ac8c0e467a3b873e305f34b902e50b86d category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- A wakeup non-idle entity should preempt idle entity at any time, but because of the slice protection of the idle entity, the non-idle entity has to wait, so just cancel it. This patch is aimed at minimizing the impact of SCHED_IDLE on SCHED_NORMAL. For example, a task with SCHED_IDLE policy that sleeps for 1s and then runs for 3 ms, running cyclictest on the same cpu, has a maximum latency of 3 ms, which is caused by the slice protection of the idle entity. It is unreasonable. With this patch, the cyclictest latency under the same conditions is basically the same on the cpu with idle processes and on empty cpu. [peterz: add helpers] Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption") Signed-off-by: zihan zhou <15645113830zzh@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20250208080850.16300-1-15645113830zzh@gmail.com Conflicts: kernel/sched/fair.c [a trival context conflict] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 46 ++++++++++++++++++++++++++++++++------------- 1 file changed, 33 insertions(+), 13 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 134aa2eac4bf..895d9633a608 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1066,6 +1066,26 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq) return __node_2_se(left); } +/* + * HACK, stash a copy of deadline at the point of pick in vlag, + * which isn't used until dequeue. + */ +static inline void set_protect_slice(struct sched_entity *se) +{ + se->vlag = se->deadline; +} + +static inline bool protect_slice(struct sched_entity *se) +{ + return se->vlag == se->deadline; +} + +static inline void cancel_protect_slice(struct sched_entity *se) +{ + if (protect_slice(se)) + se->vlag = se->deadline + 1; +} + /* * Earliest Eligible Virtual Deadline First * @@ -1118,11 +1138,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, bool wakeup_preemp !entity_eligible(cfs_rq, curr)) curr = NULL; - /* - * Once selected, run a task until it either becomes non-eligible or - * until it gets a new slice. See the HACK in set_next_entity(). - */ - if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline) + if (sched_feat(RUN_TO_PARITY) && curr && protect_slice(curr)) return curr; /* Pick the leftmost entity if it's eligible */ @@ -5598,11 +5614,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) update_stats_wait_end_fair(cfs_rq, se); __dequeue_entity(cfs_rq, se); update_load_avg(cfs_rq, se, UPDATE_TG); - /* - * HACK, stash a copy of deadline at the point of pick in vlag, - * which isn't used until dequeue. - */ - se->vlag = se->deadline; + + set_protect_slice(se); } update_stats_curr_start(cfs_rq, se); @@ -9561,8 +9574,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ * Preempt an idle entity in favor of a non-idle entity (and don't preempt * in the inverse case). */ - if (cse_is_idle && !pse_is_idle) + if (cse_is_idle && !pse_is_idle) { + /* + * When non-idle entity preempt an idle entity, + * don't give idle entity slice protection. + */ + cancel_protect_slice(se); goto preempt; + } + if (cse_is_idle != pse_is_idle) return; @@ -9586,8 +9606,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ * Note that even if @p does not turn out to be the most eligible * task at this moment, current's slice protection will be lost. */ - if (do_preempt_short(cfs_rq, pse, se) && se->vlag == se->deadline) - se->vlag = se->deadline + 1; + if (do_preempt_short(cfs_rq, pse, se)) + cancel_protect_slice(se); /* * If @p has become the most eligible task, force preemption. -- 2.34.1
From: Vincent Guittot <vincent.guittot@linaro.org> mainline inclusion from mainline-v6.17-rc1 commit 9cdb4fe20cd239c848b5c3f5753d83a9443ba329 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Replace the test by the relevant protect_slice() function. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dhaval Giani (AMD) <dhaval@gianis.ca> Link: https://lkml.kernel.org/r/20250708165630.1948751-2-vincent.guittot@linaro.or... Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 895d9633a608..b8ed9382c422 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1377,7 +1377,7 @@ static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity if (!sched_feat(PREEMPT_SHORT)) return false; - if (curr->vlag == curr->deadline) + if (protect_slice(curr)) return false; return !entity_eligible(cfs_rq, curr); -- 2.34.1
From: Vincent Guittot <vincent.guittot@linaro.org> mainline inclusion from mainline-v6.17-rc1 commit 74eec63661d46a7153d04c2e0249eeb76cc76d44 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- EEVDF expects the scheduler to allocate a time quantum to the selected entity and then pick a new entity for next quantum. Although this notion of time quantum is not strictly doable in our case, we can ensure a minimum runtime for each task most of the time and pick a new entity after a minimum time has elapsed. Reuse the slice protection of run to parity to ensure such runtime quantum. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250708165630.1948751-3-vincent.guittot@linaro.or... Conflicts: kernel/sched/fair.c [a trival context conflict] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- include/linux/sched.h | 10 +++++++++- kernel/sched/fair.c | 31 ++++++++++++++++++++----------- 2 files changed, 29 insertions(+), 12 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index ccf9b58cf4b8..479c206307d1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -592,7 +592,15 @@ struct sched_entity { u64 sum_exec_runtime; u64 prev_sum_exec_runtime; u64 vruntime; - s64 vlag; + union { + /* + * When !@on_rq this field is vlag. + * When cfs_rq->curr == se (which implies @on_rq) + * this field is vprot. See protect_slice(). + */ + s64 vlag; + u64 vprot; + }; u64 slice; u64 nr_migrations; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b8ed9382c422..9c2fc59a8c58 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1067,23 +1067,35 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq) } /* - * HACK, stash a copy of deadline at the point of pick in vlag, - * which isn't used until dequeue. + * Set the vruntime up to which an entity can run before looking + * for another entity to pick. + * In case of run to parity, we protect the entity up to its deadline. + * When run to parity is disabled, we give a minimum quantum to the running + * entity to ensure progress. */ static inline void set_protect_slice(struct sched_entity *se) { - se->vlag = se->deadline; + u64 slice = se->slice; + u64 vprot = se->deadline; + + if (!sched_feat(RUN_TO_PARITY)) + slice = min(slice, normalized_sysctl_sched_base_slice); + + if (slice != se->slice) + vprot = min_vruntime(vprot, se->vruntime + calc_delta_fair(slice, se)); + + se->vprot = vprot; } static inline bool protect_slice(struct sched_entity *se) { - return se->vlag == se->deadline; + return ((s64)(se->vprot - se->vruntime) > 0); } static inline void cancel_protect_slice(struct sched_entity *se) { if (protect_slice(se)) - se->vlag = se->deadline + 1; + se->vprot = se->vruntime; } /* @@ -1138,7 +1150,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, bool wakeup_preemp !entity_eligible(cfs_rq, curr)) curr = NULL; - if (sched_feat(RUN_TO_PARITY) && curr && protect_slice(curr)) + if (curr && protect_slice(curr)) return curr; /* Pick the leftmost entity if it's eligible */ @@ -1372,11 +1384,8 @@ static inline void update_curr_task(struct task_struct *p, s64 delta_exec) cgroup_account_cputime(p, delta_exec); } -static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity *curr) +static inline bool resched_next_slice(struct cfs_rq *cfs_rq, struct sched_entity *curr) { - if (!sched_feat(PREEMPT_SHORT)) - return false; - if (protect_slice(curr)) return false; @@ -1447,7 +1456,7 @@ static void update_curr(struct cfs_rq *cfs_rq) if (cfs_rq->nr_running == 1) return; - if (resched || did_preempt_short(cfs_rq, curr)) { + if (resched || resched_next_slice(cfs_rq, curr)) { resched_curr(rq); clear_buddies(cfs_rq, curr); } -- 2.34.1
From: Zhang Qiao <zhangqiao22@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 -------------------------------- Fixes: 4d234eb931ba ("sched/fair: Fix NO_RUN_TO_PARITY case") Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- include/linux/sched.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 479c206307d1..847d5502f113 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -592,6 +592,8 @@ struct sched_entity { u64 sum_exec_runtime; u64 prev_sum_exec_runtime; u64 vruntime; + KABI_BROKEN_REPLACE( + s64 vlag, union { /* * When !@on_rq this field is vlag. @@ -600,7 +602,8 @@ struct sched_entity { */ s64 vlag; u64 vprot; - }; + } + ) u64 slice; u64 nr_migrations; -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v7.0-rc2 commit bcd74b2ffdd0a2233adbf26b65c62fc69a809c8e category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- We should not (re)set slice protection in the sched_change pattern which calls put_prev_task() / set_next_task(). Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20260219080624.561421378%40infradead.org Conflicts: kernel/sched/fair.c [set_protect_slice() only has one parameter, so keep it original.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9c2fc59a8c58..337478725843 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5609,7 +5609,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) } static void -set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) +set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool first) { clear_buddies(cfs_rq, se); @@ -5624,7 +5624,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) __dequeue_entity(cfs_rq, se); update_load_avg(cfs_rq, se, UPDATE_TG); - set_protect_slice(se); + if (first) + set_protect_slice(se); } update_stats_curr_start(cfs_rq, se); @@ -10296,13 +10297,13 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf pse = parent_entity(pse); } if (se_depth >= pse_depth) { - set_next_entity(cfs_rq_of(se), se); + set_next_entity(cfs_rq_of(se), se, true); se = parent_entity(se); } } put_prev_entity(cfs_rq, pse); - set_next_entity(cfs_rq, se); + set_next_entity(cfs_rq, se, true); } goto done; @@ -10327,7 +10328,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf p = task_of(se); while (se) { - set_next_entity(cfs_rq_of(se), se); + set_next_entity(cfs_rq_of(se), se, true); se = parent_entity(se); } @@ -10341,7 +10342,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf do { se = pick_next_entity(cfs_rq, NULL); - set_next_entity(cfs_rq, se); + set_next_entity(cfs_rq, se, true); cfs_rq = group_cfs_rq(se); } while (cfs_rq); @@ -15185,7 +15186,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) for_each_sched_entity(se) { struct cfs_rq *cfs_rq = cfs_rq_of(se); - set_next_entity(cfs_rq, se); + set_next_entity(cfs_rq, se, first); /* ensure bandwidth has been allocated on our new cfs_rq */ account_cfs_rq_runtime(cfs_rq, 0); } -- 2.34.1
From: Wang Tao <wangtao554@huawei.com> mainline inclusion from mainline-v7.0-rc2 commit ff38424030f98976150e42ca35f4b00e6ab8fa23 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- In the EEVDF framework with Run-to-Parity protection, `se->vprot` is an independent variable defining the virtual protection timestamp. When `reweight_entity()` is called (e.g., via nice/renice), it performs the following actions to preserve Lag consistency: 1. Scales `se->vlag` based on the new weight. 2. Calls `place_entity()`, which recalculates `se->vruntime` based on the new weight and scaled lag. However, the current implementation fails to update `se->vprot`, leading to mismatches between the task's actual runtime and its expected duration. Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption") Suggested-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Wang Tao <wangtao554@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20260120123113.3518950-1-wangtao554@huawei.com Conflicts: kernel/sched/fair.c [Conflicts with the name for cfs_rq->nr_running and cfs_rq->nr_queued.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 337478725843..e7bb8a144c43 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3941,6 +3941,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, unsigned long weight) { bool curr = cfs_rq->curr == se; + bool rel_vprot = false; + u64 vprot; if (se->on_rq) { /* commit outstanding execution time */ @@ -3948,6 +3950,11 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, update_entity_lag(cfs_rq, se); se->deadline -= se->vruntime; se->rel_deadline = 1; + if (curr && protect_slice(se)) { + vprot = se->vprot - se->vruntime; + rel_vprot = true; + } + cfs_rq->nr_running--; if (!curr) __dequeue_entity(cfs_rq, se); @@ -3963,6 +3970,9 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, if (se->rel_deadline) se->deadline = div_s64(se->deadline * se->load.weight, weight); + if (rel_vprot) + vprot = div_s64(vprot * se->load.weight, weight); + update_load_set(&se->load, weight); #ifdef CONFIG_SMP @@ -3976,6 +3986,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, enqueue_load_avg(cfs_rq, se); if (se->on_rq) { place_entity(cfs_rq, se, 0); + if (rel_vprot) + se->vprot = se->vruntime + vprot; update_load_add(&cfs_rq->load, se->load.weight); if (!curr) __enqueue_entity(cfs_rq, se); -- 2.34.1
hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 ---------------------------------------- In reweight_entity(), when reweighting a currently running entity (se == cfs_rq->curr), the entity remains on the runqueue context without undergoing a full dequeue/enqueue cycle. This means avg_vruntime() remains constant throughout the reweight operation. However, the current implementation calls place_entity(..., 0) at the end of reweight_entity(). Under EEVDF, place_entity() is designed to handle entities entering the runqueue and calculates the virtual lag (vlag) to account for the change in the weighted average vruntime (V) using the formula: vlag' = vlag * (W + w_i) / W Where 'W' is the current aggregate weight (including cfs_rq->curr->load.weight) and 'w_i' is the weight of the entity being enqueued (in this case, the se is exactly the cfs_rq->curr). This leads to a "double scaling" logic for running entities: 1. reweight_entity() already rescales se->vlag based on the new weight ratio. 2. place_entity() then mistakenly applies the (W + w_i)/W scaling again, treating the reweight as a fresh enqueue into a new total weight pool. This can cause the entity's vlag to be amplified (if positive) or suppressed (if negative) incorrectly during the reweight process. In environments with frequent cgroup throttle/unthrottle operations, this math error manifests as a vruntime drift. A hungtask was observed as below: crash> runq -c 0 -g CPU 0 CURRENT: PID: 330440 TASK: ffff00004cd61540 COMMAND: "stress-ng" ROOT_TASK_GROUP: ffff8001025fa4c0 RT_RQ: ffff0000fff42500 [no tasks queued] ROOT_TASK_GROUP: ffff8001025fa4c0 CFS_RQ: ffff0000fff422c0 TASK_GROUP: ffff0000c130fc00 CFS_RQ: ffff00009125a400 <test_cg> cfs_bandwidth: period=100000000, quota=18446744073709551615, gse: 0xffff000091258c00, vruntime=127285708384434, deadline=127285714880550, vlag=11721467, weight=338965, my_q=ffff00009125a400, cfs_rq: avg_vruntime=0, zero_vruntime=2029704519792, avg_load=0, nr_running=1 TASK_GROUP: ffff0000d7cc8800 CFS_RQ: ffff0000c8f86800 <test_test329274_1> cfs_bandwidth: period=14000000, quota=14000000, gse: 0xffff0000c8f86400, vruntime=2034894470719, deadline=2034898697770, vlag=0, weight=215291, my_q=ffff0000c8f86800, cfs_rq: avg_vruntime=-422528991, zero_vruntime=8444226681954, avg_load=54, nr_running=19 [110] PID: 330440 TASK: ffff00004cd61540 COMMAND: "stress-ng" [CURRENT] vruntime=8444367524951, deadline=8444932411139, vlag=8444932411139, weight=3072, last_arrival=4002964107010, last_queued=0, exec_start=3872860294100, sum_exec_runtime=22252021900 ... [110] PID: 330291 TASK: ffff0000c02c9540 COMMAND: "stress-ng" vruntime=8444229273009, deadline=8444946073008, vlag=-2701415, weight=3072, last_arrival=4002964076840, last_queued=4002964550990, exec_start=3872859839290, sum_exec_runtime=22310951770 [100] PID: 97 TASK: ffff0000c2432a00 COMMAND: "kworker/0:1H" vruntime=127285720095197, deadline=127285720119423, vlag=48453, weight=90891264, last_arrival=3846600432710, last_queued=3846600721010, exec_start=3743307237970, sum_exec_runtime=413405210 [120] PID: 15 TASK: ffff0000c0368080 COMMAND: "ksoftirqd/0" vruntime=127285722433404, deadline=127285724533404, vlag=0, weight=1048576, last_arrival=3506755665780, last_queued=3506852159390, exec_start=3461615726670, sum_exec_runtime=16341041340 [120] PID: 50173 TASK: ffff0000741d8080 COMMAND: "kworker/0:0" vruntime=127285722960040, deadline=127285725060040, vlag=-414755, weight=1048576, last_arrival=3506828139580, last_queued=3506972354700, exec_start=3461676584440, sum_exec_runtime=84414080 [120] PID: 58662 TASK: ffff000091180080 COMMAND: "kworker/0:2" vruntime=127285723428168, deadline=127285725528168, vlag=3049158, weight=1048576, last_arrival=3505689085070, last_queued=3506848131990, exec_start=3460592328510, sum_exec_runtime=89193000 TASK 1 (systemd) is waiting for cgroup_mutex. TASK 329296 (sh) holds cgroup_mutex and is waiting for cpus_read_lock. TASK 50173 (kworker/0:0) holds the cpus_read_lock, but fail to be scheduled. test_cg and TASK 97 may have suppressed TASK 50173, causing it to not be scheduled for a long time, thus failing to release locks in a timely manner and ultimately causing a hungtask issue. Fix by adding ENQUEUE_REWEIGHT_CURR flag and skipping vlag recalculation in place_entity() when reweighting the current running entity. For non-current entities, the existing logic remains as dequeue/enqueue changes avg_vruntime(). Fixes: e328fcb84584 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 11 ++++++++++- kernel/sched/sched.h | 2 ++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e7bb8a144c43..ccd2c852b2b3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3985,7 +3985,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, enqueue_load_avg(cfs_rq, se); if (se->on_rq) { - place_entity(cfs_rq, se, 0); + place_entity(cfs_rq, se, curr ? ENQUEUE_REWEIGHT_CURR : 0); if (rel_vprot) se->vprot = se->vruntime + vprot; update_load_add(&cfs_rq->load, se->load.weight); @@ -5387,6 +5387,14 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) lag = se->vlag; + /* + * ENQUEUE_REWEIGHT_CURR: + * current running se (cfs_rq->curr) should skip vlag recalculation, + * because avg_vruntime(...) hasn't changed. + */ + if (flags & ENQUEUE_REWEIGHT_CURR) + goto skip_lag_scale; + /* * If we want to place a task and preserve lag, we have to * consider the effect of the new entity on the weighted @@ -5449,6 +5457,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) lag = div_s64(lag, load); } +skip_lag_scale: se->vruntime = vruntime - lag; if (se->rel_deadline) { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 214659d7dc9c..61f2ad5d1cfb 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2431,6 +2431,8 @@ extern const u32 sched_prio_to_wmult[40]; #endif #define ENQUEUE_INITIAL 0x80 +#define ENQUEUE_REWEIGHT_CURR 0x200 + #define RETRY_TASK ((void *)-1UL) struct affinity_context { -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v7.0-rc7 commit 1319ea57529e131822bab56bf417c8edc2db9ae8 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- John reported that stress-ng-yield could make his machine unhappy and managed to bisect it to commit b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking"). The combination of yield and that commit was specific enough to hypothesize the following scenario: Suppose we have 2 runnable tasks, both doing yield. Then one will be eligible and one will not be, because the average position must be in between these two entities. Therefore, the runnable task will be eligible, and be promoted a full slice (all the tasks do is yield after all). This causes it to jump over the other task and now the other task is eligible and current is no longer. So we schedule. Since we are runnable, there is no {de,en}queue. All we have is the __{en,de}queue_entity() from {put_prev,set_next}_task(). But per the fingered commit, those two no longer move zero_vruntime. All that moves zero_vruntime are tick and full {de,en}queue. This means, that if the two tasks playing leapfrog can reach the critical speed to reach the overflow point inside one tick's worth of time, we're up a creek. Additionally, when multiple cgroups are involved, there is no guarantee the tick will in fact hit every cgroup in a timely manner. Statistically speaking it will, but that same statistics does not rule out the possibility of one cgroup not getting a tick for a significant amount of time -- however unlikely. Therefore, just like with the yield() case, force an update at the end of every slice. This ensures the update is never more than a single slice behind and the whole thing is within 2 lag bounds as per the comment on entity_key(). Fixes: b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260401132355.081530332@infradead.org Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ccd2c852b2b3..58673c790f31 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -880,7 +880,7 @@ void update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta) * Called in: * - place_entity() -- before enqueue * - update_entity_lag() -- before dequeue - * - entity_tick() + * - update_deadline() -- slice expiration * * This means it is one entry 'behind' but that puts it close enough to where * the bound on entity_key() is at most two lag bounds. @@ -1245,6 +1245,7 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) * EEVDF: vd_i = ve_i + r_i / w_i */ se->deadline = se->vruntime + calc_delta_fair(se->slice, se); + avg_vruntime(cfs_rq); /* * The task has consumed its request, reschedule. @@ -5728,11 +5729,6 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) update_load_avg(cfs_rq, curr, UPDATE_TG); update_cfs_group(curr); - /* - * Pulls along cfs_rq::zero_vruntime. - */ - avg_vruntime(cfs_rq); - #ifdef CONFIG_SCHED_HRTICK /* * queued ticks are scheduled to match the slice, so don't bother @@ -10517,7 +10513,7 @@ static void yield_task_fair(struct rq *rq) */ if (entity_eligible(cfs_rq, se)) { se->vruntime = se->deadline; - se->deadline += calc_delta_fair(se->slice, se); + update_deadline(cfs_rq, se); } } -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v7.0-rc7 commit e08d007f9d813616ce7093600bc4fdb9c9d81d89 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- John reported that stress-ng-yield could make his machine unhappy and managed to bisect it to commit b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking"). The commit in question changes avg_vruntime() from a function that is a pure reader, to a function that updates variables. This turns an unlocked sched/debug usage of this function from a minor mistake into a data corruptor. Fixes: af4cf40470c2 ("sched/fair: Add cfs_rq::avg_vruntime") Fixes: b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260401132355.196370805@infradead.org Conflicts: kernel/sched/debug.c [Context difference. Only git am reports conflicts, but git cherry-pick works good.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/debug.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 31e71294a949..f0e46b94fb18 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -672,6 +672,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu) void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) { s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread; + u64 avruntime; struct sched_entity *last, *first, *root; struct rq *rq = cpu_rq(cpu); unsigned long flags; @@ -697,6 +698,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) if (last) right_vruntime = last->vruntime; zero_vruntime = cfs_rq->zero_vruntime; + avruntime = avg_vruntime(cfs_rq); raw_spin_rq_unlock_irqrestore(rq, flags); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_deadline", @@ -706,7 +708,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "zero_vruntime", SPLIT_NS(zero_vruntime)); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "avg_vruntime", - SPLIT_NS(avg_vruntime(cfs_rq))); + SPLIT_NS(avruntime)); SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "right_vruntime", SPLIT_NS(right_vruntime)); spread = right_vruntime - left_vruntime; -- 2.34.1
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://atomgit.com/openeuler/kernel/merge_requests/21702 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/7N2... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://atomgit.com/openeuler/kernel/merge_requests/21702 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/7N2...
participants (2)
-
patchwork bot -
Zicheng Qu