hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8903 ---------------------------------------- In reweight_entity(), when reweighting a currently running entity (se == cfs_rq->curr), the entity remains on the runqueue context without undergoing a full dequeue/enqueue cycle. This means avg_vruntime() remains constant throughout the reweight operation. However, the current implementation calls place_entity(..., 0) at the end of reweight_entity(). Under EEVDF, place_entity() is designed to handle entities entering the runqueue and calculates the virtual lag (vlag) to account for the change in the weighted average vruntime (V) using the formula: vlag' = vlag * (W + w_i) / W Where 'W' is the current aggregate weight (including cfs_rq->curr->load.weight) and 'w_i' is the weight of the entity being enqueued (in this case, the se is exactly the cfs_rq->curr). This leads to a "double scaling" logic for running entities: 1. reweight_entity() already rescales se->vlag based on the new weight ratio. 2. place_entity() then mistakenly applies the (W + w_i)/W scaling again, treating the reweight as a fresh enqueue into a new total weight pool. This can cause the entity's vlag to be amplified (if positive) or suppressed (if negative) incorrectly during the reweight process. In environments with frequent cgroup throttle/unthrottle operations, this math error manifests as a vruntime drift. A hungtask was observed as below: crash> runq -c 0 -g CPU 0 CURRENT: PID: 330440 TASK: ffff00004cd61540 COMMAND: "stress-ng" ROOT_TASK_GROUP: ffff8001025fa4c0 RT_RQ: ffff0000fff42500 [no tasks queued] ROOT_TASK_GROUP: ffff8001025fa4c0 CFS_RQ: ffff0000fff422c0 TASK_GROUP: ffff0000c130fc00 CFS_RQ: ffff00009125a400 <test_cg> cfs_bandwidth: period=100000000, quota=18446744073709551615, gse: 0xffff000091258c00, vruntime=127285708384434, deadline=127285714880550, vlag=11721467, weight=338965, my_q=ffff00009125a400, cfs_rq: avg_vruntime=0, zero_vruntime=2029704519792, avg_load=0, nr_running=1 TASK_GROUP: ffff0000d7cc8800 CFS_RQ: ffff0000c8f86800 <test_test329274_1> cfs_bandwidth: period=14000000, quota=14000000, gse: 0xffff0000c8f86400, vruntime=2034894470719, deadline=2034898697770, vlag=0, weight=215291, my_q=ffff0000c8f86800, cfs_rq: avg_vruntime=-422528991, zero_vruntime=8444226681954, avg_load=54, nr_running=19 [110] PID: 330440 TASK: ffff00004cd61540 COMMAND: "stress-ng" [CURRENT] vruntime=8444367524951, deadline=8444932411139, vlag=8444932411139, weight=3072, last_arrival=4002964107010, last_queued=0, exec_start=3872860294100, sum_exec_runtime=22252021900 ... [110] PID: 330291 TASK: ffff0000c02c9540 COMMAND: "stress-ng" vruntime=8444229273009, deadline=8444946073008, vlag=-2701415, weight=3072, last_arrival=4002964076840, last_queued=4002964550990, exec_start=3872859839290, sum_exec_runtime=22310951770 [100] PID: 97 TASK: ffff0000c2432a00 COMMAND: "kworker/0:1H" vruntime=127285720095197, deadline=127285720119423, vlag=48453, weight=90891264, last_arrival=3846600432710, last_queued=3846600721010, exec_start=3743307237970, sum_exec_runtime=413405210 [120] PID: 15 TASK: ffff0000c0368080 COMMAND: "ksoftirqd/0" vruntime=127285722433404, deadline=127285724533404, vlag=0, weight=1048576, last_arrival=3506755665780, last_queued=3506852159390, exec_start=3461615726670, sum_exec_runtime=16341041340 [120] PID: 50173 TASK: ffff0000741d8080 COMMAND: "kworker/0:0" vruntime=127285722960040, deadline=127285725060040, vlag=-414755, weight=1048576, last_arrival=3506828139580, last_queued=3506972354700, exec_start=3461676584440, sum_exec_runtime=84414080 [120] PID: 58662 TASK: ffff000091180080 COMMAND: "kworker/0:2" vruntime=127285723428168, deadline=127285725528168, vlag=3049158, weight=1048576, last_arrival=3505689085070, last_queued=3506848131990, exec_start=3460592328510, sum_exec_runtime=89193000 TASK 1 (systemd) is waiting for cgroup_mutex. TASK 329296 (sh) holds cgroup_mutex and is waiting for cpus_read_lock. TASK 50173 (kworker/0:0) holds the cpus_read_lock, but fail to be scheduled. test_cg and TASK 97 may have suppressed TASK 50173, causing it to not be scheduled for a long time, thus failing to release locks in a timely manner and ultimately causing a hungtask issue. Fix by adding ENQUEUE_REWEIGHT_CURR flag and skipping vlag recalculation in place_entity() when reweighting the current running entity. For non-current entities, the existing logic remains as dequeue/enqueue changes avg_vruntime(). Fixes: 26ffc93c86c6 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> --- kernel/sched/fair.c | 11 ++++++++++- kernel/sched/sched.h | 2 ++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e7bb8a144c43..ccd2c852b2b3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3985,7 +3985,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, enqueue_load_avg(cfs_rq, se); if (se->on_rq) { - place_entity(cfs_rq, se, 0); + place_entity(cfs_rq, se, curr ? ENQUEUE_REWEIGHT_CURR : 0); if (rel_vprot) se->vprot = se->vruntime + vprot; update_load_add(&cfs_rq->load, se->load.weight); @@ -5387,6 +5387,14 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) lag = se->vlag; + /* + * ENQUEUE_REWEIGHT_CURR: + * current running se (cfs_rq->curr) should skip vlag recalculation, + * because avg_vruntime(...) hasn't changed. + */ + if (flags & ENQUEUE_REWEIGHT_CURR) + goto skip_lag_scale; + /* * If we want to place a task and preserve lag, we have to * consider the effect of the new entity on the weighted @@ -5449,6 +5457,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) lag = div_s64(lag, load); } +skip_lag_scale: se->vruntime = vruntime - lag; if (se->rel_deadline) { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 214659d7dc9c..61f2ad5d1cfb 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2431,6 +2431,8 @@ extern const u32 sched_prio_to_wmult[40]; #endif #define ENQUEUE_INITIAL 0x80 +#define ENQUEUE_REWEIGHT_CURR 0x200 + #define RETRY_TASK ((void *)-1UL) struct affinity_context { -- 2.34.1