From: Chengming Zhou zhouchengming@bytedance.com
mainline inclusion from mainline-v6.1-rc1 commit 65176f59a18d888684525658a1d0b8bf749d24f3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Way back when PSI_MEM_FULL was accounted from the timer tick, task switching could simply iterate next and prev to the common ancestor to update TSK_ONCPU and be done.
Then memstall ticks were replaced with checking curr->in_memstall directly in psi_group_change(). That meant that now if the task switch was between a memstall and a !memstall task, we had to iterate through the common ancestors at least ONCE to fix up their state_masks.
We added the identical_state filter to make sure the common ancestor elimination was skipped in that case. It seems that was always a little too eager, because it caused us to walk the common ancestors *twice* instead of the required once: the iteration for next could have stopped at the common ancestor; prev could have updated TSK_ONCPU up to the common ancestor, then finish to the root without changing any flags, just to get the new curr->in_memstall into the state_masks.
This patch recognizes this and makes it so that we walk to the root exactly once if state_mask needs updating, which is simply catching up on a missed optimization that could have been done in commit 7fae6c8171d2 ("psi: Use ONCPU state tracking machinery to detect reclaim") directly.
Apart from this, it's also necessary for the next patch "sched/psi: remove NR_ONCPU task accounting". Suppose we walk the common ancestors twice:
(1) psi_group_change(.clear = 0, .set = TSK_ONCPU) (2) psi_group_change(.clear = TSK_ONCPU, .set = 0)
We previously used tasks[NR_ONCPU] to record TSK_ONCPU, tasks[NR_ONCPU]++ in (1) then tasks[NR_ONCPU]-- in (2), so tasks[NR_ONCPU] still be correct.
The next patch change to use one bit in state mask to record TSK_ONCPU, PSI_ONCPU bit will be set in (1), but then be cleared in (2), which cause the psi_group_cpu has task running on CPU but without PSI_ONCPU bit set!
With this patch, we will never walk the common ancestors twice, so won't have above problem.
Suggested-by: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Chengming Zhou zhouchengming@bytedance.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Johannes Weiner hannes@cmpxchg.org Link: https://lore.kernel.org/r/20220825164111.29534-6-zhouchengming@bytedance.com Conflict: kernel/sched/psi.c Signed-off-by: Lu Jialin lujialin4@huawei.com --- kernel/sched/psi.c | 22 +++++++++------------- 1 file changed, 9 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 0fa8c3fdec33..17898826fe24 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -841,21 +841,15 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, u64 now = cpu_clock(cpu);
if (next->pid) { - bool identical_state; - psi_flags_change(next, 0, TSK_ONCPU); /* - * When switching between tasks that have an identical - * runtime state, the cgroup that contains both tasks - * runtime state, the cgroup that contains both tasks - * we reach the first common ancestor. Iterate @next's - * ancestors only until we encounter @prev's ONCPU. + * Set TSK_ONCPU on @next's cgroups. If @next shares any + * ancestors with @prev, those will already have @prev's + * TSK_ONCPU bit set, and we can stop the iteration there. */ - identical_state = prev->psi_flags == next->psi_flags; iter = NULL; while ((group = iterate_groups(next, &iter))) { - if (identical_state && - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { common = group; break; } @@ -899,10 +893,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, psi_group_change(group, cpu, clear, set, now, wake_clock);
/* - * TSK_ONCPU is handled up to the common ancestor. If we're tasked - * with dequeuing too, finish that for the rest of the hierarchy. + * TSK_ONCPU is handled up to the common ancestor. If there are + * any other differences between the two tasks (e.g. prev goes + * to sleep, or only one task is memstall), finish propagating + * those differences all the way up to the root. */ - if (sleep) { + if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) { clear &= ~TSK_ONCPU; for (; group; group = iterate_groups(prev, &iter)) psi_group_change(group, cpu, clear, set, now, wake_clock);