Chen Wandun (3): mm: disable psi cgroup v1 by default mm: add config isolation for psi under cgroup v1 psi: dont alloc memory for psi by default
Chengming Zhou (3): sched/psi: Fix periodic aggregation shut off sched/psi: Optimize task switch inside shared cgroups again sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
Haifeng Xu (1): sched/psi: Bail out early from irq time accounting
Hao Jia (1): sched/psi: Zero the memory of struct psi_group
Johannes Weiner (1): sched/psi: Remove NR_ONCPU task accounting
Lu Jialin (11): psi: support irq.pressure under cgroup v1 psi: enable CONFIG_PSI_CGROUP_V1 in openeuler_defconfig psi: update psi irqtime when the irq delta is nozero memcg: Modify memcg async reclaim psi: add struct psi_group_ext PSI: Introduce fine grained stall time collect for cgroup reclaim PSI: Introduce avgs and total calculation for cgroup reclaim PSI: Introduce pressure.stat in psi PSI: add more memory fine grained stall tracking in pressure.stat add cpu fine grained stall tracking in pressure.stat PSI: enable CONFIG_PSI_FINE_GRAINED in openeuler_defconfig
Suren Baghdasaryan (1): psi: Fix "defined but not used" warnings when CONFIG_PROC_FS=n
Documentation/admin-guide/cgroup-v2.rst | 6 + arch/arm64/configs/openeuler_defconfig | 2 + arch/x86/configs/openeuler_defconfig | 2 + block/blk-cgroup.c | 2 +- block/blk-core.c | 2 +- include/linux/cgroup-defs.h | 6 +- include/linux/cgroup.h | 2 +- include/linux/psi.h | 9 +- include/linux/psi_types.h | 101 +++- include/linux/sched.h | 4 + init/Kconfig | 20 + kernel/cgroup/cgroup.c | 64 ++- kernel/sched/core.c | 2 + kernel/sched/cpuacct.c | 6 +- kernel/sched/fair.c | 6 - kernel/sched/psi.c | 694 ++++++++++++++++++++---- kernel/sched/stats.h | 9 + mm/compaction.c | 2 +- mm/filemap.c | 4 +- mm/memcontrol.c | 118 ++-- mm/page_alloc.c | 6 + mm/page_io.c | 3 + mm/vmscan.c | 5 +- 23 files changed, 887 insertions(+), 188 deletions(-)
From: Suren Baghdasaryan surenb@google.com
mainline inclusion from mainline-v5.18-rc1 commit 5102bb1c9f82857a3164af9d7ab7ad628cb783ed category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When CONFIG_PROC_FS is disabled psi code generates the following warnings:
kernel/sched/psi.c:1364:30: warning: 'psi_cpu_proc_ops' defined but not used [-Wunused-const-variable=] 1364 | static const struct proc_ops psi_cpu_proc_ops = { | ^~~~~~~~~~~~~~~~ kernel/sched/psi.c:1355:30: warning: 'psi_memory_proc_ops' defined but not used [-Wunused-const-variable=] 1355 | static const struct proc_ops psi_memory_proc_ops = { | ^~~~~~~~~~~~~~~~~~~ kernel/sched/psi.c:1346:30: warning: 'psi_io_proc_ops' defined but not used [-Wunused-const-variable=] 1346 | static const struct proc_ops psi_io_proc_ops = { | ^~~~~~~~~~~~~~~
Make definitions of these structures and related functions conditional on CONFIG_PROC_FS config.
Fixes: 0e94682b73bf ("psi: introduce psi monitor") Reported-by: kernel test robot lkp@intel.com Signed-off-by: Suren Baghdasaryan surenb@google.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20220119223940.787748-3-surenb@google.com Conflict: kernel/sched/psi.c Signed-off-by: Lu Jialin lujialin4@huawei.com --- kernel/sched/psi.c | 63 ++++++++++++++++++++++++---------------------- 1 file changed, 33 insertions(+), 30 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 11a43dccb7fc..fd4c9847219c 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1106,36 +1106,6 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) return 0; }
-static int psi_io_show(struct seq_file *m, void *v) -{ - return psi_show(m, &psi_system, PSI_IO); -} - -static int psi_memory_show(struct seq_file *m, void *v) -{ - return psi_show(m, &psi_system, PSI_MEM); -} - -static int psi_cpu_show(struct seq_file *m, void *v) -{ - return psi_show(m, &psi_system, PSI_CPU); -} - -static int psi_io_open(struct inode *inode, struct file *file) -{ - return single_open(file, psi_io_show, NULL); -} - -static int psi_memory_open(struct inode *inode, struct file *file) -{ - return single_open(file, psi_memory_show, NULL); -} - -static int psi_cpu_open(struct inode *inode, struct file *file) -{ - return single_open(file, psi_cpu_show, NULL); -} - struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf, size_t nbytes, enum psi_res res, struct kernfs_open_file *of) @@ -1304,6 +1274,37 @@ __poll_t psi_trigger_poll(void **trigger_ptr, return ret; }
+#ifdef CONFIG_PROC_FS +static int psi_io_show(struct seq_file *m, void *v) +{ + return psi_show(m, &psi_system, PSI_IO); +} + +static int psi_memory_show(struct seq_file *m, void *v) +{ + return psi_show(m, &psi_system, PSI_MEM); +} + +static int psi_cpu_show(struct seq_file *m, void *v) +{ + return psi_show(m, &psi_system, PSI_CPU); +} + +static int psi_io_open(struct inode *inode, struct file *file) +{ + return single_open(file, psi_io_show, NULL); +} + +static int psi_memory_open(struct inode *inode, struct file *file) +{ + return single_open(file, psi_memory_show, NULL); +} + +static int psi_cpu_open(struct inode *inode, struct file *file) +{ + return single_open(file, psi_cpu_show, NULL); +} + static ssize_t psi_write(struct file *file, const char __user *user_buf, size_t nbytes, enum psi_res res) { @@ -1418,3 +1419,5 @@ static int __init psi_proc_init(void) return 0; } module_init(psi_proc_init); + +#endif /* CONFIG_PROC_FS */
From: Chengming Zhou zhouchengming@bytedance.com
mainline inclusion from mainline-v6.1-rc1 commit c530a3c716b963625e43aa915e0de6b4d1ce8ad9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We don't want to wake periodic aggregation work back up if the task change is the aggregation worker itself going to sleep, or we'll ping-pong forever.
Previously, we would use psi_task_change() in psi_dequeue() when task going to sleep, so this check was put in psi_task_change().
But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") defer task sleep handling to psi_task_switch(), won't go through psi_task_change() anymore.
So this patch move this check to psi_task_switch().
Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") Signed-off-by: Chengming Zhou zhouchengming@bytedance.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Johannes Weiner hannes@cmpxchg.org Link: https://lore.kernel.org/r/20220825164111.29534-2-zhouchengming@bytedance.com Signed-off-by: Lu Jialin lujialin4@huawei.com --- kernel/sched/psi.c | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index fd4c9847219c..0fa8c3fdec33 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -818,7 +818,6 @@ void psi_task_change(struct task_struct *task, int clear, int set) { int cpu = task_cpu(task); struct psi_group *group; - bool wake_clock = true; void *iter = NULL; u64 now;
@@ -828,19 +827,9 @@ void psi_task_change(struct task_struct *task, int clear, int set) psi_flags_change(task, clear, set);
now = cpu_clock(cpu); - /* - * Periodic aggregation shuts off if there is a period of no - * task changes, so we wake it back up if necessary. However, - * don't do this if the task change is the aggregation worker - * itself going to sleep, or we'll ping-pong forever. - */ - if (unlikely((clear & TSK_RUNNING) && - (task->flags & PF_WQ_WORKER) && - wq_worker_last_func(task) == psi_avgs_work)) - wake_clock = false;
while ((group = iterate_groups(task, &iter))) - psi_group_change(group, cpu, clear, set, now, wake_clock); + psi_group_change(group, cpu, clear, set, now, true); }
void psi_task_switch(struct task_struct *prev, struct task_struct *next, @@ -877,6 +866,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
if (prev->pid) { int clear = TSK_ONCPU, set = 0; + bool wake_clock = true;
/* * When we're going to sleep, psi_dequeue() lets us @@ -890,13 +880,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, clear |= TSK_MEMSTALL_RUNNING; if (prev->in_iowait) set |= TSK_IOWAIT; + + /* + * Periodic aggregation shuts off if there is a period of no + * task changes, so we wake it back up if necessary. However, + * don't do this if the task change is the aggregation worker + * itself going to sleep, or we'll ping-pong forever. + */ + if (unlikely((prev->flags & PF_WQ_WORKER) && + wq_worker_last_func(prev) == psi_avgs_work)) + wake_clock = false; }
psi_flags_change(prev, clear, set);
iter = NULL; while ((group = iterate_groups(prev, &iter)) && group != common) - psi_group_change(group, cpu, clear, set, now, true); + psi_group_change(group, cpu, clear, set, now, wake_clock);
/* * TSK_ONCPU is handled up to the common ancestor. If we're tasked @@ -905,7 +905,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, if (sleep) { clear &= ~TSK_ONCPU; for (; group; group = iterate_groups(prev, &iter)) - psi_group_change(group, cpu, clear, set, now, true); + psi_group_change(group, cpu, clear, set, now, wake_clock); } } }
From: Chengming Zhou zhouchengming@bytedance.com
mainline inclusion from mainline-v6.1-rc1 commit 65176f59a18d888684525658a1d0b8bf749d24f3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Way back when PSI_MEM_FULL was accounted from the timer tick, task switching could simply iterate next and prev to the common ancestor to update TSK_ONCPU and be done.
Then memstall ticks were replaced with checking curr->in_memstall directly in psi_group_change(). That meant that now if the task switch was between a memstall and a !memstall task, we had to iterate through the common ancestors at least ONCE to fix up their state_masks.
We added the identical_state filter to make sure the common ancestor elimination was skipped in that case. It seems that was always a little too eager, because it caused us to walk the common ancestors *twice* instead of the required once: the iteration for next could have stopped at the common ancestor; prev could have updated TSK_ONCPU up to the common ancestor, then finish to the root without changing any flags, just to get the new curr->in_memstall into the state_masks.
This patch recognizes this and makes it so that we walk to the root exactly once if state_mask needs updating, which is simply catching up on a missed optimization that could have been done in commit 7fae6c8171d2 ("psi: Use ONCPU state tracking machinery to detect reclaim") directly.
Apart from this, it's also necessary for the next patch "sched/psi: remove NR_ONCPU task accounting". Suppose we walk the common ancestors twice:
(1) psi_group_change(.clear = 0, .set = TSK_ONCPU) (2) psi_group_change(.clear = TSK_ONCPU, .set = 0)
We previously used tasks[NR_ONCPU] to record TSK_ONCPU, tasks[NR_ONCPU]++ in (1) then tasks[NR_ONCPU]-- in (2), so tasks[NR_ONCPU] still be correct.
The next patch change to use one bit in state mask to record TSK_ONCPU, PSI_ONCPU bit will be set in (1), but then be cleared in (2), which cause the psi_group_cpu has task running on CPU but without PSI_ONCPU bit set!
With this patch, we will never walk the common ancestors twice, so won't have above problem.
Suggested-by: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Chengming Zhou zhouchengming@bytedance.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Johannes Weiner hannes@cmpxchg.org Link: https://lore.kernel.org/r/20220825164111.29534-6-zhouchengming@bytedance.com Conflict: kernel/sched/psi.c Signed-off-by: Lu Jialin lujialin4@huawei.com --- kernel/sched/psi.c | 22 +++++++++------------- 1 file changed, 9 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 0fa8c3fdec33..17898826fe24 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -841,21 +841,15 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, u64 now = cpu_clock(cpu);
if (next->pid) { - bool identical_state; - psi_flags_change(next, 0, TSK_ONCPU); /* - * When switching between tasks that have an identical - * runtime state, the cgroup that contains both tasks - * runtime state, the cgroup that contains both tasks - * we reach the first common ancestor. Iterate @next's - * ancestors only until we encounter @prev's ONCPU. + * Set TSK_ONCPU on @next's cgroups. If @next shares any + * ancestors with @prev, those will already have @prev's + * TSK_ONCPU bit set, and we can stop the iteration there. */ - identical_state = prev->psi_flags == next->psi_flags; iter = NULL; while ((group = iterate_groups(next, &iter))) { - if (identical_state && - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { common = group; break; } @@ -899,10 +893,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, psi_group_change(group, cpu, clear, set, now, wake_clock);
/* - * TSK_ONCPU is handled up to the common ancestor. If we're tasked - * with dequeuing too, finish that for the rest of the hierarchy. + * TSK_ONCPU is handled up to the common ancestor. If there are + * any other differences between the two tasks (e.g. prev goes + * to sleep, or only one task is memstall), finish propagating + * those differences all the way up to the root. */ - if (sleep) { + if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) { clear &= ~TSK_ONCPU; for (; group; group = iterate_groups(prev, &iter)) psi_group_change(group, cpu, clear, set, now, wake_clock);
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v6.1-rc1 commit 71dbdde7914d32e86f01ac1f6e54e964c9dfdbd9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We put all fields updated by the scheduler in the first cacheline of struct psi_group_cpu for performance.
Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure, we need to reclaim space first. This patch remove NR_ONCPU task accounting in struct psi_group_cpu, use one bit in state_mask to track instead.
Signed-off-by: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Chengming Zhou zhouchengming@bytedance.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Chengming Zhou zhouchengming@bytedance.com Tested-by: Chengming Zhou zhouchengming@bytedance.com Link: https://lore.kernel.org/r/20220825164111.29534-7-zhouchengming@bytedance.com Conflict: include/linux/psi_types.h Signed-off-by: Lu Jialin lujialin4@huawei.com --- include/linux/psi_types.h | 14 +++++-------- kernel/sched/psi.c | 41 ++++++++++++++++++++++++++++----------- 2 files changed, 35 insertions(+), 20 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 0b6e17e7f84f..9b3988ee2ce8 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -36,13 +36,6 @@ enum psi_task_count { NR_IOWAIT, NR_MEMSTALL, NR_RUNNING, - /* - * This can't have values other than 0 or 1 and could be - * implemented as a bit flag. But for now we still have room - * in the first cacheline of psi_group_cpu, and this way we - * don't have to special case any state tracking for it. - */ - NR_ONCPU, /* * For IO and CPU stalls the presence of running/oncpu tasks * in the domain means a partial rather than a full stall. @@ -53,7 +46,7 @@ enum psi_task_count { * threads and memstall ones. */ NR_MEMSTALL_RUNNING, - NR_PSI_TASK_COUNTS = 5, + NR_PSI_TASK_COUNTS = 4, }; #endif
@@ -61,8 +54,9 @@ enum psi_task_count { #define TSK_IOWAIT (1 << NR_IOWAIT) #define TSK_MEMSTALL (1 << NR_MEMSTALL) #define TSK_RUNNING (1 << NR_RUNNING) -#define TSK_ONCPU (1 << NR_ONCPU) #define TSK_MEMSTALL_RUNNING (1 << NR_MEMSTALL_RUNNING) +/* Only one task can be scheduled, no corresponding task count */ +#define TSK_ONCPU (1 << NR_PSI_TASK_COUNTS)
/* Resources that workloads could be stalled on */ enum psi_res { @@ -110,6 +104,8 @@ enum psi_states { }; #endif
+/* Use one bit in the state mask to track TSK_ONCPU */ +#define PSI_ONCPU (1 << NR_PSI_STATES)
enum psi_aggregators { PSI_AVGS = 0, diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 17898826fe24..800d11f9898c 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -228,7 +228,7 @@ void __init psi_init(void) group_init(&psi_system); }
-static bool test_state(unsigned int *tasks, enum psi_states state) +static bool test_state(unsigned int *tasks, enum psi_states state, bool oncpu) { switch (state) { case PSI_IO_SOME: @@ -241,9 +241,9 @@ static bool test_state(unsigned int *tasks, enum psi_states state) return unlikely(tasks[NR_MEMSTALL] && tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]); case PSI_CPU_SOME: - return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]); + return unlikely(tasks[NR_RUNNING] > oncpu); case PSI_CPU_FULL: - return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]); + return unlikely(tasks[NR_RUNNING] && !oncpu); case PSI_NONIDLE: return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] || tasks[NR_RUNNING]; @@ -696,9 +696,9 @@ static void psi_group_change(struct psi_group *group, int cpu, bool wake_clock) { struct psi_group_cpu *groupc; - u32 state_mask = 0; unsigned int t, m; enum psi_states s; + u32 state_mask;
groupc = per_cpu_ptr(group->pcpu, cpu);
@@ -714,17 +714,36 @@ static void psi_group_change(struct psi_group *group, int cpu,
record_times(groupc, now);
+ /* + * Start with TSK_ONCPU, which doesn't have a corresponding + * task count - it's just a boolean flag directly encoded in + * the state mask. Clear, set, or carry the current state if + * no changes are requested. + */ + if (unlikely(clear & TSK_ONCPU)) { + state_mask = 0; + clear &= ~TSK_ONCPU; + } else if (unlikely(set & TSK_ONCPU)) { + state_mask = PSI_ONCPU; + set &= ~TSK_ONCPU; + } else { + state_mask = groupc->state_mask & PSI_ONCPU; + } + + /* + * The rest of the state mask is calculated based on the task + * counts. Update those first, then construct the mask. + */ for (t = 0, m = clear; m; m &= ~(1 << t), t++) { if (!(m & (1 << t))) continue; if (groupc->tasks[t]) { groupc->tasks[t]--; } else if (!psi_bug) { - printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u %u %u] clear=%x set=%x\n", + printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u %u] clear=%x set=%x\n", cpu, t, groupc->tasks[0], groupc->tasks[1], groupc->tasks[2], - groupc->tasks[3], groupc->tasks[4], - clear, set); + groupc->tasks[3], clear, set); psi_bug = 1; } } @@ -733,9 +752,8 @@ static void psi_group_change(struct psi_group *group, int cpu, if (set & (1 << t)) groupc->tasks[t]++;
- /* Calculate state mask representing active states */ for (s = 0; s < NR_PSI_STATES; s++) { - if (test_state(groupc->tasks, s)) + if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU)) state_mask |= (1 << s); }
@@ -747,7 +765,7 @@ static void psi_group_change(struct psi_group *group, int cpu, * task in a cgroup is in_memstall, the corresponding groupc * on that cpu is in PSI_MEM_FULL state. */ - if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)) + if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall)) state_mask |= (1 << PSI_MEM_FULL);
groupc->state_mask = state_mask; @@ -849,7 +867,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, */ iter = NULL; while ((group = iterate_groups(next, &iter))) { - if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + if (per_cpu_ptr(group->pcpu, cpu)->state_mask & + PSI_ONCPU) { common = group; break; }
From: Chengming Zhou zhouchengming@bytedance.com
mainline inclusion from mainline-v6.1-rc1 commit 52b1364ba0b105122d6de0e719b36db705011ac1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload.
When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status.
Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL.
Signed-off-by: Chengming Zhou zhouchengming@bytedance.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Johannes Weiner hannes@cmpxchg.org Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com Conflict: Documentation/admin-guide/cgroup-v2.rst include/linux/psi_types.h kernel/sched/psi.c kernel/sched/stats.h Signed-off-by: Lu Jialin lujialin4@huawei.com --- Documentation/admin-guide/cgroup-v2.rst | 6 ++ include/linux/psi.h | 1 + include/linux/psi_types.h | 10 +++- kernel/cgroup/cgroup.c | 26 +++++++++ kernel/sched/core.c | 1 + kernel/sched/psi.c | 74 ++++++++++++++++++++++++- kernel/sched/stats.h | 1 + 7 files changed, 115 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 394fb18c9e3d..20c35c289253 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -951,6 +951,12 @@ All cgroup core files are prefixed with "cgroup." it's possible to delete a frozen (and empty) cgroup, as well as create new sub-cgroups.
+irq.pressure + A read-write nested-keyed file. + + Shows pressure stall information for IRQ/SOFTIRQ. See + :ref:`Documentation/accounting/psi.rst <psi>` for details. + Controllers ===========
diff --git a/include/linux/psi.h b/include/linux/psi.h index 86635a5630ba..9e15596968d0 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -21,6 +21,7 @@ void psi_init(void); void psi_task_change(struct task_struct *task, int clear, int set); void psi_task_switch(struct task_struct *prev, struct task_struct *next, bool sleep); +void psi_account_irqtime(struct task_struct *task, u32 delta);
void psi_memstall_enter(unsigned long *flags); void psi_memstall_leave(unsigned long *flags); diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 9b3988ee2ce8..668f67b56464 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -63,7 +63,10 @@ enum psi_res { PSI_IO, PSI_MEM, PSI_CPU, - NR_PSI_RESOURCES = 3, +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + PSI_IRQ, +#endif + NR_PSI_RESOURCES, };
/* @@ -98,9 +101,12 @@ enum psi_states { PSI_MEM_FULL, PSI_CPU_SOME, PSI_CPU_FULL, +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + PSI_IRQ_FULL, +#endif /* Only per-CPU, to weigh the CPU in the global average: */ PSI_NONIDLE, - NR_PSI_STATES = 7, + NR_PSI_STATES, }; #endif
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 3d778636f2e8..45c422808340 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3751,6 +3751,23 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of, return cgroup_pressure_write(of, buf, nbytes, PSI_CPU); }
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING +static int cgroup_irq_pressure_show(struct seq_file *seq, void *v) +{ + struct cgroup *cgrp = seq_css(seq)->cgroup; + struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi; + + return psi_show(seq, psi, PSI_IRQ); +} + +static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ); +} +#endif + static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, poll_table *pt) { @@ -5155,6 +5172,15 @@ static struct cftype cgroup_base_files[] = { .poll = cgroup_pressure_poll, .release = cgroup_pressure_release, }, +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + { + .name = "irq.pressure", + .seq_show = cgroup_irq_pressure_show, + .write = cgroup_irq_pressure_write, + .poll = cgroup_pressure_poll, + .release = cgroup_pressure_release, + }, +#endif #endif /* CONFIG_PSI */ { } /* terminate */ }; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 201d75a786e0..eb3353b90b8d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -629,6 +629,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
rq->prev_irq_time += irq_delta; delta -= irq_delta; + psi_account_irqtime(rq->curr, irq_delta); #endif #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((¶virt_steal_rq_enabled))) { diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 800d11f9898c..7c255aabc8ac 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -925,6 +925,36 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, } }
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING +void psi_account_irqtime(struct task_struct *task, u32 delta) +{ + int cpu = task_cpu(task); + void *iter = NULL; + struct psi_group *group; + struct psi_group_cpu *groupc; + u64 now; + + if (!task->pid) + return; + + now = cpu_clock(cpu); + + while ((group = iterate_groups(task, &iter))) { + groupc = per_cpu_ptr(group->pcpu, cpu); + + write_seqcount_begin(&groupc->seq); + + record_times(groupc, now); + groupc->times[PSI_IRQ_FULL] += delta; + + write_seqcount_end(&groupc->seq); + + if (group->poll_states & (1 << PSI_IRQ_FULL)) + psi_schedule_poll_work(group, 1); + } +} +#endif + /** * psi_memstall_enter - mark the beginning of a memory stall section * @flags: flags to handle nested sections @@ -1083,6 +1113,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) { + bool only_full = false; int full; u64 now;
@@ -1097,7 +1128,11 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) group->avg_next_update = update_averages(group, now); mutex_unlock(&group->avgs_lock);
- for (full = 0; full < 2; full++) { +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + only_full = res == PSI_IRQ; +#endif + + for (full = 0; full < 2 - only_full; full++) { unsigned long avg[3] = { 0, }; u64 total = 0; int w; @@ -1111,7 +1146,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) }
seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n", - full ? "full" : "some", + full || only_full ? "full" : "some", LOAD_INT(avg[0]), LOAD_FRAC(avg[0]), LOAD_INT(avg[1]), LOAD_FRAC(avg[1]), LOAD_INT(avg[2]), LOAD_FRAC(avg[2]), @@ -1140,6 +1175,11 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf, else return ERR_PTR(-EINVAL);
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING + if (res == PSI_IRQ && --state != PSI_IRQ_FULL) + return ERR_PTR(-EINVAL); +#endif + if (state >= PSI_NONIDLE) return ERR_PTR(-EINVAL);
@@ -1423,6 +1463,33 @@ static const struct proc_ops psi_cpu_proc_ops = { .proc_release = psi_fop_release, };
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING +static int psi_irq_show(struct seq_file *m, void *v) +{ + return psi_show(m, &psi_system, PSI_IRQ); +} + +static int psi_irq_open(struct inode *inode, struct file *file) +{ + return single_open(file, psi_irq_show, NULL); +} + +static ssize_t psi_irq_write(struct file *file, const char __user *user_buf, + size_t nbytes, loff_t *ppos) +{ + return psi_write(file, user_buf, nbytes, PSI_IRQ); +} + +static const struct proc_ops psi_irq_proc_ops = { + .proc_open = psi_irq_open, + .proc_read = seq_read, + .proc_lseek = seq_lseek, + .proc_write = psi_irq_write, + .proc_poll = psi_fop_poll, + .proc_release = psi_fop_release, +}; +#endif + static int __init psi_proc_init(void) { if (psi_enable) { @@ -1430,7 +1497,10 @@ static int __init psi_proc_init(void) proc_create("pressure/io", 0, NULL, &psi_io_proc_ops); proc_create("pressure/memory", 0, NULL, &psi_memory_proc_ops); proc_create("pressure/cpu", 0, NULL, &psi_cpu_proc_ops); +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + proc_create("pressure/irq", 0, NULL, &psi_irq_proc_ops); } +#endif return 0; } module_init(psi_proc_init); diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index b8b4e5b2694e..874d8c6e6750 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -170,6 +170,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {} static inline void psi_sched_switch(struct task_struct *prev, struct task_struct *next, bool sleep) {} +static inline void psi_account_irqtime(struct task_struct *task, u32 delta) {} #endif /* CONFIG_PSI */
#ifdef CONFIG_SCHED_INFO
From: Haifeng Xu haifeng.xu@shopee.com
mainline inclusion from mainline-v6.7-rc1 commit 0c2924079f5a83ed715630680e338b3685a0bf7d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We could bail out early when psi was disabled.
Signed-off-by: Haifeng Xu haifeng.xu@shopee.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Chengming Zhou zhouchengming@bytedance.com Link: https://lore.kernel.org/r/20230926115722.467833-1-haifeng.xu@shopee.com Signed-off-by: Lu Jialin lujialin4@huawei.com --- kernel/sched/psi.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 7c255aabc8ac..99d5051c48ae 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -934,6 +934,9 @@ void psi_account_irqtime(struct task_struct *task, u32 delta) struct psi_group_cpu *groupc; u64 now;
+ if (static_branch_likely(&psi_disabled)) + return; + if (!task->pid) return;
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
-------------------------------
Export "irq.pressure" to cgroup v1 "cpuacct" subsystem.
Signed-off-by: Lu Jialin lujialin4@huawei.com --- kernel/cgroup/cgroup.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 45c422808340..55b336656fc9 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3808,6 +3808,16 @@ struct cftype cgroup_v1_psi_files[] = { .poll = cgroup_pressure_poll, .release = cgroup_pressure_release, }, +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + { + .name = "irq.pressure", + .flags = CFTYPE_NO_PREFIX, + .seq_show = cgroup_irq_pressure_show, + .write = cgroup_irq_pressure_write, + .poll = cgroup_pressure_poll, + .release = cgroup_pressure_release, + }, +#endif { } /* terminate */ }; #endif /* CONFIG_PSI */
From: Chen Wandun chenwandun@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
-------------------------------
Feature of psi cgroup v1 should only enable when open CONFIG_PSI_CGROUP_V1.
Signed-off-by: Chen Wandun chenwandun@huawei.com Signed-off-by: Lu Jialin lujialin4@huawei.com --- include/linux/psi.h | 2 +- kernel/sched/cpuacct.c | 4 ++-- kernel/sched/psi.c | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/psi.h b/include/linux/psi.h index 9e15596968d0..c4bb670225ee 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -14,7 +14,7 @@ struct css_set; extern struct static_key_false psi_disabled; extern struct psi_group psi_system;
-extern struct static_key_false psi_v1_disabled; +extern struct static_key_true psi_v1_disabled;
void psi_init(void);
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 28ed182b6801..1fa89f338e51 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -383,8 +383,8 @@ static int __init setup_psi_v1(char *str) int ret;
ret = kstrtobool(str, &psi_v1_enable); - if (!psi_v1_enable) - static_branch_enable(&psi_v1_disabled); + if (psi_v1_enable) + static_branch_disable(&psi_v1_disabled);
return ret == 0; } diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 99d5051c48ae..3d600aff3d5b 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -156,7 +156,7 @@ static int psi_bug __read_mostly;
DEFINE_STATIC_KEY_FALSE(psi_disabled); -DEFINE_STATIC_KEY_FALSE(psi_v1_disabled); +DEFINE_STATIC_KEY_TRUE(psi_v1_disabled);
#ifdef CONFIG_PSI_DEFAULT_DISABLED static bool psi_enable;
From: Chen Wandun chenwandun@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
-------------------------------
Add CONFIG_PSI_CGROUP_V1 to separate feature of psi under cgroup v1 from baseline.
Signed-off-by: Chen Wandun chenwandun@huawei.com Signed-off-by: Lu Jialin lujialin4@huawei.com --- include/linux/psi.h | 2 ++ init/Kconfig | 10 ++++++++++ kernel/cgroup/cgroup.c | 3 +++ kernel/sched/cpuacct.c | 2 +- kernel/sched/psi.c | 22 ++++++++++++++-------- 5 files changed, 30 insertions(+), 9 deletions(-)
diff --git a/include/linux/psi.h b/include/linux/psi.h index c4bb670225ee..40cfbf0bf831 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -14,7 +14,9 @@ struct css_set; extern struct static_key_false psi_disabled; extern struct psi_group psi_system;
+#ifdef CONFIG_PSI_CGROUP_V1 extern struct static_key_true psi_v1_disabled; +#endif
void psi_init(void);
diff --git a/init/Kconfig b/init/Kconfig index bb9063807556..c668fb4933f6 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -653,6 +653,16 @@ config PSI_DEFAULT_DISABLED
Say N if unsure.
+config PSI_CGROUP_V1 + bool "Support PSI under cgroup v1" + default n + depends on PSI + help + If set, pressure stall information tracking will be used + for cgroup v1 other than v2. + + Say N if unsure. + endmenu # "CPU/Task time and stats accounting"
config CPU_ISOLATION diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 55b336656fc9..5588c5a3804d 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3783,6 +3783,7 @@ static void cgroup_pressure_release(struct kernfs_open_file *of) psi_trigger_destroy(ctx->psi.trigger); }
+#ifdef CONFIG_PSI_CGROUP_V1 struct cftype cgroup_v1_psi_files[] = { { .name = "io.pressure", @@ -3820,6 +3821,8 @@ struct cftype cgroup_v1_psi_files[] = { #endif { } /* terminate */ }; +#endif + #endif /* CONFIG_PSI */
static int cgroup_freeze_show(struct seq_file *seq, void *v) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 1fa89f338e51..7a7a0dec8c4e 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -375,7 +375,7 @@ struct cgroup_subsys cpuacct_cgrp_subsys = { .early_init = true, };
-#ifdef CONFIG_PSI +#ifdef CONFIG_PSI_CGROUP_V1
static bool psi_v1_enable; static int __init setup_psi_v1(char *str) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 3d600aff3d5b..0ea2745c1800 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -156,7 +156,10 @@ static int psi_bug __read_mostly;
DEFINE_STATIC_KEY_FALSE(psi_disabled); + +#ifdef CONFIG_PSI_CGROUP_V1 DEFINE_STATIC_KEY_TRUE(psi_v1_disabled); +#endif
#ifdef CONFIG_PSI_DEFAULT_DISABLED static bool psi_enable; @@ -785,21 +788,23 @@ static struct psi_group *iterate_groups(struct task_struct *task, void **iter) struct cgroup *cgroup = NULL;
if (!*iter) { - if (static_branch_likely(&psi_v1_disabled)) - cgroup = task->cgroups->dfl_cgrp; - else { +#ifndef CONFIG_PSI_CGROUP_V1 + cgroup = task->cgroups->dfl_cgrp; +#else #ifdef CONFIG_CGROUP_CPUACCT - if (!cgroup_subsys_on_dfl(cpuacct_cgrp_subsys)) { + if (!cgroup_subsys_on_dfl(cpuacct_cgrp_subsys)) { + if (!static_branch_likely(&psi_v1_disabled)) { rcu_read_lock(); cgroup = task_cgroup(task, cpuacct_cgrp_id); rcu_read_unlock(); - } else { - cgroup = task->cgroups->dfl_cgrp; } + } else { + cgroup = task->cgroups->dfl_cgrp; + } #else - cgroup = NULL; + cgroup = NULL; +#endif #endif - } } else if (*iter == &psi_system) return NULL; else @@ -1009,6 +1014,7 @@ void psi_memstall_leave(unsigned long *flags) return;
trace_psi_memstall_leave(_RET_IP_); + /* * in_memstall clearing & accounting needs to be atomic wrt * changes to the task's scheduling state, otherwise we could
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
-------------------------------
enable CONFIG_PSI_CGROUP_V1 in openeuler_defconfig
Signed-off-by: Lu Jialin lujialin4@huawei.com --- arch/arm64/configs/openeuler_defconfig | 2 ++ arch/x86/configs/openeuler_defconfig | 1 + 2 files changed, 3 insertions(+)
diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 2c6aa6404190..68b1e3f3330d 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -104,6 +104,8 @@ CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y CONFIG_PSI=y CONFIG_PSI_DEFAULT_DISABLED=y +CONFIG_PSI_CGROUP_V1=y + # end of CPU/Task time and stats accounting
CONFIG_CPU_ISOLATION=y diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index 9ca52bbee1f1..e75127469a93 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -108,6 +108,7 @@ CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y CONFIG_PSI=y CONFIG_PSI_DEFAULT_DISABLED=y +CONFIG_PSI_CGROUP_V1=y # end of CPU/Task time and stats accounting
CONFIG_CPU_ISOLATION=y
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
-------------------------------
If update psi irqtime whether the irq delta is zero or not, the performance will be degradation when update_rq_clock_task works frequently. Therefore, just update psi irqtime whether the irq delta is nozero. performace test of times: 1) without psi_account_irqtime in update_rq_clock_task [root@arm64_perf bin]# ./times -E -C 200 -L -S -W -N "times" -I 200 Running: times# ./../bin-arm64/times -E -C 200 -L -S -W -N times -I 200 prc thr usecs/call samples errors cnt/samp times 1 1 0.45210 188 0 500
2) psi_account_irqtime in update_rq_clock_task [root@arm64_perf bin]# ./times -E -C 200 -L -S -W -N "times" -I 200 Running: times# ./../bin-arm64/times -E -C 200 -L -S -W -N times -I 200 prc thr usecs/call samples errors cnt/samp times 1 1 0.49408 196 0 500
3) psi_account_irqtime in update_rq_clock_task when irq delta is nozero [root@arm64_perf bin]# ./times -E -C 200 -L -S -W -N "times" -I 200 Running: times# ./../bin-arm64/times -E -C 200 -L -S -W -N times -I 200 prc thr usecs/call samples errors cnt/samp times 1 1 0.45158 195 0 500
Signed-off-by: Lu Jialin lujialin4@huawei.com --- kernel/sched/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index eb3353b90b8d..15c24ae17766 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -629,7 +629,8 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
rq->prev_irq_time += irq_delta; delta -= irq_delta; - psi_account_irqtime(rq->curr, irq_delta); + if (irq_delta) + psi_account_irqtime(rq->curr, irq_delta); #endif #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((¶virt_steal_rq_enabled))) {
From: Chen Wandun chenwandun@huawei.com
mainline inclusion from mainline-v6.0-rc1 commit 5f69a6577bc33d8f6d6bbe02bccdeb357b287f56 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Memory about struct psi_group is allocated by default for each cgroup even if psi_disabled is true, in this case, these allocated memory is waste, so alloc memory for struct psi_group only when psi_disabled is false.
Signed-off-by: Chen Wandun chenwandun@huawei.com Acked-by: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Tejun Heo tj@kernel.org Conflict: include/linux/cgroup-defs.h include/linux/cgroup.h kernel/cgroup/cgroup.c Signed-off-by: Lu Jialin lujialin4@huawei.com --- include/linux/cgroup-defs.h | 6 +++--- include/linux/cgroup.h | 2 +- kernel/cgroup/cgroup.c | 10 +++++----- kernel/sched/psi.c | 19 +++++++++++++------ 4 files changed, 22 insertions(+), 15 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 47263cecb12f..09f2d58d119b 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -492,9 +492,9 @@ struct cgroup { /* * It is accessed only the cgroup core code and so changes made to * the cgroup structure should not affect third-party kernel modules. + * The member is unused now. */ - struct psi_group psi; - + KABI_DEPRECATE(struct psi_group, psi) /* used to store eBPF programs */ struct cgroup_bpf bpf;
@@ -504,7 +504,7 @@ struct cgroup { /* Used to store internal freezer state */ struct cgroup_freezer_state freezer;
- KABI_RESERVE(1) + KABI_USE(1, struct psi_group *psi) KABI_RESERVE(2) KABI_RESERVE(3)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index e706ff15ec88..5b8089c6b320 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -675,7 +675,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
static inline struct psi_group *cgroup_psi(struct cgroup *cgrp) { - return &cgrp->psi; + return cgrp->psi; }
static inline void cgroup_init_kthreadd(void) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 5588c5a3804d..3f340fc30abc 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3677,21 +3677,21 @@ static int cpu_stat_show(struct seq_file *seq, void *v) static int cgroup_io_pressure_show(struct seq_file *seq, void *v) { struct cgroup *cgrp = seq_css(seq)->cgroup; - struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi; + struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
return psi_show(seq, psi, PSI_IO); } static int cgroup_memory_pressure_show(struct seq_file *seq, void *v) { struct cgroup *cgrp = seq_css(seq)->cgroup; - struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi; + struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
return psi_show(seq, psi, PSI_MEM); } static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v) { struct cgroup *cgrp = seq_css(seq)->cgroup; - struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi; + struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
return psi_show(seq, psi, PSI_CPU); } @@ -3717,7 +3717,7 @@ static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf, return -EBUSY; }
- psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi; + psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi; new = psi_trigger_create(psi, buf, nbytes, res, of); if (IS_ERR(new)) { cgroup_put(cgrp); @@ -3755,7 +3755,7 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of, static int cgroup_irq_pressure_show(struct seq_file *seq, void *v) { struct cgroup *cgrp = seq_css(seq)->cgroup; - struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi; + struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
return psi_show(seq, psi, PSI_IRQ); } diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 0ea2745c1800..d41c6b4ab324 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1034,10 +1034,16 @@ int psi_cgroup_alloc(struct cgroup *cgroup) if (static_branch_likely(&psi_disabled)) return 0;
- cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu); - if (!cgroup->psi.pcpu) + cgroup->psi = kmalloc(sizeof(struct psi_group), GFP_KERNEL); + if (!cgroup->psi) return -ENOMEM; - group_init(&cgroup->psi); + + cgroup->psi->pcpu = alloc_percpu(struct psi_group_cpu); + if (!cgroup->psi->pcpu) { + kfree(cgroup->psi); + return -ENOMEM; + } + group_init(cgroup->psi); return 0; }
@@ -1046,10 +1052,11 @@ void psi_cgroup_free(struct cgroup *cgroup) if (static_branch_likely(&psi_disabled)) return;
- cancel_delayed_work_sync(&cgroup->psi.avgs_work); - free_percpu(cgroup->psi.pcpu); + cancel_delayed_work_sync(&cgroup->psi->avgs_work); + free_percpu(cgroup->psi->pcpu); /* All triggers must be removed by now */ - WARN_ONCE(cgroup->psi.poll_states, "psi: trigger leak\n"); + WARN_ONCE(cgroup->psi->poll_states, "psi: trigger leak\n"); + kfree(cgroup->psi); }
/**
From: Hao Jia jiahao.os@bytedance.com
mainline inclusion from mainline-v6.0-rc3 commit 2b97cf76289a4fcae66d7959b0d74a87207d7068 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
After commit 5f69a6577bc3 ("psi: dont alloc memory for psi by default"), the memory used by struct psi_group is no longer allocated and zeroed in cgroup_create().
Since the memory of struct psi_group is not zeroed, the data in this memory is random, which will lead to inaccurate psi statistics when creating a new cgroup.
So we use kzlloc() to allocate and zero the struct psi_group and remove the redundant zeroing in group_init().
Steps to reproduce: 1. Use cgroup v2 and enable CONFIG_PSI 2. Create a new cgroup, and query psi statistics mkdir /sys/fs/cgroup/test cat /sys/fs/cgroup/test/cpu.pressure some avg10=0.00 avg60=0.00 avg300=47927752200.00 total=12884901 full avg10=561815124.00 avg60=125835394188.00 avg300=1077090462000.00 total=10273561772
cat /sys/fs/cgroup/test/io.pressure some avg10=1040093132823.95 avg60=1203770351379.21 avg300=3862252669559.46 total=4294967296 full avg10=921884564601.39 avg60=0.00 avg300=1984507298.35 total=442381631
cat /sys/fs/cgroup/test/memory.pressure some avg10=232476085778.11 avg60=0.00 avg300=0.00 total=0 full avg10=0.00 avg60=0.00 avg300=2585658472280.57 total=12884901
Fixes: commit 5f69a6577bc3 ("psi: dont alloc memory for psi by default") Signed-off-by: Hao Jia jiahao.os@bytedance.com Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Lu Jialin lujialin4@huawei.com --- kernel/sched/psi.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index d41c6b4ab324..f691702b5fc6 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -209,12 +209,8 @@ static void group_init(struct psi_group *group) /* Init trigger-related members */ mutex_init(&group->trigger_lock); INIT_LIST_HEAD(&group->triggers); - memset(group->nr_triggers, 0, sizeof(group->nr_triggers)); - group->poll_states = 0; group->poll_min_period = U32_MAX; - memset(group->polling_total, 0, sizeof(group->polling_total)); group->polling_next_update = ULLONG_MAX; - group->polling_until = 0; init_waitqueue_head(&group->poll_wait); timer_setup(&group->poll_timer, poll_timer_fn, 0); rcu_assign_pointer(group->poll_task, NULL); @@ -1034,7 +1030,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup) if (static_branch_likely(&psi_disabled)) return 0;
- cgroup->psi = kmalloc(sizeof(struct psi_group), GFP_KERNEL); + cgroup->psi = kzalloc(sizeof(struct psi_group), GFP_KERNEL); if (!cgroup->psi) return -ENOMEM;
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
--------------------------------
1) Split async memcg reclaim from reclaim_high to async_reclaim_high, which will make async memcg reclaim pressure stall can be getted in psi. 2) There is no use to call async reclaim in memory.high setting, just remove it. 3) Set default memcg async reclaim ratio is 100, which means just call memory.high(without async reclaim). And the ratio range is (10, 100].
Signed-off-by: Lu Jialin lujialin4@huawei.com --- mm/memcontrol.c | 105 +++++++++++++++++++++--------------------------- 1 file changed, 45 insertions(+), 60 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2da152a09ea3..5009ada24d4c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -110,18 +110,14 @@ static bool do_memsw_account(void) #define SOFTLIMIT_EVENTS_TARGET 1024
/* - * when memcg->high_async_ratio is HIGH_ASYNC_RATIO_DEFAULT, memcg async + * memcg warning watermark = memory.high * memcg->high_async_ratio / + * HIGH_ASYNC_RATIO_BASE. + * when memcg usage is larger than warning watermark, but smaller than + * memory.high, start memcg async reclaim; + * when memcg->high_async_ratio is HIGH_ASYNC_RATIO_BASE, memcg async * relcaim is disabled; - * when mem_usage is larger than memory.high * memcg->high_async_ratio/ - * HIGH_ASYNC_RATIO_BASE, start async reclaim; - * if mem_usage is larger than memory.high * (memcg->high_async_ratio - - * HIGH_ASYNC_RATIO_GAP) / HIGH_ASYNC_RATIO_BASE, the aim reclaim page is - * the diff of mem_usage and memory.high * (memcg->high_async_ratio - - * HIGH_ASYNC_RATIO_GAP) / HIGH_ASYNC_RATIO_BASE else the aim reclaim - * page is MEMCG_CHARGE_BATCH; - */ + * */
-#define HIGH_ASYNC_RATIO_DEFAULT 0 #define HIGH_ASYNC_RATIO_BASE 100 #define HIGH_ASYNC_RATIO_GAP 10
@@ -2367,41 +2363,18 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) return 0; }
-static bool is_high_async_reclaim(struct mem_cgroup *memcg) -{ - int ratio = READ_ONCE(memcg->high_async_ratio); - - if (ratio == HIGH_ASYNC_RATIO_DEFAULT) - return false; - - if (READ_ONCE(memcg->memory.high) == PAGE_COUNTER_MAX) - return false; - - return page_counter_read(&memcg->memory) > - (READ_ONCE(memcg->memory.high) * ratio / HIGH_ASYNC_RATIO_BASE); -} - static unsigned long reclaim_high(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask) { unsigned long nr_reclaimed = 0; - bool high_async_reclaim = READ_ONCE(memcg->high_async_reclaim); - - if (high_async_reclaim) - WRITE_ONCE(memcg->high_async_reclaim, false);
do { unsigned long pflags;
- if (high_async_reclaim) { - if (!is_high_async_reclaim(memcg)) - continue; - } else { - if (page_counter_read(&memcg->memory) <= - READ_ONCE(memcg->memory.high)) - continue; - } + if (page_counter_read(&memcg->memory) <= + READ_ONCE(memcg->memory.high)) + continue;
memcg_memory_event(memcg, MEMCG_HIGH);
@@ -2416,27 +2389,46 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, return nr_reclaimed; }
-static unsigned long get_reclaim_pages(struct mem_cgroup *memcg) +static bool is_high_async_reclaim(struct mem_cgroup *memcg) { - unsigned long nr_pages = page_counter_read(&memcg->memory); int ratio = READ_ONCE(memcg->high_async_ratio); - unsigned long safe_pages; + unsigned long memcg_high = READ_ONCE(memcg->memory.high);
- ratio = ratio < HIGH_ASYNC_RATIO_GAP ? 0 : ratio - HIGH_ASYNC_RATIO_GAP; - safe_pages = READ_ONCE(memcg->memory.high) * ratio / - HIGH_ASYNC_RATIO_BASE; + if (ratio == HIGH_ASYNC_RATIO_BASE || memcg_high == PAGE_COUNTER_MAX) + return false;
- return (nr_pages > safe_pages) ? (nr_pages - safe_pages) : - MEMCG_CHARGE_BATCH; + return page_counter_read(&memcg->memory) > + memcg_high * ratio / HIGH_ASYNC_RATIO_BASE; +} + +static void async_reclaim_high(struct mem_cgroup *memcg) +{ + unsigned long nr_pages, pflags; + unsigned long memcg_high = READ_ONCE(memcg->memory.high); + unsigned long memcg_usage = page_counter_read(&memcg->memory); + int ratio = READ_ONCE(memcg->high_async_ratio) - HIGH_ASYNC_RATIO_GAP; + unsigned long safe_pages = memcg_high * ratio / HIGH_ASYNC_RATIO_BASE; + + if (!is_high_async_reclaim(memcg)) { + WRITE_ONCE(memcg->high_async_reclaim, false); + return; + } + + psi_memstall_enter(&pflags); + nr_pages = memcg_usage > safe_pages ? memcg_usage - safe_pages : + MEMCG_CHARGE_BATCH; + try_to_free_mem_cgroup_pages(memcg, nr_pages, GFP_KERNEL, true); + psi_memstall_leave(&pflags); + WRITE_ONCE(memcg->high_async_reclaim, false); }
static void high_work_func(struct work_struct *work) { - struct mem_cgroup *memcg; + struct mem_cgroup *memcg = container_of(work, struct mem_cgroup, + high_work);
- memcg = container_of(work, struct mem_cgroup, high_work); - if (memcg->high_async_reclaim) - reclaim_high(memcg, get_reclaim_pages(memcg), GFP_KERNEL); + if (READ_ONCE(memcg->high_async_reclaim)) + async_reclaim_high(memcg); else reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL); } @@ -2825,9 +2817,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, continue; }
- if (is_high_async_reclaim(memcg)) { + if (is_high_async_reclaim(memcg) && !mem_high) { WRITE_ONCE(memcg->high_async_reclaim, true); schedule_work(&memcg->high_work); + break; }
if (mem_high || swap_high) { @@ -5579,11 +5572,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
page_counter_set_high(&memcg->memory, high);
- if (is_high_async_reclaim(memcg)) { - WRITE_ONCE(memcg->high_async_reclaim, true); - schedule_work(&memcg->high_work); - } - for (;;) { unsigned long nr_pages = page_counter_read(&memcg->memory); unsigned long reclaimed; @@ -5736,15 +5724,12 @@ static ssize_t memcg_high_async_ratio_write(struct kernfs_open_file *of, if (ret) return ret;
- if (high_async_ratio >= HIGH_ASYNC_RATIO_BASE || - high_async_ratio < HIGH_ASYNC_RATIO_DEFAULT) + if (high_async_ratio > HIGH_ASYNC_RATIO_BASE || + high_async_ratio <= HIGH_ASYNC_RATIO_GAP) return -EINVAL;
WRITE_ONCE(memcg->high_async_ratio, high_async_ratio);
- if (is_high_async_reclaim(memcg)) - schedule_work(&memcg->high_work); - return nbytes; }
@@ -6359,7 +6344,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX); memcg->soft_limit = PAGE_COUNTER_MAX; - memcg->high_async_ratio = HIGH_ASYNC_RATIO_DEFAULT; + memcg->high_async_ratio = HIGH_ASYNC_RATIO_BASE; page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); if (parent) { memcg->swappiness = mem_cgroup_swappiness(parent);
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
-------------------------------
Change struct psi_group directly will causes the kabi broken. Therefore, add a new struct psi_group_ext for new variables, which will be added in the next patch of pressure.stat.
Signed-off-by: Lu Jialin lujialin4@huawei.com --- include/linux/psi_types.h | 8 ++++++++ init/Kconfig | 10 ++++++++++ kernel/sched/psi.c | 37 +++++++++++++++++++++++++++++++++++++ 3 files changed, 55 insertions(+)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 668f67b56464..d9b70144a7be 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -231,6 +231,14 @@ struct psi_group { u64 polling_until; };
+#ifdef CONFIG_PSI_FINE_GRAINED +struct psi_group_ext { + struct psi_group psi; +}; +#else +struct psi_group_ext { }; +#endif /* CONFIG_PSI_FINE_GRAINED */ + #else /* CONFIG_PSI */
struct psi_group { }; diff --git a/init/Kconfig b/init/Kconfig index c668fb4933f6..69bd400daeb3 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -663,6 +663,16 @@ config PSI_CGROUP_V1
Say N if unsure.
+config PSI_FINE_GRAINED + bool "Support fine grained psi under cgroup v1 and system" + default n + depends on PSI + help + If set, fine grained pressure stall information tracking will + be used for cgroup v1 and system, such as memory reclaim, + memory compact and so on. + Say N if unsure. + endmenu # "CPU/Task time and stats accounting"
config CPU_ISOLATION diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index f691702b5fc6..a1960e4cc7fd 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -192,6 +192,24 @@ struct psi_group psi_system = { .pcpu = &system_group_pcpu, };
+#ifdef CONFIG_PSI_FINE_GRAINED +/* System-level fine grained pressure and stall tracking */ +struct psi_group_ext psi_stat_system = { }; + +struct psi_group_ext *to_psi_group_ext(struct psi_group *psi) +{ + if (psi == &psi_system) + return &psi_stat_system; + else + return container_of(psi, struct psi_group_ext, psi); +} +#else +static inline struct psi_group_ext *to_psi_group_ext(struct psi_group *psi) +{ + return NULL; +} +#endif + static void psi_avgs_work(struct work_struct *work);
static void poll_timer_fn(struct timer_list *t); @@ -1027,16 +1045,31 @@ void psi_memstall_leave(unsigned long *flags) #ifdef CONFIG_CGROUPS int psi_cgroup_alloc(struct cgroup *cgroup) { +#ifdef CONFIG_PSI_FINE_GRAINED + struct psi_group_ext *psi_ext; +#endif + if (static_branch_likely(&psi_disabled)) return 0;
+#ifdef CONFIG_PSI_FINE_GRAINED + psi_ext = kzalloc(sizeof(struct psi_group_ext), GFP_KERNEL); + if (!psi_ext) + return -ENOMEM; + cgroup->psi = &psi_ext->psi; +#else cgroup->psi = kzalloc(sizeof(struct psi_group), GFP_KERNEL); if (!cgroup->psi) return -ENOMEM;
+#endif cgroup->psi->pcpu = alloc_percpu(struct psi_group_cpu); if (!cgroup->psi->pcpu) { +#ifdef CONFIG_PSI_FINE_GRAINED + kfree(psi_ext); +#else kfree(cgroup->psi); +#endif return -ENOMEM; } group_init(cgroup->psi); @@ -1052,7 +1085,11 @@ void psi_cgroup_free(struct cgroup *cgroup) free_percpu(cgroup->psi->pcpu); /* All triggers must be removed by now */ WARN_ONCE(cgroup->psi->poll_states, "psi: trigger leak\n"); +#ifdef CONFIG_PSI_FINE_GRAINED + kfree(to_psi_group_ext(cgroup->psi)); +#else kfree(cgroup->psi); +#endif }
/**
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
-------------------------------
PSI will tracking pressure stall for memory, cpu, io and irq. But, there are differrnt pressure types which will cause memory pressure, memory.pressure could not show the type of pressure effectively. The same situation for cpu.pressure. Introduce pressure.stat in psi, which will monitor specific reasons for the memory.pressure and cpu.pressure, such as global/cgroup memory reclaim, memory compact, cpu cfs bandwidth and so on. Therefore, userland could make the right solution to reduce the pressure depends on the specific pressure reasons. This patch will introduce memory fine grained stall time collect for cgroup reclaim.
Signed-off-by: Lu Jialin lujialin4@huawei.com --- include/linux/psi_types.h | 34 +++++++++ include/linux/sched.h | 4 + kernel/sched/psi.c | 157 ++++++++++++++++++++++++++++++++++++-- mm/memcontrol.c | 10 ++- 4 files changed, 198 insertions(+), 7 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index d9b70144a7be..18c6ad011d93 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -232,8 +232,29 @@ struct psi_group { };
#ifdef CONFIG_PSI_FINE_GRAINED + +enum psi_stat_states { + PSI_MEMCG_RECLAIM_SOME, + PSI_MEMCG_RECLAIM_FULL, + NR_PSI_STAT_STATES, +}; + +enum psi_stat_task_count { + NR_MEMCG_RECLAIM, + NR_MEMCG_RECLAIM_RUNNING, + NR_PSI_STAT_TASK_COUNTS, +}; + +struct psi_group_stat_cpu { + u32 state_mask; + u32 times[NR_PSI_STAT_STATES]; + u32 psi_delta; + unsigned int tasks[NR_PSI_STAT_TASK_COUNTS]; +}; + struct psi_group_ext { struct psi_group psi; + struct psi_group_stat_cpu __percpu *pcpu; }; #else struct psi_group_ext { }; @@ -245,4 +266,17 @@ struct psi_group { };
#endif /* CONFIG_PSI */
+/* + * one type should have two task stats: regular running and memstall + * threads. The reason is the same as NR_MEMSTALL_RUNNING. + * Because of the psi_memstall_type is start with 1, the correspondence + * between psi_memstall_type and psi_stat_task_count should be as below: + * + * memstall : psi_memstall_type * 2 - 2; + * running : psi_memstall_type * 2 - 1; + */ +enum psi_memstall_type { + PSI_MEMCG_RECLAIM = 1, +}; + #endif /* _LINUX_PSI_TYPES_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index a70603f9d742..579e47c22980 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1459,7 +1459,11 @@ struct task_struct { #else KABI_RESERVE(12) #endif +#ifdef CONFIG_PSI_FINE_GRAINED + KABI_USE(13, int memstall_type) +#else KABI_RESERVE(13) +#endif KABI_RESERVE(14) KABI_RESERVE(15) KABI_RESERVE(16) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index a1960e4cc7fd..33e7b12475e8 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -194,7 +194,10 @@ struct psi_group psi_system = {
#ifdef CONFIG_PSI_FINE_GRAINED /* System-level fine grained pressure and stall tracking */ -struct psi_group_ext psi_stat_system = { }; +static DEFINE_PER_CPU(struct psi_group_stat_cpu, system_stat_group_pcpu); +struct psi_group_ext psi_stat_system = { + .pcpu = &system_stat_group_pcpu, +};
struct psi_group_ext *to_psi_group_ext(struct psi_group *psi) { @@ -334,6 +337,109 @@ static void calc_avgs(unsigned long avg[3], int missed_periods, avg[2] = calc_load(avg[2], EXP_300s, pct); }
+#ifdef CONFIG_PSI_FINE_GRAINED + +static void record_stat_times(struct psi_group_ext *psi_ext, int cpu) +{ + struct psi_group_stat_cpu *ext_grpc = per_cpu_ptr(psi_ext->pcpu, cpu); + + u32 delta = ext_grpc->psi_delta; + + if (ext_grpc->state_mask & (1 << PSI_MEMCG_RECLAIM_SOME)) { + ext_grpc->times[PSI_MEMCG_RECLAIM_SOME] += delta; + if (ext_grpc->state_mask & (1 << PSI_MEMCG_RECLAIM_FULL)) + ext_grpc->times[PSI_MEMCG_RECLAIM_FULL] += delta; + } +} + +static bool test_fine_grained_stat(unsigned int *stat_tasks, + unsigned int nr_running, + enum psi_stat_states state) +{ + switch (state) { + case PSI_MEMCG_RECLAIM_SOME: + return unlikely(stat_tasks[NR_MEMCG_RECLAIM]); + case PSI_MEMCG_RECLAIM_FULL: + return unlikely(stat_tasks[NR_MEMCG_RECLAIM] && + nr_running == stat_tasks[NR_MEMCG_RECLAIM_RUNNING]); + default: + return false; + } +} + +static void psi_group_stat_change(struct psi_group *group, int cpu, + int clear, int set) +{ + int t; + u32 state_mask = 0; + enum psi_stat_states s; + struct psi_group_ext *psi_ext = to_psi_group_ext(group); + struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu); + struct psi_group_stat_cpu *ext_groupc = per_cpu_ptr(psi_ext->pcpu, cpu); + + write_seqcount_begin(&groupc->seq); + record_stat_times(psi_ext, cpu); + + for (t = 0; clear; clear &= ~(1 << t), t++) + if (clear & (1 << t)) + ext_groupc->tasks[t]--; + for (t = 0; set; set &= ~(1 << t), t++) + if (set & (1 << t)) + ext_groupc->tasks[t]++; + for (s = 0; s < NR_PSI_STAT_STATES; s++) + if (test_fine_grained_stat(ext_groupc->tasks, + groupc->tasks[NR_RUNNING], s)) + state_mask |= (1 << s); + if (unlikely(groupc->state_mask & PSI_ONCPU) && + cpu_curr(cpu)->memstall_type) + state_mask |= (1 << (cpu_curr(cpu)->memstall_type * 2 - 1)); + + ext_groupc->state_mask = state_mask; + write_seqcount_end(&groupc->seq); +} + +static void update_psi_stat_delta(struct psi_group *group, int cpu, u64 now) +{ + struct psi_group_ext *psi_ext = to_psi_group_ext(group); + struct psi_group_stat_cpu *ext_groupc = per_cpu_ptr(psi_ext->pcpu, cpu); + struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu); + + ext_groupc->psi_delta = now - groupc->state_start; +} + +static void psi_stat_flags_change(struct task_struct *task, int *stat_set, + int *stat_clear, int set, int clear) +{ + if (!task->memstall_type) + return; + + if (clear) { + if (clear & TSK_MEMSTALL) + *stat_clear |= 1 << (2 * task->memstall_type - 2); + if (clear & TSK_MEMSTALL_RUNNING) + *stat_clear |= 1 << (2 * task->memstall_type - 1); + } + if (set) { + if (set & TSK_MEMSTALL) + *stat_set |= 1 << (2 * task->memstall_type - 2); + if (set & TSK_MEMSTALL_RUNNING) + *stat_set |= 1 << (2 * task->memstall_type - 1); + } + if (!task->in_memstall) + task->memstall_type = 0; +} + +#else +static inline void psi_group_stat_change(struct psi_group *group, int cpu, + int clear, int set) {} +static inline void update_psi_stat_delta(struct psi_group *group, int cpu, + u64 now) {} +static inline void psi_stat_flags_change(struct task_struct *task, + int *stat_set, int *stat_clear, + int set, int clear) {} +static inline void record_stat_times(struct psi_group_ext *psi_ext, int cpu) {} +#endif + static void collect_percpu_times(struct psi_group *group, enum psi_aggregators aggregator, u32 *pchanged_states) @@ -857,16 +963,22 @@ void psi_task_change(struct task_struct *task, int clear, int set) struct psi_group *group; void *iter = NULL; u64 now; + int stat_set = 0; + int stat_clear = 0;
if (!task->pid) return;
psi_flags_change(task, clear, set); + psi_stat_flags_change(task, &stat_set, &stat_clear, set, clear);
now = cpu_clock(cpu);
- while ((group = iterate_groups(task, &iter))) + while ((group = iterate_groups(task, &iter))) { + update_psi_stat_delta(group, cpu, now); psi_group_change(group, cpu, clear, set, now, true); + psi_group_stat_change(group, cpu, stat_clear, stat_set); + } }
void psi_task_switch(struct task_struct *prev, struct task_struct *next, @@ -892,13 +1004,18 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, break; }
+ update_psi_stat_delta(group, cpu, now); psi_group_change(group, cpu, 0, TSK_ONCPU, now, true); + psi_group_stat_change(group, cpu, 0, 0); } }
if (prev->pid) { int clear = TSK_ONCPU, set = 0; bool wake_clock = true; + int stat_set = 0; + int stat_clear = 0; + bool memstall_type_change = false;
/* * When we're going to sleep, psi_dequeue() lets us @@ -925,21 +1042,33 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, }
psi_flags_change(prev, clear, set); + psi_stat_flags_change(prev, &stat_set, &stat_clear, set, clear);
iter = NULL; - while ((group = iterate_groups(prev, &iter)) && group != common) + while ((group = iterate_groups(prev, &iter)) && group != common) { + update_psi_stat_delta(group, cpu, now); psi_group_change(group, cpu, clear, set, now, wake_clock); - + psi_group_stat_change(group, cpu, stat_clear, stat_set); + } +#ifdef CONFIG_PSI_FINE_GRAINED + if (next->memstall_type != prev->memstall_type) + memstall_type_change = true; +#endif /* * TSK_ONCPU is handled up to the common ancestor. If there are * any other differences between the two tasks (e.g. prev goes * to sleep, or only one task is memstall), finish propagating * those differences all the way up to the root. */ - if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) { + if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU || + memstall_type_change) { clear &= ~TSK_ONCPU; - for (; group; group = iterate_groups(prev, &iter)) + for (; group; group = iterate_groups(prev, &iter)) { + update_psi_stat_delta(group, cpu, now); psi_group_change(group, cpu, clear, set, now, wake_clock); + psi_group_stat_change(group, cpu, stat_clear, + stat_set); + } } } } @@ -966,6 +1095,8 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
write_seqcount_begin(&groupc->seq);
+ update_psi_stat_delta(group, cpu, now); + record_stat_times(to_psi_group_ext(group), cpu); record_times(groupc, now); groupc->times[PSI_IRQ_FULL] += delta;
@@ -988,6 +1119,9 @@ void psi_memstall_enter(unsigned long *flags) { struct rq_flags rf; struct rq *rq; +#ifdef CONFIG_PSI_FINE_GRAINED + unsigned long stat_flags = *flags; +#endif
if (static_branch_likely(&psi_disabled)) return; @@ -1005,6 +1139,10 @@ void psi_memstall_enter(unsigned long *flags) rq = this_rq_lock_irq(&rf);
current->in_memstall = 1; +#ifdef CONFIG_PSI_FINE_GRAINED + if (stat_flags) + current->memstall_type = stat_flags; +#endif psi_task_change(current, 0, TSK_MEMSTALL | TSK_MEMSTALL_RUNNING);
rq_unlock_irq(rq, &rf); @@ -1056,6 +1194,11 @@ int psi_cgroup_alloc(struct cgroup *cgroup) psi_ext = kzalloc(sizeof(struct psi_group_ext), GFP_KERNEL); if (!psi_ext) return -ENOMEM; + psi_ext->pcpu = alloc_percpu(struct psi_group_stat_cpu); + if (!psi_ext->pcpu) { + kfree(psi_ext); + return -ENOMEM; + } cgroup->psi = &psi_ext->psi; #else cgroup->psi = kzalloc(sizeof(struct psi_group), GFP_KERNEL); @@ -1066,6 +1209,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup) cgroup->psi->pcpu = alloc_percpu(struct psi_group_cpu); if (!cgroup->psi->pcpu) { #ifdef CONFIG_PSI_FINE_GRAINED + free_percpu(psi_ext->pcpu); kfree(psi_ext); #else kfree(cgroup->psi); @@ -1086,6 +1230,7 @@ void psi_cgroup_free(struct cgroup *cgroup) /* All triggers must be removed by now */ WARN_ONCE(cgroup->psi->poll_states, "psi: trigger leak\n"); #ifdef CONFIG_PSI_FINE_GRAINED + free_percpu(to_psi_group_ext(cgroup->psi)->pcpu); kfree(to_psi_group_ext(cgroup->psi)); #else kfree(cgroup->psi); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5009ada24d4c..dfa7f5032ae8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2378,6 +2378,9 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
memcg_memory_event(memcg, MEMCG_HIGH);
+#ifdef CONFIG_PSI_FINE_GRAINED + pflags = PSI_MEMCG_RECLAIM; +#endif psi_memstall_enter(&pflags); nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, @@ -2645,6 +2648,9 @@ void mem_cgroup_handle_over_high(void) * schedule_timeout_killable sets TASK_KILLABLE). This means we don't * need to account for any ill-begotten jiffies to pay them off later. */ +#ifdef CONFIG_PSI_FINE_GRAINED + pflags = PSI_MEMCG_RECLAIM; +#endif psi_memstall_enter(&pflags); schedule_timeout_killable(penalty_jiffies); psi_memstall_leave(&pflags); @@ -2715,7 +2721,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, goto nomem;
memcg_memory_event(mem_over_limit, MEMCG_MAX); - +#ifdef CONFIG_PSI_FINE_GRAINED + pflags = PSI_MEMCG_RECLAIM; +#endif psi_memstall_enter(&pflags); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, gfp_mask, reclaim_options);
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
-------------------------------
Introduce avgs and total calculation depend on the fine grained time collect in psi_avgs_works() for cgroup_reclaim. The results will be shown in pressure.stat, which will be done in the next patch.
Signed-off-by: Lu Jialin lujialin4@huawei.com --- include/linux/psi_types.h | 7 +++++ kernel/sched/psi.c | 61 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 68 insertions(+)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 18c6ad011d93..92212cbe2f25 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -250,11 +250,18 @@ struct psi_group_stat_cpu { u32 times[NR_PSI_STAT_STATES]; u32 psi_delta; unsigned int tasks[NR_PSI_STAT_TASK_COUNTS]; + u32 times_delta; + u32 times_prev[NR_PSI_AGGREGATORS][NR_PSI_STAT_STATES]; };
struct psi_group_ext { struct psi_group psi; struct psi_group_stat_cpu __percpu *pcpu; + /* Running fine grained pressure averages */ + u64 avg_total[NR_PSI_STAT_STATES]; + /* Total fine grained stall times and sampled pressure averages */ + u64 total[NR_PSI_AGGREGATORS][NR_PSI_STAT_STATES]; + unsigned long avg[NR_PSI_STAT_STATES][3]; }; #else struct psi_group_ext { }; diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 33e7b12475e8..75cb2bf68765 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -276,6 +276,10 @@ static void get_recent_times(struct psi_group *group, int cpu, enum psi_aggregators aggregator, u32 *times, u32 *pchanged_states) { +#ifdef CONFIG_PSI_FINE_GRAINED + struct psi_group_ext *psi_ext = to_psi_group_ext(group); + struct psi_group_stat_cpu *ext_groupc = per_cpu_ptr(psi_ext->pcpu, cpu); +#endif struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu); u64 now, state_start; enum psi_states s; @@ -315,6 +319,9 @@ static void get_recent_times(struct psi_group *group, int cpu, if (delta) *pchanged_states |= (1 << s); } +#ifdef CONFIG_PSI_FINE_GRAINED + ext_groupc->times_delta = now - state_start; +#endif }
static void calc_avgs(unsigned long avg[3], int missed_periods, @@ -429,6 +436,39 @@ static void psi_stat_flags_change(struct task_struct *task, int *stat_set, task->memstall_type = 0; }
+static void get_recent_stat_times(struct psi_group *group, int cpu, + enum psi_aggregators aggregator, u32 *times) +{ + struct psi_group_ext *psi_ext = to_psi_group_ext(group); + struct psi_group_stat_cpu *ext_groupc = per_cpu_ptr(psi_ext->pcpu, cpu); + enum psi_stat_states s; + u32 delta; + + memcpy(times, ext_groupc->times, sizeof(ext_groupc->times)); + for (s = 0; s < NR_PSI_STAT_STATES; s++) { + if (ext_groupc->state_mask & (1 << s)) + times[s] += ext_groupc->times_delta; + delta = times[s] - ext_groupc->times_prev[aggregator][s]; + ext_groupc->times_prev[aggregator][s] = times[s]; + times[s] = delta; + } +} + +static void update_stat_averages(struct psi_group_ext *psi_ext, + unsigned long missed_periods, u64 period) +{ + int s; + + for (s = 0; s < NR_PSI_STAT_STATES; s++) { + u32 sample; + + sample = psi_ext->total[PSI_AVGS][s] - psi_ext->avg_total[s]; + if (sample > period) + sample = period; + psi_ext->avg_total[s] += sample; + calc_avgs(psi_ext->avg[s], missed_periods, sample, period); + } +} #else static inline void psi_group_stat_change(struct psi_group *group, int cpu, int clear, int set) {} @@ -438,12 +478,20 @@ static inline void psi_stat_flags_change(struct task_struct *task, int *stat_set, int *stat_clear, int set, int clear) {} static inline void record_stat_times(struct psi_group_ext *psi_ext, int cpu) {} +static inline void update_stat_averages(struct psi_group_ext *psi_ext, + unsigned long missed_periods, + u64 period) {} #endif
static void collect_percpu_times(struct psi_group *group, enum psi_aggregators aggregator, u32 *pchanged_states) { +#ifdef CONFIG_PSI_FINE_GRAINED + u64 stat_delta[NR_PSI_STAT_STATES] = { 0 }; + u32 stat_times[NR_PSI_STAT_STATES] = { 0 }; + struct psi_group_ext *psi_ext = to_psi_group_ext(group); +#endif u64 deltas[NR_PSI_STATES - 1] = { 0, }; unsigned long nonidle_total = 0; u32 changed_states = 0; @@ -472,6 +520,11 @@ static void collect_percpu_times(struct psi_group *group,
for (s = 0; s < PSI_NONIDLE; s++) deltas[s] += (u64)times[s] * nonidle; +#ifdef CONFIG_PSI_FINE_GRAINED + get_recent_stat_times(group, cpu, aggregator, stat_times); + for (s = 0; s < NR_PSI_STAT_STATES; s++) + stat_delta[s] += (u64)stat_times[s] * nonidle; +#endif }
/* @@ -491,12 +544,19 @@ static void collect_percpu_times(struct psi_group *group, group->total[aggregator][s] += div_u64(deltas[s], max(nonidle_total, 1UL));
+#ifdef CONFIG_PSI_FINE_GRAINED + for (s = 0; s < NR_PSI_STAT_STATES; s++) + psi_ext->total[aggregator][s] += + div_u64(stat_delta[s], max(nonidle_total, 1UL)); +#endif + if (pchanged_states) *pchanged_states = changed_states; }
static u64 update_averages(struct psi_group *group, u64 now) { + struct psi_group_ext *psi_ext = to_psi_group_ext(group); unsigned long missed_periods = 0; u64 expires, period; u64 avg_next_update; @@ -545,6 +605,7 @@ static u64 update_averages(struct psi_group *group, u64 now) calc_avgs(group->avg[s], missed_periods, sample, period); }
+ update_stat_averages(psi_ext, missed_periods, period); return avg_next_update; }
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
-------------------------------
Introduce pressure.stat in psi for cgroupv1 and system, which will show the fine grained time tracking for cgroup memory reclaim.
for example:
/test # cat /tmp/cpuacct/test/pressure.stat cgroup_memory_reclaim some avg10=45.78 avg60=10.40 avg300=2.26 total=13491160 full avg10=0.00 avg60=0.00 avg300=0.00 total=0
Signed-off-by: Lu Jialin lujialin4@huawei.com --- include/linux/psi.h | 4 +++ kernel/cgroup/cgroup.c | 17 ++++++++++++ kernel/sched/psi.c | 62 +++++++++++++++++++++++++++++++++++++++++- 3 files changed, 82 insertions(+), 1 deletion(-)
diff --git a/include/linux/psi.h b/include/linux/psi.h index 40cfbf0bf831..55bb63a4fd65 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -37,6 +37,10 @@ void psi_trigger_destroy(struct psi_trigger *t); __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file, poll_table *wait);
+#ifdef CONFIG_PSI_FINE_GRAINED +int psi_stat_show(struct seq_file *s, struct psi_group *group); +#endif + #ifdef CONFIG_CGROUPS int psi_cgroup_alloc(struct cgroup *cgrp); void psi_cgroup_free(struct cgroup *cgrp); diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 3f340fc30abc..ebd4384a2697 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3784,6 +3784,16 @@ static void cgroup_pressure_release(struct kernfs_open_file *of) }
#ifdef CONFIG_PSI_CGROUP_V1 +#ifdef CONFIG_PSI_FINE_GRAINED +static int cgroup_psi_stat_show(struct seq_file *seq, void *v) +{ + struct cgroup *cgrp = seq_css(seq)->cgroup; + struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi; + + return psi_stat_show(seq, psi); +} +#endif + struct cftype cgroup_v1_psi_files[] = { { .name = "io.pressure", @@ -3818,6 +3828,13 @@ struct cftype cgroup_v1_psi_files[] = { .poll = cgroup_pressure_poll, .release = cgroup_pressure_release, }, +#endif +#ifdef CONFIG_PSI_FINE_GRAINED + { + .name = "pressure.stat", + .flags = CFTYPE_NO_PREFIX, + .seq_show = cgroup_psi_stat_show, + }, #endif { } /* terminate */ }; diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 75cb2bf68765..64b81a1a361f 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1718,6 +1718,63 @@ static const struct proc_ops psi_cpu_proc_ops = { .proc_release = psi_fop_release, };
+#ifdef CONFIG_PSI_FINE_GRAINED +static const char *const psi_stat_names[] = { + "cgroup_memory_reclaim", +}; + +int psi_stat_show(struct seq_file *m, struct psi_group *group) +{ + struct psi_group_ext *psi_ext; + unsigned long avg[3] = {0, }; + int i, w; + bool is_full; + u64 now, total; + + if (static_branch_likely(&psi_disabled)) + return -EOPNOTSUPP; + + psi_ext = to_psi_group_ext(group); + mutex_lock(&group->avgs_lock); + now = sched_clock(); + collect_percpu_times(group, PSI_AVGS, NULL); + if (now >= group->avg_next_update) + group->avg_next_update = update_averages(group, now); + mutex_unlock(&group->avgs_lock); + for (i = 0; i < NR_PSI_STAT_STATES; i++) { + is_full = i % 2; + for (w = 0; w < 3; w++) + avg[w] = psi_ext->avg[i][w]; + total = div_u64(psi_ext->total[PSI_AVGS][i], NSEC_PER_USEC); + if (!is_full) + seq_printf(m, "%s\n", psi_stat_names[i / 2]); + seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n", + is_full ? "full" : "some", + LOAD_INT(avg[0]), LOAD_FRAC(avg[0]), + LOAD_INT(avg[1]), LOAD_FRAC(avg[1]), + LOAD_INT(avg[2]), LOAD_FRAC(avg[2]), + total); + } + return 0; +} +static int system_psi_stat_show(struct seq_file *m, void *v) +{ + return psi_stat_show(m, &psi_system); +} + +static int psi_stat_open(struct inode *inode, struct file *file) +{ + return single_open(file, system_psi_stat_show, NULL); +} + +static const struct proc_ops psi_stat_proc_ops = { + .proc_open = psi_stat_open, + .proc_read = seq_read, + .proc_lseek = seq_lseek, + .proc_release = psi_fop_release, +}; +#endif + #ifdef CONFIG_IRQ_TIME_ACCOUNTING static int psi_irq_show(struct seq_file *m, void *v) { @@ -1754,8 +1811,11 @@ static int __init psi_proc_init(void) proc_create("pressure/cpu", 0, NULL, &psi_cpu_proc_ops); #ifdef CONFIG_IRQ_TIME_ACCOUNTING proc_create("pressure/irq", 0, NULL, &psi_irq_proc_ops); - } #endif +#ifdef CONFIG_PSI_FINE_GRAINED + proc_create("pressure/stat", 0, NULL, &psi_stat_proc_ops); +#endif + } return 0; } module_init(psi_proc_init);
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
-------------------------------
Introcude more memory fine grianed stall tracking in pressure.stat, such as global memory relcaim, memory compact, memory async cgroup reclaim and swap.
Signed-off-by: Lu Jialin lujialin4@huawei.com --- block/blk-cgroup.c | 2 +- block/blk-core.c | 2 +- include/linux/psi_types.h | 20 ++++++++++++++++++ kernel/sched/psi.c | 44 +++++++++++++++++++++++++++++++++++++++ mm/compaction.c | 2 +- mm/filemap.c | 4 ++-- mm/memcontrol.c | 3 +++ mm/page_alloc.c | 6 ++++++ mm/page_io.c | 3 +++ mm/vmscan.c | 5 ++++- 10 files changed, 85 insertions(+), 6 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 92ce202bd8e5..1f2c93e9daa1 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1689,7 +1689,7 @@ static void blkcg_scale_delay(struct blkcg_gq *blkg, u64 now) */ static void blkcg_maybe_throttle_blkg(struct blkcg_gq *blkg, bool use_memdelay) { - unsigned long pflags; + unsigned long pflags = 0; bool clamp; u64 now = ktime_to_ns(ktime_get()); u64 exp; diff --git a/block/blk-core.c b/block/blk-core.c index 01f0782668ce..71d60ec24a8a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1116,7 +1116,7 @@ blk_qc_t submit_bio(struct bio *bio) */ if (unlikely(bio_op(bio) == REQ_OP_READ && bio_flagged(bio, BIO_WORKINGSET))) { - unsigned long pflags; + unsigned long pflags = 0; blk_qc_t ret;
psi_memstall_enter(&pflags); diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 92212cbe2f25..83be3fada1e7 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -236,12 +236,28 @@ struct psi_group { enum psi_stat_states { PSI_MEMCG_RECLAIM_SOME, PSI_MEMCG_RECLAIM_FULL, + PSI_GLOBAL_RECLAIM_SOME, + PSI_GLOBAL_RECLAIM_FULL, + PSI_COMPACT_SOME, + PSI_COMPACT_FULL, + PSI_ASYNC_MEMCG_RECLAIM_SOME, + PSI_ASYNC_MEMCG_RECLAIM_FULL, + PSI_SWAP_SOME, + PSI_SWAP_FULL, NR_PSI_STAT_STATES, };
enum psi_stat_task_count { NR_MEMCG_RECLAIM, NR_MEMCG_RECLAIM_RUNNING, + NR_GLOBAL_RECLAIM, + NR_GLOBAL_RECLAIM_RUNNING, + NR_COMPACT, + NR_COMPACT_RUNNING, + NR_ASYNC_MEMCG_RECLAIM, + NR_ASYNC_MEMCG_RECLAIM_RUNNING, + NR_SWAP, + NR_SWAP_RUNNING, NR_PSI_STAT_TASK_COUNTS, };
@@ -284,6 +300,10 @@ struct psi_group { }; */ enum psi_memstall_type { PSI_MEMCG_RECLAIM = 1, + PSI_GLOBAL_RECLAIM, + PSI_COMPACT, + PSI_ASYNC_MEMCG_RECLAIM, + PSI_SWAP, };
#endif /* _LINUX_PSI_TYPES_H */ diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 64b81a1a361f..227ea1676cd1 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -357,6 +357,26 @@ static void record_stat_times(struct psi_group_ext *psi_ext, int cpu) if (ext_grpc->state_mask & (1 << PSI_MEMCG_RECLAIM_FULL)) ext_grpc->times[PSI_MEMCG_RECLAIM_FULL] += delta; } + if (ext_grpc->state_mask & (1 << PSI_GLOBAL_RECLAIM_SOME)) { + ext_grpc->times[PSI_GLOBAL_RECLAIM_SOME] += delta; + if (ext_grpc->state_mask & (1 << PSI_GLOBAL_RECLAIM_FULL)) + ext_grpc->times[PSI_GLOBAL_RECLAIM_FULL] += delta; + } + if (ext_grpc->state_mask & (1 << PSI_COMPACT_SOME)) { + ext_grpc->times[PSI_COMPACT_SOME] += delta; + if (ext_grpc->state_mask & (1 << PSI_COMPACT_FULL)) + ext_grpc->times[PSI_COMPACT_FULL] += delta; + } + if (ext_grpc->state_mask & (1 << PSI_ASYNC_MEMCG_RECLAIM_SOME)) { + ext_grpc->times[PSI_ASYNC_MEMCG_RECLAIM_SOME] += delta; + if (ext_grpc->state_mask & (1 << PSI_ASYNC_MEMCG_RECLAIM_FULL)) + ext_grpc->times[PSI_ASYNC_MEMCG_RECLAIM_FULL] += delta; + } + if (ext_grpc->state_mask & (1 << PSI_SWAP_SOME)) { + ext_grpc->times[PSI_SWAP_SOME] += delta; + if (ext_grpc->state_mask & (1 << PSI_SWAP_FULL)) + ext_grpc->times[PSI_SWAP_FULL] += delta; + } }
static bool test_fine_grained_stat(unsigned int *stat_tasks, @@ -369,6 +389,26 @@ static bool test_fine_grained_stat(unsigned int *stat_tasks, case PSI_MEMCG_RECLAIM_FULL: return unlikely(stat_tasks[NR_MEMCG_RECLAIM] && nr_running == stat_tasks[NR_MEMCG_RECLAIM_RUNNING]); + case PSI_GLOBAL_RECLAIM_SOME: + return unlikely(stat_tasks[NR_GLOBAL_RECLAIM]); + case PSI_GLOBAL_RECLAIM_FULL: + return unlikely(stat_tasks[NR_GLOBAL_RECLAIM] && + nr_running == stat_tasks[NR_GLOBAL_RECLAIM_RUNNING]); + case PSI_COMPACT_SOME: + return unlikely(stat_tasks[NR_COMPACT]); + case PSI_COMPACT_FULL: + return unlikely(stat_tasks[NR_COMPACT] && + nr_running == stat_tasks[NR_COMPACT_RUNNING]); + case PSI_ASYNC_MEMCG_RECLAIM_SOME: + return unlikely(stat_tasks[NR_ASYNC_MEMCG_RECLAIM]); + case PSI_ASYNC_MEMCG_RECLAIM_FULL: + return unlikely(stat_tasks[NR_ASYNC_MEMCG_RECLAIM] && + nr_running == stat_tasks[NR_ASYNC_MEMCG_RECLAIM_RUNNING]); + case PSI_SWAP_SOME: + return unlikely(stat_tasks[NR_SWAP]); + case PSI_SWAP_FULL: + return unlikely(stat_tasks[NR_SWAP] && + nr_running == stat_tasks[NR_SWAP_RUNNING]); default: return false; } @@ -1721,6 +1761,10 @@ static const struct proc_ops psi_cpu_proc_ops = { #ifdef CONFIG_PSI_FINE_GRAINED static const char *const psi_stat_names[] = { "cgroup_memory_reclaim", + "global_memory_reclaim", + "compact", + "cgroup_async_memory_reclaim", + "swap", };
int psi_stat_show(struct seq_file *m, struct psi_group *group) diff --git a/mm/compaction.c b/mm/compaction.c index a193af836ee6..bdcde6ea7f97 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2852,7 +2852,7 @@ static int kcompactd(void *p) pgdat->kcompactd_highest_zoneidx = pgdat->nr_zones - 1;
while (!kthread_should_stop()) { - unsigned long pflags; + unsigned long pflags = 0;
trace_mm_compaction_kcompactd_sleep(pgdat->node_id); if (wait_event_freezable_timeout(pgdat->kcompactd_wait, diff --git a/mm/filemap.c b/mm/filemap.c index fd4aae06ff15..04e4aad7ed67 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1194,7 +1194,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, wait_queue_entry_t *wait = &wait_page.wait; bool thrashing = false; bool delayacct = false; - unsigned long pflags; + unsigned long pflags = 0;
if (bit_nr == PG_locked && !PageUptodate(page) && PageWorkingset(page)) { @@ -1351,7 +1351,7 @@ void migration_entry_wait_on_locked(swp_entry_t entry, pte_t *ptep, wait_queue_entry_t *wait = &wait_page.wait; bool thrashing = false; bool delayacct = false; - unsigned long pflags; + unsigned long pflags = 0; wait_queue_head_t *q; struct page *page = compound_head(migration_entry_to_page(entry));
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index dfa7f5032ae8..09cd8c8535fd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2417,6 +2417,9 @@ static void async_reclaim_high(struct mem_cgroup *memcg) return; }
+#ifdef CONFIG_PSI_FINE_GRAINED + pflags = PSI_ASYNC_MEMCG_RECLAIM; +#endif psi_memstall_enter(&pflags); nr_pages = memcg_usage > safe_pages ? memcg_usage - safe_pages : MEMCG_CHARGE_BATCH; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f21365c92a98..d2a8ec193151 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4178,6 +4178,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, if (!order) return NULL;
+#ifdef CONFIG_PSI_FINE_GRAINED + pflags = PSI_COMPACT; +#endif psi_memstall_enter(&pflags); noreclaim_flag = memalloc_noreclaim_save();
@@ -4447,6 +4450,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, unsigned long pflags; bool drained = false;
+#ifdef CONFIG_PSI_FINE_GRAINED + pflags = PSI_GLOBAL_RECLAIM; +#endif psi_memstall_enter(&pflags); *did_some_progress = __perform_reclaim(gfp_mask, order, ac); if (unlikely(!(*did_some_progress))) diff --git a/mm/page_io.c b/mm/page_io.c index ee28c39e566e..78de95b9ef5a 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -341,6 +341,9 @@ int swap_readpage(struct page *page, bool synchronous) * or the submitting cgroup IO-throttled, submission can be a * significant part of overall IO time. */ +#ifdef CONFIG_PSI_FINE_GRAINED + pflags = PSI_SWAP; +#endif psi_memstall_enter(&pflags);
if (frontswap_load(page) == 0) { diff --git a/mm/vmscan.c b/mm/vmscan.c index dbd0757dd5a1..3d383c7126e3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3802,7 +3802,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) int i; unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; - unsigned long pflags; + unsigned long pflags = 0; unsigned long nr_boost_reclaim; unsigned long zone_boosts[MAX_NR_ZONES] = { 0, }; bool boosted; @@ -4448,6 +4448,9 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in sc.gfp_mask);
cond_resched(); +#ifdef CONFIG_PSI_FINE_GRAINED + pflags = PSI_GLOBAL_RECLAIM; +#endif psi_memstall_enter(&pflags); fs_reclaim_acquire(sc.gfp_mask);
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
-------------------------------
Introduce cpu fine grained stall tracking(cpu cfs bandwidth or cpu qos) in pressure.stat. For cpu fine grained stall tracking, only "full" information in pressure.stat.
for example:
/test # cat /tmp/cpuacct/test/pressure.stat cgroup_memory_reclaim some avg10=0.00 avg60=0.00 avg300=0.00 total=0 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 global_memory_reclaim some avg10=0.00 avg60=0.00 avg300=0.00 total=0 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 compact some avg10=0.00 avg60=0.00 avg300=0.00 total=0 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 cgroup_async_memory_reclaim some avg10=0.00 avg60=0.00 avg300=0.00 total=0 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 swap some avg10=0.00 avg60=0.00 avg300=0.00 total=0 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 cpu_cfs_bandwidth full avg10=21.76 avg60=4.58 avg300=0.98 total=3893827 cpu_qos full avg10=0.00 avg60=0.00 avg300=0.00 total=0
Signed-off-by: Lu Jialin lujialin4@huawei.com --- include/linux/psi_types.h | 8 +++++ kernel/sched/fair.c | 6 ---- kernel/sched/psi.c | 71 ++++++++++++++++++++++++++++++++++++--- kernel/sched/stats.h | 8 +++++ 4 files changed, 83 insertions(+), 10 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 83be3fada1e7..249bdde9694c 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -244,6 +244,10 @@ enum psi_stat_states { PSI_ASYNC_MEMCG_RECLAIM_FULL, PSI_SWAP_SOME, PSI_SWAP_FULL, + PSI_CPU_CFS_BANDWIDTH_FULL, +#ifdef CONFIG_QOS_SCHED + PSI_CPU_QOS_FULL, +#endif NR_PSI_STAT_STATES, };
@@ -261,6 +265,8 @@ enum psi_stat_task_count { NR_PSI_STAT_TASK_COUNTS, };
+#define CPU_CFS_BANDWIDTH 1 + struct psi_group_stat_cpu { u32 state_mask; u32 times[NR_PSI_STAT_STATES]; @@ -268,6 +274,8 @@ struct psi_group_stat_cpu { unsigned int tasks[NR_PSI_STAT_TASK_COUNTS]; u32 times_delta; u32 times_prev[NR_PSI_AGGREGATORS][NR_PSI_STAT_STATES]; + int prev_throttle; + int cur_throttle; };
struct psi_group_ext { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5b6e577acd17..6618da7f8b2c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -126,12 +126,6 @@ int __weak arch_asym_cpu_priority(int cpu)
#ifdef CONFIG_QOS_SCHED
-/* - * To distinguish cfs bw, use QOS_THROTTLED mark cfs_rq->throttled - * when qos throttled(and cfs bw throttle mark cfs_rq->throttled as 1). - */ -#define QOS_THROTTLED 2 - static DEFINE_PER_CPU_SHARED_ALIGNED(struct list_head, qos_throttled_cfs_rq); static DEFINE_PER_CPU_SHARED_ALIGNED(struct hrtimer, qos_overload_timer); static DEFINE_PER_CPU(int, qos_cpu_overload); diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 227ea1676cd1..977c973d1e70 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -433,7 +433,7 @@ static void psi_group_stat_change(struct psi_group *group, int cpu, for (t = 0; set; set &= ~(1 << t), t++) if (set & (1 << t)) ext_groupc->tasks[t]++; - for (s = 0; s < NR_PSI_STAT_STATES; s++) + for (s = 0; s < PSI_CPU_CFS_BANDWIDTH_FULL; s++) if (test_fine_grained_stat(ext_groupc->tasks, groupc->tasks[NR_RUNNING], s)) state_mask |= (1 << s); @@ -523,6 +523,52 @@ static inline void update_stat_averages(struct psi_group_ext *psi_ext, u64 period) {} #endif
+#if defined(CONFIG_CFS_BANDWIDTH) && defined(CONFIG_CGROUP_CPUACCT) && \ + defined(CONFIG_PSI_FINE_GRAINED) +static void record_cpu_stat_times(struct psi_group *group, int cpu) +{ + struct psi_group_ext *psi_ext = to_psi_group_ext(group); + struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu); + struct psi_group_stat_cpu *ext_groupc = per_cpu_ptr(psi_ext->pcpu, cpu); + u32 delta = ext_groupc->psi_delta; + + if (groupc->state_mask & (1 << PSI_CPU_FULL)) { + if (ext_groupc->prev_throttle == CPU_CFS_BANDWIDTH) + ext_groupc->times[PSI_CPU_CFS_BANDWIDTH_FULL] += delta; +#ifdef CONFIG_QOS_SCHED + else if (ext_groupc->prev_throttle == QOS_THROTTLED) + ext_groupc->times[PSI_CPU_QOS_FULL] += delta; +#endif + } +} + +static void update_throttle_type(struct task_struct *task, int cpu, bool next) +{ + struct cgroup *cpuacct_cgrp; + struct psi_group_ext *psi_ext; + struct psi_group_stat_cpu *groupc; + struct task_group *tsk_grp; + + if (!cgroup_subsys_on_dfl(cpuacct_cgrp_subsys)) { + rcu_read_lock(); + cpuacct_cgrp = task_cgroup(task, cpuacct_cgrp_id); + if (cgroup_parent(cpuacct_cgrp)) { + psi_ext = to_psi_group_ext(cgroup_psi(cpuacct_cgrp)); + groupc = per_cpu_ptr(psi_ext->pcpu, cpu); + tsk_grp = task_group(task); + if (next) + groupc->prev_throttle = groupc->cur_throttle; + groupc->cur_throttle = tsk_grp->cfs_rq[cpu]->throttled; + } + rcu_read_unlock(); + } +} +#else +static inline void record_cpu_stat_times(struct psi_group *group, int cpu) {} +static inline void update_throttle_type(struct task_struct *task, int cpu, + bool next) {} +#endif + static void collect_percpu_times(struct psi_group *group, enum psi_aggregators aggregator, u32 *pchanged_states) @@ -937,6 +983,7 @@ static void psi_group_change(struct psi_group *group, int cpu, write_seqcount_begin(&groupc->seq);
record_times(groupc, now); + record_cpu_stat_times(group, cpu);
/* * Start with TSK_ONCPU, which doesn't have a corresponding @@ -1091,6 +1138,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, u64 now = cpu_clock(cpu);
if (next->pid) { + update_throttle_type(next, cpu, true); psi_flags_change(next, 0, TSK_ONCPU); /* * Set TSK_ONCPU on @next's cgroups. If @next shares any @@ -1118,6 +1166,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, int stat_clear = 0; bool memstall_type_change = false;
+ update_throttle_type(prev, cpu, false); /* * When we're going to sleep, psi_dequeue() lets us * handle TSK_RUNNING, TSK_MEMSTALL_RUNNING and @@ -1199,6 +1248,7 @@ void psi_account_irqtime(struct task_struct *task, u32 delta) update_psi_stat_delta(group, cpu, now); record_stat_times(to_psi_group_ext(group), cpu); record_times(groupc, now); + record_cpu_stat_times(group, cpu); groupc->times[PSI_IRQ_FULL] += delta;
write_seqcount_end(&groupc->seq); @@ -1765,8 +1815,22 @@ static const char *const psi_stat_names[] = { "compact", "cgroup_async_memory_reclaim", "swap", + "cpu_cfs_bandwidth", + "cpu_qos", };
+static void get_stat_names(struct seq_file *m, int i, bool is_full) +{ + if (i <= PSI_SWAP_FULL && !is_full) + return seq_printf(m, "%s\n", psi_stat_names[i / 2]); + else if (i == PSI_CPU_CFS_BANDWIDTH_FULL) + return seq_printf(m, "%s\n", "cpu_cfs_bandwidth"); +#ifdef CONFIG_QOS_SCHED + else if (i == PSI_CPU_QOS_FULL) + return seq_printf(m, "%s\n", "cpu_qos"); +#endif +} + int psi_stat_show(struct seq_file *m, struct psi_group *group) { struct psi_group_ext *psi_ext; @@ -1786,12 +1850,11 @@ int psi_stat_show(struct seq_file *m, struct psi_group *group) group->avg_next_update = update_averages(group, now); mutex_unlock(&group->avgs_lock); for (i = 0; i < NR_PSI_STAT_STATES; i++) { - is_full = i % 2; + is_full = i % 2 || i > PSI_SWAP_FULL; for (w = 0; w < 3; w++) avg[w] = psi_ext->avg[i][w]; total = div_u64(psi_ext->total[PSI_AVGS][i], NSEC_PER_USEC); - if (!is_full) - seq_printf(m, "%s\n", psi_stat_names[i / 2]); + get_stat_names(m, i, is_full); seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n", is_full ? "full" : "some", LOAD_INT(avg[0]), LOAD_FRAC(avg[0]), diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index 874d8c6e6750..4fc84b0e2945 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -75,6 +75,14 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt # define schedstat_end_time(rq, t) do { } while (0) #endif /* CONFIG_SCHEDSTATS */
+#ifdef CONFIG_QOS_SCHED +/* + * To distinguish cfs bw, use QOS_THROTTLED mark cfs_rq->throttled + * when qos throttled(and cfs bw throttle mark cfs_rq->throttled as 1). + */ +#define QOS_THROTTLED 2 +#endif + #ifdef CONFIG_PSI /* * PSI tracks state that persists across sleeps, such as iowaits and
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8BCV4
-------------------------------
enable CONFIG_PSI_FINE_GRAINED in openeuler_defconfig of x86 and arm64
Signed-off-by: Lu Jialin lujialin4@huawei.com --- arch/arm64/configs/openeuler_defconfig | 2 +- arch/x86/configs/openeuler_defconfig | 1 + 2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 68b1e3f3330d..45d8d3f70618 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -105,7 +105,7 @@ CONFIG_TASK_IO_ACCOUNTING=y CONFIG_PSI=y CONFIG_PSI_DEFAULT_DISABLED=y CONFIG_PSI_CGROUP_V1=y - +CONFIG_PSI_FINE_GRAINED=y # end of CPU/Task time and stats accounting
CONFIG_CPU_ISOLATION=y diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index e75127469a93..277c71ccbb67 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -109,6 +109,7 @@ CONFIG_TASK_IO_ACCOUNTING=y CONFIG_PSI=y CONFIG_PSI_DEFAULT_DISABLED=y CONFIG_PSI_CGROUP_V1=y +CONFIG_PSI_FINE_GRAINED=y # end of CPU/Task time and stats accounting
CONFIG_CPU_ISOLATION=y
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/3087 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/C...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/3087 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/C...