Memory cgroup enhancement.
Johannes Weiner (9): mm: memcontrol: fix cpuhotplug statistics flushing mm: memcontrol: kill mem_cgroup_nodeinfo() mm: memcontrol: privatize memcg_page_state query functions cgroup: rstat: support cgroup1 cgroup: rstat: punt root-level optimization to individual controllers mm: memcontrol: switch to rstat mm: memcontrol: consolidate lruvec stat flushing kselftests: cgroup: update kmem test for new vmstat implementation mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code
Miaohe Lin (1): mm, memcg: remove unused functions
Shakeel Butt (5): memcg: switch lruvec stats to rstat memcg: infrastructure to flush memcg stats memcg: flush lruvec stats in the refault memcg: flush stats only if updated memcg: unify memcg stat flushing
Tejun Heo (2): cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync blk-cgroup: blk_cgroup_bio_start() should use irq-safe operations on blkg->iostat_cpu
block/blk-cgroup.c | 36 +- include/linux/memcontrol.h | 179 ++++------ kernel/cgroup/cgroup.c | 34 +- kernel/cgroup/rstat.c | 82 +++-- mm/memcontrol.c | 391 ++++++++++----------- mm/vmscan.c | 6 + mm/workingset.c | 1 + tools/testing/selftests/cgroup/test_kmem.c | 22 +- 8 files changed, 372 insertions(+), 379 deletions(-)
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-5.13-rc1 commit a3d4c05a447486b90298a8c964916c8f4fcb903f category: feature bugzilla:185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------------------------------
Patch series "mm: memcontrol: switch to rstat", v3.
This series converts memcg stats tracking to the streamlined rstat infrastructure provided by the cgroup core code. rstat is already used by the CPU controller and the IO controller. This change is motivated by recent accuracy problems in memcg's custom stats code, as well as the benefits of sharing common infra with other controllers.
The current memcg implementation does batched tree aggregation on the write side: local stat changes are cached in per-cpu counters, which are then propagated upward in batches when a threshold (32 pages) is exceeded. This is cheap, but the error introduced by the lazy upward propagation adds up: 32 pages times CPUs times cgroups in the subtree. We've had complaints from service owners that the stats do not reliably track and react to allocation behavior as expected, sometimes swallowing the results of entire test applications.
The original memcg stat implementation used to do tree aggregation exclusively on the read side: local stats would only ever be tracked in per-cpu counters, and a memory.stat read would iterate the entire subtree and sum those counters up. This didn't keep up with the times:
- Cgroup trees are much bigger now. We switched to lazily-freed cgroups, where deleted groups would hang around until their remaining page cache has been reclaimed. This can result in large subtrees that are expensive to walk, while most of the groups are idle and their statistics don't change much anymore.
- Automated monitoring increased. With the proliferation of userspace oom killing, proactive reclaim, and higher-resolution logging of workload trends in general, top-level stat files are polled at least once a second in many deployments.
- The lifetime of cgroups got shorter. Where most cgroup setups in the past would have a few large policy-oriented cgroups for everything running on the system, newer cgroup deployments tend to create one group per application - which gets deleted again as the processes exit. An aggregation scheme that doesn't retain child data inside the parents loses event history of the subtree.
Rstat addresses all three of those concerns through intelligent, persistent read-side aggregation. As statistics change at the local level, rstat tracks - on a per-cpu basis - only those parts of a subtree that have changes pending and require aggregation. The actual aggregation occurs on the colder read side - which can now skip over (potentially large) numbers of recently idle cgroups.
===
The test_kmem cgroup selftest is currently failing due to excessive cumulative vmstat drift from 100 subgroups:
ok 1 test_kmem_basic memory.current = 8810496 slab + anon + file + kernel_stack = 17074568 slab = 6101384 anon = 946176 file = 0 kernel_stack = 10027008 not ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic
As you can see, memory.stat items far exceed memory.current. The kernel stack alone is bigger than all of charged memory. That's because the memory of the test has been uncharged from memory.current, but the negative vmstat deltas are still sitting in the percpu caches.
The test at this time isn't even counting percpu, pagetables etc. yet, which would further contribute to the error. The last patch in the series updates the test to include them - as well as reduces the vmstat tolerances in general to only expect page_counter batching.
With all patches applied, the (now more stringent) test succeeds:
ok 1 test_kmem_basic ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic
===
A kernel build test confirms that overhead is comparable. Two kernels are built simultaneously in a nested tree with several idle siblings:
root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16) `- build-b (defconfig, make -j16) `- idle-1 `- ... `- idle-9
During the builds, kernelbuild/memory.stat is read once a second.
A perf diff shows that the changes in cycle distribution is minimal. Top 10 kernel symbols:
0.09% +0.08% [kernel.kallsyms] [k] __mod_memcg_lruvec_state 0.00% +0.06% [kernel.kallsyms] [k] cgroup_rstat_updated 0.08% -0.05% [kernel.kallsyms] [k] __mod_memcg_state.part.0 0.16% -0.04% [kernel.kallsyms] [k] release_pages 0.00% +0.03% [kernel.kallsyms] [k] __count_memcg_events 0.01% +0.03% [kernel.kallsyms] [k] mem_cgroup_charge_statistics.constprop.0 0.10% -0.02% [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.05% -0.02% [kernel.kallsyms] [k] mem_cgroup_update_lru_size 0.57% +0.01% [kernel.kallsyms] [k] asm_exc_page_fault
===
The on-demand aggregated stats are now fully accurate:
$ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \ grep -e inactive_file /sys/fs/cgroup/memory.stat
vanilla: patched: nr_inactive_file 1574105088 nr_inactive_file 1027801088 inactive_file 1577410560 inactive_file 1027801088
===
This patch (of 8):
The memcg hotunplug callback erroneously flushes counts on the local CPU, not the counts of the CPU going away; those counts will be lost.
Flush the CPU that is actually going away.
Also simplify the code a bit by using mod_memcg_state() and count_memcg_events() instead of open-coding the upward flush - this is comparable to how vmstat.c handles hotunplug flushing.
Link: https://lkml.kernel.org/r/20210209163304.77088-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20210209163304.77088-2-hannes@cmpxchg.org Fixes: a983b5ebee572 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Shakeel Butt shakeelb@google.com Reviewed-by: Roman Gushchin guro@fb.com Reviewed-by: Michal Koutný mkoutny@suse.com Acked-by: Michal Hocko mhocko@suse.com Cc: Tejun Heo tj@kernel.org Cc: Roman Gushchin guro@fb.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/memcontrol.c | 35 +++++++++++++++++++++-------------- 1 file changed, 21 insertions(+), 14 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a2a04a990e09..cb6e43b7ce1f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2301,45 +2301,52 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) static int memcg_hotplug_cpu_dead(unsigned int cpu) { struct memcg_stock_pcp *stock; - struct mem_cgroup *memcg, *mi; + struct mem_cgroup *memcg;
stock = &per_cpu(memcg_stock, cpu); drain_stock(stock);
for_each_mem_cgroup(memcg) { + struct memcg_vmstats_percpu *statc; int i;
+ statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); + for (i = 0; i < MEMCG_NR_STAT; i++) { int nid; - long x;
- x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0); - if (x) - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &memcg->vmstats[i]); + if (statc->stat[i]) { + mod_memcg_state(memcg, i, statc->stat[i]); + statc->stat[i] = 0; + }
if (i >= NR_VM_NODE_STAT_ITEMS) continue;
for_each_node(nid) { + struct batched_lruvec_stat *lstatc; struct mem_cgroup_per_node *pn; + long x;
pn = mem_cgroup_nodeinfo(memcg, nid); - x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0); - if (x) + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); + + x = lstatc->count[i]; + lstatc->count[i] = 0; + + if (x) { do { atomic_long_add(x, &pn->lruvec_stat[i]); } while ((pn = parent_nodeinfo(pn, nid))); + } } }
for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { - long x; - - x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0); - if (x) - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &memcg->vmevents[i]); + if (statc->events[i]) { + count_memcg_events(memcg, i, statc->events[i]); + statc->events[i] = 0; + } } }
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.13-rc1 commit a3747b53b1771a787fea71d86a2fc39aea337685 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------
No need to encapsulate a simple struct member access.
Link: https://lkml.kernel.org/r/20210209163304.77088-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Shakeel Butt shakeelb@google.com Reviewed-by: Roman Gushchin guro@fb.com Acked-by: Michal Hocko mhocko@suse.com Reviewed-by: Michal Koutný mkoutny@suse.com Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Conflict: mm/memcontrol.c Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/memcontrol.h | 8 +------- mm/memcontrol.c | 12 ++++++------ 2 files changed, 7 insertions(+), 13 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index e6e70b3bbcee..8414ee349e24 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -732,12 +732,6 @@ void mem_cgroup_uncharge_list(struct list_head *page_list);
void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
-static struct mem_cgroup_per_node * -mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid) -{ - return memcg->nodeinfo[nid]; -} - /** * mem_cgroup_lruvec - get the lru list vector for a memcg & node * @memcg: memcg of the wanted lruvec @@ -760,7 +754,7 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, if (!memcg) memcg = root_mem_cgroup;
- mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); + mz = memcg->nodeinfo[pgdat->node_id]; lruvec = &mz->lruvec; out: /* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cb6e43b7ce1f..55390b26a1d4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -593,7 +593,7 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) int nid;
for_each_node(nid) { - mz = mem_cgroup_nodeinfo(memcg, nid); + mz = memcg->nodeinfo[nid]; mctz = soft_limit_tree_node(nid); if (mctz) mem_cgroup_remove_exceeded(mz, mctz); @@ -676,7 +676,7 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) parent = parent_mem_cgroup(pn->memcg); if (!parent) return NULL; - return mem_cgroup_nodeinfo(parent, nid); + return parent->nodeinfo[nid]; }
void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, @@ -1009,7 +1009,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (reclaim) { struct mem_cgroup_per_node *mz;
- mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id); + mz = root->nodeinfo[reclaim->pgdat->node_id]; iter = &mz->iter;
if (prev && reclaim->generation != iter->generation) @@ -1112,7 +1112,7 @@ static void __invalidate_reclaim_iterators(struct mem_cgroup *from, int nid;
for_each_node(nid) { - mz = mem_cgroup_nodeinfo(from, nid); + mz = from->nodeinfo[nid]; iter = &mz->iter; cmpxchg(&iter->position, dead_memcg, NULL); } @@ -2328,7 +2328,7 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) struct mem_cgroup_per_node *pn; long x;
- pn = mem_cgroup_nodeinfo(memcg, nid); + pn = memcg->nodeinfo[nid]; lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu);
x = lstatc->count[i]; @@ -4263,7 +4263,7 @@ static int memcg_stat_show(struct seq_file *m, void *v) unsigned long file_cost = 0;
for_each_online_pgdat(pgdat) { - mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); + mz = memcg->nodeinfo[pgdat->node_id];
anon_cost += mz->lruvec.anon_cost; file_cost += mz->lruvec.file_cost;
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.13-rc1 commit a18e6e6e150a98b9ce3e9acabeff407e7b6ba0c0 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------------------------
There are no users outside of the memory controller itself. The rest of the kernel cares either about node or lruvec stats.
Link: https://lkml.kernel.org/r/20210209163304.77088-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Shakeel Butt shakeelb@google.com Reviewed-by: Roman Gushchin guro@fb.com Acked-by: Michal Hocko mhocko@suse.com Reviewed-by: Michal Koutný mkoutny@suse.com Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Conflict: include/linux/memcontrol.h Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/memcontrol.h | 44 -------------------------------------- mm/memcontrol.c | 32 +++++++++++++++++++++++++++ 2 files changed, 32 insertions(+), 44 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 8414ee349e24..d70d4f3ee3df 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -978,39 +978,6 @@ struct mem_cgroup *lock_page_memcg(struct page *page); void __unlock_page_memcg(struct mem_cgroup *memcg); void unlock_page_memcg(struct page *page);
-/* - * idx can be of type enum memcg_stat_item or node_stat_item. - * Keep in sync with memcg_exact_page_state(). - */ -static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) -{ - long x = atomic_long_read(&memcg->vmstats[idx]); -#ifdef CONFIG_SMP - if (x < 0) - x = 0; -#endif - return x; -} - -/* - * idx can be of type enum memcg_stat_item or node_stat_item. - * Keep in sync with memcg_exact_page_state(). - */ -static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, - int idx) -{ - long x = 0; - int cpu; - - for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_local->stat[idx], cpu); -#ifdef CONFIG_SMP - if (x < 0) - x = 0; -#endif - return x; -} - void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
/* idx can be of type enum memcg_stat_item or node_stat_item */ @@ -1528,17 +1495,6 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) { }
-static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) -{ - return 0; -} - -static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, - int idx) -{ - return 0; -} - static inline void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int nr) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 55390b26a1d4..9a72cdb6f5b6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -668,6 +668,38 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); }
+/* + * idx can be of type enum memcg_stat_item or node_stat_item. + * Keep in sync with memcg_exact_page_state(). + */ +static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) +{ + long x = atomic_long_read(&memcg->vmstats[idx]); +#ifdef CONFIG_SMP + if (x < 0) + x = 0; +#endif + return x; +} + +/* + * idx can be of type enum memcg_stat_item or node_stat_item. + * Keep in sync with memcg_exact_page_state(). + */ +static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) +{ + long x = 0; + int cpu; + + for_each_possible_cpu(cpu) + x += per_cpu(memcg->vmstats_local->stat[idx], cpu); +#ifdef CONFIG_SMP + if (x < 0) + x = 0; +#endif + return x; +} + static struct mem_cgroup_per_node * parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) {
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.13-rc1 commit a7df69b81aac5bdeb5c5aef9addd680ce22feebf category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------------------------------------------
Rstat currently only supports the default hierarchy in cgroup2. In order to replace memcg's private stats infrastructure - used in both cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1.
The initialization and destruction callbacks for regular cgroups are already in place. Remove the cgroup_on_dfl() guards to handle cgroup1.
The initialization of the root cgroup is currently hardcoded to only handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root() and cgroup_destroy_root() to handle the default root as well as the various cgroup1 roots we may set up during mounting.
The linking of css to cgroups happens in code shared between cgroup1 and cgroup2 as well. Simply remove the cgroup_on_dfl() guard.
Linkage of the root css to the root cgroup is a bit trickier: per default, the root css of a subsystem controller belongs to the default hierarchy (i.e. the cgroup2 root). When a controller is mounted in its cgroup1 version, the root css is stolen and moved to the cgroup1 root; on unmount, the css moves back to the default hierarchy. Annotate rebind_subsystems() to move the root css linkage along between roots.
Link: https://lkml.kernel.org/r/20210209163304.77088-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Roman Gushchin guro@fb.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Tejun Heo tj@kernel.org Reviewed-by: Michal Koutný mkoutny@suse.com Cc: Michal Hocko mhocko@suse.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/cgroup/cgroup.c | 34 +++++++++++++++++++++------------- kernel/cgroup/rstat.c | 2 -- 2 files changed, 21 insertions(+), 15 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index fafd2332457b..4158857414cc 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1342,6 +1342,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
mutex_unlock(&cgroup_mutex);
+ cgroup_rstat_exit(cgrp); kernfs_destroy_root(root->kf_root); cgroup_free_root(root); } @@ -1754,6 +1755,12 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask) &dcgrp->e_csets[ss->id]); spin_unlock_irq(&css_set_lock);
+ if (ss->css_rstat_flush) { + list_del_rcu(&css->rstat_css_node); + list_add_rcu(&css->rstat_css_node, + &dcgrp->rstat_css_list); + } + /* default hierarchy doesn't enable controllers by default */ dst_root->subsys_mask |= 1 << ssid; if (dst_root == &cgrp_dfl_root) { @@ -1975,10 +1982,14 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask) if (ret) goto destroy_root;
- ret = rebind_subsystems(root, ss_mask); + ret = cgroup_rstat_init(root_cgrp); if (ret) goto destroy_root;
+ ret = rebind_subsystems(root, ss_mask); + if (ret) + goto exit_stats; + ret = cgroup_bpf_inherit(root_cgrp); WARN_ON_ONCE(ret);
@@ -2010,6 +2021,8 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask) ret = 0; goto out;
+exit_stats: + cgroup_rstat_exit(root_cgrp); destroy_root: kernfs_destroy_root(root->kf_root); root->kf_root = NULL; @@ -5032,8 +5045,7 @@ static void css_free_rwork_fn(struct work_struct *work) cgroup_put(cgroup_parent(cgrp)); kernfs_put(cgrp->kn); psi_cgroup_free(cgrp); - if (cgroup_on_dfl(cgrp)) - cgroup_rstat_exit(cgrp); + cgroup_rstat_exit(cgrp); kfree(cgrp); } else { /* @@ -5074,8 +5086,7 @@ static void css_release_work_fn(struct work_struct *work) /* cgroup release path */ TRACE_CGROUP_PATH(release, cgrp);
- if (cgroup_on_dfl(cgrp)) - cgroup_rstat_flush(cgrp); + cgroup_rstat_flush(cgrp);
spin_lock_irq(&css_set_lock); for (tcgrp = cgroup_parent(cgrp); tcgrp; @@ -5134,7 +5145,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css, css_get(css->parent); }
- if (cgroup_on_dfl(cgrp) && ss->css_rstat_flush) + if (ss->css_rstat_flush) list_add_rcu(&css->rstat_css_node, &cgrp->rstat_css_list);
BUG_ON(cgroup_css(cgrp, ss)); @@ -5268,11 +5279,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name, if (ret) goto out_free_cgrp;
- if (cgroup_on_dfl(parent)) { - ret = cgroup_rstat_init(cgrp); - if (ret) - goto out_cancel_ref; - } + ret = cgroup_rstat_init(cgrp); + if (ret) + goto out_cancel_ref;
/* create the directory */ kn = kernfs_create_dir(parent->kn, name, mode, cgrp); @@ -5359,8 +5368,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name, out_kernfs_remove: kernfs_remove(cgrp->kn); out_stat_exit: - if (cgroup_on_dfl(parent)) - cgroup_rstat_exit(cgrp); + cgroup_rstat_exit(cgrp); out_cancel_ref: percpu_ref_exit(&cgrp->self.refcnt); out_free_cgrp: diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index d51175cedfca..faa767a870ba 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -285,8 +285,6 @@ void __init cgroup_rstat_boot(void)
for_each_possible_cpu(cpu) raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu)); - - BUG_ON(cgroup_rstat_init(&cgrp_dfl_root.cgrp)); }
/*
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mianline-v5.13-rc1 commit dc26532aed0ab25c0801a34640d1f3b9b9098a48 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference : https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------------------------------------------------
Current users of the rstat code can source root-level statistics from the native counters of their respective subsystem, allowing them to forego aggregation at the root level. This optimization is currently implemented inside the generic rstat code, which doesn't track the root cgroup and doesn't invoke the subsystem flush callbacks on it.
However, the memory controller cannot do this optimization, because cgroup1 breaks out memory specifically for the local level, including at the root level. In preparation for the memory controller switching to rstat, move the optimization from rstat core to the controllers.
Afterwards, rstat will always track the root cgroup for changes and invoke the subsystem callbacks on it; and it's up to the subsystem to special-case and skip aggregation of the root cgroup if it can source this information through other, cheaper means.
This is the case for the io controller and the cgroup base stats. In their respective flush callbacks, check whether the parent is the root cgroup, and if so, skip the unnecessary upward propagation.
The extra cost of tracking the root cgroup is negligible: on stat changes, we actually remove a branch that checks for the root. The queueing for a flush touches only per-cpu data, and only the first stat change since a flush requires a (per-cpu) lock.
Link: https://lkml.kernel.org/r/20210209163304.77088-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Tejun Heo tj@kernel.org Cc: Michal Hocko mhocko@suse.com Cc: Michal Koutný mkoutny@suse.com Cc: Roman Gushchin guro@fb.com Cc: Shakeel Butt shakeelb@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-cgroup.c | 17 +++++++----- kernel/cgroup/rstat.c | 61 +++++++++++++++++++++++++------------------ 2 files changed, 47 insertions(+), 31 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 40f15807efec..1defd7d94d9f 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -777,6 +777,10 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) struct blkcg *blkcg = css_to_blkcg(css); struct blkcg_gq *blkg;
+ /* Root-level stats are sourced from system-wide IO stats */ + if (!cgroup_parent(css->cgroup)) + return; + rcu_read_lock();
hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { @@ -799,8 +803,8 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) blkg_iostat_add(&bisc->last, &delta); u64_stats_update_end(&blkg->iostat.sync);
- /* propagate global delta to parent */ - if (parent) { + /* propagate global delta to parent (unless that's root) */ + if (parent && parent->parent) { u64_stats_update_begin(&parent->iostat.sync); blkg_iostat_set(&delta, &blkg->iostat.cur); blkg_iostat_sub(&delta, &blkg->iostat.last); @@ -814,10 +818,11 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) }
/* - * The rstat algorithms intentionally don't handle the root cgroup to avoid - * incurring overhead when no cgroups are defined. For that reason, - * cgroup_rstat_flush in blkcg_print_stat does not actually fill out the - * iostat in the root cgroup's blkcg_gq. + * We source root cgroup stats from the system-wide stats to avoid + * tracking the same information twice and incurring overhead when no + * cgroups are defined. For that reason, cgroup_rstat_flush in + * blkcg_print_stat does not actually fill out the iostat in the root + * cgroup's blkcg_gq. * * However, we would like to re-use the printing code between the root and * non-root cgroups to the extent possible. For that reason, we simulate diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index faa767a870ba..3a3fd2993a65 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -25,13 +25,8 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu) void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) { raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); - struct cgroup *parent; unsigned long flags;
- /* nothing to do for root */ - if (!cgroup_parent(cgrp)) - return; - /* * Speculative already-on-list test. This may race leading to * temporary inaccuracies, which is fine. @@ -46,10 +41,10 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) raw_spin_lock_irqsave(cpu_lock, flags);
/* put @cgrp and all ancestors on the corresponding updated lists */ - for (parent = cgroup_parent(cgrp); parent; - cgrp = parent, parent = cgroup_parent(cgrp)) { + while (true) { struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); - struct cgroup_rstat_cpu *prstatc = cgroup_rstat_cpu(parent, cpu); + struct cgroup *parent = cgroup_parent(cgrp); + struct cgroup_rstat_cpu *prstatc;
/* * Both additions and removals are bottom-up. If a cgroup @@ -58,8 +53,17 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) if (rstatc->updated_next) break;
+ /* Root has no parent to link it to, but mark it busy */ + if (!parent) { + rstatc->updated_next = cgrp; + break; + } + + prstatc = cgroup_rstat_cpu(parent, cpu); rstatc->updated_next = prstatc->updated_children; prstatc->updated_children = cgrp; + + cgrp = parent; }
raw_spin_unlock_irqrestore(cpu_lock, flags); @@ -113,23 +117,26 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos, */ if (rstatc->updated_next) { struct cgroup *parent = cgroup_parent(pos); - struct cgroup_rstat_cpu *prstatc = cgroup_rstat_cpu(parent, cpu); - struct cgroup_rstat_cpu *nrstatc; - struct cgroup **nextp; - - nextp = &prstatc->updated_children; - while (true) { - nrstatc = cgroup_rstat_cpu(*nextp, cpu); - if (*nextp == pos) - break; - - WARN_ON_ONCE(*nextp == parent); - nextp = &nrstatc->updated_next; + + if (parent) { + struct cgroup_rstat_cpu *prstatc; + struct cgroup **nextp; + + prstatc = cgroup_rstat_cpu(parent, cpu); + nextp = &prstatc->updated_children; + while (true) { + struct cgroup_rstat_cpu *nrstatc; + + nrstatc = cgroup_rstat_cpu(*nextp, cpu); + if (*nextp == pos) + break; + WARN_ON_ONCE(*nextp == parent); + nextp = &nrstatc->updated_next; + } + *nextp = rstatc->updated_next; }
- *nextp = rstatc->updated_next; rstatc->updated_next = NULL; - return pos; }
@@ -309,11 +316,15 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) { - struct cgroup *parent = cgroup_parent(cgrp); struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); + struct cgroup *parent = cgroup_parent(cgrp); struct cgroup_base_stat cur, delta; unsigned seq;
+ /* Root-level stats are sourced from system-wide CPU stats */ + if (!parent) + return; + /* fetch the current per-cpu values */ do { seq = __u64_stats_fetch_begin(&rstatc->bsync); @@ -326,8 +337,8 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) cgroup_base_stat_add(&cgrp->bstat, &delta); cgroup_base_stat_add(&rstatc->last_bstat, &delta);
- /* propagate global delta to parent */ - if (parent) { + /* propagate global delta to parent (unless that's root) */ + if (cgroup_parent(parent)) { delta = cgrp->bstat; cgroup_base_stat_sub(&delta, &cgrp->last_bstat); cgroup_base_stat_add(&parent->bstat, &delta);
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.13-rc1 commit 2d146aa3aa842d7f5065802556b4f9a2c6e8ef12 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Replace the memory controller's custom hierarchical stats code with the generic rstat infrastructure provided by the cgroup core.
The current implementation does batched upward propagation from the write side (i.e. as stats change). The per-cpu batches introduce an error, which is multiplied by the number of subgroups in a tree. In systems with many CPUs and sizable cgroup trees, the error can be large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 subgroups results in an error of up to 128M per stat item). This can entirely swallow allocation bursts inside a workload that the user is expecting to see reflected in the statistics.
In the past, we've done read-side aggregation, where a memory.stat read would have to walk the entire subtree and add up per-cpu counts. This became problematic with lazily-freed cgroups: we could have large subtrees where most cgroups were entirely idle. Hence the switch to change-driven upward propagation. Unfortunately, it needed to trade accuracy for speed due to the write side being so hot.
Rstat combines the best of both worlds: from the write side, it cheaply maintains a queue of cgroups that have pending changes, so that the read side can do selective tree aggregation. This way the reported stats will always be precise and recent as can be, while the aggregation can skip over potentially large numbers of idle cgroups.
The way rstat works is that it implements a tree for tracking cgroups with pending local changes, as well as a flush function that walks the tree upwards. The controller then drives this by 1) telling rstat when a local cgroup stat changes (e.g. mod_memcg_state) and 2) when a flush is required to get uptodate hierarchy stats for a given subtree (e.g. when memory.stat is read). The controller also provides a flush callback that is called during the rstat flush walk for each cgroup and aggregates its local per-cpu counters and propagates them upwards.
This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward aggregation. It removes 3 words from the per-cpu data. It eliminates memcg_exact_page_state(), since memcg_page_state() is now exact.
[akpm@linux-foundation.org: merge fix] [hannes@cmpxchg.org: fix a sleep in atomic section problem] Link: https://lkml.kernel.org/r/20210315234100.64307-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20210209163304.77088-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Roman Gushchin guro@fb.com Acked-by: Michal Hocko mhocko@suse.com Reviewed-by: Shakeel Butt shakeelb@google.com Reviewed-by: Michal Koutný mkoutny@suse.com Acked-by: Balbir Singh bsingharora@gmail.com Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Conflict: include/linux/memcontrol.h Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/memcontrol.h | 67 +++++++----- mm/memcontrol.c | 218 +++++++++++++++---------------------- 2 files changed, 127 insertions(+), 158 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d70d4f3ee3df..483dce4d7753 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -76,10 +76,27 @@ enum mem_cgroup_events_target { };
struct memcg_vmstats_percpu { - long stat[MEMCG_NR_STAT]; - unsigned long events[NR_VM_EVENT_ITEMS]; - unsigned long nr_page_events; - unsigned long targets[MEM_CGROUP_NTARGETS]; + /* Local (CPU and cgroup) page state & events */ + long state[MEMCG_NR_STAT]; + unsigned long events[NR_VM_EVENT_ITEMS]; + + /* Delta calculation for lockless upward propagation */ + long state_prev[MEMCG_NR_STAT]; + unsigned long events_prev[NR_VM_EVENT_ITEMS]; + + /* Cgroup1: threshold notifications & softlimit tree updates */ + unsigned long nr_page_events; + unsigned long targets[MEM_CGROUP_NTARGETS]; +}; + +struct memcg_vmstats { + /* Aggregated (CPU and subtree) page state & events */ + long state[MEMCG_NR_STAT]; + unsigned long events[NR_VM_EVENT_ITEMS]; + + /* Pending child counts during tree propagation */ + long state_pending[MEMCG_NR_STAT]; + unsigned long events_pending[NR_VM_EVENT_ITEMS]; };
struct mem_cgroup_reclaim_iter { @@ -293,8 +310,8 @@ struct mem_cgroup {
MEMCG_PADDING(_pad1_);
- atomic_long_t vmstats[MEMCG_NR_STAT]; - atomic_long_t vmevents[NR_VM_EVENT_ITEMS]; + /* memory.stat */ + struct memcg_vmstats vmstats;
/* memory.events */ atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; @@ -328,10 +345,6 @@ struct mem_cgroup { atomic_t moving_account; struct task_struct *move_lock_task;
- /* Legacy local VM stats and events */ - struct memcg_vmstats_percpu __percpu *vmstats_local; - - /* Subtree VM stats and events (batched updates) */ struct memcg_vmstats_percpu __percpu *vmstats_percpu;
#ifdef CONFIG_CGROUP_WRITEBACK @@ -1134,10 +1147,6 @@ static inline void mod_lruvec_page_state(struct page *page, local_irq_restore(flags); }
-unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, - gfp_t gfp_mask, - unsigned long *total_scanned); - void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, unsigned long count);
@@ -1218,6 +1227,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
void split_page_memcg(struct page *head, unsigned int nr);
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, + gfp_t gfp_mask, + unsigned long *total_scanned); + #else /* CONFIG_MEMCG */
#define MEM_CGROUP_ID_SHIFT 0 @@ -1327,6 +1340,10 @@ static inline bool lruvec_holds_page_lru_lock(struct page *page, return lruvec == &pgdat->__lruvec; }
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +{ +} + static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) { return NULL; @@ -1580,18 +1597,6 @@ static inline void mod_memcg_obj_state(void *p, int idx, int val) { }
-static inline -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - return 0; -} - -static inline void split_page_memcg(struct page *head, unsigned int nr) -{ -} - static inline void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, unsigned long count) @@ -1614,8 +1619,16 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { }
-static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +static inline void split_page_memcg(struct page *head, unsigned int nr) +{ +} + +static inline +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, + gfp_t gfp_mask, + unsigned long *total_scanned) { + return 0; } #endif /* CONFIG_MEMCG */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9a72cdb6f5b6..c0285b9de5f5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -644,37 +644,17 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) */ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) { - long x, threshold = MEMCG_CHARGE_BATCH; - if (mem_cgroup_disabled()) return;
- if (memcg_stat_item_in_bytes(idx)) - threshold <<= PAGE_SHIFT; - - x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); - if (unlikely(abs(x) > threshold)) { - struct mem_cgroup *mi; - - /* - * Batch local counters to keep them in sync with - * the hierarchical ones. - */ - __this_cpu_add(memcg->vmstats_local->stat[idx], x); - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &mi->vmstats[idx]); - x = 0; - } - __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); + __this_cpu_add(memcg->vmstats_percpu->state[idx], val); + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); }
-/* - * idx can be of type enum memcg_stat_item or node_stat_item. - * Keep in sync with memcg_exact_page_state(). - */ +/* idx can be of type enum memcg_stat_item or node_stat_item. */ static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) { - long x = atomic_long_read(&memcg->vmstats[idx]); + long x = READ_ONCE(memcg->vmstats.state[idx]); #ifdef CONFIG_SMP if (x < 0) x = 0; @@ -682,17 +662,14 @@ static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) return x; }
-/* - * idx can be of type enum memcg_stat_item or node_stat_item. - * Keep in sync with memcg_exact_page_state(). - */ +/* idx can be of type enum memcg_stat_item or node_stat_item. */ static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) { long x = 0; int cpu;
for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_local->stat[idx], cpu); + x += per_cpu(memcg->vmstats_percpu->state[idx], cpu); #ifdef CONFIG_SMP if (x < 0) x = 0; @@ -807,30 +784,16 @@ void mod_memcg_obj_state(void *p, int idx, int val) void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, unsigned long count) { - unsigned long x; - if (mem_cgroup_disabled()) return;
- x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]); - if (unlikely(x > MEMCG_CHARGE_BATCH)) { - struct mem_cgroup *mi; - - /* - * Batch local counters to keep them in sync with - * the hierarchical ones. - */ - __this_cpu_add(memcg->vmstats_local->events[idx], x); - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &mi->vmevents[idx]); - x = 0; - } - __this_cpu_write(memcg->vmstats_percpu->events[idx], x); + __this_cpu_add(memcg->vmstats_percpu->events[idx], count); + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); }
static unsigned long memcg_events(struct mem_cgroup *memcg, int event) { - return atomic_long_read(&memcg->vmevents[event]); + return READ_ONCE(memcg->vmstats.events[event]); }
static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) @@ -839,7 +802,7 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) int cpu;
for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_local->events[event], cpu); + x += per_cpu(memcg->vmstats_percpu->events[event], cpu); return x; }
@@ -1519,6 +1482,7 @@ static char *memory_stat_format(struct mem_cgroup *memcg) * * Current memory state: */ + cgroup_rstat_flush(memcg->css.cgroup);
for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -2339,22 +2303,11 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) drain_stock(stock);
for_each_mem_cgroup(memcg) { - struct memcg_vmstats_percpu *statc; int i;
- statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); - - for (i = 0; i < MEMCG_NR_STAT; i++) { + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { int nid;
- if (statc->stat[i]) { - mod_memcg_state(memcg, i, statc->stat[i]); - statc->stat[i] = 0; - } - - if (i >= NR_VM_NODE_STAT_ITEMS) - continue; - for_each_node(nid) { struct batched_lruvec_stat *lstatc; struct mem_cgroup_per_node *pn; @@ -2373,13 +2326,6 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) } } } - - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { - if (statc->events[i]) { - count_memcg_events(memcg, i, statc->events[i]); - statc->events[i] = 0; - } - } }
return 0; @@ -3546,6 +3492,7 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) unsigned long val;
if (mem_cgroup_is_root(memcg)) { + cgroup_rstat_flush(memcg->css.cgroup); val = memcg_page_state(memcg, NR_FILE_PAGES) + memcg_page_state(memcg, NR_ANON_MAPPED); if (swap) @@ -3610,26 +3557,15 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, } }
-static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) { - unsigned long stat[MEMCG_NR_STAT] = {0}; - struct mem_cgroup *mi; - int node, cpu, i; - - for_each_online_cpu(cpu) - for (i = 0; i < MEMCG_NR_STAT; i++) - stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu); - - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - for (i = 0; i < MEMCG_NR_STAT; i++) - atomic_long_add(stat[i], &mi->vmstats[i]); + int node;
for_each_node(node) { struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; + unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0 }; struct mem_cgroup_per_node *pi; - - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) - stat[i] = 0; + int cpu, i;
for_each_online_cpu(cpu) for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) @@ -3642,25 +3578,6 @@ static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) } }
-static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg) -{ - unsigned long events[NR_VM_EVENT_ITEMS]; - struct mem_cgroup *mi; - int cpu, i; - - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) - events[i] = 0; - - for_each_online_cpu(cpu) - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) - events[i] += per_cpu(memcg->vmstats_percpu->events[i], - cpu); - - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) - atomic_long_add(events[i], &mi->vmevents[i]); -} - #ifdef CONFIG_MEMCG_KMEM static int memcg_online_kmem(struct mem_cgroup *memcg) { @@ -4159,6 +4076,8 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) int nid; struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+ cgroup_rstat_flush(memcg->css.cgroup); + for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=%lu", stat->name, mem_cgroup_nr_lru_pages(memcg, stat->lru_mask, @@ -4229,6 +4148,8 @@ static int memcg_stat_show(struct seq_file *m, void *v)
BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
+ cgroup_rstat_flush(memcg->css.cgroup); + for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr;
@@ -4713,22 +4634,6 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) return &memcg->cgwb_domain; }
-/* - * idx can be of type enum memcg_stat_item or node_stat_item. - * Keep in sync with memcg_exact_page(). - */ -static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx) -{ - long x = atomic_long_read(&memcg->vmstats[idx]); - int cpu; - - for_each_online_cpu(cpu) - x += per_cpu_ptr(memcg->vmstats_percpu, cpu)->stat[idx]; - if (x < 0) - x = 0; - return x; -} - /** * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg * @wb: bdi_writeback in question @@ -4754,13 +4659,14 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent;
- *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); + cgroup_rstat_flush_irqsafe(memcg->css.cgroup);
- *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); - *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + - memcg_exact_page_state(memcg, NR_ACTIVE_FILE); - *pheadroom = PAGE_COUNTER_MAX; + *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); + *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); + *pfilepages = memcg_page_state(memcg, NR_INACTIVE_FILE) + + memcg_page_state(memcg, NR_ACTIVE_FILE);
+ *pheadroom = PAGE_COUNTER_MAX; while ((parent = parent_mem_cgroup(memcg))) { unsigned long ceiling = min(READ_ONCE(memcg->memory.max), READ_ONCE(memcg->memory.high)); @@ -5399,7 +5305,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); free_percpu(memcg->vmstats_percpu); - free_percpu(memcg->vmstats_local); kfree(memcg); }
@@ -5407,11 +5312,10 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) { memcg_wb_domain_exit(memcg); /* - * Flush percpu vmstats and vmevents to guarantee the value correctness - * on parent's and all ancestor levels. + * Flush percpu lruvec stats to guarantee the value + * correctness on parent's and all ancestor levels. */ - memcg_flush_percpu_vmstats(memcg); - memcg_flush_percpu_vmevents(memcg); + memcg_flush_lruvec_page_state(memcg); __mem_cgroup_free(memcg); }
@@ -5438,11 +5342,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void) goto fail; }
- memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu, - GFP_KERNEL_ACCOUNT); - if (!memcg->vmstats_local) - goto fail; - memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu, GFP_KERNEL_ACCOUNT); if (!memcg->vmstats_percpu) @@ -5663,6 +5562,62 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) memcg_wb_domain_size_changed(memcg); }
+static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + struct mem_cgroup *parent = parent_mem_cgroup(memcg); + struct memcg_vmstats_percpu *statc; + long delta, v; + int i; + + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); + + for (i = 0; i < MEMCG_NR_STAT; i++) { + /* + * Collect the aggregated propagation counts of groups + * below us. We're in a per-cpu loop here and this is + * a global counter, so the first cycle will get them. + */ + delta = memcg->vmstats.state_pending[i]; + if (delta) + memcg->vmstats.state_pending[i] = 0; + + /* Add CPU changes on this level since the last flush */ + v = READ_ONCE(statc->state[i]); + if (v != statc->state_prev[i]) { + delta += v - statc->state_prev[i]; + statc->state_prev[i] = v; + } + + if (!delta) + continue; + + /* Aggregate counts on this level and propagate upwards */ + memcg->vmstats.state[i] += delta; + if (parent) + parent->vmstats.state_pending[i] += delta; + } + + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { + delta = memcg->vmstats.events_pending[i]; + if (delta) + memcg->vmstats.events_pending[i] = 0; + + v = READ_ONCE(statc->events[i]); + if (v != statc->events_prev[i]) { + delta += v - statc->events_prev[i]; + statc->events_prev[i] = v; + } + + if (!delta) + continue; + + memcg->vmstats.events[i] += delta; + if (parent) + parent->vmstats.events_pending[i] += delta; + } +} + #ifdef CONFIG_MMU /* Handlers for move charge at task migration. */ static int mem_cgroup_do_precharge(unsigned long count) @@ -6727,6 +6682,7 @@ struct cgroup_subsys memory_cgrp_subsys = { .css_released = mem_cgroup_css_released, .css_free = mem_cgroup_css_free, .css_reset = mem_cgroup_css_reset, + .css_rstat_flush = mem_cgroup_css_rstat_flush, .can_attach = mem_cgroup_can_attach, .cancel_attach = mem_cgroup_cancel_attach, .post_attach = mem_cgroup_move_task,
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.13-rc1 commit 2cd21c89800c2203331e5564df2155757ded2e86 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-----------------------------------------------------------------------
There are two functions to flush the per-cpu data of an lruvec into the rest of the cgroup tree: when the cgroup is being freed, and when a CPU disappears during hotplug. The difference is whether all CPUs or just one is being collected, but the rest of the flushing code is the same. Merge them into one function and share the common code.
Link: https://lkml.kernel.org/r/20210209163304.77088-8-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: Roman Gushchin guro@fb.com Cc: Michal Koutný mkoutny@suse.com Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/memcontrol.c | 74 +++++++++++++++++++------------------------------ 1 file changed, 28 insertions(+), 46 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c0285b9de5f5..b1ac32fb5819 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2294,39 +2294,39 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) mutex_unlock(&percpu_charge_mutex); }
-static int memcg_hotplug_cpu_dead(unsigned int cpu) +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu) { - struct memcg_stock_pcp *stock; - struct mem_cgroup *memcg; - - stock = &per_cpu(memcg_stock, cpu); - drain_stock(stock); + int nid;
- for_each_mem_cgroup(memcg) { + for_each_node(nid) { + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; + unsigned long stat[NR_VM_NODE_STAT_ITEMS]; + struct batched_lruvec_stat *lstatc; int i;
+ lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { - int nid; + stat[i] = lstatc->count[i]; + lstatc->count[i] = 0; + }
- for_each_node(nid) { - struct batched_lruvec_stat *lstatc; - struct mem_cgroup_per_node *pn; - long x; + do { + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) + atomic_long_add(stat[i], &pn->lruvec_stat[i]); + } while ((pn = parent_nodeinfo(pn, nid))); + } +}
- pn = memcg->nodeinfo[nid]; - lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); +static int memcg_hotplug_cpu_dead(unsigned int cpu) +{ + struct memcg_stock_pcp *stock; + struct mem_cgroup *memcg;
- x = lstatc->count[i]; - lstatc->count[i] = 0; + stock = &per_cpu(memcg_stock, cpu); + drain_stock(stock);
- if (x) { - do { - atomic_long_add(x, &pn->lruvec_stat[i]); - } while ((pn = parent_nodeinfo(pn, nid))); - } - } - } - } + for_each_mem_cgroup(memcg) + memcg_flush_lruvec_page_state(memcg, cpu);
return 0; } @@ -3557,27 +3557,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, } }
-static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg) -{ - int node; - - for_each_node(node) { - struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; - unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0 }; - struct mem_cgroup_per_node *pi; - int cpu, i; - - for_each_online_cpu(cpu) - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) - stat[i] += per_cpu( - pn->lruvec_stat_cpu->count[i], cpu); - - for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) - atomic_long_add(stat[i], &pi->lruvec_stat[i]); - } -} - #ifdef CONFIG_MEMCG_KMEM static int memcg_online_kmem(struct mem_cgroup *memcg) { @@ -5310,12 +5289,15 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
static void mem_cgroup_free(struct mem_cgroup *memcg) { + int cpu; + memcg_wb_domain_exit(memcg); /* * Flush percpu lruvec stats to guarantee the value * correctness on parent's and all ancestor levels. */ - memcg_flush_lruvec_page_state(memcg); + for_each_online_cpu(cpu) + memcg_flush_lruvec_page_state(memcg, cpu); __mem_cgroup_free(memcg); }
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.13-rc1 commit 4bbcc5a41c5449f6a67edb3fbc2dccae9c6724db category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
------------------------------------------------------------------
With memcg having switched to rstat, memory.stat output is precise. Update the cgroup selftest to reflect the expectations and error tolerances of the new implementation.
Also add newly tracked types of memory to the memory.stat side of the equation, since they're included in memory.current and could throw false positives.
Link: https://lkml.kernel.org/r/20210209163304.77088-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Shakeel Butt shakeelb@google.com Reviewed-by: Michal Koutný mkoutny@suse.com Acked-by: Roman Gushchin guro@fb.com Cc: Michal Hocko mhocko@suse.com Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- tools/testing/selftests/cgroup/test_kmem.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c index 0941aa16157e..22b31ebb3513 100644 --- a/tools/testing/selftests/cgroup/test_kmem.c +++ b/tools/testing/selftests/cgroup/test_kmem.c @@ -19,12 +19,12 @@
/* - * Memory cgroup charging and vmstat data aggregation is performed using - * percpu batches 32 pages big (look at MEMCG_CHARGE_BATCH). So the maximum - * discrepancy between charge and vmstat entries is number of cpus multiplied - * by 32 pages multiplied by 2. + * Memory cgroup charging is performed using percpu batches 32 pages + * big (look at MEMCG_CHARGE_BATCH), whereas memory.stat is exact. So + * the maximum discrepancy between charge and vmstat entries is number + * of cpus multiplied by 32 pages. */ -#define MAX_VMSTAT_ERROR (4096 * 32 * 2 * get_nprocs()) +#define MAX_VMSTAT_ERROR (4096 * 32 * get_nprocs())
static int alloc_dcache(const char *cgroup, void *arg) @@ -162,7 +162,7 @@ static int cg_run_in_subcgroups(const char *parent, */ static int test_kmem_memcg_deletion(const char *root) { - long current, slab, anon, file, kernel_stack, sum; + long current, slab, anon, file, kernel_stack, pagetables, percpu, sock, sum; int ret = KSFT_FAIL; char *parent;
@@ -184,11 +184,14 @@ static int test_kmem_memcg_deletion(const char *root) anon = cg_read_key_long(parent, "memory.stat", "anon "); file = cg_read_key_long(parent, "memory.stat", "file "); kernel_stack = cg_read_key_long(parent, "memory.stat", "kernel_stack "); + pagetables = cg_read_key_long(parent, "memory.stat", "pagetables "); + percpu = cg_read_key_long(parent, "memory.stat", "percpu "); + sock = cg_read_key_long(parent, "memory.stat", "sock "); if (current < 0 || slab < 0 || anon < 0 || file < 0 || - kernel_stack < 0) + kernel_stack < 0 || pagetables < 0 || percpu < 0 || sock < 0) goto cleanup;
- sum = slab + anon + file + kernel_stack; + sum = slab + anon + file + kernel_stack + pagetables + percpu + sock; if (abs(sum - current) < MAX_VMSTAT_ERROR) { ret = KSFT_PASS; } else { @@ -198,6 +201,9 @@ static int test_kmem_memcg_deletion(const char *root) printf("anon = %ld\n", anon); printf("file = %ld\n", file); printf("kernel_stack = %ld\n", kernel_stack); + printf("pagetables = %ld\n", pagetables); + printf("percpu = %ld\n", percpu); + printf("sock = %ld\n", sock); }
cleanup:
From: Tejun Heo tj@kernel.org
mainline inclusion from mainline-v5.14-rc6 commit c3df5fb57fe8756d67fd56ed29da65cdfde839f9 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------------------------------
0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") added cgroup_rstat_flush_irqsafe() allowing flushing to happen from the irq context. However, rstat paths use u64_stats_sync to synchronize access to 64bit stat counters on 32bit machines. u64_stats_sync is implemented using seq_lock and trying to read from an irq context can lead to A-A deadlock if the irq happens to interrupt the stat update.
Fix it by using the irqsafe variants - u64_stats_update_begin_irqsave() and u64_stats_update_end_irqrestore() - in the update paths. Note that none of this matters on 64bit machines. All these are just for 32bit SMP setups.
Note that the interface was introduced way back, its first and currently only use was recently added by 2d146aa3aa84 ("mm: memcontrol: switch to rstat"). Stable tagging targets this commit.
Signed-off-by: Tejun Heo tj@kernel.org Reported-by: Rik van Riel riel@surriel.com Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat") Cc: stable@vger.kernel.org # v5.13+ Conflict: block/blk-cgroup.c Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-cgroup.c | 14 ++++++++------ kernel/cgroup/rstat.c | 19 +++++++++++-------- 2 files changed, 19 insertions(+), 14 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 1defd7d94d9f..7f7d41236838 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -787,6 +787,7 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) struct blkcg_gq *parent = blkg->parent; struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu); struct blkg_iostat cur, delta; + unsigned long flags; unsigned int seq;
/* fetch the current per-cpu values */ @@ -796,21 +797,21 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) } while (u64_stats_fetch_retry(&bisc->sync, seq));
/* propagate percpu delta to global */ - u64_stats_update_begin(&blkg->iostat.sync); + flags = u64_stats_update_begin_irqsave(&blkg->iostat.sync); blkg_iostat_set(&delta, &cur); blkg_iostat_sub(&delta, &bisc->last); blkg_iostat_add(&blkg->iostat.cur, &delta); blkg_iostat_add(&bisc->last, &delta); - u64_stats_update_end(&blkg->iostat.sync); + u64_stats_update_end_irqrestore(&blkg->iostat.sync, flags);
/* propagate global delta to parent (unless that's root) */ if (parent && parent->parent) { - u64_stats_update_begin(&parent->iostat.sync); + flags = u64_stats_update_begin_irqsave(&parent->iostat.sync); blkg_iostat_set(&delta, &blkg->iostat.cur); blkg_iostat_sub(&delta, &blkg->iostat.last); blkg_iostat_add(&parent->iostat.cur, &delta); blkg_iostat_add(&blkg->iostat.last, &delta); - u64_stats_update_end(&parent->iostat.sync); + u64_stats_update_end_irqrestore(&parent->iostat.sync, flags); } }
@@ -845,6 +846,7 @@ static void blkcg_fill_root_iostats(void) memset(&tmp, 0, sizeof(tmp)); for_each_possible_cpu(cpu) { struct disk_stats *cpu_dkstats; + unsigned long flags;
cpu_dkstats = per_cpu_ptr(part->dkstats, cpu); tmp.ios[BLKG_IOSTAT_READ] += @@ -861,9 +863,9 @@ static void blkcg_fill_root_iostats(void) tmp.bytes[BLKG_IOSTAT_DISCARD] += cpu_dkstats->sectors[STAT_DISCARD] << 9;
- u64_stats_update_begin(&blkg->iostat.sync); + flags = u64_stats_update_begin_irqsave(&blkg->iostat.sync); blkg_iostat_set(&blkg->iostat.cur, &tmp); - u64_stats_update_end(&blkg->iostat.sync); + u64_stats_update_end_irqrestore(&blkg->iostat.sync, flags); } disk_put_part(part); } diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 3a3fd2993a65..a1a30f05ed94 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -347,19 +347,20 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) }
static struct cgroup_rstat_cpu * -cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp) +cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags) { struct cgroup_rstat_cpu *rstatc;
rstatc = get_cpu_ptr(cgrp->rstat_cpu); - u64_stats_update_begin(&rstatc->bsync); + *flags = u64_stats_update_begin_irqsave(&rstatc->bsync); return rstatc; }
static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp, - struct cgroup_rstat_cpu *rstatc) + struct cgroup_rstat_cpu *rstatc, + unsigned long flags) { - u64_stats_update_end(&rstatc->bsync); + u64_stats_update_end_irqrestore(&rstatc->bsync, flags); cgroup_rstat_updated(cgrp, smp_processor_id()); put_cpu_ptr(rstatc); } @@ -367,18 +368,20 @@ static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp, void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec) { struct cgroup_rstat_cpu *rstatc; + unsigned long flags;
- rstatc = cgroup_base_stat_cputime_account_begin(cgrp); + rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags); rstatc->bstat.cputime.sum_exec_runtime += delta_exec; - cgroup_base_stat_cputime_account_end(cgrp, rstatc); + cgroup_base_stat_cputime_account_end(cgrp, rstatc, flags); }
void __cgroup_account_cputime_field(struct cgroup *cgrp, enum cpu_usage_stat index, u64 delta_exec) { struct cgroup_rstat_cpu *rstatc; + unsigned long flags;
- rstatc = cgroup_base_stat_cputime_account_begin(cgrp); + rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
switch (index) { case CPUTIME_USER: @@ -394,7 +397,7 @@ void __cgroup_account_cputime_field(struct cgroup *cgrp, break; }
- cgroup_base_stat_cputime_account_end(cgrp, rstatc); + cgroup_base_stat_cputime_account_end(cgrp, rstatc, flags); }
/*
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.14-rc4 commit 30def93565e5ba08676aa2b9083f253fc586dbed category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
------------------------------------------------------------------------
Dan Carpenter reports:
The patch 2d146aa3aa84: "mm: memcontrol: switch to rstat" from Apr 29, 2021, leads to the following static checker warning:
kernel/cgroup/rstat.c:200 cgroup_rstat_flush() warn: sleeping in atomic context
mm/memcontrol.c 3572 static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) 3573 { 3574 unsigned long val; 3575 3576 if (mem_cgroup_is_root(memcg)) { 3577 cgroup_rstat_flush(memcg->css.cgroup); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is from static analysis and potentially a false positive. The problem is that mem_cgroup_usage() is called from __mem_cgroup_threshold() which holds an rcu_read_lock(). And the cgroup_rstat_flush() function can sleep.
3578 val = memcg_page_state(memcg, NR_FILE_PAGES) + 3579 memcg_page_state(memcg, NR_ANON_MAPPED); 3580 if (swap) 3581 val += memcg_page_state(memcg, MEMCG_SWAP); 3582 } else { 3583 if (!swap) 3584 val = page_counter_read(&memcg->memory); 3585 else 3586 val = page_counter_read(&memcg->memsw); 3587 } 3588 return val; 3589 }
__mem_cgroup_threshold() indeed holds the rcu lock. In addition, the thresholding code is invoked during stat changes, and those contexts have irqs disabled as well. If the lock breaking occurs inside the flush function, it will result in a sleep from an atomic context.
Use the irqsafe flushing variant in mem_cgroup_usage() to fix this.
Link: https://lkml.kernel.org/r/20210726150019.251820-1-hannes@cmpxchg.org Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat") Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reported-by: Dan Carpenter dan.carpenter@oracle.com Acked-by: Chris Down chris@chrisdown.name Reviewed-by: Rik van Riel riel@surriel.com Acked-by: Michal Hocko mhocko@suse.com Reviewed-by: Shakeel Butt shakeelb@google.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/memcontrol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b1ac32fb5819..1fef9b7a3f41 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3492,7 +3492,8 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) unsigned long val;
if (mem_cgroup_is_root(memcg)) { - cgroup_rstat_flush(memcg->css.cgroup); + /* mem_cgroup_threshold() calls here from irqsafe context */ + cgroup_rstat_flush_irqsafe(memcg->css.cgroup); val = memcg_page_state(memcg, NR_FILE_PAGES) + memcg_page_state(memcg, NR_ANON_MAPPED); if (swap)
From: Shakeel Butt shakeelb@google.com
mainline inclusion from mainline-v5.15-rc1 commit 7e1c0d6f58207e7e60674647d3935f446f05613c category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------------------------------
The commit 2d146aa3aa84 ("mm: memcontrol: switch to rstat") switched memcg stats to rstat infrastructure but skipped the conversion of the lruvec stats as such stats are read in the performance critical code paths and flushing stats may have impacted the performances of the applications. This patch converts the lruvec stats to rstat and later patches add mechanisms to keep the performance impact to minimum.
The rstat conversion comes with the price i.e. memory cost. Effectively this patch reverts the savings done by the commit f3344adf38bd ("mm: memcontrol: optimize per-lruvec stats counter memory usage"). However this cost is justified due to negative impact of the inaccurate lruvec stats on many heuristics. One such case is reported in [1].
The memory reclaim code is filled with plethora of heuristics and many of those heuristics reads the lruvec stats. So, inaccurate stats can make such heuristics ineffective. [1] reports the impact of inaccurate lruvec stats on the "cache trim mode" heuristic. Inaccurate lruvec stats can impact the deactivation and aging anon heuristics as well.
[1] https://lore.kernel.org/linux-mm/20210311004449.1170308-1-ying.huang@intel.c...
Link: https://lkml.kernel.org/r/20210716212137.1391164-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20210714013948.270662-1-shakeelb@google.com Signed-off-by: Shakeel Butt shakeelb@google.com Cc: Tejun Heo tj@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Muchun Song songmuchun@bytedance.com Cc: Michal Hocko mhocko@kernel.org Cc: Roman Gushchin guro@fb.com Cc: Huang Ying ying.huang@intel.com Cc: Hillf Danton hdanton@sina.com Cc: Michal Koutný mkoutny@suse.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/memcontrol.h | 42 +++++++------- mm/memcontrol.c | 114 +++++++++++++------------------------ 2 files changed, 58 insertions(+), 98 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 483dce4d7753..fdff0c8f624f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -105,14 +105,6 @@ struct mem_cgroup_reclaim_iter { unsigned int generation; };
-struct lruvec_stat { - long count[NR_VM_NODE_STAT_ITEMS]; -}; - -struct batched_lruvec_stat { - s32 count[NR_VM_NODE_STAT_ITEMS]; -}; - /* * Bitmap and deferred work of shrinker::id corresponding to memcg-aware * shrinkers, which have elements charged to this memcg. @@ -123,24 +115,30 @@ struct shrinker_info { unsigned long *map; };
+struct lruvec_stats_percpu { + /* Local (CPU and cgroup) state */ + long state[NR_VM_NODE_STAT_ITEMS]; + + /* Delta calculation for lockless upward propagation */ + long state_prev[NR_VM_NODE_STAT_ITEMS]; +}; + +struct lruvec_stats { + /* Aggregated (CPU and subtree) state */ + long state[NR_VM_NODE_STAT_ITEMS]; + + /* Pending child counts during tree propagation */ + long state_pending[NR_VM_NODE_STAT_ITEMS]; +}; + /* * per-node information in memory controller. */ struct mem_cgroup_per_node { struct lruvec lruvec;
- /* - * Legacy local VM stats. This should be struct lruvec_stat and - * cannot be optimized to struct batched_lruvec_stat. Because - * the threshold of the lruvec_stat_cpu can be as big as - * MEMCG_CHARGE_BATCH * PAGE_SIZE. It can fit into s32. But this - * filed has no upper limit. - */ - struct lruvec_stat __percpu *lruvec_stat_local; - - /* Subtree VM stats (batched updates) */ - struct batched_lruvec_stat __percpu *lruvec_stat_cpu; - atomic_long_t lruvec_stat[NR_VM_NODE_STAT_ITEMS]; + struct lruvec_stats_percpu __percpu *lruvec_stats_percpu; + struct lruvec_stats lruvec_stats;
unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
@@ -1049,7 +1047,7 @@ static inline unsigned long lruvec_page_state(struct lruvec *lruvec, return node_page_state(lruvec_pgdat(lruvec), idx);
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - x = atomic_long_read(&pn->lruvec_stat[idx]); + x = READ_ONCE(pn->lruvec_stats.state[idx]); #ifdef CONFIG_SMP if (x < 0) x = 0; @@ -1069,7 +1067,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); for_each_possible_cpu(cpu) - x += per_cpu(pn->lruvec_stat_local->count[idx], cpu); + x += per_cpu(pn->lruvec_stats_percpu->state[idx], cpu); #ifdef CONFIG_SMP if (x < 0) x = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1fef9b7a3f41..08e6887fca0d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -677,23 +677,11 @@ static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) return x; }
-static struct mem_cgroup_per_node * -parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) -{ - struct mem_cgroup *parent; - - parent = parent_mem_cgroup(pn->memcg); - if (!parent) - return NULL; - return parent->nodeinfo[nid]; -} - void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val) { struct mem_cgroup_per_node *pn; struct mem_cgroup *memcg; - long x, threshold = MEMCG_CHARGE_BATCH;
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); memcg = pn->memcg; @@ -702,21 +690,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, __mod_memcg_state(memcg, idx, val);
/* Update lruvec */ - __this_cpu_add(pn->lruvec_stat_local->count[idx], val); - - if (vmstat_item_in_bytes(idx)) - threshold <<= PAGE_SHIFT; - - x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]); - if (unlikely(abs(x) > threshold)) { - pg_data_t *pgdat = lruvec_pgdat(lruvec); - struct mem_cgroup_per_node *pi; - - for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id)) - atomic_long_add(x, &pi->lruvec_stat[idx]); - x = 0; - } - __this_cpu_write(pn->lruvec_stat_cpu->count[idx], x); + __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val); }
/** @@ -2294,40 +2268,13 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) mutex_unlock(&percpu_charge_mutex); }
-static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu) -{ - int nid; - - for_each_node(nid) { - struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; - unsigned long stat[NR_VM_NODE_STAT_ITEMS]; - struct batched_lruvec_stat *lstatc; - int i; - - lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { - stat[i] = lstatc->count[i]; - lstatc->count[i] = 0; - } - - do { - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) - atomic_long_add(stat[i], &pn->lruvec_stat[i]); - } while ((pn = parent_nodeinfo(pn, nid))); - } -} - static int memcg_hotplug_cpu_dead(unsigned int cpu) { struct memcg_stock_pcp *stock; - struct mem_cgroup *memcg;
stock = &per_cpu(memcg_stock, cpu); drain_stock(stock);
- for_each_mem_cgroup(memcg) - memcg_flush_lruvec_page_state(memcg, cpu); - return 0; }
@@ -5242,17 +5189,9 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) if (!pn) return 1;
- pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat, - GFP_KERNEL_ACCOUNT); - if (!pn->lruvec_stat_local) { - kfree(pn); - return 1; - } - - pn->lruvec_stat_cpu = alloc_percpu_gfp(struct batched_lruvec_stat, - GFP_KERNEL_ACCOUNT); - if (!pn->lruvec_stat_cpu) { - free_percpu(pn->lruvec_stat_local); + pn->lruvec_stats_percpu = alloc_percpu_gfp(struct lruvec_stats_percpu, + GFP_KERNEL_ACCOUNT); + if (!pn->lruvec_stats_percpu) { kfree(pn); return 1; } @@ -5273,8 +5212,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) if (!pn) return;
- free_percpu(pn->lruvec_stat_cpu); - free_percpu(pn->lruvec_stat_local); + free_percpu(pn->lruvec_stats_percpu); kfree(pn); }
@@ -5290,15 +5228,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
static void mem_cgroup_free(struct mem_cgroup *memcg) { - int cpu; - memcg_wb_domain_exit(memcg); - /* - * Flush percpu lruvec stats to guarantee the value - * correctness on parent's and all ancestor levels. - */ - for_each_online_cpu(cpu) - memcg_flush_lruvec_page_state(memcg, cpu); __mem_cgroup_free(memcg); }
@@ -5551,7 +5481,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) struct mem_cgroup *parent = parent_mem_cgroup(memcg); struct memcg_vmstats_percpu *statc; long delta, v; - int i; + int i, nid;
statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
@@ -5599,6 +5529,36 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) if (parent) parent->vmstats.events_pending[i] += delta; } + + for_each_node_state(nid, N_MEMORY) { + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; + struct mem_cgroup_per_node *ppn = NULL; + struct lruvec_stats_percpu *lstatc; + + if (parent) + ppn = parent->nodeinfo[nid]; + + lstatc = per_cpu_ptr(pn->lruvec_stats_percpu, cpu); + + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { + delta = pn->lruvec_stats.state_pending[i]; + if (delta) + pn->lruvec_stats.state_pending[i] = 0; + + v = READ_ONCE(lstatc->state[i]); + if (v != lstatc->state_prev[i]) { + delta += v - lstatc->state_prev[i]; + lstatc->state_prev[i] = v; + } + + if (!delta) + continue; + + pn->lruvec_stats.state[i] += delta; + if (ppn) + ppn->lruvec_stats.state_pending[i] += delta; + } + } }
#ifdef CONFIG_MMU @@ -6543,6 +6503,8 @@ static int memory_numa_stat_show(struct seq_file *m, void *v) int i; struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+ cgroup_rstat_flush(memcg->css.cgroup); + for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { int nid;
From: Shakeel Butt shakeelb@google.com
mainline inclusion from mainline-v5.15-rc1 commit aa48e47e3906c332eaf1e5d7b58be11d3509ad9f category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
------------------------------------------------------
At the moment memcg stats are read in four contexts:
1. memcg stat user interfaces 2. dirty throttling 3. page fault 4. memory reclaim
Currently the kernel flushes the stats for first two cases. Flushing the stats for remaining two casese may have performance impact. Always flushing the memcg stats on the page fault code path may negatively impacts the performance of the applications. In addition flushing in the memory reclaim code path, though treated as slowpath, can become the source of contention for the global lock taken for stat flushing because when system or memcg is under memory pressure, many tasks may enter the reclaim path.
This patch uses following mechanisms to solve these challenges:
1. Periodically flush the stats from root memcg every 2 seconds. This will time limit the out of sync stats.
2. Asynchronously flush the stats after fixed number of stat updates. In the worst case the stat can be out of sync by O(nr_cpus * BATCH) for 2 seconds.
3. For avoiding thundering herd to flush the stats particularly from the memory reclaim context, introduce memcg local spinlock and let only one flusher active at a time. This could have been done through cgroup_rstat_lock lock but that lock is used by other subsystem and for userspace reading memcg stats. So, it is better to keep flushers introduced by this patch decoupled from cgroup_rstat_lock. However we would have to use irqsafe version of rstat flush but that is fine as this code path will be flushing for whole tree and do the work for everyone. No one will be waiting for that worker.
[shakeelb@google.com: fix sleep-in-wrong context bug] Link: https://lkml.kernel.org/r/20210716212137.1391164-2-shakeelb@google.com
Link: https://lkml.kernel.org/r/20210714013948.270662-2-shakeelb@google.com Signed-off-by: Shakeel Butt shakeelb@google.com Tested-by: Marek Szyprowski m.szyprowski@samsung.com Cc: Hillf Danton hdanton@sina.com Cc: Huang Ying ying.huang@intel.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Michal Koutný mkoutny@suse.com Cc: Muchun Song songmuchun@bytedance.com Cc: Roman Gushchin guro@fb.com Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Conflict: include/linux/memcontrol.h Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/memcontrol.h | 6 ++++++ mm/memcontrol.c | 34 ++++++++++++++++++++++++++++++++++ mm/vmscan.c | 6 ++++++ 3 files changed, 46 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index fdff0c8f624f..34e4e8de93ce 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1075,6 +1075,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return x; }
+void mem_cgroup_flush_stats(void); + void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val); void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, @@ -1546,6 +1548,10 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return node_page_state(lruvec_pgdat(lruvec), idx); }
+static inline void mem_cgroup_flush_stats(void) +{ +} + static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 08e6887fca0d..440ca9225aaa 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -102,6 +102,14 @@ static bool do_memsw_account(void) return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_noswap; }
+/* memcg and lruvec stats flushing */ +static void flush_memcg_stats_dwork(struct work_struct *w); +static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); +static void flush_memcg_stats_work(struct work_struct *w); +static DECLARE_WORK(stats_flush_work, flush_memcg_stats_work); +static DEFINE_PER_CPU(unsigned int, stats_flush_threshold); +static DEFINE_SPINLOCK(stats_flush_lock); + #define THRESHOLDS_EVENTS_TARGET 128 #define SOFTLIMIT_EVENTS_TARGET 1024
@@ -691,6 +699,8 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
/* Update lruvec */ __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val); + if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH)) + queue_work(system_unbound_wq, &stats_flush_work); }
/** @@ -5384,6 +5394,10 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css) /* Online state pins memcg ID, memcg ID pins CSS */ refcount_set(&memcg->id.ref, 1); css_get(css); + + if (unlikely(mem_cgroup_is_root(memcg))) + queue_delayed_work(system_unbound_wq, &stats_flush_dwork, + 2UL*HZ); return 0; }
@@ -5475,6 +5489,26 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) memcg_wb_domain_size_changed(memcg); }
+void mem_cgroup_flush_stats(void) +{ + if (!spin_trylock(&stats_flush_lock)) + return; + + cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup); + spin_unlock(&stats_flush_lock); +} + +static void flush_memcg_stats_dwork(struct work_struct *w) +{ + mem_cgroup_flush_stats(); + queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ); +} + +static void flush_memcg_stats_work(struct work_struct *w) +{ + mem_cgroup_flush_stats(); +} + static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); diff --git a/mm/vmscan.c b/mm/vmscan.c index 732356256b26..c851e5f91842 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2903,6 +2903,12 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
again: + /* + * Flush the memory cgroup stats, so that we read accurate per-memcg + * lruvec stats for heuristics. + */ + mem_cgroup_flush_stats(); + memset(&sc->nr, 0, sizeof(sc->nr));
nr_reclaimed = sc->nr_reclaimed;
From: Miaohe Lin linmiaohe@huawei.com
mainline inclusion from mainline-v5.15-rc1 commit bec49c067c679e9b7ca7c1aac50b56618c12d879 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-----------------------------------------------------------------------
Since commit 2d146aa3aa84 ("mm: memcontrol: switch to rstat"), last user of memcg_stat_item_in_bytes() is gone. And since commit fa40d1ee9f15 ("mm: vmscan: memcontrol: remove mem_cgroup_select_victim_node()"), only the declaration of mem_cgroup_select_victim_node() is remained here. Remove them.
Link: https://lkml.kernel.org/r/20210807082835.61281-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Shakeel Butt shakeelb@google.com Reviewed-by: Muchun Song songmuchun@bytedance.com Acked-by: Roman Gushchin guro@fb.com Acked-by: Michal Hocko mhocko@suse.com Cc: Alex Shi alexs@kernel.org Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Wei Yang richard.weiyang@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/memcontrol.h | 12 ------------ 1 file changed, 12 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 34e4e8de93ce..8bd428741de8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -638,13 +638,6 @@ static inline bool set_page_objcgs(struct page *page, } #endif
-static __always_inline bool memcg_stat_item_in_bytes(int idx) -{ - if (idx == MEMCG_PERCPU_B) - return true; - return vmstat_item_in_bytes(idx); -} - static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) { return (memcg == root_mem_cgroup); @@ -930,11 +923,6 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg) return !!(memcg->css.flags & CSS_ONLINE); }
-/* - * For memory reclaim. - */ -int mem_cgroup_select_victim_node(struct mem_cgroup *memcg); - void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, int zid, int nr_pages);
From: Shakeel Butt shakeelb@google.com
mainline inclusion from mainline-v5.15-rc3 commit 1f828223b7991a228bc2aef837b78737946d44b2 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-----------------------------------------------------------------------
Prior to the commit 7e1c0d6f5820 ("memcg: switch lruvec stats to rstat") and the commit aa48e47e3906 ("memcg: infrastructure to flush memcg stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus * 32) at worst and for unbounded amount of time. The commit aa48e47e3906 moved the lruvec stats to rstat infrastructure and the commit 7e1c0d6f5820 bounded the error for all the lruvec stats to (nr_cpus * 32) at worst for at most 2 seconds. More specifically it decoupled the number of stats and the number of cgroups from the error rate.
However this reduction in error comes with the cost of triggering the slowpath of stats update more frequently. Previously in the slowpath the kernel adds the stats up the memcg tree. After aa48e47e3906, the kernel triggers the asyn lruvec stats flush through queue_work(). This causes regression reports from 0day kernel bot [1] as well as from phoronix test suite [2].
We tried two options to fix the regression:
1) Increase the threshold to trigger the slowpath in lruvec stats update codepath from 32 to 512.
2) Remove the slowpath from lruvec stats update codepath and instead flush the stats in the page refault codepath. The assumption is that the kernel timely flush the stats, so, the update tree would be small in the refault codepath to not cause the preformance impact.
Following are the results of will-it-scale/page_fault[1|2|3] benchmark on four settings i.e. (1) 5.15-rc1 as baseline (2) 5.15-rc1 with aa48e47e3906 and 7e1c0d6f5820 reverted (3) 5.15-rc1 with option-1 (4) 5.15-rc1 with option-2.
test (1) (2) (3) (4) pg_f1 368563 406277 (10.23%) 399693 (8.44%) 416398 (12.97%) pg_f2 338399 372133 (9.96%) 369180 (9.09%) 381024 (12.59%) pg_f3 500853 575399 (14.88%) 570388 (13.88%) 576083 (15.02%)
From the above result, it seems like the option-2 not only solves the regression but also improves the performance for at least these benchmarks.
Feng Tang (intel) ran the aim7 benchmark with these two options and confirms that option-1 reduces the regression but option-2 removes the regression.
Michael Larabel (phoronix) ran multiple benchmarks with these options and reported the results at [3] and it shows for most benchmarks option-2 removes the regression introduced by the commit aa48e47e3906 ("memcg: infrastructure to flush memcg stats").
Based on the experiment results, this patch proposed the option-2 as the solution to resolve the regression.
Link: https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020 [1] Link: https://www.phoronix.com/scan.php?page=article&item=linux515-compile-reg... [2] Link: https://openbenchmarking.org/result/2109226-DEBU-LINUX5104 [3] Fixes: aa48e47e3906 ("memcg: infrastructure to flush memcg stats") Signed-off-by: Shakeel Butt shakeelb@google.com Tested-by: Michael Larabel Michael@phoronix.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Roman Gushchin guro@fb.com Cc: Feng Tang feng.tang@intel.com Cc: Michal Hocko mhocko@kernel.org Cc: Hillf Danton hdanton@sina.com, Cc: Michal Koutný mkoutny@suse.com Cc: Andrew Morton akpm@linux-foundation.org, Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/memcontrol.c | 10 ---------- mm/workingset.c | 1 + 2 files changed, 1 insertion(+), 10 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 440ca9225aaa..bd269746fa98 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -105,9 +105,6 @@ static bool do_memsw_account(void) /* memcg and lruvec stats flushing */ static void flush_memcg_stats_dwork(struct work_struct *w); static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); -static void flush_memcg_stats_work(struct work_struct *w); -static DECLARE_WORK(stats_flush_work, flush_memcg_stats_work); -static DEFINE_PER_CPU(unsigned int, stats_flush_threshold); static DEFINE_SPINLOCK(stats_flush_lock);
#define THRESHOLDS_EVENTS_TARGET 128 @@ -699,8 +696,6 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
/* Update lruvec */ __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val); - if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH)) - queue_work(system_unbound_wq, &stats_flush_work); }
/** @@ -5504,11 +5499,6 @@ static void flush_memcg_stats_dwork(struct work_struct *w) queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ); }
-static void flush_memcg_stats_work(struct work_struct *w) -{ - mem_cgroup_flush_stats(); -} - static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); diff --git a/mm/workingset.c b/mm/workingset.c index 3c0bae62e8bf..4a30e4a813a5 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -350,6 +350,7 @@ void workingset_refault(struct page *page, void *shadow)
inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file);
+ mem_cgroup_flush_stats(); /* * Compare the distance to the existing workingset size. We * don't activate pages that couldn't stay resident even if
From: Tejun Heo tj@kernel.org
mainline inclusion from mainline-v5.16-rc1 commit 3c08b0931eedd04c530040499fadeccab50ed646 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------------------------------------
c3df5fb57fe8 ("cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync") made u64_stats updates irq-safe to avoid A-A deadlocks. Unfortunately, the conversion missed one in blk_cgroup_bio_start(). Fix it.
Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat") Cc: stable@vger.kernel.org # v5.13+ Reported-by: syzbot+9738c8815b375ce482a1@syzkaller.appspotmail.com Signed-off-by: Tejun Heo tj@kernel.org Link: https://lore.kernel.org/r/YWi7NrQdVlxD6J9W@slm.duckdns.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-cgroup.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 7f7d41236838..bd0c2bec05a8 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1932,10 +1932,11 @@ void blk_cgroup_bio_start(struct bio *bio) { int rwd = blk_cgroup_io_type(bio), cpu; struct blkg_iostat_set *bis; + unsigned long flags;
cpu = get_cpu(); bis = per_cpu_ptr(bio->bi_blkg->iostat_cpu, cpu); - u64_stats_update_begin(&bis->sync); + flags = u64_stats_update_begin_irqsave(&bis->sync);
/* * If the bio is flagged with BIO_CGROUP_ACCT it means this is a split @@ -1947,7 +1948,7 @@ void blk_cgroup_bio_start(struct bio *bio) } bis->cur.ios[rwd]++;
- u64_stats_update_end(&bis->sync); + u64_stats_update_end_irqrestore(&bis->sync, flags); if (cgroup_subsys_on_dfl(io_cgrp_subsys)) cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu); put_cpu();
From: Shakeel Butt shakeelb@google.com
mainline inclusion from mainline-v5.16-rc1 commit 11192d9c124d58d66449b163ed0d2cdff03761a1 category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
At the moment, the kernel flushes the memcg stats on every refault and also on every reclaim iteration. Although rstat maintains per-cpu update tree but on the flush the kernel still has to go through all the cpu rstat update tree to check if there is anything to flush. This patch adds the tracking on the stats update side to make flush side more clever by skipping the flush if there is no update.
The stats update codepath is very sensitive performance wise for many workloads and benchmarks. So, we can not follow what the commit aa48e47e3906 ("memcg: infrastructure to flush memcg stats") did which was triggering async flush through queue_work() and caused a lot performance regression reports. That got reverted by the commit 1f828223b799 ("memcg: flush lruvec stats in the refault").
In this patch we kept the stats update codepath very minimal and let the stats reader side to flush the stats only when the updates are over a specific threshold. For now the threshold is (nr_cpus * CHARGE_BATCH).
To evaluate the impact of this patch, an 8 GiB tmpfs file is created on a system with swap-on-zram and the file was pushed to swap through memory.force_empty interface. On reading the whole file, the memcg stat flush in the refault code path is triggered. With this patch, we observed 63% reduction in the read time of 8 GiB file.
Link: https://lkml.kernel.org/r/20211001190040.48086-1-shakeelb@google.com Signed-off-by: Shakeel Butt shakeelb@google.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Reviewed-by: "Michal Koutný" mkoutny@suse.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/memcontrol.c | 78 ++++++++++++++++++++++++++++++++++--------------- 1 file changed, 55 insertions(+), 23 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index bd269746fa98..7d08d38bee6c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -102,11 +102,6 @@ static bool do_memsw_account(void) return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_noswap; }
-/* memcg and lruvec stats flushing */ -static void flush_memcg_stats_dwork(struct work_struct *w); -static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); -static DEFINE_SPINLOCK(stats_flush_lock); - #define THRESHOLDS_EVENTS_TARGET 128 #define SOFTLIMIT_EVENTS_TARGET 1024
@@ -641,6 +636,56 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) return mz; }
+/* + * memcg and lruvec stats flushing + * + * Many codepaths leading to stats update or read are performance sensitive and + * adding stats flushing in such codepaths is not desirable. So, to optimize the + * flushing the kernel does: + * + * 1) Periodically and asynchronously flush the stats every 2 seconds to not let + * rstat update tree grow unbounded. + * + * 2) Flush the stats synchronously on reader side only when there are more than + * (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization + * will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but + * only for 2 seconds due to (1). + */ +static void flush_memcg_stats_dwork(struct work_struct *w); +static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); +static DEFINE_SPINLOCK(stats_flush_lock); +static DEFINE_PER_CPU(unsigned int, stats_updates); +static atomic_t stats_flush_threshold = ATOMIC_INIT(0); + +static inline void memcg_rstat_updated(struct mem_cgroup *memcg) +{ + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); + if (!(__this_cpu_inc_return(stats_updates) % MEMCG_CHARGE_BATCH)) + atomic_inc(&stats_flush_threshold); +} + +static void __mem_cgroup_flush_stats(void) +{ + if (!spin_trylock(&stats_flush_lock)) + return; + + cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup); + atomic_set(&stats_flush_threshold, 0); + spin_unlock(&stats_flush_lock); +} + +void mem_cgroup_flush_stats(void) +{ + if (atomic_read(&stats_flush_threshold) > num_online_cpus()) + __mem_cgroup_flush_stats(); +} + +static void flush_memcg_stats_dwork(struct work_struct *w) +{ + mem_cgroup_flush_stats(); + queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ); +} + /** * __mod_memcg_state - update cgroup memory statistics * @memcg: the memory cgroup @@ -653,7 +698,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) return;
__this_cpu_add(memcg->vmstats_percpu->state[idx], val); - cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); + memcg_rstat_updated(memcg); }
/* idx can be of type enum memcg_stat_item or node_stat_item. */ @@ -692,10 +737,12 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, memcg = pn->memcg;
/* Update memcg */ - __mod_memcg_state(memcg, idx, val); + __this_cpu_add(memcg->vmstats_percpu->state[idx], val);
/* Update lruvec */ __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val); + + memcg_rstat_updated(memcg); }
/** @@ -767,7 +814,7 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, return;
__this_cpu_add(memcg->vmstats_percpu->events[idx], count); - cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); + memcg_rstat_updated(memcg); }
static unsigned long memcg_events(struct mem_cgroup *memcg, int event) @@ -5484,21 +5531,6 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) memcg_wb_domain_size_changed(memcg); }
-void mem_cgroup_flush_stats(void) -{ - if (!spin_trylock(&stats_flush_lock)) - return; - - cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup); - spin_unlock(&stats_flush_lock); -} - -static void flush_memcg_stats_dwork(struct work_struct *w) -{ - mem_cgroup_flush_stats(); - queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ); -} - static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) { struct mem_cgroup *memcg = mem_cgroup_from_css(css);
From: Shakeel Butt shakeelb@google.com
mainline inclusion from mainline-v5.16-rc1 commit fd25a9e0e23b995fd0ba5e2f00a1099452cbc3cf category: feature bugzilla: 185803 https://gitee.com/openeuler/kernel/issues/I4JOG9?from=project-issue CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------------------------------------------------
The memcg stats can be flushed in multiple context and potentially in parallel too. For example multiple parallel user space readers for memcg stats will contend on the rstat locks with each other. There is no need for that. We just need one flusher and everyone else can benefit.
In addition after aa48e47e3906 ("memcg: infrastructure to flush memcg stats") the kernel periodically flush the memcg stats from the root, so, the other flushers will potentially have much less work to do.
Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.com Signed-off-by: Shakeel Butt shakeelb@google.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: "Michal Koutný" mkoutny@suse.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/memcontrol.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7d08d38bee6c..f05321e7caa9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -666,12 +666,14 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg)
static void __mem_cgroup_flush_stats(void) { - if (!spin_trylock(&stats_flush_lock)) + unsigned long flag; + + if (!spin_trylock_irqsave(&stats_flush_lock, flag)) return;
cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup); atomic_set(&stats_flush_threshold, 0); - spin_unlock(&stats_flush_lock); + spin_unlock_irqrestore(&stats_flush_lock, flag); }
void mem_cgroup_flush_stats(void) @@ -1508,7 +1510,7 @@ static char *memory_stat_format(struct mem_cgroup *memcg) * * Current memory state: */ - cgroup_rstat_flush(memcg->css.cgroup); + mem_cgroup_flush_stats();
for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -3491,8 +3493,7 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) unsigned long val;
if (mem_cgroup_is_root(memcg)) { - /* mem_cgroup_threshold() calls here from irqsafe context */ - cgroup_rstat_flush_irqsafe(memcg->css.cgroup); + mem_cgroup_flush_stats(); val = memcg_page_state(memcg, NR_FILE_PAGES) + memcg_page_state(memcg, NR_ANON_MAPPED); if (swap) @@ -4055,7 +4056,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) int nid; struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
- cgroup_rstat_flush(memcg->css.cgroup); + mem_cgroup_flush_stats();
for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=%lu", stat->name, @@ -4127,7 +4128,7 @@ static int memcg_stat_show(struct seq_file *m, void *v)
BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
- cgroup_rstat_flush(memcg->css.cgroup); + mem_cgroup_flush_stats();
for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -4638,7 +4639,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent;
- cgroup_rstat_flush_irqsafe(memcg->css.cgroup); + mem_cgroup_flush_stats();
*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); @@ -6559,7 +6560,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v) int i; struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
- cgroup_rstat_flush(memcg->css.cgroup); + mem_cgroup_flush_stats();
for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { int nid;