From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-5.3-rc1 commit c03914b7aa319fb2b6701a6427c13752c7418b9b category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Patch series "mm: reparent slab memory on cgroup removal", v7.
We've noticed that the number of dying cgroups is steadily growing on most of our hosts in production. The following investigation revealed an issue in the userspace memory reclaim code [1], accounting of kernel stacks [2], and also the main reason: slab objects.
The underlying problem is quite simple: any page charged to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless all charged pages are gone. If a slab object is actively used by other cgroups, it won't be reclaimed, and will prevent the origin cgroup from being reclaimed.
Slab objects, and first of all vfs cache, is shared between cgroups, which are using the same underlying fs, and what's even more important, it's shared between multiple generations of the same workload. So if something is running periodically every time in a new cgroup (like how systemd works), we do accumulate multiple dying cgroups.
Strictly speaking pagecache isn't different here, but there is a key difference: we disable protection and apply some extra pressure on LRUs of dying cgroups, and these LRUs contain all charged pages. My experiments show that with the disabled kernel memory accounting the number of dying cgroups stabilizes at a relatively small number (~100, depends on memory pressure and cgroup creation rate), and with kernel memory accounting it grows pretty steadily up to several thousands.
Memory cgroups are quite complex and big objects (mostly due to percpu stats), so it leads to noticeable memory losses. Memory occupied by dying cgroups is measured in hundreds of megabytes. I've even seen a host with more than 100Gb of memory wasted for dying cgroups. It leads to a degradation of performance with the uptime, and generally limits the usage of cgroups.
My previous attempt [3] to fix the problem by applying extra pressure on slab shrinker lists caused a regressions with xfs and ext4, and has been reverted [4]. The following attempts to find the right balance [5, 6] were not successful.
So instead of trying to find a maybe non-existing balance, let's do reparent accounted slab caches to the parent cgroup on cgroup removal.
There is however a significant problem with reparenting of slab memory: there is no list of charged pages. Some of them are in shrinker lists, but not all. Introducing of a new list is really not an option.
But fortunately there is a way forward: every slab page has a stable pointer to the corresponding kmem_cache. So the idea is to reparent kmem_caches instead of slab pages.
It's actually simpler and cheaper, but requires some underlying changes: 1) Make kmem_caches to hold a single reference to the memory cgroup, instead of a separate reference per every slab page. 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use page->kmem_cache->memcg indirection instead. It's used only on slab page release, so performance overhead shouldn't be a big issue. 3) Introduce a refcounter for non-root slab caches. It's required to be able to destroy kmem_caches when they become empty and release the associated memory cgroup.
There is a bonus: currently we release all memcg kmem_caches all together with the memory cgroup itself. This patchset allows individual kmem_caches to be released as soon as they become inactive and free.
Some additional implementation details are provided in corresponding commit messages.
Below is the average number of dying cgroups on two groups of our production hosts. They do run some sort of web frontend workload, the memory pressure is moderate. As we can see, with the kernel memory reparenting the number stabilizes in 60s range; however with the original version it grows almost linearly and doesn't show any signs of plateauing. The difference in slab and percpu usage between patched and unpatched versions also grows linearly. In 7 days it exceeded 200Mb.
day 0 1 2 3 4 5 6 7 original 56 362 628 752 1070 1250 1490 1560 patched 23 46 51 55 60 57 67 69 mem diff(Mb) 22 74 123 152 164 182 214 241
[1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error") [2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting") [3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects") [4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects") [5]: https://lkml.org/lkml/2019/1/28/1865 [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2
This patch (of 10):
Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache() rather than in init_memcg_params().
Once kmem_cache will hold a reference to the memory cgroup, it will simplify the refcounting.
For non-root kmem_caches memcg_link_cache() is always called before the kmem_cache becomes visible to a user, so it's safe.
Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com Signed-off-by: Roman Gushchin guro@fb.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Vladimir Davydov vdavydov.dev@gmail.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Waiman Long longman@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: Christoph Lameter cl@linux.com Cc: Pekka Enberg penberg@kernel.org Cc: David Rientjes rientjes@google.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Andrei Vagin avagin@gmail.com Cc: Qian Cai cai@lca.pw Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit c03914b7aa319fb2b6701a6427c13752c7418b9b) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/slab.c | 2 +- mm/slab.h | 5 +++-- mm/slab_common.c | 14 +++++++------- mm/slub.c | 2 +- 4 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c index 34f59802551d..d876c379d966 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1284,7 +1284,7 @@ void __init kmem_cache_init(void) nr_node_ids * sizeof(struct kmem_cache_node *), SLAB_HWCACHE_ALIGN, 0, 0); list_add(&kmem_cache->list, &slab_caches); - memcg_link_cache(kmem_cache); + memcg_link_cache(kmem_cache, NULL); slab_state = PARTIAL;
/* diff --git a/mm/slab.h b/mm/slab.h index 68ad9d348771..ab218e0d1f0b 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -289,7 +289,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order, }
extern void slab_init_memcg_params(struct kmem_cache *); -extern void memcg_link_cache(struct kmem_cache *s); +extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg); extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s, void (*deact_fn)(struct kmem_cache *));
@@ -344,7 +344,8 @@ static inline void slab_init_memcg_params(struct kmem_cache *s) { }
-static inline void memcg_link_cache(struct kmem_cache *s) +static inline void memcg_link_cache(struct kmem_cache *s, + struct mem_cgroup *memcg) { }
diff --git a/mm/slab_common.c b/mm/slab_common.c index e102f27ea011..8673c2e4ac54 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -141,13 +141,12 @@ void slab_init_memcg_params(struct kmem_cache *s) }
static int init_memcg_params(struct kmem_cache *s, - struct mem_cgroup *memcg, struct kmem_cache *root_cache) + struct kmem_cache *root_cache) { struct memcg_cache_array *arr;
if (root_cache) { s->memcg_params.root_cache = root_cache; - s->memcg_params.memcg = memcg; INIT_LIST_HEAD(&s->memcg_params.children_node); INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node); return 0; @@ -222,11 +221,12 @@ int memcg_update_all_caches(int num_memcgs) return ret; }
-void memcg_link_cache(struct kmem_cache *s) +void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg) { if (is_root_cache(s)) { list_add(&s->root_caches_node, &slab_root_caches); } else { + s->memcg_params.memcg = memcg; list_add(&s->memcg_params.children_node, &s->memcg_params.root_cache->memcg_params.children); list_add(&s->memcg_params.kmem_caches_node, @@ -245,7 +245,7 @@ static void memcg_unlink_cache(struct kmem_cache *s) } #else static inline int init_memcg_params(struct kmem_cache *s, - struct mem_cgroup *memcg, struct kmem_cache *root_cache) + struct kmem_cache *root_cache) { return 0; } @@ -393,7 +393,7 @@ static struct kmem_cache *create_cache(const char *name, s->useroffset = useroffset; s->usersize = usersize;
- err = init_memcg_params(s, memcg, root_cache); + err = init_memcg_params(s, root_cache); if (err) goto out_free_cache;
@@ -403,7 +403,7 @@ static struct kmem_cache *create_cache(const char *name,
s->refcount = 1; list_add(&s->list, &slab_caches); - memcg_link_cache(s); + memcg_link_cache(s, memcg); out: if (err) return ERR_PTR(err); @@ -999,7 +999,7 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
create_boot_cache(s, name, size, flags, useroffset, usersize); list_add(&s->list, &slab_caches); - memcg_link_cache(s); + memcg_link_cache(s, NULL); s->refcount = 1; return s; } diff --git a/mm/slub.c b/mm/slub.c index 32bec9c3485e..8c3de980ee07 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4216,7 +4216,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache) } slab_init_memcg_params(s); list_add(&s->list, &slab_caches); - memcg_link_cache(s); + memcg_link_cache(s, NULL); return s; }