From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-5.7-rc1 commit 48fe267c503ec22014ba4e83d002b07caad034d0 category: bugfix bugzilla: 33351 CVE: NA
------------------------------------------------- If a task is getting moved out of the OOMing cgroup, it might result in unexpected OOM killings if memory.oom.group is used anywhere in the cgroup tree.
Imagine the following example:
A (oom.group = 1) / \ (OOM) B C
Let's say B's memory.max is exceeded and it's OOMing. The OOM killer selects a task in B as a victim, but someone asynchronously moves the task into C. mem_cgroup_get_oom_group() will iterate over all ancestors of C up to the root cgroup. In theory it had to stop at the oom_domain level - the memory cgroup which is OOMing. But because B is not an ancestor of C, it's not happening. Instead it chooses A (because it's oom.group is set), and kills all tasks in A. This behavior is wrong because the OOM happened in B, so there is no reason to kill anything outside.
Fix this by checking it the memory cgroup to which the task belongs is a descendant of the oom_domain. If not, memory.oom.group should be ignored, and the OOM killer should kill only the victim task.
Reported-by: Dan Schatzberg dschatzberg@fb.com Signed-off-by: Roman Gushchin guro@fb.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Michal Hocko mhocko@suse.com Acked-by: Johannes Weiner hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20200316223510.3176148-1-guro@fb.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 48fe267c503ec22014ba4e83d002b07caad034d0) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1342b9540476..8611f3301686 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1856,6 +1856,14 @@ struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, if (memcg == root_mem_cgroup) goto out;
+ /* + * If the victim task has been asynchronously moved to a different + * memory cgroup, we might end up killing tasks outside oom_domain. + * In this case it's better to ignore memory.group.oom. + */ + if (unlikely(!mem_cgroup_is_descendant(memcg, oom_domain))) + goto out; + /* * Traverse the memory cgroup hierarchy from the victim task's * cgroup up to the OOMing cgroup (or root) to find the
From: Yafang Shao laoar.shao@gmail.com
mainline inclusion from mainline-5.7-rc1 commit a87425a36fb2a98f560543703b3ce92270cdedd8 category: bugfix bugzilla: 33359 CVE: NA
------------------------------------------------- When I manually set default n to MEMCG_KMEM in init/Kconfig, bellow error occurs,
mm/slab_common.c: In function 'memcg_slab_start': mm/slab_common.c:1530:30: error: 'struct mem_cgroup' has no member named 'kmem_caches' return seq_list_start(&memcg->kmem_caches, *pos); ^ mm/slab_common.c: In function 'memcg_slab_next': mm/slab_common.c:1537:32: error: 'struct mem_cgroup' has no member named 'kmem_caches' return seq_list_next(p, &memcg->kmem_caches, pos); ^ mm/slab_common.c: In function 'memcg_slab_show': mm/slab_common.c:1551:16: error: 'struct mem_cgroup' has no member named 'kmem_caches' if (p == memcg->kmem_caches.next) ^ CC arch/x86/xen/smp.o mm/slab_common.c: In function 'memcg_slab_start': mm/slab_common.c:1531:1: warning: control reaches end of non-void function [-Wreturn-type] } ^ mm/slab_common.c: In function 'memcg_slab_next': mm/slab_common.c:1538:1: warning: control reaches end of non-void function [-Wreturn-type] } ^
That's because kmem_caches is defined only when CONFIG_MEMCG_KMEM is set, while memcg_slab_start() will use it no matter CONFIG_MEMCG_KMEM is defined or not.
By the way, the reason I mannuly undefined CONFIG_MEMCG_KMEM is to verify whether my some other code change is still stable when CONFIG_MEMCG_KMEM is not set. Unfortunately, the existing code has been already unstable since v4.11.
Fixes: bc2791f857e1 ("slab: link memcg kmem_caches on their associated memory cgroup") Signed-off-by: Yafang Shao laoar.shao@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Andrew Morton akpm@linux-foundation.org Cc: Tejun Heo tj@kernel.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Link: http://lkml.kernel.org/r/1580970260-2045-1-git-send-email-laoar.shao@gmail.c... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit a87425a36fb2a98f560543703b3ce92270cdedd8) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 3 ++- mm/slab_common.c | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8611f3301686..66f87c8cfd7f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4364,7 +4364,8 @@ static struct cftype mem_cgroup_legacy_files[] = { .write = mem_cgroup_reset, .read_u64 = mem_cgroup_read_u64, }, -#if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG) +#if defined(CONFIG_MEMCG_KMEM) && \ + (defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)) { .name = "kmem.slabinfo", .seq_start = memcg_slab_start, diff --git a/mm/slab_common.c b/mm/slab_common.c index a94b9981eb17..a3105880e86d 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1406,7 +1406,7 @@ void dump_unreclaimable_slab(void) mutex_unlock(&slab_mutex); }
-#if defined(CONFIG_MEMCG) +#if defined(CONFIG_MEMCG_KMEM) void *memcg_slab_start(struct seq_file *m, loff_t *pos) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
From: "Paul E. McKenney" paulmck@linux.ibm.com
mainline inclusion from mainline-5.0-rc1 commit 6564a25e6c185e65ca3148ed6e18f80882f6798f category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Now that synchronize_rcu() waits for preempt-disable regions of code as well as RCU read-side critical sections, synchronize_sched() can be replaced by synchronize_rcu(). This commit therefore makes this change.
Signed-off-by: Paul E. McKenney paulmck@linux.ibm.com Cc: Christoph Lameter cl@linux.com Cc: Pekka Enberg penberg@kernel.org Cc: David Rientjes rientjes@google.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Andrew Morton akpm@linux-foundation.org Cc: linux-mm@kvack.org (cherry picked from commit 6564a25e6c185e65ca3148ed6e18f80882f6798f) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/slab.c | 4 ++-- mm/slab_common.c | 6 +++--- 2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c index c96657204508..9a80511afb47 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -965,10 +965,10 @@ static int setup_kmem_cache_node(struct kmem_cache *cachep, * To protect lockless access to n->shared during irq disabled context. * If n->shared isn't NULL in irq disabled context, accessing to it is * guaranteed to be valid until irq is re-enabled, because it will be - * freed after synchronize_sched(). + * freed after synchronize_rcu(). */ if (old_shared && force_change) - synchronize_sched(); + synchronize_rcu();
fail: kfree(old_shared); diff --git a/mm/slab_common.c b/mm/slab_common.c index a3105880e86d..5bb1afcce3d2 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -739,7 +739,7 @@ void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s, css_get(&s->memcg_params.memcg->css);
s->memcg_params.deact_fn = deact_fn; - call_rcu_sched(&s->memcg_params.deact_rcu_head, kmemcg_deactivate_rcufn); + call_rcu(&s->memcg_params.deact_rcu_head, kmemcg_deactivate_rcufn); unlock: spin_unlock_irq(&memcg_kmem_wq_lock); } @@ -859,11 +859,11 @@ static void memcg_set_kmem_cache_dying(struct kmem_cache *s) static void flush_memcg_workqueue(struct kmem_cache *s) { /* - * SLUB deactivates the kmem_caches through call_rcu_sched. Make + * SLUB deactivates the kmem_caches through call_rcu. Make * sure all registered rcu callbacks have been invoked. */ if (IS_ENABLED(CONFIG_SLUB)) - rcu_barrier_sched(); + rcu_barrier();
/* * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
From: Vlastimil Babka vbabka@suse.cz
mainline inclusion from mainline-4.20-rc1 commit cc252eae85e09552f9c1e7ac0c3227f835efdf2d category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces kmalloc-reclaimable caches (more details in the second patch) and uses them for dcache external names. That allows us to repurpose the NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-* caches, eliminating the need for manual accounting. More importantly, it also ensures the reclaimable kmalloc allocations are grouped in pages separate from the regular kmalloc allocations. The need for proper accounting of dcache external names has shown it's easy for misbehaving process to allocate lots of them, causing premature OOMs. Without the added grouping, it's likely that a similar workload can interleave the dcache external names allocations with regular kmalloc allocations (note: I haven't searched myself for an example of such regular kmalloc allocation, but I would be very surprised if there wasn't some). A pathological case would be e.g. one 64byte regular allocations with 63 external dcache names in a page (64x64=4096), which means the page is not freed even after reclaiming after all dcache names, and the process can thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified, they can also benefit from the new functionality simply by adding __GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately) include removed branch for detecting __GFP_DMA kmalloc(), and shortening kmalloc cache names in /proc/slabinfo output. The latter is potentially an ABI break in case there are tools parsing the names and expecting the values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
... kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0 ... kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0 kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0 kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0 kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0 kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0 kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0 ...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
... nr_slab_reclaimable 2817 nr_slab_unreclaimable 1781 ... nr_kernel_misc_reclaimable 0 ...
/proc/meminfo with new KReclaimable counter:
... Shmem: 564 kB KReclaimable: 11260 kB Slab: 18368 kB SReclaimable: 11260 kB SUnreclaim: 7108 kB KernelStack: 1248 kB ...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array kmalloc_dma_caches for __GFP_DMA allocations. There are tests for __GFP_DMA in the allocation hotpaths. We can avoid the branches by combining kmalloc_caches and kmalloc_dma_caches into a single two-dimensional array where the outer dimension is cache "type". This will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka vbabka@suse.cz Acked-by: Mel Gorman mgorman@techsingularity.net Acked-by: Christoph Lameter cl@linux.com Acked-by: Roman Gushchin guro@fb.com Cc: Michal Hocko mhocko@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: David Rientjes rientjes@google.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Matthew Wilcox willy@infradead.org Cc: Laura Abbott labbott@redhat.com Cc: Sumit Semwal sumit.semwal@linaro.org Cc: Vijayanand Jitta vjitta@codeaurora.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit cc252eae85e09552f9c1e7ac0c3227f835efdf2d) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/slab.h | 42 +++++++++++++++++++++++++++++++----------- mm/slab.c | 4 ++-- mm/slab_common.c | 31 ++++++++++++------------------- mm/slub.c | 13 +++++++------ 4 files changed, 52 insertions(+), 38 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h index d6393413ef09..566f9b6c4e34 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -297,12 +297,29 @@ static inline void __check_heap_object(const void *ptr, unsigned long n, #define SLAB_OBJ_MIN_SIZE (KMALLOC_MIN_SIZE < 16 ? \ (KMALLOC_MIN_SIZE) : 16)
+enum kmalloc_cache_type { + KMALLOC_NORMAL = 0, +#ifdef CONFIG_ZONE_DMA + KMALLOC_DMA, +#endif + NR_KMALLOC_TYPES +}; + #ifndef CONFIG_SLOB -extern struct kmem_cache *kmalloc_caches[KMALLOC_SHIFT_HIGH + 1]; +extern struct kmem_cache * +kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1]; + +static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags) +{ + int is_dma = 0; + #ifdef CONFIG_ZONE_DMA -extern struct kmem_cache *kmalloc_dma_caches[KMALLOC_SHIFT_HIGH + 1]; + is_dma = !!(flags & __GFP_DMA); #endif
+ return is_dma; +} + /* * Figure out which kmalloc slab an allocation of a certain size * belongs to. @@ -503,18 +520,20 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags) static __always_inline void *kmalloc(size_t size, gfp_t flags) { if (__builtin_constant_p(size)) { +#ifndef CONFIG_SLOB + unsigned int index; +#endif if (size > KMALLOC_MAX_CACHE_SIZE) return kmalloc_large(size, flags); #ifndef CONFIG_SLOB - if (!(flags & GFP_DMA)) { - unsigned int index = kmalloc_index(size); + index = kmalloc_index(size);
- if (!index) - return ZERO_SIZE_PTR; + if (!index) + return ZERO_SIZE_PTR;
- return kmem_cache_alloc_trace(kmalloc_caches[index], - flags, size); - } + return kmem_cache_alloc_trace( + kmalloc_caches[kmalloc_type(flags)][index], + flags, size); #endif } return __kmalloc(size, flags); @@ -544,13 +563,14 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node) { #ifndef CONFIG_SLOB if (__builtin_constant_p(size) && - size <= KMALLOC_MAX_CACHE_SIZE && !(flags & GFP_DMA)) { + size <= KMALLOC_MAX_CACHE_SIZE) { unsigned int i = kmalloc_index(size);
if (!i) return ZERO_SIZE_PTR;
- return kmem_cache_alloc_node_trace(kmalloc_caches[i], + return kmem_cache_alloc_node_trace( + kmalloc_caches[kmalloc_type(flags)][i], flags, node, size); } #endif diff --git a/mm/slab.c b/mm/slab.c index 9a80511afb47..34f59802551d 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1291,7 +1291,7 @@ void __init kmem_cache_init(void) * Initialize the caches that provide memory for the kmem_cache_node * structures first. Without this, further allocations will bug. */ - kmalloc_caches[INDEX_NODE] = create_kmalloc_cache( + kmalloc_caches[KMALLOC_NORMAL][INDEX_NODE] = create_kmalloc_cache( kmalloc_info[INDEX_NODE].name, kmalloc_size(INDEX_NODE), ARCH_KMALLOC_FLAGS, 0, kmalloc_size(INDEX_NODE)); @@ -1307,7 +1307,7 @@ void __init kmem_cache_init(void) for_each_online_node(nid) { init_list(kmem_cache, &init_kmem_cache_node[CACHE_CACHE + nid], nid);
- init_list(kmalloc_caches[INDEX_NODE], + init_list(kmalloc_caches[KMALLOC_NORMAL][INDEX_NODE], &init_kmem_cache_node[SIZE_NODE + nid], nid); } } diff --git a/mm/slab_common.c b/mm/slab_common.c index 5bb1afcce3d2..2a3ab59c6b35 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1004,14 +1004,10 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name, return s; }
-struct kmem_cache *kmalloc_caches[KMALLOC_SHIFT_HIGH + 1] __ro_after_init; +struct kmem_cache * +kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1] __ro_after_init; EXPORT_SYMBOL(kmalloc_caches);
-#ifdef CONFIG_ZONE_DMA -struct kmem_cache *kmalloc_dma_caches[KMALLOC_SHIFT_HIGH + 1] __ro_after_init; -EXPORT_SYMBOL(kmalloc_dma_caches); -#endif - /* * Conversion table for small slabs sizes / 8 to the index in the * kmalloc array. This is necessary for slabs < 192 since we have non power @@ -1071,12 +1067,7 @@ struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags) index = fls(size - 1); }
-#ifdef CONFIG_ZONE_DMA - if (unlikely((flags & GFP_DMA))) - return kmalloc_dma_caches[index]; - -#endif - return kmalloc_caches[index]; + return kmalloc_caches[kmalloc_type(flags)][index]; }
/* @@ -1150,7 +1141,8 @@ void __init setup_kmalloc_cache_index_table(void)
static void __init new_kmalloc_cache(int idx, slab_flags_t flags) { - kmalloc_caches[idx] = create_kmalloc_cache(kmalloc_info[idx].name, + kmalloc_caches[KMALLOC_NORMAL][idx] = create_kmalloc_cache( + kmalloc_info[idx].name, kmalloc_info[idx].size, flags, 0, kmalloc_info[idx].size); } @@ -1163,9 +1155,10 @@ static void __init new_kmalloc_cache(int idx, slab_flags_t flags) void __init create_kmalloc_caches(slab_flags_t flags) { int i; + int type = KMALLOC_NORMAL;
for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) { - if (!kmalloc_caches[i]) + if (!kmalloc_caches[type][i]) new_kmalloc_cache(i, flags);
/* @@ -1173,9 +1166,9 @@ void __init create_kmalloc_caches(slab_flags_t flags) * These have to be created immediately after the * earlier power of two caches */ - if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[1] && i == 6) + if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[type][1] && i == 6) new_kmalloc_cache(1, flags); - if (KMALLOC_MIN_SIZE <= 64 && !kmalloc_caches[2] && i == 7) + if (KMALLOC_MIN_SIZE <= 64 && !kmalloc_caches[type][2] && i == 7) new_kmalloc_cache(2, flags); }
@@ -1184,7 +1177,7 @@ void __init create_kmalloc_caches(slab_flags_t flags)
#ifdef CONFIG_ZONE_DMA for (i = 0; i <= KMALLOC_SHIFT_HIGH; i++) { - struct kmem_cache *s = kmalloc_caches[i]; + struct kmem_cache *s = kmalloc_caches[KMALLOC_NORMAL][i];
if (s) { unsigned int size = kmalloc_size(i); @@ -1192,8 +1185,8 @@ void __init create_kmalloc_caches(slab_flags_t flags) "dma-kmalloc-%u", size);
BUG_ON(!n); - kmalloc_dma_caches[i] = create_kmalloc_cache(n, - size, SLAB_CACHE_DMA | flags, 0, 0); + kmalloc_caches[KMALLOC_DMA][i] = create_kmalloc_cache( + n, size, SLAB_CACHE_DMA | flags, 0, 0); } } #endif diff --git a/mm/slub.c b/mm/slub.c index 887e3d67a4d7..9fedcaf69ebf 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4701,6 +4701,7 @@ static int list_locations(struct kmem_cache *s, char *buf, static void __init resiliency_test(void) { u8 *p; + int type = KMALLOC_NORMAL;
BUILD_BUG_ON(KMALLOC_MIN_SIZE > 16 || KMALLOC_SHIFT_HIGH < 10);
@@ -4713,7 +4714,7 @@ static void __init resiliency_test(void) pr_err("\n1. kmalloc-16: Clobber Redzone/next pointer 0x12->0x%p\n\n", p + 16);
- validate_slab_cache(kmalloc_caches[4]); + validate_slab_cache(kmalloc_caches[type][4]);
/* Hmmm... The next two are dangerous */ p = kzalloc(32, GFP_KERNEL); @@ -4722,33 +4723,33 @@ static void __init resiliency_test(void) p); pr_err("If allocated object is overwritten then not detectable\n\n");
- validate_slab_cache(kmalloc_caches[5]); + validate_slab_cache(kmalloc_caches[type][5]); p = kzalloc(64, GFP_KERNEL); p += 64 + (get_cycles() & 0xff) * sizeof(void *); *p = 0x56; pr_err("\n3. kmalloc-64: corrupting random byte 0x56->0x%p\n", p); pr_err("If allocated object is overwritten then not detectable\n\n"); - validate_slab_cache(kmalloc_caches[6]); + validate_slab_cache(kmalloc_caches[type][6]);
pr_err("\nB. Corruption after free\n"); p = kzalloc(128, GFP_KERNEL); kfree(p); *p = 0x78; pr_err("1. kmalloc-128: Clobber first word 0x78->0x%p\n\n", p); - validate_slab_cache(kmalloc_caches[7]); + validate_slab_cache(kmalloc_caches[type][7]);
p = kzalloc(256, GFP_KERNEL); kfree(p); p[50] = 0x9a; pr_err("\n2. kmalloc-256: Clobber 50th byte 0x9a->0x%p\n\n", p); - validate_slab_cache(kmalloc_caches[8]); + validate_slab_cache(kmalloc_caches[type][8]);
p = kzalloc(512, GFP_KERNEL); kfree(p); p[512] = 0xab; pr_err("\n3. kmalloc-512: Clobber redzone 0xab->0x%p\n\n", p); - validate_slab_cache(kmalloc_caches[9]); + validate_slab_cache(kmalloc_caches[type][9]); } #else #ifdef CONFIG_SYSFS
From: Vlastimil Babka vbabka@suse.cz
mainline inclusion from mainline-4.20-rc1 commit 1291523f2c1d631fea34102fd241fb54a4e8f7a0 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which indicates they contain objects which can be reclaimed under memory pressure (typically through a shrinker). This makes the slab pages accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the MemAvailable meminfo counter and in overcommit decisions. The slab pages are also allocated with __GFP_RECLAIMABLE, which is good for anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes are used also for objects that can be reclaimed, which due to varying size cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A prominent example are dcache external names, which prompted the creation of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES in commit f1782c9bc547 ("dcache: account external names as indirectly reclaimable memory").
To better handle this and any other similar cases, this patch introduces SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X. They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among gfp flags. They are added to the kmalloc_caches array as a new type. Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since SLOB's target are tiny system and this patch does add some overhead of kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka vbabka@suse.cz Acked-by: Mel Gorman mgorman@techsingularity.net Acked-by: Christoph Lameter cl@linux.com Acked-by: Roman Gushchin guro@fb.com Cc: David Rientjes rientjes@google.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Laura Abbott labbott@redhat.com Cc: Matthew Wilcox willy@infradead.org Cc: Michal Hocko mhocko@kernel.org Cc: Sumit Semwal sumit.semwal@linaro.org Cc: Vijayanand Jitta vjitta@codeaurora.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 1291523f2c1d631fea34102fd241fb54a4e8f7a0) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/slab.h | 16 ++++++++++++++- mm/slab_common.c | 48 ++++++++++++++++++++++++++++---------------- 2 files changed, 46 insertions(+), 18 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h index 566f9b6c4e34..aeddbe22fb51 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -297,8 +297,13 @@ static inline void __check_heap_object(const void *ptr, unsigned long n, #define SLAB_OBJ_MIN_SIZE (KMALLOC_MIN_SIZE < 16 ? \ (KMALLOC_MIN_SIZE) : 16)
+/* + * Whenever changing this, take care of that kmalloc_type() and + * create_kmalloc_caches() still work as intended. + */ enum kmalloc_cache_type { KMALLOC_NORMAL = 0, + KMALLOC_RECLAIM, #ifdef CONFIG_ZONE_DMA KMALLOC_DMA, #endif @@ -312,12 +317,21 @@ kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1]; static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags) { int is_dma = 0; + int type_dma = 0; + int is_reclaimable;
#ifdef CONFIG_ZONE_DMA is_dma = !!(flags & __GFP_DMA); + type_dma = is_dma * KMALLOC_DMA; #endif
- return is_dma; + is_reclaimable = !!(flags & __GFP_RECLAIMABLE); + + /* + * If an allocation is both __GFP_DMA and __GFP_RECLAIMABLE, return + * KMALLOC_DMA and effectively ignore __GFP_RECLAIMABLE + */ + return type_dma + (is_reclaimable & !is_dma) * KMALLOC_RECLAIM; }
/* diff --git a/mm/slab_common.c b/mm/slab_common.c index 2a3ab59c6b35..e102f27ea011 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1139,10 +1139,21 @@ void __init setup_kmalloc_cache_index_table(void) } }
-static void __init new_kmalloc_cache(int idx, slab_flags_t flags) +static void __init +new_kmalloc_cache(int idx, int type, slab_flags_t flags) { - kmalloc_caches[KMALLOC_NORMAL][idx] = create_kmalloc_cache( - kmalloc_info[idx].name, + const char *name; + + if (type == KMALLOC_RECLAIM) { + flags |= SLAB_RECLAIM_ACCOUNT; + name = kasprintf(GFP_NOWAIT, "kmalloc-rcl-%u", + kmalloc_info[idx].size); + BUG_ON(!name); + } else { + name = kmalloc_info[idx].name; + } + + kmalloc_caches[type][idx] = create_kmalloc_cache(name, kmalloc_info[idx].size, flags, 0, kmalloc_info[idx].size); } @@ -1154,22 +1165,25 @@ static void __init new_kmalloc_cache(int idx, slab_flags_t flags) */ void __init create_kmalloc_caches(slab_flags_t flags) { - int i; - int type = KMALLOC_NORMAL; + int i, type;
- for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) { - if (!kmalloc_caches[type][i]) - new_kmalloc_cache(i, flags); + for (type = KMALLOC_NORMAL; type <= KMALLOC_RECLAIM; type++) { + for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) { + if (!kmalloc_caches[type][i]) + new_kmalloc_cache(i, type, flags);
- /* - * Caches that are not of the two-to-the-power-of size. - * These have to be created immediately after the - * earlier power of two caches - */ - if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[type][1] && i == 6) - new_kmalloc_cache(1, flags); - if (KMALLOC_MIN_SIZE <= 64 && !kmalloc_caches[type][2] && i == 7) - new_kmalloc_cache(2, flags); + /* + * Caches that are not of the two-to-the-power-of size. + * These have to be created immediately after the + * earlier power of two caches + */ + if (KMALLOC_MIN_SIZE <= 32 && i == 6 && + !kmalloc_caches[type][1]) + new_kmalloc_cache(1, type, flags); + if (KMALLOC_MIN_SIZE <= 64 && i == 7 && + !kmalloc_caches[type][2]) + new_kmalloc_cache(2, type, flags); + } }
/* Kmalloc array is now usable */
From: Vlastimil Babka vbabka@suse.cz
mainline inclusion from mainline-4.20-rc1 commit 2e03b4bc4ae84fcc0eee00e5ba5d228901d38809 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- We can use the newly introduced kmalloc-reclaimable-X caches, to allocate external names in dcache, which will take care of the proper accounting automatically, and also improve anti-fragmentation page grouping.
This effectively reverts commit f1782c9bc547 ("dcache: account external names as indirectly reclaimable memory") and instead passes __GFP_RECLAIMABLE to kmalloc(). The accounting thus moves from NR_INDIRECTLY_RECLAIMABLE_BYTES to NR_SLAB_RECLAIMABLE, which is also considered in MemAvailable calculation and overcommit decisions.
Link: http://lkml.kernel.org/r/20180731090649.16028-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka vbabka@suse.cz Acked-by: Mel Gorman mgorman@techsingularity.net Acked-by: Roman Gushchin guro@fb.com Cc: Christoph Lameter cl@linux.com Cc: David Rientjes rientjes@google.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Laura Abbott labbott@redhat.com Cc: Matthew Wilcox willy@infradead.org Cc: Michal Hocko mhocko@kernel.org Cc: Sumit Semwal sumit.semwal@linaro.org Cc: Vijayanand Jitta vjitta@codeaurora.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 2e03b4bc4ae84fcc0eee00e5ba5d228901d38809) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/dcache.c | 38 +++++++++----------------------------- 1 file changed, 9 insertions(+), 29 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c index 997c1efd5906..8b9841837274 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -257,24 +257,10 @@ static void __d_free(struct rcu_head *head) kmem_cache_free(dentry_cache, dentry); }
-static void __d_free_external_name(struct rcu_head *head) -{ - struct external_name *name = container_of(head, struct external_name, - u.head); - - mod_node_page_state(page_pgdat(virt_to_page(name)), - NR_INDIRECTLY_RECLAIMABLE_BYTES, - -ksize(name)); - - kfree(name); -} - static void __d_free_external(struct rcu_head *head) { struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu); - - __d_free_external_name(&external_name(dentry)->u.head); - + kfree(external_name(dentry)); kmem_cache_free(dentry_cache, dentry); }
@@ -306,7 +292,7 @@ void release_dentry_name_snapshot(struct name_snapshot *name) struct external_name *p; p = container_of(name->name, struct external_name, name[0]); if (unlikely(atomic_dec_and_test(&p->u.count))) - call_rcu(&p->u.head, __d_free_external_name); + kfree_rcu(p, u.head); } } EXPORT_SYMBOL(release_dentry_name_snapshot); @@ -1643,7 +1629,6 @@ EXPORT_SYMBOL(d_invalidate);
struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) { - struct external_name *ext = NULL; struct dentry *dentry; char *dname; int err; @@ -1664,14 +1649,15 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) dname = dentry->d_iname; } else if (name->len > DNAME_INLINE_LEN-1) { size_t size = offsetof(struct external_name, name[1]); - - ext = kmalloc(size + name->len, GFP_KERNEL_ACCOUNT); - if (!ext) { + struct external_name *p = kmalloc(size + name->len, + GFP_KERNEL_ACCOUNT | + __GFP_RECLAIMABLE); + if (!p) { kmem_cache_free(dentry_cache, dentry); return NULL; } - atomic_set(&ext->u.count, 1); - dname = ext->name; + atomic_set(&p->u.count, 1); + dname = p->name; } else { dname = dentry->d_iname; } @@ -1711,12 +1697,6 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) } }
- if (unlikely(ext)) { - pg_data_t *pgdat = page_pgdat(virt_to_page(ext)); - mod_node_page_state(pgdat, NR_INDIRECTLY_RECLAIMABLE_BYTES, - ksize(ext)); - } - this_cpu_inc(nr_dentry);
return dentry; @@ -2773,7 +2753,7 @@ static void copy_name(struct dentry *dentry, struct dentry *target) dentry->d_name.hash_len = target->d_name.hash_len; } if (old_name && likely(atomic_dec_and_test(&old_name->u.count))) - call_rcu(&old_name->u.head, __d_free_external_name); + kfree_rcu(old_name, u.head); }
/*
From: Vlastimil Babka vbabka@suse.cz
mainline inclusion from mainline-4.20-rc1 commit b29940c1abd7a4c3abeb926df0a5ec84d6902d47 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by commit eb59254608bc ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with the goal of accounting objects that can be reclaimed, but cannot be allocated via a SLAB_RECLAIM_ACCOUNT cache. This is now possible via kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user is converted.
The counter is however still useful for accounting direct page allocations (i.e. not slab) with a shrinker, such as the ION page pool. So keep it, and:
- change granularity to pages to be more like other counters; sub-page allocations should be able to use kmalloc - rename the counter to NR_KERNEL_MISC_RECLAIMABLE - expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can again remove the check for not printing "hidden" counters
Link: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.cz Signed-off-by: Vlastimil Babka vbabka@suse.cz Acked-by: Christoph Lameter cl@linux.com Acked-by: Roman Gushchin guro@fb.com Cc: Vijayanand Jitta vjitta@codeaurora.org Cc: Laura Abbott labbott@redhat.com Cc: Sumit Semwal sumit.semwal@linaro.org Cc: David Rientjes rientjes@google.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Matthew Wilcox willy@infradead.org Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit b29940c1abd7a4c3abeb926df0a5ec84d6902d47) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/staging/android/ion/ion_page_pool.c | 8 ++++---- include/linux/mmzone.h | 2 +- mm/page_alloc.c | 19 +++++++------------ mm/util.c | 3 +-- mm/vmstat.c | 6 +----- 5 files changed, 14 insertions(+), 24 deletions(-)
diff --git a/drivers/staging/android/ion/ion_page_pool.c b/drivers/staging/android/ion/ion_page_pool.c index 890d264ac687..fd1f501da669 100644 --- a/drivers/staging/android/ion/ion_page_pool.c +++ b/drivers/staging/android/ion/ion_page_pool.c @@ -36,8 +36,8 @@ static void ion_page_pool_add(struct ion_page_pool *pool, struct page *page) pool->low_count++; }
- mod_node_page_state(page_pgdat(page), NR_INDIRECTLY_RECLAIMABLE_BYTES, - (1 << (PAGE_SHIFT + pool->order))); + mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE, + 1 << pool->order); mutex_unlock(&pool->mutex); }
@@ -56,8 +56,8 @@ static struct page *ion_page_pool_remove(struct ion_page_pool *pool, bool high) }
list_del(&page->lru); - mod_node_page_state(page_pgdat(page), NR_INDIRECTLY_RECLAIMABLE_BYTES, - -(1 << (PAGE_SHIFT + pool->order))); + mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE, + -(1 << pool->order)); return page; }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 51f74519b0e8..4133cd6d6a34 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -181,7 +181,7 @@ enum node_stat_item { NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ NR_DIRTIED, /* page dirtyings since bootup */ NR_WRITTEN, /* page writings since bootup */ - NR_INDIRECTLY_RECLAIMABLE_BYTES, /* measured in bytes */ + NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */ NR_VM_NODE_STAT_ITEMS };
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e394ae5ba2a7..1e0aeebfb293 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4830,6 +4830,7 @@ long si_mem_available(void) unsigned long pagecache; unsigned long wmark_low = 0; unsigned long pages[NR_LRU_LISTS]; + unsigned long reclaimable; struct zone *zone; int lru;
@@ -4855,19 +4856,13 @@ long si_mem_available(void) available += pagecache;
/* - * Part of the reclaimable slab consists of items that are in use, - * and cannot be freed. Cap this estimate at the low watermark. + * Part of the reclaimable slab and other kernel memory consists of + * items that are in use, and cannot be freed. Cap this estimate at the + * low watermark. */ - available += global_node_page_state(NR_SLAB_RECLAIMABLE) - - min(global_node_page_state(NR_SLAB_RECLAIMABLE) / 2, - wmark_low); - - /* - * Part of the kernel memory, which can be released under memory - * pressure. - */ - available += global_node_page_state(NR_INDIRECTLY_RECLAIMABLE_BYTES) >> - PAGE_SHIFT; + reclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE) + + global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE); + available += reclaimable - min(reclaimable / 2, wmark_low);
if (available < 0) available = 0; diff --git a/mm/util.c b/mm/util.c index ed64ef1f8387..5dc02e916a19 100644 --- a/mm/util.c +++ b/mm/util.c @@ -706,8 +706,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin) * Part of the kernel memory, which can be released * under memory pressure. */ - free += global_node_page_state( - NR_INDIRECTLY_RECLAIMABLE_BYTES) >> PAGE_SHIFT; + free += global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE);
/* * Leave reserved pages. The pages are not for anonymous pages. diff --git a/mm/vmstat.c b/mm/vmstat.c index 96028cc96f2f..acc9f1c6b43a 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1161,7 +1161,7 @@ const char * const vmstat_text[] = { "nr_vmscan_immediate_reclaim", "nr_dirtied", "nr_written", - "", /* nr_indirectly_reclaimable */ + "nr_kernel_misc_reclaimable",
/* enum writeback_stat_item counters */ "nr_dirty_threshold", @@ -1722,10 +1722,6 @@ static int vmstat_show(struct seq_file *m, void *arg) unsigned long *l = arg; unsigned long off = l - (unsigned long *)m->private;
- /* Skip hidden vmstat items. */ - if (*vmstat_text[off] == '\0') - return 0; - seq_puts(m, vmstat_text[off]); seq_put_decimal_ull(m, " ", *l); seq_putc(m, '\n');
From: Vlastimil Babka vbabka@suse.cz
mainline inclusion from mainline-4.20-rc1 commit 61f94e18de94f79abaad3bb83549ff78923ac785 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- The vmstat NR_KERNEL_MISC_RECLAIMABLE counter is for kernel non-slab allocations that can be reclaimed via shrinker. In /proc/meminfo, we can show the sum of all reclaimable kernel allocations (including slab) as "KReclaimable". Add the same counter also to per-node meminfo under /sys
With this counter, users will have more complete information about kernel memory usage. Non-slab reclaimable pages (currently just the ION allocator) will not be missing from /proc/meminfo, making users wonder where part of their memory went. More precisely, they already appear in MemAvailable, but without the new counter, it's not obvious why the value in MemAvailable doesn't fully correspond with the sum of other counters participating in it.
Link: http://lkml.kernel.org/r/20180731090649.16028-6-vbabka@suse.cz Signed-off-by: Vlastimil Babka vbabka@suse.cz Acked-by: Roman Gushchin guro@fb.com Cc: Christoph Lameter cl@linux.com Cc: David Rientjes rientjes@google.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Laura Abbott labbott@redhat.com Cc: Matthew Wilcox willy@infradead.org Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@kernel.org Cc: Sumit Semwal sumit.semwal@linaro.org Cc: Vijayanand Jitta vjitta@codeaurora.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 61f94e18de94f79abaad3bb83549ff78923ac785) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/filesystems/proc.txt | 4 ++++ drivers/base/node.c | 19 ++++++++++++------- fs/proc/meminfo.c | 16 ++++++++-------- 3 files changed, 24 insertions(+), 15 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 0d0ecc7df260..cd465304bec4 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -863,6 +863,7 @@ Writeback: 0 kB AnonPages: 861800 kB Mapped: 280372 kB Shmem: 644 kB +KReclaimable: 168048 kB Slab: 284364 kB SReclaimable: 159856 kB SUnreclaim: 124508 kB @@ -930,6 +931,9 @@ AnonHugePages: Non-file backed huge pages mapped into userspace page tables ShmemHugePages: Memory used by shared memory (shmem) and tmpfs allocated with huge pages ShmemPmdMapped: Shared memory mapped into userspace with huge pages +KReclaimable: Kernel allocations that the kernel will attempt to reclaim + under memory pressure. Includes SReclaimable (below), and other + direct allocations with a shrinker. Slab: in-kernel data structures cache SReclaimable: Part of Slab, that might be reclaimed, such as caches SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure diff --git a/drivers/base/node.c b/drivers/base/node.c index 6fb55a58a1bb..44710399f21f 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -67,8 +67,11 @@ static ssize_t node_read_meminfo(struct device *dev, int nid = dev->id; struct pglist_data *pgdat = NODE_DATA(nid); struct sysinfo i; + unsigned long sreclaimable, sunreclaimable;
si_meminfo_node(&i, nid); + sreclaimable = node_page_state(pgdat, NR_SLAB_RECLAIMABLE); + sunreclaimable = node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE); n = sprintf(buf, "Node %d MemTotal: %8lu kB\n" "Node %d MemFree: %8lu kB\n" @@ -118,6 +121,7 @@ static ssize_t node_read_meminfo(struct device *dev, "Node %d NFS_Unstable: %8lu kB\n" "Node %d Bounce: %8lu kB\n" "Node %d WritebackTmp: %8lu kB\n" + "Node %d KReclaimable: %8lu kB\n" "Node %d Slab: %8lu kB\n" "Node %d SReclaimable: %8lu kB\n" "Node %d SUnreclaim: %8lu kB\n" @@ -138,20 +142,21 @@ static ssize_t node_read_meminfo(struct device *dev, nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)), nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)), nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), - nid, K(node_page_state(pgdat, NR_SLAB_RECLAIMABLE) + - node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE)), - nid, K(node_page_state(pgdat, NR_SLAB_RECLAIMABLE)), + nid, K(sreclaimable + + node_page_state(pgdat, NR_KERNEL_MISC_RECLAIMABLE)), + nid, K(sreclaimable + sunreclaimable), + nid, K(sreclaimable), + nid, K(sunreclaimable) #ifdef CONFIG_TRANSPARENT_HUGEPAGE - nid, K(node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE)), + , nid, K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR), nid, K(node_page_state(pgdat, NR_SHMEM_THPS) * HPAGE_PMD_NR), nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) * - HPAGE_PMD_NR)); -#else - nid, K(node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE))); + HPAGE_PMD_NR) #endif + ); n += hugetlb_report_node_meminfo(nid, buf + n); return n; } diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index edda898714eb..568d90e17c17 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -38,6 +38,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) long cached; long available; unsigned long pages[NR_LRU_LISTS]; + unsigned long sreclaimable, sunreclaim; int lru;
si_meminfo(&i); @@ -53,6 +54,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v) pages[lru] = global_node_page_state(NR_LRU_BASE + lru);
available = si_mem_available(); + sreclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE); + sunreclaim = global_node_page_state(NR_SLAB_UNRECLAIMABLE);
show_val_kb(m, "MemTotal: ", i.totalram); show_val_kb(m, "MemFree: ", i.freeram); @@ -94,14 +97,11 @@ static int meminfo_proc_show(struct seq_file *m, void *v) show_val_kb(m, "Mapped: ", global_node_page_state(NR_FILE_MAPPED)); show_val_kb(m, "Shmem: ", i.sharedram); - show_val_kb(m, "Slab: ", - global_node_page_state(NR_SLAB_RECLAIMABLE) + - global_node_page_state(NR_SLAB_UNRECLAIMABLE)); - - show_val_kb(m, "SReclaimable: ", - global_node_page_state(NR_SLAB_RECLAIMABLE)); - show_val_kb(m, "SUnreclaim: ", - global_node_page_state(NR_SLAB_UNRECLAIMABLE)); + show_val_kb(m, "KReclaimable: ", sreclaimable + + global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE)); + show_val_kb(m, "Slab: ", sreclaimable + sunreclaim); + show_val_kb(m, "SReclaimable: ", sreclaimable); + show_val_kb(m, "SUnreclaim: ", sunreclaim); seq_printf(m, "KernelStack: %8lu kB\n", global_zone_page_state(NR_KERNEL_STACK_KB)); show_val_kb(m, "PageTables: ",
From: "Tobin C. Harding" tobin@kernel.org
mainline inclusion from mainline-5.2-rc1 commit 6dfd1b653c49df2dad1dcfe063a196e940e02dbd category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- SLUB allocator makes heavy use of ifdef/endif pre-processor macros. The pairing of these statements is at times hard to follow e.g. if the pair are further than a screen apart or if there are nested pairs. We can reduce cognitive load by adding a comment to the endif statement of form
#ifdef CONFIG_FOO ... #endif /* CONFIG_FOO */
Add comments to endif pre-processor macros if ifdef/endif pair is not immediately apparent.
Link: http://lkml.kernel.org/r/20190402230545.2929-5-tobin@kernel.org Signed-off-by: Tobin C. Harding tobin@kernel.org Acked-by: Christoph Lameter cl@linux.com Reviewed-by: Roman Gushchin guro@fb.com Acked-by: Vlastimil Babka vbabka@suse.cz Cc: David Rientjes rientjes@google.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Matthew Wilcox willy@infradead.org Cc: Pekka Enberg penberg@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 6dfd1b653c49df2dad1dcfe063a196e940e02dbd) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/slub.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c index 9fedcaf69ebf..32bec9c3485e 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1909,7 +1909,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags, } } } while (read_mems_allowed_retry(cpuset_mems_cookie)); -#endif +#endif /* CONFIG_NUMA */ return NULL; }
@@ -2220,7 +2220,7 @@ static void unfreeze_partials(struct kmem_cache *s, discard_slab(s, page); stat(s, FREE_SLAB); } -#endif +#endif /* CONFIG_SLUB_CPU_PARTIAL */ }
/* @@ -2279,7 +2279,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) local_irq_restore(flags); } preempt_enable(); -#endif +#endif /* CONFIG_SLUB_CPU_PARTIAL */ }
static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) @@ -2797,7 +2797,7 @@ void *kmem_cache_alloc_node_trace(struct kmem_cache *s, } EXPORT_SYMBOL(kmem_cache_alloc_node_trace); #endif -#endif +#endif /* CONFIG_NUMA */
/* * Slow path handling. This may still be called frequently since objects @@ -3842,7 +3842,7 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node) return ret; } EXPORT_SYMBOL(__kmalloc_node); -#endif +#endif /* CONFIG_NUMA */
#ifdef CONFIG_HARDENED_USERCOPY /* @@ -4058,7 +4058,7 @@ void __kmemcg_cache_deactivate(struct kmem_cache *s) */ slab_deactivate_memcg_cache_rcu_sched(s, kmemcg_cache_deact_after_rcu); } -#endif +#endif /* CONFIG_MEMCG */
static int slab_mem_going_offline_callback(void *arg) { @@ -4695,7 +4695,7 @@ static int list_locations(struct kmem_cache *s, char *buf, len += sprintf(buf, "No data\n"); return len; } -#endif +#endif /* CONFIG_SLUB_DEBUG */
#ifdef SLUB_RESILIENCY_TEST static void __init resiliency_test(void) @@ -4755,7 +4755,7 @@ static void __init resiliency_test(void) #ifdef CONFIG_SYSFS static void resiliency_test(void) {}; #endif -#endif +#endif /* SLUB_RESILIENCY_TEST */
#ifdef CONFIG_SYSFS enum slab_stat_type { @@ -5421,7 +5421,7 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc); STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free); STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node); STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain); -#endif +#endif /* CONFIG_SLUB_STATS */
static struct attribute *slab_attrs[] = { &slab_size_attr.attr, @@ -5623,7 +5623,7 @@ static void memcg_propagate_slab_attrs(struct kmem_cache *s)
if (buffer) free_page((unsigned long)buffer); -#endif +#endif /* CONFIG_MEMCG */ }
static void kmem_cache_release(struct kobject *k)
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-5.3-rc1 commit c03914b7aa319fb2b6701a6427c13752c7418b9b category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Patch series "mm: reparent slab memory on cgroup removal", v7.
We've noticed that the number of dying cgroups is steadily growing on most of our hosts in production. The following investigation revealed an issue in the userspace memory reclaim code [1], accounting of kernel stacks [2], and also the main reason: slab objects.
The underlying problem is quite simple: any page charged to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless all charged pages are gone. If a slab object is actively used by other cgroups, it won't be reclaimed, and will prevent the origin cgroup from being reclaimed.
Slab objects, and first of all vfs cache, is shared between cgroups, which are using the same underlying fs, and what's even more important, it's shared between multiple generations of the same workload. So if something is running periodically every time in a new cgroup (like how systemd works), we do accumulate multiple dying cgroups.
Strictly speaking pagecache isn't different here, but there is a key difference: we disable protection and apply some extra pressure on LRUs of dying cgroups, and these LRUs contain all charged pages. My experiments show that with the disabled kernel memory accounting the number of dying cgroups stabilizes at a relatively small number (~100, depends on memory pressure and cgroup creation rate), and with kernel memory accounting it grows pretty steadily up to several thousands.
Memory cgroups are quite complex and big objects (mostly due to percpu stats), so it leads to noticeable memory losses. Memory occupied by dying cgroups is measured in hundreds of megabytes. I've even seen a host with more than 100Gb of memory wasted for dying cgroups. It leads to a degradation of performance with the uptime, and generally limits the usage of cgroups.
My previous attempt [3] to fix the problem by applying extra pressure on slab shrinker lists caused a regressions with xfs and ext4, and has been reverted [4]. The following attempts to find the right balance [5, 6] were not successful.
So instead of trying to find a maybe non-existing balance, let's do reparent accounted slab caches to the parent cgroup on cgroup removal.
There is however a significant problem with reparenting of slab memory: there is no list of charged pages. Some of them are in shrinker lists, but not all. Introducing of a new list is really not an option.
But fortunately there is a way forward: every slab page has a stable pointer to the corresponding kmem_cache. So the idea is to reparent kmem_caches instead of slab pages.
It's actually simpler and cheaper, but requires some underlying changes: 1) Make kmem_caches to hold a single reference to the memory cgroup, instead of a separate reference per every slab page. 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use page->kmem_cache->memcg indirection instead. It's used only on slab page release, so performance overhead shouldn't be a big issue. 3) Introduce a refcounter for non-root slab caches. It's required to be able to destroy kmem_caches when they become empty and release the associated memory cgroup.
There is a bonus: currently we release all memcg kmem_caches all together with the memory cgroup itself. This patchset allows individual kmem_caches to be released as soon as they become inactive and free.
Some additional implementation details are provided in corresponding commit messages.
Below is the average number of dying cgroups on two groups of our production hosts. They do run some sort of web frontend workload, the memory pressure is moderate. As we can see, with the kernel memory reparenting the number stabilizes in 60s range; however with the original version it grows almost linearly and doesn't show any signs of plateauing. The difference in slab and percpu usage between patched and unpatched versions also grows linearly. In 7 days it exceeded 200Mb.
day 0 1 2 3 4 5 6 7 original 56 362 628 752 1070 1250 1490 1560 patched 23 46 51 55 60 57 67 69 mem diff(Mb) 22 74 123 152 164 182 214 241
[1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error") [2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting") [3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects") [4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects") [5]: https://lkml.org/lkml/2019/1/28/1865 [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2
This patch (of 10):
Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache() rather than in init_memcg_params().
Once kmem_cache will hold a reference to the memory cgroup, it will simplify the refcounting.
For non-root kmem_caches memcg_link_cache() is always called before the kmem_cache becomes visible to a user, so it's safe.
Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com Signed-off-by: Roman Gushchin guro@fb.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Vladimir Davydov vdavydov.dev@gmail.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Waiman Long longman@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: Christoph Lameter cl@linux.com Cc: Pekka Enberg penberg@kernel.org Cc: David Rientjes rientjes@google.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Andrei Vagin avagin@gmail.com Cc: Qian Cai cai@lca.pw Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit c03914b7aa319fb2b6701a6427c13752c7418b9b) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/slab.c | 2 +- mm/slab.h | 5 +++-- mm/slab_common.c | 14 +++++++------- mm/slub.c | 2 +- 4 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c index 34f59802551d..d876c379d966 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1284,7 +1284,7 @@ void __init kmem_cache_init(void) nr_node_ids * sizeof(struct kmem_cache_node *), SLAB_HWCACHE_ALIGN, 0, 0); list_add(&kmem_cache->list, &slab_caches); - memcg_link_cache(kmem_cache); + memcg_link_cache(kmem_cache, NULL); slab_state = PARTIAL;
/* diff --git a/mm/slab.h b/mm/slab.h index 68ad9d348771..ab218e0d1f0b 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -289,7 +289,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order, }
extern void slab_init_memcg_params(struct kmem_cache *); -extern void memcg_link_cache(struct kmem_cache *s); +extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg); extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s, void (*deact_fn)(struct kmem_cache *));
@@ -344,7 +344,8 @@ static inline void slab_init_memcg_params(struct kmem_cache *s) { }
-static inline void memcg_link_cache(struct kmem_cache *s) +static inline void memcg_link_cache(struct kmem_cache *s, + struct mem_cgroup *memcg) { }
diff --git a/mm/slab_common.c b/mm/slab_common.c index e102f27ea011..8673c2e4ac54 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -141,13 +141,12 @@ void slab_init_memcg_params(struct kmem_cache *s) }
static int init_memcg_params(struct kmem_cache *s, - struct mem_cgroup *memcg, struct kmem_cache *root_cache) + struct kmem_cache *root_cache) { struct memcg_cache_array *arr;
if (root_cache) { s->memcg_params.root_cache = root_cache; - s->memcg_params.memcg = memcg; INIT_LIST_HEAD(&s->memcg_params.children_node); INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node); return 0; @@ -222,11 +221,12 @@ int memcg_update_all_caches(int num_memcgs) return ret; }
-void memcg_link_cache(struct kmem_cache *s) +void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg) { if (is_root_cache(s)) { list_add(&s->root_caches_node, &slab_root_caches); } else { + s->memcg_params.memcg = memcg; list_add(&s->memcg_params.children_node, &s->memcg_params.root_cache->memcg_params.children); list_add(&s->memcg_params.kmem_caches_node, @@ -245,7 +245,7 @@ static void memcg_unlink_cache(struct kmem_cache *s) } #else static inline int init_memcg_params(struct kmem_cache *s, - struct mem_cgroup *memcg, struct kmem_cache *root_cache) + struct kmem_cache *root_cache) { return 0; } @@ -393,7 +393,7 @@ static struct kmem_cache *create_cache(const char *name, s->useroffset = useroffset; s->usersize = usersize;
- err = init_memcg_params(s, memcg, root_cache); + err = init_memcg_params(s, root_cache); if (err) goto out_free_cache;
@@ -403,7 +403,7 @@ static struct kmem_cache *create_cache(const char *name,
s->refcount = 1; list_add(&s->list, &slab_caches); - memcg_link_cache(s); + memcg_link_cache(s, memcg); out: if (err) return ERR_PTR(err); @@ -999,7 +999,7 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
create_boot_cache(s, name, size, flags, useroffset, usersize); list_add(&s->list, &slab_caches); - memcg_link_cache(s); + memcg_link_cache(s, NULL); s->refcount = 1; return s; } diff --git a/mm/slub.c b/mm/slub.c index 32bec9c3485e..8c3de980ee07 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4216,7 +4216,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache) } slab_init_memcg_params(s); list_add(&s->list, &slab_caches); - memcg_link_cache(s); + memcg_link_cache(s, NULL); return s; }
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-5.3-rc1 commit 0b14e8aa68223c2c124d408aa4b110b364d13c53 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- The delayed work/rcu deactivation infrastructure of non-root kmem_caches can be also used for asynchronous release of these objects. Let's get rid of the word "deactivation" in corresponding names to make the code look better after generalization.
It's easier to make the renaming first, so that the generalized code will look consistent from scratch.
Let's rename struct memcg_cache_params fields: deact_fn -> work_fn deact_rcu_head -> rcu_head deact_work -> work
And RCU/delayed work callbacks in slab common code: kmemcg_deactivate_rcufn -> kmemcg_rcufn kmemcg_deactivate_workfn -> kmemcg_workfn
This patch contains no functional changes, only renamings.
Link: http://lkml.kernel.org/r/20190611231813.3148843-3-guro@fb.com Signed-off-by: Roman Gushchin guro@fb.com Acked-by: Vladimir Davydov vdavydov.dev@gmail.com Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Christoph Lameter cl@linux.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: Waiman Long longman@redhat.com Cc: David Rientjes rientjes@google.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Pekka Enberg penberg@kernel.org Cc: Andrei Vagin avagin@gmail.com Cc: Qian Cai cai@lca.pw Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 0b14e8aa68223c2c124d408aa4b110b364d13c53) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/slab.h | 6 +++--- mm/slab.h | 2 +- mm/slab_common.c | 30 +++++++++++++++--------------- 3 files changed, 19 insertions(+), 19 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h index aeddbe22fb51..1fdce55b77a7 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -643,10 +643,10 @@ struct memcg_cache_params { struct list_head children_node; struct list_head kmem_caches_node;
- void (*deact_fn)(struct kmem_cache *); + void (*work_fn)(struct kmem_cache *); union { - struct rcu_head deact_rcu_head; - struct work_struct deact_work; + struct rcu_head rcu_head; + struct work_struct work; }; }; }; diff --git a/mm/slab.h b/mm/slab.h index ab218e0d1f0b..a1b72f757c6c 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -291,7 +291,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order, extern void slab_init_memcg_params(struct kmem_cache *); extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg); extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s, - void (*deact_fn)(struct kmem_cache *)); + void (*work_fn)(struct kmem_cache *));
#else /* CONFIG_MEMCG_KMEM */
diff --git a/mm/slab_common.c b/mm/slab_common.c index 8673c2e4ac54..97d63c6ab2be 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -673,17 +673,17 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, put_online_cpus(); }
-static void kmemcg_deactivate_workfn(struct work_struct *work) +static void kmemcg_workfn(struct work_struct *work) { struct kmem_cache *s = container_of(work, struct kmem_cache, - memcg_params.deact_work); + memcg_params.work);
get_online_cpus(); get_online_mems();
mutex_lock(&slab_mutex);
- s->memcg_params.deact_fn(s); + s->memcg_params.work_fn(s);
mutex_unlock(&slab_mutex);
@@ -694,36 +694,36 @@ static void kmemcg_deactivate_workfn(struct work_struct *work) css_put(&s->memcg_params.memcg->css); }
-static void kmemcg_deactivate_rcufn(struct rcu_head *head) +static void kmemcg_rcufn(struct rcu_head *head) { struct kmem_cache *s = container_of(head, struct kmem_cache, - memcg_params.deact_rcu_head); + memcg_params.rcu_head);
/* - * We need to grab blocking locks. Bounce to ->deact_work. The + * We need to grab blocking locks. Bounce to ->work. The * work item shares the space with the RCU head and can't be * initialized eariler. */ - INIT_WORK(&s->memcg_params.deact_work, kmemcg_deactivate_workfn); - queue_work(memcg_kmem_cache_wq, &s->memcg_params.deact_work); + INIT_WORK(&s->memcg_params.work, kmemcg_workfn); + queue_work(memcg_kmem_cache_wq, &s->memcg_params.work); }
/** * slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a * sched RCU grace period * @s: target kmem_cache - * @deact_fn: deactivation function to call + * @work_fn: deactivation function to call * - * Schedule @deact_fn to be invoked with online cpus, mems and slab_mutex + * Schedule @work_fn to be invoked with online cpus, mems and slab_mutex * held after a sched RCU grace period. The slab is guaranteed to stay - * alive until @deact_fn is finished. This is to be used from + * alive until @work_fn is finished. This is to be used from * __kmemcg_cache_deactivate(). */ void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s, - void (*deact_fn)(struct kmem_cache *)) + void (*work_fn)(struct kmem_cache *)) { if (WARN_ON_ONCE(is_root_cache(s)) || - WARN_ON_ONCE(s->memcg_params.deact_fn)) + WARN_ON_ONCE(s->memcg_params.work_fn)) return;
/* @@ -738,8 +738,8 @@ void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s, /* pin memcg so that @s doesn't get destroyed in the middle */ css_get(&s->memcg_params.memcg->css);
- s->memcg_params.deact_fn = deact_fn; - call_rcu(&s->memcg_params.deact_rcu_head, kmemcg_deactivate_rcufn); + s->memcg_params.work_fn = work_fn; + call_rcu(&s->memcg_params.rcu_head, kmemcg_rcufn); unlock: spin_unlock_irq(&memcg_kmem_wq_lock); }
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-5.3-rc1 commit 434866947564b954409c2fe561605e22f7b49f64 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Currently SLUB uses a work scheduled after an RCU grace period to deactivate a non-root kmem_cache. This mechanism can be reused for kmem_caches release, but requires generalization for SLAB case.
Introduce kmemcg_cache_deactivate() function, which calls allocator-specific __kmem_cache_deactivate() and schedules execution of __kmem_cache_deactivate_after_rcu() with all necessary locks in a worker context after an rcu grace period.
Here is the new calling scheme: kmemcg_cache_deactivate() __kmemcg_cache_deactivate() SLAB/SLUB-specific kmemcg_rcufn() rcu kmemcg_workfn() work __kmemcg_cache_deactivate_after_rcu() SLAB/SLUB-specific
instead of: __kmemcg_cache_deactivate() SLAB/SLUB-specific slab_deactivate_memcg_cache_rcu_sched() SLUB-only kmemcg_rcufn() rcu kmemcg_workfn() work kmemcg_cache_deact_after_rcu() SLUB-only
For consistency, all allocator-specific functions start with "__".
Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.com Signed-off-by: Roman Gushchin guro@fb.com Acked-by: Vladimir Davydov vdavydov.dev@gmail.com Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Christoph Lameter cl@linux.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: Waiman Long longman@redhat.com Cc: David Rientjes rientjes@google.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Pekka Enberg penberg@kernel.org Cc: Andrei Vagin avagin@gmail.com Cc: Qian Cai cai@lca.pw Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 434866947564b954409c2fe561605e22f7b49f64) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/slab.c | 4 ++++ mm/slab.h | 3 +-- mm/slab_common.c | 27 ++++++++------------------- mm/slub.c | 8 +------- 4 files changed, 14 insertions(+), 28 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c index d876c379d966..a04e81dbbcdb 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2329,6 +2329,10 @@ void __kmemcg_cache_deactivate(struct kmem_cache *cachep) { __kmem_cache_shrink(cachep); } + +void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s) +{ +} #endif
int __kmem_cache_shutdown(struct kmem_cache *cachep) diff --git a/mm/slab.h b/mm/slab.h index a1b72f757c6c..0875314b1210 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -172,6 +172,7 @@ int __kmem_cache_shutdown(struct kmem_cache *); void __kmem_cache_release(struct kmem_cache *); int __kmem_cache_shrink(struct kmem_cache *); void __kmemcg_cache_deactivate(struct kmem_cache *s); +void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s); void slab_kmem_cache_release(struct kmem_cache *);
struct seq_file; @@ -290,8 +291,6 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
extern void slab_init_memcg_params(struct kmem_cache *); extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg); -extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s, - void (*work_fn)(struct kmem_cache *));
#else /* CONFIG_MEMCG_KMEM */
diff --git a/mm/slab_common.c b/mm/slab_common.c index 97d63c6ab2be..318d2527bc0b 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -690,7 +690,7 @@ static void kmemcg_workfn(struct work_struct *work) put_online_mems(); put_online_cpus();
- /* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */ + /* done, put the ref from kmemcg_cache_deactivate() */ css_put(&s->memcg_params.memcg->css); }
@@ -708,24 +708,14 @@ static void kmemcg_rcufn(struct rcu_head *head) queue_work(memcg_kmem_cache_wq, &s->memcg_params.work); }
-/** - * slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a - * sched RCU grace period - * @s: target kmem_cache - * @work_fn: deactivation function to call - * - * Schedule @work_fn to be invoked with online cpus, mems and slab_mutex - * held after a sched RCU grace period. The slab is guaranteed to stay - * alive until @work_fn is finished. This is to be used from - * __kmemcg_cache_deactivate(). - */ -void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s, - void (*work_fn)(struct kmem_cache *)) +static void kmemcg_cache_deactivate(struct kmem_cache *s) { if (WARN_ON_ONCE(is_root_cache(s)) || WARN_ON_ONCE(s->memcg_params.work_fn)) return;
+ __kmemcg_cache_deactivate(s); + /* * memcg_kmem_wq_lock is used to synchronize memcg_params.dying * flag and make sure that no new kmem_cache deactivation tasks @@ -738,7 +728,7 @@ void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s, /* pin memcg so that @s doesn't get destroyed in the middle */ css_get(&s->memcg_params.memcg->css);
- s->memcg_params.work_fn = work_fn; + s->memcg_params.work_fn = __kmemcg_cache_deactivate_after_rcu; call_rcu(&s->memcg_params.rcu_head, kmemcg_rcufn); unlock: spin_unlock_irq(&memcg_kmem_wq_lock); @@ -763,7 +753,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg) if (!c) continue;
- __kmemcg_cache_deactivate(c); + kmemcg_cache_deactivate(c); arr->entries[idx] = NULL; } mutex_unlock(&slab_mutex); @@ -859,11 +849,10 @@ static void memcg_set_kmem_cache_dying(struct kmem_cache *s) static void flush_memcg_workqueue(struct kmem_cache *s) { /* - * SLUB deactivates the kmem_caches through call_rcu. Make + * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make * sure all registered rcu callbacks have been invoked. */ - if (IS_ENABLED(CONFIG_SLUB)) - rcu_barrier(); + rcu_barrier();
/* * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB diff --git a/mm/slub.c b/mm/slub.c index 8c3de980ee07..4104c266580a 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4025,7 +4025,7 @@ int __kmem_cache_shrink(struct kmem_cache *s) }
#ifdef CONFIG_MEMCG -static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s) +void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s) { /* * Called with all the locks held after a sched RCU grace period. @@ -4051,12 +4051,6 @@ void __kmemcg_cache_deactivate(struct kmem_cache *s) */ slub_set_cpu_partial(s, 0); s->min_partial = 0; - - /* - * s->cpu_partial is checked locklessly (see put_cpu_partial), so - * we have to make sure the change is visible before shrinking. - */ - slab_deactivate_memcg_cache_rcu_sched(s, kmemcg_cache_deact_after_rcu); } #endif /* CONFIG_MEMCG */
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-5.3-rc1 commit 49a18eae2e98a794477b5af5d85938e430c0be72 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Let's separate the page counter modification code out of __memcg_kmem_uncharge() in a way similar to what __memcg_kmem_charge() and __memcg_kmem_charge_memcg() work.
This will allow to reuse this code later using a new memcg_kmem_uncharge_memcg() wrapper, which calls __memcg_kmem_uncharge_memcg() if memcg_kmem_enabled() check is passed.
Link: http://lkml.kernel.org/r/20190611231813.3148843-5-guro@fb.com Signed-off-by: Roman Gushchin guro@fb.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Vladimir Davydov vdavydov.dev@gmail.com Cc: Christoph Lameter cl@linux.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: Waiman Long longman@redhat.com Cc: David Rientjes rientjes@google.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Pekka Enberg penberg@kernel.org Cc: Andrei Vagin avagin@gmail.com Cc: Qian Cai cai@lca.pw Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 49a18eae2e98a794477b5af5d85938e430c0be72) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/memcontrol.h | 10 ++++++++++ mm/memcontrol.c | 25 +++++++++++++++++-------- 2 files changed, 27 insertions(+), 8 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 24068566ada2..eafc13eb9441 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1275,6 +1275,8 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order); void __memcg_kmem_uncharge(struct page *page, int order); int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, struct mem_cgroup *memcg); +void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg, + unsigned int nr_pages);
extern struct static_key_false memcg_kmem_enabled_key; extern struct workqueue_struct *memcg_kmem_cache_wq; @@ -1316,6 +1318,14 @@ static inline int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, return __memcg_kmem_charge_memcg(page, gfp, order, memcg); return 0; } + +static inline void memcg_kmem_uncharge_memcg(struct page *page, int order, + struct mem_cgroup *memcg) +{ + if (memcg_kmem_enabled()) + __memcg_kmem_uncharge_memcg(memcg, 1 << order); +} + /* * helper for accessing a memcg's index. It will be used as an index in the * child cache array in kmem_cache, and also to derive its name. This function diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 66f87c8cfd7f..4f07c6db5c68 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2716,6 +2716,22 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order) css_put(&memcg->css); return ret; } + +/** + * __memcg_kmem_uncharge_memcg: uncharge a kmem page + * @memcg: memcg to uncharge + * @nr_pages: number of pages to uncharge + */ +void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg, + unsigned int nr_pages) +{ + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + page_counter_uncharge(&memcg->kmem, nr_pages); + + page_counter_uncharge(&memcg->memory, nr_pages); + if (do_memsw_account()) + page_counter_uncharge(&memcg->memsw, nr_pages); +} /** * __memcg_kmem_uncharge: uncharge a kmem page * @page: page to uncharge @@ -2730,14 +2746,7 @@ void __memcg_kmem_uncharge(struct page *page, int order) return;
VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page); - - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) - page_counter_uncharge(&memcg->kmem, nr_pages); - - page_counter_uncharge(&memcg->memory, nr_pages); - if (do_memsw_account()) - page_counter_uncharge(&memcg->memsw, nr_pages); - + __memcg_kmem_uncharge_memcg(memcg, nr_pages); page->mem_cgroup = NULL;
/* slab pages do not have PageKmemcg flag set */
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-5.3-rc1 commit 6cea1d569d24af6f9e95f70cb301807440ae2981 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Currently the page accounting code is duplicated in SLAB and SLUB internals. Let's move it into new (un)charge_slab_page helpers in the slab_common.c file. These helpers will be responsible for statistics (global and memcg-aware) and memcg charging. So they are replacing direct memcg_(un)charge_slab() calls.
Link: http://lkml.kernel.org/r/20190611231813.3148843-6-guro@fb.com Signed-off-by: Roman Gushchin guro@fb.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Christoph Lameter cl@linux.com Acked-by: Vladimir Davydov vdavydov.dev@gmail.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: Waiman Long longman@redhat.com Cc: David Rientjes rientjes@google.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Pekka Enberg penberg@kernel.org Cc: Andrei Vagin avagin@gmail.com Cc: Qian Cai cai@lca.pw Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 6cea1d569d24af6f9e95f70cb301807440ae2981) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com
Conflicts: mm/slab.h
Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/slab.c | 19 +++---------------- mm/slab.h | 25 +++++++++++++++++++++++++ mm/slub.c | 14 ++------------ 3 files changed, 30 insertions(+), 28 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c index a04e81dbbcdb..a818297ef524 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1405,7 +1405,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid) { struct page *page; - int nr_pages;
flags |= cachep->allocflags;
@@ -1415,17 +1414,11 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, return NULL; }
- if (memcg_charge_slab(page, flags, cachep->gfporder, cachep)) { + if (charge_slab_page(page, flags, cachep->gfporder, cachep)) { __free_pages(page, cachep->gfporder); return NULL; }
- nr_pages = (1 << cachep->gfporder); - if (cachep->flags & SLAB_RECLAIM_ACCOUNT) - mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, nr_pages); - else - mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, nr_pages); - __SetPageSlab(page); /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */ if (sk_memalloc_socks() && page_is_pfmemalloc(page)) @@ -1440,12 +1433,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, static void kmem_freepages(struct kmem_cache *cachep, struct page *page) { int order = cachep->gfporder; - unsigned long nr_freed = (1 << order); - - if (cachep->flags & SLAB_RECLAIM_ACCOUNT) - mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, -nr_freed); - else - mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, -nr_freed);
BUG_ON(!PageSlab(page)); __ClearPageSlabPfmemalloc(page); @@ -1454,8 +1441,8 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page) page->mapping = NULL;
if (current->reclaim_state) - current->reclaim_state->reclaimed_slab += nr_freed; - memcg_uncharge_slab(page, order, cachep); + current->reclaim_state->reclaimed_slab += 1 << order; + uncharge_slab_page(page, order, cachep); __free_pages(page, order); }
diff --git a/mm/slab.h b/mm/slab.h index 0875314b1210..47dc964e9f9d 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -205,6 +205,12 @@ ssize_t slabinfo_write(struct file *file, const char __user *buffer, void __kmem_cache_free_bulk(struct kmem_cache *, size_t, void **); int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
+static inline int cache_vmstat_idx(struct kmem_cache *s) +{ + return (s->flags & SLAB_RECLAIM_ACCOUNT) ? + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE; +} + #ifdef CONFIG_MEMCG_KMEM
/* List of all root caches. */ @@ -350,6 +356,25 @@ static inline void memcg_link_cache(struct kmem_cache *s,
#endif /* CONFIG_MEMCG_KMEM */
+static __always_inline int charge_slab_page(struct page *page, + gfp_t gfp, int order, + struct kmem_cache *s) +{ + int ret = memcg_charge_slab(page, gfp, order, s); + + if (!ret) + mod_lruvec_page_state(page, cache_vmstat_idx(s), 1 << order); + + return ret; +} + +static __always_inline void uncharge_slab_page(struct page *page, int order, + struct kmem_cache *s) +{ + mod_lruvec_page_state(page, cache_vmstat_idx(s), -(1 << order)); + memcg_uncharge_slab(page, order, s); +} + static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x) { struct kmem_cache *cachep; diff --git a/mm/slub.c b/mm/slub.c index 4104c266580a..1e378b4a6942 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1457,7 +1457,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s, else page = __alloc_pages_node(node, flags, order);
- if (page && memcg_charge_slab(page, flags, order, s)) { + if (page && charge_slab_page(page, flags, order, s)) { __free_pages(page, order); page = NULL; } @@ -1649,11 +1649,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) if (!page) return NULL;
- mod_lruvec_page_state(page, - (s->flags & SLAB_RECLAIM_ACCOUNT) ? - NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE, - 1 << oo_order(oo)); - inc_slabs_node(s, page_to_nid(page), page->objects);
return page; @@ -1687,18 +1682,13 @@ static void __free_slab(struct kmem_cache *s, struct page *page) check_object(s, page, p, SLUB_RED_INACTIVE); }
- mod_lruvec_page_state(page, - (s->flags & SLAB_RECLAIM_ACCOUNT) ? - NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE, - -pages); - __ClearPageSlabPfmemalloc(page); __ClearPageSlab(page);
page->mapping = NULL; if (current->reclaim_state) current->reclaim_state->reclaimed_slab += pages; - memcg_uncharge_slab(page, order, s); + uncharge_slab_page(page, order, s); __free_pages(page, order); }
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-5.3-rc1 commit 570332978ea7fdbec86a07086a584d796a87da2c category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- There is no point in checking the root_cache->memcg_params.dying flag on kmem_cache creation path. New allocations shouldn't be performed using a dead root kmem_cache, so no new memcg kmem_cache creation can be scheduled after the flag is set. And if it was scheduled before, flush_memcg_workqueue() will wait for it anyway.
So let's drop this check to simplify the code.
Link: http://lkml.kernel.org/r/20190611231813.3148843-7-guro@fb.com Signed-off-by: Roman Gushchin guro@fb.com Acked-by: Vladimir Davydov vdavydov.dev@gmail.com Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Christoph Lameter cl@linux.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: Waiman Long longman@redhat.com Cc: David Rientjes rientjes@google.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Pekka Enberg penberg@kernel.org Cc: Andrei Vagin avagin@gmail.com Cc: Qian Cai cai@lca.pw Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 570332978ea7fdbec86a07086a584d796a87da2c) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/slab_common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/slab_common.c b/mm/slab_common.c index 318d2527bc0b..9e2111ae8e16 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -622,7 +622,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, * The memory cgroup could have been offlined while the cache * creation work was pending. */ - if (memcg->kmem_state != KMEM_ONLINE || root_cache->memcg_params.dying) + if (memcg->kmem_state != KMEM_ONLINE) goto out_unlock;
idx = memcg_cache_id(memcg);
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-5.3-rc1 commit f0a3a24b532d9a7e56a33c5112b2a212ed6ec580 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Currently each charged slab page holds a reference to the cgroup to which it's charged. Kmem_caches are held by the memcg and are released all together with the memory cgroup. It means that none of kmem_caches are released unless at least one reference to the memcg exists, which is very far from optimal.
Let's rework it in a way that allows releasing individual kmem_caches as soon as the cgroup is offline, the kmem_cache is empty and there are no pending allocations.
To make it possible, let's introduce a new percpu refcounter for non-root kmem caches. The counter is initialized to the percpu mode, and is switched to the atomic mode during kmem_cache deactivation. The counter is bumped for every charged page and also for every running allocation. So the kmem_cache can't be released unless all allocations complete.
To shutdown non-active empty kmem_caches, let's reuse the work queue, previously used for the kmem_cache deactivation. Once the reference counter reaches 0, let's schedule an asynchronous kmem_cache release.
* I used the following simple approach to test the performance (stolen from another patchset by T. Harding):
time find / -name fname-no-exist echo 2 > /proc/sys/vm/drop_caches repeat 10 times
Results:
orig patched
real 0m1.455s real 0m1.355s user 0m0.206s user 0m0.219s sys 0m0.855s sys 0m0.807s
real 0m1.487s real 0m1.699s user 0m0.221s user 0m0.256s sys 0m0.806s sys 0m0.948s
real 0m1.515s real 0m1.505s user 0m0.183s user 0m0.215s sys 0m0.876s sys 0m0.858s
real 0m1.291s real 0m1.380s user 0m0.193s user 0m0.198s sys 0m0.843s sys 0m0.786s
real 0m1.364s real 0m1.374s user 0m0.180s user 0m0.182s sys 0m0.868s sys 0m0.806s
real 0m1.352s real 0m1.312s user 0m0.201s user 0m0.212s sys 0m0.820s sys 0m0.761s
real 0m1.302s real 0m1.349s user 0m0.205s user 0m0.203s sys 0m0.803s sys 0m0.792s
real 0m1.334s real 0m1.301s user 0m0.194s user 0m0.201s sys 0m0.806s sys 0m0.779s
real 0m1.426s real 0m1.434s user 0m0.216s user 0m0.181s sys 0m0.824s sys 0m0.864s
real 0m1.350s real 0m1.295s user 0m0.200s user 0m0.190s sys 0m0.842s sys 0m0.811s
So it looks like the difference is not noticeable in this test.
[cai@lca.pw: fix an use-after-free in kmemcg_workfn()] Link: http://lkml.kernel.org/r/1560977573-10715-1-git-send-email-cai@lca.pw Link: http://lkml.kernel.org/r/20190611231813.3148843-9-guro@fb.com Signed-off-by: Roman Gushchin guro@fb.com Signed-off-by: Qian Cai cai@lca.pw Acked-by: Vladimir Davydov vdavydov.dev@gmail.com Cc: Christoph Lameter cl@linux.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: Shakeel Butt shakeelb@google.com Cc: Waiman Long longman@redhat.com Cc: David Rientjes rientjes@google.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Pekka Enberg penberg@kernel.org Cc: Andrei Vagin avagin@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit f0a3a24b532d9a7e56a33c5112b2a212ed6ec580) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com
Conflicts: mm/memcontrol.c
Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/slab.h | 3 +- mm/memcontrol.c | 50 +++++++++++++++++++++------- mm/slab.h | 44 +++++++------------------ mm/slab_common.c | 78 +++++++++++++++++++++++++------------------- 4 files changed, 96 insertions(+), 79 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h index 1fdce55b77a7..b3812235d59d 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -16,6 +16,7 @@ #include <linux/overflow.h> #include <linux/types.h> #include <linux/workqueue.h> +#include <linux/percpu-refcount.h>
/* @@ -152,7 +153,6 @@ int kmem_cache_shrink(struct kmem_cache *);
void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *); void memcg_deactivate_kmem_caches(struct mem_cgroup *); -void memcg_destroy_kmem_caches(struct mem_cgroup *);
/* * Please use this macro to create slab caches. Simply specify the @@ -642,6 +642,7 @@ struct memcg_cache_params { struct mem_cgroup *memcg; struct list_head children_node; struct list_head kmem_caches_node; + struct percpu_ref refcnt;
void (*work_fn)(struct kmem_cache *); union { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4f07c6db5c68..94cba70d98d1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2534,12 +2534,13 @@ static void __memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg, { struct memcg_kmem_cache_create_work *cw;
+ if (!css_tryget_online(&memcg->css)) + return; + cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN); if (!cw) return;
- css_get(&memcg->css); - cw->memcg = memcg; cw->cachep = cachep; INIT_WORK(&cw->work, memcg_kmem_cache_create_func); @@ -2598,6 +2599,7 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep) { struct mem_cgroup *memcg; struct kmem_cache *memcg_cachep; + struct memcg_cache_array *arr; int kmemcg_id;
VM_BUG_ON(!is_root_cache(cachep)); @@ -2608,14 +2610,28 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep) if (current->memcg_kmem_skip_account) return cachep;
- memcg = get_mem_cgroup_from_current(); + rcu_read_lock(); + + if (unlikely(current->active_memcg)) + memcg = current->active_memcg; + else + memcg = mem_cgroup_from_task(current); + + if (!memcg || memcg == root_mem_cgroup) + goto out_unlock; + kmemcg_id = READ_ONCE(memcg->kmemcg_id); if (kmemcg_id < 0) - goto out; + goto out_unlock; + + arr = rcu_dereference(cachep->memcg_params.memcg_caches);
- memcg_cachep = cache_from_memcg_idx(cachep, kmemcg_id); - if (likely(memcg_cachep)) - return memcg_cachep; + /* + * Make sure we will access the up-to-date value. The code updating + * memcg_caches issues a write barrier to match the data dependency + * barrier inside READ_ONCE() (see memcg_create_kmem_cache()). + */ + memcg_cachep = READ_ONCE(arr->entries[kmemcg_id]);
/* * If we are in a safe context (can wait, and not in interrupt @@ -2628,10 +2644,20 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep) * memcg_create_kmem_cache, this means no further allocation * could happen with the slab_mutex held. So it's better to * defer everything. + * + * If the memcg is dying or memcg_cache is about to be released, + * don't bother creating new kmem_caches. Because memcg_cachep + * is ZEROed as the fist step of kmem offlining, we don't need + * percpu_ref_tryget_live() here. css_tryget_online() check in + * memcg_schedule_kmem_cache_create() will prevent us from + * creation of a new kmem_cache. */ - memcg_schedule_kmem_cache_create(memcg, cachep); -out: - css_put(&memcg->css); + if (unlikely(!memcg_cachep)) + memcg_schedule_kmem_cache_create(memcg, cachep); + else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt)) + cachep = memcg_cachep; +out_unlock: + rcu_read_unlock(); return cachep; }
@@ -2642,7 +2668,7 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep) void memcg_kmem_put_cache(struct kmem_cache *cachep) { if (!is_root_cache(cachep)) - css_put(&cachep->memcg_params.memcg->css); + percpu_ref_put(&cachep->memcg_params.refcnt); }
/** @@ -3239,7 +3265,7 @@ static void memcg_free_kmem(struct mem_cgroup *memcg) memcg_offline_kmem(memcg);
if (memcg->kmem_state == KMEM_ALLOCATED) { - memcg_destroy_kmem_caches(memcg); + WARN_ON(!list_empty(&memcg->kmem_caches)); static_branch_dec(&memcg_kmem_enabled_key); WARN_ON(page_counter_read(&memcg->kmem)); } diff --git a/mm/slab.h b/mm/slab.h index 47dc964e9f9d..16049a33dffa 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -248,31 +248,6 @@ static inline const char *cache_name(struct kmem_cache *s) return s->name; }
-/* - * Note, we protect with RCU only the memcg_caches array, not per-memcg caches. - * That said the caller must assure the memcg's cache won't go away by either - * taking a css reference to the owner cgroup, or holding the slab_mutex. - */ -static inline struct kmem_cache * -cache_from_memcg_idx(struct kmem_cache *s, int idx) -{ - struct kmem_cache *cachep; - struct memcg_cache_array *arr; - - rcu_read_lock(); - arr = rcu_dereference(s->memcg_params.memcg_caches); - - /* - * Make sure we will access the up-to-date value. The code updating - * memcg_caches issues a write barrier to match this (see - * memcg_create_kmem_cache()). - */ - cachep = READ_ONCE(arr->entries[idx]); - rcu_read_unlock(); - - return cachep; -} - static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s) { if (is_root_cache(s)) @@ -284,14 +259,25 @@ static __always_inline int memcg_charge_slab(struct page *page, gfp_t gfp, int order, struct kmem_cache *s) { + int ret; + if (is_root_cache(s)) return 0; - return memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg); + + ret = memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg); + if (ret) + return ret; + + percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order); + + return 0; }
static __always_inline void memcg_uncharge_slab(struct page *page, int order, struct kmem_cache *s) { + if (!is_root_cache(s)) + percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order); memcg_kmem_uncharge(page, order); }
@@ -323,12 +309,6 @@ static inline const char *cache_name(struct kmem_cache *s) return s->name; }
-static inline struct kmem_cache * -cache_from_memcg_idx(struct kmem_cache *s, int idx) -{ - return NULL; -} - static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s) { return s; diff --git a/mm/slab_common.c b/mm/slab_common.c index 9e2111ae8e16..0867f6656858 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -132,6 +132,8 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr, LIST_HEAD(slab_root_caches); static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
+static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref); + void slab_init_memcg_params(struct kmem_cache *s) { s->memcg_params.root_cache = NULL; @@ -146,6 +148,12 @@ static int init_memcg_params(struct kmem_cache *s, struct memcg_cache_array *arr;
if (root_cache) { + int ret = percpu_ref_init(&s->memcg_params.refcnt, + kmemcg_cache_shutdown, + 0, GFP_KERNEL); + if (ret) + return ret; + s->memcg_params.root_cache = root_cache; INIT_LIST_HEAD(&s->memcg_params.children_node); INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node); @@ -171,6 +179,8 @@ static void destroy_memcg_params(struct kmem_cache *s) { if (is_root_cache(s)) kvfree(rcu_access_pointer(s->memcg_params.memcg_caches)); + else + percpu_ref_exit(&s->memcg_params.refcnt); }
static void free_memcg_params(struct rcu_head *rcu) @@ -226,6 +236,7 @@ void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg) if (is_root_cache(s)) { list_add(&s->root_caches_node, &slab_root_caches); } else { + css_get(&memcg->css); s->memcg_params.memcg = memcg; list_add(&s->memcg_params.children_node, &s->memcg_params.root_cache->memcg_params.children); @@ -241,6 +252,7 @@ static void memcg_unlink_cache(struct kmem_cache *s) } else { list_del(&s->memcg_params.children_node); list_del(&s->memcg_params.kmem_caches_node); + css_put(&s->memcg_params.memcg->css); } } #else @@ -659,7 +671,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, }
/* - * Since readers won't lock (see cache_from_memcg_idx()), we need a + * Since readers won't lock (see memcg_kmem_get_cache()), we need a * barrier here to ensure nobody will see the kmem_cache partially * initialized. */ @@ -682,16 +694,11 @@ static void kmemcg_workfn(struct work_struct *work) get_online_mems();
mutex_lock(&slab_mutex); - s->memcg_params.work_fn(s); - mutex_unlock(&slab_mutex);
put_online_mems(); put_online_cpus(); - - /* done, put the ref from kmemcg_cache_deactivate() */ - css_put(&s->memcg_params.memcg->css); }
static void kmemcg_rcufn(struct rcu_head *head) @@ -708,10 +715,38 @@ static void kmemcg_rcufn(struct rcu_head *head) queue_work(memcg_kmem_cache_wq, &s->memcg_params.work); }
+static void kmemcg_cache_shutdown_fn(struct kmem_cache *s) +{ + WARN_ON(shutdown_cache(s)); +} + +static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref) +{ + struct kmem_cache *s = container_of(percpu_ref, struct kmem_cache, + memcg_params.refcnt); + unsigned long flags; + + spin_lock_irqsave(&memcg_kmem_wq_lock, flags); + if (s->memcg_params.root_cache->memcg_params.dying) + goto unlock; + + s->memcg_params.work_fn = kmemcg_cache_shutdown_fn; + INIT_WORK(&s->memcg_params.work, kmemcg_workfn); + queue_work(memcg_kmem_cache_wq, &s->memcg_params.work); + +unlock: + spin_unlock_irqrestore(&memcg_kmem_wq_lock, flags); +} + +static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s) +{ + __kmemcg_cache_deactivate_after_rcu(s); + percpu_ref_kill(&s->memcg_params.refcnt); +} + static void kmemcg_cache_deactivate(struct kmem_cache *s) { - if (WARN_ON_ONCE(is_root_cache(s)) || - WARN_ON_ONCE(s->memcg_params.work_fn)) + if (WARN_ON_ONCE(is_root_cache(s))) return;
__kmemcg_cache_deactivate(s); @@ -725,10 +760,7 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s) if (s->memcg_params.root_cache->memcg_params.dying) goto unlock;
- /* pin memcg so that @s doesn't get destroyed in the middle */ - css_get(&s->memcg_params.memcg->css); - - s->memcg_params.work_fn = __kmemcg_cache_deactivate_after_rcu; + s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu; call_rcu(&s->memcg_params.rcu_head, kmemcg_rcufn); unlock: spin_unlock_irq(&memcg_kmem_wq_lock); @@ -762,28 +794,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg) put_online_cpus(); }
-void memcg_destroy_kmem_caches(struct mem_cgroup *memcg) -{ - struct kmem_cache *s, *s2; - - get_online_cpus(); - get_online_mems(); - - mutex_lock(&slab_mutex); - list_for_each_entry_safe(s, s2, &memcg->kmem_caches, - memcg_params.kmem_caches_node) { - /* - * The cgroup is about to be freed and therefore has no charges - * left. Hence, all its caches must be empty by now. - */ - BUG_ON(shutdown_cache(s)); - } - mutex_unlock(&slab_mutex); - - put_online_mems(); - put_online_cpus(); -} - static int shutdown_memcg_caches(struct kmem_cache *s) { struct memcg_cache_array *arr;
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-5.3-rc1 commit 4d96ba3530750fae3f3f01150adfecde96157815 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Every slab page charged to a non-root memory cgroup has a pointer to the memory cgroup and holds a reference to it, which protects a non-empty memory cgroup from being released. At the same time the page has a pointer to the corresponding kmem_cache, and also hold a reference to the kmem_cache. And kmem_cache by itself holds a reference to the cgroup.
So there is clearly some redundancy, which allows to stop setting the page->mem_cgroup pointer and rely on getting memcg pointer indirectly via kmem_cache. Further it will allow to change this pointer easier, without a need to go over all charged pages.
So let's stop setting page->mem_cgroup pointer for slab pages, and stop using the css refcounter directly for protecting the memory cgroup from going away. Instead rely on kmem_cache as an intermediate object.
Make sure that vmstats and shrinker lists are working as previously, as well as /proc/kpagecgroup interface.
Link: http://lkml.kernel.org/r/20190611231813.3148843-10-guro@fb.com Signed-off-by: Roman Gushchin guro@fb.com Acked-by: Vladimir Davydov vdavydov.dev@gmail.com Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Christoph Lameter cl@linux.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: Waiman Long longman@redhat.com Cc: David Rientjes rientjes@google.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Pekka Enberg penberg@kernel.org Cc: Andrei Vagin avagin@gmail.com Cc: Qian Cai cai@lca.pw Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 4d96ba3530750fae3f3f01150adfecde96157815) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/list_lru.c | 3 +- mm/memcontrol.c | 12 ++++---- mm/slab.h | 74 ++++++++++++++++++++++++++++++++++++++++--------- 3 files changed, 70 insertions(+), 19 deletions(-)
diff --git a/mm/list_lru.c b/mm/list_lru.c index 758653dd1443..19828ba88c48 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -11,6 +11,7 @@ #include <linux/slab.h> #include <linux/mutex.h> #include <linux/memcontrol.h> +#include "slab.h"
#ifdef CONFIG_MEMCG_KMEM static LIST_HEAD(list_lrus); @@ -62,7 +63,7 @@ static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr) if (!memcg_kmem_enabled()) return NULL; page = virt_to_head_page(ptr); - return page->mem_cgroup; + return memcg_from_slab_page(page); }
static inline struct list_lru_one * diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 94cba70d98d1..2a1e51175e91 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -495,7 +495,10 @@ ino_t page_cgroup_ino(struct page *page) unsigned long ino = 0;
rcu_read_lock(); - memcg = READ_ONCE(page->mem_cgroup); + if (PageHead(page) && PageSlab(page)) + memcg = memcg_from_slab_page(page); + else + memcg = READ_ONCE(page->mem_cgroup); while (memcg && !(memcg->css.flags & CSS_ONLINE)) memcg = parent_mem_cgroup(memcg); if (memcg) @@ -2706,9 +2709,6 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, cancel_charge(memcg, nr_pages); return -ENOMEM; } - - page->mem_cgroup = memcg; - return 0; }
@@ -2734,8 +2734,10 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order) goto out;
ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg); - if (!ret) + if (!ret) { + page->mem_cgroup = memcg; __SetPageKmemcg(page); + } }
out: diff --git a/mm/slab.h b/mm/slab.h index 16049a33dffa..b26e30b88f08 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -255,30 +255,67 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s) return s->memcg_params.root_cache; }
+/* + * Expects a pointer to a slab page. Please note, that PageSlab() check + * isn't sufficient, as it returns true also for tail compound slab pages, + * which do not have slab_cache pointer set. + * So this function assumes that the page can pass PageHead() and PageSlab() + * checks. + */ +static inline struct mem_cgroup *memcg_from_slab_page(struct page *page) +{ + struct kmem_cache *s; + + s = READ_ONCE(page->slab_cache); + if (s && !is_root_cache(s)) + return s->memcg_params.memcg; + + return NULL; +} + +/* + * Charge the slab page belonging to the non-root kmem_cache. + * Can be called for non-root kmem_caches only. + */ static __always_inline int memcg_charge_slab(struct page *page, gfp_t gfp, int order, struct kmem_cache *s) { + struct mem_cgroup *memcg; + struct lruvec *lruvec; int ret;
- if (is_root_cache(s)) - return 0; - - ret = memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg); + memcg = s->memcg_params.memcg; + ret = memcg_kmem_charge_memcg(page, gfp, order, memcg); if (ret) return ret;
+ lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg); + mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order); + + /* transer try_charge() page references to kmem_cache */ percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order); + css_put_many(&memcg->css, 1 << order);
return 0; }
+/* + * Uncharge a slab page belonging to a non-root kmem_cache. + * Can be called for non-root kmem_caches only. + */ static __always_inline void memcg_uncharge_slab(struct page *page, int order, struct kmem_cache *s) { - if (!is_root_cache(s)) - percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order); - memcg_kmem_uncharge(page, order); + struct mem_cgroup *memcg; + struct lruvec *lruvec; + + memcg = s->memcg_params.memcg; + lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg); + mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order)); + memcg_kmem_uncharge_memcg(page, order, memcg); + + percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order); }
extern void slab_init_memcg_params(struct kmem_cache *); @@ -314,6 +351,11 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s) return s; }
+static inline struct mem_cgroup *memcg_from_slab_page(struct page *page) +{ + return NULL; +} + static inline int memcg_charge_slab(struct page *page, gfp_t gfp, int order, struct kmem_cache *s) { @@ -340,18 +382,24 @@ static __always_inline int charge_slab_page(struct page *page, gfp_t gfp, int order, struct kmem_cache *s) { - int ret = memcg_charge_slab(page, gfp, order, s); - - if (!ret) - mod_lruvec_page_state(page, cache_vmstat_idx(s), 1 << order); + if (is_root_cache(s)) { + mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s), + 1 << order); + return 0; + }
- return ret; + return memcg_charge_slab(page, gfp, order, s); }
static __always_inline void uncharge_slab_page(struct page *page, int order, struct kmem_cache *s) { - mod_lruvec_page_state(page, cache_vmstat_idx(s), -(1 << order)); + if (is_root_cache(s)) { + mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s), + -(1 << order)); + return; + } + memcg_uncharge_slab(page, order, s); }
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-5.3-rc1 commit fb2f2b0adb98bbbbbb51c5a5327f3f90f5dc417e category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Let's reparent non-root kmem_caches on memcg offlining. This allows us to release the memory cgroup without waiting for the last outstanding kernel object (e.g. dentry used by another application).
Since the parent cgroup is already charged, everything we need to do is to splice the list of kmem_caches to the parent's kmem_caches list, swap the memcg pointer, drop the css refcounter for each kmem_cache and adjust the parent's css refcounter.
Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer anymore. It's safe to read it under rcu_read_lock(), cgroup_mutex held, or any other way that protects the memory cgroup from being released.
We can race with the slab allocation and deallocation paths. It's not a big problem: parent's charge and slab global stats are always correct, and we don't care anymore about the child usage and global stats. The child cgroup is already offline, so we don't use or show it anywhere.
Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't used anywhere except count_shadow_nodes(). But even there it won't break anything: after reparenting "nodes" will be 0 on child level (because we're already reparenting shrinker lists), and on parent level page stats always were 0, and this patch won't change anything.
[guro@fb.com: properly handle kmem_caches reparented to root_mem_cgroup] Link: http://lkml.kernel.org/r/20190620213427.1691847-1-guro@fb.com Link: http://lkml.kernel.org/r/20190611231813.3148843-11-guro@fb.com Signed-off-by: Roman Gushchin guro@fb.com Acked-by: Vladimir Davydov vdavydov.dev@gmail.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: David Rientjes rientjes@google.com Cc: Christoph Lameter cl@linux.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: Waiman Long longman@redhat.com Cc: David Rientjes rientjes@google.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Pekka Enberg penberg@kernel.org Cc: Andrei Vagin avagin@gmail.com Cc: Qian Cai cai@lca.pw Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit fb2f2b0adb98bbbbbb51c5a5327f3f90f5dc417e) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/slab.h | 2 +- mm/memcontrol.c | 14 ++++++++------ mm/slab.h | 41 ++++++++++++++++++++++++++++++++--------- mm/slab_common.c | 19 +++++++++++++++++-- 4 files changed, 58 insertions(+), 18 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h index b3812235d59d..d79bd66b4aaa 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -152,7 +152,7 @@ void kmem_cache_destroy(struct kmem_cache *); int kmem_cache_shrink(struct kmem_cache *);
void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *); -void memcg_deactivate_kmem_caches(struct mem_cgroup *); +void memcg_deactivate_kmem_caches(struct mem_cgroup *, struct mem_cgroup *);
/* * Please use this macro to create slab caches. Simply specify the diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2a1e51175e91..ab33121bf706 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3228,15 +3228,15 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg) */ memcg->kmem_state = KMEM_ALLOCATED;
- memcg_deactivate_kmem_caches(memcg); - - kmemcg_id = memcg->kmemcg_id; - BUG_ON(kmemcg_id < 0); - parent = parent_mem_cgroup(memcg); if (!parent) parent = root_mem_cgroup;
+ memcg_deactivate_kmem_caches(memcg, parent); + + kmemcg_id = memcg->kmemcg_id; + BUG_ON(kmemcg_id < 0); + /* * Change kmemcg_id of this cgroup and all its descendants to the * parent's id, and then move all entries from this cgroup's list_lrus @@ -3269,7 +3269,6 @@ static void memcg_free_kmem(struct mem_cgroup *memcg) if (memcg->kmem_state == KMEM_ALLOCATED) { WARN_ON(!list_empty(&memcg->kmem_caches)); static_branch_dec(&memcg_kmem_enabled_key); - WARN_ON(page_counter_read(&memcg->kmem)); } } #else @@ -4664,6 +4663,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
/* The following stuff does not apply to the root */ if (!parent) { +#ifdef CONFIG_MEMCG_KMEM + INIT_LIST_HEAD(&memcg->kmem_caches); +#endif root_mem_cgroup = memcg; return &memcg->css; } diff --git a/mm/slab.h b/mm/slab.h index b26e30b88f08..77fd31ea6877 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -261,6 +261,9 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s) * which do not have slab_cache pointer set. * So this function assumes that the page can pass PageHead() and PageSlab() * checks. + * + * The kmem_cache can be reparented asynchronously. The caller must ensure + * the memcg lifetime, e.g. by taking rcu_read_lock() or cgroup_mutex. */ static inline struct mem_cgroup *memcg_from_slab_page(struct page *page) { @@ -268,7 +271,7 @@ static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
s = READ_ONCE(page->slab_cache); if (s && !is_root_cache(s)) - return s->memcg_params.memcg; + return READ_ONCE(s->memcg_params.memcg);
return NULL; } @@ -285,10 +288,22 @@ static __always_inline int memcg_charge_slab(struct page *page, struct lruvec *lruvec; int ret;
- memcg = s->memcg_params.memcg; + rcu_read_lock(); + memcg = READ_ONCE(s->memcg_params.memcg); + while (memcg && !css_tryget_online(&memcg->css)) + memcg = parent_mem_cgroup(memcg); + rcu_read_unlock(); + + if (unlikely(!memcg || mem_cgroup_is_root(memcg))) { + mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s), + (1 << order)); + percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order); + return 0; + } + ret = memcg_kmem_charge_memcg(page, gfp, order, memcg); if (ret) - return ret; + goto out;
lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg); mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order); @@ -296,8 +311,9 @@ static __always_inline int memcg_charge_slab(struct page *page, /* transer try_charge() page references to kmem_cache */ percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order); css_put_many(&memcg->css, 1 << order); - - return 0; +out: + css_put(&memcg->css); + return ret; }
/* @@ -310,10 +326,17 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order, struct mem_cgroup *memcg; struct lruvec *lruvec;
- memcg = s->memcg_params.memcg; - lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg); - mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order)); - memcg_kmem_uncharge_memcg(page, order, memcg); + rcu_read_lock(); + memcg = READ_ONCE(s->memcg_params.memcg); + if (likely(!mem_cgroup_is_root(memcg))) { + lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg); + mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order)); + memcg_kmem_uncharge_memcg(page, order, memcg); + } else { + mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s), + -(1 << order)); + } + rcu_read_unlock();
percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order); } diff --git a/mm/slab_common.c b/mm/slab_common.c index 0867f6656858..a7d804122140 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -252,7 +252,8 @@ static void memcg_unlink_cache(struct kmem_cache *s) } else { list_del(&s->memcg_params.children_node); list_del(&s->memcg_params.kmem_caches_node); - css_put(&s->memcg_params.memcg->css); + mem_cgroup_put(s->memcg_params.memcg); + WRITE_ONCE(s->memcg_params.memcg, NULL); } } #else @@ -766,11 +767,13 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s) spin_unlock_irq(&memcg_kmem_wq_lock); }
-void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg) +void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg, + struct mem_cgroup *parent) { int idx; struct memcg_cache_array *arr; struct kmem_cache *s, *c; + unsigned int nr_reparented;
idx = memcg_cache_id(memcg);
@@ -788,6 +791,18 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg) kmemcg_cache_deactivate(c); arr->entries[idx] = NULL; } + nr_reparented = 0; + list_for_each_entry(s, &memcg->kmem_caches, + memcg_params.kmem_caches_node) { + WRITE_ONCE(s->memcg_params.memcg, parent); + css_put(&memcg->css); + nr_reparented++; + } + if (nr_reparented) { + list_splice_init(&memcg->kmem_caches, + &parent->kmem_caches); + css_get_many(&parent->css, nr_reparented); + } mutex_unlock(&slab_mutex);
put_online_mems();
From: Waiman Long longman@redhat.com
mainline inclusion from mainline-5.3-rc1 commit fcf8a1e483490cd249df4e02d5425636c3f43c86 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- There are concerns about memory leaks from extensive use of memory cgroups as each memory cgroup creates its own set of kmem caches. There is a possiblity that the memcg kmem caches may remain even after the memory cgroups have been offlined. Therefore, it will be useful to show the status of each of memcg kmem caches.
This patch introduces a new <debugfs>/memcg_slabinfo file which is somewhat similar to /proc/slabinfo in format, but lists only information about kmem caches that have child memcg kmem caches. Information available in /proc/slabinfo are not repeated in memcg_slabinfo.
A portion of a sample output of the file was:
# <name> <css_id[:dead]> <active_objs> <num_objs> <active_slabs> <num_slabs> rpc_inode_cache root 13 51 1 1 rpc_inode_cache 48 0 0 0 0 fat_inode_cache root 1 45 1 1 fat_inode_cache 41 2 45 1 1 xfs_inode root 770 816 24 24 xfs_inode 92 22 34 1 1 xfs_inode 88:dead 1 34 1 1 xfs_inode 89:dead 23 34 1 1 xfs_inode 85 4 34 1 1 xfs_inode 84 9 34 1 1
The css id of the memcg is also listed. If a memcg is not online, the tag ":dead" will be attached as shown above.
[longman@redhat.com: memcg: add ":deact" tag for reparented kmem caches in memcg_slabinfo] Link: http://lkml.kernel.org/r/20190621173005.31514-1-longman@redhat.com [longman@redhat.com: set the flag in the common code as suggested by Roman] Link: http://lkml.kernel.org/r/20190627184324.5875-1-longman@redhat.com Link: http://lkml.kernel.org/r/20190619171621.26209-1-longman@redhat.com Signed-off-by: Waiman Long longman@redhat.com Suggested-by: Shakeel Butt shakeelb@google.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Roman Gushchin guro@fb.com Acked-by: David Rientjes rientjes@google.com Cc: Christoph Lameter cl@linux.com Cc: Pekka Enberg penberg@kernel.org Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Michal Hocko mhocko@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit fcf8a1e483490cd249df4e02d5425636c3f43c86) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/slab.h | 4 +++ mm/slab_common.c | 60 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 64 insertions(+)
diff --git a/include/linux/slab.h b/include/linux/slab.h index d79bd66b4aaa..b6c34531fe1b 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -116,6 +116,10 @@ /* Objects are reclaimable */ #define SLAB_RECLAIM_ACCOUNT ((slab_flags_t __force)0x00020000U) #define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */ + +/* Slab deactivation flag */ +#define SLAB_DEACTIVATED ((slab_flags_t __force)0x10000000U) + /* * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests. * diff --git a/mm/slab_common.c b/mm/slab_common.c index a7d804122140..65334ce90ee6 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -17,6 +17,7 @@ #include <linux/uaccess.h> #include <linux/seq_file.h> #include <linux/proc_fs.h> +#include <linux/debugfs.h> #include <asm/cacheflush.h> #include <asm/tlbflush.h> #include <asm/page.h> @@ -751,6 +752,7 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s) return;
__kmemcg_cache_deactivate(s); + s->flags |= SLAB_DEACTIVATED;
/* * memcg_kmem_wq_lock is used to synchronize memcg_params.dying @@ -1501,6 +1503,64 @@ static int __init slab_proc_init(void) return 0; } module_init(slab_proc_init); + +#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_MEMCG_KMEM) +/* + * Display information about kmem caches that have child memcg caches. + */ +static int memcg_slabinfo_show(struct seq_file *m, void *unused) +{ + struct kmem_cache *s, *c; + struct slabinfo sinfo; + + mutex_lock(&slab_mutex); + seq_puts(m, "# <name> <css_id[:dead|deact]> <active_objs> <num_objs>"); + seq_puts(m, " <active_slabs> <num_slabs>\n"); + list_for_each_entry(s, &slab_root_caches, root_caches_node) { + /* + * Skip kmem caches that don't have any memcg children. + */ + if (list_empty(&s->memcg_params.children)) + continue; + + memset(&sinfo, 0, sizeof(sinfo)); + get_slabinfo(s, &sinfo); + seq_printf(m, "%-17s root %6lu %6lu %6lu %6lu\n", + cache_name(s), sinfo.active_objs, sinfo.num_objs, + sinfo.active_slabs, sinfo.num_slabs); + + for_each_memcg_cache(c, s) { + struct cgroup_subsys_state *css; + char *status = ""; + + css = &c->memcg_params.memcg->css; + if (!(css->flags & CSS_ONLINE)) + status = ":dead"; + else if (c->flags & SLAB_DEACTIVATED) + status = ":deact"; + + memset(&sinfo, 0, sizeof(sinfo)); + get_slabinfo(c, &sinfo); + seq_printf(m, "%-17s %4d%-6s %6lu %6lu %6lu %6lu\n", + cache_name(c), css->id, status, + sinfo.active_objs, sinfo.num_objs, + sinfo.active_slabs, sinfo.num_slabs); + } + } + mutex_unlock(&slab_mutex); + return 0; +} +DEFINE_SHOW_ATTRIBUTE(memcg_slabinfo); + +static int __init memcg_slabinfo_init(void) +{ + debugfs_create_file("memcg_slabinfo", S_IFREG | S_IRUGO, + NULL, NULL, &memcg_slabinfo_fops); + return 0; +} + +late_initcall(memcg_slabinfo_init); +#endif /* CONFIG_DEBUG_FS && CONFIG_MEMCG_KMEM */ #endif /* CONFIG_SLAB || CONFIG_SLUB_DEBUG */
static __always_inline void *__do_krealloc(const void *p, size_t new_size,
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-4.20-rc1 commit 591edfb10a949d635ed770c6e85ec5286206c07e category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Memcg charge is batched using per-cpu stocks, so an offline memcg can be pinned by a cached charge up to a moment, when a process belonging to some other cgroup will charge some memory on the same cpu. In other words, cached charges can prevent a memory cgroup from being reclaimed for some time, without any clear need.
Let's optimize it by explicit draining of all stocks on css offlining. As draining is performed asynchronously, and is skipped if any parallel draining is happening, it's cheap.
Link: http://lkml.kernel.org/r/20180827162621.30187-2-guro@fb.com Signed-off-by: Roman Gushchin guro@fb.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Michal Hocko mhocko@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Konstantin Khlebnikov koct9i@gmail.com Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 591edfb10a949d635ed770c6e85ec5286206c07e) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ab33121bf706..0abfd13c4015 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4727,6 +4727,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) memcg_offline_kmem(memcg); wb_memcg_offline(memcg);
+ drain_all_stock(memcg); + mem_cgroup_id_put(memcg); }
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-5.5-rc1 commit a264df74df38855096393447f1b8f386069a94b9 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Christian reported a warning like the following obtained during running some KVM-related tests on s390:
WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58 Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na> CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66 Hardware name: IBM 2964 NC9 712 (LPAR) Workqueue: events sysfs_slab_remove_workfn Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000 0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000 0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00 0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0 Krnl Code: 0000001529746842: f0a0000407fe srp 4(11,%r0),2046,0 0000001529746848: 47000700 bc 0,1792 #000000152974684c: a7f40001 brc 15,152974684e >0000001529746850: a7f4fff2 brc 15,1529746834 0000001529746854: 0707 bcr 0,%r7 0000001529746856: 0707 bcr 0,%r7 0000001529746858: eb8ff0580024 stmg %r8,%r15,88(%r15) 000000152974685e: a738ffff lhi %r3,-1 Call Trace: ([<000003e009263d00>] 0x3e009263d00) [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70 [<0000001529b04882>] kobject_put+0xaa/0xe8 [<000000152918cf28>] process_one_work+0x1e8/0x428 [<000000152918d1b0>] worker_thread+0x48/0x460 [<00000015291942c6>] kthread+0x126/0x160 [<0000001529b22344>] ret_from_fork+0x28/0x30 [<0000001529b2234c>] kernel_thread_starter+0x0/0x10 Last Breaking-Event-Address: [<000000152974684c>] percpu_ref_exit+0x4c/0x58 ---[ end trace b035e7da5788eb09 ]---
The problem occurs because kmem_cache_destroy() is called immediately after deleting of a memcg, so it races with the memcg kmem_cache deactivation.
flush_memcg_workqueue() at the beginning of kmem_cache_destroy() is supposed to guarantee that all deactivation processes are finished, but failed to do so. It waits for an rcu grace period, after which all children kmem_caches should be deactivated. During the deactivation percpu_ref_kill() is called for non root kmem_cache refcounters, but it requires yet another rcu grace period to finish the transition to the atomic (dead) state.
So in a rare case when not all children kmem_caches are destroyed at the moment when the root kmem_cache is about to be gone, we need to wait another rcu grace period before destroying the root kmem_cache.
This issue can be triggered only with dynamically created kmem_caches which are used with memcg accounting. In this case per-memcg child kmem_caches are created. They are deactivated from the cgroup removing path. If the destruction of the root kmem_cache is racing with the removal of the cgroup (both are quite complicated multi-stage processes), the described issue can occur. The only known way to trigger it in the real life, is to unload some kernel module which creates a dedicated kmem_cache, used from different memory cgroups with GFP_ACCOUNT flag. If the unloading happens immediately after calling rmdir on the corresponding cgroup, there is some chance to trigger the issue.
Link: http://lkml.kernel.org/r/20191129025011.3076017-1-guro@fb.com Fixes: f0a3a24b532d ("mm: memcg/slab: rework non-root kmem_cache lifecycle management") Signed-off-by: Roman Gushchin guro@fb.com Reported-by: Christian Borntraeger borntraeger@de.ibm.com Tested-by: Christian Borntraeger borntraeger@de.ibm.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Michal Hocko mhocko@suse.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit a264df74df38855096393447f1b8f386069a94b9) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/slab_common.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/mm/slab_common.c b/mm/slab_common.c index 65334ce90ee6..91b250aea129 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -888,6 +888,18 @@ static void flush_memcg_workqueue(struct kmem_cache *s) */ if (likely(memcg_kmem_cache_wq)) flush_workqueue(memcg_kmem_cache_wq); + + /* + * If we're racing with children kmem_cache deactivation, it might + * take another rcu grace period to complete their destruction. + * At this moment the corresponding percpu_ref_kill() call should be + * done, but it might take another rcu grace period to complete + * switching to the atomic mode. + * Please, note that we check without grabbing the slab_mutex. It's safe + * because at this moment the children list can't grow. + */ + if (!list_empty(&s->memcg_params.children)) + rcu_barrier(); } #else static inline int shutdown_memcg_caches(struct kmem_cache *s)
From: Vlastimil Babka vbabka@suse.cz
mainline inclusion from mainline-5.0-rc1 commit 4e45f712d82c6b7a37e02faf388173ad12ab464d category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Multiple people have reported the following sparse warning:
./include/linux/slab.h:332:43: warning: dubious: x & !y
The minimal fix would be to change the logical & to boolean &&, which emits the same code, but Andrew has suggested that the branch-avoiding tricks are maybe not worthwile. David Laight provided a nice comparison of disassembly of multiple variants, which shows that the current version produces a 4 deep dependency chain, and fixing the sparse warning by changing logical and to multiplication emits an IMUL, making it even more expensive.
The code as rewritten by this patch yielded the best disassembly, with a single predictable branch for the most common case, and a ternary operator for the rest, which gcc seems to compile without a branch or cmov by itself.
The result should be more readable, without a sparse warning and probably also faster for the common case.
Link: http://lkml.kernel.org/r/80340595-d7c5-97b9-4f6c-23fa893a91e9@suse.cz Fixes: 1291523f2c1d ("mm, slab/slub: introduce kmalloc-reclaimable caches") Reviewed-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Vlastimil Babka vbabka@suse.cz Reported-by: Bart Van Assche bvanassche@acm.org Reported-by: Darryl T. Agostinelli dagostinelli@gmail.com Reported-by: Masahiro Yamada yamada.masahiro@socionext.com Suggested-by: Andrew Morton akpm@linux-foundation.org Suggested-by: David Laight David.Laight@ACULAB.COM Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 4e45f712d82c6b7a37e02faf388173ad12ab464d) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/slab.h | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h index b6c34531fe1b..788f04a7ca76 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -320,22 +320,22 @@ kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];
static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags) { - int is_dma = 0; - int type_dma = 0; - int is_reclaimable; - #ifdef CONFIG_ZONE_DMA - is_dma = !!(flags & __GFP_DMA); - type_dma = is_dma * KMALLOC_DMA; -#endif - - is_reclaimable = !!(flags & __GFP_RECLAIMABLE); + /* + * The most common case is KMALLOC_NORMAL, so test for it + * with a single branch for both flags. + */ + if (likely((flags & (__GFP_DMA | __GFP_RECLAIMABLE)) == 0)) + return KMALLOC_NORMAL;
/* - * If an allocation is both __GFP_DMA and __GFP_RECLAIMABLE, return - * KMALLOC_DMA and effectively ignore __GFP_RECLAIMABLE + * At least one of the flags has to be set. If both are, __GFP_DMA + * is more important. */ - return type_dma + (is_reclaimable & !is_dma) * KMALLOC_RECLAIM; + return flags & __GFP_DMA ? KMALLOC_DMA : KMALLOC_RECLAIM; +#else + return flags & __GFP_RECLAIMABLE ? KMALLOC_RECLAIM : KMALLOC_NORMAL; +#endif }
/*
From: Johannes Weiner jweiner@fb.com
mainline inclusion from mainline-4.20-rc1 commit 95f9ab2d596e8cbb388315e78c82b9a131bf2928 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Patch series "psi: pressure stall information for CPU, memory, and IO", v4.
Overview
PSI reports the overall wallclock time in which the tasks in a system (or cgroup) wait for (contended) hardware resources.
This helps users understand the resource pressure their workloads are under, which allows them to rootcause and fix throughput and latency problems caused by overcommitting, underprovisioning, suboptimal job placement in a grid; as well as anticipate major disruptions like OOM.
Real-world applications
We're using the data collected by PSI (and its previous incarnation, memdelay) quite extensively at Facebook, and with several success stories.
One usecase is avoiding OOM hangs/livelocks. The reason these happen is because the OOM killer is triggered by reclaim not being able to free pages, but with fast flash devices there is *always* some clean and uptodate cache to reclaim; the OOM killer never kicks in, even as tasks spend 90% of the time thrashing the cache pages of their own executables. There is no situation where this ever makes sense in practice. We wrote a <100 line POC python script to monitor memory pressure and kill stuff way before such pathological thrashing leads to full system losses that would require forcible hard resets.
We've since extended and deployed this code into other places to guarantee latency and throughput SLAs, since they're usually violated way before the kernel OOM killer would ever kick in.
It is available here: https://github.com/facebookincubator/oomd
Eventually we probably want to trigger the in-kernel OOM killer based on extreme sustained pressure as well, so that Linux can avoid memory livelocks - which technically aren't deadlocks, but to the user indistinguishable from them - out of the box. We'd continue using OOMD as the first line of defense to ensure workload health and implement complex kill policies that are beyond the scope of the kernel.
We also use PSI memory pressure for loadshedding. Our batch job infrastructure used to use heuristics based on various VM stats to anticipate OOM situations, with lackluster success. We switched it to PSI and managed to anticipate and avoid OOM kills and lockups fairly reliably. The reduction of OOM outages in the worker pool raised the pool's aggregate productivity, and we were able to switch that service to smaller machines.
Lastly, we use cgroups to isolate a machine's main workload from maintenance crap like package upgrades, logging, configuration, as well as to prevent multiple workloads on a machine from stepping on each others' toes. We were not able to configure this properly without the pressure metrics; we would see latency or bandwidth drops, but it would often be hard to impossible to rootcause it post-mortem.
We now log and graph pressure for the containers in our fleet and can trivially link latency spikes and throughput drops to shortages of specific resources after the fact, and fix the job config/scheduling.
PSI has also received testing, feedback, and feature requests from Android and EndlessOS for the purpose of low-latency OOM killing, to intervene in pressure situations before the UI starts hanging.
How do you use this feature?
A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have cpu.pressure, memory.pressure and io.pressure files, which simply aggregate task stalls at the cgroup level instead of system-wide.
The cpu file contains one line:
some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
The averages give the percentage of walltime in which one or more tasks are delayed on the runqueue while another task has the CPU. They're recent averages over 10s, 1m, 5m windows, so you can tell short term trends from long term ones, similarly to the load average.
The total= value gives the absolute stall time in microseconds. This allows detecting latency spikes that might be too short to sway the running averages. It also allows custom time averaging in case the 10s/1m/5m windows aren't adequate for the usecase (or are too coarse with future hardware).
What to make of this "some" metric? If CPU utilization is at 100% and CPU pressure is 0, it means the system is perfectly utilized, with one runnable thread per CPU and nobody waiting. At two or more runnable tasks per CPU, the system is 100% overcommitted and the pressure average will indicate as much. From a utilization perspective this is a great state of course: no CPU cycles are being wasted, even when 50% of the threads were to go idle (as most workloads do vary). From the perspective of the individual job it's not great, however, and they would do better with more resources. Depending on what your priority and options are, raised "some" numbers may or may not require action.
The memory file contains two lines:
some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828 full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
The some line is the same as for cpu, the time in which at least one task is stalled on the resource. In the case of memory, this includes waiting on swap-in, page cache refaults and page reclaim.
The full line, however, indicates time in which *nobody* is using the CPU productively due to pressure: all non-idle tasks are waiting for memory in one form or another. Significant time spent in there is a good trigger for killing things, moving jobs to other machines, or dropping incoming requests, since neither the jobs nor the machine overall are making too much headway.
The io file is similar to memory. Because the block layer doesn't have a concept of hardware contention right now (how much longer is my IO request taking due to other tasks?), it reports CPU potential lost on all IO delays, not just the potential lost due to competition.
FAQ
Q: How is PSI's CPU component different from the load average?
A: There are several quirks in the load average that make it hard to impossible to tell how overcommitted the CPU really is.
1. The load average is reported as a raw number of active tasks. You need to know how many CPUs there are in the system, how many CPUs the workload is allowed to use, then think about what the proportion between load and the number of CPUs mean for the tasks trying to run.
PSI reports the percentage of wallclock time in which tasks are waiting for a CPU to run on. It doesn't matter how many CPUs are present or usable. The number always tells the quality of life of tasks in the system or in a particular cgroup.
2. The shortest averaging window is 1m, which is extremely coarse, and it's sampled in 5s intervals. A *lot* can happen on a CPU in 5 seconds. This *may* be able to identify persistent long-term trends and very clear and obvious overloads, but it's unusable for latency spikes and more subtle overutilization.
PSI's shortest window is 10s. It also exports the cumulative stall times (in microseconds) of synchronously recorded events.
3. On Linux, the load average for historical reasons includes all TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how busy the system is, but on the flipside it doesn't distinguish whether tasks are likely to contend over the CPU or IO - which obviously requires very different interventions from a sys admin or a job scheduler.
PSI reports independent metrics for CPU and IO. You can tell which resource is making the tasks wait, but in conjunction still see how overloaded the system is overall.
Q: What's the cost / performance impact of this feature?
A: PSI's primary cost is in the scheduler, in particular task wakeups and sleeps.
I benchmarked this code using Facebook's two most scheduling sensitive workloads: memcache and webserver. They handle a ton of small requests - lots of wakeups and sleeps with little actual work in between - so they tend to be canaries for scheduler regressions.
In the tests, the boxes were handling live traffic over the course of several hours. Half the machines, the control, ran with CONFIG_PSI=n.
For memcache I used eight machines total. They're 2-socket, 14 core, 56 thread boxes. The test runs for half the test period, flips the test and control kernels on the hardware to rule out HW factors, DC location etc., then runs the other half of the test.
For the webservers, I used 32 machines total. They're single socket, 16 core, 32 thread machines.
During the memcache test, CPU load was nopsi=78.05% psi=78.98% in the first half and nopsi=77.52% psi=78.25%, so PSI added between 0.7 and 0.9 percentage points to the CPU load, a difference of about 1%.
UPDATE: I re-ran this test with the v3 version of this patch set and the CPU utilization was equivalent between test and control.
UPDATE: v4 is on par with v3.
As far as end-to-end request latency from the client perspective goes, we don't sample those finely enough to capture the requests going to those particular machines during the test, but we know the p50 turnaround time in this workload is 54us, and perf bench sched pipe on those machines show nopsi=5.232666 us/op and psi=5.587347 us/op, so this doesn't add much here either.
The profile for the pipe benchmark shows:
0.87% sched-pipe [kernel.vmlinux] [k] psi_group_change 0.83% perf.real [kernel.vmlinux] [k] psi_group_change 0.82% perf.real [kernel.vmlinux] [k] psi_task_change 0.58% sched-pipe [kernel.vmlinux] [k] psi_task_change
The webserver load is running inside 4 nested cgroup levels. The CPU load with both nopsi and psi kernels was indistinguishable at 81%.
For comparison, we had to disable the cgroup cpu controller on the webservers because it added 4 percentage points to the CPU% during this same exact test.
Versions of this accounting code now run on 80% of our fleet. None of our workloads have reported regressions during the rollout.
Daniel Drake said:
: I just retested the latest version at : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results : are great. : : Test setup: : Endless OS : GeminiLake N4200 low end laptop : 2GB RAM : swap (and zram swap) disabled : : Baseline test: open a handful of large-ish apps and several website : tabs in Google Chrome. : : Results: after a couple of minutes, system is excessively thrashing, mouse : cursor can barely be moved, UI is not responding to mouse clicks, so it's : impractical to recover from this situation as an ordinary user : : Add my simple killer: : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd : : Results: when the thrashing causes the UI to become sluggish, the killer : steps in and kills something (usually a chrome tab), and the system : remains usable. I repeatedly opened more apps and more websites over a 15 : minute period but I wasn't able to get the system to a point of UI : unresponsiveness.
Suren said:
: Backported to 4.9 and retested on ARMv8 8 code system running Android. : Signals behave as expected reacting to memory pressure, no jumps in : "total" counters that would indicate an overflow/underflow issues. Nicely : done!
This patch (of 9):
If we keep just enough refault information to match the *current* page cache during reclaim time, we could lose a lot of events when there is only a temporary spike in non-cache memory consumption that pushes out all the cache. Once cache comes back, we won't see those refaults. They might not be actionable for LRU aging, but we want to know about them for measuring memory pressure.
[hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters] Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner jweiner@fb.com Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Rik van Riel riel@surriel.com Tested-by: Daniel Drake drake@endlessm.com Tested-by: Suren Baghdasaryan surenb@google.com Cc: Ingo Molnar mingo@redhat.com Cc: Tejun Heo tj@kernel.org Cc: Vinayak Menon vinmenon@codeaurora.org Cc: Christopher Lameter cl@linux.com Cc: Peter Enderborg peter.enderborg@sony.com Cc: Shakeel Butt shakeelb@google.com Cc: Mike Galbraith efault@gmx.de Cc: Randy Dunlap rdunlap@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 95f9ab2d596e8cbb388315e78c82b9a131bf2928) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/workingset.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-)
diff --git a/mm/workingset.c b/mm/workingset.c index 4516dd790129..7d5fa0dd2b38 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -364,7 +364,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, { unsigned long max_nodes; unsigned long nodes; - unsigned long cache; + unsigned long pages;
nodes = list_lru_shrink_count(&shadow_nodes, sc);
@@ -390,14 +390,20 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, * * PAGE_SIZE / radix_tree_nodes / node_entries * 8 / PAGE_SIZE */ +#ifdef CONFIG_MEMCG if (sc->memcg) { - cache = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid, - LRU_ALL_FILE); - } else { - cache = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) + - node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE); - } - max_nodes = cache >> (RADIX_TREE_MAP_SHIFT - 3); + struct lruvec *lruvec; + + pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid, + LRU_ALL); + lruvec = mem_cgroup_lruvec(NODE_DATA(sc->nid), sc->memcg); + pages += lruvec_page_state(lruvec, NR_SLAB_RECLAIMABLE); + pages += lruvec_page_state(lruvec, NR_SLAB_UNRECLAIMABLE); + } else +#endif + pages = node_present_pages(sc->nid); + + max_nodes = pages >> (RADIX_TREE_MAP_SHIFT - 3);
if (!nodes) return SHRINK_EMPTY;
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-4.20-rc1 commit 1899ad18c6072d689896badafb81267b0a1092a4 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Tested-by: Daniel Drake drake@endlessm.com Tested-by: Suren Baghdasaryan surenb@google.com Cc: Christopher Lameter cl@linux.com Cc: Ingo Molnar mingo@redhat.com Cc: Johannes Weiner jweiner@fb.com Cc: Mike Galbraith efault@gmx.de Cc: Peter Enderborg peter.enderborg@sony.com Cc: Randy Dunlap rdunlap@infradead.org Cc: Shakeel Butt shakeelb@google.com Cc: Tejun Heo tj@kernel.org Cc: Vinayak Menon vinmenon@codeaurora.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 1899ad18c6072d689896badafb81267b0a1092a4) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com
Conflicts: mm/filemap.c
Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/mmzone.h | 1 + include/linux/page-flags.h | 5 +- include/linux/swap.h | 2 +- include/trace/events/mmflags.h | 1 + mm/filemap.c | 10 ++-- mm/huge_memory.c | 1 + mm/migrate.c | 2 + mm/swap_state.c | 1 + mm/vmscan.c | 1 + mm/vmstat.c | 1 + mm/workingset.c | 95 ++++++++++++++++++++++------------ 11 files changed, 78 insertions(+), 42 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4133cd6d6a34..0c9e870fb3fe 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -164,6 +164,7 @@ enum node_stat_item { NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ WORKINGSET_REFAULT, WORKINGSET_ACTIVATE, + WORKINGSET_RESTORE, WORKINGSET_NODERECLAIM, NR_ANON_MAPPED, /* Mapped anonymous pages */ NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 23af05f8122d..17eeaf76259b 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -69,13 +69,14 @@ */ enum pageflags { PG_locked, /* Page is locked. Don't touch. */ - PG_error, PG_referenced, PG_uptodate, PG_dirty, PG_lru, PG_active, + PG_workingset, PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */ + PG_error, PG_slab, PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, @@ -281,6 +282,8 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD) PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD) PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD) TESTCLEARFLAG(Active, active, PF_HEAD) +PAGEFLAG(Workingset, workingset, PF_HEAD) + TESTCLEARFLAG(Workingset, workingset, PF_HEAD) __PAGEFLAG(Slab, slab, PF_NO_TAIL) __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL) PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */ diff --git a/include/linux/swap.h b/include/linux/swap.h index ca7b98e4ce74..aecda6766417 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -312,7 +312,7 @@ struct vma_swap_readahead {
/* linux/mm/workingset.c */ void *workingset_eviction(struct address_space *mapping, struct page *page); -bool workingset_refault(void *shadow); +void workingset_refault(struct page *page, void *shadow); void workingset_activation(struct page *page);
/* Do not use directly, use workingset_lookup_update */ diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 2994f1c86a46..894305dad41e 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -88,6 +88,7 @@ {1UL << PG_dirty, "dirty" }, \ {1UL << PG_lru, "lru" }, \ {1UL << PG_active, "active" }, \ + {1UL << PG_workingset, "workingset" }, \ {1UL << PG_slab, "slab" }, \ {1UL << PG_owner_priv_1, "owner_priv_1" }, \ {1UL << PG_arch_1, "arch_1" }, \ diff --git a/mm/filemap.c b/mm/filemap.c index 8a8bf78d3ac0..55bb93ca0076 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1023,12 +1023,10 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping, * data from the working set, only to cache data that will * get overwritten with something else, is a waste of memory. */ - if (!(gfp_mask & __GFP_WRITE) && - shadow && workingset_refault(shadow)) { - SetPageActive(page); - workingset_activation(page); - } else - ClearPageActive(page); + WARN_ON_ONCE(PageActive(page)); + if (!(gfp_mask & __GFP_WRITE) && shadow) + workingset_refault(page, shadow); + if (!PagePercpuRef(page)) lru_cache_add(page); } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8b137248b146..6aef386c1726 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2455,6 +2455,7 @@ static void __split_huge_page_tail(struct page *head, int tail, (1L << PG_mlocked) | (1L << PG_uptodate) | (1L << PG_active) | + (1L << PG_workingset) | (1L << PG_locked) | (1L << PG_unevictable) | (1L << PG_dirty))); diff --git a/mm/migrate.c b/mm/migrate.c index a69b842f95da..f1200de72999 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -683,6 +683,8 @@ void migrate_page_states(struct page *newpage, struct page *page) SetPageActive(newpage); } else if (TestClearPageUnevictable(page)) SetPageUnevictable(newpage); + if (PageWorkingset(page)) + SetPageWorkingset(newpage); if (PageChecked(page)) SetPageChecked(newpage); if (PageMappedToDisk(page)) diff --git a/mm/swap_state.c b/mm/swap_state.c index 2137e2d57196..3b538efdb3e1 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -460,6 +460,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, /* * Initiate read into locked page and return. */ + SetPageWorkingset(new_page); lru_cache_add_anon(new_page); *new_page_allocated = true; return new_page; diff --git a/mm/vmscan.c b/mm/vmscan.c index d8b6ba5bd29e..214445a9c0e5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2159,6 +2159,7 @@ static void shrink_active_list(unsigned long nr_to_scan, }
ClearPageActive(page); /* we are de-activating */ + SetPageWorkingset(page); list_add(&page->lru, &l_inactive); }
diff --git a/mm/vmstat.c b/mm/vmstat.c index acc9f1c6b43a..d1c02a4fe9f8 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1145,6 +1145,7 @@ const char * const vmstat_text[] = { "nr_isolated_file", "workingset_refault", "workingset_activate", + "workingset_restore", "workingset_nodereclaim", "nr_anon_pages", "nr_mapped", diff --git a/mm/workingset.c b/mm/workingset.c index 7d5fa0dd2b38..99b7f7c09b13 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -121,7 +121,7 @@ * the only thing eating into inactive list space is active pages. * * - * Activating refaulting pages + * Refaulting inactive pages * * All that is known about the active list is that the pages have been * accessed more than once in the past. This means that at any given @@ -134,6 +134,10 @@ * used less frequently than the refaulting page - or even not used at * all anymore. * + * That means if inactive cache is refaulting with a suitable refault + * distance, we assume the cache workingset is transitioning and put + * pressure on the current active list. + * * If this is wrong and demotion kicks in, the pages which are truly * used more frequently will be reactivated while the less frequently * used once will be evicted from memory. @@ -141,6 +145,14 @@ * But if this is right, the stale pages will be pushed out of memory * and the used pages get to stay in cache. * + * Refaulting active pages + * + * If on the other hand the refaulting pages have recently been + * deactivated, it means that the active list is no longer protecting + * actively used cache from reclaim. The cache is NOT transitioning to + * a different workingset; the existing workingset is thrashing in the + * space allocated to the page cache. + * * * Implementation * @@ -156,8 +168,7 @@ */
#define EVICTION_SHIFT (RADIX_TREE_EXCEPTIONAL_ENTRY + \ - NODES_SHIFT + \ - MEM_CGROUP_ID_SHIFT) + 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT) #define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
/* @@ -170,23 +181,28 @@ */ static unsigned int bucket_order __read_mostly;
-static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction) +static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction, + bool workingset) { eviction >>= bucket_order; eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; eviction = (eviction << NODES_SHIFT) | pgdat->node_id; + eviction = (eviction << 1) | workingset; eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY); }
static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, - unsigned long *evictionp) + unsigned long *evictionp, bool *workingsetp) { unsigned long entry = (unsigned long)shadow; int memcgid, nid; + bool workingset;
entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT; + workingset = entry & 1; + entry >>= 1; nid = entry & ((1UL << NODES_SHIFT) - 1); entry >>= NODES_SHIFT; memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1); @@ -195,6 +211,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, *memcgidp = memcgid; *pgdat = NODE_DATA(nid); *evictionp = entry << bucket_order; + *workingsetp = workingset; }
/** @@ -207,8 +224,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, */ void *workingset_eviction(struct address_space *mapping, struct page *page) { - struct mem_cgroup *memcg = page_memcg(page); struct pglist_data *pgdat = page_pgdat(page); + struct mem_cgroup *memcg = page_memcg(page); int memcgid = mem_cgroup_id(memcg); unsigned long eviction; struct lruvec *lruvec; @@ -220,30 +237,30 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
lruvec = mem_cgroup_lruvec(pgdat, memcg); eviction = atomic_long_inc_return(&lruvec->inactive_age); - return pack_shadow(memcgid, pgdat, eviction); + return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page)); }
/** * workingset_refault - evaluate the refault of a previously evicted page + * @page: the freshly allocated replacement page * @shadow: shadow entry of the evicted page * * Calculates and evaluates the refault distance of the previously * evicted page in the context of the node it was allocated in. - * - * Returns %true if the page should be activated, %false otherwise. */ -bool workingset_refault(void *shadow) +void workingset_refault(struct page *page, void *shadow) { unsigned long refault_distance; + struct pglist_data *pgdat; unsigned long active_file; struct mem_cgroup *memcg; unsigned long eviction; struct lruvec *lruvec; unsigned long refault; - struct pglist_data *pgdat; + bool workingset; int memcgid;
- unpack_shadow(shadow, &memcgid, &pgdat, &eviction); + unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
rcu_read_lock(); /* @@ -263,41 +280,51 @@ bool workingset_refault(void *shadow) * configurations instead. */ memcg = mem_cgroup_from_id(memcgid); - if (!mem_cgroup_disabled() && !memcg) { - rcu_read_unlock(); - return false; - } + if (!mem_cgroup_disabled() && !memcg) + goto out; lruvec = mem_cgroup_lruvec(pgdat, memcg); refault = atomic_long_read(&lruvec->inactive_age); active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
/* - * The unsigned subtraction here gives an accurate distance - * across inactive_age overflows in most cases. + * Calculate the refault distance * - * There is a special case: usually, shadow entries have a - * short lifetime and are either refaulted or reclaimed along - * with the inode before they get too old. But it is not - * impossible for the inactive_age to lap a shadow entry in - * the field, which can then can result in a false small - * refault distance, leading to a false activation should this - * old entry actually refault again. However, earlier kernels - * used to deactivate unconditionally with *every* reclaim - * invocation for the longest time, so the occasional - * inappropriate activation leading to pressure on the active - * list is not a problem. + * The unsigned subtraction here gives an accurate distance + * across inactive_age overflows in most cases. There is a + * special case: usually, shadow entries have a short lifetime + * and are either refaulted or reclaimed along with the inode + * before they get too old. But it is not impossible for the + * inactive_age to lap a shadow entry in the field, which can + * then result in a false small refault distance, leading to a + * false activation should this old entry actually refault + * again. However, earlier kernels used to deactivate + * unconditionally with *every* reclaim invocation for the + * longest time, so the occasional inappropriate activation + * leading to pressure on the active list is not a problem. */ refault_distance = (refault - eviction) & EVICTION_MASK;
inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
- if (refault_distance <= active_file) { - inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); - rcu_read_unlock(); - return true; + /* + * Compare the distance to the existing workingset size. We + * don't act on pages that couldn't stay resident even if all + * the memory was available to the page cache. + */ + if (refault_distance > active_file) + goto out; + + SetPageActive(page); + atomic_long_inc(&lruvec->inactive_age); + inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); + + /* Page was active prior to eviction */ + if (workingset) { + SetPageWorkingset(page); + inc_lruvec_state(lruvec, WORKINGSET_RESTORE); } +out: rcu_read_unlock(); - return false; }
/**
From: yuzhoujian yuzhoujian@didichuxing.com
mainline inclusion from mainline-5.0-rc1 commit ef8444ea01d7442652f8e1b8a8b94278cb57eafd category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim.
One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to
1) who invoked oom and what was the allocation request
[ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
2) OOM stack trace
[ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30
3) OOM memory information
[ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ...
4) current memory state of all system tasks
[ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic
5) oom context (contrains and the chosen victim).
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0
An admin can easily get the full oom context at a single line which makes parsing much easier.
Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail... Signed-off-by: yuzhoujian yuzhoujian@didichuxing.com Acked-by: Michal Hocko mhocko@suse.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: David Rientjes rientjes@google.com Cc: "Kirill A . Shutemov" kirill.shutemov@linux.intel.com Cc: Roman Gushchin guro@fb.com Cc: Tetsuo Handa penguin-kernel@i-love.sakura.ne.jp Cc: Yang Shi yang.s@alibaba-inc.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit ef8444ea01d7442652f8e1b8a8b94278cb57eafd) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/oom.h | 10 ++++++++++ kernel/cgroup/cpuset.c | 4 ++-- mm/oom_kill.c | 29 ++++++++++++++++++++--------- mm/page_alloc.c | 4 ++-- 4 files changed, 34 insertions(+), 13 deletions(-)
diff --git a/include/linux/oom.h b/include/linux/oom.h index 689d32ab694b..339d1c940500 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -15,6 +15,13 @@ struct notifier_block; struct mem_cgroup; struct task_struct;
+enum oom_constraint { + CONSTRAINT_NONE, + CONSTRAINT_CPUSET, + CONSTRAINT_MEMORY_POLICY, + CONSTRAINT_MEMCG, +}; + /* * Details of the page allocation that triggered the oom killer that are used to * determine what should be killed. @@ -42,6 +49,9 @@ struct oom_control { unsigned long totalpages; struct task_struct *chosen; unsigned long chosen_points; + + /* Used to print the constraint info. */ + enum oom_constraint constraint; };
extern struct mutex oom_lock; diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index a4ce9474a078..feb91177247c 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2681,9 +2681,9 @@ void cpuset_print_current_mems_allowed(void) rcu_read_lock();
cgrp = task_cs(current)->css.cgroup; - pr_info("%s cpuset=", current->comm); + pr_cont(",cpuset="); pr_cont_cgroup_name(cgrp); - pr_cont(" mems_allowed=%*pbl\n", + pr_cont(",mems_allowed=%*pbl", nodemask_pr_args(¤t->mems_allowed));
rcu_read_unlock(); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 46cd317e6743..b710928cc4d2 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -248,11 +248,11 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, return points > 0 ? points : 1; }
-enum oom_constraint { - CONSTRAINT_NONE, - CONSTRAINT_CPUSET, - CONSTRAINT_MEMORY_POLICY, - CONSTRAINT_MEMCG, +static const char * const oom_constraint_text[] = { + [CONSTRAINT_NONE] = "CONSTRAINT_NONE", + [CONSTRAINT_CPUSET] = "CONSTRAINT_CPUSET", + [CONSTRAINT_MEMORY_POLICY] = "CONSTRAINT_MEMORY_POLICY", + [CONSTRAINT_MEMCG] = "CONSTRAINT_MEMCG", };
/* @@ -428,16 +428,25 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) rcu_read_unlock(); }
+static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim) +{ + /* one line summary of the oom killer context. */ + pr_info("oom-kill:constraint=%s,nodemask=%*pbl", + oom_constraint_text[oc->constraint], + nodemask_pr_args(oc->nodemask)); + cpuset_print_current_mems_allowed(); + pr_cont(",task=%s,pid=%d,uid=%d\n", victim->comm, victim->pid, + from_kuid(&init_user_ns, task_uid(victim))); +} + static void dump_header(struct oom_control *oc, struct task_struct *p) { - pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n", - current->comm, oc->gfp_mask, &oc->gfp_mask, - nodemask_pr_args(oc->nodemask), oc->order, + pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n", + current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order, current->signal->oom_score_adj); if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order) pr_warn("COMPACTION is disabled!!!\n");
- cpuset_print_current_mems_allowed(); dump_stack(); if (is_memcg_oom(oc)) mem_cgroup_print_oom_info(oc->memcg, p); @@ -448,6 +457,8 @@ static void dump_header(struct oom_control *oc, struct task_struct *p) } if (sysctl_oom_dump_tasks) dump_tasks(oc->memcg, oc->nodemask); + if (p) + dump_oom_summary(oc, p); }
/* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1e0aeebfb293..39d530e64098 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3526,13 +3526,13 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) va_start(args, fmt); vaf.fmt = fmt; vaf.va = &args; - pr_warn("%s: %pV, mode:%#x(%pGg), nodemask=%*pbl\n", + pr_warn("%s: %pV, mode:%#x(%pGg), nodemask=%*pbl", current->comm, &vaf, gfp_mask, &gfp_mask, nodemask_pr_args(nodemask)); va_end(args);
cpuset_print_current_mems_allowed(); - + pr_cont("\n"); dump_stack(); warn_alloc_show_mem(gfp_mask, nodemask); }
From: yuzhoujian yuzhoujian@didichuxing.com
mainline inclusion from mainline-5.0-rc1 commit f0c867d9588d9efc10d6a55009c9560336673369 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- The current oom report doesn't display victim's memcg context during the global OOM situation. While this information is not strictly needed, it can be really helpful for containerized environments to locate which container has lost a process. Now that we have a single line for the oom context, we can trivially add both the oom memcg (this can be either global_oom or a specific memcg which hits its hard limits) and task_memcg which is the victim's memcg.
Below is the single line output in the oom report after this patch.
- global oom context information:
oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,global_oom,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid>
- memcg oom context information:
oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,oom_memcg=<memcg>,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid>
[penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()] Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail... Signed-off-by: yuzhoujian yuzhoujian@didichuxing.com Signed-off-by: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Acked-by: Michal Hocko mhocko@suse.com Cc: David Rientjes rientjes@google.com Cc: "Kirill A . Shutemov" kirill.shutemov@linux.intel.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Tetsuo Handa penguin-kernel@i-love.sakura.ne.jp Cc: Roman Gushchin guro@fb.com Cc: Yang Shi yang.s@alibaba-inc.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit f0c867d9588d9efc10d6a55009c9560336673369) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/memcontrol.h | 11 +++++++++-- mm/memcontrol.c | 33 ++++++++++++++++++++------------- mm/oom_kill.c | 3 ++- 3 files changed, 31 insertions(+), 16 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index eafc13eb9441..3b3a2f21e3e6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -526,9 +526,11 @@ void mem_cgroup_handle_over_high(void);
unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg);
-void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, +void mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p);
+void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg); + static inline void mem_cgroup_enter_user_fault(void) { WARN_ON(current->in_user_fault); @@ -973,7 +975,12 @@ static inline unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg) }
static inline void -mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) +mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p) +{ +} + +static inline void +mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) { }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0abfd13c4015..d4a18ddb6e51 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1323,32 +1323,39 @@ static const char *const memcg1_stat_names[] = {
#define K(x) ((x) << (PAGE_SHIFT-10)) /** - * mem_cgroup_print_oom_info: Print OOM information relevant to memory controller. + * mem_cgroup_print_oom_context: Print OOM information relevant to + * memory controller. * @memcg: The memory cgroup that went over limit * @p: Task that is going to be killed * * NOTE: @memcg and @p's mem_cgroup can be different when hierarchy is * enabled */ -void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) +void mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p) { - struct mem_cgroup *iter; - unsigned int i; - rcu_read_lock();
+ if (memcg) { + pr_cont(",oom_memcg="); + pr_cont_cgroup_path(memcg->css.cgroup); + } else + pr_cont(",global_oom"); if (p) { - pr_info("Task in "); + pr_cont(",task_memcg="); pr_cont_cgroup_path(task_cgroup(p, memory_cgrp_id)); - pr_cont(" killed as a result of limit of "); - } else { - pr_info("Memory limit reached of cgroup "); } - - pr_cont_cgroup_path(memcg->css.cgroup); - pr_cont("\n"); - rcu_read_unlock(); +} + +/** + * mem_cgroup_print_oom_meminfo: Print OOM memory information relevant to + * memory controller. + * @memcg: The memory cgroup that went over limit + */ +void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) +{ + struct mem_cgroup *iter; + unsigned int i;
pr_info("memory: usage %llukB, limit %llukB, failcnt %lu\n", K((u64)page_counter_read(&memcg->memory)), diff --git a/mm/oom_kill.c b/mm/oom_kill.c index b710928cc4d2..a2010620a49d 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -435,6 +435,7 @@ static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim) oom_constraint_text[oc->constraint], nodemask_pr_args(oc->nodemask)); cpuset_print_current_mems_allowed(); + mem_cgroup_print_oom_context(oc->memcg, victim); pr_cont(",task=%s,pid=%d,uid=%d\n", victim->comm, victim->pid, from_kuid(&init_user_ns, task_uid(victim))); } @@ -449,7 +450,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
dump_stack(); if (is_memcg_oom(oc)) - mem_cgroup_print_oom_info(oc->memcg, p); + mem_cgroup_print_oom_meminfo(oc->memcg); else { show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask); if (is_dump_unreclaim_slabs())
From: Yafang Shao laoar.shao@gmail.com
mainline inclusion from mainline-5.2-rc7 commit 432b1de0de02a83f64695e69a2d83cbee10c236f category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- In dump_oom_summary() oc->constraint is used to show oom_constraint_text, but it hasn't been set before. So the value of it is always the default value 0. We should inititialize it before.
Bellow is the output when memcg oom occurs,
before this patch: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=7997,uid=0
after this patch: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=13681,uid=0
Link: http://lkml.kernel.org/r/1560522038-15879-1-git-send-email-laoar.shao@gmail.... Fixes: ef8444ea01d7 ("mm, oom: reorganize the oom report in dump_header") Signed-off-by: Yafang Shao laoar.shao@gmail.com Acked-by: Michal Hocko mhocko@suse.com Cc: Wind Yu yuzhoujian@didichuxing.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 432b1de0de02a83f64695e69a2d83cbee10c236f) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/oom_kill.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index a2010620a49d..05e4ec14b588 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1026,8 +1026,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) /* * Determines whether the kernel must panic because of the panic_on_oom sysctl. */ -static void check_panic_on_oom(struct oom_control *oc, - enum oom_constraint constraint) +static void check_panic_on_oom(struct oom_control *oc) { if (likely(!sysctl_panic_on_oom)) return; @@ -1037,7 +1036,7 @@ static void check_panic_on_oom(struct oom_control *oc, * does not panic for cpuset, mempolicy, or memcg allocation * failures. */ - if (constraint != CONSTRAINT_NONE) + if (oc->constraint != CONSTRAINT_NONE) return; } /* Do not panic for oom kills triggered by sysrq */ @@ -1114,7 +1113,6 @@ EXPORT_SYMBOL_GPL(unregister_hisi_oom_notifier); bool out_of_memory(struct oom_control *oc) { unsigned long freed = 0; - enum oom_constraint constraint = CONSTRAINT_NONE; #ifdef CONFIG_ASCEND_OOM unsigned long oom_type; #endif @@ -1168,10 +1166,10 @@ bool out_of_memory(struct oom_control *oc) * Check if there were limitations on the allocation (only relevant for * NUMA and memcg) that may require different handling. */ - constraint = constrained_alloc(oc); - if (constraint != CONSTRAINT_MEMORY_POLICY) + oc->constraint = constrained_alloc(oc); + if (oc->constraint != CONSTRAINT_MEMORY_POLICY) oc->nodemask = NULL; - check_panic_on_oom(oc, constraint); + check_panic_on_oom(oc);
if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
From: Chris Down chris@chrisdown.name
mainline inclusion from mainline-5.1-rc1 commit aa9694bb78bf6eb03810108d5f6064fafa4ae1e1 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- This is the start of a series of patches similar to my earlier DEFINE_MEMCG_MAX_OR_VAL work, but with less Macro Magic(tm).
There are a bunch of places we go from seq_file to mem_cgroup, which currently requires manually getting the css, then getting the mem_cgroup from the css. It's in enough places now that having mem_cgroup_from_seq makes sense (and also makes the next patch a bit nicer).
Link: http://lkml.kernel.org/r/20190124194050.GA31341@chrisdown.name Signed-off-by: Chris Down chris@chrisdown.name Acked-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Michal Hocko mhocko@suse.com Cc: Tejun Heo tj@kernel.org Cc: Roman Gushchin guro@fb.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit aa9694bb78bf6eb03810108d5f6064fafa4ae1e1) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/memcontrol.h | 10 ++++++++++ mm/memcontrol.c | 24 ++++++++++++------------ mm/slab_common.c | 6 +++--- 3 files changed, 25 insertions(+), 15 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 3b3a2f21e3e6..a71419c869d9 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -429,6 +429,11 @@ static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) } struct mem_cgroup *mem_cgroup_from_id(unsigned short id);
+static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m) +{ + return mem_cgroup_from_css(seq_css(m)); +} + static inline struct mem_cgroup *lruvec_memcg(struct lruvec *lruvec) { struct mem_cgroup_per_node *mz; @@ -940,6 +945,11 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id) return NULL; }
+static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m) +{ + return NULL; +} + static inline struct mem_cgroup *lruvec_memcg(struct lruvec *lruvec) { return NULL; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d4a18ddb6e51..265e47a7acfe 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3471,7 +3471,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) const struct numa_stat *stat; int nid; unsigned long nr; - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { nr = mem_cgroup_nr_lru_pages(memcg, stat->lru_mask); @@ -3522,7 +3522,7 @@ static const char *const memcg1_event_names[] = {
static int memcg_stat_show(struct seq_file *m, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); unsigned long memory, memsw; struct mem_cgroup *mi; unsigned int i; @@ -3961,7 +3961,7 @@ static void mem_cgroup_oom_unregister_event(struct mem_cgroup *memcg,
static int mem_cgroup_oom_control_read(struct seq_file *sf, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(sf)); + struct mem_cgroup *memcg = mem_cgroup_from_seq(sf);
seq_printf(sf, "oom_kill_disable %d\n", memcg->oom_kill_disable); seq_printf(sf, "under_oom %d\n", (bool)memcg->under_oom); @@ -5532,7 +5532,7 @@ static u64 memory_current_read(struct cgroup_subsys_state *css,
static int memory_min_show(struct seq_file *m, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); unsigned long min = READ_ONCE(memcg->memory.min);
if (min == PAGE_COUNTER_MAX) @@ -5562,7 +5562,7 @@ static ssize_t memory_min_write(struct kernfs_open_file *of,
static int memory_low_show(struct seq_file *m, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); unsigned long low = READ_ONCE(memcg->memory.low);
if (low == PAGE_COUNTER_MAX) @@ -5592,7 +5592,7 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
static int memory_high_show(struct seq_file *m, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); unsigned long high = READ_ONCE(memcg->high);
if (high == PAGE_COUNTER_MAX) @@ -5629,7 +5629,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
static int memory_max_show(struct seq_file *m, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); unsigned long max = READ_ONCE(memcg->memory.max);
if (max == PAGE_COUNTER_MAX) @@ -5691,7 +5691,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
static int memory_events_show(struct seq_file *m, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
seq_printf(m, "low %lu\n", atomic_long_read(&memcg->memory_events[MEMCG_LOW])); @@ -5709,7 +5709,7 @@ static int memory_events_show(struct seq_file *m, void *v)
static int memory_stat_show(struct seq_file *m, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); struct accumulated_stats acc; int i;
@@ -5786,7 +5786,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
static int memory_oom_group_show(struct seq_file *m, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
seq_printf(m, "%d\n", memcg->oom_group);
@@ -6767,7 +6767,7 @@ static u64 swap_current_read(struct cgroup_subsys_state *css,
static int swap_max_show(struct seq_file *m, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); unsigned long max = READ_ONCE(memcg->swap.max);
if (max == PAGE_COUNTER_MAX) @@ -6797,7 +6797,7 @@ static ssize_t swap_max_write(struct kernfs_open_file *of,
static int swap_events_show(struct seq_file *m, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
seq_printf(m, "max %lu\n", atomic_long_read(&memcg->memory_events[MEMCG_SWAP_MAX])); diff --git a/mm/slab_common.c b/mm/slab_common.c index 91b250aea129..bbfecacee448 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1444,7 +1444,7 @@ void dump_unreclaimable_slab(void) #if defined(CONFIG_MEMCG_KMEM) void *memcg_slab_start(struct seq_file *m, loff_t *pos) { - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
mutex_lock(&slab_mutex); return seq_list_start(&memcg->kmem_caches, *pos); @@ -1452,7 +1452,7 @@ void *memcg_slab_start(struct seq_file *m, loff_t *pos)
void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos) { - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
return seq_list_next(p, &memcg->kmem_caches, pos); } @@ -1466,7 +1466,7 @@ int memcg_slab_show(struct seq_file *m, void *p) { struct kmem_cache *s = list_entry(p, struct kmem_cache, memcg_params.kmem_caches_node); - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
if (p == memcg->kmem_caches.next) print_slabinfo_header(m);
From: Chris Down chris@chrisdown.name
mainline inclusion from mainline-5.1-rc1 commit 677dc9731b54dccaaadbdcea18f8eecc95cee832 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- memcg has a significant number of files exposed to kernfs where their value is either exposed directly or is "max" in the case of PAGE_COUNTER_MAX.
This patch makes this generic by providing a single function to do this work. In combination with the previous patch adding mem_cgroup_from_seq, this makes all of the seq_show feeder functions significantly more simple.
Link: http://lkml.kernel.org/r/20190124194100.GA31425@chrisdown.name Signed-off-by: Chris Down chris@chrisdown.name Acked-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Michal Hocko mhocko@suse.com Cc: Tejun Heo tj@kernel.org Cc: Roman Gushchin guro@fb.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 677dc9731b54dccaaadbdcea18f8eecc95cee832) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 64 +++++++++++++++---------------------------------- 1 file changed, 19 insertions(+), 45 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 265e47a7acfe..fb473b4738b1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5522,6 +5522,16 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css) root_mem_cgroup->use_hierarchy = false; }
+static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value) +{ + if (value == PAGE_COUNTER_MAX) + seq_puts(m, "max\n"); + else + seq_printf(m, "%llu\n", (u64)value * PAGE_SIZE); + + return 0; +} + static u64 memory_current_read(struct cgroup_subsys_state *css, struct cftype *cft) { @@ -5532,15 +5542,8 @@ static u64 memory_current_read(struct cgroup_subsys_state *css,
static int memory_min_show(struct seq_file *m, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - unsigned long min = READ_ONCE(memcg->memory.min); - - if (min == PAGE_COUNTER_MAX) - seq_puts(m, "max\n"); - else - seq_printf(m, "%llu\n", (u64)min * PAGE_SIZE); - - return 0; + return seq_puts_memcg_tunable(m, + READ_ONCE(mem_cgroup_from_seq(m)->memory.min)); }
static ssize_t memory_min_write(struct kernfs_open_file *of, @@ -5562,15 +5565,8 @@ static ssize_t memory_min_write(struct kernfs_open_file *of,
static int memory_low_show(struct seq_file *m, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - unsigned long low = READ_ONCE(memcg->memory.low); - - if (low == PAGE_COUNTER_MAX) - seq_puts(m, "max\n"); - else - seq_printf(m, "%llu\n", (u64)low * PAGE_SIZE); - - return 0; + return seq_puts_memcg_tunable(m, + READ_ONCE(mem_cgroup_from_seq(m)->memory.low)); }
static ssize_t memory_low_write(struct kernfs_open_file *of, @@ -5592,15 +5588,7 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
static int memory_high_show(struct seq_file *m, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - unsigned long high = READ_ONCE(memcg->high); - - if (high == PAGE_COUNTER_MAX) - seq_puts(m, "max\n"); - else - seq_printf(m, "%llu\n", (u64)high * PAGE_SIZE); - - return 0; + return seq_puts_memcg_tunable(m, READ_ONCE(mem_cgroup_from_seq(m)->high)); }
static ssize_t memory_high_write(struct kernfs_open_file *of, @@ -5629,15 +5617,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
static int memory_max_show(struct seq_file *m, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - unsigned long max = READ_ONCE(memcg->memory.max); - - if (max == PAGE_COUNTER_MAX) - seq_puts(m, "max\n"); - else - seq_printf(m, "%llu\n", (u64)max * PAGE_SIZE); - - return 0; + return seq_puts_memcg_tunable(m, + READ_ONCE(mem_cgroup_from_seq(m)->memory.max)); }
static ssize_t memory_max_write(struct kernfs_open_file *of, @@ -6767,15 +6748,8 @@ static u64 swap_current_read(struct cgroup_subsys_state *css,
static int swap_max_show(struct seq_file *m, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - unsigned long max = READ_ONCE(memcg->swap.max); - - if (max == PAGE_COUNTER_MAX) - seq_puts(m, "max\n"); - else - seq_printf(m, "%llu\n", (u64)max * PAGE_SIZE); - - return 0; + return seq_puts_memcg_tunable(m, + READ_ONCE(mem_cgroup_from_seq(m)->swap.max)); }
static ssize_t swap_max_write(struct kernfs_open_file *of,
From: Chris Down chris@chrisdown.name
mainline inclusion from mainline-5.1-rc1 commit 1ff9e6e1798c7670ea6a7680a1ad5582df2fa914 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Currently THP allocation events data is fairly opaque, since you can only get it system-wide. This patch makes it easier to reason about transparent hugepage behaviour on a per-memcg basis.
For anonymous THP-backed pages, we already have MEMCG_RSS_HUGE in v1, which is used for v1's rss_huge [sic]. This is reused here as it's fairly involved to untangle NR_ANON_THPS right now to make it per-memcg, since right now some of this is delegated to rmap before we have any memcg actually assigned to the page. It's a good idea to rework that, but let's leave untangling THP allocation for a future patch.
[akpm@linux-foundation.org: fix build] [chris@chrisdown.name: fix memcontrol build when THP is disabled] Link: http://lkml.kernel.org/r/20190131160802.GA5777@chrisdown.name Link: http://lkml.kernel.org/r/20190129205852.GA7310@chrisdown.name Signed-off-by: Chris Down chris@chrisdown.name Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Tejun Heo tj@kernel.org Cc: Roman Gushchin guro@fb.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 1ff9e6e1798c7670ea6a7680a1ad5582df2fa914) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com
Conflicts: mm/memcontrol.c
Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++++++++++ mm/huge_memory.c | 2 ++ mm/khugepaged.c | 2 ++ mm/memcontrol.c | 16 ++++++++++++++++ 4 files changed, 36 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 4194ea1d756f..8305db50fa04 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1183,6 +1183,10 @@ PAGE_SIZE multiple when read back. Amount of cached filesystem data that was modified and is currently being written back to disk
+ anon_thp + Amount of memory used in anonymous mappings backed by + transparent hugepages + inactive_anon, active_anon, inactive_file, active_file, unevictable Amount of memory, swap-backed and filesystem-backed, on the internal memory management lists used by the @@ -1242,6 +1246,18 @@ PAGE_SIZE multiple when read back.
Amount of reclaimed lazyfree pages
+ thp_fault_alloc + + Number of transparent hugepages which were allocated to satisfy + a page fault, including COW faults. This counter is not present + when CONFIG_TRANSPARENT_HUGEPAGE is not set. + + thp_collapse_alloc + + Number of transparent hugepages which were allocated to allow + collapsing an existing range of pages. This counter is not + present when CONFIG_TRANSPARENT_HUGEPAGE is not set. + memory.swap.current A read-only single value file which exists on non-root cgroups. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6aef386c1726..d6ad1132ee70 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -619,6 +619,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, mm_inc_nr_ptes(vma->vm_mm); spin_unlock(vmf->ptl); count_vm_event(THP_FAULT_ALLOC); + count_memcg_events(memcg, THP_FAULT_ALLOC, 1); }
return 0; @@ -1378,6 +1379,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) }
count_vm_event(THP_FAULT_ALLOC); + count_memcg_events(memcg, THP_FAULT_ALLOC, 1);
if (!page) clear_huge_page(new_page, vmf->address, HPAGE_PMD_NR); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index f37be43f8cae..b75afe6f6063 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1076,6 +1076,7 @@ static void collapse_huge_page(struct mm_struct *mm, BUG_ON(!pmd_none(*pmd)); page_add_new_anon_rmap(new_page, vma, address, true); mem_cgroup_commit_charge(new_page, memcg, false, true); + count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1); lru_cache_add_active_or_unevictable(new_page, vma); pgtable_trans_huge_deposit(mm, pmd, pgtable); set_pmd_at(mm, address, pmd, _pmd); @@ -1531,6 +1532,7 @@ static void collapse_shmem(struct mm_struct *mm, page_ref_add(new_page, HPAGE_PMD_NR - 1); set_page_dirty(new_page); mem_cgroup_commit_charge(new_page, memcg, false, true); + count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1); lru_cache_add_anon(new_page);
/* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fb473b4738b1..9ad251d3f13f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -39,6 +39,7 @@ #include <linux/shmem_fs.h> #include <linux/hugetlb.h> #include <linux/pagemap.h> +#include <linux/vm_event_item.h> #include <linux/smp.h> #include <linux/page-flags.h> #include <linux/backing-dev.h> @@ -5731,6 +5732,15 @@ static int memory_stat_show(struct seq_file *m, void *v) seq_printf(m, "file_writeback %llu\n", (u64)acc.stat[NR_WRITEBACK] * PAGE_SIZE);
+ /* + * TODO: We should eventually replace our own MEMCG_RSS_HUGE counter + * with the NR_ANON_THP vm counter, but right now it's a pain in the + * arse because it requires migrating the work out of rmap to a place + * where the page->mem_cgroup is set up and stable. + */ + seq_printf(m, "anon_thp %llu\n", + (u64)acc.stat[MEMCG_RSS_HUGE] * PAGE_SIZE); + for (i = 0; i < NR_LRU_LISTS; i++) seq_printf(m, "%s %llu\n", mem_cgroup_lru_names[i], (u64)acc.lru_pages[i] * PAGE_SIZE); @@ -5762,6 +5772,12 @@ static int memory_stat_show(struct seq_file *m, void *v) seq_printf(m, "workingset_nodereclaim %lu\n", acc.stat[WORKINGSET_NODERECLAIM]);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE + seq_printf(m, "thp_fault_alloc %lu\n", acc.events[THP_FAULT_ALLOC]); + seq_printf(m, "thp_collapse_alloc %lu\n", + acc.events[THP_COLLAPSE_ALLOC]); +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + return 0; }
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-5.2-rc1 commit e0ee0e71078abbcadd4cbc38fb8570551fccc103 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Patch series "mm: memcontrol: clean up the LRU counts tracking".
The memcg LRU stats usage is currently a bit messy. Memcg has private per-zone counters because reclaim needs zone granularity sometimes, but we also have plenty of users that need to awkwardly sum them up to node or memcg granularity. Meanwhile the canonical per-memcg vmstats do not track the LRU counts (NR_INACTIVE_ANON etc.) as you'd expect.
This series enables LRU count tracking in the per-memcg vmstats array such that lruvec_page_state() and memcg_page_state() work on the enum node_stat_item items for the LRU counters. Then it converts all the callers that don't specifically need per-zone numbers over to that.
This patch (of 6):
The memcg code currently maintains private per-zone breakdowns of the LRU counters. This is necessary for reclaim decisions which are still zone-based, but there are a variety of users of these counters that only want the aggregate per-lruvec or per-memcg LRU counts, and they need to painfully sum up the zone counters on each request for that.
These would be better served using the memcg vmstats arrays, which track VM statistics at the desired scope already. They just don't have the LRU counts right now.
So to kick off the conversion, begin tracking LRU counts in those.
Link: http://lkml.kernel.org/r/20190228163020.24100-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Roman Gushchin guro@fb.com Cc: Tejun Heo tj@kernel.org Cc: Michal Hocko mhocko@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit e0ee0e71078abbcadd4cbc38fb8570551fccc103) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/mm_inline.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 10191c28fc04..b0e3b4473ff2 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -29,7 +29,7 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, { struct pglist_data *pgdat = lruvec_pgdat(lruvec);
- __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); + __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages); __mod_zone_page_state(&pgdat->node_zones[zid], NR_ZONE_LRU_BASE + lru, nr_pages); }
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-5.2-rc1 commit 1a61ab8038e724a6d8aa59e7d4931a119483294d category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Instead of adding up the zone counters, use lruvec_page_state() to get the node state directly. This is a bit cheaper and more stream-lined.
Link: http://lkml.kernel.org/r/20190228163020.24100-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Roman Gushchin guro@fb.com Cc: Michal Hocko mhocko@kernel.org Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 1a61ab8038e724a6d8aa59e7d4931a119483294d) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/memcontrol.h | 18 ------------------ mm/memcontrol.c | 2 +- mm/vmscan.c | 2 +- 3 files changed, 2 insertions(+), 20 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index a71419c869d9..8ddca0cb5a66 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -504,19 +504,6 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, int nid, unsigned int lru_mask);
-static inline -unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru) -{ - struct mem_cgroup_per_node *mz; - unsigned long nr_pages = 0; - int zid; - - mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - for (zid = 0; zid < MAX_NR_ZONES; zid++) - nr_pages += mz->lru_zone_size[zid][lru]; - return nr_pages; -} - static inline unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx) @@ -960,11 +947,6 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg) return true; }
-static inline unsigned long -mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru) -{ - return 0; -} static inline unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9ad251d3f13f..df93a5a17699 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -742,7 +742,7 @@ unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, for_each_lru(lru) { if (!(BIT(lru) & lru_mask)) continue; - nr += mem_cgroup_get_lru_size(lruvec, lru); + nr += lruvec_page_state(lruvec, NR_LRU_BASE + lru); } return nr; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 214445a9c0e5..cd0fa90d6765 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -358,7 +358,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone int zid;
if (!mem_cgroup_disabled()) - lru_size = mem_cgroup_get_lru_size(lruvec, lru); + lru_size = lruvec_page_state(lruvec, NR_LRU_BASE + lru); else lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-5.2-rc1 commit 22796c844fcb85f3b289c0e698713b7fa4d9c178 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Instead of adding up the node counters, use memcg_page_state() to get the memcg state directly. This is a bit cheaper and more stream-lined.
Link: http://lkml.kernel.org/r/20190228163020.24100-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Roman Gushchin guro@fb.com Cc: Michal Hocko mhocko@kernel.org Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 22796c844fcb85f3b289c0e698713b7fa4d9c178) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index df93a5a17699..a22f23a21adc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -751,10 +751,13 @@ static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg, unsigned int lru_mask) { unsigned long nr = 0; - int nid; + enum lru_list lru;
- for_each_node_state(nid, N_MEMORY) - nr += mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask); + for_each_lru(lru) { + if (!(BIT(lru) & lru_mask)) + continue; + nr += memcg_page_state(memcg, NR_LRU_BASE + lru); + } return nr; }
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-5.2-rc1 commit 2b487e59f00aaa885ebf9c47d44d09f3ef4df80e category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- mem_cgroup_node_nr_lru_pages() is just a convenience wrapper around lruvec_page_state() that takes bitmasks of lru indexes and aggregates the counts for those.
Replace callsites where the bitmask is simple enough with direct lruvec_page_state() calls.
This removes the last extern user of mem_cgroup_node_nr_lru_pages(), so make that function private again, too.
Link: http://lkml.kernel.org/r/20190228163020.24100-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Roman Gushchin guro@fb.com Cc: Michal Hocko mhocko@kernel.org Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 2b487e59f00aaa885ebf9c47d44d09f3ef4df80e) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/memcontrol.h | 10 ---------- mm/memcontrol.c | 10 +++++++--- mm/workingset.c | 5 +++-- 3 files changed, 10 insertions(+), 15 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 8ddca0cb5a66..4fda5a2203a5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -501,9 +501,6 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg); void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, int zid, int nr_pages);
-unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, - int nid, unsigned int lru_mask); - static inline unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx) @@ -954,13 +951,6 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, return 0; }
-static inline unsigned long -mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, - int nid, unsigned int lru_mask) -{ - return 0; -} - static inline unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg) { return 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a22f23a21adc..aded88884a26 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -730,7 +730,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, __this_cpu_add(memcg->stat_cpu->nr_page_events, nr_pages); }
-unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, +static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, int nid, unsigned int lru_mask) { struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg); @@ -1449,11 +1449,15 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg, int nid, bool noswap) { - if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_FILE)) + struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg); + + if (lruvec_page_state(lruvec, NR_INACTIVE_FILE) || + lruvec_page_state(lruvec, NR_ACTIVE_FILE)) return true; if (noswap || !total_swap_pages) return false; - if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_ANON)) + if (lruvec_page_state(lruvec, NR_INACTIVE_ANON) || + lruvec_page_state(lruvec, NR_ACTIVE_ANON)) return true; return false;
diff --git a/mm/workingset.c b/mm/workingset.c index 99b7f7c09b13..5de6ac71062c 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -420,10 +420,11 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, #ifdef CONFIG_MEMCG if (sc->memcg) { struct lruvec *lruvec; + int i;
- pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid, - LRU_ALL); lruvec = mem_cgroup_lruvec(NODE_DATA(sc->nid), sc->memcg); + for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) + pages += lruvec_page_state(lruvec, NR_LRU_BASE + i); pages += lruvec_page_state(lruvec, NR_SLAB_RECLAIMABLE); pages += lruvec_page_state(lruvec, NR_SLAB_UNRECLAIMABLE); } else
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-5.2-rc1 commit 21d89d151bb42bea1bcf0343f724ef62509d6161 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- mem_cgroup_nr_lru_pages() is just a convenience wrapper around memcg_page_state() that takes bitmasks of lru indexes and aggregates the counts for those.
Replace callsites where the bitmask is simple enough with direct memcg_page_state() call(s).
Link: http://lkml.kernel.org/r/20190228163020.24100-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Roman Gushchin guro@fb.com Cc: Michal Hocko mhocko@kernel.org Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 21d89d151bb42bea1bcf0343f724ef62509d6161) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index aded88884a26..a71ec4110adf 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1385,7 +1385,7 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
for (i = 0; i < NR_LRU_LISTS; i++) pr_cont(" %s:%luKB", mem_cgroup_lru_names[i], - K(mem_cgroup_nr_lru_pages(iter, BIT(i)))); + K(memcg_page_state(iter, NR_LRU_BASE + i)));
pr_cont("\n"); } @@ -3120,8 +3120,8 @@ static void accumulate_memcg_tree(struct mem_cgroup *memcg, acc->events_array ? acc->events_array[i] : i);
for (i = 0; i < NR_LRU_LISTS; i++) - acc->lru_pages[i] += - mem_cgroup_nr_lru_pages(mi, BIT(i)); + acc->lru_pages[i] += memcg_page_state(mi, + NR_LRU_BASE + i); } }
@@ -3553,7 +3553,8 @@ static int memcg_stat_show(struct seq_file *m, void *v)
for (i = 0; i < NR_LRU_LISTS; i++) seq_printf(m, "%s %lu\n", mem_cgroup_lru_names[i], - mem_cgroup_nr_lru_pages(memcg, BIT(i)) * PAGE_SIZE); + memcg_page_state(memcg, NR_LRU_BASE + i) * + PAGE_SIZE);
/* Hierarchical information */ memory = memsw = PAGE_COUNTER_MAX; @@ -4066,8 +4067,8 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
/* this should eventually include NR_UNSTABLE_NFS */ *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); - *pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) | - (1 << LRU_ACTIVE_FILE)); + *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + + memcg_exact_page_state(memcg, NR_ACTIVE_FILE); *pheadroom = PAGE_COUNTER_MAX;
while ((parent = parent_mem_cgroup(memcg))) {
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-5.2-rc1 commit 113b7dfd827175977ea71cc4a29c1ac24acb9fce category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Only memcg_numa_stat_show() uses those wrappers and the lru bitmasks, group them together.
Link: http://lkml.kernel.org/r/20190228163020.24100-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Roman Gushchin guro@fb.com Cc: Michal Hocko mhocko@kernel.org Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 113b7dfd827175977ea71cc4a29c1ac24acb9fce) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/mmzone.h | 5 ---- mm/memcontrol.c | 67 +++++++++++++++++++++++------------------- 2 files changed, 36 insertions(+), 36 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 0c9e870fb3fe..8c546c5b0f21 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -247,11 +247,6 @@ struct lruvec { #endif };
-/* Mask used at gathering information at once (see memcontrol.c) */ -#define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE)) -#define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON)) -#define LRU_ALL ((1 << NR_LRU_LISTS) - 1) - /* Isolate unmapped file */ #define ISOLATE_UNMAPPED ((__force isolate_mode_t)0x2) /* Isolate for asynchronous migration */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a71ec4110adf..cf4648416644 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -730,37 +730,6 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, __this_cpu_add(memcg->stat_cpu->nr_page_events, nr_pages); }
-static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, - int nid, unsigned int lru_mask) -{ - struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg); - unsigned long nr = 0; - enum lru_list lru; - - VM_BUG_ON((unsigned)nid >= nr_node_ids); - - for_each_lru(lru) { - if (!(BIT(lru) & lru_mask)) - continue; - nr += lruvec_page_state(lruvec, NR_LRU_BASE + lru); - } - return nr; -} - -static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg, - unsigned int lru_mask) -{ - unsigned long nr = 0; - enum lru_list lru; - - for_each_lru(lru) { - if (!(BIT(lru) & lru_mask)) - continue; - nr += memcg_page_state(memcg, NR_LRU_BASE + lru); - } - return nr; -} - static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, enum mem_cgroup_events_target target) { @@ -3463,6 +3432,42 @@ static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css, #endif
#ifdef CONFIG_NUMA + +#define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE)) +#define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON)) +#define LRU_ALL ((1 << NR_LRU_LISTS) - 1) + +static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, + int nid, unsigned int lru_mask) +{ + struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg); + unsigned long nr = 0; + enum lru_list lru; + + VM_BUG_ON((unsigned)nid >= nr_node_ids); + + for_each_lru(lru) { + if (!(BIT(lru) & lru_mask)) + continue; + nr += lruvec_page_state(lruvec, NR_LRU_BASE + lru); + } + return nr; +} + +static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg, + unsigned int lru_mask) +{ + unsigned long nr = 0; + enum lru_list lru; + + for_each_lru(lru) { + if (!(BIT(lru) & lru_mask)) + continue; + nr += memcg_page_state(memcg, NR_LRU_BASE + lru); + } + return nr; +} + static int memcg_numa_stat_show(struct seq_file *m, void *v) { struct numa_stat {
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-5.4-rc7 commit 221ec5c0a46c1a1740f34fb36fc661a5284d01b0 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- page_cgroup_ino() doesn't return a valid memcg pointer for non-compound slab pages, because it depends on PgHead AND PgSlab flags to be set to determine the memory cgroup from the kmem_cache. It's correct for compound pages, but not for generic small pages. Those don't have PgHead set, so it ends up returning zero.
Fix this by replacing the condition to PageSlab() && !PageTail().
Before this patch: [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab 0x0000000000000080 38 0 _______S___________________________________ slab
After this patch: [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab 0x0000000000000080 147 0 _______S___________________________________ slab
Also, hwpoison_filter_task() uses output of page_cgroup_ino() in order to filter error injection events based on memcg. So if page_cgroup_ino() fails to return memcg pointer, we just fail to inject memory error. Considering that hwpoison filter is for testing, affected users are limited and the impact should be marginal.
[n-horiguchi@ah.jp.nec.com: changelog additions] Link: http://lkml.kernel.org/r/20191031012151.2722280-1-guro@fb.com Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages") Signed-off-by: Roman Gushchin guro@fb.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: David Rientjes rientjes@google.com Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Daniel Jordan daniel.m.jordan@oracle.com Cc: Naoya Horiguchi n-horiguchi@ah.jp.nec.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 221ec5c0a46c1a1740f34fb36fc661a5284d01b0) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 2 +- mm/slab.h | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cf4648416644..30b08fde6481 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -496,7 +496,7 @@ ino_t page_cgroup_ino(struct page *page) unsigned long ino = 0;
rcu_read_lock(); - if (PageHead(page) && PageSlab(page)) + if (PageSlab(page) && !PageTail(page)) memcg = memcg_from_slab_page(page); else memcg = READ_ONCE(page->mem_cgroup); diff --git a/mm/slab.h b/mm/slab.h index 77fd31ea6877..d8e8b15bcbd3 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -259,8 +259,8 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s) * Expects a pointer to a slab page. Please note, that PageSlab() check * isn't sufficient, as it returns true also for tail compound slab pages, * which do not have slab_cache pointer set. - * So this function assumes that the page can pass PageHead() and PageSlab() - * checks. + * So this function assumes that the page can pass PageSlab() && !PageTail() + * check. * * The kmem_cache can be reparented asynchronously. The caller must ensure * the memcg lifetime, e.g. by taking rcu_read_lock() or cgroup_mutex.
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-5.4-rc4 commit b749ecfaf6c53ce79d6ab66afd2fc34189a073b1 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Karsten reported the following panic in __free_slab() happening on a s390x machine:
Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 0000000000000000 TEID: 0000000000000483 Fault in home space mode while using kernel ASCE. AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14 Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0) Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8 0000000000000000 000000009e3291c1 0000000000000000 0000000000000000 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600 000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78 Krnl Code: 00000000003cada6: e310a1400004 lg %r1,320(%r10) 00000000003cadac: c0e50046c286 brasl %r14,ca32b8 #00000000003cadb2: a7f4fe36 brc 15,3caa1e >00000000003cadb6: e32060800024 stg %r2,128(%r6) 00000000003cadbc: a7f4fd9e brc 15,3ca8f8 00000000003cadc0: c0e50046790c brasl %r14,c99fd8 00000000003cadc6: a7f4fe2c brc 15,3caa 00000000003cadc6: a7f4fe2c brc 15,3caa1e 00000000003cadca: ecb1ffff00d9 aghik %r11,%r1,-1 Call Trace: (<00000000003cabcc> __free_slab+0x49c/0x6b0) <00000000001f5886> rcu_core+0x5a6/0x7e0 <0000000000ca2dea> __do_softirq+0xf2/0x5c0 <0000000000152644> irq_exit+0x104/0x130 <000000000010d222> do_IRQ+0x9a/0xf0 <0000000000ca2344> ext_int_handler+0x130/0x134 <0000000000103648> enabled_wait+0x58/0x128 (<0000000000103634> enabled_wait+0x44/0x128) <0000000000103b00> arch_cpu_idle+0x40/0x58 <0000000000ca0544> default_idle_call+0x3c/0x68 <000000000018eaa4> do_idle+0xec/0x1c0 <000000000018ee0e> cpu_startup_entry+0x36/0x40 <000000000122df34> arch_call_rest_init+0x5c/0x88 <0000000000000000> 0x0 INFO: lockdep is turned off. Last Breaking-Event-Address: <00000000003ca8f4> __free_slab+0x1c4/0x6b0 Kernel panic - not syncing: Fatal exception in interrupt
The kernel panics on an attempt to dereference the NULL memcg pointer. When shutdown_cache() is called from the kmem_cache_destroy() context, a memcg kmem_cache might have empty slab pages in a partial list, which are still charged to the memory cgroup.
These pages are released by free_partial() at the beginning of shutdown_cache(): either directly or by scheduling a RCU-delayed work (if the kmem_cache has the SLAB_TYPESAFE_BY_RCU flag). The latter case is when the reported panic can happen: memcg_unlink_cache() is called immediately after shrinking partial lists, without waiting for scheduled RCU works. It sets the kmem_cache->memcg_params.memcg pointer to NULL, and the following attempt to dereference it by __free_slab() from the RCU work context causes the panic.
To fix the issue, let's postpone the release of the memcg pointer to destroy_memcg_params(). It's called from a separate work context by slab_caches_to_rcu_destroy_workfn(), which contains a full RCU barrier. This guarantees that all scheduled page release RCU works will complete before the memcg pointer will be zeroed.
Big thanks for Karsten for the perfect report containing all necessary information, his help with the analysis of the problem and testing of the fix.
Link: http://lkml.kernel.org/r/20191010160549.1584316-1-guro@fb.com Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal") Signed-off-by: Roman Gushchin guro@fb.com Reported-by: Karsten Graul kgraul@linux.ibm.com Tested-by: Karsten Graul kgraul@linux.ibm.com Acked-by: Vlastimil Babka vbabka@suse.cz Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Karsten Graul kgraul@linux.ibm.com Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: David Rientjes rientjes@google.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit b749ecfaf6c53ce79d6ab66afd2fc34189a073b1) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/slab_common.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/mm/slab_common.c b/mm/slab_common.c index bbfecacee448..d208b47e01a8 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -178,10 +178,13 @@ static int init_memcg_params(struct kmem_cache *s,
static void destroy_memcg_params(struct kmem_cache *s) { - if (is_root_cache(s)) + if (is_root_cache(s)) { kvfree(rcu_access_pointer(s->memcg_params.memcg_caches)); - else + } else { + mem_cgroup_put(s->memcg_params.memcg); + WRITE_ONCE(s->memcg_params.memcg, NULL); percpu_ref_exit(&s->memcg_params.refcnt); + } }
static void free_memcg_params(struct rcu_head *rcu) @@ -253,8 +256,6 @@ static void memcg_unlink_cache(struct kmem_cache *s) } else { list_del(&s->memcg_params.children_node); list_del(&s->memcg_params.kmem_caches_node); - mem_cgroup_put(s->memcg_params.memcg); - WRITE_ONCE(s->memcg_params.memcg, NULL); } } #else
From: Jann Horn jannh@google.com
mainline inclusion from mainline-4.20-rc1 commit f0ecf25a093fc0589f0a6bc4c1ea068bbb67d220 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Having two gigantic arrays that must manually be kept in sync, including ifdefs, isn't exactly robust. To make it easier to catch such issues in the future, add a BUILD_BUG_ON().
Link: http://lkml.kernel.org/r/20181001143138.95119-3-jannh@google.com Signed-off-by: Jann Horn jannh@google.com Reviewed-by: Kees Cook keescook@chromium.org Reviewed-by: Andrew Morton akpm@linux-foundation.org Acked-by: Roman Gushchin guro@fb.com Acked-by: Michal Hocko mhocko@suse.com Cc: Davidlohr Bueso dave@stgolabs.net Cc: Oleg Nesterov oleg@redhat.com Cc: Christoph Lameter clameter@sgi.com Cc: Kemi Wang kemi.wang@intel.com Cc: Andy Lutomirski luto@kernel.org Cc: Ingo Molnar mingo@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit f0ecf25a093fc0589f0a6bc4c1ea068bbb67d220) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/vmstat.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/vmstat.c b/mm/vmstat.c index d1c02a4fe9f8..439a73d53c9e 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1680,6 +1680,8 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos) stat_items_size += sizeof(struct vm_event_state); #endif
+ BUILD_BUG_ON(stat_items_size != + ARRAY_SIZE(vmstat_text) * sizeof(unsigned long)); v = kmalloc(stat_items_size, GFP_KERNEL); m->private = v; if (!v)
From: Shakeel Butt shakeelb@google.com
mainline inclusion from mainline-5.1-rc1 commit bbbe48029720d2c6b6733f78d02571a281511adb category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent). Later it was changed to prefer to kill a child who is worst. If the parent is still the worst then the parent will be killed.
This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less. However this is very workload dependent. If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent.
The select_bad_process() has already selected the worst process in the system/memcg. There is no need to recheck the badness of its children and hoping to find a worse candidate. That's a lot of unneeded racy work. Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs. So, let's remove this whole heuristic.
[akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com Signed-off-by: Shakeel Butt shakeelb@google.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: Roman Gushchin guro@fb.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: David Rientjes rientjes@google.com Cc: Tetsuo Handa penguin-kernel@i-love.sakura.ne.jp Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit bbbe48029720d2c6b6733f78d02571a281511adb) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/oom_kill.c | 78 ++++++++++++--------------------------------------- 1 file changed, 18 insertions(+), 60 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 05e4ec14b588..55410ef265d6 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -842,7 +842,7 @@ static bool task_will_free_mem(struct task_struct *task) return ret; }
-static void __oom_kill_process(struct task_struct *victim) +static void __oom_kill_process(struct task_struct *victim, const char *message) { struct task_struct *p; struct mm_struct *mm; @@ -873,8 +873,9 @@ static void __oom_kill_process(struct task_struct *victim) */ do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, PIDTYPE_TGID); mark_oom_victim(victim); - pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", - task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), + pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", + message, task_pid_nr(victim), victim->comm, + K(victim->mm->total_vm), K(get_mm_counter(victim->mm, MM_ANONPAGES)), K(get_mm_counter(victim->mm, MM_FILEPAGES)), K(get_mm_counter(victim->mm, MM_SHMEMPAGES))); @@ -925,25 +926,20 @@ static void __oom_kill_process(struct task_struct *victim) * Kill provided task unless it's secured by setting * oom_score_adj to OOM_SCORE_ADJ_MIN. */ -static int oom_kill_memcg_member(struct task_struct *task, void *unused) +static int oom_kill_memcg_member(struct task_struct *task, void *message) { if (task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN && !is_global_init(task)) { get_task_struct(task); - __oom_kill_process(task); + __oom_kill_process(task, message); } return 0; }
static void oom_kill_process(struct oom_control *oc, const char *message) { - struct task_struct *p = oc->chosen; - unsigned int points = oc->chosen_points; - struct task_struct *victim = p; - struct task_struct *child; - struct task_struct *t; + struct task_struct *victim = oc->chosen; struct mem_cgroup *oom_group; - unsigned int victim_points = 0; static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
@@ -952,57 +948,18 @@ static void oom_kill_process(struct oom_control *oc, const char *message) * its children or threads, just give it access to memory reserves * so it can die quickly */ - task_lock(p); - if (task_will_free_mem(p)) { - mark_oom_victim(p); - wake_oom_reaper(p); - task_unlock(p); - put_task_struct(p); + task_lock(victim); + if (task_will_free_mem(victim)) { + mark_oom_victim(victim); + wake_oom_reaper(victim); + task_unlock(victim); + put_task_struct(victim); return; } - task_unlock(p); + task_unlock(victim);
if (__ratelimit(&oom_rs)) - dump_header(oc, p); - - pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", - message, task_pid_nr(p), p->comm, points); - - /* - * If any of p's children has a different mm and is eligible for kill, - * the one with the highest oom_badness() score is sacrificed for its - * parent. This attempts to lose the minimal amount of work done while - * still freeing memory. - */ - read_lock(&tasklist_lock); - - /* - * The task 'p' might have already exited before reaching here. The - * put_task_struct() will free task_struct 'p' while the loop still try - * to access the field of 'p', so, get an extra reference. - */ - get_task_struct(p); - for_each_thread(p, t) { - list_for_each_entry(child, &t->children, sibling) { - unsigned int child_points; - - if (process_shares_mm(child, p->mm)) - continue; - /* - * oom_badness() returns 0 if the thread is unkillable - */ - child_points = oom_badness(child, - oc->memcg, oc->nodemask, oc->totalpages); - if (child_points > victim_points) { - put_task_struct(victim); - victim = child; - victim_points = child_points; - get_task_struct(victim); - } - } - } - put_task_struct(p); - read_unlock(&tasklist_lock); + dump_header(oc, victim);
/* * Do we need to kill the entire memory cgroup? @@ -1011,14 +968,15 @@ static void oom_kill_process(struct oom_control *oc, const char *message) */ oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
- __oom_kill_process(victim); + __oom_kill_process(victim, message);
/* * If necessary, kill all tasks in the selected memory cgroup. */ if (oom_group) { mem_cgroup_print_oom_group(oom_group); - mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL); + mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, + (void*)message); mem_cgroup_put(oom_group); } }
From: Thomas Hellstrom thellstrom@vmware.com
mainline inclusion from mainline-v5.5-rc1 commit bf1a12a8095615c9486f5463ca473d2d69ff6952 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
The asm-generic/pgtable.h include file appears to be the correct place for the backup x_devmap() inline functions. Moving them here is also necessary if we want to include x_devmap() in the [pmd|pud]_unstable functions. So move the x_devmap() functions to asm-generic/pgtable.h
Link: http://lkml.kernel.org/r/20191115115808.21181-1-thomas_os@shipmail.org Signed-off-by: Thomas Hellstrom thellstrom@vmware.com Cc: Arnd Bergmann arnd@arndb.de Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: Matthew Wilcox willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/asm-generic/pgtable.h | 15 +++++++++++++++ include/linux/mm.h | 15 --------------- 2 files changed, 15 insertions(+), 15 deletions(-)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 15fd0277ffa6..3c0438e8c3ed 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -865,6 +865,21 @@ static inline int pud_write(pud_t pud) } #endif /* pud_write */
+#if !defined(__HAVE_ARCH_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE) +static inline int pmd_devmap(pmd_t pmd) +{ + return 0; +} +static inline int pud_devmap(pud_t pud) +{ + return 0; +} +static inline int pgd_devmap(pgd_t pgd) +{ + return 0; +} +#endif + #if !defined(CONFIG_TRANSPARENT_HUGEPAGE) || \ (defined(CONFIG_TRANSPARENT_HUGEPAGE) && \ !defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)) diff --git a/include/linux/mm.h b/include/linux/mm.h index 369dc740963a..76e7add166da 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -497,21 +497,6 @@ struct inode; #define page_private(page) ((page)->private) #define set_page_private(page, v) ((page)->private = (v))
-#if !defined(__HAVE_ARCH_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE) -static inline int pmd_devmap(pmd_t pmd) -{ - return 0; -} -static inline int pud_devmap(pud_t pud) -{ - return 0; -} -static inline int pgd_devmap(pgd_t pgd) -{ - return 0; -} -#endif - /* * FIXME: take this include out, include page-flags.h in * files which need it (119 of them)
From: "Aneesh Kumar K.V" aneesh.kumar@linux.ibm.com
mainline inclusion from mainline-v5.1-rc1 commit d7fefcc8de9147cc37d0c00df12e7ea4f77999b5 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Patch series "mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region", v8.
ppc64 uses the CMA area for the allocation of guest page table (hash page table). We won't be able to start guest if we fail to allocate hash page table. We have observed hash table allocation failure because we failed to migrate pages out of CMA region because they were pinned. This happen when we are using VFIO. VFIO on ppc64 pins the entire guest RAM. If the guest RAM pages get allocated out of CMA region, we won't be able to migrate those pages. The pages are also pinned for the lifetime of the guest.
Currently we support migration of non-compound pages. With THP and with the addition of hugetlb migration we can end up allocating compound pages from CMA region. This patch series add support for migrating compound pages.
This patch (of 4):
Add PF_MEMALLOC_NOCMA which make sure any allocation in that context is marked non-movable and hence cannot be satisfied by CMA region.
This is useful with get_user_pages_longterm where we want to take a page pin by migrating pages from CMA region. Marking the section PF_MEMALLOC_NOCMA ensures that we avoid unnecessary page migration later.
Link: http://lkml.kernel.org/r/20190114095438.32470-2-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Suggested-by: Andrea Arcangeli aarcange@redhat.com Reviewed-by: Andrea Arcangeli aarcange@redhat.com Cc: Michal Hocko mhocko@kernel.org Cc: Alexey Kardashevskiy aik@ozlabs.ru Cc: David Gibson david@gibson.dropbear.id.au Cc: Michael Ellerman mpe@ellerman.id.au Cc: Mel Gorman mgorman@techsingularity.net Cc: Vlastimil Babka vbabka@suse.cz Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/sched.h | 1 + include/linux/sched/mm.h | 48 +++++++++++++++++++++++++++++++++------- 2 files changed, 41 insertions(+), 8 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 9c1810252dd0..ceb584190bb8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1414,6 +1414,7 @@ extern struct pid *cad_pid; #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */ #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ +#define PF_MEMALLOC_NOCMA 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */ #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */ #define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */ diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 290a613687b5..c04db712c5d2 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -175,17 +175,25 @@ static inline bool in_vfork(struct task_struct *tsk) * Applies per-task gfp context to the given allocation flags. * PF_MEMALLOC_NOIO implies GFP_NOIO * PF_MEMALLOC_NOFS implies GFP_NOFS + * PF_MEMALLOC_NOCMA implies no allocation from CMA region. */ static inline gfp_t current_gfp_context(gfp_t flags) { - /* - * NOIO implies both NOIO and NOFS and it is a weaker context - * so always make sure it makes precendence - */ - if (unlikely(current->flags & PF_MEMALLOC_NOIO)) - flags &= ~(__GFP_IO | __GFP_FS); - else if (unlikely(current->flags & PF_MEMALLOC_NOFS)) - flags &= ~__GFP_FS; + if (unlikely(current->flags & + (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) { + /* + * NOIO implies both NOIO and NOFS and it is a weaker context + * so always make sure it makes precedence + */ + if (current->flags & PF_MEMALLOC_NOIO) + flags &= ~(__GFP_IO | __GFP_FS); + else if (current->flags & PF_MEMALLOC_NOFS) + flags &= ~__GFP_FS; +#ifdef CONFIG_CMA + if (current->flags & PF_MEMALLOC_NOCMA) + flags &= ~__GFP_MOVABLE; +#endif + } return flags; }
@@ -275,6 +283,30 @@ static inline void memalloc_noreclaim_restore(unsigned int flags) current->flags = (current->flags & ~PF_MEMALLOC) | flags; }
+#ifdef CONFIG_CMA +static inline unsigned int memalloc_nocma_save(void) +{ + unsigned int flags = current->flags & PF_MEMALLOC_NOCMA; + + current->flags |= PF_MEMALLOC_NOCMA; + return flags; +} + +static inline void memalloc_nocma_restore(unsigned int flags) +{ + current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags; +} +#else +static inline unsigned int memalloc_nocma_save(void) +{ + return 0; +} + +static inline void memalloc_nocma_restore(unsigned int flags) +{ +} +#endif + #ifdef CONFIG_MEMCG /** * memalloc_use_memcg - Starts the remote memcg charging scope.
From: "Aneesh Kumar K.V" aneesh.kumar@linux.ibm.com
mainline inclusion from mainline-5.1-rc1 commit 9a4e9f3b2d7393d50256762c21e7466b4b6b1c9c category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- This patch updates get_user_pages_longterm to migrate pages allocated out of CMA region. This makes sure that we don't keep non-movable pages (due to page reference count) in the CMA area.
This will be used by ppc64 in a later patch to avoid pinning pages in the CMA region. ppc64 uses CMA region for allocation of the hardware page table (hash page table) and not able to migrate pages out of CMA region results in page table allocation failures.
One case where we hit this easy is when a guest using a VFIO passthrough device. VFIO locks all the guest's memory and if the guest memory is backed by CMA region, it becomes unmovable resulting in fragmenting the CMA and possibly preventing other guests from allocation a large enough hash page table.
NOTE: We allocate the new page without using __GFP_THISNODE
Link: http://lkml.kernel.org/r/20190114095438.32470-3-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Cc: Alexey Kardashevskiy aik@ozlabs.ru Cc: Andrea Arcangeli aarcange@redhat.com Cc: David Gibson david@gibson.dropbear.id.au Cc: Michael Ellerman mpe@ellerman.id.au Cc: Michal Hocko mhocko@kernel.org Cc: Mel Gorman mgorman@techsingularity.net Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 9a4e9f3b2d7393d50256762c21e7466b4b6b1c9c) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/hugetlb.h | 2 + include/linux/mm.h | 3 +- mm/gup.c | 200 +++++++++++++++++++++++++++++++++++----- mm/hugetlb.c | 4 +- 4 files changed, 182 insertions(+), 27 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 32f2837a6075..870c29d8b8e4 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -373,6 +373,8 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, nodemask_t *nmask); struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, unsigned long address); +struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, + int nid, nodemask_t *nmask); int huge_add_to_page_cache(struct page *page, struct address_space *mapping, pgoff_t idx);
diff --git a/include/linux/mm.h b/include/linux/mm.h index 76e7add166da..16eff0426e4a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1507,7 +1507,8 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, int *locked); long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); -#ifdef CONFIG_FS_DAX + +#if defined(CONFIG_FS_DAX) || defined(CONFIG_CMA) long get_user_pages_longterm(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas); diff --git a/mm/gup.c b/mm/gup.c index cc01f860ce71..3fc585282f24 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -13,6 +13,9 @@ #include <linux/sched/signal.h> #include <linux/rwsem.h> #include <linux/hugetlb.h> +#include <linux/migrate.h> +#include <linux/mm_inline.h> +#include <linux/sched/mm.h>
#include <asm/mmu_context.h> #include <asm/pgtable.h> @@ -1114,7 +1117,167 @@ long get_user_pages(unsigned long start, unsigned long nr_pages, } EXPORT_SYMBOL(get_user_pages);
+#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA) + #ifdef CONFIG_FS_DAX +static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages) +{ + long i; + struct vm_area_struct *vma_prev = NULL; + + for (i = 0; i < nr_pages; i++) { + struct vm_area_struct *vma = vmas[i]; + + if (vma == vma_prev) + continue; + + vma_prev = vma; + + if (vma_is_fsdax(vma)) + return true; + } + return false; +} +#else +static inline bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages) +{ + return false; +} +#endif + +#ifdef CONFIG_CMA +static struct page *new_non_cma_page(struct page *page, unsigned long private) +{ + /* + * We want to make sure we allocate the new page from the same node + * as the source page. + */ + int nid = page_to_nid(page); + /* + * Trying to allocate a page for migration. Ignore allocation + * failure warnings. We don't force __GFP_THISNODE here because + * this node here is the node where we have CMA reservation and + * in some case these nodes will have really less non movable + * allocation memory. + */ + gfp_t gfp_mask = GFP_USER | __GFP_NOWARN; + + if (PageHighMem(page)) + gfp_mask |= __GFP_HIGHMEM; + +#ifdef CONFIG_HUGETLB_PAGE + if (PageHuge(page)) { + struct hstate *h = page_hstate(page); + /* + * We don't want to dequeue from the pool because pool pages will + * mostly be from the CMA region. + */ + return alloc_migrate_huge_page(h, gfp_mask, nid, NULL); + } +#endif + if (PageTransHuge(page)) { + struct page *thp; + /* + * ignore allocation failure warnings + */ + gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_NOWARN; + + /* + * Remove the movable mask so that we don't allocate from + * CMA area again. + */ + thp_gfpmask &= ~__GFP_MOVABLE; + thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER); + if (!thp) + return NULL; + prep_transhuge_page(thp); + return thp; + } + + return __alloc_pages_node(nid, gfp_mask, 0); +} + +static long check_and_migrate_cma_pages(unsigned long start, long nr_pages, + unsigned int gup_flags, + struct page **pages, + struct vm_area_struct **vmas) +{ + long i; + bool drain_allow = true; + bool migrate_allow = true; + LIST_HEAD(cma_page_list); + +check_again: + for (i = 0; i < nr_pages; i++) { + /* + * If we get a page from the CMA zone, since we are going to + * be pinning these entries, we might as well move them out + * of the CMA zone if possible. + */ + if (is_migrate_cma_page(pages[i])) { + + struct page *head = compound_head(pages[i]); + + if (PageHuge(head)) { + isolate_huge_page(head, &cma_page_list); + } else { + if (!PageLRU(head) && drain_allow) { + lru_add_drain_all(); + drain_allow = false; + } + + if (!isolate_lru_page(head)) { + list_add_tail(&head->lru, &cma_page_list); + mod_node_page_state(page_pgdat(head), + NR_ISOLATED_ANON + + page_is_file_cache(head), + hpage_nr_pages(head)); + } + } + } + } + + if (!list_empty(&cma_page_list)) { + /* + * drop the above get_user_pages reference. + */ + for (i = 0; i < nr_pages; i++) + put_page(pages[i]); + + if (migrate_pages(&cma_page_list, new_non_cma_page, + NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE)) { + /* + * some of the pages failed migration. Do get_user_pages + * without migration. + */ + migrate_allow = false; + + if (!list_empty(&cma_page_list)) + putback_movable_pages(&cma_page_list); + } + /* + * We did migrate all the pages, Try to get the page references again + * migrating any new CMA pages which we failed to isolate earlier. + */ + nr_pages = get_user_pages(start, nr_pages, gup_flags, pages, vmas); + if ((nr_pages > 0) && migrate_allow) { + drain_allow = true; + goto check_again; + } + } + + return nr_pages; +} +#else +static inline long check_and_migrate_cma_pages(unsigned long start, long nr_pages, + unsigned int gup_flags, + struct page **pages, + struct vm_area_struct **vmas) +{ + return nr_pages; +} +#endif + /* * This is the same as get_user_pages() in that it assumes we are * operating on the current task's mm, but it goes further to validate @@ -1128,11 +1291,11 @@ EXPORT_SYMBOL(get_user_pages); * Contrast this to iov_iter_get_pages() usages which are transient. */ long get_user_pages_longterm(unsigned long start, unsigned long nr_pages, - unsigned int gup_flags, struct page **pages, - struct vm_area_struct **vmas_arg) + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas_arg) { struct vm_area_struct **vmas = vmas_arg; - struct vm_area_struct *vma_prev = NULL; + unsigned long flags; long rc, i;
if (!pages) @@ -1145,31 +1308,20 @@ long get_user_pages_longterm(unsigned long start, unsigned long nr_pages, return -ENOMEM; }
+ flags = memalloc_nocma_save(); rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas); + memalloc_nocma_restore(flags); + if (rc < 0) + goto out;
- for (i = 0; i < rc; i++) { - struct vm_area_struct *vma = vmas[i]; - - if (vma == vma_prev) - continue; - - vma_prev = vma; - - if (vma_is_fsdax(vma)) - break; - } - - /* - * Either get_user_pages() failed, or the vma validation - * succeeded, in either case we don't need to put_page() before - * returning. - */ - if (i >= rc) + if (check_dax_vmas(vmas, rc)) { + for (i = 0; i < rc; i++) + put_page(pages[i]); + rc = -EOPNOTSUPP; goto out; + }
- for (i = 0; i < rc; i++) - put_page(pages[i]); - rc = -EOPNOTSUPP; + rc = check_and_migrate_cma_pages(start, rc, gup_flags, pages, vmas); out: if (vmas != vmas_arg) kfree(vmas); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index efb68abe9f35..be8417aee700 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1665,8 +1665,8 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, return page; }
-static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, - int nid, nodemask_t *nmask) +struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, + int nid, nodemask_t *nmask) { struct page *page;
From: Peng Fan peng.fan@nxp.com
mainline inclusion from mainline-5.1-rc1 commit 2de7852fe9096e92c5f10faa472550a2a7121cea category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- group_cnt array is defined with NR_CPUS entries, but normally nr_groups will not reach up to NR_CPUS. So there is no issue to the current code.
Checking other parts of pcpu_build_alloc_info, use nr_groups as check condition, so make it consistent to use 'group < nr_groups' as for loop check. In case we do have nr_groups equals with NR_CPUS, we could also avoid memory access out of bounds.
Signed-off-by: Peng Fan peng.fan@nxp.com Signed-off-by: Dennis Zhou dennis@kernel.org (cherry picked from commit 2de7852fe9096e92c5f10faa472550a2a7121cea) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/percpu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/percpu.c b/mm/percpu.c index 0151f276ae68..fef316942ea9 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -2387,7 +2387,7 @@ static struct pcpu_alloc_info * __init pcpu_build_alloc_info( ai->atom_size = atom_size; ai->alloc_size = alloc_size;
- for (group = 0, unit = 0; group_cnt[group]; group++) { + for (group = 0, unit = 0; group < nr_groups; group++) { struct pcpu_group_info *gi = &ai->groups[group];
/*
From: Dennis Zhou dennis@kernel.org
mainline inclusion from mainline-5.2-rc1 commit 8e5a2b9893f36457582596fdade10f6feb2797ee category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- When updating the chunk's contig_hint on the free path of a hint that does not touch the page boundaries, it was incorrectly using the starting offset of the free region and the block's contig_hint. This could lead to incorrect assumptions about fit given a size and better alignment of the start. Fix this by using (end - start) as this is only called when updating a hint within a block.
Signed-off-by: Dennis Zhou dennis@kernel.org Reviewed-by: Peng Fan peng.fan@nxp.com (cherry picked from commit 8e5a2b9893f36457582596fdade10f6feb2797ee) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/percpu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/percpu.c b/mm/percpu.c index fef316942ea9..092f777422d6 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -871,7 +871,7 @@ static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off, pcpu_chunk_refresh_hint(chunk); else pcpu_chunk_update(chunk, pcpu_block_off_to_off(s_index, start), - s_block->contig_hint); + end - start); }
/**
From: Thomas Hellstrom thellstrom@vmware.com
mainline inclusion from mainline-5.5-rc1 commit 625110b5e9dae9074d8a7e67dd07f821a053eed7 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- A huge pud page can theoretically be faulted in racing with pmd_alloc() in __handle_mm_fault(). That will lead to pmd_alloc() returning an invalid pmd pointer.
Fix this by adding a pud_trans_unstable() function similar to pmd_trans_unstable() and check whether the pud is really stable before using the pmd pointer.
Race: Thread 1: Thread 2: Comment create_huge_pud() Fallback - not taken. create_huge_pud() Taken. pmd_alloc() Returns an invalid pointer.
This will result in user-visible huge page data corruption.
Note that this was caught during a code audit rather than a real experienced problem. It looks to me like the only implementation that currently creates huge pud pagetable entries is dev_dax_huge_fault() which doesn't appear to care much about private (COW) mappings or write-tracking which is, I believe, a prerequisite for create_huge_pud() falling back on thread 1, but not in thread 2.
Link: http://lkml.kernel.org/r/20191115115808.21181-2-thomas_os@shipmail.org Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages") Signed-off-by: Thomas Hellstrom thellstrom@vmware.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Arnd Bergmann arnd@arndb.de Cc: Matthew Wilcox willy@infradead.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 625110b5e9dae9074d8a7e67dd07f821a053eed7) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/asm-generic/pgtable.h | 25 +++++++++++++++++++++++++ mm/memory.c | 6 ++++++ 2 files changed, 31 insertions(+)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 3c0438e8c3ed..b96231975552 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -889,6 +889,31 @@ static inline int pud_trans_huge(pud_t pud) } #endif
+/* See pmd_none_or_trans_huge_or_clear_bad for discussion. */ +static inline int pud_none_or_trans_huge_or_dev_or_clear_bad(pud_t *pud) +{ + pud_t pudval = READ_ONCE(*pud); + + if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval)) + return 1; + if (unlikely(pud_bad(pudval))) { + pud_clear_bad(pud); + return 1; + } + return 0; +} + +/* See pmd_trans_unstable for discussion. */ +static inline int pud_trans_unstable(pud_t *pud) +{ +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \ + defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) + return pud_none_or_trans_huge_or_dev_or_clear_bad(pud); +#else + return 0; +#endif +} + #ifndef pmd_read_atomic static inline pmd_t pmd_read_atomic(pmd_t *pmdp) { diff --git a/mm/memory.c b/mm/memory.c index c65d3993ac63..852d2ffcc08d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3860,6 +3860,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, vmf.pud = pud_alloc(mm, p4d, address); if (!vmf.pud) return VM_FAULT_OOM; +retry_pud: if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) { ret = create_huge_pud(&vmf); if (!(ret & VM_FAULT_FALLBACK)) @@ -3886,6 +3887,11 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, vmf.pmd = pmd_alloc(mm, vmf.pud, address); if (!vmf.pmd) return VM_FAULT_OOM; + + /* Huge pud page fault raced with pmd_alloc? */ + if (pud_trans_unstable(vmf.pud)) + goto retry_pud; + if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) { ret = create_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK))
From: Andrea Arcangeli aarcange@redhat.com
mainline inclusion from mainline-4.20-rc1 commit 7066f0f933a1fd707bb38781866657769cff7efc category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB right away. do_huge_pmd_numa_page() flushes the TLB before calling migrate_misplaced_transhuge_page(). By the time do_huge_pmd_numa_page() runs some CPU could still access the page through the TLB.
change_huge_pmd() before arming the numa/protnone transhuge pmd calls mmu_notifier_invalidate_range_start(). So there's no need of mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end() sequence in migrate_misplaced_transhuge_page() too, because by the time migrate_misplaced_transhuge_page() runs, the pmd mapping has already been invalidated in the secondary MMUs. It has to or if a secondary MMU can still write to the page, the migrate_page_copy() would lose data.
However an explicit mmu_notifier_invalidate_range() is needed before migrate_misplaced_transhuge_page() starts copying the data of the transhuge page or the below can happen for MMU notifier users sharing the primary MMU pagetables and only implementing ->invalidate_range:
CPU0 CPU1 GPU sharing linux pagetables using only ->invalidate_range ----------- ------------ --------- GPU secondary MMU writes to the page mapped by the transhuge pmd change_pmd_range() mmu..._range_start() ->invalidate_range_start() noop change_huge_pmd() set_pmd_at(numa/protnone) pmd_unlock() do_huge_pmd_numa_page() CPU TLB flush globally (1) CPU cannot write to page migrate_misplaced_transhuge_page() GPU writes to the page... migrate_page_copy() ...GPU stops writing to the page CPU TLB flush (2) mmu..._range_end() (3) ->invalidate_range_stop() noop ->invalidate_range() GPU secondary MMU is invalidated and cannot write to the page anymore (too late)
Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives too late, we also need a mmu_notifier_invalidate_range() before calling migrate_misplaced_transhuge_page(), because the ->invalidate_range() in (3) also arrives too late.
This requirement is the result of the lazy optimization in change_huge_pmd() that releases the pmd_lock without first flushing the TLB and without first calling mmu_notifier_invalidate_range().
Even converting the removed mmu_notifier_invalidate_range_only_end() into a mmu_notifier_invalidate_range_end() would not have been enough to fix this, because it run after migrate_page_copy().
After the hugepage data copy is done migrate_misplaced_transhuge_page() can proceed and call set_pmd_at without having to flush the TLB nor any secondary MMUs because the secondary MMU invalidate, just like the CPU TLB flush, has to happen before the migrate_page_copy() is called or it would be a bug in the first place (and it was for drivers using ->invalidate_range()).
KVM is unaffected because it doesn't implement ->invalidate_range().
The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and uses the generic migrate_pages which transitions the pte from numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs and all mmu notifiers there before copying the page.
Link: http://lkml.kernel.org/r/20181013002430.698-3-aarcange@redhat.com Signed-off-by: Andrea Arcangeli aarcange@redhat.com Acked-by: Mel Gorman mgorman@suse.de Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Reviewed-by: Aaron Tomlin atomlin@redhat.com Cc: Jerome Glisse jglisse@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 7066f0f933a1fd707bb38781866657769cff7efc) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/huge_memory.c | 14 +++++++++++++- mm/migrate.c | 19 ++++++------------- 2 files changed, 19 insertions(+), 14 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d6ad1132ee70..2cc736a6ab67 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1617,8 +1617,20 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) * We are not sure a pending tlb flush here is for a huge page * mapping or not. Hence use the tlb range variant */ - if (mm_tlb_flush_pending(vma->vm_mm)) + if (mm_tlb_flush_pending(vma->vm_mm)) { flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + /* + * change_huge_pmd() released the pmd lock before + * invalidating the secondary MMUs sharing the primary + * MMU pagetables (with ->invalidate_range()). The + * mmu_notifier_invalidate_range_end() (which + * internally calls ->invalidate_range()) in + * change_pmd_range() will run after us, so we can't + * rely on it here and we need an explicit invalidate. + */ + mmu_notifier_invalidate_range(vma->vm_mm, haddr, + haddr + HPAGE_PMD_SIZE); + }
/* * Migrate the THP to the requested node, returns with page unlocked diff --git a/mm/migrate.c b/mm/migrate.c index f1200de72999..1c5e3ae0329d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2026,8 +2026,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, int isolated = 0; struct page *new_page = NULL; int page_lru = page_is_file_cache(page); - unsigned long mmun_start = address & HPAGE_PMD_MASK; - unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE; + unsigned long start = address & HPAGE_PMD_MASK; + unsigned long end = start + HPAGE_PMD_SIZE;
new_page = alloc_pages_node(node, (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE), @@ -2054,11 +2054,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, WARN_ON(PageLRU(new_page));
/* Recheck the target PMD */ - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); ptl = pmd_lock(mm, pmd); if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) { spin_unlock(ptl); - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
/* Reverse changes made by migrate_page_copy() */ if (TestClearPageActive(new_page)) @@ -2089,8 +2087,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, * new page and page_add_new_anon_rmap guarantee the copy is * visible before the pagetable update. */ - flush_cache_range(vma, mmun_start, mmun_end); - page_add_anon_rmap(new_page, vma, mmun_start, true); + flush_cache_range(vma, start, end); + page_add_anon_rmap(new_page, vma, start, true); /* * At this point the pmd is numa/protnone (i.e. non present) and the TLB * has already been flushed globally. So no TLB can be currently @@ -2102,7 +2100,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, * MADV_DONTNEED won't wait on the pmd lock and it'll skip clearing this * pmd. */ - set_pmd_at(mm, mmun_start, pmd, entry); + set_pmd_at(mm, start, pmd, entry); update_mmu_cache_pmd(vma, address, &entry);
page_ref_unfreeze(page, 2); @@ -2111,11 +2109,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
spin_unlock(ptl); - /* - * No need to double call mmu_notifier->invalidate_range() callback as - * the above pmdp_huge_clear_flush_notify() did already call it. - */ - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
/* Take an "isolate" reference and put new page on the LRU. */ get_page(new_page); @@ -2139,7 +2132,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, ptl = pmd_lock(mm, pmd); if (pmd_same(*pmd, entry)) { entry = pmd_modify(entry, vma->vm_page_prot); - set_pmd_at(mm, mmun_start, pmd, entry); + set_pmd_at(mm, start, pmd, entry); update_mmu_cache_pmd(vma, address, &entry); } spin_unlock(ptl);
From: Andrea Arcangeli aarcange@redhat.com
mainline inclusion from mainline-4.20-rc1 commit 7eef5f97c1f94c7b72520b42d372037e97a81b95 category: bugfix bugzilla: 34611 CVE: NA
------------------------------------------------- There should be no cache left by the time we overwrite the old transhuge pmd with the new one. It's already too late to flush through the virtual address because we already copied the page data to the new physical address.
So flush the cache before the data copy.
Also delete the "end" variable to shutoff a "unused variable" warning on x86 where flush_cache_range() is a noop.
Link: http://lkml.kernel.org/r/20181015202311.7209-1-aarcange@redhat.com Signed-off-by: Andrea Arcangeli aarcange@redhat.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Aaron Tomlin atomlin@redhat.com Cc: Jerome Glisse jglisse@redhat.com Cc: Mel Gorman mgorman@suse.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 7eef5f97c1f94c7b72520b42d372037e97a81b95) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/migrate.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index 1c5e3ae0329d..069a76fe5bf1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2027,7 +2027,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, struct page *new_page = NULL; int page_lru = page_is_file_cache(page); unsigned long start = address & HPAGE_PMD_MASK; - unsigned long end = start + HPAGE_PMD_SIZE;
new_page = alloc_pages_node(node, (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE), @@ -2050,6 +2049,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, /* anon mapping, we can simply copy page->mapping to the new page: */ new_page->mapping = page->mapping; new_page->index = page->index; + /* flush the cache before copying using the kernel virtual address */ + flush_cache_range(vma, start, start + HPAGE_PMD_SIZE); migrate_page_copy(new_page, page); WARN_ON(PageLRU(new_page));
@@ -2087,7 +2088,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, * new page and page_add_new_anon_rmap guarantee the copy is * visible before the pagetable update. */ - flush_cache_range(vma, start, end); page_add_anon_rmap(new_page, vma, start, true); /* * At this point the pmd is numa/protnone (i.e. non present) and the TLB
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v4.20-rc1 commit e9b257ed150c1f43912bd66031185598451f68a9 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
The refault stats go better with the page fault stats, and are of higher interest than the stats on LRU operations. In fact they used to be grouped together; when the LRU operation stats were added later on, they were wedged in between.
Move them back together. Documentation/admin-guide/cgroup-v2.rst already lists them in the right order.
Link: http://lkml.kernel.org/r/20181010140239.GA2527@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Cc: Rik van Riel riel@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Confilicts: mm/memcontrol.c
Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 30b08fde6481..c5fbf0b29348 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5768,6 +5768,13 @@ static int memory_stat_show(struct seq_file *m, void *v) seq_printf(m, "pgfault %lu\n", acc.events[PGFAULT]); seq_printf(m, "pgmajfault %lu\n", acc.events[PGMAJFAULT]);
+ seq_printf(m, "workingset_refault %lu\n", + acc.stat[WORKINGSET_REFAULT]); + seq_printf(m, "workingset_activate %lu\n", + acc.stat[WORKINGSET_ACTIVATE]); + seq_printf(m, "workingset_nodereclaim %lu\n", + acc.stat[WORKINGSET_NODERECLAIM]); + seq_printf(m, "pgrefill %lu\n", acc.events[PGREFILL]); seq_printf(m, "pgscan %lu\n", acc.events[PGSCAN_KSWAPD] + acc.events[PGSCAN_DIRECT]); @@ -5778,13 +5785,6 @@ static int memory_stat_show(struct seq_file *m, void *v) seq_printf(m, "pglazyfree %lu\n", acc.events[PGLAZYFREE]); seq_printf(m, "pglazyfreed %lu\n", acc.events[PGLAZYFREED]);
- seq_printf(m, "workingset_refault %lu\n", - acc.stat[WORKINGSET_REFAULT]); - seq_printf(m, "workingset_activate %lu\n", - acc.stat[WORKINGSET_ACTIVATE]); - seq_printf(m, "workingset_nodereclaim %lu\n", - acc.stat[WORKINGSET_NODERECLAIM]); - #ifdef CONFIG_TRANSPARENT_HUGEPAGE seq_printf(m, "thp_fault_alloc %lu\n", acc.events[THP_FAULT_ALLOC]); seq_printf(m, "thp_collapse_alloc %lu\n",
From: Chris Down chris@chrisdown.name
mainline inclusion from mainline-v5.2-rc1 commit 871789d4af807d1e91a6299f12a67e06177ed420 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
I spent literally an hour trying to work out why an earlier version of my memory.events aggregation code doesn't work properly, only to find out I was calling memcg->events instead of memcg->memory_events, which is fairly confusing.
This naming seems in need of reworking, so make it harder to do the wrong thing by using vmevents instead of events, which makes it more clear that these are vm counters rather than memcg-specific counters.
There are also a few other inconsistent names in both the percpu and aggregated structs, so these are all cleaned up to be more coherent and easy to understand.
This commit contains code cleanup only: there are no logic changes.
[akpm@linux-foundation.org: fix it for preceding changes] Link: http://lkml.kernel.org/r/20190208224319.GA23801@chrisdown.name Signed-off-by: Chris Down chris@chrisdown.name Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Tejun Heo tj@kernel.org Cc: Roman Gushchin guro@fb.com Cc: Dennis Zhou dennis@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/memcontrol.h | 24 +++--- mm/memcontrol.c | 148 +++++++++++++++++++------------------ 2 files changed, 88 insertions(+), 84 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 4fda5a2203a5..b1051824b0a3 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -94,8 +94,8 @@ enum mem_cgroup_events_target { MEM_CGROUP_NTARGETS, };
-struct mem_cgroup_stat_cpu { - long count[MEMCG_NR_STAT]; +struct memcg_vmstats_percpu { + long stat[MEMCG_NR_STAT]; unsigned long events[NR_VM_EVENT_ITEMS]; unsigned long nr_page_events; unsigned long targets[MEM_CGROUP_NTARGETS]; @@ -274,12 +274,12 @@ struct mem_cgroup { struct task_struct *move_lock_task;
/* memory.stat */ - struct mem_cgroup_stat_cpu __percpu *stat_cpu; + struct memcg_vmstats_percpu __percpu *vmstats_percpu;
MEMCG_PADDING(_pad2_);
- atomic_long_t stat[MEMCG_NR_STAT]; - atomic_long_t events[NR_VM_EVENT_ITEMS]; + atomic_long_t vmstats[MEMCG_NR_STAT]; + atomic_long_t vmevents[NR_VM_EVENT_ITEMS]; atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
unsigned long socket_pressure; @@ -557,7 +557,7 @@ void unlock_page_memcg(struct page *page); static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) { - long x = atomic_long_read(&memcg->stat[idx]); + long x = atomic_long_read(&memcg->vmstats[idx]); #ifdef CONFIG_SMP if (x < 0) x = 0; @@ -574,12 +574,12 @@ static inline void __mod_memcg_state(struct mem_cgroup *memcg, if (mem_cgroup_disabled()) return;
- x = val + __this_cpu_read(memcg->stat_cpu->count[idx]); + x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) { - atomic_long_add(x, &memcg->stat[idx]); + atomic_long_add(x, &memcg->vmstats[idx]); x = 0; } - __this_cpu_write(memcg->stat_cpu->count[idx], x); + __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); }
/* idx can be of type enum memcg_stat_item or node_stat_item */ @@ -717,12 +717,12 @@ static inline void __count_memcg_events(struct mem_cgroup *memcg, if (mem_cgroup_disabled()) return;
- x = count + __this_cpu_read(memcg->stat_cpu->events[idx]); + x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]); if (unlikely(x > MEMCG_CHARGE_BATCH)) { - atomic_long_add(x, &memcg->events[idx]); + atomic_long_add(x, &memcg->vmevents[idx]); x = 0; } - __this_cpu_write(memcg->stat_cpu->events[idx], x); + __this_cpu_write(memcg->vmstats_percpu->events[idx], x); }
static inline void count_memcg_events(struct mem_cgroup *memcg, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c5fbf0b29348..9c9162da4f5e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -695,7 +695,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) static unsigned long memcg_sum_events(struct mem_cgroup *memcg, int event) { - return atomic_long_read(&memcg->events[event]); + return atomic_long_read(&memcg->vmevents[event]); }
static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, @@ -727,7 +727,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, nr_pages = -nr_pages; /* for event */ }
- __this_cpu_add(memcg->stat_cpu->nr_page_events, nr_pages); + __this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages); }
static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, @@ -735,8 +735,8 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, { unsigned long val, next;
- val = __this_cpu_read(memcg->stat_cpu->nr_page_events); - next = __this_cpu_read(memcg->stat_cpu->targets[target]); + val = __this_cpu_read(memcg->vmstats_percpu->nr_page_events); + next = __this_cpu_read(memcg->vmstats_percpu->targets[target]); /* from time_after() in jiffies.h */ if ((long)(next - val) < 0) { switch (target) { @@ -752,7 +752,7 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, default: break; } - __this_cpu_write(memcg->stat_cpu->targets[target], next); + __this_cpu_write(memcg->vmstats_percpu->targets[target], next); return true; } return false; @@ -2124,9 +2124,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) int nid; long x;
- x = this_cpu_xchg(memcg->stat_cpu->count[i], 0); + x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0); if (x) - atomic_long_add(x, &memcg->stat[i]); + atomic_long_add(x, &memcg->vmstats[i]);
if (i >= NR_VM_NODE_STAT_ITEMS) continue; @@ -2144,9 +2144,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { long x;
- x = this_cpu_xchg(memcg->stat_cpu->events[i], 0); + x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0); if (x) - atomic_long_add(x, &memcg->events[i]); + atomic_long_add(x, &memcg->vmevents[i]); } }
@@ -3063,30 +3063,34 @@ static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css, return retval; }
-struct accumulated_stats { - unsigned long stat[MEMCG_NR_STAT]; - unsigned long events[NR_VM_EVENT_ITEMS]; +struct accumulated_vmstats { + unsigned long vmstats[MEMCG_NR_STAT]; + unsigned long vmevents[NR_VM_EVENT_ITEMS]; unsigned long lru_pages[NR_LRU_LISTS]; - const unsigned int *stats_array; - const unsigned int *events_array; - int stats_size; - int events_size; + + /* overrides for v1 */ + const unsigned int *vmstats_array; + const unsigned int *vmevents_array; + + int vmstats_size; + int vmevents_size; };
-static void accumulate_memcg_tree(struct mem_cgroup *memcg, - struct accumulated_stats *acc) +static void accumulate_vmstats(struct mem_cgroup *memcg, + struct accumulated_vmstats *acc) { struct mem_cgroup *mi; int i;
for_each_mem_cgroup_tree(mi, memcg) { - for (i = 0; i < acc->stats_size; i++) - acc->stat[i] += memcg_page_state(mi, - acc->stats_array ? acc->stats_array[i] : i); + for (i = 0; i < acc->vmstats_size; i++) + acc->vmstats[i] += memcg_page_state(mi, + acc->vmstats_array ? acc->vmstats_array[i] : i);
- for (i = 0; i < acc->events_size; i++) - acc->events[i] += memcg_sum_events(mi, - acc->events_array ? acc->events_array[i] : i); + for (i = 0; i < acc->vmevents_size; i++) + acc->vmevents[i] += memcg_sum_events(mi, + acc->vmevents_array + ? acc->vmevents_array[i] : i);
for (i = 0; i < NR_LRU_LISTS; i++) acc->lru_pages[i] += memcg_page_state(mi, @@ -3539,7 +3543,7 @@ static int memcg_stat_show(struct seq_file *m, void *v) unsigned long memory, memsw; struct mem_cgroup *mi; unsigned int i; - struct accumulated_stats acc; + struct accumulated_vmstats acc;
BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); @@ -3574,22 +3578,22 @@ static int memcg_stat_show(struct seq_file *m, void *v) (u64)memsw * PAGE_SIZE);
memset(&acc, 0, sizeof(acc)); - acc.stats_size = ARRAY_SIZE(memcg1_stats); - acc.stats_array = memcg1_stats; - acc.events_size = ARRAY_SIZE(memcg1_events); - acc.events_array = memcg1_events; - accumulate_memcg_tree(memcg, &acc); + acc.vmstats_size = ARRAY_SIZE(memcg1_stats); + acc.vmstats_array = memcg1_stats; + acc.vmevents_size = ARRAY_SIZE(memcg1_events); + acc.vmevents_array = memcg1_events; + accumulate_vmstats(memcg, &acc);
for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account()) continue; seq_printf(m, "total_%s %llu\n", memcg1_stat_names[i], - (u64)acc.stat[i] * PAGE_SIZE); + (u64)acc.vmstats[i] * PAGE_SIZE); }
for (i = 0; i < ARRAY_SIZE(memcg1_events); i++) seq_printf(m, "total_%s %llu\n", memcg1_event_names[i], - (u64)acc.events[i]); + (u64)acc.vmevents[i]);
for (i = 0; i < NR_LRU_LISTS; i++) seq_printf(m, "total_%s %llu\n", mem_cgroup_lru_names[i], @@ -4033,11 +4037,11 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) */ static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx) { - long x = atomic_long_read(&memcg->stat[idx]); + long x = atomic_long_read(&memcg->vmstats[idx]); int cpu;
for_each_online_cpu(cpu) - x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx]; + x += per_cpu_ptr(memcg->vmstats_percpu, cpu)->stat[idx]; if (x < 0) x = 0; return x; @@ -4580,7 +4584,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); - free_percpu(memcg->stat_cpu); + free_percpu(memcg->vmstats_percpu); kfree(memcg); }
@@ -4609,8 +4613,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void) if (memcg->id.id < 0) goto fail;
- memcg->stat_cpu = alloc_percpu(struct mem_cgroup_stat_cpu); - if (!memcg->stat_cpu) + memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu); + if (!memcg->vmstats_percpu) goto fail;
for_each_node(node) @@ -5705,7 +5709,7 @@ static int memory_events_show(struct seq_file *m, void *v) static int memory_stat_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - struct accumulated_stats acc; + struct accumulated_vmstats acc; int i;
/* @@ -5720,30 +5724,30 @@ static int memory_stat_show(struct seq_file *m, void *v) */
memset(&acc, 0, sizeof(acc)); - acc.stats_size = MEMCG_NR_STAT; - acc.events_size = NR_VM_EVENT_ITEMS; - accumulate_memcg_tree(memcg, &acc); + acc.vmstats_size = MEMCG_NR_STAT; + acc.vmevents_size = NR_VM_EVENT_ITEMS; + accumulate_vmstats(memcg, &acc);
seq_printf(m, "anon %llu\n", - (u64)acc.stat[MEMCG_RSS] * PAGE_SIZE); + (u64)acc.vmstats[MEMCG_RSS] * PAGE_SIZE); seq_printf(m, "file %llu\n", - (u64)acc.stat[MEMCG_CACHE] * PAGE_SIZE); + (u64)acc.vmstats[MEMCG_CACHE] * PAGE_SIZE); seq_printf(m, "kernel_stack %llu\n", - (u64)acc.stat[MEMCG_KERNEL_STACK_KB] * 1024); + (u64)acc.vmstats[MEMCG_KERNEL_STACK_KB] * 1024); seq_printf(m, "slab %llu\n", - (u64)(acc.stat[NR_SLAB_RECLAIMABLE] + - acc.stat[NR_SLAB_UNRECLAIMABLE]) * PAGE_SIZE); + (u64)(acc.vmstats[NR_SLAB_RECLAIMABLE] + + acc.vmstats[NR_SLAB_UNRECLAIMABLE]) * PAGE_SIZE); seq_printf(m, "sock %llu\n", - (u64)acc.stat[MEMCG_SOCK] * PAGE_SIZE); + (u64)acc.vmstats[MEMCG_SOCK] * PAGE_SIZE);
seq_printf(m, "shmem %llu\n", - (u64)acc.stat[NR_SHMEM] * PAGE_SIZE); + (u64)acc.vmstats[NR_SHMEM] * PAGE_SIZE); seq_printf(m, "file_mapped %llu\n", - (u64)acc.stat[NR_FILE_MAPPED] * PAGE_SIZE); + (u64)acc.vmstats[NR_FILE_MAPPED] * PAGE_SIZE); seq_printf(m, "file_dirty %llu\n", - (u64)acc.stat[NR_FILE_DIRTY] * PAGE_SIZE); + (u64)acc.vmstats[NR_FILE_DIRTY] * PAGE_SIZE); seq_printf(m, "file_writeback %llu\n", - (u64)acc.stat[NR_WRITEBACK] * PAGE_SIZE); + (u64)acc.vmstats[NR_WRITEBACK] * PAGE_SIZE);
/* * TODO: We should eventually replace our own MEMCG_RSS_HUGE counter @@ -5752,43 +5756,43 @@ static int memory_stat_show(struct seq_file *m, void *v) * where the page->mem_cgroup is set up and stable. */ seq_printf(m, "anon_thp %llu\n", - (u64)acc.stat[MEMCG_RSS_HUGE] * PAGE_SIZE); + (u64)acc.vmstats[MEMCG_RSS_HUGE] * PAGE_SIZE);
for (i = 0; i < NR_LRU_LISTS; i++) seq_printf(m, "%s %llu\n", mem_cgroup_lru_names[i], (u64)acc.lru_pages[i] * PAGE_SIZE);
seq_printf(m, "slab_reclaimable %llu\n", - (u64)acc.stat[NR_SLAB_RECLAIMABLE] * PAGE_SIZE); + (u64)acc.vmstats[NR_SLAB_RECLAIMABLE] * PAGE_SIZE); seq_printf(m, "slab_unreclaimable %llu\n", - (u64)acc.stat[NR_SLAB_UNRECLAIMABLE] * PAGE_SIZE); + (u64)acc.vmstats[NR_SLAB_UNRECLAIMABLE] * PAGE_SIZE);
/* Accumulated memory events */
- seq_printf(m, "pgfault %lu\n", acc.events[PGFAULT]); - seq_printf(m, "pgmajfault %lu\n", acc.events[PGMAJFAULT]); + seq_printf(m, "pgfault %lu\n", acc.vmevents[PGFAULT]); + seq_printf(m, "pgmajfault %lu\n", acc.vmevents[PGMAJFAULT]);
seq_printf(m, "workingset_refault %lu\n", - acc.stat[WORKINGSET_REFAULT]); + acc.vmstats[WORKINGSET_REFAULT]); seq_printf(m, "workingset_activate %lu\n", - acc.stat[WORKINGSET_ACTIVATE]); + acc.vmstats[WORKINGSET_ACTIVATE]); seq_printf(m, "workingset_nodereclaim %lu\n", - acc.stat[WORKINGSET_NODERECLAIM]); - - seq_printf(m, "pgrefill %lu\n", acc.events[PGREFILL]); - seq_printf(m, "pgscan %lu\n", acc.events[PGSCAN_KSWAPD] + - acc.events[PGSCAN_DIRECT]); - seq_printf(m, "pgsteal %lu\n", acc.events[PGSTEAL_KSWAPD] + - acc.events[PGSTEAL_DIRECT]); - seq_printf(m, "pgactivate %lu\n", acc.events[PGACTIVATE]); - seq_printf(m, "pgdeactivate %lu\n", acc.events[PGDEACTIVATE]); - seq_printf(m, "pglazyfree %lu\n", acc.events[PGLAZYFREE]); - seq_printf(m, "pglazyfreed %lu\n", acc.events[PGLAZYFREED]); + acc.vmstats[WORKINGSET_NODERECLAIM]); + + seq_printf(m, "pgrefill %lu\n", acc.vmevents[PGREFILL]); + seq_printf(m, "pgscan %lu\n", acc.vmevents[PGSCAN_KSWAPD] + + acc.vmevents[PGSCAN_DIRECT]); + seq_printf(m, "pgsteal %lu\n", acc.vmevents[PGSTEAL_KSWAPD] + + acc.vmevents[PGSTEAL_DIRECT]); + seq_printf(m, "pgactivate %lu\n", acc.vmevents[PGACTIVATE]); + seq_printf(m, "pgdeactivate %lu\n", acc.vmevents[PGDEACTIVATE]); + seq_printf(m, "pglazyfree %lu\n", acc.vmevents[PGLAZYFREE]); + seq_printf(m, "pglazyfreed %lu\n", acc.vmevents[PGLAZYFREED]);
#ifdef CONFIG_TRANSPARENT_HUGEPAGE - seq_printf(m, "thp_fault_alloc %lu\n", acc.events[THP_FAULT_ALLOC]); + seq_printf(m, "thp_fault_alloc %lu\n", acc.vmevents[THP_FAULT_ALLOC]); seq_printf(m, "thp_collapse_alloc %lu\n", - acc.events[THP_COLLAPSE_ALLOC]); + acc.vmevents[THP_COLLAPSE_ALLOC]); #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
return 0; @@ -6230,7 +6234,7 @@ static void uncharge_batch(const struct uncharge_gather *ug) __mod_memcg_state(ug->memcg, MEMCG_RSS_HUGE, -ug->nr_huge); __mod_memcg_state(ug->memcg, NR_SHMEM, -ug->nr_shmem); __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); - __this_cpu_add(ug->memcg->stat_cpu->nr_page_events, nr_pages); + __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, nr_pages); memcg_check_events(ug->memcg, ug->dummy_page); local_irq_restore(flags);
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.2-rc1 commit 205b20cc5a99cdf197c32f4dbee2b09c699477f0 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Patch series "mm: memcontrol: memory.stat cost & correctness".
The cgroup memory.stat file holds recursive statistics for the entire subtree. The current implementation does this tree walk on-demand whenever the file is read. This is giving us problems in production.
1. The cost of aggregating the statistics on-demand is high. A lot of system service cgroups are mostly idle and their stats don't change between reads, yet we always have to check them. There are also always some lazily-dying cgroups sitting around that are pinned by a handful of remaining page cache; the same applies to them.
In an application that periodically monitors memory.stat in our fleet, we have seen the aggregation consume up to 5% CPU time.
2. When cgroups die and disappear from the cgroup tree, so do their accumulated vm events. The result is that the event counters at higher-level cgroups can go backwards and confuse some of our automation, let alone people looking at the graphs over time.
To address both issues, this patch series changes the stat implementation to spill counts upwards when the counters change.
The upward spilling is batched using the existing per-cpu cache. In a sparse file stress test with 5 level cgroup nesting, the additional cost of the flushing was negligible (a little under 1% of CPU at 100% CPU utilization, compared to the 5% of reading memory.stat during regular operation).
This patch (of 4):
memcg_page_state(), lruvec_page_state(), memcg_sum_events() are currently returning the state of the local memcg or lruvec, not the recursive state.
In practice there is a demand for both versions, although the callers that want the recursive counts currently sum them up by hand.
Per default, cgroups are considered recursive entities and generally we expect more users of the recursive counters, with the local counts being special cases. To reflect that in the name, add a _local suffix to the current implementations.
The following patch will re-incarnate these functions with recursive semantics, but with an O(1) implementation.
[hannes@cmpxchg.org: fix bisection hole] Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Shakeel Butt shakeelb@google.com Reviewed-by: Roman Gushchin guro@fb.com Cc: Michal Hocko mhocko@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Confilicts: mm/vmscan.c
Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/memcontrol.h | 16 +++++++-------- mm/memcontrol.c | 40 ++++++++++++++++++++------------------ mm/vmscan.c | 6 +++--- mm/workingset.c | 7 ++++--- 4 files changed, 36 insertions(+), 33 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b1051824b0a3..e87518b03d00 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -554,8 +554,8 @@ void unlock_page_memcg(struct page *page); * idx can be of type enum memcg_stat_item or node_stat_item. * Keep in sync with memcg_exact_page_state(). */ -static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, - int idx) +static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, + int idx) { long x = atomic_long_read(&memcg->vmstats[idx]); #ifdef CONFIG_SMP @@ -624,8 +624,8 @@ static inline void mod_memcg_page_state(struct page *page, mod_memcg_state(page->mem_cgroup, idx, val); }
-static inline unsigned long lruvec_page_state(struct lruvec *lruvec, - enum node_stat_item idx) +static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, + enum node_stat_item idx) { struct mem_cgroup_per_node *pn; long x; @@ -1011,8 +1011,8 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) { }
-static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, - int idx) +static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, + int idx) { return 0; } @@ -1041,8 +1041,8 @@ static inline void mod_memcg_page_state(struct page *page, { }
-static inline unsigned long lruvec_page_state(struct lruvec *lruvec, - enum node_stat_item idx) +static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, + enum node_stat_item idx) { return node_page_state(lruvec_pgdat(lruvec), idx); } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9c9162da4f5e..b654b01c851a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -692,8 +692,8 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) return mz; }
-static unsigned long memcg_sum_events(struct mem_cgroup *memcg, - int event) +static unsigned long memcg_events_local(struct mem_cgroup *memcg, + int event) { return atomic_long_read(&memcg->vmevents[event]); } @@ -1349,12 +1349,14 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) if (memcg1_stats[i] == MEMCG_SWAP && !do_swap_account) continue; pr_cont(" %s:%luKB", memcg1_stat_names[i], - K(memcg_page_state(iter, memcg1_stats[i]))); + K(memcg_page_state_local(iter, + memcg1_stats[i]))); }
for (i = 0; i < NR_LRU_LISTS; i++) pr_cont(" %s:%luKB", mem_cgroup_lru_names[i], - K(memcg_page_state(iter, NR_LRU_BASE + i))); + K(memcg_page_state_local(iter, + NR_LRU_BASE + i)));
pr_cont("\n"); } @@ -1420,13 +1422,13 @@ static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg, { struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);
- if (lruvec_page_state(lruvec, NR_INACTIVE_FILE) || - lruvec_page_state(lruvec, NR_ACTIVE_FILE)) + if (lruvec_page_state_local(lruvec, NR_INACTIVE_FILE) || + lruvec_page_state_local(lruvec, NR_ACTIVE_FILE)) return true; if (noswap || !total_swap_pages) return false; - if (lruvec_page_state(lruvec, NR_INACTIVE_ANON) || - lruvec_page_state(lruvec, NR_ACTIVE_ANON)) + if (lruvec_page_state_local(lruvec, NR_INACTIVE_ANON) || + lruvec_page_state_local(lruvec, NR_ACTIVE_ANON)) return true; return false;
@@ -3084,16 +3086,16 @@ static void accumulate_vmstats(struct mem_cgroup *memcg,
for_each_mem_cgroup_tree(mi, memcg) { for (i = 0; i < acc->vmstats_size; i++) - acc->vmstats[i] += memcg_page_state(mi, + acc->vmstats[i] += memcg_page_state_local(mi, acc->vmstats_array ? acc->vmstats_array[i] : i);
for (i = 0; i < acc->vmevents_size; i++) - acc->vmevents[i] += memcg_sum_events(mi, + acc->vmevents[i] += memcg_events_local(mi, acc->vmevents_array ? acc->vmevents_array[i] : i);
for (i = 0; i < NR_LRU_LISTS; i++) - acc->lru_pages[i] += memcg_page_state(mi, + acc->lru_pages[i] += memcg_page_state_local(mi, NR_LRU_BASE + i); } } @@ -3106,10 +3108,10 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) struct mem_cgroup *iter;
for_each_mem_cgroup_tree(iter, memcg) { - val += memcg_page_state(iter, MEMCG_CACHE); - val += memcg_page_state(iter, MEMCG_RSS); + val += memcg_page_state_local(iter, MEMCG_CACHE); + val += memcg_page_state_local(iter, MEMCG_RSS); if (swap) - val += memcg_page_state(iter, MEMCG_SWAP); + val += memcg_page_state_local(iter, MEMCG_SWAP); } } else { if (!swap) @@ -3453,7 +3455,7 @@ static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, for_each_lru(lru) { if (!(BIT(lru) & lru_mask)) continue; - nr += lruvec_page_state(lruvec, NR_LRU_BASE + lru); + nr += lruvec_page_state_local(lruvec, NR_LRU_BASE + lru); } return nr; } @@ -3467,7 +3469,7 @@ static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg, for_each_lru(lru) { if (!(BIT(lru) & lru_mask)) continue; - nr += memcg_page_state(memcg, NR_LRU_BASE + lru); + nr += memcg_page_state_local(memcg, NR_LRU_BASE + lru); } return nr; } @@ -3552,17 +3554,17 @@ static int memcg_stat_show(struct seq_file *m, void *v) if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account()) continue; seq_printf(m, "%s %lu\n", memcg1_stat_names[i], - memcg_page_state(memcg, memcg1_stats[i]) * + memcg_page_state_local(memcg, memcg1_stats[i]) * PAGE_SIZE); }
for (i = 0; i < ARRAY_SIZE(memcg1_events); i++) seq_printf(m, "%s %lu\n", memcg1_event_names[i], - memcg_sum_events(memcg, memcg1_events[i])); + memcg_events_local(memcg, memcg1_events[i]));
for (i = 0; i < NR_LRU_LISTS; i++) seq_printf(m, "%s %lu\n", mem_cgroup_lru_names[i], - memcg_page_state(memcg, NR_LRU_BASE + i) * + memcg_page_state_local(memcg, NR_LRU_BASE + i) * PAGE_SIZE);
/* Hierarchical information */ diff --git a/mm/vmscan.c b/mm/vmscan.c index cd0fa90d6765..c29c9bf1aa5f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -358,7 +358,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone int zid;
if (!mem_cgroup_disabled()) - lru_size = lruvec_page_state(lruvec, NR_LRU_BASE + lru); + lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru); else lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
@@ -2240,7 +2240,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file, * is being established. Disable active list protection to get * rid of the stale workingset quickly. */ - refaults = lruvec_page_state(lruvec, WORKINGSET_ACTIVATE); + refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE); if (file && lruvec->refaults != refaults) { inactive_ratio = 0; } else { @@ -2999,7 +2999,7 @@ static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat) struct lruvec *lruvec;
lruvec = mem_cgroup_lruvec(pgdat, memcg); - refaults = lruvec_page_state(lruvec, WORKINGSET_ACTIVATE); + refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE); lruvec->refaults = refaults; } while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL))); } diff --git a/mm/workingset.c b/mm/workingset.c index 5de6ac71062c..15b361bf6da6 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -424,9 +424,10 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
lruvec = mem_cgroup_lruvec(NODE_DATA(sc->nid), sc->memcg); for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) - pages += lruvec_page_state(lruvec, NR_LRU_BASE + i); - pages += lruvec_page_state(lruvec, NR_SLAB_RECLAIMABLE); - pages += lruvec_page_state(lruvec, NR_SLAB_UNRECLAIMABLE); + pages += lruvec_page_state_local(lruvec, + NR_LRU_BASE + i); + pages += lruvec_page_state_local(lruvec, NR_SLAB_RECLAIMABLE); + pages += lruvec_page_state_local(lruvec, NR_SLAB_UNRECLAIMABLE); } else #endif pages = node_present_pages(sc->nid);
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.2-rc1 commit db9adbcbe740e0986b575dd56aad834ce9e9b5d3 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
These are getting too big to be inlined in every callsite. They were stolen from vmstat.c, which already out-of-lines them, and they have only been growing since. The callsites aren't that hot, either.
Move __mod_memcg_state() __mod_lruvec_state() and __count_memcg_events() out of line and add kerneldoc comments.
Link: http://lkml.kernel.org/r/20190412151507.2769-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Shakeel Butt shakeelb@google.com Reviewed-by: Roman Gushchin guro@fb.com Cc: Michal Hocko mhocko@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/memcontrol.h | 62 +++--------------------------- mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 84 insertions(+), 57 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index e87518b03d00..0786a290414e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -565,22 +565,7 @@ static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, return x; }
-/* idx can be of type enum memcg_stat_item or node_stat_item */ -static inline void __mod_memcg_state(struct mem_cgroup *memcg, - int idx, int val) -{ - long x; - - if (mem_cgroup_disabled()) - return; - - x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); - if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) { - atomic_long_add(x, &memcg->vmstats[idx]); - x = 0; - } - __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); -} +void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
/* idx can be of type enum memcg_stat_item or node_stat_item */ static inline void mod_memcg_state(struct mem_cgroup *memcg, @@ -642,31 +627,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return x; }
-static inline void __mod_lruvec_state(struct lruvec *lruvec, - enum node_stat_item idx, int val) -{ - struct mem_cgroup_per_node *pn; - long x; - - /* Update node */ - __mod_node_page_state(lruvec_pgdat(lruvec), idx, val); - - if (mem_cgroup_disabled()) - return; - - pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - - /* Update memcg */ - __mod_memcg_state(pn->memcg, idx, val); - - /* Update lruvec */ - x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]); - if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) { - atomic_long_add(x, &pn->lruvec_stat[idx]); - x = 0; - } - __this_cpu_write(pn->lruvec_stat_cpu->count[idx], x); -} +void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, + int val);
static inline void mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val) @@ -708,22 +670,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, unsigned long *total_scanned);
-static inline void __count_memcg_events(struct mem_cgroup *memcg, - enum vm_event_item idx, - unsigned long count) -{ - unsigned long x; - - if (mem_cgroup_disabled()) - return; - - x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]); - if (unlikely(x > MEMCG_CHARGE_BATCH)) { - atomic_long_add(x, &memcg->vmevents[idx]); - x = 0; - } - __this_cpu_write(memcg->vmstats_percpu->events[idx], x); -} +void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, + unsigned long count);
static inline void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b654b01c851a..5f84c6b07312 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -692,6 +692,85 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) return mz; }
+/** + * __mod_memcg_state - update cgroup memory statistics + * @memcg: the memory cgroup + * @idx: the stat item - can be enum memcg_stat_item or enum node_stat_item + * @val: delta to add to the counter, can be negative + */ +void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) +{ + long x; + + if (mem_cgroup_disabled()) + return; + + x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); + if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) { + atomic_long_add(x, &memcg->vmstats[idx]); + x = 0; + } + __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); +} + +/** + * __mod_lruvec_state - update lruvec memory statistics + * @lruvec: the lruvec + * @idx: the stat item + * @val: delta to add to the counter, can be negative + * + * The lruvec is the intersection of the NUMA node and a cgroup. This + * function updates the all three counters that are affected by a + * change of state at this level: per-node, per-cgroup, per-lruvec. + */ +void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, + int val) +{ + struct mem_cgroup_per_node *pn; + long x; + + /* Update node */ + __mod_node_page_state(lruvec_pgdat(lruvec), idx, val); + + if (mem_cgroup_disabled()) + return; + + pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); + + /* Update memcg */ + __mod_memcg_state(pn->memcg, idx, val); + + /* Update lruvec */ + x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]); + if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) { + atomic_long_add(x, &pn->lruvec_stat[idx]); + x = 0; + } + __this_cpu_write(pn->lruvec_stat_cpu->count[idx], x); +} + +/** + * __count_memcg_events - account VM events in a cgroup + * @memcg: the memory cgroup + * @idx: the event item + * @count: the number of events that occured + */ +void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, + unsigned long count) +{ + unsigned long x; + + if (mem_cgroup_disabled()) + return; + + x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]); + if (unlikely(x > MEMCG_CHARGE_BATCH)) { + atomic_long_add(x, &memcg->vmevents[idx]); + x = 0; + } + __this_cpu_write(memcg->vmstats_percpu->events[idx], x); +} + static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) {
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.2-rc1 commit 42a300353577ccc17ecc627b8570a89fa1678bec category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Right now, when somebody needs to know the recursive memory statistics and events of a cgroup subtree, they need to walk the entire subtree and sum up the counters manually.
There are two issues with this:
1. When a cgroup gets deleted, its stats are lost. The state counters should all be 0 at that point, of course, but the events are not. When this happens, the event counters, which are supposed to be monotonic, can go backwards in the parent cgroups.
2. During regular operation, we always have a certain number of lazily freed cgroups sitting around that have been deleted, have no tasks, but have a few cache pages remaining. These groups' statistics do not change until we eventually hit memory pressure, but somebody watching, say, memory.stat on an ancestor has to iterate those every time.
This patch addresses both issues by introducing recursive counters at each level that are propagated from the write side when stats change.
Upward propagation happens when the per-cpu caches spill over into the local atomic counter. This is the same thing we do during charge and uncharge, except that the latter uses atomic RMWs, which are more expensive; stat changes happen at around the same rate. In a sparse file test (page faults and reclaim at maximum CPU speed) with 5 cgroup nesting levels, perf shows __mod_memcg_page state at ~1%.
Link: http://lkml.kernel.org/r/20190412151507.2769-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Shakeel Butt shakeelb@google.com Reviewed-by: Roman Gushchin guro@fb.com Cc: Michal Hocko mhocko@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/memcontrol.h | 54 +++++++++- mm/memcontrol.c | 205 ++++++++++++++++++------------------- 2 files changed, 150 insertions(+), 109 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0786a290414e..df0c3c4bb3ba 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -128,6 +128,7 @@ struct mem_cgroup_per_node {
struct lruvec_stat __percpu *lruvec_stat_cpu; atomic_long_t lruvec_stat[NR_VM_NODE_STAT_ITEMS]; + atomic_long_t lruvec_stat_local[NR_VM_NODE_STAT_ITEMS];
unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
@@ -279,8 +280,12 @@ struct mem_cgroup { MEMCG_PADDING(_pad2_);
atomic_long_t vmstats[MEMCG_NR_STAT]; + atomic_long_t vmstats_local[MEMCG_NR_STAT]; + atomic_long_t vmevents[NR_VM_EVENT_ITEMS]; - atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; + atomic_long_t vmevents_local[NR_VM_EVENT_ITEMS]; + + atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
unsigned long socket_pressure;
@@ -550,6 +555,20 @@ struct mem_cgroup *lock_page_memcg(struct page *page); void __unlock_page_memcg(struct mem_cgroup *memcg); void unlock_page_memcg(struct page *page);
+/* + * idx can be of type enum memcg_stat_item or node_stat_item. + * Keep in sync with memcg_exact_page_state(). + */ +static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) +{ + long x = atomic_long_read(&memcg->vmstats[idx]); +#ifdef CONFIG_SMP + if (x < 0) + x = 0; +#endif + return x; +} + /* * idx can be of type enum memcg_stat_item or node_stat_item. * Keep in sync with memcg_exact_page_state(). @@ -557,7 +576,7 @@ void unlock_page_memcg(struct page *page); static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) { - long x = atomic_long_read(&memcg->vmstats[idx]); + long x = atomic_long_read(&memcg->vmstats_local[idx]); #ifdef CONFIG_SMP if (x < 0) x = 0; @@ -609,6 +628,24 @@ static inline void mod_memcg_page_state(struct page *page, mod_memcg_state(page->mem_cgroup, idx, val); }
+static inline unsigned long lruvec_page_state(struct lruvec *lruvec, + enum node_stat_item idx) +{ + struct mem_cgroup_per_node *pn; + long x; + + if (mem_cgroup_disabled()) + return node_page_state(lruvec_pgdat(lruvec), idx); + + pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); + x = atomic_long_read(&pn->lruvec_stat[idx]); +#ifdef CONFIG_SMP + if (x < 0) + x = 0; +#endif + return x; +} + static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, enum node_stat_item idx) { @@ -619,7 +656,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return node_page_state(lruvec_pgdat(lruvec), idx);
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - x = atomic_long_read(&pn->lruvec_stat[idx]); + x = atomic_long_read(&pn->lruvec_stat_local[idx]); #ifdef CONFIG_SMP if (x < 0) x = 0; @@ -959,6 +996,11 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) { }
+static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) +{ + return 0; +} + static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) { @@ -989,6 +1031,12 @@ static inline void mod_memcg_page_state(struct page *page, { }
+static inline unsigned long lruvec_page_state(struct lruvec *lruvec, + enum node_stat_item idx) +{ + return node_page_state(lruvec_pgdat(lruvec), idx); +} + static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, enum node_stat_item idx) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5f84c6b07312..9715bde5b1df 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -707,12 +707,27 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) { - atomic_long_add(x, &memcg->vmstats[idx]); + struct mem_cgroup *mi; + + atomic_long_add(x, &memcg->vmstats_local[idx]); + for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) + atomic_long_add(x, &mi->vmstats[idx]); x = 0; } __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); }
+static struct mem_cgroup_per_node * +parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) +{ + struct mem_cgroup *parent; + + parent = parent_mem_cgroup(pn->memcg); + if (!parent) + return NULL; + return mem_cgroup_nodeinfo(parent, nid); +} + /** * __mod_lruvec_state - update lruvec memory statistics * @lruvec: the lruvec @@ -726,24 +741,31 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val) { + pg_data_t *pgdat = lruvec_pgdat(lruvec); struct mem_cgroup_per_node *pn; + struct mem_cgroup *memcg; long x;
/* Update node */ - __mod_node_page_state(lruvec_pgdat(lruvec), idx, val); + __mod_node_page_state(pgdat, idx, val);
if (mem_cgroup_disabled()) return;
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); + memcg = pn->memcg;
/* Update memcg */ - __mod_memcg_state(pn->memcg, idx, val); + __mod_memcg_state(memcg, idx, val);
/* Update lruvec */ x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]); if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) { - atomic_long_add(x, &pn->lruvec_stat[idx]); + struct mem_cgroup_per_node *pi; + + atomic_long_add(x, &pn->lruvec_stat_local[idx]); + for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id)) + atomic_long_add(x, &pi->lruvec_stat[idx]); x = 0; } __this_cpu_write(pn->lruvec_stat_cpu->count[idx], x); @@ -765,18 +787,26 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]); if (unlikely(x > MEMCG_CHARGE_BATCH)) { - atomic_long_add(x, &memcg->vmevents[idx]); + struct mem_cgroup *mi; + + atomic_long_add(x, &memcg->vmevents_local[idx]); + for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) + atomic_long_add(x, &mi->vmevents[idx]); x = 0; } __this_cpu_write(memcg->vmstats_percpu->events[idx], x); }
-static unsigned long memcg_events_local(struct mem_cgroup *memcg, - int event) +static unsigned long memcg_events(struct mem_cgroup *memcg, int event) { return atomic_long_read(&memcg->vmevents[event]); }
+static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) +{ + return atomic_long_read(&memcg->vmevents_local[event]); +} + static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, struct page *page, bool compound, int nr_pages) @@ -2193,7 +2223,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) static int memcg_hotplug_cpu_dead(unsigned int cpu) { struct memcg_stock_pcp *stock; - struct mem_cgroup *memcg; + struct mem_cgroup *memcg, *mi;
stock = &per_cpu(memcg_stock, cpu); drain_stock(stock); @@ -2206,8 +2236,11 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) long x;
x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0); - if (x) - atomic_long_add(x, &memcg->vmstats[i]); + if (x) { + atomic_long_add(x, &memcg->vmstats_local[i]); + for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) + atomic_long_add(x, &memcg->vmstats[i]); + }
if (i >= NR_VM_NODE_STAT_ITEMS) continue; @@ -2217,8 +2250,12 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
pn = mem_cgroup_nodeinfo(memcg, nid); x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0); - if (x) - atomic_long_add(x, &pn->lruvec_stat[i]); + if (x) { + atomic_long_add(x, &pn->lruvec_stat_local[i]); + do { + atomic_long_add(x, &pn->lruvec_stat[i]); + } while ((pn = parent_nodeinfo(pn, nid))); + } } }
@@ -2226,8 +2263,11 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) long x;
x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0); - if (x) - atomic_long_add(x, &memcg->vmevents[i]); + if (x) { + atomic_long_add(x, &memcg->vmevents_local[i]); + for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) + atomic_long_add(x, &memcg->vmevents[i]); + } } }
@@ -3144,54 +3184,15 @@ static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css, return retval; }
-struct accumulated_vmstats { - unsigned long vmstats[MEMCG_NR_STAT]; - unsigned long vmevents[NR_VM_EVENT_ITEMS]; - unsigned long lru_pages[NR_LRU_LISTS]; - - /* overrides for v1 */ - const unsigned int *vmstats_array; - const unsigned int *vmevents_array; - - int vmstats_size; - int vmevents_size; -}; - -static void accumulate_vmstats(struct mem_cgroup *memcg, - struct accumulated_vmstats *acc) -{ - struct mem_cgroup *mi; - int i; - - for_each_mem_cgroup_tree(mi, memcg) { - for (i = 0; i < acc->vmstats_size; i++) - acc->vmstats[i] += memcg_page_state_local(mi, - acc->vmstats_array ? acc->vmstats_array[i] : i); - - for (i = 0; i < acc->vmevents_size; i++) - acc->vmevents[i] += memcg_events_local(mi, - acc->vmevents_array - ? acc->vmevents_array[i] : i); - - for (i = 0; i < NR_LRU_LISTS; i++) - acc->lru_pages[i] += memcg_page_state_local(mi, - NR_LRU_BASE + i); - } -} - static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) { - unsigned long val = 0; + unsigned long val;
if (mem_cgroup_is_root(memcg)) { - struct mem_cgroup *iter; - - for_each_mem_cgroup_tree(iter, memcg) { - val += memcg_page_state_local(iter, MEMCG_CACHE); - val += memcg_page_state_local(iter, MEMCG_RSS); - if (swap) - val += memcg_page_state_local(iter, MEMCG_SWAP); - } + val = memcg_page_state(memcg, MEMCG_CACHE) + + memcg_page_state(memcg, MEMCG_RSS); + if (swap) + val += memcg_page_state(memcg, MEMCG_SWAP); } else { if (!swap) val = page_counter_read(&memcg->memory); @@ -3624,7 +3625,6 @@ static int memcg_stat_show(struct seq_file *m, void *v) unsigned long memory, memsw; struct mem_cgroup *mi; unsigned int i; - struct accumulated_vmstats acc;
BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); @@ -3658,27 +3658,21 @@ static int memcg_stat_show(struct seq_file *m, void *v) seq_printf(m, "hierarchical_memsw_limit %llu\n", (u64)memsw * PAGE_SIZE);
- memset(&acc, 0, sizeof(acc)); - acc.vmstats_size = ARRAY_SIZE(memcg1_stats); - acc.vmstats_array = memcg1_stats; - acc.vmevents_size = ARRAY_SIZE(memcg1_events); - acc.vmevents_array = memcg1_events; - accumulate_vmstats(memcg, &acc); - for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account()) continue; seq_printf(m, "total_%s %llu\n", memcg1_stat_names[i], - (u64)acc.vmstats[i] * PAGE_SIZE); + (u64)memcg_page_state(memcg, i) * PAGE_SIZE); }
for (i = 0; i < ARRAY_SIZE(memcg1_events); i++) seq_printf(m, "total_%s %llu\n", memcg1_event_names[i], - (u64)acc.vmevents[i]); + (u64)memcg_events(memcg, i));
for (i = 0; i < NR_LRU_LISTS; i++) seq_printf(m, "total_%s %llu\n", mem_cgroup_lru_names[i], - (u64)acc.lru_pages[i] * PAGE_SIZE); + (u64)memcg_page_state(memcg, NR_LRU_BASE + i) * + PAGE_SIZE);
#ifdef CONFIG_DEBUG_VM { @@ -5790,7 +5784,6 @@ static int memory_events_show(struct seq_file *m, void *v) static int memory_stat_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - struct accumulated_vmstats acc; int i;
/* @@ -5804,31 +5797,27 @@ static int memory_stat_show(struct seq_file *m, void *v) * Current memory state: */
- memset(&acc, 0, sizeof(acc)); - acc.vmstats_size = MEMCG_NR_STAT; - acc.vmevents_size = NR_VM_EVENT_ITEMS; - accumulate_vmstats(memcg, &acc); - seq_printf(m, "anon %llu\n", - (u64)acc.vmstats[MEMCG_RSS] * PAGE_SIZE); + (u64)memcg_page_state(memcg, MEMCG_RSS) * PAGE_SIZE); seq_printf(m, "file %llu\n", - (u64)acc.vmstats[MEMCG_CACHE] * PAGE_SIZE); + (u64)memcg_page_state(memcg, MEMCG_CACHE) * PAGE_SIZE); seq_printf(m, "kernel_stack %llu\n", - (u64)acc.vmstats[MEMCG_KERNEL_STACK_KB] * 1024); + (u64)memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) * 1024); seq_printf(m, "slab %llu\n", - (u64)(acc.vmstats[NR_SLAB_RECLAIMABLE] + - acc.vmstats[NR_SLAB_UNRECLAIMABLE]) * PAGE_SIZE); + (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) + + memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE)) * + PAGE_SIZE); seq_printf(m, "sock %llu\n", - (u64)acc.vmstats[MEMCG_SOCK] * PAGE_SIZE); + (u64)memcg_page_state(memcg, MEMCG_SOCK) * PAGE_SIZE);
seq_printf(m, "shmem %llu\n", - (u64)acc.vmstats[NR_SHMEM] * PAGE_SIZE); + (u64)memcg_page_state(memcg, NR_SHMEM) * PAGE_SIZE); seq_printf(m, "file_mapped %llu\n", - (u64)acc.vmstats[NR_FILE_MAPPED] * PAGE_SIZE); + (u64)memcg_page_state(memcg, NR_FILE_MAPPED) * PAGE_SIZE); seq_printf(m, "file_dirty %llu\n", - (u64)acc.vmstats[NR_FILE_DIRTY] * PAGE_SIZE); + (u64)memcg_page_state(memcg, NR_FILE_DIRTY) * PAGE_SIZE); seq_printf(m, "file_writeback %llu\n", - (u64)acc.vmstats[NR_WRITEBACK] * PAGE_SIZE); + (u64)memcg_page_state(memcg, NR_WRITEBACK) * PAGE_SIZE);
/* * TODO: We should eventually replace our own MEMCG_RSS_HUGE counter @@ -5837,43 +5826,47 @@ static int memory_stat_show(struct seq_file *m, void *v) * where the page->mem_cgroup is set up and stable. */ seq_printf(m, "anon_thp %llu\n", - (u64)acc.vmstats[MEMCG_RSS_HUGE] * PAGE_SIZE); + (u64)memcg_page_state(memcg, MEMCG_RSS_HUGE) * PAGE_SIZE);
for (i = 0; i < NR_LRU_LISTS; i++) seq_printf(m, "%s %llu\n", mem_cgroup_lru_names[i], - (u64)acc.lru_pages[i] * PAGE_SIZE); + (u64)memcg_page_state(memcg, NR_LRU_BASE + i) * + PAGE_SIZE);
seq_printf(m, "slab_reclaimable %llu\n", - (u64)acc.vmstats[NR_SLAB_RECLAIMABLE] * PAGE_SIZE); + (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) * + PAGE_SIZE); seq_printf(m, "slab_unreclaimable %llu\n", - (u64)acc.vmstats[NR_SLAB_UNRECLAIMABLE] * PAGE_SIZE); + (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) * + PAGE_SIZE);
/* Accumulated memory events */
- seq_printf(m, "pgfault %lu\n", acc.vmevents[PGFAULT]); - seq_printf(m, "pgmajfault %lu\n", acc.vmevents[PGMAJFAULT]); + seq_printf(m, "pgfault %lu\n", memcg_events(memcg, PGFAULT)); + seq_printf(m, "pgmajfault %lu\n", memcg_events(memcg, PGMAJFAULT));
seq_printf(m, "workingset_refault %lu\n", - acc.vmstats[WORKINGSET_REFAULT]); + memcg_page_state(memcg, WORKINGSET_REFAULT)); seq_printf(m, "workingset_activate %lu\n", - acc.vmstats[WORKINGSET_ACTIVATE]); + memcg_page_state(memcg, WORKINGSET_ACTIVATE)); seq_printf(m, "workingset_nodereclaim %lu\n", - acc.vmstats[WORKINGSET_NODERECLAIM]); - - seq_printf(m, "pgrefill %lu\n", acc.vmevents[PGREFILL]); - seq_printf(m, "pgscan %lu\n", acc.vmevents[PGSCAN_KSWAPD] + - acc.vmevents[PGSCAN_DIRECT]); - seq_printf(m, "pgsteal %lu\n", acc.vmevents[PGSTEAL_KSWAPD] + - acc.vmevents[PGSTEAL_DIRECT]); - seq_printf(m, "pgactivate %lu\n", acc.vmevents[PGACTIVATE]); - seq_printf(m, "pgdeactivate %lu\n", acc.vmevents[PGDEACTIVATE]); - seq_printf(m, "pglazyfree %lu\n", acc.vmevents[PGLAZYFREE]); - seq_printf(m, "pglazyfreed %lu\n", acc.vmevents[PGLAZYFREED]); + memcg_page_state(memcg, WORKINGSET_NODERECLAIM)); + + seq_printf(m, "pgrefill %lu\n", memcg_events(memcg, PGREFILL)); + seq_printf(m, "pgscan %lu\n", memcg_events(memcg, PGSCAN_KSWAPD) + + memcg_events(memcg, PGSCAN_DIRECT)); + seq_printf(m, "pgsteal %lu\n", memcg_events(memcg, PGSTEAL_KSWAPD) + + memcg_events(memcg, PGSTEAL_DIRECT)); + seq_printf(m, "pgactivate %lu\n", memcg_events(memcg, PGACTIVATE)); + seq_printf(m, "pgdeactivate %lu\n", memcg_events(memcg, PGDEACTIVATE)); + seq_printf(m, "pglazyfree %lu\n", memcg_events(memcg, PGLAZYFREE)); + seq_printf(m, "pglazyfreed %lu\n", memcg_events(memcg, PGLAZYFREED));
#ifdef CONFIG_TRANSPARENT_HUGEPAGE - seq_printf(m, "thp_fault_alloc %lu\n", acc.vmevents[THP_FAULT_ALLOC]); + seq_printf(m, "thp_fault_alloc %lu\n", + memcg_events(memcg, THP_FAULT_ALLOC)); seq_printf(m, "thp_collapse_alloc %lu\n", - acc.vmevents[THP_COLLAPSE_ALLOC]); + memcg_events(memcg, THP_COLLAPSE_ALLOC)); #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
return 0;
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.2-rc1 commit def0fdae813dbbbbb588bfc5f52856be2e842b35 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
When a cgroup is reclaimed on behalf of a configured limit, reclaim needs to round-robin through all NUMA nodes that hold pages of the memcg in question. However, when assembling the mask of candidate NUMA nodes, the code only consults the *local* cgroup LRU counters, not the recursive counters for the entire subtree. Cgroup limits are frequently configured against intermediate cgroups that do not have memory on their own LRUs. In this case, the node mask will always come up empty and reclaim falls back to scanning only the current node.
If a cgroup subtree has some memory on one node but the processes are bound to another node afterwards, the limit reclaim will never age or reclaim that memory anymore.
To fix this, use the recursive LRU counts for a cgroup subtree to determine which nodes hold memory of that cgroup.
The code has been broken like this forever, so it doesn't seem to be a problem in practice. I just noticed it while reviewing the way the LRU counters are used in general.
Link: http://lkml.kernel.org/r/20190412151507.2769-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Shakeel Butt shakeelb@google.com Reviewed-by: Roman Gushchin guro@fb.com Cc: Michal Hocko mhocko@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9715bde5b1df..c54d9098cc5d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1531,13 +1531,13 @@ static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg, { struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);
- if (lruvec_page_state_local(lruvec, NR_INACTIVE_FILE) || - lruvec_page_state_local(lruvec, NR_ACTIVE_FILE)) + if (lruvec_page_state(lruvec, NR_INACTIVE_FILE) || + lruvec_page_state(lruvec, NR_ACTIVE_FILE)) return true; if (noswap || !total_swap_pages) return false; - if (lruvec_page_state_local(lruvec, NR_INACTIVE_ANON) || - lruvec_page_state_local(lruvec, NR_ACTIVE_ANON)) + if (lruvec_page_state(lruvec, NR_INACTIVE_ANON) || + lruvec_page_state(lruvec, NR_ACTIVE_ANON)) return true; return false;
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.2-rc5 commit 815744d75152078cde5391fc1e3c2d4424323fb6 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
The kernel test robot noticed a 26% will-it-scale pagefault regression from commit 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty"). This appears to be caused by bouncing the additional cachelines from the new hierarchical statistics counters.
We can fix this by getting rid of the batched local counters instead.
Originally, there were *only* group-local counters, and they were fully maintained per cpu. A reader of a stats file high up in the cgroup tree would have to walk the entire subtree and collect each level's per-cpu counters to get the recursive view. This was prohibitively expensive, and so we switched to per-cpu batched updates of the local counters during a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting"), reducing the complexity from nr_subgroups * nr_cpus to nr_subgroups.
With growing machines and cgroup trees, the tree walk itself became too expensive for monitoring top-level groups, and this is when the culprit patch added hierarchy counters on each cgroup level. When the per-cpu batch size would be reached, both the local and the hierarchy counters would get batch-updated from the per-cpu delta simultaneously.
This makes local and hierarchical counter reads blazingly fast, but it unfortunately makes the write-side too cache line intense.
Since local counter reads were never a problem - we only centralized them to accelerate the hierarchy walk - and use of the local counters are becoming rarer due to replacement with hierarchical views (ongoing rework in the page reclaim and workingset code), we can make those local counters unbatched per-cpu counters again.
The scheme will then be as such:
when a memcg statistic changes, the writer will: - update the local counter (per-cpu) - update the batch counter (per-cpu). If the batch is full: - spill the batch into the group's atomic_t - spill the batch into all ancestors' atomic_ts - empty out the batch counter (per-cpu)
when a local memcg counter is read, the reader will: - collect the local counter from all cpus
when a hiearchy memcg counter is read, the reader will: - read the atomic_t
We might be able to simplify this further and make the recursive counters unbatched per-cpu counters as well (batch upward propagation, but leave per-cpu collection to the readers), but that will require a more in-depth analysis and testing of all the callsites. Deal with the immediate regression for now.
Link: http://lkml.kernel.org/r/20190521151647.GB2870@cmpxchg.org Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty") Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reported-by: kernel test robot rong.a.chen@intel.com Tested-by: kernel test robot rong.a.chen@intel.com Cc: Michal Hocko mhocko@kernel.org Cc: Shakeel Butt shakeelb@google.com Cc: Roman Gushchin guro@fb.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/memcontrol.h | 26 ++++++++++++++++-------- mm/memcontrol.c | 41 ++++++++++++++++++++++++++------------ 2 files changed, 46 insertions(+), 21 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index df0c3c4bb3ba..8c7b83cee2d7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -126,9 +126,12 @@ struct memcg_shrinker_map { struct mem_cgroup_per_node { struct lruvec lruvec;
+ /* Legacy local VM stats */ + struct lruvec_stat __percpu *lruvec_stat_local; + + /* Subtree VM stats (batched updates) */ struct lruvec_stat __percpu *lruvec_stat_cpu; atomic_long_t lruvec_stat[NR_VM_NODE_STAT_ITEMS]; - atomic_long_t lruvec_stat_local[NR_VM_NODE_STAT_ITEMS];
unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
@@ -274,17 +277,18 @@ struct mem_cgroup { atomic_t moving_account; struct task_struct *move_lock_task;
- /* memory.stat */ + /* Legacy local VM stats and events */ + struct memcg_vmstats_percpu __percpu *vmstats_local; + + /* Subtree VM stats and events (batched updates) */ struct memcg_vmstats_percpu __percpu *vmstats_percpu;
MEMCG_PADDING(_pad2_);
atomic_long_t vmstats[MEMCG_NR_STAT]; - atomic_long_t vmstats_local[MEMCG_NR_STAT]; - atomic_long_t vmevents[NR_VM_EVENT_ITEMS]; - atomic_long_t vmevents_local[NR_VM_EVENT_ITEMS];
+ /* memory.events */ atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
unsigned long socket_pressure; @@ -576,7 +580,11 @@ static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) { - long x = atomic_long_read(&memcg->vmstats_local[idx]); + long x = 0; + int cpu; + + for_each_possible_cpu(cpu) + x += per_cpu(memcg->vmstats_local->stat[idx], cpu); #ifdef CONFIG_SMP if (x < 0) x = 0; @@ -650,13 +658,15 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, enum node_stat_item idx) { struct mem_cgroup_per_node *pn; - long x; + long x = 0; + int cpu;
if (mem_cgroup_disabled()) return node_page_state(lruvec_pgdat(lruvec), idx);
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - x = atomic_long_read(&pn->lruvec_stat_local[idx]); + for_each_possible_cpu(cpu) + x += per_cpu(pn->lruvec_stat_local->count[idx], cpu); #ifdef CONFIG_SMP if (x < 0) x = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c54d9098cc5d..b8db61ea124e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -705,11 +705,12 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) if (mem_cgroup_disabled()) return;
+ __this_cpu_add(memcg->vmstats_local->stat[idx], val); + x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) { struct mem_cgroup *mi;
- atomic_long_add(x, &memcg->vmstats_local[idx]); for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) atomic_long_add(x, &mi->vmstats[idx]); x = 0; @@ -759,11 +760,12 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, __mod_memcg_state(memcg, idx, val);
/* Update lruvec */ + __this_cpu_add(pn->lruvec_stat_local->count[idx], val); + x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]); if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) { struct mem_cgroup_per_node *pi;
- atomic_long_add(x, &pn->lruvec_stat_local[idx]); for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id)) atomic_long_add(x, &pi->lruvec_stat[idx]); x = 0; @@ -785,11 +787,12 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, if (mem_cgroup_disabled()) return;
+ __this_cpu_add(memcg->vmstats_local->events[idx], count); + x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]); if (unlikely(x > MEMCG_CHARGE_BATCH)) { struct mem_cgroup *mi;
- atomic_long_add(x, &memcg->vmevents_local[idx]); for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) atomic_long_add(x, &mi->vmevents[idx]); x = 0; @@ -804,7 +807,12 @@ static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) { - return atomic_long_read(&memcg->vmevents_local[event]); + long x = 0; + int cpu; + + for_each_possible_cpu(cpu) + x += per_cpu(memcg->vmstats_local->events[event], cpu); + return x; }
static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, @@ -2236,11 +2244,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) long x;
x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0); - if (x) { - atomic_long_add(x, &memcg->vmstats_local[i]); + if (x) for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) atomic_long_add(x, &memcg->vmstats[i]); - }
if (i >= NR_VM_NODE_STAT_ITEMS) continue; @@ -2250,12 +2256,10 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
pn = mem_cgroup_nodeinfo(memcg, nid); x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0); - if (x) { - atomic_long_add(x, &pn->lruvec_stat_local[i]); + if (x) do { atomic_long_add(x, &pn->lruvec_stat[i]); } while ((pn = parent_nodeinfo(pn, nid))); - } } }
@@ -2263,11 +2267,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) long x;
x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0); - if (x) { - atomic_long_add(x, &memcg->vmevents_local[i]); + if (x) for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) atomic_long_add(x, &memcg->vmevents[i]); - } } }
@@ -4627,8 +4629,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) if (!pn) return 1;
+ pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat); + if (!pn->lruvec_stat_local) { + kfree(pn); + return 1; + } + pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat); if (!pn->lruvec_stat_cpu) { + free_percpu(pn->lruvec_stat_local); kfree(pn); return 1; } @@ -4650,6 +4659,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) return;
free_percpu(pn->lruvec_stat_cpu); + free_percpu(pn->lruvec_stat_local); kfree(pn); }
@@ -4660,6 +4670,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); free_percpu(memcg->vmstats_percpu); + free_percpu(memcg->vmstats_local); kfree(memcg); }
@@ -4688,6 +4699,10 @@ static struct mem_cgroup *mem_cgroup_alloc(void) if (memcg->id.id < 0) goto fail;
+ memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu); + if (!memcg->vmstats_local) + goto fail; + memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu); if (!memcg->vmstats_percpu) goto fail;
From: Yafang Shao laoar.shao@gmail.com
mainline inclusion from mainline-v5.3-rc1 commit dd9239900e12db84c198855b262ae7796db1123b category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
When we calculate total statistics for memcg1_stats and memcg1_events, we use the the index 'i' in the for loop as the events index. Actually we should use memcg1_stats[i] and memcg1_events[i] as the events index.
Link: http://lkml.kernel.org/r/1562116978-19539-1-git-send-email-laoar.shao@gmail.... Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty"). Signed-off-by: Yafang Shao <laoar.shao@gmail.com Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Michal Hocko mhocko@suse.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Yafang Shao shaoyafang@didiglobal.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b8db61ea124e..ea8520f2f7fa 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3664,12 +3664,13 @@ static int memcg_stat_show(struct seq_file *m, void *v) if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account()) continue; seq_printf(m, "total_%s %llu\n", memcg1_stat_names[i], - (u64)memcg_page_state(memcg, i) * PAGE_SIZE); + (u64)memcg_page_state(memcg, memcg1_stats[i]) * + PAGE_SIZE); }
for (i = 0; i < ARRAY_SIZE(memcg1_events); i++) seq_printf(m, "total_%s %llu\n", memcg1_event_names[i], - (u64)memcg_events(memcg, i)); + (u64)memcg_events(memcg, memcg1_events[i]));
for (i = 0; i < NR_LRU_LISTS; i++) seq_printf(m, "total_%s %llu\n", mem_cgroup_lru_names[i],
From: Yafang Shao laoar.shao@gmail.com
mainline inclusion from mainline-v5.3-rc1 commit 766a4c19d880887c457811b86f1f68525e416965 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
After commit 815744d75152 ("mm: memcontrol: don't batch updates of local VM stats and events"), the local VM counter are not in sync with the hierarchical ones.
Below is one example in a leaf memcg on my server (with 8 CPUs):
inactive_file 3567570944 total_inactive_file 3568029696
We find that the deviation is very great because the 'val' in __mod_memcg_state() is in pages while the effective value in memcg_stat_show() is in bytes.
So the maximum of this deviation between local VM stats and total VM stats can be (32 * number_of_cpu * PAGE_SIZE), that may be an unacceptably great value.
We should keep the local VM stats in sync with the total stats. In order to keep this behavior the same across counters, this patch updates __mod_lruvec_state() and __count_memcg_events() as well.
Link: http://lkml.kernel.org/r/1562851979-10610-1-git-send-email-laoar.shao@gmail.... Signed-off-by: Yafang Shao laoar.shao@gmail.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Yafang Shao shaoyafang@didiglobal.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ea8520f2f7fa..d68e8f11f1dc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -705,12 +705,15 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) if (mem_cgroup_disabled()) return;
- __this_cpu_add(memcg->vmstats_local->stat[idx], val); - x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) { struct mem_cgroup *mi;
+ /* + * Batch local counters to keep them in sync with + * the hierarchical ones. + */ + __this_cpu_add(memcg->vmstats_local->stat[idx], x); for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) atomic_long_add(x, &mi->vmstats[idx]); x = 0; @@ -759,13 +762,15 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, /* Update memcg */ __mod_memcg_state(memcg, idx, val);
- /* Update lruvec */ - __this_cpu_add(pn->lruvec_stat_local->count[idx], val); - x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]); if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) { struct mem_cgroup_per_node *pi;
+ /* + * Batch local counters to keep them in sync with + * the hierarchical ones. + */ + __this_cpu_add(pn->lruvec_stat_local->count[idx], x); for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id)) atomic_long_add(x, &pi->lruvec_stat[idx]); x = 0; @@ -787,12 +792,15 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, if (mem_cgroup_disabled()) return;
- __this_cpu_add(memcg->vmstats_local->events[idx], count); - x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]); if (unlikely(x > MEMCG_CHARGE_BATCH)) { struct mem_cgroup *mi;
+ /* + * Batch local counters to keep them in sync with + * the hierarchical ones. + */ + __this_cpu_add(memcg->vmstats_local->events[idx], x); for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) atomic_long_add(x, &mi->vmevents[idx]); x = 0;
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-v5.3-rc6 commit c350a99ea2b1b666c28948d74ab46c16913c28a7 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Percpu caching of local vmstats with the conditional propagation by the cgroup tree leads to an accumulation of errors on non-leaf levels.
Let's imagine two nested memory cgroups A and A/B. Say, a process belonging to A/B allocates 100 pagecache pages on the CPU 0. The percpu cache will spill 3 times, so that 32*3=96 pages will be accounted to A/B and A atomic vmstat counters, 4 pages will remain in the percpu cache.
Imagine A/B is nearby memory.max, so that every following allocation triggers a direct reclaim on the local CPU. Say, each such attempt will free 16 pages on a new cpu. That means every percpu cache will have -16 pages, except the first one, which will have 4 - 16 = -12. A/B and A atomic counters will not be touched at all.
Now a user removes A/B. All percpu caches are freed and corresponding vmstat numbers are forgotten. A has 96 pages more than expected.
As memory cgroups are created and destroyed, errors do accumulate. Even 1-2 pages differences can accumulate into large numbers.
To fix this issue let's accumulate and propagate percpu vmstat values before releasing the memory cgroup. At this point these numbers are stable and cannot be changed.
Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate only over online cpus.
Link: http://lkml.kernel.org/r/20190819202338.363363-2-guro@fb.com Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty") Signed-off-by: Roman Gushchin guro@fb.com Acked-by: Michal Hocko mhocko@suse.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d68e8f11f1dc..06514cfbb9d6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3263,6 +3263,41 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, } }
+static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) +{ + unsigned long stat[MEMCG_NR_STAT]; + struct mem_cgroup *mi; + int node, cpu, i; + + for (i = 0; i < MEMCG_NR_STAT; i++) + stat[i] = 0; + + for_each_online_cpu(cpu) + for (i = 0; i < MEMCG_NR_STAT; i++) + stat[i] += raw_cpu_read(memcg->vmstats_percpu->stat[i]); + + for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) + for (i = 0; i < MEMCG_NR_STAT; i++) + atomic_long_add(stat[i], &mi->vmstats[i]); + + for_each_node(node) { + struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; + struct mem_cgroup_per_node *pi; + + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) + stat[i] = 0; + + for_each_online_cpu(cpu) + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) + stat[i] += raw_cpu_read( + pn->lruvec_stat_cpu->count[i]); + + for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) + atomic_long_add(stat[i], &pi->lruvec_stat[i]); + } +} + #ifdef CONFIG_MEMCG_KMEM static int memcg_online_kmem(struct mem_cgroup *memcg) { @@ -4676,6 +4711,11 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) { int node;
+ /* + * Flush percpu vmstats to guarantee the value correctness + * on parent's and all ancestor levels. + */ + memcg_flush_percpu_vmstats(memcg); for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); free_percpu(memcg->vmstats_percpu);
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-v5.3-rc6 commit bb65f89b7d3d305c14951f49860711fbcae70692 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Similar to vmstats, percpu caching of local vmevents leads to an accumulation of errors on non-leaf levels. This happens because some leftovers may remain in percpu caches, so that they are never propagated up by the cgroup tree and just disappear into nonexistence with on releasing of the memory cgroup.
To fix this issue let's accumulate and propagate percpu vmevents values before releasing the memory cgroup similar to what we're doing with vmstats.
Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate only over online cpus.
Link: http://lkml.kernel.org/r/20190819202338.363363-4-guro@fb.com Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty") Signed-off-by: Roman Gushchin guro@fb.com Acked-by: Michal Hocko mhocko@suse.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 06514cfbb9d6..5a2da8787f36 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3298,6 +3298,25 @@ static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) } }
+static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg) +{ + unsigned long events[NR_VM_EVENT_ITEMS]; + struct mem_cgroup *mi; + int cpu, i; + + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) + events[i] = 0; + + for_each_online_cpu(cpu) + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) + events[i] += raw_cpu_read( + memcg->vmstats_percpu->events[i]); + + for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) + atomic_long_add(events[i], &mi->vmevents[i]); +} + #ifdef CONFIG_MEMCG_KMEM static int memcg_online_kmem(struct mem_cgroup *memcg) { @@ -4712,10 +4731,11 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) int node;
/* - * Flush percpu vmstats to guarantee the value correctness + * Flush percpu vmstats and vmevents to guarantee the value correctness * on parent's and all ancestor levels. */ memcg_flush_percpu_vmstats(memcg); + memcg_flush_percpu_vmevents(memcg); for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); free_percpu(memcg->vmstats_percpu);
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-v5.3-rc7 commit bee07b33db78d4ee7ed6a2fe810b9473d5471fe4 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
I've noticed that the "slab" value in memory.stat is sometimes 0, even if some children memory cgroups have a non-zero "slab" value. The following investigation showed that this is the result of the kmem_cache reparenting in combination with the per-cpu batching of slab vmstats.
At the offlining some vmstat value may leave in the percpu cache, not being propagated upwards by the cgroup hierarchy. It means that stats on ancestor levels are lower than actual. Later when slab pages are released, the precise number of pages is substracted on the parent level, making the value negative. We don't show negative values, 0 is printed instead.
To fix this issue, let's flush percpu slab memcg and lruvec stats on memcg offlining. This guarantees that numbers on all ancestor levels are accurate and match the actual number of outstanding slab pages.
Link: http://lkml.kernel.org/r/20190819202338.363363-3-guro@fb.com Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal") Signed-off-by: Roman Gushchin guro@fb.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/mmzone.h | 5 +++-- mm/memcontrol.c | 35 +++++++++++++++++++++++++++-------- 2 files changed, 30 insertions(+), 10 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8c546c5b0f21..c9a826979e8b 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -158,8 +158,9 @@ enum node_stat_item { NR_INACTIVE_FILE, /* " " " " " */ NR_ACTIVE_FILE, /* " " " " " */ NR_UNEVICTABLE, /* " " " " " */ - NR_SLAB_RECLAIMABLE, - NR_SLAB_UNRECLAIMABLE, + NR_SLAB_RECLAIMABLE, /* Please do not reorder this item */ + NR_SLAB_UNRECLAIMABLE, /* and this one without looking at + * memcg_flush_percpu_vmstats() first. */ NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ WORKINGSET_REFAULT, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5a2da8787f36..42dd131581b8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3263,37 +3263,49 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, } }
-static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) +static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg, bool slab_only) { unsigned long stat[MEMCG_NR_STAT]; struct mem_cgroup *mi; int node, cpu, i; + int min_idx, max_idx;
- for (i = 0; i < MEMCG_NR_STAT; i++) + if (slab_only) { + min_idx = NR_SLAB_RECLAIMABLE; + max_idx = NR_SLAB_UNRECLAIMABLE; + } else { + min_idx = 0; + max_idx = MEMCG_NR_STAT; + } + + for (i = min_idx; i < max_idx; i++) stat[i] = 0;
for_each_online_cpu(cpu) - for (i = 0; i < MEMCG_NR_STAT; i++) + for (i = min_idx; i < max_idx; i++) stat[i] += raw_cpu_read(memcg->vmstats_percpu->stat[i]);
for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - for (i = 0; i < MEMCG_NR_STAT; i++) + for (i = min_idx; i < max_idx; i++) atomic_long_add(stat[i], &mi->vmstats[i]);
+ if (!slab_only) + max_idx = NR_VM_NODE_STAT_ITEMS; + for_each_node(node) { struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; struct mem_cgroup_per_node *pi;
- for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) + for (i = min_idx; i < max_idx; i++) stat[i] = 0;
for_each_online_cpu(cpu) - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) + for (i = min_idx; i < max_idx; i++) stat[i] += raw_cpu_read( pn->lruvec_stat_cpu->count[i]);
for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) + for (i = min_idx; i < max_idx; i++) atomic_long_add(stat[i], &pi->lruvec_stat[i]); } } @@ -3366,7 +3378,14 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg) if (!parent) parent = root_mem_cgroup;
+ /* + * Deactivate and reparent kmem_caches. Then flush percpu + * slab statistics to have precise values at the parent and + * all ancestor levels. It's required to keep slab stats + * accurate after the reparenting of kmem_caches. + */ memcg_deactivate_kmem_caches(memcg, parent); + memcg_flush_percpu_vmstats(memcg, true);
kmemcg_id = memcg->kmemcg_id; BUG_ON(kmemcg_id < 0); @@ -4734,7 +4753,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) * Flush percpu vmstats and vmevents to guarantee the value correctness * on parent's and all ancestor levels. */ - memcg_flush_percpu_vmstats(memcg); + memcg_flush_percpu_vmstats(memcg, false); memcg_flush_percpu_vmevents(memcg); for_each_node(node) free_mem_cgroup_per_node_info(memcg, node);
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-v5.3-rc7 commit b4c46484dc3fa3721d68fdfae85c1d7b1f6b5472 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones") effectively decreased the precision of per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.
That's good for displaying in memory.stat, but brings a serious regression into the reclaim process.
One issue I've discovered and debugged is the following: lruvec_lru_size() can return 0 instead of the actual number of pages in the lru list, preventing the kernel to reclaim last remaining pages. Result is yet another dying memory cgroups flooding. The opposite is also happening: scanning an empty lru list is the waste of cpu time.
Also, inactive_list_is_low() can return incorrect values, preventing the active lru from being scanned and freed. It can fail both because the size of active and inactive lists are inaccurate, and because the number of workingset refaults isn't precise. In other words, the result is pretty random.
I'm not sure, if using the approximate number of slab pages in count_shadow_number() is acceptable, but issues described above are enough to partially revert the patch.
Let's keep per-memcg vmstat_local batched (they are only used for displaying stats to the userspace), but keep lruvec stats precise. This change fixes the dead memcg flooding on my setup.
Link: http://lkml.kernel.org/r/20190817004726.2530670-1-guro@fb.com Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones") Signed-off-by: Roman Gushchin guro@fb.com Acked-by: Yafang Shao laoar.shao@gmail.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 42dd131581b8..7152b0f8f383 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -762,15 +762,13 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, /* Update memcg */ __mod_memcg_state(memcg, idx, val);
+ /* Update lruvec */ + __this_cpu_add(pn->lruvec_stat_local->count[idx], val); + x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]); if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) { struct mem_cgroup_per_node *pi;
- /* - * Batch local counters to keep them in sync with - * the hierarchical ones. - */ - __this_cpu_add(pn->lruvec_stat_local->count[idx], x); for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id)) atomic_long_add(x, &pi->lruvec_stat[idx]); x = 0;
From: Shakeel Butt shakeelb@google.com
mainline inclusion from mainline-v5.3-rc7 commit 6c1c280805ded72eceb2afc1a0d431b256608554 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Instead of using raw_cpu_read() use per_cpu() to read the actual data of the corresponding cpu otherwise we will be reading the data of the current cpu for the number of online CPUs.
Link: http://lkml.kernel.org/r/20190829203110.129263-1-shakeelb@google.com Fixes: bb65f89b7d3d ("mm: memcontrol: flush percpu vmevents before releasing memcg") Fixes: c350a99ea2b1 ("mm: memcontrol: flush percpu vmstats before releasing memcg") Signed-off-by: Shakeel Butt shakeelb@google.com Acked-by: Roman Gushchin guro@fb.com Acked-by: Michal Hocko mhocko@suse.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7152b0f8f383..702733a73430 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3281,7 +3281,7 @@ static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg, bool slab_only)
for_each_online_cpu(cpu) for (i = min_idx; i < max_idx; i++) - stat[i] += raw_cpu_read(memcg->vmstats_percpu->stat[i]); + stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu);
for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) for (i = min_idx; i < max_idx; i++) @@ -3299,8 +3299,8 @@ static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg, bool slab_only)
for_each_online_cpu(cpu) for (i = min_idx; i < max_idx; i++) - stat[i] += raw_cpu_read( - pn->lruvec_stat_cpu->count[i]); + stat[i] += per_cpu( + pn->lruvec_stat_cpu->count[i], cpu);
for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) for (i = min_idx; i < max_idx; i++) @@ -3319,8 +3319,8 @@ static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
for_each_online_cpu(cpu) for (i = 0; i < NR_VM_EVENT_ITEMS; i++) - events[i] += raw_cpu_read( - memcg->vmstats_percpu->events[i]); + events[i] += per_cpu(memcg->vmstats_percpu->events[i], + cpu);
for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) for (i = 0; i < NR_VM_EVENT_ITEMS; i++)
From: Honglei Wang honglei.wang@oracle.com
mainline inclusion from mainline-v5.4-rc4 commit b11edebbc967ebf5c55b8f9e1d5bb6d68ec3a7fd category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Commit 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()") has made lruvec_page_state to use per-cpu counters instead of calculating it directly from lru_zone_size with an idea that this would be more effective.
Tim has reported that this is not really the case for their database benchmark which is showing an opposite results where lruvec_page_state is taking up a huge chunk of CPU cycles (about 25% of the system time which is roughly 7% of total cpu cycles) on 5.3 kernels. The workload is running on a larger machine (96cpus), it has many cgroups (500) and it is heavily direct reclaim bound.
Tim Chen said:
: The problem can also be reproduced by running simple multi-threaded : pmbench benchmark with a fast Optane SSD swap (see profile below). : : : 6.15% 3.08% pmbench [kernel.vmlinux] [k] lruvec_lru_size : | : |--3.07%--lruvec_lru_size : | | : | |--2.11%--cpumask_next : | | | : | | --1.66%--find_next_bit : | | : | --0.57%--call_function_interrupt : | | : | --0.55%--smp_call_function_interrupt : | : |--1.59%--0x441f0fc3d009 : | _ops_rdtsc_init_base_freq : | access_histogram : | page_fault : | __do_page_fault : | handle_mm_fault : | __handle_mm_fault : | | : | --1.54%--do_swap_page : | swapin_readahead : | swap_cluster_readahead : | | : | --1.53%--read_swap_cache_async : | __read_swap_cache_async : | alloc_pages_vma : | __alloc_pages_nodemask : | __alloc_pages_slowpath : | try_to_free_pages : | do_try_to_free_pages : | shrink_node : | shrink_node_memcg : | | : | |--0.77%--lruvec_lru_size : | | : | --0.76%--inactive_list_is_low : | | : | --0.76%--lruvec_lru_size : | : --1.50%--measure_read : page_fault : __do_page_fault : handle_mm_fault : __handle_mm_fault : do_swap_page : swapin_readahead : swap_cluster_readahead : | : --1.48%--read_swap_cache_async : __read_swap_cache_async : alloc_pages_vma : __alloc_pages_nodemask : __alloc_pages_slowpath : try_to_free_pages : do_try_to_free_pages : shrink_node : shrink_node_memcg : | : |--0.75%--inactive_list_is_low : | | : | --0.75%--lruvec_lru_size : | : --0.73%--lruvec_lru_size
The likely culprit is the cache traffic the lruvec_page_state_local generates. Dave Hansen says:
: I was thinking purely of the cache footprint. If it's reading : pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192 : bytes of cache *96 CPUs = 18k of data, mostly read-only. 1 cgroup would : be 18k of data for the whole system and the caching would be pretty : efficient and all 18k would probably survive a tight page fault loop in : the L1. 500 cgroups would be ~90k of data per CPU thread which doesn't : fit in the L1 and probably wouldn't survive a tight page fault loop if : both logical threads were banging on different cgroups. : : It's just a theory, but it's why I noted the number of cgroups when I : initially saw this show up in profiles
Fix the regression by partially reverting the said commit and calculate the lru size explicitly.
Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()") Signed-off-by: Honglei Wang honglei.wang@oracle.com Reported-by: Tim Chen tim.c.chen@linux.intel.com Acked-by: Tim Chen tim.c.chen@linux.intel.com Tested-by: Tim Chen tim.c.chen@linux.intel.com Acked-by: Michal Hocko mhocko@suse.com Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Roman Gushchin guro@fb.com Cc: Tejun Heo tj@kernel.org Cc: Dave Hansen dave.hansen@intel.com Cc: stable@vger.kernel.org [5.2+] Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/vmscan.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c index c29c9bf1aa5f..26eab7de6708 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -354,12 +354,13 @@ unsigned long zone_reclaimable_pages(struct zone *zone) */ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx) { - unsigned long lru_size; + unsigned long lru_size = 0; int zid;
- if (!mem_cgroup_disabled()) - lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru); - else + if (!mem_cgroup_disabled()) { + for (zid = 0; zid < MAX_NR_ZONES; zid++) + lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid); + } else lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
From: Shakeel Butt shakeelb@google.com
mainline inclusion from mainline-v5.4-rc7 commit 7961eee3978475fd9e8626137f88595b1ca05856 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
__mem_cgroup_free() can be called on the failure path in mem_cgroup_alloc(). However memcg_flush_percpu_vmstats() and memcg_flush_percpu_vmevents() which are called from __mem_cgroup_free() access the fields of memcg which can potentially be null if called from failure path from mem_cgroup_alloc(). Indeed syzbot has reported the following crash:
kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 30393 Comm: syz-executor.1 Not tainted 5.4.0-rc2+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:memcg_flush_percpu_vmstats+0x4ae/0x930 mm/memcontrol.c:3436 Code: 05 41 89 c0 41 0f b6 04 24 41 38 c7 7c 08 84 c0 0f 85 5d 03 00 00 44 3b 05 33 d5 12 08 0f 83 e2 00 00 00 4c 89 f0 48 c1 e8 03 <42> 80 3c 28 00 0f 85 91 03 00 00 48 8b 85 10 fe ff ff 48 8b b0 90 RSP: 0018:ffff888095c27980 EFLAGS: 00010206 RAX: 0000000000000012 RBX: ffff888095c27b28 RCX: ffffc90008192000 RDX: 0000000000040000 RSI: ffffffff8340fae7 RDI: 0000000000000007 RBP: ffff888095c27be0 R08: 0000000000000000 R09: ffffed1013f0da33 R10: ffffed1013f0da32 R11: ffff88809f86d197 R12: fffffbfff138b760 R13: dffffc0000000000 R14: 0000000000000090 R15: 0000000000000007 FS: 00007f5027170700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000710158 CR3: 00000000a7b18000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __mem_cgroup_free+0x1a/0x190 mm/memcontrol.c:5021 mem_cgroup_free mm/memcontrol.c:5033 [inline] mem_cgroup_css_alloc+0x3a1/0x1ae0 mm/memcontrol.c:5160 css_create kernel/cgroup/cgroup.c:5156 [inline] cgroup_apply_control_enable+0x44d/0xc40 kernel/cgroup/cgroup.c:3119 cgroup_mkdir+0x899/0x11b0 kernel/cgroup/cgroup.c:5401 kernfs_iop_mkdir+0x14d/0x1d0 fs/kernfs/dir.c:1124 vfs_mkdir+0x42e/0x670 fs/namei.c:3807 do_mkdirat+0x234/0x2a0 fs/namei.c:3830 __do_sys_mkdir fs/namei.c:3846 [inline] __se_sys_mkdir fs/namei.c:3844 [inline] __x64_sys_mkdir+0x5c/0x80 fs/namei.c:3844 do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fixing this by moving the flush to mem_cgroup_free as there is no need to flush anything if we see failure in mem_cgroup_alloc().
Link: http://lkml.kernel.org/r/20191018165231.249872-1-shakeelb@google.com Fixes: bb65f89b7d3d ("mm: memcontrol: flush percpu vmevents before releasing memcg") Fixes: c350a99ea2b1 ("mm: memcontrol: flush percpu vmstats before releasing memcg") Signed-off-by: Shakeel Butt shakeelb@google.com Reported-by: syzbot+515d5bcfe179cdf049b2@syzkaller.appspotmail.com Reviewed-by: Roman Gushchin guro@fb.com Cc: Michal Hocko mhocko@suse.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 702733a73430..f74e0a904e97 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4747,12 +4747,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) { int node;
- /* - * Flush percpu vmstats and vmevents to guarantee the value correctness - * on parent's and all ancestor levels. - */ - memcg_flush_percpu_vmstats(memcg, false); - memcg_flush_percpu_vmevents(memcg); for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); free_percpu(memcg->vmstats_percpu); @@ -4763,6 +4757,12 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) static void mem_cgroup_free(struct mem_cgroup *memcg) { memcg_wb_domain_exit(memcg); + /* + * Flush percpu vmstats and vmevents to guarantee the value correctness + * on parent's and all ancestor levels. + */ + memcg_flush_percpu_vmstats(memcg, false); + memcg_flush_percpu_vmevents(memcg); __mem_cgroup_free(memcg); }
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-v5.5-rc7 commit 4a87e2a25dc27131c3cce5e94421622193305638 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Currently slab percpu vmstats are flushed twice: during the memcg offlining and just before freeing the memcg structure. Each time percpu counters are summed, added to the atomic counterparts and propagated up by the cgroup tree.
The second flushing is required due to how recursive vmstats are implemented: counters are batched in percpu variables on a local level, and once a percpu value is crossing some predefined threshold, it spills over to atomic values on the local and each ascendant levels. It means that without flushing some numbers cached in percpu variables will be dropped on floor each time a cgroup is destroyed. And with uptime the error on upper levels might become noticeable.
The first flushing aims to make counters on ancestor levels more precise. Dying cgroups may resume in the dying state for a long time. After kmem_cache reparenting which is performed during the offlining slab counters of the dying cgroup don't have any chances to be updated, because any slab operations will be performed on the parent level. It means that the inaccuracy caused by percpu batching will not decrease up to the final destruction of the cgroup. By the original idea flushing slab counters during the offlining should minimize the visible inaccuracy of slab counters on the parent level.
The problem is that percpu counters are not zeroed after the first flushing. So every cached percpu value is summed twice. It creates a small error (up to 32 pages per cpu, but usually less) which accumulates on parent cgroup level. After creating and destroying of thousands of child cgroups, slab counter on parent level can be way off the real value.
For now, let's just stop flushing slab counters on memcg offlining. It can't be done correctly without scheduling a work on each cpu: reading and zeroing it during css offlining can race with an asynchronous update, which doesn't expect values to be changed underneath.
With this change, slab counters on parent level will become eventually consistent. Once all dying children are gone, values are correct. And if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.
It's not perfect, as slab are reparented, so any updates after the reparenting will happen on the parent level. It means that if a slab page was allocated, a counter on child level was bumped, then the page was reparented and freed, the annihilation of positive and negative counter values will not happen until the child cgroup is released. It makes slab counters different from others, and it might want us to implement flushing in a correct form again. But it's also a question of performance: scheduling a work on each cpu isn't free, and it's an open question if the benefit of having more accurate counters is worth it.
We might also consider flushing all counters on offlining, not only slab counters.
So let's fix the main problem now: make the slab counters eventually consistent, so at least the error won't grow with uptime (or more precisely the number of created and destroyed cgroups). And think about the accuracy of counters separately.
Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com Fixes: bee07b33db78 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining") Signed-off-by: Roman Gushchin guro@fb.com Acked-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Michal Hocko mhocko@suse.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/mmzone.h | 5 ++--- mm/memcontrol.c | 37 +++++++++---------------------------- 2 files changed, 11 insertions(+), 31 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c9a826979e8b..8c546c5b0f21 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -158,9 +158,8 @@ enum node_stat_item { NR_INACTIVE_FILE, /* " " " " " */ NR_ACTIVE_FILE, /* " " " " " */ NR_UNEVICTABLE, /* " " " " " */ - NR_SLAB_RECLAIMABLE, /* Please do not reorder this item */ - NR_SLAB_UNRECLAIMABLE, /* and this one without looking at - * memcg_flush_percpu_vmstats() first. */ + NR_SLAB_RECLAIMABLE, + NR_SLAB_UNRECLAIMABLE, NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ WORKINGSET_REFAULT, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f74e0a904e97..6e5c798c7c77 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3261,49 +3261,34 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, } }
-static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg, bool slab_only) +static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) { - unsigned long stat[MEMCG_NR_STAT]; + unsigned long stat[MEMCG_NR_STAT] = {0}; struct mem_cgroup *mi; int node, cpu, i; - int min_idx, max_idx; - - if (slab_only) { - min_idx = NR_SLAB_RECLAIMABLE; - max_idx = NR_SLAB_UNRECLAIMABLE; - } else { - min_idx = 0; - max_idx = MEMCG_NR_STAT; - } - - for (i = min_idx; i < max_idx; i++) - stat[i] = 0;
for_each_online_cpu(cpu) - for (i = min_idx; i < max_idx; i++) + for (i = 0; i < MEMCG_NR_STAT; i++) stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu);
for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - for (i = min_idx; i < max_idx; i++) + for (i = 0; i < MEMCG_NR_STAT; i++) atomic_long_add(stat[i], &mi->vmstats[i]);
- if (!slab_only) - max_idx = NR_VM_NODE_STAT_ITEMS; - for_each_node(node) { struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; struct mem_cgroup_per_node *pi;
- for (i = min_idx; i < max_idx; i++) + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) stat[i] = 0;
for_each_online_cpu(cpu) - for (i = min_idx; i < max_idx; i++) + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) stat[i] += per_cpu( pn->lruvec_stat_cpu->count[i], cpu);
for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) - for (i = min_idx; i < max_idx; i++) + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) atomic_long_add(stat[i], &pi->lruvec_stat[i]); } } @@ -3377,13 +3362,9 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg) parent = root_mem_cgroup;
/* - * Deactivate and reparent kmem_caches. Then flush percpu - * slab statistics to have precise values at the parent and - * all ancestor levels. It's required to keep slab stats - * accurate after the reparenting of kmem_caches. + * Deactivate and reparent kmem_caches. */ memcg_deactivate_kmem_caches(memcg, parent); - memcg_flush_percpu_vmstats(memcg, true);
kmemcg_id = memcg->kmemcg_id; BUG_ON(kmemcg_id < 0); @@ -4761,7 +4742,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) * Flush percpu vmstats and vmevents to guarantee the value correctness * on parent's and all ancestor levels. */ - memcg_flush_percpu_vmstats(memcg, false); + memcg_flush_percpu_vmstats(memcg); memcg_flush_percpu_vmevents(memcg); __mem_cgroup_free(memcg); }