This patchset improves the performance of accounted kernel memory allocations by ~30% as measured by a micro-benchmark [1]. The benchmark is very straightforward: 1M of 64 bytes-large kmalloc() allocations.
Below are results with the disabled kernel memory accounting, the original state and with this patchset applied.
| | Kmem disabled | Original | Patched | Delta | |-------------+---------------+----------+---------+--------| | User cgroup | 29764 | 84548 | 59078 | -30.0% | | Root cgroup | 29742 | 48342 | 31501 | -34.8% |
As we can see, the patchset removes the majority of the overhead when there is no actual accounting (a task belongs to the root memory cgroup) and almost halves the accounting overhead otherwise.
The main idea is to get rid of unnecessary memcg to objcg conversions and switch to a scope-based protection of objcgs, which eliminates extra operations with objcg reference counters under a rcu read lock. More details are provided in individual commit descriptions.
Roman Gushchin (7): mm: kmem: optimize get_obj_cgroup_from_current() mm: kmem: add direct objcg pointer to task_struct mm: kmem: make memcg keep a reference to the original objcg mm: kmem: scoped objcg protection percpu: scoped objcg protection mm: kmem: reimplement get_obj_cgroup_from_current() mm: kmem: properly initialize local objcg variable in current_obj_cgroup()
include/linux/memcontrol.h | 28 +++++- include/linux/sched.h | 4 + include/linux/sched/mm.h | 4 + mm/memcontrol.c | 187 +++++++++++++++++++++++++++++++------ mm/percpu.c | 8 +- mm/slab.h | 15 +-- 6 files changed, 204 insertions(+), 42 deletions(-)
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/4405 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/T...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/4405 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/T...
From: Roman Gushchin roman.gushchin@linux.dev
mainline inclusion from mainline-v6.7-rc1 commit 7d0715d0d6b28a831b6fdfefb29c5a7a4929fa49 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8YU7J
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "mm: improve performance of accounted kernel memory allocations", v5.
This patchset improves the performance of accounted kernel memory allocations by ~30% as measured by a micro-benchmark [1]. The benchmark is very straightforward: 1M of 64 bytes-large kmalloc() allocations.
Below are results with the disabled kernel memory accounting, the original state and with this patchset applied.
| | Kmem disabled | Original | Patched | Delta | |-------------+---------------+----------+---------+--------| | User cgroup | 29764 | 84548 | 59078 | -30.0% | | Root cgroup | 29742 | 48342 | 31501 | -34.8% |
As we can see, the patchset removes the majority of the overhead when there is no actual accounting (a task belongs to the root memory cgroup) and almost halves the accounting overhead otherwise.
The main idea is to get rid of unnecessary memcg to objcg conversions and switch to a scope-based protection of objcgs, which eliminates extra operations with objcg reference counters under a rcu read lock. More details are provided in individual commit descriptions.
This patch (of 5):
Manually inline memcg_kmem_bypass() and active_memcg() to speed up get_obj_cgroup_from_current() by avoiding duplicate in_task() checks and active_memcg() readings.
Also add a likely() macro to __get_obj_cgroup_from_memcg(): obj_cgroup_tryget() should succeed at almost all times except a very unlikely race with the memcg deletion path.
Link: https://lkml.kernel.org/r/20231019225346.1822282-1-roman.gushchin@linux.dev Link: https://lkml.kernel.org/r/20231019225346.1822282-2-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) roman.gushchin@linux.dev Tested-by: Naresh Kamboju naresh.kamboju@linaro.org Acked-by: Shakeel Butt shakeelb@google.com Acked-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Vlastimil Babka vbabka@suse.cz Cc: David Rientjes rientjes@google.com Cc: Dennis Zhou dennis@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Muchun Song muchun.song@linux.dev Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/memcontrol.c | 34 ++++++++++++++-------------------- 1 file changed, 14 insertions(+), 20 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c96525697ba4..4f3361222ac3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1097,19 +1097,6 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) } EXPORT_SYMBOL(get_mem_cgroup_from_mm);
-static __always_inline bool memcg_kmem_bypass(void) -{ - /* Allow remote memcg charging from any context. */ - if (unlikely(active_memcg())) - return false; - - /* Memcg to charge can't be determined. */ - if (!in_task() || !current->mm || (current->flags & PF_KTHREAD)) - return true; - - return false; -} - /** * mem_cgroup_iter - iterate over memory cgroup hierarchy * @root: hierarchy root @@ -3112,7 +3099,7 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { objcg = rcu_dereference(memcg->objcg); - if (objcg && obj_cgroup_tryget(objcg)) + if (likely(objcg && obj_cgroup_tryget(objcg))) break; objcg = NULL; } @@ -3121,16 +3108,23 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void) { - struct obj_cgroup *objcg = NULL; struct mem_cgroup *memcg; + struct obj_cgroup *objcg;
- if (memcg_kmem_bypass()) - return NULL; + if (in_task()) { + memcg = current->active_memcg; + + /* Memcg to charge can't be determined. */ + if (likely(!memcg) && (!current->mm || (current->flags & PF_KTHREAD))) + return NULL; + } else { + memcg = this_cpu_read(int_active_memcg); + if (likely(!memcg)) + return NULL; + }
rcu_read_lock(); - if (unlikely(active_memcg())) - memcg = active_memcg(); - else + if (!memcg) memcg = mem_cgroup_from_task(current); objcg = __get_obj_cgroup_from_memcg(memcg); rcu_read_unlock();
From: Roman Gushchin roman.gushchin@linux.dev
mainline inclusion from mainline-v6.7-rc1 commit 1aacbd354313f25c855e662e41c04e2abf71444a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8YU7J
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
To charge a freshly allocated kernel object to a memory cgroup, the kernel needs to obtain an objcg pointer. Currently it does it indirectly by obtaining the memcg pointer first and then calling to __get_obj_cgroup_from_memcg().
Usually tasks spend their entire life belonging to the same object cgroup. So it makes sense to save the objcg pointer on task_struct directly, so it can be obtained faster. It requires some work on fork, exit and cgroup migrate paths, but these paths are way colder.
To avoid any costly synchronization the following rules are applied: 1) A task sets it's objcg pointer itself.
2) If a task is being migrated to another cgroup, the least significant bit of the objcg pointer is set atomically.
3) On the allocation path the objcg pointer is obtained locklessly using the READ_ONCE() macro and the least significant bit is checked. If it's set, the following procedure is used to update it locklessly: - task->objcg is zeroed using cmpxcg - new objcg pointer is obtained - task->objcg is updated using try_cmpxchg - operation is repeated if try_cmpxcg fails It guarantees that no updates will be lost if task migration is racing against objcg pointer update. It also allows to keep both read and write paths fully lockless.
Because the task is keeping a reference to the objcg, it can't go away while the task is alive.
This commit doesn't change the way the remote memcg charging works.
Link: https://lkml.kernel.org/r/20231019225346.1822282-3-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) roman.gushchin@linux.dev Tested-by: Naresh Kamboju naresh.kamboju@linaro.org Acked-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Shakeel Butt shakeelb@google.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Cc: David Rientjes rientjes@google.com Cc: Dennis Zhou dennis@kernel.org Cc: Michal Hocko mhocko@kernel.org Cc: Muchun Song muchun.song@linux.dev Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/sched.h | 4 ++ mm/memcontrol.c | 139 +++++++++++++++++++++++++++++++++++++++--- 2 files changed, 134 insertions(+), 9 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 4f18d4505618..ce6741061708 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1454,6 +1454,10 @@ struct task_struct { struct mem_cgroup *active_memcg; #endif
+#ifdef CONFIG_MEMCG_KMEM + struct obj_cgroup *objcg; +#endif + #ifdef CONFIG_BLK_CGROUP struct gendisk *throttle_disk; #endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4f3361222ac3..19a9bc9f2f11 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -275,6 +275,9 @@ struct mem_cgroup *vmpressure_to_memcg(struct vmpressure *vmpr) return container_of(vmpr, struct mem_cgroup, vmpressure); }
+#define CURRENT_OBJCG_UPDATE_BIT 0 +#define CURRENT_OBJCG_UPDATE_FLAG (1UL << CURRENT_OBJCG_UPDATE_BIT) + #ifdef CONFIG_MEMCG_KMEM static DEFINE_SPINLOCK(objcg_lock);
@@ -3106,6 +3109,58 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg) return objcg; }
+static struct obj_cgroup *current_objcg_update(void) +{ + struct mem_cgroup *memcg; + struct obj_cgroup *old, *objcg = NULL; + + do { + /* Atomically drop the update bit. */ + old = xchg(¤t->objcg, NULL); + if (old) { + old = (struct obj_cgroup *) + ((unsigned long)old & ~CURRENT_OBJCG_UPDATE_FLAG); + if (old) + obj_cgroup_put(old); + + old = NULL; + } + + /* If new objcg is NULL, no reason for the second atomic update. */ + if (!current->mm || (current->flags & PF_KTHREAD)) + return NULL; + + /* + * Release the objcg pointer from the previous iteration, + * if try_cmpxcg() below fails. + */ + if (unlikely(objcg)) { + obj_cgroup_put(objcg); + objcg = NULL; + } + + /* + * Obtain the new objcg pointer. The current task can be + * asynchronously moved to another memcg and the previous + * memcg can be offlined. So let's get the memcg pointer + * and try get a reference to objcg under a rcu read lock. + */ + + rcu_read_lock(); + memcg = mem_cgroup_from_task(current); + objcg = __get_obj_cgroup_from_memcg(memcg); + rcu_read_unlock(); + + /* + * Try set up a new objcg pointer atomically. If it + * fails, it means the update flag was set concurrently, so + * the whole procedure should be repeated. + */ + } while (!try_cmpxchg(¤t->objcg, &old, objcg)); + + return objcg; +} + __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void) { struct mem_cgroup *memcg; @@ -3113,19 +3168,26 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
if (in_task()) { memcg = current->active_memcg; + if (unlikely(memcg)) + goto from_memcg;
- /* Memcg to charge can't be determined. */ - if (likely(!memcg) && (!current->mm || (current->flags & PF_KTHREAD))) - return NULL; + objcg = READ_ONCE(current->objcg); + if (unlikely((unsigned long)objcg & CURRENT_OBJCG_UPDATE_FLAG)) + objcg = current_objcg_update(); + + if (objcg) { + obj_cgroup_get(objcg); + return objcg; + } } else { memcg = this_cpu_read(int_active_memcg); - if (likely(!memcg)) - return NULL; + if (unlikely(memcg)) + goto from_memcg; } + return NULL;
+from_memcg: rcu_read_lock(); - if (!memcg) - memcg = mem_cgroup_from_task(current); objcg = __get_obj_cgroup_from_memcg(memcg); rcu_read_unlock(); return objcg; @@ -7447,6 +7509,7 @@ static void mem_cgroup_move_task(void) mem_cgroup_clear_mc(); } } + #else /* !CONFIG_MMU */ static int mem_cgroup_can_attach(struct cgroup_taskset *tset) { @@ -7460,8 +7523,39 @@ static void mem_cgroup_move_task(void) } #endif
+#ifdef CONFIG_MEMCG_KMEM +static void mem_cgroup_fork(struct task_struct *task) +{ + /* + * Set the update flag to cause task->objcg to be initialized lazily + * on the first allocation. It can be done without any synchronization + * because it's always performed on the current task, so does + * current_objcg_update(). + */ + task->objcg = (struct obj_cgroup *)CURRENT_OBJCG_UPDATE_FLAG; +} + +static void mem_cgroup_exit(struct task_struct *task) +{ + struct obj_cgroup *objcg = task->objcg; + + objcg = (struct obj_cgroup *) + ((unsigned long)objcg & ~CURRENT_OBJCG_UPDATE_FLAG); + if (objcg) + obj_cgroup_put(objcg); + + /* + * Some kernel allocations can happen after this point, + * but let's ignore them. It can be done without any synchronization + * because it's always performed on the current task, so does + * current_objcg_update(). + */ + task->objcg = NULL; +} +#endif + #ifdef CONFIG_LRU_GEN -static void mem_cgroup_attach(struct cgroup_taskset *tset) +static void mem_cgroup_lru_gen_attach(struct cgroup_taskset *tset) { struct task_struct *task; struct cgroup_subsys_state *css; @@ -7479,10 +7573,31 @@ static void mem_cgroup_attach(struct cgroup_taskset *tset) task_unlock(task); } #else +static void mem_cgroup_lru_gen_attach(struct cgroup_taskset *tset) {} +#endif /* CONFIG_LRU_GEN */ + +#ifdef CONFIG_MEMCG_KMEM +static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset) +{ + struct task_struct *task; + struct cgroup_subsys_state *css; + + cgroup_taskset_for_each(task, css, tset) { + /* atomically set the update bit */ + set_bit(CURRENT_OBJCG_UPDATE_BIT, (unsigned long *)&task->objcg); + } +} +#else +static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset) {} +#endif /* CONFIG_MEMCG_KMEM */ + +#if defined(CONFIG_LRU_GEN) || defined(CONFIG_MEMCG_KMEM) static void mem_cgroup_attach(struct cgroup_taskset *tset) { + mem_cgroup_lru_gen_attach(tset); + mem_cgroup_kmem_attach(tset); } -#endif /* CONFIG_LRU_GEN */ +#endif
static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value) { @@ -7951,9 +8066,15 @@ struct cgroup_subsys memory_cgrp_subsys = { .css_reset = mem_cgroup_css_reset, .css_rstat_flush = mem_cgroup_css_rstat_flush, .can_attach = mem_cgroup_can_attach, +#if defined(CONFIG_LRU_GEN) || defined(CONFIG_MEMCG_KMEM) .attach = mem_cgroup_attach, +#endif .cancel_attach = mem_cgroup_cancel_attach, .post_attach = mem_cgroup_move_task, +#ifdef CONFIG_MEMCG_KMEM + .fork = mem_cgroup_fork, + .exit = mem_cgroup_exit, +#endif .dfl_cftypes = memory_files, .legacy_cftypes = mem_cgroup_legacy_files, .early_init = 0,
From: Roman Gushchin roman.gushchin@linux.dev
mainline inclusion from mainline-v6.7-rc1 commit 675d6c9b59e313ca2573c93e8fd87011a99bb8ce category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8YU7J
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Keep a reference to the original objcg object for the entire life of a memcg structure.
This allows to simplify the synchronization on the kernel memory allocation paths: pinning a (live) memcg will also pin the corresponding objcg.
The memory overhead of this change is minimal because object cgroups usually outlive their corresponding memory cgroups even without this change, so it's only an additional pointer per memcg.
Link: https://lkml.kernel.org/r/20231019225346.1822282-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) roman.gushchin@linux.dev Tested-by: Naresh Kamboju naresh.kamboju@linaro.org Acked-by: Shakeel Butt shakeelb@google.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Cc: David Rientjes rientjes@google.com Cc: Dennis Zhou dennis@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Muchun Song muchun.song@linux.dev Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/memcontrol.h | 8 +++++++- mm/memcontrol.c | 5 +++++ 2 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 649fbb5c1adc..370d02ebc3cc 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -317,7 +317,13 @@ struct mem_cgroup { #endif #ifdef CONFIG_MEMCG_KMEM int kmemcg_id; - struct obj_cgroup __rcu *objcg; + /* + * memcg->objcg is wiped out as a part of the objcg repaprenting + * process. memcg->orig_objcg preserves a pointer (and a reference) + * to the original objcg until the end of live of memcg. + */ + struct obj_cgroup __rcu *objcg; + struct obj_cgroup *orig_objcg; /* list of inherited objcgs, protected by objcg_lock */ struct list_head objcg_list; #endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 19a9bc9f2f11..abeb80a192dc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3939,6 +3939,8 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
objcg->memcg = memcg; rcu_assign_pointer(memcg->objcg, objcg); + obj_cgroup_get(objcg); + memcg->orig_objcg = objcg;
static_branch_enable(&memcg_kmem_online_key);
@@ -6393,6 +6395,9 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) { int node;
+ if (memcg->orig_objcg) + obj_cgroup_put(memcg->orig_objcg); + for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); kfree(memcg->vmstats);
From: Roman Gushchin roman.gushchin@linux.dev
mainline inclusion from mainline-v6.7-rc1 commit e86828e5446d95676835679837d995dec188d2be category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8YU7J
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Switch to a scope-based protection of the objcg pointer on slab/kmem allocation paths. Instead of using the get_() semantics in the pre-allocation hook and put the reference afterwards, let's rely on the fact that objcg is pinned by the scope.
It's possible because: 1) if the objcg is received from the current task struct, the task is keeping a reference to the objcg. 2) if the objcg is received from an active memcg (remote charging), the memcg is pinned by the scope and has a reference to the corresponding objcg.
Link: https://lkml.kernel.org/r/20231019225346.1822282-5-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) roman.gushchin@linux.dev Tested-by: Naresh Kamboju naresh.kamboju@linaro.org Acked-by: Shakeel Butt shakeelb@google.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Cc: David Rientjes rientjes@google.com Cc: Dennis Zhou dennis@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Muchun Song muchun.song@linux.dev Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/memcontrol.h | 9 ++++++++ include/linux/sched/mm.h | 4 ++++ mm/memcontrol.c | 47 ++++++++++++++++++++++++++++++++++++-- mm/slab.h | 15 ++++++------ 4 files changed, 66 insertions(+), 9 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 370d02ebc3cc..fc6b7fd78209 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1861,6 +1861,15 @@ bool mem_cgroup_kmem_disabled(void); int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order); void __memcg_kmem_uncharge_page(struct page *page, int order);
+/* + * The returned objcg pointer is safe to use without additional + * protection within a scope. The scope is defined either by + * the current task (similar to the "current" global variable) + * or by set_active_memcg() pair. + * Please, use obj_cgroup_get() to get a reference if the pointer + * needs to be used outside of the local scope. + */ +struct obj_cgroup *current_obj_cgroup(void); struct obj_cgroup *get_obj_cgroup_from_current(void); struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio);
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 8d89c8c4fac1..9a19f1b42f64 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -403,6 +403,10 @@ DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg); * __GFP_ACCOUNT allocations till the end of the scope will be charged to the * given memcg. * + * Please, make sure that caller has a reference to the passed memcg structure, + * so its lifetime is guaranteed to exceed the scope between two + * set_active_memcg() calls. + * * NOTE: This function can nest. Users must save the return value and * reset the previous value after their own charging scope is over. */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index abeb80a192dc..4b377611260f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3193,6 +3193,49 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void) return objcg; }
+__always_inline struct obj_cgroup *current_obj_cgroup(void) +{ + struct mem_cgroup *memcg; + struct obj_cgroup *objcg; + + if (in_task()) { + memcg = current->active_memcg; + if (unlikely(memcg)) + goto from_memcg; + + objcg = READ_ONCE(current->objcg); + if (unlikely((unsigned long)objcg & CURRENT_OBJCG_UPDATE_FLAG)) + objcg = current_objcg_update(); + /* + * Objcg reference is kept by the task, so it's safe + * to use the objcg by the current task. + */ + return objcg; + } + + memcg = this_cpu_read(int_active_memcg); + if (unlikely(memcg)) + goto from_memcg; + + return NULL; + +from_memcg: + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { + /* + * Memcg pointer is protected by scope (see set_active_memcg()) + * and is pinning the corresponding objcg, so objcg can't go + * away and can be used within the scope without any additional + * protection. + */ + objcg = rcu_dereference_check(memcg->objcg, 1); + if (likely(objcg)) + break; + objcg = NULL; + } + + return objcg; +} + struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio) { struct obj_cgroup *objcg; @@ -3287,15 +3330,15 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order) struct obj_cgroup *objcg; int ret = 0;
- objcg = get_obj_cgroup_from_current(); + objcg = current_obj_cgroup(); if (objcg) { ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order); if (!ret) { + obj_cgroup_get(objcg); page->memcg_data = (unsigned long)objcg | MEMCG_DATA_KMEM; return 0; } - obj_cgroup_put(objcg); } return ret; } diff --git a/mm/slab.h b/mm/slab.h index 799a315695c6..3d07fb428393 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -484,7 +484,12 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s, if (!(flags & __GFP_ACCOUNT) && !(s->flags & SLAB_ACCOUNT)) return true;
- objcg = get_obj_cgroup_from_current(); + /* + * The obtained objcg pointer is safe to use within the current scope, + * defined by current task or set_active_memcg() pair. + * obj_cgroup_get() is used to get a permanent reference. + */ + objcg = current_obj_cgroup(); if (!objcg) return true;
@@ -497,17 +502,14 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s, css_put(&memcg->css);
if (ret) - goto out; + return false; }
if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) - goto out; + return false;
*objcgp = objcg; return true; -out: - obj_cgroup_put(objcg); - return false; }
static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, @@ -542,7 +544,6 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, obj_cgroup_uncharge(objcg, obj_full_size(s)); } } - obj_cgroup_put(objcg); }
static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
From: Roman Gushchin roman.gushchin@linux.dev
mainline inclusion from mainline-v6.7-rc1 commit c63b835d0eafc956c43b8c6605708240ac52b8cd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8YU7J
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Similar to slab and kmem, switch to a scope-based protection of the objcg pointer to avoid.
Link: https://lkml.kernel.org/r/20231019225346.1822282-6-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) roman.gushchin@linux.dev Tested-by: Naresh Kamboju naresh.kamboju@linaro.org Acked-by: Shakeel Butt shakeelb@google.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Cc: David Rientjes rientjes@google.com Cc: Dennis Zhou dennis@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Muchun Song muchun.song@linux.dev Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/percpu.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/mm/percpu.c b/mm/percpu.c index a7665de8485f..f53ba692d67a 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1628,14 +1628,12 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp, if (!memcg_kmem_online() || !(gfp & __GFP_ACCOUNT)) return true;
- objcg = get_obj_cgroup_from_current(); + objcg = current_obj_cgroup(); if (!objcg) return true;
- if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size))) { - obj_cgroup_put(objcg); + if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size))) return false; - }
*objcgp = objcg; return true; @@ -1649,6 +1647,7 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg, return;
if (likely(chunk && chunk->obj_cgroups)) { + obj_cgroup_get(objcg); chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
rcu_read_lock(); @@ -1657,7 +1656,6 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg, rcu_read_unlock(); } else { obj_cgroup_uncharge(objcg, pcpu_obj_full_size(size)); - obj_cgroup_put(objcg); } }
From: Roman Gushchin roman.gushchin@linux.dev
mainline inclusion from mainline-v6.7-rc1 commit e56808fef8f71a192b2740c0b6ea8be7ab865d54 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8YU7J
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Reimplement get_obj_cgroup_from_current() using current_obj_cgroup(). get_obj_cgroup_from_current() and current_obj_cgroup() share 80% of the code, so the new implementation is almost trivial.
get_obj_cgroup_from_current() is a convenient function used by the bpf subsystem, so there is no reason to get rid of it completely.
Link: https://lkml.kernel.org/r/20231019225346.1822282-7-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) roman.gushchin@linux.dev Reviewed-by: Vlastimil Babka vbabka@suse.cz Acked-by: Shakeel Butt shakeelb@google.com Cc: David Rientjes rientjes@google.com Cc: Dennis Zhou dennis@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Muchun Song muchun.song@linux.dev Cc: Naresh Kamboju naresh.kamboju@linaro.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/memcontrol.h | 11 ++++++++++- mm/memcontrol.c | 32 -------------------------------- 2 files changed, 10 insertions(+), 33 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index fc6b7fd78209..53e66ca6fd5a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1870,9 +1870,18 @@ void __memcg_kmem_uncharge_page(struct page *page, int order); * needs to be used outside of the local scope. */ struct obj_cgroup *current_obj_cgroup(void); -struct obj_cgroup *get_obj_cgroup_from_current(void); struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio);
+static inline struct obj_cgroup *get_obj_cgroup_from_current(void) +{ + struct obj_cgroup *objcg = current_obj_cgroup(); + + if (objcg) + obj_cgroup_get(objcg); + + return objcg; +} + int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size); void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4b377611260f..ed55ef9f3c4b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3161,38 +3161,6 @@ static struct obj_cgroup *current_objcg_update(void) return objcg; }
-__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void) -{ - struct mem_cgroup *memcg; - struct obj_cgroup *objcg; - - if (in_task()) { - memcg = current->active_memcg; - if (unlikely(memcg)) - goto from_memcg; - - objcg = READ_ONCE(current->objcg); - if (unlikely((unsigned long)objcg & CURRENT_OBJCG_UPDATE_FLAG)) - objcg = current_objcg_update(); - - if (objcg) { - obj_cgroup_get(objcg); - return objcg; - } - } else { - memcg = this_cpu_read(int_active_memcg); - if (unlikely(memcg)) - goto from_memcg; - } - return NULL; - -from_memcg: - rcu_read_lock(); - objcg = __get_obj_cgroup_from_memcg(memcg); - rcu_read_unlock(); - return objcg; -} - __always_inline struct obj_cgroup *current_obj_cgroup(void) { struct mem_cgroup *memcg;
From: Roman Gushchin roman.gushchin@linux.dev
mainline inclusion from mainline-v6.7-rc5 commit 5f79489a73d77419d18952e0258efbd5ecb74770 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8YU7J
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Erhard reported that the 6.7-rc1 kernel panics on boot if being built with clang-16. The problem was not reproducible with gcc.
[ 5.975049] general protection fault, probably for non-canonical address 0xf555515555555557: 0000 [#1] SMP KASAN PTI [ 5.976422] KASAN: maybe wild-memory-access in range [0xaaaaaaaaaaaaaab8-0xaaaaaaaaaaaaaabf] [ 5.977475] CPU: 3 PID: 1 Comm: systemd Not tainted 6.7.0-rc1-Zen3 #77 [ 5.977860] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 5.977860] RIP: 0010:obj_cgroup_charge_pages+0x27/0x2d5 [ 5.977860] Code: 90 90 90 55 41 57 41 56 41 55 41 54 53 89 d5 41 89 f6 49 89 ff 48 b8 00 00 00 00 00 fc ff df 49 83 c7 10 4d3 [ 5.977860] RSP: 0018:ffffc9000001fb18 EFLAGS: 00010a02 [ 5.977860] RAX: dffffc0000000000 RBX: aaaaaaaaaaaaaaaa RCX: ffff8883eb9a8b08 [ 5.977860] RDX: 0000000000000005 RSI: 0000000000400cc0 RDI: aaaaaaaaaaaaaaaa [ 5.977860] RBP: 0000000000000005 R08: 3333333333333333 R09: 0000000000000000 [ 5.977860] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8883eb9a8b18 [ 5.977860] R13: 1555555555555557 R14: 0000000000400cc0 R15: aaaaaaaaaaaaaaba [ 5.977860] FS: 00007f2976438b40(0000) GS:ffff8883eb980000(0000) knlGS:0000000000000000 [ 5.977860] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5.977860] CR2: 00007f29769e0060 CR3: 0000000107222003 CR4: 0000000000370eb0 [ 5.977860] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5.977860] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5.977860] Call Trace: [ 5.977860] <TASK> [ 5.977860] ? __die_body+0x16/0x75 [ 5.977860] ? die_addr+0x4a/0x70 [ 5.977860] ? exc_general_protection+0x1c9/0x2d0 [ 5.977860] ? cgroup_mkdir+0x455/0x9fb [ 5.977860] ? __x64_sys_mkdir+0x69/0x80 [ 5.977860] ? asm_exc_general_protection+0x26/0x30 [ 5.977860] ? obj_cgroup_charge_pages+0x27/0x2d5 [ 5.977860] obj_cgroup_charge+0x114/0x1ab [ 5.977860] pcpu_alloc+0x1a6/0xa65 [ 5.977860] ? mem_cgroup_css_alloc+0x1eb/0x1140 [ 5.977860] ? cgroup_apply_control_enable+0x26b/0x7c0 [ 5.977860] mem_cgroup_css_alloc+0x23f/0x1140 [ 5.977860] cgroup_apply_control_enable+0x26b/0x7c0 [ 5.977860] ? cgroup_kn_set_ugid+0x2d/0x1a0 [ 5.977860] cgroup_mkdir+0x455/0x9fb [ 5.977860] ? __cfi_cgroup_mkdir+0x10/0x10 [ 5.977860] kernfs_iop_mkdir+0x130/0x170 [ 5.977860] vfs_mkdir+0x405/0x530 [ 5.977860] do_mkdirat+0x188/0x1f0 [ 5.977860] __x64_sys_mkdir+0x69/0x80 [ 5.977860] do_syscall_64+0x7d/0x100 [ 5.977860] ? do_syscall_64+0x89/0x100 [ 5.977860] ? do_syscall_64+0x89/0x100 [ 5.977860] ? do_syscall_64+0x89/0x100 [ 5.977860] ? do_syscall_64+0x89/0x100 [ 5.977860] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 5.977860] RIP: 0033:0x7f297671defb [ 5.977860] Code: 8b 05 39 7f 0d 00 bb ff ff ff ff 64 c7 00 16 00 00 00 e9 61 ff ff ff e8 23 0c 02 00 0f 1f 00 f3 0f 1e fa b88 [ 5.977860] RSP: 002b:00007ffee6242bb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053 [ 5.977860] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f297671defb [ 5.977860] RDX: 0000000000000000 RSI: 00000000000001ed RDI: 000055c6b449f0e0 [ 5.977860] RBP: 00007ffee6242bf0 R08: 000000000000000e R09: 0000000000000000 [ 5.977860] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c6b445db80 [ 5.977860] R13: 00000000000003a0 R14: 00007f2976a68651 R15: 00000000000003a0 [ 5.977860] </TASK> [ 5.977860] Modules linked in: [ 6.014095] ---[ end trace 0000000000000000 ]--- [ 6.014701] RIP: 0010:obj_cgroup_charge_pages+0x27/0x2d5 [ 6.015348] Code: 90 90 90 55 41 57 41 56 41 55 41 54 53 89 d5 41 89 f6 49 89 ff 48 b8 00 00 00 00 00 fc ff df 49 83 c7 10 4d3 [ 6.017575] RSP: 0018:ffffc9000001fb18 EFLAGS: 00010a02 [ 6.018255] RAX: dffffc0000000000 RBX: aaaaaaaaaaaaaaaa RCX: ffff8883eb9a8b08 [ 6.019120] RDX: 0000000000000005 RSI: 0000000000400cc0 RDI: aaaaaaaaaaaaaaaa [ 6.019983] RBP: 0000000000000005 R08: 3333333333333333 R09: 0000000000000000 [ 6.020849] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8883eb9a8b18 [ 6.021747] R13: 1555555555555557 R14: 0000000000400cc0 R15: aaaaaaaaaaaaaaba [ 6.022609] FS: 00007f2976438b40(0000) GS:ffff8883eb980000(0000) knlGS:0000000000000000 [ 6.023593] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.024296] CR2: 00007f29769e0060 CR3: 0000000107222003 CR4: 0000000000370eb0 [ 6.025279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.026139] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.027000] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
Actually the problem is caused by uninitialized local variable in current_obj_cgroup(). If the root memory cgroup is set as an active memory cgroup for a charging scope (as in the trace, where systemd tries to create the first non-root cgroup, so the parent cgroup is the root cgroup), the "for" loop is skipped and uninitialized objcg is returned, causing a panic down the accounting stack.
The fix is trivial: initialize the objcg variable to NULL unconditionally before the "for" loop.
[vbabka@suse.cz: remove redundant assignment] Link: https://lkml.kernel.org/r/4bd106d5-c3e3-6731-9a74-cff81e2392de@suse.cz Link: https://lkml.kernel.org/r/20231116025109.3775055-1-roman.gushchin@linux.dev Fixes: e86828e5446d ("mm: kmem: scoped objcg protection") Signed-off-by: Roman Gushchin (Cruise) roman.gushchin@linux.dev Signed-off-by: Vlastimil Babka vbabka@suse.cz Reported-by: Erhard Furtner erhard_f@mailbox.org Closes: https://github.com/ClangBuiltLinux/linux/issues/1959 Tested-by: Erhard Furtner erhard_f@mailbox.org Acked-by: Vlastimil Babka vbabka@suse.cz Acked-by: Shakeel Butt shakeelb@google.com Cc: David Rientjes rientjes@google.com Cc: Dennis Zhou dennis@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Muchun Song muchun.song@linux.dev Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ed55ef9f3c4b..b8fa0a360a99 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3188,6 +3188,7 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void) return NULL;
from_memcg: + objcg = NULL; for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { /* * Memcg pointer is protected by scope (see set_active_memcg()) @@ -3198,7 +3199,6 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void) objcg = rcu_dereference_check(memcg->objcg, 1); if (likely(objcg)) break; - objcg = NULL; }
return objcg;