Baolin Wang (1): mm: memcg: fix split queue list crash when large folio migration
Nhat Pham (3): memcontrol: add helpers for hugetlb memcg accounting memcontrol: only transfer the memcg data for migration hugetlb: memcg: account hugetlb-backed memory in memory controller
Documentation/admin-guide/cgroup-v2.rst | 29 +++++ include/linux/cgroup-defs.h | 5 + include/linux/memcontrol.h | 37 ++++++ kernel/cgroup/cgroup.c | 15 ++- mm/filemap.c | 2 +- mm/huge_memory.c | 2 +- mm/hugetlb.c | 36 ++++-- mm/memcontrol.c | 152 +++++++++++++++++++++--- mm/migrate.c | 3 +- 9 files changed, 252 insertions(+), 29 deletions(-)
From: Nhat Pham nphamcs@gmail.com
mainline inclusion from mainline-v6.7-rc1 commit 4b569387c0d566db288e7c3e1b484b43df797bdb category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I91HYH
----------------------------------------------------------------------
Patch series "hugetlb memcg accounting", v4.
Currently, hugetlb memory usage is not acounted for in the memory controller, which could lead to memory overprotection for cgroups with hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G containers. The machine is booted with hugetlb_cma=6G, and each container may or may not use up to 3 gigantic page, depending on the workload within it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. But it is very difficult to configure memory.max to keep overall consumption, including anon, cache, slab etcetera fair.
What we have had to resort to is to constantly poll hugetlb usage and readjust memory.max. Similar procedure is done to other memory limits (memory.low for e.g). However, this is rather cumbersome and buggy. Furthermore, when there is a delay in memory limits correction, (for e.g when hugetlb usage changes within consecutive runs of the userspace agent), the system could be in an over/underprotected state.
This patch series rectifies this issue by charging the memcg when the hugetlb folio is allocated, and uncharging when the folio is freed. In addition, a new selftest is added to demonstrate and verify this new behavior.
This patch (of 4):
This patch exposes charge committing and cancelling as parts of the memory controller interface. These functionalities are useful when the try_charge() and commit_charge() stages have to be separated by other actions in between (which can fail). One such example is the new hugetlb accounting behavior in the following patch.
The patch also adds a helper function to obtain a reference to the current task's memcg.
Link: https://lkml.kernel.org/r/20231006184629.155543-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20231006184629.155543-2-nphamcs@gmail.com Signed-off-by: Nhat Pham nphamcs@gmail.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Frank van der Linden fvdl@google.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Muchun Song muchun.song@linux.dev Cc: Rik van Riel riel@surriel.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Cc: Tejun heo tj@kernel.org Cc: Yosry Ahmed yosryahmed@google.com Cc: Zefan Li lizefan.x@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Chen Ridong chenridong@huawei.com --- include/linux/memcontrol.h | 21 ++++++++++++++ mm/memcontrol.c | 59 ++++++++++++++++++++++++++++++-------- 2 files changed, 68 insertions(+), 12 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 60cf2cd70e29..d8306e6c2034 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -781,6 +781,8 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *target, page_counter_read(&memcg->memory); }
+void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg); + int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp);
/** @@ -832,6 +834,8 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list) __mem_cgroup_uncharge_list(page_list); }
+void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages); + void mem_cgroup_migrate(struct folio *old, struct folio *new);
/** @@ -888,6 +892,8 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
+struct mem_cgroup *get_mem_cgroup_from_current(void); + struct lruvec *folio_lruvec_lock(struct folio *folio); struct lruvec *folio_lruvec_lock_irq(struct folio *folio); struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, @@ -1388,6 +1394,11 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *target, return false; }
+static inline void mem_cgroup_commit_charge(struct folio *folio, + struct mem_cgroup *memcg) +{ +} + static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp) { @@ -1412,6 +1423,11 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list) { }
+static inline void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, + unsigned int nr_pages) +{ +} + static inline void mem_cgroup_migrate(struct folio *old, struct folio *new) { } @@ -1449,6 +1465,11 @@ static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) return NULL; }
+static inline struct mem_cgroup *get_mem_cgroup_from_current(void) +{ + return NULL; +} + static inline struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 04aaf2bae49c..56dcedb5d802 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1116,6 +1116,27 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) } EXPORT_SYMBOL(get_mem_cgroup_from_mm);
+/** + * get_mem_cgroup_from_current - Obtain a reference on current task's memcg. + */ +struct mem_cgroup *get_mem_cgroup_from_current(void) +{ + struct mem_cgroup *memcg; + + if (mem_cgroup_disabled()) + return NULL; + +again: + rcu_read_lock(); + memcg = mem_cgroup_from_task(current); + if (!css_tryget(&memcg->css)) { + rcu_read_unlock(); + goto again; + } + rcu_read_unlock(); + return memcg; +} + /** * mem_cgroup_iter - iterate over memory cgroup hierarchy * @root: hierarchy root @@ -2943,7 +2964,12 @@ static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, return try_charge_memcg(memcg, gfp_mask, nr_pages); }
-static inline void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) +/** + * mem_cgroup_cancel_charge() - cancel an uncommitted try_charge() call. + * @memcg: memcg previously charged. + * @nr_pages: number of pages previously charged. + */ +void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) { if (mem_cgroup_is_root(memcg)) return; @@ -2968,6 +2994,22 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg) folio->memcg_data = (unsigned long)memcg; }
+/** + * mem_cgroup_commit_charge - commit a previously successful try_charge(). + * @folio: folio to commit the charge to. + * @memcg: memcg previously charged. + */ +void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg) +{ + css_get(&memcg->css); + commit_charge(folio, memcg); + + local_irq_disable(); + mem_cgroup_charge_statistics(memcg, folio_nr_pages(folio)); + memcg_check_events(memcg, folio_nid(folio)); + local_irq_enable(); +} + #ifdef CONFIG_MEMCG_KMEM /* * The allocated objcg pointers array is not accounted directly. @@ -7258,7 +7300,7 @@ static void __mem_cgroup_clear_mc(void)
/* we must uncharge all the leftover precharges from mc.to */ if (mc.precharge) { - cancel_charge(mc.to, mc.precharge); + mem_cgroup_cancel_charge(mc.to, mc.precharge); mc.precharge = 0; } /* @@ -7266,7 +7308,7 @@ static void __mem_cgroup_clear_mc(void) * we must uncharge here. */ if (mc.moved_charge) { - cancel_charge(mc.from, mc.moved_charge); + mem_cgroup_cancel_charge(mc.from, mc.moved_charge); mc.moved_charge = 0; } /* we must fixup refcnts and charges */ @@ -8295,20 +8337,13 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root, static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, gfp_t gfp) { - long nr_pages = folio_nr_pages(folio); int ret;
- ret = try_charge(memcg, gfp, nr_pages); + ret = try_charge(memcg, gfp, folio_nr_pages(folio)); if (ret) goto out;
- css_get(&memcg->css); - commit_charge(folio, memcg); - - local_irq_disable(); - mem_cgroup_charge_statistics(memcg, nr_pages); - memcg_check_events(memcg, folio_nid(folio)); - local_irq_enable(); + mem_cgroup_commit_charge(folio, memcg); out: return ret; }
From: Nhat Pham nphamcs@gmail.com
mainline inclusion from mainline-v6.7-rc1 commit 85ce2c517ade0d51b7ad95f2e88be9bbe294379a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I91HYH
----------------------------------------------------------------------
For most migration use cases, only transfer the memcg data from the old folio to the new folio, and clear the old folio's memcg data. No charging and uncharging will be done.
This shaves off some work on the migration path, and avoids the temporary double charging of a folio during its migration.
The only exception is replace_page_cache_folio(), which will use the old mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio). In that context, the isolation of the old page isn't quite as thorough as with migration, so we cannot use our new implementation directly.
This patch is the result of the following discussion on the new hugetlb memcg accounting behavior:
https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
Link: https://lkml.kernel.org/r/20231006184629.155543-3-nphamcs@gmail.com Signed-off-by: Nhat Pham nphamcs@gmail.com Suggested-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Frank van der Linden fvdl@google.com Cc: Michal Hocko mhocko@suse.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Muchun Song muchun.song@linux.dev Cc: Rik van Riel riel@surriel.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Cc: Tejun heo tj@kernel.org Cc: Yosry Ahmed yosryahmed@google.com Cc: Zefan Li lizefan.x@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Chen Ridong chenridong@huawei.com --- include/linux/memcontrol.h | 7 +++++++ mm/filemap.c | 2 +- mm/memcontrol.c | 40 +++++++++++++++++++++++++++++++++++--- 3 files changed, 45 insertions(+), 4 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d8306e6c2034..862646481bbb 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -836,6 +836,8 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages);
+void mem_cgroup_replace_folio(struct folio *old, struct folio *new); + void mem_cgroup_migrate(struct folio *old, struct folio *new);
/** @@ -1428,6 +1430,11 @@ static inline void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, { }
+static inline void mem_cgroup_replace_folio(struct folio *old, + struct folio *new) +{ +} + static inline void mem_cgroup_migrate(struct folio *old, struct folio *new) { } diff --git a/mm/filemap.c b/mm/filemap.c index 415ca59ad50e..3a23350e7125 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -820,7 +820,7 @@ void replace_page_cache_folio(struct folio *old, struct folio *new) new->mapping = mapping; new->index = offset;
- mem_cgroup_migrate(old, new); + mem_cgroup_replace_folio(old, new);
xas_lock_irq(&xas); xas_store(&xas, new); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 56dcedb5d802..9e5523a38dcc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -8556,16 +8556,17 @@ void __mem_cgroup_uncharge_list(struct list_head *page_list) }
/** - * mem_cgroup_migrate - Charge a folio's replacement. + * mem_cgroup_replace_folio - Charge a folio's replacement. * @old: Currently circulating folio. * @new: Replacement folio. * * Charge @new as a replacement folio for @old. @old will - * be uncharged upon free. + * be uncharged upon free. This is only used by the page cache + * (in replace_page_cache_folio()). * * Both folios must be locked, @new->mapping must be set up. */ -void mem_cgroup_migrate(struct folio *old, struct folio *new) +void mem_cgroup_replace_folio(struct folio *old, struct folio *new) { struct mem_cgroup *memcg; long nr_pages = folio_nr_pages(new); @@ -8604,6 +8605,39 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) local_irq_restore(flags); }
+/** + * mem_cgroup_migrate - Transfer the memcg data from the old to the new folio. + * @old: Currently circulating folio. + * @new: Replacement folio. + * + * Transfer the memcg data from the old folio to the new folio for migration. + * The old folio's data info will be cleared. Note that the memory counters + * will remain unchanged throughout the process. + * + * Both folios must be locked, @new->mapping must be set up. + */ +void mem_cgroup_migrate(struct folio *old, struct folio *new) +{ + struct mem_cgroup *memcg; + + VM_BUG_ON_FOLIO(!folio_test_locked(old), old); + VM_BUG_ON_FOLIO(!folio_test_locked(new), new); + VM_BUG_ON_FOLIO(folio_test_anon(old) != folio_test_anon(new), new); + VM_BUG_ON_FOLIO(folio_nr_pages(old) != folio_nr_pages(new), new); + + if (mem_cgroup_disabled()) + return; + + memcg = folio_memcg(old); + VM_WARN_ON_ONCE_FOLIO(!memcg, old); + if (!memcg) + return; + + /* Transfer the charge and the css ref */ + commit_charge(new, memcg); + old->memcg_data = 0; +} + DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key); EXPORT_SYMBOL(memcg_sockets_enabled_key);
From: Nhat Pham nphamcs@gmail.com
mainline inclusion from mainline-v6.7-rc1 commit 85ce2c517ade0d51b7ad95f2e88be9bbe294379a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I91HYH
----------------------------------------------------------------------
Currently, hugetlb memory usage is not acounted for in the memory controller, which could lead to memory overprotection for cgroups with hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G containers. The machine is booted with hugetlb_cma=6G, and each container may or may not use up to 3 gigantic page, depending on the workload within it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. But it is very difficult to configure memory.max to keep overall consumption, including anon, cache, slab etc. fair.
What we have had to resort to is to constantly poll hugetlb usage and readjust memory.max. Similar procedure is done to other memory limits (memory.low for e.g). However, this is rather cumbersome and buggy. Furthermore, when there is a delay in memory limits correction, (for e.g when hugetlb usage changes within consecutive runs of the userspace agent), the system could be in an over/underprotected state.
This patch rectifies this issue by charging the memcg when the hugetlb folio is utilized, and uncharging when the folio is freed (analogous to the hugetlb controller). Note that we do not charge when the folio is allocated to the hugetlb pool, because at this point it is not owned by any memcg.
Some caveats to consider: * This feature is only available on cgroup v2. * There is no hugetlb pool management involved in the memory controller. As stated above, hugetlb folios are only charged towards the memory controller when it is used. Host overcommit management has to consider it when configuring hard limits. * Failure to charge towards the memcg results in SIGBUS. This could happen even if the hugetlb pool still has pages (but the cgroup limit is hit and reclaim attempt fails). * When this feature is enabled, hugetlb pages contribute to memory reclaim protection. low, min limits tuning must take into account hugetlb memory. * Hugetlb pages utilized while this option is not selected will not be tracked by the memory controller (even if cgroup v2 is remounted later on).
Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com Signed-off-by: Nhat Pham nphamcs@gmail.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Frank van der Linden fvdl@google.com Cc: Michal Hocko mhocko@suse.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Muchun Song muchun.song@linux.dev Cc: Rik van Riel riel@surriel.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Cc: Tejun heo tj@kernel.org Cc: Yosry Ahmed yosryahmed@google.com Cc: Zefan Li lizefan.x@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org conflict: mm/hugetlb.c Signed-off-by: Chen Ridong chenridong@huawei.com --- Documentation/admin-guide/cgroup-v2.rst | 29 +++++++++++++++++ include/linux/cgroup-defs.h | 5 +++ include/linux/memcontrol.h | 9 ++++++ kernel/cgroup/cgroup.c | 15 ++++++++- mm/hugetlb.c | 36 ++++++++++++++++----- mm/memcontrol.c | 42 ++++++++++++++++++++++++- mm/migrate.c | 3 +- 7 files changed, 127 insertions(+), 12 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index dd92ccba20c2..61b31f209a25 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -210,6 +210,35 @@ cgroup v2 currently supports the following mount options. relying on the original semantics (e.g. specifying bogusly high 'bypass' protection values at higher tree levels).
+ memory_hugetlb_accounting + Count HugeTLB memory usage towards the cgroup's overall + memory usage for the memory controller (for the purpose of + statistics reporting and memory protetion). This is a new + behavior that could regress existing setups, so it must be + explicitly opted in with this mount option. + + A few caveats to keep in mind: + + * There is no HugeTLB pool management involved in the memory + controller. The pre-allocated pool does not belong to anyone. + Specifically, when a new HugeTLB folio is allocated to + the pool, it is not accounted for from the perspective of the + memory controller. It is only charged to a cgroup when it is + actually used (for e.g at page fault time). Host memory + overcommit management has to consider this when configuring + hard limits. In general, HugeTLB pool management should be + done via other mechanisms (such as the HugeTLB controller). + * Failure to charge a HugeTLB folio to the memory controller + results in SIGBUS. This could happen even if the HugeTLB pool + still has pages available (but the cgroup limit is hit and + reclaim attempt fails). + * Charging HugeTLB memory towards the memory controller affects + memory protection and reclaim dynamics. Any userspace tuning + (of low, min limits for e.g) needs to take this into account. + * HugeTLB pages utilized while this option is not selected + will not be tracked by the memory controller (even if cgroup + v2 is remounted later on). +
Organizing Processes and Threads -------------------------------- diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 81c1c2c7366f..6e3227a688de 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -116,6 +116,11 @@ enum { * Enable recursive subtree protection */ CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18), + + /* + * Enable hugetlb accounting for the memory controller. + */ + CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19), };
/* cftype->flags */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 862646481bbb..e86ba420159d 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -807,6 +807,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, return __mem_cgroup_charge(folio, mm, gfp); }
+int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, + long nr_pages); + int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry); void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); @@ -1407,6 +1410,12 @@ static inline int mem_cgroup_charge(struct folio *folio, return 0; }
+static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, + gfp_t gfp, long nr_pages) +{ + return 0; +} + static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) { diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 2d92c0ea15c0..dd8eed3c6e31 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1906,6 +1906,7 @@ enum cgroup2_param { Opt_favordynmods, Opt_memory_localevents, Opt_memory_recursiveprot, + Opt_memory_hugetlb_accounting, nr__cgroup2_params };
@@ -1914,6 +1915,7 @@ static const struct fs_parameter_spec cgroup2_fs_parameters[] = { fsparam_flag("favordynmods", Opt_favordynmods), fsparam_flag("memory_localevents", Opt_memory_localevents), fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot), + fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting), {} };
@@ -1940,6 +1942,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param case Opt_memory_recursiveprot: ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; return 0; + case Opt_memory_hugetlb_accounting: + ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; + return 0; } return -EINVAL; } @@ -1964,6 +1969,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags) cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; else cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT; + + if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING) + cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; + else + cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; } }
@@ -1977,6 +1987,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root seq_puts(seq, ",memory_localevents"); if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) seq_puts(seq, ",memory_recursiveprot"); + if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING) + seq_puts(seq, ",memory_hugetlb_accounting"); return 0; }
@@ -7163,7 +7175,8 @@ static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr, "nsdelegate\n" "favordynmods\n" "memory_localevents\n" - "memory_recursiveprot\n"); + "memory_recursiveprot\n" + "memory_hugetlb_accounting\n"); } static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ec33252edb8d..96696f5e268d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1947,7 +1947,7 @@ void free_huge_folio(struct folio *folio) pages_per_huge_page(h), folio); hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), pages_per_huge_page(h), folio); - + mem_cgroup_uncharge(folio); if (page_from_dynamic_pool(folio_page(folio, 0))) { list_del(&folio->lru); spin_unlock_irqrestore(&hugetlb_lock, flags); @@ -3144,11 +3144,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, struct hugetlbfs_inode_info *info = HUGETLBFS_I(file_inode(vma->vm_file)); struct hstate *h = hstate_vma(vma); struct folio *folio; - long map_chg, map_commit; + long map_chg, map_commit, nr_pages = pages_per_huge_page(h); long gbl_chg; - int ret, idx; + int memcg_charge_ret, ret, idx; struct hugetlb_cgroup *h_cg = NULL; + struct mem_cgroup *memcg; bool deferred_reserve; + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL; + + memcg = get_mem_cgroup_from_current(); + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages); + if (memcg_charge_ret == -ENOMEM) { + mem_cgroup_put(memcg); + return ERR_PTR(-ENOMEM); + }
idx = hstate_index(h); /* @@ -3157,8 +3166,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, * code of zero indicates a reservation exists (no change). */ map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); - if (map_chg < 0) + if (map_chg < 0) { + if (!memcg_charge_ret) + mem_cgroup_cancel_charge(memcg, nr_pages); + mem_cgroup_put(memcg); return ERR_PTR(-ENOMEM); + }
/* * Processes that did not create the mapping will have no @@ -3169,10 +3182,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, */ if (map_chg || avoid_reserve) { gbl_chg = hugepage_subpool_get_pages(spool, 1, info); - if (gbl_chg < 0) { - vma_end_reservation(h, vma, addr); - return ERR_PTR(-ENOSPC); - } + if (gbl_chg < 0) + goto out_end_reservation;
/* * Even though there was no reservation in the region/reserve @@ -3268,6 +3279,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), pages_per_huge_page(h), folio); } + + if (!memcg_charge_ret) + mem_cgroup_commit_charge(folio, memcg); + mem_cgroup_put(memcg); + return folio;
out_uncharge_cgroup: @@ -3279,7 +3295,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, out_subpool_put: if (map_chg || avoid_reserve) hugepage_subpool_put_pages(spool, 1, info); +out_end_reservation: vma_end_reservation(h, vma, addr); + if (!memcg_charge_ret) + mem_cgroup_cancel_charge(memcg, nr_pages); + mem_cgroup_put(memcg); return ERR_PTR(-ENOSPC); }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9e5523a38dcc..333d17853bb3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -8360,6 +8360,41 @@ int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp) return ret; }
+/** + * mem_cgroup_hugetlb_try_charge - try to charge the memcg for a hugetlb folio + * @memcg: memcg to charge. + * @gfp: reclaim mode. + * @nr_pages: number of pages to charge. + * + * This function is called when allocating a huge page folio to determine if + * the memcg has the capacity for it. It does not commit the charge yet, + * as the hugetlb folio itself has not been obtained from the hugetlb pool. + * + * Once we have obtained the hugetlb folio, we can call + * mem_cgroup_commit_charge() to commit the charge. If we fail to obtain the + * folio, we should instead call mem_cgroup_cancel_charge() to undo the effect + * of try_charge(). + * + * Returns 0 on success. Otherwise, an error code is returned. + */ +int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, + long nr_pages) +{ + /* + * If hugetlb memcg charging is not enabled, do not fail hugetlb allocation, + * but do not attempt to commit charge later (or cancel on error) either. + */ + if (mem_cgroup_disabled() || !memcg || + !cgroup_subsys_on_dfl(memory_cgrp_subsys) || + !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)) + return -EOPNOTSUPP; + + if (try_charge(memcg, gfp, nr_pages)) + return -ENOMEM; + + return 0; +} + /** * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. * @folio: folio to charge. @@ -8585,7 +8620,12 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new) return;
memcg = folio_memcg(old); - VM_WARN_ON_ONCE_FOLIO(!memcg, old); + /* + * Note that it is normal to see !memcg for a hugetlb folio. + * For e.g, itt could have been allocated when memory_hugetlb_accounting + * was not selected. + */ + VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !memcg, old); if (!memcg) return;
diff --git a/mm/migrate.c b/mm/migrate.c index 322c63e6f9be..7dec4f14bfd1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -638,8 +638,7 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
folio_copy_owner(newfolio, folio);
- if (!folio_test_hugetlb(folio)) - mem_cgroup_migrate(folio, newfolio); + mem_cgroup_migrate(folio, newfolio); } EXPORT_SYMBOL(folio_migrate_flags);
From: Baolin Wang baolin.wang@linux.alibaba.com
mainline inclusion from mainline-v6.7-rc1 commit 9bcef5973e31020e5aa8571eb994d67b77318356 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I91HYH
----------------------------------------------------------------------
When running autonuma with enabling multi-size THP, I encountered the following kernel crash issue:
[ 134.290216] list_del corruption. prev->next should be fffff9ad42e1c490, but was dead000000000100. (prev=fffff9ad42399890) [ 134.290877] kernel BUG at lib/list_debug.c:62! [ 134.291052] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 134.291210] CPU: 56 PID: 8037 Comm: numa01 Kdump: loaded Tainted: G E 6.7.0-rc4+ #20 [ 134.291649] RIP: 0010:__list_del_entry_valid_or_report+0x97/0xb0 ...... [ 134.294252] Call Trace: [ 134.294362] <TASK> [ 134.294440] ? die+0x33/0x90 [ 134.294561] ? do_trap+0xe0/0x110 ...... [ 134.295681] ? __list_del_entry_valid_or_report+0x97/0xb0 [ 134.295842] folio_undo_large_rmappable+0x99/0x100 [ 134.296003] destroy_large_folio+0x68/0x70 [ 134.296172] migrate_folio_move+0x12e/0x260 [ 134.296264] ? __pfx_remove_migration_pte+0x10/0x10 [ 134.296389] migrate_pages_batch+0x495/0x6b0 [ 134.296523] migrate_pages+0x1d0/0x500 [ 134.296646] ? __pfx_alloc_misplaced_dst_folio+0x10/0x10 [ 134.296799] migrate_misplaced_folio+0x12d/0x2b0 [ 134.296953] do_numa_page+0x1f4/0x570 [ 134.297121] __handle_mm_fault+0x2b0/0x6c0 [ 134.297254] handle_mm_fault+0x107/0x270 [ 134.300897] do_user_addr_fault+0x167/0x680 [ 134.304561] exc_page_fault+0x65/0x140 [ 134.307919] asm_exc_page_fault+0x22/0x30
The reason for the crash is that, the commit 85ce2c517ade ("memcontrol: only transfer the memcg data for migration") removed the charging and uncharging operations of the migration folios and cleared the memcg data of the old folio.
During the subsequent release process of the old large folio in destroy_large_folio(), if the large folio needs to be removed from the split queue, an incorrect split queue can be obtained (which is pgdat->deferred_split_queue) because the old folio's memcg is NULL now. This can lead to list operations being performed under the wrong split queue lock protection, resulting in a list crash as above.
After the migration, the old folio is going to be freed, so we can remove it from the split queue in mem_cgroup_migrate() a bit earlier before clearing the memcg data to avoid getting incorrect split queue.
[akpm@linux-foundation.org: fix comment, per Zi Yan] Link: https://lkml.kernel.org/r/61273e5e9b490682388377c20f52d19de4a80460.170305455... Fixes: 85ce2c517ade ("memcontrol: only transfer the memcg data for migration") Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Reviewed-by: Nhat Pham nphamcs@gmail.com Reviewed-by: Yang Shi shy828301@gmail.com Reviewed-by: Zi Yan ziy@nvidia.com Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Muchun Song muchun.song@linux.dev Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Shakeel Butt shakeelb@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Chen Ridong chenridong@huawei.com --- mm/huge_memory.c | 2 +- mm/memcontrol.c | 11 +++++++++++ 2 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 27fa3d3a08af..b4910247ddb8 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2794,7 +2794,7 @@ void folio_undo_large_rmappable(struct folio *folio) spin_lock_irqsave(&ds_queue->split_queue_lock, flags); if (!list_empty(&folio->_deferred_list)) { ds_queue->split_queue_len--; - list_del(&folio->_deferred_list); + list_del_init(&folio->_deferred_list); } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 333d17853bb3..5613bf6c1d7d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -8675,6 +8675,17 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
/* Transfer the charge and the css ref */ commit_charge(new, memcg); + /* + * If the old folio is a large folio and is in the split queue, it needs + * to be removed from the split queue now, in case getting an incorrect + * split queue in destroy_large_folio() after the memcg of the old folio + * is cleared. + * + * In addition, the old folio is about to be freed after migration, so + * removing from the split queue a bit earlier seems reasonable. + */ + if (folio_test_large(old) && folio_test_large_rmappable(old)) + folio_undo_large_rmappable(old); old->memcg_data = 0; }
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/4573 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/Y...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/4573 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/Y...