Enhance memcg KSM feature.
Changelog since v1: * update commit log
Jinjiang Tu (3): mm/ksm: fix ksm exec support for prctl mm/memcontrol: add ksm state for memcg mm/memcontrol: enable KSM for tasks moving to new memcg
Stefan Roesch (1): mm/ksm: support fork/exec for prctl
fs/exec.c | 13 ++++++ include/linux/ksm.h | 13 ++++++ include/linux/memcontrol.h | 4 ++ include/linux/sched/coredump.h | 6 ++- mm/memcontrol.c | 73 +++++++++++++++++++++++++++++++++- 5 files changed, 106 insertions(+), 3 deletions(-)
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/7983 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/R...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/7983 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/R...
From: Stefan Roesch shr@devkernel.io
mainline inclusion from mainline-v6.7-rc1 commit 3c6f33b7273a7e2f2b2497b62c8400bd957b2fbe category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9GT87
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "mm/ksm: add fork-exec support for prctl", v4.
A process can enable KSM with the prctl system call. When the process is forked the KSM flag is inherited by the child process. However if the process is executing an exec system call directly after the fork, the KSM setting is cleared. This patch series addresses this problem.
1) Change the mask in coredump.h for execing a new process 2) Add a new test case in ksm_functional_tests
This patch (of 2):
Today we have two ways to enable KSM:
1) madvise system call This allows to enable KSM for a memory region for a long time.
2) prctl system call This is a recent addition to enable KSM for the complete process. In addition when a process is forked, the KSM setting is inherited.
This change only affects the second case.
One of the use cases for (2) was to support the ability to enable KSM for cgroups. This allows systemd to enable KSM for the seed process. By enabling it in the seed process all child processes inherit the setting.
This works correctly when the process is forked. However it doesn't support fork/exec workflow.
From the previous cover letter:
.... Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls. ....
In addition it can also be a bit surprising that fork keeps the KSM setting and fork/exec does not.
Link: https://lkml.kernel.org/r/20230922211141.320789-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20230922211141.320789-2-shr@devkernel.io Signed-off-by: Stefan Roesch shr@devkernel.io Fixes: d7597f59d1d3 ("mm: add new api to enable ksm per process") Reviewed-by: David Hildenbrand david@redhat.com Reported-by: Carl Klemm carl@uvos.xyz Tested-by: Carl Klemm carl@uvos.xyz Cc: Johannes Weiner hannes@cmpxchg.org Cc: Rik van Riel riel@surriel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org
Conflicts: include/linux/sched/coredump.h [Context conflicts.] Signed-off-by: Jinjiang Tu tujinjiang@huawei.com --- include/linux/sched/coredump.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h index 6a4d85c7a5f3..103ca84e379c 100644 --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -70,13 +70,15 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */ #define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */ #define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ +#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) #define MMF_OOM_VICTIM 25 /* mm is the oom victim */ #define MMF_OOM_REAP_QUEUED 26 /* mm was queued for oom_reaper */ #define MMF_MULTIPROCESS 27 /* mm is shared between processes */ -#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) +#define MMF_VM_MERGE_ANY 29 +#define MMF_VM_MERGE_ANY_MASK (1 << MMF_VM_MERGE_ANY)
#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ - MMF_DISABLE_THP_MASK) + MMF_DISABLE_THP_MASK | MMF_VM_MERGE_ANY_MASK)
#define MMF_VM_MERGE_ANY 29 #endif /* _LINUX_SCHED_COREDUMP_H */
mainline inclusion from mainline commit 3a9e567ca45fb5280065283d10d9a11f0db61d2b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9GT87
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "mm/ksm: fix ksm exec support for prctl", v4.
commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits MMF_VM_MERGE_ANY flag when a task calls execve(). However, it doesn't create the mm_slot, so ksmd will not try to scan this task. The first patch fixes the issue.
The second patch refactors to prepare for the third patch. The third patch extends the selftests of ksm to verfity the deduplication really happens after fork/exec inherits ths KSM setting.
This patch (of 3):
commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits MMF_VM_MERGE_ANY flag when a task calls execve(). Howerver, it doesn't create the mm_slot, so ksmd will not try to scan this task.
To fix it, allocate and add the mm_slot to ksm_mm_head in __bprm_mm_init() when the mm has MMF_VM_MERGE_ANY flag.
Link: https://lkml.kernel.org/r/20240328111010.1502191-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20240328111010.1502191-2-tujinjiang@huawei.com Fixes: 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") Signed-off-by: Jinjiang Tu tujinjiang@huawei.com Reviewed-by: David Hildenbrand david@redhat.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Nanyong Sun sunnanyong@huawei.com Cc: Rik van Riel riel@surriel.com Cc: Stefan Roesch shr@devkernel.io Signed-off-by: Andrew Morton akpm@linux-foundation.org
Conflicts: fs/exec.c [Context conflicts, and use __GENKSYMS__ to avoid kabi breakage warning.] Signed-off-by: Jinjiang Tu tujinjiang@huawei.com --- fs/exec.c | 13 +++++++++++++ include/linux/ksm.h | 13 +++++++++++++ 2 files changed, 26 insertions(+)
diff --git a/fs/exec.c b/fs/exec.c index 792d62632e92..43378e25abcb 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -65,6 +65,9 @@ #include <linux/compat.h> #include <linux/vmalloc.h> #include <linux/io_uring.h> +#ifndef __GENKSYMS__ +#include <linux/ksm.h> +#endif
#include <linux/uaccess.h> #include <asm/mmu_context.h> @@ -252,6 +255,14 @@ static int __bprm_mm_init(struct linux_binprm *bprm) goto err_free; }
+ /* + * Need to be called with mmap write lock + * held, to avoid race with ksmd. + */ + err = ksm_execve(mm); + if (err) + goto err_ksm; + /* * Place the stack at the largest stack address the architecture * supports. Later, we'll move this to an appropriate place. We don't @@ -273,6 +284,8 @@ static int __bprm_mm_init(struct linux_binprm *bprm) bprm->p = vma->vm_end - sizeof(void *); return 0; err: + ksm_exit(mm); +err_ksm: mmap_write_unlock(mm); err_free: bprm->vma = NULL; diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 4e02e8a770a9..debef5446114 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -45,6 +45,14 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) return 0; }
+static inline int ksm_execve(struct mm_struct *mm) +{ + if (test_bit(MMF_VM_MERGE_ANY, &mm->flags)) + return __ksm_enter(mm); + + return 0; +} + static inline void ksm_exit(struct mm_struct *mm) { if (test_bit(MMF_VM_MERGEABLE, &mm->flags)) @@ -83,6 +91,11 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) return 0; }
+static inline int ksm_execve(struct mm_struct *mm) +{ + return 0; +} + static inline void ksm_exit(struct mm_struct *mm) { }
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9GT87
----------------------------------------
Add KSM state for memcg, the valid values include 0 and 1.
When changing auto_ksm_enabled from 0 to 1, enable KSM for tasks in the memcg. When changing auto_ksm_enabled from 1 to 0, disable KSM for tasks in the memcg. If enable/disable fails, return the error code and don't change auto_ksm_enabled. If the auto_ksm_state of the child memcgs differ, also enable/disable KSM for the tasks in the memcgs. If enable/disable for a child memcg fails, stop traversing child memcgs and return the error code.
When writing the value same to auto_ksm_enabled of the memcg, i.e. from 0 to 0 and 1 to 1, do nothing.
Signed-off-by: Jinjiang Tu tujinjiang@huawei.com --- include/linux/memcontrol.h | 4 ++++ mm/memcontrol.c | 30 +++++++++++++++++++++++++++++- 2 files changed, 33 insertions(+), 1 deletion(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 287c54141a90..ef3a6a8e640f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -414,7 +414,11 @@ struct mem_cgroup { #else KABI_RESERVE(7) #endif +#ifdef CONFIG_KSM + KABI_USE(8, bool auto_ksm_enabled) +#else KABI_RESERVE(8) +#endif
struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index db44ade93455..52248cfa9140 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5772,7 +5772,7 @@ static ssize_t memcg_high_async_ratio_write(struct kernfs_open_file *of, }
#ifdef CONFIG_KSM -static int memcg_set_ksm_for_tasks(struct mem_cgroup *memcg, bool enable) +static int __memcg_set_ksm_for_tasks(struct mem_cgroup *memcg, bool enable) { struct task_struct *task; struct mm_struct *mm; @@ -5806,6 +5806,27 @@ static int memcg_set_ksm_for_tasks(struct mem_cgroup *memcg, bool enable) return ret; }
+static int memcg_set_ksm_for_tasks(struct mem_cgroup *memcg, bool enable) +{ + struct mem_cgroup *iter; + int ret = 0; + + for_each_mem_cgroup_tree(iter, memcg) { + if (READ_ONCE(iter->auto_ksm_enabled) == enable) + continue; + + ret = __memcg_set_ksm_for_tasks(iter, enable); + if (ret) { + mem_cgroup_iter_break(memcg, iter); + break; + } + + WRITE_ONCE(iter->auto_ksm_enabled, enable); + } + + return ret; +} + static int memory_ksm_show(struct seq_file *m, void *v) { unsigned long ksm_merging_pages = 0; @@ -5833,6 +5854,7 @@ static int memory_ksm_show(struct seq_file *m, void *v) } css_task_iter_end(&it);
+ seq_printf(m, "auto ksm enabled: %d\n", READ_ONCE(memcg->auto_ksm_enabled)); seq_printf(m, "merge any tasks: %u\n", tasks); seq_printf(m, "ksm_rmap_items %lu\n", ksm_rmap_items); seq_printf(m, "ksm_merging_pages %lu\n", ksm_merging_pages); @@ -5855,6 +5877,9 @@ static ssize_t memory_ksm_write(struct kernfs_open_file *of, char *buf, if (err) return err;
+ if (READ_ONCE(memcg->auto_ksm_enabled) == enable) + return nbytes; + err = memcg_set_ksm_for_tasks(memcg, enable); if (err) return err; @@ -6430,6 +6455,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) }
hugetlb_pool_inherit(memcg, parent); +#ifdef CONFIG_KSM + memcg->auto_ksm_enabled = READ_ONCE(parent->auto_ksm_enabled); +#endif
error = memcg_online_kmem(memcg); if (error)
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9GT87
----------------------------------------
When a task moves to a new memcg, enable KSM for the task if the auto_ksm_enabled of the memcg is 1.
Signed-off-by: Jinjiang Tu tujinjiang@huawei.com --- mm/memcontrol.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 52248cfa9140..9007c3554771 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5886,6 +5886,39 @@ static ssize_t memory_ksm_write(struct kernfs_open_file *of, char *buf,
return nbytes; } + +static void memcg_attach_ksm(struct cgroup_taskset *tset) +{ + struct cgroup_subsys_state *css; + struct mem_cgroup *memcg; + struct task_struct *task; + + cgroup_taskset_first(tset, &css); + memcg = mem_cgroup_from_css(css); + if (!READ_ONCE(memcg->auto_ksm_enabled)) + return; + + cgroup_taskset_for_each(task, css, tset) { + struct mm_struct *mm = get_task_mm(task); + + if (!mm) + continue; + + if (mmap_write_lock_killable(mm)) { + mmput(mm); + continue; + } + + ksm_enable_merge_any(mm); + + mmap_write_unlock(mm); + mmput(mm); + } +} +#else +static inline void memcg_attach_ksm(struct cgroup_taskset *tset) +{ +} #endif /* CONFIG_KSM */
#ifdef CONFIG_CGROUP_V1_WRITEBACK @@ -7373,6 +7406,12 @@ static void mem_cgroup_move_charge(void) atomic_dec(&mc.from->moving_account); }
+static void mem_cgroup_attach(struct cgroup_taskset *tset) +{ + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + memcg_attach_ksm(tset); +} + static void mem_cgroup_move_task(void) { if (mc.to) { @@ -7388,6 +7427,9 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset) static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset) { } +static void mem_cgroup_attach(struct cgroup_taskset *tset) +{ +} static void mem_cgroup_move_task(void) { } @@ -7651,6 +7693,7 @@ struct cgroup_subsys memory_cgrp_subsys = { .css_rstat_flush = mem_cgroup_css_rstat_flush, .can_attach = mem_cgroup_can_attach, .cancel_attach = mem_cgroup_cancel_attach, + .attach = mem_cgroup_attach, .post_attach = mem_cgroup_move_task, .bind = mem_cgroup_bind, .dfl_cftypes = memory_files,