This series patches add swap control for memory cgroup. Patch[2] add page type in memory.reclaim interface to support reclaim anon pages. Patch[4] add memory.force_swapin interface to support swap back pages proactively. Patch[5] add memory.swap.max interface to limit usage of swap for memory cgroup. Patch[6-7] add memory.swapfile interface to limit available swap device for memory cgroup.
Liu Shixin (6): memcg: add page type to memory.reclaim interface memcg: introduce memcg swap qos feature memcg: introduce per-memcg swapin interface memcg: add restrict to swap to cgroup1 mm/swapfile: introduce per-memcg swapfile control mm: swap_slots: add per-type slot cache
Yosry Ahmed (1): mm: vmpressure: don't count proactive reclaim in vmpressure
.../admin-guide/cgroup-v1/memory.rst | 3 + Documentation/admin-guide/cgroup-v2.rst | 16 +- include/linux/memcontrol.h | 31 ++ include/linux/mm.h | 1 + include/linux/swap.h | 17 +- include/linux/swap_slots.h | 2 +- mm/Kconfig | 9 + mm/madvise.c | 19 + mm/memcontrol.c | 425 +++++++++++++++++- mm/swap_slots.c | 151 ++++++- mm/swapfile.c | 102 ++++- mm/vmscan.c | 36 +- 12 files changed, 767 insertions(+), 45 deletions(-)
From: Yosry Ahmed yosryahmed@google.com
mainline inclusion from mainline-v6.0-rc1 commit 73b73bac90d97400e29e585c678c4d0ebfd2680d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
memory.reclaim is a cgroup v2 interface that allows users to proactively reclaim memory from a memcg, without real memory pressure. Reclaim operations invoke vmpressure, which is used: (a) To notify userspace of reclaim efficiency in cgroup v1, and (b) As a signal for a memcg being under memory pressure for networking (see mem_cgroup_under_socket_pressure()).
For (a), vmpressure notifications in v1 are not affected by this change since memory.reclaim is a v2 feature.
For (b), the effects of the vmpressure signal (according to Shakeel [1]) are as follows: 1. Reducing send and receive buffers of the current socket. 2. May drop packets on the rx path. 3. May throttle current thread on the tx path.
Since proactive reclaim is invoked directly by userspace, not by memory pressure, it makes sense not to throttle networking. Hence, this change makes sure that proactive reclaim caused by memory.reclaim does not trigger vmpressure.
[1] https://lore.kernel.org/lkml/CALvZod68WdrXEmBpOkadhB5GPYmCXaDZzXH=yyGOCAjFRn...
[yosryahmed@google.com: update documentation] Link: https://lkml.kernel.org/r/20220721173015.2643248-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20220714064918.2576464-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed yosryahmed@google.com Acked-by: Shakeel Butt shakeelb@google.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: David Rientjes rientjes@google.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Muchun Song songmuchun@bytedance.com Cc: Matthew Wilcox willy@infradead.org Cc: Vlastimil Babka vbabka@suse.cz Cc: David Hildenbrand david@redhat.com Cc: Miaohe Lin linmiaohe@huawei.com Cc: NeilBrown neilb@suse.de Cc: Alistair Popple apopple@nvidia.com Cc: Suren Baghdasaryan surenb@google.com Cc: Peter Xu peterx@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org --- Documentation/admin-guide/cgroup-v2.rst | 7 +++++++ include/linux/swap.h | 5 ++++- mm/memcontrol.c | 23 ++++++++++++--------- mm/vmscan.c | 27 ++++++++++++++++--------- 4 files changed, 41 insertions(+), 21 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 5d9b7e552fb0..f120c53d8d16 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1210,6 +1210,13 @@ PAGE_SIZE multiple when read back. the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned.
+ Please note that the proactive reclaim (triggered by this + interface) is not meant to indicate memory pressure on the + memory cgroup. Therefore socket memory balancing triggered by + the memory reclaim normally is not exercised in this case. + This means that the networking layer will not adapt based on + reclaim induced by memory.reclaim. + memory.oom.group A read-write single value file which exists on non-root cgroups. The default value is "0". diff --git a/include/linux/swap.h b/include/linux/swap.h index 7f49964f27d2..fc96ee27eb42 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -376,10 +376,13 @@ extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode); + +#define MEMCG_RECLAIM_MAY_SWAP (1 << 1) +#define MEMCG_RECLAIM_PROACTIVE (1 << 2) extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - bool may_swap); + unsigned int reclaim_options); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, pg_data_t *pgdat, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 662c7859b7f1..90ffeb1ae6f4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2400,7 +2400,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
psi_memstall_enter(&pflags); nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, - gfp_mask, true); + gfp_mask, + MEMCG_RECLAIM_MAY_SWAP); psi_memstall_leave(&pflags); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); @@ -2663,7 +2664,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, enum oom_status oom_status; unsigned long nr_reclaimed; bool passed_oom = false; - bool may_swap = true; + unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP; bool drained = false; unsigned long pflags;
@@ -2682,7 +2683,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, mem_over_limit = mem_cgroup_from_counter(counter, memory); } else { mem_over_limit = mem_cgroup_from_counter(counter, memsw); - may_swap = false; + reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP; }
if (batch > nr_pages) { @@ -2718,7 +2719,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
psi_memstall_enter(&pflags); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, - gfp_mask, may_swap); + gfp_mask, reclaim_options); psi_memstall_leave(&pflags);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages) @@ -3368,8 +3369,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, continue; }
- if (!try_to_free_mem_cgroup_pages(memcg, 1, - GFP_KERNEL, !memsw)) { + if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { ret = -EBUSY; break; } @@ -3486,7 +3487,7 @@ int mem_cgroup_force_empty(struct mem_cgroup *memcg) return -EINTR;
progress = try_to_free_mem_cgroup_pages(memcg, 1, - GFP_KERNEL, true); + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); if (!progress) { nr_retries--; /* maybe some writeback is necessary */ @@ -5228,7 +5229,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, }
reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, true); + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
if (!reclaimed && !nr_retries--) break; @@ -5269,6 +5270,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; + unsigned int reclaim_options; int err;
buf = strstrip(buf); @@ -5280,6 +5282,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, mem_cgroup_is_root(memcg)) return -EINVAL;
+ reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; while (nr_reclaimed < nr_to_reclaim) { unsigned long reclaimed;
@@ -5295,7 +5298,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_to_reclaim - nr_reclaimed, - GFP_KERNEL, true); + GFP_KERNEL, reclaim_options);
if (!reclaimed && !nr_retries--) return -EAGAIN; @@ -6982,7 +6985,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, true)) + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) nr_reclaims--; continue; } diff --git a/mm/vmscan.c b/mm/vmscan.c index a8412c5d4eda..d0ea4e251096 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -103,6 +103,9 @@ struct scan_control { /* Can pages be swapped as part of reclaim? */ unsigned int may_swap:1;
+ /* Proactive reclaim invoked by userspace through memory.reclaim */ + unsigned int proactive:1; + /* * Cgroup memory below memory.low is protected as long as we * don't threaten to OOM. If any cgroup is reclaimed at @@ -2880,9 +2883,10 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) sc->priority);
/* Record the group's reclaim efficiency */ - vmpressure(sc->gfp_mask, memcg, false, - sc->nr_scanned - scanned, - sc->nr_reclaimed - reclaimed); + if (!sc->proactive) + vmpressure(sc->gfp_mask, memcg, false, + sc->nr_scanned - scanned, + sc->nr_reclaimed - reclaimed);
} while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL))); } @@ -3005,9 +3009,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) }
/* Record the subtree's reclaim efficiency */ - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true, - sc->nr_scanned - nr_scanned, - sc->nr_reclaimed - nr_reclaimed); + if (!sc->proactive) + vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true, + sc->nr_scanned - nr_scanned, + sc->nr_reclaimed - nr_reclaimed);
if (sc->nr_reclaimed - nr_reclaimed) reclaimable = true; @@ -3252,8 +3257,9 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, __count_zid_vm_events(ALLOCSTALL, sc->reclaim_idx, 1);
do { - vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup, - sc->priority); + if (!sc->proactive) + vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup, + sc->priority); sc->nr_scanned = 0; shrink_zones(zonelist, sc);
@@ -3562,7 +3568,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - bool may_swap) + unsigned int reclaim_options) { unsigned long nr_reclaimed; unsigned int noreclaim_flag; @@ -3575,7 +3581,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .priority = DEF_PRIORITY, .may_writepage = !laptop_mode, .may_unmap = 1, - .may_swap = may_swap, + .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), + .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), }; /* * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA
--------------------------------
Add anon/file to memory.reclaim interface to limit only reclaim one type pages. The lru algorithm can reclaim cold pages and balance between file and anon. But it didn't consider the speed of backend device. For example, if there is zram device, reclaim anon pages might has less impact on performance. So extend memory.reclaim interface to reclaim one type pages. Usage: "echo <size> type=anon > memory.reclaim" "echo <size> type=file > memory.reclaim"
Also compatible with the previous format.
Signed-off-by: Liu Shixin liushixin2@huawei.com --- Documentation/admin-guide/cgroup-v2.rst | 9 +++--- include/linux/swap.h | 1 + mm/memcontrol.c | 38 +++++++++++++++++++++++-- mm/vmscan.c | 9 ++++++ 4 files changed, 51 insertions(+), 6 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index f120c53d8d16..394fb18c9e3d 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1196,15 +1196,16 @@ PAGE_SIZE multiple when read back. target cgroup.
This file accepts a single key, the number of bytes to reclaim. - No nested keys are currently supported.
Example::
echo "1G" > memory.reclaim
- The interface can be later extended with nested keys to - configure the reclaim behavior. For example, specify the - type of memory to reclaim from (anon, file, ..). + This file also accepts nested keys, the number of bytes to reclaim + with the type of memory to reclaim. + + Example:: + echo "1G type=file" > memory.reclaim
Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the diff --git a/include/linux/swap.h b/include/linux/swap.h index fc96ee27eb42..65626521ae2b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -379,6 +379,7 @@ extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
#define MEMCG_RECLAIM_MAY_SWAP (1 << 1) #define MEMCG_RECLAIM_PROACTIVE (1 << 2) +#define MEMCG_RECLAIM_NOT_FILE (1 << 3) extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 90ffeb1ae6f4..612acf6466e3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5264,6 +5264,35 @@ static int memcg_events_local_show(struct seq_file *m, void *v) return 0; }
+static int reclaim_param_parse(char *buf, unsigned long *nr_pages, + unsigned int *reclaim_options) +{ + char *endp; + u64 bytes; + + if (!strcmp(buf, "")) { + *nr_pages = PAGE_COUNTER_MAX; + return 0; + } + + bytes = memparse(buf, &endp); + if (*endp == ' ') { + buf = endp + 1; + buf = strim(buf); + if (!strcmp(buf, "type=anon")) + *reclaim_options |= MEMCG_RECLAIM_NOT_FILE; + else if (!strcmp(buf, "type=file")) + *reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP; + else + return -EINVAL; + } else if (*endp != '\0') + return -EINVAL; + + *nr_pages = min(bytes / PAGE_SIZE, (u64)PAGE_COUNTER_MAX); + + return 0; +} + static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { @@ -5273,8 +5302,9 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, unsigned int reclaim_options; int err;
+ reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; buf = strstrip(buf); - err = page_counter_memparse(buf, "", &nr_to_reclaim); + err = reclaim_param_parse(buf, &nr_to_reclaim, &reclaim_options); if (err) return err;
@@ -5282,13 +5312,17 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, mem_cgroup_is_root(memcg)) return -EINVAL;
- reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; while (nr_reclaimed < nr_to_reclaim) { unsigned long reclaimed;
if (signal_pending(current)) return -EINTR;
+ /* If only reclaim swap pages, check swap space at first. */ + if ((reclaim_options & MEMCG_RECLAIM_NOT_FILE) && + (mem_cgroup_get_nr_swap_pages(memcg) <= 0)) + return -EAGAIN; + /* This is the final attempt, drain percpu lru caches in the * hope of introducing more evictable pages for * try_to_free_mem_cgroup_pages(). diff --git a/mm/vmscan.c b/mm/vmscan.c index d0ea4e251096..6416737f5206 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -103,6 +103,9 @@ struct scan_control { /* Can pages be swapped as part of reclaim? */ unsigned int may_swap:1;
+ /* Should skip file pages? */ + unsigned int not_file:1; + /* Proactive reclaim invoked by userspace through memory.reclaim */ unsigned int proactive:1;
@@ -2464,6 +2467,11 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, unsigned long ap, fp; enum lru_list lru;
+ if (sc->not_file) { + scan_balance = SCAN_ANON; + goto out; + } + /* If we have no swap space, do not bother scanning anon pages. */ if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) { scan_balance = SCAN_FILE; @@ -3583,6 +3591,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .may_unmap = 1, .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), + .not_file = !!(reclaim_options & MEMCG_RECLAIM_NOT_FILE), }; /* * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA
--------------------------------
Introduce memcg swap qos including subsequent sub-features. Add CONFIG_MEMCG_SWAP_QOS and static key memcg_swap_qos_key.
Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/memcontrol.h | 4 +++ mm/Kconfig | 9 +++++++ mm/memcontrol.c | 53 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 66 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 4395f2e03cb7..17ef01c63102 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -422,6 +422,10 @@ extern int sysctl_memcg_qos_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos); #endif
+#ifdef CONFIG_MEMCG_SWAP_QOS +DECLARE_STATIC_KEY_FALSE(memcg_swap_qos_key); +#endif + /* * size of first charge trial. "32" comes from vmscan.c's magic value. * TODO: maybe necessary to use big numbers in big irons. diff --git a/mm/Kconfig b/mm/Kconfig index f66457168de9..0fe459e79336 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -512,6 +512,15 @@ config MEMCG_QOS
If unsure, say "n".
+config MEMCG_SWAP_QOS + bool "Enable Memory Cgroup Swap Control" + depends on MEMCG_SWAP + depends on X86 || ARM64 + default n + help + memcg swap control include memory force swapin, swapfile control + and swap limit. + config ETMEM_SCAN tristate "module: etmem page scan for etmem support" depends on ETMEM diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 612acf6466e3..1199a582058b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4053,6 +4053,59 @@ int sysctl_memcg_qos_handler(struct ctl_table *table, int write, } #endif
+#ifdef CONFIG_MEMCG_SWAP_QOS +DEFINE_STATIC_KEY_FALSE(memcg_swap_qos_key); + +#ifdef CONFIG_SYSCTL +static int sysctl_memcg_swap_qos_stat; + +static void memcg_swap_qos_reset(void) +{ +} + +static int sysctl_memcg_swap_qos_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *length, loff_t *ppos) +{ + int ret; + + ret = proc_dointvec_minmax(table, write, buffer, length, ppos); + if (ret) + return ret; + if (write) { + if (sysctl_memcg_swap_qos_stat) { + memcg_swap_qos_reset(); + static_branch_enable(&memcg_swap_qos_key); + } else { + static_branch_disable(&memcg_swap_qos_key); + } + } + return 0; +} + +static struct ctl_table memcg_swap_qos_sysctls[] = { + { + .procname = "memcg_swap_qos_enable", + .data = &sysctl_memcg_swap_qos_stat, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = sysctl_memcg_swap_qos_handler, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, + { } +}; + +static __init int memcg_swap_qos_sysctls_init(void) +{ + if (mem_cgroup_disabled() || cgroup_memory_noswap) + return 0; + register_sysctl_init("vm", memcg_swap_qos_sysctls); + return 0; +} +late_initcall(memcg_swap_qos_sysctls_init); +#endif +#endif + #ifdef CONFIG_NUMA
#define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE))
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA
--------------------------------
Add a new per-memcg swapin interface to load data into memory in advance to improve access efficiency. Usage: # echo 0 > memory.force_swapin
Signed-off-by: Liu Shixin liushixin2@huawei.com --- .../admin-guide/cgroup-v1/memory.rst | 1 + include/linux/mm.h | 1 + mm/madvise.c | 19 +++++++++++ mm/memcontrol.c | 33 +++++++++++++++++++ 4 files changed, 54 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 6859a50fbd09..5eee7e3be4b2 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -78,6 +78,7 @@ Brief summary of control files. memory.stat show various statistics memory.use_hierarchy set/show hierarchical account enabled memory.force_empty trigger forced page reclaim + memory.force_swapin trigger forced swapin anon page memory.pressure_level set memory pressure notifications memory.swappiness set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) diff --git a/include/linux/mm.h b/include/linux/mm.h index 0b5ce84212d7..58d7a59b5b65 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2650,6 +2650,7 @@ extern int __do_munmap(struct mm_struct *, unsigned long, size_t, struct list_head *uf, bool downgrade); extern int do_munmap(struct mm_struct *, unsigned long, size_t, struct list_head *uf); +extern void force_swapin_vma(struct vm_area_struct *vma); extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
extern unsigned long __do_mmap_mm(struct mm_struct *mm, struct file *file, diff --git a/mm/madvise.c b/mm/madvise.c index 0a1d6f9d75ea..6028383a8147 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -259,6 +259,25 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
lru_add_drain(); /* Push any new pages onto the LRU now */ } + +void force_swapin_vma(struct vm_area_struct *vma) +{ + struct file *file = vma->vm_file; + + if (!can_madv_lru_vma(vma)) + return; + + if (!file) { + walk_page_vma(vma, &swapin_walk_ops, vma); + lru_add_drain(); + } else if (shmem_mapping(file->f_mapping)) + force_shm_swapin_readahead(vma, vma->vm_start, + vma->vm_end, file->f_mapping); +} +#else +void force_swapin_vma(struct vm_area_struct *vma) +{ +} #endif /* CONFIG_SWAP */
/* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1199a582058b..4aad76ada6a7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4104,6 +4104,32 @@ static __init int memcg_swap_qos_sysctls_init(void) } late_initcall(memcg_swap_qos_sysctls_init); #endif + +static int mem_cgroup_task_swapin(struct task_struct *task, void *arg) +{ + struct mm_struct *mm = task->mm; + struct vm_area_struct *vma; + struct blk_plug plug; + + mmap_read_lock(mm); + blk_start_plug(&plug); + for (vma = mm->mmap; vma; vma = vma->vm_next) + force_swapin_vma(vma); + blk_finish_plug(&plug); + mmap_read_unlock(mm); + + return 0; +} + +static ssize_t memory_swapin(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + + mem_cgroup_scan_tasks(memcg, mem_cgroup_task_swapin, NULL); + + return nbytes; +} #endif
#ifdef CONFIG_NUMA @@ -5798,6 +5824,13 @@ static struct cftype mem_cgroup_legacy_files[] = { .name = "reclaim", .write = memory_reclaim, }, +#ifdef CONFIG_MEMCG_SWAP_QOS + { + .name = "force_swapin", + .flags = CFTYPE_NOT_ON_ROOT, + .write = memory_swapin, + }, +#endif { .name = "high_async_ratio", .flags = CFTYPE_NOT_ON_ROOT,
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA
--------------------------------
The memsw can't limit the usage of swap space. Add memory.swap.max interface to limit the difference value of memsw.usage and memory.usage. Since a page may occupy both swap entry and a swap cache page, this value is not exactly equal to swap.usage.
Signed-off-by: Liu Shixin liushixin2@huawei.com --- .../admin-guide/cgroup-v1/memory.rst | 1 + include/linux/memcontrol.h | 9 ++ mm/memcontrol.c | 137 +++++++++++++++++- 3 files changed, 146 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 5eee7e3be4b2..962a80d44744 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -79,6 +79,7 @@ Brief summary of control files. memory.use_hierarchy set/show hierarchical account enabled memory.force_empty trigger forced page reclaim memory.force_swapin trigger forced swapin anon page + memory.swap.max set/show limit for swap memory.pressure_level set memory pressure notifications memory.swappiness set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 17ef01c63102..58f976014384 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -240,6 +240,10 @@ struct obj_cgroup { }; };
+struct swap_device { + unsigned long max; +}; + /* * The memory controller data structure. The memory controller controls both * page cache and RSS per cgroup. We would eventually like to provide @@ -402,7 +406,12 @@ struct mem_cgroup { #else KABI_RESERVE(6) #endif +#ifdef CONFIG_MEMCG_SWAP_QOS + /* per-memcg swap device control; protected by swap_lock */ + KABI_USE(7, struct swap_device *swap_dev) +#else KABI_RESERVE(7) +#endif KABI_RESERVE(8)
struct mem_cgroup_per_node *nodeinfo[0]; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4aad76ada6a7..abcfb313226d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4061,6 +4061,10 @@ static int sysctl_memcg_swap_qos_stat;
static void memcg_swap_qos_reset(void) { + struct mem_cgroup *memcg; + + for_each_mem_cgroup(memcg) + WRITE_ONCE(memcg->swap_dev->max, PAGE_COUNTER_MAX); }
static int sysctl_memcg_swap_qos_handler(struct ctl_table *table, int write, @@ -4130,6 +4134,124 @@ static ssize_t memory_swapin(struct kernfs_open_file *of, char *buf,
return nbytes; } + +static int memcg_alloc_swap_device(struct mem_cgroup *memcg) +{ + memcg->swap_dev = kmalloc(sizeof(struct swap_device), GFP_KERNEL); + if (!memcg->swap_dev) + return -ENOMEM; + return 0; +} + +static void memcg_free_swap_device(struct mem_cgroup *memcg) +{ + if (!memcg->swap_dev) + return; + + kfree(memcg->swap_dev); + memcg->swap_dev = NULL; +} + +static void memcg_swap_device_init(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ + if (!static_branch_likely(&memcg_swap_qos_key) || !parent) + WRITE_ONCE(memcg->swap_dev->max, PAGE_COUNTER_MAX); + else + WRITE_ONCE(memcg->swap_dev->max, + READ_ONCE(parent->swap_dev->max)); +} + +u64 memcg_swapmax_read(struct cgroup_subsys_state *css, struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + if (!static_branch_likely(&memcg_swap_qos_key)) + return PAGE_COUNTER_MAX * PAGE_SIZE; + + return READ_ONCE(memcg->swap_dev->max) * PAGE_SIZE; +} + +static ssize_t memcg_swapmax_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned long max; + int err; + + if (!static_branch_likely(&memcg_swap_qos_key)) + return -EACCES; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "max", &max); + if (err) + return err; + + WRITE_ONCE(memcg->swap_dev->max, max); + + return nbytes; +} + +static int mem_cgroup_check_swap_for_v1(struct page *page, swp_entry_t entry) +{ + struct mem_cgroup *memcg, *target_memcg; + unsigned long swap_usage; + unsigned long swap_limit; + long nr_swap_pages = PAGE_COUNTER_MAX; + + if (!static_branch_likely(&memcg_swap_qos_key)) + return 0; + + if (!entry.val) + return 0; + + rcu_read_lock(); + target_memcg = page_memcg(page); + if (!target_memcg || mem_cgroup_is_root(target_memcg)) { + rcu_read_unlock(); + return 0; + } + + if (!css_tryget_online(&target_memcg->css)) { + rcu_read_unlock(); + return 0; + } + rcu_read_unlock(); + + for (memcg = target_memcg; memcg != root_mem_cgroup; + memcg = parent_mem_cgroup(memcg)) { + swap_limit = READ_ONCE(memcg->swap_dev->max); + swap_usage = page_counter_read(&memcg->memsw) - + page_counter_read(&memcg->memory); + nr_swap_pages = min_t(long, nr_swap_pages, + swap_limit - swap_usage); + } + css_put(&target_memcg->css); + + if (thp_nr_pages(page) > nr_swap_pages) + return -ENOMEM; + return 0; +} + +#else +static int memcg_alloc_swap_device(struct mem_cgroup *memcg) +{ + return 0; +} + +static void memcg_free_swap_device(struct mem_cgroup *memcg) +{ +} + +static void memcg_swap_device_init(struct mem_cgroup *memcg, + struct mem_cgroup *parent) +{ +} + +static int mem_cgroup_check_swap_for_v1(struct page *page, swp_entry_t entry) +{ + return 0; +} #endif
#ifdef CONFIG_NUMA @@ -5830,6 +5952,12 @@ static struct cftype mem_cgroup_legacy_files[] = { .flags = CFTYPE_NOT_ON_ROOT, .write = memory_swapin, }, + { + .name = "swap.max", + .flags = CFTYPE_NOT_ON_ROOT, + .write = memcg_swapmax_write, + .read_u64 = memcg_swapmax_read, + }, #endif { .name = "high_async_ratio", @@ -5975,6 +6103,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); free_percpu(memcg->vmstats_percpu); + memcg_free_swap_device(memcg); kfree(memcg); }
@@ -5999,6 +6128,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void) if (!memcg) return ERR_PTR(error);
+ if (memcg_alloc_swap_device(memcg)) + goto fail; + memcg->id.id = idr_alloc(&mem_cgroup_idr, NULL, 1, MEM_CGROUP_ID_MAX, GFP_KERNEL); @@ -6076,17 +6208,20 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) page_counter_init(&memcg->swap, NULL); page_counter_init(&memcg->kmem, NULL); page_counter_init(&memcg->tcpmem, NULL); + memcg_swap_device_init(memcg, NULL); } else if (parent->use_hierarchy) { memcg->use_hierarchy = true; page_counter_init(&memcg->memory, &parent->memory); page_counter_init(&memcg->swap, &parent->swap); page_counter_init(&memcg->kmem, &parent->kmem); page_counter_init(&memcg->tcpmem, &parent->tcpmem); + memcg_swap_device_init(memcg, parent); } else { page_counter_init(&memcg->memory, &root_mem_cgroup->memory); page_counter_init(&memcg->swap, &root_mem_cgroup->swap); page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem); page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem); + memcg_swap_device_init(memcg, root_mem_cgroup); /* * Deeper hierachy with use_hierarchy == false doesn't make * much sense so let cgroup subsystem know about this @@ -8020,7 +8155,7 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry) unsigned short oldid;
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) - return 0; + return mem_cgroup_check_swap_for_v1(page, entry);
memcg = page_memcg(page);
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA
--------------------------------
With memory.swapfile interface, the avail swap device can be limit for memcg. The acceptable parameters are 'all', 'none' and valid swap device. Usage: echo /dev/zram0 > memory.swapfile
If the swap device is offline, the swapfile will be fallback to 'none'.
Signed-off-by: Liu Shixin liushixin2@huawei.com --- .../admin-guide/cgroup-v1/memory.rst | 1 + include/linux/memcontrol.h | 18 +++ include/linux/swap.h | 10 +- mm/memcontrol.c | 150 +++++++++++++++++- mm/swap_slots.c | 14 +- mm/swapfile.c | 100 +++++++++++- 6 files changed, 282 insertions(+), 11 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 962a80d44744..3891916a0671 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -80,6 +80,7 @@ Brief summary of control files. memory.force_empty trigger forced page reclaim memory.force_swapin trigger forced swapin anon page memory.swap.max set/show limit for swap + memory.swapfile set/show available swap file memory.pressure_level set memory pressure notifications memory.swappiness set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 58f976014384..66c46b890b8d 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -50,6 +50,11 @@ enum memcg_memory_event { MEMCG_NR_MEMORY_EVENTS, };
+enum { + SWAP_TYPE_ALL = -1, /* allowd use all swap file */ + SWAP_TYPE_NONE = -2, /* prohibited use any swapfile */ +}; + struct mem_cgroup_reclaim_cookie { pg_data_t *pgdat; unsigned int generation; @@ -242,6 +247,7 @@ struct obj_cgroup {
struct swap_device { unsigned long max; + int type; };
/* @@ -1305,6 +1311,9 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
int mem_cgroup_force_empty(struct mem_cgroup *memcg);
+int memcg_get_swap_type(struct page *page); +void memcg_remove_swapfile(int type); + #else /* CONFIG_MEMCG */
#define MEM_CGROUP_ID_SHIFT 0 @@ -1708,6 +1717,15 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, { return 0; } + +static inline int memcg_get_swap_type(struct page *page) +{ + return SWAP_TYPE_ALL; +} + +static inline void memcg_remove_swapfile(int type) +{ +} #endif /* CONFIG_MEMCG */
/* idx can be of type enum memcg_stat_item or node_stat_item */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 65626521ae2b..c0be56d09b0b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -511,11 +511,14 @@ static inline long get_nr_swap_pages(void) return atomic_long_read(&nr_swap_pages); }
+extern long get_nr_swap_pages_type(int type); + extern void si_swapinfo(struct sysinfo *); extern swp_entry_t get_swap_page(struct page *page); extern void put_swap_page(struct page *page, swp_entry_t entry); extern swp_entry_t get_swap_page_of_type(int); -extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size); +extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size, + int type); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern void swap_shmem_alloc(swp_entry_t); extern int swap_duplicate(swp_entry_t); @@ -547,6 +550,11 @@ static inline void put_swap_device(struct swap_info_struct *si) percpu_ref_put(&si->sei->users); }
+#ifdef CONFIG_MEMCG_SWAP_QOS +extern int write_swapfile_for_memcg(struct address_space *mapping, + int *swap_type); +extern void read_swapfile_for_memcg(struct seq_file *m, int type); +#endif #else /* CONFIG_SWAP */
static inline int swap_readpage(struct page *page, bool do_poll) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index abcfb313226d..9c7f52d3780c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4063,8 +4063,10 @@ static void memcg_swap_qos_reset(void) { struct mem_cgroup *memcg;
- for_each_mem_cgroup(memcg) + for_each_mem_cgroup(memcg) { WRITE_ONCE(memcg->swap_dev->max, PAGE_COUNTER_MAX); + WRITE_ONCE(memcg->swap_dev->type, SWAP_TYPE_ALL); + } }
static int sysctl_memcg_swap_qos_handler(struct ctl_table *table, int write, @@ -4155,11 +4157,15 @@ static void memcg_free_swap_device(struct mem_cgroup *memcg) static void memcg_swap_device_init(struct mem_cgroup *memcg, struct mem_cgroup *parent) { - if (!static_branch_likely(&memcg_swap_qos_key) || !parent) + if (!static_branch_likely(&memcg_swap_qos_key) || !parent) { WRITE_ONCE(memcg->swap_dev->max, PAGE_COUNTER_MAX); - else + WRITE_ONCE(memcg->swap_dev->type, SWAP_TYPE_ALL); + } else { WRITE_ONCE(memcg->swap_dev->max, READ_ONCE(parent->swap_dev->max)); + WRITE_ONCE(memcg->swap_dev->type, + READ_ONCE(parent->swap_dev->type)); + } }
u64 memcg_swapmax_read(struct cgroup_subsys_state *css, struct cftype *cft) @@ -4233,6 +4239,121 @@ static int mem_cgroup_check_swap_for_v1(struct page *page, swp_entry_t entry) return 0; }
+static int memcg_swapfile_read(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + int type; + + if (!static_branch_likely(&memcg_swap_qos_key)) { + seq_printf(m, "all\n"); + return 0; + } + + type = READ_ONCE(memcg->swap_dev->type); + if (type == SWAP_TYPE_NONE) + seq_printf(m, "none\n"); + else if (type == SWAP_TYPE_ALL) + seq_printf(m, "all\n"); + else + read_swapfile_for_memcg(m, type); + return 0; +} + +static ssize_t memcg_swapfile_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + struct filename *pathname; + struct file *swapfile; + int ret; + + if (!static_branch_likely(&memcg_swap_qos_key)) + return -EACCES; + + buf = strstrip(buf); + + if (!strcmp(buf, "none")) { + WRITE_ONCE(memcg->swap_dev->type, SWAP_TYPE_NONE); + return nbytes; + } else if (!strcmp(buf, "all")) { + WRITE_ONCE(memcg->swap_dev->type, SWAP_TYPE_ALL); + return nbytes; + } + + pathname = getname_kernel(buf); + if (IS_ERR(pathname)) + return PTR_ERR(pathname); + + swapfile = file_open_name(pathname, O_RDWR|O_LARGEFILE, 0); + if (IS_ERR(swapfile)) { + putname(pathname); + return PTR_ERR(swapfile); + } + ret = write_swapfile_for_memcg(swapfile->f_mapping, + &memcg->swap_dev->type); + filp_close(swapfile, NULL); + putname(pathname); + + return ret < 0 ? ret : nbytes; +} + +int memcg_get_swap_type(struct page *page) +{ + struct mem_cgroup *memcg; + int type; + + if (!static_branch_likely(&memcg_swap_qos_key)) + return SWAP_TYPE_ALL; + + if (!page) + return SWAP_TYPE_ALL; + + rcu_read_lock(); + memcg = page_memcg(page); + if (!memcg || mem_cgroup_is_root(memcg)) { + rcu_read_unlock(); + return SWAP_TYPE_ALL; + } + + if (!css_tryget_online(&memcg->css)) { + rcu_read_unlock(); + return SWAP_TYPE_ALL; + } + rcu_read_unlock(); + + type = READ_ONCE(memcg->swap_dev->type); + css_put(&memcg->css); + return type; +} + +void memcg_remove_swapfile(int type) +{ + struct mem_cgroup *memcg; + + if (!static_branch_likely(&memcg_swap_qos_key)) + return; + + for_each_mem_cgroup(memcg) + if (READ_ONCE(memcg->swap_dev->type) == type) + WRITE_ONCE(memcg->swap_dev->type, SWAP_TYPE_NONE); +} + +static long mem_cgroup_get_nr_swap_pages_type(struct mem_cgroup *memcg) +{ + int type; + + if (!static_branch_likely(&memcg_swap_qos_key)) + return mem_cgroup_get_nr_swap_pages(memcg); + + type = READ_ONCE(memcg->swap_dev->type); + if (type == SWAP_TYPE_ALL) + return mem_cgroup_get_nr_swap_pages(memcg); + else if (type == SWAP_TYPE_NONE) + return 0; + else + return get_nr_swap_pages_type(type); +} + #else static int memcg_alloc_swap_device(struct mem_cgroup *memcg) { @@ -4252,6 +4373,21 @@ static int mem_cgroup_check_swap_for_v1(struct page *page, swp_entry_t entry) { return 0; } + +int memcg_get_swap_type(struct page *page) +{ + return SWAP_TYPE_ALL; +} + +void memcg_remove_swapfile(int type) +{ +} + +static long mem_cgroup_get_nr_swap_pages_type(struct mem_cgroup *memcg) +{ + return mem_cgroup_get_nr_swap_pages(memcg); +} + #endif
#ifdef CONFIG_NUMA @@ -5521,7 +5657,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
/* If only reclaim swap pages, check swap space at first. */ if ((reclaim_options & MEMCG_RECLAIM_NOT_FILE) && - (mem_cgroup_get_nr_swap_pages(memcg) <= 0)) + (mem_cgroup_get_nr_swap_pages_type(memcg) <= 0)) return -EAGAIN;
/* This is the final attempt, drain percpu lru caches in the @@ -5958,6 +6094,12 @@ static struct cftype mem_cgroup_legacy_files[] = { .write = memcg_swapmax_write, .read_u64 = memcg_swapmax_read, }, + { + .name = "swapfile", + .flags = CFTYPE_NOT_ON_ROOT, + .write = memcg_swapfile_write, + .seq_show = memcg_swapfile_read, + }, #endif { .name = "high_async_ratio", diff --git a/mm/swap_slots.c b/mm/swap_slots.c index 0357fbe70645..3d4e4c230305 100644 --- a/mm/swap_slots.c +++ b/mm/swap_slots.c @@ -266,7 +266,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache) cache->cur = 0; if (swap_slot_cache_active) cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, - cache->slots, 1); + cache->slots, 1, SWAP_TYPE_ALL);
return cache->nr; } @@ -307,12 +307,17 @@ swp_entry_t get_swap_page(struct page *page) { swp_entry_t entry; struct swap_slots_cache *cache; + int type;
entry.val = 0;
+ type = memcg_get_swap_type(page); + if (type == SWAP_TYPE_NONE) + goto out; + if (PageTransHuge(page)) { if (IS_ENABLED(CONFIG_THP_SWAP)) - get_swap_pages(1, &entry, HPAGE_PMD_NR); + get_swap_pages(1, &entry, HPAGE_PMD_NR, type); goto out; }
@@ -327,7 +332,8 @@ swp_entry_t get_swap_page(struct page *page) */ cache = raw_cpu_ptr(&swp_slots);
- if (likely(check_cache_active() && cache->slots)) { + if (likely(check_cache_active() && cache->slots) && + type == SWAP_TYPE_ALL) { mutex_lock(&cache->alloc_lock); if (cache->slots) { repeat: @@ -344,7 +350,7 @@ swp_entry_t get_swap_page(struct page *page) goto out; }
- get_swap_pages(1, &entry, 1); + get_swap_pages(1, &entry, 1, type); out: if (mem_cgroup_try_charge_swap(page, entry)) { put_swap_page(page, entry); diff --git a/mm/swapfile.c b/mm/swapfile.c index 14e2396fa8a3..2134e1b83ccb 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1056,7 +1056,97 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
}
-int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) +#ifdef CONFIG_MEMCG_SWAP_QOS +int write_swapfile_for_memcg(struct address_space *mapping, int *swap_type) +{ + struct swap_info_struct *si; + unsigned int type; + int ret = -EINVAL; + + spin_lock(&swap_lock); + for (type = 0; type < nr_swapfiles; type++) { + si = swap_info[type]; + if ((si->flags & SWP_WRITEOK) && + (si->swap_file->f_mapping == mapping)) { + WRITE_ONCE(*swap_type, type); + ret = 0; + break; + } + } + spin_unlock(&swap_lock); + return ret; +} + +void read_swapfile_for_memcg(struct seq_file *m, int type) +{ + struct swap_info_struct *si; + + spin_lock(&swap_lock); + if (type < nr_swapfiles) { + si = swap_info[type]; + if (si->flags & SWP_WRITEOK) { + seq_file_path(m, si->swap_file, "\t\n\"); + seq_printf(m, "\n"); + } + } + spin_unlock(&swap_lock); +} + +long get_nr_swap_pages_type(int type) +{ + struct swap_info_struct *si; + long nr_swap_pages = 0; + + spin_lock(&swap_lock); + if (type < nr_swapfiles) { + si = swap_info[type]; + if (si->flags & SWP_WRITEOK) + nr_swap_pages = si->pages - si->inuse_pages; + } + spin_unlock(&swap_lock); + + return nr_swap_pages; +} + +static long get_avail_pages(unsigned long size, int type) +{ + long avail_pgs = 0; + + if (type == SWAP_TYPE_ALL) + return atomic_long_read(&nr_swap_pages) / size; + + spin_unlock(&swap_avail_lock); + avail_pgs = get_nr_swap_pages_type(type) / size; + spin_lock(&swap_avail_lock); + return avail_pgs; +} + +static inline bool should_skip_swap_type(int swap_type, int type) +{ + if (type == SWAP_TYPE_ALL) + return false; + + return (type != swap_type); +} +#else +long get_nr_swap_pages_type(int type) +{ + return 0; +} + +static inline long get_avail_pages(unsigned long size, int type) +{ + return atomic_long_read(&nr_swap_pages) / size; +} + +static inline bool should_skip_swap_type(int swap_type, int type) +{ + return false; +} +#endif + +int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size, + int type) { unsigned long size = swap_entry_size(entry_size); struct swap_info_struct *si, *next; @@ -1069,7 +1159,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
spin_lock(&swap_avail_lock);
- avail_pgs = atomic_long_read(&nr_swap_pages) / size; + avail_pgs = get_avail_pages(size, type); if (avail_pgs <= 0) { spin_unlock(&swap_avail_lock); goto noswap; @@ -1086,6 +1176,11 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); spin_unlock(&swap_avail_lock); spin_lock(&si->lock); + if (should_skip_swap_type(si->type, type)) { + spin_unlock(&si->lock); + spin_lock(&swap_avail_lock); + goto nextsi; + } if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) { spin_lock(&swap_avail_lock); if (plist_node_empty(&si->avail_lists[node])) { @@ -2703,6 +2798,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) cluster_info = p->cluster_info; p->cluster_info = NULL; frontswap_map = frontswap_map_get(p); + memcg_remove_swapfile(p->type); spin_unlock(&p->lock); spin_unlock(&swap_lock); arch_swap_invalidate_area(p->type);
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA
--------------------------------
Since we support per-memcg swapfile control, we need per-type slot cache to optimize performance. To reduce memory waste, allocate per-type slot cache when enable feature or online the corresponding swap device.
Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/swap.h | 1 + include/linux/swap_slots.h | 2 +- mm/memcontrol.c | 1 + mm/swap_slots.c | 145 +++++++++++++++++++++++++++++++++---- mm/swapfile.c | 2 +- 5 files changed, 136 insertions(+), 15 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h index c0be56d09b0b..524e3ca9d0f5 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -554,6 +554,7 @@ static inline void put_swap_device(struct swap_info_struct *si) extern int write_swapfile_for_memcg(struct address_space *mapping, int *swap_type); extern void read_swapfile_for_memcg(struct seq_file *m, int type); +void enable_swap_slots_cache_max(void); #endif #else /* CONFIG_SWAP */
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h index 347f1a304190..1ea720390822 100644 --- a/include/linux/swap_slots.h +++ b/include/linux/swap_slots.h @@ -23,7 +23,7 @@ struct swap_slots_cache {
void disable_swap_slots_cache_lock(void); void reenable_swap_slots_cache_unlock(void); -void enable_swap_slots_cache(void); +void enable_swap_slots_cache(int type); int free_swap_slot(swp_entry_t entry);
extern bool swap_slot_cache_enabled; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9c7f52d3780c..e05368a76e36 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4081,6 +4081,7 @@ static int sysctl_memcg_swap_qos_handler(struct ctl_table *table, int write, if (sysctl_memcg_swap_qos_stat) { memcg_swap_qos_reset(); static_branch_enable(&memcg_swap_qos_key); + enable_swap_slots_cache_max(); } else { static_branch_disable(&memcg_swap_qos_key); } diff --git a/mm/swap_slots.c b/mm/swap_slots.c index 3d4e4c230305..fbb9888a0469 100644 --- a/mm/swap_slots.c +++ b/mm/swap_slots.c @@ -35,6 +35,11 @@ #include <linux/mm.h>
static DEFINE_PER_CPU(struct swap_slots_cache, swp_slots); +#ifdef CONFIG_MEMCG_SWAP_QOS +static unsigned int nr_swap_slots; +static unsigned int max_swap_slots; +static DEFINE_PER_CPU(struct swap_slots_cache [MAX_SWAPFILES], swp_type_slots); +#endif static bool swap_slot_cache_active; bool swap_slot_cache_enabled; static bool swap_slot_cache_initialized; @@ -111,7 +116,37 @@ static bool check_cache_active(void) return swap_slot_cache_active; }
-static int alloc_swap_slot_cache(unsigned int cpu) +#ifdef CONFIG_MEMCG_SWAP_QOS +static inline struct swap_slots_cache *get_slots_cache(int swap_type) +{ + if (swap_type == SWAP_TYPE_ALL) + return raw_cpu_ptr(&swp_slots); + else + return raw_cpu_ptr(&swp_type_slots)[swap_type]; +} + +static inline struct swap_slots_cache *get_slots_cache_cpu(unsigned int cpu, + int swap_type) +{ + if (swap_type == SWAP_TYPE_ALL) + return &per_cpu(swp_slots, cpu); + else + return &per_cpu(swp_type_slots, cpu)[swap_type]; +} +#else +static inline struct swap_slots_cache *get_slots_cache(int swap_type) +{ + return raw_cpu_ptr(&swp_slots); +} + +static inline struct swap_slots_cache *get_slots_cache_cpu(unsigned int cpu, + int swap_type) +{ + return &per_cpu(swp_slots, cpu); +} +#endif + +static int alloc_swap_slot_cache_cpu_type(unsigned int cpu, int swap_type) { struct swap_slots_cache *cache; swp_entry_t *slots, *slots_ret; @@ -134,7 +169,7 @@ static int alloc_swap_slot_cache(unsigned int cpu) }
mutex_lock(&swap_slots_cache_mutex); - cache = &per_cpu(swp_slots, cpu); + cache = get_slots_cache_cpu(cpu, swap_type); if (cache->slots || cache->slots_ret) { /* cache already allocated */ mutex_unlock(&swap_slots_cache_mutex); @@ -166,13 +201,74 @@ static int alloc_swap_slot_cache(unsigned int cpu) return 0; }
-static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type, - bool free_slots) +#ifdef CONFIG_MEMCG_SWAP_QOS +static int __alloc_swap_slot_cache_cpu(unsigned int cpu) +{ + int i, ret; + + ret = alloc_swap_slot_cache_cpu_type(cpu, SWAP_TYPE_ALL); + if (ret) + return ret; + + for (i = 0; i < nr_swap_slots; i++) { + ret = alloc_swap_slot_cache_cpu_type(cpu, i); + if (ret) + return ret; + } + + return ret; +} + +static void alloc_swap_slot_cache_type(int type) +{ + unsigned int cpu; + + if (type >= max_swap_slots) + max_swap_slots = type + 1; + + if (!static_branch_likely(&memcg_swap_qos_key)) + return; + + /* serialize with cpu hotplug operations */ + get_online_cpus(); + while (type >= nr_swap_slots) { + for_each_online_cpu(cpu) + alloc_swap_slot_cache_cpu_type(cpu, nr_swap_slots); + nr_swap_slots++; + } + put_online_cpus(); +} + +void enable_swap_slots_cache_max(void) +{ + mutex_lock(&swap_slots_cache_enable_mutex); + if (max_swap_slots) + alloc_swap_slot_cache_type(max_swap_slots - 1); + mutex_unlock(&swap_slots_cache_enable_mutex); +} +#else +static inline int __alloc_swap_slot_cache_cpu(unsigned int cpu) +{ + return alloc_swap_slot_cache_cpu_type(cpu, SWAP_TYPE_ALL); +} + +static void alloc_swap_slot_cache_type(int type) +{ +} +#endif + +static int alloc_swap_slot_cache(unsigned int cpu) +{ + return __alloc_swap_slot_cache_cpu(cpu); +} + +static void drain_slots_cache_cpu_type(unsigned int cpu, unsigned int type, + bool free_slots, int swap_type) { struct swap_slots_cache *cache; swp_entry_t *slots = NULL;
- cache = &per_cpu(swp_slots, cpu); + cache = get_slots_cache_cpu(cpu, swap_type); if ((type & SLOTS_CACHE) && cache->slots) { mutex_lock(&cache->alloc_lock); swapcache_free_entries(cache->slots + cache->cur, cache->nr); @@ -198,6 +294,30 @@ static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type, } }
+#ifdef CONFIG_MEMCG_SWAP_QOS +static void __drain_slots_cache_cpu(unsigned int cpu, unsigned int type, + bool free_slots) +{ + int i; + + drain_slots_cache_cpu_type(cpu, type, free_slots, SWAP_TYPE_ALL); + for (i = 0; i < nr_swap_slots; i++) + drain_slots_cache_cpu_type(cpu, type, free_slots, i); +} +#else +static inline void __drain_slots_cache_cpu(unsigned int cpu, + unsigned int type, bool free_slots) +{ + drain_slots_cache_cpu_type(cpu, type, free_slots, SWAP_TYPE_ALL); +} +#endif + +static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type, + bool free_slots) +{ + __drain_slots_cache_cpu(cpu, type, free_slots); +} + static void __drain_swap_slots_cache(unsigned int type) { unsigned int cpu; @@ -237,7 +357,7 @@ static int free_slot_cache(unsigned int cpu) return 0; }
-void enable_swap_slots_cache(void) +void enable_swap_slots_cache(int type) { mutex_lock(&swap_slots_cache_enable_mutex); if (!swap_slot_cache_initialized) { @@ -251,14 +371,14 @@ void enable_swap_slots_cache(void)
swap_slot_cache_initialized = true; } - + alloc_swap_slot_cache_type(type); __reenable_swap_slots_cache(); out_unlock: mutex_unlock(&swap_slots_cache_enable_mutex); }
/* called with swap slot cache's alloc lock held */ -static int refill_swap_slots_cache(struct swap_slots_cache *cache) +static int refill_swap_slots_cache(struct swap_slots_cache *cache, int type) { if (!use_swap_slot_cache || cache->nr) return 0; @@ -266,7 +386,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache) cache->cur = 0; if (swap_slot_cache_active) cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, - cache->slots, 1, SWAP_TYPE_ALL); + cache->slots, 1, type);
return cache->nr; } @@ -330,10 +450,9 @@ swp_entry_t get_swap_page(struct page *page) * The alloc path here does not touch cache->slots_ret * so cache->free_lock is not taken. */ - cache = raw_cpu_ptr(&swp_slots); + cache = get_slots_cache(type);
- if (likely(check_cache_active() && cache->slots) && - type == SWAP_TYPE_ALL) { + if (likely(check_cache_active() && cache->slots)) { mutex_lock(&cache->alloc_lock); if (cache->slots) { repeat: @@ -341,7 +460,7 @@ swp_entry_t get_swap_page(struct page *page) entry = cache->slots[cache->cur]; cache->slots[cache->cur++].val = 0; cache->nr--; - } else if (refill_swap_slots_cache(cache)) { + } else if (refill_swap_slots_cache(cache, type)) { goto repeat; } } diff --git a/mm/swapfile.c b/mm/swapfile.c index 2134e1b83ccb..f5a15511f45a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3553,7 +3553,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) if (inode) inode_unlock(inode); if (!error) - enable_swap_slots_cache(); + enable_swap_slots_cache(p->type); return error; }