From: Chunguang Xu brookxu@tencent.com
mainline inclusion from mainline-v5.14-rc1 commit d80c228d44640f0b47b57a2ca4afa26ef87e16b0 category: bugfix bugzilla: 187475, https://gitee.com/openeuler/kernel/issues/I5ME0J CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
On the IO submission path, blk_account_io_start() may interrupt the system interruption. When the interruption returns, the value of part->stamp may have been updated by other cores, so the time value collected before the interruption may be less than part-> stamp. So when this happens, we should do nothing to make io_ticks more accurate? For kernels less than 5.0, this may cause io_ticks to become smaller, which in turn may cause abnormal ioutil values.
Signed-off-by: Chunguang Xu brookxu@tencent.com Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/1625521646-1069-1-git-send-email-brookxu.cn@gmail.... Signed-off-by: Jens Axboe axboe@kernel.dk
conflict: block/blk-core.c
Signed-off-by: Li Nan linan122@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-core.c b/block/blk-core.c index 109fb2750453..0b496dabc5ac 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1258,7 +1258,7 @@ void update_io_ticks(struct hd_struct *part, unsigned long now, bool end) unsigned long stamp; again: stamp = READ_ONCE(part->stamp); - if (unlikely(stamp != now)) { + if (unlikely(time_after(now, stamp))) { if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) __part_stat_add(part, io_ticks, end ? now - stamp : 1); }
From: Zheyu Ma zheyuma97@gmail.com
mainline inclusion from mainline-v5.18-rc5 commit 15cf0b82271b1823fb02ab8c377badba614d95d5 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5OVRU CVE: CVE-2022-3061
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
The userspace program could pass any values to the driver through ioctl() interface. If the driver doesn't check the value of 'pixclock', it may cause divide error.
Fix this by checking whether 'pixclock' is zero in the function i740fb_check_var().
The following log reveals it:
divide error: 0000 [#1] PREEMPT SMP KASAN PTI RIP: 0010:i740fb_decode_var drivers/video/fbdev/i740fb.c:444 [inline] RIP: 0010:i740fb_set_par+0x272f/0x3bb0 drivers/video/fbdev/i740fb.c:739 Call Trace: fb_set_var+0x604/0xeb0 drivers/video/fbdev/core/fbmem.c:1036 do_fb_ioctl+0x234/0x670 drivers/video/fbdev/core/fbmem.c:1112 fb_ioctl+0xdd/0x130 drivers/video/fbdev/core/fbmem.c:1191 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline]
Signed-off-by: Zheyu Ma zheyuma97@gmail.com Signed-off-by: Helge Deller deller@gmx.de Signed-off-by: Xia Longlong xialonglong1@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/video/fbdev/i740fb.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/video/fbdev/i740fb.c b/drivers/video/fbdev/i740fb.c index 52cce0db8bd3..b595437a5752 100644 --- a/drivers/video/fbdev/i740fb.c +++ b/drivers/video/fbdev/i740fb.c @@ -657,6 +657,9 @@ static int i740fb_decode_var(const struct fb_var_screeninfo *var,
static int i740fb_check_var(struct fb_var_screeninfo *var, struct fb_info *info) { + if (!var->pixclock) + return -EINVAL; + switch (var->bits_per_pixel) { case 8: var->red.offset = var->green.offset = var->blue.offset = 0;
From: Thadeu Lima de Souza Cascardo cascardo@canonical.com
stable inclusion from stable-v5.10.137 commit 1a4b18b1ff11ba26f9a852019d674fde9d1d1cff category: bugfix bugzilla: 187457, https://gitee.com/src-openeuler/kernel/issues/I5MEZD CVE: CVE-2022-2586
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 470ee20e069a6d05ae549f7d0ef2bdbcee6a81b2 upstream.
When doing lookups for sets on the same batch by using its ID, a set from a different table can be used.
Then, when the table is removed, a reference to the set may be kept after the set is freed, leading to a potential use-after-free.
When looking for sets by ID, use the table that was used for the lookup by name, and only return sets belonging to that same table.
This fixes CVE-2022-2586, also reported as ZDI-CAN-17470.
Reported-by: Team Orca of Sea Security (@seasecresponse) Fixes: 958bee14d071 ("netfilter: nf_tables: use new transaction infrastructure to handle sets") Signed-off-by: Thadeu Lima de Souza Cascardo cascardo@canonical.com Cc: stable@vger.kernel.org Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Lu Wei luwei32@huawei.com Reviewed-by: Wei Yongjun weiyongjun1@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- net/netfilter/nf_tables_api.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 560a93aad5b6..7b0497da2d94 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3638,6 +3638,7 @@ static struct nft_set *nft_set_lookup_byhandle(const struct nft_table *table, }
static struct nft_set *nft_set_lookup_byid(const struct net *net, + const struct nft_table *table, const struct nlattr *nla, u8 genmask) { struct nft_trans *trans; @@ -3648,6 +3649,7 @@ static struct nft_set *nft_set_lookup_byid(const struct net *net, struct nft_set *set = nft_trans_set(trans);
if (id == nft_trans_set_id(trans) && + set->table == table && nft_active_genmask(set, genmask)) return set; } @@ -3668,7 +3670,7 @@ struct nft_set *nft_set_lookup_global(const struct net *net, if (!nla_set_id) return set;
- set = nft_set_lookup_byid(net, nla_set_id, genmask); + set = nft_set_lookup_byid(net, table, nla_set_id, genmask); } return set; }
From: Thadeu Lima de Souza Cascardo cascardo@canonical.com
stable inclusion from stable-v5.10.137 commit 9e7dcb88ec8e85e4a8ad0ea494ea2f90f32d2583 category: bugfix bugzilla: 187457, https://gitee.com/src-openeuler/kernel/issues/I5MEZD CVE: CVE-2022-2586
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 95f466d22364a33d183509629d0879885b4f547e upstream.
When doing lookups for chains on the same batch by using its ID, a chain from a different table can be used. If a rule is added to a table but refers to a chain in a different table, it will be linked to the chain in table2, but would have expressions referring to objects in table1.
Then, when table1 is removed, the rule will not be removed as its linked to a chain in table2. When expressions in the rule are processed or removed, that will lead to a use-after-free.
When looking for chains by ID, use the table that was used for the lookup by name, and only return chains belonging to that same table.
Fixes: 837830a4b439 ("netfilter: nf_tables: add NFTA_RULE_CHAIN_ID attribute") Signed-off-by: Thadeu Lima de Souza Cascardo cascardo@canonical.com Cc: stable@vger.kernel.org Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Lu Wei luwei32@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- net/netfilter/nf_tables_api.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 7b0497da2d94..373f9bd3fbb1 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -2265,6 +2265,7 @@ static int nf_tables_updchain(struct nft_ctx *ctx, u8 genmask, u8 policy, }
static struct nft_chain *nft_chain_lookup_byid(const struct net *net, + const struct nft_table *table, const struct nlattr *nla) { u32 id = ntohl(nla_get_be32(nla)); @@ -2274,6 +2275,7 @@ static struct nft_chain *nft_chain_lookup_byid(const struct net *net, struct nft_chain *chain = trans->ctx.chain;
if (trans->msg_type == NFT_MSG_NEWCHAIN && + chain->table == table && id == nft_trans_chain_id(trans)) return chain; } @@ -3199,7 +3201,7 @@ static int nf_tables_newrule(struct net *net, struct sock *nlsk, return -EOPNOTSUPP;
} else if (nla[NFTA_RULE_CHAIN_ID]) { - chain = nft_chain_lookup_byid(net, nla[NFTA_RULE_CHAIN_ID]); + chain = nft_chain_lookup_byid(net, table, nla[NFTA_RULE_CHAIN_ID]); if (IS_ERR(chain)) { NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_CHAIN_ID]); return PTR_ERR(chain); @@ -8641,7 +8643,7 @@ static int nft_verdict_init(const struct nft_ctx *ctx, struct nft_data *data, tb[NFTA_VERDICT_CHAIN], genmask); } else if (tb[NFTA_VERDICT_CHAIN_ID]) { - chain = nft_chain_lookup_byid(ctx->net, + chain = nft_chain_lookup_byid(ctx->net, ctx->table, tb[NFTA_VERDICT_CHAIN_ID]); if (IS_ERR(chain)) return PTR_ERR(chain);
From: Thadeu Lima de Souza Cascardo cascardo@canonical.com
stable inclusion from stable-v5.10.137 commit 0cc5c6b7567dd4c25b0e92f6be55ed631a36f4cd category: bugfix bugzilla: 187457, https://gitee.com/src-openeuler/kernel/issues/I5MEZD CVE: CVE-2022-2586
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 36d5b2913219ac853908b0f1c664345e04313856 upstream.
When doing lookups for rules on the same batch by using its ID, a rule from a different chain can be used. If a rule is added to a chain but tries to be positioned next to a rule from a different chain, it will be linked to chain2, but the use counter on chain1 would be the one to be incremented.
When looking for rules by ID, use the chain that was used for the lookup by name. The chain used in the context copied to the transaction needs to match that same chain. That way, struct nft_rule does not need to get enlarged with another member.
Fixes: 1a94e38d254b ("netfilter: nf_tables: add NFTA_RULE_ID attribute") Fixes: 75dd48e2e420 ("netfilter: nf_tables: Support RULE_ID reference in new rule") Signed-off-by: Thadeu Lima de Souza Cascardo cascardo@canonical.com Cc: stable@vger.kernel.org Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Lu Wei luwei32@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- net/netfilter/nf_tables_api.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 373f9bd3fbb1..6a26a9cfdb58 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3156,6 +3156,7 @@ static int nft_table_validate(struct net *net, const struct nft_table *table) }
static struct nft_rule *nft_rule_lookup_byid(const struct net *net, + const struct nft_chain *chain, const struct nlattr *nla);
#define NFT_RULE_MAXEXPRS 128 @@ -3243,7 +3244,7 @@ static int nf_tables_newrule(struct net *net, struct sock *nlsk, return PTR_ERR(old_rule); } } else if (nla[NFTA_RULE_POSITION_ID]) { - old_rule = nft_rule_lookup_byid(net, nla[NFTA_RULE_POSITION_ID]); + old_rule = nft_rule_lookup_byid(net, chain, nla[NFTA_RULE_POSITION_ID]); if (IS_ERR(old_rule)) { NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_POSITION_ID]); return PTR_ERR(old_rule); @@ -3382,6 +3383,7 @@ static int nf_tables_newrule(struct net *net, struct sock *nlsk, }
static struct nft_rule *nft_rule_lookup_byid(const struct net *net, + const struct nft_chain *chain, const struct nlattr *nla) { u32 id = ntohl(nla_get_be32(nla)); @@ -3391,6 +3393,7 @@ static struct nft_rule *nft_rule_lookup_byid(const struct net *net, struct nft_rule *rule = nft_trans_rule(trans);
if (trans->msg_type == NFT_MSG_NEWRULE && + trans->ctx.chain == chain && id == nft_trans_rule_id(trans)) return rule; } @@ -3439,7 +3442,7 @@ static int nf_tables_delrule(struct net *net, struct sock *nlsk,
err = nft_delrule(&ctx, rule); } else if (nla[NFTA_RULE_ID]) { - rule = nft_rule_lookup_byid(net, nla[NFTA_RULE_ID]); + rule = nft_rule_lookup_byid(net, chain, nla[NFTA_RULE_ID]); if (IS_ERR(rule)) { NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_ID]); return PTR_ERR(rule);
From: David Leadbeater dgl@dgl.cx
mainline inclusion from mainline-v6.0-rc6 commit e8d5dfd1d8747b56077d02664a8838c71ced948e category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5OWZ7 CVE: CVE-2022-2663
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=e8...
---------------------------
CTCP messages should only be at the start of an IRC message, not anywhere within it.
While the helper only decodes packes in the ORIGINAL direction, its possible to make a client send a CTCP message back by empedding one into a PING request. As-is, thats enough to make the helper believe that it saw a CTCP message.
Fixes: 869f37d8e48f ("[NETFILTER]: nf_conntrack/nf_nat: add IRC helper port") Signed-off-by: David Leadbeater dgl@dgl.cx Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Liu Jian liujian56@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- net/netfilter/nf_conntrack_irc.c | 34 ++++++++++++++++++++++++++------ 1 file changed, 28 insertions(+), 6 deletions(-)
diff --git a/net/netfilter/nf_conntrack_irc.c b/net/netfilter/nf_conntrack_irc.c index e40988a2f22f..c745dd19d42f 100644 --- a/net/netfilter/nf_conntrack_irc.c +++ b/net/netfilter/nf_conntrack_irc.c @@ -148,15 +148,37 @@ static int help(struct sk_buff *skb, unsigned int protoff, data = ib_ptr; data_limit = ib_ptr + skb->len - dataoff;
- /* strlen("\1DCC SENT t AAAAAAAA P\1\n")=24 - * 5+MINMATCHLEN+strlen("t AAAAAAAA P\1\n")=14 */ - while (data < data_limit - (19 + MINMATCHLEN)) { - if (memcmp(data, "\1DCC ", 5)) { + /* Skip any whitespace */ + while (data < data_limit - 10) { + if (*data == ' ' || *data == '\r' || *data == '\n') + data++; + else + break; + } + + /* strlen("PRIVMSG x ")=10 */ + if (data < data_limit - 10) { + if (strncasecmp("PRIVMSG ", data, 8)) + goto out; + data += 8; + } + + /* strlen(" :\1DCC SENT t AAAAAAAA P\1\n")=26 + * 7+MINMATCHLEN+strlen("t AAAAAAAA P\1\n")=26 + */ + while (data < data_limit - (21 + MINMATCHLEN)) { + /* Find first " :", the start of message */ + if (memcmp(data, " :", 2)) { data++; continue; } + data += 2; + + /* then check that place only for the DCC command */ + if (memcmp(data, "\1DCC ", 5)) + goto out; data += 5; - /* we have at least (19+MINMATCHLEN)-5 bytes valid data left */ + /* we have at least (21+MINMATCHLEN)-(2+5) bytes valid data left */
iph = ip_hdr(skb); pr_debug("DCC found in master %pI4:%u %pI4:%u\n", @@ -172,7 +194,7 @@ static int help(struct sk_buff *skb, unsigned int protoff, pr_debug("DCC %s detected\n", dccprotos[i]);
/* we have at least - * (19+MINMATCHLEN)-5-dccprotos[i].matchlen bytes valid + * (21+MINMATCHLEN)-7-dccprotos[i].matchlen bytes valid * data left (== 14/13 bytes) */ if (parse_dcc(data, data_limit, &dcc_ip, &dcc_port, &addr_beg_p, &addr_end_p)) {
From: Pablo Neira Ayuso pablo@netfilter.org
stable inclusion from stable-v5.10.140 commit c08a104a8bce832f6e7a4e8d9ac091777b9982ea category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5PEDR?from=project-issue CVE: CVE-2022-39190
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit e02f0d3970404bfea385b6edb86f2d936db0ea2b ]
Update nft_data_init() to report EINVAL if chain is already bound.
Fixes: d0e2c7de92c7 ("netfilter: nf_tables: add NFT_CHAIN_BINDING") Reported-by: Gwangun Jung exsociety@gmail.com Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org
Conflicts: net/netfilter/nf_tables_api.c
Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- net/netfilter/nf_tables_api.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 6a26a9cfdb58..44b6d7de8100 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -8658,6 +8658,8 @@ static int nft_verdict_init(const struct nft_ctx *ctx, struct nft_data *data, return PTR_ERR(chain); if (nft_is_base_chain(chain)) return -EOPNOTSUPP; + if (nft_chain_is_bound(chain)) + return -EINVAL;
chain->use++; data->verdict.chain = chain;
From: liubo liubo254@huawei.com
euleros inclusion category: feature feature: etmem bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A
-------------------------------------------------
etmem, the memory vertical expansion technology, uses DRAM and high-performance storage new media to form multi-level memory storage.
The etmem feature was introduced in the previous commit (aa7f1d222cdab88f12e6d889437fed6571dec824),but only the config options for the etmem_swap and etmem_scan modules were added, and the config options for the etmem feature were not added, so in this commit, the CONFIG_ETMEM option for the etmem feature was added
Signed-off-by: liubo liubo254@huawei.com Reviewed-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/Kconfig | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig index 27c0b9de6357..5e1175da720e 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -514,18 +514,28 @@ config MEMCG_QOS
config ETMEM_SCAN tristate "module: etmem page scan for etmem support" - depends on MMU - depends on X86 || ARM64 + depends on ETMEM help etmem page scan feature used to scan the virtual address of the target process
config ETMEM_SWAP tristate "module: etmem page swap for etmem support" + depends on ETMEM + help + etmem page swap feature + +config ETMEM + bool "Enable etmem feature" depends on MMU depends on X86 || ARM64 + default n help - etmem page swap feature + etmem is a tiered memory extension technology that uses DRAM and memory + compression/high-performance storage media to form tiered memory storage. + Memory data is tiered, and cold data is migrated from memory media to + high-performance storage media to release memory space and reduce + memory costs.
config USERSWAP bool "Enable User Swap"
From: liubo liubo254@huawei.com
euleros inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A CVE: NA
--------------------------------
enable CONFIG_ETMEM by default.
Signed-off-by: liubo liubo254@huawei.com Reviewed-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Chao Liu liuchao173@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/configs/openeuler_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + 2 files changed, 2 insertions(+)
diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 92b33d2da470..bc2b8a825090 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -1077,6 +1077,7 @@ CONFIG_SHRINK_PAGECACHE=y CONFIG_MEMCG_QOS=y CONFIG_ETMEM_SCAN=m CONFIG_ETMEM_SWAP=m +CONFIG_ETMEM=y CONFIG_USERSWAP=y CONFIG_CMA=y # CONFIG_CMA_DEBUG is not set diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index f013c7b95881..0009d2ab79e6 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -1020,6 +1020,7 @@ CONFIG_SHRINK_PAGECACHE=y CONFIG_MEMCG_QOS=y CONFIG_ETMEM_SCAN=m CONFIG_ETMEM_SWAP=m +CONFIG_ETMEM=y CONFIG_USERSWAP=y # CONFIG_CMA is not set CONFIG_MEM_SOFT_DIRTY=y
From: liubo liubo254@huawei.com
euleros inclusion category: feature eature: etmem bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A
-------------------------------------------------
add CONFIG_ETMEM macro definition for etmem feature.
Signed-off-by: liubo liubo254@huawei.com Reviewed-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/proc/base.c | 4 ++++ fs/proc/internal.h | 2 ++ fs/proc/task_mmu.c | 2 ++ include/linux/swap.h | 2 ++ mm/vmscan.c | 2 ++ 5 files changed, 12 insertions(+)
diff --git a/fs/proc/base.c b/fs/proc/base.c index b9052be86e8d..9b4666e757f0 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3338,6 +3338,8 @@ static const struct pid_entry tgid_base_stuff[] = { REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), +#endif +#ifdef CONFIG_ETMEM REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations), REG("swap_pages", S_IWUSR, proc_mm_swap_operations), #endif @@ -3689,6 +3691,8 @@ static const struct pid_entry tid_base_stuff[] = { REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), +#endif +#ifdef CONFIG_ETMEM REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations), REG("swap_pages", S_IWUSR, proc_mm_swap_operations), #endif diff --git a/fs/proc/internal.h b/fs/proc/internal.h index d1fdb722f0ca..104945bbeb9f 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -304,8 +304,10 @@ extern const struct file_operations proc_pid_smaps_operations; extern const struct file_operations proc_pid_smaps_rollup_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; +#ifdef CONFIG_ETMEM extern const struct file_operations proc_mm_idle_operations; extern const struct file_operations proc_mm_swap_operations; +#endif
extern unsigned long task_vsize(struct mm_struct *); extern unsigned long task_statm(struct mm_struct *, diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 7b8a513d9f69..aee74a4074d4 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1855,6 +1855,7 @@ const struct file_operations proc_pagemap_operations = { .release = pagemap_release, };
+#ifdef CONFIG_ETMEM static DEFINE_SPINLOCK(scan_lock);
static int page_scan_lock(struct file *file, int is_lock, struct file_lock *flock) @@ -2037,6 +2038,7 @@ const struct file_operations proc_mm_swap_operations = { .open = mm_swap_open, .release = mm_swap_release, }; +#endif #endif /* CONFIG_PROC_PAGE_MONITOR */
#ifdef CONFIG_NUMA diff --git a/include/linux/swap.h b/include/linux/swap.h index 2b68047db2d9..e7b67e9124d0 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -384,9 +384,11 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long reclaim_pages(struct list_head *page_list); +#ifdef CONFIG_ETMEM extern int add_page_for_swap(struct page *page, struct list_head *pagelist); extern struct page *get_page_from_vaddr(struct mm_struct *mm, unsigned long vaddr); +#endif #ifdef CONFIG_NUMA extern int node_reclaim_mode; extern int sysctl_min_unmapped_ratio; diff --git a/mm/vmscan.c b/mm/vmscan.c index d96f52b2fbe0..9a27b585e8ba 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4539,6 +4539,7 @@ void check_move_unevictable_pages(struct pagevec *pvec) } EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
+#ifdef CONFIG_ETMEM int add_page_for_swap(struct page *page, struct list_head *pagelist) { int err = -EBUSY; @@ -4593,6 +4594,7 @@ struct page *get_page_from_vaddr(struct mm_struct *mm, unsigned long vaddr) return page; } EXPORT_SYMBOL_GPL(get_page_from_vaddr); +#endif
#ifdef CONFIG_PAGE_CACHE_LIMIT unsigned long page_cache_shrink_memory(unsigned long nr_to_reclaim,
From: liubo liubo254@huawei.com
euleros inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A CVE: NA
-------------------------------------------------
etmem, the memory vertical expansion technology, uses DRAM and high-performance storage new media to form multi-level memory storage. By grading the stored data, etmem migrates the classified cold storage data from the storage medium to the high-performance storage medium, so as to achieve the purpose of memory capacity expansion and memory cost reduction.
When the memory expansion function etmem is running, the native swap function of the kernel needs to be disabled in certain scenarios to avoid the impact of kernel swap.
This feature provides the preceding functions.
The /sys/kernel/mm/swap/ directory provides the kernel_swap_enable sys interface to enable or disable the native swap function of the kernel.
The default value of /sys/kernel/mm/swap/kernel_swap_enable is true, that is, kernel swap is enabled by default.
Turn on kernel swap: echo true > /sys/kernel/mm/swap/kernel_swap_enable
Turn off kernel swap: echo false > /sys/kernel/mm/swap/kernel_swap_enable
Signed-off-by: liubo liubo254@huawei.com Reviewed-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/swap.h | 5 ++++- mm/swap_state.c | 37 +++++++++++++++++++++++++++++++++++++ mm/vmscan.c | 27 +++++++++++++++++++++++++++ 3 files changed, 68 insertions(+), 1 deletion(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h index e7b67e9124d0..83a0a210a6f5 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -454,7 +454,6 @@ extern struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf); extern struct page *swapin_readahead(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf); - /* linux/mm/swapfile.c */ extern atomic_long_t nr_swap_pages; extern long total_swap_pages; @@ -731,5 +730,9 @@ static inline bool mem_cgroup_swap_full(struct page *page) } #endif
+#ifdef CONFIG_ETMEM +extern bool kernel_swap_enabled(void); +#endif + #endif /* __KERNEL__*/ #endif /* _LINUX_SWAP_H */ diff --git a/mm/swap_state.c b/mm/swap_state.c index 149f46781061..d58cbf4fe27f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -39,6 +39,9 @@ static const struct address_space_operations swap_aops = { struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly; static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly; static bool enable_vma_readahead __read_mostly = true; +#ifdef CONFIG_ETMEM +static bool enable_kernel_swap __read_mostly = true; +#endif
#define SWAP_RA_WIN_SHIFT (PAGE_SHIFT / 2) #define SWAP_RA_HITS_MASK ((1UL << SWAP_RA_WIN_SHIFT) - 1) @@ -349,6 +352,13 @@ static inline bool swap_use_vma_readahead(void) return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap); }
+#ifdef CONFIG_ETMEM +bool kernel_swap_enabled(void) +{ + return READ_ONCE(enable_kernel_swap); +} +#endif + /* * Lookup a swap entry in the swap cache. A found page will be returned * unlocked and with its refcount incremented - we rely on the kernel @@ -909,8 +919,35 @@ static struct kobj_attribute vma_ra_enabled_attr = __ATTR(vma_ra_enabled, 0644, vma_ra_enabled_show, vma_ra_enabled_store);
+#ifdef CONFIG_ETMEM +static ssize_t kernel_swap_enable_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%s\n", enable_kernel_swap ? "true" : "false"); +} +static ssize_t kernel_swap_enable_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + if (!strncmp(buf, "true", 4) || !strncmp(buf, "1", 1)) + WRITE_ONCE(enable_kernel_swap, true); + else if (!strncmp(buf, "false", 5) || !strncmp(buf, "0", 1)) + WRITE_ONCE(enable_kernel_swap, false); + else + return -EINVAL; + + return count; +} +static struct kobj_attribute kernel_swap_enable_attr = + __ATTR(kernel_swap_enable, 0644, kernel_swap_enable_show, + kernel_swap_enable_store); +#endif + static struct attribute *swap_attrs[] = { &vma_ra_enabled_attr.attr, +#ifdef CONFIG_ETMEM + &kernel_swap_enable_attr.attr, +#endif NULL, };
diff --git a/mm/vmscan.c b/mm/vmscan.c index 9a27b585e8ba..24de4ddc86d2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3457,6 +3457,18 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, return false; }
+#ifdef CONFIG_ETMEM +/* + * Check if original kernel swap is enabled + * turn off kernel swap,but leave page cache reclaim on + */ +static inline void kernel_swap_check(struct scan_control *sc) +{ + if (sc != NULL && !kernel_swap_enabled()) + sc->may_swap = 0; +} +#endif + unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *nodemask) { @@ -3473,6 +3485,9 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .may_swap = 1, };
+#ifdef CONFIG_ETMEM + kernel_swap_check(&sc); +#endif /* * scan_control uses s8 fields for order, priority, and reclaim_idx. * Confirm they are large enough for max values. @@ -3869,6 +3884,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) sc.may_writepage = !laptop_mode && !nr_boost_reclaim; sc.may_swap = !nr_boost_reclaim;
+#ifdef CONFIG_ETMEM + kernel_swap_check(&sc); +#endif + /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. All @@ -4245,6 +4264,10 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim) noreclaim_flag = memalloc_noreclaim_save(); set_task_reclaim_state(current, &sc.reclaim_state);
+#ifdef CONFIG_ETMEM + kernel_swap_check(&sc); +#endif + nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
set_task_reclaim_state(current, NULL); @@ -4409,6 +4432,10 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in cond_resched(); psi_memstall_enter(&pflags); fs_reclaim_acquire(sc.gfp_mask); + +#ifdef CONFIG_ETMEM + kernel_swap_check(&sc); +#endif /* * We need to be able to allocate from the reserves for RECLAIM_UNMAP * and we also need to be able to write out pages for RECLAIM_WRITE
From: liubo liubo254@huawei.com
euleros inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A CVE: NA
-------------------------------------------------
etmem, the memory vertical expansion technology,
In the current etmem process, memory page swapping is implemented by invoking shrink_page_list. When this interface is invoked for the first time, pages are added to the swap cache and written to disks.The swap cache page is reclaimed only when this interface is invoked for the second time and no process accesses the page.However, in the etmem process, the user mode scans pages that have been accessed, and the migration is not delivered to pages that are not accessed by processes. Therefore, the swap cache may always be occupied. To solve the preceding problem, add the logic for actively reclaiming the swap cache.When the swap cache occupies a large amount of memory, the system proactively scans the LRU linked list and reclaims the swap cache to save memory within the specified range.
Signed-off-by: liubo liubo254@huawei.com Reviewed-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/proc/etmem_swap.c | 175 +++++++++++++++++++++++- fs/proc/task_mmu.c | 8 ++ include/linux/list.h | 17 +++ include/linux/swap.h | 35 ++++- mm/swap_state.c | 1 + mm/vmscan.c | 312 ++++++++++++++++++++++++++++++++++++++++++- 6 files changed, 542 insertions(+), 6 deletions(-)
diff --git a/fs/proc/etmem_swap.c b/fs/proc/etmem_swap.c index f9f796cfaf97..0e0a5225e301 100644 --- a/fs/proc/etmem_swap.c +++ b/fs/proc/etmem_swap.c @@ -10,6 +10,24 @@ #include <linux/mempolicy.h> #include <linux/uaccess.h> #include <linux/delay.h> +#include <linux/numa.h> +#include <linux/freezer.h> +#include <linux/kthread.h> +#include <linux/mm_inline.h> + +#define RECLAIM_SWAPCACHE_MAGIC 0X77 +#define SET_SWAPCACHE_WMARK _IOW(RECLAIM_SWAPCACHE_MAGIC, 0x02, unsigned int) +#define RECLAIM_SWAPCACHE_ON _IOW(RECLAIM_SWAPCACHE_MAGIC, 0x01, unsigned int) +#define RECLAIM_SWAPCACHE_OFF _IOW(RECLAIM_SWAPCACHE_MAGIC, 0x00, unsigned int) + +#define WATERMARK_MAX 100 +#define SWAP_SCAN_NUM_MAX 32 + +static struct task_struct *reclaim_swapcache_tk; +static bool enable_swapcache_reclaim; +static unsigned long swapcache_watermark[ETMEM_SWAPCACHE_NR_WMARK]; + +static DECLARE_WAIT_QUEUE_HEAD(reclaim_queue);
static ssize_t swap_pages_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) @@ -45,7 +63,7 @@ static ssize_t swap_pages_write(struct file *file, const char __user *buf, ret = kstrtoul(p, 16, &vaddr); if (ret != 0) continue; - /*If get page struct failed, ignore it, get next page*/ + /* If get page struct failed, ignore it, get next page */ page = get_page_from_vaddr(mm, vaddr); if (!page) continue; @@ -78,9 +96,153 @@ static int swap_pages_release(struct inode *inode, struct file *file) return 0; }
+/* check if swapcache meet requirements */ +static bool swapcache_balanced(void) +{ + return total_swapcache_pages() < swapcache_watermark[ETMEM_SWAPCACHE_WMARK_HIGH]; +} + +/* the flag present if swapcache reclaim is started */ +static bool swapcache_reclaim_enabled(void) +{ + return READ_ONCE(enable_swapcache_reclaim); +} + +static void start_swapcache_reclaim(void) +{ + if (swapcache_balanced()) + return; + /* RECLAIM_SWAPCACHE_ON trigger the thread to start running. */ + if (!waitqueue_active(&reclaim_queue)) + return; + + WRITE_ONCE(enable_swapcache_reclaim, true); + wake_up_interruptible(&reclaim_queue); +} + +static void stop_swapcache_reclaim(void) +{ + WRITE_ONCE(enable_swapcache_reclaim, false); +} + +static bool should_goto_sleep(void) +{ + if (swapcache_balanced()) + stop_swapcache_reclaim(); + + if (swapcache_reclaim_enabled()) + return false; + + return true; +} + +static int get_swapcache_watermark(unsigned int ratio) +{ + unsigned int low_watermark; + unsigned int high_watermark; + + low_watermark = ratio & 0xFF; + high_watermark = (ratio >> 8) & 0xFF; + if (low_watermark > WATERMARK_MAX || + high_watermark > WATERMARK_MAX || + low_watermark > high_watermark) + return -EPERM; + + swapcache_watermark[ETMEM_SWAPCACHE_WMARK_LOW] = totalram_pages() * + low_watermark / WATERMARK_MAX; + swapcache_watermark[ETMEM_SWAPCACHE_WMARK_HIGH] = totalram_pages() * + high_watermark / WATERMARK_MAX; + + return 0; +}
extern struct file_operations proc_swap_pages_operations;
+static void reclaim_swapcache_try_to_sleep(void) +{ + DEFINE_WAIT(wait); + + if (freezing(current) || kthread_should_stop()) + return; + + prepare_to_wait(&reclaim_queue, &wait, TASK_INTERRUPTIBLE); + if (should_goto_sleep()) { + if (!kthread_should_stop()) + schedule(); + } + finish_wait(&reclaim_queue, &wait); +} + +static void etmem_reclaim_swapcache(void) +{ + do_swapcache_reclaim(swapcache_watermark, + ARRAY_SIZE(swapcache_watermark)); + stop_swapcache_reclaim(); +} + +static int reclaim_swapcache_proactive(void *para) +{ + set_freezable(); + + while (1) { + bool ret; + + reclaim_swapcache_try_to_sleep(); + ret = try_to_freeze(); + if (kthread_should_stop()) + break; + + if (ret) + continue; + + etmem_reclaim_swapcache(); + } + + return 0; +} + +static int reclaim_swapcache_run(void) +{ + int ret = 0; + + reclaim_swapcache_tk = kthread_run(reclaim_swapcache_proactive, NULL, + "etmem_recalim_swapcache"); + if (IS_ERR(reclaim_swapcache_tk)) { + ret = PTR_ERR(reclaim_swapcache_tk); + reclaim_swapcache_tk = NULL; + } + return ret; +} + +static long swap_page_ioctl(struct file *filp, unsigned int cmd, + unsigned long arg) +{ + void __user *argp = (void __user *)arg; + unsigned int ratio; + + switch (cmd) { + case RECLAIM_SWAPCACHE_ON: + if (swapcache_reclaim_enabled()) + return 0; + start_swapcache_reclaim(); + break; + case RECLAIM_SWAPCACHE_OFF: + stop_swapcache_reclaim(); + break; + case SET_SWAPCACHE_WMARK: + if (get_user(ratio, (unsigned int __user *)argp)) + return -EFAULT; + + if (get_swapcache_watermark(ratio) != 0) + return -EFAULT; + break; + default: + return -EPERM; + } + + return 0; +} + static int swap_pages_entry(void) { proc_swap_pages_operations.flock(NULL, 1, NULL); @@ -88,8 +250,12 @@ static int swap_pages_entry(void) proc_swap_pages_operations.write = swap_pages_write; proc_swap_pages_operations.open = swap_pages_open; proc_swap_pages_operations.release = swap_pages_release; + proc_swap_pages_operations.unlocked_ioctl = swap_page_ioctl; proc_swap_pages_operations.flock(NULL, 0, NULL);
+ enable_swapcache_reclaim = false; + reclaim_swapcache_run(); + return 0; }
@@ -100,7 +266,14 @@ static void swap_pages_exit(void) proc_swap_pages_operations.write = NULL; proc_swap_pages_operations.open = NULL; proc_swap_pages_operations.release = NULL; + proc_swap_pages_operations.unlocked_ioctl = NULL; proc_swap_pages_operations.flock(NULL, 0, NULL); + + if (!IS_ERR(reclaim_swapcache_tk)) { + kthread_stop(reclaim_swapcache_tk); + reclaim_swapcache_tk = NULL; + } + return; }
MODULE_LICENSE("GPL"); diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index aee74a4074d4..c61a3fbbfd71 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -2032,11 +2032,19 @@ static int mm_swap_release(struct inode *inode, struct file *file) return ret; }
+static long mm_swap_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) +{ + if (proc_swap_pages_operations.unlocked_ioctl) + return proc_swap_pages_operations.unlocked_ioctl(filp, cmd, arg); + return 0; +} + const struct file_operations proc_mm_swap_operations = { .llseek = mem_lseek, .write = mm_swap_write, .open = mm_swap_open, .release = mm_swap_release, + .unlocked_ioctl = mm_swap_ioctl, }; #endif #endif /* CONFIG_PROC_PAGE_MONITOR */ diff --git a/include/linux/list.h b/include/linux/list.h index a18c87b63376..fa9f691f2553 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -764,6 +764,23 @@ static inline void list_splice_tail_init(struct list_head *list, !list_entry_is_head(pos, head, member); \ pos = n, n = list_prev_entry(n, member))
+/** + * list_for_each_entry_safe_reverse_from - iterate backwards over list from + * current point safe against removal + * @pos: the type * to use as a loop cursor. + * @n: another type * to use as temporary storage + * @head: the head for your list. + * @member: the name of the list_head within the struct. + * + * Iterate backwards over list of given type from current point, safe against + * removal of list entry. + */ +#define list_for_each_entry_safe_reverse_from(pos, n, head, member) \ + for (n = list_prev_entry(pos, member); \ + !list_entry_is_head(pos, head, member); \ + pos = n, n = list_prev_entry(n, member)) + + /** * list_safe_reset_next - reset a stale list_for_each_entry_safe loop * @pos: the loop cursor used in the list_for_each_entry_safe loop diff --git a/include/linux/swap.h b/include/linux/swap.h index 83a0a210a6f5..e0f69151d2af 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -385,9 +385,40 @@ extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long reclaim_pages(struct list_head *page_list); #ifdef CONFIG_ETMEM +enum etmem_swapcache_watermark_en { + ETMEM_SWAPCACHE_WMARK_LOW, + ETMEM_SWAPCACHE_WMARK_HIGH, + ETMEM_SWAPCACHE_NR_WMARK +}; + extern int add_page_for_swap(struct page *page, struct list_head *pagelist); extern struct page *get_page_from_vaddr(struct mm_struct *mm, unsigned long vaddr); +extern int do_swapcache_reclaim(unsigned long *swapcache_watermark, + unsigned int watermark_nr); +extern bool kernel_swap_enabled(void); +#else +static inline int add_page_for_swap(struct page *page, struct list_head *pagelist) +{ + return 0; +} + +static inline struct page *get_page_from_vaddr(struct mm_struct *mm, + unsigned long vaddr) +{ + return NULL; +} + +static inline int do_swapcache_reclaim(unsigned long *swapcache_watermark, + unsigned int watermark_nr) +{ + return 0; +} + +static inline bool kernel_swap_enabled(void) +{ + return true; +} #endif #ifdef CONFIG_NUMA extern int node_reclaim_mode; @@ -730,9 +761,5 @@ static inline bool mem_cgroup_swap_full(struct page *page) } #endif
-#ifdef CONFIG_ETMEM -extern bool kernel_swap_enabled(void); -#endif - #endif /* __KERNEL__*/ #endif /* _LINUX_SWAP_H */ diff --git a/mm/swap_state.c b/mm/swap_state.c index d58cbf4fe27f..69d71c4be7b8 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -96,6 +96,7 @@ unsigned long total_swapcache_pages(void) } return ret; } +EXPORT_SYMBOL_GPL(total_swapcache_pages);
static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
diff --git a/mm/vmscan.c b/mm/vmscan.c index 24de4ddc86d2..9e76887a84d3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4572,7 +4572,7 @@ int add_page_for_swap(struct page *page, struct list_head *pagelist) int err = -EBUSY; struct page *head;
- /*If the page is mapped by more than one process, do not swap it */ + /* If the page is mapped by more than one process, do not swap it */ if (page_mapcount(page) > 1) return -EACCES;
@@ -4621,6 +4621,316 @@ struct page *get_page_from_vaddr(struct mm_struct *mm, unsigned long vaddr) return page; } EXPORT_SYMBOL_GPL(get_page_from_vaddr); + +static int add_page_for_reclaim_swapcache(struct page *page, + struct list_head *pagelist, struct lruvec *lruvec, enum lru_list lru) +{ + struct page *head; + + /* If the page is mapped by more than one process, do not swap it */ + if (page_mapcount(page) > 1) + return -EACCES; + + if (PageHuge(page)) + return -EACCES; + + head = compound_head(page); + + switch (__isolate_lru_page_prepare(head, 0)) { + case 0: + if (unlikely(!get_page_unless_zero(page))) + return -1; + + if (!TestClearPageLRU(page)) { + /* + * This page may in other isolation path, + * but we still hold lru_lock. + */ + put_page(page); + return -1; + } + + list_move(&head->lru, pagelist); + update_lru_size(lruvec, lru, page_zonenum(head), -thp_nr_pages(head)); + break; + + case -EBUSY: + return -1; + default: + break; + } + + return 0; +} + +static unsigned long reclaim_swapcache_pages_from_list(int nid, + struct list_head *page_list, unsigned long reclaim_num, bool putback_flag) +{ + struct scan_control sc = { + .may_unmap = 1, + .may_swap = 1, + .may_writepage = 1, + .gfp_mask = GFP_KERNEL, + }; + unsigned long nr_reclaimed = 0; + unsigned long nr_moved = 0; + struct page *page, *next; + LIST_HEAD(swap_pages); + struct pglist_data *pgdat = NULL; + struct reclaim_stat stat; + + pgdat = NODE_DATA(nid); + + if (putback_flag) + goto putback_list; + + if (reclaim_num == 0) + return 0; + + list_for_each_entry_safe(page, next, page_list, lru) { + if (!page_is_file_lru(page) && !__PageMovable(page) + && PageSwapCache(page)) { + ClearPageActive(page); + list_move(&page->lru, &swap_pages); + nr_moved++; + } + + if (nr_moved >= reclaim_num) + break; + } + + /* swap the pages */ + if (pgdat) + nr_reclaimed = shrink_page_list(&swap_pages, + pgdat, + &sc, + &stat, true); + + while (!list_empty(&swap_pages)) { + page = lru_to_page(&swap_pages); + list_del(&page->lru); + putback_lru_page(page); + } + + return nr_reclaimed; + +putback_list: + while (!list_empty(page_list)) { + page = lru_to_page(page_list); + list_del(&page->lru); + putback_lru_page(page); + } + + return nr_reclaimed; +} + +#define SWAP_SCAN_NUM_MAX 32 + +static bool swapcache_below_watermark(unsigned long *swapcache_watermark) +{ + return total_swapcache_pages() < swapcache_watermark[ETMEM_SWAPCACHE_WMARK_LOW]; +} + +static unsigned long get_swapcache_reclaim_num(unsigned long *swapcache_watermark) +{ + return total_swapcache_pages() > + swapcache_watermark[ETMEM_SWAPCACHE_WMARK_LOW] ? + (total_swapcache_pages() - swapcache_watermark[ETMEM_SWAPCACHE_WMARK_LOW]) : 0; +} + +/* + * The main function to reclaim swapcache, the whole reclaim process is + * divided into 3 steps. + * 1. get the total_swapcache_pages num to reclaim. + * 2. scan the LRU linked list of each memory node to obtain the + * swapcache pages that can be reclaimd. + * 3. reclaim the swapcache page until the requirements are meet. + */ +int do_swapcache_reclaim(unsigned long *swapcache_watermark, + unsigned int watermark_nr) +{ + int err = -EINVAL; + unsigned long swapcache_to_reclaim = 0; + unsigned long nr_reclaimed = 0; + unsigned long swapcache_total_reclaimable = 0; + unsigned long reclaim_page_count = 0; + + unsigned long *nr = NULL; + unsigned long *nr_to_reclaim = NULL; + struct list_head *swapcache_list = NULL; + + int nid = 0; + struct lruvec *lruvec = NULL; + struct list_head *src = NULL; + struct page *page = NULL; + struct page *next = NULL; + struct page *pos = NULL; + + struct mem_cgroup *memcg = NULL; + struct mem_cgroup *target_memcg = NULL; + + pg_data_t *pgdat = NULL; + unsigned int scan_count = 0; + int nid_num = 0; + + if (swapcache_watermark == NULL || + watermark_nr < ETMEM_SWAPCACHE_NR_WMARK) + return err; + + /* get the total_swapcache_pages num to reclaim. */ + swapcache_to_reclaim = get_swapcache_reclaim_num(swapcache_watermark); + if (swapcache_to_reclaim <= 0) + return err; + + nr = kcalloc(MAX_NUMNODES, sizeof(unsigned long), GFP_KERNEL); + if (nr == NULL) + return -ENOMEM; + + nr_to_reclaim = kcalloc(MAX_NUMNODES, sizeof(unsigned long), GFP_KERNEL); + if (nr_to_reclaim == NULL) { + kfree(nr); + return -ENOMEM; + } + + swapcache_list = kcalloc(MAX_NUMNODES, sizeof(struct list_head), GFP_KERNEL); + if (swapcache_list == NULL) { + kfree(nr); + kfree(nr_to_reclaim); + return -ENOMEM; + } + + /* + * scan the LRU linked list of each memory node to obtain the + * swapcache pages that can be reclaimd. + */ + for_each_node_state(nid, N_MEMORY) { + INIT_LIST_HEAD(&swapcache_list[nid_num]); + cond_resched(); + + pgdat = NODE_DATA(nid); + + memcg = mem_cgroup_iter(target_memcg, NULL, NULL); + do { + cond_resched(); + pos = NULL; + lruvec = mem_cgroup_lruvec(memcg, pgdat); + src = &(lruvec->lists[LRU_INACTIVE_ANON]); + spin_lock_irq(&lruvec->lru_lock); + scan_count = 0; + + /* + * Scan the swapcache pages that are not mapped from + * the end of the LRU linked list, scan SWAP_SCAN_NUM_MAX + * pages each time, and record the scan end point page. + */ + + pos = list_last_entry(src, struct page, lru); + spin_unlock_irq(&lruvec->lru_lock); +do_scan: + cond_resched(); + scan_count = 0; + spin_lock_irq(&lruvec->lru_lock); + + /* + * check if pos page is been released or not in LRU list, if true, + * cancel the subsequent page scanning of the current node. + */ + if (!pos || list_entry_is_head(pos, src, lru)) { + spin_unlock_irq(&lruvec->lru_lock); + continue; + } + + if (!PageLRU(pos) || page_lru(pos) != LRU_INACTIVE_ANON) { + spin_unlock_irq(&lruvec->lru_lock); + continue; + } + + page = pos; + pos = NULL; + /* Continue to scan down from the last scan breakpoint */ + list_for_each_entry_safe_reverse_from(page, next, src, lru) { + scan_count++; + pos = next; + if (scan_count >= SWAP_SCAN_NUM_MAX) + break; + + if (!PageSwapCache(page)) + continue; + + if (page_mapped(page)) + continue; + + if (add_page_for_reclaim_swapcache(page, + &swapcache_list[nid_num], + lruvec, LRU_INACTIVE_ANON) != 0) + continue; + + nr[nid_num]++; + swapcache_total_reclaimable++; + } + spin_unlock_irq(&lruvec->lru_lock); + + /* + * Check whether the scanned pages meet + * the reclaim requirements. + */ + if (swapcache_total_reclaimable <= swapcache_to_reclaim || + scan_count >= SWAP_SCAN_NUM_MAX) + goto do_scan; + + } while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL))); + + /* Start reclaiming the next memory node. */ + nid_num++; + } + + /* reclaim the swapcache page until the requirements are meet. */ + do { + nid_num = 0; + reclaim_page_count = 0; + + /* start swapcache page reclaim for each node. */ + for_each_node_state(nid, N_MEMORY) { + cond_resched(); + + nr_to_reclaim[nid_num] = (swapcache_to_reclaim / + (swapcache_total_reclaimable / nr[nid_num])); + reclaim_page_count += reclaim_swapcache_pages_from_list(nid, + &swapcache_list[nid_num], + nr_to_reclaim[nid_num], false); + nid_num++; + } + + nr_reclaimed += reclaim_page_count; + + /* + * Check whether the swapcache page reaches the reclaim requirement or + * the number of the swapcache page reclaimd is 0. Stop reclaim. + */ + if (nr_reclaimed >= swapcache_to_reclaim || reclaim_page_count == 0) + goto exit; + } while (!swapcache_below_watermark(swapcache_watermark) || + nr_reclaimed < swapcache_to_reclaim); +exit: + nid_num = 0; + /* + * Repopulate the swapcache pages that are not reclaimd back + * to the LRU linked list. + */ + for_each_node_state(nid, N_MEMORY) { + cond_resched(); + reclaim_swapcache_pages_from_list(nid, + &swapcache_list[nid_num], 0, true); + nid_num++; + } + + kfree(nr); + kfree(nr_to_reclaim); + kfree(swapcache_list); + + return 0; +} +EXPORT_SYMBOL_GPL(do_swapcache_reclaim); #endif
#ifdef CONFIG_PAGE_CACHE_LIMIT
From: liubo liubo254@huawei.com
euleros inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A CVE: NA
-------------------------------------------------
etmem, the memory vertical expansion technology,
The existing memory expansion tool etmem swaps out all pages that can be swapped out for the process by default, unless the page is marked with lock flag.
The function of swapping out specified pages is added. The process adds VM_SWAPFLAG flags for pages to be swapped out. The etmem adds filters to the scanning module and swaps out only these pages.
Signed-off-by: liubo liubo254@huawei.com Reviewed-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/proc/etmem_scan.c | 11 +++++++++++ fs/proc/etmem_scan.h | 5 ++++- fs/proc/etmem_swap.c | 1 + include/linux/mm.h | 4 ++++ include/uapi/asm-generic/mman-common.h | 3 +++ mm/madvise.c | 17 ++++++++++++++++- 6 files changed, 39 insertions(+), 2 deletions(-)
diff --git a/fs/proc/etmem_scan.c b/fs/proc/etmem_scan.c index 8bcb8d3af7c5..e9cad598189a 100644 --- a/fs/proc/etmem_scan.c +++ b/fs/proc/etmem_scan.c @@ -1187,6 +1187,11 @@ static int mm_idle_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk) { struct vm_area_struct *vma = walk->vma; + struct page_idle_ctrl *pic = walk->private; + + /* If the specified page swapout is set, the untagged vma is skipped. */ + if ((pic->flags & VMA_SCAN_FLAG) && !(vma->vm_flags & VM_SWAPFLAG)) + return 1;
if (vma->vm_file) { if (is_vm_hugetlb_page(vma)) @@ -1325,6 +1330,12 @@ static long page_scan_ioctl(struct file *filp, unsigned int cmd, unsigned long a case IDLE_SCAN_REMOVE_FLAGS: filp->f_flags &= ~flags; break; + case VMA_SCAN_ADD_FLAGS: + filp->f_flags |= flags; + break; + case VMA_SCAN_REMOVE_FLAGS: + filp->f_flags &= ~flags; + break; default: return -EOPNOTSUPP; } diff --git a/fs/proc/etmem_scan.h b/fs/proc/etmem_scan.h index 93a6e33f2025..e109f7f350e1 100644 --- a/fs/proc/etmem_scan.h +++ b/fs/proc/etmem_scan.h @@ -12,13 +12,16 @@ #define SCAN_AS_HUGE 0100000000 /* treat normal page as hugepage in vm */ #define SCAN_IGN_HOST 0200000000 /* ignore host access when scan vm */ #define VM_SCAN_HOST 0400000000 /* scan and add host page for vm hole(internal) */ +#define VMA_SCAN_FLAG 0x1000 /* scan the specifics vma with flag */
#define ALL_SCAN_FLAGS (SCAN_HUGE_PAGE | SCAN_SKIM_IDLE | SCAN_DIRTY_PAGE | \ - SCAN_AS_HUGE | SCAN_IGN_HOST | VM_SCAN_HOST) + SCAN_AS_HUGE | SCAN_IGN_HOST | VM_SCAN_HOST | VMA_SCAN_FLAG)
#define IDLE_SCAN_MAGIC 0x66 #define IDLE_SCAN_ADD_FLAGS _IOW(IDLE_SCAN_MAGIC, 0x0, unsigned int) #define IDLE_SCAN_REMOVE_FLAGS _IOW(IDLE_SCAN_MAGIC, 0x1, unsigned int) +#define VMA_SCAN_ADD_FLAGS _IOW(IDLE_SCAN_MAGIC, 0x2, unsigned int) +#define VMA_SCAN_REMOVE_FLAGS _IOW(IDLE_SCAN_MAGIC, 0x3, unsigned int)
enum ProcIdlePageType { PTE_ACCESSED, /* 4k page */ diff --git a/fs/proc/etmem_swap.c b/fs/proc/etmem_swap.c index 0e0a5225e301..86f5cf8c90a1 100644 --- a/fs/proc/etmem_swap.c +++ b/fs/proc/etmem_swap.c @@ -63,6 +63,7 @@ static ssize_t swap_pages_write(struct file *file, const char __user *buf, ret = kstrtoul(p, 16, &vaddr); if (ret != 0) continue; + /* If get page struct failed, ignore it, get next page */ page = get_page_from_vaddr(mm, vaddr); if (!page) diff --git a/include/linux/mm.h b/include/linux/mm.h index d2e8eb2f50fa..839106c9f708 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -326,6 +326,10 @@ extern unsigned int kobjsize(const void *objp);
#define VM_CHECKNODE 0x200000000
+#ifdef CONFIG_ETMEM +#define VM_SWAPFLAG 0x400000000000000 /* memory swap out flag in vma */ +#endif + #ifdef CONFIG_USERSWAP /* bit[32:36] is the protection key of intel, so use a large value for VM_USWAP */ #define VM_USWAP 0x2000000000000000 diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index e75b65364dce..898ea134b2f3 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -74,6 +74,9 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */
+#define MADV_SWAPFLAG 203 /* for memory to be swap out */ +#define MADV_SWAPFLAG_REMOVE 204 + /* compatibility flags */ #define MAP_FILE 0
diff --git a/mm/madvise.c b/mm/madvise.c index 16b1c2885b63..ba666aa1203b 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -126,6 +126,14 @@ static long madvise_behavior(struct vm_area_struct *vma, if (error) goto out_convert_errno; break; +#ifdef CONFIG_ETMEM + case MADV_SWAPFLAG: + new_flags |= VM_SWAPFLAG; + break; + case MADV_SWAPFLAG_REMOVE: + new_flags &= ~VM_SWAPFLAG; + break; +#endif }
if (new_flags == vma->vm_flags) { @@ -969,9 +977,12 @@ madvise_behavior_valid(int behavior) #ifdef CONFIG_MEMORY_FAILURE case MADV_SOFT_OFFLINE: case MADV_HWPOISON: +#endif +#ifdef CONFIG_ETMEM + case MADV_SWAPFLAG: + case MADV_SWAPFLAG_REMOVE: #endif return true; - default: return false; } @@ -1041,6 +1052,10 @@ process_madvise_behavior_valid(int behavior) * easily if memory pressure hanppens. * MADV_PAGEOUT - the application is not expected to use this memory soon, * page out the pages in this range immediately. + * MADV_SWAPFLAG - Used in the etmem memory extension feature, the process + * specifies the memory swap area by adding a flag to a specific + * vma address. + * MADV_SWAPFLAG_REMOVE - remove the specific vma flag * * return values: * zero - success
From: Zhen Lei thunder.leizhen@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5P8OD CVE: NA
-------------------------------------------------------------------------
The value of 'end' for both for_each_mem_range() and __map_memblock() is 'start + size', not 'start + size - 1'. So if the end value of a memory block is 4G, then: if (eflags && (end >= SZ_4G)) { //end=SZ_4G if (start < SZ_4G) { ... ... start = SZ_4G; } }
//Now, start=end=SZ_4G, all [start,...) will be mapped __map_memblock(pgdp, start, end, ..xxx..);
Fixes: e26eee769978 ("arm64: kdump: Don't force page-level mappings for memory above 4G") Signed-off-by: Zhen Lei thunder.leizhen@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/mm/mmu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index e76765354040..a31f2124705e 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -523,13 +523,13 @@ static void __init map_mem(pgd_t *pgdp) break;
#ifdef CONFIG_KEXEC_CORE - if (eflags && (end >= SZ_4G)) { + if (eflags && (end > SZ_4G)) { /* * The memory block cross the 4G boundary. * Forcibly use page-level mappings for memory under 4G. */ if (start < SZ_4G) { - __map_memblock(pgdp, start, SZ_4G - 1, + __map_memblock(pgdp, start, SZ_4G, pgprot_tagged(PAGE_KERNEL), flags | eflags); start = SZ_4G; }