[PATCH OLK-6.6 00/10] fs: 6.6 performance improve patches

older
[PATCH OLK-6.6] blk-rq-qos: fix...

Jinjie Ruan

8 Nov 2024 8 Nov '24

3:19 p.m.

6.6 performance improve patches, with mm��fs , Unixbench Process create improve 1 ~ 3%��lmbench exec/shell/fork improve 1.4 ~ 1.9%, 0.7 ~ 2.2%, 2 ~ 3.8% respectively. libmicro fork10/exec/sys improves 1.5%, 3.9 ~ 5.8%, 3.7 ~ 5.3%. Christian Brauner (4): fs: move audit parent inode fs: pull up trailing slashes check for O_CREAT fs: remove audit dummy context check fs: rearrange general fastpath check now that O_CREAT uses it David Hildenbrand (1): mm/rmap: minimize folio->_nr_pages_mapped updates when batching PTE (un)mapping Jeff Layton (1): fs: try an opportunistic lookup for O_CREAT opens too Liam R. Howlett (1): maple_tree: remove rcu_read_lock() from mt_validate() Mateusz Guzik (1): mm: batch unlink_file_vma calls in free_pgd_range Yu Ma (2): fs/file.c: add fast path in find_next_fd() fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() fs/file.c | 43 +++++++++++++++++++--------------- fs/namei.c | 61 +++++++++++++++++++++++++++++++++++++----------- lib/maple_tree.c | 7 ++---- mm/internal.h | 10 ++++++++ mm/memory.c | 10 ++++++-- mm/mmap.c | 41 ++++++++++++++++++++++++++++++++ mm/rmap.c | 27 +++++++++++---------- 7 files changed, 145 insertions(+), 54 deletions(-) -- 2.34.1

Show replies by date

Jinjie Ruan

8 Nov 8 Nov

3:19 p.m.

New subject: [PATCH OLK-6.6 01/10] fs/file.c: add fast path in find_next_fd()

From: Yu Ma <yu.ma@intel.com> next inclusion category: performance bugzilla: https://gitee.com/src-openeuler/kernel/issues/IB1S01 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?... -------------------------------- Skip 2-levels searching via find_next_zero_bit() when there is free slot in the word contains next_fd, as: (1) next_fd indicates the lower bound for the first free fd. (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up searching. (3) After fdt is expanded (the bitmap size doubled for each time of expansion), it would never be shrunk. The search size increases but there are few open fds available here. This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by Jan Kara <jack@suse.cz>, which is more generic and scalable than previous versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Yu Ma <yu.ma@intel.com> Link: https://lore.kernel.org/r/20240717145018.3972922-4-yu.ma@intel.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> --- fs/file.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/fs/file.c b/fs/file.c index fdada5d563c3..8ccc7e01ffb3 100644 --- a/fs/file.c +++ b/fs/file.c @@ -482,6 +482,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */ unsigned int maxbit = maxfd / BITS_PER_LONG; unsigned int bitbit = start / BITS_PER_LONG; + unsigned int bit; + + /* + * Try to avoid looking at the second level bitmap + */ + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG, + start & (BITS_PER_LONG - 1)); + if (bit < BITS_PER_LONG) + return bit + bitbit * BITS_PER_LONG; bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; if (bitbit >= maxfd) -- 2.34.1

Jinjie Ruan

3:19 p.m.

New subject: [PATCH OLK-6.6 02/10] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

From: Yu Ma <yu.ma@intel.com> next inclusion category: performance bugzilla: https://gitee.com/src-openeuler/kernel/issues/IB1S01 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?... -------------------------------- alloc_fd() has a sanity check inside to make sure the struct file mapping to the allocated fd is NULL. Remove this sanity check since it can be assured by exisitng zero initialization and NULL set when recycling fd. Meanwhile, add likely/unlikely and expand_file() call avoidance to reduce the work under file_lock. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Yu Ma <yu.ma@intel.com> Link: https://lore.kernel.org/r/20240717145018.3972922-2-yu.ma@intel.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Conflicts: fs/file.c [Context conflict] Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> --- fs/file.c | 34 +++++++++++++++------------------- 1 file changed, 15 insertions(+), 19 deletions(-) diff --git a/fs/file.c b/fs/file.c index 8ccc7e01ffb3..f8a5af0d06f1 100644 --- a/fs/file.c +++ b/fs/file.c @@ -517,7 +517,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) if (fd < files->next_fd) fd = files->next_fd; - if (fd < fdt->max_fds) + if (likely(fd < fdt->max_fds)) fd = find_next_fd(fdt, fd); /* @@ -525,19 +525,22 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) * will limit the total number of files that can be opened. */ error = -EMFILE; - if (fd >= end) + if (unlikely(fd >= end)) goto out; - error = expand_files(files, fd); - if (error < 0) - goto out; + if (unlikely(fd >= fdt->max_fds)) { + error = expand_files(files, fd); + if (error < 0) + goto out; + + /* + * If we needed to expand the fs array we + * might have blocked - try again. + */ + if (error) + goto repeat; + } - /* - * If we needed to expand the fs array we - * might have blocked - try again. - */ - if (error) - goto repeat; if (files_cg_alloc_fd(files, 1)) { error = -EMFILE; goto out; @@ -552,13 +555,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) else __clear_close_on_exec(fd, fdt); error = fd; -#if 1 - /* Sanity check */ - if (rcu_access_pointer(fdt->fd[fd]) != NULL) { - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd); - rcu_assign_pointer(fdt->fd[fd], NULL); - } -#endif out: spin_unlock(&files->file_lock); @@ -623,7 +619,7 @@ void fd_install(unsigned int fd, struct file *file) rcu_read_unlock_sched(); spin_lock(&files->file_lock); fdt = files_fdtable(files); - BUG_ON(fdt->fd[fd] != NULL); + WARN_ON(fdt->fd[fd] != NULL); rcu_assign_pointer(fdt->fd[fd], file); spin_unlock(&files->file_lock); return; -- 2.34.1

Jinjie Ruan

3:19 p.m.

New subject: [PATCH OLK-6.6 03/10] fs: try an opportunistic lookup for O_CREAT opens too

From: Jeff Layton <jlayton@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit e747e15156b79efeea0ad056df8de14b93d318c2 category: performance bugzilla: https://gitee.com/src-openeuler/kernel/issues/IB1S01 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Today, when opening a file we'll typically do a fast lookup, but if O_CREAT is set, the kernel always takes the exclusive inode lock. I assume this was done with the expectation that O_CREAT means that we always expect to do the create, but that's often not the case. Many programs set O_CREAT even in scenarios where the file already exists. This patch rearranges the pathwalk-for-open code to also attempt a fast_lookup in certain O_CREAT cases. If a positive dentry is found, the inode_lock can be avoided altogether, and if auditing isn't enabled, it can stay in rcuwalk mode for the last step_into. One notable exception that is hopefully temporary: if we're doing an rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing the dentry in that case is more expensive than taking the i_rwsem for now. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240807-openfast-v3-1-040d132d2559@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: fs/namei.c [Context conflict] Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> --- fs/namei.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 64 insertions(+), 10 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index deb67b07776e..7863f457f2e8 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3503,6 +3503,49 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, return ERR_PTR(error); } +static inline bool trailing_slashes(struct nameidata *nd) +{ + return (bool)nd->last.name[nd->last.len]; +} + +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) +{ + struct dentry *dentry; + + if (open_flag & O_CREAT) { + /* Don't bother on an O_EXCL create */ + if (open_flag & O_EXCL) + return NULL; + + /* + * FIXME: If auditing is enabled, then we'll have to unlazy to + * use the dentry. For now, don't do this, since it shifts + * contention from parent's i_rwsem to its d_lockref spinlock. + * Reconsider this once dentry refcounting handles heavy + * contention better. + */ + if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) + return NULL; + } + + if (trailing_slashes(nd)) + nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; + + dentry = lookup_fast(nd); + if (IS_ERR_OR_NULL(dentry)) + return dentry; + + if (open_flag & O_CREAT) { + /* Discard negative dentries. Need inode_lock to do the create */ + if (!dentry->d_inode) { + if (!(nd->flags & LOOKUP_RCU)) + dput(dentry); + dentry = NULL; + } + } + return dentry; +} + static const char *open_last_lookups(struct nameidata *nd, struct file *file, const struct open_flags *op) { @@ -3520,27 +3563,38 @@ static const char *open_last_lookups(struct nameidata *nd, return handle_dots(nd, nd->last_type); } + /* We _can_ be in RCU mode here */ + dentry = lookup_fast_for_open(nd, open_flag); + if (IS_ERR(dentry)) + return ERR_CAST(dentry); + if (!(open_flag & O_CREAT)) { - if (nd->last.name[nd->last.len]) - nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; - /* we _can_ be in RCU mode here */ - dentry = lookup_fast(nd); - if (IS_ERR(dentry)) - return ERR_CAST(dentry); if (likely(dentry)) goto finish_lookup; BUG_ON(nd->flags & LOOKUP_RCU); } else { - /* create side of things */ if (nd->flags & LOOKUP_RCU) { - if (!try_to_unlazy(nd)) + bool unlazied; + + /* can stay in rcuwalk if not auditing */ + if (dentry && audit_dummy_context()) { + if (trailing_slashes(nd)) + return ERR_PTR(-EISDIR); + goto finish_lookup; + } + unlazied = dentry ? try_to_unlazy_next(nd, dentry) : + try_to_unlazy(nd); + if (!unlazied) return ERR_PTR(-ECHILD); } audit_inode(nd->name, dir, AUDIT_INODE_PARENT); - /* trailing slashes? */ - if (unlikely(nd->last.name[nd->last.len])) + if (trailing_slashes(nd)) { + dput(dentry); return ERR_PTR(-EISDIR); + } + if (dentry) + goto finish_lookup; } if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { -- 2.34.1

Jinjie Ruan

3:19 p.m.

New subject: [PATCH OLK-6.6 04/10] fs: move audit parent inode

From: Christian Brauner <brauner@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit c65d41c5a5279738fc07f99c0e912b28a691c46f category: performance bugzilla: https://gitee.com/src-openeuler/kernel/issues/IB1S01 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- During O_CREAT we unconditionally audit the parent inode. This makes it difficult to support a fastpath for O_CREAT when the file already exists because we have to drop out of RCU lookup needlessly. We worked around this by checking whether audit was actually active but that's also suboptimal. Instead, move the audit of the parent inode down into lookup_open() at a point where it's mostly certain that the file needs to be created. This also reduced the inconsistency that currently exists: while audit on the parent is done independent of whether or no the file already existed an audit on the file is only performed if it has been created. By moving the audit down a bit we emit the audit a little later but it will allow us to simplify the fastpath for O_CREAT significantly. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> --- fs/namei.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/namei.c b/fs/namei.c index 7863f457f2e8..fbbcec1cbb46 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3433,6 +3433,9 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, return dentry; } + if (open_flag & O_CREAT) + audit_inode(nd->name, dir, AUDIT_INODE_PARENT); + /* * Checking write permission is tricky, bacuse we don't know if we are * going to actually need it: O_CREAT opens should work as long as the @@ -3588,7 +3591,6 @@ static const char *open_last_lookups(struct nameidata *nd, if (!unlazied) return ERR_PTR(-ECHILD); } - audit_inode(nd->name, dir, AUDIT_INODE_PARENT); if (trailing_slashes(nd)) { dput(dentry); return ERR_PTR(-EISDIR); -- 2.34.1

Jinjie Ruan

3:19 p.m.

New subject: [PATCH OLK-6.6 05/10] fs: pull up trailing slashes check for O_CREAT

From: Christian Brauner <brauner@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 4770d96a6d89c7dd5675056629c0008f7f8106bf category: performance bugzilla: https://gitee.com/src-openeuler/kernel/issues/IB1S01 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Perform the check for trailing slashes right in the fastpath check and don't bother with any additional work. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> --- fs/namei.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index fbbcec1cbb46..0108d86b83ee 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3516,6 +3516,9 @@ static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) struct dentry *dentry; if (open_flag & O_CREAT) { + if (trailing_slashes(nd)) + return ERR_PTR(-EISDIR); + /* Don't bother on an O_EXCL create */ if (open_flag & O_EXCL) return NULL; @@ -3581,20 +3584,13 @@ static const char *open_last_lookups(struct nameidata *nd, bool unlazied; /* can stay in rcuwalk if not auditing */ - if (dentry && audit_dummy_context()) { - if (trailing_slashes(nd)) - return ERR_PTR(-EISDIR); + if (dentry && audit_dummy_context()) goto finish_lookup; - } unlazied = dentry ? try_to_unlazy_next(nd, dentry) : try_to_unlazy(nd); if (!unlazied) return ERR_PTR(-ECHILD); } - if (trailing_slashes(nd)) { - dput(dentry); - return ERR_PTR(-EISDIR); - } if (dentry) goto finish_lookup; } -- 2.34.1

Jinjie Ruan

3:19 p.m.

New subject: [PATCH OLK-6.6 06/10] fs: remove audit dummy context check

From: Christian Brauner <brauner@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit d459c52ab378cdcd53c57ddcbfd5af648b20a150 category: performance bugzilla: https://gitee.com/src-openeuler/kernel/issues/IB1S01 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Now that we audit later during lookup_open() we can remove the audit dummy context check. This simplifies things a lot. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> --- fs/namei.c | 12 +----------- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 0108d86b83ee..1abc7f586b77 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3522,16 +3522,6 @@ static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag) /* Don't bother on an O_EXCL create */ if (open_flag & O_EXCL) return NULL; - - /* - * FIXME: If auditing is enabled, then we'll have to unlazy to - * use the dentry. For now, don't do this, since it shifts - * contention from parent's i_rwsem to its d_lockref spinlock. - * Reconsider this once dentry refcounting handles heavy - * contention better. - */ - if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context()) - return NULL; } if (trailing_slashes(nd)) @@ -3584,7 +3574,7 @@ static const char *open_last_lookups(struct nameidata *nd, bool unlazied; /* can stay in rcuwalk if not auditing */ - if (dentry && audit_dummy_context()) + if (dentry) goto finish_lookup; unlazied = dentry ? try_to_unlazy_next(nd, dentry) : try_to_unlazy(nd); -- 2.34.1

Jinjie Ruan

3:19 p.m.

New subject: [PATCH OLK-6.6 07/10] fs: rearrange general fastpath check now that O_CREAT uses it

From: Christian Brauner <brauner@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 0f93bb54a3a502077bca4d7beb1fe2a90d3b59db category: performance bugzilla: https://gitee.com/src-openeuler/kernel/issues/IB1S01 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- If we find a positive dentry we can now simply try and open it. All prelimiary checks are already done with or without O_CREAT. Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: fs/namei.c [Context conflict] Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> --- fs/namei.c | 17 ++++------------- 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 1abc7f586b77..13d1ca4f6842 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3564,25 +3564,16 @@ static const char *open_last_lookups(struct nameidata *nd, if (IS_ERR(dentry)) return ERR_CAST(dentry); - if (!(open_flag & O_CREAT)) { - if (likely(dentry)) - goto finish_lookup; + if (likely(dentry)) + goto finish_lookup; + if (!(open_flag & O_CREAT)) { BUG_ON(nd->flags & LOOKUP_RCU); } else { if (nd->flags & LOOKUP_RCU) { - bool unlazied; - - /* can stay in rcuwalk if not auditing */ - if (dentry) - goto finish_lookup; - unlazied = dentry ? try_to_unlazy_next(nd, dentry) : - try_to_unlazy(nd); - if (!unlazied) + if (!try_to_unlazy(nd)) return ERR_PTR(-ECHILD); } - if (dentry) - goto finish_lookup; } if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { -- 2.34.1

Jinjie Ruan

3:19 p.m.

New subject: [PATCH OLK-6.6 08/10] maple_tree: remove rcu_read_lock() from mt_validate()

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com> mainline inclusion from mainline-v6.11-rc7 commit f806de88d8f7f8191afd0fd9b94db4cd058e7d4f category: performance bugzilla: https://gitee.com/src-openeuler/kernel/issues/IB1S01 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The write lock should be held when validating the tree to avoid updates racing with checks. Holding the rcu read lock during a large tree validation may also cause a prolonged rcu read window and "rcu_preempt detected stalls" warnings. Link: https://lore.kernel.org/all/0000000000001d12d4062005aea1@google.com/ Link: https://lkml.kernel.org/r/20240820175417.2782532-1-Liam.Howlett@oracle.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reported-by: syzbot+036af2f0c7338a33b0cd@syzkaller.appspotmail.com Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> --- lib/maple_tree.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 5328e08723d7..d59d16cf4399 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -7618,14 +7618,14 @@ static void mt_validate_nulls(struct maple_tree *mt) * 2. The gap is correctly set in the parents */ void mt_validate(struct maple_tree *mt) + __must_hold(mas->tree->ma_lock) { unsigned char end; MA_STATE(mas, mt, 0, 0); - rcu_read_lock(); mas_start(&mas); if (!mas_is_active(&mas)) - goto done; + return; while (!mte_is_leaf(mas.node)) mas_descend(&mas); @@ -7646,9 +7646,6 @@ void mt_validate(struct maple_tree *mt) mas_dfs_postorder(&mas, ULONG_MAX); } mt_validate_nulls(mt); -done: - rcu_read_unlock(); - } EXPORT_SYMBOL_GPL(mt_validate); -- 2.34.1

Jinjie Ruan

3:19 p.m.

New subject: [PATCH OLK-6.6 09/10] mm: batch unlink_file_vma calls in free_pgd_range

From: Mateusz Guzik <mjguzik@gmail.com> mainline inclusion from mainline-v6.11-rc1 commit 3577dbb192419e37b6f54aced8777b6c81cd03d4 category: performance bugzilla: https://gitee.com/src-openeuler/kernel/issues/IB1S01 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Execs of dynamically linked binaries at 20-ish cores are bottlenecked on the i_mmap_rwsem semaphore, while the biggest singular contributor is free_pgd_range inducing the lock acquire back-to-back for all consecutive mappings of a given file. Tracing the count of said acquires while building the kernel shows: [1, 2) 799579 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [2, 3) 0 | | [3, 4) 3009 | | [4, 5) 3009 | | [5, 6) 326442 |@@@@@@@@@@@@@@@@@@@@@ | So in particular there were 326442 opportunities to coalesce 5 acquires into 1. Doing so increases execs per second by 4% (~50k to ~52k) when running the benchmark linked below. The lock remains the main bottleneck, I have not looked at other spots yet. Bench can be found here: http://apollo.backplane.com/DFlyMisc/doexec.c $ cc -O2 -o shared-doexec doexec.c $ ./shared-doexec $(nproc) Note this particular test makes sure binaries are separate, but the loader is shared. Stats collected on the patched kernel (+ "noinline") with: bpftrace -e 'kprobe:unlink_file_vma_batch_process { @ = lhist(((struct unlink_vma_file_batch *)arg0)->count, 0, 8, 1); }' Link: https://lkml.kernel.org/r/20240521234321.359501-1-mjguzik@gmail.com Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: mm/internal.h [Context conflict] Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> --- mm/internal.h | 10 ++++++++++ mm/memory.c | 10 ++++++++-- mm/mmap.c | 41 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 59 insertions(+), 2 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 37c17f921dae..0478e5dab55b 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1446,4 +1446,14 @@ void __meminit __init_single_page(struct page *page, unsigned long pfn, #ifdef CONFIG_PAGE_CACHE_LIMIT unsigned long shrink_memory(unsigned long nr_to_reclaim, bool may_swap); #endif /* CONFIG_PAGE_CACHE_LIMIT */ + +struct unlink_vma_file_batch { + int count; + struct vm_area_struct *vmas[8]; +}; + +void unlink_file_vma_batch_init(struct unlink_vma_file_batch *); +void unlink_file_vma_batch_add(struct unlink_vma_file_batch *, struct vm_area_struct *); +void unlink_file_vma_batch_final(struct unlink_vma_file_batch *); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memory.c b/mm/memory.c index e248b8338417..a4f7066d1e68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -368,6 +368,8 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, struct vm_area_struct *vma, unsigned long floor, unsigned long ceiling, bool mm_wr_locked) { + struct unlink_vma_file_batch vb; + do { unsigned long addr = vma->vm_start; struct vm_area_struct *next; @@ -387,12 +389,15 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, if (mm_wr_locked) vma_start_write(vma); unlink_anon_vmas(vma); - unlink_file_vma(vma); if (is_vm_hugetlb_page(vma)) { + unlink_file_vma(vma); hugetlb_free_pgd_range(tlb, addr, vma->vm_end, floor, next ? next->vm_start : ceiling); } else { + unlink_file_vma_batch_init(&vb); + unlink_file_vma_batch_add(&vb, vma); + /* * Optimization: gather nearby vmas into one call down */ @@ -405,8 +410,9 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, if (mm_wr_locked) vma_start_write(vma); unlink_anon_vmas(vma); - unlink_file_vma(vma); + unlink_file_vma_batch_add(&vb, vma); } + unlink_file_vma_batch_final(&vb); free_pgd_range(tlb, addr, vma->vm_end, floor, next ? next->vm_start : ceiling); } diff --git a/mm/mmap.c b/mm/mmap.c index 07ffb6c37b96..c898103c6d72 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -132,6 +132,47 @@ void unlink_file_vma(struct vm_area_struct *vma) } } +void unlink_file_vma_batch_init(struct unlink_vma_file_batch *vb) +{ + vb->count = 0; +} + +static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb) +{ + struct address_space *mapping; + int i; + + mapping = vb->vmas[0]->vm_file->f_mapping; + i_mmap_lock_write(mapping); + for (i = 0; i < vb->count; i++) { + VM_WARN_ON_ONCE(vb->vmas[i]->vm_file->f_mapping != mapping); + __remove_shared_vm_struct(vb->vmas[i], mapping); + } + i_mmap_unlock_write(mapping); + + unlink_file_vma_batch_init(vb); +} + +void unlink_file_vma_batch_add(struct unlink_vma_file_batch *vb, + struct vm_area_struct *vma) +{ + if (vma->vm_file == NULL) + return; + + if ((vb->count > 0 && vb->vmas[0]->vm_file != vma->vm_file) || + vb->count == ARRAY_SIZE(vb->vmas)) + unlink_file_vma_batch_process(vb); + + vb->vmas[vb->count] = vma; + vb->count++; +} + +void unlink_file_vma_batch_final(struct unlink_vma_file_batch *vb) +{ + if (vb->count > 0) + unlink_file_vma_batch_process(vb); +} + /* * Close a vm structure and free it. */ -- 2.34.1

Jinjie Ruan

3:19 p.m.

New subject: [PATCH OLK-6.6 10/10] mm/rmap: minimize folio->_nr_pages_mapped updates when batching PTE (un)mapping

From: David Hildenbrand <david@redhat.com> mainline inclusion from mainline-v6.12-rc1 commit 43c9074e6f093d304d55c43638732c402be75e2b category: performance bugzilla: https://gitee.com/src-openeuler/kernel/issues/IB1S01 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- It is not immediately obvious, but we can move the folio->_nr_pages_mapped update out of the loop and reduce the number of atomic ops without affecting the stats. The important point to realize is that only removing the last PMD mapping will result in _nr_pages_mapped going below ENTIRELY_MAPPED, not the individual atomic_inc_return_relaxed() calls. Concurrent races with removal of PMD mappings should be handled as expected, just like when we would have such races right now on a single mapcount update. In a simple munmap() microbenchmark [1] on 1 GiB of memory backed by the same PTE-mapped folio size (only mapped by a single process such that they will get completely unmapped), this change results in a speedup (positive is good) per folio size on a x86-64 Intel machine of roughly (a bit of noise expected): * 16 KiB: +10% * 32 KiB: +15% * 64 KiB: +17% * 128 KiB: +21% * 256 KiB: +22% * 512 KiB: +22% * 1024 KiB: +23% * 2048 KiB: +27% [1] https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/pte-mapped-foli... Link: https://lkml.kernel.org/r/20240807115515.1640951-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: mm/rmap.c [Context conflicts due to commit 05c5323b2a34 ("mm: track mapcount of large folios in single value") isn't merged.] Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> --- mm/rmap.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index de385e29916b..dbcdac9bb7a3 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1131,7 +1131,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio, int *nr_pmdmapped) { atomic_t *mapped = &folio->_nr_pages_mapped; - int first, nr = 0; + int first = 0, nr = 0; __folio_rmap_sanity_checks(folio, page, nr_pages, level); @@ -1143,13 +1143,13 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio, } do { - first = atomic_inc_and_test(&page->_mapcount); - if (first) { - first = atomic_inc_return_relaxed(mapped); - if (first < ENTIRELY_MAPPED) - nr++; - } + first += atomic_inc_and_test(&page->_mapcount); } while (page++, --nr_pages > 0); + + if (first && + atomic_add_return_relaxed(first, mapped) < ENTIRELY_MAPPED) + nr = first; + break; case RMAP_LEVEL_PMD: first = atomic_inc_and_test(&folio->_entire_mapcount); @@ -1489,7 +1489,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, enum rmap_level level) { atomic_t *mapped = &folio->_nr_pages_mapped; - int last, nr = 0, nr_pmdmapped = 0; + int last = 0, nr = 0, nr_pmdmapped = 0; bool partially_mapped = false; __folio_rmap_sanity_checks(folio, page, nr_pages, level); @@ -1502,14 +1502,13 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, } do { - last = atomic_add_negative(-1, &page->_mapcount); - if (last) { - last = atomic_dec_return_relaxed(mapped); - if (last < ENTIRELY_MAPPED) - nr++; - } + last += atomic_add_negative(-1, &page->_mapcount); } while (page++, --nr_pages > 0); + if (last && + atomic_sub_return_relaxed(last, mapped) < ENTIRELY_MAPPED) + nr = last; + partially_mapped = nr && atomic_read(mapped); break; case RMAP_LEVEL_PMD: -- 2.34.1

patchwork bot

3:25 p.m.

反馈：您发送到kernel@openeuler.org的补丁/补丁集，已成功转换为PR！ PR链接地址： https://gitee.com/openeuler/kernel/pulls/13122 邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/P... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/13122 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/P...

249

Age (days ago)

249

Last active (days ago)

List overview

11 comments

2 participants

participants (2)

Jinjie Ruan
patchwork bot

[PATCH OLK-6.6 00/10] fs: 6.6 performance improve patches

tags

participants (2)