Resend this series after merging the series "mm: lazyfree THP support" since there are merge conflict.
Bang Li (5): mm: add update_mmu_tlb_range() mm: implement update_mmu_tlb() using update_mmu_tlb_range() mm: use update_mmu_tlb_range() to simplify code mm/shmem: fix input and output inconsistencies mm: thp: support "THPeligible" semantics for mTHP with anonymous shmem
Baolin Wang (8): mm: memory: extend finish_fault() to support large folio mm: shmem: add THP validation for PMD-mapped THP related statistics mm: shmem: add multi-size THP sysfs interface for anonymous shmem mm: shmem: add mTHP support for anonymous shmem mm: shmem: add mTHP size alignment in shmem_get_unmapped_area mm: shmem: add mTHP counters for anonymous shmem mm: shmem: avoid allocating huge pages larger than MAX_PAGECACHE_ORDER for shmem mm: shmem: fix incorrect aligned index when checking conflicts
Christoph Hellwig (2): shmem: set a_ops earlier in shmem_symlink shmem: move the shmem_mapping assert into shmem_get_folio_gfp
Hugh Dickins (8): shmem: shrink shmem_inode_info: dir_offsets in a union shmem: remove vma arg from shmem_get_folio_gfp() shmem: factor shmem_falloc_wait() out of shmem_fault() shmem: trivial tidyups, removing extra blank lines, etc shmem: shmem_acct_blocks() and shmem_inode_acct_blocks() shmem: move memcg charge out of shmem_add_to_page_cache() shmem: _add_to_page_cache() before shmem_inode_acct_blocks() shmem,percpu_counter: add _limited_add(fbc, limit, amount)
Lance Yang (2): mm: add per-order mTHP split counters mm: add docs for per-order mTHP split counters
Liu Shixin (1): mm: shmem: Merge shmem_alloc_hugefolio() with shmem_alloc_folio()
Ryan Roberts (1): mm: shmem: rename mTHP shmem counters
Documentation/admin-guide/mm/transhuge.rst | 74 +- arch/loongarch/include/asm/pgtable.h | 4 +- arch/mips/include/asm/pgtable.h | 4 +- arch/riscv/include/asm/pgtable.h | 4 +- arch/xtensa/include/asm/pgtable.h | 6 +- arch/xtensa/mm/tlb.c | 6 +- include/linux/huge_mm.h | 26 + include/linux/percpu_counter.h | 23 + include/linux/pgtable.h | 11 +- include/linux/shmem_fs.h | 25 +- lib/percpu_counter.c | 53 ++ mm/huge_memory.c | 43 +- mm/memory.c | 65 +- mm/shmem.c | 791 +++++++++++++-------- 14 files changed, 797 insertions(+), 338 deletions(-)
From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v6.7-rc1 commit ee615d4585cfc305bf6c218a62123c3051f8b4a3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "shmem,tmpfs: general maintenance".
Mostly just cosmetic mods in mm/shmem.c, but the last two enforcing the "size=" limit better. 8/8 goes into percpu counter territory, and could stand alone.
This patch (of 8):
Shave 32 bytes off (the 64-bit) shmem_inode_info. There was a 4-byte pahole after stop_eviction, better filled by fsflags. And the 24-byte dir_offsets can only be used by directories, whereas shrinklist and swaplist only by shmem_mapping() inodes (regular files or long symlinks): so put those into a union. No change in mm/shmem.c is required for this.
Link: https://lkml.kernel.org/r/c7441dc6-f3bb-dd60-c670-9f5cbd9f266@google.com Link: https://lkml.kernel.org/r/86ebb4b-c571-b9e8-27f5-cb82ec50357e@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Chuck Lever chuck.lever@oracle.com Reviewed-by: Jan Kara jack@suse.cz Cc: Axel Rasmussen axelrasmussen@google.com Cc: Carlos Maiolino cem@kernel.org Cc: Christian Brauner brauner@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Darrick J. Wong djwong@kernel.org Cc: Dave Chinner dchinner@redhat.com Cc: Tim Chen tim.c.chen@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: include/linux/shmem_fs.h [ Context conflicts with commit 9610fbcd3eea. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/shmem_fs.h | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 134c686c8676..f0c6bf982832 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -23,18 +23,22 @@ struct shmem_inode_info { unsigned long flags; unsigned long alloced; /* data pages alloced to file */ unsigned long swapped; /* subtotal assigned to swap */ - pgoff_t fallocend; /* highest fallocate endindex */ - struct list_head shrinklist; /* shrinkable hpage inodes */ - struct list_head swaplist; /* chain of maybes on swap */ + union { + struct offset_ctx dir_offsets; /* stable directory offsets */ + struct { + struct list_head shrinklist; /* shrinkable hpage inodes */ + struct list_head swaplist; /* chain of maybes on swap */ + }; + }; + struct timespec64 i_crtime; /* file creation time */ struct shared_policy policy; /* NUMA memory alloc policy */ struct simple_xattrs xattrs; /* list of xattrs */ + pgoff_t fallocend; /* highest fallocate endindex */ + unsigned int fsflags; /* for FS_IOC_[SG]ETFLAGS */ atomic_t stop_eviction; /* hold when working on inode */ - struct timespec64 i_crtime; /* file creation time */ - unsigned int fsflags; /* flags for FS_IOC_[SG]ETFLAGS */ #ifdef CONFIG_TMPFS_QUOTA struct dquot __rcu *i_dquot[MAXQUOTAS]; #endif - struct offset_ctx dir_offsets; /* stable entry offsets */ struct inode vfs_inode; };
From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v6.7-rc1 commit e3e1a5067fd2f1b3f4f7c651f5b33082962d1aa1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The vma is already there in vmf->vma, so no need for a separate arg.
Link: https://lkml.kernel.org/r/d9ce6f65-a2ed-48f4-4299-fdb0544875c5@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Jan Kara jack@suse.cz Cc: Axel Rasmussen axelrasmussen@google.com Cc: Carlos Maiolino cem@kernel.org Cc: Christian Brauner brauner@kernel.org Cc: Chuck Lever chuck.lever@oracle.com Cc: Darrick J. Wong djwong@kernel.org Cc: Dave Chinner dchinner@redhat.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Tim Chen tim.c.chen@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/shmem.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c index 0b82806727cf..e9b5bf9e6255 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1952,14 +1952,13 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, * vm. If we swap it in we mark it dirty since we also free the swap * entry since a page cannot live in both the swap and page cache. * - * vma, vmf, and fault_type are only supplied by shmem_fault: - * otherwise they are NULL. + * vmf and fault_type are only supplied by shmem_fault: otherwise they are NULL. */ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, struct folio **foliop, enum sgp_type sgp, gfp_t gfp, - struct vm_area_struct *vma, struct vm_fault *vmf, - vm_fault_t *fault_type) + struct vm_fault *vmf, vm_fault_t *fault_type) { + struct vm_area_struct *vma = vmf ? vmf->vma : NULL; struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); struct shmem_sb_info *sbinfo; @@ -2174,7 +2173,7 @@ int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop, enum sgp_type sgp) { return shmem_get_folio_gfp(inode, index, foliop, sgp, - mapping_gfp_mask(inode->i_mapping), NULL, NULL, NULL); + mapping_gfp_mask(inode->i_mapping), NULL, NULL); }
/* @@ -2258,7 +2257,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) }
err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, - gfp, vma, vmf, &ret); + gfp, vmf, &ret); if (err) return vmf_error(err); if (folio) @@ -4933,7 +4932,7 @@ struct folio *shmem_read_folio_gfp(struct address_space *mapping,
BUG_ON(!shmem_mapping(mapping)); error = shmem_get_folio_gfp(inode, index, &folio, SGP_CACHE, - gfp, NULL, NULL, NULL); + gfp, NULL, NULL); if (error) return ERR_PTR(error);
From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v6.7-rc1 commit f0a9ad1d4d9ba3c694bca91d8d67be9a4a33b902 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
That Trinity livelock shmem_falloc avoidance block is unlikely, and a distraction from the proper business of shmem_fault(): separate it out. (This used to help compilers save stack on the fault path too, but both gcc and clang nowadays seem to make better choices anyway.)
Link: https://lkml.kernel.org/r/6fe379a4-6176-9225-9263-fe60d2633c0@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Jan Kara jack@suse.cz Cc: Axel Rasmussen axelrasmussen@google.com Cc: Carlos Maiolino cem@kernel.org Cc: Christian Brauner brauner@kernel.org Cc: Chuck Lever chuck.lever@oracle.com Cc: Darrick J. Wong djwong@kernel.org Cc: Dave Chinner dchinner@redhat.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Tim Chen tim.c.chen@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/shmem.c | 126 +++++++++++++++++++++++++++++------------------------ 1 file changed, 69 insertions(+), 57 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c index e9b5bf9e6255..1007336d3896 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2181,87 +2181,99 @@ int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop, * entry unconditionally - even if something else had already woken the * target. */ -static int synchronous_wake_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) +static int synchronous_wake_function(wait_queue_entry_t *wait, + unsigned int mode, int sync, void *key) { int ret = default_wake_function(wait, mode, sync, key); list_del_init(&wait->entry); return ret; }
+/* + * Trinity finds that probing a hole which tmpfs is punching can + * prevent the hole-punch from ever completing: which in turn + * locks writers out with its hold on i_rwsem. So refrain from + * faulting pages into the hole while it's being punched. Although + * shmem_undo_range() does remove the additions, it may be unable to + * keep up, as each new page needs its own unmap_mapping_range() call, + * and the i_mmap tree grows ever slower to scan if new vmas are added. + * + * It does not matter if we sometimes reach this check just before the + * hole-punch begins, so that one fault then races with the punch: + * we just need to make racing faults a rare case. + * + * The implementation below would be much simpler if we just used a + * standard mutex or completion: but we cannot take i_rwsem in fault, + * and bloating every shmem inode for this unlikely case would be sad. + */ +static vm_fault_t shmem_falloc_wait(struct vm_fault *vmf, struct inode *inode) +{ + struct shmem_falloc *shmem_falloc; + struct file *fpin = NULL; + vm_fault_t ret = 0; + + spin_lock(&inode->i_lock); + shmem_falloc = inode->i_private; + if (shmem_falloc && + shmem_falloc->waitq && + vmf->pgoff >= shmem_falloc->start && + vmf->pgoff < shmem_falloc->next) { + wait_queue_head_t *shmem_falloc_waitq; + DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function); + + ret = VM_FAULT_NOPAGE; + fpin = maybe_unlock_mmap_for_io(vmf, NULL); + shmem_falloc_waitq = shmem_falloc->waitq; + prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait, + TASK_UNINTERRUPTIBLE); + spin_unlock(&inode->i_lock); + schedule(); + + /* + * shmem_falloc_waitq points into the shmem_fallocate() + * stack of the hole-punching task: shmem_falloc_waitq + * is usually invalid by the time we reach here, but + * finish_wait() does not dereference it in that case; + * though i_lock needed lest racing with wake_up_all(). + */ + spin_lock(&inode->i_lock); + finish_wait(shmem_falloc_waitq, &shmem_fault_wait); + } + spin_unlock(&inode->i_lock); + if (fpin) { + fput(fpin); + ret = VM_FAULT_RETRY; + } + return ret; +} + static vm_fault_t shmem_fault(struct vm_fault *vmf) { - struct vm_area_struct *vma = vmf->vma; - struct inode *inode = file_inode(vma->vm_file); + struct inode *inode = file_inode(vmf->vma->vm_file); gfp_t gfp = mapping_gfp_mask(inode->i_mapping); struct folio *folio = NULL; + vm_fault_t ret = 0; int err; - vm_fault_t ret = VM_FAULT_LOCKED;
/* * Trinity finds that probing a hole which tmpfs is punching can - * prevent the hole-punch from ever completing: which in turn - * locks writers out with its hold on i_rwsem. So refrain from - * faulting pages into the hole while it's being punched. Although - * shmem_undo_range() does remove the additions, it may be unable to - * keep up, as each new page needs its own unmap_mapping_range() call, - * and the i_mmap tree grows ever slower to scan if new vmas are added. - * - * It does not matter if we sometimes reach this check just before the - * hole-punch begins, so that one fault then races with the punch: - * we just need to make racing faults a rare case. - * - * The implementation below would be much simpler if we just used a - * standard mutex or completion: but we cannot take i_rwsem in fault, - * and bloating every shmem inode for this unlikely case would be sad. + * prevent the hole-punch from ever completing: noted in i_private. */ if (unlikely(inode->i_private)) { - struct shmem_falloc *shmem_falloc; - - spin_lock(&inode->i_lock); - shmem_falloc = inode->i_private; - if (shmem_falloc && - shmem_falloc->waitq && - vmf->pgoff >= shmem_falloc->start && - vmf->pgoff < shmem_falloc->next) { - struct file *fpin; - wait_queue_head_t *shmem_falloc_waitq; - DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function); - - ret = VM_FAULT_NOPAGE; - fpin = maybe_unlock_mmap_for_io(vmf, NULL); - if (fpin) - ret = VM_FAULT_RETRY; - - shmem_falloc_waitq = shmem_falloc->waitq; - prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait, - TASK_UNINTERRUPTIBLE); - spin_unlock(&inode->i_lock); - schedule(); - - /* - * shmem_falloc_waitq points into the shmem_fallocate() - * stack of the hole-punching task: shmem_falloc_waitq - * is usually invalid by the time we reach here, but - * finish_wait() does not dereference it in that case; - * though i_lock needed lest racing with wake_up_all(). - */ - spin_lock(&inode->i_lock); - finish_wait(shmem_falloc_waitq, &shmem_fault_wait); - spin_unlock(&inode->i_lock); - - if (fpin) - fput(fpin); + ret = shmem_falloc_wait(vmf, inode); + if (ret) return ret; - } - spin_unlock(&inode->i_lock); }
+ WARN_ON_ONCE(vmf->page != NULL); err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, gfp, vmf, &ret); if (err) return vmf_error(err); - if (folio) + if (folio) { vmf->page = folio_file_page(folio, vmf->pgoff); + ret |= VM_FAULT_LOCKED; + } return ret; }
From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v6.7-rc1 commit 9be7d5b06648b808989e99c5d0bea1be47c5a384 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Mostly removing a few superfluous blank lines, joining short arglines, imposing some 80-column observance, correcting a couple of comments. None of it more interesting than deleting a repeated INIT_LIST_HEAD().
Link: https://lkml.kernel.org/r/b3983d28-5d3f-8649-36af-b819285d7a9e@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Jan Kara jack@suse.cz Cc: Axel Rasmussen axelrasmussen@google.com Cc: Carlos Maiolino cem@kernel.org Cc: Christian Brauner brauner@kernel.org Cc: Chuck Lever chuck.lever@oracle.com Cc: Darrick J. Wong djwong@kernel.org Cc: Dave Chinner dchinner@redhat.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Tim Chen tim.c.chen@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/shmem.c | 56 ++++++++++++++++++++---------------------------------- 1 file changed, 21 insertions(+), 35 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c index 1007336d3896..839af00150f7 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -762,7 +762,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo, #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
/* - * Like filemap_add_folio, but error if expected item has gone. + * Somewhat like filemap_add_folio, but error if expected item has gone. */ static int shmem_add_to_page_cache(struct folio *folio, struct address_space *mapping, @@ -832,7 +832,7 @@ static int shmem_add_to_page_cache(struct folio *folio, }
/* - * Like delete_from_page_cache, but substitutes swap for @folio. + * Somewhat like filemap_remove_folio, but substitutes swap for @folio. */ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap) { @@ -895,7 +895,6 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping, cond_resched_rcu(); } } - rcu_read_unlock();
return swapped << PAGE_SHIFT; @@ -1238,7 +1237,6 @@ static int shmem_setattr(struct mnt_idmap *idmap, if (i_uid_needs_update(idmap, attr, inode) || i_gid_needs_update(idmap, attr, inode)) { error = dquot_transfer(idmap, inode, attr); - if (error) return error; } @@ -2489,7 +2487,6 @@ static struct inode *__shmem_get_inode(struct mnt_idmap *idmap, if (err) return ERR_PTR(err);
- inode = new_inode(sb); if (!inode) { shmem_free_inode(sb, 0); @@ -2514,11 +2511,10 @@ static struct inode *__shmem_get_inode(struct mnt_idmap *idmap, shmem_set_inode_flags(inode, info->fsflags); INIT_LIST_HEAD(&info->shrinklist); INIT_LIST_HEAD(&info->swaplist); - INIT_LIST_HEAD(&info->swaplist); - if (sbinfo->noswap) - mapping_set_unevictable(inode->i_mapping); simple_xattrs_init(&info->xattrs); cache_no_acl(inode); + if (sbinfo->noswap) + mapping_set_unevictable(inode->i_mapping); mapping_set_large_folios(inode->i_mapping);
switch (mode & S_IFMT) { @@ -2730,7 +2726,6 @@ shmem_write_begin(struct file *file, struct address_space *mapping, }
ret = shmem_get_folio(inode, index, &folio, SGP_WRITE); - if (ret) return ret;
@@ -3262,8 +3257,7 @@ shmem_mknod(struct mnt_idmap *idmap, struct inode *dir, error = simple_acl_create(dir, inode); if (error) goto out_iput; - error = security_inode_init_security(inode, dir, - &dentry->d_name, + error = security_inode_init_security(inode, dir, &dentry->d_name, shmem_initxattrs, NULL); if (error && error != -EOPNOTSUPP) goto out_iput; @@ -3292,14 +3286,11 @@ shmem_tmpfile(struct mnt_idmap *idmap, struct inode *dir, int error;
inode = shmem_get_inode(idmap, dir->i_sb, dir, mode, 0, VM_NORESERVE); - if (IS_ERR(inode)) { error = PTR_ERR(inode); goto err_out; } - - error = security_inode_init_security(inode, dir, - NULL, + error = security_inode_init_security(inode, dir, NULL, shmem_initxattrs, NULL); if (error && error != -EOPNOTSUPP) goto out_iput; @@ -3336,7 +3327,8 @@ static int shmem_create(struct mnt_idmap *idmap, struct inode *dir, /* * Link a file.. */ -static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry) +static int shmem_link(struct dentry *old_dentry, struct inode *dir, + struct dentry *dentry) { struct inode *inode = d_inode(old_dentry); int ret = 0; @@ -3367,7 +3359,7 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr inode_inc_iversion(dir); inc_nlink(inode); ihold(inode); /* New dentry reference */ - dget(dentry); /* Extra pinning count for the created dentry */ + dget(dentry); /* Extra pinning count for the created dentry */ d_instantiate(dentry, inode); out: return ret; @@ -3387,7 +3379,7 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry) inode_set_ctime_current(inode)); inode_inc_iversion(dir); drop_nlink(inode); - dput(dentry); /* Undo the count from "create" - this does all the work */ + dput(dentry); /* Undo the count from "create" - does all the work */ return 0; }
@@ -3497,7 +3489,6 @@ static int shmem_symlink(struct mnt_idmap *idmap, struct inode *dir,
inode = shmem_get_inode(idmap, dir->i_sb, dir, S_IFLNK | 0777, 0, VM_NORESERVE); - if (IS_ERR(inode)) return PTR_ERR(inode);
@@ -3551,8 +3542,7 @@ static void shmem_put_link(void *arg) folio_put(arg); }
-static const char *shmem_get_link(struct dentry *dentry, - struct inode *inode, +static const char *shmem_get_link(struct dentry *dentry, struct inode *inode, struct delayed_call *done) { struct folio *folio = NULL; @@ -3626,8 +3616,7 @@ static int shmem_fileattr_set(struct mnt_idmap *idmap, * Callback for security_inode_init_security() for acquiring xattrs. */ static int shmem_initxattrs(struct inode *inode, - const struct xattr *xattr_array, - void *fs_info) + const struct xattr *xattr_array, void *fs_info) { struct shmem_inode_info *info = SHMEM_I(inode); struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); @@ -3811,7 +3800,6 @@ static struct dentry *shmem_find_alias(struct inode *inode) return alias ?: d_find_any_alias(inode); }
- static struct dentry *shmem_fh_to_dentry(struct super_block *sb, struct fid *fid, int fh_len, int fh_type) { @@ -4395,8 +4383,8 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc) } #endif /* CONFIG_TMPFS_QUOTA */
- inode = shmem_get_inode(&nop_mnt_idmap, sb, NULL, S_IFDIR | sbinfo->mode, 0, - VM_NORESERVE); + inode = shmem_get_inode(&nop_mnt_idmap, sb, NULL, + S_IFDIR | sbinfo->mode, 0, VM_NORESERVE); if (IS_ERR(inode)) { error = PTR_ERR(inode); goto failed; @@ -4702,11 +4690,9 @@ static ssize_t shmem_enabled_show(struct kobject *kobj,
for (i = 0; i < ARRAY_SIZE(values); i++) { len += sysfs_emit_at(buf, len, - shmem_huge == values[i] ? "%s[%s]" : "%s%s", - i ? " " : "", - shmem_format_huge(values[i])); + shmem_huge == values[i] ? "%s[%s]" : "%s%s", + i ? " " : "", shmem_format_huge(values[i])); } - len += sysfs_emit_at(buf, len, "\n");
return len; @@ -4803,8 +4789,9 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range); #define shmem_acct_size(flags, size) 0 #define shmem_unacct_size(flags, size) do {} while (0)
-static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block *sb, struct inode *dir, - umode_t mode, dev_t dev, unsigned long flags) +static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap, + struct super_block *sb, struct inode *dir, + umode_t mode, dev_t dev, unsigned long flags) { struct inode *inode = ramfs_get_inode(sb, dir, mode, dev); return inode ? inode : ERR_PTR(-ENOSPC); @@ -4814,8 +4801,8 @@ static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct supe
/* common code */
-static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name, loff_t size, - unsigned long flags, unsigned int i_flags) +static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name, + loff_t size, unsigned long flags, unsigned int i_flags) { struct inode *inode; struct file *res; @@ -4834,7 +4821,6 @@ static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name, l
inode = shmem_get_inode(&nop_mnt_idmap, mnt->mnt_sb, NULL, S_IFREG | S_IRWXUGO, 0, flags); - if (IS_ERR(inode)) { shmem_unacct_size(flags, size); return ERR_CAST(inode);
From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v6.7-rc1 commit 4199f51a7eb2054d68964efbd8d39c68053a8714 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
By historical accident, shmem_acct_block() and shmem_inode_acct_block() were never pluralized when the pages argument was added, despite their complements being shmem_unacct_blocks() and shmem_inode_unacct_blocks() all along. It has been an irritation: fix their naming at last.
Link: https://lkml.kernel.org/r/9124094-e4ab-8be7-ef80-9a87bdc2e4fc@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Jan Kara jack@suse.cz Cc: Axel Rasmussen axelrasmussen@google.com Cc: Carlos Maiolino cem@kernel.org Cc: Christian Brauner brauner@kernel.org Cc: Chuck Lever chuck.lever@oracle.com Cc: Darrick J. Wong djwong@kernel.org Cc: Dave Chinner dchinner@redhat.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Tim Chen tim.c.chen@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/shmem.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c index 839af00150f7..a73386b1c2bf 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -190,10 +190,10 @@ static inline int shmem_reacct_size(unsigned long flags, /* * ... whereas tmpfs objects are accounted incrementally as * pages are allocated, in order to allow large sparse files. - * shmem_get_folio reports shmem_acct_block failure as -ENOSPC not -ENOMEM, + * shmem_get_folio reports shmem_acct_blocks failure as -ENOSPC not -ENOMEM, * so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM. */ -static inline int shmem_acct_block(unsigned long flags, long pages) +static inline int shmem_acct_blocks(unsigned long flags, long pages) { if (!(flags & VM_NORESERVE)) return 0; @@ -208,13 +208,13 @@ static inline void shmem_unacct_blocks(unsigned long flags, long pages) vm_unacct_memory(pages * VM_ACCT(PAGE_SIZE)); }
-static int shmem_inode_acct_block(struct inode *inode, long pages) +static int shmem_inode_acct_blocks(struct inode *inode, long pages) { struct shmem_inode_info *info = SHMEM_I(inode); struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); int err = -ENOSPC;
- if (shmem_acct_block(info->flags, pages)) + if (shmem_acct_blocks(info->flags, pages)) return err;
might_sleep(); /* when quotas */ @@ -448,7 +448,7 @@ bool shmem_charge(struct inode *inode, long pages) { struct address_space *mapping = inode->i_mapping;
- if (shmem_inode_acct_block(inode, pages)) + if (shmem_inode_acct_blocks(inode, pages)) return false;
/* nrpages adjustment first, then shmem_recalc_inode() when balanced */ @@ -1696,7 +1696,7 @@ static struct folio *shmem_alloc_and_acct_folio(gfp_t gfp, struct inode *inode, huge = false; nr = huge ? HPAGE_PMD_NR : 1;
- err = shmem_inode_acct_block(inode, nr); + err = shmem_inode_acct_blocks(inode, nr); if (err) goto failed;
@@ -2605,7 +2605,7 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd, int ret; pgoff_t max_off;
- if (shmem_inode_acct_block(inode, 1)) { + if (shmem_inode_acct_blocks(inode, 1)) { /* * We may have got a page, returned -ENOENT triggering a retry, * and now we find ourselves with -ENOMEM. Release the page, to
From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v6.7-rc1 commit 054a9f7ccd0a60607fb9bbe1e06ca671494971bf category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Extract shmem's memcg charging out of shmem_add_to_page_cache(): it's misleading done there, because many calls are dealing with a swapcache page, whose memcg is nowadays always remembered while swapped out, then the charge re-levied when it's brought back into swapcache.
Temporarily move it back up to the shmem_get_folio_gfp() level, where the memcg was charged before v5.8; but the next commit goes on to move it back down to a new home.
In making this change, it becomes clear that shmem_swapin_folio() does not need to know the vma, just the fault mm (if any): call it fault_mm rather than charge_mm - let mem_cgroup_charge() decide whom to charge.
Link: https://lkml.kernel.org/r/4b2143c5-bf32-64f0-841-81a81158dac@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Jan Kara jack@suse.cz Cc: Axel Rasmussen axelrasmussen@google.com Cc: Carlos Maiolino cem@kernel.org Cc: Christian Brauner brauner@kernel.org Cc: Chuck Lever chuck.lever@oracle.com Cc: Darrick J. Wong djwong@kernel.org Cc: Dave Chinner dchinner@redhat.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Tim Chen tim.c.chen@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/shmem.c | 68 +++++++++++++++++++++++------------------------------- 1 file changed, 29 insertions(+), 39 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c index a73386b1c2bf..8b5e78661443 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -147,9 +147,8 @@ static unsigned long shmem_default_max_inodes(void) #endif
static int shmem_swapin_folio(struct inode *inode, pgoff_t index, - struct folio **foliop, enum sgp_type sgp, - gfp_t gfp, struct vm_area_struct *vma, - vm_fault_t *fault_type); + struct folio **foliop, enum sgp_type sgp, gfp_t gfp, + struct mm_struct *fault_mm, vm_fault_t *fault_type);
static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb) { @@ -766,12 +765,10 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo, */ static int shmem_add_to_page_cache(struct folio *folio, struct address_space *mapping, - pgoff_t index, void *expected, gfp_t gfp, - struct mm_struct *charge_mm) + pgoff_t index, void *expected, gfp_t gfp) { XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio)); long nr = folio_nr_pages(folio); - int error;
VM_BUG_ON_FOLIO(index != round_down(index, nr), folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); @@ -782,16 +779,7 @@ static int shmem_add_to_page_cache(struct folio *folio, folio->mapping = mapping; folio->index = index;
- if (!folio_test_swapcache(folio)) { - error = mem_cgroup_charge(folio, charge_mm, gfp); - if (error) { - if (folio_test_pmd_mappable(folio)) { - count_vm_event(THP_FILE_FALLBACK); - count_vm_event(THP_FILE_FALLBACK_CHARGE); - } - goto error; - } - } + gfp &= GFP_RECLAIM_MASK; folio_throttle_swaprate(folio, gfp);
do { @@ -820,15 +808,12 @@ static int shmem_add_to_page_cache(struct folio *folio, } while (xas_nomem(&xas, gfp));
if (xas_error(&xas)) { - error = xas_error(&xas); - goto error; + folio->mapping = NULL; + folio_ref_sub(folio, nr); + return xas_error(&xas); }
return 0; -error: - folio->mapping = NULL; - folio_ref_sub(folio, nr); - return error; }
/* @@ -1349,10 +1334,8 @@ static int shmem_unuse_swap_entries(struct inode *inode,
if (!xa_is_value(folio)) continue; - error = shmem_swapin_folio(inode, indices[i], - &folio, SGP_CACHE, - mapping_gfp_mask(mapping), - NULL, NULL); + error = shmem_swapin_folio(inode, indices[i], &folio, SGP_CACHE, + mapping_gfp_mask(mapping), NULL, NULL); if (error == 0) { folio_unlock(folio); folio_put(folio); @@ -1841,12 +1824,11 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, */ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, struct folio **foliop, enum sgp_type sgp, - gfp_t gfp, struct vm_area_struct *vma, + gfp_t gfp, struct mm_struct *fault_mm, vm_fault_t *fault_type) { struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); - struct mm_struct *charge_mm = vma ? vma->vm_mm : NULL; struct swap_info_struct *si; struct folio *folio = NULL; swp_entry_t swap; @@ -1874,7 +1856,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, if (fault_type) { *fault_type |= VM_FAULT_MAJOR; count_vm_event(PGMAJFAULT); - count_memcg_event_mm(charge_mm, PGMAJFAULT); + count_memcg_event_mm(fault_mm, PGMAJFAULT); } /* Here we actually start the io */ folio = shmem_swapin(swap, gfp, info, index); @@ -1911,8 +1893,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, }
error = shmem_add_to_page_cache(folio, mapping, index, - swp_to_radix_entry(swap), gfp, - charge_mm); + swp_to_radix_entry(swap), gfp); if (error) goto failed;
@@ -1960,7 +1941,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); struct shmem_sb_info *sbinfo; - struct mm_struct *charge_mm; + struct mm_struct *fault_mm; struct folio *folio; pgoff_t hindex; gfp_t huge_gfp; @@ -1977,7 +1958,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, }
sbinfo = SHMEM_SB(inode->i_sb); - charge_mm = vma ? vma->vm_mm : NULL; + fault_mm = vma ? vma->vm_mm : NULL;
folio = filemap_get_entry(mapping, index); if (folio && vma && userfaultfd_minor(vma)) { @@ -1989,7 +1970,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
if (xa_is_value(folio)) { error = shmem_swapin_folio(inode, index, &folio, - sgp, gfp, vma, fault_type); + sgp, gfp, fault_mm, fault_type); if (error == -EEXIST) goto repeat;
@@ -2077,9 +2058,16 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, if (sgp == SGP_WRITE) __folio_set_referenced(folio);
- error = shmem_add_to_page_cache(folio, mapping, hindex, - NULL, gfp & GFP_RECLAIM_MASK, - charge_mm); + error = mem_cgroup_charge(folio, fault_mm, gfp); + if (error) { + if (folio_test_pmd_mappable(folio)) { + count_vm_event(THP_FILE_FALLBACK); + count_vm_event(THP_FILE_FALLBACK_CHARGE); + } + goto unacct; + } + + error = shmem_add_to_page_cache(folio, mapping, hindex, NULL, gfp); if (error) goto unacct;
@@ -2677,8 +2665,10 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd, if (unlikely(pgoff >= max_off)) goto out_release;
- ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, - gfp & GFP_RECLAIM_MASK, dst_vma->vm_mm); + ret = mem_cgroup_charge(folio, dst_vma->vm_mm, gfp); + if (ret) + goto out_release; + ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp); if (ret) goto out_release;
From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v6.7-rc1 commit 3022fd7af9604d44ec43da8a4398872989599b18 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
There has been a recurring problem, that when a tmpfs volume is being filled by racing threads, some fail with ENOSPC (or consequent SIGBUS or EFAULT) even though all allocations were within the permitted size.
This was a problem since early days, but magnified and complicated by the addition of huge pages. We have often worked around it by adding some slop to the tmpfs size, but it's hard to say how much is needed, and some users prefer not to do that e.g. keeping sparse files in a tightly tailored tmpfs helps to prevent accidental writing to holes.
This comes from the allocation sequence: 1. check page cache for existing folio 2. check and reserve from vm_enough_memory 3. check and account from size of tmpfs 4. if huge, check page cache for overlapping folio 5. allocate physical folio, huge or small 6. check and charge from mem cgroup limit 7. add to page cache (but maybe another folio already got in).
Concurrent tasks allocating at the same position could deplete the size allowance and fail. Doing vm_enough_memory and size checks before the folio allocation was intentional (to limit the load on the page allocator from this source) and still has some virtue; but memory cgroup never did that, so I think it's better reordered to favour predictable behaviour.
1. check page cache for existing folio 2. if huge, check page cache for overlapping folio 3. allocate physical folio, huge or small 4. check and charge from mem cgroup limit 5. add to page cache (but maybe another folio already got in) 6. check and reserve from vm_enough_memory 7. check and account from size of tmpfs.
The folio lock held from allocation onwards ensures that the !uptodate folio cannot be used by others, and can safely be deleted from the cache if checks 6 or 7 subsequently fail (and those waiting on folio lock already check that the folio was not truncated once they get the lock); and the early addition to page cache ensures that racers find it before they try to duplicate the accounting.
Seize the opportunity to tidy up shmem_get_folio_gfp()'s ENOSPC retrying, which can be combined inside the new shmem_alloc_and_add_folio(): doing 2 splits twice (once huge, once nonhuge) is not exactly equivalent to trying 5 splits (and giving up early on huge), but let's keep it simple unless more complication proves necessary.
Userfaultfd is a foreign country: they do things differently there, and for good reason - to avoid mmap_lock deadlock. Leave ordering in shmem_mfill_atomic_pte() untouched for now, but I would rather like to mesh it better with shmem_get_folio_gfp() in the future.
Link: https://lkml.kernel.org/r/22ddd06-d919-33b-1219-56335c1bf28e@google.com Signed-off-by: Hugh Dickins hughd@google.com Cc: Axel Rasmussen axelrasmussen@google.com Cc: Carlos Maiolino cem@kernel.org Cc: Christian Brauner brauner@kernel.org Cc: Chuck Lever chuck.lever@oracle.com Cc: Darrick J. Wong djwong@kernel.org Cc: Dave Chinner dchinner@redhat.com Cc: Jan Kara jack@suse.cz Cc: Johannes Weiner hannes@cmpxchg.org Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Tim Chen tim.c.chen@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/shmem.c [ Context conflicts with commit 7cce6955521c 5b97d5485d7a 19a9d856a08b ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/shmem.c | 235 +++++++++++++++++++++++++++-------------------------- 1 file changed, 121 insertions(+), 114 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c index 8b5e78661443..3a625b60c8f1 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -795,14 +795,12 @@ static int shmem_add_to_page_cache(struct folio *folio, xas_store(&xas, folio); if (xas_error(&xas)) goto unlock; - if (folio_test_pmd_mappable(folio)) { - count_vm_event(THP_FILE_ALLOC); + if (folio_test_pmd_mappable(folio)) __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, nr); - } - mapping->nrpages += nr; __lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr); __lruvec_stat_mod_folio(folio, NR_SHMEM, nr); shmem_reliable_folio_add(folio, nr); + mapping->nrpages += nr; unlock: xas_unlock_irq(&xas); } while (xas_nomem(&xas, gfp)); @@ -1637,25 +1635,17 @@ static struct folio *shmem_alloc_hugefolio(gfp_t gfp, struct shmem_inode_info *info, pgoff_t index) { struct vm_area_struct pvma; - struct address_space *mapping = info->vfs_inode.i_mapping; - pgoff_t hindex; struct folio *folio;
- hindex = round_down(index, HPAGE_PMD_NR); - if (xa_find(&mapping->i_pages, &hindex, hindex + HPAGE_PMD_NR - 1, - XA_PRESENT)) - return NULL; - - shmem_pseudo_vma_init(&pvma, info, hindex); + shmem_pseudo_vma_init(&pvma, info, index); folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, &pvma, 0, true); shmem_pseudo_vma_destroy(&pvma); - if (!folio) - count_vm_event(THP_FILE_FALLBACK); + return folio; }
static struct folio *shmem_alloc_folio(gfp_t gfp, - struct shmem_inode_info *info, pgoff_t index) + struct shmem_inode_info *info, pgoff_t index) { struct vm_area_struct pvma; struct folio *folio; @@ -1667,40 +1657,106 @@ static struct folio *shmem_alloc_folio(gfp_t gfp, return folio; }
-static struct folio *shmem_alloc_and_acct_folio(gfp_t gfp, struct inode *inode, - pgoff_t index, bool huge) +static struct folio *shmem_alloc_and_add_folio(gfp_t gfp, + struct inode *inode, pgoff_t index, + struct mm_struct *fault_mm, bool huge) { + struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); struct folio *folio; - int nr; - int err; + long pages; + int error;
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) huge = false; - nr = huge ? HPAGE_PMD_NR : 1; - - err = shmem_inode_acct_blocks(inode, nr); - if (err) - goto failed;
if (!shmem_prepare_alloc(&gfp)) goto no_mem;
- if (huge) + if (huge) { + pages = HPAGE_PMD_NR; + index = round_down(index, HPAGE_PMD_NR); + + /* + * Check for conflict before waiting on a huge allocation. + * Conflict might be that a huge page has just been allocated + * and added to page cache by a racing thread, or that there + * is already at least one small page in the huge extent. + * Be careful to retry when appropriate, but not forever! + * Elsewhere -EEXIST would be the right code, but not here. + */ + if (xa_find(&mapping->i_pages, &index, + index + HPAGE_PMD_NR - 1, XA_PRESENT)) + return ERR_PTR(-E2BIG); + folio = shmem_alloc_hugefolio(gfp, info, index); - else + if (!folio) + count_vm_event(THP_FILE_FALLBACK); + } else { + pages = 1; folio = shmem_alloc_folio(gfp, info, index); - if (folio) { - __folio_set_locked(folio); - __folio_set_swapbacked(folio); - return folio; }
no_mem: - err = -ENOMEM; - shmem_inode_unacct_blocks(inode, nr); -failed: - return ERR_PTR(err); + if (!folio) + return ERR_PTR(-ENOMEM); + + __folio_set_locked(folio); + __folio_set_swapbacked(folio); + + gfp &= GFP_RECLAIM_MASK; + error = mem_cgroup_charge(folio, fault_mm, gfp); + if (error) { + if (xa_find(&mapping->i_pages, &index, + index + pages - 1, XA_PRESENT)) { + error = -EEXIST; + } else if (huge) { + count_vm_event(THP_FILE_FALLBACK); + count_vm_event(THP_FILE_FALLBACK_CHARGE); + } + goto unlock; + } + + error = shmem_add_to_page_cache(folio, mapping, index, NULL, gfp); + if (error) + goto unlock; + + error = shmem_inode_acct_blocks(inode, pages); + if (error) { + struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); + long freed; + /* + * Try to reclaim some space by splitting a few + * large folios beyond i_size on the filesystem. + */ + shmem_unused_huge_shrink(sbinfo, NULL, 2); + /* + * And do a shmem_recalc_inode() to account for freed pages: + * except our folio is there in cache, so not quite balanced. + */ + spin_lock(&info->lock); + freed = pages + info->alloced - info->swapped - + READ_ONCE(mapping->nrpages); + if (freed > 0) + info->alloced -= freed; + spin_unlock(&info->lock); + if (freed > 0) + shmem_inode_unacct_blocks(inode, freed); + error = shmem_inode_acct_blocks(inode, pages); + if (error) { + filemap_remove_folio(folio); + goto unlock; + } + } + + shmem_recalc_inode(inode, pages, 0); + folio_add_lru(folio); + return folio; + +unlock: + folio_unlock(folio); + folio_put(folio); + return ERR_PTR(error); }
/* @@ -1938,29 +1994,22 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, struct vm_fault *vmf, vm_fault_t *fault_type) { struct vm_area_struct *vma = vmf ? vmf->vma : NULL; - struct address_space *mapping = inode->i_mapping; - struct shmem_inode_info *info = SHMEM_I(inode); - struct shmem_sb_info *sbinfo; struct mm_struct *fault_mm; struct folio *folio; - pgoff_t hindex; - gfp_t huge_gfp; int error; - int once = 0; - int alloced = 0; + bool alloced;
if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT)) return -EFBIG; repeat: if (sgp <= SGP_CACHE && - ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) { + ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) return -EINVAL; - }
- sbinfo = SHMEM_SB(inode->i_sb); + alloced = false; fault_mm = vma ? vma->vm_mm : NULL;
- folio = filemap_get_entry(mapping, index); + folio = filemap_get_entry(inode->i_mapping, index); if (folio && vma && userfaultfd_minor(vma)) { if (!xa_is_value(folio)) folio_put(folio); @@ -1982,7 +2031,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, folio_lock(folio);
/* Has the folio been truncated or swapped out? */ - if (unlikely(folio->mapping != mapping)) { + if (unlikely(folio->mapping != inode->i_mapping)) { folio_unlock(folio); folio_put(folio); goto repeat; @@ -2017,67 +2066,39 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, return 0; }
- if (!shmem_is_huge(inode, index, false, - vma ? vma->vm_mm : NULL, vma ? vma->vm_flags : 0)) - goto alloc_nohuge; - if (mm_in_dynamic_pool(vma ? vma->vm_mm : current->mm)) - goto alloc_nohuge; + if (shmem_is_huge(inode, index, false, fault_mm, + vma ? vma->vm_flags : 0) && + !mm_in_dynamic_pool(vma ? vma->vm_mm : current->mm)) { + gfp_t huge_gfp;
- huge_gfp = vma_thp_gfp_mask(vma); - huge_gfp = limit_gfp_mask(huge_gfp, gfp); - folio = shmem_alloc_and_acct_folio(huge_gfp, inode, index, true); - if (IS_ERR(folio)) { -alloc_nohuge: - folio = shmem_alloc_and_acct_folio(gfp, inode, index, false); + huge_gfp = vma_thp_gfp_mask(vma); + huge_gfp = limit_gfp_mask(huge_gfp, gfp); + folio = shmem_alloc_and_add_folio(huge_gfp, + inode, index, fault_mm, true); + if (!IS_ERR(folio)) { + count_vm_event(THP_FILE_ALLOC); + goto alloced; + } + if (PTR_ERR(folio) == -EEXIST) + goto repeat; } - if (IS_ERR(folio)) { - int retry = 5;
+ folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false); + if (IS_ERR(folio)) { error = PTR_ERR(folio); + if (error == -EEXIST) + goto repeat; folio = NULL; - if (error != -ENOSPC) - goto unlock; - /* - * Try to reclaim some space by splitting a large folio - * beyond i_size on the filesystem. - */ - while (retry--) { - int ret; - - ret = shmem_unused_huge_shrink(sbinfo, NULL, 1); - if (ret == SHRINK_STOP) - break; - if (ret) - goto alloc_nohuge; - } goto unlock; }
- hindex = round_down(index, folio_nr_pages(folio)); - - if (sgp == SGP_WRITE) - __folio_set_referenced(folio); - - error = mem_cgroup_charge(folio, fault_mm, gfp); - if (error) { - if (folio_test_pmd_mappable(folio)) { - count_vm_event(THP_FILE_FALLBACK); - count_vm_event(THP_FILE_FALLBACK_CHARGE); - } - goto unacct; - } - - error = shmem_add_to_page_cache(folio, mapping, hindex, NULL, gfp); - if (error) - goto unacct; - - folio_add_lru(folio); - shmem_recalc_inode(inode, folio_nr_pages(folio), 0); +alloced: alloced = true; - if (folio_test_pmd_mappable(folio) && DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) < folio_next_index(folio) - 1) { + struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); + struct shmem_inode_info *info = SHMEM_I(inode); /* * Part of the large folio is beyond i_size: subject * to shrink under memory pressure. @@ -2095,6 +2116,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, spin_unlock(&sbinfo->shrinklist_lock); }
+ if (sgp == SGP_WRITE) + folio_set_referenced(folio); /* * Let SGP_FALLOC use the SGP_WRITE optimization on a new folio. */ @@ -2118,11 +2141,6 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, /* Perhaps the file has been truncated since we checked */ if (sgp <= SGP_CACHE && ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) { - if (alloced) { - folio_clear_dirty(folio); - filemap_remove_folio(folio); - shmem_recalc_inode(inode, 0, 0); - } error = -EINVAL; goto unlock; } @@ -2133,25 +2151,14 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, /* * Error recovery. */ -unacct: - shmem_inode_unacct_blocks(inode, folio_nr_pages(folio)); - - if (folio_test_large(folio)) { - folio_unlock(folio); - folio_put(folio); - goto alloc_nohuge; - } unlock: + if (alloced) + filemap_remove_folio(folio); + shmem_recalc_inode(inode, 0, 0); if (folio) { folio_unlock(folio); folio_put(folio); } - if (error == -ENOSPC && !once++) { - shmem_recalc_inode(inode, 0, 0); - goto repeat; - } - if (error == -EEXIST) - goto repeat; return error; }
From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v6.7-rc1 commit beb9868628445306958fd7b2da1cd369a4a381cc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Percpu counter's compare and add are separate functions: without locking around them (which would defeat their purpose), it has been possible to overflow the intended limit. Imagine all the other CPUs fallocating tmpfs huge pages to the limit, in between this CPU's compare and its add.
I have not seen reports of that happening; but tmpfs's recent addition of dquot_alloc_block_nodirty() in between the compare and the add makes it even more likely, and I'd be uncomfortable to leave it unfixed.
Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
I believe this implementation is correct, and slightly more efficient than the combination of compare and add (taking the lock once rather than twice when nearing full - the last 128MiB of a tmpfs volume on a machine with 128 CPUs and 4KiB pages); but it does beg for a better design - when nearing full, there is no new batching, but the costly percpu counter sum across CPUs still has to be done, while locked.
Follow __percpu_counter_sum()'s example, including cpu_dying_mask as well as cpu_online_mask: but shouldn't __percpu_counter_compare() and __percpu_counter_limited_add() then be adding a num_dying_cpus() to num_online_cpus(), when they calculate the maximum which could be held across CPUs? But the times when it matters would be vanishingly rare.
Link: https://lkml.kernel.org/r/bb817848-2d19-bcc8-39ca-ea179af0f0b4@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Jan Kara jack@suse.cz Cc: Tim Chen tim.c.chen@intel.com Cc: Dave Chinner dchinner@redhat.com Cc: Darrick J. Wong djwong@kernel.org Cc: Axel Rasmussen axelrasmussen@google.com Cc: Carlos Maiolino cem@kernel.org Cc: Christian Brauner brauner@kernel.org Cc: Chuck Lever chuck.lever@oracle.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: lib/percpu_counter.c [ Context conflicts with commit 69381c36f1ac. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/percpu_counter.h | 23 +++++++++++++++ lib/percpu_counter.c | 53 ++++++++++++++++++++++++++++++++++ mm/shmem.c | 10 +++---- 3 files changed, 81 insertions(+), 5 deletions(-)
diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h index 1a0f25a27d7b..c50716df9fa3 100644 --- a/include/linux/percpu_counter.h +++ b/include/linux/percpu_counter.h @@ -68,6 +68,8 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch); s64 __percpu_counter_sum(struct percpu_counter *fbc); int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch); +bool __percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit, + s64 amount, s32 batch); void percpu_counter_sync(struct percpu_counter *fbc);
static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs) @@ -80,6 +82,13 @@ static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount) percpu_counter_add_batch(fbc, amount, percpu_counter_batch); }
+static inline bool +percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit, s64 amount) +{ + return __percpu_counter_limited_add(fbc, limit, amount, + percpu_counter_batch); +} + /* * With percpu_counter_add_local() and percpu_counter_sub_local(), counts * are accumulated in local per cpu counter and not in fbc->count until @@ -210,6 +219,20 @@ percpu_counter_add(struct percpu_counter *fbc, s64 amount) local_irq_restore(flags); }
+static inline bool +percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit, s64 amount) +{ + unsigned long flags; + s64 count; + + local_irq_save(flags); + count = fbc->count + amount; + if (count <= limit) + fbc->count = count; + local_irq_restore(flags); + return count <= limit; +} + /* non-SMP percpu_counter_add_local is the same with percpu_counter_add */ static inline void percpu_counter_add_local(struct percpu_counter *fbc, s64 amount) diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c index 7d2eaba4db1d..fb30d739f4b5 100644 --- a/lib/percpu_counter.c +++ b/lib/percpu_counter.c @@ -279,6 +279,59 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch) } EXPORT_SYMBOL(__percpu_counter_compare);
+/* + * Compare counter, and add amount if the total is within limit. + * Return true if amount was added, false if it would exceed limit. + */ +bool __percpu_counter_limited_add(struct percpu_counter *fbc, + s64 limit, s64 amount, s32 batch) +{ + s64 count; + s64 unknown; + unsigned long flags; + bool good; + + if (amount > limit) + return false; + + local_irq_save(flags); + unknown = batch * num_online_cpus(); + count = __this_cpu_read(*fbc->counters); + + /* Skip taking the lock when safe */ + if (abs(count + amount) <= batch && + fbc->count + unknown <= limit) { + this_cpu_add(*fbc->counters, amount); + local_irq_restore(flags); + return true; + } + + raw_spin_lock(&fbc->lock); + count = fbc->count + amount; + + /* Skip percpu_counter_sum() when safe */ + if (count + unknown > limit) { + s32 *pcount; + int cpu; + + for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) { + pcount = per_cpu_ptr(fbc->counters, cpu); + count += *pcount; + } + } + + good = count <= limit; + if (good) { + count = __this_cpu_read(*fbc->counters); + fbc->count += count + amount; + __this_cpu_sub(*fbc->counters, count); + } + + raw_spin_unlock(&fbc->lock); + local_irq_restore(flags); + return good; +} + /* * percpu_counter_switch_to_pcpu_many: Converts struct percpu_counters from * atomic mode to percpu mode. diff --git a/mm/shmem.c b/mm/shmem.c index 3a625b60c8f1..4556251d9f64 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -218,15 +218,15 @@ static int shmem_inode_acct_blocks(struct inode *inode, long pages)
might_sleep(); /* when quotas */ if (sbinfo->max_blocks) { - if (percpu_counter_compare(&sbinfo->used_blocks, - sbinfo->max_blocks - pages) > 0) + if (!percpu_counter_limited_add(&sbinfo->used_blocks, + sbinfo->max_blocks, pages)) goto unacct;
err = dquot_alloc_block_nodirty(inode, pages); - if (err) + if (err) { + percpu_counter_sub(&sbinfo->used_blocks, pages); goto unacct; - - percpu_counter_add(&sbinfo->used_blocks, pages); + } } else { err = dquot_alloc_block_nodirty(inode, pages); if (err)
From: Christoph Hellwig hch@lst.de
mainline inclusion from mainline-v6.9-rc1 commit e11381d83d72198565f4545d9988b4720288eb64 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Set the a_ops in shmem_symlink before reading a folio from the mapping to prepare for asserting that shmem_get_folio is only called on shmem mappings.
Signed-off-by: Christoph Hellwig hch@lst.de Reviewed-by: "Matthew Wilcox (Oracle)" willy@infradead.org Signed-off-by: Chandan Babu R chandanbabu@kernel.org [ Dep-of: 1f63177ea89cf28bb7b2093e03769da9c9bca89a ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/shmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/shmem.c b/mm/shmem.c index 4556251d9f64..f77ca1572fad 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3508,10 +3508,10 @@ static int shmem_symlink(struct mnt_idmap *idmap, struct inode *dir, inode->i_op = &shmem_short_symlink_operations; } else { inode_nohighmem(inode); + inode->i_mapping->a_ops = &shmem_aops; error = shmem_get_folio(inode, 0, &folio, SGP_WRITE); if (error) goto out_remove_offset; - inode->i_mapping->a_ops = &shmem_aops; inode->i_op = &shmem_symlink_inode_operations; memcpy(folio_address(folio), symname, len); folio_mark_uptodate(folio);
From: Christoph Hellwig hch@lst.de
mainline inclusion from mainline-v6.9-rc1 commit 1cd81faaf61b42307e81f2dd173934005c220a64 category: cleanup bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Move the check that the inode really is a shmemfs one from shmem_read_folio_gfp to shmem_get_folio_gfp given that shmem_get_folio can also be called from outside of shmem.c. Also turn it into a WARN_ON_ONCE and error return instead of BUG_ON to be less severe.
Signed-off-by: Christoph Hellwig hch@lst.de Reviewed-by: "Matthew Wilcox (Oracle)" willy@infradead.org Reviewed-by: "Darrick J. Wong" djwong@kernel.org Signed-off-by: Chandan Babu R chandanbabu@kernel.org [ Dep-of: e7a2ab7b3bb5 (mm: shmem: add mTHP support for anonymous shmem) ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/shmem.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/shmem.c b/mm/shmem.c index f77ca1572fad..87134ed22968 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1999,6 +1999,9 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, int error; bool alloced;
+ if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping))) + return -EINVAL; + if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT)) return -EFBIG; repeat: @@ -4925,7 +4928,6 @@ struct folio *shmem_read_folio_gfp(struct address_space *mapping, struct folio *folio; int error;
- BUG_ON(!shmem_mapping(mapping)); error = shmem_get_folio_gfp(inode, index, &folio, SGP_CACHE, gfp, NULL, NULL); if (error)
From: Bang Li libang.li@antgroup.com
mainline inclusion from mainline-v6.11-rc1 commit 23b1b44e6c61295084284aa7d87db863a7802b92 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "Add update_mmu_tlb_range() to simplify code", v4.
This series of commits mainly adds the update_mmu_tlb_range() to batch update tlb in an address range and implement update_mmu_tlb() using update_mmu_tlb_range().
After commit 19eaf44954df ("mm: thp: support allocation of anonymous multi-size THP"), We may need to batch update tlb of a certain address range by calling update_mmu_tlb() in a loop. Using the update_mmu_tlb_range(), we can simplify the code and possibly reduce the execution of some unnecessary code in some architectures.
This patch (of 3):
Add update_mmu_tlb_range(), we can batch update tlb of an address range.
Link: https://lkml.kernel.org/r/20240522061204.117421-1-libang.li@antgroup.com Link: https://lkml.kernel.org/r/20240522061204.117421-2-libang.li@antgroup.com Signed-off-by: Bang Li libang.li@antgroup.com Acked-by: David Hildenbrand david@redhat.com Cc: Chris Zankel chris@zankel.net Cc: Huacai Chen chenhuacai@kernel.org Cc: Lance Yang ioworker0@gmail.com Cc: Max Filippov jcmvbkbc@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Thomas Bogendoerfer tsbogend@alpha.franken.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- arch/loongarch/include/asm/pgtable.h | 2 ++ arch/mips/include/asm/pgtable.h | 2 ++ arch/riscv/include/asm/pgtable.h | 2 ++ arch/xtensa/include/asm/pgtable.h | 3 +++ arch/xtensa/mm/tlb.c | 6 ++++++ include/linux/pgtable.h | 7 +++++++ 6 files changed, 22 insertions(+)
diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h index 29d9b12298bc..e48efd4a3e3e 100644 --- a/arch/loongarch/include/asm/pgtable.h +++ b/arch/loongarch/include/asm/pgtable.h @@ -472,6 +472,8 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
#define __HAVE_ARCH_UPDATE_MMU_TLB #define update_mmu_tlb update_mmu_cache +#define update_mmu_tlb_range(vma, addr, ptep, nr) \ + update_mmu_cache_range(NULL, vma, addr, ptep, nr)
static inline void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h index 430b208c0130..58ada9791e5a 100644 --- a/arch/mips/include/asm/pgtable.h +++ b/arch/mips/include/asm/pgtable.h @@ -596,6 +596,8 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
#define __HAVE_ARCH_UPDATE_MMU_TLB #define update_mmu_tlb update_mmu_cache +#define update_mmu_tlb_range(vma, address, ptep, nr) \ + update_mmu_cache_range(NULL, vma, address, ptep, nr)
static inline void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index a16fcdf91f39..93ca36c68833 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -493,6 +493,8 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
#define __HAVE_ARCH_UPDATE_MMU_TLB #define update_mmu_tlb update_mmu_cache +#define update_mmu_tlb_range(vma, addr, ptep, nr) \ + update_mmu_cache_range(NULL, vma, addr, ptep, nr)
static inline void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h index 9a7e5e57ee9a..436158bd9030 100644 --- a/arch/xtensa/include/asm/pgtable.h +++ b/arch/xtensa/include/asm/pgtable.h @@ -413,6 +413,9 @@ typedef pte_t *pte_addr_t; void update_mmu_tlb(struct vm_area_struct *vma, unsigned long address, pte_t *ptep); #define __HAVE_ARCH_UPDATE_MMU_TLB +void update_mmu_tlb_range(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, unsigned int nr); +#define update_mmu_tlb_range update_mmu_tlb_range
#endif /* !defined (__ASSEMBLY__) */
diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c index 4f974b74883c..b1e1f63de72b 100644 --- a/arch/xtensa/mm/tlb.c +++ b/arch/xtensa/mm/tlb.c @@ -169,6 +169,12 @@ void update_mmu_tlb(struct vm_area_struct *vma, local_flush_tlb_page(vma, address); }
+void update_mmu_tlb_range(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, unsigned int nr) +{ + local_flush_tlb_range(vma, address, address + PAGE_SIZE * nr); +} + #ifdef CONFIG_DEBUG_TLB_SANITY
static unsigned get_pte_for_vaddr(unsigned vaddr) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 2ac8e48031cb..1494ea7629da 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -715,6 +715,13 @@ static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr, * fault. This function updates TLB only, do nothing with cache or others. * It is the difference with function update_mmu_cache. */ +#ifndef update_mmu_tlb_range +static inline void update_mmu_tlb_range(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, unsigned int nr) +{ +} +#endif + #ifndef __HAVE_ARCH_UPDATE_MMU_TLB static inline void update_mmu_tlb(struct vm_area_struct *vma, unsigned long address, pte_t *ptep)
From: Bang Li libang.li@antgroup.com
mainline inclusion from mainline-v6.11-rc1 commit 8f65aa32239f1c3f11b7a25bd5921223bafc5fed category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Let's make update_mmu_tlb() simply a generic wrapper around update_mmu_tlb_range(). Only the latter can now be overridden by the architecture. We can now remove __HAVE_ARCH_UPDATE_MMU_TLB as well.
Link: https://lkml.kernel.org/r/20240522061204.117421-3-libang.li@antgroup.com Signed-off-by: Bang Li libang.li@antgroup.com Acked-by: David Hildenbrand david@redhat.com Cc: Chris Zankel chris@zankel.net Cc: Huacai Chen chenhuacai@kernel.org Cc: Lance Yang ioworker0@gmail.com Cc: Max Filippov jcmvbkbc@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Thomas Bogendoerfer tsbogend@alpha.franken.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- arch/loongarch/include/asm/pgtable.h | 2 -- arch/mips/include/asm/pgtable.h | 2 -- arch/riscv/include/asm/pgtable.h | 2 -- arch/xtensa/include/asm/pgtable.h | 3 --- arch/xtensa/mm/tlb.c | 6 ------ include/linux/pgtable.h | 4 +--- 6 files changed, 1 insertion(+), 18 deletions(-)
diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h index e48efd4a3e3e..f5300b66a39d 100644 --- a/arch/loongarch/include/asm/pgtable.h +++ b/arch/loongarch/include/asm/pgtable.h @@ -470,8 +470,6 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf, #define update_mmu_cache(vma, addr, ptep) \ update_mmu_cache_range(NULL, vma, addr, ptep, 1)
-#define __HAVE_ARCH_UPDATE_MMU_TLB -#define update_mmu_tlb update_mmu_cache #define update_mmu_tlb_range(vma, addr, ptep, nr) \ update_mmu_cache_range(NULL, vma, addr, ptep, nr)
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h index 58ada9791e5a..daa48f28ce5e 100644 --- a/arch/mips/include/asm/pgtable.h +++ b/arch/mips/include/asm/pgtable.h @@ -594,8 +594,6 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf, #define update_mmu_cache(vma, address, ptep) \ update_mmu_cache_range(NULL, vma, address, ptep, 1)
-#define __HAVE_ARCH_UPDATE_MMU_TLB -#define update_mmu_tlb update_mmu_cache #define update_mmu_tlb_range(vma, address, ptep, nr) \ update_mmu_cache_range(NULL, vma, address, ptep, nr)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 93ca36c68833..5f02effb5b4b 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -491,8 +491,6 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf, #define update_mmu_cache(vma, addr, ptep) \ update_mmu_cache_range(NULL, vma, addr, ptep, 1)
-#define __HAVE_ARCH_UPDATE_MMU_TLB -#define update_mmu_tlb update_mmu_cache #define update_mmu_tlb_range(vma, addr, ptep, nr) \ update_mmu_cache_range(NULL, vma, addr, ptep, nr)
diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h index 436158bd9030..1647a7cc3fbf 100644 --- a/arch/xtensa/include/asm/pgtable.h +++ b/arch/xtensa/include/asm/pgtable.h @@ -410,9 +410,6 @@ void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
typedef pte_t *pte_addr_t;
-void update_mmu_tlb(struct vm_area_struct *vma, - unsigned long address, pte_t *ptep); -#define __HAVE_ARCH_UPDATE_MMU_TLB void update_mmu_tlb_range(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, unsigned int nr); #define update_mmu_tlb_range update_mmu_tlb_range diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c index b1e1f63de72b..f69feee19d59 100644 --- a/arch/xtensa/mm/tlb.c +++ b/arch/xtensa/mm/tlb.c @@ -163,12 +163,6 @@ void local_flush_tlb_kernel_range(unsigned long start, unsigned long end) } }
-void update_mmu_tlb(struct vm_area_struct *vma, - unsigned long address, pte_t *ptep) -{ - local_flush_tlb_page(vma, address); -} - void update_mmu_tlb_range(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, unsigned int nr) { diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 1494ea7629da..db4faa88865d 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -722,13 +722,11 @@ static inline void update_mmu_tlb_range(struct vm_area_struct *vma, } #endif
-#ifndef __HAVE_ARCH_UPDATE_MMU_TLB static inline void update_mmu_tlb(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { + update_mmu_tlb_range(vma, address, ptep, 1); } -#define __HAVE_ARCH_UPDATE_MMU_TLB -#endif
/* * Some architectures may be able to avoid expensive synchronization
From: Bang Li libang.li@antgroup.com
mainline inclusion from mainline-v6.11-rc1 commit 6faa49d1c4404e0b949fd92f1e891c24870d4f86 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Let us simplify the code by update_mmu_tlb_range().
Link: https://lkml.kernel.org/r/20240522061204.117421-4-libang.li@antgroup.com Signed-off-by: Bang Li libang.li@antgroup.com Reviewed-by: Lance Yang ioworker0@gmail.com Acked-by: David Hildenbrand david@redhat.com Cc: Chris Zankel chris@zankel.net Cc: Huacai Chen chenhuacai@kernel.org Cc: Max Filippov jcmvbkbc@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Thomas Bogendoerfer tsbogend@alpha.franken.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/memory.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index 7fd1f71cebeb..1597718ed30e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4411,7 +4411,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) vm_fault_t ret = 0; int nr_pages = 1; pte_t entry; - int i;
/* File mapping without ->vm_ops ? */ if (vma->vm_flags & VM_SHARED) @@ -4481,8 +4480,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) update_mmu_tlb(vma, addr, vmf->pte); goto release; } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) { - for (i = 0; i < nr_pages; i++) - update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i); + update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages); goto release; }
hulk inclusion category: cleanup bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
--------------------------------
Commit 6f775463d002 ("mm: shmem: use folio_alloc_mpol() in shmem_alloc_folio()") merge shmem_alloc_hugefolio() with shmem_alloc_folio(). To avoid context conflicts in the subsequent patches, merge them.
Dep-of: 3d95bc21cea5 ("mm: shmem: add THP validation for PMD-mapped THP related statistics") Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/shmem.c | 26 +++++++------------------- 1 file changed, 7 insertions(+), 19 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c index 87134ed22968..275e2885ee83 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1631,27 +1631,15 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp) return result; }
-static struct folio *shmem_alloc_hugefolio(gfp_t gfp, +static struct folio *shmem_alloc_folio(gfp_t gfp, int order, struct shmem_inode_info *info, pgoff_t index) { struct vm_area_struct pvma; struct folio *folio; + bool hugepage = !!order;
shmem_pseudo_vma_init(&pvma, info, index); - folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, &pvma, 0, true); - shmem_pseudo_vma_destroy(&pvma); - - return folio; -} - -static struct folio *shmem_alloc_folio(gfp_t gfp, - struct shmem_inode_info *info, pgoff_t index) -{ - struct vm_area_struct pvma; - struct folio *folio; - - shmem_pseudo_vma_init(&pvma, info, index); - folio = vma_alloc_folio(gfp, 0, &pvma, 0, false); + folio = vma_alloc_folio(gfp, order, &pvma, 0, hugepage); shmem_pseudo_vma_destroy(&pvma);
return folio; @@ -1689,12 +1677,12 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp, index + HPAGE_PMD_NR - 1, XA_PRESENT)) return ERR_PTR(-E2BIG);
- folio = shmem_alloc_hugefolio(gfp, info, index); + folio = shmem_alloc_folio(gfp, HPAGE_PMD_ORDER, info, index); if (!folio) count_vm_event(THP_FILE_FALLBACK); } else { pages = 1; - folio = shmem_alloc_folio(gfp, info, index); + folio = shmem_alloc_folio(gfp, 0, info, index); }
no_mem: @@ -1796,7 +1784,7 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp, */ gfp &= ~GFP_CONSTRAINT_MASK; VM_BUG_ON_FOLIO(folio_test_large(old), old); - new = shmem_alloc_folio(gfp, info, index); + new = shmem_alloc_folio(gfp, 0, info, index); if (!new) return -ENOMEM;
@@ -2618,7 +2606,7 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
if (!*foliop) { ret = -ENOMEM; - folio = shmem_alloc_folio(gfp, info, pgoff); + folio = shmem_alloc_folio(gfp, 0, info, pgoff); if (!folio) goto out_unacct_blocks;
From: Baolin Wang baolin.wang@linux.alibaba.com
mainline inclusion from mainline-v6.11-rc1 commit 43e027e414232b1ce4fa6c96a582417e2c027f2d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "add mTHP support for anonymous shmem", v5.
Anonymous pages have already been supported for multi-size (mTHP) allocation through commit 19eaf44954df, that can allow THP to be configured through the sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
However, the anonymous shmem will ignore the anonymous mTHP rule configured through the sysfs interface, and can only use the PMD-mapped THP, that is not reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect to apply an unified mTHP strategy for anonymous pages, also including the anonymous shared pages, in order to enjoy the benefits of mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
As discussed in the bi-weekly MM meeting[1], the mTHP controls should control all of shmem, not only anonymous shmem, but support will be added iteratively. Therefore, this patch set starts with support for anonymous shmem.
The primary strategy is similar to supporting anonymous mTHP. Introduce a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled', which can have almost the same values as the top-level '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new additional "inherit" option and dropping the testing options 'force' and 'deny'. By default all sizes will be set to "never" except PMD size, which is set to "inherit". This ensures backward compatibility with the anonymous shmem enabled of the top level, meanwhile also allows independent control of anonymous shmem enabled for each mTHP.
Use the page fault latency tool to measure the performance of 1G anonymous shmem with 32 threads on my machine environment with: ARM64 Architecture, 32 cores, 125G memory: base: mm-unstable user-time sys_time faults_per_sec_per_cpu faults_per_sec 0.04s 3.10s 83516.416 2669684.890
mm-unstable + patchset, anon shmem mTHP disabled user-time sys_time faults_per_sec_per_cpu faults_per_sec 0.02s 3.14s 82936.359 2630746.027
mm-unstable + patchset, anon shmem 64K mTHP enabled user-time sys_time faults_per_sec_per_cpu faults_per_sec 0.08s 0.31s 678630.231 17082522.495
From the data above, it is observed that the patchset has a minimal impact when mTHP is not enabled (some fluctuations observed during testing). When enabling 64K mTHP, there is a significant improvement of the page fault latency.
[1] https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/
This patch (of 6):
Add large folio mapping establishment support for finish_fault() as a preparation, to support multi-size THP allocation of anonymous shmem pages in the following patches.
Keep the same behavior (per-page fault) for non-anon shmem to avoid inflating the RSS unintentionally, and we can discuss what size of mapping to build when extending mTHP to control non-anon shmem in the future.
[baolin.wang@linux.alibaba.com: avoid going beyond the PMD pagetable size] Link: https://lkml.kernel.org/r/b0e6a8b1-a32c-459e-ae67-fde5d28773e6@linux.alibaba... [baolin.wang@linux.alibaba.com: use 'PTRS_PER_PTE' instead of 'PTRS_PER_PTE - 1'] Link: https://lkml.kernel.org/r/e1f5767a-2c9b-4e37-afe6-1de26fe54e41@linux.alibaba... Link: https://lkml.kernel.org/r/cover.1718090413.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/3a190892355989d42f59cf9f2f98b94694b0d24d.171809041... Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Reviewed-by: Zi Yan ziy@nvidia.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Cc: Daniel Gomez da.gomez@samsung.com Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Hugh Dickins hughd@google.com Cc: Lance Yang ioworker0@gmail.com Cc: Pankaj Raghav p.raghav@samsung.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Barry Song v-songbaohua@oppo.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/memory.c | 61 ++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 51 insertions(+), 10 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index 1597718ed30e..bfc25fa206a2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4739,9 +4739,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct page *page; + struct folio *folio; vm_fault_t ret; bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED); + int type, nr_pages; + unsigned long addr = vmf->address;
/* Did we COW the page? */ if (is_cow) @@ -4772,24 +4775,62 @@ vm_fault_t finish_fault(struct vm_fault *vmf) return VM_FAULT_OOM; }
+ folio = page_folio(page); + nr_pages = folio_nr_pages(folio); + + /* + * Using per-page fault to maintain the uffd semantics, and same + * approach also applies to non-anonymous-shmem faults to avoid + * inflating the RSS of the process. + */ + if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) { + nr_pages = 1; + } else if (nr_pages > 1) { + pgoff_t idx = folio_page_idx(folio, page); + /* The page offset of vmf->address within the VMA. */ + pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff; + /* The index of the entry in the pagetable for fault page. */ + pgoff_t pte_off = pte_index(vmf->address); + + /* + * Fallback to per-page fault in case the folio size in page + * cache beyond the VMA limits and PMD pagetable limits. + */ + if (unlikely(vma_off < idx || + vma_off + (nr_pages - idx) > vma_pages(vma) || + pte_off < idx || + pte_off + (nr_pages - idx) > PTRS_PER_PTE)) { + nr_pages = 1; + } else { + /* Now we can set mappings for the whole large folio. */ + addr = vmf->address - idx * PAGE_SIZE; + page = &folio->page; + } + } + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, - vmf->address, &vmf->ptl); + addr, &vmf->ptl); if (!vmf->pte) return VM_FAULT_NOPAGE;
/* Re-check under ptl */ - if (likely(!vmf_pte_changed(vmf))) { - struct folio *folio = page_folio(page); - int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio); - - set_pte_range(vmf, folio, page, 1, vmf->address); - add_mm_counter(vma->vm_mm, type, 1); - ret = 0; - } else { - update_mmu_tlb(vma, vmf->address, vmf->pte); + if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) { + update_mmu_tlb(vma, addr, vmf->pte); + ret = VM_FAULT_NOPAGE; + goto unlock; + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) { + update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages); ret = VM_FAULT_NOPAGE; + goto unlock; }
+ folio_ref_add(folio, nr_pages - 1); + set_pte_range(vmf, folio, page, nr_pages, addr); + type = is_cow ? MM_ANONPAGES : mm_counter_file(folio); + add_mm_counter(vma->vm_mm, type, nr_pages); + ret = 0; + +unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); return ret; }
From: Baolin Wang baolin.wang@linux.alibaba.com
mainline inclusion from mainline-v6.11-rc1 commit 3d95bc21cea558c7cdb2942b4d0223a571e93f27 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
In order to extend support for mTHP, add THP validation for PMD-mapped THP related statistics to avoid statistical confusion.
Link: https://lkml.kernel.org/r/c4b04cbd51e6951cc2436a87be8eaa4a1516faec.171809041... Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Reviewed-by: Barry Song v-songbaohua@oppo.com Cc: Daniel Gomez da.gomez@samsung.com Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Hugh Dickins hughd@google.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Lance Yang ioworker0@gmail.com Cc: Pankaj Raghav p.raghav@samsung.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/shmem.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c index 275e2885ee83..727c83403bd8 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1678,7 +1678,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp, return ERR_PTR(-E2BIG);
folio = shmem_alloc_folio(gfp, HPAGE_PMD_ORDER, info, index); - if (!folio) + if (!folio && pages == HPAGE_PMD_NR) count_vm_event(THP_FILE_FALLBACK); } else { pages = 1; @@ -1698,7 +1698,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp, if (xa_find(&mapping->i_pages, &index, index + pages - 1, XA_PRESENT)) { error = -EEXIST; - } else if (huge) { + } else if (pages == HPAGE_PMD_NR) { count_vm_event(THP_FILE_FALLBACK); count_vm_event(THP_FILE_FALLBACK_CHARGE); } @@ -2067,7 +2067,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, folio = shmem_alloc_and_add_folio(huge_gfp, inode, index, fault_mm, true); if (!IS_ERR(folio)) { - count_vm_event(THP_FILE_ALLOC); + if (folio_test_pmd_mappable(folio)) + count_vm_event(THP_FILE_ALLOC); goto alloced; } if (PTR_ERR(folio) == -EEXIST)
From: Baolin Wang baolin.wang@linux.alibaba.com
mainline inclusion from mainline-v6.11-rc1 commit 4b98995530b77a97912230d8e1564ba7738db19c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
To support the use of mTHP with anonymous shmem, add a new sysfs interface 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/' directory for each mTHP to control whether shmem is enabled for that mTHP, with a value similar to the top level 'shmem_enabled', which can be set to: "always", "inherit (to inherit the top level setting)", "within_size", "advise", "never". An 'inherit' option is added to ensure compatibility with these global settings, and the options 'force' and 'deny' are dropped, which are rather testing artifacts from the old ages.
By default, PMD-sized hugepages have enabled="inherit" and all other hugepage sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.
In addition, if top level value is 'force', then only PMD-sized hugepages have enabled="inherit", otherwise configuration will be failed and vice versa. That means now we will avoid using non-PMD sized THP to override the global huge allocation.
[baolin.wang@linux.alibaba.com: fix transhuge.rst indentation] Link: https://lkml.kernel.org/r/b189d815-998b-4dfd-ba89-218ff51313f8@linux.alibaba... [akpm@linux-foundation.org: reflow transhuge.rst addition to 80 cols] [baolin.wang@linux.alibaba.com: move huge_shmem_orders_lock under CONFIG_SYSFS] Link: https://lkml.kernel.org/r/eb34da66-7f12-44f3-a39e-2bcc90c33354@linux.alibaba... [akpm@linux-foundation.org: huge_memory.c needs mm_types.h] Link: https://lkml.kernel.org/r/ffddfa8b3cb4266ff963099ab78cfd7184c57ac7.171809041... Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: Barry Song v-songbaohua@oppo.com Cc: Daniel Gomez da.gomez@samsung.com Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Hugh Dickins hughd@google.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Lance Yang ioworker0@gmail.com Cc: Pankaj Raghav p.raghav@samsung.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/shmem.c [ Context conflicts in shmem.c with commit 8b52f97ede12 ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- Documentation/admin-guide/mm/transhuge.rst | 25 ++++++ include/linux/huge_mm.h | 10 +++ mm/huge_memory.c | 12 +-- mm/shmem.c | 96 ++++++++++++++++++++++ 4 files changed, 135 insertions(+), 8 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index b6e5ba22176a..b63edde5a6d3 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -360,6 +360,31 @@ deny force Force the huge option on for all - very useful for testing;
+Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to +control mTHP allocation: +'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled', +and its value for each mTHP is essentially consistent with the global +setting. An 'inherit' option is added to ensure compatibility with these +global settings. Conversely, the options 'force' and 'deny' are dropped, +which are rather testing artifacts from the old ages. + +always + Attempt to allocate <size> huge pages every time we need a new page; + +inherit + Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages + have enabled="inherit" and all other hugepage sizes have enabled="never"; + +never + Do not allocate <size> huge pages; + +within_size + Only allocate <size> huge page if it will be fully within i_size. + Also respect fadvise()/madvise() hints; + +advise + Only allocate <size> huge pages if requested with fadvise()/madvise(); + Need of application restart ===========================
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index ddde47623562..294f19fc513f 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -6,6 +6,7 @@ #include <linux/mm_types.h>
#include <linux/fs.h> /* only for vma_is_dax() */ +#include <linux/kobject.h>
vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, @@ -68,6 +69,7 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf, enum transparent_hugepage_flag flag); extern struct kobj_attribute shmem_enabled_attr; +extern struct kobj_attribute thpsize_shmem_enabled_attr;
/* * Mask of all large folio orders supported for anonymous THP; all orders up to @@ -265,6 +267,14 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders); }
+struct thpsize { + struct kobject kobj; + struct list_head node; + int order; +}; + +#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj) + enum mthp_stat_item { MTHP_STAT_ANON_FAULT_ALLOC, MTHP_STAT_ANON_FAULT_FALLBACK, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6bf5708503f1..ddcf1766e3c4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -20,6 +20,7 @@ #include <linux/swapops.h> #include <linux/backing-dev.h> #include <linux/dax.h> +#include <linux/mm_types.h> #include <linux/khugepaged.h> #include <linux/freezer.h> #include <linux/pfn_t.h> @@ -600,14 +601,6 @@ static void thpsize_release(struct kobject *kobj); static DEFINE_SPINLOCK(huge_anon_orders_lock); static LIST_HEAD(thpsize_list);
-struct thpsize { - struct kobject kobj; - struct list_head node; - int order; -}; - -#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj) - static ssize_t thpsize_enabled_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -675,6 +668,9 @@ static struct kobj_attribute thpsize_enabled_attr =
static struct attribute *thpsize_attrs[] = { &thpsize_enabled_attr.attr, +#ifdef CONFIG_SHMEM + &thpsize_shmem_enabled_attr.attr, +#endif NULL, };
diff --git a/mm/shmem.c b/mm/shmem.c index 727c83403bd8..56697a3f558a 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -131,6 +131,13 @@ struct shmem_options { #define SHMEM_SEEN_QUOTA 32 };
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static unsigned long huge_shmem_orders_always __read_mostly; +static unsigned long huge_shmem_orders_madvise __read_mostly; +static unsigned long huge_shmem_orders_inherit __read_mostly; +static unsigned long huge_shmem_orders_within_size __read_mostly; +#endif + #ifdef CONFIG_TMPFS static unsigned long shmem_default_max_blocks(void) { @@ -4645,6 +4652,12 @@ void __init shmem_init(void) SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; else shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ + + /* + * Default to setting PMD-sized THP to inherit the global setting and + * disable all other multi-size THPs. + */ + huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER); #endif
shmem_reliable_init(); @@ -4707,6 +4720,11 @@ static ssize_t shmem_enabled_store(struct kobject *kobj, huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY) return -EINVAL;
+ /* Do not override huge allocation policy with non-PMD sized mTHP */ + if (huge == SHMEM_HUGE_FORCE && + huge_shmem_orders_inherit != BIT(HPAGE_PMD_ORDER)) + return -EINVAL; + shmem_huge = huge; if (shmem_huge > SHMEM_HUGE_DENY) SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; @@ -4714,6 +4732,84 @@ static ssize_t shmem_enabled_store(struct kobject *kobj, }
struct kobj_attribute shmem_enabled_attr = __ATTR_RW(shmem_enabled); +static DEFINE_SPINLOCK(huge_shmem_orders_lock); + +static ssize_t thpsize_shmem_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + int order = to_thpsize(kobj)->order; + const char *output; + + if (test_bit(order, &huge_shmem_orders_always)) + output = "[always] inherit within_size advise never"; + else if (test_bit(order, &huge_shmem_orders_inherit)) + output = "always [inherit] within_size advise never"; + else if (test_bit(order, &huge_shmem_orders_within_size)) + output = "always inherit [within_size] advise never"; + else if (test_bit(order, &huge_shmem_orders_madvise)) + output = "always inherit within_size [advise] never"; + else + output = "always inherit within_size advise [never]"; + + return sysfs_emit(buf, "%s\n", output); +} + +static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int order = to_thpsize(kobj)->order; + ssize_t ret = count; + + if (sysfs_streq(buf, "always")) { + spin_lock(&huge_shmem_orders_lock); + clear_bit(order, &huge_shmem_orders_inherit); + clear_bit(order, &huge_shmem_orders_madvise); + clear_bit(order, &huge_shmem_orders_within_size); + set_bit(order, &huge_shmem_orders_always); + spin_unlock(&huge_shmem_orders_lock); + } else if (sysfs_streq(buf, "inherit")) { + /* Do not override huge allocation policy with non-PMD sized mTHP */ + if (shmem_huge == SHMEM_HUGE_FORCE && + order != HPAGE_PMD_ORDER) + return -EINVAL; + + spin_lock(&huge_shmem_orders_lock); + clear_bit(order, &huge_shmem_orders_always); + clear_bit(order, &huge_shmem_orders_madvise); + clear_bit(order, &huge_shmem_orders_within_size); + set_bit(order, &huge_shmem_orders_inherit); + spin_unlock(&huge_shmem_orders_lock); + } else if (sysfs_streq(buf, "within_size")) { + spin_lock(&huge_shmem_orders_lock); + clear_bit(order, &huge_shmem_orders_always); + clear_bit(order, &huge_shmem_orders_inherit); + clear_bit(order, &huge_shmem_orders_madvise); + set_bit(order, &huge_shmem_orders_within_size); + spin_unlock(&huge_shmem_orders_lock); + } else if (sysfs_streq(buf, "madvise")) { + spin_lock(&huge_shmem_orders_lock); + clear_bit(order, &huge_shmem_orders_always); + clear_bit(order, &huge_shmem_orders_inherit); + clear_bit(order, &huge_shmem_orders_within_size); + set_bit(order, &huge_shmem_orders_madvise); + spin_unlock(&huge_shmem_orders_lock); + } else if (sysfs_streq(buf, "never")) { + spin_lock(&huge_shmem_orders_lock); + clear_bit(order, &huge_shmem_orders_always); + clear_bit(order, &huge_shmem_orders_inherit); + clear_bit(order, &huge_shmem_orders_within_size); + clear_bit(order, &huge_shmem_orders_madvise); + spin_unlock(&huge_shmem_orders_lock); + } else { + ret = -EINVAL; + } + + return ret; +} + +struct kobj_attribute thpsize_shmem_enabled_attr = + __ATTR(shmem_enabled, 0644, thpsize_shmem_enabled_show, thpsize_shmem_enabled_store); #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
#else /* !CONFIG_SHMEM */
From: Baolin Wang baolin.wang@linux.alibaba.com
mainline inclusion from mainline-v6.11-rc1 commit e7a2ab7b3bb5d87f99f2ea3d4481d52fc5ceb52d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that can allow THP to be configured through the sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
However, the anonymous shmem will ignore the anonymous mTHP rule configured through the sysfs interface, and can only use the PMD-mapped THP, that is not reasonable. Users expect to apply the mTHP rule for all anonymous pages, including the anonymous shmem, in order to enjoy the benefits of mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc. In addition, the mTHP interfaces can be extended to support all shmem/tmpfs scenarios in the future, especially for the shmem mmap() case.
The primary strategy is similar to supporting anonymous mTHP. Introduce a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled', which can have almost the same values as the top-level '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new additional "inherit" option and dropping the testing options 'force' and 'deny'. By default all sizes will be set to "never" except PMD size, which is set to "inherit". This ensures backward compatibility with the anonymous shmem enabled of the top level, meanwhile also allows independent control of anonymous shmem enabled for each mTHP.
Link: https://lkml.kernel.org/r/65796c1e72e51e15f3410195b5c2d5b6c160d411.171809041... Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: Barry Song v-songbaohua@oppo.com Cc: Daniel Gomez da.gomez@samsung.com Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Hugh Dickins hughd@google.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Lance Yang ioworker0@gmail.com Cc: Pankaj Raghav p.raghav@samsung.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/shmem.c [ Conflicts with commit 5b97d5485d7a and 19a9d856a08b. If mm or task in dynamic pool, force set order to 0. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/huge_mm.h | 10 +++ mm/shmem.c | 191 +++++++++++++++++++++++++++++++++------- 2 files changed, 170 insertions(+), 31 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 294f19fc513f..2cfaa87cb24a 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -575,6 +575,16 @@ static inline bool thp_migration_supported(void) { return false; } + +static inline int highest_order(unsigned long orders) +{ + return 0; +} + +static inline int next_order(unsigned long *orders, int prev) +{ + return 0; +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
static inline int split_folio_to_list_to_order(struct folio *folio, diff --git a/mm/shmem.c b/mm/shmem.c index 56697a3f558a..d99e41ceea3c 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1638,6 +1638,107 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp) return result; }
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static unsigned long shmem_allowable_huge_orders(struct inode *inode, + struct vm_area_struct *vma, pgoff_t index, + bool global_huge) +{ + unsigned long mask = READ_ONCE(huge_shmem_orders_always); + unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size); + unsigned long vm_flags = vma->vm_flags; + /* + * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that + * are enabled for this vma. + */ + unsigned long orders = BIT(PMD_ORDER + 1) - 1; + loff_t i_size; + int order; + + if ((vm_flags & VM_NOHUGEPAGE) || + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) + return 0; + + /* If the hardware/firmware marked hugepage support disabled. */ + if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED)) + return 0; + + /* + * Following the 'deny' semantics of the top level, force the huge + * option off from all mounts. + */ + if (shmem_huge == SHMEM_HUGE_DENY) + return 0; + + /* + * Only allow inherit orders if the top-level value is 'force', which + * means non-PMD sized THP can not override 'huge' mount option now. + */ + if (shmem_huge == SHMEM_HUGE_FORCE) + return READ_ONCE(huge_shmem_orders_inherit); + + /* Allow mTHP that will be fully within i_size. */ + order = highest_order(within_size_orders); + while (within_size_orders) { + index = round_up(index + 1, order); + i_size = round_up(i_size_read(inode), PAGE_SIZE); + if (i_size >> PAGE_SHIFT >= index) { + mask |= within_size_orders; + break; + } + + order = next_order(&within_size_orders, order); + } + + if (vm_flags & VM_HUGEPAGE) + mask |= READ_ONCE(huge_shmem_orders_madvise); + + if (global_huge) + mask |= READ_ONCE(huge_shmem_orders_inherit); + + return orders & mask; +} + +static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf, + struct address_space *mapping, pgoff_t index, + unsigned long orders) +{ + struct vm_area_struct *vma = vmf->vma; + unsigned long pages; + int order; + + orders = thp_vma_suitable_orders(vma, vmf->address, orders); + if (!orders) + return 0; + + /* Find the highest order that can add into the page cache */ + order = highest_order(orders); + while (orders) { + pages = 1UL << order; + index = round_down(index, pages); + if (!xa_find(&mapping->i_pages, &index, + index + pages - 1, XA_PRESENT)) + break; + order = next_order(&orders, order); + } + + return orders; +} +#else +static unsigned long shmem_allowable_huge_orders(struct inode *inode, + struct vm_area_struct *vma, pgoff_t index, + bool global_huge) +{ + return 0; +} + +static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf, + struct address_space *mapping, pgoff_t index, + unsigned long orders) +{ + return 0; +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + static struct folio *shmem_alloc_folio(gfp_t gfp, int order, struct shmem_inode_info *info, pgoff_t index) { @@ -1652,41 +1753,58 @@ static struct folio *shmem_alloc_folio(gfp_t gfp, int order, return folio; }
-static struct folio *shmem_alloc_and_add_folio(gfp_t gfp, - struct inode *inode, pgoff_t index, - struct mm_struct *fault_mm, bool huge) +static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, + gfp_t gfp, struct inode *inode, pgoff_t index, + struct mm_struct *fault_mm, unsigned long orders) { struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); - struct folio *folio; + struct vm_area_struct *vma = vmf ? vmf->vma : NULL; + unsigned long suitable_orders = 0; + struct folio *folio = NULL; long pages; - int error; + int error, order;
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) - huge = false; + orders = 0;
if (!shmem_prepare_alloc(&gfp)) goto no_mem;
- if (huge) { - pages = HPAGE_PMD_NR; - index = round_down(index, HPAGE_PMD_NR); + if (orders > 0) { + if (vma && vma_is_anon_shmem(vma)) { + suitable_orders = shmem_suitable_orders(inode, vmf, + mapping, index, orders); + } else if (orders & BIT(HPAGE_PMD_ORDER)) { + pages = HPAGE_PMD_NR; + suitable_orders = BIT(HPAGE_PMD_ORDER); + index = round_down(index, HPAGE_PMD_NR);
- /* - * Check for conflict before waiting on a huge allocation. - * Conflict might be that a huge page has just been allocated - * and added to page cache by a racing thread, or that there - * is already at least one small page in the huge extent. - * Be careful to retry when appropriate, but not forever! - * Elsewhere -EEXIST would be the right code, but not here. - */ - if (xa_find(&mapping->i_pages, &index, - index + HPAGE_PMD_NR - 1, XA_PRESENT)) - return ERR_PTR(-E2BIG); + /* + * Check for conflict before waiting on a huge allocation. + * Conflict might be that a huge page has just been allocated + * and added to page cache by a racing thread, or that there + * is already at least one small page in the huge extent. + * Be careful to retry when appropriate, but not forever! + * Elsewhere -EEXIST would be the right code, but not here. + */ + if (xa_find(&mapping->i_pages, &index, + index + HPAGE_PMD_NR - 1, XA_PRESENT)) + return ERR_PTR(-E2BIG); + }
- folio = shmem_alloc_folio(gfp, HPAGE_PMD_ORDER, info, index); - if (!folio && pages == HPAGE_PMD_NR) - count_vm_event(THP_FILE_FALLBACK); + order = highest_order(suitable_orders); + while (suitable_orders) { + pages = 1UL << order; + index = round_down(index, pages); + folio = shmem_alloc_folio(gfp, order, info, index); + if (folio) + goto allocated; + + if (pages == HPAGE_PMD_NR) + count_vm_event(THP_FILE_FALLBACK); + order = next_order(&suitable_orders, order); + } } else { pages = 1; folio = shmem_alloc_folio(gfp, 0, info, index); @@ -1696,6 +1814,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp, if (!folio) return ERR_PTR(-ENOMEM);
+allocated: __folio_set_locked(folio); __folio_set_swapbacked(folio);
@@ -1992,7 +2111,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, struct mm_struct *fault_mm; struct folio *folio; int error; - bool alloced; + bool alloced, huge; + unsigned long orders = 0;
if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping))) return -EINVAL; @@ -2064,15 +2184,24 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, return 0; }
- if (shmem_is_huge(inode, index, false, fault_mm, - vma ? vma->vm_flags : 0) && - !mm_in_dynamic_pool(vma ? vma->vm_mm : current->mm)) { + huge = shmem_is_huge(inode, index, false, fault_mm, + vma ? vma->vm_flags : 0); + + /* Find hugepage orders that are allowed for anonymous shmem. */ + if (mm_in_dynamic_pool(vma ? vma->vm_mm : current->mm)) + orders = 0; + else if (vma && vma_is_anon_shmem(vma)) + orders = shmem_allowable_huge_orders(inode, vma, index, huge); + else if (huge) + orders = BIT(HPAGE_PMD_ORDER); + + if (orders > 0) { gfp_t huge_gfp;
huge_gfp = vma_thp_gfp_mask(vma); huge_gfp = limit_gfp_mask(huge_gfp, gfp); - folio = shmem_alloc_and_add_folio(huge_gfp, - inode, index, fault_mm, true); + folio = shmem_alloc_and_add_folio(vmf, huge_gfp, + inode, index, fault_mm, orders); if (!IS_ERR(folio)) { if (folio_test_pmd_mappable(folio)) count_vm_event(THP_FILE_ALLOC); @@ -2082,7 +2211,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, goto repeat; }
- folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false); + folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, 0); if (IS_ERR(folio)) { error = PTR_ERR(folio); if (error == -EEXIST) @@ -2093,7 +2222,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
alloced: alloced = true; - if (folio_test_pmd_mappable(folio) && + if (folio_test_large(folio) && DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) < folio_next_index(folio) - 1) { struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
From: Baolin Wang baolin.wang@linux.alibaba.com
mainline inclusion from mainline-v6.11-rc1 commit 5a9dd10380a16b343aa87d80d5bcc24409a03f5b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Although the top-level hugepage allocation can be turned off, anonymous shmem can still use mTHP by configuring the sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled'. Therefore, add alignment for mTHP size to provide a suitable alignment address in shmem_get_unmapped_area().
Link: https://lkml.kernel.org/r/0c549b57cf7db07503af692d8546ecfad0fcce52.171809041... Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Tested-by: Lance Yang ioworker0@gmail.com Cc: Barry Song v-songbaohua@oppo.com Cc: Daniel Gomez da.gomez@samsung.com Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Hugh Dickins hughd@google.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Pankaj Raghav p.raghav@samsung.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/shmem.c | 40 +++++++++++++++++++++++++++++++--------- 1 file changed, 31 insertions(+), 9 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c index d99e41ceea3c..dff675ccce38 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2409,6 +2409,7 @@ unsigned long shmem_get_unmapped_area(struct file *file, unsigned long inflated_len; unsigned long inflated_addr; unsigned long inflated_offset; + unsigned long hpage_size;
if (len > TASK_SIZE) return -ENOMEM; @@ -2427,8 +2428,6 @@ unsigned long shmem_get_unmapped_area(struct file *file,
if (shmem_huge == SHMEM_HUGE_DENY) return addr; - if (len < HPAGE_PMD_SIZE) - return addr; if (flags & MAP_FIXED) return addr; /* @@ -2440,8 +2439,11 @@ unsigned long shmem_get_unmapped_area(struct file *file, if (uaddr == addr) return addr;
+ hpage_size = HPAGE_PMD_SIZE; if (shmem_huge != SHMEM_HUGE_FORCE) { struct super_block *sb; + unsigned long __maybe_unused hpage_orders; + int order = 0;
if (file) { VM_BUG_ON(file->f_op != &shmem_file_operations); @@ -2454,18 +2456,38 @@ unsigned long shmem_get_unmapped_area(struct file *file, if (IS_ERR(shm_mnt)) return addr; sb = shm_mnt->mnt_sb; + + /* + * Find the highest mTHP order used for anonymous shmem to + * provide a suitable alignment address. + */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + hpage_orders = READ_ONCE(huge_shmem_orders_always); + hpage_orders |= READ_ONCE(huge_shmem_orders_within_size); + hpage_orders |= READ_ONCE(huge_shmem_orders_madvise); + if (SHMEM_SB(sb)->huge != SHMEM_HUGE_NEVER) + hpage_orders |= READ_ONCE(huge_shmem_orders_inherit); + + if (hpage_orders > 0) { + order = highest_order(hpage_orders); + hpage_size = PAGE_SIZE << order; + } +#endif } - if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER) + if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER && !order) return addr; }
- offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1); - if (offset && offset + len < 2 * HPAGE_PMD_SIZE) + if (len < hpage_size) + return addr; + + offset = (pgoff << PAGE_SHIFT) & (hpage_size - 1); + if (offset && offset + len < 2 * hpage_size) return addr; - if ((addr & (HPAGE_PMD_SIZE-1)) == offset) + if ((addr & (hpage_size - 1)) == offset) return addr;
- inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE; + inflated_len = len + hpage_size - PAGE_SIZE; if (inflated_len > TASK_SIZE) return addr; if (inflated_len < len) @@ -2477,10 +2499,10 @@ unsigned long shmem_get_unmapped_area(struct file *file, if (inflated_addr & ~PAGE_MASK) return addr;
- inflated_offset = inflated_addr & (HPAGE_PMD_SIZE-1); + inflated_offset = inflated_addr & (hpage_size - 1); inflated_addr += offset - inflated_offset; if (inflated_offset > offset) - inflated_addr += HPAGE_PMD_SIZE; + inflated_addr += hpage_size;
if (inflated_addr > TASK_SIZE - len) return addr;
From: Baolin Wang baolin.wang@linux.alibaba.com
mainline inclusion from mainline-v6.11-rc1 commit 66f44583f9b617d74ffa2487e75a9c3adf344ddb category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Add mTHP counters for anonymous shmem.
[baolin.wang@linux.alibaba.com: update Documentation/admin-guide/mm/transhuge.rst] Link: https://lkml.kernel.org/r/d86e2e7f-4141-432b-b2ba-c6691f36ef0b@linux.alibaba... Link: https://lkml.kernel.org/r/4fd9e467d49ae4a747e428bcd821c7d13125ae67.171809041... Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Reviewed-by: Lance Yang ioworker0@gmail.com Cc: Barry Song v-songbaohua@oppo.com Cc: Daniel Gomez da.gomez@samsung.com Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Hugh Dickins hughd@google.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Pankaj Raghav p.raghav@samsung.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- Documentation/admin-guide/mm/transhuge.rst | 13 +++++++++++++ include/linux/huge_mm.h | 3 +++ mm/huge_memory.c | 6 ++++++ mm/shmem.c | 18 +++++++++++++++--- 4 files changed, 37 insertions(+), 3 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index b63edde5a6d3..89e124d66ceb 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -529,6 +529,19 @@ swpout_fallback Usually because failed to allocate some continuous swap space for the huge page.
+file_alloc + is incremented every time a file huge page is successfully + allocated. + +file_fallback + is incremented if a file huge page is attempted to be allocated + but fails and instead falls back to using small pages. + +file_fallback_charge + is incremented if a file huge page cannot be charged and instead + falls back to using small pages even though the allocation was + successful. + As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in ``/proc/vmstat`` to help diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2cfaa87cb24a..0fe1f2ec6a76 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -281,6 +281,9 @@ enum mthp_stat_item { MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE, MTHP_STAT_SWPOUT, MTHP_STAT_SWPOUT_FALLBACK, + MTHP_STAT_FILE_ALLOC, + MTHP_STAT_FILE_FALLBACK, + MTHP_STAT_FILE_FALLBACK_CHARGE, __MTHP_STAT_COUNT };
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ddcf1766e3c4..ab324eaac644 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -714,6 +714,9 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT); DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK); +DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC); +DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK); +DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE);
static struct attribute *stats_attrs[] = { &anon_fault_alloc_attr.attr, @@ -721,6 +724,9 @@ static struct attribute *stats_attrs[] = { &anon_fault_fallback_charge_attr.attr, &swpout_attr.attr, &swpout_fallback_attr.attr, + &file_alloc_attr.attr, + &file_fallback_attr.attr, + &file_fallback_charge_attr.attr, NULL, };
diff --git a/mm/shmem.c b/mm/shmem.c index dff675ccce38..3e15fdb2869e 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1803,6 +1803,9 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
if (pages == HPAGE_PMD_NR) count_vm_event(THP_FILE_FALLBACK); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK); +#endif order = next_order(&suitable_orders, order); } } else { @@ -1824,9 +1827,15 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, if (xa_find(&mapping->i_pages, &index, index + pages - 1, XA_PRESENT)) { error = -EEXIST; - } else if (pages == HPAGE_PMD_NR) { - count_vm_event(THP_FILE_FALLBACK); - count_vm_event(THP_FILE_FALLBACK_CHARGE); + } else if (pages > 1) { + if (pages == HPAGE_PMD_NR) { + count_vm_event(THP_FILE_FALLBACK); + count_vm_event(THP_FILE_FALLBACK_CHARGE); + } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK); + count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK_CHARGE); +#endif } goto unlock; } @@ -2205,6 +2214,9 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, if (!IS_ERR(folio)) { if (folio_test_pmd_mappable(folio)) count_vm_event(THP_FILE_ALLOC); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_ALLOC); +#endif goto alloced; } if (PTR_ERR(folio) == -EEXIST)
From: Bang Li libang.li@antgroup.com
mainline inclusion from mainline-v6.11-rc1 commit 843a2e24c24c5311831860c6b78ceacdd4627000 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Commit 19eaf44954df ("mm: thp: support allocation of anonymous multi-size THP") added mTHP support for anonymous shmem. We can configure different policies through the multi-size THP sysfs interface for anonymous shmem.
But when we configure the "advise" policy of /sys/kernel/mm/transparent_hugepage/hugepages-xxxkB/shmem_enabled, we cannot write the "advise", but write the "madvise", which is unreasonable. We should keep the output and input values consistent, which is more convenient for users.
Link: https://lkml.kernel.org/r/20240628032327.16987-1-libang.li@antgroup.com Fixes: 61a57f1b1da9 ("mm: shmem: add multi-size THP sysfs interface for anonymous shmem") Signed-off-by: Bang Li libang.li@antgroup.com Reviewed-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: Bang Li libang.li@antgroup.com Cc: David Hildenbrand david@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Ryan Roberts ryan.roberts@arm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/shmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/shmem.c b/mm/shmem.c index 3e15fdb2869e..c01b022bdb84 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -4950,7 +4950,7 @@ static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj, clear_bit(order, &huge_shmem_orders_madvise); set_bit(order, &huge_shmem_orders_within_size); spin_unlock(&huge_shmem_orders_lock); - } else if (sysfs_streq(buf, "madvise")) { + } else if (sysfs_streq(buf, "advise")) { spin_lock(&huge_shmem_orders_lock); clear_bit(order, &huge_shmem_orders_always); clear_bit(order, &huge_shmem_orders_inherit);
From: Bang Li libang.li@antgroup.com
mainline inclusion from mainline-v6.11-rc1 commit 26c7d8413aaf113a54b54f63e151416a5c5c2a88 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
After the commit 7fb1b252afb5 ("mm: shmem: add mTHP support for anonymous shmem"), we can configure different policies through the multi-size THP sysfs interface for anonymous shmem. But currently "THPeligible" indicates only whether the mapping is eligible for allocating THP-pages as well as the THP is PMD mappable or not for anonymous shmem, we need to support semantics for mTHP with anonymous shmem similar to those for mTHP with anonymous memory.
Link: https://lkml.kernel.org/r/20240705032309.24933-1-libang.li@antgroup.com Signed-off-by: Bang Li libang.li@antgroup.com Reviewed-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: David Hildenbrand david@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Lance Yang ioworker0@gmail.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/shmem_fs.h | 9 +++++++++ mm/huge_memory.c | 13 +++++++++---- mm/shmem.c | 9 +-------- 3 files changed, 19 insertions(+), 12 deletions(-)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index f0c6bf982832..41aa4e0d6dbc 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -117,12 +117,21 @@ int shmem_unuse(unsigned int type); #ifdef CONFIG_TRANSPARENT_HUGEPAGE extern bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force, struct mm_struct *mm, unsigned long vm_flags); +unsigned long shmem_allowable_huge_orders(struct inode *inode, + struct vm_area_struct *vma, pgoff_t index, + bool global_huge); #else static __always_inline bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force, struct mm_struct *mm, unsigned long vm_flags) { return false; } +static inline unsigned long shmem_allowable_huge_orders(struct inode *inode, + struct vm_area_struct *vma, pgoff_t index, + bool global_huge) +{ + return 0; +} #endif
#ifdef CONFIG_SHMEM diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ab324eaac644..293517f04ade 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -155,10 +155,15 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, * Must be done before hugepage flags check since shmem has its * own flags. */ - if (!in_pf && shmem_file(vma->vm_file)) - return shmem_is_huge(file_inode(vma->vm_file), vma->vm_pgoff, - !enforce_sysfs, vma->vm_mm, vm_flags) - ? orders : 0; + if (!in_pf && shmem_file(vma->vm_file)) { + bool global_huge = shmem_is_huge(file_inode(vma->vm_file), vma->vm_pgoff, + !enforce_sysfs, vma->vm_mm, vm_flags); + + if (!vma_is_anon_shmem(vma)) + return global_huge ? orders : 0; + return shmem_allowable_huge_orders(file_inode(vma->vm_file), + vma, vma->vm_pgoff, global_huge); + }
if (!vma_is_anonymous(vma)) { /* diff --git a/mm/shmem.c b/mm/shmem.c index c01b022bdb84..691f8efc438c 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1639,7 +1639,7 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp) }
#ifdef CONFIG_TRANSPARENT_HUGEPAGE -static unsigned long shmem_allowable_huge_orders(struct inode *inode, +unsigned long shmem_allowable_huge_orders(struct inode *inode, struct vm_area_struct *vma, pgoff_t index, bool global_huge) { @@ -1724,13 +1724,6 @@ static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault return orders; } #else -static unsigned long shmem_allowable_huge_orders(struct inode *inode, - struct vm_area_struct *vma, pgoff_t index, - bool global_huge) -{ - return 0; -} - static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf, struct address_space *mapping, pgoff_t index, unsigned long orders)
From: Baolin Wang baolin.wang@linux.alibaba.com
mainline inclusion from mainline-v6.11-rc2 commit b66b1b71d7ff5464d23a0ac6f73fae461b7264fd category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Similar to commit d659b715e94ac ("mm/huge_memory: avoid PMD-size page cache if needed"), ARM64 can support 512MB PMD-sized THP when the base page size is 64KB, which is larger than the maximum supported page cache size MAX_PAGECACHE_ORDER.
This is not expected. To fix this issue, use THP_ORDERS_ALL_FILE_DEFAULT for shmem to filter allowable huge orders.
[baolin.wang@linux.alibaba.com: remove comment, per Barry] Link: https://lkml.kernel.org/r/c55d7ef7-78aa-4ed6-b897-c3e03a3f3ab7@linux.alibaba... [wangkefeng.wang@huawei.com: remove local `orders'] Link: https://lkml.kernel.org/r/87769ae8-b6c6-4454-925d-1864364af9c8@huawei.com Link: https://lkml.kernel.org/r/117121665254442c3c7f585248296495e5e2b45c.172240407... Fixes: e7a2ab7b3bb5 ("mm: shmem: add mTHP support for anonymous shmem") Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Barry Song baohua@kernel.org Cc: Barry Song 21cnbao@gmail.com Cc: David Hildenbrand david@redhat.com Cc: Gavin Shan gshan@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Lance Yang ioworker0@gmail.com Cc: Matthew Wilcox willy@infradead.org Cc: Ryan Roberts ryan.roberts@arm.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/shmem.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c index 691f8efc438c..f579367c5968 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1646,11 +1646,6 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode, unsigned long mask = READ_ONCE(huge_shmem_orders_always); unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size); unsigned long vm_flags = vma->vm_flags; - /* - * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that - * are enabled for this vma. - */ - unsigned long orders = BIT(PMD_ORDER + 1) - 1; loff_t i_size; int order;
@@ -1695,7 +1690,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode, if (global_huge) mask |= READ_ONCE(huge_shmem_orders_inherit);
- return orders & mask; + return THP_ORDERS_ALL_FILE_DEFAULT & mask; }
static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
From: Baolin Wang baolin.wang@linux.alibaba.com
mainline inclusion from mainline-v6.11-rc2 commit 4cbf320b1500fe64fcef8c96ed74dfc1ae2c9e2c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
In the shmem_suitable_orders() function, xa_find() is used to check for conflicts in the pagecache to select suitable huge orders. However, when checking each huge order in every loop, the aligned index is calculated from the previous iteration, which may cause suitable huge orders to be missed.
We should use the original index each time in the loop to calculate a new aligned index for checking conflicts to avoid this issue.
Link: https://lkml.kernel.org/r/07433b0f16a152bffb8cee34934a5c040e8e2ad6.172240407... Fixes: e7a2ab7b3bb5 ("mm: shmem: add mTHP support for anonymous shmem") Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Acked-by: David Hildenbrand david@redhat.com Cc: Barry Song 21cnbao@gmail.com Cc: Gavin Shan gshan@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Lance Yang ioworker0@gmail.com Cc: Matthew Wilcox willy@infradead.org Cc: Ryan Roberts ryan.roberts@arm.com Cc: Zi Yan ziy@nvidia.com Cc: Barry Song baohua@kernel.org Cc: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/shmem.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c index f579367c5968..69593d2b1566 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1698,6 +1698,7 @@ static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault unsigned long orders) { struct vm_area_struct *vma = vmf->vma; + pgoff_t aligned_index; unsigned long pages; int order;
@@ -1709,9 +1710,9 @@ static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault order = highest_order(orders); while (orders) { pages = 1UL << order; - index = round_down(index, pages); - if (!xa_find(&mapping->i_pages, &index, - index + pages - 1, XA_PRESENT)) + aligned_index = round_down(index, pages); + if (!xa_find(&mapping->i_pages, &aligned_index, + aligned_index + pages - 1, XA_PRESENT)) break; order = next_order(&orders, order); }
From: Lance Yang ioworker0@gmail.com
mainline inclusion from mainline-v6.11-rc1 commit f216c845f3c772e54d27fe209fd300b10e7bf54a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "mm: introduce per-order mTHP split counters", v3.
At present, the split counters in THP statistics no longer include PTE-mapped mTHP. Therefore, we want to introduce per-order mTHP split counters to monitor the frequency of mTHP splits. This will assist developers in better analyzing and optimizing system performance.
/sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats split split_failed split_deferred
This patch (of 2):
Currently, the split counters in THP statistics no longer include PTE-mapped mTHP. Therefore, we propose introducing per-order mTHP split counters to monitor the frequency of mTHP splits. This will help developers better analyze and optimize system performance.
/sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats split split_failed split_deferred
[ioworker0@gmail.com: make things more readable, per Barry and Baolin] Link: https://lkml.kernel.org/r/20240704012905.42971-2-ioworker0@gmail.com [ioworker0@gmail.com: use == for `order' test, per David] Link: https://lkml.kernel.org/r/20240705113119.82210-1-ioworker0@gmail.com Link: https://lkml.kernel.org/r/20240704012905.42971-1-ioworker0@gmail.com Link: https://lkml.kernel.org/r/20240704012905.42971-2-ioworker0@gmail.com Link: https://lkml.kernel.org/r/20240628130750.73097-1-ioworker0@gmail.com Link: https://lkml.kernel.org/r/20240628130750.73097-2-ioworker0@gmail.com Signed-off-by: Mingzhe Yang mingzhe.yang@ly.com Signed-off-by: Lance Yang ioworker0@gmail.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Acked-by: Barry Song baohua@kernel.org Reviewed-by: Baolin Wang baolin.wang@linux.alibaba.com Acked-by: David Hildenbrand david@redhat.com Cc: Bang Li libang.li@antgroup.com Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/huge_mm.h | 3 +++ mm/huge_memory.c | 12 ++++++++++-- 2 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 0fe1f2ec6a76..54391f7a374f 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -284,6 +284,9 @@ enum mthp_stat_item { MTHP_STAT_FILE_ALLOC, MTHP_STAT_FILE_FALLBACK, MTHP_STAT_FILE_FALLBACK_CHARGE, + MTHP_STAT_SPLIT, + MTHP_STAT_SPLIT_FAILED, + MTHP_STAT_SPLIT_DEFERRED, __MTHP_STAT_COUNT };
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 293517f04ade..e4f3e3f4b744 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -722,6 +722,9 @@ DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK); DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC); DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK); DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE); +DEFINE_MTHP_STAT_ATTR(split, MTHP_STAT_SPLIT); +DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED); +DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
static struct attribute *stats_attrs[] = { &anon_fault_alloc_attr.attr, @@ -732,6 +735,9 @@ static struct attribute *stats_attrs[] = { &file_alloc_attr.attr, &file_fallback_attr.attr, &file_fallback_charge_attr.attr, + &split_attr.attr, + &split_failed_attr.attr, + &split_deferred_attr.attr, NULL, };
@@ -3250,7 +3256,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order); struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; - bool is_thp = folio_test_pmd_mappable(folio); + int order = folio_order(folio); int extra_pins, ret; pgoff_t end; bool is_hzp; @@ -3435,8 +3441,9 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, i_mmap_unlock_read(mapping); out: xas_destroy(&xas); - if (is_thp) + if (order == HPAGE_PMD_ORDER) count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED); + count_mthp_stat(order, !ret ? MTHP_STAT_SPLIT : MTHP_STAT_SPLIT_FAILED); return ret; }
@@ -3489,6 +3496,7 @@ void deferred_split_folio(struct folio *folio) if (list_empty(&folio->_deferred_list)) { if (folio_test_pmd_mappable(folio)) count_vm_event(THP_DEFERRED_SPLIT_PAGE); + count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED); list_add_tail(&folio->_deferred_list, &ds_queue->split_queue); ds_queue->split_queue_len++; #ifdef CONFIG_MEMCG
From: Lance Yang ioworker0@gmail.com
mainline inclusion from mainline-v6.11-rc1 commit 9b89e018990de47c72ef8b2ca29204f88fda8f05 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
This commit introduces documentation for mTHP split counters in transhuge.rst.
[ioworker0@gmail.com: improve the doc as suggested by Ryan] Link: https://lkml.kernel.org/r/20240704012905.42971-3-ioworker0@gmail.com [ioworker0@gmail.com: tweak Documentation/admin-guide/mm/transhuge.rst] Link: https://lkml.kernel.org/r/20240707013659.1151-1-ioworker0@gmail.com Link: https://lkml.kernel.org/r/20240628130750.73097-3-ioworker0@gmail.com Signed-off-by: Mingzhe Yang mingzhe.yang@ly.com Signed-off-by: Lance Yang ioworker0@gmail.com Reviewed-by: Barry Song baohua@kernel.org Reviewed-by: Ryan Roberts ryan.roberts@arm.com Acked-by: David Hildenbrand david@redhat.com Cc: Bang Li libang.li@antgroup.com Cc: Baolin Wang baolin.wang@linux.alibaba.com Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- Documentation/admin-guide/mm/transhuge.rst | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 89e124d66ceb..13e09977c43a 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -397,10 +397,6 @@ also applies to the regions registered in khugepaged. Monitoring usage ================
-.. note:: - Currently the below counters only record events relating to - PMD-sized THP. Events relating to other THP sizes are not included. - The number of PMD-sized anonymous transparent huge pages currently used by the system is available by reading the AnonHugePages field in ``/proc/meminfo``. To identify what applications are using PMD-sized anonymous transparent huge @@ -542,6 +538,21 @@ file_fallback_charge falls back to using small pages even though the allocation was successful.
+split + is incremented every time a huge page is successfully split into + smaller orders. This can happen for a variety of reasons but a + common reason is that a huge page is old and is being reclaimed. + +split_failed + is incremented if kernel fails to split huge + page. This can happen if the page was pinned by somebody. + +split_deferred + is incremented when a huge page is put onto split queue. + This happens when a huge page is partially unmapped and splitting + it would free up some memory. Pages on split queue are going to + be split under memory pressure, if splitting is possible. + As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in ``/proc/vmstat`` to help
From: Ryan Roberts ryan.roberts@arm.com
mainline inclusion from mainline-v6.11-rc1 commit 63d9866ab01ffd0d0835d5564107283a4afc0a38 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IAIHPC
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The legacy PMD-sized THP counters at /proc/vmstat include thp_file_alloc, thp_file_fallback and thp_file_fallback_charge, which rather confusingly refer to shmem THP and do not include any other types of file pages. This is inconsistent since in most other places in the kernel, THP counters are explicitly separated for anon, shmem and file flavours. However, we are stuck with it since it constitutes a user ABI.
Recently, commit 66f44583f9b6 ("mm: shmem: add mTHP counters for anonymous shmem") added equivalent mTHP stats for shmem, keeping the same "file_" prefix in the names. But in future, we may want to add extra stats to cover actual file pages, at which point, it would all become very confusing.
So let's take the opportunity to rename these new counters "shmem_" before the change makes it upstream and the ABI becomes immutable. While we are at it, let's improve the documentation for the legacy counters to make it clear that they count shmem pages only.
Link: https://lkml.kernel.org/r/20240710095503.3193901-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Baolin Wang baolin.wang@linux.alibaba.com Reviewed-by: Lance Yang ioworker0@gmail.com Reviewed-by: Zi Yan ziy@nvidia.com Reviewed-by: Barry Song baohua@kernel.org Acked-by: David Hildenbrand david@redhat.com Cc: Daniel Gomez da.gomez@samsung.com Cc: Hugh Dickins hughd@google.com Cc: Jonathan Corbet corbet@lwn.net Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- Documentation/admin-guide/mm/transhuge.rst | 29 ++++++++++++---------- include/linux/huge_mm.h | 6 ++--- mm/huge_memory.c | 12 ++++----- mm/shmem.c | 8 +++--- 4 files changed, 29 insertions(+), 26 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 13e09977c43a..f9dc42f4451f 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -441,20 +441,23 @@ thp_collapse_alloc_failed the allocation.
thp_file_alloc - is incremented every time a file huge page is successfully - allocated. + is incremented every time a shmem huge page is successfully + allocated (Note that despite being named after "file", the counter + measures only shmem).
thp_file_fallback - is incremented if a file huge page is attempted to be allocated - but fails and instead falls back to using small pages. + is incremented if a shmem huge page is attempted to be allocated + but fails and instead falls back to using small pages. (Note that + despite being named after "file", the counter measures only shmem).
thp_file_fallback_charge - is incremented if a file huge page cannot be charged and instead + is incremented if a shmem huge page cannot be charged and instead falls back to using small pages even though the allocation was - successful. + successful. (Note that despite being named after "file", the + counter measures only shmem).
thp_file_mapped - is incremented every time a file huge page is mapped into + is incremented every time a file or shmem huge page is mapped into user address space.
thp_split_page @@ -525,16 +528,16 @@ swpout_fallback Usually because failed to allocate some continuous swap space for the huge page.
-file_alloc - is incremented every time a file huge page is successfully +shmem_alloc + is incremented every time a shmem huge page is successfully allocated.
-file_fallback - is incremented if a file huge page is attempted to be allocated +shmem_fallback + is incremented if a shmem huge page is attempted to be allocated but fails and instead falls back to using small pages.
-file_fallback_charge - is incremented if a file huge page cannot be charged and instead +shmem_fallback_charge + is incremented if a shmem huge page cannot be charged and instead falls back to using small pages even though the allocation was successful.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 54391f7a374f..1474fd9c63ad 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -281,9 +281,9 @@ enum mthp_stat_item { MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE, MTHP_STAT_SWPOUT, MTHP_STAT_SWPOUT_FALLBACK, - MTHP_STAT_FILE_ALLOC, - MTHP_STAT_FILE_FALLBACK, - MTHP_STAT_FILE_FALLBACK_CHARGE, + MTHP_STAT_SHMEM_ALLOC, + MTHP_STAT_SHMEM_FALLBACK, + MTHP_STAT_SHMEM_FALLBACK_CHARGE, MTHP_STAT_SPLIT, MTHP_STAT_SPLIT_FAILED, MTHP_STAT_SPLIT_DEFERRED, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e4f3e3f4b744..fec3ee2c020b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -719,9 +719,9 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT); DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK); -DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC); -DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK); -DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE); +DEFINE_MTHP_STAT_ATTR(shmem_alloc, MTHP_STAT_SHMEM_ALLOC); +DEFINE_MTHP_STAT_ATTR(shmem_fallback, MTHP_STAT_SHMEM_FALLBACK); +DEFINE_MTHP_STAT_ATTR(shmem_fallback_charge, MTHP_STAT_SHMEM_FALLBACK_CHARGE); DEFINE_MTHP_STAT_ATTR(split, MTHP_STAT_SPLIT); DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED); DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED); @@ -732,9 +732,9 @@ static struct attribute *stats_attrs[] = { &anon_fault_fallback_charge_attr.attr, &swpout_attr.attr, &swpout_fallback_attr.attr, - &file_alloc_attr.attr, - &file_fallback_attr.attr, - &file_fallback_charge_attr.attr, + &shmem_alloc_attr.attr, + &shmem_fallback_attr.attr, + &shmem_fallback_charge_attr.attr, &split_attr.attr, &split_failed_attr.attr, &split_deferred_attr.attr, diff --git a/mm/shmem.c b/mm/shmem.c index 69593d2b1566..3e1d36c98b92 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1793,7 +1793,7 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, if (pages == HPAGE_PMD_NR) count_vm_event(THP_FILE_FALLBACK); #ifdef CONFIG_TRANSPARENT_HUGEPAGE - count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK); + count_mthp_stat(order, MTHP_STAT_SHMEM_FALLBACK); #endif order = next_order(&suitable_orders, order); } @@ -1822,8 +1822,8 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, count_vm_event(THP_FILE_FALLBACK_CHARGE); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE - count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK); - count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK_CHARGE); + count_mthp_stat(folio_order(folio), MTHP_STAT_SHMEM_FALLBACK); + count_mthp_stat(folio_order(folio), MTHP_STAT_SHMEM_FALLBACK_CHARGE); #endif } goto unlock; @@ -2204,7 +2204,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, if (folio_test_pmd_mappable(folio)) count_vm_event(THP_FILE_ALLOC); #ifdef CONFIG_TRANSPARENT_HUGEPAGE - count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_ALLOC); + count_mthp_stat(folio_order(folio), MTHP_STAT_SHMEM_ALLOC); #endif goto alloced; }
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/11272 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/H...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/11272 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/H...