From: Chris Down chris@chrisdown.name
mainline inclusion from mainline-v5.9-rc1 commit e809d5f0b5c912fe981dce738f3283b2010665f0 category: bugfix bugzilla: NA CVE: NA ---------------------------
Patch series "tmpfs: inode: Reduce risk of inum overflow", v7.
In Facebook production we are seeing heavy i_ino wraparounds on tmpfs. On affected tiers, in excess of 10% of hosts show multiple files with different content and the same inode number, with some servers even having as many as 150 duplicated inode numbers with differing file content.
This causes actual, tangible problems in production. For example, we have complaints from those working on remote caches that their application is reporting cache corruptions because it uses (device, inodenum) to establish the identity of a particular cache object, but because it's not unique any more, the application refuses to continue and reports cache corruption. Even worse, sometimes applications may not even detect the corruption but may continue anyway, causing phantom and hard to debug behaviour.
In general, userspace applications expect that (device, inodenum) should be enough to be uniquely point to one inode, which seems fair enough. One might also need to check the generation, but in this case:
1. That's not currently exposed to userspace (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs); 2. Even with generation, there shouldn't be two live inodes with the same inode number on one device.
In order to mitigate this, we take a two-pronged approach:
1. Moving inum generation from being global to per-sb for tmpfs. This itself allows some reduction in i_ino churn. This works on both 64- and 32- bit machines. 2. Adding inode{64,32} for tmpfs. This fix is supported on machines with 64-bit ino_t only: we allow users to mount tmpfs with a new inode64 option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.
You can see how this compares to previous related patches which didn't implement this per-superblock:
- https://patchwork.kernel.org/patch/11254001/ - https://patchwork.kernel.org/patch/11023915/
This patch (of 2):
get_next_ino has a number of problems:
- It uses and returns a uint, which is susceptible to become overflowed if a lot of volatile inodes that use get_next_ino are created. - It's global, with no specificity per-sb or even per-filesystem. This means it's not that difficult to cause inode number wraparounds on a single device, which can result in having multiple distinct inodes with the same inode number.
This patch adds a per-superblock counter that mitigates the second case. This design also allows us to later have a specific i_ino size per-device, for example, allowing users to choose whether to use 32- or 64-bit inodes for each tmpfs mount. This is implemented in the next commit.
For internal shmem mounts which may be less tolerant to spinlock delays, we implement a percpu batching scheme which only takes the stat_lock at each batch boundary.
Signed-off-by: Chris Down chris@chrisdown.name Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Hugh Dickins hughd@google.com Cc: Amir Goldstein amir73il@gmail.com Cc: Al Viro viro@zeniv.linux.org.uk Cc: Matthew Wilcox willy@infradead.org Cc: Jeff Layton jlayton@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Tejun Heo tj@kernel.org Link: http://lkml.kernel.org/r/cover.1594661218.git.chris@chrisdown.name Link: http://lkml.kernel.org/r/1986b9d63b986f08ec07a4aa4b2275e718e47d8a.1594661218... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Luo Meng luomeng12@huawei.com
Conflicts: mm/shmem.c Reviewed-by: zhangyi (F) yi.zhang@huawei.com Reviewed-by: Zhang Xiaoxu zhangxiaoxu5@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/fs.h | 15 +++++++++ include/linux/shmem_fs.h | 2 ++ mm/shmem.c | 66 +++++++++++++++++++++++++++++++++++++--- 3 files changed, 78 insertions(+), 5 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h index 8d5ee697727cd..46eb1d540606b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3007,6 +3007,21 @@ extern void discard_new_inode(struct inode *); extern unsigned int get_next_ino(void); extern void evict_inodes(struct super_block *sb);
+/* + * Userspace may rely on the the inode number being non-zero. For example, glibc + * simply ignores files with zero i_ino in unlink() and other places. + * + * As an additional complication, if userspace was compiled with + * _FILE_OFFSET_BITS=32 on a 64-bit kernel we'll only end up reading out the + * lower 32 bits, so we need to check that those aren't zero explicitly. With + * _FILE_OFFSET_BITS=64, this may cause some harmless false-negatives, but + * better safe than sorry. + */ +static inline bool is_zero_ino(ino_t ino) +{ + return (u32)ino == 0; +} + extern void __iget(struct inode * inode); extern void iget_failed(struct inode *); extern void clear_inode(struct inode *); diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index f155dc607112e..dd67c6e59ceaa 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -34,6 +34,8 @@ struct shmem_sb_info { unsigned char huge; /* Whether to try for hugepages */ kuid_t uid; /* Mount uid for root directory */ kgid_t gid; /* Mount gid for root directory */ + ino_t next_ino; /* The next per-sb inode number to use */ + ino_t __percpu *ino_batch; /* The next per-cpu inode number to use */ struct mempolicy *mpol; /* default memory policy for mappings */ spinlock_t shrinklist_lock; /* Protects shrinklist */ struct list_head shrinklist; /* List of shinkable inodes */ diff --git a/mm/shmem.c b/mm/shmem.c index 98939e57f6cc6..25205b46b053f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -267,18 +267,67 @@ bool vma_is_shmem(struct vm_area_struct *vma) static LIST_HEAD(shmem_swaplist); static DEFINE_MUTEX(shmem_swaplist_mutex);
-static int shmem_reserve_inode(struct super_block *sb) +/* + * shmem_reserve_inode() performs bookkeeping to reserve a shmem inode, and + * produces a novel ino for the newly allocated inode. + * + * It may also be called when making a hard link to permit the space needed by + * each dentry. However, in that case, no new inode number is needed since that + * internally draws from another pool of inode numbers (currently global + * get_next_ino()). This case is indicated by passing NULL as inop. + */ +#define SHMEM_INO_BATCH 1024 +static int shmem_reserve_inode(struct super_block *sb, ino_t *inop) { struct shmem_sb_info *sbinfo = SHMEM_SB(sb); - if (sbinfo->max_inodes) { + ino_t ino; + + if (!(sb->s_flags & SB_KERNMOUNT)) { spin_lock(&sbinfo->stat_lock); if (!sbinfo->free_inodes) { spin_unlock(&sbinfo->stat_lock); return -ENOSPC; } sbinfo->free_inodes--; + if (inop) { + ino = sbinfo->next_ino++; + if (unlikely(is_zero_ino(ino))) + ino = sbinfo->next_ino++; + if (unlikely(ino > UINT_MAX)) { + /* + * Emulate get_next_ino uint wraparound for + * compatibility + */ + ino = 1; + } + *inop = ino; + } spin_unlock(&sbinfo->stat_lock); + } else if (inop) { + /* + * __shmem_file_setup, one of our callers, is lock-free: it + * doesn't hold stat_lock in shmem_reserve_inode since + * max_inodes is always 0, and is called from potentially + * unknown contexts. As such, use a per-cpu batched allocator + * which doesn't require the per-sb stat_lock unless we are at + * the batch boundary. + */ + ino_t *next_ino; + next_ino = per_cpu_ptr(sbinfo->ino_batch, get_cpu()); + ino = *next_ino; + if (unlikely(ino % SHMEM_INO_BATCH == 0)) { + spin_lock(&sbinfo->stat_lock); + ino = sbinfo->next_ino; + sbinfo->next_ino += SHMEM_INO_BATCH; + spin_unlock(&sbinfo->stat_lock); + if (unlikely(is_zero_ino(ino))) + ino++; + } + *inop = ino; + *next_ino = ++ino; + put_cpu(); } + return 0; }
@@ -2218,13 +2267,14 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode struct inode *inode; struct shmem_inode_info *info; struct shmem_sb_info *sbinfo = SHMEM_SB(sb); + ino_t ino;
- if (shmem_reserve_inode(sb)) + if (shmem_reserve_inode(sb, &ino)) return NULL;
inode = new_inode(sb); if (inode) { - inode->i_ino = get_next_ino(); + inode->i_ino = ino; inode_init_owner(inode, dir, mode); inode->i_blocks = 0; inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); @@ -2938,7 +2988,7 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr * first link must skip that, to get the accounting right. */ if (inode->i_nlink) { - ret = shmem_reserve_inode(inode->i_sb); + ret = shmem_reserve_inode(inode->i_sb, NULL); if (ret) goto out; } @@ -3539,6 +3589,7 @@ static void shmem_put_super(struct super_block *sb) { struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+ free_percpu(sbinfo->ino_batch); percpu_counter_destroy(&sbinfo->used_blocks); mpol_put(sbinfo->mpol); kfree(sbinfo); @@ -3583,6 +3634,11 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent) #else sb->s_flags |= SB_NOUSER; #endif + if (sb->s_flags & SB_KERNMOUNT) { + sbinfo->ino_batch = alloc_percpu(ino_t); + if (!sbinfo->ino_batch) + goto failed; + }
spin_lock_init(&sbinfo->stat_lock); if (percpu_counter_init(&sbinfo->used_blocks, 0, GFP_KERNEL))