[PATCH OLK-6.6 V2 0/5] Support open_tree OPEN_TREE_NAMESPACE
Changes since V1: The position of unlock_mount_hash() has been adjusted to align with the mainline. Al Viro (3): attach_mnt(): expand in attach_recursive_mnt(), then lose the flag argument get rid of mnt_set_mountpoint_beneath() prevent mount hash conflicts Christian Brauner (2): mount: add OPEN_TREE_NAMESPACE mount: hold namespace_sem across copy in create_new_namespace() fs/internal.h | 2 + fs/mount.h | 1 + fs/namespace.c | 281 ++++++++++++++++++++++++++++--------- fs/nsfs.c | 32 +++++ include/uapi/linux/mount.h | 3 +- 5 files changed, 248 insertions(+), 71 deletions(-) -- 2.52.0
From: Al Viro <viro@zeniv.linux.org.uk> mainline inclusion from mainline-v6.17-rc1 commit 8c6ce8e86dd75db8e6c6a3e5a870e8d52dbab2d0 category: cleanup bugzilla: https://atomgit.com/openeuler/kernel/issues/9218 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- simpler that way - all but one caller pass false as 'beneath' argument, and that one caller is actually happier with the call expanded - the logics with choice of mountpoint is identical for 'moving' and 'attaching' cases, and now that is no longer hidden. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Conflicts: fs/namespace.c [attach_recursive_mnt() has no mnt_notify_add() function, which is in commit bf630c401641 ("vfs: add notifications for mount attach and detach"), not affect this patch.] Signed-off-by: Zizhi Wo <wozizhi@huawei.com> --- fs/namespace.c | 37 ++++++++++++------------------------- 1 file changed, 12 insertions(+), 25 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 7c58151a19a1..8cb30a703db2 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -968,39 +968,24 @@ static void __attach_mnt(struct mount *mnt, struct mount *parent) * attach_mnt - mount a mount, attach to @mount_hashtable and parent's * list of child mounts * @parent: the parent * @mnt: the new mount * @mp: the new mountpoint - * @beneath: whether to mount @mnt beneath or on top of @parent * - * If @beneath is false, mount @mnt at @mp on @parent. Then attach @mnt + * Mount @mnt at @mp on @parent. Then attach @mnt * to @parent's child mount list and to @mount_hashtable. * - * If @beneath is true, remove @mnt from its current parent and - * mountpoint and mount it on @mp on @parent, and mount @parent on the - * old parent and old mountpoint of @mnt. Finally, attach @parent to - * @mnt_hashtable and @parent->mnt_parent->mnt_mounts. - * * Note, when __attach_mnt() is called @mnt->mnt_parent already points * to the correct parent. * * Context: This function expects namespace_lock() and lock_mount_hash() * to have been acquired in that order. */ static void attach_mnt(struct mount *mnt, struct mount *parent, - struct mountpoint *mp, bool beneath) + struct mountpoint *mp) { - if (beneath) - mnt_set_mountpoint_beneath(mnt, parent, mp); - else - mnt_set_mountpoint(parent, mp, mnt); - /* - * Note, @mnt->mnt_parent has to be used. If @mnt was mounted - * beneath @parent then @mnt will need to be attached to - * @parent's old parent, not @parent. IOW, @mnt->mnt_parent - * isn't the same mount as @parent. - */ + mnt_set_mountpoint(parent, mp, mnt); __attach_mnt(mnt, mnt->mnt_parent); } void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt) { @@ -1009,11 +994,11 @@ void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct m list_del_init(&mnt->mnt_child); hlist_del_init(&mnt->mnt_mp_list); hlist_del_init_rcu(&mnt->mnt_hash); - attach_mnt(mnt, parent, mp, false); + attach_mnt(mnt, parent, mp); put_mountpoint(old_mp); mnt_add_count(old_parent, -1); } @@ -2012,11 +1997,11 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry, q = clone_mnt(p, p->mnt.mnt_root, flag); if (IS_ERR(q)) goto out; lock_mount_hash(); list_add_tail(&q->mnt_list, &res->mnt_list); - attach_mnt(q, parent, p->mnt_mp, false); + attach_mnt(q, parent, p->mnt_mp); unlock_mount_hash(); } } return res; out: @@ -2348,14 +2333,16 @@ static int attach_recursive_mnt(struct mount *source_mnt, for (p = source_mnt; p; p = next_mnt(p, source_mnt)) set_mnt_shared(p); } if (moving) { - if (beneath) - dest_mp = smp; unhash_mnt(source_mnt); - attach_mnt(source_mnt, top_mnt, dest_mp, beneath); + if (beneath) + mnt_set_mountpoint_beneath(source_mnt, top_mnt, smp); + else + mnt_set_mountpoint(top_mnt, dest_mp, source_mnt); + __attach_mnt(source_mnt, source_mnt->mnt_parent); touch_mnt_namespace(source_mnt->mnt_ns); } else { if (source_mnt->mnt_ns) { /* move from anon - the caller will destroy */ list_del_init(&source_mnt->mnt_ns->list); @@ -4278,13 +4265,13 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, if (root_mnt->mnt.mnt_flags & MNT_LOCKED) { new_mnt->mnt.mnt_flags |= MNT_LOCKED; root_mnt->mnt.mnt_flags &= ~MNT_LOCKED; } /* mount old root on put_old */ - attach_mnt(root_mnt, old_mnt, old_mp, false); + attach_mnt(root_mnt, old_mnt, old_mp); /* mount new_root on / */ - attach_mnt(new_mnt, root_parent, root_mp, false); + attach_mnt(new_mnt, root_parent, root_mp); mnt_add_count(root_parent, -1); touch_mnt_namespace(current->nsproxy->mnt_ns); /* A moved mount should not expire automatically */ list_del_init(&new_mnt->mnt_expire); put_mountpoint(root_mp); -- 2.52.0
From: Al Viro <viro@zeniv.linux.org.uk> mainline inclusion from mainline-v6.17-rc1 commit 431cc1d8e2dab751b2b8f298a6d9caf83d8b49c3 category: cleanup bugzilla: https://atomgit.com/openeuler/kernel/issues/9218 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- mnt_set_mountpoint_beneath() consists of attaching new mount side-by-side with the one we want to mount beneath (by mnt_set_mountpoint()), followed by mnt_change_mountpoint() shifting the the top mount onto the new one (by mnt_change_mountpoint()). Both callers of mnt_set_mountpoint_beneath (both in attach_recursive_mnt()) have the same form - in 'beneath' case we call mnt_set_mountpoint_beneath(), otherwise - mnt_set_mountpoint(). The thing is, expressing that as unconditional mnt_set_mountpoint(), followed, in 'beneath' case, by mnt_change_mountpoint() is just as easy. And these mnt_change_mountpoint() callers are similar to the ones we do when it comes to attaching propagated copies, which will allow more cleanups in the next commits. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Conflicts: fs/namespace.c [attach_recursive_mnt() has conflicts in move from anon. Commit 2eea9ce4310d ("mounts: keep list of mounts in an rbtree") has not been merged, not affect this patch.] Signed-off-by: Zizhi Wo <wozizhi@huawei.com> --- fs/namespace.c | 37 ++++--------------------------------- 1 file changed, 4 insertions(+), 33 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 8cb30a703db2..93e8b2cd3a24 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -928,37 +928,10 @@ void mnt_set_mountpoint(struct mount *mnt, child_mnt->mnt_parent = mnt; child_mnt->mnt_mp = mp; hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list); } -/** - * mnt_set_mountpoint_beneath - mount a mount beneath another one - * - * @new_parent: the source mount - * @top_mnt: the mount beneath which @new_parent is mounted - * @new_mp: the new mountpoint of @top_mnt on @new_parent - * - * Remove @top_mnt from its current mountpoint @top_mnt->mnt_mp and - * parent @top_mnt->mnt_parent and mount it on top of @new_parent at - * @new_mp. And mount @new_parent on the old parent and old - * mountpoint of @top_mnt. - * - * Context: This function expects namespace_lock() and lock_mount_hash() - * to have been acquired in that order. - */ -static void mnt_set_mountpoint_beneath(struct mount *new_parent, - struct mount *top_mnt, - struct mountpoint *new_mp) -{ - struct mount *old_top_parent = top_mnt->mnt_parent; - struct mountpoint *old_top_mp = top_mnt->mnt_mp; - - mnt_set_mountpoint(old_top_parent, old_top_mp, new_parent); - mnt_change_mountpoint(new_parent, new_mp, top_mnt); -} - - static void __attach_mnt(struct mount *mnt, struct mount *parent) { hlist_add_head_rcu(&mnt->mnt_hash, m_hash(&parent->mnt, mnt->mnt_mountpoint)); list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); @@ -2334,25 +2307,23 @@ static int attach_recursive_mnt(struct mount *source_mnt, set_mnt_shared(p); } if (moving) { unhash_mnt(source_mnt); + mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt); if (beneath) - mnt_set_mountpoint_beneath(source_mnt, top_mnt, smp); - else - mnt_set_mountpoint(top_mnt, dest_mp, source_mnt); + mnt_change_mountpoint(source_mnt, smp, top_mnt); __attach_mnt(source_mnt, source_mnt->mnt_parent); touch_mnt_namespace(source_mnt->mnt_ns); } else { if (source_mnt->mnt_ns) { /* move from anon - the caller will destroy */ list_del_init(&source_mnt->mnt_ns->list); } + mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt); if (beneath) - mnt_set_mountpoint_beneath(source_mnt, top_mnt, smp); - else - mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt); + mnt_change_mountpoint(source_mnt, smp, top_mnt); commit_tree(source_mnt); } hlist_for_each_entry_safe(child, n, &tree_list, mnt_hash) { struct mount *q; -- 2.52.0
From: Al Viro <viro@zeniv.linux.org.uk> mainline inclusion from mainline-v6.17-rc1 commit ffdc52fbbd5835a936ad683c943d6d103a2d4514 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/9218 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Currently it's still possible to run into a pathological situation when two hashed mounts share both parent and mountpoint. That does not work well, for obvious reasons. We are not far from getting rid of that; the only remaining gap is attach_recursive_mnt() not being careful enough when sliding a tree under existing mount (for propagated copies or in 'beneath' case for the original one). To deal with that cleanly we need to be able to find overmounts (i.e. mounts on top of parent's root); we could do hash lookups or scan the list of children but either would be costly. Since one of the results we get from that will be prevention of multiple parallel overmounts, let's just bite the bullet and store a (non-counting) reference to overmount in struct mount. With that done, closing the hole in attach_recursive_mnt() becomes easy - we just need to follow the chain of overmounts before we change the mountpoint of the mount we are sliding things under. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Conflicts: fs/mount.h fs/namespace.c [Simple context conflicts, not affect this patch.] Signed-off-by: Zizhi Wo <wozizhi@huawei.com> --- fs/mount.h | 1 + fs/namespace.c | 27 ++++++++++++++++++++++----- 2 files changed, 23 insertions(+), 5 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index 3c2fbb78cae2..7533e7d3b26a 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -75,10 +75,11 @@ struct mount { int mnt_id; /* mount identifier */ int mnt_group_id; /* peer group identifier */ int mnt_expiry_mark; /* true if marked for expiry */ struct hlist_head mnt_pins; struct hlist_head mnt_stuck_children; + struct mount *overmount; /* mounted on ->mnt_root */ } __randomize_layout; #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */ static inline struct mount *real_mount(struct vfsmount *mnt) diff --git a/fs/namespace.c b/fs/namespace.c index 93e8b2cd3a24..fbc6dd74ded4 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -895,10 +895,13 @@ static void __touch_mnt_namespace(struct mnt_namespace *ns) * vfsmount lock must be held for write */ static struct mountpoint *unhash_mnt(struct mount *mnt) { struct mountpoint *mp; + struct mount *parent = mnt->mnt_parent; + if (unlikely(parent->overmount == mnt)) + parent->overmount = NULL; mnt->mnt_parent = mnt; mnt->mnt_mountpoint = mnt->mnt.mnt_root; list_del_init(&mnt->mnt_child); hlist_del_init_rcu(&mnt->mnt_hash); hlist_del_init(&mnt->mnt_mp_list); @@ -930,10 +933,12 @@ void mnt_set_mountpoint(struct mount *mnt, hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list); } static void __attach_mnt(struct mount *mnt, struct mount *parent) { + if (unlikely(mnt->mnt_mountpoint == parent->mnt.mnt_root)) + parent->overmount = mnt; hlist_add_head_rcu(&mnt->mnt_hash, m_hash(&parent->mnt, mnt->mnt_mountpoint)); list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); } @@ -2265,22 +2270,30 @@ static int attach_recursive_mnt(struct mount *source_mnt, { struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns; HLIST_HEAD(tree_list); struct mnt_namespace *ns = top_mnt->mnt_ns; struct mountpoint *smp; + struct mountpoint *secondary = NULL; struct mount *child, *dest_mnt, *p; + struct mount *top; struct hlist_node *n; int err = 0; bool moving = flags & MNT_TREE_MOVE, beneath = flags & MNT_TREE_BENEATH; /* * Preallocate a mountpoint in case the new mounts need to be * mounted beneath mounts on the same mountpoint. */ - smp = get_mountpoint(source_mnt->mnt.mnt_root); + for (top = source_mnt; unlikely(top->overmount); top = top->overmount) { + if (!secondary && is_mnt_ns_file(top->mnt.mnt_root)) + secondary = top->mnt_mp; + } + smp = get_mountpoint(top->mnt.mnt_root); if (IS_ERR(smp)) return PTR_ERR(smp); + if (!secondary) + secondary = smp; /* Is there space to add these mounts to the mount namespace? */ if (!moving) { err = count_mounts(ns, source_mnt); if (err) @@ -2309,21 +2322,21 @@ static int attach_recursive_mnt(struct mount *source_mnt, if (moving) { unhash_mnt(source_mnt); mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt); if (beneath) - mnt_change_mountpoint(source_mnt, smp, top_mnt); + mnt_change_mountpoint(top, smp, top_mnt); __attach_mnt(source_mnt, source_mnt->mnt_parent); touch_mnt_namespace(source_mnt->mnt_ns); } else { if (source_mnt->mnt_ns) { /* move from anon - the caller will destroy */ list_del_init(&source_mnt->mnt_ns->list); } mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt); if (beneath) - mnt_change_mountpoint(source_mnt, smp, top_mnt); + mnt_change_mountpoint(top, smp, top_mnt); commit_tree(source_mnt); } hlist_for_each_entry_safe(child, n, &tree_list, mnt_hash) { struct mount *q; @@ -2332,12 +2345,16 @@ static int attach_recursive_mnt(struct mount *source_mnt, if (child->mnt_parent->mnt_ns->user_ns != user_ns) lock_mnt_tree(child); child->mnt.mnt_flags &= ~MNT_LOCKED; q = __lookup_mnt(&child->mnt_parent->mnt, child->mnt_mountpoint); - if (q) - mnt_change_mountpoint(child, smp, q); + if (q) { + struct mount *r = child; + while (unlikely(r->overmount)) + r = r->overmount; + mnt_change_mountpoint(r, secondary, q); + } commit_tree(child); } put_mountpoint(smp); unlock_mount_hash(); -- 2.52.0
From: Christian Brauner <brauner@kernel.org> mainline inclusion from mainline-v7.0-rc1 commit 9b8a0ba68246a61d903ce62c35c303b1501df28b category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/9218 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- When creating containers the setup usually involves using CLONE_NEWNS via clone3() or unshare(). This copies the caller's complete mount namespace. The runtime will also assemble a new rootfs and then use pivot_root() to switch the old mount tree with the new rootfs. Afterward it will recursively umount the old mount tree thereby getting rid of all mounts. On a basic system here where the mount table isn't particularly large this still copies about 30 mounts. Copying all of these mounts only to get rid of them later is pretty wasteful. This is exacerbated if intermediary mount namespaces are used that only exist for a very short amount of time and are immediately destroyed again causing a ton of mounts to be copied and destroyed needlessly. With a large mount table and a system where thousands or ten-thousands of containers are spawned in parallel this quickly becomes a bottleneck increasing contention on the semaphore. Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of returning a file descriptor referring to that mount tree OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor to a new mount namespace. In that new mount namespace the copied mount tree has been mounted on top of a copy of the real rootfs. The caller can setns() into that mount namespace and perform any additionally required setup such as move_mount() detached mounts in there. This allows OPEN_TREE_NAMESPACE to function as a combined unshare(CLONE_NEWNS) and pivot_root(). A caller may for example choose to create an extremely minimal rootfs: fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE); This will create a mount namespace where "wootwoot" has become the rootfs mounted on top of the real rootfs. The caller can now setns() into this new mount namespace and assemble additional mounts. This also works with user namespaces: unshare(CLONE_NEWUSER); fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE); which creates a new mount namespace owned by the earlier created user namespace with "wootwoot" as the rootfs mounted on top of the real rootfs. Link: https://patch.msgid.link/20251229-work-empty-namespace-v1-1-bfb24c7b061f@ker... Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Suggested-by: Christian Brauner <brauner@kernel.org> Suggested-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: fs/internal.h fs/namespace.c fs/nsfs.c [1. This kernel version does not have mnt_add_to_ns(), as commit 2eea9ce4310d ("mounts: keep list of mounts in an rbtree") not merged. Implemented mnt_add_tree_to_ns(). 2. This kernel version does not have __ns_tree_add_raw(), as commit 885fc8ac0a4d ("nstree: make iterator generic") not merged. Not affect to this patch. 3. This kernel version does not have path_from_stashed(), as commit 07fd7c329839 ("libfs: add path_from_stashed()") not merged. Implemented similar logic in open_namespace_file(). 4. This kernel version does not have LOCK_MOUNT_EXACT, as commit 9bf5d488529b ("finish_automount(): take the lock_mount() analogue into a helper") not merged. Implemented lock_mount_exact(). 5. There are a few other minor conflicts that do not affect the patch.] Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> --- fs/internal.h | 2 + fs/namespace.c | 204 ++++++++++++++++++++++++++++++++++--- fs/nsfs.c | 32 ++++++ include/uapi/linux/mount.h | 3 +- 4 files changed, 228 insertions(+), 13 deletions(-) diff --git a/fs/internal.h b/fs/internal.h index 273e6fd40d1b..68f51bf7c5b0 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -15,10 +15,11 @@ struct mount; struct shrink_control; struct fs_context; struct pipe_inode_info; struct iov_iter; struct mnt_idmap; +struct ns_common; /* * block/bdev.c */ #ifdef CONFIG_BLOCK @@ -228,10 +229,11 @@ extern void mnt_pin_kill(struct mount *m); /* * fs/nsfs.c */ extern const struct dentry_operations ns_dentry_operations; +struct file *open_namespace_file(struct ns_common *ns); /* * fs/stat.c: */ diff --git a/fs/namespace.c b/fs/namespace.c index fbc6dd74ded4..2c9ed65371f1 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1029,10 +1029,21 @@ static struct mount *skip_mnt_tree(struct mount *p) prev = p->mnt_mounts.prev; } return p; } +static void mnt_add_tree_to_ns(struct mnt_namespace *ns, struct mount *root) +{ + struct mount *mnt; + + for (mnt = root; mnt; mnt = next_mnt(mnt, root)) { + mnt->mnt_ns = ns; + ns->mounts++; + } + list_add_tail(&ns->list, &root->mnt_list); +} + /** * vfs_create_mount - Create a mount for a configured superblock * @fc: The configuration context with the superblock attached * * Create a mount to an already configured superblock. If necessary, the @@ -2569,27 +2580,41 @@ static int do_change_type(struct path *path, int ms_flags) out_unlock: namespace_unlock(); return err; } -static struct mount *__do_loopback(struct path *old_path, int recurse) +static struct mount *__do_loopback(struct path *old_path, + unsigned int flags, unsigned int copy_flags) { struct mount *mnt = ERR_PTR(-EINVAL), *old = real_mount(old_path->mnt); + bool recurse = flags & AT_RECURSIVE; if (IS_MNT_UNBINDABLE(old)) return mnt; if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations) return mnt; if (!recurse && has_locked_children(old, old_path->dentry)) return mnt; + /* + * When creating a new mount namespace we don't want to copy over + * mounts of mount namespaces to avoid the risk of cycles and also to + * minimize the default complex interdependencies between mount + * namespaces. + * + * We could ofc just check whether all mount namespace files aren't + * creating cycles but really let's keep this simple. + */ + if (!(flags & OPEN_TREE_NAMESPACE)) + copy_flags |= CL_COPY_MNT_NS_FILE; + if (recurse) - mnt = copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE); + mnt = copy_tree(old, old_path->dentry, copy_flags); else - mnt = clone_mnt(old, old_path->dentry, 0); + mnt = clone_mnt(old, old_path->dentry, copy_flags); if (!IS_ERR(mnt)) mnt->mnt.mnt_flags &= ~MNT_LOCKED; return mnt; @@ -2602,11 +2627,13 @@ static int do_loopback(struct path *path, const char *old_name, int recurse) { struct path old_path; struct mount *mnt = NULL, *parent; struct mountpoint *mp; + unsigned int flags = recurse ? AT_RECURSIVE : 0; int err; + if (!old_name || !*old_name) return -EINVAL; err = kern_path(old_name, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT, &old_path); if (err) return err; @@ -2623,11 +2650,11 @@ static int do_loopback(struct path *path, const char *old_name, parent = real_mount(path->mnt); if (!check_mnt(parent)) goto out2; - mnt = __do_loopback(&old_path, recurse); + mnt = __do_loopback(&old_path, flags, 0); if (IS_ERR(mnt)) { err = PTR_ERR(mnt); goto out2; } @@ -2642,22 +2669,22 @@ static int do_loopback(struct path *path, const char *old_name, out: path_put(&old_path); return err; } -static struct file *open_detached_copy(struct path *path, bool recursive) +static struct file *open_detached_copy(struct path *path, unsigned int flags) { struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns; struct mnt_namespace *ns = alloc_mnt_ns(user_ns, true); struct mount *mnt, *p; struct file *file; if (IS_ERR(ns)) return ERR_CAST(ns); namespace_lock(); - mnt = __do_loopback(path, recursive); + mnt = __do_loopback(path, flags, 0); if (IS_ERR(mnt)) { namespace_unlock(); free_mnt_ns(ns); return ERR_CAST(mnt); } @@ -2681,49 +2708,177 @@ static struct file *open_detached_copy(struct path *path, bool recursive) else file->f_mode |= FMODE_NEED_UNMOUNT; return file; } +static struct mountpoint *lock_mount_exact(struct path *path); + +static struct mnt_namespace *create_new_namespace(struct path *path, unsigned int flags) +{ + struct mnt_namespace *new_ns; + struct path to_path = {}; + struct mnt_namespace *ns = current->nsproxy->mnt_ns; + struct user_namespace *user_ns = current_user_ns(); + struct mountpoint *mp; + struct mount *new_ns_root; + struct mount *mnt; + unsigned int copy_flags = 0; + bool locked = false; + int err; + + if (user_ns != ns->user_ns) + copy_flags |= CL_SLAVE; + + new_ns = alloc_mnt_ns(user_ns, false); + if (IS_ERR(new_ns)) + return new_ns; + + namespace_lock(); + new_ns_root = clone_mnt(ns->root, ns->root->mnt.mnt_root, copy_flags); + if (IS_ERR(new_ns_root)) { + namespace_unlock(); + err = PTR_ERR(new_ns_root); + goto err_free_ns; + } + + /* + * If the real rootfs had a locked mount on top of it somewhere + * in the stack, lock the new mount tree as well so it can't be + * exposed. + */ + mnt = ns->root; + while (mnt->overmount) { + mnt = mnt->overmount; + if (mnt->mnt.mnt_flags & MNT_LOCKED) + locked = true; + } + namespace_unlock(); + + /* + * We dropped the namespace semaphore so we can actually lock + * the copy for mounting. The copied mount isn't attached to any + * mount namespace and it is thus excluded from any propagation. + * So realistically we're isolated and the mount can't be + * overmounted. + */ + + /* Borrow the reference from clone_mnt(). */ + to_path.mnt = &new_ns_root->mnt; + to_path.dentry = dget(new_ns_root->mnt.mnt_root); + + /* Now lock for actual mounting. */ + mp = lock_mount_exact(&to_path); + if (unlikely(IS_ERR(mp))) { + err = PTR_ERR(mp); + goto err_path_put; + } + + /* + * We don't emulate unshare()ing a mount namespace. We stick to the + * restrictions of creating detached bind-mounts. It has a lot + * saner and simpler semantics. + */ + mnt = __do_loopback(path, flags, copy_flags); + if (IS_ERR(mnt)) { + err = PTR_ERR(mnt); + unlock_mount(mp); + goto err_path_put; + } + + lock_mount_hash(); + if (locked) + mnt->mnt.mnt_flags |= MNT_LOCKED; + /* + * Now mount the detached tree on top of the copy of the + * real rootfs we created. + */ + attach_mnt(mnt, new_ns_root, mp); + if (user_ns != ns->user_ns) + lock_mnt_tree(new_ns_root); + unlock_mount_hash(); + + /* Add all mounts to the new namespace. */ + mnt_add_tree_to_ns(new_ns, new_ns_root); + + new_ns->root = new_ns_root; + unlock_mount(mp); + to_path.mnt = NULL; + path_put(&to_path); + + return new_ns; + +err_path_put: + path_put(&to_path); +err_free_ns: + free_mnt_ns(new_ns); + return ERR_PTR(err); +} + +static struct file *open_new_namespace(struct path *path, unsigned int flags) +{ + struct mnt_namespace *new_ns; + + new_ns = create_new_namespace(path, flags); + if (IS_ERR(new_ns)) + return ERR_CAST(new_ns); + + return open_namespace_file(from_mnt_ns(new_ns)); +} + SYSCALL_DEFINE3(open_tree, int, dfd, const char __user *, filename, unsigned, flags) { struct file *file; struct path path; int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW; - bool detached = flags & OPEN_TREE_CLONE; int error; int fd; BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC); if (flags & ~(AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE | AT_SYMLINK_NOFOLLOW | OPEN_TREE_CLONE | - OPEN_TREE_CLOEXEC)) + OPEN_TREE_CLOEXEC | OPEN_TREE_NAMESPACE)) + return -EINVAL; + + if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE | OPEN_TREE_NAMESPACE)) == + AT_RECURSIVE) return -EINVAL; - if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE)) == AT_RECURSIVE) + if (hweight32(flags & (OPEN_TREE_CLONE | OPEN_TREE_NAMESPACE)) > 1) return -EINVAL; if (flags & AT_NO_AUTOMOUNT) lookup_flags &= ~LOOKUP_AUTOMOUNT; if (flags & AT_SYMLINK_NOFOLLOW) lookup_flags &= ~LOOKUP_FOLLOW; if (flags & AT_EMPTY_PATH) lookup_flags |= LOOKUP_EMPTY; - if (detached && !may_mount()) + /* + * If we create a new mount namespace with the cloned mount tree we + * just care about being privileged over our current user namespace. + * The new mount namespace will be owned by it. + */ + if ((flags & OPEN_TREE_NAMESPACE) && + !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) + return -EPERM; + + if ((flags & OPEN_TREE_CLONE) && !may_mount()) return -EPERM; fd = get_unused_fd_flags(flags & O_CLOEXEC); if (fd < 0) return fd; error = user_path_at(dfd, filename, lookup_flags, &path); if (unlikely(error)) { file = ERR_PTR(error); } else { - if (detached) - file = open_detached_copy(&path, flags & AT_RECURSIVE); + if (flags & OPEN_TREE_NAMESPACE) + file = open_new_namespace(&path, flags); + else if (flags & OPEN_TREE_CLONE) + file = open_detached_copy(&path, flags); else file = dentry_open(&path, O_PATH, current_cred()); path_put(&path); } if (IS_ERR(file)) { @@ -3356,10 +3511,35 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, put_fs_context(fc); return err; } +static struct mountpoint *lock_mount_exact(struct path *path) +{ + struct dentry *dentry = path->dentry; + struct mountpoint *mp; + int err = 0; + + inode_lock(dentry->d_inode); + namespace_lock(); + if (unlikely(cant_mount(dentry))) { + err = -ENOENT; + } else if (path_overmounted(path)) { + err = -EBUSY; + } else { + mp = get_mountpoint(dentry); + if (IS_ERR(mp)) + err = PTR_ERR(mp); + } + if (unlikely(err)) { + namespace_unlock(); + inode_unlock(dentry->d_inode); + return ERR_PTR(err); + } + return mp; +} + int finish_automount(struct vfsmount *m, const struct path *path) { struct dentry *dentry = path->dentry; struct mountpoint *mp; struct mount *mnt; diff --git a/fs/nsfs.c b/fs/nsfs.c index 647a22433bd8..8f8c3c7c37da 100644 --- a/fs/nsfs.c +++ b/fs/nsfs.c @@ -143,10 +143,42 @@ int ns_get_path(struct path *path, struct task_struct *task, }; return ns_get_path_cb(path, ns_get_path_task, &args); } +static struct ns_common *ns_get_from_common(void *private_data) +{ + struct ns_common *ns = private_data; + + refcount_inc(&ns->count); + return ns; +} + +/** + * open_namespace_file - open a file for an existing namespace + * @ns: namespace to open + * + * The caller must pass a live namespace reference. This helper consumes that + * reference independent of success or failure. Temporary references are + * acquired through ns_get_path_cb() so stashed nsfs dentry lookup can retry. + */ +struct file *open_namespace_file(struct ns_common *ns) +{ + struct path path = {}; + struct file *file; + int err; + + err = ns_get_path_cb(&path, ns_get_from_common, ns); + ns->ops->put(ns); + if (err) + return ERR_PTR(err); + + file = dentry_open(&path, O_RDONLY, current_cred()); + path_put(&path); + return file; +} + int open_related_ns(struct ns_common *ns, struct ns_common *(*get_ns)(struct ns_common *ns)) { struct path path = {}; struct file *f; diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h index bb242fdcfe6b..9e1fbb17d305 100644 --- a/include/uapi/linux/mount.h +++ b/include/uapi/linux/mount.h @@ -59,11 +59,12 @@ #define MS_MGC_MSK 0xffff0000 /* * open_tree() flags. */ -#define OPEN_TREE_CLONE 1 /* Clone the target tree and attach the clone */ +#define OPEN_TREE_CLONE (1 << 0) /* Clone the target tree and attach the clone */ +#define OPEN_TREE_NAMESPACE (1 << 1) /* Clone the target tree into a new mount namespace */ #define OPEN_TREE_CLOEXEC O_CLOEXEC /* Close the file on execve() */ /* * move_mount() flags. */ -- 2.52.0
From: Christian Brauner <brauner@kernel.org> mainline inclusion from mainline-v7.0-rc2 commit a41dbf5e004edbe1260883c43a8bd134d9cb0c1c category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/9218 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Fix an oversight when creating a new mount namespace. If someone had the bright idea to make the real rootfs a shared or dependent mount and it is later copied the copy will become a peer of the old real rootfs mount or a dependent mount of it. The namespace semaphore is dropped and we use mount lock exact to lock the new real root mount. If that fails or the subsequent do_loopback() fails we rely on the copy of the real root mount to be cleaned up by path_put(). The problem is that this doesn't deal with mount propagation and will leave the mounts linked in the propagation lists. When creating a new mount namespace create_new_namespace() first acquires namespace_sem to clone the nullfs root, drops it, then reacquires it via LOCK_MOUNT_EXACT which takes inode_lock first to respect the inode_lock -> namespace_sem lock ordering. This drop-and-reacquire pattern is fragile and was the source of the propagation cleanup bug fixed in the preceding commit. Extend lock_mount_exact() with a copy_mount mode that clones the mount under the locks atomically. When copy_mount is true, path_overmounted() is skipped since we're copying the mount, not mounting on top of it - the nullfs root always has rootfs mounted on top so the check would always fail. If clone_mnt() fails after get_mountpoint() has pinned the mountpoint, __unlock_mount() is used to properly unpin the mountpoint and release both locks. This allows create_new_namespace() to use LOCK_MOUNT_EXACT_COPY which takes inode_lock and namespace_sem once and holds them throughout the clone and subsequent mount operations, eliminating the drop-and-reacquire pattern entirely. Reported-by: syzbot+a89f9434fb5a001ccd58@syzkaller.appspotmail.com Fixes: 9b8a0ba68246 ("mount: add OPEN_TREE_NAMESPACE") # mainline only Link: https://lore.kernel.org/699047f6.050a0220.2757fb.0024.GAE@google.com Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: fs/namespace.c [Simple context conflicts, not affect this patch.] Signed-off-by: Zizhi Wo <wozizhi@huawei.com> --- fs/namespace.c | 68 ++++++++++++++++++++------------------------------ 1 file changed, 27 insertions(+), 41 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 2c9ed65371f1..593482e703b3 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2712,16 +2712,16 @@ static struct file *open_detached_copy(struct path *path, unsigned int flags) static struct mountpoint *lock_mount_exact(struct path *path); static struct mnt_namespace *create_new_namespace(struct path *path, unsigned int flags) { - struct mnt_namespace *new_ns; - struct path to_path = {}; struct mnt_namespace *ns = current->nsproxy->mnt_ns; struct user_namespace *user_ns = current_user_ns(); + struct mnt_namespace *new_ns; + struct mount *new_ns_root, *old_ns_root; + struct path to_path; struct mountpoint *mp; - struct mount *new_ns_root; struct mount *mnt; unsigned int copy_flags = 0; bool locked = false; int err; @@ -2730,68 +2730,58 @@ static struct mnt_namespace *create_new_namespace(struct path *path, unsigned in new_ns = alloc_mnt_ns(user_ns, false); if (IS_ERR(new_ns)) return new_ns; - namespace_lock(); + old_ns_root = ns->root; + to_path.mnt = &old_ns_root->mnt; + to_path.dentry = old_ns_root->mnt.mnt_root; + + mp = lock_mount_exact(&to_path); + if (IS_ERR(mp)) { + err = PTR_ERR(mp); + goto err_free_ns; + } + new_ns_root = clone_mnt(ns->root, ns->root->mnt.mnt_root, copy_flags); if (IS_ERR(new_ns_root)) { - namespace_unlock(); err = PTR_ERR(new_ns_root); - goto err_free_ns; + goto err_unlock_mp; } /* * If the real rootfs had a locked mount on top of it somewhere * in the stack, lock the new mount tree as well so it can't be * exposed. */ - mnt = ns->root; + mnt = old_ns_root; while (mnt->overmount) { mnt = mnt->overmount; if (mnt->mnt.mnt_flags & MNT_LOCKED) locked = true; } - namespace_unlock(); /* - * We dropped the namespace semaphore so we can actually lock - * the copy for mounting. The copied mount isn't attached to any - * mount namespace and it is thus excluded from any propagation. - * So realistically we're isolated and the mount can't be - * overmounted. - */ - - /* Borrow the reference from clone_mnt(). */ - to_path.mnt = &new_ns_root->mnt; - to_path.dentry = dget(new_ns_root->mnt.mnt_root); - - /* Now lock for actual mounting. */ - mp = lock_mount_exact(&to_path); - if (unlikely(IS_ERR(mp))) { - err = PTR_ERR(mp); - goto err_path_put; - } - - /* - * We don't emulate unshare()ing a mount namespace. We stick to the - * restrictions of creating detached bind-mounts. It has a lot - * saner and simpler semantics. + * We don't emulate unshare()ing a mount namespace. We stick + * to the restrictions of creating detached bind-mounts. It + * has a lot saner and simpler semantics. */ mnt = __do_loopback(path, flags, copy_flags); + + lock_mount_hash(); if (IS_ERR(mnt)) { err = PTR_ERR(mnt); - unlock_mount(mp); - goto err_path_put; + umount_tree(new_ns_root, 0); + unlock_mount_hash(); + goto err_unlock_mp; } - lock_mount_hash(); if (locked) mnt->mnt.mnt_flags |= MNT_LOCKED; /* - * Now mount the detached tree on top of the copy of the - * real rootfs we created. + * now mount the detached tree on top of the copy + * of the real rootfs we created. */ attach_mnt(mnt, new_ns_root, mp); if (user_ns != ns->user_ns) lock_mnt_tree(new_ns_root); unlock_mount_hash(); @@ -2799,17 +2789,15 @@ static struct mnt_namespace *create_new_namespace(struct path *path, unsigned in /* Add all mounts to the new namespace. */ mnt_add_tree_to_ns(new_ns, new_ns_root); new_ns->root = new_ns_root; unlock_mount(mp); - to_path.mnt = NULL; - path_put(&to_path); return new_ns; -err_path_put: - path_put(&to_path); +err_unlock_mp: + unlock_mount(mp); err_free_ns: free_mnt_ns(new_ns); return ERR_PTR(err); } @@ -3521,12 +3509,10 @@ static struct mountpoint *lock_mount_exact(struct path *path) inode_lock(dentry->d_inode); namespace_lock(); if (unlikely(cant_mount(dentry))) { err = -ENOENT; - } else if (path_overmounted(path)) { - err = -EBUSY; } else { mp = get_mountpoint(dentry); if (IS_ERR(mp)) err = PTR_ERR(mp); } -- 2.52.0
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://atomgit.com/openeuler/kernel/merge_requests/23012 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/RMP... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://atomgit.com/openeuler/kernel/merge_requests/23012 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/RMP...
participants (2)
-
patchwork bot -
Zizhi Wo