MPAM: Fix use-after-free and deadlock issues
Yu Liao (4): mpam/resctrl: Fix a deadlock due to inaccurate reference mpam/resctrl: Clean up unused function parameter in mkdir path mpam/resctrl: Clean up resctrl_group_rmdir_[ctrl/mon]() mpam/resctrl: Fix use-after-free due to inaccurate refcount of rdtgroup
fs/resctrlfs.c | 57 +++++++++++++++++++------------------------------- 1 file changed, 21 insertions(+), 36 deletions(-)
Offering: HULK hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9PVOZ
--------------------------------
Found a deadlock issue in mpam which is already fixed in x86 resctrl by commit 334b0f4e9b1b ("x86/resctrl: Fix a deadlock due to inaccurate reference").
There is a race condition which results in a deadlock when rmdir and mkdir concurrently:
Thread 1: rmdir /sys/fs/resctrl/p1 Thread 2: mkdir /sys/fs/resctrl/p1/mon_groups/m1
Thread 1 is deleting control group "p1". Holding rdtgroup_mutex, kernfs_remove() removes all kernfs nodes under directory "p1" recursively, then waits for sub kernfs node "mon_groups" to drop active reference.
Thread 2 is trying to create a subdirectory "m1" in the "mon_groups" directory. The wrapper kernfs_iop_mkdir() takes an active reference to the "mon_groups" directory but the code drops the active reference to the parent directory "p1" instead.
As a result, Thread 1 is blocked on waiting for active reference to drop and never release rdtgroup_mutex, while Thread 2 is also blocked on trying to get rdtgroup_mutex.
Thread 1 (rdtgroup_rmdir) Thread 2 (rdtgroup_mkdir) (rmdir /sys/fs/resctrl/p1) (mkdir /sys/fs/resctrl/p1/mon_groups/m1) ------------------------- ------------------------- kernfs_iop_mkdir /* * kn: "m1", parent_kn: "mon_groups", * prgrp_kn: parent_kn->parent: "p1", * * "mon_groups", parent_kn->active++: 1 */ kernfs_get_active(parent_kn) kernfs_iop_rmdir /* "p1", kn->active++ */ kernfs_get_active(kn)
rdtgroup_kn_lock_live atomic_inc(&rdtgrp->waitcount) /* "p1", kn->active-- */ kernfs_break_active_protection(kn) mutex_lock
rdtgroup_rmdir_ctrl free_all_child_rdtgrp sentry->flags = RDT_DELETED
rdtgroup_ctrl_remove rdtgrp->flags = RDT_DELETED kernfs_get(kn) kernfs_remove(rdtgrp->kn) __kernfs_remove /* "mon_groups", sub_kn */ atomic_add(KN_DEACTIVATED_BIAS, &sub_kn->active) kernfs_drain(sub_kn) /* * sub_kn->active == KN_DEACTIVATED_BIAS + 1, * waiting on sub_kn->active to drop, but it * never drops in Thread 2 which is blocked * on getting rdtgroup_mutex. */ Thread 1 hangs here ----> wait_event(sub_kn->active == KN_DEACTIVATED_BIAS) ... rdtgroup_mkdir rdtgroup_mkdir_mon(parent_kn, prgrp_kn) mkdir_rdt_prepare(parent_kn, prgrp_kn) rdtgroup_kn_lock_live(prgrp_kn) atomic_inc(&rdtgrp->waitcount) /* * "p1", prgrp_kn->active-- * * The active reference on "p1" is * dropped, but not matching the * actual active reference taken * on "mon_groups", thus causing * Thread 1 to wait forever while * holding rdtgroup_mutex. */ kernfs_break_active_protection( prgrp_kn) /* * Trying to get rdtgroup_mutex * which is held by Thread 1. */ Thread 2 hangs here ----> mutex_lock ...
The problem is that the creation of a subdirectory in the "mon_groups" directory incorrectly releases the active protection of its parent directory instead of itself before it starts waiting for rdtgroup_mutex. This is triggered by the rdtgroup_mkdir() flow calling rdtgroup_kn_lock_live()/rdtgroup_kn_unlock() with kernfs node of the parent control group ("p1") as argument. It should be called with kernfs node "mon_groups" instead. What is currently missing is that the kn->priv of "mon_groups" is NULL instead of pointing to the rdtgrp.
Fix it by pointing kn->priv to rdtgrp when "mon_groups" is created. Then it could be passed to rdtgroup_kn_lock_live()/rdtgroup_kn_unlock() instead. And then it operates on the same rdtgroup structure but handles the active reference of kernfs node "mon_groups" to prevent deadlock. The same changes are also made to the "mon_data" directories.
This patch involved associating the private data of the "mon_groups" and "mon_data" directories to the resource group to which they belong instead of NULL as before.
Improving the directory removal safeguard to ensure that subdirectories of the resctrl root directory can only be removed if they are a child of the resctrl filesystem's root _and_ not associated with the default resource group.
This patch results in some unused function parameters that will be cleaned up in follow-up patch, this patch focuses on the fix only in support of backport efforts.
Fixes: 6065f5513f78 ("resctrlfs: init support resctrlfs") Signed-off-by: Yu Liao liaoyu15@huawei.com --- fs/resctrlfs.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/resctrlfs.c b/fs/resctrlfs.c index 1bb1485d4440..4cf0a4aad386 100644 --- a/fs/resctrlfs.c +++ b/fs/resctrlfs.c @@ -387,7 +387,7 @@ static int resctrl_get_tree(struct fs_context *fc)
if (resctrl_mon_capable) { ret = mongroup_create_dir(resctrl_group_default.kn, - NULL, "mon_groups", + &resctrl_group_default, "mon_groups", &kn_mongrp); if (ret) goto out_info; @@ -718,7 +718,7 @@ static int mkdir_resctrl_prepare(struct kernfs_node *parent_kn, uint files = 0; int ret;
- prdtgrp = resctrl_group_kn_lock_live(prgrp_kn); + prdtgrp = resctrl_group_kn_lock_live(parent_kn); rdt_last_cmd_clear(); if (!prdtgrp) { ret = -ENODEV; @@ -807,7 +807,7 @@ static int mkdir_resctrl_prepare(struct kernfs_node *parent_kn, kernfs_activate(kn);
/* - * The caller unlocks the prgrp_kn upon success. + * The caller unlocks the parent_kn upon success. */ return 0;
@@ -824,7 +824,7 @@ static int mkdir_resctrl_prepare(struct kernfs_node *parent_kn, out_free_rdtgrp: kfree(rdtgrp); out_unlock: - resctrl_group_kn_unlock(prgrp_kn); + resctrl_group_kn_unlock(parent_kn); return ret; }
@@ -865,7 +865,7 @@ static int resctrl_group_mkdir_mon(struct kernfs_node *parent_kn, */ ret = resctrl_update_groups_config(prgrp);
- resctrl_group_kn_unlock(prgrp_kn); + resctrl_group_kn_unlock(parent_kn); return ret; }
@@ -899,7 +899,7 @@ static int resctrl_group_mkdir_ctrl_mon(struct kernfs_node *parent_kn, * Create an empty mon_groups directory to hold the subset * of tasks and cpus to monitor. */ - ret = mongroup_create_dir(kn, NULL, "mon_groups", NULL); + ret = mongroup_create_dir(kn, rdtgrp, "mon_groups", NULL); if (ret) { rdt_last_cmd_puts("kernfs subdir error\n"); goto out_list_del; @@ -913,7 +913,7 @@ static int resctrl_group_mkdir_ctrl_mon(struct kernfs_node *parent_kn, out_common_fail: mkdir_resctrl_prepare_clean(rdtgrp); out_unlock: - resctrl_group_kn_unlock(prgrp_kn); + resctrl_group_kn_unlock(parent_kn); return ret; }
@@ -1072,7 +1072,8 @@ static int resctrl_group_rmdir(struct kernfs_node *kn) * If the resctrl_group is a mon group and parent directory * is a valid "mon_groups" directory, remove the mon group. */ - if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == resctrl_group_default.kn) + if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == resctrl_group_default.kn && + rdtgrp != &resctrl_group_default) ret = resctrl_group_rmdir_ctrl(kn, rdtgrp, tmpmask); else if (rdtgrp->type == RDTMON_GROUP && is_mon_groups(parent_kn, kn->name))
Offering: HULK hulk inclusion category: cleanup bugzilla: https://gitee.com/openeuler/kernel/issues/I9PVOZ
--------------------------------
Previous commit ("mpam/resctrl: Fix a deadlock due to inaccurate reference") changed the argument to resctrl_group_kn_lock_live()/ resctrl_group_kn_unlock() within mkdir_resctrl_prepare(). That change resulted in an unused function parameter to mkdir_resctrl_prepare().
Clean up the unused function parameter in mkdir_resctrl_prepare() and its callers resctrl_group_mkdir_mon() and resctrl_group_mkdir_ctrl_mon().
Signed-off-by: Yu Liao liaoyu15@huawei.com --- fs/resctrlfs.c | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-)
diff --git a/fs/resctrlfs.c b/fs/resctrlfs.c index 4cf0a4aad386..a3354cd0e54a 100644 --- a/fs/resctrlfs.c +++ b/fs/resctrlfs.c @@ -709,7 +709,6 @@ static int find_rdtgrp_allocable_rmid(struct resctrl_group *rdtgrp) }
static int mkdir_resctrl_prepare(struct kernfs_node *parent_kn, - struct kernfs_node *prgrp_kn, const char *name, umode_t mode, enum rdt_group_type rtype, struct resctrl_group **r) { @@ -840,14 +839,12 @@ static void mkdir_resctrl_prepare_clean(struct resctrl_group *rgrp) * to monitor a subset of tasks and cpus in its parent ctrl_mon group. */ static int resctrl_group_mkdir_mon(struct kernfs_node *parent_kn, - struct kernfs_node *prgrp_kn, - const char *name, - umode_t mode) + const char *name, umode_t mode) { struct resctrl_group *rdtgrp, *prgrp; int ret;
- ret = mkdir_resctrl_prepare(parent_kn, prgrp_kn, name, mode, RDTMON_GROUP, + ret = mkdir_resctrl_prepare(parent_kn, name, mode, RDTMON_GROUP, &rdtgrp); if (ret) return ret; @@ -874,14 +871,13 @@ static int resctrl_group_mkdir_mon(struct kernfs_node *parent_kn, * to allocate and monitor resources. */ static int resctrl_group_mkdir_ctrl_mon(struct kernfs_node *parent_kn, - struct kernfs_node *prgrp_kn, const char *name, umode_t mode) { struct resctrl_group *rdtgrp; struct kernfs_node *kn; int ret;
- ret = mkdir_resctrl_prepare(parent_kn, prgrp_kn, name, mode, RDTCTRL_GROUP, + ret = mkdir_resctrl_prepare(parent_kn, name, mode, RDTCTRL_GROUP, &rdtgrp); if (ret) return ret; @@ -946,14 +942,14 @@ static int resctrl_group_mkdir(struct kernfs_node *parent_kn, const char *name, * subdirectory */ if (resctrl_alloc_capable && parent_kn == resctrl_group_default.kn) - return resctrl_group_mkdir_ctrl_mon(parent_kn, parent_kn, name, mode); + return resctrl_group_mkdir_ctrl_mon(parent_kn, name, mode);
/* * If RDT monitoring is supported and the parent directory is a valid * "mon_groups" directory, add a monitoring subdirectory. */ if (resctrl_mon_capable && is_mon_groups(parent_kn, name)) - return resctrl_group_mkdir_mon(parent_kn, parent_kn->parent, name, mode); + return resctrl_group_mkdir_mon(parent_kn, name, mode);
return -EPERM; }
Offering: HULK hulk inclusion category: cleanup bugzilla: https://gitee.com/openeuler/kernel/issues/I9PVOZ
--------------------------------
resctrl_group_rm_ctrl() is only called in resctrl_group_rmdir_ctrl(), so merge resctrl_group_rm_ctrl() and resctrl_group_rmdir_ctrl() for the following fix patch. Also do the same on resctrl_group_rm_mon() and resctrl_group_rmdir_mon().
Signed-off-by: Yu Liao liaoyu15@huawei.com --- fs/resctrlfs.c | 17 +++-------------- 1 file changed, 3 insertions(+), 14 deletions(-)
diff --git a/fs/resctrlfs.c b/fs/resctrlfs.c index a3354cd0e54a..dc084a03c1b2 100644 --- a/fs/resctrlfs.c +++ b/fs/resctrlfs.c @@ -954,7 +954,7 @@ static int resctrl_group_mkdir(struct kernfs_node *parent_kn, const char *name, return -EPERM; }
-static void resctrl_group_rm_mon(struct resctrl_group *rdtgrp, +static int resctrl_group_rmdir_mon(struct kernfs_node *kn, struct resctrl_group *rdtgrp, cpumask_var_t tmpmask) { struct resctrl_group *prdtgrp = rdtgrp->mon.parent; @@ -985,19 +985,14 @@ static void resctrl_group_rm_mon(struct resctrl_group *rdtgrp, */ WARN_ON(list_empty(&prdtgrp->mon.crdtgrp_list)); list_del(&rdtgrp->mon.crdtgrp_list); -} - -static int resctrl_group_rmdir_mon(struct kernfs_node *kn, struct resctrl_group *rdtgrp, - cpumask_var_t tmpmask) -{ - resctrl_group_rm_mon(rdtgrp, tmpmask);
kernfs_remove(rdtgrp->kn);
return 0; }
-static void resctrl_group_rm_ctrl(struct resctrl_group *rdtgrp, cpumask_var_t tmpmask) +static int resctrl_group_rmdir_ctrl(struct kernfs_node *kn, struct resctrl_group *rdtgrp, + cpumask_var_t tmpmask) { int cpu;
@@ -1033,12 +1028,6 @@ static void resctrl_group_rm_ctrl(struct resctrl_group *rdtgrp, cpumask_var_t tm free_all_child_rdtgrp(rdtgrp);
list_del(&rdtgrp->resctrl_group_list); -} - -static int resctrl_group_rmdir_ctrl(struct kernfs_node *kn, struct resctrl_group *rdtgrp, - cpumask_var_t tmpmask) -{ - resctrl_group_rm_ctrl(rdtgrp, tmpmask);
kernfs_remove(rdtgrp->kn);
Offering: HULK hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9PVOZ
--------------------------------
Backport the following x86 commit to MPAM to eliminate potential issues:
commit 074fadee59ee ("x86/resctrl: Fix use-after-free due to inaccurate refcount of rdtgroup")
There is a race condition in the following scenario which results in an use-after-free issue when reading a monitoring file and deleting the parent ctrl_mon group concurrently:
Thread 1 calls atomic_inc() to take refcount of rdtgrp and then calls kernfs_break_active_protection() to drop the active reference of kernfs node in rdtgroup_kn_lock_live().
In Thread 2, kernfs_remove() is a blocking routine. It waits on all sub kernfs nodes to drop the active reference when removing all subtree kernfs nodes recursively. Thread 2 could block on kernfs_remove() until Thread 1 calls kernfs_break_active_protection(). Only after kernfs_remove() completes the refcount of rdtgrp could be trusted.
Before Thread 1 calls atomic_inc() and kernfs_break_active_protection(), Thread 2 could call kfree() when the refcount of rdtgrp (sentry) is 0 instead of 1 due to the race.
In Thread 1, in rdtgroup_kn_unlock(), referring to earlier rdtgrp memory (rdtgrp->waitcount) which was already freed in Thread 2 results in use-after-free issue.
Thread 1 (rdtgroup_mondata_show) Thread 2 (rdtgroup_rmdir) -------------------------------- ------------------------- rdtgroup_kn_lock_live /* * kn active protection until * kernfs_break_active_protection(kn) */ rdtgrp = kernfs_to_rdtgroup(kn) rdtgroup_kn_lock_live atomic_inc(&rdtgrp->waitcount) mutex_lock rdtgroup_rmdir_ctrl free_all_child_rdtgrp /* * sentry->waitcount should be 1 * but is 0 now due to the race. */ kfree(sentry)*[1] /* * Only after kernfs_remove() * completes, the refcount of * rdtgrp could be trusted. */ atomic_inc(&rdtgrp->waitcount) /* kn->active-- */ kernfs_break_active_protection(kn) rdtgroup_ctrl_remove rdtgrp->flags = RDT_DELETED /* * Blocking routine, wait for * all sub kernfs nodes to drop * active reference in * kernfs_break_active_protection. */ kernfs_remove(rdtgrp->kn) rdtgroup_kn_unlock mutex_unlock atomic_dec_and_test( &rdtgrp->waitcount) && (flags & RDT_DELETED) kernfs_unbreak_active_protection(kn) kfree(rdtgrp) mutex_lock mon_event_read rdtgroup_kn_unlock mutex_unlock /* * Use-after-free: refer to earlier rdtgrp * memory which was freed in [1]. */ atomic_dec_and_test(&rdtgrp->waitcount) && (flags & RDT_DELETED) /* kn->active++ */ kernfs_unbreak_active_protection(kn) kfree(rdtgrp)
Fix it by moving free_all_child_rdtgrp() to after kernfs_remove() in rdtgroup_rmdir_ctrl() to ensure it has the accurate refcount of rdtgrp.
Fixes: 6065f5513f78 ("resctrlfs: init support resctrlfs") Signed-off-by: Yu Liao liaoyu15@huawei.com --- fs/resctrlfs.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/fs/resctrlfs.c b/fs/resctrlfs.c index dc084a03c1b2..acd0cbce5027 100644 --- a/fs/resctrlfs.c +++ b/fs/resctrlfs.c @@ -1018,19 +1018,18 @@ static int resctrl_group_rmdir_ctrl(struct kernfs_node *kn, struct resctrl_group cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); update_closid_rmid(tmpmask, NULL);
- rdtgrp->flags |= RDT_DELETED; closid_free(rdtgrp->closid.intpartid); rmid_free(rdtgrp->mon.rmid);
+ rdtgrp->flags |= RDT_DELETED; + list_del(&rdtgrp->resctrl_group_list); + + kernfs_remove(rdtgrp->kn); /* * Free all the child monitor group rmids. */ free_all_child_rdtgrp(rdtgrp);
- list_del(&rdtgrp->resctrl_group_list); - - kernfs_remove(rdtgrp->kn); - return 0; }
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/7488 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/K...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/7488 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/K...