[PATCH OLK-5.10 0/5] arm64/mpam: Use an IPI instead of task_work_add() to update MPAM sysreg

Zeng Heng (5): arm64/mpam: Use an IPI instead of task_work_add() to update MPAM sysreg arm64/mpam: Apply READ_ONCE/WRITE_ONCE to task_struct.{rmid,closid} arm64/mpam: Fix task CLOSID/RMID update race arm64/mpam: Fix the potential deadlock when updating the rmid arm64/mpam: Only inform the CPUs running tasks of the group arch/arm64/kernel/mpam/mpam_resctrl.c | 163 ++++++++++++++------------ fs/resctrlfs.c | 12 +- 2 files changed, 98 insertions(+), 77 deletions(-) -- 2.25.1

Offering: HULK hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBN6HC -------------------------------- Currently, when moving a task to a resource group the PQR_ASSOC MSR is updated with the new closid and rmid in an added task callback. If the task is running, the work is run as soon as possible. If the task is not running, the work is executed later in the kernel exit path when the kernel returns to the task again. Updating the PQR_ASSOC MSR as soon as possible on the CPU a moved task is running is the right thing to do. Queueing work for a task that is not running is unnecessary (the PQR_ASSOC MSR is already updated when the task is scheduled in) and causing system resource waste with the way in which it is implemented: Work to update the PQR_ASSOC register is queued every time the user writes a task id to the "tasks" file, even if the task already belongs to the resource group. This could result in multiple pending work items associated with a single task even if they are all identical and even though only a single update with most recent values is needed. Specifically, even if a task is moved between different resource groups while it is sleeping then it is only the last move that is relevant but yet a work item is queued during each move. This unnecessary queueing of work items could result in significant system resource waste, especially on tasks sleeping for a long time. For example, as demonstrated by Shakeel Butt in [1] writing the same task id to the "tasks" file can quickly consume significant memory. The same problem (wasted system resources) occurs when moving a task between different resource groups. As pointed out by Valentin Schneider in [2] there is an additional issue with the way in which the queueing of work is done in that the task_struct update is currently done after the work is queued, resulting in a race with the register update possibly done before the data needed by the update is available. To solve these issues, update the MPAM sysreg in a synchronous way right after the new closid and rmid are ready during the task movement, only if the task is running. If a moved task is not running nothing is done since the MPAM sysreg will be updated next time the task is scheduled. This is the same way used to update the register when tasks are moved as part of resource group removal. Link: https://lore.kernel.org/all/17aa2fb38fc12ce7bb710106b3e7c7b45acb9e94.1608243... Fixes: e37caef15c18 ("resctrlfs: mpam: Build basic framework for mpam") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/kernel/mpam/mpam_resctrl.c | 109 ++++++++++---------------- 1 file changed, 43 insertions(+), 66 deletions(-) diff --git a/arch/arm64/kernel/mpam/mpam_resctrl.c b/arch/arm64/kernel/mpam/mpam_resctrl.c index 39ee9f6e65f2..65ef64d2bec7 100644 --- a/arch/arm64/kernel/mpam/mpam_resctrl.c +++ b/arch/arm64/kernel/mpam/mpam_resctrl.c @@ -1311,7 +1311,7 @@ void update_cpu_closid_rmid(void *info) } /* - * Update the PGR_ASSOC MSR on all cpus in @cpu_mask, + * Update the MPAM sysreg on all cpus in @cpu_mask, * * Per task closids/rmids must have been set up before calling this function. */ @@ -1326,86 +1326,63 @@ update_closid_rmid(const struct cpumask *cpu_mask, struct rdtgroup *r) put_cpu(); } -struct task_move_callback { - struct callback_head work; - struct rdtgroup *rdtgrp; -}; - -static void move_myself(struct callback_head *head) +static void _update_task_closid_rmid(void *task) { - struct task_move_callback *callback; - struct rdtgroup *rdtgrp; - - callback = container_of(head, struct task_move_callback, work); - rdtgrp = callback->rdtgrp; - /* - * If resource group was deleted before this task work callback - * was invoked, then assign the task to root group and free the - * resource group. + * If the task is still current on this CPU, update MPAM sysreg. + * Otherwise, the MSR is updated when the task is scheduled in. */ - if (atomic_dec_and_test(&rdtgrp->waitcount) && - (rdtgrp->flags & RDT_DELETED)) { - current->closid = 0; - current->rmid = 0; - rdtgroup_remove(rdtgrp); - } - - preempt_disable(); - /* update PQR_ASSOC MSR to make resource group go into effect */ - mpam_sched_in(); - preempt_enable(); + if (task == current) + mpam_sched_in(); +} - kfree(callback); +static void update_task_closid_rmid(struct task_struct *t) +{ + if (IS_ENABLED(CONFIG_SMP) && task_curr(t)) + smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1); + else + _update_task_closid_rmid(t); } int __resctrl_group_move_task(struct task_struct *tsk, struct rdtgroup *rdtgrp) { - struct task_move_callback *callback; - int ret; - - callback = kzalloc(sizeof(*callback), GFP_NOWAIT); - if (!callback) - return -ENOMEM; - callback->work.func = move_myself; - callback->rdtgrp = rdtgrp; - /* - * Take a refcount, so rdtgrp cannot be freed before the - * callback has been invoked. + * Set the task's closid/rmid before the MPAM sysreg can be + * updated by them. + * + * For ctrl_mon groups, move both closid and rmid. + * For monitor groups, can move the tasks only from + * their parent CTRL group. */ - atomic_inc(&rdtgrp->waitcount); - ret = task_work_add(tsk, &callback->work, true); - if (ret) { - /* - * Task is exiting. Drop the refcount and free the callback. - * No need to check the refcount as the group cannot be - * deleted before the write function unlocks resctrl_group_mutex. - */ - atomic_dec(&rdtgrp->waitcount); - kfree(callback); - rdt_last_cmd_puts("task exited\n"); - } else { - /* - * For ctrl_mon groups move both closid and rmid. - * For monitor groups, can move the tasks only from - * their parent CTRL group. - */ - if (rdtgrp->type == RDTCTRL_GROUP) { + if (rdtgrp->type == RDTCTRL_GROUP) { + tsk->closid = resctrl_navie_closid(rdtgrp->closid); + tsk->rmid = resctrl_navie_rmid(rdtgrp->mon.rmid); + } else if (rdtgrp->type == RDTMON_GROUP) { + if (rdtgrp->mon.parent->closid.intpartid == tsk->closid) { tsk->closid = resctrl_navie_closid(rdtgrp->closid); tsk->rmid = resctrl_navie_rmid(rdtgrp->mon.rmid); - } else if (rdtgrp->type == RDTMON_GROUP) { - if (rdtgrp->mon.parent->closid.intpartid == tsk->closid) { - tsk->closid = resctrl_navie_closid(rdtgrp->closid); - tsk->rmid = resctrl_navie_rmid(rdtgrp->mon.rmid); - } else { - rdt_last_cmd_puts("Can't move task to different control group\n"); - ret = -EINVAL; - } + } else { + rdt_last_cmd_puts("Can't move task to different control group\n"); + return -EINVAL; } } - return ret; + + /* + * Ensure the task's closid and rmid are written before determining if + * the task is current that will decide if it will be interrupted. + */ + barrier(); + + /* + * By now, the task's closid and rmid are set. If the task is current + * on a CPU, the MPAM sysreg needs to be updated to make the resource + * group go into effect. If the task is not current, the sysreg will be + * updated when the task is scheduled in. + */ + update_task_closid_rmid(tsk); + + return 0; } static int resctrl_group_seqfile_show(struct seq_file *m, void *arg) -- 2.25.1

Offering: HULK hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBN6HC -------------------------------- A CPU's current task can have its {closid, rmid} fields read locally while they are being concurrently written to from another CPU. This can happen anytime __resctrl_sched_in() races with either __rdtgroup_move_task() or rdt_move_group_tasks(). Prevent load / store tearing for those accesses by giving them the READ_ONCE() / WRITE_ONCE() treatment. Link: https://lkml.kernel.org/r/9921fda88ad81afb9885b517fbe864a2bc7c35a9.160824314... Fixes: e37caef15c18 ("resctrlfs: mpam: Build basic framework for mpam") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/kernel/mpam/mpam_resctrl.c | 19 +++++++++++-------- fs/resctrlfs.c | 4 ++-- 2 files changed, 13 insertions(+), 10 deletions(-) diff --git a/arch/arm64/kernel/mpam/mpam_resctrl.c b/arch/arm64/kernel/mpam/mpam_resctrl.c index 65ef64d2bec7..b04c0d2b8312 100644 --- a/arch/arm64/kernel/mpam/mpam_resctrl.c +++ b/arch/arm64/kernel/mpam/mpam_resctrl.c @@ -1356,12 +1356,12 @@ int __resctrl_group_move_task(struct task_struct *tsk, * their parent CTRL group. */ if (rdtgrp->type == RDTCTRL_GROUP) { - tsk->closid = resctrl_navie_closid(rdtgrp->closid); - tsk->rmid = resctrl_navie_rmid(rdtgrp->mon.rmid); + WRITE_ONCE(tsk->closid, resctrl_navie_closid(rdtgrp->closid)); + WRITE_ONCE(tsk->rmid, resctrl_navie_rmid(rdtgrp->mon.rmid)); } else if (rdtgrp->type == RDTMON_GROUP) { if (rdtgrp->mon.parent->closid.intpartid == tsk->closid) { - tsk->closid = resctrl_navie_closid(rdtgrp->closid); - tsk->rmid = resctrl_navie_rmid(rdtgrp->mon.rmid); + WRITE_ONCE(tsk->closid, resctrl_navie_closid(rdtgrp->closid)); + WRITE_ONCE(tsk->rmid, resctrl_navie_rmid(rdtgrp->mon.rmid)); } else { rdt_last_cmd_puts("Can't move task to different control group\n"); return -EINVAL; @@ -2198,19 +2198,22 @@ void __mpam_sched_in(void) u64 closid = state->default_closid; u64 reqpartid = 0; u64 pmg = 0; + u32 tmp; /* * If this task has a closid/rmid assigned, use it. * Else use the closid/rmid assigned to this cpu. */ if (static_branch_likely(&resctrl_alloc_enable_key)) { - if (current->closid) - closid = current->closid; + tmp = READ_ONCE(current->closid); + if (tmp) + closid = tmp; } if (static_branch_likely(&resctrl_mon_enable_key)) { - if (current->rmid) - rmid = current->rmid; + tmp = READ_ONCE(current->rmid); + if (tmp) + rmid = tmp; } if (closid != state->cur_closid || rmid != state->cur_rmid) { diff --git a/fs/resctrlfs.c b/fs/resctrlfs.c index acd0cbce5027..29fe73571870 100644 --- a/fs/resctrlfs.c +++ b/fs/resctrlfs.c @@ -454,8 +454,8 @@ static void resctrl_move_group_tasks(struct resctrl_group *from, struct resctrl_ read_lock(&tasklist_lock); for_each_process_thread(p, t) { if (!from || is_task_match_resctrl_group(t, from)) { - t->closid = resctrl_navie_closid(to->closid); - t->rmid = resctrl_navie_rmid(to->mon.rmid); + WRITE_ONCE(t->closid, resctrl_navie_closid(to->closid)); + WRITE_ONCE(t->rmid, resctrl_navie_rmid(to->mon.rmid)); #ifdef CONFIG_SMP /* -- 2.25.1

Offering: HULK hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBN6HC -------------------------------- When the user moves a running task to a new rdtgroup using the task's file interface or by deleting its rdtgroup, the resulting change in CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the task(s) CPUs. x86 allows reordering loads with prior stores, so if the task starts running between a task_curr() check that the CPU hoisted before the stores in the CLOSID/RMID update then it can start running with the old CLOSID/RMID until it is switched again because __rdtgroup_move_task() failed to determine that it needs to be interrupted to obtain the new CLOSID/RMID. Refer to the diagram below: CPU 0 CPU 1 ----- ----- __rdtgroup_move_task(): curr <- t1->cpu->rq->curr __schedule(): rq->curr <- t1 resctrl_sched_in(): t1->{closid,rmid} -> {1,1} t1->{closid,rmid} <- {2,2} if (curr == t1) // false IPI(t1->cpu) A similar race impacts rdt_move_group_tasks(), which updates tasks in a deleted rdtgroup. In both cases, use smp_mb() to order the task_struct::{closid,rmid} stores before the loads in task_curr(). In particular, in the rdt_move_group_tasks() case, simply execute an smp_mb() on every iteration with a matching task. It is possible to use a single smp_mb() in rdt_move_group_tasks(), but this would require two passes and a means of remembering which task_structs were updated in the first loop. However, benchmarking results below showed too little performance impact in the simple approach to justify implementing the two-pass approach. Times below were collected using `perf stat` to measure the time to remove a group containing a 1600-task, parallel workload. CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads) # mkdir /sys/fs/resctrl/test # echo $$ > /sys/fs/resctrl/test/tasks # perf bench sched messaging -g 40 -l 100000 task-clock time ranges collected using: # perf stat rmdir /sys/fs/resctrl/test Baseline: 1.54 - 1.60 ms smp_mb() every matching task: 1.57 - 1.67 ms Link: https://lore.kernel.org/r/20221220161123.432120-1-peternewman@google.com Fixes: 6065f5513f78 ("resctrlfs: init support resctrlfs") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/kernel/mpam/mpam_resctrl.c | 4 +++- fs/resctrlfs.c | 8 ++++++++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kernel/mpam/mpam_resctrl.c b/arch/arm64/kernel/mpam/mpam_resctrl.c index b04c0d2b8312..ab822fd6828d 100644 --- a/arch/arm64/kernel/mpam/mpam_resctrl.c +++ b/arch/arm64/kernel/mpam/mpam_resctrl.c @@ -1371,8 +1371,10 @@ int __resctrl_group_move_task(struct task_struct *tsk, /* * Ensure the task's closid and rmid are written before determining if * the task is current that will decide if it will be interrupted. + * This pairs with the full barrier between the rq->curr update and + * resctrl_sched_in() during context switch. */ - barrier(); + smp_mb(); /* * By now, the task's closid and rmid are set. If the task is current diff --git a/fs/resctrlfs.c b/fs/resctrlfs.c index 29fe73571870..7d816b33e209 100644 --- a/fs/resctrlfs.c +++ b/fs/resctrlfs.c @@ -457,6 +457,14 @@ static void resctrl_move_group_tasks(struct resctrl_group *from, struct resctrl_ WRITE_ONCE(t->closid, resctrl_navie_closid(to->closid)); WRITE_ONCE(t->rmid, resctrl_navie_rmid(to->mon.rmid)); + /* + * Order the closid/rmid stores above before the loads + * in task_curr(). This pairs with the full barrier + * between the rq->curr update and mpam_sched_in() + * during context switch. + */ + smp_mb(); + #ifdef CONFIG_SMP /* * This is safe on x86 w/o barriers as the ordering -- 2.25.1

Offering: HULK hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBN6HC -------------------------------- Since the commit c8903ef0769e switched task_work to the IPI method for updating the task's partid, this may cause a deadlock on the tasklist_lock when updating the control group's rmid. The concurrent scenario is that on one side, a read lock on the tasklist_lock is obtained and tries to update rmid of a control group by IPI method, while on the other side, a task in this control group, while waiting for a child thread to exit, disables interrupts and attempts to obtain a write lock on the tasklist_lock, resulting in a deadlock because it cannot respond to the IPI interrupt right now. CPU_A CPU_B resctrl_group_rmid_write() read_lock(&tasklist_lock); <-- read locked do_wait() wait_consider_task() wait_task_zombie() release_task() write_lock_irq(&tasklist_lock); __raw_write_lock_irq(lock); local_irq_disable(); <-- disable interrupt do_raw_write_lock() <-- wait write lock __resctrl_group_move_task(t, rdtgrp); update_task_closid_rmid(tsk); smp_call_function_single(); <-- never return The IPI operation needs to be moved outside the critical section of the tasklist_lock. We have chosen to merge it with the subsequent update_closid_rmid(). Instead of just binding to the specified CPUs of the control group, but all online CPUs. Fixes: c8903ef0769e ("[Huawei] arm64/mpam: Use an IPI instead of task_work_add() to update MPAM sysreg") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/kernel/mpam/mpam_resctrl.c | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/arch/arm64/kernel/mpam/mpam_resctrl.c b/arch/arm64/kernel/mpam/mpam_resctrl.c index ab822fd6828d..220dc48e3614 100644 --- a/arch/arm64/kernel/mpam/mpam_resctrl.c +++ b/arch/arm64/kernel/mpam/mpam_resctrl.c @@ -1344,8 +1344,8 @@ static void update_task_closid_rmid(struct task_struct *t) _update_task_closid_rmid(t); } -int __resctrl_group_move_task(struct task_struct *tsk, - struct rdtgroup *rdtgrp) +static int resctrl_group_update_task(struct task_struct *tsk, + struct rdtgroup *rdtgrp) { /* * Set the task's closid/rmid before the MPAM sysreg can be @@ -1376,6 +1376,18 @@ int __resctrl_group_move_task(struct task_struct *tsk, */ smp_mb(); + return 0; +} + +int __resctrl_group_move_task(struct task_struct *tsk, + struct rdtgroup *rdtgrp) +{ + int err; + + err = resctrl_group_update_task(tsk, rdtgrp); + if (err) + return err; + /* * By now, the task's closid and rmid are set. If the task is current * on a CPU, the MPAM sysreg needs to be updated to make the resource @@ -1972,7 +1984,7 @@ static ssize_t resctrl_group_rmid_write(struct kernfs_open_file *of, read_lock(&tasklist_lock); for_each_process_thread(p, t) { if (t->closid == rdtgrp->closid.intpartid) { - ret = __resctrl_group_move_task(t, rdtgrp); + ret = resctrl_group_update_task(t, rdtgrp); if (ret) { read_unlock(&tasklist_lock); goto rollback; @@ -1981,7 +1993,7 @@ static ssize_t resctrl_group_rmid_write(struct kernfs_open_file *of, } read_unlock(&tasklist_lock); - update_closid_rmid(&rdtgrp->cpu_mask, rdtgrp); + update_closid_rmid(cpu_online_mask, rdtgrp); rmid_free(old_rmid); unlock: @@ -2004,7 +2016,7 @@ static ssize_t resctrl_group_rmid_write(struct kernfs_open_file *of, read_lock(&tasklist_lock); for_each_process_thread(p, t) { if (t->closid == rdtgrp->closid.intpartid) - WARN_ON_ONCE(__resctrl_group_move_task(t, rdtgrp)); + WARN_ON_ONCE(resctrl_group_update_task(t, rdtgrp)); } read_unlock(&tasklist_lock); -- 2.25.1

Offering: HULK hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBN6HC -------------------------------- Instead of broadcasting to all online CPUs, only inform the CPUs running tasks of the group which is changing its rmid. The detection is inaccurate as tasks might move or schedule before the smp function call takes place. In such a case the function call is pointless, but there is no other side effect. The IPI partid update operation needs to be split because update_closid_rmid() updates the rdtgroup->partid to the pqr_state of each CPU in the cpumask. Since we have already successfully updated the partid of the task, we only need to let the CPU call mpam_sched_in(). Fixes: ed282cdb7305 ("[Huawei] arm64/mpam: Fix the potential deadlock when updating the rmid") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/kernel/mpam/mpam_resctrl.c | 29 ++++++++++++++++++++++----- 1 file changed, 24 insertions(+), 5 deletions(-) diff --git a/arch/arm64/kernel/mpam/mpam_resctrl.c b/arch/arm64/kernel/mpam/mpam_resctrl.c index 220dc48e3614..8364b3f8b7b3 100644 --- a/arch/arm64/kernel/mpam/mpam_resctrl.c +++ b/arch/arm64/kernel/mpam/mpam_resctrl.c @@ -1914,10 +1914,14 @@ static ssize_t resctrl_group_rmid_write(struct kernfs_open_file *of, int old_rmid; int old_reqpartid; struct task_struct *p, *t; + cpumask_var_t tmpmask; if (kstrtoint(strstrip(buf), 0, &rmid) || rmid < 0) return -EINVAL; + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) + return -ENOMEM; + rdtgrp = resctrl_group_kn_lock_live(of->kn); if (!rdtgrp) { ret = -ENOENT; @@ -1989,19 +1993,25 @@ static ssize_t resctrl_group_rmid_write(struct kernfs_open_file *of, read_unlock(&tasklist_lock); goto rollback; } + + if (IS_ENABLED(CONFIG_SMP) && task_curr(t)) + cpumask_set_cpu(task_cpu(t), tmpmask); } } read_unlock(&tasklist_lock); - update_closid_rmid(cpu_online_mask, rdtgrp); + /* Update PARTID on CPUs which have moved task running on them */ + update_closid_rmid(tmpmask, NULL); + /* Update PARTID on the cpu_list of the group */ + update_closid_rmid(&rdtgrp->cpu_mask, rdtgrp); + rmid_free(old_rmid); unlock: resctrl_group_kn_unlock(of->kn); - if (ret) - return ret; + free_cpumask_var(tmpmask); - return nbytes; + return ret ? : nbytes; rollback: rdtgrp->mon.rmid = old_rmid; @@ -2013,15 +2023,24 @@ static ssize_t resctrl_group_rmid_write(struct kernfs_open_file *of, rdtgrp->resync = 1; WARN_ON_ONCE(resctrl_update_groups_config(rdtgrp)); + cpumask_clear(tmpmask); + read_lock(&tasklist_lock); for_each_process_thread(p, t) { - if (t->closid == rdtgrp->closid.intpartid) + if (t->closid == rdtgrp->closid.intpartid) { WARN_ON_ONCE(resctrl_group_update_task(t, rdtgrp)); + + if (IS_ENABLED(CONFIG_SMP) && task_curr(t)) + cpumask_set_cpu(task_cpu(t), tmpmask); + } } read_unlock(&tasklist_lock); + update_closid_rmid(tmpmask, NULL); + rmid_free(rmid); resctrl_group_kn_unlock(of->kn); + free_cpumask_var(tmpmask); return ret; } -- 2.25.1

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/17157 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/YKA... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/17157 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/YKA...
participants (2)
-
patchwork bot
-
Zeng Heng