From: "Eric W. Biederman" ebiederm@xmission.com
mainline inclusion from mainline-5.4-rc1 commit 5311a98fef7d0dc2e8040ae0e18f5568d6d1dd5a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3UKOW CVE: NA
-------------------------------------------------
The current task on the runqueue is currently read with rcu_dereference().
To obtain ordinary RCU semantics for an rcu_dereference() of rq->curr it needs to be paired with rcu_assign_pointer() of rq->curr. Which provides the memory barrier necessary to order assignments to the task_struct and the assignment to rq->curr.
Unfortunately the assignment of rq->curr in __schedule is a hot path, and it has already been show that additional barriers in that code will reduce the performance of the scheduler. So I will attempt to describe below why you can effectively have ordinary RCU semantics without any additional barriers.
The assignment of rq->curr in init_idle is a slow path called once per cpu and that can use rcu_assign_pointer() without any concerns.
As I write this there are effectively two users of rcu_dereference() on rq->curr. There is the membarrier code in kernel/sched/membarrier.c that only looks at "->mm" after the rcu_dereference(). Then there is task_numa_compare() in kernel/sched/fair.c. My best reading of the code shows that task_numa_compare only access: "->flags", "->cpus_ptr", "->numa_group", "->numa_faults[]", "->total_numa_faults", and "->se.cfs_rq".
The code in __schedule() essentially does: rq_lock(...); smp_mb__after_spinlock();
next = pick_next_task(...); rq->curr = next;
context_switch(prev, next);
At the start of the function the rq_lock/smp_mb__after_spinlock pair provides a full memory barrier. Further there is a full memory barrier in context_switch().
This means that any task that has already run and modified itself (the common case) has already seen two memory barriers before __schedule() runs and begins executing. A task that modifies itself then sees a third full memory barrier pair with the rq_lock();
For a brand new task that is enqueued with wake_up_new_task() there are the memory barriers present from the taking and release the pi_lock and the rq_lock as the processes is enqueued as well as the full memory barrier at the start of __schedule() assuming __schedule() happens on the same cpu.
This means that by the time we reach the assignment of rq->curr except for values on the task struct modified in pick_next_task the code has the same guarantees as if it used rcu_assign_pointer().
Reading through all of the implementations of pick_next_task it appears pick_next_task is limited to modifying the task_struct fields "->se", "->rt", "->dl". These fields are the sched_entity structures of the varies schedulers.
Further "->se.cfs_rq" is only changed in cgroup attach/move operations initialized by userspace.
Unless I have missed something this means that in practice that the users of "rcu_dereference(rq->curr)" get normal RCU semantics of rcu_dereference() for the fields the care about, despite the assignment of rq->curr in __schedule() ot using rcu_assign_pointer.
Signed-off-by: Eric W. Biederman ebiederm@xmission.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Chris Metcalf cmetcalf@ezchip.com Cc: Christoph Lameter cl@linux.com Cc: Davidlohr Bueso dave@stgolabs.net Cc: Kirill Tkhai tkhai@yandex.ru Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Mike Galbraith efault@gmx.de Cc: Oleg Nesterov oleg@redhat.com Cc: Paul E. McKenney paulmck@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Russell King - ARM Linux admin linux@armlinux.org.uk Cc: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Li Hua hucool.lihua@huawei.com Signed-off-by: Zheng Zucheng zhengzucheng@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/sched/core.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0a0d25602c758..5f4703ddb8ab5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3565,7 +3565,11 @@ static void __sched notrace __schedule(bool preempt)
if (likely(prev != next)) { rq->nr_switches++; - rq->curr = next; + /* + * RCU users of rcu_dereference(rq->curr) may not see + * changes to task_struct made by pick_next_task(). + */ + RCU_INIT_POINTER(rq->curr, next); /* * The membarrier system call requires each architecture * to have a full memory barrier after updating @@ -5517,7 +5521,8 @@ void init_idle(struct task_struct *idle, int cpu) __set_task_cpu(idle, cpu); rcu_read_unlock();
- rq->curr = rq->idle = idle; + rq->idle = idle; + rcu_assign_pointer(rq->curr, idle); idle->on_rq = TASK_ON_RQ_QUEUED; #ifdef CONFIG_SMP idle->on_cpu = 1;