-----Original Message----- From: Vincent Guittot [mailto:vincent.guittot@linaro.org] Sent: Tuesday, January 19, 2021 12:14 AM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com Cc: Ingo Molnar mingo@kernel.org; Peter Zijlstra peterz@infradead.org; Dietmar Eggemann dietmar.eggemann@arm.com; Morten Rasmussen morten.rasmussen@arm.com; Valentin Schneider valentin.schneider@arm.com; linux-kernel linux-kernel@vger.kernel.org; Mel Gorman mgorman@suse.de; linuxarm@openeuler.org Subject: Re: [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2
On Fri, 15 Jan 2021 at 21:42, Barry Song song.bao.hua@hisilicon.com wrote:
This patch is a follow-up of the 3-hops issue reported by Valentin Schneider: [1] https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com/ [2]
https://lore.kernel.org/lkml/20201110184300.15673-1-valentin.schneider@arm .com/
Here is a brief summary of the background: For a NUMA system with 3-hops, sched_group for NUMA 2-hops could be not a subset of sched_domain. For example, for a system with the below topology(two cpus in each NUMA node): node 0 1 2 3 0: 10 12 20 22 1: 12 10 22 24 2: 20 22 10 12 3: 22 24 12 10
For CPU0, domain-2 will span 0-5, but its group will span 0-3, 4-7. 4-7 isn't a subset of 0-5.
CPU0 attaching sched-domain(s): domain-0: span=0-1 level=MC groups: 0:{ span=0 cap=989 }, 1:{ span=1 cap=1016 } domain-1: span=0-3 level=NUMA groups: 0:{ span=0-1 cap=2005 }, 2:{ span=2-3 cap=2028 } domain-2: span=0-5 level=NUMA groups: 0:{ span=0-3 cap=4033 }, 4:{ span=4-7 cap=3909 } ERROR: groups don't span domain->span domain-3: span=0-7 level=NUMA groups: 0:{ span=0-5 mask=0-1 cap=6062 }, 6:{ span=4-7 mask=6-7 cap=3928 }
All other cpus also have the same issue: sched_group could be not a subset of sched_domain.
Here I am trying to figure out the scheduling impact of this issue from two aspects:
- find busiest cpu in load_balance
- find idlest cpu in fork/exec/wake balance
Would be better to fix the error in the sched domain topology instead of hacking the load balance to compensate the topology problem
I think Valentin Schneider has tried to do that before, but failed. This will add some new groups which won't be managed by current update_group_capacity()? @Valentine, would you like to share more details?
On the other hand, another purpose of this RFC is that I also want to dig into more details about how the 3-hops issue could affect the behavior of scheduler. In Valentine's original thread, I think we haven't figured out how the issue will really impact scheduling.
For case 1, load_balance() seems to be handling this issue correctly as it
only
fills cpus in sched_domain to the cpus of lb_env. Also, find_busiest_group() and find_busiest_queue() will result in scanning cpus within env.cpus only:
static int load_balance(int this_cpu, struct rq *this_rq, struct sched_domain *sd, enum cpu_idle_type idle, int *continue_balancing) {` ...
struct lb_env env = { ... .cpus = cpus, .fbq_type = all, .tasks = LIST_HEAD_INIT(env.tasks), }; /* added by barry: only cpus in sched_domain are put in lb_env */ cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask); ... /* * added by barry: the below functions are only scanning cpus * in env.cpus */ group = find_busiest_group(&env); ... busiest = find_busiest_queue(&env, group); ...
}
But one thing which looks wrong is that update_sg_lb_stats() is only counting tasks in sched_domain, but sgs->group_capacity and sgs->group_weight are counting all cpus in the sched_group. Then finally, update_sg_lb_stats() uses the load of cpus which are in the sched_domain to calculate group_type and avg_load which can be seriously underestimated. This is explained in detail as the comments added by me in the code:
static inline void update_sg_lb_stats() { int i, nr_running, local_group;
/* added by barry: here it only counts cpu in the sched_domain */ for_each_cpu_and(i, sched_group_span(group), env->cpus) { ... sgs->group_load += cpu_load(rq); sgs->group_util += cpu_util(i); sgs->group_runnable += cpu_runnable(rq); sgs->sum_h_nr_running += rq->cfs.h_nr_running; nr_running = rq->nr_running; sgs->sum_nr_running += nr_running; ... } ... /* added by barry: here it count all cpus which might not be in the
domain */
sgs->group_capacity = group->sgc->capacity; sgs->group_weight = group->group_weight; /* added by barry: finally the group_type and avg_load could be wrong
*/
sgs->group_type = group_classify(env->sd->imbalance_pct, group,
sgs);
if (sgs->group_type == group_overloaded) sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) / sgs->group_capacity; ...
} For example, if we have 2 cpus in sched_domain and 4 cpus in sched_group,
the
code is using the load of 2 cpus to calculate the group_type and avg_load
of 4
cpus, the sched_group is likely to get much lower load than the real case. This patch fixed it by only counting cpus within sched_domain for
group_capacity
and group_weight.
For case 2, find_idlest_group() and find_idlest_group_cpu() don't use
sched_domain
for scanning at all. They are scanning all cpus in the sched_group though
sched_group
isn't a subset of sched_domain. So they can result in picking an idle cpu
outside
the sched_domain but inside the sched_group. This patch moved to only scan cpus within the sched_domain, which would be
similar
with load_balance().
For this moment, this is pretty much PoC code to get feedback.
Signed-off-by: Barry Song song.bao.hua@hisilicon.com
kernel/sched/fair.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 04a3ce20da67..f183dba4961e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5901,7 +5901,7 @@ find_idlest_group(struct sched_domain *sd, struct
task_struct *p, int this_cpu);
- find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
*/ static int -find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int
this_cpu)
+find_idlest_group_cpu(struct sched_domain *sd, struct sched_group *group,
struct task_struct *p, int this_cpu)
{ unsigned long load, min_load = ULONG_MAX; unsigned int min_exit_latency = UINT_MAX; @@ -5916,6 +5916,10 @@ find_idlest_group_cpu(struct sched_group *group,
struct task_struct *p, int this
/* Traverse only the allowed CPUs */ for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
/* when sched_group isn't a subset of sched_domain */
if (!cpumask_test_cpu(i, sched_domain_span(sd)))
continue;
if (sched_idle_cpu(i)) return i;
@@ -5984,7 +5988,7 @@ static inline int find_idlest_cpu(struct sched_domain
*sd, struct task_struct *p
continue; }
new_cpu = find_idlest_group_cpu(group, p, cpu);
new_cpu = find_idlest_group_cpu(sd, group, p, cpu); if (new_cpu == cpu) { /* Now try balancing at a lower domain level of 'cpu':
*/
sd = sd->child;
@@ -8416,6 +8420,8 @@ static inline void update_sg_lb_stats(struct lb_env
*env,
if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq,
false))
env->flags |= LBF_NOHZ_AGAIN;
sgs->group_capacity += capacity_of(i);
sgs->group_weight++; sgs->group_load += cpu_load(rq); sgs->group_util += cpu_util(i); sgs->group_runnable += cpu_runnable(rq);
@@ -8462,10 +8468,6 @@ static inline void update_sg_lb_stats(struct lb_env
*env,
sgs->group_asym_packing = 1; }
sgs->group_capacity = group->sgc->capacity;
sgs->group_weight = group->group_weight;
sgs->group_type = group_classify(env->sd->imbalance_pct, group,
sgs);
/* Computing avg_load makes sense only when group is overloaded */
@@ -8688,10 +8690,12 @@ static inline void update_sg_wakeup_stats(struct
sched_domain *sd,
memset(sgs, 0, sizeof(*sgs));
for_each_cpu(i, sched_group_span(group)) {
for_each_cpu_and(i, sched_group_span(group), sched_domain_span(sd))
{
struct rq *rq = cpu_rq(i); unsigned int local;
sgs->group_capacity += capacity_of(i);
sgs->group_weight++; sgs->group_load += cpu_load_without(rq, p); sgs->group_util += cpu_util_without(i, p); sgs->group_runnable += cpu_runnable_without(rq, p);
@@ -8715,10 +8719,6 @@ static inline void update_sg_wakeup_stats(struct
sched_domain *sd,
sgs->group_misfit_task_load = 1; }
sgs->group_capacity = group->sgc->capacity;
sgs->group_weight = group->group_weight;
sgs->group_type = group_classify(sd->imbalance_pct, group, sgs); /*
-- 2.25.1
Thanks Barry