[Linuxarm] Re: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

3 Feb 2021

...
-----Original Message-----
From: Valentin Schneider [mailto:valentin.schneider@arm.com]
Sent: Wednesday, February 3, 2021 4:17 AM
To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org;
peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com;
linux-kernel@vger.kernel.org
Cc: linuxarm@openeuler.org; xuwei (O) <xuwei5@huawei.com>; Liguozhu (Kenneth)
<liguozhu@hisilicon.com>; tiantao (H) <tiantao6@hisilicon.com>; wanghuiqiang
<wanghuiqiang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Jonathan
Cameron <jonathan.cameron@huawei.com>; guodong.xu@linaro.org; Song Bao Hua
(Barry Song) <song.bao.hua@hisilicon.com>; Meelis Roos <mroos@linux.ee>
Subject: Re: [PATCH] sched/topology: fix the issue groups don't span
domain->span for NUMA diameter > 2
On 01/02/21 16:38, Barry Song wrote:
...
@@ -964,6 +941,12 @@ static void init_overlap_sched_group(struct sched_domain
*sd,
build_balance_mask(sd, sg, mask);
      cpu = cpumask_first_and(sched_group_span(sg), mask);
+	/*
+	 * for the group generated by grandchild, use the sgc of 2nd cpu
+	 * because the 1st cpu might be used by another sched_group
+	 */
+	if (from_grandchild && cpumask_weight(mask) > 1)
+		cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
So you are getting a (hopefully) unique ID for this group span at this
given topology level (i.e. sd->private) but as I had stated in that list of
issues, this creates an sgc that isn't attached to the local group of any
sched_domain, and thus won't get its capacity values updated.
This can actually be seen via the capacity values you're getting at build
time:
...
[    0.868907] CPU0 attaching sched-domain(s):
...
[    0.869542]    domain-2: span=0-5 level=NUMA
[    0.869559]     groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 }
                                                          ^^^^^^^^^^^^^^^^
[    0.871177] CPU4 attaching sched-domain(s):
...
[    0.871200]   groups: 4:{ span=4 cap=977 }, 5:{ span=5 cap=1001 }
[    0.871243]   domain-1: span=4-7 level=NUMA
[    0.871257]    groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 }
                                ^^^^^^^^^^^^^^^^
Yes. I could see this issue.  We could hack update_group_capacity to let
it scan both local_group  and sched_group generated by grandchild, but it
seems your edit is much better.
...
IMO what we want to do here is to hook this CPU0-domain-2-group5 to the sgc
of CPU4-domain1-group4. I've done that in the below diff - this gives us
groups with sgc's owned at lower topology levels, but this will only ever
be true for non-local groups. This has the added benefit of working with
single-CPU nodes. Briefly tested on your topology and the sunfire's (via
QEMU), and I didn't get screamed at.
Before the fun police comes and impounds my keyboard, I'd like to point out
that we could leverage this cross-level sgc referencing hack to further
change the NUMA domains and pretty much get rid of overlapping groups
(that's what I was fumbling with in [1]).
[1]: http://lore.kernel.org/r/jhjwnw11ak2.mognet@arm.com
That is, rather than building overlapping groups and fixing them whenever
that breaks (distance > 2), we could have:
- the local group being the child domain's span (as always)
- all non-local NUMA groups spanning a single node each, with the right sgc
  cross-referencing.
Thoughts?
I guess the original purpose of overlapping groups is creating as few groups
as possible. If we totally remove overlapping groups, it seems we will create
much more groups?
For example, while node0 begins to build sched_domain for distance 20, it will
add node2, since the distance between node2 and node3 is 15, so while node2 is
added, node3 is also added as node2's lower domain has covered node3. So we need
two groups only for node0's sched_domain of distance level 20.
+-------+                  +--------+
 |       |      15          |        |
 |  node0+----------------+ | node1  |
 |       |                  |        |
 +----+--+                XXX--------+
      |                 XXX
      |                XX
20    |         15   XX
      |            XXX
      |       X XXX
 +----+----XXX               +-------+
 |         |     15          |  node3|
 | node2   +-----------------+       |
 |         |                 +-------+
 +---------+

If we remove overlapping group, we will add a group for node2, another
group for node3. Then we get three groups.

I am not sure if it is always positive for performance.
...
--->8---

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b748999c9e11..ef43abb6b1fb 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -932,21 +932,15 @@ build_group_from_child_sched_domain(struct sched_domain
*sd, int cpu)
static void init_overlap_sched_group(struct sched_domain *sd,
      		     struct sched_group *sg,
-				     int from_grandchild)
+				     struct sched_domain *grandchild)
 {
  struct cpumask *mask = sched_domains_tmpmask2;
-	struct sd_data *sdd = sd->private;
+	struct sd_data *sdd = grandchild ? grandchild->private : sd->private;
  struct cpumask *sg_span;
  int cpu;
build_balance_mask(sd, sg, mask);
  cpu = cpumask_first_and(sched_group_span(sg), mask);
-	/*
-	 * for the group generated by grandchild, use the sgc of 2nd cpu
-	 * because the 1st cpu might be used by another sched_group
-	 */
-	if (from_grandchild && cpumask_weight(mask) > 1)
-		cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
  if (atomic_inc_return(&sg->sgc->ref) == 1)
@@ -979,7 +973,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int
cpu)
for_each_cpu_wrap(i, span, cpu) {
      struct cpumask *sg_span;
-		int from_grandchild = 0;
+		bool from_grandchild = false;
if (cpumask_test_cpu(i, covered))
      	continue;
@@ -1033,7 +1027,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int
cpu)
             !cpumask_subset(sched_domain_span(sibling->child),
      		       span)) {
      	sibling = sibling->child;
-			from_grandchild = 1;
+			from_grandchild = true;
      }
sg = build_group_from_child_sched_domain(sibling, cpu);
@@ -1043,7 +1037,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int
cpu)
      sg_span = sched_group_span(sg);
      cpumask_or(covered, covered, sg_span);
-		init_overlap_sched_group(sd, sg, from_grandchild);
+		init_overlap_sched_group(sd, sg, from_grandchild ? sibling : NULL);
Nice edit!
Will merge your edit into v1 and send v2.
...
if (!first)
      	first = sg;
Thanks
Barry