 
            -----Original Message----- From: Peter Zijlstra [mailto:peterz@infradead.org] Sent: Thursday, February 11, 2021 12:22 AM To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> Cc: valentin.schneider@arm.com; vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com; linux-kernel@vger.kernel.org; linuxarm@openeuler.org; xuwei (O) <xuwei5@huawei.com>; Liguozhu (Kenneth) <liguozhu@hisilicon.com>; tiantao (H) <tiantao6@hisilicon.com>; wanghuiqiang <wanghuiqiang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Jonathan Cameron <jonathan.cameron@huawei.com>; guodong.xu@linaro.org; Meelis Roos <mroos@linux.ee> Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
On Tue, Feb 09, 2021 at 08:58:15PM +0000, Song Bao Hua (Barry Song) wrote:
I've finally had a moment to think about this, would it make sense to also break up group: node0+1, such that we then end up with 3 groups of equal size?
Since the sched_domain[n-1] of a part of node[m]'s siblings are able to cover the whole span of sched_domain[n] of node[m], there is no necessity to scan over all siblings of node[m], once sched_domain[n] of node[m] has been covered, we can stop making more sched_groups. So the number of sched_groups is small.
So historically, the code has never tried to make sched_groups result in equal size. And it permits the overlapping of local group and remote groups.
Histrorically groups have (typically) always been the same size though.
This is probably true for other platforms. But unfortunately it has never been true in my platform :-) node 0 1 2 3 0: 10 12 20 22 1: 12 10 22 24 2: 20 22 10 12 3: 22 24 12 10 In case we have only two cpus in one numa. CPU0's domain-3 has no overflowed sched_group, but its first group covers 0-5(node0-node2), the second group covers 4-7 (node2-node3): [ 0.802139] CPU0 attaching sched-domain(s): [ 0.802193] domain-0: span=0-1 level=MC [ 0.802443] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 } [ 0.802693] domain-1: span=0-3 level=NUMA [ 0.802731] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [ 0.802811] domain-2: span=0-5 level=NUMA [ 0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [ 0.802881] ERROR: groups don't span domain->span [ 0.803058] domain-3: span=0-7 level=NUMA [ 0.803080] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 }
The reason I did ask is because when you get one large and a bunch of smaller groups, the load-balancing 'pull' is relatively smaller to the large groups.
That is, IIRC should_we_balance() ensures only 1 CPU out of the group continues the load-balancing pass. So if, for example, we have one group of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull 1/2 times, while the group of 4 CPUs will pull 1/4 times.
By making sure all groups are of the same level, and thus of equal size, this doesn't happen.
As you can see, even if we give all groups of domain2 equal size by breaking up both local_group and remote_groups, we will get to the same problem in domain-3. And what's more tricky is that domain-3 has no problem of "groups don't span domain->span". It seems we need to change both domain2 and domain3 then though domain3 has no issue of "groups don't span domain->span". Thanks Barry