-----Original Message----- From: Valentin Schneider [mailto:valentin.schneider@arm.com] Sent: Wednesday, February 3, 2021 4:17 AM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org; peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com; linux-kernel@vger.kernel.org Cc: linuxarm@openeuler.org; xuwei (O) xuwei5@huawei.com; Liguozhu (Kenneth) liguozhu@hisilicon.com; tiantao (H) tiantao6@hisilicon.com; wanghuiqiang wanghuiqiang@huawei.com; Zengtao (B) prime.zeng@hisilicon.com; Jonathan Cameron jonathan.cameron@huawei.com; guodong.xu@linaro.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Meelis Roos mroos@linux.ee Subject: Re: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
On 01/02/21 16:38, Barry Song wrote:
@@ -964,6 +941,12 @@ static void init_overlap_sched_group(struct sched_domain
*sd,
build_balance_mask(sd, sg, mask); cpu = cpumask_first_and(sched_group_span(sg), mask);
/*
* for the group generated by grandchild, use the sgc of 2nd cpu
* because the 1st cpu might be used by another sched_group
*/
if (from_grandchild && cpumask_weight(mask) > 1)
cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
So you are getting a (hopefully) unique ID for this group span at this given topology level (i.e. sd->private) but as I had stated in that list of issues, this creates an sgc that isn't attached to the local group of any sched_domain, and thus won't get its capacity values updated.
This can actually be seen via the capacity values you're getting at build time:
[ 0.868907] CPU0 attaching sched-domain(s):
...
[ 0.869542] domain-2: span=0-5 level=NUMA [ 0.869559] groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 }
^^^^^^^^^^^^^^^^
[ 0.871177] CPU4 attaching sched-domain(s):
...
[ 0.871200] groups: 4:{ span=4 cap=977 }, 5:{ span=5 cap=1001 } [ 0.871243] domain-1: span=4-7 level=NUMA [ 0.871257] groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 }
^^^^^^^^^^^^^^^^
Yes. I could see this issue. We could hack update_group_capacity to let it scan both local_group and sched_group generated by grandchild, but it seems your edit is much better.
IMO what we want to do here is to hook this CPU0-domain-2-group5 to the sgc of CPU4-domain1-group4. I've done that in the below diff - this gives us groups with sgc's owned at lower topology levels, but this will only ever be true for non-local groups. This has the added benefit of working with single-CPU nodes. Briefly tested on your topology and the sunfire's (via QEMU), and I didn't get screamed at.
Before the fun police comes and impounds my keyboard, I'd like to point out that we could leverage this cross-level sgc referencing hack to further change the NUMA domains and pretty much get rid of overlapping groups (that's what I was fumbling with in [1]).
That is, rather than building overlapping groups and fixing them whenever that breaks (distance > 2), we could have:
- the local group being the child domain's span (as always)
- all non-local NUMA groups spanning a single node each, with the right sgc cross-referencing.
Thoughts?
I guess the original purpose of overlapping groups is creating as few groups as possible. If we totally remove overlapping groups, it seems we will create much more groups? For example, while node0 begins to build sched_domain for distance 20, it will add node2, since the distance between node2 and node3 is 15, so while node2 is added, node3 is also added as node2's lower domain has covered node3. So we need two groups only for node0's sched_domain of distance level 20. +-------+ +--------+ | | 15 | | | node0+----------------+ | node1 | | | | | +----+--+ XXX--------+ | XXX | XX 20 | 15 XX | XXX | X XXX +----+----XXX +-------+ | | 15 | node3| | node2 +-----------------+ | | | +-------+ +---------+
If we remove overlapping group, we will add a group for node2, another group for node3. Then we get three groups.
I am not sure if it is always positive for performance.
--->8--- diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index b748999c9e11..ef43abb6b1fb 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -932,21 +932,15 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
static void init_overlap_sched_group(struct sched_domain *sd, struct sched_group *sg,
int from_grandchild)
struct sched_domain *grandchild)
{ struct cpumask *mask = sched_domains_tmpmask2;
- struct sd_data *sdd = sd->private;
struct sd_data *sdd = grandchild ? grandchild->private : sd->private; struct cpumask *sg_span; int cpu;
build_balance_mask(sd, sg, mask); cpu = cpumask_first_and(sched_group_span(sg), mask);
/*
* for the group generated by grandchild, use the sgc of 2nd cpu
* because the 1st cpu might be used by another sched_group
*/
if (from_grandchild && cpumask_weight(mask) > 1)
cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu); if (atomic_inc_return(&sg->sgc->ref) == 1)
@@ -979,7 +973,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
for_each_cpu_wrap(i, span, cpu) { struct cpumask *sg_span;
int from_grandchild = 0;
bool from_grandchild = false;
if (cpumask_test_cpu(i, covered)) continue;
@@ -1033,7 +1027,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) !cpumask_subset(sched_domain_span(sibling->child), span)) { sibling = sibling->child;
from_grandchild = 1;
from_grandchild = true;
}
sg = build_group_from_child_sched_domain(sibling, cpu);
@@ -1043,7 +1037,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) sg_span = sched_group_span(sg); cpumask_or(covered, covered, sg_span);
init_overlap_sched_group(sd, sg, from_grandchild);
init_overlap_sched_group(sd, sg, from_grandchild ? sibling : NULL);
Nice edit! Will merge your edit into v1 and send v2.
if (!first) first = sg;
Thanks Barry