Hi folks,
My keyboard still hasn't been taken away from me, so I went down and did what I was rambling about in [1].
The preemptive degeneration check is the worst wart, but I think I'll have to sleep on it before I can figure out a better way.
Something tells me the increase in group numbers isn't going to make me any friends, so I've included 3/2 which is Barry's approach (with the subset check condensed in a single spot).
Cheers, Valentin
[1]: http://lore.kernel.org/r/jhj4kiu4hz8.mognet@arm.com
Valentin Schneider (2): sched/topology: Get rid of NUMA overlapping groups Revert "sched/topology: Warn when NUMA diameter > 2"
kernel/sched/topology.c | 77 +++++++++++++++++++++++------------------ 1 file changed, 43 insertions(+), 34 deletions(-)
-- 2.27.0
As pointed out in commit
b5b217346de8 ("sched/topology: Warn when NUMA diameter > 2")
overlapping groups result in broken topology data structures whenever the underlying system has a NUMA diameter greater than 2. This stems from overlapping groups being built from sibling domain's spans, yielding bogus transitivity relations the like of:
distance(A, B) <= 30 && distance(B, C) <= 20 => distance(A, C) <= 30
As discussed with Barry, a feasible approach is to catch bogus overlapping groups and fix them after the fact [1].
A more proactive approach would be to prevent aforementioned bogus relations from being built altogether, implies departing from the "group span is sibling domain child's span" strategy. Said strategy only works for diameter <= 2, which fortunately or unfortunately is currently the most common case.
The chosen approach is, for NUMA domains: a) have the local group be the child domain's span, as before b) have all remote groups span only their respective node
This boils down to getting rid of overlapping groups.
Note that b) requires introducing cross sched_domain_topology_level references for sched_group_capacity. This is a somewhat prickly matter as we need to ensure whichever group we hook into won't see its domain degenerated (which was never an issue when such references were bounded within a single topology level).
This lifts the NUMA diameter restriction, although yields more groups in the NUMA domains. As an example, here is the distance matrix for an AMD Epyc:
node 0 1 2 3 4 5 6 7 0: 10 16 16 16 32 32 32 32 1: 16 10 16 16 32 32 32 32 2: 16 16 10 16 32 32 32 32 3: 16 16 16 10 32 32 32 32 4: 32 32 32 32 10 16 16 16 5: 32 32 32 32 16 10 16 16 6: 32 32 32 32 16 16 10 16 7: 32 32 32 32 16 16 16 10
Emulating this on QEMU yields, before the patch: [ 0.386745] CPU0 attaching sched-domain(s): [ 0.386969] domain-0: span=0-3 level=NUMA [ 0.387708] groups: 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1007 }, 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 } [ 0.388505] domain-1: span=0-7 level=NUMA [ 0.388700] groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 } [ 0.389861] CPU1 attaching sched-domain(s): [ 0.390020] domain-0: span=0-3 level=NUMA [ 0.390200] groups: 1:{ span=1 cap=1007 }, 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 } [ 0.390701] domain-1: span=0-7 level=NUMA [ 0.390874] groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 } [ 0.391460] CPU2 attaching sched-domain(s): [ 0.391664] domain-0: span=0-3 level=NUMA [ 0.392750] groups: 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1007 } [ 0.393672] domain-1: span=0-7 level=NUMA [ 0.393961] groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 } [ 0.394645] CPU3 attaching sched-domain(s): [ 0.394792] domain-0: span=0-3 level=NUMA [ 0.394961] groups: 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1007 }, 2:{ span=2 cap=1007 } [ 0.395749] domain-1: span=0-7 level=NUMA [ 0.396098] groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 } [ 0.396455] CPU4 attaching sched-domain(s): [ 0.396603] domain-0: span=4-7 level=NUMA [ 0.396771] groups: 4:{ span=4 cap=1001 }, 5:{ span=5 cap=1004 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=1006 } [ 0.397274] domain-1: span=0-7 level=NUMA [ 0.397454] groups: 4:{ span=4-7 cap=4014 }, 0:{ span=0-3 cap=4020 } [ 0.397801] CPU5 attaching sched-domain(s): [ 0.397945] domain-0: span=4-7 level=NUMA [ 0.398110] groups: 5:{ span=5 cap=1004 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=1006 }, 4:{ span=4 cap=1001 } [ 0.398605] domain-1: span=0-7 level=NUMA [ 0.398773] groups: 4:{ span=4-7 cap=4014 }, 0:{ span=0-3 cap=4020 } [ 0.399109] CPU6 attaching sched-domain(s): [ 0.399253] domain-0: span=4-7 level=NUMA [ 0.399418] groups: 6:{ span=6 cap=1003 }, 7:{ span=7 cap=1006 }, 4:{ span=4 cap=1001 }, 5:{ span=5 cap=1004 } [ 0.400562] domain-1: span=0-7 level=NUMA [ 0.400741] groups: 4:{ span=4-7 cap=4020 }, 0:{ span=0-3 cap=4020 } [ 0.401083] CPU7 attaching sched-domain(s): [ 0.401231] domain-0: span=4-7 level=NUMA [ 0.401395] groups: 7:{ span=7 cap=1006 }, 4:{ span=4 cap=1004 }, 5:{ span=5 cap=1007 }, 6:{ span=6 cap=1003 } [ 0.401906] domain-1: span=0-7 level=NUMA [ 0.402076] groups: 4:{ span=4-7 cap=4020 }, 0:{ span=0-3 cap=4020 } [ 0.402437] root domain span: 0-7 (max cpu_capacity = 1024)
with the patch: [ 0.367436] CPU0 attaching sched-domain(s): [ 0.368064] domain-0: span=0-3 level=NUMA [ 0.368614] groups: 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1002 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 } [ 0.369490] domain-1: span=0-7 level=NUMA [ 0.369682] groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 } [ 0.371132] CPU1 attaching sched-domain(s): [ 0.371290] domain-0: span=0-3 level=NUMA [ 0.371462] groups: 1:{ span=1 cap=1002 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 }, 0:{ span=0 cap=1012 } [ 0.372720] domain-1: span=0-7 level=NUMA [ 0.372906] groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 } [ 0.373678] CPU2 attaching sched-domain(s): [ 0.373833] domain-0: span=0-3 level=NUMA [ 0.374006] groups: 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1002 } [ 0.374516] domain-1: span=0-7 level=NUMA [ 0.374689] groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 } [ 0.375337] CPU3 attaching sched-domain(s): [ 0.375491] domain-0: span=0-3 level=NUMA [ 0.375666] groups: 3:{ span=3 cap=993 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1002 }, 2:{ span=2 cap=1000 } [ 0.376639] domain-1: span=0-7 level=NUMA [ 0.376818] groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 } [ 0.377465] CPU4 attaching sched-domain(s): [ 0.377616] domain-0: span=4-7 level=NUMA [ 0.377844] groups: 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 } [ 0.378445] domain-1: span=0-7 level=NUMA [ 0.378622] groups: 4:{ span=4-7 cap=3995 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 } [ 0.379296] CPU5 attaching sched-domain(s): [ 0.379453] domain-0: span=4-7 level=NUMA [ 0.379629] groups: 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }, 4:{ span=4 cap=991 } [ 0.380499] domain-1: span=0-7 level=NUMA [ 0.380800] groups: 4:{ span=4-7 cap=4001 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=998 } [ 0.381475] CPU6 attaching sched-domain(s): [ 0.381641] domain-0: span=4-7 level=NUMA [ 0.381882] groups: 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }, 4:{ span=4 cap=997 }, 5:{ span=5 cap=1003 } [ 0.382419] domain-1: span=0-7 level=NUMA [ 0.382594] groups: 4:{ span=4-7 cap=4001 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1004 }, 3:{ span=3 cap=998 } [ 0.383253] CPU7 attaching sched-domain(s): [ 0.383407] domain-0: span=4-7 level=NUMA [ 0.383584] groups: 7:{ span=7 cap=998 }, 4:{ span=4 cap=997 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 } [ 0.384089] domain-1: span=0-7 level=NUMA [ 0.384516] groups: 4:{ span=4-7 cap=4001 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1004 }, 3:{ span=3 cap=998 } [ 0.385503] root domain span: 0-7 (max cpu_capacity = 1024)
IOW, this does more than double the number of groups at higher domains, the impact of which has yet to be measured.
XXX: this only changes the group generation; actually removing SD_OVERLAP and all the related fluff is yet to be done.
[1]: http://lore.kernel.org/r/20210201033830.15040-1-song.bao.hua@hisilicon.com
Signed-off-by: Valentin Schneider valentin.schneider@arm.com --- kernel/sched/topology.c | 44 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 43 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 09d35044bd88..a8f69f234258 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -982,6 +982,41 @@ static void init_overlap_sched_group(struct sched_domain *sd, sg->sgc->max_capacity = SCHED_CAPACITY_SCALE; }
+static struct sched_domain *find_node_domain(struct sched_domain *sd) +{ + struct sched_domain *parent; + + BUG_ON(!(sd->flags & SD_NUMA)); + + /* Get to the level above NODE */ + while (sd && sd->child) { + parent = sd; + sd = sd->child; + + if (!(sd->flags & SD_NUMA)) + break; + } + /* + * We're going to create cross topology level sched_group_capacity + * references. This can only work if the domains resulting from said + * levels won't be degenerated, as we need said sgc to be periodically + * updated: it needs to be attached to the local group of a domain + * that didn't get degenerated. + * + * Of course, groups aren't available yet, so we can't call the usual + * sd_degenerate(). Checking domain spans is the closest we get. + * Start from NODE's parent, and keep going up until we get a domain + * we're sure won't be degenerated. + */ + while (sd->parent && + cpumask_equal(sched_domain_span(sd), sched_domain_span(parent))) { + sd = parent; + parent = sd->parent; + } + + return parent; +} + static int build_overlap_sched_groups(struct sched_domain *sd, int cpu) { @@ -1015,6 +1050,13 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) if (!cpumask_test_cpu(i, sched_domain_span(sibling))) continue;
+ /* + * Local group is child domain's span, as is tradition. + * Non-local groups will only span remote nodes. + */ + if (first) + sibling = find_node_domain(sibling); + sg = build_group_from_child_sched_domain(sibling, cpu); if (!sg) goto fail; @@ -1022,7 +1064,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) sg_span = sched_group_span(sg); cpumask_or(covered, covered, sg_span);
- init_overlap_sched_group(sd, sg); + init_overlap_sched_group(sibling, sg);
if (!first) first = sg;
-----Original Message----- From: Valentin Schneider [mailto:valentin.schneider@arm.com] Sent: Thursday, February 4, 2021 4:55 AM To: linux-kernel@vger.kernel.org Cc: vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org; peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com; linuxarm@openeuler.org; xuwei (O) xuwei5@huawei.com; Liguozhu (Kenneth) liguozhu@hisilicon.com; tiantao (H) tiantao6@hisilicon.com; wanghuiqiang wanghuiqiang@huawei.com; Zengtao (B) prime.zeng@hisilicon.com; Jonathan Cameron jonathan.cameron@huawei.com; guodong.xu@linaro.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Meelis Roos mroos@linux.ee Subject: [RFC PATCH 1/2] sched/topology: Get rid of NUMA overlapping groups
As pointed out in commit
b5b217346de8 ("sched/topology: Warn when NUMA diameter > 2")
overlapping groups result in broken topology data structures whenever the underlying system has a NUMA diameter greater than 2. This stems from overlapping groups being built from sibling domain's spans, yielding bogus transitivity relations the like of:
distance(A, B) <= 30 && distance(B, C) <= 20 => distance(A, C) <= 30
As discussed with Barry, a feasible approach is to catch bogus overlapping groups and fix them after the fact [1].
A more proactive approach would be to prevent aforementioned bogus relations from being built altogether, implies departing from the "group span is sibling domain child's span" strategy. Said strategy only works for diameter <= 2, which fortunately or unfortunately is currently the most common case.
The chosen approach is, for NUMA domains: a) have the local group be the child domain's span, as before b) have all remote groups span only their respective node
This boils down to getting rid of overlapping groups.
Hi Valentin,
While I like your approach, this will require more time to evaluate possible influence as the approach also affects all machines without 3-hops issue. So x86 platforms need to be tested and benchmark is required.
What about we firstly finish the review of "grandchild" approach v2 and have a solution for kunpeng920 and Sun Fire X4600-M2 while not impacting other machines which haven't 3-hops issues first?
I would appreciate very much if you could comment on v2: https://lore.kernel.org/lkml/20210203111201.20720-1-song.bao.hua@hisilicon.c...
Note that b) requires introducing cross sched_domain_topology_level references for sched_group_capacity. This is a somewhat prickly matter as we need to ensure whichever group we hook into won't see its domain degenerated (which was never an issue when such references were bounded within a single topology level).
This lifts the NUMA diameter restriction, although yields more groups in the NUMA domains. As an example, here is the distance matrix for an AMD Epyc:
node 0 1 2 3 4 5 6 7 0: 10 16 16 16 32 32 32 32 1: 16 10 16 16 32 32 32 32 2: 16 16 10 16 32 32 32 32 3: 16 16 16 10 32 32 32 32 4: 32 32 32 32 10 16 16 16 5: 32 32 32 32 16 10 16 16 6: 32 32 32 32 16 16 10 16 7: 32 32 32 32 16 16 16 10
Emulating this on QEMU yields, before the patch: [ 0.386745] CPU0 attaching sched-domain(s): [ 0.386969] domain-0: span=0-3 level=NUMA [ 0.387708] groups: 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1007 }, 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 } [ 0.388505] domain-1: span=0-7 level=NUMA [ 0.388700] groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 } [ 0.389861] CPU1 attaching sched-domain(s): [ 0.390020] domain-0: span=0-3 level=NUMA [ 0.390200] groups: 1:{ span=1 cap=1007 }, 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 } [ 0.390701] domain-1: span=0-7 level=NUMA [ 0.390874] groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 } [ 0.391460] CPU2 attaching sched-domain(s): [ 0.391664] domain-0: span=0-3 level=NUMA [ 0.392750] groups: 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1007 } [ 0.393672] domain-1: span=0-7 level=NUMA [ 0.393961] groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 } [ 0.394645] CPU3 attaching sched-domain(s): [ 0.394792] domain-0: span=0-3 level=NUMA [ 0.394961] groups: 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1007 }, 2:{ span=2 cap=1007 } [ 0.395749] domain-1: span=0-7 level=NUMA [ 0.396098] groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 } [ 0.396455] CPU4 attaching sched-domain(s): [ 0.396603] domain-0: span=4-7 level=NUMA [ 0.396771] groups: 4:{ span=4 cap=1001 }, 5:{ span=5 cap=1004 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=1006 } [ 0.397274] domain-1: span=0-7 level=NUMA [ 0.397454] groups: 4:{ span=4-7 cap=4014 }, 0:{ span=0-3 cap=4020 } [ 0.397801] CPU5 attaching sched-domain(s): [ 0.397945] domain-0: span=4-7 level=NUMA [ 0.398110] groups: 5:{ span=5 cap=1004 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=1006 }, 4:{ span=4 cap=1001 } [ 0.398605] domain-1: span=0-7 level=NUMA [ 0.398773] groups: 4:{ span=4-7 cap=4014 }, 0:{ span=0-3 cap=4020 } [ 0.399109] CPU6 attaching sched-domain(s): [ 0.399253] domain-0: span=4-7 level=NUMA [ 0.399418] groups: 6:{ span=6 cap=1003 }, 7:{ span=7 cap=1006 }, 4:{ span=4 cap=1001 }, 5:{ span=5 cap=1004 } [ 0.400562] domain-1: span=0-7 level=NUMA [ 0.400741] groups: 4:{ span=4-7 cap=4020 }, 0:{ span=0-3 cap=4020 } [ 0.401083] CPU7 attaching sched-domain(s): [ 0.401231] domain-0: span=4-7 level=NUMA [ 0.401395] groups: 7:{ span=7 cap=1006 }, 4:{ span=4 cap=1004 }, 5:{ span=5 cap=1007 }, 6:{ span=6 cap=1003 } [ 0.401906] domain-1: span=0-7 level=NUMA [ 0.402076] groups: 4:{ span=4-7 cap=4020 }, 0:{ span=0-3 cap=4020 } [ 0.402437] root domain span: 0-7 (max cpu_capacity = 1024)
with the patch: [ 0.367436] CPU0 attaching sched-domain(s): [ 0.368064] domain-0: span=0-3 level=NUMA [ 0.368614] groups: 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1002 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 } [ 0.369490] domain-1: span=0-7 level=NUMA [ 0.369682] groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 } [ 0.371132] CPU1 attaching sched-domain(s): [ 0.371290] domain-0: span=0-3 level=NUMA [ 0.371462] groups: 1:{ span=1 cap=1002 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 }, 0:{ span=0 cap=1012 } [ 0.372720] domain-1: span=0-7 level=NUMA [ 0.372906] groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 } [ 0.373678] CPU2 attaching sched-domain(s): [ 0.373833] domain-0: span=0-3 level=NUMA [ 0.374006] groups: 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1002 } [ 0.374516] domain-1: span=0-7 level=NUMA [ 0.374689] groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 } [ 0.375337] CPU3 attaching sched-domain(s): [ 0.375491] domain-0: span=0-3 level=NUMA [ 0.375666] groups: 3:{ span=3 cap=993 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1002 }, 2:{ span=2 cap=1000 } [ 0.376639] domain-1: span=0-7 level=NUMA [ 0.376818] groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 } [ 0.377465] CPU4 attaching sched-domain(s): [ 0.377616] domain-0: span=4-7 level=NUMA [ 0.377844] groups: 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 } [ 0.378445] domain-1: span=0-7 level=NUMA [ 0.378622] groups: 4:{ span=4-7 cap=3995 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 } [ 0.379296] CPU5 attaching sched-domain(s): [ 0.379453] domain-0: span=4-7 level=NUMA [ 0.379629] groups: 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }, 4:{ span=4 cap=991 } [ 0.380499] domain-1: span=0-7 level=NUMA [ 0.380800] groups: 4:{ span=4-7 cap=4001 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=998 } [ 0.381475] CPU6 attaching sched-domain(s): [ 0.381641] domain-0: span=4-7 level=NUMA [ 0.381882] groups: 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }, 4:{ span=4 cap=997 }, 5:{ span=5 cap=1003 } [ 0.382419] domain-1: span=0-7 level=NUMA [ 0.382594] groups: 4:{ span=4-7 cap=4001 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1004 }, 3:{ span=3 cap=998 } [ 0.383253] CPU7 attaching sched-domain(s): [ 0.383407] domain-0: span=4-7 level=NUMA [ 0.383584] groups: 7:{ span=7 cap=998 }, 4:{ span=4 cap=997 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 } [ 0.384089] domain-1: span=0-7 level=NUMA [ 0.384516] groups: 4:{ span=4-7 cap=4001 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1004 }, 3:{ span=3 cap=998 } [ 0.385503] root domain span: 0-7 (max cpu_capacity = 1024)
IOW, this does more than double the number of groups at higher domains, the impact of which has yet to be measured.
XXX: this only changes the group generation; actually removing SD_OVERLAP and all the related fluff is yet to be done.
Signed-off-by: Valentin Schneider valentin.schneider@arm.com
kernel/sched/topology.c | 44 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 43 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 09d35044bd88..a8f69f234258 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -982,6 +982,41 @@ static void init_overlap_sched_group(struct sched_domain *sd, sg->sgc->max_capacity = SCHED_CAPACITY_SCALE; }
+static struct sched_domain *find_node_domain(struct sched_domain *sd) +{
- struct sched_domain *parent;
- BUG_ON(!(sd->flags & SD_NUMA));
- /* Get to the level above NODE */
- while (sd && sd->child) {
parent = sd;
sd = sd->child;
if (!(sd->flags & SD_NUMA))
break;
- }
- /*
* We're going to create cross topology level sched_group_capacity
* references. This can only work if the domains resulting from said
* levels won't be degenerated, as we need said sgc to be periodically
* updated: it needs to be attached to the local group of a domain
* that didn't get degenerated.
*
* Of course, groups aren't available yet, so we can't call the usual
* sd_degenerate(). Checking domain spans is the closest we get.
* Start from NODE's parent, and keep going up until we get a domain
* we're sure won't be degenerated.
*/
- while (sd->parent &&
cpumask_equal(sched_domain_span(sd), sched_domain_span(parent))) {
sd = parent;
parent = sd->parent;
- }
So this is because the sched_domain which doesn't contribute to scheduler will be destroyed during cpu_attach_domain() since sd and parent span the seam mask?
- return parent;
+}
static int build_overlap_sched_groups(struct sched_domain *sd, int cpu) { @@ -1015,6 +1050,13 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) if (!cpumask_test_cpu(i, sched_domain_span(sibling))) continue;
/*
* Local group is child domain's span, as is tradition.
* Non-local groups will only span remote nodes.
*/
if (first)
sibling = find_node_domain(sibling);
- sg = build_group_from_child_sched_domain(sibling, cpu); if (!sg) goto fail;
@@ -1022,7 +1064,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) sg_span = sched_group_span(sg); cpumask_or(covered, covered, sg_span);
init_overlap_sched_group(sd, sg);
init_overlap_sched_group(sibling, sg);
if (!first) first = sg;
-- 2.27.0
Thanks Barry
Hi Barry,
On 08/02/21 10:04, Song Bao Hua (Barry Song) wrote:
-----Original Message----- From: Valentin Schneider [mailto:valentin.schneider@arm.com]
Hi Valentin,
While I like your approach, this will require more time to evaluate possible influence as the approach also affects all machines without 3-hops issue. So x86 platforms need to be tested and benchmark is required.
What about we firstly finish the review of "grandchild" approach v2 and have a solution for kunpeng920 and Sun Fire X4600-M2 while not impacting other machines which haven't 3-hops issues first?
I figured I'd toss this out while the iron was hot (and I had the topology crud paged in), but I ultimately agree that it's better to first go with something that fixes the diameter > 2 topologies and leaves the other ones untouched, which is exactly what you have.
I would appreciate very much if you could comment on v2: https://lore.kernel.org/lkml/20210203111201.20720-1-song.bao.hua@hisilicon.c...
See my comment below on domain degeneration; with that taken care of I would say it's good to go. Have a look at what patch1+patch3 squashed together looks like, passing the right sd to init_overlap_sched_group() looks a bit neater IMO.
+static struct sched_domain *find_node_domain(struct sched_domain *sd) +{
- struct sched_domain *parent;
- BUG_ON(!(sd->flags & SD_NUMA));
- /* Get to the level above NODE */
- while (sd && sd->child) {
parent = sd;
sd = sd->child;
if (!(sd->flags & SD_NUMA))
break;
- }
- /*
* We're going to create cross topology level sched_group_capacity
* references. This can only work if the domains resulting from said
* levels won't be degenerated, as we need said sgc to be periodically
* updated: it needs to be attached to the local group of a domain
* that didn't get degenerated.
*
* Of course, groups aren't available yet, so we can't call the usual
* sd_degenerate(). Checking domain spans is the closest we get.
* Start from NODE's parent, and keep going up until we get a domain
* we're sure won't be degenerated.
*/
- while (sd->parent &&
cpumask_equal(sched_domain_span(sd), sched_domain_span(parent))) {
sd = parent;
parent = sd->parent;
- }
So this is because the sched_domain which doesn't contribute to scheduler will be destroyed during cpu_attach_domain() since sd and parent span the seam mask?
Yes; let's take your topology for instance:
node 0 1 2 3 0: 10 12 20 22 1: 12 10 22 24 2: 20 22 10 12 3: 22 24 12 10
2 10 2 0 <---> 1 <---> 2 <---> 3
Domains for node1 will look like (before any fixes are applied):
NUMA<=10: span=1 groups=(1) NUMA<=12: span=0-1 groups=(1)->(0) NUMA<=20: span=0-1 groups=(0,1) NUMA<=22: span=0-2 groups=(0,1)->(0,2-3) NUMA<=24: span=0-3 groups=(0-2)->(0,2-3)
As you can see, the domain representing distance <= 20 will be degenerated (it has a single group). If we were to e.g. add some more nodes to the left of node0, then we would trigger the "grandchildren logic" for node1 and would end up creating a reference to node1 NUMA<=20's sgc, which is a mistake: that domain will be degenerated, and that sgc will never be updated. The right thing to do here would be reference node1 NUMA<=12's sgc, which the above snippet does.
- return parent;
+}
-----Original Message----- From: Valentin Schneider [mailto:valentin.schneider@arm.com] Sent: Tuesday, February 9, 2021 12:48 AM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; linux-kernel@vger.kernel.org Cc: vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org; peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com; linuxarm@openeuler.org; xuwei (O) xuwei5@huawei.com; Liguozhu (Kenneth) liguozhu@hisilicon.com; tiantao (H) tiantao6@hisilicon.com; wanghuiqiang wanghuiqiang@huawei.com; Zengtao (B) prime.zeng@hisilicon.com; Jonathan Cameron jonathan.cameron@huawei.com; guodong.xu@linaro.org; Meelis Roos mroos@linux.ee Subject: RE: [RFC PATCH 1/2] sched/topology: Get rid of NUMA overlapping groups
Hi Barry,
On 08/02/21 10:04, Song Bao Hua (Barry Song) wrote:
-----Original Message----- From: Valentin Schneider [mailto:valentin.schneider@arm.com]
Hi Valentin,
While I like your approach, this will require more time to evaluate possible influence as the approach also affects all machines without 3-hops issue. So x86 platforms need to be tested and benchmark is required.
What about we firstly finish the review of "grandchild" approach v2 and have a solution for kunpeng920 and Sun Fire X4600-M2 while not impacting other machines which haven't 3-hops issues first?
I figured I'd toss this out while the iron was hot (and I had the topology crud paged in), but I ultimately agree that it's better to first go with something that fixes the diameter > 2 topologies and leaves the other ones untouched, which is exactly what you have.
I would appreciate very much if you could comment on v2:
https://lore.kernel.org/lkml/20210203111201.20720-1-song.bao.hua@hisilicon .com/
See my comment below on domain degeneration; with that taken care of I would say it's good to go. Have a look at what patch1+patch3 squashed together looks like, passing the right sd to init_overlap_sched_group() looks a bit neater IMO.
+static struct sched_domain *find_node_domain(struct sched_domain *sd) +{
- struct sched_domain *parent;
- BUG_ON(!(sd->flags & SD_NUMA));
- /* Get to the level above NODE */
- while (sd && sd->child) {
parent = sd;
sd = sd->child;
if (!(sd->flags & SD_NUMA))
break;
- }
- /*
* We're going to create cross topology level sched_group_capacity
* references. This can only work if the domains resulting from said
* levels won't be degenerated, as we need said sgc to be periodically
* updated: it needs to be attached to the local group of a domain
* that didn't get degenerated.
*
* Of course, groups aren't available yet, so we can't call the usual
* sd_degenerate(). Checking domain spans is the closest we get.
* Start from NODE's parent, and keep going up until we get a domain
* we're sure won't be degenerated.
*/
- while (sd->parent &&
cpumask_equal(sched_domain_span(sd), sched_domain_span(parent)))
{
sd = parent;
parent = sd->parent;
- }
So this is because the sched_domain which doesn't contribute to scheduler will be destroyed during cpu_attach_domain() since sd and parent span the seam mask?
Yes; let's take your topology for instance:
node 0 1 2 3 0: 10 12 20 22 1: 12 10 22 24 2: 20 22 10 12 3: 22 24 12 10
2 10 2
0 <---> 1 <---> 2 <---> 3
Guess you actually mean 2 10 2 1 <---> 0 <---> 2 <---> 3
Domains for node1 will look like (before any fixes are applied):
NUMA<=10: span=1 groups=(1) NUMA<=12: span=0-1 groups=(1)->(0) NUMA<=20: span=0-1 groups=(0,1) NUMA<=22: span=0-2 groups=(0,1)->(0,2-3) NUMA<=24: span=0-3 groups=(0-2)->(0,2-3)
As you can see, the domain representing distance <= 20 will be degenerated (it has a single group). If we were to e.g. add some more nodes to the left of node0, then we would trigger the "grandchildren logic" for node1 and would end up creating a reference to node1 NUMA<=20's sgc, which is a mistake: that domain will be degenerated, and that sgc will never be updated. The right thing to do here would be reference node1 NUMA<=12's sgc, which the above snippet does.
Guess I got your point even though the diagram is not correct :-)
If the topology is as below(add a node left to node1 rather than node0):
9 2 10 2 A <---> 1 <---> 0 <---> 2 <---> 3
For nodeA, NUMA<=10: span=A groups=(A) NUMA<=12: span= A groups= (A) NUMA<=19: span=A-1 groups=(A),(1) NUMA<=20: span=A-1 groups=(A,1) *1 NUMA<=21: span=A-1-0 groups=(A,1), node1's numa<=20
For node0, NUMA<=10: span=9 groups=(0) #3 NUMA<=12: span=0-1 groups=(0)->(1) #2 NUMA<=19: span=0-1 groups=(0,1) #1 NUMA<=20: span=0-1-2 groups=(0,1),....
*1 will firstly try #1, and it finds 2 is outside the A-1-0, then it will try #2. Finally #2 will be degenerated, so we should actually use #3. Amazing!
- return parent;
+}
Thanks Barry
On 09/02/21 00:12, Song Bao Hua (Barry Song) wrote:
-----Original Message----- From: Valentin Schneider [mailto:valentin.schneider@arm.com]
Yes; let's take your topology for instance:
node 0 1 2 3 0: 10 12 20 22 1: 12 10 22 24 2: 20 22 10 12 3: 22 24 12 10
2 10 2
0 <---> 1 <---> 2 <---> 3
Guess you actually mean 2 10 2 1 <---> 0 <---> 2 <---> 3
Yeah, you're right, sorry about that!
Domains for node1 will look like (before any fixes are applied):
NUMA<=10: span=1 groups=(1) NUMA<=12: span=0-1 groups=(1)->(0) NUMA<=20: span=0-1 groups=(0,1) NUMA<=22: span=0-2 groups=(0,1)->(0,2-3) NUMA<=24: span=0-3 groups=(0-2)->(0,2-3)
As you can see, the domain representing distance <= 20 will be degenerated (it has a single group). If we were to e.g. add some more nodes to the left of node0, then we would trigger the "grandchildren logic" for node1 and would end up creating a reference to node1 NUMA<=20's sgc, which is a mistake: that domain will be degenerated, and that sgc will never be updated. The right thing to do here would be reference node1 NUMA<=12's sgc, which the above snippet does.
Guess I got your point even though the diagram is not correct :-)
Good!
If the topology is as below(add a node left to node1 rather than node0):
9 2 10 2
A <---> 1 <---> 0 <---> 2 <---> 3
For nodeA, NUMA<=10: span=A groups=(A) NUMA<=12: span= A groups= (A) NUMA<=19: span=A-1 groups=(A),(1) NUMA<=20: span=A-1 groups=(A,1) *1 NUMA<=21: span=A-1-0 groups=(A,1), node1's numa<=20
For node0, NUMA<=10: span=9 groups=(0) #3 NUMA<=12: span=0-1 groups=(0)->(1) #2 NUMA<=19: span=0-1 groups=(0,1) #1 NUMA<=20: span=0-1-2 groups=(0,1),....
*1 will firstly try #1, and it finds 2 is outside the A-1-0, then it will try #2. Finally #2 will be degenerated, so we should actually use #3. Amazing!
Bingo!
- return parent;
+}
Thanks Barry
The scheduler topology code can now figure out what to do with such topologies.
This reverts commit b5b217346de85ed1b03fdecd5c5076b34fbb2f0b.
Signed-off-by: Valentin Schneider valentin.schneider@arm.com --- kernel/sched/topology.c | 33 --------------------------------- 1 file changed, 33 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index a8f69f234258..0fa41aab74e0 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -688,7 +688,6 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) { struct rq *rq = cpu_rq(cpu); struct sched_domain *tmp; - int numa_distance = 0;
/* Remove the sched domains which do not contribute to scheduling. */ for (tmp = sd; tmp; ) { @@ -720,38 +719,6 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) sd->child = NULL; }
- for (tmp = sd; tmp; tmp = tmp->parent) - numa_distance += !!(tmp->flags & SD_NUMA); - - /* - * FIXME: Diameter >=3 is misrepresented. - * - * Smallest diameter=3 topology is: - * - * node 0 1 2 3 - * 0: 10 20 30 40 - * 1: 20 10 20 30 - * 2: 30 20 10 20 - * 3: 40 30 20 10 - * - * 0 --- 1 --- 2 --- 3 - * - * NUMA-3 0-3 N/A N/A 0-3 - * groups: {0-2},{1-3} {1-3},{0-2} - * - * NUMA-2 0-2 0-3 0-3 1-3 - * groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} {2-3},{0-2} - * - * NUMA-1 0-1 0-2 1-3 2-3 - * groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2} - * - * NUMA-0 0 1 2 3 - * - * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the - * group span isn't a subset of the domain span. - */ - WARN_ONCE(numa_distance > 2, "Shortest NUMA path spans too many nodes\n"); - sched_domain_debug(sd, cpu);
rq_attach_root(rq, rd);
-----Original Message----- From: Valentin Schneider [mailto:valentin.schneider@arm.com] Sent: Thursday, February 4, 2021 4:55 AM To: linux-kernel@vger.kernel.org Cc: vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org; peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com; linuxarm@openeuler.org; xuwei (O) xuwei5@huawei.com; Liguozhu (Kenneth) liguozhu@hisilicon.com; tiantao (H) tiantao6@hisilicon.com; wanghuiqiang wanghuiqiang@huawei.com; Zengtao (B) prime.zeng@hisilicon.com; Jonathan Cameron jonathan.cameron@huawei.com; guodong.xu@linaro.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Meelis Roos mroos@linux.ee Subject: [RFC PATCH 2/2] Revert "sched/topology: Warn when NUMA diameter > 2"
The scheduler topology code can now figure out what to do with such topologies.
This reverts commit b5b217346de85ed1b03fdecd5c5076b34fbb2f0b.
Signed-off-by: Valentin Schneider valentin.schneider@arm.com
Yes, this is fine. I actually have seen some other problems we need to consider.
The current code is probably well consolidated for machines with 2 hops or less. Thus, even after we fix the 3-hops span issue, I can still see some other issue.
For example, if we change the sd flags and remove the SD_BALANCE flags for the last hops in sd_init(), we are able to see large score increase in unixbench.
if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance || is_3rd_hops_domain(...)) { sd->flags &= ~(SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE); }
So guess something needs to be tuned for machines with 3 hops or more.
But we need a kernel which has the fix of 3-hops issue before we can do more work.
kernel/sched/topology.c | 33 --------------------------------- 1 file changed, 33 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index a8f69f234258..0fa41aab74e0 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -688,7 +688,6 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) { struct rq *rq = cpu_rq(cpu); struct sched_domain *tmp;
int numa_distance = 0;
/* Remove the sched domains which do not contribute to scheduling. */ for (tmp = sd; tmp; ) {
@@ -720,38 +719,6 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) sd->child = NULL; }
for (tmp = sd; tmp; tmp = tmp->parent)
numa_distance += !!(tmp->flags & SD_NUMA);
/*
* FIXME: Diameter >=3 is misrepresented.
*
* Smallest diameter=3 topology is:
*
* node 0 1 2 3
* 0: 10 20 30 40
* 1: 20 10 20 30
* 2: 30 20 10 20
* 3: 40 30 20 10
*
* 0 --- 1 --- 2 --- 3
*
* NUMA-3 0-3 N/A N/A 0-3
* groups: {0-2},{1-3} {1-3},{0-2}
*
* NUMA-2 0-2 0-3 0-3 1-3
* groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} {2-3},{0-2}
*
* NUMA-1 0-1 0-2 1-3 2-3
* groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2}
*
* NUMA-0 0 1 2 3
*
* The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the
* group span isn't a subset of the domain span.
*/
WARN_ONCE(numa_distance > 2, "Shortest NUMA path spans too many nodes\n");
sched_domain_debug(sd, cpu);
rq_attach_root(rq, rd);
-- 2.27.0
Thanks Barry
This follows Barry's approach, which yields fewer groups in the higher domains.
Signed-off-by: Valentin Schneider valentin.schneider@arm.com --- kernel/sched/topology.c | 42 +++++++++++++++++++---------------------- 1 file changed, 19 insertions(+), 23 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 0fa41aab74e0..e1bb97c06f53 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -949,20 +949,21 @@ static void init_overlap_sched_group(struct sched_domain *sd, sg->sgc->max_capacity = SCHED_CAPACITY_SCALE; }
-static struct sched_domain *find_node_domain(struct sched_domain *sd) +static struct sched_domain * +find_sibling_domain(struct sched_domain *sd, struct sched_domain *sibling) { - struct sched_domain *parent; - BUG_ON(!(sd->flags & SD_NUMA));
- /* Get to the level above NODE */ - while (sd && sd->child) { - parent = sd; - sd = sd->child; + /* Find a subset sibling */ + while (sibling->child && + !cpumask_subset(sched_domain_span(sibling->child), + sched_domain_span(sd))) + sibling = sibling->child; + + /* If the above loop was a no-op, we're done */ + if (sd->private == sibling->private) + return sibling;
- if (!(sd->flags & SD_NUMA)) - break; - } /* * We're going to create cross topology level sched_group_capacity * references. This can only work if the domains resulting from said @@ -972,16 +973,14 @@ static struct sched_domain *find_node_domain(struct sched_domain *sd) * * Of course, groups aren't available yet, so we can't call the usual * sd_degenerate(). Checking domain spans is the closest we get. - * Start from NODE's parent, and keep going up until we get a domain - * we're sure won't be degenerated. + * We can't go up as per the above subset check, so keep going down + * until we get a domain we're sure won't be degenerated. */ - while (sd->parent && - cpumask_equal(sched_domain_span(sd), sched_domain_span(parent))) { - sd = parent; - parent = sd->parent; - } + while (sibling->child && + cpumask_equal(sched_domain_span(sibling->child), sched_domain_span(sibling))) + sibling = sibling->child;
- return parent; + return sibling; }
static int @@ -1017,12 +1016,9 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) if (!cpumask_test_cpu(i, sched_domain_span(sibling))) continue;
- /* - * Local group is child domain's span, as is tradition. - * Non-local groups will only span remote nodes. - */ + /* Ensure sibling domain's span is a subset of this domain */ if (first) - sibling = find_node_domain(sibling); + sibling = find_sibling_domain(sd, sibling);
sg = build_group_from_child_sched_domain(sibling, cpu); if (!sg)