As pointed out in commit
b5b217346de8 ("sched/topology: Warn when NUMA diameter > 2")
overlapping groups result in broken topology data structures whenever the underlying system has a NUMA diameter greater than 2. This stems from overlapping groups being built from sibling domain's spans, yielding bogus transitivity relations the like of:
distance(A, B) <= 30 && distance(B, C) <= 20 => distance(A, C) <= 30
As discussed with Barry, a feasible approach is to catch bogus overlapping groups and fix them after the fact [1].
A more proactive approach would be to prevent aforementioned bogus relations from being built altogether, implies departing from the "group span is sibling domain child's span" strategy. Said strategy only works for diameter <= 2, which fortunately or unfortunately is currently the most common case.
The chosen approach is, for NUMA domains: a) have the local group be the child domain's span, as before b) have all remote groups span only their respective node
This boils down to getting rid of overlapping groups.
Note that b) requires introducing cross sched_domain_topology_level references for sched_group_capacity. This is a somewhat prickly matter as we need to ensure whichever group we hook into won't see its domain degenerated (which was never an issue when such references were bounded within a single topology level).
This lifts the NUMA diameter restriction, although yields more groups in the NUMA domains. As an example, here is the distance matrix for an AMD Epyc:
node 0 1 2 3 4 5 6 7 0: 10 16 16 16 32 32 32 32 1: 16 10 16 16 32 32 32 32 2: 16 16 10 16 32 32 32 32 3: 16 16 16 10 32 32 32 32 4: 32 32 32 32 10 16 16 16 5: 32 32 32 32 16 10 16 16 6: 32 32 32 32 16 16 10 16 7: 32 32 32 32 16 16 16 10
Emulating this on QEMU yields, before the patch: [ 0.386745] CPU0 attaching sched-domain(s): [ 0.386969] domain-0: span=0-3 level=NUMA [ 0.387708] groups: 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1007 }, 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 } [ 0.388505] domain-1: span=0-7 level=NUMA [ 0.388700] groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 } [ 0.389861] CPU1 attaching sched-domain(s): [ 0.390020] domain-0: span=0-3 level=NUMA [ 0.390200] groups: 1:{ span=1 cap=1007 }, 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 } [ 0.390701] domain-1: span=0-7 level=NUMA [ 0.390874] groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 } [ 0.391460] CPU2 attaching sched-domain(s): [ 0.391664] domain-0: span=0-3 level=NUMA [ 0.392750] groups: 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1007 } [ 0.393672] domain-1: span=0-7 level=NUMA [ 0.393961] groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 } [ 0.394645] CPU3 attaching sched-domain(s): [ 0.394792] domain-0: span=0-3 level=NUMA [ 0.394961] groups: 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1007 }, 2:{ span=2 cap=1007 } [ 0.395749] domain-1: span=0-7 level=NUMA [ 0.396098] groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 } [ 0.396455] CPU4 attaching sched-domain(s): [ 0.396603] domain-0: span=4-7 level=NUMA [ 0.396771] groups: 4:{ span=4 cap=1001 }, 5:{ span=5 cap=1004 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=1006 } [ 0.397274] domain-1: span=0-7 level=NUMA [ 0.397454] groups: 4:{ span=4-7 cap=4014 }, 0:{ span=0-3 cap=4020 } [ 0.397801] CPU5 attaching sched-domain(s): [ 0.397945] domain-0: span=4-7 level=NUMA [ 0.398110] groups: 5:{ span=5 cap=1004 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=1006 }, 4:{ span=4 cap=1001 } [ 0.398605] domain-1: span=0-7 level=NUMA [ 0.398773] groups: 4:{ span=4-7 cap=4014 }, 0:{ span=0-3 cap=4020 } [ 0.399109] CPU6 attaching sched-domain(s): [ 0.399253] domain-0: span=4-7 level=NUMA [ 0.399418] groups: 6:{ span=6 cap=1003 }, 7:{ span=7 cap=1006 }, 4:{ span=4 cap=1001 }, 5:{ span=5 cap=1004 } [ 0.400562] domain-1: span=0-7 level=NUMA [ 0.400741] groups: 4:{ span=4-7 cap=4020 }, 0:{ span=0-3 cap=4020 } [ 0.401083] CPU7 attaching sched-domain(s): [ 0.401231] domain-0: span=4-7 level=NUMA [ 0.401395] groups: 7:{ span=7 cap=1006 }, 4:{ span=4 cap=1004 }, 5:{ span=5 cap=1007 }, 6:{ span=6 cap=1003 } [ 0.401906] domain-1: span=0-7 level=NUMA [ 0.402076] groups: 4:{ span=4-7 cap=4020 }, 0:{ span=0-3 cap=4020 } [ 0.402437] root domain span: 0-7 (max cpu_capacity = 1024)
with the patch: [ 0.367436] CPU0 attaching sched-domain(s): [ 0.368064] domain-0: span=0-3 level=NUMA [ 0.368614] groups: 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1002 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 } [ 0.369490] domain-1: span=0-7 level=NUMA [ 0.369682] groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 } [ 0.371132] CPU1 attaching sched-domain(s): [ 0.371290] domain-0: span=0-3 level=NUMA [ 0.371462] groups: 1:{ span=1 cap=1002 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 }, 0:{ span=0 cap=1012 } [ 0.372720] domain-1: span=0-7 level=NUMA [ 0.372906] groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 } [ 0.373678] CPU2 attaching sched-domain(s): [ 0.373833] domain-0: span=0-3 level=NUMA [ 0.374006] groups: 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1002 } [ 0.374516] domain-1: span=0-7 level=NUMA [ 0.374689] groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 } [ 0.375337] CPU3 attaching sched-domain(s): [ 0.375491] domain-0: span=0-3 level=NUMA [ 0.375666] groups: 3:{ span=3 cap=993 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1002 }, 2:{ span=2 cap=1000 } [ 0.376639] domain-1: span=0-7 level=NUMA [ 0.376818] groups: 0:{ span=0-3 cap=4007 }, 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 } [ 0.377465] CPU4 attaching sched-domain(s): [ 0.377616] domain-0: span=4-7 level=NUMA [ 0.377844] groups: 4:{ span=4 cap=991 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 } [ 0.378445] domain-1: span=0-7 level=NUMA [ 0.378622] groups: 4:{ span=4-7 cap=3995 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=993 } [ 0.379296] CPU5 attaching sched-domain(s): [ 0.379453] domain-0: span=4-7 level=NUMA [ 0.379629] groups: 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }, 4:{ span=4 cap=991 } [ 0.380499] domain-1: span=0-7 level=NUMA [ 0.380800] groups: 4:{ span=4-7 cap=4001 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1000 }, 3:{ span=3 cap=998 } [ 0.381475] CPU6 attaching sched-domain(s): [ 0.381641] domain-0: span=4-7 level=NUMA [ 0.381882] groups: 6:{ span=6 cap=1003 }, 7:{ span=7 cap=998 }, 4:{ span=4 cap=997 }, 5:{ span=5 cap=1003 } [ 0.382419] domain-1: span=0-7 level=NUMA [ 0.382594] groups: 4:{ span=4-7 cap=4001 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1004 }, 3:{ span=3 cap=998 } [ 0.383253] CPU7 attaching sched-domain(s): [ 0.383407] domain-0: span=4-7 level=NUMA [ 0.383584] groups: 7:{ span=7 cap=998 }, 4:{ span=4 cap=997 }, 5:{ span=5 cap=1003 }, 6:{ span=6 cap=1003 } [ 0.384089] domain-1: span=0-7 level=NUMA [ 0.384516] groups: 4:{ span=4-7 cap=4001 }, 0:{ span=0 cap=1012 }, 1:{ span=1 cap=1001 }, 2:{ span=2 cap=1004 }, 3:{ span=3 cap=998 } [ 0.385503] root domain span: 0-7 (max cpu_capacity = 1024)
IOW, this does more than double the number of groups at higher domains, the impact of which has yet to be measured.
XXX: this only changes the group generation; actually removing SD_OVERLAP and all the related fluff is yet to be done.
[1]: http://lore.kernel.org/r/20210201033830.15040-1-song.bao.hua@hisilicon.com
Signed-off-by: Valentin Schneider valentin.schneider@arm.com --- kernel/sched/topology.c | 44 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 43 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 09d35044bd88..a8f69f234258 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -982,6 +982,41 @@ static void init_overlap_sched_group(struct sched_domain *sd, sg->sgc->max_capacity = SCHED_CAPACITY_SCALE; }
+static struct sched_domain *find_node_domain(struct sched_domain *sd) +{ + struct sched_domain *parent; + + BUG_ON(!(sd->flags & SD_NUMA)); + + /* Get to the level above NODE */ + while (sd && sd->child) { + parent = sd; + sd = sd->child; + + if (!(sd->flags & SD_NUMA)) + break; + } + /* + * We're going to create cross topology level sched_group_capacity + * references. This can only work if the domains resulting from said + * levels won't be degenerated, as we need said sgc to be periodically + * updated: it needs to be attached to the local group of a domain + * that didn't get degenerated. + * + * Of course, groups aren't available yet, so we can't call the usual + * sd_degenerate(). Checking domain spans is the closest we get. + * Start from NODE's parent, and keep going up until we get a domain + * we're sure won't be degenerated. + */ + while (sd->parent && + cpumask_equal(sched_domain_span(sd), sched_domain_span(parent))) { + sd = parent; + parent = sd->parent; + } + + return parent; +} + static int build_overlap_sched_groups(struct sched_domain *sd, int cpu) { @@ -1015,6 +1050,13 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) if (!cpumask_test_cpu(i, sched_domain_span(sibling))) continue;
+ /* + * Local group is child domain's span, as is tradition. + * Non-local groups will only span remote nodes. + */ + if (first) + sibling = find_node_domain(sibling); + sg = build_group_from_child_sched_domain(sibling, cpu); if (!sg) goto fail; @@ -1022,7 +1064,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) sg_span = sched_group_span(sg); cpumask_or(covered, covered, sg_span);
- init_overlap_sched_group(sd, sg); + init_overlap_sched_group(sibling, sg);
if (!first) first = sg;