On 09/02/21 21:21, Barry Song wrote:
Reported-by: Valentin Schneider valentin.schneider@arm.com Tested-by: Meelis Roos mroos@linux.ee Signed-off-by: Barry Song song.bao.hua@hisilicon.com
Tested on a bunch of NUMA topologies via QEMU (AMD Epyc, D06, Sunfire) and the results look sane.
Small comment nits below, but regardless:
Reviewed-by: Valentin Schneider valentin.schneider@arm.com
-v3: Mainly updated according to Valentin's comments. While the approach was started by me, Valentin contributed the most useful edit and comments. Thanks, Valentin!
Happy to help, thanks for fixing this!
- fixed a potential issue that re-used sgc might be located in
a sched_domain which will be degenrated;
- code cleanup to make it more readable
While Valentin started another approach which completely removed overlapped sched_group, we both agree that it is better to have a solution which won't touch machines without 3-hops issue first: https://lore.kernel.org/lkml/jhjpn1a232z.mognet@arm.com/
I think this doesn't scale properly so I doubt it's ever going to see the light of day, but I had to write it to convince myself of it :/
kernel/sched/topology.c | 91 +++++++++++++++++++++++++++-------------- 1 file changed, 61 insertions(+), 30 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 5d3675c7a76b..ab5ebf17f30a 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -982,6 +953,31 @@ static void init_overlap_sched_group(struct sched_domain *sd, sg->sgc->max_capacity = SCHED_CAPACITY_SCALE; }
+static struct sched_domain * +find_descended_sibling(struct sched_domain *sd, struct sched_domain *sibling) +{
- /*
* The proper descendant would be the one whose child won't span out
* of sd
*/
- while (sibling->child &&
!cpumask_subset(sched_domain_span(sibling->child),
sched_domain_span(sd)))
sibling = sibling->child;
- /*
* As we are referencing sgc across different topology level, we need
* to go down to skip those sched_domains which don't contribute to
* scheduling because they will be degenerated in cpu_attach_domain
+ * This is important because we must make sure we point to an sgc that + * will be updated via update_group_capacity().
*/
- while (sibling->child &&
cpumask_equal(sched_domain_span(sibling->child),
sched_domain_span(sibling)))
sibling = sibling->child;
- return sibling;
+}
static int build_overlap_sched_groups(struct sched_domain *sd, int cpu) { @@ -1015,6 +1011,41 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) if (!cpumask_test_cpu(i, sched_domain_span(sibling))) continue;
/*
* Usually we build sched_group by sibling's child sched_domain
* But for machines whose NUMA diameter are 3 or above, we move
* to build sched_group by sibling's proper descendant's child
* domain because sibling's child sched_domain will span out of
* the sched_domain being built as below.
*
* Smallest diameter=3 topology is:
*
* node 0 1 2 3
* 0: 10 20 30 40
* 1: 20 10 20 30
* 2: 30 20 10 20
* 3: 40 30 20 10
*
* 0 --- 1 --- 2 --- 3
*
* NUMA-3 0-3 N/A N/A 0-3
* groups: {0-2},{1-3} {1-3},{0-2}
*
* NUMA-2 0-2 0-3 0-3 1-3
* groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} {2-3},{0-2}
*
* NUMA-1 0-1 0-2 1-3 2-3
* groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2}
*
* NUMA-0 0 1 2 3
*
* The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the
^^^ s/are/would be/
* group span isn't a subset of the domain span.
^^^^^ s/isn't/wouldn't be/
*/
if (sibling->child &&
!cpumask_subset(sched_domain_span(sibling->child), span))
sibling = find_descended_sibling(sd, sibling);
sg = build_group_from_child_sched_domain(sibling, cpu); if (!sg) goto fail;
@@ -1022,7 +1053,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) sg_span = sched_group_span(sg); cpumask_or(covered, covered, sg_span);
init_overlap_sched_group(sd, sg);
init_overlap_sched_group(sibling, sg); if (!first) first = sg;
-- 2.25.1