[Linuxarm] Re: [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2

26 Jan 2021

...
-----Original Message-----
From: Valentin Schneider [mailto:valentin.schneider@arm.com]
Sent: Tuesday, January 26, 2021 1:11 AM
To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Vincent Guittot
vincent.guittot@linaro.org; Mel Gorman mgorman@suse.de
Cc: Ingo Molnar mingo@kernel.org; Peter Zijlstra peterz@infradead.org;
Dietmar Eggemann dietmar.eggemann@arm.com; Morten Rasmussen
morten.rasmussen@arm.com; linux-kernel linux-kernel@vger.kernel.org;
linuxarm@openeuler.org
Subject: RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact
of NUMA diameter > 2
On 25/01/21 03:13, Song Bao Hua (Barry Song) wrote:
...
As long as NUMA diameter > 2, building sched_domain by sibling's child domain
will definitely create a sched_domain with sched_group which will span
out of the sched_domain
               +------+         +------+        +-------+       +------+
               | node |  12     |node  | 20     | node  |  12   |node  |
               |  0   +---------+1     +--------+ 2     +-------+3     |
               +------+         +------+        +-------+       +------+
domain0        node0            node1            node2          node3
domain1        node0+1          node0+1          node2+3        node2+3
                                                 +
domain2        node0+1+2                         |
             group: node0+1                      |
               group:node2+3 <-------------------+
when node2 is added into the domain2 of node0, kernel is using the child
domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
the span of node0+1+2.
Will we move to use the *child* domain of the *child* domain of node2's
domain2 to build the sched_group?
I mean:
               +------+         +------+        +-------+       +------+
               | node |  12     |node  | 20     | node  |  12   |node  |
               |  0   +---------+1     +--------+ 2     +-------+3     |
               +------+         +------+        +-------+       +------+
domain0        node0            node1          +- node2          node3
                                               |
domain1        node0+1          node0+1        | node2+3        node2+3
                                               |
domain2        node0+1+2                       |
             group: node0+1                    |
               group:node2 <-------------------+
In this way, it seems we don't have to create a new group as we are just
reusing the existing group?
One thing I've been musing over is pretty much this; that is to say we
would make all non-local NUMA sched_groups span a single node. This would
let us reuse an existing span+sched_group_capacity: the local group of that
node at its first NUMA topology level.
Essentially this means getting rid of the overlapping groups, and the
balance mask is handled the same way as for !NUMA, i.e. it's the local
group span. I've not gone far enough through the thought experiment to see
where does it miserably fall apart... It is at the very least violating the
expectation that a group span is a child domain's span - here it can be a
grand^x children domain's span.
If we take your topology, we currently have:
| tl\node | 0            | 1             | 2             | 3            |
|---------+--------------+---------------+---------------+--------------|
| NUMA0   | (0)->(1)     | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)     |
| NUMA1   | (0-1)->(1-3) | (0-2)->(2-3)  | (1-3)->(0-1)  | (2-3)->(0-2) |
| NUMA2   | (0-2)->(1-3) | N/A           | N/A           | (1-3)->(0-2) |
With the current overlapping group scheme, we would need to make it look
like so:
| tl\node | 0             | 1             | 2             | 3             |
|---------+---------------+---------------+---------------+---------------
|
| NUMA0   | (0)->(1)      | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)      |
| NUMA1   | (0-1)->(1-2)* | (0-2)->(2-3)  | (1-3)->(0-1)  | (2-3)->(1-2)* |
| NUMA2   | (0-2)->(1-3)  | N/A           | N/A           | (1-3)->(0-2)  |
But as already discussed, that's tricky to make work. With the node-span
groups thing, we would turn this into:
| tl\node | 0          | 1             | 2             | 3          |
|---------+------------+---------------+---------------+------------|
| NUMA0   | (0)->(1)   | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)   |
| NUMA1   | (0-1)->(2) | (0-2)->(3)    | (1-3)->(0)    | (2-3)->(1) |
| NUMA2   | (0-2)->(3) | N/A           | N/A           | (1-3)->(0) |
Actually I didn't mean going that far. What I was thinking is that
we only fix the sched_domain while sched_group isn't a subset of
sched_domain. For those sched_domains which haven't the group span
issue, we just don't touch it. For NUMA1, we change like your diagram,
but NUMA2 won't be changed. The concept is like:

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1040,6 +1040,19 @@ build_overlap_sched_groups(struct sched_domain
*sd, int cpu)
                }
sg_span = sched_group_span(sg);
+#if 1
+               if (sibling->child && !cpumask_subset(sg_span, span)) {
+                       sg = build_group_from_child_sched_domain(sibling->child, cpu);
+                       ...
+                       sg_span = sched_group_span(sg);
+               }
+#endif
                cpumask_or(covered, covered, sg_span);
Thanks
Barry

    

2024

2023

2022

2021

2020

[Linuxarm] Re: [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2