[Linuxarm] Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler

25 Jan 2021

...
-----Original Message-----
From: Dietmar Eggemann [mailto:dietmar.eggemann@arm.com]
Sent: Wednesday, January 13, 2021 1:53 AM
To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Morten Rasmussen
<morten.rasmussen@arm.com>; Tim Chen <tim.c.chen@linux.intel.com>
Cc: valentin.schneider@arm.com; catalin.marinas@arm.com; will@kernel.org;
rjw@rjwysocki.net; vincent.guittot@linaro.org; lenb@kernel.org;
gregkh@linuxfoundation.org; Jonathan Cameron <jonathan.cameron@huawei.com>;
mingo@redhat.com; peterz@infradead.org; juri.lelli@redhat.com;
rostedt@goodmis.org; bsegall@google.com; mgorman@suse.de;
mark.rutland@arm.com; sudeep.holla@arm.com; aubrey.li@linux.intel.com;
linux-arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org;
linux-acpi@vger.kernel.org; linuxarm@openeuler.org; xuwei (O)
<xuwei5@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; tiantao (H)
<tiantao6@hisilicon.com>
Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and
add cluster scheduler
On 08/01/2021 22:30, Song Bao Hua (Barry Song) wrote:
...
...
-----Original Message-----
From: Morten Rasmussen [mailto:morten.rasmussen@arm.com]
Sent: Saturday, January 9, 2021 4:13 AM
To: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
valentin.schneider@arm.com; catalin.marinas@arm.com; will@kernel.org;
rjw@rjwysocki.net; vincent.guittot@linaro.org; lenb@kernel.org;
gregkh@linuxfoundation.org; Jonathan Cameron
<jonathan.cameron@huawei.com>;
...
...
mingo@redhat.com; peterz@infradead.org; juri.lelli@redhat.com;
dietmar.eggemann@arm.com; rostedt@goodmis.org; bsegall@google.com;
mgorman@suse.de; mark.rutland@arm.com; sudeep.holla@arm.com;
aubrey.li@linux.intel.com; linux-arm-kernel@lists.infradead.org;
linux-kernel@vger.kernel.org; linux-acpi@vger.kernel.org;
linuxarm@openeuler.org; xuwei (O) <xuwei5@huawei.com>; Zengtao (B)
<prime.zeng@hisilicon.com>; tiantao (H) <tiantao6@hisilicon.com>
Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters
and
add cluster scheduler
On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote:
...
On 1/6/21 12:30 AM, Barry Song wrote:
...
ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data while each cluster
has local L3 tag. On the other hand, each cluster will share some
internal system bus. This means cache is much more affine inside one cluster
than across clusters.
There is a similar need for clustering in x86.  Some x86 cores could share
L2 caches that
is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6
clusters
of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing
L3).
Having a sched domain at the L2 cluster helps spread load among
L2 domains.  This will reduce L2 cache contention and help with
performance for low to moderate load scenarios.
IIUC, you are arguing for the exact opposite behaviour, i.e. balancing
between L2 caches while Barry is after consolidating tasks within the
boundaries of a L3 tag cache. One helps cache utilization, the other
communication latency between tasks. Am I missing something?
Morten, this is not true.
we are both actually looking for the same behavior. My patch also
has done the exact same behavior of spreading with Tim's patch.
That's the case for the load-balance path because of the extra Sched
Domain (SD) (CLS/MC_L2) below MC.
But in wakeup you add code which leads to a different packing strategy.
Yes, but I put a note for the 1st case:
"Case 1. we have two tasks *without* any relationship running in a system
with 2 clusters and 8 cpus"

so for tasks without wake-up relationship, the current patch will only
result in spreading.

Anyway, I will also test Tim's benchmark in kunpeng920 with the SCHED_CLUTER
to see what will happen. Till now, benchmark has only covered the case to
figure out the benefit of changing wake-up path.
I would also be interested in figuring out what we have got from the change
of load_balance().
...
It looks like that Tim's workload (SPECrate mcf) shows a performance
boost solely because of the changes the additional MC_L2 SD introduces
in load balance. The wakeup path is unchanged, i.e. llc-packing. IMHO we
have to carefully distinguish between packing vs. spreading in wakeup
and load-balance here.
...
Considering the below two cases:
Case 1. we have two tasks without any relationship running in a system with
2 clusters and 8 cpus.
Without the sched_domain of cluster, these two tasks might be put as below:
+-------------------+            +-----------------+
| +----+   +----+   |            |                 |
| |task|   |task|   |            |                 |
| |1   |   |2   |   |            |                 |
| +----+   +----+   |            |                 |
|                   |            |                 |
|       cluster1    |            |     cluster2    |
+-------------------+            +-----------------+
With the sched_domain of cluster, load balance will spread them as below:
+-------------------+            +-----------------+
| +----+            |            | +----+          |
| |task|            |            | |task|          |
| |1   |            |            | |2   |          |
| +----+            |            | +----+          |
|                   |            |                 |
|       cluster1    |            |     cluster2    |
+-------------------+            +-----------------+
Then task1 and tasks2 get more cache and decrease cache contention.
They will get better performance.
That is what my original patch also can make. And tim's patch
is also doing. Once we add a sched_domain, load balance will
get involved.
Case 2. we have 8 tasks, running in a system with 2 clusters and 8 cpus.
But they are working in 4 groups:
Task1 wakes up task4
Task2 wakes up task5
Task3 wakes up task6
Task4 wakes up task7
With my changing in select_idle_sibling, the WAKE_AFFINE mechanism will
try to put task1 and 4, task2 and 5, task3 and 6, task4 and 7 in same clusters
rather
than putting all of them in the random one of the 8 cpus. However, the 8 tasks
are still spreading among the 8 cpus with my change in select_idle_sibling
as load balance is still working.
+---------------------------+    +----------------------+
| +----+        +-----+     |    | +----+      +-----+  |
| |task|        |task |     |    | |task|      |task |  |
| |1   |        | 4   |     |    | |2   |      |5    |  |
| +----+        +-----+     |    | +----+      +-----+  |
|                           |    |                      |
|       cluster1            |    |     cluster2         |
|                           |    |                      |
|                           |    |                      |
| +-----+       +------+    |    | +-----+     +------+ |
| |task |       | task |    |    | |task |     |task  | |
| |3    |       |  6   |    |    | |4    |     |8     | |
| +-----+       +------+    |    | +-----+     +------+ |
+---------------------------+    +----------------------+
Your use-case (#tasks, runtime/period) seems to be perfectly crafted to
show the benefit of your patch on your specific system (cluster-size =
4). IMHO, this extra infrastructure especially in the wakeup path should
show benefits over a range of different benchmarks.
...
Let's consider the 3rd case, that one would be more tricky:
task1 and task2 have close relationship and they are waker-wakee pair.
With my current patch, select_idle_sidling() wants to put them in one
cluster, load balance wants to put them in two clusters. Load balance will
win.
Then maybe we need some same mechanism like adjusting numa imbalance:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
kernel/sched/fair.c?id=b396f52326de20
...
if we permit a light imbalance between clusters, select_idle_sidling()
will win. And task1 and task2 get better cache affinity.
This would look weird to allow this kind of imbalance on CLS (MC_L2) and
NUMA domains but not on the MC domain for example.
Yes. I guess I actually meant permitting imbalance between sched_group
made by the child sched_cluster domain of the parent sched_mc domain.

sched_mc domain

+----------------------------------+
|   +--------+     +----------+    |
|   |sched_  |     |sched_    |    |
|   |group   |     |group     |    |
|   +--+-----+     +----+-----+    |
|      |  allow small   |          |
|      |  imbalance     |          |
+----------------------------------+
       |                |
       |                |
       |                |
       |                |
       |                |
       +                +
   child domain:     child domain:
   sched_cluster     sched_cluster

For sched_group within one sched_cluster domain, we don't allow this
kind of imbalance.

Anyway, I would be happier to see this kind of imbalance is
only allowed when we exactly know two tasks in the cluster
have wake-up relationship. Right now, SD_NUMA seems to be
simply allowing this imbalance without the knowledge of the
relationships of tasks causing imbalance.

Thanks
Barry