[Linuxarm] [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler

6 Jan 2021

      ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data while each cluster
has local L3 tag. On the other hand, each cluster will share some
internal system bus. This means cache is much more affine inside one cluster
than across clusters.

    +-----------------------------------+                          +---------+
    |  +------+    +------+            +---------------------------+         |
    |  | CPU0 |    | cpu1 |             |    +-----------+         |         |
    |  +------+    +------+             |    |           |         |         |
    |                                   +----+    L3     |         |         |
    |  +------+    +------+   cluster   |    |    tag    |         |         |
    |  | CPU2 |    | CPU3 |             |    |           |         |         |
    |  +------+    +------+             |    +-----------+         |         |
    |                                   |                          |         |
    +-----------------------------------+                          |         |
    +-----------------------------------+                          |         |
    |  +------+    +------+             +--------------------------+         |
    |  |      |    |      |             |    +-----------+         |         |
    |  +------+    +------+             |    |           |         |         |
    |                                   |    |    L3     |         |         |
    |  +------+    +------+             +----+    tag    |         |         |
    |  |      |    |      |             |    |           |         |         |
    |  +------+    +------+             |    +-----------+         |         |
    |                                   |                          |         |
    +-----------------------------------+                          |   L3    |
                                                                   |   data  |
    +-----------------------------------+                          |         |
    |  +------+    +------+             |    +-----------+         |         |
    |  |      |    |      |             |    |           |         |         |
    |  +------+    +------+             +----+    L3     |         |         |
    |                                   |    |    tag    |         |         |
    |  +------+    +------+             |    |           |         |         |
    |  |      |    |      |            ++    +-----------+         |         |
    |  +------+    +------+            |---------------------------+         |
    +-----------------------------------|                          |         |
    +-----------------------------------|                          |         |
    |  +------+    +------+            +---------------------------+         |
    |  |      |    |      |             |    +-----------+         |         |
    |  +------+    +------+             |    |           |         |         |
    |                                   +----+    L3     |         |         |
    |  +------+    +------+             |    |    tag    |         |         |
    |  |      |    |      |             |    |           |         |         |
    |  +------+    +------+             |    +-----------+         |         |
    |                                   |                          |         |
    +-----------------------------------+                          |         |
    +-----------------------------------+                          |         |
    |  +------+    +------+             +--------------------------+         |
    |  |      |    |      |             |   +-----------+          |         |
    |  +------+    +------+             |   |           |          |         |

Through the following small program, you can see the performance impact of
running it in one cluster and across two clusters:

struct foo {
        int x;
        int y;
} f;

void *thread1_fun(void *param)
{
        int s = 0;
        for (int i = 0; i < 0xfffffff; i++)
                s += f.x;
}

void *thread2_fun(void *param)
{
        int s = 0;
        for (int i = 0; i < 0xfffffff; i++)
                f.y++;
}

int main(int argc, char **argv)
{
        pthread_t tid1, tid2;

        pthread_create(&tid1, NULL, thread1_fun, NULL);
        pthread_create(&tid2, NULL, thread2_fun, NULL);
        pthread_join(tid1, NULL);
        pthread_join(tid2, NULL);
}

While running this program in one cluster, it takes:
$ time taskset -c 0,1 ./a.out 
real	0m0.832s
user	0m1.649s
sys	0m0.004s

As a contrast, it takes much more time if we run the same program
in two clusters:
$ time taskset -c 0,4 ./a.out 
real	0m1.133s
user	0m1.960s
sys	0m0.000s

0.832/1.133 = 73%, it is a huge difference.

Also, hackbench running on 4 cpus in single one cluster and 4 cpus in
different clusters also shows a large contrast:
* inside a cluster:
root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 4.285

* across clusters:
root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 5.524

The score is 4.285 vs 5.524, shorter time means better performance.

All these testing implies that we should let the Linux scheduler use
this topology to make better load balancing and WAKE_AFFINE decisions.
However, the current scheduler totally has no idea of clusters.

This patchset exposed the cluster topology first, then added the sched
domain for cluster. While it is named as "cluster", architectures and
machines can define the exact meaning of cluster as long as they have
some resources sharing under llc and they can leverage the affinity
of this resource to achive better scheduling performance.

-v3:
 - rebased againest 5.11-rc2
 - with respect to the comments of Valentin Schneider, Peter Zijlstra,
   Vincent Guittot and Mel Gorman etc.
  * moved the scheduler changes from arm64 to the common place for all
    architectures.
  * added SD_SHARE_CLS_RESOURCES sd_flags specifying the sched_domain
    where select_idle_cpu() should begin to scan from
  * removed redundant select_idle_cluster() function since all code is
    in select_idle_cpu() now. it also avoided scanning cluster cpus
    twice in v2 code;
  * redo the hackbench in one numa after the above changes

Valentin suggested that select_idle_cpu() could begin to scan from
domain with SD_SHARE_PKG_RESOURCES. Changing like this might be too
aggressive and limit the spreading of tasks. Thus, this patch lets
the architectures and machines to decide where to start by adding
a new SD_SHARE_CLS_RESOURCES.

Barry Song (1):
  scheduler: add scheduler level for clusters

Jonathan Cameron (1):
  topology: Represent clusters of CPUs within a die.

 Documentation/admin-guide/cputopology.rst | 26 +++++++++++---
 arch/arm64/Kconfig                        |  7 ++++
 arch/arm64/kernel/topology.c              |  2 ++
 drivers/acpi/pptt.c                       | 60 +++++++++++++++++++++++++++++++
 drivers/base/arch_topology.c              | 14 ++++++++
 drivers/base/topology.c                   | 10 ++++++
 include/linux/acpi.h                      |  5 +++
 include/linux/arch_topology.h             |  5 +++
 include/linux/sched/sd_flags.h            |  9 +++++
 include/linux/sched/topology.h            |  7 ++++
 include/linux/topology.h                  | 13 +++++++
 kernel/sched/fair.c                       | 27 ++++++++++----
 kernel/sched/topology.c                   |  6 ++++
 13 files changed, 181 insertions(+), 10 deletions(-)

-- 
2.7.4