On 1/6/21 12:30 AM, Barry Song wrote:
ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data while each cluster has local L3 tag. On the other hand, each cluster will share some internal system bus. This means cache is much more affine inside one cluster than across clusters.
+-----------------------------------+ +---------+ | +------+ +------+ +---------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | ++ +-----------+ | | | +------+ +------+ |---------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +---------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | |
There is a similar need for clustering in x86. Some x86 cores could share L2 caches that is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 clusters of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing L3). Having a sched domain at the L2 cluster helps spread load among L2 domains. This will reduce L2 cache contention and help with performance for low to moderate load scenarios.
The cluster detection mechanism will need to be based on L2 cache sharing in this case. I suggest making the cluster detection to be CPU architecture dependent so both ARM64 and x86 use cases can be accommodated.
Attached below are two RFC patches for creating x86 L2 cache sched domain, sans the idle cpu selection on wake up code. It is similar enough in concept to Barry's patch that we should have a single patchset that accommodates both use cases.
Thanks.
Tim
From e0e7e42e1a033c9634723ff1dc80b426deeec1e9 Mon Sep 17 00:00:00 2001 Message-Id: e0e7e42e1a033c9634723ff1dc80b426deeec1e9.1609970726.git.tim.c.chen@linux.intel.com In-Reply-To: cover.1609970726.git.tim.c.chen@linux.intel.com References: cover.1609970726.git.tim.c.chen@linux.intel.com From: Tim Chen tim.c.chen@linux.intel.com Date: Wed, 19 Aug 2020 16:22:35 -0700 Subject: [RFC PATCH 1/2] sched: Add L2 cache cpu mask
There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is shared among a group of cores instead of being exclusive to one single core.
To prevent oversubscription of L2 cache, load could be balanced between such L2 domains.
Add CPU masks of CPUs sharing the L2 cache so we can build such L2 scheduler domain for load balancing at the L2 level.
Signed-off-by: Tim Chen tim.c.chen@linux.intel.com --- arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 12 ++++++++++++ 2 files changed, 13 insertions(+)
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index f4234575f3fd..e35f5f55cb15 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -103,6 +103,7 @@ static inline void setup_node_to_cpumask_map(void) { } #include <asm-generic/topology.h>
extern const struct cpumask *cpu_coregroup_mask(int cpu); +extern const struct cpumask *cpu_l2group_mask(int cpu);
#define topology_logical_package_id(cpu) (cpu_data(cpu).logical_proc_id) #define topology_physical_package_id(cpu) (cpu_data(cpu).phys_proc_id) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 27aa04a95702..8ba0b505f020 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -56,6 +56,7 @@ #include <linux/numa.h> #include <linux/pgtable.h> #include <linux/overflow.h> +#include <linux/cacheinfo.h>
#include <asm/acpi.h> #include <asm/desc.h> @@ -643,6 +644,17 @@ const struct cpumask *cpu_coregroup_mask(int cpu) return cpu_llc_shared_mask(cpu); }
+const struct cpumask *cpu_l2group_mask(int cpu) +{ + struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu); + + /* Sanity check for presence of L2, leaf index 2 */ + if (ci->num_leaves < 3) + return topology_sibling_cpumask(cpu); + + return &ci->info_list[2].shared_cpu_map; +} + static void impress_friends(void) { int cpu;