As long as NUMA diameter > 2, building sched_domain by sibling's child domain will definitely create a sched_domain with sched_group which will span out of the sched_domain:
+------+ +------+ +-------+ +------+ | node | 12 |node | 20 | node | 12 |node | | 0 +---------+1 +--------+ 2 +-------+3 | +------+ +------+ +-------+ +------+
domain0 node0 node1 node2 node3
domain1 node0+1 node0+1 node2+3 node2+3 + domain2 node0+1+2 | group: node0+1 | group:node2+3 <-------------------+
when node2 is added into the domain2 of node0, kernel is using the child domain of node2's domain2, which is domain1(node2+3). Node 3 is outside the span of the domain including node0+1+2.
This will make load_balance() run based on screwed avg_load and group_type in the sched_group spanning out of the sched_domain, and it also makes select_task_rq_fair() pick an idle CPU out of the sched_domain.
Real servers which suffer from this problem include Kunpeng920 and 8-node Sun Fire X4600-M2, at least.
Here we move to use the *child* domain of the *child* domain of node2's domain2 as the new added sched_group. At the same time, we re-use the lower level sgc directly.
+------+ +------+ +-------+ +------+ | node | 12 |node | 20 | node | 12 |node | | 0 +---------+1 +--------+ 2 +-------+3 | +------+ +------+ +-------+ +------+
domain0 node0 node1 +- node2 node3 | domain1 node0+1 node0+1 | node2+3 node2+3 | domain2 node0+1+2 | group: node0+1 | group:node2 <-------------------+
Tested by the below topology: qemu-system-aarch64 -M virt -nographic \ -smp cpus=8 \ -numa node,cpus=0-1,nodeid=0 \ -numa node,cpus=2-3,nodeid=1 \ -numa node,cpus=4-5,nodeid=2 \ -numa node,cpus=6-7,nodeid=3 \ -numa dist,src=0,dst=1,val=12 \ -numa dist,src=0,dst=2,val=20 \ -numa dist,src=0,dst=3,val=22 \ -numa dist,src=1,dst=2,val=22 \ -numa dist,src=2,dst=3,val=12 \ -numa dist,src=1,dst=3,val=24 \ -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image
w/o patch, we get lots of "groups don't span domain->span": [ 0.802139] CPU0 attaching sched-domain(s): [ 0.802193] domain-0: span=0-1 level=MC [ 0.802443] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 } [ 0.802693] domain-1: span=0-3 level=NUMA [ 0.802731] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [ 0.802811] domain-2: span=0-5 level=NUMA [ 0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [ 0.802881] ERROR: groups don't span domain->span [ 0.803058] domain-3: span=0-7 level=NUMA [ 0.803080] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [ 0.804055] CPU1 attaching sched-domain(s): [ 0.804072] domain-0: span=0-1 level=MC [ 0.804096] groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 } [ 0.804152] domain-1: span=0-3 level=NUMA [ 0.804170] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [ 0.804219] domain-2: span=0-5 level=NUMA [ 0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [ 0.804302] ERROR: groups don't span domain->span [ 0.804520] domain-3: span=0-7 level=NUMA [ 0.804546] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [ 0.804677] CPU2 attaching sched-domain(s): [ 0.804687] domain-0: span=2-3 level=MC [ 0.804705] groups: 2:{ span=2 cap=934 }, 3:{ span=3 cap=1009 } [ 0.804754] domain-1: span=0-3 level=NUMA [ 0.804772] groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 } [ 0.804820] domain-2: span=0-5 level=NUMA [ 0.804836] groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 } [ 0.804944] ERROR: groups don't span domain->span [ 0.805108] domain-3: span=0-7 level=NUMA [ 0.805134] groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 } [ 0.805223] CPU3 attaching sched-domain(s): [ 0.805232] domain-0: span=2-3 level=MC [ 0.805249] groups: 3:{ span=3 cap=1009 }, 2:{ span=2 cap=934 } [ 0.805319] domain-1: span=0-3 level=NUMA [ 0.805336] groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 } [ 0.805383] domain-2: span=0-5 level=NUMA [ 0.805399] groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 } [ 0.805458] ERROR: groups don't span domain->span [ 0.805605] domain-3: span=0-7 level=NUMA [ 0.805626] groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 } [ 0.805712] CPU4 attaching sched-domain(s): [ 0.805721] domain-0: span=4-5 level=MC [ 0.805738] groups: 4:{ span=4 cap=984 }, 5:{ span=5 cap=924 } [ 0.805787] domain-1: span=4-7 level=NUMA [ 0.805803] groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 } [ 0.805851] domain-2: span=0-1,4-7 level=NUMA [ 0.805867] groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 } [ 0.805915] ERROR: groups don't span domain->span [ 0.806108] domain-3: span=0-7 level=NUMA [ 0.806130] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 } [ 0.806214] CPU5 attaching sched-domain(s): [ 0.806222] domain-0: span=4-5 level=MC [ 0.806240] groups: 5:{ span=5 cap=924 }, 4:{ span=4 cap=984 } [ 0.806841] domain-1: span=4-7 level=NUMA [ 0.806866] groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 } [ 0.806934] domain-2: span=0-1,4-7 level=NUMA [ 0.806953] groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 } [ 0.807004] ERROR: groups don't span domain->span [ 0.807312] domain-3: span=0-7 level=NUMA [ 0.807386] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 } [ 0.807686] CPU6 attaching sched-domain(s): [ 0.807710] domain-0: span=6-7 level=MC [ 0.807750] groups: 6:{ span=6 cap=1017 }, 7:{ span=7 cap=1012 } [ 0.807840] domain-1: span=4-7 level=NUMA [ 0.807870] groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 } [ 0.807952] domain-2: span=0-1,4-7 level=NUMA [ 0.807985] groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 } [ 0.808045] ERROR: groups don't span domain->span [ 0.808257] domain-3: span=0-7 level=NUMA [ 0.808571] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6125 }, 2:{ span=0-5 mask=2-3 cap=5899 } [ 0.808848] CPU7 attaching sched-domain(s): [ 0.808860] domain-0: span=6-7 level=MC [ 0.808880] groups: 7:{ span=7 cap=1012 }, 6:{ span=6 cap=1017 } [ 0.808953] domain-1: span=4-7 level=NUMA [ 0.808974] groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 } [ 0.809034] domain-2: span=0-1,4-7 level=NUMA [ 0.809055] groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 } [ 0.809128] ERROR: groups don't span domain->span [ 0.810361] domain-3: span=0-7 level=NUMA [ 0.810400] groups: 6:{ span=0-1,4-7 mask=6-7 cap=5961 }, 2:{ span=0-5 mask=2-3 cap=5903 }
w/ patch, we don't get "groups don't span domain->span" any more: [ 1.486271] CPU0 attaching sched-domain(s): [ 1.486820] domain-0: span=0-1 level=MC [ 1.500924] groups: 0:{ span=0 cap=980 }, 1:{ span=1 cap=994 } [ 1.515717] domain-1: span=0-3 level=NUMA [ 1.515903] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 } [ 1.516989] domain-2: span=0-5 level=NUMA [ 1.517124] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 } [ 1.517369] domain-3: span=0-7 level=NUMA [ 1.517423] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 } [ 1.520027] CPU1 attaching sched-domain(s): [ 1.520097] domain-0: span=0-1 level=MC [ 1.520184] groups: 1:{ span=1 cap=994 }, 0:{ span=0 cap=980 } [ 1.520429] domain-1: span=0-3 level=NUMA [ 1.520487] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 } [ 1.520687] domain-2: span=0-5 level=NUMA [ 1.520744] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 } [ 1.520948] domain-3: span=0-7 level=NUMA [ 1.521038] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 } [ 1.522068] CPU2 attaching sched-domain(s): [ 1.522348] domain-0: span=2-3 level=MC [ 1.522606] groups: 2:{ span=2 cap=1003 }, 3:{ span=3 cap=986 } [ 1.522832] domain-1: span=0-3 level=NUMA [ 1.522885] groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 } [ 1.523043] domain-2: span=0-5 level=NUMA [ 1.523092] groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 } [ 1.523302] domain-3: span=0-7 level=NUMA [ 1.523352] groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 } [ 1.523748] CPU3 attaching sched-domain(s): [ 1.523774] domain-0: span=2-3 level=MC [ 1.523825] groups: 3:{ span=3 cap=986 }, 2:{ span=2 cap=1003 } [ 1.524009] domain-1: span=0-3 level=NUMA [ 1.524086] groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 } [ 1.524281] domain-2: span=0-5 level=NUMA [ 1.524331] groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 } [ 1.524534] domain-3: span=0-7 level=NUMA [ 1.524586] groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 } [ 1.524847] CPU4 attaching sched-domain(s): [ 1.524873] domain-0: span=4-5 level=MC [ 1.524954] groups: 4:{ span=4 cap=958 }, 5:{ span=5 cap=991 } [ 1.525105] domain-1: span=4-7 level=NUMA [ 1.525153] groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 } [ 1.525368] domain-2: span=0-1,4-7 level=NUMA [ 1.525428] groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 } [ 1.532726] domain-3: span=0-7 level=NUMA [ 1.532811] groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=4037 } [ 1.534125] CPU5 attaching sched-domain(s): [ 1.534159] domain-0: span=4-5 level=MC [ 1.534303] groups: 5:{ span=5 cap=991 }, 4:{ span=4 cap=958 } [ 1.534490] domain-1: span=4-7 level=NUMA [ 1.534572] groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 } [ 1.534734] domain-2: span=0-1,4-7 level=NUMA [ 1.534783] groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 } [ 1.536057] domain-3: span=0-7 level=NUMA [ 1.536430] groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=3896 } [ 1.536815] CPU6 attaching sched-domain(s): [ 1.536846] domain-0: span=6-7 level=MC [ 1.536934] groups: 6:{ span=6 cap=1005 }, 7:{ span=7 cap=1001 } [ 1.537144] domain-1: span=4-7 level=NUMA [ 1.537262] groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 } [ 1.537553] domain-2: span=0-1,4-7 level=NUMA [ 1.537613] groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 } [ 1.537872] domain-3: span=0-7 level=NUMA [ 1.537998] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 } [ 1.538448] CPU7 attaching sched-domain(s): [ 1.538505] domain-0: span=6-7 level=MC [ 1.538586] groups: 7:{ span=7 cap=1001 }, 6:{ span=6 cap=1005 } [ 1.538746] domain-1: span=4-7 level=NUMA [ 1.538798] groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 } [ 1.539048] domain-2: span=0-1,4-7 level=NUMA [ 1.539111] groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 } [ 1.539571] domain-3: span=0-7 level=NUMA [ 1.539610] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 }
Reported-by: Valentin Schneider valentin.schneider@arm.com Tested-by: Meelis Roos mroos@linux.ee Signed-off-by: Barry Song song.bao.hua@hisilicon.com --- -v2: merged Valentin's edit on v1, thus fixed two issues * "one cpu in one numa" was not supported * the capacity of groups from grandchild couldn't be updated Thanks for Valentin's nice edit!
And also added some comments in build_balance_mask
kernel/sched/topology.c | 85 +++++++++++++++++++++++++---------------- 1 file changed, 53 insertions(+), 32 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 5d3675c7a76b..964ed89001fe 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -723,35 +723,6 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) for (tmp = sd; tmp; tmp = tmp->parent) numa_distance += !!(tmp->flags & SD_NUMA);
- /* - * FIXME: Diameter >=3 is misrepresented. - * - * Smallest diameter=3 topology is: - * - * node 0 1 2 3 - * 0: 10 20 30 40 - * 1: 20 10 20 30 - * 2: 30 20 10 20 - * 3: 40 30 20 10 - * - * 0 --- 1 --- 2 --- 3 - * - * NUMA-3 0-3 N/A N/A 0-3 - * groups: {0-2},{1-3} {1-3},{0-2} - * - * NUMA-2 0-2 0-3 0-3 1-3 - * groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} {2-3},{0-2} - * - * NUMA-1 0-1 0-2 1-3 2-3 - * groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2} - * - * NUMA-0 0 1 2 3 - * - * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the - * group span isn't a subset of the domain span. - */ - WARN_ONCE(numa_distance > 2, "Shortest NUMA path spans too many nodes\n"); - sched_domain_debug(sd, cpu);
rq_attach_root(rq, rd); @@ -908,6 +879,17 @@ build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpuma for_each_cpu(i, sg_span) { sibling = *per_cpu_ptr(sdd->sd, i);
+ /* + * for NUMA diameter >= 3, while sibling->child spans out of + * the domain being built, we are using the grandchild; + * build_overlap_sched_groups includes words on how this could + * happen + */ + while (sibling->child && + !cpumask_subset(sched_domain_span(sibling->child), + sched_domain_span(sd))) + sibling = sibling->child; + /* * Can happen in the asymmetric case, where these siblings are * unused. The mask will not be empty because those CPUs that @@ -955,10 +937,11 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu) }
static void init_overlap_sched_group(struct sched_domain *sd, - struct sched_group *sg) + struct sched_group *sg, + struct sched_domain *grandchild) { struct cpumask *mask = sched_domains_tmpmask2; - struct sd_data *sdd = sd->private; + struct sd_data *sdd = grandchild ? grandchild->private : sd->private; struct cpumask *sg_span; int cpu;
@@ -996,6 +979,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
for_each_cpu_wrap(i, span, cpu) { struct cpumask *sg_span; + bool from_grandchild = false;
if (cpumask_test_cpu(i, covered)) continue; @@ -1015,6 +999,43 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) if (!cpumask_test_cpu(i, sched_domain_span(sibling))) continue;
+ /* + * for NUMA diameter >= 3, building sched_domain by sibling's + * child's child domain to prevent sched_group from spanning + * out of sched_domain + * if we don't do this, Diameter >=3 is misrepresented: + * + * Smallest diameter=3 topology is: + * + * node 0 1 2 3 + * 0: 10 20 30 40 + * 1: 20 10 20 30 + * 2: 30 20 10 20 + * 3: 40 30 20 10 + * + * 0 --- 1 --- 2 --- 3 + * + * NUMA-3 0-3 N/A N/A 0-3 + * groups: {0-2},{1-3} {1-3},{0-2} + * + * NUMA-2 0-2 0-3 0-3 1-3 + * groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} {2-3},{0-2} + * + * NUMA-1 0-1 0-2 1-3 2-3 + * groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2} + * + * NUMA-0 0 1 2 3 + * + * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the + * group span isn't a subset of the domain span. + */ + while (sibling->child && + !cpumask_subset(sched_domain_span(sibling->child), + span)) { + sibling = sibling->child; + from_grandchild = true; + } + sg = build_group_from_child_sched_domain(sibling, cpu); if (!sg) goto fail; @@ -1022,7 +1043,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) sg_span = sched_group_span(sg); cpumask_or(covered, covered, sg_span);
- init_overlap_sched_group(sd, sg); + init_overlap_sched_group(sd, sg, from_grandchild ? sibling : NULL);
if (!first) first = sg;
03.02.21 13:12 Barry Song wrote:
kernel/sched/topology.c | 85 +++++++++++++++++++++++++---------------- 1 file changed, 53 insertions(+), 32 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 5d3675c7a76b..964ed89001fe 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c
This one still works on the Sun X4600-M2, on top of v5.11-rc6-55-g3aaf0a27ffc2.
Performance-wise - is the some simple benhmark to run to meaure the impact? Compared to what - 5.10.0 or the kernel with the warning?
drop caches and time the build time of linux kernel with make -j64?
-----Original Message----- From: Meelis Roos [mailto:mroos@linux.ee] Sent: Thursday, February 4, 2021 12:58 AM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; valentin.schneider@arm.com; vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org; peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com; linux-kernel@vger.kernel.org Cc: linuxarm@openeuler.org; xuwei (O) xuwei5@huawei.com; Liguozhu (Kenneth) liguozhu@hisilicon.com; tiantao (H) tiantao6@hisilicon.com; wanghuiqiang wanghuiqiang@huawei.com; Zengtao (B) prime.zeng@hisilicon.com; Jonathan Cameron jonathan.cameron@huawei.com; guodong.xu@linaro.org Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
03.02.21 13:12 Barry Song wrote:
kernel/sched/topology.c | 85 +++++++++++++++++++++++++---------------- 1 file changed, 53 insertions(+), 32 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 5d3675c7a76b..964ed89001fe 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c
This one still works on the Sun X4600-M2, on top of v5.11-rc6-55-g3aaf0a27ffc2.
Performance-wise - is the some simple benhmark to run to meaure the impact? Compared to what - 5.10.0 or the kernel with the warning?
Hi Meelis, Thanks for retesting.
Comparing to the kernel with the warning is enough. As I mentioned here: https://lore.kernel.org/lkml/20210115203632.34396-1-song.bao.hua@hisilicon.c...
I have seen two major issues the broken sched_group has:
* in load_balance() and find_busiest_group() kernel is calculating the avg_load and group_type by:
sum(load of cpus within sched_domain) ------------------------------------ capacity of the whole sched_group
since sched_group isn't a subset of sched_domain, so the load of the problematic group is severely underestimated.
sched_domain
+----------------------------------+ | | | +-------------------------------------------+ | | +-------+ +------+ | | | | | cpu0 | | cpu1 | | | | | +-------+ +------+ | | +----------------------------------+ | | | | +-------+ +-------+ | | |cpu2 | |cpu3 | | | +-------+ +-------+ | | | +-------------------------------------------+ problematic sched_group
For the above example, kernel will divide "the sum load of cpu0 and cpu1" by "the capacity of the whole group including cpu0,1,2 and 3".
* in select_task_rq_fair() and find_idlest_group() Kernel could push a forked/exec-ed task to the outside of the sched_domain, but still inside the sched_group. For the above diagram, while kernel wants to find the idlest cpu in the sched_domain, it can result in picking cpu2 or cpu3.
I guess these two issues can potentially affect many benchmarks. Our team have seen 5% unixbench score increase with the fix in some machines though the real impact might be case-by-case.
drop caches and time the build time of linux kernel with make -j64?
-- Meelis Roos
Thanks Barry
On Thu, Feb 04, 2021 at 12:12:01AM +1300, Barry Song wrote:
As long as NUMA diameter > 2, building sched_domain by sibling's child domain will definitely create a sched_domain with sched_group which will span out of the sched_domain:
+------+ +------+ +-------+ +------+ | node | 12 |node | 20 | node | 12 |node | | 0 +---------+1 +--------+ 2 +-------+3 | +------+ +------+ +-------+ +------+
domain0 node0 node1 node2 node3
domain1 node0+1 node0+1 node2+3 node2+3 + domain2 node0+1+2 | group: node0+1 | group:node2+3 <-------------------+
when node2 is added into the domain2 of node0, kernel is using the child domain of node2's domain2, which is domain1(node2+3). Node 3 is outside the span of the domain including node0+1+2.
This will make load_balance() run based on screwed avg_load and group_type in the sched_group spanning out of the sched_domain, and it also makes select_task_rq_fair() pick an idle CPU out of the sched_domain.
Real servers which suffer from this problem include Kunpeng920 and 8-node Sun Fire X4600-M2, at least.
Here we move to use the *child* domain of the *child* domain of node2's domain2 as the new added sched_group. At the same time, we re-use the lower level sgc directly.
+------+ +------+ +-------+ +------+ | node | 12 |node | 20 | node | 12 |node | | 0 +---------+1 +--------+ 2 +-------+3 | +------+ +------+ +-------+ +------+
domain0 node0 node1 +- node2 node3 | domain1 node0+1 node0+1 | node2+3 node2+3 | domain2 node0+1+2 | group: node0+1 | group:node2 <-------------------+
I've finally had a moment to think about this, would it make sense to also break up group: node0+1, such that we then end up with 3 groups of equal size?
w/ patch, we don't get "groups don't span domain->span" any more: [ 1.486271] CPU0 attaching sched-domain(s): [ 1.486820] domain-0: span=0-1 level=MC [ 1.500924] groups: 0:{ span=0 cap=980 }, 1:{ span=1 cap=994 } [ 1.515717] domain-1: span=0-3 level=NUMA [ 1.515903] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 } [ 1.516989] domain-2: span=0-5 level=NUMA [ 1.517124] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 }
groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3, cap=1989 }, 4:{ span=4-5, cap=1949 }
[ 1.517369] domain-3: span=0-7 level=NUMA [ 1.517423] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 }
Let me continue to think about this... it's been a while :/
-----Original Message----- From: Peter Zijlstra [mailto:peterz@infradead.org] Sent: Wednesday, February 10, 2021 1:56 AM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com Cc: valentin.schneider@arm.com; vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com; linux-kernel@vger.kernel.org; linuxarm@openeuler.org; xuwei (O) xuwei5@huawei.com; Liguozhu (Kenneth) liguozhu@hisilicon.com; tiantao (H) tiantao6@hisilicon.com; wanghuiqiang wanghuiqiang@huawei.com; Zengtao (B) prime.zeng@hisilicon.com; Jonathan Cameron jonathan.cameron@huawei.com; guodong.xu@linaro.org; Meelis Roos mroos@linux.ee Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
On Thu, Feb 04, 2021 at 12:12:01AM +1300, Barry Song wrote:
As long as NUMA diameter > 2, building sched_domain by sibling's child domain will definitely create a sched_domain with sched_group which will span out of the sched_domain:
+------+ +------+ +-------+ +------+ | node | 12 |node | 20 | node | 12 |node | | 0 +---------+1 +--------+ 2 +-------+3 | +------+ +------+ +-------+ +------+
domain0 node0 node1 node2 node3
domain1 node0+1 node0+1 node2+3 node2+3 + domain2 node0+1+2 | group: node0+1 | group:node2+3 <-------------------+
when node2 is added into the domain2 of node0, kernel is using the child domain of node2's domain2, which is domain1(node2+3). Node 3 is outside the span of the domain including node0+1+2.
This will make load_balance() run based on screwed avg_load and group_type in the sched_group spanning out of the sched_domain, and it also makes select_task_rq_fair() pick an idle CPU out of the sched_domain.
Real servers which suffer from this problem include Kunpeng920 and 8-node Sun Fire X4600-M2, at least.
Here we move to use the *child* domain of the *child* domain of node2's domain2 as the new added sched_group. At the same time, we re-use the lower level sgc directly.
+------+ +------+ +-------+ +------+ | node | 12 |node | 20 | node | 12 |node | | 0 +---------+1 +--------+ 2 +-------+3 | +------+ +------+ +-------+ +------+
domain0 node0 node1 +- node2 node3 | domain1 node0+1 node0+1 | node2+3 node2+3 | domain2 node0+1+2 | group: node0+1 | group:node2 <-------------------+
I've finally had a moment to think about this, would it make sense to also break up group: node0+1, such that we then end up with 3 groups of equal size?
We used to create the sched_groups of sched_domain[n] of node[m] by 1. local group: sched_domain[n-1] of node[m] 2. remote group: sched_domain[n-1] of node[m]'s siblings in the same level. Since the sched_domain[n-1] of a part of node[m]'s siblings are able to cover the whole span of sched_domain[n] of node[m], there is no necessity to scan over all siblings of node[m], once sched_domain[n] of node[m] has been covered, we can stop making more sched_groups. So the number of sched_groups is small.
So historically, the code has never tried to make sched_groups result in equal size. And it permits the overlapping of local group and remote groups.
One issue we are facing in original code is that once the topology gets to 3-hops NUMA, sched_domain[n-1] of node[m]'s siblings might span out of the range of sched_domain[n] of node[m]. Here my approach is trying to find a descanted sibling to build remote groups and fix this issue for those machines with this problem. So it keeps those machines without 3-hops issues untouched.
Valentin sent another RFC to break up all remote groups to include the remote node only instead of using sched_domain[n-1] of siblings, this will eliminate the problem from the first beginning. One side effect is that it changes all machines including those machines w/o 3-hops issue by creating much more remote sched_groups. So we both agree we can get started from descanted sibling(grandchild) approach first.
What you are advising seems to be breaking up local sched_group, it will create much more local groups. It sounds like a huge change even beyond the scope of the original issue we are trying to fix :-)
w/ patch, we don't get "groups don't span domain->span" any more: [ 1.486271] CPU0 attaching sched-domain(s): [ 1.486820] domain-0: span=0-1 level=MC [ 1.500924] groups: 0:{ span=0 cap=980 }, 1:{ span=1 cap=994 } [ 1.515717] domain-1: span=0-3 level=NUMA [ 1.515903] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 } [ 1.516989] domain-2: span=0-5 level=NUMA [ 1.517124] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 }
groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3, cap=1989 },
4:{ span=4-5, cap=1949 }
[ 1.517369] domain-3: span=0-7 level=NUMA [ 1.517423] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7
mask=6-7 cap=4054 }
Let me continue to think about this... it's been a while :/
Sure, thanks!
Barry
On Tue, Feb 09, 2021 at 08:58:15PM +0000, Song Bao Hua (Barry Song) wrote:
I've finally had a moment to think about this, would it make sense to also break up group: node0+1, such that we then end up with 3 groups of equal size?
Since the sched_domain[n-1] of a part of node[m]'s siblings are able to cover the whole span of sched_domain[n] of node[m], there is no necessity to scan over all siblings of node[m], once sched_domain[n] of node[m] has been covered, we can stop making more sched_groups. So the number of sched_groups is small.
So historically, the code has never tried to make sched_groups result in equal size. And it permits the overlapping of local group and remote groups.
Histrorically groups have (typically) always been the same size though.
The reason I did ask is because when you get one large and a bunch of smaller groups, the load-balancing 'pull' is relatively smaller to the large groups.
That is, IIRC should_we_balance() ensures only 1 CPU out of the group continues the load-balancing pass. So if, for example, we have one group of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull 1/2 times, while the group of 4 CPUs will pull 1/4 times.
By making sure all groups are of the same level, and thus of equal size, this doesn't happen.
-----Original Message----- From: Peter Zijlstra [mailto:peterz@infradead.org] Sent: Thursday, February 11, 2021 12:22 AM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com Cc: valentin.schneider@arm.com; vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com; linux-kernel@vger.kernel.org; linuxarm@openeuler.org; xuwei (O) xuwei5@huawei.com; Liguozhu (Kenneth) liguozhu@hisilicon.com; tiantao (H) tiantao6@hisilicon.com; wanghuiqiang wanghuiqiang@huawei.com; Zengtao (B) prime.zeng@hisilicon.com; Jonathan Cameron jonathan.cameron@huawei.com; guodong.xu@linaro.org; Meelis Roos mroos@linux.ee Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
On Tue, Feb 09, 2021 at 08:58:15PM +0000, Song Bao Hua (Barry Song) wrote:
I've finally had a moment to think about this, would it make sense to also break up group: node0+1, such that we then end up with 3 groups of equal size?
Since the sched_domain[n-1] of a part of node[m]'s siblings are able to cover the whole span of sched_domain[n] of node[m], there is no necessity to scan over all siblings of node[m], once sched_domain[n] of node[m] has been covered, we can stop making more sched_groups. So the number of sched_groups is small.
So historically, the code has never tried to make sched_groups result in equal size. And it permits the overlapping of local group and remote groups.
Histrorically groups have (typically) always been the same size though.
This is probably true for other platforms. But unfortunately it has never been true in my platform :-)
node 0 1 2 3 0: 10 12 20 22 1: 12 10 22 24 2: 20 22 10 12 3: 22 24 12 10
In case we have only two cpus in one numa.
CPU0's domain-3 has no overflowed sched_group, but its first group covers 0-5(node0-node2), the second group covers 4-7 (node2-node3):
[ 0.802139] CPU0 attaching sched-domain(s): [ 0.802193] domain-0: span=0-1 level=MC [ 0.802443] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 } [ 0.802693] domain-1: span=0-3 level=NUMA [ 0.802731] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [ 0.802811] domain-2: span=0-5 level=NUMA [ 0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [ 0.802881] ERROR: groups don't span domain->span [ 0.803058] domain-3: span=0-7 level=NUMA [ 0.803080] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 }
The reason I did ask is because when you get one large and a bunch of smaller groups, the load-balancing 'pull' is relatively smaller to the large groups.
That is, IIRC should_we_balance() ensures only 1 CPU out of the group continues the load-balancing pass. So if, for example, we have one group of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull 1/2 times, while the group of 4 CPUs will pull 1/4 times.
By making sure all groups are of the same level, and thus of equal size, this doesn't happen.
As you can see, even if we give all groups of domain2 equal size by breaking up both local_group and remote_groups, we will get to the same problem in domain-3. And what's more tricky is that domain-3 has no problem of "groups don't span domain->span".
It seems we need to change both domain2 and domain3 then though domain3 has no issue of "groups don't span domain->span".
Thanks Barry
On 10/02/21 12:21, Peter Zijlstra wrote:
On Tue, Feb 09, 2021 at 08:58:15PM +0000, Song Bao Hua (Barry Song) wrote:
So historically, the code has never tried to make sched_groups result in equal size. And it permits the overlapping of local group and remote groups.
Histrorically groups have (typically) always been the same size though.
The reason I did ask is because when you get one large and a bunch of smaller groups, the load-balancing 'pull' is relatively smaller to the large groups.
That is, IIRC should_we_balance() ensures only 1 CPU out of the group continues the load-balancing pass. So if, for example, we have one group of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull 1/2 times, while the group of 4 CPUs will pull 1/4 times.
By making sure all groups are of the same level, and thus of equal size, this doesn't happen.
So I hacked something that tries to do this, with the notable exception that it doesn't change the way the local group is generated. Breaking the assumption that the local group always spans the child domain doesn't sound like the best thing to do.
Anywho, the below makes it so all !local NUMA groups are built using the same sched_domain_topology_level. Some of it is absolutely disgusting (esp. the sched_domains_curr_level stuff), I didn't bother with handling domain degeneration yet, and I trigger the WARN_ON in get_group()... But at least the topology gets built!
AFAICT Barry's topology is handled the same. On that sunfire topology, it pretty much turns all remote groups into groups spanning a single node. That does almost double the number of groups for some domains, compared to Barry's current v3.
If that is really a route we want to go down, I'll try to clean the below.
(deposit your drinks before going any further) --->8--- diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 8f0f778b7c91..e74f48bdd226 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -187,7 +187,10 @@ struct sched_domain_topology_level { sched_domain_mask_f mask; sched_domain_flags_f sd_flags; int flags; +#ifdef CONFIG_NUMA int numa_level; + int numa_sibling_level; +#endif struct sd_data data; #ifdef CONFIG_SCHED_DEBUG char *name; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 3c50cc7285c9..5a9e6a4d5d89 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -742,6 +742,34 @@ enum s_alloc { sa_none, };
+/* + * Topology list, bottom-up. + */ +static struct sched_domain_topology_level default_topology[] = { +#ifdef CONFIG_SCHED_SMT + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, +#endif +#ifdef CONFIG_SCHED_MC + { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, +#endif + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + +static struct sched_domain_topology_level *sched_domain_topology = + default_topology; + +#define for_each_sd_topology(tl) \ + for (tl = sched_domain_topology; tl->mask; tl++) + +void set_sched_topology(struct sched_domain_topology_level *tl) +{ + if (WARN_ON_ONCE(sched_smp_initialized)) + return; + + sched_domain_topology = tl; +} + /* * Return the canonical balance CPU for this group, this is the first CPU * of this group that's also in the balance mask. @@ -955,10 +983,12 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) struct sched_group *first = NULL, *last = NULL, *sg; const struct cpumask *span = sched_domain_span(sd); struct cpumask *covered = sched_domains_tmpmask; + struct sched_domain_topology_level *tl; struct sd_data *sdd = sd->private; struct sched_domain *sibling; int i;
+ tl = container_of(sd->private, struct sched_domain_topology_level, data); cpumask_clear(covered);
for_each_cpu_wrap(i, span, cpu) { @@ -982,6 +1012,10 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) if (!cpumask_test_cpu(i, sched_domain_span(sibling))) continue;
+ /* Don't mess up local group being child sched domain */ + if (tl->numa_sibling_level != sd->level && i != cpu) + sibling = *per_cpu_ptr(sched_domain_topology[tl->numa_sibling_level].data.sd, i); + sg = build_group_from_child_sched_domain(sibling, cpu); if (!sg) goto fail; @@ -989,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) sg_span = sched_group_span(sg); cpumask_or(covered, covered, sg_span);
- init_overlap_sched_group(sd, sg); + init_overlap_sched_group(sibling, sg);
if (!first) first = sg; @@ -1440,34 +1474,6 @@ sd_init(struct sched_domain_topology_level *tl, return sd; }
-/* - * Topology list, bottom-up. - */ -static struct sched_domain_topology_level default_topology[] = { -#ifdef CONFIG_SCHED_SMT - { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, -#endif -#ifdef CONFIG_SCHED_MC - { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, -#endif - { cpu_cpu_mask, SD_INIT_NAME(DIE) }, - { NULL, }, -}; - -static struct sched_domain_topology_level *sched_domain_topology = - default_topology; - -#define for_each_sd_topology(tl) \ - for (tl = sched_domain_topology; tl->mask; tl++) - -void set_sched_topology(struct sched_domain_topology_level *tl) -{ - if (WARN_ON_ONCE(sched_smp_initialized)) - return; - - sched_domain_topology = tl; -} - #ifdef CONFIG_NUMA
static const struct cpumask *sd_numa_mask(int cpu) @@ -1566,6 +1572,52 @@ static void init_numa_topology_type(void)
#define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
+void sched_init_numa_siblings(void) +{ + struct sched_domain_topology_level *tl; + int tl_id = 0, sibling_tl_id = 0; + const struct cpumask *mask; + int last_written_tl = 0; + int n, i, j; + + for_each_sd_topology(tl) { + if (!tl->numa_level) + goto next; + + for_each_node(n) { + /* XXX: this *thing* needs to die */ + sched_domains_curr_level = tl->numa_level; + i = cpumask_first(cpumask_of_node(n)); + mask = tl->mask(i); + + for_each_cpu_wrap(j, mask, i) { + sibling_tl_id = tl_id; + sched_domains_curr_level = tl->numa_level; + + /* Not subset? Down we go! */ + /* XXX bother about degeneration here? */ + do { + sibling_tl_id--; + sched_domains_curr_level--; + } while (sibling_tl_id > 0 && + !cpumask_subset(sched_domain_topology[sibling_tl_id].mask(j), mask)); + + /* Only write if first write or it lowers level */ + if (last_written_tl < tl_id || + tl->numa_sibling_level > sibling_tl_id) { + tl->numa_sibling_level = sibling_tl_id; + last_written_tl = tl_id; + } + } + } +next: + if (last_written_tl < tl_id) + tl->numa_sibling_level = tl_id; + + tl_id++; + } +} + void sched_init_numa(void) { struct sched_domain_topology_level *tl; @@ -1708,6 +1760,8 @@ void sched_init_numa(void) sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
init_numa_topology_type(); + + sched_init_numa_siblings(); }
void sched_domains_numa_masks_set(unsigned int cpu)
-----Original Message----- From: Valentin Schneider [mailto:valentin.schneider@arm.com] Sent: Friday, February 12, 2021 8:55 AM To: Peter Zijlstra peterz@infradead.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com Cc: vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com; linux-kernel@vger.kernel.org; linuxarm@openeuler.org; xuwei (O) xuwei5@huawei.com; Liguozhu (Kenneth) liguozhu@hisilicon.com; tiantao (H) tiantao6@hisilicon.com; wanghuiqiang wanghuiqiang@huawei.com; Zengtao (B) prime.zeng@hisilicon.com; Jonathan Cameron jonathan.cameron@huawei.com; guodong.xu@linaro.org; Meelis Roos mroos@linux.ee Subject: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
On 10/02/21 12:21, Peter Zijlstra wrote:
On Tue, Feb 09, 2021 at 08:58:15PM +0000, Song Bao Hua (Barry Song) wrote:
So historically, the code has never tried to make sched_groups result in equal size. And it permits the overlapping of local group and remote groups.
Histrorically groups have (typically) always been the same size though.
The reason I did ask is because when you get one large and a bunch of smaller groups, the load-balancing 'pull' is relatively smaller to the large groups.
That is, IIRC should_we_balance() ensures only 1 CPU out of the group continues the load-balancing pass. So if, for example, we have one group of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull 1/2 times, while the group of 4 CPUs will pull 1/4 times.
By making sure all groups are of the same level, and thus of equal size, this doesn't happen.
So I hacked something that tries to do this, with the notable exception that it doesn't change the way the local group is generated. Breaking the assumption that the local group always spans the child domain doesn't sound like the best thing to do.
Anywho, the below makes it so all !local NUMA groups are built using the same sched_domain_topology_level. Some of it is absolutely disgusting (esp. the sched_domains_curr_level stuff), I didn't bother with handling domain degeneration yet, and I trigger the WARN_ON in get_group()... But at least the topology gets built!
AFAICT Barry's topology is handled the same. On that sunfire topology, it pretty much turns all remote groups into groups spanning a single node. That does almost double the number of groups for some domains, compared to Barry's current v3.
If that is really a route we want to go down, I'll try to clean the below.
Hi Valentin,
I understand Peter's concern is that the local group has different size with remote groups. Is this patch resolving Peter's concern? To me, it seems not :-)
Though I don’t understand why different group sizes will be harmful since all groups are calculating avg_load and group_type based on their own capacities. Thus, for a smaller group, its capacity would be smaller.
Is it because a bigger group has relatively less chance to pull, so load balancing will be completed more slowly while small groups have high load?
(deposit your drinks before going any further) --->8--- diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 8f0f778b7c91..e74f48bdd226 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -187,7 +187,10 @@ struct sched_domain_topology_level { sched_domain_mask_f mask; sched_domain_flags_f sd_flags; int flags; +#ifdef CONFIG_NUMA int numa_level;
- int numa_sibling_level;
+#endif struct sd_data data; #ifdef CONFIG_SCHED_DEBUG char *name; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 3c50cc7285c9..5a9e6a4d5d89 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -742,6 +742,34 @@ enum s_alloc { sa_none, };
+/*
- Topology list, bottom-up.
- */
+static struct sched_domain_topology_level default_topology[] = { +#ifdef CONFIG_SCHED_SMT
- { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif +#ifdef CONFIG_SCHED_MC
- { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
- { NULL, },
+};
+static struct sched_domain_topology_level *sched_domain_topology =
- default_topology;
+#define for_each_sd_topology(tl) \
- for (tl = sched_domain_topology; tl->mask; tl++)
+void set_sched_topology(struct sched_domain_topology_level *tl) +{
- if (WARN_ON_ONCE(sched_smp_initialized))
return;
- sched_domain_topology = tl;
+}
/*
- Return the canonical balance CPU for this group, this is the first CPU
- of this group that's also in the balance mask.
@@ -955,10 +983,12 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) struct sched_group *first = NULL, *last = NULL, *sg; const struct cpumask *span = sched_domain_span(sd); struct cpumask *covered = sched_domains_tmpmask;
struct sched_domain_topology_level *tl; struct sd_data *sdd = sd->private; struct sched_domain *sibling; int i;
tl = container_of(sd->private, struct sched_domain_topology_level, data); cpumask_clear(covered);
for_each_cpu_wrap(i, span, cpu) {
@@ -982,6 +1012,10 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) if (!cpumask_test_cpu(i, sched_domain_span(sibling))) continue;
/* Don't mess up local group being child sched domain */
if (tl->numa_sibling_level != sd->level && i != cpu)
sibling =
*per_cpu_ptr(sched_domain_topology[tl->numa_sibling_level].data.sd, i);
- sg = build_group_from_child_sched_domain(sibling, cpu); if (!sg) goto fail;
@@ -989,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) sg_span = sched_group_span(sg); cpumask_or(covered, covered, sg_span);
init_overlap_sched_group(sd, sg);
init_overlap_sched_group(sibling, sg);
if (!first) first = sg;
@@ -1440,34 +1474,6 @@ sd_init(struct sched_domain_topology_level *tl, return sd; }
-/*
- Topology list, bottom-up.
- */
-static struct sched_domain_topology_level default_topology[] = { -#ifdef CONFIG_SCHED_SMT
- { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
-#endif -#ifdef CONFIG_SCHED_MC
- { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
-#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
- { NULL, },
-};
-static struct sched_domain_topology_level *sched_domain_topology =
- default_topology;
-#define for_each_sd_topology(tl) \
- for (tl = sched_domain_topology; tl->mask; tl++)
-void set_sched_topology(struct sched_domain_topology_level *tl) -{
- if (WARN_ON_ONCE(sched_smp_initialized))
return;
- sched_domain_topology = tl;
-}
#ifdef CONFIG_NUMA
static const struct cpumask *sd_numa_mask(int cpu) @@ -1566,6 +1572,52 @@ static void init_numa_topology_type(void)
#define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
+void sched_init_numa_siblings(void) +{
- struct sched_domain_topology_level *tl;
- int tl_id = 0, sibling_tl_id = 0;
- const struct cpumask *mask;
- int last_written_tl = 0;
- int n, i, j;
- for_each_sd_topology(tl) {
if (!tl->numa_level)
goto next;
for_each_node(n) {
/* XXX: this *thing* needs to die */
sched_domains_curr_level = tl->numa_level;
i = cpumask_first(cpumask_of_node(n));
mask = tl->mask(i);
for_each_cpu_wrap(j, mask, i) {
sibling_tl_id = tl_id;
sched_domains_curr_level = tl->numa_level;
/* Not subset? Down we go! */
/* XXX bother about degeneration here? */
do {
sibling_tl_id--;
sched_domains_curr_level--;
} while (sibling_tl_id > 0 &&
- !cpumask_subset(sched_domain_topology[sibling_tl_id].mask(j), mask));
/* Only write if first write or it lowers level */
if (last_written_tl < tl_id ||
tl->numa_sibling_level > sibling_tl_id) {
tl->numa_sibling_level = sibling_tl_id;
last_written_tl = tl_id;
}
}
}
+next:
if (last_written_tl < tl_id)
tl->numa_sibling_level = tl_id;
tl_id++;
- }
+}
void sched_init_numa(void) { struct sched_domain_topology_level *tl; @@ -1708,6 +1760,8 @@ void sched_init_numa(void) sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
init_numa_topology_type();
- sched_init_numa_siblings();
}
void sched_domains_numa_masks_set(unsigned int cpu)
Thanks Barry
Hi Barry,
On 18/02/21 09:17, Song Bao Hua (Barry Song) wrote:
Hi Valentin,
I understand Peter's concern is that the local group has different size with remote groups. Is this patch resolving Peter's concern? To me, it seems not :-)
If you remove the '&& i != cpu' in build_overlap_sched_groups() you get that, but then you also get some extra warnings :-)
Now yes, should_we_balance() only matters for the local group. However I'm somewhat wary of messing with the local groups; for one it means you would have more than one tl now accessing the same sgc->next_update, sgc->{min, max}capacity, sgc->group_imbalance (as Vincent had pointed out).
By ensuring only remote (i.e. !local) groups are modified (which is what your patch does), we absolve ourselves of this issue, which is why I prefer this approach ATM.
Though I don’t understand why different group sizes will be harmful since all groups are calculating avg_load and group_type based on their own capacities. Thus, for a smaller group, its capacity would be smaller.
Is it because a bigger group has relatively less chance to pull, so load balancing will be completed more slowly while small groups have high load?
Peter's point is that, if at a given tl you have groups that look like
g0: 0-4, g1: 5-6, g2: 7-8
Then g0 is half as likely to pull tasks with load_balance() than g1 or g2 (due to the group size vs should_we_balance())
However, I suppose one "trick" to be aware of here is that since your patch *doesn't* change the local group, we do have e.g. on CPU0:
[ 0.374840] domain-2: span=0-5 level=NUMA [ 0.375054] groups: 0:{ span=0-3 cap=4003 }, 4:{ span=4-5 cap=1988 }
*but* on CPU4 we get:
[ 0.387019] domain-2: span=0-1,4-7 level=NUMA [ 0.387211] groups: 4:{ span=4-7 cap=3984 }, 0:{ span=0-1 cap=2013 }
IOW, at a given tl, all *local* groups have /roughly/ the same size and thus similar pull probability (it took me writing this mail to see it that way). So perhaps this is all fine already?
-----Original Message----- From: Valentin Schneider [mailto:valentin.schneider@arm.com] Sent: Friday, February 19, 2021 1:41 AM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Peter Zijlstra peterz@infradead.org Cc: vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com; linux-kernel@vger.kernel.org; linuxarm@openeuler.org; xuwei (O) xuwei5@huawei.com; Liguozhu (Kenneth) liguozhu@hisilicon.com; tiantao (H) tiantao6@hisilicon.com; wanghuiqiang wanghuiqiang@huawei.com; Zengtao (B) prime.zeng@hisilicon.com; Jonathan Cameron jonathan.cameron@huawei.com; guodong.xu@linaro.org; Meelis Roos mroos@linux.ee Subject: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
Hi Barry,
On 18/02/21 09:17, Song Bao Hua (Barry Song) wrote:
Hi Valentin,
I understand Peter's concern is that the local group has different size with remote groups. Is this patch resolving Peter's concern? To me, it seems not :-)
If you remove the '&& i != cpu' in build_overlap_sched_groups() you get that, but then you also get some extra warnings :-)
Now yes, should_we_balance() only matters for the local group. However I'm somewhat wary of messing with the local groups; for one it means you would have more than one tl now accessing the same sgc->next_update, sgc->{min, max}capacity, sgc->group_imbalance (as Vincent had pointed out).
By ensuring only remote (i.e. !local) groups are modified (which is what your patch does), we absolve ourselves of this issue, which is why I prefer this approach ATM.
Yep. The grandchild approach seems still to the feasible way for this moment.
Though I don’t understand why different group sizes will be harmful since all groups are calculating avg_load and group_type based on their own capacities. Thus, for a smaller group, its capacity would be smaller.
Is it because a bigger group has relatively less chance to pull, so load balancing will be completed more slowly while small groups have high load?
Peter's point is that, if at a given tl you have groups that look like
g0: 0-4, g1: 5-6, g2: 7-8
Then g0 is half as likely to pull tasks with load_balance() than g1 or g2 (due to the group size vs should_we_balance())
Yep. the difference is that g1 and g2 won't be local groups of any CPU in this tl. The smaller groups g1 and g2 are only remote groups, so should_we_balance() doesn't matter here for them.
However, I suppose one "trick" to be aware of here is that since your patch *doesn't* change the local group, we do have e.g. on CPU0:
[ 0.374840] domain-2: span=0-5 level=NUMA [ 0.375054] groups: 0:{ span=0-3 cap=4003 }, 4:{ span=4-5 cap=1988 }
*but* on CPU4 we get:
[ 0.387019] domain-2: span=0-1,4-7 level=NUMA [ 0.387211] groups: 4:{ span=4-7 cap=3984 }, 0:{ span=0-1 cap=2013 }
IOW, at a given tl, all *local* groups have /roughly/ the same size and thus similar pull probability (it took me writing this mail to see it that way). So perhaps this is all fine already?
Yep. since should_we_balance() only matters for local groups and we haven't changed the size of local groups in the grandchild approach, all local groups are still getting similar pull probability in this topology level.
Since we still prefer the grandchild approach ATM, if Peter has no more concern on the unequal size between local groups and remote groups, I would be glad to send v4 of grandchild approach by rewriting changelog to explain the update issue of sgc->next_update, sgc->{min, max}capacity, sgc->group_imbalance Vincent pointed out and also describe the local_groups are not touched, thus still in the equal size.
Thanks Barry