As long as NUMA diameter > 2, building sched_domain by sibling's child domain will definitely create a sched_domain with sched_group which will span out of the sched_domain:
+------+ +------+ +-------+ +------+ | node | 12 |node | 20 | node | 12 |node | | 0 +---------+1 +--------+ 2 +-------+3 | +------+ +------+ +-------+ +------+
domain0 node0 node1 node2 node3
domain1 node0+1 node0+1 node2+3 node2+3 + domain2 node0+1+2 | group: node0+1 | group:node2+3 <-------------------+
when node2 is added into the domain2 of node0, kernel is using the child domain of node2's domain2, which is domain1(node2+3). Node 3 is outside the span of the domain including node0+1+2.
This will make load_balance() run based on the avg_load in the sched_group spanning out of the sched_domain, and it also makes select_task_rq_fair() pick an idle CPU out of the sched_domain.
Real servers which suffer from this problem include Kunpeng920 and 8-node Sun Fire X4600-M2, at least.
Here we move to use the *child* domain of the *child* domain of node2's domain2 to build the sched_group.
+------+ +------+ +-------+ +------+ | node | 12 |node | 20 | node | 12 |node | | 0 +---------+1 +--------+ 2 +-------+3 | +------+ +------+ +-------+ +------+
domain0 node0 node1 +- node2 node3 | domain1 node0+1 node0+1 | node2+3 node2+3 | domain2 node0+1+2 | group: node0+1 | group:node2 <-------------------+
A tricky thing is that we shouldn't use the sgc of the 1st CPU of node2 for the sched_group generated by grandchild, otherwise, when this cpu becomes the balance_cpu of another sched_group of cpus other than node0, our sched_group generated by grandchild will access the same sgc with the sched_group generated by child of another CPU.
So in init_overlap_sched_group(), sgc's capacity be overwritten: build_balance_mask(sd, sg, mask); cpu = cpumask_first_and(sched_group_span(sg), mask);
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
And WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)) will also be triggered: static void init_overlap_sched_group(struct sched_domain *sd, struct sched_group *sg) { if (atomic_inc_return(&sg->sgc->ref) == 1) cpumask_copy(group_balance_mask(sg), mask); else WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)); }
So here move to use the sgc of the 2nd cpu. For the corner case, if NUMA has only one CPU, we will still trigger this WARN_ON_ONCE. But It is really unlikely to be a real case for one NUMA to have one CPU only.
Tested by the below topology: qemu-system-aarch64 -M virt -nographic \ -smp cpus=8 \ -numa node,cpus=0-1,nodeid=0 \ -numa node,cpus=2-3,nodeid=1 \ -numa node,cpus=4-5,nodeid=2 \ -numa node,cpus=6-7,nodeid=3 \ -numa dist,src=0,dst=1,val=12 \ -numa dist,src=0,dst=2,val=20 \ -numa dist,src=0,dst=3,val=22 \ -numa dist,src=1,dst=2,val=22 \ -numa dist,src=2,dst=3,val=12 \ -numa dist,src=1,dst=3,val=24 \ -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image
w/o patch, we get lots of "groups don't span domain->span": [ 0.802139] CPU0 attaching sched-domain(s): [ 0.802193] domain-0: span=0-1 level=MC [ 0.802443] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 } [ 0.802693] domain-1: span=0-3 level=NUMA [ 0.802731] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [ 0.802811] domain-2: span=0-5 level=NUMA [ 0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [ 0.802881] ERROR: groups don't span domain->span [ 0.803058] domain-3: span=0-7 level=NUMA [ 0.803080] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [ 0.804055] CPU1 attaching sched-domain(s): [ 0.804072] domain-0: span=0-1 level=MC [ 0.804096] groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 } [ 0.804152] domain-1: span=0-3 level=NUMA [ 0.804170] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [ 0.804219] domain-2: span=0-5 level=NUMA [ 0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [ 0.804302] ERROR: groups don't span domain->span [ 0.804520] domain-3: span=0-7 level=NUMA [ 0.804546] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [ 0.804677] CPU2 attaching sched-domain(s): [ 0.804687] domain-0: span=2-3 level=MC [ 0.804705] groups: 2:{ span=2 cap=934 }, 3:{ span=3 cap=1009 } [ 0.804754] domain-1: span=0-3 level=NUMA [ 0.804772] groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 } [ 0.804820] domain-2: span=0-5 level=NUMA [ 0.804836] groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 } [ 0.804944] ERROR: groups don't span domain->span [ 0.805108] domain-3: span=0-7 level=NUMA [ 0.805134] groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 } [ 0.805223] CPU3 attaching sched-domain(s): [ 0.805232] domain-0: span=2-3 level=MC [ 0.805249] groups: 3:{ span=3 cap=1009 }, 2:{ span=2 cap=934 } [ 0.805319] domain-1: span=0-3 level=NUMA [ 0.805336] groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 } [ 0.805383] domain-2: span=0-5 level=NUMA [ 0.805399] groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 } [ 0.805458] ERROR: groups don't span domain->span [ 0.805605] domain-3: span=0-7 level=NUMA [ 0.805626] groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 } [ 0.805712] CPU4 attaching sched-domain(s): [ 0.805721] domain-0: span=4-5 level=MC [ 0.805738] groups: 4:{ span=4 cap=984 }, 5:{ span=5 cap=924 } [ 0.805787] domain-1: span=4-7 level=NUMA [ 0.805803] groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 } [ 0.805851] domain-2: span=0-1,4-7 level=NUMA [ 0.805867] groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 } [ 0.805915] ERROR: groups don't span domain->span [ 0.806108] domain-3: span=0-7 level=NUMA [ 0.806130] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 } [ 0.806214] CPU5 attaching sched-domain(s): [ 0.806222] domain-0: span=4-5 level=MC [ 0.806240] groups: 5:{ span=5 cap=924 }, 4:{ span=4 cap=984 } [ 0.806841] domain-1: span=4-7 level=NUMA [ 0.806866] groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 } [ 0.806934] domain-2: span=0-1,4-7 level=NUMA [ 0.806953] groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 } [ 0.807004] ERROR: groups don't span domain->span [ 0.807312] domain-3: span=0-7 level=NUMA [ 0.807386] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 } [ 0.807686] CPU6 attaching sched-domain(s): [ 0.807710] domain-0: span=6-7 level=MC [ 0.807750] groups: 6:{ span=6 cap=1017 }, 7:{ span=7 cap=1012 } [ 0.807840] domain-1: span=4-7 level=NUMA [ 0.807870] groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 } [ 0.807952] domain-2: span=0-1,4-7 level=NUMA [ 0.807985] groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 } [ 0.808045] ERROR: groups don't span domain->span [ 0.808257] domain-3: span=0-7 level=NUMA [ 0.808571] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6125 }, 2:{ span=0-5 mask=2-3 cap=5899 } [ 0.808848] CPU7 attaching sched-domain(s): [ 0.808860] domain-0: span=6-7 level=MC [ 0.808880] groups: 7:{ span=7 cap=1012 }, 6:{ span=6 cap=1017 } [ 0.808953] domain-1: span=4-7 level=NUMA [ 0.808974] groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 } [ 0.809034] domain-2: span=0-1,4-7 level=NUMA [ 0.809055] groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 } [ 0.809128] ERROR: groups don't span domain->span [ 0.810361] domain-3: span=0-7 level=NUMA [ 0.810400] groups: 6:{ span=0-1,4-7 mask=6-7 cap=5961 }, 2:{ span=0-5 mask=2-3 cap=5903 }
w/ patch, we don't get "groups don't span domain->span" any more: [ 0.868907] CPU0 attaching sched-domain(s): [ 0.868962] domain-0: span=0-1 level=MC [ 0.869179] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=983 } [ 0.869405] domain-1: span=0-3 level=NUMA [ 0.869438] groups: 0:{ span=0-1 cap=1996 }, 2:{ span=2-3 cap=2006 } [ 0.869542] domain-2: span=0-5 level=NUMA [ 0.869559] groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 } [ 0.869603] domain-3: span=0-7 level=NUMA [ 0.869618] groups: 0:{ span=0-5 mask=0-1 cap=5980 }, 6:{ span=4-7 mask=6-7 cap=4016 } [ 0.870303] CPU1 attaching sched-domain(s): [ 0.870314] domain-0: span=0-1 level=MC [ 0.870334] groups: 1:{ span=1 cap=983 }, 0:{ span=0 cap=1013 } [ 0.870381] domain-1: span=0-3 level=NUMA [ 0.870396] groups: 0:{ span=0-1 cap=1996 }, 2:{ span=2-3 cap=2006 } [ 0.870440] domain-2: span=0-5 level=NUMA [ 0.870454] groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 } [ 0.870507] domain-3: span=0-7 level=NUMA [ 0.870530] groups: 0:{ span=0-5 mask=0-1 cap=5980 }, 6:{ span=4-7 mask=6-7 cap=4016 } [ 0.870611] CPU2 attaching sched-domain(s): [ 0.870619] domain-0: span=2-3 level=MC [ 0.870634] groups: 2:{ span=2 cap=1007 }, 3:{ span=3 cap=999 } [ 0.870677] domain-1: span=0-3 level=NUMA [ 0.870691] groups: 2:{ span=2-3 cap=2006 }, 0:{ span=0-1 cap=1996 } [ 0.870734] domain-2: span=0-5 level=NUMA [ 0.870748] groups: 2:{ span=0-3 mask=2-3 cap=4054 }, 5:{ span=4-5 cap=2048 } [ 0.870795] domain-3: span=0-7 level=NUMA [ 0.870809] groups: 2:{ span=0-5 mask=2-3 cap=6032 }, 6:{ span=0-1,4-7 mask=6-7 cap=6064 } [ 0.870913] CPU3 attaching sched-domain(s): [ 0.870921] domain-0: span=2-3 level=MC [ 0.870936] groups: 3:{ span=3 cap=999 }, 2:{ span=2 cap=1007 } [ 0.870979] domain-1: span=0-3 level=NUMA [ 0.870993] groups: 2:{ span=2-3 cap=2006 }, 0:{ span=0-1 cap=1996 } [ 0.871035] domain-2: span=0-5 level=NUMA [ 0.871049] groups: 2:{ span=0-3 mask=2-3 cap=4054 }, 5:{ span=4-5 cap=2048 } [ 0.871096] domain-3: span=0-7 level=NUMA [ 0.871110] groups: 2:{ span=0-5 mask=2-3 cap=6032 }, 6:{ span=0-1,4-7 mask=6-7 cap=6064 } [ 0.871177] CPU4 attaching sched-domain(s): [ 0.871185] domain-0: span=4-5 level=MC [ 0.871200] groups: 4:{ span=4 cap=977 }, 5:{ span=5 cap=1001 } [ 0.871243] domain-1: span=4-7 level=NUMA [ 0.871257] groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 } [ 0.871300] domain-2: span=0-1,4-7 level=NUMA [ 0.871314] groups: 4:{ span=4-7 cap=3946 }, 1:{ span=0-1 cap=2048 } [ 0.871356] domain-3: span=0-7 level=NUMA [ 0.871370] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5994 }, 2:{ span=0-3 mask=2-3 cap=4054 } [ 0.871436] CPU5 attaching sched-domain(s): [ 0.871443] domain-0: span=4-5 level=MC [ 0.871457] groups: 5:{ span=5 cap=1001 }, 4:{ span=4 cap=977 } [ 0.871512] domain-1: span=4-7 level=NUMA [ 0.871893] groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 } [ 0.871949] domain-2: span=0-1,4-7 level=NUMA [ 0.871966] groups: 4:{ span=4-7 cap=3946 }, 1:{ span=0-1 cap=2048 } [ 0.872010] domain-3: span=0-7 level=NUMA [ 0.872025] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5994 }, 2:{ span=0-3 mask=2-3 cap=4054 } [ 0.872115] CPU6 attaching sched-domain(s): [ 0.872123] domain-0: span=6-7 level=MC [ 0.872139] groups: 6:{ span=6 cap=993 }, 7:{ span=7 cap=975 } [ 0.872186] domain-1: span=4-7 level=NUMA [ 0.872202] groups: 6:{ span=6-7 cap=1968 }, 4:{ span=4-5 cap=1978 } [ 0.872246] domain-2: span=0-1,4-7 level=NUMA [ 0.872260] groups: 6:{ span=4-7 mask=6-7 cap=4016 }, 1:{ span=0-1 cap=2048 } [ 0.872309] domain-3: span=0-7 level=NUMA [ 0.872323] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6064 }, 2:{ span=0-5 mask=2-3 cap=6032 } [ 0.872392] CPU7 attaching sched-domain(s): [ 0.872399] domain-0: span=6-7 level=MC [ 0.872414] groups: 7:{ span=7 cap=975 }, 6:{ span=6 cap=993 } [ 0.872458] domain-1: span=4-7 level=NUMA [ 0.872472] groups: 6:{ span=6-7 cap=1968 }, 4:{ span=4-5 cap=1978 } [ 0.872662] domain-2: span=0-1,4-7 level=NUMA [ 0.872685] groups: 6:{ span=4-7 mask=6-7 cap=4016 }, 1:{ span=0-1 cap=2048 } [ 0.872737] domain-3: span=0-7 level=NUMA [ 0.872752] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6064 }, 2:{ span=0-5 mask=2-3 cap=6032 }
Reported-by: Valentin Schneider valentin.schneider@arm.com Tested-by: Meelis Roos mroos@linux.ee Signed-off-by: Barry Song song.bao.hua@hisilicon.com --- Differences with RFC v2 * added tested-by Meelis Roos for the fixed "8-node Sun Fire X4600-M2" * removed the hacking code in balance_mask and should_we_balance() * removed the redundant "from_grandchild" field from sched_group
The patch is based on 5.11-rc6;
kernel/sched/topology.c | 83 ++++++++++++++++++++++++++--------------- 1 file changed, 52 insertions(+), 31 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 5d3675c7a76b..100feb2fd8a0 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -723,35 +723,6 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) for (tmp = sd; tmp; tmp = tmp->parent) numa_distance += !!(tmp->flags & SD_NUMA);
- /* - * FIXME: Diameter >=3 is misrepresented. - * - * Smallest diameter=3 topology is: - * - * node 0 1 2 3 - * 0: 10 20 30 40 - * 1: 20 10 20 30 - * 2: 30 20 10 20 - * 3: 40 30 20 10 - * - * 0 --- 1 --- 2 --- 3 - * - * NUMA-3 0-3 N/A N/A 0-3 - * groups: {0-2},{1-3} {1-3},{0-2} - * - * NUMA-2 0-2 0-3 0-3 1-3 - * groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} {2-3},{0-2} - * - * NUMA-1 0-1 0-2 1-3 2-3 - * groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2} - * - * NUMA-0 0 1 2 3 - * - * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the - * group span isn't a subset of the domain span. - */ - WARN_ONCE(numa_distance > 2, "Shortest NUMA path spans too many nodes\n"); - sched_domain_debug(sd, cpu);
rq_attach_root(rq, rd); @@ -916,6 +887,11 @@ build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpuma if (!sibling->child) continue;
+ while (sibling->child && + !cpumask_subset(sched_domain_span(sibling->child), + sched_domain_span(sd))) + sibling = sibling->child; + /* If we would not end up here, we can't continue from here */ if (!cpumask_equal(sg_span, sched_domain_span(sibling->child))) continue; @@ -955,7 +931,8 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu) }
static void init_overlap_sched_group(struct sched_domain *sd, - struct sched_group *sg) + struct sched_group *sg, + int from_grandchild) { struct cpumask *mask = sched_domains_tmpmask2; struct sd_data *sdd = sd->private; @@ -964,6 +941,12 @@ static void init_overlap_sched_group(struct sched_domain *sd,
build_balance_mask(sd, sg, mask); cpu = cpumask_first_and(sched_group_span(sg), mask); + /* + * for the group generated by grandchild, use the sgc of 2nd cpu + * because the 1st cpu might be used by another sched_group + */ + if (from_grandchild && cpumask_weight(mask) > 1) + cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu); if (atomic_inc_return(&sg->sgc->ref) == 1) @@ -996,6 +979,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
for_each_cpu_wrap(i, span, cpu) { struct cpumask *sg_span; + int from_grandchild = 0;
if (cpumask_test_cpu(i, covered)) continue; @@ -1015,6 +999,43 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) if (!cpumask_test_cpu(i, sched_domain_span(sibling))) continue;
+ /* + * for NUMA diameter >= 3, building sched_domain by sibling's + * child's child domain to prevent sched_group from spanning + * out of sched_domain + * if we don't do this, Diameter >=3 is misrepresented: + * + * Smallest diameter=3 topology is: + * + * node 0 1 2 3 + * 0: 10 20 30 40 + * 1: 20 10 20 30 + * 2: 30 20 10 20 + * 3: 40 30 20 10 + * + * 0 --- 1 --- 2 --- 3 + * + * NUMA-3 0-3 N/A N/A 0-3 + * groups: {0-2},{1-3} {1-3},{0-2} + * + * NUMA-2 0-2 0-3 0-3 1-3 + * groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} {2-3},{0-2} + * + * NUMA-1 0-1 0-2 1-3 2-3 + * groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2} + * + * NUMA-0 0 1 2 3 + * + * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the + * group span isn't a subset of the domain span. + */ + while (sibling->child && + !cpumask_subset(sched_domain_span(sibling->child), + span)) { + sibling = sibling->child; + from_grandchild = 1; + } + sg = build_group_from_child_sched_domain(sibling, cpu); if (!sg) goto fail; @@ -1022,7 +1043,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) sg_span = sched_group_span(sg); cpumask_or(covered, covered, sg_span);
- init_overlap_sched_group(sd, sg); + init_overlap_sched_group(sd, sg, from_grandchild);
if (!first) first = sg;
Hi Barry,
I love your patch! Perhaps something to improve:
[auto build test WARNING on tip/sched/core] [also build test WARNING on next-20210125] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Barry-Song/sched-topology-fix-the-i... base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 7a976f77bb962ce9486e09eb839aa135619b54f3 config: i386-randconfig-s001-20210201 (attached as .config) compiler: gcc-9 (Debian 9.3.0-15) 9.3.0 reproduce: # apt-get install sparse # sparse version: v0.6.3-215-g0fb77bb6-dirty # https://github.com/0day-ci/linux/commit/a8f7eae6bc5ac37efd8cbe15e0a388127619... git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Barry-Song/sched-topology-fix-the-issue-groups-don-t-span-domain-span-for-NUMA-diameter-2/20210201-114807 git checkout a8f7eae6bc5ac37efd8cbe15e0a388127619f1b0 # save the attached .config to linux build tree make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=i386
If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot lkp@intel.com
"sparse warnings: (new ones prefixed by >>)" kernel/sched/topology.c:106:56: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:106:56: sparse: expected struct sched_domain *sd kernel/sched/topology.c:106:56: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:125:60: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:125:60: sparse: expected struct sched_domain *sd kernel/sched/topology.c:125:60: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:148:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:148:20: sparse: expected struct sched_domain *sd kernel/sched/topology.c:148:20: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:431:13: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct perf_domain *[assigned] tmp @@ got struct perf_domain [noderef] __rcu *pd @@ kernel/sched/topology.c:431:13: sparse: expected struct perf_domain *[assigned] tmp kernel/sched/topology.c:431:13: sparse: got struct perf_domain [noderef] __rcu *pd kernel/sched/topology.c:440:13: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct perf_domain *[assigned] tmp @@ got struct perf_domain [noderef] __rcu *pd @@ kernel/sched/topology.c:440:13: sparse: expected struct perf_domain *[assigned] tmp kernel/sched/topology.c:440:13: sparse: got struct perf_domain [noderef] __rcu *pd kernel/sched/topology.c:461:19: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct perf_domain *[assigned] pd @@ got struct perf_domain [noderef] __rcu *pd @@ kernel/sched/topology.c:461:19: sparse: expected struct perf_domain *[assigned] pd kernel/sched/topology.c:461:19: sparse: got struct perf_domain [noderef] __rcu *pd kernel/sched/topology.c:623:49: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *parent @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:623:49: sparse: expected struct sched_domain *parent kernel/sched/topology.c:623:49: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:695:50: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *parent @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:695:50: sparse: expected struct sched_domain *parent kernel/sched/topology.c:695:50: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:702:55: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain [noderef] __rcu *[noderef] __rcu child @@ got struct sched_domain *[assigned] tmp @@ kernel/sched/topology.c:702:55: sparse: expected struct sched_domain [noderef] __rcu *[noderef] __rcu child kernel/sched/topology.c:702:55: sparse: got struct sched_domain *[assigned] tmp kernel/sched/topology.c:712:29: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] tmp @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:712:29: sparse: expected struct sched_domain *[assigned] tmp kernel/sched/topology.c:712:29: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:717:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:717:20: sparse: expected struct sched_domain *sd kernel/sched/topology.c:717:20: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:723:33: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] tmp @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:723:33: sparse: expected struct sched_domain *[assigned] tmp kernel/sched/topology.c:723:33: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:729:13: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] tmp @@ got struct sched_domain [noderef] __rcu *sd @@ kernel/sched/topology.c:729:13: sparse: expected struct sched_domain *[assigned] tmp kernel/sched/topology.c:729:13: sparse: got struct sched_domain [noderef] __rcu *sd kernel/sched/topology.c:891:66: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:891:66: sparse: expected struct sched_domain *sd kernel/sched/topology.c:891:66: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:893:33: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sibling @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:893:33: sparse: expected struct sched_domain *[assigned] sibling kernel/sched/topology.c:893:33: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:896:70: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:896:70: sparse: expected struct sched_domain *sd kernel/sched/topology.c:896:70: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:925:59: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:925:59: sparse: expected struct sched_domain *sd kernel/sched/topology.c:925:59: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:1033:65: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:1033:65: sparse: expected struct sched_domain *sd kernel/sched/topology.c:1033:65: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:1035:33: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sibling @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:1035:33: sparse: expected struct sched_domain *[assigned] sibling kernel/sched/topology.c:1035:33: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:1140:40: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *child @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:1140:40: sparse: expected struct sched_domain *child kernel/sched/topology.c:1140:40: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:1441:43: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain [noderef] __rcu *child @@ got struct sched_domain *child @@ kernel/sched/topology.c:1441:43: sparse: expected struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:1441:43: sparse: got struct sched_domain *child kernel/sched/topology.c:1926:31: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain [noderef] __rcu *parent @@ got struct sched_domain *sd @@ kernel/sched/topology.c:1926:31: sparse: expected struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:1926:31: sparse: got struct sched_domain *sd kernel/sched/topology.c:2094:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:2094:57: sparse: expected struct sched_domain *[assigned] sd kernel/sched/topology.c:2094:57: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:2111:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:2111:57: sparse: expected struct sched_domain *[assigned] sd kernel/sched/topology.c:2111:57: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:59:25: sparse: sparse: dereference of noderef expression kernel/sched/topology.c:64:25: sparse: sparse: dereference of noderef expression kernel/sched/topology.c: note: in included file: kernel/sched/sched.h:1455:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/sched.h:1455:9: sparse: expected struct sched_domain *[assigned] sd kernel/sched/sched.h:1455:9: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/sched.h:1468:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/sched.h:1468:9: sparse: expected struct sched_domain *[assigned] sd kernel/sched/sched.h:1468:9: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/sched.h:1455:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/sched.h:1455:9: sparse: expected struct sched_domain *[assigned] sd kernel/sched/sched.h:1455:9: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/sched.h:1468:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/sched.h:1468:9: sparse: expected struct sched_domain *[assigned] sd kernel/sched/sched.h:1468:9: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:1456:19: sparse: sparse: dereference of noderef expression
vim +893 kernel/sched/topology.c
762 763 764 /* 765 * NUMA topology (first read the regular topology blurb below) 766 * 767 * Given a node-distance table, for example: 768 * 769 * node 0 1 2 3 770 * 0: 10 20 30 20 771 * 1: 20 10 20 30 772 * 2: 30 20 10 20 773 * 3: 20 30 20 10 774 * 775 * which represents a 4 node ring topology like: 776 * 777 * 0 ----- 1 778 * | | 779 * | | 780 * | | 781 * 3 ----- 2 782 * 783 * We want to construct domains and groups to represent this. The way we go 784 * about doing this is to build the domains on 'hops'. For each NUMA level we 785 * construct the mask of all nodes reachable in @level hops. 786 * 787 * For the above NUMA topology that gives 3 levels: 788 * 789 * NUMA-2 0-3 0-3 0-3 0-3 790 * groups: {0-1,3},{1-3} {0-2},{0,2-3} {1-3},{0-1,3} {0,2-3},{0-2} 791 * 792 * NUMA-1 0-1,3 0-2 1-3 0,2-3 793 * groups: {0},{1},{3} {0},{1},{2} {1},{2},{3} {0},{2},{3} 794 * 795 * NUMA-0 0 1 2 3 796 * 797 * 798 * As can be seen; things don't nicely line up as with the regular topology. 799 * When we iterate a domain in child domain chunks some nodes can be 800 * represented multiple times -- hence the "overlap" naming for this part of 801 * the topology. 802 * 803 * In order to minimize this overlap, we only build enough groups to cover the 804 * domain. For instance Node-0 NUMA-2 would only get groups: 0-1,3 and 1-3. 805 * 806 * Because: 807 * 808 * - the first group of each domain is its child domain; this 809 * gets us the first 0-1,3 810 * - the only uncovered node is 2, who's child domain is 1-3. 811 * 812 * However, because of the overlap, computing a unique CPU for each group is 813 * more complicated. Consider for instance the groups of NODE-1 NUMA-2, both 814 * groups include the CPUs of Node-0, while those CPUs would not in fact ever 815 * end up at those groups (they would end up in group: 0-1,3). 816 * 817 * To correct this we have to introduce the group balance mask. This mask 818 * will contain those CPUs in the group that can reach this group given the 819 * (child) domain tree. 820 * 821 * With this we can once again compute balance_cpu and sched_group_capacity 822 * relations. 823 * 824 * XXX include words on how balance_cpu is unique and therefore can be 825 * used for sched_group_capacity links. 826 * 827 * 828 * Another 'interesting' topology is: 829 * 830 * node 0 1 2 3 831 * 0: 10 20 20 30 832 * 1: 20 10 20 20 833 * 2: 20 20 10 20 834 * 3: 30 20 20 10 835 * 836 * Which looks a little like: 837 * 838 * 0 ----- 1 839 * | / | 840 * | / | 841 * | / | 842 * 2 ----- 3 843 * 844 * This topology is asymmetric, nodes 1,2 are fully connected, but nodes 0,3 845 * are not. 846 * 847 * This leads to a few particularly weird cases where the sched_domain's are 848 * not of the same number for each CPU. Consider: 849 * 850 * NUMA-2 0-3 0-3 851 * groups: {0-2},{1-3} {1-3},{0-2} 852 * 853 * NUMA-1 0-2 0-3 0-3 1-3 854 * 855 * NUMA-0 0 1 2 3 856 * 857 */ 858 859 860 /* 861 * Build the balance mask; it contains only those CPUs that can arrive at this 862 * group and should be considered to continue balancing. 863 * 864 * We do this during the group creation pass, therefore the group information 865 * isn't complete yet, however since each group represents a (child) domain we 866 * can fully construct this using the sched_domain bits (which are already 867 * complete). 868 */ 869 static void 870 build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask) 871 { 872 const struct cpumask *sg_span = sched_group_span(sg); 873 struct sd_data *sdd = sd->private; 874 struct sched_domain *sibling; 875 int i; 876 877 cpumask_clear(mask); 878 879 for_each_cpu(i, sg_span) { 880 sibling = *per_cpu_ptr(sdd->sd, i); 881 882 /* 883 * Can happen in the asymmetric case, where these siblings are 884 * unused. The mask will not be empty because those CPUs that 885 * do have the top domain _should_ span the domain. 886 */ 887 if (!sibling->child) 888 continue; 889 890 while (sibling->child && 891 !cpumask_subset(sched_domain_span(sibling->child), 892 sched_domain_span(sd)))
893 sibling = sibling->child;
894 895 /* If we would not end up here, we can't continue from here */ 896 if (!cpumask_equal(sg_span, sched_domain_span(sibling->child))) 897 continue; 898 899 cpumask_set_cpu(i, mask); 900 } 901 902 /* We must not have empty masks here */ 903 WARN_ON_ONCE(cpumask_empty(mask)); 904 } 905
--- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Hi,
On 01/02/21 16:38, Barry Song wrote:
A tricky thing is that we shouldn't use the sgc of the 1st CPU of node2 for the sched_group generated by grandchild, otherwise, when this cpu becomes the balance_cpu of another sched_group of cpus other than node0, our sched_group generated by grandchild will access the same sgc with the sched_group generated by child of another CPU.
So in init_overlap_sched_group(), sgc's capacity be overwritten: build_balance_mask(sd, sg, mask); cpu = cpumask_first_and(sched_group_span(sg), mask);
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
And WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)) will also be triggered: static void init_overlap_sched_group(struct sched_domain *sd, struct sched_group *sg) { if (atomic_inc_return(&sg->sgc->ref) == 1) cpumask_copy(group_balance_mask(sg), mask); else WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)); }
So here move to use the sgc of the 2nd cpu. For the corner case, if NUMA has only one CPU, we will still trigger this WARN_ON_ONCE. But It is really unlikely to be a real case for one NUMA to have one CPU only.
Well, it's trivial to boot this with QEMU, and it's actually the example the comment atop that WARN_ONCE() is based on. Also, you could end up with a single CPU on a node during hotplug operations...
I am not entirely sure whether having more than one CPU per node is a sufficient condition. I'm starting to *think* it is, but I'm not entirely convinced yet - and now I need a new notebook.
-----Original Message----- From: Valentin Schneider [mailto:valentin.schneider@arm.com] Sent: Tuesday, February 2, 2021 7:11 AM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org; peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com; linux-kernel@vger.kernel.org Cc: linuxarm@openeuler.org; xuwei (O) xuwei5@huawei.com; Liguozhu (Kenneth) liguozhu@hisilicon.com; tiantao (H) tiantao6@hisilicon.com; wanghuiqiang wanghuiqiang@huawei.com; Zengtao (B) prime.zeng@hisilicon.com; Jonathan Cameron jonathan.cameron@huawei.com; guodong.xu@linaro.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Meelis Roos mroos@linux.ee Subject: Re: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
Hi,
On 01/02/21 16:38, Barry Song wrote:
A tricky thing is that we shouldn't use the sgc of the 1st CPU of node2 for the sched_group generated by grandchild, otherwise, when this cpu becomes the balance_cpu of another sched_group of cpus other than node0, our sched_group generated by grandchild will access the same sgc with the sched_group generated by child of another CPU.
So in init_overlap_sched_group(), sgc's capacity be overwritten: build_balance_mask(sd, sg, mask); cpu = cpumask_first_and(sched_group_span(sg), mask);
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
And WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)) will also be triggered: static void init_overlap_sched_group(struct sched_domain *sd, struct sched_group *sg) { if (atomic_inc_return(&sg->sgc->ref) == 1) cpumask_copy(group_balance_mask(sg), mask); else WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)); }
So here move to use the sgc of the 2nd cpu. For the corner case, if NUMA has only one CPU, we will still trigger this WARN_ON_ONCE. But It is really unlikely to be a real case for one NUMA to have one CPU only.
Well, it's trivial to boot this with QEMU, and it's actually the example the comment atop that WARN_ONCE() is based on. Also, you could end up with a single CPU on a node during hotplug operations...
Hi Valentin,
The qemu topology is just a reflection of real kunpeng920 case, and pls also note Meelis has also tested on another real hardware "8-node Sun Fire X4600-M2" and gave the tested-by.
It might not a perfect fix, but it is the simplest way to fix for this moment and for real cases. A "perfect" fix will require major refactoring of topology.c.
I don't think hotplug is much relevant as even some cpus are unplugged and only one cpu is left in the sched_group of the sched_domain, the related domain and group are still getting right settings.
On the other hand, the corner could literally be fixed, but will get some very ugly code involved. I mean, two sched_group can result in using the same sgc: 1. the sched_group generated by grandchild with only one numa 2. the sched_group generated by child with more than one numa
Right now, I'm moving to the 2nd cpu for sched_group1, if we move to use 2nd cpu for sched_group2, then having only one cpu in one NUMA wouldn't be a problem anymore. But the code will be very ugly. So I would prefer to keep this assumption and just ignore the unreal corner case.
I am not entirely sure whether having more than one CPU per node is a sufficient condition. I'm starting to *think* it is, but I'm not entirely convinced yet - and now I need a new notebook.
Me too. Some extremely complicated topology might break the assumption. Really need a new notebook to draw this kind of complicated topology to break the assumption :-)
But it is sufficient for the existing real cases which need fixing. When someday a real case in which each numa has more than one CPU wakes up the below warning: WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)). It might be the right time to consider major refactoring of topology.c.
Thanks Barry
On 01/02/21 16:38, Barry Song wrote:
@@ -964,6 +941,12 @@ static void init_overlap_sched_group(struct sched_domain *sd,
build_balance_mask(sd, sg, mask); cpu = cpumask_first_and(sched_group_span(sg), mask);
/*
* for the group generated by grandchild, use the sgc of 2nd cpu
* because the 1st cpu might be used by another sched_group
*/
if (from_grandchild && cpumask_weight(mask) > 1)
cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
So you are getting a (hopefully) unique ID for this group span at this given topology level (i.e. sd->private) but as I had stated in that list of issues, this creates an sgc that isn't attached to the local group of any sched_domain, and thus won't get its capacity values updated.
This can actually be seen via the capacity values you're getting at build time:
[ 0.868907] CPU0 attaching sched-domain(s):
...
[ 0.869542] domain-2: span=0-5 level=NUMA [ 0.869559] groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 }
^^^^^^^^^^^^^^^^
[ 0.871177] CPU4 attaching sched-domain(s):
...
[ 0.871200] groups: 4:{ span=4 cap=977 }, 5:{ span=5 cap=1001 } [ 0.871243] domain-1: span=4-7 level=NUMA [ 0.871257] groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 }
^^^^^^^^^^^^^^^^
IMO what we want to do here is to hook this CPU0-domain-2-group5 to the sgc of CPU4-domain1-group4. I've done that in the below diff - this gives us groups with sgc's owned at lower topology levels, but this will only ever be true for non-local groups. This has the added benefit of working with single-CPU nodes. Briefly tested on your topology and the sunfire's (via QEMU), and I didn't get screamed at.
Before the fun police comes and impounds my keyboard, I'd like to point out that we could leverage this cross-level sgc referencing hack to further change the NUMA domains and pretty much get rid of overlapping groups (that's what I was fumbling with in [1]).
[1]: http://lore.kernel.org/r/jhjwnw11ak2.mognet@arm.com
That is, rather than building overlapping groups and fixing them whenever that breaks (distance > 2), we could have: - the local group being the child domain's span (as always) - all non-local NUMA groups spanning a single node each, with the right sgc cross-referencing.
Thoughts?
--->8--- diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index b748999c9e11..ef43abb6b1fb 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -932,21 +932,15 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
static void init_overlap_sched_group(struct sched_domain *sd, struct sched_group *sg, - int from_grandchild) + struct sched_domain *grandchild) { struct cpumask *mask = sched_domains_tmpmask2; - struct sd_data *sdd = sd->private; + struct sd_data *sdd = grandchild ? grandchild->private : sd->private; struct cpumask *sg_span; int cpu;
build_balance_mask(sd, sg, mask); cpu = cpumask_first_and(sched_group_span(sg), mask); - /* - * for the group generated by grandchild, use the sgc of 2nd cpu - * because the 1st cpu might be used by another sched_group - */ - if (from_grandchild && cpumask_weight(mask) > 1) - cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu); if (atomic_inc_return(&sg->sgc->ref) == 1) @@ -979,7 +973,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
for_each_cpu_wrap(i, span, cpu) { struct cpumask *sg_span; - int from_grandchild = 0; + bool from_grandchild = false;
if (cpumask_test_cpu(i, covered)) continue; @@ -1033,7 +1027,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) !cpumask_subset(sched_domain_span(sibling->child), span)) { sibling = sibling->child; - from_grandchild = 1; + from_grandchild = true; }
sg = build_group_from_child_sched_domain(sibling, cpu); @@ -1043,7 +1037,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) sg_span = sched_group_span(sg); cpumask_or(covered, covered, sg_span);
- init_overlap_sched_group(sd, sg, from_grandchild); + init_overlap_sched_group(sd, sg, from_grandchild ? sibling : NULL);
if (!first) first = sg;
On 02/02/21 15:17, Valentin Schneider wrote:
That is, rather than building overlapping groups and fixing them whenever that breaks (distance > 2), we could have:
- the local group being the child domain's span (as always)
- all non-local NUMA groups spanning a single node each, with the right sgc cross-referencing.
Thoughts?
Hmph I'm thinking this can be broken by domain degeneration, as nothing will come fix up the ->sgc link and we'd be back to having a reference to a group that doesn't get updated...
-----Original Message----- From: Valentin Schneider [mailto:valentin.schneider@arm.com] Sent: Wednesday, February 3, 2021 4:17 AM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org; peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com; linux-kernel@vger.kernel.org Cc: linuxarm@openeuler.org; xuwei (O) xuwei5@huawei.com; Liguozhu (Kenneth) liguozhu@hisilicon.com; tiantao (H) tiantao6@hisilicon.com; wanghuiqiang wanghuiqiang@huawei.com; Zengtao (B) prime.zeng@hisilicon.com; Jonathan Cameron jonathan.cameron@huawei.com; guodong.xu@linaro.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Meelis Roos mroos@linux.ee Subject: Re: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
On 01/02/21 16:38, Barry Song wrote:
@@ -964,6 +941,12 @@ static void init_overlap_sched_group(struct sched_domain
*sd,
build_balance_mask(sd, sg, mask); cpu = cpumask_first_and(sched_group_span(sg), mask);
/*
* for the group generated by grandchild, use the sgc of 2nd cpu
* because the 1st cpu might be used by another sched_group
*/
if (from_grandchild && cpumask_weight(mask) > 1)
cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
So you are getting a (hopefully) unique ID for this group span at this given topology level (i.e. sd->private) but as I had stated in that list of issues, this creates an sgc that isn't attached to the local group of any sched_domain, and thus won't get its capacity values updated.
This can actually be seen via the capacity values you're getting at build time:
[ 0.868907] CPU0 attaching sched-domain(s):
...
[ 0.869542] domain-2: span=0-5 level=NUMA [ 0.869559] groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 }
^^^^^^^^^^^^^^^^
[ 0.871177] CPU4 attaching sched-domain(s):
...
[ 0.871200] groups: 4:{ span=4 cap=977 }, 5:{ span=5 cap=1001 } [ 0.871243] domain-1: span=4-7 level=NUMA [ 0.871257] groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 }
^^^^^^^^^^^^^^^^
Yes. I could see this issue. We could hack update_group_capacity to let it scan both local_group and sched_group generated by grandchild, but it seems your edit is much better.
IMO what we want to do here is to hook this CPU0-domain-2-group5 to the sgc of CPU4-domain1-group4. I've done that in the below diff - this gives us groups with sgc's owned at lower topology levels, but this will only ever be true for non-local groups. This has the added benefit of working with single-CPU nodes. Briefly tested on your topology and the sunfire's (via QEMU), and I didn't get screamed at.
Before the fun police comes and impounds my keyboard, I'd like to point out that we could leverage this cross-level sgc referencing hack to further change the NUMA domains and pretty much get rid of overlapping groups (that's what I was fumbling with in [1]).
That is, rather than building overlapping groups and fixing them whenever that breaks (distance > 2), we could have:
- the local group being the child domain's span (as always)
- all non-local NUMA groups spanning a single node each, with the right sgc cross-referencing.
Thoughts?
I guess the original purpose of overlapping groups is creating as few groups as possible. If we totally remove overlapping groups, it seems we will create much more groups? For example, while node0 begins to build sched_domain for distance 20, it will add node2, since the distance between node2 and node3 is 15, so while node2 is added, node3 is also added as node2's lower domain has covered node3. So we need two groups only for node0's sched_domain of distance level 20. +-------+ +--------+ | | 15 | | | node0+----------------+ | node1 | | | | | +----+--+ XXX--------+ | XXX | XX 20 | 15 XX | XXX | X XXX +----+----XXX +-------+ | | 15 | node3| | node2 +-----------------+ | | | +-------+ +---------+
If we remove overlapping group, we will add a group for node2, another group for node3. Then we get three groups.
I am not sure if it is always positive for performance.
--->8--- diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index b748999c9e11..ef43abb6b1fb 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -932,21 +932,15 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
static void init_overlap_sched_group(struct sched_domain *sd, struct sched_group *sg,
int from_grandchild)
struct sched_domain *grandchild)
{ struct cpumask *mask = sched_domains_tmpmask2;
- struct sd_data *sdd = sd->private;
struct sd_data *sdd = grandchild ? grandchild->private : sd->private; struct cpumask *sg_span; int cpu;
build_balance_mask(sd, sg, mask); cpu = cpumask_first_and(sched_group_span(sg), mask);
/*
* for the group generated by grandchild, use the sgc of 2nd cpu
* because the 1st cpu might be used by another sched_group
*/
if (from_grandchild && cpumask_weight(mask) > 1)
cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu); if (atomic_inc_return(&sg->sgc->ref) == 1)
@@ -979,7 +973,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
for_each_cpu_wrap(i, span, cpu) { struct cpumask *sg_span;
int from_grandchild = 0;
bool from_grandchild = false;
if (cpumask_test_cpu(i, covered)) continue;
@@ -1033,7 +1027,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) !cpumask_subset(sched_domain_span(sibling->child), span)) { sibling = sibling->child;
from_grandchild = 1;
from_grandchild = true;
}
sg = build_group_from_child_sched_domain(sibling, cpu);
@@ -1043,7 +1037,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) sg_span = sched_group_span(sg); cpumask_or(covered, covered, sg_span);
init_overlap_sched_group(sd, sg, from_grandchild);
init_overlap_sched_group(sd, sg, from_grandchild ? sibling : NULL);
Nice edit! Will merge your edit into v1 and send v2.
if (!first) first = sg;
Thanks Barry
On 03/02/21 10:23, Song Bao Hua (Barry Song) wrote:
-----Original Message----- From: Valentin Schneider [mailto:valentin.schneider@arm.com] Thoughts?
I guess the original purpose of overlapping groups is creating as few groups as possible. If we totally remove overlapping groups, it seems we will create much more groups? For example, while node0 begins to build sched_domain for distance 20, it will add node2, since the distance between node2 and node3 is 15, so while node2 is added, node3 is also added as node2's lower domain has covered node3. So we need two groups only for node0's sched_domain of distance level 20. +-------+ +--------+ | | 15 | | | node0+----------------+ | node1 | | | | | +----+--+ XXX--------+ | XXX | XX 20 | 15 XX | XXX | X XXX +----+----XXX +-------+ | | 15 | node3| | node2 +-----------------+ | | | +-------+ +---------+
If we remove overlapping group, we will add a group for node2, another group for node3. Then we get three groups.
I am not sure if it is always positive for performance.
Neither am I! At the same time our strategy for generating groups is pretty much flawed for anything with distance > 2, so I'd like to have a saner setup that doesn't involve fixing groups "after the fact".
I have a sort-of-working hack, I'll make this into a patch and toss it out for discussion.
-----Original Message----- From: Song Bao Hua (Barry Song) Sent: Wednesday, February 3, 2021 11:18 PM To: 'Valentin Schneider' valentin.schneider@arm.com; vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org; peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com; linux-kernel@vger.kernel.org Cc: linuxarm@openeuler.org; xuwei (O) xuwei5@huawei.com; Liguozhu (Kenneth) liguozhu@hisilicon.com; tiantao (H) tiantao6@hisilicon.com; wanghuiqiang wanghuiqiang@huawei.com; Zengtao (B) prime.zeng@hisilicon.com; Jonathan Cameron jonathan.cameron@huawei.com; guodong.xu@linaro.org; Meelis Roos mroos@linux.ee Subject: RE: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
-----Original Message----- From: Valentin Schneider [mailto:valentin.schneider@arm.com] Sent: Wednesday, February 3, 2021 4:17 AM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org; peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com; linux-kernel@vger.kernel.org Cc: linuxarm@openeuler.org; xuwei (O) xuwei5@huawei.com; Liguozhu
(Kenneth)
liguozhu@hisilicon.com; tiantao (H) tiantao6@hisilicon.com;
wanghuiqiang
wanghuiqiang@huawei.com; Zengtao (B) prime.zeng@hisilicon.com; Jonathan Cameron jonathan.cameron@huawei.com; guodong.xu@linaro.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Meelis Roos mroos@linux.ee Subject: Re: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
On 01/02/21 16:38, Barry Song wrote:
@@ -964,6 +941,12 @@ static void init_overlap_sched_group(struct
sched_domain
*sd,
build_balance_mask(sd, sg, mask); cpu = cpumask_first_and(sched_group_span(sg), mask);
/*
* for the group generated by grandchild, use the sgc of 2nd cpu
* because the 1st cpu might be used by another sched_group
*/
if (from_grandchild && cpumask_weight(mask) > 1)
cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
So you are getting a (hopefully) unique ID for this group span at this given topology level (i.e. sd->private) but as I had stated in that list of issues, this creates an sgc that isn't attached to the local group of any sched_domain, and thus won't get its capacity values updated.
This can actually be seen via the capacity values you're getting at build time:
[ 0.868907] CPU0 attaching sched-domain(s):
...
[ 0.869542] domain-2: span=0-5 level=NUMA [ 0.869559] groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 }
^^^^^^^^^^^^^^^^
[ 0.871177] CPU4 attaching sched-domain(s):
...
[ 0.871200] groups: 4:{ span=4 cap=977 }, 5:{ span=5 cap=1001 } [ 0.871243] domain-1: span=4-7 level=NUMA [ 0.871257] groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 }
^^^^^^^^^^^^^^^^
Yes. I could see this issue. We could hack update_group_capacity to let it scan both local_group and sched_group generated by grandchild, but it seems your edit is much better.
IMO what we want to do here is to hook this CPU0-domain-2-group5 to the sgc of CPU4-domain1-group4. I've done that in the below diff - this gives us groups with sgc's owned at lower topology levels, but this will only ever be true for non-local groups. This has the added benefit of working with single-CPU nodes. Briefly tested on your topology and the sunfire's (via QEMU), and I didn't get screamed at.
Before the fun police comes and impounds my keyboard, I'd like to point out that we could leverage this cross-level sgc referencing hack to further change the NUMA domains and pretty much get rid of overlapping groups (that's what I was fumbling with in [1]).
That is, rather than building overlapping groups and fixing them whenever that breaks (distance > 2), we could have:
- the local group being the child domain's span (as always)
- all non-local NUMA groups spanning a single node each, with the right sgc cross-referencing.
Thoughts?
I guess the original purpose of overlapping groups is creating as few groups as possible. If we totally remove overlapping groups, it seems we will create much more groups? For example, while node0 begins to build sched_domain for distance 20, it will add node2, since the distance between node2 and node3 is 15, so while node2 is added, node3 is also added as node2's lower domain has covered node3. So we need two groups only for node0's sched_domain of distance level 20. +-------+ +--------+ | | 15 | | | node0+----------------+ | node1 | | | | | +----+--+ XXX--------+ | XXX | XX 20 | 15 XX | XXX | X XXX +----+----XXX +-------+ | | 15 | node3| | node2 +-----------------+ | | | +-------+ +---------+
Sorry for missing a line:
node0-node3: 20
20 X XX X X X X X X XXXX X X X XX XX XXX XX X XX XX +-----X-+ +--------+ XX | | 15 | | X | node0+----------------+ | node1 | X | | | | X +----+--+ XXX--------+ X | XXX XX | XX XX 20 | 15 XX XXXX | XXX XXXX | X XXX XXXX +----+----XXX +-------+ XXXX | | 15 | node3|XX | node2 +-----------------+ | | | +-------+ +---------+
If we remove overlapping group, we will add a group for node2, another group for node3. Then we get three groups.
I am not sure if it is always positive for performance.
--->8--- diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index b748999c9e11..ef43abb6b1fb 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -932,21 +932,15 @@ build_group_from_child_sched_domain(struct
sched_domain
*sd, int cpu)
static void init_overlap_sched_group(struct sched_domain *sd, struct sched_group *sg,
int from_grandchild)
struct sched_domain *grandchild)
{ struct cpumask *mask = sched_domains_tmpmask2;
- struct sd_data *sdd = sd->private;
struct sd_data *sdd = grandchild ? grandchild->private : sd->private; struct cpumask *sg_span; int cpu;
build_balance_mask(sd, sg, mask); cpu = cpumask_first_and(sched_group_span(sg), mask);
/*
* for the group generated by grandchild, use the sgc of 2nd cpu
* because the 1st cpu might be used by another sched_group
*/
if (from_grandchild && cpumask_weight(mask) > 1)
cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu); if (atomic_inc_return(&sg->sgc->ref) == 1)
@@ -979,7 +973,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
for_each_cpu_wrap(i, span, cpu) { struct cpumask *sg_span;
int from_grandchild = 0;
bool from_grandchild = false;
if (cpumask_test_cpu(i, covered)) continue;
@@ -1033,7 +1027,7 @@ build_overlap_sched_groups(struct sched_domain *sd,
int
cpu) !cpumask_subset(sched_domain_span(sibling->child), span)) { sibling = sibling->child;
from_grandchild = 1;
from_grandchild = true;
}
sg = build_group_from_child_sched_domain(sibling, cpu);
@@ -1043,7 +1037,7 @@ build_overlap_sched_groups(struct sched_domain *sd,
int
cpu) sg_span = sched_group_span(sg); cpumask_or(covered, covered, sg_span);
init_overlap_sched_group(sd, sg, from_grandchild);
init_overlap_sched_group(sd, sg, from_grandchild ? sibling : NULL);
Nice edit! Will merge your edit into v1 and send v2.
if (!first) first = sg;
Thanks Barry