-----Original Message----- From: Valentin Schneider [mailto:valentin.schneider@arm.com] Sent: Tuesday, February 2, 2021 7:11 AM To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; vincent.guittot@linaro.org; mgorman@suse.de; mingo@kernel.org; peterz@infradead.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com; linux-kernel@vger.kernel.org Cc: linuxarm@openeuler.org; xuwei (O) xuwei5@huawei.com; Liguozhu (Kenneth) liguozhu@hisilicon.com; tiantao (H) tiantao6@hisilicon.com; wanghuiqiang wanghuiqiang@huawei.com; Zengtao (B) prime.zeng@hisilicon.com; Jonathan Cameron jonathan.cameron@huawei.com; guodong.xu@linaro.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Meelis Roos mroos@linux.ee Subject: Re: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
Hi,
On 01/02/21 16:38, Barry Song wrote:
A tricky thing is that we shouldn't use the sgc of the 1st CPU of node2 for the sched_group generated by grandchild, otherwise, when this cpu becomes the balance_cpu of another sched_group of cpus other than node0, our sched_group generated by grandchild will access the same sgc with the sched_group generated by child of another CPU.
So in init_overlap_sched_group(), sgc's capacity be overwritten: build_balance_mask(sd, sg, mask); cpu = cpumask_first_and(sched_group_span(sg), mask);
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
And WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)) will also be triggered: static void init_overlap_sched_group(struct sched_domain *sd, struct sched_group *sg) { if (atomic_inc_return(&sg->sgc->ref) == 1) cpumask_copy(group_balance_mask(sg), mask); else WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)); }
So here move to use the sgc of the 2nd cpu. For the corner case, if NUMA has only one CPU, we will still trigger this WARN_ON_ONCE. But It is really unlikely to be a real case for one NUMA to have one CPU only.
Well, it's trivial to boot this with QEMU, and it's actually the example the comment atop that WARN_ONCE() is based on. Also, you could end up with a single CPU on a node during hotplug operations...
Hi Valentin,
The qemu topology is just a reflection of real kunpeng920 case, and pls also note Meelis has also tested on another real hardware "8-node Sun Fire X4600-M2" and gave the tested-by.
It might not a perfect fix, but it is the simplest way to fix for this moment and for real cases. A "perfect" fix will require major refactoring of topology.c.
I don't think hotplug is much relevant as even some cpus are unplugged and only one cpu is left in the sched_group of the sched_domain, the related domain and group are still getting right settings.
On the other hand, the corner could literally be fixed, but will get some very ugly code involved. I mean, two sched_group can result in using the same sgc: 1. the sched_group generated by grandchild with only one numa 2. the sched_group generated by child with more than one numa
Right now, I'm moving to the 2nd cpu for sched_group1, if we move to use 2nd cpu for sched_group2, then having only one cpu in one NUMA wouldn't be a problem anymore. But the code will be very ugly. So I would prefer to keep this assumption and just ignore the unreal corner case.
I am not entirely sure whether having more than one CPU per node is a sufficient condition. I'm starting to *think* it is, but I'm not entirely convinced yet - and now I need a new notebook.
Me too. Some extremely complicated topology might break the assumption. Really need a new notebook to draw this kind of complicated topology to break the assumption :-)
But it is sufficient for the existing real cases which need fixing. When someday a real case in which each numa has more than one CPU wakes up the below warning: WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)). It might be the right time to consider major refactoring of topology.c.
Thanks Barry