From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-v5.6-rc1 commit 11ea68f553e244851d15793a7fa33a97c46d8271 category: bugfix bugzilla: 45956 CVE: NA
--------------------------------
The affinity of managed interrupts is completely handled in the kernel and cannot be changed via the /proc/irq/* interfaces from user space. As the kernel tries to spread out interrupts evenly accross CPUs on x86 to prevent vector exhaustion, it can happen that a managed interrupt whose affinity mask contains both isolated and housekeeping CPUs is routed to an isolated CPU. As a consequence IO submitted on a housekeeping CPU causes interrupts on the isolated CPU.
Add a new sub-parameter 'managed_irq' for 'isolcpus' and the corresponding logic in the interrupt affinity selection code.
The subparameter indicates to the interrupt affinity selection logic that it should try to avoid the above scenario.
This isolation is best effort and only effective if the automatically assigned interrupt mask of a device queue contains isolated and housekeeping CPUs. If housekeeping CPUs are online then such interrupts are directed to the housekeeping CPU so that IO submitted on the housekeeping CPU cannot disturb the isolated CPU.
If a queue's affinity mask contains only isolated CPUs then this parameter has no effect on the interrupt routing decision, though interrupts are only happening when tasks running on those isolated CPUs submit IO. IO submitted on housekeeping CPUs has no influence on those queues.
If the affinity mask contains both housekeeping and isolated CPUs, but none of the contained housekeeping CPUs is online, then the interrupt is also routed to an isolated CPU. Interrupts are only delivered when one of the isolated CPUs in the affinity mask submits IO. If one of the contained housekeeping CPUs comes online, the CPU hotplug logic migrates the interrupt automatically back to the upcoming housekeeping CPU. Depending on the type of interrupt controller, this can require that at least one interrupt is delivered to the isolated CPU in order to complete the migration.
[ tglx: Removed unused parameter, added and edited comments/documentation and rephrased the changelog so it contains more details. ]
Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/20200120091625.17912-1-ming.lei@redhat.com Signed-off-by: Liu Chao liuchao173@huawei.com Reviewed-by: Jian Cheng cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- .../admin-guide/kernel-parameters.txt | 26 +++++++++++- include/linux/sched/isolation.h | 7 ++++ kernel/irq/cpuhotplug.c | 21 +++++++++- kernel/irq/manage.c | 41 ++++++++++++++++++- kernel/sched/isolation.c | 12 ++++++ 5 files changed, 102 insertions(+), 5 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 60c06c7d243d..a0e84095a1a5 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1911,9 +1911,31 @@ <cpu number> begins at 0 and the maximum value is "number of CPUs in system - 1".
- The format of <cpu-list> is described above. - + managed_irq + + Isolate from being targeted by managed interrupts + which have an interrupt mask containing isolated + CPUs. The affinity of managed interrupts is + handled by the kernel and cannot be changed via + the /proc/irq/* interfaces. + + This isolation is best effort and only effective + if the automatically assigned interrupt mask of a + device queue contains isolated and housekeeping + CPUs. If housekeeping CPUs are online then such + interrupts are directed to the housekeeping CPU + so that IO submitted on the housekeeping CPU + cannot disturb the isolated CPU. + + If a queue's affinity mask contains only isolated + CPUs then this parameter has no effect on the + interrupt routing decision, though interrupts are + only delivered when tasks running on those + isolated CPUs submit IO. IO submitted on + housekeeping CPUs has no influence on those + queues.
+ The format of <cpu-list> is described above.
iucv= [HW,NET]
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h index 4a6582c27dea..461e1f6ab3eb 100644 --- a/include/linux/sched/isolation.h +++ b/include/linux/sched/isolation.h @@ -13,12 +13,14 @@ enum hk_flags { HK_FLAG_TICK = (1 << 4), HK_FLAG_DOMAIN = (1 << 5), HK_FLAG_WQ = (1 << 6), + HK_FLAG_MANAGED_IRQ = (1 << 7), };
#ifdef CONFIG_CPU_ISOLATION DECLARE_STATIC_KEY_FALSE(housekeeping_overriden); extern int housekeeping_any_cpu(enum hk_flags flags); extern const struct cpumask *housekeeping_cpumask(enum hk_flags flags); +extern bool housekeeping_enabled(enum hk_flags flags); extern void housekeeping_affine(struct task_struct *t, enum hk_flags flags); extern bool housekeeping_test_cpu(int cpu, enum hk_flags flags); extern void __init housekeeping_init(void); @@ -35,6 +37,11 @@ static inline const struct cpumask *housekeeping_cpumask(enum hk_flags flags) return cpu_possible_mask; }
+static inline bool housekeeping_enabled(enum hk_flags flags) +{ + return false; +} + static inline void housekeeping_affine(struct task_struct *t, enum hk_flags flags) { } static inline void housekeeping_init(void) { } diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c index 6c7ca2e983a5..02236b13b359 100644 --- a/kernel/irq/cpuhotplug.c +++ b/kernel/irq/cpuhotplug.c @@ -12,6 +12,7 @@ #include <linux/interrupt.h> #include <linux/ratelimit.h> #include <linux/irq.h> +#include <linux/sched/isolation.h>
#include "internals.h"
@@ -171,6 +172,20 @@ void irq_migrate_all_off_this_cpu(void) } }
+static bool hk_should_isolate(struct irq_data *data, unsigned int cpu) +{ + const struct cpumask *hk_mask; + + if (!housekeeping_enabled(HK_FLAG_MANAGED_IRQ)) + return false; + + hk_mask = housekeeping_cpumask(HK_FLAG_MANAGED_IRQ); + if (cpumask_subset(irq_data_get_effective_affinity_mask(data), hk_mask)) + return false; + + return cpumask_test_cpu(cpu, hk_mask); +} + static void irq_restore_affinity_of_irq(struct irq_desc *desc, unsigned int cpu) { struct irq_data *data = irq_desc_get_irq_data(desc); @@ -188,9 +203,11 @@ static void irq_restore_affinity_of_irq(struct irq_desc *desc, unsigned int cpu) /* * If the interrupt can only be directed to a single target * CPU then it is already assigned to a CPU in the affinity - * mask. No point in trying to move it around. + * mask. No point in trying to move it around unless the + * isolation mechanism requests to move it to an upcoming + * housekeeping CPU. */ - if (!irqd_is_single_target(data)) + if (!irqd_is_single_target(data) || hk_should_isolate(data, cpu)) irq_set_affinity_locked(data, affinity, false); }
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index a8c66acee82a..34eda1e772fa 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -18,6 +18,7 @@ #include <linux/sched.h> #include <linux/sched/rt.h> #include <linux/sched/task.h> +#include <linux/sched/isolation.h> #include <uapi/linux/sched/types.h> #include <linux/task_work.h>
@@ -227,7 +228,45 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, if (!chip || !chip->irq_set_affinity) return -EINVAL;
- ret = chip->irq_set_affinity(data, mask, force); + /* + * If this is a managed interrupt and housekeeping is enabled on + * it check whether the requested affinity mask intersects with + * a housekeeping CPU. If so, then remove the isolated CPUs from + * the mask and just keep the housekeeping CPU(s). This prevents + * the affinity setter from routing the interrupt to an isolated + * CPU to avoid that I/O submitted from a housekeeping CPU causes + * interrupts on an isolated one. + * + * If the masks do not intersect or include online CPU(s) then + * keep the requested mask. The isolated target CPUs are only + * receiving interrupts when the I/O operation was submitted + * directly from them. + * + * If all housekeeping CPUs in the affinity mask are offline, the + * interrupt will be migrated by the CPU hotplug code once a + * housekeeping CPU which belongs to the affinity mask comes + * online. + */ + if (irqd_affinity_is_managed(data) && + housekeeping_enabled(HK_FLAG_MANAGED_IRQ)) { + const struct cpumask *hk_mask, *prog_mask; + + static DEFINE_RAW_SPINLOCK(tmp_mask_lock); + static struct cpumask tmp_mask; + + hk_mask = housekeeping_cpumask(HK_FLAG_MANAGED_IRQ); + + raw_spin_lock(&tmp_mask_lock); + cpumask_and(&tmp_mask, mask, hk_mask); + if (!cpumask_intersects(&tmp_mask, cpu_online_mask)) + prog_mask = mask; + else + prog_mask = &tmp_mask; + ret = chip->irq_set_affinity(data, prog_mask, force); + raw_spin_unlock(&tmp_mask_lock); + } else { + ret = chip->irq_set_affinity(data, mask, force); + } switch (ret) { case IRQ_SET_MASK_OK: case IRQ_SET_MASK_OK_DONE: diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c index e6802181900f..6e85049d68cf 100644 --- a/kernel/sched/isolation.c +++ b/kernel/sched/isolation.c @@ -13,6 +13,12 @@ EXPORT_SYMBOL_GPL(housekeeping_overriden); static cpumask_var_t housekeeping_mask; static unsigned int housekeeping_flags;
+bool housekeeping_enabled(enum hk_flags flags) +{ + return !!(housekeeping_flags & flags); +} +EXPORT_SYMBOL_GPL(housekeeping_enabled); + int housekeeping_any_cpu(enum hk_flags flags) { if (static_branch_unlikely(&housekeeping_overriden)) @@ -140,6 +146,12 @@ static int __init housekeeping_isolcpus_setup(char *str) continue; }
+ if (!strncmp(str, "managed_irq,", 12)) { + str += 12; + flags |= HK_FLAG_MANAGED_IRQ; + continue; + } + pr_warn("isolcpus: Error, unknown flag\n"); return 0; }
From: Peter Xu peterx@redhat.com
mainline inclusion from mainline-v5.6-rc6 commit 469ff207b4c4033540b50bc59587dc915faa1367 category: bugfix bugzilla: 45956 CVE: NA
--------------------------------
The vector management code assumes that managed interrupts cannot be migrated away from an online CPU. free_moved_vector() has a WARN_ON_ONCE() which triggers when a managed interrupt vector association on a online CPU is cleared. The CPU offline code uses a different mechanism which cannot trigger this.
This assumption is not longer correct because the new CPU isolation feature which affects the placement of managed interrupts must be able to move a managed interrupt away from an online CPU.
There are two reasons why this can happen:
1) When the interrupt is activated the affinity mask which was established in irq_create_affinity_masks() is handed in to the vector allocation code. This mask contains all CPUs to which the interrupt can be made affine to, but this does not take the CPU isolation 'managed_irq' mask into account.
When the interrupt is finally requested by the device driver then the affinity is checked again and the CPU isolation 'managed_irq' mask is taken into account, which moves the interrupt to a non-isolated CPU if possible.
2) The interrupt can be affine to an isolated CPU because the non-isolated CPUs in the calculated affinity mask are not online.
Once a non-isolated CPU which is in the mask comes online the interrupt is migrated to this non-isolated CPU
In both cases the regular online migration mechanism is used which triggers the WARN_ON_ONCE() in free_moved_vector().
Case #1 could have been addressed by taking the isolation mask into account, but that would require a massive code change in the activation logic and the eventual migration event was accepted as a reasonable tradeoff when the isolation feature was developed. But even if #1 would be addressed, #2 would still trigger it.
Of course the warning in free_moved_vector() was overlooked at that time and the above two cases which have been discussed during patch review have obviously never been tested before the final submission.
So keep it simple and remove the warning.
[ tglx: Rewrote changelog and added a comment to free_moved_vector() ]
Fixes: 11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed interrupts") Signed-off-by: Peter Xu peterx@redhat.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ming Lei ming.lei@redhat.com Link: https://lkml.kernel.org/r/20200312205830.81796-1-peterx@redhat.com Signed-off-by: Liu Chao liuchao173@huawei.com Reviewed-by: Jian Cheng cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/kernel/apic/vector.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 8b7e0b46e86e..701e46f31791 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -831,13 +831,15 @@ static void free_moved_vector(struct apic_chip_data *apicd) bool managed = apicd->is_managed;
/* - * This should never happen. Managed interrupts are not - * migrated except on CPU down, which does not involve the - * cleanup vector. But try to keep the accounting correct - * nevertheless. + * Managed interrupts are usually not migrated away + * from an online CPU, but CPU isolation 'managed_irq' + * can make that happen. + * 1) Activation does not take the isolation into account + * to keep the code simple + * 2) Migration away from an isolated CPU can happen when + * a non-isolated CPU which is in the calculated + * affinity mask comes online. */ - WARN_ON_ONCE(managed); - trace_vector_free_moved(apicd->irq, cpu, vector, managed); irq_matrix_free(vector_matrix, cpu, vector, managed); per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;
From: Marcelo Tosatti mtosatti@redhat.com
mainline inclusion from mainline-v5.9-rc1 commit 043eb8e1051143a24811e6f35c276e35ae8247b6 category: bugfix bugzilla: 45956 CVE: NA
--------------------------------
Next patch will switch unbound kernel threads mask to housekeeping_cpumask(), a subset of cpu_possible_mask. So in order to ease bisection, lets first switch kthreads default affinity from cpu_all_mask to cpu_possible_mask.
It looks safe to do so as cpu_possible_mask seem to be initialized at setup_arch() time, way before kthreadd is created.
Suggested-by: Frederic Weisbecker frederic@kernel.org Signed-off-by: Frederic Weisbecker frederic@kernel.org Signed-off-by: Marcelo Tosatti mtosatti@redhat.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200527142909.23372-2-frederic@kernel.org Signed-off-by: Liu Chao liuchao173@huawei.com Reviewed-by: Jian Cheng cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/kthread.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/kthread.c b/kernel/kthread.c index 598e16c76147..e006e4536a38 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -360,7 +360,7 @@ struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data), * The kernel thread should not inherit these properties. */ sched_setscheduler_nocheck(task, SCHED_NORMAL, ¶m); - set_cpus_allowed_ptr(task, cpu_all_mask); + set_cpus_allowed_ptr(task, cpu_possible_mask); } kfree(create); return task; @@ -585,7 +585,7 @@ int kthreadd(void *unused) /* Setup a clean context for our children to inherit. */ set_task_comm(tsk, "kthreadd"); ignore_signals(tsk); - set_cpus_allowed_ptr(tsk, cpu_all_mask); + set_cpus_allowed_ptr(tsk, cpu_possible_mask); set_mems_allowed(node_states[N_MEMORY]);
current->flags |= PF_NOFREEZE;
From: Marcelo Tosatti mtosatti@redhat.com
mainline inclusion from mainline-v5.9-rc1 commit 9cc5b8656892a72438ee7deb5e80f5be47643b8b category: bugfix bugzilla: 45956 CVE: NA
--------------------------------
This is a kernel enhancement that configures the cpu affinity of kernel threads via kernel boot option nohz_full=.
When this option is specified, the cpumask is immediately applied upon kthread launch. This does not affect kernel threads that specify cpu and node.
This allows CPU isolation (that is not allowing certain threads to execute on certain CPUs) without using the isolcpus=domain parameter, making it possible to enable load balancing on such CPUs during runtime (see kernel-parameters.txt).
Note-1: this is based off on Wind River's patch at https://github.com/starlingx-staging/stx-integ/blob/master/kernel/kernel-std...
Difference being that this patch is limited to modifying kernel thread cpumask. Behaviour of other threads can be controlled via cgroups or sched_setaffinity.
Note-2: Wind River's patch was based off Christoph Lameter's patch at https://lwn.net/Articles/565932/ with the only difference being the kernel parameter changed from kthread to kthread_cpus.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Signed-off-by: Marcelo Tosatti mtosatti@redhat.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200527142909.23372-3-frederic@kernel.org Signed-off-by: Liu Chao liuchao173@huawei.com Reviewed-by: Jian Cheng cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/sched/isolation.h | 1 + kernel/kthread.c | 6 ++++-- kernel/sched/isolation.c | 3 ++- 3 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h index 461e1f6ab3eb..dfffbc98a63f 100644 --- a/include/linux/sched/isolation.h +++ b/include/linux/sched/isolation.h @@ -14,6 +14,7 @@ enum hk_flags { HK_FLAG_DOMAIN = (1 << 5), HK_FLAG_WQ = (1 << 6), HK_FLAG_MANAGED_IRQ = (1 << 7), + HK_FLAG_KTHREAD = (1 << 8), };
#ifdef CONFIG_CPU_ISOLATION diff --git a/kernel/kthread.c b/kernel/kthread.c index e006e4536a38..3f7daba4a550 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -20,6 +20,7 @@ #include <linux/freezer.h> #include <linux/ptrace.h> #include <linux/uaccess.h> +#include <linux/sched/isolation.h> #include <trace/events/sched.h>
static DEFINE_SPINLOCK(kthread_create_lock); @@ -360,7 +361,8 @@ struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data), * The kernel thread should not inherit these properties. */ sched_setscheduler_nocheck(task, SCHED_NORMAL, ¶m); - set_cpus_allowed_ptr(task, cpu_possible_mask); + set_cpus_allowed_ptr(task, + housekeeping_cpumask(HK_FLAG_KTHREAD)); } kfree(create); return task; @@ -585,7 +587,7 @@ int kthreadd(void *unused) /* Setup a clean context for our children to inherit. */ set_task_comm(tsk, "kthreadd"); ignore_signals(tsk); - set_cpus_allowed_ptr(tsk, cpu_possible_mask); + set_cpus_allowed_ptr(tsk, housekeeping_cpumask(HK_FLAG_KTHREAD)); set_mems_allowed(node_states[N_MEMORY]);
current->flags |= PF_NOFREEZE; diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c index 6e85049d68cf..2dc9f899daf8 100644 --- a/kernel/sched/isolation.c +++ b/kernel/sched/isolation.c @@ -123,7 +123,8 @@ static int __init housekeeping_nohz_full_setup(char *str) { unsigned int flags;
- flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU | HK_FLAG_MISC; + flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU | + HK_FLAG_MISC | HK_FLAG_KTHREAD;
return housekeeping_setup(str, flags); }
From: Alex Belits abelits@marvell.com
mainline inclusion from mainline-v5.9-rc1 commit 1abdfe706a579a702799fce465bceb9fb01d407c category: bugfix bugzilla: 45956 CVE: NA
--------------------------------
The current implementation of cpumask_local_spread() does not respect the isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task, it will return it to the caller for pinning of its IRQ threads. Having these unwanted IRQ threads on an isolated CPU adds up to a latency overhead.
Restrict the CPUs that are returned for spreading IRQs only to the available housekeeping CPUs.
Signed-off-by: Alex Belits abelits@marvell.com Signed-off-by: Nitesh Narayan Lal nitesh@redhat.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200625223443.2684-2-nitesh@redhat.com Signed-off-by: Liu Chao liuchao173@huawei.com Reviewed-by: Jian Cheng cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- lib/cpumask.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/lib/cpumask.c b/lib/cpumask.c index beca6244671a..7dde20a9faa7 100644 --- a/lib/cpumask.c +++ b/lib/cpumask.c @@ -5,6 +5,7 @@ #include <linux/cpumask.h> #include <linux/export.h> #include <linux/bootmem.h> +#include <linux/sched/isolation.h>
/** * cpumask_next - get the next cpu in a cpumask @@ -201,22 +202,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask) */ unsigned int cpumask_local_spread(unsigned int i, int node) { - int cpu; + int cpu, hk_flags; + const struct cpumask *mask;
+ hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ; + mask = housekeeping_cpumask(hk_flags); /* Wrap: we always want a cpu. */ - i %= num_online_cpus(); + i %= cpumask_weight(mask);
if (node == -1) { - for_each_cpu(cpu, cpu_online_mask) + for_each_cpu(cpu, mask) { if (i-- == 0) return cpu; + } } else { /* NUMA first. */ - for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask) + for_each_cpu_and(cpu, cpumask_of_node(node), mask) { if (i-- == 0) return cpu; + }
- for_each_cpu(cpu, cpu_online_mask) { + for_each_cpu(cpu, mask) { /* Skip NUMA nodes, done above. */ if (cpumask_test_cpu(cpu, cpumask_of_node(node))) continue;
From: Alex Belits abelits@marvell.com
mainline inclusion from mainline-v5.9-rc1 commit 69a18b18699b59654333651d95f8ca09d01048f8 category: bugfix bugzilla: 45956 CVE: NA
--------------------------------
pci_call_probe() prevents the nesting of work_on_cpu() for a scenario where a VF device is probed from work_on_cpu() of the PF.
Replace the cpumask used in pci_call_probe() from all online CPUs to only housekeeping CPUs. This is to ensure that there are no additional latency overheads caused due to the pinning of jobs on isolated CPUs.
Signed-off-by: Alex Belits abelits@marvell.com Signed-off-by: Nitesh Narayan Lal nitesh@redhat.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Liu Chao liuchao173@huawei.com Reviewed-by: Frederic Weisbecker frederic@kernel.org Acked-by: Bjorn Helgaas bhelgaas@google.com Link: https://lkml.kernel.org/r/20200625223443.2684-3-nitesh@redhat.com Reviewed-by: Jian Cheng cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pci-driver.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 45c3a78d6004..d6b2ac4bf266 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -12,6 +12,7 @@ #include <linux/string.h> #include <linux/slab.h> #include <linux/sched.h> +#include <linux/sched/isolation.h> #include <linux/cpu.h> #include <linux/pm_runtime.h> #include <linux/suspend.h> @@ -332,6 +333,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev, const struct pci_device_id *id) { int error, node, cpu; + int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ; struct drv_dev_and_id ddi = { drv, dev, id };
/* @@ -352,7 +354,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev, pci_physfn_is_probed(dev)) cpu = nr_cpu_ids; else - cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask); + cpu = cpumask_any_and(cpumask_of_node(node), + housekeeping_cpumask(hk_flags));
if (cpu < nr_cpu_ids) error = work_on_cpu(cpu, local_pci_probe, &ddi);
From: Alex Belits abelits@marvell.com
mainline inclusion from mainline-v5.9-rc1 commit 07bbecb3410617816a99e76a2df7576507a0c8ad category: bugfix bugzilla: 45956 CVE: NA
--------------------------------
With the existing implementation of store_rps_map(), packets are queued in the receive path on the backlog queues of other CPUs irrespective of whether they are isolated or not. This could add a latency overhead to any RT workload that is running on the same CPU.
Ensure that store_rps_map() only uses available housekeeping CPUs for storing the rps_map.
Signed-off-by: Alex Belits abelits@marvell.com Signed-off-by: Nitesh Narayan Lal nitesh@redhat.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200625223443.2684-4-nitesh@redhat.com Signed-off-by: Liu Chao liuchao173@huawei.com Reviewed-by: Jian Cheng cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/core/net-sysfs.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 001d7f07e780..5dd2da0ea7c6 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -16,6 +16,7 @@ #include <linux/if_arp.h> #include <linux/slab.h> #include <linux/sched/signal.h> +#include <linux/sched/isolation.h> #include <linux/nsproxy.h> #include <net/sock.h> #include <net/net_namespace.h> @@ -720,7 +721,7 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue, { struct rps_map *old_map, *map; cpumask_var_t mask; - int err, cpu, i; + int err, cpu, i, hk_flags; static DEFINE_MUTEX(rps_map_mutex);
if (!capable(CAP_NET_ADMIN)) @@ -735,6 +736,13 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue, return err; }
+ hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ; + cpumask_and(mask, mask, housekeeping_cpumask(hk_flags)); + if (cpumask_empty(mask)) { + free_cpumask_var(mask); + return -EINVAL; + } + map = kzalloc(max_t(unsigned int, RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES), GFP_KERNEL);
From: Eric Dumazet edumazet@google.com
mainline inclusion from mainline-v5.9-rc1 commit 2e0d8fef5f76bce0887c73b824d9e625a08e7406 category: bugfix bugzilla: 45956 CVE: NA
--------------------------------
We must accept an empty mask in store_rps_map(), or we are not able to disable RPS on a queue.
Fixes: 07bbecb34106 ("net: Restrict receive packets queuing to housekeeping CPUs") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: Maciej ĺ‘•enczykowski maze@google.com Cc: Alex Belits abelits@marvell.com Cc: Nitesh Narayan Lal nitesh@redhat.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Maciej ĺ‘•enczykowski maze@google.com Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Nitesh Narayan Lal nitesh@redhat.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Liu Chao liuchao173@huawei.com Reviewed-by: Jian Cheng cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/core/net-sysfs.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 5dd2da0ea7c6..905d9ea76660 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -736,11 +736,13 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue, return err; }
- hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ; - cpumask_and(mask, mask, housekeeping_cpumask(hk_flags)); - if (cpumask_empty(mask)) { - free_cpumask_var(mask); - return -EINVAL; + if (!cpumask_empty(mask)) { + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ; + cpumask_and(mask, mask, housekeeping_cpumask(hk_flags)); + if (cpumask_empty(mask)) { + free_cpumask_var(mask); + return -EINVAL; + } }
map = kzalloc(max_t(unsigned int,
From: Xunlei Pang xlpang@linux.alibaba.com
mainline inclusion from mainline-v5.10-rc1 commit df3cb4ea1fb63ff326488efd671ba3c39034255e category: bugfix bugzilla: 45956 CVE: NA
--------------------------------
We've met problems that occasionally tasks with full cpumask (e.g. by putting it into a cpuset or setting to full affinity) were migrated to our isolated cpus in production environment.
After some analysis, we found that it is due to the current select_idle_smt() not considering the sched_domain mask.
Steps to reproduce on my 31-CPU hyperthreads machine: 1. with boot parameter: "isolcpus=domain,2-31" (thread lists: 0,16 and 1,17) 2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads" 3. some threads will be migrated to the isolated cpu16~17.
Fix it by checking the valid domain mask in select_idle_smt().
Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings()) Reported-by: Wetp Zhang wetp.zy@linux.alibaba.com Signed-off-by: Xunlei Pang xlpang@linux.alibaba.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Jiang Biao benbjiang@tencent.com Reviewed-by: Vincent Guittot vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlpang@linux.ali... Signed-off-by: Liu Chao liuchao173@huawei.com Reviewed-by: Jian Cheng cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2aca6f86eb34..97e956012a60 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6237,7 +6237,8 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t return -1;
for_each_cpu(cpu, cpu_smt_mask(target)) { - if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) + if (!cpumask_test_cpu(cpu, &p->cpus_allowed) || + !cpumask_test_cpu(cpu, sched_domain_span(sd))) continue; if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) return cpu;
From: Frederic Weisbecker frederic@kernel.org
mainline inclusion from mainline-v5.9-rc1 commit 3c8920e2dbd1a55f72dc14d656df9d0097cf5c72 category: bugfix bugzilla: 45956 CVE: NA
--------------------------------
Setting a tick dependency on any task, including the case where a task sets that dependency on itself, triggers an IPI to all CPUs. That is of course suboptimal but it had previously not been an issue because it was only used by POSIX CPU timers on nohz_full, which apparently never occurs in latency-sensitive workloads in production. (Or users of such systems are suffering in silence on the one hand or venting their ire on the wrong people on the other.)
But RCU now sets a task tick dependency on the current task in order to fix stall issues that can occur during RCU callback processing. Thus, RCU callback processing triggers frequent system-wide IPIs from nohz_full CPUs. This is quite counter-productive, after all, avoiding IPIs is what nohz_full is supposed to be all about.
This commit therefore optimizes tasks' self-setting of a task tick dependency by using tick_nohz_full_kick() to avoid the system-wide IPI. Instead, only the execution of the one task is disturbed, which is acceptable given that this disturbance is well down into the noise compared to the degree to which the RCU callback processing itself disturbs execution.
Fixes: 6a949b7af82d (rcu: Force on tick when invoking lots of callbacks) Reported-by: Matt Fleming matt@codeblueprint.co.uk Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: stable@kernel.org Cc: Paul E. McKenney paulmck@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@kernel.org Signed-off-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Liu Chao liuchao173@huawei.com Reviewed-by: Jian Cheng cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/time/tick-sched.c | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 48403fb653c2..00d2418412f1 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -334,16 +334,24 @@ void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) }
/* - * Set a per-task tick dependency. Posix CPU timers need this in order to elapse - * per task timers. + * Set a per-task tick dependency. RCU need this. Also posix CPU timers + * in order to elapse per task timers. */ void tick_nohz_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit) { - /* - * We could optimize this with just kicking the target running the task - * if that noise matters for nohz full users. - */ - tick_nohz_dep_set_all(&tsk->tick_dep_mask, bit); + if (!atomic_fetch_or(BIT(bit), &tsk->tick_dep_mask)) { + if (tsk == current) { + preempt_disable(); + tick_nohz_full_kick(); + preempt_enable(); + } else { + /* + * Some future tick_nohz_full_kick_task() + * should optimize this. + */ + tick_nohz_full_kick_all(); + } + } }
void tick_nohz_dep_clear_task(struct task_struct *tsk, enum tick_dep_bits bit)