Fix the following issue:
CPU1 CPU2 CPU3
T1 sets cfs_quota starts hrtimer cfs_bandwidth 'period_timer' T1 is migrated to CPU3 T2(worker thread) initiates offlining of CPU1 Hotplug operation starts ... 'period_timer' expires and is re-enqueued on CPU1 ... take_cpu_down() CPU1 shuts down and does not handle timers anymore. They have to be migrated in the post dead hotplug steps by the control task.
T2(worker thread) runs the post dead offline operation T1 holds lockA T1 is scheduled out //throttled by CFS bandwidth control T1 waits for 'period_timer' to expire T2(worker thread) waits for lockA
T1 waits there forever if it is scheduled out before it can execute the hrtimer offline callback hrtimers_dead_cpu(). Thus T2 waits for lockA forever.
Thomas Gleixner (1): hrtimers: Push pending hrtimers away from outgoing CPU earlier
Yu Liao (1): cpu/hotplug: fix kabi breakage in enum cpuhp_state
include/linux/hrtimer.h | 4 ++-- include/linux/smp.h | 1 + kernel/cpu.c | 17 +++++++++++++++-- kernel/smp.c | 8 ++++++++ kernel/time/hrtimer.c | 33 ++++++++++++--------------------- 5 files changed, 38 insertions(+), 25 deletions(-)
From: Thomas Gleixner tglx@linutronix.de
mainline inclusion from mainline-v6.7-rc2 commit 5c0930ccaad5a74d74e8b18b648c5eb21ed2fe94 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8JEVI
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
2b8272ff4a70 ("cpu/hotplug: Prevent self deadlock on CPU hot-unplug") solved the straight forward CPU hotplug deadlock vs. the scheduler bandwidth timer. Yu discovered a more involved variant where a task which has a bandwidth timer started on the outgoing CPU holds a lock and then gets throttled. If the lock required by one of the CPU hotplug callbacks the hotplug operation deadlocks because the unthrottling timer event is not handled on the dying CPU and can only be recovered once the control CPU reaches the hotplug state which pulls the pending hrtimers from the dead CPU.
Solve this by pushing the hrtimers away from the dying CPU in the dying callbacks. Nothing can queue a hrtimer on the dying CPU at that point because all other CPUs spin in stop_machine() with interrupts disabled and once the operation is finished the CPU is marked offline.
Reported-by: Yu Liao liaoyu15@huawei.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Liu Tie liutie4@huawei.com Link: https://lore.kernel.org/r/87a5rphara.ffs@tglx Signed-off-by: Yu Liao liaoyu15@huawei.com --- include/linux/cpuhotplug.h | 1 + include/linux/hrtimer.h | 4 ++-- kernel/cpu.c | 8 +++++++- kernel/time/hrtimer.c | 33 ++++++++++++--------------------- 4 files changed, 22 insertions(+), 24 deletions(-)
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index b540e5a60ea9..b0b4efe227f8 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -154,6 +154,7 @@ enum cpuhp_state { CPUHP_AP_ARM_CORESIGHT_CTI_STARTING, CPUHP_AP_ARM64_ISNDEP_STARTING, CPUHP_AP_SMPCFD_DYING, + CPUHP_AP_HRTIMERS_DYING, CPUHP_AP_X86_TBOOT_DYING, CPUHP_AP_ARM_CACHE_B15_RAC_DYING, CPUHP_AP_ONLINE, diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index b1f2e4692e66..c1459b7974c5 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -543,9 +543,9 @@ extern void sysrq_timer_list_show(void);
int hrtimers_prepare_cpu(unsigned int cpu); #ifdef CONFIG_HOTPLUG_CPU -int hrtimers_dead_cpu(unsigned int cpu); +int hrtimers_cpu_dying(unsigned int cpu); #else -#define hrtimers_dead_cpu NULL +#define hrtimers_cpu_dying NULL #endif
#endif diff --git a/kernel/cpu.c b/kernel/cpu.c index 9eedba9acb9f..f38ef9be6da1 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1614,7 +1614,7 @@ static struct cpuhp_step cpuhp_hp_states[] = { [CPUHP_HRTIMERS_PREPARE] = { .name = "hrtimers:prepare", .startup.single = hrtimers_prepare_cpu, - .teardown.single = hrtimers_dead_cpu, + .teardown.single = NULL, }, [CPUHP_SMPCFD_PREPARE] = { .name = "smpcfd:prepare", @@ -1681,6 +1681,12 @@ static struct cpuhp_step cpuhp_hp_states[] = { .startup.single = NULL, .teardown.single = smpcfd_dying_cpu, }, + [CPUHP_AP_HRTIMERS_DYING] = { + .name = "hrtimers:dying", + .startup.single = NULL, + .teardown.single = hrtimers_cpu_dying, + }, + /* Entry state on starting. Interrupts enabled from here on. Transient * state for synchronsization */ [CPUHP_AP_ONLINE] = { diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 70deb2f01e97..ede09dda36e9 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -2114,29 +2114,22 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base, } }
-int hrtimers_dead_cpu(unsigned int scpu) +int hrtimers_cpu_dying(unsigned int dying_cpu) { struct hrtimer_cpu_base *old_base, *new_base; - int i; + int i, ncpu = cpumask_first(cpu_active_mask);
- BUG_ON(cpu_online(scpu)); - tick_cancel_sched_timer(scpu); + tick_cancel_sched_timer(dying_cpu); + + old_base = this_cpu_ptr(&hrtimer_bases); + new_base = &per_cpu(hrtimer_bases, ncpu);
- /* - * this BH disable ensures that raise_softirq_irqoff() does - * not wakeup ksoftirqd (and acquire the pi-lock) while - * holding the cpu_base lock - */ - local_bh_disable(); - local_irq_disable(); - old_base = &per_cpu(hrtimer_bases, scpu); - new_base = this_cpu_ptr(&hrtimer_bases); /* * The caller is globally serialized and nobody else * takes two locks at once, deadlock is not possible. */ - raw_spin_lock(&new_base->lock); - raw_spin_lock_nested(&old_base->lock, SINGLE_DEPTH_NESTING); + raw_spin_lock(&old_base->lock); + raw_spin_lock_nested(&new_base->lock, SINGLE_DEPTH_NESTING);
for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { migrate_hrtimer_list(&old_base->clock_base[i], @@ -2147,15 +2140,13 @@ int hrtimers_dead_cpu(unsigned int scpu) * The migration might have changed the first expiring softirq * timer on this CPU. Update it. */ - hrtimer_update_softirq_timer(new_base, false); + __hrtimer_get_next_event(new_base, HRTIMER_ACTIVE_SOFT); + /* Tell the other CPU to retrigger the next event */ + smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);
- raw_spin_unlock(&old_base->lock); raw_spin_unlock(&new_base->lock); + raw_spin_unlock(&old_base->lock);
- /* Check, if we got expired work to do */ - __hrtimer_peek_ahead_timers(); - local_irq_enable(); - local_bh_enable(); return 0; }
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8JEVI
--------------------------------
Commit baecdf2dbe73 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") add a new step in enum cpuhp_state breaks kabi.
In order to fix the kabi breakage, we had to move the hrtimers:dying step into smpcfd:dying and create a new function smpcfd_and_hrtimer_dying_cpu().
Signed-off-by: Yu Liao liaoyu15@huawei.com --- include/linux/cpuhotplug.h | 1 - include/linux/hrtimer.h | 2 +- include/linux/smp.h | 1 + kernel/cpu.c | 19 +++++++++++++------ kernel/smp.c | 8 ++++++++ 5 files changed, 23 insertions(+), 8 deletions(-)
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index b0b4efe227f8..b540e5a60ea9 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -154,7 +154,6 @@ enum cpuhp_state { CPUHP_AP_ARM_CORESIGHT_CTI_STARTING, CPUHP_AP_ARM64_ISNDEP_STARTING, CPUHP_AP_SMPCFD_DYING, - CPUHP_AP_HRTIMERS_DYING, CPUHP_AP_X86_TBOOT_DYING, CPUHP_AP_ARM_CACHE_B15_RAC_DYING, CPUHP_AP_ONLINE, diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index c1459b7974c5..cfc62d447fa0 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -545,7 +545,7 @@ int hrtimers_prepare_cpu(unsigned int cpu); #ifdef CONFIG_HOTPLUG_CPU int hrtimers_cpu_dying(unsigned int cpu); #else -#define hrtimers_cpu_dying NULL +static inline int hrtimers_cpu_dying(unsigned int cpu) { return 0; } #endif
#endif diff --git a/include/linux/smp.h b/include/linux/smp.h index 84a0b4828f66..812c26f61300 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -278,5 +278,6 @@ int smp_call_on_cpu(unsigned int cpu, int (*func)(void *), void *par, int smpcfd_prepare_cpu(unsigned int cpu); int smpcfd_dead_cpu(unsigned int cpu); int smpcfd_dying_cpu(unsigned int cpu); +int smpcfd_and_hrtimer_dying_cpu(unsigned int cpu);
#endif /* __LINUX_SMP_H */ diff --git a/kernel/cpu.c b/kernel/cpu.c index f38ef9be6da1..89a8e7b9fdac 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1676,17 +1676,24 @@ static struct cpuhp_step cpuhp_hp_states[] = { .startup.single = NULL, .teardown.single = rcutree_dying_cpu, }, + /* + * In order to fix the kabi breakage, we had to move the hrtimers:dying + * step into smpcfd:dying and create a new function smpcfd_and_hrtimer_dying_cpu(). + * Please ensure that there are no other steps with teardown handler + * between smpcfd:dying and cpu:teardown. + */ [CPUHP_AP_SMPCFD_DYING] = { .name = "smpcfd:dying", .startup.single = NULL, - .teardown.single = smpcfd_dying_cpu, - }, - [CPUHP_AP_HRTIMERS_DYING] = { - .name = "hrtimers:dying", - .startup.single = NULL, - .teardown.single = hrtimers_cpu_dying, + .teardown.single = smpcfd_and_hrtimer_dying_cpu, },
+ /* + * Attention: Please do not add steps between smpcfd:dying + * and ap:online. Please refer to the above for specific + * reasons. + */ + /* Entry state on starting. Interrupts enabled from here on. Transient * state for synchronsization */ [CPUHP_AP_ONLINE] = { diff --git a/kernel/smp.c b/kernel/smp.c index 114776d0d11e..863cf7e2dbdc 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -75,6 +75,14 @@ int smpcfd_dead_cpu(unsigned int cpu) return 0; }
+int smpcfd_and_hrtimer_dying_cpu(unsigned int cpu) +{ + hrtimers_cpu_dying(cpu); + smpcfd_dying_cpu(cpu); + + return 0; +} + int smpcfd_dying_cpu(unsigned int cpu) { /*
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/2987 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/5...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/2987 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/5...