
Fix watchdog false positive problem Luo Gengkun (1): watchdog: fix watchdog may detect false positive of softlockup Nysal Jan K.A (1): watchdog: fix the SOFTLOCKUP_DETECTOR=n case include/linux/nmi.h | 1 + kernel/sysctl.c | 2 +- kernel/watchdog.c | 39 ++++++++++++++++++++++++++------------- 3 files changed, 28 insertions(+), 14 deletions(-) -- 2.34.1

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,转换为PR失败! 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/Q2R... 失败原因:调用gitee api创建PR失败, 失败原因如下: 标题标题不能为空 建议解决方法:请稍等,机器人会在下一次任务重新执行 FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/Q2R... Failed Reason: create PR failed when call gitee's api, failed reason is as follows: 标题标题不能为空 Suggest Solution: please wait, the bot will retry in the next interval

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,转换为PR失败! 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/Q2R... 失败原因:调用gitee api创建PR失败, 失败原因如下: 标题标题不能为空 建议解决方法:请稍等,机器人会在下一次任务重新执行 FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/Q2R... Failed Reason: create PR failed when call gitee's api, failed reason is as follows: 标题标题不能为空 Suggest Solution: please wait, the bot will retry in the next interval

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,转换为PR失败! 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/Q2R... 失败原因:调用gitee api创建PR失败, 失败原因如下: 标题标题不能为空 建议解决方法:请稍等,机器人会在下一次任务重新执行 FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/Q2R... Failed Reason: create PR failed when call gitee's api, failed reason is as follows: 标题标题不能为空 Suggest Solution: please wait, the bot will retry in the next interval

mm-unstable inclusion category: bugfix bugzilla: 190597 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches... -------------------------------- When updating `watchdog_thresh`, there is a race condition between writing the new `watchdog_thresh` value and stopping the old watchdog timer. If the old timer triggers during this window, it may falsely detect a softlockup due to the old interval and the new `watchdog_thresh` value being used. The problem can be described as follow: # We asuume previous watchdog_thresh is 60, so the watchdog timer is # coming every 24s. echo 10 > /proc/sys/kernel/watchdog_thresh (User space) | +------>+ update watchdog_thresh (We are in kernel now) | | # using old interval and new `watchdog_thresh` +------>+ watchdog hrtimer (irq context: detect softlockup) | | +-------+ | | + softlockup_stop_all To fix this problem, introduce a shadow variable for `watchdog_thresh`. The update to the actual `watchdog_thresh` is delayed until after the old timer is stopped, preventing false positives. The following testcase may help to understand this problem. --------------------------------------------- echo RT_RUNTIME_SHARE > /sys/kernel/debug/sched/features echo -1 > /proc/sys/kernel/sched_rt_runtime_us echo 0 > /sys/kernel/debug/sched/fair_server/cpu3/runtime echo 60 > /proc/sys/kernel/watchdog_thresh taskset -c 3 chrt -r 99 /bin/bash -c "while true;do true; done" & echo 10 > /proc/sys/kernel/watchdog_thresh & --------------------------------------------- The test case above first removes the throttling restrictions for real-time tasks. It then sets watchdog_thresh to 60 and executes a real-time task ,a simple while(1) loop, on cpu3. Consequently, the final command gets blocked because the presence of this real-time thread prevents kworker:3 from being selected by the scheduler. This eventually triggers a softlockup detection on cpu3 due to watchdog_timer_fn operating with inconsistent variable - using both the old interval and the updated watchdog_thresh simultaneously. Link: https://lkml.kernel.org/r/20250421035021.3507649-1-luogengkun@huaweicloud.co... Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Doug Anderson <dianders@chromium.org> Cc: Joel Granados <joel.granados@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Cc: "Nysal Jan K.A." <nysal@linux.ibm.com> Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: include/linux/nmi.h kernel/sysctl.c kernel/watchdog.c [Fix conflict due to the original patch is based on mainline] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/nmi.h | 1 + kernel/sysctl.c | 2 +- kernel/watchdog.c | 35 +++++++++++++++++++++++------------ 3 files changed, 25 insertions(+), 13 deletions(-) diff --git a/include/linux/nmi.h b/include/linux/nmi.h index 9e7767353f7e..ae9dbedb9849 100644 --- a/include/linux/nmi.h +++ b/include/linux/nmi.h @@ -21,6 +21,7 @@ extern int watchdog_user_enabled; extern int nmi_watchdog_user_enabled; extern int soft_watchdog_user_enabled; extern int watchdog_thresh; +extern int watchdog_thresh_next; extern unsigned long watchdog_enabled; extern struct cpumask watchdog_cpumask; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index b4b36f8a3149..0b1c13a05332 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2384,7 +2384,7 @@ static struct ctl_table kern_table[] = { }, { .procname = "watchdog_thresh", - .data = &watchdog_thresh, + .data = &watchdog_thresh_next, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_watchdog_thresh, diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 36f458111205..5a286f6bd169 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -43,6 +43,7 @@ int __read_mostly watchdog_user_enabled = 1; int __read_mostly nmi_watchdog_user_enabled = NMI_WATCHDOG_DEFAULT; int __read_mostly soft_watchdog_user_enabled = 1; int __read_mostly watchdog_thresh = 10; +int __read_mostly watchdog_thresh_next; static int __read_mostly nmi_watchdog_available; struct cpumask watchdog_cpumask __read_mostly; @@ -558,12 +559,20 @@ int lockup_detector_offline_cpu(unsigned int cpu) return 0; } -static void __lockup_detector_reconfigure(void) +static void __lockup_detector_reconfigure(bool thresh_changed) { cpus_read_lock(); nmi_watchdog_ops.watchdog_nmi_stop(); softlockup_stop_all(); + /* + * To prevent watchdog_timer_fn from using the old interval and + * the new watchdog_thresh at the same time, which could lead to + * false softlockup reports, it is necessary to update the + * watchdog_thresh after the softlockup is completed. + */ + if (thresh_changed) + watchdog_thresh = READ_ONCE(watchdog_thresh_next); set_sample_period(); lockup_detector_update_enable(); if (watchdog_enabled && watchdog_thresh) @@ -581,7 +590,7 @@ static void __lockup_detector_reconfigure(void) void lockup_detector_reconfigure(void) { mutex_lock(&watchdog_mutex); - __lockup_detector_reconfigure(); + __lockup_detector_reconfigure(false); mutex_unlock(&watchdog_mutex); } @@ -605,7 +614,7 @@ static __init void lockup_detector_setup(void) return; mutex_lock(&watchdog_mutex); - __lockup_detector_reconfigure(); + __lockup_detector_reconfigure(false); softlockup_initialized = true; mutex_unlock(&watchdog_mutex); } @@ -621,11 +630,11 @@ static void __lockup_detector_reconfigure(void) } void lockup_detector_reconfigure(void) { - __lockup_detector_reconfigure(); + __lockup_detector_reconfigure(false); } static inline void lockup_detector_setup(void) { - __lockup_detector_reconfigure(); + __lockup_detector_reconfigure(false); } #endif /* !CONFIG_SOFTLOCKUP_DETECTOR */ @@ -661,11 +670,11 @@ void lockup_detector_soft_poweroff(void) #ifdef CONFIG_SYSCTL /* Propagate any changes to the watchdog threads */ -static void proc_watchdog_update(void) +static void proc_watchdog_update(bool thresh_changed) { /* Remove impossible cpus to keep sysctl output clean. */ cpumask_and(&watchdog_cpumask, &watchdog_cpumask, cpu_possible_mask); - __lockup_detector_reconfigure(); + __lockup_detector_reconfigure(thresh_changed); } /* @@ -698,7 +707,7 @@ static int proc_watchdog_common(int which, struct ctl_table *table, int write, old = READ_ONCE(*param); err = proc_dointvec_minmax(table, write, buffer, lenp, ppos); if (!err && old != READ_ONCE(*param)) - proc_watchdog_update(); + proc_watchdog_update(false); } mutex_unlock(&watchdog_mutex); return err; @@ -746,11 +755,13 @@ int proc_watchdog_thresh(struct ctl_table *table, int write, mutex_lock(&watchdog_mutex); - old = READ_ONCE(watchdog_thresh); + watchdog_thresh_next = READ_ONCE(watchdog_thresh); + + old = watchdog_thresh_next; err = proc_dointvec_minmax(table, write, buffer, lenp, ppos); - if (!err && write && old != READ_ONCE(watchdog_thresh)) - proc_watchdog_update(); + if (!err && write && old != READ_ONCE(watchdog_thresh_next)) + proc_watchdog_update(true); mutex_unlock(&watchdog_mutex); return err; @@ -771,7 +782,7 @@ int proc_watchdog_cpumask(struct ctl_table *table, int write, err = proc_do_large_bitmap(table, write, buffer, lenp, ppos); if (!err && write) - proc_watchdog_update(); + proc_watchdog_update(false); mutex_unlock(&watchdog_mutex); return err; -- 2.34.1

From: "Nysal Jan K.A" <nysal@linux.ibm.com> mm-unstable inclusion category: bugfix bugzilla: 190597 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches... -------------------------------- Update watchdog_thresh when SOFTLOCKUP_DETECTOR=n. Additionally fix a build failure in this case as well. Link: https://lkml.kernel.org/r/20250502111120.282690-1-nysal@linux.ibm.com Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Closes: https://lore.kernel.org/all/339e2b3e-c7ee-418f-a84c-9c6360dc570b@linux.ibm.c... Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Doug Anderson <dianders@chromium.org> Cc: Joel Granados <joel.granados@kernel.org> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Song Liu <song@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: kernel/watchdog.c [Fix conflict due to the original patch is based on mainline] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/watchdog.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 5a286f6bd169..88be068e9922 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -620,10 +620,12 @@ static __init void lockup_detector_setup(void) } #else /* CONFIG_SOFTLOCKUP_DETECTOR */ -static void __lockup_detector_reconfigure(void) +static void __lockup_detector_reconfigure(bool thresh_changed) { cpus_read_lock(); nmi_watchdog_ops.watchdog_nmi_stop(); + if (thresh_changed) + watchdog_thresh = READ_ONCE(watchdog_thresh_next); lockup_detector_update_enable(); nmi_watchdog_ops.watchdog_nmi_start(); cpus_read_unlock(); -- 2.34.1

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,转换为PR失败! 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/Q2R... 失败原因:调用gitee api创建PR失败, 失败原因如下: 标题标题不能为空 建议解决方法:请稍等,机器人会在下一次任务重新执行 FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/Q2R... Failed Reason: create PR failed when call gitee's api, failed reason is as follows: 标题标题不能为空 Suggest Solution: please wait, the bot will retry in the next interval

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,转换为PR失败! 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/Q2R... 失败原因:调用gitee api创建PR失败, 失败原因如下: 标题标题不能为空 建议解决方法:请稍等,机器人会在下一次任务重新执行 FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/Q2R... Failed Reason: create PR failed when call gitee's api, failed reason is as follows: 标题标题不能为空 Suggest Solution: please wait, the bot will retry in the next interval
participants (2)
-
Luo Gengkun
-
patchwork bot