mailweb.openeuler.org
Manage this list

Keyboard Shortcuts

Thread View

  • j: Next unread message
  • k: Previous unread message
  • j a: Jump to all threads
  • j l: Jump to MailingList overview

Kernel

Threads by month
  • ----- 2026 -----
  • March
  • February
  • January
  • ----- 2025 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2024 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2023 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2022 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2021 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2020 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2019 -----
  • December
kernel@openeuler.org

  • 50 participants
  • 23104 discussions
[PATCH OLK-6.6 1/2] [Backport] sched/fair: Proportional newidle balance
by Chen Jinghuang 10 Mar '26

10 Mar '26
From: "Peter Zijlstra (Intel)" <peterz(a)infradead.org> stable inclusion from stable-v6.6.120 commit 51445190c10a36d292e70db085d0fb6cc3bec94f category: perf bugzilla: https://atomgit.com/openeuler/kernel/issues/8555 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- commit 33cf66d88306663d16e4759e9d24766b0aaa2e17 upstream. Add a randomized algorithm that runs newidle balancing proportional to its success rate. This improves schbench significantly: 6.18-rc4: 2.22 Mrps/s 6.18-rc4+revert: 2.04 Mrps/s 6.18-rc4+revert+random: 2.18 Mrps/S Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%: 6.17: -6% 6.17+revert: 0% 6.17+revert+random: -1% Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann(a)arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann(a)arm.com> Tested-by: Chris Mason <clm(a)meta.com> Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com Link: https://patch.msgid.link/20251107161739.770122091@infradead.org [ Ajay: Modified to apply on v6.6 ] Signed-off-by: Ajay Kaher <ajay.kaher(a)broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Conflicts: kernel/sched/core.c kernel/sched/fair.c [context conflicts] Signed-off-by: Chen Jinghuang <chenjinghuang2(a)huawei.com> --- include/linux/sched/topology.h | 3 +++ kernel/sched/core.c | 3 +++ kernel/sched/fair.c | 44 ++++++++++++++++++++++++++++++---- kernel/sched/features.h | 5 ++++ kernel/sched/sched.h | 7 ++++++ kernel/sched/topology.c | 6 +++++ 6 files changed, 64 insertions(+), 4 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 7eee852aa384..95e5d7772800 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -130,6 +130,9 @@ struct sched_domain { unsigned int nr_balance_failed; /* initialise to 0 */ /* idle_balance() stats */ + unsigned int newidle_call; + unsigned int newidle_success; + unsigned int newidle_ratio; u64 max_newidle_lb_cost; unsigned long last_decay_max_lb_cost; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 69d10fdb84d8..cf45ea94bb05 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -119,6 +119,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp); EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp); DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU(struct rnd_state, sched_rnd_state); #ifdef CONFIG_QOS_SCHED static void sched_change_qos_group(struct task_struct *tsk, struct task_group *tg); @@ -9976,6 +9977,8 @@ void __init sched_init_smp(void) sched_init_numa(NUMA_NO_NODE); set_sched_cluster(); + prandom_init_once(&sched_rnd_state); + /* * There's no userspace yet to cause hotplug operations; hence all the * CPU masks are stable and all blatant races in the below code cannot diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a4592d35da8b..d16c65d5fa34 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -13651,8 +13651,24 @@ void update_max_interval(void) max_load_balance_interval = HZ*num_online_cpus()/10; } -static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost) +static inline void update_newidle_stats(struct sched_domain *sd, unsigned int success) { + sd->newidle_call++; + sd->newidle_success += success; + + if (sd->newidle_call >= 1024) { + sd->newidle_ratio = sd->newidle_success; + sd->newidle_call /= 2; + sd->newidle_success /= 2; + } +} + +static inline bool +update_newidle_cost(struct sched_domain *sd, u64 cost, unsigned int success) +{ + if (cost) + update_newidle_stats(sd, success); + if (cost > sd->max_newidle_lb_cost) { /* * Track max cost of a domain to make sure to not delay the @@ -13700,7 +13716,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) * Decay the newidle max times here because this is a regular * visit to all the domains. */ - need_decay = update_newidle_cost(sd, 0); + need_decay = update_newidle_cost(sd, 0, 0); max_cost += sd->max_newidle_lb_cost; /* @@ -14336,6 +14352,22 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) break; if (sd->flags & SD_BALANCE_NEWIDLE) { + unsigned int weight = 1; + + if (sched_feat(NI_RANDOM)) { + /* + * Throw a 1k sided dice; and only run + * newidle_balance according to the success + * rate. + */ + u32 d1k = sched_rng() % 1024; + weight = 1 + sd->newidle_ratio; + if (d1k > weight) { + update_newidle_stats(sd, 0); + continue; + } + weight = (1024 + weight/2) / weight; + } pulled_task = load_balance(this_cpu, this_rq, sd, CPU_NEWLY_IDLE, @@ -14343,10 +14375,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) t1 = sched_clock_cpu(this_cpu); domain_cost = t1 - t0; - update_newidle_cost(sd, domain_cost); - curr_cost += domain_cost; t0 = t1; + + /* + * Track max cost of a domain to make sure to not delay the + * next wakeup on the CPU. + */ + update_newidle_cost(sd, domain_cost, weight * !!pulled_task); } /* diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 52ea0097c513..24a0c853a8a0 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -106,6 +106,11 @@ SCHED_FEAT(UTIL_EST_FASTUP, true) SCHED_FEAT(LATENCY_WARN, false) +/* + * Do newidle balancing proportional to its success rate using randomization. + */ +SCHED_FEAT(NI_RANDOM, false) + SCHED_FEAT(HZ_BW, true) SCHED_FEAT(IRQ_AVG, false) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bb581cbdae8d..0c3abb44f365 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -5,6 +5,7 @@ #ifndef _KERNEL_SCHED_SCHED_H #define _KERNEL_SCHED_SCHED_H +#include <linux/prandom.h> #include <linux/sched/affinity.h> #include <linux/sched/autogroup.h> #include <linux/sched/cpufreq.h> @@ -1377,6 +1378,12 @@ static inline bool is_migration_disabled(struct task_struct *p) } DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU(struct rnd_state, sched_rnd_state); + +static inline u32 sched_rng(void) +{ + return prandom_u32_state(this_cpu_ptr(&sched_rnd_state)); +} #define cpu_rq(cpu) (&per_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr(&runqueues) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 3023e67da0fd..cf847fdf1063 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1644,6 +1644,12 @@ sd_init(struct sched_domain_topology_level *tl, .last_balance = jiffies, .balance_interval = sd_weight, + + /* 50% success rate */ + .newidle_call = 512, + .newidle_success = 256, + .newidle_ratio = 512, + .max_newidle_lb_cost = 0, .last_decay_max_lb_cost = jiffies, .child = child, -- 2.34.1
2 3
0 0
[PATCH 1/2] [Backport] sched/fair: Proportional newidle balance
by Chen Jinghuang 10 Mar '26

10 Mar '26
From: "Peter Zijlstra (Intel)" <peterz(a)infradead.org> stable inclusion from stable-v6.6.120 commit 51445190c10a36d292e70db085d0fb6cc3bec94f category: perf bugzilla: https://atomgit.com/openeuler/kernel/issues/8555 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- commit 33cf66d88306663d16e4759e9d24766b0aaa2e17 upstream. Add a randomized algorithm that runs newidle balancing proportional to its success rate. This improves schbench significantly: 6.18-rc4: 2.22 Mrps/s 6.18-rc4+revert: 2.04 Mrps/s 6.18-rc4+revert+random: 2.18 Mrps/S Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%: 6.17: -6% 6.17+revert: 0% 6.17+revert+random: -1% Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann(a)arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann(a)arm.com> Tested-by: Chris Mason <clm(a)meta.com> Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com Link: https://patch.msgid.link/20251107161739.770122091@infradead.org [ Ajay: Modified to apply on v6.6 ] Signed-off-by: Ajay Kaher <ajay.kaher(a)broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Conflicts: kernel/sched/core.c kernel/sched/fair.c [context conflicts] Signed-off-by: Chen Jinghuang <chenjinghuang2(a)huawei.com> --- include/linux/sched/topology.h | 3 +++ kernel/sched/core.c | 3 +++ kernel/sched/fair.c | 44 ++++++++++++++++++++++++++++++---- kernel/sched/features.h | 5 ++++ kernel/sched/sched.h | 7 ++++++ kernel/sched/topology.c | 6 +++++ 6 files changed, 64 insertions(+), 4 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 7eee852aa384..95e5d7772800 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -130,6 +130,9 @@ struct sched_domain { unsigned int nr_balance_failed; /* initialise to 0 */ /* idle_balance() stats */ + unsigned int newidle_call; + unsigned int newidle_success; + unsigned int newidle_ratio; u64 max_newidle_lb_cost; unsigned long last_decay_max_lb_cost; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 69d10fdb84d8..cf45ea94bb05 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -119,6 +119,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp); EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp); DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU(struct rnd_state, sched_rnd_state); #ifdef CONFIG_QOS_SCHED static void sched_change_qos_group(struct task_struct *tsk, struct task_group *tg); @@ -9976,6 +9977,8 @@ void __init sched_init_smp(void) sched_init_numa(NUMA_NO_NODE); set_sched_cluster(); + prandom_init_once(&sched_rnd_state); + /* * There's no userspace yet to cause hotplug operations; hence all the * CPU masks are stable and all blatant races in the below code cannot diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a4592d35da8b..d16c65d5fa34 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -13651,8 +13651,24 @@ void update_max_interval(void) max_load_balance_interval = HZ*num_online_cpus()/10; } -static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost) +static inline void update_newidle_stats(struct sched_domain *sd, unsigned int success) { + sd->newidle_call++; + sd->newidle_success += success; + + if (sd->newidle_call >= 1024) { + sd->newidle_ratio = sd->newidle_success; + sd->newidle_call /= 2; + sd->newidle_success /= 2; + } +} + +static inline bool +update_newidle_cost(struct sched_domain *sd, u64 cost, unsigned int success) +{ + if (cost) + update_newidle_stats(sd, success); + if (cost > sd->max_newidle_lb_cost) { /* * Track max cost of a domain to make sure to not delay the @@ -13700,7 +13716,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) * Decay the newidle max times here because this is a regular * visit to all the domains. */ - need_decay = update_newidle_cost(sd, 0); + need_decay = update_newidle_cost(sd, 0, 0); max_cost += sd->max_newidle_lb_cost; /* @@ -14336,6 +14352,22 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) break; if (sd->flags & SD_BALANCE_NEWIDLE) { + unsigned int weight = 1; + + if (sched_feat(NI_RANDOM)) { + /* + * Throw a 1k sided dice; and only run + * newidle_balance according to the success + * rate. + */ + u32 d1k = sched_rng() % 1024; + weight = 1 + sd->newidle_ratio; + if (d1k > weight) { + update_newidle_stats(sd, 0); + continue; + } + weight = (1024 + weight/2) / weight; + } pulled_task = load_balance(this_cpu, this_rq, sd, CPU_NEWLY_IDLE, @@ -14343,10 +14375,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) t1 = sched_clock_cpu(this_cpu); domain_cost = t1 - t0; - update_newidle_cost(sd, domain_cost); - curr_cost += domain_cost; t0 = t1; + + /* + * Track max cost of a domain to make sure to not delay the + * next wakeup on the CPU. + */ + update_newidle_cost(sd, domain_cost, weight * !!pulled_task); } /* diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 52ea0097c513..24a0c853a8a0 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -106,6 +106,11 @@ SCHED_FEAT(UTIL_EST_FASTUP, true) SCHED_FEAT(LATENCY_WARN, false) +/* + * Do newidle balancing proportional to its success rate using randomization. + */ +SCHED_FEAT(NI_RANDOM, false) + SCHED_FEAT(HZ_BW, true) SCHED_FEAT(IRQ_AVG, false) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bb581cbdae8d..0c3abb44f365 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -5,6 +5,7 @@ #ifndef _KERNEL_SCHED_SCHED_H #define _KERNEL_SCHED_SCHED_H +#include <linux/prandom.h> #include <linux/sched/affinity.h> #include <linux/sched/autogroup.h> #include <linux/sched/cpufreq.h> @@ -1377,6 +1378,12 @@ static inline bool is_migration_disabled(struct task_struct *p) } DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU(struct rnd_state, sched_rnd_state); + +static inline u32 sched_rng(void) +{ + return prandom_u32_state(this_cpu_ptr(&sched_rnd_state)); +} #define cpu_rq(cpu) (&per_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr(&runqueues) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 3023e67da0fd..cf847fdf1063 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1644,6 +1644,12 @@ sd_init(struct sched_domain_topology_level *tl, .last_balance = jiffies, .balance_interval = sd_weight, + + /* 50% success rate */ + .newidle_call = 512, + .newidle_success = 256, + .newidle_ratio = 512, + .max_newidle_lb_cost = 0, .last_decay_max_lb_cost = jiffies, .child = child, -- 2.34.1
1 1
0 0
[PATCH 1/2] [Backport] sched/fair: Proportional newidle balance
by Chen Jinghuang 10 Mar '26

10 Mar '26
From: "Peter Zijlstra (Intel)" <peterz(a)infradead.org> stable inclusion from stable-v6.6.120 commit 51445190c10a36d292e70db085d0fb6cc3bec94f category: perf bugzilla: https://atomgit.com/openeuler/kernel/issues/8555 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- commit 33cf66d88306663d16e4759e9d24766b0aaa2e17 upstream. Add a randomized algorithm that runs newidle balancing proportional to its success rate. This improves schbench significantly: 6.18-rc4: 2.22 Mrps/s 6.18-rc4+revert: 2.04 Mrps/s 6.18-rc4+revert+random: 2.18 Mrps/S Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%: 6.17: -6% 6.17+revert: 0% 6.17+revert+random: -1% Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann(a)arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann(a)arm.com> Tested-by: Chris Mason <clm(a)meta.com> Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com Link: https://patch.msgid.link/20251107161739.770122091@infradead.org [ Ajay: Modified to apply on v6.6 ] Signed-off-by: Ajay Kaher <ajay.kaher(a)broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Conflicts: kernel/sched/core.c kernel/sched/fair.c [context conflicts] Signed-off-by: Chen Jinghuang <chenjinghuang2(a)huawei.com> --- include/linux/sched/topology.h | 3 +++ kernel/sched/core.c | 3 +++ kernel/sched/fair.c | 44 ++++++++++++++++++++++++++++++---- kernel/sched/features.h | 5 ++++ kernel/sched/sched.h | 7 ++++++ kernel/sched/topology.c | 6 +++++ 6 files changed, 64 insertions(+), 4 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 7eee852aa384..95e5d7772800 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -130,6 +130,9 @@ struct sched_domain { unsigned int nr_balance_failed; /* initialise to 0 */ /* idle_balance() stats */ + unsigned int newidle_call; + unsigned int newidle_success; + unsigned int newidle_ratio; u64 max_newidle_lb_cost; unsigned long last_decay_max_lb_cost; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 69d10fdb84d8..cf45ea94bb05 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -119,6 +119,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp); EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp); DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU(struct rnd_state, sched_rnd_state); #ifdef CONFIG_QOS_SCHED static void sched_change_qos_group(struct task_struct *tsk, struct task_group *tg); @@ -9976,6 +9977,8 @@ void __init sched_init_smp(void) sched_init_numa(NUMA_NO_NODE); set_sched_cluster(); + prandom_init_once(&sched_rnd_state); + /* * There's no userspace yet to cause hotplug operations; hence all the * CPU masks are stable and all blatant races in the below code cannot diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a4592d35da8b..d16c65d5fa34 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -13651,8 +13651,24 @@ void update_max_interval(void) max_load_balance_interval = HZ*num_online_cpus()/10; } -static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost) +static inline void update_newidle_stats(struct sched_domain *sd, unsigned int success) { + sd->newidle_call++; + sd->newidle_success += success; + + if (sd->newidle_call >= 1024) { + sd->newidle_ratio = sd->newidle_success; + sd->newidle_call /= 2; + sd->newidle_success /= 2; + } +} + +static inline bool +update_newidle_cost(struct sched_domain *sd, u64 cost, unsigned int success) +{ + if (cost) + update_newidle_stats(sd, success); + if (cost > sd->max_newidle_lb_cost) { /* * Track max cost of a domain to make sure to not delay the @@ -13700,7 +13716,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) * Decay the newidle max times here because this is a regular * visit to all the domains. */ - need_decay = update_newidle_cost(sd, 0); + need_decay = update_newidle_cost(sd, 0, 0); max_cost += sd->max_newidle_lb_cost; /* @@ -14336,6 +14352,22 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) break; if (sd->flags & SD_BALANCE_NEWIDLE) { + unsigned int weight = 1; + + if (sched_feat(NI_RANDOM)) { + /* + * Throw a 1k sided dice; and only run + * newidle_balance according to the success + * rate. + */ + u32 d1k = sched_rng() % 1024; + weight = 1 + sd->newidle_ratio; + if (d1k > weight) { + update_newidle_stats(sd, 0); + continue; + } + weight = (1024 + weight/2) / weight; + } pulled_task = load_balance(this_cpu, this_rq, sd, CPU_NEWLY_IDLE, @@ -14343,10 +14375,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) t1 = sched_clock_cpu(this_cpu); domain_cost = t1 - t0; - update_newidle_cost(sd, domain_cost); - curr_cost += domain_cost; t0 = t1; + + /* + * Track max cost of a domain to make sure to not delay the + * next wakeup on the CPU. + */ + update_newidle_cost(sd, domain_cost, weight * !!pulled_task); } /* diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 52ea0097c513..6cd6b5307430 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -106,6 +106,11 @@ SCHED_FEAT(UTIL_EST_FASTUP, true) SCHED_FEAT(LATENCY_WARN, false) +/* + * Do newidle balancing proportional to its success rate using randomization. + */ +SCHED_FEAT(NI_RANDOM, true) + SCHED_FEAT(HZ_BW, true) SCHED_FEAT(IRQ_AVG, false) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bb581cbdae8d..0c3abb44f365 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -5,6 +5,7 @@ #ifndef _KERNEL_SCHED_SCHED_H #define _KERNEL_SCHED_SCHED_H +#include <linux/prandom.h> #include <linux/sched/affinity.h> #include <linux/sched/autogroup.h> #include <linux/sched/cpufreq.h> @@ -1377,6 +1378,12 @@ static inline bool is_migration_disabled(struct task_struct *p) } DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU(struct rnd_state, sched_rnd_state); + +static inline u32 sched_rng(void) +{ + return prandom_u32_state(this_cpu_ptr(&sched_rnd_state)); +} #define cpu_rq(cpu) (&per_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr(&runqueues) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 3023e67da0fd..cf847fdf1063 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1644,6 +1644,12 @@ sd_init(struct sched_domain_topology_level *tl, .last_balance = jiffies, .balance_interval = sd_weight, + + /* 50% success rate */ + .newidle_call = 512, + .newidle_success = 256, + .newidle_ratio = 512, + .max_newidle_lb_cost = 0, .last_decay_max_lb_cost = jiffies, .child = child, -- 2.34.1
1 1
0 0
[PATCH 1/2] [Backport] sched/fair: Proportional newidle balance
by Chen Jinghuang 10 Mar '26

10 Mar '26
From: "Peter Zijlstra (Intel)" <peterz(a)infradead.org> stable inclusion from stable-v6.6.120 commit 51445190c10a36d292e70db085d0fb6cc3bec94f category: perf bugzilla: https://atomgit.com/openeuler/kernel/issues/8555 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- commit 33cf66d88306663d16e4759e9d24766b0aaa2e17 upstream. Add a randomized algorithm that runs newidle balancing proportional to its success rate. This improves schbench significantly: 6.18-rc4: 2.22 Mrps/s 6.18-rc4+revert: 2.04 Mrps/s 6.18-rc4+revert+random: 2.18 Mrps/S Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%: 6.17: -6% 6.17+revert: 0% 6.17+revert+random: -1% Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann(a)arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann(a)arm.com> Tested-by: Chris Mason <clm(a)meta.com> Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com Link: https://patch.msgid.link/20251107161739.770122091@infradead.org [ Ajay: Modified to apply on v6.6 ] Signed-off-by: Ajay Kaher <ajay.kaher(a)broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Conflicts: kernel/sched/core.c kernel/sched/fair.c [context conflicts] Signed-off-by: Chen Jinghuang <chenjinghuang2(a)huawei.com> --- include/linux/sched/topology.h | 3 +++ kernel/sched/core.c | 3 +++ kernel/sched/fair.c | 44 ++++++++++++++++++++++++++++++---- kernel/sched/features.h | 5 ++++ kernel/sched/sched.h | 7 ++++++ kernel/sched/topology.c | 6 +++++ 6 files changed, 64 insertions(+), 4 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 7eee852aa384..95e5d7772800 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -130,6 +130,9 @@ struct sched_domain { unsigned int nr_balance_failed; /* initialise to 0 */ /* idle_balance() stats */ + unsigned int newidle_call; + unsigned int newidle_success; + unsigned int newidle_ratio; u64 max_newidle_lb_cost; unsigned long last_decay_max_lb_cost; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 69d10fdb84d8..cf45ea94bb05 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -119,6 +119,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp); EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp); DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU(struct rnd_state, sched_rnd_state); #ifdef CONFIG_QOS_SCHED static void sched_change_qos_group(struct task_struct *tsk, struct task_group *tg); @@ -9976,6 +9977,8 @@ void __init sched_init_smp(void) sched_init_numa(NUMA_NO_NODE); set_sched_cluster(); + prandom_init_once(&sched_rnd_state); + /* * There's no userspace yet to cause hotplug operations; hence all the * CPU masks are stable and all blatant races in the below code cannot diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a4592d35da8b..d16c65d5fa34 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -13651,8 +13651,24 @@ void update_max_interval(void) max_load_balance_interval = HZ*num_online_cpus()/10; } -static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost) +static inline void update_newidle_stats(struct sched_domain *sd, unsigned int success) { + sd->newidle_call++; + sd->newidle_success += success; + + if (sd->newidle_call >= 1024) { + sd->newidle_ratio = sd->newidle_success; + sd->newidle_call /= 2; + sd->newidle_success /= 2; + } +} + +static inline bool +update_newidle_cost(struct sched_domain *sd, u64 cost, unsigned int success) +{ + if (cost) + update_newidle_stats(sd, success); + if (cost > sd->max_newidle_lb_cost) { /* * Track max cost of a domain to make sure to not delay the @@ -13700,7 +13716,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) * Decay the newidle max times here because this is a regular * visit to all the domains. */ - need_decay = update_newidle_cost(sd, 0); + need_decay = update_newidle_cost(sd, 0, 0); max_cost += sd->max_newidle_lb_cost; /* @@ -14336,6 +14352,22 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) break; if (sd->flags & SD_BALANCE_NEWIDLE) { + unsigned int weight = 1; + + if (sched_feat(NI_RANDOM)) { + /* + * Throw a 1k sided dice; and only run + * newidle_balance according to the success + * rate. + */ + u32 d1k = sched_rng() % 1024; + weight = 1 + sd->newidle_ratio; + if (d1k > weight) { + update_newidle_stats(sd, 0); + continue; + } + weight = (1024 + weight/2) / weight; + } pulled_task = load_balance(this_cpu, this_rq, sd, CPU_NEWLY_IDLE, @@ -14343,10 +14375,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) t1 = sched_clock_cpu(this_cpu); domain_cost = t1 - t0; - update_newidle_cost(sd, domain_cost); - curr_cost += domain_cost; t0 = t1; + + /* + * Track max cost of a domain to make sure to not delay the + * next wakeup on the CPU. + */ + update_newidle_cost(sd, domain_cost, weight * !!pulled_task); } /* diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 52ea0097c513..6cd6b5307430 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -106,6 +106,11 @@ SCHED_FEAT(UTIL_EST_FASTUP, true) SCHED_FEAT(LATENCY_WARN, false) +/* + * Do newidle balancing proportional to its success rate using randomization. + */ +SCHED_FEAT(NI_RANDOM, true) + SCHED_FEAT(HZ_BW, true) SCHED_FEAT(IRQ_AVG, false) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bb581cbdae8d..0c3abb44f365 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -5,6 +5,7 @@ #ifndef _KERNEL_SCHED_SCHED_H #define _KERNEL_SCHED_SCHED_H +#include <linux/prandom.h> #include <linux/sched/affinity.h> #include <linux/sched/autogroup.h> #include <linux/sched/cpufreq.h> @@ -1377,6 +1378,12 @@ static inline bool is_migration_disabled(struct task_struct *p) } DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU(struct rnd_state, sched_rnd_state); + +static inline u32 sched_rng(void) +{ + return prandom_u32_state(this_cpu_ptr(&sched_rnd_state)); +} #define cpu_rq(cpu) (&per_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr(&runqueues) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 3023e67da0fd..cf847fdf1063 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1644,6 +1644,12 @@ sd_init(struct sched_domain_topology_level *tl, .last_balance = jiffies, .balance_interval = sd_weight, + + /* 50% success rate */ + .newidle_call = 512, + .newidle_success = 256, + .newidle_ratio = 512, + .max_newidle_lb_cost = 0, .last_decay_max_lb_cost = jiffies, .child = child, -- 2.34.1
1 1
0 0
[PATCH OLK-6.6] tools/interference: add ifstool utility for kernel interference statistics
by Tengda Wu 10 Mar '26

10 Mar '26
hulk inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/7429 -------------------------------- As real-time and high-performance workloads become more sensitive to execution jitter, observability into kernel-induced "noise" is critical. The CONFIG_CGROUP_IFS infrastructure provides this telemetry, but raw data from cgroups is difficult to parse and visualize manually. Introduce ifstool, a userspace utility to monitor and analyze the interference.stat interface. Key capabilities include: 1. High-Resolution Monitoring: - Samples raw counters and latency distributions at sub-second intervals. - Outputs structured CSV data for long-term profiling and integration with external post-processing tools. 2. Interactive Offline Reporting: - Generates self-contained HTML dashboards using Plotly. - Visualizes "Total Time Delta" trends to identify temporal spikes. - Renders latency heatmaps to expose the magnitude and frequency of interference events. Signed-off-by: Tengda Wu <wutengda2(a)huawei.com> --- tools/Makefile | 11 +- tools/kspect/Makefile | 28 +++ tools/kspect/README | 65 ++++++ tools/kspect/ifstool | 409 ++++++++++++++++++++++++++++++++++ tools/kspect/ifstool.1 | 87 ++++++++ tools/kspect/requirements.txt | 2 + 6 files changed, 598 insertions(+), 4 deletions(-) create mode 100644 tools/kspect/Makefile create mode 100644 tools/kspect/README create mode 100644 tools/kspect/ifstool create mode 100644 tools/kspect/ifstool.1 create mode 100644 tools/kspect/requirements.txt diff --git a/tools/Makefile b/tools/Makefile index 37e9f6804832..153817f0cc17 100644 --- a/tools/Makefile +++ b/tools/Makefile @@ -22,6 +22,7 @@ help: @echo ' hv - tools used when in Hyper-V clients' @echo ' iio - IIO tools' @echo ' intel-speed-select - Intel Speed Select tool' + @echo ' kspect - KSPECT tools' @echo ' kvm_stat - top-like utility for displaying kvm statistics' @echo ' leds - LEDs tools' @echo ' nolibc - nolibc headers testing and installation' @@ -69,7 +70,7 @@ acpi: FORCE cpupower: FORCE $(call descend,power/$@) -cgroup counter firewire hv guest bootconfig spi usb virtio mm bpf iio gpio objtool leds wmi pci firmware debugging tracing: FORCE +cgroup counter firewire hv guest bootconfig spi usb virtio mm bpf iio gpio objtool leds wmi pci firmware debugging tracing kspect: FORCE $(call descend,$@) bpf/%: FORCE @@ -120,7 +121,8 @@ all: acpi cgroup counter cpupower gpio hv firewire \ perf selftests bootconfig spi turbostat usb \ virtio mm bpf x86_energy_perf_policy \ tmon freefall iio objtool kvm_stat wmi \ - pci debugging tracing thermal thermometer thermal-engine + pci debugging tracing thermal thermometer thermal-engine \ + kspect acpi_install: $(call descend,power/$(@:_install=),install) @@ -128,7 +130,7 @@ acpi_install: cpupower_install: $(call descend,power/$(@:_install=),install) -cgroup_install counter_install firewire_install gpio_install hv_install iio_install perf_install bootconfig_install spi_install usb_install virtio_install mm_install bpf_install objtool_install wmi_install pci_install debugging_install tracing_install: +cgroup_install counter_install firewire_install gpio_install hv_install iio_install perf_install bootconfig_install spi_install usb_install virtio_install mm_install bpf_install objtool_install wmi_install pci_install debugging_install tracing_install kspect_install: $(call descend,$(@:_install=),install) selftests_install: @@ -161,7 +163,8 @@ install: acpi_install cgroup_install counter_install cpupower_install gpio_insta virtio_install mm_install bpf_install x86_energy_perf_policy_install \ tmon_install freefall_install objtool_install kvm_stat_install \ wmi_install pci_install debugging_install intel-speed-select_install \ - tracing_install thermometer_install thermal-engine_install + tracing_install thermometer_install thermal-engine_install \ + kspect_install acpi_clean: $(call descend,power/acpi,clean) diff --git a/tools/kspect/Makefile b/tools/kspect/Makefile new file mode 100644 index 000000000000..ffb380ddc885 --- /dev/null +++ b/tools/kspect/Makefile @@ -0,0 +1,28 @@ +# SPDX-License-Identifier: GPL-2.0 + +PREFIX ?= /usr/local +BINDIR = $(PREFIX)/bin +MANDIR = $(PREFIX)/share/man/man1 + +MAN1 = ifstool.1 +TARGET = ifstool + +all: man + +man: $(MAN1) + +install-man: + install -d $(MANDIR) + install -m 0644 $(MAN1) $(MANDIR) + +install-tools: + install -d $(BINDIR) + install -m 0755 $(TARGET) $(BINDIR) + +install: install-tools install-man + +uninstall: + rm -f $(BINDIR)/$(TARGET) + rm -f $(MANDIR)/$(MAN1) + +.PHONY: all install-tools install-man install diff --git a/tools/kspect/README b/tools/kspect/README new file mode 100644 index 000000000000..618e02a202f7 --- /dev/null +++ b/tools/kspect/README @@ -0,0 +1,65 @@ +IFSTOOL - Interference Statistics Analytical Utility + +Overview +======== +IFSTOOL is a specialized userspace utility designed to facilitate the +monitoring and analysis of Interference Statistics (CONFIG_CGROUP_IFS). + +The IFS infrastructure is a kernel-level framework that provides critical +observability into execution jitter (noise) that disrupts task determinism. +It monitors and quantifies the CPU time stolen by kernel-level activities, +such as: interrupt handling, softirqs, and lock contention. The framework +exposes this telemetry via the interference.stat control file within the +cgroup hierarchy. + +IFSTOOL interfaces with the interference.stat file to export raw metrics +into structured CSV data and interactive HTML-based distribution reports. + +Setup +===== +The host environment must meet the following criteria: + +* Kernel: Compiled with `CONFIG_CGROUP_IFS=y`. +* Boot Parameters: `cgroup_ifs=1` added to the kernel command line (optional, + not needed if `CONFIG_CGROUP_IFS_DEFAULT_ENABLED=y`). +* Python Runtime: Python 3.x with `pandas` and `plotly` libraries. +* Cgroup Hierarchy: Either v2 unified or v1 with the cpu subsystem mounted. + +Build +===== +IFSTOOL provides a Makefile to streamline the installation of the +executable and its documentation. + + $ make install + +Run +=== +IFSTOOL operates via two primary functional modes: monitor and report. + +1. **Monitor:** Capture raw interference data from a target cgroup. + + $ ifstool monitor --cgroup docker/<cid> --interval 1 --output capture.csv + +2. **Report:** Transform captured CSV data into an interactive HTML dashboard. + + - Provide only --base for a single-session deep dive: + + $ ifstool report --base capture.csv + + - Provide both --base and --curr to render a differential report, + ideal for validating optimizations: + + $ ifstool report --base baseline.csv --curr current.csv + +HTML Description +================ +The generated HTML report provides a multi-dimensional view of kernel noise: + +- Total Time Delta Trend: A time-series line chart illustrating the incremental + nanoseconds of interference per category (e.g., irq, spinlock). +- Latency Heatmaps: A frequency-domain visualization of the kernel's internal + histogram. + - X-axis: Wall-clock time of the trace. + - Y-axis: Latency magnitude (logarithmic buckets from ns to s). + - Color Intensity: Represents the density (event count) of interference + within that specific latency window. diff --git a/tools/kspect/ifstool b/tools/kspect/ifstool new file mode 100644 index 000000000000..c17de045263a --- /dev/null +++ b/tools/kspect/ifstool @@ -0,0 +1,409 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0-only +# +# ifstool: A tool to monitor and report interference statistics (CONFIG_CGROUP_IFS). +# +# Copyright(c) 2026. Huawei Technologies Co., Ltd +# +# Authors: +# Tengda Wu <wutengda2(a)huawei.com> + +import os +import time +import re +import csv +import argparse +import pandas as pd +import plotly.graph_objects as go +from plotly.subplots import make_subplots +from plotly.offline import get_plotlyjs + + +class InterferenceTool: + """ + A utility class to monitor Linux cgroup interference statistics (CONFIG_CGROUP_IFS) + and generate visual comparison reports between baseline and current data. + """ + + def __init__(self): + # Default path for cgroup v2 interference stats + self.base_path = "/sys/fs/cgroup" + + def parse_stat(self, content): + """ + Parses the raw content of the interference.stat file. + + Args: + content (str): Raw string content from the stat file. + Returns: + tuple: (total_times dict, distributions dict) + """ + + total_times, distributions = {}, {} + + # Split content into sections: Top-level totals and various distributions + # Uses positive lookahead to split before a word followed by ' distribution' + sections = re.split(r"\n(?=[a-z]+ distribution)", content) + + # Parse global total times (first section) + for line in sections[0].strip().split("\n"): + match = re.match(r"^([a-z]+)\s+(\d+)$", line.strip()) + if match: + total_times[match.group(1)] = int(match.group(2)) + + # Parse histogram distributions (subsequent sections) + for section in sections[1:]: + lines = section.strip().split("\n") + # Extract header name (e.g., 'spinlock distribution') + header = lines[0].replace(" distribution", "").strip() + # Parse bucket key-value pairs (e.g., '[64 ns, 128 ns) : 143791') + dist_data = { + l.split(":")[0].strip(): int(l.split(":")[1].strip()) + for l in lines[1:] + if ":" in l + } + distributions[header] = dist_data + + return total_times, distributions + + def monitor(self, cgroup_id, interval, duration, output_csv): + """ + Periodically samples interference stats and saves results to a CSV file. + + Args: + cgroup_id (str): The specific cgroup folder name. For example, + /sys/kernel/cgroup/A/B/C, when want to monitor level B, the + cgroup_id is A/B. + interval (float): Seconds between samples. + duration (int): Total monitoring time in seconds. + output_csv (str): Filename to save sampled data. + """ + + candidates = [ + os.path.join(self.base_path, cgroup_id), + os.path.join(self.base_path, "cpu", cgroup_id), # cgroup v1 + ] + path = next((p for p in candidates if os.path.exists(p)), None) + + if not path: + print( + f"\n[!] No access to cgroup: {os.path.join(self.base_path, cgroup_id)}" + ) + print(" Hint: You can find the correct cid by executing:") + print(" cat /proc/<PID>/cgroup") + return + + path = os.path.join(path, "interference.stat") + if not os.path.exists(path): + print(f"\n[!] Interface file '{path}' not found.") + print(" This usually happens due to one of the following:") + print(" 1. Kernel is not compiled with CONFIG_CGROUP_IFS=y") + print( + " 2. CONFIG_CGROUP_IFS_DEFAULT_ENABLED is not set to y", + "and the boot parameter 'cgroup_ifs=1' is not added", + ) + print(" 3. The current cgroup is not managed by the IFS controller") + return + + print( + f"[*] Starting monitor: {cgroup_id}, interval: {interval}s, duration: {duration}s" + ) + + data_list = [] + start_time = time.time() + + try: + while (time.time() - start_time) < duration: + ts = time.strftime("%H:%M:%S") + # Using fractional seconds in timestamp for sub-second intervals + if interval < 1.0: + ts = time.strftime("%H:%M:%S") + f".{int((time.time()%1)*100):02d}" + + with open(path, "r") as f: + total_times, dists = self.parse_stat(f.read()) + + # Append total time metrics + for cat, val in total_times.items(): + data_list.append([ts, cgroup_id, cat, "total_time_ns", val]) + + # Append bucket distribution metrics + for cat, dist in dists.items(): + for b, c in dist.items(): + data_list.append([ts, cgroup_id, cat, f"bucket_{b}", c]) + + time.sleep(interval) + except KeyboardInterrupt: + print("\n[!] Monitoring interrupted by user.") + + # Write buffered data to CSV + with open(output_csv, "w", newline="") as f: + writer = csv.writer(f) + writer.writerow( + ["timestamp", "cgroup_id", "category", "metric_type", "value"] + ) + writer.writerows(data_list) + print(f"[+] Data exported successfully: {output_csv}") + + def bucket_key(self, bucket_str): + """ + Parsing logic for histogram bucket labels to enable correct numerical sorting. + Example input: "bucket_[67.10 ms, 134.21 ms)" + + Returns: + float: The value converted to nanoseconds (ns). + """ + try: + # Regex to capture the first number and its unit (ns|us|ms|s) + match = re.search(r"(\d+\.?\d*)\s*(ns|us|ms|s)", bucket_str) + if not match: + return 0 + + value = float(match.group(1)) + unit = match.group(2).lower() + + # Conversion factors to nanoseconds + factors = {"ns": 1, "us": 1000, "ms": 1000000, "s": 1000000000} + + return value * factors.get(unit, 1) + except Exception: + return 0 + + def report(self, base_csv, curr_csv, output_html): + """ + Loads two CSV files and generates an interactive HTML dashboard. + + Args: + base_csv (str): Path to baseline data. + curr_csv (str): Path to current data. + output_html (str): Path to generate the HTML report. + """ + + df_b = pd.read_csv(base_csv) + single_mode = curr_csv is None or base_csv == curr_csv + df_c = None if single_mode else pd.read_csv(curr_csv) + + # Get unique union of categories present in both datasets + all_dfs = [df_b] if single_mode else [df_b, df_c] + categories = sorted( + list(set().union(*(df.category.unique() for df in all_dfs))) + ) + + # Color mapping to ensure consistent colors for categories across plots + color_sequence = [ + "#636EFA", + "#EF553B", + "#00CC96", + "#AB63FA", + "#FFA15A", + "#19D3F3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52", + ] + color_map = { + cat: color_sequence[i % len(color_sequence)] + for i, cat in enumerate(categories) + } + + # Build HTML content with CSS for layout + html_parts = [ + "<html><head><title>IFS Analysis Dashboard</title>", + f"<script type='text/javascript'>{get_plotlyjs()}</script>", + ] + col_width = "100%" if single_mode else "49.5%" + html_parts.append( + f"""<style> + body{{font-family:'Segoe UI',sans-serif; background:#f0f2f5; padding:20px; color:#333;}} + .card{{background:white; padding:15px; margin-bottom:20px; border-radius:10px; box-shadow:0 2px 5px rgba(0,0,0,0.05);}} + .row-container{{display:flex; gap:15px; justify-content:space-between;}} + .col-item{{width:{col_width}; min-width:0;}} + .stat-box{{display:flex; gap:10px; margin:5px 0; font-size:0.85em; color:#555;}} + .stat-item{{background:#f1f3f5; padding:4px 8px; border-radius:4px; border-left:3px solid #007bff;}} + .section-header{{margin:30px 0 15px 0; padding-bottom:10px; border-bottom:2px solid #ddd;}} + </style></head><body>""" + ) + + title = "Performance Analysis" if single_mode else "Performance Comparison" + html_parts.append(f"<h2>{title} (CONFIG_CGROUP_IFS)</h2>") + info = ( + f"File: {base_csv}" + if single_mode + else f"Baseline: {base_csv} | Current: {curr_csv}" + ) + html_parts.append(f"<p style='color:#666;'>{info}</p>") + + # --- Part 1: Total Time Trends (Line Charts) --- + html_parts.append("<div class='card'>") + html_parts.append("<h3 style='margin-top:0;'>Total Time Delta (ns) Trend</h3>") + + # Subplots ensure Y-axis can be matched for direct visual comparison + fig_line = make_subplots( + rows=1, + cols=1 if single_mode else 2, + horizontal_spacing=0.05, + subplot_titles=("Latency",) if single_mode else ("Baseline", "Current"), + ) + + plot_configs = ( + [("Data", df_b)] if single_mode else [("Baseline", df_b), ("Current", df_c)] + ) + for i, (name, df) in enumerate(plot_configs, 1): + for cat in categories: + sub = df[ + (df["category"] == cat) & (df["metric_type"] == "total_time_ns") + ].sort_values("timestamp") + if not sub.empty: + y_val = sub["value"].diff().fillna(0) + fig_line.add_trace( + go.Scatter( + x=sub["timestamp"], + y=y_val, + name=f"{cat} ({name})", + legendgroup=cat, + mode="lines+markers", + line=dict(color=color_map[cat], width=2), + marker=dict(color=color_map[cat], size=6), + ), + row=1, + col=i, + ) + + # Sync Y-axes scale for baseline and current plots + if not single_mode: + fig_line.update_yaxes(matches="y", row=1, col=2) + fig_line.update_layout( + height=500, template="plotly_white", margin=dict(t=50, b=20) + ) + html_parts.append(fig_line.to_html(full_html=False, include_plotlyjs=False)) + html_parts.append("</div>") + + html_parts.append( + "<h3 class='section-header'>Detailed Latency Distribution</h3>" + ) + + # --- Part 2: Distribution Heatmaps --- + for cat in categories: + + def get_m_and_stats(df): + """Processes raw metrics into a delta-count matrix for heatmap visualization.""" + sub = df[ + (df["category"] == cat) + & (df["metric_type"].str.startswith("bucket_")) + ] + if sub.empty: + return pd.DataFrame(), {} + + # Transform to matrix (Buckets vs Time) + p = sub.pivot( + index="metric_type", columns="timestamp", values="value" + ).sort_index(axis=1) + # Sort Y-axis buckets based on numerical time value + p = p.reindex(sorted(p.index, key=self.bucket_key)) + # Calculate incremental change (delta) + delta = p.diff(axis=1).fillna(0) + return delta, { + "Total": int(delta.values.sum()), + "Peak": int(delta.values.max()), + } + + m_b, stats_b = get_m_and_stats(df_b) + m_c, stats_c = (None, None) if single_mode else get_m_and_stats(df_c) + + # Determine global max for heatmap color scaling consistency + z_vals = [m_b.values.max() if not m_b.empty else 0] + if not single_mode: + z_vals.append(m_c.values.max() if not m_c.empty else 0) + global_max_z = max(z_vals + [1]) + + html_parts.append(f"<div class='card'><h4>Category: {cat.upper()}</h4>") + html_parts.append("<div class='row-container'>") + + configs = ( + [("Data", m_b, stats_b)] + if single_mode + else [("Baseline", m_b, stats_b), ("Current", m_c, stats_c)] + ) + for name, m, stats in configs: + html_parts.append("<div class='col-item'>") + if not m.empty: + # Display summary statistics + html_parts.append("<div class='stat-box'>") + for k, v in stats.items(): + html_parts.append( + f"<div class='stat-item'><b>{k}:</b> {v}</div>" + ) + html_parts.append("</div>") + + # Generate Heatmap + fig = go.Figure( + data=go.Heatmap( + z=m.values, + x=m.columns, + y=m.index, + colorscale="Viridis", + zmin=0, + zmax=global_max_z, + colorbar=dict(title="Count", thickness=10, len=0.7), + ) + ) + fig.update_layout( + height=400, + margin=dict(l=120, r=0, t=10, b=30), + template="plotly_white", + ) + html_parts.append( + fig.to_html(full_html=False, include_plotlyjs=False) + ) + else: + html_parts.append( + "<div style='height:100px; display:flex; align-items:center; justify-content:center; background:#fafafa; color:#ccc;'>No Data</div>" + ) + html_parts.append("</div>") + + html_parts.append("</div></div>") # end row-container & card + + # Save the final HTML report + with open(output_html, "w") as f: + f.writelines(html_parts + ["</body></html>"]) + print(f"[+] Comparison report generated: {output_html}") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + sub = parser.add_subparsers(dest="mode") + + # Monitor Mode Arguments + p_mon = sub.add_parser("monitor") + p_mon.add_argument( + "-G", + "--cgroup", + required=True, + help="Cgroup name (e.g. 'docker/cid'). " + "Tip: Find this via 'cat /proc/<PID>/cgroup'", + ) + p_mon.add_argument( + "-i", "--interval", type=float, default=1.0, help="Sampling interval (sec)" + ) + p_mon.add_argument( + "-d", "--duration", type=int, default=30, help="Sampling duration (sec)" + ) + p_mon.add_argument("-o", "--output", default="capture.csv") + + # Report Mode Arguments + p_comp = sub.add_parser("report") + p_comp.add_argument("-b", "--base", required=True, help="Baseline CSV file") + p_comp.add_argument( + "-c", "--curr", default=None, help="Current CSV file (optional, for comparison)" + ) + p_comp.add_argument("-o", "--output", default="report.html") + + args = parser.parse_args() + tool = InterferenceTool() + + if args.mode == "monitor": + tool.monitor(args.cgroup, args.interval, args.duration, args.output) + elif args.mode == "report": + tool.report(args.base, args.curr, args.output) + else: + parser.print_help() diff --git a/tools/kspect/ifstool.1 b/tools/kspect/ifstool.1 new file mode 100644 index 000000000000..7ff9d297fc08 --- /dev/null +++ b/tools/kspect/ifstool.1 @@ -0,0 +1,87 @@ +.TH IFSTOOL 1 "March 2026" "Linux" "User Commands" +.SH NAME +ifstool \- Analyze kernel interference statistics (CONFIG_CGROUP_IFS) +.SH SYNOPSIS +.B ifstool monitor +[\fB\-G\fR|\fB\-\-cgroup\fR \fICGROUP_ID\fR] [\fB\-d\fR|\fB\-\-duration\fR \fISEC\fR] [\fB\-i\fR|\fB\-\-interval\fR \fISEC\fR] [\fB\-o\fR|\fB\-\-output\fR \fICSV\fR] +.br +.B ifstool report +[\fB\-b\fR|\fB\-\-base\fR \fICSV\fR] [\fB\-c\fR|\fB\-\-curr\fR \fICSV\fR] [\fB\-o\fR|\fB\-\-output\fR \fIHTML\fR] + +.SH DESCRIPTION +.B IFSTOOL +is a specialized userspace utility designed to facilitate the monitoring and analysis of Interference Statistics (\fBCONFIG_CGROUP_IFS\fR). + +The \fBIFS infrastructure\fR is a kernel-level framework providing critical observability into execution jitter (noise) that disrupts task determinism. It quantifies CPU time stolen by kernel activities such as interrupt handling, softirqs, and lock contention. This telemetry is exposed via the \fBinterference.stat\fR control file within the cgroup hierarchy. + +\fBIFSTOOL\fR interfaces with this file to export raw metrics into structured CSV data and interactive HTML-based distribution reports. + +.SH SETUP +The host environment must meet the following criteria: +.IP \[bu] 2 +\fBKernel:\fR Compiled with \fBCONFIG_CGROUP_IFS=y\fR. +.IP \[bu] 2 +\fBBoot Parameters:\fR \fBcgroup_ifs=1\fR added to the kernel command line (optional, not needed if \fBCONFIG_CGROUP_IFS_DEFAULT_ENABLED=y\fR). +.IP \[bu] 2 +\fBPython Runtime:\fR Python 3.x with \fBpandas\fR and \fBplotly\fR libraries. +.IP \[bu] 2 +\fBCgroup Hierarchy:\fR Either v2 unified or v1 (with the \fBcpu\fR subsystem mounted). + +.SH COMMANDS +.SS monitor +Capture raw interference data from a target cgroup. +.TP +\fB\-G, \-\-cgroup\fR \fICGROUP_ID\fR +Specify the cgroup identifier (e.g., \fIdocker/<cid>\fR). For example, /sys/kernel/cgroup/A/B/C, when you want to monitor level B, the cgroup_id is A/B. Tip: Find this via cat \fI/proc/PID/cgroup\fR. +.TP +\fB\-d, \-\-duration\fR \fISEC\fR +Total collection time in seconds (default: 30). +.TP +\fB\-i, \-\-interval\fR \fISEC\fR +Sampling interval in seconds; supports floating point (default: 1.0). +.TP +\fB\-o, \-\-output\fR \fICSV\fR +Path to the output CSV file where raw sampled data will be stored (default: \fBcapture.csv\fR). + +.SS report +Transform captured CSV data into an interactive HTML dashboard. +.TP +\fB\-b, \-\-base\fR \fICSV\fR +Primary capture file for analysis. +.TP +\fB\-c, \-\-curr\fR \fICSV\fR +Optional secondary file for differential (comparison) analysis. +.TP +\fB\-o, \-\-output\fR \fIHTML\fR +Path to the generated interactive HTML dashboard (default: \fBreport.html\fR). + +.SH HTML REPORT STRUCTURE +The generated HTML report provides a multi-dimensional view of kernel noise: +.IP "Total Time Delta Trend" 4 +A time-series line chart illustrating the incremental nanoseconds of interference per category (e.g., irq, spinlock). +.IP "Latency Heatmaps" 4 +A frequency-domain visualization of the kernel's internal histogram: +.RS 8 +.IP "X-axis:" 8 +Wall-clock time of the trace. +.IP "Y-axis:" 8 +Latency magnitude (logarithmic buckets from ns to s). +.IP "Color Intensity:" 8 +Represents the event count (density) of interference within that specific latency window. +.RE + +.SH EXAMPLES +Monitor a Docker container for 10 seconds: +.IP +.B $ ifstool monitor \-\-cgroup docker/<cid> \-\-duration 10 \-\-interval 1 +.PP +Generate a single-session deep dive: +.IP +.B $ ifstool report \-\-base capture.csv +.PP +Generate a differential report between two traces: +.IP +.B $ ifstool report \-\-base baseline.csv \-\-curr current.csv + +.SH AUTHOR +Tengda Wu <wutengda2(a)huawei.com> \ No newline at end of file diff --git a/tools/kspect/requirements.txt b/tools/kspect/requirements.txt new file mode 100644 index 000000000000..8e5e376a0e9a --- /dev/null +++ b/tools/kspect/requirements.txt @@ -0,0 +1,2 @@ +pandas==3.* +plotly==6.* -- 2.34.1
2 1
0 0
[PATCH OLK-5.10] gic: Add hip09 support for HISILICON_ERRATUM_165010801
by Zeng Heng 10 Mar '26

10 Mar '26
From: Yifan Wu <wuyifan50(a)huawei.com> driver inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/7023 ---------------------------------------------------------------------- Add hip09 support for HISILICON_ERRATUM_165010801 Fixes: f0f9be237c22 ("gic: increase the arch_timer priority to avoid hardlockup") Signed-off-by: Yifan Wu <wuyifan50(a)huawei.com> Signed-off-by: Zeng Heng <zengheng4(a)huawei.com> --- arch/arm64/Kconfig | 11 ++++++----- drivers/clocksource/arm_arch_timer.c | 5 +++++ 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 715c2810a43d..48fc9c71ed31 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -878,11 +878,12 @@ config HISILICON_ERRATUM_165010801 depends on ARCH_HISI default y help - On HIP12, when GIC receives multiple interrupts of the same priority and - different types, the interrupts are selected in the following sequence: - SPI > LPI > SGI > PPI. This scheduling rule may cause PPI starvation. - To prevent starvation from triggering system watchdog hardlockup, the - interrupt priority is explicitly increased in the arch_timer driver. + On HIP09/12 hardware, observations indicate that the GIC's handling of + equal-priority interrupts across different interrupt types may not always + provide balanced servicing. In some cases, this could result in prolonged + service delays for particular interrupt types, creating conditions where + system watchdog hardlockup detection might be triggered. The arch_timer + driver addresses this by proactively increasing affected interrupt priorities. config QCOM_FALKOR_ERRATUM_1003 bool "Falkor E1003: Incorrect translation due to ASID change" diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c index 7f29ee39af3a..c6664d435921 100644 --- a/drivers/clocksource/arm_arch_timer.c +++ b/drivers/clocksource/arm_arch_timer.c @@ -340,6 +340,11 @@ static struct ate_acpi_oem_info hisi_161010101_oem_info[] = { #ifdef CONFIG_HISILICON_ERRATUM_165010801 static struct ate_acpi_oem_info hisi_165010801_oem_info[] = { + { + .oem_id = "HISI ", + .oem_table_id = "HIP09 ", + .oem_revision = 0, + }, { .oem_id = "HISI ", .oem_table_id = "HIP12 ", -- 2.25.1
2 1
0 0
[PATCH OLK-6.6] net: yt6801: add link info for yt6801
by Frank_Sae 10 Mar '26

10 Mar '26
yt6801 inclusion category: bugfix bugzilla: https://https://atomgit.com/openeuler/kernel/issues/7229 -------------------------------------------------------------------- add link info for yt6801 to fix that the link info is not update on os installation. Fixes: b9f5c0893d16 ("net: yt6801: add link info for yt6801") Signed-off-by: Frank_Sae <Frank.Sae(a)motor-comm.com> --- .../ethernet/motorcomm/yt6801/yt6801_main.c | 90 +++++++++++++++++++ 1 file changed, 90 insertions(+) diff --git a/drivers/net/ethernet/motorcomm/yt6801/yt6801_main.c b/drivers/net/ethernet/motorcomm/yt6801/yt6801_main.c index 01eed3ace..cb54329ef 100644 --- a/drivers/net/ethernet/motorcomm/yt6801/yt6801_main.c +++ b/drivers/net/ethernet/motorcomm/yt6801/yt6801_main.c @@ -27,6 +27,95 @@ const struct net_device_ops *fxgmac_get_netdev_ops(void); static void fxgmac_napi_enable(struct fxgmac_pdata *priv); +const struct ethtool_ops *fxgmac_get_ethtool_ops(void); + +#define MII_SPEC_STATUS 0x11 /* PHY specific status */ +#define FXGMAC_EPHY_LINK_STATUS BIT(10) +#define PHY_MII_SPEC_DUPLEX BIT(13) + +static int fxgmac_get_link_ksettings(struct net_device *netdev, + struct ethtool_link_ksettings *cmd) +{ + struct fxgmac_pdata *pdata = netdev_priv(netdev); + struct phy_device *phydev = netdev->phydev; + u32 duplex, regval, link_status; + u32 adv = 0xFFFFFFFF; + + ethtool_link_ksettings_zero_link_mode(cmd, supported); + ethtool_link_ksettings_zero_link_mode(cmd, advertising); + + /* set the supported link speeds */ + ethtool_link_ksettings_add_link_mode(cmd, supported, 1000baseT_Full); + ethtool_link_ksettings_add_link_mode(cmd, supported, 100baseT_Full); + ethtool_link_ksettings_add_link_mode(cmd, supported, 100baseT_Half); + ethtool_link_ksettings_add_link_mode(cmd, supported, 10baseT_Full); + ethtool_link_ksettings_add_link_mode(cmd, supported, 10baseT_Half); + + /* Indicate pause support */ + ethtool_link_ksettings_add_link_mode(cmd, supported, Pause); + ethtool_link_ksettings_add_link_mode(cmd, supported, Asym_Pause); + + regval = phy_read(phydev, MII_ADVERTISE); + + if (field_get(ADVERTISE_PAUSE_CAP, regval)) + ethtool_link_ksettings_add_link_mode(cmd, advertising, Pause); + + if (field_get(ADVERTISE_PAUSE_ASYM, regval)) + ethtool_link_ksettings_add_link_mode(cmd, advertising, Asym_Pause); + + ethtool_link_ksettings_add_link_mode(cmd, supported, MII); + cmd->base.port = PORT_MII; + + ethtool_link_ksettings_add_link_mode(cmd, supported, Autoneg); + regval = phy_read(phydev, MII_BMCR); + + regval = field_get(BMCR_ANENABLE, regval); + if (regval) { + ethtool_link_ksettings_add_link_mode(cmd, advertising, Autoneg); + + regval = phy_read(phydev, MII_ADVERTISE); + + if (adv & ADVERTISE_10HALF) + ethtool_link_ksettings_add_link_mode(cmd, advertising, 10baseT_Half); + if (adv & ADVERTISE_10FULL) + ethtool_link_ksettings_add_link_mode(cmd, advertising, 10baseT_Full); + if (adv & ADVERTISE_100HALF) + ethtool_link_ksettings_add_link_mode(cmd, advertising, 100baseT_Half); + if (adv & ADVERTISE_100FULL) + ethtool_link_ksettings_add_link_mode(cmd, advertising, 100baseT_Full); + + adv = phy_read(phydev, MII_CTRL1000); + + if (adv & ADVERTISE_1000FULL) + ethtool_link_ksettings_add_link_mode(cmd, advertising, 1000baseT_Full); + } + + cmd->base.autoneg = 1; + + regval = phy_read(phydev, MII_SPEC_STATUS); + + link_status = field_get(FXGMAC_EPHY_LINK_STATUS, regval); + if (link_status) { + duplex = field_get(PHY_MII_SPEC_DUPLEX, regval); + cmd->base.duplex = duplex; + cmd->base.speed = pdata->mac_speed; + } else { + cmd->base.duplex = DUPLEX_UNKNOWN; + cmd->base.speed = SPEED_UNKNOWN; + } + + return 0; +} + +static const struct ethtool_ops fxgmac_ethtool_ops = { + .get_link = ethtool_op_get_link, + .get_link_ksettings = fxgmac_get_link_ksettings, +}; + +const struct ethtool_ops *fxgmac_get_ethtool_ops(void) +{ + return &fxgmac_ethtool_ops; +} #define PHY_WR_CONFIG(reg_offset) (0x8000205 + ((reg_offset) * 0x10000)) static int fxgmac_phy_write_reg(struct fxgmac_pdata *priv, u32 reg_id, u32 data) @@ -1899,6 +1988,7 @@ static int fxgmac_init(struct fxgmac_pdata *priv, bool save_private_reg) FXGMAC_JUMBO_PACKET_MTU + (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN); ndev->netdev_ops = fxgmac_get_netdev_ops();/* Set device operations */ + ndev->ethtool_ops = fxgmac_get_ethtool_ops();/* Set device operations */ /* Set device features */ if (priv->hw_feat.tso) { -- 2.34.1
2 1
0 0
[PATCH OLK-6.6 0/2] Fix TLBI broadcast optimization loss and incorrect dvmbm handling
by Tian Zheng 10 Mar '26

10 Mar '26
Fix 1 In the TLBI broadcast-optimization feature, the SYS_LSUDVM_CTRL_EL2 register controls whether the feature is enabled. When LPI low-power mode is enabled, this register is cleared as the pCPU powers down into LPI mode, causing the TLBI broadcast optimization to stop working. The fix saves and restores this control register in the callbacks for entering and exiting LPI mode. Fix 2 The current TLBI broadcast-optimization logic only supports normal VMs. If the feature is enabled globally, VMs that are not yet adapted for it, such as CCA will use an incorrect TLBI broadcast bitmap, which can cause a panic in CCA VMs. The fix moves the actual enable/disable of the dvmbm functionality into the vcpu load and vcpu put functions. Tian Zheng (2): KVM: arm64: Fix TLBI optimization broken in LPI mode KVM: arm64: Fix CCA guest panic when dvmbm is enabled arch/arm64/kvm/arm.c | 6 ++++++ arch/arm64/kvm/hisilicon/hisi_virt.c | 9 ++++----- arch/arm64/kvm/hisilicon/hisi_virt.h | 19 +++++++++++++++++++ 3 files changed, 29 insertions(+), 5 deletions(-) -- 2.33.0
1 2
0 0
[PATCH openEuler-1.0-LTS] [Huawei] PCI: Fix AB-BA deadlock between aer_isr() and device_shutdown()
by Hongtao Zhang 10 Mar '26

10 Mar '26
From: Ziming Du <duziming2(a)huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/13854 CVE: NA Reference: NA -------------------------------- During system shutdown, a deadlock may occur between AER recovery process and device shutdown as follows: The device_shutdown path holds the device_lock throughout the entire process and waits for the irq handlers to complete when release nodes: device_shutdown device_lock # A hold device_lock pci_device_shutdown pcie_port_device_remove remove_iter device_unregister device_del bus_remove_device device_release_driver devres_release_all release_nodes # B wait for irq handlers The aer_isr path will acquire device_lock in pci_bus_reset(): aer_isr # B execute irq process aer_isr_one_error aer_process_err_devices handle_error_source pcie_do_recovery aer_root_reset pci_bus_error_reset pci_bus_reset # A acquire device_lock The circular dependency causes system hang. Fix it by using pci_bus_trylock() instead of pci_bus_lock() in pci_bus_reset(). When the lock is unavailable, return -EAGAIN, as in similar cases. Fixes: c4eed62a2143 ("PCI/ERR: Use slot reset if available") Signed-off-by: Ziming Du <duziming2(a)huawei.com> Signed-off-by: Zhang Hongtao <zhanghongtao35(a)huawei.com> --- drivers/pci/pci.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index b93605616d4e4..d1a8531df0271 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -5309,15 +5309,21 @@ static int pci_bus_reset(struct pci_bus *bus, int probe) if (probe) return 0; - pci_bus_lock(bus); + /* + * Replace blocking lock with trylock to prevent deadlock during bus reset. + * Same as above except return -EAGAIN if the bus cannot be locked. + */ + if (pci_bus_trylock(bus)) { + might_sleep(); - might_sleep(); + ret = pci_bridge_secondary_bus_reset(bus->self); - ret = pci_bridge_secondary_bus_reset(bus->self); + pci_bus_unlock(bus); - pci_bus_unlock(bus); + return ret; + } - return ret; + return -EAGAIN; } /** -- 2.43.0
2 1
0 0
[PATCH openEuler-1.0-LTS] [Huawei] PCI: Fix AB-BA deadlock between aer_isr() and device_shutdown()
by Hongtao Zhang 10 Mar '26

10 Mar '26
From: Ziming Du <duziming2(a)huawei.com> Offering: HULK hulk inclusion category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/13854 -------------------------------- During system shutdown, a deadlock may occur between AER recovery process and device shutdown as follows: The device_shutdown path holds the device_lock throughout the entire process and waits for the irq handlers to complete when release nodes: device_shutdown device_lock # A hold device_lock pci_device_shutdown pcie_port_device_remove remove_iter device_unregister device_del bus_remove_device device_release_driver devres_release_all release_nodes # B wait for irq handlers The aer_isr path will acquire device_lock in pci_bus_reset(): aer_isr # B execute irq process aer_isr_one_error aer_process_err_devices handle_error_source pcie_do_recovery aer_root_reset pci_bus_error_reset pci_bus_reset # A acquire device_lock The circular dependency causes system hang. Fix it by using pci_bus_trylock() instead of pci_bus_lock() in pci_bus_reset(). When the lock is unavailable, return -EAGAIN, as in similar cases. Fixes: c4eed62a2143 ("PCI/ERR: Use slot reset if available") Signed-off-by: Ziming Du <duziming2(a)huawei.com> Signed-off-by: Zhang Hongtao <zhanghongtao35(a)huawei.com> --- drivers/pci/pci.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index b93605616d4e4..d1a8531df0271 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -5309,15 +5309,21 @@ static int pci_bus_reset(struct pci_bus *bus, int probe) if (probe) return 0; - pci_bus_lock(bus); + /* + * Replace blocking lock with trylock to prevent deadlock during bus reset. + * Same as above except return -EAGAIN if the bus cannot be locked. + */ + if (pci_bus_trylock(bus)) { + might_sleep(); - might_sleep(); + ret = pci_bridge_secondary_bus_reset(bus->self); - ret = pci_bridge_secondary_bus_reset(bus->self); + pci_bus_unlock(bus); - pci_bus_unlock(bus); + return ret; + } - return ret; + return -EAGAIN; } /** -- 2.43.0
2 1
0 0
  • ← Newer
  • 1
  • ...
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • ...
  • 2311
  • Older →

HyperKitty Powered by HyperKitty