[PATCH OLK-5.10 00/10] rcu: Add RCU stall diagnosis information

newer
[openEuler-1.0-LTS,v2] net/sched:...

older
[PATCH OLK-5.10] psi: fix compile...

Zhen Lei

28 Jul 2023 28 Jul '23

2:42 p.m.

In some extreme cases, such as the I/O pressure test, the CPU usage may be 100%, causing RCU stall. In this case, the printed information about current is not useful. Displays the number and usage of hard interrupts, soft interrupts, and context switches that are generated within half of the CPU stall timeout, can help us make a general judgment. In other cases, we can preliminarily determine whether an infinite loop occurs when local_irq, local_bh or preempt is disabled. Neeraj Upadhyay (1): rcu: Check and report missed fqs timer wakeup on RCU stall Paul E. McKenney (3): rcu: Prevent lockdep-RCU splats on lock acquisition/release rcu: For RCU grace-period kthread starvation, dump last CPU it ran on rcu: Do not NMI offline CPUs Zhen Lei (6): sched/debug: Try trigger_single_cpu_backtrace(cpu) in dump_cpu_task() sched/debug: Show the registers of 'current' in dump_cpu_task() sched: Add helper kstat_cpu_softirqs_sum() sched: Add helper nr_context_switches_cpu() rcu: Add RCU stall diagnosis information rcu: Align the output of RCU CPU stall warning messages Documentation/RCU/stallwarn.rst | 23 ++++- .../admin-guide/kernel-parameters.txt | 6 ++ include/linux/kernel_stat.h | 12 +++ kernel/rcu/Kconfig.debug | 13 +++ kernel/rcu/rcu.h | 1 + kernel/rcu/tree.c | 64 +++++++++++--- kernel/rcu/tree.h | 20 +++++ kernel/rcu/tree_stall.h | 86 +++++++++++++++++-- kernel/rcu/update.c | 2 + kernel/sched/core.c | 19 ++++ kernel/smp.c | 3 +- 11 files changed, 230 insertions(+), 19 deletions(-) -- 2.34.1

Show replies by date

Zhen Lei

28 Jul 28 Jul

2:42 p.m.

New subject: [PATCH OLK-5.10 01/10] rcu: Prevent lockdep-RCU splats on lock acquisition/release

From: "Paul E. McKenney" <paulmck@kernel.org> stable inclusion from stable-v5.10.163 commit 7c15d7ecce0064fd0724422ec84d074a4b4eecef category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6AVM6 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- commit 4d60b475f858ebdb06c1339f01a890f287b5e587 upstream. The rcu_cpu_starting() and rcu_report_dead() functions transition the current CPU between online and offline state from an RCU perspective. Unfortunately, this means that the rcu_cpu_starting() function's lock acquisition and the rcu_report_dead() function's lock releases happen while the CPU is offline from an RCU perspective, which can result in lockdep-RCU splats about using RCU from an offline CPU. And this situation can also result in too-short grace periods, especially in guest OSes that are subject to vCPU preemption. This commit therefore uses sequence-count-like synchronization to forgive use of RCU while RCU thinks a CPU is offline across the full extent of the rcu_cpu_starting() and rcu_report_dead() function's lock acquisitions and releases. One approach would have been to use the actual sequence-count primitives provided by the Linux kernel. Unfortunately, the resulting code looks completely broken and wrong, and is likely to result in patches that break RCU in an attempt to address this appearance of broken wrongness. Plus there is no net savings in lines of code, given the additional explicit memory barriers required. Therefore, this sequence count is instead implemented by a new ->ofl_seq field in the rcu_node structure. If this counter's value is an odd number, RCU forgives RCU read-side critical sections on other CPUs covered by the same rcu_node structure, even if those CPUs are offline from an RCU perspective. In addition, if a given leaf rcu_node structure's ->ofl_seq counter value is an odd number, rcu_gp_init() delays starting the grace period until that counter value changes. [ paulmck: Apply Peter Zijlstra feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> --- kernel/rcu/tree.c | 21 ++++++++++++++++++++- kernel/rcu/tree.h | 1 + 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 18b9c74474ec1fe..754c85063f0b901 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1180,7 +1180,7 @@ bool rcu_lockdep_current_cpu_online(void) preempt_disable_notrace(); rdp = this_cpu_ptr(&rcu_data); rnp = rdp->mynode; - if (rdp->grpmask & rcu_rnp_online_cpus(rnp)) + if (rdp->grpmask & rcu_rnp_online_cpus(rnp) || READ_ONCE(rnp->ofl_seq) & 0x1) ret = true; preempt_enable_notrace(); return ret; @@ -1745,6 +1745,7 @@ static void rcu_strict_gp_boundary(void *unused) */ static bool rcu_gp_init(void) { + unsigned long firstseq; unsigned long flags; unsigned long oldmask; unsigned long mask; @@ -1788,6 +1789,12 @@ static bool rcu_gp_init(void) */ rcu_state.gp_state = RCU_GP_ONOFF; rcu_for_each_leaf_node(rnp) { + smp_mb(); // Pair with barriers used when updating ->ofl_seq to odd values. + firstseq = READ_ONCE(rnp->ofl_seq); + if (firstseq & 0x1) + while (firstseq == READ_ONCE(rnp->ofl_seq)) + schedule_timeout_idle(1); // Can't wake unless RCU is watching. + smp_mb(); // Pair with barriers used when updating ->ofl_seq to even values. raw_spin_lock(&rcu_state.ofl_lock); raw_spin_lock_irq_rcu_node(rnp); if (rnp->qsmaskinit == rnp->qsmaskinitnext && @@ -4131,6 +4138,9 @@ void rcu_cpu_starting(unsigned int cpu) rnp = rdp->mynode; mask = rdp->grpmask; + WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1); + WARN_ON_ONCE(!(rnp->ofl_seq & 0x1)); + smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier(). raw_spin_lock_irqsave_rcu_node(rnp, flags); WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask); newcpu = !(rnp->expmaskinitnext & mask); @@ -4148,6 +4158,9 @@ void rcu_cpu_starting(unsigned int cpu) } else { raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } + smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier(). + WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1); + WARN_ON_ONCE(rnp->ofl_seq & 0x1); smp_mb(); /* Ensure RCU read-side usage follows above initialization. */ } @@ -4174,6 +4187,9 @@ void rcu_report_dead(unsigned int cpu) /* Remove outgoing CPU from mask in the leaf rcu_node structure. */ mask = rdp->grpmask; + WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1); + WARN_ON_ONCE(!(rnp->ofl_seq & 0x1)); + smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier(). raw_spin_lock(&rcu_state.ofl_lock); raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */ rdp->rcu_ofl_gp_seq = READ_ONCE(rcu_state.gp_seq); @@ -4186,6 +4202,9 @@ void rcu_report_dead(unsigned int cpu) WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask); raw_spin_unlock_irqrestore_rcu_node(rnp, flags); raw_spin_unlock(&rcu_state.ofl_lock); + smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier(). + WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1); + WARN_ON_ONCE(rnp->ofl_seq & 0x1); rdp->cpu_started = false; } diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 0ec2b1f66b13580..b64696ca9109845 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -56,6 +56,7 @@ struct rcu_node { /* Initialized from ->qsmaskinitnext at the */ /* beginning of each grace period. */ unsigned long qsmaskinitnext; + unsigned long ofl_seq; /* CPU-hotplug operation sequence count. */ /* Online CPUs for next grace period. */ unsigned long expmask; /* CPUs or groups that need to check in */ /* to allow the current expedited GP */ -- 2.34.1

Zhen Lei

2:42 p.m.

New subject: [PATCH OLK-5.10 02/10] rcu: For RCU grace-period kthread starvation, dump last CPU it ran on

From: "Paul E. McKenney" <paulmck@kernel.org> mainline inclusion from mainline-v5.12-rc1 commit 243027a3c80564bf96e40437ffac46efb9f5f2b5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7OIXK Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- When the RCU CPU stall-warning code detects that the RCU grace-period kthread is being starved, it dumps that kthread's stack. This can sometimes be useful, but it is also useful to know what is running on the CPU that this kthread is attempting to run on. This commit therefore adds a stack trace of this CPU in order to help track down whatever it is that might be preventing RCU's grace-period kthread from running. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> --- kernel/rcu/tree_stall.h | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 251a9af3709afd6..5df4640254ace16 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -449,20 +449,27 @@ static void print_cpu_stall_info(int cpu) /* Complain about starvation of grace-period kthread. */ static void rcu_check_gp_kthread_starvation(void) { + int cpu; struct task_struct *gpk = rcu_state.gp_kthread; unsigned long j; if (rcu_is_gp_kthread_starving(&j)) { + cpu = gpk ? task_cpu(gpk) : -1; pr_err("%s kthread starved for %ld jiffies! g%ld f%#x %s(%d) ->state=%#lx ->cpu=%d\n", rcu_state.name, j, (long)rcu_seq_current(&rcu_state.gp_seq), data_race(rcu_state.gp_flags), gp_state_getname(rcu_state.gp_state), rcu_state.gp_state, - gpk ? gpk->state : ~0, gpk ? task_cpu(gpk) : -1); + gpk ? gpk->state : ~0, cpu); if (gpk) { pr_err("\tUnless %s kthread gets sufficient CPU time, OOM is now expected behavior.\n", rcu_state.name); pr_err("RCU grace-period kthread stack dump:\n"); sched_show_task(gpk); + if (cpu >= 0) { + pr_err("Stack dump where RCU grace-period kthread last ran:\n"); + if (!trigger_single_cpu_backtrace(cpu)) + dump_cpu_task(cpu); + } wake_up_process(gpk); } } -- 2.34.1

Zhen Lei

2:42 p.m.

New subject: [PATCH OLK-5.10 03/10] rcu: Do not NMI offline CPUs

From: "Paul E. McKenney" <paulmck@kernel.org> mainline inclusion from mainline-v5.12-rc1 commit 725969ac11d7fa50aa701321daa600ce421fc21b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7OIXK Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Currently, RCU CPU stall warning messages will NMI whatever CPU looks like it is blocking either the current grace period or the grace-period kthread. This can produce confusing output if the target CPU is offline. This commit therefore checks for offline CPUs. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> --- kernel/rcu/tree_stall.h | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 5df4640254ace16..c38622ff72673ef 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -333,9 +333,12 @@ static void rcu_dump_cpu_stacks(void) rcu_for_each_leaf_node(rnp) { raw_spin_lock_irqsave_rcu_node(rnp, flags); for_each_leaf_node_possible_cpu(rnp, cpu) - if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) - if (!trigger_single_cpu_backtrace(cpu)) + if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { + if (cpu_is_offline(cpu)) + pr_err("Offline CPU %d blocking current GP.\n", cpu); + else if (!trigger_single_cpu_backtrace(cpu)) dump_cpu_task(cpu); + } raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } } @@ -466,9 +469,13 @@ static void rcu_check_gp_kthread_starvation(void) pr_err("RCU grace-period kthread stack dump:\n"); sched_show_task(gpk); if (cpu >= 0) { - pr_err("Stack dump where RCU grace-period kthread last ran:\n"); - if (!trigger_single_cpu_backtrace(cpu)) - dump_cpu_task(cpu); + if (cpu_is_offline(cpu)) { + pr_err("RCU GP kthread last ran on offline CPU %d.\n", cpu); + } else { + pr_err("Stack dump where RCU GP kthread last ran:\n"); + if (!trigger_single_cpu_backtrace(cpu)) + dump_cpu_task(cpu); + } } wake_up_process(gpk); } -- 2.34.1

Zhen Lei

2:42 p.m.

New subject: [PATCH OLK-5.10 04/10] rcu: Check and report missed fqs timer wakeup on RCU stall

From: Neeraj Upadhyay <neeraju@codeaurora.org> mainline inclusion from mainline-v5.12-rc1 commit 683954e55c981467bfd4688417e914bafc40959f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7OIXK Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- For a new grace period request, the RCU GP kthread transitions through following states: a. [RCU_GP_WAIT_GPS] -> [RCU_GP_DONE_GPS] The RCU_GP_WAIT_GPS state is where the GP kthread waits for a request for a new GP. Once it receives a request (for example, when a new RCU callback is queued), the GP kthread transitions to RCU_GP_DONE_GPS. b. [RCU_GP_DONE_GPS] -> [RCU_GP_ONOFF] Grace period initialization starts in rcu_gp_init(), which records the start of new GP in rcu_state.gp_seq and transitions to RCU_GP_ONOFF. c. [RCU_GP_ONOFF] -> [RCU_GP_INIT] The purpose of the RCU_GP_ONOFF state is to apply the online/offline information that was buffered for any CPUs that recently came online or went offline. This state is maintained in per-leaf rcu_node bitmasks, with the buffered state in ->qsmaskinitnext and the state for the upcoming GP in ->qsmaskinit. At the end of this RCU_GP_ONOFF state, each bit in ->qsmaskinit will correspond to a CPU that must pass through a quiescent state before the upcoming grace period is allowed to complete. However, a leaf rcu_node structure with an all-zeroes ->qsmaskinit cannot necessarily be ignored. In preemptible RCU, there might well be tasks still in RCU read-side critical sections that were first preempted while running on one of the CPUs managed by this structure. Such tasks will be queued on this structure's ->blkd_tasks list. Only after this list fully drains can this leaf rcu_node structure be ignored, and even then only if none of its CPUs have come back online in the meantime. Once that happens, the ->qsmaskinit masks further up the tree will be updated to exclude this leaf rcu_node structure. Once the ->qsmaskinitnext and ->qsmaskinit fields have been updated as needed, the GP kthread transitions to RCU_GP_INIT. d. [RCU_GP_INIT] -> [RCU_GP_WAIT_FQS] The purpose of the RCU_GP_INIT state is to copy each ->qsmaskinit to the ->qsmask field within each rcu_node structure. This copying is done breadth-first from the root to the leaves. Why not just copy directly from ->qsmaskinitnext to ->qsmask? Because the ->qsmaskinitnext masks can change in the meantime as additional CPUs come online or go offline. Such changes would result in inconsistencies in the ->qsmask fields up and down the tree, which could in turn result in too-short grace periods or grace-period hangs. These issues are avoided by snapshotting the leaf rcu_node structures' ->qsmaskinitnext fields into their ->qsmaskinit counterparts, generating a consistent set of ->qsmaskinit fields throughout the tree, and only then copying these consistent ->qsmaskinit fields to their ->qsmask counterparts. Once this initialization step is complete, the GP kthread transitions to RCU_GP_WAIT_FQS, where it waits to do a force-quiescent-state scan on the one hand or for the end of the grace period on the other. e. [RCU_GP_WAIT_FQS] -> [RCU_GP_DOING_FQS] The RCU_GP_WAIT_FQS state waits for one of three things: (1) An explicit request to do a force-quiescent-state scan, (2) The end of the grace period, or (3) A short interval of time, after which it will do a force-quiescent-state (FQS) scan. The explicit request can come from rcutorture or from any CPU that has too many RCU callbacks queued (see the qhimark kernel parameter and the RCU_GP_FLAG_OVLD flag). The aforementioned "short period of time" is specified by the jiffies_till_first_fqs boot parameter for a given grace period's first FQS scan and by the jiffies_till_next_fqs for later FQS scans. Either way, once the wait is over, the GP kthread transitions to RCU_GP_DOING_FQS. f. [RCU_GP_DOING_FQS] -> [RCU_GP_CLEANUP] The RCU_GP_DOING_FQS state performs an FQS scan. Each such scan carries out two functions for any CPU whose bit is still set in its leaf rcu_node structure's ->qsmask field, that is, for any CPU that has not yet reported a quiescent state for the current grace period: i. Report quiescent states on behalf of CPUs that have been observed to be idle (from an RCU perspective) since the beginning of the grace period. ii. If the current grace period is too old, take various actions to encourage holdout CPUs to pass through quiescent states, including enlisting the aid of any calls to cond_resched() and might_sleep(), and even including IPIing the holdout CPUs. These checks are skipped for any leaf rcu_node structure with a all-zero ->qsmask field, however such structures are subject to RCU priority boosting if there are tasks on a given structure blocking the current grace period. The end of the grace period is detected when the root rcu_node structure's ->qsmask is zero and when there are no longer any preempted tasks blocking the current grace period. (No, this last check is not redundant. To see this, consider an rcu_node tree having exactly one structure that serves as both root and leaf.) Once the end of the grace period is detected, the GP kthread transitions to RCU_GP_CLEANUP. g. [RCU_GP_CLEANUP] -> [RCU_GP_CLEANED] The RCU_GP_CLEANUP state marks the end of grace period by updating the rcu_state structure's ->gp_seq field and also all rcu_node structures' ->gp_seq field. As before, the rcu_node tree is traversed in breadth first order. Once this update is complete, the GP kthread transitions to the RCU_GP_CLEANED state. i. [RCU_GP_CLEANED] -> [RCU_GP_INIT] Once in the RCU_GP_CLEANED state, the GP kthread immediately transitions into the RCU_GP_INIT state. j. The role of timers. If there is at least one idle CPU, and if timers are not firing, the transition from RCU_GP_DOING_FQS to RCU_GP_CLEANUP will never happen. Timers can fail to fire for a number of reasons, including issues in timer configuration, issues in the timer framework, and failure to handle softirqs (for example, when there is a storm of interrupts). Whatever the reason, if the timers fail to fire, the GP kthread will never be awakened, resulting in RCU CPU stall warnings and eventually in OOM. However, an RCU CPU stall warning has a large number of potential causes, as documented in Documentation/RCU/stallwarn.rst. This commit therefore adds analysis to the RCU CPU stall-warning code to emit an additional message if the cause of the stall is likely to be timer failure. Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> --- Documentation/RCU/stallwarn.rst | 23 ++++++++++++++++++++++- kernel/rcu/tree.c | 25 +++++++++++++++---------- kernel/rcu/tree_stall.h | 32 ++++++++++++++++++++++++++++++++ 3 files changed, 69 insertions(+), 11 deletions(-) diff --git a/Documentation/RCU/stallwarn.rst b/Documentation/RCU/stallwarn.rst index c9ab6af4d3be9a0..d53856cc639089d 100644 --- a/Documentation/RCU/stallwarn.rst +++ b/Documentation/RCU/stallwarn.rst @@ -92,7 +92,9 @@ warnings: buggy timer hardware through bugs in the interrupt or exception path (whether hardware, firmware, or software) through bugs in Linux's timer subsystem through bugs in the scheduler, and, - yes, even including bugs in RCU itself. + yes, even including bugs in RCU itself. It can also result in + the ``rcu_.*timer wakeup didn't happen for`` console-log message, + which will include additional debugging information. - A bug in the RCU implementation. @@ -292,6 +294,25 @@ kthread is waiting for a short timeout, the "state" precedes value of the task_struct ->state field, and the "cpu" indicates that the grace-period kthread last ran on CPU 5. +If the relevant grace-period kthread does not wake from FQS wait in a +reasonable time, then the following additional line is printed:: + + kthread timer wakeup didn't happen for 23804 jiffies! g7076 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 + +The "23804" indicates that kthread's timer expired more than 23 thousand +jiffies ago. The rest of the line has meaning similar to the kthread +starvation case. + +Additionally, the following line is printed:: + + Possible timer handling issue on cpu=4 timer-softirq=11142 + +Here "cpu" indicates that the grace-period kthread last ran on CPU 4, +where it queued the fqs timer. The number following the "timer-softirq" +is the current ``TIMER_SOFTIRQ`` count on cpu 4. If this value does not +change on successive RCU CPU stall warnings, there is further reason to +suspect a timer problem. + Multiple Warnings From One Stall ================================ diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 754c85063f0b901..6a2669d1b6a2638 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1787,7 +1787,7 @@ static bool rcu_gp_init(void) * go offline later. Please also refer to "Hotplug CPU" section * of RCU's Requirements documentation. */ - rcu_state.gp_state = RCU_GP_ONOFF; + WRITE_ONCE(rcu_state.gp_state, RCU_GP_ONOFF); rcu_for_each_leaf_node(rnp) { smp_mb(); // Pair with barriers used when updating ->ofl_seq to odd values. firstseq = READ_ONCE(rnp->ofl_seq); @@ -1853,7 +1853,7 @@ static bool rcu_gp_init(void) * The grace period cannot complete until the initialization * process finishes, because this kthread handles both. */ - rcu_state.gp_state = RCU_GP_INIT; + WRITE_ONCE(rcu_state.gp_state, RCU_GP_INIT); rcu_for_each_node_breadth_first(rnp) { rcu_gp_slow(gp_init_delay); raw_spin_lock_irqsave_rcu_node(rnp, flags); @@ -1952,17 +1952,22 @@ static void rcu_gp_fqs_loop(void) ret = 0; for (;;) { if (!ret) { - rcu_state.jiffies_force_qs = jiffies + j; + WRITE_ONCE(rcu_state.jiffies_force_qs, jiffies + j); + /* + * jiffies_force_qs before RCU_GP_WAIT_FQS state + * update; required for stall checks. + */ + smp_wmb(); WRITE_ONCE(rcu_state.jiffies_kick_kthreads, jiffies + (j ? 3 * j : 2)); } trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("fqswait")); - rcu_state.gp_state = RCU_GP_WAIT_FQS; + WRITE_ONCE(rcu_state.gp_state, RCU_GP_WAIT_FQS); ret = swait_event_idle_timeout_exclusive( rcu_state.gp_wq, rcu_gp_fqs_check_wake(&gf), j); rcu_gp_torture_wait(); - rcu_state.gp_state = RCU_GP_DOING_FQS; + WRITE_ONCE(rcu_state.gp_state, RCU_GP_DOING_FQS); /* Locking provides needed memory barriers. */ /* If grace period done, leave loop. */ if (!READ_ONCE(rnp->qsmask) && @@ -2076,7 +2081,7 @@ static void rcu_gp_cleanup(void) trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("end")); rcu_seq_end(&rcu_state.gp_seq); ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq); - rcu_state.gp_state = RCU_GP_IDLE; + WRITE_ONCE(rcu_state.gp_state, RCU_GP_IDLE); /* Check for GP requests since above loop. */ rdp = this_cpu_ptr(&rcu_data); if (!needgp && ULONG_CMP_LT(rnp->gp_seq, rnp->gp_seq_needed)) { @@ -2116,12 +2121,12 @@ static int __noreturn rcu_gp_kthread(void *unused) for (;;) { trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("reqwait")); - rcu_state.gp_state = RCU_GP_WAIT_GPS; + WRITE_ONCE(rcu_state.gp_state, RCU_GP_WAIT_GPS); swait_event_idle_exclusive(rcu_state.gp_wq, READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_INIT); rcu_gp_torture_wait(); - rcu_state.gp_state = RCU_GP_DONE_GPS; + WRITE_ONCE(rcu_state.gp_state, RCU_GP_DONE_GPS); /* Locking provides needed memory barrier. */ if (rcu_gp_init()) break; @@ -2136,9 +2141,9 @@ static int __noreturn rcu_gp_kthread(void *unused) rcu_gp_fqs_loop(); /* Handle grace-period end. */ - rcu_state.gp_state = RCU_GP_CLEANUP; + WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANUP); rcu_gp_cleanup(); - rcu_state.gp_state = RCU_GP_CLEANED; + WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANED); } } diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index c38622ff72673ef..f7d031ef559395f 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -482,6 +482,36 @@ static void rcu_check_gp_kthread_starvation(void) } } +/* Complain about missing wakeups from expired fqs wait timer */ +static void rcu_check_gp_kthread_expired_fqs_timer(void) +{ + struct task_struct *gpk = rcu_state.gp_kthread; + short gp_state; + unsigned long jiffies_fqs; + int cpu; + + /* + * Order reads of .gp_state and .jiffies_force_qs. + * Matching smp_wmb() is present in rcu_gp_fqs_loop(). + */ + gp_state = smp_load_acquire(&rcu_state.gp_state); + jiffies_fqs = READ_ONCE(rcu_state.jiffies_force_qs); + + if (gp_state == RCU_GP_WAIT_FQS && + time_after(jiffies, jiffies_fqs + RCU_STALL_MIGHT_MIN) && + gpk && !READ_ONCE(gpk->on_rq)) { + cpu = task_cpu(gpk); + pr_err("%s kthread timer wakeup didn't happen for %ld jiffies! g%ld f%#x %s(%d) ->state=%#lx\n", + rcu_state.name, (jiffies - jiffies_fqs), + (long)rcu_seq_current(&rcu_state.gp_seq), + data_race(rcu_state.gp_flags), + gp_state_getname(RCU_GP_WAIT_FQS), RCU_GP_WAIT_FQS, + gpk->state); + pr_err("\tPossible timer handling issue on cpu=%d timer-softirq=%u\n", + cpu, kstat_softirqs_cpu(TIMER_SOFTIRQ, cpu)); + } +} + static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) { int cpu; @@ -546,6 +576,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) WRITE_ONCE(rcu_state.jiffies_stall, jiffies + 3 * rcu_jiffies_till_stall_check() + 3); + rcu_check_gp_kthread_expired_fqs_timer(); rcu_check_gp_kthread_starvation(); panic_on_rcu_stall(); @@ -583,6 +614,7 @@ static void print_cpu_stall(unsigned long gps) jiffies - gps, (long)rcu_seq_current(&rcu_state.gp_seq), totqlen); + rcu_check_gp_kthread_expired_fqs_timer(); rcu_check_gp_kthread_starvation(); rcu_dump_cpu_stacks(); -- 2.34.1

Zhen Lei

2:42 p.m.

New subject: [PATCH OLK-5.10 05/10] sched/debug: Try trigger_single_cpu_backtrace(cpu) in dump_cpu_task()

mainline inclusion from mainline-v6.1-rc1 commit e73dfe30930b75c98746152e7a2f6a8ab6067b51 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7OIXK Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The trigger_all_cpu_backtrace() function attempts to send an NMI to the target CPU, which usually provides much better stack traces than the dump_cpu_task() function's approach of dumping that stack from some other CPU. So much so that most calls to dump_cpu_task() only happen after a call to trigger_all_cpu_backtrace() has failed. And the exception to this rule really should attempt to use trigger_all_cpu_backtrace() first. Therefore, move the trigger_all_cpu_backtrace() invocation into dump_cpu_task(). Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Conflicts: kernel/smp.c Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> --- kernel/rcu/tree_stall.h | 5 ++--- kernel/sched/core.c | 3 +++ kernel/smp.c | 3 +-- 3 files changed, 6 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index f7d031ef559395f..338d35fc674a941 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -336,7 +336,7 @@ static void rcu_dump_cpu_stacks(void) if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { if (cpu_is_offline(cpu)) pr_err("Offline CPU %d blocking current GP.\n", cpu); - else if (!trigger_single_cpu_backtrace(cpu)) + else dump_cpu_task(cpu); } raw_spin_unlock_irqrestore_rcu_node(rnp, flags); @@ -473,8 +473,7 @@ static void rcu_check_gp_kthread_starvation(void) pr_err("RCU GP kthread last ran on offline CPU %d.\n", cpu); } else { pr_err("Stack dump where RCU GP kthread last ran:\n"); - if (!trigger_single_cpu_backtrace(cpu)) - dump_cpu_task(cpu); + dump_cpu_task(cpu); } } wake_up_process(gpk); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 71c69c8161ffddd..b1af0c97a260849 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9986,6 +9986,9 @@ struct cgroup_subsys cpu_cgrp_subsys = { void dump_cpu_task(int cpu) { + if (trigger_single_cpu_backtrace(cpu)) + return; + pr_info("Task dump for CPU %d:\n", cpu); sched_show_task(cpu_curr(cpu)); } diff --git a/kernel/smp.c b/kernel/smp.c index 114776d0d11eca6..b9e6397a1a72683 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -205,8 +205,7 @@ static __always_inline bool csd_lock_wait_toolong(struct __call_single_data *csd *bug_id, !cpu_cur_csd ? "unresponsive" : "handling this request"); } if (cpu >= 0) { - if (!trigger_single_cpu_backtrace(cpu)) - dump_cpu_task(cpu); + dump_cpu_task(cpu); if (!cpu_cur_csd) { pr_alert("csd: Re-sending CSD lock (#%d) IPI from CPU#%02d to CPU#%02d\n", *bug_id, raw_smp_processor_id(), cpu); arch_send_call_function_single_ipi(cpu); -- 2.34.1

Zhen Lei

2:42 p.m.

New subject: [PATCH OLK-5.10 06/10] sched/debug: Show the registers of 'current' in dump_cpu_task()

mainline inclusion from mainline-v6.1-rc1 commit bc1cca97e6da6c7c34db7c5b864bb354ca5305ac category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7OIXK Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The dump_cpu_task() function does not print registers on architectures that do not support NMIs. However, registers can be useful for debugging. Fortunately, in the case where dump_cpu_task() is invoked from an interrupt handler and is dumping the current CPU's stack, the get_irq_regs() function can be used to get the registers. Therefore, this commit makes dump_cpu_task() check to see if it is being asked to dump the current CPU's stack from within an interrupt handler, and, if so, it uses the get_irq_regs() function to obtain the registers. On systems that do support NMIs, this commit has the further advantage of avoiding a self-NMI in this case. This is an example of rcu self-detected stall on arm64, which does not support NMIs: [ 27.501721] rcu: INFO: rcu_preempt self-detected stall on CPU [ 27.502238] rcu: 0-....: (1250 ticks this GP) idle=4f7/1/0x4000000000000000 softirq=2594/2594 fqs=619 [ 27.502632] (t=1251 jiffies g=2989 q=29 ncpus=4) [ 27.503845] CPU: 0 PID: 306 Comm: test0 Not tainted 5.19.0-rc7-00009-g1c1a6c29ff99-dirty #46 [ 27.504732] Hardware name: linux,dummy-virt (DT) [ 27.504947] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 27.504998] pc : arch_counter_read+0x18/0x24 [ 27.505301] lr : arch_counter_read+0x18/0x24 [ 27.505328] sp : ffff80000b29bdf0 [ 27.505345] x29: ffff80000b29bdf0 x28: 0000000000000000 x27: 0000000000000000 [ 27.505475] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 [ 27.505553] x23: 0000000000001f40 x22: ffff800009849c48 x21: 000000065f871ae0 [ 27.505627] x20: 00000000000025ec x19: ffff80000a6eb300 x18: ffffffffffffffff [ 27.505654] x17: 0000000000000001 x16: 0000000000000000 x15: ffff80000a6d0296 [ 27.505681] x14: ffffffffffffffff x13: ffff80000a29bc18 x12: 0000000000000426 [ 27.505709] x11: 0000000000000162 x10: ffff80000a2f3c18 x9 : ffff80000a29bc18 [ 27.505736] x8 : 00000000ffffefff x7 : ffff80000a2f3c18 x6 : 00000000759bd013 [ 27.505761] x5 : 01ffffffffffffff x4 : 0002dc6c00000000 x3 : 0000000000000017 [ 27.505787] x2 : 00000000000025ec x1 : ffff80000b29bdf0 x0 : 0000000075a30653 [ 27.505937] Call trace: [ 27.506002] arch_counter_read+0x18/0x24 [ 27.506171] ktime_get+0x48/0xa0 [ 27.506207] test_task+0x70/0xf0 [ 27.506227] kthread+0x10c/0x110 [ 27.506243] ret_from_fork+0x10/0x20 This is a marked improvement over the old output: [ 27.944550] rcu: INFO: rcu_preempt self-detected stall on CPU [ 27.944980] rcu: 0-....: (1249 ticks this GP) idle=cbb/1/0x4000000000000000 softirq=2610/2610 fqs=614 [ 27.945407] (t=1251 jiffies g=2681 q=28 ncpus=4) [ 27.945731] Task dump for CPU 0: [ 27.945844] task:test0 state:R running task stack: 0 pid: 306 ppid: 2 flags:0x0000000a [ 27.946073] Call trace: [ 27.946151] dump_backtrace.part.0+0xc8/0xd4 [ 27.946378] show_stack+0x18/0x70 [ 27.946405] sched_show_task+0x150/0x180 [ 27.946427] dump_cpu_task+0x44/0x54 [ 27.947193] rcu_dump_cpu_stacks+0xec/0x130 [ 27.947212] rcu_sched_clock_irq+0xb18/0xef0 [ 27.947231] update_process_times+0x68/0xac [ 27.947248] tick_sched_handle+0x34/0x60 [ 27.947266] tick_sched_timer+0x4c/0xa4 [ 27.947281] __hrtimer_run_queues+0x178/0x360 [ 27.947295] hrtimer_interrupt+0xe8/0x244 [ 27.947309] arch_timer_handler_virt+0x38/0x4c [ 27.947326] handle_percpu_devid_irq+0x88/0x230 [ 27.947342] generic_handle_domain_irq+0x2c/0x44 [ 27.947357] gic_handle_irq+0x44/0xc4 [ 27.947376] call_on_irq_stack+0x2c/0x54 [ 27.947415] do_interrupt_handler+0x80/0x94 [ 27.947431] el1_interrupt+0x34/0x70 [ 27.947447] el1h_64_irq_handler+0x18/0x24 [ 27.947462] el1h_64_irq+0x64/0x68 <--- the above backtrace is worthless [ 27.947474] arch_counter_read+0x18/0x24 [ 27.947487] ktime_get+0x48/0xa0 [ 27.947501] test_task+0x70/0xf0 [ 27.947520] kthread+0x10c/0x110 [ 27.947538] ret_from_fork+0x10/0x20 Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Conflicts: kernel/sched/core.c Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> --- kernel/sched/core.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b1af0c97a260849..2e4cdc439e02e37 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -17,6 +17,7 @@ #include <linux/kcov.h> #include <linux/scs.h> +#include <asm/irq_regs.h> #include <asm/switch_to.h> #include <asm/tlb.h> @@ -9986,6 +9987,16 @@ struct cgroup_subsys cpu_cgrp_subsys = { void dump_cpu_task(int cpu) { + if (cpu == smp_processor_id() && in_irq()) { + struct pt_regs *regs; + + regs = get_irq_regs(); + if (regs) { + show_regs(regs); + return; + } + } + if (trigger_single_cpu_backtrace(cpu)) return; -- 2.34.1

Zhen Lei

2:42 p.m.

New subject: [PATCH OLK-5.10 07/10] sched: Add helper kstat_cpu_softirqs_sum()

mainline inclusion from mainline-v6.3-rc1 commit 3ca0a6ea8f6de0f21a9331ff3596a9c073fbdab0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7OIXK Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Add a kstat_cpu_softirqs_sum() function that is similar to kstat_cpu_irqs_sum(), but which counts software interrupts since boot on the specified CPU. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Cc: Josh Don <joshdon@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> --- include/linux/kernel_stat.h | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 8fff3500d50eea7..6f467f0cf0bb768 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -64,6 +64,17 @@ static inline unsigned int kstat_softirqs_cpu(unsigned int irq, int cpu) return kstat_cpu(cpu).softirqs[irq]; } +static inline unsigned int kstat_cpu_softirqs_sum(int cpu) +{ + int i; + unsigned int sum = 0; + + for (i = 0; i < NR_SOFTIRQS; i++) + sum += kstat_softirqs_cpu(i, cpu); + + return sum; +} + /* * Number of interrupts per specific IRQ source, since bootup */ -- 2.34.1

Zhen Lei

2:42 p.m.

New subject: [PATCH OLK-5.10 08/10] sched: Add helper nr_context_switches_cpu()

mainline inclusion from mainline-v6.3-rc1 commit 7c182722a0a9447e31f9645de4f311e5bc59b480 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7OIXK Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Add a function nr_context_switches_cpu() that returns number of context switches since boot on the specified CPU. This information will be used to diagnose RCU CPU stalls. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> --- include/linux/kernel_stat.h | 1 + kernel/sched/core.c | 5 +++++ 2 files changed, 6 insertions(+) diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 6f467f0cf0bb768..1df2319ee0949a9 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -49,6 +49,7 @@ DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat); #define kstat_cpu(cpu) per_cpu(kstat, cpu) #define kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu) +extern unsigned long long nr_context_switches_cpu(int cpu); extern unsigned long long nr_context_switches(void); extern unsigned int kstat_irqs_cpu(unsigned int irq, int cpu); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2e4cdc439e02e37..b9e312c6f35637e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4264,6 +4264,11 @@ bool single_task_running(void) } EXPORT_SYMBOL(single_task_running); +unsigned long long nr_context_switches_cpu(int cpu) +{ + return cpu_rq(cpu)->nr_switches; +} + unsigned long long nr_context_switches(void) { int i; -- 2.34.1

Zhen Lei

2:42 p.m.

New subject: [PATCH OLK-5.10 09/10] rcu: Add RCU stall diagnosis information

mainline inclusion from mainline-v6.3-rc1 commit be42f00b73a0f50710d16eb7cb4efda0cce062dd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7OIXK Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Because RCU CPU stall warnings are driven from the scheduling-clock interrupt handler, a workload consisting of a very large number of short-duration hardware interrupts can result in misleading stall-warning messages. On systems supporting only a single level of interrupts, that is, where interrupts handlers cannot be interrupted, this can produce misleading diagnostics. The stack traces will show the innocent-bystander interrupted task, not the interrupts that are at the very least exacerbating the stall. This situation can be improved by displaying the number of interrupts and the CPU time that they have consumed. Diagnosing other types of stalls can be eased by also providing the count of softirqs and the CPU time that they consumed as well as the number of context switches and the task-level CPU time consumed. Consider the following output given this change: rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 0-....: (1250 ticks this GP) <omitted> rcu: hardirqs softirqs csw/system rcu: number: 624 45 0 rcu: cputime: 69 1 2425 ==> 2500(ms) This output shows that the number of hard and soft interrupts is small, there are no context switches, and the system takes up a lot of time. This indicates that the current task is looping with preemption disabled. The impact on system performance is negligible because snapshot is recorded only once for all continuous RCU stalls. This added debugging information is suppressed by default and can be enabled by building the kernel with CONFIG_RCU_CPU_STALL_CPUTIME=y or by booting with rcupdate.rcu_cpu_stall_cputime=1. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Conflicts: Documentation/admin-guide/kernel-parameters.txt kernel/rcu/Kconfig.debug [Change RCU_CPU_STALL_CPUTIME to be enabled by default] kernel/rcu/rcu.h kernel/rcu/tree.h kernel/rcu/tree_stall.h kernel/rcu/update.c Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> --- .../admin-guide/kernel-parameters.txt | 6 ++++ kernel/rcu/Kconfig.debug | 13 ++++++++ kernel/rcu/rcu.h | 1 + kernel/rcu/tree.c | 18 +++++++++++ kernel/rcu/tree.h | 19 ++++++++++++ kernel/rcu/tree_stall.h | 31 +++++++++++++++++++ kernel/rcu/update.c | 2 ++ 7 files changed, 90 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 2addbf715cc08f8..a47daa3af42cd95 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4695,6 +4695,12 @@ rcupdate.rcu_cpu_stall_timeout= [KNL] Set timeout for RCU CPU stall warning messages. + rcupdate.rcu_cpu_stall_cputime= [KNL] + Provide statistics on the cputime and count of + interrupts and tasks during the sampling period. For + multiple continuous RCU stalls, all sampling periods + begin at half of the first RCU stall timeout. + rcupdate.rcu_expedited= [KNL] Use expedited grace-period primitives, for example, synchronize_rcu_expedited() instead diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug index 1942c1f1bb65d32..65103e9fa6f84d7 100644 --- a/kernel/rcu/Kconfig.debug +++ b/kernel/rcu/Kconfig.debug @@ -91,6 +91,19 @@ config RCU_CPU_STALL_TIMEOUT RCU grace period persists, additional CPU stall warnings are printed at more widely spaced intervals. +config RCU_CPU_STALL_CPUTIME + bool "Provide additional RCU stall debug information" + depends on RCU_STALL_COMMON + default n + help + Collect statistics during the sampling period, such as the number of + (hard interrupts, soft interrupts, task switches) and the cputime of + (hard interrupts, soft interrupts, kernel tasks) are added to the + RCU stall report. For multiple continuous RCU stalls, all sampling + periods begin at half of the first RCU stall timeout. + The boot option rcupdate.rcu_cpu_stall_cputime has the same function + as this one, but will override this if it exists. + config RCU_TRACE bool "Enable tracing for RCU" depends on DEBUG_KERNEL diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index fcf95d1eec69ab1..387c973e5c8134e 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -210,6 +210,7 @@ static inline bool rcu_stall_is_suppressed_at_boot(void) extern int rcu_cpu_stall_ftrace_dump; extern int rcu_cpu_stall_suppress; extern int rcu_cpu_stall_timeout; +extern int rcu_cpu_stall_cputime; int rcu_jiffies_till_stall_check(void); static inline bool rcu_stall_is_suppressed(void) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 6a2669d1b6a2638..05a6a7bca03e10b 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1346,6 +1346,24 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) rdp->rcu_iw_gp_seq = rnp->gp_seq; irq_work_queue_on(&rdp->rcu_iw, rdp->cpu); } + + if (rcu_cpu_stall_cputime && rdp->snap_record.gp_seq != rdp->gp_seq) { + int cpu = rdp->cpu; + struct rcu_snap_record *rsrp; + struct kernel_cpustat *kcsp; + + kcsp = &kcpustat_cpu(cpu); + + rsrp = &rdp->snap_record; + rsrp->cputime_irq = kcpustat_field(kcsp, CPUTIME_IRQ, cpu); + rsrp->cputime_softirq = kcpustat_field(kcsp, CPUTIME_SOFTIRQ, cpu); + rsrp->cputime_system = kcpustat_field(kcsp, CPUTIME_SYSTEM, cpu); + rsrp->nr_hardirqs = kstat_cpu_irqs_sum(rdp->cpu); + rsrp->nr_softirqs = kstat_cpu_softirqs_sum(rdp->cpu); + rsrp->nr_csw = nr_context_switches_cpu(rdp->cpu); + rsrp->jiffies = jiffies; + rsrp->gp_seq = rdp->gp_seq; + } } return 0; diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index b64696ca9109845..2a68a9e0aea4709 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -147,6 +147,23 @@ union rcu_noqs { u16 s; /* Set of bits, aggregate OR here. */ }; +/* + * Record the snapshot of the core stats at half of the first RCU stall timeout. + * The member gp_seq is used to ensure that all members are updated only once + * during the sampling period. The snapshot is taken only if this gp_seq is not + * equal to rdp->gp_seq. + */ +struct rcu_snap_record { + unsigned long gp_seq; /* Track rdp->gp_seq counter */ + u64 cputime_irq; /* Accumulated cputime of hard irqs */ + u64 cputime_softirq;/* Accumulated cputime of soft irqs */ + u64 cputime_system; /* Accumulated cputime of kernel tasks */ + unsigned long nr_hardirqs; /* Accumulated number of hard irqs */ + unsigned int nr_softirqs; /* Accumulated number of soft irqs */ + unsigned long long nr_csw; /* Accumulated number of task switches */ + unsigned long jiffies; /* Track jiffies value */ +}; + /* Per-CPU data for read-copy update. */ struct rcu_data { /* 1) quiescent-state and grace-period handling : */ @@ -251,6 +268,8 @@ struct rcu_data { unsigned long rcu_onl_gp_seq; /* ->gp_seq at last online. */ short rcu_onl_gp_flags; /* ->gp_flags at last online. */ unsigned long last_fqs_resched; /* Time of last rcu_resched(). */ + struct rcu_snap_record snap_record; /* Snapshot of core stats at half of */ + /* the first RCU stall timeout */ int cpu; }; diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 338d35fc674a941..945b7de36a76d6d 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -395,6 +395,35 @@ static bool rcu_is_gp_kthread_starving(unsigned long *jp) return j > 2 * HZ; } +static void print_cpu_stat_info(int cpu) +{ + struct rcu_snap_record rsr, *rsrp; + struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); + struct kernel_cpustat *kcsp = &kcpustat_cpu(cpu); + + if (!rcu_cpu_stall_cputime) + return; + + rsrp = &rdp->snap_record; + if (rsrp->gp_seq != rdp->gp_seq) + return; + + rsr.cputime_irq = kcpustat_field(kcsp, CPUTIME_IRQ, cpu); + rsr.cputime_softirq = kcpustat_field(kcsp, CPUTIME_SOFTIRQ, cpu); + rsr.cputime_system = kcpustat_field(kcsp, CPUTIME_SYSTEM, cpu); + + pr_err("\t hardirqs softirqs csw/system\n"); + pr_err("\t number: %8ld %10d %12lld\n", + kstat_cpu_irqs_sum(cpu) - rsrp->nr_hardirqs, + kstat_cpu_softirqs_sum(cpu) - rsrp->nr_softirqs, + nr_context_switches_cpu(cpu) - rsrp->nr_csw); + pr_err("\tcputime: %8lld %10lld %12lld ==> %d(ms)\n", + div_u64(rsr.cputime_irq - rsrp->cputime_irq, NSEC_PER_MSEC), + div_u64(rsr.cputime_softirq - rsrp->cputime_softirq, NSEC_PER_MSEC), + div_u64(rsr.cputime_system - rsrp->cputime_system, NSEC_PER_MSEC), + jiffies_to_msecs(jiffies - rsrp->jiffies)); +} + /* * Print out diagnostic information for the specified stalled CPU. * @@ -447,6 +476,8 @@ static void print_cpu_stall_info(int cpu) data_race(rcu_state.n_force_qs) - rcu_state.n_force_qs_gpstart, fast_no_hz, falsepositive ? " (false positive?)" : ""); + + print_cpu_stat_info(cpu); } /* Complain about starvation of grace-period kthread. */ diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 849f0aa99333bd8..0e3821783796522 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -497,6 +497,8 @@ EXPORT_SYMBOL_GPL(rcu_cpu_stall_suppress); module_param(rcu_cpu_stall_suppress, int, 0644); int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT; module_param(rcu_cpu_stall_timeout, int, 0644); +int rcu_cpu_stall_cputime __read_mostly = IS_ENABLED(CONFIG_RCU_CPU_STALL_CPUTIME); +module_param(rcu_cpu_stall_cputime, int, 0644); #endif /* #ifdef CONFIG_RCU_STALL_COMMON */ // Suppress boot-time RCU CPU stall warnings and rcutorture writer stall -- 2.34.1

Zhen Lei

2:42 p.m.

New subject: [PATCH OLK-5.10 10/10] rcu: Align the output of RCU CPU stall warning messages

mainline inclusion from mainline-v6.3-rc1 commit 3ab955de929a53245c89deb7ff9992b6eebbe4e8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7OIXK Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Time stamps are added to the output in kernels built with CONFIG_PRINTK_TIME=y, which causes misaligned output. Therefore, replace pr_cont() with pr_err(), which fixes alignment and gets rid of a couple of despised pr_cont() calls. Before: [ 37.567343] rcu: INFO: rcu_preempt self-detected stall on CPU [ 37.567839] rcu: 0-....: (1500 ticks this GP) idle=*** [ 37.568270] (t=1501 jiffies g=4717 q=28 ncpus=4) [ 37.568668] CPU: 0 PID: 313 Comm: test0 Not tainted 6.1.0-rc4 #8 After: [ 36.762074] rcu: INFO: rcu_preempt self-detected stall on CPU [ 36.762543] rcu: 0-....: (1499 ticks this GP) idle=*** [ 36.763003] rcu: (t=1500 jiffies g=5097 q=27 ncpus=4) [ 36.763522] CPU: 0 PID: 313 Comm: test0 Not tainted 6.1.0-rc4 #9 Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Conflicts: kernel/rcu/tree_stall.h Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> --- kernel/rcu/tree_stall.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 945b7de36a76d6d..e9d8b6dbde7c458 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -580,7 +580,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) for_each_possible_cpu(cpu) totqlen += rcu_get_n_cbs_cpu(cpu); - pr_cont("\t(detected by %d, t=%ld jiffies, g=%ld, q=%lu)\n", + pr_err("\t(detected by %d, t=%ld jiffies, g=%ld, q=%lu)\n", smp_processor_id(), (long)(jiffies - gps), (long)rcu_seq_current(&rcu_state.gp_seq), totqlen); if (ndetected) { @@ -640,7 +640,7 @@ static void print_cpu_stall(unsigned long gps) raw_spin_unlock_irqrestore_rcu_node(rdp->mynode, flags); for_each_possible_cpu(cpu) totqlen += rcu_get_n_cbs_cpu(cpu); - pr_cont("\t(t=%lu jiffies g=%ld q=%lu)\n", + pr_err("\t(t=%lu jiffies g=%ld q=%lu)\n", jiffies - gps, (long)rcu_seq_current(&rcu_state.gp_seq), totqlen); -- 2.34.1

patchwork bot

2:58 p.m.

反馈：您发送到kernel@openeuler.org的补丁/补丁集，已成功转换为PR！ PR链接地址： https://gitee.com/openeuler/kernel/pulls/1567 邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/S... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/1567 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/S...

946

Age (days ago)

946

Last active (days ago)

List overview

11 comments

2 participants

participants (2)

patchwork bot
Zhen Lei

[PATCH OLK-5.10 00/10] rcu: Add RCU stall diagnosis information

tags

participants (2)