mailweb.openeuler.org
Manage this list

Keyboard Shortcuts

Thread View

  • j: Next unread message
  • k: Previous unread message
  • j a: Jump to all threads
  • j l: Jump to MailingList overview

Kernel

Threads by month
  • ----- 2025 -----
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2024 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2023 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2022 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2021 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2020 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2019 -----
  • December
kernel@openeuler.org

  • 62 participants
  • 18848 discussions
[PATCH openEuler-1.0-LTS] net/sched: act_mirred: use the backlog for mirred ingress
by Zhengchao Shao 11 Apr '24

11 Apr '24
From: Jakub Kicinski <kuba(a)kernel.org> mainline inclusion from mainline-v6.8-rc6 commit 52f671db18823089a02f07efc04efdb2272ddc17 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9E2LT CVE: CVE-2024-26740 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- The test Davide added in commit ca22da2fbd69 ("act_mirred: use the backlog for nested calls to mirred ingress") hangs our testing VMs every 10 or so runs, with the familiar tcp_v4_rcv -> tcp_v4_rcv deadlock reported by lockdep. The problem as previously described by Davide (see Link) is that if we reverse flow of traffic with the redirect (egress -> ingress) we may reach the same socket which generated the packet. And we may still be holding its socket lock. The common solution to such deadlocks is to put the packet in the Rx backlog, rather than run the Rx path inline. Do that for all egress -> ingress reversals, not just once we started to nest mirred calls. In the past there was a concern that the backlog indirection will lead to loss of error reporting / less accurate stats. But the current workaround does not seem to address the issue. Fixes: 53592b364001 ("net/sched: act_mirred: Implement ingress actions") Cc: Marcelo Ricardo Leitner <marcelo.leitner(a)gmail.com> Suggested-by: Davide Caratti <dcaratti(a)redhat.com> Link: https://lore.kernel.org/netdev/33dc43f587ec1388ba456b4915c75f02a8aae226.166… Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> Acked-by: Jamal Hadi Salim <jhs(a)mojatatu.com> Signed-off-by: David S. Miller <davem(a)davemloft.net> Conflicts: net/sched/act_mirred.c Signed-off-by: Zhengchao Shao <shaozhengchao(a)huawei.com> --- net/sched/act_mirred.c | 15 ++++++--------- .../selftests/net/forwarding/tc_actions.sh | 3 --- 2 files changed, 6 insertions(+), 12 deletions(-) diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c index febf06b8bbdf..336db2c938b5 100644 --- a/net/sched/act_mirred.c +++ b/net/sched/act_mirred.c @@ -197,18 +197,14 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, return ret; } -static bool is_mirred_nested(void) -{ - return unlikely(__this_cpu_read(mirred_rec_level) > 1); -} - -static int tcf_mirred_forward(bool want_ingress, struct sk_buff *skb) +static int +tcf_mirred_forward(bool at_ingress, bool want_ingress, struct sk_buff *skb) { int err; if (!want_ingress) err = dev_queue_xmit(skb); - else if (is_mirred_nested()) + else if (!at_ingress) err = netif_rx(skb); else err = netif_receive_skb(skb); @@ -300,14 +296,15 @@ static int tcf_mirred_act(struct sk_buff *skb, const struct tc_action *a, if (use_reinsert) { res->ingress = want_ingress; res->qstats = this_cpu_ptr(m->common.cpu_qstats); - if (tcf_mirred_forward(want_ingress, skb) && res->qstats) + if (tcf_mirred_forward(skb_at_tc_ingress(skb), want_ingress, skb) + && res->qstats) qstats_overlimit_inc(res->qstats); __this_cpu_dec(mirred_rec_level); return TC_ACT_CONSUMED; } } - err = tcf_mirred_forward(want_ingress, skb2); + err = tcf_mirred_forward(skb_at_tc_ingress(skb), want_ingress, skb2); if (err) { out: qstats_overlimit_inc(this_cpu_ptr(m->common.cpu_qstats)); diff --git a/tools/testing/selftests/net/forwarding/tc_actions.sh b/tools/testing/selftests/net/forwarding/tc_actions.sh index aaa1ea10ac83..221a023ee5d6 100755 --- a/tools/testing/selftests/net/forwarding/tc_actions.sh +++ b/tools/testing/selftests/net/forwarding/tc_actions.sh @@ -183,9 +183,6 @@ mirred_egress_to_ingress_tcp_test() check_err $? "didn't mirred redirect ICMP" tc_check_packets "dev $h1 ingress" 102 10 check_err $? "didn't drop mirred ICMP" - local overlimits=$(tc_rule_stats_get ${h1} 101 egress .overlimits) - test ${overlimits} = 10 - check_err $? "wrong overlimits, expected 10 got ${overlimits}" tc filter del dev $h1 egress protocol ip pref 100 handle 100 flower tc filter del dev $h1 egress protocol ip pref 101 handle 101 flower -- 2.34.1
2 1
0 0
[PATCH OLK-5.10 v2] IB/hfi1: Fix sdma.h tx->num_descs off-by-one error
by Liu Jian 11 Apr '24

11 Apr '24
From: Daniel Vacek <neelx(a)redhat.com> stable inclusion from stable-v5.10.211 commit 3f38d22e645e2e994979426ea5a35186102ff3c2 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9E2Y3 CVE: CVE-2024-26766 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… --------------------------- commit e6f57c6881916df39db7d95981a8ad2b9c3458d6 upstream. Unfortunately the commit `fd8958efe877` introduced another error causing the `descs` array to overflow. This reults in further crashes easily reproducible by `sendmsg` system call. [ 1080.836473] general protection fault, probably for non-canonical address 0x400300015528b00a: 0000 [#1] PREEMPT SMP PTI [ 1080.869326] RIP: 0010:hfi1_ipoib_build_ib_tx_headers.constprop.0+0xe1/0x2b0 [hfi1] ... [ 1080.974535] Call Trace: [ 1080.976990] <TASK> [ 1081.021929] hfi1_ipoib_send_dma_common+0x7a/0x2e0 [hfi1] [ 1081.027364] hfi1_ipoib_send_dma_list+0x62/0x270 [hfi1] [ 1081.032633] hfi1_ipoib_send+0x112/0x300 [hfi1] [ 1081.042001] ipoib_start_xmit+0x2a9/0x2d0 [ib_ipoib] [ 1081.046978] dev_hard_start_xmit+0xc4/0x210 ... [ 1081.148347] __sys_sendmsg+0x59/0xa0 crash> ipoib_txreq 0xffff9cfeba229f00 struct ipoib_txreq { txreq = { list = { next = 0xffff9cfeba229f00, prev = 0xffff9cfeba229f00 }, descp = 0xffff9cfeba229f40, coalesce_buf = 0x0, wait = 0xffff9cfea4e69a48, complete = 0xffffffffc0fe0760 <hfi1_ipoib_sdma_complete>, packet_len = 0x46d, tlen = 0x0, num_desc = 0x0, desc_limit = 0x6, next_descq_idx = 0x45c, coalesce_idx = 0x0, flags = 0x0, descs = {{ qw = {0x8024000120dffb00, 0x4} # SDMA_DESC0_FIRST_DESC_FLAG (bit 63) }, { qw = { 0x3800014231b108, 0x4} }, { qw = { 0x310000e4ee0fcf0, 0x8} }, { qw = { 0x3000012e9f8000, 0x8} }, { qw = { 0x59000dfb9d0000, 0x8} }, { qw = { 0x78000e02e40000, 0x8} }} }, sdma_hdr = 0x400300015528b000, <<< invalid pointer in the tx request structure sdma_status = 0x0, SDMA_DESC0_LAST_DESC_FLAG (bit 62) complete = 0x0, priv = 0x0, txq = 0xffff9cfea4e69880, skb = 0xffff9d099809f400 } If an SDMA send consists of exactly 6 descriptors and requires dword padding (in the 7th descriptor), the sdma_txreq descriptor array is not properly expanded and the packet will overflow into the container structure. This results in a panic when the send completion runs. The exact panic varies depending on what elements of the container structure get corrupted. The fix is to use the correct expression in _pad_sdma_tx_descs() to test the need to expand the descriptor array. With this patch the crashes are no longer reproducible and the machine is stable. Fixes: fd8958efe877 ("IB/hfi1: Fix sdma.h tx->num_descs off-by-one errors") Cc: stable(a)vger.kernel.org Reported-by: Mats Kronberg <kronberg(a)nsc.liu.se> Tested-by: Mats Kronberg <kronberg(a)nsc.liu.se> Signed-off-by: Daniel Vacek <neelx(a)redhat.com> Link: https://lore.kernel.org/r/20240201081009.1109442-1-neelx@redhat.com Signed-off-by: Leon Romanovsky <leon(a)kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> [Change commit log: "--" to "...". Otherwise openuler mail2pr ci-robot won't work.] Signed-off-by: Liu Jian <liujian56(a)huawei.com> --- v1->v2: change commit log. drivers/infiniband/hw/hfi1/sdma.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/hfi1/sdma.c b/drivers/infiniband/hw/hfi1/sdma.c index 2dc97de434a5..68a8557e9a7c 100644 --- a/drivers/infiniband/hw/hfi1/sdma.c +++ b/drivers/infiniband/hw/hfi1/sdma.c @@ -3200,7 +3200,7 @@ int _pad_sdma_tx_descs(struct hfi1_devdata *dd, struct sdma_txreq *tx) { int rval = 0; - if ((unlikely(tx->num_desc + 1 == tx->desc_limit))) { + if ((unlikely(tx->num_desc == tx->desc_limit))) { rval = _extend_sdma_tx_descs(dd, tx); if (rval) { __sdma_txclean(dd, tx); -- 2.34.1
2 1
0 0
[PATCH OLK-5.10] [Backport] drm/amdgpu: fix use-after-free bug
by Zhenzeng Su 11 Apr '24

11 Apr '24
From: Vitaly Prosyak <vitaly.prosyak(a)amd.com> mainline inclusion from mainline-v6.9-rc1 commit 22207fd5c80177b860279653d017474b2812af5e category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9DO1Z CVE: CVE-2024-26656 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- The bug can be triggered by sending a single amdgpu_gem_userptr_ioctl to the AMDGPU DRM driver on any ASICs with an invalid address and size. The bug was reported by Joonkyo Jung <joonkyoj(a)yonsei.ac.kr>. For example the following code: static void Syzkaller1(int fd) { struct drm_amdgpu_gem_userptr arg; int ret; arg.addr = 0xffffffffffff0000; arg.size = 0x80000000; /*2 Gb*/ arg.flags = 0x7; ret = drmIoctl(fd, 0xc1186451/*amdgpu_gem_userptr_ioctl*/, &arg); } Due to the address and size are not valid there is a failure in amdgpu_mn_register->mmu_interval_notifier_insert->__mmu_interval_notifier_insert-> check_shl_overflow, but we even the amdgpu_mn_register failure we still call amdgpu_mn_unregister into amdgpu_gem_object_free which causes access to a bad address. The following stack is below when the issue is reproduced when Kazan is enabled: [ +0.000014] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020 [ +0.000009] RIP: 0010:mmu_interval_notifier_remove+0x327/0x340 [ +0.000017] Code: ff ff 49 89 44 24 08 48 b8 00 01 00 00 00 00 ad de 4c 89 f7 49 89 47 40 48 83 c0 22 49 89 47 48 e8 ce d1 2d 01 e9 32 ff ff ff <0f> 0b e9 16 ff ff ff 4c 89 ef e8 fa 14 b3 ff e9 36 ff ff ff e8 80 [ +0.000014] RSP: 0018:ffffc90002657988 EFLAGS: 00010246 [ +0.000013] RAX: 0000000000000000 RBX: 1ffff920004caf35 RCX: ffffffff8160565b [ +0.000011] RDX: dffffc0000000000 RSI: 0000000000000004 RDI: ffff8881a9f78260 [ +0.000010] RBP: ffffc90002657a70 R08: 0000000000000001 R09: fffff520004caf25 [ +0.000010] R10: 0000000000000003 R11: ffffffff8161d1d6 R12: ffff88810e988c00 [ +0.000010] R13: ffff888126fb5a00 R14: ffff88810e988c0c R15: ffff8881a9f78260 [ +0.000011] FS: 00007ff9ec848540(0000) GS:ffff8883cc880000(0000) knlGS:0000000000000000 [ +0.000012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000010] CR2: 000055b3f7e14328 CR3: 00000001b5770000 CR4: 0000000000350ef0 [ +0.000010] Call Trace: [ +0.000006] <TASK> [ +0.000007] ? show_regs+0x6a/0x80 [ +0.000018] ? __warn+0xa5/0x1b0 [ +0.000019] ? mmu_interval_notifier_remove+0x327/0x340 [ +0.000018] ? report_bug+0x24a/0x290 [ +0.000022] ? handle_bug+0x46/0x90 [ +0.000015] ? exc_invalid_op+0x19/0x50 [ +0.000016] ? asm_exc_invalid_op+0x1b/0x20 [ +0.000017] ? kasan_save_stack+0x26/0x50 [ +0.000017] ? mmu_interval_notifier_remove+0x23b/0x340 [ +0.000019] ? mmu_interval_notifier_remove+0x327/0x340 [ +0.000019] ? mmu_interval_notifier_remove+0x23b/0x340 [ +0.000020] ? __pfx_mmu_interval_notifier_remove+0x10/0x10 [ +0.000017] ? kasan_save_alloc_info+0x1e/0x30 [ +0.000018] ? srso_return_thunk+0x5/0x5f [ +0.000014] ? __kasan_kmalloc+0xb1/0xc0 [ +0.000018] ? srso_return_thunk+0x5/0x5f [ +0.000013] ? __kasan_check_read+0x11/0x20 [ +0.000020] amdgpu_mn_unregister+0x34/0x50 [amdgpu] [ +0.004695] amdgpu_gem_object_free+0x66/0xa0 [amdgpu] [ +0.004534] ? __pfx_amdgpu_gem_object_free+0x10/0x10 [amdgpu] [ +0.004291] ? do_syscall_64+0x5f/0xe0 [ +0.000023] ? srso_return_thunk+0x5/0x5f [ +0.000017] drm_gem_object_free+0x3b/0x50 [drm] [ +0.000489] amdgpu_gem_userptr_ioctl+0x306/0x500 [amdgpu] [ +0.004295] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004270] ? srso_return_thunk+0x5/0x5f [ +0.000014] ? __this_cpu_preempt_check+0x13/0x20 [ +0.000015] ? srso_return_thunk+0x5/0x5f [ +0.000013] ? sysvec_apic_timer_interrupt+0x57/0xc0 [ +0.000020] ? srso_return_thunk+0x5/0x5f [ +0.000014] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [ +0.000022] ? drm_ioctl_kernel+0x17b/0x1f0 [drm] [ +0.000496] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004272] ? drm_ioctl_kernel+0x190/0x1f0 [drm] [ +0.000492] drm_ioctl_kernel+0x140/0x1f0 [drm] [ +0.000497] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004297] ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm] [ +0.000489] ? srso_return_thunk+0x5/0x5f [ +0.000011] ? __kasan_check_write+0x14/0x20 [ +0.000016] drm_ioctl+0x3da/0x730 [drm] [ +0.000475] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004293] ? __pfx_drm_ioctl+0x10/0x10 [drm] [ +0.000506] ? __pfx_rpm_resume+0x10/0x10 [ +0.000016] ? srso_return_thunk+0x5/0x5f [ +0.000011] ? __kasan_check_write+0x14/0x20 [ +0.000010] ? srso_return_thunk+0x5/0x5f [ +0.000011] ? _raw_spin_lock_irqsave+0x99/0x100 [ +0.000015] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ +0.000014] ? srso_return_thunk+0x5/0x5f [ +0.000013] ? srso_return_thunk+0x5/0x5f [ +0.000011] ? srso_return_thunk+0x5/0x5f [ +0.000011] ? preempt_count_sub+0x18/0xc0 [ +0.000013] ? srso_return_thunk+0x5/0x5f [ +0.000010] ? _raw_spin_unlock_irqrestore+0x27/0x50 [ +0.000019] amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu] [ +0.004272] __x64_sys_ioctl+0xcd/0x110 [ +0.000020] do_syscall_64+0x5f/0xe0 [ +0.000021] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ +0.000015] RIP: 0033:0x7ff9ed31a94f [ +0.000012] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00 [ +0.000013] RSP: 002b:00007fff25f66790 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.000016] RAX: ffffffffffffffda RBX: 000055b3f7e133e0 RCX: 00007ff9ed31a94f [ +0.000012] RDX: 000055b3f7e133e0 RSI: 00000000c1186451 RDI: 0000000000000003 [ +0.000010] RBP: 00000000c1186451 R08: 0000000000000000 R09: 0000000000000000 [ +0.000009] R10: 0000000000000008 R11: 0000000000000246 R12: 00007fff25f66ca8 [ +0.000009] R13: 0000000000000003 R14: 000055b3f7021ba8 R15: 00007ff9ed7af040 [ +0.000024] </TASK> [ +0.000007] ---[ end trace 0000000000000000 ]--- v2: Consolidate any error handling into amdgpu_mn_register which applied to kfd_bo also. (Christian) v3: Improve syntax and comment (Christian) Cc: Christian Koenig <christian.koenig(a)amd.com> Cc: Alex Deucher <alexander.deucher(a)amd.com> Cc: Felix Kuehling <felix.kuehling(a)amd.com> Cc: Joonkyo Jung <joonkyoj(a)yonsei.ac.kr> Cc: Dokyung Song <dokyungs(a)yonsei.ac.kr> Cc: <jisoo.jang(a)yonsei.ac.kr> Cc: <yw9865(a)yonsei.ac.kr> Signed-off-by: Vitaly Prosyak <vitaly.prosyak(a)amd.com> Reviewed-by: Christian König <christian.koenig(a)amd.com> Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index 828b5167ff12..57ee0b7af9d2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -132,13 +132,25 @@ static const struct mmu_interval_notifier_ops amdgpu_mn_hsa_ops = { */ int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr) { + int r; + if (bo->kfd_bo) - return mmu_interval_notifier_insert(&bo->notifier, current->mm, + r = mmu_interval_notifier_insert(&bo->notifier, current->mm, addr, amdgpu_bo_size(bo), &amdgpu_mn_hsa_ops); - return mmu_interval_notifier_insert(&bo->notifier, current->mm, addr, - amdgpu_bo_size(bo), - &amdgpu_mn_gfx_ops); + else + r = mmu_interval_notifier_insert(&bo->notifier, current->mm, addr, + amdgpu_bo_size(bo), + &amdgpu_mn_gfx_ops); + if (r) + /* + * Make sure amdgpu_mn_unregister() doesn't call + * mmu_interval_notifier_remove() when the notifier isn't properly + * initialized. + */ + bo->notifier.mm = NULL; + + return r; } /** -- 2.25.1
2 1
0 0
[PATCH OLK-5.10] btrfs: don't drop extent_map for free space inode on write error
by Zizhi Wo 11 Apr '24

11 Apr '24
From: Josef Bacik <josef(a)toxicpanda.com> stable inclusion from stable-v6.1.79 commit 02f2b95b00bf57d20320ee168b30fb7f3db8e555 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9E2F7 CVE: CVE-2024-26726 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- commit 5571e41ec6e56e35f34ae9f5b3a335ef510e0ade upstream. While running the CI for an unrelated change I hit the following panic with generic/648 on btrfs_holes_spacecache. assertion failed: block_start != EXTENT_MAP_HOLE, in fs/btrfs/extent_io.c:1385 ------------[ cut here ]------------ kernel BUG at fs/btrfs/extent_io.c:1385! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 2695096 Comm: fsstress Kdump: loaded Tainted: G W 6.8.0-rc2+ #1 RIP: 0010:__extent_writepage_io.constprop.0+0x4c1/0x5c0 Call Trace: <TASK> extent_write_cache_pages+0x2ac/0x8f0 extent_writepages+0x87/0x110 do_writepages+0xd5/0x1f0 filemap_fdatawrite_wbc+0x63/0x90 __filemap_fdatawrite_range+0x5c/0x80 btrfs_fdatawrite_range+0x1f/0x50 btrfs_write_out_cache+0x507/0x560 btrfs_write_dirty_block_groups+0x32a/0x420 commit_cowonly_roots+0x21b/0x290 btrfs_commit_transaction+0x813/0x1360 btrfs_sync_file+0x51a/0x640 __x64_sys_fdatasync+0x52/0x90 do_syscall_64+0x9c/0x190 entry_SYSCALL_64_after_hwframe+0x6e/0x76 This happens because we fail to write out the free space cache in one instance, come back around and attempt to write it again. However on the second pass through we go to call btrfs_get_extent() on the inode to get the extent mapping. Because this is a new block group, and with the free space inode we always search the commit root to avoid deadlocking with the tree, we find nothing and return a EXTENT_MAP_HOLE for the requested range. This happens because the first time we try to write the space cache out we hit an error, and on an error we drop the extent mapping. This is normal for normal files, but the free space cache inode is special. We always expect the extent map to be correct. Thus the second time through we end up with a bogus extent map. Since we're deprecating this feature, the most straightforward way to fix this is to simply skip dropping the extent map range for this failed range. I shortened the test by using error injection to stress the area to make it easier to reproduce. With this patch in place we no longer panic with my error injection test. CC: stable(a)vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana(a)suse.com> Signed-off-by: Josef Bacik <josef(a)toxicpanda.com> Signed-off-by: David Sterba <dsterba(a)suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Conflicts: fs/btrfs/inode.c Signed-off-by: Zizhi Wo <wozizhi(a)huawei.com> --- fs/btrfs/inode.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b12fc82e34ba..03670d4cd6ed 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2775,8 +2775,22 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) unwritten_start += logical_len; clear_extent_uptodate(io_tree, unwritten_start, end, NULL); - /* Drop the cache for the part of the extent we didn't write. */ - btrfs_drop_extent_cache(BTRFS_I(inode), unwritten_start, end, 0); + /* + * Drop extent maps for the part of the extent we didn't write. + * + * We have an exception here for the free_space_inode, this is + * because when we do btrfs_get_extent() on the free space inode + * we will search the commit root. If this is a new block group + * we won't find anything, and we will trip over the assert in + * writepage where we do ASSERT(em->block_start != + * EXTENT_MAP_HOLE). + * + * Theoretically we could also skip this for any NOCOW extent as + * we don't mess with the extent map tree in the NOCOW case, but + * for now simply skip this if we are the free space inode. + */ + if (!btrfs_is_free_space_inode(BTRFS_I(inode))) + btrfs_drop_extent_cache(BTRFS_I(inode), unwritten_start, end, 0); /* * If the ordered extent had an IOERR or something else went -- 2.39.2
2 1
0 0
[PATCH openEuler-1.0-LTS] btrfs: don't drop extent_map for free space inode on write error
by Zizhi Wo 11 Apr '24

11 Apr '24
From: Josef Bacik <josef(a)toxicpanda.com> stable inclusion from stable-v6.1.79 commit 02f2b95b00bf57d20320ee168b30fb7f3db8e555 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9E2F7 CVE: CVE-2024-26726 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- commit 5571e41ec6e56e35f34ae9f5b3a335ef510e0ade upstream. While running the CI for an unrelated change I hit the following panic with generic/648 on btrfs_holes_spacecache. assertion failed: block_start != EXTENT_MAP_HOLE, in fs/btrfs/extent_io.c:1385 ------------[ cut here ]------------ kernel BUG at fs/btrfs/extent_io.c:1385! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 2695096 Comm: fsstress Kdump: loaded Tainted: G W 6.8.0-rc2+ #1 RIP: 0010:__extent_writepage_io.constprop.0+0x4c1/0x5c0 Call Trace: <TASK> extent_write_cache_pages+0x2ac/0x8f0 extent_writepages+0x87/0x110 do_writepages+0xd5/0x1f0 filemap_fdatawrite_wbc+0x63/0x90 __filemap_fdatawrite_range+0x5c/0x80 btrfs_fdatawrite_range+0x1f/0x50 btrfs_write_out_cache+0x507/0x560 btrfs_write_dirty_block_groups+0x32a/0x420 commit_cowonly_roots+0x21b/0x290 btrfs_commit_transaction+0x813/0x1360 btrfs_sync_file+0x51a/0x640 __x64_sys_fdatasync+0x52/0x90 do_syscall_64+0x9c/0x190 entry_SYSCALL_64_after_hwframe+0x6e/0x76 This happens because we fail to write out the free space cache in one instance, come back around and attempt to write it again. However on the second pass through we go to call btrfs_get_extent() on the inode to get the extent mapping. Because this is a new block group, and with the free space inode we always search the commit root to avoid deadlocking with the tree, we find nothing and return a EXTENT_MAP_HOLE for the requested range. This happens because the first time we try to write the space cache out we hit an error, and on an error we drop the extent mapping. This is normal for normal files, but the free space cache inode is special. We always expect the extent map to be correct. Thus the second time through we end up with a bogus extent map. Since we're deprecating this feature, the most straightforward way to fix this is to simply skip dropping the extent map range for this failed range. I shortened the test by using error injection to stress the area to make it easier to reproduce. With this patch in place we no longer panic with my error injection test. CC: stable(a)vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana(a)suse.com> Signed-off-by: Josef Bacik <josef(a)toxicpanda.com> Signed-off-by: David Sterba <dsterba(a)suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Conflicts: fs/btrfs/inode.c Signed-off-by: Zizhi Wo <wozizhi(a)huawei.com> --- fs/btrfs/inode.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 51a119ac91cd..676cce61cad9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3145,8 +3145,22 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) end = ordered_extent->file_offset + ordered_extent->len - 1; clear_extent_uptodate(io_tree, start, end, NULL); - /* Drop the cache for the part of the extent we didn't write. */ - btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0); + /* + * Drop extent maps for the part of the extent we didn't write. + * + * We have an exception here for the free_space_inode, this is + * because when we do btrfs_get_extent() on the free space inode + * we will search the commit root. If this is a new block group + * we won't find anything, and we will trip over the assert in + * writepage where we do ASSERT(em->block_start != + * EXTENT_MAP_HOLE). + * + * Theoretically we could also skip this for any NOCOW extent as + * we don't mess with the extent map tree in the NOCOW case, but + * for now simply skip this if we are the free space inode. + */ + if (!btrfs_is_free_space_inode(BTRFS_I(inode))) + btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0); /* * If the ordered extent had an IOERR or something else went -- 2.39.2
2 1
0 0
[PATCH openEuler-1.0-LTS] PM / devfreq: Synchronize devfreq_monitor_[start/stop]
by Yi Yang 11 Apr '24

11 Apr '24
From: Mukesh Ojha <quic_mojha(a)quicinc.com> mainline inclusion from mainline-v6.8-rc1 commit aed5ed595960c6d301dcd4ed31aeaa7a8054c0c6 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9DNGK CVE: CVE-2023-52635 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- There is a chance if a frequent switch of the governor done in a loop result in timer list corruption where timer cancel being done from two place one from cancel_delayed_work_sync() and followed by expire_timers() can be seen from the traces[1]. while true do echo "simple_ondemand" > /sys/class/devfreq/1d84000.ufshc/governor echo "performance" > /sys/class/devfreq/1d84000.ufshc/governor done It looks to be issue with devfreq driver where device_monitor_[start/stop] need to synchronized so that delayed work should get corrupted while it is either being queued or running or being cancelled. Let's use polling flag and devfreq lock to synchronize the queueing the timer instance twice and work data being corrupted. [1] ... .. <idle>-0 [003] 9436.209662: timer_cancel timer=0xffffff80444f0428 <idle>-0 [003] 9436.209664: timer_expire_entry timer=0xffffff80444f0428 now=0x10022da1c function=__typeid__ZTSFvP10timer_listE_global_addr baseclk=0x10022da1c <idle>-0 [003] 9436.209718: timer_expire_exit timer=0xffffff80444f0428 kworker/u16:6-14217 [003] 9436.209863: timer_start timer=0xffffff80444f0428 function=__typeid__ZTSFvP10timer_listE_global_addr expires=0x10022da2b now=0x10022da1c flags=182452227 vendor.xxxyyy.ha-1593 [004] 9436.209888: timer_cancel timer=0xffffff80444f0428 vendor.xxxyyy.ha-1593 [004] 9436.216390: timer_init timer=0xffffff80444f0428 vendor.xxxyyy.ha-1593 [004] 9436.216392: timer_start timer=0xffffff80444f0428 function=__typeid__ZTSFvP10timer_listE_global_addr expires=0x10022da2c now=0x10022da1d flags=186646532 vendor.xxxyyy.ha-1593 [005] 9436.220992: timer_cancel timer=0xffffff80444f0428 xxxyyyTraceManag-7795 [004] 9436.261641: timer_cancel timer=0xffffff80444f0428 [2] 9436.261653][ C4] Unable to handle kernel paging request at virtual address dead00000000012a [ 9436.261664][ C4] Mem abort info: [ 9436.261666][ C4] ESR = 0x96000044 [ 9436.261669][ C4] EC = 0x25: DABT (current EL), IL = 32 bits [ 9436.261671][ C4] SET = 0, FnV = 0 [ 9436.261673][ C4] EA = 0, S1PTW = 0 [ 9436.261675][ C4] Data abort info: [ 9436.261677][ C4] ISV = 0, ISS = 0x00000044 [ 9436.261680][ C4] CM = 0, WnR = 1 [ 9436.261682][ C4] [dead00000000012a] address between user and kernel address ranges [ 9436.261685][ C4] Internal error: Oops: 96000044 [#1] PREEMPT SMP [ 9436.261701][ C4] Skip md ftrace buffer dump for: 0x3a982d0 ... [ 9436.262138][ C4] CPU: 4 PID: 7795 Comm: TraceManag Tainted: G S W O 5.10.149-android12-9-o-g17f915d29d0c #1 [ 9436.262141][ C4] Hardware name: Qualcomm Technologies, Inc. (DT) [ 9436.262144][ C4] pstate: 22400085 (nzCv daIf +PAN -UAO +TCO BTYPE=--) [ 9436.262161][ C4] pc : expire_timers+0x9c/0x438 [ 9436.262164][ C4] lr : expire_timers+0x2a4/0x438 [ 9436.262168][ C4] sp : ffffffc010023dd0 [ 9436.262171][ C4] x29: ffffffc010023df0 x28: ffffffd0636fdc18 [ 9436.262178][ C4] x27: ffffffd063569dd0 x26: ffffffd063536008 [ 9436.262182][ C4] x25: 0000000000000001 x24: ffffff88f7c69280 [ 9436.262185][ C4] x23: 00000000000000e0 x22: dead000000000122 [ 9436.262188][ C4] x21: 000000010022da29 x20: ffffff8af72b4e80 [ 9436.262191][ C4] x19: ffffffc010023e50 x18: ffffffc010025038 [ 9436.262195][ C4] x17: 0000000000000240 x16: 0000000000000201 [ 9436.262199][ C4] x15: ffffffffffffffff x14: ffffff889f3c3100 [ 9436.262203][ C4] x13: ffffff889f3c3100 x12: 00000000049f56b8 [ 9436.262207][ C4] x11: 00000000049f56b8 x10: 00000000ffffffff [ 9436.262212][ C4] x9 : ffffffc010023e50 x8 : dead000000000122 [ 9436.262216][ C4] x7 : ffffffffffffffff x6 : ffffffc0100239d8 [ 9436.262220][ C4] x5 : 0000000000000000 x4 : 0000000000000101 [ 9436.262223][ C4] x3 : 0000000000000080 x2 : ffffff889edc155c [ 9436.262227][ C4] x1 : ffffff8001005200 x0 : ffffff80444f0428 [ 9436.262232][ C4] Call trace: [ 9436.262236][ C4] expire_timers+0x9c/0x438 [ 9436.262240][ C4] __run_timers+0x1f0/0x330 [ 9436.262245][ C4] run_timer_softirq+0x28/0x58 [ 9436.262255][ C4] efi_header_end+0x168/0x5ec [ 9436.262265][ C4] __irq_exit_rcu+0x108/0x124 [ 9436.262274][ C4] __handle_domain_irq+0x118/0x1e4 [ 9436.262282][ C4] gic_handle_irq.30369+0x6c/0x2bc [ 9436.262286][ C4] el0_irq_naked+0x60/0x6c Link: https://lore.kernel.org/all/1700860318-4025-1-git-send-email-quic_mojha@qui… Reported-by: Joyyoung Huang <huangzaiyang(a)oppo.com> Acked-by: MyungJoo Ham <myungjoo.ham(a)samsung.com> Signed-off-by: Mukesh Ojha <quic_mojha(a)quicinc.com> Signed-off-by: Chanwoo Choi <cw00.choi(a)samsung.com> conflicts: drivers/devfreq/devfreq.c Signed-off-by: Yi Yang <yiyang13(a)huawei.com> --- drivers/devfreq/devfreq.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c index 1229bfb3180e..57589022d45e 100644 --- a/drivers/devfreq/devfreq.c +++ b/drivers/devfreq/devfreq.c @@ -381,8 +381,13 @@ static void devfreq_monitor(struct work_struct *work) if (err) dev_err(&devfreq->dev, "dvfs failed with (%d) error\n", err); + if (devfreq->stop_polling) + goto out; + queue_delayed_work(devfreq_wq, &devfreq->work, msecs_to_jiffies(devfreq->profile->polling_ms)); + +out: mutex_unlock(&devfreq->lock); } @@ -397,10 +402,18 @@ static void devfreq_monitor(struct work_struct *work) */ void devfreq_monitor_start(struct devfreq *devfreq) { + mutex_lock(&devfreq->lock); + if (delayed_work_pending(&devfreq->work)) + goto out; + INIT_DEFERRABLE_WORK(&devfreq->work, devfreq_monitor); if (devfreq->profile->polling_ms) queue_delayed_work(devfreq_wq, &devfreq->work, msecs_to_jiffies(devfreq->profile->polling_ms)); + +out: + devfreq->stop_polling = false; + mutex_unlock(&devfreq->lock); } EXPORT_SYMBOL(devfreq_monitor_start); @@ -414,6 +427,14 @@ EXPORT_SYMBOL(devfreq_monitor_start); */ void devfreq_monitor_stop(struct devfreq *devfreq) { + mutex_lock(&devfreq->lock); + if (devfreq->stop_polling) { + mutex_unlock(&devfreq->lock); + return; + } + + devfreq->stop_polling = true; + mutex_unlock(&devfreq->lock); cancel_delayed_work_sync(&devfreq->work); } EXPORT_SYMBOL(devfreq_monitor_stop); -- 2.25.1
2 1
0 0
[PATCH openEuler-1.0-LTS] net/sched: act_mirred: use the backlog for mirred ingress
by Zhengchao Shao 11 Apr '24

11 Apr '24
From: Jakub Kicinski <kuba(a)kernel.org> mainline inclusion from mainline-v6.8-rc6 commit 52f671db18823089a02f07efc04efdb2272ddc17 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9E2LT CVE: CVE-2024-26740 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- The test Davide added in commit ca22da2fbd69 ("act_mirred: use the backlog for nested calls to mirred ingress") hangs our testing VMs every 10 or so runs, with the familiar tcp_v4_rcv -> tcp_v4_rcv deadlock reported by lockdep. The problem as previously described by Davide (see Link) is that if we reverse flow of traffic with the redirect (egress -> ingress) we may reach the same socket which generated the packet. And we may still be holding its socket lock. The common solution to such deadlocks is to put the packet in the Rx backlog, rather than run the Rx path inline. Do that for all egress -> ingress reversals, not just once we started to nest mirred calls. In the past there was a concern that the backlog indirection will lead to loss of error reporting / less accurate stats. But the current workaround does not seem to address the issue. Fixes: 53592b364001 ("net/sched: act_mirred: Implement ingress actions") Cc: Marcelo Ricardo Leitner <marcelo.leitner(a)gmail.com> Suggested-by: Davide Caratti <dcaratti(a)redhat.com> Link: https://lore.kernel.org/netdev/33dc43f587ec1388ba456b4915c75f02a8aae226.166… Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> Acked-by: Jamal Hadi Salim <jhs(a)mojatatu.com> Signed-off-by: David S. Miller <davem(a)davemloft.net> Conflicts: net/sched/act_mirred.c Signed-off-by: Zhengchao Shao <shaozhengchao(a)huawei.com> --- net/sched/act_mirred.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c index febf06b8bbdf..336db2c938b5 100644 --- a/net/sched/act_mirred.c +++ b/net/sched/act_mirred.c @@ -197,18 +197,14 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, return ret; } -static bool is_mirred_nested(void) -{ - return unlikely(__this_cpu_read(mirred_rec_level) > 1); -} - -static int tcf_mirred_forward(bool want_ingress, struct sk_buff *skb) +static int +tcf_mirred_forward(bool at_ingress, bool want_ingress, struct sk_buff *skb) { int err; if (!want_ingress) err = dev_queue_xmit(skb); - else if (is_mirred_nested()) + else if (!at_ingress) err = netif_rx(skb); else err = netif_receive_skb(skb); @@ -300,14 +296,15 @@ static int tcf_mirred_act(struct sk_buff *skb, const struct tc_action *a, if (use_reinsert) { res->ingress = want_ingress; res->qstats = this_cpu_ptr(m->common.cpu_qstats); - if (tcf_mirred_forward(want_ingress, skb) && res->qstats) + if (tcf_mirred_forward(skb_at_tc_ingress(skb), want_ingress, skb) + && res->qstats) qstats_overlimit_inc(res->qstats); __this_cpu_dec(mirred_rec_level); return TC_ACT_CONSUMED; } } - err = tcf_mirred_forward(want_ingress, skb2); + err = tcf_mirred_forward(skb_at_tc_ingress(skb), want_ingress, skb2); if (err) { out: qstats_overlimit_inc(this_cpu_ptr(m->common.cpu_qstats)); -- 2.34.1
2 1
0 0
[PATCH OLK-6.6] locking/osq_lock: Avoid false sharing in optimistic_spin_node
by liwei 11 Apr '24

11 Apr '24
From: Zeng Heng <zengheng4(a)huawei.com> hulk inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/I8MV01 -------------------------------- Using the UnixBench test suite, we clearly find that osq_lock() cause extremely high overheads with perf tool in the File Copy items: Overhead Shared Object Symbol 94.25% [kernel] [k] osq_lock 0.74% [kernel] [k] rwsem_spin_on_owner 0.32% [kernel] [k] filemap_get_read_batch In response to this, we conducted an analysis and made some gains: In the prologue of osq_lock(), it set `cpu` member of percpu struct optimistic_spin_node with the local cpu id, after that the value of the percpu struct would never change in fact. Based on that, we can regard the `cpu` member as a constant variable. In the meanwhile, other members of the percpu struct like next, prev and locked are frequently modified by osq_lock() and osq_unlock() which are called by rwsem, mutex and so on. However, that would invalidate the cache of the cpu member on other CPUs. Therefore, we can place padding here and split them into different cache lines to avoid cache misses when the next CPU is spinning to check other node's cpu member by vcpu_is_preempted(). Here provide the UnixBench full-core test result as below: Machine Intel(R) Xeon(R) Gold 6248 CPU, 40 cores, 80 threads Run the command of "./Run -c 80 -i 3" 10 times and take the average. System Benchmarks Index Values Without Patch With Patch Diff Dhrystone 2 using register variables 185876.43 185945.41 0.04% Double-Precision Whetstone 79637.27 79659.29 0.03% Execl Throughput 9909.61 10576.06 6.73% File Copy 1024 bufsize 2000 maxblocks 1723.01 2086.08 21.07% File Copy 256 bufsize 500 maxblocks 1150.24 1338.21 16.34% File Copy 4096 bufsize 8000 maxblocks 3719.19 4011.99 7.87% Pipe Throughput 66184.84 66025.25 -0.24% Pipe-based Context Switching 30606.18 31074.21 1.53% Process Creation 9442.48 9450.77 0.09% Shell Scripts (1 concurrent) 44526.52 46548.54 4.54% Shell Scripts (8 concurrent) 42903.96 45718.56 6.56% System Call Overhead 3645.20 3717.42 1.98% ======== System Benchmarks Index Score 15126.87 15931.29 5.32% Signed-off-by: Zeng Heng <zengheng4(a)huawei.com> Signed-off-by: liwei <liwei728(a)huawei.com> --- include/linux/osq_lock.h | 2 +- kernel/locking/osq_lock.c | 8 +++++++- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h index 5581dbd3bd34..deb90ad5f560 100644 --- a/include/linux/osq_lock.h +++ b/include/linux/osq_lock.h @@ -9,7 +9,7 @@ struct optimistic_spin_node { struct optimistic_spin_node *next, *prev; int locked; /* 1 if lock acquired */ - int cpu; /* encoded CPU # + 1 value */ + int cpu ____cacheline_aligned; /* encoded CPU # + 1 value */ }; struct optimistic_spin_queue { diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index d5610ad52b92..17618d62343f 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -96,7 +96,13 @@ bool osq_lock(struct optimistic_spin_queue *lock) node->locked = 0; node->next = NULL; - node->cpu = curr; + /* + * After this cpu member is initialized for the first time, it + * would no longer change in fact. That could avoid cache misses + * when spin and access the cpu member by other CPUs. + */ + if (node->cpu != curr) + node->cpu = curr; /* * We need both ACQUIRE (pairs with corresponding RELEASE in -- 2.25.1
2 1
0 0
[PATCH OLK-5.10] IB/hfi1: Fix sdma.h tx->num_descs off-by-one error
by Liu Jian 11 Apr '24

11 Apr '24
From: Daniel Vacek <neelx(a)redhat.com> stable inclusion from stable-v5.10.211 commit 3f38d22e645e2e994979426ea5a35186102ff3c2 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9E2Y3 CVE: CVE-2024-26766 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… --------------------------- commit e6f57c6881916df39db7d95981a8ad2b9c3458d6 upstream. Unfortunately the commit `fd8958efe877` introduced another error causing the `descs` array to overflow. This reults in further crashes easily reproducible by `sendmsg` system call. [ 1080.836473] general protection fault, probably for non-canonical address 0x400300015528b00a: 0000 [#1] PREEMPT SMP PTI [ 1080.869326] RIP: 0010:hfi1_ipoib_build_ib_tx_headers.constprop.0+0xe1/0x2b0 [hfi1] -- [ 1080.974535] Call Trace: [ 1080.976990] <TASK> [ 1081.021929] hfi1_ipoib_send_dma_common+0x7a/0x2e0 [hfi1] [ 1081.027364] hfi1_ipoib_send_dma_list+0x62/0x270 [hfi1] [ 1081.032633] hfi1_ipoib_send+0x112/0x300 [hfi1] [ 1081.042001] ipoib_start_xmit+0x2a9/0x2d0 [ib_ipoib] [ 1081.046978] dev_hard_start_xmit+0xc4/0x210 -- [ 1081.148347] __sys_sendmsg+0x59/0xa0 crash> ipoib_txreq 0xffff9cfeba229f00 struct ipoib_txreq { txreq = { list = { next = 0xffff9cfeba229f00, prev = 0xffff9cfeba229f00 }, descp = 0xffff9cfeba229f40, coalesce_buf = 0x0, wait = 0xffff9cfea4e69a48, complete = 0xffffffffc0fe0760 <hfi1_ipoib_sdma_complete>, packet_len = 0x46d, tlen = 0x0, num_desc = 0x0, desc_limit = 0x6, next_descq_idx = 0x45c, coalesce_idx = 0x0, flags = 0x0, descs = {{ qw = {0x8024000120dffb00, 0x4} # SDMA_DESC0_FIRST_DESC_FLAG (bit 63) }, { qw = { 0x3800014231b108, 0x4} }, { qw = { 0x310000e4ee0fcf0, 0x8} }, { qw = { 0x3000012e9f8000, 0x8} }, { qw = { 0x59000dfb9d0000, 0x8} }, { qw = { 0x78000e02e40000, 0x8} }} }, sdma_hdr = 0x400300015528b000, <<< invalid pointer in the tx request structure sdma_status = 0x0, SDMA_DESC0_LAST_DESC_FLAG (bit 62) complete = 0x0, priv = 0x0, txq = 0xffff9cfea4e69880, skb = 0xffff9d099809f400 } If an SDMA send consists of exactly 6 descriptors and requires dword padding (in the 7th descriptor), the sdma_txreq descriptor array is not properly expanded and the packet will overflow into the container structure. This results in a panic when the send completion runs. The exact panic varies depending on what elements of the container structure get corrupted. The fix is to use the correct expression in _pad_sdma_tx_descs() to test the need to expand the descriptor array. With this patch the crashes are no longer reproducible and the machine is stable. Fixes: fd8958efe877 ("IB/hfi1: Fix sdma.h tx->num_descs off-by-one errors") Cc: stable(a)vger.kernel.org Reported-by: Mats Kronberg <kronberg(a)nsc.liu.se> Tested-by: Mats Kronberg <kronberg(a)nsc.liu.se> Signed-off-by: Daniel Vacek <neelx(a)redhat.com> Link: https://lore.kernel.org/r/20240201081009.1109442-1-neelx@redhat.com Signed-off-by: Leon Romanovsky <leon(a)kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Signed-off-by: Liu Jian <liujian56(a)huawei.com> --- drivers/infiniband/hw/hfi1/sdma.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/hfi1/sdma.c b/drivers/infiniband/hw/hfi1/sdma.c index 2dc97de434a5..68a8557e9a7c 100644 --- a/drivers/infiniband/hw/hfi1/sdma.c +++ b/drivers/infiniband/hw/hfi1/sdma.c @@ -3200,7 +3200,7 @@ int _pad_sdma_tx_descs(struct hfi1_devdata *dd, struct sdma_txreq *tx) { int rval = 0; - if ((unlikely(tx->num_desc + 1 == tx->desc_limit))) { + if ((unlikely(tx->num_desc == tx->desc_limit))) { rval = _extend_sdma_tx_descs(dd, tx); if (rval) { __sdma_txclean(dd, tx); -- 2.34.1
2 1
0 0
[OLK-6.6] locking/osq_lock: Avoid false sharing in optimistic_spin_node
by liwei 11 Apr '24

11 Apr '24
From: Zeng Heng <zengheng4(a)huawei.com> hulk inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/I8MV01 -------------------------------- Using the UnixBench test suite, we clearly find that osq_lock() cause extremely high overheads with perf tool in the File Copy items: Overhead Shared Object Symbol 94.25% [kernel] [k] osq_lock 0.74% [kernel] [k] rwsem_spin_on_owner 0.32% [kernel] [k] filemap_get_read_batch In response to this, we conducted an analysis and made some gains: In the prologue of osq_lock(), it set `cpu` member of percpu struct optimistic_spin_node with the local cpu id, after that the value of the percpu struct would never change in fact. Based on that, we can regard the `cpu` member as a constant variable. In the meanwhile, other members of the percpu struct like next, prev and locked are frequently modified by osq_lock() and osq_unlock() which are called by rwsem, mutex and so on. However, that would invalidate the cache of the cpu member on other CPUs. Therefore, we can place padding here and split them into different cache lines to avoid cache misses when the next CPU is spinning to check other node's cpu member by vcpu_is_preempted(). Here provide the UnixBench full-core test result as below: Machine Intel(R) Xeon(R) Gold 6248 CPU, 40 cores, 80 threads Run the command of "./Run -c 80 -i 3" 10 times and take the average. System Benchmarks Index Values Without Patch With Patch Diff Dhrystone 2 using register variables 185876.43 185945.41 0.04% Double-Precision Whetstone 79637.27 79659.29 0.03% Execl Throughput 9909.61 10576.06 6.73% File Copy 1024 bufsize 2000 maxblocks 1723.01 2086.08 21.07% File Copy 256 bufsize 500 maxblocks 1150.24 1338.21 16.34% File Copy 4096 bufsize 8000 maxblocks 3719.19 4011.99 7.87% Pipe Throughput 66184.84 66025.25 -0.24% Pipe-based Context Switching 30606.18 31074.21 1.53% Process Creation 9442.48 9450.77 0.09% Shell Scripts (1 concurrent) 44526.52 46548.54 4.54% Shell Scripts (8 concurrent) 42903.96 45718.56 6.56% System Call Overhead 3645.20 3717.42 1.98% ======== System Benchmarks Index Score 15126.87 15931.29 5.32% Signed-off-by: Zeng Heng <zengheng4(a)huawei.com> Signed-off-by: liwei <liwei728(a)huawei.com> --- include/linux/osq_lock.h | 2 +- kernel/locking/osq_lock.c | 8 +++++++- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h index 5581dbd3bd34..deb90ad5f560 100644 --- a/include/linux/osq_lock.h +++ b/include/linux/osq_lock.h @@ -9,7 +9,7 @@ struct optimistic_spin_node { struct optimistic_spin_node *next, *prev; int locked; /* 1 if lock acquired */ - int cpu; /* encoded CPU # + 1 value */ + int cpu ____cacheline_aligned; /* encoded CPU # + 1 value */ }; struct optimistic_spin_queue { diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index d5610ad52b92..17618d62343f 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -96,7 +96,13 @@ bool osq_lock(struct optimistic_spin_queue *lock) node->locked = 0; node->next = NULL; - node->cpu = curr; + /* + * After this cpu member is initialized for the first time, it + * would no longer change in fact. That could avoid cache misses + * when spin and access the cpu member by other CPUs. + */ + if (node->cpu != curr) + node->cpu = curr; /* * We need both ACQUIRE (pairs with corresponding RELEASE in -- 2.25.1
1 0
0 0
  • ← Newer
  • 1
  • ...
  • 1160
  • 1161
  • 1162
  • 1163
  • 1164
  • 1165
  • 1166
  • ...
  • 1885
  • Older →

HyperKitty Powered by HyperKitty