- Kernel - mailweb.openeuler.org

[PATCH openEuler-1.0-LTS] ipvlan: Dont Use skb->sk in ipvlan_process_v{4,6}_outbound
by Dong Chenchen 01 Jul '24

01 Jul '24

From: Yue Haibing <yuehaibing(a)huawei.com> mainline inclusion from mainline-v6.10-rc2 commit b3dc6e8003b500861fa307e9a3400c52e78e4d3a category: bugfix bugzilla: 11847, https://gitee.com/openeuler/kernel/issues/IA6GKA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- Raw packet from PF_PACKET socket ontop of an IPv6-backed ipvlan device will hit WARN_ON_ONCE() in sk_mc_loop() through sch_direct_xmit() path. WARNING: CPU: 2 PID: 0 at net/core/sock.c:775 sk_mc_loop+0x2d/0x70 Modules linked in: sch_netem ipvlan rfkill cirrus drm_shmem_helper sg drm_kms_helper CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Not tainted 6.9.0+ #279 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:sk_mc_loop+0x2d/0x70 Code: fa 0f 1f 44 00 00 65 0f b7 15 f7 96 a3 4f 31 c0 66 85 d2 75 26 48 85 ff 74 1c RSP: 0018:ffffa9584015cd78 EFLAGS: 00010212 RAX: 0000000000000011 RBX: ffff91e585793e00 RCX: 0000000002c6a001 RDX: 0000000000000000 RSI: 0000000000000040 RDI: ffff91e589c0f000 RBP: ffff91e5855bd100 R08: 0000000000000000 R09: 3d00545216f43d00 R10: ffff91e584fdcc50 R11: 00000060dd8616f4 R12: ffff91e58132d000 R13: ffff91e584fdcc68 R14: ffff91e5869ce800 R15: ffff91e589c0f000 FS: 0000000000000000(0000) GS:ffff91e898100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f788f7c44c0 CR3: 0000000008e1a000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> ? __warn (kernel/panic.c:693) ? sk_mc_loop (net/core/sock.c:760) ? report_bug (lib/bug.c:201 lib/bug.c:219) ? handle_bug (arch/x86/kernel/traps.c:239) ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1)) ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) ? sk_mc_loop (net/core/sock.c:760) ip6_finish_output2 (net/ipv6/ip6_output.c:83 (discriminator 1)) ? nf_hook_slow (net/netfilter/core.c:626) ip6_finish_output (net/ipv6/ip6_output.c:222) ? __pfx_ip6_finish_output (net/ipv6/ip6_output.c:215) ipvlan_xmit_mode_l3 (drivers/net/ipvlan/ipvlan_core.c:602) ipvlan ipvlan_start_xmit (drivers/net/ipvlan/ipvlan_main.c:226) ipvlan dev_hard_start_xmit (net/core/dev.c:3594) sch_direct_xmit (net/sched/sch_generic.c:343) __qdisc_run (net/sched/sch_generic.c:416) net_tx_action (net/core/dev.c:5286) handle_softirqs (kernel/softirq.c:555) __irq_exit_rcu (kernel/softirq.c:589) sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1043) The warning triggers as this: packet_sendmsg packet_snd //skb->sk is packet sk __dev_queue_xmit __dev_xmit_skb //q->enqueue is not NULL __qdisc_run sch_direct_xmit dev_hard_start_xmit ipvlan_start_xmit ipvlan_xmit_mode_l3 //l3 mode ipvlan_process_outbound //vepa flag ipvlan_process_v6_outbound ip6_local_out __ip6_finish_output ip6_finish_output2 //multicast packet sk_mc_loop //sk->sk_family is AF_PACKET Call ip{6}_local_out() with NULL sk in ipvlan as other tunnels to fix this. Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.") Suggested-by: Eric Dumazet <edumazet(a)google.com> Signed-off-by: Yue Haibing <yuehaibing(a)huawei.com> Reviewed-by: Eric Dumazet <edumazet(a)google.com> Link: https://lore.kernel.org/r/20240529095633.613103-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni(a)redhat.com> Conflicts: drivers/net/ipvlan/ipvlan_core.c [commit ff672b9ffeb3 replace dev->stats with DEV_STATS_INC(), which not merged lead to conflicts] Signed-off-by: Dong Chenchen <dongchenchen2(a)huawei.com> --- drivers/net/ipvlan/ipvlan_core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c index b25c60af02f6..0a82752b28b5 100644 --- a/drivers/net/ipvlan/ipvlan_core.c +++ b/drivers/net/ipvlan/ipvlan_core.c @@ -446,7 +446,7 @@ static noinline_for_stack int ipvlan_process_v4_outbound(struct sk_buff *skb) memset(IPCB(skb), 0, sizeof(*IPCB(skb))); - err = ip_local_out(net, skb->sk, skb); + err = ip_local_out(net, NULL, skb); if (unlikely(net_xmit_eval(err))) dev->stats.tx_errors++; else @@ -501,7 +501,7 @@ static int ipvlan_process_v6_outbound(struct sk_buff *skb) memset(IP6CB(skb), 0, sizeof(*IP6CB(skb))); - err = ip6_local_out(dev_net(dev), skb->sk, skb); + err = ip6_local_out(dev_net(dev), NULL, skb); if (unlikely(net_xmit_eval(err))) dev->stats.tx_errors++; else -- 2.25.1

2 1

[PATCH openEuler-1.0-LTS] net/mlx5e: Avoid field-overflowing memcpy()
by Yang Yingliang 01 Jul '24

01 Jul '24

From: Kees Cook <keescook(a)chromium.org> mainline inclusion from mainline-v5.17-rc3 commit ad5185735f7dab342fdd0dd41044da4c9ccfef67 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IA72I4 CVE: CVE-2022-48744 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… --------------------------- In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memcpy(), memmove(), and memset(), avoid intentionally writing across neighboring fields. Use flexible arrays instead of zero-element arrays (which look like they are always overflowing) and split the cross-field memcpy() into two halves that can be appropriately bounds-checked by the compiler. We were doing: #define ETH_HLEN 14 #define VLAN_HLEN 4 ... #define MLX5E_XDP_MIN_INLINE (ETH_HLEN + VLAN_HLEN) ... struct mlx5e_tx_wqe *wqe = mlx5_wq_cyc_get_wqe(wq, pi); ... struct mlx5_wqe_eth_seg *eseg = &wqe->eth; struct mlx5_wqe_data_seg *dseg = wqe->data; ... memcpy(eseg->inline_hdr.start, xdptxd->data, MLX5E_XDP_MIN_INLINE); target is wqe->eth.inline_hdr.start (which the compiler sees as being 2 bytes in size), but copying 18, intending to write across start (really vlan_tci, 2 bytes). The remaining 16 bytes get written into wqe->data[0], covering byte_count (4 bytes), lkey (4 bytes), and addr (8 bytes). struct mlx5e_tx_wqe { struct mlx5_wqe_ctrl_seg ctrl; /* 0 16 */ struct mlx5_wqe_eth_seg eth; /* 16 16 */ struct mlx5_wqe_data_seg data[]; /* 32 0 */ /* size: 32, cachelines: 1, members: 3 */ /* last cacheline: 32 bytes */ }; struct mlx5_wqe_eth_seg { u8 swp_outer_l4_offset; /* 0 1 */ u8 swp_outer_l3_offset; /* 1 1 */ u8 swp_inner_l4_offset; /* 2 1 */ u8 swp_inner_l3_offset; /* 3 1 */ u8 cs_flags; /* 4 1 */ u8 swp_flags; /* 5 1 */ __be16 mss; /* 6 2 */ __be32 flow_table_metadata; /* 8 4 */ union { struct { __be16 sz; /* 12 2 */ u8 start[2]; /* 14 2 */ } inline_hdr; /* 12 4 */ struct { __be16 type; /* 12 2 */ __be16 vlan_tci; /* 14 2 */ } insert; /* 12 4 */ __be32 trailer; /* 12 4 */ }; /* 12 4 */ /* size: 16, cachelines: 1, members: 9 */ /* last cacheline: 16 bytes */ }; struct mlx5_wqe_data_seg { __be32 byte_count; /* 0 4 */ __be32 lkey; /* 4 4 */ __be64 addr; /* 8 8 */ /* size: 16, cachelines: 1, members: 3 */ /* last cacheline: 16 bytes */ }; So, split the memcpy() so the compiler can reason about the buffer sizes. "pahole" shows no size nor member offset changes to struct mlx5e_tx_wqe nor struct mlx5e_umr_wqe. "objdump -d" shows no meaningful object code changes (i.e. only source line number induced differences and optimizations). Fixes: b5503b994ed5 ("net/mlx5e: XDP TX forwarding support") Signed-off-by: Kees Cook <keescook(a)chromium.org> Signed-off-by: Saeed Mahameed <saeedm(a)nvidia.com> Conflicts: drivers/net/ethernet/mellanox/mlx5/core/en.h drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c [yyl: adjust context] Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com> --- drivers/net/ethernet/mellanox/mlx5/core/en.h | 2 +- drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h index 6c06b5c3337f..051a05d818e1 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -184,7 +184,7 @@ static inline int mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev) struct mlx5e_tx_wqe { struct mlx5_wqe_ctrl_seg ctrl; struct mlx5_wqe_eth_seg eth; - struct mlx5_wqe_data_seg data[0]; + struct mlx5_wqe_data_seg data[]; }; struct mlx5e_rx_wqe_ll { diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c index 599114ab7821..12f3787b3048 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c @@ -159,8 +159,10 @@ bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *xdpi) /* copy the inline part if required */ if (sq->min_inline_mode != MLX5_INLINE_MODE_NONE) { - memcpy(eseg->inline_hdr.start, xdpf->data, MLX5E_XDP_MIN_INLINE); + memcpy(eseg->inline_hdr.start, xdpf->data, sizeof(eseg->inline_hdr.start)); eseg->inline_hdr.sz = cpu_to_be16(MLX5E_XDP_MIN_INLINE); + memcpy(dseg, xdpf->data + sizeof(eseg->inline_hdr.start), + MLX5E_XDP_MIN_INLINE - sizeof(eseg->inline_hdr.start)); dma_len -= MLX5E_XDP_MIN_INLINE; dma_addr += MLX5E_XDP_MIN_INLINE; dseg++; -- 2.25.1

2 1

[PATCH OLK-6.6 v2] eventfs: Fix a possible null pointer dereference in eventfs_find_events()
by Yipeng Zou 01 Jul '24

01 Jul '24

From: Hao Ge <gehao(a)kylinos.cn> stable inclusion from stable-v6.10.123 commit d4e9a968738bf66d3bb852dd5588d4c7afd6d7f4 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IA8ADN CVE: CVE-2024-39470 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- In function eventfs_find_events,there is a potential null pointer that may be caused by calling update_events_attr which will perform some operations on the members of the ei struct when ei is NULL. Hence,When ei->is_freed is set,return NULL directly. Link: https://lore.kernel.org/linux-trace-kernel/20240513053338.63017-1-hao.ge@li… Cc: stable(a)vger.kernel.org Fixes: 8186fff7ab64 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Hao Ge <gehao(a)kylinos.cn> Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org> Signed-off-by: Yipeng Zou <zouyipeng(a)huawei.com> --- fs/tracefs/event_inode.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 56d1741fe041..4b6f73e201a4 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -295,10 +295,9 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry) * If the ei is being freed, the ownership of the children * doesn't matter. */ - if (ei->is_freed) { - ei = NULL; - break; - } + if (ei->is_freed) + return NULL; + // Walk upwards until you find the events inode } while (!ei->is_events); -- 2.34.1

2 1

[PATCH OLK-6.6] eventfs: Fix a possible null pointer dereference in eventfs_find_events()
by Yipeng Zou 01 Jul '24

01 Jul '24

From: Hao Ge <gehao(a)kylinos.cn> stable inclusion from stable-v6.10-rc1 commit d4e9a968738bf66d3bb852dd5588d4c7afd6d7f4 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IA8ADN CVE: CVE-2024-39470 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- In function eventfs_find_events,there is a potential null pointer that may be caused by calling update_events_attr which will perform some operations on the members of the ei struct when ei is NULL. Hence,When ei->is_freed is set,return NULL directly. Link: https://lore.kernel.org/linux-trace-kernel/20240513053338.63017-1-hao.ge@li… Cc: stable(a)vger.kernel.org Fixes: 8186fff7ab64 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Hao Ge <gehao(a)kylinos.cn> Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org> Signed-off-by: Yipeng Zou <zouyipeng(a)huawei.com> --- fs/tracefs/event_inode.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 56d1741fe041..4b6f73e201a4 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -295,10 +295,9 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry) * If the ei is being freed, the ownership of the children * doesn't matter. */ - if (ei->is_freed) { - ei = NULL; - break; - } + if (ei->is_freed) + return NULL; + // Walk upwards until you find the events inode } while (!ei->is_events); -- 2.34.1

2 1

[PATCH OLK-6.6] smart_grid: introducing rebuild_affinity_domain
by Yipeng Zou 01 Jul '24

01 Jul '24

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OJK9 CVE: NA ---------------------------------------- Here are many scenarios we tested with smart_grid, we found that the first domain level is key to the benchmark. The reason is that there are many things such as interrupt affinity, memory affinity factor that can have a big impact on the test. Before this patch, the first domain level is unchangeable after creation. This patch introduce the 'cpu.rebuild_affinity_domain' to dynamically reconfigure all domain levels. Typical use cases: echo $cpu_id > cpu.rebuild_affinity_domain The cpu_id means which cpu we want to set first level. If we set cpu_id = 34, we can see some change like: ---------------- ----------------- | level 0 (0-31) | | level 0 (32-63) | ---------------- ----------------- v v ------------------- ------------------ | level 1 (0-63) | | level 1 (0-63) | ------------------- ------------------ v --> v --------------------- -------------------- | level 2 (0-95) | | level 2 (0-95) | --------------------- -------------------- v v ------------------------ ---------------------- | level 3 (0-127) | | level 3 (0-127) | ------------------------ ---------------------- There are number of constraints on the rebuild feature: 1. Only rebuild domain while auto mode disabled. (cpu.dynamic_affinity_mode == 1) 2. Only rebuild on active and housekeeping cpu. (Offline and isolate CPUs are forbidden) 3. This file is write only. Signed-off-by: Yipeng Zou <zouyipeng(a)huawei.com> --- kernel/sched/core.c | 13 +++++++++++++ kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 1 + 3 files changed, 57 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 700dbd19eb44..b9b3f536959d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -11555,6 +11555,15 @@ static int cpu_affinity_stat_show(struct seq_file *sf, void *v) return 0; } + +static int cpu_rebuild_affinity_domain_u64(struct cgroup_subsys_state *css, + struct cftype *cftype, + u64 cpu) +{ + struct task_group *tg = css_tg(css); + + return tg_rebuild_affinity_domains(cpu, tg->auto_affinity); +} #endif /* CONFIG_QOS_SCHED_SMART_GRID */ #ifdef CONFIG_QOS_SCHED @@ -11735,6 +11744,10 @@ static struct cftype cpu_legacy_files[] = { .name = "affinity_stat", .seq_show = cpu_affinity_stat_show, }, + { + .name = "rebuild_affinity_domain", + .write_u64 = cpu_rebuild_affinity_domain_u64, + }, #endif #ifdef CONFIG_CFS_BANDWIDTH { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d2efd40fb784..07a7c6bf7b1e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7252,6 +7252,49 @@ static void destroy_auto_affinity(struct task_group *tg) kfree(tg->auto_affinity); tg->auto_affinity = NULL; } + +int tg_rebuild_affinity_domains(int cpu, struct auto_affinity *auto_affi) +{ + int ret = 0; + int level = 0; + struct sched_domain *tmp; + + if (unlikely(!auto_affi)) + return -EPERM; + + mutex_lock(&smart_grid_used_mutex); + raw_spin_lock_irq(&auto_affi->lock); + /* Only build domain while auto mode disabled */ + if (auto_affi->mode) { + ret = -EPERM; + goto unlock_all; + } + + /* Only build on active and housekeeping cpu */ + if (!cpu_active(cpu) || !housekeeping_cpu(cpu, HK_TYPE_DOMAIN)) { + ret = -EINVAL; + goto unlock_all; + } + + for_each_domain(cpu, tmp) { + if (!auto_affi->ad.domains[level] || !auto_affi->ad.domains_orig[level]) + continue; + + /* rebuild domain[,_orig] and reset schedstat counter */ + cpumask_copy(auto_affi->ad.domains[level], sched_domain_span(tmp)); + cpumask_copy(auto_affi->ad.domains_orig[level], auto_affi->ad.domains[level]); + __schedstat_set(auto_affi->ad.stay_cnt[level], 0); + level++; + } + + /* trigger to update smart grid zone */ + sched_grid_zone_update(false); + +unlock_all: + raw_spin_unlock_irq(&auto_affi->lock); + mutex_unlock(&smart_grid_used_mutex); + return ret; +} #else static void __maybe_unused destroy_auto_affinity(struct task_group *tg) {} diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9de2bac6444f..79b93b0a2ef0 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -551,6 +551,7 @@ extern void start_auto_affinity(struct auto_affinity *auto_affi); extern void stop_auto_affinity(struct auto_affinity *auto_affi); extern int init_auto_affinity(struct task_group *tg); extern void tg_update_affinity_domains(int cpu, int online); +extern int tg_rebuild_affinity_domains(int cpu, struct auto_affinity *auto_affi); #else static inline int init_auto_affinity(struct task_group *tg) -- 2.34.1

2 1

[PATCH OLK-6.6] net/mlx5: Discard command completions in internal error
by Zhengchao Shao 01 Jul '24

01 Jul '24

From: Akiva Goldberger <agoldberger(a)nvidia.com> stable inclusion from stable-v6.6.33 commit 1337ec94bc5a9eed250e33f5f5c89a28a6bfabdb category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IA6SA3 CVE: CVE-2024-38555 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… --------------------------- [ Upstream commit db9b31aa9bc56ff0d15b78f7e827d61c4a096e40 ] Fix use after free when FW completion arrives while device is in internal error state. Avoid calling completion handler in this case, since the device will flush the command interface and trigger all completions manually. Kernel log: ------------[ cut here ]------------ refcount_t: underflow; use-after-free. ... RIP: 0010:refcount_warn_saturate+0xd8/0xe0 ... Call Trace: <IRQ> ? __warn+0x79/0x120 ? refcount_warn_saturate+0xd8/0xe0 ? report_bug+0x17c/0x190 ? handle_bug+0x3c/0x60 ? exc_invalid_op+0x14/0x70 ? asm_exc_invalid_op+0x16/0x20 ? refcount_warn_saturate+0xd8/0xe0 cmd_ent_put+0x13b/0x160 [mlx5_core] mlx5_cmd_comp_handler+0x5f9/0x670 [mlx5_core] cmd_comp_notifier+0x1f/0x30 [mlx5_core] notifier_call_chain+0x35/0xb0 atomic_notifier_call_chain+0x16/0x20 mlx5_eq_async_int+0xf6/0x290 [mlx5_core] notifier_call_chain+0x35/0xb0 atomic_notifier_call_chain+0x16/0x20 irq_int_handler+0x19/0x30 [mlx5_core] __handle_irq_event_percpu+0x4b/0x160 handle_irq_event+0x2e/0x80 handle_edge_irq+0x98/0x230 __common_interrupt+0x3b/0xa0 common_interrupt+0x7b/0xa0 </IRQ> <TASK> asm_common_interrupt+0x22/0x40 Fixes: 51d138c2610a ("net/mlx5: Fix health error state handling") Signed-off-by: Akiva Goldberger <agoldberger(a)nvidia.com> Reviewed-by: Moshe Shemesh <moshe(a)nvidia.com> Signed-off-by: Tariq Toukan <tariqt(a)nvidia.com> Link: https://lore.kernel.org/r/20240509112951.590184-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> Signed-off-by: Sasha Levin <sashal(a)kernel.org> Signed-off-by: Zhengchao Shao <shaozhengchao(a)huawei.com> --- drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c index 55efb932ab2c..a54638914dae 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c @@ -1609,6 +1609,9 @@ static int cmd_comp_notifier(struct notifier_block *nb, dev = container_of(cmd, struct mlx5_core_dev, cmd); eqe = data; + if (dev->state == MLX5_DEVICE_STATE_INTERNAL_ERROR) + return NOTIFY_DONE; + mlx5_cmd_comp_handler(dev, be32_to_cpu(eqe->data.cmd.vector), false); return NOTIFY_OK; -- 2.34.1

2 1

[PATCH OLK-5.10 V2] btrfs: fix crash on racing fsync and size-extending write into prealloc
by Zizhi Wo 01 Jul '24

01 Jul '24

From: Omar Sandoval <osandov(a)fb.com> stable inclusion from stable-v6.1.94 commit 1ff2bd566fbcefcb892be85c493bdb92b911c428 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IA8AFW CVE: CVE-2024-37354 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- commit 9d274c19a71b3a276949933859610721a453946b upstream. We have been seeing crashes on duplicate keys in btrfs_set_item_key_safe(): BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:2620! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs] With the following stack trace: #0 btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4) #1 btrfs_drop_extents (fs/btrfs/file.c:411:4) #2 log_one_extent (fs/btrfs/tree-log.c:4732:9) #3 btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9) #4 btrfs_log_inode (fs/btrfs/tree-log.c:6626:9) #5 btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8) #6 btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8) #7 btrfs_sync_file (fs/btrfs/file.c:1933:8) #8 vfs_fsync_range (fs/sync.c:188:9) #9 vfs_fsync (fs/sync.c:202:9) #10 do_fsync (fs/sync.c:212:9) #11 __do_sys_fdatasync (fs/sync.c:225:9) #12 __se_sys_fdatasync (fs/sync.c:223:1) #13 __x64_sys_fdatasync (fs/sync.c:223:1) #14 do_syscall_x64 (arch/x86/entry/common.c:52:14) #15 do_syscall_64 (arch/x86/entry/common.c:83:7) #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121) So we're logging a changed extent from fsync, which is splitting an extent in the log tree. But this split part already exists in the tree, triggering the BUG(). This is the state of the log tree at the time of the crash, dumped with drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py) to get more details than btrfs_print_leaf() gives us: >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"]) leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610 leaf 33439744 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 7 transid 9 size 8192 nbytes 8473563889606862198 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 204 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417704.983333333 (2024-05-22 15:41:44) mtime 1716417704.983333333 (2024-05-22 15:41:44) otime 17592186044416.000000000 (559444-03-08 01:40:16) item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13 index 195 namelen 3 name: 193 item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 4096 ram 12288 extent compression 0 (none) item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 4096 nr 8192 item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 ... So the real problem happened earlier: notice that items 4 (4k-12k) and 5 (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and item 5 starts at i_size. Here is the state of the filesystem tree at the time of the crash: >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0)) >>> print_extent_buffer(nodes[0]) leaf 30425088 level 0 items 184 generation 9 owner 5 leaf 30425088 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da ... item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160 generation 7 transid 7 size 4096 nbytes 12288 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 6 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417703.220000000 (2024-05-22 15:41:43) mtime 1716417703.220000000 (2024-05-22 15:41:43) otime 1716417703.220000000 (2024-05-22 15:41:43) item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13 index 195 namelen 3 name: 193 item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 8192 ram 12288 extent compression 0 (none) item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 Item 5 in the log tree corresponds to item 183 in the filesystem tree, but nothing matches item 4. Furthermore, item 183 is the last item in the leaf. btrfs_log_prealloc_extents() is responsible for logging prealloc extents beyond i_size. It first truncates any previously logged prealloc extents that start beyond i_size. Then, it walks the filesystem tree and copies the prealloc extent items to the log tree. If it hits the end of a leaf, then it calls btrfs_next_leaf(), which unlocks the tree and does another search. However, while the filesystem tree is unlocked, an ordered extent completion may modify the tree. In particular, it may insert an extent item that overlaps with an extent item that was already copied to the log tree. This may manifest in several ways depending on the exact scenario, including an EEXIST error that is silently translated to a full sync, overlapping items in the log tree, or this crash. This particular crash is triggered by the following sequence of events: - Initially, the file has i_size=4k, a regular extent from 0-4k, and a prealloc extent beyond i_size from 4k-12k. The prealloc extent item is the last item in its B-tree leaf. - The file is fsync'd, which copies its inode item and both extent items to the log tree. - An xattr is set on the file, which sets the BTRFS_INODE_COPY_EVERYTHING flag. - The range 4k-8k in the file is written using direct I/O. i_size is extended to 8k, but the ordered extent is still in flight. - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this calls copy_inode_items_to_log(), which calls btrfs_log_prealloc_extents(). - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the filesystem tree. Since it starts before i_size, it skips it. Since it is the last item in its B-tree leaf, it calls btrfs_next_leaf(). - btrfs_next_leaf() unlocks the path. - The ordered extent completion runs, which converts the 4k-8k part of the prealloc extent to written and inserts the remaining prealloc part from 8k-12k. - btrfs_next_leaf() does a search and finds the new prealloc extent 8k-12k. - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into the log tree. Note that it overlaps with the 4k-12k prealloc extent that was copied to the log tree by the first fsync. - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k extent that was written. - This tries to drop the range 4k-8k in the log tree, which requires adjusting the start of the 4k-12k prealloc extent in the log tree to 8k. - btrfs_set_item_key_safe() sees that there is already an extent starting at 8k in the log tree and calls BUG(). Fix this by detecting when we're about to insert an overlapping file extent item in the log tree and truncating the part that would overlap. CC: stable(a)vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana(a)suse.com> Signed-off-by: Omar Sandoval <osandov(a)fb.com> Signed-off-by: David Sterba <dsterba(a)suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Conflicts: fs/btrfs/tree-log.c [Simple context adaptation is performed] Signed-off-by: Zizhi Wo <wozizhi(a)huawei.com> --- fs/btrfs/tree-log.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 10a0913ffb49..73e507aa9a5a 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4415,22 +4415,27 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans, path->slots[0]++; continue; } - if (!dropped_extents) { - /* - * Avoid logging extent items logged in past fsync calls - * and leading to duplicate keys in the log tree. - */ + /* + * Avoid overlapping items in the log tree. The first time we + * get here, get rid of everything from a past fsync. After + * that, if the current extent starts before the end of the last + * extent we copied, truncate the last one. This can happen if + * an ordered extent completion modifies the subvolume tree + * while btrfs_next_leaf() has the tree unlocked. + */ + if (!dropped_extents || key.offset < truncate_offset) { do { ret = btrfs_truncate_inode_items(trans, root->log_root, &inode->vfs_inode, - truncate_offset, + min(key.offset, truncate_offset), BTRFS_EXTENT_DATA_KEY); } while (ret == -EAGAIN); if (ret) goto out; dropped_extents = true; } + truncate_offset = btrfs_file_extent_end(path); if (ins_nr == 0) start_slot = slot; ins_nr++; -- 2.39.2

2 1

[PATCH openEuler-22.03-LTS-SP1 V2] btrfs: fix crash on racing fsync and size-extending write into prealloc
by Zizhi Wo 01 Jul '24

01 Jul '24

From: Omar Sandoval <osandov(a)fb.com> stable inclusion from stable-v6.1.94 commit 1ff2bd566fbcefcb892be85c493bdb92b911c428 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IA8AFW CVE: CVE-2024-37354 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- commit 9d274c19a71b3a276949933859610721a453946b upstream. We have been seeing crashes on duplicate keys in btrfs_set_item_key_safe(): BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:2620! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs] With the following stack trace: #0 btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4) #1 btrfs_drop_extents (fs/btrfs/file.c:411:4) #2 log_one_extent (fs/btrfs/tree-log.c:4732:9) #3 btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9) #4 btrfs_log_inode (fs/btrfs/tree-log.c:6626:9) #5 btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8) #6 btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8) #7 btrfs_sync_file (fs/btrfs/file.c:1933:8) #8 vfs_fsync_range (fs/sync.c:188:9) #9 vfs_fsync (fs/sync.c:202:9) #10 do_fsync (fs/sync.c:212:9) #11 __do_sys_fdatasync (fs/sync.c:225:9) #12 __se_sys_fdatasync (fs/sync.c:223:1) #13 __x64_sys_fdatasync (fs/sync.c:223:1) #14 do_syscall_x64 (arch/x86/entry/common.c:52:14) #15 do_syscall_64 (arch/x86/entry/common.c:83:7) #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121) So we're logging a changed extent from fsync, which is splitting an extent in the log tree. But this split part already exists in the tree, triggering the BUG(). This is the state of the log tree at the time of the crash, dumped with drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py) to get more details than btrfs_print_leaf() gives us: >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"]) leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610 leaf 33439744 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 7 transid 9 size 8192 nbytes 8473563889606862198 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 204 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417704.983333333 (2024-05-22 15:41:44) mtime 1716417704.983333333 (2024-05-22 15:41:44) otime 17592186044416.000000000 (559444-03-08 01:40:16) item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13 index 195 namelen 3 name: 193 item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 4096 ram 12288 extent compression 0 (none) item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 4096 nr 8192 item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 ... So the real problem happened earlier: notice that items 4 (4k-12k) and 5 (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and item 5 starts at i_size. Here is the state of the filesystem tree at the time of the crash: >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0)) >>> print_extent_buffer(nodes[0]) leaf 30425088 level 0 items 184 generation 9 owner 5 leaf 30425088 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da ... item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160 generation 7 transid 7 size 4096 nbytes 12288 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 6 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417703.220000000 (2024-05-22 15:41:43) mtime 1716417703.220000000 (2024-05-22 15:41:43) otime 1716417703.220000000 (2024-05-22 15:41:43) item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13 index 195 namelen 3 name: 193 item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 8192 ram 12288 extent compression 0 (none) item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 Item 5 in the log tree corresponds to item 183 in the filesystem tree, but nothing matches item 4. Furthermore, item 183 is the last item in the leaf. btrfs_log_prealloc_extents() is responsible for logging prealloc extents beyond i_size. It first truncates any previously logged prealloc extents that start beyond i_size. Then, it walks the filesystem tree and copies the prealloc extent items to the log tree. If it hits the end of a leaf, then it calls btrfs_next_leaf(), which unlocks the tree and does another search. However, while the filesystem tree is unlocked, an ordered extent completion may modify the tree. In particular, it may insert an extent item that overlaps with an extent item that was already copied to the log tree. This may manifest in several ways depending on the exact scenario, including an EEXIST error that is silently translated to a full sync, overlapping items in the log tree, or this crash. This particular crash is triggered by the following sequence of events: - Initially, the file has i_size=4k, a regular extent from 0-4k, and a prealloc extent beyond i_size from 4k-12k. The prealloc extent item is the last item in its B-tree leaf. - The file is fsync'd, which copies its inode item and both extent items to the log tree. - An xattr is set on the file, which sets the BTRFS_INODE_COPY_EVERYTHING flag. - The range 4k-8k in the file is written using direct I/O. i_size is extended to 8k, but the ordered extent is still in flight. - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this calls copy_inode_items_to_log(), which calls btrfs_log_prealloc_extents(). - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the filesystem tree. Since it starts before i_size, it skips it. Since it is the last item in its B-tree leaf, it calls btrfs_next_leaf(). - btrfs_next_leaf() unlocks the path. - The ordered extent completion runs, which converts the 4k-8k part of the prealloc extent to written and inserts the remaining prealloc part from 8k-12k. - btrfs_next_leaf() does a search and finds the new prealloc extent 8k-12k. - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into the log tree. Note that it overlaps with the 4k-12k prealloc extent that was copied to the log tree by the first fsync. - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k extent that was written. - This tries to drop the range 4k-8k in the log tree, which requires adjusting the start of the 4k-12k prealloc extent in the log tree to 8k. - btrfs_set_item_key_safe() sees that there is already an extent starting at 8k in the log tree and calls BUG(). Fix this by detecting when we're about to insert an overlapping file extent item in the log tree and truncating the part that would overlap. CC: stable(a)vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana(a)suse.com> Signed-off-by: Omar Sandoval <osandov(a)fb.com> Signed-off-by: David Sterba <dsterba(a)suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Conflicts: fs/btrfs/tree-log.c [Simple context adaptation is performed] Signed-off-by: Zizhi Wo <wozizhi(a)huawei.com> --- fs/btrfs/tree-log.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 9a8dc16673b4..f7d1baff579f 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4415,22 +4415,27 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans, path->slots[0]++; continue; } - if (!dropped_extents) { - /* - * Avoid logging extent items logged in past fsync calls - * and leading to duplicate keys in the log tree. - */ + /* + * Avoid overlapping items in the log tree. The first time we + * get here, get rid of everything from a past fsync. After + * that, if the current extent starts before the end of the last + * extent we copied, truncate the last one. This can happen if + * an ordered extent completion modifies the subvolume tree + * while btrfs_next_leaf() has the tree unlocked. + */ + if (!dropped_extents || key.offset < truncate_offset) { do { ret = btrfs_truncate_inode_items(trans, root->log_root, &inode->vfs_inode, - truncate_offset, + min(key.offset, truncate_offset), BTRFS_EXTENT_DATA_KEY); } while (ret == -EAGAIN); if (ret) goto out; dropped_extents = true; } + truncate_offset = btrfs_file_extent_end(path); if (ins_nr == 0) start_slot = slot; ins_nr++; -- 2.39.2

2 1

[PACTH OLK-5.10 V2] btrfs: fix crash on racing fsync and size-extending write into prealloc
by Zizhi Wo 01 Jul '24

01 Jul '24

From: Omar Sandoval <osandov(a)fb.com> stable inclusion from stable-v6.1.94 commit 1ff2bd566fbcefcb892be85c493bdb92b911c428 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IA8AFW CVE: CVE-2024-37354 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- commit 9d274c19a71b3a276949933859610721a453946b upstream. We have been seeing crashes on duplicate keys in btrfs_set_item_key_safe(): BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:2620! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs] With the following stack trace: #0 btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4) #1 btrfs_drop_extents (fs/btrfs/file.c:411:4) #2 log_one_extent (fs/btrfs/tree-log.c:4732:9) #3 btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9) #4 btrfs_log_inode (fs/btrfs/tree-log.c:6626:9) #5 btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8) #6 btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8) #7 btrfs_sync_file (fs/btrfs/file.c:1933:8) #8 vfs_fsync_range (fs/sync.c:188:9) #9 vfs_fsync (fs/sync.c:202:9) #10 do_fsync (fs/sync.c:212:9) #11 __do_sys_fdatasync (fs/sync.c:225:9) #12 __se_sys_fdatasync (fs/sync.c:223:1) #13 __x64_sys_fdatasync (fs/sync.c:223:1) #14 do_syscall_x64 (arch/x86/entry/common.c:52:14) #15 do_syscall_64 (arch/x86/entry/common.c:83:7) #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121) So we're logging a changed extent from fsync, which is splitting an extent in the log tree. But this split part already exists in the tree, triggering the BUG(). This is the state of the log tree at the time of the crash, dumped with drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py) to get more details than btrfs_print_leaf() gives us: >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"]) leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610 leaf 33439744 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 7 transid 9 size 8192 nbytes 8473563889606862198 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 204 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417704.983333333 (2024-05-22 15:41:44) mtime 1716417704.983333333 (2024-05-22 15:41:44) otime 17592186044416.000000000 (559444-03-08 01:40:16) item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13 index 195 namelen 3 name: 193 item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 4096 ram 12288 extent compression 0 (none) item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 4096 nr 8192 item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 ... So the real problem happened earlier: notice that items 4 (4k-12k) and 5 (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and item 5 starts at i_size. Here is the state of the filesystem tree at the time of the crash: >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0)) >>> print_extent_buffer(nodes[0]) leaf 30425088 level 0 items 184 generation 9 owner 5 leaf 30425088 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da ... item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160 generation 7 transid 7 size 4096 nbytes 12288 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 6 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417703.220000000 (2024-05-22 15:41:43) mtime 1716417703.220000000 (2024-05-22 15:41:43) otime 1716417703.220000000 (2024-05-22 15:41:43) item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13 index 195 namelen 3 name: 193 item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 8192 ram 12288 extent compression 0 (none) item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 Item 5 in the log tree corresponds to item 183 in the filesystem tree, but nothing matches item 4. Furthermore, item 183 is the last item in the leaf. btrfs_log_prealloc_extents() is responsible for logging prealloc extents beyond i_size. It first truncates any previously logged prealloc extents that start beyond i_size. Then, it walks the filesystem tree and copies the prealloc extent items to the log tree. If it hits the end of a leaf, then it calls btrfs_next_leaf(), which unlocks the tree and does another search. However, while the filesystem tree is unlocked, an ordered extent completion may modify the tree. In particular, it may insert an extent item that overlaps with an extent item that was already copied to the log tree. This may manifest in several ways depending on the exact scenario, including an EEXIST error that is silently translated to a full sync, overlapping items in the log tree, or this crash. This particular crash is triggered by the following sequence of events: - Initially, the file has i_size=4k, a regular extent from 0-4k, and a prealloc extent beyond i_size from 4k-12k. The prealloc extent item is the last item in its B-tree leaf. - The file is fsync'd, which copies its inode item and both extent items to the log tree. - An xattr is set on the file, which sets the BTRFS_INODE_COPY_EVERYTHING flag. - The range 4k-8k in the file is written using direct I/O. i_size is extended to 8k, but the ordered extent is still in flight. - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this calls copy_inode_items_to_log(), which calls btrfs_log_prealloc_extents(). - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the filesystem tree. Since it starts before i_size, it skips it. Since it is the last item in its B-tree leaf, it calls btrfs_next_leaf(). - btrfs_next_leaf() unlocks the path. - The ordered extent completion runs, which converts the 4k-8k part of the prealloc extent to written and inserts the remaining prealloc part from 8k-12k. - btrfs_next_leaf() does a search and finds the new prealloc extent 8k-12k. - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into the log tree. Note that it overlaps with the 4k-12k prealloc extent that was copied to the log tree by the first fsync. - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k extent that was written. - This tries to drop the range 4k-8k in the log tree, which requires adjusting the start of the 4k-12k prealloc extent in the log tree to 8k. - btrfs_set_item_key_safe() sees that there is already an extent starting at 8k in the log tree and calls BUG(). Fix this by detecting when we're about to insert an overlapping file extent item in the log tree and truncating the part that would overlap. CC: stable(a)vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana(a)suse.com> Signed-off-by: Omar Sandoval <osandov(a)fb.com> Signed-off-by: David Sterba <dsterba(a)suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Conflicts: fs/btrfs/tree-log.c [Simple context adaptation is performed] Signed-off-by: Zizhi Wo <wozizhi(a)huawei.com> --- fs/btrfs/tree-log.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 10a0913ffb49..73e507aa9a5a 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4415,22 +4415,27 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans, path->slots[0]++; continue; } - if (!dropped_extents) { - /* - * Avoid logging extent items logged in past fsync calls - * and leading to duplicate keys in the log tree. - */ + /* + * Avoid overlapping items in the log tree. The first time we + * get here, get rid of everything from a past fsync. After + * that, if the current extent starts before the end of the last + * extent we copied, truncate the last one. This can happen if + * an ordered extent completion modifies the subvolume tree + * while btrfs_next_leaf() has the tree unlocked. + */ + if (!dropped_extents || key.offset < truncate_offset) { do { ret = btrfs_truncate_inode_items(trans, root->log_root, &inode->vfs_inode, - truncate_offset, + min(key.offset, truncate_offset), BTRFS_EXTENT_DATA_KEY); } while (ret == -EAGAIN); if (ret) goto out; dropped_extents = true; } + truncate_offset = btrfs_file_extent_end(path); if (ins_nr == 0) start_slot = slot; ins_nr++; -- 2.39.2

1 0

[openeuler:OLK-6.6 2170/10500] net/ipv4/sysctl_net_ipv4.c:481:50: sparse: sparse: incorrect type in argument 3 (different address spaces)
by kernel test robot 01 Jul '24

01 Jul '24

tree: https://gitee.com/openeuler/kernel.git OLK-6.6 head: 660a8fc17efc471c239b7a5a2de6bd9519fe9db3 commit: c31dcf6c5ab41f07da38d3f270987807735ec93e [2170/10500] tcp_comp: implement tcp compression config: loongarch-randconfig-r113-20240701 compiler: loongarch64-linux-gcc (GCC) 13.2.0 reproduce: If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp(a)intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202407010557.qDFEOeJ2-lkp@intel.com/ sparse warnings: (new ones prefixed by >>) >> net/ipv4/sysctl_net_ipv4.c:481:50: sparse: sparse: incorrect type in argument 3 (different address spaces) @@ expected void * @@ got void [noderef] __user *buffer @@ net/ipv4/sysctl_net_ipv4.c:481:50: sparse: expected void * net/ipv4/sysctl_net_ipv4.c:481:50: sparse: got void [noderef] __user *buffer >> net/ipv4/sysctl_net_ipv4.c:620:35: sparse: sparse: incorrect type in initializer (incompatible argument 3 (different address spaces)) @@ expected int ( [usertype] *proc_handler )( ... ) @@ got int ( * )( ... ) @@ net/ipv4/sysctl_net_ipv4.c:620:35: sparse: expected int ( [usertype] *proc_handler )( ... ) net/ipv4/sysctl_net_ipv4.c:620:35: sparse: got int ( * )( ... ) net/ipv4/sysctl_net_ipv4.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/slab.h): include/linux/page-flags.h:242:46: sparse: sparse: self-comparison always evaluates to false vim +481 net/ipv4/sysctl_net_ipv4.c 471 472 #if IS_ENABLED(CONFIG_TCP_COMP) 473 static int proc_tcp_compression_ports(struct ctl_table *table, int write, 474 void __user *buffer, size_t *lenp, 475 loff_t *ppos) 476 { 477 unsigned long *bitmap = *(unsigned long **)table->data; 478 unsigned long bitmap_len = table->maxlen; 479 int ret; 480 > 481 ret = proc_do_large_bitmap(table, write, buffer, lenp, ppos); 482 if (write && ret == 0) { 483 if (bitmap_empty(bitmap, bitmap_len)) { 484 if (static_key_enabled(&tcp_have_comp)) 485 static_branch_disable(&tcp_have_comp); 486 } else { 487 if (!static_key_enabled(&tcp_have_comp)) 488 static_branch_enable(&tcp_have_comp); 489 } 490 } 491 492 return ret; 493 } 494 #endif 495 496 static struct ctl_table ipv4_table[] = { 497 { 498 .procname = "tcp_max_orphans", 499 .data = &sysctl_tcp_max_orphans, 500 .maxlen = sizeof(int), 501 .mode = 0644, 502 .proc_handler = proc_dointvec 503 }, 504 { 505 .procname = "inet_peer_threshold", 506 .data = &inet_peer_threshold, 507 .maxlen = sizeof(int), 508 .mode = 0644, 509 .proc_handler = proc_dointvec 510 }, 511 { 512 .procname = "inet_peer_minttl", 513 .data = &inet_peer_minttl, 514 .maxlen = sizeof(int), 515 .mode = 0644, 516 .proc_handler = proc_dointvec_jiffies, 517 }, 518 { 519 .procname = "inet_peer_maxttl", 520 .data = &inet_peer_maxttl, 521 .maxlen = sizeof(int), 522 .mode = 0644, 523 .proc_handler = proc_dointvec_jiffies, 524 }, 525 { 526 .procname = "tcp_mem", 527 .maxlen = sizeof(sysctl_tcp_mem), 528 .data = &sysctl_tcp_mem, 529 .mode = 0644, 530 .proc_handler = proc_doulongvec_minmax, 531 }, 532 { 533 .procname = "tcp_low_latency", 534 .data = &sysctl_tcp_low_latency, 535 .maxlen = sizeof(int), 536 .mode = 0644, 537 .proc_handler = proc_dointvec 538 }, 539 #ifdef CONFIG_NETLABEL 540 { 541 .procname = "cipso_cache_enable", 542 .data = &cipso_v4_cache_enabled, 543 .maxlen = sizeof(int), 544 .mode = 0644, 545 .proc_handler = proc_dointvec, 546 }, 547 { 548 .procname = "cipso_cache_bucket_size", 549 .data = &cipso_v4_cache_bucketsize, 550 .maxlen = sizeof(int), 551 .mode = 0644, 552 .proc_handler = proc_dointvec, 553 }, 554 { 555 .procname = "cipso_rbm_optfmt", 556 .data = &cipso_v4_rbm_optfmt, 557 .maxlen = sizeof(int), 558 .mode = 0644, 559 .proc_handler = proc_dointvec, 560 }, 561 { 562 .procname = "cipso_rbm_strictvalid", 563 .data = &cipso_v4_rbm_strictvalid, 564 .maxlen = sizeof(int), 565 .mode = 0644, 566 .proc_handler = proc_dointvec, 567 }, 568 #endif /* CONFIG_NETLABEL */ 569 { 570 .procname = "tcp_available_ulp", 571 .maxlen = TCP_ULP_BUF_MAX, 572 .mode = 0444, 573 .proc_handler = proc_tcp_available_ulp, 574 }, 575 { 576 .procname = "icmp_msgs_per_sec", 577 .data = &sysctl_icmp_msgs_per_sec, 578 .maxlen = sizeof(int), 579 .mode = 0644, 580 .proc_handler = proc_dointvec_minmax, 581 .extra1 = SYSCTL_ZERO, 582 }, 583 { 584 .procname = "icmp_msgs_burst", 585 .data = &sysctl_icmp_msgs_burst, 586 .maxlen = sizeof(int), 587 .mode = 0644, 588 .proc_handler = proc_dointvec_minmax, 589 .extra1 = SYSCTL_ZERO, 590 }, 591 { 592 .procname = "udp_mem", 593 .data = &sysctl_udp_mem, 594 .maxlen = sizeof(sysctl_udp_mem), 595 .mode = 0644, 596 .proc_handler = proc_doulongvec_minmax, 597 }, 598 { 599 .procname = "fib_sync_mem", 600 .data = &sysctl_fib_sync_mem, 601 .maxlen = sizeof(sysctl_fib_sync_mem), 602 .mode = 0644, 603 .proc_handler = proc_douintvec_minmax, 604 .extra1 = &sysctl_fib_sync_mem_min, 605 .extra2 = &sysctl_fib_sync_mem_max, 606 }, 607 { 608 .procname = "local_port_allocation", 609 .data = &sysctl_local_port_allocation, 610 .maxlen = sizeof(int), 611 .mode = 0644, 612 .proc_handler = proc_dointvec, 613 }, 614 #if IS_ENABLED(CONFIG_TCP_COMP) 615 { 616 .procname = "tcp_compression_ports", 617 .data = &sysctl_tcp_compression_ports, 618 .maxlen = 65536, 619 .mode = 0644, > 620 .proc_handler = proc_tcp_compression_ports, 621 }, 622 { 623 .procname = "tcp_compression_local", 624 .data = &sysctl_tcp_compression_local, 625 .maxlen = sizeof(int), 626 .mode = 0644, 627 .proc_handler = proc_dointvec_minmax, 628 .extra1 = SYSCTL_ZERO, 629 .extra2 = SYSCTL_ONE, 630 }, 631 #endif 632 { } 633 }; 634 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki

1 0