- Kernel - mailweb.openeuler.org

[PATCH OLK-6.6 0/2] irqchip/gic-v3: Fix ALLINT masking logic by decoupling from GIC NMI support
by Jinjie Ruan 08 Jun '26

08 Jun '26

Fix ALLINT masking logic by decoupling from GIC NMI support. Jinjie Ruan (2): Revert "irqchip/gic-v3: Fix ALLINT masking logic by decoupling from GIC NMI support" irqchip/gic-v3: Fix ALLINT masking logic by decoupling from GIC NMI support arch/arm64/include/asm/nmi.h | 5 +++++ drivers/irqchip/irq-gic-v3.c | 14 ++++++++------ 2 files changed, 13 insertions(+), 6 deletions(-) -- 2.34.1

2 3

[PATCH OLK-6.6 0/2] irqchip/gic-v3: Fix ALLINT masking logic by decoupling from GIC NMI support
by Jinjie Ruan 08 Jun '26

08 Jun '26

Fix ALLINT masking logic by decoupling from GIC NMI support. Jinjie Ruan (2): Revert "irqchip/gic-v3: Fix ALLINT masking logic by decoupling from GIC NMI support" irqchip/gic-v3: Fix ALLINT masking logic by decoupling from GIC NMI support arch/arm64/include/asm/nmi.h | 5 +++++ drivers/irqchip/irq-gic-v3.c | 14 ++++++++------ 2 files changed, 13 insertions(+), 6 deletions(-) -- 2.34.1

2 3

[PATCH OLK-5.10] irqchip/gic-v3: Fix ALLINT masking logic by decoupling from GIC NMI support
by Jinjie Ruan 08 Jun '26

08 Jun '26

hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/9343 -------------------------------- The ARM64 FEAT_NMI implementation relies on two independent hardware validation paths: 1. CPU Feature Check (`system_uses_nmi()`): Validates if the PE supports FEAT_NMI via ID_AA64PFR1_EL1.NMI. If pseudo-NMI is disabled and this bit is set, enables SCTLR_EL1.NMI (map to SCTLR_EL2.NMI in VHE mode) and clears SCTLR_EL1.SPINTMASK. Once active, the hardware automatically sets PSTATE.ALLINT upon taking an exception to EL1/EL2, effectively masking interrupts. `sysreg_clear_set(sctlr_el1, SCTLR_EL1_SPINTMASK, SCTLR_EL1_NMI)` 2. GICv3 Feature Check (gic_data.has_nmi): Validates if the interrupt controller supports Non-maskable Interrupts via GICD_TYPER.NMI. `gic_data.has_nmi = !!(typer & GICD_TYPER_NMI)` And PSTATE.ALLINT is strictly a PE-level state register. Its existence and behavior are entirely independent of whether the underlying GICv3 hardware implementation has its own FEAT_NMI capabilities enabled. Currently, __gic_handle_irq_from_irqson() uses has_v3_3_nmi() to guard the clearing of PSTATE.ALLINT. This creates a destructive dependency where the GIC's NMI status dictates PE register management. static inline bool has_v3_3_nmi(void) { return gic_data.has_nmi && system_uses_nmi(); } If a platform features a PE that supports FEAT_NMI but has a GICv3 that does not support (or has disabled) the NMI feature, the current implementation fails to clear ALLINT during IRQ handling. Because ALLINT is automatically set by the hardware exception entry, it remains permanently set. As a consequence, local_irq_enable() only clears PSTATE.I and F bits, leaving ALLINT untouched. Since ALLINT masks both standard IRQs and NMIs, even if PSTATE.I cleared, the system becomes completely incapable of responding to subsequent NMIs and standard IRQs during downstream processing (such as softirq execution). handle_softirqs() -> local_irq_enable() -> asm volatile("msr daifclr, #3") -> do pending softirq action -> local_irq_disable() -> asm volatile("msr daifset, #3") Fix this by replacing the has_v3_3_nmi() check with system_uses_nmi() when clearing ALLINT. This ensures that PE state management correctly reflects PE capabilities. Additionally, explicitly log whether the GIC supports hardware NMI based on the GICD_TYPER.NMI bit during initialization. Since FEAT_NMI deployment relies on both subsystems, printing this status provides critical diagnostic information for system verification. Fixes: eefea6156921 ("irqchip/gic-v3: Fix hard LOCKUP caused by NMI being masked") Signed-off-by: Jinjie Ruan <ruanjinjie(a)huawei.com> --- arch/arm64/include/asm/daifflags.h | 5 +++++ drivers/irqchip/irq-gic-v3.c | 8 +++++--- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/arm64/include/asm/daifflags.h b/arch/arm64/include/asm/daifflags.h index e4aec9469dcf..f04c5ccbd176 100644 --- a/arch/arm64/include/asm/daifflags.h +++ b/arch/arm64/include/asm/daifflags.h @@ -17,6 +17,7 @@ #define DAIF_ERRCTX (PSR_I_BIT | PSR_A_BIT) #define DAIF_MASK (PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT) +#ifdef CONFIG_ARM64_NMI static __always_inline void _allint_clear(void) { asm volatile(__msr_s(SYS_ALLINT_CLR, "xzr")); @@ -26,6 +27,10 @@ static __always_inline void _allint_set(void) { asm volatile(__msr_s(SYS_ALLINT_SET, "xzr")); } +#else +static __always_inline void _allint_clear(void) { } +static __always_inline void _allint_set(void) { } +#endif /* mask/save/unmask/restore all exceptions, including interrupts. */ static inline void local_daif_mask(void) diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 9870ad4cc766..38a429c29fdf 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -821,11 +821,12 @@ static void __gic_handle_irq_from_irqson(struct pt_regs *regs) if (gic_prio_masking_enabled()) { gic_pmr_mask_irqs(); gic_arch_enable_irqs(); - } else if (has_v3_3_nmi()) { -#ifdef CONFIG_ARM64_NMI + } + +#ifdef CONFIG_ARM64 + if (system_uses_nmi()) _allint_clear(); #endif - } if (!is_nmi) __gic_handle_irq(irqnr, regs); @@ -2273,6 +2274,7 @@ static int __init gic_init_bases(void __iomem *dist_base, gic_data.has_nmi = !!(typer & GICD_TYPER_NMI); pr_info("Distributor has %sRange Selector support\n", gic_data.has_rss ? "" : "no "); + pr_info("GICD_TYPER NMI is%s supported.\n", gic_data.has_nmi ? "" : " not"); if (typer & GICD_TYPER_MBIS) { err = mbi_init(handle, gic_data.domain); -- 2.34.1

2 1

[PATCH OLK-6.6 0/2] arm64: Fix ALLINT masking logic by decoupling from GIC NMI support
by Jinjie Ruan 08 Jun '26

08 Jun '26

Fix ALLINT masking logic by decoupling from GIC NMI support Jinjie Ruan (2): Revert "irqchip/gic-v3: Fix ALLINT masking logic by decoupling from GIC NMI support" irqchip/gic-v3: Fix ALLINT masking logic by decoupling from GIC NMI support drivers/irqchip/irq-gic-v3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- 2.34.1

2 3

[PATCH OLK-5.10] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
by Zhang Yuwei 08 Jun '26

08 Jun '26

From: Jinhui Guo <guojinhui.liam(a)bytedance.com> stable inclusion from stable-v5.10.252 commit 581ce094d9eafb78ec4f9de77bd24b780c151236 category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/14677 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- [ Upstream commit 10e60d87813989e20eac1f3eda30b3bae461e7f9 ] Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") relies on pci_dev_is_disconnected() to skip ATS invalidation for safely-removed devices, but it does not cover link-down caused by faults, which can still hard-lock the system. For example, if a VM fails to connect to the PCIe device, "virsh destroy" is executed to release resources and isolate the fault, but a hard-lockup occurs while releasing the group fd. Call Trace: qi_submit_sync qi_flush_dev_iotlb intel_pasid_tear_down_entry device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput Although pci_device_is_present() is slower than pci_dev_is_disconnected(), it still takes only ~70 µs on a ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed and width increase. Besides, devtlb_invalidation_with_pasid() is called only in the paths below, which are far less frequent than memory map/unmap. 1. mm-struct release 2. {attach,release}_dev 3. set/remove PASID 4. dirty-tracking setup The gain in system stability far outweighs the negligible cost of using pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to skip ATS invalidation, especially under GDR high-load conditions. Fixes: 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") Cc: stable(a)vger.kernel.org Signed-off-by: Jinhui Guo <guojinhui.liam(a)bytedance.com> Link: https://lore.kernel.org/r/20251211035946.2071-3-guojinhui.liam@bytedance.com Signed-off-by: Lu Baolu <baolu.lu(a)linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel(a)amd.com> Signed-off-by: Sasha Levin <sashal(a)kernel.org> Signed-off-by: Lin Yujun <linyujun809(a)h-partners.com> --- drivers/iommu/intel/pasid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index cc913f00f296..33c85d9e3a45 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -513,7 +513,7 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu, if (!info || !info->ats_enabled) return; - if (pci_dev_is_disconnected(to_pci_dev(dev))) + if (!pci_device_is_present(to_pci_dev(dev))) return; sid = info->bus << 8 | info->devfn; -- 2.22.0

2 1

[PATCH OLK-5.10] netfilter: flowtable: strictly check for maximum number of actions
by Zhang Changzhong 06 Jun '26

06 Jun '26

From: Pablo Neira Ayuso <pablo(a)netfilter.org> mainline inclusion from mainline-v7.0-rc7 commit 76522fcdbc3a02b568f5d957f7e66fc194abb893 category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/14887 CVE: CVE-2026-43329 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- The maximum number of flowtable hardware offload actions in IPv6 is: * ethernet mangling (4 payload actions, 2 for each ethernet address) * SNAT (4 payload actions) * DNAT (4 payload actions) * Double VLAN (4 vlan actions, 2 for popping vlan, and 2 for pushing) for QinQ. * Redirect (1 action) Which makes 17, while the maximum is 16. But act_ct supports for tunnels actions too. Note that payload action operates at 32-bit word level, so mangling an IPv6 address takes 4 payload actions. Update flow_action_entry_next() calls to check for the maximum number of supported actions. While at it, rise the maximum number of actions per flow from 16 to 24 so this works fine with IPv6 setups. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Reported-by: Hyunwoo Kim <imv4bel(a)gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo(a)netfilter.org> Conflicts: net/netfilter/nf_flow_table_offload.c [context conflict because commits 2ed37183abb7 and eeff3000f240 were not merged] Signed-off-by: Zhang Changzhong <zhangchangzhong(a)huawei.com> --- net/netfilter/nf_flow_table_offload.c | 173 +++++++++++++++++++++++----------- 1 file changed, 116 insertions(+), 57 deletions(-) diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c index f6275d9..dc53bf4 100644 --- a/net/netfilter/nf_flow_table_offload.c +++ b/net/netfilter/nf_flow_table_offload.c @@ -13,6 +13,8 @@ #include <net/netfilter/nf_conntrack_core.h> #include <net/netfilter/nf_conntrack_tuple.h> +#define NF_FLOW_RULE_ACTION_MAX 24 + static struct workqueue_struct *nf_flow_offload_wq; struct flow_offload_work { @@ -165,7 +167,12 @@ static void flow_offload_mangle(struct flow_action_entry *entry, static inline struct flow_action_entry * flow_action_entry_next(struct nf_flow_rule *flow_rule) { - int i = flow_rule->rule->action.num_entries++; + int i; + + if (unlikely(flow_rule->rule->action.num_entries >= NF_FLOW_RULE_ACTION_MAX)) + return NULL; + + i = flow_rule->rule->action.num_entries++; return &flow_rule->rule->action.entries[i]; } @@ -182,6 +189,9 @@ static int flow_offload_eth_src(struct net *net, u32 mask, val; u16 val16; + if (!entry0 || !entry1) + return -E2BIG; + dev = dev_get_by_index(net, tuple->iifidx); if (!dev) return -ENOENT; @@ -216,6 +226,9 @@ static int flow_offload_eth_dst(struct net *net, u8 nud_state; u16 val16; + if (!entry0 || !entry1) + return -E2BIG; + dst_cache = flow->tuplehash[dir].tuple.dst_cache; n = dst_neigh_lookup(dst_cache, daddr); if (!n) @@ -246,16 +259,19 @@ static int flow_offload_eth_dst(struct net *net, return 0; } -static void flow_offload_ipv4_snat(struct net *net, - const struct flow_offload *flow, - enum flow_offload_tuple_dir dir, - struct nf_flow_rule *flow_rule) +static int flow_offload_ipv4_snat(struct net *net, + const struct flow_offload *flow, + enum flow_offload_tuple_dir dir, + struct nf_flow_rule *flow_rule) { struct flow_action_entry *entry = flow_action_entry_next(flow_rule); u32 mask = ~htonl(0xffffffff); __be32 addr; u32 offset; + if (!entry) + return -E2BIG; + switch (dir) { case FLOW_OFFLOAD_DIR_ORIGINAL: addr = flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_v4.s_addr; @@ -266,23 +282,27 @@ static void flow_offload_ipv4_snat(struct net *net, offset = offsetof(struct iphdr, daddr); break; default: - return; + return -EOPNOTSUPP; } flow_offload_mangle(entry, FLOW_ACT_MANGLE_HDR_TYPE_IP4, offset, &addr, &mask); + return 0; } -static void flow_offload_ipv4_dnat(struct net *net, - const struct flow_offload *flow, - enum flow_offload_tuple_dir dir, - struct nf_flow_rule *flow_rule) +static int flow_offload_ipv4_dnat(struct net *net, + const struct flow_offload *flow, + enum flow_offload_tuple_dir dir, + struct nf_flow_rule *flow_rule) { struct flow_action_entry *entry = flow_action_entry_next(flow_rule); u32 mask = ~htonl(0xffffffff); __be32 addr; u32 offset; + if (!entry) + return -E2BIG; + switch (dir) { case FLOW_OFFLOAD_DIR_ORIGINAL: addr = flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.src_v4.s_addr; @@ -293,14 +313,15 @@ static void flow_offload_ipv4_dnat(struct net *net, offset = offsetof(struct iphdr, saddr); break; default: - return; + return -EOPNOTSUPP; } flow_offload_mangle(entry, FLOW_ACT_MANGLE_HDR_TYPE_IP4, offset, &addr, &mask); + return 0; } -static void flow_offload_ipv6_mangle(struct nf_flow_rule *flow_rule, +static int flow_offload_ipv6_mangle(struct nf_flow_rule *flow_rule, unsigned int offset, const __be32 *addr, const __be32 *mask) { @@ -309,15 +330,20 @@ static void flow_offload_ipv6_mangle(struct nf_flow_rule *flow_rule, for (i = 0; i < sizeof(struct in6_addr) / sizeof(u32); i++) { entry = flow_action_entry_next(flow_rule); + if (!entry) + return -E2BIG; + flow_offload_mangle(entry, FLOW_ACT_MANGLE_HDR_TYPE_IP6, offset + i * sizeof(u32), &addr[i], mask); } + + return 0; } -static void flow_offload_ipv6_snat(struct net *net, - const struct flow_offload *flow, - enum flow_offload_tuple_dir dir, - struct nf_flow_rule *flow_rule) +static int flow_offload_ipv6_snat(struct net *net, + const struct flow_offload *flow, + enum flow_offload_tuple_dir dir, + struct nf_flow_rule *flow_rule) { u32 mask = ~htonl(0xffffffff); const __be32 *addr; @@ -333,16 +359,16 @@ static void flow_offload_ipv6_snat(struct net *net, offset = offsetof(struct ipv6hdr, daddr); break; default: - return; + return -EOPNOTSUPP; } - flow_offload_ipv6_mangle(flow_rule, offset, addr, &mask); + return flow_offload_ipv6_mangle(flow_rule, offset, addr, &mask); } -static void flow_offload_ipv6_dnat(struct net *net, - const struct flow_offload *flow, - enum flow_offload_tuple_dir dir, - struct nf_flow_rule *flow_rule) +static int flow_offload_ipv6_dnat(struct net *net, + const struct flow_offload *flow, + enum flow_offload_tuple_dir dir, + struct nf_flow_rule *flow_rule) { u32 mask = ~htonl(0xffffffff); const __be32 *addr; @@ -358,10 +384,10 @@ static void flow_offload_ipv6_dnat(struct net *net, offset = offsetof(struct ipv6hdr, saddr); break; default: - return; + return -EOPNOTSUPP; } - flow_offload_ipv6_mangle(flow_rule, offset, addr, &mask); + return flow_offload_ipv6_mangle(flow_rule, offset, addr, &mask); } static int flow_offload_l4proto(const struct flow_offload *flow) @@ -383,15 +409,18 @@ static int flow_offload_l4proto(const struct flow_offload *flow) return type; } -static void flow_offload_port_snat(struct net *net, - const struct flow_offload *flow, - enum flow_offload_tuple_dir dir, - struct nf_flow_rule *flow_rule) +static int flow_offload_port_snat(struct net *net, + const struct flow_offload *flow, + enum flow_offload_tuple_dir dir, + struct nf_flow_rule *flow_rule) { struct flow_action_entry *entry = flow_action_entry_next(flow_rule); u32 mask, port; u32 offset; + if (!entry) + return -E2BIG; + switch (dir) { case FLOW_OFFLOAD_DIR_ORIGINAL: port = ntohs(flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_port); @@ -406,22 +435,26 @@ static void flow_offload_port_snat(struct net *net, mask = ~htonl(0xffff); break; default: - return; + return -EOPNOTSUPP; } flow_offload_mangle(entry, flow_offload_l4proto(flow), offset, &port, &mask); + return 0; } -static void flow_offload_port_dnat(struct net *net, - const struct flow_offload *flow, - enum flow_offload_tuple_dir dir, - struct nf_flow_rule *flow_rule) +static int flow_offload_port_dnat(struct net *net, + const struct flow_offload *flow, + enum flow_offload_tuple_dir dir, + struct nf_flow_rule *flow_rule) { struct flow_action_entry *entry = flow_action_entry_next(flow_rule); u32 mask, port; u32 offset; + if (!entry) + return -E2BIG; + switch (dir) { case FLOW_OFFLOAD_DIR_ORIGINAL: port = ntohs(flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.src_port); @@ -436,20 +469,24 @@ static void flow_offload_port_dnat(struct net *net, mask = ~htonl(0xffff0000); break; default: - return; + return -EOPNOTSUPP; } flow_offload_mangle(entry, flow_offload_l4proto(flow), offset, &port, &mask); + return 0; } -static void flow_offload_ipv4_checksum(struct net *net, - const struct flow_offload *flow, - struct nf_flow_rule *flow_rule) +static int flow_offload_ipv4_checksum(struct net *net, + const struct flow_offload *flow, + struct nf_flow_rule *flow_rule) { u8 protonum = flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.l4proto; struct flow_action_entry *entry = flow_action_entry_next(flow_rule); + if (!entry) + return -E2BIG; + entry->id = FLOW_ACTION_CSUM; entry->csum_flags = TCA_CSUM_UPDATE_FLAG_IPV4HDR; @@ -461,22 +498,29 @@ static void flow_offload_ipv4_checksum(struct net *net, entry->csum_flags |= TCA_CSUM_UPDATE_FLAG_UDP; break; } + + return 0; } -static void flow_offload_redirect(const struct flow_offload *flow, +static int flow_offload_redirect(const struct flow_offload *flow, enum flow_offload_tuple_dir dir, struct nf_flow_rule *flow_rule) { struct flow_action_entry *entry = flow_action_entry_next(flow_rule); struct rtable *rt; + if (!entry) + return -E2BIG; + rt = (struct rtable *)flow->tuplehash[dir].tuple.dst_cache; entry->id = FLOW_ACTION_REDIRECT; entry->dev = rt->dst.dev; dev_hold(rt->dst.dev); + + return 0; } -static void flow_offload_encap_tunnel(const struct flow_offload *flow, +static int flow_offload_encap_tunnel(const struct flow_offload *flow, enum flow_offload_tuple_dir dir, struct nf_flow_rule *flow_rule) { @@ -490,13 +534,17 @@ static void flow_offload_encap_tunnel(const struct flow_offload *flow, tun_info = lwt_tun_info(dst->lwtstate); if (tun_info && (tun_info->mode & IP_TUNNEL_INFO_TX)) { entry = flow_action_entry_next(flow_rule); + if (!entry) + return -E2BIG; entry->id = FLOW_ACTION_TUNNEL_ENCAP; entry->tunnel = tun_info; } } + + return 0; } -static void flow_offload_decap_tunnel(const struct flow_offload *flow, +static int flow_offload_decap_tunnel(const struct flow_offload *flow, enum flow_offload_tuple_dir dir, struct nf_flow_rule *flow_rule) { @@ -510,35 +558,44 @@ static void flow_offload_decap_tunnel(const struct flow_offload *flow, tun_info = lwt_tun_info(dst->lwtstate); if (tun_info && (tun_info->mode & IP_TUNNEL_INFO_TX)) { entry = flow_action_entry_next(flow_rule); + if (!entry) + return -E2BIG; entry->id = FLOW_ACTION_TUNNEL_DECAP; } } + + return 0; } int nf_flow_rule_route_ipv4(struct net *net, const struct flow_offload *flow, enum flow_offload_tuple_dir dir, struct nf_flow_rule *flow_rule) { - flow_offload_decap_tunnel(flow, dir, flow_rule); - flow_offload_encap_tunnel(flow, dir, flow_rule); + if (flow_offload_decap_tunnel(flow, dir, flow_rule) < 0 || + flow_offload_encap_tunnel(flow, dir, flow_rule) < 0) + return -1; if (flow_offload_eth_src(net, flow, dir, flow_rule) < 0 || flow_offload_eth_dst(net, flow, dir, flow_rule) < 0) return -1; if (test_bit(NF_FLOW_SNAT, &flow->flags)) { - flow_offload_ipv4_snat(net, flow, dir, flow_rule); - flow_offload_port_snat(net, flow, dir, flow_rule); + if (flow_offload_ipv4_snat(net, flow, dir, flow_rule) < 0 || + flow_offload_port_snat(net, flow, dir, flow_rule) < 0) + return -1; } if (test_bit(NF_FLOW_DNAT, &flow->flags)) { - flow_offload_ipv4_dnat(net, flow, dir, flow_rule); - flow_offload_port_dnat(net, flow, dir, flow_rule); + if (flow_offload_ipv4_dnat(net, flow, dir, flow_rule) < 0 || + flow_offload_port_dnat(net, flow, dir, flow_rule) < 0) + return -1; } if (test_bit(NF_FLOW_SNAT, &flow->flags) || test_bit(NF_FLOW_DNAT, &flow->flags)) - flow_offload_ipv4_checksum(net, flow, flow_rule); + if (flow_offload_ipv4_checksum(net, flow, flow_rule) < 0) + return -1; - flow_offload_redirect(flow, dir, flow_rule); + if (flow_offload_redirect(flow, dir, flow_rule) < 0) + return -1; return 0; } @@ -548,30 +605,32 @@ int nf_flow_rule_route_ipv6(struct net *net, const struct flow_offload *flow, enum flow_offload_tuple_dir dir, struct nf_flow_rule *flow_rule) { - flow_offload_decap_tunnel(flow, dir, flow_rule); - flow_offload_encap_tunnel(flow, dir, flow_rule); + if (flow_offload_decap_tunnel(flow, dir, flow_rule) < 0 || + flow_offload_encap_tunnel(flow, dir, flow_rule) < 0) + return -1; if (flow_offload_eth_src(net, flow, dir, flow_rule) < 0 || flow_offload_eth_dst(net, flow, dir, flow_rule) < 0) return -1; if (test_bit(NF_FLOW_SNAT, &flow->flags)) { - flow_offload_ipv6_snat(net, flow, dir, flow_rule); - flow_offload_port_snat(net, flow, dir, flow_rule); + if (flow_offload_ipv6_snat(net, flow, dir, flow_rule) < 0 || + flow_offload_port_snat(net, flow, dir, flow_rule) < 0) + return -1; } if (test_bit(NF_FLOW_DNAT, &flow->flags)) { - flow_offload_ipv6_dnat(net, flow, dir, flow_rule); - flow_offload_port_dnat(net, flow, dir, flow_rule); + if (flow_offload_ipv6_dnat(net, flow, dir, flow_rule) < 0 || + flow_offload_port_dnat(net, flow, dir, flow_rule) < 0) + return -1; } - flow_offload_redirect(flow, dir, flow_rule); + if (flow_offload_redirect(flow, dir, flow_rule) < 0) + return -1; return 0; } EXPORT_SYMBOL_GPL(nf_flow_rule_route_ipv6); -#define NF_FLOW_RULE_ACTION_MAX 16 - static struct nf_flow_rule * nf_flow_offload_rule_alloc(struct net *net, const struct flow_offload_work *offload, -- 2.9.5

2 1

[PATCH OLK-5.10] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
by Zhang Yuwei 06 Jun '26

06 Jun '26

From: Jinhui Guo <guojinhui.liam(a)bytedance.com> stable inclusion from stable-v5.10.252 commit 581ce094d9eafb78ec4f9de77bd24b780c151236 category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15190 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- [ Upstream commit 10e60d87813989e20eac1f3eda30b3bae461e7f9 ] Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") relies on pci_dev_is_disconnected() to skip ATS invalidation for safely-removed devices, but it does not cover link-down caused by faults, which can still hard-lock the system. For example, if a VM fails to connect to the PCIe device, "virsh destroy" is executed to release resources and isolate the fault, but a hard-lockup occurs while releasing the group fd. Call Trace: qi_submit_sync qi_flush_dev_iotlb intel_pasid_tear_down_entry device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput Although pci_device_is_present() is slower than pci_dev_is_disconnected(), it still takes only ~70 µs on a ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed and width increase. Besides, devtlb_invalidation_with_pasid() is called only in the paths below, which are far less frequent than memory map/unmap. 1. mm-struct release 2. {attach,release}_dev 3. set/remove PASID 4. dirty-tracking setup The gain in system stability far outweighs the negligible cost of using pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to skip ATS invalidation, especially under GDR high-load conditions. Fixes: 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") Cc: stable(a)vger.kernel.org Signed-off-by: Jinhui Guo <guojinhui.liam(a)bytedance.com> Link: https://lore.kernel.org/r/20251211035946.2071-3-guojinhui.liam@bytedance.com Signed-off-by: Lu Baolu <baolu.lu(a)linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel(a)amd.com> Signed-off-by: Sasha Levin <sashal(a)kernel.org> Signed-off-by: Lin Yujun <linyujun809(a)h-partners.com> --- drivers/iommu/intel/pasid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index cc913f00f296..33c85d9e3a45 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -513,7 +513,7 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu, if (!info || !info->ats_enabled) return; - if (pci_dev_is_disconnected(to_pci_dev(dev))) + if (!pci_device_is_present(to_pci_dev(dev))) return; sid = info->bus << 8 | info->devfn; -- 2.22.0

2 1

[PATCH OLK-5.10] iommu/vt-d: Clear Present bit before tearing down context entry
by Zhang Yuwei 06 Jun '26

06 Jun '26

From: Lu Baolu <baolu.lu(a)linux.intel.com> mainline inclusion from mainline-v7.0-rc1 commit c1e4f1dccbe9d7656d1c6872ebeadb5992d0aaa2 category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15190 CVE: CVE-2026-45944 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- When tearing down a context entry, the current implementation zeros the entire 128-bit entry using multiple 64-bit writes. This creates a window where the hardware can fetch a "torn" entry — where some fields are already zeroed while the 'Present' bit is still set — leading to unpredictable behavior or spurious faults. While x86 provides strong write ordering, the compiler may reorder writes to the two 64-bit halves of the context entry. Even without compiler reordering, the hardware fetch is not guaranteed to be atomic with respect to multiple CPU writes. Align with the "Guidance to Software for Invalidations" in the VT-d spec (Section 6.5.3.3) by implementing the recommended ownership handshake: 1. Clear only the 'Present' (P) bit of the context entry first to signal the transition of ownership from hardware to software. 2. Use dma_wmb() to ensure the cleared bit is visible to the IOMMU. 3. Perform the required cache and context-cache invalidation to ensure hardware no longer has cached references to the entry. 4. Fully zero out the entry only after the invalidation is complete. Also, add a dma_wmb() to context_set_present() to ensure the entry is fully initialized before the 'Present' bit becomes visible. Fixes: ba39592764ed ("Intel IOMMU: Intel IOMMU driver") Reported-by: Dmytro Maluka <dmaluka(a)chromium.org> Closes: https://lore.kernel.org/all/aTG7gc7I5wExai3S@google.com/ Signed-off-by: Lu Baolu <baolu.lu(a)linux.intel.com> Reviewed-by: Dmytro Maluka <dmaluka(a)chromium.org> Reviewed-by: Samiullah Khawaja <skhawaja(a)google.com> Reviewed-by: Kevin Tian <kevin.tian(a)intel.com> Link: https://lore.kernel.org/r/20260120061816.2132558-3-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <joerg.roedel(a)amd.com> Conflicts: drivers/iommu/intel/pasid.c drivers/iommu/intel/iommu.c drivers/iommu/intel/iommu.h [contxt conflict] Signed-off-by: Zhang Yuwei <zhangyuwei20(a)huawei.com> --- drivers/iommu/intel/iommu.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index a68d1bfc7444..245fef1026c3 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -210,7 +210,26 @@ static phys_addr_t root_entry_uctp(struct root_entry *re) static inline void context_set_present(struct context_entry *context) { - context->lo |= 1; + u64 val; + + dma_wmb(); + val = READ_ONCE(context->lo) | 1; + WRITE_ONCE(context->lo, val); +} + +/* + * Clear the Present (P) bit (bit 0) of a context table entry. This initiates + * the transition of the entry's ownership from hardware to software. The + * caller is responsible for fulfilling the invalidation handshake recommended + * by the VT-d spec, Section 6.5.3.3 (Guidance to Software for Invalidations). + */ +static inline void context_clear_present(struct context_entry *context) +{ + u64 val; + + val = READ_ONCE(context->lo) & GENMASK_ULL(63, 1); + WRITE_ONCE(context->lo, val); + dma_wmb(); } static inline void context_set_fault_enable(struct context_entry *context) @@ -2617,7 +2636,7 @@ static void domain_context_clear_one(struct device_domain_info *info, u8 bus, u8 did_old = context_domain_id(context); } - context_clear_entry(context); + context_clear_present(context); __iommu_flush_cache(iommu, context, sizeof(*context)); spin_unlock_irqrestore(&iommu->lock, flags); iommu->flush.flush_context(iommu, @@ -2636,6 +2655,8 @@ static void domain_context_clear_one(struct device_domain_info *info, u8 bus, u8 DMA_TLB_DSI_FLUSH); __iommu_flush_dev_iotlb(info, 0, MAX_AGAW_PFN_WIDTH); + context_clear_entry(context); + __iommu_flush_cache(iommu, context, sizeof(*context)); } static inline void unlink_domain_info(struct device_domain_info *info) -- 2.22.0

2 1

[PATCH OLK-5.10] iommu/vt-d: Clear Present bit before tearing down PASID entry
by Zhang Yuwei 06 Jun '26

06 Jun '26

From: Lu Baolu <baolu.lu(a)linux.intel.com> mainline inclusion from mainline-v7.0-rc1 commit 75ed00055c059dedc47b5daaaa2f8a7a019138ff category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15137 CVE: CVE-2026-45894 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64 bytes). When tearing down an entry, the current implementation zeros the entire 64-byte structure immediately using multiple 64-bit writes. Since the IOMMU hardware may fetch these 64 bytes using multiple internal transactions (e.g., four 128-bit bursts), updating or zeroing the entire entry while it is active (P=1) risks a "torn" read. If a hardware fetch occurs simultaneously with the CPU zeroing the entry, the hardware could observe an inconsistent state, leading to unpredictable behavior or spurious faults. Follow the "Guidance to Software for Invalidations" in the VT-d spec (Section 6.5.3.3) by implementing the recommended ownership handshake: 1. Clear only the 'Present' (P) bit of the PASID entry. 2. Use a dma_wmb() to ensure the cleared bit is visible to hardware before proceeding. 3. Execute the required invalidation sequence (PASID cache, IOTLB, and Device-TLB flush) to ensure the hardware has released all cached references. 4. Only after the flushes are complete, zero out the remaining fields of the PASID entry. Also, add a dma_wmb() in pasid_set_present() to ensure that all other fields of the PASID entry are visible to the hardware before the Present bit is set. Fixes: 0bbeb01a4faf ("iommu/vt-d: Manage scalalble mode PASID tables") Signed-off-by: Lu Baolu <baolu.lu(a)linux.intel.com> Reviewed-by: Dmytro Maluka <dmaluka(a)chromium.org> Reviewed-by: Samiullah Khawaja <skhawaja(a)google.com> Reviewed-by: Kevin Tian <kevin.tian(a)intel.com> Link: https://lore.kernel.org/r/20260120061816.2132558-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <joerg.roedel(a)amd.com> Conflicts: drivers/iommu/intel/pasid.c drivers/iommu/intel/pasid.h [commit 1f5e307ca16c0 not merged] Signed-off-by: Zhang Yuwei <zhangyuwei20(a)huawei.com> --- drivers/iommu/intel/pasid.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 9d909e99b7e8..cc913f00f296 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -411,9 +411,23 @@ static inline void pasid_set_sre(struct pasid_entry *pe) */ static inline void pasid_set_present(struct pasid_entry *pe) { + dma_wmb(); pasid_set_bits(&pe->val[0], 1 << 0, 1); } +/* + * Clear the Present (P) bit (bit 0) of a scalable-mode PASID table entry. + * This initiates the transition of the entry's ownership from hardware + * to software. The caller is responsible for fulfilling the invalidation + * handshake recommended by the VT-d spec, Section 6.5.3.3 (Guidance to + * Software for Invalidations). + */ +static inline void pasid_clear_present(struct pasid_entry *pe) +{ + pasid_set_bits(&pe->val[0], 1 << 0, 0); + dma_wmb(); +} + /* * Setup Page Walk Snoop bit (Bit 87) of a scalable mode PASID * entry. @@ -531,7 +545,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev, did = pasid_get_domain_id(pte); pgtt = pasid_pte_get_pgtt(pte); - intel_pasid_clear_entry(dev, pasid, fault_ignore); + pasid_clear_present(pte); if (!ecap_coherent(iommu->ecap)) clflush_cache_range(pte, sizeof(*pte)); @@ -546,6 +560,10 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev, /* Device IOTLB doesn't need to be flushed in caching mode. */ if (!cap_caching_mode(iommu->cap)) devtlb_invalidation_with_pasid(iommu, dev, pasid); + + intel_pasid_clear_entry(dev, pasid, fault_ignore); + if (!ecap_coherent(iommu->ecap)) + clflush_cache_range(pte, sizeof(*pte)); } static void pasid_flush_caches(struct intel_iommu *iommu, -- 2.22.0

2 1

[PATCH OLK-6.6] iommu/vt-d: Clear Present bit before tearing down PASID entry
by Zhang Yuwei 06 Jun '26

06 Jun '26

From: Lu Baolu <baolu.lu(a)linux.intel.com> mainline inclusion from mainline-v7.0-rc1 commit 75ed00055c059dedc47b5daaaa2f8a7a019138ff category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/15137 CVE: CVE-2026-45894 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64 bytes). When tearing down an entry, the current implementation zeros the entire 64-byte structure immediately using multiple 64-bit writes. Since the IOMMU hardware may fetch these 64 bytes using multiple internal transactions (e.g., four 128-bit bursts), updating or zeroing the entire entry while it is active (P=1) risks a "torn" read. If a hardware fetch occurs simultaneously with the CPU zeroing the entry, the hardware could observe an inconsistent state, leading to unpredictable behavior or spurious faults. Follow the "Guidance to Software for Invalidations" in the VT-d spec (Section 6.5.3.3) by implementing the recommended ownership handshake: 1. Clear only the 'Present' (P) bit of the PASID entry. 2. Use a dma_wmb() to ensure the cleared bit is visible to hardware before proceeding. 3. Execute the required invalidation sequence (PASID cache, IOTLB, and Device-TLB flush) to ensure the hardware has released all cached references. 4. Only after the flushes are complete, zero out the remaining fields of the PASID entry. Also, add a dma_wmb() in pasid_set_present() to ensure that all other fields of the PASID entry are visible to the hardware before the Present bit is set. Fixes: 0bbeb01a4faf ("iommu/vt-d: Manage scalalble mode PASID tables") Signed-off-by: Lu Baolu <baolu.lu(a)linux.intel.com> Reviewed-by: Dmytro Maluka <dmaluka(a)chromium.org> Reviewed-by: Samiullah Khawaja <skhawaja(a)google.com> Reviewed-by: Kevin Tian <kevin.tian(a)intel.com> Link: https://lore.kernel.org/r/20260120061816.2132558-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <joerg.roedel(a)amd.com> Conflicts: drivers/iommu/intel/pasid.c [commit 1f5e307ca16c0 not merged] Signed-off-by: Zhang Yuwei <zhangyuwei20(a)huawei.com> --- drivers/iommu/intel/pasid.c | 6 +++++- drivers/iommu/intel/pasid.h | 14 ++++++++++++++ 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 6ef582bfaea5..89a2c7a17489 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -248,7 +248,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev, did = pasid_get_domain_id(pte); pgtt = pasid_pte_get_pgtt(pte); - intel_pasid_clear_entry(dev, pasid, fault_ignore); + pasid_clear_present(pte); spin_unlock(&iommu->lock); if (!ecap_coherent(iommu->ecap)) @@ -264,6 +264,10 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev, /* Device IOTLB doesn't need to be flushed in caching mode. */ if (!cap_caching_mode(iommu->cap)) devtlb_invalidation_with_pasid(iommu, dev, pasid); + + intel_pasid_clear_entry(dev, pasid, fault_ignore); + if (!ecap_coherent(iommu->ecap)) + clflush_cache_range(pte, sizeof(*pte)); } /* diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h index 42fda97fd851..979fcfea6b2c 100644 --- a/drivers/iommu/intel/pasid.h +++ b/drivers/iommu/intel/pasid.h @@ -235,9 +235,23 @@ static inline void pasid_set_wpe(struct pasid_entry *pe) */ static inline void pasid_set_present(struct pasid_entry *pe) { + dma_wmb(); pasid_set_bits(&pe->val[0], 1 << 0, 1); } +/* + * Clear the Present (P) bit (bit 0) of a scalable-mode PASID table entry. + * This initiates the transition of the entry's ownership from hardware + * to software. The caller is responsible for fulfilling the invalidation + * handshake recommended by the VT-d spec, Section 6.5.3.3 (Guidance to + * Software for Invalidations). + */ +static inline void pasid_clear_present(struct pasid_entry *pe) +{ + pasid_set_bits(&pe->val[0], 1 << 0, 0); + dma_wmb(); +} + /* * Setup Page Walk Snoop bit (Bit 87) of a scalable mode PASID * entry. -- 2.22.0

2 1