From: Matthias May matthias.may@westermo.com
stable inclusion from stable-v5.10.138 commit 38b83883ce4e4efe8ff0a727192219cac2668d42 category: bugfix bugzilla: 188217
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------------------------
commit ca2bb69514a8bc7f83914122f0d596371352416c upstream.
According to Guillaume Nault RT_TOS should never be used for IPv6.
Quote: RT_TOS() is an old macro used to interprete IPv4 TOS as described in the obsolete RFC 1349. It's conceptually wrong to use it even in IPv4 code, although, given the current state of the code, most of the existing calls have no consequence.
But using RT_TOS() in IPv6 code is always a bug: IPv6 never had a "TOS" field to be interpreted the RFC 1349 way. There's no historical compatibility to worry about.
Fixes: 3a56f86f1be6 ("geneve: handle ipv6 priority like ipv4 tos") Acked-by: Guillaume Nault gnault@redhat.com Signed-off-by: Matthias May matthias.may@westermo.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 38b83883ce4e4efe8ff0a727192219cac2668d42) Signed-off-by: Wang Yufen wangyufen@huawei.com --- drivers/net/geneve.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 5ddb2db..ba9947d 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -850,8 +850,7 @@ static struct dst_entry *geneve_get_v6_dst(struct sk_buff *skb, use_cache = false; }
- fl6->flowlabel = ip6_make_flowinfo(RT_TOS(prio), - info->key.label); + fl6->flowlabel = ip6_make_flowinfo(prio, info->key.label); dst_cache = (struct dst_cache *)&info->dst_cache; if (use_cache) { dst = dst_cache_get_ip6(dst_cache, &fl6->saddr);
From: Matthias May matthias.may@westermo.com
stable inclusion from stable-v5.10.138 commit 2c746ec91de76e095e4f18367dbb5103a6ed09e1 category: bugfix bugzilla: 188217
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------------------------
commit b4ab94d6adaa5cf842b68bd28f4b50bc774496bd upstream.
The current code retrieves the TOS field after the lookup on the ipv4 routing table. The routing process currently only allows routing based on the original 3 TOS bits, and not on the full 6 DSCP bits. As a result the retrieved TOS is cut to the 3 bits. However for inheriting purposes the full 6 bits should be used.
Extract the full 6 bits before the route lookup and use that instead of the cut off 3 TOS bits.
Fixes: e305ac6cf5a1 ("geneve: Add support to collect tunnel metadata.") Signed-off-by: Matthias May matthias.may@westermo.com Acked-by: Guillaume Nault gnault@redhat.com Link: https://lore.kernel.org/r/20220805190006.8078-1-matthias.may@westermo.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 2c746ec91de76e095e4f18367dbb5103a6ed09e1) Signed-off-by: Wang Yufen wangyufen@huawei.com --- drivers/net/geneve.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index ba9947d..081939c 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -772,7 +772,8 @@ static struct rtable *geneve_get_v4_rt(struct sk_buff *skb, struct geneve_sock *gs4, struct flowi4 *fl4, const struct ip_tunnel_info *info, - __be16 dport, __be16 sport) + __be16 dport, __be16 sport, + __u8 *full_tos) { bool use_cache = ip_tunnel_dst_cache_usable(skb, info); struct geneve_dev *geneve = netdev_priv(dev); @@ -797,6 +798,8 @@ static struct rtable *geneve_get_v4_rt(struct sk_buff *skb, use_cache = false; } fl4->flowi4_tos = RT_TOS(tos); + if (full_tos) + *full_tos = tos;
dst_cache = (struct dst_cache *)&info->dst_cache; if (use_cache) { @@ -884,6 +887,7 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev, const struct ip_tunnel_key *key = &info->key; struct rtable *rt; struct flowi4 fl4; + __u8 full_tos; __u8 tos, ttl; __be16 df = 0; __be16 sport; @@ -894,7 +898,7 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true); rt = geneve_get_v4_rt(skb, dev, gs4, &fl4, info, - geneve->cfg.info.key.tp_dst, sport); + geneve->cfg.info.key.tp_dst, sport, &full_tos); if (IS_ERR(rt)) return PTR_ERR(rt);
@@ -938,7 +942,7 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0; } else { - tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb); + tos = ip_tunnel_ecn_encap(full_tos, ip_hdr(skb), skb); if (geneve->cfg.ttl_inherit) ttl = ip_tunnel_get_ttl(ip_hdr(skb), skb); else @@ -1120,7 +1124,7 @@ static int geneve_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb) 1, USHRT_MAX, true);
rt = geneve_get_v4_rt(skb, dev, gs4, &fl4, info, - geneve->cfg.info.key.tp_dst, sport); + geneve->cfg.info.key.tp_dst, sport, NULL); if (IS_ERR(rt)) return PTR_ERR(rt);
From: Maciej Fijalkowski maciej.fijalkowski@intel.com
stable inclusion from stable-v5.10.140 commit d29d7108e19e1c471e398bf85028999c0d0e4eae category: bugfix bugzilla: 188217
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------------------------
[ Upstream commit 296f13ff3854535009a185aaf8e3603266d39d94 ]
With the upcoming introduction of batching to XSK data path, performance wise it will be the best to have the ring descriptor count to be aligned to power of 2.
Check if ring sizes that user is going to attach the XSK socket fulfill the condition above. For Tx side, although check is being done against the Tx queue and in the end the socket will be attached to the XDP queue, it is fine since XDP queues get the ring->count setting from Tx queues.
Suggested-by: Alexander Lobakin alexandr.lobakin@intel.com Signed-off-by: Maciej Fijalkowski maciej.fijalkowski@intel.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Reviewed-by: Alexander Lobakin alexandr.lobakin@intel.com Acked-by: Magnus Karlsson magnus.karlsson@intel.com Link: https://lore.kernel.org/bpf/20220125160446.78976-3-maciej.fijalkowski@intel.... Signed-off-by: Sasha Levin sashal@kernel.org (cherry picked from commit d29d7108e19e1c471e398bf85028999c0d0e4eae) Signed-off-by: Wang Yufen wangyufen@huawei.com --- drivers/net/ethernet/intel/ice/ice_xsk.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c index 5733526..4bb6295 100644 --- a/drivers/net/ethernet/intel/ice/ice_xsk.c +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c @@ -371,6 +371,13 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid) bool if_running, pool_present = !!pool; int ret = 0, pool_failure = 0;
+ if (!is_power_of_2(vsi->rx_rings[qid]->count) || + !is_power_of_2(vsi->tx_rings[qid]->count)) { + netdev_err(vsi->netdev, "Please align ring sizes to power of 2\n"); + pool_failure = -EINVAL; + goto failure; + } + if_running = netif_running(vsi->netdev) && ice_is_xdp_ena_vsi(vsi);
if (if_running) { @@ -393,6 +400,7 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid) netdev_err(vsi->netdev, "ice_qp_ena error = %d\n", ret); }
+failure: if (pool_failure) { netdev_err(vsi->netdev, "Could not %sable buffer pool, error = %d\n", pool_present ? "en" : "dis", pool_failure);
From: Maciej Fijalkowski maciej.fijalkowski@intel.com
stable inclusion from stable-v5.10.140 commit 1bfdcde723d8ceb2d73291b0415767e7c1cc1d8a category: bugfix bugzilla: 188217
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------------------------
[ Upstream commit 5a42f112d367bb4700a8a41f5c12724fde6bfbb9 ]
Fix the following scenario: 1. ethtool -L $IFACE rx 8 tx 96 2. xdpsock -q 10 -t -z
Above refers to a case where user would like to attach XSK socket in txonly mode at a queue id that does not have a corresponding Rx queue. At this moment ice's XSK logic is tightly bound to act on a "queue pair", e.g. both Tx and Rx queues at a given queue id are disabled/enabled and both of them will get XSK pool assigned, which is broken for the presented queue configuration. This results in the splat included at the bottom, which is basically an OOB access to Rx ring array.
To fix this, allow using the ids only in scope of "combined" queues reported by ethtool. However, logic should be rewritten to allow such configurations later on, which would end up as a complete rewrite of the control path, so let us go with this temporary fix.
[420160.558008] BUG: kernel NULL pointer dereference, address: 0000000000000082 [420160.566359] #PF: supervisor read access in kernel mode [420160.572657] #PF: error_code(0x0000) - not-present page [420160.579002] PGD 0 P4D 0 [420160.582756] Oops: 0000 [#1] PREEMPT SMP NOPTI [420160.588396] CPU: 10 PID: 21232 Comm: xdpsock Tainted: G OE 5.19.0-rc7+ #10 [420160.597893] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019 [420160.609894] RIP: 0010:ice_xsk_pool_setup+0x44/0x7d0 [ice] [420160.616968] Code: f3 48 83 ec 40 48 8b 4f 20 48 8b 3f 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 48 8d 04 ed 00 00 00 00 48 01 c1 48 8b 11 <0f> b7 92 82 00 00 00 48 85 d2 0f 84 2d 75 00 00 48 8d 72 ff 48 85 [420160.639421] RSP: 0018:ffffc9002d2afd48 EFLAGS: 00010282 [420160.646650] RAX: 0000000000000050 RBX: ffff88811d8bdd00 RCX: ffff888112c14ff8 [420160.655893] RDX: 0000000000000000 RSI: ffff88811d8bdd00 RDI: ffff888109861000 [420160.665166] RBP: 000000000000000a R08: 000000000000000a R09: 0000000000000000 [420160.674493] R10: 000000000000889f R11: 0000000000000000 R12: 000000000000000a [420160.683833] R13: 000000000000000a R14: 0000000000000000 R15: ffff888117611828 [420160.693211] FS: 00007fa869fc1f80(0000) GS:ffff8897e0880000(0000) knlGS:0000000000000000 [420160.703645] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [420160.711783] CR2: 0000000000000082 CR3: 00000001d076c001 CR4: 00000000007706e0 [420160.721399] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [420160.731045] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [420160.740707] PKRU: 55555554 [420160.745960] Call Trace: [420160.750962] <TASK> [420160.755597] ? kmalloc_large_node+0x79/0x90 [420160.762703] ? __kmalloc_node+0x3f5/0x4b0 [420160.769341] xp_assign_dev+0xfd/0x210 [420160.775661] ? shmem_file_read_iter+0x29a/0x420 [420160.782896] xsk_bind+0x152/0x490 [420160.788943] __sys_bind+0xd0/0x100 [420160.795097] ? exit_to_user_mode_prepare+0x20/0x120 [420160.802801] __x64_sys_bind+0x16/0x20 [420160.809298] do_syscall_64+0x38/0x90 [420160.815741] entry_SYSCALL_64_after_hwframe+0x63/0xcd [420160.823731] RIP: 0033:0x7fa86a0dd2fb [420160.830264] Code: c3 66 0f 1f 44 00 00 48 8b 15 69 8b 0c 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 31 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3d 8b 0c 00 f7 d8 64 89 01 48 [420160.855410] RSP: 002b:00007ffc1146f618 EFLAGS: 00000246 ORIG_RAX: 0000000000000031 [420160.866366] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa86a0dd2fb [420160.876957] RDX: 0000000000000010 RSI: 00007ffc1146f680 RDI: 0000000000000003 [420160.887604] RBP: 000055d7113a0520 R08: 00007fa868fb8000 R09: 0000000080000000 [420160.898293] R10: 0000000000008001 R11: 0000000000000246 R12: 000055d7113a04e0 [420160.909038] R13: 000055d7113a0320 R14: 000000000000000a R15: 0000000000000000 [420160.919817] </TASK> [420160.925659] Modules linked in: ice(OE) af_packet binfmt_misc nls_iso8859_1 ipmi_ssif intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp mei_me coretemp ioatdma mei ipmi_si wmi ipmi_msghandler acpi_pad acpi_power_meter ip_tables x_tables autofs4 ixgbe i40e crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd ahci mdio dca libahci lpc_ich [last unloaded: ice] [420160.977576] CR2: 0000000000000082 [420160.985037] ---[ end trace 0000000000000000 ]--- [420161.097724] RIP: 0010:ice_xsk_pool_setup+0x44/0x7d0 [ice] [420161.107341] Code: f3 48 83 ec 40 48 8b 4f 20 48 8b 3f 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 48 8d 04 ed 00 00 00 00 48 01 c1 48 8b 11 <0f> b7 92 82 00 00 00 48 85 d2 0f 84 2d 75 00 00 48 8d 72 ff 48 85 [420161.134741] RSP: 0018:ffffc9002d2afd48 EFLAGS: 00010282 [420161.144274] RAX: 0000000000000050 RBX: ffff88811d8bdd00 RCX: ffff888112c14ff8 [420161.155690] RDX: 0000000000000000 RSI: ffff88811d8bdd00 RDI: ffff888109861000 [420161.168088] RBP: 000000000000000a R08: 000000000000000a R09: 0000000000000000 [420161.179295] R10: 000000000000889f R11: 0000000000000000 R12: 000000000000000a [420161.190420] R13: 000000000000000a R14: 0000000000000000 R15: ffff888117611828 [420161.201505] FS: 00007fa869fc1f80(0000) GS:ffff8897e0880000(0000) knlGS:0000000000000000 [420161.213628] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [420161.223413] CR2: 0000000000000082 CR3: 00000001d076c001 CR4: 00000000007706e0 [420161.234653] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [420161.245893] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [420161.257052] PKRU: 55555554
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Signed-off-by: Maciej Fijalkowski maciej.fijalkowski@intel.com Tested-by: George Kuruvinakunnel george.kuruvinakunnel@intel.com Signed-off-by: Tony Nguyen anthony.l.nguyen@intel.com Signed-off-by: Sasha Levin sashal@kernel.org (cherry picked from commit 1bfdcde723d8ceb2d73291b0415767e7c1cc1d8a) Signed-off-by: Wang Yufen wangyufen@huawei.com --- drivers/net/ethernet/intel/ice/ice_xsk.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c index 4bb6295..59963b90 100644 --- a/drivers/net/ethernet/intel/ice/ice_xsk.c +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c @@ -371,6 +371,12 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid) bool if_running, pool_present = !!pool; int ret = 0, pool_failure = 0;
+ if (qid >= vsi->num_rxq || qid >= vsi->num_txq) { + netdev_err(vsi->netdev, "Please use queue id in scope of combined queues count\n"); + pool_failure = -EINVAL; + goto failure; + } + if (!is_power_of_2(vsi->rx_rings[qid]->count) || !is_power_of_2(vsi->tx_rings[qid]->count)) { netdev_err(vsi->netdev, "Please align ring sizes to power of 2\n");
From: Sebastian Andrzej Siewior bigeasy@linutronix.de
stable inclusion from stable-v5.10.142 commit d71a1c9fce184718d1b3a51a9e8a6e31cbbb45ce category: bugfix bugzilla: 188217
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------------------------
commit 278d3ba61563ceed3cb248383ced19e14ec7bc1f upstream.
On 32bit-UP u64_stats_fetch_begin() disables only preemption. If the reader is in preemptible context and the writer side (u64_stats_update_begin*()) runs in an interrupt context (IRQ or softirq) then the writer can update the stats during the read operation. This update remains undetected.
Use u64_stats_fetch_begin_irq() to ensure the stats fetch on 32bit-UP are not interrupted by a writer. 32bit-SMP remains unaffected by this change.
Cc: "David S. Miller" davem@davemloft.net Cc: Catherine Sullivan csully@google.com Cc: David Awogbemila awogbemila@google.com Cc: Dimitris Michailidis dmichail@fungible.com Cc: Eric Dumazet edumazet@google.com Cc: Hans Ulli Kroll ulli.kroll@googlemail.com Cc: Jakub Kicinski kuba@kernel.org Cc: Jeroen de Borst jeroendb@google.com Cc: Johannes Berg johannes@sipsolutions.net Cc: Linus Walleij linus.walleij@linaro.org Cc: Paolo Abeni pabeni@redhat.com Cc: Simon Horman simon.horman@corigine.com Cc: linux-arm-kernel@lists.infradead.org Cc: linux-wireless@vger.kernel.org Cc: netdev@vger.kernel.org Cc: oss-drivers@corigine.com Cc: stable@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior bigeasy@linutronix.de Reviewed-by: Simon Horman simon.horman@corigine.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit d71a1c9fce184718d1b3a51a9e8a6e31cbbb45ce) Signed-off-by: Wang Yufen wangyufen@huawei.com
Conflicts: drivers/net/ethernet/huawei/hinic/hinic_rx.c drivers/net/ethernet/huawei/hinic/hinic_tx.c
Signed-off-by: Wang Yufen wangyufen@huawei.com --- drivers/net/ethernet/cortina/gemini.c | 24 +++++++++++----------- drivers/net/ethernet/google/gve/gve_ethtool.c | 16 +++++++-------- drivers/net/ethernet/google/gve/gve_main.c | 12 +++++------ drivers/net/ethernet/huawei/hinic/hinic_rx.c | 4 ++-- drivers/net/ethernet/huawei/hinic/hinic_tx.c | 4 ++-- .../net/ethernet/netronome/nfp/nfp_net_common.c | 8 ++++---- .../net/ethernet/netronome/nfp/nfp_net_ethtool.c | 8 ++++---- drivers/net/netdevsim/netdev.c | 4 ++-- net/mac80211/sta_info.c | 8 ++++---- net/mpls/af_mpls.c | 4 ++-- 10 files changed, 46 insertions(+), 46 deletions(-)
diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c index 3685878..b22ea40 100644 --- a/drivers/net/ethernet/cortina/gemini.c +++ b/drivers/net/ethernet/cortina/gemini.c @@ -1920,7 +1920,7 @@ static void gmac_get_stats64(struct net_device *netdev,
/* Racing with RX NAPI */ do { - start = u64_stats_fetch_begin(&port->rx_stats_syncp); + start = u64_stats_fetch_begin_irq(&port->rx_stats_syncp);
stats->rx_packets = port->stats.rx_packets; stats->rx_bytes = port->stats.rx_bytes; @@ -1932,11 +1932,11 @@ static void gmac_get_stats64(struct net_device *netdev, stats->rx_crc_errors = port->stats.rx_crc_errors; stats->rx_frame_errors = port->stats.rx_frame_errors;
- } while (u64_stats_fetch_retry(&port->rx_stats_syncp, start)); + } while (u64_stats_fetch_retry_irq(&port->rx_stats_syncp, start));
/* Racing with MIB and TX completion interrupts */ do { - start = u64_stats_fetch_begin(&port->ir_stats_syncp); + start = u64_stats_fetch_begin_irq(&port->ir_stats_syncp);
stats->tx_errors = port->stats.tx_errors; stats->tx_packets = port->stats.tx_packets; @@ -1946,15 +1946,15 @@ static void gmac_get_stats64(struct net_device *netdev, stats->rx_missed_errors = port->stats.rx_missed_errors; stats->rx_fifo_errors = port->stats.rx_fifo_errors;
- } while (u64_stats_fetch_retry(&port->ir_stats_syncp, start)); + } while (u64_stats_fetch_retry_irq(&port->ir_stats_syncp, start));
/* Racing with hard_start_xmit */ do { - start = u64_stats_fetch_begin(&port->tx_stats_syncp); + start = u64_stats_fetch_begin_irq(&port->tx_stats_syncp);
stats->tx_dropped = port->stats.tx_dropped;
- } while (u64_stats_fetch_retry(&port->tx_stats_syncp, start)); + } while (u64_stats_fetch_retry_irq(&port->tx_stats_syncp, start));
stats->rx_dropped += stats->rx_missed_errors; } @@ -2032,18 +2032,18 @@ static void gmac_get_ethtool_stats(struct net_device *netdev, /* Racing with MIB interrupt */ do { p = values; - start = u64_stats_fetch_begin(&port->ir_stats_syncp); + start = u64_stats_fetch_begin_irq(&port->ir_stats_syncp);
for (i = 0; i < RX_STATS_NUM; i++) *p++ = port->hw_stats[i];
- } while (u64_stats_fetch_retry(&port->ir_stats_syncp, start)); + } while (u64_stats_fetch_retry_irq(&port->ir_stats_syncp, start)); values = p;
/* Racing with RX NAPI */ do { p = values; - start = u64_stats_fetch_begin(&port->rx_stats_syncp); + start = u64_stats_fetch_begin_irq(&port->rx_stats_syncp);
for (i = 0; i < RX_STATUS_NUM; i++) *p++ = port->rx_stats[i]; @@ -2051,13 +2051,13 @@ static void gmac_get_ethtool_stats(struct net_device *netdev, *p++ = port->rx_csum_stats[i]; *p++ = port->rx_napi_exits;
- } while (u64_stats_fetch_retry(&port->rx_stats_syncp, start)); + } while (u64_stats_fetch_retry_irq(&port->rx_stats_syncp, start)); values = p;
/* Racing with TX start_xmit */ do { p = values; - start = u64_stats_fetch_begin(&port->tx_stats_syncp); + start = u64_stats_fetch_begin_irq(&port->tx_stats_syncp);
for (i = 0; i < TX_MAX_FRAGS; i++) { *values++ = port->tx_frag_stats[i]; @@ -2066,7 +2066,7 @@ static void gmac_get_ethtool_stats(struct net_device *netdev, *values++ = port->tx_frags_linearized; *values++ = port->tx_hw_csummed;
- } while (u64_stats_fetch_retry(&port->tx_stats_syncp, start)); + } while (u64_stats_fetch_retry_irq(&port->tx_stats_syncp, start)); }
static int gmac_get_ksettings(struct net_device *netdev, diff --git a/drivers/net/ethernet/google/gve/gve_ethtool.c b/drivers/net/ethernet/google/gve/gve_ethtool.c index 66f9b37..80a8c0c 100644 --- a/drivers/net/ethernet/google/gve/gve_ethtool.c +++ b/drivers/net/ethernet/google/gve/gve_ethtool.c @@ -172,14 +172,14 @@ static int gve_get_sset_count(struct net_device *netdev, int sset) struct gve_rx_ring *rx = &priv->rx[ring];
start = - u64_stats_fetch_begin(&priv->rx[ring].statss); + u64_stats_fetch_begin_irq(&priv->rx[ring].statss); tmp_rx_pkts = rx->rpackets; tmp_rx_bytes = rx->rbytes; tmp_rx_skb_alloc_fail = rx->rx_skb_alloc_fail; tmp_rx_buf_alloc_fail = rx->rx_buf_alloc_fail; tmp_rx_desc_err_dropped_pkt = rx->rx_desc_err_dropped_pkt; - } while (u64_stats_fetch_retry(&priv->rx[ring].statss, + } while (u64_stats_fetch_retry_irq(&priv->rx[ring].statss, start)); rx_pkts += tmp_rx_pkts; rx_bytes += tmp_rx_bytes; @@ -193,10 +193,10 @@ static int gve_get_sset_count(struct net_device *netdev, int sset) if (priv->tx) { do { start = - u64_stats_fetch_begin(&priv->tx[ring].statss); + u64_stats_fetch_begin_irq(&priv->tx[ring].statss); tmp_tx_pkts = priv->tx[ring].pkt_done; tmp_tx_bytes = priv->tx[ring].bytes_done; - } while (u64_stats_fetch_retry(&priv->tx[ring].statss, + } while (u64_stats_fetch_retry_irq(&priv->tx[ring].statss, start)); tx_pkts += tmp_tx_pkts; tx_bytes += tmp_tx_bytes; @@ -254,13 +254,13 @@ static int gve_get_sset_count(struct net_device *netdev, int sset) data[i++] = rx->cnt; do { start = - u64_stats_fetch_begin(&priv->rx[ring].statss); + u64_stats_fetch_begin_irq(&priv->rx[ring].statss); tmp_rx_bytes = rx->rbytes; tmp_rx_skb_alloc_fail = rx->rx_skb_alloc_fail; tmp_rx_buf_alloc_fail = rx->rx_buf_alloc_fail; tmp_rx_desc_err_dropped_pkt = rx->rx_desc_err_dropped_pkt; - } while (u64_stats_fetch_retry(&priv->rx[ring].statss, + } while (u64_stats_fetch_retry_irq(&priv->rx[ring].statss, start)); data[i++] = tmp_rx_bytes; /* rx dropped packets */ @@ -313,9 +313,9 @@ static int gve_get_sset_count(struct net_device *netdev, int sset) data[i++] = tx->done; do { start = - u64_stats_fetch_begin(&priv->tx[ring].statss); + u64_stats_fetch_begin_irq(&priv->tx[ring].statss); tmp_tx_bytes = tx->bytes_done; - } while (u64_stats_fetch_retry(&priv->tx[ring].statss, + } while (u64_stats_fetch_retry_irq(&priv->tx[ring].statss, start)); data[i++] = tmp_tx_bytes; data[i++] = tx->wake_queue; diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c index 6cb75bb..f0c1e6c8 100644 --- a/drivers/net/ethernet/google/gve/gve_main.c +++ b/drivers/net/ethernet/google/gve/gve_main.c @@ -40,10 +40,10 @@ static void gve_get_stats(struct net_device *dev, struct rtnl_link_stats64 *s) for (ring = 0; ring < priv->rx_cfg.num_queues; ring++) { do { start = - u64_stats_fetch_begin(&priv->rx[ring].statss); + u64_stats_fetch_begin_irq(&priv->rx[ring].statss); packets = priv->rx[ring].rpackets; bytes = priv->rx[ring].rbytes; - } while (u64_stats_fetch_retry(&priv->rx[ring].statss, + } while (u64_stats_fetch_retry_irq(&priv->rx[ring].statss, start)); s->rx_packets += packets; s->rx_bytes += bytes; @@ -53,10 +53,10 @@ static void gve_get_stats(struct net_device *dev, struct rtnl_link_stats64 *s) for (ring = 0; ring < priv->tx_cfg.num_queues; ring++) { do { start = - u64_stats_fetch_begin(&priv->tx[ring].statss); + u64_stats_fetch_begin_irq(&priv->tx[ring].statss); packets = priv->tx[ring].pkt_done; bytes = priv->tx[ring].bytes_done; - } while (u64_stats_fetch_retry(&priv->tx[ring].statss, + } while (u64_stats_fetch_retry_irq(&priv->tx[ring].statss, start)); s->tx_packets += packets; s->tx_bytes += bytes; @@ -1041,9 +1041,9 @@ void gve_handle_report_stats(struct gve_priv *priv) if (priv->tx) { for (idx = 0; idx < priv->tx_cfg.num_queues; idx++) { do { - start = u64_stats_fetch_begin(&priv->tx[idx].statss); + start = u64_stats_fetch_begin_irq(&priv->tx[idx].statss); tx_bytes = priv->tx[idx].bytes_done; - } while (u64_stats_fetch_retry(&priv->tx[idx].statss, start)); + } while (u64_stats_fetch_retry_irq(&priv->tx[idx].statss, start)); stats[stats_idx++] = (struct stats) { .stat_name = cpu_to_be32(TX_WAKE_CNT), .value = cpu_to_be64(priv->tx[idx].wake_queue), diff --git a/drivers/net/ethernet/huawei/hinic/hinic_rx.c b/drivers/net/ethernet/huawei/hinic/hinic_rx.c index 57d5d79..1b57b67 100644 --- a/drivers/net/ethernet/huawei/hinic/hinic_rx.c +++ b/drivers/net/ethernet/huawei/hinic/hinic_rx.c @@ -375,7 +375,7 @@ void hinic_rxq_get_stats(struct hinic_rxq *rxq,
u64_stats_update_begin(&stats->syncp); do { - start = u64_stats_fetch_begin(&rxq_stats->syncp); + start = u64_stats_fetch_begin_irq(&rxq_stats->syncp); stats->bytes = rxq_stats->bytes; stats->packets = rxq_stats->packets; stats->errors = rxq_stats->csum_errors + @@ -384,7 +384,7 @@ void hinic_rxq_get_stats(struct hinic_rxq *rxq, stats->other_errors = rxq_stats->other_errors; stats->dropped = rxq_stats->dropped; stats->rx_buf_empty = rxq_stats->rx_buf_empty; - } while (u64_stats_fetch_retry(&rxq_stats->syncp, start)); + } while (u64_stats_fetch_retry_irq(&rxq_stats->syncp, start)); u64_stats_update_end(&stats->syncp); }
diff --git a/drivers/net/ethernet/huawei/hinic/hinic_tx.c b/drivers/net/ethernet/huawei/hinic/hinic_tx.c index 75fa344..ff37b6f 100644 --- a/drivers/net/ethernet/huawei/hinic/hinic_tx.c +++ b/drivers/net/ethernet/huawei/hinic/hinic_tx.c @@ -61,7 +61,7 @@ void hinic_txq_get_stats(struct hinic_txq *txq,
u64_stats_update_begin(&stats->syncp); do { - start = u64_stats_fetch_begin(&txq_stats->syncp); + start = u64_stats_fetch_begin_irq(&txq_stats->syncp); stats->bytes = txq_stats->bytes; stats->packets = txq_stats->packets; stats->busy = txq_stats->busy; @@ -69,7 +69,7 @@ void hinic_txq_get_stats(struct hinic_txq *txq, stats->dropped = txq_stats->dropped; stats->big_frags_pkts = txq_stats->big_frags_pkts; stats->big_udp_pkts = txq_stats->big_udp_pkts; - } while (u64_stats_fetch_retry(&txq_stats->syncp, start)); + } while (u64_stats_fetch_retry_irq(&txq_stats->syncp, start)); u64_stats_update_end(&stats->syncp); }
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index dfc1f32..5ab230aa 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -3373,21 +3373,21 @@ static void nfp_net_stat64(struct net_device *netdev, unsigned int start;
do { - start = u64_stats_fetch_begin(&r_vec->rx_sync); + start = u64_stats_fetch_begin_irq(&r_vec->rx_sync); data[0] = r_vec->rx_pkts; data[1] = r_vec->rx_bytes; data[2] = r_vec->rx_drops; - } while (u64_stats_fetch_retry(&r_vec->rx_sync, start)); + } while (u64_stats_fetch_retry_irq(&r_vec->rx_sync, start)); stats->rx_packets += data[0]; stats->rx_bytes += data[1]; stats->rx_dropped += data[2];
do { - start = u64_stats_fetch_begin(&r_vec->tx_sync); + start = u64_stats_fetch_begin_irq(&r_vec->tx_sync); data[0] = r_vec->tx_pkts; data[1] = r_vec->tx_bytes; data[2] = r_vec->tx_errors; - } while (u64_stats_fetch_retry(&r_vec->tx_sync, start)); + } while (u64_stats_fetch_retry_irq(&r_vec->tx_sync, start)); stats->tx_packets += data[0]; stats->tx_bytes += data[1]; stats->tx_errors += data[2]; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c index 1027f1ff..5b38d9c 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c @@ -498,7 +498,7 @@ static u64 *nfp_vnic_get_sw_stats(struct net_device *netdev, u64 *data) unsigned int start;
do { - start = u64_stats_fetch_begin(&nn->r_vecs[i].rx_sync); + start = u64_stats_fetch_begin_irq(&nn->r_vecs[i].rx_sync); data[0] = nn->r_vecs[i].rx_pkts; tmp[0] = nn->r_vecs[i].hw_csum_rx_ok; tmp[1] = nn->r_vecs[i].hw_csum_rx_inner_ok; @@ -506,10 +506,10 @@ static u64 *nfp_vnic_get_sw_stats(struct net_device *netdev, u64 *data) tmp[3] = nn->r_vecs[i].hw_csum_rx_error; tmp[4] = nn->r_vecs[i].rx_replace_buf_alloc_fail; tmp[5] = nn->r_vecs[i].hw_tls_rx; - } while (u64_stats_fetch_retry(&nn->r_vecs[i].rx_sync, start)); + } while (u64_stats_fetch_retry_irq(&nn->r_vecs[i].rx_sync, start));
do { - start = u64_stats_fetch_begin(&nn->r_vecs[i].tx_sync); + start = u64_stats_fetch_begin_irq(&nn->r_vecs[i].tx_sync); data[1] = nn->r_vecs[i].tx_pkts; data[2] = nn->r_vecs[i].tx_busy; tmp[6] = nn->r_vecs[i].hw_csum_tx; @@ -519,7 +519,7 @@ static u64 *nfp_vnic_get_sw_stats(struct net_device *netdev, u64 *data) tmp[10] = nn->r_vecs[i].hw_tls_tx; tmp[11] = nn->r_vecs[i].tls_tx_fallback; tmp[12] = nn->r_vecs[i].tls_tx_no_fallback; - } while (u64_stats_fetch_retry(&nn->r_vecs[i].tx_sync, start)); + } while (u64_stats_fetch_retry_irq(&nn->r_vecs[i].tx_sync, start));
data += NN_RVEC_PER_Q_STATS;
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c index ad6dbf01..4fb0638 100644 --- a/drivers/net/netdevsim/netdev.c +++ b/drivers/net/netdevsim/netdev.c @@ -67,10 +67,10 @@ static int nsim_change_mtu(struct net_device *dev, int new_mtu) unsigned int start;
do { - start = u64_stats_fetch_begin(&ns->syncp); + start = u64_stats_fetch_begin_irq(&ns->syncp); stats->tx_bytes = ns->tx_bytes; stats->tx_packets = ns->tx_packets; - } while (u64_stats_fetch_retry(&ns->syncp, start)); + } while (u64_stats_fetch_retry_irq(&ns->syncp, start)); }
static int diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c index e18c385..0a55058 100644 --- a/net/mac80211/sta_info.c +++ b/net/mac80211/sta_info.c @@ -2175,9 +2175,9 @@ static inline u64 sta_get_tidstats_msdu(struct ieee80211_sta_rx_stats *rxstats, u64 value;
do { - start = u64_stats_fetch_begin(&rxstats->syncp); + start = u64_stats_fetch_begin_irq(&rxstats->syncp); value = rxstats->msdu[tid]; - } while (u64_stats_fetch_retry(&rxstats->syncp, start)); + } while (u64_stats_fetch_retry_irq(&rxstats->syncp, start));
return value; } @@ -2241,9 +2241,9 @@ static inline u64 sta_get_stats_bytes(struct ieee80211_sta_rx_stats *rxstats) u64 value;
do { - start = u64_stats_fetch_begin(&rxstats->syncp); + start = u64_stats_fetch_begin_irq(&rxstats->syncp); value = rxstats->bytes; - } while (u64_stats_fetch_retry(&rxstats->syncp, start)); + } while (u64_stats_fetch_retry_irq(&rxstats->syncp, start));
return value; } diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 9c047c1..7239814 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -1078,9 +1078,9 @@ static void mpls_get_stats(struct mpls_dev *mdev,
p = per_cpu_ptr(mdev->stats, i); do { - start = u64_stats_fetch_begin(&p->syncp); + start = u64_stats_fetch_begin_irq(&p->syncp); local = p->stats; - } while (u64_stats_fetch_retry(&p->syncp, start)); + } while (u64_stats_fetch_retry_irq(&p->syncp, start));
stats->rx_packets += local.rx_packets; stats->rx_bytes += local.rx_bytes;
From: Kuniyuki Iwashima kuniyu@amazon.com
stable inclusion from stable-v5.10.154 commit 2bf33b5ea46dbe547de44cdcbee6b1c0b6c167d4 category: bugfix bugzilla: 188217
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------------------------
commit 11052589cf5c0bab3b4884d423d5f60c38fcf25d upstream.
Commit e21145a9871a ("ipv4: namespacify ip_early_demux sysctl knob") made it possible to enable/disable early_demux on a per-netns basis. Then, we introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for TCP/UDP in commit dddb64bcb346 ("net: Add sysctl to toggle early demux for tcp and udp"). However, the .proc_handler() was wrong and actually disabled us from changing the behaviour in each netns.
We can execute early_demux if net.ipv4.ip_early_demux is on and each proto .early_demux() handler is not NULL. When we toggle (tcp|udp)_early_demux, the change itself is saved in each netns variable, but the .early_demux() handler is a global variable, so the handler is switched based on the init_net's sysctl variable. Thus, netns (tcp|udp)_early_demux knobs have nothing to do with the logic. Whether we CAN execute proto .early_demux() is always decided by init_net's sysctl knob, and whether we DO it or not is by each netns ip_early_demux knob.
This patch namespacifies (tcp|udp)_early_demux again. For now, the users of the .early_demux() handler are TCP and UDP only, and they are called directly to avoid retpoline. So, we can remove the .early_demux() handler from inet6?_protos and need not dereference them in ip6?_rcv_finish_core(). If another proto needs .early_demux(), we can restore it at that time.
Fixes: dddb64bcb346 ("net: Add sysctl to toggle early demux for tcp and udp") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 2bf33b5ea46dbe547de44cdcbee6b1c0b6c167d4) Signed-off-by: Wang Yufen wangyufen@huawei.com --- include/net/protocol.h | 4 ---- include/net/tcp.h | 2 +- include/net/udp.h | 1 + net/ipv4/af_inet.c | 14 ++--------- net/ipv4/ip_input.c | 37 +++++++++++++++++------------ net/ipv4/sysctl_net_ipv4.c | 59 ++-------------------------------------------- net/ipv6/ip6_input.c | 26 +++++++++++--------- net/ipv6/tcp_ipv6.c | 9 ++----- net/ipv6/udp.c | 9 ++----- 9 files changed, 47 insertions(+), 114 deletions(-)
diff --git a/include/net/protocol.h b/include/net/protocol.h index 2b778e1..0fd2df8 100644 --- a/include/net/protocol.h +++ b/include/net/protocol.h @@ -35,8 +35,6 @@
/* This is used to register protocols. */ struct net_protocol { - int (*early_demux)(struct sk_buff *skb); - int (*early_demux_handler)(struct sk_buff *skb); int (*handler)(struct sk_buff *skb);
/* This returns an error if we weren't able to handle the error. */ @@ -53,8 +51,6 @@ struct net_protocol {
#if IS_ENABLED(CONFIG_IPV6) struct inet6_protocol { - void (*early_demux)(struct sk_buff *skb); - void (*early_demux_handler)(struct sk_buff *skb); int (*handler)(struct sk_buff *skb);
/* This returns an error if we weren't able to handle the error. */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 73959a2..a8de6d4 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -937,7 +937,7 @@ static inline int tcp_v6_sdif(const struct sk_buff *skb)
INDIRECT_CALLABLE_DECLARE(void tcp_v6_send_check(struct sock *sk, struct sk_buff *skb)); INDIRECT_CALLABLE_DECLARE(int tcp_v6_rcv(struct sk_buff *skb)); -INDIRECT_CALLABLE_DECLARE(void tcp_v6_early_demux(struct sk_buff *skb)); +void tcp_v6_early_demux(struct sk_buff *skb);
#endif
diff --git a/include/net/udp.h b/include/net/udp.h index 4017f25..c08d6af 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -176,6 +176,7 @@ static inline void udp_csum_pull_header(struct sk_buff *skb) struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb, struct udphdr *uh, struct sock *sk); int udp_gro_complete(struct sk_buff *skb, int nhoff, udp_lookup_t lookup); +void udp_v6_early_demux(struct sk_buff *skb);
struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb, netdev_features_t features, bool is_ipv6); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 911ad59..4d0eff2 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1721,12 +1721,7 @@ u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset) }; #endif
-/* thinking of making this const? Don't. - * early_demux can change based on sysctl. - */ -static struct net_protocol tcp_protocol = { - .early_demux = tcp_v4_early_demux, - .early_demux_handler = tcp_v4_early_demux, +static const struct net_protocol tcp_protocol = { .handler = tcp_v4_rcv, .err_handler = tcp_v4_err, .no_policy = 1, @@ -1734,12 +1729,7 @@ u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset) .icmp_strict_tag_validation = 1, };
-/* thinking of making this const? Don't. - * early_demux can change based on sysctl. - */ -static struct net_protocol udp_protocol = { - .early_demux = udp_v4_early_demux, - .early_demux_handler = udp_v4_early_demux, +static const struct net_protocol udp_protocol = { .handler = udp_rcv, .err_handler = udp_err, .no_policy = 1, diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index b0c244a..f6b3237 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -309,14 +309,13 @@ static bool ip_can_use_hint(const struct sk_buff *skb, const struct iphdr *iph, ip_hdr(hint)->tos == iph->tos; }
-INDIRECT_CALLABLE_DECLARE(int udp_v4_early_demux(struct sk_buff *)); -INDIRECT_CALLABLE_DECLARE(int tcp_v4_early_demux(struct sk_buff *)); +int tcp_v4_early_demux(struct sk_buff *skb); +int udp_v4_early_demux(struct sk_buff *skb); static int ip_rcv_finish_core(struct net *net, struct sock *sk, struct sk_buff *skb, struct net_device *dev, const struct sk_buff *hint) { const struct iphdr *iph = ip_hdr(skb); - int (*edemux)(struct sk_buff *skb); struct rtable *rt; int err;
@@ -327,21 +326,29 @@ static int ip_rcv_finish_core(struct net *net, struct sock *sk, goto drop_error; }
- if (net->ipv4.sysctl_ip_early_demux && + if (READ_ONCE(net->ipv4.sysctl_ip_early_demux) && !skb_dst(skb) && !skb->sk && !ip_is_fragment(iph)) { - const struct net_protocol *ipprot; - int protocol = iph->protocol; - - ipprot = rcu_dereference(inet_protos[protocol]); - if (ipprot && (edemux = READ_ONCE(ipprot->early_demux))) { - err = INDIRECT_CALL_2(edemux, tcp_v4_early_demux, - udp_v4_early_demux, skb); - if (unlikely(err)) - goto drop_error; - /* must reload iph, skb->head might have changed */ - iph = ip_hdr(skb); + switch (iph->protocol) { + case IPPROTO_TCP: + if (READ_ONCE(net->ipv4.sysctl_tcp_early_demux)) { + tcp_v4_early_demux(skb); + + /* must reload iph, skb->head might have changed */ + iph = ip_hdr(skb); + } + break; + case IPPROTO_UDP: + if (READ_ONCE(net->ipv4.sysctl_udp_early_demux)) { + err = udp_v4_early_demux(skb); + if (unlikely(err)) + goto drop_error; + + /* must reload iph, skb->head might have changed */ + iph = ip_hdr(skb); + } + break; } }
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 1cffacd..e7c1ce7 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -361,61 +361,6 @@ static int proc_tcp_fastopen_key(struct ctl_table *table, int write, return ret; }
-static void proc_configure_early_demux(int enabled, int protocol) -{ - struct net_protocol *ipprot; -#if IS_ENABLED(CONFIG_IPV6) - struct inet6_protocol *ip6prot; -#endif - - rcu_read_lock(); - - ipprot = rcu_dereference(inet_protos[protocol]); - if (ipprot) - ipprot->early_demux = enabled ? ipprot->early_demux_handler : - NULL; - -#if IS_ENABLED(CONFIG_IPV6) - ip6prot = rcu_dereference(inet6_protos[protocol]); - if (ip6prot) - ip6prot->early_demux = enabled ? ip6prot->early_demux_handler : - NULL; -#endif - rcu_read_unlock(); -} - -static int proc_tcp_early_demux(struct ctl_table *table, int write, - void *buffer, size_t *lenp, loff_t *ppos) -{ - int ret = 0; - - ret = proc_dointvec(table, write, buffer, lenp, ppos); - - if (write && !ret) { - int enabled = init_net.ipv4.sysctl_tcp_early_demux; - - proc_configure_early_demux(enabled, IPPROTO_TCP); - } - - return ret; -} - -static int proc_udp_early_demux(struct ctl_table *table, int write, - void *buffer, size_t *lenp, loff_t *ppos) -{ - int ret = 0; - - ret = proc_dointvec(table, write, buffer, lenp, ppos); - - if (write && !ret) { - int enabled = init_net.ipv4.sysctl_udp_early_demux; - - proc_configure_early_demux(enabled, IPPROTO_UDP); - } - - return ret; -} - static int proc_tfo_blackhole_detect_timeout(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) @@ -727,14 +672,14 @@ static int proc_tcp_compression_ports(struct ctl_table *table, int write, .data = &init_net.ipv4.sysctl_udp_early_demux, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_udp_early_demux + .proc_handler = proc_douintvec_minmax, }, { .procname = "tcp_early_demux", .data = &init_net.ipv4.sysctl_tcp_early_demux, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_tcp_early_demux + .proc_handler = proc_douintvec_minmax, }, { .procname = "nexthop_compat_mode", diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c index 15ea3d0..4eb9fbf 100644 --- a/net/ipv6/ip6_input.c +++ b/net/ipv6/ip6_input.c @@ -44,21 +44,25 @@ #include <net/inet_ecn.h> #include <net/dst_metadata.h>
-INDIRECT_CALLABLE_DECLARE(void udp_v6_early_demux(struct sk_buff *)); -INDIRECT_CALLABLE_DECLARE(void tcp_v6_early_demux(struct sk_buff *)); +void udp_v6_early_demux(struct sk_buff *); +void tcp_v6_early_demux(struct sk_buff *); static void ip6_rcv_finish_core(struct net *net, struct sock *sk, struct sk_buff *skb) { - void (*edemux)(struct sk_buff *skb); - - if (net->ipv4.sysctl_ip_early_demux && !skb_dst(skb) && skb->sk == NULL) { - const struct inet6_protocol *ipprot; - - ipprot = rcu_dereference(inet6_protos[ipv6_hdr(skb)->nexthdr]); - if (ipprot && (edemux = READ_ONCE(ipprot->early_demux))) - INDIRECT_CALL_2(edemux, tcp_v6_early_demux, - udp_v6_early_demux, skb); + if (READ_ONCE(net->ipv4.sysctl_ip_early_demux) && + !skb_dst(skb) && !skb->sk) { + switch (ipv6_hdr(skb)->nexthdr) { + case IPPROTO_TCP: + if (READ_ONCE(net->ipv4.sysctl_tcp_early_demux)) + tcp_v6_early_demux(skb); + break; + case IPPROTO_UDP: + if (READ_ONCE(net->ipv4.sysctl_udp_early_demux)) + udp_v6_early_demux(skb); + break; + } } + if (!skb_valid_dst(skb)) ip6_route_input(skb); } diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index ef36c34..5ccd4c8 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1818,7 +1818,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) goto discard_it; }
-INDIRECT_CALLABLE_SCOPE void tcp_v6_early_demux(struct sk_buff *skb) +void tcp_v6_early_demux(struct sk_buff *skb) { const struct ipv6hdr *hdr; const struct tcphdr *th; @@ -2169,12 +2169,7 @@ struct proto tcpv6_prot = { }; EXPORT_SYMBOL_GPL(tcpv6_prot);
-/* thinking of making this const? Don't. - * early_demux can change based on sysctl. - */ -static struct inet6_protocol tcpv6_protocol = { - .early_demux = tcp_v6_early_demux, - .early_demux_handler = tcp_v6_early_demux, +static const struct inet6_protocol tcpv6_protocol = { .handler = tcp_v6_rcv, .err_handler = tcp_v6_err, .flags = INET6_PROTO_NOPOLICY|INET6_PROTO_FINAL, diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 4dfb236..131c97c 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1026,7 +1026,7 @@ static struct sock *__udp6_lib_demux_lookup(struct net *net, return NULL; }
-INDIRECT_CALLABLE_SCOPE void udp_v6_early_demux(struct sk_buff *skb) +void udp_v6_early_demux(struct sk_buff *skb) { struct net *net = dev_net(skb->dev); const struct udphdr *uh; @@ -1639,12 +1639,7 @@ int udpv6_getsockopt(struct sock *sk, int level, int optname, return ipv6_getsockopt(sk, level, optname, optval, optlen); }
-/* thinking of making this const? Don't. - * early_demux can change based on sysctl. - */ -static struct inet6_protocol udpv6_protocol = { - .early_demux = udp_v6_early_demux, - .early_demux_handler = udp_v6_early_demux, +static const struct inet6_protocol udpv6_protocol = { .handler = udpv6_rcv, .err_handler = udpv6_err, .flags = INET6_PROTO_NOPOLICY|INET6_PROTO_FINAL,
Offering: HULK hulk inclusion category: bugfix bugzilla: 188217
--------------------------------
Add early_demux_handler and early_demux back to fix kabi broken in struct net_protocol and inet6_protocol.
Signed-off-by: Wang Yufen wangyufen@huawei.com --- include/net/protocol.h | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/include/net/protocol.h b/include/net/protocol.h index 0fd2df8..2b778e1 100644 --- a/include/net/protocol.h +++ b/include/net/protocol.h @@ -35,6 +35,8 @@
/* This is used to register protocols. */ struct net_protocol { + int (*early_demux)(struct sk_buff *skb); + int (*early_demux_handler)(struct sk_buff *skb); int (*handler)(struct sk_buff *skb);
/* This returns an error if we weren't able to handle the error. */ @@ -51,6 +53,8 @@ struct net_protocol {
#if IS_ENABLED(CONFIG_IPV6) struct inet6_protocol { + void (*early_demux)(struct sk_buff *skb); + void (*early_demux_handler)(struct sk_buff *skb); int (*handler)(struct sk_buff *skb);
/* This returns an error if we weren't able to handle the error. */
Please ignore, the baseline is wrong.
在 2023/2/3 15:58, Wang Yufen 写道:
From: Matthias May matthias.may@westermo.com
stable inclusion from stable-v5.10.138 commit 38b83883ce4e4efe8ff0a727192219cac2668d42 category: bugfix bugzilla: 188217
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
commit ca2bb69514a8bc7f83914122f0d596371352416c upstream.
According to Guillaume Nault RT_TOS should never be used for IPv6.
Quote: RT_TOS() is an old macro used to interprete IPv4 TOS as described in the obsolete RFC 1349. It's conceptually wrong to use it even in IPv4 code, although, given the current state of the code, most of the existing calls have no consequence.
But using RT_TOS() in IPv6 code is always a bug: IPv6 never had a "TOS" field to be interpreted the RFC 1349 way. There's no historical compatibility to worry about.
Fixes: 3a56f86f1be6 ("geneve: handle ipv6 priority like ipv4 tos") Acked-by: Guillaume Nault gnault@redhat.com Signed-off-by: Matthias May matthias.may@westermo.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 38b83883ce4e4efe8ff0a727192219cac2668d42) Signed-off-by: Wang Yufen wangyufen@huawei.com
drivers/net/geneve.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 5ddb2db..ba9947d 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -850,8 +850,7 @@ static struct dst_entry *geneve_get_v6_dst(struct sk_buff *skb, use_cache = false; }
- fl6->flowlabel = ip6_make_flowinfo(RT_TOS(prio),
info->key.label);
- fl6->flowlabel = ip6_make_flowinfo(prio, info->key.label); dst_cache = (struct dst_cache *)&info->dst_cache; if (use_cache) { dst = dst_cache_get_ip6(dst_cache, &fl6->saddr);