From: Liu Shixin liushixin2@huawei.com
stable inclusion from stable-v5.10.163 commit f9d19f3a044ca651b0be52a4bf951ffe74259b9f category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6CIF8 CVE: CVE-2023-0615
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit 94a7ad9283464b75b12516c5512541d467cefcf8 ]
syzkaller found a bug:
BUG: unable to handle page fault for address: ffffc9000a3b1000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 100000067 P4D 100000067 PUD 10015f067 PMD 1121ca067 PTE 0 Oops: 0002 [#1] PREEMPT SMP CPU: 0 PID: 23489 Comm: vivid-000-vid-c Not tainted 6.1.0-rc1+ #512 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:memcpy_erms+0x6/0x10 [...] Call Trace: <TASK> ? tpg_fill_plane_buffer+0x856/0x15b0 vivid_fillbuff+0x8ac/0x1110 vivid_thread_vid_cap_tick+0x361/0xc90 vivid_thread_vid_cap+0x21a/0x3a0 kthread+0x143/0x180 ret_from_fork+0x1f/0x30 </TASK>
This is because we forget to check boundary after adjust compose->height int V4L2_SEL_TGT_CROP case. Add v4l2_rect_map_inside() to fix this problem for this case.
Fixes: ef834f7836ec ("[media] vivid: add the video capture and output parts") Signed-off-by: Liu Shixin liushixin2@huawei.com Signed-off-by: Hans Verkuil hverkuil-cisco@xs4all.nl Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Wang Hai wanghai38@huawei.com Signed-off-by: Longlong Xia xialonglong1@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/media/test-drivers/vivid/vivid-vid-cap.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/media/test-drivers/vivid/vivid-vid-cap.c b/drivers/media/test-drivers/vivid/vivid-vid-cap.c index eadf28ab1e39..eeb0aeb62f79 100644 --- a/drivers/media/test-drivers/vivid/vivid-vid-cap.c +++ b/drivers/media/test-drivers/vivid/vivid-vid-cap.c @@ -953,6 +953,7 @@ int vivid_vid_cap_s_selection(struct file *file, void *fh, struct v4l2_selection if (dev->has_compose_cap) { v4l2_rect_set_min_size(compose, &min_rect); v4l2_rect_set_max_size(compose, &max_rect); + v4l2_rect_map_inside(compose, &fmt); } dev->fmt_cap_rect = fmt; tpg_s_buf_height(&dev->tpg, fmt.height);
From: Nikolay Aleksandrov nikolay@nvidia.com
mainline inclusion from mainline-v5.16-rc8 commit f83a112bd91a494cdee671aec74e777470fb4a07 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DKZJ CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
As reported[1] if startup query interval is set too low in combination with large number of startup queries and we have multiple bridges or even a single bridge with multiple querier vlans configured we can crash the machine. Add a 1 second minimum which must be enforced by overwriting the value if set lower (i.e. without returning an error) to avoid breaking user-space. If that happens a log message is emitted to let the admin know that the startup interval has been set to the minimum. It doesn't make sense to make the startup interval lower than the normal query interval so use the same value of 1 second. The issue has been present since these intervals could be user-controlled.
[1] https://lore.kernel.org/netdev/e8b9ce41-57b9-b6e2-a46a-ff9c791cf0ba@gmail.co...
Fixes: d902eee43f19 ("bridge: Add multicast count/interval sysfs entries") Reported-by: Eric Dumazet eric.dumazet@gmail.com Signed-off-by: Nikolay Aleksandrov nikolay@nvidia.com Signed-off-by: Jakub Kicinski kuba@kernel.org
Conflicts: net/bridge/br_multicast.c net/bridge/br_netlink.c net/bridge/br_private.h net/bridge/br_sysfs_br.c
Signed-off-by: Zhengchao Shao shaozhengchao@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- net/bridge/br_multicast.c | 16 ++++++++++++++++ net/bridge/br_netlink.c | 2 +- net/bridge/br_private.h | 4 ++++ net/bridge/br_sysfs_br.c | 2 +- 4 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c index e5328a2777ec..8ea85a4718f9 100644 --- a/net/bridge/br_multicast.c +++ b/net/bridge/br_multicast.c @@ -3609,6 +3609,22 @@ int br_multicast_set_mld_version(struct net_bridge *br, unsigned long val) } #endif
+void br_multicast_set_startup_query_intvl(struct net_bridge *br, + unsigned long val) +{ + unsigned long intvl_jiffies = clock_t_to_jiffies(val); + + if (intvl_jiffies < BR_MULTICAST_STARTUP_QUERY_INTVL_MIN) { + br_info(br, + "trying to set multicast startup query interval below minimum, setting to %lu (%ums)\n", + jiffies_to_clock_t(BR_MULTICAST_STARTUP_QUERY_INTVL_MIN), + jiffies_to_msecs(BR_MULTICAST_STARTUP_QUERY_INTVL_MIN)); + intvl_jiffies = BR_MULTICAST_STARTUP_QUERY_INTVL_MIN; + } + + br->multicast_startup_query_interval = intvl_jiffies; +} + /** * br_multicast_list_adjacent - Returns snooped multicast addresses * @dev: The bridge port adjacent to which to retrieve addresses diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c index 31b00ba5dcc8..1f85c1b77c89 100644 --- a/net/bridge/br_netlink.c +++ b/net/bridge/br_netlink.c @@ -1298,7 +1298,7 @@ static int br_changelink(struct net_device *brdev, struct nlattr *tb[], if (data[IFLA_BR_MCAST_STARTUP_QUERY_INTVL]) { u64 val = nla_get_u64(data[IFLA_BR_MCAST_STARTUP_QUERY_INTVL]);
- br->multicast_startup_query_interval = clock_t_to_jiffies(val); + br_multicast_set_startup_query_intvl(br, val); }
if (data[IFLA_BR_MCAST_STATS_ENABLED]) { diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 2b88b17cc8b2..e439d9aae45a 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -28,6 +28,8 @@ #define BR_MAX_PORTS (1<<BR_PORT_BITS)
#define BR_MULTICAST_DEFAULT_HASH_MAX 4096 +#define BR_MULTICAST_QUERY_INTVL_MIN msecs_to_jiffies(1000) +#define BR_MULTICAST_STARTUP_QUERY_INTVL_MIN BR_MULTICAST_QUERY_INTVL_MIN
#define BR_VERSION "2.3"
@@ -1568,4 +1570,6 @@ void br_do_proxy_suppress_arp(struct sk_buff *skb, struct net_bridge *br, void br_do_suppress_nd(struct sk_buff *skb, struct net_bridge *br, u16 vid, struct net_bridge_port *p, struct nd_msg *msg); struct nd_msg *br_is_nd_neigh_msg(struct sk_buff *skb, struct nd_msg *m); +void br_multicast_set_startup_query_intvl(struct net_bridge *br, + unsigned long val); #endif diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c index 7db06e3f642a..1fdc491f648a 100644 --- a/net/bridge/br_sysfs_br.c +++ b/net/bridge/br_sysfs_br.c @@ -640,7 +640,7 @@ static ssize_t multicast_startup_query_interval_show(
static int set_startup_query_interval(struct net_bridge *br, unsigned long val) { - br->multicast_startup_query_interval = clock_t_to_jiffies(val); + br_multicast_set_startup_query_intvl(br, val); return 0; }
From: Nikolay Aleksandrov nikolay@nvidia.com
mainline inclusion from mainline-v5.16-rc8 commit 99b40610956a8a8755653a67392e2a8b772453be category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DKZJ CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
As reported[1] if query interval is set too low and we have multiple bridges or even a single bridge with multiple querier vlans configured we can crash the machine. Add a 1 second minimum which must be enforced by overwriting the value if set lower (i.e. without returning an error) to avoid breaking user-space. If that happens a log message is emitted to let the administrator know that the interval has been set to the minimum. The issue has been present since these intervals could be user-controlled.
[1] https://lore.kernel.org/netdev/e8b9ce41-57b9-b6e2-a46a-ff9c791cf0ba@gmail.co...
Fixes: d902eee43f19 ("bridge: Add multicast count/interval sysfs entries") Reported-by: Eric Dumazet eric.dumazet@gmail.com Signed-off-by: Nikolay Aleksandrov nikolay@nvidia.com Signed-off-by: Jakub Kicinski kuba@kernel.org
Conflicts: net/bridge/br_multicast.c net/bridge/br_netlink.c net/bridge/br_private.h net/bridge/br_sysfs_br.c
Signed-off-by: Zhengchao Shao shaozhengchao@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- net/bridge/br_multicast.c | 16 ++++++++++++++++ net/bridge/br_netlink.c | 2 +- net/bridge/br_private.h | 2 ++ net/bridge/br_sysfs_br.c | 2 +- 4 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c index 8ea85a4718f9..de79fbfe1611 100644 --- a/net/bridge/br_multicast.c +++ b/net/bridge/br_multicast.c @@ -3625,6 +3625,22 @@ void br_multicast_set_startup_query_intvl(struct net_bridge *br, br->multicast_startup_query_interval = intvl_jiffies; }
+void br_multicast_set_query_intvl(struct net_bridge *br, + unsigned long val) +{ + unsigned long intvl_jiffies = clock_t_to_jiffies(val); + + if (intvl_jiffies < BR_MULTICAST_QUERY_INTVL_MIN) { + br_info(br, + "trying to set multicast query interval below minimum, setting to %lu (%ums)\n", + jiffies_to_clock_t(BR_MULTICAST_QUERY_INTVL_MIN), + jiffies_to_msecs(BR_MULTICAST_QUERY_INTVL_MIN)); + intvl_jiffies = BR_MULTICAST_QUERY_INTVL_MIN; + } + + br->multicast_query_interval = intvl_jiffies; +} + /** * br_multicast_list_adjacent - Returns snooped multicast addresses * @dev: The bridge port adjacent to which to retrieve addresses diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c index 1f85c1b77c89..bed6c798fea9 100644 --- a/net/bridge/br_netlink.c +++ b/net/bridge/br_netlink.c @@ -1286,7 +1286,7 @@ static int br_changelink(struct net_device *brdev, struct nlattr *tb[], if (data[IFLA_BR_MCAST_QUERY_INTVL]) { u64 val = nla_get_u64(data[IFLA_BR_MCAST_QUERY_INTVL]);
- br->multicast_query_interval = clock_t_to_jiffies(val); + br_multicast_set_query_intvl(br, val); }
if (data[IFLA_BR_MCAST_QUERY_RESPONSE_INTVL]) { diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index e439d9aae45a..bfd0e995b078 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -1572,4 +1572,6 @@ void br_do_suppress_nd(struct sk_buff *skb, struct net_bridge *br, struct nd_msg *br_is_nd_neigh_msg(struct sk_buff *skb, struct nd_msg *m); void br_multicast_set_startup_query_intvl(struct net_bridge *br, unsigned long val); +void br_multicast_set_query_intvl(struct net_bridge *br, + unsigned long val); #endif diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c index 1fdc491f648a..c8b477cbc4a3 100644 --- a/net/bridge/br_sysfs_br.c +++ b/net/bridge/br_sysfs_br.c @@ -594,7 +594,7 @@ static ssize_t multicast_query_interval_show(struct device *d,
static int set_query_interval(struct net_bridge *br, unsigned long val) { - br->multicast_query_interval = clock_t_to_jiffies(val); + br_multicast_set_query_intvl(br, val); return 0; }
From: ZhaoLong Wang wangzhaolong1@huawei.com
mainline inclusion from mainline-v6.2-rc8 commit aa5465aeca3c66fecdf7efcf554aed79b4c4b211 category: bugfix bugzilla: 188381, https://gitee.com/openeuler/kernel/issues/I644ST CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
------------------------------------------------------
When the network status is unstable, use-after-free may occur when read data from the server.
BUG: KASAN: use-after-free in readpages_fill_pages+0x14c/0x7e0
Call Trace: <TASK> dump_stack_lvl+0x38/0x4c print_report+0x16f/0x4a6 kasan_report+0xb7/0x130 readpages_fill_pages+0x14c/0x7e0 cifs_readv_receive+0x46d/0xa40 cifs_demultiplex_thread+0x121c/0x1490 kthread+0x16b/0x1a0 ret_from_fork+0x2c/0x50 </TASK>
Allocated by task 2535: kasan_save_stack+0x22/0x50 kasan_set_track+0x25/0x30 __kasan_kmalloc+0x82/0x90 cifs_readdata_direct_alloc+0x2c/0x110 cifs_readdata_alloc+0x2d/0x60 cifs_readahead+0x393/0xfe0 read_pages+0x12f/0x470 page_cache_ra_unbounded+0x1b1/0x240 filemap_get_pages+0x1c8/0x9a0 filemap_read+0x1c0/0x540 cifs_strict_readv+0x21b/0x240 vfs_read+0x395/0x4b0 ksys_read+0xb8/0x150 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc
Freed by task 79: kasan_save_stack+0x22/0x50 kasan_set_track+0x25/0x30 kasan_save_free_info+0x2e/0x50 __kasan_slab_free+0x10e/0x1a0 __kmem_cache_free+0x7a/0x1a0 cifs_readdata_release+0x49/0x60 process_one_work+0x46c/0x760 worker_thread+0x2a4/0x6f0 kthread+0x16b/0x1a0 ret_from_fork+0x2c/0x50
Last potentially related work creation: kasan_save_stack+0x22/0x50 __kasan_record_aux_stack+0x95/0xb0 insert_work+0x2b/0x130 __queue_work+0x1fe/0x660 queue_work_on+0x4b/0x60 smb2_readv_callback+0x396/0x800 cifs_abort_connection+0x474/0x6a0 cifs_reconnect+0x5cb/0xa50 cifs_readv_from_socket.cold+0x22/0x6c cifs_read_page_from_socket+0xc1/0x100 readpages_fill_pages.cold+0x2f/0x46 cifs_readv_receive+0x46d/0xa40 cifs_demultiplex_thread+0x121c/0x1490 kthread+0x16b/0x1a0 ret_from_fork+0x2c/0x50
The following function calls will cause UAF of the rdata pointer.
readpages_fill_pages cifs_read_page_from_socket cifs_readv_from_socket cifs_reconnect __cifs_reconnect cifs_abort_connection mid->callback() --> smb2_readv_callback queue_work(&rdata->work) # if the worker completes first, # the rdata is freed cifs_readv_complete kref_put cifs_readdata_release kfree(rdata) return rdata->... # UAF in readpages_fill_pages()
Similarly, this problem also occurs in the uncache_fill_pages().
Fix this by adjusts the order of condition judgment in the return statement.
Signed-off-by: ZhaoLong Wang wangzhaolong1@huawei.com Cc: stable@vger.kernel.org Acked-by: Paulo Alcantara (SUSE) pc@cjr.nz Signed-off-by: Steve French stfrench@microsoft.com Reviewed-by: Zhang Xiaoxu zhangxiaoxu5@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/cifs/file.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/cifs/file.c b/fs/cifs/file.c index fafb69d338c2..f6a1218366d1 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -3536,7 +3536,7 @@ uncached_fill_pages(struct TCP_Server_Info *server, rdata->got_bytes += result; }
- return rdata->got_bytes > 0 && result != -ECONNABORTED ? + return result != -ECONNABORTED && rdata->got_bytes > 0 ? rdata->got_bytes : result; }
@@ -4290,7 +4290,7 @@ readpages_fill_pages(struct TCP_Server_Info *server, rdata->got_bytes += result; }
- return rdata->got_bytes > 0 && result != -ECONNABORTED ? + return result != -ECONNABORTED && rdata->got_bytes > 0 ? rdata->got_bytes : result; }
From: Eric Dumazet edumazet@google.com
mainline inclusion from mainline-v6.0-rc1 commit ba44f8182ec299c5d1c8a72fc0fde4ec127b5a6d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6ECEK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
In order to prepare the following patch, I change raw v4 & v6 code to use more conventional iterators.
Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- include/net/raw.h | 5 +-- include/net/rawv6.h | 6 +-- net/ipv4/raw.c | 93 ++++++++++++++------------------------- net/ipv4/raw_diag.c | 33 +++++++------- net/ipv6/raw.c | 105 ++++++++++++++++---------------------------- 5 files changed, 92 insertions(+), 150 deletions(-)
diff --git a/include/net/raw.h b/include/net/raw.h index c51a635671a7..6324965779ec 100644 --- a/include/net/raw.h +++ b/include/net/raw.h @@ -20,9 +20,8 @@ extern struct proto raw_prot;
extern struct raw_hashinfo raw_v4_hashinfo; -struct sock *__raw_v4_lookup(struct net *net, struct sock *sk, - unsigned short num, __be32 raddr, - __be32 laddr, int dif, int sdif); +bool raw_v4_match(struct net *net, struct sock *sk, unsigned short num, + __be32 raddr, __be32 laddr, int dif, int sdif);
int raw_abort(struct sock *sk, int err); void raw_icmp_error(struct sk_buff *, int, u32); diff --git a/include/net/rawv6.h b/include/net/rawv6.h index 53d86b6055e8..c48c1298699a 100644 --- a/include/net/rawv6.h +++ b/include/net/rawv6.h @@ -5,9 +5,9 @@ #include <net/protocol.h>
extern struct raw_hashinfo raw_v6_hashinfo; -struct sock *__raw_v6_lookup(struct net *net, struct sock *sk, - unsigned short num, const struct in6_addr *loc_addr, - const struct in6_addr *rmt_addr, int dif, int sdif); +bool raw_v6_match(struct net *net, struct sock *sk, unsigned short num, + const struct in6_addr *loc_addr, + const struct in6_addr *rmt_addr, int dif, int sdif);
int raw_abort(struct sock *sk, int err);
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 4899ebe569eb..8d1e8fc28cf9 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -117,24 +117,19 @@ void raw_unhash_sk(struct sock *sk) } EXPORT_SYMBOL_GPL(raw_unhash_sk);
-struct sock *__raw_v4_lookup(struct net *net, struct sock *sk, - unsigned short num, __be32 raddr, __be32 laddr, - int dif, int sdif) +bool raw_v4_match(struct net *net, struct sock *sk, unsigned short num, + __be32 raddr, __be32 laddr, int dif, int sdif) { - sk_for_each_from(sk) { - struct inet_sock *inet = inet_sk(sk); - - if (net_eq(sock_net(sk), net) && inet->inet_num == num && - !(inet->inet_daddr && inet->inet_daddr != raddr) && - !(inet->inet_rcv_saddr && inet->inet_rcv_saddr != laddr) && - raw_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif)) - goto found; /* gotcha */ - } - sk = NULL; -found: - return sk; + struct inet_sock *inet = inet_sk(sk); + + if (net_eq(sock_net(sk), net) && inet->inet_num == num && + !(inet->inet_daddr && inet->inet_daddr != raddr) && + !(inet->inet_rcv_saddr && inet->inet_rcv_saddr != laddr) && + raw_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif)) + return true; + return false; } -EXPORT_SYMBOL_GPL(__raw_v4_lookup); +EXPORT_SYMBOL_GPL(raw_v4_match);
/* * 0 - deliver @@ -168,23 +163,21 @@ static int icmp_filter(const struct sock *sk, const struct sk_buff *skb) */ static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash) { + struct net *net = dev_net(skb->dev); int sdif = inet_sdif(skb); int dif = inet_iif(skb); - struct sock *sk; struct hlist_head *head; int delivered = 0; - struct net *net; + struct sock *sk;
- read_lock(&raw_v4_hashinfo.lock); head = &raw_v4_hashinfo.ht[hash]; if (hlist_empty(head)) - goto out; - - net = dev_net(skb->dev); - sk = __raw_v4_lookup(net, __sk_head(head), iph->protocol, - iph->saddr, iph->daddr, dif, sdif); - - while (sk) { + return 0; + read_lock(&raw_v4_hashinfo.lock); + sk_for_each(sk, head) { + if (!raw_v4_match(net, sk, iph->protocol, + iph->saddr, iph->daddr, dif, sdif)) + continue; delivered = 1; if ((iph->protocol != IPPROTO_ICMP || !icmp_filter(sk, skb)) && ip_mc_sf_allow(sk, iph->daddr, iph->saddr, @@ -195,31 +188,16 @@ static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash) if (clone) raw_rcv(sk, clone); } - sk = __raw_v4_lookup(net, sk_next(sk), iph->protocol, - iph->saddr, iph->daddr, - dif, sdif); } -out: read_unlock(&raw_v4_hashinfo.lock); return delivered; }
int raw_local_deliver(struct sk_buff *skb, int protocol) { - int hash; - struct sock *raw_sk; - - hash = protocol & (RAW_HTABLE_SIZE - 1); - raw_sk = sk_head(&raw_v4_hashinfo.ht[hash]); - - /* If there maybe a raw socket we must check - if not we - * don't care less - */ - if (raw_sk && !raw_v4_input(skb, ip_hdr(skb), hash)) - raw_sk = NULL; - - return raw_sk != NULL; + int hash = protocol & (RAW_HTABLE_SIZE - 1);
+ return raw_v4_input(skb, ip_hdr(skb), hash); }
static void raw_err(struct sock *sk, struct sk_buff *skb, u32 info) @@ -286,29 +264,24 @@ static void raw_err(struct sock *sk, struct sk_buff *skb, u32 info)
void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info) { - int hash; - struct sock *raw_sk; + struct net *net = dev_net(skb->dev);; + int dif = skb->dev->ifindex; + int sdif = inet_sdif(skb); + struct hlist_head *head; const struct iphdr *iph; - struct net *net; + struct sock *sk; + int hash;
hash = protocol & (RAW_HTABLE_SIZE - 1); + head = &raw_v4_hashinfo.ht[hash];
read_lock(&raw_v4_hashinfo.lock); - raw_sk = sk_head(&raw_v4_hashinfo.ht[hash]); - if (raw_sk) { - int dif = skb->dev->ifindex; - int sdif = inet_sdif(skb); - + sk_for_each(sk, head) { iph = (const struct iphdr *)skb->data; - net = dev_net(skb->dev); - - while ((raw_sk = __raw_v4_lookup(net, raw_sk, protocol, - iph->daddr, iph->saddr, - dif, sdif)) != NULL) { - raw_err(raw_sk, skb, info); - raw_sk = sk_next(raw_sk); - iph = (const struct iphdr *)skb->data; - } + if (!raw_v4_match(net, sk, iph->protocol, + iph->saddr, iph->daddr, dif, sdif)) + continue; + raw_err(sk, skb, info); } read_unlock(&raw_v4_hashinfo.lock); } diff --git a/net/ipv4/raw_diag.c b/net/ipv4/raw_diag.c index 1b5b8af27aaf..5d811cd02f16 100644 --- a/net/ipv4/raw_diag.c +++ b/net/ipv4/raw_diag.c @@ -34,31 +34,30 @@ raw_get_hashinfo(const struct inet_diag_req_v2 *r) * use helper to figure it out. */
-static struct sock *raw_lookup(struct net *net, struct sock *from, - const struct inet_diag_req_v2 *req) +static bool raw_lookup(struct net *net, struct sock *sk, + const struct inet_diag_req_v2 *req) { struct inet_diag_req_raw *r = (void *)req; - struct sock *sk = NULL;
if (r->sdiag_family == AF_INET) - sk = __raw_v4_lookup(net, from, r->sdiag_raw_protocol, - r->id.idiag_dst[0], - r->id.idiag_src[0], - r->id.idiag_if, 0); + return raw_v4_match(net, sk, r->sdiag_raw_protocol, + r->id.idiag_dst[0], + r->id.idiag_src[0], + r->id.idiag_if, 0); #if IS_ENABLED(CONFIG_IPV6) else - sk = __raw_v6_lookup(net, from, r->sdiag_raw_protocol, - (const struct in6_addr *)r->id.idiag_src, - (const struct in6_addr *)r->id.idiag_dst, - r->id.idiag_if, 0); + return raw_v6_match(net, sk, r->sdiag_raw_protocol, + (const struct in6_addr *)r->id.idiag_src, + (const struct in6_addr *)r->id.idiag_dst, + r->id.idiag_if, 0); #endif - return sk; + return false; }
static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2 *r) { struct raw_hashinfo *hashinfo = raw_get_hashinfo(r); - struct sock *sk = NULL, *s; + struct sock *sk; int slot;
if (IS_ERR(hashinfo)) @@ -66,9 +65,8 @@ static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2
read_lock(&hashinfo->lock); for (slot = 0; slot < RAW_HTABLE_SIZE; slot++) { - sk_for_each(s, &hashinfo->ht[slot]) { - sk = raw_lookup(net, s, r); - if (sk) { + sk_for_each(sk, &hashinfo->ht[slot]) { + if (raw_lookup(net, sk, r)) { /* * Grab it and keep until we fill * diag meaage to be reported, so @@ -81,10 +79,11 @@ static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2 } } } + sk = ERR_PTR(-ENOENT); out_unlock: read_unlock(&hashinfo->lock);
- return sk ? sk : ERR_PTR(-ENOENT); + return sk; }
static int raw_diag_dump_one(struct netlink_callback *cb, diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 110254f44a46..88f89bf1d985 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -66,41 +66,27 @@ struct raw_hashinfo raw_v6_hashinfo = { }; EXPORT_SYMBOL_GPL(raw_v6_hashinfo);
-struct sock *__raw_v6_lookup(struct net *net, struct sock *sk, - unsigned short num, const struct in6_addr *loc_addr, - const struct in6_addr *rmt_addr, int dif, int sdif) +bool raw_v6_match(struct net *net, struct sock *sk, unsigned short num, + const struct in6_addr *loc_addr, + const struct in6_addr *rmt_addr, int dif, int sdif) { - bool is_multicast = ipv6_addr_is_multicast(loc_addr); - - sk_for_each_from(sk) - if (inet_sk(sk)->inet_num == num) { - - if (!net_eq(sock_net(sk), net)) - continue; - - if (!ipv6_addr_any(&sk->sk_v6_daddr) && - !ipv6_addr_equal(&sk->sk_v6_daddr, rmt_addr)) - continue; - - if (!raw_sk_bound_dev_eq(net, sk->sk_bound_dev_if, - dif, sdif)) - continue; - - if (!ipv6_addr_any(&sk->sk_v6_rcv_saddr)) { - if (ipv6_addr_equal(&sk->sk_v6_rcv_saddr, loc_addr)) - goto found; - if (is_multicast && - inet6_mc_check(sk, loc_addr, rmt_addr)) - goto found; - continue; - } - goto found; - } - sk = NULL; -found: - return sk; + if (inet_sk(sk)->inet_num != num || + !net_eq(sock_net(sk), net) || + (!ipv6_addr_any(&sk->sk_v6_daddr) && + !ipv6_addr_equal(&sk->sk_v6_daddr, rmt_addr)) || + !raw_sk_bound_dev_eq(net, sk->sk_bound_dev_if, + dif, sdif)) + return false; + + if (ipv6_addr_any(&sk->sk_v6_rcv_saddr) || + ipv6_addr_equal(&sk->sk_v6_rcv_saddr, loc_addr) || + (ipv6_addr_is_multicast(loc_addr) && + inet6_mc_check(sk, loc_addr, rmt_addr))) + return true; + + return false; } -EXPORT_SYMBOL_GPL(__raw_v6_lookup); +EXPORT_SYMBOL_GPL(raw_v6_match);
/* * 0 - deliver @@ -156,31 +142,28 @@ EXPORT_SYMBOL(rawv6_mh_filter_unregister); */ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr) { + struct net *net = dev_net(skb->dev); const struct in6_addr *saddr; const struct in6_addr *daddr; + struct hlist_head *head; struct sock *sk; bool delivered = false; __u8 hash; - struct net *net;
saddr = &ipv6_hdr(skb)->saddr; daddr = saddr + 1;
hash = nexthdr & (RAW_HTABLE_SIZE - 1); - + head = &raw_v6_hashinfo.ht[hash]; + if (hlist_empty(head)) + return false; read_lock(&raw_v6_hashinfo.lock); - sk = sk_head(&raw_v6_hashinfo.ht[hash]); - - if (!sk) - goto out; - - net = dev_net(skb->dev); - sk = __raw_v6_lookup(net, sk, nexthdr, daddr, saddr, - inet6_iif(skb), inet6_sdif(skb)); - - while (sk) { + sk_for_each(sk, head) { int filtered;
+ if (!raw_v6_match(net, sk, nexthdr, daddr, saddr, + inet6_iif(skb), inet6_sdif(skb))) + continue; delivered = true; switch (nexthdr) { case IPPROTO_ICMPV6: @@ -219,23 +202,14 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr) rawv6_rcv(sk, clone); } } - sk = __raw_v6_lookup(net, sk_next(sk), nexthdr, daddr, saddr, - inet6_iif(skb), inet6_sdif(skb)); } -out: read_unlock(&raw_v6_hashinfo.lock); return delivered; }
bool raw6_local_deliver(struct sk_buff *skb, int nexthdr) { - struct sock *raw_sk; - - raw_sk = sk_head(&raw_v6_hashinfo.ht[nexthdr & (RAW_HTABLE_SIZE - 1)]); - if (raw_sk && !ipv6_raw_deliver(skb, nexthdr)) - raw_sk = NULL; - - return raw_sk != NULL; + return ipv6_raw_deliver(skb, nexthdr); }
/* This cleans up af_inet6 a bit. -DaveM */ @@ -361,28 +335,25 @@ static void rawv6_err(struct sock *sk, struct sk_buff *skb, void raw6_icmp_error(struct sk_buff *skb, int nexthdr, u8 type, u8 code, int inner_offset, __be32 info) { + const struct in6_addr *saddr, *daddr; + struct net *net = dev_net(skb->dev); + struct hlist_head *head; struct sock *sk; int hash; - const struct in6_addr *saddr, *daddr; - struct net *net;
hash = nexthdr & (RAW_HTABLE_SIZE - 1); - + head = &raw_v6_hashinfo.ht[hash]; read_lock(&raw_v6_hashinfo.lock); - sk = sk_head(&raw_v6_hashinfo.ht[hash]); - if (sk) { + sk_for_each(sk, head) { /* Note: ipv6_hdr(skb) != skb->data */ const struct ipv6hdr *ip6h = (const struct ipv6hdr *)skb->data; saddr = &ip6h->saddr; daddr = &ip6h->daddr; - net = dev_net(skb->dev);
- while ((sk = __raw_v6_lookup(net, sk, nexthdr, saddr, daddr, - inet6_iif(skb), inet6_iif(skb)))) { - rawv6_err(sk, skb, NULL, type, code, - inner_offset, info); - sk = sk_next(sk); - } + if (!raw_v6_match(net, sk, nexthdr, &ip6h->saddr, &ip6h->daddr, + inet6_iif(skb), inet6_iif(skb))) + continue; + rawv6_err(sk, skb, NULL, type, code, inner_offset, info); } read_unlock(&raw_v6_hashinfo.lock); }
From: Eric Dumazet edumazet@google.com
mainline inclusion from mainline-v6.0-rc1 commit 0daf07e527095e64ee8927ce297ab626643e9f51 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6ECEK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Using rwlock in networking code is extremely risky. writers can starve if enough readers are constantly grabing the rwlock.
I thought rwlock were at fault and sent this patch:
https://lkml.org/lkml/2022/6/17/272
But Peter and Linus essentially told me rwlock had to be unfair.
We need to get rid of rwlock in networking code.
Without this fix, following script triggers soft lockups:
for i in {1..48} do ping -f -n -q 127.0.0.1 & sleep 0.1 done
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Conflicts: net/ipv4/raw.c net/ipv6/af_inet6.c Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- include/net/raw.h | 11 +++++- include/net/rawv6.h | 1 + net/ipv4/af_inet.c | 2 ++ net/ipv4/raw.c | 83 +++++++++++++++++++++------------------------ net/ipv4/raw_diag.c | 22 +++++++----- net/ipv6/af_inet6.c | 3 ++ net/ipv6/raw.c | 28 +++++++-------- 7 files changed, 80 insertions(+), 70 deletions(-)
diff --git a/include/net/raw.h b/include/net/raw.h index 6324965779ec..537d9d1df890 100644 --- a/include/net/raw.h +++ b/include/net/raw.h @@ -33,9 +33,18 @@ int raw_rcv(struct sock *, struct sk_buff *);
struct raw_hashinfo { rwlock_t lock; - struct hlist_head ht[RAW_HTABLE_SIZE]; + struct hlist_nulls_head ht[RAW_HTABLE_SIZE]; };
+static inline void raw_hashinfo_init(struct raw_hashinfo *hashinfo) +{ + int i; + + rwlock_init(&hashinfo->lock); + for (i = 0; i < RAW_HTABLE_SIZE; i++) + INIT_HLIST_NULLS_HEAD(&hashinfo->ht[i], i); +} + #ifdef CONFIG_PROC_FS int raw_proc_init(void); void raw_proc_exit(void); diff --git a/include/net/rawv6.h b/include/net/rawv6.h index c48c1298699a..bc70909625f6 100644 --- a/include/net/rawv6.h +++ b/include/net/rawv6.h @@ -3,6 +3,7 @@ #define _NET_RAWV6_H
#include <net/protocol.h> +#include <net/raw.h>
extern struct raw_hashinfo raw_v6_hashinfo; bool raw_v6_match(struct net *net, struct sock *sk, unsigned short num, diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 0e5b739ae2c1..4ddbfccfd0c8 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1940,6 +1940,8 @@ static int __init inet_init(void)
sock_skb_cb_check_size(sizeof(struct inet_skb_parm));
+ raw_hashinfo_init(&raw_v4_hashinfo); + rc = proto_register(&tcp_prot, 1); if (rc) goto out; diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 8d1e8fc28cf9..d5740cf30d2c 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -85,20 +85,19 @@ struct raw_frag_vec { int hlen; };
-struct raw_hashinfo raw_v4_hashinfo = { - .lock = __RW_LOCK_UNLOCKED(raw_v4_hashinfo.lock), -}; +struct raw_hashinfo raw_v4_hashinfo; EXPORT_SYMBOL_GPL(raw_v4_hashinfo);
int raw_hash_sk(struct sock *sk) { struct raw_hashinfo *h = sk->sk_prot->h.raw_hash; - struct hlist_head *head; + struct hlist_nulls_head *hlist;
- head = &h->ht[inet_sk(sk)->inet_num & (RAW_HTABLE_SIZE - 1)]; + hlist = &h->ht[inet_sk(sk)->inet_num & (RAW_HTABLE_SIZE - 1)];
write_lock_bh(&h->lock); - sk_add_node(sk, head); + hlist_nulls_add_head_rcu(&sk->sk_nulls_node, hlist); + sock_set_flag(sk, SOCK_RCU_FREE); sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1); write_unlock_bh(&h->lock);
@@ -111,7 +110,7 @@ void raw_unhash_sk(struct sock *sk) struct raw_hashinfo *h = sk->sk_prot->h.raw_hash;
write_lock_bh(&h->lock); - if (sk_del_node_init(sk)) + if (__sk_nulls_del_node_init_rcu(sk)) sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); write_unlock_bh(&h->lock); } @@ -164,17 +163,16 @@ static int icmp_filter(const struct sock *sk, const struct sk_buff *skb) static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash) { struct net *net = dev_net(skb->dev); + struct hlist_nulls_head *hlist; + struct hlist_nulls_node *hnode; int sdif = inet_sdif(skb); int dif = inet_iif(skb); - struct hlist_head *head; int delivered = 0; struct sock *sk;
- head = &raw_v4_hashinfo.ht[hash]; - if (hlist_empty(head)) - return 0; - read_lock(&raw_v4_hashinfo.lock); - sk_for_each(sk, head) { + hlist = &raw_v4_hashinfo.ht[hash]; + rcu_read_lock(); + hlist_nulls_for_each_entry(sk, hnode, hlist, sk_nulls_node) { if (!raw_v4_match(net, sk, iph->protocol, iph->saddr, iph->daddr, dif, sdif)) continue; @@ -189,7 +187,7 @@ static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash) raw_rcv(sk, clone); } } - read_unlock(&raw_v4_hashinfo.lock); + rcu_read_unlock(); return delivered; }
@@ -265,25 +263,26 @@ static void raw_err(struct sock *sk, struct sk_buff *skb, u32 info) void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info) { struct net *net = dev_net(skb->dev);; + struct hlist_nulls_head *hlist; + struct hlist_nulls_node *hnode; int dif = skb->dev->ifindex; int sdif = inet_sdif(skb); - struct hlist_head *head; const struct iphdr *iph; struct sock *sk; int hash;
hash = protocol & (RAW_HTABLE_SIZE - 1); - head = &raw_v4_hashinfo.ht[hash]; + hlist = &raw_v4_hashinfo.ht[hash];
- read_lock(&raw_v4_hashinfo.lock); - sk_for_each(sk, head) { + rcu_read_lock(); + hlist_nulls_for_each_entry(sk, hnode, hlist, sk_nulls_node) { iph = (const struct iphdr *)skb->data; if (!raw_v4_match(net, sk, iph->protocol, iph->saddr, iph->daddr, dif, sdif)) continue; raw_err(sk, skb, info); } - read_unlock(&raw_v4_hashinfo.lock); + rcu_read_unlock(); }
static int raw_rcv_skb(struct sock *sk, struct sk_buff *skb) @@ -943,44 +942,41 @@ struct proto raw_prot = { };
#ifdef CONFIG_PROC_FS -static struct sock *raw_get_first(struct seq_file *seq) +static struct sock *raw_get_first(struct seq_file *seq, int bucket) { - struct sock *sk; struct raw_hashinfo *h = PDE_DATA(file_inode(seq->file)); struct raw_iter_state *state = raw_seq_private(seq); + struct hlist_nulls_head *hlist; + struct hlist_nulls_node *hnode; + struct sock *sk;
- for (state->bucket = 0; state->bucket < RAW_HTABLE_SIZE; + for (state->bucket = bucket; state->bucket < RAW_HTABLE_SIZE; ++state->bucket) { - sk_for_each(sk, &h->ht[state->bucket]) + hlist = &h->ht[state->bucket]; + hlist_nulls_for_each_entry(sk, hnode, hlist, sk_nulls_node) { if (sock_net(sk) == seq_file_net(seq)) - goto found; + return sk; + } } - sk = NULL; -found: - return sk; + return NULL; }
static struct sock *raw_get_next(struct seq_file *seq, struct sock *sk) { - struct raw_hashinfo *h = PDE_DATA(file_inode(seq->file)); struct raw_iter_state *state = raw_seq_private(seq);
do { - sk = sk_next(sk); -try_again: - ; + sk = sk_nulls_next(sk); } while (sk && sock_net(sk) != seq_file_net(seq));
- if (!sk && ++state->bucket < RAW_HTABLE_SIZE) { - sk = sk_head(&h->ht[state->bucket]); - goto try_again; - } + if (!sk) + return raw_get_first(seq, state->bucket + 1); return sk; }
static struct sock *raw_get_idx(struct seq_file *seq, loff_t pos) { - struct sock *sk = raw_get_first(seq); + struct sock *sk = raw_get_first(seq, 0);
if (sk) while (pos && (sk = raw_get_next(seq, sk)) != NULL) @@ -989,11 +985,9 @@ static struct sock *raw_get_idx(struct seq_file *seq, loff_t pos) }
void *raw_seq_start(struct seq_file *seq, loff_t *pos) - __acquires(&h->lock) + __acquires(RCU) { - struct raw_hashinfo *h = PDE_DATA(file_inode(seq->file)); - - read_lock(&h->lock); + rcu_read_lock(); return *pos ? raw_get_idx(seq, *pos - 1) : SEQ_START_TOKEN; } EXPORT_SYMBOL_GPL(raw_seq_start); @@ -1003,7 +997,7 @@ void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos) struct sock *sk;
if (v == SEQ_START_TOKEN) - sk = raw_get_first(seq); + sk = raw_get_first(seq, 0); else sk = raw_get_next(seq, v); ++*pos; @@ -1012,11 +1006,9 @@ void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos) EXPORT_SYMBOL_GPL(raw_seq_next);
void raw_seq_stop(struct seq_file *seq, void *v) - __releases(&h->lock) + __releases(RCU) { - struct raw_hashinfo *h = PDE_DATA(file_inode(seq->file)); - - read_unlock(&h->lock); + rcu_read_unlock(); } EXPORT_SYMBOL_GPL(raw_seq_stop);
@@ -1078,6 +1070,7 @@ static __net_initdata struct pernet_operations raw_net_ops = {
int __init raw_proc_init(void) { + return register_pernet_subsys(&raw_net_ops); }
diff --git a/net/ipv4/raw_diag.c b/net/ipv4/raw_diag.c index 5d811cd02f16..3ea637ce8962 100644 --- a/net/ipv4/raw_diag.c +++ b/net/ipv4/raw_diag.c @@ -57,31 +57,32 @@ static bool raw_lookup(struct net *net, struct sock *sk, static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2 *r) { struct raw_hashinfo *hashinfo = raw_get_hashinfo(r); + struct hlist_nulls_head *hlist; + struct hlist_nulls_node *hnode; struct sock *sk; int slot;
if (IS_ERR(hashinfo)) return ERR_CAST(hashinfo);
- read_lock(&hashinfo->lock); + rcu_read_lock(); for (slot = 0; slot < RAW_HTABLE_SIZE; slot++) { - sk_for_each(sk, &hashinfo->ht[slot]) { + hlist = &hashinfo->ht[slot]; + hlist_nulls_for_each_entry(sk, hnode, hlist, sk_nulls_node) { if (raw_lookup(net, sk, r)) { /* * Grab it and keep until we fill - * diag meaage to be reported, so + * diag message to be reported, so * caller should call sock_put then. - * We can do that because we're keeping - * hashinfo->lock here. */ - sock_hold(sk); - goto out_unlock; + if (refcount_inc_not_zero(&sk->sk_refcnt)) + goto out_unlock; } } } sk = ERR_PTR(-ENOENT); out_unlock: - read_unlock(&hashinfo->lock); + rcu_read_unlock();
return sk; } @@ -144,6 +145,8 @@ static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb, struct raw_hashinfo *hashinfo = raw_get_hashinfo(r); struct net *net = sock_net(skb->sk); struct inet_diag_dump_data *cb_data; + struct hlist_nulls_head *hlist; + struct hlist_nulls_node *hnode; int num, s_num, slot, s_slot; struct sock *sk = NULL; struct nlattr *bc; @@ -160,7 +163,8 @@ static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb, for (slot = s_slot; slot < RAW_HTABLE_SIZE; s_num = 0, slot++) { num = 0;
- sk_for_each(sk, &hashinfo->ht[slot]) { + hlist = &hashinfo->ht[slot]; + hlist_nulls_for_each_entry(sk, hnode, hlist, sk_nulls_node) { struct inet_sock *inet = inet_sk(sk);
if (!net_eq(sock_net(sk), net)) diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index c4ef6aad924b..ce066d78ad75 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -62,6 +62,7 @@ #include <net/rpl.h> #include <net/compat.h> #include <net/xfrm.h> +#include <net/rawv6.h>
#include <linux/uaccess.h> #include <linux/mroute6.h> @@ -1064,6 +1065,8 @@ static int __init inet6_init(void) goto out; }
+ raw_hashinfo_init(&raw_v6_hashinfo); + err = proto_register(&tcpv6_prot, 1); if (err) goto out; diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 88f89bf1d985..60b395d7c310 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -61,9 +61,7 @@
#define ICMPV6_HDRLEN 4 /* ICMPv6 header, RFC 4443 Section 2.1 */
-struct raw_hashinfo raw_v6_hashinfo = { - .lock = __RW_LOCK_UNLOCKED(raw_v6_hashinfo.lock), -}; +struct raw_hashinfo raw_v6_hashinfo; EXPORT_SYMBOL_GPL(raw_v6_hashinfo);
bool raw_v6_match(struct net *net, struct sock *sk, unsigned short num, @@ -143,9 +141,10 @@ EXPORT_SYMBOL(rawv6_mh_filter_unregister); static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr) { struct net *net = dev_net(skb->dev); + struct hlist_nulls_head *hlist; + struct hlist_nulls_node *hnode; const struct in6_addr *saddr; const struct in6_addr *daddr; - struct hlist_head *head; struct sock *sk; bool delivered = false; __u8 hash; @@ -154,11 +153,9 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr) daddr = saddr + 1;
hash = nexthdr & (RAW_HTABLE_SIZE - 1); - head = &raw_v6_hashinfo.ht[hash]; - if (hlist_empty(head)) - return false; - read_lock(&raw_v6_hashinfo.lock); - sk_for_each(sk, head) { + hlist = &raw_v6_hashinfo.ht[hash]; + rcu_read_lock(); + hlist_nulls_for_each_entry(sk, hnode, hlist, sk_nulls_node) { int filtered;
if (!raw_v6_match(net, sk, nexthdr, daddr, saddr, @@ -203,7 +200,7 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr) } } } - read_unlock(&raw_v6_hashinfo.lock); + rcu_read_unlock(); return delivered; }
@@ -337,14 +334,15 @@ void raw6_icmp_error(struct sk_buff *skb, int nexthdr, { const struct in6_addr *saddr, *daddr; struct net *net = dev_net(skb->dev); - struct hlist_head *head; + struct hlist_nulls_head *hlist; + struct hlist_nulls_node *hnode; struct sock *sk; int hash;
hash = nexthdr & (RAW_HTABLE_SIZE - 1); - head = &raw_v6_hashinfo.ht[hash]; - read_lock(&raw_v6_hashinfo.lock); - sk_for_each(sk, head) { + hlist = &raw_v6_hashinfo.ht[hash]; + rcu_read_lock(); + hlist_nulls_for_each_entry(sk, hnode, hlist, sk_nulls_node) { /* Note: ipv6_hdr(skb) != skb->data */ const struct ipv6hdr *ip6h = (const struct ipv6hdr *)skb->data; saddr = &ip6h->saddr; @@ -355,7 +353,7 @@ void raw6_icmp_error(struct sk_buff *skb, int nexthdr, continue; rawv6_err(sk, skb, NULL, type, code, inner_offset, info); } - read_unlock(&raw_v6_hashinfo.lock); + rcu_read_unlock(); }
static inline int rawv6_rcv_skb(struct sock *sk, struct sk_buff *skb)
From: Kuniyuki Iwashima kuniyu@amazon.com
mainline inclusion from mainline-v6.0-rc1 commit 5da39e31b1b0eb62b8ed369ad9615da850239e9e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6ECEK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The trailing semicolon causes a compiler error, so let's remove it.
net/ipv4/raw.c: In function ‘raw_icmp_error’: net/ipv4/raw.c:266:2: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] 266 | struct hlist_nulls_head *hlist; | ^~~~~~
Fixes: ba44f8182ec2 ("raw: use more conventional iterators") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- net/ipv4/raw.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index d5740cf30d2c..26569cee206b 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -262,7 +262,7 @@ static void raw_err(struct sock *sk, struct sk_buff *skb, u32 info)
void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info) { - struct net *net = dev_net(skb->dev);; + struct net *net = dev_net(skb->dev); struct hlist_nulls_head *hlist; struct hlist_nulls_node *hnode; int dif = skb->dev->ifindex;
From: Kuniyuki Iwashima kuniyu@amazon.com
mainline inclusion from mainline-v6.0-rc1 commit f289c02bf41b55fbfccf21d72c4ac44cd4a7a107 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6ECEK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry() have dedicated macros for sk.
Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- net/ipv4/raw.c | 8 ++++---- net/ipv4/raw_diag.c | 4 ++-- net/ipv6/raw.c | 4 ++-- 3 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 26569cee206b..0056c2df5ef1 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -96,7 +96,7 @@ int raw_hash_sk(struct sock *sk) hlist = &h->ht[inet_sk(sk)->inet_num & (RAW_HTABLE_SIZE - 1)];
write_lock_bh(&h->lock); - hlist_nulls_add_head_rcu(&sk->sk_nulls_node, hlist); + __sk_nulls_add_node_rcu(sk, hlist); sock_set_flag(sk, SOCK_RCU_FREE); sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1); write_unlock_bh(&h->lock); @@ -172,7 +172,7 @@ static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash)
hlist = &raw_v4_hashinfo.ht[hash]; rcu_read_lock(); - hlist_nulls_for_each_entry(sk, hnode, hlist, sk_nulls_node) { + sk_nulls_for_each(sk, hnode, hlist) { if (!raw_v4_match(net, sk, iph->protocol, iph->saddr, iph->daddr, dif, sdif)) continue; @@ -275,7 +275,7 @@ void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info) hlist = &raw_v4_hashinfo.ht[hash];
rcu_read_lock(); - hlist_nulls_for_each_entry(sk, hnode, hlist, sk_nulls_node) { + sk_nulls_for_each(sk, hnode, hlist) { iph = (const struct iphdr *)skb->data; if (!raw_v4_match(net, sk, iph->protocol, iph->saddr, iph->daddr, dif, sdif)) @@ -953,7 +953,7 @@ static struct sock *raw_get_first(struct seq_file *seq, int bucket) for (state->bucket = bucket; state->bucket < RAW_HTABLE_SIZE; ++state->bucket) { hlist = &h->ht[state->bucket]; - hlist_nulls_for_each_entry(sk, hnode, hlist, sk_nulls_node) { + sk_nulls_for_each(sk, hnode, hlist) { if (sock_net(sk) == seq_file_net(seq)) return sk; } diff --git a/net/ipv4/raw_diag.c b/net/ipv4/raw_diag.c index 3ea637ce8962..e88f4ba807f1 100644 --- a/net/ipv4/raw_diag.c +++ b/net/ipv4/raw_diag.c @@ -68,7 +68,7 @@ static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2 rcu_read_lock(); for (slot = 0; slot < RAW_HTABLE_SIZE; slot++) { hlist = &hashinfo->ht[slot]; - hlist_nulls_for_each_entry(sk, hnode, hlist, sk_nulls_node) { + sk_nulls_for_each(sk, hnode, hlist) { if (raw_lookup(net, sk, r)) { /* * Grab it and keep until we fill @@ -164,7 +164,7 @@ static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb, num = 0;
hlist = &hashinfo->ht[slot]; - hlist_nulls_for_each_entry(sk, hnode, hlist, sk_nulls_node) { + sk_nulls_for_each(sk, hnode, hlist) { struct inet_sock *inet = inet_sk(sk);
if (!net_eq(sock_net(sk), net)) diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 60b395d7c310..692d771de914 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -155,7 +155,7 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr) hash = nexthdr & (RAW_HTABLE_SIZE - 1); hlist = &raw_v6_hashinfo.ht[hash]; rcu_read_lock(); - hlist_nulls_for_each_entry(sk, hnode, hlist, sk_nulls_node) { + sk_nulls_for_each(sk, hnode, hlist) { int filtered;
if (!raw_v6_match(net, sk, nexthdr, daddr, saddr, @@ -342,7 +342,7 @@ void raw6_icmp_error(struct sk_buff *skb, int nexthdr, hash = nexthdr & (RAW_HTABLE_SIZE - 1); hlist = &raw_v6_hashinfo.ht[hash]; rcu_read_lock(); - hlist_nulls_for_each_entry(sk, hnode, hlist, sk_nulls_node) { + sk_nulls_for_each(sk, hnode, hlist) { /* Note: ipv6_hdr(skb) != skb->data */ const struct ipv6hdr *ip6h = (const struct ipv6hdr *)skb->data; saddr = &ip6h->saddr;
From: Eric Dumazet edumazet@google.com
mainline inclusion from mainline-v6.0-rc1 commit af185d8c76333daa877678e0166a7b45e63bf3c4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6ECEK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
raw_diag_dump() can use rcu_read_lock() instead of read_lock()
Now the hashinfo lock is only used from process context, in write mode only, we can convert it to a spinlock, and we do not need to block BH anymore.
Signed-off-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/20220620100509.3493504-1-eric.dumazet@gmail.com Signed-off-by: Paolo Abeni pabeni@redhat.com Conflicts: net/ipv4/raw.c Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- include/net/raw.h | 4 ++-- net/ipv4/raw.c | 8 ++++---- net/ipv4/raw_diag.c | 4 ++-- 3 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/include/net/raw.h b/include/net/raw.h index 537d9d1df890..5e665934ebc7 100644 --- a/include/net/raw.h +++ b/include/net/raw.h @@ -32,7 +32,7 @@ int raw_rcv(struct sock *, struct sk_buff *); #define RAW_HTABLE_SIZE MAX_INET_PROTOS
struct raw_hashinfo { - rwlock_t lock; + spinlock_t lock; struct hlist_nulls_head ht[RAW_HTABLE_SIZE]; };
@@ -40,7 +40,7 @@ static inline void raw_hashinfo_init(struct raw_hashinfo *hashinfo) { int i;
- rwlock_init(&hashinfo->lock); + spin_lock_init(&hashinfo->lock); for (i = 0; i < RAW_HTABLE_SIZE; i++) INIT_HLIST_NULLS_HEAD(&hashinfo->ht[i], i); } diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 0056c2df5ef1..08b2578a8e93 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -95,11 +95,11 @@ int raw_hash_sk(struct sock *sk)
hlist = &h->ht[inet_sk(sk)->inet_num & (RAW_HTABLE_SIZE - 1)];
- write_lock_bh(&h->lock); + spin_lock(&h->lock); __sk_nulls_add_node_rcu(sk, hlist); sock_set_flag(sk, SOCK_RCU_FREE); sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1); - write_unlock_bh(&h->lock); + spin_unlock(&h->lock);
return 0; } @@ -109,10 +109,10 @@ void raw_unhash_sk(struct sock *sk) { struct raw_hashinfo *h = sk->sk_prot->h.raw_hash;
- write_lock_bh(&h->lock); + spin_lock(&h->lock); if (__sk_nulls_del_node_init_rcu(sk)) sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); - write_unlock_bh(&h->lock); + spin_unlock(&h->lock); } EXPORT_SYMBOL_GPL(raw_unhash_sk);
diff --git a/net/ipv4/raw_diag.c b/net/ipv4/raw_diag.c index e88f4ba807f1..a5fcca24f9eb 100644 --- a/net/ipv4/raw_diag.c +++ b/net/ipv4/raw_diag.c @@ -159,7 +159,7 @@ static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb, s_slot = cb->args[0]; num = s_num = cb->args[1];
- read_lock(&hashinfo->lock); + rcu_read_lock(); for (slot = s_slot; slot < RAW_HTABLE_SIZE; s_num = 0, slot++) { num = 0;
@@ -187,7 +187,7 @@ static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb, }
out_unlock: - read_unlock(&hashinfo->lock); + rcu_read_unlock();
cb->args[0] = slot; cb->args[1] = num;
From: Eric Dumazet edumazet@google.com
mainline inclusion from mainline-v6.0-rc1 commit 97a4d46b1516250d640c1ae0c9e7129d160d6a1c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6ECEK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
I accidentally broke IPv4 traceroute, by swapping iph->saddr and iph->daddr.
Probably because raw_icmp_error() and raw_v4_input() use different order for iph->saddr and iph->daddr.
Fixes: ba44f8182ec2 ("raw: use more conventional iterators") Reported-by: John Sperbeck jsperbeck@google.com Signed-off-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/20220623193540.2851799-1-edumazet@google.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- net/ipv4/raw.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 08b2578a8e93..7963638f9d15 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -278,7 +278,7 @@ void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info) sk_nulls_for_each(sk, hnode, hlist) { iph = (const struct iphdr *)skb->data; if (!raw_v4_match(net, sk, iph->protocol, - iph->saddr, iph->daddr, dif, sdif)) + iph->daddr, iph->saddr, dif, sdif)) continue; raw_err(sk, skb, info); }
From: Eric Dumazet edumazet@google.com
mainline inclusion from mainline-v6.0-rc1 commit c4fceb46add65481ef0dfb79cad24c3c269b4cad category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6ECEK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
saddr and daddr are set but not used.
Fixes: ba44f8182ec2 ("raw: use more conventional iterators") Reported-by: kernel test robot lkp@intel.com Signed-off-by: Eric Dumazet edumazet@google.com Acked-by: Jonathan Lemon jonathan.lemon@gmail.com Link: https://lore.kernel.org/r/20220622032303.159394-1-edumazet@google.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- net/ipv6/raw.c | 3 --- 1 file changed, 3 deletions(-)
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 692d771de914..20f2d971f3a5 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -332,7 +332,6 @@ static void rawv6_err(struct sock *sk, struct sk_buff *skb, void raw6_icmp_error(struct sk_buff *skb, int nexthdr, u8 type, u8 code, int inner_offset, __be32 info) { - const struct in6_addr *saddr, *daddr; struct net *net = dev_net(skb->dev); struct hlist_nulls_head *hlist; struct hlist_nulls_node *hnode; @@ -345,8 +344,6 @@ void raw6_icmp_error(struct sk_buff *skb, int nexthdr, sk_nulls_for_each(sk, hnode, hlist) { /* Note: ipv6_hdr(skb) != skb->data */ const struct ipv6hdr *ip6h = (const struct ipv6hdr *)skb->data; - saddr = &ip6h->saddr; - daddr = &ip6h->daddr;
if (!raw_v6_match(net, sk, nexthdr, &ip6h->saddr, &ip6h->daddr, inet6_iif(skb), inet6_iif(skb)))
From: Ido Schimmel idosch@nvidia.com
mainline inclusion from mainline-v6.0-rc7 commit 76dd07281338da6951fdab3432ced843fa87839c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6ECEK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The global 'raw_v6_hashinfo' variable can be accessed even when IPv6 is administratively disabled via the 'ipv6.disable=1' kernel command line option, leading to a crash [1].
Fix by restoring the original behavior and always initializing the variable, regardless of IPv6 support being administratively disabled or not.
[1] BUG: unable to handle page fault for address: ffffffffffffffc8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 173e18067 P4D 173e18067 PUD 173e1a067 PMD 0 Oops: 0000 [#1] PREEMPT SMP KASAN CPU: 3 PID: 271 Comm: ss Not tainted 6.0.0-rc4-custom-00136-g0727a9a5fbc1 #1396 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 RIP: 0010:raw_diag_dump+0x310/0x7f0 [...] Call Trace: <TASK> __inet_diag_dump+0x10f/0x2e0 netlink_dump+0x575/0xfd0 __netlink_dump_start+0x67b/0x940 inet_diag_handler_cmd+0x273/0x2d0 sock_diag_rcv_msg+0x317/0x440 netlink_rcv_skb+0x15e/0x430 sock_diag_rcv+0x2b/0x40 netlink_unicast+0x53b/0x800 netlink_sendmsg+0x945/0xe60 ____sys_sendmsg+0x747/0x960 ___sys_sendmsg+0x13a/0x1e0 __sys_sendmsg+0x118/0x1e0 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Fixes: 0daf07e52709 ("raw: convert raw sockets to RCU") Reported-by: Roberto Ricci rroberto2r@gmail.com Tested-by: Roberto Ricci rroberto2r@gmail.com Signed-off-by: Ido Schimmel idosch@nvidia.com Reviewed-by: David Ahern dsahern@kernel.org Link: https://lore.kernel.org/r/20220916084821.229287-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- net/ipv6/af_inet6.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index ce066d78ad75..58287e772613 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -1060,13 +1060,13 @@ static int __init inet6_init(void) for (r = &inetsw6[0]; r < &inetsw6[SOCK_MAX]; ++r) INIT_LIST_HEAD(r);
+ raw_hashinfo_init(&raw_v6_hashinfo); + if (disable_ipv6_mod) { pr_info("Loaded, but administratively disabled, reboot required to enable\n"); goto out; }
- raw_hashinfo_init(&raw_v6_hashinfo); - err = proto_register(&tcpv6_prot, 1); if (err) goto out;
From: Ziyang Xuan william.xuanziyang@huawei.com
Offering: HULK hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6ECEK CVE: NA
--------------------------------
In order to fix softlockup problem in raw sockets because global rwlock, backport "raw: RCU conversion" series patches. That will introduce KABI changes. This patch is to fix KABI changes.
Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- include/net/raw.h | 18 ++++-------------- include/net/raw_common.h | 22 ++++++++++++++++++++++ include/net/rawv6.h | 3 +-- include/net/sock.h | 4 +++- net/ipv4/raw.c | 8 ++++---- net/ipv4/raw_diag.c | 6 +++--- net/ipv6/af_inet6.c | 1 + net/ipv6/raw.c | 2 +- 8 files changed, 39 insertions(+), 25 deletions(-) create mode 100644 include/net/raw_common.h
diff --git a/include/net/raw.h b/include/net/raw.h index 5e665934ebc7..76b3f842258f 100644 --- a/include/net/raw.h +++ b/include/net/raw.h @@ -15,11 +15,12 @@
#include <net/inet_sock.h> #include <net/protocol.h> +#include <net/raw_common.h> #include <linux/icmp.h>
extern struct proto raw_prot;
-extern struct raw_hashinfo raw_v4_hashinfo; +extern struct raw_hashinfo_new raw_v4_hashinfo; bool raw_v4_match(struct net *net, struct sock *sk, unsigned short num, __be32 raddr, __be32 laddr, int dif, int sdif);
@@ -29,22 +30,11 @@ int raw_local_deliver(struct sk_buff *, int);
int raw_rcv(struct sock *, struct sk_buff *);
-#define RAW_HTABLE_SIZE MAX_INET_PROTOS - struct raw_hashinfo { - spinlock_t lock; - struct hlist_nulls_head ht[RAW_HTABLE_SIZE]; + rwlock_t lock; + struct hlist_head ht[RAW_HTABLE_SIZE]; };
-static inline void raw_hashinfo_init(struct raw_hashinfo *hashinfo) -{ - int i; - - spin_lock_init(&hashinfo->lock); - for (i = 0; i < RAW_HTABLE_SIZE; i++) - INIT_HLIST_NULLS_HEAD(&hashinfo->ht[i], i); -} - #ifdef CONFIG_PROC_FS int raw_proc_init(void); void raw_proc_exit(void); diff --git a/include/net/raw_common.h b/include/net/raw_common.h new file mode 100644 index 000000000000..77e701357928 --- /dev/null +++ b/include/net/raw_common.h @@ -0,0 +1,22 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ + +#ifndef _RAW_COMMON_H +#define _RAW_COMMON_H + +#define RAW_HTABLE_SIZE MAX_INET_PROTOS + +struct raw_hashinfo_new { + spinlock_t lock; + struct hlist_nulls_head ht[RAW_HTABLE_SIZE]; +}; + +static inline void raw_hashinfo_init(struct raw_hashinfo_new *hashinfo) +{ + int i; + + spin_lock_init(&hashinfo->lock); + for (i = 0; i < RAW_HTABLE_SIZE; i++) + INIT_HLIST_NULLS_HEAD(&hashinfo->ht[i], i); +} + +#endif /* _RAW_COMMON_H */ diff --git a/include/net/rawv6.h b/include/net/rawv6.h index bc70909625f6..700de7d3af68 100644 --- a/include/net/rawv6.h +++ b/include/net/rawv6.h @@ -3,9 +3,8 @@ #define _NET_RAWV6_H
#include <net/protocol.h> -#include <net/raw.h>
-extern struct raw_hashinfo raw_v6_hashinfo; +extern struct raw_hashinfo_new raw_v6_hashinfo; bool raw_v6_match(struct net *net, struct sock *sk, unsigned short num, const struct in6_addr *loc_addr, const struct in6_addr *rmt_addr, int dif, int sdif); diff --git a/include/net/sock.h b/include/net/sock.h index a3e2514c5835..74dbbf0e30c0 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1155,6 +1155,7 @@ struct request_sock_ops; struct timewait_sock_ops; struct inet_hashinfo; struct raw_hashinfo; +struct raw_hashinfo_new; struct smc_hashinfo; struct module;
@@ -1269,7 +1270,8 @@ struct proto { union { struct inet_hashinfo *hashinfo; struct udp_table *udp_table; - struct raw_hashinfo *raw_hash; + KABI_REPLACE(struct raw_hashinfo *raw_hash, + struct raw_hashinfo_new *raw_hash) struct smc_hashinfo *smc_hash; } h;
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 7963638f9d15..cc9ee2e5a32b 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -85,12 +85,12 @@ struct raw_frag_vec { int hlen; };
-struct raw_hashinfo raw_v4_hashinfo; +struct raw_hashinfo_new raw_v4_hashinfo; EXPORT_SYMBOL_GPL(raw_v4_hashinfo);
int raw_hash_sk(struct sock *sk) { - struct raw_hashinfo *h = sk->sk_prot->h.raw_hash; + struct raw_hashinfo_new *h = sk->sk_prot->h.raw_hash; struct hlist_nulls_head *hlist;
hlist = &h->ht[inet_sk(sk)->inet_num & (RAW_HTABLE_SIZE - 1)]; @@ -107,7 +107,7 @@ EXPORT_SYMBOL_GPL(raw_hash_sk);
void raw_unhash_sk(struct sock *sk) { - struct raw_hashinfo *h = sk->sk_prot->h.raw_hash; + struct raw_hashinfo_new *h = sk->sk_prot->h.raw_hash;
spin_lock(&h->lock); if (__sk_nulls_del_node_init_rcu(sk)) @@ -944,7 +944,7 @@ struct proto raw_prot = { #ifdef CONFIG_PROC_FS static struct sock *raw_get_first(struct seq_file *seq, int bucket) { - struct raw_hashinfo *h = PDE_DATA(file_inode(seq->file)); + struct raw_hashinfo_new *h = PDE_DATA(file_inode(seq->file)); struct raw_iter_state *state = raw_seq_private(seq); struct hlist_nulls_head *hlist; struct hlist_nulls_node *hnode; diff --git a/net/ipv4/raw_diag.c b/net/ipv4/raw_diag.c index a5fcca24f9eb..70981c6d67ef 100644 --- a/net/ipv4/raw_diag.c +++ b/net/ipv4/raw_diag.c @@ -14,7 +14,7 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-static struct raw_hashinfo * +static struct raw_hashinfo_new * raw_get_hashinfo(const struct inet_diag_req_v2 *r) { if (r->sdiag_family == AF_INET) { @@ -56,7 +56,7 @@ static bool raw_lookup(struct net *net, struct sock *sk,
static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2 *r) { - struct raw_hashinfo *hashinfo = raw_get_hashinfo(r); + struct raw_hashinfo_new *hashinfo = raw_get_hashinfo(r); struct hlist_nulls_head *hlist; struct hlist_nulls_node *hnode; struct sock *sk; @@ -142,7 +142,7 @@ static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb, const struct inet_diag_req_v2 *r) { bool net_admin = netlink_net_capable(cb->skb, CAP_NET_ADMIN); - struct raw_hashinfo *hashinfo = raw_get_hashinfo(r); + struct raw_hashinfo_new *hashinfo = raw_get_hashinfo(r); struct net *net = sock_net(skb->sk); struct inet_diag_dump_data *cb_data; struct hlist_nulls_head *hlist; diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index 58287e772613..2524825e5157 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -62,6 +62,7 @@ #include <net/rpl.h> #include <net/compat.h> #include <net/xfrm.h> +#include <net/raw_common.h> #include <net/rawv6.h>
#include <linux/uaccess.h> diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 20f2d971f3a5..c9cda334fa12 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -61,7 +61,7 @@
#define ICMPV6_HDRLEN 4 /* ICMPv6 header, RFC 4443 Section 2.1 */
-struct raw_hashinfo raw_v6_hashinfo; +struct raw_hashinfo_new raw_v6_hashinfo; EXPORT_SYMBOL_GPL(raw_v6_hashinfo);
bool raw_v6_match(struct net *net, struct sock *sk, unsigned short num,
From: Stanislav Fomichev sdf@google.com
stable inclusion from stable-v5.10.163 commit 6d935a02658be82585ecb39aab339faa84496650 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6AVM6
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit 07ec7b502800ba9f7b8b15cb01dd6556bb41aaca ]
syzkaller managed to trigger another case where skb->len == 0 when we enter __dev_queue_xmit:
WARNING: CPU: 0 PID: 2470 at include/linux/skbuff.h:2576 skb_assert_len include/linux/skbuff.h:2576 [inline] WARNING: CPU: 0 PID: 2470 at include/linux/skbuff.h:2576 __dev_queue_xmit+0x2069/0x35e0 net/core/dev.c:4295
Call Trace: dev_queue_xmit+0x17/0x20 net/core/dev.c:4406 __bpf_tx_skb net/core/filter.c:2115 [inline] __bpf_redirect_no_mac net/core/filter.c:2140 [inline] __bpf_redirect+0x5fb/0xda0 net/core/filter.c:2163 ____bpf_clone_redirect net/core/filter.c:2447 [inline] bpf_clone_redirect+0x247/0x390 net/core/filter.c:2419 bpf_prog_48159a89cb4a9a16+0x59/0x5e bpf_dispatcher_nop_func include/linux/bpf.h:897 [inline] __bpf_prog_run include/linux/filter.h:596 [inline] bpf_prog_run include/linux/filter.h:603 [inline] bpf_test_run+0x46c/0x890 net/bpf/test_run.c:402 bpf_prog_test_run_skb+0xbdc/0x14c0 net/bpf/test_run.c:1170 bpf_prog_test_run+0x345/0x3c0 kernel/bpf/syscall.c:3648 __sys_bpf+0x43a/0x6c0 kernel/bpf/syscall.c:5005 __do_sys_bpf kernel/bpf/syscall.c:5091 [inline] __se_sys_bpf kernel/bpf/syscall.c:5089 [inline] __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5089 do_syscall_64+0x54/0x70 arch/x86/entry/common.c:48 entry_SYSCALL_64_after_hwframe+0x61/0xc6
The reproducer doesn't really reproduce outside of syzkaller environment, so I'm taking a guess here. It looks like we do generate correct ETH_HLEN-sized packet, but we redirect the packet to the tunneling device. Before we do so, we __skb_pull l2 header and arrive again at skb->len == 0. Doesn't seem like we can do anything better than having an explicit check after __skb_pull?
Cc: Eric Dumazet edumazet@google.com Reported-by: syzbot+f635e86ec3fa0a37e019@syzkaller.appspotmail.com Signed-off-by: Stanislav Fomichev sdf@google.com Link: https://lore.kernel.org/r/20221027225537.353077-1-sdf@google.com Signed-off-by: Martin KaFai Lau martin.lau@kernel.org Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Liu Jian liujian56@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- net/core/filter.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/net/core/filter.c b/net/core/filter.c index 4bcf93934241..012a5070a9e5 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2127,6 +2127,10 @@ static int __bpf_redirect_no_mac(struct sk_buff *skb, struct net_device *dev,
if (mlen) { __skb_pull(skb, mlen); + if (unlikely(!skb->len)) { + kfree_skb(skb); + return -ERANGE; + }
/* At ingress, the mac header has already been pulled once. * At egress, skb_pospull_rcsum has to be done in case that
From: Eric Dumazet edumazet@google.com
stable inclusion from stable-v5.10.163 commit be719496ae6a7fc325e9e5056a52f63ebc84cc0c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6AVM6
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit 0a182f8d607464911756b4dbef5d6cad8de22469 ]
sock_map_free() calls release_sock(sk) without owning a reference on the socket. This can cause use-after-free as syzbot found [1]
Jakub Sitnicki already took care of a similar issue in sock_hash_free() in commit 75e68e5bf2c7 ("bpf, sockhash: Synchronize delete from bucket list on map free")
[1] refcount_t: decrement hit 0; leaking memory. WARNING: CPU: 0 PID: 3785 at lib/refcount.c:31 refcount_warn_saturate+0x17c/0x1a0 lib/refcount.c:31 Modules linked in: CPU: 0 PID: 3785 Comm: kworker/u4:6 Not tainted 6.1.0-rc7-syzkaller-00103-gef4d3ea40565 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 Workqueue: events_unbound bpf_map_free_deferred RIP: 0010:refcount_warn_saturate+0x17c/0x1a0 lib/refcount.c:31 Code: 68 8b 31 c0 e8 75 71 15 fd 0f 0b e9 64 ff ff ff e8 d9 6e 4e fd c6 05 62 9c 3d 0a 01 48 c7 c7 80 bb 68 8b 31 c0 e8 54 71 15 fd <0f> 0b e9 43 ff ff ff 89 d9 80 e1 07 80 c1 03 38 c1 0f 8c a2 fe ff RSP: 0018:ffffc9000456fb60 EFLAGS: 00010246 RAX: eae59bab72dcd700 RBX: 0000000000000004 RCX: ffff8880207057c0 RDX: 0000000000000000 RSI: 0000000000000201 RDI: 0000000000000000 RBP: 0000000000000004 R08: ffffffff816fdabd R09: fffff520008adee5 R10: fffff520008adee5 R11: 1ffff920008adee4 R12: 0000000000000004 R13: dffffc0000000000 R14: ffff88807b1c6c00 R15: 1ffff1100f638dcf FS: 0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b30c30000 CR3: 000000000d08e000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __refcount_dec include/linux/refcount.h:344 [inline] refcount_dec include/linux/refcount.h:359 [inline] __sock_put include/net/sock.h:779 [inline] tcp_release_cb+0x2d0/0x360 net/ipv4/tcp_output.c:1092 release_sock+0xaf/0x1c0 net/core/sock.c:3468 sock_map_free+0x219/0x2c0 net/core/sock_map.c:356 process_one_work+0x81c/0xd10 kernel/workqueue.c:2289 worker_thread+0xb14/0x1330 kernel/workqueue.c:2436 kthread+0x266/0x300 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 </TASK>
Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: syzbot syzkaller@googlegroups.com Cc: Jakub Sitnicki jakub@cloudflare.com Cc: John Fastabend john.fastabend@gmail.com Cc: Alexei Starovoitov ast@kernel.org Cc: Daniel Borkmann daniel@iogearbox.net Cc: Song Liu songliubraving@fb.com Acked-by: John Fastabend john.fastabend@gmail.com Link: https://lore.kernel.org/r/20221202111640.2745533-1-edumazet@google.com Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Liu Jian liujian56@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- net/core/sock_map.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c index 50fb43905c87..79528310e6a7 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -358,11 +358,13 @@ static void sock_map_free(struct bpf_map *map)
sk = xchg(psk, NULL); if (sk) { + sock_hold(sk); lock_sock(sk); rcu_read_lock(); sock_map_unref(sk, psk); rcu_read_unlock(); release_sock(sk); + sock_put(sk); } }
From: Cui GaoSheng cuigaosheng1@huawei.com
hulk inclusion category: bugfix bugzilla: 188368 https://gitee.com/openeuler/kernel/issues/I6EEK7 CVE: NA
--------------------------------
Avoid using conflicting compilation parameters during compilation, so we move -fpic from KBUILD_CFLAGS to CFLAGS_KERNEL to avoid using -fpic and fno-pic parameters together.
Fixes: 6bc05b0a27f0 ("arm32: kaslr: Fix the bug of module install failure") Signed-off-by: Cui GaoSheng cuigaosheng1@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- arch/arm/Makefile | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/arm/Makefile b/arch/arm/Makefile index 8300530757ae..d6817de42f24 100644 --- a/arch/arm/Makefile +++ b/arch/arm/Makefile @@ -49,7 +49,8 @@ KBUILD_LDFLAGS += -EL endif
ifeq ($(CONFIG_RELOCATABLE),y) -KBUILD_CFLAGS += -fpic -include $(srctree)/include/linux/hidden.h +KBUILD_CFLAGS += -include $(srctree)/include/linux/hidden.h +CFLAGS_KERNEL += -fpic CFLAGS_MODULE += -fno-pic LDFLAGS_vmlinux += -pie -shared -Bsymbolic endif
From: Stanislav Fomichev sdf@google.com
stable inclusion from stable-v5.10.163 commit 148dcbd3af039ae39c3af697a3183008c7995805 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6F7AI
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit 9f225444467b98579cf28d94f4ad053460dfdb84 ]
Syzkaller triggered flow dissector warning with the following:
r0 = openat$ppp(0xffffffffffffff9c, &(0x7f0000000000), 0xc0802, 0x0) ioctl$PPPIOCNEWUNIT(r0, 0xc004743e, &(0x7f00000000c0)) ioctl$PPPIOCSACTIVE(r0, 0x40107446, &(0x7f0000000240)={0x2, &(0x7f0000000180)=[{0x20, 0x0, 0x0, 0xfffff034}, {0x6}]}) pwritev(r0, &(0x7f0000000040)=[{&(0x7f0000000140)='\x00!', 0x2}], 0x1, 0x0, 0x0)
[ 9.485814] WARNING: CPU: 3 PID: 329 at net/core/flow_dissector.c:1016 __skb_flow_dissect+0x1ee0/0x1fa0 [ 9.485929] skb_get_poff+0x53/0xa0 [ 9.485937] bpf_skb_get_pay_offset+0xe/0x20 [ 9.485944] ? ppp_send_frame+0xc2/0x5b0 [ 9.485949] ? _raw_spin_unlock_irqrestore+0x40/0x60 [ 9.485958] ? __ppp_xmit_process+0x7a/0xe0 [ 9.485968] ? ppp_xmit_process+0x5b/0xb0 [ 9.485974] ? ppp_write+0x12a/0x190 [ 9.485981] ? do_iter_write+0x18e/0x2d0 [ 9.485987] ? __import_iovec+0x30/0x130 [ 9.485997] ? do_pwritev+0x1b6/0x240 [ 9.486016] ? trace_hardirqs_on+0x47/0x50 [ 9.486023] ? __x64_sys_pwritev+0x24/0x30 [ 9.486026] ? do_syscall_64+0x3d/0x80 [ 9.486031] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
Flow dissector tries to find skb net namespace either via device or via socket. Neigher is set in ppp_send_frame, so let's manually use ppp->dev.
Cc: Paul Mackerras paulus@samba.org Cc: linux-ppp@vger.kernel.org Reported-by: syzbot+41cab52ab62ee99ed24a@syzkaller.appspotmail.com Signed-off-by: Stanislav Fomichev sdf@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Baisong Zhong zhongbaisong@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/net/ppp/ppp_generic.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c index 2b9815ec4a62..b825c6a9b6dd 100644 --- a/drivers/net/ppp/ppp_generic.c +++ b/drivers/net/ppp/ppp_generic.c @@ -1610,6 +1610,8 @@ ppp_send_frame(struct ppp *ppp, struct sk_buff *skb) int len; unsigned char *cp;
+ skb->dev = ppp->dev; + if (proto < 0x8000) { #ifdef CONFIG_PPP_FILTER /* check if we should pass this packet */
From: Jiacheng Shi billsjc@sjtu.edu.cn
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6FHYK CVE: NA
Reference: https://github.com/torvalds/linux/commit/2bed2ced40c97b8540ff38df0149e8ecb2b...
--------------------------------
[ Upstream commit 2bed2ced40c97b8540ff38df0149e8ecb2bf4c65 ]
Variables allocated by kvzalloc should not be freed by kfree. Because they may be allocated by vmalloc. So we replace kfree with kvfree here.
Fixes: d6a4c185660c ("vfio iommu: Implementation of ioctl for dirty pages tracking") Signed-off-by: Jiacheng Shi billsjc@sjtu.edu.cn Link: https://lore.kernel.org/r/20211212091600.2560-1-billsjc@sjtu.edu.cn Signed-off-by: Alex Williamson alex.williamson@redhat.com Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/vfio/vfio_iommu_type1.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index b39515935ea4..a6fdab250ea0 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -241,7 +241,7 @@ static int vfio_dma_bitmap_alloc(struct vfio_dma *dma, size_t pgsize)
static void vfio_dma_bitmap_free(struct vfio_dma *dma) { - kfree(dma->bitmap); + kvfree(dma->bitmap); dma->bitmap = NULL; }
From: Kunkun Jiang jiangkunkun@huawei.com
virt inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6FHYK CVE: NA
--------------------------------
For security purposes, kvzalloc is used to allocate memory. Because the memory may be allocated by vmalloc. So we replace kfree with kvfree here.
Reported-by: Zhaolong Wang wangzhaolong1@huawei.com Signed-off-by: Kunkun Jiang jiangkunkun@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/vfio/vfio_iommu_type1.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index a6fdab250ea0..20d13cf4aec7 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1129,7 +1129,7 @@ static int vfio_iova_dirty_log_clear(u64 __user *bitmap, int ret = 0;
bitmap_size = DIRTY_BITMAP_BYTES(size >> pgshift); - bitmap_buffer = kvmalloc(bitmap_size, GFP_KERNEL); + bitmap_buffer = kvzalloc(bitmap_size, GFP_KERNEL); if (!bitmap_buffer) { ret = -ENOMEM; goto out; @@ -1179,7 +1179,7 @@ static int vfio_iova_dirty_log_clear(u64 __user *bitmap, }
out: - kfree(bitmap_buffer); + kvfree(bitmap_buffer); return ret; }
From: Rik van Riel riel@surriel.com
mainline inclusion from mainline-v6.1-rc2 commit 12df140f0bdfae5dcfc81800970dd7f6f632e00c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6EVPO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
The h->*_huge_pages counters are protected by the hugetlb_lock, but alloc_huge_page has a corner case where it can decrement the counter outside of the lock.
This could lead to a corrupted value of h->resv_huge_pages, which we have observed on our systems.
Take the hugetlb_lock before decrementing h->resv_huge_pages to avoid a potential race.
Link: https://lkml.kernel.org/r/20221017202505.0e6a4fcd@imladris.surriel.com Fixes: a88c76954804 ("mm: hugetlb: fix hugepage memory leak caused by wrong reserve count") Signed-off-by: Rik van Riel riel@surriel.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: Naoya Horiguchi n-horiguchi@ah.jp.nec.com Cc: Glen McCready gkmccready@meta.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Muchun Song songmuchun@bytedance.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Zhang Peng zhangpeng362@huawei.com Reviewed-by: tong tiangen tongtiangen@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9bfb781fafd3..03eca3aec0f6 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2641,11 +2641,11 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, page = alloc_buddy_huge_page_with_mpol(h, vma, addr); if (!page) goto out_uncharge_cgroup; + spin_lock_irq(&hugetlb_lock); if (!avoid_reserve && vma_has_reserves(vma, gbl_chg)) { SetHPageRestoreReserve(page); h->resv_huge_pages--; } - spin_lock_irq(&hugetlb_lock); list_add(&page->lru, &h->hugepage_activelist); /* Fall through */ }
From: Kefeng Wang wangkefeng.wang@huawei.com
mainline inclusion from mainline-v6.2-rc7 commit ac86f547ca1002aec2ef66b9e64d03f45bbbfbb9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6BYND
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
As commit 18365225f044 ("hwpoison, memcg: forcibly uncharge LRU pages"), hwpoison will forcibly uncharg a LRU hwpoisoned page, the folio_memcg could be NULl, then, mem_cgroup_track_foreign_dirty_slowpath() could occurs a NULL pointer dereference, let's do not record the foreign writebacks for folio memcg is null in mem_cgroup_track_foreign_dirty() to fix it.
Link: https://lkml.kernel.org/r/20230129040945.180629-1-wangkefeng.wang@huawei.com Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing") Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Reported-by: Ma Wupeng mawupeng1@huawei.com Tested-by: Miko Larsson mikoxyzzz@gmail.com Acked-by: Michal Hocko mhocko@suse.com Cc: Jan Kara jack@suse.cz Cc: Jens Axboe axboe@kernel.dk Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Ma Wupeng mawupeng1@huawei.com Cc: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Shakeel Butt shakeelb@google.com Cc: Tejun Heo tj@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: include/linux/memcontrol.h Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- include/linux/memcontrol.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 89e4f01c5710..600cda4ea1be 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1873,10 +1873,13 @@ void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, static inline void mem_cgroup_track_foreign_dirty(struct page *page, struct bdi_writeback *wb) { + struct mem_cgroup *memcg; + if (mem_cgroup_disabled()) return;
- if (unlikely(&page_memcg(page)->css != wb->memcg_css)) + memcg = page_memcg(page); + if (unlikely(memcg && &memcg->css != wb->memcg_css)) mem_cgroup_track_foreign_dirty_slowpath(page, wb); }
From: Huajingjing huajingjing1@huawei.com
driver inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I67J42 CVE: NA
-------------------------------------------------
The BMA software is a system management software offered by Huawei. It supports the status monitoring, performance monitoring and event monitoring of various components, including server CPUs, memory hard disks, NICs, IB cards, PCIe cards, RAID controller cards and optical modules.
In this version, the system resets due to the BMA spin lock when the memory usage is too high, the vulnerability has been rectified.
Signed-off-by: Huajingjing huajingjing1@huawei.com Reviewed-by: ChenJiesong chenjiesong@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- .../ethernet/huawei/bma/cdev_drv/bma_cdev.c | 2 +- .../huawei/bma/edma_drv/bma_devintf.c | 110 +++++++++++------- .../ethernet/huawei/bma/edma_drv/bma_pci.h | 2 +- .../huawei/bma/kbox_drv/kbox_include.h | 2 +- .../ethernet/huawei/bma/kbox_drv/kbox_panic.c | 4 +- .../huawei/bma/kbox_drv/kbox_printk.c | 4 +- .../huawei/bma/kbox_drv/kbox_ram_op.c | 2 +- .../ethernet/huawei/bma/veth_drv/veth_hb.c | 7 +- .../ethernet/huawei/bma/veth_drv/veth_hb.h | 2 +- 9 files changed, 81 insertions(+), 54 deletions(-)
diff --git a/drivers/net/ethernet/huawei/bma/cdev_drv/bma_cdev.c b/drivers/net/ethernet/huawei/bma/cdev_drv/bma_cdev.c index 0348a83005d6..9468a5a0c768 100644 --- a/drivers/net/ethernet/huawei/bma/cdev_drv/bma_cdev.c +++ b/drivers/net/ethernet/huawei/bma/cdev_drv/bma_cdev.c @@ -28,7 +28,7 @@ #ifdef DRV_VERSION #define CDEV_VERSION MICRO_TO_STR(DRV_VERSION) #else -#define CDEV_VERSION "0.3.4" +#define CDEV_VERSION "0.3.5" #endif
#define CDEV_DEFAULT_NUM 4 diff --git a/drivers/net/ethernet/huawei/bma/edma_drv/bma_devintf.c b/drivers/net/ethernet/huawei/bma/edma_drv/bma_devintf.c index acf6bbfc50ff..149c393e8b0d 100644 --- a/drivers/net/ethernet/huawei/bma/edma_drv/bma_devintf.c +++ b/drivers/net/ethernet/huawei/bma/edma_drv/bma_devintf.c @@ -419,8 +419,10 @@ EXPORT_SYMBOL(bma_intf_int_to_bmc);
int bma_intf_is_link_ok(void) { - return (g_bma_dev->edma_host.statistics.remote_status == - REGISTERED) ? 1 : 0; + if ((&g_bma_dev->edma_host != NULL) && + (g_bma_dev->edma_host.statistics.remote_status == REGISTERED)) + return 1; + return 0; } EXPORT_SYMBOL(bma_intf_is_link_ok);
@@ -460,14 +462,10 @@ int bma_cdev_recv_msg(void *handle, char __user *data, size_t count) } EXPORT_SYMBOL_GPL(bma_cdev_recv_msg);
-int bma_cdev_add_msg(void *handle, const char __user *msg, size_t msg_len) +static int check_cdev_add_msg_param(struct bma_priv_data_s *handle, +const char __user *msg, size_t msg_len) { struct bma_priv_data_s *priv = NULL; - struct edma_msg_hdr_s *hdr = NULL; - unsigned long flags = 0; - int total_len = 0; - int ret = 0; - struct edma_host_s *phost = &g_bma_dev->edma_host;
if (!handle || !msg || msg_len == 0) { BMA_LOG(DLOG_DEBUG, "input NULL point!\n"); @@ -479,54 +477,80 @@ int bma_cdev_add_msg(void *handle, const char __user *msg, size_t msg_len) return -EINVAL; }
- priv = (struct bma_priv_data_s *)handle; + priv = handle;
if (priv->user.type >= TYPE_MAX) { BMA_LOG(DLOG_DEBUG, "error type = %d\n", priv->user.type); return -EFAULT; } - total_len = SIZE_OF_MSG_HDR + msg_len; + + return 0; +} + +static void edma_msg_hdr_init(struct edma_msg_hdr_s *hdr, + struct bma_priv_data_s *private_data, + char *msg_buf, size_t msg_len) +{ + hdr->type = private_data->user.type; + hdr->sub_type = private_data->user.sub_type; + hdr->user_id = private_data->user.user_id; + hdr->datalen = msg_len; + BMA_LOG(DLOG_DEBUG, "msg_len is %zu\n", msg_len); + + memcpy(hdr->data, msg_buf, msg_len); +} + +int bma_cdev_add_msg(void *handle, const char __user *msg, size_t msg_len) +{ + struct bma_priv_data_s *priv = NULL; + struct edma_msg_hdr_s *hdr = NULL; + unsigned long flags = 0; + unsigned int total_len = 0; + int ret = 0; + struct edma_host_s *phost = &g_bma_dev->edma_host; + char *msg_buf = NULL; + + ret = check_cdev_add_msg_param(handle, msg, msg_len); + if (ret != 0) + return ret; + + priv = (struct bma_priv_data_s *)handle; + + total_len = (unsigned int)(SIZE_OF_MSG_HDR + msg_len); + if (phost->msg_send_write + total_len > HOST_MAX_SEND_MBX_LEN - SIZE_OF_MBX_HDR) { + BMA_LOG(DLOG_DEBUG, "msg lost,msg_send_write: %u,msg_len:%u,max_len: %d\n", + phost->msg_send_write, total_len, HOST_MAX_SEND_MBX_LEN); + return -ENOSPC; + } + + msg_buf = (char *)kmalloc(msg_len, GFP_KERNEL); + if (!msg_buf) { + BMA_LOG(DLOG_ERROR, "malloc msg_buf failed\n"); + return -ENOMEM; + } + + if (copy_from_user(msg_buf, msg, msg_len)) { + BMA_LOG(DLOG_ERROR, "copy_from_user error\n"); + kfree(msg_buf); + return -EFAULT; + }
spin_lock_irqsave(&phost->send_msg_lock, flags);
- if (phost->msg_send_write + total_len <= - HOST_MAX_SEND_MBX_LEN - SIZE_OF_MBX_HDR) { - hdr = (struct edma_msg_hdr_s *)(phost->msg_send_buf + - phost->msg_send_write); - hdr->type = priv->user.type; - hdr->sub_type = priv->user.sub_type; - hdr->user_id = priv->user.user_id; - hdr->datalen = msg_len; - BMA_LOG(DLOG_DEBUG, "msg_len is %zu\n", msg_len); - - if (copy_from_user(hdr->data, msg, msg_len)) { - BMA_LOG(DLOG_ERROR, "copy_from_user error\n"); - ret = -EFAULT; - goto end; - } + hdr = (struct edma_msg_hdr_s *)(phost->msg_send_buf + phost->msg_send_write); + edma_msg_hdr_init(hdr, priv, msg_buf, msg_len);
- phost->msg_send_write += total_len; - phost->statistics.send_bytes += total_len; - phost->statistics.send_pkgs++; + phost->msg_send_write += total_len; + phost->statistics.send_bytes += total_len; + phost->statistics.send_pkgs++; #ifdef EDMA_TIMER - (void)mod_timer(&phost->timer, jiffies_64); + (void)mod_timer(&phost->timer, jiffies_64); #endif - BMA_LOG(DLOG_DEBUG, "msg_send_write = %d\n", - phost->msg_send_write); - - ret = msg_len; - goto end; - } else { - BMA_LOG(DLOG_DEBUG, - "msg lost,msg_send_write: %d,msg_len:%d,max_len: %d\n", - phost->msg_send_write, total_len, - HOST_MAX_SEND_MBX_LEN); - ret = -ENOSPC; - goto end; - } + BMA_LOG(DLOG_DEBUG, "msg_send_write = %d\n", phost->msg_send_write);
-end: + ret = msg_len; spin_unlock_irqrestore(&g_bma_dev->edma_host.send_msg_lock, flags); + kfree(msg_buf); return ret; } EXPORT_SYMBOL_GPL(bma_cdev_add_msg); diff --git a/drivers/net/ethernet/huawei/bma/edma_drv/bma_pci.h b/drivers/net/ethernet/huawei/bma/edma_drv/bma_pci.h index 92893b6bfd08..639ed4e58a8b 100644 --- a/drivers/net/ethernet/huawei/bma/edma_drv/bma_pci.h +++ b/drivers/net/ethernet/huawei/bma/edma_drv/bma_pci.h @@ -71,7 +71,7 @@ struct bma_pci_dev_s { #ifdef DRV_VERSION #define BMA_VERSION MICRO_TO_STR(DRV_VERSION) #else -#define BMA_VERSION "0.3.4" +#define BMA_VERSION "0.3.5" #endif
#ifdef CONFIG_ARM64 diff --git a/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_include.h b/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_include.h index ffadf3727734..b027306e52c1 100644 --- a/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_include.h +++ b/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_include.h @@ -23,7 +23,7 @@ #ifdef DRV_VERSION #define KBOX_VERSION MICRO_TO_STR(DRV_VERSION) #else -#define KBOX_VERSION "0.3.4" +#define KBOX_VERSION "0.3.5" #endif
#define UNUSED(x) (x = x) diff --git a/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_panic.c b/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_panic.c index 0c17cd2bae49..2b142ae9bff6 100644 --- a/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_panic.c +++ b/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_panic.c @@ -135,7 +135,7 @@ int kbox_panic_init(void) int ret = KBOX_TRUE;
g_panic_info_buf = kmalloc(SLOT_LENGTH, GFP_KERNEL); - if (IS_ERR(g_panic_info_buf) || !g_panic_info_buf) { + if (!g_panic_info_buf) { KBOX_MSG("kmalloc g_panic_info_buf fail!\n"); ret = -ENOMEM; goto fail; @@ -144,7 +144,7 @@ int kbox_panic_init(void) memset(g_panic_info_buf, 0, SLOT_LENGTH);
g_panic_info_buf_tmp = kmalloc(SLOT_LENGTH, GFP_KERNEL); - if (IS_ERR(g_panic_info_buf_tmp) || !g_panic_info_buf_tmp) { + if (!g_panic_info_buf_tmp) { KBOX_MSG("kmalloc g_panic_info_buf_tmp fail!\n"); ret = -ENOMEM; goto fail; diff --git a/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_printk.c b/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_printk.c index 3b04ba206108..630a1e16ea24 100644 --- a/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_printk.c +++ b/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_printk.c @@ -304,7 +304,7 @@ int kbox_printk_init(int kbox_proc_exist)
g_printk_info_buf = kmalloc(SECTION_PRINTK_LEN, GFP_KERNEL); - if (IS_ERR(g_printk_info_buf) || !g_printk_info_buf) { + if (!g_printk_info_buf) { KBOX_MSG("kmalloc g_printk_info_buf fail!\n"); ret = -ENOMEM; goto fail; @@ -314,7 +314,7 @@ int kbox_printk_init(int kbox_proc_exist)
g_printk_info_buf_tmp = kmalloc(SECTION_PRINTK_LEN, GFP_KERNEL); - if (IS_ERR(g_printk_info_buf_tmp) || !g_printk_info_buf_tmp) { + if (!g_printk_info_buf_tmp) { KBOX_MSG("kmalloc g_printk_info_buf_tmp fail!\n"); ret = -ENOMEM; goto fail; diff --git a/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_ram_op.c b/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_ram_op.c index 49690bab1cef..9f6dfe55e3fb 100644 --- a/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_ram_op.c +++ b/drivers/net/ethernet/huawei/bma/kbox_drv/kbox_ram_op.c @@ -432,7 +432,7 @@ int kbox_write_op(long long offset, unsigned int count, return KBOX_FALSE;
temp_buf_char = kmalloc(TEMP_BUF_DATA_SIZE, GFP_KERNEL); - if (!temp_buf_char || IS_ERR(temp_buf_char)) { + if (!temp_buf_char) { KBOX_MSG("kmalloc temp_buf_char fail!\n"); up(&user_sem); return -ENOMEM; diff --git a/drivers/net/ethernet/huawei/bma/veth_drv/veth_hb.c b/drivers/net/ethernet/huawei/bma/veth_drv/veth_hb.c index 240db31d7178..7705bb919ead 100644 --- a/drivers/net/ethernet/huawei/bma/veth_drv/veth_hb.c +++ b/drivers/net/ethernet/huawei/bma/veth_drv/veth_hb.c @@ -638,6 +638,7 @@ s32 veth_refill_rxskb(struct bspveth_rxtx_q *prx_queue, int queue) next_to_fill = (next_to_fill + 1) & BSPVETH_POINT_MASK; }
+ mb();/* memory barriers. */ prx_queue->next_to_fill = next_to_fill;
tail = prx_queue->tail; @@ -672,6 +673,7 @@ s32 bspveth_setup_rx_skb(struct bspveth_device *pvethdev, if (!idx) /* Can't alloc even one packets */ return -EFAULT;
+ mb();/* memory barriers. */ prx_queue->next_to_fill = idx;
VETH_LOG(DLOG_DEBUG, "prx_queue->next_to_fill=%d\n", @@ -886,8 +888,6 @@ s32 bspveth_setup_all_rx_resources(struct bspveth_device *pvethdev) err = bspveth_setup_rx_resources(pvethdev, pvethdev->prx_queue[qid]); if (err) { - kfree(pvethdev->prx_queue[qid]); - pvethdev->prx_queue[qid] = NULL; VETH_LOG(DLOG_ERROR, "Allocation for Rx Queue %u failed\n", qid);
@@ -1328,6 +1328,7 @@ s32 veth_send_one_pkt(struct sk_buff *skb, int queue) pbd_v->off = off; pbd_v->len = skb->len;
+ mb();/* memory barriers. */ head = (head + 1) & BSPVETH_POINT_MASK; ptx_queue->head = head;
@@ -1424,6 +1425,7 @@ s32 veth_free_txskb(struct bspveth_rxtx_q *ptx_queue, int queue) next_to_free = (next_to_free + 1) & BSPVETH_POINT_MASK; }
+ mb(); /* memory barriers. */ ptx_queue->next_to_free = next_to_free; tail = ptx_queue->tail;
@@ -1522,6 +1524,7 @@ s32 veth_recv_pkt(struct bspveth_rxtx_q *prx_queue, int queue) } }
+ mb();/* memory barriers. */ prx_queue->tail = tail; head = prx_queue->head;
diff --git a/drivers/net/ethernet/huawei/bma/veth_drv/veth_hb.h b/drivers/net/ethernet/huawei/bma/veth_drv/veth_hb.h index 9a4d699e6421..503636549320 100644 --- a/drivers/net/ethernet/huawei/bma/veth_drv/veth_hb.h +++ b/drivers/net/ethernet/huawei/bma/veth_drv/veth_hb.h @@ -31,7 +31,7 @@ extern "C" { #ifdef DRV_VERSION #define VETH_VERSION MICRO_TO_STR(DRV_VERSION) #else -#define VETH_VERSION "0.3.4" +#define VETH_VERSION "0.3.5" #endif
#define MODULE_NAME "veth"
From: Peter Zijlstra peterz@infradead.org
mainline inclusion from mainline-v6.2-rc1 commit 97e3d26b5e5f371b3ee223d94dd123e6c442ba80 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6C6UC CVE: CVE-2023-0597
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Seth found that the CPU-entry-area; the piece of per-cpu data that is mapped into the userspace page-tables for kPTI is not subject to any randomization -- irrespective of kASLR settings.
On x86_64 a whole P4D (512 GB) of virtual address space is reserved for this structure, which is plenty large enough to randomize things a little.
As such, use a straight forward randomization scheme that avoids duplicates to spread the existing CPUs over the available space.
[ bp: Fix le build. ]
Reported-by: Seth Jenkins sethjenkins@google.com Reviewed-by: Kees Cook keescook@chromium.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Signed-off-by: Borislav Petkov bp@suse.de Confilict: arch/x86/mm/cpu_entry_area.c
Use get_random_u32() instead of prandom_u32_max() in init_cea_offsets().
With CONFIG_RANDOMIZE_BASE=y, KASLR use prandom_seed_state() init prandom seed before init_cea_offsets(). But when CONFIG_RANDOMIZE_BASE=n, prandom seed init after init_cea_offsets() cause cea is always 0.
The patch d4150779e60f("random32: use real rng for non-deterministic randomness") use get_random_u32() instead of prandom_u32() in prandom_u32_max() that make prandom_u32_max() don't need to wait prandom seed init(). But the patch has many pre-patches that have not been merged,
So,we adopt the current solution as a workaround. directly use get_random_u32() in init_cea_offsets() to simplify code. Signed-off-by: Ke Liu liuke94@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- arch/x86/include/asm/cpu_entry_area.h | 4 --- arch/x86/include/asm/pgtable_areas.h | 8 ++++- arch/x86/kernel/hw_breakpoint.c | 2 +- arch/x86/mm/cpu_entry_area.c | 51 ++++++++++++++++++++++++--- 4 files changed, 55 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h index dd5ea1bdf04c..e2c04a5015b0 100644 --- a/arch/x86/include/asm/cpu_entry_area.h +++ b/arch/x86/include/asm/cpu_entry_area.h @@ -130,10 +130,6 @@ struct cpu_entry_area { };
#define CPU_ENTRY_AREA_SIZE (sizeof(struct cpu_entry_area)) -#define CPU_ENTRY_AREA_ARRAY_SIZE (CPU_ENTRY_AREA_SIZE * NR_CPUS) - -/* Total size includes the readonly IDT mapping page as well: */ -#define CPU_ENTRY_AREA_TOTAL_SIZE (CPU_ENTRY_AREA_ARRAY_SIZE + PAGE_SIZE)
DECLARE_PER_CPU(struct cpu_entry_area *, cpu_entry_area); DECLARE_PER_CPU(struct cea_exception_stacks *, cea_exception_stacks); diff --git a/arch/x86/include/asm/pgtable_areas.h b/arch/x86/include/asm/pgtable_areas.h index d34cce1b995c..4f056fb88174 100644 --- a/arch/x86/include/asm/pgtable_areas.h +++ b/arch/x86/include/asm/pgtable_areas.h @@ -11,6 +11,12 @@
#define CPU_ENTRY_AREA_RO_IDT_VADDR ((void *)CPU_ENTRY_AREA_RO_IDT)
-#define CPU_ENTRY_AREA_MAP_SIZE (CPU_ENTRY_AREA_PER_CPU + CPU_ENTRY_AREA_ARRAY_SIZE - CPU_ENTRY_AREA_BASE) +#ifdef CONFIG_X86_32 +#define CPU_ENTRY_AREA_MAP_SIZE (CPU_ENTRY_AREA_PER_CPU + \ + (CPU_ENTRY_AREA_SIZE * NR_CPUS) - \ + CPU_ENTRY_AREA_BASE) +#else +#define CPU_ENTRY_AREA_MAP_SIZE P4D_SIZE +#endif
#endif /* _ASM_X86_PGTABLE_AREAS_H */ diff --git a/arch/x86/kernel/hw_breakpoint.c b/arch/x86/kernel/hw_breakpoint.c index 668a4a6533d9..bbb0f737aab1 100644 --- a/arch/x86/kernel/hw_breakpoint.c +++ b/arch/x86/kernel/hw_breakpoint.c @@ -266,7 +266,7 @@ static inline bool within_cpu_entry(unsigned long addr, unsigned long end)
/* CPU entry erea is always used for CPU entry */ if (within_area(addr, end, CPU_ENTRY_AREA_BASE, - CPU_ENTRY_AREA_TOTAL_SIZE)) + CPU_ENTRY_AREA_MAP_SIZE)) return true;
/* diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c index 6c2f1b76a0b6..98396a67958c 100644 --- a/arch/x86/mm/cpu_entry_area.c +++ b/arch/x86/mm/cpu_entry_area.c @@ -5,6 +5,7 @@ #include <linux/kallsyms.h> #include <linux/kcore.h> #include <linux/pgtable.h> +#include <linux/random.h>
#include <asm/cpu_entry_area.h> #include <asm/fixmap.h> @@ -15,16 +16,57 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page, entry_stack_storage) #ifdef CONFIG_X86_64 static DEFINE_PER_CPU_PAGE_ALIGNED(struct exception_stacks, exception_stacks); DEFINE_PER_CPU(struct cea_exception_stacks*, cea_exception_stacks); -#endif
-#ifdef CONFIG_X86_32 +static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, _cea_offset); + +static __always_inline unsigned int cea_offset(unsigned int cpu) +{ + return per_cpu(_cea_offset, cpu); +} + +static __init void init_cea_offsets(void) +{ + unsigned int max_cea; + unsigned int i, j; + + max_cea = (CPU_ENTRY_AREA_MAP_SIZE - PAGE_SIZE) / CPU_ENTRY_AREA_SIZE; + + /* O(sodding terrible) */ + for_each_possible_cpu(i) { + unsigned int cea; + +again: + /* + * Directly use get_random_u32() instead of prandom_u32_max + * to avoid seed can't be generated when CONFIG_RANDOMIZE_BASE=n. + */ + cea = (u32)(((u64) get_random_u32() * max_cea) >> 32); + + for_each_possible_cpu(j) { + if (cea_offset(j) == cea) + goto again; + + if (i == j) + break; + } + + per_cpu(_cea_offset, i) = cea; + } +} +#else /* !X86_64 */ DECLARE_PER_CPU_PAGE_ALIGNED(struct doublefault_stack, doublefault_stack); + +static __always_inline unsigned int cea_offset(unsigned int cpu) +{ + return cpu; +} +static inline void init_cea_offsets(void) { } #endif
/* Is called from entry code, so must be noinstr */ noinstr struct cpu_entry_area *get_cpu_entry_area(int cpu) { - unsigned long va = CPU_ENTRY_AREA_PER_CPU + cpu * CPU_ENTRY_AREA_SIZE; + unsigned long va = CPU_ENTRY_AREA_PER_CPU + cea_offset(cpu) * CPU_ENTRY_AREA_SIZE; BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
return (struct cpu_entry_area *) va; @@ -205,7 +247,6 @@ static __init void setup_cpu_entry_area_ptes(void)
/* The +1 is for the readonly IDT: */ BUILD_BUG_ON((CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE); - BUILD_BUG_ON(CPU_ENTRY_AREA_TOTAL_SIZE != CPU_ENTRY_AREA_MAP_SIZE); BUG_ON(CPU_ENTRY_AREA_BASE & ~PMD_MASK);
start = CPU_ENTRY_AREA_BASE; @@ -221,6 +262,8 @@ void __init setup_cpu_entry_areas(void) { unsigned int cpu;
+ init_cea_offsets(); + setup_cpu_entry_area_ptes();
for_each_possible_cpu(cpu)
From: Andrey Ryabinin ryabinin.a.a@gmail.com
mainline inclusion from mainline-v6.2-rc1 commit 3f148f3318140035e87decc1214795ff0755757b category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6C6UC CVE: CVE-2023-0597
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
KASAN maps shadow for the entire CPU-entry-area: [CPU_ENTRY_AREA_BASE, CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_MAP_SIZE]
This will explode once the per-cpu entry areas are randomized since it will increase CPU_ENTRY_AREA_MAP_SIZE to 512 GB and KASAN fails to allocate shadow for such big area.
Fix this by allocating KASAN shadow only for really used cpu entry area addresses mapped by cea_map_percpu_pages()
Thanks to the 0day folks for finding and reporting this to be an issue.
[ dhansen: tweak changelog since this will get committed before peterz's actual cpu-entry-area randomization ]
Signed-off-by: Andrey Ryabinin ryabinin.a.a@gmail.com Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Tested-by: Yujie Liu yujie.liu@intel.com Cc: kernel test robot yujie.liu@intel.com Link: https://lore.kernel.org/r/202210241508.2e203c3d-yujie.liu@intel.com Signed-off-by: Tong Tiangen tongtiangen@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- arch/x86/include/asm/kasan.h | 3 +++ arch/x86/mm/cpu_entry_area.c | 8 +++++++- arch/x86/mm/kasan_init_64.c | 15 ++++++++++++--- 3 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kasan.h b/arch/x86/include/asm/kasan.h index 13e70da38bed..de75306b932e 100644 --- a/arch/x86/include/asm/kasan.h +++ b/arch/x86/include/asm/kasan.h @@ -28,9 +28,12 @@ #ifdef CONFIG_KASAN void __init kasan_early_init(void); void __init kasan_init(void); +void __init kasan_populate_shadow_for_vaddr(void *va, size_t size, int nid); #else static inline void kasan_early_init(void) { } static inline void kasan_init(void) { } +static inline void kasan_populate_shadow_for_vaddr(void *va, size_t size, + int nid) { } #endif
#endif diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c index 98396a67958c..bf5786f3f3c4 100644 --- a/arch/x86/mm/cpu_entry_area.c +++ b/arch/x86/mm/cpu_entry_area.c @@ -10,6 +10,7 @@ #include <asm/cpu_entry_area.h> #include <asm/fixmap.h> #include <asm/desc.h> +#include <asm/kasan.h>
static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page, entry_stack_storage);
@@ -95,8 +96,13 @@ void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags) static void __init cea_map_percpu_pages(void *cea_vaddr, void *ptr, int pages, pgprot_t prot) { + phys_addr_t pa = per_cpu_ptr_to_phys(ptr); + + kasan_populate_shadow_for_vaddr(cea_vaddr, pages * PAGE_SIZE, + early_pfn_to_nid(PFN_DOWN(pa))); + for ( ; pages; pages--, cea_vaddr+= PAGE_SIZE, ptr += PAGE_SIZE) - cea_set_pte(cea_vaddr, per_cpu_ptr_to_phys(ptr), prot); + cea_set_pte(cea_vaddr, pa, prot); }
static void __init percpu_setup_debug_store(unsigned int cpu) diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index 1a50434c8a4d..18d574af30ef 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -318,6 +318,18 @@ void __init kasan_early_init(void) kasan_map_early_shadow(init_top_pgt); }
+void __init kasan_populate_shadow_for_vaddr(void *va, size_t size, int nid) +{ + unsigned long shadow_start, shadow_end; + + shadow_start = (unsigned long)kasan_mem_to_shadow(va); + shadow_start = round_down(shadow_start, PAGE_SIZE); + shadow_end = (unsigned long)kasan_mem_to_shadow(va + size); + shadow_end = round_up(shadow_end, PAGE_SIZE); + + kasan_populate_shadow(shadow_start, shadow_end, nid); +} + void __init kasan_init(void) { int i; @@ -395,9 +407,6 @@ void __init kasan_init(void) kasan_mem_to_shadow((void *)VMALLOC_END + 1), shadow_cpu_entry_begin);
- kasan_populate_shadow((unsigned long)shadow_cpu_entry_begin, - (unsigned long)shadow_cpu_entry_end, 0); - kasan_populate_early_shadow(shadow_cpu_entry_end, kasan_mem_to_shadow((void *)__START_KERNEL_map));
From: Sean Christopherson seanjc@google.com
mainline inclusion from mainline-v6.2-rc1 commit 80d72a8f76e8f3f0b5a70b8c7022578e17bde8e7 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6C6UC CVE: CVE-2023-0597
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Recompute the physical address for each per-CPU page in the CPU entry area, a recent commit inadvertantly modified cea_map_percpu_pages() such that every PTE is mapped to the physical address of the first page.
Fixes: 9fd429c28073 ("x86/kasan: Map shadow for percpu pages on demand") Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Andrey Ryabinin ryabinin.a.a@gmail.com Link: https://lkml.kernel.org/r/20221110203504.1985010-2-seanjc@google.com Signed-off-by: Tong Tiangen tongtiangen@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- arch/x86/mm/cpu_entry_area.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c index bf5786f3f3c4..2eaf9ee41495 100644 --- a/arch/x86/mm/cpu_entry_area.c +++ b/arch/x86/mm/cpu_entry_area.c @@ -102,7 +102,7 @@ cea_map_percpu_pages(void *cea_vaddr, void *ptr, int pages, pgprot_t prot) early_pfn_to_nid(PFN_DOWN(pa)));
for ( ; pages; pages--, cea_vaddr+= PAGE_SIZE, ptr += PAGE_SIZE) - cea_set_pte(cea_vaddr, pa, prot); + cea_set_pte(cea_vaddr, per_cpu_ptr_to_phys(ptr), prot); }
static void __init percpu_setup_debug_store(unsigned int cpu)
From: Sean Christopherson seanjc@google.com
mainline inclusion from mainline-v6.2-rc1 commit 97650148a15e0b30099d6175ffe278b9f55ec66a category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6C6UC CVE: CVE-2023-0597
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Populate a KASAN shadow for the entire possible per-CPU range of the CPU entry area instead of requiring that each individual chunk map a shadow. Mapping shadows individually is error prone, e.g. the per-CPU GDT mapping was left behind, which can lead to not-present page faults during KASAN validation if the kernel performs a software lookup into the GDT. The DS buffer is also likely affected.
The motivation for mapping the per-CPU areas on-demand was to avoid mapping the entire 512GiB range that's reserved for the CPU entry area, shaving a few bytes by not creating shadows for potentially unused memory was not a goal.
The bug is most easily reproduced by doing a sigreturn with a garbage CS in the sigcontext, e.g.
int main(void) { struct sigcontext regs;
syscall(__NR_mmap, 0x1ffff000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul); syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul); syscall(__NR_mmap, 0x21000000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
memset(®s, 0, sizeof(regs)); regs.cs = 0x1d0; syscall(__NR_rt_sigreturn); return 0; }
to coerce the kernel into doing a GDT lookup to compute CS.base when reading the instruction bytes on the subsequent #GP to determine whether or not the #GP is something the kernel should handle, e.g. to fixup UMIP violations or to emulate CLI/STI for IOPL=3 applications.
BUG: unable to handle page fault for address: fffffbc8379ace00 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 16c03a067 P4D 16c03a067 PUD 15b990067 PMD 15b98f067 PTE 0 Oops: 0000 [#1] PREEMPT SMP KASAN CPU: 3 PID: 851 Comm: r2 Not tainted 6.1.0-rc3-next-20221103+ #432 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kasan_check_range+0xdf/0x190 Call Trace: <TASK> get_desc+0xb0/0x1d0 insn_get_seg_base+0x104/0x270 insn_fetch_from_user+0x66/0x80 fixup_umip_exception+0xb1/0x530 exc_general_protection+0x181/0x210 asm_exc_general_protection+0x22/0x30 RIP: 0003:0x0 Code: Unable to access opcode bytes at 0xffffffffffffffd6. RSP: 0003:0000000000000000 EFLAGS: 00000202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000001d0 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK>
Fixes: 9fd429c28073 ("x86/kasan: Map shadow for percpu pages on demand") Reported-by: syzbot+ffb4f000dc2872c93f62@syzkaller.appspotmail.com Suggested-by: Andrey Ryabinin ryabinin.a.a@gmail.com Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Andrey Ryabinin ryabinin.a.a@gmail.com Link: https://lkml.kernel.org/r/20221110203504.1985010-3-seanjc@google.com Signed-off-by: Tong Tiangen tongtiangen@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- arch/x86/mm/cpu_entry_area.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c index 2eaf9ee41495..88e2cc4d4e75 100644 --- a/arch/x86/mm/cpu_entry_area.c +++ b/arch/x86/mm/cpu_entry_area.c @@ -96,11 +96,6 @@ void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags) static void __init cea_map_percpu_pages(void *cea_vaddr, void *ptr, int pages, pgprot_t prot) { - phys_addr_t pa = per_cpu_ptr_to_phys(ptr); - - kasan_populate_shadow_for_vaddr(cea_vaddr, pages * PAGE_SIZE, - early_pfn_to_nid(PFN_DOWN(pa))); - for ( ; pages; pages--, cea_vaddr+= PAGE_SIZE, ptr += PAGE_SIZE) cea_set_pte(cea_vaddr, per_cpu_ptr_to_phys(ptr), prot); } @@ -200,6 +195,9 @@ static void __init setup_cpu_entry_area(unsigned int cpu) pgprot_t tss_prot = PAGE_KERNEL; #endif
+ kasan_populate_shadow_for_vaddr(cea, CPU_ENTRY_AREA_SIZE, + early_cpu_to_node(cpu)); + cea_set_pte(&cea->gdt, get_cpu_gdt_paddr(cpu), gdt_prot);
cea_map_percpu_pages(&cea->entry_stack_page,
From: Sean Christopherson seanjc@google.com
mainline inclusion from mainline-v6.2-rc1 commit 7077d2ccb94dafd00b29cc2d601c9f6891648f5b category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6C6UC CVE: CVE-2023-0597
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Rename the CPU entry area variables in kasan_init() to shorten their names, a future fix will reference the beginning of the per-CPU portion of the CPU entry area, and shadow_cpu_entry_per_cpu_begin is a bit much.
No functional change intended.
Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Andrey Ryabinin ryabinin.a.a@gmail.com Link: https://lkml.kernel.org/r/20221110203504.1985010-4-seanjc@google.com Signed-off-by: Tong Tiangen tongtiangen@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- arch/x86/mm/kasan_init_64.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index 18d574af30ef..75b0eb482487 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -333,7 +333,7 @@ void __init kasan_populate_shadow_for_vaddr(void *va, size_t size, int nid) void __init kasan_init(void) { int i; - void *shadow_cpu_entry_begin, *shadow_cpu_entry_end; + void *shadow_cea_begin, *shadow_cea_end;
memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
@@ -374,16 +374,16 @@ void __init kasan_init(void) map_range(&pfn_mapped[i]); }
- shadow_cpu_entry_begin = (void *)CPU_ENTRY_AREA_BASE; - shadow_cpu_entry_begin = kasan_mem_to_shadow(shadow_cpu_entry_begin); - shadow_cpu_entry_begin = (void *)round_down( - (unsigned long)shadow_cpu_entry_begin, PAGE_SIZE); + shadow_cea_begin = (void *)CPU_ENTRY_AREA_BASE; + shadow_cea_begin = kasan_mem_to_shadow(shadow_cea_begin); + shadow_cea_begin = (void *)round_down( + (unsigned long)shadow_cea_begin, PAGE_SIZE);
- shadow_cpu_entry_end = (void *)(CPU_ENTRY_AREA_BASE + + shadow_cea_end = (void *)(CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_MAP_SIZE); - shadow_cpu_entry_end = kasan_mem_to_shadow(shadow_cpu_entry_end); - shadow_cpu_entry_end = (void *)round_up( - (unsigned long)shadow_cpu_entry_end, PAGE_SIZE); + shadow_cea_end = kasan_mem_to_shadow(shadow_cea_end); + shadow_cea_end = (void *)round_up( + (unsigned long)shadow_cea_end, PAGE_SIZE);
kasan_populate_early_shadow( kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM), @@ -405,9 +405,9 @@ void __init kasan_init(void)
kasan_populate_early_shadow( kasan_mem_to_shadow((void *)VMALLOC_END + 1), - shadow_cpu_entry_begin); + shadow_cea_begin);
- kasan_populate_early_shadow(shadow_cpu_entry_end, + kasan_populate_early_shadow(shadow_cea_end, kasan_mem_to_shadow((void *)__START_KERNEL_map));
kasan_populate_shadow((unsigned long)kasan_mem_to_shadow(_stext),
From: Sean Christopherson seanjc@google.com
mainline inclusion from mainline-v6.2-rc1 commit bde258d97409f2a45243cb393a55ea9ecfc7aba5 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6C6UC CVE: CVE-2023-0597
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Add helpers to dedup code for aligning shadow address up/down to page boundaries when translating an address to its shadow.
No functional change intended.
Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Andrey Ryabinin ryabinin.a.a@gmail.com Link: https://lkml.kernel.org/r/20221110203504.1985010-5-seanjc@google.com Signed-off-by: Tong Tiangen tongtiangen@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- arch/x86/mm/kasan_init_64.c | 40 ++++++++++++++++++++----------------- 1 file changed, 22 insertions(+), 18 deletions(-)
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index 75b0eb482487..c1d99556c024 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -318,22 +318,33 @@ void __init kasan_early_init(void) kasan_map_early_shadow(init_top_pgt); }
+static unsigned long kasan_mem_to_shadow_align_down(unsigned long va) +{ + unsigned long shadow = (unsigned long)kasan_mem_to_shadow((void *)va); + + return round_down(shadow, PAGE_SIZE); +} + +static unsigned long kasan_mem_to_shadow_align_up(unsigned long va) +{ + unsigned long shadow = (unsigned long)kasan_mem_to_shadow((void *)va); + + return round_up(shadow, PAGE_SIZE); +} + void __init kasan_populate_shadow_for_vaddr(void *va, size_t size, int nid) { unsigned long shadow_start, shadow_end;
- shadow_start = (unsigned long)kasan_mem_to_shadow(va); - shadow_start = round_down(shadow_start, PAGE_SIZE); - shadow_end = (unsigned long)kasan_mem_to_shadow(va + size); - shadow_end = round_up(shadow_end, PAGE_SIZE); - + shadow_start = kasan_mem_to_shadow_align_down((unsigned long)va); + shadow_end = kasan_mem_to_shadow_align_up((unsigned long)va + size); kasan_populate_shadow(shadow_start, shadow_end, nid); }
void __init kasan_init(void) { + unsigned long shadow_cea_begin, shadow_cea_end; int i; - void *shadow_cea_begin, *shadow_cea_end;
memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
@@ -374,16 +385,9 @@ void __init kasan_init(void) map_range(&pfn_mapped[i]); }
- shadow_cea_begin = (void *)CPU_ENTRY_AREA_BASE; - shadow_cea_begin = kasan_mem_to_shadow(shadow_cea_begin); - shadow_cea_begin = (void *)round_down( - (unsigned long)shadow_cea_begin, PAGE_SIZE); - - shadow_cea_end = (void *)(CPU_ENTRY_AREA_BASE + - CPU_ENTRY_AREA_MAP_SIZE); - shadow_cea_end = kasan_mem_to_shadow(shadow_cea_end); - shadow_cea_end = (void *)round_up( - (unsigned long)shadow_cea_end, PAGE_SIZE); + shadow_cea_begin = kasan_mem_to_shadow_align_down(CPU_ENTRY_AREA_BASE); + shadow_cea_end = kasan_mem_to_shadow_align_up(CPU_ENTRY_AREA_BASE + + CPU_ENTRY_AREA_MAP_SIZE);
kasan_populate_early_shadow( kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM), @@ -405,9 +409,9 @@ void __init kasan_init(void)
kasan_populate_early_shadow( kasan_mem_to_shadow((void *)VMALLOC_END + 1), - shadow_cea_begin); + (void *)shadow_cea_begin);
- kasan_populate_early_shadow(shadow_cea_end, + kasan_populate_early_shadow((void *)shadow_cea_end, kasan_mem_to_shadow((void *)__START_KERNEL_map));
kasan_populate_shadow((unsigned long)kasan_mem_to_shadow(_stext),
From: Sean Christopherson seanjc@google.com
mainline inclusion from mainline-v6.2-rc1 commit 1cfaac2400c73378e78182a706be0f3ac8b93cd7 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6C6UC CVE: CVE-2023-0597
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Popuplate the shadow for the shared portion of the CPU entry area, i.e. the read-only IDT mapping, during KASAN initialization. A recent change modified KASAN to map the per-CPU areas on-demand, but forgot to keep a shadow for the common area that is shared amongst all CPUs.
Map the common area in KASAN init instead of letting idt_map_in_cea() do the dirty work so that it Just Works in the unlikely event more shared data is shoved into the CPU entry area.
The bug manifests as a not-present #PF when software attempts to lookup an IDT entry, e.g. when KVM is handling IRQs on Intel CPUs (KVM performs direct CALL to the IRQ handler to avoid the overhead of INTn):
BUG: unable to handle page fault for address: fffffbc0000001d8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 16c03a067 P4D 16c03a067 PUD 0 Oops: 0000 [#1] PREEMPT SMP KASAN CPU: 5 PID: 901 Comm: repro Tainted: G W 6.1.0-rc3+ #410 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kasan_check_range+0xdf/0x190 vmx_handle_exit_irqoff+0x152/0x290 [kvm_intel] vcpu_run+0x1d89/0x2bd0 [kvm] kvm_arch_vcpu_ioctl_run+0x3ce/0xa70 [kvm] kvm_vcpu_ioctl+0x349/0x900 [kvm] __x64_sys_ioctl+0xb8/0xf0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0
Fixes: 9fd429c28073 ("x86/kasan: Map shadow for percpu pages on demand") Reported-by: syzbot+8cdd16fd5a6c0565e227@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20221110203504.1985010-6-seanjc@google.com Signed-off-by: Tong Tiangen tongtiangen@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- arch/x86/mm/kasan_init_64.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index c1d99556c024..b4b187960dd0 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -343,7 +343,7 @@ void __init kasan_populate_shadow_for_vaddr(void *va, size_t size, int nid)
void __init kasan_init(void) { - unsigned long shadow_cea_begin, shadow_cea_end; + unsigned long shadow_cea_begin, shadow_cea_per_cpu_begin, shadow_cea_end; int i;
memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt)); @@ -386,6 +386,7 @@ void __init kasan_init(void) }
shadow_cea_begin = kasan_mem_to_shadow_align_down(CPU_ENTRY_AREA_BASE); + shadow_cea_per_cpu_begin = kasan_mem_to_shadow_align_up(CPU_ENTRY_AREA_PER_CPU); shadow_cea_end = kasan_mem_to_shadow_align_up(CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_MAP_SIZE);
@@ -411,6 +412,15 @@ void __init kasan_init(void) kasan_mem_to_shadow((void *)VMALLOC_END + 1), (void *)shadow_cea_begin);
+ /* + * Populate the shadow for the shared portion of the CPU entry area. + * Shadows for the per-CPU areas are mapped on-demand, as each CPU's + * area is randomly placed somewhere in the 512GiB range and mapping + * the entire 512GiB range is prohibitively expensive. + */ + kasan_populate_shadow(shadow_cea_begin, + shadow_cea_per_cpu_begin, 0); + kasan_populate_early_shadow((void *)shadow_cea_end, kasan_mem_to_shadow((void *)__START_KERNEL_map));