From: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
stable inclusion
from stable-v5.10.223
commit 2f2fa9cf7c3537958a82fbe8c8595a5eb0861ad7
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAGPRT
CVE: CVE-2024-42104
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=…
--------------------------------
commit bb76c6c274683c8570ad788f79d4b875bde0e458 upstream.
Syzbot reported that mounting and unmounting a specific pattern of
corrupted nilfs2 filesystem images causes a use-after-free of metadata
file inodes, which triggers a kernel bug in lru_add_fn().
As Jan Kara pointed out, this is because the link count of a metadata file
gets corrupted to 0, and nilfs_evict_inode(), which is called from iput(),
tries to delete that inode (ifile inode in this case).
The inconsistency occurs because directories containing the inode numbers
of these metadata files that should not be visible in the namespace are
read without checking.
Fix this issue by treating the inode numbers of these internal files as
errors in the sanity check helper when reading directory folios/pages.
Also thanks to Hillf Danton and Matthew Wilcox for their initial mm-layer
analysis.
Link: https://lkml.kernel.org/r/20240623051135.4180-3-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Reported-by: syzbot+d79afb004be235636ee8(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d79afb004be235636ee8
Reported-by: Jan Kara <jack(a)suse.cz>
Closes: https://lkml.kernel.org/r/20240617075758.wewhukbrjod5fp5o@quack3
Tested-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Cc: Hillf Danton <hdanton(a)sina.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Xia Fukun <xiafukun(a)huawei.com>
---
fs/nilfs2/dir.c | 6 ++++++
fs/nilfs2/nilfs.h | 5 +++++
2 files changed, 11 insertions(+)
diff --git a/fs/nilfs2/dir.c b/fs/nilfs2/dir.c
index eb7de9e2a384e..1f9e31d285cc9 100644
--- a/fs/nilfs2/dir.c
+++ b/fs/nilfs2/dir.c
@@ -143,6 +143,9 @@ static bool nilfs_check_page(struct page *page)
goto Enamelen;
if (((offs + rec_len - 1) ^ offs) & ~(chunk_size-1))
goto Espan;
+ if (unlikely(p->inode &&
+ NILFS_PRIVATE_INODE(le64_to_cpu(p->inode))))
+ goto Einumber;
}
if (offs != limit)
goto Eend;
@@ -168,6 +171,9 @@ static bool nilfs_check_page(struct page *page)
goto bad_entry;
Espan:
error = "directory entry across blocks";
+ goto bad_entry;
+Einumber:
+ error = "disallowed inode number";
bad_entry:
nilfs_error(sb,
"bad entry in directory #%lu: %s - offset=%lu, inode=%lu, rec_len=%d, name_len=%d",
diff --git a/fs/nilfs2/nilfs.h b/fs/nilfs2/nilfs.h
index a2f247b6a209e..986620709b9de 100644
--- a/fs/nilfs2/nilfs.h
+++ b/fs/nilfs2/nilfs.h
@@ -125,6 +125,11 @@ enum {
#define NILFS_VALID_INODE(sb, ino) \
((ino) >= NILFS_FIRST_INO(sb) || (NILFS_SYS_INO_BITS & BIT(ino)))
+#define NILFS_PRIVATE_INODE(ino) ({ \
+ ino_t __ino = (ino); \
+ ((__ino) < NILFS_USER_INO && (__ino) != NILFS_ROOT_INO && \
+ (__ino) != NILFS_SKETCH_INO); })
+
/**
* struct nilfs_transaction_info: context information for synchronization
* @ti_magic: Magic number
--
2.34.1
From: Edward Adam Davis <eadavis(a)qq.com>
mainline inclusion
from mainline-v6.10-rc7
commit 89e856e124f9ae548572c56b1b70c2255705f8fe
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAGEK1
CVE: CVE-2024-41062
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
-------------------------------------------
The problem occurs between the system call to close the sock and hci_rx_work,
where the former releases the sock and the latter accesses it without lock protection.
CPU0 CPU1
---- ----
sock_close hci_rx_work
l2cap_sock_release hci_acldata_packet
l2cap_sock_kill l2cap_recv_frame
sk_free l2cap_conless_channel
l2cap_sock_recv_cb
If hci_rx_work processes the data that needs to be received before the sock is
closed, then everything is normal; Otherwise, the work thread may access the
released sock when receiving data.
Add a chan mutex in the rx callback of the sock to achieve synchronization between
the sock release and recv cb.
Sock is dead, so set chan data to NULL, avoid others use invalid sock pointer.
Reported-and-tested-by: syzbot+b7f6f8c9303466e16c8a(a)syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis(a)qq.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz(a)intel.com>
Conflicts:
net/bluetooth/l2cap_sock.c
[The conflict occurs because the commit 89e856e124f9("bluetooth/l2cap:
sync sock recv cb and release") is not merged]
Signed-off-by: Zhengchao Shao <shaozhengchao(a)huawei.com>
---
net/bluetooth/l2cap_sock.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
index c91f4c190bbc..c7217d3ecd8b 100644
--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -1238,6 +1238,10 @@ static void l2cap_sock_kill(struct sock *sk)
BT_DBG("sk %p state %s", sk, state_to_string(sk->sk_state));
+ /* Sock is dead, so set chan data to NULL, avoid other task use invalid
+ * sock pointer.
+ */
+ l2cap_pi(sk)->chan->data = NULL;
/* Kill poor orphan */
l2cap_chan_put(l2cap_pi(sk)->chan);
@@ -1480,9 +1484,21 @@ static struct l2cap_chan *l2cap_sock_new_connection_cb(struct l2cap_chan *chan)
static int l2cap_sock_recv_cb(struct l2cap_chan *chan, struct sk_buff *skb)
{
- struct sock *sk = chan->data;
+ struct sock *sk;
int err;
+ /* To avoid race with sock_release, a chan lock needs to be added here
+ * to synchronize the sock.
+ */
+ l2cap_chan_hold(chan);
+ l2cap_chan_lock(chan);
+ sk = chan->data;
+ if (!sk) {
+ l2cap_chan_unlock(chan);
+ l2cap_chan_put(chan);
+ return -ENXIO;
+ }
+
lock_sock(sk);
if (l2cap_pi(sk)->rx_busy_skb) {
@@ -1519,6 +1535,8 @@ static int l2cap_sock_recv_cb(struct l2cap_chan *chan, struct sk_buff *skb)
done:
release_sock(sk);
+ l2cap_chan_unlock(chan);
+ l2cap_chan_put(chan);
return err;
}
--
2.34.1
From: Edward Adam Davis <eadavis(a)qq.com>
mainline inclusion
from mainline-v6.10-rc7
commit 89e856e124f9ae548572c56b1b70c2255705f8fe
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAGEK1
CVE: CVE-2024-41062
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
-------------------------------------------
The problem occurs between the system call to close the sock and hci_rx_work,
where the former releases the sock and the latter accesses it without lock protection.
CPU0 CPU1
---- ----
sock_close hci_rx_work
l2cap_sock_release hci_acldata_packet
l2cap_sock_kill l2cap_recv_frame
sk_free l2cap_conless_channel
l2cap_sock_recv_cb
If hci_rx_work processes the data that needs to be received before the sock is
closed, then everything is normal; Otherwise, the work thread may access the
released sock when receiving data.
Add a chan mutex in the rx callback of the sock to achieve synchronization between
the sock release and recv cb.
Sock is dead, so set chan data to NULL, avoid others use invalid sock pointer.
Reported-and-tested-by: syzbot+b7f6f8c9303466e16c8a(a)syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis(a)qq.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz(a)intel.com>
Conflicts:
net/bluetooth/l2cap_sock.c
[The conflict occurs because the commit 89e856e124f9("bluetooth/l2cap:
sync sock recv cb and release") is not merged]
Signed-off-by: Zhengchao Shao <shaozhengchao(a)huawei.com>
---
net/bluetooth/l2cap_sock.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
index 13afdc599b58..409475da9283 100644
--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -1238,6 +1238,10 @@ static void l2cap_sock_kill(struct sock *sk)
BT_DBG("sk %p state %s", sk, state_to_string(sk->sk_state));
+ /* Sock is dead, so set chan data to NULL, avoid other task use invalid
+ * sock pointer.
+ */
+ l2cap_pi(sk)->chan->data = NULL;
/* Kill poor orphan */
l2cap_chan_put(l2cap_pi(sk)->chan);
@@ -1480,9 +1484,21 @@ static struct l2cap_chan *l2cap_sock_new_connection_cb(struct l2cap_chan *chan)
static int l2cap_sock_recv_cb(struct l2cap_chan *chan, struct sk_buff *skb)
{
- struct sock *sk = chan->data;
+ struct sock *sk;
int err;
+ /* To avoid race with sock_release, a chan lock needs to be added here
+ * to synchronize the sock.
+ */
+ l2cap_chan_hold(chan);
+ l2cap_chan_lock(chan);
+ sk = chan->data;
+ if (!sk) {
+ l2cap_chan_unlock(chan);
+ l2cap_chan_put(chan);
+ return -ENXIO;
+ }
+
lock_sock(sk);
if (l2cap_pi(sk)->rx_busy_skb) {
@@ -1519,6 +1535,8 @@ static int l2cap_sock_recv_cb(struct l2cap_chan *chan, struct sk_buff *skb)
done:
release_sock(sk);
+ l2cap_chan_unlock(chan);
+ l2cap_chan_put(chan);
return err;
}
--
2.34.1
hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I9OJK9
CVE: NA
----------------------------------------
Here are many scenarios we tested with smart_grid, we found that the
first domain level is key to the benchmark.
The reason is that there are many things such as interrupt affinity,
memory affinity factor that can have a big impact on the test.
Before this patch, the first domain level is unchangeable after
creation.
This patch introduce the 'cpu.rebuild_affinity_domain' to dynamically
reconfigure all domain levels.
Typical use cases:
echo $cpu_id > cpu.rebuild_affinity_domain
The cpu_id means which cpu we want to set first level.
If we set cpu_id = 34, we can see some change like:
---------------- -----------------
| level 0 (0-31) | | level 0 (32-63) |
---------------- -----------------
v v
------------------- ------------------
| level 1 (0-63) | | level 1 (0-63) |
------------------- ------------------
v --> v
--------------------- --------------------
| level 2 (0-95) | | level 2 (0-95) |
--------------------- --------------------
v v
------------------------ ----------------------
| level 3 (0-127) | | level 3 (0-127) |
------------------------ ----------------------
There are number of constraints on the rebuild feature:
1. Only rebuild domain while auto mode disabled.
(cpu.dynamic_affinity_mode == 1)
2. Only rebuild on active and housekeeping cpu.
(Offline and isolate CPUs are forbidden)
3. This file is write only.
Signed-off-by: Yipeng Zou <zouyipeng(a)huawei.com>
---
kernel/sched/core.c | 13 +++++++++++++
kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 1 +
3 files changed, 57 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 700dbd19eb44..b9b3f536959d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11555,6 +11555,15 @@ static int cpu_affinity_stat_show(struct seq_file *sf, void *v)
return 0;
}
+
+static int cpu_rebuild_affinity_domain_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype,
+ u64 cpu)
+{
+ struct task_group *tg = css_tg(css);
+
+ return tg_rebuild_affinity_domains(cpu, tg->auto_affinity);
+}
#endif /* CONFIG_QOS_SCHED_SMART_GRID */
#ifdef CONFIG_QOS_SCHED
@@ -11735,6 +11744,10 @@ static struct cftype cpu_legacy_files[] = {
.name = "affinity_stat",
.seq_show = cpu_affinity_stat_show,
},
+ {
+ .name = "rebuild_affinity_domain",
+ .write_u64 = cpu_rebuild_affinity_domain_u64,
+ },
#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 240e315d8e02..e27058ab099d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7252,6 +7252,49 @@ static void destroy_auto_affinity(struct task_group *tg)
kfree(tg->auto_affinity);
tg->auto_affinity = NULL;
}
+
+int tg_rebuild_affinity_domains(int cpu, struct auto_affinity *auto_affi)
+{
+ int ret = 0;
+ int level = 0;
+ struct sched_domain *tmp;
+
+ if (unlikely(!auto_affi))
+ return -EPERM;
+
+ mutex_lock(&smart_grid_used_mutex);
+ raw_spin_lock_irq(&auto_affi->lock);
+ /* Only build domain while auto mode disabled */
+ if (auto_affi->mode) {
+ ret = -EPERM;
+ goto unlock_all;
+ }
+
+ /* Only build on active and housekeeping cpu */
+ if (!cpu_active(cpu) || !housekeeping_cpu(cpu, HK_TYPE_DOMAIN)) {
+ ret = -EINVAL;
+ goto unlock_all;
+ }
+
+ for_each_domain(cpu, tmp) {
+ if (!auto_affi->ad.domains[level] || !auto_affi->ad.domains_orig[level])
+ continue;
+
+ /* rebuild domain[,_orig] and reset schedstat counter */
+ cpumask_copy(auto_affi->ad.domains[level], sched_domain_span(tmp));
+ cpumask_copy(auto_affi->ad.domains_orig[level], auto_affi->ad.domains[level]);
+ __schedstat_set(auto_affi->ad.stay_cnt[level], 0);
+ level++;
+ }
+
+ /* trigger to update smart grid zone */
+ sched_grid_zone_update(false);
+
+unlock_all:
+ raw_spin_unlock_irq(&auto_affi->lock);
+ mutex_unlock(&smart_grid_used_mutex);
+ return ret;
+}
#else
static void __maybe_unused destroy_auto_affinity(struct task_group *tg) {}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9de2bac6444f..79b93b0a2ef0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -551,6 +551,7 @@ extern void start_auto_affinity(struct auto_affinity *auto_affi);
extern void stop_auto_affinity(struct auto_affinity *auto_affi);
extern int init_auto_affinity(struct task_group *tg);
extern void tg_update_affinity_domains(int cpu, int online);
+extern int tg_rebuild_affinity_domains(int cpu, struct auto_affinity *auto_affi);
#else
static inline int init_auto_affinity(struct task_group *tg)
--
2.34.1
From: Geliang Tang <tanggeliang(a)kylinos.cn>
stable inclusion
from stable-v6.6.41
commit b180739b45a38b4caa88fe16bb5273072e6613dc
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAGEM4
CVE: CVE-2024-41048
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id…
---------------------------
[ Upstream commit f0c18025693707ec344a70b6887f7450bf4c826b ]
When running BPF selftests (./test_progs -t sockmap_basic) on a Loongarch
platform, the following kernel panic occurs:
[...]
Oops[#1]:
CPU: 22 PID: 2824 Comm: test_progs Tainted: G OE 6.10.0-rc2+ #18
Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018
... ...
ra: 90000000048bf6c0 sk_msg_recvmsg+0x120/0x560
ERA: 9000000004162774 copy_page_to_iter+0x74/0x1c0
CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
PRMD: 0000000c (PPLV0 +PIE +PWE)
EUEN: 00000007 (+FPE +SXE +ASXE -BTE)
ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0)
BADV: 0000000000000040
PRID: 0014c011 (Loongson-64bit, Loongson-3C5000)
Modules linked in: bpf_testmod(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack
Process test_progs (pid: 2824, threadinfo=0000000000863a31, task=...)
Stack : ...
Call Trace:
[<9000000004162774>] copy_page_to_iter+0x74/0x1c0
[<90000000048bf6c0>] sk_msg_recvmsg+0x120/0x560
[<90000000049f2b90>] tcp_bpf_recvmsg_parser+0x170/0x4e0
[<90000000049aae34>] inet_recvmsg+0x54/0x100
[<900000000481ad5c>] sock_recvmsg+0x7c/0xe0
[<900000000481e1a8>] __sys_recvfrom+0x108/0x1c0
[<900000000481e27c>] sys_recvfrom+0x1c/0x40
[<9000000004c076ec>] do_syscall+0x8c/0xc0
[<9000000003731da4>] handle_syscall+0xc4/0x160
Code: ...
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Fatal exception
Kernel relocated by 0x3510000
.text @ 0x9000000003710000
.data @ 0x9000000004d70000
.bss @ 0x9000000006469400
---[ end Kernel panic - not syncing: Fatal exception ]---
[...]
This crash happens every time when running sockmap_skb_verdict_shutdown
subtest in sockmap_basic.
This crash is because a NULL pointer is passed to page_address() in the
sk_msg_recvmsg(). Due to the different implementations depending on the
architecture, page_address(NULL) will trigger a panic on Loongarch
platform but not on x86 platform. So this bug was hidden on x86 platform
for a while, but now it is exposed on Loongarch platform. The root cause
is that a zero length skb (skb->len == 0) was put on the queue.
This zero length skb is a TCP FIN packet, which was sent by shutdown(),
invoked in test_sockmap_skb_verdict_shutdown():
shutdown(p1, SHUT_WR);
In this case, in sk_psock_skb_ingress_enqueue(), num_sge is zero, and no
page is put to this sge (see sg_set_page in sg_set_page), but this empty
sge is queued into ingress_msg list.
And in sk_msg_recvmsg(), this empty sge is used, and a NULL page is got by
sg_page(sge). Pass this NULL page to copy_page_to_iter(), which passes it
to kmap_local_page() and to page_address(), then kernel panics.
To solve this, we should skip this zero length skb. So in sk_msg_recvmsg(),
if copy is zero, that means it's a zero length skb, skip invoking
copy_page_to_iter(). We are using the EFAULT return triggered by
copy_page_to_iter to check for is_fin in tcp_bpf.c.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Suggested-by: John Fastabend <john.fastabend(a)gmail.com>
Signed-off-by: Geliang Tang <tanggeliang(a)kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend(a)gmail.com>
Link: https://lore.kernel.org/bpf/e3a16eacdc6740658ee02a33489b1b9d4912f378.171999…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
Signed-off-by: Liu Jian <liujian56(a)huawei.com>
---
net/core/skmsg.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index f2e7ce81fef0..04a98327f81a 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -434,7 +434,8 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
page = sg_page(sge);
if (copied + copy > len)
copy = len - copied;
- copy = copy_page_to_iter(page, sge->offset, copy, iter);
+ if (copy)
+ copy = copy_page_to_iter(page, sge->offset, copy, iter);
if (!copy) {
copied = copied ? copied : -EFAULT;
goto out;
--
2.34.1
From: Geliang Tang <tanggeliang(a)kylinos.cn>
mainline inclusion
from mainline-v6.10
commit f0c18025693707ec344a70b6887f7450bf4c826b
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAGEM4
CVE: CVE-2024-41048
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
---------------------------
When running BPF selftests (./test_progs -t sockmap_basic) on a Loongarch
platform, the following kernel panic occurs:
[...]
Oops[#1]:
CPU: 22 PID: 2824 Comm: test_progs Tainted: G OE 6.10.0-rc2+ #18
Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018
... ...
ra: 90000000048bf6c0 sk_msg_recvmsg+0x120/0x560
ERA: 9000000004162774 copy_page_to_iter+0x74/0x1c0
CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
PRMD: 0000000c (PPLV0 +PIE +PWE)
EUEN: 00000007 (+FPE +SXE +ASXE -BTE)
ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0)
BADV: 0000000000000040
PRID: 0014c011 (Loongson-64bit, Loongson-3C5000)
Modules linked in: bpf_testmod(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack
Process test_progs (pid: 2824, threadinfo=0000000000863a31, task=...)
Stack : ...
Call Trace:
[<9000000004162774>] copy_page_to_iter+0x74/0x1c0
[<90000000048bf6c0>] sk_msg_recvmsg+0x120/0x560
[<90000000049f2b90>] tcp_bpf_recvmsg_parser+0x170/0x4e0
[<90000000049aae34>] inet_recvmsg+0x54/0x100
[<900000000481ad5c>] sock_recvmsg+0x7c/0xe0
[<900000000481e1a8>] __sys_recvfrom+0x108/0x1c0
[<900000000481e27c>] sys_recvfrom+0x1c/0x40
[<9000000004c076ec>] do_syscall+0x8c/0xc0
[<9000000003731da4>] handle_syscall+0xc4/0x160
Code: ...
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Fatal exception
Kernel relocated by 0x3510000
.text @ 0x9000000003710000
.data @ 0x9000000004d70000
.bss @ 0x9000000006469400
---[ end Kernel panic - not syncing: Fatal exception ]---
[...]
This crash happens every time when running sockmap_skb_verdict_shutdown
subtest in sockmap_basic.
This crash is because a NULL pointer is passed to page_address() in the
sk_msg_recvmsg(). Due to the different implementations depending on the
architecture, page_address(NULL) will trigger a panic on Loongarch
platform but not on x86 platform. So this bug was hidden on x86 platform
for a while, but now it is exposed on Loongarch platform. The root cause
is that a zero length skb (skb->len == 0) was put on the queue.
This zero length skb is a TCP FIN packet, which was sent by shutdown(),
invoked in test_sockmap_skb_verdict_shutdown():
shutdown(p1, SHUT_WR);
In this case, in sk_psock_skb_ingress_enqueue(), num_sge is zero, and no
page is put to this sge (see sg_set_page in sg_set_page), but this empty
sge is queued into ingress_msg list.
And in sk_msg_recvmsg(), this empty sge is used, and a NULL page is got by
sg_page(sge). Pass this NULL page to copy_page_to_iter(), which passes it
to kmap_local_page() and to page_address(), then kernel panics.
To solve this, we should skip this zero length skb. So in sk_msg_recvmsg(),
if copy is zero, that means it's a zero length skb, skip invoking
copy_page_to_iter(). We are using the EFAULT return triggered by
copy_page_to_iter to check for is_fin in tcp_bpf.c.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Suggested-by: John Fastabend <john.fastabend(a)gmail.com>
Signed-off-by: Geliang Tang <tanggeliang(a)kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend(a)gmail.com>
Link: https://lore.kernel.org/bpf/e3a16eacdc6740658ee02a33489b1b9d4912f378.171999…
Conflicts:
net/core/skmsg.c
net/ipv4/tcp_bpf.c
[Did not backport 2bc793e3272a ("skmsg: Extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()").]
Signed-off-by: Liu Jian <liujian56(a)huawei.com>
---
net/ipv4/tcp_bpf.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 1cff6ae3f6fd..8645017da9df 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -36,7 +36,8 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
page = sg_page(sge);
if (copied + copy > len)
copy = len - copied;
- copy = copy_page_to_iter(page, sge->offset, copy, iter);
+ if (copy)
+ copy = copy_page_to_iter(page, sge->offset, copy, iter);
if (!copy)
return copied ? copied : -EFAULT;
--
2.34.1
From: Geliang Tang <tanggeliang(a)kylinos.cn>
mainline inclusion
from mainline-v6.10
commit f0c18025693707ec344a70b6887f7450bf4c826b
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAGEM4
CVE: CVE-2024-41048
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
---------------------------
When running BPF selftests (./test_progs -t sockmap_basic) on a Loongarch
platform, the following kernel panic occurs:
[...]
Oops[#1]:
CPU: 22 PID: 2824 Comm: test_progs Tainted: G OE 6.10.0-rc2+ #18
Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018
... ...
ra: 90000000048bf6c0 sk_msg_recvmsg+0x120/0x560
ERA: 9000000004162774 copy_page_to_iter+0x74/0x1c0
CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
PRMD: 0000000c (PPLV0 +PIE +PWE)
EUEN: 00000007 (+FPE +SXE +ASXE -BTE)
ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0)
BADV: 0000000000000040
PRID: 0014c011 (Loongson-64bit, Loongson-3C5000)
Modules linked in: bpf_testmod(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack
Process test_progs (pid: 2824, threadinfo=0000000000863a31, task=...)
Stack : ...
Call Trace:
[<9000000004162774>] copy_page_to_iter+0x74/0x1c0
[<90000000048bf6c0>] sk_msg_recvmsg+0x120/0x560
[<90000000049f2b90>] tcp_bpf_recvmsg_parser+0x170/0x4e0
[<90000000049aae34>] inet_recvmsg+0x54/0x100
[<900000000481ad5c>] sock_recvmsg+0x7c/0xe0
[<900000000481e1a8>] __sys_recvfrom+0x108/0x1c0
[<900000000481e27c>] sys_recvfrom+0x1c/0x40
[<9000000004c076ec>] do_syscall+0x8c/0xc0
[<9000000003731da4>] handle_syscall+0xc4/0x160
Code: ...
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Fatal exception
Kernel relocated by 0x3510000
.text @ 0x9000000003710000
.data @ 0x9000000004d70000
.bss @ 0x9000000006469400
---[ end Kernel panic - not syncing: Fatal exception ]---
[...]
This crash happens every time when running sockmap_skb_verdict_shutdown
subtest in sockmap_basic.
This crash is because a NULL pointer is passed to page_address() in the
sk_msg_recvmsg(). Due to the different implementations depending on the
architecture, page_address(NULL) will trigger a panic on Loongarch
platform but not on x86 platform. So this bug was hidden on x86 platform
for a while, but now it is exposed on Loongarch platform. The root cause
is that a zero length skb (skb->len == 0) was put on the queue.
This zero length skb is a TCP FIN packet, which was sent by shutdown(),
invoked in test_sockmap_skb_verdict_shutdown():
shutdown(p1, SHUT_WR);
In this case, in sk_psock_skb_ingress_enqueue(), num_sge is zero, and no
page is put to this sge (see sg_set_page in sg_set_page), but this empty
sge is queued into ingress_msg list.
And in sk_msg_recvmsg(), this empty sge is used, and a NULL page is got by
sg_page(sge). Pass this NULL page to copy_page_to_iter(), which passes it
to kmap_local_page() and to page_address(), then kernel panics.
To solve this, we should skip this zero length skb. So in sk_msg_recvmsg(),
if copy is zero, that means it's a zero length skb, skip invoking
copy_page_to_iter(). We are using the EFAULT return triggered by
copy_page_to_iter to check for is_fin in tcp_bpf.c.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Suggested-by: John Fastabend <john.fastabend(a)gmail.com>
Signed-off-by: Geliang Tang <tanggeliang(a)kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend(a)gmail.com>
Link: https://lore.kernel.org/bpf/e3a16eacdc6740658ee02a33489b1b9d4912f378.171999…
Conflicts:
net/core/skmsg.c
net/ipv4/tcp_bpf.c
[Did not backport 2bc793e3272a ("skmsg: Extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()").]
Signed-off-by: Liu Jian <liujian56(a)huawei.com>
---
net/ipv4/tcp_bpf.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index cd214291290c..5484a8e677fc 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -36,7 +36,8 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
page = sg_page(sge);
if (copied + copy > len)
copy = len - copied;
- copy = copy_page_to_iter(page, sge->offset, copy, iter);
+ if (copy)
+ copy = copy_page_to_iter(page, sge->offset, copy, iter);
if (!copy)
return copied ? copied : -EFAULT;
--
2.34.1