From: Josef Bacik <josef(a)toxicpanda.com>
stable inclusion
from stable-v6.6.24
commit ded566b4637f1b6b4c9ba74e7d0b8493e93f19cf
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9Q8ZZ
CVE: CVE-2024-35784
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
--------------------------------
commit b0ad381fa7690244802aed119b478b4bdafc31dd upstream.
While working on the patchset to remove extent locking I got a lockdep
splat with fiemap and pagefaulting with my new extent lock replacement
lock.
This deadlock exists with our normal code, we just don't have lockdep
annotations with the extent locking so we've never noticed it.
Since we're copying the fiemap extent to user space on every iteration
we have the chance of pagefaulting. Because we hold the extent lock for
the entire range we could mkwrite into a range in the file that we have
mmap'ed. This would deadlock with the following stack trace
[<0>] lock_extent+0x28d/0x2f0
[<0>] btrfs_page_mkwrite+0x273/0x8a0
[<0>] do_page_mkwrite+0x50/0xb0
[<0>] do_fault+0xc1/0x7b0
[<0>] __handle_mm_fault+0x2fa/0x460
[<0>] handle_mm_fault+0xa4/0x330
[<0>] do_user_addr_fault+0x1f4/0x800
[<0>] exc_page_fault+0x7c/0x1e0
[<0>] asm_exc_page_fault+0x26/0x30
[<0>] rep_movs_alternative+0x33/0x70
[<0>] _copy_to_user+0x49/0x70
[<0>] fiemap_fill_next_extent+0xc8/0x120
[<0>] emit_fiemap_extent+0x4d/0xa0
[<0>] extent_fiemap+0x7f8/0xad0
[<0>] btrfs_fiemap+0x49/0x80
[<0>] __x64_sys_ioctl+0x3e1/0xb50
[<0>] do_syscall_64+0x94/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
I wrote an fstest to reproduce this deadlock without my replacement lock
and verified that the deadlock exists with our existing locking.
To fix this simply don't take the extent lock for the entire duration of
the fiemap. This is safe in general because we keep track of where we
are when we're searching the tree, so if an ordered extent updates in
the middle of our fiemap call we'll still emit the correct extents
because we know what offset we were on before.
The only place we maintain the lock is searching delalloc. Since the
delalloc stuff can change during writeback we want to lock the extent
range so we have a consistent view of delalloc at the time we're
checking to see if we need to set the delalloc flag.
With this patch applied we no longer deadlock with my testcase.
CC: stable(a)vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Conflicts:
fs/btrfs/extent_io.c
[The key to fixing the patch is the fiemap_process_hole function, which
locks only before querying delalloc. Earlier versions do not have this
function, and adaptation requires a lot of refactoring patches. So do
something similar directly in the get_extent_skip_holes function, which
contains the logic to query delalloc.]
Signed-off-by: Zizhi Wo <wozizhi(a)huawei.com>
---
fs/btrfs/extent_io.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 685a375bb6af..e8ae864a0337 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4532,11 +4532,30 @@ static struct extent_map *get_extent_skip_holes(struct btrfs_inode *inode,
return NULL;
while (1) {
+ struct extent_state *cached_state = NULL;
+ u64 lockstart;
+ u64 lockend;
+
len = last - offset;
if (len == 0)
break;
len = ALIGN(len, sectorsize);
+ lockstart = round_down(offset, sectorsize);
+ lockend = round_up(offset + len, sectorsize) - 1;
+
+ /*
+ * We are only locking for the delalloc range because that's the
+ * only thing that can change here. With fiemap we have a lock
+ * on the inode, so no buffered or direct writes can happen.
+ *
+ * However mmaps and normal page writeback will cause this to
+ * change arbitrarily. We have to lock the extent lock here to
+ * make sure that nobody messes with the tree while we're doing
+ * btrfs_find_delalloc_in_range.
+ */
+ lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
em = btrfs_get_extent_fiemap(inode, offset, len);
+ unlock_extent_cached(&inode->io_tree, lockstart, lockend, &cached_state);
if (IS_ERR_OR_NULL(em))
return em;
@@ -4679,7 +4698,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
u64 isize = i_size_read(&inode->vfs_inode);
struct btrfs_key found_key;
struct extent_map *em = NULL;
- struct extent_state *cached_state = NULL;
struct btrfs_path *path;
struct btrfs_root *root = inode->root;
struct fiemap_cache cache = { 0 };
@@ -4758,9 +4776,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
last_for_get_extent = isize;
}
- lock_extent_bits(&inode->io_tree, start, start + len - 1,
- &cached_state);
-
em = get_extent_skip_holes(inode, start, last_for_get_extent);
if (!em)
goto out;
@@ -4871,9 +4886,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
ret = emit_last_fiemap_cache(fieinfo, &cache);
free_extent_map(em);
out:
- unlock_extent_cached(&inode->io_tree, start, start + len - 1,
- &cached_state);
-
out_free_ulist:
btrfs_free_path(path);
ulist_free(roots);
--
2.39.2
From: Willem de Bruijn <willemb(a)google.com>
mainline inclusion
from mainline-v6.7-rc1
commit 7b3ba18703a63f6fd487183b9262b08e5632da1b
category: bugfix
bugzilla: 189991, https://gitee.com/src-openeuler/kernel/issues/I9RG0B
CVE: CVE-2023-52843
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
--------------------------------
LLC reads the mac header with eth_hdr without verifying that the skb
has an Ethernet header.
Syzbot was able to enter llc_rcv on a tun device. Tun can insert
packets without mac len and with user configurable skb->protocol
(passing a tun_pi header when not configuring IFF_NO_PI).
BUG: KMSAN: uninit-value in llc_station_ac_send_test_r net/llc/llc_station.c:81 [inline]
BUG: KMSAN: uninit-value in llc_station_rcv+0x6fb/0x1290 net/llc/llc_station.c:111
llc_station_ac_send_test_r net/llc/llc_station.c:81 [inline]
llc_station_rcv+0x6fb/0x1290 net/llc/llc_station.c:111
llc_rcv+0xc5d/0x14a0 net/llc/llc_input.c:218
__netif_receive_skb_one_core net/core/dev.c:5523 [inline]
__netif_receive_skb+0x1a6/0x5a0 net/core/dev.c:5637
netif_receive_skb_internal net/core/dev.c:5723 [inline]
netif_receive_skb+0x58/0x660 net/core/dev.c:5782
tun_rx_batched+0x3ee/0x980 drivers/net/tun.c:1555
tun_get_user+0x54c5/0x69c0 drivers/net/tun.c:2002
Add a mac_len test before all three eth_hdr(skb) calls under net/llc.
There are further uses in include/net/llc_pdu.h. All these are
protected by a test skb->protocol == ETH_P_802_2. Which does not
protect against this tun scenario.
But the mac_len test added in this patch in llc_fixup_skb will
indirectly protect those too. That is called from llc_rcv before any
other LLC code.
It is tempting to just add a blanket mac_len check in llc_rcv, but
not sure whether that could break valid LLC paths that do not assume
an Ethernet header. 802.2 LLC may be used on top of non-802.3
protocols in principle. The below referenced commit shows that used
to, on top of Token Ring.
At least one of the three eth_hdr uses goes back to before the start
of git history. But the one that syzbot exercises is introduced in
this commit. That commit is old enough (2008), that effectively all
stable kernels should receive this.
Fixes: f83f1768f833 ("[LLC]: skb allocation size for responses")
Reported-by: syzbot+a8c7be6dee0de1b669cc(a)syzkaller.appspotmail.com
Signed-off-by: Willem de Bruijn <willemb(a)google.com>
Link: https://lore.kernel.org/r/20231025234251.3796495-1-willemdebruijn.kernel@gm…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Dong Chenchen <dongchenchen2(a)huawei.com>
---
net/llc/llc_input.c | 10 ++++++++--
net/llc/llc_s_ac.c | 3 +++
net/llc/llc_station.c | 3 +++
3 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/net/llc/llc_input.c b/net/llc/llc_input.c
index 82cb93f66b9b..de5023b9607c 100644
--- a/net/llc/llc_input.c
+++ b/net/llc/llc_input.c
@@ -127,8 +127,14 @@ static inline int llc_fixup_skb(struct sk_buff *skb)
skb->transport_header += llc_len;
skb_pull(skb, llc_len);
if (skb->protocol == htons(ETH_P_802_2)) {
- __be16 pdulen = eth_hdr(skb)->h_proto;
- s32 data_size = ntohs(pdulen) - llc_len;
+ __be16 pdulen;
+ s32 data_size;
+
+ if (skb->mac_len < ETH_HLEN)
+ return 0;
+
+ pdulen = eth_hdr(skb)->h_proto;
+ data_size = ntohs(pdulen) - llc_len;
if (data_size < 0 ||
!pskb_may_pull(skb, data_size))
diff --git a/net/llc/llc_s_ac.c b/net/llc/llc_s_ac.c
index 7ae4cc684d3a..4cf636bb7850 100644
--- a/net/llc/llc_s_ac.c
+++ b/net/llc/llc_s_ac.c
@@ -153,6 +153,9 @@ int llc_sap_action_send_test_r(struct llc_sap *sap, struct sk_buff *skb)
int rc = 1;
u32 data_size;
+ if (skb->mac_len < ETH_HLEN)
+ return 1;
+
llc_pdu_decode_sa(skb, mac_da);
llc_pdu_decode_da(skb, mac_sa);
llc_pdu_decode_ssap(skb, &dsap);
diff --git a/net/llc/llc_station.c b/net/llc/llc_station.c
index c29170e767a8..64e2c67e16ba 100644
--- a/net/llc/llc_station.c
+++ b/net/llc/llc_station.c
@@ -77,6 +77,9 @@ static int llc_station_ac_send_test_r(struct sk_buff *skb)
u32 data_size;
struct sk_buff *nskb;
+ if (skb->mac_len < ETH_HLEN)
+ goto out;
+
/* The test request command is type U (llc_len = 3) */
data_size = ntohs(eth_hdr(skb)->h_proto) - 3;
nskb = llc_alloc_frame(NULL, skb->dev, LLC_PDU_TYPE_U, data_size);
--
2.25.1
hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I9VPMT
---------------------------
Remove warn printk for checking 'rq->tmp_alone_branch !=
&rq->leaf_cfs_rq_list' to avoid rq deadlock.
Deadlock analaze:
cpu 0
distribute_cfs_runtime --- rq_lock_irqsave(rq, &rf);
->__warn_printk
->try_to_wake_up --- rq_lock(rq, &rf), deadlock
Call Trace:
queued_spin_lock_slowpath at ffff000080173358
try_to_wake_up at ffff000080141068
wake_up_process at ffff00008014113c
insert_work at ffff000080123750
__queue_work at ffff0000801257ac
queue_work_on at ffff000080125c54
drm_fb_helper_dirty at ffff0000806dcd44
drm_fb_helper_sys_imageblit at ffff0000806dcf04
virtio_gpu_3d_imageblit at ffff000000c915d0 [virtio_gpu]
soft_cursor at ffff0000805e3e04
bit_cursor at ffff0000805e3654
fbcon_cursor at ffff0000805df404
hide_cursor at ffff000080677d68
vt_console_print at ffff0000806799dc
console_unlock at ffff000080183d78
vprintk_emit at ffff000080185948
vprintk_default at ffff000080185b80
vprintk_func at ffff000080186c44
printk at ffff000080186394
__warn_printk at ffff000080102d60
unthrottle_cfs_rq at ffff000080155e50
distribute_cfs_runtime at ffff00008015617c
sched_cfs_period_timer at ffff00008015654c
__hrtimer_run_queues at ffff0000801b2c58
hrtimer_interrupt at ffff0000801b3c74
arch_timer_handler_virt at ffff00008089dc3c
handle_percpu_devid_irq at ffff00008018fb3c
generic_handle_irq at ffff000080187140
__handle_domain_irq at ffff000080187adc
gic_handle_irq at ffff000080081814
Fixes: 6e9efc5d870d ("sched/fair: Add tmp_alone_branch assertion")
Signed-off-by: Hui Tang <tanghui20(a)huawei.com>
---
kernel/sched/fair.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3bd5aa6dedb3..aee13d30a7de 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -393,9 +393,12 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
}
}
+/*
+ * There are possible rq deadlock when warn is triggered,
+ * since try_to_wake_up may called by __warn_printk.
+ */
static inline void assert_list_leaf_cfs_rq(struct rq *rq)
{
- SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
}
/* Iterate thr' all leaf cfs_rq's on a runqueue */
--
2.34.1
hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I9VPMT
---------------------------
Remove warn printk for checking 'rq->tmp_alone_branch !=
&rq->leaf_cfs_rq_list' to avoid rq deadlock.
Deadlock analaze:
cpu 0
distribute_cfs_runtime --- rq_lock_irqsave(rq, &rf);
->__warn_printk
->try_to_wake_up --- rq_lock(rq, &rf), deadlock
Call Trace:
queued_spin_lock_slowpath at ffff000080173358
try_to_wake_up at ffff000080141068
wake_up_process at ffff00008014113c
insert_work at ffff000080123750
__queue_work at ffff0000801257ac
queue_work_on at ffff000080125c54
drm_fb_helper_dirty at ffff0000806dcd44
drm_fb_helper_sys_imageblit at ffff0000806dcf04
virtio_gpu_3d_imageblit at ffff000000c915d0 [virtio_gpu]
soft_cursor at ffff0000805e3e04
bit_cursor at ffff0000805e3654
fbcon_cursor at ffff0000805df404
hide_cursor at ffff000080677d68
vt_console_print at ffff0000806799dc
console_unlock at ffff000080183d78
vprintk_emit at ffff000080185948
vprintk_default at ffff000080185b80
vprintk_func at ffff000080186c44
printk at ffff000080186394
__warn_printk at ffff000080102d60
unthrottle_cfs_rq at ffff000080155e50
distribute_cfs_runtime at ffff00008015617c
sched_cfs_period_timer at ffff00008015654c
__hrtimer_run_queues at ffff0000801b2c58
hrtimer_interrupt at ffff0000801b3c74
arch_timer_handler_virt at ffff00008089dc3c
handle_percpu_devid_irq at ffff00008018fb3c
generic_handle_irq at ffff000080187140
__handle_domain_irq at ffff000080187adc
gic_handle_irq at ffff000080081814
Signed-off-by: Hui Tang <tanghui20(a)huawei.com>
---
kernel/sched/fair.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3bd5aa6dedb3..aee13d30a7de 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -393,9 +393,12 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
}
}
+/*
+ * There are possible rq deadlock when warn is triggered,
+ * since try_to_wake_up may called by __warn_printk.
+ */
static inline void assert_list_leaf_cfs_rq(struct rq *rq)
{
- SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
}
/* Iterate thr' all leaf cfs_rq's on a runqueue */
--
2.34.1
From: Josef Bacik <josef(a)toxicpanda.com>
stable inclusion
from stable-v6.6.24
commit ded566b4637f1b6b4c9ba74e7d0b8493e93f19cf
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9Q8ZZ
CVE: CVE-2024-35784
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
--------------------------------
commit b0ad381fa7690244802aed119b478b4bdafc31dd upstream.
While working on the patchset to remove extent locking I got a lockdep
splat with fiemap and pagefaulting with my new extent lock replacement
lock.
This deadlock exists with our normal code, we just don't have lockdep
annotations with the extent locking so we've never noticed it.
Since we're copying the fiemap extent to user space on every iteration
we have the chance of pagefaulting. Because we hold the extent lock for
the entire range we could mkwrite into a range in the file that we have
mmap'ed. This would deadlock with the following stack trace
[<0>] lock_extent+0x28d/0x2f0
[<0>] btrfs_page_mkwrite+0x273/0x8a0
[<0>] do_page_mkwrite+0x50/0xb0
[<0>] do_fault+0xc1/0x7b0
[<0>] __handle_mm_fault+0x2fa/0x460
[<0>] handle_mm_fault+0xa4/0x330
[<0>] do_user_addr_fault+0x1f4/0x800
[<0>] exc_page_fault+0x7c/0x1e0
[<0>] asm_exc_page_fault+0x26/0x30
[<0>] rep_movs_alternative+0x33/0x70
[<0>] _copy_to_user+0x49/0x70
[<0>] fiemap_fill_next_extent+0xc8/0x120
[<0>] emit_fiemap_extent+0x4d/0xa0
[<0>] extent_fiemap+0x7f8/0xad0
[<0>] btrfs_fiemap+0x49/0x80
[<0>] __x64_sys_ioctl+0x3e1/0xb50
[<0>] do_syscall_64+0x94/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
I wrote an fstest to reproduce this deadlock without my replacement lock
and verified that the deadlock exists with our existing locking.
To fix this simply don't take the extent lock for the entire duration of
the fiemap. This is safe in general because we keep track of where we
are when we're searching the tree, so if an ordered extent updates in
the middle of our fiemap call we'll still emit the correct extents
because we know what offset we were on before.
The only place we maintain the lock is searching delalloc. Since the
delalloc stuff can change during writeback we want to lock the extent
range so we have a consistent view of delalloc at the time we're
checking to see if we need to set the delalloc flag.
With this patch applied we no longer deadlock with my testcase.
CC: stable(a)vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Conflicts:
fs/btrfs/extent_io.c
[The key to fixing the patch is the fiemap_process_hole function, which
locks only before querying delalloc. Earlier versions do not have this
function, and adaptation requires a lot of refactoring patches. So do
something similar directly in the get_extent_skip_holes function, which
contains the logic to query delalloc.]
Signed-off-by: Zizhi Wo <wozizhi(a)huawei.com>
---
fs/btrfs/extent_io.c | 25 +++++++++++++++++++------
1 file changed, 19 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b20021c501d7..8aa7738d9aad 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4332,12 +4332,31 @@ static struct extent_map *get_extent_skip_holes(struct inode *inode,
return NULL;
while (1) {
+ struct extent_state *cached_state = NULL;
+ u64 lockstart;
+ u64 lockend;
+
len = last - offset;
if (len == 0)
break;
len = ALIGN(len, sectorsize);
+ lockstart = round_down(offset, sectorsize);
+ lockend = round_up(offset + len, sectorsize);
+
+ /*
+ * We are only locking for the delalloc range because that's the
+ * only thing that can change here. With fiemap we have a lock
+ * on the inode, so no buffered or direct writes can happen.
+ *
+ * However mmaps and normal page writeback will cause this to
+ * change arbitrarily. We have to lock the extent lock here to
+ * make sure that nobody messes with the tree while we're doing
+ * btrfs_find_delalloc_in_range.
+ */
+ lock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend, &cached_state);
em = btrfs_get_extent_fiemap(BTRFS_I(inode), NULL, 0, offset,
len, 0);
+ unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend, &cached_state);
if (IS_ERR_OR_NULL(em))
return em;
@@ -4481,7 +4500,6 @@ int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 isize = i_size_read(inode);
struct btrfs_key found_key;
struct extent_map *em = NULL;
- struct extent_state *cached_state = NULL;
struct btrfs_path *path;
struct btrfs_root *root = BTRFS_I(inode)->root;
struct fiemap_cache cache = { 0 };
@@ -4547,9 +4565,6 @@ int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
last_for_get_extent = isize;
}
- lock_extent_bits(&BTRFS_I(inode)->io_tree, start, start + len - 1,
- &cached_state);
-
em = get_extent_skip_holes(inode, start, last_for_get_extent);
if (!em)
goto out;
@@ -4662,8 +4677,6 @@ int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
free_extent_map(em);
out:
btrfs_free_path(path);
- unlock_extent_cached(&BTRFS_I(inode)->io_tree, start, start + len - 1,
- &cached_state);
return ret;
}
--
2.39.2
From: Josef Bacik <josef(a)toxicpanda.com>
stable inclusion
from stable-v6.6.24
commit ded566b4637f1b6b4c9ba74e7d0b8493e93f19cf
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9Q8ZZ
CVE: CVE-2024-35784
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
--------------------------------
commit b0ad381fa7690244802aed119b478b4bdafc31dd upstream.
While working on the patchset to remove extent locking I got a lockdep
splat with fiemap and pagefaulting with my new extent lock replacement
lock.
This deadlock exists with our normal code, we just don't have lockdep
annotations with the extent locking so we've never noticed it.
Since we're copying the fiemap extent to user space on every iteration
we have the chance of pagefaulting. Because we hold the extent lock for
the entire range we could mkwrite into a range in the file that we have
mmap'ed. This would deadlock with the following stack trace
[<0>] lock_extent+0x28d/0x2f0
[<0>] btrfs_page_mkwrite+0x273/0x8a0
[<0>] do_page_mkwrite+0x50/0xb0
[<0>] do_fault+0xc1/0x7b0
[<0>] __handle_mm_fault+0x2fa/0x460
[<0>] handle_mm_fault+0xa4/0x330
[<0>] do_user_addr_fault+0x1f4/0x800
[<0>] exc_page_fault+0x7c/0x1e0
[<0>] asm_exc_page_fault+0x26/0x30
[<0>] rep_movs_alternative+0x33/0x70
[<0>] _copy_to_user+0x49/0x70
[<0>] fiemap_fill_next_extent+0xc8/0x120
[<0>] emit_fiemap_extent+0x4d/0xa0
[<0>] extent_fiemap+0x7f8/0xad0
[<0>] btrfs_fiemap+0x49/0x80
[<0>] __x64_sys_ioctl+0x3e1/0xb50
[<0>] do_syscall_64+0x94/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
I wrote an fstest to reproduce this deadlock without my replacement lock
and verified that the deadlock exists with our existing locking.
To fix this simply don't take the extent lock for the entire duration of
the fiemap. This is safe in general because we keep track of where we
are when we're searching the tree, so if an ordered extent updates in
the middle of our fiemap call we'll still emit the correct extents
because we know what offset we were on before.
The only place we maintain the lock is searching delalloc. Since the
delalloc stuff can change during writeback we want to lock the extent
range so we have a consistent view of delalloc at the time we're
checking to see if we need to set the delalloc flag.
With this patch applied we no longer deadlock with my testcase.
CC: stable(a)vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Conflicts:
fs/btrfs/extent_io.c
[The key to fixing the patch is the fiemap_process_hole function, which
locks only before querying delalloc. Earlier versions do not have this
function, and adaptation requires a lot of refactoring patches. So do
something similar directly in the get_extent_skip_holes function, which
contains the logic to query delalloc.]
Signed-off-by: Zizhi Wo <wozizhi(a)huawei.com>
---
fs/btrfs/extent_io.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 685a375bb6af..0d7843323930 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4532,11 +4532,30 @@ static struct extent_map *get_extent_skip_holes(struct btrfs_inode *inode,
return NULL;
while (1) {
+ struct extent_state *cached_state = NULL;
+ u64 lockstart;
+ u64 lockend;
+
len = last - offset;
if (len == 0)
break;
len = ALIGN(len, sectorsize);
+ lockstart = round_down(offset, sectorsize);
+ lockend = round_up(offset + len, sectorsize);
+
+ /*
+ * We are only locking for the delalloc range because that's the
+ * only thing that can change here. With fiemap we have a lock
+ * on the inode, so no buffered or direct writes can happen.
+ *
+ * However mmaps and normal page writeback will cause this to
+ * change arbitrarily. We have to lock the extent lock here to
+ * make sure that nobody messes with the tree while we're doing
+ * btrfs_find_delalloc_in_range.
+ */
+ lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
em = btrfs_get_extent_fiemap(inode, offset, len);
+ unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
if (IS_ERR_OR_NULL(em))
return em;
@@ -4679,7 +4698,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
u64 isize = i_size_read(&inode->vfs_inode);
struct btrfs_key found_key;
struct extent_map *em = NULL;
- struct extent_state *cached_state = NULL;
struct btrfs_path *path;
struct btrfs_root *root = inode->root;
struct fiemap_cache cache = { 0 };
@@ -4758,9 +4776,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
last_for_get_extent = isize;
}
- lock_extent_bits(&inode->io_tree, start, start + len - 1,
- &cached_state);
-
em = get_extent_skip_holes(inode, start, last_for_get_extent);
if (!em)
goto out;
@@ -4871,9 +4886,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
ret = emit_last_fiemap_cache(fieinfo, &cache);
free_extent_map(em);
out:
- unlock_extent_cached(&inode->io_tree, start, start + len - 1,
- &cached_state);
-
out_free_ulist:
btrfs_free_path(path);
ulist_free(roots);
--
2.39.2