From: Coly Li colyli@suse.de
stable inclusion from stable-v4.19.280 commit 84e13235e08941ce37aa9ee238b6dd007170f0fc category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I715PM CVE: NA
--------------------------------
commit 9bbf5feecc7eab2c370496c1c161bbfe62084028 upstream.
This is an already known issue that dm-thin volume cannot be used as swap, otherwise a deadlock may happen when dm-thin internal memory demand triggers swap I/O on the dm-thin volume itself.
But thanks to commit a666e5c05e7c ("dm: fix deadlock when swapping to encrypted device"), the limit_swap_bios target flag can also be used for dm-thin to avoid the recursive I/O when it is used as swap.
Fix is to simply set ti->limit_swap_bios to true in both pool_ctr() and thin_ctr().
In my test, I create a dm-thin volume /dev/vg/swap and use it as swap device. Then I run fio on another dm-thin volume /dev/vg/main and use large --blocksize to trigger swap I/O onto /dev/vg/swap.
The following fio command line is used in my test, fio --name recursive-swap-io --lockmem 1 --iodepth 128 \ --ioengine libaio --filename /dev/vg/main --rw randrw \ --blocksize 1M --numjobs 32 --time_based --runtime=12h
Without this fix, the whole system can be locked up within 15 seconds.
With this fix, there is no any deadlock or hung task observed after 2 hours of running fio.
Furthermore, if blocksize is changed from 1M to 128M, after around 30 seconds fio has no visible I/O, and the out-of-memory killer message shows up in kernel message. After around 20 minutes all fio processes are killed and the whole system is back to being alive.
This is exactly what is expected when recursive I/O happens on dm-thin volume when it is used as swap.
Depends-on: a666e5c05e7c ("dm: fix deadlock when swapping to encrypted device") Cc: stable@vger.kernel.org Signed-off-by: Coly Li colyli@suse.de Acked-by: Mikulas Patocka mpatocka@redhat.com Signed-off-by: Mike Snitzer snitzer@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/md/dm-thin.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c index fd8c89e3fc54..d884bb9cef94 100644 --- a/drivers/md/dm-thin.c +++ b/drivers/md/dm-thin.c @@ -3361,6 +3361,7 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv) pt->low_water_blocks = low_water_blocks; pt->adjusted_pf = pt->requested_pf = pf; ti->num_flush_bios = 1; + ti->limit_swap_bios = true;
/* * Only need to enable discards if the pool should pass @@ -4244,6 +4245,7 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv) goto bad;
ti->num_flush_bios = 1; + ti->limit_swap_bios = true; ti->flush_supported = true; ti->per_io_data_size = sizeof(struct dm_thin_endio_hook);
From: Jiasheng Jiang jiasheng@iscas.ac.cn
stable inclusion from stable-v4.19.280 commit 0d96bd507ed7e7d565b6d53ebd3874686f123b2e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I715PM CVE: NA
--------------------------------
commit d3aa3e060c4a80827eb801fc448debc9daa7c46b upstream.
Check alloc_precpu()'s return value and return an error from dm_stats_init() if it fails. Update alloc_dev() to fail if dm_stats_init() does.
Otherwise, a NULL pointer dereference will occur in dm_stats_cleanup() even if dm-stats isn't being actively used.
Fixes: fd2ed4d25270 ("dm: add statistics support") Cc: stable@vger.kernel.org Signed-off-by: Jiasheng Jiang jiasheng@iscas.ac.cn Signed-off-by: Mike Snitzer snitzer@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/md/dm-stats.c | 7 ++++++- drivers/md/dm-stats.h | 2 +- drivers/md/dm.c | 4 +++- 3 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/drivers/md/dm-stats.c b/drivers/md/dm-stats.c index 3d59f3e208c5..0eb48e739f7e 100644 --- a/drivers/md/dm-stats.c +++ b/drivers/md/dm-stats.c @@ -188,7 +188,7 @@ static int dm_stat_in_flight(struct dm_stat_shared *shared) atomic_read(&shared->in_flight[WRITE]); }
-void dm_stats_init(struct dm_stats *stats) +int dm_stats_init(struct dm_stats *stats) { int cpu; struct dm_stats_last_position *last; @@ -196,11 +196,16 @@ void dm_stats_init(struct dm_stats *stats) mutex_init(&stats->mutex); INIT_LIST_HEAD(&stats->list); stats->last = alloc_percpu(struct dm_stats_last_position); + if (!stats->last) + return -ENOMEM; + for_each_possible_cpu(cpu) { last = per_cpu_ptr(stats->last, cpu); last->last_sector = (sector_t)ULLONG_MAX; last->last_rw = UINT_MAX; } + + return 0; }
void dm_stats_cleanup(struct dm_stats *stats) diff --git a/drivers/md/dm-stats.h b/drivers/md/dm-stats.h index 2ddfae678f32..dcac11fce03b 100644 --- a/drivers/md/dm-stats.h +++ b/drivers/md/dm-stats.h @@ -22,7 +22,7 @@ struct dm_stats_aux { unsigned long long duration_ns; };
-void dm_stats_init(struct dm_stats *st); +int dm_stats_init(struct dm_stats *st); void dm_stats_cleanup(struct dm_stats *st);
struct mapped_device; diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 288dab0ab226..a1ff61b880d6 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2031,7 +2031,9 @@ static struct mapped_device *alloc_dev(int minor) bio_set_dev(&md->flush_bio, md->bdev); md->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC;
- dm_stats_init(&md->stats); + r = dm_stats_init(&md->stats); + if (r < 0) + goto bad;
/* Populate the mapping, nobody knows we exist yet */ spin_lock(&_minor_lock);
From: Linus Torvalds torvalds@linux-foundation.org
stable inclusion from stable-v4.19.280 commit 178ff87d2a0c2d3d74081e1c2efbb33b3487267d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I715PM CVE: NA
--------------------------------
[ Upstream commit 6015b1aca1a233379625385feb01dd014aca60b5 ]
The getaffinity() system call uses 'cpumask_size()' to decide how big the CPU mask is - so far so good. It is indeed the allocation size of a cpumask.
But the code also assumes that the whole allocation is initialized without actually doing so itself. That's wrong, because we might have fixed-size allocations (making copying and clearing more efficient), but not all of it is then necessarily used if 'nr_cpu_ids' is smaller.
Having checked other users of 'cpumask_size()', they all seem to be ok, either using it purely for the allocation size, or explicitly zeroing the cpumask before using the size in bytes to copy it.
See for example the ublk_ctrl_get_queue_affinity() function that uses the proper 'zalloc_cpumask_var()' to make sure that the whole mask is cleared, whether the storage is on the stack or if it was an external allocation.
Fix this by just zeroing the allocation before using it. Do the same for the compat version of sched_getaffinity(), which had the same logic.
Also, for consistency, make sched_getaffinity() use 'cpumask_bits()' to access the bits. For a cpumask_var_t, it ends up being a pointer to the same data either way, but it's just a good idea to treat it like you would a 'cpumask_t'. The compat case already did that.
Reported-by: Ryan Roberts ryan.roberts@arm.com Link: https://lore.kernel.org/lkml/7d026744-6bd6-6827-0471-b5e8eae0be3f@arm.com/ Cc: Yury Norov yury.norov@gmail.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- kernel/compat.c | 2 +- kernel/sched/core.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/compat.c b/kernel/compat.c index 903e44b9d76c..2eb876a67b32 100644 --- a/kernel/compat.c +++ b/kernel/compat.c @@ -307,7 +307,7 @@ COMPAT_SYSCALL_DEFINE3(sched_getaffinity, compat_pid_t, pid, unsigned int, len, if (len & (sizeof(compat_ulong_t)-1)) return -EINVAL;
- if (!alloc_cpumask_var(&mask, GFP_KERNEL)) + if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) return -ENOMEM;
ret = sched_getaffinity(pid, mask); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5d2a0a591e18..168e98c6d51a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4998,14 +4998,14 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len, if (len & (sizeof(unsigned long)-1)) return -EINVAL;
- if (!alloc_cpumask_var(&mask, GFP_KERNEL)) + if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) return -ENOMEM;
ret = sched_getaffinity(pid, mask); if (ret == 0) { unsigned int retlen = min(len, cpumask_size());
- if (copy_to_user(user_mask_ptr, mask, retlen)) + if (copy_to_user(user_mask_ptr, cpumask_bits(mask), retlen)) ret = -EFAULT; else ret = retlen;
From: Eric Dumazet edumazet@google.com
stable inclusion from stable-v4.19.281 commit 824bc1fd2ec3d389c372bc885170f1943642d010 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I715PM CVE: NA
--------------------------------
[ Upstream commit 7d63b67125382ff0ffdfca434acbc94a38bd092b ]
syzbot was able to trigger a panic [1] in icmp_glue_bits(), or more exactly in skb_copy_and_csum_bits()
There is no repro yet, but I think the issue is that syzbot manages to lower device mtu to a small value, fooling __icmp_send()
__icmp_send() must make sure there is enough room for the packet to include at least the headers.
We might in the future refactor skb_copy_and_csum_bits() and its callers to no longer crash when something bad happens.
[1] kernel BUG at net/core/skbuff.c:3343 ! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 15766 Comm: syz-executor.0 Not tainted 6.3.0-rc4-syzkaller-00039-gffe78bbd5121 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 RIP: 0010:skb_copy_and_csum_bits+0x798/0x860 net/core/skbuff.c:3343 Code: f0 c1 c8 08 41 89 c6 e9 73 ff ff ff e8 61 48 d4 f9 e9 41 fd ff ff 48 8b 7c 24 48 e8 52 48 d4 f9 e9 c3 fc ff ff e8 c8 27 84 f9 <0f> 0b 48 89 44 24 28 e8 3c 48 d4 f9 48 8b 44 24 28 e9 9d fb ff ff RSP: 0018:ffffc90000007620 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000000001e8 RCX: 0000000000000100 RDX: ffff8880276f6280 RSI: ffffffff87fdd138 RDI: 0000000000000005 RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000 R10: 00000000000001e8 R11: 0000000000000001 R12: 000000000000003c R13: 0000000000000000 R14: ffff888028244868 R15: 0000000000000b0e FS: 00007fbc81f1c700(0000) GS:ffff88802ca00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2df43000 CR3: 00000000744db000 CR4: 0000000000150ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> icmp_glue_bits+0x7b/0x210 net/ipv4/icmp.c:353 __ip_append_data+0x1d1b/0x39f0 net/ipv4/ip_output.c:1161 ip_append_data net/ipv4/ip_output.c:1343 [inline] ip_append_data+0x115/0x1a0 net/ipv4/ip_output.c:1322 icmp_push_reply+0xa8/0x440 net/ipv4/icmp.c:370 __icmp_send+0xb80/0x1430 net/ipv4/icmp.c:765 ipv4_send_dest_unreach net/ipv4/route.c:1239 [inline] ipv4_link_failure+0x5a9/0x9e0 net/ipv4/route.c:1246 dst_link_failure include/net/dst.h:423 [inline] arp_error_report+0xcb/0x1c0 net/ipv4/arp.c:296 neigh_invalidate+0x20d/0x560 net/core/neighbour.c:1079 neigh_timer_handler+0xc77/0xff0 net/core/neighbour.c:1166 call_timer_fn+0x1a0/0x580 kernel/time/timer.c:1700 expire_timers+0x29b/0x4b0 kernel/time/timer.c:1751 __run_timers kernel/time/timer.c:2022 [inline]
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+d373d60fddbdc915e666@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/20230330174502.1915328-1-edumazet@google.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/ipv4/icmp.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index 10a5c7aa9c50..a4f838ec7c75 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -761,6 +761,11 @@ void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info, room = 576; room -= sizeof(struct iphdr) + icmp_param.replyopts.opt.opt.optlen; room -= sizeof(struct icmphdr); + /* Guard against tiny mtu. We need to include at least one + * IP network header for this message to make any sense. + */ + if (room <= (int)sizeof(struct iphdr)) + goto ende;
icmp_param.data_len = skb_in->len - icmp_param.offset; if (icmp_param.data_len > room)
From: Jakub Kicinski kuba@kernel.org
stable inclusion from stable-v4.19.281 commit be71c3c75a488ca1594a98df0754094179ec8146 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I715PM CVE: NA
--------------------------------
[ Upstream commit 275b471e3d2daf1472ae8fa70dc1b50c9e0b9e75 ]
Commit 0db3dc73f7a3 ("[NETPOLL]: tx lock deadlock fix") narrowed down the region under netif_tx_trylock() inside netpoll_send_skb(). (At that point in time netif_tx_trylock() would lock all queues of the device.) Taking the tx lock was problematic because driver's cleanup method may take the same lock. So the change made us hold the xmit lock only around xmit, and expected the driver to take care of locking within ->ndo_poll_controller().
Unfortunately this only works if netpoll isn't itself called with the xmit lock already held. Netpoll code is careful and uses trylock(). The drivers, however, may be using plain lock(). Printing while holding the xmit lock is going to result in rare deadlocks.
Luckily we record the xmit lock owners, so we can scan all the queues, the same way we scan NAPI owners. If any of the xmit locks is held by the local CPU we better not attempt any polling.
It would be nice if we could narrow down the check to only the NAPIs and the queue we're trying to use. I don't see a way to do that now.
Reported-by: Roman Gushchin roman.gushchin@linux.dev Fixes: 0db3dc73f7a3 ("[NETPOLL]: tx lock deadlock fix") Signed-off-by: Jakub Kicinski kuba@kernel.org Reviewed-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/core/netpoll.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/net/core/netpoll.c b/net/core/netpoll.c index c77cc26dd7e5..b23c32ed6773 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -136,6 +136,20 @@ static void queue_process(struct work_struct *work) } }
+static int netif_local_xmit_active(struct net_device *dev) +{ + int i; + + for (i = 0; i < dev->num_tx_queues; i++) { + struct netdev_queue *txq = netdev_get_tx_queue(dev, i); + + if (READ_ONCE(txq->xmit_lock_owner) == smp_processor_id()) + return 1; + } + + return 0; +} + static void poll_one_napi(struct napi_struct *napi) { int work; @@ -182,7 +196,10 @@ void netpoll_poll_dev(struct net_device *dev) if (!ni || down_trylock(&ni->dev_lock)) return;
- if (!netif_running(dev)) { + /* Some drivers will take the same locks in poll and xmit, + * we can't poll if local CPU is already in xmit. + */ + if (!netif_running(dev) || netif_local_xmit_active(dev)) { up(&ni->dev_lock); return; }
From: Kan Liang kan.liang@linux.intel.com
stable inclusion from stable-v4.19.281 commit c1412fcad345d0c63874068a556cb1c03b87abb5 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I715PM CVE: NA
--------------------------------
[ Upstream commit 24d3ae2f37d8bc3c14b31d353c5d27baf582b6a6 ]
The same task check in perf_event_set_output has some potential issues for some usages.
For the current perf code, there is a problem if using of perf_event_open() to have multiple samples getting into the same mmap’d memory when they are both attached to the same process. https://lore.kernel.org/all/92645262-D319-4068-9C44-2409EF44888E@gmail.com/ Because the event->ctx is not ready when the perf_event_set_output() is invoked in the perf_event_open().
Besides the above issue, before the commit bd2756811766 ("perf: Rewrite core context handling"), perf record can errors out when sampling with a hardware event and a software event as below. $ perf record -e cycles,dummy --per-thread ls failed to mmap with 22 (Invalid argument) That's because that prior to the commit a hardware event and a software event are from different task context.
The problem should be a long time issue since commit c3f00c70276d ("perk: Separate find_get_context() from event initialization").
The task struct is stored in the event->hw.target for each per-thread event. It is a more reliable way to determine whether two events are attached to the same task.
The event->hw.target was also introduced several years ago by the commit 50f16a8bf9d7 ("perf: Remove type specific target pointers"). It can not only be used to fix the issue with the current code, but also back port to fix the issues with an older kernel.
Note: The event->hw.target was introduced later than commit c3f00c70276d. The patch may cannot be applied between the commit c3f00c70276d and commit 50f16a8bf9d7. Anybody that wants to back-port this at that period may have to find other solutions.
Fixes: c3f00c70276d ("perf: Separate find_get_context() from event initialization") Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Zhengjun Xing zhengjun.xing@linux.intel.com Link: https://lkml.kernel.org/r/20230322202449.512091-1-kan.liang@linux.intel.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- kernel/events/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c index 96a4ffcce2e5..0dd284509aa7 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -10502,7 +10502,7 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event) /* * If its not a per-cpu rb, it must be the same task. */ - if (output_event->cpu == -1 && output_event->ctx != event->ctx) + if (output_event->cpu == -1 && output_event->hw.target != event->hw.target) goto out;
/*
From: John Keeping john@metanate.com
stable inclusion from stable-v4.19.281 commit 51ba3ee276a8f3e3b0b588526ebd8e9a9331aa2f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I715PM CVE: NA
--------------------------------
commit ea65b41807a26495ff2a73dd8b1bab2751940887 upstream.
If the compiler decides not to inline this function then preemption tracing will always show an IP inside the preemption disabling path and never the function actually calling preempt_{enable,disable}.
Link: https://lore.kernel.org/linux-trace-kernel/20230327173647.1690849-1-john@met...
Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Mark Rutland mark.rutland@arm.com Cc: stable@vger.kernel.org Fixes: f904f58263e1d ("sched/debug: Fix preempt_disable_ip recording for preempt_disable()") Signed-off-by: John Keeping john@metanate.com Signed-off-by: Steven Rostedt (Google) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- include/linux/ftrace.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h index dd16e8218db3..ed35a950ce58 100644 --- a/include/linux/ftrace.h +++ b/include/linux/ftrace.h @@ -689,7 +689,7 @@ static inline void __ftrace_enabled_restore(int enabled) #define CALLER_ADDR5 ((unsigned long)ftrace_return_address(5)) #define CALLER_ADDR6 ((unsigned long)ftrace_return_address(6))
-static inline unsigned long get_lock_parent_ip(void) +static __always_inline unsigned long get_lock_parent_ip(void) { unsigned long addr = CALLER_ADDR0;
From: Rongwei Wang rongwei.wang@linux.alibaba.com
stable inclusion from stable-v4.19.281 commit a55f268abdb74ac5633b75a09fefb58458e9d2a2 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I715PM CVE: NA
--------------------------------
commit 6fe7d6b992113719e96744d974212df3fcddc76c upstream.
The si->lock must be held when deleting the si from the available list. Otherwise, another thread can re-add the si to the available list, which can lead to memory corruption. The only place we have found where this happens is in the swapoff path. This case can be described as below:
core 0 core 1 swapoff
del_from_avail_list(si) waiting
try lock si->lock acquire swap_avail_lock and re-add si into swap_avail_head
acquire si->lock but missing si already being added again, and continuing to clear SWP_WRITEOK, etc.
It can be easily found that a massive warning messages can be triggered inside get_swap_pages() by some special cases, for example, we call madvise(MADV_PAGEOUT) on blocks of touched memory concurrently, meanwhile, run much swapon-swapoff operations (e.g. stress-ng-swap).
However, in the worst case, panic can be caused by the above scene. In swapoff(), the memory used by si could be kept in swap_info[] after turning off a swap. This means memory corruption will not be caused immediately until allocated and reset for a new swap in the swapon path. A panic message caused: (with CONFIG_PLIST_DEBUG enabled)
------------[ cut here ]------------ top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70 Modules linked in: rfkill(E) crct10dif_ce(E)... CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+ Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015 pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) pc : plist_check_prev_next_node+0x50/0x70 lr : plist_check_prev_next_node+0x50/0x70 sp : ffff0018009d3c30 x29: ffff0018009d3c40 x28: ffff800011b32a98 x27: 0000000000000000 x26: ffff001803908000 x25: ffff8000128ea088 x24: ffff800011b32a48 x23: 0000000000000028 x22: ffff001800875c00 x21: ffff800010f9e520 x20: ffff001800875c00 x19: ffff001800fdc6e0 x18: 0000000000000030 x17: 0000000000000000 x16: 0000000000000000 x15: 0736076307640766 x14: 0730073007380731 x13: 0736076307640766 x12: 0730073007380731 x11: 000000000004058d x10: 0000000085a85b76 x9 : ffff8000101436e4 x8 : ffff800011c8ce08 x7 : 0000000000000000 x6 : 0000000000000001 x5 : ffff0017df9ed338 x4 : 0000000000000001 x3 : ffff8017ce62a000 x2 : ffff0017df9ed340 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: plist_check_prev_next_node+0x50/0x70 plist_check_head+0x80/0xf0 plist_add+0x28/0x140 add_to_avail_list+0x9c/0xf0 _enable_swap_info+0x78/0xb4 __do_sys_swapon+0x918/0xa10 __arm64_sys_swapon+0x20/0x30 el0_svc_common+0x8c/0x220 do_el0_svc+0x2c/0x90 el0_svc+0x1c/0x30 el0_sync_handler+0xa8/0xb0 el0_sync+0x148/0x180 irq event stamp: 2082270
Now, si->lock locked before calling 'del_from_avail_list()' to make sure other thread see the si had been deleted and SWP_WRITEOK cleared together, will not reinsert again.
This problem exists in versions after stable 5.10.y.
Link: https://lkml.kernel.org/r/20230404154716.23058-1-rongwei.wang@linux.alibaba.... Fixes: a2468cc9bfdff ("swap: choose swap device according to numa node") Tested-by: Yongchen Yin wb-yyc939293@alibaba-inc.com Signed-off-by: Rongwei Wang rongwei.wang@linux.alibaba.com Cc: Bagas Sanjaya bagasdotme@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Aaron Lu aaron.lu@intel.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- mm/swapfile.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 1fc42be5efe8..45b86574541f 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -620,6 +620,7 @@ static void __del_from_avail_list(struct swap_info_struct *p) { int nid;
+ assert_spin_locked(&p->lock); for_each_node(nid) plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]); } @@ -2669,8 +2670,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) spin_unlock(&swap_lock); goto out_dput; } - del_from_avail_list(p); spin_lock(&p->lock); + del_from_avail_list(p); if (p->prio < 0) { struct swap_info_struct *si = p; int nid;
From: Eric Dumazet edumazet@google.com
stable inclusion from stable-v4.19.281 commit 0638f2b9b220ac33e4657010dd0a130ed0cde7df category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I715PM CVE: NA
--------------------------------
[ Upstream commit 1c5950fc6fe996235f1d18539b9c6b64b597f50f ]
lena wang reported an issue caused by udpv6_sendmsg() mangling msg->msg_name and msg->msg_namelen, which are later read from ____sys_sendmsg() :
/* * If this is sendmmsg() and sending to current destination address was * successful, remember it. */ if (used_address && err >= 0) { used_address->name_len = msg_sys->msg_namelen; if (msg_sys->msg_name) memcpy(&used_address->name, msg_sys->msg_name, used_address->name_len); }
udpv6_sendmsg() wants to pretend the remote address family is AF_INET in order to call udp_sendmsg().
A fix would be to modify the address in-place, instead of using a local variable, but this could have other side effects.
Instead, restore initial values before we return from udpv6_sendmsg().
Fixes: c71d8ebe7a44 ("net: Fix security_socket_sendmsg() bypass problem.") Reported-by: lena wang lena.wang@mediatek.com Signed-off-by: Eric Dumazet edumazet@google.com Reviewed-by: Maciej Żenczykowski maze@google.com Link: https://lore.kernel.org/r/20230412130308.1202254-1-edumazet@google.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/ipv6/udp.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 88aa8bfb46af..cd317a6b73fc 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1219,9 +1219,11 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) msg->msg_name = &sin; msg->msg_namelen = sizeof(sin); do_udp_sendmsg: - if (__ipv6_only_sock(sk)) - return -ENETUNREACH; - return udp_sendmsg(sk, msg, len); + err = __ipv6_only_sock(sk) ? + -ENETUNREACH : udp_sendmsg(sk, msg, len); + msg->msg_name = sin6; + msg->msg_namelen = addr_len; + return err; } }
From: Robbie Harwood rharwood@redhat.com
stable inclusion from stable-v4.19.281 commit 96acda7332d7f7f9b71e440d387ddebfe564e2f7 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I715PM CVE: NA
--------------------------------
[ Upstream commit 4fc5c74dde69a7eda172514aaeb5a7df3600adb3 ]
The PE Format Specification (section "The Attribute Certificate Table (Image Only)") states that `dwLength` is to be rounded up to 8-byte alignment when used for traversal. Therefore, the field is not required to be an 8-byte multiple in the first place.
Accordingly, pesign has not performed this alignment since version 0.110. This causes kexec failure on pesign'd binaries with "PEFILE: Signature wrapper len wrong". Update the comment and relax the check.
Signed-off-by: Robbie Harwood rharwood@redhat.com Signed-off-by: David Howells dhowells@redhat.com cc: Jarkko Sakkinen jarkko@kernel.org cc: Eric Biederman ebiederm@xmission.com cc: Herbert Xu herbert@gondor.apana.org.au cc: keyrings@vger.kernel.org cc: linux-crypto@vger.kernel.org cc: kexec@lists.infradead.org Link: https://learn.microsoft.com/en-us/windows/win32/debug/pe-format#the-attribut... Link: https://github.com/rhboot/pesign Link: https://lore.kernel.org/r/20230220171254.592347-2-rharwood@redhat.com/ # v2 Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- crypto/asymmetric_keys/verify_pefile.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/crypto/asymmetric_keys/verify_pefile.c b/crypto/asymmetric_keys/verify_pefile.c index d178650fd524..411977947adb 100644 --- a/crypto/asymmetric_keys/verify_pefile.c +++ b/crypto/asymmetric_keys/verify_pefile.c @@ -139,11 +139,15 @@ static int pefile_strip_sig_wrapper(const void *pebuf, pr_debug("sig wrapper = { %x, %x, %x }\n", wrapper.length, wrapper.revision, wrapper.cert_type);
- /* Both pesign and sbsign round up the length of certificate table - * (in optional header data directories) to 8 byte alignment. + /* sbsign rounds up the length of certificate table (in optional + * header data directories) to 8 byte alignment. However, the PE + * specification states that while entries are 8-byte aligned, this is + * not included in their length, and as a result, pesign has not + * rounded up since 0.110. */ - if (round_up(wrapper.length, 8) != ctx->sig_len) { - pr_debug("Signature wrapper len wrong\n"); + if (wrapper.length > ctx->sig_len) { + pr_debug("Signature wrapper bigger than sig len (%x > %x)\n", + ctx->sig_len, wrapper.length); return -ELIBBAD; } if (wrapper.revision != WIN_CERT_REVISION_2_0) {
From: Waiman Long longman@redhat.com
stable inclusion from stable-v4.19.281 commit 30e138e23f43f566b6df87bba51e4d97930671fd category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I715PM CVE: NA
--------------------------------
commit ba9182a89626d5f83c2ee4594f55cb9c1e60f0e2 upstream.
After a successful cpuset_can_attach() call which increments the attach_in_progress flag, either cpuset_cancel_attach() or cpuset_attach() will be called later. In cpuset_attach(), tasks in cpuset_attach_wq, if present, will be woken up at the end. That is not the case in cpuset_cancel_attach(). So missed wakeup is possible if the attach operation is somehow cancelled. Fix that by doing the wakeup in cpuset_cancel_attach() as well.
Fixes: e44193d39e8d ("cpuset: let hotplug propagation work wait for task attaching") Signed-off-by: Waiman Long longman@redhat.com Reviewed-by: Michal Koutný mkoutny@suse.com Cc: stable@vger.kernel.org # v3.11+ Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- kernel/cgroup/cpuset.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 1d13d64108a0..d7ec7f98fe70 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1588,7 +1588,9 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset) cs = css_cs(css);
mutex_lock(&cpuset_mutex); - css_cs(css)->attach_in_progress--; + cs->attach_in_progress--; + if (!cs->attach_in_progress) + wake_up(&cpuset_attach_wq); mutex_unlock(&cpuset_mutex); }
From: Ziyang Xuan william.xuanziyang@huawei.com
stable inclusion from stable-v4.19.281 commit f394f690a30a5ec0413c62777a058eaf3d6e10d5 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I715PM CVE: NA
--------------------------------
[ Upstream commit ea30388baebcce37fd594d425a65037ca35e59e8 ]
Syzbot reported a bug as following:
===================================================== BUG: KMSAN: uninit-value in arch_atomic64_inc arch/x86/include/asm/atomic64_64.h:88 [inline] BUG: KMSAN: uninit-value in arch_atomic_long_inc include/linux/atomic/atomic-long.h:161 [inline] BUG: KMSAN: uninit-value in atomic_long_inc include/linux/atomic/atomic-instrumented.h:1429 [inline] BUG: KMSAN: uninit-value in __ip6_make_skb+0x2f37/0x30f0 net/ipv6/ip6_output.c:1956 arch_atomic64_inc arch/x86/include/asm/atomic64_64.h:88 [inline] arch_atomic_long_inc include/linux/atomic/atomic-long.h:161 [inline] atomic_long_inc include/linux/atomic/atomic-instrumented.h:1429 [inline] __ip6_make_skb+0x2f37/0x30f0 net/ipv6/ip6_output.c:1956 ip6_finish_skb include/net/ipv6.h:1122 [inline] ip6_push_pending_frames+0x10e/0x550 net/ipv6/ip6_output.c:1987 rawv6_push_pending_frames+0xb12/0xb90 net/ipv6/raw.c:579 rawv6_sendmsg+0x297e/0x2e60 net/ipv6/raw.c:922 inet_sendmsg+0x101/0x180 net/ipv4/af_inet.c:827 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg net/socket.c:734 [inline] ____sys_sendmsg+0xa8e/0xe70 net/socket.c:2476 ___sys_sendmsg+0x2a1/0x3f0 net/socket.c:2530 __sys_sendmsg net/socket.c:2559 [inline] __do_sys_sendmsg net/socket.c:2568 [inline] __se_sys_sendmsg net/socket.c:2566 [inline] __x64_sys_sendmsg+0x367/0x540 net/socket.c:2566 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
Uninit was created at: slab_post_alloc_hook mm/slab.h:766 [inline] slab_alloc_node mm/slub.c:3452 [inline] __kmem_cache_alloc_node+0x71f/0xce0 mm/slub.c:3491 __do_kmalloc_node mm/slab_common.c:967 [inline] __kmalloc_node_track_caller+0x114/0x3b0 mm/slab_common.c:988 kmalloc_reserve net/core/skbuff.c:492 [inline] __alloc_skb+0x3af/0x8f0 net/core/skbuff.c:565 alloc_skb include/linux/skbuff.h:1270 [inline] __ip6_append_data+0x51c1/0x6bb0 net/ipv6/ip6_output.c:1684 ip6_append_data+0x411/0x580 net/ipv6/ip6_output.c:1854 rawv6_sendmsg+0x2882/0x2e60 net/ipv6/raw.c:915 inet_sendmsg+0x101/0x180 net/ipv4/af_inet.c:827 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg net/socket.c:734 [inline] ____sys_sendmsg+0xa8e/0xe70 net/socket.c:2476 ___sys_sendmsg+0x2a1/0x3f0 net/socket.c:2530 __sys_sendmsg net/socket.c:2559 [inline] __do_sys_sendmsg net/socket.c:2568 [inline] __se_sys_sendmsg net/socket.c:2566 [inline] __x64_sys_sendmsg+0x367/0x540 net/socket.c:2566 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
It is because icmp6hdr does not in skb linear region under the scenario of SOCK_RAW socket. Access icmp6_hdr(skb)->icmp6_type directly will trigger the uninit variable access bug.
Use a local variable icmp6_type to carry the correct value in different scenarios.
Fixes: 14878f75abd5 ("[IPV6]: Add ICMPMsgStats MIB (RFC 4293) [rev 2]") Reported-by: syzbot+8257f4dcef79de670baf@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=3d605ec1d0a7f2a269a1a6936ac7f2b85975ee9... Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/ipv6/ip6_output.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 70820d049b92..4f31a781ab37 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1730,8 +1730,13 @@ struct sk_buff *__ip6_make_skb(struct sock *sk, IP6_UPD_PO_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUT, skb->len); if (proto == IPPROTO_ICMPV6) { struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb)); + u8 icmp6_type;
- ICMP6MSGOUT_INC_STATS(net, idev, icmp6_hdr(skb)->icmp6_type); + if (sk->sk_socket->type == SOCK_RAW && !inet_sk(sk)->hdrincl) + icmp6_type = fl6->fl6_icmp_type; + else + icmp6_type = icmp6_hdr(skb)->icmp6_type; + ICMP6MSGOUT_INC_STATS(net, idev, icmp6_type); ICMP6_INC_STATS(net, idev, ICMP6_MIB_OUTMSGS); }