Send 14 patches to test patchwork->PR function
Baisong Zhong (1): media: dvb-usb: az6027: fix null-ptr-deref in az6027_i2c_xfer()
Chen Zhongjin (1): ftrace: Fix invalid address access in lookup_rec() when index is 0
Darrick J. Wong (1): ext4: fix another off-by-one fsmap error on 1k block filesystems
David Hildenbrand (2): mm: optimize do_wp_page() for exclusive pages in the swapcache mm: optimize do_wp_page() for fresh pages in local LRU pagevecs
Kuniyuki Iwashima (1): seccomp: Move copy_seccomp() to no failure path.
Luke D. Jones (1): HID: asus: Remove check for same LED brightness on set
Nicholas Piggin (1): mm/vmalloc: huge vmalloc backing pages should be split rather than compound
Pietro Borrello (2): HID: asus: use spinlock to protect concurrent accesses HID: asus: use spinlock to safely schedule workers
Xin Long (2): tipc: set con sock in tipc_conn_alloc tipc: add an extra conn_get in tipc_conn_alloc
Zheng Yejian (1): livepatch/core: Fix hungtask against cpu hotplug on x86
Zhihao Cheng (1): jbd2: fix data missing when reusing bh which is ready to be checkpointed
drivers/hid/hid-asus.c | 38 ++++++++++++++++++----- drivers/media/usb/dvb-usb/az6027.c | 4 +++ fs/ext4/fsmap.c | 2 ++ fs/jbd2/transaction.c | 50 +++++++++++++++++------------- kernel/fork.c | 17 ++++++---- kernel/livepatch/core.c | 36 ++++++++++++++------- kernel/trace/ftrace.c | 3 +- mm/memory.c | 28 +++++++++++++---- mm/vmalloc.c | 22 ++++++++++--- net/tipc/topsrv.c | 20 ++++++------ 10 files changed, 154 insertions(+), 66 deletions(-)
From: Zhihao Cheng chengzhihao1@huawei.com
mainline inclusion from mainline-v6.3-rc1 commit e6b9bd7290d334451ce054e98e752abc055e0034 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6C5HV CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Following process will make data lost and could lead to a filesystem corrupted problem:
1. jh(bh) is inserted into T1->t_checkpoint_list, bh is dirty, and jh->b_transaction = NULL 2. T1 is added into journal->j_checkpoint_transactions. 3. Get bh prepare to write while doing checkpoing: PA PB do_get_write_access jbd2_log_do_checkpoint spin_lock(&jh->b_state_lock) if (buffer_dirty(bh)) clear_buffer_dirty(bh) // clear buffer dirty set_buffer_jbddirty(bh) transaction = journal->j_checkpoint_transactions jh = transaction->t_checkpoint_list if (!buffer_dirty(bh)) __jbd2_journal_remove_checkpoint(jh) // bh won't be flushed jbd2_cleanup_journal_tail __jbd2_journal_file_buffer(jh, transaction, BJ_Reserved) 4. Aborting journal/Power-cut before writing latest bh on journal area.
In this way we get a corrupted filesystem with bh's data lost.
Fix it by moving the clearing of buffer_dirty bit just before the call to __jbd2_journal_file_buffer(), both bit clearing and jh->b_transaction assignment are under journal->j_list_lock locked, so that jbd2_log_do_checkpoint() will wait until jh's new transaction fininshed even bh is currently not dirty. And journal_shrink_one_cp_list() won't remove jh from checkpoint list if the buffer head is reused in do_get_write_access().
Fetch a reproducer in [Link].
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216898 Cc: stable@kernel.org Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: zhanchengbin zhanchengbin1@huawei.com Suggested-by: Jan Kara jack@suse.cz Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20230110015327.1181863-1-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Reviewed-by: Yang Erkun yangerkun@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/jbd2/transaction.c | 50 +++++++++++++++++++++++++------------------ 1 file changed, 29 insertions(+), 21 deletions(-)
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index cefee2dead54..8fa88c42fcb4 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -984,36 +984,28 @@ do_get_write_access(handle_t *handle, struct journal_head *jh, * ie. locked but not dirty) or tune2fs (which may actually have * the buffer dirtied, ugh.) */
- if (buffer_dirty(bh)) { + if (buffer_dirty(bh) && jh->b_transaction) { + warn_dirty_buffer(bh); /* - * First question: is this buffer already part of the current - * transaction or the existing committing transaction? - */ - if (jh->b_transaction) { - J_ASSERT_JH(jh, - jh->b_transaction == transaction || - jh->b_transaction == - journal->j_committing_transaction); - if (jh->b_next_transaction) - J_ASSERT_JH(jh, jh->b_next_transaction == - transaction); - warn_dirty_buffer(bh); - } - /* - * In any case we need to clean the dirty flag and we must - * do it under the buffer lock to be sure we don't race - * with running write-out. + * We need to clean the dirty flag and we must do it under the + * buffer lock to be sure we don't race with running write-out. */ JBUFFER_TRACE(jh, "Journalling dirty buffer"); clear_buffer_dirty(bh); + /* + * The buffer is going to be added to BJ_Reserved list now and + * nothing guarantees jbd2_journal_dirty_metadata() will be + * ever called for it. So we need to set jbddirty bit here to + * make sure the buffer is dirtied and written out when the + * journaling machinery is done with it. + */ set_buffer_jbddirty(bh); }
- unlock_buffer(bh); - error = -EROFS; if (is_handle_aborted(handle)) { spin_unlock(&jh->b_state_lock); + unlock_buffer(bh); goto out; } error = 0; @@ -1023,8 +1015,10 @@ do_get_write_access(handle_t *handle, struct journal_head *jh, * b_next_transaction points to it */ if (jh->b_transaction == transaction || - jh->b_next_transaction == transaction) + jh->b_next_transaction == transaction) { + unlock_buffer(bh); goto done; + }
/* * this is the first time this transaction is touching this buffer, @@ -1048,10 +1042,24 @@ do_get_write_access(handle_t *handle, struct journal_head *jh, */ smp_wmb(); spin_lock(&journal->j_list_lock); + if (test_clear_buffer_dirty(bh)) { + /* + * Execute buffer dirty clearing and jh->b_transaction + * assignment under journal->j_list_lock locked to + * prevent bh being removed from checkpoint list if + * the buffer is in an intermediate state (not dirty + * and jh->b_transaction is NULL). + */ + JBUFFER_TRACE(jh, "Journalling dirty buffer"); + set_buffer_jbddirty(bh); + } __jbd2_journal_file_buffer(jh, transaction, BJ_Reserved); spin_unlock(&journal->j_list_lock); + unlock_buffer(bh); goto done; } + unlock_buffer(bh); + /* * If there is already a copy-out version of this buffer, then we don't * need to make another one
From: "Luke D. Jones" luke@ljones.dev
mainline inclusion from mainline-v5.14-rc4 commit 3fdcf7cdfc229346d028242e73562704ad644dd0 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6I7U9 CVE: CVE-2023-1079
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Remove the early return on LED brightness set so that any controller application, daemon, or desktop may set the same brightness at any stage.
This is required because many ASUS ROG keyboards will default to max brightness on laptop resume if the LEDs were set to off before sleep.
Signed-off-by: Luke D Jones luke@ljones.dev Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Yuyao Lin linyuyao1@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/hid/hid-asus.c | 3 --- 1 file changed, 3 deletions(-)
diff --git a/drivers/hid/hid-asus.c b/drivers/hid/hid-asus.c index f85c6e3309a0..9a6b63828634 100644 --- a/drivers/hid/hid-asus.c +++ b/drivers/hid/hid-asus.c @@ -402,9 +402,6 @@ static void asus_kbd_backlight_set(struct led_classdev *led_cdev, { struct asus_kbd_leds *led = container_of(led_cdev, struct asus_kbd_leds, cdev); - if (led->brightness == brightness) - return; - led->brightness = brightness; schedule_work(&led->work); }
From: Pietro Borrello borrello@diag.uniroma1.it
mainline inclusion from mainline-v6.2 commit 315c537068a13f0b5681d33dd045a912f4bece6f category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6I7U9 CVE: CVE-2023-1079
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
asus driver has a worker that may access data concurrently. Proct the accesses using a spinlock.
Fixes: af22a610bc38 ("HID: asus: support backlight on USB keyboards") Signed-off-by: Pietro Borrello borrello@diag.uniroma1.it Link: https://lore.kernel.org/r/20230125-hid-unregister-leds-v4-4-7860c5763c38@dia... Signed-off-by: Benjamin Tissoires benjamin.tissoires@redhat.com Signed-off-by: Yuyao Lin linyuyao1@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/hid/hid-asus.c | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/drivers/hid/hid-asus.c b/drivers/hid/hid-asus.c index 9a6b63828634..112c0c25a77f 100644 --- a/drivers/hid/hid-asus.c +++ b/drivers/hid/hid-asus.c @@ -95,6 +95,7 @@ struct asus_kbd_leds { struct hid_device *hdev; struct work_struct work; unsigned int brightness; + spinlock_t lock; bool removed; };
@@ -402,7 +403,12 @@ static void asus_kbd_backlight_set(struct led_classdev *led_cdev, { struct asus_kbd_leds *led = container_of(led_cdev, struct asus_kbd_leds, cdev); + unsigned long flags; + + spin_lock_irqsave(&led->lock, flags); led->brightness = brightness; + spin_unlock_irqrestore(&led->lock, flags); + schedule_work(&led->work); }
@@ -410,8 +416,14 @@ static enum led_brightness asus_kbd_backlight_get(struct led_classdev *led_cdev) { struct asus_kbd_leds *led = container_of(led_cdev, struct asus_kbd_leds, cdev); + enum led_brightness brightness; + unsigned long flags;
- return led->brightness; + spin_lock_irqsave(&led->lock, flags); + brightness = led->brightness; + spin_unlock_irqrestore(&led->lock, flags); + + return brightness; }
static void asus_kbd_backlight_work(struct work_struct *work) @@ -419,11 +431,14 @@ static void asus_kbd_backlight_work(struct work_struct *work) struct asus_kbd_leds *led = container_of(work, struct asus_kbd_leds, work); u8 buf[] = { FEATURE_KBD_REPORT_ID, 0xba, 0xc5, 0xc4, 0x00 }; int ret; + unsigned long flags;
if (led->removed) return;
+ spin_lock_irqsave(&led->lock, flags); buf[4] = led->brightness; + spin_unlock_irqrestore(&led->lock, flags);
ret = asus_kbd_set_report(led->hdev, buf, sizeof(buf)); if (ret < 0) @@ -485,6 +500,7 @@ static int asus_kbd_register_leds(struct hid_device *hdev) drvdata->kbd_backlight->cdev.brightness_set = asus_kbd_backlight_set; drvdata->kbd_backlight->cdev.brightness_get = asus_kbd_backlight_get; INIT_WORK(&drvdata->kbd_backlight->work, asus_kbd_backlight_work); + spin_lock_init(&drvdata->kbd_backlight->lock);
ret = devm_led_classdev_register(&hdev->dev, &drvdata->kbd_backlight->cdev); if (ret < 0) { @@ -1013,9 +1029,13 @@ static int asus_probe(struct hid_device *hdev, const struct hid_device_id *id) static void asus_remove(struct hid_device *hdev) { struct asus_drvdata *drvdata = hid_get_drvdata(hdev); + unsigned long flags;
if (drvdata->kbd_backlight) { + spin_lock_irqsave(&drvdata->kbd_backlight->lock, flags); drvdata->kbd_backlight->removed = true; + spin_unlock_irqrestore(&drvdata->kbd_backlight->lock, flags); + cancel_work_sync(&drvdata->kbd_backlight->work); }
From: Pietro Borrello borrello@diag.uniroma1.it
mainline inclusion from mainline-v6.2 commit 4ab3a086d10eeec1424f2e8a968827a6336203df category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6I7U9 CVE: CVE-2023-1079
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Use spinlocks to deal with workers introducing a wrapper asus_schedule_work(), and several spinlock checks. Otherwise, asus_kbd_backlight_set() may schedule led->work after the structure has been freed, causing a use-after-free.
Fixes: af22a610bc38 ("HID: asus: support backlight on USB keyboards") Signed-off-by: Pietro Borrello borrello@diag.uniroma1.it Link: https://lore.kernel.org/r/20230125-hid-unregister-leds-v4-5-7860c5763c38@dia... Signed-off-by: Benjamin Tissoires benjamin.tissoires@redhat.com Signed-off-by: Yuyao Lin linyuyao1@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/hid/hid-asus.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/drivers/hid/hid-asus.c b/drivers/hid/hid-asus.c index 112c0c25a77f..6865cab33cf8 100644 --- a/drivers/hid/hid-asus.c +++ b/drivers/hid/hid-asus.c @@ -398,6 +398,16 @@ static int asus_kbd_get_functions(struct hid_device *hdev, return ret; }
+static void asus_schedule_work(struct asus_kbd_leds *led) +{ + unsigned long flags; + + spin_lock_irqsave(&led->lock, flags); + if (!led->removed) + schedule_work(&led->work); + spin_unlock_irqrestore(&led->lock, flags); +} + static void asus_kbd_backlight_set(struct led_classdev *led_cdev, enum led_brightness brightness) { @@ -409,7 +419,7 @@ static void asus_kbd_backlight_set(struct led_classdev *led_cdev, led->brightness = brightness; spin_unlock_irqrestore(&led->lock, flags);
- schedule_work(&led->work); + asus_schedule_work(led); }
static enum led_brightness asus_kbd_backlight_get(struct led_classdev *led_cdev) @@ -433,9 +443,6 @@ static void asus_kbd_backlight_work(struct work_struct *work) int ret; unsigned long flags;
- if (led->removed) - return; - spin_lock_irqsave(&led->lock, flags); buf[4] = led->brightness; spin_unlock_irqrestore(&led->lock, flags);
From: Nicholas Piggin npiggin@gmail.com
mainline inclusion from mainline-v5.18-rc4 commit 3b8000ae185cb068adbda5f966a3835053c85fd4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6LD0S CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Huge vmalloc higher-order backing pages were allocated with __GFP_COMP in order to allow the sub-pages to be refcounted by callers such as "remap_vmalloc_page [sic]" (remap_vmalloc_range).
However a similar problem exists for other struct page fields callers use, for example fb_deferred_io_fault() takes a vmalloc'ed page and not only refcounts it but uses ->lru, ->mapping, ->index.
This is not compatible with compound sub-pages, and can cause bad page state issues like
BUG: Bad page state in process swapper/0 pfn:00743 page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x743 flags: 0x7ffff000000000(node=0|zone=0|lastcpupid=0x7ffff) raw: 007ffff000000000 c00c00000001d0c8 c00c00000001d0c8 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: corrupted mapping in tail page Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc3-00082-gfc6fff4a7ce1-dirty #2810 Call Trace: dump_stack_lvl+0x74/0xa8 (unreliable) bad_page+0x12c/0x170 free_tail_pages_check+0xe8/0x190 free_pcp_prepare+0x31c/0x4e0 free_unref_page+0x40/0x1b0 __vunmap+0x1d8/0x420 ...
The correct approach is to use split high-order pages for the huge vmalloc backing. These allow callers to treat them in exactly the same way as individually-allocated order-0 pages.
Link: https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.... Signed-off-by: Nicholas Piggin npiggin@gmail.com Cc: Paul Menzel pmenzel@molgen.mpg.de Cc: Song Liu songliubraving@fb.com Cc: Rick Edgecombe rick.p.edgecombe@intel.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
conflicts: mm/vmalloc.c
Signed-off-by: ZhangPeng zhangpeng362@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- mm/vmalloc.c | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index e27cd716ca95..2ca2c1bc0db9 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2641,14 +2641,17 @@ static void __vunmap(const void *addr, int deallocate_pages) vm_remove_mappings(area, deallocate_pages);
if (deallocate_pages) { - unsigned int page_order = vm_area_page_order(area); int i;
- for (i = 0; i < area->nr_pages; i += 1U << page_order) { + for (i = 0; i < area->nr_pages; i++) { struct page *page = area->pages[i];
BUG_ON(!page); - __free_pages(page, page_order); + /* + * High-order allocs for huge vmallocs are split, so + * can be freed as an array of order-0 allocations + */ + __free_pages(page, 0); } atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
@@ -2930,8 +2933,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, struct page *page; int p;
- /* Compound pages required for remap_vmalloc_page */ - page = alloc_pages_node(node, gfp_mask | __GFP_COMP, page_order); + page = alloc_pages_node(node, gfp_mask, page_order); if (unlikely(!page)) { /* Successfully allocated i pages, free them in __vfree() */ area->nr_pages = i; @@ -2943,6 +2945,16 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, goto fail; }
+ /* + * Higher order allocations must be able to be treated as + * indepdenent small pages by callers (as they can with + * small-page vmallocs). Some drivers do their own refcounting + * on vmalloc_to_page() pages, some use page->mapping, + * page->lru, etc. + */ + if (page_order) + split_page(page, page_order); + for (p = 0; p < (1U << page_order); p++) area->pages[i + p] = page + p;
From: "Darrick J. Wong" djwong@kernel.org
mainline inclusion from mainline-v6.3-rc2 commit c993799baf9c5861f8df91beb80e1611b12efcbd category: bugfix bugzilla: 188522,https://gitee.com/openeuler/kernel/issues/I6N7ZP CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Apparently syzbot figured out that issuing this FSMAP call:
struct fsmap_head cmd = { .fmh_count = ...; .fmh_keys = { { .fmr_device = /* ext4 dev */, .fmr_physical = 0, }, { .fmr_device = /* ext4 dev */, .fmr_physical = 0, }, }, ... }; ret = ioctl(fd, FS_IOC_GETFSMAP, &cmd);
Produces this crash if the underlying filesystem is a 1k-block ext4 filesystem:
kernel BUG at fs/ext4/ext4.h:3331! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 3 PID: 3227965 Comm: xfs_io Tainted: G W O 6.2.0-rc8-achx Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 RIP: 0010:ext4_mb_load_buddy_gfp+0x47c/0x570 [ext4] RSP: 0018:ffffc90007c03998 EFLAGS: 00010246 RAX: ffff888004978000 RBX: ffffc90007c03a20 RCX: ffff888041618000 RDX: 0000000000000000 RSI: 00000000000005a4 RDI: ffffffffa0c99b11 RBP: ffff888012330000 R08: ffffffffa0c2b7d0 R09: 0000000000000400 R10: ffffc90007c03950 R11: 0000000000000000 R12: 0000000000000001 R13: 00000000ffffffff R14: 0000000000000c40 R15: ffff88802678c398 FS: 00007fdf2020c880(0000) GS:ffff88807e100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd318a5fe8 CR3: 000000007f80f001 CR4: 00000000001706e0 Call Trace: <TASK> ext4_mballoc_query_range+0x4b/0x210 [ext4 dfa189daddffe8fecd3cdfd00564e0f265a8ab80] ext4_getfsmap_datadev+0x713/0x890 [ext4 dfa189daddffe8fecd3cdfd00564e0f265a8ab80] ext4_getfsmap+0x2b7/0x330 [ext4 dfa189daddffe8fecd3cdfd00564e0f265a8ab80] ext4_ioc_getfsmap+0x153/0x2b0 [ext4 dfa189daddffe8fecd3cdfd00564e0f265a8ab80] __ext4_ioctl+0x2a7/0x17e0 [ext4 dfa189daddffe8fecd3cdfd00564e0f265a8ab80] __x64_sys_ioctl+0x82/0xa0 do_syscall_64+0x2b/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fdf20558aff RSP: 002b:00007ffd318a9e30 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00000000000200c0 RCX: 00007fdf20558aff RDX: 00007fdf1feb2010 RSI: 00000000c0c0583b RDI: 0000000000000003 RBP: 00005625c0634be0 R08: 00005625c0634c40 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fdf1feb2010 R13: 00005625be70d994 R14: 0000000000000800 R15: 0000000000000000
For GETFSMAP calls, the caller selects a physical block device by writing its block number into fsmap_head.fmh_keys[01].fmr_device. To query mappings for a subrange of the device, the starting byte of the range is written to fsmap_head.fmh_keys[0].fmr_physical and the last byte of the range goes in fsmap_head.fmh_keys[1].fmr_physical.
IOWs, to query what mappings overlap with bytes 3-14 of /dev/sda, you'd set the inputs as follows:
fmh_keys[0] = { .fmr_device = major(8, 0), .fmr_physical = 3}, fmh_keys[1] = { .fmr_device = major(8, 0), .fmr_physical = 14},
Which would return you whatever is mapped in the 12 bytes starting at physical offset 3.
The crash is due to insufficient range validation of keys[1] in ext4_getfsmap_datadev. On 1k-block filesystems, block 0 is not part of the filesystem, which means that s_first_data_block is nonzero. ext4_get_group_no_and_offset subtracts this quantity from the blocknr argument before cracking it into a group number and a block number within a group. IOWs, block group 0 spans blocks 1-8192 (1-based) instead of 0-8191 (0-based) like what happens with larger blocksizes.
The net result of this encoding is that blocknr < s_first_data_block is not a valid input to this function. The end_fsb variable is set from the keys that are copied from userspace, which means that in the above example, its value is zero. That leads to an underflow here:
blocknr = blocknr - le32_to_cpu(es->s_first_data_block);
The division then operates on -1:
offset = do_div(blocknr, EXT4_BLOCKS_PER_GROUP(sb)) >> EXT4_SB(sb)->s_cluster_bits;
Leaving an impossibly large group number (2^32-1) in blocknr. ext4_getfsmap_check_keys checked that keys[0].fmr_physical and keys[1].fmr_physical are in increasing order, but ext4_getfsmap_datadev adjusts keys[0].fmr_physical to be at least s_first_data_block. This implies that we have to check it again after the adjustment, which is the piece that I forgot.
Reported-by: syzbot+6be2b977c89f79b6b153@syzkaller.appspotmail.com Fixes: 4a4956249dac ("ext4: fix off-by-one fsmap error on 1k block filesystems") Link: https://syzkaller.appspot.com/bug?id=79d5768e9bfe362911ac1a5057a36fc6b5c3000... Cc: stable@vger.kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Link: https://lore.kernel.org/r/Y+58NPTH7VNGgzdd@magnolia Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Zhihao Cheng chengzhihao1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/ext4/fsmap.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/ext4/fsmap.c b/fs/ext4/fsmap.c index 4493ef0c715e..cdf9bfe10137 100644 --- a/fs/ext4/fsmap.c +++ b/fs/ext4/fsmap.c @@ -486,6 +486,8 @@ static int ext4_getfsmap_datadev(struct super_block *sb, keys[0].fmr_physical = bofs; if (keys[1].fmr_physical >= eofs) keys[1].fmr_physical = eofs - 1; + if (keys[1].fmr_physical < keys[0].fmr_physical) + return 0; start_fsb = keys[0].fmr_physical; end_fsb = keys[1].fmr_physical;
From: Xin Long lucien.xin@gmail.com
stable inclusion from stable-v5.10.157 commit e87a077d09c05985a0edac7c6c49bb307f775d12 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6NCRH CVE: CVE-2023-1382
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit 0e5d56c64afcd6fd2d132ea972605b66f8a7d3c4 ]
A crash was reported by Wei Chen:
BUG: kernel NULL pointer dereference, address: 0000000000000018 RIP: 0010:tipc_conn_close+0x12/0x100 Call Trace: tipc_topsrv_exit_net+0x139/0x320 ops_exit_list.isra.9+0x49/0x80 cleanup_net+0x31a/0x540 process_one_work+0x3fa/0x9f0 worker_thread+0x42/0x5c0
It was caused by !con->sock in tipc_conn_close(). In tipc_topsrv_accept(), con is allocated in conn_idr then its sock is set:
con = tipc_conn_alloc(); ... <----[1] con->sock = newsock;
If tipc_conn_close() is called in anytime of [1], the null-pointer-def is triggered by con->sock->sk due to con->sock is not yet set.
This patch fixes it by moving the con->sock setting to tipc_conn_alloc() under s->idr_lock. So that con->sock can never be NULL when getting the con from s->conn_idr. It will be also safer to move con->server and flag CF_CONNECTED setting under s->idr_lock, as they should all be set before tipc_conn_alloc() is called.
Fixes: c5fa7b3cf3cb ("tipc: introduce new TIPC server infrastructure") Reported-by: Wei Chen harperchen1110@gmail.com Signed-off-by: Xin Long lucien.xin@gmail.com Acked-by: Jon Maloy jmaloy@redhat.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org
conflict: net/tipc/topsrv.c
Signed-off-by: Lu Wei luwei32@huawei.com Reviewed-by: Liu Jian liujian56@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- net/tipc/topsrv.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c index 13f3143609f9..6fb088948300 100644 --- a/net/tipc/topsrv.c +++ b/net/tipc/topsrv.c @@ -176,7 +176,7 @@ static void tipc_conn_close(struct tipc_conn *con) conn_put(con); }
-static struct tipc_conn *tipc_conn_alloc(struct tipc_topsrv *s) +static struct tipc_conn *tipc_conn_alloc(struct tipc_topsrv *s, struct socket *sock) { struct tipc_conn *con; int ret; @@ -202,10 +202,11 @@ static struct tipc_conn *tipc_conn_alloc(struct tipc_topsrv *s) } con->conid = ret; s->idr_in_use++; - spin_unlock_bh(&s->idr_lock);
set_bit(CF_CONNECTED, &con->flags); con->server = s; + con->sock = sock; + spin_unlock_bh(&s->idr_lock);
return con; } @@ -460,7 +461,7 @@ static void tipc_topsrv_accept(struct work_struct *work) ret = kernel_accept(lsock, &newsock, O_NONBLOCK); if (ret < 0) return; - con = tipc_conn_alloc(srv); + con = tipc_conn_alloc(srv, newsock); if (IS_ERR(con)) { ret = PTR_ERR(con); sock_release(newsock); @@ -472,7 +473,6 @@ static void tipc_topsrv_accept(struct work_struct *work) newsk->sk_data_ready = tipc_conn_data_ready; newsk->sk_write_space = tipc_conn_write_space; newsk->sk_user_data = con; - con->sock = newsock; write_unlock_bh(&newsk->sk_callback_lock);
/* Wake up receive process in case of 'SYN+' message */ @@ -570,12 +570,11 @@ bool tipc_topsrv_kern_subscr(struct net *net, u32 port, u32 type, u32 lower, sub.filter = filter; *(u32 *)&sub.usr_handle = port;
- con = tipc_conn_alloc(tipc_topsrv(net)); + con = tipc_conn_alloc(tipc_topsrv(net), NULL); if (IS_ERR(con)) return false;
*conid = con->conid; - con->sock = NULL; rc = tipc_conn_rcv_sub(tipc_topsrv(net), con, &sub); if (rc >= 0) return true;
From: Xin Long lucien.xin@gmail.com
stable inclusion from stable-v5.10.157 commit 4058e3b74ab3eabe0835cee9a0c6deda79e8a295 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6NCRH CVE: CVE-2023-1382
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit a7b42969d63f47320853a802efd879fbdc4e010e ]
One extra conn_get() is needed in tipc_conn_alloc(), as after tipc_conn_alloc() is called, tipc_conn_close() may free this con before deferencing it in tipc_topsrv_accept():
tipc_conn_alloc(); newsk = newsock->sk; <---- tipc_conn_close(); write_lock_bh(&sk->sk_callback_lock); newsk->sk_data_ready = tipc_conn_data_ready;
Then an uaf issue can be triggered:
BUG: KASAN: use-after-free in tipc_topsrv_accept+0x1e7/0x370 [tipc] Call Trace: <TASK> dump_stack_lvl+0x33/0x46 print_report+0x178/0x4b0 kasan_report+0x8c/0x100 kasan_check_range+0x179/0x1e0 tipc_topsrv_accept+0x1e7/0x370 [tipc] process_one_work+0x6a3/0x1030 worker_thread+0x8a/0xdf0
This patch fixes it by holding it in tipc_conn_alloc(), then after all accessing in tipc_topsrv_accept() releasing it. Note when does this in tipc_topsrv_kern_subscr(), as tipc_conn_rcv_sub() returns 0 or -1 only, we don't need to check for "> 0".
Fixes: c5fa7b3cf3cb ("tipc: introduce new TIPC server infrastructure") Signed-off-by: Xin Long lucien.xin@gmail.com Acked-by: Jon Maloy jmaloy@redhat.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Lu Wei luwei32@huawei.com Reviewed-by: Liu Jian liujian56@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- net/tipc/topsrv.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c index 6fb088948300..51cd64c814d6 100644 --- a/net/tipc/topsrv.c +++ b/net/tipc/topsrv.c @@ -206,6 +206,7 @@ static struct tipc_conn *tipc_conn_alloc(struct tipc_topsrv *s, struct socket *s set_bit(CF_CONNECTED, &con->flags); con->server = s; con->sock = sock; + conn_get(con); spin_unlock_bh(&s->idr_lock);
return con; @@ -477,6 +478,7 @@ static void tipc_topsrv_accept(struct work_struct *work)
/* Wake up receive process in case of 'SYN+' message */ newsk->sk_data_ready(newsk); + conn_put(con); } }
@@ -576,10 +578,11 @@ bool tipc_topsrv_kern_subscr(struct net *net, u32 port, u32 type, u32 lower,
*conid = con->conid; rc = tipc_conn_rcv_sub(tipc_topsrv(net), con, &sub); - if (rc >= 0) - return true; + if (rc) + conn_put(con); + conn_put(con); - return false; + return !rc; }
void tipc_topsrv_kern_unsubscr(struct net *net, int conid)
From: Chen Zhongjin chenzhongjin@huawei.com
mainline inclusion from mainline-v6.3-rc2 commit ee92fa443358f4fc0017c1d0d325c27b37802504 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6NJNA CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
KASAN reported follow problem:
BUG: KASAN: use-after-free in lookup_rec Read of size 8 at addr ffff000199270ff0 by task modprobe CPU: 2 Comm: modprobe Call trace: kasan_report __asan_load8 lookup_rec ftrace_location arch_check_ftrace_location check_kprobe_address_safe register_kprobe
When checking pg->records[pg->index - 1].ip in lookup_rec(), it can get a pg which is newly added to ftrace_pages_start in ftrace_process_locs(). Before the first pg->index++, index is 0 and accessing pg->records[-1].ip will cause this problem.
Don't check the ip when pg->index is 0.
Link: https://lore.kernel.org/linux-trace-kernel/20230309080230.36064-1-chenzhongj...
Cc: stable@vger.kernel.org Fixes: 9644302e3315 ("ftrace: Speed up search by skipping pages by address") Suggested-by: Steven Rostedt (Google) rostedt@goodmis.org Signed-off-by: Chen Zhongjin chenzhongjin@huawei.com Signed-off-by: Steven Rostedt (Google) rostedt@goodmis.org Reviewed-by: Zheng Yejian zhengyejian1@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- kernel/trace/ftrace.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 16388295696c..b09a9a9b49b4 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1538,7 +1538,8 @@ static struct dyn_ftrace *lookup_rec(unsigned long start, unsigned long end) key.flags = end; /* overload flags, as it is unsigned long */
for (pg = ftrace_pages_start; pg; pg = pg->next) { - if (end < pg->records[0].ip || + if (pg->index == 0 || + end < pg->records[0].ip || start >= (pg->records[pg->index - 1].ip + MCOUNT_INSN_SIZE)) continue; rec = bsearch(&key, pg->records, pg->index,
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v5.18-rc1 commit 53a05ad9f21d858d24f76d12b3e990405f2036d1 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6NK0S CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "mm: COW fixes part 1: fix the COW security issue for THP and swap", v3.
This series attempts to optimize and streamline the COW logic for ordinary anon pages and THP anon pages, fixing two remaining instances of CVE-2020-29374 in do_swap_page() and do_huge_pmd_wp_page(): information can leak from a parent process to a child process via anonymous pages shared during fork().
This issue, including other related COW issues, has been summarized in [2]:
"1. Observing Memory Modifications of Private Pages From A Child Process
Long story short: process-private memory might not be as private as you think once you fork(): successive modifications of private memory regions in the parent process can still be observed by the child process, for example, by smart use of vmsplice()+munmap().
The core problem is that pinning pages readable in a child process, such as done via the vmsplice system call, can result in a child process observing memory modifications done in the parent process the child is not supposed to observe. [1] contains an excellent summary and [2] contains further details. This issue was assigned CVE-2020-29374 [9].
For this to trigger, it's required to use a fork() without subsequent exec(), for example, as used under Android zygote. Without further details about an application that forks less-privileged child processes, one cannot really say what's actually affected and what's not -- see the details section the end of this mail for a short sshd/openssh analysis.
While commit 17839856fd58 ("gup: document and work around "COW can break either way" issue") fixed this issue and resulted in other problems (e.g., ptrace on pmem), commit 09854ba94c6a ("mm: do_wp_page() simplification") re-introduced part of the problem unfortunately.
The original reproducer can be modified quite easily to use THP [3] and make the issue appear again on upstream kernels. I modified it to use hugetlb [4] and it triggers as well. The problem is certainly less severe with hugetlb than with THP; it merely highlights that we still have plenty of open holes we should be closing/fixing.
Regarding vmsplice(), the only known workaround is to disallow the vmsplice() system call ... or disable THP and hugetlb. But who knows what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in the end, it's a more generic issue"
This security issue was first reported by Jann Horn on 27 May 2020 and it currently affects anonymous pages during swapin, anonymous THP and hugetlb. This series tackles anonymous pages during swapin and anonymous THP:
- do_swap_page() for handling COW on PTEs during swapin directly
- do_huge_pmd_wp_page() for handling COW on PMD-mapped THP during write faults
With this series, we'll apply the same COW logic we have in do_wp_page() to all swappable anon pages: don't reuse (map writable) the page in case there are additional references (page_count() != 1). All users of reuse_swap_page() are remove, and consequently reuse_swap_page() is removed.
In general, we're struggling with the following COW-related issues:
(1) "missed COW": we miss to copy on write and reuse the page (map it writable) although we must copy because there are pending references from another process to this page. The result is a security issue.
(2) "wrong COW": we copy on write although we wouldn't have to and shouldn't: if there are valid GUP references, they will become out of sync with the pages mapped into the page table. We fail to detect that such a page can be reused safely, especially if never more than a single process mapped the page. The result is an intra process memory corruption.
(3) "unnecessary COW": we copy on write although we wouldn't have to: performance degradation and temporary increases swap+memory consumption can be the result.
While this series fixes (1) for swappable anon pages, it tries to reduce reported cases of (3) first as good and easy as possible to limit the impact when streamlining. The individual patches try to describe in which cases we will run into (3).
This series certainly makes (2) worse for THP, because a THP will now get PTE-mapped on write faults if there are additional references, even if there was only ever a single process involved: once PTE-mapped, we'll copy each and every subpage and won't reuse any subpage as long as the underlying compound page wasn't split.
I'm working on an approach to fix (2) and improve (3): PageAnonExclusive to mark anon pages that are exclusive to a single process, allow GUP pins only on such exclusive pages, and allow turning exclusive pages shared (clearing PageAnonExclusive) only if there are no GUP pins. Anon pages with PageAnonExclusive set never have to be copied during write faults, but eventually during fork() if they cannot be turned shared. The improved reuse logic in this series will essentially also be the logic to reset PageAnonExclusive. This work will certainly take a while, but I'm planning on sharing details before having code fully ready.
cleanups related to reuse_swap_page().
Notes: * For now, I'll leave hugetlb code untouched: "unnecessary COW" might easily break existing setups because hugetlb pages are a scarce resource and we could just end up having to crash the application when we run out of hugetlb pages. We have to be very careful and the security aspect with hugetlb is most certainly less relevant than for unprivileged anon pages. * Instead of lru_add_drain() we might actually just drain the lru_add list or even just remove the single page of interest from the lru_add list. This would require a new helper function, and could be added if the conditional lru_add_drain() turn out to be a problem. * I extended the test case already included in [1] to also test for the newly found do_swap_page() case. I'll send that out separately once/if this part was merged.
[1] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com [2] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
This patch (of 9):
Liang Zhang reported [1] that the current COW logic in do_wp_page() is sub-optimal when it comes to swap+read fault+write fault of anonymous pages that have a single user, visible via a performance degradation in the redis benchmark. Something similar was previously reported [2] by Nadav with a simple reproducer.
After we put an anon page into the swapcache and unmapped it from a single process, that process might read that page again and refault it read-only. If that process then writes to that page, the process is actually the exclusive user of the page, however, the COW logic in do_co_page() won't be able to reuse it due to the additional reference from the swapcache.
Let's optimize for pages that have been added to the swapcache but only have an exclusive user. Try removing the swapcache reference if there is hope that we're the exclusive user.
We will fail removing the swapcache reference in two scenarios: (1) There are additional swap entries referencing the page: copying instead of reusing is the right thing to do. (2) The page is under writeback: theoretically we might be able to reuse in some cases, however, we cannot remove the additional reference and will have to copy.
Note that we'll only try removing the page from the swapcache when it's highly likely that we'll be the exclusive owner after removing the page from the swapache. As we're about to map that page writable and redirty it, that should not affect reclaim but is rather the right thing to do.
Further, we might have additional references from the LRU pagevecs, which will force us to copy instead of being able to reuse. We'll try handling such references for some scenarios next. Concurrent writeback cannot be handled easily and we'll always have to copy.
While at it, remove the superfluous page_mapcount() check: it's implicitly covered by the page_count() for ordinary anon pages.
[1] https://lkml.kernel.org/r/20220113140318.11117-1-zhangliang5@huawei.com [2] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com
Link: https://lkml.kernel.org/r/20220131162940.210846-2-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reported-by: Liang Zhang zhangliang5@huawei.com Reported-by: Nadav Amit nadav.amit@gmail.com Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Acked-by: Vlastimil Babka vbabka@suse.cz Cc: Hugh Dickins hughd@google.com Cc: David Rientjes rientjes@google.com Cc: Shakeel Butt shakeelb@google.com Cc: John Hubbard jhubbard@nvidia.com Cc: Jason Gunthorpe jgg@nvidia.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Mike Rapoport rppt@linux.ibm.com Cc: Yang Shi shy828301@gmail.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Jann Horn jannh@google.com Cc: Michal Hocko mhocko@kernel.org Cc: Rik van Riel riel@surriel.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Andrea Arcangeli aarcange@redhat.com Cc: Peter Xu peterx@redhat.com Cc: Don Dutile ddutile@redhat.com Cc: Christoph Hellwig hch@lst.de Cc: Oleg Nesterov oleg@redhat.com Cc: Jan Kara jack@suse.cz Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: ZhangPeng zhangpeng362@huawei.com Reviewed-by: tong tiangen tongtiangen@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- mm/memory.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index 732895edbb67..d632d5ebf10f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3197,19 +3197,27 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) if (PageAnon(vmf->page)) { struct page *page = vmf->page;
- /* PageKsm() doesn't necessarily raise the page refcount */ - if (PageKsm(page) || page_count(page) != 1) + /* + * We have to verify under page lock: these early checks are + * just an optimization to avoid locking the page and freeing + * the swapcache if there is little hope that we can reuse. + * + * PageKsm() doesn't necessarily raise the page refcount. + */ + if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page)) goto copy; if (!trylock_page(page)) goto copy; - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) { + if (PageSwapCache(page)) + try_to_free_swap(page); + if (PageKsm(page) || page_count(page) != 1) { unlock_page(page); goto copy; } /* - * Ok, we've got the only map reference, and the only - * page count reference, and the page is locked, - * it's dark out, and we're wearing sunglasses. Hit it. + * Ok, we've got the only page reference from our mapping + * and the page is locked, it's dark out, and we're wearing + * sunglasses. Hit it. */ unlock_page(page); wp_page_reuse(vmf);
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v5.18-rc1 commit d4c470970d45c863fafc757521a82be2f80b1232 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6NK0S CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
For example, if a page just got swapped in via a read fault, the LRU pagevecs might still hold a reference to the page. If we trigger a write fault on such a page, the additional reference from the LRU pagevecs will prohibit reusing the page.
Let's conditionally drain the local LRU pagevecs when we stumble over a !PageLRU() page. We cannot easily drain remote LRU pagevecs and it might not be desirable performance-wise. Consequently, this will only avoid copying in some cases.
Add a simple "page_count(page) > 3" check first but keep the "page_count(page) > 1 + PageSwapCache(page)" check in place, as we want to minimize cases where we remove a page from the swapcache but won't be able to reuse it, for example, because another process has it mapped R/O, to not affect reclaim.
We cannot easily handle the following cases and we will always have to copy:
(1) The page is referenced in the LRU pagevecs of other CPUs. We really would have to drain the LRU pagevecs of all CPUs -- most probably copying is much cheaper.
(2) The page is already PageLRU() but is getting moved between LRU lists, for example, for activation (e.g., mark_page_accessed()), deactivation (MADV_COLD), or lazyfree (MADV_FREE). We'd have to drain mostly unconditionally, which might be bad performance-wise. Most probably this won't happen too often in practice.
Note that there are other reasons why an anon page might temporarily not be PageLRU(): for example, compaction and migration have to isolate LRU pages from the LRU lists first (isolate_lru_page()), moving them to temporary local lists and clearing PageLRU() and holding an additional reference on the page. In that case, we'll always copy.
This change seems to be fairly effective with the reproducer [1] shared by Nadav, as long as writeback is done synchronously, for example, using zram. However, with asynchronous writeback, we'll usually fail to free the swapcache because the page is still under writeback: something we cannot easily optimize for, and maybe it's not really relevant in practice.
[1] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com
Link: https://lkml.kernel.org/r/20220131162940.210846-3-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Acked-by: Vlastimil Babka vbabka@suse.cz Cc: Andrea Arcangeli aarcange@redhat.com Cc: Christoph Hellwig hch@lst.de Cc: David Rientjes rientjes@google.com Cc: Don Dutile ddutile@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Jan Kara jack@suse.cz Cc: Jann Horn jannh@google.com Cc: Jason Gunthorpe jgg@nvidia.com Cc: John Hubbard jhubbard@nvidia.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Liang Zhang zhangliang5@huawei.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michal Hocko mhocko@kernel.org Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Mike Rapoport rppt@linux.ibm.com Cc: Nadav Amit nadav.amit@gmail.com Cc: Oleg Nesterov oleg@redhat.com Cc: Peter Xu peterx@redhat.com Cc: Rik van Riel riel@surriel.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Shakeel Butt shakeelb@google.com Cc: Yang Shi shy828301@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: ZhangPeng zhangpeng362@huawei.com Reviewed-by: tong tiangen tongtiangen@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- mm/memory.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c index d632d5ebf10f..8f7d4531c763 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3204,7 +3204,15 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) * * PageKsm() doesn't necessarily raise the page refcount. */ - if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page)) + if (PageKsm(page) || page_count(page) > 3) + goto copy; + if (!PageLRU(page)) + /* + * Note: We cannot easily detect+handle references from + * remote LRU pagevecs or references to PageLRU() pages. + */ + lru_add_drain(); + if (page_count(page) > 1 + PageSwapCache(page)) goto copy; if (!trylock_page(page)) goto copy;
From: Baisong Zhong zhongbaisong@huawei.com
mainline inclusion from mainline-v6.2-rc1 commit 0ed554fd769a19ea8464bb83e9ac201002ef74ad category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6NCQH CVE: CVE-2023-28328
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Wei Chen reports a kernel bug as blew:
general protection fault, probably for non-canonical address KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] ... Call Trace: <TASK> __i2c_transfer+0x77e/0x1930 drivers/i2c/i2c-core-base.c:2109 i2c_transfer+0x1d5/0x3d0 drivers/i2c/i2c-core-base.c:2170 i2cdev_ioctl_rdwr+0x393/0x660 drivers/i2c/i2c-dev.c:297 i2cdev_ioctl+0x75d/0x9f0 drivers/i2c/i2c-dev.c:458 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl+0xfb/0x170 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fd834a8bded
In az6027_i2c_xfer(), if msg[i].addr is 0x99, a null-ptr-deref will caused when accessing msg[i].buf. For msg[i].len is 0 and msg[i].buf is null.
Fix this by checking msg[i].len in az6027_i2c_xfer().
Link: https://lore.kernel.org/lkml/CAO4mrfcPHB5aQJO=mpqV+p8mPLNg-Fok0gw8gZ=zemAfMG...
Link: https://lore.kernel.org/linux-media/20221120065918.2160782-1-zhongbaisong@hu... Fixes: 76f9a820c867 ("V4L/DVB: AZ6027: Initial import of the driver") Reported-by: Wei Chen harperchen1110@gmail.com Signed-off-by: Baisong Zhong zhongbaisong@huawei.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org Signed-off-by: ZhangPeng zhangpeng362@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Nanyong Sun sunnanyong@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/media/usb/dvb-usb/az6027.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/media/usb/dvb-usb/az6027.c b/drivers/media/usb/dvb-usb/az6027.c index 86788771175b..32b4ee65c280 100644 --- a/drivers/media/usb/dvb-usb/az6027.c +++ b/drivers/media/usb/dvb-usb/az6027.c @@ -975,6 +975,10 @@ static int az6027_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg msg[], int n if (msg[i].addr == 0x99) { req = 0xBE; index = 0; + if (msg[i].len < 1) { + i = -EOPNOTSUPP; + break; + } value = msg[i].buf[0] & 0x00ff; length = 1; az6027_usb_out_op(d, req, value, index, data, length);
From: Kuniyuki Iwashima kuniyu@amazon.com
mainline inclusion from mainline-v6.2-rc1 commit a1140cb215fa13dcec06d12ba0c3ee105633b7c4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6NVC3 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------
Our syzbot instance reported memory leaks in do_seccomp() [0], similar to the report [1]. It shows that we miss freeing struct seccomp_filter and some objects included in it.
We can reproduce the issue with the program below [2] which calls one seccomp() and two clone() syscalls.
The first clone()d child exits earlier than its parent and sends a signal to kill it during the second clone(), more precisely before the fatal_signal_pending() test in copy_process(). When the parent receives the signal, it has to destroy the embryonic process and return -EINTR to user space. In the failure path, we have to call seccomp_filter_release() to decrement the filter's refcount.
Initially, we called it in free_task() called from the failure path, but the commit 3a15fb6ed92c ("seccomp: release filter after task is fully dead") moved it to release_task() to notify user space as early as possible that the filter is no longer used.
To keep the change and current seccomp refcount semantics, let's move copy_seccomp() just after the signal check and add a WARN_ON_ONCE() in free_task() for future debugging.
[0]: unreferenced object 0xffff8880063add00 (size 256): comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.914s) hex dump (first 32 bytes): 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ backtrace: do_seccomp (./include/linux/slab.h:600 ./include/linux/slab.h:733 kernel/seccomp.c:666 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) unreferenced object 0xffffc90000035000 (size 4096): comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 00 00 00 00 05 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: __vmalloc_node_range (mm/vmalloc.c:3226) __vmalloc_node (mm/vmalloc.c:3261 (discriminator 4)) bpf_prog_alloc_no_stats (kernel/bpf/core.c:91) bpf_prog_alloc (kernel/bpf/core.c:129) bpf_prog_create_from_user (net/core/filter.c:1414) do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) unreferenced object 0xffff888003fa1000 (size 1024): comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: bpf_prog_alloc_no_stats (./include/linux/slab.h:600 ./include/linux/slab.h:733 kernel/bpf/core.c:95) bpf_prog_alloc (kernel/bpf/core.c:129) bpf_prog_create_from_user (net/core/filter.c:1414) do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) unreferenced object 0xffff888006360240 (size 16): comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s) hex dump (first 16 bytes): 01 00 37 00 76 65 72 6c e0 83 01 06 80 88 ff ff ..7.verl........ backtrace: bpf_prog_store_orig_filter (net/core/filter.c:1137) bpf_prog_create_from_user (net/core/filter.c:1428) do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) unreferenced object 0xffff8880060183e0 (size 8): comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s) hex dump (first 8 bytes): 06 00 00 00 00 00 ff 7f ........ backtrace: kmemdup (mm/util.c:129) bpf_prog_store_orig_filter (net/core/filter.c:1144) bpf_prog_create_from_user (net/core/filter.c:1428) do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
[1]: https://syzkaller.appspot.com/bug?id=2809bb0ac77ad9aa3f4afe42d6a610aba594a98...
[2]:
void main(void) { struct sock_filter filter[] = { BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), }; struct sock_fprog fprog = { .len = sizeof(filter) / sizeof(filter[0]), .filter = filter, }; long i, pid;
syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER, 0, &fprog);
for (i = 0; i < 2; i++) { pid = syscall(__NR_clone, CLONE_NEWNET | SIGKILL, NULL, NULL, 0); if (pid == 0) return; } }
Fixes: 3a15fb6ed92c ("seccomp: release filter after task is fully dead") Reported-by: syzbot+ab17848fe269b573eb71@syzkaller.appspotmail.com Reported-by: Ayushman Dutta ayudutta@amazon.com Suggested-by: Kees Cook keescook@chromium.org Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Reviewed-by: Christian Brauner (Microsoft) brauner@kernel.org Signed-off-by: Kees Cook keescook@chromium.org Link: https://lore.kernel.org/r/20220823154532.82913-1-kuniyu@amazon.com
Conflicts: kernel/fork.c
Signed-off-by: GONG, Ruiqi gongruiqi1@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- kernel/fork.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c index 0e141623a95d..f0aa2da990b8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -445,6 +445,9 @@ void put_task_stack(struct task_struct *tsk)
void free_task(struct task_struct *tsk) { +#ifdef CONFIG_SECCOMP + WARN_ON_ONCE(tsk->seccomp.filter); +#endif scs_release(tsk);
#ifndef CONFIG_THREAD_INFO_IN_TASK @@ -2332,12 +2335,6 @@ static __latent_entropy struct task_struct *copy_process(
spin_lock(¤t->sighand->siglock);
- /* - * Copy seccomp details explicitly here, in case they were changed - * before holding sighand lock. - */ - copy_seccomp(p); - rseq_fork(p, clone_flags);
/* Don't start children in a dying pid namespace */ @@ -2352,6 +2349,14 @@ static __latent_entropy struct task_struct *copy_process( goto bad_fork_cancel_cgroup; }
+ /* No more failure paths after this point. */ + + /* + * Copy seccomp details explicitly here, in case they were changed + * before holding sighand lock. + */ + copy_seccomp(p); + init_task_pid_links(p); if (likely(p->pid)) { ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace);
From: Zheng Yejian zhengyejian1@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6NVPT
--------------------------------
Concurrently enable/disable livepatch and online/offline cpu cause deadlock: __klp_{enable,disable}_patch() _cpu_{up,down}() -------- -------- mutex_lock(&text_mutex); cpus_write_lock(); cpus_read_lock(); mutex_lock(&text_mutex);
__klp_{enable,disable}_patch() hold text_mutex then wait cpu_hotplug_lock, while _cpu_{up,down}() hold cpu_hotplug_lock but wait text_mutex, finally result in the deadlock.
Cpu hotplug locking is a "percpu" rw semaphore, however write lock and read lock on it are globally mutual exclusive, that is cpus_write_lock() on one cpu can block all cpus_read_lock() on other cpus, vice versa.
Similar lock issue was solved in commit 2d1e38f56622 ("kprobes: Cure hotplug lock ordering issues") which change lock order to be: kprobe_mutex -> cpus_rwsem -> jump_label_mutex -> text_mutex
Therefore take cpus_read_lock() before text_mutex to avoid deadlock.
Fixes: f5a6746743d8 ("livepatch/x86: support livepatch without ftrace") Signed-off-by: Zheng Yejian zhengyejian1@huawei.com Reviewed-by: Xu Kuohai xukuohai@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- kernel/livepatch/core.c | 36 +++++++++++++++++++++++++----------- 1 file changed, 25 insertions(+), 11 deletions(-)
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index a00aa45eb065..8768ec1bddf3 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -1669,6 +1669,27 @@ static void klp_breakpoint_post_process(struct klp_patch *patch, bool restore) module_put(patch->mod); }
+static int klp_stop_machine(cpu_stop_fn_t fn, void *data, const struct cpumask *cpus) +{ + int ret; + + /* + * Cpu hotplug locking is a "percpu" rw semaphore, however write + * lock and read lock on it are globally mutual exclusive, that is + * cpus_write_lock() on one cpu can block all cpus_read_lock() + * on other cpus, vice versa. + * + * Since cpu hotplug take the cpus_write_lock() before text_mutex, + * here take cpus_read_lock() before text_mutex to avoid deadlock. + */ + cpus_read_lock(); + arch_klp_code_modify_prepare(); + ret = stop_machine_cpuslocked(fn, data, cpus); + arch_klp_code_modify_post_process(); + cpus_read_unlock(); + return ret; +} + static int __klp_disable_patch(struct klp_patch *patch) { int ret; @@ -1689,9 +1710,7 @@ static int __klp_disable_patch(struct klp_patch *patch) } #endif
- arch_klp_code_modify_prepare(); - ret = stop_machine(klp_try_disable_patch, &patch_data, cpu_online_mask); - arch_klp_code_modify_post_process(); + ret = klp_stop_machine(klp_try_disable_patch, &patch_data, cpu_online_mask); if (ret) return ret;
@@ -1960,10 +1979,8 @@ static int klp_breakpoint_optimize(struct klp_patch *patch)
cnt++;
- arch_klp_code_modify_prepare(); - ret = stop_machine(klp_try_enable_patch, &patch_data, - cpu_online_mask); - arch_klp_code_modify_post_process(); + ret = klp_stop_machine(klp_try_enable_patch, &patch_data, + cpu_online_mask); if (!ret || ret != -EAGAIN) break;
@@ -2010,10 +2027,7 @@ static int __klp_enable_patch(struct klp_patch *patch) if (ret) return ret;
- arch_klp_code_modify_prepare(); - ret = stop_machine(klp_try_enable_patch, &patch_data, - cpu_online_mask); - arch_klp_code_modify_post_process(); + ret = klp_stop_machine(klp_try_enable_patch, &patch_data, cpu_online_mask); if (!ret) goto move_patch_to_tail; if (ret != -EAGAIN)