- Kernel - mailweb.openeuler.org

[PATCH OLK-5.10] btrfs: fix qgroup reserve leaks in cow_file_range
by Yifan Qiao 24 Sep '24

24 Sep '24

From: Boris Burkov <boris(a)bur.io> mainline inclusion from mainline-v6.11-rc3 commit 30479f31d44d47ed00ae0c7453d9b253537005b2 bugzilla: https://gitee.com/src-openeuler/kernel/issues/IARV5C CVE: CVE-2024-46733 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- In the buffered write path, the dirty page owns the qgroup reserve until it creates an ordered_extent. Therefore, any errors that occur before the ordered_extent is created must free that reservation, or else the space is leaked. The fstest generic/475 exercises various IO error paths, and is able to trigger errors in cow_file_range where we fail to get to allocating the ordered extent. Note that because we *do* clear delalloc, we are likely to remove the inode from the delalloc list, so the inodes/pages to not have invalidate/launder called on them in the commit abort path. This results in failures at the unmount stage of the test that look like: BTRFS: error (device dm-8 state EA) in cleanup_transaction:2018: errno=-5 IO failure BTRFS: error (device dm-8 state EA) in btrfs_replace_file_extents:2416: errno=-5 IO failure BTRFS warning (device dm-8 state EA): qgroup 0/5 has unreleased space, type 0 rsv 28672 ------------[ cut here ]------------ WARNING: CPU: 3 PID: 22588 at fs/btrfs/disk-io.c:4333 close_ctree+0x222/0x4d0 [btrfs] Modules linked in: btrfs blake2b_generic libcrc32c xor zstd_compress raid6_pq CPU: 3 PID: 22588 Comm: umount Kdump: loaded Tainted: G W 6.10.0-rc7-gab56fde445b8 #21 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 RIP: 0010:close_ctree+0x222/0x4d0 [btrfs] RSP: 0018:ffffb4465283be00 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffffa1a1818e1000 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffffb4465283bbe0 RDI: ffffa1a19374fcb8 RBP: ffffa1a1818e13c0 R08: 0000000100028b16 R09: 0000000000000000 R10: 0000000000000003 R11: 0000000000000003 R12: ffffa1a18ad7972c R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f9168312b80(0000) GS:ffffa1a4afcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f91683c9140 CR3: 000000010acaa000 CR4: 00000000000006f0 Call Trace: <TASK> ? close_ctree+0x222/0x4d0 [btrfs] ? __warn.cold+0x8e/0xea ? close_ctree+0x222/0x4d0 [btrfs] ? report_bug+0xff/0x140 ? handle_bug+0x3b/0x70 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? close_ctree+0x222/0x4d0 [btrfs] generic_shutdown_super+0x70/0x160 kill_anon_super+0x11/0x40 btrfs_kill_super+0x11/0x20 [btrfs] deactivate_locked_super+0x2e/0xa0 cleanup_mnt+0xb5/0x150 task_work_run+0x57/0x80 syscall_exit_to_user_mode+0x121/0x130 do_syscall_64+0xab/0x1a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f916847a887 ---[ end trace 0000000000000000 ]--- BTRFS error (device dm-8 state EA): qgroup reserved space leaked Cases 2 and 3 in the out_reserve path both pertain to this type of leak and must free the reserved qgroup data. Because it is already an error path, I opted not to handle the possible errors in btrfs_free_qgroup_data. Reviewed-by: Qu Wenruo <wqu(a)suse.com> Signed-off-by: Boris Burkov <boris(a)bur.io> Signed-off-by: David Sterba <dsterba(a)suse.com> Conflicts: fs/btrfs/inode.c [The cleanup logic is very old in 5.10, need to adapt the logic.] Signed-off-by: Yifan Qiao <qiaoyifan4(a)huawei.com> --- fs/btrfs/inode.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b1138ba9642d..9e8e5be6e97f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1161,6 +1161,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode, locked_page, clear_bits, page_ops); + btrfs_qgroup_free_data(inode, NULL, start, cur_alloc_size); start += cur_alloc_size; if (start >= end) goto out; @@ -1168,6 +1169,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode, extent_clear_unlock_delalloc(inode, start, end, locked_page, clear_bits | EXTENT_CLEAR_DATA_RESV, page_ops); + btrfs_qgroup_free_data(inode, NULL, start, cur_alloc_size); goto out; } @@ -1777,7 +1779,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, if (nocow) btrfs_dec_nocow_writers(fs_info, disk_bytenr); - if (ret && cur_offset < end) + if (ret && cur_offset < end) { extent_clear_unlock_delalloc(inode, cur_offset, end, locked_page, EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DEFRAG | @@ -1785,6 +1787,8 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK); + btrfs_qgroup_free_data(inode, NULL, cur_offset, end - cur_offset + 1); + } btrfs_free_path(path); return ret; } -- 2.39.2

2 1

[PATCH OLK-5.10] btrfs: replace BUG_ON() with error handling at update_ref_for_cow()
by Yifan Qiao 24 Sep '24

24 Sep '24

From: Filipe Manana <fdmanana(a)suse.com> stable inclusion from stable-v6.6.51 commit 41a0f85e268d72fe04f731b8ceea4748c2d65491 bugzilla: https://gitee.com/src-openeuler/kernel/issues/IARX1T CVE: CVE-2024-46752 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- [ Upstream commit b56329a782314fde5b61058e2a25097af7ccb675 ] Instead of a BUG_ON() just return an error, log an error message and abort the transaction in case we find an extent buffer belonging to the relocation tree that doesn't have the full backref flag set. This is unexpected and should never happen (save for bugs or a potential bad memory). Reviewed-by: Qu Wenruo <wqu(a)suse.com> Signed-off-by: Filipe Manana <fdmanana(a)suse.com> Reviewed-by: David Sterba <dsterba(a)suse.com> Signed-off-by: David Sterba <dsterba(a)suse.com> Signed-off-by: Sasha Levin <sashal(a)kernel.org> Conflict: fs/btrfs/ctree.c [No btrfs_root_id yet.] Signed-off-by: Yifan Qiao <qiaoyifan4(a)huawei.com> --- fs/btrfs/ctree.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 814f2f07e74c..55896280b2dd 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -896,8 +896,16 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, } owner = btrfs_header_owner(buf); - BUG_ON(owner == BTRFS_TREE_RELOC_OBJECTID && - !(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF)); + if (unlikely(owner == BTRFS_TREE_RELOC_OBJECTID && + !(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF))) { + btrfs_crit(fs_info, +"found tree block at bytenr %llu level %d root %llu refs %llu flags %llx without full backref flag set", + buf->start, btrfs_header_level(buf), + root->root_key.objectid, refs, flags); + ret = -EUCLEAN; + btrfs_abort_transaction(trans, ret); + return ret; + } if (refs > 1) { if ((owner == root->root_key.objectid || -- 2.39.2

2 1

[PATCH OLK-5.10] Squashfs: sanity check symbolic link size
by Yifan Qiao 24 Sep '24

24 Sep '24

From: Phillip Lougher <phillip(a)squashfs.org.uk> stable inclusion from stable-v5.10.226 commit 5c8906de98d0d7ad42ff3edf2cb6cd7e0ea658c4 bugzilla: https://gitee.com/src-openeuler/kernel/issues/IARYAF CVE: CVE-2024-46744 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- [ Upstream commit 810ee43d9cd245d138a2733d87a24858a23f577d ] Syzkiller reports a "KMSAN: uninit-value in pick_link" bug. This is caused by an uninitialised page, which is ultimately caused by a corrupted symbolic link size read from disk. The reason why the corrupted symlink size causes an uninitialised page is due to the following sequence of events: 1. squashfs_read_inode() is called to read the symbolic link from disk. This assigns the corrupted value 3875536935 to inode->i_size. 2. Later squashfs_symlink_read_folio() is called, which assigns this corrupted value to the length variable, which being a signed int, overflows producing a negative number. 3. The following loop that fills in the page contents checks that the copied bytes is less than length, which being negative means the loop is skipped, producing an uninitialised page. This patch adds a sanity check which checks that the symbolic link size is not larger than expected. -- Signed-off-by: Phillip Lougher <phillip(a)squashfs.org.uk> Link: https://lore.kernel.org/r/20240811232821.13903-1-phillip@squashfs.org.uk Reported-by: Lizhi Xu <lizhi.xu(a)windriver.com> Reported-by: syzbot+24ac24ff58dc5b0d26b9(a)syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000a90e8c061e86a76b@google.com/ V2: fix spelling mistake. Signed-off-by: Christian Brauner <brauner(a)kernel.org> Signed-off-by: Sasha Levin <sashal(a)kernel.org> Signed-off-by: Yifan Qiao <qiaoyifan4(a)huawei.com> --- fs/squashfs/inode.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/squashfs/inode.c b/fs/squashfs/inode.c index 9fb8e9fd4c00..95a9ff9e2399 100644 --- a/fs/squashfs/inode.c +++ b/fs/squashfs/inode.c @@ -279,8 +279,13 @@ int squashfs_read_inode(struct inode *inode, long long ino) if (err < 0) goto failed_read; - set_nlink(inode, le32_to_cpu(sqsh_ino->nlink)); inode->i_size = le32_to_cpu(sqsh_ino->symlink_size); + if (inode->i_size > PAGE_SIZE) { + ERROR("Corrupted symlink\n"); + return -EINVAL; + } + + set_nlink(inode, le32_to_cpu(sqsh_ino->nlink)); inode->i_op = &squashfs_symlink_inode_ops; inode_nohighmem(inode); inode->i_data.a_ops = &squashfs_symlink_aops; -- 2.39.2

2 1

[PATCH openEuler-1.0-LTS] Squashfs: sanity check symbolic link size
by Yifan Qiao 24 Sep '24

24 Sep '24

From: Phillip Lougher <phillip(a)squashfs.org.uk> stable inclusion from stable-v4.19.322 commit f82cb7f24032ed023fc67d26ea9bf322d8431a90 bugzilla: https://gitee.com/src-openeuler/kernel/issues/IARYAF CVE: CVE-2024-46744 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- [ Upstream commit 810ee43d9cd245d138a2733d87a24858a23f577d ] Syzkiller reports a "KMSAN: uninit-value in pick_link" bug. This is caused by an uninitialised page, which is ultimately caused by a corrupted symbolic link size read from disk. The reason why the corrupted symlink size causes an uninitialised page is due to the following sequence of events: 1. squashfs_read_inode() is called to read the symbolic link from disk. This assigns the corrupted value 3875536935 to inode->i_size. 2. Later squashfs_symlink_read_folio() is called, which assigns this corrupted value to the length variable, which being a signed int, overflows producing a negative number. 3. The following loop that fills in the page contents checks that the copied bytes is less than length, which being negative means the loop is skipped, producing an uninitialised page. This patch adds a sanity check which checks that the symbolic link size is not larger than expected. -- Signed-off-by: Phillip Lougher <phillip(a)squashfs.org.uk> Link: https://lore.kernel.org/r/20240811232821.13903-1-phillip@squashfs.org.uk Reported-by: Lizhi Xu <lizhi.xu(a)windriver.com> Reported-by: syzbot+24ac24ff58dc5b0d26b9(a)syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000a90e8c061e86a76b@google.com/ V2: fix spelling mistake. Signed-off-by: Christian Brauner <brauner(a)kernel.org> Signed-off-by: Sasha Levin <sashal(a)kernel.org> Signed-off-by: Yifan Qiao <qiaoyifan4(a)huawei.com> --- fs/squashfs/inode.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/squashfs/inode.c b/fs/squashfs/inode.c index 98825d8a03eb..aee2ec547f1a 100644 --- a/fs/squashfs/inode.c +++ b/fs/squashfs/inode.c @@ -292,8 +292,13 @@ int squashfs_read_inode(struct inode *inode, long long ino) if (err < 0) goto failed_read; - set_nlink(inode, le32_to_cpu(sqsh_ino->nlink)); inode->i_size = le32_to_cpu(sqsh_ino->symlink_size); + if (inode->i_size > PAGE_SIZE) { + ERROR("Corrupted symlink\n"); + return -EINVAL; + } + + set_nlink(inode, le32_to_cpu(sqsh_ino->nlink)); inode->i_op = &squashfs_symlink_inode_ops; inode_nohighmem(inode); inode->i_data.a_ops = &squashfs_symlink_aops; -- 2.39.2

2 1

[PATCH openEuler-1.0-LTS] binder: fix UAF caused by offsets overwrite
by Wupeng Ma 24 Sep '24

24 Sep '24

From: Carlos Llamas <cmllamas(a)google.com> stable inclusion from stable-v5.10.226 commit 3a8154bb4ab4a01390a3abf1e6afac296e037da4 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IARY7V CVE: CVE-2024-46740 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- commit 4df153652cc46545722879415937582028c18af5 upstream. Binder objects are processed and copied individually into the target buffer during transactions. Any raw data in-between these objects is copied as well. However, this raw data copy lacks an out-of-bounds check. If the raw data exceeds the data section size then the copy overwrites the offsets section. This eventually triggers an error that attempts to unwind the processed objects. However, at this point the offsets used to index these objects are now corrupted. Unwinding with corrupted offsets can result in decrements of arbitrary nodes and lead to their premature release. Other users of such nodes are left with a dangling pointer triggering a use-after-free. This issue is made evident by the following KASAN report (trimmed): ================================================================== BUG: KASAN: slab-use-after-free in _raw_spin_lock+0xe4/0x19c Write of size 4 at addr ffff47fc91598f04 by task binder-util/743 CPU: 9 UID: 0 PID: 743 Comm: binder-util Not tainted 6.11.0-rc4 #1 Hardware name: linux,dummy-virt (DT) Call trace: _raw_spin_lock+0xe4/0x19c binder_free_buf+0x128/0x434 binder_thread_write+0x8a4/0x3260 binder_ioctl+0x18f0/0x258c [...] Allocated by task 743: __kmalloc_cache_noprof+0x110/0x270 binder_new_node+0x50/0x700 binder_transaction+0x413c/0x6da8 binder_thread_write+0x978/0x3260 binder_ioctl+0x18f0/0x258c [...] Freed by task 745: kfree+0xbc/0x208 binder_thread_read+0x1c5c/0x37d4 binder_ioctl+0x16d8/0x258c [...] ================================================================== To avoid this issue, let's check that the raw data copy is within the boundaries of the data section. Fixes: 6d98eb95b450 ("binder: avoid potential data leakage when copying txn") Cc: Todd Kjos <tkjos(a)google.com> Cc: stable(a)vger.kernel.org Signed-off-by: Carlos Llamas <cmllamas(a)google.com> Link: https://lore.kernel.org/r/20240822182353.2129600-1-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Signed-off-by: Ma Wupeng <mawupeng1(a)huawei.com> --- drivers/android/binder.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/android/binder.c b/drivers/android/binder.c index 723e5a919c20..266ce581e751 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -3555,6 +3555,7 @@ static void binder_transaction(struct binder_proc *proc, */ copy_size = object_offset - user_offset; if (copy_size && (user_offset > object_offset || + object_offset > tr->data_size || binder_alloc_copy_user_to_buffer( &target_proc->alloc, t->buffer, user_offset, -- 2.25.1

2 1

[openeuler:openEuler-1.0-LTS 21355/23742] drivers/gpio/gpio-phytium-core.c:346:5: warning: "CONFIG_SMP" is not defined, evaluates to 0
by kernel test robot 24 Sep '24

24 Sep '24

Hi Tian, FYI, the error/warning still remains. tree: https://gitee.com/openeuler/kernel.git openEuler-1.0-LTS head: f2ffc6afa458c9c8ddacf5000e60c1be28e095e1 commit: 00711bad7e372a30c4975ba43811ffa666aff0e1 [21355/23742] gpio: add phytium gpio driver config: x86_64-randconfig-r063-20240924 (https://download.01.org/0day-ci/archive/20240924/202409241018.HxSSkSip-lkp@…) compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240924/202409241018.HxSSkSip-lkp@…) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp(a)intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202409241018.HxSSkSip-lkp@intel.com/ All warnings (new ones prefixed by >>): >> drivers/gpio/gpio-phytium-core.c:346:5: warning: "CONFIG_SMP" is not defined, evaluates to 0 [-Wundef] 346 | #if CONFIG_SMP | ^~~~~~~~~~ >> drivers/gpio/gpio-phytium-core.c:346:5: warning: "CONFIG_SMP" is not defined, evaluates to 0 [-Wundef] 346 | #if CONFIG_SMP | ^~~~~~~~~~ vim +/CONFIG_SMP +346 drivers/gpio/gpio-phytium-core.c 345 > 346 #if CONFIG_SMP 347 int 348 phytium_gpio_irq_set_affinity(struct irq_data *d, 349 const struct cpumask *mask_val, bool force) 350 { 351 struct gpio_chip *chip_data = irq_data_get_irq_chip_data(d); 352 struct irq_chip *chip = irq_get_chip(chip_data->irq.num_parents); 353 struct irq_data *data = irq_get_irq_data(chip_data->irq.num_parents); 354 355 if (chip && chip->irq_set_affinity) 356 return chip->irq_set_affinity(data, mask_val, force); 357 358 return -EINVAL; 359 } 360 EXPORT_SYMBOL_GPL(phytium_gpio_irq_set_affinity); 361 #endif 362 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki

1 0

[PATCH openEuler-22.03-LTS-SP1] userfaultfd: fix checks for huge PMDs
by Liu Shixin 24 Sep '24

24 Sep '24

From: Jann Horn <jannh(a)google.com> mainline inclusion from mainline-v6.11-rc7 commit 71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IARWOX CVE: CVE-2024-46787 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2. The pmd_trans_huge() code in mfill_atomic() is wrong in three different ways depending on kernel version: 1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit the right two race windows) - I've tested this in a kernel build with some extra mdelay() calls. See the commit message for a description of the race scenario. On older kernels (before 6.5), I think the same bug can even theoretically lead to accessing transhuge page contents as a page table if you hit the right 5 narrow race windows (I haven't tested this case). 2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for detecting PMDs that don't point to page tables. On older kernels (before 6.5), you'd just have to win a single fairly wide race to hit this. I've tested this on 6.1 stable by racing migration (with a mdelay() patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86 VM, that causes a kernel oops in ptlock_ptr(). 3. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed to yank page tables out from under us (though I haven't tested that), so I think the BUG_ON() checks in mfill_atomic() are just wrong. I decided to write two separate fixes for these (one fix for bugs 1+2, one fix for bug 3), so that the first fix can be backported to kernels affected by bugs 1+2. This patch (of 2): This fixes two issues. I discovered that the following race can occur: mfill_atomic other thread ============ ============ <zap PMD> pmdp_get_lockless() [reads none pmd] <bail if trans_huge> <if none:> <pagefault creates transhuge zeropage> __pte_alloc [no-op] <zap PMD> <bail if pmd_trans_huge(*dst_pmd)> BUG_ON(pmd_none(*dst_pmd)) I have experimentally verified this in a kernel with extra mdelay() calls; the BUG_ON(pmd_none(*dst_pmd)) triggers. On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow pte_offset_map[_lock]() to fail"), this can't lead to anything worse than a BUG_ON(), since the page table access helpers are actually designed to deal with page tables concurrently disappearing; but on older kernels (<=6.4), I think we could probably theoretically race past the two BUG_ON() checks and end up treating a hugepage as a page table. The second issue is that, as Qi Zheng pointed out, there are other types of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs (in particular, migration PMDs). On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a PMD that contains a migration entry (which just requires winning a single, fairly wide race), it will pass the PMD to pte_offset_map_lock(), which assumes that the PMD points to a page table. Breakage follows: First, the kernel tries to take the PTE lock (which will crash or maybe worse if there is no "struct page" for the address bits in the migration entry PMD - I think at least on X86 there usually is no corresponding "struct page" thanks to the PTE inversion mitigation, amd64 looks different). If that didn't crash, the kernel would next try to write a PTE into what it wrongly thinks is a page table. As part of fixing these issues, get rid of the check for pmd_trans_huge() before __pte_alloc() - that's redundant, we're going to have to check for that after the __pte_alloc() anyway. Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels. Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-0-5efa61078a41@goog… Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-1-5efa61078a41@goog… Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation") Signed-off-by: Jann Horn <jannh(a)google.com> Acked-by: David Hildenbrand <david(a)redhat.com> Cc: Andrea Arcangeli <aarcange(a)redhat.com> Cc: Hugh Dickins <hughd(a)google.com> Cc: Jann Horn <jannh(a)google.com> Cc: Pavel Emelyanov <xemul(a)virtuozzo.com> Cc: Qi Zheng <zhengqi.arch(a)bytedance.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Conflicts: mm/userfaultfd.c [ pmd_read_atomic() is renamed to pmdp_get_lockless() in dab6e717429e5e ] Signed-off-by: Liu Shixin <liushixin2(a)huawei.com> --- mm/userfaultfd.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 15c46208a2ac..41ebcdce430a 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -583,21 +583,23 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, } dst_pmdval = pmd_read_atomic(dst_pmd); - /* - * If the dst_pmd is mapped as THP don't - * override it and just be strict. - */ - if (unlikely(pmd_trans_huge(dst_pmdval))) { - err = -EEXIST; - break; - } if (unlikely(pmd_none(dst_pmdval)) && unlikely(__pte_alloc(dst_mm, dst_pmd))) { err = -ENOMEM; break; } - /* If an huge pmd materialized from under us fail */ - if (unlikely(pmd_trans_huge(*dst_pmd))) { + dst_pmdval = pmd_read_atomic(dst_pmd); + /* + * If the dst_pmd is THP don't override it and just be strict. + * (This includes the case where the PMD used to be THP and + * changed back to none after __pte_alloc().) + */ + if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) || + pmd_devmap(dst_pmdval))) { + err = -EEXIST; + break; + } + if (unlikely(pmd_bad(dst_pmdval))) { err = -EFAULT; break; } -- 2.34.1

2 1

[PATCH OLK-5.10] userfaultfd: fix checks for huge PMDs
by Liu Shixin 24 Sep '24

24 Sep '24

From: Jann Horn <jannh(a)google.com> mainline inclusion from mainline-v6.11-rc7 commit 71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IARWOX CVE: CVE-2024-46787 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2. The pmd_trans_huge() code in mfill_atomic() is wrong in three different ways depending on kernel version: 1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit the right two race windows) - I've tested this in a kernel build with some extra mdelay() calls. See the commit message for a description of the race scenario. On older kernels (before 6.5), I think the same bug can even theoretically lead to accessing transhuge page contents as a page table if you hit the right 5 narrow race windows (I haven't tested this case). 2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for detecting PMDs that don't point to page tables. On older kernels (before 6.5), you'd just have to win a single fairly wide race to hit this. I've tested this on 6.1 stable by racing migration (with a mdelay() patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86 VM, that causes a kernel oops in ptlock_ptr(). 3. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed to yank page tables out from under us (though I haven't tested that), so I think the BUG_ON() checks in mfill_atomic() are just wrong. I decided to write two separate fixes for these (one fix for bugs 1+2, one fix for bug 3), so that the first fix can be backported to kernels affected by bugs 1+2. This patch (of 2): This fixes two issues. I discovered that the following race can occur: mfill_atomic other thread ============ ============ <zap PMD> pmdp_get_lockless() [reads none pmd] <bail if trans_huge> <if none:> <pagefault creates transhuge zeropage> __pte_alloc [no-op] <zap PMD> <bail if pmd_trans_huge(*dst_pmd)> BUG_ON(pmd_none(*dst_pmd)) I have experimentally verified this in a kernel with extra mdelay() calls; the BUG_ON(pmd_none(*dst_pmd)) triggers. On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow pte_offset_map[_lock]() to fail"), this can't lead to anything worse than a BUG_ON(), since the page table access helpers are actually designed to deal with page tables concurrently disappearing; but on older kernels (<=6.4), I think we could probably theoretically race past the two BUG_ON() checks and end up treating a hugepage as a page table. The second issue is that, as Qi Zheng pointed out, there are other types of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs (in particular, migration PMDs). On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a PMD that contains a migration entry (which just requires winning a single, fairly wide race), it will pass the PMD to pte_offset_map_lock(), which assumes that the PMD points to a page table. Breakage follows: First, the kernel tries to take the PTE lock (which will crash or maybe worse if there is no "struct page" for the address bits in the migration entry PMD - I think at least on X86 there usually is no corresponding "struct page" thanks to the PTE inversion mitigation, amd64 looks different). If that didn't crash, the kernel would next try to write a PTE into what it wrongly thinks is a page table. As part of fixing these issues, get rid of the check for pmd_trans_huge() before __pte_alloc() - that's redundant, we're going to have to check for that after the __pte_alloc() anyway. Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels. Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-0-5efa61078a41@goog… Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-1-5efa61078a41@goog… Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation") Signed-off-by: Jann Horn <jannh(a)google.com> Acked-by: David Hildenbrand <david(a)redhat.com> Cc: Andrea Arcangeli <aarcange(a)redhat.com> Cc: Hugh Dickins <hughd(a)google.com> Cc: Jann Horn <jannh(a)google.com> Cc: Pavel Emelyanov <xemul(a)virtuozzo.com> Cc: Qi Zheng <zhengqi.arch(a)bytedance.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Conflicts: mm/userfaultfd.c [ pmd_read_atomic() is renamed to pmdp_get_lockless() in dab6e717429e5e ] Signed-off-by: Liu Shixin <liushixin2(a)huawei.com> --- mm/userfaultfd.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index eb272c4d1e89..9d462ffa0157 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -575,21 +575,23 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, } dst_pmdval = pmd_read_atomic(dst_pmd); - /* - * If the dst_pmd is mapped as THP don't - * override it and just be strict. - */ - if (unlikely(pmd_trans_huge(dst_pmdval))) { - err = -EEXIST; - break; - } if (unlikely(pmd_none(dst_pmdval)) && unlikely(__pte_alloc(dst_mm, dst_pmd))) { err = -ENOMEM; break; } - /* If an huge pmd materialized from under us fail */ - if (unlikely(pmd_trans_huge(*dst_pmd))) { + dst_pmdval = pmd_read_atomic(dst_pmd); + /* + * If the dst_pmd is THP don't override it and just be strict. + * (This includes the case where the PMD used to be THP and + * changed back to none after __pte_alloc().) + */ + if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) || + pmd_devmap(dst_pmdval))) { + err = -EEXIST; + break; + } + if (unlikely(pmd_bad(dst_pmdval))) { err = -EFAULT; break; } -- 2.34.1

2 1

[PATCH openEuler-1.0-LTS] x86/mm: Fix pti_clone_pgtable() alignment assumption
by Liu Shixin 24 Sep '24

24 Sep '24

From: Peter Zijlstra <peterz(a)infradead.org> stable inclusion from stable-v4.19.320 commit 18da1b27ce16a14a9b636af9232acb4fb24f4c9e category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAOXYY CVE: CVE-2024-44965 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- [ Upstream commit 41e71dbb0e0a0fe214545fe64af031303a08524c ] Guenter reported dodgy crashes on an i386-nosmp build using GCC-11 that had the form of endless traps until entry stack exhaust and then It turned out that pti_clone_pgtable() had alignment assumptions on the start address, notably it hard assumes start is PMD aligned. This is true on x86_64, but very much not true on i386. These assumptions can cause the end condition to malfunction, leading to a 'short' clone. Guess what happens when the user mapping has a short copy of the entry text? Use the correct increment form for addr to avoid alignment assumptions. Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Reported-by: Guenter Roeck <linux(a)roeck-us.net> Tested-by: Guenter Roeck <linux(a)roeck-us.net> Suggested-by: Thomas Gleixner <tglx(a)linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org> Link: https://lkml.kernel.org/r/20240731163105.GG33588@noisy.programming.kicks-as… Signed-off-by: Sasha Levin <sashal(a)kernel.org> Signed-off-by: Liu Shixin <liushixin2(a)huawei.com> --- arch/x86/mm/pti.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index 622d5968c979..21105ae44ca1 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -383,14 +383,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end, */ *target_pmd = *pmd; - addr += PMD_SIZE; + addr = round_up(addr + 1, PMD_SIZE); } else if (level == PTI_CLONE_PTE) { /* Walk the page-table down to the pte level */ pte = pte_offset_kernel(pmd, addr); if (pte_none(*pte)) { - addr += PAGE_SIZE; + addr = round_up(addr + 1, PAGE_SIZE); continue; } @@ -410,7 +410,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end, /* Clone the PTE */ *target_pte = *pte; - addr += PAGE_SIZE; + addr = round_up(addr + 1, PAGE_SIZE); } else { BUG(); -- 2.34.1

2 1

[PATCH openEuler-22.03-LTS-SP1] x86/mm: Fix pti_clone_pgtable() alignment assumption
by Liu Shixin 24 Sep '24

24 Sep '24

From: Peter Zijlstra <peterz(a)infradead.org> stable inclusion from stable-v5.10.224 commit d00c9b4bbc442d99e1dafbdfdab848bc1ead73f6 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAOXYY CVE: CVE-2024-44965 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- [ Upstream commit 41e71dbb0e0a0fe214545fe64af031303a08524c ] Guenter reported dodgy crashes on an i386-nosmp build using GCC-11 that had the form of endless traps until entry stack exhaust and then It turned out that pti_clone_pgtable() had alignment assumptions on the start address, notably it hard assumes start is PMD aligned. This is true on x86_64, but very much not true on i386. These assumptions can cause the end condition to malfunction, leading to a 'short' clone. Guess what happens when the user mapping has a short copy of the entry text? Use the correct increment form for addr to avoid alignment assumptions. Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Reported-by: Guenter Roeck <linux(a)roeck-us.net> Tested-by: Guenter Roeck <linux(a)roeck-us.net> Suggested-by: Thomas Gleixner <tglx(a)linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org> Link: https://lkml.kernel.org/r/20240731163105.GG33588@noisy.programming.kicks-as… Signed-off-by: Sasha Levin <sashal(a)kernel.org> Signed-off-by: Liu Shixin <liushixin2(a)huawei.com> --- arch/x86/mm/pti.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index 1aab92930569..59c13fdb8da0 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -374,14 +374,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end, */ *target_pmd = *pmd; - addr += PMD_SIZE; + addr = round_up(addr + 1, PMD_SIZE); } else if (level == PTI_CLONE_PTE) { /* Walk the page-table down to the pte level */ pte = pte_offset_kernel(pmd, addr); if (pte_none(*pte)) { - addr += PAGE_SIZE; + addr = round_up(addr + 1, PAGE_SIZE); continue; } @@ -401,7 +401,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end, /* Clone the PTE */ *target_pte = *pte; - addr += PAGE_SIZE; + addr = round_up(addr + 1, PAGE_SIZE); } else { BUG(); -- 2.34.1

2 1