Chen Zhongjin (1): perf: Fix possible memleak in pmu_dev_alloc()
Christian Brauner (1): pnode: terminate at peers of source
Chuck Lever (1): SUNRPC: Don't leak netobj memory when gss_read_proxy_verf() fails
Dan Carpenter (2): bonding: uninitialized variable in bond_miimon_inspect() ipmi: fix use after free in _ipmi_destroy_user()
Enzo Matsumiya (1): cifs: do not include page data when checking signature
Eric Dumazet (1): net: stream: purge sk_error_queue in sk_stream_kill_queues()
Greg Kroah-Hartman (1): prlimit: do_prlimit needs to have a speculation check
Huaxin Lu (1): ima: Fix a potential NULL pointer access in ima_restore_measurement_list
Isaac J. Manjarres (1): driver core: Fix bus_type.match() error handling in __driver_attach()
Jakub Kicinski (2): bpf: pull before calling skb_postpull_rcsum() net: stream: don't purge sk_error_queue in sk_stream_kill_queues()
Jan Kara (2): mbcache: automatically delete entries from cache on freeing ext4: fix deadlock due to mbcache entry corruption
Jann Horn (2): mm/khugepaged: fix GUP-fast interaction by sending IPI mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths
Jiamei Xie (1): serial: amba-pl011: avoid SBSA UART accessing DMACR register
Jiang Li (1): md/raid1: stop mdx_raid1 thread when raid1 array run failed
Li Zetao (1): ACPICA: Fix use-after-free in acpi_ut_copy_ipackage_to_ipackage()
Mark Rutland (1): arm64: cmpxchg_double*: hazard against entire exchange variable
Michael S. Tsirkin (1): PCI: Fix pci_device_is_present() for VFs by checking PF
Mikulas Patocka (1): md: fix a crash in mempool_free
Paolo Abeni (1): net/ulp: prevent ULP without clone op from entering the LISTEN status
Rafael J. Wysocki (1): ACPICA: Fix error code path in acpi_ds_call_control_method()
Sascha Hauer (1): PCI/sysfs: Fix double free in error path
Schspa Shi (1): mrp: introduce active flags to prevent UAF when applicant uninit
Stanislav Fomichev (1): bpf: make sure skb->len != 0 when redirecting to a tunneling device
Stephen Boyd (1): pstore: Avoid kcore oops by vmap()ing with VM_IOREMAP
Subash Abhinov Kasiviswanathan (1): skbuff: Account for tail adjustment during pull operations
Ulf Hansson (1): cpuidle: dt: Return the correct numbers of parsed idle states
Volker Lendecke (1): cifs: Fix uninitialized memory read for smb311 posix symlink create
Wang ShaoBo (1): SUNRPC: Fix missing release socket in rpc_sockname()
Wang Weiyang (1): device_cgroup: Roll back to original exceptions after copy failure
Wang Yufen (2): pstore/ram: Fix error return code in ramoops_probe() binfmt: Fix error return code in load_elf_fdpic_binary()
Xiu Jianfeng (1): ima: Fix misuse of dereference of pointer in template_desc_init_fields()
Yang Jihong (1): blktrace: Fix output non-blktrace event when blk_classic option enabled
Yang Shi (1): mm: gup: fix the fast GUP race against THP collapse
Yang Yingliang (2): class: fix possible memory leak in __class_register() chardev: fix error handling in cdev_device_add()
Ye Bin (1): blk-mq: fix possible memleak when register 'hctx' failed
Yuan Can (1): perf: arm_dsu: Fix hotplug callback leak in dsu_pmu_init()
Zhang Tianci (1): ovl: Use ovl mounter's fsuid and fsgid in ovl_link()
Zhang Yiqun (1): crypto: tcrypt - Fix multibuffer skcipher speed test mem leak
Zhang Yuchen (2): ipmi: fix memleak when unload ipmi driver ipmi: fix long wait in unload when IPMI disconnect
ZhangPeng (1): pinctrl: pinconf-generic: add missing of_node_put()
delisun (1): serial: pl011: Do not clear RX FIFO & RX interrupt in unthrottle.
minoura makoto (1): SUNRPC: ensure the matching upcall is in-flight upon downcall
arch/arm64/include/asm/atomic_ll_sc.h | 2 +- arch/arm64/include/asm/atomic_lse.h | 2 +- block/blk-mq-sysfs.c | 11 ++- crypto/tcrypt.c | 9 -- drivers/acpi/acpica/dsmethod.c | 10 ++- drivers/acpi/acpica/utcopy.c | 7 -- drivers/base/class.c | 5 ++ drivers/base/dd.c | 8 +- drivers/char/ipmi/ipmi_msghandler.c | 12 ++- drivers/char/ipmi/ipmi_si_intf.c | 27 ++++-- drivers/cpuidle/dt_idle_states.c | 2 +- drivers/md/md.c | 9 +- drivers/md/raid1.c | 1 + drivers/net/bonding/bond_main.c | 2 +- drivers/pci/pci-sysfs.c | 13 ++- drivers/pci/pci.c | 2 + drivers/perf/arm_dsu_pmu.c | 6 +- drivers/pinctrl/pinconf-generic.c | 4 +- drivers/tty/serial/amba-pl011.c | 14 ++- fs/binfmt_elf_fdpic.c | 5 +- fs/char_dev.c | 2 +- fs/cifs/link.c | 1 + fs/cifs/smb2pdu.c | 15 ++-- fs/ext4/xattr.c | 4 +- fs/mbcache.c | 118 ++++++++++---------------- fs/overlayfs/dir.c | 46 ++++++---- fs/pnode.c | 2 +- fs/pstore/ram.c | 2 + fs/pstore/ram_core.c | 6 +- include/asm-generic/tlb.h | 6 ++ include/linux/mbcache.h | 33 ++++--- include/linux/sunrpc/rpc_pipe_fs.h | 5 ++ include/net/mrp.h | 1 + kernel/events/core.c | 8 +- kernel/sys.c | 2 + kernel/trace/blktrace.c | 3 +- mm/gup.c | 34 ++++++-- mm/khugepaged.c | 34 ++++++-- mm/memory.c | 1 + mm/mmu_gather.c | 5 ++ net/802/mrp.c | 18 ++-- net/core/filter.c | 11 ++- net/core/skbuff.c | 3 + net/core/stream.c | 7 +- net/ipv4/inet_connection_sock.c | 16 +++- net/sunrpc/auth_gss/auth_gss.c | 19 ++++- net/sunrpc/auth_gss/svcauth_gss.c | 4 +- net/sunrpc/clnt.c | 2 +- security/device_cgroup.c | 33 ++++++- security/integrity/ima/ima_template.c | 9 +- 50 files changed, 406 insertions(+), 195 deletions(-)
From: Yuan Can yuancan@huawei.com
stable inclusion from stable-v4.19.270 commit c2bb8256823c65ef4affc8ebdc690bea761ca462 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit facafab7611f7b872c6b9eeaff53461ef11f482e ]
dsu_pmu_init() won't remove the callback added by cpuhp_setup_state_multi() when platform_driver_register() failed. Remove the callback by cpuhp_remove_multi_state() in fail path.
Similar to the handling of arm_ccn_init() in commit 26242b330093 ("bus: arm-ccn: Prevent hotplug callback leak")
Fixes: 7520fa99246d ("perf: ARM DynamIQ Shared Unit PMU support") Signed-off-by: Yuan Can yuancan@huawei.com Acked-by: Suzuki K Poulose suzuki.poulose@arm.com Link: https://lore.kernel.org/r/20221115070207.32634-2-yuancan@huawei.com Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/perf/arm_dsu_pmu.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/perf/arm_dsu_pmu.c b/drivers/perf/arm_dsu_pmu.c index 1b347ba15a21..f1cb7a910394 100644 --- a/drivers/perf/arm_dsu_pmu.c +++ b/drivers/perf/arm_dsu_pmu.c @@ -824,7 +824,11 @@ static int __init dsu_pmu_init(void) if (ret < 0) return ret; dsu_pmu_cpuhp_state = ret; - return platform_driver_register(&dsu_pmu_driver); + ret = platform_driver_register(&dsu_pmu_driver); + if (ret) + cpuhp_remove_multi_state(dsu_pmu_cpuhp_state); + + return ret; }
static void __exit dsu_pmu_exit(void)
From: Wang Yufen wangyufen@huawei.com
stable inclusion from stable-v4.19.270 commit 6d8c5fc579eb5ab8de263a05b23bf5dae171e96c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit e1fce564900f8734edf15b87f028c57e14f6e28d ]
In the if (dev_of_node(dev) && !pdata) path, the "err" may be assigned a value of 0, so the error return code -EINVAL may be incorrectly set to 0. To fix set valid return code before calling to goto.
Fixes: 35da60941e44 ("pstore/ram: add Device Tree bindings") Signed-off-by: Wang Yufen wangyufen@huawei.com Signed-off-by: Kees Cook keescook@chromium.org Link: https://lore.kernel.org/r/1669969374-46582-1-git-send-email-wangyufen@huawei... Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/pstore/ram.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c index b2ecfa34b2ff..5907d081fa13 100644 --- a/fs/pstore/ram.c +++ b/fs/pstore/ram.c @@ -763,6 +763,7 @@ static int ramoops_probe(struct platform_device *pdev) /* Make sure we didn't get bogus platform data pointer. */ if (!pdata) { pr_err("NULL platform data\n"); + err = -EINVAL; goto fail_out; }
@@ -770,6 +771,7 @@ static int ramoops_probe(struct platform_device *pdev) !pdata->ftrace_size && !pdata->pmsg_size)) { pr_err("The memory size and the record/console size must be " "non-zero\n"); + err = -EINVAL; goto fail_out; }
From: Stephen Boyd swboyd@chromium.org
stable inclusion from stable-v4.19.270 commit 6d9460214e363e1f3d0756ee5d947e76e3e6f86c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit e6b842741b4f39007215fd7e545cb55aa3d358a2 ]
An oops can be induced by running 'cat /proc/kcore > /dev/null' on devices using pstore with the ram backend because kmap_atomic() assumes lowmem pages are accessible with __va().
Unable to handle kernel paging request at virtual address ffffff807ff2b000 Mem abort info: ESR = 0x96000006 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x06: level 2 translation fault Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081d87000 [ffffff807ff2b000] pgd=180000017fe18003, p4d=180000017fe18003, pud=180000017fe18003, pmd=0000000000000000 Internal error: Oops: 96000006 [#1] PREEMPT SMP Modules linked in: dm_integrity CPU: 7 PID: 21179 Comm: perf Not tainted 5.15.67-10882-ge4eb2eb988cd #1 baa443fb8e8477896a370b31a821eb2009f9bfba Hardware name: Google Lazor (rev3 - 8) (DT) pstate: a0400009 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __memcpy+0x110/0x260 lr : vread+0x194/0x294 sp : ffffffc013ee39d0 x29: ffffffc013ee39f0 x28: 0000000000001000 x27: ffffff807ff2b000 x26: 0000000000001000 x25: ffffffc0085a2000 x24: ffffff802d4b3000 x23: ffffff80f8a60000 x22: ffffff802d4b3000 x21: ffffffc0085a2000 x20: ffffff8080b7bc68 x19: 0000000000001000 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: ffffffd3073f2e60 x14: ffffffffad588000 x13: 0000000000000000 x12: 0000000000000001 x11: 00000000000001a2 x10: 00680000fff2bf0b x9 : 03fffffff807ff2b x8 : 0000000000000001 x7 : 0000000000000000 x6 : 0000000000000000 x5 : ffffff802d4b4000 x4 : ffffff807ff2c000 x3 : ffffffc013ee3a78 x2 : 0000000000001000 x1 : ffffff807ff2b000 x0 : ffffff802d4b3000 Call trace: __memcpy+0x110/0x260 read_kcore+0x584/0x778 proc_reg_read+0xb4/0xe4
During early boot, memblock reserves the pages for the ramoops reserved memory node in DT that would otherwise be part of the direct lowmem mapping. Pstore's ram backend reuses those reserved pages to change the memory type (writeback or non-cached) by passing the pages to vmap() (see pfn_to_page() usage in persistent_ram_vmap() for more details) with specific flags. When read_kcore() starts iterating over the vmalloc region, it runs over the virtual address that vmap() returned for ramoops. In aligned_vread() the virtual address is passed to vmalloc_to_page() which returns the page struct for the reserved lowmem area. That lowmem page is passed to kmap_atomic(), which effectively calls page_to_virt() that assumes a lowmem page struct must be directly accessible with __va() and friends. These pages are mapped via vmap() though, and the lowmem mapping was never made, so accessing them via the lowmem virtual address oopses like above.
Let's side-step this problem by passing VM_IOREMAP to vmap(). This will tell vread() to not include the ramoops region in the kcore. Instead the area will look like a bunch of zeros. The alternative is to teach kmap() about vmalloc areas that intersect with lowmem. Presumably such a change isn't a one-liner, and there isn't much interest in inspecting the ramoops region in kcore files anyway, so the most expedient route is taken for now.
Cc: Brian Geffon bgeffon@google.com Cc: Mike Rapoport rppt@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Fixes: 404a6043385d ("staging: android: persistent_ram: handle reserving and mapping memory") Signed-off-by: Stephen Boyd swboyd@chromium.org Signed-off-by: Kees Cook keescook@chromium.org Link: https://lore.kernel.org/r/20221205233136.3420802-1-swboyd@chromium.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/pstore/ram_core.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c index e37bad21a3c5..dc3570dafaff 100644 --- a/fs/pstore/ram_core.c +++ b/fs/pstore/ram_core.c @@ -426,7 +426,11 @@ static void *persistent_ram_vmap(phys_addr_t start, size_t size, phys_addr_t addr = page_start + i * PAGE_SIZE; pages[i] = pfn_to_page(addr >> PAGE_SHIFT); } - vaddr = vmap(pages, page_count, VM_MAP, prot); + /* + * VM_IOREMAP used here to bypass this region during vread() + * and kmap_atomic() (i.e. kcore) to avoid __va() failures. + */ + vaddr = vmap(pages, page_count, VM_MAP | VM_IOREMAP, prot); kfree(pages);
/*
From: Ulf Hansson ulf.hansson@linaro.org
stable inclusion from stable-v4.19.270 commit fff6b31b49d7d33151b401c9df4bbbecbe13cc72 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit ee3c2c8ad6ba6785f14a60e4081d7c82e88162a2 ]
While we correctly skips to initialize an idle state from a disabled idle state node in DT, the returned value from dt_init_idle_driver() don't get adjusted accordingly. Instead the number of found idle state nodes are returned, while the callers are expecting the number of successfully initialized idle states from DT.
This leads to cpuidle drivers unnecessarily continues to initialize their idle state specific data. Moreover, in the case when all idle states have been disabled in DT, we would end up registering a cpuidle driver, rather than relying on the default arch specific idle call.
Fixes: 9f14da345599 ("drivers: cpuidle: implement DT based idle states infrastructure") Signed-off-by: Ulf Hansson ulf.hansson@linaro.org Reviewed-by: Sudeep Holla sudeep.holla@arm.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/cpuidle/dt_idle_states.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/cpuidle/dt_idle_states.c b/drivers/cpuidle/dt_idle_states.c index 53342b7f1010..ea3c59d3fdad 100644 --- a/drivers/cpuidle/dt_idle_states.c +++ b/drivers/cpuidle/dt_idle_states.c @@ -224,6 +224,6 @@ int dt_init_idle_driver(struct cpuidle_driver *drv, * also be 0 on platforms with missing DT idle states or legacy DT * configuration predating the DT idle states bindings. */ - return i; + return state_idx - start_idx; } EXPORT_SYMBOL_GPL(dt_init_idle_driver);
From: Chen Zhongjin chenzhongjin@huawei.com
stable inclusion from stable-v4.19.270 commit aa679eeae4a1dba98b657a41589b8d58c14a31cc category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit e8d7a90c08ce963c592fb49845f2ccc606a2ac21 ]
In pmu_dev_alloc(), when dev_set_name() failed, it will goto free_dev and call put_device(pmu->dev) to release it. However pmu->dev->release is assigned after this, which makes warning and memleak. Call dev_set_name() after pmu->dev->release = pmu_dev_release to fix it.
Device '(null)' does not have a release() function... WARNING: CPU: 2 PID: 441 at drivers/base/core.c:2332 device_release+0x1b9/0x240 ... Call Trace: <TASK> kobject_put+0x17f/0x460 put_device+0x20/0x30 pmu_dev_alloc+0x152/0x400 perf_pmu_register+0x96b/0xee0 ... kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak) unreferenced object 0xffff888014759000 (size 2048): comm "modprobe", pid 441, jiffies 4294931444 (age 38.332s) backtrace: [<0000000005aed3b4>] kmalloc_trace+0x27/0x110 [<000000006b38f9b8>] pmu_dev_alloc+0x50/0x400 [<00000000735f17be>] perf_pmu_register+0x96b/0xee0 [<00000000e38477f1>] 0xffffffffc0ad8603 [<000000004e162216>] do_one_initcall+0xd0/0x4e0 ...
Fixes: abe43400579d ("perf: Sysfs enumeration") Signed-off-by: Chen Zhongjin chenzhongjin@huawei.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20221111103653.91058-1-chenzhongjin@huawei.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- kernel/events/core.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c index b1e4beb6931c..96a4ffcce2e5 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9728,13 +9728,15 @@ static int pmu_dev_alloc(struct pmu *pmu)
pmu->dev->groups = pmu->attr_groups; device_initialize(pmu->dev); - ret = dev_set_name(pmu->dev, "%s", pmu->name); - if (ret) - goto free_dev;
dev_set_drvdata(pmu->dev, pmu); pmu->dev->bus = &pmu_bus; pmu->dev->release = pmu_dev_release; + + ret = dev_set_name(pmu->dev, "%s", pmu->name); + if (ret) + goto free_dev; + ret = device_add(pmu->dev); if (ret) goto free_dev;
From: Ye Bin yebin10@huawei.com
stable inclusion from stable-v4.19.270 commit 02bc8bc6eab03c84373281b85cb6e98747172ff7 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 4b7a21c57b14fbcd0e1729150189e5933f5088e9 ]
There's issue as follows when do fault injection test: unreferenced object 0xffff888132a9f400 (size 512): comm "insmod", pid 308021, jiffies 4324277909 (age 509.733s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 08 f4 a9 32 81 88 ff ff ...........2.... 08 f4 a9 32 81 88 ff ff 00 00 00 00 00 00 00 00 ...2............ backtrace: [<00000000e8952bb4>] kmalloc_node_trace+0x22/0xa0 [<00000000f9980e0f>] blk_mq_alloc_and_init_hctx+0x3f1/0x7e0 [<000000002e719efa>] blk_mq_realloc_hw_ctxs+0x1e6/0x230 [<000000004f1fda40>] blk_mq_init_allocated_queue+0x27e/0x910 [<00000000287123ec>] __blk_mq_alloc_disk+0x67/0xf0 [<00000000a2a34657>] 0xffffffffa2ad310f [<00000000b173f718>] 0xffffffffa2af824a [<0000000095a1dabb>] do_one_initcall+0x87/0x2a0 [<00000000f32fdf93>] do_init_module+0xdf/0x320 [<00000000cbe8541e>] load_module+0x3006/0x3390 [<0000000069ed1bdb>] __do_sys_finit_module+0x113/0x1b0 [<00000000a1a29ae8>] do_syscall_64+0x35/0x80 [<000000009cd878b0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
Fault injection context as follows: kobject_add blk_mq_register_hctx blk_mq_sysfs_register blk_register_queue device_add_disk null_add_dev.part.0 [null_blk]
As 'blk_mq_register_hctx' may already add some objects when failed halfway, but there isn't do fallback, caller don't know which objects add failed. To solve above issue just do fallback when add objects failed halfway in 'blk_mq_register_hctx'.
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: Ming Lei ming.lei@redhat.com Link: https://lore.kernel.org/r/20221117022940.873959-1-yebin@huaweicloud.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- block/blk-mq-sysfs.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 4ca4f0b1b619..3209979e105b 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -253,7 +253,7 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx) { struct request_queue *q = hctx->queue; struct blk_mq_ctx *ctx; - int i, ret; + int i, j, ret;
if (!hctx->nr_ctx) return 0; @@ -265,9 +265,16 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx) hctx_for_each_ctx(hctx, ctx, i) { ret = kobject_add(&ctx->kobj, &hctx->kobj, "cpu%u", ctx->cpu); if (ret) - break; + goto out; }
+ return 0; +out: + hctx_for_each_ctx(hctx, ctx, j) { + if (j < i) + kobject_del(&ctx->kobj); + } + kobject_del(&hctx->kobj); return ret; }
From: Jiang Li jiang.li@ugreen.com
stable inclusion from stable-v4.19.270 commit 0c7c7468c3ae222e297b7dc74d6ccb69c4d0183c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit b611ad14006e5be2170d9e8e611bf49dff288911 ]
fail run raid1 array when we assemble array with the inactive disk only, but the mdx_raid1 thread were not stop, Even if the associated resources have been released. it will caused a NULL dereference when we do poweroff.
This causes the following Oops: [ 287.587787] BUG: kernel NULL pointer dereference, address: 0000000000000070 [ 287.594762] #PF: supervisor read access in kernel mode [ 287.599912] #PF: error_code(0x0000) - not-present page [ 287.605061] PGD 0 P4D 0 [ 287.607612] Oops: 0000 [#1] SMP NOPTI [ 287.611287] CPU: 3 PID: 5265 Comm: md0_raid1 Tainted: G U 5.10.146 #0 [ 287.619029] Hardware name: xxxxxxx/To be filled by O.E.M, BIOS 5.19 06/16/2022 [ 287.626775] RIP: 0010:md_check_recovery+0x57/0x500 [md_mod] [ 287.632357] Code: fe 01 00 00 48 83 bb 10 03 00 00 00 74 08 48 89 ...... [ 287.651118] RSP: 0018:ffffc90000433d78 EFLAGS: 00010202 [ 287.656347] RAX: 0000000000000000 RBX: ffff888105986800 RCX: 0000000000000000 [ 287.663491] RDX: ffffc90000433bb0 RSI: 00000000ffffefff RDI: ffff888105986800 [ 287.670634] RBP: ffffc90000433da0 R08: 0000000000000000 R09: c0000000ffffefff [ 287.677771] R10: 0000000000000001 R11: ffffc90000433ba8 R12: ffff888105986800 [ 287.684907] R13: 0000000000000000 R14: fffffffffffffe00 R15: ffff888100b6b500 [ 287.692052] FS: 0000000000000000(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000 [ 287.700149] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 287.705897] CR2: 0000000000000070 CR3: 000000000320a000 CR4: 0000000000350ee0 [ 287.713033] Call Trace: [ 287.715498] raid1d+0x6c/0xbbb [raid1] [ 287.719256] ? __schedule+0x1ff/0x760 [ 287.722930] ? schedule+0x3b/0xb0 [ 287.726260] ? schedule_timeout+0x1ed/0x290 [ 287.730456] ? __switch_to+0x11f/0x400 [ 287.734219] md_thread+0xe9/0x140 [md_mod] [ 287.738328] ? md_thread+0xe9/0x140 [md_mod] [ 287.742601] ? wait_woken+0x80/0x80 [ 287.746097] ? md_register_thread+0xe0/0xe0 [md_mod] [ 287.751064] kthread+0x11a/0x140 [ 287.754300] ? kthread_park+0x90/0x90 [ 287.757974] ret_from_fork+0x1f/0x30
In fact, when raid1 array run fail, we need to do md_unregister_thread() before raid1_free().
Signed-off-by: Jiang Li jiang.li@ugreen.com Signed-off-by: Song Liu song@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/md/raid1.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 0c30a1fdb561..cebfe752d366 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -3145,6 +3145,7 @@ static int raid1_run(struct mddev *mddev) * RAID1 needs at least one disk in active */ if (conf->raid_disks - mddev->degraded < 1) { + md_unregister_thread(&conf->thread); ret = -EINVAL; goto abort; }
From: Li Zetao lizetao1@huawei.com
stable inclusion from stable-v4.19.270 commit c9125b643fc51b8e662f2f614096ceb45a0adbc3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 470188b09e92d83c5a997f25f0e8fb8cd2bc3469 ]
There is an use-after-free reported by KASAN:
BUG: KASAN: use-after-free in acpi_ut_remove_reference+0x3b/0x82 Read of size 1 at addr ffff888112afc460 by task modprobe/2111 CPU: 0 PID: 2111 Comm: modprobe Not tainted 6.1.0-rc7-dirty Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), Call Trace: <TASK> kasan_report+0xae/0xe0 acpi_ut_remove_reference+0x3b/0x82 acpi_ut_copy_iobject_to_iobject+0x3be/0x3d5 acpi_ds_store_object_to_local+0x15d/0x3a0 acpi_ex_store+0x78d/0x7fd acpi_ex_opcode_1A_1T_1R+0xbe4/0xf9b acpi_ps_parse_aml+0x217/0x8d5 ... </TASK>
The root cause of the problem is that the acpi_operand_object is freed when acpi_ut_walk_package_tree() fails in acpi_ut_copy_ipackage_to_ipackage(), lead to repeated release in acpi_ut_copy_iobject_to_iobject(). The problem was introduced by "8aa5e56eeb61" commit, this commit is to fix memory leak in acpi_ut_copy_iobject_to_iobject(), repeatedly adding remove operation, lead to "acpi_operand_object" used after free.
Fix it by removing acpi_ut_remove_reference() in acpi_ut_copy_ipackage_to_ipackage(). acpi_ut_copy_ipackage_to_ipackage() is called to copy an internal package object into another internal package object, when it fails, the memory of acpi_operand_object should be freed by the caller.
Fixes: 8aa5e56eeb61 ("ACPICA: Utilities: Fix memory leak in acpi_ut_copy_iobject_to_iobject") Signed-off-by: Li Zetao lizetao1@huawei.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/acpi/acpica/utcopy.c | 7 ------- 1 file changed, 7 deletions(-)
diff --git a/drivers/acpi/acpica/utcopy.c b/drivers/acpi/acpica/utcopy.c index a872ed7879ca..056c1741c1e3 100644 --- a/drivers/acpi/acpica/utcopy.c +++ b/drivers/acpi/acpica/utcopy.c @@ -916,13 +916,6 @@ acpi_ut_copy_ipackage_to_ipackage(union acpi_operand_object *source_obj, status = acpi_ut_walk_package_tree(source_obj, dest_obj, acpi_ut_copy_ielement_to_ielement, walk_state); - if (ACPI_FAILURE(status)) { - - /* On failure, delete the destination package object */ - - acpi_ut_remove_reference(dest_obj); - } - return_ACPI_STATUS(status); }
From: Xiu Jianfeng xiujianfeng@huawei.com
stable inclusion from stable-v4.19.270 commit 9ca76b0b46fdb02e46558db5464988b4e18375eb category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 25369175ce84813dd99d6604e710dc2491f68523 ]
The input parameter @fields is type of struct ima_template_field ***, so when allocates array memory for @fields, the size of element should be sizeof(**field) instead of sizeof(*field).
Actually the original code would not cause any runtime error, but it's better to make it logically right.
Fixes: adf53a778a0a ("ima: new templates management mechanism") Signed-off-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Roberto Sassu roberto.sassu@huawei.com Signed-off-by: Mimi Zohar zohar@linux.ibm.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- security/integrity/ima/ima_template.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/security/integrity/ima/ima_template.c b/security/integrity/ima/ima_template.c index 4dfdccce497b..13567e555130 100644 --- a/security/integrity/ima/ima_template.c +++ b/security/integrity/ima/ima_template.c @@ -196,11 +196,11 @@ static int template_desc_init_fields(const char *template_fmt, }
if (fields && num_fields) { - *fields = kmalloc_array(i, sizeof(*fields), GFP_KERNEL); + *fields = kmalloc_array(i, sizeof(**fields), GFP_KERNEL); if (*fields == NULL) return -ENOMEM;
- memcpy(*fields, found_fields, i * sizeof(*fields)); + memcpy(*fields, found_fields, i * sizeof(**fields)); *num_fields = i; }
From: ZhangPeng zhangpeng362@huawei.com
stable inclusion from stable-v4.19.270 commit cad3e90013f4d224adbbe735b502fa080fe28e9e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 5ead93289815a075d43c415e35c8beafafb801c9 ]
of_node_put() needs to be called when jumping out of the loop, since for_each_available_child_of_node() will increase the refcount of node.
Fixes: c7289500e29d ("pinctrl: pinconf-generic: scan also referenced phandle node") Signed-off-by: ZhangPeng zhangpeng362@huawei.com Link: https://lore.kernel.org/r/20221125070156.3535855-1-zhangpeng362@huawei.com Signed-off-by: Linus Walleij linus.walleij@linaro.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/pinctrl/pinconf-generic.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/pinctrl/pinconf-generic.c b/drivers/pinctrl/pinconf-generic.c index b4f7f8a458ea..78fb9a1dc10a 100644 --- a/drivers/pinctrl/pinconf-generic.c +++ b/drivers/pinctrl/pinconf-generic.c @@ -390,8 +390,10 @@ int pinconf_generic_dt_node_to_map(struct pinctrl_dev *pctldev, for_each_available_child_of_node(np_config, np) { ret = pinconf_generic_dt_subnode_to_map(pctldev, np, map, &reserved_maps, num_maps, type); - if (ret < 0) + if (ret < 0) { + of_node_put(np); goto exit; + } } return 0;
From: Dan Carpenter error27@gmail.com
stable inclusion from stable-v4.19.270 commit c66566533a16228d2a78e723713a4c0cc1b82a0a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit e5214f363dabca240446272dac54d404501ad5e5 ]
The "ignore_updelay" variable needs to be initialized to false.
Fixes: f8a65ab2f3ff ("bonding: fix link recovery in mode 2 when updelay is nonzero") Signed-off-by: Dan Carpenter error27@gmail.com Reviewed-by: Pavan Chebbi pavan.chebbi@broadcom.com Acked-by: Jay Vosburgh jay.vosburgh@canonical.com Link: https://lore.kernel.org/r/Y4SWJlh3ohJ6EPTL@kili Signed-off-by: Paolo Abeni pabeni@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/net/bonding/bond_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index b1b4b2b42166..5fd9aefdfa13 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2101,10 +2101,10 @@ static int bond_slave_info_query(struct net_device *bond_dev, struct ifslave *in /* called with rcu_read_lock() */ static int bond_miimon_inspect(struct bonding *bond) { + bool ignore_updelay = false; int link_state, commit = 0; struct list_head *iter; struct slave *slave; - bool ignore_updelay;
ignore_updelay = !rcu_dereference(bond->curr_active_slave);
From: Wang ShaoBo bobo.shaobowang@huawei.com
stable inclusion from stable-v4.19.270 commit 1f356afe521dd4d92b636761df098c068800093d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 50fa355bc0d75911fe9d5072a5ba52cdb803aff7 ]
socket dynamically created is not released when getting an unintended address family type in rpc_sockname(), direct to out_release for calling sock_release().
Fixes: 2e738fdce22f ("SUNRPC: Add API to acquire source address") Signed-off-by: Wang ShaoBo bobo.shaobowang@huawei.com Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/sunrpc/clnt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index dc58c227f37c..0fc540b0d183 100644 --- a/net/sunrpc/clnt.c +++ b/net/sunrpc/clnt.c @@ -1277,7 +1277,7 @@ static int rpc_sockname(struct net *net, struct sockaddr *sap, size_t salen, break; default: err = -EAFNOSUPPORT; - goto out; + goto out_release; } if (err < 0) { dprintk("RPC: can't bind UDP socket (%d)\n", err);
From: Yang Jihong yangjihong1@huawei.com
stable inclusion from stable-v4.19.270 commit 7349d943eaa189cbc13e02dfd3871c868253cf95 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit f596da3efaf4130ff61cd029558845808df9bf99 ]
When the blk_classic option is enabled, non-blktrace events must be filtered out. Otherwise, events of other types are output in the blktrace classic format, which is unexpected.
The problem can be triggered in the following ways:
# echo 1 > /sys/kernel/debug/tracing/options/blk_classic # echo 1 > /sys/kernel/debug/tracing/events/enable # echo blk > /sys/kernel/debug/tracing/current_tracer # cat /sys/kernel/debug/tracing/trace_pipe
Fixes: c71a89615411 ("blktrace: add ftrace plugin") Signed-off-by: Yang Jihong yangjihong1@huawei.com Link: https://lore.kernel.org/r/20221122040410.85113-1-yangjihong1@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- kernel/trace/blktrace.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index 671f464f92ae..49a84ee7ec30 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -1592,7 +1592,8 @@ blk_trace_event_print_binary(struct trace_iterator *iter, int flags,
static enum print_line_t blk_tracer_print_line(struct trace_iterator *iter) { - if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC)) + if ((iter->ent->type != TRACE_BLK) || + !(blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC)) return TRACE_TYPE_UNHANDLED;
return print_one_line(iter, true);
From: Zhang Yiqun zhangyiqun@phytium.com.cn
stable inclusion from stable-v4.19.270 commit e4ec2042899536b5a8f714b6eda4443d717f41bf category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 1aa33fc8d4032227253ceb736f47c52b859d9683 ]
In the past, the data for mb-skcipher test has been allocated twice, that means the first allcated memory area is without free, which may cause a potential memory leakage. So this patch is to remove one allocation to fix this error.
Fixes: e161c5930c15 ("crypto: tcrypt - add multibuf skcipher...") Signed-off-by: Zhang Yiqun zhangyiqun@phytium.com.cn Signed-off-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- crypto/tcrypt.c | 9 --------- 1 file changed, 9 deletions(-)
diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c index bf797c613ba2..366f4510acbe 100644 --- a/crypto/tcrypt.c +++ b/crypto/tcrypt.c @@ -1285,15 +1285,6 @@ static void test_mb_skcipher_speed(const char *algo, int enc, int secs, goto out_free_tfm; }
- - for (i = 0; i < num_mb; ++i) - if (testmgr_alloc_buf(data[i].xbuf)) { - while (i--) - testmgr_free_buf(data[i].xbuf); - goto out_free_tfm; - } - - for (i = 0; i < num_mb; ++i) { data[i].req = skcipher_request_alloc(tfm, GFP_KERNEL); if (!data[i].req) {
From: Yang Yingliang yangyingliang@huawei.com
stable inclusion from stable-v4.19.270 commit 3bb9c92c27624ad076419a70f2b1a30cd1f8bbbd category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 8c3e8a6bdb5253b97ad532570f8b5db5f7a06407 ]
If class_add_groups() returns error, the 'cp->subsys' need be unregister, and the 'cp' need be freed.
We can not call kset_unregister() here, because the 'cls' will be freed in callback function class_release() and it's also freed in caller's error path, it will cause double free.
So fix this by calling kobject_del() and kfree_const(name) to cleanup kobject. Besides, call kfree() to free the 'cp'.
Fault injection test can trigger this:
unreferenced object 0xffff888102fa8190 (size 8): comm "modprobe", pid 502, jiffies 4294906074 (age 49.296s) hex dump (first 8 bytes): 70 6b 74 63 64 76 64 00 pktcdvd. backtrace: [<00000000e7c7703d>] __kmalloc_track_caller+0x1ae/0x320 [<000000005e4d70bc>] kstrdup+0x3a/0x70 [<00000000c2e5e85a>] kstrdup_const+0x68/0x80 [<000000000049a8c7>] kvasprintf_const+0x10b/0x190 [<0000000029123163>] kobject_set_name_vargs+0x56/0x150 [<00000000747219c9>] kobject_set_name+0xab/0xe0 [<0000000005f1ea4e>] __class_register+0x15c/0x49a
unreferenced object 0xffff888037274000 (size 1024): comm "modprobe", pid 502, jiffies 4294906074 (age 49.296s) hex dump (first 32 bytes): 00 40 27 37 80 88 ff ff 00 40 27 37 80 88 ff ff .@'7.....@'7.... 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... backtrace: [<00000000151f9600>] kmem_cache_alloc_trace+0x17c/0x2f0 [<00000000ecf3dd95>] __class_register+0x86/0x49a
Fixes: ced6473e7486 ("driver core: class: add class_groups support") Signed-off-by: Yang Yingliang yangyingliang@huawei.com Link: https://lore.kernel.org/r/20221026082803.3458760-1-yangyingliang@huawei.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/base/class.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/drivers/base/class.c b/drivers/base/class.c index 4c103bd8d525..1dd058fa9bce 100644 --- a/drivers/base/class.c +++ b/drivers/base/class.c @@ -185,6 +185,11 @@ int __class_register(struct class *cls, struct lock_class_key *key) } error = class_add_groups(class_get(cls), cls->class_groups); class_put(cls); + if (error) { + kobject_del(&cp->subsys.kobj); + kfree_const(cp->subsys.kobj.name); + kfree(cp); + } return error; } EXPORT_SYMBOL_GPL(__class_register);
From: Jiamei Xie jiamei.xie@arm.com
stable inclusion from stable-v4.19.270 commit 78d837ce20517e0c1ff3ebe08ad64636e02c2e48 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 94cdb9f33698478b0e7062586633c42c6158a786 ]
Chapter "B Generic UART" in "ARM Server Base System Architecture" [1] documentation describes a generic UART interface. Such generic UART does not support DMA. In current code, sbsa_uart_pops and amba_pl011_pops share the same stop_rx operation, which will invoke pl011_dma_rx_stop, leading to an access of the DMACR register. This commit adds a using_rx_dma check in pl011_dma_rx_stop to avoid the access to DMACR register for SBSA UARTs which does not support DMA.
When the kernel enables DMA engine with "CONFIG_DMA_ENGINE=y", Linux SBSA PL011 driver will access PL011 DMACR register in some functions. For most real SBSA Pl011 hardware implementations, the DMACR write behaviour will be ignored. So these DMACR operations will not cause obvious problems. But for some virtual SBSA PL011 hardware, like Xen virtual SBSA PL011 (vpl011) device, the behaviour might be different. Xen vpl011 emulation will inject a data abort to guest, when guest is accessing an unimplemented UART register. As Xen VPL011 is SBSA compatible, it will not implement DMACR register. So when Linux SBSA PL011 driver access DMACR register, it will get an unhandled data abort fault and the application will get a segmentation fault: Unhandled fault at 0xffffffc00944d048 Mem abort info: ESR = 0x96000000 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x00: ttbr address size fault Data abort info: ISV = 0, ISS = 0x00000000 CM = 0, WnR = 0 swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000020e2e000 [ffffffc00944d048] pgd=100000003ffff803, p4d=100000003ffff803, pud=100000003ffff803, pmd=100000003fffa803, pte=006800009c090f13 Internal error: ttbr address size fault: 96000000 [#1] PREEMPT SMP ... Call trace: pl011_stop_rx+0x70/0x80 tty_port_shutdown+0x7c/0xb4 tty_port_close+0x60/0xcc uart_close+0x34/0x8c tty_release+0x144/0x4c0 __fput+0x78/0x220 ____fput+0x1c/0x30 task_work_run+0x88/0xc0 do_notify_resume+0x8d0/0x123c el0_svc+0xa8/0xc0 el0t_64_sync_handler+0xa4/0x130 el0t_64_sync+0x1a0/0x1a4 Code: b9000083 b901f001 794038a0 8b000042 (b9000041) ---[ end trace 83dd93df15c3216f ]--- note: bootlogd[132] exited with preempt_count 1 /etc/rcS.d/S07bootlogd: line 47: 132 Segmentation fault start-stop-daemon
This has been discussed in the Xen community, and we think it should fix this in Linux. See [2] for more information.
[1] https://developer.arm.com/documentation/den0094/c/?lang=en [2] https://lists.xenproject.org/archives/html/xen-devel/2022-11/msg00543.html
Fixes: 0dd1e247fd39 (drivers: PL011: add support for the ARM SBSA generic UART) Signed-off-by: Jiamei Xie jiamei.xie@arm.com Reviewed-by: Andre Przywara andre.przywara@arm.com Link: https://lore.kernel.org/r/20221117103237.86856-1-jiamei.xie@arm.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/tty/serial/amba-pl011.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c index 2ff45aae120e..0b2576404ace 100644 --- a/drivers/tty/serial/amba-pl011.c +++ b/drivers/tty/serial/amba-pl011.c @@ -1053,6 +1053,9 @@ static void pl011_dma_rx_callback(void *data) */ static inline void pl011_dma_rx_stop(struct uart_amba_port *uap) { + if (!uap->using_rx_dma) + return; + /* FIXME. Just disable the DMA enable */ uap->dmacr &= ~UART011_RXDMAE; pl011_write(uap->dmacr, uap, REG_DMACR);
From: delisun delisun@pateo.com.cn
stable inclusion from stable-v4.19.270 commit 3a25c7891d717db137354476e0bb6eb34ad5f2d3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 032d5a71ed378ffc6a2d41a187d8488a4f9fe415 ]
Clearing the RX FIFO will cause data loss. Copy the pl011_enabl_interrupts implementation, and remove the clear interrupt and FIFO part of the code.
Fixes: 211565b10099 ("serial: pl011: UPSTAT_AUTORTS requires .throttle/unthrottle") Signed-off-by: delisun delisun@pateo.com.cn Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Link: https://lore.kernel.org/r/20221110020108.7700-1-delisun@pateo.com.cn Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/tty/serial/amba-pl011.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c index 0b2576404ace..c987db50757c 100644 --- a/drivers/tty/serial/amba-pl011.c +++ b/drivers/tty/serial/amba-pl011.c @@ -1837,8 +1837,17 @@ static void pl011_enable_interrupts(struct uart_amba_port *uap) static void pl011_unthrottle_rx(struct uart_port *port) { struct uart_amba_port *uap = container_of(port, struct uart_amba_port, port); + unsigned long flags;
- pl011_enable_interrupts(uap); + spin_lock_irqsave(&uap->port.lock, flags); + + uap->im = UART011_RTIM; + if (!pl011_dma_rx_running(uap)) + uap->im |= UART011_RXIM; + + pl011_write(uap->im, uap, REG_IMSC); + + spin_unlock_irqrestore(&uap->port.lock, flags); }
static int pl011_startup(struct uart_port *port)
From: Subash Abhinov Kasiviswanathan quic_subashab@quicinc.com
stable inclusion from stable-v4.19.270 commit 2d59f0ca153e9573ec4f140988c0ccca0eb4181b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 2d7afdcbc9d32423f177ee12b7c93783aea338fb ]
Extending the tail can have some unexpected side effects if a program uses a helper like BPF_FUNC_skb_pull_data to read partial content beyond the head skb headlen when all the skbs in the gso frag_list are linear with no head_frag -
kernel BUG at net/core/skbuff.c:4219! pc : skb_segment+0xcf4/0xd2c lr : skb_segment+0x63c/0xd2c Call trace: skb_segment+0xcf4/0xd2c __udp_gso_segment+0xa4/0x544 udp4_ufo_fragment+0x184/0x1c0 inet_gso_segment+0x16c/0x3a4 skb_mac_gso_segment+0xd4/0x1b0 __skb_gso_segment+0xcc/0x12c udp_rcv_segment+0x54/0x16c udp_queue_rcv_skb+0x78/0x144 udp_unicast_rcv_skb+0x8c/0xa4 __udp4_lib_rcv+0x490/0x68c udp_rcv+0x20/0x30 ip_protocol_deliver_rcu+0x1b0/0x33c ip_local_deliver+0xd8/0x1f0 ip_rcv+0x98/0x1a4 deliver_ptype_list_skb+0x98/0x1ec __netif_receive_skb_core+0x978/0xc60
Fix this by marking these skbs as GSO_DODGY so segmentation can handle the tail updates accordingly.
Fixes: 3dcbdb134f32 ("net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list") Signed-off-by: Sean Tranchetti quic_stranche@quicinc.com Signed-off-by: Subash Abhinov Kasiviswanathan quic_subashab@quicinc.com Reviewed-by: Alexander Duyck alexanderduyck@fb.com Link: https://lore.kernel.org/r/1671084718-24796-1-git-send-email-quic_subashab@qu... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/core/skbuff.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 4178fc28c277..7f501dff4501 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1953,6 +1953,9 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta) insp = list; } else { /* Eaten partially. */ + if (skb_is_gso(skb) && !list->head_frag && + skb_headlen(list)) + skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
if (skb_shared(list)) { /* Sucks! We need to fork list. :-( */
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
stable inclusion from stable-v4.19.270 commit 2deb42c4f9776e59bee247c14af9c5e8c05ca9a6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 404ec60438add1afadaffaed34bb5fe4ddcadd40 ]
A use-after-free in acpi_ps_parse_aml() after a failing invocaion of acpi_ds_call_control_method() is reported by KASAN [1] and code inspection reveals that next_walk_state pushed to the thread by acpi_ds_create_walk_state() is freed on errors, but it is not popped from the thread beforehand. Thus acpi_ds_get_current_walk_state() called by acpi_ps_parse_aml() subsequently returns it as the new walk state which is incorrect.
To address this, make acpi_ds_call_control_method() call acpi_ds_pop_walk_state() to pop next_walk_state from the thread before returning an error.
Link: https://lore.kernel.org/linux-acpi/20221019073443.248215-1-chenzhongjin@huaw... # [1] Reported-by: Chen Zhongjin chenzhongjin@huawei.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Reviewed-by: Chen Zhongjin chenzhongjin@huawei.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/acpi/acpica/dsmethod.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/acpica/dsmethod.c b/drivers/acpi/acpica/dsmethod.c index dd4deb678d13..a00516d9538c 100644 --- a/drivers/acpi/acpica/dsmethod.c +++ b/drivers/acpi/acpica/dsmethod.c @@ -517,7 +517,7 @@ acpi_ds_call_control_method(struct acpi_thread_state *thread, info = ACPI_ALLOCATE_ZEROED(sizeof(struct acpi_evaluate_info)); if (!info) { status = AE_NO_MEMORY; - goto cleanup; + goto pop_walk_state; }
info->parameters = &this_walk_state->operands[0]; @@ -529,7 +529,7 @@ acpi_ds_call_control_method(struct acpi_thread_state *thread,
ACPI_FREE(info); if (ACPI_FAILURE(status)) { - goto cleanup; + goto pop_walk_state; }
/* @@ -561,6 +561,12 @@ acpi_ds_call_control_method(struct acpi_thread_state *thread,
return_ACPI_STATUS(status);
+pop_walk_state: + + /* On error, pop the walk state to be deleted from thread */ + + acpi_ds_pop_walk_state(thread); + cleanup:
/* On error, we must terminate the method properly */
From: Zhang Yuchen zhangyuchen.lcr@bytedance.com
stable inclusion from stable-v4.19.270 commit acc6579bea6a20e472eca3264203dd5854ca9b4e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 36992eb6b9b83f7f9cdc8e74fb5799d7b52e83e9 ]
After the IPMI disconnect problem, the memory kept rising and we tried to unload the driver to free the memory. However, only part of the free memory is recovered after the driver is uninstalled. Using ebpf to hook free functions, we find that neither ipmi_user nor ipmi_smi_msg is free, only ipmi_recv_msg is free.
We find that the deliver_smi_err_response call in clean_smi_msgs does the destroy processing on each message from the xmit_msg queue without checking the return value and free ipmi_smi_msg.
deliver_smi_err_response is called only at this location. Adding the free handling has no effect.
To verify, try using ebpf to trace the free function.
$ bpftrace -e 'kretprobe:ipmi_alloc_recv_msg {printf("alloc rcv %p\n",retval);} kprobe:free_recv_msg {printf("free recv %p\n", arg0)} kretprobe:ipmi_alloc_smi_msg {printf("alloc smi %p\n", retval);} kprobe:free_smi_msg {printf("free smi %p\n",arg0)}'
Signed-off-by: Zhang Yuchen zhangyuchen.lcr@bytedance.com Message-Id: 20221007092617.87597-4-zhangyuchen.lcr@bytedance.com [Fixed the comment above handle_one_recv_msg().] Signed-off-by: Corey Minyard cminyard@mvista.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/char/ipmi/ipmi_msghandler.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c index 83eaab66d2f6..6d745cc41611 100644 --- a/drivers/char/ipmi/ipmi_msghandler.c +++ b/drivers/char/ipmi/ipmi_msghandler.c @@ -3459,12 +3459,16 @@ static void deliver_smi_err_response(struct ipmi_smi *intf, struct ipmi_smi_msg *msg, unsigned char err) { + int rv; msg->rsp[0] = msg->data[0] | 4; msg->rsp[1] = msg->data[1]; msg->rsp[2] = err; msg->rsp_size = 3; - /* It's an error, so it will never requeue, no need to check return. */ - handle_one_recv_msg(intf, msg); + + /* This will never requeue, but it may ask us to free the message. */ + rv = handle_one_recv_msg(intf, msg); + if (rv == 0) + ipmi_free_smi_msg(msg); }
static void cleanup_smi_msgs(struct ipmi_smi *intf)
From: Stanislav Fomichev sdf@google.com
stable inclusion from stable-v4.19.270 commit e6a63203e5a90a39392fa1a7ffc60f5e9baf642a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 07ec7b502800ba9f7b8b15cb01dd6556bb41aaca ]
syzkaller managed to trigger another case where skb->len == 0 when we enter __dev_queue_xmit:
WARNING: CPU: 0 PID: 2470 at include/linux/skbuff.h:2576 skb_assert_len include/linux/skbuff.h:2576 [inline] WARNING: CPU: 0 PID: 2470 at include/linux/skbuff.h:2576 __dev_queue_xmit+0x2069/0x35e0 net/core/dev.c:4295
Call Trace: dev_queue_xmit+0x17/0x20 net/core/dev.c:4406 __bpf_tx_skb net/core/filter.c:2115 [inline] __bpf_redirect_no_mac net/core/filter.c:2140 [inline] __bpf_redirect+0x5fb/0xda0 net/core/filter.c:2163 ____bpf_clone_redirect net/core/filter.c:2447 [inline] bpf_clone_redirect+0x247/0x390 net/core/filter.c:2419 bpf_prog_48159a89cb4a9a16+0x59/0x5e bpf_dispatcher_nop_func include/linux/bpf.h:897 [inline] __bpf_prog_run include/linux/filter.h:596 [inline] bpf_prog_run include/linux/filter.h:603 [inline] bpf_test_run+0x46c/0x890 net/bpf/test_run.c:402 bpf_prog_test_run_skb+0xbdc/0x14c0 net/bpf/test_run.c:1170 bpf_prog_test_run+0x345/0x3c0 kernel/bpf/syscall.c:3648 __sys_bpf+0x43a/0x6c0 kernel/bpf/syscall.c:5005 __do_sys_bpf kernel/bpf/syscall.c:5091 [inline] __se_sys_bpf kernel/bpf/syscall.c:5089 [inline] __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5089 do_syscall_64+0x54/0x70 arch/x86/entry/common.c:48 entry_SYSCALL_64_after_hwframe+0x61/0xc6
The reproducer doesn't really reproduce outside of syzkaller environment, so I'm taking a guess here. It looks like we do generate correct ETH_HLEN-sized packet, but we redirect the packet to the tunneling device. Before we do so, we __skb_pull l2 header and arrive again at skb->len == 0. Doesn't seem like we can do anything better than having an explicit check after __skb_pull?
Cc: Eric Dumazet edumazet@google.com Reported-by: syzbot+f635e86ec3fa0a37e019@syzkaller.appspotmail.com Signed-off-by: Stanislav Fomichev sdf@google.com Link: https://lore.kernel.org/r/20221027225537.353077-1-sdf@google.com Signed-off-by: Martin KaFai Lau martin.lau@kernel.org Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/core/filter.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/net/core/filter.c b/net/core/filter.c index 7a3655e14764..e8111f5ee81e 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2025,6 +2025,10 @@ static int __bpf_redirect_no_mac(struct sk_buff *skb, struct net_device *dev,
if (mlen) { __skb_pull(skb, mlen); + if (unlikely(!skb->len)) { + kfree_skb(skb); + return -ERANGE; + }
/* At ingress, the mac header has already been pulled once. * At egress, skb_pospull_rcsum has to be done in case that
From: Schspa Shi schspa@gmail.com
stable inclusion from stable-v4.19.270 commit 78d48bc41f7726113c9f114268d3ab11212814da category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit ab0377803dafc58f1e22296708c1c28e309414d6 ]
The caller of del_timer_sync must prevent restarting of the timer, If we have no this synchronization, there is a small probability that the cancellation will not be successful.
And syzbot report the fellowing crash: ================================================================== BUG: KASAN: use-after-free in hlist_add_head include/linux/list.h:929 [inline] BUG: KASAN: use-after-free in enqueue_timer+0x18/0xa4 kernel/time/timer.c:605 Write at addr f9ff000024df6058 by task syz-fuzzer/2256 Pointer tag: [f9], memory tag: [fe]
CPU: 1 PID: 2256 Comm: syz-fuzzer Not tainted 6.1.0-rc5-syzkaller-00008- ge01d50cbd6ee #0 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace.part.0+0xe0/0xf0 arch/arm64/kernel/stacktrace.c:156 dump_backtrace arch/arm64/kernel/stacktrace.c:162 [inline] show_stack+0x18/0x40 arch/arm64/kernel/stacktrace.c:163 __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x68/0x84 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:284 [inline] print_report+0x1a8/0x4a0 mm/kasan/report.c:395 kasan_report+0x94/0xb4 mm/kasan/report.c:495 __do_kernel_fault+0x164/0x1e0 arch/arm64/mm/fault.c:320 do_bad_area arch/arm64/mm/fault.c:473 [inline] do_tag_check_fault+0x78/0x8c arch/arm64/mm/fault.c:749 do_mem_abort+0x44/0x94 arch/arm64/mm/fault.c:825 el1_abort+0x40/0x60 arch/arm64/kernel/entry-common.c:367 el1h_64_sync_handler+0xd8/0xe4 arch/arm64/kernel/entry-common.c:427 el1h_64_sync+0x64/0x68 arch/arm64/kernel/entry.S:576 hlist_add_head include/linux/list.h:929 [inline] enqueue_timer+0x18/0xa4 kernel/time/timer.c:605 mod_timer+0x14/0x20 kernel/time/timer.c:1161 mrp_periodic_timer_arm net/802/mrp.c:614 [inline] mrp_periodic_timer+0xa0/0xc0 net/802/mrp.c:627 call_timer_fn.constprop.0+0x24/0x80 kernel/time/timer.c:1474 expire_timers+0x98/0xc4 kernel/time/timer.c:1519
To fix it, we can introduce a new active flags to make sure the timer will not restart.
Reported-by: syzbot+6fd64001c20aa99e34a4@syzkaller.appspotmail.com
Signed-off-by: Schspa Shi schspa@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- include/net/mrp.h | 1 + net/802/mrp.c | 18 +++++++++++++----- 2 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/include/net/mrp.h b/include/net/mrp.h index ef58b4a07190..c6c53370e390 100644 --- a/include/net/mrp.h +++ b/include/net/mrp.h @@ -120,6 +120,7 @@ struct mrp_applicant { struct sk_buff *pdu; struct rb_root mad; struct rcu_head rcu; + bool active; };
struct mrp_port { diff --git a/net/802/mrp.c b/net/802/mrp.c index 32f87d458f05..ce6e4774d333 100644 --- a/net/802/mrp.c +++ b/net/802/mrp.c @@ -609,7 +609,10 @@ static void mrp_join_timer(struct timer_list *t) spin_unlock(&app->lock);
mrp_queue_xmit(app); - mrp_join_timer_arm(app); + spin_lock(&app->lock); + if (likely(app->active)) + mrp_join_timer_arm(app); + spin_unlock(&app->lock); }
static void mrp_periodic_timer_arm(struct mrp_applicant *app) @@ -623,11 +626,12 @@ static void mrp_periodic_timer(struct timer_list *t) struct mrp_applicant *app = from_timer(app, t, periodic_timer);
spin_lock(&app->lock); - mrp_mad_event(app, MRP_EVENT_PERIODIC); - mrp_pdu_queue(app); + if (likely(app->active)) { + mrp_mad_event(app, MRP_EVENT_PERIODIC); + mrp_pdu_queue(app); + mrp_periodic_timer_arm(app); + } spin_unlock(&app->lock); - - mrp_periodic_timer_arm(app); }
static int mrp_pdu_parse_end_mark(struct sk_buff *skb, int *offset) @@ -875,6 +879,7 @@ int mrp_init_applicant(struct net_device *dev, struct mrp_application *appl) app->dev = dev; app->app = appl; app->mad = RB_ROOT; + app->active = true; spin_lock_init(&app->lock); skb_queue_head_init(&app->queue); rcu_assign_pointer(dev->mrp_port->applicants[appl->type], app); @@ -903,6 +908,9 @@ void mrp_uninit_applicant(struct net_device *dev, struct mrp_application *appl)
RCU_INIT_POINTER(port->applicants[appl->type], NULL);
+ spin_lock_bh(&app->lock); + app->active = false; + spin_unlock_bh(&app->lock); /* Delete timer and generate a final TX event to flush out * all pending messages before the applicant is gone. */
From: Yang Yingliang yangyingliang@huawei.com
stable inclusion from stable-v4.19.270 commit 34d17b39bceef25e4cf9805cd59250ae05d0a139 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 11fa7fefe3d8fac7da56bc9aa3dd5fb3081ca797 ]
While doing fault injection test, I got the following report:
------------[ cut here ]------------ kobject: '(null)' (0000000039956980): is not initialized, yet kobject_put() is being called. WARNING: CPU: 3 PID: 6306 at kobject_put+0x23d/0x4e0 CPU: 3 PID: 6306 Comm: 283 Tainted: G W 6.1.0-rc2-00005-g307c1086d7c9 #1253 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:kobject_put+0x23d/0x4e0 Call Trace: <TASK> cdev_device_add+0x15e/0x1b0 __iio_device_register+0x13b4/0x1af0 [industrialio] __devm_iio_device_register+0x22/0x90 [industrialio] max517_probe+0x3d8/0x6b4 [max517] i2c_device_probe+0xa81/0xc00
When device_add() is injected fault and returns error, if dev->devt is not set, cdev_add() is not called, cdev_del() is not needed. Fix this by checking dev->devt in error path.
Fixes: 233ed09d7fda ("chardev: add helper function to register char devs with a struct device") Signed-off-by: Yang Yingliang yangyingliang@huawei.com Link: https://lore.kernel.org/r/20221202030237.520280-1-yangyingliang@huawei.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/char_dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/char_dev.c b/fs/char_dev.c index 5fffd5050fb7..2c3d519b21c2 100644 --- a/fs/char_dev.c +++ b/fs/char_dev.c @@ -553,7 +553,7 @@ int cdev_device_add(struct cdev *cdev, struct device *dev) }
rc = device_add(dev); - if (rc) + if (rc && dev->devt) cdev_del(cdev);
return rc;
From: Wang Yufen wangyufen@huawei.com
stable inclusion from stable-v4.19.270 commit 72bd0b5cdbcbe31d6644960cdbcbc33d1b4b658d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit e7f703ff2507f4e9f496da96cd4b78fd3026120c ]
Fix to return a negative error code from create_elf_fdpic_tables() instead of 0.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Wang Yufen wangyufen@huawei.com Signed-off-by: Kees Cook keescook@chromium.org Link: https://lore.kernel.org/r/1669945261-30271-1-git-send-email-wangyufen@huawei... Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/binfmt_elf_fdpic.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c index 60896c16f103..64d0b838085d 100644 --- a/fs/binfmt_elf_fdpic.c +++ b/fs/binfmt_elf_fdpic.c @@ -439,8 +439,9 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm) current->mm->start_stack = current->mm->start_brk + stack_size; #endif
- if (create_elf_fdpic_tables(bprm, current->mm, - &exec_params, &interp_params) < 0) + retval = create_elf_fdpic_tables(bprm, current->mm, &exec_params, + &interp_params); + if (retval < 0) goto error;
kdebug("- start_code %lx", current->mm->start_code);
From: Zhang Yuchen zhangyuchen.lcr@bytedance.com
stable inclusion from stable-v4.19.270 commit f99cb54d8ec6ba564ffc72354d9e1e6103fad887 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit f6f1234d98cce69578bfac79df147a1f6660596c upstream.
When fixing the problem mentioned in PATCH1, we also found the following problem:
If the IPMI is disconnected and in the sending process, the uninstallation driver will be stuck for a long time.
The main problem is that uninstalling the driver waits for curr_msg to be sent or HOSED. After stopping tasklet, the only place to trigger the timeout mechanism is the circular poll in shutdown_smi.
The poll function delays 10us and calls smi_event_handler(smi_info,10). Smi_event_handler deducts 10us from kcs->ibf_timeout.
But the poll func is followed by schedule_timeout_uninterruptible(1). The time consumed here is not counted in kcs->ibf_timeout.
So when 10us is deducted from kcs->ibf_timeout, at least 1 jiffies has actually passed. The waiting time has increased by more than a hundredfold.
Now instead of calling poll(). call smi_event_handler() directly and calculate the elapsed time.
For verification, you can directly use ebpf to check the kcs-> ibf_timeout for each call to kcs_event() when IPMI is disconnected. Decrement at normal rate before unloading. The decrement rate becomes very slow after unloading.
$ bpftrace -e 'kprobe:kcs_event {printf("kcs->ibftimeout : %d\n", *(arg0+584));}'
Signed-off-by: Zhang Yuchen zhangyuchen.lcr@bytedance.com Message-Id: 20221007092617.87597-3-zhangyuchen.lcr@bytedance.com Signed-off-by: Corey Minyard cminyard@mvista.com Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/char/ipmi/ipmi_si_intf.c | 27 +++++++++++++++++++-------- 1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c index 17d706fdac76..c4a043cc4814 100644 --- a/drivers/char/ipmi/ipmi_si_intf.c +++ b/drivers/char/ipmi/ipmi_si_intf.c @@ -2187,6 +2187,20 @@ static int __init init_ipmi_si(void) } module_init(init_ipmi_si);
+static void wait_msg_processed(struct smi_info *smi_info) +{ + unsigned long jiffies_now; + long time_diff; + + while (smi_info->curr_msg || (smi_info->si_state != SI_NORMAL)) { + jiffies_now = jiffies; + time_diff = (((long)jiffies_now - (long)smi_info->last_timeout_jiffies) + * SI_USEC_PER_JIFFY); + smi_event_handler(smi_info, time_diff); + schedule_timeout_uninterruptible(1); + } +} + static void shutdown_smi(void *send_info) { struct smi_info *smi_info = send_info; @@ -2221,16 +2235,13 @@ static void shutdown_smi(void *send_info) * in the BMC. Note that timers and CPU interrupts are off, * so no need for locks. */ - while (smi_info->curr_msg || (smi_info->si_state != SI_NORMAL)) { - poll(smi_info); - schedule_timeout_uninterruptible(1); - } + wait_msg_processed(smi_info); + if (smi_info->handlers) disable_si_irq(smi_info); - while (smi_info->curr_msg || (smi_info->si_state != SI_NORMAL)) { - poll(smi_info); - schedule_timeout_uninterruptible(1); - } + + wait_msg_processed(smi_info); + if (smi_info->handlers) smi_info->handlers->cleanup(smi_info->si_sm);
From: Huaxin Lu luhuaxin1@huawei.com
stable inclusion from stable-v4.19.270 commit c3572fb4002fdd36ebb9e707f8c397a0e2830c9e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit 11220db412edae8dba58853238f53258268bdb88 upstream.
In restore_template_fmt, when kstrdup fails, a non-NULL value will still be returned, which causes a NULL pointer access in template_desc_init_fields.
Fixes: c7d09367702e ("ima: support restoring multiple template formats") Cc: stable@kernel.org Co-developed-by: Jiaming Li lijiaming30@huawei.com Signed-off-by: Jiaming Li lijiaming30@huawei.com Signed-off-by: Huaxin Lu luhuaxin1@huawei.com Reviewed-by: Stefan Berger stefanb@linux.ibm.com Signed-off-by: Mimi Zohar zohar@linux.ibm.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- security/integrity/ima/ima_template.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/security/integrity/ima/ima_template.c b/security/integrity/ima/ima_template.c index 13567e555130..ec814cbdae99 100644 --- a/security/integrity/ima/ima_template.c +++ b/security/integrity/ima/ima_template.c @@ -266,8 +266,11 @@ static struct ima_template_desc *restore_template_fmt(char *template_name)
template_desc->name = ""; template_desc->fmt = kstrdup(template_name, GFP_KERNEL); - if (!template_desc->fmt) + if (!template_desc->fmt) { + kfree(template_desc); + template_desc = NULL; goto out; + }
spin_lock(&template_list); list_add_tail_rcu(&template_desc->list, &defined_templates);
From: Dan Carpenter error27@gmail.com
stable inclusion from stable-v4.19.270 commit 35ad87bfe330f7ef6a19f772223c63296d643172 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit a92ce570c81dc0feaeb12a429b4bc65686d17967 upstream.
The intf_free() function frees the "intf" pointer so we cannot dereference it again on the next line.
Fixes: cbb79863fc31 ("ipmi: Don't allow device module unload when in use") Signed-off-by: Dan Carpenter error27@gmail.com Message-Id: Y3M8xa1drZv4CToE@kili Cc: stable@vger.kernel.org # 5.5+ Signed-off-by: Corey Minyard cminyard@mvista.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/char/ipmi/ipmi_msghandler.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c index 6d745cc41611..53758597c509 100644 --- a/drivers/char/ipmi/ipmi_msghandler.c +++ b/drivers/char/ipmi/ipmi_msghandler.c @@ -1219,6 +1219,7 @@ static void _ipmi_destroy_user(struct ipmi_user *user) unsigned long flags; struct cmd_rcvr *rcvr; struct cmd_rcvr *rcvrs = NULL; + struct module *owner;
if (!acquire_ipmi_user(user, &i)) { /* @@ -1278,8 +1279,9 @@ static void _ipmi_destroy_user(struct ipmi_user *user) kfree(rcvr); }
+ owner = intf->owner; kref_put(&intf->refcount, intf_free); - module_put(intf->owner); + module_put(owner); }
int ipmi_destroy_user(struct ipmi_user *user)
From: "Michael S. Tsirkin" mst@redhat.com
stable inclusion from stable-v4.19.270 commit 643d77fda08d06f863af35e80a7e517ea61d9629 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit 98b04dd0b4577894520493d96bc4623387767445 upstream.
pci_device_is_present() previously didn't work for VFs because it reads the Vendor and Device ID, which are 0xffff for VFs, which looks like they aren't present. Check the PF instead.
Wei Gong reported that if virtio I/O is in progress when the driver is unbound or "0" is written to /sys/.../sriov_numvfs, the virtio I/O operation hangs, which may result in output like this:
task:bash state:D stack: 0 pid: 1773 ppid: 1241 flags:0x00004002 Call Trace: schedule+0x4f/0xc0 blk_mq_freeze_queue_wait+0x69/0xa0 blk_mq_freeze_queue+0x1b/0x20 blk_cleanup_queue+0x3d/0xd0 virtblk_remove+0x3c/0xb0 [virtio_blk] virtio_dev_remove+0x4b/0x80 ... device_unregister+0x1b/0x60 unregister_virtio_device+0x18/0x30 virtio_pci_remove+0x41/0x80 pci_device_remove+0x3e/0xb0
This happened because pci_device_is_present(VF) returned "false" in virtio_pci_remove(), so it called virtio_break_device(). The broken vq meant that vring_interrupt() skipped the vq.callback() that would have completed the virtio I/O operation via virtblk_done().
[bhelgaas: commit log, simplify to always use pci_physfn(), add stable tag] Link: https://lore.kernel.org/r/20221026060912.173250-1-mst@redhat.com Reported-by: Wei Gong gongwei833x@gmail.com Tested-by: Wei Gong gongwei833x@gmail.com Signed-off-by: Michael S. Tsirkin mst@redhat.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/pci/pci.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 2020452d1f17..e58fe13c99e1 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -5898,6 +5898,8 @@ bool pci_device_is_present(struct pci_dev *pdev) { u32 v;
+ /* Check PF if pdev is a VF, since VF Vendor/Device IDs are 0xffff */ + pdev = pci_physfn(pdev); if (pci_dev_is_disconnected(pdev)) return false; return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0);
From: Sascha Hauer s.hauer@pengutronix.de
stable inclusion from stable-v4.19.270 commit 17e1b1800ce07d88219e7bff6b23dd35aa751681 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit aa382ffa705bea9931ec92b6f3c70e1fdb372195 upstream.
When pci_create_attr() fails, pci_remove_resource_files() is called which will iterate over the res_attr[_wc] arrays and frees every non NULL entry. To avoid a double free here set the array entry only after it's clear we successfully initialized it.
Fixes: b562ec8f74e4 ("PCI: Don't leak memory if sysfs_create_bin_file() fails") Link: https://lore.kernel.org/r/20221007070735.GX986@pengutronix.de/ Signed-off-by: Sascha Hauer s.hauer@pengutronix.de Signed-off-by: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/pci/pci-sysfs.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 6ec8679bafe9..48c56cb08652 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -1301,11 +1301,9 @@ static int pci_create_attr(struct pci_dev *pdev, int num, int write_combine)
sysfs_bin_attr_init(res_attr); if (write_combine) { - pdev->res_attr_wc[num] = res_attr; sprintf(res_attr_name, "resource%d_wc", num); res_attr->mmap = pci_mmap_resource_wc; } else { - pdev->res_attr[num] = res_attr; sprintf(res_attr_name, "resource%d", num); if (pci_resource_flags(pdev, num) & IORESOURCE_IO) { res_attr->read = pci_read_resource_io; @@ -1321,10 +1319,17 @@ static int pci_create_attr(struct pci_dev *pdev, int num, int write_combine) res_attr->size = pci_resource_len(pdev, num); res_attr->private = (void *)(unsigned long)num; retval = sysfs_create_bin_file(&pdev->dev.kobj, res_attr); - if (retval) + if (retval) { kfree(res_attr); + return retval; + }
- return retval; + if (write_combine) + pdev->res_attr_wc[num] = res_attr; + else + pdev->res_attr[num] = res_attr; + + return 0; }
/**
From: Wang Weiyang wangweiyang2@huawei.com
stable inclusion from stable-v4.19.270 commit 697e55b94162721cfdfa7acd1be09427d2c47c80 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit e68bfbd3b3c3a0ec3cf8c230996ad8cabe90322f upstream.
When add the 'a *:* rwm' entry to devcgroup A's whitelist, at first A's exceptions will be cleaned and A's behavior is changed to DEVCG_DEFAULT_ALLOW. Then parent's exceptions will be copyed to A's whitelist. If copy failure occurs, just return leaving A to grant permissions to all devices. And A may grant more permissions than parent.
Backup A's whitelist and recover original exceptions after copy failure.
Cc: stable@vger.kernel.org Fixes: 4cef7299b478 ("device_cgroup: add proper checking when changing default behavior") Signed-off-by: Wang Weiyang wangweiyang2@huawei.com Reviewed-by: Aristeu Rozanski aris@redhat.com Signed-off-by: Paul Moore paul@paul-moore.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- security/device_cgroup.c | 33 +++++++++++++++++++++++++++++---- 1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/security/device_cgroup.c b/security/device_cgroup.c index dc28914fa72e..5ff31eeea68c 100644 --- a/security/device_cgroup.c +++ b/security/device_cgroup.c @@ -79,6 +79,17 @@ static int dev_exceptions_copy(struct list_head *dest, struct list_head *orig) return -ENOMEM; }
+static void dev_exceptions_move(struct list_head *dest, struct list_head *orig) +{ + struct dev_exception_item *ex, *tmp; + + lockdep_assert_held(&devcgroup_mutex); + + list_for_each_entry_safe(ex, tmp, orig, list) { + list_move_tail(&ex->list, dest); + } +} + /* * called under devcgroup_mutex */ @@ -600,11 +611,13 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup, int count, rc = 0; struct dev_exception_item ex; struct dev_cgroup *parent = css_to_devcgroup(devcgroup->css.parent); + struct dev_cgroup tmp_devcgrp;
if (!capable(CAP_SYS_ADMIN)) return -EPERM;
memset(&ex, 0, sizeof(ex)); + memset(&tmp_devcgrp, 0, sizeof(tmp_devcgrp)); b = buffer;
switch (*b) { @@ -616,15 +629,27 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
if (!may_allow_all(parent)) return -EPERM; - dev_exception_clean(devcgroup); - devcgroup->behavior = DEVCG_DEFAULT_ALLOW; - if (!parent) + if (!parent) { + devcgroup->behavior = DEVCG_DEFAULT_ALLOW; + dev_exception_clean(devcgroup); break; + }
+ INIT_LIST_HEAD(&tmp_devcgrp.exceptions); + rc = dev_exceptions_copy(&tmp_devcgrp.exceptions, + &devcgroup->exceptions); + if (rc) + return rc; + dev_exception_clean(devcgroup); rc = dev_exceptions_copy(&devcgroup->exceptions, &parent->exceptions); - if (rc) + if (rc) { + dev_exceptions_move(&devcgroup->exceptions, + &tmp_devcgrp.exceptions); return rc; + } + devcgroup->behavior = DEVCG_DEFAULT_ALLOW; + dev_exception_clean(&tmp_devcgrp); break; case DEVCG_DENY: if (css_has_online_children(&devcgroup->css))
From: Volker Lendecke vl@samba.org
stable inclusion from stable-v4.19.270 commit 707682dbab5b61d6b7a95b05491b476510aeeb64 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit a152d05ae4a71d802d50cf9177dba34e8bb09f68 upstream.
If smb311 posix is enabled, we send the intended mode for file creation in the posix create context. Instead of using what's there on the stack, create the mfsymlink file with 0644.
Fixes: ce558b0e17f8a ("smb3: Add posix create context for smb3.11 posix mounts") Cc: stable@vger.kernel.org Signed-off-by: Volker Lendecke vl@samba.org Reviewed-by: Tom Talpey tom@talpey.com Reviewed-by: Paulo Alcantara (SUSE) pc@cjr.nz Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/cifs/link.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/cifs/link.c b/fs/cifs/link.c index 5b1c33d9283a..f590149e21ba 100644 --- a/fs/cifs/link.c +++ b/fs/cifs/link.c @@ -481,6 +481,7 @@ smb3_create_mf_symlink(unsigned int xid, struct cifs_tcon *tcon, oparms.disposition = FILE_CREATE; oparms.fid = &fid; oparms.reconnect = false; + oparms.mode = 0644;
rc = SMB2_open(xid, &oparms, utf16_path, &oplock, NULL, NULL, NULL);
From: Christian Brauner brauner@kernel.org
stable inclusion from stable-v4.19.270 commit 7f57df69de7f05302fad584eb8e3f34de39e0311 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit 11933cf1d91d57da9e5c53822a540bbdc2656c16 upstream.
The propagate_mnt() function handles mount propagation when creating mounts and propagates the source mount tree @source_mnt to all applicable nodes of the destination propagation mount tree headed by @dest_mnt.
Unfortunately it contains a bug where it fails to terminate at peers of @source_mnt when looking up copies of the source mount that become masters for copies of the source mount tree mounted on top of slaves in the destination propagation tree causing a NULL dereference.
Once the mechanics of the bug are understood it's easy to trigger. Because of unprivileged user namespaces it is available to unprivileged users.
While fixing this bug we've gotten confused multiple times due to unclear terminology or missing concepts. So let's start this with some clarifications:
* The terms "master" or "peer" denote a shared mount. A shared mount belongs to a peer group.
* A peer group is a set of shared mounts that propagate to each other. They are identified by a peer group id. The peer group id is available in @shared_mnt->mnt_group_id. Shared mounts within the same peer group have the same peer group id. The peers in a peer group can be reached via @shared_mnt->mnt_share.
* The terms "slave mount" or "dependent mount" denote a mount that receives propagation from a peer in a peer group. IOW, shared mounts may have slave mounts and slave mounts have shared mounts as their master. Slave mounts of a given peer in a peer group are listed on that peers slave list available at @shared_mnt->mnt_slave_list.
* The term "master mount" denotes a mount in a peer group. IOW, it denotes a shared mount or a peer mount in a peer group. The term "master mount" - or "master" for short - is mostly used when talking in the context of slave mounts that receive propagation from a master mount. A master mount of a slave identifies the closest peer group a slave mount receives propagation from. The master mount of a slave can be identified via @slave_mount->mnt_master. Different slaves may point to different masters in the same peer group.
* Multiple peers in a peer group can have non-empty ->mnt_slave_lists. Non-empty ->mnt_slave_lists of peers don't intersect. Consequently, to ensure all slave mounts of a peer group are visited the ->mnt_slave_lists of all peers in a peer group have to be walked.
* Slave mounts point to a peer in the closest peer group they receive propagation from via @slave_mnt->mnt_master (see above). Together with these peers they form a propagation group (see below). The closest peer group can thus be identified through the peer group id @slave_mnt->mnt_master->mnt_group_id of the peer/master that a slave mount receives propagation from.
* A shared-slave mount is a slave mount to a peer group pg1 while also a peer in another peer group pg2. IOW, a peer group may receive propagation from another peer group.
If a peer group pg1 is a slave to another peer group pg2 then all peers in peer group pg1 point to the same peer in peer group pg2 via ->mnt_master. IOW, all peers in peer group pg1 appear on the same ->mnt_slave_list. IOW, they cannot be slaves to different peer groups.
* A pure slave mount is a slave mount that is a slave to a peer group but is not a peer in another peer group.
* A propagation group denotes the set of mounts consisting of a single peer group pg1 and all slave mounts and shared-slave mounts that point to a peer in that peer group via ->mnt_master. IOW, all slave mounts such that @slave_mnt->mnt_master->mnt_group_id is equal to @shared_mnt->mnt_group_id.
The concept of a propagation group makes it easier to talk about a single propagation level in a propagation tree.
For example, in propagate_mnt() the immediate peers of @dest_mnt and all slaves of @dest_mnt's peer group form a propagation group propg1. So a shared-slave mount that is a slave in propg1 and that is a peer in another peer group pg2 forms another propagation group propg2 together with all slaves that point to that shared-slave mount in their ->mnt_master.
* A propagation tree refers to all mounts that receive propagation starting from a specific shared mount.
For example, for propagate_mnt() @dest_mnt is the start of a propagation tree. The propagation tree ecompasses all mounts that receive propagation from @dest_mnt's peer group down to the leafs.
With that out of the way let's get to the actual algorithm.
We know that @dest_mnt is guaranteed to be a pure shared mount or a shared-slave mount. This is guaranteed by a check in attach_recursive_mnt(). So propagate_mnt() will first propagate the source mount tree to all peers in @dest_mnt's peer group:
for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) { ret = propagate_one(n); if (ret) goto out; }
Notice, that the peer propagation loop of propagate_mnt() doesn't propagate @dest_mnt itself. @dest_mnt is mounted directly in attach_recursive_mnt() after we propagated to the destination propagation tree.
The mount that will be mounted on top of @dest_mnt is @source_mnt. This copy was created earlier even before we entered attach_recursive_mnt() and doesn't concern us a lot here.
It's just important to notice that when propagate_mnt() is called @source_mnt will not yet have been mounted on top of @dest_mnt. Thus, @source_mnt->mnt_parent will either still point to @source_mnt or - in the case @source_mnt is moved and thus already attached - still to its former parent.
For each peer @m in @dest_mnt's peer group propagate_one() will create a new copy of the source mount tree and mount that copy @child on @m such that @child->mnt_parent points to @m after propagate_one() returns.
propagate_one() will stash the last destination propagation node @m in @last_dest and the last copy it created for the source mount tree in @last_source.
Hence, if we call into propagate_one() again for the next destination propagation node @m, @last_dest will point to the previous destination propagation node and @last_source will point to the previous copy of the source mount tree and mounted on @last_dest.
Each new copy of the source mount tree is created from the previous copy of the source mount tree. This will become important later.
The peer loop in propagate_mnt() is straightforward. We iterate through the peers copying and updating @last_source and @last_dest as we go through them and mount each copy of the source mount tree @child on a peer @m in @dest_mnt's peer group.
After propagate_mnt() handled the peers in @dest_mnt's peer group propagate_mnt() will propagate the source mount tree down the propagation tree that @dest_mnt's peer group propagates to:
for (m = next_group(dest_mnt, dest_mnt); m; m = next_group(m, dest_mnt)) { /* everything in that slave group */ n = m; do { ret = propagate_one(n); if (ret) goto out; n = next_peer(n); } while (n != m); }
The next_group() helper will recursively walk the destination propagation tree, descending into each propagation group of the propagation tree.
The important part is that it takes care to propagate the source mount tree to all peers in the peer group of a propagation group before it propagates to the slaves to those peers in the propagation group. IOW, it creates and mounts copies of the source mount tree that become masters before it creates and mounts copies of the source mount tree that become slaves to these masters.
It is important to remember that propagating the source mount tree to each mount @m in the destination propagation tree simply means that we create and mount new copies @child of the source mount tree on @m such that @child->mnt_parent points to @m.
Since we know that each node @m in the destination propagation tree headed by @dest_mnt's peer group will be overmounted with a copy of the source mount tree and since we know that the propagation properties of each copy of the source mount tree we create and mount at @m will mostly mirror the propagation properties of @m. We can use that information to create and mount the copies of the source mount tree that become masters before their slaves.
The easy case is always when @m and @last_dest are peers in a peer group of a given propagation group. In that case we know that we can simply copy @last_source without having to figure out what the master for the new copy @child of the source mount tree needs to be as we've done that in a previous call to propagate_one().
The hard case is when we're dealing with a slave mount or a shared-slave mount @m in a destination propagation group that we need to create and mount a copy of the source mount tree on.
For each propagation group in the destination propagation tree we propagate the source mount tree to we want to make sure that the copies @child of the source mount tree we create and mount on slaves @m pick an ealier copy of the source mount tree that we mounted on a master @m of the destination propagation group as their master. This is a mouthful but as far as we can tell that's the core of it all.
But, if we keep track of the masters in the destination propagation tree @m we can use the information to find the correct master for each copy of the source mount tree we create and mount at the slaves in the destination propagation tree @m.
Let's walk through the base case as that's still fairly easy to grasp.
If we're dealing with the first slave in the propagation group that @dest_mnt is in then we don't yet have marked any masters in the destination propagation tree.
We know the master for the first slave to @dest_mnt's peer group is simple @dest_mnt. So we expect this algorithm to yield a copy of the source mount tree that was mounted on a peer in @dest_mnt's peer group as the master for the copy of the source mount tree we want to mount at the first slave @m:
for (n = m; ; n = p) { p = n->mnt_master; if (p == dest_master || IS_MNT_MARKED(p)) break; }
For the first slave we walk the destination propagation tree all the way up to a peer in @dest_mnt's peer group. IOW, the propagation hierarchy can be walked by walking up the @mnt->mnt_master hierarchy of the destination propagation tree @m. We will ultimately find a peer in @dest_mnt's peer group and thus ultimately @dest_mnt->mnt_master.
Btw, here the assumption we listed at the beginning becomes important. Namely, that peers in a peer group pg1 that are slaves in another peer group pg2 appear on the same ->mnt_slave_list. IOW, all slaves who are peers in peer group pg1 point to the same peer in peer group pg2 via their ->mnt_master. Otherwise the termination condition in the code above would be wrong and next_group() would be broken too.
So the first iteration sets:
n = m; p = n->mnt_master;
such that @p now points to a peer or @dest_mnt itself. We walk up one more level since we don't have any marked mounts. So we end up with:
n = dest_mnt; p = dest_mnt->mnt_master;
If @dest_mnt's peer group is not slave to another peer group then @p is now NULL. If @dest_mnt's peer group is a slave to another peer group then @p now points to @dest_mnt->mnt_master points which is a master outside the propagation tree we're dealing with.
Now we need to figure out the master for the copy of the source mount tree we're about to create and mount on the first slave of @dest_mnt's peer group:
do { struct mount *parent = last_source->mnt_parent; if (last_source == first_source) break; done = parent->mnt_master == p; if (done && peers(n, parent)) break; last_source = last_source->mnt_master; } while (!done);
We know that @last_source->mnt_parent points to @last_dest and @last_dest is the last peer in @dest_mnt's peer group we propagated to in the peer loop in propagate_mnt().
Consequently, @last_source is the last copy we created and mount on that last peer in @dest_mnt's peer group. So @last_source is the master we want to pick.
We know that @last_source->mnt_parent->mnt_master points to @last_dest->mnt_master. We also know that @last_dest->mnt_master is either NULL or points to a master outside of the destination propagation tree and so does @p. Hence:
done = parent->mnt_master == p;
is trivially true in the base condition.
We also know that for the first slave mount of @dest_mnt's peer group that @last_dest either points @dest_mnt itself because it was initialized to:
last_dest = dest_mnt;
at the beginning of propagate_mnt() or it will point to a peer of @dest_mnt in its peer group. In both cases it is guaranteed that on the first iteration @n and @parent are peers (Please note the check for peers here as that's important.):
if (done && peers(n, parent)) break;
So, as we expected, we select @last_source, which referes to the last copy of the source mount tree we mounted on the last peer in @dest_mnt's peer group, as the master of the first slave in @dest_mnt's peer group. The rest is taken care of by clone_mnt(last_source, ...). We'll skip over that part otherwise this becomes a blogpost.
At the end of propagate_mnt() we now mark @m->mnt_master as the first master in the destination propagation tree that is distinct from @dest_mnt->mnt_master. IOW, we mark @dest_mnt itself as a master.
By marking @dest_mnt or one of it's peers we are able to easily find it again when we later lookup masters for other copies of the source mount tree we mount copies of the source mount tree on slaves @m to @dest_mnt's peer group. This, in turn allows us to find the master we selected for the copies of the source mount tree we mounted on master in the destination propagation tree again.
The important part is to realize that the code makes use of the fact that the last copy of the source mount tree stashed in @last_source was mounted on top of the previous destination propagation node @last_dest. What this means is that @last_source allows us to walk the destination propagation hierarchy the same way each destination propagation node @m does.
If we take @last_source, which is the copy of @source_mnt we have mounted on @last_dest in the previous iteration of propagate_one(), then we know @last_source->mnt_parent points to @last_dest but we also know that as we walk through the destination propagation tree that @last_source->mnt_master will point to an earlier copy of the source mount tree we mounted one an earlier destination propagation node @m.
IOW, @last_source->mnt_parent will be our hook into the destination propagation tree and each consecutive @last_source->mnt_master will lead us to an earlier propagation node @m via @last_source->mnt_master->mnt_parent.
Hence, by walking up @last_source->mnt_master, each of which is mounted on a node that is a master @m in the destination propagation tree we can also walk up the destination propagation hierarchy.
So, for each new destination propagation node @m we use the previous copy of @last_source and the fact it's mounted on the previous propagation node @last_dest via @last_source->mnt_master->mnt_parent to determine what the master of the new copy of @last_source needs to be.
The goal is to find the _closest_ master that the new copy of the source mount tree we are about to create and mount on a slave @m in the destination propagation tree needs to pick. IOW, we want to find a suitable master in the propagation group.
As the propagation structure of the source mount propagation tree we create mirrors the propagation structure of the destination propagation tree we can find @m's closest master - i.e., a marked master - which is a peer in the closest peer group that @m receives propagation from. We store that closest master of @m in @p as before and record the slave to that master in @n
We then search for this master @p via @last_source by walking up the master hierarchy starting from the last copy of the source mount tree stored in @last_source that we created and mounted on the previous destination propagation node @m.
We will try to find the master by walking @last_source->mnt_master and by comparing @last_source->mnt_master->mnt_parent->mnt_master to @p. If we find @p then we can figure out what earlier copy of the source mount tree needs to be the master for the new copy of the source mount tree we're about to create and mount at the current destination propagation node @m.
If @last_source->mnt_master->mnt_parent and @n are peers then we know that the closest master they receive propagation from is @last_source->mnt_master->mnt_parent->mnt_master. If not then the closest immediate peer group that they receive propagation from must be one level higher up.
This builds on the earlier clarification at the beginning that all peers in a peer group which are slaves of other peer groups all point to the same ->mnt_master, i.e., appear on the same ->mnt_slave_list, of the closest peer group that they receive propagation from.
However, terminating the walk has corner cases.
If the closest marked master for a given destination node @m cannot be found by walking up the master hierarchy via @last_source->mnt_master then we need to terminate the walk when we encounter @source_mnt again.
This isn't an arbitrary termination. It simply means that the new copy of the source mount tree we're about to create has a copy of the source mount tree we created and mounted on a peer in @dest_mnt's peer group as its master. IOW, @source_mnt is the peer in the closest peer group that the new copy of the source mount tree receives propagation from.
We absolutely have to stop @source_mnt because @last_source->mnt_master either points outside the propagation hierarchy we're dealing with or it is NULL because @source_mnt isn't a shared-slave.
So continuing the walk past @source_mnt would cause a NULL dereference via @last_source->mnt_master->mnt_parent. And so we have to stop the walk when we encounter @source_mnt again.
One scenario where this can happen is when we first handled a series of slaves of @dest_mnt's peer group and then encounter peers in a new peer group that is a slave to @dest_mnt's peer group. We handle them and then we encounter another slave mount to @dest_mnt that is a pure slave to @dest_mnt's peer group. That pure slave will have a peer in @dest_mnt's peer group as its master. Consequently, the new copy of the source mount tree will need to have @source_mnt as it's master. So we walk the propagation hierarchy all the way up to @source_mnt based on @last_source->mnt_master.
So terminate on @source_mnt, easy peasy. Except, that the check misses something that the rest of the algorithm already handles.
If @dest_mnt has peers in it's peer group the peer loop in propagate_mnt():
for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) { ret = propagate_one(n); if (ret) goto out; }
will consecutively update @last_source with each previous copy of the source mount tree we created and mounted at the previous peer in @dest_mnt's peer group. So after that loop terminates @last_source will point to whatever copy of the source mount tree was created and mounted on the last peer in @dest_mnt's peer group.
Furthermore, if there is even a single additional peer in @dest_mnt's peer group then @last_source will __not__ point to @source_mnt anymore. Because, as we mentioned above, @dest_mnt isn't even handled in this loop but directly in attach_recursive_mnt(). So it can't even accidently come last in that peer loop.
So the first time we handle a slave mount @m of @dest_mnt's peer group the copy of the source mount tree we create will make the __last copy of the source mount tree we created and mounted on the last peer in @dest_mnt's peer group the master of the new copy of the source mount tree we create and mount on the first slave of @dest_mnt's peer group__.
But this means that the termination condition that checks for @source_mnt is wrong. The @source_mnt cannot be found anymore by propagate_one(). Instead it will find the last copy of the source mount tree we created and mounted for the last peer of @dest_mnt's peer group again. And that is a peer of @source_mnt not @source_mnt itself.
IOW, we fail to terminate the loop correctly and ultimately dereference @last_source->mnt_master->mnt_parent. When @source_mnt's peer group isn't slave to another peer group then @last_source->mnt_master is NULL causing the splat below.
For example, assume @dest_mnt is a pure shared mount and has three peers in its peer group:
=================================================================================== mount-id mount-parent-id peer-group-id =================================================================================== (@dest_mnt) mnt_master[216] 309 297 shared:216 \ (@source_mnt) mnt_master[218]: 609 609 shared:218
(1) mnt_master[216]: 607 605 shared:216 \ (P1) mnt_master[218]: 624 607 shared:218
(2) mnt_master[216]: 576 574 shared:216 \ (P2) mnt_master[218]: 625 576 shared:218
(3) mnt_master[216]: 545 543 shared:216 \ (P3) mnt_master[218]: 626 545 shared:218
After this sequence has been processed @last_source will point to (P3), the copy generated for the third peer in @dest_mnt's peer group we handled. So the copy of the source mount tree (P4) we create and mount on the first slave of @dest_mnt's peer group:
=================================================================================== mount-id mount-parent-id peer-group-id =================================================================================== mnt_master[216] 309 297 shared:216 / / (S0) mnt_slave 483 481 master:216 \ \ (P3) mnt_master[218] 626 545 shared:218 \ / / (P4) mnt_slave 627 483 master:218
will pick the last copy of the source mount tree (P3) as master, not (S0).
When walking the propagation hierarchy via @last_source's master hierarchy we encounter (P3) but not (S0), i.e., @source_mnt.
We can fix this in multiple ways:
(1) By setting @last_source to @source_mnt after we processed the peers in @dest_mnt's peer group right after the peer loop in propagate_mnt().
(2) By changing the termination condition that relies on finding exactly @source_mnt to finding a peer of @source_mnt.
(3) By only moving @last_source when we actually venture into a new peer group or some clever variant thereof.
The first two options are minimally invasive and what we want as a fix. The third option is more intrusive but something we'd like to explore in the near future.
This passes all LTP tests and specifically the mount propagation testsuite part of it. It also holds up against all known reproducers of this issues.
Final words. First, this is a clever but __worringly__ underdocumented algorithm. There isn't a single detailed comment to be found in next_group(), propagate_one() or anywhere else in that file for that matter. This has been a giant pain to understand and work through and a bug like this is insanely difficult to fix without a detailed understanding of what's happening. Let's not talk about the amount of time that was sunk into fixing this.
Second, all the cool kids with access to unshare --mount --user --map-root --propagation=unchanged are going to have a lot of fun. IOW, triggerable by unprivileged users while namespace_lock() lock is held.
[ 115.848393] BUG: kernel NULL pointer dereference, address: 0000000000000010 [ 115.848967] #PF: supervisor read access in kernel mode [ 115.849386] #PF: error_code(0x0000) - not-present page [ 115.849803] PGD 0 P4D 0 [ 115.850012] Oops: 0000 [#1] PREEMPT SMP PTI [ 115.850354] CPU: 0 PID: 15591 Comm: mount Not tainted 6.1.0-rc7 #3 [ 115.850851] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 115.851510] RIP: 0010:propagate_one.part.0+0x7f/0x1a0 [ 115.851924] Code: 75 eb 4c 8b 05 c2 25 37 02 4c 89 ca 48 8b 4a 10 49 39 d0 74 1e 48 3b 81 e0 00 00 00 74 26 48 8b 92 e0 00 00 00 be 01 00 00 00 <48> 8b 4a 10 49 39 d0 75 e2 40 84 f6 74 38 4c 89 05 84 25 37 02 4d [ 115.853441] RSP: 0018:ffffb8d5443d7d50 EFLAGS: 00010282 [ 115.853865] RAX: ffff8e4d87c41c80 RBX: ffff8e4d88ded780 RCX: ffff8e4da4333a00 [ 115.854458] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8e4d88ded780 [ 115.855044] RBP: ffff8e4d88ded780 R08: ffff8e4da4338000 R09: ffff8e4da43388c0 [ 115.855693] R10: 0000000000000002 R11: ffffb8d540158000 R12: ffffb8d5443d7da8 [ 115.856304] R13: ffff8e4d88ded780 R14: 0000000000000000 R15: 0000000000000000 [ 115.856859] FS: 00007f92c90c9800(0000) GS:ffff8e4dfdc00000(0000) knlGS:0000000000000000 [ 115.857531] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 115.858006] CR2: 0000000000000010 CR3: 0000000022f4c002 CR4: 00000000000706f0 [ 115.858598] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 115.859393] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 115.860099] Call Trace: [ 115.860358] <TASK> [ 115.860535] propagate_mnt+0x14d/0x190 [ 115.860848] attach_recursive_mnt+0x274/0x3e0 [ 115.861212] path_mount+0x8c8/0xa60 [ 115.861503] __x64_sys_mount+0xf6/0x140 [ 115.861819] do_syscall_64+0x5b/0x80 [ 115.862117] ? do_faccessat+0x123/0x250 [ 115.862435] ? syscall_exit_to_user_mode+0x17/0x40 [ 115.862826] ? do_syscall_64+0x67/0x80 [ 115.863133] ? syscall_exit_to_user_mode+0x17/0x40 [ 115.863527] ? do_syscall_64+0x67/0x80 [ 115.863835] ? do_syscall_64+0x67/0x80 [ 115.864144] ? do_syscall_64+0x67/0x80 [ 115.864452] ? exc_page_fault+0x70/0x170 [ 115.864775] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 115.865187] RIP: 0033:0x7f92c92b0ebe [ 115.865480] Code: 48 8b 0d 75 4f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 42 4f 0c 00 f7 d8 64 89 01 48 [ 115.866984] RSP: 002b:00007fff000aa728 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [ 115.867607] RAX: ffffffffffffffda RBX: 000055a77888d6b0 RCX: 00007f92c92b0ebe [ 115.868240] RDX: 000055a77888d8e0 RSI: 000055a77888e6e0 RDI: 000055a77888e620 [ 115.868823] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 [ 115.869403] R10: 0000000000001000 R11: 0000000000000246 R12: 000055a77888e620 [ 115.869994] R13: 000055a77888d8e0 R14: 00000000ffffffff R15: 00007f92c93e4076 [ 115.870581] </TASK> [ 115.870763] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink qrtr snd_intel8x0 sunrpc snd_ac97_codec ac97_bus snd_pcm snd_timer intel_rapl_msr intel_rapl_common snd vboxguest intel_powerclamp video rapl joydev soundcore i2c_piix4 wmi fuse zram xfs vmwgfx crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic drm_ttm_helper ttm e1000 ghash_clmulni_intel serio_raw ata_generic pata_acpi scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath [ 115.875288] CR2: 0000000000000010 [ 115.875641] ---[ end trace 0000000000000000 ]--- [ 115.876135] RIP: 0010:propagate_one.part.0+0x7f/0x1a0 [ 115.876551] Code: 75 eb 4c 8b 05 c2 25 37 02 4c 89 ca 48 8b 4a 10 49 39 d0 74 1e 48 3b 81 e0 00 00 00 74 26 48 8b 92 e0 00 00 00 be 01 00 00 00 <48> 8b 4a 10 49 39 d0 75 e2 40 84 f6 74 38 4c 89 05 84 25 37 02 4d [ 115.878086] RSP: 0018:ffffb8d5443d7d50 EFLAGS: 00010282 [ 115.878511] RAX: ffff8e4d87c41c80 RBX: ffff8e4d88ded780 RCX: ffff8e4da4333a00 [ 115.879128] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8e4d88ded780 [ 115.879715] RBP: ffff8e4d88ded780 R08: ffff8e4da4338000 R09: ffff8e4da43388c0 [ 115.880359] R10: 0000000000000002 R11: ffffb8d540158000 R12: ffffb8d5443d7da8 [ 115.880962] R13: ffff8e4d88ded780 R14: 0000000000000000 R15: 0000000000000000 [ 115.881548] FS: 00007f92c90c9800(0000) GS:ffff8e4dfdc00000(0000) knlGS:0000000000000000 [ 115.882234] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 115.882713] CR2: 0000000000000010 CR3: 0000000022f4c002 CR4: 00000000000706f0 [ 115.883314] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 115.883966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Fixes: f2ebb3a921c1 ("smarter propagate_mnt()") Fixes: 5ec0811d3037 ("propogate_mnt: Handle the first propogated copy being a slave") Cc: stable@vger.kernel.org Reported-by: Ditang Chen ditang.c@gmail.com Signed-off-by: Seth Forshee (Digital Ocean) sforshee@kernel.org Signed-off-by: Christian Brauner (Microsoft) brauner@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/pnode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/pnode.c b/fs/pnode.c index 7910ae91f17e..d27b7b97c4c1 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -245,7 +245,7 @@ static int propagate_one(struct mount *m) } do { struct mount *parent = last_source->mnt_parent; - if (last_source == first_source) + if (peers(last_source, first_source)) break; done = parent->mnt_master == p; if (done && peers(n, parent))
From: Zhang Tianci zhangtianci.1997@bytedance.com
stable inclusion from stable-v4.19.270 commit 936a357a97c710b95fe9164d5e9aca9f156a0dc1 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit 5b0db51215e895a361bc63132caa7cca36a53d6a upstream.
There is a wrong case of link() on overlay: $ mkdir /lower /fuse /merge $ mount -t fuse /fuse $ mkdir /fuse/upper /fuse/work $ mount -t overlay /merge -o lowerdir=/lower,upperdir=/fuse/upper,\ workdir=work $ touch /merge/file $ chown bin.bin /merge/file // the file's caller becomes "bin" $ ln /merge/file /merge/lnkfile
Then we will get an error(EACCES) because fuse daemon checks the link()'s caller is "bin", it denied this request.
In the changing history of ovl_link(), there are two key commits:
The first is commit bb0d2b8ad296 ("ovl: fix sgid on directory") which overrides the cred's fsuid/fsgid using the new inode. The new inode's owner is initialized by inode_init_owner(), and inode->fsuid is assigned to the current user. So the override fsuid becomes the current user. We know link() is actually modifying the directory, so the caller must have the MAY_WRITE permission on the directory. The current caller may should have this permission. This is acceptable to use the caller's fsuid.
The second is commit 51f7e52dc943 ("ovl: share inode for hard link") which removed the inode creation in ovl_link(). This commit move inode_init_owner() into ovl_create_object(), so the ovl_link() just give the old inode to ovl_create_or_link(). Then the override fsuid becomes the old inode's fsuid, neither the caller nor the overlay's mounter! So this is incorrect.
Fix this bug by using ovl mounter's fsuid/fsgid to do underlying fs's link().
Link: https://lore.kernel.org/all/20220817102952.xnvesg3a7rbv576x@wittgenstein/T Link: https://lore.kernel.org/lkml/20220825130552.29587-1-zhangtianci.1997@bytedan... Signed-off-by: Zhang Tianci zhangtianci.1997@bytedance.com Signed-off-by: Jiachen Zhang zhangjiachen.jaycee@bytedance.com Reviewed-by: Christian Brauner (Microsoft) brauner@kernel.org Fixes: 51f7e52dc943 ("ovl: share inode for hard link") Cc: stable@vger.kernel.org # v4.8 Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/overlayfs/dir.c | 46 ++++++++++++++++++++++++++++++---------------- 1 file changed, 30 insertions(+), 16 deletions(-)
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c index 8ee7b77effcd..8570e755a392 100644 --- a/fs/overlayfs/dir.c +++ b/fs/overlayfs/dir.c @@ -563,28 +563,42 @@ static int ovl_create_or_link(struct dentry *dentry, struct inode *inode, goto out_revert_creds; }
- err = -ENOMEM; - override_cred = prepare_creds(); - if (override_cred) { + if (!attr->hardlink) { + err = -ENOMEM; + override_cred = prepare_creds(); + if (!override_cred) + goto out_revert_creds; + /* + * In the creation cases(create, mkdir, mknod, symlink), + * ovl should transfer current's fs{u,g}id to underlying + * fs. Because underlying fs want to initialize its new + * inode owner using current's fs{u,g}id. And in this + * case, the @inode is a new inode that is initialized + * in inode_init_owner() to current's fs{u,g}id. So use + * the inode's i_{u,g}id to override the cred's fs{u,g}id. + * + * But in the other hardlink case, ovl_link() does not + * create a new inode, so just use the ovl mounter's + * fs{u,g}id. + */ override_cred->fsuid = inode->i_uid; override_cred->fsgid = inode->i_gid; - if (!attr->hardlink) { - err = security_dentry_create_files_as(dentry, - attr->mode, &dentry->d_name, old_cred, - override_cred); - if (err) { - put_cred(override_cred); - goto out_revert_creds; - } + err = security_dentry_create_files_as(dentry, + attr->mode, &dentry->d_name, old_cred, + override_cred); + if (err) { + put_cred(override_cred); + goto out_revert_creds; } put_cred(override_creds(override_cred)); put_cred(override_cred); - - if (!ovl_dentry_is_whiteout(dentry)) - err = ovl_create_upper(dentry, inode, attr); - else - err = ovl_create_over_whiteout(dentry, inode, attr); } + + if (!ovl_dentry_is_whiteout(dentry)) + err = ovl_create_upper(dentry, inode, attr); + else + err = ovl_create_over_whiteout(dentry, inode, attr); + out_revert_creds: revert_creds(old_cred); return err;
From: minoura makoto minoura@valinux.co.jp
stable inclusion from stable-v4.19.270 commit 4916a52341b7c0ab016c213b11d0104d7f54a2c6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit b18cba09e374637a0a3759d856a6bca94c133952 ]
Commit 9130b8dbc6ac ("SUNRPC: allow for upcalls for the same uid but different gss service") introduced `auth` argument to __gss_find_upcall(), but in gss_pipe_downcall() it was left as NULL since it (and auth->service) was not (yet) determined.
When multiple upcalls with the same uid and different service are ongoing, it could happen that __gss_find_upcall(), which returns the first match found in the pipe->in_downcall list, could not find the correct gss_msg corresponding to the downcall we are looking for. Moreover, it might return a msg which is not sent to rpc.gssd yet.
We could see mount.nfs process hung in D state with multiple mount.nfs are executed in parallel. The call trace below is of CentOS 7.9 kernel-3.10.0-1160.24.1.el7.x86_64 but we observed the same hang w/ elrepo kernel-ml-6.0.7-1.el7.
PID: 71258 TASK: ffff91ebd4be0000 CPU: 36 COMMAND: "mount.nfs" #0 [ffff9203ca3234f8] __schedule at ffffffffa3b8899f #1 [ffff9203ca323580] schedule at ffffffffa3b88eb9 #2 [ffff9203ca323590] gss_cred_init at ffffffffc0355818 [auth_rpcgss] #3 [ffff9203ca323658] rpcauth_lookup_credcache at ffffffffc0421ebc [sunrpc] #4 [ffff9203ca3236d8] gss_lookup_cred at ffffffffc0353633 [auth_rpcgss] #5 [ffff9203ca3236e8] rpcauth_lookupcred at ffffffffc0421581 [sunrpc] #6 [ffff9203ca323740] rpcauth_refreshcred at ffffffffc04223d3 [sunrpc] #7 [ffff9203ca3237a0] call_refresh at ffffffffc04103dc [sunrpc] #8 [ffff9203ca3237b8] __rpc_execute at ffffffffc041e1c9 [sunrpc] #9 [ffff9203ca323820] rpc_execute at ffffffffc0420a48 [sunrpc]
The scenario is like this. Let's say there are two upcalls for services A and B, A -> B in pipe->in_downcall, B -> A in pipe->pipe.
When rpc.gssd reads pipe to get the upcall msg corresponding to service B from pipe->pipe and then writes the response, in gss_pipe_downcall the msg corresponding to service A will be picked because only uid is used to find the msg and it is before the one for B in pipe->in_downcall. And the process waiting for the msg corresponding to service A will be woken up.
Actual scheduing of that process might be after rpc.gssd processes the next msg. In rpc_pipe_generic_upcall it clears msg->errno (for A). The process is scheduled to see gss_msg->ctx == NULL and gss_msg->msg.errno == 0, therefore it cannot break the loop in gss_create_upcall and is never woken up after that.
This patch adds a simple check to ensure that a msg which is not sent to rpc.gssd yet is not chosen as the matching upcall upon receiving a downcall.
Signed-off-by: minoura makoto minoura@valinux.co.jp Signed-off-by: Hiroshi Shimamoto h-shimamoto@nec.com Tested-by: Hiroshi Shimamoto h-shimamoto@nec.com Cc: Trond Myklebust trondmy@hammerspace.com Fixes: 9130b8dbc6ac ("SUNRPC: allow for upcalls for same uid but different gss service") Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- include/linux/sunrpc/rpc_pipe_fs.h | 5 +++++ net/sunrpc/auth_gss/auth_gss.c | 19 +++++++++++++++++-- 2 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/include/linux/sunrpc/rpc_pipe_fs.h b/include/linux/sunrpc/rpc_pipe_fs.h index e90b9bd99ded..396de2ef8767 100644 --- a/include/linux/sunrpc/rpc_pipe_fs.h +++ b/include/linux/sunrpc/rpc_pipe_fs.h @@ -94,6 +94,11 @@ extern ssize_t rpc_pipe_generic_upcall(struct file *, struct rpc_pipe_msg *, char __user *, size_t); extern int rpc_queue_upcall(struct rpc_pipe *, struct rpc_pipe_msg *);
+/* returns true if the msg is in-flight, i.e., already eaten by the peer */ +static inline bool rpc_msg_is_inflight(const struct rpc_pipe_msg *msg) { + return (msg->copied != 0 && list_empty(&msg->list)); +} + struct rpc_clnt; extern struct dentry *rpc_create_client_dir(struct dentry *, const char *, struct rpc_clnt *); extern int rpc_remove_client_dir(struct rpc_clnt *); diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c index cdc8a74dcaf8..ae87b9302e80 100644 --- a/net/sunrpc/auth_gss/auth_gss.c +++ b/net/sunrpc/auth_gss/auth_gss.c @@ -323,7 +323,7 @@ __gss_find_upcall(struct rpc_pipe *pipe, kuid_t uid, const struct gss_auth *auth list_for_each_entry(pos, &pipe->in_downcall, list) { if (!uid_eq(pos->uid, uid)) continue; - if (auth && pos->auth->service != auth->service) + if (pos->auth->service != auth->service) continue; refcount_inc(&pos->count); dprintk("RPC: %s found msg %p\n", __func__, pos); @@ -677,6 +677,21 @@ gss_create_upcall(struct gss_auth *gss_auth, struct gss_cred *gss_cred) return err; }
+static struct gss_upcall_msg * +gss_find_downcall(struct rpc_pipe *pipe, kuid_t uid) +{ + struct gss_upcall_msg *pos; + list_for_each_entry(pos, &pipe->in_downcall, list) { + if (!uid_eq(pos->uid, uid)) + continue; + if (!rpc_msg_is_inflight(&pos->msg)) + continue; + refcount_inc(&pos->count); + return pos; + } + return NULL; +} + #define MSG_BUF_MAXSIZE 1024
static ssize_t @@ -723,7 +738,7 @@ gss_pipe_downcall(struct file *filp, const char __user *src, size_t mlen) err = -ENOENT; /* Find a matching upcall */ spin_lock(&pipe->lock); - gss_msg = __gss_find_upcall(pipe, uid, NULL); + gss_msg = gss_find_downcall(pipe, uid); if (gss_msg == NULL) { spin_unlock(&pipe->lock); goto err_put_ctx;
From: Jakub Kicinski kuba@kernel.org
stable inclusion from stable-v4.19.270 commit 31f7a52168c67e70a521d7acb8b0c8b6c95e7abd category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 54c3f1a81421f85e60ae2eaae7be3727a09916ee ]
Anand hit a BUG() when pulling off headers on egress to a SW tunnel. We get to skb_checksum_help() with an invalid checksum offset (commit d7ea0d9df2a6 ("net: remove two BUG() from skb_checksum_help()") converted those BUGs to WARN_ONs()). He points out oddness in how skb_postpull_rcsum() gets used. Indeed looks like we should pull before "postpull", otherwise the CHECKSUM_PARTIAL fixup from skb_postpull_rcsum() will not be able to do its job:
if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_start_offset(skb) < 0) skb->ip_summed = CHECKSUM_NONE;
Reported-by: Anand Parthasarathy anpartha@meta.com Fixes: 6578171a7ff0 ("bpf: add bpf_skb_change_proto helper") Signed-off-by: Jakub Kicinski kuba@kernel.org Acked-by: Stanislav Fomichev sdf@google.com Link: https://lore.kernel.org/r/20221220004701.402165-1-kuba@kernel.org Signed-off-by: Martin KaFai Lau martin.lau@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/core/filter.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/net/core/filter.c b/net/core/filter.c index e8111f5ee81e..c24c7cddeb8e 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2565,15 +2565,18 @@ static int bpf_skb_generic_push(struct sk_buff *skb, u32 off, u32 len)
static int bpf_skb_generic_pop(struct sk_buff *skb, u32 off, u32 len) { + void *old_data; + /* skb_ensure_writable() is not needed here, as we're * already working on an uncloned skb. */ if (unlikely(!pskb_may_pull(skb, off + len))) return -ENOMEM;
- skb_postpull_rcsum(skb, skb->data + off, len); - memmove(skb->data + len, skb->data, off); + old_data = skb->data; __skb_pull(skb, len); + skb_postpull_rcsum(skb, old_data + off, len); + memmove(skb->data, old_data, off);
return 0; }
From: Mikulas Patocka mpatocka@redhat.com
stable inclusion from stable-v4.19.270 commit b5be563b4356b3089b3245d024cae3f248ba7090 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit 341097ee53573e06ab9fc675d96a052385b851fa upstream.
There's a crash in mempool_free when running the lvm test shell/lvchange-rebuild-raid.sh.
The reason for the crash is this: * super_written calls atomic_dec_and_test(&mddev->pending_writes) and wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev) and bio_put(bio). * so, the process that waited on sb_wait and that is woken up is racing with bio_put(bio). * if the process wins the race, it calls bioset_exit before bio_put(bio) is executed. * bio_put(bio) attempts to free a bio into a destroyed bio set - causing a crash in mempool_free.
We fix this bug by moving bio_put before atomic_dec_and_test.
We also move rdev_dec_pending before atomic_dec_and_test as suggested by Neil Brown.
The function md_end_flush has a similar bug - we must call bio_put before we decrement the number of in-progress bios.
BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 11557f0067 P4D 11557f0067 PUD 0 Oops: 0002 [#1] PREEMPT SMP CPU: 0 PID: 73 Comm: kworker/0:1 Not tainted 6.1.0-rc3 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: kdelayd flush_expired_bios [dm_delay] RIP: 0010:mempool_free+0x47/0x80 Code: 48 89 ef 5b 5d ff e0 f3 c3 48 89 f7 e8 32 45 3f 00 48 63 53 08 48 89 c6 3b 53 04 7d 2d 48 8b 43 10 8d 4a 01 48 89 df 89 4b 08 <48> 89 2c d0 e8 b0 45 3f 00 48 8d 7b 30 5b 5d 31 c9 ba 01 00 00 00 RSP: 0018:ffff88910036bda8 EFLAGS: 00010093 RAX: 0000000000000000 RBX: ffff8891037b65d8 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8891037b65d8 RBP: ffff8891447ba240 R08: 0000000000012908 R09: 00000000003d0900 R10: 0000000000000000 R11: 0000000000173544 R12: ffff889101a14000 R13: ffff8891562ac300 R14: ffff889102b41440 R15: ffffe8ffffa00d05 FS: 0000000000000000(0000) GS:ffff88942fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000001102e99000 CR4: 00000000000006b0 Call Trace: <TASK> clone_endio+0xf4/0x1c0 [dm_mod] clone_endio+0xf4/0x1c0 [dm_mod] __submit_bio+0x76/0x120 submit_bio_noacct_nocheck+0xb6/0x2a0 flush_expired_bios+0x28/0x2f [dm_delay] process_one_work+0x1b4/0x300 worker_thread+0x45/0x3e0 ? rescuer_thread+0x380/0x380 kthread+0xc2/0x100 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Modules linked in: brd dm_delay dm_raid dm_mod af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg scsi_common [last unloaded: brd] CR2: 0000000000000000 ---[ end trace 0000000000000000 ]---
Signed-off-by: Mikulas Patocka mpatocka@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Song Liu song@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/md/md.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index a251f658fe5e..629d3f346128 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -418,13 +418,14 @@ static void md_end_flush(struct bio *bio) struct md_rdev *rdev = bio->bi_private; struct mddev *mddev = rdev->mddev;
+ bio_put(bio); + rdev_dec_pending(rdev, mddev);
if (atomic_dec_and_test(&mddev->flush_pending)) { /* The pre-request flush has finished */ queue_work(md_wq, &mddev->flush_work); } - bio_put(bio); }
static void md_submit_flush_data(struct work_struct *ws); @@ -828,10 +829,12 @@ static void super_written(struct bio *bio) } else clear_bit(LastDev, &rdev->flags);
+ bio_put(bio); + + rdev_dec_pending(rdev, mddev); + if (atomic_dec_and_test(&mddev->pending_writes)) wake_up(&mddev->sb_wait); - rdev_dec_pending(rdev, mddev); - bio_put(bio); }
void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
From: "Isaac J. Manjarres" isaacmanjarres@google.com
stable inclusion from stable-v4.19.270 commit 728c23ee14f01858632625556e51c2d1db4a414e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit 27c0d217340e47ec995557f61423ef415afba987 upstream.
When a driver registers with a bus, it will attempt to match with every device on the bus through the __driver_attach() function. Currently, if the bus_type.match() function encounters an error that is not -EPROBE_DEFER, __driver_attach() will return a negative error code, which causes the driver registration logic to stop trying to match with the remaining devices on the bus.
This behavior is not correct; a failure while matching a driver to a device does not mean that the driver won't be able to match and bind with other devices on the bus. Update the logic in __driver_attach() to reflect this.
Fixes: 656b8035b0ee ("ARM: 8524/1: driver cohandle -EPROBE_DEFER from bus_type.match()") Cc: stable@vger.kernel.org Cc: Saravana Kannan saravanak@google.com Signed-off-by: Isaac J. Manjarres isaacmanjarres@google.com Link: https://lore.kernel.org/r/20220921001414.4046492-1-isaacmanjarres@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/base/dd.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 1e15973f17a1..0377c3c0f2d4 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -956,8 +956,12 @@ static int __driver_attach(struct device *dev, void *data) */ return 0; } else if (ret < 0) { - dev_dbg(dev, "Bus failed to match device: %d", ret); - return ret; + dev_dbg(dev, "Bus failed to match device: %d\n", ret); + /* + * Driver could not match with device, but may match with + * another device on the bus. + */ + return 0; } /* ret > 0 means positive match */
device_driver_attach(drv, dev);
From: Paolo Abeni pabeni@redhat.com
stable inclusion from stable-v4.19.270 commit 755193f2523ce5157c2f844a4b6d16b95593f830 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit 2c02d41d71f90a5168391b6a5f2954112ba2307c upstream.
When an ULP-enabled socket enters the LISTEN status, the listener ULP data pointer is copied inside the child/accepted sockets by sk_clone_lock().
The relevant ULP can take care of de-duplicating the context pointer via the clone() operation, but only MPTCP and SMC implement such op.
Other ULPs may end-up with a double-free at socket disposal time.
We can't simply clear the ULP data at clone time, as TLS replaces the socket ops with custom ones assuming a valid TLS ULP context is available.
Instead completely prevent clone-less ULP sockets from entering the LISTEN status.
Fixes: 734942cc4ea6 ("tcp: ULP infrastructure") Reported-by: slipper slipper.alive@gmail.com Signed-off-by: Paolo Abeni pabeni@redhat.com Link: https://lore.kernel.org/r/4b80c3d1dbe3d0ab072f80450c202d9bc88b4b03.167274060... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/ipv4/inet_connection_sock.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 56465911bcdd..f1f3dc6a7d63 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -910,11 +910,25 @@ void inet_csk_prepare_forced_close(struct sock *sk) } EXPORT_SYMBOL(inet_csk_prepare_forced_close);
+static int inet_ulp_can_listen(const struct sock *sk) +{ + const struct inet_connection_sock *icsk = inet_csk(sk); + + if (icsk->icsk_ulp_ops) + return -EINVAL; + + return 0; +} + int inet_csk_listen_start(struct sock *sk, int backlog) { struct inet_connection_sock *icsk = inet_csk(sk); struct inet_sock *inet = inet_sk(sk); - int err = -EADDRINUSE; + int err; + + err = inet_ulp_can_listen(sk); + if (unlikely(err)) + return err;
reqsk_queue_alloc(&icsk->icsk_accept_queue);
From: Mark Rutland mark.rutland@arm.com
stable inclusion from stable-v4.19.270 commit 6ad3636bd8419b29dc85fadb3e50caa8f91cbc79 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 031af50045ea97ed4386eb3751ca2c134d0fc911 ]
The inline assembly for arm64's cmpxchg_double*() implementations use a +Q constraint to hazard against other accesses to the memory location being exchanged. However, the pointer passed to the constraint is a pointer to unsigned long, and thus the hazard only applies to the first 8 bytes of the location.
GCC can take advantage of this, assuming that other portions of the location are unchanged, leading to a number of potential problems.
This is similar to what we fixed back in commit:
fee960bed5e857eb ("arm64: xchg: hazard against entire exchange variable")
... but we forgot to adjust cmpxchg_double*() similarly at the same time.
The same problem applies, as demonstrated with the following test:
| struct big { | u64 lo, hi; | } __aligned(128); | | unsigned long foo(struct big *b) | { | u64 hi_old, hi_new; | | hi_old = b->hi; | cmpxchg_double_local(&b->lo, &b->hi, 0x12, 0x34, 0x56, 0x78); | hi_new = b->hi; | | return hi_old ^ hi_new; | }
... which GCC 12.1.0 compiles as:
| 0000000000000000 <foo>: | 0: d503233f paciasp | 4: aa0003e4 mov x4, x0 | 8: 1400000e b 40 <foo+0x40> | c: d2800240 mov x0, #0x12 // #18 | 10: d2800681 mov x1, #0x34 // #52 | 14: aa0003e5 mov x5, x0 | 18: aa0103e6 mov x6, x1 | 1c: d2800ac2 mov x2, #0x56 // #86 | 20: d2800f03 mov x3, #0x78 // #120 | 24: 48207c82 casp x0, x1, x2, x3, [x4] | 28: ca050000 eor x0, x0, x5 | 2c: ca060021 eor x1, x1, x6 | 30: aa010000 orr x0, x0, x1 | 34: d2800000 mov x0, #0x0 // #0 <--- BANG | 38: d50323bf autiasp | 3c: d65f03c0 ret | 40: d2800240 mov x0, #0x12 // #18 | 44: d2800681 mov x1, #0x34 // #52 | 48: d2800ac2 mov x2, #0x56 // #86 | 4c: d2800f03 mov x3, #0x78 // #120 | 50: f9800091 prfm pstl1strm, [x4] | 54: c87f1885 ldxp x5, x6, [x4] | 58: ca0000a5 eor x5, x5, x0 | 5c: ca0100c6 eor x6, x6, x1 | 60: aa0600a6 orr x6, x5, x6 | 64: b5000066 cbnz x6, 70 <foo+0x70> | 68: c8250c82 stxp w5, x2, x3, [x4] | 6c: 35ffff45 cbnz w5, 54 <foo+0x54> | 70: d2800000 mov x0, #0x0 // #0 <--- BANG | 74: d50323bf autiasp | 78: d65f03c0 ret
Notice that at the lines with "BANG" comments, GCC has assumed that the higher 8 bytes are unchanged by the cmpxchg_double() call, and that `hi_old ^ hi_new` can be reduced to a constant zero, for both LSE and LL/SC versions of cmpxchg_double().
This patch fixes the issue by passing a pointer to __uint128_t into the +Q constraint, ensuring that the compiler hazards against the entire 16 bytes being modified.
With this change, GCC 12.1.0 compiles the above test as:
| 0000000000000000 <foo>: | 0: f9400407 ldr x7, [x0, #8] | 4: d503233f paciasp | 8: aa0003e4 mov x4, x0 | c: 1400000f b 48 <foo+0x48> | 10: d2800240 mov x0, #0x12 // #18 | 14: d2800681 mov x1, #0x34 // #52 | 18: aa0003e5 mov x5, x0 | 1c: aa0103e6 mov x6, x1 | 20: d2800ac2 mov x2, #0x56 // #86 | 24: d2800f03 mov x3, #0x78 // #120 | 28: 48207c82 casp x0, x1, x2, x3, [x4] | 2c: ca050000 eor x0, x0, x5 | 30: ca060021 eor x1, x1, x6 | 34: aa010000 orr x0, x0, x1 | 38: f9400480 ldr x0, [x4, #8] | 3c: d50323bf autiasp | 40: ca0000e0 eor x0, x7, x0 | 44: d65f03c0 ret | 48: d2800240 mov x0, #0x12 // #18 | 4c: d2800681 mov x1, #0x34 // #52 | 50: d2800ac2 mov x2, #0x56 // #86 | 54: d2800f03 mov x3, #0x78 // #120 | 58: f9800091 prfm pstl1strm, [x4] | 5c: c87f1885 ldxp x5, x6, [x4] | 60: ca0000a5 eor x5, x5, x0 | 64: ca0100c6 eor x6, x6, x1 | 68: aa0600a6 orr x6, x5, x6 | 6c: b5000066 cbnz x6, 78 <foo+0x78> | 70: c8250c82 stxp w5, x2, x3, [x4] | 74: 35ffff45 cbnz w5, 5c <foo+0x5c> | 78: f9400480 ldr x0, [x4, #8] | 7c: d50323bf autiasp | 80: ca0000e0 eor x0, x7, x0 | 84: d65f03c0 ret
... sampling the high 8 bytes before and after the cmpxchg, and performing an EOR, as we'd expect.
For backporting, I've tested this atop linux-4.9.y with GCC 5.5.0. Note that linux-4.9.y is oldest currently supported stable release, and mandates GCC 5.1+. Unfortunately I couldn't get a GCC 5.1 binary to run on my machines due to library incompatibilities.
I've also used a standalone test to check that we can use a __uint128_t pointer in a +Q constraint at least as far back as GCC 4.8.5 and LLVM 3.9.1.
Fixes: 5284e1b4bc8a ("arm64: xchg: Implement cmpxchg_double") Fixes: e9a4b795652f ("arm64: cmpxchg_dbl: patch in lse instructions when supported by the CPU") Reported-by: Boqun Feng boqun.feng@gmail.com Link: https://lore.kernel.org/lkml/Y6DEfQXymYVgL3oJ@boqun-archlinux/ Reported-by: Peter Zijlstra peterz@infradead.org Link: https://lore.kernel.org/lkml/Y6GXoO4qmH9OIZ5Q@hirez.programming.kicks-ass.ne... Signed-off-by: Mark Rutland mark.rutland@arm.com Cc: stable@vger.kernel.org Cc: Arnd Bergmann arnd@arndb.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: Steve Capper steve.capper@arm.com Cc: Will Deacon will@kernel.org Link: https://lore.kernel.org/r/20230104151626.3262137-1-mark.rutland@arm.com Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- arch/arm64/include/asm/atomic_ll_sc.h | 2 +- arch/arm64/include/asm/atomic_lse.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h index f5a2d09afb38..817a043a85f6 100644 --- a/arch/arm64/include/asm/atomic_ll_sc.h +++ b/arch/arm64/include/asm/atomic_ll_sc.h @@ -314,7 +314,7 @@ __LL_SC_PREFIX(__cmpxchg_double##name(unsigned long old1, \ " cbnz %w0, 1b\n" \ " " #mb "\n" \ "2:" \ - : "=&r" (tmp), "=&r" (ret), "+Q" (*(unsigned long *)ptr) \ + : "=&r" (tmp), "=&r" (ret), "+Q" (*(__uint128_t *)ptr) \ : "r" (old1), "r" (old2), "r" (new1), "r" (new2) \ : cl); \ \ diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h index eab3de4f2ad2..d1e77f843d88 100644 --- a/arch/arm64/include/asm/atomic_lse.h +++ b/arch/arm64/include/asm/atomic_lse.h @@ -555,7 +555,7 @@ static inline long __cmpxchg_double##name(unsigned long old1, \ " eor %[old2], %[old2], %[oldval2]\n" \ " orr %[old1], %[old1], %[old2]") \ : [old1] "+&r" (x0), [old2] "+&r" (x1), \ - [v] "+Q" (*(unsigned long *)ptr) \ + [v] "+Q" (*(__uint128_t *)ptr) \ : [new1] "r" (x2), [new2] "r" (x3), [ptr] "r" (x4), \ [oldval1] "r" (oldval1), [oldval2] "r" (oldval2) \ : __LL_SC_CLOBBERS, ##cl); \
From: Greg Kroah-Hartman gregkh@linuxfoundation.org
stable inclusion from stable-v4.19.271 commit d3ee91e50a6b3c5a45398e3dcb912a8a264f575c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit 739790605705ddcf18f21782b9c99ad7d53a8c11 upstream.
do_prlimit() adds the user-controlled resource value to a pointer that will subsequently be dereferenced. In order to help prevent this codepath from being used as a spectre "gadget" a barrier needs to be added after checking the range.
Reported-by: Jordy Zomer jordyzomer@google.com Tested-by: Jordy Zomer jordyzomer@google.com Suggested-by: Linus Torvalds torvalds@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- kernel/sys.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/sys.c b/kernel/sys.c index faef6e6b635f..b088b71732b7 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1530,6 +1530,8 @@ int do_prlimit(struct task_struct *tsk, unsigned int resource,
if (resource >= RLIM_NLIMITS) return -EINVAL; + resource = array_index_nospec(resource, RLIM_NLIMITS); + if (new_rlim) { if (new_rlim->rlim_cur > new_rlim->rlim_max) return -EINVAL;
From: Yang Shi shy828301@gmail.com
stable inclusion from stable-v5.10.148 commit 377c60dd32d3289788bdb3d8840382f79d42139b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit 70cbc3cc78a997d8247b50389d37c4e1736019da upstream.
Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer sufficient to handle concurrent GUP-fast in all cases, it only handles traditional IPI-based GUP-fast correctly. On architectures that send an IPI broadcast on TLB flush, it works as expected. But on the architectures that do not use IPI to broadcast TLB flush, it may have the below race:
CPU A CPU B THP collapse fast GUP gup_pmd_range() <-- see valid pmd gup_pte_range() <-- work on pte pmdp_collapse_flush() <-- clear pmd and flush __collapse_huge_page_isolate() check page pinned <-- before GUP bump refcount pin the page check PTE <-- no change __collapse_huge_page_copy() copy data to huge page ptep_clear() install huge pmd for the huge page return the stale page discard the stale page
The race can be fixed by checking whether PMD is changed or not after taking the page pin in fast GUP, just like what it does for PTE. If the PMD is changed it means there may be parallel THP collapse, so GUP should back off.
Also update the stale comment about serializing against fast GUP in khugepaged.
Link: https://lkml.kernel.org/r/20220907180144.555485-1-shy828301@gmail.com Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") Acked-by: David Hildenbrand david@redhat.com Acked-by: Peter Xu peterx@redhat.com Signed-off-by: Yang Shi shy828301@gmail.com Reviewed-by: John Hubbard jhubbard@nvidia.com Cc: "Aneesh Kumar K.V" aneesh.kumar@linux.ibm.com Cc: Hugh Dickins hughd@google.com Cc: Jason Gunthorpe jgg@nvidia.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: Michael Ellerman mpe@ellerman.id.au Cc: Nicholas Piggin npiggin@gmail.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Conflicts: mm/gup.c Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: tong tiangen tongtiangen@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- mm/gup.c | 34 ++++++++++++++++++++++++++++------ mm/khugepaged.c | 10 ++++++---- 2 files changed, 34 insertions(+), 10 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c index 5f367d8211bd..f0eda2d9c152 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1434,8 +1434,28 @@ static inline struct page *try_get_compound_head(struct page *page, int refs) }
#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, - int write, struct page **pages, int *nr) +/* + * Fast-gup relies on pte change detection to avoid concurrent pgtable + * operations. + * + * To pin the page, fast-gup needs to do below in order: + * (1) pin the page (by prefetching pte), then (2) check pte not changed. + * + * For the rest of pgtable operations where pgtable updates can be racy + * with fast-gup, we need to do (1) clear pte, then (2) check whether page + * is pinned. + * + * Above will work for all pte-level operations, including THP split. + * + * For THP collapse, it's a bit more complicated because fast-gup may be + * walking a pgtable page that is being freed (pte is still valid but pmd + * can be cleared already). To avoid race in such condition, we need to + * also check pmd here to make sure pmd doesn't change (corresponds to + * pmdp_collapse_flush() in the THP collapse code path). + */ +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, + unsigned long end, int write, + struct page **pages, int *nr) { struct dev_pagemap *pgmap = NULL; int nr_start = *nr, ret = 0; @@ -1472,7 +1492,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, if (!head) goto pte_unmap;
- if (unlikely(pte_val(pte) != pte_val(*ptep))) { + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || + unlikely(pte_val(pte) != pte_val(*ptep))) { put_page(head); goto pte_unmap; } @@ -1504,8 +1525,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, * __get_user_pages_fast implementation that can pin pages. Thus it's still * useful to have gup_huge_pmd even if we can't operate on ptes. */ -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, - int write, struct page **pages, int *nr) +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, + unsigned long end, int write, + struct page **pages, int *nr) { return 0; } @@ -1736,7 +1758,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr, PMD_SHIFT, next, write, pages, nr)) return 0; - } else if (!gup_pte_range(pmd, addr, next, write, pages, nr)) + } else if (!gup_pte_range(pmd, pmdp, addr, next, write, pages, nr)) return 0; } while (pmdp++, addr = next, addr != end);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2975fc124cb6..cbf26683898c 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1043,10 +1043,12 @@ static void collapse_huge_page(struct mm_struct *mm, mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ /* - * After this gup_fast can't run anymore. This also removes - * any huge TLB entry from the CPU so we won't allow - * huge and small TLB entries for the same virtual address - * to avoid the risk of CPU bugs in that area. + * This removes any huge TLB entry from the CPU so we won't allow + * huge and small TLB entries for the same virtual address to + * avoid the risk of CPU bugs in that area. + * + * Parallel fast GUP is fine since fast GUP will back off when + * it detects PMD is changed. */ _pmd = pmdp_collapse_flush(vma, address, pmd); spin_unlock(pmd_ptl);
From: Jann Horn jannh@google.com
stable inclusion from stable-v4.19.270 commit f0700ae26832550ce497a568789c3fceeb44d753 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit 2ba99c5e08812494bc57f319fb562f527d9bacd8 upstream.
Since commit 70cbc3cc78a99 ("mm: gup: fix the fast GUP race against THP collapse"), the lockless_pages_from_mm() fastpath rechecks the pmd_t to ensure that the page table was not removed by khugepaged in between.
However, lockless_pages_from_mm() still requires that the page table is not concurrently freed. Fix it by sending IPIs (if the architecture uses semi-RCU-style page table freeing) before freeing/reusing page tables.
Link: https://lkml.kernel.org/r/20221129154730.2274278-2-jannh@google.com Link: https://lkml.kernel.org/r/20221128180252.1684965-2-jannh@google.com Link: https://lkml.kernel.org/r/20221125213714.4115729-2-jannh@google.com Fixes: ba76149f47d8 ("thp: khugepaged") Signed-off-by: Jann Horn jannh@google.com Reviewed-by: Yang Shi shy828301@gmail.com Acked-by: David Hildenbrand david@redhat.com Cc: John Hubbard jhubbard@nvidia.com Cc: Peter Xu peterx@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org [manual backport: two of the three places in khugepaged that can free ptes were refactored into a common helper between 5.15 and 6.0; TLB flushing was refactored between 5.4 and 5.10; TLB flushing was refactored between 4.19 and 5.4; pmd collapse for PTE-mapped THP was only added in 5.4; ugly hack needed in <=4.19 for s390 and arm] Signed-off-by: Jann Horn jannh@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Conflicts: mm/memory.c mm/mmu_gather.c Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: tong tiangen tongtiangen@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- include/asm-generic/tlb.h | 6 ++++++ mm/khugepaged.c | 15 +++++++++++++++ mm/memory.c | 1 + mm/mmu_gather.c | 5 +++++ 4 files changed, 27 insertions(+)
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index cfc86e3ba460..139bcd821a3c 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -63,6 +63,12 @@ struct mmu_table_batch { extern void tlb_table_flush(struct mmu_gather *tlb); extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+void tlb_remove_table_sync_one(void); + +#else + +static inline void tlb_remove_table_sync_one(void) { } + #endif
/* diff --git a/mm/khugepaged.c b/mm/khugepaged.c index cbf26683898c..83500fe19572 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -23,6 +23,19 @@ #include <asm/pgalloc.h> #include "internal.h"
+/* gross hack for <=4.19 stable */ +#if defined(CONFIG_S390) || defined(CONFIG_ARM) +static void tlb_remove_table_smp_sync(void *arg) +{ + /* Simply deliver the interrupt */ +} + +static void tlb_remove_table_sync_one(void) +{ + smp_call_function(tlb_remove_table_smp_sync, NULL, 1); +} +#endif + enum scan_result { SCAN_FAIL, SCAN_SUCCEED, @@ -1053,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm, _pmd = pmdp_collapse_flush(vma, address, pmd); spin_unlock(pmd_ptl); mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + tlb_remove_table_sync_one();
spin_lock(pte_ptl); isolated = __collapse_huge_page_isolate(vma, address, pte); @@ -1318,6 +1332,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) _pmd = pmdp_collapse_flush(vma, addr, pmd); spin_unlock(ptl); mm_dec_nr_ptes(mm); + tlb_remove_table_sync_one(); pte_free(mm, pmd_pgtable(_pmd)); } up_write(&mm->mmap_sem); diff --git a/mm/memory.c b/mm/memory.c index 5c573b364bdd..407920bf4b97 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -200,6 +200,7 @@ static void check_sync_rss_stat(struct task_struct *task)
#endif /* SPLIT_RSS_COUNTING */
+ /* * Note: this doesn't free the actual pages themselves. That * has been handled earlier when unmapping all the memory regions. diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index c147a5aacfa9..a44cf211ffee 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -177,6 +177,11 @@ static void tlb_remove_table_smp_sync(void *arg) /* Simply deliver the interrupt */ }
+void tlb_remove_table_sync_one(void) +{ + smp_call_function(tlb_remove_table_smp_sync, NULL, 1); +} + static void tlb_remove_table_one(void *table) { /*
From: Jann Horn jannh@google.com
stable inclusion from stable-v4.19.270 commit ff2a1a6f869650aec99e9d070b5ab625bfbc5bc3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit f268f6cf875f3220afc77bdd0bf1bb136eb54db9 upstream.
Any codepath that zaps page table entries must invoke MMU notifiers to ensure that secondary MMUs (like KVM) don't keep accessing pages which aren't mapped anymore. Secondary MMUs don't hold their own references to pages that are mirrored over, so failing to notify them can lead to page use-after-free.
I'm marking this as addressing an issue introduced in commit f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of the security impact of this only came in commit 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP"), which actually omitted flushes for the removal of present PTEs, not just for the removal of empty page tables.
Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Jann Horn jannh@google.com Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: John Hubbard jhubbard@nvidia.com Cc: Peter Xu peterx@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org [manual backport: this code was refactored from two copies into a common helper between 5.15 and 6.0; pmd collapse for PTE-mapped THP was only added in 5.4; MMU notifier API changed between 4.19 and 5.4] Signed-off-by: Jann Horn jannh@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: tong tiangen tongtiangen@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- mm/khugepaged.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 83500fe19572..04d0a3ee006e 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1327,13 +1327,20 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) */ if (down_write_trylock(&mm->mmap_sem)) { if (!khugepaged_test_exit(mm)) { - spinlock_t *ptl = pmd_lock(mm, pmd); + spinlock_t *ptl; + unsigned long end = addr + HPAGE_PMD_SIZE; + + mmu_notifier_invalidate_range_start(mm, addr, + end); + ptl = pmd_lock(mm, pmd); /* assume page table is clear */ _pmd = pmdp_collapse_flush(vma, addr, pmd); spin_unlock(ptl); mm_dec_nr_ptes(mm); tlb_remove_table_sync_one(); pte_free(mm, pmd_pgtable(_pmd)); + mmu_notifier_invalidate_range_end(mm, addr, + end); } up_write(&mm->mmap_sem); }
From: Jan Kara jack@suse.cz
stable inclusion from stable-v4.19.270 commit 61dc6cdfc85000e305a58553d41036716b427a0d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit 307af6c879377c1c63e71cbdd978201f9c7ee8df ]
Use the fact that entries with elevated refcount are not removed from the hash and just move removal of the entry from the hash to the entry freeing time. When doing this we also change the generic code to hold one reference to the cache entry, not two of them, which makes code somewhat more obvious.
Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20220712105436.32204-10-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Stable-dep-of: a44e84a9b776 ("ext4: fix deadlock due to mbcache entry corruption") Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/mbcache.c | 108 +++++++++++++++------------------------- include/linux/mbcache.h | 24 ++++++--- 2 files changed, 55 insertions(+), 77 deletions(-)
diff --git a/fs/mbcache.c b/fs/mbcache.c index c76030e20d92..aa564e79891d 100644 --- a/fs/mbcache.c +++ b/fs/mbcache.c @@ -89,7 +89,7 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key, return -ENOMEM;
INIT_LIST_HEAD(&entry->e_list); - /* One ref for hash, one ref returned */ + /* Initial hash reference */ atomic_set(&entry->e_refcnt, 1); entry->e_key = key; entry->e_value = value; @@ -105,21 +105,28 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key, } } hlist_bl_add_head(&entry->e_hash_list, head); - hlist_bl_unlock(head); - + /* + * Add entry to LRU list before it can be found by + * mb_cache_entry_delete() to avoid races + */ spin_lock(&cache->c_list_lock); list_add_tail(&entry->e_list, &cache->c_list); - /* Grab ref for LRU list */ - atomic_inc(&entry->e_refcnt); cache->c_entry_count++; spin_unlock(&cache->c_list_lock); + hlist_bl_unlock(head);
return 0; } EXPORT_SYMBOL(mb_cache_entry_create);
-void __mb_cache_entry_free(struct mb_cache_entry *entry) +void __mb_cache_entry_free(struct mb_cache *cache, struct mb_cache_entry *entry) { + struct hlist_bl_head *head; + + head = mb_cache_entry_head(cache, entry->e_key); + hlist_bl_lock(head); + hlist_bl_del(&entry->e_hash_list); + hlist_bl_unlock(head); kmem_cache_free(mb_entry_cache, entry); } EXPORT_SYMBOL(__mb_cache_entry_free); @@ -133,7 +140,7 @@ EXPORT_SYMBOL(__mb_cache_entry_free); */ void mb_cache_entry_wait_unused(struct mb_cache_entry *entry) { - wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 3); + wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 2); } EXPORT_SYMBOL(mb_cache_entry_wait_unused);
@@ -154,10 +161,9 @@ static struct mb_cache_entry *__entry_find(struct mb_cache *cache, while (node) { entry = hlist_bl_entry(node, struct mb_cache_entry, e_hash_list); - if (entry->e_key == key && entry->e_reusable) { - atomic_inc(&entry->e_refcnt); + if (entry->e_key == key && entry->e_reusable && + atomic_inc_not_zero(&entry->e_refcnt)) goto out; - } node = node->next; } entry = NULL; @@ -217,10 +223,9 @@ struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key, head = mb_cache_entry_head(cache, key); hlist_bl_lock(head); hlist_bl_for_each_entry(entry, node, head, e_hash_list) { - if (entry->e_key == key && entry->e_value == value) { - atomic_inc(&entry->e_refcnt); + if (entry->e_key == key && entry->e_value == value && + atomic_inc_not_zero(&entry->e_refcnt)) goto out; - } } entry = NULL; out: @@ -280,37 +285,25 @@ EXPORT_SYMBOL(mb_cache_entry_delete); struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache, u32 key, u64 value) { - struct hlist_bl_node *node; - struct hlist_bl_head *head; struct mb_cache_entry *entry;
- head = mb_cache_entry_head(cache, key); - hlist_bl_lock(head); - hlist_bl_for_each_entry(entry, node, head, e_hash_list) { - if (entry->e_key == key && entry->e_value == value) { - if (atomic_read(&entry->e_refcnt) > 2) { - atomic_inc(&entry->e_refcnt); - hlist_bl_unlock(head); - return entry; - } - /* We keep hash list reference to keep entry alive */ - hlist_bl_del_init(&entry->e_hash_list); - hlist_bl_unlock(head); - spin_lock(&cache->c_list_lock); - if (!list_empty(&entry->e_list)) { - list_del_init(&entry->e_list); - if (!WARN_ONCE(cache->c_entry_count == 0, - "mbcache: attempt to decrement c_entry_count past zero")) - cache->c_entry_count--; - atomic_dec(&entry->e_refcnt); - } - spin_unlock(&cache->c_list_lock); - mb_cache_entry_put(cache, entry); - return NULL; - } - } - hlist_bl_unlock(head); + entry = mb_cache_entry_get(cache, key, value); + if (!entry) + return NULL;
+ /* + * Drop the ref we got from mb_cache_entry_get() and the initial hash + * ref if we are the last user + */ + if (atomic_cmpxchg(&entry->e_refcnt, 2, 0) != 2) + return entry; + + spin_lock(&cache->c_list_lock); + if (!list_empty(&entry->e_list)) + list_del_init(&entry->e_list); + cache->c_entry_count--; + spin_unlock(&cache->c_list_lock); + __mb_cache_entry_free(cache, entry); return NULL; } EXPORT_SYMBOL(mb_cache_entry_delete_or_get); @@ -342,42 +335,24 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache, unsigned long nr_to_scan) { struct mb_cache_entry *entry; - struct hlist_bl_head *head; unsigned long shrunk = 0;
spin_lock(&cache->c_list_lock); while (nr_to_scan-- && !list_empty(&cache->c_list)) { entry = list_first_entry(&cache->c_list, struct mb_cache_entry, e_list); - if (entry->e_referenced || atomic_read(&entry->e_refcnt) > 2) { + /* Drop initial hash reference if there is no user */ + if (entry->e_referenced || + atomic_cmpxchg(&entry->e_refcnt, 1, 0) != 1) { entry->e_referenced = 0; list_move_tail(&entry->e_list, &cache->c_list); continue; } list_del_init(&entry->e_list); cache->c_entry_count--; - /* - * We keep LRU list reference so that entry doesn't go away - * from under us. - */ spin_unlock(&cache->c_list_lock); - head = mb_cache_entry_head(cache, entry->e_key); - hlist_bl_lock(head); - /* Now a reliable check if the entry didn't get used... */ - if (atomic_read(&entry->e_refcnt) > 2) { - hlist_bl_unlock(head); - spin_lock(&cache->c_list_lock); - list_add_tail(&entry->e_list, &cache->c_list); - cache->c_entry_count++; - continue; - } - if (!hlist_bl_unhashed(&entry->e_hash_list)) { - hlist_bl_del_init(&entry->e_hash_list); - atomic_dec(&entry->e_refcnt); - } - hlist_bl_unlock(head); - if (mb_cache_entry_put(cache, entry)) - shrunk++; + __mb_cache_entry_free(cache, entry); + shrunk++; cond_resched(); spin_lock(&cache->c_list_lock); } @@ -469,11 +444,6 @@ void mb_cache_destroy(struct mb_cache *cache) * point. */ list_for_each_entry_safe(entry, next, &cache->c_list, e_list) { - if (!hlist_bl_unhashed(&entry->e_hash_list)) { - hlist_bl_del_init(&entry->e_hash_list); - atomic_dec(&entry->e_refcnt); - } else - WARN_ON(1); list_del(&entry->e_list); WARN_ON(atomic_read(&entry->e_refcnt) != 1); mb_cache_entry_put(cache, entry); diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h index 8eca7f25c432..e9d5ece87794 100644 --- a/include/linux/mbcache.h +++ b/include/linux/mbcache.h @@ -13,8 +13,16 @@ struct mb_cache; struct mb_cache_entry { /* List of entries in cache - protected by cache->c_list_lock */ struct list_head e_list; - /* Hash table list - protected by hash chain bitlock */ + /* + * Hash table list - protected by hash chain bitlock. The entry is + * guaranteed to be hashed while e_refcnt > 0. + */ struct hlist_bl_node e_hash_list; + /* + * Entry refcount. Once it reaches zero, entry is unhashed and freed. + * While refcount > 0, the entry is guaranteed to stay in the hash and + * e.g. mb_cache_entry_try_delete() will fail. + */ atomic_t e_refcnt; /* Key in hash - stable during lifetime of the entry */ u32 e_key; @@ -29,20 +37,20 @@ void mb_cache_destroy(struct mb_cache *cache);
int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key, u64 value, bool reusable); -void __mb_cache_entry_free(struct mb_cache_entry *entry); +void __mb_cache_entry_free(struct mb_cache *cache, + struct mb_cache_entry *entry); void mb_cache_entry_wait_unused(struct mb_cache_entry *entry); -static inline int mb_cache_entry_put(struct mb_cache *cache, - struct mb_cache_entry *entry) +static inline void mb_cache_entry_put(struct mb_cache *cache, + struct mb_cache_entry *entry) { unsigned int cnt = atomic_dec_return(&entry->e_refcnt);
if (cnt > 0) { - if (cnt <= 3) + if (cnt <= 2) wake_up_var(&entry->e_refcnt); - return 0; + return; } - __mb_cache_entry_free(entry); - return 1; + __mb_cache_entry_free(cache, entry); }
struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
From: Jan Kara jack@suse.cz
stable inclusion from stable-v4.19.270 commit efaa0ca678f56d47316a08030b2515678cebbc50 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
[ Upstream commit a44e84a9b7764c72896f7241a0ec9ac7e7ef38dd ]
When manipulating xattr blocks, we can deadlock infinitely looping inside ext4_xattr_block_set() where we constantly keep finding xattr block for reuse in mbcache but we are unable to reuse it because its reference count is too big. This happens because cache entry for the xattr block is marked as reusable (e_reusable set) although its reference count is too big. When this inconsistency happens, this inconsistent state is kept indefinitely and so ext4_xattr_block_set() keeps retrying indefinitely.
The inconsistent state is caused by non-atomic update of e_reusable bit. e_reusable is part of a bitfield and e_reusable update can race with update of e_referenced bit in the same bitfield resulting in loss of one of the updates. Fix the problem by using atomic bitops instead.
This bug has been around for many years, but it became *much* easier to hit after commit 65f8b80053a1 ("ext4: fix race when reusing xattr blocks").
Cc: stable@vger.kernel.org Fixes: 6048c64b2609 ("mbcache: add reusable flag to cache entries") Fixes: 65f8b80053a1 ("ext4: fix race when reusing xattr blocks") Reported-and-tested-by: Jeremi Piotrowski jpiotrowski@linux.microsoft.com Reported-by: Thilo Fromm t-lo@linux.microsoft.com Link: https://lore.kernel.org/r/c77bf00f-4618-7149-56f1-b8d1664b9d07@linux.microso... Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Andreas Dilger adilger@dilger.ca Link: https://lore.kernel.org/r/20221123193950.16758-1-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/ext4/xattr.c | 4 ++-- fs/mbcache.c | 14 ++++++++------ include/linux/mbcache.h | 9 +++++++-- 3 files changed, 17 insertions(+), 10 deletions(-)
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index 8bc3f5cc11e3..abb644b169cb 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -1280,7 +1280,7 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode, ce = mb_cache_entry_get(ea_block_cache, hash, bh->b_blocknr); if (ce) { - ce->e_reusable = 1; + set_bit(MBE_REUSABLE_B, &ce->e_flags); mb_cache_entry_put(ea_block_cache, ce); } } @@ -2039,7 +2039,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode, } BHDR(new_bh)->h_refcount = cpu_to_le32(ref); if (ref == EXT4_XATTR_REFCOUNT_MAX) - ce->e_reusable = 0; + clear_bit(MBE_REUSABLE_B, &ce->e_flags); ea_bdebug(new_bh, "reusing; refcount now=%d", ref); ext4_xattr_block_csum_set(inode, new_bh); diff --git a/fs/mbcache.c b/fs/mbcache.c index aa564e79891d..8e9e1888e448 100644 --- a/fs/mbcache.c +++ b/fs/mbcache.c @@ -93,8 +93,9 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key, atomic_set(&entry->e_refcnt, 1); entry->e_key = key; entry->e_value = value; - entry->e_reusable = reusable; - entry->e_referenced = 0; + entry->e_flags = 0; + if (reusable) + set_bit(MBE_REUSABLE_B, &entry->e_flags); head = mb_cache_entry_head(cache, key); hlist_bl_lock(head); hlist_bl_for_each_entry(dup, dup_node, head, e_hash_list) { @@ -161,7 +162,8 @@ static struct mb_cache_entry *__entry_find(struct mb_cache *cache, while (node) { entry = hlist_bl_entry(node, struct mb_cache_entry, e_hash_list); - if (entry->e_key == key && entry->e_reusable && + if (entry->e_key == key && + test_bit(MBE_REUSABLE_B, &entry->e_flags) && atomic_inc_not_zero(&entry->e_refcnt)) goto out; node = node->next; @@ -317,7 +319,7 @@ EXPORT_SYMBOL(mb_cache_entry_delete_or_get); void mb_cache_entry_touch(struct mb_cache *cache, struct mb_cache_entry *entry) { - entry->e_referenced = 1; + set_bit(MBE_REFERENCED_B, &entry->e_flags); } EXPORT_SYMBOL(mb_cache_entry_touch);
@@ -342,9 +344,9 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache, entry = list_first_entry(&cache->c_list, struct mb_cache_entry, e_list); /* Drop initial hash reference if there is no user */ - if (entry->e_referenced || + if (test_bit(MBE_REFERENCED_B, &entry->e_flags) || atomic_cmpxchg(&entry->e_refcnt, 1, 0) != 1) { - entry->e_referenced = 0; + clear_bit(MBE_REFERENCED_B, &entry->e_flags); list_move_tail(&entry->e_list, &cache->c_list); continue; } diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h index e9d5ece87794..591bc4cefe1d 100644 --- a/include/linux/mbcache.h +++ b/include/linux/mbcache.h @@ -10,6 +10,12 @@
struct mb_cache;
+/* Cache entry flags */ +enum { + MBE_REFERENCED_B = 0, + MBE_REUSABLE_B +}; + struct mb_cache_entry { /* List of entries in cache - protected by cache->c_list_lock */ struct list_head e_list; @@ -26,8 +32,7 @@ struct mb_cache_entry { atomic_t e_refcnt; /* Key in hash - stable during lifetime of the entry */ u32 e_key; - u32 e_referenced:1; - u32 e_reusable:1; + unsigned long e_flags; /* User provided value - stable during lifetime of the entry */ u64 e_value; };
From: Jakub Kicinski kuba@kernel.org
stable inclusion from stable-v4.19.218 commit 8b8b3d738e450d2c2ccdc75f0ab5a951746c2a96 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit 24bcbe1cc69fa52dc4f7b5b2456678ed464724d8 ]
sk_stream_kill_queues() can be called on close when there are still outstanding skbs to transmit. Those skbs may try to queue notifications to the error queue (e.g. timestamps). If sk_stream_kill_queues() purges the queue without taking its lock the queue may get corrupted, and skbs leaked.
This shows up as a warning about an rmem leak:
WARNING: CPU: 24 PID: 0 at net/ipv4/af_inet.c:154 inet_sock_destruct+0x...
The leak is always a multiple of 0x300 bytes (the value is in %rax on my builds, so RAX: 0000000000000300). 0x300 is truesize of an empty sk_buff. Indeed if we dump the socket state at the time of the warning the sk_error_queue is often (but not always) corrupted. The ->next pointer points back at the list head, but not the ->prev pointer. Indeed we can find the leaked skb by scanning the kernel memory for something that looks like an skb with ->sk = socket in question, and ->truesize = 0x300. The contents of ->cb[] of the skb confirms the suspicion that it is indeed a timestamp notification (as generated in __skb_complete_tx_timestamp()).
Removing purging of sk_error_queue should be okay, since inet_sock_destruct() does it again once all socket refs are gone. Eric suggests this may cause sockets that go thru disconnect() to maintain notifications from the previous incarnations of the socket, but that should be okay since the race was there anyway, and disconnect() is not exactly dependable.
Thanks to Jonathan Lemon and Omar Sandoval for help at various stages of tracing the issue.
Fixes: cb9eff097831 ("net: new user space API for time stamping of incoming and outgoing packets") Signed-off-by: Jakub Kicinski kuba@kernel.org Reviewed-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Lu Wei luwei32@huawei.com Reviewed-by: Liu Jian liujian56@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/core/stream.c | 3 --- 1 file changed, 3 deletions(-)
diff --git a/net/core/stream.c b/net/core/stream.c index 71f654ca8da3..7b411a91a81c 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -196,9 +196,6 @@ void sk_stream_kill_queues(struct sock *sk) /* First the read buffer. */ __skb_queue_purge(&sk->sk_receive_queue);
- /* Next, the error queue. */ - __skb_queue_purge(&sk->sk_error_queue); - /* Next, the write queue. */ WARN_ON(!skb_queue_empty(&sk->sk_write_queue));
From: Eric Dumazet edumazet@google.com
stable inclusion from stable-v4.19.270 commit 6f00bd0402a1e3d2d556afba57c045bd7931e4d3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit e0c8bccd40fc1c19e1d246c39bcf79e357e1ada3 ]
Changheon Lee reported TCP socket leaks, with a nice repro.
It seems we leak TCP sockets with the following sequence:
1) SOF_TIMESTAMPING_TX_ACK is enabled on the socket.
Each ACK will cook an skb put in error queue, from __skb_tstamp_tx(). __skb_tstamp_tx() is using skb_clone(), unless SOF_TIMESTAMPING_OPT_TSONLY was also requested.
2) If the application is also using MSG_ZEROCOPY, then we put in the error queue cloned skbs that had a struct ubuf_info attached to them.
Whenever an struct ubuf_info is allocated, sock_zerocopy_alloc() does a sock_hold().
As long as the cloned skbs are still in sk_error_queue, socket refcount is kept elevated.
3) Application closes the socket, while error queue is not empty.
Since tcp_close() no longer purges the socket error queue, we might end up with a TCP socket with at least one skb in error queue keeping the socket alive forever.
This bug can be (ab)used to consume all kernel memory and freeze the host.
We need to purge the error queue, with proper synchronization against concurrent writers.
Fixes: 24bcbe1cc69f ("net: stream: don't purge sk_error_queue in sk_stream_kill_queues()") Reported-by: Changheon Lee darklight2357@icloud.com Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Lu Wei luwei32@huawei.com Reviewed-by: Liu Jian liujian56@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/core/stream.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/net/core/stream.c b/net/core/stream.c index 7b411a91a81c..58755528d39e 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -196,6 +196,12 @@ void sk_stream_kill_queues(struct sock *sk) /* First the read buffer. */ __skb_queue_purge(&sk->sk_receive_queue);
+ /* Next, the error queue. + * We need to use queue lock, because other threads might + * add packets to the queue without socket lock being held. + */ + skb_queue_purge(&sk->sk_error_queue); + /* Next, the write queue. */ WARN_ON(!skb_queue_empty(&sk->sk_write_queue));
From: Chuck Lever chuck.lever@oracle.com
stable inclusion from stable-v4.19.270 commit 76f2497a2faa6a4e91efb94a7f55705b403273fd category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
--------------------------------
commit da522b5fe1a5f8b7c20a0023e87b52a150e53bf5 upstream.
Fixes: 030d794bf498 ("SUNRPC: Use gssproxy upcall for server RPCGSS authentication.") Signed-off-by: Chuck Lever chuck.lever@oracle.com Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton jlayton@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Conflicts: net/sunrpc/auth_gss/svcauth_gss.c
Signed-off-by: Baisong Zhong zhongbaisong@huawei.com Reviewed-by: Liu Jian liujian56@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/sunrpc/auth_gss/svcauth_gss.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/sunrpc/auth_gss/svcauth_gss.c b/net/sunrpc/auth_gss/svcauth_gss.c index b18b23b3a994..047029959d45 100644 --- a/net/sunrpc/auth_gss/svcauth_gss.c +++ b/net/sunrpc/auth_gss/svcauth_gss.c @@ -1092,8 +1092,10 @@ gss_read_proxy_verf(struct svc_rqst *rqstp, return res;
inlen = svc_getnl(argv); - if (inlen > (argv->iov_len + rqstp->rq_arg.page_len)) + if (inlen > (argv->iov_len + rqstp->rq_arg.page_len)) { + kfree(in_handle->data); return SVC_DENIED; + }
in_token->pages = rqstp->rq_pages; in_token->page_base = (ulong)argv->iov_base & ~PAGE_MASK;
From: Enzo Matsumiya ematsumiya@suse.de
stable inclusion from stable-v4.19.271 commit 19f0577dd34b250e1595f8dd577d9c2b6c1dc85d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...
--------------------------------
commit 30b2b2196d6e4cc24cbec633535a2404f258ce69 upstream.
On async reads, page data is allocated before sending. When the response is received but it has no data to fill (e.g. STATUS_END_OF_FILE), __calc_signature() will still include the pages in its computation, leading to an invalid signature check.
This patch fixes this by not setting the async read smb_rqst page data (zeroed by default) if its got_bytes is 0.
This can be reproduced/verified with xfstests generic/465.
Cc: stable@vger.kernel.org Signed-off-by: Enzo Matsumiya ematsumiya@suse.de Reviewed-by: Paulo Alcantara (SUSE) pc@cjr.nz Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Conflict: fs/cifs/smb2pdu.c
Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Zhang Xiaoxu zhangxiaoxu5@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/cifs/smb2pdu.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index 544e31149f07..a115db50be85 100644 --- a/fs/cifs/smb2pdu.c +++ b/fs/cifs/smb2pdu.c @@ -3154,12 +3154,15 @@ smb2_readv_callback(struct mid_q_entry *mid) (struct smb2_sync_hdr *)rdata->iov[0].iov_base; unsigned int credits_received = 0; struct smb_rqst rqst = { .rq_iov = rdata->iov, - .rq_nvec = 2, - .rq_pages = rdata->pages, - .rq_offset = rdata->page_offset, - .rq_npages = rdata->nr_pages, - .rq_pagesz = rdata->pagesz, - .rq_tailsz = rdata->tailsz }; + .rq_nvec = 2, }; + + if (rdata->got_bytes) { + rqst.rq_pages = rdata->pages; + rqst.rq_offset = rdata->page_offset; + rqst.rq_npages = rdata->nr_pages; + rqst.rq_pagesz = rdata->pagesz; + rqst.rq_tailsz = rdata->tailsz; + }
cifs_dbg(FYI, "%s: mid=%llu state=%d result=%d bytes=%u\n", __func__, mid->mid, mid->mid_state, rdata->result,