From: Cheng Jian cj.chengjian@huawei.com
hulk inclusion category: bugfix bugzilla: 38261, https://gitee.com/openeuler/kernel/issues/I49XPZ CVE: NA
---------------------------
when CONFIG_SCHED_STEAL disabled: kernel/sched/topology.c:24:74: warning: ‘struct s_data’ declared inside parameter list will not be visible outside of this definition or declaration 24 | static inline int sd_llc_alloc_all(const struct cpumask *cpu_map, struct s_data *d) { return 0; } | ^~~~~~
kernel/sched/topology.c: In function ‘build_sched_domains’: kernel/sched/topology.c:2188:32: error: passing argument 2 of ‘sd_llc_alloc_all’ from incompatible pointer type [-Werror=incompatible-pointer-types] 2188 | if (sd_llc_alloc_all(cpu_map, &d)) | ^~ | | | struct s_data *
Signed-off-by: Cheng Jian cj.chengjian@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/sched/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 0564aeabbcb8..fcf6aebb13c4 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -13,8 +13,8 @@ DEFINE_MUTEX(sched_domains_mutex); static cpumask_var_t sched_domains_tmpmask; static cpumask_var_t sched_domains_tmpmask2;
-#ifdef CONFIG_SCHED_STEAL struct s_data; +#ifdef CONFIG_SCHED_STEAL static int sd_llc_alloc(struct sched_domain *sd); static void sd_llc_free(struct sched_domain *sd); static int sd_llc_alloc_all(const struct cpumask *cpu_map, struct s_data *d);
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GUAB CVE: NA
-------------------------------
Add generic macro for kabi padding reserve for openEuler kernel. This Macro should used to reserve kabi padding for base data structs before kabi freeze.
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Reviewed-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/kabi.h | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) create mode 100644 include/linux/kabi.h
diff --git a/include/linux/kabi.h b/include/linux/kabi.h new file mode 100644 index 000000000000..10c3dcfbbe55 --- /dev/null +++ b/include/linux/kabi.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * kabi.h - openEuler kABI abstraction header + * + * Copyright (C) 2021. Huawei Technologies Co., Ltd. All rights reserved. + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 and + * only version 2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef _LINUX_KABI_H +#define _LINUX_KABI_H + +/* + * Macro for Reserving KABI padding for base data structs before KABI freeze. + */ + +#define KABI_RESERVE(n) unsigned long kabi_reserved##n; + +#endif /* _LINUX_KABI_H */
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-v5.15-rc3 commit a647a524a46736786c95cdb553a070322ca096e3 category: bugfix bugzilla: 182234 https://gitee.com/openeuler/kernel/issues/I4DDEL Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
rq_qos framework is only applied on request based driver, so:
1) rq_qos_done_bio() needn't to be called for bio based driver
2) rq_qos_done_bio() needn't to be called for bio which isn't tracked, such as bios ended from error handling code.
Especially in bio_endio():
1) request queue is referred via bio->bi_bdev->bd_disk->queue, which may be gone since request queue refcount may not be held in above two cases
2) q->rq_qos may be freed in blk_cleanup_queue() when calling into __rq_qos_done_bio()
Fix the potential kernel panic by not calling rq_qos_ops->done_bio if the bio isn't tracked. This way is safe because both ioc_rqos_done_bio() and blkcg_iolatency_done_bio() are nop if the bio isn't tracked.
Reported-by: Yu Kuai yukuai3@huawei.com Cc: tj@kernel.org Signed-off-by: Ming Lei ming.lei@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Acked-by: Tejun Heo tj@kernel.org Link: https://lore.kernel.org/r/20210924110704.1541818-1-ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflict: block/bio.c Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/bio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/bio.c b/block/bio.c index 0703a208ca24..b2e0f0c94c5f 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1433,7 +1433,7 @@ void bio_endio(struct bio *bio) if (!bio_integrity_endio(bio)) return;
- if (bio->bi_disk) + if (bio->bi_disk && bio_flagged(bio, BIO_TRACKED)) rq_qos_done_bio(bio->bi_disk->queue, bio);
/*
From: Li Jinlin lijinlin3@huawei.com
mainline inclusion from mainline-5.14 commit 884f0e84f1e3195b801319c8ec3d5774e9bf2710 category: bugfix bugzilla: 181877 https://gitee.com/openeuler/kernel/issues/I4DDEL
---------------------------
The pending timer has been set up in blk_throtl_init(). However, the timer is not deleted in blk_throtl_exit(). This means that the timer handler may still be running after freeing the timer, which would result in a use-after-free.
Fix by calling del_timer_sync() to delete the timer in blk_throtl_exit().
Signed-off-by: Li Jinlin lijinlin3@huawei.com Link: https://lore.kernel.org/r/20210907121242.2885564-1-lijinlin3@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Laibin Qiu qiulaibin@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-throttle.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c index f05e932cf816..4087dcc13f7d 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -2487,6 +2487,7 @@ int blk_throtl_init(struct request_queue *q) void blk_throtl_exit(struct request_queue *q) { BUG_ON(!q->td); + del_timer_sync(&q->td->service_queue.pending_timer); throtl_shutdown_wq(q); blkcg_deactivate_policy(q, &blkcg_policy_throtl); free_percpu(q->td->latency_buckets[READ]);
From: Laibin Qiu qiulaibin@huawei.com
hulk inclusion category: bugfix bugzilla: 182666 https://gitee.com/openeuler/kernel/issues/I4DDEL
--------------------------
KASAN reports a use-after-free report when doing block test:
[293762.535116]
================================================================== [293762.535129] BUG: KASAN: use-after-free in queued_spin_lock_slowpath+0x78/0x4c8 [293762.535133] Write of size 2 at addr ffff8000d5f12bc8 by task sh/9148 [293762.535135] [293762.535145] CPU: 1 PID: 9148 Comm: sh Kdump: loaded Tainted: G W 4.19.90-vhulk2108.6.0.h824.kasan.eulerosv2r10.aarch64 #1 [293762.535148] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [293762.535150] Call trace: [293762.535154] dump_backtrace+0x0/0x310 [293762.535158] show_stack+0x28/0x38 [293762.535165] dump_stack+0xec/0x15c [293762.535172] print_address_description+0x68/0x2d0 [293762.535177] kasan_report+0x130/0x2f0 [293762.535182] __asan_store2+0x80/0xa8 [293762.535189] queued_spin_lock_slowpath+0x78/0x4c8 [293762.535194] __ioc_clear_queue+0x158/0x160 [293762.535198] ioc_clear_queue+0x194/0x258 [293762.535202] elevator_switch_mq+0x64/0x170 [293762.535206] elevator_switch+0x140/0x270 [293762.535211] elv_iosched_store+0x1a4/0x2a0 [293762.535215] queue_attr_store+0x90/0xe0 [293762.535219] sysfs_kf_write+0xa8/0xe8 [293762.535222] kernfs_fop_write+0x1f8/0x378 [293762.535227] __vfs_write+0xe0/0x360 [293762.535233] vfs_write+0xf0/0x270 [293762.535237] ksys_write+0xdc/0x1b8 [293762.535241] __arm64_sys_write+0x50/0x60 [293762.535245] el0_svc_common+0xc8/0x320 [293762.535250] el0_svc_handler+0xf8/0x160 [293762.535253] el0_svc+0x10/0x218 [293762.535254] [293762.535258] Allocated by task 3466763: [293762.535264] kasan_kmalloc+0xe0/0x190 [293762.535269] kasan_slab_alloc+0x14/0x20 [293762.535276] kmem_cache_alloc_node+0x1b4/0x420 [293762.535280] create_task_io_context+0x40/0x210 [293762.535284] generic_make_request_checks+0xc78/0xe38 [293762.535288] generic_make_request+0xf8/0x640 [293762.535394] generic_file_direct_write+0x100/0x268 [293762.535401] __generic_file_write_iter+0x128/0x370 [293762.535467] vfs_iter_write+0x64/0x90 [293762.535489] ovl_write_iter+0x2f8/0x458 [overlay] [293762.535493] __vfs_write+0x264/0x360 [293762.535497] vfs_write+0xf0/0x270 [293762.535501] ksys_write+0xdc/0x1b8 [293762.535505] __arm64_sys_write+0x50/0x60 [293762.535509] el0_svc_common+0xc8/0x320 [293762.535387] ext4_direct_IO+0x3c8/0xe80 [ext4] [293762.535394] generic_file_direct_write+0x100/0x268 [293762.535401] __generic_file_write_iter+0x128/0x370 [293762.535452] ext4_file_write_iter+0x610/0xa80 [ext4] [293762.535457] do_iter_readv_writev+0x28c/0x390 [293762.535463] do_iter_write+0xfc/0x360 [293762.535467] vfs_iter_write+0x64/0x90 [293762.535489] ovl_write_iter+0x2f8/0x458 [overlay] [293762.535493] __vfs_write+0x264/0x360 [293762.535497] vfs_write+0xf0/0x270 [293762.535501] ksys_write+0xdc/0x1b8 [293762.535505] __arm64_sys_write+0x50/0x60 [293762.535509] el0_svc_common+0xc8/0x320 [293762.535513] el0_svc_handler+0xf8/0x160 [293762.535517] el0_svc+0x10/0x218 [293762.535521] [293762.535523] Freed by task 3466763: [293762.535528] __kasan_slab_free+0x120/0x228 [293762.535532] kasan_slab_free+0x10/0x18 [293762.535536] kmem_cache_free+0x68/0x248 [293762.535540] put_io_context+0x104/0x190 [293762.535545] put_io_context_active+0x204/0x2c8 [293762.535549] exit_io_context+0x74/0xa0 [293762.535553] do_exit+0x658/0xae0 [293762.535557] do_group_exit+0x74/0x1a8 [293762.535561] get_signal+0x21c/0xf38 [293762.535564] do_signal+0x10c/0x450 [293762.535568] do_notify_resume+0x224/0x4b0 [293762.535573] work_pending+0x8/0x10 [293762.535574] [293762.535578] The buggy address belongs to the object at ffff8000d5f12bb8 which belongs to the cache blkdev_ioc of size 136 [293762.535582] The buggy address is located 16 bytes inside of 136-byte region [ffff8000d5f12bb8, ffff8000d5f12c40) [293762.535583] The buggy address belongs to the page: [293762.535588] page:ffff7e000357c480 count:1 mapcount:0 mapping:ffff8000d8563c00 index:0x0 [293762.536201] flags: 0x7ffff0000000100(slab) [293762.536540] raw: 07ffff0000000100 ffff7e0003118588 ffff8000d8adb530 ffff8000d8563c00 [293762.536546] raw: 0000000000000000 0000000000140014 00000001ffffffff 0000000000000000 [293762.536551] page dumped because: kasan: bad access detected [293762.536552] [293762.536554] Memory state around the buggy address: [293762.536558] ffff8000d5f12a80: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fb fb [293762.536562] ffff8000d5f12b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc [293762.536566] >ffff8000d5f12b80: fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb fb [293762.536568] ^ [293762.536572] ffff8000d5f12c00: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [293762.536576] ffff8000d5f12c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[293762.536577] ==================================================================
ioc_release_fn() will destroy icq from ioc->icq_list and __ioc_clear_queue() will destroy icq from request_queue->icq_list. However, the ioc_release_fn() will hold ioc_lock firstly, and free ioc finally. Then __ioc_clear_queue() will get ioc from icq and hold ioc_lock, but ioc has been released, which will result in a use-after-free.
CPU0 CPU1 put_io_context elevator_switch_mq queue_work &ioc->release_work ioc_clear_queue ^^^ splice q->icq_list __ioc_clear_queue ^^^get icq from icq_list get ioc from icq->ioc ioc_release_fn spin_lock(ioc->lock) ioc_destroy_icq(icq) spin_unlock(ioc->lock) free(ioc) spin_lock(ioc->lock) <= UAF
Fix by grabbing the request_queue->queue_lock in ioc_clear_queue() to avoid this race scene.
Signed-off-by: Laibin Qiu qiulaibin@huawei.com Link: https://lore.kernel.org/lkml/1c9ad9f2-c487-c793-1ffc-5c3ec0fcc0ae@kernel.dk/ Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-ioc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-ioc.c b/block/blk-ioc.c index 57299f860d41..1d6ba8ff5a66 100644 --- a/block/blk-ioc.c +++ b/block/blk-ioc.c @@ -242,9 +242,9 @@ void ioc_clear_queue(struct request_queue *q)
spin_lock_irq(&q->queue_lock); list_splice_init(&q->icq_list, &icq_list); - spin_unlock_irq(&q->queue_lock);
__ioc_clear_queue(&icq_list); + spin_unlock_irq(&q->queue_lock); }
int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node)
From: Baokun Li libaokun1@huawei.com
mainline inclusion from mainline-5.14 commit fad7cd3310db3099f95dd34312c77740fbc455e5 category: bugfix bugzilla: 175174 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
If user specify a large enough value of NBD blocks option, it may trigger signed integer overflow which may lead to nbd->config->bytesize becomes a large or small value, zero in particular.
UBSAN: Undefined behaviour in drivers/block/nbd.c:325:31 signed integer overflow: 1024 * 4611686155866341414 cannot be represented in type 'long long int' [...] Call trace: [...] handle_overflow+0x188/0x1dc lib/ubsan.c:192 __ubsan_handle_mul_overflow+0x34/0x44 lib/ubsan.c:213 nbd_size_set drivers/block/nbd.c:325 [inline] __nbd_ioctl drivers/block/nbd.c:1342 [inline] nbd_ioctl+0x998/0xa10 drivers/block/nbd.c:1395 __blkdev_driver_ioctl block/ioctl.c:311 [inline] [...]
Although it is not a big deal, still silence the UBSAN by limit the input value.
Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Josef Bacik josef@toxicpanda.com Link: https://lore.kernel.org/r/20210804021212.990223-1-libaokun1@huawei.com [axboe: dropped unlikely()] Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: drivers/block/nbd.c
Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/block/nbd.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 59c452fff835..8a841d5f422d 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -1385,6 +1385,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd, unsigned int cmd, unsigned long arg) { struct nbd_config *config = nbd->config; + loff_t bytesize;
switch (cmd) { case NBD_DISCONNECT: @@ -1407,6 +1408,8 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd, div_s64(arg, config->blksize)); return 0; case NBD_SET_SIZE_BLOCKS: + if (check_mul_overflow((loff_t)arg, config->blksize, &bytesize)) + return -EINVAL; nbd_size_set(nbd, config->blksize, arg); return 0; case NBD_SET_TIMEOUT:
From: Valentin Schneider valentin.schneider@arm.com
mainline inclusion from mainline-v5.11-rc1 commit b5b217346de85ed1b03fdecd5c5076b34fbb2f0b category: bugfix bugzilla: 182847 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------
NUMA topologies where the shortest path between some two nodes requires three or more hops (i.e. diameter > 2) end up being misrepresented in the scheduler topology structures.
This is currently detected when booting a kernel with CONFIG_SCHED_DEBUG=y + sched_debug on the cmdline, although this will only yield a warning about sched_group spans not matching sched_domain spans:
ERROR: groups don't span domain->span
Add an explicit warning for that case, triggered regardless of CONFIG_SCHED_DEBUG, and decorate it with an appropriate comment.
The topology described in the comment can be booted up on QEMU by appending the following to your usual QEMU incantation:
-smp cores=4 \ -numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \ -numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \ -numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \ -numa dist,src=0,dst=3,val=40, -numa dist,src=1,dst=2,val=20, \ -numa dist,src=1,dst=3,val=30, -numa dist,src=2,dst=3,val=20
A somewhat more realistic topology (6-node mesh) with the same affliction can be conjured with:
-smp cores=6 \ -numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \ -numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \ -numa node,cpus=4,nodeid=4, -numa node,cpus=5,nodeid=5, \ -numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \ -numa dist,src=0,dst=3,val=40, -numa dist,src=0,dst=4,val=30, \ -numa dist,src=0,dst=5,val=20, \ -numa dist,src=1,dst=2,val=20, -numa dist,src=1,dst=3,val=30, \ -numa dist,src=1,dst=4,val=20, -numa dist,src=1,dst=5,val=30, \ -numa dist,src=2,dst=3,val=20, -numa dist,src=2,dst=4,val=30, \ -numa dist,src=2,dst=5,val=40, \ -numa dist,src=3,dst=4,val=20, -numa dist,src=3,dst=5,val=30, \ -numa dist,src=4,dst=5,val=20
Signed-off-by: Valentin Schneider valentin.schneider@arm.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Mel Gorman mgorman@techsingularity.net Link: https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com Reviewed-by: Cheng Jian cj.chengjian@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/sched/topology.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index fcf6aebb13c4..23c20aa5391f 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -701,6 +701,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) { struct rq *rq = cpu_rq(cpu); struct sched_domain *tmp; + int numa_distance = 0;
/* Remove the sched domains which do not contribute to scheduling. */ for (tmp = sd; tmp; ) { @@ -732,6 +733,38 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) sd->child = NULL; }
+ for (tmp = sd; tmp; tmp = tmp->parent) + numa_distance += !!(tmp->flags & SD_NUMA); + + /* + * FIXME: Diameter >=3 is misrepresented. + * + * Smallest diameter=3 topology is: + * + * node 0 1 2 3 + * 0: 10 20 30 40 + * 1: 20 10 20 30 + * 2: 30 20 10 20 + * 3: 40 30 20 10 + * + * 0 --- 1 --- 2 --- 3 + * + * NUMA-3 0-3 N/A N/A 0-3 + * groups: {0-2},{1-3} {1-3},{0-2} + * + * NUMA-2 0-2 0-3 0-3 1-3 + * groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} {2-3},{0-2} + * + * NUMA-1 0-1 0-2 1-3 2-3 + * groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2} + * + * NUMA-0 0 1 2 3 + * + * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the + * group span isn't a subset of the domain span. + */ + WARN_ONCE(numa_distance > 2, "Shortest NUMA path spans too many nodes\n"); + sched_domain_debug(sd, cpu);
rq_attach_root(rq, rd);
From: Barry Song song.bao.hua@hisilicon.com
mainline inclusion from mainline-v5.13-rc1 commit 585b6d2723dc927ebc4ad884c4e879e4da8bc21f category: bugfix bugzilla: 182847 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------
As long as NUMA diameter > 2, building sched_domain by sibling's child domain will definitely create a sched_domain with sched_group which will span out of the sched_domain:
+------+ +------+ +-------+ +------+ | node | 12 |node | 20 | node | 12 |node | | 0 +---------+1 +--------+ 2 +-------+3 | +------+ +------+ +-------+ +------+
domain0 node0 node1 node2 node3
domain1 node0+1 node0+1 node2+3 node2+3 + domain2 node0+1+2 | group: node0+1 | group:node2+3 <-------------------+
when node2 is added into the domain2 of node0, kernel is using the child domain of node2's domain2, which is domain1(node2+3). Node 3 is outside the span of the domain including node0+1+2.
This will make load_balance() run based on screwed avg_load and group_type in the sched_group spanning out of the sched_domain, and it also makes select_task_rq_fair() pick an idle CPU outside the sched_domain.
Real servers which suffer from this problem include Kunpeng920 and 8-node Sun Fire X4600-M2, at least.
Here we move to use the *child* domain of the *child* domain of node2's domain2 as the new added sched_group. At the same, we re-use the lower level sgc directly. +------+ +------+ +-------+ +------+ | node | 12 |node | 20 | node | 12 |node | | 0 +---------+1 +--------+ 2 +-------+3 | +------+ +------+ +-------+ +------+
domain0 node0 node1 +- node2 node3 | domain1 node0+1 node0+1 | node2+3 node2+3 | domain2 node0+1+2 | group: node0+1 | group:node2 <-------------------+
While the lower level sgc is re-used, this patch only changes the remote sched_groups for those sched_domains playing grandchild trick, therefore, sgc->next_update is still safe since it's only touched by CPUs that have the group span as local group. And sgc->imbalance is also safe because sd_parent remains the same in load_balance and LB only tries other CPUs from the local group. Moreover, since local groups are not touched, they are still getting roughly equal size in a TL. And should_we_balance() only matters with local groups, so the pull probability of those groups are still roughly equal.
Tested by the below topology: qemu-system-aarch64 -M virt -nographic \ -smp cpus=8 \ -numa node,cpus=0-1,nodeid=0 \ -numa node,cpus=2-3,nodeid=1 \ -numa node,cpus=4-5,nodeid=2 \ -numa node,cpus=6-7,nodeid=3 \ -numa dist,src=0,dst=1,val=12 \ -numa dist,src=0,dst=2,val=20 \ -numa dist,src=0,dst=3,val=22 \ -numa dist,src=1,dst=2,val=22 \ -numa dist,src=2,dst=3,val=12 \ -numa dist,src=1,dst=3,val=24 \ -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image
w/o patch, we get lots of "groups don't span domain->span": [ 0.802139] CPU0 attaching sched-domain(s): [ 0.802193] domain-0: span=0-1 level=MC [ 0.802443] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 } [ 0.802693] domain-1: span=0-3 level=NUMA [ 0.802731] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [ 0.802811] domain-2: span=0-5 level=NUMA [ 0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [ 0.802881] ERROR: groups don't span domain->span [ 0.803058] domain-3: span=0-7 level=NUMA [ 0.803080] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [ 0.804055] CPU1 attaching sched-domain(s): [ 0.804072] domain-0: span=0-1 level=MC [ 0.804096] groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 } [ 0.804152] domain-1: span=0-3 level=NUMA [ 0.804170] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [ 0.804219] domain-2: span=0-5 level=NUMA [ 0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [ 0.804302] ERROR: groups don't span domain->span [ 0.804520] domain-3: span=0-7 level=NUMA [ 0.804546] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [ 0.804677] CPU2 attaching sched-domain(s): [ 0.804687] domain-0: span=2-3 level=MC [ 0.804705] groups: 2:{ span=2 cap=934 }, 3:{ span=3 cap=1009 } [ 0.804754] domain-1: span=0-3 level=NUMA [ 0.804772] groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 } [ 0.804820] domain-2: span=0-5 level=NUMA [ 0.804836] groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 } [ 0.804944] ERROR: groups don't span domain->span [ 0.805108] domain-3: span=0-7 level=NUMA [ 0.805134] groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 } [ 0.805223] CPU3 attaching sched-domain(s): [ 0.805232] domain-0: span=2-3 level=MC [ 0.805249] groups: 3:{ span=3 cap=1009 }, 2:{ span=2 cap=934 } [ 0.805319] domain-1: span=0-3 level=NUMA [ 0.805336] groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 } [ 0.805383] domain-2: span=0-5 level=NUMA [ 0.805399] groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 } [ 0.805458] ERROR: groups don't span domain->span [ 0.805605] domain-3: span=0-7 level=NUMA [ 0.805626] groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 } [ 0.805712] CPU4 attaching sched-domain(s): [ 0.805721] domain-0: span=4-5 level=MC [ 0.805738] groups: 4:{ span=4 cap=984 }, 5:{ span=5 cap=924 } [ 0.805787] domain-1: span=4-7 level=NUMA [ 0.805803] groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 } [ 0.805851] domain-2: span=0-1,4-7 level=NUMA [ 0.805867] groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 } [ 0.805915] ERROR: groups don't span domain->span [ 0.806108] domain-3: span=0-7 level=NUMA [ 0.806130] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 } [ 0.806214] CPU5 attaching sched-domain(s): [ 0.806222] domain-0: span=4-5 level=MC [ 0.806240] groups: 5:{ span=5 cap=924 }, 4:{ span=4 cap=984 } [ 0.806841] domain-1: span=4-7 level=NUMA [ 0.806866] groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 } [ 0.806934] domain-2: span=0-1,4-7 level=NUMA [ 0.806953] groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 } [ 0.807004] ERROR: groups don't span domain->span [ 0.807312] domain-3: span=0-7 level=NUMA [ 0.807386] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 } [ 0.807686] CPU6 attaching sched-domain(s): [ 0.807710] domain-0: span=6-7 level=MC [ 0.807750] groups: 6:{ span=6 cap=1017 }, 7:{ span=7 cap=1012 } [ 0.807840] domain-1: span=4-7 level=NUMA [ 0.807870] groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 } [ 0.807952] domain-2: span=0-1,4-7 level=NUMA [ 0.807985] groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 } [ 0.808045] ERROR: groups don't span domain->span [ 0.808257] domain-3: span=0-7 level=NUMA [ 0.808571] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6125 }, 2:{ span=0-5 mask=2-3 cap=5899 } [ 0.808848] CPU7 attaching sched-domain(s): [ 0.808860] domain-0: span=6-7 level=MC [ 0.808880] groups: 7:{ span=7 cap=1012 }, 6:{ span=6 cap=1017 } [ 0.808953] domain-1: span=4-7 level=NUMA [ 0.808974] groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 } [ 0.809034] domain-2: span=0-1,4-7 level=NUMA [ 0.809055] groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 } [ 0.809128] ERROR: groups don't span domain->span [ 0.810361] domain-3: span=0-7 level=NUMA [ 0.810400] groups: 6:{ span=0-1,4-7 mask=6-7 cap=5961 }, 2:{ span=0-5 mask=2-3 cap=5903 }
w/ patch, we don't get "groups don't span domain->span" any more: [ 1.486271] CPU0 attaching sched-domain(s): [ 1.486820] domain-0: span=0-1 level=MC [ 1.500924] groups: 0:{ span=0 cap=980 }, 1:{ span=1 cap=994 } [ 1.515717] domain-1: span=0-3 level=NUMA [ 1.515903] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 } [ 1.516989] domain-2: span=0-5 level=NUMA [ 1.517124] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 } [ 1.517369] domain-3: span=0-7 level=NUMA [ 1.517423] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 } [ 1.520027] CPU1 attaching sched-domain(s): [ 1.520097] domain-0: span=0-1 level=MC [ 1.520184] groups: 1:{ span=1 cap=994 }, 0:{ span=0 cap=980 } [ 1.520429] domain-1: span=0-3 level=NUMA [ 1.520487] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 } [ 1.520687] domain-2: span=0-5 level=NUMA [ 1.520744] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 } [ 1.520948] domain-3: span=0-7 level=NUMA [ 1.521038] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 } [ 1.522068] CPU2 attaching sched-domain(s): [ 1.522348] domain-0: span=2-3 level=MC [ 1.522606] groups: 2:{ span=2 cap=1003 }, 3:{ span=3 cap=986 } [ 1.522832] domain-1: span=0-3 level=NUMA [ 1.522885] groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 } [ 1.523043] domain-2: span=0-5 level=NUMA [ 1.523092] groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 } [ 1.523302] domain-3: span=0-7 level=NUMA [ 1.523352] groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 } [ 1.523748] CPU3 attaching sched-domain(s): [ 1.523774] domain-0: span=2-3 level=MC [ 1.523825] groups: 3:{ span=3 cap=986 }, 2:{ span=2 cap=1003 } [ 1.524009] domain-1: span=0-3 level=NUMA [ 1.524086] groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 } [ 1.524281] domain-2: span=0-5 level=NUMA [ 1.524331] groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 } [ 1.524534] domain-3: span=0-7 level=NUMA [ 1.524586] groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 } [ 1.524847] CPU4 attaching sched-domain(s): [ 1.524873] domain-0: span=4-5 level=MC [ 1.524954] groups: 4:{ span=4 cap=958 }, 5:{ span=5 cap=991 } [ 1.525105] domain-1: span=4-7 level=NUMA [ 1.525153] groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 } [ 1.525368] domain-2: span=0-1,4-7 level=NUMA [ 1.525428] groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 } [ 1.532726] domain-3: span=0-7 level=NUMA [ 1.532811] groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=4037 } [ 1.534125] CPU5 attaching sched-domain(s): [ 1.534159] domain-0: span=4-5 level=MC [ 1.534303] groups: 5:{ span=5 cap=991 }, 4:{ span=4 cap=958 } [ 1.534490] domain-1: span=4-7 level=NUMA [ 1.534572] groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 } [ 1.534734] domain-2: span=0-1,4-7 level=NUMA [ 1.534783] groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 } [ 1.536057] domain-3: span=0-7 level=NUMA [ 1.536430] groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=3896 } [ 1.536815] CPU6 attaching sched-domain(s): [ 1.536846] domain-0: span=6-7 level=MC [ 1.536934] groups: 6:{ span=6 cap=1005 }, 7:{ span=7 cap=1001 } [ 1.537144] domain-1: span=4-7 level=NUMA [ 1.537262] groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 } [ 1.537553] domain-2: span=0-1,4-7 level=NUMA [ 1.537613] groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 } [ 1.537872] domain-3: span=0-7 level=NUMA [ 1.537998] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 } [ 1.538448] CPU7 attaching sched-domain(s): [ 1.538505] domain-0: span=6-7 level=MC [ 1.538586] groups: 7:{ span=7 cap=1001 }, 6:{ span=6 cap=1005 } [ 1.538746] domain-1: span=4-7 level=NUMA [ 1.538798] groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 } [ 1.539048] domain-2: span=0-1,4-7 level=NUMA [ 1.539111] groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 } [ 1.539571] domain-3: span=0-7 level=NUMA [ 1.539610] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 }
Signed-off-by: Barry Song song.bao.hua@hisilicon.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Ingo Molnar mingo@kernel.org Reviewed-by: Valentin Schneider valentin.schneider@arm.com Tested-by: Meelis Roos mroos@linux.ee Link: https://lkml.kernel.org/r/20210224030944.15232-1-song.bao.hua@hisilicon.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com Reviewed-by: Cheng Jian cj.chengjian@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/sched/topology.c | 91 +++++++++++++++++++++++++++-------------- 1 file changed, 61 insertions(+), 30 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 23c20aa5391f..287f763656fc 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -736,35 +736,6 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) for (tmp = sd; tmp; tmp = tmp->parent) numa_distance += !!(tmp->flags & SD_NUMA);
- /* - * FIXME: Diameter >=3 is misrepresented. - * - * Smallest diameter=3 topology is: - * - * node 0 1 2 3 - * 0: 10 20 30 40 - * 1: 20 10 20 30 - * 2: 30 20 10 20 - * 3: 40 30 20 10 - * - * 0 --- 1 --- 2 --- 3 - * - * NUMA-3 0-3 N/A N/A 0-3 - * groups: {0-2},{1-3} {1-3},{0-2} - * - * NUMA-2 0-2 0-3 0-3 1-3 - * groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} {2-3},{0-2} - * - * NUMA-1 0-1 0-2 1-3 2-3 - * groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2} - * - * NUMA-0 0 1 2 3 - * - * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the - * group span isn't a subset of the domain span. - */ - WARN_ONCE(numa_distance > 2, "Shortest NUMA path spans too many nodes\n"); - sched_domain_debug(sd, cpu);
rq_attach_root(rq, rd); @@ -995,6 +966,31 @@ static void init_overlap_sched_group(struct sched_domain *sd, sg->sgc->max_capacity = SCHED_CAPACITY_SCALE; }
+static struct sched_domain * +find_descended_sibling(struct sched_domain *sd, struct sched_domain *sibling) +{ + /* + * The proper descendant would be the one whose child won't span out + * of sd + */ + while (sibling->child && + !cpumask_subset(sched_domain_span(sibling->child), + sched_domain_span(sd))) + sibling = sibling->child; + + /* + * As we are referencing sgc across different topology level, we need + * to go down to skip those sched_domains which don't contribute to + * scheduling because they will be degenerated in cpu_attach_domain + */ + while (sibling->child && + cpumask_equal(sched_domain_span(sibling->child), + sched_domain_span(sibling))) + sibling = sibling->child; + + return sibling; +} + static int build_overlap_sched_groups(struct sched_domain *sd, int cpu) { @@ -1028,6 +1024,41 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) if (!cpumask_test_cpu(i, sched_domain_span(sibling))) continue;
+ /* + * Usually we build sched_group by sibling's child sched_domain + * But for machines whose NUMA diameter are 3 or above, we move + * to build sched_group by sibling's proper descendant's child + * domain because sibling's child sched_domain will span out of + * the sched_domain being built as below. + * + * Smallest diameter=3 topology is: + * + * node 0 1 2 3 + * 0: 10 20 30 40 + * 1: 20 10 20 30 + * 2: 30 20 10 20 + * 3: 40 30 20 10 + * + * 0 --- 1 --- 2 --- 3 + * + * NUMA-3 0-3 N/A N/A 0-3 + * groups: {0-2},{1-3} {1-3},{0-2} + * + * NUMA-2 0-2 0-3 0-3 1-3 + * groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} {2-3},{0-2} + * + * NUMA-1 0-1 0-2 1-3 2-3 + * groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2} + * + * NUMA-0 0 1 2 3 + * + * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the + * group span isn't a subset of the domain span. + */ + if (sibling->child && + !cpumask_subset(sched_domain_span(sibling->child), span)) + sibling = find_descended_sibling(sd, sibling); + sg = build_group_from_child_sched_domain(sibling, cpu); if (!sg) goto fail; @@ -1035,7 +1066,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) sg_span = sched_group_span(sg); cpumask_or(covered, covered, sg_span);
- init_overlap_sched_group(sd, sg); + init_overlap_sched_group(sibling, sg);
if (!first) first = sg;
From: Valentin Schneider valentin.schneider@arm.com
mainline inclusion from mainline-v5.12-rc1 commit 620a6dc40754dc218f5b6389b5d335e9a107fd29 category: bugfix bugzilla: 182847 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------
The deduplicating sort in sched_init_numa() assumes that the first line in the distance table contains all unique values in the entire table. I've been trying to pen what this exactly means for the topology, but it's not straightforward. For instance, topology.c uses this example:
node 0 1 2 3 0: 10 20 20 30 1: 20 10 20 20 2: 20 20 10 20 3: 30 20 20 10
0 ----- 1 | / | | / | | / | 2 ----- 3
Which works out just fine. However, if we swap nodes 0 and 1:
1 ----- 0 | / | | / | | / | 2 ----- 3
we get this distance table:
node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 30 2: 20 20 10 20 3: 20 30 20 10
Which breaks the deduplicating sort (non-representative first line). In this case this would just be a renumbering exercise, but it so happens that we can have a deduplicating sort that goes through the whole table in O(n²) at the extra cost of a temporary memory allocation (i.e. any form of set).
The ACPI spec (SLIT) mentions distances are encoded on 8 bits. Following this, implement the set as a 256-bits bitmap. Should this not be satisfactory (i.e. we want to support 32-bit values), then we'll have to go for some other sparse set implementation.
This has the added benefit of letting us allocate just the right amount of memory for sched_domains_numa_distance[], rather than an arbitrary (nr_node_ids + 1).
Note: DT binding equivalent (distance-map) decodes distances as 32-bit values.
Signed-off-by: Valentin Schneider valentin.schneider@arm.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20210122123943.1217-2-valentin.schneider@arm.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com Reviewed-by: Cheng Jian cj.chengjian@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/topology.h | 1 + kernel/sched/topology.c | 99 +++++++++++++++++++--------------------- 2 files changed, 49 insertions(+), 51 deletions(-)
diff --git a/include/linux/topology.h b/include/linux/topology.h index ad03df1cc266..7634cd737061 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -48,6 +48,7 @@ int arch_update_cpu_topology(void); /* Conform to ACPI 2.0 SLIT distance definitions */ #define LOCAL_DISTANCE 10 #define REMOTE_DISTANCE 20 +#define DISTANCE_BITS 8 #ifndef node_distance #define node_distance(from,to) ((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE) #endif diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 287f763656fc..9212d1d432db 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1669,66 +1669,58 @@ static void check_node_limit(void) static inline void check_node_limit(void) { } #endif /* CONFIG_SCHED_STEAL */
+ +#define NR_DISTANCE_VALUES (1 << DISTANCE_BITS) + void sched_init_numa(void) { - int next_distance, curr_distance = node_distance(0, 0); struct sched_domain_topology_level *tl; - int level = 0; - int i, j, k; - - sched_domains_numa_distance = kzalloc(sizeof(int) * (nr_node_ids + 1), GFP_KERNEL); - if (!sched_domains_numa_distance) - return; - - /* Includes NUMA identity node at level 0. */ - sched_domains_numa_distance[level++] = curr_distance; - sched_domains_numa_levels = level; + unsigned long *distance_map; + int nr_levels = 0; + int i, j;
/* * O(nr_nodes^2) deduplicating selection sort -- in order to find the * unique distances in the node_distance() table. - * - * Assumes node_distance(0,j) includes all distances in - * node_distance(i,j) in order to avoid cubic time. */ - next_distance = curr_distance; + distance_map = bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL); + if (!distance_map) + return; + + bitmap_zero(distance_map, NR_DISTANCE_VALUES); for (i = 0; i < nr_node_ids; i++) { for (j = 0; j < nr_node_ids; j++) { - for (k = 0; k < nr_node_ids; k++) { - int distance = node_distance(i, k); - - if (distance > curr_distance && - (distance < next_distance || - next_distance == curr_distance)) - next_distance = distance; - - /* - * While not a strong assumption it would be nice to know - * about cases where if node A is connected to B, B is not - * equally connected to A. - */ - if (sched_debug() && node_distance(k, i) != distance) - sched_numa_warn("Node-distance not symmetric"); + int distance = node_distance(i, j);
- if (sched_debug() && i && !find_numa_distance(distance)) - sched_numa_warn("Node-0 not representative"); + if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) { + sched_numa_warn("Invalid distance value range"); + return; } - if (next_distance != curr_distance) { - sched_domains_numa_distance[level++] = next_distance; - sched_domains_numa_levels = level; - curr_distance = next_distance; - } else break; + + bitmap_set(distance_map, distance, 1); } + } + /* + * We can now figure out how many unique distance values there are and + * allocate memory accordingly. + */ + nr_levels = bitmap_weight(distance_map, NR_DISTANCE_VALUES);
- /* - * In case of sched_debug() we verify the above assumption. - */ - if (!sched_debug()) - break; + sched_domains_numa_distance = kcalloc(nr_levels, sizeof(int), GFP_KERNEL); + if (!sched_domains_numa_distance) { + bitmap_free(distance_map); + return; + } + + for (i = 0, j = 0; i < nr_levels; i++, j++) { + j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j); + sched_domains_numa_distance[i] = j; }
+ bitmap_free(distance_map); + /* - * 'level' contains the number of unique distances + * 'nr_levels' contains the number of unique distances * * The sched_domains_numa_distance[] array includes the actual distance * numbers. @@ -1737,15 +1729,15 @@ void sched_init_numa(void) /* * Here, we should temporarily reset sched_domains_numa_levels to 0. * If it fails to allocate memory for array sched_domains_numa_masks[][], - * the array will contain less then 'level' members. This could be + * the array will contain less then 'nr_levels' members. This could be * dangerous when we use it to iterate array sched_domains_numa_masks[][] * in other functions. * - * We reset it to 'level' at the end of this function. + * We reset it to 'nr_levels' at the end of this function. */ sched_domains_numa_levels = 0;
- sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL); + sched_domains_numa_masks = kzalloc(sizeof(void *) * nr_levels, GFP_KERNEL); if (!sched_domains_numa_masks) return;
@@ -1753,7 +1745,7 @@ void sched_init_numa(void) * Now for each level, construct a mask per node which contains all * CPUs of nodes that are that many hops away from us. */ - for (i = 0; i < level; i++) { + for (i = 0; i < nr_levels; i++) { sched_domains_numa_masks[i] = kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL); if (!sched_domains_numa_masks[i]) @@ -1761,12 +1753,17 @@ void sched_init_numa(void)
for (j = 0; j < nr_node_ids; j++) { struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL); + int k; + if (!mask) return;
sched_domains_numa_masks[i][j] = mask;
for_each_node(k) { + if (sched_debug() && (node_distance(j, k) != node_distance(k, j))) + sched_numa_warn("Node-distance not symmetric"); + if (node_distance(j, k) > sched_domains_numa_distance[i]) continue;
@@ -1778,7 +1775,7 @@ void sched_init_numa(void) /* Compute default topology size */ for (i = 0; sched_domain_topology[i].mask; i++);
- tl = kzalloc((i + level + 1) * + tl = kzalloc((i + nr_levels) * sizeof(struct sched_domain_topology_level), GFP_KERNEL); if (!tl) return; @@ -1801,7 +1798,7 @@ void sched_init_numa(void) /* * .. and append 'j' levels of NUMA goodness. */ - for (j = 1; j < level; i++, j++) { + for (j = 1; j < nr_levels; i++, j++) { tl[i] = (struct sched_domain_topology_level){ .mask = sd_numa_mask, .sd_flags = cpu_numa_flags, @@ -1813,8 +1810,8 @@ void sched_init_numa(void)
sched_domain_topology = tl;
- sched_domains_numa_levels = level; - sched_max_numa_distance = sched_domains_numa_distance[level - 1]; + sched_domains_numa_levels = nr_levels; + sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
init_numa_topology_type(); check_node_limit();
From: Valentin Schneider valentin.schneider@arm.com
mainline inclusion from mainline-v5.13-rc1 commit b22a8f7b4bde4e4ab73b64908ffd5d90ecdcdbfd category: bugfix bugzilla: 182847 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------
John Paul reported a warning about bogus NUMA distance values spurred by commit:
620a6dc40754 ("sched/topology: Make sched_init_numa() use a set for the deduplicating sort")
In this case, the afflicted machine comes up with a reported 256 possible nodes, all of which are 0 distance away from one another. This was previously silently ignored, but is now caught by the aforementioned commit.
The culprit is ia64's node_possible_map which remains unchanged from its initialization value of NODE_MASK_ALL. In John's case, the machine doesn't have any SRAT nor SLIT table, but AIUI the possible map remains untouched regardless of what ACPI tables end up being parsed. Thus, !online && possible nodes remain with a bogus distance of 0 (distances \in [0, 9] are "reserved and have no meaning" as per the ACPI spec).
Follow x86 / drivers/base/arch_numa's example and set the possible map to the parsed map, which in this case seems to be the online map.
Link: http://lore.kernel.org/r/255d6b5d-194e-eb0e-ecdd-97477a534441@physik.fu-berl... Link: https://lkml.kernel.org/r/20210318130617.896309-1-valentin.schneider@arm.com Fixes: 620a6dc40754 ("sched/topology: Make sched_init_numa() use a set for the deduplicating sort") Signed-off-by: Valentin Schneider valentin.schneider@arm.com Reported-by: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de Tested-by: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de Tested-by: Sergei Trofimovich slyfox@gentoo.org Cc: "Peter Zijlstra (Intel)" peterz@infradead.org Cc: Ingo Molnar mingo@kernel.org Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Dietmar Eggemann dietmar.eggemann@arm.com Cc: Anatoly Pugachev matorola@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Jialin Zhang zhangjialin11@huawei.com Reviewed-by: Cheng Jian cj.chengjian@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/ia64/kernel/acpi.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c index a5636524af76..e2af6b172200 100644 --- a/arch/ia64/kernel/acpi.c +++ b/arch/ia64/kernel/acpi.c @@ -446,7 +446,8 @@ void __init acpi_numa_fixup(void) if (srat_num_cpus == 0) { node_set_online(0); node_cpuid[0].phys_id = hard_smp_processor_id(); - return; + slit_distance(0, 0) = LOCAL_DISTANCE; + goto out; }
/* @@ -489,7 +490,7 @@ void __init acpi_numa_fixup(void) for (j = 0; j < MAX_NUMNODES; j++) slit_distance(i, j) = i == j ? LOCAL_DISTANCE : REMOTE_DISTANCE; - return; + goto out; }
memset(numa_slit, -1, sizeof(numa_slit)); @@ -514,6 +515,8 @@ void __init acpi_numa_fixup(void) printk("\n"); } #endif +out: + node_possible_map = node_online_map; } #endif /* CONFIG_ACPI_NUMA */
From: Laibin Qiu qiulaibin@huawei.com
hulk inclusion category: bugfix bugzilla: 182674 https://gitee.com/openeuler/kernel/issues/I4DDEL
---------------------------
If the blk_mq_sched_alloc_tags() -> blk_mq_alloc_rqs() call fails, then we call blk_mq_free_rq_map(). But if BLK_MQ_F_TAG_HCTX_SHARED has been set to hctx->flags, tags->bitmap_tags and tags->breserved_tags will not be released and cause a memory leak.
Fixes: 31044c2e5b8de ("blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling") Signed-off-by: Laibin Qiu qiulaibin@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-mq-sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index d495c5fe5e6d..090aa0416443 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -521,7 +521,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
ret = blk_mq_alloc_rqs(set, hctx->sched_tags, hctx_idx, q->nr_requests); if (ret) { - blk_mq_free_rq_map(hctx->sched_tags, set->flags); + blk_mq_free_rq_map(hctx->sched_tags, flags); hctx->sched_tags = NULL; }
From: Laibin Qiu qiulaibin@huawei.com
hulk inclusion category: bugfix bugzilla: 181879 https://gitee.com/openeuler/kernel/issues/I4DDEL
--------------------------
KASAN reports a use-after-free report when doing fuzz test:
[693354.104835] ================================================================== [693354.105094] BUG: KASAN: use-after-free in bfq_io_set_weight_legacy+0xd3/0x160 [693354.105336] Read of size 4 at addr ffff888be0a35664 by task sh/1453338
[693354.105607] CPU: 41 PID: 1453338 Comm: sh Kdump: loaded Not tainted 4.18.0-147 [693354.105610] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.81 07/02/2018 [693354.105612] Call Trace: [693354.105621] dump_stack+0xf1/0x19b [693354.105626] ? show_regs_print_info+0x5/0x5 [693354.105634] ? printk+0x9c/0xc3 [693354.105638] ? cpumask_weight+0x1f/0x1f [693354.105648] print_address_description+0x70/0x360 [693354.105654] kasan_report+0x1b2/0x330 [693354.105659] ? bfq_io_set_weight_legacy+0xd3/0x160 [693354.105665] ? bfq_io_set_weight_legacy+0xd3/0x160 [693354.105670] bfq_io_set_weight_legacy+0xd3/0x160 [693354.105675] ? bfq_cpd_init+0x20/0x20 [693354.105683] cgroup_file_write+0x3aa/0x510 [693354.105693] ? ___slab_alloc+0x507/0x540 [693354.105698] ? cgroup_file_poll+0x60/0x60 [693354.105702] ? 0xffffffff89600000 [693354.105708] ? usercopy_abort+0x90/0x90 [693354.105716] ? mutex_lock+0xef/0x180 [693354.105726] kernfs_fop_write+0x1ab/0x280 [693354.105732] ? cgroup_file_poll+0x60/0x60 [693354.105738] vfs_write+0xe7/0x230 [693354.105744] ksys_write+0xb0/0x140 [693354.105749] ? __ia32_sys_read+0x50/0x50 [693354.105760] do_syscall_64+0x112/0x370 [693354.105766] ? syscall_return_slowpath+0x260/0x260 [693354.105772] ? do_page_fault+0x9b/0x270 [693354.105779] ? prepare_exit_to_usermode+0xf9/0x1a0 [693354.105784] ? enter_from_user_mode+0x30/0x30 [693354.105793] entry_SYSCALL_64_after_hwframe+0x65/0xca
[693354.105875] Allocated by task 1453337: [693354.106001] kasan_kmalloc+0xa0/0xd0 [693354.106006] kmem_cache_alloc_node_trace+0x108/0x220 [693354.106010] bfq_pd_alloc+0x96/0x120 [693354.106015] blkcg_activate_policy+0x1b7/0x2b0 [693354.106020] bfq_create_group_hierarchy+0x1e/0x80 [693354.106026] bfq_init_queue+0x678/0x8c0 [693354.106031] blk_mq_init_sched+0x1f8/0x460 [693354.106037] elevator_switch_mq+0xe1/0x240 [693354.106041] elevator_switch+0x25/0x40 [693354.106045] elv_iosched_store+0x1a1/0x230 [693354.106049] queue_attr_store+0x78/0xb0 [693354.106053] kernfs_fop_write+0x1ab/0x280 [693354.106056] vfs_write+0xe7/0x230 [693354.106060] ksys_write+0xb0/0x140 [693354.106064] do_syscall_64+0x112/0x370 [693354.106069] entry_SYSCALL_64_after_hwframe+0x65/0xca
[693354.106114] Freed by task 1453336: [693354.106225] __kasan_slab_free+0x130/0x180 [693354.106229] kfree+0x90/0x1b0 [693354.106233] blkcg_deactivate_policy+0x12c/0x220 [693354.106238] bfq_exit_queue+0xf5/0x110 [693354.106241] blk_mq_exit_sched+0x104/0x130 [693354.106245] __elevator_exit+0x45/0x60 [693354.106249] elevator_switch_mq+0xd6/0x240 [693354.106253] elevator_switch+0x25/0x40 [693354.106257] elv_iosched_store+0x1a1/0x230 [693354.106261] queue_attr_store+0x78/0xb0 [693354.106264] kernfs_fop_write+0x1ab/0x280 [693354.106268] vfs_write+0xe7/0x230 [693354.106271] ksys_write+0xb0/0x140 [693354.106275] do_syscall_64+0x112/0x370 [693354.106280] entry_SYSCALL_64_after_hwframe+0x65/0xca
[693354.106329] The buggy address belongs to the object at ffff888be0a35580 which belongs to the cache kmalloc-1k of size 1024 [693354.106736] The buggy address is located 228 bytes inside of 1024-byte region [ffff888be0a35580, ffff888be0a35980) [693354.107114] The buggy address belongs to the page: [693354.107273] page:ffffea002f828c00 count:1 mapcount:0 mapping:ffff888107c17080 index:0x0 compound_mapcount: 0 [693354.107606] flags: 0x17ffffc0008100(slab|head) [693354.107760] raw: 0017ffffc0008100 ffffea002fcbc808 ffffea0030bd3a08 ffff888107c17080 [693354.108020] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000 [693354.108278] page dumped because: kasan: bad access detected
[693354.108511] Memory state around the buggy address: [693354.108671] ffff888be0a35500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [693354.116396] ffff888be0a35580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [693354.124473] >ffff888be0a35600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [693354.132421] ^ [693354.140284] ffff888be0a35680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [693354.147912] ffff888be0a35700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [693354.155281] ==================================================================
blkgs are protected by both queue and blkcg locks and holding either should stabilize them. However, the path of destroying blkg policy data is only protected by queue lock in blkcg_activate_policy()/blkcg_deactivate_policy(). Other tasks can get the blkg policy data before the blkg policy data is destroyed, and use it after destroyed, which will result in a use-after-free.
CPU0 CPU1 blkcg_deactivate_policy spin_lock_irq(&q->queue_lock) bfq_io_set_weight_legacy spin_lock_irq(&blkcg->lock) blkg_to_bfqg(blkg) pd_to_bfqg(blkg->pd[pol->plid]) ^^^^^^blkg->pd[pol->plid] != NULL bfqg != NULL pol->pd_free_fn(blkg->pd[pol->plid]) pd_to_bfqg(blkg->pd[pol->plid]) bfqg_put(bfqg) kfree(bfqg) blkg->pd[pol->plid] = NULL spin_unlock_irq(q->queue_lock); bfq_group_set_weight(bfqg, val, 0) bfqg->entity.new_weight ^^^^^^trigger uaf here spin_unlock_irq(&blkcg->lock);
Fix by garbbing the matching blkcg lock before trying to destroy blkg policy data.
Signed-off-by: Li Jinlin lijinlin3@huawei.com Link: https://lore.kernel.org/linux-block/YUDLt9uBNLhWL6Gt@slm.duckdns.org/T/#r07c...
Conflicts: block/blk-cgroup.c
Signed-off-by: Laibin Qiu qiulaibin@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-cgroup.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index f13688c4b931..5b19665bc486 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1387,10 +1387,14 @@ int blkcg_activate_policy(struct request_queue *q, /* alloc failed, nothing's initialized yet, free everything */ spin_lock_irq(&q->queue_lock); list_for_each_entry(blkg, &q->blkg_list, q_node) { + struct blkcg *blkcg = blkg->blkcg; + + spin_lock(&blkcg->lock); if (blkg->pd[pol->plid]) { pol->pd_free_fn(blkg->pd[pol->plid]); blkg->pd[pol->plid] = NULL; } + spin_unlock(&blkcg->lock); } spin_unlock_irq(&q->queue_lock); ret = -ENOMEM; @@ -1422,12 +1426,16 @@ void blkcg_deactivate_policy(struct request_queue *q, __clear_bit(pol->plid, q->blkcg_pols);
list_for_each_entry(blkg, &q->blkg_list, q_node) { + struct blkcg *blkcg = blkg->blkcg; + + spin_lock(&blkcg->lock); if (blkg->pd[pol->plid]) { if (pol->pd_offline_fn) pol->pd_offline_fn(blkg->pd[pol->plid]); pol->pd_free_fn(blkg->pd[pol->plid]); blkg->pd[pol->plid] = NULL; } + spin_unlock(&blkcg->lock); }
spin_unlock_irq(&q->queue_lock);
From: Christian Brauner christian.brauner@ubuntu.com
mainline inclusion from mainline-5.14-rc2 commit d1d488d813703618f0dd93f0e4c4a05928114aa8 category: bugfix bugzilla: 172156 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
Add a simple helper that filesystems can use in their parameter parser to parse the "source" parameter. A few places open-coded this function and that already caused a bug in the cgroup v1 parser that we fixed. Let's make it harder to get this wrong by introducing a helper which performs all necessary checks.
Link: https://syzkaller.appspot.com/bug?id=6312526aba5beae046fdae8f00399f87aab48b1... Cc: Christoph Hellwig hch@lst.de Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Dmitry Vyukov dvyukov@google.com Signed-off-by: Christian Brauner christian.brauner@ubuntu.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: include/linux/fs_context.h
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/fs_context.c | 54 +++++++++++++++++++++++++------------- include/linux/fs_context.h | 2 ++ kernel/cgroup/cgroup-v1.c | 14 ++++------ 3 files changed, 43 insertions(+), 27 deletions(-)
diff --git a/fs/fs_context.c b/fs/fs_context.c index 4858645ca620..b7e43a780a62 100644 --- a/fs/fs_context.c +++ b/fs/fs_context.c @@ -79,6 +79,35 @@ static int vfs_parse_sb_flag(struct fs_context *fc, const char *key) return -ENOPARAM; }
+/** + * vfs_parse_fs_param_source - Handle setting "source" via parameter + * @fc: The filesystem context to modify + * @param: The parameter + * + * This is a simple helper for filesystems to verify that the "source" they + * accept is sane. + * + * Returns 0 on success, -ENOPARAM if this is not "source" parameter, and + * -EINVAL otherwise. In the event of failure, supplementary error information + * is logged. + */ +int vfs_parse_fs_param_source(struct fs_context *fc, struct fs_parameter *param) +{ + if (strcmp(param->key, "source") != 0) + return -ENOPARAM; + + if (param->type != fs_value_is_string) + return invalf(fc, "Non-string source"); + + if (fc->source) + return invalf(fc, "Multiple sources"); + + fc->source = param->string; + param->string = NULL; + return 0; +} +EXPORT_SYMBOL(vfs_parse_fs_param_source); + /** * vfs_parse_fs_param - Add a single parameter to a superblock config * @fc: The filesystem context to modify @@ -122,15 +151,9 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param) /* If the filesystem doesn't take any arguments, give it the * default handling of source. */ - if (strcmp(param->key, "source") == 0) { - if (param->type != fs_value_is_string) - return invalf(fc, "VFS: Non-string source"); - if (fc->source) - return invalf(fc, "VFS: Multiple sources"); - fc->source = param->string; - param->string = NULL; - return 0; - } + ret = vfs_parse_fs_param_source(fc, param); + if (ret != -ENOPARAM) + return ret;
return invalf(fc, "%s: Unknown parameter '%s'", fc->fs_type->name, param->key); @@ -504,16 +527,11 @@ static int legacy_parse_param(struct fs_context *fc, struct fs_parameter *param) struct legacy_fs_context *ctx = fc->fs_private; unsigned int size = ctx->data_size; size_t len = 0; + int ret;
- if (strcmp(param->key, "source") == 0) { - if (param->type != fs_value_is_string) - return invalf(fc, "VFS: Legacy: Non-string source"); - if (fc->source) - return invalf(fc, "VFS: Legacy: Multiple sources"); - fc->source = param->string; - param->string = NULL; - return 0; - } + ret = vfs_parse_fs_param_source(fc, param); + if (ret != -ENOPARAM) + return ret;
if (ctx->param_type == LEGACY_FS_MONOLITHIC_PARAMS) return invalf(fc, "VFS: Legacy: Can't mix monolithic and individual options"); diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h index 5b44b0195a28..6b54982fc5f3 100644 --- a/include/linux/fs_context.h +++ b/include/linux/fs_context.h @@ -139,6 +139,8 @@ extern int vfs_parse_fs_string(struct fs_context *fc, const char *key, extern int generic_parse_monolithic(struct fs_context *fc, void *data); extern int vfs_get_tree(struct fs_context *fc); extern void put_fs_context(struct fs_context *fc); +extern int vfs_parse_fs_param_source(struct fs_context *fc, + struct fs_parameter *param); extern void fc_drop_locked(struct fs_context *fc);
/* diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 812dd4ed0129..88b72815aa9d 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -915,15 +915,11 @@ int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param)
opt = fs_parse(fc, cgroup1_fs_parameters, param, &result); if (opt == -ENOPARAM) { - if (strcmp(param->key, "source") == 0) { - if (param->type != fs_value_is_string) - return invalf(fc, "Non-string source"); - if (fc->source) - return invalf(fc, "Multiple sources not supported"); - fc->source = param->string; - param->string = NULL; - return 0; - } + int ret; + + ret = vfs_parse_fs_param_source(fc, param); + if (ret != -ENOPARAM) + return ret; for_each_subsys(ss, i) { if (strcmp(param->key, ss->legacy_name)) continue;
From: yangerkun yangerkun@huawei.com
hulk inclusion category: bugfix bugzilla: 172156 https://gitee.com/openeuler/kernel/issues/I4DDEL
---------------------------
ramfs_parse_param does not parse key "source", and will convert -ENOPARAM to 0. This will skip vfs_parse_fs_param_source in vfs_parse_fs_param, which lead always "none" mount source for ramfs. Fix it by parse "source" in ramfs_parse_param like cgroup1_parse_param has do.
Link: https://lkml.kernel.org/r/20210924091756.1906118-1-yangerkun@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Cc: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Stephen Rothwell sfr@canb.auug.org.au Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ramfs/inode.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c index ee179a81b3da..0163dcff9318 100644 --- a/fs/ramfs/inode.c +++ b/fs/ramfs/inode.c @@ -193,17 +193,20 @@ static int ramfs_parse_param(struct fs_context *fc, struct fs_parameter *param) int opt;
opt = fs_parse(fc, ramfs_fs_parameters, param, &result); - if (opt < 0) { + if (opt == -ENOPARAM) { + opt = vfs_parse_fs_param_source(fc, param); + if (opt != -ENOPARAM) + return opt; /* * We might like to report bad mount options here; * but traditionally ramfs has ignored all mount options, * and as it is used as a !CONFIG_SHMEM simple substitute * for tmpfs, better continue to ignore other mount options. */ - if (opt == -ENOPARAM) - opt = 0; - return opt; + return 0; } + if (opt < 0) + return opt;
switch (opt) { case Opt_mode:
From: Zhang Jianhua chris.zjh@huawei.com
hulk inclusion category: bugfix bugzilla: 182892 https://gitee.com/openeuler/kernel/issues/I4DDEL
--------
While reboot the system by sysrq, the following bug will be occur.
BUG: sleeping function called from invalid context at kernel/locking/semaphore.c:90 in_atomic(): 0, irqs_disabled(): 128, non_block: 0, pid: 10052, name: rc.shutdown CPU: 3 PID: 10052 Comm: rc.shutdown Tainted: G W O 5.10.0 #1 Call trace: dump_backtrace+0x0/0x1c8 show_stack+0x18/0x28 dump_stack+0xd0/0x110 ___might_sleep+0x14c/0x160 __might_sleep+0x74/0x88 down_interruptible+0x40/0x118 virt_efi_reset_system+0x3c/0xd0 efi_reboot+0xd4/0x11c machine_restart+0x60/0x9c emergency_restart+0x1c/0x2c sysrq_handle_reboot+0x1c/0x2c __handle_sysrq+0xd0/0x194 write_sysrq_trigger+0xbc/0xe4 proc_reg_write+0xd4/0xf0 vfs_write+0xa8/0x148 ksys_write+0x6c/0xd8 __arm64_sys_write+0x18/0x28 el0_svc_common.constprop.3+0xe4/0x16c do_el0_svc+0x1c/0x2c el0_svc+0x20/0x30 el0_sync_handler+0x80/0x17c el0_sync+0x158/0x180
The reason for this problem is that irq has been disabled in machine_restart() and then it calls down_interruptible() in virt_efi_reset_system(), which would occur sleep in irq context, it is dangerous! Commit 99409b935c9a("locking/semaphore: Add might_sleep() to down_*() family") add might_sleep() in down_interruptible(), so the bug info is here. down_trylock() can solve this problem, cause there is no might_sleep.
--------
Signed-off-by: Zhang Jianhua chris.zjh@huawei.com Reviewed-by: Chen Lifu chenlifu@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Reviewed-by: Liao Chang liaochang1@huawei.com Reviewed-by: He Ying heying24@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/firmware/efi/runtime-wrappers.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index 1410beaef5c3..f3e54f6616f0 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -414,7 +414,7 @@ static void virt_efi_reset_system(int reset_type, unsigned long data_size, efi_char16_t *data) { - if (down_interruptible(&efi_runtime_lock)) { + if (down_trylock(&efi_runtime_lock)) { pr_warn("failed to invoke the reset_system() runtime service:\n" "could not get exclusive access to the firmware\n"); return;
From: Zhang Yi yi.zhang@huawei.com
mainline inclusion from mainline-5.15-rc4 commit 4df031ff5876d94b48dd9ee486ba5522382a06b2 category: perf bugzilla: 182881 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
After commit 3da40c7b0898 ("ext4: only call ext4_truncate when size <= isize"), i_disksize could always be updated to i_size in ext4_setattr(), and we could sure that i_disksize <= i_size since holding inode lock and if i_disksize < i_size there are delalloc writes pending in the range upto i_size. If the end of the current write is <= i_size, there's no need to touch i_disksize since writeback will push i_disksize upto i_size eventually. So we can switch to check i_size instead of i_disksize in ext4_da_write_end() when write to the end of the file. we also could remove ext4_mark_inode_dirty() together because we defer inode dirtying to generic_write_end() or ext4_da_write_inline_data_end().
Signed-off-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Link: https://lore.kernel.org/r/20210716122024.1105856-2-yi.zhang@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ext4/inode.c | 34 ++++++++++++++++++---------------- 1 file changed, 18 insertions(+), 16 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index a054f07a63a6..558e0d8dd526 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3088,35 +3088,37 @@ static int ext4_da_write_end(struct file *file, end = start + copied - 1;
/* - * generic_write_end() will run mark_inode_dirty() if i_size - * changes. So let's piggyback the i_disksize mark_inode_dirty - * into that. + * Since we are holding inode lock, we are sure i_disksize <= + * i_size. We also know that if i_disksize < i_size, there are + * delalloc writes pending in the range upto i_size. If the end of + * the current write is <= i_size, there's no need to touch + * i_disksize since writeback will push i_disksize upto i_size + * eventually. If the end of the current write is > i_size and + * inside an allocated block (ext4_da_should_update_i_disksize() + * check), we need to update i_disksize here as neither + * ext4_writepage() nor certain ext4_writepages() paths not + * allocating blocks update i_disksize. + * + * Note that we defer inode dirtying to generic_write_end() / + * ext4_da_write_inline_data_end(). */ new_i_size = pos + copied; - if (copied && new_i_size > EXT4_I(inode)->i_disksize) { + if (copied && new_i_size > inode->i_size) { if (ext4_has_inline_data(inode) || - ext4_da_should_update_i_disksize(page, end)) { + ext4_da_should_update_i_disksize(page, end)) ext4_update_i_disksize(inode, new_i_size); - /* We need to mark inode dirty even if - * new_i_size is less that inode->i_size - * bu greater than i_disksize.(hint delalloc) - */ - ret = ext4_mark_inode_dirty(handle, inode); - } }
if (write_mode != CONVERT_INLINE_DATA && ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) && ext4_has_inline_data(inode)) - ret2 = ext4_da_write_inline_data_end(inode, pos, len, copied, + ret = ext4_da_write_inline_data_end(inode, pos, len, copied, page); else - ret2 = generic_write_end(file, mapping, pos, len, copied, + ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
- copied = ret2; - if (ret2 < 0) - ret = ret2; + copied = ret; ret2 = ext4_journal_stop(handle); if (unlikely(ret2 && !ret)) ret = ret2;
From: Zhang Yi yi.zhang@huawei.com
mainline inclusion from mainline-5.15-rc4 commit 55ce2f649b9e88111270333a8127e23f4f8f42d7 category: perf bugzilla: 182881 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
Current error path of ext4_write_inline_data_end() is not correct.
Firstly, it should pass out the error value if ext4_get_inode_loc() return fail, or else it could trigger infinite loop if we inject error here. And then it's better to add inode to orphan list if it return fail in ext4_journal_stop(), otherwise we could not restore inline xattr entry after power failure. Finally, we need to reset the 'ret' value if ext4_write_inline_data_end() return success in ext4_write_end() and ext4_journalled_write_end(), otherwise we could not get the error return value of ext4_journal_stop().
Signed-off-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Link: https://lore.kernel.org/r/20210716122024.1105856-3-yi.zhang@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ext4/inline.c | 15 +++++---------- fs/ext4/inode.c | 7 +++++-- 2 files changed, 10 insertions(+), 12 deletions(-)
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c index 0f7b53d5edea..a96b688a0410 100644 --- a/fs/ext4/inline.c +++ b/fs/ext4/inline.c @@ -733,18 +733,13 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len, void *kaddr; struct ext4_iloc iloc;
- if (unlikely(copied < len)) { - if (!PageUptodate(page)) { - copied = 0; - goto out; - } - } + if (unlikely(copied < len) && !PageUptodate(page)) + return 0;
ret = ext4_get_inode_loc(inode, &iloc); if (ret) { ext4_std_error(inode->i_sb, ret); - copied = 0; - goto out; + return ret; }
ext4_write_lock_xattr(inode, &no_expand); @@ -757,7 +752,7 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len, (void) ext4_find_inline_data_nolock(inode);
kaddr = kmap_atomic(page); - ext4_write_inline_data(inode, &iloc, kaddr, pos, len); + ext4_write_inline_data(inode, &iloc, kaddr, pos, copied); kunmap_atomic(kaddr); SetPageUptodate(page); /* clear page dirty so that writepages wouldn't work for us. */ @@ -766,7 +761,7 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len, ext4_write_unlock_xattr(inode, &no_expand); brelse(iloc.bh); mark_inode_dirty(inode); -out: + return copied; }
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 558e0d8dd526..d9b20649dc23 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1299,6 +1299,7 @@ static int ext4_write_end(struct file *file, goto errout; } copied = ret; + ret = 0; } else copied = block_write_end(file, mapping, pos, len, copied, page, fsdata); @@ -1325,13 +1326,14 @@ static int ext4_write_end(struct file *file, if (i_size_changed || inline_data) ret = ext4_mark_inode_dirty(handle, inode);
+errout: if (pos + len > inode->i_size && !verity && ext4_can_truncate(inode)) /* if we have allocated more blocks and copied * less. We will have blocks allocated outside * inode->i_size. So truncate them */ ext4_orphan_add(handle, inode); -errout: + ret2 = ext4_journal_stop(handle); if (!ret) ret = ret2; @@ -1414,6 +1416,7 @@ static int ext4_journalled_write_end(struct file *file, goto errout; } copied = ret; + ret = 0; } else if (unlikely(copied < len) && !PageUptodate(page)) { copied = 0; ext4_journalled_zero_new_buffers(handle, page, from, to); @@ -1443,6 +1446,7 @@ static int ext4_journalled_write_end(struct file *file, ret = ret2; }
+errout: if (pos + len > inode->i_size && !verity && ext4_can_truncate(inode)) /* if we have allocated more blocks and copied * less. We will have blocks allocated outside @@ -1450,7 +1454,6 @@ static int ext4_journalled_write_end(struct file *file, */ ext4_orphan_add(handle, inode);
-errout: ret2 = ext4_journal_stop(handle); if (!ret) ret = ret2;
From: Zhang Yi yi.zhang@huawei.com
mainline inclusion from mainline-5.15-rc4 commit 6984aef59814fb5c47b0e30c56e101186b5ebf8c category: perf bugzilla: 182881 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
Now that the inline_data file write end procedure are falled into the common write end functions, it is not clear. Factor them out and do some cleanup. This patch also drop ext4_da_write_inline_data_end() and switch to use ext4_write_inline_data_end() instead because we also need to do the same error processing if we failed to write data into inline entry.
Signed-off-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Link: https://lore.kernel.org/r/20210716122024.1105856-4-yi.zhang@huawei.com
Conflicts: fs/ext4/inline.c Reviewed-by: Yang Erkun yangerkun@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ext4/ext4.h | 3 -- fs/ext4/inline.c | 128 +++++++++++++++++++++++++---------------------- fs/ext4/inode.c | 73 +++++++++------------------ 3 files changed, 90 insertions(+), 114 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index c859ec4544bc..bb3adca53d93 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3437,9 +3437,6 @@ extern int ext4_da_write_inline_data_begin(struct address_space *mapping, unsigned flags, struct page **pagep, void **fsdata); -extern int ext4_da_write_inline_data_end(struct inode *inode, loff_t pos, - unsigned len, unsigned copied, - struct page *page); extern int ext4_try_add_inline_entry(handle_t *handle, struct ext4_filename *fname, struct inode *dir, struct inode *inode); diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c index a96b688a0410..cb42b2245c21 100644 --- a/fs/ext4/inline.c +++ b/fs/ext4/inline.c @@ -729,40 +729,83 @@ int ext4_try_to_write_inline_data(struct address_space *mapping, int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied, struct page *page) { - int ret, no_expand; + handle_t *handle = ext4_journal_current_handle(); + int no_expand; void *kaddr; struct ext4_iloc iloc; + int ret = 0, ret2;
if (unlikely(copied < len) && !PageUptodate(page)) - return 0; + copied = 0;
- ret = ext4_get_inode_loc(inode, &iloc); - if (ret) { - ext4_std_error(inode->i_sb, ret); - return ret; - } + if (likely(copied)) { + ret = ext4_get_inode_loc(inode, &iloc); + if (ret) { + unlock_page(page); + put_page(page); + ext4_std_error(inode->i_sb, ret); + goto out; + } + ext4_write_lock_xattr(inode, &no_expand); + BUG_ON(!ext4_has_inline_data(inode));
- ext4_write_lock_xattr(inode, &no_expand); - BUG_ON(!ext4_has_inline_data(inode)); + /* + * ei->i_inline_off may have changed since + * ext4_write_begin() called + * ext4_try_to_write_inline_data() + */ + (void) ext4_find_inline_data_nolock(inode);
- /* - * ei->i_inline_off may have changed since ext4_write_begin() - * called ext4_try_to_write_inline_data() - */ - (void) ext4_find_inline_data_nolock(inode); + kaddr = kmap_atomic(page); + ext4_write_inline_data(inode, &iloc, kaddr, pos, copied); + kunmap_atomic(kaddr); + SetPageUptodate(page); + /* clear page dirty so that writepages wouldn't work for us. */ + ClearPageDirty(page);
- kaddr = kmap_atomic(page); - ext4_write_inline_data(inode, &iloc, kaddr, pos, copied); - kunmap_atomic(kaddr); - SetPageUptodate(page); - /* clear page dirty so that writepages wouldn't work for us. */ - ClearPageDirty(page); + ext4_write_unlock_xattr(inode, &no_expand); + brelse(iloc.bh);
- ext4_write_unlock_xattr(inode, &no_expand); - brelse(iloc.bh); - mark_inode_dirty(inode); + /* + * It's important to update i_size while still holding page + * lock: page writeout could otherwise come in and zero + * beyond i_size. + */ + ext4_update_inode_size(inode, pos + copied); + } + unlock_page(page); + put_page(page); + + /* + * Don't mark the inode dirty under page lock. First, it unnecessarily + * makes the holding time of page lock longer. Second, it forces lock + * ordering of page lock and transaction start for journaling + * filesystems. + */ + if (likely(copied)) + mark_inode_dirty(inode); +out: + /* + * If we didn't copy as much data as expected, we need to trim back + * size of xattr containing inline data. + */ + if (pos + len > inode->i_size && ext4_can_truncate(inode)) + ext4_orphan_add(handle, inode);
- return copied; + ret2 = ext4_journal_stop(handle); + if (!ret) + ret = ret2; + if (pos + len > inode->i_size) { + ext4_truncate_failed_write(inode); + /* + * If truncate failed early the inode might still be + * on the orphan list; we need to make sure the inode + * is removed from the orphan list in that case. + */ + if (inode->i_nlink) + ext4_orphan_del(NULL, inode); + } + return ret ? ret : copied; }
struct buffer_head * @@ -943,43 +986,6 @@ int ext4_da_write_inline_data_begin(struct address_space *mapping, return ret; }
-int ext4_da_write_inline_data_end(struct inode *inode, loff_t pos, - unsigned len, unsigned copied, - struct page *page) -{ - int ret; - - ret = ext4_write_inline_data_end(inode, pos, len, copied, page); - if (ret < 0) { - unlock_page(page); - put_page(page); - return ret; - } - copied = ret; - - /* - * No need to use i_size_read() here, the i_size - * cannot change under us because we hold i_mutex. - * - * But it's important to update i_size while still holding page lock: - * page writeout could otherwise come in and zero beyond i_size. - */ - if (pos+copied > inode->i_size) - i_size_write(inode, pos+copied); - unlock_page(page); - put_page(page); - - /* - * Don't mark the inode dirty under page lock. First, it unnecessarily - * makes the holding time of page lock longer. Second, it forces lock - * ordering of page lock and transaction start for journaling - * filesystems. - */ - mark_inode_dirty(inode); - - return copied; -} - #ifdef INLINE_DIR_DEBUG void ext4_show_inline_dir(struct inode *dir, struct buffer_head *bh, void *inline_start, int inline_size) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d9b20649dc23..49926b9c52cb 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1286,23 +1286,14 @@ static int ext4_write_end(struct file *file, loff_t old_size = inode->i_size; int ret = 0, ret2; int i_size_changed = 0; - int inline_data = ext4_has_inline_data(inode); bool verity = ext4_verity_in_progress(inode);
trace_ext4_write_end(inode, pos, len, copied); - if (inline_data) { - ret = ext4_write_inline_data_end(inode, pos, len, - copied, page); - if (ret < 0) { - unlock_page(page); - put_page(page); - goto errout; - } - copied = ret; - ret = 0; - } else - copied = block_write_end(file, mapping, pos, - len, copied, page, fsdata); + + if (ext4_has_inline_data(inode)) + return ext4_write_inline_data_end(inode, pos, len, copied, page); + + copied = block_write_end(file, mapping, pos, len, copied, page, fsdata); /* * it's important to update i_size while still holding page lock: * page writeout could otherwise come in and zero beyond i_size. @@ -1323,10 +1314,9 @@ static int ext4_write_end(struct file *file, * ordering of page lock and transaction start for journaling * filesystems. */ - if (i_size_changed || inline_data) + if (i_size_changed) ret = ext4_mark_inode_dirty(handle, inode);
-errout: if (pos + len > inode->i_size && !verity && ext4_can_truncate(inode)) /* if we have allocated more blocks and copied * less. We will have blocks allocated outside @@ -1398,7 +1388,6 @@ static int ext4_journalled_write_end(struct file *file, int partial = 0; unsigned from, to; int size_changed = 0; - int inline_data = ext4_has_inline_data(inode); bool verity = ext4_verity_in_progress(inode);
trace_ext4_journalled_write_end(inode, pos, len, copied); @@ -1407,17 +1396,10 @@ static int ext4_journalled_write_end(struct file *file,
BUG_ON(!ext4_handle_valid(handle));
- if (inline_data) { - ret = ext4_write_inline_data_end(inode, pos, len, - copied, page); - if (ret < 0) { - unlock_page(page); - put_page(page); - goto errout; - } - copied = ret; - ret = 0; - } else if (unlikely(copied < len) && !PageUptodate(page)) { + if (ext4_has_inline_data(inode)) + return ext4_write_inline_data_end(inode, pos, len, copied, page); + + if (unlikely(copied < len) && !PageUptodate(page)) { copied = 0; ext4_journalled_zero_new_buffers(handle, page, from, to); } else { @@ -1440,13 +1422,12 @@ static int ext4_journalled_write_end(struct file *file, if (old_size < pos && !verity) pagecache_isize_extended(inode, old_size, pos);
- if (size_changed || inline_data) { + if (size_changed) { ret2 = ext4_mark_inode_dirty(handle, inode); if (!ret) ret = ret2; }
-errout: if (pos + len > inode->i_size && !verity && ext4_can_truncate(inode)) /* if we have allocated more blocks and copied * less. We will have blocks allocated outside @@ -3076,7 +3057,7 @@ static int ext4_da_write_end(struct file *file, struct page *page, void *fsdata) { struct inode *inode = mapping->host; - int ret = 0, ret2; + int ret; handle_t *handle = ext4_journal_current_handle(); loff_t new_i_size; unsigned long start, end; @@ -3087,6 +3068,12 @@ static int ext4_da_write_end(struct file *file, len, copied, page, fsdata);
trace_ext4_da_write_end(inode, pos, len, copied); + + if (write_mode != CONVERT_INLINE_DATA && + ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) && + ext4_has_inline_data(inode)) + return ext4_write_inline_data_end(inode, pos, len, copied, page); + start = pos & (PAGE_SIZE - 1); end = start + copied - 1;
@@ -3106,26 +3093,12 @@ static int ext4_da_write_end(struct file *file, * ext4_da_write_inline_data_end(). */ new_i_size = pos + copied; - if (copied && new_i_size > inode->i_size) { - if (ext4_has_inline_data(inode) || - ext4_da_should_update_i_disksize(page, end)) - ext4_update_i_disksize(inode, new_i_size); - } - - if (write_mode != CONVERT_INLINE_DATA && - ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) && - ext4_has_inline_data(inode)) - ret = ext4_da_write_inline_data_end(inode, pos, len, copied, - page); - else - ret = generic_write_end(file, mapping, pos, len, copied, - page, fsdata); - - copied = ret; - ret2 = ext4_journal_stop(handle); - if (unlikely(ret2 && !ret)) - ret = ret2; + if (copied && new_i_size > inode->i_size && + ext4_da_should_update_i_disksize(page, end)) + ext4_update_i_disksize(inode, new_i_size);
+ copied = generic_write_end(file, mapping, pos, len, copied, page, fsdata); + ret = ext4_journal_stop(handle); return ret ? ret : copied; }
From: Zhang Yi yi.zhang@huawei.com
mainline inclusion from mainline-5.15-rc4 commit cc883236b79297f6266ca6f4e7f24f3fd3c736c1 category: perf bugzilla: 182881 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
After we factor out the inline data write procedure from ext4_da_write_end(), we don't need to start journal handle for the cases of both buffer overwrite and append-write. If we need to update i_disksize, mark_inode_dirty() do start handle and update inode buffer. So we could just remove all the journal handle codes in the delalloc write procedure.
After this patch, we could get a lot of performance improvement. Below is the Unixbench comparison data test on my machine with 'Intel Xeon Gold 5120' CPU and nvme SSD backend.
Test cmd:
./Run -c 56 -i 3 fstime fsbuffer fsdisk
Before this patch:
System Benchmarks Partial Index BASELINE RESULT INDEX File Copy 1024 bufsize 2000 maxblocks 3960.0 422965.0 1068.1 File Copy 256 bufsize 500 maxblocks 1655.0 105077.0 634.9 File Copy 4096 bufsize 8000 maxblocks 5800.0 1429092.0 2464.0 ====== System Benchmarks Index Score (Partial Only) 1186.6
After this patch:
System Benchmarks Partial Index BASELINE RESULT INDEX File Copy 1024 bufsize 2000 maxblocks 3960.0 732716.0 1850.3 File Copy 256 bufsize 500 maxblocks 1655.0 184940.0 1117.5 File Copy 4096 bufsize 8000 maxblocks 5800.0 2427152.0 4184.7 ====== System Benchmarks Index Score (Partial Only) 2053.0
Signed-off-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Link: https://lore.kernel.org/r/20210716122024.1105856-5-yi.zhang@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ext4/inode.c | 60 +++++-------------------------------------------- 1 file changed, 5 insertions(+), 55 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 49926b9c52cb..94c9a4d01102 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2914,19 +2914,6 @@ static int ext4_nonda_switch(struct super_block *sb) return 0; }
-/* We always reserve for an inode update; the superblock could be there too */ -static int ext4_da_write_credits(struct inode *inode, loff_t pos, unsigned len) -{ - if (likely(ext4_has_feature_large_file(inode->i_sb))) - return 1; - - if (pos + len <= 0x7fffffffULL) - return 1; - - /* We might need to update the superblock to set LARGE_FILE */ - return 2; -} - static int ext4_da_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata) @@ -2935,7 +2922,6 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping, struct page *page; pgoff_t index; struct inode *inode = mapping->host; - handle_t *handle;
if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb)))) return -EIO; @@ -2961,41 +2947,11 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping, return 0; }
- /* - * grab_cache_page_write_begin() can take a long time if the - * system is thrashing due to memory pressure, or if the page - * is being written back. So grab it first before we start - * the transaction handle. This also allows us to allocate - * the page (if needed) without using GFP_NOFS. - */ -retry_grab: +retry: page = grab_cache_page_write_begin(mapping, index, flags); if (!page) return -ENOMEM; - unlock_page(page); - - /* - * With delayed allocation, we don't log the i_disksize update - * if there is delayed block allocation. But we still need - * to journalling the i_disksize update if writes to the end - * of file which has an already mapped buffer. - */ -retry_journal: - handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, - ext4_da_write_credits(inode, pos, len)); - if (IS_ERR(handle)) { - put_page(page); - return PTR_ERR(handle); - }
- lock_page(page); - if (page->mapping != mapping) { - /* The page got truncated from under us */ - unlock_page(page); - put_page(page); - ext4_journal_stop(handle); - goto retry_grab; - } /* In case writeback began while the page was unlocked */ wait_for_stable_page(page);
@@ -3007,20 +2963,18 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping, #endif if (ret < 0) { unlock_page(page); - ext4_journal_stop(handle); + put_page(page); /* * block_write_begin may have instantiated a few blocks * outside i_size. Trim these off again. Don't need - * i_size_read because we hold i_mutex. + * i_size_read because we hold inode lock. */ if (pos + len > inode->i_size) ext4_truncate_failed_write(inode);
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) - goto retry_journal; - - put_page(page); + goto retry; return ret; }
@@ -3057,8 +3011,6 @@ static int ext4_da_write_end(struct file *file, struct page *page, void *fsdata) { struct inode *inode = mapping->host; - int ret; - handle_t *handle = ext4_journal_current_handle(); loff_t new_i_size; unsigned long start, end; int write_mode = (int)(unsigned long)fsdata; @@ -3097,9 +3049,7 @@ static int ext4_da_write_end(struct file *file, ext4_da_should_update_i_disksize(page, end)) ext4_update_i_disksize(inode, new_i_size);
- copied = generic_write_end(file, mapping, pos, len, copied, page, fsdata); - ret = ext4_journal_stop(handle); - return ret ? ret : copied; + return generic_write_end(file, mapping, pos, len, copied, page, fsdata); }
/*
From: Zhihao Cheng chengzhihao1@huawei.com
mainline inclusion from mainline-5.15-rc3 commit 5afedf670caf30a2b5a52da96eb7eac7dee6a9c9 category: bugfix bugzilla: 181454 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
There is an use-after-free problem triggered by following process:
P1(sda) P2(sdb) echo 0 > /sys/block/sdb/trace/enable blk_trace_remove_queue synchronize_rcu blk_trace_free relay_close rcu_read_lock __blk_add_trace trace_note_tsk (Iterate running_trace_list) relay_close_buf relay_destroy_buf kfree(buf) trace_note(sdb's bt) relay_reserve buf->offset <- nullptr deference (use-after-free) !!! rcu_read_unlock
[ 502.714379] BUG: kernel NULL pointer dereference, address: 0000000000000010 [ 502.715260] #PF: supervisor read access in kernel mode [ 502.715903] #PF: error_code(0x0000) - not-present page [ 502.716546] PGD 103984067 P4D 103984067 PUD 17592b067 PMD 0 [ 502.717252] Oops: 0000 [#1] SMP [ 502.720308] RIP: 0010:trace_note.isra.0+0x86/0x360 [ 502.732872] Call Trace: [ 502.733193] __blk_add_trace.cold+0x137/0x1a3 [ 502.733734] blk_add_trace_rq+0x7b/0xd0 [ 502.734207] blk_add_trace_rq_issue+0x54/0xa0 [ 502.734755] blk_mq_start_request+0xde/0x1b0 [ 502.735287] scsi_queue_rq+0x528/0x1140 ... [ 502.742704] sg_new_write.isra.0+0x16e/0x3e0 [ 502.747501] sg_ioctl+0x466/0x1100
Reproduce method: ioctl(/dev/sda, BLKTRACESETUP, blk_user_trace_setup[buf_size=127]) ioctl(/dev/sda, BLKTRACESTART) ioctl(/dev/sdb, BLKTRACESETUP, blk_user_trace_setup[buf_size=127]) ioctl(/dev/sdb, BLKTRACESTART)
echo 0 > /sys/block/sdb/trace/enable & // Add delay(mdelay/msleep) before kernel enters blk_trace_free()
ioctl$SG_IO(/dev/sda, SG_IO, ...) // Enters trace_note_tsk() after blk_trace_free() returned // Use mdelay in rcu region rather than msleep(which may schedule out)
Remove blk_trace from running_list before calling blk_trace_free() by sysfs if blk_trace is at Blktrace_running state.
Fixes: c71a896154119f ("blktrace: add ftrace plugin") Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Link: https://lore.kernel.org/r/20210923134921.109194-1-chengzhihao1@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/trace/blktrace.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index f1022945e346..b89ff188a618 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -1670,6 +1670,14 @@ static int blk_trace_remove_queue(struct request_queue *q) if (bt == NULL) return -EINVAL;
+ if (bt->trace_state == Blktrace_running) { + bt->trace_state = Blktrace_stopped; + spin_lock_irq(&running_trace_lock); + list_del_init(&bt->running_list); + spin_unlock_irq(&running_trace_lock); + relay_flush(bt->rchan); + } + put_probe_ref(); synchronize_rcu(); blk_trace_free(bt);
From: Qian Cai cai@lca.pw
mainline inclusion from mainline-v5.12-rc1 commit c9790fb5df461c91d3fff1d864c1acb8baf5ad5c category: bugfix bugzilla: 107715 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
It is unsafe to traverse tbl->it_group_list without the RCU read lock.
WARNING: suspicious RCU usage 5.7.0-rc4-next-20200508 #1 Not tainted ----------------------------- arch/powerpc/platforms/powernv/pci-ioda-tce.c:355 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1 3 locks held by qemu-kvm/4305: #0: c000000bc3fe6988 (&container->group_lock){++++}-{3:3}, at: vfio_fops_unl_ioctl+0x108/0x410 [vfio] #1: c00800000fcc7400 (&vfio.iommu_drivers_lock){+.+.}-{3:3}, at: vfio_fops_unl_ioctl+0x148/0x410 [vfio] #2: c000000bc3fe4d68 (&container->lock){+.+.}-{3:3}, at: tce_iommu_attach_group+0x3c/0x4f0 [vfio_iommu_spapr_tce]
stack backtrace: CPU: 4 PID: 4305 Comm: qemu-kvm Not tainted 5.7.0-rc4-next-20200508 #1 Call Trace: [c0000010f29afa60] [c0000000007154c8] dump_stack+0xfc/0x174 (unreliable) [c0000010f29afab0] [c0000000001d8ff0] lockdep_rcu_suspicious+0x140/0x164 [c0000010f29afb30] [c0000000000dae2c] pnv_pci_unlink_table_and_group+0x11c/0x200 [c0000010f29afb70] [c0000000000d4a34] pnv_pci_ioda2_unset_window+0xc4/0x190 [c0000010f29afbf0] [c0000000000d4b4c] pnv_ioda2_take_ownership+0x4c/0xd0 [c0000010f29afc30] [c00800000fd60ee0] tce_iommu_attach_group+0x2c8/0x4f0 [vfio_iommu_spapr_tce] [c0000010f29afcd0] [c00800000fcc11a0] vfio_fops_unl_ioctl+0x238/0x410 [vfio] [c0000010f29afd50] [c0000000005430a8] ksys_ioctl+0xd8/0x130 [c0000010f29afda0] [c000000000543128] sys_ioctl+0x28/0x40 [c0000010f29afdc0] [c000000000038af4] system_call_exception+0x114/0x1e0 [c0000010f29afe20] [c00000000000c8f0] system_call_common+0xf0/0x278
Signed-off-by: Qian Cai cai@lca.pw Signed-off-by: Michael Ellerman mpe@ellerman.id.au Link: https://lore.kernel.org/r/20200510051347.1906-1-cai@lca.pw Signed-off-by: Chen Lifu chenlifu@huawei.com Reviewed-by: Zhang Jianhua chris.zjh@huawei.com Reviewed-by: He Ying heying24@huawei.com Reviewed-by: Yihang Xu xuyihang@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/powerpc/platforms/powernv/pci-ioda-tce.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c index 5218f5da2737..30551bbd7988 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c +++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c @@ -380,6 +380,8 @@ void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
/* Remove link to a group from table's list of attached groups */ found = false; + + rcu_read_lock(); list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) { if (tgl->table_group == table_group) { list_del_rcu(&tgl->next); @@ -388,6 +390,8 @@ void pnv_pci_unlink_table_and_group(struct iommu_table *tbl, break; } } + rcu_read_unlock(); + if (WARN_ON(!found)) return;
From: Srikar Dronamraju srikar@linux.vnet.ibm.com
mainline inclusion from mainline-v5.15-rc1 commit 9a245d0e1f006bc7ccf0285d0d520ed304d00c4a category: bugfix bugzilla: 180732 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
cpu_cpu_map holds all the CPUs in the DIE. However in PowerPC, when onlining/offlining of CPUs, this mask doesn't get updated. This mask is however updated when CPUs are added/removed. So when both operations like online/offline of CPUs and adding/removing of CPUs are done simultaneously, then cpumaps end up broken.
WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898 build_sched_domains+0xd48/0x1720 Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink pseries_rng xts vmx_crypto uio_pdrv_genirq uio binfmt_misc ip_tables xfs libcrc32c dm_service_time sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28 Workqueue: events cpuset_hotplug_workfn NIP: c0000000001caac8 LR: c0000000001caac4 CTR: 00000000007088ec REGS: c00000005596f220 TRAP: 0700 Not tainted (5.13.0-rc6+) MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 48828222 XER: 00000009 CFAR: c0000000001ea698 IRQMASK: 0 GPR00: c0000000001caac4 c00000005596f4c0 c000000001c4a400 0000000000000036 GPR04: 00000000fffdffff c00000005596f1d0 0000000000000027 c0000018cfd07f90 GPR08: 0000000000000023 0000000000000001 0000000000000027 c0000018fe68ffe8 GPR12: 0000000000008000 c00000001e9d1880 c00000013a047200 0000000000000800 GPR16: c000000001d3c7d0 0000000000000240 0000000000000048 c000000010aacd18 GPR20: 0000000000000001 c000000010aacc18 c00000013a047c00 c000000139ec2400 GPR24: 0000000000000280 c000000139ec2520 c000000136c1b400 c000000001c93060 GPR28: c00000013a047c20 c000000001d3c6c0 c000000001c978a0 000000000000000d NIP [c0000000001caac8] build_sched_domains+0xd48/0x1720 LR [c0000000001caac4] build_sched_domains+0xd44/0x1720 Call Trace: [c00000005596f4c0] [c0000000001caac4] build_sched_domains+0xd44/0x1720 (unreliable) [c00000005596f670] [c0000000001cc5ec] partition_sched_domains_locked+0x3ac/0x4b0 [c00000005596f710] [c0000000002804e4] rebuild_sched_domains_locked+0x404/0x9e0 [c00000005596f810] [c000000000283e60] rebuild_sched_domains+0x40/0x70 [c00000005596f840] [c000000000284124] cpuset_hotplug_workfn+0x294/0xf10 [c00000005596fc60] [c000000000175040] process_one_work+0x290/0x590 [c00000005596fd00] [c0000000001753c8] worker_thread+0x88/0x620 [c00000005596fda0] [c000000000181704] kthread+0x194/0x1a0 [c00000005596fe10] [c00000000000ccec] ret_from_kernel_thread+0x5c/0x70 Instruction dump: 485af049 60000000 2fa30800 409e0028 80fe0000 e89a00f8 e86100e8 38da0120 7f88e378 7ce53b78 4801fb91 60000000 <0fe00000> 39000000 38e00000 38c00000
Fix this by updating cpu_cpu_map aka cpumask_of_node() on every CPU online/offline.
Signed-off-by: Srikar Dronamraju srikar@linux.vnet.ibm.com Signed-off-by: Michael Ellerman mpe@ellerman.id.au Link: https://lore.kernel.org/r/20210826100521.412639-5-srikar@linux.vnet.ibm.com Signed-off-by: Chen Lifu chenlifu@huawei.com Reviewed-by: Zhang Jianhua chris.zjh@huawei.com Reviewed-by: Linruizhe linruizhe@huawei.com Reviewed-by: He Ying heying24@huawei.com Reviewed-by: Yihang Xu xuyihang@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/powerpc/include/asm/topology.h | 13 +++++++++++++ arch/powerpc/kernel/smp.c | 3 +++ arch/powerpc/mm/numa.c | 7 ++----- 3 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 3beeb030cd78..424635b24b52 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -65,6 +65,11 @@ static inline int early_cpu_to_node(int cpu)
int of_drconf_to_nid_single(struct drmem_lmb *lmb);
+extern void map_cpu_to_node(int cpu, int node); +#ifdef CONFIG_HOTPLUG_CPU +extern void unmap_cpu_from_node(unsigned long cpu); +#endif /* CONFIG_HOTPLUG_CPU */ + #else
static inline int early_cpu_to_node(int cpu) { return 0; } @@ -93,6 +98,14 @@ static inline int of_drconf_to_nid_single(struct drmem_lmb *lmb) return first_online_node; }
+ +#ifdef CONFIG_SMP +static inline void map_cpu_to_node(int cpu, int node) {} +#ifdef CONFIG_HOTPLUG_CPU +static inline void unmap_cpu_from_node(unsigned long cpu) {} +#endif /* CONFIG_HOTPLUG_CPU */ +#endif /* CONFIG_SMP */ + #endif /* CONFIG_NUMA */
#if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 91f274134884..ea6b03f19789 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1300,6 +1300,8 @@ static void remove_cpu_from_masks(int cpu) struct cpumask *(*mask_fn)(int) = cpu_sibling_mask; int i;
+ unmap_cpu_from_node(cpu); + if (shared_caches) mask_fn = cpu_l2_cache_mask;
@@ -1384,6 +1386,7 @@ static void add_cpu_to_masks(int cpu) * This CPU will not be in the online mask yet so we need to manually * add it to it's own thread sibling mask. */ + map_cpu_to_node(cpu, cpu_to_node(cpu)); cpumask_set_cpu(cpu, cpu_sibling_mask(cpu)); cpumask_set_cpu(cpu, cpu_core_mask(cpu));
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 094a1076fd1f..23bfe00811ae 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -137,7 +137,7 @@ static void reset_numa_cpu_lookup_table(void) numa_cpu_lookup_table[cpu] = -1; }
-static void map_cpu_to_node(int cpu, int node) +void map_cpu_to_node(int cpu, int node) { update_numa_cpu_lookup_table(cpu, node);
@@ -148,7 +148,7 @@ static void map_cpu_to_node(int cpu, int node) }
#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR) -static void unmap_cpu_from_node(unsigned long cpu) +void unmap_cpu_from_node(unsigned long cpu) { int node = numa_cpu_lookup_table[cpu];
@@ -598,9 +598,6 @@ static int ppc_numa_cpu_prepare(unsigned int cpu)
static int ppc_numa_cpu_dead(unsigned int cpu) { -#ifdef CONFIG_HOTPLUG_CPU - unmap_cpu_from_node(cpu); -#endif return 0; }
From: Vlastimil Babka vbabka@suse.cz
mainline inclusion from mainline-v5.15-rc1 commit 1399af7e54896c774d67f1c1acc491b07149421d category: bugfix bugzilla: 180677 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-----------------------------------------------
drop_slab_node() is called as part of echo 2>/proc/sys/vm/drop_caches operation. It iterates over all memcgs and calls shrink_slab() which in turn iterates over all slab shrinkers. Freed objects are counted and as long as the total number of freed objects from all memcgs and shrinkers is higher than 10, drop_slab_node() loops for another full memcgs*shrinkers iteration.
This arbitrary constant threshold of 10 can result in effectively an infinite loop on a system with large number of memcgs and/or parallel activity that allocates new objects. This has been reported previously by Chunxin Zang [1] and recently by our customer.
The previous report [1] has resulted in commit 069c411de40a ("mm/vmscan: fix infinite loop in drop_slab_node") which added a check for signals allowing the user to terminate the command writing to drop_caches. At the time it was also considered to make the threshold grow with each iteration to guarantee termination, but such patch hasn't been formally proposed yet.
This patch implements the dynamically growing threshold. At first iteration it's enough to free one object to continue, and this threshold effectively doubles with each iteration. Our customer's feedback was positive.
There is always a risk that this change will result on some system in a previously terminating drop_caches operation to terminate sooner and free fewer objects. Ideally the semantics would guarantee freeing all freeable objects that existed at the moment of starting the operation, while not looping forever for newly allocated objects, but that's not feasible to track. In the less ideal solution based on thresholds, arguably the termination guarantee is more important than the exhaustiveness guarantee. If there are reports of large regression wrt being exhaustive, we can tune how fast the threshold grows.
[1] https://lore.kernel.org/lkml/20200909152047.27905-1-zangchunxin@bytedance.co...
[vbabka@suse.cz: avoid undefined shift behaviour] Link: https://lkml.kernel.org/r/2f034e6f-a753-550a-f374-e4e23899d3d5@suse.cz
Link: https://lkml.kernel.org/r/20210818152239.25502-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka vbabka@suse.cz Reported-by: Chunxin Zang zangchunxin@bytedance.com Cc: Muchun Song songmuchun@bytedance.com Cc: Chris Down chris@chrisdown.name Cc: Michal Hocko mhocko@kernel.org Cc: Matthew Wilcox willy@infradead.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/vmscan.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c index 23f8a5242de7..718840df14e1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -895,6 +895,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, void drop_slab_node(int nid) { unsigned long freed; + int shift = 0;
do { struct mem_cgroup *memcg = NULL; @@ -907,7 +908,7 @@ void drop_slab_node(int nid) do { freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); - } while (freed > 10); + } while ((freed >> shift++) > 1); }
void drop_slab(void)
From: Baolin Wang baolin.wang@linux.alibaba.com
mainline inclusion from mainline-5.15-rc1 commit 37bc3cb9bbef86d1ddbbc789e55b588c8a2cac26 category: bugfix bugzilla: 180690 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Since commit c843966c556d ("mm: allow swappiness that prefers reclaiming anon over the file workingset") has expended the swappiness value to make swap to be preferred in some systems. We should also change the memcg swappiness restriction to allow memcg swap-preferred.
Link: https://lkml.kernel.org/r/d77469b90c45c49953ccbc51e54a1d465bc18f70.162762625... Fixes: c843966c556d ("mm: allow swappiness that prefers reclaiming anon over the file workingset") Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Acked-by: Michal Hocko mhocko@suse.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Conflicts: mm/memcontrol.c Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 314f3dcba6a6..a2a04a990e09 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4282,7 +4282,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css, { struct mem_cgroup *memcg = mem_cgroup_from_css(css);
- if (val > 100) + if (val > 200) return -EINVAL;
if (css->parent)
From: yangerkun yangerkun@huawei.com
mainline inclusion from mainline-5.15-rc4 commit 42cb447410d024e9d54139ae9c21ea132a8c384c category: bugfix bugzilla: 182902 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
When ext4_htree_fill_tree() fails, ext4_dx_readdir() can run into an infinite loop since if info->last_pos != ctx->pos this will reset the directory scan and reread the failing entry. For example:
1. a dx_dir which has 3 block, block 0 as dx_root block, block 1/2 as leaf block which own the ext4_dir_entry_2 2. block 1 read ok and call_filldir which will fill the dirent and update the ctx->pos 3. block 2 read fail, but we has already fill some dirent, so we will return back to userspace will a positive return val(see ksys_getdents64) 4. the second ext4_dx_readdir will reset the world since info->last_pos != ctx->pos, and will also init the curr_hash which pos to block 1 5. So we will read block1 too, and once block2 still read fail, we can only fill one dirent because the hash of the entry in block1(besides the last one) won't greater than curr_hash 6. this time, we forget update last_pos too since the read for block2 will fail, and since we has got the one entry, ksys_getdents64 can return success 7. Latter we will trapped in a loop with step 4~6
Cc: stable@kernel.org Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Link: https://lore.kernel.org/r/20210914111415.3921954-1-yangerkun@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ext4/dir.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index ca50c90adc4c..70a0f5e56f4d 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -534,7 +534,7 @@ static int ext4_dx_readdir(struct file *file, struct dir_context *ctx) struct dir_private_info *info = file->private_data; struct inode *inode = file_inode(file); struct fname *fname; - int ret; + int ret = 0;
if (!info) { info = ext4_htree_create_dir_info(file, ctx->pos); @@ -582,7 +582,7 @@ static int ext4_dx_readdir(struct file *file, struct dir_context *ctx) info->curr_minor_hash, &info->next_hash); if (ret < 0) - return ret; + goto finished; if (ret == 0) { ctx->pos = ext4_get_htree_eof(file); break; @@ -613,7 +613,7 @@ static int ext4_dx_readdir(struct file *file, struct dir_context *ctx) } finished: info->last_pos = ctx->pos; - return 0; + return ret < 0 ? ret : 0; }
static int ext4_dir_open(struct inode * inode, struct file * filp)
From: Dan Carpenter dan.carpenter@oracle.com
maillist inclusion category: bugfix bugzilla: 182946 https://gitee.com/openeuler/kernel/issues/I4DDEL CVE: CVE-2021-42739
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
The bounds checking in avc_ca_pmt() is not strict enough. It should be checking "read_pos + 4" because it's reading 5 bytes. If the "es_info_length" is non-zero then it reads a 6th byte so there needs to be an additional check for that.
I also added checks for the "write_pos". I don't think these are required because "read_pos" and "write_pos" are tied together so checking one ought to be enough. But they make the code easier to understand for me. The check on write_pos is:
if (write_pos + 4 >= sizeof(c->operand) - 4) {
The first "+ 4" is because we're writing 5 bytes and the last " - 4" is to leave space for the CRC.
The other problem is that "length" can be invalid. It comes from "data_length" in fdtv_ca_pmt().
Cc: stable@vger.kernel.org Reported-by: Luo Likang luolikang@nsfocus.com Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Signed-off-by: Hans Verkuil hverkuil-cisco@xs4all.nl Signed-off-by: Mauro Carvalho Chehab mchehab+huawei@kernel.org Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Weiyang Wang wangweiyang2@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/media/firewire/firedtv-avc.c | 14 +++++++++++--- drivers/media/firewire/firedtv-ci.c | 2 ++ 2 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/drivers/media/firewire/firedtv-avc.c b/drivers/media/firewire/firedtv-avc.c index 2bf9467b917d..71991f8638e6 100644 --- a/drivers/media/firewire/firedtv-avc.c +++ b/drivers/media/firewire/firedtv-avc.c @@ -1165,7 +1165,11 @@ int avc_ca_pmt(struct firedtv *fdtv, char *msg, int length) read_pos += program_info_length; write_pos += program_info_length; } - while (read_pos < length) { + while (read_pos + 4 < length) { + if (write_pos + 4 >= sizeof(c->operand) - 4) { + ret = -EINVAL; + goto out; + } c->operand[write_pos++] = msg[read_pos++]; c->operand[write_pos++] = msg[read_pos++]; c->operand[write_pos++] = msg[read_pos++]; @@ -1177,13 +1181,17 @@ int avc_ca_pmt(struct firedtv *fdtv, char *msg, int length) c->operand[write_pos++] = es_info_length >> 8; c->operand[write_pos++] = es_info_length & 0xff; if (es_info_length > 0) { + if (read_pos >= length) { + ret = -EINVAL; + goto out; + } pmt_cmd_id = msg[read_pos++]; if (pmt_cmd_id != 1 && pmt_cmd_id != 4) dev_err(fdtv->device, "invalid pmt_cmd_id %d at stream level\n", pmt_cmd_id);
- if (es_info_length > sizeof(c->operand) - 4 - - write_pos) { + if (es_info_length > sizeof(c->operand) - 4 - write_pos || + es_info_length > length - read_pos) { ret = -EINVAL; goto out; } diff --git a/drivers/media/firewire/firedtv-ci.c b/drivers/media/firewire/firedtv-ci.c index 9363d005e2b6..e0d57e09dab0 100644 --- a/drivers/media/firewire/firedtv-ci.c +++ b/drivers/media/firewire/firedtv-ci.c @@ -134,6 +134,8 @@ static int fdtv_ca_pmt(struct firedtv *fdtv, void *arg) } else { data_length = msg->msg[3]; } + if (data_length > sizeof(msg->msg) - data_pos) + return -EINVAL;
return avc_ca_pmt(fdtv, &msg->msg[data_pos], data_length); }