From: Yang Shi shy828301@gmail.com
stable inclusion from stable-v5.10.102 commit db3f3636e4aed2cba3e4e7897a053323f7a62249 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6EVUJ
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 24d7275ce2791829953ed4e72f68277ceb2571c6 upstream.
The syzbot reported the below BUG:
kernel BUG at include/linux/page-flags.h:785! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 4392 Comm: syz-executor560 Not tainted 5.16.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:PageDoubleMap include/linux/page-flags.h:785 [inline] RIP: 0010:__page_mapcount+0x2d2/0x350 mm/util.c:744 Call Trace: page_mapcount include/linux/mm.h:837 [inline] smaps_account+0x470/0xb10 fs/proc/task_mmu.c:466 smaps_pte_entry fs/proc/task_mmu.c:538 [inline] smaps_pte_range+0x611/0x1250 fs/proc/task_mmu.c:601 walk_pmd_range mm/pagewalk.c:128 [inline] walk_pud_range mm/pagewalk.c:205 [inline] walk_p4d_range mm/pagewalk.c:240 [inline] walk_pgd_range mm/pagewalk.c:277 [inline] __walk_page_range+0xe23/0x1ea0 mm/pagewalk.c:379 walk_page_vma+0x277/0x350 mm/pagewalk.c:530 smap_gather_stats.part.0+0x148/0x260 fs/proc/task_mmu.c:768 smap_gather_stats fs/proc/task_mmu.c:741 [inline] show_smap+0xc6/0x440 fs/proc/task_mmu.c:822 seq_read_iter+0xbb0/0x1240 fs/seq_file.c:272 seq_read+0x3e0/0x5b0 fs/seq_file.c:162 vfs_read+0x1b5/0x600 fs/read_write.c:479 ksys_read+0x12d/0x250 fs/read_write.c:619 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae
The reproducer was trying to read /proc/$PID/smaps when calling MADV_FREE at the mean time. MADV_FREE may split THPs if it is called for partial THP. It may trigger the below race:
CPU A CPU B ----- ----- smaps walk: MADV_FREE: page_mapcount() PageCompound() split_huge_page() page = compound_head(page) PageDoubleMap(page)
When calling PageDoubleMap() this page is not a tail page of THP anymore so the BUG is triggered.
This could be fixed by elevated refcount of the page before calling mapcount, but that would prevent it from counting migration entries, and it seems overkilling because the race just could happen when PMD is split so all PTE entries of tail pages are actually migration entries, and smaps_account() does treat migration entries as mapcount == 1 as Kirill pointed out.
Add a new parameter for smaps_account() to tell this entry is migration entry then skip calling page_mapcount(). Don't skip getting mapcount for device private entries since they do track references with mapcount.
Pagemap also has the similar issue although it was not reported. Fixed it as well.
[shy828301@gmail.com: v4] Link: https://lkml.kernel.org/r/20220203182641.824731-1-shy828301@gmail.com [nathan@kernel.org: avoid unused variable warning in pagemap_pmd_range()] Link: https://lkml.kernel.org/r/20220207171049.1102239-1-nathan@kernel.org Link: https://lkml.kernel.org/r/20220120202805.3369-1-shy828301@gmail.com Fixes: e9b61f19858a ("thp: reintroduce split_huge_page()") Signed-off-by: Yang Shi shy828301@gmail.com Signed-off-by: Nathan Chancellor nathan@kernel.org Reported-by: syzbot+1f52b3a18d5633fa7f82@syzkaller.appspotmail.com Acked-by: David Hildenbrand david@redhat.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: Jann Horn jannh@google.com Cc: Matthew Wilcox willy@infradead.org Cc: Alexey Dobriyan adobriyan@gmail.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
conflicts: fs/proc/task_mmu.c
Signed-off-by: Zhang Peng zhangpeng362@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/proc/task_mmu.c | 37 +++++++++++++++++++++++++++---------- 1 file changed, 27 insertions(+), 10 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 93815b2b8440..0175fd7b3598 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -429,7 +429,8 @@ struct mem_size_stats { };
static void smaps_account(struct mem_size_stats *mss, struct page *page, - bool compound, bool young, bool dirty, bool locked) + bool compound, bool young, bool dirty, bool locked, + bool migration) { int i, nr = compound ? 1 << compound_order(page) : 1; unsigned long size = nr * PAGE_SIZE; @@ -449,8 +450,15 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page, * page_count(page) == 1 guarantees the page is mapped exactly once. * If any subpage of the compound page mapped with PTE it would elevate * page_count(). + * + * The page_mapcount() is called to get a snapshot of the mapcount. + * Without holding the page lock this snapshot can be slightly wrong as + * we cannot always read the mapcount atomically. It is not safe to + * call page_mapcount() even with PTL held if the page is not mapped, + * especially for migration entries. Treat regular migration entries + * as mapcount == 1. */ - if (page_count(page) == 1) { + if ((page_count(page) == 1) || migration) { if (dirty || PageDirty(page)) mss->private_dirty += size; else @@ -505,6 +513,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, struct vm_area_struct *vma = walk->vma; bool locked = !!(vma->vm_flags & VM_LOCKED); struct page *page = NULL; + bool migration = false;
if (pte_present(*pte)) { page = vm_normal_page(vma, addr, *pte); @@ -524,9 +533,10 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, } else { mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT; } - } else if (is_migration_entry(swpent)) + } else if (is_migration_entry(swpent)) { + migration = true; page = migration_entry_to_page(swpent); - else if (is_device_private_entry(swpent)) + } else if (is_device_private_entry(swpent)) page = device_private_entry_to_page(swpent); } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap && pte_none(*pte))) { @@ -546,7 +556,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, if (!page) return;
- smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte), locked); + smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte), + locked, migration); }
#ifdef CONFIG_TRANSPARENT_HUGEPAGE @@ -570,7 +581,8 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, /* pass */; else VM_BUG_ON_PAGE(1, page); - smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd), locked); + smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd), + locked, false); } #else static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, @@ -1285,6 +1297,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, { u64 frame = 0, flags = 0; struct page *page = NULL; + bool migration = false;
if (pte_present(pte)) { if (pm->show_pfn) @@ -1302,8 +1315,10 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, frame = swp_type(entry) | (swp_offset(entry) << MAX_SWAPFILES_SHIFT); flags |= PM_SWAP; - if (is_migration_entry(entry)) + if (is_migration_entry(entry)) { + migration = true; page = migration_entry_to_page(entry); + }
if (is_device_private_entry(entry)) page = device_private_entry_to_page(entry); @@ -1311,7 +1326,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
if (page && !PageAnon(page)) flags |= PM_FILE; - if (page && page_mapcount(page) == 1) + if (page && !migration && page_mapcount(page) == 1) flags |= PM_MMAP_EXCLUSIVE; if (vma->vm_flags & VM_SOFTDIRTY) flags |= PM_SOFT_DIRTY; @@ -1327,8 +1342,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, spinlock_t *ptl; pte_t *pte, *orig_pte; int err = 0; - #ifdef CONFIG_TRANSPARENT_HUGEPAGE + bool migration = false; + ptl = pmd_trans_huge_lock(pmdp, vma); if (ptl) { u64 flags = 0, frame = 0; @@ -1363,11 +1379,12 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, if (pmd_swp_soft_dirty(pmd)) flags |= PM_SOFT_DIRTY; VM_BUG_ON(!is_pmd_migration_entry(pmd)); + migration = is_migration_entry(entry); page = migration_entry_to_page(entry); } #endif
- if (page && page_mapcount(page) == 1) + if (page && !migration && page_mapcount(page) == 1) flags |= PM_MMAP_EXCLUSIVE;
for (; addr != end; addr += PAGE_SIZE) {
From: Eric Biggers ebiggers@google.com
mainline inclusion from mainline-v5.18-rc1 commit d3481accd974541e6a5d6a1fb588924a3519c36e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6ETWH CVE: NA
--------------------------------
RSA PKCS#1 v1.5 signatures are required to be the same length as the RSA key size. RFC8017 specifically requires the verifier to check this (https://datatracker.ietf.org/doc/html/rfc8017#section-8.2.2).
Commit a49de377e051 ("crypto: Add hash param to pkcs1pad") changed the kernel to allow longer signatures, but didn't explain this part of the change; it seems to be unrelated to the rest of the commit.
Revert this change, since it doesn't appear to be correct.
We can be pretty sure that no one is relying on overly-long signatures (which would have to be front-padded with zeroes) being supported, given that they would have been broken since commit c7381b012872 ("crypto: akcipher - new verify API for public key algorithms").
Fixes: a49de377e051 ("crypto: Add hash param to pkcs1pad") Cc: stable@vger.kernel.org # v4.6+ Cc: Tadeusz Struk tadeusz.struk@linaro.org Suggested-by: Vitaly Chikunov vt@altlinux.org Signed-off-by: Eric Biggers ebiggers@google.com Signed-off-by: Herbert Xu herbert@gondor.apana.org.au Conflicts: crypto/rsa-pkcs1pad.c Signed-off-by: GUO Zihua guozihua@huawei.com Reviewed-by: yiyang yiyang13@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- crypto/rsa-pkcs1pad.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/crypto/rsa-pkcs1pad.c b/crypto/rsa-pkcs1pad.c index ab2e74e23a7d..48bd07e22a1d 100644 --- a/crypto/rsa-pkcs1pad.c +++ b/crypto/rsa-pkcs1pad.c @@ -528,7 +528,7 @@ static int pkcs1pad_verify(struct akcipher_request *req) struct pkcs1pad_request *req_ctx = akcipher_request_ctx(req); int err;
- if (!ctx->key_size || req->src_len < ctx->key_size) + if (!ctx->key_size || req->src_len != ctx->key_size) return -EINVAL;
req_ctx->out_buf = kmalloc(ctx->key_size, GFP_KERNEL);
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-5.4-rc1 commit cecf5d87ff2035127bb5a9ee054d0023a4a7cad3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6HOKY CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...
---------------------------
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store path. Meantime, inside block's .show/.store callback, q->sysfs_lock is required.
However, when mq & iosched kobjects are removed via blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held too. This way causes AB-BA lock because the kernfs built-in lock of 'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for both blk_mq_unregister_dev() & elv_unregister_queue() because clearing REGISTERED flag prevents storing to 'queue/scheduler' from being happened. Also sysfs write(store) is exclusive, so no necessary to hold the lock for elv_unregister_queue() when it is called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for covering sync .store, the other one is named as .sysfs_dir_lock for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and showing/storing attributes under kobjects. For switching scheduler via storing to 'queue/scheduler', we use the queue flag of QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning ====================================================== WARNING: possible circular locking dependency detected 5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted ------------------------------------------------------ rmmod/777 is trying to acquire lock: 00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}: __lock_acquire+0x95f/0xa2f lock_acquire+0x1b4/0x1e8 __mutex_lock+0x14a/0xa9b blk_mq_hw_sysfs_show+0x63/0xb6 sysfs_kf_seq_show+0x11f/0x196 seq_read+0x2cd/0x5f2 vfs_read+0xc7/0x18c ksys_read+0xc4/0x13e do_syscall_64+0xa7/0x295 entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}: check_prev_add+0x5d2/0xc45 validate_chain+0xed3/0xf94 __lock_acquire+0x95f/0xa2f lock_acquire+0x1b4/0x1e8 __kernfs_remove+0x237/0x40b kernfs_remove_by_name_ns+0x59/0x72 remove_files+0x61/0x96 sysfs_remove_group+0x81/0xa4 sysfs_remove_groups+0x3b/0x44 kobject_del+0x44/0x94 blk_mq_unregister_dev+0x83/0xdd blk_unregister_queue+0xa0/0x10b del_gendisk+0x259/0x3fa null_del_dev+0x8b/0x1c3 [null_blk] null_exit+0x5c/0x95 [null_blk] __se_sys_delete_module+0x204/0x337 do_syscall_64+0xa7/0x295 entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1 ---- ---- lock(&q->sysfs_lock); lock(kn->count#202); lock(&q->sysfs_lock); lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777: #0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk] #1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace: CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4 Call Trace: dump_stack+0x9a/0xe6 check_noncircular+0x207/0x251 ? print_circular_bug+0x32a/0x32a ? find_usage_backwards+0x84/0xb0 check_prev_add+0x5d2/0xc45 validate_chain+0xed3/0xf94 ? check_prev_add+0xc45/0xc45 ? mark_lock+0x11b/0x804 ? check_usage_forwards+0x1ca/0x1ca __lock_acquire+0x95f/0xa2f lock_acquire+0x1b4/0x1e8 ? kernfs_remove_by_name_ns+0x59/0x72 __kernfs_remove+0x237/0x40b ? kernfs_remove_by_name_ns+0x59/0x72 ? kernfs_next_descendant_post+0x7d/0x7d ? strlen+0x10/0x23 ? strcmp+0x22/0x44 kernfs_remove_by_name_ns+0x59/0x72 remove_files+0x61/0x96 sysfs_remove_group+0x81/0xa4 sysfs_remove_groups+0x3b/0x44 kobject_del+0x44/0x94 blk_mq_unregister_dev+0x83/0xdd blk_unregister_queue+0xa0/0x10b del_gendisk+0x259/0x3fa ? disk_events_poll_msecs_store+0x12b/0x12b ? check_flags+0x1ea/0x204 ? mark_held_locks+0x1f/0x7a null_del_dev+0x8b/0x1c3 [null_blk] null_exit+0x5c/0x95 [null_blk] __se_sys_delete_module+0x204/0x337 ? free_module+0x39f/0x39f ? blkcg_maybe_throttle_current+0x8a/0x718 ? rwlock_bug+0x62/0x62 ? __blkcg_punt_bio_submit+0xd0/0xd0 ? trace_hardirqs_on_thunk+0x1a/0x20 ? mark_held_locks+0x1f/0x7a ? do_syscall_64+0x4c/0x295 do_syscall_64+0xa7/0x295 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fb696cdbe6b Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008 RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828 RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000 R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0 R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig hch@infradead.org Cc: Hannes Reinecke hare@suse.com Cc: Greg KH gregkh@linuxfoundation.org Cc: Mike Snitzer snitzer@redhat.com Reviewed-by: Bart Van Assche bvanassche@acm.org Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: block/blk.h block/elevator.c block/blk-sysfs.c block/blk-core.c
Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- block/blk-core.c | 1 + block/blk-mq-sysfs.c | 16 ++++++------ block/blk-sysfs.c | 46 +++++++++++++++++++------------- block/blk.h | 2 +- block/elevator.c | 59 +++++++++++++++++++++++++++++++++++------- include/linux/blkdev.h | 1 + 6 files changed, 88 insertions(+), 37 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index 2ce4ad4619c1..dbfcf2dc735d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1337,6 +1337,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id, mutex_init(&q->blk_trace_mutex); #endif mutex_init(&q->sysfs_lock); + mutex_init(&q->sysfs_dir_lock); spin_lock_init(&q->__queue_lock);
q->queue_lock = lock ? : &q->__queue_lock; diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 3209979e105b..87fe2708d188 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -283,7 +283,7 @@ void blk_mq_unregister_dev(struct device *dev, struct request_queue *q) struct blk_mq_hw_ctx *hctx; int i;
- lockdep_assert_held(&q->sysfs_lock); + lockdep_assert_held(&q->sysfs_dir_lock);
queue_for_each_hw_ctx(q, hctx, i) blk_mq_unregister_hctx(hctx); @@ -333,7 +333,7 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q) int ret, i;
WARN_ON_ONCE(!q->kobj.parent); - lockdep_assert_held(&q->sysfs_lock); + lockdep_assert_held(&q->sysfs_dir_lock);
ret = kobject_add(q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq"); if (ret < 0) @@ -366,9 +366,9 @@ int blk_mq_register_dev(struct device *dev, struct request_queue *q) { int ret;
- mutex_lock(&q->sysfs_lock); + mutex_lock(&q->sysfs_dir_lock); ret = __blk_mq_register_dev(dev, q); - mutex_unlock(&q->sysfs_lock); + mutex_unlock(&q->sysfs_dir_lock);
return ret; } @@ -379,7 +379,7 @@ void blk_mq_sysfs_unregister(struct request_queue *q) struct blk_mq_hw_ctx *hctx; int i;
- mutex_lock(&q->sysfs_lock); + mutex_lock(&q->sysfs_dir_lock); if (!q->mq_sysfs_init_done) goto unlock;
@@ -387,7 +387,7 @@ void blk_mq_sysfs_unregister(struct request_queue *q) blk_mq_unregister_hctx(hctx);
unlock: - mutex_unlock(&q->sysfs_lock); + mutex_unlock(&q->sysfs_dir_lock); }
int blk_mq_sysfs_register(struct request_queue *q) @@ -395,7 +395,7 @@ int blk_mq_sysfs_register(struct request_queue *q) struct blk_mq_hw_ctx *hctx; int i, ret = 0;
- mutex_lock(&q->sysfs_lock); + mutex_lock(&q->sysfs_dir_lock); if (!q->mq_sysfs_init_done) goto unlock;
@@ -406,7 +406,7 @@ int blk_mq_sysfs_register(struct request_queue *q) }
unlock: - mutex_unlock(&q->sysfs_lock); + mutex_unlock(&q->sysfs_dir_lock);
return ret; } diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 5bdae7c8fa27..d925b75702e3 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -933,6 +933,7 @@ int blk_register_queue(struct gendisk *disk) int ret; struct device *dev = disk_to_dev(disk); struct request_queue *q = disk->queue; + bool has_elevator = false;
if (WARN_ON(!q)) return -ENXIO; @@ -940,14 +941,12 @@ int blk_register_queue(struct gendisk *disk) WARN_ONCE(test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags), "%s is registering an already registered queue\n", kobject_name(&dev->kobj)); - queue_flag_set_unlocked(QUEUE_FLAG_REGISTERED, q);
ret = blk_trace_init_sysfs(dev); if (ret) return ret;
- /* Prevent changes through sysfs until registration is completed. */ - mutex_lock(&q->sysfs_lock); + mutex_lock(&q->sysfs_dir_lock);
ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue"); if (ret < 0) { @@ -960,28 +959,37 @@ int blk_register_queue(struct gendisk *disk) blk_mq_debugfs_register(q); }
- kobject_uevent(&q->kobj, KOBJ_ADD); - - wbt_enable_default(q); - - blk_throtl_register_queue(q); - blk_queue_flag_set(QUEUE_FLAG_THROTL_INIT_DONE, q); - + /* + * The flag of QUEUE_FLAG_REGISTERED isn't set yet, so elevator + * switch won't happen at all. + */ if (q->request_fn || (q->mq_ops && q->elevator)) { - ret = elv_register_queue(q); + ret = elv_register_queue(q, false); if (ret) { - mutex_unlock(&q->sysfs_lock); - kobject_uevent(&q->kobj, KOBJ_REMOVE); + mutex_unlock(&q->sysfs_dir_lock); kobject_del(&q->kobj); blk_trace_remove_sysfs(dev); kobject_put(&dev->kobj); return ret; } + has_elevator = true; }
+ mutex_lock(&q->sysfs_lock); + blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q); + wbt_enable_default(q); + blk_throtl_register_queue(q); + blk_queue_flag_set(QUEUE_FLAG_THROTL_INIT_DONE, q); + + /* Now everything is ready and send out KOBJ_ADD uevent */ + kobject_uevent(&q->kobj, KOBJ_ADD); + if (has_elevator) + kobject_uevent(&q->elevator->kobj, KOBJ_ADD); + mutex_unlock(&q->sysfs_lock); + ret = 0; unlock: - mutex_unlock(&q->sysfs_lock); + mutex_unlock(&q->sysfs_dir_lock);
/* * SCSI probing may synchronously create and destroy a lot of @@ -1012,6 +1020,7 @@ EXPORT_SYMBOL_GPL(blk_register_queue); void blk_unregister_queue(struct gendisk *disk) { struct request_queue *q = disk->queue; + bool has_elevator;
if (WARN_ON(!q)) return; @@ -1029,21 +1038,22 @@ void blk_unregister_queue(struct gendisk *disk) mutex_lock(&q->sysfs_lock);
blk_queue_flag_clear(QUEUE_FLAG_REGISTERED, q); + has_elevator = !!q->elevator; + mutex_unlock(&q->sysfs_lock);
+ mutex_lock(&q->sysfs_dir_lock); /* * Remove the sysfs attributes before unregistering the queue data * structures that can be modified through sysfs. */ if (q->mq_ops) blk_mq_unregister_dev(disk_to_dev(disk), q); - mutex_unlock(&q->sysfs_lock);
blk_trace_remove_sysfs(disk_to_dev(disk));
- mutex_lock(&q->sysfs_lock); - if (q->request_fn || (q->mq_ops && q->elevator)) + if (q->request_fn || (q->mq_ops && has_elevator)) elv_unregister_queue(q); - mutex_unlock(&q->sysfs_lock); + mutex_unlock(&q->sysfs_dir_lock);
/* Now that we've deleted all child objects, we can delete the queue. */ kobject_uevent(&q->kobj, KOBJ_REMOVE); diff --git a/block/blk.h b/block/blk.h index 985e20980aff..2c73160dff6c 100644 --- a/block/blk.h +++ b/block/blk.h @@ -262,7 +262,7 @@ int elevator_init_mq(struct request_queue *q); int elevator_switch_mq(struct request_queue *q, struct elevator_type *new_e); void elevator_exit(struct request_queue *, struct elevator_queue *); -int elv_register_queue(struct request_queue *q); +int elv_register_queue(struct request_queue *q, bool uevent); void elv_unregister_queue(struct request_queue *q); void __blk_rq_init(struct request_queue *q, struct request *rq);
diff --git a/block/elevator.c b/block/elevator.c index 9621a812319e..a1e62af8bf33 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -841,13 +841,16 @@ static struct kobj_type elv_ktype = { .release = elevator_release, };
-int elv_register_queue(struct request_queue *q) +/* + * elv_register_queue is called from either blk_register_queue or + * elevator_switch, elevator switch is prevented from being happen + * in the two paths, so it is safe to not hold q->sysfs_lock. + */ +int elv_register_queue(struct request_queue *q, bool uevent) { struct elevator_queue *e = q->elevator; int error;
- lockdep_assert_held(&q->sysfs_lock); - error = kobject_add(&e->kobj, &q->kobj, "%s", "iosched"); if (!error) { struct elv_fs_entry *attr = e->type->elevator_attrs; @@ -858,24 +861,34 @@ int elv_register_queue(struct request_queue *q) attr++; } } - kobject_uevent(&e->kobj, KOBJ_ADD); + if (uevent) + kobject_uevent(&e->kobj, KOBJ_ADD); + + mutex_lock(&q->sysfs_lock); e->registered = 1; if (!e->uses_mq && e->type->ops.sq.elevator_registered_fn) e->type->ops.sq.elevator_registered_fn(q); + mutex_unlock(&q->sysfs_lock); } return error; }
+/* + * elv_unregister_queue is called from either blk_unregister_queue or + * elevator_switch, elevator switch is prevented from being happen + * in the two paths, so it is safe to not hold q->sysfs_lock. + */ void elv_unregister_queue(struct request_queue *q) { - lockdep_assert_held(&q->sysfs_lock); - if (q) { struct elevator_queue *e = q->elevator;
kobject_uevent(&e->kobj, KOBJ_REMOVE); kobject_del(&e->kobj); + + mutex_lock(&q->sysfs_lock); e->registered = 0; + mutex_unlock(&q->sysfs_lock); } }
@@ -946,10 +959,32 @@ int elevator_switch_mq(struct request_queue *q, lockdep_assert_held(&q->sysfs_lock);
if (q->elevator) { - if (q->elevator->registered) + if (q->elevator->registered) { + mutex_unlock(&q->sysfs_lock); + + /* + * Concurrent elevator switch can't happen becasue + * sysfs write is always exclusively on same file. + * + * Also the elevator queue won't be freed after + * sysfs_lock is released becasue kobject_del() in + * blk_unregister_queue() waits for completion of + * .store & .show on its attributes. + */ elv_unregister_queue(q); + + mutex_lock(&q->sysfs_lock); + } ioc_clear_queue(q); elevator_exit(q, q->elevator); + + /* + * sysfs_lock may be dropped, so re-check if queue is + * unregistered. If yes, don't switch to new elevator + * any more + */ + if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)) + return 0; }
ret = blk_mq_init_sched(q, new_e); @@ -957,7 +992,11 @@ int elevator_switch_mq(struct request_queue *q, goto out;
if (new_e) { - ret = elv_register_queue(q); + mutex_unlock(&q->sysfs_lock); + + ret = elv_register_queue(q, true); + + mutex_lock(&q->sysfs_lock); if (ret) { elevator_exit(q, q->elevator); goto out; @@ -1053,7 +1092,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e) if (err) goto fail_init;
- err = elv_register_queue(q); + err = elv_register_queue(q, true); if (err) goto fail_register;
@@ -1073,7 +1112,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e) /* switch failed, restore and re-register old elevator */ if (old) { q->elevator = old; - elv_register_queue(q); + elv_register_queue(q, true); blk_queue_bypass_end(q); }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 41d235ee579a..0c8f2a28f163 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -655,6 +655,7 @@ struct request_queue { struct delayed_work requeue_work;
struct mutex sysfs_lock; + struct mutex sysfs_dir_lock;
/* * for reusing dead hctx instance in case of updating
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-5.4-rc1 commit 0a67b5a926e63ff5492c3c675eab5900580d056d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6HOKY CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...
---------------------------
cecf5d87ff20 ("block: split .sysfs_lock into two locks") starts to release & actuire sysfs_lock again during switching elevator. So it isn't enough to prevent switching elevator from happening by simply clearing QUEUE_FLAG_REGISTERED with holding sysfs_lock, because in-progress switch still can move on after re-acquiring the lock, meantime the flag of QUEUE_FLAG_REGISTERED won't get checked.
Fixes this issue by checking 'q->elevator' directly & locklessly after q->kobj is removed in blk_unregister_queue(), this way is safe because q->elevator can't be changed at that time.
Fixes: cecf5d87ff20 ("block: split .sysfs_lock into two locks") Cc: Christoph Hellwig hch@infradead.org Cc: Hannes Reinecke hare@suse.com Cc: Greg KH gregkh@linuxfoundation.org Cc: Mike Snitzer snitzer@redhat.com Cc: Bart Van Assche bvanassche@acm.org Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Conflict: block/blk-sysfs.c Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- block/blk-sysfs.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index d925b75702e3..a663dbcbdd19 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -1020,7 +1020,6 @@ EXPORT_SYMBOL_GPL(blk_register_queue); void blk_unregister_queue(struct gendisk *disk) { struct request_queue *q = disk->queue; - bool has_elevator;
if (WARN_ON(!q)) return; @@ -1038,7 +1037,6 @@ void blk_unregister_queue(struct gendisk *disk) mutex_lock(&q->sysfs_lock);
blk_queue_flag_clear(QUEUE_FLAG_REGISTERED, q); - has_elevator = !!q->elevator; mutex_unlock(&q->sysfs_lock);
mutex_lock(&q->sysfs_dir_lock); @@ -1051,7 +1049,11 @@ void blk_unregister_queue(struct gendisk *disk)
blk_trace_remove_sysfs(disk_to_dev(disk));
- if (q->request_fn || (q->mq_ops && has_elevator)) + /* + * q->kobj has been removed, so it is safe to check if elevator + * exists without holding q->sysfs_lock. + */ + if (q->request_fn || (q->mq_ops && q->elevator)) elv_unregister_queue(q); mutex_unlock(&q->sysfs_dir_lock);
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-5.4-rc1 commit b89f625e28d44552083f43752f62d8621ded0a04 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6HOKY CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...
---------------------------
cecf5d87ff20 ("block: split .sysfs_lock into two locks") starts to release & acquire sysfs_lock before registering/un-registering elevator queue during switching elevator for avoiding potential deadlock from showing & storing 'queue/iosched' attributes and removing elevator's kobject.
Turns out there isn't such deadlock because 'q->sysfs_lock' isn't required in .show & .store of queue/iosched's attributes, and just elevator's sysfs lock is acquired in elv_iosched_store() and elv_iosched_show(). So it is safe to hold queue's sysfs lock when registering/un-registering elevator queue.
The biggest issue is that commit cecf5d87ff20 assumes that concurrent write on 'queue/scheduler' can't happen. However, this assumption isn't true, because kernfs_fop_write() only guarantees that concurrent write aren't called on the same open file, but the write could be from different open on the file. So we can't release & re-acquire queue's sysfs lock during switching elevator, otherwise use-after-free on elevator could be triggered.
Fixes the issue by not releasing queue's sysfs lock during switching elevator.
Fixes: cecf5d87ff20 ("block: split .sysfs_lock into two locks") Cc: Christoph Hellwig hch@infradead.org Cc: Hannes Reinecke hare@suse.com Cc: Greg KH gregkh@linuxfoundation.org Cc: Mike Snitzer snitzer@redhat.com Reviewed-by: Bart Van Assche bvanassche@acm.org Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Conflict: block/blk-sysfs.c block/elevator.c Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- block/blk-sysfs.c | 13 ++++--------- block/elevator.c | 31 +------------------------------ 2 files changed, 5 insertions(+), 39 deletions(-)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index a663dbcbdd19..711724a41fc7 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -959,13 +959,11 @@ int blk_register_queue(struct gendisk *disk) blk_mq_debugfs_register(q); }
- /* - * The flag of QUEUE_FLAG_REGISTERED isn't set yet, so elevator - * switch won't happen at all. - */ + mutex_lock(&q->sysfs_lock); if (q->request_fn || (q->mq_ops && q->elevator)) { ret = elv_register_queue(q, false); if (ret) { + mutex_unlock(&q->sysfs_lock); mutex_unlock(&q->sysfs_dir_lock); kobject_del(&q->kobj); blk_trace_remove_sysfs(dev); @@ -975,7 +973,6 @@ int blk_register_queue(struct gendisk *disk) has_elevator = true; }
- mutex_lock(&q->sysfs_lock); blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q); wbt_enable_default(q); blk_throtl_register_queue(q); @@ -1049,12 +1046,10 @@ void blk_unregister_queue(struct gendisk *disk)
blk_trace_remove_sysfs(disk_to_dev(disk));
- /* - * q->kobj has been removed, so it is safe to check if elevator - * exists without holding q->sysfs_lock. - */ + mutex_lock(&q->sysfs_lock); if (q->request_fn || (q->mq_ops && q->elevator)) elv_unregister_queue(q); + mutex_unlock(&q->sysfs_lock); mutex_unlock(&q->sysfs_dir_lock);
/* Now that we've deleted all child objects, we can delete the queue. */ diff --git a/block/elevator.c b/block/elevator.c index a1e62af8bf33..77fde865bf7a 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -864,11 +864,9 @@ int elv_register_queue(struct request_queue *q, bool uevent) if (uevent) kobject_uevent(&e->kobj, KOBJ_ADD);
- mutex_lock(&q->sysfs_lock); e->registered = 1; if (!e->uses_mq && e->type->ops.sq.elevator_registered_fn) e->type->ops.sq.elevator_registered_fn(q); - mutex_unlock(&q->sysfs_lock); } return error; } @@ -886,9 +884,7 @@ void elv_unregister_queue(struct request_queue *q) kobject_uevent(&e->kobj, KOBJ_REMOVE); kobject_del(&e->kobj);
- mutex_lock(&q->sysfs_lock); e->registered = 0; - mutex_unlock(&q->sysfs_lock); } }
@@ -959,32 +955,11 @@ int elevator_switch_mq(struct request_queue *q, lockdep_assert_held(&q->sysfs_lock);
if (q->elevator) { - if (q->elevator->registered) { - mutex_unlock(&q->sysfs_lock); - - /* - * Concurrent elevator switch can't happen becasue - * sysfs write is always exclusively on same file. - * - * Also the elevator queue won't be freed after - * sysfs_lock is released becasue kobject_del() in - * blk_unregister_queue() waits for completion of - * .store & .show on its attributes. - */ + if (q->elevator->registered) elv_unregister_queue(q);
- mutex_lock(&q->sysfs_lock); - } ioc_clear_queue(q); elevator_exit(q, q->elevator); - - /* - * sysfs_lock may be dropped, so re-check if queue is - * unregistered. If yes, don't switch to new elevator - * any more - */ - if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)) - return 0; }
ret = blk_mq_init_sched(q, new_e); @@ -992,11 +967,7 @@ int elevator_switch_mq(struct request_queue *q, goto out;
if (new_e) { - mutex_unlock(&q->sysfs_lock); - ret = elv_register_queue(q, true); - - mutex_lock(&q->sysfs_lock); if (ret) { elevator_exit(q, q->elevator); goto out;
From: Yufen Yu yuyufen@huawei.com
mainline inclusion from mainline-5.10-rc1 commit f0c6ae09db1d39cbcf35776f889e0f7f861493d2 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6HOKY CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...
---------------------------
After commit b89f625e28d4 ("block: don't release queue's sysfs lock during switching elevator"), whole elevator register and unregister function are covered by sysfs_lock. So, remove wrong comment and add lockdep assert.
Signed-off-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- block/elevator.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c index 77fde865bf7a..91c8769882d5 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -841,16 +841,13 @@ static struct kobj_type elv_ktype = { .release = elevator_release, };
-/* - * elv_register_queue is called from either blk_register_queue or - * elevator_switch, elevator switch is prevented from being happen - * in the two paths, so it is safe to not hold q->sysfs_lock. - */ int elv_register_queue(struct request_queue *q, bool uevent) { struct elevator_queue *e = q->elevator; int error;
+ lockdep_assert_held(&q->sysfs_lock); + error = kobject_add(&e->kobj, &q->kobj, "%s", "iosched"); if (!error) { struct elv_fs_entry *attr = e->type->elevator_attrs; @@ -871,13 +868,10 @@ int elv_register_queue(struct request_queue *q, bool uevent) return error; }
-/* - * elv_unregister_queue is called from either blk_unregister_queue or - * elevator_switch, elevator switch is prevented from being happen - * in the two paths, so it is safe to not hold q->sysfs_lock. - */ void elv_unregister_queue(struct request_queue *q) { + lockdep_assert_held(&q->sysfs_lock); + if (q) { struct elevator_queue *e = q->elevator;
From: Li Lingfeng lilingfeng3@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6HOKY CVE: NA
--------------------------------
commit 92567b1d6f5a ("block: split .sysfs_lock into two locks") introduce new member 'sysfs_dir_lock' in struct request_queue, thus move it to request_queue_wrapper to avoid kabi broken.
Fixes: 92567b1d6f5a ("block: split .sysfs_lock into two locks")
Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- block/blk-core.c | 2 +- block/blk-mq-sysfs.c | 16 ++++++++-------- block/blk-sysfs.c | 10 +++++----- block/blk.h | 1 + include/linux/blkdev.h | 1 - 5 files changed, 15 insertions(+), 15 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index dbfcf2dc735d..acf5585b0557 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1337,7 +1337,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id, mutex_init(&q->blk_trace_mutex); #endif mutex_init(&q->sysfs_lock); - mutex_init(&q->sysfs_dir_lock); + mutex_init(&q_wrapper->sysfs_dir_lock); spin_lock_init(&q->__queue_lock);
q->queue_lock = lock ? : &q->__queue_lock; diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 87fe2708d188..cace619d7421 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -283,7 +283,7 @@ void blk_mq_unregister_dev(struct device *dev, struct request_queue *q) struct blk_mq_hw_ctx *hctx; int i;
- lockdep_assert_held(&q->sysfs_dir_lock); + lockdep_assert_held(&queue_to_wrapper(q)->sysfs_dir_lock);
queue_for_each_hw_ctx(q, hctx, i) blk_mq_unregister_hctx(hctx); @@ -333,7 +333,7 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q) int ret, i;
WARN_ON_ONCE(!q->kobj.parent); - lockdep_assert_held(&q->sysfs_dir_lock); + lockdep_assert_held(&queue_to_wrapper(q)->sysfs_dir_lock);
ret = kobject_add(q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq"); if (ret < 0) @@ -366,9 +366,9 @@ int blk_mq_register_dev(struct device *dev, struct request_queue *q) { int ret;
- mutex_lock(&q->sysfs_dir_lock); + mutex_lock(&queue_to_wrapper(q)->sysfs_dir_lock); ret = __blk_mq_register_dev(dev, q); - mutex_unlock(&q->sysfs_dir_lock); + mutex_unlock(&queue_to_wrapper(q)->sysfs_dir_lock);
return ret; } @@ -379,7 +379,7 @@ void blk_mq_sysfs_unregister(struct request_queue *q) struct blk_mq_hw_ctx *hctx; int i;
- mutex_lock(&q->sysfs_dir_lock); + mutex_lock(&queue_to_wrapper(q)->sysfs_dir_lock); if (!q->mq_sysfs_init_done) goto unlock;
@@ -387,7 +387,7 @@ void blk_mq_sysfs_unregister(struct request_queue *q) blk_mq_unregister_hctx(hctx);
unlock: - mutex_unlock(&q->sysfs_dir_lock); + mutex_unlock(&queue_to_wrapper(q)->sysfs_dir_lock); }
int blk_mq_sysfs_register(struct request_queue *q) @@ -395,7 +395,7 @@ int blk_mq_sysfs_register(struct request_queue *q) struct blk_mq_hw_ctx *hctx; int i, ret = 0;
- mutex_lock(&q->sysfs_dir_lock); + mutex_lock(&queue_to_wrapper(q)->sysfs_dir_lock); if (!q->mq_sysfs_init_done) goto unlock;
@@ -406,7 +406,7 @@ int blk_mq_sysfs_register(struct request_queue *q) }
unlock: - mutex_unlock(&q->sysfs_dir_lock); + mutex_unlock(&queue_to_wrapper(q)->sysfs_dir_lock);
return ret; } diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 711724a41fc7..1b00c8f5c9d6 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -946,7 +946,7 @@ int blk_register_queue(struct gendisk *disk) if (ret) return ret;
- mutex_lock(&q->sysfs_dir_lock); + mutex_lock(&queue_to_wrapper(q)->sysfs_dir_lock);
ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue"); if (ret < 0) { @@ -964,7 +964,7 @@ int blk_register_queue(struct gendisk *disk) ret = elv_register_queue(q, false); if (ret) { mutex_unlock(&q->sysfs_lock); - mutex_unlock(&q->sysfs_dir_lock); + mutex_unlock(&queue_to_wrapper(q)->sysfs_dir_lock); kobject_del(&q->kobj); blk_trace_remove_sysfs(dev); kobject_put(&dev->kobj); @@ -986,7 +986,7 @@ int blk_register_queue(struct gendisk *disk)
ret = 0; unlock: - mutex_unlock(&q->sysfs_dir_lock); + mutex_unlock(&queue_to_wrapper(q)->sysfs_dir_lock);
/* * SCSI probing may synchronously create and destroy a lot of @@ -1036,7 +1036,7 @@ void blk_unregister_queue(struct gendisk *disk) blk_queue_flag_clear(QUEUE_FLAG_REGISTERED, q); mutex_unlock(&q->sysfs_lock);
- mutex_lock(&q->sysfs_dir_lock); + mutex_lock(&queue_to_wrapper(q)->sysfs_dir_lock); /* * Remove the sysfs attributes before unregistering the queue data * structures that can be modified through sysfs. @@ -1050,7 +1050,7 @@ void blk_unregister_queue(struct gendisk *disk) if (q->request_fn || (q->mq_ops && q->elevator)) elv_unregister_queue(q); mutex_unlock(&q->sysfs_lock); - mutex_unlock(&q->sysfs_dir_lock); + mutex_unlock(&queue_to_wrapper(q)->sysfs_dir_lock);
/* Now that we've deleted all child objects, we can delete the queue. */ kobject_uevent(&q->kobj, KOBJ_REMOVE); diff --git a/block/blk.h b/block/blk.h index 2c73160dff6c..9269bb6b14f8 100644 --- a/block/blk.h +++ b/block/blk.h @@ -54,6 +54,7 @@ struct request_queue_wrapper { struct cpumask dispatch_async_cpus; int __percpu *last_dispatch_cpu; #endif + struct mutex sysfs_dir_lock; };
#define queue_to_wrapper(q) \ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 0c8f2a28f163..41d235ee579a 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -655,7 +655,6 @@ struct request_queue { struct delayed_work requeue_work;
struct mutex sysfs_lock; - struct mutex sysfs_dir_lock;
/* * for reusing dead hctx instance in case of updating
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.2-rc1 commit 671fae5e51297fc76b3758ca2edd514858734a6a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6HOKY CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...
--------------------------------
Commit b5dc5d4d1f4f ("block,bfq: Disable writeback throttling") tries to disable wbt for bfq, it's done by calling wbt_disable_default() in bfq_init_queue(). However, wbt is still enabled if default elevator is bfq:
device_add_disk elevator_init_mq bfq_init_queue wbt_disable_default -> done nothing
blk_register_queue wbt_enable_default -> wbt is enabled
Fix the problem by adding a new flag ELEVATOR_FLAG_DISBALE_WBT, bfq will set the flag in bfq_init_queue, and following wbt_enable_default() won't enable wbt while the flag is set.
Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20221019121518.3865235-7-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflict: block/bfq-iosched.c block/blk-wbt.c block/elevator.h
Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- block/blk-wbt.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/block/blk-wbt.c b/block/blk-wbt.c index 366d294a11ef..1a5ea2e0c055 100644 --- a/block/blk-wbt.c +++ b/block/blk-wbt.c @@ -711,11 +711,15 @@ void wbt_set_write_cache(struct request_queue *q, bool write_cache_on) */ void wbt_enable_default(struct request_queue *q) { - struct rq_qos *rqos = wbt_rq_qos(q); + struct rq_qos *rqos; + bool disable_flag = q->elevator && + !strcmp(q->elevator->type->elevator_name, "cfq");
/* Throttling already enabled? */ + rqos = wbt_rq_qos(q); if (rqos) { - if (RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT) + if (!disable_flag && + RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT) RQWB(rqos)->enable_state = WBT_STATE_ON_DEFAULT; return; } @@ -724,8 +728,9 @@ void wbt_enable_default(struct request_queue *q) if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)) return;
- if ((q->mq_ops && IS_ENABLED(CONFIG_BLK_WBT_MQ)) || - (q->request_fn && IS_ENABLED(CONFIG_BLK_WBT_SQ))) + if (((q->mq_ops && IS_ENABLED(CONFIG_BLK_WBT_MQ)) || + (q->request_fn && IS_ENABLED(CONFIG_BLK_WBT_SQ))) && + !disable_flag) wbt_init(q); } EXPORT_SYMBOL_GPL(wbt_enable_default);
Hi, Yongqiaong
貌似下边这个bugzilla链接写错了, bugzilla: https://gitee.com/openeuler/kernel/issues/I6HOKY https://gitee.com/openeuler/kernel/issues/I6HOKY
看上去应该是: https://gitee.com/openeuler/kernel/issues/I6LH5K
-Sang Lipeng
在 2023/3/8 22:08,“Yongqiang Liu”<liuyongqiang13@huawei.com mailto:liuyongqiang13@huawei.com> 写入:
此封邮件来自公司外部,除非您能判断发件人和知道邮件内容安全,否则请勿打开链接或者附件,如需帮助可将本封邮件转发上报至信息安全部进行判别(MailTo:anquan@jd.com mailto:anquan@jd.com)。 JD Security Tips: Please do not click on links or open attachments unless you trust the sender and know the content is safe.If you need help, you can forward this email to the Information Security Department for identification (MailTo: anquan@jd.com mailto:anquan@jd.com).
From: Yu Kuai <yukuai3@huawei.com mailto:yukuai3@huawei.com>
mainline inclusion from mainline-v6.2-rc1 commit 671fae5e51297fc76b3758ca2edd514858734a6a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6HOKY https://gitee.com/openeuler/kernel/issues/I6HOKY CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h... https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.2&id=671fae5e51297fc76b3758ca2edd514858734a6a
--------------------------------
Commit b5dc5d4d1f4f ("block,bfq: Disable writeback throttling") tries to disable wbt for bfq, it's done by calling wbt_disable_default() in bfq_init_queue(). However, wbt is still enabled if default elevator is bfq:
device_add_disk elevator_init_mq bfq_init_queue wbt_disable_default -> done nothing
blk_register_queue wbt_enable_default -> wbt is enabled
Fix the problem by adding a new flag ELEVATOR_FLAG_DISBALE_WBT, bfq will set the flag in bfq_init_queue, and following wbt_enable_default() won't enable wbt while the flag is set.
Signed-off-by: Yu Kuai <yukuai3@huawei.com mailto:yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de mailto:hch@lst.de> Link: https://lore.kernel.org/r/20221019121518.3865235-7-yukuai1@huaweicloud.com mailto:20221019121518.3865235-7-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk mailto:axboe@kernel.dk>
Conflict: block/bfq-iosched.c block/blk-wbt.c block/elevator.h
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com mailto:lilingfeng3@huawei.com> Reviewed-by: Hou Tao <houtao1@huawei.com mailto:houtao1@huawei.com> Signed-off-by: Yongqiang Liu <liuyongqiang13@huawei.com mailto:liuyongqiang13@huawei.com> --- block/blk-wbt.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/block/blk-wbt.c b/block/blk-wbt.c index 366d294a11ef..1a5ea2e0c055 100644 --- a/block/blk-wbt.c +++ b/block/blk-wbt.c @@ -711,11 +711,15 @@ void wbt_set_write_cache(struct request_queue *q, bool write_cache_on) */ void wbt_enable_default(struct request_queue *q) { - struct rq_qos *rqos = wbt_rq_qos(q); + struct rq_qos *rqos; + bool disable_flag = q->elevator && + !strcmp(q->elevator->type->elevator_name, "cfq");
/* Throttling already enabled? */ + rqos = wbt_rq_qos(q); if (rqos) { - if (RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT) + if (!disable_flag && + RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT) RQWB(rqos)->enable_state = WBT_STATE_ON_DEFAULT; return; } @@ -724,8 +728,9 @@ void wbt_enable_default(struct request_queue *q) if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)) return;
- if ((q->mq_ops && IS_ENABLED(CONFIG_BLK_WBT_MQ)) || - (q->request_fn && IS_ENABLED(CONFIG_BLK_WBT_SQ))) + if (((q->mq_ops && IS_ENABLED(CONFIG_BLK_WBT_MQ)) || + (q->request_fn && IS_ENABLED(CONFIG_BLK_WBT_SQ))) && + !disable_flag) wbt_init(q); } EXPORT_SYMBOL_GPL(wbt_enable_default); -- 2.25.1
_______________________________________________ Kernel mailing list -- kernel@openeuler.org mailto:kernel@openeuler.org To unsubscribe send an email to kernel-leave@openeuler.org mailto:kernel-leave@openeuler.org
Hi, Yongqiaong
貌似下边这个bugzilla链接写错了, bugzilla: https://gitee.com/openeuler/kernel/issues/I6HOKY https://gitee.com/openeuler/kernel/issues/I6HOKY
看上去应该是: https://gitee.com/openeuler/kernel/issues/I6LH5K
-Sang Lipeng
在 2023/3/8 22:08,“Yongqiang Liu”<liuyongqiang13@huawei.com mailto:liuyongqiang13@huawei.com> 写入:
此封邮件来自公司外部,除非您能判断发件人和知道邮件内容安全,否则请勿打开链接或者附件,如需帮助可将本封邮件转发上报至信息安全部进行判别(MailTo:anquan@jd.com mailto:anquan@jd.com)。 JD Security Tips: Please do not click on links or open attachments unless you trust the sender and know the content is safe.If you need help, you can forward this email to the Information Security Department for identification (MailTo: anquan@jd.com mailto:anquan@jd.com).
From: Yu Kuai <yukuai3@huawei.com mailto:yukuai3@huawei.com>
mainline inclusion from mainline-v6.2-rc1 commit 671fae5e51297fc76b3758ca2edd514858734a6a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6HOKY https://gitee.com/openeuler/kernel/issues/I6HOKY CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h... https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.2&id=671fae5e51297fc76b3758ca2edd514858734a6a
--------------------------------
Commit b5dc5d4d1f4f ("block,bfq: Disable writeback throttling") tries to disable wbt for bfq, it's done by calling wbt_disable_default() in bfq_init_queue(). However, wbt is still enabled if default elevator is bfq:
device_add_disk elevator_init_mq bfq_init_queue wbt_disable_default -> done nothing
blk_register_queue wbt_enable_default -> wbt is enabled
Fix the problem by adding a new flag ELEVATOR_FLAG_DISBALE_WBT, bfq will set the flag in bfq_init_queue, and following wbt_enable_default() won't enable wbt while the flag is set.
Signed-off-by: Yu Kuai <yukuai3@huawei.com mailto:yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de mailto:hch@lst.de> Link: https://lore.kernel.org/r/20221019121518.3865235-7-yukuai1@huaweicloud.com mailto:20221019121518.3865235-7-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk mailto:axboe@kernel.dk>
Conflict: block/bfq-iosched.c block/blk-wbt.c block/elevator.h
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com mailto:lilingfeng3@huawei.com> Reviewed-by: Hou Tao <houtao1@huawei.com mailto:houtao1@huawei.com> Signed-off-by: Yongqiang Liu <liuyongqiang13@huawei.com mailto:liuyongqiang13@huawei.com> --- block/blk-wbt.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/block/blk-wbt.c b/block/blk-wbt.c index 366d294a11ef..1a5ea2e0c055 100644 --- a/block/blk-wbt.c +++ b/block/blk-wbt.c @@ -711,11 +711,15 @@ void wbt_set_write_cache(struct request_queue *q, bool write_cache_on) */ void wbt_enable_default(struct request_queue *q) { - struct rq_qos *rqos = wbt_rq_qos(q); + struct rq_qos *rqos; + bool disable_flag = q->elevator && + !strcmp(q->elevator->type->elevator_name, "cfq");
/* Throttling already enabled? */ + rqos = wbt_rq_qos(q); if (rqos) { - if (RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT) + if (!disable_flag && + RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT) RQWB(rqos)->enable_state = WBT_STATE_ON_DEFAULT; return; } @@ -724,8 +728,9 @@ void wbt_enable_default(struct request_queue *q) if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)) return;
- if ((q->mq_ops && IS_ENABLED(CONFIG_BLK_WBT_MQ)) || - (q->request_fn && IS_ENABLED(CONFIG_BLK_WBT_SQ))) + if (((q->mq_ops && IS_ENABLED(CONFIG_BLK_WBT_MQ)) || + (q->request_fn && IS_ENABLED(CONFIG_BLK_WBT_SQ))) && + !disable_flag) wbt_init(q); } EXPORT_SYMBOL_GPL(wbt_enable_default); -- 2.25.1
_______________________________________________ Kernel mailing list -- kernel@openeuler.org mailto:kernel@openeuler.org To unsubscribe send an email to kernel-leave@openeuler.org mailto:kernel-leave@openeuler.org