Hugh Dickins (1): sbitmap: fix lockup while swapping
Jan Kara (1): sbitmap: Avoid leaving waitqueue in invalid state in __sbq_wake_up()
Kemeng Shi (3): sbitmap: correct wake_batch recalculation to avoid potential IO hung blk-mq: wait on correct sbitmap_queue in blk_mq_mark_tag_wait blk-mq: Fix potential io hung for shared sbitmap per tagset
Laibin Qiu (2): blk-mq: fix tag_get wait task can't be awakened blk-mq: Fix wrong wakeup batch configuration which will cause hang
Li Lingfeng (1): block: Fix lockdep warning in blk_mq_mark_tag_wait
Yu Kuai (2): blk-mq: fix potential io hang by wrong 'wake_batch' sbitmap: fix possible io hung due to lost wakeup
block/blk-mq-tag.c | 46 +++++++++++++++---- block/blk-mq.c | 12 +++-- include/linux/sbitmap.h | 11 +++++ lib/sbitmap.c | 97 +++++++++++++++++++++++++++++------------ 4 files changed, 128 insertions(+), 38 deletions(-)
From: Laibin Qiu qiulaibin@huawei.com
mainline inclusion from mainline-v5.17-rc1 commit 180dccb0dba4f5e84a4a70c1be1d34cbb6528b32 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAQPKU CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
In case of shared tags, there might be more than one hctx which allocates from the same tags, and each hctx is limited to allocate at most: hctx_max_depth = max((bt->sb.depth + users - 1) / users, 4U);
tag idle detection is lazy, and may be delayed for 30sec, so there could be just one real active hctx(queue) but all others are actually idle and still accounted as active because of the lazy idle detection. Then if wake_batch is > hctx_max_depth, driver tag allocation may wait forever on this real active hctx.
Fix this by recalculating wake_batch when inc or dec active_queues.
Fixes: 0d2602ca30e41 ("blk-mq: improve support for shared tags maps") Suggested-by: Ming Lei ming.lei@redhat.com Suggested-by: John Garry john.garry@huawei.com Signed-off-by: Laibin Qiu qiulaibin@huawei.com Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Link: https://lore.kernel.org/r/20220113025536.1479653-1-qiulaibin@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Conflicts: block/blk-mq-tag.c [Context conflicts] Signed-off-by: Yu Kuai yukuai3@huawei.com --- block/blk-mq-tag.c | 42 +++++++++++++++++++++++++++++++++-------- include/linux/sbitmap.h | 11 +++++++++++ lib/sbitmap.c | 25 +++++++++++++++++++++--- 3 files changed, 67 insertions(+), 11 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 24b48a2f7fba..676d56a04094 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -18,6 +18,21 @@
#define BLK_MQ_DTAG_WAIT_EXPIRE (5 * HZ)
+/* + * Recalculate wakeup batch when tag is shared by hctx. + */ +static void blk_mq_update_wake_batch(struct blk_mq_tags *tags, + unsigned int users) +{ + if (!users) + return; + + sbitmap_queue_recalculate_wake_batch(&tags->bitmap_tags, + users); + sbitmap_queue_recalculate_wake_batch(&tags->breserved_tags, + users); +} + /* * If a previously inactive queue goes active, bump the active user count. * We need to do this before try to allocate driver tag, then even if fail @@ -26,18 +41,26 @@ */ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) { + unsigned int users; + if (blk_mq_is_sbitmap_shared(hctx->flags)) { struct request_queue *q = hctx->queue;
- if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) && - !test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) - atomic_inc(&hctx->tags->active_queues); + if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) || + test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) { + return true; + } } else { - if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) && - !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) - atomic_inc(&hctx->tags->active_queues); + if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) || + test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) { + return true; + } }
+ users = atomic_inc_return(&hctx->tags->active_queues); + + blk_mq_update_wake_batch(hctx->tags, users); + return true; }
@@ -58,6 +81,7 @@ void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve) void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) { struct blk_mq_tags *tags = hctx->tags; + unsigned int users;
if (blk_mq_is_sbitmap_shared(hctx->flags)) { struct request_queue *q = hctx->queue; @@ -65,13 +89,15 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) if (!test_and_clear_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) return; - atomic_dec(&tags->active_queues); } else { if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) return; - atomic_dec(&tags->active_queues); }
+ users = atomic_dec_return(&tags->active_queues); + + blk_mq_update_wake_batch(tags, users); + blk_mq_tag_wakeup_all(tags, false); }
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h index 86004ddc60bd..a0c8bd1c83ad 100644 --- a/include/linux/sbitmap.h +++ b/include/linux/sbitmap.h @@ -381,6 +381,17 @@ static inline void sbitmap_queue_free(struct sbitmap_queue *sbq) sbitmap_free(&sbq->sb); }
+/** + * sbitmap_queue_recalculate_wake_batch() - Recalculate wake batch + * @sbq: Bitmap queue to recalculate wake batch. + * @users: Number of shares. + * + * Like sbitmap_queue_update_wake_batch(), this will calculate wake batch + * by depth. This interface is for HCTX shared tags or queue shared tags. + */ +void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq, + unsigned int users); + /** * sbitmap_queue_resize() - Resize a &struct sbitmap_queue. * @sbq: Bitmap queue to resize. diff --git a/lib/sbitmap.c b/lib/sbitmap.c index 267aa7709416..168d8ddb73e0 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -395,10 +395,9 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth, } EXPORT_SYMBOL_GPL(sbitmap_queue_init_node);
-static void sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq, - unsigned int depth) +static inline void __sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq, + unsigned int wake_batch) { - unsigned int wake_batch = sbq_calc_wake_batch(sbq, depth); int i;
if (sbq->wake_batch != wake_batch) { @@ -414,6 +413,26 @@ static void sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq, } }
+static void sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq, + unsigned int depth) +{ + unsigned int wake_batch; + + wake_batch = sbq_calc_wake_batch(sbq, depth); + __sbitmap_queue_update_wake_batch(sbq, wake_batch); +} + +void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq, + unsigned int users) +{ + unsigned int wake_batch; + + wake_batch = clamp_val((sbq->sb.depth + users - 1) / + users, 4, SBQ_WAKE_BATCH); + __sbitmap_queue_update_wake_batch(sbq, wake_batch); +} +EXPORT_SYMBOL_GPL(sbitmap_queue_recalculate_wake_batch); + void sbitmap_queue_resize(struct sbitmap_queue *sbq, unsigned int depth) { sbitmap_queue_update_wake_batch(sbq, depth);
From: Laibin Qiu qiulaibin@huawei.com
mainline inclusion from mainline-v5.17-rc2 commit 10825410b956dc1ed8c5fbc8bbedaffdadde7f20 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAQPKU CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened") will recalculate wake_batch when incrementing or decrementing active_queues to avoid wake_batch > hctx_max_depth. At the same time, in order to not affect performance as much as possible, the minimum wakeup batch is set to 4. But when the QD is small (such as QD=1), if inc or dec active_queues increases wakeup batch, that can lead to a hang:
Fix this problem with the following strategies: QD : >= 32 | < 32 --------------------------------- wakeup batch: 8~4 | 3~1
Fixes: 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened") Link: https://lore.kernel.org/linux-block/78cafe94-a787-e006-8851-69906f0c2128@hua... Reported-by: Alex Xu (Hello71) alex_y_xu@yahoo.ca Signed-off-by: Laibin Qiu qiulaibin@huawei.com Tested-by: Alex Xu (Hello71) alex_y_xu@yahoo.ca Link: https://lore.kernel.org/r/20220127100047.1763746-1-qiulaibin@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com --- lib/sbitmap.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/lib/sbitmap.c b/lib/sbitmap.c index 168d8ddb73e0..bc287e52c75d 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -426,9 +426,13 @@ void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq, unsigned int users) { unsigned int wake_batch; + unsigned int min_batch; + unsigned int depth = (sbq->sb.depth + users - 1) / users;
- wake_batch = clamp_val((sbq->sb.depth + users - 1) / - users, 4, SBQ_WAKE_BATCH); + min_batch = sbq->sb.depth >= (4 * SBQ_WAIT_QUEUES) ? 4 : 1; + + wake_batch = clamp_val(depth / SBQ_WAIT_QUEUES, + min_batch, SBQ_WAKE_BATCH); __sbitmap_queue_update_wake_batch(sbq, wake_batch); } EXPORT_SYMBOL_GPL(sbitmap_queue_recalculate_wake_batch);
From: Kemeng Shi shikemeng@huaweicloud.com
mainline inclusion from mainline-v6.3-rc1 commit b5fcf7871acb7f9a3a8ed341a68bd86aba3e254a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAQPKU CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened") mentioned that in case of shared tags, there could be just one real active hctx(queue) because of lazy detection of tag idle. Then driver tag allocation may wait forever on this real active hctx(queue) if wake_batch is > hctx_max_depth where hctx_max_depth is available tags depth for the actve hctx(queue). However, the condition wake_batch > hctx_max_depth is not strong enough to avoid IO hung as the sbitmap_queue_wake_up will only wake up one wait queue for each wake_batch even though there is only one waiter in the woken wait queue. After this, there is only one tag to free and wake_batch may not be reached anymore. Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened") methioned that driver tag allocation may wait forever. Actually, the inactive hctx(queue) will be truely idle after at most 30 seconds and will call blk_mq_tag_wakeup_all to wake one waiter per wait queue to break the hung. But IO hung for 30 seconds is also not acceptable. Set batch size to small enough that depth of the shared hctx(queue) is enough to wake up all of the queues like sbq_calc_wake_batch do to fix this potential IO hung.
Although hctx_max_depth will be clamped to at least 4 while wake_batch recalculation does not do the clamp, the wake_batch will be always recalculated to 1 when hctx_max_depth <= 4.
Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened") Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Kemeng Shi shikemeng@huaweicloud.com Link: https://lore.kernel.org/r/20230116205059.3821738-6-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe axboe@kernel.dk Conflicts: lib/sbitmap.c [commit 4f8126bb2308 ("sbitmap: Use single per-bitmap counting to wake up queued tags") is not backported] Signed-off-by: Yu Kuai yukuai3@huawei.com --- lib/sbitmap.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/lib/sbitmap.c b/lib/sbitmap.c index bc287e52c75d..4ada36936941 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -426,13 +426,10 @@ void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq, unsigned int users) { unsigned int wake_batch; - unsigned int min_batch; unsigned int depth = (sbq->sb.depth + users - 1) / users;
- min_batch = sbq->sb.depth >= (4 * SBQ_WAIT_QUEUES) ? 4 : 1; - wake_batch = clamp_val(depth / SBQ_WAIT_QUEUES, - min_batch, SBQ_WAKE_BATCH); + 1, SBQ_WAKE_BATCH); __sbitmap_queue_update_wake_batch(sbq, wake_batch); } EXPORT_SYMBOL_GPL(sbitmap_queue_recalculate_wake_batch);
mainline inclusion from mainline-v6.5-rc1 commit 4f1731df60f9033669f024d06ae26a6301260b55 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAQPKU CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
In __blk_mq_tag_busy/idle(), updating 'active_queues' and calculating 'wake_batch' is not atomic:
t1: t2: _blk_mq_tag_busy blk_mq_tag_busy inc active_queues // assume 1->2 inc active_queues // 2 -> 3 blk_mq_update_wake_batch // calculate based on 3 blk_mq_update_wake_batch /* calculate based on 2, while active_queues is actually 3. */
Fix this problem by protecting them wih 'tags->lock', this is not a hot path, so performance should not be concerned. And now that all writers are inside the lock, switch 'actives_queues' from atomic to unsigned int.
Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened") Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20230610023043.2559121-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe axboe@kernel.dk Conflicts: block/blk-mq-debugfs.c block/blk-mq-tag.c block/blk-mq-tag.h block/blk-mq.h include/linux/blk-mq.h [Still use atomic for 'active_queues' to prevent kabi broken] Signed-off-by: Yu Kuai yukuai3@huawei.com --- block/blk-mq-tag.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 676d56a04094..7a509ecc47dc 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -42,6 +42,7 @@ static void blk_mq_update_wake_batch(struct blk_mq_tags *tags, bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) { unsigned int users; + struct blk_mq_tags *tags = hctx->tags;
if (blk_mq_is_sbitmap_shared(hctx->flags)) { struct request_queue *q = hctx->queue; @@ -57,9 +58,10 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) } }
- users = atomic_inc_return(&hctx->tags->active_queues); - - blk_mq_update_wake_batch(hctx->tags, users); + spin_lock_irq(&tags->lock); + users = atomic_inc_return(&tags->active_queues); + blk_mq_update_wake_batch(tags, users); + spin_unlock_irq(&tags->lock);
return true; } @@ -94,9 +96,10 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx) return; }
+ spin_lock_irq(&tags->lock); users = atomic_dec_return(&tags->active_queues); - blk_mq_update_wake_batch(tags, users); + spin_unlock_irq(&tags->lock);
blk_mq_tag_wakeup_all(tags, false); }
From: Li Lingfeng lilingfeng3@huawei.com
mainline inclusion from mainline-v6.11-rc4 commit b313a8c835516bdda85025500be866ac8a74e022 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAQPKU CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Lockdep reported a warning in Linux version 6.6:
[ 414.344659] ================================ [ 414.345155] WARNING: inconsistent lock state [ 414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted [ 414.346221] -------------------------------- [ 414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. [ 414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes: [ 414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0 [ 414.351204] {IN-SOFTIRQ-W} state was registered at: [ 414.351751] lock_acquire+0x18d/0x460 [ 414.352218] _raw_spin_lock_irqsave+0x39/0x60 [ 414.352769] __wake_up_common_lock+0x22/0x60 [ 414.353289] sbitmap_queue_wake_up+0x375/0x4f0 [ 414.353829] sbitmap_queue_clear+0xdd/0x270 [ 414.354338] blk_mq_put_tag+0xdf/0x170 [ 414.354807] __blk_mq_free_request+0x381/0x4d0 [ 414.355335] blk_mq_free_request+0x28b/0x3e0 [ 414.355847] __blk_mq_end_request+0x242/0xc30 [ 414.356367] scsi_end_request+0x2c1/0x830 [ 414.345155] WARNING: inconsistent lock state [ 414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted [ 414.346221] -------------------------------- [ 414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. [ 414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes: [ 414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0 [ 414.351204] {IN-SOFTIRQ-W} state was registered at: [ 414.351751] lock_acquire+0x18d/0x460 [ 414.352218] _raw_spin_lock_irqsave+0x39/0x60 [ 414.352769] __wake_up_common_lock+0x22/0x60 [ 414.353289] sbitmap_queue_wake_up+0x375/0x4f0 [ 414.353829] sbitmap_queue_clear+0xdd/0x270 [ 414.354338] blk_mq_put_tag+0xdf/0x170 [ 414.354807] __blk_mq_free_request+0x381/0x4d0 [ 414.355335] blk_mq_free_request+0x28b/0x3e0 [ 414.355847] __blk_mq_end_request+0x242/0xc30 [ 414.356367] scsi_end_request+0x2c1/0x830 [ 414.356863] scsi_io_completion+0x177/0x1610 [ 414.357379] scsi_complete+0x12f/0x260 [ 414.357856] blk_complete_reqs+0xba/0xf0 [ 414.358338] __do_softirq+0x1b0/0x7a2 [ 414.358796] irq_exit_rcu+0x14b/0x1a0 [ 414.359262] sysvec_call_function_single+0xaf/0xc0 [ 414.359828] asm_sysvec_call_function_single+0x1a/0x20 [ 414.360426] default_idle+0x1e/0x30 [ 414.360873] default_idle_call+0x9b/0x1f0 [ 414.361390] do_idle+0x2d2/0x3e0 [ 414.361819] cpu_startup_entry+0x55/0x60 [ 414.362314] start_secondary+0x235/0x2b0 [ 414.362809] secondary_startup_64_no_verify+0x18f/0x19b [ 414.363413] irq event stamp: 428794 [ 414.363825] hardirqs last enabled at (428793): [<ffffffff816bfd1c>] ktime_get+0x1dc/0x200 [ 414.364694] hardirqs last disabled at (428794): [<ffffffff85470177>] _raw_spin_lock_irq+0x47/0x50 [ 414.365629] softirqs last enabled at (428444): [<ffffffff85474780>] __do_softirq+0x540/0x7a2 [ 414.366522] softirqs last disabled at (428419): [<ffffffff813f65ab>] irq_exit_rcu+0x14b/0x1a0 [ 414.367425] other info that might help us debug this: [ 414.368194] Possible unsafe locking scenario: [ 414.368900] CPU0 [ 414.369225] ---- [ 414.369548] lock(&sbq->ws[i].wait); [ 414.370000] <Interrupt> [ 414.370342] lock(&sbq->ws[i].wait); [ 414.370802] *** DEADLOCK *** [ 414.371569] 5 locks held by kworker/u10:3/1152: [ 414.372088] #0: ffff88810130e938 ((wq_completion)writeback){+.+.}-{0:0}, at: process_scheduled_works+0x357/0x13f0 [ 414.373180] #1: ffff88810201fdb8 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_scheduled_works+0x3a3/0x13f0 [ 414.374384] #2: ffffffff86ffbdc0 (rcu_read_lock){....}-{1:2}, at: blk_mq_run_hw_queue+0x637/0xa00 [ 414.375342] #3: ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0 [ 414.376377] #4: ffff888106205a08 (&hctx->dispatch_wait_lock){+.-.}-{2:2}, at: blk_mq_dispatch_rq_list+0x1337/0x1ee0 [ 414.378607] stack backtrace: [ 414.379177] CPU: 0 PID: 1152 Comm: kworker/u10:3 Not tainted 6.6.0-07439-gba2303cacfda #6 [ 414.380032] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 414.381177] Workqueue: writeback wb_workfn (flush-253:0) [ 414.381805] Call Trace: [ 414.382136] <TASK> [ 414.382429] dump_stack_lvl+0x91/0xf0 [ 414.382884] mark_lock_irq+0xb3b/0x1260 [ 414.383367] ? __pfx_mark_lock_irq+0x10/0x10 [ 414.383889] ? stack_trace_save+0x8e/0xc0 [ 414.384373] ? __pfx_stack_trace_save+0x10/0x10 [ 414.384903] ? graph_lock+0xcf/0x410 [ 414.385350] ? save_trace+0x3d/0xc70 [ 414.385808] mark_lock.part.20+0x56d/0xa90 [ 414.386317] mark_held_locks+0xb0/0x110 [ 414.386791] ? __pfx_do_raw_spin_lock+0x10/0x10 [ 414.387320] lockdep_hardirqs_on_prepare+0x297/0x3f0 [ 414.387901] ? _raw_spin_unlock_irq+0x28/0x50 [ 414.388422] trace_hardirqs_on+0x58/0x100 [ 414.388917] _raw_spin_unlock_irq+0x28/0x50 [ 414.389422] __blk_mq_tag_busy+0x1d6/0x2a0 [ 414.389920] __blk_mq_get_driver_tag+0x761/0x9f0 [ 414.390899] blk_mq_dispatch_rq_list+0x1780/0x1ee0 [ 414.391473] ? __pfx_blk_mq_dispatch_rq_list+0x10/0x10 [ 414.392070] ? sbitmap_get+0x2b8/0x450 [ 414.392533] ? __blk_mq_get_driver_tag+0x210/0x9f0 [ 414.393095] __blk_mq_sched_dispatch_requests+0xd99/0x1690 [ 414.393730] ? elv_attempt_insert_merge+0x1b1/0x420 [ 414.394302] ? __pfx___blk_mq_sched_dispatch_requests+0x10/0x10 [ 414.394970] ? lock_acquire+0x18d/0x460 [ 414.395456] ? blk_mq_run_hw_queue+0x637/0xa00 [ 414.395986] ? __pfx_lock_acquire+0x10/0x10 [ 414.396499] blk_mq_sched_dispatch_requests+0x109/0x190 [ 414.397100] blk_mq_run_hw_queue+0x66e/0xa00 [ 414.397616] blk_mq_flush_plug_list.part.17+0x614/0x2030 [ 414.398244] ? __pfx_blk_mq_flush_plug_list.part.17+0x10/0x10 [ 414.398897] ? writeback_sb_inodes+0x241/0xcc0 [ 414.399429] blk_mq_flush_plug_list+0x65/0x80 [ 414.399957] __blk_flush_plug+0x2f1/0x530 [ 414.400458] ? __pfx___blk_flush_plug+0x10/0x10 [ 414.400999] blk_finish_plug+0x59/0xa0 [ 414.401467] wb_writeback+0x7cc/0x920 [ 414.401935] ? __pfx_wb_writeback+0x10/0x10 [ 414.402442] ? mark_held_locks+0xb0/0x110 [ 414.402931] ? __pfx_do_raw_spin_lock+0x10/0x10 [ 414.403462] ? lockdep_hardirqs_on_prepare+0x297/0x3f0 [ 414.404062] wb_workfn+0x2b3/0xcf0 [ 414.404500] ? __pfx_wb_workfn+0x10/0x10 [ 414.404989] process_scheduled_works+0x432/0x13f0 [ 414.405546] ? __pfx_process_scheduled_works+0x10/0x10 [ 414.406139] ? do_raw_spin_lock+0x101/0x2a0 [ 414.406641] ? assign_work+0x19b/0x240 [ 414.407106] ? lock_is_held_type+0x9d/0x110 [ 414.407604] worker_thread+0x6f2/0x1160 [ 414.408075] ? __kthread_parkme+0x62/0x210 [ 414.408572] ? lockdep_hardirqs_on_prepare+0x297/0x3f0 [ 414.409168] ? __kthread_parkme+0x13c/0x210 [ 414.409678] ? __pfx_worker_thread+0x10/0x10 [ 414.410191] kthread+0x33c/0x440 [ 414.410602] ? __pfx_kthread+0x10/0x10 [ 414.411068] ret_from_fork+0x4d/0x80 [ 414.411526] ? __pfx_kthread+0x10/0x10 [ 414.411993] ret_from_fork_asm+0x1b/0x30 [ 414.412489] </TASK>
When interrupt is turned on while a lock holding by spin_lock_irq it throws a warning because of potential deadlock.
blk_mq_prep_dispatch_rq blk_mq_get_driver_tag __blk_mq_get_driver_tag __blk_mq_alloc_driver_tag blk_mq_tag_busy -> tag is already busy // failed to get driver tag blk_mq_mark_tag_wait spin_lock_irq(&wq->lock) -> lock A (&sbq->ws[i].wait) __add_wait_queue(wq, wait) -> wait queue active blk_mq_get_driver_tag __blk_mq_tag_busy -> 1) tag must be idle, which means there can't be inflight IO spin_lock_irq(&tags->lock) -> lock B (hctx->tags) spin_unlock_irq(&tags->lock) -> unlock B, turn on interrupt accidentally -> 2) context must be preempt by IO interrupt to trigger deadlock.
As shown above, the deadlock is not possible in theory, but the warning still need to be fixed.
Fix it by using spin_lock_irqsave to get lockB instead of spin_lock_irq.
Fixes: 4f1731df60f9 ("blk-mq: fix potential io hang by wrong 'wake_batch'") Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Ming Lei ming.lei@redhat.com Reviewed-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Bart Van Assche bvanassche@acm.org Link: https://lore.kernel.org/r/20240815024736.2040971-1-lilingfeng@huaweicloud.co... Signed-off-by: Jens Axboe axboe@kernel.dk Conflicts: block/blk-mq-tag.c [Context conflicts] Signed-off-by: Yu Kuai yukuai3@huawei.com --- block/blk-mq-tag.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 7a509ecc47dc..73c046bc3d90 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -42,6 +42,7 @@ static void blk_mq_update_wake_batch(struct blk_mq_tags *tags, bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) { unsigned int users; + unsigned long flags; struct blk_mq_tags *tags = hctx->tags;
if (blk_mq_is_sbitmap_shared(hctx->flags)) { @@ -58,10 +59,10 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) } }
- spin_lock_irq(&tags->lock); + spin_lock_irqsave(&tags->lock, flags); users = atomic_inc_return(&tags->active_queues); blk_mq_update_wake_batch(tags, users); - spin_unlock_irq(&tags->lock); + spin_unlock_irqrestore(&tags->lock, flags);
return true; }
mainline inclusion from mainline-v6.1-rc1 commit 040b83fcecfb86f3225d3a5de7fd9b3fbccf83b4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAQPKU CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
There are two problems can lead to lost wakeup:
1) invalid wakeup on the wrong waitqueue:
For example, 2 * wake_batch tags are put, while only wake_batch threads are woken:
__sbq_wake_up atomic_cmpxchg -> reset wait_cnt __sbq_wake_up -> decrease wait_cnt ... __sbq_wake_up -> wait_cnt is decreased to 0 again atomic_cmpxchg sbq_index_atomic_inc -> increase wake_index wake_up_nr -> wake up and waitqueue might be empty sbq_index_atomic_inc -> increase again, one waitqueue is skipped wake_up_nr -> invalid wake up because old wakequeue might be empty
To fix the problem, increasing 'wake_index' before resetting 'wait_cnt'.
2) 'wait_cnt' can be decreased while waitqueue is empty
As pointed out by Jan Kara, following race is possible:
CPU1 CPU2 __sbq_wake_up __sbq_wake_up sbq_wake_ptr() sbq_wake_ptr() -> the same wait_cnt = atomic_dec_return() /* decreased to 0 */ sbq_index_atomic_inc() /* move to next waitqueue */ atomic_set() /* reset wait_cnt */ wake_up_nr() /* wake up on the old waitqueue */ wait_cnt = atomic_dec_return() /* * decrease wait_cnt in the old * waitqueue, while it can be * empty. */
Fix the problem by waking up before updating 'wake_index' and 'wait_cnt'.
With this patch, noted that 'wait_cnt' is still decreased in the old empty waitqueue, however, the wakeup is redirected to a active waitqueue, and the extra decrement on the old empty waitqueue is not handled.
Fixes: 88459642cba4 ("blk-mq: abstract tag allocation out into sbitmap library") Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20220803121504.212071-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe axboe@kernel.dk --- lib/sbitmap.c | 55 ++++++++++++++++++++++++++++++--------------------- 1 file changed, 33 insertions(+), 22 deletions(-)
diff --git a/lib/sbitmap.c b/lib/sbitmap.c index 4ada36936941..d3c63dd8276f 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -542,32 +542,43 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq) return false;
wait_cnt = atomic_dec_return(&ws->wait_cnt); - if (wait_cnt <= 0) { - int ret; + /* + * For concurrent callers of this, callers should call this function + * again to wakeup a new batch on a different 'ws'. + */ + if (wait_cnt < 0 || !waitqueue_active(&ws->wait)) + return true;
- wake_batch = READ_ONCE(sbq->wake_batch); + if (wait_cnt > 0) + return false;
- /* - * Pairs with the memory barrier in sbitmap_queue_resize() to - * ensure that we see the batch size update before the wait - * count is reset. - */ - smp_mb__before_atomic(); + wake_batch = READ_ONCE(sbq->wake_batch);
- /* - * For concurrent callers of this, the one that failed the - * atomic_cmpxhcg() race should call this function again - * to wakeup a new batch on a different 'ws'. - */ - ret = atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wake_batch); - if (ret == wait_cnt) { - sbq_index_atomic_inc(&sbq->wake_index); - wake_up_nr(&ws->wait, wake_batch); - return false; - } + /* + * Wake up first in case that concurrent callers decrease wait_cnt + * while waitqueue is empty. + */ + wake_up_nr(&ws->wait, wake_batch);
- return true; - } + /* + * Pairs with the memory barrier in sbitmap_queue_resize() to + * ensure that we see the batch size update before the wait + * count is reset. + * + * Also pairs with the implicit barrier between decrementing wait_cnt + * and checking for waitqueue_active() to make sure waitqueue_active() + * sees result of the wakeup if atomic_dec_return() has seen the result + * of atomic_set(). + */ + smp_mb__before_atomic(); + + /* + * Increase wake_index before updating wait_cnt, otherwise concurrent + * callers can see valid wait_cnt in old waitqueue, which can cause + * invalid wakeup on the old waitqueue. + */ + sbq_index_atomic_inc(&sbq->wake_index); + atomic_set(&ws->wait_cnt, wake_batch);
return false; }
From: Jan Kara jack@suse.cz
mainline inclusion from mainline-v6.1-rc1 commit 48c033314f372478548203c583529f53080fd078 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAQPKU CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When __sbq_wake_up() decrements wait_cnt to 0 but races with someone else waking the waiter on the waitqueue (so the waitqueue becomes empty), it exits without reseting wait_cnt to wake_batch number. Once wait_cnt is 0, nobody will ever reset the wait_cnt or wake the new waiters resulting in possible deadlocks or busyloops. Fix the problem by making sure we reset wait_cnt even if we didn't wake up anybody in the end.
Fixes: 040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup") Reported-by: Keith Busch kbusch@kernel.org Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20220908130937.2795-1-jack@suse.cz Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com --- lib/sbitmap.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/lib/sbitmap.c b/lib/sbitmap.c index d3c63dd8276f..074140d05a85 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -536,6 +536,7 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq) struct sbq_wait_state *ws; unsigned int wake_batch; int wait_cnt; + bool ret;
ws = sbq_wake_ptr(sbq); if (!ws) @@ -546,12 +547,23 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq) * For concurrent callers of this, callers should call this function * again to wakeup a new batch on a different 'ws'. */ - if (wait_cnt < 0 || !waitqueue_active(&ws->wait)) + if (wait_cnt < 0) return true;
+ /* + * If we decremented queue without waiters, retry to avoid lost + * wakeups. + */ if (wait_cnt > 0) - return false; + return !waitqueue_active(&ws->wait);
+ /* + * When wait_cnt == 0, we have to be particularly careful as we are + * responsible to reset wait_cnt regardless whether we've actually + * woken up anybody. But in case we didn't wakeup anybody, we still + * need to retry. + */ + ret = !waitqueue_active(&ws->wait); wake_batch = READ_ONCE(sbq->wake_batch);
/* @@ -580,7 +592,7 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq) sbq_index_atomic_inc(&sbq->wake_index); atomic_set(&ws->wait_cnt, wake_batch);
- return false; + return ret; }
void sbitmap_queue_wake_up(struct sbitmap_queue *sbq)
From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v6.1-rc1 commit 30514bd2dd4e86a3ecfd6a93a3eadf7b9ea164a0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAQPKU CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Commit 4acb83417cad ("sbitmap: fix batched wait_cnt accounting") is a big improvement: without it, I had to revert to before commit 040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup") to avoid the high system time and freezes which that had introduced.
Now okay on the NVME laptop, but 4acb83417cad is a disaster for heavy swapping (kernel builds in low memory) on another: soon locking up in sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling around with waitqueue_active() but wait_cnt 0 . Here is a backtrace, showing the common pattern of outer sbitmap_queue_wake_up() interrupted before setting wait_cnt 0 back to wake_batch (in some cases other CPUs are idle, in other cases they're spinning for a lock in dd_bio_merge()):
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag < __blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request < scsi_end_request < scsi_io_completion < scsi_finish_command < scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq < __irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt < _raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up < sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag < __blk_mq_free_request < blk_mq_free_request < dd_bio_merge < blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio < __submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct < submit_bio < __swap_writepage < swap_writepage < pageout < shrink_folio_list < evict_folios < lru_gen_shrink_lruvec < shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages < __alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio < do_anonymous_page < __handle_mm_fault < handle_mm_fault < do_user_addr_fault < exc_page_fault < asm_exc_page_fault
See how the process-context sbitmap_queue_wake_up() has been interrupted, after bringing wait_cnt down to 0 (and in this example, after doing its wakeups), before advancing wake_index and refilling wake_cnt: an interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck.
I have almost no grasp of all the possible sbitmap races, and their consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0, so it is better if sbq_wake_ptr() skips on to the next ws in that case: which fixes the lockup and shows no adverse consequence for me.
The check for wait_cnt being 0 is obviously racy, and ultimately can lead to lost wakeups: for example, when there is only a single waitqueue with waiters. However, lost wakeups are unlikely to matter in these cases, and a proper fix requires redesign (and benchmarking) of the batched wakeup code: so let's plug the hole with this bandaid for now.
Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Jan Kara jack@suse.cz Reviewed-by: Keith Busch kbusch@kernel.org Link: https://lore.kernel.org/r/9c2038a7-cdc5-5ee-854c-fbc6168bf16@google.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com --- lib/sbitmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/sbitmap.c b/lib/sbitmap.c index 074140d05a85..904697c64961 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -519,7 +519,7 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq) for (i = 0; i < SBQ_WAIT_QUEUES; i++) { struct sbq_wait_state *ws = &sbq->ws[wake_index];
- if (waitqueue_active(&ws->wait)) { + if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) { if (wake_index != atomic_read(&sbq->wake_index)) atomic_set(&sbq->wake_index, wake_index); return ws;
From: Kemeng Shi shikemeng@huaweicloud.com
mainline inclusion from mainline-v6.3-rc1 commit 98b99e9412d0cde8c7b442bf5efb09528a2ede8b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAQPKU CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
For shared queues case, we will only wait on bitmap_tags if we fail to get driver tag. However, rq could be from breserved_tags, then two problems will occur: 1. io hung if no tag is currently allocated from bitmap_tags. 2. unnecessary wakeup when tag is freed to bitmap_tags while no tag is freed to breserved_tags. Wait on the bitmap which rq from to fix this.
Fixes: f906a6a0f426 ("blk-mq: improve tag waiting setup for non-shared tags") Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Kemeng Shi shikemeng@huaweicloud.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com --- block/blk-mq.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index 81d0fc1a8bf9..22926a9a5651 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1236,7 +1236,7 @@ static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx, struct request *rq) { - struct sbitmap_queue *sbq = &hctx->tags->bitmap_tags; + struct sbitmap_queue *sbq; struct wait_queue_head *wq; wait_queue_entry_t *wait; bool ret; @@ -1259,6 +1259,10 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx, if (!list_empty_careful(&wait->entry)) return false;
+ if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag)) + sbq = &hctx->tags->breserved_tags; + else + sbq = &hctx->tags->bitmap_tags; wq = &bt_wait_ptr(sbq, hctx)->wait;
spin_lock_irq(&wq->lock);
From: Kemeng Shi shikemeng@huaweicloud.com
mainline inclusion from mainline-v6.3-rc1 commit 47df9ce95cd568d3f84218c4f65e9fbd4dfeda55 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAQPKU CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Commit f906a6a0f4268 ("blk-mq: improve tag waiting setup for non-shared tags") mark restart for unshared tags for improvement. At that time, tags is only shared betweens queues and we can check if tags is shared by test BLK_MQ_F_TAG_SHARED. Afterwards, commit 32bc15afed04b ("blk-mq: Facilitate a shared sbitmap per tagset") enabled tags share betweens hctxs inside a queue. We only mark restart for shared hctxs inside a queue and may cause io hung if there is no tag currently allocated by hctxs going to be marked restart. Wait on sbitmap_queue instead of mark restart for shared hctxs case to fix this.
Fixes: 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per tagset") Signed-off-by: Kemeng Shi shikemeng@huaweicloud.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Jens Axboe axboe@kernel.dk Conflicts: block/blk-mq.c [commit 2a5a24aa8338 ("scsi: blk-mq: Return budget token from .get_budget callback") and commit 079a2e3e8625 ("blk-mq: Change shared sbitmap naming to shared tags") are not backported] Signed-off-by: Yu Kuai yukuai3@huawei.com --- block/blk-mq.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index 22926a9a5651..d6694b65e02f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1241,7 +1241,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx, wait_queue_entry_t *wait; bool ret;
- if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) { + if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) && + !(blk_mq_is_sbitmap_shared(hctx->flags))) { blk_mq_sched_mark_restart_hctx(hctx);
/* @@ -1520,7 +1521,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list, bool needs_restart; /* For non-shared tags, the RESTART check will suffice */ bool no_tag = prep == PREP_DISPATCH_NO_TAG && - (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED); + ((hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) || + blk_mq_is_sbitmap_shared(hctx->flags));
blk_mq_release_budgets(q, nr_budgets);
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/11942 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/V...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/11942 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/V...