Too many plugged bios may affect IO performance and cost too much memory, so limit the number of it.
Li Lingfeng (1): Revert "md/raid10: fix softlockup in raid10_unplug"
Mariusz Tkaczyk (2): md: drop queue limitation for RAID1 and RAID10 md: raid1/raid10: drop pending_cnt
Yu Kuai (8): md/raid10: prevent soft lockup while flush writes md/raid1-10: factor out a helper to add bio to plug md/raid1-10: factor out a helper to submit normal write md/raid1-10: submit write io directly if bitmap is not enabled md/md-bitmap: add a new helper to unplug bitmap asynchrously md/raid1-10: don't handle pluged bio by daemon thread md/raid1-10: limit the number of plugged bio md/raid1-10: fix casting from randomized structure in raid1_submit_write()
drivers/md/md-bitmap.c | 33 +++++++++++++++++-- drivers/md/md-bitmap.h | 8 +++++ drivers/md/md.c | 9 +++++ drivers/md/md.h | 1 + drivers/md/raid1-10.c | 74 ++++++++++++++++++++++++++++++++++++++---- drivers/md/raid1.c | 46 +++----------------------- drivers/md/raid1.h | 1 - drivers/md/raid10.c | 65 +++++-------------------------------- drivers/md/raid10.h | 1 - 9 files changed, 130 insertions(+), 108 deletions(-)
hulk inclusion category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I8UKFJ CVE: NA
--------------------------------
This reverts commit 7efc0915ce2386d7dbdbac12e7b4ba7dfe898f87.
The problem can be solved by commit 010444623e7f ("md/raid10: prevent soft lockup while flush writes") from mainline, so revert this patch and apply the mainline one.
Signed-off-by: Li Lingfeng lilingfeng3@huawei.com --- drivers/md/raid10.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index f1ff63c0444c..7ef2360e78ff 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -910,7 +910,6 @@ static void flush_pending_writes(struct r10conf *conf) else submit_bio_noacct(bio); bio = next; - cond_resched(); } blk_finish_plug(&plug); } else @@ -1116,7 +1115,6 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule) else submit_bio_noacct(bio); bio = next; - cond_resched(); } kfree(plug); }
From: Mariusz Tkaczyk mariusz.tkaczyk@linux.intel.com
mainline inclusion from mainline-v5.17-rc1 commit a92ce0feffeed8b91f02dac85246d1205e4a64b6 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I8UKFJ CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------
As suggested by Neil Brown[1], this limitation seems to be deprecated.
With plugging in use, writes are processed behind the raid thread and conf->pending_count is not increased. This limitation occurs only if caller doesn't use plugs.
It can be avoided and often it is (with plugging). There are no reports that queue is growing to enormous size so remove queue limitation for non-plugged IOs too.
[1] https://lore.kernel.org/linux-raid/162496301481.7211.18031090130574610495@no...
Signed-off-by: Mariusz Tkaczyk mariusz.tkaczyk@linux.intel.com Signed-off-by: Song Liu song@kernel.org Signed-off-by: Li Lingfeng lilingfeng3@huawei.com --- drivers/md/raid1-10.c | 6 ------ drivers/md/raid1.c | 7 ------- drivers/md/raid10.c | 7 ------- 3 files changed, 20 deletions(-)
diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c index 54db34163968..83f9a4f3d82e 100644 --- a/drivers/md/raid1-10.c +++ b/drivers/md/raid1-10.c @@ -22,12 +22,6 @@
#define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)
-/* When there are this many requests queue to be written by - * the raid thread, we become 'congested' to provide back-pressure - * for writeback. - */ -static int max_queued_requests = 1024; - /* for managing resync I/O pages */ struct resync_pages { void *raid_bio; diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index e511354c6e8c..6b076249dd88 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1360,12 +1360,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, r1_bio = alloc_r1bio(mddev, bio); r1_bio->sectors = max_write_sectors;
- if (conf->pending_count >= max_queued_requests) { - md_wakeup_thread(mddev->thread); - raid1_log(mddev, "wait queued"); - wait_event(conf->wait_barrier, - conf->pending_count < max_queued_requests); - } /* first select target devices under rcu_lock and * inc refcount on their rdev. Record them by setting * bios[x] to bio @@ -3431,4 +3425,3 @@ MODULE_ALIAS("md-personality-3"); /* RAID1 */ MODULE_ALIAS("md-raid1"); MODULE_ALIAS("md-level-1");
-module_param(max_queued_requests, int, S_IRUGO|S_IWUSR); diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 7ef2360e78ff..3e1483bade73 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1343,12 +1343,6 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio, conf->reshape_safe = mddev->reshape_position; }
- if (conf->pending_count >= max_queued_requests) { - md_wakeup_thread(mddev->thread); - raid10_log(mddev, "wait queued"); - wait_event(conf->wait_barrier, - conf->pending_count < max_queued_requests); - } /* first select target devices under rcu_lock and * inc refcount on their rdev. Record them by setting * bios[x] to bio @@ -4974,4 +4968,3 @@ MODULE_ALIAS("md-personality-9"); /* RAID10 */ MODULE_ALIAS("md-raid10"); MODULE_ALIAS("md-level-10");
-module_param(max_queued_requests, int, S_IRUGO|S_IWUSR);
From: Mariusz Tkaczyk mariusz.tkaczyk@linux.intel.com
mainline inclusion from mainline-v5.18-rc1 commit daae161fd2e568b4f481b177b8be34374df98b68 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I8UKFJ CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------
Those counters are not necessary after commit 11bb45e8aaf6 ("md: drop queue limitation for RAID1 and RAID10"). Remove them from all code (conf and plug structs). raid1_plug_cb and raid10_plug_cb are identical, so move definition of raid1_plug_cb to common raid1-10 definitions and use it for RAID10 too.
Signed-off-by: Mariusz Tkaczyk mariusz.tkaczyk@linux.intel.com Signed-off-by: Song Liu song@kernel.org Signed-off-by: Li Lingfeng lilingfeng3@huawei.com --- drivers/md/raid1-10.c | 5 +++++ drivers/md/raid1.c | 11 ----------- drivers/md/raid1.h | 1 - drivers/md/raid10.c | 17 +++-------------- drivers/md/raid10.h | 1 - 5 files changed, 8 insertions(+), 27 deletions(-)
diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c index 83f9a4f3d82e..e61f6cad4e08 100644 --- a/drivers/md/raid1-10.c +++ b/drivers/md/raid1-10.c @@ -28,6 +28,11 @@ struct resync_pages { struct page *pages[RESYNC_PAGES]; };
+struct raid1_plug_cb { + struct blk_plug_cb cb; + struct bio_list pending; +}; + static void rbio_pool_free(void *rbio, void *data) { kfree(rbio); diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 6b076249dd88..d5c278994d31 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -826,7 +826,6 @@ static void flush_pending_writes(struct r1conf *conf) struct bio *bio;
bio = bio_list_get(&conf->pending_bio_list); - conf->pending_count = 0; spin_unlock_irq(&conf->device_lock);
/* @@ -1148,12 +1147,6 @@ static void alloc_behind_master_bio(struct r1bio *r1_bio, bio_put(behind_bio); }
-struct raid1_plug_cb { - struct blk_plug_cb cb; - struct bio_list pending; - int pending_cnt; -}; - static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule) { struct raid1_plug_cb *plug = container_of(cb, struct raid1_plug_cb, @@ -1165,7 +1158,6 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule) if (from_schedule || current->bio_list) { spin_lock_irq(&conf->device_lock); bio_list_merge(&conf->pending_bio_list, &plug->pending); - conf->pending_count += plug->pending_cnt; spin_unlock_irq(&conf->device_lock); wake_up(&conf->wait_barrier); md_wakeup_thread(mddev->thread); @@ -1554,11 +1546,9 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, plug = NULL; if (plug) { bio_list_add(&plug->pending, mbio); - plug->pending_cnt++; } else { spin_lock_irqsave(&conf->device_lock, flags); bio_list_add(&conf->pending_bio_list, mbio); - conf->pending_count++; spin_unlock_irqrestore(&conf->device_lock, flags); md_wakeup_thread(mddev->thread); } @@ -3039,7 +3029,6 @@ static struct r1conf *setup_conf(struct mddev *mddev) init_waitqueue_head(&conf->wait_barrier);
bio_list_init(&conf->pending_bio_list); - conf->pending_count = 0; conf->recovery_disabled = mddev->recovery_disabled - 1;
err = -EIO; diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h index ff30681d753c..468f189da7a0 100644 --- a/drivers/md/raid1.h +++ b/drivers/md/raid1.h @@ -87,7 +87,6 @@ struct r1conf {
/* queue pending writes to be submitted on unplug */ struct bio_list pending_bio_list; - int pending_count;
/* for use when syncing mirrors: * We don't allow both normal IO and resync/recovery IO at diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 3e1483bade73..968e30e5e2f6 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -876,7 +876,6 @@ static void flush_pending_writes(struct r10conf *conf) struct bio *bio;
bio = bio_list_get(&conf->pending_bio_list); - conf->pending_count = 0; spin_unlock_irq(&conf->device_lock);
/* @@ -1071,16 +1070,9 @@ static sector_t choose_data_offset(struct r10bio *r10_bio, return rdev->new_data_offset; }
-struct raid10_plug_cb { - struct blk_plug_cb cb; - struct bio_list pending; - int pending_cnt; -}; - static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule) { - struct raid10_plug_cb *plug = container_of(cb, struct raid10_plug_cb, - cb); + struct raid1_plug_cb *plug = container_of(cb, struct raid1_plug_cb, cb); struct mddev *mddev = plug->cb.data; struct r10conf *conf = mddev->private; struct bio *bio; @@ -1088,7 +1080,6 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule) if (from_schedule || current->bio_list) { spin_lock_irq(&conf->device_lock); bio_list_merge(&conf->pending_bio_list, &plug->pending); - conf->pending_count += plug->pending_cnt; spin_unlock_irq(&conf->device_lock); wake_up(&conf->wait_barrier); md_wakeup_thread(mddev->thread); @@ -1247,7 +1238,7 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio, const unsigned long do_fua = (bio->bi_opf & REQ_FUA); unsigned long flags; struct blk_plug_cb *cb; - struct raid10_plug_cb *plug = NULL; + struct raid1_plug_cb *plug = NULL; struct r10conf *conf = mddev->private; struct md_rdev *rdev; int devnum = r10_bio->devs[n_copy].devnum; @@ -1283,16 +1274,14 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
cb = blk_check_plugged(raid10_unplug, mddev, sizeof(*plug)); if (cb) - plug = container_of(cb, struct raid10_plug_cb, cb); + plug = container_of(cb, struct raid1_plug_cb, cb); else plug = NULL; if (plug) { bio_list_add(&plug->pending, mbio); - plug->pending_cnt++; } else { spin_lock_irqsave(&conf->device_lock, flags); bio_list_add(&conf->pending_bio_list, mbio); - conf->pending_count++; spin_unlock_irqrestore(&conf->device_lock, flags); md_wakeup_thread(mddev->thread); } diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h index 4f627ad16bec..6570f96e8f47 100644 --- a/drivers/md/raid10.h +++ b/drivers/md/raid10.h @@ -75,7 +75,6 @@ struct r10conf {
/* queue pending writes and submit them on unplug */ struct bio_list pending_bio_list; - int pending_count;
spinlock_t resync_lock; atomic_t nr_pending;
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.5-rc1 commit 010444623e7f4da6b4a4dd603a7da7469981e293 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I8UKFJ CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------
Currently, there is no limit for raid1/raid10 plugged bio. While flushing writes, raid1 has cond_resched() while raid10 doesn't, and too many writes can cause soft lockup.
Follow up soft lockup can be triggered easily with writeback test for raid10 with ramdisks:
watchdog: BUG: soft lockup - CPU#10 stuck for 27s! [md0_raid10:1293] Call Trace: <TASK> call_rcu+0x16/0x20 put_object+0x41/0x80 __delete_object+0x50/0x90 delete_object_full+0x2b/0x40 kmemleak_free+0x46/0xa0 slab_free_freelist_hook.constprop.0+0xed/0x1a0 kmem_cache_free+0xfd/0x300 mempool_free_slab+0x1f/0x30 mempool_free+0x3a/0x100 bio_free+0x59/0x80 bio_put+0xcf/0x2c0 free_r10bio+0xbf/0xf0 raid_end_bio_io+0x78/0xb0 one_write_done+0x8a/0xa0 raid10_end_write_request+0x1b4/0x430 bio_endio+0x175/0x320 brd_submit_bio+0x3b9/0x9b7 [brd] __submit_bio+0x69/0xe0 submit_bio_noacct_nocheck+0x1e6/0x5a0 submit_bio_noacct+0x38c/0x7e0 flush_pending_writes+0xf0/0x240 raid10d+0xac/0x1ed0
Fix the problem by adding cond_resched() to raid10 like what raid1 did.
Note that unlimited plugged bio still need to be optimized, for example, in the case of lots of dirty pages writeback, this will take lots of memory and io will spend a long time in plug, hence io latency is bad.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230529131106.2123367-2-yukuai1@huaweicloud.com Signed-off-by: Li Lingfeng lilingfeng3@huawei.com --- drivers/md/raid10.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 968e30e5e2f6..4bf556d2ad8e 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -909,6 +909,7 @@ static void flush_pending_writes(struct r10conf *conf) else submit_bio_noacct(bio); bio = next; + cond_resched(); } blk_finish_plug(&plug); } else @@ -1106,6 +1107,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule) else submit_bio_noacct(bio); bio = next; + cond_resched(); } kfree(plug); }
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.5-rc1 commit 5ec6ca140a034682e421e2e808ef5ddfdfd65242 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I8UKFJ CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------
The code in raid1 and raid10 is identical, prepare to limit the number of plugged bios.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230529131106.2123367-3-yukuai1@huaweicloud.com
Conflicts: Commit 309dca309fc3 ("block: store a block_device pointer in struct bio") changed "mbio->bi_disk" to "mbio->bi_bdev"; Commit 2e94275ed582 ("md/raid1: use rdev in raid1_write_request directly") changed "conf->mirrors[i].rdev" to "rdev"; Commit cb1802ff82e1 ("md/raid10: Use the new blk_opf_t type") changed the type of "do_sync" and "do_fua". Signed-off-by: Li Lingfeng lilingfeng3@huawei.com --- drivers/md/raid1-10.c | 16 ++++++++++++++++ drivers/md/raid1.c | 11 +---------- drivers/md/raid10.c | 11 +---------- 3 files changed, 18 insertions(+), 20 deletions(-)
diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c index e61f6cad4e08..9bf19a3409ce 100644 --- a/drivers/md/raid1-10.c +++ b/drivers/md/raid1-10.c @@ -109,3 +109,19 @@ static void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp, size -= len; } while (idx++ < RESYNC_PAGES && size > 0); } + +static inline bool raid1_add_bio_to_plug(struct mddev *mddev, struct bio *bio, + blk_plug_cb_fn unplug) +{ + struct raid1_plug_cb *plug = NULL; + struct blk_plug_cb *cb = blk_check_plugged(unplug, mddev, + sizeof(*plug)); + + if (!cb) + return false; + + plug = container_of(cb, struct raid1_plug_cb, cb); + bio_list_add(&plug->pending, bio); + + return true; +} diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index d5c278994d31..07408ad991a5 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1319,8 +1319,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, struct bitmap *bitmap = mddev->bitmap; unsigned long flags; struct md_rdev *blocked_rdev; - struct blk_plug_cb *cb; - struct raid1_plug_cb *plug = NULL; int first_clone; int max_sectors; bool write_behind = false; @@ -1539,14 +1537,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, /* flush_pending_writes() needs access to the rdev so...*/ mbio->bi_disk = (void *)conf->mirrors[i].rdev;
- cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug)); - if (cb) - plug = container_of(cb, struct raid1_plug_cb, cb); - else - plug = NULL; - if (plug) { - bio_list_add(&plug->pending, mbio); - } else { + if (!raid1_add_bio_to_plug(mddev, mbio, raid1_unplug)) { spin_lock_irqsave(&conf->device_lock, flags); bio_list_add(&conf->pending_bio_list, mbio); spin_unlock_irqrestore(&conf->device_lock, flags); diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 4bf556d2ad8e..bfed5ffc9ce0 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1239,8 +1239,6 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio, const unsigned long do_sync = (bio->bi_opf & REQ_SYNC); const unsigned long do_fua = (bio->bi_opf & REQ_FUA); unsigned long flags; - struct blk_plug_cb *cb; - struct raid1_plug_cb *plug = NULL; struct r10conf *conf = mddev->private; struct md_rdev *rdev; int devnum = r10_bio->devs[n_copy].devnum; @@ -1274,14 +1272,7 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
atomic_inc(&r10_bio->remaining);
- cb = blk_check_plugged(raid10_unplug, mddev, sizeof(*plug)); - if (cb) - plug = container_of(cb, struct raid1_plug_cb, cb); - else - plug = NULL; - if (plug) { - bio_list_add(&plug->pending, mbio); - } else { + if (!raid1_add_bio_to_plug(mddev, mbio, raid10_unplug)) { spin_lock_irqsave(&conf->device_lock, flags); bio_list_add(&conf->pending_bio_list, mbio); spin_unlock_irqrestore(&conf->device_lock, flags);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.5-rc1 commit 8295efbe68c080047e98d9c0eb5cb933b238a8cb category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I8UKFJ CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------
There are multiple places to do the same thing, factor out a helper to prevent redundant code, and the helper will be used in following patch as well.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230529131106.2123367-4-yukuai1@huaweicloud.com
Conflicts: Commit 309dca309fc3 ("block: store a block_device pointer in struct bio") changed "mbio->bi_disk" to "mbio->bi_bdev"; Commit 70200574cc22 ("block: remove QUEUE_FLAG_DISCARD") use a non-zero max_discard_sectors as an indicator for discard support. Signed-off-by: Li Lingfeng lilingfeng3@huawei.com --- drivers/md/raid1-10.c | 17 +++++++++++++++++ drivers/md/raid1.c | 13 ++----------- drivers/md/raid10.c | 26 ++++---------------------- 3 files changed, 23 insertions(+), 33 deletions(-)
diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c index 9bf19a3409ce..471e398b43b4 100644 --- a/drivers/md/raid1-10.c +++ b/drivers/md/raid1-10.c @@ -110,6 +110,23 @@ static void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp, } while (idx++ < RESYNC_PAGES && size > 0); }
+ +static inline void raid1_submit_write(struct bio *bio) +{ + struct md_rdev *rdev = (struct md_rdev *)bio->bi_disk; + + bio->bi_next = NULL; + bio_set_dev(bio, rdev->bdev); + if (test_bit(Faulty, &rdev->flags)) + bio_io_error(bio); + else if (unlikely(bio_op(bio) == REQ_OP_DISCARD && + !blk_queue_discard(bio->bi_disk->queue))) + /* Just ignore it */ + bio_endio(bio); + else + submit_bio_noacct(bio); +} + static inline bool raid1_add_bio_to_plug(struct mddev *mddev, struct bio *bio, blk_plug_cb_fn unplug) { diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 07408ad991a5..5a0beadacfc0 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -798,17 +798,8 @@ static void flush_bio_list(struct r1conf *conf, struct bio *bio)
while (bio) { /* submit pending writes */ struct bio *next = bio->bi_next; - struct md_rdev *rdev = (void *)bio->bi_disk; - bio->bi_next = NULL; - bio_set_dev(bio, rdev->bdev); - if (test_bit(Faulty, &rdev->flags)) { - bio_io_error(bio); - } else if (unlikely((bio_op(bio) == REQ_OP_DISCARD) && - !blk_queue_discard(bio->bi_disk->queue))) - /* Just ignore it */ - bio_endio(bio); - else - submit_bio_noacct(bio); + + raid1_submit_write(bio); bio = next; cond_resched(); } diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index bfed5ffc9ce0..fbc7705b0bdf 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -897,17 +897,8 @@ static void flush_pending_writes(struct r10conf *conf)
while (bio) { /* submit pending writes */ struct bio *next = bio->bi_next; - struct md_rdev *rdev = (void*)bio->bi_disk; - bio->bi_next = NULL; - bio_set_dev(bio, rdev->bdev); - if (test_bit(Faulty, &rdev->flags)) { - bio_io_error(bio); - } else if (unlikely((bio_op(bio) == REQ_OP_DISCARD) && - !blk_queue_discard(bio->bi_disk->queue))) - /* Just ignore it */ - bio_endio(bio); - else - submit_bio_noacct(bio); + + raid1_submit_write(bio); bio = next; cond_resched(); } @@ -1095,17 +1086,8 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
while (bio) { /* submit pending writes */ struct bio *next = bio->bi_next; - struct md_rdev *rdev = (void*)bio->bi_disk; - bio->bi_next = NULL; - bio_set_dev(bio, rdev->bdev); - if (test_bit(Faulty, &rdev->flags)) { - bio_io_error(bio); - } else if (unlikely((bio_op(bio) == REQ_OP_DISCARD) && - !blk_queue_discard(bio->bi_disk->queue))) - /* Just ignore it */ - bio_endio(bio); - else - submit_bio_noacct(bio); + + raid1_submit_write(bio); bio = next; cond_resched(); }
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.5-rc1 commit 7db922bae3abdf0a1db81ef7228cc0b996a0c1e3 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I8UKFJ CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------
Commit 6cce3b23f6f8 ("[PATCH] md: write intent bitmap support for raid10") add bitmap support, and it changed that write io is submitted through daemon thread because bitmap need to be updated before write io. And later, plug is used to fix performance regression because all the write io will go to demon thread, which means io can't be issued concurrently.
However, if bitmap is not enabled, the write io should not go to daemon thread in the first place, and plug is not needed as well.
Fixes: 6cce3b23f6f8 ("[PATCH] md: write intent bitmap support for raid10") Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230529131106.2123367-5-yukuai1@huaweicloud.com Signed-off-by: Li Lingfeng lilingfeng3@huawei.com --- drivers/md/md-bitmap.c | 4 +--- drivers/md/md-bitmap.h | 7 +++++++ drivers/md/raid1-10.c | 13 +++++++++++-- 3 files changed, 19 insertions(+), 5 deletions(-)
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 895bbb512135..8865d427d7e7 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -1001,7 +1001,6 @@ static int md_bitmap_file_test_bit(struct bitmap *bitmap, sector_t block) return set; }
- /* this gets called when the md device is ready to unplug its underlying * (slave) device queues -- before we let any writes go down, we need to * sync the dirty pages of the bitmap file to disk */ @@ -1011,8 +1010,7 @@ void md_bitmap_unplug(struct bitmap *bitmap) int dirty, need_write; int writing = 0;
- if (!bitmap || !bitmap->storage.filemap || - test_bit(BITMAP_STALE, &bitmap->flags)) + if (!md_bitmap_enabled(bitmap)) return;
/* look at each page to see if there are any set bits that need to be diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h index cfd7395de8fd..3a4750952b3a 100644 --- a/drivers/md/md-bitmap.h +++ b/drivers/md/md-bitmap.h @@ -273,6 +273,13 @@ int md_bitmap_copy_from_slot(struct mddev *mddev, int slot, sector_t *lo, sector_t *hi, bool clear_bits); void md_bitmap_free(struct bitmap *bitmap); void md_bitmap_wait_behind_writes(struct mddev *mddev); + +static inline bool md_bitmap_enabled(struct bitmap *bitmap) +{ + return bitmap && bitmap->storage.filemap && + !test_bit(BITMAP_STALE, &bitmap->flags); +} + #endif
#endif diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c index 471e398b43b4..b5871259b832 100644 --- a/drivers/md/raid1-10.c +++ b/drivers/md/raid1-10.c @@ -131,9 +131,18 @@ static inline bool raid1_add_bio_to_plug(struct mddev *mddev, struct bio *bio, blk_plug_cb_fn unplug) { struct raid1_plug_cb *plug = NULL; - struct blk_plug_cb *cb = blk_check_plugged(unplug, mddev, - sizeof(*plug)); + struct blk_plug_cb *cb; + + /* + * If bitmap is not enabled, it's safe to submit the io directly, and + * this can get optimal performance. + */ + if (!md_bitmap_enabled(mddev->bitmap)) { + raid1_submit_write(bio); + return true; + }
+ cb = blk_check_plugged(unplug, mddev, sizeof(*plug)); if (!cb) return false;
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.5-rc1 commit a022325ab970cf04b66ca128a87345714aa44b99 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I8UKFJ CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------
If bitmap is enabled, bitmap must update before submitting write io, this is why unplug callback must move these io to 'conf->pending_io_list' if 'current->bio_list' is not empty, which will suffer performance degradation.
A new helper md_bitmap_unplug_async() is introduced to submit bitmap io in a kworker, so that submit bitmap io in raid10_unplug() doesn't require that 'current->bio_list' is empty.
This patch prepare to limit the number of plugged bio.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230529131106.2123367-6-yukuai1@huaweicloud.com
Conflicts: Commit 3ce94ce5d05a ("md: fix duplicate filename for rdev") remove the md_rdev_misc_wq; Commit 28144f9998e0 ("md: use __register_blkdev to allocate devices on demand") changed to use __register_blkdev to allocate devices. Signed-off-by: Li Lingfeng lilingfeng3@huawei.com --- drivers/md/md-bitmap.c | 29 +++++++++++++++++++++++++++++ drivers/md/md-bitmap.h | 1 + drivers/md/md.c | 9 +++++++++ drivers/md/md.h | 1 + 4 files changed, 40 insertions(+)
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 8865d427d7e7..107e656e507a 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -1039,6 +1039,35 @@ void md_bitmap_unplug(struct bitmap *bitmap) } EXPORT_SYMBOL(md_bitmap_unplug);
+struct bitmap_unplug_work { + struct work_struct work; + struct bitmap *bitmap; + struct completion *done; +}; + +static void md_bitmap_unplug_fn(struct work_struct *work) +{ + struct bitmap_unplug_work *unplug_work = + container_of(work, struct bitmap_unplug_work, work); + + md_bitmap_unplug(unplug_work->bitmap); + complete(unplug_work->done); +} + +void md_bitmap_unplug_async(struct bitmap *bitmap) +{ + DECLARE_COMPLETION_ONSTACK(done); + struct bitmap_unplug_work unplug_work; + + INIT_WORK_ONSTACK(&unplug_work.work, md_bitmap_unplug_fn); + unplug_work.bitmap = bitmap; + unplug_work.done = &done; + + queue_work(md_bitmap_wq, &unplug_work.work); + wait_for_completion(&done); +} +EXPORT_SYMBOL(md_bitmap_unplug_async); + static void md_bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int needed); /* * bitmap_init_from_disk -- called at bitmap_create time to initialize * the in-memory bitmap from the on-disk bitmap -- also, sets up the diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h index 3a4750952b3a..8a3788c9bfef 100644 --- a/drivers/md/md-bitmap.h +++ b/drivers/md/md-bitmap.h @@ -264,6 +264,7 @@ void md_bitmap_sync_with_cluster(struct mddev *mddev, sector_t new_lo, sector_t new_hi);
void md_bitmap_unplug(struct bitmap *bitmap); +void md_bitmap_unplug_async(struct bitmap *bitmap); void md_bitmap_daemon_work(struct mddev *mddev);
int md_bitmap_resize(struct bitmap *bitmap, sector_t blocks, diff --git a/drivers/md/md.c b/drivers/md/md.c index 02dcf64103ca..4726c1ec3e34 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -82,6 +82,7 @@ static DECLARE_WAIT_QUEUE_HEAD(resync_wait); static struct workqueue_struct *md_wq; static struct workqueue_struct *md_misc_wq; static struct workqueue_struct *md_rdev_misc_wq; +struct workqueue_struct *md_bitmap_wq;
static int remove_and_add_spares(struct mddev *mddev, struct md_rdev *this); @@ -9707,6 +9708,11 @@ static int __init md_init(void) if (!md_rdev_misc_wq) goto err_rdev_misc_wq;
+ md_bitmap_wq = alloc_workqueue("md_bitmap", WQ_MEM_RECLAIM | WQ_UNBOUND, + 0); + if (!md_bitmap_wq) + goto err_bitmap_wq; + if ((ret = register_blkdev(MD_MAJOR, "md")) < 0) goto err_md;
@@ -9728,6 +9734,8 @@ static int __init md_init(void) err_mdp: unregister_blkdev(MD_MAJOR, "md"); err_md: + destroy_workqueue(md_bitmap_wq); +err_bitmap_wq: destroy_workqueue(md_rdev_misc_wq); err_rdev_misc_wq: destroy_workqueue(md_misc_wq); @@ -10023,6 +10031,7 @@ static __exit void md_exit(void) } destroy_workqueue(md_rdev_misc_wq); destroy_workqueue(md_misc_wq); + destroy_workqueue(md_bitmap_wq); destroy_workqueue(md_wq); }
diff --git a/drivers/md/md.h b/drivers/md/md.h index 72e9f31c3ef9..df78553b4e5a 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -839,6 +839,7 @@ struct mdu_array_info_s; struct mdu_disk_info_s;
extern int mdp_major; +extern struct workqueue_struct *md_bitmap_wq; void md_autostart_arrays(int part); int md_set_array_info(struct mddev *mddev, struct mdu_array_info_s *info); int md_add_new_disk(struct mddev *mddev, struct mdu_disk_info_s *info);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.5-rc1 commit 9efcc2c3df7612eea02daa159ae7c6ac44420513 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I8UKFJ CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------
current->bio_list will be set under submit_bio() context, in this case bitmap io will be added to the list and wait for current io submission to finish, while current io submission must wait for bitmap io to be done. commit 874807a83139 ("md/raid1{,0}: fix deadlock in bitmap_unplug.") fix the deadlock by handling plugged bio by daemon thread.
On the one hand, the deadlock won't exist after commit a214b949d8e3 ("blk-mq: only flush requests from the plug in blk_mq_submit_bio"). On the other hand, current solution makes it impossible to flush plugged bio in raid1/10_make_request(), because this will cause that all the writes will goto daemon thread.
In order to limit the number of plugged bio, commit 874807a83139 ("md/raid1{,0}: fix deadlock in bitmap_unplug.") is reverted, and the deadlock is fixed by handling bitmap io asynchronously.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230529131106.2123367-7-yukuai1@huaweicloud.com Signed-off-by: Li Lingfeng lilingfeng3@huawei.com --- drivers/md/raid1-10.c | 14 ++++++++++++++ drivers/md/raid1.c | 4 ++-- drivers/md/raid10.c | 8 +++----- 3 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c index b5871259b832..fef13945aab1 100644 --- a/drivers/md/raid1-10.c +++ b/drivers/md/raid1-10.c @@ -151,3 +151,17 @@ static inline bool raid1_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
return true; } + +/* + * current->bio_list will be set under submit_bio() context, in this case bitmap + * io will be added to the list and wait for current io submission to finish, + * while current io submission must wait for bitmap io to be done. In order to + * avoid such deadlock, submit bitmap io asynchronously. + */ +static inline void raid1_prepare_flush_writes(struct bitmap *bitmap) +{ + if (current->bio_list) + md_bitmap_unplug_async(bitmap); + else + md_bitmap_unplug(bitmap); +} diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 5a0beadacfc0..fbfc5dcf1a12 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -793,7 +793,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect static void flush_bio_list(struct r1conf *conf, struct bio *bio) { /* flush any pending bitmap writes to disk before proceeding w/ I/O */ - md_bitmap_unplug(conf->mddev->bitmap); + raid1_prepare_flush_writes(conf->mddev->bitmap); wake_up(&conf->wait_barrier);
while (bio) { /* submit pending writes */ @@ -1146,7 +1146,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule) struct r1conf *conf = mddev->private; struct bio *bio;
- if (from_schedule || current->bio_list) { + if (from_schedule) { spin_lock_irq(&conf->device_lock); bio_list_merge(&conf->pending_bio_list, &plug->pending); spin_unlock_irq(&conf->device_lock); diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index fbc7705b0bdf..cbc199ea7c52 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -890,9 +890,7 @@ static void flush_pending_writes(struct r10conf *conf) __set_current_state(TASK_RUNNING);
blk_start_plug(&plug); - /* flush any pending bitmap writes to disk - * before proceeding w/ I/O */ - md_bitmap_unplug(conf->mddev->bitmap); + raid1_prepare_flush_writes(conf->mddev->bitmap); wake_up(&conf->wait_barrier);
while (bio) { /* submit pending writes */ @@ -1069,7 +1067,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule) struct r10conf *conf = mddev->private; struct bio *bio;
- if (from_schedule || current->bio_list) { + if (from_schedule) { spin_lock_irq(&conf->device_lock); bio_list_merge(&conf->pending_bio_list, &plug->pending); spin_unlock_irq(&conf->device_lock); @@ -1081,7 +1079,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
/* we aren't scheduling, so we can do the write-out directly. */ bio = bio_list_get(&plug->pending); - md_bitmap_unplug(mddev->bitmap); + raid1_prepare_flush_writes(mddev->bitmap); wake_up(&conf->wait_barrier);
while (bio) { /* submit pending writes */
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.5-rc1 commit 460af1f9d9e62acce4a21f9bd00b5bcd5963bcd4 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I8UKFJ CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------
bio can be added to plug infinitely, and following writeback test can trigger huge amount of plugged bio:
Test script: modprobe brd rd_nr=4 rd_size=10485760 mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean --bitmap=internal echo 0 > /proc/sys/vm/dirty_background_ratio fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test
Test result: Monitor /sys/block/md0/inflight will found that inflight keep increasing until fio finish writing, after running for about 2 minutes:
[root@fedora ~]# cat /sys/block/md0/inflight 0 4474191
Fix the problem by limiting the number of plugged bio based on the number of copies for original bio.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230529131106.2123367-8-yukuai1@huaweicloud.com
Conflicts: Commit 309dca309fc3 ("block: store a block_device pointer in struct bio") changed "mbio->bi_disk" to "mbio->bi_bdev"; Commit 2e94275ed582 ("md/raid1: use rdev in raid1_write_request directly") changed "conf->mirrors[i].rdev" to "rdev"; Signed-off-by: Li Lingfeng lilingfeng3@huawei.com --- drivers/md/raid1-10.c | 9 ++++++++- drivers/md/raid1.c | 2 +- drivers/md/raid10.c | 2 +- 3 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c index fef13945aab1..c2f0fbbf92c4 100644 --- a/drivers/md/raid1-10.c +++ b/drivers/md/raid1-10.c @@ -21,6 +21,7 @@ #define IO_MADE_GOOD ((struct bio *)2)
#define BIO_SPECIAL(bio) ((unsigned long)bio <= 2) +#define MAX_PLUG_BIO 32
/* for managing resync I/O pages */ struct resync_pages { @@ -31,6 +32,7 @@ struct resync_pages { struct raid1_plug_cb { struct blk_plug_cb cb; struct bio_list pending; + unsigned int count; };
static void rbio_pool_free(void *rbio, void *data) @@ -128,7 +130,7 @@ static inline void raid1_submit_write(struct bio *bio) }
static inline bool raid1_add_bio_to_plug(struct mddev *mddev, struct bio *bio, - blk_plug_cb_fn unplug) + blk_plug_cb_fn unplug, int copies) { struct raid1_plug_cb *plug = NULL; struct blk_plug_cb *cb; @@ -148,6 +150,11 @@ static inline bool raid1_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
plug = container_of(cb, struct raid1_plug_cb, cb); bio_list_add(&plug->pending, bio); + if (++plug->count / MAX_PLUG_BIO >= copies) { + list_del(&cb->list); + cb->callback(cb, false); + } +
return true; } diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index fbfc5dcf1a12..ebb7adba44d6 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1528,7 +1528,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, /* flush_pending_writes() needs access to the rdev so...*/ mbio->bi_disk = (void *)conf->mirrors[i].rdev;
- if (!raid1_add_bio_to_plug(mddev, mbio, raid1_unplug)) { + if (!raid1_add_bio_to_plug(mddev, mbio, raid1_unplug, disks)) { spin_lock_irqsave(&conf->device_lock, flags); bio_list_add(&conf->pending_bio_list, mbio); spin_unlock_irqrestore(&conf->device_lock, flags); diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index cbc199ea7c52..0162dbe17cf3 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1252,7 +1252,7 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
atomic_inc(&r10_bio->remaining);
- if (!raid1_add_bio_to_plug(mddev, mbio, raid10_unplug)) { + if (!raid1_add_bio_to_plug(mddev, mbio, raid10_unplug, conf->copies)) { spin_lock_irqsave(&conf->device_lock, flags); bio_list_add(&conf->pending_bio_list, mbio); spin_unlock_irqrestore(&conf->device_lock, flags);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.5-rc1 commit b5a99602b74bbfa655be509c615181dd95b0719e category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I8UKFJ CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------
Following build error triggered while build with clang version 17.0.0 with W=1(this can't be reporduced with gcc 13.1.0):
drivers/md/raid1-10.c:117:25: error: casting from randomized structure pointer type 'struct block_device *' to 'struct md_rdev *' 117 | struct md_rdev *rdev = (struct md_rdev *)bio->bi_bdev; | ^
Fix this by casting 'bio->bi_bdev' to 'void *', as it used to be.
Reported-by: kernel test robot lkp@intel.com Closes: https://lore.kernel.org/oe-kbuild-all/202306142042.fmjfmTF8-lkp@intel.com/ Fixes: 8295efbe68c0 ("md/raid1-10: factor out a helper to submit normal write") Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230616012136.3047071-1-yukuai1@huaweicloud.com
Conflict: Commit 309dca309fc3 ("block: store a block_device pointer in struct bio") changed "bi_disk" to "bi_bdev", and in commit 8295efbe68c0 ("md/raid1-10: factor out a helper to submit normal write"), "bi_bdev" was adapted to "bi_disk". Signed-off-by: Li Lingfeng lilingfeng3@huawei.com --- drivers/md/raid1-10.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c index c2f0fbbf92c4..5cada8fa4af7 100644 --- a/drivers/md/raid1-10.c +++ b/drivers/md/raid1-10.c @@ -115,7 +115,7 @@ static void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp,
static inline void raid1_submit_write(struct bio *bio) { - struct md_rdev *rdev = (struct md_rdev *)bio->bi_disk; + struct md_rdev *rdev = (void *)bio->bi_disk;
bio->bi_next = NULL; bio_set_dev(bio, rdev->bdev);