[PATCH OLK-6.6 00/64] md: backport from mainline
From: Li Nan <linan122@huawei.com> Benjamin Marzinski (2): dm-raid: Fix WARN_ON_ONCE check for sync_thread in raid_resume md/raid5: recheck if reshape has finished with device_lock held Christoph Hellwig (4): md: add a mddev_trace_remap helper md: add a mddev_add_trace_msg helper md: add a mddev_is_dm helper md/raid0: don't free conf on raid0_run failure Christophe JAILLET (1): md-cluster: Constify struct md_cluster_operations Florian-Ewald Mueller (1): md: add check for sleepers in md_wakeup_thread() Heming Zhao (2): md-cluster: fix hanging issue while a new disk adding md-cluster: fix no recovery job when adding/re-adding a disk Joel Granados (1): raid: Remove now superfluous sentinel element from ctl_table array Li Lingfeng (1): md: get rdev->mddev with READ_ONCE() Li Nan (12): md: merge the check of capabilities into md_ioctl_valid() md: changed the switch of RAID_VERSION to if md: return directly before setting did_set_md_closing md: factor out a helper to sync mddev md: sync blockdev before stopping raid or setting readonly md: clean up openers check in do_md_stop() and md_set_readonly() md: check mddev->pers before calling md_set_readonly() md: Fix overflow in is_mddev_idle md: don't account sync_io if iostats of the disk is disabled md: Revert "md: Fix overflow in is_mddev_idle" md: change the return value type of md_write_start to void md: make md_flush_request() more readable Mateusz Jończyk (1): md/raid1: set max_sectors during early return from choose_slow_rdev() Mikulas Patocka (1): md: fix a suspicious RCU usage warning Yang Li (1): md: Remove unneeded semicolon Yu Kuai (37): md: remove redundant check of 'mddev->sync_thread' md: remove redundant md_wakeup_thread() md: Don't ignore suspended array in md_check_recovery() md: Don't ignore read-only array in md_check_recovery() md: Make sure md_do_sync() will set MD_RECOVERY_DONE md: Don't register sync_thread for reshape directly md: Don't suspend the array for interrupted reshape md: add a new helper rdev_has_badblock() md/raid1: factor out helpers to add rdev to conf md/raid1: record nonrot rdevs while adding/removing rdevs to conf md/raid1: fix choose next idle in read_balance() md/raid1-10: add a helper raid1_check_read_range() md/raid1-10: factor out a new helper raid1_should_read_first() md/raid1: factor out read_first_rdev() from read_balance() md/raid1: factor out choose_slow_rdev() from read_balance() md/raid1: factor out choose_bb_rdev() from read_balance() md/raid1: factor out the code to manage sequential IO md/raid1: factor out helpers to choose the best rdev from read_balance() md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume md: export helpers to stop sync_thread md: export helper md_is_rdwr() md: add a new helper reshape_interrupted() dm-raid: really frozen sync_thread during suspend dm-raid: add a new helper prepare_suspend() in md_personality dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape md: rearrange recovery_flags md: add a new enum type sync_action md: add new helpers for sync_action md: factor out helper to start reshape from action_store() md: replace sysfs api sync_action with new helpers md: remove parameter check_seq for stop_sync_thread() md: don't fail action_store() if sync_thread is not registered md: use new helpers in md_do_sync() md: replace last_sync_action with new enum type md: factor out helpers for different sync_action in md_do_sync() md: pass in max_sectors for pers->sync_request() md/raid5: fix spares errors about rcu usage drivers/md/md-cluster.h | 2 + drivers/md/md.h | 204 ++++++++-- drivers/md/raid1.h | 1 + drivers/md/dm-raid.c | 66 +++- drivers/md/md-bitmap.c | 9 +- drivers/md/md-cluster.c | 51 ++- drivers/md/md.c | 858 +++++++++++++++++++++++----------------- drivers/md/raid0.c | 24 +- drivers/md/raid1-10.c | 69 ++++ drivers/md/raid1.c | 595 ++++++++++++++++------------ drivers/md/raid10.c | 120 ++---- drivers/md/raid5.c | 215 +++++----- 12 files changed, 1336 insertions(+), 878 deletions(-) -- 2.39.2
From: Joel Granados <j.granados@samsung.com> mainline inclusion from mainline-v6.7-rc1 commit dd6291c506490c195620b394dc96763675e7e5f4 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel from raid_table Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 715f6be52ea7..f9883d7aa4d4 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -312,7 +312,6 @@ static struct ctl_table raid_table[] = { .mode = S_IRUGO|S_IWUSR, .proc_handler = proc_dointvec, }, - { } }; static int start_readonly; -- 2.39.2
From: Mikulas Patocka <mpatocka@redhat.com> mainline inclusion from mainline-v6.8-rc2 commit 9f3fe29d77ef4e7f7cb5c4c8c59f6dc373e57e78 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- RCU protection was removed in the commit 2d32777d60de ("raid1: remove rcu protection to access rdev from conf"). However, the code in fix_read_error does rcu_dereference outside rcu_read_lock - this triggers the following warning. The warning is triggered by a LVM2 test shell/integrity-caching.sh. This commit removes rcu_dereference. ============================= WARNING: suspicious RCU usage 6.7.0 #2 Not tainted ----------------------------- drivers/md/raid1.c:2265 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by mdX_raid1/1859. stack backtrace: CPU: 2 PID: 1859 Comm: mdX_raid1 Not tainted 6.7.0 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x60/0x70 lockdep_rcu_suspicious+0x153/0x1b0 raid1d+0x1732/0x1750 [raid1] ? lock_acquire+0x9f/0x270 ? finish_wait+0x3d/0x80 ? md_thread+0xf7/0x130 [md_mod] ? lock_release+0xaa/0x230 ? md_register_thread+0xd0/0xd0 [md_mod] md_thread+0xa0/0x130 [md_mod] ? housekeeping_test_cpu+0x30/0x30 kthread+0xdc/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x28/0x40 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork_asm+0x11/0x20 </TASK> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: ca294b34aaf3 ("md/raid1: support read error check") Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/51539879-e1ca-fde3-b8b4-8934ddedcbc@redhat.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/raid1.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 119ef331473d..8820ad4ae0a7 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2247,7 +2247,7 @@ static void fix_read_error(struct r1conf *conf, struct r1bio *r1_bio) int sectors = r1_bio->sectors; int read_disk = r1_bio->read_disk; struct mddev *mddev = conf->mddev; - struct md_rdev *rdev = rcu_dereference(conf->mirrors[read_disk].rdev); + struct md_rdev *rdev = conf->mirrors[read_disk].rdev; if (exceed_read_errors(mddev, rdev)) { r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 61c90765e131e63ead773b9b99167415e246a945 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The lifetime of sync_thread: 1) Set MD_RECOVERY_NEEDED and wake up daemon thread (by ioctl/sysfs or other events); 2) Daemon thread woke up, md_check_recovery() found that MD_RECOVERY_NEEDED is set: a) try to grab reconfig_mutex; b) set MD_RECOVERY_RUNNING; c) clear MD_RECOVERY_NEEDED, and then queue sync_work; 3) md_start_sync() choose sync_action, then register sync_thread; 4) md_do_sync() is done, set MD_RECOVERY_DONE and wake up daemon thread; 5) Daemon thread woke up, md_check_recovery() found that MD_RECOVERY_DONE is set: a) try to grab reconfig_mutex; b) unregister sync_thread; c) clear MD_RECOVERY_RUNNING and MD_RECOVERY_DONE; Hence there is no such case that MD_RECOVERY_RUNNING is not set, while sync_thread is registered. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231228125553.2697765-2-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 14 ++++---------- drivers/md/raid5.c | 6 ++---- 2 files changed, 6 insertions(+), 14 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index f9883d7aa4d4..0002758b98df 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -3373,8 +3373,7 @@ static ssize_t new_offset_store(struct md_rdev *rdev, if (kstrtoull(buf, 10, &new_offset) < 0) return -EINVAL; - if (mddev->sync_thread || - test_bit(MD_RECOVERY_RUNNING,&mddev->recovery)) + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) return -EBUSY; if (new_offset == rdev->data_offset) /* reset is always permitted */ @@ -4052,8 +4051,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len) */ rv = -EBUSY; - if (mddev->sync_thread || - test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || mddev->reshape_position != MaxSector || mddev->sysfs_active) goto out_unlock; @@ -6455,7 +6453,6 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) mutex_lock(&mddev->open_mutex); if ((mddev->pers && atomic_read(&mddev->openers) > !!bdev) || - mddev->sync_thread || test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { pr_warn("md: %s still in use.\n",mdname(mddev)); err = -EBUSY; @@ -6508,7 +6505,6 @@ static int do_md_stop(struct mddev *mddev, int mode, mutex_lock(&mddev->open_mutex); if ((mddev->pers && atomic_read(&mddev->openers) > !!bdev) || mddev->sysfs_active || - mddev->sync_thread || test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { pr_warn("md: %s still in use.\n",mdname(mddev)); mutex_unlock(&mddev->open_mutex); @@ -7354,8 +7350,7 @@ static int update_size(struct mddev *mddev, sector_t num_sectors) * of each device. If num_sectors is zero, we find the largest size * that fits. */ - if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || - mddev->sync_thread) + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) return -EBUSY; if (!md_is_rdwr(mddev)) return -EROFS; @@ -7392,8 +7387,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks) if (raid_disks <= 0 || (mddev->max_disks && raid_disks >= mddev->max_disks)) return -EINVAL; - if (mddev->sync_thread || - test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) || mddev->reshape_position != MaxSector) return -EBUSY; diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index bfcdbc0cbcd5..910e2e3a1136 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -6927,10 +6927,8 @@ raid5_store_stripe_size(struct mddev *mddev, const char *page, size_t len) pr_debug("md/raid: change stripe_size from %lu to %lu\n", conf->stripe_size, new); - if (mddev->sync_thread || - test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || - mddev->reshape_position != MaxSector || - mddev->sysfs_active) { + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || + mddev->reshape_position != MaxSector || mddev->sysfs_active) { err = -EBUSY; goto out_unlock; } -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit faeaf210a559eb05bc1a294082d100d01c49a1e9 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- On the one hand, mddev_unlock() will call md_wakeup_thread() unconditionally; on the other hand, md_check_recovery() can't make progress if 'reconfig_mutex' can't be grabbed. Hence, it really doesn't make sense to wake up daemon thread while 'reconfig_mutex' is still grabbed. Remove all the md_wakup_thread() for 'mddev->thread' while 'reconfig_mtuex' is still grabbed. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231228125553.2697765-3-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 20 ++------------------ 1 file changed, 2 insertions(+), 18 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 0002758b98df..891bb64d96c3 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2963,7 +2963,6 @@ static int add_bound_rdev(struct md_rdev *rdev) set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); md_new_event(); - md_wakeup_thread(mddev->thread); return 0; } @@ -3078,10 +3077,8 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) if (err == 0) { md_kick_rdev_from_array(rdev); - if (mddev->pers) { + if (mddev->pers) set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); - md_wakeup_thread(mddev->thread); - } md_new_event(); } } @@ -3111,7 +3108,6 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) clear_bit(BlockedBadBlocks, &rdev->flags); wake_up(&rdev->blocked_wait); set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery); - md_wakeup_thread(rdev->mddev->thread); err = 0; } else if (cmd_match(buf, "insync") && rdev->raid_disk == -1) { @@ -3149,7 +3145,6 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) !test_bit(Replacement, &rdev->flags)) set_bit(WantReplacement, &rdev->flags); set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery); - md_wakeup_thread(rdev->mddev->thread); err = 0; } else if (cmd_match(buf, "-want_replacement")) { /* Clearing 'want_replacement' is always allowed. @@ -3279,7 +3274,6 @@ slot_store(struct md_rdev *rdev, const char *buf, size_t len) if (rdev->raid_disk >= 0) return -EBUSY; set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery); - md_wakeup_thread(rdev->mddev->thread); } else if (rdev->mddev->pers) { /* Activating a spare .. or possibly reactivating * if we ever get bitmaps working here. @@ -6225,7 +6219,6 @@ int do_md_run(struct mddev *mddev) /* run start up tasks that require md_thread */ md_start(mddev); - md_wakeup_thread(mddev->thread); md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */ set_capacity_and_notify(mddev->gendisk, mddev->array_sectors); @@ -6246,7 +6239,6 @@ int md_start(struct mddev *mddev) if (mddev->pers->start) { set_bit(MD_RECOVERY_WAIT, &mddev->recovery); - md_wakeup_thread(mddev->thread); ret = mddev->pers->start(mddev); clear_bit(MD_RECOVERY_WAIT, &mddev->recovery); md_wakeup_thread(mddev->sync_thread); @@ -6291,7 +6283,6 @@ static int restart_array(struct mddev *mddev) pr_debug("md: %s switched to read-write mode.\n", mdname(mddev)); /* Kick recovery or resync if necessary */ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); - md_wakeup_thread(mddev->thread); md_wakeup_thread(mddev->sync_thread); sysfs_notify_dirent_safe(mddev->sysfs_state); return 0; @@ -6443,7 +6434,6 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) if (!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) { did_freeze = 1; set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - md_wakeup_thread(mddev->thread); } stop_sync_thread(mddev, false, false); @@ -6475,7 +6465,6 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) if ((mddev->pers && !err) || did_freeze) { clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); - md_wakeup_thread(mddev->thread); sysfs_notify_dirent_safe(mddev->sysfs_state); } @@ -6497,7 +6486,6 @@ static int do_md_stop(struct mddev *mddev, int mode, if (!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) { did_freeze = 1; set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - md_wakeup_thread(mddev->thread); } stop_sync_thread(mddev, true, false); @@ -6511,7 +6499,6 @@ static int do_md_stop(struct mddev *mddev, int mode, if (did_freeze) { clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); - md_wakeup_thread(mddev->thread); } return -EBUSY; } @@ -7052,9 +7039,7 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev) md_kick_rdev_from_array(rdev); set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); - if (mddev->thread) - md_wakeup_thread(mddev->thread); - else + if (!mddev->thread) md_update_sb(mddev, 1); md_new_event(); @@ -7136,7 +7121,6 @@ static int hot_add_disk(struct mddev *mddev, dev_t dev) * array immediately. */ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); - md_wakeup_thread(mddev->thread); md_new_event(); return 0; -- 2.39.2
From: Li Lingfeng <lilingfeng3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 9cfcf99e7ed613e6b3697e1c1034a24487ec3154 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Users may get rdev->mddev by sysfs while rdev is releasing. So use both READ_ONCE() and WRITE_ONCE() to prevent load/store tearing and to read/write mddev atomically. Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231229070500.3602712-1-lilingfeng@huaweicloud.co... Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 891bb64d96c3..beea5bf9f084 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2621,7 +2621,7 @@ static void md_kick_rdev_from_array(struct md_rdev *rdev) list_del_rcu(&rdev->same_set); pr_debug("md: unbind<%pg>\n", rdev->bdev); mddev_destroy_serial_pool(rdev->mddev, rdev); - rdev->mddev = NULL; + WRITE_ONCE(rdev->mddev, NULL); sysfs_remove_link(&rdev->kobj, "block"); sysfs_put(rdev->sysfs_state); sysfs_put(rdev->sysfs_unack_badblocks); @@ -3698,7 +3698,7 @@ rdev_attr_store(struct kobject *kobj, struct attribute *attr, struct kernfs_node *kn = NULL; bool suspend = false; ssize_t rv; - struct mddev *mddev = rdev->mddev; + struct mddev *mddev = READ_ONCE(rdev->mddev); if (!entry->store) return -EIO; -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.8-rc6 commit 1baae052cccd08daf9a9d64c3f959d8cdb689757 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- mddev_suspend() never stop sync_thread, hence it doesn't make sense to ignore suspended array in md_check_recovery(), which might cause sync_thread can't be unregistered. After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following hang can be triggered by test shell/integrity-caching.sh: 1) suspend the array: raid_postsuspend mddev_suspend 2) stop the array: raid_dtr md_stop __md_stop_writes stop_sync_thread set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_wakeup_thread_directly(mddev->sync_thread); wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) 3) sync thread done: md_do_sync set_bit(MD_RECOVERY_DONE, &mddev->recovery); md_wakeup_thread(mddev->thread); 4) daemon thread can't unregister sync thread: md_check_recovery if (mddev->suspended) return; -> return directly md_read_sync_thread clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); -> MD_RECOVERY_RUNNING can't be cleared, hence step 2 hang; This problem is not just related to dm-raid, fix it by ignoring suspended array in md_check_recovery(). And follow up patches will improve dm-raid better to frozen sync thread during suspend. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/8fb335e-6d2c-dbb5-d7-ded8db5145a@redhat.com/ Fixes: 68866e425be2 ("MD: no sync IO while suspended") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-2-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index beea5bf9f084..2de87f4cf94d 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9548,9 +9548,6 @@ static void md_start_sync(struct work_struct *ws) */ void md_check_recovery(struct mddev *mddev) { - if (READ_ONCE(mddev->suspended)) - return; - if (mddev->bitmap) md_bitmap_daemon_work(mddev); -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.8-rc6 commit 55a48ad2db64737f7ffc0407634218cc6e4c513b category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Usually if the array is not read-write, md_check_recovery() won't register new sync_thread in the first place. And if the array is read-write and sync_thread is registered, md_set_readonly() will unregister sync_thread before setting the array read-only. md/raid follow this behavior hence there is no problem. After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following hang can be triggered by test shell/integrity-caching.sh: 1) array is read-only. dm-raid update super block: rs_update_sbs ro = mddev->ro mddev->ro = 0 -> set array read-write md_update_sb 2) register new sync thread concurrently. 3) dm-raid set array back to read-only: rs_update_sbs mddev->ro = ro 4) stop the array: raid_dtr md_stop stop_sync_thread set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_wakeup_thread_directly(mddev->sync_thread); wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) 5) sync thread done: md_do_sync set_bit(MD_RECOVERY_DONE, &mddev->recovery); md_wakeup_thread(mddev->thread); 6) daemon thread can't unregister sync thread: md_check_recovery if (!md_is_rdwr(mddev) && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery)) return; -> -> MD_RECOVERY_RUNNING can't be cleared, hence step 4 hang; The root cause is that dm-raid manipulate 'mddev->ro' by itself, however, dm-raid really should stop sync thread before setting the array read-only. Unfortunately, I need to read more code before I can refacter the handler of 'mddev->ro' in dm-raid, hence let's fix the problem the easy way for now to prevent dm-raid regression. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/9801e40-8ac7-e225-6a71-309dcf9dc9aa@redhat.com/ Fixes: ecbfb9f118bc ("dm raid: add raid level takeover support") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-3-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 31 ++++++++++++++++++------------- 1 file changed, 18 insertions(+), 13 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 2de87f4cf94d..cf8b2e1d5e53 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9524,6 +9524,20 @@ static void md_start_sync(struct work_struct *ws) sysfs_notify_dirent_safe(mddev->sysfs_action); } +static void unregister_sync_thread(struct mddev *mddev) +{ + if (!test_bit(MD_RECOVERY_DONE, &mddev->recovery)) { + /* resync/recovery still happening */ + clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); + return; + } + + if (WARN_ON_ONCE(!mddev->sync_thread)) + return; + + md_reap_sync_thread(mddev); +} + /* * This routine is regularly called by all per-raid-array threads to * deal with generic issues like resync and super-block update. @@ -9561,7 +9575,8 @@ void md_check_recovery(struct mddev *mddev) } if (!md_is_rdwr(mddev) && - !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery)) + !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) && + !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) return; if ( ! ( (mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) || @@ -9583,8 +9598,7 @@ void md_check_recovery(struct mddev *mddev) struct md_rdev *rdev; if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { - /* sync_work already queued. */ - clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); + unregister_sync_thread(mddev); goto unlock; } @@ -9647,16 +9661,7 @@ void md_check_recovery(struct mddev *mddev) * still set. */ if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { - if (!test_bit(MD_RECOVERY_DONE, &mddev->recovery)) { - /* resync/recovery still happening */ - clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); - goto unlock; - } - - if (WARN_ON_ONCE(!mddev->sync_thread)) - goto unlock; - - md_reap_sync_thread(mddev); + unregister_sync_thread(mddev); goto unlock; } -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.8-rc6 commit 82ec0ae59d02e89164b24c0cc8e4e50de78b5fd6 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- stop_sync_thread() will interrupt md_do_sync(), and md_do_sync() must set MD_RECOVERY_DONE, so that follow up md_check_recovery() will unregister sync_thread, clear MD_RECOVERY_RUNNING and wake up stop_sync_thread(). If MD_RECOVERY_WAIT is set or the array is read-only, md_do_sync() will return without setting MD_RECOVERY_DONE, and after commit f52f5c71f3d4 ("md: fix stopping sync thread"), dm-raid switch from md_reap_sync_thread() to stop_sync_thread() to unregister sync_thread from md_stop() and md_stop_writes(), causing the test shell/lvconvert-raid-reshape.sh hang. We shouldn't switch back to md_reap_sync_thread() because it's problematic in the first place. Fix the problem by making sure md_do_sync() will set MD_RECOVERY_DONE. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/ece2b06f-d647-6613-a534-ff4c9bec1142@redhat.com/ Fixes: d5d885fd514f ("md: introduce new personality funciton start()") Fixes: 5fd6c1dce06e ("[PATCH] md: allow checkpoint of recovery with version-1 superblock") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-4-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index cf8b2e1d5e53..d87b43f8f375 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8856,12 +8856,16 @@ void md_do_sync(struct md_thread *thread) int ret; /* just incase thread restarts... */ - if (test_bit(MD_RECOVERY_DONE, &mddev->recovery) || - test_bit(MD_RECOVERY_WAIT, &mddev->recovery)) + if (test_bit(MD_RECOVERY_DONE, &mddev->recovery)) return; - if (!md_is_rdwr(mddev)) {/* never try to sync a read-only array */ + + if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) + goto skip; + + if (test_bit(MD_RECOVERY_WAIT, &mddev->recovery) || + !md_is_rdwr(mddev)) {/* never try to sync a read-only array */ set_bit(MD_RECOVERY_INTR, &mddev->recovery); - return; + goto skip; } if (mddev_is_clustered(mddev)) { -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.8-rc6 commit ad39c08186f8a0f221337985036ba86731d6aafe category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Currently, if reshape is interrupted, then reassemble the array will register sync_thread directly from pers->run(), in this case 'MD_RECOVERY_RUNNING' is set directly, however, there is no guarantee that md_do_sync() will be executed, hence stop_sync_thread() will hang because 'MD_RECOVERY_RUNNING' can't be cleared. Last patch make sure that md_do_sync() will set MD_RECOVERY_DONE, however, following hang can still be triggered by dm-raid test shell/lvconvert-raid-reshape.sh occasionally: [root@fedora ~]# cat /proc/1982/stack [<0>] stop_sync_thread+0x1ab/0x270 [md_mod] [<0>] md_frozen_sync_thread+0x5c/0xa0 [md_mod] [<0>] raid_presuspend+0x1e/0x70 [dm_raid] [<0>] dm_table_presuspend_targets+0x40/0xb0 [dm_mod] [<0>] __dm_destroy+0x2a5/0x310 [dm_mod] [<0>] dm_destroy+0x16/0x30 [dm_mod] [<0>] dev_remove+0x165/0x290 [dm_mod] [<0>] ctl_ioctl+0x4bb/0x7b0 [dm_mod] [<0>] dm_ctl_ioctl+0x11/0x20 [dm_mod] [<0>] vfs_ioctl+0x21/0x60 [<0>] __x64_sys_ioctl+0xb9/0xe0 [<0>] do_syscall_64+0xc6/0x230 [<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74 Meanwhile mddev->recovery is: MD_RECOVERY_RUNNING | MD_RECOVERY_INTR | MD_RECOVERY_RESHAPE | MD_RECOVERY_FROZEN Fix this problem by remove the code to register sync_thread directly from raid10 and raid5. And let md_check_recovery() to register sync_thread. Fixes: f67055780caa ("[PATCH] md: Checkpoint and allow restart of raid5 reshape") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-5-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 5 ++++- drivers/md/raid10.c | 16 ++-------------- drivers/md/raid5.c | 29 ++--------------------------- 3 files changed, 8 insertions(+), 42 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index d87b43f8f375..bb93b8f2760a 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9451,6 +9451,7 @@ static void md_start_sync(struct work_struct *ws) struct mddev *mddev = container_of(ws, struct mddev, sync_work); int spares = 0; bool suspend = false; + char *name; if (md_spares_need_change(mddev)) suspend = true; @@ -9483,8 +9484,10 @@ static void md_start_sync(struct work_struct *ws) if (spares) md_bitmap_write_all(mddev->bitmap); + name = test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) ? + "reshape" : "resync"; rcu_assign_pointer(mddev->sync_thread, - md_register_thread(md_do_sync, mddev, "resync")); + md_register_thread(md_do_sync, mddev, name)); if (!mddev->sync_thread) { pr_warn("%s: could not start resync thread...\n", mdname(mddev)); diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 200ae4182932..e303f6336306 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -4152,11 +4152,7 @@ static int raid10_run(struct mddev *mddev) clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); - set_bit(MD_RECOVERY_RUNNING, &mddev->recovery); - rcu_assign_pointer(mddev->sync_thread, - md_register_thread(md_do_sync, mddev, "reshape")); - if (!mddev->sync_thread) - goto out_free_conf; + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); } return 0; @@ -4550,16 +4546,8 @@ static int raid10_start_reshape(struct mddev *mddev) clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); clear_bit(MD_RECOVERY_DONE, &mddev->recovery); set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); - set_bit(MD_RECOVERY_RUNNING, &mddev->recovery); - - rcu_assign_pointer(mddev->sync_thread, - md_register_thread(md_do_sync, mddev, "reshape")); - if (!mddev->sync_thread) { - ret = -EAGAIN; - goto abort; - } + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); conf->reshape_checkpoint = jiffies; - md_wakeup_thread(mddev->sync_thread); md_new_event(); return 0; diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 910e2e3a1136..bb2dbb7f6a61 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -7896,11 +7896,7 @@ static int raid5_run(struct mddev *mddev) clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); - set_bit(MD_RECOVERY_RUNNING, &mddev->recovery); - rcu_assign_pointer(mddev->sync_thread, - md_register_thread(md_do_sync, mddev, "reshape")); - if (!mddev->sync_thread) - goto abort; + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); } /* Ok, everything is just fine now */ @@ -8466,29 +8462,8 @@ static int raid5_start_reshape(struct mddev *mddev) clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); clear_bit(MD_RECOVERY_DONE, &mddev->recovery); set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); - set_bit(MD_RECOVERY_RUNNING, &mddev->recovery); - rcu_assign_pointer(mddev->sync_thread, - md_register_thread(md_do_sync, mddev, "reshape")); - if (!mddev->sync_thread) { - mddev->recovery = 0; - spin_lock_irq(&conf->device_lock); - write_seqcount_begin(&conf->gen_lock); - mddev->raid_disks = conf->raid_disks = conf->previous_raid_disks; - mddev->new_chunk_sectors = - conf->chunk_sectors = conf->prev_chunk_sectors; - mddev->new_layout = conf->algorithm = conf->prev_algo; - rdev_for_each(rdev, mddev) - rdev->new_data_offset = rdev->data_offset; - smp_wmb(); - conf->generation --; - conf->reshape_progress = MaxSector; - mddev->reshape_position = MaxSector; - write_seqcount_end(&conf->gen_lock); - spin_unlock_irq(&conf->device_lock); - return -EAGAIN; - } + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); conf->reshape_checkpoint = jiffies; - md_wakeup_thread(mddev->sync_thread); md_new_event(); return 0; } -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.8-rc6 commit 9e46c70e829bddc24e04f963471e9983a11598b7 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- md_start_sync() will suspend the array if there are spares that can be added or removed from conf, however, if reshape is still in progress, this won't happen at all or data will be corrupted(remove_and_add_spares won't be called from md_choose_sync_action for reshape), hence there is no need to suspend the array if reshape is not done yet. Meanwhile, there is a potential deadlock for raid456: 1) reshape is interrupted; 2) set one of the disk WantReplacement, and add a new disk to the array, however, recovery won't start until the reshape is finished; 3) then issue an IO across reshpae position, this IO will wait for reshape to make progress; 4) continue to reshape, then md_start_sync() found there is a spare disk that can be added to conf, mddev_suspend() is called; Step 4 and step 3 is waiting for each other, deadlock triggered. Noted this problem is found by code review, and it's not reporduced yet. Fix this porblem by don't suspend the array for interrupted reshape, this is safe because conf won't be changed until reshape is done. Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-6-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index bb93b8f2760a..93ccf1be9f64 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9453,12 +9453,17 @@ static void md_start_sync(struct work_struct *ws) bool suspend = false; char *name; - if (md_spares_need_change(mddev)) + /* + * If reshape is still in progress, spares won't be added or removed + * from conf until reshape is done. + */ + if (mddev->reshape_position == MaxSector && + md_spares_need_change(mddev)) { suspend = true; + mddev_suspend(mddev, false); + } - suspend ? mddev_suspend_and_lock_nointr(mddev) : - mddev_lock_nointr(mddev); - + mddev_lock_nointr(mddev); if (!md_is_rdwr(mddev)) { /* * On a read-only array we can: -- 2.39.2
From: Li Nan <linan122@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 2fe4ffc3ecdcb69d0e5aded5abf5367e3519bd04 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- There is no functional change. Just to make code cleaner. Conflicts: drivers/md/md.c [ The apply order of "md: Don't clear MD_CLOSING when the raid is about to stop" is different with mainline ] Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-2-linan666@huaweicloud.com --- drivers/md/md.c | 31 ++++++++++++------------------- 1 file changed, 12 insertions(+), 19 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 93ccf1be9f64..9666aa0bb51e 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7567,16 +7567,17 @@ static int md_getgeo(struct block_device *bdev, struct hd_geometry *geo) return 0; } -static inline bool md_ioctl_valid(unsigned int cmd) +static inline int md_ioctl_valid(unsigned int cmd) { switch (cmd) { - case ADD_NEW_DISK: case GET_ARRAY_INFO: - case GET_BITMAP_FILE: case GET_DISK_INFO: + case RAID_VERSION: + return 0; + case ADD_NEW_DISK: + case GET_BITMAP_FILE: case HOT_ADD_DISK: case HOT_REMOVE_DISK: - case RAID_VERSION: case RESTART_ARRAY_RW: case RUN_ARRAY: case SET_ARRAY_INFO: @@ -7585,9 +7586,11 @@ static inline bool md_ioctl_valid(unsigned int cmd) case STOP_ARRAY: case STOP_ARRAY_RO: case CLUSTERED_DISK_NACK: - return true; + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + return 0; default: - return false; + return -ENOTTY; } } @@ -7646,19 +7649,9 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, void __user *argp = (void __user *)arg; struct mddev *mddev = NULL; - if (!md_ioctl_valid(cmd)) - return -ENOTTY; - - switch (cmd) { - case RAID_VERSION: - case GET_ARRAY_INFO: - case GET_DISK_INFO: - break; - default: - if (!capable(CAP_SYS_ADMIN)) - return -EACCES; - } - + err = md_ioctl_valid(cmd); + if (err) + return err; /* * Commands dealing with the RAID driver but not any * particular array: -- 2.39.2
From: Li Nan <linan122@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 4e26593944e02446a75d911e11b759a9320c8273 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- There is only one case of this 'switch'. Change it to 'if'. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-3-linan666@huaweicloud.com --- drivers/md/md.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 9666aa0bb51e..c465d78233a0 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7656,12 +7656,8 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, * Commands dealing with the RAID driver but not any * particular array: */ - switch (cmd) { - case RAID_VERSION: - err = get_version(argp); - goto out; - default:; - } + if (cmd == RAID_VERSION) + return get_version(argp); /* * Commands creating/starting a new array: -- 2.39.2
From: Li Nan <linan122@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 91b26a39fb83cf370d94980609bf649c3c46993c category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- There is nothing to do at 'out' before setting 'did_set_md_closing' in md_ioctl(). Return directly, and it will help us to remove 'did_set_md_closing' later. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-5-linan666@huaweicloud.com Conflicts: drivers/md/md.c [ The apply order of "md: Don't clear MD_CLOSING when the raid is about ] Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 25 ++++++++----------------- 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index c465d78233a0..3b390eeca5f1 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7669,26 +7669,19 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, switch (cmd) { case GET_ARRAY_INFO: if (!mddev->raid_disks && !mddev->external) - err = -ENODEV; - else - err = get_array_info(mddev, argp); - goto out; + return -ENODEV; + return get_array_info(mddev, argp); case GET_DISK_INFO: if (!mddev->raid_disks && !mddev->external) - err = -ENODEV; - else - err = get_disk_info(mddev, argp); - goto out; + return -ENODEV; + return get_disk_info(mddev, argp); case SET_DISK_FAULTY: - err = set_disk_faulty(mddev, new_decode_dev(arg)); - goto out; + return set_disk_faulty(mddev, new_decode_dev(arg)); case GET_BITMAP_FILE: - err = get_bitmap_file(mddev, argp); - goto out; - + return get_bitmap_file(mddev, argp); } if (cmd == STOP_ARRAY || cmd == STOP_ARRAY_RO) { @@ -7698,13 +7691,11 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, mutex_lock(&mddev->open_mutex); if (mddev->pers && atomic_read(&mddev->openers) > 1) { mutex_unlock(&mddev->open_mutex); - err = -EBUSY; - goto out; + return -EBUSY; } if (test_and_set_bit(MD_CLOSING, &mddev->flags)) { mutex_unlock(&mddev->open_mutex); - err = -EBUSY; - goto out; + return -EBUSY; } mutex_unlock(&mddev->open_mutex); sync_blockdev(bdev); -- 2.39.2
From: Li Nan <linan122@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit f74aaf614e84d8e5767a062e1172b4907c7b775e category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- There are no functional changes, prepare to sync mddev in array_state_store(). Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-7-linan666@huaweicloud.com --- drivers/md/md.c | 32 +++++++++++++++++++++----------- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 3b390eeca5f1..16ef263deabd 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -528,6 +528,24 @@ void mddev_resume(struct mddev *mddev) } EXPORT_SYMBOL_GPL(mddev_resume); +/* sync bdev before setting device to readonly or stopping raid*/ +static int mddev_set_closing_and_sync_blockdev(struct mddev *mddev, int opener_num) +{ + mutex_lock(&mddev->open_mutex); + if (mddev->pers && atomic_read(&mddev->openers) > opener_num) { + mutex_unlock(&mddev->open_mutex); + return -EBUSY; + } + if (test_and_set_bit(MD_CLOSING, &mddev->flags)) { + mutex_unlock(&mddev->open_mutex); + return -EBUSY; + } + mutex_unlock(&mddev->open_mutex); + + sync_blockdev(mddev->gendisk->part0); + return 0; +} + /* * Generic flush handling for md */ @@ -7688,17 +7706,9 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, /* Need to flush page cache, and ensure no-one else opens * and writes */ - mutex_lock(&mddev->open_mutex); - if (mddev->pers && atomic_read(&mddev->openers) > 1) { - mutex_unlock(&mddev->open_mutex); - return -EBUSY; - } - if (test_and_set_bit(MD_CLOSING, &mddev->flags)) { - mutex_unlock(&mddev->open_mutex); - return -EBUSY; - } - mutex_unlock(&mddev->open_mutex); - sync_blockdev(bdev); + err = mddev_set_closing_and_sync_blockdev(mddev, 1); + if (err) + return err; } if (!md_is_rdwr(mddev)) -- 2.39.2
From: Li Nan <linan122@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 99b902ac17253ee65d23012e2be390c026b77fa4 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Commit a05b7ea03d72 ("md: avoid crash when stopping md array races with closing other open fds.") added sync_block before stopping raid and setting readonly. Later in commit 260fa034ef7a ("md: avoid deadlock when dirty buffers during md_stop.") it is moved to ioctl. array_state_store() was ignored. Add sync blockdev to array_state_store() now. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-8-linan666@huaweicloud.com --- drivers/md/md.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/drivers/md/md.c b/drivers/md/md.c index 16ef263deabd..03a0bb4a686e 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4538,6 +4538,17 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len) case broken: /* cannot be set */ case bad_word: return -EINVAL; + case clear: + case readonly: + case inactive: + case read_auto: + if (!mddev->pers || !md_is_rdwr(mddev)) + break; + /* write sysfs will not open mddev and opener should be 0 */ + err = mddev_set_closing_and_sync_blockdev(mddev, 0); + if (err) + return err; + break; default: break; } @@ -4637,6 +4648,11 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len) sysfs_notify_dirent_safe(mddev->sysfs_state); } mddev_unlock(mddev); + + if (st == readonly || st == read_auto || st == inactive || + (err && st == clear)) + clear_bit(MD_CLOSING, &mddev->flags); + return err ?: len; } static struct md_sysfs_entry md_array_state = -- 2.39.2
From: Li Nan <linan122@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 650b2e69ff6ab4de8d895e933f2b6fbacb1f8411 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Before stopping or setting readonly, mddev_set_closing_and_sync_blockdev() is always called to check the openers. So no longer need to check it again in do_md_stop() and md_set_readonly(). Clean it up. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-9-linan666@huaweicloud.com --- drivers/md/md.c | 37 ++++++++++++++----------------------- 1 file changed, 14 insertions(+), 23 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 03a0bb4a686e..b6e4e8076f5a 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4520,8 +4520,8 @@ array_state_show(struct mddev *mddev, char *page) return sprintf(page, "%s\n", array_states[st]); } -static int do_md_stop(struct mddev *mddev, int ro, struct block_device *bdev); -static int md_set_readonly(struct mddev *mddev, struct block_device *bdev); +static int do_md_stop(struct mddev *mddev, int ro); +static int md_set_readonly(struct mddev *mddev); static int restart_array(struct mddev *mddev); static ssize_t @@ -4582,14 +4582,14 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len) case inactive: /* stop an active array, return 0 otherwise */ if (mddev->pers) - err = do_md_stop(mddev, 2, NULL); + err = do_md_stop(mddev, 2); break; case clear: - err = do_md_stop(mddev, 0, NULL); + err = do_md_stop(mddev, 0); break; case readonly: if (mddev->pers) - err = md_set_readonly(mddev, NULL); + err = md_set_readonly(mddev); else { mddev->ro = MD_RDONLY; set_disk_ro(mddev->gendisk, 1); @@ -4599,7 +4599,7 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len) case read_auto: if (mddev->pers) { if (md_is_rdwr(mddev)) - err = md_set_readonly(mddev, NULL); + err = md_set_readonly(mddev); else if (mddev->ro == MD_RDONLY) err = restart_array(mddev); if (err == 0) { @@ -6457,7 +6457,7 @@ void md_stop(struct mddev *mddev) EXPORT_SYMBOL_GPL(md_stop); -static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) +static int md_set_readonly(struct mddev *mddev) { int err = 0; int did_freeze = 0; @@ -6475,9 +6475,7 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)); mddev_lock_nointr(mddev); - mutex_lock(&mddev->open_mutex); - if ((mddev->pers && atomic_read(&mddev->openers) > !!bdev) || - test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { pr_warn("md: %s still in use.\n",mdname(mddev)); err = -EBUSY; goto out; @@ -6502,7 +6500,6 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) sysfs_notify_dirent_safe(mddev->sysfs_state); } - mutex_unlock(&mddev->open_mutex); return err; } @@ -6510,8 +6507,7 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) * 0 - completely stop and dis-assemble array * 2 - stop but do not disassemble array */ -static int do_md_stop(struct mddev *mddev, int mode, - struct block_device *bdev) +static int do_md_stop(struct mddev *mddev, int mode) { struct gendisk *disk = mddev->gendisk; struct md_rdev *rdev; @@ -6524,12 +6520,9 @@ static int do_md_stop(struct mddev *mddev, int mode, stop_sync_thread(mddev, true, false); - mutex_lock(&mddev->open_mutex); - if ((mddev->pers && atomic_read(&mddev->openers) > !!bdev) || - mddev->sysfs_active || + if (mddev->sysfs_active || test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { pr_warn("md: %s still in use.\n",mdname(mddev)); - mutex_unlock(&mddev->open_mutex); if (did_freeze) { clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); @@ -6551,13 +6544,11 @@ static int do_md_stop(struct mddev *mddev, int mode, sysfs_unlink_rdev(mddev, rdev); set_capacity_and_notify(disk, 0); - mutex_unlock(&mddev->open_mutex); mddev->changed = 1; if (!md_is_rdwr(mddev)) mddev->ro = MD_RDWR; - } else - mutex_unlock(&mddev->open_mutex); + } /* * Free resources if final stop */ @@ -6603,7 +6594,7 @@ static void autorun_array(struct mddev *mddev) err = do_md_run(mddev); if (err) { pr_warn("md: do_md_run() returned %d\n", err); - do_md_stop(mddev, 0, NULL); + do_md_stop(mddev, 0); } } @@ -7765,11 +7756,11 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, goto unlock; case STOP_ARRAY: - err = do_md_stop(mddev, 0, bdev); + err = do_md_stop(mddev, 0); goto unlock; case STOP_ARRAY_RO: - err = md_set_readonly(mddev, bdev); + err = md_set_readonly(mddev); goto unlock; case HOT_REMOVE_DISK: -- 2.39.2
From: Li Nan <linan122@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit e9b0a1556ca2e18d67fa6452b2b99aa66b60ba6e category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- If 'mddev->pers' is NULL, there is nothing to do in md_set_readonly(). Except for md_ioctl(), the other two callers of md_set_readonly() have already checked 'mddev->pers'. To simplify the code, move the check of 'mddev->pers' to the caller. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-10-linan666@huaweicloud.com --- drivers/md/md.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index b6e4e8076f5a..ce36462e77a1 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -6457,6 +6457,7 @@ void md_stop(struct mddev *mddev) EXPORT_SYMBOL_GPL(md_stop); +/* ensure 'mddev->pers' exist before calling md_set_readonly() */ static int md_set_readonly(struct mddev *mddev) { int err = 0; @@ -6481,20 +6482,18 @@ static int md_set_readonly(struct mddev *mddev) goto out; } - if (mddev->pers) { - __md_stop_writes(mddev); - - if (mddev->ro == MD_RDONLY) { - err = -ENXIO; - goto out; - } + __md_stop_writes(mddev); - mddev->ro = MD_RDONLY; - set_disk_ro(mddev->gendisk, 1); + if (mddev->ro == MD_RDONLY) { + err = -ENXIO; + goto out; } + mddev->ro = MD_RDONLY; + set_disk_ro(mddev->gendisk, 1); + out: - if ((mddev->pers && !err) || did_freeze) { + if (!err || did_freeze) { clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); sysfs_notify_dirent_safe(mddev->sysfs_state); @@ -7760,7 +7759,8 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, goto unlock; case STOP_ARRAY_RO: - err = md_set_readonly(mddev); + if (mddev->pers) + err = md_set_readonly(mddev); goto unlock; case HOT_REMOVE_DISK: -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 3a0f007b6979db6b5e0022d9edf4b61002be3e10 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The current api is_badblock() must pass in 'first_bad' and 'bad_sectors', however, many caller just want to know if there are badblocks or not, and these caller must define two local variable that will never be used. Add a new helper rdev_has_badblock() that will only return if there are badblocks or not, remove unnecessary local variables and replace is_badblock() with the new helper in many places. There are no functional changes, and the new helper will also be used later to refactor read_balance(). Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-2-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 10 ++++++++++ drivers/md/raid1.c | 26 +++++++------------------- drivers/md/raid10.c | 45 ++++++++++++++------------------------------- drivers/md/raid5.c | 35 +++++++++++++---------------------- 4 files changed, 44 insertions(+), 72 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index c91da1abef42..ba61d7b83b40 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -222,6 +222,16 @@ static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors, } return 0; } + +static inline int rdev_has_badblock(struct md_rdev *rdev, sector_t s, + int sectors) +{ + sector_t first_bad; + int bad_sectors; + + return is_badblock(rdev, s, sectors, &first_bad, &bad_sectors); +} + extern int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors, int is_new); extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors, diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 8820ad4ae0a7..5ce5ba3642c4 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -494,9 +494,6 @@ static void raid1_end_write_request(struct bio *bio) * to user-side. So if something waits for IO, then it * will wait for the 'master' bio. */ - sector_t first_bad; - int bad_sectors; - r1_bio->bios[mirror] = NULL; to_put = bio; /* @@ -512,8 +509,8 @@ static void raid1_end_write_request(struct bio *bio) set_bit(R1BIO_Uptodate, &r1_bio->state); /* Maybe we can clear some bad blocks. */ - if (is_badblock(rdev, r1_bio->sector, r1_bio->sectors, - &first_bad, &bad_sectors) && !discard_error) { + if (rdev_has_badblock(rdev, r1_bio->sector, r1_bio->sectors) && + !discard_error) { r1_bio->bios[mirror] = IO_MADE_GOOD; set_bit(R1BIO_MadeGood, &r1_bio->state); } @@ -1923,8 +1920,6 @@ static void end_sync_write(struct bio *bio) struct r1bio *r1_bio = get_resync_r1bio(bio); struct mddev *mddev = r1_bio->mddev; struct r1conf *conf = mddev->private; - sector_t first_bad; - int bad_sectors; struct md_rdev *rdev = conf->mirrors[find_bio_disk(r1_bio, bio)].rdev; if (!uptodate) { @@ -1934,14 +1929,11 @@ static void end_sync_write(struct bio *bio) set_bit(MD_RECOVERY_NEEDED, & mddev->recovery); set_bit(R1BIO_WriteError, &r1_bio->state); - } else if (is_badblock(rdev, r1_bio->sector, r1_bio->sectors, - &first_bad, &bad_sectors) && - !is_badblock(conf->mirrors[r1_bio->read_disk].rdev, - r1_bio->sector, - r1_bio->sectors, - &first_bad, &bad_sectors) - ) + } else if (rdev_has_badblock(rdev, r1_bio->sector, r1_bio->sectors) && + !rdev_has_badblock(conf->mirrors[r1_bio->read_disk].rdev, + r1_bio->sector, r1_bio->sectors)) { set_bit(R1BIO_MadeGood, &r1_bio->state); + } put_sync_write_buf(r1_bio, uptodate); } @@ -2264,16 +2256,12 @@ static void fix_read_error(struct r1conf *conf, struct r1bio *r1_bio) s = PAGE_SIZE >> 9; do { - sector_t first_bad; - int bad_sectors; - rdev = conf->mirrors[d].rdev; if (rdev && (test_bit(In_sync, &rdev->flags) || (!test_bit(Faulty, &rdev->flags) && rdev->recovery_offset >= sect + s)) && - is_badblock(rdev, sect, s, - &first_bad, &bad_sectors) == 0) { + rdev_has_badblock(rdev, sect, s) == 0) { atomic_inc(&rdev->nr_pending); if (sync_page_io(rdev, sect, s<<9, conf->tmppage, REQ_OP_READ, false)) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index e303f6336306..0d3e63c17508 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -512,11 +512,7 @@ static void raid10_end_write_request(struct bio *bio) * The 'master' represents the composite IO operation to * user-side. So if something waits for IO, then it will * wait for the 'master' bio. - */ - sector_t first_bad; - int bad_sectors; - - /* + * * Do not set R10BIO_Uptodate if the current device is * rebuilding or Faulty. This is because we cannot use * such device for properly reading the data back (we could @@ -529,10 +525,9 @@ static void raid10_end_write_request(struct bio *bio) set_bit(R10BIO_Uptodate, &r10_bio->state); /* Maybe we can clear some bad blocks. */ - if (is_badblock(rdev, - r10_bio->devs[slot].addr, - r10_bio->sectors, - &first_bad, &bad_sectors) && !discard_error) { + if (rdev_has_badblock(rdev, r10_bio->devs[slot].addr, + r10_bio->sectors) && + !discard_error) { bio_put(bio); if (repl) r10_bio->devs[slot].repl_bio = IO_MADE_GOOD; @@ -1320,10 +1315,7 @@ static void wait_blocked_dev(struct mddev *mddev, struct r10bio *r10_bio) } if (rdev && test_bit(WriteErrorSeen, &rdev->flags)) { - sector_t first_bad; sector_t dev_sector = r10_bio->devs[i].addr; - int bad_sectors; - int is_bad; /* * Discard request doesn't care the write result @@ -1332,9 +1324,8 @@ static void wait_blocked_dev(struct mddev *mddev, struct r10bio *r10_bio) if (!r10_bio->sectors) continue; - is_bad = is_badblock(rdev, dev_sector, r10_bio->sectors, - &first_bad, &bad_sectors); - if (is_bad < 0) { + if (rdev_has_badblock(rdev, dev_sector, + r10_bio->sectors) < 0) { /* * Mustn't write here until the bad block * is acknowledged @@ -2272,8 +2263,6 @@ static void end_sync_write(struct bio *bio) struct mddev *mddev = r10_bio->mddev; struct r10conf *conf = mddev->private; int d; - sector_t first_bad; - int bad_sectors; int slot; int repl; struct md_rdev *rdev = NULL; @@ -2294,11 +2283,10 @@ static void end_sync_write(struct bio *bio) &rdev->mddev->recovery); set_bit(R10BIO_WriteError, &r10_bio->state); } - } else if (is_badblock(rdev, - r10_bio->devs[slot].addr, - r10_bio->sectors, - &first_bad, &bad_sectors)) + } else if (rdev_has_badblock(rdev, r10_bio->devs[slot].addr, + r10_bio->sectors)) { set_bit(R10BIO_MadeGood, &r10_bio->state); + } rdev_dec_pending(rdev, mddev); @@ -2579,11 +2567,8 @@ static void recovery_request_write(struct mddev *mddev, struct r10bio *r10_bio) static int r10_sync_page_io(struct md_rdev *rdev, sector_t sector, int sectors, struct page *page, enum req_op op) { - sector_t first_bad; - int bad_sectors; - - if (is_badblock(rdev, sector, sectors, &first_bad, &bad_sectors) - && (op == REQ_OP_READ || test_bit(WriteErrorSeen, &rdev->flags))) + if (rdev_has_badblock(rdev, sector, sectors) && + (op == REQ_OP_READ || test_bit(WriteErrorSeen, &rdev->flags))) return -1; if (sync_page_io(rdev, sector, sectors << 9, page, op, false)) /* success */ @@ -2640,16 +2625,14 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10 s = PAGE_SIZE >> 9; do { - sector_t first_bad; - int bad_sectors; - d = r10_bio->devs[sl].devnum; rdev = conf->mirrors[d].rdev; if (rdev && test_bit(In_sync, &rdev->flags) && !test_bit(Faulty, &rdev->flags) && - is_badblock(rdev, r10_bio->devs[sl].addr + sect, s, - &first_bad, &bad_sectors) == 0) { + rdev_has_badblock(rdev, + r10_bio->devs[sl].addr + sect, + s) == 0) { atomic_inc(&rdev->nr_pending); success = sync_page_io(rdev, r10_bio->devs[sl].addr + diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index bb2dbb7f6a61..0651dfd08b8e 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1209,10 +1209,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) */ while (op_is_write(op) && rdev && test_bit(WriteErrorSeen, &rdev->flags)) { - sector_t first_bad; - int bad_sectors; - int bad = is_badblock(rdev, sh->sector, RAID5_STRIPE_SECTORS(conf), - &first_bad, &bad_sectors); + int bad = rdev_has_badblock(rdev, sh->sector, + RAID5_STRIPE_SECTORS(conf)); if (!bad) break; @@ -2852,8 +2850,6 @@ static void raid5_end_write_request(struct bio *bi) struct r5conf *conf = sh->raid_conf; int disks = sh->disks, i; struct md_rdev *rdev; - sector_t first_bad; - int bad_sectors; int replacement = 0; for (i = 0 ; i < disks; i++) { @@ -2885,9 +2881,8 @@ static void raid5_end_write_request(struct bio *bi) if (replacement) { if (bi->bi_status) md_error(conf->mddev, rdev); - else if (is_badblock(rdev, sh->sector, - RAID5_STRIPE_SECTORS(conf), - &first_bad, &bad_sectors)) + else if (rdev_has_badblock(rdev, sh->sector, + RAID5_STRIPE_SECTORS(conf))) set_bit(R5_MadeGoodRepl, &sh->dev[i].flags); } else { if (bi->bi_status) { @@ -2896,9 +2891,8 @@ static void raid5_end_write_request(struct bio *bi) if (!test_and_set_bit(WantReplacement, &rdev->flags)) set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery); - } else if (is_badblock(rdev, sh->sector, - RAID5_STRIPE_SECTORS(conf), - &first_bad, &bad_sectors)) { + } else if (rdev_has_badblock(rdev, sh->sector, + RAID5_STRIPE_SECTORS(conf))) { set_bit(R5_MadeGood, &sh->dev[i].flags); if (test_bit(R5_ReadError, &sh->dev[i].flags)) /* That was a successful write so make @@ -4634,8 +4628,6 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) /* Now to look around and see what can be done */ for (i=disks; i--; ) { struct md_rdev *rdev; - sector_t first_bad; - int bad_sectors; int is_bad = 0; dev = &sh->dev[i]; @@ -4679,8 +4671,8 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) rdev = conf->disks[i].replacement; if (rdev && !test_bit(Faulty, &rdev->flags) && rdev->recovery_offset >= sh->sector + RAID5_STRIPE_SECTORS(conf) && - !is_badblock(rdev, sh->sector, RAID5_STRIPE_SECTORS(conf), - &first_bad, &bad_sectors)) + !rdev_has_badblock(rdev, sh->sector, + RAID5_STRIPE_SECTORS(conf))) set_bit(R5_ReadRepl, &dev->flags); else { if (rdev && !test_bit(Faulty, &rdev->flags)) @@ -4693,8 +4685,8 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) if (rdev && test_bit(Faulty, &rdev->flags)) rdev = NULL; if (rdev) { - is_bad = is_badblock(rdev, sh->sector, RAID5_STRIPE_SECTORS(conf), - &first_bad, &bad_sectors); + is_bad = rdev_has_badblock(rdev, sh->sector, + RAID5_STRIPE_SECTORS(conf)); if (s->blocked_rdev == NULL && (test_bit(Blocked, &rdev->flags) || is_bad < 0)) { @@ -5421,8 +5413,8 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio) struct r5conf *conf = mddev->private; struct bio *align_bio; struct md_rdev *rdev; - sector_t sector, end_sector, first_bad; - int bad_sectors, dd_idx; + sector_t sector, end_sector; + int dd_idx; bool did_inc; if (!in_chunk_boundary(mddev, raid_bio)) { @@ -5451,8 +5443,7 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio) atomic_inc(&rdev->nr_pending); - if (is_badblock(rdev, sector, bio_sectors(raid_bio), &first_bad, - &bad_sectors)) { + if (rdev_has_badblock(rdev, sector, bio_sectors(raid_bio))) { rdev_dec_pending(rdev, mddev); return 0; } -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 969d6589abcb369d53d84ec7c9c37f4b23ec1ad9 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- There are no functional changes, just make code cleaner and prepare to record disk non-rotational information while adding and removing rdev to conf Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-3-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/raid1.c | 85 +++++++++++++++++++++++++++++----------------- 1 file changed, 53 insertions(+), 32 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 5ce5ba3642c4..1a116d707c48 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1736,6 +1736,44 @@ static int raid1_spare_active(struct mddev *mddev) return count; } +static bool raid1_add_conf(struct r1conf *conf, struct md_rdev *rdev, int disk, + bool replacement) +{ + struct raid1_info *info = conf->mirrors + disk; + + if (replacement) + info += conf->raid_disks; + + if (info->rdev) + return false; + + rdev->raid_disk = disk; + info->head_position = 0; + info->seq_start = MaxSector; + WRITE_ONCE(info->rdev, rdev); + + return true; +} + +static bool raid1_remove_conf(struct r1conf *conf, int disk) +{ + struct raid1_info *info = conf->mirrors + disk; + struct md_rdev *rdev = info->rdev; + + if (!rdev || test_bit(In_sync, &rdev->flags) || + atomic_read(&rdev->nr_pending)) + return false; + + /* Only remove non-faulty devices if recovery is not possible. */ + if (!test_bit(Faulty, &rdev->flags) && + rdev->mddev->recovery_disabled != conf->recovery_disabled && + rdev->mddev->degraded < conf->raid_disks) + return false; + + WRITE_ONCE(info->rdev, NULL); + return true; +} + static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev) { struct r1conf *conf = mddev->private; @@ -1771,15 +1809,13 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev) disk_stack_limits(mddev->gendisk, rdev->bdev, rdev->data_offset << 9); - p->head_position = 0; - rdev->raid_disk = mirror; + raid1_add_conf(conf, rdev, mirror, false); err = 0; /* As all devices are equivalent, we don't need a full recovery * if this was recently any drive of the array */ if (rdev->saved_raid_disk < 0) conf->fullsync = 1; - WRITE_ONCE(p->rdev, rdev); break; } if (test_bit(WantReplacement, &p->rdev->flags) && @@ -1789,13 +1825,11 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev) if (err && repl_slot >= 0) { /* Add this device as a replacement */ - p = conf->mirrors + repl_slot; clear_bit(In_sync, &rdev->flags); set_bit(Replacement, &rdev->flags); - rdev->raid_disk = repl_slot; + raid1_add_conf(conf, rdev, repl_slot, true); err = 0; conf->fullsync = 1; - WRITE_ONCE(p[conf->raid_disks].rdev, rdev); } print_conf(conf); @@ -1812,27 +1846,20 @@ static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev) if (unlikely(number >= conf->raid_disks)) goto abort; - if (rdev != p->rdev) - p = conf->mirrors + conf->raid_disks + number; + if (rdev != p->rdev) { + number += conf->raid_disks; + p = conf->mirrors + number; + } print_conf(conf); if (rdev == p->rdev) { - if (test_bit(In_sync, &rdev->flags) || - atomic_read(&rdev->nr_pending)) { - err = -EBUSY; - goto abort; - } - /* Only remove non-faulty devices if recovery - * is not possible. - */ - if (!test_bit(Faulty, &rdev->flags) && - mddev->recovery_disabled != conf->recovery_disabled && - mddev->degraded < conf->raid_disks) { + if (!raid1_remove_conf(conf, number)) { err = -EBUSY; goto abort; } - WRITE_ONCE(p->rdev, NULL); - if (conf->mirrors[conf->raid_disks + number].rdev) { + + if (number < conf->raid_disks && + conf->mirrors[conf->raid_disks + number].rdev) { /* We just removed a device that is being replaced. * Move down the replacement. We drain all IO before * doing this to avoid confusion. @@ -2974,23 +3001,17 @@ static struct r1conf *setup_conf(struct mddev *mddev) err = -EINVAL; spin_lock_init(&conf->device_lock); + conf->raid_disks = mddev->raid_disks; rdev_for_each(rdev, mddev) { int disk_idx = rdev->raid_disk; - if (disk_idx >= mddev->raid_disks - || disk_idx < 0) + + if (disk_idx >= conf->raid_disks || disk_idx < 0) continue; - if (test_bit(Replacement, &rdev->flags)) - disk = conf->mirrors + mddev->raid_disks + disk_idx; - else - disk = conf->mirrors + disk_idx; - if (disk->rdev) + if (!raid1_add_conf(conf, rdev, disk_idx, + test_bit(Replacement, &rdev->flags))) goto abort; - disk->rdev = rdev; - disk->head_position = 0; - disk->seq_start = MaxSector; } - conf->raid_disks = mddev->raid_disks; conf->mddev = mddev; INIT_LIST_HEAD(&conf->retry_list); INIT_LIST_HEAD(&conf->bio_end_io_list); -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 2c27d09d3a76b33629d2e681bf8b774f776ade7f category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- For raid1, each read will iterate all the rdevs from conf and check if any rdev is non-rotational, then choose rdev with minimal IO inflight if so, or rdev with closest distance otherwise. Disk nonrot info can be changed through sysfs entry: /sys/block/[disk_name]/queue/rotational However, consider that this should only be used for testing, and user really shouldn't do this in real life. Record the number of non-rotational disks in conf, to avoid checking each rdev in IO fast path and simplify read_balance() a little bit. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-4-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 1 + drivers/md/raid1.h | 1 + drivers/md/raid1.c | 17 ++++++++++------- 3 files changed, 12 insertions(+), 7 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index ba61d7b83b40..49335c949ae3 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -207,6 +207,7 @@ enum flag_bits { * check if there is collision between raid1 * serial bios. */ + Nonrot, /* non-rotational device (SSD) */ }; static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors, diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h index 44f2390a8866..33f318fcc268 100644 --- a/drivers/md/raid1.h +++ b/drivers/md/raid1.h @@ -71,6 +71,7 @@ struct r1conf { * allow for replacements. */ int raid_disks; + int nonrot_disks; spinlock_t device_lock; diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 1a116d707c48..545232745f47 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -595,7 +595,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect int sectors; int best_good_sectors; int best_disk, best_dist_disk, best_pending_disk; - int has_nonrot_disk; int disk; sector_t best_dist; unsigned int min_pending; @@ -616,7 +615,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect best_pending_disk = -1; min_pending = UINT_MAX; best_good_sectors = 0; - has_nonrot_disk = 0; choose_next_idle = 0; clear_bit(R1BIO_FailFast, &r1_bio->state); @@ -633,7 +631,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect sector_t first_bad; int bad_sectors; unsigned int pending; - bool nonrot; rdev = conf->mirrors[disk].rdev; if (r1_bio->bios[disk] == IO_BLOCKED @@ -699,8 +696,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect /* At least two disks to choose from so failfast is OK */ set_bit(R1BIO_FailFast, &r1_bio->state); - nonrot = bdev_nonrot(rdev->bdev); - has_nonrot_disk |= nonrot; pending = atomic_read(&rdev->nr_pending); dist = abs(this_sector - conf->mirrors[disk].head_position); if (choose_first) { @@ -727,7 +722,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect * small, but not a big deal since when the second disk * starts IO, the first disk is likely still busy. */ - if (nonrot && opt_iosize > 0 && + if (test_bit(Nonrot, &rdev->flags) && opt_iosize > 0 && mirror->seq_start != MaxSector && mirror->next_seq_sect > opt_iosize && mirror->next_seq_sect - opt_iosize >= @@ -759,7 +754,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect * mixed ratation/non-rotational disks depending on workload. */ if (best_disk == -1) { - if (has_nonrot_disk || min_pending == 0) + if (READ_ONCE(conf->nonrot_disks) || min_pending == 0) best_disk = best_pending_disk; else best_disk = best_dist_disk; @@ -1747,6 +1742,11 @@ static bool raid1_add_conf(struct r1conf *conf, struct md_rdev *rdev, int disk, if (info->rdev) return false; + if (bdev_nonrot(rdev->bdev)) { + set_bit(Nonrot, &rdev->flags); + WRITE_ONCE(conf->nonrot_disks, conf->nonrot_disks + 1); + } + rdev->raid_disk = disk; info->head_position = 0; info->seq_start = MaxSector; @@ -1770,6 +1770,9 @@ static bool raid1_remove_conf(struct r1conf *conf, int disk) rdev->mddev->degraded < conf->raid_disks) return false; + if (test_and_clear_bit(Nonrot, &rdev->flags)) + WRITE_ONCE(conf->nonrot_disks, conf->nonrot_disks - 1); + WRITE_ONCE(info->rdev, NULL); return true; } -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 257ac239ffcfd097a9a0732bf5095fb00164f334 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add the case choose next idle in read_balance(): read_balance: for_each_rdev if(next_seq_sect == this_sector || dist == 0) -> sequential reads best_disk = disk; if (...) choose_next_idle = 1 continue; for_each_rdev -> iterate next rdev if (pending == 0) best_disk = disk; -> choose the next idle disk break; if (choose_next_idle) -> keep using this rdev if there are no other idle disk contine However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.") remove the code: - /* If device is idle, use it */ - if (pending == 0) { - best_disk = disk; - break; - } Hence choose next idle will never work now, fix this problem by following: 1) don't set best_disk in this case, read_balance() will choose the best disk after iterating all the disks; 2) add 'pending' so that other idle disk will be chosen; 3) add a new local variable 'sequential_disk' to record the disk, and if there is no other idle disk, 'sequential_disk' will be chosen; Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.") Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-5-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/raid1.c | 32 ++++++++++++++++++++++---------- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 545232745f47..e90f93b09c82 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -594,13 +594,12 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect const sector_t this_sector = r1_bio->sector; int sectors; int best_good_sectors; - int best_disk, best_dist_disk, best_pending_disk; + int best_disk, best_dist_disk, best_pending_disk, sequential_disk; int disk; sector_t best_dist; unsigned int min_pending; struct md_rdev *rdev; int choose_first; - int choose_next_idle; /* * Check if we can balance. We can balance on the whole @@ -611,11 +610,11 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect sectors = r1_bio->sectors; best_disk = -1; best_dist_disk = -1; + sequential_disk = -1; best_dist = MaxSector; best_pending_disk = -1; min_pending = UINT_MAX; best_good_sectors = 0; - choose_next_idle = 0; clear_bit(R1BIO_FailFast, &r1_bio->state); if ((conf->mddev->recovery_cp < this_sector + sectors) || @@ -708,7 +707,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect int opt_iosize = bdev_io_opt(rdev->bdev) >> 9; struct raid1_info *mirror = &conf->mirrors[disk]; - best_disk = disk; /* * If buffered sequential IO size exceeds optimal * iosize, check if there is idle disk. If yes, choose @@ -727,15 +725,22 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect mirror->next_seq_sect > opt_iosize && mirror->next_seq_sect - opt_iosize >= mirror->seq_start) { - choose_next_idle = 1; - continue; + /* + * Add 'pending' to avoid choosing this disk if + * there is other idle disk. + */ + pending++; + /* + * If there is no other idle disk, this disk + * will be chosen. + */ + sequential_disk = disk; + } else { + best_disk = disk; + break; } - break; } - if (choose_next_idle) - continue; - if (min_pending > pending) { min_pending = pending; best_pending_disk = disk; @@ -747,6 +752,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect } } + /* + * sequential IO size exceeds optimal iosize, however, there is no other + * idle disk, so choose the sequential disk. + */ + if (best_disk == -1 && min_pending != 0) + best_disk = sequential_disk; + /* * If all disks are rotational, choose the closest disk. If any disk is * non-rotational, choose the disk with less pending request even the -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit f29841ff3b272e1703454f93b96baf0fe0d9f31a category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The checking and handler of bad blocks appear many timers during read_balance() in raid1 and raid10. This helper will be used in later patches to simplify read_balance() a lot. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-6-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/raid1-10.c | 49 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 49 insertions(+) diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c index 512746551f36..9bc0f0022a6c 100644 --- a/drivers/md/raid1-10.c +++ b/drivers/md/raid1-10.c @@ -227,3 +227,52 @@ static inline bool exceed_read_errors(struct mddev *mddev, struct md_rdev *rdev) return false; } + +/** + * raid1_check_read_range() - check a given read range for bad blocks, + * available read length is returned; + * @rdev: the rdev to read; + * @this_sector: read position; + * @len: read length; + * + * helper function for read_balance() + * + * 1) If there are no bad blocks in the range, @len is returned; + * 2) If the range are all bad blocks, 0 is returned; + * 3) If there are partial bad blocks: + * - If the bad block range starts after @this_sector, the length of first + * good region is returned; + * - If the bad block range starts before @this_sector, 0 is returned and + * the @len is updated to the offset into the region before we get to the + * good blocks; + */ +static inline int raid1_check_read_range(struct md_rdev *rdev, + sector_t this_sector, int *len) +{ + sector_t first_bad; + int bad_sectors; + + /* no bad block overlap */ + if (!is_badblock(rdev, this_sector, *len, &first_bad, &bad_sectors)) + return *len; + + /* + * bad block range starts offset into our range so we can return the + * number of sectors before the bad blocks start. + */ + if (first_bad > this_sector) + return first_bad - this_sector; + + /* read range is fully consumed by bad blocks. */ + if (this_sector + *len <= first_bad + bad_sectors) + return 0; + + /* + * final case, bad block range starts before or at the start of our + * range but does not cover our entire range so we still return 0 but + * update the length with the number of sectors before we get to the + * good ones. + */ + *len = first_bad + bad_sectors - this_sector; + return 0; +} -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit f109207629552cb04c2a48e90abe7c481e363984 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- If resync is in progress, read_balance() should find the first usable disk, otherwise, data could be inconsistent after resync is done. raid1 and raid10 implement the same checking, hence factor out the checking to make code cleaner. Noted that raid1 is using 'mddev->recovery_cp', which is updated after all resync IO is done, while raid10 is using 'conf->next_resync', which is inaccurate because raid10 update it before submitting resync IO. Fortunately, raid10 read IO can't concurrent with resync IO, hence there is no problem. And this patch also switch raid10 to use 'mddev->recovery_cp'. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-7-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/raid1-10.c | 20 ++++++++++++++++++++ drivers/md/raid1.c | 15 ++------------- drivers/md/raid10.c | 13 ++----------- 3 files changed, 24 insertions(+), 24 deletions(-) diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c index 9bc0f0022a6c..2ea1710a3b70 100644 --- a/drivers/md/raid1-10.c +++ b/drivers/md/raid1-10.c @@ -276,3 +276,23 @@ static inline int raid1_check_read_range(struct md_rdev *rdev, *len = first_bad + bad_sectors - this_sector; return 0; } + +/* + * Check if read should choose the first rdev. + * + * Balance on the whole device if no resync is going on (recovery is ok) or + * below the resync window. Otherwise, take the first readable disk. + */ +static inline bool raid1_should_read_first(struct mddev *mddev, + sector_t this_sector, int len) +{ + if ((mddev->recovery_cp < this_sector + len)) + return true; + + if (mddev_is_clustered(mddev) && + md_cluster_ops->area_resyncing(mddev, READ, this_sector, + this_sector + len)) + return true; + + return false; +} diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index e90f93b09c82..05e318fec302 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -601,11 +601,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect struct md_rdev *rdev; int choose_first; - /* - * Check if we can balance. We can balance on the whole - * device if no resync is going on, or below the resync window. - * We take the first readable disk when above the resync window. - */ retry: sectors = r1_bio->sectors; best_disk = -1; @@ -615,16 +610,10 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect best_pending_disk = -1; min_pending = UINT_MAX; best_good_sectors = 0; + choose_first = raid1_should_read_first(conf->mddev, this_sector, + sectors); clear_bit(R1BIO_FailFast, &r1_bio->state); - if ((conf->mddev->recovery_cp < this_sector + sectors) || - (mddev_is_clustered(conf->mddev) && - md_cluster_ops->area_resyncing(conf->mddev, READ, this_sector, - this_sector + sectors))) - choose_first = 1; - else - choose_first = 0; - for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) { sector_t dist; sector_t first_bad; diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 0d3e63c17508..e47df90e705a 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -742,17 +742,8 @@ static struct md_rdev *read_balance(struct r10conf *conf, best_good_sectors = 0; do_balance = 1; clear_bit(R10BIO_FailFast, &r10_bio->state); - /* - * Check if we can balance. We can balance on the whole - * device if no resync is going on (recovery is ok), or below - * the resync window. We take the first readable disk when - * above the resync window. - */ - if ((conf->mddev->recovery_cp < MaxSector - && (this_sector + sectors >= conf->next_resync)) || - (mddev_is_clustered(conf->mddev) && - md_cluster_ops->area_resyncing(conf->mddev, READ, this_sector, - this_sector + sectors))) + + if (raid1_should_read_first(conf->mddev, this_sector, sectors)) do_balance = 0; for (slot = 0; slot < conf->copies ; slot++) { -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 31a73331752d3c7d28c8fc089b21d3ae8c15e664 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the first rdev from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-8-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/raid1.c | 63 +++++++++++++++++++++++++++++++++------------- 1 file changed, 46 insertions(+), 17 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 05e318fec302..c548dbc80636 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -575,6 +575,47 @@ static sector_t align_to_barrier_unit_end(sector_t start_sector, return len; } +static void update_read_sectors(struct r1conf *conf, int disk, + sector_t this_sector, int len) +{ + struct raid1_info *info = &conf->mirrors[disk]; + + atomic_inc(&info->rdev->nr_pending); + if (info->next_seq_sect != this_sector) + info->seq_start = this_sector; + info->next_seq_sect = this_sector + len; +} + +static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio, + int *max_sectors) +{ + sector_t this_sector = r1_bio->sector; + int len = r1_bio->sectors; + int disk; + + for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) { + struct md_rdev *rdev; + int read_len; + + if (r1_bio->bios[disk] == IO_BLOCKED) + continue; + + rdev = conf->mirrors[disk].rdev; + if (!rdev || test_bit(Faulty, &rdev->flags)) + continue; + + /* choose the first disk even if it has some bad blocks. */ + read_len = raid1_check_read_range(rdev, this_sector, &len); + if (read_len > 0) { + update_read_sectors(conf, disk, this_sector, read_len); + *max_sectors = read_len; + return disk; + } + } + + return -1; +} + /* * This routine returns the disk from which the requested read should * be done. There is a per-array 'next expected sequential IO' sector @@ -599,7 +640,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect sector_t best_dist; unsigned int min_pending; struct md_rdev *rdev; - int choose_first; retry: sectors = r1_bio->sectors; @@ -610,10 +650,11 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect best_pending_disk = -1; min_pending = UINT_MAX; best_good_sectors = 0; - choose_first = raid1_should_read_first(conf->mddev, this_sector, - sectors); clear_bit(R1BIO_FailFast, &r1_bio->state); + if (raid1_should_read_first(conf->mddev, this_sector, sectors)) + return choose_first_rdev(conf, r1_bio, max_sectors); + for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) { sector_t dist; sector_t first_bad; @@ -659,8 +700,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect * bad_sectors from another device.. */ bad_sectors -= (this_sector - first_bad); - if (choose_first && sectors > bad_sectors) - sectors = bad_sectors; if (best_good_sectors > sectors) best_good_sectors = sectors; @@ -670,8 +709,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect best_good_sectors = good_sectors; best_disk = disk; } - if (choose_first) - break; } continue; } else { @@ -686,10 +723,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect pending = atomic_read(&rdev->nr_pending); dist = abs(this_sector - conf->mirrors[disk].head_position); - if (choose_first) { - best_disk = disk; - break; - } /* Don't change to another disk for sequential reads */ if (conf->mirrors[disk].next_seq_sect == this_sector || dist == 0) { @@ -765,13 +798,9 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect rdev = conf->mirrors[best_disk].rdev; if (!rdev) goto retry; - atomic_inc(&rdev->nr_pending); - sectors = best_good_sectors; - - if (conf->mirrors[best_disk].next_seq_sect != this_sector) - conf->mirrors[best_disk].seq_start = this_sector; - conf->mirrors[best_disk].next_seq_sect = this_sector + sectors; + sectors = best_good_sectors; + update_read_sectors(conf, disk, this_sector, sectors); } *max_sectors = sectors; -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit dfa8ecd167c1753d4fc24a517e1d79c603183c94 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the slow rdev from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-9-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/raid1.c | 69 ++++++++++++++++++++++++++++++++++------------ 1 file changed, 52 insertions(+), 17 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index c548dbc80636..42879db48fd5 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -616,6 +616,53 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio, return -1; } +static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio, + int *max_sectors) +{ + sector_t this_sector = r1_bio->sector; + int bb_disk = -1; + int bb_read_len = 0; + int disk; + + for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) { + struct md_rdev *rdev; + int len; + int read_len; + + if (r1_bio->bios[disk] == IO_BLOCKED) + continue; + + rdev = conf->mirrors[disk].rdev; + if (!rdev || test_bit(Faulty, &rdev->flags) || + !test_bit(WriteMostly, &rdev->flags)) + continue; + + /* there are no bad blocks, we can use this disk */ + len = r1_bio->sectors; + read_len = raid1_check_read_range(rdev, this_sector, &len); + if (read_len == r1_bio->sectors) { + update_read_sectors(conf, disk, this_sector, read_len); + return disk; + } + + /* + * there are partial bad blocks, choose the rdev with largest + * read length. + */ + if (read_len > bb_read_len) { + bb_disk = disk; + bb_read_len = read_len; + } + } + + if (bb_disk != -1) { + *max_sectors = bb_read_len; + update_read_sectors(conf, bb_disk, this_sector, bb_read_len); + } + + return bb_disk; +} + /* * This routine returns the disk from which the requested read should * be done. There is a per-array 'next expected sequential IO' sector @@ -669,23 +716,8 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect if (!test_bit(In_sync, &rdev->flags) && rdev->recovery_offset < this_sector + sectors) continue; - if (test_bit(WriteMostly, &rdev->flags)) { - /* Don't balance among write-mostly, just - * use the first as a last resort */ - if (best_dist_disk < 0) { - if (is_badblock(rdev, this_sector, sectors, - &first_bad, &bad_sectors)) { - if (first_bad <= this_sector) - /* Cannot use this */ - continue; - best_good_sectors = first_bad - this_sector; - } else - best_good_sectors = sectors; - best_dist_disk = disk; - best_pending_disk = disk; - } + if (test_bit(WriteMostly, &rdev->flags)) continue; - } /* This is a reasonable device to use. It might * even be best. */ @@ -804,7 +836,10 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect } *max_sectors = sectors; - return best_disk; + if (best_disk >= 0) + return best_disk; + + return choose_slow_rdev(conf, r1_bio, max_sectors); } static void wake_up_barrier(struct r1conf *conf) -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 9f3ced792203891b8fa39afa37908eba843fcfac category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the rdev with bad blocks from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-10-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/raid1.c | 79 ++++++++++++++++++++++++++++------------------ 1 file changed, 48 insertions(+), 31 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 42879db48fd5..78173321d099 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -616,6 +616,44 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio, return -1; } +static int choose_bb_rdev(struct r1conf *conf, struct r1bio *r1_bio, + int *max_sectors) +{ + sector_t this_sector = r1_bio->sector; + int best_disk = -1; + int best_len = 0; + int disk; + + for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) { + struct md_rdev *rdev; + int len; + int read_len; + + if (r1_bio->bios[disk] == IO_BLOCKED) + continue; + + rdev = conf->mirrors[disk].rdev; + if (!rdev || test_bit(Faulty, &rdev->flags) || + test_bit(WriteMostly, &rdev->flags)) + continue; + + /* keep track of the disk with the most readable sectors. */ + len = r1_bio->sectors; + read_len = raid1_check_read_range(rdev, this_sector, &len); + if (read_len > best_len) { + best_disk = disk; + best_len = read_len; + } + } + + if (best_disk != -1) { + *max_sectors = best_len; + update_read_sectors(conf, best_disk, this_sector, best_len); + } + + return best_disk; +} + static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio, int *max_sectors) { @@ -704,8 +742,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) { sector_t dist; - sector_t first_bad; - int bad_sectors; unsigned int pending; rdev = conf->mirrors[disk].rdev; @@ -718,36 +754,8 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect continue; if (test_bit(WriteMostly, &rdev->flags)) continue; - /* This is a reasonable device to use. It might - * even be best. - */ - if (is_badblock(rdev, this_sector, sectors, - &first_bad, &bad_sectors)) { - if (best_dist < MaxSector) - /* already have a better device */ - continue; - if (first_bad <= this_sector) { - /* cannot read here. If this is the 'primary' - * device, then we must not read beyond - * bad_sectors from another device.. - */ - bad_sectors -= (this_sector - first_bad); - if (best_good_sectors > sectors) - best_good_sectors = sectors; - - } else { - sector_t good_sectors = first_bad - this_sector; - if (good_sectors > best_good_sectors) { - best_good_sectors = good_sectors; - best_disk = disk; - } - } + if (rdev_has_badblock(rdev, this_sector, sectors)) continue; - } else { - if ((sectors > best_good_sectors) && (best_disk >= 0)) - best_disk = -1; - best_good_sectors = sectors; - } if (best_disk >= 0) /* At least two disks to choose from so failfast is OK */ @@ -839,6 +847,15 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect if (best_disk >= 0) return best_disk; + /* + * If we are here it means we didn't find a perfectly good disk so + * now spend a bit more time trying to find one with the most good + * sectors. + */ + disk = choose_bb_rdev(conf, r1_bio, max_sectors); + if (disk >= 0) + return disk; + return choose_slow_rdev(conf, r1_bio, max_sectors); } -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit ba58f57fdf98af642c57654599823640ffe8334c category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- There is no functional change for now, make read_balance() cleaner and prepare to fix problems and refactor the handler of sequential IO. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-11-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/raid1.c | 71 ++++++++++++++++++++++++---------------------- 1 file changed, 37 insertions(+), 34 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 78173321d099..f89968d5d0d0 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -701,6 +701,31 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio, return bb_disk; } +static bool is_sequential(struct r1conf *conf, int disk, struct r1bio *r1_bio) +{ + /* TODO: address issues with this check and concurrency. */ + return conf->mirrors[disk].next_seq_sect == r1_bio->sector || + conf->mirrors[disk].head_position == r1_bio->sector; +} + +/* + * If buffered sequential IO size exceeds optimal iosize, check if there is idle + * disk. If yes, choose the idle disk. + */ +static bool should_choose_next(struct r1conf *conf, int disk) +{ + struct raid1_info *mirror = &conf->mirrors[disk]; + int opt_iosize; + + if (!test_bit(Nonrot, &mirror->rdev->flags)) + return false; + + opt_iosize = bdev_io_opt(mirror->rdev->bdev) >> 9; + return opt_iosize > 0 && mirror->seq_start != MaxSector && + mirror->next_seq_sect > opt_iosize && + mirror->next_seq_sect - opt_iosize >= mirror->seq_start; +} + /* * This routine returns the disk from which the requested read should * be done. There is a per-array 'next expected sequential IO' sector @@ -764,43 +789,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect pending = atomic_read(&rdev->nr_pending); dist = abs(this_sector - conf->mirrors[disk].head_position); /* Don't change to another disk for sequential reads */ - if (conf->mirrors[disk].next_seq_sect == this_sector - || dist == 0) { - int opt_iosize = bdev_io_opt(rdev->bdev) >> 9; - struct raid1_info *mirror = &conf->mirrors[disk]; - - /* - * If buffered sequential IO size exceeds optimal - * iosize, check if there is idle disk. If yes, choose - * the idle disk. read_balance could already choose an - * idle disk before noticing it's a sequential IO in - * this disk. This doesn't matter because this disk - * will idle, next time it will be utilized after the - * first disk has IO size exceeds optimal iosize. In - * this way, iosize of the first disk will be optimal - * iosize at least. iosize of the second disk might be - * small, but not a big deal since when the second disk - * starts IO, the first disk is likely still busy. - */ - if (test_bit(Nonrot, &rdev->flags) && opt_iosize > 0 && - mirror->seq_start != MaxSector && - mirror->next_seq_sect > opt_iosize && - mirror->next_seq_sect - opt_iosize >= - mirror->seq_start) { - /* - * Add 'pending' to avoid choosing this disk if - * there is other idle disk. - */ - pending++; - /* - * If there is no other idle disk, this disk - * will be chosen. - */ - sequential_disk = disk; - } else { + if (is_sequential(conf, disk, r1_bio)) { + if (!should_choose_next(conf, disk)) { best_disk = disk; break; } + /* + * Add 'pending' to avoid choosing this disk if + * there is other idle disk. + */ + pending++; + /* + * If there is no other idle disk, this disk + * will be chosen. + */ + sequential_disk = disk; } if (min_pending > pending) { -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 0091c5a269eca7ab0e057c3083804daed9997f08 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The way that best rdev is chosen: 1) If the read is sequential from one rdev: - if rdev is rotational, use this rdev; - if rdev is non-rotational, use this rdev until total read length exceed disk opt io size; 2) If the read is not sequential: - if there is idle disk, use it, otherwise: - if the array has non-rotational disk, choose the rdev with minimal inflight IO; - if all the underlaying disks are rotational disk, choose the rdev with closest IO; There are no functional changes, just to make code cleaner and prepare for following refactor. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-12-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/raid1.c | 175 +++++++++++++++++++++++++-------------------- 1 file changed, 98 insertions(+), 77 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index f89968d5d0d0..ba4fed333664 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -726,74 +726,71 @@ static bool should_choose_next(struct r1conf *conf, int disk) mirror->next_seq_sect - opt_iosize >= mirror->seq_start; } -/* - * This routine returns the disk from which the requested read should - * be done. There is a per-array 'next expected sequential IO' sector - * number - if this matches on the next IO then we use the last disk. - * There is also a per-disk 'last know head position' sector that is - * maintained from IRQ contexts, both the normal and the resync IO - * completion handlers update this position correctly. If there is no - * perfect sequential match then we pick the disk whose head is closest. - * - * If there are 2 mirrors in the same 2 devices, performance degrades - * because position is mirror, not device based. - * - * The rdev for the device selected will have nr_pending incremented. - */ -static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sectors) +static bool rdev_readable(struct md_rdev *rdev, struct r1bio *r1_bio) { - const sector_t this_sector = r1_bio->sector; - int sectors; - int best_good_sectors; - int best_disk, best_dist_disk, best_pending_disk, sequential_disk; - int disk; - sector_t best_dist; - unsigned int min_pending; - struct md_rdev *rdev; + if (!rdev || test_bit(Faulty, &rdev->flags)) + return false; - retry: - sectors = r1_bio->sectors; - best_disk = -1; - best_dist_disk = -1; - sequential_disk = -1; - best_dist = MaxSector; - best_pending_disk = -1; - min_pending = UINT_MAX; - best_good_sectors = 0; - clear_bit(R1BIO_FailFast, &r1_bio->state); + /* still in recovery */ + if (!test_bit(In_sync, &rdev->flags) && + rdev->recovery_offset < r1_bio->sector + r1_bio->sectors) + return false; - if (raid1_should_read_first(conf->mddev, this_sector, sectors)) - return choose_first_rdev(conf, r1_bio, max_sectors); + /* don't read from slow disk unless have to */ + if (test_bit(WriteMostly, &rdev->flags)) + return false; + + /* don't split IO for bad blocks unless have to */ + if (rdev_has_badblock(rdev, r1_bio->sector, r1_bio->sectors)) + return false; + + return true; +} + +struct read_balance_ctl { + sector_t closest_dist; + int closest_dist_disk; + int min_pending; + int min_pending_disk; + int sequential_disk; + int readable_disks; +}; + +static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio) +{ + int disk; + struct read_balance_ctl ctl = { + .closest_dist_disk = -1, + .closest_dist = MaxSector, + .min_pending_disk = -1, + .min_pending = UINT_MAX, + .sequential_disk = -1, + }; for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) { + struct md_rdev *rdev; sector_t dist; unsigned int pending; - rdev = conf->mirrors[disk].rdev; - if (r1_bio->bios[disk] == IO_BLOCKED - || rdev == NULL - || test_bit(Faulty, &rdev->flags)) - continue; - if (!test_bit(In_sync, &rdev->flags) && - rdev->recovery_offset < this_sector + sectors) - continue; - if (test_bit(WriteMostly, &rdev->flags)) + if (r1_bio->bios[disk] == IO_BLOCKED) continue; - if (rdev_has_badblock(rdev, this_sector, sectors)) + + rdev = conf->mirrors[disk].rdev; + if (!rdev_readable(rdev, r1_bio)) continue; - if (best_disk >= 0) - /* At least two disks to choose from so failfast is OK */ + /* At least two disks to choose from so failfast is OK */ + if (ctl.readable_disks++ == 1) set_bit(R1BIO_FailFast, &r1_bio->state); pending = atomic_read(&rdev->nr_pending); - dist = abs(this_sector - conf->mirrors[disk].head_position); + dist = abs(r1_bio->sector - conf->mirrors[disk].head_position); + /* Don't change to another disk for sequential reads */ if (is_sequential(conf, disk, r1_bio)) { - if (!should_choose_next(conf, disk)) { - best_disk = disk; - break; - } + if (!should_choose_next(conf, disk)) + return disk; + /* * Add 'pending' to avoid choosing this disk if * there is other idle disk. @@ -803,17 +800,17 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect * If there is no other idle disk, this disk * will be chosen. */ - sequential_disk = disk; + ctl.sequential_disk = disk; } - if (min_pending > pending) { - min_pending = pending; - best_pending_disk = disk; + if (ctl.min_pending > pending) { + ctl.min_pending = pending; + ctl.min_pending_disk = disk; } - if (dist < best_dist) { - best_dist = dist; - best_dist_disk = disk; + if (ctl.closest_dist > dist) { + ctl.closest_dist = dist; + ctl.closest_dist_disk = disk; } } @@ -821,8 +818,8 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect * sequential IO size exceeds optimal iosize, however, there is no other * idle disk, so choose the sequential disk. */ - if (best_disk == -1 && min_pending != 0) - best_disk = sequential_disk; + if (ctl.sequential_disk != -1 && ctl.min_pending != 0) + return ctl.sequential_disk; /* * If all disks are rotational, choose the closest disk. If any disk is @@ -830,25 +827,49 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect * disk is rotational, which might/might not be optimal for raids with * mixed ratation/non-rotational disks depending on workload. */ - if (best_disk == -1) { - if (READ_ONCE(conf->nonrot_disks) || min_pending == 0) - best_disk = best_pending_disk; - else - best_disk = best_dist_disk; - } + if (ctl.min_pending_disk != -1 && + (READ_ONCE(conf->nonrot_disks) || ctl.min_pending == 0)) + return ctl.min_pending_disk; + else + return ctl.closest_dist_disk; +} - if (best_disk >= 0) { - rdev = conf->mirrors[best_disk].rdev; - if (!rdev) - goto retry; +/* + * This routine returns the disk from which the requested read should be done. + * + * 1) If resync is in progress, find the first usable disk and use it even if it + * has some bad blocks. + * + * 2) Now that there is no resync, loop through all disks and skipping slow + * disks and disks with bad blocks for now. Only pay attention to key disk + * choice. + * + * 3) If we've made it this far, now look for disks with bad blocks and choose + * the one with most number of sectors. + * + * 4) If we are all the way at the end, we have no choice but to use a disk even + * if it is write mostly. + * + * The rdev for the device selected will have nr_pending incremented. + */ +static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, + int *max_sectors) +{ + int disk; - sectors = best_good_sectors; - update_read_sectors(conf, disk, this_sector, sectors); - } - *max_sectors = sectors; + clear_bit(R1BIO_FailFast, &r1_bio->state); + + if (raid1_should_read_first(conf->mddev, r1_bio->sector, + r1_bio->sectors)) + return choose_first_rdev(conf, r1_bio, max_sectors); - if (best_disk >= 0) - return best_disk; + disk = choose_best_rdev(conf, r1_bio); + if (disk >= 0) { + *max_sectors = r1_bio->sectors; + update_read_sectors(conf, disk, r1_bio->sector, + r1_bio->sectors); + return disk; + } /* * If we are here it means we didn't find a perfectly good disk so -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 2f03d0c2cd451c7ac2f317079d4ec518f0986b55 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- After commit 9dbd1aa3a81c ("dm raid: add reshaping support to the target") raid_ctr() will set MD_RECOVERY_FROZEN before md_run() and expect to keep array frozen until resume. However, md_run() will clear the flag by setting mddev->recovery to 0. Before commit 1baae052cccd ("md: Don't ignore suspended array in md_check_recovery()"), dm-raid actually relied on suspending to prevent starting new sync_thread. Fix this problem by keeping 'MD_RECOVERY_FROZEN' for dm-raid in md_run(). Fixes: 1baae052cccd ("md: Don't ignore suspended array in md_check_recovery()") Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-2-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index ce36462e77a1..e33ebefdc04c 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -6099,7 +6099,10 @@ int md_run(struct mddev *mddev) pr_warn("True protection against single-disk failure might be compromised.\n"); } - mddev->recovery = 0; + /* dm-raid expect sync_thread to be frozen until resume */ + if (mddev->gendisk) + mddev->recovery = 0; + /* may be over-ridden by personality */ mddev->resync_max_sectors = mddev->dev_sectors; -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 7a2347e284d7ec2f0759be4db60fa7ca937284fc category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Add new helpers: void md_idle_sync_thread(struct mddev *mddev); void md_frozen_sync_thread(struct mddev *mddev); void md_unfrozen_sync_thread(struct mddev *mddev); The helpers will be used in dm-raid in later patches to fix regressions and prevent calling md_reap_sync_thread() directly. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-3-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 3 +++ drivers/md/md.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+) diff --git a/drivers/md/md.h b/drivers/md/md.h index 49335c949ae3..14f0d92d1f8f 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -797,6 +797,9 @@ extern void md_rdev_clear(struct md_rdev *rdev); extern void md_handle_request(struct mddev *mddev, struct bio *bio); extern int mddev_suspend(struct mddev *mddev, bool interruptible); extern void mddev_resume(struct mddev *mddev); +extern void md_idle_sync_thread(struct mddev *mddev); +extern void md_frozen_sync_thread(struct mddev *mddev); +extern void md_unfrozen_sync_thread(struct mddev *mddev); extern void md_reload_sb(struct mddev *mddev, int raid_disk); extern void md_update_sb(struct mddev *mddev, int force); diff --git a/drivers/md/md.c b/drivers/md/md.c index e33ebefdc04c..4c21c1d3f73a 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4980,6 +4980,35 @@ static void stop_sync_thread(struct mddev *mddev, bool locked, bool check_seq) mddev_lock_nointr(mddev); } +void md_idle_sync_thread(struct mddev *mddev) +{ + lockdep_assert_held(&mddev->reconfig_mutex); + + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + stop_sync_thread(mddev, true, true); +} +EXPORT_SYMBOL_GPL(md_idle_sync_thread); + +void md_frozen_sync_thread(struct mddev *mddev) +{ + lockdep_assert_held(&mddev->reconfig_mutex); + + set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + stop_sync_thread(mddev, true, false); +} +EXPORT_SYMBOL_GPL(md_frozen_sync_thread); + +void md_unfrozen_sync_thread(struct mddev *mddev) +{ + lockdep_assert_held(&mddev->reconfig_mutex); + + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); + md_wakeup_thread(mddev->thread); + sysfs_notify_dirent_safe(mddev->sysfs_action); +} +EXPORT_SYMBOL_GPL(md_unfrozen_sync_thread); + static void idle_sync_thread(struct mddev *mddev) { mutex_lock(&mddev->sync_mutex); -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 314e9af065513ff86ec9e32eaa96b9bd275cf51d category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- There are no functional changes for now, prepare to fix a deadlock for dm-raid456. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-4-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 12 ++++++++++++ drivers/md/md.c | 12 ------------ 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index 14f0d92d1f8f..e6d1c4c9c2c7 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -569,6 +569,18 @@ enum recovery_flags { MD_RESYNCING_REMOTE, /* remote node is running resync thread */ }; +enum md_ro_state { + MD_RDWR, + MD_RDONLY, + MD_AUTO_READ, + MD_MAX_STATE +}; + +static inline bool md_is_rdwr(struct mddev *mddev) +{ + return (mddev->ro == MD_RDWR); +} + static inline int __must_check mddev_lock(struct mddev *mddev) { return mutex_lock_interruptible(&mddev->reconfig_mutex); diff --git a/drivers/md/md.c b/drivers/md/md.c index 4c21c1d3f73a..9730f3e2ecf9 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -99,18 +99,6 @@ static void mddev_detach(struct mddev *mddev); static void export_rdev(struct md_rdev *rdev, struct mddev *mddev); static void md_wakeup_thread_directly(struct md_thread __rcu *thread); -enum md_ro_state { - MD_RDWR, - MD_RDONLY, - MD_AUTO_READ, - MD_MAX_STATE -}; - -static bool md_is_rdwr(struct mddev *mddev) -{ - return (mddev->ro == MD_RDWR); -} - /* * Default number of read corrections we'll attempt on an rdev * before ejecting it from the array. We divide the read error -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 503f9d43790fdd0c6e6ae2f4dd3f70b146ac4159 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The helper will be used for dm-raid456 later to detect the case that reshape can't make progress. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-5-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/drivers/md/md.h b/drivers/md/md.h index e6d1c4c9c2c7..3012bbc09aaa 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -581,6 +581,25 @@ static inline bool md_is_rdwr(struct mddev *mddev) return (mddev->ro == MD_RDWR); } +static inline bool reshape_interrupted(struct mddev *mddev) +{ + /* reshape never start */ + if (mddev->reshape_position == MaxSector) + return false; + + /* interrupted */ + if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) + return true; + + /* running reshape will be interrupted soon. */ + if (test_bit(MD_RECOVERY_WAIT, &mddev->recovery) || + test_bit(MD_RECOVERY_INTR, &mddev->recovery) || + test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) + return true; + + return false; +} + static inline int __must_check mddev_lock(struct mddev *mddev) { return mutex_lock_interruptible(&mddev->reconfig_mutex); -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 16c4770c75b1223998adbeb7286f9a15c65fba73 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- 1) commit f52f5c71f3d4 ("md: fix stopping sync thread") remove MD_RECOVERY_FROZEN from __md_stop_writes() and doesn't realize that dm-raid relies on __md_stop_writes() to frozen sync_thread indirectly. Fix this problem by adding MD_RECOVERY_FROZEN in md_stop_writes(), and since stop_sync_thread() is only used for dm-raid in this case, also move stop_sync_thread() to md_stop_writes(). 2) The flag MD_RECOVERY_FROZEN doesn't mean that sync thread is frozen, it only prevent new sync_thread to start, and it can't stop the running sync thread; In order to frozen sync_thread, after seting the flag, stop_sync_thread() should be used. 3) The flag MD_RECOVERY_FROZEN doesn't mean that writes are stopped, use it as condition for md_stop_writes() in raid_postsuspend() doesn't look correct. Consider that reentrant stop_sync_thread() do nothing, always call md_stop_writes() in raid_postsuspend(). 4) raid_message can set/clear the flag MD_RECOVERY_FROZEN at anytime, and if MD_RECOVERY_FROZEN is cleared while the array is suspended, new sync_thread can start unexpected. Fix this by disallow raid_message() to change sync_thread status during suspend. Note that after commit f52f5c71f3d4 ("md: fix stopping sync thread"), the test shell/lvconvert-raid-reshape.sh start to hang in stop_sync_thread(), and with previous fixes, the test won't hang there anymore, however, the test will still fail and complain that ext4 is corrupted. And with this patch, the test won't hang due to stop_sync_thread() or fail due to ext4 is corrupted anymore. However, there is still a deadlock related to dm-raid456 that will be fixed in following patches. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/e5e8afe2-e9a8-49a2-5ab0-958d4065c55e@redhat.com/ Fixes: 1af2048a3e87 ("dm raid: fix deadlock caused by premature md_stop_writes()") Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-6-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/dm-raid.c | 25 +++++++++++++++---------- drivers/md/md.c | 3 ++- 2 files changed, 17 insertions(+), 11 deletions(-) diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index e95e6d48c69d..0fa6d7b55030 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -3240,11 +3240,12 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv) rs->md.ro = 1; rs->md.in_sync = 1; - /* Keep array frozen until resume. */ - set_bit(MD_RECOVERY_FROZEN, &rs->md.recovery); - /* Has to be held on running the array */ mddev_suspend_and_lock_nointr(&rs->md); + + /* Keep array frozen until resume. */ + md_frozen_sync_thread(&rs->md); + r = md_run(&rs->md); rs->md.in_sync = 0; /* Assume already marked dirty */ if (r) { @@ -3722,6 +3723,9 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv, if (!mddev->pers || !mddev->pers->sync_request) return -EINVAL; + if (test_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) + return -EBUSY; + if (!strcasecmp(argv[0], "frozen")) set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); else @@ -3796,10 +3800,11 @@ static void raid_postsuspend(struct dm_target *ti) struct raid_set *rs = ti->private; if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) { - /* Writes have to be stopped before suspending to avoid deadlocks. */ - if (!test_bit(MD_RECOVERY_FROZEN, &rs->md.recovery)) - md_stop_writes(&rs->md); - + /* + * sync_thread must be stopped during suspend, and writes have + * to be stopped before suspending to avoid deadlocks. + */ + md_stop_writes(&rs->md); mddev_suspend(&rs->md, false); } } @@ -4012,8 +4017,6 @@ static int raid_preresume(struct dm_target *ti) } /* Check for any resize/reshape on @rs and adjust/initiate */ - /* Be prepared for mddev_resume() in raid_resume() */ - set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); if (mddev->recovery_cp && mddev->recovery_cp < MaxSector) { set_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); mddev->resync_min = mddev->recovery_cp; @@ -4057,10 +4060,12 @@ static void raid_resume(struct dm_target *ti) if (mddev->delta_disks < 0) rs_set_capacity(rs); + WARN_ON_ONCE(!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)); + WARN_ON_ONCE(test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)); mddev_lock_nointr(mddev); - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); mddev->ro = 0; mddev->in_sync = 0; + md_unfrozen_sync_thread(mddev); mddev_unlock_and_resume(mddev); } } diff --git a/drivers/md/md.c b/drivers/md/md.c index 9730f3e2ecf9..cb4a8b302a6c 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -6401,7 +6401,6 @@ static void md_clean(struct mddev *mddev) static void __md_stop_writes(struct mddev *mddev) { - stop_sync_thread(mddev, true, false); del_timer_sync(&mddev->safemode_timer); if (mddev->pers && mddev->pers->quiesce) { @@ -6426,6 +6425,8 @@ static void __md_stop_writes(struct mddev *mddev) void md_stop_writes(struct mddev *mddev) { mddev_lock_nointr(mddev); + set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + stop_sync_thread(mddev, true, false); __md_stop_writes(mddev); mddev_unlock(mddev); } -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 5625ff8b72b0e5c13b0fc1fc1f198155af45f729 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- There are no functional changes for now, prepare to fix a deadlock for dm-raid456. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-8-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 1 + drivers/md/dm-raid.c | 18 ++++++++++++++++++ 2 files changed, 19 insertions(+) diff --git a/drivers/md/md.h b/drivers/md/md.h index 3012bbc09aaa..05b3286e5ae2 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -659,6 +659,7 @@ struct md_personality int (*start_reshape) (struct mddev *mddev); void (*finish_reshape) (struct mddev *mddev); void (*update_reshape_pos) (struct mddev *mddev); + void (*prepare_suspend) (struct mddev *mddev); /* quiesce suspends or resumes internal processing. * 1 - stop new actions and wait for action io to complete * 0 - return to normal behaviour diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index 0fa6d7b55030..4d89a4947217 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -3795,6 +3795,23 @@ static void raid_io_hints(struct dm_target *ti, struct queue_limits *limits) blk_limits_io_opt(limits, chunk_size_bytes * mddev_data_stripes(rs)); } +static void raid_presuspend(struct dm_target *ti) +{ + struct raid_set *rs = ti->private; + struct mddev *mddev = &rs->md; + + if (!reshape_interrupted(mddev)) + return; + + /* + * For raid456, if reshape is interrupted, IO across reshape position + * will never make progress, while caller will wait for IO to be done. + * Inform raid456 to handle those IO to prevent deadlock. + */ + if (mddev->pers && mddev->pers->prepare_suspend) + mddev->pers->prepare_suspend(mddev); +} + static void raid_postsuspend(struct dm_target *ti) { struct raid_set *rs = ti->private; @@ -4081,6 +4098,7 @@ static struct target_type raid_target = { .message = raid_message, .iterate_devices = raid_iterate_devices, .io_hints = raid_io_hints, + .presuspend = raid_presuspend, .postsuspend = raid_postsuspend, .preresume = raid_preresume, .resume = raid_resume, -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.9-rc1 commit 41425f96d7aa59bc865f60f5dda3d7697b555677 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- For raid456, if reshape is still in progress, then IO across reshape position will wait for reshape to make progress. However, for dm-raid, in following cases reshape will never make progress hence IO will hang: 1) the array is read-only; 2) MD_RECOVERY_WAIT is set; 3) MD_RECOVERY_FROZEN is set; After commit c467e97f079f ("md/raid6: use valid sector values to determine if an I/O should wait on the reshape") fix the problem that IO across reshape position doesn't wait for reshape, the dm-raid test shell/lvconvert-raid-reshape.sh start to hang: [root@fedora ~]# cat /proc/979/stack [<0>] wait_woken+0x7d/0x90 [<0>] raid5_make_request+0x929/0x1d70 [raid456] [<0>] md_handle_request+0xc2/0x3b0 [md_mod] [<0>] raid_map+0x2c/0x50 [dm_raid] [<0>] __map_bio+0x251/0x380 [dm_mod] [<0>] dm_submit_bio+0x1f0/0x760 [dm_mod] [<0>] __submit_bio+0xc2/0x1c0 [<0>] submit_bio_noacct_nocheck+0x17f/0x450 [<0>] submit_bio_noacct+0x2bc/0x780 [<0>] submit_bio+0x70/0xc0 [<0>] mpage_readahead+0x169/0x1f0 [<0>] blkdev_readahead+0x18/0x30 [<0>] read_pages+0x7c/0x3b0 [<0>] page_cache_ra_unbounded+0x1ab/0x280 [<0>] force_page_cache_ra+0x9e/0x130 [<0>] page_cache_sync_ra+0x3b/0x110 [<0>] filemap_get_pages+0x143/0xa30 [<0>] filemap_read+0xdc/0x4b0 [<0>] blkdev_read_iter+0x75/0x200 [<0>] vfs_read+0x272/0x460 [<0>] ksys_read+0x7a/0x170 [<0>] __x64_sys_read+0x1c/0x30 [<0>] do_syscall_64+0xc6/0x230 [<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74 This is because reshape can't make progress. For md/raid, the problem doesn't exist because register new sync_thread doesn't rely on the IO to be done any more: 1) If array is read-only, it can switch to read-write by ioctl/sysfs; 2) md/raid never set MD_RECOVERY_WAIT; 3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold 'reconfig_mutex', hence it can be cleared and reshape can continue by sysfs api 'sync_action'. However, I'm not sure yet how to avoid the problem in dm-raid yet. This patch on the one hand make sure raid_message() can't change sync_thread() through raid_message() after presuspend(), on the other hand detect the above 3 cases before wait for IO do be done in dm_suspend(), and let dm-raid requeue those IO. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com Conflicts: drivers/md/dm-raid.c [ context conflict ] Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 3 ++- drivers/md/dm-raid.c | 22 ++++++++++++++++++++-- drivers/md/md.c | 24 ++++++++++++++++++++++-- drivers/md/raid5.c | 32 ++++++++++++++++++++++++++++++-- 4 files changed, 74 insertions(+), 7 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index 05b3286e5ae2..1cb05e2324f6 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -798,6 +798,7 @@ extern void md_finish_reshape(struct mddev *mddev); void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio, sector_t start, sector_t size); void md_account_bio(struct mddev *mddev, struct bio **bio); +void md_free_cloned_bio(struct bio *bio); extern bool __must_check md_flush_request(struct mddev *mddev, struct bio *bio); extern void md_super_write(struct mddev *mddev, struct md_rdev *rdev, @@ -826,7 +827,7 @@ extern void md_stop_writes(struct mddev *mddev); extern int md_rdev_init(struct md_rdev *rdev); extern void md_rdev_clear(struct md_rdev *rdev); -extern void md_handle_request(struct mddev *mddev, struct bio *bio); +extern bool md_handle_request(struct mddev *mddev, struct bio *bio); extern int mddev_suspend(struct mddev *mddev, bool interruptible); extern void mddev_resume(struct mddev *mddev); extern void md_idle_sync_thread(struct mddev *mddev); diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index 4d89a4947217..a6262348e09d 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -213,6 +213,7 @@ struct raid_dev { #define RT_FLAG_RS_IN_SYNC 6 #define RT_FLAG_RS_RESYNCING 7 #define RT_FLAG_RS_GROW 8 +#define RT_FLAG_RS_FROZEN 9 /* Array elements of 64 bit needed for rebuild/failed disk bits */ #define DISKS_ARRAY_ELEMS ((MAX_RAID_DEVICES + (sizeof(uint64_t) * 8 - 1)) / sizeof(uint64_t) / 8) @@ -3340,7 +3341,8 @@ static int raid_map(struct dm_target *ti, struct bio *bio) if (unlikely(bio_has_data(bio) && bio_end_sector(bio) > mddev->array_sectors)) return DM_MAPIO_REQUEUE; - md_handle_request(mddev, bio); + if (unlikely(!md_handle_request(mddev, bio))) + return DM_MAPIO_REQUEUE; return DM_MAPIO_SUBMITTED; } @@ -3723,7 +3725,8 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv, if (!mddev->pers || !mddev->pers->sync_request) return -EINVAL; - if (test_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) + if (test_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags) || + test_bit(RT_FLAG_RS_FROZEN, &rs->runtime_flags)) return -EBUSY; if (!strcasecmp(argv[0], "frozen")) @@ -3800,6 +3803,12 @@ static void raid_presuspend(struct dm_target *ti) struct raid_set *rs = ti->private; struct mddev *mddev = &rs->md; + /* + * From now on, disallow raid_message() to change sync_thread until + * resume, raid_postsuspend() is too late. + */ + set_bit(RT_FLAG_RS_FROZEN, &rs->runtime_flags); + if (!reshape_interrupted(mddev)) return; @@ -3812,6 +3821,13 @@ static void raid_presuspend(struct dm_target *ti) mddev->pers->prepare_suspend(mddev); } +static void raid_presuspend_undo(struct dm_target *ti) +{ + struct raid_set *rs = ti->private; + + clear_bit(RT_FLAG_RS_FROZEN, &rs->runtime_flags); +} + static void raid_postsuspend(struct dm_target *ti) { struct raid_set *rs = ti->private; @@ -4079,6 +4095,7 @@ static void raid_resume(struct dm_target *ti) WARN_ON_ONCE(!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)); WARN_ON_ONCE(test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)); + clear_bit(RT_FLAG_RS_FROZEN, &rs->runtime_flags); mddev_lock_nointr(mddev); mddev->ro = 0; mddev->in_sync = 0; @@ -4099,6 +4116,7 @@ static struct target_type raid_target = { .iterate_devices = raid_iterate_devices, .io_hints = raid_io_hints, .presuspend = raid_presuspend, + .presuspend_undo = raid_presuspend_undo, .postsuspend = raid_postsuspend, .preresume = raid_preresume, .resume = raid_resume, diff --git a/drivers/md/md.c b/drivers/md/md.c index cb4a8b302a6c..c06759fbb57e 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -366,7 +366,7 @@ static bool is_suspended(struct mddev *mddev, struct bio *bio) return true; } -void md_handle_request(struct mddev *mddev, struct bio *bio) +bool md_handle_request(struct mddev *mddev, struct bio *bio) { check_suspended: if (is_suspended(mddev, bio)) { @@ -374,7 +374,7 @@ void md_handle_request(struct mddev *mddev, struct bio *bio) /* Bail out if REQ_NOWAIT is set for the bio */ if (bio->bi_opf & REQ_NOWAIT) { bio_wouldblock_error(bio); - return; + return true; } for (;;) { prepare_to_wait(&mddev->sb_wait, &__wait, @@ -390,10 +390,13 @@ void md_handle_request(struct mddev *mddev, struct bio *bio) if (!mddev->pers->make_request(mddev, bio)) { percpu_ref_put(&mddev->active_io); + if (!mddev->gendisk && mddev->pers->prepare_suspend) + return false; goto check_suspended; } percpu_ref_put(&mddev->active_io); + return true; } EXPORT_SYMBOL(md_handle_request); @@ -8819,6 +8822,23 @@ void md_account_bio(struct mddev *mddev, struct bio **bio) } EXPORT_SYMBOL_GPL(md_account_bio); +void md_free_cloned_bio(struct bio *bio) +{ + struct md_io_clone *md_io_clone = bio->bi_private; + struct bio *orig_bio = md_io_clone->orig_bio; + struct mddev *mddev = md_io_clone->mddev; + + if (bio->bi_status && !orig_bio->bi_status) + orig_bio->bi_status = bio->bi_status; + + if (md_io_clone->start_time) + bio_end_io_acct(orig_bio, md_io_clone->start_time); + + bio_put(bio); + percpu_ref_put(&mddev->active_io); +} +EXPORT_SYMBOL_GPL(md_free_cloned_bio); + /* md_allow_write(mddev) * Calling this ensures that the array is marked 'active' so that writes * may proceed without blocking. It is important to call this before diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 0651dfd08b8e..7a0f57e12132 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -760,6 +760,7 @@ enum stripe_result { STRIPE_RETRY, STRIPE_SCHEDULE_AND_RETRY, STRIPE_FAIL, + STRIPE_WAIT_RESHAPE, }; struct stripe_request_ctx { @@ -5888,7 +5889,8 @@ static enum stripe_result make_stripe_request(struct mddev *mddev, if (ahead_of_reshape(mddev, logical_sector, conf->reshape_safe)) { spin_unlock_irq(&conf->device_lock); - return STRIPE_SCHEDULE_AND_RETRY; + ret = STRIPE_SCHEDULE_AND_RETRY; + goto out; } } spin_unlock_irq(&conf->device_lock); @@ -5967,6 +5969,12 @@ static enum stripe_result make_stripe_request(struct mddev *mddev, out_release: raid5_release_stripe(sh); +out: + if (ret == STRIPE_SCHEDULE_AND_RETRY && reshape_interrupted(mddev)) { + bi->bi_status = BLK_STS_RESOURCE; + ret = STRIPE_WAIT_RESHAPE; + pr_err_ratelimited("dm-raid456: io across reshape position while reshape can't make progress"); + } return ret; } @@ -6088,7 +6096,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) while (1) { res = make_stripe_request(mddev, conf, &ctx, logical_sector, bi); - if (res == STRIPE_FAIL) + if (res == STRIPE_FAIL || res == STRIPE_WAIT_RESHAPE) break; if (res == STRIPE_RETRY) @@ -6126,6 +6134,11 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) if (rw == WRITE) md_write_end(mddev); + if (res == STRIPE_WAIT_RESHAPE) { + md_free_cloned_bio(bi); + return false; + } + bio_endio(bi); return true; } @@ -8860,6 +8873,18 @@ static int raid5_start(struct mddev *mddev) return r5l_start(conf->log); } +/* + * This is only used for dm-raid456, caller already frozen sync_thread, hence + * if rehsape is still in progress, io that is waiting for reshape can never be + * done now, hence wake up and handle those IO. + */ +static void raid5_prepare_suspend(struct mddev *mddev) +{ + struct r5conf *conf = mddev->private; + + wake_up(&conf->wait_for_overlap); +} + static struct md_personality raid6_personality = { .name = "raid6", @@ -8883,6 +8908,7 @@ static struct md_personality raid6_personality = .quiesce = raid5_quiesce, .takeover = raid6_takeover, .change_consistency_policy = raid5_change_consistency_policy, + .prepare_suspend = raid5_prepare_suspend, }; static struct md_personality raid5_personality = { @@ -8907,6 +8933,7 @@ static struct md_personality raid5_personality = .quiesce = raid5_quiesce, .takeover = raid5_takeover, .change_consistency_policy = raid5_change_consistency_policy, + .prepare_suspend = raid5_prepare_suspend, }; static struct md_personality raid4_personality = @@ -8932,6 +8959,7 @@ static struct md_personality raid4_personality = .quiesce = raid5_quiesce, .takeover = raid4_takeover, .change_consistency_policy = raid5_change_consistency_policy, + .prepare_suspend = raid5_prepare_suspend, }; static int __init raid5_init(void) -- 2.39.2
From: Christoph Hellwig <hch@lst.de> mainline inclusion from mainline-v6.9-rc1 commit c396b90e502691fc6ff7b43984cfd9d1b15aaa80 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Add a helper to trace bio remapping that hides some argument dereferences and the check for a DM-mapped MD device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-2-hch@lst.de Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 8 ++++++++ drivers/md/md.c | 6 +----- drivers/md/raid0.c | 5 +---- drivers/md/raid1.c | 11 ++--------- drivers/md/raid10.c | 10 ++-------- drivers/md/raid5.c | 14 +++----------- 6 files changed, 17 insertions(+), 37 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index 1cb05e2324f6..1b6a1fb12522 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -18,6 +18,7 @@ #include <linux/timer.h> #include <linux/wait.h> #include <linux/workqueue.h> +#include <trace/events/block.h> #include "md-cluster.h" #define MaxSector (~(sector_t)0) @@ -915,4 +916,11 @@ int do_md_run(struct mddev *mddev); extern const struct block_device_operations md_fops; +static inline void mddev_trace_remap(struct mddev *mddev, struct bio *bio, + sector_t sector) +{ + if (mddev->gendisk) + trace_block_bio_remap(bio, disk_devt(mddev->gendisk), sector); +} + #endif /* _MD_MD_H */ diff --git a/drivers/md/md.c b/drivers/md/md.c index c06759fbb57e..62482808daf6 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -65,7 +65,6 @@ #include <linux/percpu-refcount.h> #include <linux/part_stat.h> -#include <trace/events/block.h> #include "md.h" #include "md-bitmap.h" #include "md-cluster.h" @@ -8746,10 +8745,7 @@ void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev, bio_chain(discard_bio, bio); bio_clone_blkg_association(discard_bio, bio); - if (mddev->gendisk) - trace_block_bio_remap(discard_bio, - disk_devt(mddev->gendisk), - bio->bi_iter.bi_sector); + mddev_trace_remap(mddev, discard_bio, bio->bi_iter.bi_sector); submit_bio_noacct(discard_bio); } EXPORT_SYMBOL_GPL(md_submit_discard_bio); diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index c50a7abda744..aff094de9743 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -578,10 +578,7 @@ static void raid0_map_submit_bio(struct mddev *mddev, struct bio *bio) bio_set_dev(bio, tmp_dev->bdev); bio->bi_iter.bi_sector = sector + zone->dev_start + tmp_dev->data_offset; - - if (mddev->gendisk) - trace_block_bio_remap(bio, disk_devt(mddev->gendisk), - bio_sector); + mddev_trace_remap(mddev, bio, bio_sector); mddev_check_write_zeroes(mddev, bio); submit_bio_noacct(bio); } diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index ba4fed333664..fcb56797ed53 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1411,11 +1411,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio, test_bit(R1BIO_FailFast, &r1_bio->state)) read_bio->bi_opf |= MD_FAILFAST; read_bio->bi_private = r1_bio; - - if (mddev->gendisk) - trace_block_bio_remap(read_bio, disk_devt(mddev->gendisk), - r1_bio->sector); - + mddev_trace_remap(mddev, read_bio, r1_bio->sector); submit_bio_noacct(read_bio); } @@ -1634,10 +1630,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, mbio->bi_private = r1_bio; atomic_inc(&r1_bio->remaining); - - if (mddev->gendisk) - trace_block_bio_remap(mbio, disk_devt(mddev->gendisk), - r1_bio->sector); + mddev_trace_remap(mddev, mbio, r1_bio->sector); /* flush_pending_writes() needs access to the rdev so...*/ mbio->bi_bdev = (void *)rdev; if (!raid1_add_bio_to_plug(mddev, mbio, raid1_unplug, disks)) { diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index e47df90e705a..c227cdb789e3 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1229,10 +1229,7 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio, test_bit(R10BIO_FailFast, &r10_bio->state)) read_bio->bi_opf |= MD_FAILFAST; read_bio->bi_private = r10_bio; - - if (mddev->gendisk) - trace_block_bio_remap(read_bio, disk_devt(mddev->gendisk), - r10_bio->sector); + mddev_trace_remap(mddev, read_bio, r10_bio->sector); submit_bio_noacct(read_bio); return; } @@ -1264,10 +1261,7 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio, && enough(conf, devnum)) mbio->bi_opf |= MD_FAILFAST; mbio->bi_private = r10_bio; - - if (conf->mddev->gendisk) - trace_block_bio_remap(mbio, disk_devt(conf->mddev->gendisk), - r10_bio->sector); + mddev_trace_remap(mddev, mbio, r10_bio->sector); /* flush_pending_writes() needs access to the rdev so...*/ mbio->bi_bdev = (void *)rdev; diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 7a0f57e12132..0014b28222c4 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1293,10 +1293,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) if (rrdev) set_bit(R5_DOUBLE_LOCKED, &sh->dev[i].flags); - if (conf->mddev->gendisk) - trace_block_bio_remap(bi, - disk_devt(conf->mddev->gendisk), - sh->dev[i].sector); + mddev_trace_remap(conf->mddev, bi, sh->dev[i].sector); if (should_defer && op_is_write(op)) bio_list_add(&pending_bios, bi); else @@ -1340,10 +1337,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) */ if (op == REQ_OP_DISCARD) rbi->bi_vcnt = 0; - if (conf->mddev->gendisk) - trace_block_bio_remap(rbi, - disk_devt(conf->mddev->gendisk), - sh->dev[i].sector); + mddev_trace_remap(conf->mddev, rbi, sh->dev[i].sector); if (should_defer && op_is_write(op)) bio_list_add(&pending_bios, rbi); else @@ -5480,9 +5474,7 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio) spin_unlock_irq(&conf->device_lock); } - if (mddev->gendisk) - trace_block_bio_remap(align_bio, disk_devt(mddev->gendisk), - raid_bio->bi_iter.bi_sector); + mddev_trace_remap(mddev, align_bio, raid_bio->bi_iter.bi_sector); submit_bio_noacct(align_bio); return 1; } -- 2.39.2
From: Christoph Hellwig <hch@lst.de> mainline inclusion from mainline-v6.9-rc1 commit 28be4fd310d146e9a43d7b1bb55cb7e9f5e06e88 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Add a small wrapper around blk_add_trace_msg that hides some argument dereferences and the check for a DM-mapped MD device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-3-hch@lst.de Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 6 ++++++ drivers/md/md-bitmap.c | 9 +++------ drivers/md/md.c | 3 +-- drivers/md/raid1.c | 10 ++++------ drivers/md/raid10.c | 15 +++++++-------- drivers/md/raid5.c | 14 +++++++------- 6 files changed, 28 insertions(+), 29 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index 1b6a1fb12522..8156c3e70f9d 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -923,4 +923,10 @@ static inline void mddev_trace_remap(struct mddev *mddev, struct bio *bio, trace_block_bio_remap(bio, disk_devt(mddev->gendisk), sector); } +#define mddev_add_trace_msg(mddev, fmt, args...) \ +do { \ + if ((mddev)->gendisk) \ + blk_add_trace_msg((mddev)->queue, fmt, ##args); \ +} while (0) + #endif /* _MD_MD_H */ diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index a62a82f77d6f..bd3cd035a3b6 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -1046,9 +1046,8 @@ void md_bitmap_unplug(struct bitmap *bitmap) if (dirty || need_write) { if (!writing) { md_bitmap_wait_writes(bitmap); - if (bitmap->mddev->queue) - blk_add_trace_msg(bitmap->mddev->queue, - "md bitmap_unplug"); + mddev_add_trace_msg(bitmap->mddev, + "md bitmap_unplug"); } clear_page_attr(bitmap, i, BITMAP_PAGE_PENDING); filemap_write_page(bitmap, i, false); @@ -1320,9 +1319,7 @@ void md_bitmap_daemon_work(struct mddev *mddev) } bitmap->allclean = 1; - if (bitmap->mddev->queue) - blk_add_trace_msg(bitmap->mddev->queue, - "md bitmap_daemon_work"); + mddev_add_trace_msg(bitmap->mddev, "md bitmap_daemon_work"); /* Any file-page which is PENDING now needs to be written. * So set NEEDWRITE now, then after we make any last-minute changes diff --git a/drivers/md/md.c b/drivers/md/md.c index 62482808daf6..ec813ea577f2 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2885,8 +2885,7 @@ void md_update_sb(struct mddev *mddev, int force_change) pr_debug("md: updating %s RAID superblock on device (in sync %d)\n", mdname(mddev), mddev->in_sync); - if (mddev->queue) - blk_add_trace_msg(mddev->queue, "md md_update_sb"); + mddev_add_trace_msg(mddev, "md md_update_sb"); rewrite: md_bitmap_update_sb(mddev->bitmap); rdev_for_each(rdev, mddev) { diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index fcb56797ed53..2af4dd6e3bae 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -47,9 +47,6 @@ static void allow_barrier(struct r1conf *conf, sector_t sector_nr); static void lower_barrier(struct r1conf *conf, sector_t sector_nr); static void raid1_free(struct mddev *mddev, void *priv); -#define raid1_log(md, fmt, args...) \ - do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0) - #define RAID_1_10_NAME "raid1" #include "raid1-10.c" @@ -1192,7 +1189,7 @@ static void freeze_array(struct r1conf *conf, int extra) */ spin_lock_irq(&conf->resync_lock); conf->array_frozen = 1; - raid1_log(conf->mddev, "wait freeze"); + mddev_add_trace_msg(conf->mddev, "raid1 wait freeze"); wait_event_lock_irq_cmd( conf->wait_barrier, get_unqueued_pending(conf) == extra, @@ -1379,7 +1376,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio, * Reading from a write-mostly device must take care not to * over-take any writes that are 'behind' */ - raid1_log(mddev, "wait behind writes"); + mddev_add_trace_msg(mddev, "raid1 wait behind writes"); wait_event(bitmap->behind_wait, atomic_read(&bitmap->behind_writes) == 0); } @@ -1548,7 +1545,8 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, bio_wouldblock_error(bio); return; } - raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk); + mddev_add_trace_msg(mddev, "raid1 wait rdev %d blocked", + blocked_rdev->raid_disk); md_wait_for_blocked_rdev(blocked_rdev, mddev); wait_barrier(conf, bio->bi_iter.bi_sector, false); goto retry_write; diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index c227cdb789e3..71a19d3cfa07 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -76,9 +76,6 @@ static void reshape_request_write(struct mddev *mddev, struct r10bio *r10_bio); static void end_reshape_write(struct bio *bio); static void end_reshape(struct r10conf *conf); -#define raid10_log(md, fmt, args...) \ - do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid10 " fmt, ##args); } while (0) - #include "raid1-10.c" #define NULL_CMD @@ -1013,7 +1010,7 @@ static bool wait_barrier(struct r10conf *conf, bool nowait) ret = false; } else { conf->nr_waiting++; - raid10_log(conf->mddev, "wait barrier"); + mddev_add_trace_msg(conf->mddev, "raid10 wait barrier"); wait_event_barrier(conf, stop_waiting_barrier(conf)); conf->nr_waiting--; } @@ -1132,7 +1129,7 @@ static bool regular_request_wait(struct mddev *mddev, struct r10conf *conf, bio_wouldblock_error(bio); return false; } - raid10_log(conf->mddev, "wait reshape"); + mddev_add_trace_msg(conf->mddev, "raid10 wait reshape"); wait_event(conf->wait_barrier, conf->reshape_progress <= bio->bi_iter.bi_sector || conf->reshape_progress >= bio->bi_iter.bi_sector + @@ -1326,8 +1323,9 @@ static void wait_blocked_dev(struct mddev *mddev, struct r10bio *r10_bio) if (unlikely(blocked_rdev)) { /* Have to wait for this device to get unblocked, then retry */ allow_barrier(conf); - raid10_log(conf->mddev, "%s wait rdev %d blocked", - __func__, blocked_rdev->raid_disk); + mddev_add_trace_msg(conf->mddev, + "raid10 %s wait rdev %d blocked", + __func__, blocked_rdev->raid_disk); md_wait_for_blocked_rdev(blocked_rdev, mddev); wait_barrier(conf, false); goto retry_wait; @@ -1385,7 +1383,8 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio, bio_wouldblock_error(bio); return; } - raid10_log(conf->mddev, "wait reshape metadata"); + mddev_add_trace_msg(conf->mddev, + "raid10 wait reshape metadata"); wait_event(mddev->sb_wait, !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)); diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 0014b28222c4..e59d1fb65790 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -4156,10 +4156,9 @@ static int handle_stripe_dirtying(struct r5conf *conf, set_bit(STRIPE_HANDLE, &sh->state); if ((rmw < rcw || (rmw == rcw && conf->rmw_level == PARITY_PREFER_RMW)) && rmw > 0) { /* prefer read-modify-write, but need to get some data */ - if (conf->mddev->queue) - blk_add_trace_msg(conf->mddev->queue, - "raid5 rmw %llu %d", - (unsigned long long)sh->sector, rmw); + mddev_add_trace_msg(conf->mddev, "raid5 rmw %llu %d", + sh->sector, rmw); + for (i = disks; i--; ) { struct r5dev *dev = &sh->dev[i]; if (test_bit(R5_InJournal, &dev->flags) && @@ -4237,9 +4236,10 @@ static int handle_stripe_dirtying(struct r5conf *conf, } } if (rcw && conf->mddev->queue) - blk_add_trace_msg(conf->mddev->queue, "raid5 rcw %llu %d %d %d", - (unsigned long long)sh->sector, - rcw, qread, test_bit(STRIPE_DELAYED, &sh->state)); + mddev_add_trace_msg(conf->mddev, + "raid5 rcw %llu %d %d %d", + sh->sector, rcw, qread, + test_bit(STRIPE_DELAYED, &sh->state)); } if (rcw > disks && rmw > disks && -- 2.39.2
From: Christoph Hellwig <hch@lst.de> mainline inclusion from mainline-v6.9-rc1 commit 176df894d7974166c65d0cce3b3b019678f9e698 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Add a helper to check for a DM-mapped MD device instead of using the obfuscated ->gendisk or ->queue NULL checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-4-hch@lst.de Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 12 ++++++++++-- drivers/md/md.c | 15 +++++++-------- drivers/md/raid0.c | 2 +- drivers/md/raid1.c | 13 +++++-------- drivers/md/raid10.c | 10 +++++----- drivers/md/raid5.c | 21 ++++++++++----------- 6 files changed, 38 insertions(+), 35 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index 8156c3e70f9d..3a067c04b93a 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -916,16 +916,24 @@ int do_md_run(struct mddev *mddev); extern const struct block_device_operations md_fops; +/* + * MD devices can be used undeneath by DM, in which case ->gendisk is NULL. + */ +static inline bool mddev_is_dm(struct mddev *mddev) +{ + return !mddev->gendisk; +} + static inline void mddev_trace_remap(struct mddev *mddev, struct bio *bio, sector_t sector) { - if (mddev->gendisk) + if (!mddev_is_dm(mddev)) trace_block_bio_remap(bio, disk_devt(mddev->gendisk), sector); } #define mddev_add_trace_msg(mddev, fmt, args...) \ do { \ - if ((mddev)->gendisk) \ + if (!mddev_is_dm(mddev)) \ blk_add_trace_msg((mddev)->queue, fmt, ##args); \ } while (0) diff --git a/drivers/md/md.c b/drivers/md/md.c index ec813ea577f2..d229214393b3 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2439,7 +2439,7 @@ int md_integrity_register(struct mddev *mddev) if (list_empty(&mddev->disks)) return 0; /* nothing to do */ - if (!mddev->gendisk || blk_get_integrity(mddev->gendisk)) + if (mddev_is_dm(mddev) || blk_get_integrity(mddev->gendisk)) return 0; /* shouldn't register, or already is */ rdev_for_each(rdev, mddev) { /* skip spares and non-functional disks */ @@ -2492,7 +2492,7 @@ int md_integrity_add_rdev(struct md_rdev *rdev, struct mddev *mddev) { struct blk_integrity *bi_mddev; - if (!mddev->gendisk) + if (mddev_is_dm(mddev)) return 0; bi_mddev = blk_get_integrity(mddev->gendisk); @@ -6014,7 +6014,7 @@ int md_run(struct mddev *mddev) invalidate_bdev(rdev->bdev); if (mddev->ro != MD_RDONLY && rdev_read_only(rdev)) { mddev->ro = MD_RDONLY; - if (mddev->gendisk) + if (!mddev_is_dm(mddev)) set_disk_ro(mddev->gendisk, 1); } @@ -6176,7 +6176,7 @@ int md_run(struct mddev *mddev) } } - if (mddev->queue) { + if (!mddev_is_dm(mddev)) { bool nonrot = true; rdev_for_each(rdev, mddev) { @@ -6441,7 +6441,7 @@ static void mddev_detach(struct mddev *mddev) mddev->pers->quiesce(mddev, 0); } md_unregister_thread(mddev, &mddev->thread); - if (mddev->queue) + if (!mddev_is_dm(mddev)) blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/ } @@ -7397,10 +7397,9 @@ static int update_size(struct mddev *mddev, sector_t num_sectors) if (!rv) { if (mddev_is_clustered(mddev)) md_cluster_ops->update_size(mddev, old_dev_sectors); - else if (mddev->queue) { + else if (!mddev_is_dm(mddev)) set_capacity_and_notify(mddev->gendisk, mddev->array_sectors); - } } return rv; } @@ -9267,7 +9266,7 @@ void md_do_sync(struct md_thread *thread) mddev->delta_disks > 0 && mddev->pers->finish_reshape && mddev->pers->size && - mddev->queue) { + !mddev_is_dm(mddev)) { mddev_lock_nointr(mddev); md_set_array_sectors(mddev, mddev->pers->size(mddev, 0, 0)); mddev_unlock(mddev); diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index aff094de9743..9f787ae77ede 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -399,7 +399,7 @@ static int raid0_run(struct mddev *mddev) mddev->private = conf; } conf = mddev->private; - if (mddev->queue) { + if (!mddev_is_dm(mddev)) { struct md_rdev *rdev; blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 2af4dd6e3bae..d5557ffda88b 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1905,7 +1905,7 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev) for (mirror = first; mirror <= last; mirror++) { p = conf->mirrors + mirror; if (!p->rdev) { - if (mddev->gendisk) + if (!mddev_is_dm(mddev)) disk_stack_limits(mddev->gendisk, rdev->bdev, rdev->data_offset << 9); @@ -3206,14 +3206,11 @@ static int raid1_run(struct mddev *mddev) if (IS_ERR(conf)) return PTR_ERR(conf); - if (mddev->queue) + if (!mddev_is_dm(mddev)) { blk_queue_max_write_zeroes_sectors(mddev->queue, 0); - - rdev_for_each(rdev, mddev) { - if (!mddev->gendisk) - continue; - disk_stack_limits(mddev->gendisk, rdev->bdev, - rdev->data_offset << 9); + rdev_for_each(rdev, mddev) + disk_stack_limits(mddev->gendisk, rdev->bdev, + rdev->data_offset << 9); } mddev->degraded = 0; diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 71a19d3cfa07..0613ed828ecd 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2088,7 +2088,7 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev) continue; } - if (mddev->gendisk) + if (!mddev_is_dm(mddev)) disk_stack_limits(mddev->gendisk, rdev->bdev, rdev->data_offset << 9); @@ -2108,7 +2108,7 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev) set_bit(Replacement, &rdev->flags); rdev->raid_disk = repl_slot; err = 0; - if (mddev->gendisk) + if (!mddev_is_dm(mddev)) disk_stack_limits(mddev->gendisk, rdev->bdev, rdev->data_offset << 9); conf->fullsync = 1; @@ -3991,7 +3991,7 @@ static int raid10_run(struct mddev *mddev) } } - if (mddev->queue) { + if (!mddev_is_dm(conf->mddev)) { blk_queue_max_write_zeroes_sectors(mddev->queue, 0); blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9); raid10_set_io_opt(conf); @@ -4025,7 +4025,7 @@ static int raid10_run(struct mddev *mddev) if (first || diff < min_offset_diff) min_offset_diff = diff; - if (mddev->gendisk) + if (!mddev_is_dm(mddev)) disk_stack_limits(mddev->gendisk, rdev->bdev, rdev->data_offset << 9); @@ -4898,7 +4898,7 @@ static void end_reshape(struct r10conf *conf) conf->reshape_safe = MaxSector; spin_unlock_irq(&conf->device_lock); - if (conf->mddev->queue) + if (!mddev_is_dm(conf->mddev)) raid10_set_io_opt(conf); conf->fullsync = 0; } diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index e59d1fb65790..0e0530b433fb 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2412,12 +2412,12 @@ static int grow_stripes(struct r5conf *conf, int num) size_t namelen = sizeof(conf->cache_name[0]); int devs = max(conf->raid_disks, conf->previous_raid_disks); - if (conf->mddev->gendisk) + if (mddev_is_dm(conf->mddev)) snprintf(conf->cache_name[0], namelen, - "raid%d-%s", conf->level, mdname(conf->mddev)); + "raid%d-%p", conf->level, conf->mddev); else snprintf(conf->cache_name[0], namelen, - "raid%d-%p", conf->level, conf->mddev); + "raid%d-%s", conf->level, mdname(conf->mddev)); snprintf(conf->cache_name[1], namelen, "%.27s-alt", conf->cache_name[0]); conf->active_name = 0; @@ -4235,11 +4235,10 @@ static int handle_stripe_dirtying(struct r5conf *conf, set_bit(STRIPE_DELAYED, &sh->state); } } - if (rcw && conf->mddev->queue) - mddev_add_trace_msg(conf->mddev, - "raid5 rcw %llu %d %d %d", - sh->sector, rcw, qread, - test_bit(STRIPE_DELAYED, &sh->state)); + if (rcw && !mddev_is_dm(conf->mddev)) + blk_add_trace_msg(conf->mddev->queue, "raid5 rcw %llu %d %d %d", + (unsigned long long)sh->sector, + rcw, qread, test_bit(STRIPE_DELAYED, &sh->state)); } if (rcw > disks && rmw > disks && @@ -5643,7 +5642,7 @@ static void raid5_unplug(struct blk_plug_cb *blk_cb, bool from_schedule) } release_inactive_stripe_list(conf, cb->temp_inactive_list, NR_STRIPE_HASH_LOCKS); - if (mddev->queue) + if (!mddev_is_dm(mddev)) trace_block_unplug(mddev->queue, cnt, !from_schedule); kfree(cb); } @@ -7904,7 +7903,7 @@ static int raid5_run(struct mddev *mddev) mdname(mddev)); md_set_array_sectors(mddev, raid5_size(mddev, 0, 0)); - if (mddev->queue) { + if (!mddev_is_dm(mddev)) { int chunk_size; /* read-ahead size must cover two whole stripes, which * is 2 * (datadisks) * chunksize where 'n' is the @@ -8487,7 +8486,7 @@ static void end_reshape(struct r5conf *conf) spin_unlock_irq(&conf->device_lock); wake_up(&conf->wait_for_overlap); - if (conf->mddev->queue) + if (!mddev_is_dm(conf->mddev)) raid5_set_io_opt(conf); } } -- 2.39.2
From: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> mainline inclusion from mainline-v6.11-rc1 commit 3821bbad0d0fbb6c9d77987bd54c89348752056d category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Check for sleeping thread before attempting its wake_up in md_wakeup_thread() to avoid unnecessary spinlock contention. With a 6.1 kernel, fio random read/write tests on many (>= 100) virtual volumes, of 100 GiB each, on 3 md-raid5s on 8 SSDs each (building a raid50), show by 3 to 4 % improved IOPS performance. Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Link: https://lore.kernel.org/r/20240327114022.74634-1-jinpu.wang@ionos.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index d229214393b3..501ff290e013 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8068,7 +8068,8 @@ void md_wakeup_thread(struct md_thread __rcu *thread) if (t) { pr_debug("md: waking up MD thread %s.\n", t->tsk->comm); set_bit(THREAD_WAKEUP, &t->flags); - wake_up(&t->wqueue); + if (wq_has_sleeper(&t->wqueue)) + wake_up(&t->wqueue); } rcu_read_unlock(); } -- 2.39.2
From: Li Nan <linan122@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit 3f9f231236ce7e48780d8a4f1f8cb9fae2df1e4e category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- UBSAN reports this problem: UBSAN: Undefined behaviour in drivers/md/md.c:8175:15 signed integer overflow: -2147483291 - 2072033152 cannot be represented in type 'int' Call trace: dump_backtrace+0x0/0x310 show_stack+0x28/0x38 dump_stack+0xec/0x15c ubsan_epilogue+0x18/0x84 handle_overflow+0x14c/0x19c __ubsan_handle_sub_overflow+0x34/0x44 is_mddev_idle+0x338/0x3d8 md_do_sync+0x1bb8/0x1cf8 md_thread+0x220/0x288 kthread+0x1d8/0x1e0 ret_from_fork+0x10/0x18 'curr_events' will overflow when stat accum or 'sync_io' is greater than INT_MAX. Fix it by changing sync_io, last_events and curr_events to 64bit. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240117031946.2324519-2-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> --- drivers/md/md.h | 4 ++-- include/linux/blkdev.h | 2 +- drivers/md/md.c | 7 ++++--- 3 files changed, 7 insertions(+), 6 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index 3a067c04b93a..20fc443b7969 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -51,7 +51,7 @@ struct md_rdev { sector_t sectors; /* Device size (in 512bytes sectors) */ struct mddev *mddev; /* RAID array if running */ - int last_events; /* IO event timestamp */ + long long last_events; /* IO event timestamp */ /* * If meta_bdev is non-NULL, it means that a separate device is @@ -622,7 +622,7 @@ extern void mddev_unlock(struct mddev *mddev); static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors) { - atomic_add(nr_sectors, &bdev->bd_disk->sync_io); + atomic64_add(nr_sectors, &bdev->bd_disk->sync_io); } static inline void md_sync_acct_bio(struct bio *bio, unsigned long nr_sectors) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 32f9f1cc7d3c..10b6651ec194 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -177,7 +177,7 @@ struct gendisk { struct list_head slave_bdevs; #endif struct timer_rand_state *random; - atomic_t sync_io; /* RAID */ + atomic64_t sync_io; /* RAID */ struct disk_events *ev; #ifdef CONFIG_BLK_DEV_ZONED diff --git a/drivers/md/md.c b/drivers/md/md.c index 501ff290e013..cc008df968cd 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8588,14 +8588,15 @@ static int is_mddev_idle(struct mddev *mddev, int init) { struct md_rdev *rdev; int idle; - int curr_events; + long long curr_events; idle = 1; rcu_read_lock(); rdev_for_each_rcu(rdev, mddev) { struct gendisk *disk = rdev->bdev->bd_disk; - curr_events = (int)part_stat_read_accum(disk->part0, sectors) - - atomic_read(&disk->sync_io); + curr_events = + (long long)part_stat_read_accum(disk->part0, sectors) - + atomic64_read(&disk->sync_io); /* sync IO will cause sync_io to increase before the disk_stats * as sync_io is counted when a request starts, and * disk_stats is counted when it completes. -- 2.39.2
From: Li Nan <linan122@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit 9d1110f99c253ccef82e480bfe9f38a12eb797a7 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- If iostats is disabled, disk_stats will not be updated and part_stat_read_accum() only returns a constant value. In this case, continuing to count sync_io and to check is_mddev_idle() is no longer meaningful. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240117031946.2324519-3-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> --- drivers/md/md.h | 3 ++- drivers/md/md.c | 4 ++++ 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index 20fc443b7969..2edc0a343146 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -622,7 +622,8 @@ extern void mddev_unlock(struct mddev *mddev); static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors) { - atomic64_add(nr_sectors, &bdev->bd_disk->sync_io); + if (blk_queue_io_stat(bdev->bd_disk->queue)) + atomic64_add(nr_sectors, &bdev->bd_disk->sync_io); } static inline void md_sync_acct_bio(struct bio *bio, unsigned long nr_sectors) diff --git a/drivers/md/md.c b/drivers/md/md.c index cc008df968cd..3d381e78245a 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8594,6 +8594,10 @@ static int is_mddev_idle(struct mddev *mddev, int init) rcu_read_lock(); rdev_for_each_rcu(rdev, mddev) { struct gendisk *disk = rdev->bdev->bd_disk; + + if (!init && !blk_queue_io_stat(disk->queue)) + continue; + curr_events = (long long)part_stat_read_accum(disk->part0, sectors) - atomic64_read(&disk->sync_io); -- 2.39.2
From: Li Nan <linan122@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit 504fbcffea649cad69111e7597081dd8adc3b395 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- This reverts commit 3f9f231236ce7e48780d8a4f1f8cb9fae2df1e4e. Using 64bit for 'sync_io' is unnecessary from the gendisk side. This overflow will not cause any functional impact, except for a UBSAN warning. Solving this overflow requires introducing additional calculations and checks which are not necessary. So just keep using 32bit for 'sync_io'. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20240507023103.781816-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> --- drivers/md/md.h | 4 ++-- include/linux/blkdev.h | 2 +- drivers/md/md.c | 7 +++---- 3 files changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index 2edc0a343146..fee2942aba12 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -51,7 +51,7 @@ struct md_rdev { sector_t sectors; /* Device size (in 512bytes sectors) */ struct mddev *mddev; /* RAID array if running */ - long long last_events; /* IO event timestamp */ + int last_events; /* IO event timestamp */ /* * If meta_bdev is non-NULL, it means that a separate device is @@ -623,7 +623,7 @@ extern void mddev_unlock(struct mddev *mddev); static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors) { if (blk_queue_io_stat(bdev->bd_disk->queue)) - atomic64_add(nr_sectors, &bdev->bd_disk->sync_io); + atomic_add(nr_sectors, &bdev->bd_disk->sync_io); } static inline void md_sync_acct_bio(struct bio *bio, unsigned long nr_sectors) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 10b6651ec194..32f9f1cc7d3c 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -177,7 +177,7 @@ struct gendisk { struct list_head slave_bdevs; #endif struct timer_rand_state *random; - atomic64_t sync_io; /* RAID */ + atomic_t sync_io; /* RAID */ struct disk_events *ev; #ifdef CONFIG_BLK_DEV_ZONED diff --git a/drivers/md/md.c b/drivers/md/md.c index 3d381e78245a..84abb621e4e9 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8588,7 +8588,7 @@ static int is_mddev_idle(struct mddev *mddev, int init) { struct md_rdev *rdev; int idle; - long long curr_events; + int curr_events; idle = 1; rcu_read_lock(); @@ -8598,9 +8598,8 @@ static int is_mddev_idle(struct mddev *mddev, int init) if (!init && !blk_queue_io_stat(disk->queue)) continue; - curr_events = - (long long)part_stat_read_accum(disk->part0, sectors) - - atomic64_read(&disk->sync_io); + curr_events = (int)part_stat_read_accum(disk->part0, sectors) - + atomic_read(&disk->sync_io); /* sync IO will cause sync_io to increase before the disk_stats * as sync_io is counted when a request starts, and * disk_stats is counted when it completes. -- 2.39.2
From: Li Nan <linan122@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit 03e792eaf18ec2e93e2c623f9f1a4bdb97fe4126 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Commit cc27b0c78c79 ("md: fix deadlock between mddev_suspend() and md_write_start()") aborted md_write_start() with false when mddev is suspended, which fixed a deadlock if calling mddev_suspend() with holding reconfig_mutex(). Since mddev_suspend() now includes lockdep_assert_not_held(), it no longer holds the reconfig_mutex. This makes previous abort unnecessary. Now, remove unnecessary abort and change function return value to void. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240525185257.3896201-2-linan666@huaweicloud.com --- drivers/md/md.h | 2 +- drivers/md/md.c | 14 ++++---------- drivers/md/raid1.c | 3 +-- drivers/md/raid10.c | 3 +-- drivers/md/raid5.c | 3 +-- 5 files changed, 8 insertions(+), 17 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index fee2942aba12..0fd446d46dc9 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -791,7 +791,7 @@ extern void md_unregister_thread(struct mddev *mddev, struct md_thread __rcu **t extern void md_wakeup_thread(struct md_thread __rcu *thread); extern void md_check_recovery(struct mddev *mddev); extern void md_reap_sync_thread(struct mddev *mddev); -extern bool md_write_start(struct mddev *mddev, struct bio *bi); +extern void md_write_start(struct mddev *mddev, struct bio *bi); extern void md_write_inc(struct mddev *mddev, struct bio *bi); extern void md_write_end(struct mddev *mddev); extern void md_done_sync(struct mddev *mddev, int blocks, int ok); diff --git a/drivers/md/md.c b/drivers/md/md.c index 84abb621e4e9..b688d3ad6cae 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8652,12 +8652,12 @@ EXPORT_SYMBOL(md_done_sync); * A return value of 'false' means that the write wasn't recorded * and cannot proceed as the array is being suspend. */ -bool md_write_start(struct mddev *mddev, struct bio *bi) +void md_write_start(struct mddev *mddev, struct bio *bi) { int did_change = 0; if (bio_data_dir(bi) != WRITE) - return true; + return; BUG_ON(mddev->ro == MD_RDONLY); if (mddev->ro == MD_AUTO_READ) { @@ -8690,15 +8690,9 @@ bool md_write_start(struct mddev *mddev, struct bio *bi) if (did_change) sysfs_notify_dirent_safe(mddev->sysfs_state); if (!mddev->has_superblocks) - return true; + return; wait_event(mddev->sb_wait, - !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) || - is_md_suspended(mddev)); - if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) { - percpu_ref_put(&mddev->writes_pending); - return false; - } - return true; + !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)); } EXPORT_SYMBOL(md_write_start); diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index d5557ffda88b..c463a1873832 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1666,8 +1666,7 @@ static bool raid1_make_request(struct mddev *mddev, struct bio *bio) if (bio_data_dir(bio) == READ) raid1_read_request(mddev, bio, sectors, NULL); else { - if (!md_write_start(mddev,bio)) - return false; + md_write_start(mddev,bio); raid1_write_request(mddev, bio, sectors); } return true; diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 0613ed828ecd..23cc3c8a345f 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1818,8 +1818,7 @@ static bool raid10_make_request(struct mddev *mddev, struct bio *bio) && md_flush_request(mddev, bio)) return true; - if (!md_write_start(mddev, bio)) - return false; + md_write_start(mddev, bio); if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) if (!raid10_handle_discard(mddev, bio)) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 0e0530b433fb..c832a1528103 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -6028,8 +6028,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) ctx.do_flush = bi->bi_opf & REQ_PREFLUSH; } - if (!md_write_start(mddev, bi)) - return false; + md_write_start(mddev, bi); /* * If array is degraded, better not do chunk aligned read because * later we might have to read it again in order to reconstruct -- 2.39.2
From: Li Nan <linan122@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit acc6680af28696a037ede62867e731841d4454c2 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Setting bio to NULL and checking 'if(!bio)' is redundant and looks strange, just consolidate them into one condition. There are no functional changes. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240528203149.2383260-1-linan666@huaweicloud.com --- drivers/md/md.c | 28 +++++++++++++--------------- 1 file changed, 13 insertions(+), 15 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index b688d3ad6cae..a4e824ea57e2 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -657,24 +657,22 @@ bool md_flush_request(struct mddev *mddev, struct bio *bio) WARN_ON(percpu_ref_is_zero(&mddev->active_io)); percpu_ref_get(&mddev->active_io); mddev->flush_bio = bio; - bio = NULL; - } - spin_unlock_irq(&mddev->lock); - - if (!bio) { + spin_unlock_irq(&mddev->lock); INIT_WORK(&mddev->flush_work, submit_flushes); queue_work(md_wq, &mddev->flush_work); - } else { - /* flush was performed for some other bio while we waited. */ - if (bio->bi_iter.bi_size == 0) - /* an empty barrier - all done */ - bio_endio(bio); - else { - bio->bi_opf &= ~REQ_PREFLUSH; - return false; - } + return true; } - return true; + + /* flush was performed for some other bio while we waited. */ + spin_unlock_irq(&mddev->lock); + if (bio->bi_iter.bi_size == 0) { + /* pure flush without data - all done */ + bio_endio(bio); + return true; + } + + bio->bi_opf &= ~REQ_PREFLUSH; + return false; } EXPORT_SYMBOL(md_flush_request); -- 2.39.2
From: Christoph Hellwig <hch@lst.de> mainline inclusion from mainline-v6.11-rc1 commit 35f20acaa3585f25f8356da0ee6bc143e0256522 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The core md code calls the ->free method which already frees conf. Fixes: 0c031fd37f69 ("md: Move alloc/free acct bioset in to personality") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240604172607.3185916-2-hch@lst.de Conflicts: drivers/md/raid0.c [ conflict because of mainline commit 56cf22d6f672 ("md/raid0: use the atomic queue limit update APIs") ] Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/raid0.c | 17 ++++------------- 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index 9f787ae77ede..c97daac92439 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -365,18 +365,13 @@ static sector_t raid0_size(struct mddev *mddev, sector_t sectors, int raid_disks return array_sectors; } -static void free_conf(struct mddev *mddev, struct r0conf *conf) -{ - kfree(conf->strip_zone); - kfree(conf->devlist); - kfree(conf); -} - static void raid0_free(struct mddev *mddev, void *priv) { struct r0conf *conf = priv; - free_conf(mddev, conf); + kfree(conf->strip_zone); + kfree(conf->devlist); + kfree(conf); } static int raid0_run(struct mddev *mddev) @@ -424,11 +419,7 @@ static int raid0_run(struct mddev *mddev) dump_zones(mddev); - ret = md_integrity_register(mddev); - if (ret) - free_conf(mddev, conf); - - return ret; + return md_integrity_register(mddev); } /* -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit 0476d09c36a845757dd9fd1c80fbbf45b0faeb3c category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Currently there are lots of flags with the same confusing prefix "MD_REOCVERY_", and there are two main types of flags, sync thread runnng status, I prefer prefix "SYNC_THREAD_", and sync thread action, I perfer prefix "SYNC_ACTION_". For now, rearrange and update comment to improve code readability, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-2-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 52 ++++++++++++++++++++++++++++++++++++------------- 1 file changed, 38 insertions(+), 14 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index 0fd446d46dc9..ca8fc696f114 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -552,22 +552,46 @@ struct mddev { }; enum recovery_flags { + /* flags for sync thread running status */ + + /* + * set when one of sync action is set and new sync thread need to be + * registered, or just add/remove spares from conf. + */ + MD_RECOVERY_NEEDED, + /* sync thread is running, or about to be started */ + MD_RECOVERY_RUNNING, + /* sync thread needs to be aborted for some reason */ + MD_RECOVERY_INTR, + /* sync thread is done and is waiting to be unregistered */ + MD_RECOVERY_DONE, + /* running sync thread must abort immediately, and not restart */ + MD_RECOVERY_FROZEN, + /* waiting for pers->start() to finish */ + MD_RECOVERY_WAIT, + /* interrupted because io-error */ + MD_RECOVERY_ERROR, + + /* flags determines sync action */ + + /* if just this flag is set, action is resync. */ + MD_RECOVERY_SYNC, + /* + * paired with MD_RECOVERY_SYNC, if MD_RECOVERY_CHECK is not set, + * action is repair, means user requested resync. + */ + MD_RECOVERY_REQUESTED, /* - * If neither SYNC or RESHAPE are set, then it is a recovery. + * paired with MD_RECOVERY_SYNC and MD_RECOVERY_REQUESTED, action is + * check. */ - MD_RECOVERY_RUNNING, /* a thread is running, or about to be started */ - MD_RECOVERY_SYNC, /* actually doing a resync, not a recovery */ - MD_RECOVERY_RECOVER, /* doing recovery, or need to try it. */ - MD_RECOVERY_INTR, /* resync needs to be aborted for some reason */ - MD_RECOVERY_DONE, /* thread is done and is waiting to be reaped */ - MD_RECOVERY_NEEDED, /* we might need to start a resync/recover */ - MD_RECOVERY_REQUESTED, /* user-space has requested a sync (used with SYNC) */ - MD_RECOVERY_CHECK, /* user-space request for check-only, no repair */ - MD_RECOVERY_RESHAPE, /* A reshape is happening */ - MD_RECOVERY_FROZEN, /* User request to abort, and not restart, any action */ - MD_RECOVERY_ERROR, /* sync-action interrupted because io-error */ - MD_RECOVERY_WAIT, /* waiting for pers->start() to finish */ - MD_RESYNCING_REMOTE, /* remote node is running resync thread */ + MD_RECOVERY_CHECK, + /* recovery, or need to try it */ + MD_RECOVERY_RECOVER, + /* reshape */ + MD_RECOVERY_RESHAPE, + /* remote node is running resync thread */ + MD_RESYNCING_REMOTE, }; enum md_ro_state { -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit a85aa09da2f2773c685310666ef09f935ff68a45 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- In order to make code related to sync_thread cleaner in following patches, also add detail comment about each sync action. And also prepare to remove the related recovery_flags in the fulture. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-3-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 57 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 56 insertions(+), 1 deletion(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index ca8fc696f114..058e3ca5f457 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -34,6 +34,61 @@ */ #define MD_FAILFAST (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT) +/* Status of sync thread. */ +enum sync_action { + /* + * Represent by MD_RECOVERY_SYNC, start when: + * 1) after assemble, sync data from first rdev to other copies, this + * must be done first before other sync actions and will only execute + * once; + * 2) resize the array(notice that this is not reshape), sync data for + * the new range; + */ + ACTION_RESYNC, + /* + * Represent by MD_RECOVERY_RECOVER, start when: + * 1) for new replacement, sync data based on the replace rdev or + * available copies from other rdev; + * 2) for new member disk while the array is degraded, sync data from + * other rdev; + * 3) reassemble after power failure or re-add a hot removed rdev, sync + * data from first rdev to other copies based on bitmap; + */ + ACTION_RECOVER, + /* + * Represent by MD_RECOVERY_SYNC | MD_RECOVERY_REQUESTED | + * MD_RECOVERY_CHECK, start when user echo "check" to sysfs api + * sync_action, used to check if data copies from differenct rdev are + * the same. The number of mismatch sectors will be exported to user + * by sysfs api mismatch_cnt; + */ + ACTION_CHECK, + /* + * Represent by MD_RECOVERY_SYNC | MD_RECOVERY_REQUESTED, start when + * user echo "repair" to sysfs api sync_action, usually paired with + * ACTION_CHECK, used to force syncing data once user found that there + * are inconsistent data, + */ + ACTION_REPAIR, + /* + * Represent by MD_RECOVERY_RESHAPE, start when new member disk is added + * to the conf, notice that this is different from spares or + * replacement; + */ + ACTION_RESHAPE, + /* + * Represent by MD_RECOVERY_FROZEN, can be set by sysfs api sync_action + * or internal usage like setting the array read-only, will forbid above + * actions. + */ + ACTION_FROZEN, + /* + * All above actions don't match. + */ + ACTION_IDLE, + NR_SYNC_ACTIONS, +}; + /* * The struct embedded in rdev is used to serialize IO. */ @@ -572,7 +627,7 @@ enum recovery_flags { /* interrupted because io-error */ MD_RECOVERY_ERROR, - /* flags determines sync action */ + /* flags determines sync action, see details in enum sync_action */ /* if just this flag is set, action is resync. */ MD_RECOVERY_SYNC, -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit e792a4c2156a392d8126bf0496f74407a21a8824 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The new helpers will get current sync_action of the array, will be used in later patches to make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-4-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 3 ++ drivers/md/md.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 82 insertions(+) diff --git a/drivers/md/md.h b/drivers/md/md.h index 058e3ca5f457..d5652f2b2691 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -870,6 +870,9 @@ extern void md_unregister_thread(struct mddev *mddev, struct md_thread __rcu **t extern void md_wakeup_thread(struct md_thread __rcu *thread); extern void md_check_recovery(struct mddev *mddev); extern void md_reap_sync_thread(struct mddev *mddev); +extern enum sync_action md_sync_action(struct mddev *mddev); +extern enum sync_action md_sync_action_by_name(const char *page); +extern const char *md_sync_action_name(enum sync_action action); extern void md_write_start(struct mddev *mddev, struct bio *bi); extern void md_write_inc(struct mddev *mddev, struct bio *bi); extern void md_write_end(struct mddev *mddev); diff --git a/drivers/md/md.c b/drivers/md/md.c index a4e824ea57e2..be7a16817d3c 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -69,6 +69,16 @@ #include "md-bitmap.h" #include "md-cluster.h" +static const char *action_name[NR_SYNC_ACTIONS] = { + [ACTION_RESYNC] = "resync", + [ACTION_RECOVER] = "recover", + [ACTION_CHECK] = "check", + [ACTION_REPAIR] = "repair", + [ACTION_RESHAPE] = "reshape", + [ACTION_FROZEN] = "frozen", + [ACTION_IDLE] = "idle", +}; + /* pers_list is a list of registered personalities protected by pers_lock. */ static LIST_HEAD(pers_list); static DEFINE_SPINLOCK(pers_lock); @@ -4900,6 +4910,75 @@ metadata_store(struct mddev *mddev, const char *buf, size_t len) static struct md_sysfs_entry md_metadata = __ATTR_PREALLOC(metadata_version, S_IRUGO|S_IWUSR, metadata_show, metadata_store); +enum sync_action md_sync_action(struct mddev *mddev) +{ + unsigned long recovery = mddev->recovery; + + /* + * frozen has the highest priority, means running sync_thread will be + * stopped immediately, and no new sync_thread can start. + */ + if (test_bit(MD_RECOVERY_FROZEN, &recovery)) + return ACTION_FROZEN; + + /* + * read-only array can't register sync_thread, and it can only + * add/remove spares. + */ + if (!md_is_rdwr(mddev)) + return ACTION_IDLE; + + /* + * idle means no sync_thread is running, and no new sync_thread is + * requested. + */ + if (!test_bit(MD_RECOVERY_RUNNING, &recovery) && + !test_bit(MD_RECOVERY_NEEDED, &recovery)) + return ACTION_IDLE; + + if (test_bit(MD_RECOVERY_RESHAPE, &recovery) || + mddev->reshape_position != MaxSector) + return ACTION_RESHAPE; + + if (test_bit(MD_RECOVERY_RECOVER, &recovery)) + return ACTION_RECOVER; + + if (test_bit(MD_RECOVERY_SYNC, &recovery)) { + /* + * MD_RECOVERY_CHECK must be paired with + * MD_RECOVERY_REQUESTED. + */ + if (test_bit(MD_RECOVERY_CHECK, &recovery)) + return ACTION_CHECK; + if (test_bit(MD_RECOVERY_REQUESTED, &recovery)) + return ACTION_REPAIR; + return ACTION_RESYNC; + } + + /* + * MD_RECOVERY_NEEDED or MD_RECOVERY_RUNNING is set, however, no + * sync_action is specified. + */ + return ACTION_IDLE; +} + +enum sync_action md_sync_action_by_name(const char *page) +{ + enum sync_action action; + + for (action = 0; action < NR_SYNC_ACTIONS; ++action) { + if (cmd_match(page, action_name[action])) + return action; + } + + return NR_SYNC_ACTIONS; +} + +const char *md_sync_action_name(enum sync_action action) +{ + return action_name[action]; +} + static ssize_t action_show(struct mddev *mddev, char *page) { -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit 207c5656c33d56b3759d0876b68fa56cb56e5c51 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- There are no functional changes, just to make code cleaner and prepare for following refactor. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-5-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 65 +++++++++++++++++++++++++++++++------------------ 1 file changed, 41 insertions(+), 24 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index be7a16817d3c..08087504905b 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5103,6 +5103,45 @@ static void frozen_sync_thread(struct mddev *mddev) mutex_unlock(&mddev->sync_mutex); } +static int mddev_start_reshape(struct mddev *mddev) +{ + int ret; + + if (mddev->pers->start_reshape == NULL) + return -EINVAL; + + ret = mddev_lock(mddev); + if (ret) + return ret; + + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { + mddev_unlock(mddev); + return -EBUSY; + } + + if (mddev->reshape_position == MaxSector || + mddev->pers->check_reshape == NULL || + mddev->pers->check_reshape(mddev)) { + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + ret = mddev->pers->start_reshape(mddev); + if (ret) { + mddev_unlock(mddev); + return ret; + } + } else { + /* + * If reshape is still in progress, and md_check_recovery() can + * continue to reshape, don't restart reshape because data can + * be corrupted for raid456. + */ + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + } + + mddev_unlock(mddev); + sysfs_notify_dirent_safe(mddev->sysfs_degraded); + return 0; +} + static ssize_t action_store(struct mddev *mddev, const char *page, size_t len) { @@ -5122,32 +5161,10 @@ action_store(struct mddev *mddev, const char *page, size_t len) clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); } else if (cmd_match(page, "reshape")) { - int err; - if (mddev->pers->start_reshape == NULL) - return -EINVAL; - err = mddev_lock(mddev); - if (!err) { - if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { - err = -EBUSY; - } else if (mddev->reshape_position == MaxSector || - mddev->pers->check_reshape == NULL || - mddev->pers->check_reshape(mddev)) { - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - err = mddev->pers->start_reshape(mddev); - } else { - /* - * If reshape is still in progress, and - * md_check_recovery() can continue to reshape, - * don't restart reshape because data can be - * corrupted for raid456. - */ - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - } - mddev_unlock(mddev); - } + int err = mddev_start_reshape(mddev); + if (err) return err; - sysfs_notify_dirent_safe(mddev->sysfs_degraded); } else { if (cmd_match(page, "check")) set_bit(MD_RECOVERY_CHECK, &mddev->recovery); -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit c8ecfe680c371db2a6d125de5d6bc2398950e9cf category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- To get rid of extrem long if else if usage, and make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-6-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 94 +++++++++++++++++++++++++++---------------------- 1 file changed, 52 insertions(+), 42 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 08087504905b..7c2f7d1320d6 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4982,27 +4982,9 @@ const char *md_sync_action_name(enum sync_action action) static ssize_t action_show(struct mddev *mddev, char *page) { - char *type = "idle"; - unsigned long recovery = mddev->recovery; - if (test_bit(MD_RECOVERY_FROZEN, &recovery)) - type = "frozen"; - else if (test_bit(MD_RECOVERY_RUNNING, &recovery) || - (md_is_rdwr(mddev) && test_bit(MD_RECOVERY_NEEDED, &recovery))) { - if (test_bit(MD_RECOVERY_RESHAPE, &recovery)) - type = "reshape"; - else if (test_bit(MD_RECOVERY_SYNC, &recovery)) { - if (!test_bit(MD_RECOVERY_REQUESTED, &recovery)) - type = "resync"; - else if (test_bit(MD_RECOVERY_CHECK, &recovery)) - type = "check"; - else - type = "repair"; - } else if (test_bit(MD_RECOVERY_RECOVER, &recovery)) - type = "recover"; - else if (mddev->reshape_position != MaxSector) - type = "reshape"; - } - return sprintf(page, "%s\n", type); + enum sync_action action = md_sync_action(mddev); + + return sprintf(page, "%s\n", md_sync_action_name(action)); } /** @@ -5145,35 +5127,63 @@ static int mddev_start_reshape(struct mddev *mddev) static ssize_t action_store(struct mddev *mddev, const char *page, size_t len) { + int ret; + enum sync_action action; + if (!mddev->pers || !mddev->pers->sync_request) return -EINVAL; + action = md_sync_action_by_name(page); - if (cmd_match(page, "idle")) - idle_sync_thread(mddev); - else if (cmd_match(page, "frozen")) - frozen_sync_thread(mddev); - else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) - return -EBUSY; - else if (cmd_match(page, "resync")) - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - else if (cmd_match(page, "recover")) { - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); - } else if (cmd_match(page, "reshape")) { - int err = mddev_start_reshape(mddev); - - if (err) - return err; + /* TODO: mdadm rely on "idle" to start sync_thread. */ + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { + switch (action) { + case ACTION_FROZEN: + frozen_sync_thread(mddev); + return len; + case ACTION_IDLE: + idle_sync_thread(mddev); + break; + case ACTION_RESHAPE: + case ACTION_RECOVER: + case ACTION_CHECK: + case ACTION_REPAIR: + case ACTION_RESYNC: + return -EBUSY; + default: + return -EINVAL; + } } else { - if (cmd_match(page, "check")) + switch (action) { + case ACTION_FROZEN: + set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + return len; + case ACTION_RESHAPE: + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + ret = mddev_start_reshape(mddev); + if (ret) + return ret; + break; + case ACTION_RECOVER: + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); + break; + case ACTION_CHECK: set_bit(MD_RECOVERY_CHECK, &mddev->recovery); - else if (!cmd_match(page, "repair")) + fallthrough; + case ACTION_REPAIR: + set_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); + set_bit(MD_RECOVERY_SYNC, &mddev->recovery); + fallthrough; + case ACTION_RESYNC: + case ACTION_IDLE: + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + break; + default: return -EINVAL; - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - set_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); - set_bit(MD_RECOVERY_SYNC, &mddev->recovery); + } } + if (mddev->ro == MD_AUTO_READ) { /* A write to sync_action is enough to justify * canceling read-auto mode -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit df79234bdc3f441bec99dfc8199b6f2c673203ed category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Caller will always set MD_RECOVERY_FROZEN if check_seq is true, and always clear MD_RECOVERY_FROZEN if check_seq is false, hence replace the parameter with test_bit() to make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-7-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 26 +++++++++++--------------- 1 file changed, 11 insertions(+), 15 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 7c2f7d1320d6..18648c89d785 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4993,15 +4993,10 @@ action_show(struct mddev *mddev, char *page) * @locked: if set, reconfig_mutex will still be held after this function * return; if not set, reconfig_mutex will be released after this * function return. - * @check_seq: if set, only wait for curent running sync_thread to stop, noted - * that new sync_thread can still start. */ -static void stop_sync_thread(struct mddev *mddev, bool locked, bool check_seq) +static void stop_sync_thread(struct mddev *mddev, bool locked) { - int sync_seq; - - if (check_seq) - sync_seq = atomic_read(&mddev->sync_seq); + int sync_seq = atomic_read(&mddev->sync_seq); if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { if (!locked) @@ -5022,7 +5017,8 @@ static void stop_sync_thread(struct mddev *mddev, bool locked, bool check_seq) wait_event(resync_wait, !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || - (check_seq && sync_seq != atomic_read(&mddev->sync_seq))); + (!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery) && + sync_seq != atomic_read(&mddev->sync_seq))); if (locked) mddev_lock_nointr(mddev); @@ -5033,7 +5029,7 @@ void md_idle_sync_thread(struct mddev *mddev) lockdep_assert_held(&mddev->reconfig_mutex); clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - stop_sync_thread(mddev, true, true); + stop_sync_thread(mddev, true); } EXPORT_SYMBOL_GPL(md_idle_sync_thread); @@ -5042,7 +5038,7 @@ void md_frozen_sync_thread(struct mddev *mddev) lockdep_assert_held(&mddev->reconfig_mutex); set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - stop_sync_thread(mddev, true, false); + stop_sync_thread(mddev, true); } EXPORT_SYMBOL_GPL(md_frozen_sync_thread); @@ -5067,7 +5063,7 @@ static void idle_sync_thread(struct mddev *mddev) return; } - stop_sync_thread(mddev, false, true); + stop_sync_thread(mddev, false); mutex_unlock(&mddev->sync_mutex); } @@ -5081,7 +5077,7 @@ static void frozen_sync_thread(struct mddev *mddev) return; } - stop_sync_thread(mddev, false, false); + stop_sync_thread(mddev, false); mutex_unlock(&mddev->sync_mutex); } @@ -6531,7 +6527,7 @@ void md_stop_writes(struct mddev *mddev) { mddev_lock_nointr(mddev); set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - stop_sync_thread(mddev, true, false); + stop_sync_thread(mddev, true); __md_stop_writes(mddev); mddev_unlock(mddev); } @@ -6597,7 +6593,7 @@ static int md_set_readonly(struct mddev *mddev) set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); } - stop_sync_thread(mddev, false, false); + stop_sync_thread(mddev, false); wait_event(mddev->sb_wait, !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)); mddev_lock_nointr(mddev); @@ -6643,7 +6639,7 @@ static int do_md_stop(struct mddev *mddev, int mode) set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); } - stop_sync_thread(mddev, true, false); + stop_sync_thread(mddev, true); if (mddev->sysfs_active || test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit 5ce10a38590c77f20d0dc706944f79e7d56a7400 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- MD_RECOVERY_RUNNING will always be set when trying to register a new sync_thread, however, if md_start_sync() turns out to do nothing, MD_RECOVERY_RUNNING will be cleared in this case. And during the race window, action_store() will return -EBUSY, which will cause some mdadm tests to fail. For example: The test 07reshape5intr will add a new disk to array, then start reshape: mdadm /dev/md0 --add /dev/xxx mdadm --grow /dev/md0 -n 3 And add_bound_rdev() from mdadm --add will set MD_RECOVERY_NEEDED, then during the race windown, mdadm --grow will fail. Fix the problem by waiting in action_store() during the race window, fail only if sync_thread is registered. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-8-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 2 -- drivers/md/md.c | 85 +++++++++++++++++++------------------------------ 2 files changed, 33 insertions(+), 54 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index d5652f2b2691..260bab71fdeb 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -596,8 +596,6 @@ struct mddev { */ struct list_head deleting; - /* Used to synchronize idle and frozen for action_store() */ - struct mutex sync_mutex; /* The sequence number for sync thread */ atomic_t sync_seq; diff --git a/drivers/md/md.c b/drivers/md/md.c index 18648c89d785..276c8d552663 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -759,7 +759,6 @@ int mddev_init(struct mddev *mddev) mutex_init(&mddev->open_mutex); mutex_init(&mddev->reconfig_mutex); - mutex_init(&mddev->sync_mutex); mutex_init(&mddev->suspend_mutex); mutex_init(&mddev->bitmap_info.mutex); INIT_LIST_HEAD(&mddev->disks); @@ -5053,34 +5052,6 @@ void md_unfrozen_sync_thread(struct mddev *mddev) } EXPORT_SYMBOL_GPL(md_unfrozen_sync_thread); -static void idle_sync_thread(struct mddev *mddev) -{ - mutex_lock(&mddev->sync_mutex); - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - - if (mddev_lock(mddev)) { - mutex_unlock(&mddev->sync_mutex); - return; - } - - stop_sync_thread(mddev, false); - mutex_unlock(&mddev->sync_mutex); -} - -static void frozen_sync_thread(struct mddev *mddev) -{ - mutex_lock(&mddev->sync_mutex); - set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - - if (mddev_lock(mddev)) { - mutex_unlock(&mddev->sync_mutex); - return; - } - - stop_sync_thread(mddev, false); - mutex_unlock(&mddev->sync_mutex); -} - static int mddev_start_reshape(struct mddev *mddev) { int ret; @@ -5088,24 +5059,13 @@ static int mddev_start_reshape(struct mddev *mddev) if (mddev->pers->start_reshape == NULL) return -EINVAL; - ret = mddev_lock(mddev); - if (ret) - return ret; - - if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { - mddev_unlock(mddev); - return -EBUSY; - } - if (mddev->reshape_position == MaxSector || mddev->pers->check_reshape == NULL || mddev->pers->check_reshape(mddev)) { clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); ret = mddev->pers->start_reshape(mddev); - if (ret) { - mddev_unlock(mddev); + if (ret) return ret; - } } else { /* * If reshape is still in progress, and md_check_recovery() can @@ -5115,7 +5075,6 @@ static int mddev_start_reshape(struct mddev *mddev) clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); } - mddev_unlock(mddev); sysfs_notify_dirent_safe(mddev->sysfs_degraded); return 0; } @@ -5129,36 +5088,53 @@ action_store(struct mddev *mddev, const char *page, size_t len) if (!mddev->pers || !mddev->pers->sync_request) return -EINVAL; +retry: + if (work_busy(&mddev->sync_work)) + flush_work(&mddev->sync_work); + + ret = mddev_lock(mddev); + if (ret) + return ret; + + if (work_busy(&mddev->sync_work)) { + mddev_unlock(mddev); + goto retry; + } + action = md_sync_action_by_name(page); /* TODO: mdadm rely on "idle" to start sync_thread. */ if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { switch (action) { case ACTION_FROZEN: - frozen_sync_thread(mddev); - return len; + md_frozen_sync_thread(mddev); + ret = len; + goto out; case ACTION_IDLE: - idle_sync_thread(mddev); + md_idle_sync_thread(mddev); break; case ACTION_RESHAPE: case ACTION_RECOVER: case ACTION_CHECK: case ACTION_REPAIR: case ACTION_RESYNC: - return -EBUSY; + ret = -EBUSY; + goto out; default: - return -EINVAL; + ret = -EINVAL; + goto out; } } else { switch (action) { case ACTION_FROZEN: set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - return len; + ret = len; + goto out; case ACTION_RESHAPE: clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); ret = mddev_start_reshape(mddev); if (ret) - return ret; + goto out; break; case ACTION_RECOVER: clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); @@ -5176,7 +5152,8 @@ action_store(struct mddev *mddev, const char *page, size_t len) clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); break; default: - return -EINVAL; + ret = -EINVAL; + goto out; } } @@ -5184,14 +5161,18 @@ action_store(struct mddev *mddev, const char *page, size_t len) /* A write to sync_action is enough to justify * canceling read-auto mode */ - flush_work(&mddev->sync_work); mddev->ro = MD_RDWR; md_wakeup_thread(mddev->sync_thread); } + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); md_wakeup_thread(mddev->thread); sysfs_notify_dirent_safe(mddev->sysfs_action); - return len; + ret = len; + +out: + mddev_unlock(mddev); + return ret; } static struct md_sysfs_entry md_scan_mode = -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit 7d9f107a4e946bb52b7502eed9ed8f316700397e category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Make code cleaner. and also use the action_name directly in kernel log: - "check" instead of "data-check" - "repair" instead of "requested-resync" Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-9-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 2 +- drivers/md/md.c | 21 +++++---------------- 2 files changed, 6 insertions(+), 17 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index 260bab71fdeb..fce474df49de 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -432,7 +432,7 @@ struct mddev { * when the sync thread is "frozen" (interrupted) or "idle" (stopped * or finished). It is overwritten when a new sync operation is begun. */ - char *last_sync_action; + const char *last_sync_action; sector_t curr_resync; /* last block scheduled */ /* As resync requests can complete out of order, we cannot easily track * how much resync has been completed. So we occasionally pause until diff --git a/drivers/md/md.c b/drivers/md/md.c index 276c8d552663..1390993d00f3 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8963,7 +8963,8 @@ void md_do_sync(struct md_thread *thread) sector_t last_check; int skipped = 0; struct md_rdev *rdev; - char *desc, *action = NULL; + enum sync_action action; + const char *desc; struct blk_plug plug; int ret; @@ -8994,21 +8995,9 @@ void md_do_sync(struct md_thread *thread) goto skip; } - if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { - if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery)) { - desc = "data-check"; - action = "check"; - } else if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) { - desc = "requested-resync"; - action = "repair"; - } else - desc = "resync"; - } else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) - desc = "reshape"; - else - desc = "recovery"; - - mddev->last_sync_action = action ?: desc; + action = md_sync_action(mddev); + desc = md_sync_action_name(action); + mddev->last_sync_action = desc; /* * Before starting a resync we must have set curr_resync to -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit d249e541887a966df37544f7c4d301cdee0f0e27 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The only difference is that "none" is removed and initial last_sync_action will be idle. On the one hand, this value is introduced by commit c4a395514516 ("MD: Remember the last sync operation that was performed"), and the usage described in commit message is not affected. On the other hand, last_sync_action is not used in mdadm or mdmon, and none of the tests that I can find. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-10-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 9 ++++----- drivers/md/dm-raid.c | 2 +- drivers/md/md.c | 7 ++++--- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index fce474df49de..b998a12f5ed2 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -426,13 +426,12 @@ struct mddev { struct md_thread __rcu *thread; /* management thread */ struct md_thread __rcu *sync_thread; /* doing resync or reconstruct */ - /* 'last_sync_action' is initialized to "none". It is set when a - * sync operation (i.e "data-check", "requested-resync", "resync", - * "recovery", or "reshape") is started. It holds this value even + /* + * Set when a sync operation is started. It holds this value even * when the sync thread is "frozen" (interrupted) or "idle" (stopped - * or finished). It is overwritten when a new sync operation is begun. + * or finished). It is overwritten when a new sync operation is begun. */ - const char *last_sync_action; + enum sync_action last_sync_action; sector_t curr_resync; /* last block scheduled */ /* As resync requests can complete out of order, we cannot easily track * how much resync has been completed. So we occasionally pause until diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index a6262348e09d..e01f1f18b817 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -3542,7 +3542,7 @@ static void raid_status(struct dm_target *ti, status_type_t type, recovery = rs->md.recovery; state = decipher_sync_action(mddev, recovery); progress = rs_get_progress(rs, recovery, state, resync_max_sectors); - resync_mismatches = (mddev->last_sync_action && !strcasecmp(mddev->last_sync_action, "check")) ? + resync_mismatches = mddev->last_sync_action == ACTION_CHECK ? atomic64_read(&mddev->resync_mismatches) : 0; /* HM FIXME: do we want another state char for raid0? It shows 'D'/'A'/'-' now */ diff --git a/drivers/md/md.c b/drivers/md/md.c index 1390993d00f3..1a754c20aab4 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -774,7 +774,7 @@ int mddev_init(struct mddev *mddev) init_waitqueue_head(&mddev->recovery_wait); mddev->reshape_position = MaxSector; mddev->reshape_backwards = 0; - mddev->last_sync_action = "none"; + mddev->last_sync_action = ACTION_IDLE; mddev->resync_min = 0; mddev->resync_max = MaxSector; mddev->level = LEVEL_NONE; @@ -5181,7 +5181,8 @@ __ATTR_PREALLOC(sync_action, S_IRUGO|S_IWUSR, action_show, action_store); static ssize_t last_sync_action_show(struct mddev *mddev, char *page) { - return sprintf(page, "%s\n", mddev->last_sync_action); + return sprintf(page, "%s\n", + md_sync_action_name(mddev->last_sync_action)); } static struct md_sysfs_entry md_last_scan_mode = __ATTR_RO(last_sync_action); @@ -8997,7 +8998,7 @@ void md_do_sync(struct md_thread *thread) action = md_sync_action(mddev); desc = md_sync_action_name(action); - mddev->last_sync_action = desc; + mddev->last_sync_action = action; /* * Before starting a resync we must have set curr_resync to -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit bbf2076277b137f03624259da0a0369af88f3a68 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Make code cleaner by replacing if else if with switch, and it's more obvious now what is doing for each sync_action. There are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-11-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 123 ++++++++++++++++++++++++++++-------------------- 1 file changed, 73 insertions(+), 50 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 1a754c20aab4..e53aaa66bc3e 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8948,6 +8948,77 @@ void md_allow_write(struct mddev *mddev) } EXPORT_SYMBOL_GPL(md_allow_write); +static sector_t md_sync_max_sectors(struct mddev *mddev, + enum sync_action action) +{ + switch (action) { + case ACTION_RESYNC: + case ACTION_CHECK: + case ACTION_REPAIR: + atomic64_set(&mddev->resync_mismatches, 0); + fallthrough; + case ACTION_RESHAPE: + return mddev->resync_max_sectors; + case ACTION_RECOVER: + return mddev->dev_sectors; + default: + return 0; + } +} + +static sector_t md_sync_position(struct mddev *mddev, enum sync_action action) +{ + sector_t start = 0; + struct md_rdev *rdev; + + switch (action) { + case ACTION_CHECK: + case ACTION_REPAIR: + return mddev->resync_min; + case ACTION_RESYNC: + if (!mddev->bitmap) + return mddev->recovery_cp; + return 0; + case ACTION_RESHAPE: + /* + * If the original node aborts reshaping then we continue the + * reshaping, so set again to avoid restart reshape from the + * first beginning + */ + if (mddev_is_clustered(mddev) && + mddev->reshape_position != MaxSector) + return mddev->reshape_position; + return 0; + case ACTION_RECOVER: + start = MaxSector; + rcu_read_lock(); + rdev_for_each_rcu(rdev, mddev) + if (rdev->raid_disk >= 0 && + !test_bit(Journal, &rdev->flags) && + !test_bit(Faulty, &rdev->flags) && + !test_bit(In_sync, &rdev->flags) && + rdev->recovery_offset < start) + start = rdev->recovery_offset; + rcu_read_unlock(); + + /* If there is a bitmap, we need to make sure all + * writes that started before we added a spare + * complete before we start doing a recovery. + * Otherwise the write might complete and (via + * bitmap_endwrite) set a bit in the bitmap after the + * recovery has checked that bit and skipped that + * region. + */ + if (mddev->bitmap) { + mddev->pers->quiesce(mddev, 1); + mddev->pers->quiesce(mddev, 0); + } + return start; + default: + return MaxSector; + } +} + #define SYNC_MARKS 10 #define SYNC_MARK_STEP (3*HZ) #define UPDATE_FREQUENCY (5*60*HZ) @@ -9066,56 +9137,8 @@ void md_do_sync(struct md_thread *thread) spin_unlock(&all_mddevs_lock); } while (mddev->curr_resync < MD_RESYNC_DELAYED); - j = 0; - if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { - /* resync follows the size requested by the personality, - * which defaults to physical size, but can be virtual size - */ - max_sectors = mddev->resync_max_sectors; - atomic64_set(&mddev->resync_mismatches, 0); - /* we don't use the checkpoint if there's a bitmap */ - if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) - j = mddev->resync_min; - else if (!mddev->bitmap) - j = mddev->recovery_cp; - - } else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) { - max_sectors = mddev->resync_max_sectors; - /* - * If the original node aborts reshaping then we continue the - * reshaping, so set j again to avoid restart reshape from the - * first beginning - */ - if (mddev_is_clustered(mddev) && - mddev->reshape_position != MaxSector) - j = mddev->reshape_position; - } else { - /* recovery follows the physical size of devices */ - max_sectors = mddev->dev_sectors; - j = MaxSector; - rcu_read_lock(); - rdev_for_each_rcu(rdev, mddev) - if (rdev->raid_disk >= 0 && - !test_bit(Journal, &rdev->flags) && - !test_bit(Faulty, &rdev->flags) && - !test_bit(In_sync, &rdev->flags) && - rdev->recovery_offset < j) - j = rdev->recovery_offset; - rcu_read_unlock(); - - /* If there is a bitmap, we need to make sure all - * writes that started before we added a spare - * complete before we start doing a recovery. - * Otherwise the write might complete and (via - * bitmap_endwrite) set a bit in the bitmap after the - * recovery has checked that bit and skipped that - * region. - */ - if (mddev->bitmap) { - mddev->pers->quiesce(mddev, 1); - mddev->pers->quiesce(mddev, 0); - } - } + max_sectors = md_sync_max_sectors(mddev, action); + j = md_sync_position(mddev, action); pr_info("md: %s of RAID array %s\n", desc, mdname(mddev)); pr_debug("md: minimum _guaranteed_ speed: %d KB/sec/disk.\n", speed_min(mddev)); -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit bc49694a9e8fd1b36bca47d9a54ec8da8e39012f category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- For different sync_action, sync_thread will use different max_sectors, see details in md_sync_max_sectors(), currently both md_do_sync() and pers->sync_request() in eatch iteration have to get the same max_sectors. Hence pass in max_sectors for pers->sync_request() to prevent redundant code. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-12-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 3 ++- drivers/md/md.c | 5 +++-- drivers/md/raid1.c | 5 ++--- drivers/md/raid10.c | 8 ++------ drivers/md/raid5.c | 3 +-- 5 files changed, 10 insertions(+), 14 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index b998a12f5ed2..7f22e961db67 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -730,7 +730,8 @@ struct md_personality int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev); int (*hot_remove_disk) (struct mddev *mddev, struct md_rdev *rdev); int (*spare_active) (struct mddev *mddev); - sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped); + sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, + sector_t max_sector, int *skipped); int (*resize) (struct mddev *mddev, sector_t sectors); sector_t (*size) (struct mddev *mddev, sector_t sectors, int raid_disks); int (*check_reshape) (struct mddev *mddev); diff --git a/drivers/md/md.c b/drivers/md/md.c index e53aaa66bc3e..abef4bab3375 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9220,7 +9220,8 @@ void md_do_sync(struct md_thread *thread) if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) break; - sectors = mddev->pers->sync_request(mddev, j, &skipped); + sectors = mddev->pers->sync_request(mddev, j, max_sectors, + &skipped); if (sectors == 0) { set_bit(MD_RECOVERY_INTR, &mddev->recovery); break; @@ -9310,7 +9311,7 @@ void md_do_sync(struct md_thread *thread) mddev->curr_resync_completed = mddev->curr_resync; sysfs_notify_dirent_safe(mddev->sysfs_completed); } - mddev->pers->sync_request(mddev, max_sectors, &skipped); + mddev->pers->sync_request(mddev, max_sectors, max_sectors, &skipped); if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) && mddev->curr_resync > MD_RESYNC_ACTIVE) { diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index c463a1873832..1372efc0d15b 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2737,12 +2737,12 @@ static struct r1bio *raid1_alloc_init_r1buf(struct r1conf *conf) */ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr, - int *skipped) + sector_t max_sector, int *skipped) { struct r1conf *conf = mddev->private; struct r1bio *r1_bio; struct bio *bio; - sector_t max_sector, nr_sectors; + sector_t nr_sectors; int disk = -1; int i; int wonly = -1; @@ -2758,7 +2758,6 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr, if (init_resync(conf)) return 0; - max_sector = mddev->dev_sectors; if (sector_nr >= max_sector) { /* If we aborted, we need to abort the * sync on the 'current' bitmap chunk (there will diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 23cc3c8a345f..db24ef1fdbe7 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -3118,12 +3118,12 @@ static void raid10_set_cluster_sync_high(struct r10conf *conf) */ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr, - int *skipped) + sector_t max_sector, int *skipped) { struct r10conf *conf = mddev->private; struct r10bio *r10_bio; struct bio *biolist = NULL, *bio; - sector_t max_sector, nr_sectors; + sector_t nr_sectors; int i; int max_sync; sector_t sync_blocks; @@ -3153,10 +3153,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr, return 0; skipped: - max_sector = mddev->dev_sectors; - if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) || - test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) - max_sector = mddev->resync_max_sectors; if (sector_nr >= max_sector) { conf->cluster_sync_low = 0; conf->cluster_sync_high = 0; diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index c832a1528103..51821e6eb27e 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -6413,11 +6413,10 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk } static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_nr, - int *skipped) + sector_t max_sector, int *skipped) { struct r5conf *conf = mddev->private; struct stripe_head *sh; - sector_t max_sector = mddev->dev_sectors; sector_t sync_blocks; int still_degraded = 0; int i; -- 2.39.2
From: Benjamin Marzinski <bmarzins@redhat.com> mainline inclusion from mainline-v6.11-rc1 commit 3199a34bfaf7561410e0be1e33a61eba870768fc category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- rm-raid devices will occasionally trigger the following warning when being resumed after a table load because DM_RECOVERY_RUNNING is set: WARNING: CPU: 7 PID: 5660 at drivers/md/dm-raid.c:4105 raid_resume+0xee/0x100 [dm_raid] The failing check is: WARN_ON_ONCE(test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)); This check is designed to make sure that the sync thread isn't registered, but md_check_recovery can set MD_RECOVERY_RUNNING without the sync_thread ever getting registered. Instead of checking if MD_RECOVERY_RUNNING is set, check if sync_thread is non-NULL. Fixes: 16c4770c75b1 ("dm-raid: really frozen sync_thread during suspend") Suggested-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/dm-raid.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index e01f1f18b817..5b1aa2373e8d 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -4093,10 +4093,11 @@ static void raid_resume(struct dm_target *ti) if (mddev->delta_disks < 0) rs_set_capacity(rs); + mddev_lock_nointr(mddev); WARN_ON_ONCE(!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)); - WARN_ON_ONCE(test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)); + WARN_ON_ONCE(rcu_dereference_protected(mddev->sync_thread, + lockdep_is_held(&mddev->reconfig_mutex))); clear_bit(RT_FLAG_RS_FROZEN, &rs->runtime_flags); - mddev_lock_nointr(mddev); mddev->ro = 0; mddev->in_sync = 0; md_unfrozen_sync_thread(mddev); -- 2.39.2
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit 2314c2e3a70521f055dd011245dccf6fd97c7ee0 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- As commit ad8606702f26 ("md/raid5: remove rcu protection to access rdev from conf") explains, rcu protection can be removed, however, there are three places left, there won't be any real problems. drivers/md/raid5.c:8071:24: error: incompatible types in comparison expression (different address spaces): drivers/md/raid5.c:8071:24: struct md_rdev [noderef] __rcu * drivers/md/raid5.c:8071:24: struct md_rdev * drivers/md/raid5.c:7569:25: error: incompatible types in comparison expression (different address spaces): drivers/md/raid5.c:7569:25: struct md_rdev [noderef] __rcu * drivers/md/raid5.c:7569:25: struct md_rdev * drivers/md/raid5.c:7573:25: error: incompatible types in comparison expression (different address spaces): drivers/md/raid5.c:7573:25: struct md_rdev [noderef] __rcu * drivers/md/raid5.c:7573:25: struct md_rdev * Fixes: ad8606702f26 ("md/raid5: remove rcu protection to access rdev from conf") Cc: stable@vger.kernel.org Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240615085143.1648223-1-yukuai1@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/raid5.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 51821e6eb27e..c752291f195b 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -155,7 +155,7 @@ static int raid6_idx_to_slot(int idx, struct stripe_head *sh, return slot; } -static void print_raid5_conf (struct r5conf *conf); +static void print_raid5_conf(struct r5conf *conf); static int stripe_operations_active(struct stripe_head *sh) { @@ -7516,11 +7516,11 @@ static struct r5conf *setup_conf(struct mddev *mddev) if (test_bit(Replacement, &rdev->flags)) { if (disk->replacement) goto abort; - RCU_INIT_POINTER(disk->replacement, rdev); + disk->replacement = rdev; } else { if (disk->rdev) goto abort; - RCU_INIT_POINTER(disk->rdev, rdev); + disk->rdev = rdev; } if (test_bit(In_sync, &rdev->flags)) { @@ -8002,7 +8002,7 @@ static void raid5_status(struct seq_file *seq, struct mddev *mddev) seq_printf (seq, "]"); } -static void print_raid5_conf (struct r5conf *conf) +static void print_raid5_conf(struct r5conf *conf) { struct md_rdev *rdev; int i; @@ -8016,15 +8016,13 @@ static void print_raid5_conf (struct r5conf *conf) conf->raid_disks, conf->raid_disks - conf->mddev->degraded); - rcu_read_lock(); for (i = 0; i < conf->raid_disks; i++) { - rdev = rcu_dereference(conf->disks[i].rdev); + rdev = conf->disks[i].rdev; if (rdev) pr_debug(" disk %d, o:%d, dev:%pg\n", i, !test_bit(Faulty, &rdev->flags), rdev->bdev); } - rcu_read_unlock(); } static int raid5_spare_active(struct mddev *mddev) -- 2.39.2
From: Yang Li <yang.lee@linux.alibaba.com> mainline inclusion from mainline-v6.11-rc1 commit ae720670b9fc5ef3588efd5b95e6a0f59a36dec0 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- ./drivers/md/md.c:630:21-22: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9344 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240618010759.85416-1-yang.lee@linux.alibaba.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index abef4bab3375..11bdb3c588fa 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -627,7 +627,7 @@ static void md_submit_flush_data(struct work_struct *ws) * always is 0, make_request() will not be called here. */ if (WARN_ON_ONCE(!mddev->pers->make_request(mddev, bio))) - bio_io_error(bio);; + bio_io_error(bio); } /* The pair is percpu_ref_get() from md_flush_request() */ -- 2.39.2
From: Christophe JAILLET <christophe.jaillet@wanadoo.fr> mainline inclusion from mainline-v6.11-rc1 commit 1f4a72ff00cafa74b43b0c8a37573c78f86ed1a8 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- 'struct md_cluster_operations' is not modified in this driver. Constifying this structure moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 51941 1442 80 53463 d0d7 drivers/md/md-cluster.o After: ===== text data bss dec hex filename 52133 1246 80 53459 d0d3 drivers/md/md-cluster.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/3727f3ce9693cae4e62ae6778ea13971df805479.171917385... Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md.h | 4 ++-- drivers/md/md-cluster.c | 2 +- drivers/md/md.c | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index 7f22e961db67..219ca9e0ae63 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -855,7 +855,7 @@ static inline void safe_put_page(struct page *p) extern int register_md_personality(struct md_personality *p); extern int unregister_md_personality(struct md_personality *p); -extern int register_md_cluster_operations(struct md_cluster_operations *ops, +extern int register_md_cluster_operations(const struct md_cluster_operations *ops, struct module *module); extern int unregister_md_cluster_operations(void); extern int md_setup_cluster(struct mddev *mddev, int nodes); @@ -938,7 +938,7 @@ static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev) } } -extern struct md_cluster_operations *md_cluster_ops; +extern const struct md_cluster_operations *md_cluster_ops; static inline int mddev_is_clustered(struct mddev *mddev) { return mddev->cluster_info && mddev->bitmap_info.nodes > 1; diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 69b29112377e..1249fcc18950 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -1576,7 +1576,7 @@ static int gather_bitmaps(struct md_rdev *rdev) return err; } -static struct md_cluster_operations cluster_ops = { +static const struct md_cluster_operations cluster_ops = { .join = join, .leave = leave, .slot_number = slot_number, diff --git a/drivers/md/md.c b/drivers/md/md.c index 11bdb3c588fa..2330d1631707 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -85,7 +85,7 @@ static DEFINE_SPINLOCK(pers_lock); static const struct kobj_type md_ktype; -struct md_cluster_operations *md_cluster_ops; +const struct md_cluster_operations *md_cluster_ops; EXPORT_SYMBOL(md_cluster_ops); static struct module *md_cluster_mod; @@ -8613,7 +8613,7 @@ int unregister_md_personality(struct md_personality *p) } EXPORT_SYMBOL(unregister_md_personality); -int register_md_cluster_operations(struct md_cluster_operations *ops, +int register_md_cluster_operations(const struct md_cluster_operations *ops, struct module *module) { int ret = 0; -- 2.39.2
From: Benjamin Marzinski <bmarzins@redhat.com> mainline inclusion from mainline-v6.11-rc1 commit 25b3a8237a03ec0b67b965b52d74862e77ef7115 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- When handling an IO request, MD checks if a reshape is currently happening, and if so, where the IO sector is in relation to the reshape progress. MD uses conf->reshape_progress for both of these tasks. When the reshape finishes, conf->reshape_progress is set to MaxSector. If this occurs after MD checks if the reshape is currently happening but before it calls ahead_of_reshape(), then ahead_of_reshape() will end up comparing the IO sector against MaxSector. During a backwards reshape, this will make MD think the IO sector is in the area not yet reshaped, causing it to use the previous configuration, and map the IO to the sector where that data was before the reshape. This bug can be triggered by running the lvm2 lvconvert-raid-reshape-linear_to_raid6-single-type.sh test in a loop, although it's very hard to reproduce. Fix this by factoring the code that checks where the IO sector is in relation to the reshape out to a helper called get_reshape_loc(), which reads reshape_progress and reshape_safe while holding the device_lock, and then rechecks if the reshape has finished before calling ahead_of_reshape with the saved values. Also use the helper during the REQ_NOWAIT check to see if the location is inside of the reshape region. Fixes: fef9c61fdfabf ("md/raid5: change reshape-progress measurement to cope with reshaping backwards.") Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240702151802.1632010-1-bmarzins@redhat.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/raid5.c | 64 +++++++++++++++++++++++++++++----------------- 1 file changed, 41 insertions(+), 23 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index c752291f195b..fd31dfa67ce7 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -5849,6 +5849,39 @@ static int add_all_stripe_bios(struct r5conf *conf, return ret; } +enum reshape_loc { + LOC_NO_RESHAPE, + LOC_AHEAD_OF_RESHAPE, + LOC_INSIDE_RESHAPE, + LOC_BEHIND_RESHAPE, +}; + +static enum reshape_loc get_reshape_loc(struct mddev *mddev, + struct r5conf *conf, sector_t logical_sector) +{ + sector_t reshape_progress, reshape_safe; + /* + * Spinlock is needed as reshape_progress may be + * 64bit on a 32bit platform, and so it might be + * possible to see a half-updated value + * Of course reshape_progress could change after + * the lock is dropped, so once we get a reference + * to the stripe that we think it is, we will have + * to check again. + */ + spin_lock_irq(&conf->device_lock); + reshape_progress = conf->reshape_progress; + reshape_safe = conf->reshape_safe; + spin_unlock_irq(&conf->device_lock); + if (reshape_progress == MaxSector) + return LOC_NO_RESHAPE; + if (ahead_of_reshape(mddev, logical_sector, reshape_progress)) + return LOC_AHEAD_OF_RESHAPE; + if (ahead_of_reshape(mddev, logical_sector, reshape_safe)) + return LOC_INSIDE_RESHAPE; + return LOC_BEHIND_RESHAPE; +} + static enum stripe_result make_stripe_request(struct mddev *mddev, struct r5conf *conf, struct stripe_request_ctx *ctx, sector_t logical_sector, struct bio *bi) @@ -5863,28 +5896,14 @@ static enum stripe_result make_stripe_request(struct mddev *mddev, seq = read_seqcount_begin(&conf->gen_lock); if (unlikely(conf->reshape_progress != MaxSector)) { - /* - * Spinlock is needed as reshape_progress may be - * 64bit on a 32bit platform, and so it might be - * possible to see a half-updated value - * Of course reshape_progress could change after - * the lock is dropped, so once we get a reference - * to the stripe that we think it is, we will have - * to check again. - */ - spin_lock_irq(&conf->device_lock); - if (ahead_of_reshape(mddev, logical_sector, - conf->reshape_progress)) { - previous = 1; - } else { - if (ahead_of_reshape(mddev, logical_sector, - conf->reshape_safe)) { - spin_unlock_irq(&conf->device_lock); - ret = STRIPE_SCHEDULE_AND_RETRY; - goto out; - } + enum reshape_loc loc = get_reshape_loc(mddev, conf, + logical_sector); + if (loc == LOC_INSIDE_RESHAPE) { + ret = STRIPE_SCHEDULE_AND_RETRY; + goto out; } - spin_unlock_irq(&conf->device_lock); + if (loc == LOC_AHEAD_OF_RESHAPE) + previous = 1; } new_sector = raid5_compute_sector(conf, logical_sector, previous, @@ -6062,8 +6081,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) /* Bail out if conflicts with reshape and REQ_NOWAIT is set */ if ((bi->bi_opf & REQ_NOWAIT) && (conf->reshape_progress != MaxSector) && - !ahead_of_reshape(mddev, logical_sector, conf->reshape_progress) && - ahead_of_reshape(mddev, logical_sector, conf->reshape_safe)) { + get_reshape_loc(mddev, conf, logical_sector) == LOC_INSIDE_RESHAPE) { bio_wouldblock_error(bi); if (rw == WRITE) md_write_end(mddev); -- 2.39.2
From: Heming Zhao <heming.zhao@suse.com> mainline inclusion from mainline-v6.11-rc1 commit fff42f213824fa434a4b6cf906b4331fe6e9302b category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The commit 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding") is correct in terms of code syntax but not suite real clustered code logic. When a timeout occurs while adding a new disk, if recv_daemon() bypasses the unlock for ack_lockres:CR, another node will be waiting to grab EX lock. This will cause the cluster to hang indefinitely. How to fix: 1. In dlm_lock_sync(), change the wait behaviour from forever to a timeout, This could avoid the hanging issue when another node fails to handle cluster msg. Another result of this change is that if another node receives an unknown msg (e.g. a new msg_type), the old code will hang, whereas the new code will timeout and fail. This could help cluster_md handle new msg_type from different nodes with different kernel/module versions (e.g. The user only updates one leg's kernel and monitors the stability of the new kernel). 2. The old code for __sendmsg() always returns 0 (success) under the design (must successfully unlock ->message_lockres). This commit makes this function return an error number when an error occurs. Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Su Yue <glass.su@suse.com> Acked-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240709104120.22243-1-heming.zhao@suse.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md-cluster.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 1249fcc18950..25bca948601c 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -15,6 +15,7 @@ #define LVB_SIZE 64 #define NEW_DEV_TIMEOUT 5000 +#define WAIT_DLM_LOCK_TIMEOUT (30 * HZ) struct dlm_lock_resource { dlm_lockspace_t *ls; @@ -130,8 +131,13 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode) 0, sync_ast, res, res->bast); if (ret) return ret; - wait_event(res->sync_locking, res->sync_locking_done); + ret = wait_event_timeout(res->sync_locking, res->sync_locking_done, + WAIT_DLM_LOCK_TIMEOUT); res->sync_locking_done = false; + if (!ret) { + pr_err("locking DLM '%s' timeout!\n", res->name); + return -EBUSY; + } if (res->lksb.sb_status == 0) res->mode = mode; return res->lksb.sb_status; @@ -743,7 +749,7 @@ static void unlock_comm(struct md_cluster_info *cinfo) */ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) { - int error; + int error, unlock_error; int slot = cinfo->slot_number - 1; cmsg->slot = cpu_to_le32(slot); @@ -751,7 +757,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX); if (error) { pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error); - goto failed_message; + return error; } memcpy(cinfo->message_lockres->lksb.sb_lvbptr, (void *)cmsg, @@ -781,14 +787,10 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) } failed_ack: - error = dlm_unlock_sync(cinfo->message_lockres); - if (unlikely(error != 0)) { + while ((unlock_error = dlm_unlock_sync(cinfo->message_lockres))) pr_err("md-cluster: failed convert to NL on MESSAGE(%d)\n", - error); - /* in case the message can't be released due to some reason */ - goto failed_ack; - } -failed_message: + unlock_error); + return error; } -- 2.39.2
From: Heming Zhao <heming.zhao@suse.com> mainline inclusion from mainline-v6.11-rc1 commit 35a0a409fa269c287c4378f1aefe84ae8b5211a1 category: bugfix bugzilla: 190197 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The commit db5e653d7c9f ("md: delay choosing sync action to md_start_sync()") delays the start of the sync action. In a clustered environment, this will cause another node to first activate the spare disk and skip recovery. As a result, no nodes will perform recovery when a disk is added or re-added. Before db5e653d7c9f: ``` node1 node2 ---------------------------------------------------------------- md_check_recovery + md_update_sb | sendmsg: METADATA_UPDATED + md_choose_sync_action process_metadata_update | remove_and_add_spares //node1 has not finished adding + call mddev->sync_work //the spare disk:do nothing md_start_sync starts md_do_sync md_do_sync + grabbed resync_lockres:DLM_LOCK_EX + do syncing job md_check_recovery sendmsg: METADATA_UPDATED process_metadata_update //activate spare disk ... ... md_do_sync waiting to grab resync_lockres:EX ``` After db5e653d7c9f: (note: if 'cmd:idle' sets MD_RECOVERY_INTR after md_check_recovery starts md_start_sync, setting the INTR action will exacerbate the delay in node1 calling the md_do_sync function.) ``` node1 node2 ---------------------------------------------------------------- md_check_recovery + md_update_sb | sendmsg: METADATA_UPDATED + calls mddev->sync_work process_metadata_update //node1 has not finished adding //the spare disk:do nothing md_start_sync + md_choose_sync_action | remove_and_add_spares + calls md_do_sync md_check_recovery md_update_sb sendmsg: METADATA_UPDATED process_metadata_update //activate spare disk ... ... ... ... md_do_sync + grabbed resync_lockres:EX + raid1_sync_request skip sync under conf->fullsync:0 md_do_sync 1. waiting to grab resync_lockres:EX 2. when node1 could grab EX lock, node1 will skip resync under recovery_offset:MaxSector ``` How to trigger: ```(commands @node1) # to easily watch the recovery status echo 2000 > /proc/sys/dev/raid/speed_limit_max ssh root@node2 "echo 2000 > /proc/sys/dev/raid/speed_limit_max" mdadm -CR /dev/md0 -l1 -b clustered -n 2 /dev/sda /dev/sdb --assume-clean ssh root@node2 mdadm -A /dev/md0 /dev/sda /dev/sdb mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda mdadm --manage /dev/md0 --add /dev/sdc === "cat /proc/mdstat" on both node, there are no recovery action. === ``` How to fix: because md layer code logic is hard to restore for speeding up sync job on local node, we add new cluster msg to pending the another node to active disk. Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Su Yue <glass.su@suse.com> Acked-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240709104120.22243-2-heming.zhao@suse.com Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/md-cluster.h | 2 ++ drivers/md/md-cluster.c | 27 +++++++++++++++++++++++++++ drivers/md/md.c | 17 ++++++++++++++--- 3 files changed, 43 insertions(+), 3 deletions(-) diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h index a78e3021775d..470bf18ffde5 100644 --- a/drivers/md/md-cluster.h +++ b/drivers/md/md-cluster.h @@ -14,6 +14,8 @@ struct md_cluster_operations { int (*leave)(struct mddev *mddev); int (*slot_number)(struct mddev *mddev); int (*resync_info_update)(struct mddev *mddev, sector_t lo, sector_t hi); + int (*resync_start_notify)(struct mddev *mddev); + int (*resync_status_get)(struct mddev *mddev); void (*resync_info_get)(struct mddev *mddev, sector_t *lo, sector_t *hi); int (*metadata_update_start)(struct mddev *mddev); int (*metadata_update_finish)(struct mddev *mddev); diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 25bca948601c..30da7d71a590 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -57,6 +57,7 @@ struct resync_info { #define MD_CLUSTER_ALREADY_IN_CLUSTER 6 #define MD_CLUSTER_PENDING_RECV_EVENT 7 #define MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD 8 +#define MD_CLUSTER_WAITING_FOR_SYNC 9 struct md_cluster_info { struct mddev *mddev; /* the md device which md_cluster_info belongs to */ @@ -92,6 +93,7 @@ struct md_cluster_info { sector_t sync_hi; }; +/* For compatibility, add the new msg_type at the end. */ enum msg_type { METADATA_UPDATED = 0, RESYNCING, @@ -101,6 +103,7 @@ enum msg_type { BITMAP_NEEDS_SYNC, CHANGE_CAPACITY, BITMAP_RESIZE, + RESYNCING_START, }; struct cluster_msg { @@ -461,6 +464,7 @@ static void process_suspend_info(struct mddev *mddev, clear_bit(MD_RESYNCING_REMOTE, &mddev->recovery); remove_suspend_info(mddev, slot); set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); + clear_bit(MD_CLUSTER_WAITING_FOR_SYNC, &cinfo->state); md_wakeup_thread(mddev->thread); return; } @@ -531,6 +535,7 @@ static int process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg) res = -1; } clear_bit(MD_CLUSTER_WAITING_FOR_NEWDISK, &cinfo->state); + set_bit(MD_CLUSTER_WAITING_FOR_SYNC, &cinfo->state); return res; } @@ -599,6 +604,9 @@ static int process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) case CHANGE_CAPACITY: set_capacity_and_notify(mddev->gendisk, mddev->array_sectors); break; + case RESYNCING_START: + clear_bit(MD_CLUSTER_WAITING_FOR_SYNC, &mddev->cluster_info->state); + break; case RESYNCING: set_bit(MD_RESYNCING_REMOTE, &mddev->recovery); process_suspend_info(mddev, le32_to_cpu(msg->slot), @@ -1351,6 +1359,23 @@ static void resync_info_get(struct mddev *mddev, sector_t *lo, sector_t *hi) spin_unlock_irq(&cinfo->suspend_lock); } +static int resync_status_get(struct mddev *mddev) +{ + struct md_cluster_info *cinfo = mddev->cluster_info; + + return test_bit(MD_CLUSTER_WAITING_FOR_SYNC, &cinfo->state); +} + +static int resync_start_notify(struct mddev *mddev) +{ + struct md_cluster_info *cinfo = mddev->cluster_info; + struct cluster_msg cmsg = {0}; + + cmsg.type = cpu_to_le32(RESYNCING_START); + + return sendmsg(cinfo, &cmsg, 0); +} + static int resync_info_update(struct mddev *mddev, sector_t lo, sector_t hi) { struct md_cluster_info *cinfo = mddev->cluster_info; @@ -1585,6 +1610,8 @@ static const struct md_cluster_operations cluster_ops = { .resync_start = resync_start, .resync_finish = resync_finish, .resync_info_update = resync_info_update, + .resync_start_notify = resync_start_notify, + .resync_status_get = resync_status_get, .resync_info_get = resync_info_get, .metadata_update_start = metadata_update_start, .metadata_update_finish = metadata_update_finish, diff --git a/drivers/md/md.c b/drivers/md/md.c index 2330d1631707..80f0b84b3582 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9080,7 +9080,8 @@ void md_do_sync(struct md_thread *thread) * This will mean we have to start checking from the beginning again. * */ - + if (mddev_is_clustered(mddev)) + md_cluster_ops->resync_start_notify(mddev); do { int mddev2_minor = -1; mddev->curr_resync = MD_RESYNC_DELAYED; @@ -10100,8 +10101,18 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev) */ if (rdev2->raid_disk == -1 && role != MD_DISK_ROLE_SPARE && !(le32_to_cpu(sb->feature_map) & - MD_FEATURE_RESHAPE_ACTIVE)) { - rdev2->saved_raid_disk = role; + MD_FEATURE_RESHAPE_ACTIVE) && + !md_cluster_ops->resync_status_get(mddev)) { + /* + * -1 to make raid1_add_disk() set conf->fullsync + * to 1. This could avoid skipping sync when the + * remote node is down during resyncing. + */ + if ((le32_to_cpu(sb->feature_map) + & MD_FEATURE_RECOVERY_OFFSET)) + rdev2->saved_raid_disk = -1; + else + rdev2->saved_raid_disk = role; ret = remove_and_add_spares(mddev, rdev2); pr_info("Activated spare: %pg\n", rdev2->bdev); -- 2.39.2
From: Mateusz Jończyk <mat.jonczyk@o2.pl> mainline inclusion from mainline-v6.11-rc1 commit 36a5c03f232719eb4e2d925f4d584e09cfaf372c category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8374 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Linux 6.9+ is unable to start a degraded RAID1 array with one drive, when that drive has a write-mostly flag set. During such an attempt, the following assertion in bio_split() is hit: BUG_ON(sectors <= 0); Call Trace: ? bio_split+0x96/0xb0 ? exc_invalid_op+0x53/0x70 ? bio_split+0x96/0xb0 ? asm_exc_invalid_op+0x1b/0x20 ? bio_split+0x96/0xb0 ? raid1_read_request+0x890/0xd20 ? __call_rcu_common.constprop.0+0x97/0x260 raid1_make_request+0x81/0xce0 ? __get_random_u32_below+0x17/0x70 ? new_slab+0x2b3/0x580 md_handle_request+0x77/0x210 md_submit_bio+0x62/0xa0 __submit_bio+0x17b/0x230 submit_bio_noacct_nocheck+0x18e/0x3c0 submit_bio_noacct+0x244/0x670 After investigation, it turned out that choose_slow_rdev() does not set the value of max_sectors in some cases and because of it, raid1_read_request calls bio_split with sectors == 0. Fix it by filling in this variable. This bug was introduced in commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()") but apparently hidden until commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()") shortly thereafter. Cc: stable@vger.kernel.org # 6.9.x+ Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()") Cc: Song Liu <song@kernel.org> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Paul Luse <paul.e.luse@linux.intel.com> Cc: Xiao Ni <xni@redhat.com> Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/ -- Tested on both Linux 6.10 and 6.9.8. Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems: ./test --dev=loop --no-error --raidtype=raid1 (on 6.9.8 there was one failure, caused by external bitmap support not compiled in). Notes: - I was reliably getting deadlocks when adding / removing devices on such an array - while the array was loaded with fsstress with 20 concurrent processes. When the array was idle or loaded with fsstress with 8 processes, no such deadlocks happened in my tests. This occurred also on unpatched Linux 6.8.0 though, but not on 6.1.97-rc1, so this is likely an independent regression (to be investigated). - I was also getting deadlocks when adding / removing the bitmap on the array in similar conditions - this happened on Linux 6.1.97-rc1 also though. fsstress with 8 concurrent processes did cause it only once during many tests. - in my testing, there was once a problem with hot adding an internal bitmap to the array: mdadm: Cannot add bitmap while array is resyncing or reshaping etc. mdadm: failed to set internal bitmap. even though no such reshaping was happening according to /proc/mdstat. This seems unrelated, though. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240711202316.10775-1-mat.jonczyk@o2.pl Signed-off-by: Li Nan <linan122@huawei.com> --- drivers/md/raid1.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 1372efc0d15b..95c1e1507d4e 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -676,6 +676,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio, len = r1_bio->sectors; read_len = raid1_check_read_range(rdev, this_sector, &len); if (read_len == r1_bio->sectors) { + *max_sectors = read_len; update_read_sectors(conf, disk, this_sector, read_len); return disk; } -- 2.39.2
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://atomgit.com/openeuler/kernel/merge_requests/20143 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/P3Y... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://atomgit.com/openeuler/kernel/merge_requests/20143 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/P3Y...
participants (2)
-
linan666@huaweicloud.com -
patchwork bot