Alex Lyakas (1): md: Whenassemble the array, consult the superblock of the freshest device
David Jeffery (1): md/raid6: use valid sector values to determine if an I/O should wait on the reshape
Denis Plotnikov (1): md-cluster: check for timeout while a new disk adding
Gou Hao (1): md/raid1: remove unnecessary null checking
Junxiao Bi (2): md: bypass block throttle for superblock update Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"
Justin Stitt (1): md: replace deprecated strncpy with memcpy
Li Nan (2): md: factor out a helper exceed_read_errors() to check read_errors md/raid1: support read error check
Mariusz Tkaczyk (1): md: do not require mddev_lock() for all options in array_state_store()
Yu Kuai (47): md: use separate work_struct for md_start_sync() md: factor out a helper to choose sync action from md_check_recovery() md: delay choosing sync action to md_start_sync() md: factor out a helper rdev_removeable() from remove_and_add_spares() md: factor out a helper rdev_is_spare() from remove_and_add_spares() md: factor out a helper rdev_addable() from remove_and_add_spares() md: delay remove_and_add_spares() for read only array to md_start_sync() md: initialize 'active_io' while allocating mddev md: initialize 'writes_pending' while allocating mddev md-bitmap: remove the checking of 'pers->quiesce' from location_store() md-bitmap: suspend array earlier in location_store() md: don't check 'mddev->pers' from suspend_hi_store() md: don't check 'mddev->pers' and 'pers->quiesce' from suspend_lo_store() md: factor out a helper from mddev_put() md: simplify md_seq_ops md/raid1: don't split discard io for write behind md: use READ_ONCE/WRITE_ONCE for 'suspend_lo' and 'suspend_hi' md/raid5-cache: use READ_ONCE/WRITE_ONCE for 'conf->log' md: replace is_md_suspended() with 'mddev->suspended' in md_check_recovery() md: add new helpers to suspend/resume array md: add new helpers to suspend/resume and lock/unlock array md/dm-raid: use new apis to suspend array md/md-bitmap: use new apis to suspend array for location_store() md/raid5-cache: use new apis to suspend array md/raid5: use new apis to suspend array md: use new apis to suspend array for sysfs apis md: use new apis to suspend array for adding/removing rdev from state_store() md: use new apis to suspend array for ioctls involed array reconfiguration md: use new apis to suspend array before mddev_create/destroy_serial_pool md: cleanup mddev_create/destroy_serial_pool() md/md-linear: cleanup linear_add() md/raid5: replace suspend with quiesce() callback md: suspend array in md_start_sync() if array need reconfiguration md: remove old apis to suspend the array md: rename __mddev_suspend/resume() back to mddev_suspend/resume() md: cleanup pers->prepare_suspend() md: remove flag RemoveSynchronized md/raid10: remove rcu protection to access rdev from conf md/raid1: remove rcu protection to access rdev from conf md/raid5: remove rcu protection to access rdev from conf md/md-multipath: remove rcu protection to access rdev from conf md: synchronize flush io with array reconfiguration md: fix missing flush of sync_work md: don't leave 'MD_RECOVERY_FROZEN' in error path of md_set_readonly() md: fix stopping sync thread md: split MD_RECOVERY_NEEDED out of mddev_resume dm-raid: delay flushing event_work() after reconfig_mutex is released
drivers/md/md.h | 75 +-- drivers/md/raid5.h | 4 +- drivers/md/dm-raid.c | 20 +- drivers/md/md-autodetect.c | 4 +- drivers/md/md-bitmap.c | 61 +- drivers/md/md-cluster.c | 15 +- drivers/md/md-linear.c | 2 - drivers/md/md-multipath.c | 32 +- drivers/md/md.c | 1089 ++++++++++++++++++++---------------- drivers/md/raid1-10.c | 54 ++ drivers/md/raid1.c | 97 ++-- drivers/md/raid10.c | 274 ++------- drivers/md/raid5-cache.c | 75 ++- drivers/md/raid5-ppl.c | 16 +- drivers/md/raid5.c | 310 +++------- 15 files changed, 995 insertions(+), 1133 deletions(-)
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit ac619781967bd5663c29606246b50dbebd8b3473 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
It's a little weird to borrow 'del_work' for md_start_sync(), declare a new work_struct 'sync_work' for md_start_sync().
Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Xiao Ni xni@redhat.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230825031622.1530464-2-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.h | 5 ++++- drivers/md/md.c | 10 ++++++---- 2 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/drivers/md/md.h b/drivers/md/md.h index 274e7d61d19f..c4a585d89866 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -451,7 +451,10 @@ struct mddev { struct kernfs_node *sysfs_degraded; /*handle for 'degraded' */ struct kernfs_node *sysfs_level; /*handle for 'level' */
- struct work_struct del_work; /* used for delayed sysfs removal */ + /* used for delayed sysfs removal */ + struct work_struct del_work; + /* used for register new sync thread */ + struct work_struct sync_work;
/* "lock" protects: * flush_bio transition from NULL to !NULL diff --git a/drivers/md/md.c b/drivers/md/md.c index c44d97904c02..65698b957f79 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -631,13 +631,13 @@ void mddev_put(struct mddev *mddev) * flush_workqueue() after mddev_find will succeed in waiting * for the work to be done. */ - INIT_WORK(&mddev->del_work, mddev_delayed_delete); queue_work(md_misc_wq, &mddev->del_work); } spin_unlock(&all_mddevs_lock); }
static void md_safemode_timeout(struct timer_list *t); +static void md_start_sync(struct work_struct *ws);
void mddev_init(struct mddev *mddev) { @@ -662,6 +662,9 @@ void mddev_init(struct mddev *mddev) mddev->resync_min = 0; mddev->resync_max = MaxSector; mddev->level = LEVEL_NONE; + + INIT_WORK(&mddev->sync_work, md_start_sync); + INIT_WORK(&mddev->del_work, mddev_delayed_delete); } EXPORT_SYMBOL_GPL(mddev_init);
@@ -9250,7 +9253,7 @@ static int remove_and_add_spares(struct mddev *mddev,
static void md_start_sync(struct work_struct *ws) { - struct mddev *mddev = container_of(ws, struct mddev, del_work); + struct mddev *mddev = container_of(ws, struct mddev, sync_work);
rcu_assign_pointer(mddev->sync_thread, md_register_thread(md_do_sync, mddev, "resync")); @@ -9463,8 +9466,7 @@ void md_check_recovery(struct mddev *mddev) */ md_bitmap_write_all(mddev->bitmap); } - INIT_WORK(&mddev->del_work, md_start_sync); - queue_work(md_misc_wq, &mddev->del_work); + queue_work(md_misc_wq, &mddev->sync_work); goto unlock; } not_running:
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 897c62a1cae6485870a45934b22c180016a0570f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
There are no functional changes, on the one hand make the code cleaner, on the other hand prevent following checkpatch error in the next patch to delay choosing sync action to md_start_sync().
ERROR: do not use assignment in if condition + } else if ((spares = remove_and_add_spares(mddev, NULL))) {
Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Xiao Ni xni@redhat.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230825031622.1530464-3-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 70 +++++++++++++++++++++++++++++++------------------ 1 file changed, 45 insertions(+), 25 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 65698b957f79..daf5b6d6653c 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9251,6 +9251,50 @@ static int remove_and_add_spares(struct mddev *mddev, return spares; }
+static bool md_choose_sync_action(struct mddev *mddev, int *spares) +{ + /* Check if reshape is in progress first. */ + if (mddev->reshape_position != MaxSector) { + if (mddev->pers->check_reshape == NULL || + mddev->pers->check_reshape(mddev) != 0) + return false; + + set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); + clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); + return true; + } + + /* + * Remove any failed drives, then add spares if possible. Spares are + * also removed and re-added, to allow the personality to fail the + * re-add. + */ + *spares = remove_and_add_spares(mddev, NULL); + if (*spares) { + clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); + clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); + clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); + + /* Start new recovery. */ + set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); + return true; + } + + /* Check if recovery is in progress. */ + if (mddev->recovery_cp < MaxSector) { + set_bit(MD_RECOVERY_SYNC, &mddev->recovery); + clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); + return true; + } + + /* Delay to choose resync/check/repair in md_do_sync(). */ + if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) + return true; + + /* Nothing to be done */ + return false; +} + static void md_start_sync(struct work_struct *ws) { struct mddev *mddev = container_of(ws, struct mddev, sync_work); @@ -9432,32 +9476,8 @@ void md_check_recovery(struct mddev *mddev) if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) goto not_running; - /* no recovery is running. - * remove any failed drives, then - * add spares if possible. - * Spares are also removed and re-added, to allow - * the personality to fail the re-add. - */ - - if (mddev->reshape_position != MaxSector) { - if (mddev->pers->check_reshape == NULL || - mddev->pers->check_reshape(mddev) != 0) - /* Cannot proceed */ - goto not_running; - set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); - clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); - } else if ((spares = remove_and_add_spares(mddev, NULL))) { - clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); - clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); - clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); - set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); - } else if (mddev->recovery_cp < MaxSector) { - set_bit(MD_RECOVERY_SYNC, &mddev->recovery); - clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); - } else if (!test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) - /* nothing to be done ... */ + if (!md_choose_sync_action(mddev, &spares)) goto not_running; - if (mddev->pers->sync_request) { if (spares) { /* We are adding a device or devices to an array
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit db5e653d7c9fae8ed61da58ab5e9a8db2cd61a2b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Before this patch, for read-write array:
1) md_check_recover() found that something need to be done, and it'll try to grab 'reconfig_mutex'. The case that md_check_recover() need to do something: - array is not suspend; - super_block need to be updated; - 'MD_RECOVERY_NEEDED' or 'MD_RECOVERY_DONE' is set; - unusual case related to safemode;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set, md_check_recover() will try to choose a sync action, and then queue a work md_start_sync().
3) md_start_sync() register sync_thread;
After this patch,
1) is the same; 2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set, queue a work md_start_sync() directly; 3) md_start_sync() will try to choose a sync action, and then register sync_thread();
Because 'MD_RECOVERY_RUNNING' is cleared when sync_thread is done, 2) and 3) and md_do_sync() is always ran in serial and they can never concurrent, this change should not introduce any behavior change for now.
Also fix a problem that md_start_sync() can clear 'MD_RECOVERY_RUNNING' without protection in error path, which might affect the logical in md_check_recovery().
The advantage to change this is that array reconfiguration is independent from daemon now, and it'll be much easier to synchronize it with io, consider that io may rely on daemon thread to be done.
Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Xiao Ni xni@redhat.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230825031622.1530464-4-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 73 ++++++++++++++++++++++++++----------------------- 1 file changed, 39 insertions(+), 34 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index daf5b6d6653c..fe6cec08e7ab 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9298,6 +9298,22 @@ static bool md_choose_sync_action(struct mddev *mddev, int *spares) static void md_start_sync(struct work_struct *ws) { struct mddev *mddev = container_of(ws, struct mddev, sync_work); + int spares = 0; + + mddev_lock_nointr(mddev); + + if (!md_choose_sync_action(mddev, &spares)) + goto not_running; + + if (!mddev->pers->sync_request) + goto not_running; + + /* + * We are adding a device or devices to an array which has the bitmap + * stored on all devices. So make sure all bitmap pages get written. + */ + if (spares) + md_bitmap_write_all(mddev->bitmap);
rcu_assign_pointer(mddev->sync_thread, md_register_thread(md_do_sync, mddev, "resync")); @@ -9305,20 +9321,27 @@ static void md_start_sync(struct work_struct *ws) pr_warn("%s: could not start resync thread...\n", mdname(mddev)); /* leave the spares where they are, it shouldn't hurt */ - clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); - clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); - clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); - clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); - clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); - wake_up(&resync_wait); - if (test_and_clear_bit(MD_RECOVERY_RECOVER, - &mddev->recovery)) - if (mddev->sysfs_action) - sysfs_notify_dirent_safe(mddev->sysfs_action); - } else - md_wakeup_thread(mddev->sync_thread); + goto not_running; + } + + mddev_unlock(mddev); + md_wakeup_thread(mddev->sync_thread); sysfs_notify_dirent_safe(mddev->sysfs_action); md_new_event(); + return; + +not_running: + clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); + clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); + clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); + clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); + clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); + mddev_unlock(mddev); + + wake_up(&resync_wait); + if (test_and_clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery) && + mddev->sysfs_action) + sysfs_notify_dirent_safe(mddev->sysfs_action); }
/* @@ -9386,7 +9409,6 @@ void md_check_recovery(struct mddev *mddev) return;
if (mddev_trylock(mddev)) { - int spares = 0; bool try_set_sync = mddev->safemode != 0;
if (!mddev->external && mddev->safemode == 1) @@ -9473,31 +9495,14 @@ void md_check_recovery(struct mddev *mddev) clear_bit(MD_RECOVERY_INTR, &mddev->recovery); clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
- if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || - test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) - goto not_running; - if (!md_choose_sync_action(mddev, &spares)) - goto not_running; - if (mddev->pers->sync_request) { - if (spares) { - /* We are adding a device or devices to an array - * which has the bitmap stored on all devices. - * So make sure all bitmap pages get written - */ - md_bitmap_write_all(mddev->bitmap); - } + if (test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) && + !test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) { queue_work(md_misc_wq, &mddev->sync_work); - goto unlock; - } - not_running: - if (!mddev->sync_thread) { + } else { clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); wake_up(&resync_wait); - if (test_and_clear_bit(MD_RECOVERY_RECOVER, - &mddev->recovery)) - if (mddev->sysfs_action) - sysfs_notify_dirent_safe(mddev->sysfs_action); } + unlock: wake_up(&mddev->sb_wait); mddev_unlock(mddev);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 3389d57f97531efe9e285a0740c9b767d6f29f6c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
There are no functional changes, just to make the code simpler and prepare to delay remove_and_add_spares() to md_start_sync().
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230825031622.1530464-5-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 44 ++++++++++++++++++++++++++++++++++++++------ 1 file changed, 38 insertions(+), 6 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index fe6cec08e7ab..22a19b74cd99 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9158,6 +9158,42 @@ void md_do_sync(struct md_thread *thread) } EXPORT_SYMBOL_GPL(md_do_sync);
+static bool rdev_removeable(struct md_rdev *rdev) +{ + /* rdev is not used. */ + if (rdev->raid_disk < 0) + return false; + + /* There are still inflight io, don't remove this rdev. */ + if (atomic_read(&rdev->nr_pending)) + return false; + + /* + * An error occurred but has not yet been acknowledged by the metadata + * handler, don't remove this rdev. + */ + if (test_bit(Blocked, &rdev->flags)) + return false; + + /* Fautly rdev is not used, it's safe to remove it. */ + if (test_bit(Faulty, &rdev->flags)) + return true; + + /* Journal disk can only be removed if it's faulty. */ + if (test_bit(Journal, &rdev->flags)) + return false; + + /* + * 'In_sync' is cleared while 'raid_disk' is valid, which means + * replacement has just become active from pers->spare_active(), and + * then pers->hot_remove_disk() will replace this rdev with replacement. + */ + if (!test_bit(In_sync, &rdev->flags)) + return true; + + return false; +} + static int remove_and_add_spares(struct mddev *mddev, struct md_rdev *this) { @@ -9190,12 +9226,8 @@ static int remove_and_add_spares(struct mddev *mddev, synchronize_rcu(); rdev_for_each(rdev, mddev) { if ((this == NULL || rdev == this) && - rdev->raid_disk >= 0 && - !test_bit(Blocked, &rdev->flags) && - ((test_bit(RemoveSynchronized, &rdev->flags) || - (!test_bit(In_sync, &rdev->flags) && - !test_bit(Journal, &rdev->flags))) && - atomic_read(&rdev->nr_pending)==0)) { + (test_bit(RemoveSynchronized, &rdev->flags) || + rdev_removeable(rdev))) { if (mddev->pers->hot_remove_disk( mddev, rdev) == 0) { sysfs_unlink_rdev(mddev, rdev);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit b172a0704d0ddea3a55f3633f8249c3ce7d960bc category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
There are no functional changes, just to make the code simpler and prepare to delay remove_and_add_spares() to md_start_sync().
Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Xiao Ni xni@redhat.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230825031622.1530464-6-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 22a19b74cd99..225b6d3a46b5 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9194,6 +9194,14 @@ static bool rdev_removeable(struct md_rdev *rdev) return false; }
+static bool rdev_is_spare(struct md_rdev *rdev) +{ + return !test_bit(Candidate, &rdev->flags) && rdev->raid_disk >= 0 && + !test_bit(In_sync, &rdev->flags) && + !test_bit(Journal, &rdev->flags) && + !test_bit(Faulty, &rdev->flags); +} + static int remove_and_add_spares(struct mddev *mddev, struct md_rdev *this) { @@ -9249,13 +9257,10 @@ static int remove_and_add_spares(struct mddev *mddev, rdev_for_each(rdev, mddev) { if (this && this != rdev) continue; + if (rdev_is_spare(rdev)) + spares++; if (test_bit(Candidate, &rdev->flags)) continue; - if (rdev->raid_disk >= 0 && - !test_bit(In_sync, &rdev->flags) && - !test_bit(Journal, &rdev->flags) && - !test_bit(Faulty, &rdev->flags)) - spares++; if (rdev->raid_disk >= 0) continue; if (test_bit(Faulty, &rdev->flags))
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit a0ae7e4e0bc00ae178e42391157e8ddb88b2838a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
There are no functional changes, just to make the code simpler and prepare to delay remove_and_add_spares() to md_start_sync().
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230825031622.1530464-7-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 39 +++++++++++++++++++++++++++------------ 1 file changed, 27 insertions(+), 12 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 225b6d3a46b5..bf474798c228 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9202,6 +9202,31 @@ static bool rdev_is_spare(struct md_rdev *rdev) !test_bit(Faulty, &rdev->flags); }
+static bool rdev_addable(struct md_rdev *rdev) +{ + /* rdev is already used, don't add it again. */ + if (test_bit(Candidate, &rdev->flags) || rdev->raid_disk >= 0 || + test_bit(Faulty, &rdev->flags)) + return false; + + /* Allow to add journal disk. */ + if (test_bit(Journal, &rdev->flags)) + return true; + + /* Allow to add if array is read-write. */ + if (md_is_rdwr(rdev->mddev)) + return true; + + /* + * For read-only array, only allow to readd a rdev. And if bitmap is + * used, don't allow to readd a rdev that is too old. + */ + if (rdev->saved_raid_disk >= 0 && !test_bit(Bitmap_sync, &rdev->flags)) + return true; + + return false; +} + static int remove_and_add_spares(struct mddev *mddev, struct md_rdev *this) { @@ -9259,20 +9284,10 @@ static int remove_and_add_spares(struct mddev *mddev, continue; if (rdev_is_spare(rdev)) spares++; - if (test_bit(Candidate, &rdev->flags)) - continue; - if (rdev->raid_disk >= 0) + if (!rdev_addable(rdev)) continue; - if (test_bit(Faulty, &rdev->flags)) - continue; - if (!test_bit(Journal, &rdev->flags)) { - if (!md_is_rdwr(mddev) && - !(rdev->saved_raid_disk >= 0 && - !test_bit(Bitmap_sync, &rdev->flags))) - continue; - + if (!test_bit(Journal, &rdev->flags)) rdev->recovery_offset = 0; - } if (mddev->pers->hot_add_disk(mddev, rdev) == 0) { /* failure here is OK */ sysfs_link_rdev(mddev, rdev);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 81e2ce1b3d5a896e24fe5af83896fec057860a25 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Before this patch, for read-only array:
md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will call remove_and_add_spares() directly to try to remove and add rdevs from array.
After this patch:
1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the worker 'sync_work' is not pending, and there are rdevs can be added or removed, then it will queue new work md_start_sync(); 2) md_start_sync() will call remove_and_add_spares() and exist;
This change make sure that array reconfiguration is independent from daemon, and it'll be much easier to synchronize it with io, consier that io may rely on daemon thread to be done.
Also fix a problem that 'pers->spars_active' is called after remove_and_add_spares(), which order is wrong, because spares must active first, and then remove_and_add_spares() can add spares to the array, like what read-write case does:
1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do recovery; 2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return; 3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING'; 4) in the next round of daemon, call remove_and_add_spares() to add spares to the array.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 61 +++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 51 insertions(+), 10 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index bf474798c228..85962fac704d 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4868,6 +4868,7 @@ action_store(struct mddev *mddev, const char *page, size_t len) /* A write to sync_action is enough to justify * canceling read-auto mode */ + flush_work(&mddev->sync_work); mddev->ro = MD_RDWR; md_wakeup_thread(mddev->sync_thread); } @@ -7642,6 +7643,10 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, mutex_unlock(&mddev->open_mutex); sync_blockdev(bdev); } + + if (!md_is_rdwr(mddev)) + flush_work(&mddev->sync_work); + err = mddev_lock(mddev); if (err) { pr_debug("md: ioctl lock interrupted, reason %d, cmd %d\n", @@ -8566,6 +8571,7 @@ bool md_write_start(struct mddev *mddev, struct bio *bi) BUG_ON(mddev->ro == MD_RDONLY); if (mddev->ro == MD_AUTO_READ) { /* need to switch to read/write */ + flush_work(&mddev->sync_work); mddev->ro = MD_RDWR; set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); md_wakeup_thread(mddev->thread); @@ -9227,6 +9233,16 @@ static bool rdev_addable(struct md_rdev *rdev) return false; }
+static bool md_spares_need_change(struct mddev *mddev) +{ + struct md_rdev *rdev; + + rdev_for_each(rdev, mddev) + if (rdev_removeable(rdev) || rdev_addable(rdev)) + return true; + return false; +} + static int remove_and_add_spares(struct mddev *mddev, struct md_rdev *this) { @@ -9354,6 +9370,18 @@ static void md_start_sync(struct work_struct *ws)
mddev_lock_nointr(mddev);
+ if (!md_is_rdwr(mddev)) { + /* + * On a read-only array we can: + * - remove failed devices + * - add already-in_sync devices if the array itself is in-sync. + * As we only add devices that are already in-sync, we can + * activate the spares immediately. + */ + remove_and_add_spares(mddev, NULL); + goto not_running; + } + if (!md_choose_sync_action(mddev, &spares)) goto not_running;
@@ -9468,30 +9496,43 @@ void md_check_recovery(struct mddev *mddev)
if (!md_is_rdwr(mddev)) { struct md_rdev *rdev; + + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { + /* sync_work already queued. */ + clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); + goto unlock; + } + if (!mddev->external && mddev->in_sync) - /* 'Blocked' flag not needed as failed devices + /* + * 'Blocked' flag not needed as failed devices * will be recorded if array switched to read/write. * Leaving it set will prevent the device * from being removed. */ rdev_for_each(rdev, mddev) clear_bit(Blocked, &rdev->flags); - /* On a read-only array we can: - * - remove failed devices - * - add already-in_sync devices if the array itself - * is in-sync. - * As we only add devices that are already in-sync, - * we can activate the spares immediately. - */ - remove_and_add_spares(mddev, NULL); - /* There is no thread, but we need to call + + /* + * There is no thread, but we need to call * ->spare_active and clear saved_raid_disk */ set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_reap_sync_thread(mddev); + + /* + * Let md_start_sync() to remove and add rdevs to the + * array. + */ + if (md_spares_need_change(mddev)) { + set_bit(MD_RECOVERY_RUNNING, &mddev->recovery); + queue_work(md_misc_wq, &mddev->sync_work); + } + clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags); + goto unlock; }
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit d58eff83bd3c6166944f6b159544438385d48549 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
'active_io' is used for mddev_suspend() and it's initialized in md_run(), this restrict that 'reconfig_mutex' must be held and "mddev->pers" must be set before calling mddev_suspend().
Initialize 'active_io' early so that mddev_suspend() is safe to call once mddev is allocated, this will be helpful to refactor mddev_suspend() in following patches.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230825030956.1527023-2-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.h | 3 ++- drivers/md/dm-raid.c | 7 +++++- drivers/md/md.c | 53 +++++++++++++++++++++++++++----------------- 3 files changed, 41 insertions(+), 22 deletions(-)
diff --git a/drivers/md/md.h b/drivers/md/md.h index c4a585d89866..bd5b5db1e579 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -796,7 +796,8 @@ extern int md_integrity_register(struct mddev *mddev); extern int md_integrity_add_rdev(struct md_rdev *rdev, struct mddev *mddev); extern int strict_strtoul_scaled(const char *cp, unsigned long *res, int scale);
-extern void mddev_init(struct mddev *mddev); +extern int mddev_init(struct mddev *mddev); +extern void mddev_destroy(struct mddev *mddev); struct mddev *md_alloc(dev_t dev, char *name); void mddev_put(struct mddev *mddev); extern int md_run(struct mddev *mddev); diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index 5f9991765f27..69805d37e113 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -749,7 +749,11 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r return ERR_PTR(-ENOMEM); }
- mddev_init(&rs->md); + if (mddev_init(&rs->md)) { + kfree(rs); + ti->error = "Cannot initialize raid context"; + return ERR_PTR(-ENOMEM); + }
rs->raid_disks = raid_devs; rs->delta_disks = 0; @@ -798,6 +802,7 @@ static void raid_set_free(struct raid_set *rs) dm_put_device(rs->ti, rs->dev[i].data_dev); }
+ mddev_destroy(&rs->md); kfree(rs); }
diff --git a/drivers/md/md.c b/drivers/md/md.c index 85962fac704d..3d3887870cb0 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -639,8 +639,20 @@ void mddev_put(struct mddev *mddev) static void md_safemode_timeout(struct timer_list *t); static void md_start_sync(struct work_struct *ws);
-void mddev_init(struct mddev *mddev) +static void active_io_release(struct percpu_ref *ref) { + struct mddev *mddev = container_of(ref, struct mddev, active_io); + + wake_up(&mddev->sb_wait); +} + +int mddev_init(struct mddev *mddev) +{ + + if (percpu_ref_init(&mddev->active_io, active_io_release, + PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) + return -ENOMEM; + mutex_init(&mddev->open_mutex); mutex_init(&mddev->reconfig_mutex); mutex_init(&mddev->sync_mutex); @@ -665,9 +677,17 @@ void mddev_init(struct mddev *mddev)
INIT_WORK(&mddev->sync_work, md_start_sync); INIT_WORK(&mddev->del_work, mddev_delayed_delete); + + return 0; } EXPORT_SYMBOL_GPL(mddev_init);
+void mddev_destroy(struct mddev *mddev) +{ + percpu_ref_exit(&mddev->active_io); +} +EXPORT_SYMBOL_GPL(mddev_destroy); + static struct mddev *mddev_find_locked(dev_t unit) { struct mddev *mddev; @@ -711,13 +731,16 @@ static struct mddev *mddev_alloc(dev_t unit) new = kzalloc(sizeof(*new), GFP_KERNEL); if (!new) return ERR_PTR(-ENOMEM); - mddev_init(new); + + error = mddev_init(new); + if (error) + goto out_free_new;
spin_lock(&all_mddevs_lock); if (unit) { error = -EEXIST; if (mddev_find_locked(unit)) - goto out_free_new; + goto out_destroy_new; new->unit = unit; if (MAJOR(unit) == MD_MAJOR) new->md_minor = MINOR(unit); @@ -728,7 +751,7 @@ static struct mddev *mddev_alloc(dev_t unit) error = -ENODEV; new->unit = mddev_alloc_unit(); if (!new->unit) - goto out_free_new; + goto out_destroy_new; new->md_minor = MINOR(new->unit); new->hold_active = UNTIL_STOP; } @@ -736,8 +759,11 @@ static struct mddev *mddev_alloc(dev_t unit) list_add(&new->all_mddevs, &all_mddevs); spin_unlock(&all_mddevs_lock); return new; -out_free_new: + +out_destroy_new: spin_unlock(&all_mddevs_lock); + mddev_destroy(new); +out_free_new: kfree(new); return ERR_PTR(error); } @@ -748,6 +774,7 @@ static void mddev_free(struct mddev *mddev) list_del(&mddev->all_mddevs); spin_unlock(&all_mddevs_lock);
+ mddev_destroy(mddev); kfree(mddev); }
@@ -5780,12 +5807,6 @@ static void md_safemode_timeout(struct timer_list *t) }
static int start_dirty_degraded; -static void active_io_release(struct percpu_ref *ref) -{ - struct mddev *mddev = container_of(ref, struct mddev, active_io); - - wake_up(&mddev->sb_wait); -}
int md_run(struct mddev *mddev) { @@ -5866,15 +5887,10 @@ int md_run(struct mddev *mddev) nowait = nowait && bdev_nowait(rdev->bdev); }
- err = percpu_ref_init(&mddev->active_io, active_io_release, - PERCPU_REF_ALLOW_REINIT, GFP_KERNEL); - if (err) - return err; - if (!bioset_initialized(&mddev->bio_set)) { err = bioset_init(&mddev->bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS); if (err) - goto exit_active_io; + return err; } if (!bioset_initialized(&mddev->sync_set)) { err = bioset_init(&mddev->sync_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS); @@ -6071,8 +6087,6 @@ int md_run(struct mddev *mddev) bioset_exit(&mddev->sync_set); exit_bio_set: bioset_exit(&mddev->bio_set); -exit_active_io: - percpu_ref_exit(&mddev->active_io); return err; } EXPORT_SYMBOL_GPL(md_run); @@ -6288,7 +6302,6 @@ static void __md_stop(struct mddev *mddev) module_put(pers->owner); clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
- percpu_ref_exit(&mddev->active_io); bioset_exit(&mddev->bio_set); bioset_exit(&mddev->sync_set); bioset_exit(&mddev->io_clone_set);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit b8494823e236326500aa1004155e83f748dd10da category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Currently 'writes_pending' is initialized in pers->run for raid1/5/10, and it's freed while deleing mddev, instead of pers->free. pers->run can be called multiple times before mddev is deleted, and a helper mddev_init_writes_pending() is used to prevent 'writes_pending' to be initialized multiple times, this usage is safe but a litter weird.
On the other hand, 'writes_pending' is only initialized for raid1/5/10, however, it's used in common layer, for example:
array_state_store set_in_sync if (!mddev->in_sync) -> in_sync is used for all levels // access writes_pending
There might be some implicit dependency that I don't recognized to make sure 'writes_pending' can only be accessed for raid1/5/10, but there are no comments about that.
By the way, it make sense to initialize 'writes_pending' in common layer because there are already three levels use it.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230825030956.1527023-3-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.h | 1 - drivers/md/md.c | 29 ++++++++++++----------------- drivers/md/raid1.c | 3 +-- drivers/md/raid10.c | 3 --- drivers/md/raid5.c | 3 --- 5 files changed, 13 insertions(+), 26 deletions(-)
diff --git a/drivers/md/md.h b/drivers/md/md.h index bd5b5db1e579..eca6dd41ad73 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -769,7 +769,6 @@ extern void md_unregister_thread(struct mddev *mddev, struct md_thread __rcu **t extern void md_wakeup_thread(struct md_thread __rcu *thread); extern void md_check_recovery(struct mddev *mddev); extern void md_reap_sync_thread(struct mddev *mddev); -extern int mddev_init_writes_pending(struct mddev *mddev); extern bool md_write_start(struct mddev *mddev, struct bio *bi); extern void md_write_inc(struct mddev *mddev, struct bio *bi); extern void md_write_end(struct mddev *mddev); diff --git a/drivers/md/md.c b/drivers/md/md.c index 3d3887870cb0..f5d93e90277c 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -646,6 +646,8 @@ static void active_io_release(struct percpu_ref *ref) wake_up(&mddev->sb_wait); }
+static void no_op(struct percpu_ref *r) {} + int mddev_init(struct mddev *mddev) {
@@ -653,6 +655,15 @@ int mddev_init(struct mddev *mddev) PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) return -ENOMEM;
+ if (percpu_ref_init(&mddev->writes_pending, no_op, + PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) { + percpu_ref_exit(&mddev->active_io); + return -ENOMEM; + } + + /* We want to start with the refcount at zero */ + percpu_ref_put(&mddev->writes_pending); + mutex_init(&mddev->open_mutex); mutex_init(&mddev->reconfig_mutex); mutex_init(&mddev->sync_mutex); @@ -685,6 +696,7 @@ EXPORT_SYMBOL_GPL(mddev_init); void mddev_destroy(struct mddev *mddev) { percpu_ref_exit(&mddev->active_io); + percpu_ref_exit(&mddev->writes_pending); } EXPORT_SYMBOL_GPL(mddev_destroy);
@@ -5621,21 +5633,6 @@ static void mddev_delayed_delete(struct work_struct *ws) kobject_put(&mddev->kobj); }
-static void no_op(struct percpu_ref *r) {} - -int mddev_init_writes_pending(struct mddev *mddev) -{ - if (mddev->writes_pending.percpu_count_ptr) - return 0; - if (percpu_ref_init(&mddev->writes_pending, no_op, - PERCPU_REF_ALLOW_REINIT, GFP_KERNEL) < 0) - return -ENOMEM; - /* We want to start with the refcount at zero */ - percpu_ref_put(&mddev->writes_pending); - return 0; -} -EXPORT_SYMBOL_GPL(mddev_init_writes_pending); - struct mddev *md_alloc(dev_t dev, char *name) { /* @@ -6316,7 +6313,6 @@ void md_stop(struct mddev *mddev) */ __md_stop_writes(mddev); __md_stop(mddev); - percpu_ref_exit(&mddev->writes_pending); }
EXPORT_SYMBOL_GPL(md_stop); @@ -7900,7 +7896,6 @@ static void md_free_disk(struct gendisk *disk) { struct mddev *mddev = disk->private_data;
- percpu_ref_exit(&mddev->writes_pending); mddev_free(mddev); }
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 2aabac773fe7..3a78f79ee6d5 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -3122,8 +3122,7 @@ static int raid1_run(struct mddev *mddev) mdname(mddev)); return -EIO; } - if (mddev_init_writes_pending(mddev) < 0) - return -ENOMEM; + /* * copy the already verified devices into our private RAID1 * bookkeeping area. [whatever we allocate in run(), diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 023413120851..a5927e98dc67 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -4154,9 +4154,6 @@ static int raid10_run(struct mddev *mddev) sector_t min_offset_diff = 0; int first = 1;
- if (mddev_init_writes_pending(mddev) < 0) - return -ENOMEM; - if (mddev->private == NULL) { conf = setup_conf(mddev); if (IS_ERR(conf)) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 284cd71bcc68..0644b83fd3f4 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -7785,9 +7785,6 @@ static int raid5_run(struct mddev *mddev) long long min_offset_diff = 0; int first = 1;
- if (mddev_init_writes_pending(mddev) < 0) - return -ENOMEM; - if (mddev->recovery_cp != MaxSector) pr_notice("md/raid:%s: not clean -- starting background reconstruction\n", mdname(mddev));
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit b71fe4ac7531d67e6fc8c287cbcb2b176aa93833 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
After commit 4d27e927344a ("md: don't quiesce in mddev_suspend()"), there is no need to check 'pers->quiesce' before calling mddev_suspend().
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230825030956.1527023-5-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md-bitmap.c | 4 ---- 1 file changed, 4 deletions(-)
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 6f9ff14971f9..f38c7f3156cb 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -2352,10 +2352,6 @@ location_store(struct mddev *mddev, const char *buf, size_t len) if (rv) return rv; if (mddev->pers) { - if (!mddev->pers->quiesce) { - rv = -EBUSY; - goto out; - } if (mddev->recovery || mddev->sync_thread) { rv = -EBUSY; goto out;
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 158d32af8710777b146f5303a5ca066b493d18e8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's safe to call mddev_suspend() earlier.
This will also be helper to refactor mddev_suspend() later.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230825030956.1527023-6-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md-bitmap.c | 43 ++++++++++++++++++++---------------------- 1 file changed, 20 insertions(+), 23 deletions(-)
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index f38c7f3156cb..0c661e5036bb 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -2351,6 +2351,8 @@ location_store(struct mddev *mddev, const char *buf, size_t len) rv = mddev_lock(mddev); if (rv) return rv; + + mddev_suspend(mddev); if (mddev->pers) { if (mddev->recovery || mddev->sync_thread) { rv = -EBUSY; @@ -2365,11 +2367,8 @@ location_store(struct mddev *mddev, const char *buf, size_t len) rv = -EBUSY; goto out; } - if (mddev->pers) { - mddev_suspend(mddev); - md_bitmap_destroy(mddev); - mddev_resume(mddev); - } + + md_bitmap_destroy(mddev); mddev->bitmap_info.offset = 0; if (mddev->bitmap_info.file) { struct file *f = mddev->bitmap_info.file; @@ -2379,6 +2378,8 @@ location_store(struct mddev *mddev, const char *buf, size_t len) } else { /* No bitmap, OK to set a location */ long long offset; + struct bitmap *bitmap; + if (strncmp(buf, "none", 4) == 0) /* nothing to be done */; else if (strncmp(buf, "file:", 5) == 0) { @@ -2402,25 +2403,20 @@ location_store(struct mddev *mddev, const char *buf, size_t len) rv = -EINVAL; goto out; } + mddev->bitmap_info.offset = offset; - if (mddev->pers) { - struct bitmap *bitmap; - bitmap = md_bitmap_create(mddev, -1); - mddev_suspend(mddev); - if (IS_ERR(bitmap)) - rv = PTR_ERR(bitmap); - else { - mddev->bitmap = bitmap; - rv = md_bitmap_load(mddev); - if (rv) - mddev->bitmap_info.offset = 0; - } - if (rv) { - md_bitmap_destroy(mddev); - mddev_resume(mddev); - goto out; - } - mddev_resume(mddev); + bitmap = md_bitmap_create(mddev, -1); + if (IS_ERR(bitmap)) { + rv = PTR_ERR(bitmap); + goto out; + } + + mddev->bitmap = bitmap; + rv = md_bitmap_load(mddev); + if (rv) { + mddev->bitmap_info.offset = 0; + md_bitmap_destroy(mddev); + goto out; } } } @@ -2433,6 +2429,7 @@ location_store(struct mddev *mddev, const char *buf, size_t len) } rv = 0; out: + mddev_resume(mddev); mddev_unlock(mddev); if (rv) return rv;
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit a2a9f16838509475ea6801f7794a89e03d55e3ed category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's safe to remove such checking.
This will also allow the array to be suspended even before the array is ran.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230825030956.1527023-7-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index f5d93e90277c..aac65c18f545 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5219,18 +5219,13 @@ suspend_hi_store(struct mddev *mddev, const char *buf, size_t len) err = mddev_lock(mddev); if (err) return err; - err = -EINVAL; - if (mddev->pers == NULL) - goto unlock;
mddev_suspend(mddev); mddev->suspend_hi = new; mddev_resume(mddev);
- err = 0; -unlock: mddev_unlock(mddev); - return err ?: len; + return len; } static struct md_sysfs_entry md_suspend_hi = __ATTR(suspend_hi, S_IRUGO|S_IWUSR, suspend_hi_show, suspend_hi_store);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 54d21eb6ad5e57e70157590397ba01b9faed6b59 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's safe to remove such checking.
This will also allow the array to be suspended even before the array is ran.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230825030956.1527023-8-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index aac65c18f545..65e919201602 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5182,18 +5182,13 @@ suspend_lo_store(struct mddev *mddev, const char *buf, size_t len) err = mddev_lock(mddev); if (err) return err; - err = -EINVAL; - if (mddev->pers == NULL || - mddev->pers->quiesce == NULL) - goto unlock; + mddev_suspend(mddev); mddev->suspend_lo = new; mddev_resume(mddev);
- err = 0; -unlock: mddev_unlock(mddev); - return err ?: len; + return len; } static struct md_sysfs_entry md_suspend_lo = __ATTR(suspend_lo, S_IRUGO|S_IWUSR, suspend_lo_show, suspend_lo_store);
From: Justin Stitt justinstitt@google.com
mainline inclusion from mainline-v6.7-rc1 commit ceb0416383988dbd5decd6a70141a3507732c160 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
`strncpy` is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces.
There are three such strncpy uses that this patch addresses:
The respective destination buffers are: 1) mddev->clevel 2) clevel 3) mddev->metadata_type
We expect mddev->clevel to be NUL-terminated due to its use with format strings: | ret = sprintf(page, "%s\n", mddev->clevel);
Furthermore, we can see that mddev->clevel is not expected to be NUL-padded as `md_clean()` merely set's its first byte to NULL -- not the entire buffer: | static void md_clean(struct mddev *mddev) | { | mddev->array_sectors = 0; | mddev->external_size = 0; | ... | mddev->level = LEVEL_NONE; | mddev->clevel[0] = 0; | ...
A suitable replacement for this instance is `memcpy` as we know the number of bytes to copy and perform manual NUL-termination at a specified offset. This really decays to just a byte copy from one buffer to another. `strscpy` is also a considerable replacement but using `slen` as the length argument would result in truncation of the last byte unless something like `slen + 1` was provided which isn't the most idiomatic strscpy usage.
For the next case, the destination buffer `clevel` is expected to be NUL-terminated based on its usage within kstrtol() which expects NUL-terminated strings. Note that, in context, this code removes a trailing newline which is seemingly not required as kstrtol() can handle trailing newlines implicitly. However, there exists further usage of clevel (or buf) that would also like to have the newline removed. All in all, with similar reasoning to the first case, let's just use memcpy as this is just a byte copy and NUL-termination is handled manually.
The third and final case concerning `mddev->metadata_type` is more or less the same as the other two. We expect that it be NUL-terminated based on its usage with seq_printf: | seq_printf(seq, " super external:%s", | mddev->metadata_type); ... and we can surmise that NUL-padding isn't required either due to how it is handled in md_clean(): | static void md_clean(struct mddev *mddev) | { | ... | mddev->metadata_type[0] = 0; | ...
So really, all these instances have precisely calculated lengths and purposeful NUL-termination so we can just use memcpy to remove ambiguity surrounding strncpy.
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nu... [1] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt justinstitt@google.com Reviewed-by: Kees Cook keescook@chromium.org Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230925-strncpy-drivers-md-md-c-v1-1-2b0093b89c2b... Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 65e919201602..66819917c93a 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -3914,7 +3914,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len) return rv;
if (mddev->pers == NULL) { - strncpy(mddev->clevel, buf, slen); + memcpy(mddev->clevel, buf, slen); if (mddev->clevel[slen-1] == '\n') slen--; mddev->clevel[slen] = 0; @@ -3947,7 +3947,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len) }
/* Now find the new personality */ - strncpy(clevel, buf, slen); + memcpy(clevel, buf, slen); if (clevel[slen-1] == '\n') slen--; clevel[slen] = 0; @@ -4733,7 +4733,7 @@ metadata_store(struct mddev *mddev, const char *buf, size_t len) size_t namelen = len-9; if (namelen >= sizeof(mddev->metadata_type)) namelen = sizeof(mddev->metadata_type)-1; - strncpy(mddev->metadata_type, buf+9, namelen); + memcpy(mddev->metadata_type, buf+9, namelen); mddev->metadata_type[namelen] = 0; if (namelen && mddev->metadata_type[namelen-1] == '\n') mddev->metadata_type[--namelen] = 0;
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 3d8d32873c7b6d9cec5b40c2ddb8c7c55961694f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
There are no functional changes, prepare to simplify md_seq_ops in next patch.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230927061241.1552837-2-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 66819917c93a..d2295b146937 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -616,23 +616,28 @@ static inline struct mddev *mddev_get(struct mddev *mddev)
static void mddev_delayed_delete(struct work_struct *ws);
+static void __mddev_put(struct mddev *mddev) +{ + if (mddev->raid_disks || !list_empty(&mddev->disks) || + mddev->ctime || mddev->hold_active) + return; + + /* Array is not configured at all, and not held active, so destroy it */ + set_bit(MD_DELETED, &mddev->flags); + + /* + * Call queue_work inside the spinlock so that flush_workqueue() after + * mddev_find will succeed in waiting for the work to be done. + */ + queue_work(md_misc_wq, &mddev->del_work); +} + void mddev_put(struct mddev *mddev) { if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock)) return; - if (!mddev->raid_disks && list_empty(&mddev->disks) && - mddev->ctime == 0 && !mddev->hold_active) { - /* Array is not configured at all, and not held active, - * so destroy it */ - set_bit(MD_DELETED, &mddev->flags);
- /* - * Call queue_work inside the spinlock so that - * flush_workqueue() after mddev_find will succeed in waiting - * for the work to be done. - */ - queue_work(md_misc_wq, &mddev->del_work); - } + __mddev_put(mddev); spin_unlock(&all_mddevs_lock); }
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit cf1b6d4441fffd0ba8ae4ced6a12f578c95ca049 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Before this patch, the implementation is hacky and hard to understand:
1) md_seq_start set pos to 1; 2) md_seq_show found pos is 1, then print Personalities; 3) md_seq_next found pos is 1, then it update pos to the first mddev; 4) md_seq_show found pos is not 1 or 2, show mddev; 5) md_seq_next found pos is not 1 or 2, update pos to next mddev; 6) loop 4-5 until the last mddev, then md_seq_next update pos to 2; 7) md_seq_show found pos is 2, then print unused devices; 8) md_seq_next found pos is 2, stop;
This patch remove the magic value and use seq_list_start/next/stop() directly, and move printing "Personalities" to md_seq_start(), "unsed devices" to md_seq_stop():
1) md_seq_start print Personalities, and then set pos to first mddev; 2) md_seq_show show mddev; 3) md_seq_next update pos to next mddev; 4) loop 2-3 until the last mddev; 5) md_seq_stop print unsed devices;
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230927061241.1552837-3-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 100 +++++++++++------------------------------------- 1 file changed, 22 insertions(+), 78 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index d2295b146937..d9242d2882b0 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8206,105 +8206,46 @@ static int status_resync(struct seq_file *seq, struct mddev *mddev) }
static void *md_seq_start(struct seq_file *seq, loff_t *pos) + __acquires(&all_mddevs_lock) { - struct list_head *tmp; - loff_t l = *pos; - struct mddev *mddev; + struct md_personality *pers;
- if (l == 0x10000) { - ++*pos; - return (void *)2; - } - if (l > 0x10000) - return NULL; - if (!l--) - /* header */ - return (void*)1; + seq_puts(seq, "Personalities : "); + spin_lock(&pers_lock); + list_for_each_entry(pers, &pers_list, list) + seq_printf(seq, "[%s] ", pers->name); + + spin_unlock(&pers_lock); + seq_puts(seq, "\n"); + seq->poll_event = atomic_read(&md_event_count);
spin_lock(&all_mddevs_lock); - list_for_each(tmp,&all_mddevs) - if (!l--) { - mddev = list_entry(tmp, struct mddev, all_mddevs); - if (!mddev_get(mddev)) - continue; - spin_unlock(&all_mddevs_lock); - return mddev; - } - spin_unlock(&all_mddevs_lock); - if (!l--) - return (void*)2;/* tail */ - return NULL; + + return seq_list_start(&all_mddevs, *pos); }
static void *md_seq_next(struct seq_file *seq, void *v, loff_t *pos) { - struct list_head *tmp; - struct mddev *next_mddev, *mddev = v; - struct mddev *to_put = NULL; - - ++*pos; - if (v == (void*)2) - return NULL; - - spin_lock(&all_mddevs_lock); - if (v == (void*)1) { - tmp = all_mddevs.next; - } else { - to_put = mddev; - tmp = mddev->all_mddevs.next; - } - - for (;;) { - if (tmp == &all_mddevs) { - next_mddev = (void*)2; - *pos = 0x10000; - break; - } - next_mddev = list_entry(tmp, struct mddev, all_mddevs); - if (mddev_get(next_mddev)) - break; - mddev = next_mddev; - tmp = mddev->all_mddevs.next; - } - spin_unlock(&all_mddevs_lock); - - if (to_put) - mddev_put(to_put); - return next_mddev; - + return seq_list_next(v, &all_mddevs, pos); }
static void md_seq_stop(struct seq_file *seq, void *v) + __releases(&all_mddevs_lock) { - struct mddev *mddev = v; - - if (mddev && v != (void*)1 && v != (void*)2) - mddev_put(mddev); + status_unused(seq); + spin_unlock(&all_mddevs_lock); }
static int md_seq_show(struct seq_file *seq, void *v) { - struct mddev *mddev = v; + struct mddev *mddev = list_entry(v, struct mddev, all_mddevs); sector_t sectors; struct md_rdev *rdev;
- if (v == (void*)1) { - struct md_personality *pers; - seq_printf(seq, "Personalities : "); - spin_lock(&pers_lock); - list_for_each_entry(pers, &pers_list, list) - seq_printf(seq, "[%s] ", pers->name); - - spin_unlock(&pers_lock); - seq_printf(seq, "\n"); - seq->poll_event = atomic_read(&md_event_count); + if (!mddev_get(mddev)) return 0; - } - if (v == (void*)2) { - status_unused(seq); - return 0; - }
+ spin_unlock(&all_mddevs_lock); spin_lock(&mddev->lock); if (mddev->pers || mddev->raid_disks || !list_empty(&mddev->disks)) { seq_printf(seq, "%s : %sactive", mdname(mddev), @@ -8375,6 +8316,9 @@ static int md_seq_show(struct seq_file *seq, void *v) seq_printf(seq, "\n"); } spin_unlock(&mddev->lock); + spin_lock(&all_mddevs_lock); + if (atomic_dec_and_test(&mddev->active)) + __mddev_put(mddev);
return 0; }
From: Mariusz Tkaczyk mariusz.tkaczyk@linux.intel.com
mainline inclusion from mainline-v6.7-rc1 commit 09f894affcf2daac5aa841ffff43d1242215fd80 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We don't need to lock device to reject not supported request in array_state_store(). No functional changes intended.
There are differences between ioctl and sysfs handling during stopping. With this change, it will be easier to add additional steps which needs to be completed before mddev_lock() is taken.
Reviewed-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Mariusz Tkaczyk mariusz.tkaczyk@linux.intel.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230928125517.12356-1-mariusz.tkaczyk@linux.intel... Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 37 ++++++++++++++++++++----------------- 1 file changed, 20 insertions(+), 17 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index d9242d2882b0..2813047ac580 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4408,6 +4408,18 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len) int err = 0; enum array_state st = match_word(buf, array_states);
+ /* No lock dependent actions */ + switch (st) { + case suspended: /* not supported yet */ + case write_pending: /* cannot be set */ + case active_idle: /* cannot be set */ + case broken: /* cannot be set */ + case bad_word: + return -EINVAL; + default: + break; + } + if (mddev->pers && (st == active || st == clean) && mddev->ro != MD_RDONLY) { /* don't take reconfig_mutex when toggling between @@ -4432,23 +4444,16 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len) err = mddev_lock(mddev); if (err) return err; - err = -EINVAL; - switch(st) { - case bad_word: - break; - case clear: - /* stopping an active array */ - err = do_md_stop(mddev, 0, NULL); - break; + + switch (st) { case inactive: - /* stopping an active array */ + /* stop an active array, return 0 otherwise */ if (mddev->pers) err = do_md_stop(mddev, 2, NULL); - else - err = 0; /* already inactive */ break; - case suspended: - break; /* not supported yet */ + case clear: + err = do_md_stop(mddev, 0, NULL); + break; case readonly: if (mddev->pers) err = md_set_readonly(mddev, NULL); @@ -4499,10 +4504,8 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len) err = do_md_run(mddev); } break; - case write_pending: - case active_idle: - case broken: - /* these cannot be set */ + default: + err = -EINVAL; break; }
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 9e55a22fce1384837c213274d1a3b93be16ed9d7 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Currently, discad io is treated the same as normal write io, and for write behind case, io size is limited to:
BIO_MAX_VECS * (PAGE_SIZE >> 9)
For 0.5KB sector size and 4KB PAGE_SIZE, this is just 1MB. For consequence, if 'WriteMostly' is set to one of the underlying disks, then diskcard io will be splited into 1MB and it will take a long time for the diskcard to finish.
Fix this problem by disable write behind for discard io.
Reported-by: Roman Mamedov rm@romanrm.net Closes: https://lore.kernel.org/all/6a1165f7-c792-c054-b8f0-1ad4f7b8ae01@ultracoder.... Reported-and-tested-by: Kirill Kirilenko kirill@ultracoder.org Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231007112105.407449-1-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/raid1.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 3a78f79ee6d5..35d12948e0a9 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1345,6 +1345,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, int first_clone; int max_sectors; bool write_behind = false; + bool is_discard = (bio_op(bio) == REQ_OP_DISCARD);
if (mddev_is_clustered(mddev) && md_cluster_ops->area_resyncing(mddev, WRITE, @@ -1405,7 +1406,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, * write-mostly, which means we could allocate write behind * bio later. */ - if (rdev && test_bit(WriteMostly, &rdev->flags)) + if (!is_discard && rdev && test_bit(WriteMostly, &rdev->flags)) write_behind = true;
if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 617787f1386da5e851dad68926a31522134d352b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Protect 'suspend_lo' and 'suspend_hi' with READ_ONCE/WRITE_ONCE to prevent reading abnormal values.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-2-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 2813047ac580..fe2dd4b4c9f2 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -359,11 +359,11 @@ static bool is_suspended(struct mddev *mddev, struct bio *bio) return true; if (bio_data_dir(bio) != WRITE) return false; - if (mddev->suspend_lo >= mddev->suspend_hi) + if (READ_ONCE(mddev->suspend_lo) >= READ_ONCE(mddev->suspend_hi)) return false; - if (bio->bi_iter.bi_sector >= mddev->suspend_hi) + if (bio->bi_iter.bi_sector >= READ_ONCE(mddev->suspend_hi)) return false; - if (bio_end_sector(bio) < mddev->suspend_lo) + if (bio_end_sector(bio) < READ_ONCE(mddev->suspend_lo)) return false; return true; } @@ -5172,7 +5172,8 @@ __ATTR(sync_max, S_IRUGO|S_IWUSR, max_sync_show, max_sync_store); static ssize_t suspend_lo_show(struct mddev *mddev, char *page) { - return sprintf(page, "%llu\n", (unsigned long long)mddev->suspend_lo); + return sprintf(page, "%llu\n", + (unsigned long long)READ_ONCE(mddev->suspend_lo)); }
static ssize_t @@ -5192,7 +5193,7 @@ suspend_lo_store(struct mddev *mddev, const char *buf, size_t len) return err;
mddev_suspend(mddev); - mddev->suspend_lo = new; + WRITE_ONCE(mddev->suspend_lo, new); mddev_resume(mddev);
mddev_unlock(mddev); @@ -5204,7 +5205,8 @@ __ATTR(suspend_lo, S_IRUGO|S_IWUSR, suspend_lo_show, suspend_lo_store); static ssize_t suspend_hi_show(struct mddev *mddev, char *page) { - return sprintf(page, "%llu\n", (unsigned long long)mddev->suspend_hi); + return sprintf(page, "%llu\n", + (unsigned long long)READ_ONCE(mddev->suspend_hi)); }
static ssize_t @@ -5224,7 +5226,7 @@ suspend_hi_store(struct mddev *mddev, const char *buf, size_t len) return err;
mddev_suspend(mddev); - mddev->suspend_hi = new; + WRITE_ONCE(mddev->suspend_hi, new); mddev_resume(mddev);
mddev_unlock(mddev);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 06a4d0d8c642b5ea654e832b74dca12965356da0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
'conf->log' is set with 'reconfig_mutex' grabbed, however, readers are not procted, hence protect it with READ_ONCE/WRITE_ONCE to prevent reading abnormal values.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-3-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/raid5-cache.c | 47 +++++++++++++++++++++------------------- 1 file changed, 25 insertions(+), 22 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c index 518b7cfa78b9..889bba60d6ff 100644 --- a/drivers/md/raid5-cache.c +++ b/drivers/md/raid5-cache.c @@ -327,8 +327,9 @@ void r5l_wake_reclaim(struct r5l_log *log, sector_t space); void r5c_check_stripe_cache_usage(struct r5conf *conf) { int total_cached; + struct r5l_log *log = READ_ONCE(conf->log);
- if (!r5c_is_writeback(conf->log)) + if (!r5c_is_writeback(log)) return;
total_cached = atomic_read(&conf->r5c_cached_partial_stripes) + @@ -344,7 +345,7 @@ void r5c_check_stripe_cache_usage(struct r5conf *conf) */ if (total_cached > conf->min_nr_stripes * 1 / 2 || atomic_read(&conf->empty_inactive_list_nr) > 0) - r5l_wake_reclaim(conf->log, 0); + r5l_wake_reclaim(log, 0); }
/* @@ -353,7 +354,9 @@ void r5c_check_stripe_cache_usage(struct r5conf *conf) */ void r5c_check_cached_full_stripe(struct r5conf *conf) { - if (!r5c_is_writeback(conf->log)) + struct r5l_log *log = READ_ONCE(conf->log); + + if (!r5c_is_writeback(log)) return;
/* @@ -363,7 +366,7 @@ void r5c_check_cached_full_stripe(struct r5conf *conf) if (atomic_read(&conf->r5c_cached_full_stripes) >= min(R5C_FULL_STRIPE_FLUSH_BATCH(conf), conf->chunk_sectors >> RAID5_STRIPE_SHIFT(conf))) - r5l_wake_reclaim(conf->log, 0); + r5l_wake_reclaim(log, 0); }
/* @@ -396,7 +399,7 @@ void r5c_check_cached_full_stripe(struct r5conf *conf) */ static sector_t r5c_log_required_to_flush_cache(struct r5conf *conf) { - struct r5l_log *log = conf->log; + struct r5l_log *log = READ_ONCE(conf->log);
if (!r5c_is_writeback(log)) return 0; @@ -449,7 +452,7 @@ static inline void r5c_update_log_state(struct r5l_log *log) void r5c_make_stripe_write_out(struct stripe_head *sh) { struct r5conf *conf = sh->raid_conf; - struct r5l_log *log = conf->log; + struct r5l_log *log = READ_ONCE(conf->log);
BUG_ON(!r5c_is_writeback(log));
@@ -491,7 +494,7 @@ static void r5c_handle_parity_cached(struct stripe_head *sh) */ static void r5c_finish_cache_stripe(struct stripe_head *sh) { - struct r5l_log *log = sh->raid_conf->log; + struct r5l_log *log = READ_ONCE(sh->raid_conf->log);
if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH) { BUG_ON(test_bit(STRIPE_R5C_CACHING, &sh->state)); @@ -692,7 +695,7 @@ static void r5c_disable_writeback_async(struct work_struct *work)
/* wait superblock change before suspend */ wait_event(mddev->sb_wait, - conf->log == NULL || + !READ_ONCE(conf->log) || (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && (locked = mddev_trylock(mddev)))); if (locked) { @@ -1151,7 +1154,7 @@ static void r5l_run_no_space_stripes(struct r5l_log *log) static sector_t r5c_calculate_new_cp(struct r5conf *conf) { struct stripe_head *sh; - struct r5l_log *log = conf->log; + struct r5l_log *log = READ_ONCE(conf->log); sector_t new_cp; unsigned long flags;
@@ -1159,12 +1162,12 @@ static sector_t r5c_calculate_new_cp(struct r5conf *conf) return log->next_checkpoint;
spin_lock_irqsave(&log->stripe_in_journal_lock, flags); - if (list_empty(&conf->log->stripe_in_journal_list)) { + if (list_empty(&log->stripe_in_journal_list)) { /* all stripes flushed */ spin_unlock_irqrestore(&log->stripe_in_journal_lock, flags); return log->next_checkpoint; } - sh = list_first_entry(&conf->log->stripe_in_journal_list, + sh = list_first_entry(&log->stripe_in_journal_list, struct stripe_head, r5c); new_cp = sh->log_start; spin_unlock_irqrestore(&log->stripe_in_journal_lock, flags); @@ -1399,7 +1402,7 @@ void r5c_flush_cache(struct r5conf *conf, int num) struct stripe_head *sh, *next;
lockdep_assert_held(&conf->device_lock); - if (!conf->log) + if (!READ_ONCE(conf->log)) return;
count = 0; @@ -1420,7 +1423,7 @@ void r5c_flush_cache(struct r5conf *conf, int num)
static void r5c_do_reclaim(struct r5conf *conf) { - struct r5l_log *log = conf->log; + struct r5l_log *log = READ_ONCE(conf->log); struct stripe_head *sh; int count = 0; unsigned long flags; @@ -1549,7 +1552,7 @@ static void r5l_reclaim_thread(struct md_thread *thread) { struct mddev *mddev = thread->mddev; struct r5conf *conf = mddev->private; - struct r5l_log *log = conf->log; + struct r5l_log *log = READ_ONCE(conf->log);
if (!log) return; @@ -1591,7 +1594,7 @@ void r5l_quiesce(struct r5l_log *log, int quiesce)
bool r5l_log_disk_error(struct r5conf *conf) { - struct r5l_log *log = conf->log; + struct r5l_log *log = READ_ONCE(conf->log);
/* don't allow write if journal disk is missing */ if (!log) @@ -2635,7 +2638,7 @@ int r5c_try_caching_write(struct r5conf *conf, struct stripe_head_state *s, int disks) { - struct r5l_log *log = conf->log; + struct r5l_log *log = READ_ONCE(conf->log); int i; struct r5dev *dev; int to_cache = 0; @@ -2802,7 +2805,7 @@ void r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh, struct stripe_head_state *s) { - struct r5l_log *log = conf->log; + struct r5l_log *log = READ_ONCE(conf->log); int i; int do_wakeup = 0; sector_t tree_index; @@ -2941,7 +2944,7 @@ int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh) /* check whether this big stripe is in write back cache. */ bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect) { - struct r5l_log *log = conf->log; + struct r5l_log *log = READ_ONCE(conf->log); sector_t tree_index; void *slot;
@@ -3049,14 +3052,14 @@ int r5l_start(struct r5l_log *log) void r5c_update_on_rdev_error(struct mddev *mddev, struct md_rdev *rdev) { struct r5conf *conf = mddev->private; - struct r5l_log *log = conf->log; + struct r5l_log *log = READ_ONCE(conf->log);
if (!log) return;
if ((raid5_calc_degraded(conf) > 0 || test_bit(Journal, &rdev->flags)) && - conf->log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_BACK) + log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_BACK) schedule_work(&log->disable_writeback_work); }
@@ -3145,7 +3148,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev) spin_lock_init(&log->stripe_in_journal_lock); atomic_set(&log->stripe_in_journal_count, 0);
- conf->log = log; + WRITE_ONCE(conf->log, log);
set_bit(MD_HAS_JOURNAL, &conf->mddev->flags); return 0; @@ -3173,7 +3176,7 @@ void r5l_exit_log(struct r5conf *conf) * 'reconfig_mutex' is held by caller, set 'confg->log' to NULL to * ensure disable_writeback_work wakes up and exits. */ - conf->log = NULL; + WRITE_ONCE(conf->log, NULL); wake_up(&conf->mddev->sb_wait); flush_work(&log->disable_writeback_work);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 2e82248b70f48a98f3720cdaa8ea7a4e7f8bd760 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Prepare to cleanup pers->prepare_suspend(), which is used to fix a deadlock in raid456 by returning error for io that is waiting for reshape to make progress in mddev_suspend().
This change will allow reshape to make progress while waiting for io to be done in mddev_suspend() in following patches.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-4-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index fe2dd4b4c9f2..6b640fa7dfca 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9412,7 +9412,7 @@ void md_check_recovery(struct mddev *mddev) wake_up(&mddev->sb_wait); }
- if (is_md_suspended(mddev)) + if (READ_ONCE(mddev->suspended)) return;
if (mddev->bitmap)
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 714d20150ed85811193ae07a494d91f9927c590f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Advantages for new apis: - reconfig_mutex is not required; - the weird logical that suspend array hold 'reconfig_mutex' for mddev_check_recovery() to update superblock is not needed; - the specail handling, 'pers->prepare_suspend', for raid456 is not needed; - It's safe to be called at any time once mddev is allocated, and it's designed to be used from slow path where array configuration is changed; - the new helpers is designed to be called before mddev_lock(), hence it support to be interrupted by user as well.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-5-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.h | 3 ++ drivers/md/md.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 103 insertions(+), 2 deletions(-)
diff --git a/drivers/md/md.h b/drivers/md/md.h index eca6dd41ad73..2bfd569e9e67 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -314,6 +314,7 @@ struct mddev { unsigned long sb_flags;
int suspended; + struct mutex suspend_mutex; struct percpu_ref active_io; int ro; int sysfs_active; /* set when sysfs deletes @@ -809,6 +810,8 @@ extern void md_rdev_clear(struct md_rdev *rdev); extern void md_handle_request(struct mddev *mddev, struct bio *bio); extern void mddev_suspend(struct mddev *mddev); extern void mddev_resume(struct mddev *mddev); +extern int __mddev_suspend(struct mddev *mddev, bool interruptible); +extern void __mddev_resume(struct mddev *mddev);
extern void md_reload_sb(struct mddev *mddev, int raid_disk); extern void md_update_sb(struct mddev *mddev, int force); diff --git a/drivers/md/md.c b/drivers/md/md.c index 6b640fa7dfca..4f9658a6b000 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -443,12 +443,22 @@ void mddev_suspend(struct mddev *mddev) lockdep_is_held(&mddev->reconfig_mutex));
WARN_ON_ONCE(thread && current == thread->tsk); - if (mddev->suspended++) + + /* can't concurrent with __mddev_suspend() and __mddev_resume() */ + mutex_lock(&mddev->suspend_mutex); + if (mddev->suspended++) { + mutex_unlock(&mddev->suspend_mutex); return; + } + wake_up(&mddev->sb_wait); set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags); percpu_ref_kill(&mddev->active_io);
+ /* + * TODO: cleanup 'pers->prepare_suspend after all callers are replaced + * by __mddev_suspend(). + */ if (mddev->pers && mddev->pers->prepare_suspend) mddev->pers->prepare_suspend(mddev);
@@ -459,14 +469,21 @@ void mddev_suspend(struct mddev *mddev) del_timer_sync(&mddev->safemode_timer); /* restrict memory reclaim I/O during raid array is suspend */ mddev->noio_flag = memalloc_noio_save(); + + mutex_unlock(&mddev->suspend_mutex); } EXPORT_SYMBOL_GPL(mddev_suspend);
void mddev_resume(struct mddev *mddev) { lockdep_assert_held(&mddev->reconfig_mutex); - if (--mddev->suspended) + + /* can't concurrent with __mddev_suspend() and __mddev_resume() */ + mutex_lock(&mddev->suspend_mutex); + if (--mddev->suspended) { + mutex_unlock(&mddev->suspend_mutex); return; + }
/* entred the memalloc scope from mddev_suspend() */ memalloc_noio_restore(mddev->noio_flag); @@ -477,9 +494,89 @@ void mddev_resume(struct mddev *mddev) set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); md_wakeup_thread(mddev->thread); md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */ + + mutex_unlock(&mddev->suspend_mutex); } EXPORT_SYMBOL_GPL(mddev_resume);
+int __mddev_suspend(struct mddev *mddev, bool interruptible) +{ + int err = 0; + + /* + * hold reconfig_mutex to wait for normal io will deadlock, because + * other context can't update super_block, and normal io can rely on + * updating super_block. + */ + lockdep_assert_not_held(&mddev->reconfig_mutex); + + if (interruptible) + err = mutex_lock_interruptible(&mddev->suspend_mutex); + else + mutex_lock(&mddev->suspend_mutex); + if (err) + return err; + + if (mddev->suspended) { + WRITE_ONCE(mddev->suspended, mddev->suspended + 1); + mutex_unlock(&mddev->suspend_mutex); + return 0; + } + + percpu_ref_kill(&mddev->active_io); + if (interruptible) + err = wait_event_interruptible(mddev->sb_wait, + percpu_ref_is_zero(&mddev->active_io)); + else + wait_event(mddev->sb_wait, + percpu_ref_is_zero(&mddev->active_io)); + if (err) { + percpu_ref_resurrect(&mddev->active_io); + mutex_unlock(&mddev->suspend_mutex); + return err; + } + + /* + * For raid456, io might be waiting for reshape to make progress, + * allow new reshape to start while waiting for io to be done to + * prevent deadlock. + */ + WRITE_ONCE(mddev->suspended, mddev->suspended + 1); + + del_timer_sync(&mddev->safemode_timer); + /* restrict memory reclaim I/O during raid array is suspend */ + mddev->noio_flag = memalloc_noio_save(); + + mutex_unlock(&mddev->suspend_mutex); + return 0; +} +EXPORT_SYMBOL_GPL(__mddev_suspend); + +void __mddev_resume(struct mddev *mddev) +{ + lockdep_assert_not_held(&mddev->reconfig_mutex); + + mutex_lock(&mddev->suspend_mutex); + WRITE_ONCE(mddev->suspended, mddev->suspended - 1); + if (mddev->suspended) { + mutex_unlock(&mddev->suspend_mutex); + return; + } + + /* entred the memalloc scope from __mddev_suspend() */ + memalloc_noio_restore(mddev->noio_flag); + + percpu_ref_resurrect(&mddev->active_io); + wake_up(&mddev->sb_wait); + + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); + md_wakeup_thread(mddev->thread); + md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */ + + mutex_unlock(&mddev->suspend_mutex); +} +EXPORT_SYMBOL_GPL(__mddev_resume); + /* * Generic flush handling for md */ @@ -672,6 +769,7 @@ int mddev_init(struct mddev *mddev) mutex_init(&mddev->open_mutex); mutex_init(&mddev->reconfig_mutex); mutex_init(&mddev->sync_mutex); + mutex_init(&mddev->suspend_mutex); mutex_init(&mddev->bitmap_info.mutex); INIT_LIST_HEAD(&mddev->disks); INIT_LIST_HEAD(&mddev->all_mddevs);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit f45461e24febab5c2ba8f331e9eb109bb59cbdc1 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The new helpers suspend the array first and then lock the array,
Prepare to refactor from:
mddev_lock/lock_nointr mddev_suspend ... mddev_resuem mddev_lock
With:
mddev_suspend_and_lock/lock_nointr ... mddev_unlock_and_resume
After all the use cases is refactored, mddev_suspend/resume() will be removed.
And mddev_suspend_and_lock() will also replace mddev_lock() for the case that the array will be reconfigured, in order to synchronize with io to prevent problems in many corner cases.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-6-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.h | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+)
diff --git a/drivers/md/md.h b/drivers/md/md.h index 2bfd569e9e67..2e80d30bcfbf 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -856,6 +856,33 @@ static inline void mddev_check_write_zeroes(struct mddev *mddev, struct bio *bio mddev->queue->limits.max_write_zeroes_sectors = 0; }
+static inline int mddev_suspend_and_lock(struct mddev *mddev) +{ + int ret; + + ret = __mddev_suspend(mddev, true); + if (ret) + return ret; + + ret = mddev_lock(mddev); + if (ret) + __mddev_resume(mddev); + + return ret; +} + +static inline void mddev_suspend_and_lock_nointr(struct mddev *mddev) +{ + __mddev_suspend(mddev, false); + mutex_lock(&mddev->reconfig_mutex); +} + +static inline void mddev_unlock_and_resume(struct mddev *mddev) +{ + mddev_unlock(mddev); + __mddev_resume(mddev); +} + struct mdu_array_info_s; struct mdu_disk_info_s;
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 4eb3327aa28f3a737c2d3f7e35e83575f1d52283 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Convert to use new apis, the old apis will be removed eventually.
These are not hot path, so performance is not concerned.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-7-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/dm-raid.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-)
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index 69805d37e113..05dd6ccf6f48 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -3244,7 +3244,7 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv) set_bit(MD_RECOVERY_FROZEN, &rs->md.recovery);
/* Has to be held on running the array */ - mddev_lock_nointr(&rs->md); + mddev_suspend_and_lock_nointr(&rs->md); r = md_run(&rs->md); rs->md.in_sync = 0; /* Assume already marked dirty */ if (r) { @@ -3268,7 +3268,6 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv) } }
- mddev_suspend(&rs->md); set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags);
/* Try to adjust the raid4/5/6 stripe cache size to the stripe size */ @@ -3798,9 +3797,7 @@ static void raid_postsuspend(struct dm_target *ti) if (!test_bit(MD_RECOVERY_FROZEN, &rs->md.recovery)) md_stop_writes(&rs->md);
- mddev_lock_nointr(&rs->md); - mddev_suspend(&rs->md); - mddev_unlock(&rs->md); + __mddev_suspend(&rs->md, false); } }
@@ -4012,7 +4009,7 @@ static int raid_preresume(struct dm_target *ti) }
/* Check for any resize/reshape on @rs and adjust/initiate */ - /* Be prepared for mddev_resume() in raid_resume() */ + /* Be prepared for __mddev_resume() in raid_resume() */ set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); if (mddev->recovery_cp && mddev->recovery_cp < MaxSector) { set_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); @@ -4059,8 +4056,7 @@ static void raid_resume(struct dm_target *ti) clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); mddev->ro = 0; mddev->in_sync = 0; - mddev_resume(mddev); - mddev_unlock(mddev); + mddev_unlock_and_resume(mddev); } }
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 3cddf86a57a58441eb019d21acea147f043dc13c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Convert to use new apis, the old apis will be removed eventually.
This is not hot path, so performance is not concerned.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-8-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md-bitmap.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 0c661e5036bb..7d21e2a5b06e 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -2348,11 +2348,10 @@ location_store(struct mddev *mddev, const char *buf, size_t len) { int rv;
- rv = mddev_lock(mddev); + rv = mddev_suspend_and_lock(mddev); if (rv) return rv;
- mddev_suspend(mddev); if (mddev->pers) { if (mddev->recovery || mddev->sync_thread) { rv = -EBUSY; @@ -2429,8 +2428,7 @@ location_store(struct mddev *mddev, const char *buf, size_t len) } rv = 0; out: - mddev_resume(mddev); - mddev_unlock(mddev); + mddev_unlock_and_resume(mddev); if (rv) return rv; return len;
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 1b172e0b11c00e89c5df72c2761b3d4d279fbb4d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Convert to use new apis, the old apis will be removed eventually.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-9-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/raid5-cache.c | 19 ++++++++----------- 1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c index 889bba60d6ff..9909110262ee 100644 --- a/drivers/md/raid5-cache.c +++ b/drivers/md/raid5-cache.c @@ -686,7 +686,6 @@ static void r5c_disable_writeback_async(struct work_struct *work) disable_writeback_work); struct mddev *mddev = log->rdev->mddev; struct r5conf *conf = mddev->private; - int locked = 0;
if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH) return; @@ -696,13 +695,13 @@ static void r5c_disable_writeback_async(struct work_struct *work) /* wait superblock change before suspend */ wait_event(mddev->sb_wait, !READ_ONCE(conf->log) || - (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && - (locked = mddev_trylock(mddev)))); - if (locked) { - mddev_suspend(mddev); + !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)); + + log = READ_ONCE(conf->log); + if (log) { + __mddev_suspend(mddev, false); log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH; - mddev_resume(mddev); - mddev_unlock(mddev); + __mddev_resume(mddev); } }
@@ -2586,9 +2585,7 @@ int r5c_journal_mode_set(struct mddev *mddev, int mode) mode == R5C_JOURNAL_MODE_WRITE_BACK) return -EINVAL;
- mddev_suspend(mddev); conf->log->r5c_journal_mode = mode; - mddev_resume(mddev);
pr_debug("md/raid:%s: setting r5c cache mode to %d: %s\n", mdname(mddev), mode, r5c_journal_mode_str[mode]); @@ -2613,11 +2610,11 @@ static ssize_t r5c_journal_mode_store(struct mddev *mddev, if (strlen(r5c_journal_mode_str[mode]) == len && !strncmp(page, r5c_journal_mode_str[mode], len)) break; - ret = mddev_lock(mddev); + ret = mddev_suspend_and_lock(mddev); if (ret) return ret; ret = r5c_journal_mode_set(mddev, mode); - mddev_unlock(mddev); + mddev_unlock_and_resume(mddev); return ret ?: length; }
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit e28ca92fbb5c6ed8fac5db65e02947bf020c5b60 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Convert to use new apis, the old apis will be removed eventually.
These are not hot path, so performance is not concerned.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-10-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/raid5.c | 38 ++++++++++++-------------------------- 1 file changed, 12 insertions(+), 26 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 0644b83fd3f4..6e62d76dd552 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -7032,7 +7032,7 @@ raid5_store_stripe_size(struct mddev *mddev, const char *page, size_t len) new != roundup_pow_of_two(new)) return -EINVAL;
- err = mddev_lock(mddev); + err = mddev_suspend_and_lock(mddev); if (err) return err;
@@ -7056,7 +7056,6 @@ raid5_store_stripe_size(struct mddev *mddev, const char *page, size_t len) goto out_unlock; }
- mddev_suspend(mddev); mutex_lock(&conf->cache_size_mutex); size = conf->max_nr_stripes;
@@ -7071,10 +7070,9 @@ raid5_store_stripe_size(struct mddev *mddev, const char *page, size_t len) err = -ENOMEM; } mutex_unlock(&conf->cache_size_mutex); - mddev_resume(mddev);
out_unlock: - mddev_unlock(mddev); + mddev_unlock_and_resume(mddev); return err ?: len; }
@@ -7160,7 +7158,7 @@ raid5_store_skip_copy(struct mddev *mddev, const char *page, size_t len) return -EINVAL; new = !!new;
- err = mddev_lock(mddev); + err = mddev_suspend_and_lock(mddev); if (err) return err; conf = mddev->private; @@ -7169,15 +7167,13 @@ raid5_store_skip_copy(struct mddev *mddev, const char *page, size_t len) else if (new != conf->skip_copy) { struct request_queue *q = mddev->queue;
- mddev_suspend(mddev); conf->skip_copy = new; if (new) blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, q); else blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, q); - mddev_resume(mddev); } - mddev_unlock(mddev); + mddev_unlock_and_resume(mddev); return err ?: len; }
@@ -7232,15 +7228,13 @@ raid5_store_group_thread_cnt(struct mddev *mddev, const char *page, size_t len) if (new > 8192) return -EINVAL;
- err = mddev_lock(mddev); + err = mddev_suspend_and_lock(mddev); if (err) return err; conf = mddev->private; if (!conf) err = -ENODEV; else if (new != conf->worker_cnt_per_group) { - mddev_suspend(mddev); - old_groups = conf->worker_groups; if (old_groups) flush_workqueue(raid5_wq); @@ -7257,9 +7251,8 @@ raid5_store_group_thread_cnt(struct mddev *mddev, const char *page, size_t len) kfree(old_groups[0].workers); kfree(old_groups); } - mddev_resume(mddev); } - mddev_unlock(mddev); + mddev_unlock_and_resume(mddev);
return err ?: len; } @@ -8981,12 +8974,12 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf) struct r5conf *conf; int err;
- err = mddev_lock(mddev); + err = mddev_suspend_and_lock(mddev); if (err) return err; conf = mddev->private; if (!conf) { - mddev_unlock(mddev); + mddev_unlock_and_resume(mddev); return -ENODEV; }
@@ -8996,19 +8989,14 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf) err = log_init(conf, NULL, true); if (!err) { err = resize_stripes(conf, conf->pool_size); - if (err) { - mddev_suspend(mddev); + if (err) log_exit(conf); - mddev_resume(mddev); - } } } else err = -EINVAL; } else if (strncmp(buf, "resync", 6) == 0) { if (raid5_has_ppl(conf)) { - mddev_suspend(mddev); log_exit(conf); - mddev_resume(mddev); err = resize_stripes(conf, conf->pool_size); } else if (test_bit(MD_HAS_JOURNAL, &conf->mddev->flags) && r5l_log_disk_error(conf)) { @@ -9021,11 +9009,9 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf) break; }
- if (!journal_dev_exists) { - mddev_suspend(mddev); + if (!journal_dev_exists) clear_bit(MD_HAS_JOURNAL, &mddev->flags); - mddev_resume(mddev); - } else /* need remove journal device first */ + else /* need remove journal device first */ err = -EBUSY; } else err = -EINVAL; @@ -9036,7 +9022,7 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf) if (!err) md_update_sb(mddev, 1);
- mddev_unlock(mddev); + mddev_unlock_and_resume(mddev);
return err; }
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 205669f37770772c1ae8c2dac1eba6da521a8b77 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Convert to use new apis in following sysfs apis: - level_store - suspend_lo_store - suspend_hi_store - serialize_policy_store
These are not hot path, so performance is not concerned.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-11-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 24 ++++++++---------------- 1 file changed, 8 insertions(+), 16 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 4f9658a6b000..4e0bab861ad5 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4012,7 +4012,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len) if (slen == 0 || slen >= sizeof(clevel)) return -EINVAL;
- rv = mddev_lock(mddev); + rv = mddev_suspend_and_lock(mddev); if (rv) return rv;
@@ -4105,7 +4105,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len) }
/* Looks like we have a winner */ - mddev_suspend(mddev); mddev_detach(mddev);
spin_lock(&mddev->lock); @@ -4191,14 +4190,13 @@ level_store(struct mddev *mddev, const char *buf, size_t len) blk_set_stacking_limits(&mddev->queue->limits); pers->run(mddev); set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); - mddev_resume(mddev); if (!mddev->thread) md_update_sb(mddev, 1); sysfs_notify_dirent_safe(mddev->sysfs_level); md_new_event(); rv = len; out_unlock: - mddev_unlock(mddev); + mddev_unlock_and_resume(mddev); return rv; }
@@ -5286,15 +5284,13 @@ suspend_lo_store(struct mddev *mddev, const char *buf, size_t len) if (new != (sector_t)new) return -EINVAL;
- err = mddev_lock(mddev); + err = __mddev_suspend(mddev, true); if (err) return err;
- mddev_suspend(mddev); WRITE_ONCE(mddev->suspend_lo, new); - mddev_resume(mddev); + __mddev_resume(mddev);
- mddev_unlock(mddev); return len; } static struct md_sysfs_entry md_suspend_lo = @@ -5319,15 +5315,13 @@ suspend_hi_store(struct mddev *mddev, const char *buf, size_t len) if (new != (sector_t)new) return -EINVAL;
- err = mddev_lock(mddev); + err = __mddev_suspend(mddev, true); if (err) return err;
- mddev_suspend(mddev); WRITE_ONCE(mddev->suspend_hi, new); - mddev_resume(mddev); + __mddev_resume(mddev);
- mddev_unlock(mddev); return len; } static struct md_sysfs_entry md_suspend_hi = @@ -5575,7 +5569,7 @@ serialize_policy_store(struct mddev *mddev, const char *buf, size_t len) if (value == mddev->serialize_policy) return len;
- err = mddev_lock(mddev); + err = mddev_suspend_and_lock(mddev); if (err) return err; if (mddev->pers == NULL || (mddev->pers->level != 1)) { @@ -5584,15 +5578,13 @@ serialize_policy_store(struct mddev *mddev, const char *buf, size_t len) goto unlock; }
- mddev_suspend(mddev); if (value) mddev_create_serial_pool(mddev, NULL, true); else mddev_destroy_serial_pool(mddev, NULL, true); mddev->serialize_policy = value; - mddev_resume(mddev); unlock: - mddev_unlock(mddev); + mddev_unlock_and_resume(mddev); return err ?: len; }
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit cfa078c8b80d0daf8f2fd4a2ab8e26fa8c33bca1 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
User can write 'remove' and 're-add' to trigger array reconfiguration through sysfs, suspend array in this case so that io won't concurrent with array reconfiguration.
And now that all the caller of add_bound_rdev() alread suspend the array, remove mddev_suspend/resume() from add_bound_rdev() as well.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-12-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 4e0bab861ad5..05eeadc12c34 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2939,11 +2939,7 @@ static int add_bound_rdev(struct md_rdev *rdev) */ super_types[mddev->major_version]. validate_super(mddev, rdev); - if (add_journal) - mddev_suspend(mddev); err = mddev->pers->hot_add_disk(mddev, rdev); - if (add_journal) - mddev_resume(mddev); if (err) { md_kick_rdev_from_array(rdev); return err; @@ -3696,6 +3692,7 @@ rdev_attr_store(struct kobject *kobj, struct attribute *attr, struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr); struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj); struct kernfs_node *kn = NULL; + bool suspend = false; ssize_t rv; struct mddev *mddev = rdev->mddev;
@@ -3703,17 +3700,23 @@ rdev_attr_store(struct kobject *kobj, struct attribute *attr, return -EIO; if (!capable(CAP_SYS_ADMIN)) return -EACCES; + if (!mddev) + return -ENODEV;
- if (entry->store == state_store && cmd_match(page, "remove")) - kn = sysfs_break_active_protection(kobj, attr); + if (entry->store == state_store) { + if (cmd_match(page, "remove")) + kn = sysfs_break_active_protection(kobj, attr); + if (cmd_match(page, "remove") || cmd_match(page, "re-add")) + suspend = true; + }
- rv = mddev ? mddev_lock(mddev) : -ENODEV; + rv = suspend ? mddev_suspend_and_lock(mddev) : mddev_lock(mddev); if (!rv) { if (rdev->mddev == NULL) rv = -ENODEV; else rv = entry->store(rdev, page, length); - mddev_unlock(mddev); + suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev); }
if (kn)
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 1b0a2d950ee2a54aa04fb31ead32144be0bbf690 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
'reconfig_mutex' will be grabbed before these ioctls, suspend array before holding the lock, so that io won't concurrent with array reconfiguration through ioctls.
This is not hot path, so performance is not concerned.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-13-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 30 ++++++++++++++++++++---------- 1 file changed, 20 insertions(+), 10 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 05eeadc12c34..54d75c2ae2b3 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7208,7 +7208,6 @@ static int set_bitmap_file(struct mddev *mddev, int fd) struct bitmap *bitmap;
bitmap = md_bitmap_create(mddev, -1); - mddev_suspend(mddev); if (!IS_ERR(bitmap)) { mddev->bitmap = bitmap; err = md_bitmap_load(mddev); @@ -7218,11 +7217,8 @@ static int set_bitmap_file(struct mddev *mddev, int fd) md_bitmap_destroy(mddev); fd = -1; } - mddev_resume(mddev); } else if (fd < 0) { - mddev_suspend(mddev); md_bitmap_destroy(mddev); - mddev_resume(mddev); } } if (fd < 0) { @@ -7511,7 +7507,6 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info) mddev->bitmap_info.space = mddev->bitmap_info.default_space; bitmap = md_bitmap_create(mddev, -1); - mddev_suspend(mddev); if (!IS_ERR(bitmap)) { mddev->bitmap = bitmap; rv = md_bitmap_load(mddev); @@ -7519,7 +7514,6 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info) rv = PTR_ERR(bitmap); if (rv) md_bitmap_destroy(mddev); - mddev_resume(mddev); } else { /* remove the bitmap */ if (!mddev->bitmap) { @@ -7544,9 +7538,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info) module_put(md_cluster_mod); mddev->safemode_delay = DEFAULT_SAFEMODE_DELAY; } - mddev_suspend(mddev); md_bitmap_destroy(mddev); - mddev_resume(mddev); mddev->bitmap_info.offset = 0; } } @@ -7617,6 +7609,20 @@ static inline bool md_ioctl_valid(unsigned int cmd) } }
+static bool md_ioctl_need_suspend(unsigned int cmd) +{ + switch (cmd) { + case ADD_NEW_DISK: + case HOT_ADD_DISK: + case HOT_REMOVE_DISK: + case SET_BITMAP_FILE: + case SET_ARRAY_INFO: + return true; + default: + return false; + } +} + static int __md_set_array_info(struct mddev *mddev, void __user *argp) { mdu_array_info_t info; @@ -7749,7 +7755,8 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, if (!md_is_rdwr(mddev)) flush_work(&mddev->sync_work);
- err = mddev_lock(mddev); + err = md_ioctl_need_suspend(cmd) ? mddev_suspend_and_lock(mddev) : + mddev_lock(mddev); if (err) { pr_debug("md: ioctl lock interrupted, reason %d, cmd %d\n", err, cmd); @@ -7877,7 +7884,10 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, if (mddev->hold_active == UNTIL_IOCTL && err != -EINVAL) mddev->hold_active = 0; - mddev_unlock(mddev); + + md_ioctl_need_suspend(cmd) ? mddev_unlock_and_resume(mddev) : + mddev_unlock(mddev); + out: if(did_set_md_closing) clear_bit(MD_CLOSING, &mddev->flags);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 58226942ad3d1423785f815ede5bcc832574f2ea category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
mddev_create/destroy_serial_pool() will be called from several places where mddev_suspend() will be called later.
Prepare to remove the mddev_suspend() from mddev_create/destroy_serial_pool().
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-14-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md-autodetect.c | 4 ++-- drivers/md/md-bitmap.c | 8 ++++---- drivers/md/md.c | 22 ++++++++++++---------- 3 files changed, 18 insertions(+), 16 deletions(-)
diff --git a/drivers/md/md-autodetect.c b/drivers/md/md-autodetect.c index 6eaa0eab40f9..4b80165afd23 100644 --- a/drivers/md/md-autodetect.c +++ b/drivers/md/md-autodetect.c @@ -175,7 +175,7 @@ static void __init md_setup_drive(struct md_setup_args *args) return; }
- err = mddev_lock(mddev); + err = mddev_suspend_and_lock(mddev); if (err) { pr_err("md: failed to lock array %s\n", name); goto out_mddev_put; @@ -221,7 +221,7 @@ static void __init md_setup_drive(struct md_setup_args *args) if (err) pr_warn("md: starting %s failed\n", name); out_unlock: - mddev_unlock(mddev); + mddev_unlock_and_resume(mddev); out_mddev_put: mddev_put(mddev); } diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 7d21e2a5b06e..b3d701c5c461 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -2537,7 +2537,7 @@ backlog_store(struct mddev *mddev, const char *buf, size_t len) if (backlog > COUNTER_MAX) return -EINVAL;
- rv = mddev_lock(mddev); + rv = mddev_suspend_and_lock(mddev); if (rv) return rv;
@@ -2562,16 +2562,16 @@ backlog_store(struct mddev *mddev, const char *buf, size_t len) if (!backlog && mddev->serial_info_pool) { /* serial_info_pool is not needed if backlog is zero */ if (!mddev->serialize_policy) - mddev_destroy_serial_pool(mddev, NULL, false); + mddev_destroy_serial_pool(mddev, NULL, true); } else if (backlog && !mddev->serial_info_pool) { /* serial_info_pool is needed since backlog is not zero */ rdev_for_each(rdev, mddev) - mddev_create_serial_pool(mddev, rdev, false); + mddev_create_serial_pool(mddev, rdev, true); } if (old_mwb != backlog) md_bitmap_update_sb(mddev->bitmap);
- mddev_unlock(mddev); + mddev_unlock_and_resume(mddev); return len; }
diff --git a/drivers/md/md.c b/drivers/md/md.c index 54d75c2ae2b3..98c8efddfb0e 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2557,7 +2557,7 @@ static int bind_rdev_to_array(struct md_rdev *rdev, struct mddev *mddev) pr_debug("md: bind<%s>\n", b);
if (mddev->raid_disks) - mddev_create_serial_pool(mddev, rdev, false); + mddev_create_serial_pool(mddev, rdev, true);
if ((err = kobject_add(&rdev->kobj, &mddev->kobj, "dev-%s", b))) goto fail; @@ -3076,11 +3076,11 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) } } else if (cmd_match(buf, "writemostly")) { set_bit(WriteMostly, &rdev->flags); - mddev_create_serial_pool(rdev->mddev, rdev, false); + mddev_create_serial_pool(rdev->mddev, rdev, true); need_update_sb = true; err = 0; } else if (cmd_match(buf, "-writemostly")) { - mddev_destroy_serial_pool(rdev->mddev, rdev, false); + mddev_destroy_serial_pool(rdev->mddev, rdev, true); clear_bit(WriteMostly, &rdev->flags); need_update_sb = true; err = 0; @@ -3706,7 +3706,9 @@ rdev_attr_store(struct kobject *kobj, struct attribute *attr, if (entry->store == state_store) { if (cmd_match(page, "remove")) kn = sysfs_break_active_protection(kobj, attr); - if (cmd_match(page, "remove") || cmd_match(page, "re-add")) + if (cmd_match(page, "remove") || cmd_match(page, "re-add") || + cmd_match(page, "writemostly") || + cmd_match(page, "-writemostly")) suspend = true; }
@@ -4677,7 +4679,7 @@ new_dev_store(struct mddev *mddev, const char *buf, size_t len) minor != MINOR(dev)) return -EOVERFLOW;
- err = mddev_lock(mddev); + err = mddev_suspend_and_lock(mddev); if (err) return err; if (mddev->persistent) { @@ -4698,14 +4700,14 @@ new_dev_store(struct mddev *mddev, const char *buf, size_t len) rdev = md_import_device(dev, -1, -1);
if (IS_ERR(rdev)) { - mddev_unlock(mddev); + mddev_unlock_and_resume(mddev); return PTR_ERR(rdev); } err = bind_rdev_to_array(rdev, mddev); out: if (err) export_rdev(rdev, mddev); - mddev_unlock(mddev); + mddev_unlock_and_resume(mddev); if (!err) md_new_event(); return err ? err : len; @@ -6642,13 +6644,13 @@ static void autorun_devices(int part) if (IS_ERR(mddev)) break;
- if (mddev_lock(mddev)) + if (mddev_suspend_and_lock(mddev)) pr_warn("md: %s locked, cannot run\n", mdname(mddev)); else if (mddev->raid_disks || mddev->major_version || !list_empty(&mddev->disks)) { pr_warn("md: %s already running, cannot run %pg\n", mdname(mddev), rdev0->bdev); - mddev_unlock(mddev); + mddev_unlock_and_resume(mddev); } else { pr_debug("md: created %s\n", mdname(mddev)); mddev->persistent = 1; @@ -6658,7 +6660,7 @@ static void autorun_devices(int part) export_rdev(rdev, mddev); } autorun_array(mddev); - mddev_unlock(mddev); + mddev_unlock_and_resume(mddev); } /* on success, candidates will be empty, on error * it won't...
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit b4128c00a653dfd08fbe3d26fcf4c8b4970a69ba category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now that except for stopping the array, all the callers already suspend the array, there is no need to suspend anymore, hence remove the second parameter.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-15-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.h | 7 +++---- drivers/md/md-bitmap.c | 8 ++++---- drivers/md/md.c | 33 ++++++++++----------------------- 3 files changed, 17 insertions(+), 31 deletions(-)
diff --git a/drivers/md/md.h b/drivers/md/md.h index 2e80d30bcfbf..b1a805966bfb 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -815,10 +815,9 @@ extern void __mddev_resume(struct mddev *mddev);
extern void md_reload_sb(struct mddev *mddev, int raid_disk); extern void md_update_sb(struct mddev *mddev, int force); -extern void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev, - bool is_suspend); -extern void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev, - bool is_suspend); +extern void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev); +extern void mddev_destroy_serial_pool(struct mddev *mddev, + struct md_rdev *rdev); struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr); struct md_rdev *md_find_rdev_rcu(struct mddev *mddev, dev_t dev);
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index b3d701c5c461..9672f75c3050 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -1861,7 +1861,7 @@ void md_bitmap_destroy(struct mddev *mddev)
md_bitmap_wait_behind_writes(mddev); if (!mddev->serialize_policy) - mddev_destroy_serial_pool(mddev, NULL, true); + mddev_destroy_serial_pool(mddev, NULL);
mutex_lock(&mddev->bitmap_info.mutex); spin_lock(&mddev->lock); @@ -1977,7 +1977,7 @@ int md_bitmap_load(struct mddev *mddev) goto out;
rdev_for_each(rdev, mddev) - mddev_create_serial_pool(mddev, rdev, true); + mddev_create_serial_pool(mddev, rdev);
if (mddev_is_clustered(mddev)) md_cluster_ops->load_bitmaps(mddev, mddev->bitmap_info.nodes); @@ -2562,11 +2562,11 @@ backlog_store(struct mddev *mddev, const char *buf, size_t len) if (!backlog && mddev->serial_info_pool) { /* serial_info_pool is not needed if backlog is zero */ if (!mddev->serialize_policy) - mddev_destroy_serial_pool(mddev, NULL, true); + mddev_destroy_serial_pool(mddev, NULL); } else if (backlog && !mddev->serial_info_pool) { /* serial_info_pool is needed since backlog is not zero */ rdev_for_each(rdev, mddev) - mddev_create_serial_pool(mddev, rdev, true); + mddev_create_serial_pool(mddev, rdev); } if (old_mwb != backlog) md_bitmap_update_sb(mddev->bitmap); diff --git a/drivers/md/md.c b/drivers/md/md.c index 98c8efddfb0e..372b42d88e08 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -206,8 +206,7 @@ static int rdev_need_serial(struct md_rdev *rdev) * 1. rdev is the first device which return true from rdev_enable_serial. * 2. rdev is NULL, means we want to enable serialization for all rdevs. */ -void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev, - bool is_suspend) +void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev) { int ret = 0;
@@ -215,15 +214,12 @@ void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev, !test_bit(CollisionCheck, &rdev->flags)) return;
- if (!is_suspend) - mddev_suspend(mddev); - if (!rdev) ret = rdevs_init_serial(mddev); else ret = rdev_init_serial(rdev); if (ret) - goto abort; + return;
if (mddev->serial_info_pool == NULL) { /* @@ -238,10 +234,6 @@ void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev, pr_err("can't alloc memory pool for serialization\n"); } } - -abort: - if (!is_suspend) - mddev_resume(mddev); }
/* @@ -250,8 +242,7 @@ void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev, * 2. when bitmap is destroyed while policy is not enabled. * 3. for disable policy, the pool is destroyed only when no rdev needs it. */ -void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev, - bool is_suspend) +void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev) { if (rdev && !test_bit(CollisionCheck, &rdev->flags)) return; @@ -260,8 +251,6 @@ void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev, struct md_rdev *temp; int num = 0; /* used to track if other rdevs need the pool */
- if (!is_suspend) - mddev_suspend(mddev); rdev_for_each(temp, mddev) { if (!rdev) { if (!mddev->serialize_policy || @@ -283,8 +272,6 @@ void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev, mempool_destroy(mddev->serial_info_pool); mddev->serial_info_pool = NULL; } - if (!is_suspend) - mddev_resume(mddev); } }
@@ -2557,7 +2544,7 @@ static int bind_rdev_to_array(struct md_rdev *rdev, struct mddev *mddev) pr_debug("md: bind<%s>\n", b);
if (mddev->raid_disks) - mddev_create_serial_pool(mddev, rdev, true); + mddev_create_serial_pool(mddev, rdev);
if ((err = kobject_add(&rdev->kobj, &mddev->kobj, "dev-%s", b))) goto fail; @@ -2609,7 +2596,7 @@ static void md_kick_rdev_from_array(struct md_rdev *rdev) bd_unlink_disk_holder(rdev->bdev, rdev->mddev->gendisk); list_del_rcu(&rdev->same_set); pr_debug("md: unbind<%pg>\n", rdev->bdev); - mddev_destroy_serial_pool(rdev->mddev, rdev, false); + mddev_destroy_serial_pool(rdev->mddev, rdev); rdev->mddev = NULL; sysfs_remove_link(&rdev->kobj, "block"); sysfs_put(rdev->sysfs_state); @@ -3076,11 +3063,11 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) } } else if (cmd_match(buf, "writemostly")) { set_bit(WriteMostly, &rdev->flags); - mddev_create_serial_pool(rdev->mddev, rdev, true); + mddev_create_serial_pool(rdev->mddev, rdev); need_update_sb = true; err = 0; } else if (cmd_match(buf, "-writemostly")) { - mddev_destroy_serial_pool(rdev->mddev, rdev, true); + mddev_destroy_serial_pool(rdev->mddev, rdev); clear_bit(WriteMostly, &rdev->flags); need_update_sb = true; err = 0; @@ -5584,9 +5571,9 @@ serialize_policy_store(struct mddev *mddev, const char *buf, size_t len) }
if (value) - mddev_create_serial_pool(mddev, NULL, true); + mddev_create_serial_pool(mddev, NULL); else - mddev_destroy_serial_pool(mddev, NULL, true); + mddev_destroy_serial_pool(mddev, NULL); mddev->serialize_policy = value; unlock: mddev_unlock_and_resume(mddev); @@ -6352,7 +6339,7 @@ static void __md_stop_writes(struct mddev *mddev) } /* disable policy to guarantee rdevs free resources for serialization */ mddev->serialize_policy = 0; - mddev_destroy_serial_pool(mddev, NULL, true); + mddev_destroy_serial_pool(mddev, NULL); }
void md_stop_writes(struct mddev *mddev)
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 1978c742f3e7ef359b83cb712f8559fcd32032d0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now that caller already suspend the array, there is no need to suspend array in liner_add().
Note that mddev_suspend/resume() is not used anymore.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-16-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md-linear.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c index 71ac99646827..66412397cef0 100644 --- a/drivers/md/md-linear.c +++ b/drivers/md/md-linear.c @@ -183,7 +183,6 @@ static int linear_add(struct mddev *mddev, struct md_rdev *rdev) * in linear_congested(), therefore kfree_rcu() is used to free * oldconf until no one uses it anymore. */ - mddev_suspend(mddev); oldconf = rcu_dereference_protected(mddev->private, lockdep_is_held(&mddev->reconfig_mutex)); mddev->raid_disks++; @@ -192,7 +191,6 @@ static int linear_add(struct mddev *mddev, struct md_rdev *rdev) rcu_assign_pointer(mddev->private, newconf); md_set_array_sectors(mddev, linear_size(mddev, 0, 0)); set_capacity_and_notify(mddev->gendisk, mddev->array_sectors); - mddev_resume(mddev); kfree_rcu(oldconf, rcu); return 0; }
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit b42cd7b3a20db692a8d24c2c756ad5008729976b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
raid5 is the only personality to suspend array in check_reshape() and start_reshape() callback, suspend and quiesce() callback can both wait for all normal io to be done, and prevent new io to be dispatched, the difference is that suspend is implemented in common layer, and quiesce() callback is implemented in raid5.
In order to cleanup all the usage of mddev_suspend(), the new apis __mddev_suspend() need to be called before 'reconfig_mutex' is held, and it's not good to affect all the personalities in common layer just for raid5. Hence replace suspend with quiesce() callaback, prepare to reomove all the users of mddev_suspend().
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-17-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/raid5.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 6e62d76dd552..7d7cf3d88fba 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -70,6 +70,8 @@ MODULE_PARM_DESC(devices_handle_discard_safely, "Set to Y if all devices in each array reliably return zeroes on reads from discarded regions"); static struct workqueue_struct *raid5_wq;
+static void raid5_quiesce(struct mddev *mddev, int quiesce); + static inline struct hlist_head *stripe_hash(struct r5conf *conf, sector_t sect) { int hash = (sect >> RAID5_STRIPE_SHIFT(conf)) & HASH_MASK; @@ -2499,15 +2501,12 @@ static int resize_chunks(struct r5conf *conf, int new_disks, int new_sectors) unsigned long cpu; int err = 0;
- /* - * Never shrink. And mddev_suspend() could deadlock if this is called - * from raid5d. In that case, scribble_disks and scribble_sectors - * should equal to new_disks and new_sectors - */ + /* Never shrink. */ if (conf->scribble_disks >= new_disks && conf->scribble_sectors >= new_sectors) return 0; - mddev_suspend(conf->mddev); + + raid5_quiesce(conf->mddev, true); cpus_read_lock();
for_each_present_cpu(cpu) { @@ -2521,7 +2520,8 @@ static int resize_chunks(struct r5conf *conf, int new_disks, int new_sectors) }
cpus_read_unlock(); - mddev_resume(conf->mddev); + raid5_quiesce(conf->mddev, false); + if (!err) { conf->scribble_disks = new_disks; conf->scribble_sectors = new_sectors; @@ -8558,8 +8558,8 @@ static int raid5_start_reshape(struct mddev *mddev) * the reshape wasn't running - like Discard or Read - have * completed. */ - mddev_suspend(mddev); - mddev_resume(mddev); + raid5_quiesce(mddev, true); + raid5_quiesce(mddev, false);
/* Add some new drives, as many as will fit. * We know there are enough to make the newly sized array work.
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit bc08041b32abe6c9824f78735bac22018eabfc06 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
So that io won't concurrent with array reconfiguration, and it's safe to suspend the array directly because normal io won't rely on md_start_sync().
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-18-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 372b42d88e08..501ca7bfedfc 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9411,8 +9411,13 @@ static void md_start_sync(struct work_struct *ws) { struct mddev *mddev = container_of(ws, struct mddev, sync_work); int spares = 0; + bool suspend = false;
- mddev_lock_nointr(mddev); + if (md_spares_need_change(mddev)) + suspend = true; + + suspend ? mddev_suspend_and_lock_nointr(mddev) : + mddev_lock_nointr(mddev);
if (!md_is_rdwr(mddev)) { /* @@ -9448,7 +9453,7 @@ static void md_start_sync(struct work_struct *ws) goto not_running; }
- mddev_unlock(mddev); + suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev); md_wakeup_thread(mddev->sync_thread); sysfs_notify_dirent_safe(mddev->sysfs_action); md_new_event(); @@ -9460,7 +9465,7 @@ static void md_start_sync(struct work_struct *ws) clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); - mddev_unlock(mddev); + suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev);
wake_up(&resync_wait); if (test_and_clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 4717c02875221d43d73ffed07f960d881bc00fc8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now that mddev_suspend() and mddev_resume() is not used anywhere, remove them, and remove 'MD_ALLOW_SB_UPDATE' and 'MD_UPDATING_SB' as well.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-19-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.h | 8 ----- drivers/md/md.c | 82 ++----------------------------------------------- 2 files changed, 3 insertions(+), 87 deletions(-)
diff --git a/drivers/md/md.h b/drivers/md/md.h index b1a805966bfb..c63e2e8965fd 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -246,10 +246,6 @@ struct md_cluster_info; * become failed. * @MD_HAS_PPL: The raid array has PPL feature set. * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set. - * @MD_ALLOW_SB_UPDATE: md_check_recovery is allowed to update the metadata - * without taking reconfig_mutex. - * @MD_UPDATING_SB: md_check_recovery is updating the metadata without - * explicitly holding reconfig_mutex. * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that * array is ready yet. * @MD_BROKEN: This is used to stop writes and mark array as failed. @@ -266,8 +262,6 @@ enum mddev_flags { MD_FAILFAST_SUPPORTED, MD_HAS_PPL, MD_HAS_MULTIPLE_PPLS, - MD_ALLOW_SB_UPDATE, - MD_UPDATING_SB, MD_NOT_READY, MD_BROKEN, MD_DELETED, @@ -808,8 +802,6 @@ extern int md_rdev_init(struct md_rdev *rdev); extern void md_rdev_clear(struct md_rdev *rdev);
extern void md_handle_request(struct mddev *mddev, struct bio *bio); -extern void mddev_suspend(struct mddev *mddev); -extern void mddev_resume(struct mddev *mddev); extern int __mddev_suspend(struct mddev *mddev, bool interruptible); extern void __mddev_resume(struct mddev *mddev);
diff --git a/drivers/md/md.c b/drivers/md/md.c index 501ca7bfedfc..59b4055ca3b1 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -418,74 +418,10 @@ static void md_submit_bio(struct bio *bio) md_handle_request(mddev, bio); }
-/* mddev_suspend makes sure no new requests are submitted - * to the device, and that any requests that have been submitted - * are completely handled. - * Once mddev_detach() is called and completes, the module will be - * completely unused. +/* + * Make sure no new requests are submitted to the device, and any requests that + * have been submitted are completely handled. */ -void mddev_suspend(struct mddev *mddev) -{ - struct md_thread *thread = rcu_dereference_protected(mddev->thread, - lockdep_is_held(&mddev->reconfig_mutex)); - - WARN_ON_ONCE(thread && current == thread->tsk); - - /* can't concurrent with __mddev_suspend() and __mddev_resume() */ - mutex_lock(&mddev->suspend_mutex); - if (mddev->suspended++) { - mutex_unlock(&mddev->suspend_mutex); - return; - } - - wake_up(&mddev->sb_wait); - set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags); - percpu_ref_kill(&mddev->active_io); - - /* - * TODO: cleanup 'pers->prepare_suspend after all callers are replaced - * by __mddev_suspend(). - */ - if (mddev->pers && mddev->pers->prepare_suspend) - mddev->pers->prepare_suspend(mddev); - - wait_event(mddev->sb_wait, percpu_ref_is_zero(&mddev->active_io)); - clear_bit_unlock(MD_ALLOW_SB_UPDATE, &mddev->flags); - wait_event(mddev->sb_wait, !test_bit(MD_UPDATING_SB, &mddev->flags)); - - del_timer_sync(&mddev->safemode_timer); - /* restrict memory reclaim I/O during raid array is suspend */ - mddev->noio_flag = memalloc_noio_save(); - - mutex_unlock(&mddev->suspend_mutex); -} -EXPORT_SYMBOL_GPL(mddev_suspend); - -void mddev_resume(struct mddev *mddev) -{ - lockdep_assert_held(&mddev->reconfig_mutex); - - /* can't concurrent with __mddev_suspend() and __mddev_resume() */ - mutex_lock(&mddev->suspend_mutex); - if (--mddev->suspended) { - mutex_unlock(&mddev->suspend_mutex); - return; - } - - /* entred the memalloc scope from mddev_suspend() */ - memalloc_noio_restore(mddev->noio_flag); - - percpu_ref_resurrect(&mddev->active_io); - wake_up(&mddev->sb_wait); - - set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); - md_wakeup_thread(mddev->thread); - md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */ - - mutex_unlock(&mddev->suspend_mutex); -} -EXPORT_SYMBOL_GPL(mddev_resume); - int __mddev_suspend(struct mddev *mddev, bool interruptible) { int err = 0; @@ -9497,18 +9433,6 @@ static void md_start_sync(struct work_struct *ws) */ void md_check_recovery(struct mddev *mddev) { - if (test_bit(MD_ALLOW_SB_UPDATE, &mddev->flags) && mddev->sb_flags) { - /* Write superblock - thread that called mddev_suspend() - * holds reconfig_mutex for us. - */ - set_bit(MD_UPDATING_SB, &mddev->flags); - smp_mb__after_atomic(); - if (test_bit(MD_ALLOW_SB_UPDATE, &mddev->flags)) - md_update_sb(mddev, 0); - clear_bit_unlock(MD_UPDATING_SB, &mddev->flags); - wake_up(&mddev->sb_wait); - } - if (READ_ONCE(mddev->suspended)) return;
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 2b16a52549d51937a98d82b07b4d83dce6c43683 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now that the old apis are removed, __mddev_suspend/resume() can be renamed to their original names.
This is done by:
sed -i "s/__mddev_suspend/mddev_suspend/g" *.[ch] sed -i "s/__mddev_resume/mddev_resume/g" *.[ch]
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231010151958.145896-20-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.h | 12 ++++++------ drivers/md/dm-raid.c | 4 ++-- drivers/md/md.c | 18 +++++++++--------- drivers/md/raid5-cache.c | 4 ++-- 4 files changed, 19 insertions(+), 19 deletions(-)
diff --git a/drivers/md/md.h b/drivers/md/md.h index c63e2e8965fd..fb3fc236b397 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -802,8 +802,8 @@ extern int md_rdev_init(struct md_rdev *rdev); extern void md_rdev_clear(struct md_rdev *rdev);
extern void md_handle_request(struct mddev *mddev, struct bio *bio); -extern int __mddev_suspend(struct mddev *mddev, bool interruptible); -extern void __mddev_resume(struct mddev *mddev); +extern int mddev_suspend(struct mddev *mddev, bool interruptible); +extern void mddev_resume(struct mddev *mddev);
extern void md_reload_sb(struct mddev *mddev, int raid_disk); extern void md_update_sb(struct mddev *mddev, int force); @@ -851,27 +851,27 @@ static inline int mddev_suspend_and_lock(struct mddev *mddev) { int ret;
- ret = __mddev_suspend(mddev, true); + ret = mddev_suspend(mddev, true); if (ret) return ret;
ret = mddev_lock(mddev); if (ret) - __mddev_resume(mddev); + mddev_resume(mddev);
return ret; }
static inline void mddev_suspend_and_lock_nointr(struct mddev *mddev) { - __mddev_suspend(mddev, false); + mddev_suspend(mddev, false); mutex_lock(&mddev->reconfig_mutex); }
static inline void mddev_unlock_and_resume(struct mddev *mddev) { mddev_unlock(mddev); - __mddev_resume(mddev); + mddev_resume(mddev); }
struct mdu_array_info_s; diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index 05dd6ccf6f48..a4692f8f98ee 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -3797,7 +3797,7 @@ static void raid_postsuspend(struct dm_target *ti) if (!test_bit(MD_RECOVERY_FROZEN, &rs->md.recovery)) md_stop_writes(&rs->md);
- __mddev_suspend(&rs->md, false); + mddev_suspend(&rs->md, false); } }
@@ -4009,7 +4009,7 @@ static int raid_preresume(struct dm_target *ti) }
/* Check for any resize/reshape on @rs and adjust/initiate */ - /* Be prepared for __mddev_resume() in raid_resume() */ + /* Be prepared for mddev_resume() in raid_resume() */ set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); if (mddev->recovery_cp && mddev->recovery_cp < MaxSector) { set_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); diff --git a/drivers/md/md.c b/drivers/md/md.c index 59b4055ca3b1..0ceec28761e9 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -422,7 +422,7 @@ static void md_submit_bio(struct bio *bio) * Make sure no new requests are submitted to the device, and any requests that * have been submitted are completely handled. */ -int __mddev_suspend(struct mddev *mddev, bool interruptible) +int mddev_suspend(struct mddev *mddev, bool interruptible) { int err = 0;
@@ -473,9 +473,9 @@ int __mddev_suspend(struct mddev *mddev, bool interruptible) mutex_unlock(&mddev->suspend_mutex); return 0; } -EXPORT_SYMBOL_GPL(__mddev_suspend); +EXPORT_SYMBOL_GPL(mddev_suspend);
-void __mddev_resume(struct mddev *mddev) +void mddev_resume(struct mddev *mddev) { lockdep_assert_not_held(&mddev->reconfig_mutex);
@@ -486,7 +486,7 @@ void __mddev_resume(struct mddev *mddev) return; }
- /* entred the memalloc scope from __mddev_suspend() */ + /* entred the memalloc scope from mddev_suspend() */ memalloc_noio_restore(mddev->noio_flag);
percpu_ref_resurrect(&mddev->active_io); @@ -498,7 +498,7 @@ void __mddev_resume(struct mddev *mddev)
mutex_unlock(&mddev->suspend_mutex); } -EXPORT_SYMBOL_GPL(__mddev_resume); +EXPORT_SYMBOL_GPL(mddev_resume);
/* * Generic flush handling for md @@ -5212,12 +5212,12 @@ suspend_lo_store(struct mddev *mddev, const char *buf, size_t len) if (new != (sector_t)new) return -EINVAL;
- err = __mddev_suspend(mddev, true); + err = mddev_suspend(mddev, true); if (err) return err;
WRITE_ONCE(mddev->suspend_lo, new); - __mddev_resume(mddev); + mddev_resume(mddev);
return len; } @@ -5243,12 +5243,12 @@ suspend_hi_store(struct mddev *mddev, const char *buf, size_t len) if (new != (sector_t)new) return -EINVAL;
- err = __mddev_suspend(mddev, true); + err = mddev_suspend(mddev, true); if (err) return err;
WRITE_ONCE(mddev->suspend_hi, new); - __mddev_resume(mddev); + mddev_resume(mddev);
return len; } diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c index 9909110262ee..6157f5beb9fe 100644 --- a/drivers/md/raid5-cache.c +++ b/drivers/md/raid5-cache.c @@ -699,9 +699,9 @@ static void r5c_disable_writeback_async(struct work_struct *work)
log = READ_ONCE(conf->log); if (log) { - __mddev_suspend(mddev, false); + mddev_suspend(mddev, false); log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH; - __mddev_resume(mddev); + mddev_resume(mddev); } }
From: Denis Plotnikov den-plotnikov@yandex-team.ru
mainline inclusion from mainline-v6.7-rc1 commit 1bbe254e4336c0944dd4fb6f0b8c9665b81de50f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
A new disk adding may end up with timeout and a new disk won't be added. Add returning the error in that case.
Found by Linux Verification Center (linuxtesting.org) with SVACE
Signed-off-by: Denis Plotnikov den-plotnikov@yandex-team.ru Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20230925125940.1542506-1-den-plotnikov@yandex-team... Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md-cluster.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 1e26eb223349..8e36a0feec09 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -501,7 +501,7 @@ static void process_suspend_info(struct mddev *mddev, mddev->pers->quiesce(mddev, 0); }
-static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg) +static int process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg) { char disk_uuid[64]; struct md_cluster_info *cinfo = mddev->cluster_info; @@ -509,6 +509,7 @@ static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg) char raid_slot[16]; char *envp[] = {event_name, disk_uuid, raid_slot, NULL}; int len; + int res = 0;
len = snprintf(disk_uuid, 64, "DEVICE_UUID="); sprintf(disk_uuid + len, "%pU", cmsg->uuid); @@ -517,9 +518,14 @@ static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg) init_completion(&cinfo->newdisk_completion); set_bit(MD_CLUSTER_WAITING_FOR_NEWDISK, &cinfo->state); kobject_uevent_env(&disk_to_dev(mddev->gendisk)->kobj, KOBJ_CHANGE, envp); - wait_for_completion_timeout(&cinfo->newdisk_completion, - NEW_DEV_TIMEOUT); + if (!wait_for_completion_timeout(&cinfo->newdisk_completion, + NEW_DEV_TIMEOUT)) { + pr_err("md-cluster(%s:%d): timeout on a new disk adding\n", + __func__, __LINE__); + res = -1; + } clear_bit(MD_CLUSTER_WAITING_FOR_NEWDISK, &cinfo->state); + return res; }
@@ -594,7 +600,8 @@ static int process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) le64_to_cpu(msg->high)); break; case NEWDISK: - process_add_new_disk(mddev, msg); + if (process_add_new_disk(mddev, msg)) + ret = -1; break; case REMOVE: process_remove_disk(mddev, msg);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc1 commit 78b7b13f07a3ca16c03aa8bf63f51d6780e8e9e1 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
pers->prepare_suspend() is not used anymore and can be removed.
Reverts following three commit:
- commit 431e61257d63 ("md: export md_is_rdwr() and is_md_suspended()") - commit 3e00777d5157 ("md: add a new api prepare_suspend() in md_personality") - commit 868bba54a3bc ("md/raid5: fix a deadlock in the case that reshape is interrupted")
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231016100240.540474-1-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.h | 18 ------------------ drivers/md/md.c | 17 ++++++++++++++++- drivers/md/raid5.c | 44 +------------------------------------------- 3 files changed, 17 insertions(+), 62 deletions(-)
diff --git a/drivers/md/md.h b/drivers/md/md.h index fb3fc236b397..ade83af123a2 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -563,23 +563,6 @@ enum recovery_flags { MD_RESYNCING_REMOTE, /* remote node is running resync thread */ };
-enum md_ro_state { - MD_RDWR, - MD_RDONLY, - MD_AUTO_READ, - MD_MAX_STATE -}; - -static inline bool md_is_rdwr(struct mddev *mddev) -{ - return (mddev->ro == MD_RDWR); -} - -static inline bool is_md_suspended(struct mddev *mddev) -{ - return percpu_ref_is_dying(&mddev->active_io); -} - static inline int __must_check mddev_lock(struct mddev *mddev) { return mutex_lock_interruptible(&mddev->reconfig_mutex); @@ -639,7 +622,6 @@ struct md_personality int (*start_reshape) (struct mddev *mddev); void (*finish_reshape) (struct mddev *mddev); void (*update_reshape_pos) (struct mddev *mddev); - void (*prepare_suspend) (struct mddev *mddev); /* quiesce suspends or resumes internal processing. * 1 - stop new actions and wait for action io to complete * 0 - return to normal behaviour diff --git a/drivers/md/md.c b/drivers/md/md.c index 0ceec28761e9..7e46fba5d000 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -91,6 +91,18 @@ static void mddev_detach(struct mddev *mddev); static void export_rdev(struct md_rdev *rdev, struct mddev *mddev); static void md_wakeup_thread_directly(struct md_thread __rcu *thread);
+enum md_ro_state { + MD_RDWR, + MD_RDONLY, + MD_AUTO_READ, + MD_MAX_STATE +}; + +static bool md_is_rdwr(struct mddev *mddev) +{ + return (mddev->ro == MD_RDWR); +} + /* * Default number of read corrections we'll attempt on an rdev * before ejecting it from the array. We divide the read error @@ -333,6 +345,10 @@ EXPORT_SYMBOL_GPL(md_new_event); static LIST_HEAD(all_mddevs); static DEFINE_SPINLOCK(all_mddevs_lock);
+static bool is_md_suspended(struct mddev *mddev) +{ + return percpu_ref_is_dying(&mddev->active_io); +} /* Rather than calling directly into the personality make_request function, * IO requests come here first so that we can check if the device is * being suspended pending a reconfiguration. @@ -9138,7 +9154,6 @@ void md_do_sync(struct md_thread *thread) spin_unlock(&mddev->lock);
wake_up(&resync_wait); - wake_up(&mddev->sb_wait); md_wakeup_thread(mddev->thread); return; } diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 7d7cf3d88fba..c84ccc97329b 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -5960,19 +5960,6 @@ static int add_all_stripe_bios(struct r5conf *conf, return ret; }
-static bool reshape_inprogress(struct mddev *mddev) -{ - return test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) && - test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && - !test_bit(MD_RECOVERY_DONE, &mddev->recovery) && - !test_bit(MD_RECOVERY_INTR, &mddev->recovery); -} - -static bool reshape_disabled(struct mddev *mddev) -{ - return is_md_suspended(mddev) || !md_is_rdwr(mddev); -} - static enum stripe_result make_stripe_request(struct mddev *mddev, struct r5conf *conf, struct stripe_request_ctx *ctx, sector_t logical_sector, struct bio *bi) @@ -6004,8 +5991,7 @@ static enum stripe_result make_stripe_request(struct mddev *mddev, if (ahead_of_reshape(mddev, logical_sector, conf->reshape_safe)) { spin_unlock_irq(&conf->device_lock); - ret = STRIPE_SCHEDULE_AND_RETRY; - goto out; + return STRIPE_SCHEDULE_AND_RETRY; } } spin_unlock_irq(&conf->device_lock); @@ -6084,15 +6070,6 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
out_release: raid5_release_stripe(sh); -out: - if (ret == STRIPE_SCHEDULE_AND_RETRY && !reshape_inprogress(mddev) && - reshape_disabled(mddev)) { - bi->bi_status = BLK_STS_IOERR; - ret = STRIPE_FAIL; - pr_err("md/raid456:%s: io failed across reshape position while reshape can't make progress.\n", - mdname(mddev)); - } - return ret; }
@@ -9034,22 +9011,6 @@ static int raid5_start(struct mddev *mddev) return r5l_start(conf->log); }
-static void raid5_prepare_suspend(struct mddev *mddev) -{ - struct r5conf *conf = mddev->private; - - wait_event(mddev->sb_wait, !reshape_inprogress(mddev) || - percpu_ref_is_zero(&mddev->active_io)); - if (percpu_ref_is_zero(&mddev->active_io)) - return; - - /* - * Reshape is not in progress, and array is suspended, io that is - * waiting for reshpape can never be done. - */ - wake_up(&conf->wait_for_overlap); -} - static struct md_personality raid6_personality = { .name = "raid6", @@ -9070,7 +9031,6 @@ static struct md_personality raid6_personality = .check_reshape = raid6_check_reshape, .start_reshape = raid5_start_reshape, .finish_reshape = raid5_finish_reshape, - .prepare_suspend = raid5_prepare_suspend, .quiesce = raid5_quiesce, .takeover = raid6_takeover, .change_consistency_policy = raid5_change_consistency_policy, @@ -9095,7 +9055,6 @@ static struct md_personality raid5_personality = .check_reshape = raid5_check_reshape, .start_reshape = raid5_start_reshape, .finish_reshape = raid5_finish_reshape, - .prepare_suspend = raid5_prepare_suspend, .quiesce = raid5_quiesce, .takeover = raid5_takeover, .change_consistency_policy = raid5_change_consistency_policy, @@ -9121,7 +9080,6 @@ static struct md_personality raid4_personality = .check_reshape = raid5_check_reshape, .start_reshape = raid5_start_reshape, .finish_reshape = raid5_finish_reshape, - .prepare_suspend = raid5_prepare_suspend, .quiesce = raid5_quiesce, .takeover = raid4_takeover, .change_consistency_policy = raid5_change_consistency_policy,
From: Junxiao Bi junxiao.bi@oracle.com
mainline inclusion from mainline-next-20231220 commit d6e035aad6c09991da1c667fb83419329a3baed8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") introduced a hung bug and will be reverted in next patch, since the issue that commit is fixing is due to md superblock write is throttled by wbt, to fix it, we can have superblock write bypass block layer throttle.
Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") Cc: stable@vger.kernel.org # v5.19+ Suggested-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Junxiao Bi junxiao.bi@oracle.com Reviewed-by: Logan Gunthorpe logang@deltatee.com Reviewed-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231108182216.73611-1-junxiao.bi@oracle.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 7e46fba5d000..2860a43e56be 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1014,9 +1014,10 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev, return;
bio = bio_alloc_bioset(rdev->meta_bdev ? rdev->meta_bdev : rdev->bdev, - 1, - REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH | REQ_FUA, - GFP_NOIO, &mddev->sync_set); + 1, + REQ_OP_WRITE | REQ_SYNC | REQ_IDLE | REQ_META + | REQ_PREFLUSH | REQ_FUA, + GFP_NOIO, &mddev->sync_set);
atomic_inc(&rdev->nr_pending);
From: Junxiao Bi junxiao.bi@oracle.com
mainline inclusion from mainline-next-20231220 commit bed9e27baf52a09b7ba2a3714f1e24e17ced386d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
This reverts commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74.
That commit introduced the following race and can cause system hung.
md_write_start: raid5d: // mddev->in_sync == 1 set "MD_SB_CHANGE_PENDING" // running before md_write_start wakeup it waiting "MD_SB_CHANGE_PENDING" cleared >>>>>>>>> hung wakeup mddev->thread ... waiting "MD_SB_CHANGE_PENDING" cleared
hung, raid5d should clear this flag
but get hung by same flag.
The issue reverted commit fixing is fixed by last patch in a new way.
Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") Cc: stable@vger.kernel.org # v5.19+ Signed-off-by: Junxiao Bi junxiao.bi@oracle.com Reviewed-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231108182216.73611-2-junxiao.bi@oracle.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/raid5.c | 12 ------------ 1 file changed, 12 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index c84ccc97329b..3302191e4255 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -36,7 +36,6 @@ */
#include <linux/blkdev.h> -#include <linux/delay.h> #include <linux/kthread.h> #include <linux/raid/pq.h> #include <linux/async_tx.h> @@ -6820,18 +6819,7 @@ static void raid5d(struct md_thread *thread) spin_unlock_irq(&conf->device_lock); md_check_recovery(mddev); spin_lock_irq(&conf->device_lock); - - /* - * Waiting on MD_SB_CHANGE_PENDING below may deadlock - * seeing md_check_recovery() is needed to clear - * the flag when using mdmon. - */ - continue; } - - wait_event_lock_irq(mddev->sb_wait, - !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags), - conf->device_lock); } pr_debug("%d stripes handled\n", handled);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-next-20231220 commit c891f1fd90e66e584bb1353e1859cef7c9eb36f8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
rcu is not used correctly here, because synchronize_rcu() is called before replacing old value, for example:
remove_and_add_spares // other path synchronize_rcu // called before replacing old value set_bit(RemoveSynchronized) rcu_read_lock() rdev = conf->mirros[].rdev pers->hot_remove_disk conf->mirros[].rdev = NULL; if (!test_bit(RemoveSynchronized)) synchronize_rcu /* * won't be called, and won't wait * for concurrent readers to be done. */ // access rdev after remove_and_add_spares() rcu_read_unlock()
Fortunately, there is a separate rcu protection to prevent such rdev to be freed:
md_kick_rdev_from_array //other path rcu_read_lock() rdev = conf->mirros[].rdev list_del_rcu(&rdev->same_set)
rcu_read_unlock() /* * rdev can be removed from conf, but * rdev won't be freed. */ synchronize_rcu() free rdev
Hence remove this useless flag and prepare to remove rcu protection to access rdev from 'conf'.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231125081604.3939938-2-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.h | 5 ----- drivers/md/md-multipath.c | 9 --------- drivers/md/md.c | 37 ++++++------------------------------- drivers/md/raid1.c | 9 --------- drivers/md/raid10.c | 9 --------- drivers/md/raid5.c | 9 --------- 6 files changed, 6 insertions(+), 72 deletions(-)
diff --git a/drivers/md/md.h b/drivers/md/md.h index ade83af123a2..8d881cc59799 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -190,11 +190,6 @@ enum flag_bits { * than other devices in the array */ ClusterRemove, - RemoveSynchronized, /* synchronize_rcu() was called after - * this device was known to be faulty, - * so it is safe to remove without - * another synchronize_rcu() call. - */ ExternalBbl, /* External metadata provides bad * block management for a disk */ diff --git a/drivers/md/md-multipath.c b/drivers/md/md-multipath.c index d22276870283..aa77133f3188 100644 --- a/drivers/md/md-multipath.c +++ b/drivers/md/md-multipath.c @@ -258,15 +258,6 @@ static int multipath_remove_disk(struct mddev *mddev, struct md_rdev *rdev) goto abort; } p->rdev = NULL; - if (!test_bit(RemoveSynchronized, &rdev->flags)) { - synchronize_rcu(); - if (atomic_read(&rdev->nr_pending)) { - /* lost the race, try later */ - err = -EBUSY; - p->rdev = rdev; - goto abort; - } - } err = md_integrity_register(mddev); } abort: diff --git a/drivers/md/md.c b/drivers/md/md.c index 2860a43e56be..945a35ea3b5f 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9245,44 +9245,19 @@ static int remove_and_add_spares(struct mddev *mddev, struct md_rdev *rdev; int spares = 0; int removed = 0; - bool remove_some = false;
if (this && test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) /* Mustn't remove devices when resync thread is running */ return 0;
rdev_for_each(rdev, mddev) { - if ((this == NULL || rdev == this) && - rdev->raid_disk >= 0 && - !test_bit(Blocked, &rdev->flags) && - test_bit(Faulty, &rdev->flags) && - atomic_read(&rdev->nr_pending)==0) { - /* Faulty non-Blocked devices with nr_pending == 0 - * never get nr_pending incremented, - * never get Faulty cleared, and never get Blocked set. - * So we can synchronize_rcu now rather than once per device - */ - remove_some = true; - set_bit(RemoveSynchronized, &rdev->flags); - } - } - - if (remove_some) - synchronize_rcu(); - rdev_for_each(rdev, mddev) { - if ((this == NULL || rdev == this) && - (test_bit(RemoveSynchronized, &rdev->flags) || - rdev_removeable(rdev))) { - if (mddev->pers->hot_remove_disk( - mddev, rdev) == 0) { - sysfs_unlink_rdev(mddev, rdev); - rdev->saved_raid_disk = rdev->raid_disk; - rdev->raid_disk = -1; - removed++; - } + if ((this == NULL || rdev == this) && rdev_removeable(rdev) && + !mddev->pers->hot_remove_disk(mddev, rdev)) { + sysfs_unlink_rdev(mddev, rdev); + rdev->saved_raid_disk = rdev->raid_disk; + rdev->raid_disk = -1; + removed++; } - if (remove_some && test_bit(RemoveSynchronized, &rdev->flags)) - clear_bit(RemoveSynchronized, &rdev->flags); }
if (removed && mddev->kobj.sd) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 35d12948e0a9..a678e0e6e102 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1863,15 +1863,6 @@ static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev) goto abort; } p->rdev = NULL; - if (!test_bit(RemoveSynchronized, &rdev->flags)) { - synchronize_rcu(); - if (atomic_read(&rdev->nr_pending)) { - /* lost the race, try later */ - err = -EBUSY; - p->rdev = rdev; - goto abort; - } - } if (conf->mirrors[conf->raid_disks + number].rdev) { /* We just removed a device that is being replaced. * Move down the replacement. We drain all IO before diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index a5927e98dc67..132a79523338 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2247,15 +2247,6 @@ static int raid10_remove_disk(struct mddev *mddev, struct md_rdev *rdev) goto abort; } *rdevp = NULL; - if (!test_bit(RemoveSynchronized, &rdev->flags)) { - synchronize_rcu(); - if (atomic_read(&rdev->nr_pending)) { - /* lost the race, try later */ - err = -EBUSY; - *rdevp = rdev; - goto abort; - } - } if (p->replacement) { /* We must have just cleared 'rdev' */ p->rdev = p->replacement; diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 3302191e4255..37072adac05d 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -8229,15 +8229,6 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev) goto abort; } *rdevp = NULL; - if (!test_bit(RemoveSynchronized, &rdev->flags)) { - lockdep_assert_held(&mddev->reconfig_mutex); - synchronize_rcu(); - if (atomic_read(&rdev->nr_pending)) { - /* lost the race, try later */ - err = -EBUSY; - rcu_assign_pointer(*rdevp, rdev); - } - } if (!err) { err = log_modify(conf, rdev, false); if (err)
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-next-20231220 commit a448af25becf4b555660b5ba2618c7ed3c4de6da category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Because it's safe to accees rdev from conf: - If any spinlock is held, because synchronize_rcu() from md_kick_rdev_from_array() will prevent 'rdev' to be freed until spinlock is released; - If 'reconfig_lock' is held, because rdev can't be added or removed from array; - If there is normal IO inflight, because mddev_suspend() will prevent rdev to be added or removed from array; - If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is checked in remove_and_add_spares().
And these will cover all the scenarios in raid10.
This patch also cleanup the code to handle the case that replacement replace rdev while IO is still inflight.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231125081604.3939938-3-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/raid10.c | 213 ++++++++++++-------------------------------- 1 file changed, 58 insertions(+), 155 deletions(-)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 132a79523338..375c11d6159f 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -743,7 +743,6 @@ static struct md_rdev *read_balance(struct r10conf *conf, struct geom *geo = &conf->geo;
raid10_find_phys(conf, r10_bio); - rcu_read_lock(); best_dist_slot = -1; min_pending = UINT_MAX; best_dist_rdev = NULL; @@ -775,18 +774,11 @@ static struct md_rdev *read_balance(struct r10conf *conf, if (r10_bio->devs[slot].bio == IO_BLOCKED) continue; disk = r10_bio->devs[slot].devnum; - rdev = rcu_dereference(conf->mirrors[disk].replacement); + rdev = conf->mirrors[disk].replacement; if (rdev == NULL || test_bit(Faulty, &rdev->flags) || r10_bio->devs[slot].addr + sectors > - rdev->recovery_offset) { - /* - * Read replacement first to prevent reading both rdev - * and replacement as NULL during replacement replace - * rdev. - */ - smp_mb(); - rdev = rcu_dereference(conf->mirrors[disk].rdev); - } + rdev->recovery_offset) + rdev = conf->mirrors[disk].rdev; if (rdev == NULL || test_bit(Faulty, &rdev->flags)) continue; @@ -876,7 +868,6 @@ static struct md_rdev *read_balance(struct r10conf *conf, r10_bio->read_slot = slot; } else rdev = NULL; - rcu_read_unlock(); *max_sectors = best_good_sectors;
return rdev; @@ -1198,9 +1189,8 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio, */ gfp = GFP_NOIO | __GFP_HIGH;
- rcu_read_lock(); disk = r10_bio->devs[slot].devnum; - err_rdev = rcu_dereference(conf->mirrors[disk].rdev); + err_rdev = conf->mirrors[disk].rdev; if (err_rdev) snprintf(b, sizeof(b), "%pg", err_rdev->bdev); else { @@ -1208,7 +1198,6 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio, /* This never gets dereferenced */ err_rdev = r10_bio->devs[slot].rdev; } - rcu_read_unlock(); }
if (!regular_request_wait(mddev, conf, bio, r10_bio->sectors)) @@ -1279,15 +1268,8 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio, int devnum = r10_bio->devs[n_copy].devnum; struct bio *mbio;
- if (replacement) { - rdev = conf->mirrors[devnum].replacement; - if (rdev == NULL) { - /* Replacement just got moved to main 'rdev' */ - smp_mb(); - rdev = conf->mirrors[devnum].rdev; - } - } else - rdev = conf->mirrors[devnum].rdev; + rdev = replacement ? conf->mirrors[devnum].replacement : + conf->mirrors[devnum].rdev;
mbio = bio_alloc_clone(rdev->bdev, bio, GFP_NOIO, &mddev->bio_set); if (replacement) @@ -1321,25 +1303,6 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio, } }
-static struct md_rdev *dereference_rdev_and_rrdev(struct raid10_info *mirror, - struct md_rdev **prrdev) -{ - struct md_rdev *rdev, *rrdev; - - rrdev = rcu_dereference(mirror->replacement); - /* - * Read replacement first to prevent reading both rdev and - * replacement as NULL during replacement replace rdev. - */ - smp_mb(); - rdev = rcu_dereference(mirror->rdev); - if (rdev == rrdev) - rrdev = NULL; - - *prrdev = rrdev; - return rdev; -} - static void wait_blocked_dev(struct mddev *mddev, struct r10bio *r10_bio) { int i; @@ -1348,11 +1311,11 @@ static void wait_blocked_dev(struct mddev *mddev, struct r10bio *r10_bio)
retry_wait: blocked_rdev = NULL; - rcu_read_lock(); for (i = 0; i < conf->copies; i++) { struct md_rdev *rdev, *rrdev;
- rdev = dereference_rdev_and_rrdev(&conf->mirrors[i], &rrdev); + rdev = conf->mirrors[i].rdev; + rrdev = conf->mirrors[i].replacement; if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) { atomic_inc(&rdev->nr_pending); blocked_rdev = rdev; @@ -1391,7 +1354,6 @@ static void wait_blocked_dev(struct mddev *mddev, struct r10bio *r10_bio) } } } - rcu_read_unlock();
if (unlikely(blocked_rdev)) { /* Have to wait for this device to get unblocked, then retry */ @@ -1474,14 +1436,14 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
wait_blocked_dev(mddev, r10_bio);
- rcu_read_lock(); max_sectors = r10_bio->sectors;
for (i = 0; i < conf->copies; i++) { int d = r10_bio->devs[i].devnum; struct md_rdev *rdev, *rrdev;
- rdev = dereference_rdev_and_rrdev(&conf->mirrors[d], &rrdev); + rdev = conf->mirrors[d].rdev; + rrdev = conf->mirrors[d].replacement; if (rdev && (test_bit(Faulty, &rdev->flags))) rdev = NULL; if (rrdev && (test_bit(Faulty, &rrdev->flags))) @@ -1535,7 +1497,6 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio, atomic_inc(&rrdev->nr_pending); } } - rcu_read_unlock();
if (max_sectors < r10_bio->sectors) r10_bio->sectors = max_sectors; @@ -1625,17 +1586,8 @@ static void raid10_end_discard_request(struct bio *bio) set_bit(R10BIO_Uptodate, &r10_bio->state);
dev = find_bio_disk(conf, r10_bio, bio, &slot, &repl); - if (repl) - rdev = conf->mirrors[dev].replacement; - if (!rdev) { - /* - * raid10_remove_disk uses smp_mb to make sure rdev is set to - * replacement before setting replacement to NULL. It can read - * rdev first without barrier protect even replacement is NULL - */ - smp_rmb(); - rdev = conf->mirrors[dev].rdev; - } + rdev = repl ? conf->mirrors[dev].replacement : + conf->mirrors[dev].rdev;
raid_end_discard_bio(r10_bio); rdev_dec_pending(rdev, conf->mddev); @@ -1785,11 +1737,11 @@ static int raid10_handle_discard(struct mddev *mddev, struct bio *bio) * inc refcount on their rdev. Record them by setting * bios[x] to bio */ - rcu_read_lock(); for (disk = 0; disk < geo->raid_disks; disk++) { struct md_rdev *rdev, *rrdev;
- rdev = dereference_rdev_and_rrdev(&conf->mirrors[disk], &rrdev); + rdev = conf->mirrors[disk].rdev; + rrdev = conf->mirrors[disk].replacement; r10_bio->devs[disk].bio = NULL; r10_bio->devs[disk].repl_bio = NULL;
@@ -1809,7 +1761,6 @@ static int raid10_handle_discard(struct mddev *mddev, struct bio *bio) atomic_inc(&rrdev->nr_pending); } } - rcu_read_unlock();
atomic_set(&r10_bio->remaining, 1); for (disk = 0; disk < geo->raid_disks; disk++) { @@ -1939,6 +1890,8 @@ static void raid10_status(struct seq_file *seq, struct mddev *mddev) struct r10conf *conf = mddev->private; int i;
+ lockdep_assert_held(&mddev->lock); + if (conf->geo.near_copies < conf->geo.raid_disks) seq_printf(seq, " %dK chunks", mddev->chunk_sectors / 2); if (conf->geo.near_copies > 1) @@ -1953,12 +1906,11 @@ static void raid10_status(struct seq_file *seq, struct mddev *mddev) } seq_printf(seq, " [%d/%d] [", conf->geo.raid_disks, conf->geo.raid_disks - mddev->degraded); - rcu_read_lock(); for (i = 0; i < conf->geo.raid_disks; i++) { - struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev); + struct md_rdev *rdev = READ_ONCE(conf->mirrors[i].rdev); + seq_printf(seq, "%s", rdev && test_bit(In_sync, &rdev->flags) ? "U" : "_"); } - rcu_read_unlock(); seq_printf(seq, "]"); }
@@ -1980,7 +1932,6 @@ static int _enough(struct r10conf *conf, int previous, int ignore) ncopies = conf->geo.near_copies; }
- rcu_read_lock(); do { int n = conf->copies; int cnt = 0; @@ -1988,7 +1939,7 @@ static int _enough(struct r10conf *conf, int previous, int ignore) while (n--) { struct md_rdev *rdev; if (this != ignore && - (rdev = rcu_dereference(conf->mirrors[this].rdev)) && + (rdev = conf->mirrors[this].rdev) && test_bit(In_sync, &rdev->flags)) cnt++; this = (this+1) % disks; @@ -1999,7 +1950,6 @@ static int _enough(struct r10conf *conf, int previous, int ignore) } while (first != 0); has_enough = 1; out: - rcu_read_unlock(); return has_enough; }
@@ -2072,8 +2022,7 @@ static void print_conf(struct r10conf *conf) pr_debug(" --- wd:%d rd:%d\n", conf->geo.raid_disks - conf->mddev->degraded, conf->geo.raid_disks);
- /* This is only called with ->reconfix_mutex held, so - * rcu protection of rdev is not needed */ + lockdep_assert_held(&conf->mddev->reconfig_mutex); for (i = 0; i < conf->geo.raid_disks; i++) { rdev = conf->mirrors[i].rdev; if (rdev) @@ -2190,7 +2139,7 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev) err = 0; if (rdev->saved_raid_disk != mirror) conf->fullsync = 1; - rcu_assign_pointer(p->rdev, rdev); + WRITE_ONCE(p->rdev, rdev); break; }
@@ -2204,7 +2153,7 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev) disk_stack_limits(mddev->gendisk, rdev->bdev, rdev->data_offset << 9); conf->fullsync = 1; - rcu_assign_pointer(p->replacement, rdev); + WRITE_ONCE(p->replacement, rdev); }
print_conf(conf); @@ -2246,15 +2195,12 @@ static int raid10_remove_disk(struct mddev *mddev, struct md_rdev *rdev) err = -EBUSY; goto abort; } - *rdevp = NULL; + WRITE_ONCE(*rdevp, NULL); if (p->replacement) { /* We must have just cleared 'rdev' */ - p->rdev = p->replacement; + WRITE_ONCE(p->rdev, p->replacement); clear_bit(Replacement, &p->replacement->flags); - smp_mb(); /* Make sure other CPUs may see both as identical - * but will never see neither -- if they are careful. - */ - p->replacement = NULL; + WRITE_ONCE(p->replacement, NULL); }
clear_bit(WantReplacement, &rdev->flags); @@ -2754,20 +2700,18 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10 if (s > (PAGE_SIZE>>9)) s = PAGE_SIZE >> 9;
- rcu_read_lock(); do { sector_t first_bad; int bad_sectors;
d = r10_bio->devs[sl].devnum; - rdev = rcu_dereference(conf->mirrors[d].rdev); + rdev = conf->mirrors[d].rdev; if (rdev && test_bit(In_sync, &rdev->flags) && !test_bit(Faulty, &rdev->flags) && is_badblock(rdev, r10_bio->devs[sl].addr + sect, s, &first_bad, &bad_sectors) == 0) { atomic_inc(&rdev->nr_pending); - rcu_read_unlock(); success = sync_page_io(rdev, r10_bio->devs[sl].addr + sect, @@ -2775,7 +2719,6 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10 conf->tmppage, REQ_OP_READ, false); rdev_dec_pending(rdev, mddev); - rcu_read_lock(); if (success) break; } @@ -2783,7 +2726,6 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10 if (sl == conf->copies) sl = 0; } while (sl != slot); - rcu_read_unlock();
if (!success) { /* Cannot read from anywhere, just mark the block @@ -2807,20 +2749,18 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
start = sl; /* write it back and re-read */ - rcu_read_lock(); while (sl != slot) { if (sl==0) sl = conf->copies; sl--; d = r10_bio->devs[sl].devnum; - rdev = rcu_dereference(conf->mirrors[d].rdev); + rdev = conf->mirrors[d].rdev; if (!rdev || test_bit(Faulty, &rdev->flags) || !test_bit(In_sync, &rdev->flags)) continue;
atomic_inc(&rdev->nr_pending); - rcu_read_unlock(); if (r10_sync_page_io(rdev, r10_bio->devs[sl].addr + sect, @@ -2839,7 +2779,6 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10 rdev->bdev); } rdev_dec_pending(rdev, mddev); - rcu_read_lock(); } sl = start; while (sl != slot) { @@ -2847,14 +2786,13 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10 sl = conf->copies; sl--; d = r10_bio->devs[sl].devnum; - rdev = rcu_dereference(conf->mirrors[d].rdev); + rdev = conf->mirrors[d].rdev; if (!rdev || test_bit(Faulty, &rdev->flags) || !test_bit(In_sync, &rdev->flags)) continue;
atomic_inc(&rdev->nr_pending); - rcu_read_unlock(); switch (r10_sync_page_io(rdev, r10_bio->devs[sl].addr + sect, @@ -2882,9 +2820,7 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10 }
rdev_dec_pending(rdev, mddev); - rcu_read_lock(); } - rcu_read_unlock();
sectors -= s; sect += s; @@ -3358,14 +3294,13 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr, /* Completed a full sync so the replacements * are now fully recovered. */ - rcu_read_lock(); for (i = 0; i < conf->geo.raid_disks; i++) { struct md_rdev *rdev = - rcu_dereference(conf->mirrors[i].replacement); + conf->mirrors[i].replacement; + if (rdev) rdev->recovery_offset = MaxSector; } - rcu_read_unlock(); } conf->fullsync = 0; } @@ -3446,9 +3381,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr, struct raid10_info *mirror = &conf->mirrors[i]; struct md_rdev *mrdev, *mreplace;
- rcu_read_lock(); - mrdev = rcu_dereference(mirror->rdev); - mreplace = rcu_dereference(mirror->replacement); + mrdev = mirror->rdev; + mreplace = mirror->replacement;
if (mrdev && (test_bit(Faulty, &mrdev->flags) || test_bit(In_sync, &mrdev->flags))) @@ -3456,22 +3390,18 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr, if (mreplace && test_bit(Faulty, &mreplace->flags)) mreplace = NULL;
- if (!mrdev && !mreplace) { - rcu_read_unlock(); + if (!mrdev && !mreplace) continue; - }
still_degraded = 0; /* want to reconstruct this device */ rb2 = r10_bio; sect = raid10_find_virt(conf, sector_nr, i); - if (sect >= mddev->resync_max_sectors) { + if (sect >= mddev->resync_max_sectors) /* last stripe is not complete - don't * try to recover this sector. */ - rcu_read_unlock(); continue; - } /* Unless we are doing a full sync, or a replacement * we only need to recover the block if it is set in * the bitmap @@ -3487,14 +3417,12 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr, * that there will never be anything to do here */ chunks_skipped = -1; - rcu_read_unlock(); continue; } if (mrdev) atomic_inc(&mrdev->nr_pending); if (mreplace) atomic_inc(&mreplace->nr_pending); - rcu_read_unlock();
r10_bio = raid10_alloc_init_r10buf(conf); r10_bio->state = 0; @@ -3513,10 +3441,9 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr, /* Need to check if the array will still be * degraded */ - rcu_read_lock(); for (j = 0; j < conf->geo.raid_disks; j++) { - struct md_rdev *rdev = rcu_dereference( - conf->mirrors[j].rdev); + struct md_rdev *rdev = conf->mirrors[j].rdev; + if (rdev == NULL || test_bit(Faulty, &rdev->flags)) { still_degraded = 1; break; @@ -3531,8 +3458,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr, int k; int d = r10_bio->devs[j].devnum; sector_t from_addr, to_addr; - struct md_rdev *rdev = - rcu_dereference(conf->mirrors[d].rdev); + struct md_rdev *rdev = conf->mirrors[d].rdev; sector_t sector, first_bad; int bad_sectors; if (!rdev || @@ -3611,7 +3537,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr, atomic_inc(&r10_bio->remaining); break; } - rcu_read_unlock(); if (j == conf->copies) { /* Cannot recover, so abort the recovery or * record a bad block */ @@ -3738,12 +3663,10 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
bio = r10_bio->devs[i].bio; bio->bi_status = BLK_STS_IOERR; - rcu_read_lock(); - rdev = rcu_dereference(conf->mirrors[d].rdev); - if (rdev == NULL || test_bit(Faulty, &rdev->flags)) { - rcu_read_unlock(); + rdev = conf->mirrors[d].rdev; + if (rdev == NULL || test_bit(Faulty, &rdev->flags)) continue; - } + sector = r10_bio->devs[i].addr; if (is_badblock(rdev, sector, max_sync, &first_bad, &bad_sectors)) { @@ -3753,7 +3676,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr, bad_sectors -= (sector - first_bad); if (max_sync > bad_sectors) max_sync = bad_sectors; - rcu_read_unlock(); continue; } } @@ -3769,11 +3691,10 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr, bio_set_dev(bio, rdev->bdev); count++;
- rdev = rcu_dereference(conf->mirrors[d].replacement); - if (rdev == NULL || test_bit(Faulty, &rdev->flags)) { - rcu_read_unlock(); + rdev = conf->mirrors[d].replacement; + if (rdev == NULL || test_bit(Faulty, &rdev->flags)) continue; - } + atomic_inc(&rdev->nr_pending);
/* Need to set up for writing to the replacement */ @@ -3790,7 +3711,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr, bio->bi_iter.bi_sector = sector + rdev->data_offset; bio_set_dev(bio, rdev->bdev); count++; - rcu_read_unlock(); }
if (count < 2) { @@ -4500,11 +4420,11 @@ static int calc_degraded(struct r10conf *conf) int degraded, degraded2; int i;
- rcu_read_lock(); degraded = 0; /* 'prev' section first */ for (i = 0; i < conf->prev.raid_disks; i++) { - struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev); + struct md_rdev *rdev = conf->mirrors[i].rdev; + if (!rdev || test_bit(Faulty, &rdev->flags)) degraded++; else if (!test_bit(In_sync, &rdev->flags)) @@ -4514,13 +4434,12 @@ static int calc_degraded(struct r10conf *conf) */ degraded++; } - rcu_read_unlock(); if (conf->geo.raid_disks == conf->prev.raid_disks) return degraded; - rcu_read_lock(); degraded2 = 0; for (i = 0; i < conf->geo.raid_disks; i++) { - struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev); + struct md_rdev *rdev = conf->mirrors[i].rdev; + if (!rdev || test_bit(Faulty, &rdev->flags)) degraded2++; else if (!test_bit(In_sync, &rdev->flags)) { @@ -4533,7 +4452,6 @@ static int calc_degraded(struct r10conf *conf) degraded2++; } } - rcu_read_unlock(); if (degraded2 > degraded) return degraded2; return degraded; @@ -4965,16 +4883,15 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, blist = read_bio; read_bio->bi_next = NULL;
- rcu_read_lock(); for (s = 0; s < conf->copies*2; s++) { struct bio *b; int d = r10_bio->devs[s/2].devnum; struct md_rdev *rdev2; if (s&1) { - rdev2 = rcu_dereference(conf->mirrors[d].replacement); + rdev2 = conf->mirrors[d].replacement; b = r10_bio->devs[s/2].repl_bio; } else { - rdev2 = rcu_dereference(conf->mirrors[d].rdev); + rdev2 = conf->mirrors[d].rdev; b = r10_bio->devs[s/2].bio; } if (!rdev2 || test_bit(Faulty, &rdev2->flags)) @@ -5008,7 +4925,6 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, sector_nr += len >> 9; nr_sectors += len >> 9; } - rcu_read_unlock(); r10_bio->sectors = nr_sectors;
/* Now submit the read */ @@ -5061,20 +4977,17 @@ static void reshape_request_write(struct mddev *mddev, struct r10bio *r10_bio) struct bio *b; int d = r10_bio->devs[s/2].devnum; struct md_rdev *rdev; - rcu_read_lock(); if (s&1) { - rdev = rcu_dereference(conf->mirrors[d].replacement); + rdev = conf->mirrors[d].replacement; b = r10_bio->devs[s/2].repl_bio; } else { - rdev = rcu_dereference(conf->mirrors[d].rdev); + rdev = conf->mirrors[d].rdev; b = r10_bio->devs[s/2].bio; } - if (!rdev || test_bit(Faulty, &rdev->flags)) { - rcu_read_unlock(); + if (!rdev || test_bit(Faulty, &rdev->flags)) continue; - } + atomic_inc(&rdev->nr_pending); - rcu_read_unlock(); md_sync_acct_bio(b, r10_bio->sectors); atomic_inc(&r10_bio->remaining); b->bi_next = NULL; @@ -5145,10 +5058,9 @@ static int handle_reshape_read_error(struct mddev *mddev, if (s > (PAGE_SIZE >> 9)) s = PAGE_SIZE >> 9;
- rcu_read_lock(); while (!success) { int d = r10b->devs[slot].devnum; - struct md_rdev *rdev = rcu_dereference(conf->mirrors[d].rdev); + struct md_rdev *rdev = conf->mirrors[d].rdev; sector_t addr; if (rdev == NULL || test_bit(Faulty, &rdev->flags) || @@ -5157,14 +5069,12 @@ static int handle_reshape_read_error(struct mddev *mddev,
addr = r10b->devs[slot].addr + idx * PAGE_SIZE; atomic_inc(&rdev->nr_pending); - rcu_read_unlock(); success = sync_page_io(rdev, addr, s << 9, pages[idx], REQ_OP_READ, false); rdev_dec_pending(rdev, mddev); - rcu_read_lock(); if (success) break; failed: @@ -5174,7 +5084,6 @@ static int handle_reshape_read_error(struct mddev *mddev, if (slot == first_slot) break; } - rcu_read_unlock(); if (!success) { /* couldn't read this block, must give up */ set_bit(MD_RECOVERY_INTR, @@ -5200,12 +5109,8 @@ static void end_reshape_write(struct bio *bio) struct md_rdev *rdev = NULL;
d = find_bio_disk(conf, r10_bio, bio, &slot, &repl); - if (repl) - rdev = conf->mirrors[d].replacement; - if (!rdev) { - smp_mb(); - rdev = conf->mirrors[d].rdev; - } + rdev = repl ? conf->mirrors[d].replacement : + conf->mirrors[d].rdev;
if (bio->bi_status) { /* FIXME should record badblock */ @@ -5240,18 +5145,16 @@ static void raid10_finish_reshape(struct mddev *mddev) mddev->resync_max_sectors = mddev->array_sectors; } else { int d; - rcu_read_lock(); for (d = conf->geo.raid_disks ; d < conf->geo.raid_disks - mddev->delta_disks; d++) { - struct md_rdev *rdev = rcu_dereference(conf->mirrors[d].rdev); + struct md_rdev *rdev = conf->mirrors[d].rdev; if (rdev) clear_bit(In_sync, &rdev->flags); - rdev = rcu_dereference(conf->mirrors[d].replacement); + rdev = conf->mirrors[d].replacement; if (rdev) clear_bit(In_sync, &rdev->flags); } - rcu_read_unlock(); } mddev->layout = mddev->new_layout; mddev->chunk_sectors = 1 << conf->geo.chunk_shift;
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-next-20231220 commit 2d32777d60de81aa020a2431567020af26564c71 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Because it's safe to accees rdev from conf: - If any spinlock is held, because synchronize_rcu() from md_kick_rdev_from_array() will prevent 'rdev' to be freed until spinlock is released; - If 'reconfig_lock' is held, because rdev can't be added or removed from array; - If there is normal IO inflight, because mddev_suspend() will prevent rdev to be added or removed from array; - If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is checked in remove_and_add_spares().
And these will cover all the scenarios in raid1.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231125081604.3939938-4-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/raid1.c | 62 +++++++++++++++++----------------------------- 1 file changed, 23 insertions(+), 39 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index a678e0e6e102..9348f1709512 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -609,7 +609,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect int choose_first; int choose_next_idle;
- rcu_read_lock(); /* * Check if we can balance. We can balance on the whole * device if no resync is going on, or below the resync window. @@ -642,7 +641,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect unsigned int pending; bool nonrot;
- rdev = rcu_dereference(conf->mirrors[disk].rdev); + rdev = conf->mirrors[disk].rdev; if (r1_bio->bios[disk] == IO_BLOCKED || rdev == NULL || test_bit(Faulty, &rdev->flags)) @@ -773,7 +772,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect }
if (best_disk >= 0) { - rdev = rcu_dereference(conf->mirrors[best_disk].rdev); + rdev = conf->mirrors[best_disk].rdev; if (!rdev) goto retry; atomic_inc(&rdev->nr_pending); @@ -784,7 +783,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
conf->mirrors[best_disk].next_seq_sect = this_sector + sectors; } - rcu_read_unlock(); *max_sectors = sectors;
return best_disk; @@ -1235,14 +1233,12 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
if (r1bio_existed) { /* Need to get the block device name carefully */ - struct md_rdev *rdev; - rcu_read_lock(); - rdev = rcu_dereference(conf->mirrors[r1_bio->read_disk].rdev); + struct md_rdev *rdev = conf->mirrors[r1_bio->read_disk].rdev; + if (rdev) snprintf(b, sizeof(b), "%pg", rdev->bdev); else strcpy(b, "???"); - rcu_read_unlock(); }
/* @@ -1396,10 +1392,9 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
disks = conf->raid_disks * 2; blocked_rdev = NULL; - rcu_read_lock(); max_sectors = r1_bio->sectors; for (i = 0; i < disks; i++) { - struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev); + struct md_rdev *rdev = conf->mirrors[i].rdev;
/* * The write-behind io is only attempted on drives marked as @@ -1465,7 +1460,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, } r1_bio->bios[i] = bio; } - rcu_read_unlock();
if (unlikely(blocked_rdev)) { /* Wait for this device to become unblocked */ @@ -1617,15 +1611,16 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) struct r1conf *conf = mddev->private; int i;
+ lockdep_assert_held(&mddev->lock); + seq_printf(seq, " [%d/%d] [", conf->raid_disks, conf->raid_disks - mddev->degraded); - rcu_read_lock(); for (i = 0; i < conf->raid_disks; i++) { - struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev); + struct md_rdev *rdev = READ_ONCE(conf->mirrors[i].rdev); + seq_printf(seq, "%s", rdev && test_bit(In_sync, &rdev->flags) ? "U" : "_"); } - rcu_read_unlock(); seq_printf(seq, "]"); }
@@ -1691,16 +1686,15 @@ static void print_conf(struct r1conf *conf) pr_debug(" --- wd:%d rd:%d\n", conf->raid_disks - conf->mddev->degraded, conf->raid_disks);
- rcu_read_lock(); + lockdep_assert_held(&conf->mddev->reconfig_mutex); for (i = 0; i < conf->raid_disks; i++) { - struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev); + struct md_rdev *rdev = conf->mirrors[i].rdev; if (rdev) pr_debug(" disk %d, wo:%d, o:%d, dev:%pg\n", i, !test_bit(In_sync, &rdev->flags), !test_bit(Faulty, &rdev->flags), rdev->bdev); } - rcu_read_unlock(); }
static void close_sync(struct r1conf *conf) @@ -1810,7 +1804,7 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev) */ if (rdev->saved_raid_disk < 0) conf->fullsync = 1; - rcu_assign_pointer(p->rdev, rdev); + WRITE_ONCE(p->rdev, rdev); break; } if (test_bit(WantReplacement, &p->rdev->flags) && @@ -1826,7 +1820,7 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev) rdev->raid_disk = repl_slot; err = 0; conf->fullsync = 1; - rcu_assign_pointer(p[conf->raid_disks].rdev, rdev); + WRITE_ONCE(p[conf->raid_disks].rdev, rdev); }
print_conf(conf); @@ -1862,7 +1856,7 @@ static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev) err = -EBUSY; goto abort; } - p->rdev = NULL; + WRITE_ONCE(p->rdev, NULL); if (conf->mirrors[conf->raid_disks + number].rdev) { /* We just removed a device that is being replaced. * Move down the replacement. We drain all IO before @@ -1883,7 +1877,7 @@ static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev) goto abort; } clear_bit(Replacement, &repl->flags); - p->rdev = repl; + WRITE_ONCE(p->rdev, repl); conf->mirrors[conf->raid_disks + number].rdev = NULL; unfreeze_array(conf); } @@ -2281,8 +2275,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk, sector_t first_bad; int bad_sectors;
- rcu_read_lock(); - rdev = rcu_dereference(conf->mirrors[d].rdev); + rdev = conf->mirrors[d].rdev; if (rdev && (test_bit(In_sync, &rdev->flags) || (!test_bit(Faulty, &rdev->flags) && @@ -2290,15 +2283,14 @@ static void fix_read_error(struct r1conf *conf, int read_disk, is_badblock(rdev, sect, s, &first_bad, &bad_sectors) == 0) { atomic_inc(&rdev->nr_pending); - rcu_read_unlock(); if (sync_page_io(rdev, sect, s<<9, conf->tmppage, REQ_OP_READ, false)) success = 1; rdev_dec_pending(rdev, mddev); if (success) break; - } else - rcu_read_unlock(); + } + d++; if (d == conf->raid_disks * 2) d = 0; @@ -2317,29 +2309,24 @@ static void fix_read_error(struct r1conf *conf, int read_disk, if (d==0) d = conf->raid_disks * 2; d--; - rcu_read_lock(); - rdev = rcu_dereference(conf->mirrors[d].rdev); + rdev = conf->mirrors[d].rdev; if (rdev && !test_bit(Faulty, &rdev->flags)) { atomic_inc(&rdev->nr_pending); - rcu_read_unlock(); r1_sync_page_io(rdev, sect, s, conf->tmppage, WRITE); rdev_dec_pending(rdev, mddev); - } else - rcu_read_unlock(); + } } d = start; while (d != read_disk) { if (d==0) d = conf->raid_disks * 2; d--; - rcu_read_lock(); - rdev = rcu_dereference(conf->mirrors[d].rdev); + rdev = conf->mirrors[d].rdev; if (rdev && !test_bit(Faulty, &rdev->flags)) { atomic_inc(&rdev->nr_pending); - rcu_read_unlock(); if (r1_sync_page_io(rdev, sect, s, conf->tmppage, READ)) { atomic_add(s, &rdev->corrected_errors); @@ -2350,8 +2337,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk, rdev->bdev); } rdev_dec_pending(rdev, mddev); - } else - rcu_read_unlock(); + } } sectors -= s; sect += s; @@ -2732,7 +2718,6 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
r1_bio = raid1_alloc_init_r1buf(conf);
- rcu_read_lock(); /* * If we get a correctably read error during resync or recovery, * we might want to read from a different device. So we @@ -2753,7 +2738,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr, struct md_rdev *rdev; bio = r1_bio->bios[i];
- rdev = rcu_dereference(conf->mirrors[i].rdev); + rdev = conf->mirrors[i].rdev; if (rdev == NULL || test_bit(Faulty, &rdev->flags)) { if (i < conf->raid_disks) @@ -2811,7 +2796,6 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr, bio->bi_opf |= MD_FAILFAST; } } - rcu_read_unlock(); if (disk < 0) disk = wonly; r1_bio->read_disk = disk;
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-next-20231220 commit ad8606702f268903b26795e6b93605646fd1a6a8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Because it's safe to accees rdev from conf: - If any spinlock is held, because synchronize_rcu() from md_kick_rdev_from_array() will prevent 'rdev' to be freed until spinlock is released; - If 'reconfig_lock' is held, because rdev can't be added or removed from array; - If there is normal IO inflight, because mddev_suspend() will prevent rdev to be added or removed from array; - If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is checked in remove_and_add_spares().
And these will cover all the scenarios in raid456.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231125081604.3939938-5-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/raid5.h | 4 +- drivers/md/raid5-cache.c | 11 +-- drivers/md/raid5-ppl.c | 16 +--- drivers/md/raid5.c | 182 +++++++++++++-------------------------- 4 files changed, 69 insertions(+), 144 deletions(-)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index 97a795979a35..9163c8cefb3f 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -473,8 +473,8 @@ enum { */
struct disk_info { - struct md_rdev __rcu *rdev; - struct md_rdev __rcu *replacement; + struct md_rdev *rdev; + struct md_rdev *replacement; struct page *extra_page; /* extra page to use in prexor */ };
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c index 6157f5beb9fe..874874fe4fa1 100644 --- a/drivers/md/raid5-cache.c +++ b/drivers/md/raid5-cache.c @@ -1890,28 +1890,22 @@ r5l_recovery_replay_one_stripe(struct r5conf *conf, continue;
/* in case device is broken */ - rcu_read_lock(); - rdev = rcu_dereference(conf->disks[disk_index].rdev); + rdev = conf->disks[disk_index].rdev; if (rdev) { atomic_inc(&rdev->nr_pending); - rcu_read_unlock(); sync_page_io(rdev, sh->sector, PAGE_SIZE, sh->dev[disk_index].page, REQ_OP_WRITE, false); rdev_dec_pending(rdev, rdev->mddev); - rcu_read_lock(); } - rrdev = rcu_dereference(conf->disks[disk_index].replacement); + rrdev = conf->disks[disk_index].replacement; if (rrdev) { atomic_inc(&rrdev->nr_pending); - rcu_read_unlock(); sync_page_io(rrdev, sh->sector, PAGE_SIZE, sh->dev[disk_index].page, REQ_OP_WRITE, false); rdev_dec_pending(rrdev, rrdev->mddev); - rcu_read_lock(); } - rcu_read_unlock(); } ctx->data_parity_stripes++; out: @@ -2948,7 +2942,6 @@ bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect) if (!log) return false;
- WARN_ON_ONCE(!rcu_read_lock_held()); tree_index = r5c_tree_index(conf, sect); slot = radix_tree_lookup(&log->big_stripe_tree, tree_index); return slot != NULL; diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c index eaea57aee602..da4ba736c4f0 100644 --- a/drivers/md/raid5-ppl.c +++ b/drivers/md/raid5-ppl.c @@ -620,11 +620,9 @@ static void ppl_do_flush(struct ppl_io_unit *io) struct md_rdev *rdev; struct block_device *bdev = NULL;
- rcu_read_lock(); - rdev = rcu_dereference(conf->disks[i].rdev); + rdev = conf->disks[i].rdev; if (rdev && !test_bit(Faulty, &rdev->flags)) bdev = rdev->bdev; - rcu_read_unlock();
if (bdev) { struct bio *bio; @@ -882,9 +880,7 @@ static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e, (unsigned long long)r_sector, dd_idx, (unsigned long long)sector);
- /* Array has not started so rcu dereference is safe */ - rdev = rcu_dereference_protected( - conf->disks[dd_idx].rdev, 1); + rdev = conf->disks[dd_idx].rdev; if (!rdev || (!test_bit(In_sync, &rdev->flags) && sector >= rdev->recovery_offset)) { pr_debug("%s:%*s data member disk %d missing\n", @@ -936,9 +932,7 @@ static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e, 0, &disk, &sh); BUG_ON(sh.pd_idx != le32_to_cpu(e->parity_disk));
- /* Array has not started so rcu dereference is safe */ - parity_rdev = rcu_dereference_protected( - conf->disks[sh.pd_idx].rdev, 1); + parity_rdev = conf->disks[sh.pd_idx].rdev;
BUG_ON(parity_rdev->bdev->bd_dev != log->rdev->bdev->bd_dev); pr_debug("%s:%*s write parity at sector %llu, disk %pg\n", @@ -1404,9 +1398,7 @@ int ppl_init_log(struct r5conf *conf)
for (i = 0; i < ppl_conf->count; i++) { struct ppl_log *log = &ppl_conf->child_logs[i]; - /* Array has not started so rcu dereference is safe */ - struct md_rdev *rdev = - rcu_dereference_protected(conf->disks[i].rdev, 1); + struct md_rdev *rdev = conf->disks[i].rdev;
mutex_init(&log->io_mutex); spin_lock_init(&log->io_list_lock); diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 37072adac05d..2af56a2ad573 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -693,12 +693,12 @@ int raid5_calc_degraded(struct r5conf *conf) int degraded, degraded2; int i;
- rcu_read_lock(); degraded = 0; for (i = 0; i < conf->previous_raid_disks; i++) { - struct md_rdev *rdev = rcu_dereference(conf->disks[i].rdev); + struct md_rdev *rdev = READ_ONCE(conf->disks[i].rdev); + if (rdev && test_bit(Faulty, &rdev->flags)) - rdev = rcu_dereference(conf->disks[i].replacement); + rdev = READ_ONCE(conf->disks[i].replacement); if (!rdev || test_bit(Faulty, &rdev->flags)) degraded++; else if (test_bit(In_sync, &rdev->flags)) @@ -716,15 +716,14 @@ int raid5_calc_degraded(struct r5conf *conf) if (conf->raid_disks >= conf->previous_raid_disks) degraded++; } - rcu_read_unlock(); if (conf->raid_disks == conf->previous_raid_disks) return degraded; - rcu_read_lock(); degraded2 = 0; for (i = 0; i < conf->raid_disks; i++) { - struct md_rdev *rdev = rcu_dereference(conf->disks[i].rdev); + struct md_rdev *rdev = READ_ONCE(conf->disks[i].rdev); + if (rdev && test_bit(Faulty, &rdev->flags)) - rdev = rcu_dereference(conf->disks[i].replacement); + rdev = READ_ONCE(conf->disks[i].replacement); if (!rdev || test_bit(Faulty, &rdev->flags)) degraded2++; else if (test_bit(In_sync, &rdev->flags)) @@ -738,7 +737,6 @@ int raid5_calc_degraded(struct r5conf *conf) if (conf->raid_disks <= conf->previous_raid_disks) degraded2++; } - rcu_read_unlock(); if (degraded2 > degraded) return degraded2; return degraded; @@ -1183,14 +1181,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) bi = &dev->req; rbi = &dev->rreq; /* For writing to replacement */
- rcu_read_lock(); - rrdev = rcu_dereference(conf->disks[i].replacement); - smp_mb(); /* Ensure that if rrdev is NULL, rdev won't be */ - rdev = rcu_dereference(conf->disks[i].rdev); - if (!rdev) { - rdev = rrdev; - rrdev = NULL; - } + rdev = conf->disks[i].rdev; + rrdev = conf->disks[i].replacement; if (op_is_write(op)) { if (replace_only) rdev = NULL; @@ -1211,7 +1203,6 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) rrdev = NULL; if (rrdev) atomic_inc(&rrdev->nr_pending); - rcu_read_unlock();
/* We have already checked bad blocks for reads. Now * need to check for writes. We never accept write errors @@ -2730,28 +2721,6 @@ static void shrink_stripes(struct r5conf *conf) conf->slab_cache = NULL; }
-/* - * This helper wraps rcu_dereference_protected() and can be used when - * it is known that the nr_pending of the rdev is elevated. - */ -static struct md_rdev *rdev_pend_deref(struct md_rdev __rcu *rdev) -{ - return rcu_dereference_protected(rdev, - atomic_read(&rcu_access_pointer(rdev)->nr_pending)); -} - -/* - * This helper wraps rcu_dereference_protected() and should be used - * when it is known that the mddev_lock() is held. This is safe - * seeing raid5_remove_disk() has the same lock held. - */ -static struct md_rdev *rdev_mdlock_deref(struct mddev *mddev, - struct md_rdev __rcu *rdev) -{ - return rcu_dereference_protected(rdev, - lockdep_is_held(&mddev->reconfig_mutex)); -} - static void raid5_end_read_request(struct bio * bi) { struct stripe_head *sh = bi->bi_private; @@ -2777,9 +2746,9 @@ static void raid5_end_read_request(struct bio * bi) * In that case it moved down to 'rdev'. * rdev is not removed until all requests are finished. */ - rdev = rdev_pend_deref(conf->disks[i].replacement); + rdev = conf->disks[i].replacement; if (!rdev) - rdev = rdev_pend_deref(conf->disks[i].rdev); + rdev = conf->disks[i].rdev;
if (use_new_offset(conf, sh)) s = sh->sector + rdev->new_data_offset; @@ -2892,11 +2861,11 @@ static void raid5_end_write_request(struct bio *bi)
for (i = 0 ; i < disks; i++) { if (bi == &sh->dev[i].req) { - rdev = rdev_pend_deref(conf->disks[i].rdev); + rdev = conf->disks[i].rdev; break; } if (bi == &sh->dev[i].rreq) { - rdev = rdev_pend_deref(conf->disks[i].replacement); + rdev = conf->disks[i].replacement; if (rdev) replacement = 1; else @@ -2904,7 +2873,7 @@ static void raid5_end_write_request(struct bio *bi) * replaced it. rdev is not removed * until all requests are finished. */ - rdev = rdev_pend_deref(conf->disks[i].rdev); + rdev = conf->disks[i].rdev; break; } } @@ -3666,15 +3635,13 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh, int bitmap_end = 0;
if (test_bit(R5_ReadError, &sh->dev[i].flags)) { - struct md_rdev *rdev; - rcu_read_lock(); - rdev = rcu_dereference(conf->disks[i].rdev); + struct md_rdev *rdev = conf->disks[i].rdev; + if (rdev && test_bit(In_sync, &rdev->flags) && !test_bit(Faulty, &rdev->flags)) atomic_inc(&rdev->nr_pending); else rdev = NULL; - rcu_read_unlock(); if (rdev) { if (!rdev_set_badblocks( rdev, @@ -3792,16 +3759,17 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh, /* During recovery devices cannot be removed, so * locking and refcounting of rdevs is not needed */ - rcu_read_lock(); for (i = 0; i < conf->raid_disks; i++) { - struct md_rdev *rdev = rcu_dereference(conf->disks[i].rdev); + struct md_rdev *rdev = conf->disks[i].rdev; + if (rdev && !test_bit(Faulty, &rdev->flags) && !test_bit(In_sync, &rdev->flags) && !rdev_set_badblocks(rdev, sh->sector, RAID5_STRIPE_SECTORS(conf), 0)) abort = 1; - rdev = rcu_dereference(conf->disks[i].replacement); + rdev = conf->disks[i].replacement; + if (rdev && !test_bit(Faulty, &rdev->flags) && !test_bit(In_sync, &rdev->flags) @@ -3809,7 +3777,6 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh, RAID5_STRIPE_SECTORS(conf), 0)) abort = 1; } - rcu_read_unlock(); if (abort) conf->recovery_disabled = conf->mddev->recovery_disabled; @@ -3822,15 +3789,13 @@ static int want_replace(struct stripe_head *sh, int disk_idx) struct md_rdev *rdev; int rv = 0;
- rcu_read_lock(); - rdev = rcu_dereference(sh->raid_conf->disks[disk_idx].replacement); + rdev = sh->raid_conf->disks[disk_idx].replacement; if (rdev && !test_bit(Faulty, &rdev->flags) && !test_bit(In_sync, &rdev->flags) && (rdev->recovery_offset <= sh->sector || rdev->mddev->recovery_cp <= sh->sector)) rv = 1; - rcu_read_unlock(); return rv; }
@@ -4707,7 +4672,6 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) s->log_failed = r5l_log_disk_error(conf);
/* Now to look around and see what can be done */ - rcu_read_lock(); for (i=disks; i--; ) { struct md_rdev *rdev; sector_t first_bad; @@ -4752,7 +4716,7 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) /* Prefer to use the replacement for reads, but only * if it is recovered enough and has no bad blocks. */ - rdev = rcu_dereference(conf->disks[i].replacement); + rdev = conf->disks[i].replacement; if (rdev && !test_bit(Faulty, &rdev->flags) && rdev->recovery_offset >= sh->sector + RAID5_STRIPE_SECTORS(conf) && !is_badblock(rdev, sh->sector, RAID5_STRIPE_SECTORS(conf), @@ -4763,7 +4727,7 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) set_bit(R5_NeedReplace, &dev->flags); else clear_bit(R5_NeedReplace, &dev->flags); - rdev = rcu_dereference(conf->disks[i].rdev); + rdev = conf->disks[i].rdev; clear_bit(R5_ReadRepl, &dev->flags); } if (rdev && test_bit(Faulty, &rdev->flags)) @@ -4810,8 +4774,8 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) if (test_bit(R5_WriteError, &dev->flags)) { /* This flag does not apply to '.replacement' * only to .rdev, so make sure to check that*/ - struct md_rdev *rdev2 = rcu_dereference( - conf->disks[i].rdev); + struct md_rdev *rdev2 = conf->disks[i].rdev; + if (rdev2 == rdev) clear_bit(R5_Insync, &dev->flags); if (rdev2 && !test_bit(Faulty, &rdev2->flags)) { @@ -4823,8 +4787,8 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) if (test_bit(R5_MadeGood, &dev->flags)) { /* This flag does not apply to '.replacement' * only to .rdev, so make sure to check that*/ - struct md_rdev *rdev2 = rcu_dereference( - conf->disks[i].rdev); + struct md_rdev *rdev2 = conf->disks[i].rdev; + if (rdev2 && !test_bit(Faulty, &rdev2->flags)) { s->handle_bad_blocks = 1; atomic_inc(&rdev2->nr_pending); @@ -4832,8 +4796,8 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) clear_bit(R5_MadeGood, &dev->flags); } if (test_bit(R5_MadeGoodRepl, &dev->flags)) { - struct md_rdev *rdev2 = rcu_dereference( - conf->disks[i].replacement); + struct md_rdev *rdev2 = conf->disks[i].replacement; + if (rdev2 && !test_bit(Faulty, &rdev2->flags)) { s->handle_bad_blocks = 1; atomic_inc(&rdev2->nr_pending); @@ -4854,8 +4818,7 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) if (rdev && !test_bit(Faulty, &rdev->flags)) do_recovery = 1; else if (!rdev) { - rdev = rcu_dereference( - conf->disks[i].replacement); + rdev = conf->disks[i].replacement; if (rdev && !test_bit(Faulty, &rdev->flags)) do_recovery = 1; } @@ -4882,7 +4845,6 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) else s->replacing = 1; } - rcu_read_unlock(); }
/* @@ -5339,23 +5301,23 @@ static void handle_stripe(struct stripe_head *sh) struct r5dev *dev = &sh->dev[i]; if (test_and_clear_bit(R5_WriteError, &dev->flags)) { /* We own a safe reference to the rdev */ - rdev = rdev_pend_deref(conf->disks[i].rdev); + rdev = conf->disks[i].rdev; if (!rdev_set_badblocks(rdev, sh->sector, RAID5_STRIPE_SECTORS(conf), 0)) md_error(conf->mddev, rdev); rdev_dec_pending(rdev, conf->mddev); } if (test_and_clear_bit(R5_MadeGood, &dev->flags)) { - rdev = rdev_pend_deref(conf->disks[i].rdev); + rdev = conf->disks[i].rdev; rdev_clear_badblocks(rdev, sh->sector, RAID5_STRIPE_SECTORS(conf), 0); rdev_dec_pending(rdev, conf->mddev); } if (test_and_clear_bit(R5_MadeGoodRepl, &dev->flags)) { - rdev = rdev_pend_deref(conf->disks[i].replacement); + rdev = conf->disks[i].replacement; if (!rdev) /* rdev have been moved down */ - rdev = rdev_pend_deref(conf->disks[i].rdev); + rdev = conf->disks[i].rdev; rdev_clear_badblocks(rdev, sh->sector, RAID5_STRIPE_SECTORS(conf), 0); rdev_dec_pending(rdev, conf->mddev); @@ -5514,24 +5476,22 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio) &dd_idx, NULL); end_sector = sector + bio_sectors(raid_bio);
- rcu_read_lock(); if (r5c_big_stripe_cached(conf, sector)) - goto out_rcu_unlock; + return 0;
- rdev = rcu_dereference(conf->disks[dd_idx].replacement); + rdev = conf->disks[dd_idx].replacement; if (!rdev || test_bit(Faulty, &rdev->flags) || rdev->recovery_offset < end_sector) { - rdev = rcu_dereference(conf->disks[dd_idx].rdev); + rdev = conf->disks[dd_idx].rdev; if (!rdev) - goto out_rcu_unlock; + return 0; if (test_bit(Faulty, &rdev->flags) || !(test_bit(In_sync, &rdev->flags) || rdev->recovery_offset >= end_sector)) - goto out_rcu_unlock; + return 0; }
atomic_inc(&rdev->nr_pending); - rcu_read_unlock();
if (is_badblock(rdev, sector, bio_sectors(raid_bio), &first_bad, &bad_sectors)) { @@ -5575,10 +5535,6 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio) raid_bio->bi_iter.bi_sector); submit_bio_noacct(align_bio); return 1; - -out_rcu_unlock: - rcu_read_unlock(); - return 0; }
static struct bio *chunk_aligned_read(struct mddev *mddev, struct bio *raid_bio) @@ -6581,14 +6537,12 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n * Note in case of > 1 drive failures it's possible we're rebuilding * one drive while leaving another faulty drive in array. */ - rcu_read_lock(); for (i = 0; i < conf->raid_disks; i++) { - struct md_rdev *rdev = rcu_dereference(conf->disks[i].rdev); + struct md_rdev *rdev = conf->disks[i].rdev;
if (rdev == NULL || test_bit(Faulty, &rdev->flags)) still_degraded = 1; } - rcu_read_unlock();
md_bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, still_degraded);
@@ -7895,18 +7849,10 @@ static int raid5_run(struct mddev *mddev)
for (i = 0; i < conf->raid_disks && conf->previous_raid_disks; i++) { - rdev = rdev_mdlock_deref(mddev, conf->disks[i].rdev); - if (!rdev && conf->disks[i].replacement) { - /* The replacement is all we have yet */ - rdev = rdev_mdlock_deref(mddev, - conf->disks[i].replacement); - conf->disks[i].replacement = NULL; - clear_bit(Replacement, &rdev->flags); - rcu_assign_pointer(conf->disks[i].rdev, rdev); - } + rdev = conf->disks[i].rdev; if (!rdev) continue; - if (rcu_access_pointer(conf->disks[i].replacement) && + if (conf->disks[i].replacement && conf->reshape_progress != MaxSector) { /* replacements and reshape simply do not mix. */ pr_warn("md: cannot handle concurrent replacement and reshape.\n"); @@ -8090,15 +8036,16 @@ static void raid5_status(struct seq_file *seq, struct mddev *mddev) struct r5conf *conf = mddev->private; int i;
+ lockdep_assert_held(&mddev->lock); + seq_printf(seq, " level %d, %dk chunk, algorithm %d", mddev->level, conf->chunk_sectors / 2, mddev->layout); seq_printf (seq, " [%d/%d] [", conf->raid_disks, conf->raid_disks - mddev->degraded); - rcu_read_lock(); for (i = 0; i < conf->raid_disks; i++) { - struct md_rdev *rdev = rcu_dereference(conf->disks[i].rdev); + struct md_rdev *rdev = READ_ONCE(conf->disks[i].rdev); + seq_printf (seq, "%s", rdev && test_bit(In_sync, &rdev->flags) ? "U" : "_"); } - rcu_read_unlock(); seq_printf (seq, "]"); }
@@ -8136,9 +8083,8 @@ static int raid5_spare_active(struct mddev *mddev) unsigned long flags;
for (i = 0; i < conf->raid_disks; i++) { - rdev = rdev_mdlock_deref(mddev, conf->disks[i].rdev); - replacement = rdev_mdlock_deref(mddev, - conf->disks[i].replacement); + rdev = conf->disks[i].rdev; + replacement = conf->disks[i].replacement; if (replacement && replacement->recovery_offset == MaxSector && !test_bit(Faulty, &replacement->flags) @@ -8177,7 +8123,7 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev) struct r5conf *conf = mddev->private; int err = 0; int number = rdev->raid_disk; - struct md_rdev __rcu **rdevp; + struct md_rdev **rdevp; struct disk_info *p; struct md_rdev *tmp;
@@ -8200,9 +8146,9 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev) if (unlikely(number >= conf->pool_size)) return 0; p = conf->disks + number; - if (rdev == rcu_access_pointer(p->rdev)) + if (rdev == p->rdev) rdevp = &p->rdev; - else if (rdev == rcu_access_pointer(p->replacement)) + else if (rdev == p->replacement) rdevp = &p->replacement; else return 0; @@ -8222,28 +8168,24 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev) if (!test_bit(Faulty, &rdev->flags) && mddev->recovery_disabled != conf->recovery_disabled && !has_failed(conf) && - (!rcu_access_pointer(p->replacement) || - rcu_access_pointer(p->replacement) == rdev) && + (!p->replacement || p->replacement == rdev) && number < conf->raid_disks) { err = -EBUSY; goto abort; } - *rdevp = NULL; + WRITE_ONCE(*rdevp, NULL); if (!err) { err = log_modify(conf, rdev, false); if (err) goto abort; }
- tmp = rcu_access_pointer(p->replacement); + tmp = p->replacement; if (tmp) { /* We must have just cleared 'rdev' */ - rcu_assign_pointer(p->rdev, tmp); + WRITE_ONCE(p->rdev, tmp); clear_bit(Replacement, &tmp->flags); - smp_mb(); /* Make sure other CPUs may see both as identical - * but will never see neither - if they are careful - */ - rcu_assign_pointer(p->replacement, NULL); + WRITE_ONCE(p->replacement, NULL);
if (!err) err = log_modify(conf, tmp, true); @@ -8311,7 +8253,7 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev) rdev->raid_disk = disk; if (rdev->saved_raid_disk != disk) conf->fullsync = 1; - rcu_assign_pointer(p->rdev, rdev); + WRITE_ONCE(p->rdev, rdev);
err = log_modify(conf, rdev, true);
@@ -8320,7 +8262,7 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev) } for (disk = first; disk <= last; disk++) { p = conf->disks + disk; - tmp = rdev_mdlock_deref(mddev, p->rdev); + tmp = p->rdev; if (test_bit(WantReplacement, &tmp->flags) && mddev->reshape_position == MaxSector && p->replacement == NULL) { @@ -8329,7 +8271,7 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev) rdev->raid_disk = disk; err = 0; conf->fullsync = 1; - rcu_assign_pointer(p->replacement, rdev); + WRITE_ONCE(p->replacement, rdev); break; } } @@ -8462,7 +8404,7 @@ static int raid5_start_reshape(struct mddev *mddev) if (mddev->recovery_cp < MaxSector) return -EBUSY; for (i = 0; i < conf->raid_disks; i++) - if (rdev_mdlock_deref(mddev, conf->disks[i].replacement)) + if (conf->disks[i].replacement) return -EBUSY;
rdev_for_each(rdev, mddev) { @@ -8633,12 +8575,10 @@ static void raid5_finish_reshape(struct mddev *mddev) for (d = conf->raid_disks ; d < conf->raid_disks - mddev->delta_disks; d++) { - rdev = rdev_mdlock_deref(mddev, - conf->disks[d].rdev); + rdev = conf->disks[d].rdev; if (rdev) clear_bit(In_sync, &rdev->flags); - rdev = rdev_mdlock_deref(mddev, - conf->disks[d].replacement); + rdev = conf->disks[d].replacement; if (rdev) clear_bit(In_sync, &rdev->flags); }
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-next-20231220 commit 7ecab28c3b2c26c1a51154d997433e8432eb49ba category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Because it's safe to accees rdev from conf: - If any spinlock is held, because synchronize_rcu() from md_kick_rdev_from_array() will prevent 'rdev' to be freed until spinlock is released; - If there is normal IO inflight, because mddev_suspend() will prevent rdev to be added or removed from array;
And these will cover all the scenarios in md-multipath.
Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231125081604.3939938-6-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md-multipath.c | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-)
diff --git a/drivers/md/md-multipath.c b/drivers/md/md-multipath.c index aa77133f3188..19c8625ea642 100644 --- a/drivers/md/md-multipath.c +++ b/drivers/md/md-multipath.c @@ -32,17 +32,15 @@ static int multipath_map (struct mpconf *conf) * now we use the first available disk. */
- rcu_read_lock(); for (i = 0; i < disks; i++) { - struct md_rdev *rdev = rcu_dereference(conf->multipaths[i].rdev); + struct md_rdev *rdev = conf->multipaths[i].rdev; + if (rdev && test_bit(In_sync, &rdev->flags) && !test_bit(Faulty, &rdev->flags)) { atomic_inc(&rdev->nr_pending); - rcu_read_unlock(); return i; } } - rcu_read_unlock();
pr_crit_ratelimited("multipath_map(): no more operational IO paths?\n"); return (-1); @@ -137,14 +135,16 @@ static void multipath_status(struct seq_file *seq, struct mddev *mddev) struct mpconf *conf = mddev->private; int i;
+ lockdep_assert_held(&mddev->lock); + seq_printf (seq, " [%d/%d] [", conf->raid_disks, conf->raid_disks - mddev->degraded); - rcu_read_lock(); for (i = 0; i < conf->raid_disks; i++) { - struct md_rdev *rdev = rcu_dereference(conf->multipaths[i].rdev); - seq_printf (seq, "%s", rdev && test_bit(In_sync, &rdev->flags) ? "U" : "_"); + struct md_rdev *rdev = READ_ONCE(conf->multipaths[i].rdev); + + seq_printf(seq, "%s", + rdev && test_bit(In_sync, &rdev->flags) ? "U" : "_"); } - rcu_read_unlock(); seq_putc(seq, ']'); }
@@ -182,7 +182,7 @@ static void multipath_error (struct mddev *mddev, struct md_rdev *rdev) conf->raid_disks - mddev->degraded); }
-static void print_multipath_conf (struct mpconf *conf) +static void print_multipath_conf(struct mpconf *conf) { int i; struct multipath_info *tmp; @@ -195,6 +195,7 @@ static void print_multipath_conf (struct mpconf *conf) pr_debug(" --- wd:%d rd:%d\n", conf->raid_disks - conf->mddev->degraded, conf->raid_disks);
+ lockdep_assert_held(&conf->mddev->reconfig_mutex); for (i = 0; i < conf->raid_disks; i++) { tmp = conf->multipaths + i; if (tmp->rdev) @@ -231,7 +232,7 @@ static int multipath_add_disk(struct mddev *mddev, struct md_rdev *rdev) rdev->raid_disk = path; set_bit(In_sync, &rdev->flags); spin_unlock_irq(&conf->device_lock); - rcu_assign_pointer(p->rdev, rdev); + WRITE_ONCE(p->rdev, rdev); err = 0; break; } @@ -257,7 +258,7 @@ static int multipath_remove_disk(struct mddev *mddev, struct md_rdev *rdev) err = -EBUSY; goto abort; } - p->rdev = NULL; + WRITE_ONCE(p->rdev, NULL); err = md_integrity_register(mddev); } abort:
From: David Jeffery djeffery@redhat.com
mainline inclusion from mainline-v6.7-rc5 commit c467e97f079f0019870c314996fae952cc768e82 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
During a reshape or a RAID6 array such as expanding by adding an additional disk, I/Os to the region of the array which have not yet been reshaped can stall indefinitely. This is from errors in the stripe_ahead_of_reshape function causing md to think the I/O is to a region in the actively undergoing the reshape.
stripe_ahead_of_reshape fails to account for the q disk having a sector value of 0. By not excluding the q disk from the for loop, raid6 will always generate a min_sector value of 0, causing a return value which stalls.
The function's max_sector calculation also uses min() when it should use max(), causing the max_sector value to always be 0. During a backwards rebuild this can cause the opposite problem where it allows I/O to advance when it should wait.
Fixing these errors will allow safe I/O to advance in a timely manner and delay only I/O which is unsafe due to stripes in the middle of undergoing the reshape.
Fixes: 486f60558607 ("md/raid5: Check all disks in a stripe_head for reshape progress") Cc: stable@vger.kernel.org # v6.0+ Signed-off-by: David Jeffery djeffery@redhat.com Tested-by: Laurence Oberman loberman@redhat.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231128181233.6187-1-djeffery@redhat.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/raid5.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 2af56a2ad573..213f90f8bb8e 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -5847,11 +5847,11 @@ static bool stripe_ahead_of_reshape(struct mddev *mddev, struct r5conf *conf, int dd_idx;
for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) { - if (dd_idx == sh->pd_idx) + if (dd_idx == sh->pd_idx || dd_idx == sh->qd_idx) continue;
min_sector = min(min_sector, sh->dev[dd_idx].sector); - max_sector = min(max_sector, sh->dev[dd_idx].sector); + max_sector = max(max_sector, sh->dev[dd_idx].sector); }
spin_lock_irq(&conf->device_lock);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-next-20231220 commit fa2bbff7b0b4e211fec5e5686ef96350690597b5 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Currently rcu is used to protect iterating rdev from submit_flushes():
submit_flushes remove_and_add_spares synchronize_rcu pers->hot_remove_disk() rcu_read_lock() rdev_for_each_rcu if (rdev->raid_disk >= 0) rdev->radi_disk = -1; atomic_inc(&rdev->nr_pending) rcu_read_unlock() bi = bio_alloc_bioset() bi->bi_end_io = md_end_flush bi->private = rdev submit_bio // issue io for removed rdev
Fix this problem by grabbing 'acive_io' before iterating rdev, make sure that remove_and_add_spares() won't concurrent with submit_flushes().
Fixes: a2826aa92e2e ("md: support barrier requests on all personalities.") Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231129020234.1586910-1-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 945a35ea3b5f..4742991f0795 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -530,6 +530,9 @@ static void md_end_flush(struct bio *bio) rdev_dec_pending(rdev, mddev);
if (atomic_dec_and_test(&mddev->flush_pending)) { + /* The pair is percpu_ref_get() from md_flush_request() */ + percpu_ref_put(&mddev->active_io); + /* The pre-request flush has finished */ queue_work(md_wq, &mddev->flush_work); } @@ -549,12 +552,8 @@ static void submit_flushes(struct work_struct *ws) rdev_for_each_rcu(rdev, mddev) if (rdev->raid_disk >= 0 && !test_bit(Faulty, &rdev->flags)) { - /* Take two references, one is dropped - * when request finishes, one after - * we reclaim rcu_read_lock - */ struct bio *bi; - atomic_inc(&rdev->nr_pending); + atomic_inc(&rdev->nr_pending); rcu_read_unlock(); bi = bio_alloc_bioset(rdev->bdev, 0, @@ -565,7 +564,6 @@ static void submit_flushes(struct work_struct *ws) atomic_inc(&mddev->flush_pending); submit_bio(bi); rcu_read_lock(); - rdev_dec_pending(rdev, mddev); } rcu_read_unlock(); if (atomic_dec_and_test(&mddev->flush_pending)) @@ -618,6 +616,18 @@ bool md_flush_request(struct mddev *mddev, struct bio *bio) /* new request after previous flush is completed */ if (ktime_after(req_start, mddev->prev_flush_start)) { WARN_ON(mddev->flush_bio); + /* + * Grab a reference to make sure mddev_suspend() will wait for + * this flush to be done. + * + * md_flush_reqeust() is called under md_handle_request() and + * 'active_io' is already grabbed, hence percpu_ref_is_zero() + * won't pass, percpu_ref_tryget_live() can't be used because + * percpu_ref_kill() can be called by mddev_suspend() + * concurrently. + */ + WARN_ON(percpu_ref_is_zero(&mddev->active_io)); + percpu_ref_get(&mddev->active_io); mddev->flush_bio = bio; bio = NULL; }
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc5 commit f2d87a759f6841a132e845e2fafdad37385ddd30 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Commit ac619781967b ("md: use separate work_struct for md_start_sync()") use a new sync_work to replace del_work, however, stop_sync_thread() and __md_stop_writes() was trying to wait for sync_thread to be done, hence they should switch to use sync_work as well.
Noted that md_start_sync() from sync_work will grab 'reconfig_mutex', hence other contex can't held the same lock to flush work, and this will be fixed in later patches.
Fixes: ac619781967b ("md: use separate work_struct for md_start_sync()") Signed-off-by: Yu Kuai yukuai3@huawei.com Acked-by: Xiao Ni xni@redhat.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231205094215.1824240-2-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 4742991f0795..092c47e50f7a 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4869,7 +4869,7 @@ static void stop_sync_thread(struct mddev *mddev) return; }
- if (work_pending(&mddev->del_work)) + if (work_pending(&mddev->sync_work)) flush_workqueue(md_misc_wq);
set_bit(MD_RECOVERY_INTR, &mddev->recovery); @@ -6277,7 +6277,7 @@ static void md_clean(struct mddev *mddev) static void __md_stop_writes(struct mddev *mddev) { set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - if (work_pending(&mddev->del_work)) + if (work_pending(&mddev->sync_work)) flush_workqueue(md_misc_wq); if (mddev->sync_thread) { set_bit(MD_RECOVERY_INTR, &mddev->recovery);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc5 commit c9f7cb5b2bc968adcdc686c197ed108f47fd8eb0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
If md_set_readonly() failed, the array could still be read-write, however 'MD_RECOVERY_FROZEN' could still be set, which leave the array in an abnormal state that sync or recovery can't continue anymore. Hence make sure the flag is cleared after md_set_readonly() returns.
Fixes: 88724bfa68be ("md: wait for pending superblock updates before switching to read-only") Signed-off-by: Yu Kuai yukuai3@huawei.com Acked-by: Xiao Ni xni@redhat.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231205094215.1824240-3-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 092c47e50f7a..b02efcca8c1c 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -6367,6 +6367,9 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) int err = 0; int did_freeze = 0;
+ if (mddev->external && test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) + return -EBUSY; + if (!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) { did_freeze = 1; set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); @@ -6381,8 +6384,6 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) */ md_wakeup_thread_directly(mddev->sync_thread);
- if (mddev->external && test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) - return -EBUSY; mddev_unlock(mddev); wait_event(resync_wait, !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)); @@ -6395,29 +6396,30 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) mddev->sync_thread || test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { pr_warn("md: %s still in use.\n",mdname(mddev)); - if (did_freeze) { - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); - md_wakeup_thread(mddev->thread); - } err = -EBUSY; goto out; } + if (mddev->pers) { __md_stop_writes(mddev);
- err = -ENXIO; - if (mddev->ro == MD_RDONLY) + if (mddev->ro == MD_RDONLY) { + err = -ENXIO; goto out; + } + mddev->ro = MD_RDONLY; set_disk_ro(mddev->gendisk, 1); + } + +out: + if ((mddev->pers && !err) || did_freeze) { clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); md_wakeup_thread(mddev->thread); sysfs_notify_dirent_safe(mddev->sysfs_state); - err = 0; } -out: + mutex_unlock(&mddev->open_mutex); return err; }
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc5 commit f52f5c71f3d4bb0992800139d2f35cf9f6f6e0ee category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Currently sync thread is stopped from multiple contex: - idle_sync_thread - frozen_sync_thread - __md_stop_writes - md_set_readonly - do_md_stop
And there are some problems: 1) sync_work is flushed while reconfig_mutex is grabbed, this can deadlock because the work function will grab reconfig_mutex as well. 2) md_reap_sync_thread() can't be called directly while md_do_sync() is not finished yet, for example, commit 130443d60b1b ("md: refactor idle/frozen_sync_thread() to fix deadlock"). 3) If MD_RECOVERY_RUNNING is not set, there is no need to stop sync_thread at all because sync_thread must not be registered.
Factor out a helper stop_sync_thread(), so that above contex will behave the same. Fix 1) by flushing sync_work after reconfig_mutex is released, before waiting for sync_thread to be done; Fix 2) bt letting daemon thread to unregister sync_thread; Fix 3) by always checking MD_RECOVERY_RUNNING first.
Fixes: db5e653d7c9f ("md: delay choosing sync action to md_start_sync()") Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231205094215.1824240-4-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 90 ++++++++++++++++++++----------------------------- 1 file changed, 37 insertions(+), 53 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index b02efcca8c1c..ee20955ef59f 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4852,25 +4852,29 @@ action_show(struct mddev *mddev, char *page) return sprintf(page, "%s\n", type); }
-static void stop_sync_thread(struct mddev *mddev) +/** + * stop_sync_thread() - wait for sync_thread to stop if it's running. + * @mddev: the array. + * @locked: if set, reconfig_mutex will still be held after this function + * return; if not set, reconfig_mutex will be released after this + * function return. + * @check_seq: if set, only wait for curent running sync_thread to stop, noted + * that new sync_thread can still start. + */ +static void stop_sync_thread(struct mddev *mddev, bool locked, bool check_seq) { - if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) - return; + int sync_seq;
- if (mddev_lock(mddev)) - return; + if (check_seq) + sync_seq = atomic_read(&mddev->sync_seq);
- /* - * Check again in case MD_RECOVERY_RUNNING is cleared before lock is - * held. - */ if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { - mddev_unlock(mddev); + if (!locked) + mddev_unlock(mddev); return; }
- if (work_pending(&mddev->sync_work)) - flush_workqueue(md_misc_wq); + mddev_unlock(mddev);
set_bit(MD_RECOVERY_INTR, &mddev->recovery); /* @@ -4878,21 +4882,28 @@ static void stop_sync_thread(struct mddev *mddev) * never happen */ md_wakeup_thread_directly(mddev->sync_thread); + if (work_pending(&mddev->sync_work)) + flush_work(&mddev->sync_work);
- mddev_unlock(mddev); + wait_event(resync_wait, + !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || + (check_seq && sync_seq != atomic_read(&mddev->sync_seq))); + + if (locked) + mddev_lock_nointr(mddev); }
static void idle_sync_thread(struct mddev *mddev) { - int sync_seq = atomic_read(&mddev->sync_seq); - mutex_lock(&mddev->sync_mutex); clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - stop_sync_thread(mddev);
- wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) || - !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)); + if (mddev_lock(mddev)) { + mutex_unlock(&mddev->sync_mutex); + return; + }
+ stop_sync_thread(mddev, false, true); mutex_unlock(&mddev->sync_mutex); }
@@ -4900,11 +4911,13 @@ static void frozen_sync_thread(struct mddev *mddev) { mutex_lock(&mddev->sync_mutex); set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - stop_sync_thread(mddev);
- wait_event(resync_wait, mddev->sync_thread == NULL && - !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)); + if (mddev_lock(mddev)) { + mutex_unlock(&mddev->sync_mutex); + return; + }
+ stop_sync_thread(mddev, false, false); mutex_unlock(&mddev->sync_mutex); }
@@ -6276,14 +6289,7 @@ static void md_clean(struct mddev *mddev)
static void __md_stop_writes(struct mddev *mddev) { - set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - if (work_pending(&mddev->sync_work)) - flush_workqueue(md_misc_wq); - if (mddev->sync_thread) { - set_bit(MD_RECOVERY_INTR, &mddev->recovery); - md_reap_sync_thread(mddev); - } - + stop_sync_thread(mddev, true, false); del_timer_sync(&mddev->safemode_timer);
if (mddev->pers && mddev->pers->quiesce) { @@ -6375,18 +6381,8 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); md_wakeup_thread(mddev->thread); } - if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) - set_bit(MD_RECOVERY_INTR, &mddev->recovery); - - /* - * Thread might be blocked waiting for metadata update which will now - * never happen - */ - md_wakeup_thread_directly(mddev->sync_thread);
- mddev_unlock(mddev); - wait_event(resync_wait, !test_bit(MD_RECOVERY_RUNNING, - &mddev->recovery)); + stop_sync_thread(mddev, false, false); wait_event(mddev->sb_wait, !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)); mddev_lock_nointr(mddev); @@ -6440,20 +6436,8 @@ static int do_md_stop(struct mddev *mddev, int mode, set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); md_wakeup_thread(mddev->thread); } - if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) - set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- /* - * Thread might be blocked waiting for metadata update which will now - * never happen - */ - md_wakeup_thread_directly(mddev->sync_thread); - - mddev_unlock(mddev); - wait_event(resync_wait, (mddev->sync_thread == NULL && - !test_bit(MD_RECOVERY_RUNNING, - &mddev->recovery))); - mddev_lock_nointr(mddev); + stop_sync_thread(mddev, true, false);
mutex_lock(&mddev->open_mutex); if ((mddev->pers && atomic_read(&mddev->openers) > !!bdev) ||
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc5 commit b39113349de60e9b0bc97c2e129181b193c45054 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
New mddev_resume() calls are added to synchronize IO with array reconfiguration, however, this introduces a performance regression while adding it in md_start_sync():
1) someone sets MD_RECOVERY_NEEDED first; 2) daemon thread grabs reconfig_mutex, then clears MD_RECOVERY_NEEDED and queues a new sync work; 3) daemon thread releases reconfig_mutex; 4) in md_start_sync a) check that there are spares that can be added/removed, then suspend the array; b) remove_and_add_spares may not be called, or called without really add/remove spares; c) resume the array, then set MD_RECOVERY_NEEDED again!
Loop between 2 - 4, then mddev_suspend() will be called quite often, for consequence, normal IO will be quite slow.
Fix this problem by don't set MD_RECOVERY_NEEDED again in md_start_sync(), hence the loop will be broken.
Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration") Suggested-by: Song Liu song@kernel.org Reported-by: Janpieter Sollie janpieter.sollie@edpnet.be Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218200 Signed-off-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231207020724.2797445-1-yukuai1@huaweicloud.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 30 ++++++++++++++++++++++++++---- 1 file changed, 26 insertions(+), 4 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index ee20955ef59f..68ade3002e1d 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -491,7 +491,7 @@ int mddev_suspend(struct mddev *mddev, bool interruptible) } EXPORT_SYMBOL_GPL(mddev_suspend);
-void mddev_resume(struct mddev *mddev) +static void __mddev_resume(struct mddev *mddev, bool recovery_needed) { lockdep_assert_not_held(&mddev->reconfig_mutex);
@@ -508,12 +508,18 @@ void mddev_resume(struct mddev *mddev) percpu_ref_resurrect(&mddev->active_io); wake_up(&mddev->sb_wait);
- set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); + if (recovery_needed) + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); md_wakeup_thread(mddev->thread); md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
mutex_unlock(&mddev->suspend_mutex); } + +void mddev_resume(struct mddev *mddev) +{ + return __mddev_resume(mddev, true); +} EXPORT_SYMBOL_GPL(mddev_resume);
/* @@ -9376,7 +9382,15 @@ static void md_start_sync(struct work_struct *ws) goto not_running; }
- suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev); + mddev_unlock(mddev); + /* + * md_start_sync was triggered by MD_RECOVERY_NEEDED, so we should + * not set it again. Otherwise, we may cause issue like this one: + * https://bugzilla.kernel.org/show_bug.cgi?id=218200 + * Therefore, use __mddev_resume(mddev, false). + */ + if (suspend) + __mddev_resume(mddev, false); md_wakeup_thread(mddev->sync_thread); sysfs_notify_dirent_safe(mddev->sysfs_action); md_new_event(); @@ -9388,7 +9402,15 @@ static void md_start_sync(struct work_struct *ws) clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); - suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev); + mddev_unlock(mddev); + /* + * md_start_sync was triggered by MD_RECOVERY_NEEDED, so we should + * not set it again. Otherwise, we may cause issue like this one: + * https://bugzilla.kernel.org/show_bug.cgi?id=218200 + * Therefore, use __mddev_resume(mddev, false). + */ + if (suspend) + __mddev_resume(mddev, false);
wake_up(&resync_wait); if (test_and_clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&
From: Gou Hao gouhao@uniontech.com
mainline inclusion from mainline-next-20231220 commit af140f806ae2679f9dba48ea0f5811da83854eb6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
If %__GFP_DIRECT_RECLAIM is set then bio_alloc_bioset will always be able to allocate a bio. See comment of bio_alloc_bioset.
Signed-off-by: Gou Hao gouhao@uniontech.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231214151458.28970-1-gouhao@uniontech.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/raid1.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 9348f1709512..19c9bf0060ae 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1124,8 +1124,6 @@ static void alloc_behind_master_bio(struct r1bio *r1_bio,
behind_bio = bio_alloc_bioset(NULL, vcnt, 0, GFP_NOIO, &r1_bio->mddev->bio_set); - if (!behind_bio) - return;
/* discard op, we don't support writezero/writesame yet */ if (!bio_has_data(bio)) {
From: Alex Lyakas alex.lyakas@zadara.com
mainline inclusion from mainline-next-20231220 commit dc1cc22ed58f11d58d8553c5ec5f11cbfc3e3039 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Upon assembling the array, both kernel and mdadm allow the devices to have event counter difference of 1, and still consider them as up-to-date. However, a device whose event count is behind by 1, may in fact not be up-to-date, and array resync with such a device may cause data corruption. To avoid this, consult the superblock of the freshest device about the status of a device, whose event counter is behind by 1.
Signed-off-by: Alex Lyakas alex.lyakas@zadara.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/1702470271-16073-1-git-send-email-alex.lyakas@zada... Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 54 ++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 44 insertions(+), 10 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 68ade3002e1d..b4772955e110 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1213,6 +1213,7 @@ struct super_type { struct md_rdev *refdev, int minor_version); int (*validate_super)(struct mddev *mddev, + struct md_rdev *freshest, struct md_rdev *rdev); void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev); @@ -1350,8 +1351,9 @@ static int super_90_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor
/* * validate_super for 0.90.0 + * note: we are not using "freshest" for 0.9 superblock */ -static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev) +static int super_90_validate(struct mddev *mddev, struct md_rdev *freshest, struct md_rdev *rdev) { mdp_disk_t *desc; mdp_super_t *sb = page_address(rdev->sb_page); @@ -1863,7 +1865,7 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_ return ret; }
-static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev) +static int super_1_validate(struct mddev *mddev, struct md_rdev *freshest, struct md_rdev *rdev) { struct mdp_superblock_1 *sb = page_address(rdev->sb_page); __u64 ev1 = le64_to_cpu(sb->events); @@ -1959,13 +1961,15 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev) } } else if (mddev->pers == NULL) { /* Insist of good event counter while assembling, except for - * spares (which don't need an event count) */ - ++ev1; + * spares (which don't need an event count). + * Similar to mdadm, we allow event counter difference of 1 + * from the freshest device. + */ if (rdev->desc_nr >= 0 && rdev->desc_nr < le32_to_cpu(sb->max_dev) && (le16_to_cpu(sb->dev_roles[rdev->desc_nr]) < MD_DISK_ROLE_MAX || le16_to_cpu(sb->dev_roles[rdev->desc_nr]) == MD_DISK_ROLE_JOURNAL)) - if (ev1 < mddev->events) + if (ev1 + 1 < mddev->events) return -EINVAL; } else if (mddev->bitmap) { /* If adding to array with a bitmap, then we can accept an @@ -1986,8 +1990,38 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev) rdev->desc_nr >= le32_to_cpu(sb->max_dev)) { role = MD_DISK_ROLE_SPARE; rdev->desc_nr = -1; - } else + } else if (mddev->pers == NULL && freshest && ev1 < mddev->events) { + /* + * If we are assembling, and our event counter is smaller than the + * highest event counter, we cannot trust our superblock about the role. + * It could happen that our rdev was marked as Faulty, and all other + * superblocks were updated with +1 event counter. + * Then, before the next superblock update, which typically happens when + * remove_and_add_spares() removes the device from the array, there was + * a crash or reboot. + * If we allow current rdev without consulting the freshest superblock, + * we could cause data corruption. + * Note that in this case our event counter is smaller by 1 than the + * highest, otherwise, this rdev would not be allowed into array; + * both kernel and mdadm allow event counter difference of 1. + */ + struct mdp_superblock_1 *freshest_sb = page_address(freshest->sb_page); + u32 freshest_max_dev = le32_to_cpu(freshest_sb->max_dev); + + if (rdev->desc_nr >= freshest_max_dev) { + /* this is unexpected, better not proceed */ + pr_warn("md: %s: rdev[%pg]: desc_nr(%d) >= freshest(%pg)->sb->max_dev(%u)\n", + mdname(mddev), rdev->bdev, rdev->desc_nr, + freshest->bdev, freshest_max_dev); + return -EUCLEAN; + } + + role = le16_to_cpu(freshest_sb->dev_roles[rdev->desc_nr]); + pr_debug("md: %s: rdev[%pg]: role=%d(0x%x) according to freshest %pg\n", + mdname(mddev), rdev->bdev, role, role, freshest->bdev); + } else { role = le16_to_cpu(sb->dev_roles[rdev->desc_nr]); + } switch(role) { case MD_DISK_ROLE_SPARE: /* spare */ break; @@ -2894,7 +2928,7 @@ static int add_bound_rdev(struct md_rdev *rdev) * and should be added immediately. */ super_types[mddev->major_version]. - validate_super(mddev, rdev); + validate_super(mddev, NULL/*freshest*/, rdev); err = mddev->pers->hot_add_disk(mddev, rdev); if (err) { md_kick_rdev_from_array(rdev); @@ -3831,7 +3865,7 @@ static int analyze_sbs(struct mddev *mddev) }
super_types[mddev->major_version]. - validate_super(mddev, freshest); + validate_super(mddev, NULL/*freshest*/, freshest);
i = 0; rdev_for_each_safe(rdev, tmp, mddev) { @@ -3846,7 +3880,7 @@ static int analyze_sbs(struct mddev *mddev) } if (rdev != freshest) { if (super_types[mddev->major_version]. - validate_super(mddev, rdev)) { + validate_super(mddev, freshest, rdev)) { pr_warn("md: kicking non-fresh %pg from array!\n", rdev->bdev); md_kick_rdev_from_array(rdev); @@ -6840,7 +6874,7 @@ int md_add_new_disk(struct mddev *mddev, struct mdu_disk_info_s *info) rdev->saved_raid_disk = rdev->raid_disk; } else super_types[mddev->major_version]. - validate_super(mddev, rdev); + validate_super(mddev, NULL/*freshest*/, rdev); if ((info->state & (1<<MD_DISK_SYNC)) && rdev->raid_disk != info->raid_disk) { /* This was a hot-add request, but events doesn't
mainline inclusion from mainline-next-20231220 commit 1979dbbe328ca4e1d0f061c94381cfa03388088d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Move check_decay_read_errors() to raid1-10.c and factor out a helper exceed_read_errors() to check if read_errors exceeds the limit, so that raid1 can also use it. There are no functional changes.
Signed-off-by: Li Nan linan122@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231215023852.3478228-2-linan666@huaweicloud.com --- drivers/md/raid1-10.c | 54 +++++++++++++++++++++++++++++++++++++++++++ drivers/md/raid1.c | 1 + drivers/md/raid10.c | 49 +++------------------------------------ 3 files changed, 58 insertions(+), 46 deletions(-)
diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c index 3f22edec70e7..512746551f36 100644 --- a/drivers/md/raid1-10.c +++ b/drivers/md/raid1-10.c @@ -173,3 +173,57 @@ static inline void raid1_prepare_flush_writes(struct bitmap *bitmap) else md_bitmap_unplug(bitmap); } + +/* + * Used by fix_read_error() to decay the per rdev read_errors. + * We halve the read error count for every hour that has elapsed + * since the last recorded read error. + */ +static inline void check_decay_read_errors(struct mddev *mddev, struct md_rdev *rdev) +{ + long cur_time_mon; + unsigned long hours_since_last; + unsigned int read_errors = atomic_read(&rdev->read_errors); + + cur_time_mon = ktime_get_seconds(); + + if (rdev->last_read_error == 0) { + /* first time we've seen a read error */ + rdev->last_read_error = cur_time_mon; + return; + } + + hours_since_last = (long)(cur_time_mon - + rdev->last_read_error) / 3600; + + rdev->last_read_error = cur_time_mon; + + /* + * if hours_since_last is > the number of bits in read_errors + * just set read errors to 0. We do this to avoid + * overflowing the shift of read_errors by hours_since_last. + */ + if (hours_since_last >= 8 * sizeof(read_errors)) + atomic_set(&rdev->read_errors, 0); + else + atomic_set(&rdev->read_errors, read_errors >> hours_since_last); +} + +static inline bool exceed_read_errors(struct mddev *mddev, struct md_rdev *rdev) +{ + int max_read_errors = atomic_read(&mddev->max_corr_read_errors); + int read_errors; + + check_decay_read_errors(mddev, rdev); + read_errors = atomic_inc_return(&rdev->read_errors); + if (read_errors > max_read_errors) { + pr_notice("md/"RAID_1_10_NAME":%s: %pg: Raid device exceeded read_error threshold [cur %d:max %d]\n", + mdname(mddev), rdev->bdev, read_errors, max_read_errors); + pr_notice("md/"RAID_1_10_NAME":%s: %pg: Failing raid device\n", + mdname(mddev), rdev->bdev); + md_error(mddev, rdev); + return true; + } + + return false; +} diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 19c9bf0060ae..8c65ee0c4445 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -49,6 +49,7 @@ static void lower_barrier(struct r1conf *conf, sector_t sector_nr); #define raid1_log(md, fmt, args...) \ do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
+#define RAID_1_10_NAME "raid1" #include "raid1-10.c"
#define START(node) ((node)->start) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 375c11d6159f..7412066ea22c 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -19,6 +19,8 @@ #include <linux/raid/md_p.h> #include <trace/events/block.h> #include "md.h" + +#define RAID_1_10_NAME "raid10" #include "raid10.h" #include "raid0.h" #include "md-bitmap.h" @@ -2592,42 +2594,6 @@ static void recovery_request_write(struct mddev *mddev, struct r10bio *r10_bio) } }
-/* - * Used by fix_read_error() to decay the per rdev read_errors. - * We halve the read error count for every hour that has elapsed - * since the last recorded read error. - * - */ -static void check_decay_read_errors(struct mddev *mddev, struct md_rdev *rdev) -{ - long cur_time_mon; - unsigned long hours_since_last; - unsigned int read_errors = atomic_read(&rdev->read_errors); - - cur_time_mon = ktime_get_seconds(); - - if (rdev->last_read_error == 0) { - /* first time we've seen a read error */ - rdev->last_read_error = cur_time_mon; - return; - } - - hours_since_last = (long)(cur_time_mon - - rdev->last_read_error) / 3600; - - rdev->last_read_error = cur_time_mon; - - /* - * if hours_since_last is > the number of bits in read_errors - * just set read errors to 0. We do this to avoid - * overflowing the shift of read_errors by hours_since_last. - */ - if (hours_since_last >= 8 * sizeof(read_errors)) - atomic_set(&rdev->read_errors, 0); - else - atomic_set(&rdev->read_errors, read_errors >> hours_since_last); -} - static int r10_sync_page_io(struct md_rdev *rdev, sector_t sector, int sectors, struct page *page, enum req_op op) { @@ -2665,7 +2631,6 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10 int sect = 0; /* Offset from r10_bio->sector */ int sectors = r10_bio->sectors, slot = r10_bio->read_slot; struct md_rdev *rdev; - int max_read_errors = atomic_read(&mddev->max_corr_read_errors); int d = r10_bio->devs[slot].devnum;
/* still own a reference to this rdev, so it cannot @@ -2678,15 +2643,7 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10 more fix_read_error() attempts */ return;
- check_decay_read_errors(mddev, rdev); - atomic_inc(&rdev->read_errors); - if (atomic_read(&rdev->read_errors) > max_read_errors) { - pr_notice("md/raid10:%s: %pg: Raid device exceeded read_error threshold [cur %d:max %d]\n", - mdname(mddev), rdev->bdev, - atomic_read(&rdev->read_errors), max_read_errors); - pr_notice("md/raid10:%s: %pg: Failing raid device\n", - mdname(mddev), rdev->bdev); - md_error(mddev, rdev); + if (exceed_read_errors(mddev, rdev)) { r10_bio->devs[slot].bio = IO_BLOCKED; return; }
mainline inclusion from mainline-next-20231220 commit ca294b34aaf3a417fe9069b174e52508ac918ec8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
After commit 1e50915fe0bb ("raid: improve MD/raid10 handling of correctable read errors."), rdev will be set to faulty if it reads data error to many times in raid10. Add this mechanism to raid1 now.
Signed-off-by: Li Nan linan122@huawei.com Signed-off-by: Song Liu song@kernel.org Link: https://lore.kernel.org/r/20231215023852.3478228-3-linan666@huaweicloud.com --- drivers/md/raid1.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 8c65ee0c4445..aaa434f0c175 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2256,16 +2256,24 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio) * 3. Performs writes following reads for array synchronising. */
-static void fix_read_error(struct r1conf *conf, int read_disk, - sector_t sect, int sectors) +static void fix_read_error(struct r1conf *conf, struct r1bio *r1_bio) { + sector_t sect = r1_bio->sector; + int sectors = r1_bio->sectors; + int read_disk = r1_bio->read_disk; struct mddev *mddev = conf->mddev; + struct md_rdev *rdev = rcu_dereference(conf->mirrors[read_disk].rdev); + + if (exceed_read_errors(mddev, rdev)) { + r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; + return; + } + while(sectors) { int s = sectors; int d = read_disk; int success = 0; int start; - struct md_rdev *rdev;
if (s > (PAGE_SIZE>>9)) s = PAGE_SIZE >> 9; @@ -2506,8 +2514,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) if (mddev->ro == 0 && !test_bit(FailFast, &rdev->flags)) { freeze_array(conf, 1); - fix_read_error(conf, r1_bio->read_disk, - r1_bio->sector, r1_bio->sectors); + fix_read_error(conf, r1_bio); unfreeze_array(conf); } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { md_error(mddev, rdev);
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.7-rc7 commit db29d79b34d9593179de5f868be45c650923e7b4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T02O
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
After commit db5e653d7c9f ("md: delay choosing sync action to md_start_sync()"), md_start_sync() will hold 'reconfig_mutex', however, in order to make sure event_work is done, __md_stop() will flush workqueue with reconfig_mutex grabbed, hence if sync_work is still pending, deadlock will be triggered.
Fortunately, former pacthes to fix stopping sync_thread already make sure all sync_work is done already, hence such deadlock is not possible anymore. However, in order not to cause confusions for people by this implicit dependency, delay flushing event_work to dm-raid where 'reconfig_mutex' is not held, and add some comments to emphasize that the workqueue can't be flushed with 'reconfig_mutex'.
Fixes: db5e653d7c9f ("md: delay choosing sync action to md_start_sync()") Depends-on: f52f5c71f3d4 ("md: fix stopping sync thread") Signed-off-by: Yu Kuai yukuai3@huawei.com Acked-by: Xiao Ni xni@redhat.com Signed-off-by: Mike Snitzer snitzer@kernel.org Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/dm-raid.c | 3 +++ drivers/md/md.c | 11 ++++++++--- 2 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index a4692f8f98ee..51f15c20f621 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -3317,6 +3317,9 @@ static void raid_dtr(struct dm_target *ti) mddev_lock_nointr(&rs->md); md_stop(&rs->md); mddev_unlock(&rs->md); + + if (work_pending(&rs->md.event_work)) + flush_work(&rs->md.event_work); raid_set_free(rs); }
diff --git a/drivers/md/md.c b/drivers/md/md.c index b4772955e110..f97e2e0bd2a1 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -82,6 +82,14 @@ static struct module *md_cluster_mod;
static DECLARE_WAIT_QUEUE_HEAD(resync_wait); static struct workqueue_struct *md_wq; + +/* + * This workqueue is used for sync_work to register new sync_thread, and for + * del_work to remove rdev, and for event_work that is only set by dm-raid. + * + * Noted that sync_work will grab reconfig_mutex, hence never flush this + * workqueue whith reconfig_mutex grabbed. + */ static struct workqueue_struct *md_misc_wq; struct workqueue_struct *md_bitmap_wq;
@@ -6376,9 +6384,6 @@ static void __md_stop(struct mddev *mddev) struct md_personality *pers = mddev->pers; md_bitmap_destroy(mddev); mddev_detach(mddev); - /* Ensure ->event_work is done */ - if (mddev->event_work.func) - flush_workqueue(md_misc_wq); spin_lock(&mddev->lock); mddev->pers = NULL; spin_unlock(&mddev->lock);
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/3761 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/E...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/3761 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/E...