[PATCH openEuler-22.03-LTS 00/22] Backport CVEs and bugfixes - Kernel - mailweb.openeuler.org

newer
[PATCH OLK-5.10 00/24] Backport...

[PATCH openEuler-22.03-LTS 00/22] Backport CVEs and bugfixes

older
[PATCH openEuler-1.0-LTS 0/2]...

Jialin Zhang

31 May 2023 31 May '23

3:51 p.m.

Pull new CVEs: CVE-2023-22998 cgroup bugfix from Gaosheng Cui sched bugfix from Xia Fukun block bugfixes from Zhong Jinghua and Yu Kuai iomap and ext4 bugfixes from Baokun Li md bugfixes from Yu Kuai -- 2.25.1

Reply

Sign in to reply online Use email software

Show replies by date

Jialin Zhang

31 May 31 May

3:51 p.m.

New subject: [PATCH openEuler-22.03-LTS 01/22] block: merge disk_scan_partitions and blkdev_reread_part

From: Christoph Hellwig <hch@lst.de> mainline inclusion from mainline-v5.17-rc1 commit e16e506ccd673a3a888a34f8f694698305840044 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6MQLP CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Unify the functionality that implements a partition rescan for a gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211122130625.1136848-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/blk.h block/genhd.c block/ioctl.c Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- block/blk.h | 1 + block/genhd.c | 17 ++++++++++------- block/ioctl.c | 33 +++++++-------------------------- 3 files changed, 18 insertions(+), 33 deletions(-) diff --git a/block/blk.h b/block/blk.h index a644910678ac..7b3693149183 100644 --- a/block/blk.h +++ b/block/blk.h @@ -186,6 +186,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list, void blk_account_io_start(struct request *req); void blk_account_io_done(struct request *req, u64 now); +int disk_scan_partitions(struct gendisk *disk, fmode_t mode); /* * Plug flush limits diff --git a/block/genhd.c b/block/genhd.c index a5a6840f2e85..f224381ed46f 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -736,17 +736,19 @@ static void register_disk(struct device *parent, struct gendisk *disk, } } -static void disk_scan_partitions(struct gendisk *disk) +int disk_scan_partitions(struct gendisk *disk, fmode_t mode) { struct block_device *bdev; - if (!get_capacity(disk) || !disk_part_scan_enabled(disk)) - return; + if (!disk_part_scan_enabled(disk)) + return -EINVAL; set_bit(GD_NEED_PART_SCAN, &disk->state); - bdev = blkdev_get_by_dev(disk_devt(disk), FMODE_READ, NULL); - if (!IS_ERR(bdev)) - blkdev_put(bdev, FMODE_READ); + bdev = blkdev_get_by_dev(disk_devt(disk), mode, NULL); + if (IS_ERR(bdev)) + return PTR_ERR(bdev); + blkdev_put(bdev, mode); + return 0; } static void disk_init_partition(struct gendisk *disk) @@ -755,7 +757,8 @@ static void disk_init_partition(struct gendisk *disk) struct disk_part_iter piter; struct hd_struct *part; - disk_scan_partitions(disk); + if (get_capacity(disk)) + disk_scan_partitions(disk, FMODE_READ); /* announce disk after possible partitions are created */ dev_set_uevent_suppress(ddev, 0); diff --git a/block/ioctl.c b/block/ioctl.c index 7c578f84a4fd..565d43c9addd 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -90,31 +90,6 @@ static int compat_blkpg_ioctl(struct block_device *bdev, } #endif -static int blkdev_reread_part(struct block_device *bdev, fmode_t mode) -{ - struct block_device *tmp; - - if (!disk_part_scan_enabled(bdev->bd_disk) || bdev_is_partition(bdev)) - return -EINVAL; - if (!capable(CAP_SYS_ADMIN)) - return -EACCES; - if (bdev->bd_part_count) - return -EBUSY; - - /* - * Reopen the device to revalidate the driver state and force a - * partition rescan. - */ - mode &= ~FMODE_EXCL; - set_bit(GD_NEED_PART_SCAN, &bdev->bd_disk->state); - - tmp = blkdev_get_by_dev(bdev->bd_dev, mode, NULL); - if (IS_ERR(tmp)) - return PTR_ERR(tmp); - blkdev_put(tmp, mode); - return 0; -} - static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode, unsigned long arg, unsigned long flags) { @@ -562,7 +537,13 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode, bdev->bd_bdi->ra_pages = (arg * 512) / PAGE_SIZE; return 0; case BLKRRPART: - return blkdev_reread_part(bdev, mode); + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + if (bdev_is_partition(bdev)) + return -EINVAL; + if (bdev->bd_part_count) + return -EBUSY; + return disk_scan_partitions(bdev->bd_disk, mode & ~FMODE_EXCL); case BLKTRACESTART: case BLKTRACESTOP: case BLKTRACETEARDOWN: -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:51 p.m.

New subject: [PATCH openEuler-22.03-LTS 02/22] block: fix scan partition for exclusively open device again

From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.3-rc1 commit e5cfefa97bccf956ea0bb6464c1f6c84fd7a8d9f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6MQLP CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- As explained in commit 36369f46e917 ("block: Do not reread partition table on exclusively open device"), reread partition on the device that is exclusively opened by someone else is problematic. This patch will make sure partition scan will only be proceed if current thread open the device exclusively, or the device is not opened exclusively, and in the later case, other scanners and exclusive openers will be blocked temporarily until partition scan is done. Fixes: 10c70d95c0f2 ("block: remove the bd_openers checks in blk_drop_partitions") Cc: <stable@vger.kernel.org> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230217022200.3092987-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/genhd.c block/ioctl.c Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- block/genhd.c | 38 ++++++++++++++++++++++++++++++++++---- block/ioctl.c | 2 +- 2 files changed, 35 insertions(+), 5 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index f224381ed46f..b6ad3554876f 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -739,16 +739,42 @@ static void register_disk(struct device *parent, struct gendisk *disk, int disk_scan_partitions(struct gendisk *disk, fmode_t mode) { struct block_device *bdev; + struct block_device *claim; + int ret = 0; if (!disk_part_scan_enabled(disk)) return -EINVAL; + /* + * If the device is opened exclusively by current thread already, it's + * safe to scan partitons, otherwise, use bd_prepare_to_claim() to + * synchronize with other exclusive openers and other partition + * scanners. + */ + if (!(mode & FMODE_EXCL)) { + claim = bdget_part(&disk->part0); + if (!claim) + return -ENOMEM; + + ret = bd_prepare_to_claim(claim, claim, disk_scan_partitions); + if (ret) { + bdput(claim); + return ret; + } + } + set_bit(GD_NEED_PART_SCAN, &disk->state); - bdev = blkdev_get_by_dev(disk_devt(disk), mode, NULL); + bdev = blkdev_get_by_dev(disk_devt(disk), mode & ~FMODE_EXCL, NULL); if (IS_ERR(bdev)) - return PTR_ERR(bdev); - blkdev_put(bdev, mode); - return 0; + ret = PTR_ERR(bdev); + else + blkdev_put(bdev, mode); + + if (!(mode & FMODE_EXCL)) { + bd_abort_claiming(claim, claim, disk_scan_partitions); + bdput(claim); + } + return ret; } static void disk_init_partition(struct gendisk *disk) @@ -850,6 +876,10 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk, disk_add_events(disk); blk_integrity_add(disk); + /* Make sure the first partition scan will be proceed */ + if (get_capacity(disk) && disk_part_scan_enabled(disk)) + set_bit(GD_NEED_PART_SCAN, &disk->state); + /* * Set the flag at last, so that block devcie can't be opened * before it's registration is done. diff --git a/block/ioctl.c b/block/ioctl.c index 565d43c9addd..e3c5a27c23b1 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -543,7 +543,7 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode, return -EINVAL; if (bdev->bd_part_count) return -EBUSY; - return disk_scan_partitions(bdev->bd_disk, mode & ~FMODE_EXCL); + return disk_scan_partitions(bdev->bd_disk, mode); case BLKTRACESTART: case BLKTRACESTOP: case BLKTRACETEARDOWN: -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:51 p.m.

New subject: [PATCH openEuler-22.03-LTS 03/22] block: fix wrong mode for blkdev_put() from disk_scan_partitions()

From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.3-rc2 commit 428913bce1e67ccb4dae317fd0332545bf8c9233 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6MQLP CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- If disk_scan_partitions() is called with 'FMODE_EXCL', blkdev_get_by_dev() will be called without 'FMODE_EXCL', however, follow blkdev_put() is still called with 'FMODE_EXCL', which will cause 'bd_holders' counter to leak. Fix the problem by using the right mode for blkdev_put(). Reported-by: syzbot+2bcc0d79e548c4f62a59@syzkaller.appspotmail.com Link: https://lore.kernel.org/lkml/f9649d501bc8c3444769418f6c26263555d9d3be.camel@... Tested-by: Julian Ruess <julianr@linux.ibm.com> Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- block/genhd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/genhd.c b/block/genhd.c index b6ad3554876f..3c34dd011585 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -768,7 +768,7 @@ int disk_scan_partitions(struct gendisk *disk, fmode_t mode) if (IS_ERR(bdev)) ret = PTR_ERR(bdev); else - blkdev_put(bdev, mode); + blkdev_put(bdev, mode & ~FMODE_EXCL); if (!(mode & FMODE_EXCL)) { bd_abort_claiming(claim, claim, disk_scan_partitions); -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:51 p.m.

New subject: [PATCH openEuler-22.03-LTS 04/22] md: unlock mddev before reap sync_thread in action_store

From: Guoqing Jiang <guoqing.jiang@linux.dev> mainline inclusion from mainline-v6.0-rc1 commit 9dfbdafda3b34e262e43e786077bab8e476a89d1 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h... -------------------------------- Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held") fixed is related with action_store path, other callers which reap sync_thread didn't need to be changed. Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous bug with belows. 1. unlock mddev before md_reap_sync_thread in action_store. 2. save reshape_position before unlock, then restore it to ensure position not changed accidentally by others. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- drivers/md/dm-raid.c | 1 + drivers/md/md.c | 19 +++++++++++++++++-- 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index a2d09c9c6e9f..134c51027dce 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -3693,6 +3693,7 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv, if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) { if (mddev->sync_thread) { set_bit(MD_RECOVERY_INTR, &mddev->recovery); + md_unregister_thread(&mddev->sync_thread); md_reap_sync_thread(mddev); } } else if (decipher_sync_action(mddev, mddev->recovery) != st_idle) diff --git a/drivers/md/md.c b/drivers/md/md.c index e072ccb08735..cf7ca756b216 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4872,6 +4872,19 @@ action_store(struct mddev *mddev, const char *page, size_t len) if (work_pending(&mddev->del_work)) flush_workqueue(md_misc_wq); if (mddev->sync_thread) { + sector_t save_rp = mddev->reshape_position; + + mddev_unlock(mddev); + set_bit(MD_RECOVERY_INTR, &mddev->recovery); + md_unregister_thread(&mddev->sync_thread); + mddev_lock_nointr(mddev); + /* + * set RECOVERY_INTR again and restore reshape + * position in case others changed them after + * got lock, eg, reshape_position_store and + * md_check_recovery. + */ + mddev->reshape_position = save_rp; set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_reap_sync_thread(mddev); } @@ -6251,6 +6264,7 @@ static void __md_stop_writes(struct mddev *mddev) flush_workqueue(md_misc_wq); if (mddev->sync_thread) { set_bit(MD_RECOVERY_INTR, &mddev->recovery); + md_unregister_thread(&mddev->sync_thread); md_reap_sync_thread(mddev); } @@ -9321,6 +9335,7 @@ void md_check_recovery(struct mddev *mddev) * ->spare_active and clear saved_raid_disk */ set_bit(MD_RECOVERY_INTR, &mddev->recovery); + md_unregister_thread(&mddev->sync_thread); md_reap_sync_thread(mddev); clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); @@ -9356,6 +9371,7 @@ void md_check_recovery(struct mddev *mddev) goto unlock; } if (mddev->sync_thread) { + md_unregister_thread(&mddev->sync_thread); md_reap_sync_thread(mddev); goto unlock; } @@ -9435,8 +9451,7 @@ void md_reap_sync_thread(struct mddev *mddev) sector_t old_dev_sectors = mddev->dev_sectors; bool is_reshaped = false; - /* resync has finished, collect result */ - md_unregister_thread(&mddev->sync_thread); + /* sync_thread should be unregistered, collect result */ if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) && !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) && mddev->degraded != mddev->raid_disks) { -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:51 p.m.

New subject: [PATCH openEuler-22.03-LTS 05/22] Revert "md: unlock mddev before reap sync_thread in action_store"

From: Yu Kuai <yukuai3@huawei.com> hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC CVE: NA -------------------------------- This reverts commit 9dfbdafda3b34e262e43e786077bab8e476a89d1. Because it will introduce a defect that sync_thread can be running while MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems, for example: list_add corruption. prev->next should be next (ffff0001ac1daba0), but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0). Call trace: __list_add_valid+0xfc/0x140 insert_work+0x78/0x1a0 __queue_work+0x500/0xcf4 queue_work_on+0xe8/0x12c md_check_recovery+0xa34/0xf30 raid10d+0xb8/0x900 [raid10] md_thread+0x16c/0x2cc kthread+0x1a4/0x1ec ret_from_fork+0x10/0x18 This is because work is requeued while it's still inside workqueue: t1: t2: action_store mddev_lock if (mddev->sync_thread) mddev_unlock md_unregister_thread // first sync_thread is done md_check_recovery mddev_try_lock /* * once MD_RECOVERY_DONE is set, new sync_thread * can start. */ set_bit(MD_RECOVERY_RUNNING, &mddev->recovery) INIT_WORK(&mddev->del_work, md_start_sync) queue_work(md_misc_wq, &mddev->del_work) test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...) // set pending bit insert_work list_add_tail mddev_unlock mddev_lock_nointr md_reap_sync_thread // MD_RECOVERY_RUNNING is cleared mddev_unlock t3: // before queued work started from t2 md_check_recovery // MD_RECOVERY_RUNNING is not set, a new sync_thread can be started INIT_WORK(&mddev->del_work, md_start_sync) work->data = 0 // work pending bit is cleared queue_work(md_misc_wq, &mddev->del_work) insert_work list_add_tail // list is corrupted This patch revert the commit to fix the problem, the deadlock this commit tries to fix will be fixed in following patches. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230322064122.2384589-2-yukuai1@huaweicloud.com Reviewed-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- drivers/md/dm-raid.c | 1 - drivers/md/md.c | 19 ++----------------- 2 files changed, 2 insertions(+), 18 deletions(-) diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index 134c51027dce..a2d09c9c6e9f 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -3693,7 +3693,6 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv, if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) { if (mddev->sync_thread) { set_bit(MD_RECOVERY_INTR, &mddev->recovery); - md_unregister_thread(&mddev->sync_thread); md_reap_sync_thread(mddev); } } else if (decipher_sync_action(mddev, mddev->recovery) != st_idle) diff --git a/drivers/md/md.c b/drivers/md/md.c index cf7ca756b216..e072ccb08735 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4872,19 +4872,6 @@ action_store(struct mddev *mddev, const char *page, size_t len) if (work_pending(&mddev->del_work)) flush_workqueue(md_misc_wq); if (mddev->sync_thread) { - sector_t save_rp = mddev->reshape_position; - - mddev_unlock(mddev); - set_bit(MD_RECOVERY_INTR, &mddev->recovery); - md_unregister_thread(&mddev->sync_thread); - mddev_lock_nointr(mddev); - /* - * set RECOVERY_INTR again and restore reshape - * position in case others changed them after - * got lock, eg, reshape_position_store and - * md_check_recovery. - */ - mddev->reshape_position = save_rp; set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_reap_sync_thread(mddev); } @@ -6264,7 +6251,6 @@ static void __md_stop_writes(struct mddev *mddev) flush_workqueue(md_misc_wq); if (mddev->sync_thread) { set_bit(MD_RECOVERY_INTR, &mddev->recovery); - md_unregister_thread(&mddev->sync_thread); md_reap_sync_thread(mddev); } @@ -9335,7 +9321,6 @@ void md_check_recovery(struct mddev *mddev) * ->spare_active and clear saved_raid_disk */ set_bit(MD_RECOVERY_INTR, &mddev->recovery); - md_unregister_thread(&mddev->sync_thread); md_reap_sync_thread(mddev); clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); @@ -9371,7 +9356,6 @@ void md_check_recovery(struct mddev *mddev) goto unlock; } if (mddev->sync_thread) { - md_unregister_thread(&mddev->sync_thread); md_reap_sync_thread(mddev); goto unlock; } @@ -9451,7 +9435,8 @@ void md_reap_sync_thread(struct mddev *mddev) sector_t old_dev_sectors = mddev->dev_sectors; bool is_reshaped = false; - /* sync_thread should be unregistered, collect result */ + /* resync has finished, collect result */ + md_unregister_thread(&mddev->sync_thread); if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) && !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) && mddev->degraded != mddev->raid_disks) { -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:51 p.m.

New subject: [PATCH openEuler-22.03-LTS 06/22] md: refactor action_store() for 'idle' and 'frozen'

From: Yu Kuai <yukuai3@huawei.com> hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC CVE: NA -------------------------------- Prepare to handle 'idle' and 'frozen' differently to fix a deadlock, there are no functional changes except that MD_RECOVERY_RUNNING is checked again after 'reconfig_mutex' is held. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- drivers/md/md.c | 61 ++++++++++++++++++++++++++++++++++++------------- 1 file changed, 45 insertions(+), 16 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index e072ccb08735..e02b18205d86 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4855,6 +4855,46 @@ action_show(struct mddev *mddev, char *page) return sprintf(page, "%s\n", type); } +static void stop_sync_thread(struct mddev *mddev) +{ + if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) + return; + + if (mddev_lock(mddev)) + return; + + /* + * Check again in case MD_RECOVERY_RUNNING is cleared before lock is + * held. + */ + if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { + mddev_unlock(mddev); + return; + } + + if (work_pending(&mddev->del_work)) + flush_workqueue(md_misc_wq); + + if (mddev->sync_thread) { + set_bit(MD_RECOVERY_INTR, &mddev->recovery); + md_reap_sync_thread(mddev); + } + + mddev_unlock(mddev); +} + +static void idle_sync_thread(struct mddev *mddev) +{ + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + stop_sync_thread(mddev); +} + +static void frozen_sync_thread(struct mddev *mddev) +{ + set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + stop_sync_thread(mddev); +} + static ssize_t action_store(struct mddev *mddev, const char *page, size_t len) { @@ -4862,22 +4902,11 @@ action_store(struct mddev *mddev, const char *page, size_t len) return -EINVAL; - if (cmd_match(page, "idle") || cmd_match(page, "frozen")) { - if (cmd_match(page, "frozen")) - set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - else - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && - mddev_lock(mddev) == 0) { - if (work_pending(&mddev->del_work)) - flush_workqueue(md_misc_wq); - if (mddev->sync_thread) { - set_bit(MD_RECOVERY_INTR, &mddev->recovery); - md_reap_sync_thread(mddev); - } - mddev_unlock(mddev); - } - } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) + if (cmd_match(page, "idle")) + idle_sync_thread(mddev); + else if (cmd_match(page, "frozen")) + frozen_sync_thread(mddev); + else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) return -EBUSY; else if (cmd_match(page, "resync")) clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:51 p.m.

New subject: [PATCH openEuler-22.03-LTS 07/22] md: add a mutex to synchronize idle and frozen in action_store()

From: Yu Kuai <yukuai3@huawei.com> hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC CVE: NA -------------------------------- Currently, for idle and frozen, action_store will hold 'reconfig_mutex' and call md_reap_sync_thread() to stop sync thread, however, this will cause deadlock (explained in the next patch). In order to fix the problem, following patch will release 'reconfig_mutex' and wait on 'resync_wait', like md_set_readonly() and do_md_stop() does. Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN' unconditionally, which might cause unexpected problems, for example, frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while 'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which might starve in progress frozen. This patch add a mutex to synchronize idle and frozen from action_store(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- drivers/md/md.c | 5 +++++ drivers/md/md.h | 3 +++ 2 files changed, 8 insertions(+) diff --git a/drivers/md/md.c b/drivers/md/md.c index e02b18205d86..a04485cdae29 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -690,6 +690,7 @@ void mddev_init(struct mddev *mddev) mutex_init(&mddev->open_mutex); mutex_init(&mddev->reconfig_mutex); mutex_init(&mddev->bitmap_info.mutex); + mutex_init(&mddev->sync_mutex); INIT_LIST_HEAD(&mddev->disks); INIT_LIST_HEAD(&mddev->all_mddevs); timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0); @@ -4885,14 +4886,18 @@ static void stop_sync_thread(struct mddev *mddev) static void idle_sync_thread(struct mddev *mddev) { + mutex_lock(&mddev->sync_mutex); clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); stop_sync_thread(mddev); + mutex_unlock(&mddev->sync_mutex); } static void frozen_sync_thread(struct mddev *mddev) { + mutex_lock(&mddev->sync_mutex); set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); stop_sync_thread(mddev); + mutex_unlock(&mddev->sync_mutex); } static ssize_t diff --git a/drivers/md/md.h b/drivers/md/md.h index dbc566d309d6..50c7d3e4f2d7 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -528,6 +528,9 @@ struct mddev { bool has_superblocks:1; bool fail_last_dev:1; bool serialize_policy:1; + + /* Used to synchronize idle and frozen for action_store() */ + struct mutex sync_mutex; }; enum recovery_flags { -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:51 p.m.

New subject: [PATCH openEuler-22.03-LTS 08/22] md: refactor idle/frozen_sync_thread()

From: Yu Kuai <yukuai3@huawei.com> hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC CVE: NA -------------------------------- Our test found a following deadlock in raid10: 1) Issue a normal write, and such write failed: raid10_end_write_request set_bit(R10BIO_WriteError, &r10_bio->state) one_write_done reschedule_retry // later from md thread raid10d handle_write_completed list_add(&r10_bio->retry_list, &conf->bio_end_io_list) // later from md thread raid10d if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) list_move(conf->bio_end_io_list.prev, &tmp) r10_bio = list_first_entry(&tmp, struct r10bio, retry_list) raid_end_bio_io(r10_bio) Dependency chain 1: normal io is waiting for updating superblock 2) Trigger a recovery: raid10_sync_request raise_barrier Dependency chain 2: sync thread is waiting for normal io 3) echo idle/frozen to sync_action: action_store mddev_lock md_unregister_thread kthread_stop Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread 4) md thread can't update superblock: raid10d md_check_recovery if (mddev_trylock(mddev)) md_update_sb Dependency chain 4: update superblock is waiting for 'reconfig_mutex' Hence cyclic dependency exist, in order to fix the problem, we must break one of them. Dependency 1 and 2 can't be broken because they are foundation design. Dependency 4 may be possible if it can be guaranteed that no io can be inflight, however, this requires a new mechanism which seems complex. Dependency 3 is a good choice, because idle/frozen only requires sync thread to finish, which can be done asynchronously that is already implemented, and 'reconfig_mutex' is not needed anymore. This patch switch 'idle' and 'frozen' to wait sync thread to be done asynchronously, and this patch also add a sequence counter to record how many times sync thread is done, so that 'idle' won't keep waiting on new started sync thread. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- drivers/md/md.c | 24 ++++++++++++++++++++---- drivers/md/md.h | 2 ++ 2 files changed, 22 insertions(+), 4 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index a04485cdae29..abf58ef633a3 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -697,6 +697,7 @@ void mddev_init(struct mddev *mddev) atomic_set(&mddev->active, 1); atomic_set(&mddev->openers, 0); atomic_set(&mddev->active_io, 0); + atomic_set(&mddev->sync_seq, 0); spin_lock_init(&mddev->lock); atomic_set(&mddev->flush_pending, 0); init_waitqueue_head(&mddev->sb_wait); @@ -4876,19 +4877,28 @@ static void stop_sync_thread(struct mddev *mddev) if (work_pending(&mddev->del_work)) flush_workqueue(md_misc_wq); - if (mddev->sync_thread) { - set_bit(MD_RECOVERY_INTR, &mddev->recovery); - md_reap_sync_thread(mddev); - } + set_bit(MD_RECOVERY_INTR, &mddev->recovery); + /* + * Thread might be blocked waiting for metadata update which will now + * never happen. + */ + if (mddev->sync_thread) + wake_up_process(mddev->sync_thread->tsk); mddev_unlock(mddev); } static void idle_sync_thread(struct mddev *mddev) { + int sync_seq = atomic_read(&mddev->sync_seq); + mutex_lock(&mddev->sync_mutex); clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); stop_sync_thread(mddev); + + wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) || + !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)); + mutex_unlock(&mddev->sync_mutex); } @@ -4897,6 +4907,10 @@ static void frozen_sync_thread(struct mddev *mddev) mutex_lock(&mddev->sync_mutex); set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); stop_sync_thread(mddev); + + wait_event(resync_wait, mddev->sync_thread == NULL && + !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)); + mutex_unlock(&mddev->sync_mutex); } @@ -9471,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev) /* resync has finished, collect result */ md_unregister_thread(&mddev->sync_thread); + atomic_inc(&mddev->sync_seq); + if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) && !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) && mddev->degraded != mddev->raid_disks) { diff --git a/drivers/md/md.h b/drivers/md/md.h index 50c7d3e4f2d7..4565f17fc048 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -531,6 +531,8 @@ struct mddev { /* Used to synchronize idle and frozen for action_store() */ struct mutex sync_mutex; + /* The sequence number for sync thread */ + atomic_t sync_seq; }; enum recovery_flags { -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:51 p.m.

New subject: [PATCH openEuler-22.03-LTS 09/22] md: wake up 'resync_wait' at last in md_reap_sync_thread()

From: Yu Kuai <yukuai3@huawei.com> hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC CVE: NA -------------------------------- We just replace md_reap_sync_thread() with wait_event(resync_wait, ...) from action_store(), this patch just make sure action_store() will still wait for everything to be done in md_reap_sync_thread(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- drivers/md/md.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index abf58ef633a3..cd86a3a1f771 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9531,7 +9531,6 @@ void md_reap_sync_thread(struct mddev *mddev) if (mddev_is_clustered(mddev) && is_reshaped && !test_bit(MD_CLOSING, &mddev->flags)) md_cluster_ops->update_size(mddev, old_dev_sectors); - wake_up(&resync_wait); /* flag recovery needed just to double check */ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); sysfs_notify_dirent_safe(mddev->sysfs_completed); @@ -9539,6 +9538,7 @@ void md_reap_sync_thread(struct mddev *mddev) md_new_event(mddev); if (mddev->event_work.func) queue_work(md_misc_wq, &mddev->event_work); + wake_up(&resync_wait); } EXPORT_SYMBOL(md_reap_sync_thread); -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:52 p.m.

New subject: [PATCH openEuler-22.03-LTS 10/22] md: use interruptible apis in idle/frozen_sync_thread

From: Yu Kuai <yukuai3@huawei.com> hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC CVE: NA -------------------------------- Before refactoring idle and frozen from action_store, interruptible apis is used so that hungtask warning won't be triggered if it takes too long to finish indle/frozen sync_thread. This patch do the same. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- drivers/md/md.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index cd86a3a1f771..61f68689ddfd 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4892,11 +4892,14 @@ static void idle_sync_thread(struct mddev *mddev) { int sync_seq = atomic_read(&mddev->sync_seq); - mutex_lock(&mddev->sync_mutex); + if (mutex_lock_interruptible(&mddev->sync_mutex)) + return; + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); stop_sync_thread(mddev); - wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) || + wait_event_interruptible(resync_wait, + sync_seq != atomic_read(&mddev->sync_seq) || !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)); mutex_unlock(&mddev->sync_mutex); @@ -4904,11 +4907,13 @@ static void idle_sync_thread(struct mddev *mddev) static void frozen_sync_thread(struct mddev *mddev) { - mutex_lock(&mddev->sync_mutex); + if (mutex_lock_interruptible(&mddev->sync_mutex)) + return; + set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); stop_sync_thread(mddev); - wait_event(resync_wait, mddev->sync_thread == NULL && + wait_event_interruptible(resync_wait, mddev->sync_thread == NULL && !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)); mutex_unlock(&mddev->sync_mutex); -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:52 p.m.

New subject: [PATCH openEuler-22.03-LTS 11/22] md: fix kabi broken in struct mddev

From: Yu Kuai <yukuai3@huawei.com> hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC CVE: NA -------------------------------- Struct mddev is just used inside raid, just in case that md_mod is compiled from new kernel, and raid1/raid10 or other out-of-tree raid are compiled from old kernel. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- drivers/md/md.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/md.h b/drivers/md/md.h index 4565f17fc048..0954b9fe4cf6 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -530,9 +530,9 @@ struct mddev { bool serialize_policy:1; /* Used to synchronize idle and frozen for action_store() */ - struct mutex sync_mutex; + KABI_EXTEND(struct mutex sync_mutex) /* The sequence number for sync thread */ - atomic_t sync_seq; + KABI_EXTEND(atomic_t sync_seq) }; enum recovery_flags { -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:52 p.m.

New subject: [PATCH openEuler-22.03-LTS 12/22] iomap: Permit pages without an iop to enter writeback

From: Andreas Gruenbacher <agruenba@redhat.com> mainline inclusion from mainline-v5.14-rc2 commit 8e1bcef8e18d0fec4afe527c074bb1fd6c2b140c category: bugfix bugzilla: 188764, https://gitee.com/openeuler/kernel/issues/I736LW Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Create an iop in the writeback path if one doesn't exist. This allows us to avoid creating the iop in some cases. We'll initially do that for pages with inline data, but it can be extended to pages which are entirely within an extent. It also allows for an iop to be removed from pages in the future (eg page split). Co-developed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Yang Erkun <yangerkun@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- fs/iomap/buffered-io.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 33ddce5c5bbc..342331115ca5 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1382,14 +1382,13 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, struct writeback_control *wbc, struct inode *inode, struct page *page, u64 end_offset) { - struct iomap_page *iop = to_iomap_page(page); + struct iomap_page *iop = iomap_page_create(inode, page); struct iomap_ioend *ioend, *next; unsigned len = i_blocksize(inode); u64 file_offset; /* file offset of page */ int error = 0, count = 0, i; LIST_HEAD(submit_list); - WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop); WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0); /* -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:52 p.m.

New subject: [PATCH openEuler-22.03-LTS 13/22] iomap: Don't create iomap_page objects for inline files

From: Andreas Gruenbacher <agruenba@redhat.com> mainline inclusion from mainline-v5.14-rc2 commit 637d3375953e052a62c0db409557e3b3354be88a category: bugfix bugzilla: 188764, https://gitee.com/openeuler/kernel/issues/I736LW Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- In iomap_readpage_actor, don't create iop objects for inline inodes. Otherwise, iomap_read_inline_data will set PageUptodate without setting iop->uptodate, and iomap_page_release will eventually complain. To prevent this kind of bug from occurring in the future, make sure the page doesn't have private data attached in iomap_read_inline_data. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Yang Erkun <yangerkun@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- fs/iomap/buffered-io.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 342331115ca5..67871d0a7e4f 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -217,6 +217,7 @@ iomap_read_inline_data(struct inode *inode, struct page *page, if (PageUptodate(page)) return; + BUG_ON(page_has_private(page)); BUG_ON(page->index); BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data)); @@ -241,7 +242,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data, { struct iomap_readpage_ctx *ctx = data; struct page *page = ctx->cur_page; - struct iomap_page *iop = iomap_page_create(inode, page); + struct iomap_page *iop; bool same_page = false, is_contig = false; loff_t orig_pos = pos; unsigned poff, plen; @@ -254,6 +255,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data, } /* zero post-eof blocks as the page may be mapped */ + iop = iomap_page_create(inode, page); iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen); if (plen == 0) goto done; -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:52 p.m.

New subject: [PATCH openEuler-22.03-LTS 14/22] iomap: Don't create iomap_page objects in iomap_page_mkwrite_actor

From: Andreas Gruenbacher <agruenba@redhat.com> mainline inclusion from mainline-v5.14-rc2 commit 229adf3c64dbeae4e2f45fb561907ada9fcc0d0c category: bugfix bugzilla: 188764, https://gitee.com/openeuler/kernel/issues/I736LW Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Now that we create those objects in iomap_writepage_map when needed, there's no need to pre-create them in iomap_page_mkwrite_actor anymore. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Yang Erkun <yangerkun@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- fs/iomap/buffered-io.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 67871d0a7e4f..e625c1920458 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1002,7 +1002,6 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length, block_commit_write(page, 0, length); } else { WARN_ON_ONCE(!PageUptodate(page)); - iomap_page_create(inode, page); set_page_dirty(page); } -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:52 p.m.

New subject: [PATCH openEuler-22.03-LTS 15/22] iomap: don't invalidate folios after writeback errors

From: "Darrick J. Wong" <djwong@kernel.org> mainline inclusion from mainline-v5.19-rc1 commit e9c3a8e820ed0eeb2be05072f29f80d1b79f053b category: bugfix bugzilla: 188775, https://gitee.com/openeuler/kernel/issues/I73IFH Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- XFS has the unique behavior (as compared to the other Linux filesystems) that on writeback errors it will completely invalidate the affected folio and force the page cache to reread the contents from disk. All other filesystems leave the page mapped and up to date. This is a rude awakening for user programs, since (in the case where write fails but reread doesn't) file contents will appear to revert to old disk contents with no notification other than an EIO on fsync. This might have been annoying back in the days when iomap dealt with one page at a time, but with multipage folios, we can now throw away *megabytes* worth of data for a single write error. On *most* Linux filesystems, a program can respond to an EIO on write by redirtying the entire file and scheduling it for writeback. This isn't foolproof, since the page that failed writeback is no longer dirty and could be evicted, but programs that want to recover properly *also* have to detect XFS and regenerate every write they've made to the file. When running xfs/314 on arm64, I noticed a UAF when xfs_discard_folio invalidates multipage folios that could be undergoing writeback. If, say, we have a 256K folio caching a mix of written and unwritten extents, it's possible that we could start writeback of the first (say) 64K of the folio and then hit a writeback error on the next 64K. We then free the iop attached to the folio, which is really bad because writeback completion on the first 64k will trip over the "blocks per folio > 1 && !iop" assertion. This can't be fixed by only invalidating the folio if writeback fails at the start of the folio, since the folio is marked !uptodate, which trips other assertions elsewhere. Get rid of the whole behavior entirely. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Conflicts: fs/xfs/xfs_aops.c fs/iomap/buffered-io.c Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Yang Erkun <yangerkun@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- fs/iomap/buffered-io.c | 1 - fs/xfs/xfs_aops.c | 4 +--- 2 files changed, 1 insertion(+), 4 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index e625c1920458..b6f7eae6ccdf 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1438,7 +1438,6 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, if (wpc->ops->discard_page) wpc->ops->discard_page(page, file_offset); if (!count) { - ClearPageUptodate(page); unlock_page(page); goto done; } diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index d4d531c78ae4..fecb1fddcc79 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -525,7 +525,7 @@ xfs_discard_page( int error; if (XFS_FORCED_SHUTDOWN(mp)) - goto out_invalidate; + return; xfs_alert_ratelimited(mp, "page discard on page "PTR_FMT", inode 0x%llx, offset %llu.", @@ -535,8 +535,6 @@ xfs_discard_page( i_blocks_per_page(inode, page) - pageoff_fsb); if (error && !XFS_FORCED_SHUTDOWN(mp)) xfs_alert(mp, "page discard unable to remove delalloc mapping."); -out_invalidate: - iomap_invalidatepage(page, pageoff, PAGE_SIZE - pageoff); } static const struct iomap_writeback_ops xfs_writeback_ops = { -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:52 p.m.

New subject: [PATCH openEuler-22.03-LTS 16/22] ext4: avoid a potential slab-out-of-bounds in ext4_group_desc_csum

From: Tudor Ambarus <tudor.ambarus@linaro.org> stable inclusion from stable-v5.10.180 commit 0dde3141c527b09b96bef1e7eeb18b8127810ce9 category: bugfix bugzilla: 188791,https://gitee.com/openeuler/kernel/issues/I76XUJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- commit 4f04351888a83e595571de672e0a4a8b74f4fb31 upstream. When modifying the block device while it is mounted by the filesystem, syzbot reported the following: BUG: KASAN: slab-out-of-bounds in crc16+0x206/0x280 lib/crc16.c:58 Read of size 1 at addr ffff888075f5c0a8 by task syz-executor.2/15586 CPU: 1 PID: 15586 Comm: syz-executor.2 Not tainted 6.2.0-rc5-syzkaller-00205-gc96618275234 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:306 print_report+0x107/0x1f0 mm/kasan/report.c:417 kasan_report+0xcd/0x100 mm/kasan/report.c:517 crc16+0x206/0x280 lib/crc16.c:58 ext4_group_desc_csum+0x81b/0xb20 fs/ext4/super.c:3187 ext4_group_desc_csum_set+0x195/0x230 fs/ext4/super.c:3210 ext4_mb_clear_bb fs/ext4/mballoc.c:6027 [inline] ext4_free_blocks+0x191a/0x2810 fs/ext4/mballoc.c:6173 ext4_remove_blocks fs/ext4/extents.c:2527 [inline] ext4_ext_rm_leaf fs/ext4/extents.c:2710 [inline] ext4_ext_remove_space+0x24ef/0x46a0 fs/ext4/extents.c:2958 ext4_ext_truncate+0x177/0x220 fs/ext4/extents.c:4416 ext4_truncate+0xa6a/0xea0 fs/ext4/inode.c:4342 ext4_setattr+0x10c8/0x1930 fs/ext4/inode.c:5622 notify_change+0xe50/0x1100 fs/attr.c:482 do_truncate+0x200/0x2f0 fs/open.c:65 handle_truncate fs/namei.c:3216 [inline] do_open fs/namei.c:3561 [inline] path_openat+0x272b/0x2dd0 fs/namei.c:3714 do_filp_open+0x264/0x4f0 fs/namei.c:3741 do_sys_openat2+0x124/0x4e0 fs/open.c:1310 do_sys_open fs/open.c:1326 [inline] __do_sys_creat fs/open.c:1402 [inline] __se_sys_creat fs/open.c:1396 [inline] __x64_sys_creat+0x11f/0x160 fs/open.c:1396 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f72f8a8c0c9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f72f97e3168 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 RAX: ffffffffffffffda RBX: 00007f72f8bac050 RCX: 00007f72f8a8c0c9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000280 RBP: 00007f72f8ae7ae9 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffd165348bf R14: 00007f72f97e3300 R15: 0000000000022000 Replace le16_to_cpu(sbi->s_es->s_desc_size) with sbi->s_desc_size It reduces ext4's compiled text size, and makes the code more efficient (we remove an extra indirect reference and a potential byte swap on big endian systems), and there is no downside. It also avoids the potential KASAN / syzkaller failure, as a bonus. Reported-by: syzbot+fc51227e7100c9294894@syzkaller.appspotmail.com Reported-by: syzbot+8785e41224a3afd04321@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=70d28d11ab14bd7938f3e088365252aa923cff4... Link: https://syzkaller.appspot.com/bug?id=b85721b38583ecc6b5e72ff524c67302abbc30f... Link: https://lore.kernel.org/all/000000000000ece18705f3b20934@google.com/ Fixes: 717d50e4971b ("Ext4: Uninitialized Block Groups") Cc: stable@vger.kernel.org Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Link: https://lore.kernel.org/r/20230504121525.3275886-1-tudor.ambarus@linaro.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- fs/ext4/super.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 3be882d07616..5e776f27976c 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -2878,11 +2878,9 @@ static __le16 ext4_group_desc_csum(struct super_block *sb, __u32 block_group, crc = crc16(crc, (__u8 *)gdp, offset); offset += sizeof(gdp->bg_checksum); /* skip checksum */ /* for checksum of struct ext4_group_desc do the rest...*/ - if (ext4_has_feature_64bit(sb) && - offset < le16_to_cpu(sbi->s_es->s_desc_size)) + if (ext4_has_feature_64bit(sb) && offset < sbi->s_desc_size) crc = crc16(crc, (__u8 *)gdp + offset, - le16_to_cpu(sbi->s_es->s_desc_size) - - offset); + sbi->s_desc_size - offset); out: return cpu_to_le16(crc); -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:52 p.m.

New subject: [PATCH openEuler-22.03-LTS 17/22] block: Fix the partition start may overflow in add_partition()

From: Zhong Jinghua <zhongjinghua@huawei.com> hulk inclusion category: bugfix bugzilla: 187268, https://gitee.com/openeuler/kernel/issues/I76JDY CVE: NA ---------------------------------------- In the block_ioctl, we can pass in the unsigned number 0x8000000000000000 as an input parameter, like below: block_ioctl blkdev_ioctl blkpg_ioctl blkpg_do_ioctl copy_from_user bdev_add_partition add_partition p->start_sect = start; // start = 0x8000000000000000 Then, there was an warning when submit bio: WARNING: CPU: 0 PID: 382 at fs/iomap/apply.c:54 Call trace: iomap_apply+0x644/0x6e0 __iomap_dio_rw+0x5cc/0xa24 iomap_dio_rw+0x4c/0xcc ext4_dio_read_iter ext4_file_read_iter ext4_file_read_iter+0x318/0x39c call_read_iter lo_rw_aio.isra.0+0x748/0x75c do_req_filebacked+0x2d4/0x370 loop_handle_cmd loop_queue_work+0x94/0x23c kthread_worker_fn+0x160/0x6bc loop_kthread_worker_fn+0x3c/0x50 kthread+0x20c/0x25c ret_from_fork+0x10/0x18 Stack: submit_bio_noacct submit_bio_checks blk_partition_remap bio->bi_iter.bi_sector += p->start_sect // bio->bi_iter.bi_sector = 0xffc0000000000000 + 65408 .. loop_queue_work loop_handle_cmd do_req_filebacked pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset // pos < 0 lo_rw_aio call_read_iter ext4_dio_read_iter __iomap_dio_rw iomap_apply ext4_iomap_begin map.m_lblk = offset >> blkbits ext4_set_iomap iomap->offset = (u64) map->m_lblk << blkbits // iomap->offset = 64512 WARN_ON(iomap.offset > pos) // iomap.offset = 64512 and pos < 0 This is unreasonable for start + length > disk->part0.nr_sects. There is already a similar check in blk_add_partition(). Fix it by adding a check in bdev_add_partition(). Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- block/ioctl.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/block/ioctl.c b/block/ioctl.c index e3c5a27c23b1..f2a3524ed0ac 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -32,9 +32,16 @@ static int blkpg_do_ioctl(struct block_device *bdev, if (op == BLKPG_DEL_PARTITION) return bdev_del_partition(bdev, p.pno); + if (p.start < 0 || p.length <= 0 || p.start + p.length < 0) + return -EINVAL; + start = p.start >> SECTOR_SHIFT; length = p.length >> SECTOR_SHIFT; + /* length may be equal to 0 after right shift */ + if (!length || start + length > get_capacity(bdev->bd_disk)) + return -EINVAL; + /* check for fit in a hd_struct */ if (sizeof(sector_t) < sizeof(long long)) { long pstart = start, plength = length; -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:52 p.m.

New subject: [PATCH openEuler-22.03-LTS 18/22] sched/topology: Fix exceptional memory access in sd_llc_free_all()

From: Xia Fukun <xiafukun@huawei.com> hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6YJJQ CVE: NA ---------------------------------------- The function sd_llc_free_all() will be called to release allocated resources when space allocation for the scheduling domain structure fails. However, this function did not check if sd is a null pointer when releasing sdd resources, resulting in an error: "Unable to handle kernel paging request at virtual address". Fix this issue by adding null pointer discrimination. Fixes: 79bec4c643fb ("sched/topology: Provide hooks to allocate data shared per LLC") Signed-off-by: Xia Fukun <xiafukun@huawei.com> Reviewed-by: Zhang Qiao <zhangqiao22@huawei.com> Reviewed-by: songping yu <yusongping@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- kernel/sched/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 9b4e3b25ddff..d8827c0535a8 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2035,7 +2035,7 @@ static void sd_llc_free_all(const struct cpumask *cpu_map) for_each_sd_topology(tl) { sdd = &tl->data; - if (!sdd) + if (!sdd || !sdd->sd) continue; for_each_cpu(j, cpu_map) { sd = *per_cpu_ptr(sdd->sd, j); -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:52 p.m.

New subject: [PATCH openEuler-22.03-LTS 19/22] drm/virtio: Fix NULL vs IS_ERR checking in virtio_gpu_object_shmem_init

From: Miaoqian Lin <linmq006@gmail.com> stable inclusion from stable-v5.10.171 commit 0a4181b23acf53e9c95b351df6a7891116b98f9b category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6IKWF CVE: CVE-2023-22998 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- commit c24968734abfed81c8f93dc5f44a7b7a9aecadfa upstream. Since drm_prime_pages_to_sg() function return error pointers. The drm_gem_shmem_get_sg_table() function returns error pointers too. Using IS_ERR() to check the return value to fix this. Fixes: 2f2aa13724d5 ("drm/virtio: move virtio_gpu_mem_entry initialization to new function") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Link: http://patchwork.freedesktop.org/patch/msgid/20220602104223.54527-1-linmq006... Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Guo Mengqi <guomengqi3@huawei.com> Reviewed-by: Weilong Chen <chenweilong@huawei.com> Reviewed-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- drivers/gpu/drm/virtio/virtgpu_object.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/virtio/virtgpu_object.c b/drivers/gpu/drm/virtio/virtgpu_object.c index 0c98978e2e55..d4fab3361d2c 100644 --- a/drivers/gpu/drm/virtio/virtgpu_object.c +++ b/drivers/gpu/drm/virtio/virtgpu_object.c @@ -157,9 +157,9 @@ static int virtio_gpu_object_shmem_init(struct virtio_gpu_device *vgdev, * since virtio_gpu doesn't support dma-buf import from other devices. */ shmem->pages = drm_gem_shmem_get_sg_table(&bo->base.base); - if (!shmem->pages) { + if (IS_ERR(shmem->pages)) { drm_gem_shmem_unpin(&bo->base.base); - return -EINVAL; + return PTR_ERR(shmem->pages); } if (use_dma_api) { -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:52 p.m.

New subject: [PATCH openEuler-22.03-LTS 20/22] drm/virtio: Correct drm_gem_shmem_get_sg_table() error handling

From: Dmitry Osipenko <dmitry.osipenko@collabora.com> stable inclusion from stable-v5.10.171 commit 87c647def389354c95263d6635c62ca0de7d12ca category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6IKWF CVE: CVE-2023-22998 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- commit 64b88afbd92fbf434759d1896a7cf705e1c00e79 upstream. Previous commit fixed checking of the ERR_PTR value returned by drm_gem_shmem_get_sg_table(), but it missed to zero out the shmem->pages, which will crash virtio_gpu_cleanup_object(). Add the missing zeroing of the shmem->pages. Fixes: c24968734abf ("drm/virtio: Fix NULL vs IS_ERR checking in virtio_gpu_object_shmem_init") Reviewed-by: Emil Velikov <emil.l.velikov@gmail.com> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Link: http://patchwork.freedesktop.org/patch/msgid/20220630200726.1884320-2-dmitry... Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Guo Mengqi <guomengqi3@huawei.com> Reviewed-by: Weilong Chen <chenweilong@huawei.com> Reviewed-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- drivers/gpu/drm/virtio/virtgpu_object.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/virtio/virtgpu_object.c b/drivers/gpu/drm/virtio/virtgpu_object.c index d4fab3361d2c..168148686001 100644 --- a/drivers/gpu/drm/virtio/virtgpu_object.c +++ b/drivers/gpu/drm/virtio/virtgpu_object.c @@ -159,6 +159,7 @@ static int virtio_gpu_object_shmem_init(struct virtio_gpu_device *vgdev, shmem->pages = drm_gem_shmem_get_sg_table(&bo->base.base); if (IS_ERR(shmem->pages)) { drm_gem_shmem_unpin(&bo->base.base); + shmem->pages = NULL; return PTR_ERR(shmem->pages); } -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:52 p.m.

New subject: [PATCH openEuler-22.03-LTS 21/22] drm/virtio: Fix error code in virtio_gpu_object_shmem_init()

From: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> stable inclusion from stable-v5.10.173 commit c5fe3fba1b7bfecb6f17f93a433782b8500fe377 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6IKWF CVE: CVE-2023-22998 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- In virtio_gpu_object_shmem_init() we are passing NULL to PTR_ERR, which is returning 0/success. Fix this by storing error value in 'ret' variable before assigning shmem->pages to NULL. Found using static analysis with Smatch. Fixes: 64b88afbd92f ("drm/virtio: Correct drm_gem_shmem_get_sg_table() error handling") Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Reviewed-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Guo Mengqi <guomengqi3@huawei.com> Reviewed-by: Weilong Chen <chenweilong@huawei.com> Reviewed-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- drivers/gpu/drm/virtio/virtgpu_object.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/virtio/virtgpu_object.c b/drivers/gpu/drm/virtio/virtgpu_object.c index 168148686001..49fa59e09187 100644 --- a/drivers/gpu/drm/virtio/virtgpu_object.c +++ b/drivers/gpu/drm/virtio/virtgpu_object.c @@ -159,8 +159,9 @@ static int virtio_gpu_object_shmem_init(struct virtio_gpu_device *vgdev, shmem->pages = drm_gem_shmem_get_sg_table(&bo->base.base); if (IS_ERR(shmem->pages)) { drm_gem_shmem_unpin(&bo->base.base); + ret = PTR_ERR(shmem->pages); shmem->pages = NULL; - return PTR_ERR(shmem->pages); + return ret; } if (use_dma_api) { -- 2.25.1

Reply

Sign in to reply online Use email software

Jialin Zhang

3:52 p.m.

New subject: [PATCH openEuler-22.03-LTS 22/22] cgroup: Stop task iteration when rebinding subsystem

From: Xiu Jianfeng <xiujianfeng@huaweicloud.com> hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I798WQ CVE: NA ---------------------------------------------------------------------- We found a refcount UAF bug as follows: refcount_t: addition on 0; use-after-free. WARNING: CPU: 1 PID: 342 at lib/refcount.c:25 refcount_warn_saturate+0xa0/0x148 Workqueue: events cpuset_hotplug_workfn Call trace: refcount_warn_saturate+0xa0/0x148 __refcount_add.constprop.0+0x5c/0x80 css_task_iter_advance_css_set+0xd8/0x210 css_task_iter_advance+0xa8/0x120 css_task_iter_next+0x94/0x158 update_tasks_root_domain+0x58/0x98 rebuild_root_domains+0xa0/0x1b0 rebuild_sched_domains_locked+0x144/0x188 cpuset_hotplug_workfn+0x138/0x5a0 process_one_work+0x1e8/0x448 worker_thread+0x228/0x3e0 kthread+0xe0/0xf0 ret_from_fork+0x10/0x20 then a kernel panic will be triggered as below: Unable to handle kernel paging request at virtual address 00000000c0000010 Call trace: cgroup_apply_control_disable+0xa4/0x16c rebind_subsystems+0x224/0x590 cgroup_destroy_root+0x64/0x2e0 css_free_rwork_fn+0x198/0x2a0 process_one_work+0x1d4/0x4bc worker_thread+0x158/0x410 kthread+0x108/0x13c ret_from_fork+0x10/0x18 The race that cause this bug can be shown as below: (hotplug cpu) | (umount cpuset) mutex_lock(&cpuset_mutex) | mutex_lock(&cgroup_mutex) cpuset_hotplug_workfn | rebuild_root_domains | rebind_subsystems update_tasks_root_domain | spin_lock_irq(&css_set_lock) css_task_iter_start | list_move_tail(&cset->e_cset_node[ss->id] while(css_task_iter_next) | &dcgrp->e_csets[ss->id]); css_task_iter_end | spin_unlock_irq(&css_set_lock) mutex_unlock(&cpuset_mutex) | mutex_unlock(&cgroup_mutex) Inside css_task_iter_start/next/end, css_set_lock is hold and then released, so when iterating task(left side), the css_set may be moved to another list(right side), then it->cset_head points to the old list head and it->cset_pos->next points to the head node of new list, which can't be used as struct css_set. To fix this issue, introduce CSS_TASK_ITER_STOPPED flag for css_task_iter. when moving css_set to dcgrp->e_csets[ss->id] in rebind_subsystems(), stop the task iteration. Reported-by: Gaosheng Cui <cuigaosheng1@huawei.com> Link: https://www.spinics.net/lists/cgroups/msg37935.html Fixes: f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information") Signed-off-by: Xiu Jianfeng <xiujianfeng@huaweicloud.com> Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Reviewed-by: Wang Weiyang <wangweiyang2@huawei.com> Reviewed-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> --- include/linux/cgroup.h | 1 + kernel/cgroup/cgroup.c | 34 +++++++++++++++++++++++++++++++++- 2 files changed, 34 insertions(+), 1 deletion(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 9226ec130ba3..9d50e3e01648 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -47,6 +47,7 @@ struct kernel_clone_args; /* internal flags */ #define CSS_TASK_ITER_SKIPPED (1U << 16) +#define CSS_TASK_ITER_STOPPED (1U << 17) /* a css_task_iter should be treated as an opaque object */ struct css_task_iter { diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 46d5c120c626..f9c5375ad8fd 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -214,6 +214,8 @@ static int cgroup_apply_control(struct cgroup *cgrp); static void cgroup_finalize_control(struct cgroup *cgrp, int ret); static void css_task_iter_skip(struct css_task_iter *it, struct task_struct *task); +static void css_task_iter_stop(struct css_task_iter *it, + struct cgroup_subsys *ss); static int cgroup_destroy_locked(struct cgroup *cgrp); static struct cgroup_subsys_state *css_create(struct cgroup *cgrp, struct cgroup_subsys *ss); @@ -854,6 +856,19 @@ static void css_set_skip_task_iters(struct css_set *cset, css_task_iter_skip(it, task); } +/* + * @cset is moving to other list, it's not safe to continue the iteration, + * because the cset_head of css_task_iter which is from the old list can + * not be used as the stop condition of iteration. + */ +static void css_set_stop_iters(struct css_set *cset, struct cgroup_subsys *ss) +{ + struct css_task_iter *it, *pos; + + list_for_each_entry_safe(it, pos, &cset->task_iters, iters_node) + css_task_iter_stop(it, ss); +} + /** * css_set_move_task - move a task from one css_set to another * @task: task being moved @@ -1774,9 +1789,11 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask) css->cgroup = dcgrp; spin_lock_irq(&css_set_lock); - hash_for_each(css_set_table, i, cset, hlist) + hash_for_each(css_set_table, i, cset, hlist) { + css_set_stop_iters(cset, ss); list_move_tail(&cset->e_cset_node[ss->id], &dcgrp->e_csets[ss->id]); + } spin_unlock_irq(&css_set_lock); if (ss->css_rstat_flush) { @@ -4664,6 +4681,16 @@ static void css_task_iter_skip(struct css_task_iter *it, } } +static void css_task_iter_stop(struct css_task_iter *it, + struct cgroup_subsys *ss) +{ + lockdep_assert_held(&css_set_lock); + + if (it->ss == ss) { + it->flags |= CSS_TASK_ITER_STOPPED; + } +} + static void css_task_iter_advance(struct css_task_iter *it) { struct task_struct *task; @@ -4767,6 +4794,11 @@ struct task_struct *css_task_iter_next(struct css_task_iter *it) spin_lock_irq(&css_set_lock); + if (it->flags & CSS_TASK_ITER_STOPPED) { + spin_unlock_irq(&css_set_lock); + return NULL; + } + /* @it may be half-advanced by skips, finish advancing */ if (it->flags & CSS_TASK_ITER_SKIPPED) css_task_iter_advance(it); -- 2.25.1

Reply

Sign in to reply online Use email software

987

Age (days ago)

987

Last active (days ago)

22 comments

1 participants

tags

participants (1)

Jialin Zhang