Li Nan (1): md: ensure child flush IO does not affect origin bio->bi_status
Mikulas Patocka (1): md: fix a crash in mempool_free
Yu Kuai (1): md: Remove flush handling
drivers/md/md.h | 10 ---- drivers/md/md.c | 137 ++++++++++++------------------------------------ 2 files changed, 34 insertions(+), 113 deletions(-)
From: Mikulas Patocka mpatocka@redhat.com
stable inclusion from stable-v5.10.163 commit ae7793027766491c5f8635b12d15a5940d3b8698 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAKQB5 CVE: CVE-2024-43855
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
----------------------------------------------------
commit 341097ee53573e06ab9fc675d96a052385b851fa upstream.
There's a crash in mempool_free when running the lvm test shell/lvchange-rebuild-raid.sh.
The reason for the crash is this: * super_written calls atomic_dec_and_test(&mddev->pending_writes) and wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev) and bio_put(bio). * so, the process that waited on sb_wait and that is woken up is racing with bio_put(bio). * if the process wins the race, it calls bioset_exit before bio_put(bio) is executed. * bio_put(bio) attempts to free a bio into a destroyed bio set - causing a crash in mempool_free.
We fix this bug by moving bio_put before atomic_dec_and_test.
We also move rdev_dec_pending before atomic_dec_and_test as suggested by Neil Brown.
The function md_end_flush has a similar bug - we must call bio_put before we decrement the number of in-progress bios.
BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 11557f0067 P4D 11557f0067 PUD 0 Oops: 0002 [#1] PREEMPT SMP CPU: 0 PID: 73 Comm: kworker/0:1 Not tainted 6.1.0-rc3 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: kdelayd flush_expired_bios [dm_delay] RIP: 0010:mempool_free+0x47/0x80 Code: 48 89 ef 5b 5d ff e0 f3 c3 48 89 f7 e8 32 45 3f 00 48 63 53 08 48 89 c6 3b 53 04 7d 2d 48 8b 43 10 8d 4a 01 48 89 df 89 4b 08 <48> 89 2c d0 e8 b0 45 3f 00 48 8d 7b 30 5b 5d 31 c9 ba 01 00 00 00 RSP: 0018:ffff88910036bda8 EFLAGS: 00010093 RAX: 0000000000000000 RBX: ffff8891037b65d8 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8891037b65d8 RBP: ffff8891447ba240 R08: 0000000000012908 R09: 00000000003d0900 R10: 0000000000000000 R11: 0000000000173544 R12: ffff889101a14000 R13: ffff8891562ac300 R14: ffff889102b41440 R15: ffffe8ffffa00d05 FS: 0000000000000000(0000) GS:ffff88942fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000001102e99000 CR4: 00000000000006b0 Call Trace: <TASK> clone_endio+0xf4/0x1c0 [dm_mod] clone_endio+0xf4/0x1c0 [dm_mod] __submit_bio+0x76/0x120 submit_bio_noacct_nocheck+0xb6/0x2a0 flush_expired_bios+0x28/0x2f [dm_delay] process_one_work+0x1b4/0x300 worker_thread+0x45/0x3e0 ? rescuer_thread+0x380/0x380 kthread+0xc2/0x100 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Modules linked in: brd dm_delay dm_raid dm_mod af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg scsi_common [last unloaded: brd] CR2: 0000000000000000 ---[ end trace 0000000000000000 ]---
Signed-off-by: Mikulas Patocka mpatocka@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Song Liu song@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: zhaoxiaoqiang11 zhaoxiaoqiang11@jd.com Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index f958b1e9a672..0292eb1c8363 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -545,13 +545,14 @@ static void md_end_flush(struct bio *bio) struct md_rdev *rdev = bio->bi_private; struct mddev *mddev = rdev->mddev;
+ bio_put(bio); + rdev_dec_pending(rdev, mddev);
if (atomic_dec_and_test(&mddev->flush_pending)) { /* The pre-request flush has finished */ queue_work(md_wq, &mddev->flush_work); } - bio_put(bio); }
static void md_submit_flush_data(struct work_struct *ws); @@ -955,10 +956,12 @@ static void super_written(struct bio *bio) } else clear_bit(LastDev, &rdev->flags);
+ bio_put(bio); + + rdev_dec_pending(rdev, mddev); + if (atomic_dec_and_test(&mddev->pending_writes)) wake_up(&mddev->sb_wait); - rdev_dec_pending(rdev, mddev); - bio_put(bio); }
void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
From: Yu Kuai yukuai3@huawei.com
maillist inclusion category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAKQB5 CVE: CVE-2024-43855
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?...
--------------------------------
For flush request, md has a special flush handling to merge concurrent flush request into single one, however, the whole mechanism is based on a disk level spin_lock 'mddev->lock'. And fsync can be called quite often in some user cases, for consequence, spin lock from IO fast path can cause performance degradation.
Fortunately, the block layer already has flush handling to merge concurrent flush request, and it only acquires hctx level spin lock. (see details in blk-flush.c)
This patch removes the flush handling in md, and converts to use general block layer flush handling in underlying disks.
Flush test for 4 nvme raid10: start 128 threads to do fsync 100000 times, on arm64, see how long it takes.
Test script: void* thread_func(void* arg) { int fd = *(int*)arg; for (int i = 0; i < FSYNC_COUNT; i++) { fsync(fd); } return NULL; }
int main() { int fd = open("/dev/md0", O_RDWR); if (fd < 0) { perror("open"); exit(1); }
pthread_t threads[THREADS]; struct timeval start, end;
gettimeofday(&start, NULL);
for (int i = 0; i < THREADS; i++) { pthread_create(&threads[i], NULL, thread_func, &fd); }
for (int i = 0; i < THREADS; i++) { pthread_join(threads[i], NULL); }
gettimeofday(&end, NULL);
close(fd);
long long elapsed = (end.tv_sec - start.tv_sec) * 1000000LL + (end.tv_usec - start.tv_usec); printf("Elapsed time: %lld microseconds\n", elapsed);
return 0; }
Test result: about 10 times faster: Before this patch: 50943374 microseconds After this patch: 5096347 microseconds
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Yu Kuai yukuai3@huawei.com Link: https://lore.kernel.org/r/20240827110616.3860190-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu song@kernel.org Conflicts: drivers/md/md.c [ There is a lot of conflict, we just delete the same function as origin patch. In the new function, the differences from mainline are: 1. remove the check of 'mddev->active_io'. 2. use rdev_for_each_rcu() and add rcu lock. 3. change the way of alloc 'new'. ] Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.h | 10 ---- drivers/md/md.c | 127 +++++++----------------------------------------- 2 files changed, 18 insertions(+), 119 deletions(-)
diff --git a/drivers/md/md.h b/drivers/md/md.h index 6eba883eddd6..e13d3654aef5 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -508,16 +508,6 @@ struct mddev { */ struct bio_set io_acct_set; /* for raid0 and raid5 io accounting */
- /* Generic flush handling. - * The last to finish preflush schedules a worker to submit - * the rest of the request (without the REQ_PREFLUSH flag). - */ - struct bio *flush_bio; - atomic_t flush_pending; - ktime_t start_flush, last_flush; /* last_flush is when the last completed - * flush was started. - */ - struct work_struct flush_work; struct work_struct event_work; /* used by dm to report failure event */ mempool_t *serial_info_pool; void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev); diff --git a/drivers/md/md.c b/drivers/md/md.c index 0292eb1c8363..94ea0de3f06f 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -536,123 +536,33 @@ void mddev_resume(struct mddev *mddev) } EXPORT_SYMBOL_GPL(mddev_resume);
-/* - * Generic flush handling for md - */ - -static void md_end_flush(struct bio *bio) -{ - struct md_rdev *rdev = bio->bi_private; - struct mddev *mddev = rdev->mddev; - - bio_put(bio); - - rdev_dec_pending(rdev, mddev); - - if (atomic_dec_and_test(&mddev->flush_pending)) { - /* The pre-request flush has finished */ - queue_work(md_wq, &mddev->flush_work); - } -} - -static void md_submit_flush_data(struct work_struct *ws); - -static void submit_flushes(struct work_struct *ws) +bool md_flush_request(struct mddev *mddev, struct bio *bio) { - struct mddev *mddev = container_of(ws, struct mddev, flush_work); struct md_rdev *rdev; + struct bio *new;
- mddev->start_flush = ktime_get_boottime(); - INIT_WORK(&mddev->flush_work, md_submit_flush_data); - atomic_set(&mddev->flush_pending, 1); rcu_read_lock(); - rdev_for_each_rcu(rdev, mddev) - if (rdev->raid_disk >= 0 && - !test_bit(Faulty, &rdev->flags)) { - /* Take two references, one is dropped - * when request finishes, one after - * we reclaim rcu_read_lock - */ - struct bio *bi; - atomic_inc(&rdev->nr_pending); - atomic_inc(&rdev->nr_pending); - rcu_read_unlock(); - bi = bio_alloc_mddev(GFP_NOIO, 0, mddev); - bi->bi_end_io = md_end_flush; - bi->bi_private = rdev; - bio_set_dev(bi, rdev->bdev); - bi->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH; - atomic_inc(&mddev->flush_pending); - submit_bio(bi); - rcu_read_lock(); - rdev_dec_pending(rdev, mddev); - } - rcu_read_unlock(); - if (atomic_dec_and_test(&mddev->flush_pending)) - queue_work(md_wq, &mddev->flush_work); -} - -static void md_submit_flush_data(struct work_struct *ws) -{ - struct mddev *mddev = container_of(ws, struct mddev, flush_work); - struct bio *bio = mddev->flush_bio; + rdev_for_each_rcu(rdev, mddev) { + if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags)) + continue;
- /* - * must reset flush_bio before calling into md_handle_request to avoid a - * deadlock, because other bios passed md_handle_request suspend check - * could wait for this and below md_handle_request could wait for those - * bios because of suspend check - */ - spin_lock_irq(&mddev->lock); - mddev->last_flush = mddev->start_flush; - mddev->flush_bio = NULL; - spin_unlock_irq(&mddev->lock); - wake_up(&mddev->sb_wait); + rcu_read_unlock(); + new = bio_alloc_mddev(GFP_NOIO, 0, mddev); + bio_set_dev(new, rdev->bdev); + new->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH; + bio_chain(new, bio); + submit_bio(new); + rcu_read_lock(); + } + rcu_read_unlock();
- if (bio->bi_iter.bi_size == 0) { - /* an empty barrier - all done */ + if (bio_sectors(bio) == 0) { bio_endio(bio); - } else { - bio->bi_opf &= ~REQ_PREFLUSH; - md_handle_request(mddev, bio); + return true; } -}
-/* - * Manages consolidation of flushes and submitting any flushes needed for - * a bio with REQ_PREFLUSH. Returns true if the bio is finished or is - * being finished in another context. Returns false if the flushing is - * complete but still needs the I/O portion of the bio to be processed. - */ -bool md_flush_request(struct mddev *mddev, struct bio *bio) -{ - ktime_t start = ktime_get_boottime(); - spin_lock_irq(&mddev->lock); - wait_event_lock_irq(mddev->sb_wait, - !mddev->flush_bio || - ktime_after(mddev->last_flush, start), - mddev->lock); - if (!ktime_after(mddev->last_flush, start)) { - WARN_ON(mddev->flush_bio); - mddev->flush_bio = bio; - bio = NULL; - } - spin_unlock_irq(&mddev->lock); - - if (!bio) { - INIT_WORK(&mddev->flush_work, submit_flushes); - queue_work(md_wq, &mddev->flush_work); - } else { - /* flush was performed for some other bio while we waited. */ - if (bio->bi_iter.bi_size == 0) - /* an empty barrier - all done */ - bio_endio(bio); - else { - bio->bi_opf &= ~REQ_PREFLUSH; - return false; - } - } - return true; + bio->bi_opf &= ~REQ_PREFLUSH; + return false; } EXPORT_SYMBOL(md_flush_request);
@@ -701,7 +611,6 @@ void mddev_init(struct mddev *mddev) atomic_set(&mddev->active_io, 0); atomic_set(&mddev->sync_seq, 0); spin_lock_init(&mddev->lock); - atomic_set(&mddev->flush_pending, 0); init_waitqueue_head(&mddev->sb_wait); init_waitqueue_head(&mddev->recovery_wait); mddev->reshape_position = MaxSector;
hulk inclusion category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAKQB5 CVE: CVE-2024-43855
--------------------------------
When a flush is issued to an RAID array, a child flush IO is created and issued for each member disk in the RAID array. Since patch b75197e86e6d ("md: Remove flush handling"), each child flush IO has been chained with the original bio. As a result, the failure of any child IO could modify the bi_status of the original bio, potentially impacting the upper-layer filesystem.
Fix the issue by preventing child flush IO from altering the original bio->bi_status as before. However, this design introduces a known issue: in the event of a power failure, if a flush IO on a member disk fails, the upper layers may not be informed. This issue will be fixed in a future patch.
Fixes: b75197e86e6d ("md: Remove flush handling") Signed-off-by: Li Nan linan122@huawei.com --- drivers/md/md.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 94ea0de3f06f..d96921ff1d29 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -536,6 +536,23 @@ void mddev_resume(struct mddev *mddev) } EXPORT_SYMBOL_GPL(mddev_resume);
+static void md_end_flush(struct bio *bio) +{ + struct bio *parent = bio->bi_private; + char b[BDEVNAME_SIZE]; + + /* + * If any flush io error before the power failure, + * disk data may be lost. + */ + if (bio->bi_status) + pr_err("md: %s flush io error %d\n", bio_devname(bio, b), + blk_status_to_errno(bio->bi_status)); + + bio_put(bio); + bio_endio(parent); +} + bool md_flush_request(struct mddev *mddev, struct bio *bio) { struct md_rdev *rdev; @@ -550,7 +567,9 @@ bool md_flush_request(struct mddev *mddev, struct bio *bio) new = bio_alloc_mddev(GFP_NOIO, 0, mddev); bio_set_dev(new, rdev->bdev); new->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH; - bio_chain(new, bio); + new->bi_private = bio; + new->bi_end_io = md_end_flush; + bio_inc_remaining(bio); submit_bio(new); rcu_read_lock(); }
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/12117 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/M...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/12117 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/M...