[PATCH OLK-6.6 0/8] cgroupv2 io bugfix patches
cgroupv2 io bugfix patches David Laight (1): block: use min() instead of min_t() Jialin Wang (1): blk-iocost: fix busy_level reset when no IOs complete Julian Sun (1): jbd2: increase IO priority of checkpoint Li Lingfeng (1): block: flush all throttled bios when deleting the cgroup Ming Lei (1): blk-throttle: don't take carryover for prioritized processing of metadata Yu Kuai (3): blk-throttle: fix lower control under super low iops limit blk-throttle: support prioritized processing of metadata blk-throttle: fix lower bps rate by throtl_trim_slice() block/blk-iocost.c | 29 ++++++--- block/blk-throttle.c | 140 +++++++++++++++++++++++++++-------------- block/partitions/efi.c | 3 +- fs/jbd2/checkpoint.c | 2 +- 4 files changed, 113 insertions(+), 61 deletions(-) -- 2.52.0
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit 1beabab88ecee0698ecee7b54afa9cce7046ef96 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/9104 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- User will configure allowed iops limit in 1s, and calculate_io_allowed() will calculate allowed iops in the slice by: limit * HZ / throtl_slice However, if limit is quite low, the result can be 0, then allowed IO in the slice is 0, this will cause missing dispatch and control will be lower than limit. For example, set iops_limit to 5 with HD disk, and test will found that iops will be 3. This is usually not a big deal, because user will unlikely to configure such low iops limit, however, this is still a problem in the extreme scene. Fix the problem by making sure the wait time calculated by tg_within_iops_limit() should allow at least one IO to be dispatched. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240618062108.3680835-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> --- block/blk-throttle.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 61fc85f8f53a..623d4145cf4e 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -875,10 +875,13 @@ static unsigned long tg_within_iops_limit(struct throtl_grp *tg, struct bio *bio if (io_allowed > 0 && tg->io_disp[rw] + 1 <= io_allowed) return 0; /* Calc approx time to dispatch */ jiffy_wait = jiffy_elapsed_rnd - jiffy_elapsed; + + /* make sure at least one io can be dispatched after waiting */ + jiffy_wait = max(jiffy_wait, HZ / iops_limit + 1); return jiffy_wait; } static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio, u64 bps_limit) -- 2.52.0
From: Yu Kuai <yukuai3@huawei.com> mainline inclusion from mainline-v6.12-rc1 commit 29390bb5661d49d10424ad8e915230de1f7074c9 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/9104 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Currently, blk-throttle handle all IO fifo, hence if data IO is throttled and then meta IO is dispatched, the meta IO will have to wait for the data IO, causing priority inversion problems. This patch support to handle metadata first and then pay debt while throttling data. Test script: use cgroup v1 to throttle root cgroup, then create new dir and file while write back is throttled test() { mkdir /mnt/test/xxx touch /mnt/test/xxx/1 sync /mnt/test/xxx sync /mnt/test/xxx } mkfs.ext4 -F /dev/nvme0n1 -E lazy_itable_init=0,lazy_journal_init=0 mount /dev/nvme0n1 /mnt/test echo "259:0 $((1024*1024))" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device dd if=/dev/zero of=/mnt/test/foo1 bs=16M count=1 conv=fdatasync status=none & sleep 4 time test echo "259:0 0" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device sleep 1 umount /dev/nvme0n1 Test result: time cost for creating new dir and file before this patch: 14s after this patch: 0.1s Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240903135149.271857-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/blk-throttle.c [throttle-low has not been removed, not affect this patch.] Signed-off-by: Zizhi Wo <wozizhi@huawei.com> --- block/blk-throttle.c | 64 +++++++++++++++++++++++++++++--------------- 1 file changed, 42 insertions(+), 22 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 623d4145cf4e..a19f8e538d42 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -2203,10 +2203,26 @@ static bool throtl_can_upgrade(struct throtl_data *td, static void throtl_upgrade_state(struct throtl_data *td) { } #endif +static bool tg_within_limit(struct throtl_grp *tg, struct bio *bio, bool rw) +{ + /* throtl is FIFO - if bios are already queued, should queue */ + if (tg->service_queue.nr_queued[rw]) + return false; + + return tg_may_dispatch(tg, bio, NULL); +} + +static void tg_dispatch_in_debt(struct throtl_grp *tg, struct bio *bio, bool rw) +{ + if (!bio_flagged(bio, BIO_BPS_THROTTLED)) + tg->carryover_bytes[rw] -= throtl_bio_data_size(bio); + tg->carryover_ios[rw]--; +} + bool __blk_throtl_bio(struct bio *bio) { struct request_queue *q = bdev_get_queue(bio->bi_bdev); struct blkcg_gq *blkg = bio->bi_blkg; struct throtl_qnode *qn = NULL; @@ -2230,40 +2246,44 @@ bool __blk_throtl_bio(struct bio *bio) while (true) { if (tg->last_low_overflow_time[rw] == 0) tg->last_low_overflow_time[rw] = jiffies; throtl_downgrade_check(tg); throtl_upgrade_check(tg); - /* throtl is FIFO - if bios are already queued, should queue */ - if (sq->nr_queued[rw]) - break; - - /* if above limits, break to queue */ - if (!tg_may_dispatch(tg, bio, NULL)) { + if (tg_within_limit(tg, bio, rw)) { + /* within limits, let's charge and dispatch directly */ + throtl_charge_bio(tg, bio); + + /* + * We need to trim slice even when bios are not being + * queued otherwise it might happen that a bio is not + * queued for a long time and slice keeps on extending + * and trim is not called for a long time. Now if limits + * are reduced suddenly we take into account all the IO + * dispatched so far at new low rate and * newly queued + * IO gets a really long dispatch time. + * + * So keep on trimming slice even if bio is not queued. + */ + throtl_trim_slice(tg, rw); + } else if (bio_issue_as_root_blkg(bio)) { + /* + * IOs which may cause priority inversions are + * dispatched directly, even if they're over limit. + * Debts are handled by carryover_bytes/ios while + * calculating wait time. + */ + tg_dispatch_in_debt(tg, bio, rw); + } else { + /* if above limits, break to queue */ tg->last_low_overflow_time[rw] = jiffies; if (throtl_can_upgrade(td, tg)) { throtl_upgrade_state(td); goto again; } break; } - /* within limits, let's charge and dispatch directly */ - throtl_charge_bio(tg, bio); - - /* - * We need to trim slice even when bios are not being queued - * otherwise it might happen that a bio is not queued for - * a long time and slice keeps on extending and trim is not - * called for a long time. Now if limits are reduced suddenly - * we take into account all the IO dispatched so far at new - * low rate and * newly queued IO gets a really long dispatch - * time. - * - * So keep on trimming slice even if bio is not queued. - */ - throtl_trim_slice(tg, rw); - /* * @bio passed through this layer without being throttled. * Climb up the ladder. If we're already at the top, it * can be executed directly. */ -- 2.52.0
From: Ming Lei <ming.lei@redhat.com> stable inclusion from stable-v6.12.31 commit 95080412e9304cb5eeadd8a1f54aae9465330c54 category: other bugzilla: https://atomgit.com/openeuler/kernel/issues/9104 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- [ Upstream commit a9fc8868b350cbf4ff730a4ea9651319cc669516 ] Commit 29390bb5661d ("blk-throttle: support prioritized processing of metadata") takes bytes/ios carryover for prioritized processing of metadata. Turns out we can support it by charging it directly without trimming slice, and the result is same with carryover. Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250305043123.3938491-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> Conflicts: block/blk-throttle.c [throttle-low has not been removed, not affect this patch.] Signed-off-by: Zizhi Wo <wozizhi@huawei.com> --- block/blk-throttle.c | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index a19f8e538d42..8e8402a7cde4 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -2212,17 +2212,10 @@ static bool tg_within_limit(struct throtl_grp *tg, struct bio *bio, bool rw) return false; return tg_may_dispatch(tg, bio, NULL); } -static void tg_dispatch_in_debt(struct throtl_grp *tg, struct bio *bio, bool rw) -{ - if (!bio_flagged(bio, BIO_BPS_THROTTLED)) - tg->carryover_bytes[rw] -= throtl_bio_data_size(bio); - tg->carryover_ios[rw]--; -} - bool __blk_throtl_bio(struct bio *bio) { struct request_queue *q = bdev_get_queue(bio->bi_bdev); struct blkcg_gq *blkg = bio->bi_blkg; struct throtl_qnode *qn = NULL; @@ -2266,14 +2259,16 @@ bool __blk_throtl_bio(struct bio *bio) throtl_trim_slice(tg, rw); } else if (bio_issue_as_root_blkg(bio)) { /* * IOs which may cause priority inversions are * dispatched directly, even if they're over limit. - * Debts are handled by carryover_bytes/ios while - * calculating wait time. + * + * Charge and dispatch directly, and our throttle + * control algorithm is adaptive, and extra IO byte + * will be throttled for paying the debt */ - tg_dispatch_in_debt(tg, bio, rw); + throtl_charge_bio(tg, bio); } else { /* if above limits, break to queue */ tg->last_low_overflow_time[rw] = jiffies; if (throtl_can_upgrade(td, tg)) { throtl_upgrade_state(td); -- 2.52.0
From: Julian Sun <sunjunchao@bytedance.com> mainline inclusion from mainline-v6.18-rc1 commit 0f3b05c12158ec7545fb336469ccce38c31c7f9f category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/9104 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- In commit 6a3afb6ac6df ("jbd2: increase the journal IO's priority"), the priority of IOs initiated by jbd2 has been raised, exempting them from WBT throttling. Checkpoint is also a crucial operation of jbd2. While no serious issues have been observed so far, it should still be reasonable to exempt checkpoint from WBT throttling. Signed-off-by: Julian Sun <sunjunchao@bytedance.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> --- fs/jbd2/checkpoint.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c index 368ae50d8a59..b9686b79d59d 100644 --- a/fs/jbd2/checkpoint.c +++ b/fs/jbd2/checkpoint.c @@ -129,11 +129,11 @@ __flush_batch(journal_t *journal, int *batch_count) int i; struct blk_plug plug; blk_start_plug(&plug); for (i = 0; i < *batch_count; i++) - write_dirty_buffer(journal->j_chkpt_bhs[i], REQ_SYNC); + write_dirty_buffer(journal->j_chkpt_bhs[i], JBD2_JOURNAL_REQ_FLAGS); blk_finish_plug(&plug); for (i = 0; i < *batch_count; i++) { struct buffer_head *bh = journal->j_chkpt_bhs[i]; BUFFER_TRACE(bh, "brelse"); -- 2.52.0
From: Li Lingfeng <lilingfeng3@huawei.com> mainline inclusion from mainline-v6.13-rc1 commit 919b5139bd1d557a4d4cd4b2466e2440dda65484 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/9104 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- When a process migrates to another cgroup and the original cgroup is deleted, the restrictions of throttled bios cannot be removed. If the restrictions are set too low, it will take a long time to complete these bios. Refer to the process of deleting a disk to remove the restrictions and issue bios when deleting the cgroup. This makes difference on the behavior of throttled bios: Before: the limit of the throttled bios can't be changed and the bios will complete under this limit; Now: the limit will be canceled and the throttled bios will be flushed immediately. References: [1] https://lore.kernel.org/r/20220318130144.1066064-4-ming.lei@redhat.com [2] https://lore.kernel.org/all/da861d63-58c6-3ca0-2535-9089993e9e28@huaweicloud... Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240817071108.1919729-1-lilingfeng@huaweicloud.co... Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/blk-throttle.c [Commit bf20ab538c81 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") removed throtl_pd_offline at mainline. Not affect this patch.] Signed-off-by: Zizhi Wo <wozizhi@huawei.com> --- block/blk-throttle.c | 65 ++++++++++++++++++++++++++++---------------- 1 file changed, 41 insertions(+), 24 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 8e8402a7cde4..883255d8f8e0 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -495,10 +495,11 @@ static void blk_throtl_update_limit_valid(struct throtl_data *td) static inline void blk_throtl_update_limit_valid(struct throtl_data *td) { } #endif +static void tg_flush_bios(struct throtl_grp *tg); static void throtl_upgrade_state(struct throtl_data *td); static void throtl_pd_offline(struct blkg_policy_data *pd) { struct throtl_grp *tg = pd_to_tg(pd); @@ -509,10 +510,12 @@ static void throtl_pd_offline(struct blkg_policy_data *pd) blk_throtl_update_limit_valid(tg->td); if (!tg->td->limit_valid[tg->td->limit_index]) throtl_upgrade_state(tg->td); + + tg_flush_bios(tg); } static void throtl_pd_free(struct blkg_policy_data *pd) { struct throtl_grp *tg = pd_to_tg(pd); @@ -1735,10 +1738,41 @@ static void throtl_shutdown_wq(struct request_queue *q) struct throtl_data *td = q->td; cancel_work_sync(&td->dispatch_work); } +static void tg_flush_bios(struct throtl_grp *tg) +{ + struct throtl_service_queue *sq = &tg->service_queue; + + if (tg->flags & THROTL_TG_CANCELING) + return; + /* + * Set the flag to make sure throtl_pending_timer_fn() won't + * stop until all throttled bios are dispatched. + */ + tg->flags |= THROTL_TG_CANCELING; + + /* + * Do not dispatch cgroup without THROTL_TG_PENDING or cgroup + * will be inserted to service queue without THROTL_TG_PENDING + * set in tg_update_disptime below. Then IO dispatched from + * child in tg_dispatch_one_bio will trigger double insertion + * and corrupt the tree. + */ + if (!(tg->flags & THROTL_TG_PENDING)) + return; + + /* + * Update disptime after setting the above flag to make sure + * throtl_select_dispatch() won't exit without dispatching. + */ + tg_update_disptime(tg); + + throtl_schedule_pending_timer(sq, jiffies + 1); +} + struct blkcg_policy blkcg_policy_throtl = { .dfl_cftypes = throtl_files, .legacy_cftypes = throtl_legacy_files, .pd_alloc_fn = throtl_pd_alloc, @@ -1760,36 +1794,19 @@ void blk_throtl_cancel_bios(struct gendisk *disk) * However, rcu lock is still held to emphasize that following * path need RCU protection and to prevent warning from lockdep. */ rcu_read_lock(); blkg_for_each_descendant_post(blkg, pos_css, q->root_blkg) { - struct throtl_grp *tg = blkg_to_tg(blkg); - struct throtl_service_queue *sq = &tg->service_queue; - - /* - * Set the flag to make sure throtl_pending_timer_fn() won't - * stop until all throttled bios are dispatched. - */ - tg->flags |= THROTL_TG_CANCELING; - /* - * Do not dispatch cgroup without THROTL_TG_PENDING or cgroup - * will be inserted to service queue without THROTL_TG_PENDING - * set in tg_update_disptime below. Then IO dispatched from - * child in tg_dispatch_one_bio will trigger double insertion - * and corrupt the tree. + * disk_release will call pd_offline_fn to cancel bios. + * However, disk_release can't be called if someone get + * the refcount of device and issued bios which are + * inflight after del_gendisk. + * Cancel bios here to ensure no bios are inflight after + * del_gendisk. */ - if (!(tg->flags & THROTL_TG_PENDING)) - continue; - - /* - * Update disptime after setting the above flag to make sure - * throtl_select_dispatch() won't exit without dispatching. - */ - tg_update_disptime(tg); - - throtl_schedule_pending_timer(sq, jiffies + 1); + tg_flush_bios(blkg_to_tg(blkg)); } rcu_read_unlock(); spin_unlock_irq(&q->queue_lock); } -- 2.52.0
From: Yu Kuai <yukuai3@huawei.com> stable inclusion from stable-v6.14.2 commit 1c5ceca353089c7948a1064f220aa21e4f904096 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/9104 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... -------------------------------- [ Upstream commit 29cb955934302a5da525db6b327c795572538426 ] The bio submission time may be a few jiffies more than the expected waiting time, due to 'extra_bytes' can't be divided in tg_within_bps_limit(), and also due to timer wakeup delay. In this case, adjust slice_start to jiffies will discard the extra wait time, causing lower rate than expected. Current in-tree code already covers deviation by rounddown(), but turns out it is not enough, because jiffies - slice_start can be a multiple of throtl_slice. For example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and slice is 20ms(2 jiffies), expected rate is 1000 / 1000 * 20 = 20 bytes per slice. If user issues two 21 bytes IO, then wait time will be 30ms for the first IO: bytes_allowed = 20, extra_bytes = 1; jiffy_wait = 1 + 2 = 3 jiffies and consider extra 1 jiffies by timer, throtl_trim_slice() will be called at: jiffies = 40ms slice_start = 0ms, slice_end= 40ms bytes_disp = 21 In this case, before the patch, real rate in the first two slices is 10.5 bytes per slice, and slice will be updated to: jiffies = 40ms slice_start = 40ms, slice_end = 60ms, bytes_disp = 0; Hence the second IO will have to wait another 30ms; With the patch, the real rate in the first slice is 20 bytes per slice, which is the same as expected, and slice will be updated: jiffies=40ms, slice_start = 20ms, slice_end = 60ms, bytes_disp = 1; And now, there is still 19 bytes allowed in the second slice, and the second IO will only have to wait 10ms; This problem will cause blktests throtl/001 failure in case of CONFIG_HZ_100=y, fix it by preserving one extra finished slice in throtl_trim_slice(). Fixes: e43473b7f223 ("blkio: Core implementation of throttle policy") Reported-by: Ming Lei <ming.lei@redhat.com> Closes: https://lore.kernel.org/linux-block/20250222092823.210318-3-yukuai1@huaweicl... Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227120645.812815-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> --- block/blk-throttle.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 883255d8f8e0..087a31e50678 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -784,18 +784,27 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw) * that initially cgroup limit was very low resulting in high * slice_end, but later limit was bumped up and bio was dispatched * sooner, then we need to reduce slice_end. A high bogus slice_end * is bad because it does not allow new slice to start. */ - throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice); time_elapsed = rounddown(jiffies - tg->slice_start[rw], tg->td->throtl_slice); - if (!time_elapsed) + /* Don't trim slice until at least 2 slices are used */ + if (time_elapsed < tg->td->throtl_slice * 2) return; + /* + * The bio submission time may be a few jiffies more than the expected + * waiting time, due to 'extra_bytes' can't be divided in + * tg_within_bps_limit(), and also due to timer wakeup delay. In this + * case, adjust slice_start will discard the extra wait time, causing + * lower rate than expected. Therefore, other than the above rounddown, + * one extra slice is preserved for deviation. + */ + time_elapsed -= tg->td->throtl_slice; bytes_trim = calculate_bytes_allowed(tg_bps_limit(tg, rw), time_elapsed) + tg->carryover_bytes[rw]; io_trim = calculate_io_allowed(tg_iops_limit(tg, rw), time_elapsed) + tg->carryover_ios[rw]; -- 2.52.0
From: David Laight <david.laight.linux@gmail.com> mainline inclusion from mainline-v6.19-rc1 commit 9420e720ad192c53c8d2803c5a2313b2d586adbd category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/9104 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- min_t(unsigned int, a, b) casts an 'unsigned long' to 'unsigned int'. Use min(a, b) instead as it promotes any 'unsigned int' to 'unsigned long' and so cannot discard significant bits. In this case the 'unsigned long' value is small enough that the result is ok. (Similarly for max_t() and clamp_t().) Detected by an extra check added to min_t(). Signed-off-by: David Laight <david.laight.linux@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> --- block/blk-iocost.c | 6 ++---- block/partitions/efi.c | 3 +-- 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/block/blk-iocost.c b/block/blk-iocost.c index a9c3f17ab680..0019b67aed15 100644 --- a/block/blk-iocost.c +++ b/block/blk-iocost.c @@ -2333,14 +2333,12 @@ static void ioc_timer_fn(struct timer_list *timer) if (time_after64(iocg->activated_at, ioc->period_at)) usage_dur = max_t(u64, now.now - iocg->activated_at, 1); else usage_dur = max_t(u64, now.now - ioc->period_at, 1); - usage = clamp_t(u32, - DIV64_U64_ROUND_UP(usage_us * WEIGHT_ONE, - usage_dur), - 1, WEIGHT_ONE); + usage = clamp(DIV64_U64_ROUND_UP(usage_us * WEIGHT_ONE, usage_dur), + 1, WEIGHT_ONE); /* * Already donating or accumulated enough to start. * Determine the donation amount. */ diff --git a/block/partitions/efi.c b/block/partitions/efi.c index 7acba66eed48..638261e9f2fb 100644 --- a/block/partitions/efi.c +++ b/block/partitions/efi.c @@ -213,12 +213,11 @@ static int is_pmbr_valid(legacy_mbr *mbr, sector_t total_sectors) */ if (ret == GPT_MBR_PROTECTIVE) { sz = le32_to_cpu(mbr->partition_record[part].size_in_lba); if (sz != (uint32_t) total_sectors - 1 && sz != 0xFFFFFFFF) pr_debug("GPT: mbr size in lba (%u) different than whole disk (%u).\n", - sz, min_t(uint32_t, - total_sectors - 1, 0xFFFFFFFF)); + sz, (uint32_t)min(total_sectors - 1, 0xFFFFFFFF)); } done: return ret; } -- 2.52.0
From: Jialin Wang <wjl.linux@gmail.com> mainline inclusion from mainline-v7.1-rc1 commit f91ffe89b2016d280995a9c28d73288b02d83615 category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/9104 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- When a disk is saturated, it is common for no IOs to complete within a timer period. Currently, in this case, rq_wait_pct and missed_ppm are calculated as 0, the iocost incorrectly interprets this as meeting QoS targets and resets busy_level to 0. This reset prevents busy_level from reaching the threshold (4) needed to reduce vrate. On certain cloud storage, such as Azure Premium SSD, we observed that iocost may fail to reduce vrate for tens of seconds during saturation, failing to mitigate noisy neighbor issues. Fix this by tracking the number of IO completions (nr_done) in a period. If nr_done is 0 and there are lagging IOs, the saturation status is unknown, so we keep busy_level unchanged. The issue is consistently reproducible on Azure Standard_D8as_v5 (Dasv5) VMs with 512GB Premium SSD (P20) using the script below. It was not observed on GCP n2d VMs (with 100G pd-ssd and 1.5T local-ssd), and no regressions were found with this patch. In this script, cgA performs large IOs with iodepth=128, while cgB performs small IOs with iodepth=1 rate_iops=100 rw=randrw. With iocost enabled, we expect it to throttle cgA, the submission latency (slat) of cgA should be significantly higher, cgB can reach 200 IOPS and the completion latency (clat) should below. BLK_DEVID="8:0" MODEL="rbps=173471131 rseqiops=3566 rrandiops=3566 wbps=173333269 wseqiops=3566 wrandiops=3566" QOS="rpct=90 rlat=3500 wpct=90 wlat=3500 min=80 max=10000" echo "$BLK_DEVID ctrl=user model=linear $MODEL" > /sys/fs/cgroup/io.cost.model echo "$BLK_DEVID enable=1 ctrl=user $QOS" > /sys/fs/cgroup/io.cost.qos CG_A="/sys/fs/cgroup/cgA" CG_B="/sys/fs/cgroup/cgB" FILE_A="/path/to/sda/A.fio.testfile" FILE_B="/path/to/sda/B.fio.testfile" RESULT_DIR="./iocost_results_$(date +%Y%m%d_%H%M%S)" mkdir -p "$CG_A" "$CG_B" "$RESULT_DIR" get_result() { local file=$1 local label=$2 local results=$(jq -r ' .jobs[0].mixed | ( .iops | tonumber | round ) as $iops | ( .bw_bytes / 1024 / 1024 ) as $bps | ( .slat_ns.mean / 1000000 ) as $slat | ( .clat_ns.mean / 1000000 ) as $avg | ( .clat_ns.max / 1000000 ) as $max | ( .clat_ns.percentile["90.000000"] / 1000000 ) as $p90 | ( .clat_ns.percentile["99.000000"] / 1000000 ) as $p99 | ( .clat_ns.percentile["99.900000"] / 1000000 ) as $p999 | ( .clat_ns.percentile["99.990000"] / 1000000 ) as $p9999 | "\($iops)|\($bps)|\($slat)|\($avg)|\($max)|\($p90)|\($p99)|\($p999)|\($p9999)" ' "$file") IFS='|' read -r iops bps slat avg max p90 p99 p999 p9999 <<<"$results" printf "%-8s %-6s %-7.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f\n" \ "$label" "$iops" "$bps" "$slat" "$avg" "$max" "$p90" "$p99" "$p999" "$p9999" } run_fio() { local cg_path=$1 local filename=$2 local name=$3 local bs=$4 local qd=$5 local out=$6 shift 6 local extra=$@ ( pid=$(sh -c 'echo $PPID') echo $pid >"${cg_path}/cgroup.procs" fio --name="$name" --filename="$filename" --direct=1 --rw=randrw --rwmixread=50 \ --ioengine=libaio --bs="$bs" --iodepth="$qd" --size=4G --runtime=10 \ --time_based --group_reporting --unified_rw_reporting=mixed \ --output-format=json --output="$out" $extra >/dev/null 2>&1 ) & } echo "Starting Test ..." for bs_b in "4k" "32k" "256k"; do echo "Running iteration: BS=$bs_b" out_a="${RESULT_DIR}/cgA_1m.json" out_b="${RESULT_DIR}/cgB_${bs_b}.json" # cgA: Heavy background (BS 1MB, QD 128) run_fio "$CG_A" "$FILE_A" "cgA" "1m" 128 "$out_a" # cgB: Latency sensitive (Variable BS, QD 1, Read/Write IOPS limit 100) run_fio "$CG_B" "$FILE_B" "cgB" "$bs_b" 1 "$out_b" "--rate_iops=100" wait SUMMARY_DATA+="$(get_result "$out_a" "cgA-1m")"$'\n' SUMMARY_DATA+="$(get_result "$out_b" "cgB-$bs_b")"$'\n\n' done echo -e "\nFinal Results Summary:\n" printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n" \ "" "" "" "slat" "clat" "clat" "clat" "clat" "clat" "clat" printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n\n" \ "CGROUP" "IOPS" "MB/s" "avg(ms)" "avg(ms)" "max(ms)" "P90(ms)" "P99" "P99.9" "P99.99" echo "$SUMMARY_DATA" echo "Results saved in $RESULT_DIR" Before: slat clat clat clat clat clat clat CGROUP IOPS MB/s avg(ms) avg(ms) max(ms) P90(ms) P99 P99.9 P99.99 cgA-1m 166 166.37 3.44 748.95 1298.29 977.27 1233.13 1300.23 1300.23 cgB-4k 5 0.02 0.02 181.74 761.32 742.39 759.17 759.17 759.17 cgA-1m 167 166.51 1.98 748.68 1549.41 809.50 1451.23 1551.89 1551.89 cgB-32k 6 0.18 0.02 169.98 761.76 742.39 759.17 759.17 759.17 cgA-1m 166 165.55 2.89 750.89 1540.37 851.44 1451.23 1535.12 1535.12 cgB-256k 5 1.30 0.02 191.35 759.51 750.78 759.17 759.17 759.17 After: slat clat clat clat clat clat clat CGROUP IOPS MB/s avg(ms) avg(ms) max(ms) P90(ms) P99 P99.9 P99.99 cgA-1m 162 162.48 6.14 749.69 850.02 826.28 834.67 843.06 851.44 cgB-4k 199 0.78 0.01 1.95 42.12 2.57 7.50 34.87 42.21 cgA-1m 146 146.20 6.83 833.04 908.68 893.39 901.78 910.16 910.16 cgB-32k 200 6.25 0.01 2.32 31.40 3.06 7.50 16.58 31.33 cgA-1m 110 110.46 9.04 1082.67 1197.91 1182.79 1199.57 1199.57 1199.57 cgB-256k 200 49.98 0.02 3.69 22.20 4.88 9.11 20.05 22.15 Signed-off-by: Jialin Wang <wjl.linux@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://patch.msgid.link/20260331100509.182882-1-wjl.linux@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> --- block/blk-iocost.c | 23 +++++++++++++++++------ 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/block/blk-iocost.c b/block/blk-iocost.c index 0019b67aed15..133d3f5828ee 100644 --- a/block/blk-iocost.c +++ b/block/blk-iocost.c @@ -1595,11 +1595,12 @@ static enum hrtimer_restart iocg_waitq_timer_fn(struct hrtimer *timer) iocg_unlock(iocg, pay_debt, &flags); return HRTIMER_NORESTART; } -static void ioc_lat_stat(struct ioc *ioc, u32 *missed_ppm_ar, u32 *rq_wait_pct_p) +static void ioc_lat_stat(struct ioc *ioc, u32 *missed_ppm_ar, u32 *rq_wait_pct_p, + u32 *nr_done) { u32 nr_met[2] = { }; u32 nr_missed[2] = { }; u64 rq_wait_ns = 0; int cpu, rw; @@ -1632,10 +1633,12 @@ static void ioc_lat_stat(struct ioc *ioc, u32 *missed_ppm_ar, u32 *rq_wait_pct_p missed_ppm_ar[rw] = 0; } *rq_wait_pct_p = div64_u64(rq_wait_ns * 100, ioc->period_us * NSEC_PER_USEC); + + *nr_done = nr_met[READ] + nr_met[WRITE] + nr_missed[READ] + nr_missed[WRITE]; } /* was iocg idle this period? */ static bool iocg_is_idle(struct ioc_gq *iocg) { @@ -2249,16 +2252,16 @@ static void ioc_timer_fn(struct timer_list *timer) LIST_HEAD(surpluses); int nr_debtors, nr_shortages = 0, nr_lagging = 0; u64 usage_us_sum = 0; u32 ppm_rthr; u32 ppm_wthr; - u32 missed_ppm[2], rq_wait_pct; + u32 missed_ppm[2], rq_wait_pct, nr_done; u64 period_vtime; int prev_busy_level; /* how were the latencies during the period? */ - ioc_lat_stat(ioc, missed_ppm, &rq_wait_pct); + ioc_lat_stat(ioc, missed_ppm, &rq_wait_pct, &nr_done); /* take care of active iocgs */ spin_lock_irq(&ioc->lock); ppm_rthr = MILLION - ioc->params.qos[QOS_RPPM]; @@ -2396,13 +2399,21 @@ static void ioc_timer_fn(struct timer_list *timer) * too much IO and should lower vtime rate. If we're not missing * and experiencing shortages but not surpluses, we're too stingy * and should increase vtime rate. */ prev_busy_level = ioc->busy_level; - if (rq_wait_pct > RQ_WAIT_BUSY_PCT || - missed_ppm[READ] > ppm_rthr || - missed_ppm[WRITE] > ppm_wthr) { + if (!nr_done && nr_lagging) { + /* + * When there are lagging IOs but no completions, we don't + * know if the IO latency will meet the QoS targets. The + * disk might be saturated or not. We should not reset + * busy_level to 0 (which would prevent vrate from scaling + * up or down), but rather to keep it unchanged. + */ + } else if (rq_wait_pct > RQ_WAIT_BUSY_PCT || + missed_ppm[READ] > ppm_rthr || + missed_ppm[WRITE] > ppm_wthr) { /* clearly missing QoS targets, slow down vrate */ ioc->busy_level = max(ioc->busy_level, 0); ioc->busy_level++; } else if (rq_wait_pct <= RQ_WAIT_BUSY_PCT * UNBUSY_THR_PCT / 100 && missed_ppm[READ] <= ppm_rthr * UNBUSY_THR_PCT / 100 && -- 2.52.0
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://atomgit.com/openeuler/kernel/merge_requests/22204 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/HIX... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://atomgit.com/openeuler/kernel/merge_requests/22204 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/HIX...
participants (2)
-
patchwork bot -
Zizhi Wo