This series is bcache backport up to Linux v5.19. The referenced bug is, https://gitee.com/openeuler/kernel/issues/I59A5L#note_10515416
All the patches are tested on Intel x86-64 platform, but not tested on ARM64 plantform yet.
Coly Li ---
Arnd Bergmann (1): md: bcache: avoid -Wempty-body warnings
Bhaskar Chowdhury (1): md: bcache: Trivial typo fixes in the file journal.c
Chao Yu (1): bcache: fix error info in register_bcache()
Christoph Hellwig (1): bcache: remove PTR_CACHE
Coly Li (10): bcache: fix a regression of code compiling failure in debug.c bcache: remove bcache device self-defined readahead bcache: avoid oversized read request in cache missing code path bcache: move uapi header bcache.h to bcache code directory bcache: improve multithreaded bch_btree_check() bcache: improve multithreaded bch_sectors_dirty_init() bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() bcache: avoid journal no-space deadlock by reserving 1 journal bucket bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init() bcache: avoid unnecessary soft lockup in kworker update_writeback_rate()
Ding Senjie (1): md: bcache: Fix spelling of 'acquire'
Dongsheng Yang (1): bcache: fix race between setting bdev state to none and new write request direct to backing
Greg Kroah-Hartman (1): bcache: use default_groups in kobj_type
Gustavo A. R. Silva (1): bcache: Use 64-bit arithmetic instead of 32-bit
Jia-Ju Bai (1): md: bcache: check the return value of kzalloc() in detached_dev_do_request()
Joe Perches (1): bcache: Avoid comma separated statements
Kai Krakow (1): bcache: Fix register_device_aync typo
Lin Feng (2): bcache: move calc_cached_dev_sectors to proper place on backing device detach bcache: fix NULL pointer reference in cached_dev_detach_finish
Ming Lei (1): bcache: don't pass BIOSET_NEED_BVECS for the 'bio_set' embedded in 'cache_set'
Mingzhe Zou (2): bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing bcache: fixup multiple threads crash
Qing Wang (1): bcache: replace snprintf in show functions with sysfs_emit
Yang Li (1): bcache: use NULL instead of using plain integer as pointer
Yi Li (2): bcache:remove a superfluous check in register_bcache bcache: set pdev_set_uuid before scond loop iteration
YueHaibing (1): lib: crc64: fix kernel-doc warning
Zheng Yongjun (1): md/bcache: convert comma to semicolon
Zhiqiang Liu (1): bcache: reduce redundant code in bch_cached_dev_run()
dongdong tao (1): bcache: consider the fragmentation when update the writeback rate
drivers/md/bcache/alloc.c | 5 +- drivers/md/bcache/bcache.h | 25 ++- .../md/bcache/bcache_ondisk.h | 0 drivers/md/bcache/bset.c | 12 +- drivers/md/bcache/bset.h | 2 +- drivers/md/bcache/btree.c | 67 +++--- drivers/md/bcache/btree.h | 2 +- drivers/md/bcache/debug.c | 2 +- drivers/md/bcache/extents.c | 4 +- drivers/md/bcache/features.c | 4 +- drivers/md/bcache/features.h | 3 +- drivers/md/bcache/io.c | 4 +- drivers/md/bcache/journal.c | 37 +++- drivers/md/bcache/journal.h | 2 + drivers/md/bcache/request.c | 25 +-- drivers/md/bcache/stats.c | 17 +- drivers/md/bcache/stats.h | 1 - drivers/md/bcache/super.c | 63 +++--- drivers/md/bcache/sysfs.c | 50 +++-- drivers/md/bcache/sysfs.h | 20 +- drivers/md/bcache/util.h | 19 +- drivers/md/bcache/writeback.c | 198 +++++++++++------- drivers/md/bcache/writeback.h | 6 +- lib/crc64.c | 2 +- 24 files changed, 324 insertions(+), 246 deletions(-) rename include/uapi/linux/bcache.h => drivers/md/bcache/bcache_ondisk.h (100%)
From: Dongsheng Yang dongsheng.yang@easystack.cn
mainline inclusion from v5.11-rc1 commit df4ad53242158f9f1f97daf4feddbb4f8b77f080 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
There is a race condition in detaching as below: A. detaching B. Write request (1) writing back (2) write back done, set bdev state to clean. (3) cached_dev_put() and schedule_work(&dc->detach); (4) write data [0 - 4K] directly into backing and ack to user. (5) power-failure...
When we restart this bcache device, this bdev is clean but not detached, and read [0 - 4K], we will get unexpected old data from cache device.
To fix this problem, set the bdev state to none when we writeback done in detaching, and then if power-failure happened as above, the data in cache will not be used in next bcache device starting, it's detached, we will read the correct data from backing derectly.
Signed-off-by: Dongsheng Yang dongsheng.yang@easystack.cn Signed-off-by: Coly Li colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/super.c | 9 --------- drivers/md/bcache/writeback.c | 9 +++++++++ 2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 81f1cc5b3499..b7d9d1b79ac2 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1151,9 +1151,6 @@ static void cancel_writeback_rate_update_dwork(struct cached_dev *dc) static void cached_dev_detach_finish(struct work_struct *w) { struct cached_dev *dc = container_of(w, struct cached_dev, detach); - struct closure cl; - - closure_init_stack(&cl);
BUG_ON(!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags)); BUG_ON(refcount_read(&dc->count)); @@ -1167,12 +1164,6 @@ static void cached_dev_detach_finish(struct work_struct *w) dc->writeback_thread = NULL; }
- memset(&dc->sb.set_uuid, 0, 16); - SET_BDEV_STATE(&dc->sb, BDEV_STATE_NONE); - - bch_write_bdev_super(dc, &cl); - closure_sync(&cl); - mutex_lock(&bch_register_lock);
calc_cached_dev_sectors(dc->disk.c); diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 3c74996978da..a129e4d2707c 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -705,6 +705,15 @@ static int bch_writeback_thread(void *arg) * bch_cached_dev_detach(). */ if (test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags)) { + struct closure cl; + + closure_init_stack(&cl); + memset(&dc->sb.set_uuid, 0, 16); + SET_BDEV_STATE(&dc->sb, BDEV_STATE_NONE); + + bch_write_bdev_super(dc, &cl); + closure_sync(&cl); + up_write(&dc->writeback_lock); break; }
From: Yi Li yili@winhong.com
mainline inclusion from v5.11-rc1 commit 117ae250cfa3718f21bd07df0650dfbe3bc3a823 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
There have no reassign the bdev after check It is IS_ERR. the double check !IS_ERR(bdev) is superfluous.
After commit 4e7b5671c6a8 ("block: remove i_bdev"), "Switch the block device lookup interfaces to directly work with a dev_t so that struct block_device references are only acquired by the blkdev_get variants (and the blk-cgroup special case). This means that we now don't need an extra reference in the inode and can generally simplify handling of struct block_device to keep the lookups contained in the core block layer code."
so after lookup_bdev call, there no need to do bdput.
remove a superfluous check the bdev & don't call bdput after lookup_bdev.
Fixes: 4e7b5671c6a8("block: remove i_bdev") Signed-off-by: Yi Li yili@winhong.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Coly Li colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/super.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index b7d9d1b79ac2..1572190c32ec 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2588,8 +2588,6 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, else err = "device busy"; mutex_unlock(&bch_register_lock); - if (!IS_ERR(bdev)) - bdput(bdev); if (attr == &ksysfs_register_quiet) goto done; }
From: Zheng Yongjun zhengyongjun3@huawei.com
mainline inclusion from v5.11-rc1 commit 46926127d76359b46659c556df7b4aa1b6325d90 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
Replace a comma between expression statements by a semicolon.
Signed-off-by: Zheng Yongjun zhengyongjun3@huawei.com Signed-off-by: Coly Li colyli@sue.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/sysfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 554e3afc9b68..00a520c03f41 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -404,7 +404,7 @@ STORE(__cached_dev) if (!env) return -ENOMEM; add_uevent_var(env, "DRIVER=bcache"); - add_uevent_var(env, "CACHED_UUID=%pU", dc->sb.uuid), + add_uevent_var(env, "CACHED_UUID=%pU", dc->sb.uuid); add_uevent_var(env, "CACHED_LABEL=%s", buf); kobject_uevent_env(&disk_to_dev(dc->disk.disk)->kobj, KOBJ_CHANGE,
From: Yi Li yili@winhong.com
mainline inclusion from v5.11-rc3 commit e80927079fd97b4d5457e3af2400a0087b561564 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
There is no need to reassign pdev_set_uuid in the second loop iteration, so move it to the place before second loop.
Signed-off-by: Yi Li yili@winhong.com Signed-off-by: Coly Li colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 1572190c32ec..456e41a81c06 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2697,8 +2697,8 @@ static ssize_t bch_pending_bdevs_cleanup(struct kobject *k, }
list_for_each_entry_safe(pdev, tpdev, &pending_devs, list) { + char *pdev_set_uuid = pdev->dc->sb.set_uuid; list_for_each_entry_safe(c, tc, &bch_cache_sets, list) { - char *pdev_set_uuid = pdev->dc->sb.set_uuid; char *set_uuid = c->set_uuid;
if (!memcmp(pdev_set_uuid, set_uuid, 16)) {
From: Ming Lei ming.lei@redhat.com
mainline inclusion from v5.12-rc1 commit faa8e2c4fb30f336a289e3cbaa1e9a9dfd92ac8c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
This bioset is just for allocating bio only from bio_next_split, and it needn't bvecs, so remove the flag.
Cc: linux-bcache@vger.kernel.org Cc: Coly Li colyli@suse.de Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Ming Lei ming.lei@redhat.com Acked-by: Coly Li colyli@suse.de Reviewed-by: Hannes Reinecke hare@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 456e41a81c06..7195b289780a 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1947,7 +1947,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb) goto err;
if (bioset_init(&c->bio_split, 4, offsetof(struct bbio, bio), - BIOSET_NEED_BVECS|BIOSET_NEED_RESCUER)) + BIOSET_NEED_RESCUER)) goto err;
c->uuids = alloc_meta_bucket_pages(GFP_KERNEL, sb);
From: dongdong tao dongdong.tao@canonical.com
mainline inclusion from v5.12-rc1 commit 71dda2a5625f31bc3410cb69c3d31376a2b66f28 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
Current way to calculate the writeback rate only considered the dirty sectors, this usually works fine when the fragmentation is not high, but it will give us unreasonable small rate when we are under a situation that very few dirty sectors consumed a lot dirty buckets. In some case, the dirty bucekts can reached to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even reached the writeback_percent, the writeback rate will still be the minimum value (4k), thus it will cause all the writes to be stucked in a non-writeback mode because of the slow writeback.
We accelerate the rate in 3 stages with different aggressiveness, the first stage starts when dirty buckets percent reach above BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default the first stage tries to writeback the amount of dirty data in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second, the second stage tries to writeback the amount of dirty data in one bucket in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third stage tries to writeback the amount of dirty data in one bucket in (1 / (dirty_buckets_percent - 64)) millisecond.
the initial rate at each stage can be controlled by 3 configurable parameters writeback_rate_fp_term_{low|mid|high}, they are by default 1, 10, 1000, the hint of IO throughput that these values are trying to achieve is described by above paragraph, the reason that I choose those value as default is based on the testing and the production data, below is some details:
A. When it comes to the low stage, there is still a bit far from the 70 threshold, so we only want to give it a little bit push by setting the term to 1, it means the initial rate will be 170 if the fragment is 6, it is calculated by bucket_size/fragment, this rate is very small, but still much reasonable than the minimum 8. For a production bcache with unheavy workload, if the cache device is bigger than 1 TB, it may take hours to consume 1% buckets, so it is very possible to reclaim enough dirty buckets in this stage, thus to avoid entering the next stage.
B. If the dirty buckets ratio didn't turn around during the first stage, it comes to the mid stage, then it is necessary for mid stage to be more aggressive than low stage, so i choose the initial rate to be 10 times more than low stage, that means 1700 as the initial rate if the fragment is 6. This is some normal rate we usually see for a normal workload when writeback happens because of writeback_percent.
C. If the dirty buckets ratio didn't turn around during the low and mid stages, it comes to the third stage, and it is the last chance that we can turn around to avoid the horrible cutoff writeback sync issue, then we choose 100 times more aggressive than the mid stage, that means 170000 as the initial rate if the fragment is 6. This is also inferred from a production bcache, I've got one week's writeback rate data from a production bcache which has quite heavy workloads, again, the writeback is triggered by the writeback percent, the highest rate area is around 100000 to 240000, so I believe this kind aggressiveness at this stage is reasonable for production. And it should be mostly enough because the hint is trying to reclaim 1000 bucket per second, and from that heavy production env, it is consuming 50 bucket per second on average in one week's data.
Option writeback_consider_fragment is to control whether we want this feature to be on or off, it's on by default.
Lastly, below is the performance data for all the testing result, including the data from production env: https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxsc...
Signed-off-by: dongdong tao dongdong.tao@canonical.com Signed-off-by: Coly Li colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/bcache.h | 4 ++++ drivers/md/bcache/sysfs.c | 23 +++++++++++++++++++ drivers/md/bcache/writeback.c | 42 +++++++++++++++++++++++++++++++++++ drivers/md/bcache/writeback.h | 4 ++++ 4 files changed, 73 insertions(+)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index e8bf4f752e8b..848dd4db1659 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -373,6 +373,7 @@ struct cached_dev { unsigned int partial_stripes_expensive:1; unsigned int writeback_metadata:1; unsigned int writeback_running:1; + unsigned int writeback_consider_fragment:1; unsigned char writeback_percent; unsigned int writeback_delay;
@@ -385,6 +386,9 @@ struct cached_dev { unsigned int writeback_rate_update_seconds; unsigned int writeback_rate_i_term_inverse; unsigned int writeback_rate_p_term_inverse; + unsigned int writeback_rate_fp_term_low; + unsigned int writeback_rate_fp_term_mid; + unsigned int writeback_rate_fp_term_high; unsigned int writeback_rate_minimum;
enum stop_on_failure stop_when_cache_set_failed; diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 00a520c03f41..eef15f8022ba 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -117,10 +117,14 @@ rw_attribute(writeback_running); rw_attribute(writeback_percent); rw_attribute(writeback_delay); rw_attribute(writeback_rate); +rw_attribute(writeback_consider_fragment);
rw_attribute(writeback_rate_update_seconds); rw_attribute(writeback_rate_i_term_inverse); rw_attribute(writeback_rate_p_term_inverse); +rw_attribute(writeback_rate_fp_term_low); +rw_attribute(writeback_rate_fp_term_mid); +rw_attribute(writeback_rate_fp_term_high); rw_attribute(writeback_rate_minimum); read_attribute(writeback_rate_debug);
@@ -195,6 +199,7 @@ SHOW(__bch_cached_dev) var_printf(bypass_torture_test, "%i"); var_printf(writeback_metadata, "%i"); var_printf(writeback_running, "%i"); + var_printf(writeback_consider_fragment, "%i"); var_print(writeback_delay); var_print(writeback_percent); sysfs_hprint(writeback_rate, @@ -205,6 +210,9 @@ SHOW(__bch_cached_dev) var_print(writeback_rate_update_seconds); var_print(writeback_rate_i_term_inverse); var_print(writeback_rate_p_term_inverse); + var_print(writeback_rate_fp_term_low); + var_print(writeback_rate_fp_term_mid); + var_print(writeback_rate_fp_term_high); var_print(writeback_rate_minimum);
if (attr == &sysfs_writeback_rate_debug) { @@ -303,6 +311,7 @@ STORE(__cached_dev) sysfs_strtoul_bool(bypass_torture_test, dc->bypass_torture_test); sysfs_strtoul_bool(writeback_metadata, dc->writeback_metadata); sysfs_strtoul_bool(writeback_running, dc->writeback_running); + sysfs_strtoul_bool(writeback_consider_fragment, dc->writeback_consider_fragment); sysfs_strtoul_clamp(writeback_delay, dc->writeback_delay, 0, UINT_MAX);
sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent, @@ -331,6 +340,16 @@ STORE(__cached_dev) sysfs_strtoul_clamp(writeback_rate_p_term_inverse, dc->writeback_rate_p_term_inverse, 1, UINT_MAX); + sysfs_strtoul_clamp(writeback_rate_fp_term_low, + dc->writeback_rate_fp_term_low, + 1, dc->writeback_rate_fp_term_mid - 1); + sysfs_strtoul_clamp(writeback_rate_fp_term_mid, + dc->writeback_rate_fp_term_mid, + dc->writeback_rate_fp_term_low + 1, + dc->writeback_rate_fp_term_high - 1); + sysfs_strtoul_clamp(writeback_rate_fp_term_high, + dc->writeback_rate_fp_term_high, + dc->writeback_rate_fp_term_mid + 1, UINT_MAX); sysfs_strtoul_clamp(writeback_rate_minimum, dc->writeback_rate_minimum, 1, UINT_MAX); @@ -499,9 +518,13 @@ static struct attribute *bch_cached_dev_files[] = { &sysfs_writeback_delay, &sysfs_writeback_percent, &sysfs_writeback_rate, + &sysfs_writeback_consider_fragment, &sysfs_writeback_rate_update_seconds, &sysfs_writeback_rate_i_term_inverse, &sysfs_writeback_rate_p_term_inverse, + &sysfs_writeback_rate_fp_term_low, + &sysfs_writeback_rate_fp_term_mid, + &sysfs_writeback_rate_fp_term_high, &sysfs_writeback_rate_minimum, &sysfs_writeback_rate_debug, &sysfs_io_errors, diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index a129e4d2707c..82d4e0880a99 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -88,6 +88,44 @@ static void __update_writeback_rate(struct cached_dev *dc) int64_t integral_scaled; uint32_t new_rate;
+ /* + * We need to consider the number of dirty buckets as well + * when calculating the proportional_scaled, Otherwise we might + * have an unreasonable small writeback rate at a highly fragmented situation + * when very few dirty sectors consumed a lot dirty buckets, the + * worst case is when dirty buckets reached cutoff_writeback_sync and + * dirty data is still not even reached to writeback percent, so the rate + * still will be at the minimum value, which will cause the write + * stuck at a non-writeback mode. + */ + struct cache_set *c = dc->disk.c; + + int64_t dirty_buckets = c->nbuckets - c->avail_nbuckets; + + if (dc->writeback_consider_fragment && + c->gc_stats.in_use > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW && dirty > 0) { + int64_t fragment = + div_s64((dirty_buckets * c->cache->sb.bucket_size), dirty); + int64_t fp_term; + int64_t fps; + + if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID) { + fp_term = dc->writeback_rate_fp_term_low * + (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW); + } else if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH) { + fp_term = dc->writeback_rate_fp_term_mid * + (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID); + } else { + fp_term = dc->writeback_rate_fp_term_high * + (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH); + } + fps = div_s64(dirty, dirty_buckets) * fp_term; + if (fragment > 3 && fps > proportional_scaled) { + /* Only overrite the p when fragment > 3 */ + proportional_scaled = fps; + } + } + if ((error < 0 && dc->writeback_rate_integral > 0) || (error > 0 && time_before64(local_clock(), dc->writeback_rate.next + NSEC_PER_MSEC))) { @@ -977,6 +1015,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_metadata = true; dc->writeback_running = false; + dc->writeback_consider_fragment = true; dc->writeback_percent = 10; dc->writeback_delay = 30; atomic_long_set(&dc->writeback_rate.rate, 1024); @@ -984,6 +1023,9 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT; dc->writeback_rate_p_term_inverse = 40; + dc->writeback_rate_fp_term_low = 1; + dc->writeback_rate_fp_term_mid = 10; + dc->writeback_rate_fp_term_high = 1000; dc->writeback_rate_i_term_inverse = 10000;
WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags)); diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index 3f1230e22de0..02b2f9df73f6 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -16,6 +16,10 @@
#define BCH_AUTO_GC_DIRTY_THRESHOLD 50
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW 50 +#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID 57 +#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH 64 + #define BCH_DIRTY_INIT_THRD_MAX 64 /* * 14 (16384ths) is chosen here as something that each backing device
From: Kai Krakow kai@kaishome.de
mainline inclusion from 5.12-rc1 commit d7fae7b4fa152795ab70c680d3a63c7843c9368c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
Should be `register_device_async`.
Cc: Coly Li colyli@suse.de Signed-off-by: Kai Krakow kai@kaishome.de Signed-off-by: Coly Li colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/super.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 7195b289780a..3e2dc136bc37 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2527,7 +2527,7 @@ static void register_cache_worker(struct work_struct *work) module_put(THIS_MODULE); }
-static void register_device_aync(struct async_reg_args *args) +static void register_device_async(struct async_reg_args *args) { if (SB_IS_BDEV(args->sb)) INIT_DELAYED_WORK(&args->reg_work, register_bdev_worker); @@ -2619,7 +2619,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, args->sb = sb; args->sb_disk = sb_disk; args->bdev = bdev; - register_device_aync(args); + register_device_async(args); /* No wait and returns to user space */ goto async_done; }
From: Joe Perches joe@perches.com
mainline inclusion from v5.12-rc1 commit 6751c1e3cff3aa763c760c08862627069a37b50e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
Use semicolons and braces.
Signed-off-by: Joe Perches joe@perches.com Signed-off-by: Coly Li colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/bset.c | 12 ++++++++---- drivers/md/bcache/sysfs.c | 6 ++++-- 2 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c index 67a2c47f4201..94d38e8a59b3 100644 --- a/drivers/md/bcache/bset.c +++ b/drivers/md/bcache/bset.c @@ -712,8 +712,10 @@ void bch_bset_build_written_tree(struct btree_keys *b) for (j = inorder_next(0, t->size); j; j = inorder_next(j, t->size)) { - while (bkey_to_cacheline(t, k) < cacheline) - prev = k, k = bkey_next(k); + while (bkey_to_cacheline(t, k) < cacheline) { + prev = k; + k = bkey_next(k); + }
t->prev[j] = bkey_u64s(prev); t->tree[j].m = bkey_to_cacheline_offset(t, cacheline++, k); @@ -901,8 +903,10 @@ unsigned int bch_btree_insert_key(struct btree_keys *b, struct bkey *k, status = BTREE_INSERT_STATUS_INSERT;
while (m != bset_bkey_last(i) && - bkey_cmp(k, b->ops->is_extents ? &START_KEY(m) : m) > 0) - prev = m, m = bkey_next(m); + bkey_cmp(k, b->ops->is_extents ? &START_KEY(m) : m) > 0) { + prev = m; + m = bkey_next(m); + }
/* prev is in the tree, if we merge we're done */ status = BTREE_INSERT_STATUS_BACK_MERGE; diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index eef15f8022ba..cc89f3156d1a 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -1094,8 +1094,10 @@ SHOW(__bch_cache) --n;
while (cached < p + n && - *cached == BTREE_PRIO) - cached++, n--; + *cached == BTREE_PRIO) { + cached++; + n--; + }
for (i = 0; i < n; i++) sum += INITIAL_PRIO - cached[i];
From: Zhiqiang Liu liuzhiqiang26@huawei.com
mainline inclusion from v5.13-rc1 commit 13e1db65d2b9263c3dfe447077981e7a32c857ae category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
In bch_cached_dev_run(), free(env[1])|free(env[2])|free(buf) show up three times. This patch introduce out tag in which free(env[1])|free(env[2])|free(buf) are only called one time. If we need to call free() when errors occur, we can set error code to ret, and then goto out tag directly.
Signed-off-by: Zhiqiang Liu liuzhiqiang26@huawei.com Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20210411134316.80274-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/super.c | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 3e2dc136bc37..a52f491e5209 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1058,6 +1058,7 @@ static int cached_dev_status_update(void *arg)
int bch_cached_dev_run(struct cached_dev *dc) { + int ret = 0; struct bcache_device *d = &dc->disk; char *buf = kmemdup_nul(dc->sb.label, SB_LABEL_SIZE, GFP_KERNEL); char *env[] = { @@ -1070,19 +1071,15 @@ int bch_cached_dev_run(struct cached_dev *dc) if (dc->io_disable) { pr_err("I/O disabled on cached dev %s\n", dc->backing_dev_name); - kfree(env[1]); - kfree(env[2]); - kfree(buf); - return -EIO; + ret = -EIO; + goto out; }
if (atomic_xchg(&dc->running, 1)) { - kfree(env[1]); - kfree(env[2]); - kfree(buf); pr_info("cached dev %s is running already\n", dc->backing_dev_name); - return -EBUSY; + ret = -EBUSY; + goto out; }
if (!d->c && @@ -1103,15 +1100,13 @@ int bch_cached_dev_run(struct cached_dev *dc) * only class / kset properties are persistent */ kobject_uevent_env(&disk_to_dev(d->disk)->kobj, KOBJ_CHANGE, env); - kfree(env[1]); - kfree(env[2]); - kfree(buf);
if (sysfs_create_link(&d->kobj, &disk_to_dev(d->disk)->kobj, "dev") || sysfs_create_link(&disk_to_dev(d->disk)->kobj, &d->kobj, "bcache")) { pr_err("Couldn't create bcache dev <-> disk sysfs symlinks\n"); - return -ENOMEM; + ret = -ENOMEM; + goto out; }
dc->status_update_thread = kthread_run(cached_dev_status_update, @@ -1120,7 +1115,11 @@ int bch_cached_dev_run(struct cached_dev *dc) pr_warn("failed to create bcache_status_update kthread, continue to run without monitoring backing device status\n"); }
- return 0; +out: + kfree(env[1]); + kfree(env[2]); + kfree(buf); + return ret; }
/*
From: Christoph Hellwig hch@lst.de
mainline inclusion from 5.13-rc1 commit 11e9560e6c005b4adca12d17b27dc5ac22b40663 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
Remove the PTR_CACHE inline and replace it with a direct dereference of c->cache.
(Coly Li: fix the typo from PTR_BUCKET to PTR_CACHE in commit log)
Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20210411134316.80274-3-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/alloc.c | 5 ++--- drivers/md/bcache/bcache.h | 11 ++--------- drivers/md/bcache/btree.c | 4 ++-- drivers/md/bcache/debug.c | 2 +- drivers/md/bcache/extents.c | 4 ++-- drivers/md/bcache/io.c | 4 ++-- drivers/md/bcache/journal.c | 2 +- drivers/md/bcache/writeback.c | 5 ++--- 8 files changed, 14 insertions(+), 23 deletions(-)
diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c index 8c371d5eef8e..097577ae3c47 100644 --- a/drivers/md/bcache/alloc.c +++ b/drivers/md/bcache/alloc.c @@ -482,8 +482,7 @@ void bch_bucket_free(struct cache_set *c, struct bkey *k) unsigned int i;
for (i = 0; i < KEY_PTRS(k); i++) - __bch_bucket_free(PTR_CACHE(c, k, i), - PTR_BUCKET(c, k, i)); + __bch_bucket_free(c->cache, PTR_BUCKET(c, k, i)); }
int __bch_bucket_alloc_set(struct cache_set *c, unsigned int reserve, @@ -674,7 +673,7 @@ bool bch_alloc_sectors(struct cache_set *c, SET_PTR_OFFSET(&b->key, i, PTR_OFFSET(&b->key, i) + sectors);
atomic_long_add(sectors, - &PTR_CACHE(c, &b->key, i)->sectors_written); + &c->cache->sectors_written); }
if (b->sectors_free < c->cache->sb.block_size) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 848dd4db1659..0a4551e165ab 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -804,13 +804,6 @@ static inline sector_t bucket_remainder(struct cache_set *c, sector_t s) return s & (c->cache->sb.bucket_size - 1); }
-static inline struct cache *PTR_CACHE(struct cache_set *c, - const struct bkey *k, - unsigned int ptr) -{ - return c->cache; -} - static inline size_t PTR_BUCKET_NR(struct cache_set *c, const struct bkey *k, unsigned int ptr) @@ -822,7 +815,7 @@ static inline struct bucket *PTR_BUCKET(struct cache_set *c, const struct bkey *k, unsigned int ptr) { - return PTR_CACHE(c, k, ptr)->buckets + PTR_BUCKET_NR(c, k, ptr); + return c->cache->buckets + PTR_BUCKET_NR(c, k, ptr); }
static inline uint8_t gen_after(uint8_t a, uint8_t b) @@ -841,7 +834,7 @@ static inline uint8_t ptr_stale(struct cache_set *c, const struct bkey *k, static inline bool ptr_available(struct cache_set *c, const struct bkey *k, unsigned int i) { - return (PTR_DEV(k, i) < MAX_CACHES_PER_SET) && PTR_CACHE(c, k, i); + return (PTR_DEV(k, i) < MAX_CACHES_PER_SET) && c->cache; }
/* Btree key macros */ diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index fe6dce125aba..183a58c89377 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -426,7 +426,7 @@ void __bch_btree_node_write(struct btree *b, struct closure *parent) do_btree_node_write(b);
atomic_long_add(set_blocks(i, block_bytes(b->c->cache)) * b->c->cache->sb.block_size, - &PTR_CACHE(b->c, &b->key, 0)->btree_sectors_written); + &b->c->cache->btree_sectors_written);
b->written += set_blocks(i, block_bytes(b->c->cache)); } @@ -1161,7 +1161,7 @@ static void make_btree_freeing_key(struct btree *b, struct bkey *k)
for (i = 0; i < KEY_PTRS(k); i++) SET_PTR_GEN(k, i, - bch_inc_gen(PTR_CACHE(b->c, &b->key, i), + bch_inc_gen(b->c->cache, PTR_BUCKET(b->c, &b->key, i)));
mutex_unlock(&b->c->bucket_lock); diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index b00fd08d696b..b2eb59b9cd71 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -50,7 +50,7 @@ void bch_btree_verify(struct btree *b) v->keys.ops = b->keys.ops;
bio = bch_bbio_alloc(b->c); - bio_set_dev(bio, PTR_CACHE(b->c, &b->key, 0)->bdev); + bio_set_dev(bio, c->cache->bdev); bio->bi_iter.bi_sector = PTR_OFFSET(&b->key, 0); bio->bi_iter.bi_size = KEY_SIZE(&v->key) << 9; bio->bi_opf = REQ_OP_READ | REQ_META; diff --git a/drivers/md/bcache/extents.c b/drivers/md/bcache/extents.c index f4658a1f37b8..d626ffcbecb9 100644 --- a/drivers/md/bcache/extents.c +++ b/drivers/md/bcache/extents.c @@ -50,7 +50,7 @@ static bool __ptr_invalid(struct cache_set *c, const struct bkey *k)
for (i = 0; i < KEY_PTRS(k); i++) if (ptr_available(c, k, i)) { - struct cache *ca = PTR_CACHE(c, k, i); + struct cache *ca = c->cache; size_t bucket = PTR_BUCKET_NR(c, k, i); size_t r = bucket_remainder(c, PTR_OFFSET(k, i));
@@ -71,7 +71,7 @@ static const char *bch_ptr_status(struct cache_set *c, const struct bkey *k)
for (i = 0; i < KEY_PTRS(k); i++) if (ptr_available(c, k, i)) { - struct cache *ca = PTR_CACHE(c, k, i); + struct cache *ca = c->cache; size_t bucket = PTR_BUCKET_NR(c, k, i); size_t r = bucket_remainder(c, PTR_OFFSET(k, i));
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c index dad71a6b7889..e4388fe3ab7e 100644 --- a/drivers/md/bcache/io.c +++ b/drivers/md/bcache/io.c @@ -36,7 +36,7 @@ void __bch_submit_bbio(struct bio *bio, struct cache_set *c) struct bbio *b = container_of(bio, struct bbio, bio);
bio->bi_iter.bi_sector = PTR_OFFSET(&b->key, 0); - bio_set_dev(bio, PTR_CACHE(c, &b->key, 0)->bdev); + bio_set_dev(bio, c->cache->bdev);
b->submit_time_us = local_clock_us(); closure_bio_submit(c, bio, bio->bi_private); @@ -137,7 +137,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct bio *bio, blk_status_t error, const char *m) { struct bbio *b = container_of(bio, struct bbio, bio); - struct cache *ca = PTR_CACHE(c, &b->key, 0); + struct cache *ca = c->cache; int is_read = (bio_data_dir(bio) == READ ? 1 : 0);
unsigned int threshold = op_is_write(bio_op(bio)) diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index c6613e817333..de2c0d7699cf 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -768,7 +768,7 @@ static void journal_write_unlocked(struct closure *cl) w->data->csum = csum_set(w->data);
for (i = 0; i < KEY_PTRS(k); i++) { - ca = PTR_CACHE(c, k, i); + ca = c->cache; bio = &ca->journal.bio;
atomic_long_add(sectors, &ca->meta_sectors_written); diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 82d4e0880a99..bcd550a2b0da 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -416,7 +416,7 @@ static void read_dirty_endio(struct bio *bio) struct dirty_io *io = w->private;
/* is_read = 1 */ - bch_count_io_errors(PTR_CACHE(io->dc->disk.c, &w->key, 0), + bch_count_io_errors(io->dc->disk.c->cache, bio->bi_status, 1, "reading dirty data from cache");
@@ -510,8 +510,7 @@ static void read_dirty(struct cached_dev *dc) dirty_init(w); bio_set_op_attrs(&io->bio, REQ_OP_READ, 0); io->bio.bi_iter.bi_sector = PTR_OFFSET(&w->key, 0); - bio_set_dev(&io->bio, - PTR_CACHE(dc->disk.c, &w->key, 0)->bdev); + bio_set_dev(&io->bio, dc->disk.c->cache->bdev); io->bio.bi_end_io = read_dirty_endio;
if (bch_bio_alloc_pages(&io->bio, GFP_KERNEL))
From: Yang Li yang.lee@linux.alibaba.com
mainline inclusion from 5.13-rc1 commit f9a018e8a6af2898dc782f6e526bd11f6f352e87 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
This fixes the following sparse warnings: drivers/md/bcache/features.c:22:16: warning: Using plain integer as NULL pointer
Reported-by: Abaci Robot abaci@linux.alibaba.com Signed-off-by: Yang Li yang.lee@linux.alibaba.com Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20210411134316.80274-4-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/features.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/features.c b/drivers/md/bcache/features.c index d636b7b2d070..6d2b7b84a7b7 100644 --- a/drivers/md/bcache/features.c +++ b/drivers/md/bcache/features.c @@ -19,7 +19,7 @@ struct feature { static struct feature feature_list[] = { {BCH_FEATURE_INCOMPAT, BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE, "large_bucket"}, - {0, 0, 0 }, + {0, 0, NULL }, };
#define compose_feature_string(type) \
From: Arnd Bergmann arnd@arndb.de
mainline inclusion from v5.13-rc1 commit be3bacececd7c4ab233105171d39082858de1baa category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
building with 'make W=1' shows a harmless warning for each user of the EBUG_ON() macro:
drivers/md/bcache/bset.c: In function 'bch_btree_sort_partial': drivers/md/bcache/util.h:30:55: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 30 | #define EBUG_ON(cond) do { if (cond); } while (0) | ^ drivers/md/bcache/bset.c:1312:9: note: in expansion of macro 'EBUG_ON' 1312 | EBUG_ON(oldsize >= 0 && bch_count_data(b) != oldsize); | ^~~~~~~
Reword the macro slightly to avoid the warning.
Signed-off-by: Arnd Bergmann arnd@arndb.de Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20210411134316.80274-5-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/util.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h index c029f7443190..bca4a7c97da7 100644 --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -27,7 +27,7 @@ struct closure;
#else /* DEBUG */
-#define EBUG_ON(cond) do { if (cond); } while (0) +#define EBUG_ON(cond) do { if (cond) do {} while (0); } while (0) #define atomic_dec_bug(v) atomic_dec(v) #define atomic_inc_bug(v, i) atomic_inc(v)
From: Bhaskar Chowdhury unixbhaskar@gmail.com
mainline inclusion from v5.13-rc1 commit 9c9b81c45619e76d315eb3b9934e9d4bfa7d3bcd category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
s/condidate/candidate/ s/folowing/following/
Signed-off-by: Bhaskar Chowdhury unixbhaskar@gmail.com Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20210411134316.80274-6-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/journal.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index de2c0d7699cf..61bd79babf7a 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -111,7 +111,7 @@ reread: left = ca->sb.bucket_size - offset; * Check from the oldest jset for last_seq. If * i->j.seq < j->last_seq, it means the oldest jset * in list is expired and useless, remove it from - * this list. Otherwise, j is a condidate jset for + * this list. Otherwise, j is a candidate jset for * further following checks. */ while (!list_empty(list)) { @@ -498,7 +498,7 @@ static void btree_flush_write(struct cache_set *c) * - If there are matched nodes recorded in btree_nodes[], * they are clean now (this is why and how the oldest * journal entry can be reclaimed). These selected nodes - * will be ignored and skipped in the folowing for-loop. + * will be ignored and skipped in the following for-loop. */ if (((btree_current_write(b)->journal - fifo_front_p) & mask) != 0) {
From: "Gustavo A. R. Silva" gustavoars@kernel.org
mainline inclusion from v5.13-rc1 commit 62594f189e81caffa6a3bfa2fdb08eec2e347c76 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
Cast multiple variables to (int64_t) in order to give the compiler complete information about the proper arithmetic to use. Notice that these variables are being used in contexts that expect expressions of type int64_t (64 bit, signed). And currently, such expressions are being evaluated using 32-bit arithmetic.
Fixes: d0cf9503e908 ("octeontx2-pf: ethtool fec mode support") Addresses-Coverity-ID: 1501724 ("Unintentional integer overflow") Addresses-Coverity-ID: 1501725 ("Unintentional integer overflow") Addresses-Coverity-ID: 1501726 ("Unintentional integer overflow") Signed-off-by: Gustavo A. R. Silva gustavoars@kernel.org Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20210411134316.80274-7-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/writeback.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index bcd550a2b0da..8120da278161 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -110,13 +110,13 @@ static void __update_writeback_rate(struct cached_dev *dc) int64_t fps;
if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID) { - fp_term = dc->writeback_rate_fp_term_low * + fp_term = (int64_t)dc->writeback_rate_fp_term_low * (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW); } else if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH) { - fp_term = dc->writeback_rate_fp_term_mid * + fp_term = (int64_t)dc->writeback_rate_fp_term_mid * (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID); } else { - fp_term = dc->writeback_rate_fp_term_high * + fp_term = (int64_t)dc->writeback_rate_fp_term_high * (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH); } fps = div_s64(dirty, dirty_buckets) * fp_term;
mainline inclusion from v5.13-rc1 commit 33ec5dfe8f42aaf0163a16e2b450ab06f3a7f1f3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
The patch "bcache: remove PTR_CACHE" introduces a compiling failure in debug.c with following error message, In file included from drivers/md/bcache/bcache.h:182:0, from drivers/md/bcache/debug.c:9: drivers/md/bcache/debug.c: In function 'bch_btree_verify': drivers/md/bcache/debug.c:53:19: error: 'c' undeclared (first use in this function) bio_set_dev(bio, c->cache->bdev); ^ This patch fixes the regression by replacing c->cache->bdev by b->c-> cache->bdev.
Signed-off-by: Coly Li colyli@suse.de Cc: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20210411134316.80274-8-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/debug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index b2eb59b9cd71..45e7d54a40ff 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -50,7 +50,7 @@ void bch_btree_verify(struct btree *b) v->keys.ops = b->keys.ops;
bio = bch_bbio_alloc(b->c); - bio_set_dev(bio, c->cache->bdev); + bio_set_dev(bio, b->c->cache->bdev); bio->bi_iter.bi_sector = PTR_OFFSET(&b->key, 0); bio->bi_iter.bi_size = KEY_SIZE(&v->key) << 9; bio->bi_opf = REQ_OP_READ | REQ_META;
From: YueHaibing yuehaibing@huawei.com
mainline inclusion from v5.13-rc5 commit 415f0c835ba799e47ce077b01876568431da1ff3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
Fix W=1 kernel build warning:
lib/crc64.c:40: warning: bad line: or the previous crc64 value if computing incrementally.
Link: https://lkml.kernel.org/r/20210601135851.15444-1-yuehaibing@huawei.com Signed-off-by: YueHaibing yuehaibing@huawei.com Reviewed-by: Coly Li colyli@suse.de Acked-by: Randy Dunlap rdunlap@infradead.org Tested-by: Randy Dunlap rdunlap@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org --- lib/crc64.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/crc64.c b/lib/crc64.c index 47cfa054827f..9f852a89ee2a 100644 --- a/lib/crc64.c +++ b/lib/crc64.c @@ -37,7 +37,7 @@ MODULE_LICENSE("GPL v2"); /** * crc64_be - Calculate bitwise big-endian ECMA-182 CRC64 * @crc: seed value for computation. 0 or (u64)~0 for a new CRC calculation, - or the previous crc64 value if computing incrementally. + * or the previous crc64 value if computing incrementally. * @p: pointer to buffer over which CRC64 is run * @len: length of buffer @p */
mainline inclusion from v5.13-rc6 commit 1616a4c2ab1a80893b6890ae93da40a2b1d0c691 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
For read cache missing, bcache defines a readahead size for the read I/O request to the backing device for the missing data. This readahead size is initialized to 0, and almost no one uses it to avoid unnecessary read amplifying onto backing device and write amplifying onto cache device. Considering upper layer file system code has readahead logic allready and works fine with readahead_cache_policy sysfile interface, we don't have to keep bcache self-defined readahead anymore.
This patch removes the bcache self-defined readahead for cache missing request for backing device, and the readahead sysfs file interfaces are removed as well.
This is the preparation for next patch to fix potential kernel panic due to oversized request in a simpler method.
Reported-by: Alexander Ullrich ealex1979@gmail.com Reported-by: Diego Ercolani diego.ercolani@gmail.com Reported-by: Jan Szubiak jan.szubiak@linuxpolska.pl Reported-by: Marco Rebhan me@dblsaiko.net Reported-by: Matthias Ferdinand bcache@mfedv.net Reported-by: Victor Westerhuis victor@westerhu.is Reported-by: Vojtech Pavlik vojtech@suse.cz Reported-and-tested-by: Rolf Fokkens rolf@rolffokkens.nl Reported-and-tested-by: Thorsten Knabe linux@thorsten-knabe.de Signed-off-by: Coly Li colyli@suse.de Reviewed-by: Christoph Hellwig hch@lst.de Cc: stable@vger.kernel.org Cc: Kent Overstreet kent.overstreet@gmail.com Cc: Nix nix@esperi.org.uk Cc: Takashi Iwai tiwai@suse.com Link: https://lore.kernel.org/r/20210607125052.21277-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/bcache.h | 1 - drivers/md/bcache/request.c | 12 +----------- drivers/md/bcache/stats.c | 14 -------------- drivers/md/bcache/stats.h | 1 - drivers/md/bcache/sysfs.c | 4 ---- 5 files changed, 1 insertion(+), 31 deletions(-)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 0a4551e165ab..5fc989a6d452 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -364,7 +364,6 @@ struct cached_dev {
/* The rest of this all shows up in sysfs */ unsigned int sequential_cutoff; - unsigned int readahead;
unsigned int io_disable:1; unsigned int verify:1; diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 214326383145..f66889d882f5 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -878,7 +878,6 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s, struct bio *bio, unsigned int sectors) { int ret = MAP_CONTINUE; - unsigned int reada = 0; struct cached_dev *dc = container_of(s->d, struct cached_dev, disk); struct bio *miss, *cache_bio;
@@ -890,13 +889,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s, goto out_submit; }
- if (!(bio->bi_opf & REQ_RAHEAD) && - !(bio->bi_opf & (REQ_META|REQ_PRIO)) && - s->iop.c->gc_stats.in_use < CUTOFF_CACHE_READA) - reada = min_t(sector_t, dc->readahead >> 9, - get_capacity(bio->bi_disk) - bio_end_sector(bio)); - - s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada); + s->insert_bio_sectors = min(sectors, bio_sectors(bio));
s->iop.replace_key = KEY(s->iop.inode, bio->bi_iter.bi_sector + s->insert_bio_sectors, @@ -930,9 +923,6 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s, if (bch_bio_alloc_pages(cache_bio, __GFP_NOWARN|GFP_NOIO)) goto out_put;
- if (reada) - bch_mark_cache_readahead(s->iop.c, s->d); - s->cache_miss = miss; s->iop.bio = cache_bio; bio_get(cache_bio); diff --git a/drivers/md/bcache/stats.c b/drivers/md/bcache/stats.c index 503aafe188dc..4c7ee5fedb9d 100644 --- a/drivers/md/bcache/stats.c +++ b/drivers/md/bcache/stats.c @@ -46,7 +46,6 @@ read_attribute(cache_misses); read_attribute(cache_bypass_hits); read_attribute(cache_bypass_misses); read_attribute(cache_hit_ratio); -read_attribute(cache_readaheads); read_attribute(cache_miss_collisions); read_attribute(bypassed);
@@ -64,7 +63,6 @@ SHOW(bch_stats) DIV_SAFE(var(cache_hits) * 100, var(cache_hits) + var(cache_misses)));
- var_print(cache_readaheads); var_print(cache_miss_collisions); sysfs_hprint(bypassed, var(sectors_bypassed) << 9); #undef var @@ -86,7 +84,6 @@ static struct attribute *bch_stats_files[] = { &sysfs_cache_bypass_hits, &sysfs_cache_bypass_misses, &sysfs_cache_hit_ratio, - &sysfs_cache_readaheads, &sysfs_cache_miss_collisions, &sysfs_bypassed, NULL @@ -113,7 +110,6 @@ void bch_cache_accounting_clear(struct cache_accounting *acc) acc->total.cache_misses = 0; acc->total.cache_bypass_hits = 0; acc->total.cache_bypass_misses = 0; - acc->total.cache_readaheads = 0; acc->total.cache_miss_collisions = 0; acc->total.sectors_bypassed = 0; } @@ -145,7 +141,6 @@ static void scale_stats(struct cache_stats *stats, unsigned long rescale_at) scale_stat(&stats->cache_misses); scale_stat(&stats->cache_bypass_hits); scale_stat(&stats->cache_bypass_misses); - scale_stat(&stats->cache_readaheads); scale_stat(&stats->cache_miss_collisions); scale_stat(&stats->sectors_bypassed); } @@ -168,7 +163,6 @@ static void scale_accounting(struct timer_list *t) move_stat(cache_misses); move_stat(cache_bypass_hits); move_stat(cache_bypass_misses); - move_stat(cache_readaheads); move_stat(cache_miss_collisions); move_stat(sectors_bypassed);
@@ -209,14 +203,6 @@ void bch_mark_cache_accounting(struct cache_set *c, struct bcache_device *d, mark_cache_stats(&c->accounting.collector, hit, bypass); }
-void bch_mark_cache_readahead(struct cache_set *c, struct bcache_device *d) -{ - struct cached_dev *dc = container_of(d, struct cached_dev, disk); - - atomic_inc(&dc->accounting.collector.cache_readaheads); - atomic_inc(&c->accounting.collector.cache_readaheads); -} - void bch_mark_cache_miss_collision(struct cache_set *c, struct bcache_device *d) { struct cached_dev *dc = container_of(d, struct cached_dev, disk); diff --git a/drivers/md/bcache/stats.h b/drivers/md/bcache/stats.h index abfaabf7e7fc..ca4f435f7216 100644 --- a/drivers/md/bcache/stats.h +++ b/drivers/md/bcache/stats.h @@ -7,7 +7,6 @@ struct cache_stat_collector { atomic_t cache_misses; atomic_t cache_bypass_hits; atomic_t cache_bypass_misses; - atomic_t cache_readaheads; atomic_t cache_miss_collisions; atomic_t sectors_bypassed; }; diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index cc89f3156d1a..05ac1d6fbbf3 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -137,7 +137,6 @@ rw_attribute(io_disable); rw_attribute(discard); rw_attribute(running); rw_attribute(label); -rw_attribute(readahead); rw_attribute(errors); rw_attribute(io_error_limit); rw_attribute(io_error_halflife); @@ -260,7 +259,6 @@ SHOW(__bch_cached_dev) var_printf(partial_stripes_expensive, "%u");
var_hprint(sequential_cutoff); - var_hprint(readahead);
sysfs_print(running, atomic_read(&dc->running)); sysfs_print(state, states[BDEV_STATE(&dc->sb)]); @@ -365,7 +363,6 @@ STORE(__cached_dev) sysfs_strtoul_clamp(sequential_cutoff, dc->sequential_cutoff, 0, UINT_MAX); - d_strtoi_h(readahead);
if (attr == &sysfs_clear_stats) bch_cache_accounting_clear(&dc->accounting); @@ -538,7 +535,6 @@ static struct attribute *bch_cached_dev_files[] = { &sysfs_running, &sysfs_state, &sysfs_label, - &sysfs_readahead, #ifdef CONFIG_BCACHE_DEBUG &sysfs_verify, &sysfs_bypass_torture_test,
mainline inclusion from v5.13-rc6 commit 41fe8d088e96472f63164e213de44ec77be69478 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
In the cache missing code path of cached device, if a proper location from the internal B+ tree is matched for a cache miss range, function cached_dev_cache_miss() will be called in cache_lookup_fn() in the following code block, [code block 1] 526 unsigned int sectors = KEY_INODE(k) == s->iop.inode 527 ? min_t(uint64_t, INT_MAX, 528 KEY_START(k) - bio->bi_iter.bi_sector) 529 : INT_MAX; 530 int ret = s->d->cache_miss(b, s, bio, sectors);
Here s->d->cache_miss() is the call backfunction pointer initialized as cached_dev_cache_miss(), the last parameter 'sectors' is an important hint to calculate the size of read request to backing device of the missing cache data.
Current calculation in above code block may generate oversized value of 'sectors', which consequently may trigger 2 different potential kernel panics by BUG() or BUG_ON() as listed below,
1) BUG_ON() inside bch_btree_insert_key(), [code block 2] 886 BUG_ON(b->ops->is_extents && !KEY_SIZE(k)); 2) BUG() inside biovec_slab(), [code block 3] 51 default: 52 BUG(); 53 return NULL;
All the above panics are original from cached_dev_cache_miss() by the oversized parameter 'sectors'.
Inside cached_dev_cache_miss(), parameter 'sectors' is used to calculate the size of data read from backing device for the cache missing. This size is stored in s->insert_bio_sectors by the following lines of code, [code block 4] 909 s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);
Then the actual key inserting to the internal B+ tree is generated and stored in s->iop.replace_key by the following lines of code, [code block 5] 911 s->iop.replace_key = KEY(s->iop.inode, 912 bio->bi_iter.bi_sector + s->insert_bio_sectors, 913 s->insert_bio_sectors); The oversized parameter 'sectors' may trigger panic 1) by BUG_ON() from the above code block.
And the bio sending to backing device for the missing data is allocated with hint from s->insert_bio_sectors by the following lines of code, [code block 6] 926 cache_bio = bio_alloc_bioset(GFP_NOWAIT, 927 DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS), 928 &dc->disk.bio_split); The oversized parameter 'sectors' may trigger panic 2) by BUG() from the agove code block.
Now let me explain how the panics happen with the oversized 'sectors'. In code block 5, replace_key is generated by macro KEY(). From the definition of macro KEY(), [code block 7] 71 #define KEY(inode, offset, size) \ 72 ((struct bkey) { \ 73 .high = (1ULL << 63) | ((__u64) (size) << 20) | (inode), \ 74 .low = (offset) \ 75 })
Here 'size' is 16bits width embedded in 64bits member 'high' of struct bkey. But in code block 1, if "KEY_START(k) - bio->bi_iter.bi_sector" is very probably to be larger than (1<<16) - 1, which makes the bkey size calculation in code block 5 is overflowed. In one bug report the value of parameter 'sectors' is 131072 (= 1 << 17), the overflowed 'sectors' results the overflowed s->insert_bio_sectors in code block 4, then makes size field of s->iop.replace_key to be 0 in code block 5. Then the 0- sized s->iop.replace_key is inserted into the internal B+ tree as cache missing check key (a special key to detect and avoid a racing between normal write request and cache missing read request) as, [code block 8] 915 ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key);
Then the 0-sized s->iop.replace_key as 3rd parameter triggers the bkey size check BUG_ON() in code block 2, and causes the kernel panic 1).
Another kernel panic is from code block 6, is by the bvecs number oversized value s->insert_bio_sectors from code block 4, min(sectors, bio_sectors(bio) + reada) There are two possibility for oversized reresult, - bio_sectors(bio) is valid, but bio_sectors(bio) + reada is oversized. - sectors < bio_sectors(bio) + reada, but sectors is oversized.
From a bug report the result of "DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS)" from code block 6 can be 344, 282, 946, 342 and many other values which larther than BIO_MAX_VECS (a.k.a 256). When calling bio_alloc_bioset() with such larger-than-256 value as the 2nd parameter, this value will eventually be sent to biovec_slab() as parameter 'nr_vecs' in following code path, bio_alloc_bioset() ==> bvec_alloc() ==> biovec_slab() Because parameter 'nr_vecs' is larger-than-256 value, the panic by BUG() in code block 3 is triggered inside biovec_slab().
From the above analysis, we know that the 4th parameter 'sector' sent into cached_dev_cache_miss() may cause overflow in code block 5 and 6, and finally cause kernel panic in code block 2 and 3. And if result of bio_sectors(bio) + reada exceeds valid bvecs number, it may also trigger kernel panic in code block 3 from code block 6.
Now the almost-useless readahead size for cache missing request back to backing device is removed, this patch can fix the oversized issue with more simpler method. - add a local variable size_limit, set it by the minimum value from the max bkey size and max bio bvecs number. - set s->insert_bio_sectors by the minimum value from size_limit, sectors, and the sectors size of bio. - replace sectors by s->insert_bio_sectors to do bio_next_split.
By the above method with size_limit, s->insert_bio_sectors will never result oversized replace_key size or bio bvecs number. And split bio 'miss' from bio_next_split() will always match the size of 'cache_bio', that is the current maximum bio size we can sent to backing device for fetching the cache missing data.
Current problmatic code can be partially found since Linux v3.13-rc1, therefore all maintained stable kernels should try to apply this fix.
(Coly Li: still use BIO_MAX_PAGES because BIO_MAX_VECS is not defined yet in 5.10 based kernel.)
Reported-by: Alexander Ullrich ealex1979@gmail.com Reported-by: Diego Ercolani diego.ercolani@gmail.com Reported-by: Jan Szubiak jan.szubiak@linuxpolska.pl Reported-by: Marco Rebhan me@dblsaiko.net Reported-by: Matthias Ferdinand bcache@mfedv.net Reported-by: Victor Westerhuis victor@westerhu.is Reported-by: Vojtech Pavlik vojtech@suse.cz Reported-and-tested-by: Rolf Fokkens rolf@rolffokkens.nl Reported-and-tested-by: Thorsten Knabe linux@thorsten-knabe.de Signed-off-by: Coly Li colyli@suse.de Cc: stable@vger.kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Kent Overstreet kent.overstreet@gmail.com Cc: Nix nix@esperi.org.uk Cc: Takashi Iwai tiwai@suse.com Link: https://lore.kernel.org/r/20210607125052.21277-3-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/request.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index f66889d882f5..19014a87a7db 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -880,6 +880,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s, int ret = MAP_CONTINUE; struct cached_dev *dc = container_of(s->d, struct cached_dev, disk); struct bio *miss, *cache_bio; + unsigned int size_limit;
s->cache_missed = 1;
@@ -889,7 +890,10 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s, goto out_submit; }
- s->insert_bio_sectors = min(sectors, bio_sectors(bio)); + /* Limitation for valid replace key size and cache_bio bvecs number */ + size_limit = min_t(unsigned int, BIO_MAX_PAGES * PAGE_SECTORS, + (1 << KEY_SIZE_BITS) - 1); + s->insert_bio_sectors = min3(size_limit, sectors, bio_sectors(bio));
s->iop.replace_key = KEY(s->iop.inode, bio->bi_iter.bi_sector + s->insert_bio_sectors, @@ -901,7 +905,8 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
s->iop.replace = true;
- miss = bio_next_split(bio, sectors, GFP_NOIO, &s->d->bio_split); + miss = bio_next_split(bio, s->insert_bio_sectors, GFP_NOIO, + &s->d->bio_split);
/* btree_search_recurse()'s btree iterator is no good anymore */ ret = miss == bio ? MAP_DONE : -EINTR;
From: Ding Senjie dingsenjie@yulong.com
mainline inclusion from v5.16-rc1 commit a307e2abfc22880a3026bc2f2a997402b7c2d833 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
acqurie -> acquire
Signed-off-by: Ding Senjie dingsenjie@yulong.com Reviewed-by: Hannes Reinecke hare@suse.de Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20211020143812.6403-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index a52f491e5209..ecdd613833f4 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2759,7 +2759,7 @@ static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x) * The reason bch_register_lock is not held to call * bch_cache_set_stop() and bcache_device_stop() is to * avoid potential deadlock during reboot, because cache - * set or bcache device stopping process will acqurie + * set or bcache device stopping process will acquire * bch_register_lock too. * * We are safe here because bcache_is_reboot sets to
From: Chao Yu yuchao0@huawei.com
mainline inclusion from v5.16-rc1 commit d55f7cb2e5c053010d2b527494da9bbb722a78ba category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
In register_bcache(), there are several cases we didn't set correct error info (return value and/or error message): - if kzalloc() fails, it needs to return ENOMEM and print "cannot allocate memory"; - if register_cache() fails, it's better to propagate its return value rather than using default EINVAL.
Signed-off-by: Chao Yu yuchao0@huawei.com Reviewed-by: Hannes Reinecke hare@suse.de Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20211020143812.6403-4-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/super.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index ecdd613833f4..06f69970f836 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2626,8 +2626,11 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, if (SB_IS_BDEV(sb)) { struct cached_dev *dc = kzalloc(sizeof(*dc), GFP_KERNEL);
- if (!dc) + if (!dc) { + ret = -ENOMEM; + err = "cannot allocate memory"; goto out_put_sb_page; + }
mutex_lock(&bch_register_lock); ret = register_bdev(sb, sb_disk, bdev, dc); @@ -2638,11 +2641,15 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, } else { struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
- if (!ca) + if (!ca) { + ret = -ENOMEM; + err = "cannot allocate memory"; goto out_put_sb_page; + }
/* blkdev_put() will be called in bch_cache_release() */ - if (register_cache(sb, sb_disk, bdev, ca) != 0) + ret = register_cache(sb, sb_disk, bdev, ca); + if (ret) goto out_free_sb; }
From: Lin Feng linf@wangsu.com
mainline inclusion from v5.16-rc1 commit 0259d4498ba48454749ecfb9c81e892cdb8d1a32 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
Calculation of cache_set's cached sectors is done by travelling cached_devs list as shown below:
static void calc_cached_dev_sectors(struct cache_set *c) { ... list_for_each_entry(dc, &c->cached_devs, list) sectors += bdev_sectors(dc->bdev);
c->cached_dev_sectors = sectors; }
But cached_dev won't be unlinked from c->cached_devs list until we call following list_move(&dc->list, &uncached_devices), so previous fix in 'commit 46010141da6677b81cc77f9b47f8ac62bd1cbfd3 ("bcache: recal cached_dev_sectors on detach")' is wrong, now we move it to its right place.
Signed-off-by: Lin Feng linf@wangsu.com Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20211020143812.6403-5-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 06f69970f836..203f85599018 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1165,9 +1165,9 @@ static void cached_dev_detach_finish(struct work_struct *w)
mutex_lock(&bch_register_lock);
- calc_cached_dev_sectors(dc->disk.c); bcache_device_detach(&dc->disk); list_move(&dc->list, &uncached_devices); + calc_cached_dev_sectors(dc->disk.c);
clear_bit(BCACHE_DEV_DETACHING, &dc->disk.flags); clear_bit(BCACHE_DEV_UNLINK_DONE, &dc->disk.flags);
mainline inclusion from v5.16-rc1 commit cf2197ca4b8c199d188593ca6800ea1827c42171 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
The header file include/uapi/linux/bcache.h is not really a user space API heaer. This file defines the ondisk format of bcache internal meta data but no one includes it from user space, bcache-tools has its own copy of this header with minor modification.
Therefore, this patch moves include/uapi/linux/bcache.h to bcache code directory as drivers/md/bcache/bcache_ondisk.h.
Suggested-by: Arnd Bergmann arnd@kernel.org Suggested-by: Christoph Hellwig hch@lst.de Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20211029060930.119923-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/bcache.h | 2 +- .../uapi/linux/bcache.h => drivers/md/bcache/bcache_ondisk.h | 0 drivers/md/bcache/bset.h | 2 +- drivers/md/bcache/features.c | 2 +- drivers/md/bcache/features.h | 3 ++- 5 files changed, 5 insertions(+), 4 deletions(-) rename include/uapi/linux/bcache.h => drivers/md/bcache/bcache_ondisk.h (100%)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 5fc989a6d452..2a011469af02 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -178,7 +178,6 @@
#define pr_fmt(fmt) "bcache: %s() " fmt, __func__
-#include <linux/bcache.h> #include <linux/bio.h> #include <linux/kobject.h> #include <linux/list.h> @@ -190,6 +189,7 @@ #include <linux/workqueue.h> #include <linux/kthread.h>
+#include "bcache_ondisk.h" #include "bset.h" #include "util.h" #include "closure.h" diff --git a/include/uapi/linux/bcache.h b/drivers/md/bcache/bcache_ondisk.h similarity index 100% rename from include/uapi/linux/bcache.h rename to drivers/md/bcache/bcache_ondisk.h diff --git a/drivers/md/bcache/bset.h b/drivers/md/bcache/bset.h index a50dcfda656f..d795c84246b0 100644 --- a/drivers/md/bcache/bset.h +++ b/drivers/md/bcache/bset.h @@ -2,10 +2,10 @@ #ifndef _BCACHE_BSET_H #define _BCACHE_BSET_H
-#include <linux/bcache.h> #include <linux/kernel.h> #include <linux/types.h>
+#include "bcache_ondisk.h" #include "util.h" /* for time_stats */
/* diff --git a/drivers/md/bcache/features.c b/drivers/md/bcache/features.c index 6d2b7b84a7b7..634922c5601d 100644 --- a/drivers/md/bcache/features.c +++ b/drivers/md/bcache/features.c @@ -6,7 +6,7 @@ * Copyright 2020 Coly Li colyli@suse.de * */ -#include <linux/bcache.h> +#include "bcache_ondisk.h" #include "bcache.h" #include "features.h"
diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h index d1c8fd3977fc..09161b89c63e 100644 --- a/drivers/md/bcache/features.h +++ b/drivers/md/bcache/features.h @@ -2,10 +2,11 @@ #ifndef _BCACHE_FEATURES_H #define _BCACHE_FEATURES_H
-#include <linux/bcache.h> #include <linux/kernel.h> #include <linux/types.h>
+#include "bcache_ondisk.h" + #define BCH_FEATURE_COMPAT 0 #define BCH_FEATURE_RO_COMPAT 1 #define BCH_FEATURE_INCOMPAT 2
From: Qing Wang wangqing@vivo.com
mainline inclusion from v5.16-rc1 commit 1b86db5f4e025840e0bf7cef2b10e84531954386 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
coccicheck complains about the use of snprintf() in sysfs show functions.
Fix the following coccicheck warning: drivers/md/bcache/sysfs.h:54:12-20: WARNING: use scnprintf or sprintf.
Implement sysfs_print() by sysfs_emit() and remove snprint() since no one uses it any more.
Suggested-by: Coly Li colyli@suse.de Signed-off-by: Qing Wang wangqing@vivo.com Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20211029060930.119923-3-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/sysfs.h | 18 ++++++++++++++++-- drivers/md/bcache/util.h | 17 ----------------- 2 files changed, 16 insertions(+), 19 deletions(-)
diff --git a/drivers/md/bcache/sysfs.h b/drivers/md/bcache/sysfs.h index 215df32f567b..c1752ba2e05b 100644 --- a/drivers/md/bcache/sysfs.h +++ b/drivers/md/bcache/sysfs.h @@ -51,13 +51,27 @@ STORE(fn) \ #define sysfs_printf(file, fmt, ...) \ do { \ if (attr == &sysfs_ ## file) \ - return snprintf(buf, PAGE_SIZE, fmt "\n", __VA_ARGS__); \ + return sysfs_emit(buf, fmt "\n", __VA_ARGS__); \ } while (0)
#define sysfs_print(file, var) \ do { \ if (attr == &sysfs_ ## file) \ - return snprint(buf, PAGE_SIZE, var); \ + return sysfs_emit(buf, \ + __builtin_types_compatible_p(typeof(var), int) \ + ? "%i\n" : \ + __builtin_types_compatible_p(typeof(var), unsigned int) \ + ? "%u\n" : \ + __builtin_types_compatible_p(typeof(var), long) \ + ? "%li\n" : \ + __builtin_types_compatible_p(typeof(var), unsigned long)\ + ? "%lu\n" : \ + __builtin_types_compatible_p(typeof(var), int64_t) \ + ? "%lli\n" : \ + __builtin_types_compatible_p(typeof(var), uint64_t) \ + ? "%llu\n" : \ + __builtin_types_compatible_p(typeof(var), const char *) \ + ? "%s\n" : "%i\n", var); \ } while (0)
#define sysfs_hprint(file, val) \ diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h index bca4a7c97da7..97c524679c8a 100644 --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -342,23 +342,6 @@ static inline int bch_strtoul_h(const char *cp, long *res) _r; \ })
-#define snprint(buf, size, var) \ - snprintf(buf, size, \ - __builtin_types_compatible_p(typeof(var), int) \ - ? "%i\n" : \ - __builtin_types_compatible_p(typeof(var), unsigned int) \ - ? "%u\n" : \ - __builtin_types_compatible_p(typeof(var), long) \ - ? "%li\n" : \ - __builtin_types_compatible_p(typeof(var), unsigned long)\ - ? "%lu\n" : \ - __builtin_types_compatible_p(typeof(var), int64_t) \ - ? "%lli\n" : \ - __builtin_types_compatible_p(typeof(var), uint64_t) \ - ? "%llu\n" : \ - __builtin_types_compatible_p(typeof(var), const char *) \ - ? "%s\n" : "%i\n", var) - ssize_t bch_hprint(char *buf, int64_t v);
bool bch_is_zero(const char *p, size_t n);
From: Lin Feng linf@wangsu.com
mainline inclusion from v5.16-rc6 commit aa97f6cdb7e92909e17c8ca63e622fcb81d57a57 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
Commit 0259d4498ba4 ("bcache: move calc_cached_dev_sectors to proper place on backing device detach") tries to fix calc_cached_dev_sectors when bcache device detaches, but now we have:
cached_dev_detach_finish ... bcache_device_detach(&dc->disk); ... closure_put(&d->c->caching); d->c = NULL; [*explicitly set dc->disk.c to NULL*] list_move(&dc->list, &uncached_devices); calc_cached_dev_sectors(dc->disk.c); [*passing a NULL pointer*] ...
Upper codeflows shows how bug happens, this patch fix the problem by caching dc->disk.c beforehand, and cache_set won't be freed under us because c->caching closure at least holds a reference count and closure callback __cache_set_unregister only being called by bch_cache_set_stop which using closure_queue(&c->caching), that means c->caching closure callback for destroying cache_set won't be trigger by previous closure_put(&d->c->caching). So at this stage(while cached_dev_detach_finish is calling) it's safe to access cache_set dc->disk.c.
Fixes: 0259d4498ba4 ("bcache: move calc_cached_dev_sectors to proper place on backing device detach") Signed-off-by: Lin Feng linf@wangsu.com Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20211112053629.3437-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/super.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 203f85599018..a98b46d3cd0b 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1150,6 +1150,7 @@ static void cancel_writeback_rate_update_dwork(struct cached_dev *dc) static void cached_dev_detach_finish(struct work_struct *w) { struct cached_dev *dc = container_of(w, struct cached_dev, detach); + struct cache_set *c = dc->disk.c;
BUG_ON(!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags)); BUG_ON(refcount_read(&dc->count)); @@ -1167,7 +1168,7 @@ static void cached_dev_detach_finish(struct work_struct *w)
bcache_device_detach(&dc->disk); list_move(&dc->list, &uncached_devices); - calc_cached_dev_sectors(dc->disk.c); + calc_cached_dev_sectors(c);
clear_bit(BCACHE_DEV_DETACHING, &dc->disk.flags); clear_bit(BCACHE_DEV_UNLINK_DONE, &dc->disk.flags);
From: Greg Kroah-Hartman gregkh@linuxfoundation.org
mainline inclusion from v5.18-rc1 commit fa97cb843cfb874c50cd1dcc46a2f28187e184e9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the bcache sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field.
Cc: Kent Overstreet kent.overstreet@gmail.com Cc: linux-bcache@vger.kernel.org Acked-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20220106100004.3277439-1-gregkh@linuxfoundation.or... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/md/bcache/stats.c | 3 ++- drivers/md/bcache/sysfs.c | 15 ++++++++++----- drivers/md/bcache/sysfs.h | 2 +- 3 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/drivers/md/bcache/stats.c b/drivers/md/bcache/stats.c index 4c7ee5fedb9d..68b02216033d 100644 --- a/drivers/md/bcache/stats.c +++ b/drivers/md/bcache/stats.c @@ -78,7 +78,7 @@ static void bch_stats_release(struct kobject *k) { }
-static struct attribute *bch_stats_files[] = { +static struct attribute *bch_stats_attrs[] = { &sysfs_cache_hits, &sysfs_cache_misses, &sysfs_cache_bypass_hits, @@ -88,6 +88,7 @@ static struct attribute *bch_stats_files[] = { &sysfs_bypassed, NULL }; +ATTRIBUTE_GROUPS(bch_stats); static KTYPE(bch_stats);
int bch_cache_accounting_add_kobjs(struct cache_accounting *acc, diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 05ac1d6fbbf3..8467e37411a7 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -500,7 +500,7 @@ STORE(bch_cached_dev) return size; }
-static struct attribute *bch_cached_dev_files[] = { +static struct attribute *bch_cached_dev_attrs[] = { &sysfs_attach, &sysfs_detach, &sysfs_stop, @@ -543,6 +543,7 @@ static struct attribute *bch_cached_dev_files[] = { &sysfs_backing_dev_uuid, NULL }; +ATTRIBUTE_GROUPS(bch_cached_dev); KTYPE(bch_cached_dev);
SHOW(bch_flash_dev) @@ -600,7 +601,7 @@ STORE(__bch_flash_dev) } STORE_LOCKED(bch_flash_dev)
-static struct attribute *bch_flash_dev_files[] = { +static struct attribute *bch_flash_dev_attrs[] = { &sysfs_unregister, #if 0 &sysfs_data_csum, @@ -609,6 +610,7 @@ static struct attribute *bch_flash_dev_files[] = { &sysfs_size, NULL }; +ATTRIBUTE_GROUPS(bch_flash_dev); KTYPE(bch_flash_dev);
struct bset_stats_op { @@ -955,7 +957,7 @@ static void bch_cache_set_internal_release(struct kobject *k) { }
-static struct attribute *bch_cache_set_files[] = { +static struct attribute *bch_cache_set_attrs[] = { &sysfs_unregister, &sysfs_stop, &sysfs_synchronous, @@ -980,9 +982,10 @@ static struct attribute *bch_cache_set_files[] = { &sysfs_clear_stats, NULL }; +ATTRIBUTE_GROUPS(bch_cache_set); KTYPE(bch_cache_set);
-static struct attribute *bch_cache_set_internal_files[] = { +static struct attribute *bch_cache_set_internal_attrs[] = { &sysfs_active_journal_entries,
sysfs_time_stats_attribute_list(btree_gc, sec, ms) @@ -1022,6 +1025,7 @@ static struct attribute *bch_cache_set_internal_files[] = { &sysfs_feature_incompat, NULL }; +ATTRIBUTE_GROUPS(bch_cache_set_internal); KTYPE(bch_cache_set_internal);
static int __bch_cache_cmp(const void *l, const void *r) @@ -1182,7 +1186,7 @@ STORE(__bch_cache) } STORE_LOCKED(bch_cache)
-static struct attribute *bch_cache_files[] = { +static struct attribute *bch_cache_attrs[] = { &sysfs_bucket_size, &sysfs_block_size, &sysfs_nbuckets, @@ -1196,4 +1200,5 @@ static struct attribute *bch_cache_files[] = { &sysfs_cache_replacement_policy, NULL }; +ATTRIBUTE_GROUPS(bch_cache); KTYPE(bch_cache); diff --git a/drivers/md/bcache/sysfs.h b/drivers/md/bcache/sysfs.h index c1752ba2e05b..a2ff6447b699 100644 --- a/drivers/md/bcache/sysfs.h +++ b/drivers/md/bcache/sysfs.h @@ -9,7 +9,7 @@ struct kobj_type type ## _ktype = { \ .show = type ## _show, \ .store = type ## _store \ }), \ - .default_attrs = type ## _files \ + .default_groups = type ## _groups \ }
#define SHOW(fn) \
From: Mingzhe Zou mingzhe.zou@easystack.cn
mainline inclusion from v5.18-rc1 commit 7b1002f7cfe581930f63787a0b3de0144e61ed55 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
When attaching a cached device (a.k.a backing device) to a cache device, bch_sectors_dirty_init() is called to count dirty sectors and stripes (see what bcache_dev_sectors_dirty_add() does) on the cache device.
When bcache_dev_sectors_dirty_add() is called, set_bit(stripe, d->full_dirty_stripes) or clear_bit(stripe, d->full_dirty_stripes) operation will always be performed. In full_dirty_stripes, each 1bit represents stripe_size (8192) sectors (512B), so 1bit=4MB (8192*512), and each CPU cache line=64B=512bit=2048MB. When 20 threads process a cached disk with 100G dirty data, a single thread processes about 23M at a time, and 20 threads total 460M. These full_dirty_stripes bits corresponding to the 460M data is likely to fall in the same CPU cache line. When one of these threads performs a set_bit or clear_bit operation, the same CPU cache line of other threads will become invalid and must read the full_dirty_stripes from the main memory again. Compared with single thread, the time of a bcache_dev_sectors_dirty_add() call is increased by about 50 times in our test (100G dirty data, 20 threads, bcache_dev_sectors_dirty_add() is called more than 20 million times).
This patch tries to test_bit before set_bit or clear_bit operation. Therefore, a lot of force set and clear operations will be avoided, and most of bcache_dev_sectors_dirty_add() calls will only read CPU cache line.
Signed-off-by: Mingzhe Zou mingzhe.zou@easystack.cn Signed-off-by: Coly Li colyli@suse.de --- drivers/md/bcache/writeback.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 8120da278161..e93a09ccaddd 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -585,10 +585,13 @@ void bcache_dev_sectors_dirty_add(struct cache_set *c, unsigned int inode,
sectors_dirty = atomic_add_return(s, d->stripe_sectors_dirty + stripe); - if (sectors_dirty == d->stripe_size) - set_bit(stripe, d->full_dirty_stripes); - else - clear_bit(stripe, d->full_dirty_stripes); + if (sectors_dirty == d->stripe_size) { + if (!test_bit(stripe, d->full_dirty_stripes)) + set_bit(stripe, d->full_dirty_stripes); + } else { + if (test_bit(stripe, d->full_dirty_stripes)) + clear_bit(stripe, d->full_dirty_stripes); + }
nr_sectors -= s; stripe_offset = 0;
From: Mingzhe Zou mingzhe.zou@easystack.cn
mainline inclusion from v5.18-rc1 commit 887554ab96588de2917b6c8c73e552da082e5368 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
When multiple threads to check btree nodes in parallel, the main thread wait for all threads to stop or CACHE_SET_IO_DISABLE flag:
wait_event_interruptible(check_state->wait, atomic_read(&check_state->started) == 0 || test_bit(CACHE_SET_IO_DISABLE, &c->flags));
However, the bch_btree_node_read and bch_btree_node_read_done maybe call bch_cache_set_error, then the CACHE_SET_IO_DISABLE will be set. If the flag already set, the main thread return error. At the same time, maybe some threads still running and read NULL pointer, the kernel will crash.
This patch change the event wait condition, the main thread must wait for all threads to stop.
Fixes: 8e7102273f597 ("bcache: make bch_btree_check() to be multithreaded") Signed-off-by: Mingzhe Zou mingzhe.zou@easystack.cn Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Coly Li colyli@suse.de --- drivers/md/bcache/btree.c | 6 ++++-- drivers/md/bcache/writeback.c | 6 ++++-- 2 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 183a58c89377..8eecc9df319b 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -2060,9 +2060,11 @@ int bch_btree_check(struct cache_set *c) } }
+ /* + * Must wait for all threads to stop. + */ wait_event_interruptible(check_state->wait, - atomic_read(&check_state->started) == 0 || - test_bit(CACHE_SET_IO_DISABLE, &c->flags)); + atomic_read(&check_state->started) == 0);
for (i = 0; i < check_state->total_threads; i++) { if (check_state->infos[i].result) { diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index e93a09ccaddd..8bd9098185b6 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -1001,9 +1001,11 @@ void bch_sectors_dirty_init(struct bcache_device *d) } }
+ /* + * Must wait for all threads to stop. + */ wait_event_interruptible(state->wait, - atomic_read(&state->started) == 0 || - test_bit(CACHE_SET_IO_DISABLE, &c->flags)); + atomic_read(&state->started) == 0);
out: kfree(state);
mainline inclusion from v5.19-rc1 commit 622536443b6731ec82c563aae7807165adbe9178 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
Commit 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded") makes bch_btree_check() to be much faster when checking all btree nodes during cache device registration. But it isn't in ideal shap yet, still can be improved.
This patch does the following thing to improve current parallel btree nodes check by multiple threads in bch_btree_check(), - Add read lock to root node while checking all the btree nodes with multiple threads. Although currently it is not mandatory but it is good to have a read lock in code logic. - Remove local variable 'char name[32]', and generate kernel thread name string directly when calling kthread_run(). - Allocate local variable "struct btree_check_state check_state" on the stack and avoid unnecessary dynamic memory allocation for it. - Reduce BCH_BTR_CHKTHREAD_MAX from 64 to 12 which is enough indeed. - Increase check_state->started to count created kernel thread after it succeeds to create. - When wait for all checking kernel threads to finish, use wait_event() to replace wait_event_interruptible().
With this change, the code is more clear, and some potential error conditions are avoided.
Fixes: 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded") Signed-off-by: Coly Li colyli@suse.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/btree.c | 58 ++++++++++++++++++--------------------- drivers/md/bcache/btree.h | 2 +- 2 files changed, 27 insertions(+), 33 deletions(-)
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 8eecc9df319b..7b6f8bfef927 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -2006,8 +2006,7 @@ int bch_btree_check(struct cache_set *c) int i; struct bkey *k = NULL; struct btree_iter iter; - struct btree_check_state *check_state; - char name[32]; + struct btree_check_state check_state;
/* check and mark root node keys */ for_each_key_filter(&c->root->keys, k, &iter, bch_ptr_invalid) @@ -2018,63 +2017,58 @@ int bch_btree_check(struct cache_set *c) if (c->root->level == 0) return 0;
- check_state = kzalloc(sizeof(struct btree_check_state), GFP_KERNEL); - if (!check_state) - return -ENOMEM; - - check_state->c = c; - check_state->total_threads = bch_btree_chkthread_nr(); - check_state->key_idx = 0; - spin_lock_init(&check_state->idx_lock); - atomic_set(&check_state->started, 0); - atomic_set(&check_state->enough, 0); - init_waitqueue_head(&check_state->wait); + check_state.c = c; + check_state.total_threads = bch_btree_chkthread_nr(); + check_state.key_idx = 0; + spin_lock_init(&check_state.idx_lock); + atomic_set(&check_state.started, 0); + atomic_set(&check_state.enough, 0); + init_waitqueue_head(&check_state.wait);
+ rw_lock(0, c->root, c->root->level); /* * Run multiple threads to check btree nodes in parallel, - * if check_state->enough is non-zero, it means current + * if check_state.enough is non-zero, it means current * running check threads are enough, unncessary to create * more. */ - for (i = 0; i < check_state->total_threads; i++) { - /* fetch latest check_state->enough earlier */ + for (i = 0; i < check_state.total_threads; i++) { + /* fetch latest check_state.enough earlier */ smp_mb__before_atomic(); - if (atomic_read(&check_state->enough)) + if (atomic_read(&check_state.enough)) break;
- check_state->infos[i].result = 0; - check_state->infos[i].state = check_state; - snprintf(name, sizeof(name), "bch_btrchk[%u]", i); - atomic_inc(&check_state->started); + check_state.infos[i].result = 0; + check_state.infos[i].state = &check_state;
- check_state->infos[i].thread = + check_state.infos[i].thread = kthread_run(bch_btree_check_thread, - &check_state->infos[i], - name); - if (IS_ERR(check_state->infos[i].thread)) { + &check_state.infos[i], + "bch_btrchk[%d]", i); + if (IS_ERR(check_state.infos[i].thread)) { pr_err("fails to run thread bch_btrchk[%d]\n", i); for (--i; i >= 0; i--) - kthread_stop(check_state->infos[i].thread); + kthread_stop(check_state.infos[i].thread); ret = -ENOMEM; goto out; } + atomic_inc(&check_state.started); }
/* * Must wait for all threads to stop. */ - wait_event_interruptible(check_state->wait, - atomic_read(&check_state->started) == 0); + wait_event(check_state.wait, atomic_read(&check_state.started) == 0);
- for (i = 0; i < check_state->total_threads; i++) { - if (check_state->infos[i].result) { - ret = check_state->infos[i].result; + for (i = 0; i < check_state.total_threads; i++) { + if (check_state.infos[i].result) { + ret = check_state.infos[i].result; goto out; } }
out: - kfree(check_state); + rw_unlock(0, c->root); return ret; }
diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h index 50482107134f..1b5fdbc0d83e 100644 --- a/drivers/md/bcache/btree.h +++ b/drivers/md/bcache/btree.h @@ -226,7 +226,7 @@ struct btree_check_info { int result; };
-#define BCH_BTR_CHKTHREAD_MAX 64 +#define BCH_BTR_CHKTHREAD_MAX 12 struct btree_check_state { struct cache_set *c; int total_threads;
mainline inclusion from v5.19-rc1 commit 4dc34ae1b45fe26e772a44379f936c72623dd407 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
Commit b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") makes bch_sectors_dirty_init() to be much faster when counting dirty sectors by iterating all dirty keys in the btree. But it isn't in ideal shape yet, still can be improved.
This patch does the following changes to improve current parallel dirty keys iteration on the btree, - Add read lock to root node when multiple threads iterating the btree, to prevent the root node gets split by I/Os from other registered bcache devices. - Remove local variable "char name[32]" and generate kernel thread name string directly when calling kthread_run(). - Allocate "struct bch_dirty_init_state state" directly on stack and avoid the unnecessary dynamic memory allocation for it. - Decrease BCH_DIRTY_INIT_THRD_MAX from 64 to 12 which is enough indeed. - Increase &state->started to count created kernel thread after it succeeds to create. - When wait for all dirty key counting threads to finish, use wait_event() to replace wait_event_interruptible().
With the above changes, the code is more clear, and some potential error conditions are avoided.
Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Coly Li colyli@suse.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-3-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/writeback.c | 62 ++++++++++++++--------------------- drivers/md/bcache/writeback.h | 2 +- 2 files changed, 26 insertions(+), 38 deletions(-)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 8bd9098185b6..7ca91d1eb83b 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -948,10 +948,10 @@ void bch_sectors_dirty_init(struct bcache_device *d) struct btree_iter iter; struct sectors_dirty_init op; struct cache_set *c = d->c; - struct bch_dirty_init_state *state; - char name[32]; + struct bch_dirty_init_state state;
/* Just count root keys if no leaf node */ + rw_lock(0, c->root, c->root->level); if (c->root->level == 0) { bch_btree_op_init(&op.op, -1); op.inode = d->id; @@ -961,54 +961,42 @@ void bch_sectors_dirty_init(struct bcache_device *d) for_each_key_filter(&c->root->keys, k, &iter, bch_ptr_invalid) sectors_dirty_init_fn(&op.op, c->root, k); + rw_unlock(0, c->root); return; }
- state = kzalloc(sizeof(struct bch_dirty_init_state), GFP_KERNEL); - if (!state) { - pr_warn("sectors dirty init failed: cannot allocate memory\n"); - return; - } - - state->c = c; - state->d = d; - state->total_threads = bch_btre_dirty_init_thread_nr(); - state->key_idx = 0; - spin_lock_init(&state->idx_lock); - atomic_set(&state->started, 0); - atomic_set(&state->enough, 0); - init_waitqueue_head(&state->wait); - - for (i = 0; i < state->total_threads; i++) { - /* Fetch latest state->enough earlier */ + state.c = c; + state.d = d; + state.total_threads = bch_btre_dirty_init_thread_nr(); + state.key_idx = 0; + spin_lock_init(&state.idx_lock); + atomic_set(&state.started, 0); + atomic_set(&state.enough, 0); + init_waitqueue_head(&state.wait); + + for (i = 0; i < state.total_threads; i++) { + /* Fetch latest state.enough earlier */ smp_mb__before_atomic(); - if (atomic_read(&state->enough)) + if (atomic_read(&state.enough)) break;
- state->infos[i].state = state; - atomic_inc(&state->started); - snprintf(name, sizeof(name), "bch_dirty_init[%d]", i); - - state->infos[i].thread = - kthread_run(bch_dirty_init_thread, - &state->infos[i], - name); - if (IS_ERR(state->infos[i].thread)) { + state.infos[i].state = &state; + state.infos[i].thread = + kthread_run(bch_dirty_init_thread, &state.infos[i], + "bch_dirtcnt[%d]", i); + if (IS_ERR(state.infos[i].thread)) { pr_err("fails to run thread bch_dirty_init[%d]\n", i); for (--i; i >= 0; i--) - kthread_stop(state->infos[i].thread); + kthread_stop(state.infos[i].thread); goto out; } + atomic_inc(&state.started); }
- /* - * Must wait for all threads to stop. - */ - wait_event_interruptible(state->wait, - atomic_read(&state->started) == 0); - out: - kfree(state); + /* Must wait for all threads to stop. */ + wait_event(state.wait, atomic_read(&state.started) == 0); + rw_unlock(0, c->root); }
void bch_cached_dev_writeback_init(struct cached_dev *dc) diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index 02b2f9df73f6..31df716951f6 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -20,7 +20,7 @@ #define BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID 57 #define BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH 64
-#define BCH_DIRTY_INIT_THRD_MAX 64 +#define BCH_DIRTY_INIT_THRD_MAX 12 /* * 14 (16384ths) is chosen here as something that each backing device * should be a reasonable fraction of the share, and not to blow up
mainline inclusion from v5.19-rc1 commit 80db4e4707e78cb22287da7d058d7274bd4cb370 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
After making bch_sectors_dirty_init() being multithreaded, the existing incremental dirty sector counting in bch_root_node_dirty_init() doesn't release btree occupation after iterating 500000 (INIT_KEYS_EACH_TIME) bkeys. Because a read lock is added on btree root node to prevent the btree to be split during the dirty sectors counting, other I/O requester has no chance to gain the write lock even restart bcache_btree().
That is to say, the incremental dirty sectors counting is incompatible to the multhreaded bch_sectors_dirty_init(). We have to choose one and drop another one.
In my testing, with 512 bytes random writes, I generate 1.2T dirty data and a btree with 400K nodes. With single thread and incremental dirty sectors counting, it takes 30+ minites to register the backing device. And with multithreaded dirty sectors counting, the backing device registration can be accomplished within 2 minutes.
The 30+ minutes V.S. 2- minutes difference makes me decide to keep multithreaded bch_sectors_dirty_init() and drop the incremental dirty sectors counting. This is what this patch does.
But INIT_KEYS_EACH_TIME is kept, in sectors_dirty_init_fn() the CPU will be released by cond_resched() after every INIT_KEYS_EACH_TIME keys iterated. This is to avoid the watchdog reports a bogus soft lockup warning.
Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Coly Li colyli@suse.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-4-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/writeback.c | 41 +++++++++++------------------------ 1 file changed, 13 insertions(+), 28 deletions(-)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 7ca91d1eb83b..2b6cb308f73a 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -805,13 +805,11 @@ static int bch_writeback_thread(void *arg)
/* Init */ #define INIT_KEYS_EACH_TIME 500000 -#define INIT_KEYS_SLEEP_MS 100
struct sectors_dirty_init { struct btree_op op; unsigned int inode; size_t count; - struct bkey start; };
static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b, @@ -827,11 +825,8 @@ static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b, KEY_START(k), KEY_SIZE(k));
op->count++; - if (atomic_read(&b->c->search_inflight) && - !(op->count % INIT_KEYS_EACH_TIME)) { - bkey_copy_key(&op->start, k); - return -EAGAIN; - } + if (!(op->count % INIT_KEYS_EACH_TIME)) + cond_resched();
return MAP_CONTINUE; } @@ -846,24 +841,16 @@ static int bch_root_node_dirty_init(struct cache_set *c, bch_btree_op_init(&op.op, -1); op.inode = d->id; op.count = 0; - op.start = KEY(op.inode, 0, 0); - - do { - ret = bcache_btree(map_keys_recurse, - k, - c->root, - &op.op, - &op.start, - sectors_dirty_init_fn, - 0); - if (ret == -EAGAIN) - schedule_timeout_interruptible( - msecs_to_jiffies(INIT_KEYS_SLEEP_MS)); - else if (ret < 0) { - pr_warn("sectors dirty init failed, ret=%d!\n", ret); - break; - } - } while (ret == -EAGAIN); + + ret = bcache_btree(map_keys_recurse, + k, + c->root, + &op.op, + &KEY(op.inode, 0, 0), + sectors_dirty_init_fn, + 0); + if (ret < 0) + pr_warn("sectors dirty init failed, ret=%d!\n", ret);
return ret; } @@ -907,7 +894,6 @@ static int bch_dirty_init_thread(void *arg) goto out; } skip_nr--; - cond_resched(); }
if (p) { @@ -917,7 +903,6 @@ static int bch_dirty_init_thread(void *arg)
p = NULL; prev_idx = cur_idx; - cond_resched(); }
out: @@ -956,11 +941,11 @@ void bch_sectors_dirty_init(struct bcache_device *d) bch_btree_op_init(&op.op, -1); op.inode = d->id; op.count = 0; - op.start = KEY(op.inode, 0, 0);
for_each_key_filter(&c->root->keys, k, &iter, bch_ptr_invalid) sectors_dirty_init_fn(&op.op, c->root, k); + rw_unlock(0, c->root); return; }
mainline inclusion from v5.19-rc1 commit 32feee36c30ea06e38ccb8ae6e5c44c6eec790a6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
The journal no-space deadlock was reported time to time. Such deadlock can happen in the following situation.
When all journal buckets are fully filled by active jset with heavy write I/O load, the cache set registration (after a reboot) will load all active jsets and inserting them into the btree again (which is called journal replay). If a journaled bkey is inserted into a btree node and results btree node split, new journal request might be triggered. For example, the btree grows one more level after the node split, then the root node record in cache device super block will be upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no space in journal buckets, the journal replay has to wait for new journal bucket to be reclaimed after at least one journal bucket replayed. This is one example that how the journal no-space deadlock happens.
The solution to avoid the deadlock is to reserve 1 journal bucket in run time, and only permit the reserved journal bucket to be used during cache set registration procedure for things like journal replay. Then the journal space will never be fully filled, there is no chance for journal no-space deadlock to happen anymore.
This patch adds a new member "bool do_reserve" in struct journal, it is inititalized to 0 (false) when struct journal is allocated, and set to 1 (true) by bch_journal_space_reserve() when all initialization done in run_cache_set(). In the run time when journal_reclaim() tries to allocate a new journal bucket, free_journal_buckets() is called to check whether there are enough free journal buckets to use. If there is only 1 free journal bucket and journal->do_reserve is 1 (true), the last bucket is reserved and free_journal_buckets() will return 0 to indicate no free journal bucket. Then journal_reclaim() will give up, and try next time to see whetheer there is free journal bucket to allocate. By this method, there is always 1 jouranl bucket reserved in run time.
During the cache set registration, journal->do_reserve is 0 (false), so the reserved journal bucket can be used to avoid the no-space deadlock.
Reported-by: Nikhil Kshirsagar nkshirsagar@gmail.com Signed-off-by: Coly Li colyli@suse.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-5-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/journal.c | 31 ++++++++++++++++++++++++++----- drivers/md/bcache/journal.h | 2 ++ drivers/md/bcache/super.c | 1 + 3 files changed, 29 insertions(+), 5 deletions(-)
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index 61bd79babf7a..346a92c43858 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -407,6 +407,11 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list) return ret; }
+void bch_journal_space_reserve(struct journal *j) +{ + j->do_reserve = true; +} + /* Journalling */
static void btree_flush_write(struct cache_set *c) @@ -625,12 +630,30 @@ static void do_journal_discard(struct cache *ca) } }
+static unsigned int free_journal_buckets(struct cache_set *c) +{ + struct journal *j = &c->journal; + struct cache *ca = c->cache; + struct journal_device *ja = &c->cache->journal; + unsigned int n; + + /* In case njournal_buckets is not power of 2 */ + if (ja->cur_idx >= ja->discard_idx) + n = ca->sb.njournal_buckets + ja->discard_idx - ja->cur_idx; + else + n = ja->discard_idx - ja->cur_idx; + + if (n > (1 + j->do_reserve)) + return n - (1 + j->do_reserve); + + return 0; +} + static void journal_reclaim(struct cache_set *c) { struct bkey *k = &c->journal.key; struct cache *ca = c->cache; uint64_t last_seq; - unsigned int next; struct journal_device *ja = &ca->journal; atomic_t p __maybe_unused;
@@ -653,12 +676,10 @@ static void journal_reclaim(struct cache_set *c) if (c->journal.blocks_free) goto out;
- next = (ja->cur_idx + 1) % ca->sb.njournal_buckets; - /* No space available on this device */ - if (next == ja->discard_idx) + if (!free_journal_buckets(c)) goto out;
- ja->cur_idx = next; + ja->cur_idx = (ja->cur_idx + 1) % ca->sb.njournal_buckets; k->ptr[0] = MAKE_PTR(0, bucket_to_sector(c, ca->sb.d[ja->cur_idx]), ca->sb.nr_this_dev); diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h index f2ea34d5f431..cd316b4a1e95 100644 --- a/drivers/md/bcache/journal.h +++ b/drivers/md/bcache/journal.h @@ -105,6 +105,7 @@ struct journal { spinlock_t lock; spinlock_t flush_write_lock; bool btree_flushing; + bool do_reserve; /* used when waiting because the journal was full */ struct closure_waitlist wait; struct closure io; @@ -182,5 +183,6 @@ int bch_journal_replay(struct cache_set *c, struct list_head *list);
void bch_journal_free(struct cache_set *c); int bch_journal_alloc(struct cache_set *c); +void bch_journal_space_reserve(struct journal *j);
#endif /* _BCACHE_JOURNAL_H */ diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index a98b46d3cd0b..b5601f200c09 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2141,6 +2141,7 @@ static int run_cache_set(struct cache_set *c)
flash_devs_run(c);
+ bch_journal_space_reserve(&c->journal); set_bit(CACHE_SET_RUNNING, &c->flags); return 0; err:
mainline inclusion from v5.19-rc1 commit 7d6b902ea0e02b2a25c480edf471cbaa4ebe6b3c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
The local variables check_state (in bch_btree_check()) and state (in bch_sectors_dirty_init()) should be fully filled by 0, because before allocating them on stack, they were dynamically allocated by kzalloc().
Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20220527152818.27545-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/btree.c | 1 + drivers/md/bcache/writeback.c | 1 + 2 files changed, 2 insertions(+)
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 7b6f8bfef927..98daa9d200f7 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -2017,6 +2017,7 @@ int bch_btree_check(struct cache_set *c) if (c->root->level == 0) return 0;
+ memset(&check_state, 0, sizeof(struct btree_check_state)); check_state.c = c; check_state.total_threads = bch_btree_chkthread_nr(); check_state.key_idx = 0; diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 2b6cb308f73a..2bf831fe4ca4 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -950,6 +950,7 @@ void bch_sectors_dirty_init(struct bcache_device *d) return; }
+ memset(&state, 0, sizeof(struct bch_dirty_init_state)); state.c = c; state.d = d; state.total_threads = bch_btre_dirty_init_thread_nr();
From: Jia-Ju Bai baijiaju1990@gmail.com
mainline inclusion from v5.19-rc1 commit 40f567bbb3b0639d2ec7d1c6ad4b1b018f80cf19 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
The function kzalloc() in detached_dev_do_request() can fail, so its return value should be checked.
Fixes: bc082a55d25c ("bcache: fix inaccurate io state for detached bcache devices") Reported-by: TOTE Robot oslab@tsinghua.edu.cn Signed-off-by: Jia-Ju Bai baijiaju1990@gmail.com Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20220527152818.27545-4-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/request.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 19014a87a7db..c1a1bd7aa9ec 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -1104,6 +1104,12 @@ static void detached_dev_do_request(struct bcache_device *d, struct bio *bio) * which would call closure_get(&dc->disk.cl) */ ddip = kzalloc(sizeof(struct detached_dev_io_private), GFP_NOIO); + if (!ddip) { + bio->bi_status = BLK_STS_RESOURCE; + bio->bi_end_io(bio); + return; + } + ddip->d = d; /* Count on the bcache device */ ddip->start_time = part_start_io_acct(d->disk, &ddip->part, bio);
mainline inclusion from v5.19-rc1 commit a1a2d8f0162b27e85e7ce0ae6a35c96a490e0559 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
The kworker routine update_writeback_rate() is schedued to update the writeback rate in every 5 seconds by default. Before calling __update_writeback_rate() to do real job, semaphore dc->writeback_lock should be held by the kworker routine.
At the same time, bcache writeback thread routine bch_writeback_thread() also needs to hold dc->writeback_lock before flushing dirty data back into the backing device. If the dirty data set is large, it might be very long time for bch_writeback_thread() to scan all dirty buckets and releases dc->writeback_lock. In such case update_writeback_rate() can be starved for long enough time so that kernel reports a soft lockup warn- ing started like: watchdog: BUG: soft lockup - CPU#246 stuck for 23s! [kworker/246:31:179713]
Such soft lockup condition is unnecessary, because after the writeback thread finishes its job and releases dc->writeback_lock, the kworker update_writeback_rate() may continue to work and everything is fine indeed.
This patch avoids the unnecessary soft lockup by the following method, - Add new member to struct cached_dev - dc->rate_update_retry (0 by default) - In update_writeback_rate() call down_read_trylock(&dc->writeback_lock) firstly, if it fails then lock contention happens. - If dc->rate_update_retry <= BCH_WBRATE_UPDATE_MAX_SKIPS (15), doesn't acquire the lock and reschedules the kworker for next try. - If dc->rate_update_retry > BCH_WBRATE_UPDATE_MAX_SKIPS, no retry anymore and call down_read(&dc->writeback_lock) to wait for the lock.
By the above method, at worst case update_writeback_rate() may retry for 1+ minutes before blocking on dc->writeback_lock by calling down_read(). For a 4TB cache device with 1TB dirty data, 90%+ of the unnecessary soft lockup warning message can be avoided.
When retrying to acquire dc->writeback_lock in update_writeback_rate(), of course the writeback rate cannot be updated. It is fair, because when the kworker is blocked on the lock contention of dc->writeback_lock, the writeback rate cannot be updated neither.
This change follows Jens Axboe's suggestion to a more clear and simple version.
Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20220528124550.32834-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk --- drivers/md/bcache/bcache.h | 7 +++++++ drivers/md/bcache/writeback.c | 31 +++++++++++++++++++++---------- 2 files changed, 28 insertions(+), 10 deletions(-)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 2a011469af02..0563a40812fa 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -396,6 +396,13 @@ struct cached_dev { unsigned int error_limit; unsigned int offline_seconds;
+ /* + * Retry to update writeback_rate if contention happens for + * down_read(dc->writeback_lock) in update_writeback_rate() + */ +#define BCH_WBRATE_UPDATE_MAX_SKIPS 15 + unsigned int rate_update_retry; + char backing_dev_name[BDEVNAME_SIZE]; };
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 2bf831fe4ca4..3e879e985373 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -235,19 +235,27 @@ static void update_writeback_rate(struct work_struct *work) return; }
- if (atomic_read(&dc->has_dirty) && dc->writeback_percent) { - /* - * If the whole cache set is idle, set_at_max_writeback_rate() - * will set writeback rate to a max number. Then it is - * unncessary to update writeback rate for an idle cache set - * in maximum writeback rate number(s). - */ - if (!set_at_max_writeback_rate(c, dc)) { - down_read(&dc->writeback_lock); + /* + * If the whole cache set is idle, set_at_max_writeback_rate() + * will set writeback rate to a max number. Then it is + * unncessary to update writeback rate for an idle cache set + * in maximum writeback rate number(s). + */ + if (atomic_read(&dc->has_dirty) && dc->writeback_percent && + !set_at_max_writeback_rate(c, dc)) { + do { + if (!down_read_trylock((&dc->writeback_lock))) { + dc->rate_update_retry++; + if (dc->rate_update_retry <= + BCH_WBRATE_UPDATE_MAX_SKIPS) + break; + down_read(&dc->writeback_lock); + dc->rate_update_retry = 0; + } __update_writeback_rate(dc); update_gc_after_writeback(c); up_read(&dc->writeback_lock); - } + } while (0); }
@@ -1006,6 +1014,9 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc) dc->writeback_rate_fp_term_high = 1000; dc->writeback_rate_i_term_inverse = 10000;
+ /* For dc->writeback_lock contention in update_writeback_rate() */ + dc->rate_update_retry = 0; + WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags)); INIT_DELAYED_WORK(&dc->writeback_rate_update, update_writeback_rate); }