From: Dongsheng Yang dongsheng.yang@easystack.cn
mainline inclusion from v5.11-rc1 commit df4ad53242158f9f1f97daf4feddbb4f8b77f080 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
-----------------------------------------------
There is a race condition in detaching as below: A. detaching B. Write request (1) writing back (2) write back done, set bdev state to clean. (3) cached_dev_put() and schedule_work(&dc->detach); (4) write data [0 - 4K] directly into backing and ack to user. (5) power-failure...
When we restart this bcache device, this bdev is clean but not detached, and read [0 - 4K], we will get unexpected old data from cache device.
To fix this problem, set the bdev state to none when we writeback done in detaching, and then if power-failure happened as above, the data in cache will not be used in next bcache device starting, it's detached, we will read the correct data from backing derectly.
Signed-off-by: Dongsheng Yang dongsheng.yang@easystack.cn Signed-off-by: Coly Li colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/super.c | 9 --------- drivers/md/bcache/writeback.c | 9 +++++++++ 2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 81f1cc5b3499..b7d9d1b79ac2 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1151,9 +1151,6 @@ static void cancel_writeback_rate_update_dwork(struct cached_dev *dc) static void cached_dev_detach_finish(struct work_struct *w) { struct cached_dev *dc = container_of(w, struct cached_dev, detach); - struct closure cl; - - closure_init_stack(&cl);
BUG_ON(!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags)); BUG_ON(refcount_read(&dc->count)); @@ -1167,12 +1164,6 @@ static void cached_dev_detach_finish(struct work_struct *w) dc->writeback_thread = NULL; }
- memset(&dc->sb.set_uuid, 0, 16); - SET_BDEV_STATE(&dc->sb, BDEV_STATE_NONE); - - bch_write_bdev_super(dc, &cl); - closure_sync(&cl); - mutex_lock(&bch_register_lock);
calc_cached_dev_sectors(dc->disk.c); diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 3c74996978da..a129e4d2707c 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -705,6 +705,15 @@ static int bch_writeback_thread(void *arg) * bch_cached_dev_detach(). */ if (test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags)) { + struct closure cl; + + closure_init_stack(&cl); + memset(&dc->sb.set_uuid, 0, 16); + SET_BDEV_STATE(&dc->sb, BDEV_STATE_NONE); + + bch_write_bdev_super(dc, &cl); + closure_sync(&cl); + up_write(&dc->writeback_lock); break; }
From: Yi Li yili@winhong.com
mainline inclusion from v5.11-rc1 commit 117ae250cfa3718f21bd07df0650dfbe3bc3a823 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
------------------------------------------------
There have no reassign the bdev after check It is IS_ERR. the double check !IS_ERR(bdev) is superfluous.
After commit 4e7b5671c6a8 ("block: remove i_bdev"), "Switch the block device lookup interfaces to directly work with a dev_t so that struct block_device references are only acquired by the blkdev_get variants (and the blk-cgroup special case). This means that we now don't need an extra reference in the inode and can generally simplify handling of struct block_device to keep the lookups contained in the core block layer code."
so after lookup_bdev call, there no need to do bdput.
remove a superfluous check the bdev & don't call bdput after lookup_bdev.
Fixes: 4e7b5671c6a8("block: remove i_bdev") Signed-off-by: Yi Li yili@winhong.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Coly Li colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/super.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index b7d9d1b79ac2..1572190c32ec 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2588,8 +2588,6 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, else err = "device busy"; mutex_unlock(&bch_register_lock); - if (!IS_ERR(bdev)) - bdput(bdev); if (attr == &ksysfs_register_quiet) goto done; }
From: Zheng Yongjun zhengyongjun3@huawei.com
mainline inclusion from v5.11-rc1 commit 46926127d76359b46659c556df7b4aa1b6325d90 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
---------------------------------------
Replace a comma between expression statements by a semicolon.
Signed-off-by: Zheng Yongjun zhengyongjun3@huawei.com Signed-off-by: Coly Li colyli@sue.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/sysfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 554e3afc9b68..00a520c03f41 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -404,7 +404,7 @@ STORE(__cached_dev) if (!env) return -ENOMEM; add_uevent_var(env, "DRIVER=bcache"); - add_uevent_var(env, "CACHED_UUID=%pU", dc->sb.uuid), + add_uevent_var(env, "CACHED_UUID=%pU", dc->sb.uuid); add_uevent_var(env, "CACHED_LABEL=%s", buf); kobject_uevent_env(&disk_to_dev(dc->disk.disk)->kobj, KOBJ_CHANGE,
From: Yi Li yili@winhong.com
mainline inclusion from v5.11-rc3 commit e80927079fd97b4d5457e3af2400a0087b561564 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
------------------------------------------------
There is no need to reassign pdev_set_uuid in the second loop iteration, so move it to the place before second loop.
Signed-off-by: Yi Li yili@winhong.com Signed-off-by: Coly Li colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 1572190c32ec..456e41a81c06 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2697,8 +2697,8 @@ static ssize_t bch_pending_bdevs_cleanup(struct kobject *k, }
list_for_each_entry_safe(pdev, tpdev, &pending_devs, list) { + char *pdev_set_uuid = pdev->dc->sb.set_uuid; list_for_each_entry_safe(c, tc, &bch_cache_sets, list) { - char *pdev_set_uuid = pdev->dc->sb.set_uuid; char *set_uuid = c->set_uuid;
if (!memcmp(pdev_set_uuid, set_uuid, 16)) {
From: Ming Lei ming.lei@redhat.com
mainline inclusion from v5.12-rc1 commit faa8e2c4fb30f336a289e3cbaa1e9a9dfd92ac8c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
-------------------------------------------
This bioset is just for allocating bio only from bio_next_split, and it needn't bvecs, so remove the flag.
Cc: linux-bcache@vger.kernel.org Cc: Coly Li colyli@suse.de Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Ming Lei ming.lei@redhat.com Acked-by: Coly Li colyli@suse.de Reviewed-by: Hannes Reinecke hare@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 456e41a81c06..7195b289780a 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1947,7 +1947,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb) goto err;
if (bioset_init(&c->bio_split, 4, offsetof(struct bbio, bio), - BIOSET_NEED_BVECS|BIOSET_NEED_RESCUER)) + BIOSET_NEED_RESCUER)) goto err;
c->uuids = alloc_meta_bucket_pages(GFP_KERNEL, sb);
From: dongdong tao dongdong.tao@canonical.com
mainline inclusion from v5.12-rc1 commit 71dda2a5625f31bc3410cb69c3d31376a2b66f28 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
---------------------------------------------
Current way to calculate the writeback rate only considered the dirty sectors, this usually works fine when the fragmentation is not high, but it will give us unreasonable small rate when we are under a situation that very few dirty sectors consumed a lot dirty buckets. In some case, the dirty bucekts can reached to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even reached the writeback_percent, the writeback rate will still be the minimum value (4k), thus it will cause all the writes to be stucked in a non-writeback mode because of the slow writeback.
We accelerate the rate in 3 stages with different aggressiveness, the first stage starts when dirty buckets percent reach above BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default the first stage tries to writeback the amount of dirty data in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second, the second stage tries to writeback the amount of dirty data in one bucket in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third stage tries to writeback the amount of dirty data in one bucket in (1 / (dirty_buckets_percent - 64)) millisecond.
the initial rate at each stage can be controlled by 3 configurable parameters writeback_rate_fp_term_{low|mid|high}, they are by default 1, 10, 1000, the hint of IO throughput that these values are trying to achieve is described by above paragraph, the reason that I choose those value as default is based on the testing and the production data, below is some details:
A. When it comes to the low stage, there is still a bit far from the 70 threshold, so we only want to give it a little bit push by setting the term to 1, it means the initial rate will be 170 if the fragment is 6, it is calculated by bucket_size/fragment, this rate is very small, but still much reasonable than the minimum 8. For a production bcache with unheavy workload, if the cache device is bigger than 1 TB, it may take hours to consume 1% buckets, so it is very possible to reclaim enough dirty buckets in this stage, thus to avoid entering the next stage.
B. If the dirty buckets ratio didn't turn around during the first stage, it comes to the mid stage, then it is necessary for mid stage to be more aggressive than low stage, so i choose the initial rate to be 10 times more than low stage, that means 1700 as the initial rate if the fragment is 6. This is some normal rate we usually see for a normal workload when writeback happens because of writeback_percent.
C. If the dirty buckets ratio didn't turn around during the low and mid stages, it comes to the third stage, and it is the last chance that we can turn around to avoid the horrible cutoff writeback sync issue, then we choose 100 times more aggressive than the mid stage, that means 170000 as the initial rate if the fragment is 6. This is also inferred from a production bcache, I've got one week's writeback rate data from a production bcache which has quite heavy workloads, again, the writeback is triggered by the writeback percent, the highest rate area is around 100000 to 240000, so I believe this kind aggressiveness at this stage is reasonable for production. And it should be mostly enough because the hint is trying to reclaim 1000 bucket per second, and from that heavy production env, it is consuming 50 bucket per second on average in one week's data.
Option writeback_consider_fragment is to control whether we want this feature to be on or off, it's on by default.
Lastly, below is the performance data for all the testing result, including the data from production env: https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxsc...
Signed-off-by: dongdong tao dongdong.tao@canonical.com Signed-off-by: Coly Li colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/bcache.h | 4 ++++ drivers/md/bcache/sysfs.c | 23 +++++++++++++++++++ drivers/md/bcache/writeback.c | 42 +++++++++++++++++++++++++++++++++++ drivers/md/bcache/writeback.h | 4 ++++ 4 files changed, 73 insertions(+)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index e8bf4f752e8b..848dd4db1659 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -373,6 +373,7 @@ struct cached_dev { unsigned int partial_stripes_expensive:1; unsigned int writeback_metadata:1; unsigned int writeback_running:1; + unsigned int writeback_consider_fragment:1; unsigned char writeback_percent; unsigned int writeback_delay;
@@ -385,6 +386,9 @@ struct cached_dev { unsigned int writeback_rate_update_seconds; unsigned int writeback_rate_i_term_inverse; unsigned int writeback_rate_p_term_inverse; + unsigned int writeback_rate_fp_term_low; + unsigned int writeback_rate_fp_term_mid; + unsigned int writeback_rate_fp_term_high; unsigned int writeback_rate_minimum;
enum stop_on_failure stop_when_cache_set_failed; diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 00a520c03f41..eef15f8022ba 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -117,10 +117,14 @@ rw_attribute(writeback_running); rw_attribute(writeback_percent); rw_attribute(writeback_delay); rw_attribute(writeback_rate); +rw_attribute(writeback_consider_fragment);
rw_attribute(writeback_rate_update_seconds); rw_attribute(writeback_rate_i_term_inverse); rw_attribute(writeback_rate_p_term_inverse); +rw_attribute(writeback_rate_fp_term_low); +rw_attribute(writeback_rate_fp_term_mid); +rw_attribute(writeback_rate_fp_term_high); rw_attribute(writeback_rate_minimum); read_attribute(writeback_rate_debug);
@@ -195,6 +199,7 @@ SHOW(__bch_cached_dev) var_printf(bypass_torture_test, "%i"); var_printf(writeback_metadata, "%i"); var_printf(writeback_running, "%i"); + var_printf(writeback_consider_fragment, "%i"); var_print(writeback_delay); var_print(writeback_percent); sysfs_hprint(writeback_rate, @@ -205,6 +210,9 @@ SHOW(__bch_cached_dev) var_print(writeback_rate_update_seconds); var_print(writeback_rate_i_term_inverse); var_print(writeback_rate_p_term_inverse); + var_print(writeback_rate_fp_term_low); + var_print(writeback_rate_fp_term_mid); + var_print(writeback_rate_fp_term_high); var_print(writeback_rate_minimum);
if (attr == &sysfs_writeback_rate_debug) { @@ -303,6 +311,7 @@ STORE(__cached_dev) sysfs_strtoul_bool(bypass_torture_test, dc->bypass_torture_test); sysfs_strtoul_bool(writeback_metadata, dc->writeback_metadata); sysfs_strtoul_bool(writeback_running, dc->writeback_running); + sysfs_strtoul_bool(writeback_consider_fragment, dc->writeback_consider_fragment); sysfs_strtoul_clamp(writeback_delay, dc->writeback_delay, 0, UINT_MAX);
sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent, @@ -331,6 +340,16 @@ STORE(__cached_dev) sysfs_strtoul_clamp(writeback_rate_p_term_inverse, dc->writeback_rate_p_term_inverse, 1, UINT_MAX); + sysfs_strtoul_clamp(writeback_rate_fp_term_low, + dc->writeback_rate_fp_term_low, + 1, dc->writeback_rate_fp_term_mid - 1); + sysfs_strtoul_clamp(writeback_rate_fp_term_mid, + dc->writeback_rate_fp_term_mid, + dc->writeback_rate_fp_term_low + 1, + dc->writeback_rate_fp_term_high - 1); + sysfs_strtoul_clamp(writeback_rate_fp_term_high, + dc->writeback_rate_fp_term_high, + dc->writeback_rate_fp_term_mid + 1, UINT_MAX); sysfs_strtoul_clamp(writeback_rate_minimum, dc->writeback_rate_minimum, 1, UINT_MAX); @@ -499,9 +518,13 @@ static struct attribute *bch_cached_dev_files[] = { &sysfs_writeback_delay, &sysfs_writeback_percent, &sysfs_writeback_rate, + &sysfs_writeback_consider_fragment, &sysfs_writeback_rate_update_seconds, &sysfs_writeback_rate_i_term_inverse, &sysfs_writeback_rate_p_term_inverse, + &sysfs_writeback_rate_fp_term_low, + &sysfs_writeback_rate_fp_term_mid, + &sysfs_writeback_rate_fp_term_high, &sysfs_writeback_rate_minimum, &sysfs_writeback_rate_debug, &sysfs_io_errors, diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index a129e4d2707c..82d4e0880a99 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -88,6 +88,44 @@ static void __update_writeback_rate(struct cached_dev *dc) int64_t integral_scaled; uint32_t new_rate;
+ /* + * We need to consider the number of dirty buckets as well + * when calculating the proportional_scaled, Otherwise we might + * have an unreasonable small writeback rate at a highly fragmented situation + * when very few dirty sectors consumed a lot dirty buckets, the + * worst case is when dirty buckets reached cutoff_writeback_sync and + * dirty data is still not even reached to writeback percent, so the rate + * still will be at the minimum value, which will cause the write + * stuck at a non-writeback mode. + */ + struct cache_set *c = dc->disk.c; + + int64_t dirty_buckets = c->nbuckets - c->avail_nbuckets; + + if (dc->writeback_consider_fragment && + c->gc_stats.in_use > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW && dirty > 0) { + int64_t fragment = + div_s64((dirty_buckets * c->cache->sb.bucket_size), dirty); + int64_t fp_term; + int64_t fps; + + if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID) { + fp_term = dc->writeback_rate_fp_term_low * + (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW); + } else if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH) { + fp_term = dc->writeback_rate_fp_term_mid * + (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID); + } else { + fp_term = dc->writeback_rate_fp_term_high * + (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH); + } + fps = div_s64(dirty, dirty_buckets) * fp_term; + if (fragment > 3 && fps > proportional_scaled) { + /* Only overrite the p when fragment > 3 */ + proportional_scaled = fps; + } + } + if ((error < 0 && dc->writeback_rate_integral > 0) || (error > 0 && time_before64(local_clock(), dc->writeback_rate.next + NSEC_PER_MSEC))) { @@ -977,6 +1015,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_metadata = true; dc->writeback_running = false; + dc->writeback_consider_fragment = true; dc->writeback_percent = 10; dc->writeback_delay = 30; atomic_long_set(&dc->writeback_rate.rate, 1024); @@ -984,6 +1023,9 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT; dc->writeback_rate_p_term_inverse = 40; + dc->writeback_rate_fp_term_low = 1; + dc->writeback_rate_fp_term_mid = 10; + dc->writeback_rate_fp_term_high = 1000; dc->writeback_rate_i_term_inverse = 10000;
WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags)); diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index 3f1230e22de0..02b2f9df73f6 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -16,6 +16,10 @@
#define BCH_AUTO_GC_DIRTY_THRESHOLD 50
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW 50 +#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID 57 +#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH 64 + #define BCH_DIRTY_INIT_THRD_MAX 64 /* * 14 (16384ths) is chosen here as something that each backing device
From: Kai Krakow kai@kaishome.de
mainline inclusion from 5.12-rc1 commit d7fae7b4fa152795ab70c680d3a63c7843c9368c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
---------------------------------------------------
Should be `register_device_async`.
Cc: Coly Li colyli@suse.de Signed-off-by: Kai Krakow kai@kaishome.de Signed-off-by: Coly Li colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/super.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 7195b289780a..3e2dc136bc37 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2527,7 +2527,7 @@ static void register_cache_worker(struct work_struct *work) module_put(THIS_MODULE); }
-static void register_device_aync(struct async_reg_args *args) +static void register_device_async(struct async_reg_args *args) { if (SB_IS_BDEV(args->sb)) INIT_DELAYED_WORK(&args->reg_work, register_bdev_worker); @@ -2619,7 +2619,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, args->sb = sb; args->sb_disk = sb_disk; args->bdev = bdev; - register_device_aync(args); + register_device_async(args); /* No wait and returns to user space */ goto async_done; }
From: Joe Perches joe@perches.com
mainline inclusion from v5.12-rc1 commit 6751c1e3cff3aa763c760c08862627069a37b50e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
-----------------------------------------------
Use semicolons and braces.
Signed-off-by: Joe Perches joe@perches.com Signed-off-by: Coly Li colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/bset.c | 12 ++++++++---- drivers/md/bcache/sysfs.c | 6 ++++-- 2 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c index 67a2c47f4201..94d38e8a59b3 100644 --- a/drivers/md/bcache/bset.c +++ b/drivers/md/bcache/bset.c @@ -712,8 +712,10 @@ void bch_bset_build_written_tree(struct btree_keys *b) for (j = inorder_next(0, t->size); j; j = inorder_next(j, t->size)) { - while (bkey_to_cacheline(t, k) < cacheline) - prev = k, k = bkey_next(k); + while (bkey_to_cacheline(t, k) < cacheline) { + prev = k; + k = bkey_next(k); + }
t->prev[j] = bkey_u64s(prev); t->tree[j].m = bkey_to_cacheline_offset(t, cacheline++, k); @@ -901,8 +903,10 @@ unsigned int bch_btree_insert_key(struct btree_keys *b, struct bkey *k, status = BTREE_INSERT_STATUS_INSERT;
while (m != bset_bkey_last(i) && - bkey_cmp(k, b->ops->is_extents ? &START_KEY(m) : m) > 0) - prev = m, m = bkey_next(m); + bkey_cmp(k, b->ops->is_extents ? &START_KEY(m) : m) > 0) { + prev = m; + m = bkey_next(m); + }
/* prev is in the tree, if we merge we're done */ status = BTREE_INSERT_STATUS_BACK_MERGE; diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index eef15f8022ba..cc89f3156d1a 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -1094,8 +1094,10 @@ SHOW(__bch_cache) --n;
while (cached < p + n && - *cached == BTREE_PRIO) - cached++, n--; + *cached == BTREE_PRIO) { + cached++; + n--; + }
for (i = 0; i < n; i++) sum += INITIAL_PRIO - cached[i];
From: Zhiqiang Liu liuzhiqiang26@huawei.com
mainline inclusion from v5.13-rc1 commit 13e1db65d2b9263c3dfe447077981e7a32c857ae category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
---------------------------------------------
In bch_cached_dev_run(), free(env[1])|free(env[2])|free(buf) show up three times. This patch introduce out tag in which free(env[1])|free(env[2])|free(buf) are only called one time. If we need to call free() when errors occur, we can set error code to ret, and then goto out tag directly.
Signed-off-by: Zhiqiang Liu liuzhiqiang26@huawei.com Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20210411134316.80274-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/super.c | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 3e2dc136bc37..a52f491e5209 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1058,6 +1058,7 @@ static int cached_dev_status_update(void *arg)
int bch_cached_dev_run(struct cached_dev *dc) { + int ret = 0; struct bcache_device *d = &dc->disk; char *buf = kmemdup_nul(dc->sb.label, SB_LABEL_SIZE, GFP_KERNEL); char *env[] = { @@ -1070,19 +1071,15 @@ int bch_cached_dev_run(struct cached_dev *dc) if (dc->io_disable) { pr_err("I/O disabled on cached dev %s\n", dc->backing_dev_name); - kfree(env[1]); - kfree(env[2]); - kfree(buf); - return -EIO; + ret = -EIO; + goto out; }
if (atomic_xchg(&dc->running, 1)) { - kfree(env[1]); - kfree(env[2]); - kfree(buf); pr_info("cached dev %s is running already\n", dc->backing_dev_name); - return -EBUSY; + ret = -EBUSY; + goto out; }
if (!d->c && @@ -1103,15 +1100,13 @@ int bch_cached_dev_run(struct cached_dev *dc) * only class / kset properties are persistent */ kobject_uevent_env(&disk_to_dev(d->disk)->kobj, KOBJ_CHANGE, env); - kfree(env[1]); - kfree(env[2]); - kfree(buf);
if (sysfs_create_link(&d->kobj, &disk_to_dev(d->disk)->kobj, "dev") || sysfs_create_link(&disk_to_dev(d->disk)->kobj, &d->kobj, "bcache")) { pr_err("Couldn't create bcache dev <-> disk sysfs symlinks\n"); - return -ENOMEM; + ret = -ENOMEM; + goto out; }
dc->status_update_thread = kthread_run(cached_dev_status_update, @@ -1120,7 +1115,11 @@ int bch_cached_dev_run(struct cached_dev *dc) pr_warn("failed to create bcache_status_update kthread, continue to run without monitoring backing device status\n"); }
- return 0; +out: + kfree(env[1]); + kfree(env[2]); + kfree(buf); + return ret; }
/*
From: Christoph Hellwig hch@lst.de
mainline inclusion from 5.13-rc1 commit 11e9560e6c005b4adca12d17b27dc5ac22b40663 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
----------------------------------
Remove the PTR_CACHE inline and replace it with a direct dereference of c->cache.
(Coly Li: fix the typo from PTR_BUCKET to PTR_CACHE in commit log)
Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20210411134316.80274-3-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/alloc.c | 5 ++--- drivers/md/bcache/bcache.h | 11 ++--------- drivers/md/bcache/btree.c | 4 ++-- drivers/md/bcache/debug.c | 2 +- drivers/md/bcache/extents.c | 4 ++-- drivers/md/bcache/io.c | 4 ++-- drivers/md/bcache/journal.c | 2 +- drivers/md/bcache/writeback.c | 5 ++--- 8 files changed, 14 insertions(+), 23 deletions(-)
diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c index 8c371d5eef8e..097577ae3c47 100644 --- a/drivers/md/bcache/alloc.c +++ b/drivers/md/bcache/alloc.c @@ -482,8 +482,7 @@ void bch_bucket_free(struct cache_set *c, struct bkey *k) unsigned int i;
for (i = 0; i < KEY_PTRS(k); i++) - __bch_bucket_free(PTR_CACHE(c, k, i), - PTR_BUCKET(c, k, i)); + __bch_bucket_free(c->cache, PTR_BUCKET(c, k, i)); }
int __bch_bucket_alloc_set(struct cache_set *c, unsigned int reserve, @@ -674,7 +673,7 @@ bool bch_alloc_sectors(struct cache_set *c, SET_PTR_OFFSET(&b->key, i, PTR_OFFSET(&b->key, i) + sectors);
atomic_long_add(sectors, - &PTR_CACHE(c, &b->key, i)->sectors_written); + &c->cache->sectors_written); }
if (b->sectors_free < c->cache->sb.block_size) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 848dd4db1659..0a4551e165ab 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -804,13 +804,6 @@ static inline sector_t bucket_remainder(struct cache_set *c, sector_t s) return s & (c->cache->sb.bucket_size - 1); }
-static inline struct cache *PTR_CACHE(struct cache_set *c, - const struct bkey *k, - unsigned int ptr) -{ - return c->cache; -} - static inline size_t PTR_BUCKET_NR(struct cache_set *c, const struct bkey *k, unsigned int ptr) @@ -822,7 +815,7 @@ static inline struct bucket *PTR_BUCKET(struct cache_set *c, const struct bkey *k, unsigned int ptr) { - return PTR_CACHE(c, k, ptr)->buckets + PTR_BUCKET_NR(c, k, ptr); + return c->cache->buckets + PTR_BUCKET_NR(c, k, ptr); }
static inline uint8_t gen_after(uint8_t a, uint8_t b) @@ -841,7 +834,7 @@ static inline uint8_t ptr_stale(struct cache_set *c, const struct bkey *k, static inline bool ptr_available(struct cache_set *c, const struct bkey *k, unsigned int i) { - return (PTR_DEV(k, i) < MAX_CACHES_PER_SET) && PTR_CACHE(c, k, i); + return (PTR_DEV(k, i) < MAX_CACHES_PER_SET) && c->cache; }
/* Btree key macros */ diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index fe6dce125aba..183a58c89377 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -426,7 +426,7 @@ void __bch_btree_node_write(struct btree *b, struct closure *parent) do_btree_node_write(b);
atomic_long_add(set_blocks(i, block_bytes(b->c->cache)) * b->c->cache->sb.block_size, - &PTR_CACHE(b->c, &b->key, 0)->btree_sectors_written); + &b->c->cache->btree_sectors_written);
b->written += set_blocks(i, block_bytes(b->c->cache)); } @@ -1161,7 +1161,7 @@ static void make_btree_freeing_key(struct btree *b, struct bkey *k)
for (i = 0; i < KEY_PTRS(k); i++) SET_PTR_GEN(k, i, - bch_inc_gen(PTR_CACHE(b->c, &b->key, i), + bch_inc_gen(b->c->cache, PTR_BUCKET(b->c, &b->key, i)));
mutex_unlock(&b->c->bucket_lock); diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index b00fd08d696b..b2eb59b9cd71 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -50,7 +50,7 @@ void bch_btree_verify(struct btree *b) v->keys.ops = b->keys.ops;
bio = bch_bbio_alloc(b->c); - bio_set_dev(bio, PTR_CACHE(b->c, &b->key, 0)->bdev); + bio_set_dev(bio, c->cache->bdev); bio->bi_iter.bi_sector = PTR_OFFSET(&b->key, 0); bio->bi_iter.bi_size = KEY_SIZE(&v->key) << 9; bio->bi_opf = REQ_OP_READ | REQ_META; diff --git a/drivers/md/bcache/extents.c b/drivers/md/bcache/extents.c index f4658a1f37b8..d626ffcbecb9 100644 --- a/drivers/md/bcache/extents.c +++ b/drivers/md/bcache/extents.c @@ -50,7 +50,7 @@ static bool __ptr_invalid(struct cache_set *c, const struct bkey *k)
for (i = 0; i < KEY_PTRS(k); i++) if (ptr_available(c, k, i)) { - struct cache *ca = PTR_CACHE(c, k, i); + struct cache *ca = c->cache; size_t bucket = PTR_BUCKET_NR(c, k, i); size_t r = bucket_remainder(c, PTR_OFFSET(k, i));
@@ -71,7 +71,7 @@ static const char *bch_ptr_status(struct cache_set *c, const struct bkey *k)
for (i = 0; i < KEY_PTRS(k); i++) if (ptr_available(c, k, i)) { - struct cache *ca = PTR_CACHE(c, k, i); + struct cache *ca = c->cache; size_t bucket = PTR_BUCKET_NR(c, k, i); size_t r = bucket_remainder(c, PTR_OFFSET(k, i));
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c index dad71a6b7889..e4388fe3ab7e 100644 --- a/drivers/md/bcache/io.c +++ b/drivers/md/bcache/io.c @@ -36,7 +36,7 @@ void __bch_submit_bbio(struct bio *bio, struct cache_set *c) struct bbio *b = container_of(bio, struct bbio, bio);
bio->bi_iter.bi_sector = PTR_OFFSET(&b->key, 0); - bio_set_dev(bio, PTR_CACHE(c, &b->key, 0)->bdev); + bio_set_dev(bio, c->cache->bdev);
b->submit_time_us = local_clock_us(); closure_bio_submit(c, bio, bio->bi_private); @@ -137,7 +137,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct bio *bio, blk_status_t error, const char *m) { struct bbio *b = container_of(bio, struct bbio, bio); - struct cache *ca = PTR_CACHE(c, &b->key, 0); + struct cache *ca = c->cache; int is_read = (bio_data_dir(bio) == READ ? 1 : 0);
unsigned int threshold = op_is_write(bio_op(bio)) diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index c6613e817333..de2c0d7699cf 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -768,7 +768,7 @@ static void journal_write_unlocked(struct closure *cl) w->data->csum = csum_set(w->data);
for (i = 0; i < KEY_PTRS(k); i++) { - ca = PTR_CACHE(c, k, i); + ca = c->cache; bio = &ca->journal.bio;
atomic_long_add(sectors, &ca->meta_sectors_written); diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 82d4e0880a99..bcd550a2b0da 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -416,7 +416,7 @@ static void read_dirty_endio(struct bio *bio) struct dirty_io *io = w->private;
/* is_read = 1 */ - bch_count_io_errors(PTR_CACHE(io->dc->disk.c, &w->key, 0), + bch_count_io_errors(io->dc->disk.c->cache, bio->bi_status, 1, "reading dirty data from cache");
@@ -510,8 +510,7 @@ static void read_dirty(struct cached_dev *dc) dirty_init(w); bio_set_op_attrs(&io->bio, REQ_OP_READ, 0); io->bio.bi_iter.bi_sector = PTR_OFFSET(&w->key, 0); - bio_set_dev(&io->bio, - PTR_CACHE(dc->disk.c, &w->key, 0)->bdev); + bio_set_dev(&io->bio, dc->disk.c->cache->bdev); io->bio.bi_end_io = read_dirty_endio;
if (bch_bio_alloc_pages(&io->bio, GFP_KERNEL))
From: Yang Li yang.lee@linux.alibaba.com
mainline inclusion from 5.13-rc1 commit f9a018e8a6af2898dc782f6e526bd11f6f352e87 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
--------------------------------------------------
This fixes the following sparse warnings: drivers/md/bcache/features.c:22:16: warning: Using plain integer as NULL pointer
Reported-by: Abaci Robot abaci@linux.alibaba.com Signed-off-by: Yang Li yang.lee@linux.alibaba.com Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20210411134316.80274-4-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/features.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/features.c b/drivers/md/bcache/features.c index d636b7b2d070..6d2b7b84a7b7 100644 --- a/drivers/md/bcache/features.c +++ b/drivers/md/bcache/features.c @@ -19,7 +19,7 @@ struct feature { static struct feature feature_list[] = { {BCH_FEATURE_INCOMPAT, BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE, "large_bucket"}, - {0, 0, 0 }, + {0, 0, NULL }, };
#define compose_feature_string(type) \
From: Arnd Bergmann arnd@arndb.de
mainline inclusion from v5.13-rc1 commit be3bacececd7c4ab233105171d39082858de1baa category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
--------------------------------------------
building with 'make W=1' shows a harmless warning for each user of the EBUG_ON() macro:
drivers/md/bcache/bset.c: In function 'bch_btree_sort_partial': drivers/md/bcache/util.h:30:55: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 30 | #define EBUG_ON(cond) do { if (cond); } while (0) | ^ drivers/md/bcache/bset.c:1312:9: note: in expansion of macro 'EBUG_ON' 1312 | EBUG_ON(oldsize >= 0 && bch_count_data(b) != oldsize); | ^~~~~~~
Reword the macro slightly to avoid the warning.
Signed-off-by: Arnd Bergmann arnd@arndb.de Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20210411134316.80274-5-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/util.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h index c029f7443190..bca4a7c97da7 100644 --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -27,7 +27,7 @@ struct closure;
#else /* DEBUG */
-#define EBUG_ON(cond) do { if (cond); } while (0) +#define EBUG_ON(cond) do { if (cond) do {} while (0); } while (0) #define atomic_dec_bug(v) atomic_dec(v) #define atomic_inc_bug(v, i) atomic_inc(v)
From: Bhaskar Chowdhury unixbhaskar@gmail.com
mainline inclusion from v5.13-rc1 commit 9c9b81c45619e76d315eb3b9934e9d4bfa7d3bcd category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
-------------------------------------------------
s/condidate/candidate/ s/folowing/following/
Signed-off-by: Bhaskar Chowdhury unixbhaskar@gmail.com Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20210411134316.80274-6-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/journal.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index de2c0d7699cf..61bd79babf7a 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -111,7 +111,7 @@ reread: left = ca->sb.bucket_size - offset; * Check from the oldest jset for last_seq. If * i->j.seq < j->last_seq, it means the oldest jset * in list is expired and useless, remove it from - * this list. Otherwise, j is a condidate jset for + * this list. Otherwise, j is a candidate jset for * further following checks. */ while (!list_empty(list)) { @@ -498,7 +498,7 @@ static void btree_flush_write(struct cache_set *c) * - If there are matched nodes recorded in btree_nodes[], * they are clean now (this is why and how the oldest * journal entry can be reclaimed). These selected nodes - * will be ignored and skipped in the folowing for-loop. + * will be ignored and skipped in the following for-loop. */ if (((btree_current_write(b)->journal - fifo_front_p) & mask) != 0) {
From: "Gustavo A. R. Silva" gustavoars@kernel.org
mainline inclusion from v5.13-rc1 commit 62594f189e81caffa6a3bfa2fdb08eec2e347c76 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
--------------------------------------------
Cast multiple variables to (int64_t) in order to give the compiler complete information about the proper arithmetic to use. Notice that these variables are being used in contexts that expect expressions of type int64_t (64 bit, signed). And currently, such expressions are being evaluated using 32-bit arithmetic.
Fixes: d0cf9503e908 ("octeontx2-pf: ethtool fec mode support") Addresses-Coverity-ID: 1501724 ("Unintentional integer overflow") Addresses-Coverity-ID: 1501725 ("Unintentional integer overflow") Addresses-Coverity-ID: 1501726 ("Unintentional integer overflow") Signed-off-by: Gustavo A. R. Silva gustavoars@kernel.org Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20210411134316.80274-7-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/writeback.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index bcd550a2b0da..8120da278161 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -110,13 +110,13 @@ static void __update_writeback_rate(struct cached_dev *dc) int64_t fps;
if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID) { - fp_term = dc->writeback_rate_fp_term_low * + fp_term = (int64_t)dc->writeback_rate_fp_term_low * (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW); } else if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH) { - fp_term = dc->writeback_rate_fp_term_mid * + fp_term = (int64_t)dc->writeback_rate_fp_term_mid * (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID); } else { - fp_term = dc->writeback_rate_fp_term_high * + fp_term = (int64_t)dc->writeback_rate_fp_term_high * (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH); } fps = div_s64(dirty, dirty_buckets) * fp_term;
From: Coly Li colyli@suse.de
mainline inclusion from v5.13-rc1 commit 33ec5dfe8f42aaf0163a16e2b450ab06f3a7f1f3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
----------------------------------------
The patch "bcache: remove PTR_CACHE" introduces a compiling failure in debug.c with following error message, In file included from drivers/md/bcache/bcache.h:182:0, from drivers/md/bcache/debug.c:9: drivers/md/bcache/debug.c: In function 'bch_btree_verify': drivers/md/bcache/debug.c:53:19: error: 'c' undeclared (first use in this function) bio_set_dev(bio, c->cache->bdev); ^ This patch fixes the regression by replacing c->cache->bdev by b->c-> cache->bdev.
Signed-off-by: Coly Li colyli@suse.de Cc: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20210411134316.80274-8-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/debug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index b2eb59b9cd71..45e7d54a40ff 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -50,7 +50,7 @@ void bch_btree_verify(struct btree *b) v->keys.ops = b->keys.ops;
bio = bch_bbio_alloc(b->c); - bio_set_dev(bio, c->cache->bdev); + bio_set_dev(bio, b->c->cache->bdev); bio->bi_iter.bi_sector = PTR_OFFSET(&b->key, 0); bio->bi_iter.bi_size = KEY_SIZE(&v->key) << 9; bio->bi_opf = REQ_OP_READ | REQ_META;
From: YueHaibing yuehaibing@huawei.com
mainline inclusion from v5.13-rc5 commit 415f0c835ba799e47ce077b01876568431da1ff3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
----------------------------------------------
Fix W=1 kernel build warning:
lib/crc64.c:40: warning: bad line: or the previous crc64 value if computing incrementally.
Link: https://lkml.kernel.org/r/20210601135851.15444-1-yuehaibing@huawei.com Signed-off-by: YueHaibing yuehaibing@huawei.com Reviewed-by: Coly Li colyli@suse.de Acked-by: Randy Dunlap rdunlap@infradead.org Tested-by: Randy Dunlap rdunlap@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- lib/crc64.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/crc64.c b/lib/crc64.c index 47cfa054827f..9f852a89ee2a 100644 --- a/lib/crc64.c +++ b/lib/crc64.c @@ -37,7 +37,7 @@ MODULE_LICENSE("GPL v2"); /** * crc64_be - Calculate bitwise big-endian ECMA-182 CRC64 * @crc: seed value for computation. 0 or (u64)~0 for a new CRC calculation, - or the previous crc64 value if computing incrementally. + * or the previous crc64 value if computing incrementally. * @p: pointer to buffer over which CRC64 is run * @len: length of buffer @p */
From: Coly Li colyli@suse.de
mainline inclusion from v5.13-rc6 commit 1616a4c2ab1a80893b6890ae93da40a2b1d0c691 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
------------------------------------
For read cache missing, bcache defines a readahead size for the read I/O request to the backing device for the missing data. This readahead size is initialized to 0, and almost no one uses it to avoid unnecessary read amplifying onto backing device and write amplifying onto cache device. Considering upper layer file system code has readahead logic allready and works fine with readahead_cache_policy sysfile interface, we don't have to keep bcache self-defined readahead anymore.
This patch removes the bcache self-defined readahead for cache missing request for backing device, and the readahead sysfs file interfaces are removed as well.
This is the preparation for next patch to fix potential kernel panic due to oversized request in a simpler method.
Reported-by: Alexander Ullrich ealex1979@gmail.com Reported-by: Diego Ercolani diego.ercolani@gmail.com Reported-by: Jan Szubiak jan.szubiak@linuxpolska.pl Reported-by: Marco Rebhan me@dblsaiko.net Reported-by: Matthias Ferdinand bcache@mfedv.net Reported-by: Victor Westerhuis victor@westerhu.is Reported-by: Vojtech Pavlik vojtech@suse.cz Reported-and-tested-by: Rolf Fokkens rolf@rolffokkens.nl Reported-and-tested-by: Thorsten Knabe linux@thorsten-knabe.de Signed-off-by: Coly Li colyli@suse.de Reviewed-by: Christoph Hellwig hch@lst.de Cc: stable@vger.kernel.org Cc: Kent Overstreet kent.overstreet@gmail.com Cc: Nix nix@esperi.org.uk Cc: Takashi Iwai tiwai@suse.com Link: https://lore.kernel.org/r/20210607125052.21277-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/bcache.h | 1 - drivers/md/bcache/request.c | 12 +----------- drivers/md/bcache/stats.c | 14 -------------- drivers/md/bcache/stats.h | 1 - drivers/md/bcache/sysfs.c | 4 ---- 5 files changed, 1 insertion(+), 31 deletions(-)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 0a4551e165ab..5fc989a6d452 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -364,7 +364,6 @@ struct cached_dev {
/* The rest of this all shows up in sysfs */ unsigned int sequential_cutoff; - unsigned int readahead;
unsigned int io_disable:1; unsigned int verify:1; diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 214326383145..f66889d882f5 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -878,7 +878,6 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s, struct bio *bio, unsigned int sectors) { int ret = MAP_CONTINUE; - unsigned int reada = 0; struct cached_dev *dc = container_of(s->d, struct cached_dev, disk); struct bio *miss, *cache_bio;
@@ -890,13 +889,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s, goto out_submit; }
- if (!(bio->bi_opf & REQ_RAHEAD) && - !(bio->bi_opf & (REQ_META|REQ_PRIO)) && - s->iop.c->gc_stats.in_use < CUTOFF_CACHE_READA) - reada = min_t(sector_t, dc->readahead >> 9, - get_capacity(bio->bi_disk) - bio_end_sector(bio)); - - s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada); + s->insert_bio_sectors = min(sectors, bio_sectors(bio));
s->iop.replace_key = KEY(s->iop.inode, bio->bi_iter.bi_sector + s->insert_bio_sectors, @@ -930,9 +923,6 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s, if (bch_bio_alloc_pages(cache_bio, __GFP_NOWARN|GFP_NOIO)) goto out_put;
- if (reada) - bch_mark_cache_readahead(s->iop.c, s->d); - s->cache_miss = miss; s->iop.bio = cache_bio; bio_get(cache_bio); diff --git a/drivers/md/bcache/stats.c b/drivers/md/bcache/stats.c index 503aafe188dc..4c7ee5fedb9d 100644 --- a/drivers/md/bcache/stats.c +++ b/drivers/md/bcache/stats.c @@ -46,7 +46,6 @@ read_attribute(cache_misses); read_attribute(cache_bypass_hits); read_attribute(cache_bypass_misses); read_attribute(cache_hit_ratio); -read_attribute(cache_readaheads); read_attribute(cache_miss_collisions); read_attribute(bypassed);
@@ -64,7 +63,6 @@ SHOW(bch_stats) DIV_SAFE(var(cache_hits) * 100, var(cache_hits) + var(cache_misses)));
- var_print(cache_readaheads); var_print(cache_miss_collisions); sysfs_hprint(bypassed, var(sectors_bypassed) << 9); #undef var @@ -86,7 +84,6 @@ static struct attribute *bch_stats_files[] = { &sysfs_cache_bypass_hits, &sysfs_cache_bypass_misses, &sysfs_cache_hit_ratio, - &sysfs_cache_readaheads, &sysfs_cache_miss_collisions, &sysfs_bypassed, NULL @@ -113,7 +110,6 @@ void bch_cache_accounting_clear(struct cache_accounting *acc) acc->total.cache_misses = 0; acc->total.cache_bypass_hits = 0; acc->total.cache_bypass_misses = 0; - acc->total.cache_readaheads = 0; acc->total.cache_miss_collisions = 0; acc->total.sectors_bypassed = 0; } @@ -145,7 +141,6 @@ static void scale_stats(struct cache_stats *stats, unsigned long rescale_at) scale_stat(&stats->cache_misses); scale_stat(&stats->cache_bypass_hits); scale_stat(&stats->cache_bypass_misses); - scale_stat(&stats->cache_readaheads); scale_stat(&stats->cache_miss_collisions); scale_stat(&stats->sectors_bypassed); } @@ -168,7 +163,6 @@ static void scale_accounting(struct timer_list *t) move_stat(cache_misses); move_stat(cache_bypass_hits); move_stat(cache_bypass_misses); - move_stat(cache_readaheads); move_stat(cache_miss_collisions); move_stat(sectors_bypassed);
@@ -209,14 +203,6 @@ void bch_mark_cache_accounting(struct cache_set *c, struct bcache_device *d, mark_cache_stats(&c->accounting.collector, hit, bypass); }
-void bch_mark_cache_readahead(struct cache_set *c, struct bcache_device *d) -{ - struct cached_dev *dc = container_of(d, struct cached_dev, disk); - - atomic_inc(&dc->accounting.collector.cache_readaheads); - atomic_inc(&c->accounting.collector.cache_readaheads); -} - void bch_mark_cache_miss_collision(struct cache_set *c, struct bcache_device *d) { struct cached_dev *dc = container_of(d, struct cached_dev, disk); diff --git a/drivers/md/bcache/stats.h b/drivers/md/bcache/stats.h index abfaabf7e7fc..ca4f435f7216 100644 --- a/drivers/md/bcache/stats.h +++ b/drivers/md/bcache/stats.h @@ -7,7 +7,6 @@ struct cache_stat_collector { atomic_t cache_misses; atomic_t cache_bypass_hits; atomic_t cache_bypass_misses; - atomic_t cache_readaheads; atomic_t cache_miss_collisions; atomic_t sectors_bypassed; }; diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index cc89f3156d1a..05ac1d6fbbf3 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -137,7 +137,6 @@ rw_attribute(io_disable); rw_attribute(discard); rw_attribute(running); rw_attribute(label); -rw_attribute(readahead); rw_attribute(errors); rw_attribute(io_error_limit); rw_attribute(io_error_halflife); @@ -260,7 +259,6 @@ SHOW(__bch_cached_dev) var_printf(partial_stripes_expensive, "%u");
var_hprint(sequential_cutoff); - var_hprint(readahead);
sysfs_print(running, atomic_read(&dc->running)); sysfs_print(state, states[BDEV_STATE(&dc->sb)]); @@ -365,7 +363,6 @@ STORE(__cached_dev) sysfs_strtoul_clamp(sequential_cutoff, dc->sequential_cutoff, 0, UINT_MAX); - d_strtoi_h(readahead);
if (attr == &sysfs_clear_stats) bch_cache_accounting_clear(&dc->accounting); @@ -538,7 +535,6 @@ static struct attribute *bch_cached_dev_files[] = { &sysfs_running, &sysfs_state, &sysfs_label, - &sysfs_readahead, #ifdef CONFIG_BCACHE_DEBUG &sysfs_verify, &sysfs_bypass_torture_test,
From: Coly Li colyli@suse.de
mainline inclusion from v5.13-rc6 commit 41fe8d088e96472f63164e213de44ec77be69478 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
----------------------------------------------------
In the cache missing code path of cached device, if a proper location from the internal B+ tree is matched for a cache miss range, function cached_dev_cache_miss() will be called in cache_lookup_fn() in the following code block, [code block 1] 526 unsigned int sectors = KEY_INODE(k) == s->iop.inode 527 ? min_t(uint64_t, INT_MAX, 528 KEY_START(k) - bio->bi_iter.bi_sector) 529 : INT_MAX; 530 int ret = s->d->cache_miss(b, s, bio, sectors);
Here s->d->cache_miss() is the call backfunction pointer initialized as cached_dev_cache_miss(), the last parameter 'sectors' is an important hint to calculate the size of read request to backing device of the missing cache data.
Current calculation in above code block may generate oversized value of 'sectors', which consequently may trigger 2 different potential kernel panics by BUG() or BUG_ON() as listed below,
1) BUG_ON() inside bch_btree_insert_key(), [code block 2] 886 BUG_ON(b->ops->is_extents && !KEY_SIZE(k)); 2) BUG() inside biovec_slab(), [code block 3] 51 default: 52 BUG(); 53 return NULL;
All the above panics are original from cached_dev_cache_miss() by the oversized parameter 'sectors'.
Inside cached_dev_cache_miss(), parameter 'sectors' is used to calculate the size of data read from backing device for the cache missing. This size is stored in s->insert_bio_sectors by the following lines of code, [code block 4] 909 s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);
Then the actual key inserting to the internal B+ tree is generated and stored in s->iop.replace_key by the following lines of code, [code block 5] 911 s->iop.replace_key = KEY(s->iop.inode, 912 bio->bi_iter.bi_sector + s->insert_bio_sectors, 913 s->insert_bio_sectors); The oversized parameter 'sectors' may trigger panic 1) by BUG_ON() from the above code block.
And the bio sending to backing device for the missing data is allocated with hint from s->insert_bio_sectors by the following lines of code, [code block 6] 926 cache_bio = bio_alloc_bioset(GFP_NOWAIT, 927 DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS), 928 &dc->disk.bio_split); The oversized parameter 'sectors' may trigger panic 2) by BUG() from the agove code block.
Now let me explain how the panics happen with the oversized 'sectors'. In code block 5, replace_key is generated by macro KEY(). From the definition of macro KEY(), [code block 7] 71 #define KEY(inode, offset, size) \ 72 ((struct bkey) { \ 73 .high = (1ULL << 63) | ((__u64) (size) << 20) | (inode), \ 74 .low = (offset) \ 75 })
Here 'size' is 16bits width embedded in 64bits member 'high' of struct bkey. But in code block 1, if "KEY_START(k) - bio->bi_iter.bi_sector" is very probably to be larger than (1<<16) - 1, which makes the bkey size calculation in code block 5 is overflowed. In one bug report the value of parameter 'sectors' is 131072 (= 1 << 17), the overflowed 'sectors' results the overflowed s->insert_bio_sectors in code block 4, then makes size field of s->iop.replace_key to be 0 in code block 5. Then the 0- sized s->iop.replace_key is inserted into the internal B+ tree as cache missing check key (a special key to detect and avoid a racing between normal write request and cache missing read request) as, [code block 8] 915 ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key);
Then the 0-sized s->iop.replace_key as 3rd parameter triggers the bkey size check BUG_ON() in code block 2, and causes the kernel panic 1).
Another kernel panic is from code block 6, is by the bvecs number oversized value s->insert_bio_sectors from code block 4, min(sectors, bio_sectors(bio) + reada) There are two possibility for oversized reresult, - bio_sectors(bio) is valid, but bio_sectors(bio) + reada is oversized. - sectors < bio_sectors(bio) + reada, but sectors is oversized.
From a bug report the result of "DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS)" from code block 6 can be 344, 282, 946, 342 and many other values which larther than BIO_MAX_VECS (a.k.a 256). When calling bio_alloc_bioset() with such larger-than-256 value as the 2nd parameter, this value will eventually be sent to biovec_slab() as parameter 'nr_vecs' in following code path, bio_alloc_bioset() ==> bvec_alloc() ==> biovec_slab() Because parameter 'nr_vecs' is larger-than-256 value, the panic by BUG() in code block 3 is triggered inside biovec_slab().
From the above analysis, we know that the 4th parameter 'sector' sent into cached_dev_cache_miss() may cause overflow in code block 5 and 6, and finally cause kernel panic in code block 2 and 3. And if result of bio_sectors(bio) + reada exceeds valid bvecs number, it may also trigger kernel panic in code block 3 from code block 6.
Now the almost-useless readahead size for cache missing request back to backing device is removed, this patch can fix the oversized issue with more simpler method. - add a local variable size_limit, set it by the minimum value from the max bkey size and max bio bvecs number. - set s->insert_bio_sectors by the minimum value from size_limit, sectors, and the sectors size of bio. - replace sectors by s->insert_bio_sectors to do bio_next_split.
By the above method with size_limit, s->insert_bio_sectors will never result oversized replace_key size or bio bvecs number. And split bio 'miss' from bio_next_split() will always match the size of 'cache_bio', that is the current maximum bio size we can sent to backing device for fetching the cache missing data.
Current problmatic code can be partially found since Linux v3.13-rc1, therefore all maintained stable kernels should try to apply this fix.
(Coly Li: still use BIO_MAX_PAGES because BIO_MAX_VECS is not defined yet in 5.10 based kernel.)
Reported-by: Alexander Ullrich ealex1979@gmail.com Reported-by: Diego Ercolani diego.ercolani@gmail.com Reported-by: Jan Szubiak jan.szubiak@linuxpolska.pl Reported-by: Marco Rebhan me@dblsaiko.net Reported-by: Matthias Ferdinand bcache@mfedv.net Reported-by: Victor Westerhuis victor@westerhu.is Reported-by: Vojtech Pavlik vojtech@suse.cz Reported-and-tested-by: Rolf Fokkens rolf@rolffokkens.nl Reported-and-tested-by: Thorsten Knabe linux@thorsten-knabe.de Signed-off-by: Coly Li colyli@suse.de Cc: stable@vger.kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Kent Overstreet kent.overstreet@gmail.com Cc: Nix nix@esperi.org.uk Cc: Takashi Iwai tiwai@suse.com Link: https://lore.kernel.org/r/20210607125052.21277-3-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/request.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index f66889d882f5..19014a87a7db 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -880,6 +880,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s, int ret = MAP_CONTINUE; struct cached_dev *dc = container_of(s->d, struct cached_dev, disk); struct bio *miss, *cache_bio; + unsigned int size_limit;
s->cache_missed = 1;
@@ -889,7 +890,10 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s, goto out_submit; }
- s->insert_bio_sectors = min(sectors, bio_sectors(bio)); + /* Limitation for valid replace key size and cache_bio bvecs number */ + size_limit = min_t(unsigned int, BIO_MAX_PAGES * PAGE_SECTORS, + (1 << KEY_SIZE_BITS) - 1); + s->insert_bio_sectors = min3(size_limit, sectors, bio_sectors(bio));
s->iop.replace_key = KEY(s->iop.inode, bio->bi_iter.bi_sector + s->insert_bio_sectors, @@ -901,7 +905,8 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
s->iop.replace = true;
- miss = bio_next_split(bio, sectors, GFP_NOIO, &s->d->bio_split); + miss = bio_next_split(bio, s->insert_bio_sectors, GFP_NOIO, + &s->d->bio_split);
/* btree_search_recurse()'s btree iterator is no good anymore */ ret = miss == bio ? MAP_DONE : -EINTR;
From: Ding Senjie dingsenjie@yulong.com
mainline inclusion from v5.16-rc1 commit a307e2abfc22880a3026bc2f2a997402b7c2d833 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
-----------------------------------------------
acqurie -> acquire
Signed-off-by: Ding Senjie dingsenjie@yulong.com Reviewed-by: Hannes Reinecke hare@suse.de Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20211020143812.6403-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index a52f491e5209..ecdd613833f4 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2759,7 +2759,7 @@ static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x) * The reason bch_register_lock is not held to call * bch_cache_set_stop() and bcache_device_stop() is to * avoid potential deadlock during reboot, because cache - * set or bcache device stopping process will acqurie + * set or bcache device stopping process will acquire * bch_register_lock too. * * We are safe here because bcache_is_reboot sets to
From: Chao Yu yuchao0@huawei.com
mainline inclusion from v5.16-rc1 commit d55f7cb2e5c053010d2b527494da9bbb722a78ba category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
----------------------------------------
In register_bcache(), there are several cases we didn't set correct error info (return value and/or error message): - if kzalloc() fails, it needs to return ENOMEM and print "cannot allocate memory"; - if register_cache() fails, it's better to propagate its return value rather than using default EINVAL.
Signed-off-by: Chao Yu yuchao0@huawei.com Reviewed-by: Hannes Reinecke hare@suse.de Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20211020143812.6403-4-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/super.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index ecdd613833f4..06f69970f836 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2626,8 +2626,11 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, if (SB_IS_BDEV(sb)) { struct cached_dev *dc = kzalloc(sizeof(*dc), GFP_KERNEL);
- if (!dc) + if (!dc) { + ret = -ENOMEM; + err = "cannot allocate memory"; goto out_put_sb_page; + }
mutex_lock(&bch_register_lock); ret = register_bdev(sb, sb_disk, bdev, dc); @@ -2638,11 +2641,15 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, } else { struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
- if (!ca) + if (!ca) { + ret = -ENOMEM; + err = "cannot allocate memory"; goto out_put_sb_page; + }
/* blkdev_put() will be called in bch_cache_release() */ - if (register_cache(sb, sb_disk, bdev, ca) != 0) + ret = register_cache(sb, sb_disk, bdev, ca); + if (ret) goto out_free_sb; }
From: Lin Feng linf@wangsu.com
mainline inclusion from v5.16-rc1 commit 0259d4498ba48454749ecfb9c81e892cdb8d1a32 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
---------------------------------------------
Calculation of cache_set's cached sectors is done by travelling cached_devs list as shown below:
static void calc_cached_dev_sectors(struct cache_set *c) { ... list_for_each_entry(dc, &c->cached_devs, list) sectors += bdev_sectors(dc->bdev);
c->cached_dev_sectors = sectors; }
But cached_dev won't be unlinked from c->cached_devs list until we call following list_move(&dc->list, &uncached_devices), so previous fix in 'commit 46010141da6677b81cc77f9b47f8ac62bd1cbfd3 ("bcache: recal cached_dev_sectors on detach")' is wrong, now we move it to its right place.
Signed-off-by: Lin Feng linf@wangsu.com Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20211020143812.6403-5-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 06f69970f836..203f85599018 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1165,9 +1165,9 @@ static void cached_dev_detach_finish(struct work_struct *w)
mutex_lock(&bch_register_lock);
- calc_cached_dev_sectors(dc->disk.c); bcache_device_detach(&dc->disk); list_move(&dc->list, &uncached_devices); + calc_cached_dev_sectors(dc->disk.c);
clear_bit(BCACHE_DEV_DETACHING, &dc->disk.flags); clear_bit(BCACHE_DEV_UNLINK_DONE, &dc->disk.flags);
From: Coly Li colyli@suse.de
mainline inclusion from v5.16-rc1 commit cf2197ca4b8c199d188593ca6800ea1827c42171 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
-----------------------------------------
The header file include/uapi/linux/bcache.h is not really a user space API heaer. This file defines the ondisk format of bcache internal meta data but no one includes it from user space, bcache-tools has its own copy of this header with minor modification.
Therefore, this patch moves include/uapi/linux/bcache.h to bcache code directory as drivers/md/bcache/bcache_ondisk.h.
Suggested-by: Arnd Bergmann arnd@kernel.org Suggested-by: Christoph Hellwig hch@lst.de Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20211029060930.119923-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/bcache.h | 2 +- .../uapi/linux/bcache.h => drivers/md/bcache/bcache_ondisk.h | 0 drivers/md/bcache/bset.h | 2 +- drivers/md/bcache/features.c | 2 +- drivers/md/bcache/features.h | 3 ++- 5 files changed, 5 insertions(+), 4 deletions(-) rename include/uapi/linux/bcache.h => drivers/md/bcache/bcache_ondisk.h (100%)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 5fc989a6d452..2a011469af02 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -178,7 +178,6 @@
#define pr_fmt(fmt) "bcache: %s() " fmt, __func__
-#include <linux/bcache.h> #include <linux/bio.h> #include <linux/kobject.h> #include <linux/list.h> @@ -190,6 +189,7 @@ #include <linux/workqueue.h> #include <linux/kthread.h>
+#include "bcache_ondisk.h" #include "bset.h" #include "util.h" #include "closure.h" diff --git a/include/uapi/linux/bcache.h b/drivers/md/bcache/bcache_ondisk.h similarity index 100% rename from include/uapi/linux/bcache.h rename to drivers/md/bcache/bcache_ondisk.h diff --git a/drivers/md/bcache/bset.h b/drivers/md/bcache/bset.h index a50dcfda656f..d795c84246b0 100644 --- a/drivers/md/bcache/bset.h +++ b/drivers/md/bcache/bset.h @@ -2,10 +2,10 @@ #ifndef _BCACHE_BSET_H #define _BCACHE_BSET_H
-#include <linux/bcache.h> #include <linux/kernel.h> #include <linux/types.h>
+#include "bcache_ondisk.h" #include "util.h" /* for time_stats */
/* diff --git a/drivers/md/bcache/features.c b/drivers/md/bcache/features.c index 6d2b7b84a7b7..634922c5601d 100644 --- a/drivers/md/bcache/features.c +++ b/drivers/md/bcache/features.c @@ -6,7 +6,7 @@ * Copyright 2020 Coly Li colyli@suse.de * */ -#include <linux/bcache.h> +#include "bcache_ondisk.h" #include "bcache.h" #include "features.h"
diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h index d1c8fd3977fc..09161b89c63e 100644 --- a/drivers/md/bcache/features.h +++ b/drivers/md/bcache/features.h @@ -2,10 +2,11 @@ #ifndef _BCACHE_FEATURES_H #define _BCACHE_FEATURES_H
-#include <linux/bcache.h> #include <linux/kernel.h> #include <linux/types.h>
+#include "bcache_ondisk.h" + #define BCH_FEATURE_COMPAT 0 #define BCH_FEATURE_RO_COMPAT 1 #define BCH_FEATURE_INCOMPAT 2
From: Qing Wang wangqing@vivo.com
mainline inclusion from v5.16-rc1 commit 1b86db5f4e025840e0bf7cef2b10e84531954386 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
-----------------------------------
coccicheck complains about the use of snprintf() in sysfs show functions.
Fix the following coccicheck warning: drivers/md/bcache/sysfs.h:54:12-20: WARNING: use scnprintf or sprintf.
Implement sysfs_print() by sysfs_emit() and remove snprint() since no one uses it any more.
Suggested-by: Coly Li colyli@suse.de Signed-off-by: Qing Wang wangqing@vivo.com Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20211029060930.119923-3-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/sysfs.h | 18 ++++++++++++++++-- drivers/md/bcache/util.h | 17 ----------------- 2 files changed, 16 insertions(+), 19 deletions(-)
diff --git a/drivers/md/bcache/sysfs.h b/drivers/md/bcache/sysfs.h index 215df32f567b..c1752ba2e05b 100644 --- a/drivers/md/bcache/sysfs.h +++ b/drivers/md/bcache/sysfs.h @@ -51,13 +51,27 @@ STORE(fn) \ #define sysfs_printf(file, fmt, ...) \ do { \ if (attr == &sysfs_ ## file) \ - return snprintf(buf, PAGE_SIZE, fmt "\n", __VA_ARGS__); \ + return sysfs_emit(buf, fmt "\n", __VA_ARGS__); \ } while (0)
#define sysfs_print(file, var) \ do { \ if (attr == &sysfs_ ## file) \ - return snprint(buf, PAGE_SIZE, var); \ + return sysfs_emit(buf, \ + __builtin_types_compatible_p(typeof(var), int) \ + ? "%i\n" : \ + __builtin_types_compatible_p(typeof(var), unsigned int) \ + ? "%u\n" : \ + __builtin_types_compatible_p(typeof(var), long) \ + ? "%li\n" : \ + __builtin_types_compatible_p(typeof(var), unsigned long)\ + ? "%lu\n" : \ + __builtin_types_compatible_p(typeof(var), int64_t) \ + ? "%lli\n" : \ + __builtin_types_compatible_p(typeof(var), uint64_t) \ + ? "%llu\n" : \ + __builtin_types_compatible_p(typeof(var), const char *) \ + ? "%s\n" : "%i\n", var); \ } while (0)
#define sysfs_hprint(file, val) \ diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h index bca4a7c97da7..97c524679c8a 100644 --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -342,23 +342,6 @@ static inline int bch_strtoul_h(const char *cp, long *res) _r; \ })
-#define snprint(buf, size, var) \ - snprintf(buf, size, \ - __builtin_types_compatible_p(typeof(var), int) \ - ? "%i\n" : \ - __builtin_types_compatible_p(typeof(var), unsigned int) \ - ? "%u\n" : \ - __builtin_types_compatible_p(typeof(var), long) \ - ? "%li\n" : \ - __builtin_types_compatible_p(typeof(var), unsigned long)\ - ? "%lu\n" : \ - __builtin_types_compatible_p(typeof(var), int64_t) \ - ? "%lli\n" : \ - __builtin_types_compatible_p(typeof(var), uint64_t) \ - ? "%llu\n" : \ - __builtin_types_compatible_p(typeof(var), const char *) \ - ? "%s\n" : "%i\n", var) - ssize_t bch_hprint(char *buf, int64_t v);
bool bch_is_zero(const char *p, size_t n);
From: Lin Feng linf@wangsu.com
mainline inclusion from v5.16-rc6 commit aa97f6cdb7e92909e17c8ca63e622fcb81d57a57 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
-----------------------------------
Commit 0259d4498ba4 ("bcache: move calc_cached_dev_sectors to proper place on backing device detach") tries to fix calc_cached_dev_sectors when bcache device detaches, but now we have:
cached_dev_detach_finish ... bcache_device_detach(&dc->disk); ... closure_put(&d->c->caching); d->c = NULL; [*explicitly set dc->disk.c to NULL*] list_move(&dc->list, &uncached_devices); calc_cached_dev_sectors(dc->disk.c); [*passing a NULL pointer*] ...
Upper codeflows shows how bug happens, this patch fix the problem by caching dc->disk.c beforehand, and cache_set won't be freed under us because c->caching closure at least holds a reference count and closure callback __cache_set_unregister only being called by bch_cache_set_stop which using closure_queue(&c->caching), that means c->caching closure callback for destroying cache_set won't be trigger by previous closure_put(&d->c->caching). So at this stage(while cached_dev_detach_finish is calling) it's safe to access cache_set dc->disk.c.
Fixes: 0259d4498ba4 ("bcache: move calc_cached_dev_sectors to proper place on backing device detach") Signed-off-by: Lin Feng linf@wangsu.com Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20211112053629.3437-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/super.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 203f85599018..a98b46d3cd0b 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1150,6 +1150,7 @@ static void cancel_writeback_rate_update_dwork(struct cached_dev *dc) static void cached_dev_detach_finish(struct work_struct *w) { struct cached_dev *dc = container_of(w, struct cached_dev, detach); + struct cache_set *c = dc->disk.c;
BUG_ON(!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags)); BUG_ON(refcount_read(&dc->count)); @@ -1167,7 +1168,7 @@ static void cached_dev_detach_finish(struct work_struct *w)
bcache_device_detach(&dc->disk); list_move(&dc->list, &uncached_devices); - calc_cached_dev_sectors(dc->disk.c); + calc_cached_dev_sectors(c);
clear_bit(BCACHE_DEV_DETACHING, &dc->disk.flags); clear_bit(BCACHE_DEV_UNLINK_DONE, &dc->disk.flags);
From: Greg Kroah-Hartman gregkh@linuxfoundation.org
mainline inclusion from v5.18-rc1 commit fa97cb843cfb874c50cd1dcc46a2f28187e184e9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
--------------------------------------
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the bcache sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field.
Cc: Kent Overstreet kent.overstreet@gmail.com Cc: linux-bcache@vger.kernel.org Acked-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20220106100004.3277439-1-gregkh@linuxfoundation.or... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/stats.c | 3 ++- drivers/md/bcache/sysfs.c | 15 ++++++++++----- drivers/md/bcache/sysfs.h | 2 +- 3 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/drivers/md/bcache/stats.c b/drivers/md/bcache/stats.c index 4c7ee5fedb9d..68b02216033d 100644 --- a/drivers/md/bcache/stats.c +++ b/drivers/md/bcache/stats.c @@ -78,7 +78,7 @@ static void bch_stats_release(struct kobject *k) { }
-static struct attribute *bch_stats_files[] = { +static struct attribute *bch_stats_attrs[] = { &sysfs_cache_hits, &sysfs_cache_misses, &sysfs_cache_bypass_hits, @@ -88,6 +88,7 @@ static struct attribute *bch_stats_files[] = { &sysfs_bypassed, NULL }; +ATTRIBUTE_GROUPS(bch_stats); static KTYPE(bch_stats);
int bch_cache_accounting_add_kobjs(struct cache_accounting *acc, diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 05ac1d6fbbf3..8467e37411a7 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -500,7 +500,7 @@ STORE(bch_cached_dev) return size; }
-static struct attribute *bch_cached_dev_files[] = { +static struct attribute *bch_cached_dev_attrs[] = { &sysfs_attach, &sysfs_detach, &sysfs_stop, @@ -543,6 +543,7 @@ static struct attribute *bch_cached_dev_files[] = { &sysfs_backing_dev_uuid, NULL }; +ATTRIBUTE_GROUPS(bch_cached_dev); KTYPE(bch_cached_dev);
SHOW(bch_flash_dev) @@ -600,7 +601,7 @@ STORE(__bch_flash_dev) } STORE_LOCKED(bch_flash_dev)
-static struct attribute *bch_flash_dev_files[] = { +static struct attribute *bch_flash_dev_attrs[] = { &sysfs_unregister, #if 0 &sysfs_data_csum, @@ -609,6 +610,7 @@ static struct attribute *bch_flash_dev_files[] = { &sysfs_size, NULL }; +ATTRIBUTE_GROUPS(bch_flash_dev); KTYPE(bch_flash_dev);
struct bset_stats_op { @@ -955,7 +957,7 @@ static void bch_cache_set_internal_release(struct kobject *k) { }
-static struct attribute *bch_cache_set_files[] = { +static struct attribute *bch_cache_set_attrs[] = { &sysfs_unregister, &sysfs_stop, &sysfs_synchronous, @@ -980,9 +982,10 @@ static struct attribute *bch_cache_set_files[] = { &sysfs_clear_stats, NULL }; +ATTRIBUTE_GROUPS(bch_cache_set); KTYPE(bch_cache_set);
-static struct attribute *bch_cache_set_internal_files[] = { +static struct attribute *bch_cache_set_internal_attrs[] = { &sysfs_active_journal_entries,
sysfs_time_stats_attribute_list(btree_gc, sec, ms) @@ -1022,6 +1025,7 @@ static struct attribute *bch_cache_set_internal_files[] = { &sysfs_feature_incompat, NULL }; +ATTRIBUTE_GROUPS(bch_cache_set_internal); KTYPE(bch_cache_set_internal);
static int __bch_cache_cmp(const void *l, const void *r) @@ -1182,7 +1186,7 @@ STORE(__bch_cache) } STORE_LOCKED(bch_cache)
-static struct attribute *bch_cache_files[] = { +static struct attribute *bch_cache_attrs[] = { &sysfs_bucket_size, &sysfs_block_size, &sysfs_nbuckets, @@ -1196,4 +1200,5 @@ static struct attribute *bch_cache_files[] = { &sysfs_cache_replacement_policy, NULL }; +ATTRIBUTE_GROUPS(bch_cache); KTYPE(bch_cache); diff --git a/drivers/md/bcache/sysfs.h b/drivers/md/bcache/sysfs.h index c1752ba2e05b..a2ff6447b699 100644 --- a/drivers/md/bcache/sysfs.h +++ b/drivers/md/bcache/sysfs.h @@ -9,7 +9,7 @@ struct kobj_type type ## _ktype = { \ .show = type ## _show, \ .store = type ## _store \ }), \ - .default_attrs = type ## _files \ + .default_groups = type ## _groups \ }
#define SHOW(fn) \
From: Mingzhe Zou mingzhe.zou@easystack.cn
mainline inclusion from v5.18-rc1 commit 7b1002f7cfe581930f63787a0b3de0144e61ed55 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
--------------------------------------
When attaching a cached device (a.k.a backing device) to a cache device, bch_sectors_dirty_init() is called to count dirty sectors and stripes (see what bcache_dev_sectors_dirty_add() does) on the cache device.
When bcache_dev_sectors_dirty_add() is called, set_bit(stripe, d->full_dirty_stripes) or clear_bit(stripe, d->full_dirty_stripes) operation will always be performed. In full_dirty_stripes, each 1bit represents stripe_size (8192) sectors (512B), so 1bit=4MB (8192*512), and each CPU cache line=64B=512bit=2048MB. When 20 threads process a cached disk with 100G dirty data, a single thread processes about 23M at a time, and 20 threads total 460M. These full_dirty_stripes bits corresponding to the 460M data is likely to fall in the same CPU cache line. When one of these threads performs a set_bit or clear_bit operation, the same CPU cache line of other threads will become invalid and must read the full_dirty_stripes from the main memory again. Compared with single thread, the time of a bcache_dev_sectors_dirty_add() call is increased by about 50 times in our test (100G dirty data, 20 threads, bcache_dev_sectors_dirty_add() is called more than 20 million times).
This patch tries to test_bit before set_bit or clear_bit operation. Therefore, a lot of force set and clear operations will be avoided, and most of bcache_dev_sectors_dirty_add() calls will only read CPU cache line.
Signed-off-by: Mingzhe Zou mingzhe.zou@easystack.cn Signed-off-by: Coly Li colyli@suse.de Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/writeback.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 8120da278161..e93a09ccaddd 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -585,10 +585,13 @@ void bcache_dev_sectors_dirty_add(struct cache_set *c, unsigned int inode,
sectors_dirty = atomic_add_return(s, d->stripe_sectors_dirty + stripe); - if (sectors_dirty == d->stripe_size) - set_bit(stripe, d->full_dirty_stripes); - else - clear_bit(stripe, d->full_dirty_stripes); + if (sectors_dirty == d->stripe_size) { + if (!test_bit(stripe, d->full_dirty_stripes)) + set_bit(stripe, d->full_dirty_stripes); + } else { + if (test_bit(stripe, d->full_dirty_stripes)) + clear_bit(stripe, d->full_dirty_stripes); + }
nr_sectors -= s; stripe_offset = 0;
From: Mingzhe Zou mingzhe.zou@easystack.cn
mainline inclusion from v5.18-rc1 commit 887554ab96588de2917b6c8c73e552da082e5368 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
----------------------------------------
When multiple threads to check btree nodes in parallel, the main thread wait for all threads to stop or CACHE_SET_IO_DISABLE flag:
wait_event_interruptible(check_state->wait, atomic_read(&check_state->started) == 0 || test_bit(CACHE_SET_IO_DISABLE, &c->flags));
However, the bch_btree_node_read and bch_btree_node_read_done maybe call bch_cache_set_error, then the CACHE_SET_IO_DISABLE will be set. If the flag already set, the main thread return error. At the same time, maybe some threads still running and read NULL pointer, the kernel will crash.
This patch change the event wait condition, the main thread must wait for all threads to stop.
Fixes: 8e7102273f597 ("bcache: make bch_btree_check() to be multithreaded") Signed-off-by: Mingzhe Zou mingzhe.zou@easystack.cn Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Coly Li colyli@suse.de Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/btree.c | 6 ++++-- drivers/md/bcache/writeback.c | 6 ++++-- 2 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 183a58c89377..8eecc9df319b 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -2060,9 +2060,11 @@ int bch_btree_check(struct cache_set *c) } }
+ /* + * Must wait for all threads to stop. + */ wait_event_interruptible(check_state->wait, - atomic_read(&check_state->started) == 0 || - test_bit(CACHE_SET_IO_DISABLE, &c->flags)); + atomic_read(&check_state->started) == 0);
for (i = 0; i < check_state->total_threads; i++) { if (check_state->infos[i].result) { diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index e93a09ccaddd..8bd9098185b6 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -1001,9 +1001,11 @@ void bch_sectors_dirty_init(struct bcache_device *d) } }
+ /* + * Must wait for all threads to stop. + */ wait_event_interruptible(state->wait, - atomic_read(&state->started) == 0 || - test_bit(CACHE_SET_IO_DISABLE, &c->flags)); + atomic_read(&state->started) == 0);
out: kfree(state);
From: Coly Li colyli@suse.de
mainline inclusion from v5.19-rc1 commit 622536443b6731ec82c563aae7807165adbe9178 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
----------------------------------------
Commit 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded") makes bch_btree_check() to be much faster when checking all btree nodes during cache device registration. But it isn't in ideal shap yet, still can be improved.
This patch does the following thing to improve current parallel btree nodes check by multiple threads in bch_btree_check(), - Add read lock to root node while checking all the btree nodes with multiple threads. Although currently it is not mandatory but it is good to have a read lock in code logic. - Remove local variable 'char name[32]', and generate kernel thread name string directly when calling kthread_run(). - Allocate local variable "struct btree_check_state check_state" on the stack and avoid unnecessary dynamic memory allocation for it. - Reduce BCH_BTR_CHKTHREAD_MAX from 64 to 12 which is enough indeed. - Increase check_state->started to count created kernel thread after it succeeds to create. - When wait for all checking kernel threads to finish, use wait_event() to replace wait_event_interruptible().
With this change, the code is more clear, and some potential error conditions are avoided.
Fixes: 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded") Signed-off-by: Coly Li colyli@suse.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/btree.c | 58 ++++++++++++++++++--------------------- drivers/md/bcache/btree.h | 2 +- 2 files changed, 27 insertions(+), 33 deletions(-)
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 8eecc9df319b..7b6f8bfef927 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -2006,8 +2006,7 @@ int bch_btree_check(struct cache_set *c) int i; struct bkey *k = NULL; struct btree_iter iter; - struct btree_check_state *check_state; - char name[32]; + struct btree_check_state check_state;
/* check and mark root node keys */ for_each_key_filter(&c->root->keys, k, &iter, bch_ptr_invalid) @@ -2018,63 +2017,58 @@ int bch_btree_check(struct cache_set *c) if (c->root->level == 0) return 0;
- check_state = kzalloc(sizeof(struct btree_check_state), GFP_KERNEL); - if (!check_state) - return -ENOMEM; - - check_state->c = c; - check_state->total_threads = bch_btree_chkthread_nr(); - check_state->key_idx = 0; - spin_lock_init(&check_state->idx_lock); - atomic_set(&check_state->started, 0); - atomic_set(&check_state->enough, 0); - init_waitqueue_head(&check_state->wait); + check_state.c = c; + check_state.total_threads = bch_btree_chkthread_nr(); + check_state.key_idx = 0; + spin_lock_init(&check_state.idx_lock); + atomic_set(&check_state.started, 0); + atomic_set(&check_state.enough, 0); + init_waitqueue_head(&check_state.wait);
+ rw_lock(0, c->root, c->root->level); /* * Run multiple threads to check btree nodes in parallel, - * if check_state->enough is non-zero, it means current + * if check_state.enough is non-zero, it means current * running check threads are enough, unncessary to create * more. */ - for (i = 0; i < check_state->total_threads; i++) { - /* fetch latest check_state->enough earlier */ + for (i = 0; i < check_state.total_threads; i++) { + /* fetch latest check_state.enough earlier */ smp_mb__before_atomic(); - if (atomic_read(&check_state->enough)) + if (atomic_read(&check_state.enough)) break;
- check_state->infos[i].result = 0; - check_state->infos[i].state = check_state; - snprintf(name, sizeof(name), "bch_btrchk[%u]", i); - atomic_inc(&check_state->started); + check_state.infos[i].result = 0; + check_state.infos[i].state = &check_state;
- check_state->infos[i].thread = + check_state.infos[i].thread = kthread_run(bch_btree_check_thread, - &check_state->infos[i], - name); - if (IS_ERR(check_state->infos[i].thread)) { + &check_state.infos[i], + "bch_btrchk[%d]", i); + if (IS_ERR(check_state.infos[i].thread)) { pr_err("fails to run thread bch_btrchk[%d]\n", i); for (--i; i >= 0; i--) - kthread_stop(check_state->infos[i].thread); + kthread_stop(check_state.infos[i].thread); ret = -ENOMEM; goto out; } + atomic_inc(&check_state.started); }
/* * Must wait for all threads to stop. */ - wait_event_interruptible(check_state->wait, - atomic_read(&check_state->started) == 0); + wait_event(check_state.wait, atomic_read(&check_state.started) == 0);
- for (i = 0; i < check_state->total_threads; i++) { - if (check_state->infos[i].result) { - ret = check_state->infos[i].result; + for (i = 0; i < check_state.total_threads; i++) { + if (check_state.infos[i].result) { + ret = check_state.infos[i].result; goto out; } }
out: - kfree(check_state); + rw_unlock(0, c->root); return ret; }
diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h index 50482107134f..1b5fdbc0d83e 100644 --- a/drivers/md/bcache/btree.h +++ b/drivers/md/bcache/btree.h @@ -226,7 +226,7 @@ struct btree_check_info { int result; };
-#define BCH_BTR_CHKTHREAD_MAX 64 +#define BCH_BTR_CHKTHREAD_MAX 12 struct btree_check_state { struct cache_set *c; int total_threads;
From: Coly Li colyli@suse.de
mainline inclusion from v5.19-rc1 commit 4dc34ae1b45fe26e772a44379f936c72623dd407 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
-------------------------------------------
Commit b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") makes bch_sectors_dirty_init() to be much faster when counting dirty sectors by iterating all dirty keys in the btree. But it isn't in ideal shape yet, still can be improved.
This patch does the following changes to improve current parallel dirty keys iteration on the btree, - Add read lock to root node when multiple threads iterating the btree, to prevent the root node gets split by I/Os from other registered bcache devices. - Remove local variable "char name[32]" and generate kernel thread name string directly when calling kthread_run(). - Allocate "struct bch_dirty_init_state state" directly on stack and avoid the unnecessary dynamic memory allocation for it. - Decrease BCH_DIRTY_INIT_THRD_MAX from 64 to 12 which is enough indeed. - Increase &state->started to count created kernel thread after it succeeds to create. - When wait for all dirty key counting threads to finish, use wait_event() to replace wait_event_interruptible().
With the above changes, the code is more clear, and some potential error conditions are avoided.
Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Coly Li colyli@suse.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-3-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/writeback.c | 62 ++++++++++++++--------------------- drivers/md/bcache/writeback.h | 2 +- 2 files changed, 26 insertions(+), 38 deletions(-)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 8bd9098185b6..7ca91d1eb83b 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -948,10 +948,10 @@ void bch_sectors_dirty_init(struct bcache_device *d) struct btree_iter iter; struct sectors_dirty_init op; struct cache_set *c = d->c; - struct bch_dirty_init_state *state; - char name[32]; + struct bch_dirty_init_state state;
/* Just count root keys if no leaf node */ + rw_lock(0, c->root, c->root->level); if (c->root->level == 0) { bch_btree_op_init(&op.op, -1); op.inode = d->id; @@ -961,54 +961,42 @@ void bch_sectors_dirty_init(struct bcache_device *d) for_each_key_filter(&c->root->keys, k, &iter, bch_ptr_invalid) sectors_dirty_init_fn(&op.op, c->root, k); + rw_unlock(0, c->root); return; }
- state = kzalloc(sizeof(struct bch_dirty_init_state), GFP_KERNEL); - if (!state) { - pr_warn("sectors dirty init failed: cannot allocate memory\n"); - return; - } - - state->c = c; - state->d = d; - state->total_threads = bch_btre_dirty_init_thread_nr(); - state->key_idx = 0; - spin_lock_init(&state->idx_lock); - atomic_set(&state->started, 0); - atomic_set(&state->enough, 0); - init_waitqueue_head(&state->wait); - - for (i = 0; i < state->total_threads; i++) { - /* Fetch latest state->enough earlier */ + state.c = c; + state.d = d; + state.total_threads = bch_btre_dirty_init_thread_nr(); + state.key_idx = 0; + spin_lock_init(&state.idx_lock); + atomic_set(&state.started, 0); + atomic_set(&state.enough, 0); + init_waitqueue_head(&state.wait); + + for (i = 0; i < state.total_threads; i++) { + /* Fetch latest state.enough earlier */ smp_mb__before_atomic(); - if (atomic_read(&state->enough)) + if (atomic_read(&state.enough)) break;
- state->infos[i].state = state; - atomic_inc(&state->started); - snprintf(name, sizeof(name), "bch_dirty_init[%d]", i); - - state->infos[i].thread = - kthread_run(bch_dirty_init_thread, - &state->infos[i], - name); - if (IS_ERR(state->infos[i].thread)) { + state.infos[i].state = &state; + state.infos[i].thread = + kthread_run(bch_dirty_init_thread, &state.infos[i], + "bch_dirtcnt[%d]", i); + if (IS_ERR(state.infos[i].thread)) { pr_err("fails to run thread bch_dirty_init[%d]\n", i); for (--i; i >= 0; i--) - kthread_stop(state->infos[i].thread); + kthread_stop(state.infos[i].thread); goto out; } + atomic_inc(&state.started); }
- /* - * Must wait for all threads to stop. - */ - wait_event_interruptible(state->wait, - atomic_read(&state->started) == 0); - out: - kfree(state); + /* Must wait for all threads to stop. */ + wait_event(state.wait, atomic_read(&state.started) == 0); + rw_unlock(0, c->root); }
void bch_cached_dev_writeback_init(struct cached_dev *dc) diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index 02b2f9df73f6..31df716951f6 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -20,7 +20,7 @@ #define BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID 57 #define BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH 64
-#define BCH_DIRTY_INIT_THRD_MAX 64 +#define BCH_DIRTY_INIT_THRD_MAX 12 /* * 14 (16384ths) is chosen here as something that each backing device * should be a reasonable fraction of the share, and not to blow up
From: Coly Li colyli@suse.de
mainline inclusion from v5.19-rc1 commit 80db4e4707e78cb22287da7d058d7274bd4cb370 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
----------------------------------------
After making bch_sectors_dirty_init() being multithreaded, the existing incremental dirty sector counting in bch_root_node_dirty_init() doesn't release btree occupation after iterating 500000 (INIT_KEYS_EACH_TIME) bkeys. Because a read lock is added on btree root node to prevent the btree to be split during the dirty sectors counting, other I/O requester has no chance to gain the write lock even restart bcache_btree().
That is to say, the incremental dirty sectors counting is incompatible to the multhreaded bch_sectors_dirty_init(). We have to choose one and drop another one.
In my testing, with 512 bytes random writes, I generate 1.2T dirty data and a btree with 400K nodes. With single thread and incremental dirty sectors counting, it takes 30+ minites to register the backing device. And with multithreaded dirty sectors counting, the backing device registration can be accomplished within 2 minutes.
The 30+ minutes V.S. 2- minutes difference makes me decide to keep multithreaded bch_sectors_dirty_init() and drop the incremental dirty sectors counting. This is what this patch does.
But INIT_KEYS_EACH_TIME is kept, in sectors_dirty_init_fn() the CPU will be released by cond_resched() after every INIT_KEYS_EACH_TIME keys iterated. This is to avoid the watchdog reports a bogus soft lockup warning.
Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Coly Li colyli@suse.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-4-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/writeback.c | 41 +++++++++++------------------------ 1 file changed, 13 insertions(+), 28 deletions(-)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 7ca91d1eb83b..2b6cb308f73a 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -805,13 +805,11 @@ static int bch_writeback_thread(void *arg)
/* Init */ #define INIT_KEYS_EACH_TIME 500000 -#define INIT_KEYS_SLEEP_MS 100
struct sectors_dirty_init { struct btree_op op; unsigned int inode; size_t count; - struct bkey start; };
static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b, @@ -827,11 +825,8 @@ static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b, KEY_START(k), KEY_SIZE(k));
op->count++; - if (atomic_read(&b->c->search_inflight) && - !(op->count % INIT_KEYS_EACH_TIME)) { - bkey_copy_key(&op->start, k); - return -EAGAIN; - } + if (!(op->count % INIT_KEYS_EACH_TIME)) + cond_resched();
return MAP_CONTINUE; } @@ -846,24 +841,16 @@ static int bch_root_node_dirty_init(struct cache_set *c, bch_btree_op_init(&op.op, -1); op.inode = d->id; op.count = 0; - op.start = KEY(op.inode, 0, 0); - - do { - ret = bcache_btree(map_keys_recurse, - k, - c->root, - &op.op, - &op.start, - sectors_dirty_init_fn, - 0); - if (ret == -EAGAIN) - schedule_timeout_interruptible( - msecs_to_jiffies(INIT_KEYS_SLEEP_MS)); - else if (ret < 0) { - pr_warn("sectors dirty init failed, ret=%d!\n", ret); - break; - } - } while (ret == -EAGAIN); + + ret = bcache_btree(map_keys_recurse, + k, + c->root, + &op.op, + &KEY(op.inode, 0, 0), + sectors_dirty_init_fn, + 0); + if (ret < 0) + pr_warn("sectors dirty init failed, ret=%d!\n", ret);
return ret; } @@ -907,7 +894,6 @@ static int bch_dirty_init_thread(void *arg) goto out; } skip_nr--; - cond_resched(); }
if (p) { @@ -917,7 +903,6 @@ static int bch_dirty_init_thread(void *arg)
p = NULL; prev_idx = cur_idx; - cond_resched(); }
out: @@ -956,11 +941,11 @@ void bch_sectors_dirty_init(struct bcache_device *d) bch_btree_op_init(&op.op, -1); op.inode = d->id; op.count = 0; - op.start = KEY(op.inode, 0, 0);
for_each_key_filter(&c->root->keys, k, &iter, bch_ptr_invalid) sectors_dirty_init_fn(&op.op, c->root, k); + rw_unlock(0, c->root); return; }
From: Coly Li colyli@suse.de
mainline inclusion from v5.19-rc1 commit 32feee36c30ea06e38ccb8ae6e5c44c6eec790a6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
---------------------------------
The journal no-space deadlock was reported time to time. Such deadlock can happen in the following situation.
When all journal buckets are fully filled by active jset with heavy write I/O load, the cache set registration (after a reboot) will load all active jsets and inserting them into the btree again (which is called journal replay). If a journaled bkey is inserted into a btree node and results btree node split, new journal request might be triggered. For example, the btree grows one more level after the node split, then the root node record in cache device super block will be upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no space in journal buckets, the journal replay has to wait for new journal bucket to be reclaimed after at least one journal bucket replayed. This is one example that how the journal no-space deadlock happens.
The solution to avoid the deadlock is to reserve 1 journal bucket in run time, and only permit the reserved journal bucket to be used during cache set registration procedure for things like journal replay. Then the journal space will never be fully filled, there is no chance for journal no-space deadlock to happen anymore.
This patch adds a new member "bool do_reserve" in struct journal, it is inititalized to 0 (false) when struct journal is allocated, and set to 1 (true) by bch_journal_space_reserve() when all initialization done in run_cache_set(). In the run time when journal_reclaim() tries to allocate a new journal bucket, free_journal_buckets() is called to check whether there are enough free journal buckets to use. If there is only 1 free journal bucket and journal->do_reserve is 1 (true), the last bucket is reserved and free_journal_buckets() will return 0 to indicate no free journal bucket. Then journal_reclaim() will give up, and try next time to see whetheer there is free journal bucket to allocate. By this method, there is always 1 jouranl bucket reserved in run time.
During the cache set registration, journal->do_reserve is 0 (false), so the reserved journal bucket can be used to avoid the no-space deadlock.
Reported-by: Nikhil Kshirsagar nkshirsagar@gmail.com Signed-off-by: Coly Li colyli@suse.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-5-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/journal.c | 31 ++++++++++++++++++++++++++----- drivers/md/bcache/journal.h | 2 ++ drivers/md/bcache/super.c | 1 + 3 files changed, 29 insertions(+), 5 deletions(-)
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index 61bd79babf7a..346a92c43858 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -407,6 +407,11 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list) return ret; }
+void bch_journal_space_reserve(struct journal *j) +{ + j->do_reserve = true; +} + /* Journalling */
static void btree_flush_write(struct cache_set *c) @@ -625,12 +630,30 @@ static void do_journal_discard(struct cache *ca) } }
+static unsigned int free_journal_buckets(struct cache_set *c) +{ + struct journal *j = &c->journal; + struct cache *ca = c->cache; + struct journal_device *ja = &c->cache->journal; + unsigned int n; + + /* In case njournal_buckets is not power of 2 */ + if (ja->cur_idx >= ja->discard_idx) + n = ca->sb.njournal_buckets + ja->discard_idx - ja->cur_idx; + else + n = ja->discard_idx - ja->cur_idx; + + if (n > (1 + j->do_reserve)) + return n - (1 + j->do_reserve); + + return 0; +} + static void journal_reclaim(struct cache_set *c) { struct bkey *k = &c->journal.key; struct cache *ca = c->cache; uint64_t last_seq; - unsigned int next; struct journal_device *ja = &ca->journal; atomic_t p __maybe_unused;
@@ -653,12 +676,10 @@ static void journal_reclaim(struct cache_set *c) if (c->journal.blocks_free) goto out;
- next = (ja->cur_idx + 1) % ca->sb.njournal_buckets; - /* No space available on this device */ - if (next == ja->discard_idx) + if (!free_journal_buckets(c)) goto out;
- ja->cur_idx = next; + ja->cur_idx = (ja->cur_idx + 1) % ca->sb.njournal_buckets; k->ptr[0] = MAKE_PTR(0, bucket_to_sector(c, ca->sb.d[ja->cur_idx]), ca->sb.nr_this_dev); diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h index f2ea34d5f431..cd316b4a1e95 100644 --- a/drivers/md/bcache/journal.h +++ b/drivers/md/bcache/journal.h @@ -105,6 +105,7 @@ struct journal { spinlock_t lock; spinlock_t flush_write_lock; bool btree_flushing; + bool do_reserve; /* used when waiting because the journal was full */ struct closure_waitlist wait; struct closure io; @@ -182,5 +183,6 @@ int bch_journal_replay(struct cache_set *c, struct list_head *list);
void bch_journal_free(struct cache_set *c); int bch_journal_alloc(struct cache_set *c); +void bch_journal_space_reserve(struct journal *j);
#endif /* _BCACHE_JOURNAL_H */ diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index a98b46d3cd0b..b5601f200c09 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2141,6 +2141,7 @@ static int run_cache_set(struct cache_set *c)
flash_devs_run(c);
+ bch_journal_space_reserve(&c->journal); set_bit(CACHE_SET_RUNNING, &c->flags); return 0; err:
From: Coly Li colyli@suse.de
mainline inclusion from v5.19-rc1 commit 7d6b902ea0e02b2a25c480edf471cbaa4ebe6b3c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
---------------------------------------
The local variables check_state (in bch_btree_check()) and state (in bch_sectors_dirty_init()) should be fully filled by 0, because before allocating them on stack, they were dynamically allocated by kzalloc().
Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20220527152818.27545-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/btree.c | 1 + drivers/md/bcache/writeback.c | 1 + 2 files changed, 2 insertions(+)
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 7b6f8bfef927..98daa9d200f7 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -2017,6 +2017,7 @@ int bch_btree_check(struct cache_set *c) if (c->root->level == 0) return 0;
+ memset(&check_state, 0, sizeof(struct btree_check_state)); check_state.c = c; check_state.total_threads = bch_btree_chkthread_nr(); check_state.key_idx = 0; diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 2b6cb308f73a..2bf831fe4ca4 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -950,6 +950,7 @@ void bch_sectors_dirty_init(struct bcache_device *d) return; }
+ memset(&state, 0, sizeof(struct bch_dirty_init_state)); state.c = c; state.d = d; state.total_threads = bch_btre_dirty_init_thread_nr();
From: Jia-Ju Bai baijiaju1990@gmail.com
mainline inclusion from v5.19-rc1 commit 40f567bbb3b0639d2ec7d1c6ad4b1b018f80cf19 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
------------------------------------------
The function kzalloc() in detached_dev_do_request() can fail, so its return value should be checked.
Fixes: bc082a55d25c ("bcache: fix inaccurate io state for detached bcache devices") Reported-by: TOTE Robot oslab@tsinghua.edu.cn Signed-off-by: Jia-Ju Bai baijiaju1990@gmail.com Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20220527152818.27545-4-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/request.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 19014a87a7db..c1a1bd7aa9ec 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -1104,6 +1104,12 @@ static void detached_dev_do_request(struct bcache_device *d, struct bio *bio) * which would call closure_get(&dc->disk.cl) */ ddip = kzalloc(sizeof(struct detached_dev_io_private), GFP_NOIO); + if (!ddip) { + bio->bi_status = BLK_STS_RESOURCE; + bio->bi_end_io(bio); + return; + } + ddip->d = d; /* Count on the bcache device */ ddip->start_time = part_start_io_acct(d->disk, &ddip->part, bio);
From: Coly Li colyli@suse.de
mainline inclusion from v5.19-rc1 commit a1a2d8f0162b27e85e7ce0ae6a35c96a490e0559 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue CVE: N/A
-----------------------------------------
The kworker routine update_writeback_rate() is schedued to update the writeback rate in every 5 seconds by default. Before calling __update_writeback_rate() to do real job, semaphore dc->writeback_lock should be held by the kworker routine.
At the same time, bcache writeback thread routine bch_writeback_thread() also needs to hold dc->writeback_lock before flushing dirty data back into the backing device. If the dirty data set is large, it might be very long time for bch_writeback_thread() to scan all dirty buckets and releases dc->writeback_lock. In such case update_writeback_rate() can be starved for long enough time so that kernel reports a soft lockup warn- ing started like: watchdog: BUG: soft lockup - CPU#246 stuck for 23s! [kworker/246:31:179713]
Such soft lockup condition is unnecessary, because after the writeback thread finishes its job and releases dc->writeback_lock, the kworker update_writeback_rate() may continue to work and everything is fine indeed.
This patch avoids the unnecessary soft lockup by the following method, - Add new member to struct cached_dev - dc->rate_update_retry (0 by default) - In update_writeback_rate() call down_read_trylock(&dc->writeback_lock) firstly, if it fails then lock contention happens. - If dc->rate_update_retry <= BCH_WBRATE_UPDATE_MAX_SKIPS (15), doesn't acquire the lock and reschedules the kworker for next try. - If dc->rate_update_retry > BCH_WBRATE_UPDATE_MAX_SKIPS, no retry anymore and call down_read(&dc->writeback_lock) to wait for the lock.
By the above method, at worst case update_writeback_rate() may retry for 1+ minutes before blocking on dc->writeback_lock by calling down_read(). For a 4TB cache device with 1TB dirty data, 90%+ of the unnecessary soft lockup warning message can be avoided.
When retrying to acquire dc->writeback_lock in update_writeback_rate(), of course the writeback rate cannot be updated. It is fair, because when the kworker is blocked on the lock contention of dc->writeback_lock, the writeback rate cannot be updated neither.
This change follows Jens Axboe's suggestion to a more clear and simple version.
Signed-off-by: Coly Li colyli@suse.de Link: https://lore.kernel.org/r/20220528124550.32834-2-colyli@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/md/bcache/bcache.h | 7 +++++++ drivers/md/bcache/writeback.c | 31 +++++++++++++++++++++---------- 2 files changed, 28 insertions(+), 10 deletions(-)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 2a011469af02..0563a40812fa 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -396,6 +396,13 @@ struct cached_dev { unsigned int error_limit; unsigned int offline_seconds;
+ /* + * Retry to update writeback_rate if contention happens for + * down_read(dc->writeback_lock) in update_writeback_rate() + */ +#define BCH_WBRATE_UPDATE_MAX_SKIPS 15 + unsigned int rate_update_retry; + char backing_dev_name[BDEVNAME_SIZE]; };
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 2bf831fe4ca4..3e879e985373 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -235,19 +235,27 @@ static void update_writeback_rate(struct work_struct *work) return; }
- if (atomic_read(&dc->has_dirty) && dc->writeback_percent) { - /* - * If the whole cache set is idle, set_at_max_writeback_rate() - * will set writeback rate to a max number. Then it is - * unncessary to update writeback rate for an idle cache set - * in maximum writeback rate number(s). - */ - if (!set_at_max_writeback_rate(c, dc)) { - down_read(&dc->writeback_lock); + /* + * If the whole cache set is idle, set_at_max_writeback_rate() + * will set writeback rate to a max number. Then it is + * unncessary to update writeback rate for an idle cache set + * in maximum writeback rate number(s). + */ + if (atomic_read(&dc->has_dirty) && dc->writeback_percent && + !set_at_max_writeback_rate(c, dc)) { + do { + if (!down_read_trylock((&dc->writeback_lock))) { + dc->rate_update_retry++; + if (dc->rate_update_retry <= + BCH_WBRATE_UPDATE_MAX_SKIPS) + break; + down_read(&dc->writeback_lock); + dc->rate_update_retry = 0; + } __update_writeback_rate(dc); update_gc_after_writeback(c); up_read(&dc->writeback_lock); - } + } while (0); }
@@ -1006,6 +1014,9 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc) dc->writeback_rate_fp_term_high = 1000; dc->writeback_rate_i_term_inverse = 10000;
+ /* For dc->writeback_lock contention in update_writeback_rate() */ + dc->rate_update_retry = 0; + WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags)); INIT_DELAYED_WORK(&dc->writeback_rate_update, update_writeback_rate); }
From: Zhengchao Shao shaozhengchao@huawei.com
hulk inclusion category: bugfix bugzilla: 186807 https://gitee.com/openeuler/kernel/issues/I5ATLD CVE: NA
--------------------------------
When we clean up namespace, we have to notify every netdevice that dev is down. If device that registered too many, the notify time will take too many CPU time, It will course CPU soft-lockup issue. The reprocedure is followed: NIFS=50 for ((i=0; i<$NIFS; i++)) do ip netns add dummy-ns$i ip netns exec dummy-ns$i ip link set lo up done
for ((j=0; j<$NIFS; j++)) do for ((i=0; i<1000; i++)) do if=eth$j$i ip netns exec dummy-ns$j ip link add $if type dummy ip netns exec dummy-ns$j ip link set $if up done done
for ((i=0; i<$NIFS; i++)) do ip netns del dummy-ns$i done The test will result in the following stack. So clean up work must sleep for a while when notify device down/change.
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u8:5:288] Modules linked in: CPU: 0 PID: 288 Comm: kworker/u8:5 Tainted: G B 5.10.0+ #5 Hardware name: linux,dummy-virt (DT) Workqueue: netns cleanup_net pstate: 20000005 (nzCv daif -PAN -UAO -TCO BTYPE=--) pc : atomic_set include/asm-generic/atomic-instrumented.h:46 [inline] pc : __alloc_skb+0x268/0x450 net/core/skbuff.c:241 lr : atomic_set include/asm-generic/atomic-instrumented.h:46 [inline] lr : __alloc_skb+0x268/0x450 net/core/skbuff.c:241 sp : ffff000015607610 x29: ffff000015607610 x28: 00000000ffffffff x27: 0000000000000001 x26: ffff0000cc9400e0 x25: ffff0000c745c1be x24: 1fffe00002ac0ed0 x23: 0000000000000000 x22: ffff0000cc9400c0 x21: ffff0000c745c234 x20: ffff0000cc940000 x19: ffff0000c745c140 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 1fffe00002ac0f00 x13: 0000000000000000 x12: ffff80001992801d x11: 1fffe0001992801c x10: ffff80001992801c x9 : dfffa00000000000 x8 : ffff0000cc9400e3 x7 : 0000000000000001 x6 : ffff80001992801c x5 : ffff0000cc9400e0 x4 : dfffa00000000000 x3 : ffffa00011529a78 x2 : 0000000000000003 x1 : 0000000000000000 x0 : ffff0000cc9400e0 Call trace: atomic_set include/asm-generic/atomic-instrumented.h:46 [inline] __alloc_skb+0x268/0x450 net/core/skbuff.c:241 alloc_skb include/linux/skbuff.h:1107 [inline] nlmsg_new include/net/netlink.h:958 [inline] rtmsg_ifa+0xf4/0x1e0 net/ipv4/devinet.c:1900 __inet_del_ifa+0x328/0x650 net/ipv4/devinet.c:427 inet_del_ifa net/ipv4/devinet.c:465 [inline] inetdev_destroy net/ipv4/devinet.c:318 [inline] inetdev_event+0x2ac/0xac0 net/ipv4/devinet.c:1599 notifier_call_chain kernel/notifier.c:83 [inline] raw_notifier_call_chain+0x94/0xd0 kernel/notifier.c:410 call_netdevice_notifiers_info+0x9c/0x14c net/core/dev.c:2047 call_netdevice_notifiers_extack net/core/dev.c:2059 [inline] call_netdevice_notifiers net/core/dev.c:2073 [inline] rollback_registered_many+0x3d0/0x7dc net/core/dev.c:9558 unregister_netdevice_many+0x40/0x1b0 net/core/dev.c:10779 default_device_exit_batch+0x24c/0x2a0 net/core/dev.c:11262 ops_exit_list+0xb4/0xd0 net/core/net_namespace.c:192 cleanup_net+0x2b8/0x540 net/core/net_namespace.c:608 process_one_work+0x3ec/0xa40 kernel/workqueue.c:2279 worker_thread+0x110/0x8b0 kernel/workqueue.c:2425 kthread+0x1ac/0x1fc kernel/kthread.c:313 ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1034
Signed-off-by: Zhengchao Shao shaozhengchao@huawei.com Reviewed-by: Wei Yongjun weiyongjun1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- net/core/dev.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/net/core/dev.c b/net/core/dev.c index 406ed8c7f22d..12089c484b30 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1648,6 +1648,7 @@ void dev_close_many(struct list_head *head, bool unlink) call_netdevice_notifiers(NETDEV_DOWN, dev); if (unlink) list_del_init(&dev->close_list); + cond_resched(); } } EXPORT_SYMBOL(dev_close_many); @@ -9591,6 +9592,7 @@ static void rollback_registered_many(struct list_head *head) /* Remove XPS queueing entries */ netif_reset_xps_queues_gt(dev, 0); #endif + cond_resched(); }
synchronize_net();
From: Baokun Li libaokun1@huawei.com
hulk inclusion category: bugfix bugzilla: 186777, https://gitee.com/openeuler/kernel/issues/I5C568 CVE: NA
--------------------------------
Hulk Robot reported a BUG_ON:
================================================================== kernel BUG at fs/ext4/mballoc.c:3211! [...] RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f [...] Call Trace: ext4_mb_new_blocks+0x9df/0x5d30 ext4_ext_map_blocks+0x1803/0x4d80 ext4_map_blocks+0x3a4/0x1a10 ext4_writepages+0x126d/0x2c30 do_writepages+0x7f/0x1b0 __filemap_fdatawrite_range+0x285/0x3b0 file_write_and_wait_range+0xb1/0x140 ext4_sync_file+0x1aa/0xca0 vfs_fsync_range+0xfb/0x260 do_fsync+0x48/0xa0 [...] ==================================================================
Above issue may happen as follows: ------------------------------------- do_fsync vfs_fsync_range ext4_sync_file file_write_and_wait_range __filemap_fdatawrite_range do_writepages ext4_writepages mpage_map_and_submit_extent mpage_map_one_extent ext4_map_blocks ext4_mb_new_blocks ext4_mb_normalize_request >>> start + size <= ac->ac_o_ex.fe_logical ext4_mb_regular_allocator ext4_mb_simple_scan_group ext4_mb_use_best_found ext4_mb_new_preallocation ext4_mb_new_inode_pa ext4_mb_use_inode_pa >>> set ac->ac_b_ex.fe_len <= 0 ext4_mb_mark_diskspace_used >>> BUG_ON(ac->ac_b_ex.fe_len <= 0);
we can easily reproduce this problem with the following commands: `fallocate -l100M disk` `mkfs.ext4 -b 1024 -g 256 disk` `mount disk /mnt` `fsstress -d /mnt -l 0 -n 1000 -p 1`
The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP. Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur when the size is truncated. So start should be the start position of the group where ac_o_ex.fe_logical is located after alignment. In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP is very large, the value calculated by start_off is more accurate.
Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ext4/mballoc.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 110c25824a67..a2fa2b992179 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3494,6 +3494,15 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, size = size >> bsbits; start = start_off >> bsbits;
+ /* + * For tiny groups (smaller than 8MB) the chosen allocation + * alignment may be larger than group size. Make sure the + * alignment does not move allocation to a different group which + * makes mballoc fail assertions later. + */ + start = max(start, rounddown(ac->ac_o_ex.fe_logical, + (ext4_lblk_t)EXT4_BLOCKS_PER_GROUP(ac->ac_sb))); + /* don't cover already allocated blocks in selected range */ if (ar->pleft && start <= ar->lleft) { size -= ar->lleft + 1 - start;
From: Baokun Li libaokun1@huawei.com
hulk inclusion category: bugfix bugzilla: 186777, https://gitee.com/openeuler/kernel/issues/I5C568 CVE: NA
--------------------------------
ext4_mb_normalize_request() can move logical start of allocated blocks to reduce fragmentation and better utilize preallocation. However logical block requested as a start of allocation (ac->ac_o_ex.fe_logical) should always be covered by allocated blocks so we should check that by modifying and to or in the assertion.
Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ext4/mballoc.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index a2fa2b992179..3663bd261089 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3575,7 +3575,22 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, } rcu_read_unlock();
- if (start + size <= ac->ac_o_ex.fe_logical && + /* + * In this function "start" and "size" are normalized for better + * alignment and length such that we could preallocate more blocks. + * This normalization is done such that original request of + * ac->ac_o_ex.fe_logical & fe_len should always lie within "start" and + * "size" boundaries. + * (Note fe_len can be relaxed since FS block allocation API does not + * provide gurantee on number of contiguous blocks allocation since that + * depends upon free space left, etc). + * In case of inode pa, later we use the allocated blocks + * [pa_start + fe_logical - pa_lstart, fe_len/size] from the preallocated + * range of goal/best blocks [start, size] to put it at the + * ac_o_ex.fe_logical extent of this inode. + * (See ext4_mb_use_inode_pa() for more details) + */ + if (start + size <= ac->ac_o_ex.fe_logical || start > ac->ac_o_ex.fe_logical) { ext4_msg(ac->ac_sb, KERN_ERR, "start %lu, size %lu, fe_logical %lu",
From: Eric Dumazet edumazet@google.com
stable inclusion from stable-v5.10.119 commit 33f1b4a27abced7ae0f740d2ec3040debf7c4b3c category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5C3A9 CVE: CVE-2022-32296
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 190cc82489f46f9d88e73c81a47e14f80a791e1a upstream.
RFC 6056 (Recommendations for Transport-Protocol Port Randomization) provides good summary of why source selection needs extra care.
David Dworken reminded us that linux implements Algorithm 3 as described in RFC 6056 3.3.3
Quoting David : In the context of the web, this creates an interesting info leak where websites can count how many TCP connections a user's computer is establishing over time. For example, this allows a website to count exactly how many subresources a third party website loaded. This also allows: - Distinguishing between different users behind a VPN based on distinct source port ranges. - Tracking users over time across multiple networks. - Covert communication channels between different browsers/browser profiles running on the same computer - Tracking what applications are running on a computer based on the pattern of how fast source ports are getting incremented.
Section 3.3.4 describes an enhancement, that reduces attackers ability to use the basic information currently stored into the shared 'u32 hint'.
This change also decreases collision rate when multiple applications need to connect() to different destinations.
Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: David Dworken ddworken@google.com Cc: Willem de Bruijn willemb@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Stefan Ghinea stefan.ghinea@windriver.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Conflicts: net/ipv4/inet_hashtables.c
Signed-off-by: Baisong Zhong zhongbaisong@huawei.com Reviewed-by: Wei Yongjun weiyongjun1@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- net/ipv4/inet_hashtables.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index fe74b45ae5b8..692399e56f69 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -711,6 +711,17 @@ void inet_unhash(struct sock *sk) } EXPORT_SYMBOL_GPL(inet_unhash);
+/* RFC 6056 3.3.4. Algorithm 4: Double-Hash Port Selection Algorithm + * Note that we use 32bit integers (vs RFC 'short integers') + * because 2^16 is not a multiple of num_ephemeral and this + * property might be used by clever attacker. + * RFC claims using TABLE_LENGTH=10 buckets gives an improvement, + * we use 256 instead to really give more isolation and + * privacy, this only consumes 1 KB of kernel memory. + */ +#define INET_TABLE_PERTURB_SHIFT 8 +static u32 table_perturb[1 << INET_TABLE_PERTURB_SHIFT]; + int __inet_hash_connect(struct inet_timewait_death_row *death_row, struct sock *sk, u64 port_offset, int (*check_established)(struct inet_timewait_death_row *, @@ -724,8 +735,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, struct inet_bind_bucket *tb; u32 remaining, offset; int ret, i, low, high; - static u32 hint; int l3mdev; + u32 index;
if (port) { head = &hinfo->bhash[inet_bhashfn(net, port, @@ -752,7 +763,10 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, if (likely(remaining > 1)) remaining &= ~1U;
- offset = hint + port_offset; + net_get_random_once(table_perturb, sizeof(table_perturb)); + index = hash_32(port_offset, INET_TABLE_PERTURB_SHIFT); + + offset = READ_ONCE(table_perturb[index]) + port_offset; offset %= remaining; /* In first pass we try ports of @low parity. * inet_csk_get_port() does the opposite choice. @@ -807,7 +821,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, return -EADDRNOTAVAIL;
ok: - hint += i + 2; + WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + 2);
/* Head lock still held and bh's disabled */ inet_bind_hash(sk, tb, port);
From: Willy Tarreau w@1wt.eu
mainline inclusion from mainline-v5.18-rc6 commit 4c2c8f03a5ab7cb04ec64724d7d176d00bcc91e5 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5C3A9 CVE: CVE-2022-32296
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Moshe Kol, Amit Klein, and Yossi Gilad reported being able to accurately identify a client by forcing it to emit only 40 times more connections than there are entries in the table_perturb[] table. The previous two improvements consisting in resalting the secret every 10s and adding randomness to each port selection only slightly improved the situation, and the current value of 2^8 was too small as it's not very difficult to make a client emit 10k connections in less than 10 seconds.
Thus we're increasing the perturb table from 2^8 to 2^16 so that the same precision now requires 2.6M connections, which is more difficult in this time frame and harder to hide as a background activity. The impact is that the table now uses 256 kB instead of 1 kB, which could mostly affect devices making frequent outgoing connections. However such components usually target a small set of destinations (load balancers, database clients, perf assessment tools), and in practice only a few entries will be visited, like before.
A live test at 1 million connections per second showed no performance difference from the previous value.
Reported-by: Moshe Kol moshe.kol@mail.huji.ac.il Reported-by: Yossi Gilad yossi.gilad@mail.huji.ac.il Reported-by: Amit Klein aksecurity@gmail.com Reviewed-by: Eric Dumazet edumazet@google.com Signed-off-by: Willy Tarreau w@1wt.eu Signed-off-by: Jakub Kicinski kuba@kernel.org
Conflicts: net/ipv4/inet_hashtables.c
Signed-off-by: Baisong Zhong zhongbaisong@huawei.com Reviewed-by: Wei Yongjun weiyongjun1@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- net/ipv4/inet_hashtables.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 692399e56f69..cf178d8eea81 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -715,11 +715,12 @@ EXPORT_SYMBOL_GPL(inet_unhash); * Note that we use 32bit integers (vs RFC 'short integers') * because 2^16 is not a multiple of num_ephemeral and this * property might be used by clever attacker. - * RFC claims using TABLE_LENGTH=10 buckets gives an improvement, - * we use 256 instead to really give more isolation and - * privacy, this only consumes 1 KB of kernel memory. + * RFC claims using TABLE_LENGTH=10 buckets gives an improvement, though + * attacks were since demonstrated, thus we use 65536 instead to really + * give more isolation and privacy, at the expense of 256kB of kernel + * memory. */ -#define INET_TABLE_PERTURB_SHIFT 8 +#define INET_TABLE_PERTURB_SHIFT 16 static u32 table_perturb[1 << INET_TABLE_PERTURB_SHIFT];
int __inet_hash_connect(struct inet_timewait_death_row *death_row,
From: Chengchang Tang tangchengchang@huawei.com
mainline inclusion from mainline-v5.18-rc1 commit 5a32949d81cc category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK cve: NA
reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=5a3...
The parameter "op_modifier" is only used for HIP06. It is useless for HIP08 and later versions. After removing HIP06, this parameter is no longer used, so remove it.
Link: https://lore.kernel.org/r/20220302064830.61706-2-liangwenpeng@huawei.com Signed-off-by: Chengchang Tang tangchengchang@huawei.com Signed-off-by: Haoyue Xu xuhaoyue1@hisilicon.com Signed-off-by: Wenpeng Liang liangwenpeng@huawei.com Reviewed-by: Leon Romanovsky leonro@nvidia.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Zhengfeng Luo luozhengfeng@h-partners.com Reviewed-by: Yangyang Li liyangyang20@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/infiniband/hw/hns/hns_roce_cmd.c | 36 ++++++++----------- drivers/infiniband/hw/hns/hns_roce_cmd.h | 3 +- drivers/infiniband/hw/hns/hns_roce_cq.c | 4 +-- drivers/infiniband/hw/hns/hns_roce_device.h | 2 +- drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 26 +++++++------- .../infiniband/hw/hns/hns_roce_hw_v2_dfx.c | 2 +- drivers/infiniband/hw/hns/hns_roce_mr.c | 6 ++-- drivers/infiniband/hw/hns/hns_roce_srq.c | 4 +-- 8 files changed, 37 insertions(+), 46 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_cmd.c b/drivers/infiniband/hw/hns/hns_roce_cmd.c index 4b693d542ace..ab89e70b6f04 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cmd.c +++ b/drivers/infiniband/hw/hns/hns_roce_cmd.c @@ -39,25 +39,22 @@ #define CMD_MAX_NUM 32
static int hns_roce_cmd_mbox_post_hw(struct hns_roce_dev *hr_dev, u64 in_param, - u64 out_param, u32 in_modifier, - u8 op_modifier, u16 op, u16 token, - int event) + u64 out_param, u32 in_modifier, u16 op, + u16 token, int event) { return hr_dev->hw->post_mbox(hr_dev, in_param, out_param, in_modifier, - op_modifier, op, token, event); + op, token, event); }
/* this should be called with "poll_sem" */ static int __hns_roce_cmd_mbox_poll(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, unsigned long in_modifier, - u8 op_modifier, u16 op, - unsigned int timeout) + u16 op, unsigned int timeout) { int ret;
ret = hns_roce_cmd_mbox_post_hw(hr_dev, in_param, out_param, - in_modifier, op_modifier, op, - CMD_POLL_TOKEN, 0); + in_modifier, op, CMD_POLL_TOKEN, 0); if (ret) { dev_err_ratelimited(hr_dev->dev, "failed to post mailbox 0x%x in poll mode, ret = %d.\n", @@ -70,13 +67,13 @@ static int __hns_roce_cmd_mbox_poll(struct hns_roce_dev *hr_dev, u64 in_param,
static int hns_roce_cmd_mbox_poll(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, unsigned long in_modifier, - u8 op_modifier, u16 op, unsigned int timeout) + u16 op, unsigned int timeout) { int ret;
down(&hr_dev->cmd.poll_sem); ret = __hns_roce_cmd_mbox_poll(hr_dev, in_param, out_param, in_modifier, - op_modifier, op, timeout); + op, timeout); up(&hr_dev->cmd.poll_sem);
return ret; @@ -102,8 +99,7 @@ void hns_roce_cmd_event(struct hns_roce_dev *hr_dev, u16 token, u8 status,
static int __hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, unsigned long in_modifier, - u8 op_modifier, u16 op, - unsigned int timeout) + u16 op, unsigned int timeout) { struct hns_roce_cmdq *cmd = &hr_dev->cmd; struct hns_roce_cmd_context *context; @@ -125,8 +121,7 @@ static int __hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param, reinit_completion(&context->done);
ret = hns_roce_cmd_mbox_post_hw(hr_dev, in_param, out_param, - in_modifier, op_modifier, op, - context->token, 1); + in_modifier, op, context->token, 1); if (ret) { dev_err_ratelimited(dev, "failed to post mailbox 0x%x in event mode, ret = %d.\n", @@ -154,21 +149,20 @@ static int __hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param,
static int hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, unsigned long in_modifier, - u8 op_modifier, u16 op, unsigned int timeout) + u16 op, unsigned int timeout) { int ret;
down(&hr_dev->cmd.event_sem); ret = __hns_roce_cmd_mbox_wait(hr_dev, in_param, out_param, in_modifier, - op_modifier, op, timeout); + op, timeout); up(&hr_dev->cmd.event_sem);
return ret; }
int hns_roce_cmd_mbox(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, - unsigned long in_modifier, u8 op_modifier, u16 op, - unsigned int timeout) + unsigned long in_modifier, u16 op, unsigned int timeout) { bool is_busy;
@@ -178,12 +172,10 @@ int hns_roce_cmd_mbox(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param,
if (hr_dev->cmd.use_events) return hns_roce_cmd_mbox_wait(hr_dev, in_param, out_param, - in_modifier, op_modifier, op, - timeout); + in_modifier, op, timeout); else return hns_roce_cmd_mbox_poll(hr_dev, in_param, out_param, - in_modifier, op_modifier, op, - timeout); + in_modifier, op, timeout); }
int hns_roce_cmd_init(struct hns_roce_dev *hr_dev) diff --git a/drivers/infiniband/hw/hns/hns_roce_cmd.h b/drivers/infiniband/hw/hns/hns_roce_cmd.h index 8025e7f657fa..3055996935d5 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cmd.h +++ b/drivers/infiniband/hw/hns/hns_roce_cmd.h @@ -140,8 +140,7 @@ enum { };
int hns_roce_cmd_mbox(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, - unsigned long in_modifier, u8 op_modifier, u16 op, - unsigned int timeout); + unsigned long in_modifier, u16 op, unsigned int timeout);
struct hns_roce_cmd_mailbox * hns_roce_alloc_cmd_mailbox(struct hns_roce_dev *hr_dev); diff --git a/drivers/infiniband/hw/hns/hns_roce_cq.c b/drivers/infiniband/hw/hns/hns_roce_cq.c index 65e1e6126d95..2b73675953dc 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cq.c +++ b/drivers/infiniband/hw/hns/hns_roce_cq.c @@ -140,7 +140,7 @@ static int alloc_cqc(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq) hr_dev->hw->write_cqc(hr_dev, hr_cq, mailbox->buf, mtts, dma_handle);
/* Send mailbox to hw */ - ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, hr_cq->cqn, 0, + ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, hr_cq->cqn, HNS_ROCE_CMD_CREATE_CQC, HNS_ROCE_CMD_TIMEOUT_MSECS); hns_roce_free_cmd_mailbox(hr_dev, mailbox); if (ret) { @@ -174,7 +174,7 @@ static void free_cqc(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq) struct device *dev = hr_dev->dev; int ret;
- ret = hns_roce_cmd_mbox(hr_dev, 0, 0, hr_cq->cqn, 1, + ret = hns_roce_cmd_mbox(hr_dev, 0, 0, hr_cq->cqn, HNS_ROCE_CMD_DESTROY_CQC, HNS_ROCE_CMD_TIMEOUT_MSECS); if (ret) diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h index 8bea6de7f955..e7d8f0b408a1 100644 --- a/drivers/infiniband/hw/hns/hns_roce_device.h +++ b/drivers/infiniband/hw/hns/hns_roce_device.h @@ -852,7 +852,7 @@ struct hns_roce_hw { int (*hw_init)(struct hns_roce_dev *hr_dev); void (*hw_exit)(struct hns_roce_dev *hr_dev); int (*post_mbox)(struct hns_roce_dev *hr_dev, u64 in_param, - u64 out_param, u32 in_modifier, u8 op_modifier, u16 op, + u64 out_param, u32 in_modifier, u16 op, u16 token, int event); int (*poll_mbox_done)(struct hns_roce_dev *hr_dev, unsigned int timeout); diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c index 587b46ddecfc..6a072e8c14d6 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c @@ -1353,7 +1353,7 @@ static int config_hem_ba_to_hw(struct hns_roce_dev *hr_dev, unsigned long obj, if (IS_ERR(mbox)) return PTR_ERR(mbox);
- ret = hns_roce_cmd_mbox(hr_dev, base_addr, mbox->dma, obj, 0, op, + ret = hns_roce_cmd_mbox(hr_dev, base_addr, mbox->dma, obj, op, HNS_ROCE_CMD_TIMEOUT_MSECS); hns_roce_free_cmd_mailbox(hr_dev, mbox); return ret; @@ -2757,7 +2757,7 @@ static void hns_roce_v2_exit(struct hns_roce_dev *hr_dev) }
static int hns_roce_mbox_post(struct hns_roce_dev *hr_dev, u64 in_param, - u64 out_param, u32 in_modifier, u8 op_modifier, + u64 out_param, u32 in_modifier, u16 op, u16 token, int event) { struct hns_roce_cmq_desc desc; @@ -2824,7 +2824,7 @@ static int v2_wait_mbox_complete(struct hns_roce_dev *hr_dev, u32 timeout, }
static int v2_post_mbox(struct hns_roce_dev *hr_dev, u64 in_param, - u64 out_param, u32 in_modifier, u8 op_modifier, + u64 out_param, u32 in_modifier, u16 op, u16 token, int event) { u8 status = 0; @@ -2842,7 +2842,7 @@ static int v2_post_mbox(struct hns_roce_dev *hr_dev, u64 in_param,
/* Post new message to mbox */ ret = hns_roce_mbox_post(hr_dev, in_param, out_param, in_modifier, - op_modifier, op, token, event); + op, token, event); if (ret) dev_err_ratelimited(hr_dev->dev, "failed to post mailbox, ret = %d.\n", ret); @@ -3968,7 +3968,7 @@ static int hns_roce_v2_clear_hem(struct hns_roce_dev *hr_dev, return PTR_ERR(mailbox);
/* configure the tag and op */ - ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, obj, 0, op, + ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, obj, op, HNS_ROCE_CMD_TIMEOUT_MSECS);
hns_roce_free_cmd_mailbox(hr_dev, mailbox); @@ -3993,7 +3993,7 @@ static int hns_roce_v2_qp_modify(struct hns_roce_dev *hr_dev, memcpy(mailbox->buf, context, qpc_size); memcpy(mailbox->buf + qpc_size, qpc_mask, qpc_size);
- ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, hr_qp->qpn, 0, + ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, hr_qp->qpn, HNS_ROCE_CMD_MODIFY_QPC, HNS_ROCE_CMD_TIMEOUT_MSECS);
@@ -5040,7 +5040,7 @@ static int hns_roce_v2_query_qpc(struct hns_roce_dev *hr_dev, if (IS_ERR(mailbox)) return PTR_ERR(mailbox);
- ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, hr_qp->qpn, 0, + ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, hr_qp->qpn, HNS_ROCE_CMD_QUERY_QPC, HNS_ROCE_CMD_TIMEOUT_MSECS); if (ret) @@ -5408,7 +5408,7 @@ static int hns_roce_v2_modify_srq(struct ib_srq *ibsrq, hr_reg_write(srq_context, SRQC_LIMIT_WL, srq_attr->srq_limit); hr_reg_clear(srqc_mask, SRQC_LIMIT_WL);
- ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, srq->srqn, 0, + ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, srq->srqn, HNS_ROCE_CMD_MODIFY_SRQC, HNS_ROCE_CMD_TIMEOUT_MSECS); hns_roce_free_cmd_mailbox(hr_dev, mailbox); @@ -5436,7 +5436,7 @@ static int hns_roce_v2_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr) return PTR_ERR(mailbox);
srq_context = mailbox->buf; - ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, srq->srqn, 0, + ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, srq->srqn, HNS_ROCE_CMD_QUERY_SRQC, HNS_ROCE_CMD_TIMEOUT_MSECS); if (ret) { @@ -5478,7 +5478,7 @@ static int hns_roce_v2_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period) hr_reg_write(cq_context, CQC_CQ_PERIOD, cq_period); hr_reg_clear(cqc_mask, CQC_CQ_PERIOD);
- ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, hr_cq->cqn, 1, + ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, hr_cq->cqn, HNS_ROCE_CMD_MODIFY_CQC, HNS_ROCE_CMD_TIMEOUT_MSECS); hns_roce_free_cmd_mailbox(hr_dev, mailbox); @@ -5810,11 +5810,11 @@ static void hns_roce_v2_destroy_eqc(struct hns_roce_dev *hr_dev, u32 eqn)
if (eqn < hr_dev->caps.num_comp_vectors) ret = hns_roce_cmd_mbox(hr_dev, 0, 0, eqn & HNS_ROCE_V2_EQN_M, - 0, HNS_ROCE_CMD_DESTROY_CEQC, + HNS_ROCE_CMD_DESTROY_CEQC, HNS_ROCE_CMD_TIMEOUT_MSECS); else ret = hns_roce_cmd_mbox(hr_dev, 0, 0, eqn & HNS_ROCE_V2_EQN_M, - 0, HNS_ROCE_CMD_DESTROY_AEQC, + HNS_ROCE_CMD_DESTROY_AEQC, HNS_ROCE_CMD_TIMEOUT_MSECS); if (ret) dev_err(dev, "[mailbox cmd] destroy eqc(%u) failed.\n", eqn); @@ -5931,7 +5931,7 @@ static int hns_roce_v2_create_eq(struct hns_roce_dev *hr_dev, if (ret) goto err_cmd_mbox;
- ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, eq->eqn, 0, + ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, eq->eqn, eq_cmd, HNS_ROCE_CMD_TIMEOUT_MSECS); if (ret) { dev_err(hr_dev->dev, "[mailbox cmd] create eqc failed.\n"); diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2_dfx.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2_dfx.c index 5a97b5a0b7be..bce3a67b0b2d 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2_dfx.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2_dfx.c @@ -18,7 +18,7 @@ int hns_roce_v2_query_cqc_info(struct hns_roce_dev *hr_dev, u32 cqn, return PTR_ERR(mailbox);
cq_context = mailbox->buf; - ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, cqn, 0, + ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, cqn, HNS_ROCE_CMD_QUERY_CQC, HNS_ROCE_CMD_TIMEOUT_MSECS); if (ret) { diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c index 44e0ee3b5b6c..6b942e659363 100644 --- a/drivers/infiniband/hw/hns/hns_roce_mr.c +++ b/drivers/infiniband/hw/hns/hns_roce_mr.c @@ -51,7 +51,7 @@ static int hns_roce_hw_create_mpt(struct hns_roce_dev *hr_dev, struct hns_roce_cmd_mailbox *mailbox, unsigned long mpt_index) { - return hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, mpt_index, 0, + return hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, mpt_index, HNS_ROCE_CMD_CREATE_MPT, HNS_ROCE_CMD_TIMEOUT_MSECS); } @@ -61,7 +61,7 @@ int hns_roce_hw_destroy_mpt(struct hns_roce_dev *hr_dev, unsigned long mpt_index) { return hns_roce_cmd_mbox(hr_dev, 0, mailbox ? mailbox->dma : 0, - mpt_index, !mailbox, HNS_ROCE_CMD_DESTROY_MPT, + mpt_index, HNS_ROCE_CMD_DESTROY_MPT, HNS_ROCE_CMD_TIMEOUT_MSECS); }
@@ -302,7 +302,7 @@ int hns_roce_rereg_user_mr(struct ib_mr *ibmr, int flags, u64 start, u64 length, return PTR_ERR(mailbox);
mtpt_idx = key_to_hw_index(mr->key) & (hr_dev->caps.num_mtpts - 1); - ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, mtpt_idx, 0, + ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, mtpt_idx, HNS_ROCE_CMD_QUERY_MPT, HNS_ROCE_CMD_TIMEOUT_MSECS); if (ret) diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c b/drivers/infiniband/hw/hns/hns_roce_srq.c index 21962e547243..2c286d3bc688 100644 --- a/drivers/infiniband/hw/hns/hns_roce_srq.c +++ b/drivers/infiniband/hw/hns/hns_roce_srq.c @@ -63,7 +63,7 @@ static int hns_roce_hw_create_srq(struct hns_roce_dev *dev, struct hns_roce_cmd_mailbox *mailbox, unsigned long srq_num) { - return hns_roce_cmd_mbox(dev, mailbox->dma, 0, srq_num, 0, + return hns_roce_cmd_mbox(dev, mailbox->dma, 0, srq_num, HNS_ROCE_CMD_CREATE_SRQ, HNS_ROCE_CMD_TIMEOUT_MSECS); } @@ -73,7 +73,7 @@ static int hns_roce_hw_destroy_srq(struct hns_roce_dev *dev, unsigned long srq_num) { return hns_roce_cmd_mbox(dev, 0, mailbox ? mailbox->dma : 0, srq_num, - mailbox ? 0 : 1, HNS_ROCE_CMD_DESTROY_SRQ, + HNS_ROCE_CMD_DESTROY_SRQ, HNS_ROCE_CMD_TIMEOUT_MSECS); }
From: Chengchang Tang tangchengchang@huawei.com
mainline inclusion from mainline-v5.18-rc1 commit 0018ed4bb07f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK cve: NA
reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=001...
The value of the function parameter "timeout" is unique. Therefore, it is unnecessary to specify the parameter "timeout" value each time. So remove it.
Link: https://lore.kernel.org/r/20220302064830.61706-3-liangwenpeng@huawei.com Signed-off-by: Chengchang Tang tangchengchang@huawei.com Signed-off-by: Haoyue Xu xuhaoyue1@hisilicon.com Signed-off-by: Wenpeng Liang liangwenpeng@huawei.com Reviewed-by: Leon Romanovsky leonro@nvidia.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Zhengfeng Luo luozhengfeng@h-partners.com Reviewed-by: Yangyang Li liyangyang20@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/infiniband/hw/hns/hns_roce_cmd.c | 22 ++++++------ drivers/infiniband/hw/hns/hns_roce_cmd.h | 2 +- drivers/infiniband/hw/hns/hns_roce_cq.c | 5 ++- drivers/infiniband/hw/hns/hns_roce_device.h | 3 +- drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 35 +++++++------------ .../infiniband/hw/hns/hns_roce_hw_v2_dfx.c | 3 +- drivers/infiniband/hw/hns/hns_roce_mr.c | 9 ++--- drivers/infiniband/hw/hns/hns_roce_srq.c | 6 ++-- 8 files changed, 34 insertions(+), 51 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_cmd.c b/drivers/infiniband/hw/hns/hns_roce_cmd.c index ab89e70b6f04..3642e9282b42 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cmd.c +++ b/drivers/infiniband/hw/hns/hns_roce_cmd.c @@ -49,7 +49,7 @@ static int hns_roce_cmd_mbox_post_hw(struct hns_roce_dev *hr_dev, u64 in_param, /* this should be called with "poll_sem" */ static int __hns_roce_cmd_mbox_poll(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, unsigned long in_modifier, - u16 op, unsigned int timeout) + u16 op) { int ret;
@@ -62,18 +62,18 @@ static int __hns_roce_cmd_mbox_poll(struct hns_roce_dev *hr_dev, u64 in_param, return ret; }
- return hr_dev->hw->poll_mbox_done(hr_dev, timeout); + return hr_dev->hw->poll_mbox_done(hr_dev); }
static int hns_roce_cmd_mbox_poll(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, unsigned long in_modifier, - u16 op, unsigned int timeout) + u16 op) { int ret;
down(&hr_dev->cmd.poll_sem); ret = __hns_roce_cmd_mbox_poll(hr_dev, in_param, out_param, in_modifier, - op, timeout); + op); up(&hr_dev->cmd.poll_sem);
return ret; @@ -99,7 +99,7 @@ void hns_roce_cmd_event(struct hns_roce_dev *hr_dev, u16 token, u8 status,
static int __hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, unsigned long in_modifier, - u16 op, unsigned int timeout) + u16 op) { struct hns_roce_cmdq *cmd = &hr_dev->cmd; struct hns_roce_cmd_context *context; @@ -130,7 +130,7 @@ static int __hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param, }
if (!wait_for_completion_timeout(&context->done, - msecs_to_jiffies(timeout))) { + msecs_to_jiffies(HNS_ROCE_CMD_TIMEOUT_MSECS))) { dev_err_ratelimited(dev, "[cmd] token 0x%x mailbox 0x%x timeout.\n", context->token, op); ret = -EBUSY; @@ -149,20 +149,20 @@ static int __hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param,
static int hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, unsigned long in_modifier, - u16 op, unsigned int timeout) + u16 op) { int ret;
down(&hr_dev->cmd.event_sem); ret = __hns_roce_cmd_mbox_wait(hr_dev, in_param, out_param, in_modifier, - op, timeout); + op); up(&hr_dev->cmd.event_sem);
return ret; }
int hns_roce_cmd_mbox(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, - unsigned long in_modifier, u16 op, unsigned int timeout) + unsigned long in_modifier, u16 op) { bool is_busy;
@@ -172,10 +172,10 @@ int hns_roce_cmd_mbox(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param,
if (hr_dev->cmd.use_events) return hns_roce_cmd_mbox_wait(hr_dev, in_param, out_param, - in_modifier, op, timeout); + in_modifier, op); else return hns_roce_cmd_mbox_poll(hr_dev, in_param, out_param, - in_modifier, op, timeout); + in_modifier, op); }
int hns_roce_cmd_init(struct hns_roce_dev *hr_dev) diff --git a/drivers/infiniband/hw/hns/hns_roce_cmd.h b/drivers/infiniband/hw/hns/hns_roce_cmd.h index 3055996935d5..23937b106aa5 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cmd.h +++ b/drivers/infiniband/hw/hns/hns_roce_cmd.h @@ -140,7 +140,7 @@ enum { };
int hns_roce_cmd_mbox(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, - unsigned long in_modifier, u16 op, unsigned int timeout); + unsigned long in_modifier, u16 op);
struct hns_roce_cmd_mailbox * hns_roce_alloc_cmd_mailbox(struct hns_roce_dev *hr_dev); diff --git a/drivers/infiniband/hw/hns/hns_roce_cq.c b/drivers/infiniband/hw/hns/hns_roce_cq.c index 2b73675953dc..c2ea0f1c5d11 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cq.c +++ b/drivers/infiniband/hw/hns/hns_roce_cq.c @@ -141,7 +141,7 @@ static int alloc_cqc(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq)
/* Send mailbox to hw */ ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, hr_cq->cqn, - HNS_ROCE_CMD_CREATE_CQC, HNS_ROCE_CMD_TIMEOUT_MSECS); + HNS_ROCE_CMD_CREATE_CQC); hns_roce_free_cmd_mailbox(hr_dev, mailbox); if (ret) { ibdev_err(ibdev, @@ -175,8 +175,7 @@ static void free_cqc(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq) int ret;
ret = hns_roce_cmd_mbox(hr_dev, 0, 0, hr_cq->cqn, - HNS_ROCE_CMD_DESTROY_CQC, - HNS_ROCE_CMD_TIMEOUT_MSECS); + HNS_ROCE_CMD_DESTROY_CQC); if (ret) dev_err(dev, "DESTROY_CQ failed (%d) for CQN %06lx\n", ret, hr_cq->cqn); diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h index e7d8f0b408a1..4077d5e95af2 100644 --- a/drivers/infiniband/hw/hns/hns_roce_device.h +++ b/drivers/infiniband/hw/hns/hns_roce_device.h @@ -854,8 +854,7 @@ struct hns_roce_hw { int (*post_mbox)(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, u32 in_modifier, u16 op, u16 token, int event); - int (*poll_mbox_done)(struct hns_roce_dev *hr_dev, - unsigned int timeout); + int (*poll_mbox_done)(struct hns_roce_dev *hr_dev); bool (*chk_mbox_avail)(struct hns_roce_dev *hr_dev, bool *is_busy); int (*set_gid)(struct hns_roce_dev *hr_dev, int gid_index, const union ib_gid *gid, const struct ib_gid_attr *attr); diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c index 6a072e8c14d6..fc3517306f19 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c @@ -1353,8 +1353,7 @@ static int config_hem_ba_to_hw(struct hns_roce_dev *hr_dev, unsigned long obj, if (IS_ERR(mbox)) return PTR_ERR(mbox);
- ret = hns_roce_cmd_mbox(hr_dev, base_addr, mbox->dma, obj, op, - HNS_ROCE_CMD_TIMEOUT_MSECS); + ret = hns_roce_cmd_mbox(hr_dev, base_addr, mbox->dma, obj, op); hns_roce_free_cmd_mailbox(hr_dev, mbox); return ret; } @@ -2850,12 +2849,13 @@ static int v2_post_mbox(struct hns_roce_dev *hr_dev, u64 in_param, return ret; }
-static int v2_poll_mbox_done(struct hns_roce_dev *hr_dev, unsigned int timeout) +static int v2_poll_mbox_done(struct hns_roce_dev *hr_dev) { u8 status = 0; int ret;
- ret = v2_wait_mbox_complete(hr_dev, timeout, &status); + ret = v2_wait_mbox_complete(hr_dev, HNS_ROCE_CMD_TIMEOUT_MSECS, + &status); if (!ret) { if (status != MB_ST_COMPLETE_SUCC) return -EBUSY; @@ -3968,8 +3968,7 @@ static int hns_roce_v2_clear_hem(struct hns_roce_dev *hr_dev, return PTR_ERR(mailbox);
/* configure the tag and op */ - ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, obj, op, - HNS_ROCE_CMD_TIMEOUT_MSECS); + ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, obj, op);
hns_roce_free_cmd_mailbox(hr_dev, mailbox); return ret; @@ -3994,8 +3993,7 @@ static int hns_roce_v2_qp_modify(struct hns_roce_dev *hr_dev, memcpy(mailbox->buf + qpc_size, qpc_mask, qpc_size);
ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, hr_qp->qpn, - HNS_ROCE_CMD_MODIFY_QPC, - HNS_ROCE_CMD_TIMEOUT_MSECS); + HNS_ROCE_CMD_MODIFY_QPC);
hns_roce_free_cmd_mailbox(hr_dev, mailbox);
@@ -5041,8 +5039,7 @@ static int hns_roce_v2_query_qpc(struct hns_roce_dev *hr_dev, return PTR_ERR(mailbox);
ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, hr_qp->qpn, - HNS_ROCE_CMD_QUERY_QPC, - HNS_ROCE_CMD_TIMEOUT_MSECS); + HNS_ROCE_CMD_QUERY_QPC); if (ret) goto out;
@@ -5409,8 +5406,7 @@ static int hns_roce_v2_modify_srq(struct ib_srq *ibsrq, hr_reg_clear(srqc_mask, SRQC_LIMIT_WL);
ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, srq->srqn, - HNS_ROCE_CMD_MODIFY_SRQC, - HNS_ROCE_CMD_TIMEOUT_MSECS); + HNS_ROCE_CMD_MODIFY_SRQC); hns_roce_free_cmd_mailbox(hr_dev, mailbox); if (ret) { ibdev_err(&hr_dev->ib_dev, @@ -5437,8 +5433,7 @@ static int hns_roce_v2_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr)
srq_context = mailbox->buf; ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, srq->srqn, - HNS_ROCE_CMD_QUERY_SRQC, - HNS_ROCE_CMD_TIMEOUT_MSECS); + HNS_ROCE_CMD_QUERY_SRQC); if (ret) { ibdev_err(&hr_dev->ib_dev, "failed to process cmd of querying SRQ, ret = %d.\n", @@ -5479,8 +5474,7 @@ static int hns_roce_v2_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period) hr_reg_clear(cqc_mask, CQC_CQ_PERIOD);
ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, hr_cq->cqn, - HNS_ROCE_CMD_MODIFY_CQC, - HNS_ROCE_CMD_TIMEOUT_MSECS); + HNS_ROCE_CMD_MODIFY_CQC); hns_roce_free_cmd_mailbox(hr_dev, mailbox); if (ret) ibdev_err(&hr_dev->ib_dev, @@ -5810,12 +5804,10 @@ static void hns_roce_v2_destroy_eqc(struct hns_roce_dev *hr_dev, u32 eqn)
if (eqn < hr_dev->caps.num_comp_vectors) ret = hns_roce_cmd_mbox(hr_dev, 0, 0, eqn & HNS_ROCE_V2_EQN_M, - HNS_ROCE_CMD_DESTROY_CEQC, - HNS_ROCE_CMD_TIMEOUT_MSECS); + HNS_ROCE_CMD_DESTROY_CEQC); else ret = hns_roce_cmd_mbox(hr_dev, 0, 0, eqn & HNS_ROCE_V2_EQN_M, - HNS_ROCE_CMD_DESTROY_AEQC, - HNS_ROCE_CMD_TIMEOUT_MSECS); + HNS_ROCE_CMD_DESTROY_AEQC); if (ret) dev_err(dev, "[mailbox cmd] destroy eqc(%u) failed.\n", eqn); } @@ -5931,8 +5923,7 @@ static int hns_roce_v2_create_eq(struct hns_roce_dev *hr_dev, if (ret) goto err_cmd_mbox;
- ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, eq->eqn, - eq_cmd, HNS_ROCE_CMD_TIMEOUT_MSECS); + ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, eq->eqn, eq_cmd); if (ret) { dev_err(hr_dev->dev, "[mailbox cmd] create eqc failed.\n"); goto err_cmd_mbox; diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2_dfx.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2_dfx.c index bce3a67b0b2d..107288150e3f 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2_dfx.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2_dfx.c @@ -19,8 +19,7 @@ int hns_roce_v2_query_cqc_info(struct hns_roce_dev *hr_dev, u32 cqn,
cq_context = mailbox->buf; ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, cqn, - HNS_ROCE_CMD_QUERY_CQC, - HNS_ROCE_CMD_TIMEOUT_MSECS); + HNS_ROCE_CMD_QUERY_CQC); if (ret) { dev_err(hr_dev->dev, "QUERY cqc cmd process error\n"); goto err_mailbox; diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c index 6b942e659363..ece569107869 100644 --- a/drivers/infiniband/hw/hns/hns_roce_mr.c +++ b/drivers/infiniband/hw/hns/hns_roce_mr.c @@ -52,8 +52,7 @@ static int hns_roce_hw_create_mpt(struct hns_roce_dev *hr_dev, unsigned long mpt_index) { return hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, mpt_index, - HNS_ROCE_CMD_CREATE_MPT, - HNS_ROCE_CMD_TIMEOUT_MSECS); + HNS_ROCE_CMD_CREATE_MPT); }
int hns_roce_hw_destroy_mpt(struct hns_roce_dev *hr_dev, @@ -61,8 +60,7 @@ int hns_roce_hw_destroy_mpt(struct hns_roce_dev *hr_dev, unsigned long mpt_index) { return hns_roce_cmd_mbox(hr_dev, 0, mailbox ? mailbox->dma : 0, - mpt_index, HNS_ROCE_CMD_DESTROY_MPT, - HNS_ROCE_CMD_TIMEOUT_MSECS); + mpt_index, HNS_ROCE_CMD_DESTROY_MPT); }
static int alloc_mr_key(struct hns_roce_dev *hr_dev, struct hns_roce_mr *mr) @@ -303,8 +301,7 @@ int hns_roce_rereg_user_mr(struct ib_mr *ibmr, int flags, u64 start, u64 length,
mtpt_idx = key_to_hw_index(mr->key) & (hr_dev->caps.num_mtpts - 1); ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, mtpt_idx, - HNS_ROCE_CMD_QUERY_MPT, - HNS_ROCE_CMD_TIMEOUT_MSECS); + HNS_ROCE_CMD_QUERY_MPT); if (ret) goto free_cmd_mbox;
diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c b/drivers/infiniband/hw/hns/hns_roce_srq.c index 2c286d3bc688..b39965db3184 100644 --- a/drivers/infiniband/hw/hns/hns_roce_srq.c +++ b/drivers/infiniband/hw/hns/hns_roce_srq.c @@ -64,8 +64,7 @@ static int hns_roce_hw_create_srq(struct hns_roce_dev *dev, unsigned long srq_num) { return hns_roce_cmd_mbox(dev, mailbox->dma, 0, srq_num, - HNS_ROCE_CMD_CREATE_SRQ, - HNS_ROCE_CMD_TIMEOUT_MSECS); + HNS_ROCE_CMD_CREATE_SRQ); }
static int hns_roce_hw_destroy_srq(struct hns_roce_dev *dev, @@ -73,8 +72,7 @@ static int hns_roce_hw_destroy_srq(struct hns_roce_dev *dev, unsigned long srq_num) { return hns_roce_cmd_mbox(dev, 0, mailbox ? mailbox->dma : 0, srq_num, - HNS_ROCE_CMD_DESTROY_SRQ, - HNS_ROCE_CMD_TIMEOUT_MSECS); + HNS_ROCE_CMD_DESTROY_SRQ); }
static int alloc_srqc(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq)
From: Wenpeng Liang liangwenpeng@huawei.com
mainline inclusion from mainline-v5.18-rc1 commit 479dc93ba75d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK cve: NA
reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=479...
The parameter "out_param" of the mailbox is always null when the context is destroyed. So remove the function parameter "mailbox".
Link: https://lore.kernel.org/r/20220302064830.61706-4-liangwenpeng@huawei.com Signed-off-by: Wenpeng Liang liangwenpeng@huawei.com Reviewed-by: Leon Romanovsky leonro@nvidia.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Zhengfeng Luo luozhengfeng@h-partners.com Reviewed-by: Yangyang Li liyangyang20@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/infiniband/hw/hns/hns_roce_device.h | 1 - drivers/infiniband/hw/hns/hns_roce_mr.c | 11 +++++------ drivers/infiniband/hw/hns/hns_roce_srq.c | 6 ++---- 3 files changed, 7 insertions(+), 11 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h index 4077d5e95af2..d02f235ab820 100644 --- a/drivers/infiniband/hw/hns/hns_roce_device.h +++ b/drivers/infiniband/hw/hns/hns_roce_device.h @@ -1140,7 +1140,6 @@ int hns_roce_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents, unsigned int *sg_offset); int hns_roce_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata); int hns_roce_hw_destroy_mpt(struct hns_roce_dev *hr_dev, - struct hns_roce_cmd_mailbox *mailbox, unsigned long mpt_index); unsigned long key_to_hw_index(u32 key);
diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c index ece569107869..d80a06cb8aa1 100644 --- a/drivers/infiniband/hw/hns/hns_roce_mr.c +++ b/drivers/infiniband/hw/hns/hns_roce_mr.c @@ -56,11 +56,10 @@ static int hns_roce_hw_create_mpt(struct hns_roce_dev *hr_dev, }
int hns_roce_hw_destroy_mpt(struct hns_roce_dev *hr_dev, - struct hns_roce_cmd_mailbox *mailbox, unsigned long mpt_index) { - return hns_roce_cmd_mbox(hr_dev, 0, mailbox ? mailbox->dma : 0, - mpt_index, HNS_ROCE_CMD_DESTROY_MPT); + return hns_roce_cmd_mbox(hr_dev, 0, 0, mpt_index, + HNS_ROCE_CMD_DESTROY_MPT); }
static int alloc_mr_key(struct hns_roce_dev *hr_dev, struct hns_roce_mr *mr) @@ -142,7 +141,7 @@ static void hns_roce_mr_free(struct hns_roce_dev *hr_dev, int ret;
if (mr->enabled) { - ret = hns_roce_hw_destroy_mpt(hr_dev, NULL, + ret = hns_roce_hw_destroy_mpt(hr_dev, key_to_hw_index(mr->key) & (hr_dev->caps.num_mtpts - 1)); if (ret) @@ -305,7 +304,7 @@ int hns_roce_rereg_user_mr(struct ib_mr *ibmr, int flags, u64 start, u64 length, if (ret) goto free_cmd_mbox;
- ret = hns_roce_hw_destroy_mpt(hr_dev, NULL, mtpt_idx); + ret = hns_roce_hw_destroy_mpt(hr_dev, mtpt_idx); if (ret) ibdev_warn(ib_dev, "failed to destroy MPT, ret = %d.\n", ret);
@@ -474,7 +473,7 @@ static void hns_roce_mw_free(struct hns_roce_dev *hr_dev, int ret;
if (mw->enabled) { - ret = hns_roce_hw_destroy_mpt(hr_dev, NULL, + ret = hns_roce_hw_destroy_mpt(hr_dev, key_to_hw_index(mw->rkey) & (hr_dev->caps.num_mtpts - 1)); if (ret) diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c b/drivers/infiniband/hw/hns/hns_roce_srq.c index b39965db3184..cf24c8a23983 100644 --- a/drivers/infiniband/hw/hns/hns_roce_srq.c +++ b/drivers/infiniband/hw/hns/hns_roce_srq.c @@ -68,11 +68,9 @@ static int hns_roce_hw_create_srq(struct hns_roce_dev *dev, }
static int hns_roce_hw_destroy_srq(struct hns_roce_dev *dev, - struct hns_roce_cmd_mailbox *mailbox, unsigned long srq_num) { - return hns_roce_cmd_mbox(dev, 0, mailbox ? mailbox->dma : 0, srq_num, - HNS_ROCE_CMD_DESTROY_SRQ); + return hns_roce_cmd_mbox(dev, 0, 0, srq_num, HNS_ROCE_CMD_DESTROY_SRQ); }
static int alloc_srqc(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq) @@ -144,7 +142,7 @@ static void free_srqc(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq) struct hns_roce_srq_table *srq_table = &hr_dev->srq_table; int ret;
- ret = hns_roce_hw_destroy_srq(hr_dev, NULL, srq->srqn); + ret = hns_roce_hw_destroy_srq(hr_dev, srq->srqn); if (ret) dev_err(hr_dev->dev, "DESTROY_SRQ failed (%d) for SRQN %06lx\n", ret, srq->srqn);
From: Wenpeng Liang liangwenpeng@huawei.com
mainline inclusion from mainline-v5.18-rc1 commit e50cda2b9f84 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK cve: NA
reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=e50...
The "op" field of the mailbox occupies 8 bits, so the parameter "op" should be of type u8.
Link: https://lore.kernel.org/r/20220302064830.61706-5-liangwenpeng@huawei.com Signed-off-by: Wenpeng Liang liangwenpeng@huawei.com Reviewed-by: Leon Romanovsky leonro@nvidia.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Zhengfeng Luo luozhengfeng@h-partners.com Reviewed-by: Yangyang Li liyangyang20@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/infiniband/hw/hns/hns_roce_cmd.c | 12 ++++---- drivers/infiniband/hw/hns/hns_roce_cmd.h | 2 +- drivers/infiniband/hw/hns/hns_roce_device.h | 6 ++-- drivers/infiniband/hw/hns/hns_roce_hem.c | 4 +-- drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 33 ++++++++++----------- 5 files changed, 28 insertions(+), 29 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_cmd.c b/drivers/infiniband/hw/hns/hns_roce_cmd.c index 3642e9282b42..df11acd8030e 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cmd.c +++ b/drivers/infiniband/hw/hns/hns_roce_cmd.c @@ -39,7 +39,7 @@ #define CMD_MAX_NUM 32
static int hns_roce_cmd_mbox_post_hw(struct hns_roce_dev *hr_dev, u64 in_param, - u64 out_param, u32 in_modifier, u16 op, + u64 out_param, u32 in_modifier, u8 op, u16 token, int event) { return hr_dev->hw->post_mbox(hr_dev, in_param, out_param, in_modifier, @@ -49,7 +49,7 @@ static int hns_roce_cmd_mbox_post_hw(struct hns_roce_dev *hr_dev, u64 in_param, /* this should be called with "poll_sem" */ static int __hns_roce_cmd_mbox_poll(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, unsigned long in_modifier, - u16 op) + u8 op) { int ret;
@@ -67,7 +67,7 @@ static int __hns_roce_cmd_mbox_poll(struct hns_roce_dev *hr_dev, u64 in_param,
static int hns_roce_cmd_mbox_poll(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, unsigned long in_modifier, - u16 op) + u8 op) { int ret;
@@ -99,7 +99,7 @@ void hns_roce_cmd_event(struct hns_roce_dev *hr_dev, u16 token, u8 status,
static int __hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, unsigned long in_modifier, - u16 op) + u8 op) { struct hns_roce_cmdq *cmd = &hr_dev->cmd; struct hns_roce_cmd_context *context; @@ -149,7 +149,7 @@ static int __hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param,
static int hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, unsigned long in_modifier, - u16 op) + u8 op) { int ret;
@@ -162,7 +162,7 @@ static int hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param, }
int hns_roce_cmd_mbox(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, - unsigned long in_modifier, u16 op) + unsigned long in_modifier, u8 op) { bool is_busy;
diff --git a/drivers/infiniband/hw/hns/hns_roce_cmd.h b/drivers/infiniband/hw/hns/hns_roce_cmd.h index 23937b106aa5..7928790061b8 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cmd.h +++ b/drivers/infiniband/hw/hns/hns_roce_cmd.h @@ -140,7 +140,7 @@ enum { };
int hns_roce_cmd_mbox(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, - unsigned long in_modifier, u16 op); + unsigned long in_modifier, u8 op);
struct hns_roce_cmd_mailbox * hns_roce_alloc_cmd_mailbox(struct hns_roce_dev *hr_dev); diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h index d02f235ab820..d353610dd529 100644 --- a/drivers/infiniband/hw/hns/hns_roce_device.h +++ b/drivers/infiniband/hw/hns/hns_roce_device.h @@ -852,7 +852,7 @@ struct hns_roce_hw { int (*hw_init)(struct hns_roce_dev *hr_dev); void (*hw_exit)(struct hns_roce_dev *hr_dev); int (*post_mbox)(struct hns_roce_dev *hr_dev, u64 in_param, - u64 out_param, u32 in_modifier, u16 op, + u64 out_param, u32 in_modifier, u8 op, u16 token, int event); int (*poll_mbox_done)(struct hns_roce_dev *hr_dev); bool (*chk_mbox_avail)(struct hns_roce_dev *hr_dev, bool *is_busy); @@ -872,10 +872,10 @@ struct hns_roce_hw { struct hns_roce_cq *hr_cq, void *mb_buf, u64 *mtts, dma_addr_t dma_handle); int (*set_hem)(struct hns_roce_dev *hr_dev, - struct hns_roce_hem_table *table, int obj, int step_idx); + struct hns_roce_hem_table *table, int obj, u32 step_idx); int (*clear_hem)(struct hns_roce_dev *hr_dev, struct hns_roce_hem_table *table, int obj, - int step_idx); + u32 step_idx); int (*modify_qp)(struct ib_qp *ibqp, const struct ib_qp_attr *attr, int attr_mask, enum ib_qp_state cur_state, enum ib_qp_state new_state); diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c b/drivers/infiniband/hw/hns/hns_roce_hem.c index 7cc45a332fc0..a5f7b8775756 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hem.c +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c @@ -488,7 +488,7 @@ static int set_mhop_hem(struct hns_roce_dev *hr_dev, struct hns_roce_hem_index *index) { struct ib_device *ibdev = &hr_dev->ib_dev; - int step_idx; + u32 step_idx; int ret = 0;
if (index->inited & HEM_INDEX_L0) { @@ -618,7 +618,7 @@ static void clear_mhop_hem(struct hns_roce_dev *hr_dev, struct ib_device *ibdev = &hr_dev->ib_dev; u32 hop_num = mhop->hop_num; u32 chunk_ba_num; - int step_idx; + u32 step_idx;
index->inited = HEM_INDEX_BUF; chunk_ba_num = mhop->bt_chunk_size / BA_BYTE_LEN; diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c index fc3517306f19..60eea430bf70 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c @@ -1345,7 +1345,7 @@ static int hns_roce_cmq_send(struct hns_roce_dev *hr_dev, }
static int config_hem_ba_to_hw(struct hns_roce_dev *hr_dev, unsigned long obj, - dma_addr_t base_addr, u16 op) + dma_addr_t base_addr, u8 op) { struct hns_roce_cmd_mailbox *mbox = hns_roce_alloc_cmd_mailbox(hr_dev); int ret; @@ -2757,7 +2757,7 @@ static void hns_roce_v2_exit(struct hns_roce_dev *hr_dev)
static int hns_roce_mbox_post(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, u32 in_modifier, - u16 op, u16 token, int event) + u8 op, u16 token, int event) { struct hns_roce_cmq_desc desc; struct hns_roce_post_mbox *mb = (struct hns_roce_post_mbox *)desc.data; @@ -2824,7 +2824,7 @@ static int v2_wait_mbox_complete(struct hns_roce_dev *hr_dev, u32 timeout,
static int v2_post_mbox(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, u32 in_modifier, - u16 op, u16 token, int event) + u8 op, u16 token, int event) { u8 status = 0; int ret; @@ -3794,9 +3794,9 @@ static int hns_roce_v2_poll_cq(struct ib_cq *ibcq, int num_entries, }
static int get_op_for_set_hem(struct hns_roce_dev *hr_dev, u32 type, - int step_idx, u16 *mbox_op) + u32 step_idx, u8 *mbox_op) { - u16 op; + u8 op;
switch (type) { case HEM_TYPE_QPC: @@ -3848,10 +3848,10 @@ static int config_gmv_ba_to_hw(struct hns_roce_dev *hr_dev, unsigned long obj, }
static int set_hem_to_hw(struct hns_roce_dev *hr_dev, int obj, - dma_addr_t base_addr, u32 hem_type, int step_idx) + dma_addr_t base_addr, u32 hem_type, u32 step_idx) { int ret; - u16 op; + u8 op;
if (unlikely(hem_type == HEM_TYPE_GMV)) return config_gmv_ba_to_hw(hr_dev, obj, base_addr); @@ -3868,7 +3868,7 @@ static int set_hem_to_hw(struct hns_roce_dev *hr_dev, int obj,
static int hns_roce_v2_set_hem(struct hns_roce_dev *hr_dev, struct hns_roce_hem_table *table, int obj, - int step_idx) + u32 step_idx) { struct hns_roce_hem_iter iter; struct hns_roce_hem_mhop mhop; @@ -3927,12 +3927,12 @@ static int hns_roce_v2_set_hem(struct hns_roce_dev *hr_dev,
static int hns_roce_v2_clear_hem(struct hns_roce_dev *hr_dev, struct hns_roce_hem_table *table, int obj, - int step_idx) + u32 step_idx) { - struct device *dev = hr_dev->dev; struct hns_roce_cmd_mailbox *mailbox; + struct device *dev = hr_dev->dev; + u8 op = 0xff; int ret; - u16 op = 0xff;
if (!hns_roce_check_whether_mhop(hr_dev, table->type)) return 0; @@ -5904,8 +5904,7 @@ static int alloc_eq_buf(struct hns_roce_dev *hr_dev, struct hns_roce_eq *eq) }
static int hns_roce_v2_create_eq(struct hns_roce_dev *hr_dev, - struct hns_roce_eq *eq, - unsigned int eq_cmd) + struct hns_roce_eq *eq, u8 eq_cmd) { struct hns_roce_cmd_mailbox *mailbox; int ret; @@ -6034,14 +6033,14 @@ static int hns_roce_v2_init_eq_table(struct hns_roce_dev *hr_dev) struct hns_roce_eq_table *eq_table = &hr_dev->eq_table; struct device *dev = hr_dev->dev; struct hns_roce_eq *eq; - unsigned int eq_cmd; - int irq_num; - int eq_num; int other_num; int comp_num; int aeq_num; - int i; + int irq_num; + int eq_num; + u8 eq_cmd; int ret; + int i;
other_num = hr_dev->caps.num_other_vectors; comp_num = hr_dev->caps.num_comp_vectors;
From: Chengchang Tang tangchengchang@huawei.com
mainline inclusion from mainline-v5.18-rc1 commit 162e29feabba category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK cve: NA
reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=162...
The current mailbox functions have too many parameters, making the code difficult to maintain. So construct a new structure mbox_msg to pass the information needed by mailbox.
Link: https://lore.kernel.org/r/20220302064830.61706-6-liangwenpeng@huawei.com Signed-off-by: Chengchang Tang tangchengchang@huawei.com Signed-off-by: Wenpeng Liang liangwenpeng@huawei.com Reviewed-by: Leon Romanovsky leonro@nvidia.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Zhengfeng Luo luozhengfeng@h-partners.com Reviewed-by: Yangyang Li liyangyang20@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/infiniband/hw/hns/hns_roce_cmd.c | 73 ++++++------ drivers/infiniband/hw/hns/hns_roce_cmd.h | 2 +- drivers/infiniband/hw/hns/hns_roce_cq.c | 9 +- drivers/infiniband/hw/hns/hns_roce_device.h | 14 ++- drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 111 +++++++++--------- .../infiniband/hw/hns/hns_roce_hw_v2_dfx.c | 4 +- drivers/infiniband/hw/hns/hns_roce_mr.c | 12 +- drivers/infiniband/hw/hns/hns_roce_srq.c | 6 +- 8 files changed, 119 insertions(+), 112 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_cmd.c b/drivers/infiniband/hw/hns/hns_roce_cmd.c index df11acd8030e..7e37066b272d 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cmd.c +++ b/drivers/infiniband/hw/hns/hns_roce_cmd.c @@ -38,42 +38,36 @@ #define CMD_POLL_TOKEN 0xffff #define CMD_MAX_NUM 32
-static int hns_roce_cmd_mbox_post_hw(struct hns_roce_dev *hr_dev, u64 in_param, - u64 out_param, u32 in_modifier, u8 op, - u16 token, int event) +static int hns_roce_cmd_mbox_post_hw(struct hns_roce_dev *hr_dev, + struct hns_roce_mbox_msg *mbox_msg) { - return hr_dev->hw->post_mbox(hr_dev, in_param, out_param, in_modifier, - op, token, event); + return hr_dev->hw->post_mbox(hr_dev, mbox_msg); }
/* this should be called with "poll_sem" */ -static int __hns_roce_cmd_mbox_poll(struct hns_roce_dev *hr_dev, u64 in_param, - u64 out_param, unsigned long in_modifier, - u8 op) +static int __hns_roce_cmd_mbox_poll(struct hns_roce_dev *hr_dev, + struct hns_roce_mbox_msg *mbox_msg) { int ret;
- ret = hns_roce_cmd_mbox_post_hw(hr_dev, in_param, out_param, - in_modifier, op, CMD_POLL_TOKEN, 0); + ret = hns_roce_cmd_mbox_post_hw(hr_dev, mbox_msg); if (ret) { dev_err_ratelimited(hr_dev->dev, "failed to post mailbox 0x%x in poll mode, ret = %d.\n", - op, ret); + mbox_msg->cmd, ret); return ret; }
return hr_dev->hw->poll_mbox_done(hr_dev); }
-static int hns_roce_cmd_mbox_poll(struct hns_roce_dev *hr_dev, u64 in_param, - u64 out_param, unsigned long in_modifier, - u8 op) +static int hns_roce_cmd_mbox_poll(struct hns_roce_dev *hr_dev, + struct hns_roce_mbox_msg *mbox_msg) { int ret;
down(&hr_dev->cmd.poll_sem); - ret = __hns_roce_cmd_mbox_poll(hr_dev, in_param, out_param, in_modifier, - op); + ret = __hns_roce_cmd_mbox_poll(hr_dev, mbox_msg); up(&hr_dev->cmd.poll_sem);
return ret; @@ -97,9 +91,8 @@ void hns_roce_cmd_event(struct hns_roce_dev *hr_dev, u16 token, u8 status, complete(&context->done); }
-static int __hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param, - u64 out_param, unsigned long in_modifier, - u8 op) +static int __hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, + struct hns_roce_mbox_msg *mbox_msg) { struct hns_roce_cmdq *cmd = &hr_dev->cmd; struct hns_roce_cmd_context *context; @@ -120,19 +113,19 @@ static int __hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param,
reinit_completion(&context->done);
- ret = hns_roce_cmd_mbox_post_hw(hr_dev, in_param, out_param, - in_modifier, op, context->token, 1); + mbox_msg->token = context->token; + ret = hns_roce_cmd_mbox_post_hw(hr_dev, mbox_msg); if (ret) { dev_err_ratelimited(dev, "failed to post mailbox 0x%x in event mode, ret = %d.\n", - op, ret); + mbox_msg->cmd, ret); goto out; }
if (!wait_for_completion_timeout(&context->done, msecs_to_jiffies(HNS_ROCE_CMD_TIMEOUT_MSECS))) { dev_err_ratelimited(dev, "[cmd] token 0x%x mailbox 0x%x timeout.\n", - context->token, op); + context->token, mbox_msg->cmd); ret = -EBUSY; goto out; } @@ -140,42 +133,50 @@ static int __hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param, ret = context->result; if (ret) dev_err_ratelimited(dev, "[cmd] token 0x%x mailbox 0x%x error %d.\n", - context->token, op, ret); + context->token, mbox_msg->cmd, ret);
out: context->busy = 0; return ret; }
-static int hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, u64 in_param, - u64 out_param, unsigned long in_modifier, - u8 op) +static int hns_roce_cmd_mbox_wait(struct hns_roce_dev *hr_dev, + struct hns_roce_mbox_msg *mbox_msg) { int ret;
down(&hr_dev->cmd.event_sem); - ret = __hns_roce_cmd_mbox_wait(hr_dev, in_param, out_param, in_modifier, - op); + ret = __hns_roce_cmd_mbox_wait(hr_dev, mbox_msg); up(&hr_dev->cmd.event_sem);
return ret; }
int hns_roce_cmd_mbox(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, - unsigned long in_modifier, u8 op) + u8 cmd, unsigned long tag) { + struct hns_roce_mbox_msg mbox_msg = {}; bool is_busy;
if (hr_dev->hw->chk_mbox_avail) if (!hr_dev->hw->chk_mbox_avail(hr_dev, &is_busy)) return is_busy ? -EBUSY : 0;
- if (hr_dev->cmd.use_events) - return hns_roce_cmd_mbox_wait(hr_dev, in_param, out_param, - in_modifier, op); - else - return hns_roce_cmd_mbox_poll(hr_dev, in_param, out_param, - in_modifier, op); + mbox_msg.in_param = in_param; + mbox_msg.out_param = out_param; + mbox_msg.cmd = cmd; + mbox_msg.tag = tag; + + if (hr_dev->cmd.use_events) { + mbox_msg.event_en = 1; + + return hns_roce_cmd_mbox_wait(hr_dev, &mbox_msg); + } else { + mbox_msg.event_en = 0; + mbox_msg.token = CMD_POLL_TOKEN; + + return hns_roce_cmd_mbox_poll(hr_dev, &mbox_msg); + } }
int hns_roce_cmd_init(struct hns_roce_dev *hr_dev) diff --git a/drivers/infiniband/hw/hns/hns_roce_cmd.h b/drivers/infiniband/hw/hns/hns_roce_cmd.h index 7928790061b8..759da8981c71 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cmd.h +++ b/drivers/infiniband/hw/hns/hns_roce_cmd.h @@ -140,7 +140,7 @@ enum { };
int hns_roce_cmd_mbox(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param, - unsigned long in_modifier, u8 op); + u8 cmd, unsigned long tag);
struct hns_roce_cmd_mailbox * hns_roce_alloc_cmd_mailbox(struct hns_roce_dev *hr_dev); diff --git a/drivers/infiniband/hw/hns/hns_roce_cq.c b/drivers/infiniband/hw/hns/hns_roce_cq.c index c2ea0f1c5d11..0ef503c5e485 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cq.c +++ b/drivers/infiniband/hw/hns/hns_roce_cq.c @@ -139,9 +139,8 @@ static int alloc_cqc(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq)
hr_dev->hw->write_cqc(hr_dev, hr_cq, mailbox->buf, mtts, dma_handle);
- /* Send mailbox to hw */ - ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, hr_cq->cqn, - HNS_ROCE_CMD_CREATE_CQC); + ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, + HNS_ROCE_CMD_CREATE_CQC, hr_cq->cqn); hns_roce_free_cmd_mailbox(hr_dev, mailbox); if (ret) { ibdev_err(ibdev, @@ -174,8 +173,8 @@ static void free_cqc(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq) struct device *dev = hr_dev->dev; int ret;
- ret = hns_roce_cmd_mbox(hr_dev, 0, 0, hr_cq->cqn, - HNS_ROCE_CMD_DESTROY_CQC); + ret = hns_roce_cmd_mbox(hr_dev, 0, 0, HNS_ROCE_CMD_DESTROY_CQC, + hr_cq->cqn); if (ret) dev_err(dev, "DESTROY_CQ failed (%d) for CQN %06lx\n", ret, hr_cq->cqn); diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h index d353610dd529..5ed7e00bc90b 100644 --- a/drivers/infiniband/hw/hns/hns_roce_device.h +++ b/drivers/infiniband/hw/hns/hns_roce_device.h @@ -561,6 +561,15 @@ struct hns_roce_cmd_mailbox { dma_addr_t dma; };
+struct hns_roce_mbox_msg { + u64 in_param; + u64 out_param; + u8 cmd; + u32 tag; + u16 token; + u8 event_en; +}; + struct hns_roce_dev;
struct hns_roce_rinl_sge { @@ -851,9 +860,8 @@ struct hns_roce_hw { int (*hw_profile)(struct hns_roce_dev *hr_dev); int (*hw_init)(struct hns_roce_dev *hr_dev); void (*hw_exit)(struct hns_roce_dev *hr_dev); - int (*post_mbox)(struct hns_roce_dev *hr_dev, u64 in_param, - u64 out_param, u32 in_modifier, u8 op, - u16 token, int event); + int (*post_mbox)(struct hns_roce_dev *hr_dev, + struct hns_roce_mbox_msg *mbox_msg); int (*poll_mbox_done)(struct hns_roce_dev *hr_dev); bool (*chk_mbox_avail)(struct hns_roce_dev *hr_dev, bool *is_busy); int (*set_gid)(struct hns_roce_dev *hr_dev, int gid_index, diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c index 60eea430bf70..e68c0034a66d 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c @@ -1344,16 +1344,17 @@ static int hns_roce_cmq_send(struct hns_roce_dev *hr_dev, return ret; }
-static int config_hem_ba_to_hw(struct hns_roce_dev *hr_dev, unsigned long obj, - dma_addr_t base_addr, u8 op) +static int config_hem_ba_to_hw(struct hns_roce_dev *hr_dev, + dma_addr_t base_addr, u8 cmd, unsigned long tag) { - struct hns_roce_cmd_mailbox *mbox = hns_roce_alloc_cmd_mailbox(hr_dev); + struct hns_roce_cmd_mailbox *mbox; int ret;
+ mbox = hns_roce_alloc_cmd_mailbox(hr_dev); if (IS_ERR(mbox)) return PTR_ERR(mbox);
- ret = hns_roce_cmd_mbox(hr_dev, base_addr, mbox->dma, obj, op); + ret = hns_roce_cmd_mbox(hr_dev, base_addr, mbox->dma, cmd, tag); hns_roce_free_cmd_mailbox(hr_dev, mbox); return ret; } @@ -2755,21 +2756,21 @@ static void hns_roce_v2_exit(struct hns_roce_dev *hr_dev) free_dip_list(hr_dev); }
-static int hns_roce_mbox_post(struct hns_roce_dev *hr_dev, u64 in_param, - u64 out_param, u32 in_modifier, - u8 op, u16 token, int event) +static int hns_roce_mbox_post(struct hns_roce_dev *hr_dev, + struct hns_roce_mbox_msg *mbox_msg) { struct hns_roce_cmq_desc desc; struct hns_roce_post_mbox *mb = (struct hns_roce_post_mbox *)desc.data;
hns_roce_cmq_setup_basic_desc(&desc, HNS_ROCE_OPC_POST_MB, false);
- mb->in_param_l = cpu_to_le32(in_param); - mb->in_param_h = cpu_to_le32(in_param >> 32); - mb->out_param_l = cpu_to_le32(out_param); - mb->out_param_h = cpu_to_le32(out_param >> 32); - mb->cmd_tag = cpu_to_le32(in_modifier << 8 | op); - mb->token_event_en = cpu_to_le32(event << 16 | token); + mb->in_param_l = cpu_to_le32(mbox_msg->in_param); + mb->in_param_h = cpu_to_le32(mbox_msg->in_param >> 32); + mb->out_param_l = cpu_to_le32(mbox_msg->out_param); + mb->out_param_h = cpu_to_le32(mbox_msg->out_param >> 32); + mb->cmd_tag = cpu_to_le32(mbox_msg->tag << 8 | mbox_msg->cmd); + mb->token_event_en = cpu_to_le32(mbox_msg->event_en << 16 | + mbox_msg->token);
return hns_roce_cmq_send(hr_dev, &desc, 1); } @@ -2822,9 +2823,8 @@ static int v2_wait_mbox_complete(struct hns_roce_dev *hr_dev, u32 timeout, return ret; }
-static int v2_post_mbox(struct hns_roce_dev *hr_dev, u64 in_param, - u64 out_param, u32 in_modifier, - u8 op, u16 token, int event) +static int v2_post_mbox(struct hns_roce_dev *hr_dev, + struct hns_roce_mbox_msg *mbox_msg) { u8 status = 0; int ret; @@ -2840,8 +2840,7 @@ static int v2_post_mbox(struct hns_roce_dev *hr_dev, u64 in_param, }
/* Post new message to mbox */ - ret = hns_roce_mbox_post(hr_dev, in_param, out_param, in_modifier, - op, token, event); + ret = hns_roce_mbox_post(hr_dev, mbox_msg); if (ret) dev_err_ratelimited(hr_dev->dev, "failed to post mailbox, ret = %d.\n", ret); @@ -3794,38 +3793,38 @@ static int hns_roce_v2_poll_cq(struct ib_cq *ibcq, int num_entries, }
static int get_op_for_set_hem(struct hns_roce_dev *hr_dev, u32 type, - u32 step_idx, u8 *mbox_op) + u32 step_idx, u8 *mbox_cmd) { - u8 op; + u8 cmd;
switch (type) { case HEM_TYPE_QPC: - op = HNS_ROCE_CMD_WRITE_QPC_BT0; + cmd = HNS_ROCE_CMD_WRITE_QPC_BT0; break; case HEM_TYPE_MTPT: - op = HNS_ROCE_CMD_WRITE_MPT_BT0; + cmd = HNS_ROCE_CMD_WRITE_MPT_BT0; break; case HEM_TYPE_CQC: - op = HNS_ROCE_CMD_WRITE_CQC_BT0; + cmd = HNS_ROCE_CMD_WRITE_CQC_BT0; break; case HEM_TYPE_SRQC: - op = HNS_ROCE_CMD_WRITE_SRQC_BT0; + cmd = HNS_ROCE_CMD_WRITE_SRQC_BT0; break; case HEM_TYPE_SCCC: - op = HNS_ROCE_CMD_WRITE_SCCC_BT0; + cmd = HNS_ROCE_CMD_WRITE_SCCC_BT0; break; case HEM_TYPE_QPC_TIMER: - op = HNS_ROCE_CMD_WRITE_QPC_TIMER_BT0; + cmd = HNS_ROCE_CMD_WRITE_QPC_TIMER_BT0; break; case HEM_TYPE_CQC_TIMER: - op = HNS_ROCE_CMD_WRITE_CQC_TIMER_BT0; + cmd = HNS_ROCE_CMD_WRITE_CQC_TIMER_BT0; break; default: dev_warn(hr_dev->dev, "failed to check hem type %u.\n", type); return -EINVAL; }
- *mbox_op = op + step_idx; + *mbox_cmd = cmd + step_idx;
return 0; } @@ -3851,7 +3850,7 @@ static int set_hem_to_hw(struct hns_roce_dev *hr_dev, int obj, dma_addr_t base_addr, u32 hem_type, u32 step_idx) { int ret; - u8 op; + u8 cmd;
if (unlikely(hem_type == HEM_TYPE_GMV)) return config_gmv_ba_to_hw(hr_dev, obj, base_addr); @@ -3859,11 +3858,11 @@ static int set_hem_to_hw(struct hns_roce_dev *hr_dev, int obj, if (unlikely(hem_type == HEM_TYPE_SCCC && step_idx)) return 0;
- ret = get_op_for_set_hem(hr_dev, hem_type, step_idx, &op); + ret = get_op_for_set_hem(hr_dev, hem_type, step_idx, &cmd); if (ret < 0) return ret;
- return config_hem_ba_to_hw(hr_dev, obj, base_addr, op); + return config_hem_ba_to_hw(hr_dev, base_addr, cmd, obj); }
static int hns_roce_v2_set_hem(struct hns_roce_dev *hr_dev, @@ -3926,12 +3925,12 @@ static int hns_roce_v2_set_hem(struct hns_roce_dev *hr_dev, }
static int hns_roce_v2_clear_hem(struct hns_roce_dev *hr_dev, - struct hns_roce_hem_table *table, int obj, - u32 step_idx) + struct hns_roce_hem_table *table, + int tag, u32 step_idx) { struct hns_roce_cmd_mailbox *mailbox; struct device *dev = hr_dev->dev; - u8 op = 0xff; + u8 cmd = 0xff; int ret;
if (!hns_roce_check_whether_mhop(hr_dev, table->type)) @@ -3939,16 +3938,16 @@ static int hns_roce_v2_clear_hem(struct hns_roce_dev *hr_dev,
switch (table->type) { case HEM_TYPE_QPC: - op = HNS_ROCE_CMD_DESTROY_QPC_BT0; + cmd = HNS_ROCE_CMD_DESTROY_QPC_BT0; break; case HEM_TYPE_MTPT: - op = HNS_ROCE_CMD_DESTROY_MPT_BT0; + cmd = HNS_ROCE_CMD_DESTROY_MPT_BT0; break; case HEM_TYPE_CQC: - op = HNS_ROCE_CMD_DESTROY_CQC_BT0; + cmd = HNS_ROCE_CMD_DESTROY_CQC_BT0; break; case HEM_TYPE_SRQC: - op = HNS_ROCE_CMD_DESTROY_SRQC_BT0; + cmd = HNS_ROCE_CMD_DESTROY_SRQC_BT0; break; case HEM_TYPE_SCCC: case HEM_TYPE_QPC_TIMER: @@ -3961,14 +3960,13 @@ static int hns_roce_v2_clear_hem(struct hns_roce_dev *hr_dev, return 0; }
- op += step_idx; + cmd += step_idx;
mailbox = hns_roce_alloc_cmd_mailbox(hr_dev); if (IS_ERR(mailbox)) return PTR_ERR(mailbox);
- /* configure the tag and op */ - ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, obj, op); + ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, cmd, tag);
hns_roce_free_cmd_mailbox(hr_dev, mailbox); return ret; @@ -3992,8 +3990,8 @@ static int hns_roce_v2_qp_modify(struct hns_roce_dev *hr_dev, memcpy(mailbox->buf, context, qpc_size); memcpy(mailbox->buf + qpc_size, qpc_mask, qpc_size);
- ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, hr_qp->qpn, - HNS_ROCE_CMD_MODIFY_QPC); + ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, + HNS_ROCE_CMD_MODIFY_QPC, hr_qp->qpn);
hns_roce_free_cmd_mailbox(hr_dev, mailbox);
@@ -5038,8 +5036,8 @@ static int hns_roce_v2_query_qpc(struct hns_roce_dev *hr_dev, if (IS_ERR(mailbox)) return PTR_ERR(mailbox);
- ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, hr_qp->qpn, - HNS_ROCE_CMD_QUERY_QPC); + ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, HNS_ROCE_CMD_QUERY_QPC, + hr_qp->qpn); if (ret) goto out;
@@ -5405,8 +5403,8 @@ static int hns_roce_v2_modify_srq(struct ib_srq *ibsrq, hr_reg_write(srq_context, SRQC_LIMIT_WL, srq_attr->srq_limit); hr_reg_clear(srqc_mask, SRQC_LIMIT_WL);
- ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, srq->srqn, - HNS_ROCE_CMD_MODIFY_SRQC); + ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, + HNS_ROCE_CMD_MODIFY_SRQC, srq->srqn); hns_roce_free_cmd_mailbox(hr_dev, mailbox); if (ret) { ibdev_err(&hr_dev->ib_dev, @@ -5432,8 +5430,8 @@ static int hns_roce_v2_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr) return PTR_ERR(mailbox);
srq_context = mailbox->buf; - ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, srq->srqn, - HNS_ROCE_CMD_QUERY_SRQC); + ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, + HNS_ROCE_CMD_QUERY_SRQC, srq->srqn); if (ret) { ibdev_err(&hr_dev->ib_dev, "failed to process cmd of querying SRQ, ret = %d.\n", @@ -5473,8 +5471,8 @@ static int hns_roce_v2_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period) hr_reg_write(cq_context, CQC_CQ_PERIOD, cq_period); hr_reg_clear(cqc_mask, CQC_CQ_PERIOD);
- ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, hr_cq->cqn, - HNS_ROCE_CMD_MODIFY_CQC); + ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, + HNS_ROCE_CMD_MODIFY_CQC, hr_cq->cqn); hns_roce_free_cmd_mailbox(hr_dev, mailbox); if (ret) ibdev_err(&hr_dev->ib_dev, @@ -5801,13 +5799,14 @@ static void hns_roce_v2_destroy_eqc(struct hns_roce_dev *hr_dev, u32 eqn) { struct device *dev = hr_dev->dev; int ret; + u8 cmd;
if (eqn < hr_dev->caps.num_comp_vectors) - ret = hns_roce_cmd_mbox(hr_dev, 0, 0, eqn & HNS_ROCE_V2_EQN_M, - HNS_ROCE_CMD_DESTROY_CEQC); + cmd = HNS_ROCE_CMD_DESTROY_CEQC; else - ret = hns_roce_cmd_mbox(hr_dev, 0, 0, eqn & HNS_ROCE_V2_EQN_M, - HNS_ROCE_CMD_DESTROY_AEQC); + cmd = HNS_ROCE_CMD_DESTROY_AEQC; + + ret = hns_roce_cmd_mbox(hr_dev, 0, 0, cmd, eqn & HNS_ROCE_V2_EQN_M); if (ret) dev_err(dev, "[mailbox cmd] destroy eqc(%u) failed.\n", eqn); } @@ -5922,7 +5921,7 @@ static int hns_roce_v2_create_eq(struct hns_roce_dev *hr_dev, if (ret) goto err_cmd_mbox;
- ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, eq->eqn, eq_cmd); + ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, eq_cmd, eq->eqn); if (ret) { dev_err(hr_dev->dev, "[mailbox cmd] create eqc failed.\n"); goto err_cmd_mbox; diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2_dfx.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2_dfx.c index 107288150e3f..f7a75a7cda74 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2_dfx.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2_dfx.c @@ -18,8 +18,8 @@ int hns_roce_v2_query_cqc_info(struct hns_roce_dev *hr_dev, u32 cqn, return PTR_ERR(mailbox);
cq_context = mailbox->buf; - ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, cqn, - HNS_ROCE_CMD_QUERY_CQC); + ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, HNS_ROCE_CMD_QUERY_CQC, + cqn); if (ret) { dev_err(hr_dev->dev, "QUERY cqc cmd process error\n"); goto err_mailbox; diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c index d80a06cb8aa1..fb57571215f0 100644 --- a/drivers/infiniband/hw/hns/hns_roce_mr.c +++ b/drivers/infiniband/hw/hns/hns_roce_mr.c @@ -51,15 +51,15 @@ static int hns_roce_hw_create_mpt(struct hns_roce_dev *hr_dev, struct hns_roce_cmd_mailbox *mailbox, unsigned long mpt_index) { - return hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, mpt_index, - HNS_ROCE_CMD_CREATE_MPT); + return hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, + HNS_ROCE_CMD_CREATE_MPT, mpt_index); }
int hns_roce_hw_destroy_mpt(struct hns_roce_dev *hr_dev, unsigned long mpt_index) { - return hns_roce_cmd_mbox(hr_dev, 0, 0, mpt_index, - HNS_ROCE_CMD_DESTROY_MPT); + return hns_roce_cmd_mbox(hr_dev, 0, 0, HNS_ROCE_CMD_DESTROY_MPT, + mpt_index); }
static int alloc_mr_key(struct hns_roce_dev *hr_dev, struct hns_roce_mr *mr) @@ -299,8 +299,8 @@ int hns_roce_rereg_user_mr(struct ib_mr *ibmr, int flags, u64 start, u64 length, return PTR_ERR(mailbox);
mtpt_idx = key_to_hw_index(mr->key) & (hr_dev->caps.num_mtpts - 1); - ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, mtpt_idx, - HNS_ROCE_CMD_QUERY_MPT); + ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, HNS_ROCE_CMD_QUERY_MPT, + mtpt_idx); if (ret) goto free_cmd_mbox;
diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c b/drivers/infiniband/hw/hns/hns_roce_srq.c index cf24c8a23983..c569cc37483e 100644 --- a/drivers/infiniband/hw/hns/hns_roce_srq.c +++ b/drivers/infiniband/hw/hns/hns_roce_srq.c @@ -63,14 +63,14 @@ static int hns_roce_hw_create_srq(struct hns_roce_dev *dev, struct hns_roce_cmd_mailbox *mailbox, unsigned long srq_num) { - return hns_roce_cmd_mbox(dev, mailbox->dma, 0, srq_num, - HNS_ROCE_CMD_CREATE_SRQ); + return hns_roce_cmd_mbox(dev, mailbox->dma, 0, HNS_ROCE_CMD_CREATE_SRQ, + srq_num); }
static int hns_roce_hw_destroy_srq(struct hns_roce_dev *dev, unsigned long srq_num) { - return hns_roce_cmd_mbox(dev, 0, 0, srq_num, HNS_ROCE_CMD_DESTROY_SRQ); + return hns_roce_cmd_mbox(dev, 0, 0, HNS_ROCE_CMD_DESTROY_SRQ, srq_num); }
static int alloc_srqc(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq)
From: Chengchang Tang tangchengchang@huawei.com
mainline inclusion from mainline-v5.18-rc1 commit cf7f8f5c1c54 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK cve: NA
reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=cf7...
Remove duplicate code for creating and destroying hardware contexts via mailbox.
Link: https://lore.kernel.org/r/20220302064830.61706-7-liangwenpeng@huawei.com Signed-off-by: Chengchang Tang tangchengchang@huawei.com Signed-off-by: Wenpeng Liang liangwenpeng@huawei.com Reviewed-by: Leon Romanovsky leonro@nvidia.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Zhengfeng Luo luozhengfeng@h-partners.com Reviewed-by: Yangyang Li liyangyang20@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/infiniband/hw/hns/hns_roce_cmd.c | 12 +++++++++ drivers/infiniband/hw/hns/hns_roce_cmd.h | 5 ++++ drivers/infiniband/hw/hns/hns_roce_cq.c | 8 +++--- drivers/infiniband/hw/hns/hns_roce_device.h | 2 -- drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 4 +-- drivers/infiniband/hw/hns/hns_roce_mr.c | 29 ++++++--------------- drivers/infiniband/hw/hns/hns_roce_srq.c | 20 +++----------- 7 files changed, 35 insertions(+), 45 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_cmd.c b/drivers/infiniband/hw/hns/hns_roce_cmd.c index 7e37066b272d..864413607571 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cmd.c +++ b/drivers/infiniband/hw/hns/hns_roce_cmd.c @@ -262,3 +262,15 @@ void hns_roce_free_cmd_mailbox(struct hns_roce_dev *hr_dev, dma_pool_free(hr_dev->cmd.pool, mailbox->buf, mailbox->dma); kfree(mailbox); } + +int hns_roce_create_hw_ctx(struct hns_roce_dev *dev, + struct hns_roce_cmd_mailbox *mailbox, + u8 cmd, unsigned long idx) +{ + return hns_roce_cmd_mbox(dev, mailbox->dma, 0, cmd, idx); +} + +int hns_roce_destroy_hw_ctx(struct hns_roce_dev *dev, u8 cmd, unsigned long idx) +{ + return hns_roce_cmd_mbox(dev, 0, 0, cmd, idx); +} diff --git a/drivers/infiniband/hw/hns/hns_roce_cmd.h b/drivers/infiniband/hw/hns/hns_roce_cmd.h index 759da8981c71..052a3d60905a 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cmd.h +++ b/drivers/infiniband/hw/hns/hns_roce_cmd.h @@ -146,5 +146,10 @@ struct hns_roce_cmd_mailbox * hns_roce_alloc_cmd_mailbox(struct hns_roce_dev *hr_dev); void hns_roce_free_cmd_mailbox(struct hns_roce_dev *hr_dev, struct hns_roce_cmd_mailbox *mailbox); +int hns_roce_create_hw_ctx(struct hns_roce_dev *dev, + struct hns_roce_cmd_mailbox *mailbox, + u8 cmd, unsigned long idx); +int hns_roce_destroy_hw_ctx(struct hns_roce_dev *dev, u8 cmd, + unsigned long idx);
#endif /* _HNS_ROCE_CMD_H */ diff --git a/drivers/infiniband/hw/hns/hns_roce_cq.c b/drivers/infiniband/hw/hns/hns_roce_cq.c index 0ef503c5e485..ea562645967f 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cq.c +++ b/drivers/infiniband/hw/hns/hns_roce_cq.c @@ -139,8 +139,8 @@ static int alloc_cqc(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq)
hr_dev->hw->write_cqc(hr_dev, hr_cq, mailbox->buf, mtts, dma_handle);
- ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, - HNS_ROCE_CMD_CREATE_CQC, hr_cq->cqn); + ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_CQC, + hr_cq->cqn); hns_roce_free_cmd_mailbox(hr_dev, mailbox); if (ret) { ibdev_err(ibdev, @@ -173,8 +173,8 @@ static void free_cqc(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq) struct device *dev = hr_dev->dev; int ret;
- ret = hns_roce_cmd_mbox(hr_dev, 0, 0, HNS_ROCE_CMD_DESTROY_CQC, - hr_cq->cqn); + ret = hns_roce_destroy_hw_ctx(hr_dev, HNS_ROCE_CMD_DESTROY_CQC, + hr_cq->cqn); if (ret) dev_err(dev, "DESTROY_CQ failed (%d) for CQN %06lx\n", ret, hr_cq->cqn); diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h index 5ed7e00bc90b..21fa93c86b04 100644 --- a/drivers/infiniband/hw/hns/hns_roce_device.h +++ b/drivers/infiniband/hw/hns/hns_roce_device.h @@ -1147,8 +1147,6 @@ struct ib_mr *hns_roce_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type, int hns_roce_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents, unsigned int *sg_offset); int hns_roce_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata); -int hns_roce_hw_destroy_mpt(struct hns_roce_dev *hr_dev, - unsigned long mpt_index); unsigned long key_to_hw_index(u32 key);
int hns_roce_alloc_mw(struct ib_mw *mw, struct ib_udata *udata); diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c index e68c0034a66d..f77fb6a5295a 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c @@ -5806,7 +5806,7 @@ static void hns_roce_v2_destroy_eqc(struct hns_roce_dev *hr_dev, u32 eqn) else cmd = HNS_ROCE_CMD_DESTROY_AEQC;
- ret = hns_roce_cmd_mbox(hr_dev, 0, 0, cmd, eqn & HNS_ROCE_V2_EQN_M); + ret = hns_roce_destroy_hw_ctx(hr_dev, cmd, eqn & HNS_ROCE_V2_EQN_M); if (ret) dev_err(dev, "[mailbox cmd] destroy eqc(%u) failed.\n", eqn); } @@ -5921,7 +5921,7 @@ static int hns_roce_v2_create_eq(struct hns_roce_dev *hr_dev, if (ret) goto err_cmd_mbox;
- ret = hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, eq_cmd, eq->eqn); + ret = hns_roce_create_hw_ctx(hr_dev, mailbox, eq_cmd, eq->eqn); if (ret) { dev_err(hr_dev->dev, "[mailbox cmd] create eqc failed.\n"); goto err_cmd_mbox; diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c index fb57571215f0..57c1de3c39ec 100644 --- a/drivers/infiniband/hw/hns/hns_roce_mr.c +++ b/drivers/infiniband/hw/hns/hns_roce_mr.c @@ -47,21 +47,6 @@ unsigned long key_to_hw_index(u32 key) return (key << 24) | (key >> 8); }
-static int hns_roce_hw_create_mpt(struct hns_roce_dev *hr_dev, - struct hns_roce_cmd_mailbox *mailbox, - unsigned long mpt_index) -{ - return hns_roce_cmd_mbox(hr_dev, mailbox->dma, 0, - HNS_ROCE_CMD_CREATE_MPT, mpt_index); -} - -int hns_roce_hw_destroy_mpt(struct hns_roce_dev *hr_dev, - unsigned long mpt_index) -{ - return hns_roce_cmd_mbox(hr_dev, 0, 0, HNS_ROCE_CMD_DESTROY_MPT, - mpt_index); -} - static int alloc_mr_key(struct hns_roce_dev *hr_dev, struct hns_roce_mr *mr) { struct hns_roce_ida *mtpt_ida = &hr_dev->mr_table.mtpt_ida; @@ -141,7 +126,7 @@ static void hns_roce_mr_free(struct hns_roce_dev *hr_dev, int ret;
if (mr->enabled) { - ret = hns_roce_hw_destroy_mpt(hr_dev, + ret = hns_roce_destroy_hw_ctx(hr_dev, HNS_ROCE_CMD_DESTROY_MPT, key_to_hw_index(mr->key) & (hr_dev->caps.num_mtpts - 1)); if (ret) @@ -177,7 +162,7 @@ static int hns_roce_mr_enable(struct hns_roce_dev *hr_dev, goto err_page; }
- ret = hns_roce_hw_create_mpt(hr_dev, mailbox, + ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_MPT, mtpt_idx & (hr_dev->caps.num_mtpts - 1)); if (ret) { dev_err(dev, "failed to create mpt, ret = %d.\n", ret); @@ -304,7 +289,8 @@ int hns_roce_rereg_user_mr(struct ib_mr *ibmr, int flags, u64 start, u64 length, if (ret) goto free_cmd_mbox;
- ret = hns_roce_hw_destroy_mpt(hr_dev, mtpt_idx); + ret = hns_roce_destroy_hw_ctx(hr_dev, HNS_ROCE_CMD_DESTROY_MPT, + mtpt_idx); if (ret) ibdev_warn(ib_dev, "failed to destroy MPT, ret = %d.\n", ret);
@@ -334,7 +320,8 @@ int hns_roce_rereg_user_mr(struct ib_mr *ibmr, int flags, u64 start, u64 length, goto free_cmd_mbox; }
- ret = hns_roce_hw_create_mpt(hr_dev, mailbox, mtpt_idx); + ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_MPT, + mtpt_idx); if (ret) { ibdev_err(ib_dev, "failed to create MPT, ret = %d.\n", ret); goto free_cmd_mbox; @@ -473,7 +460,7 @@ static void hns_roce_mw_free(struct hns_roce_dev *hr_dev, int ret;
if (mw->enabled) { - ret = hns_roce_hw_destroy_mpt(hr_dev, + ret = hns_roce_destroy_hw_ctx(hr_dev, HNS_ROCE_CMD_DESTROY_MPT, key_to_hw_index(mw->rkey) & (hr_dev->caps.num_mtpts - 1)); if (ret) @@ -513,7 +500,7 @@ static int hns_roce_mw_enable(struct hns_roce_dev *hr_dev, goto err_page; }
- ret = hns_roce_hw_create_mpt(hr_dev, mailbox, + ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_MPT, mtpt_idx & (hr_dev->caps.num_mtpts - 1)); if (ret) { dev_err(dev, "MW CREATE_MPT failed (%d)\n", ret); diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c b/drivers/infiniband/hw/hns/hns_roce_srq.c index c569cc37483e..fabe3959a98b 100644 --- a/drivers/infiniband/hw/hns/hns_roce_srq.c +++ b/drivers/infiniband/hw/hns/hns_roce_srq.c @@ -59,20 +59,6 @@ static void hns_roce_ib_srq_event(struct hns_roce_srq *srq, } }
-static int hns_roce_hw_create_srq(struct hns_roce_dev *dev, - struct hns_roce_cmd_mailbox *mailbox, - unsigned long srq_num) -{ - return hns_roce_cmd_mbox(dev, mailbox->dma, 0, HNS_ROCE_CMD_CREATE_SRQ, - srq_num); -} - -static int hns_roce_hw_destroy_srq(struct hns_roce_dev *dev, - unsigned long srq_num) -{ - return hns_roce_cmd_mbox(dev, 0, 0, HNS_ROCE_CMD_DESTROY_SRQ, srq_num); -} - static int alloc_srqc(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq) { struct hns_roce_srq_table *srq_table = &hr_dev->srq_table; @@ -115,7 +101,8 @@ static int alloc_srqc(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq) goto err_mbox; }
- ret = hns_roce_hw_create_srq(hr_dev, mailbox, srq->srqn); + ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_SRQ, + srq->srqn); if (ret) { ibdev_err(ibdev, "failed to config SRQC, ret = %d.\n", ret); goto err_mbox; @@ -142,7 +129,8 @@ static void free_srqc(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq) struct hns_roce_srq_table *srq_table = &hr_dev->srq_table; int ret;
- ret = hns_roce_hw_destroy_srq(hr_dev, srq->srqn); + ret = hns_roce_destroy_hw_ctx(hr_dev, HNS_ROCE_CMD_DESTROY_SRQ, + srq->srqn); if (ret) dev_err(hr_dev->dev, "DESTROY_SRQ failed (%d) for SRQN %06lx\n", ret, srq->srqn);
From: Wenpeng Liang liangwenpeng@huawei.com
mainline inclusion from mainline-v5.18-rc1 commit 904de76c42b7 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK cve: NA
reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=904...
hns_roce_alloc_cmd_mailbox() never returns NULL, so the check should be IS_ERR(). And the error code should be converted as the function's return value.
Link: https://lore.kernel.org/r/20220302064830.61706-8-liangwenpeng@huawei.com Signed-off-by: Wenpeng Liang liangwenpeng@huawei.com Reviewed-by: Leon Romanovsky leonro@nvidia.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Zhengfeng Luo luozhengfeng@h-partners.com Reviewed-by: Yangyang Li liyangyang20@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 4 ++-- drivers/infiniband/hw/hns/hns_roce_mr.c | 6 ++---- drivers/infiniband/hw/hns/hns_roce_srq.c | 4 ++-- 3 files changed, 6 insertions(+), 8 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c index f77fb6a5295a..936176712758 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c @@ -5910,8 +5910,8 @@ static int hns_roce_v2_create_eq(struct hns_roce_dev *hr_dev,
/* Allocate mailbox memory */ mailbox = hns_roce_alloc_cmd_mailbox(hr_dev); - if (IS_ERR_OR_NULL(mailbox)) - return -ENOMEM; + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox);
ret = alloc_eq_buf(hr_dev, eq); if (ret) diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c index 57c1de3c39ec..214833a87542 100644 --- a/drivers/infiniband/hw/hns/hns_roce_mr.c +++ b/drivers/infiniband/hw/hns/hns_roce_mr.c @@ -148,10 +148,8 @@ static int hns_roce_mr_enable(struct hns_roce_dev *hr_dev,
/* Allocate mailbox memory */ mailbox = hns_roce_alloc_cmd_mailbox(hr_dev); - if (IS_ERR(mailbox)) { - ret = PTR_ERR(mailbox); - return ret; - } + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox);
if (mr->type != MR_TYPE_FRMR) ret = hr_dev->hw->write_mtpt(hr_dev, mailbox->buf, mr); diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c b/drivers/infiniband/hw/hns/hns_roce_srq.c index fabe3959a98b..b65cca33d239 100644 --- a/drivers/infiniband/hw/hns/hns_roce_srq.c +++ b/drivers/infiniband/hw/hns/hns_roce_srq.c @@ -89,9 +89,9 @@ static int alloc_srqc(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq) }
mailbox = hns_roce_alloc_cmd_mailbox(hr_dev); - if (IS_ERR_OR_NULL(mailbox)) { + if (IS_ERR(mailbox)) { ibdev_err(ibdev, "failed to alloc mailbox for SRQC.\n"); - ret = -ENOMEM; + ret = PTR_ERR(mailbox); goto err_xa; }
From: Chengchang Tang tangchengchang@huawei.com
mainline inclusion from mainline-v5.18-rc1 commit b65afbd2a05c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK cve: NA
reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=b65...
Abstract the alloc_srqc() into several parts and separate the alloc_srqn() from the alloc_srqc().
Link: https://lore.kernel.org/r/20220302064830.61706-9-liangwenpeng@huawei.com Signed-off-by: Chengchang Tang tangchengchang@huawei.com Signed-off-by: Wenpeng Liang liangwenpeng@huawei.com Reviewed-by: Leon Romanovsky leonro@nvidia.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Zhengfeng Luo luozhengfeng@h-partners.com Reviewed-by: Yangyang Li liyangyang20@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/infiniband/hw/hns/hns_roce_srq.c | 80 +++++++++++++++--------- 1 file changed, 52 insertions(+), 28 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c b/drivers/infiniband/hw/hns/hns_roce_srq.c index b65cca33d239..f3e19c66283f 100644 --- a/drivers/infiniband/hw/hns/hns_roce_srq.c +++ b/drivers/infiniband/hw/hns/hns_roce_srq.c @@ -59,40 +59,39 @@ static void hns_roce_ib_srq_event(struct hns_roce_srq *srq, } }
-static int alloc_srqc(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq) +static int alloc_srqn(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq) { - struct hns_roce_srq_table *srq_table = &hr_dev->srq_table; struct hns_roce_ida *srq_ida = &hr_dev->srq_table.srq_ida; - struct ib_device *ibdev = &hr_dev->ib_dev; - struct hns_roce_cmd_mailbox *mailbox; - int ret; int id;
id = ida_alloc_range(&srq_ida->ida, srq_ida->min, srq_ida->max, GFP_KERNEL); if (id < 0) { - ibdev_err(ibdev, "failed to alloc srq(%d).\n", id); + ibdev_err(&hr_dev->ib_dev, "failed to alloc srq(%d).\n", id); return -ENOMEM; } - srq->srqn = (unsigned long)id;
- ret = hns_roce_table_get(hr_dev, &srq_table->table, srq->srqn); - if (ret) { - ibdev_err(ibdev, "failed to get SRQC table, ret = %d.\n", ret); - goto err_out; - } + srq->srqn = id;
- ret = xa_err(xa_store(&srq_table->xa, srq->srqn, srq, GFP_KERNEL)); - if (ret) { - ibdev_err(ibdev, "failed to store SRQC, ret = %d.\n", ret); - goto err_put; - } + return 0; +} + +static void free_srqn(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq) +{ + ida_free(&hr_dev->srq_table.srq_ida.ida, (int)srq->srqn); +} + +static int hns_roce_create_srqc(struct hns_roce_dev *hr_dev, + struct hns_roce_srq *srq) +{ + struct ib_device *ibdev = &hr_dev->ib_dev; + struct hns_roce_cmd_mailbox *mailbox; + int ret;
mailbox = hns_roce_alloc_cmd_mailbox(hr_dev); if (IS_ERR(mailbox)) { ibdev_err(ibdev, "failed to alloc mailbox for SRQC.\n"); - ret = PTR_ERR(mailbox); - goto err_xa; + return PTR_ERR(mailbox); }
ret = hr_dev->hw->write_srqc(srq, mailbox->buf); @@ -103,23 +102,42 @@ static int alloc_srqc(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq)
ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_SRQ, srq->srqn); - if (ret) { + if (ret) ibdev_err(ibdev, "failed to config SRQC, ret = %d.\n", ret); - goto err_mbox; - }
+err_mbox: hns_roce_free_cmd_mailbox(hr_dev, mailbox); + return ret; +} + +static int alloc_srqc(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq) +{ + struct hns_roce_srq_table *srq_table = &hr_dev->srq_table; + struct ib_device *ibdev = &hr_dev->ib_dev; + int ret; + + ret = hns_roce_table_get(hr_dev, &srq_table->table, srq->srqn); + if (ret) { + ibdev_err(ibdev, "failed to get SRQC table, ret = %d.\n", ret); + return ret; + } + + ret = xa_err(xa_store(&srq_table->xa, srq->srqn, srq, GFP_KERNEL)); + if (ret) { + ibdev_err(ibdev, "failed to store SRQC, ret = %d.\n", ret); + goto err_put; + } + + ret = hns_roce_create_srqc(hr_dev, srq); + if (ret) + goto err_xa;
return 0;
-err_mbox: - hns_roce_free_cmd_mailbox(hr_dev, mailbox); err_xa: xa_erase(&srq_table->xa, srq->srqn); err_put: hns_roce_table_put(hr_dev, &srq_table->table, srq->srqn); -err_out: - ida_free(&srq_ida->ida, id);
return ret; } @@ -142,7 +160,6 @@ static void free_srqc(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq) wait_for_completion(&srq->free);
hns_roce_table_put(hr_dev, &srq_table->table, srq->srqn); - ida_free(&srq_table->srq_ida.ida, (int)srq->srqn); }
static int alloc_srq_idx(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq, @@ -390,10 +407,14 @@ int hns_roce_create_srq(struct ib_srq *ib_srq, if (ret) return ret;
- ret = alloc_srqc(hr_dev, srq); + ret = alloc_srqn(hr_dev, srq); if (ret) goto err_srq_buf;
+ ret = alloc_srqc(hr_dev, srq); + if (ret) + goto err_srqn; + if (udata) { resp.srqn = srq->srqn; if (ib_copy_to_udata(udata, &resp, @@ -412,6 +433,8 @@ int hns_roce_create_srq(struct ib_srq *ib_srq,
err_srqc: free_srqc(hr_dev, srq); +err_srqn: + free_srqn(hr_dev, srq); err_srq_buf: free_srq_buf(hr_dev, srq);
@@ -424,6 +447,7 @@ int hns_roce_destroy_srq(struct ib_srq *ibsrq, struct ib_udata *udata) struct hns_roce_srq *srq = to_hr_srq(ibsrq);
free_srqc(hr_dev, srq); + free_srqn(hr_dev, srq); free_srq_buf(hr_dev, srq); return 0; }
From: Wenpeng Liang liangwenpeng@huawei.com
mainline inclusion from mainline-v5.18-rc1 commit 73f7e05609ec category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK cve: NA
reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=73f...
Abstract the alloc_cqc() into several parts and separate the process unrelated to allocating CQC.
Link: https://lore.kernel.org/r/20220302064830.61706-10-liangwenpeng@huawei.com Signed-off-by: Wenpeng Liang liangwenpeng@huawei.com Reviewed-by: Leon Romanovsky leonro@nvidia.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Zhengfeng Luo luozhengfeng@h-partners.com Reviewed-by: Yangyang Li liyangyang20@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/infiniband/hw/hns/hns_roce_cq.c | 65 ++++++++++++++----------- 1 file changed, 37 insertions(+), 28 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_cq.c b/drivers/infiniband/hw/hns/hns_roce_cq.c index ea562645967f..5320f4a4c312 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cq.c +++ b/drivers/infiniband/hw/hns/hns_roce_cq.c @@ -100,12 +100,39 @@ static void free_cqn(struct hns_roce_dev *hr_dev, unsigned long cqn) mutex_unlock(&cq_table->bank_mutex); }
+static int hns_roce_create_cqc(struct hns_roce_dev *hr_dev, + struct hns_roce_cq *hr_cq, + u64 *mtts, dma_addr_t dma_handle) +{ + struct ib_device *ibdev = &hr_dev->ib_dev; + struct hns_roce_cmd_mailbox *mailbox; + int ret; + + mailbox = hns_roce_alloc_cmd_mailbox(hr_dev); + if (IS_ERR(mailbox)) { + ibdev_err(ibdev, "failed to alloc mailbox for CQC.\n"); + return PTR_ERR(mailbox); + } + + hr_dev->hw->write_cqc(hr_dev, hr_cq, mailbox->buf, mtts, dma_handle); + + ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_CQC, + hr_cq->cqn); + if (ret) + ibdev_err(ibdev, + "failed to send create cmd for CQ(0x%lx), ret = %d.\n", + hr_cq->cqn, ret); + + hns_roce_free_cmd_mailbox(hr_dev, mailbox); + + return ret; +} + static int alloc_cqc(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq) { struct hns_roce_cq_table *cq_table = &hr_dev->cq_table; struct ib_device *ibdev = &hr_dev->ib_dev; - struct hns_roce_cmd_mailbox *mailbox; - u64 mtts[MTT_MIN_COUNT] = { 0 }; + u64 mtts[MTT_MIN_COUNT] = {}; dma_addr_t dma_handle; int ret;
@@ -121,7 +148,7 @@ static int alloc_cqc(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq) if (ret) { ibdev_err(ibdev, "failed to get CQ(0x%lx) context, ret = %d.\n", hr_cq->cqn, ret); - goto err_out; + return ret; }
ret = xa_err(xa_store(&cq_table->array, hr_cq->cqn, hr_cq, GFP_KERNEL)); @@ -130,40 +157,17 @@ static int alloc_cqc(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq) goto err_put; }
- /* Allocate mailbox memory */ - mailbox = hns_roce_alloc_cmd_mailbox(hr_dev); - if (IS_ERR(mailbox)) { - ret = PTR_ERR(mailbox); - goto err_xa; - } - - hr_dev->hw->write_cqc(hr_dev, hr_cq, mailbox->buf, mtts, dma_handle); - - ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_CQC, - hr_cq->cqn); - hns_roce_free_cmd_mailbox(hr_dev, mailbox); - if (ret) { - ibdev_err(ibdev, - "failed to send create cmd for CQ(0x%lx), ret = %d.\n", - hr_cq->cqn, ret); + ret = hns_roce_create_cqc(hr_dev, hr_cq, mtts, dma_handle); + if (ret) goto err_xa; - } - - hr_cq->cons_index = 0; - hr_cq->arm_sn = 1; - - refcount_set(&hr_cq->refcount, 1); - init_completion(&hr_cq->free);
return 0;
err_xa: xa_erase(&cq_table->array, hr_cq->cqn); - err_put: hns_roce_table_put(hr_dev, &cq_table->table, hr_cq->cqn);
-err_out: return ret; }
@@ -412,6 +416,11 @@ int hns_roce_create_cq(struct ib_cq *ib_cq, const struct ib_cq_init_attr *attr, goto err_cqc; }
+ hr_cq->cons_index = 0; + hr_cq->arm_sn = 1; + refcount_set(&hr_cq->refcount, 1); + init_completion(&hr_cq->free); + return 0;
err_cqc:
From: Michael Ellerman mpe@ellerman.id.au
mainline inclusion from mainline-v5.19-rc2 commit 8e1278444446fc97778a5e5c99bca1ce0bbc5ec9 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5C43D?from=project-issue CVE: CVE-2022-32981
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id...
--------------------------------
The ptrace PEEKUSR/POKEUSR (aka PEEKUSER/POKEUSER) API allows a process to read/write registers of another process.
To get/set a register, the API takes an index into an imaginary address space called the "USER area", where the registers of the process are laid out in some fashion.
The kernel then maps that index to a particular register in its own data structures and gets/sets the value.
The API only allows a single machine-word to be read/written at a time. So 4 bytes on 32-bit kernels and 8 bytes on 64-bit kernels.
The way floating point registers (FPRs) are addressed is somewhat complicated, because double precision float values are 64-bit even on 32-bit CPUs. That means on 32-bit kernels each FPR occupies two word-sized locations in the USER area. On 64-bit kernels each FPR occupies one word-sized location in the USER area.
Internally the kernel stores the FPRs in an array of u64s, or if VSX is enabled, an array of pairs of u64s where one half of each pair stores the FPR. Which half of the pair stores the FPR depends on the kernel's endianness.
To handle the different layouts of the FPRs depending on VSX/no-VSX and big/little endian, the TS_FPR() macro was introduced.
Unfortunately the TS_FPR() macro does not take into account the fact that the addressing of each FPR differs between 32-bit and 64-bit kernels. It just takes the index into the "USER area" passed from userspace and indexes into the fp_state.fpr array.
On 32-bit there are 64 indexes that address FPRs, but only 32 entries in the fp_state.fpr array, meaning the user can read/write 256 bytes past the end of the array. Because the fp_state sits in the middle of the thread_struct there are various fields than can be overwritten, including some pointers. As such it may be exploitable.
It has also been observed to cause systems to hang or otherwise misbehave when using gdbserver, and is probably the root cause of this report which could not be easily reproduced: https://lore.kernel.org/linuxppc-dev/dc38afe9-6b78-f3f5-666b-986939e40fc6@ke...
Rather than trying to make the TS_FPR() macro even more complicated to fix the bug, or add more macros, instead add a special-case for 32-bit kernels. This is more obvious and hopefully avoids a similar bug happening again in future.
Note that because 32-bit kernels never have VSX enabled the code doesn't need to consider TS_FPRWIDTH/OFFSET at all. Add a BUILD_BUG_ON() to ensure that 32-bit && VSX is never enabled.
Fixes: 87fec0514f61 ("powerpc: PTRACE_PEEKUSR/PTRACE_POKEUSER of FPR registers in little endian builds") Cc: stable@vger.kernel.org # v3.13+ Reported-by: Ariel Miculas ariel.miculas@belden.com Tested-by: Christophe Leroy christophe.leroy@csgroup.eu Signed-off-by: Michael Ellerman mpe@ellerman.id.au Link: https://lore.kernel.org/r/20220609133245.573565-1-mpe@ellerman.id.au Signed-off-by: Yipeng Zou zouyipeng@huawei.com Reviewed-by: Zhang Jianhua chris.zjh@huawei.com Reviewed-by: Liao Chang liaochang1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/powerpc/kernel/ptrace/ptrace.c | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-) mode change 100644 => 100755 arch/powerpc/kernel/ptrace/ptrace.c
diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c old mode 100644 new mode 100755 index f6e51be47c6e..81125c822008 --- a/arch/powerpc/kernel/ptrace/ptrace.c +++ b/arch/powerpc/kernel/ptrace/ptrace.c @@ -74,10 +74,13 @@ long arch_ptrace(struct task_struct *child, long request, unsigned int fpidx = index - PT_FPR0;
flush_fp_to_thread(child); - if (fpidx < (PT_FPSCR - PT_FPR0)) - memcpy(&tmp, &child->thread.TS_FPR(fpidx), - sizeof(long)); - else + if (fpidx < (PT_FPSCR - PT_FPR0)) { + if (IS_ENABLED(CONFIG_PPC32)) + // On 32-bit the index we are passed refers to 32-bit words + tmp = ((u32 *)child->thread.fp_state.fpr)[fpidx]; + else + memcpy(&tmp, &child->thread.TS_FPR(fpidx), sizeof(long)); + } else tmp = child->thread.fp_state.fpscr; } ret = put_user(tmp, datalp); @@ -107,10 +110,13 @@ long arch_ptrace(struct task_struct *child, long request, unsigned int fpidx = index - PT_FPR0;
flush_fp_to_thread(child); - if (fpidx < (PT_FPSCR - PT_FPR0)) - memcpy(&child->thread.TS_FPR(fpidx), &data, - sizeof(long)); - else + if (fpidx < (PT_FPSCR - PT_FPR0)) { + if (IS_ENABLED(CONFIG_PPC32)) + // On 32-bit the index we are passed refers to 32-bit words + ((u32 *)child->thread.fp_state.fpr)[fpidx] = data; + else + memcpy(&child->thread.TS_FPR(fpidx), &data, sizeof(long)); + } else child->thread.fp_state.fpscr = data; ret = 0; } @@ -478,4 +484,7 @@ void __init pt_regs_check(void) * real registers. */ BUILD_BUG_ON(PT_DSCR < sizeof(struct user_pt_regs) / sizeof(unsigned long)); + + // ptrace_get/put_fpr() rely on PPC32 and VSX being incompatible + BUILD_BUG_ON(IS_ENABLED(CONFIG_PPC32) && IS_ENABLED(CONFIG_VSX)); }
From: Xiyu Yang xiyuyang19@fudan.edu.cn
mainline inclusion from mainline-v5.16-rc1 commit 31d21d219b51dcfb16e18427eddae5394d402820 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5C8IW CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
refcount_t type and corresponding API can protect refcounters from accidental underflow and overflow and further use-after-free situations.
Signed-off-by: Xiyu Yang xiyuyang19@fudan.edu.cn Signed-off-by: Xin Tan tanxin.ctf@gmail.com Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/1626674355-55795-1-git-send-email-xiyuyang19@fudan... Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Li Nan linan122@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ext4/ext4.h | 3 ++- fs/ext4/page-io.c | 8 ++++---- 2 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index c11a23d73c79..277f89d5de03 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -17,6 +17,7 @@ #ifndef _EXT4_H #define _EXT4_H
+#include <linux/refcount.h> #include <linux/types.h> #include <linux/blkdev.h> #include <linux/magic.h> @@ -235,7 +236,7 @@ typedef struct ext4_io_end { struct bio *bio; /* Linked list of completed * bios covering the extent */ unsigned int flag; /* unwritten or not */ - atomic_t count; /* reference counter */ + refcount_t count; /* reference counter */ struct list_head list_vec; /* list of ext4_io_end_vec */ } ext4_io_end_t;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index 4569075a7da0..b076fabb72e2 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -284,14 +284,14 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags) io_end->inode = inode; INIT_LIST_HEAD(&io_end->list); INIT_LIST_HEAD(&io_end->list_vec); - atomic_set(&io_end->count, 1); + refcount_set(&io_end->count, 1); } return io_end; }
void ext4_put_io_end_defer(ext4_io_end_t *io_end) { - if (atomic_dec_and_test(&io_end->count)) { + if (refcount_dec_and_test(&io_end->count)) { if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) || list_empty(&io_end->list_vec)) { ext4_release_io_end(io_end); @@ -305,7 +305,7 @@ int ext4_put_io_end(ext4_io_end_t *io_end) { int err = 0;
- if (atomic_dec_and_test(&io_end->count)) { + if (refcount_dec_and_test(&io_end->count)) { if (io_end->flag & EXT4_IO_END_UNWRITTEN) { err = ext4_convert_unwritten_io_end_vec(io_end->handle, io_end); @@ -319,7 +319,7 @@ int ext4_put_io_end(ext4_io_end_t *io_end)
ext4_io_end_t *ext4_get_io_end(ext4_io_end_t *io_end) { - atomic_inc(&io_end->count); + refcount_inc(&io_end->count); return io_end; }
From: Pavel Reichl preichl@redhat.com
mainline inclusion from stable-v5.13-rc1 commit 0f98b4ece18da9d8287bb4cc4e8f78b8760ea0d0 category: bugfix bugzilla: 186908, https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Rename mp variable to parsisng_mp so it is easy to distinguish between current mount point handle and handle for mount point which mount options are being parsed.
Suggested-by: Eric Sandeen sandeen@redhat.com Signed-off-by: Pavel Reichl preichl@redhat.com
Reviewed-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Carlos Maiolino cmaiolino@redhat.com Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com
Conflicts: fs/xfs/xfs_super.c Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_super.c | 100 ++++++++++++++++++++++----------------------- 1 file changed, 50 insertions(+), 50 deletions(-)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 8533571421e6..1834653f0bc1 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1215,7 +1215,7 @@ xfs_fc_parse_param( struct fs_context *fc, struct fs_parameter *param) { - struct xfs_mount *mp = fc->s_fs_info; + struct xfs_mount *parsing_mp = fc->s_fs_info; struct fs_parse_result result; int size = 0; int opt; @@ -1226,138 +1226,138 @@ xfs_fc_parse_param(
switch (opt) { case Opt_logbufs: - mp->m_logbufs = result.uint_32; + parsing_mp->m_logbufs = result.uint_32; return 0; case Opt_logbsize: - if (suffix_kstrtoint(param->string, 10, &mp->m_logbsize)) + if (suffix_kstrtoint(param->string, 10, &parsing_mp->m_logbsize)) return -EINVAL; return 0; case Opt_logdev: - kfree(mp->m_logname); - mp->m_logname = kstrdup(param->string, GFP_KERNEL); - if (!mp->m_logname) + kfree(parsing_mp->m_logname); + parsing_mp->m_logname = kstrdup(param->string, GFP_KERNEL); + if (!parsing_mp->m_logname) return -ENOMEM; return 0; case Opt_rtdev: - kfree(mp->m_rtname); - mp->m_rtname = kstrdup(param->string, GFP_KERNEL); - if (!mp->m_rtname) + kfree(parsing_mp->m_rtname); + parsing_mp->m_rtname = kstrdup(param->string, GFP_KERNEL); + if (!parsing_mp->m_rtname) return -ENOMEM; return 0; case Opt_allocsize: if (suffix_kstrtoint(param->string, 10, &size)) return -EINVAL; - mp->m_allocsize_log = ffs(size) - 1; - mp->m_flags |= XFS_MOUNT_ALLOCSIZE; + parsing_mp->m_allocsize_log = ffs(size) - 1; + parsing_mp->m_flags |= XFS_MOUNT_ALLOCSIZE; return 0; case Opt_grpid: case Opt_bsdgroups: - mp->m_flags |= XFS_MOUNT_GRPID; + parsing_mp->m_flags |= XFS_MOUNT_GRPID; return 0; case Opt_nogrpid: case Opt_sysvgroups: - mp->m_flags &= ~XFS_MOUNT_GRPID; + parsing_mp->m_flags &= ~XFS_MOUNT_GRPID; return 0; case Opt_wsync: - mp->m_flags |= XFS_MOUNT_WSYNC; + parsing_mp->m_flags |= XFS_MOUNT_WSYNC; return 0; case Opt_norecovery: - mp->m_flags |= XFS_MOUNT_NORECOVERY; + parsing_mp->m_flags |= XFS_MOUNT_NORECOVERY; return 0; case Opt_noalign: - mp->m_flags |= XFS_MOUNT_NOALIGN; + parsing_mp->m_flags |= XFS_MOUNT_NOALIGN; return 0; case Opt_swalloc: - mp->m_flags |= XFS_MOUNT_SWALLOC; + parsing_mp->m_flags |= XFS_MOUNT_SWALLOC; return 0; case Opt_sunit: - mp->m_dalign = result.uint_32; + parsing_mp->m_dalign = result.uint_32; return 0; case Opt_swidth: - mp->m_swidth = result.uint_32; + parsing_mp->m_swidth = result.uint_32; return 0; case Opt_inode32: - mp->m_flags |= XFS_MOUNT_SMALL_INUMS; + parsing_mp->m_flags |= XFS_MOUNT_SMALL_INUMS; return 0; case Opt_inode64: - mp->m_flags &= ~XFS_MOUNT_SMALL_INUMS; + parsing_mp->m_flags &= ~XFS_MOUNT_SMALL_INUMS; return 0; case Opt_nouuid: - mp->m_flags |= XFS_MOUNT_NOUUID; + parsing_mp->m_flags |= XFS_MOUNT_NOUUID; return 0; case Opt_largeio: - mp->m_flags |= XFS_MOUNT_LARGEIO; + parsing_mp->m_flags |= XFS_MOUNT_LARGEIO; return 0; case Opt_nolargeio: - mp->m_flags &= ~XFS_MOUNT_LARGEIO; + parsing_mp->m_flags &= ~XFS_MOUNT_LARGEIO; return 0; case Opt_filestreams: - mp->m_flags |= XFS_MOUNT_FILESTREAMS; + parsing_mp->m_flags |= XFS_MOUNT_FILESTREAMS; return 0; case Opt_noquota: - mp->m_qflags &= ~XFS_ALL_QUOTA_ACCT; - mp->m_qflags &= ~XFS_ALL_QUOTA_ENFD; + parsing_mp->m_qflags &= ~XFS_ALL_QUOTA_ACCT; + parsing_mp->m_qflags &= ~XFS_ALL_QUOTA_ENFD; return 0; case Opt_quota: case Opt_uquota: case Opt_usrquota: - mp->m_qflags |= (XFS_UQUOTA_ACCT | XFS_UQUOTA_ENFD); + parsing_mp->m_qflags |= (XFS_UQUOTA_ACCT | XFS_UQUOTA_ENFD); return 0; case Opt_qnoenforce: case Opt_uqnoenforce: - mp->m_qflags |= XFS_UQUOTA_ACCT; - mp->m_qflags &= ~XFS_UQUOTA_ENFD; + parsing_mp->m_qflags |= XFS_UQUOTA_ACCT; + parsing_mp->m_qflags &= ~XFS_UQUOTA_ENFD; return 0; case Opt_pquota: case Opt_prjquota: - mp->m_qflags |= (XFS_PQUOTA_ACCT | XFS_PQUOTA_ENFD); + parsing_mp->m_qflags |= (XFS_PQUOTA_ACCT | XFS_PQUOTA_ENFD); return 0; case Opt_pqnoenforce: - mp->m_qflags |= XFS_PQUOTA_ACCT; - mp->m_qflags &= ~XFS_PQUOTA_ENFD; + parsing_mp->m_qflags |= XFS_PQUOTA_ACCT; + parsing_mp->m_qflags &= ~XFS_PQUOTA_ENFD; return 0; case Opt_gquota: case Opt_grpquota: - mp->m_qflags |= (XFS_GQUOTA_ACCT | XFS_GQUOTA_ENFD); + parsing_mp->m_qflags |= (XFS_GQUOTA_ACCT | XFS_GQUOTA_ENFD); return 0; case Opt_gqnoenforce: - mp->m_qflags |= XFS_GQUOTA_ACCT; - mp->m_qflags &= ~XFS_GQUOTA_ENFD; + parsing_mp->m_qflags |= XFS_GQUOTA_ACCT; + parsing_mp->m_qflags &= ~XFS_GQUOTA_ENFD; return 0; case Opt_discard: - mp->m_flags |= XFS_MOUNT_DISCARD; + parsing_mp->m_flags |= XFS_MOUNT_DISCARD; return 0; case Opt_nodiscard: - mp->m_flags &= ~XFS_MOUNT_DISCARD; + parsing_mp->m_flags &= ~XFS_MOUNT_DISCARD; return 0; #ifdef CONFIG_FS_DAX case Opt_dax: - xfs_mount_set_dax_mode(mp, XFS_DAX_ALWAYS); + xfs_mount_set_dax_mode(parsing_mp, XFS_DAX_ALWAYS); return 0; case Opt_dax_enum: - xfs_mount_set_dax_mode(mp, result.uint_32); + xfs_mount_set_dax_mode(parsing_mp, result.uint_32); return 0; #endif /* Following mount options will be removed in September 2025 */ case Opt_ikeep: - xfs_warn(mp, "%s mount option is deprecated.", param->key); - mp->m_flags |= XFS_MOUNT_IKEEP; + xfs_warn(parsing_mp, "%s mount option is deprecated.", param->key); + parsing_mp->m_flags |= XFS_MOUNT_IKEEP; return 0; case Opt_noikeep: - xfs_warn(mp, "%s mount option is deprecated.", param->key); - mp->m_flags &= ~XFS_MOUNT_IKEEP; + xfs_warn(parsing_mp, "%s mount option is deprecated.", param->key); + parsing_mp->m_flags &= ~XFS_MOUNT_IKEEP; return 0; case Opt_attr2: - xfs_warn(mp, "%s mount option is deprecated.", param->key); - mp->m_flags |= XFS_MOUNT_ATTR2; + xfs_warn(parsing_mp, "%s mount option is deprecated.", param->key); + parsing_mp->m_flags |= XFS_MOUNT_ATTR2; return 0; case Opt_noattr2: - xfs_warn(mp, "%s mount option is deprecated.", param->key); - mp->m_flags &= ~XFS_MOUNT_ATTR2; - mp->m_flags |= XFS_MOUNT_NOATTR2; + xfs_warn(parsing_mp, "%s mount option is deprecated.", param->key); + parsing_mp->m_flags &= ~XFS_MOUNT_ATTR2; + parsing_mp->m_flags |= XFS_MOUNT_NOATTR2; return 0; default: - xfs_warn(mp, "unknown mount option [%s].", param->key); + xfs_warn(parsing_mp, "unknown mount option [%s].", param->key); return -EINVAL; }
From: Pavel Reichl preichl@redhat.com
mainline inclusion from stable-v5.13-rc1 commit 92cf7d36384b99d5a57bf4422904a3c16dc4527a category: bugfix bugzilla: 186908, https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Skip the warnings about mount option being deprecated if we are remounting and deprecated option state is not changing.
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211605 Fix-suggested-by: Eric Sandeen sandeen@redhat.com Signed-off-by: Pavel Reichl preichl@redhat.com
Reviewed-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Carlos Maiolino cmaiolino@redhat.com Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_super.c | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 1834653f0bc1..9148170a12cb 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1205,6 +1205,22 @@ suffix_kstrtoint( return ret; }
+static inline void +xfs_fs_warn_deprecated( + struct fs_context *fc, + struct fs_parameter *param, + uint64_t flag, + bool value) +{ + /* Don't print the warning if reconfiguring and current mount point + * already had the flag set + */ + if ((fc->purpose & FS_CONTEXT_FOR_RECONFIGURE) && + !!(XFS_M(fc->root->d_sb)->m_flags & flag) == value) + return; + xfs_warn(fc->s_fs_info, "%s mount option is deprecated.", param->key); +} + /* * Set mount state from a mount option. * @@ -1340,19 +1356,19 @@ xfs_fc_parse_param( #endif /* Following mount options will be removed in September 2025 */ case Opt_ikeep: - xfs_warn(parsing_mp, "%s mount option is deprecated.", param->key); + xfs_fs_warn_deprecated(fc, param, XFS_MOUNT_IKEEP, true); parsing_mp->m_flags |= XFS_MOUNT_IKEEP; return 0; case Opt_noikeep: - xfs_warn(parsing_mp, "%s mount option is deprecated.", param->key); + xfs_fs_warn_deprecated(fc, param, XFS_MOUNT_IKEEP, false); parsing_mp->m_flags &= ~XFS_MOUNT_IKEEP; return 0; case Opt_attr2: - xfs_warn(parsing_mp, "%s mount option is deprecated.", param->key); + xfs_fs_warn_deprecated(fc, param, XFS_MOUNT_ATTR2, true); parsing_mp->m_flags |= XFS_MOUNT_ATTR2; return 0; case Opt_noattr2: - xfs_warn(parsing_mp, "%s mount option is deprecated.", param->key); + xfs_fs_warn_deprecated(fc, param, XFS_MOUNT_NOATTR2, true); parsing_mp->m_flags &= ~XFS_MOUNT_ATTR2; parsing_mp->m_flags |= XFS_MOUNT_NOATTR2; return 0;
From: ChenXiaoSong chenxiaosong2@huawei.com
hulk inclusion category: bugfix bugzilla: 186345, https://gitee.com/openeuler/kernel/issues/I4T2WV CVE: NA
--------------------------------
This reverts commit ce368536dd614452407dc31e2449eb84681a06af.
filemap_sample_wb_err() will return 0 if nobody has seen the error yet, then filemap_check_wb_err() will return the unchanged writeback error, async write() will become sync write().
Reproducer: nfs server | nfs client --------------------------------|---------------------------------------------- # No space left on server | fallocate -l 100G /server/nospc | | | mount -t nfs $nfs_server_ip:/ /mnt | | # Expected error: No space left on device | dd if=/dev/zero of=/mnt/file count=1 ibs=1K | | # Release space on mountpoint | rm /mnt/nospc | | # Very very slow | dd if=/dev/zero of=/mnt/file count=1 ibs=1K
Signed-off-by: ChenXiaoSong chenxiaosong2@huawei.com Reviewed-by: Zhang Xiaoxu zhangxiaoxu5@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/nfs/file.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c index 4556e75d4591..f96367a2463e 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -587,14 +587,12 @@ static const struct vm_operations_struct nfs_file_vm_ops = { .page_mkwrite = nfs_vm_page_mkwrite, };
-static int nfs_need_check_write(struct file *filp, struct inode *inode, - int error) +static int nfs_need_check_write(struct file *filp, struct inode *inode) { struct nfs_open_context *ctx;
ctx = nfs_file_open_context(filp); - if (nfs_error_is_fatal_on_server(error) || - nfs_ctx_key_to_expire(ctx, inode)) + if (nfs_ctx_key_to_expire(ctx, inode)) return 1; return 0; } @@ -605,8 +603,6 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from) struct inode *inode = file_inode(file); unsigned long written = 0; ssize_t result; - errseq_t since; - int error;
result = nfs_key_timeout_notify(file, inode); if (result) @@ -631,7 +627,6 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from) if (iocb->ki_pos > i_size_read(inode)) nfs_revalidate_mapping(inode, file->f_mapping);
- since = filemap_sample_wb_err(file->f_mapping); nfs_start_io_write(inode); result = generic_write_checks(iocb, from); if (result > 0) { @@ -650,8 +645,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from) goto out;
/* Return error values */ - error = filemap_check_wb_err(file->f_mapping, since); - if (nfs_need_check_write(file, inode, error)) { + if (nfs_need_check_write(file, inode)) { int err = nfs_wb_all(inode); if (err < 0) result = err;
From: Namjae Jeon linkinjeon@kernel.org
mainline inclusion from mainline-v5.19-rc1 commit f26967b9f7a830e228bb13fb41bd516ddd9d789d category: bugfix bugzilla: 186929, https://gitee.com/src-openeuler/kernel/issues/I5D82L CVE: CVE-2022-1973
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------
log_read_rst() returns ENOMEM error when there is not enough memory. In this case, if info is returned without initialization, it attempts to kfree the uninitialized info->r_page pointer. This patch moves the memset initialization code to before log_read_rst() is called.
Reported-by: Gerald Lee sundaywind2004@gmail.com Signed-off-by: Namjae Jeon linkinjeon@kernel.org Signed-off-by: Konstantin Komarov almaz.alexandrovich@paragon-software.com Signed-off-by: ZhaoLong Wang wangzhaolong1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/ntfs3/fslog.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/ntfs3/fslog.c b/fs/ntfs3/fslog.c index 06492f088d60..fc36c53b865a 100644 --- a/fs/ntfs3/fslog.c +++ b/fs/ntfs3/fslog.c @@ -1185,8 +1185,6 @@ static int log_read_rst(struct ntfs_log *log, u32 l_size, bool first, if (!r_page) return -ENOMEM;
- memset(info, 0, sizeof(struct restart_info)); - /* Determine which restart area we are looking for. */ if (first) { vbo = 0; @@ -3791,10 +3789,11 @@ int log_replay(struct ntfs_inode *ni, bool *initialized) if (!log) return -ENOMEM;
+ memset(&rst_info, 0, sizeof(struct restart_info)); + log->ni = ni; log->l_size = l_size; log->one_page_buf = kmalloc(page_size, GFP_NOFS); - if (!log->one_page_buf) { err = -ENOMEM; goto out; @@ -3842,6 +3841,7 @@ int log_replay(struct ntfs_inode *ni, bool *initialized) if (rst_info.vbo) goto check_restart_area;
+ memset(&rst_info2, 0, sizeof(struct restart_info)); err = log_read_rst(log, l_size, false, &rst_info2);
/* Determine which restart area to use. */
From: Gou Hao gouhao@uniontech.com
uniontech inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I40JRR CVE: NA
-------------------
After alloc the sbi->persisters memory, dep_init will call dep_fini when error happened.Because sbi->persisters is not set to 0, -> dep_fini() can be called with sbi->persisters[] uninitialized, thus kthread_stop() can be called with random value.
Signed-off-by: Gou Hao gouhao@uniontech.com Reviewed-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/eulerfs/dep.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/eulerfs/dep.c b/fs/eulerfs/dep.c index ec014bbf3700..a41471c5f2ec 100644 --- a/fs/eulerfs/dep.c +++ b/fs/eulerfs/dep.c @@ -718,7 +718,7 @@ int dep_init(struct super_block *sb) for_each_possible_cpu(cpu) init_llist_head(per_cpu_ptr(sbi->persistee_list, cpu));
- sbi->persisters = kmalloc(sizeof(struct task_struct *) * + sbi->persisters = kzalloc(sizeof(struct task_struct *) * persisters_per_socket * num_sockets, GFP_KERNEL); if (!sbi->persisters) {
From: tatataeki shengzeyu19_98@163.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4MC3F CVE: NA
----------------------------------
Multiple operations on cgroups in cgroup v1 are related to the status of the cgroup. The status of the current cgroup can be displayed in cgroupv2, but it cannot be displayed in cgroup v1, so the cgroup.flag_stat member is added in memory cgroup to display the status of the current cgroup and sub-cgroups.
Testing result: List the status of user.slice [root@test user.slice]#cat memory.flag_stat NO_REF 0 ONLINE 1 RELEASED 0 VISIBLE 1 DYING 0 CHILD_NO_REF 0 CHILD_ONLINE 1 CHILD_RELEASED 0 CHILD_VISIBLE 1 CHILD_DYING 0
Create a new cgroup in user.slice [root@test user.slice]#mkdir user-test
List the current status of user.slice after operation above [root@test user.slice]#cat memory.flag_stat NO_REF 0 ONLINE 1 RELEASED 0 VISIBLE 1 DYING 0 CHILD_NO_REF 0 CHILD_ONLINE 2 CHILD_RELEASED 0 CHILD_VISIBLE 2 CHILD_DYING 0
Signed-off-by: tatataeki shengzeyu19_98@163.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/memcontrol.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3ede56d6b307..1938e69ad5cc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4556,6 +4556,53 @@ static void mem_cgroup_oom_unregister_event(struct mem_cgroup *memcg, spin_unlock(&memcg_oom_lock); }
+static const char *const memcg_flag_name[] = { + "NO_REF", + "ONLINE", + "RELEASED", + "VISIBLE", + "DYING" +}; + +static void memcg_flag_stat_get(int mem_flags, int *stat) +{ + int i; + int flags = mem_flags; + + for (i = 0; i < ARRAY_SIZE(memcg_flag_name); i++) { + if (flags & 1) + stat[i] += 1; + flags >>= 1; + } +} + +static int memcg_flag_stat_show(struct seq_file *sf, void *v) +{ + int self_flag[ARRAY_SIZE(memcg_flag_name)]; + int child_flag[ARRAY_SIZE(memcg_flag_name)]; + int iter; + struct cgroup_subsys_state *child; + struct cgroup_subsys_state *css = seq_css(sf); + + memset(self_flag, 0, sizeof(self_flag)); + memset(child_flag, 0, sizeof(child_flag)); + + memcg_flag_stat_get(css->flags, self_flag); + + rcu_read_lock(); + css_for_each_child(child, css) + memcg_flag_stat_get(child->flags, child_flag); + rcu_read_unlock(); + + for (iter = 0; iter < ARRAY_SIZE(memcg_flag_name); iter++) + seq_printf(sf, "%s %d\n", memcg_flag_name[iter], self_flag[iter]); + + for (iter = 0; iter < ARRAY_SIZE(memcg_flag_name); iter++) + seq_printf(sf, "CHILD_%s %d\n", memcg_flag_name[iter], child_flag[iter]); + + return 0; +} + static int mem_cgroup_oom_control_read(struct seq_file *sf, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_seq(sf); @@ -5259,6 +5306,10 @@ static struct cftype mem_cgroup_legacy_files[] = { .write_u64 = mem_cgroup_oom_control_write, .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), }, + { + .name = "flag_stat", + .seq_show = memcg_flag_stat_show, + }, { .name = "pressure_level", },
From: Huaixin Chang changhuaixin@linux.alibaba.com
mainline inclusion from mainline-v5.13-rc6 commit f4183717b370ad28dd0c0d74760142b20e6e7931 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5CPWE CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The CFS bandwidth controller limits CPU requests of a task group to quota during each period. However, parallel workloads might be bursty so that they get throttled even when their average utilization is under quota. And they are latency sensitive at the same time so that throttling them is undesired.
We borrow time now against our future underrun, at the cost of increased interference against the other system users. All nicely bounded.
Traditional (UP-EDF) bandwidth control is something like:
(U = \Sum u_i) <= 1
This guaranteeds both that every deadline is met and that the system is stable. After all, if U were > 1, then for every second of walltime, we'd have to run more than a second of program time, and obviously miss our deadline, but the next deadline will be further out still, there is never time to catch up, unbounded fail.
This work observes that a workload doesn't always executes the full quota; this enables one to describe u_i as a statistical distribution.
For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100) (the traditional WCET). This effectively allows u to be smaller, increasing the efficiency (we can pack more tasks in the system), but at the cost of missing deadlines when all the odds line up. However, it does maintain stability, since every overrun must be paired with an underrun as long as our x is above the average.
That is, suppose we have 2 tasks, both specify a p(95) value, then we have a p(95)*p(95) = 90.25% chance both tasks are within their quota and everything is good. At the same time we have a p(5)p(5) = 0.25% chance both tasks will exceed their quota at the same time (guaranteed deadline fail). Somewhere in between there's a threshold where one exceeds and the other doesn't underrun enough to compensate; this depends on the specific CDFs.
At the same time, we can say that the worst case deadline miss, will be \Sum e_i; that is, there is a bounded tardiness (under the assumption that x+e is indeed WCET).
The benefit of burst is seen when testing with schbench. Default value of kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
mkdir /sys/fs/cgroup/cpu/test echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
The average CPU usage is at 80%. I run this for 10 times, and got long tail latency for 6 times and got throttled for 8 times.
Tail latencies are shown below, and it wasn't the worst case.
Latency percentiles (usec) 50.0000th: 19872 75.0000th: 21344 90.0000th: 22176 95.0000th: 22496 *99.0000th: 22752 99.5000th: 22752 99.9000th: 22752 min=0, max=22727 rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
The interferenece when using burst is valued by the possibilities for missing the deadline and the average WCET. Test results showed that when there many cgroups or CPU is under utilized, the interference is limited. More details are shown in: https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alib...
Co-developed-by: Shanpei Chen shanpeic@linux.alibaba.com Signed-off-by: Shanpei Chen shanpeic@linux.alibaba.com Co-developed-by: Tianchen Ding dtcccc@linux.alibaba.com Signed-off-by: Tianchen Ding dtcccc@linux.alibaba.com Signed-off-by: Huaixin Chang changhuaixin@linux.alibaba.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Ben Segall bsegall@google.com Acked-by: Tejun Heo tj@kernel.org Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.... Signed-off-by: Hui Tang tanghui20@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/sched/core.c | 68 ++++++++++++++++++++++++++++++++++++++++---- kernel/sched/fair.c | 14 ++++++--- kernel/sched/sched.h | 4 +++ 3 files changed, 76 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0fd88dc7660f..5b374129c1cb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8214,7 +8214,8 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
-static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, + u64 burst) { int i, ret = 0, runtime_enabled, runtime_was_enabled; struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; @@ -8244,6 +8245,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) if (quota != RUNTIME_INF && quota > max_cfs_runtime) return -EINVAL;
+ if (quota != RUNTIME_INF && (burst > quota || + burst + quota > max_cfs_runtime)) + return -EINVAL; + /* * Prevent race between setting of cfs_rq->runtime_enabled and * unthrottle_offline_cfs_rqs(). @@ -8265,6 +8270,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) raw_spin_lock_irq(&cfs_b->lock); cfs_b->period = ns_to_ktime(period); cfs_b->quota = quota; + cfs_b->burst = burst;
__refill_cfs_bandwidth_runtime(cfs_b);
@@ -8298,9 +8304,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us) { - u64 quota, period; + u64 quota, period, burst;
period = ktime_to_ns(tg->cfs_bandwidth.period); + burst = tg->cfs_bandwidth.burst; if (cfs_quota_us < 0) quota = RUNTIME_INF; else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC) @@ -8308,7 +8315,7 @@ static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us) else return -EINVAL;
- return tg_set_cfs_bandwidth(tg, period, quota); + return tg_set_cfs_bandwidth(tg, period, quota, burst); }
static long tg_get_cfs_quota(struct task_group *tg) @@ -8326,15 +8333,16 @@ static long tg_get_cfs_quota(struct task_group *tg)
static int tg_set_cfs_period(struct task_group *tg, long cfs_period_us) { - u64 quota, period; + u64 quota, period, burst;
if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC) return -EINVAL;
period = (u64)cfs_period_us * NSEC_PER_USEC; quota = tg->cfs_bandwidth.quota; + burst = tg->cfs_bandwidth.burst;
- return tg_set_cfs_bandwidth(tg, period, quota); + return tg_set_cfs_bandwidth(tg, period, quota, burst); }
static long tg_get_cfs_period(struct task_group *tg) @@ -8347,6 +8355,30 @@ static long tg_get_cfs_period(struct task_group *tg) return cfs_period_us; }
+static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us) +{ + u64 quota, period, burst; + + if ((u64)cfs_burst_us > U64_MAX / NSEC_PER_USEC) + return -EINVAL; + + burst = (u64)cfs_burst_us * NSEC_PER_USEC; + period = ktime_to_ns(tg->cfs_bandwidth.period); + quota = tg->cfs_bandwidth.quota; + + return tg_set_cfs_bandwidth(tg, period, quota, burst); +} + +static long tg_get_cfs_burst(struct task_group *tg) +{ + u64 burst_us; + + burst_us = tg->cfs_bandwidth.burst; + do_div(burst_us, NSEC_PER_USEC); + + return burst_us; +} + static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) { @@ -8371,6 +8403,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css, return tg_set_cfs_period(css_tg(css), cfs_period_us); }
+static u64 cpu_cfs_burst_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + return tg_get_cfs_burst(css_tg(css)); +} + +static int cpu_cfs_burst_write_u64(struct cgroup_subsys_state *css, + struct cftype *cftype, u64 cfs_burst_us) +{ + return tg_set_cfs_burst(css_tg(css), cfs_burst_us); +} + struct cfs_schedulable_data { struct task_group *tg; u64 period, quota; @@ -8586,6 +8630,11 @@ static struct cftype cpu_legacy_files[] = { .read_u64 = cpu_cfs_period_read_u64, .write_u64 = cpu_cfs_period_write_u64, }, + { + .name = "cfs_burst_us", + .read_u64 = cpu_cfs_burst_read_u64, + .write_u64 = cpu_cfs_burst_write_u64, + }, { .name = "stat", .seq_show = cpu_cfs_stat_show, @@ -8758,12 +8807,13 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of, { struct task_group *tg = css_tg(of_css(of)); u64 period = tg_get_cfs_period(tg); + u64 burst = tg_get_cfs_burst(tg); u64 quota; int ret;
ret = cpu_period_quota_parse(buf, &period, "a); if (!ret) - ret = tg_set_cfs_bandwidth(tg, period, quota); + ret = tg_set_cfs_bandwidth(tg, period, quota, burst); return ret ?: nbytes; } #endif @@ -8790,6 +8840,12 @@ static struct cftype cpu_files[] = { .seq_show = cpu_max_show, .write = cpu_max_write, }, + { + .name = "max.burst", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = cpu_cfs_burst_read_u64, + .write_u64 = cpu_cfs_burst_write_u64, + }, #endif #ifdef CONFIG_UCLAMP_TASK_GROUP { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9d5c780160c5..593e763bb1f2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4738,8 +4738,11 @@ static inline u64 sched_cfs_bandwidth_slice(void) */ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) { - if (cfs_b->quota != RUNTIME_INF) - cfs_b->runtime = cfs_b->quota; + if (unlikely(cfs_b->quota == RUNTIME_INF)) + return; + + cfs_b->runtime += cfs_b->quota; + cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst); }
static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) @@ -5095,6 +5098,9 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u throttled = !list_empty(&cfs_b->throttled_cfs_rq); cfs_b->nr_periods += overrun;
+ /* Refill extra burst quota even if cfs_b->idle */ + __refill_cfs_bandwidth_runtime(cfs_b); + /* * idle depends on !throttled (for the case of a large deficit), and if * we're going inactive then everything else can be deferred @@ -5102,8 +5108,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u if (cfs_b->idle && !throttled) goto out_deactivate;
- __refill_cfs_bandwidth_runtime(cfs_b); - if (!throttled) { /* mark as potentially idle for the upcoming period */ cfs_b->idle = 1; @@ -5356,6 +5360,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) if (new < max_cfs_quota_period) { cfs_b->period = ns_to_ktime(new); cfs_b->quota *= 2; + cfs_b->burst *= 2;
pr_warn_ratelimited( "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n", @@ -5387,6 +5392,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) cfs_b->runtime = 0; cfs_b->quota = RUNTIME_INF; cfs_b->period = ns_to_ktime(default_cfs_period()); + cfs_b->burst = 0;
INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq); hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0d40bb700f3c..d05c787f0658 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -385,7 +385,11 @@ struct cfs_bandwidth { int nr_throttled; u64 throttled_time;
+#if !defined(__GENKSYMS__) + u64 burst; +#else KABI_RESERVE(1) +#endif KABI_RESERVE(2) KABI_RESERVE(3) KABI_RESERVE(4)
From: Huaixin Chang changhuaixin@linux.alibaba.com
mainline inclusion from mainline-v5.15-rc4 commit bcb1704a1ed2de580a46f28922e223a65f16e0f5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5CPWE CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Two new statistics are introduced to show the internal of burst feature and explain why burst helps or not.
nr_bursts: number of periods bandwidth burst occurs burst_time: cumulative wall-time (in nanoseconds) that any cpus has used above quota in respective periods
Co-developed-by: Shanpei Chen shanpeic@linux.alibaba.com Signed-off-by: Shanpei Chen shanpeic@linux.alibaba.com Co-developed-by: Tianchen Ding dtcccc@linux.alibaba.com Signed-off-by: Tianchen Ding dtcccc@linux.alibaba.com Signed-off-by: Huaixin Chang changhuaixin@linux.alibaba.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Daniel Jordan daniel.m.jordan@oracle.com Acked-by: Tejun Heo tj@kernel.org Link: https://lore.kernel.org/r/20210830032215.16302-2-changhuaixin@linux.alibaba.... Signed-off-by: Hui Tang tanghui20@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/sched/core.c | 13 ++++++++++--- kernel/sched/fair.c | 9 +++++++++ kernel/sched/sched.h | 5 ++++- 3 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5b374129c1cb..b55de01ec68e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8517,6 +8517,9 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v) seq_printf(sf, "wait_sum %llu\n", ws); }
+ seq_printf(sf, "nr_bursts %d\n", cfs_b->nr_burst); + seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time); + return 0; } #endif /* CONFIG_CFS_BANDWIDTH */ @@ -8683,16 +8686,20 @@ static int cpu_extra_stat_show(struct seq_file *sf, { struct task_group *tg = css_tg(css); struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; - u64 throttled_usec; + u64 throttled_usec, burst_usec;
throttled_usec = cfs_b->throttled_time; do_div(throttled_usec, NSEC_PER_USEC); + burst_usec = cfs_b->burst_time; + do_div(burst_usec, NSEC_PER_USEC);
seq_printf(sf, "nr_periods %d\n" "nr_throttled %d\n" - "throttled_usec %llu\n", + "throttled_usec %llu\n" + "nr_bursts %d\n" + "burst_usec %llu\n", cfs_b->nr_periods, cfs_b->nr_throttled, - throttled_usec); + throttled_usec, cfs_b->nr_burst, burst_usec); } #endif return 0; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 593e763bb1f2..50d457979db6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4738,11 +4738,20 @@ static inline u64 sched_cfs_bandwidth_slice(void) */ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) { + s64 runtime; + if (unlikely(cfs_b->quota == RUNTIME_INF)) return;
cfs_b->runtime += cfs_b->quota; + runtime = cfs_b->runtime_snap - cfs_b->runtime; + if (runtime > 0) { + cfs_b->burst_time += runtime; + cfs_b->nr_burst++; + } + cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst); + cfs_b->runtime_snap = cfs_b->runtime; }
static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d05c787f0658..e41a5207a212 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -387,12 +387,15 @@ struct cfs_bandwidth {
#if !defined(__GENKSYMS__) u64 burst; + u64 runtime_snap; + int nr_burst; + u64 burst_time; #else KABI_RESERVE(1) -#endif KABI_RESERVE(2) KABI_RESERVE(3) KABI_RESERVE(4) +#endif KABI_RESERVE(5) KABI_RESERVE(6) #endif
From: Huaixin Chang changhuaixin@linux.alibaba.com
mainline inclusion from mainline-v5.15-rc4 commit d73df887b6b8174dfbb7f5f878fbd1e0e2eb3f08 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5CPWE CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Basic description of usage and effect for CFS Bandwidth Control Burst.
Co-developed-by: Shanpei Chen shanpeic@linux.alibaba.com Signed-off-by: Shanpei Chen shanpeic@linux.alibaba.com Co-developed-by: Tianchen Ding dtcccc@linux.alibaba.com Signed-off-by: Tianchen Ding dtcccc@linux.alibaba.com Signed-off-by: Huaixin Chang changhuaixin@linux.alibaba.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Daniel Jordan daniel.m.jordan@oracle.com Acked-by: Tejun Heo tj@kernel.org Link: https://lore.kernel.org/r/20210830032215.16302-3-changhuaixin@linux.alibaba.... Signed-off-by: Hui Tang tanghui20@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/admin-guide/cgroup-v2.rst | 8 +++ Documentation/scheduler/sched-bwc.rst | 84 ++++++++++++++++++++++--- 2 files changed, 83 insertions(+), 9 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index c54db136d9b4..5d9b7e552fb0 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -997,6 +997,8 @@ All time durations are in microseconds. - nr_periods - nr_throttled - throttled_usec + - nr_bursts + - burst_usec
cpu.weight A read-write single value file which exists on non-root @@ -1028,6 +1030,12 @@ All time durations are in microseconds. $PERIOD duration. "max" for $MAX indicates no limit. If only one number is written, $MAX is updated.
+ cpu.max.burst + A read-write single value file which exists on non-root + cgroups. The default is "0". + + The burst in the range [0, $MAX]. + cpu.pressure A read-only nested-key file which exists on non-root cgroups.
diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst index 9801d6b284b1..5723d8c69e35 100644 --- a/Documentation/scheduler/sched-bwc.rst +++ b/Documentation/scheduler/sched-bwc.rst @@ -21,33 +21,84 @@ cfs_quota units at each period boundary. As threads consume this bandwidth it is transferred to cpu-local "silos" on a demand basis. The amount transferred within each of these updates is tunable and described as the "slice".
+Burst feature +------------- +This feature borrows time now against our future underrun, at the cost of +increased interference against the other system users. All nicely bounded. + +Traditional (UP-EDF) bandwidth control is something like: + + (U = \Sum u_i) <= 1 + +This guaranteeds both that every deadline is met and that the system is +stable. After all, if U were > 1, then for every second of walltime, +we'd have to run more than a second of program time, and obviously miss +our deadline, but the next deadline will be further out still, there is +never time to catch up, unbounded fail. + +The burst feature observes that a workload doesn't always executes the full +quota; this enables one to describe u_i as a statistical distribution. + +For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100) +(the traditional WCET). This effectively allows u to be smaller, +increasing the efficiency (we can pack more tasks in the system), but at +the cost of missing deadlines when all the odds line up. However, it +does maintain stability, since every overrun must be paired with an +underrun as long as our x is above the average. + +That is, suppose we have 2 tasks, both specify a p(95) value, then we +have a p(95)*p(95) = 90.25% chance both tasks are within their quota and +everything is good. At the same time we have a p(5)p(5) = 0.25% chance +both tasks will exceed their quota at the same time (guaranteed deadline +fail). Somewhere in between there's a threshold where one exceeds and +the other doesn't underrun enough to compensate; this depends on the +specific CDFs. + +At the same time, we can say that the worst case deadline miss, will be +\Sum e_i; that is, there is a bounded tardiness (under the assumption +that x+e is indeed WCET). + +The interferenece when using burst is valued by the possibilities for +missing the deadline and the average WCET. Test results showed that when +there many cgroups or CPU is under utilized, the interference is +limited. More details are shown in: +https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alib... + Management ---------- -Quota and period are managed within the cpu subsystem via cgroupfs. +Quota, period and burst are managed within the cpu subsystem via cgroupfs.
-cpu.cfs_quota_us: the total available run-time within a period (in microseconds) +cpu.cfs_quota_us: run-time replenished within a period (in microseconds) cpu.cfs_period_us: the length of a period (in microseconds) cpu.stat: exports throttling statistics [explained further below] +cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds)
The default values are::
cpu.cfs_period_us=100ms - cpu.cfs_quota=-1 + cpu.cfs_quota_us=-1 + cpu.cfs_burst_us=0
A value of -1 for cpu.cfs_quota_us indicates that the group does not have any bandwidth restriction in place, such a group is described as an unconstrained bandwidth group. This represents the traditional work-conserving behavior for CFS.
-Writing any (valid) positive value(s) will enact the specified bandwidth limit. -The minimum quota allowed for the quota or period is 1ms. There is also an -upper bound on the period length of 1s. Additional restrictions exist when -bandwidth limits are used in a hierarchical fashion, these are explained in -more detail below. +Writing any (valid) positive value(s) no smaller than cpu.cfs_burst_us will +enact the specified bandwidth limit. The minimum quota allowed for the quota or +period is 1ms. There is also an upper bound on the period length of 1s. +Additional restrictions exist when bandwidth limits are used in a hierarchical +fashion, these are explained in more detail below.
Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit and return the group to an unconstrained state once more.
+A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate +any unused bandwidth. It makes the traditional bandwidth control behavior for +CFS unchanged. Writing any (valid) positive value(s) no larger than +cpu.cfs_quota_us into cpu.cfs_burst_us will enact the cap on unused bandwidth +accumulation. + Any updates to a group's bandwidth specification will result in it becoming unthrottled if it is in a constrained state.
@@ -67,7 +118,7 @@ for more fine-grained consumption.
Statistics ---------- -A group's bandwidth statistics are exported via 3 fields in cpu.stat. +A group's bandwidth statistics are exported via 5 fields in cpu.stat.
cpu.stat:
@@ -75,6 +126,9 @@ cpu.stat: - nr_throttled: Number of times the group has been throttled/limited. - throttled_time: The total time duration (in nanoseconds) for which entities of the group have been throttled. +- nr_bursts: Number of periods burst occurs. +- burst_time: Cumulative wall-time (in nanoseconds) that any CPUs has used + above quota in respective periods
This interface is read-only.
@@ -172,3 +226,15 @@ Examples
By using a small period here we are ensuring a consistent latency response at the expense of burst capacity. + +4. Limit a group to 40% of 1 CPU, and allow accumulate up to 20% of 1 CPU + additionally, in case accumulation has been done. + + With 50ms period, 20ms quota will be equivalent to 40% of 1 CPU. + And 10ms burst will be equivalent to 20% of 1 CPU. + + # echo 20000 > cpu.cfs_quota_us /* quota = 20ms */ + # echo 50000 > cpu.cfs_period_us /* period = 50ms */ + # echo 10000 > cpu.cfs_burst_us /* burst = 10ms */ + + Larger buffer setting (no larger than quota) allows greater burst capacity.