From: Baokun Li libaokun1@huawei.com
hulk inclusion category: bugfix bugzilla: 182993 https://gitee.com/openeuler/kernel/issues/I4DDEL
---------------------------
Hulk Robot reported a KASAN report about slab-out-of-bounds: ================================================================== BUG: KASAN: slab-out-of-bounds in ubifs_change_lp+0x3a9/0x1390 [ubifs] Read of size 8 at addr ffff888101c961f8 by task fsstress/1068 [...] Call Trace: check_memory_region+0x1c1/0x1e0 ubifs_change_lp+0x3a9/0x1390 [ubifs] ubifs_change_one_lp+0x170/0x220 [ubifs] ubifs_garbage_collect+0x7f9/0xda0 [ubifs] ubifs_budget_space+0xfe4/0x1bd0 [ubifs] ubifs_write_begin+0x528/0x10c0 [ubifs]
Allocated by task 1068: kmemdup+0x25/0x50 ubifs_lpt_lookup_dirty+0x372/0xb00 [ubifs] ubifs_update_one_lp+0x46/0x260 [ubifs] ubifs_tnc_end_commit+0x98b/0x1720 [ubifs] do_commit+0x6cb/0x1950 [ubifs] ubifs_run_commit+0x15a/0x2b0 [ubifs] ubifs_budget_space+0x1061/0x1bd0 [ubifs] ubifs_write_begin+0x528/0x10c0 [ubifs] [...] ==================================================================
In ubifs_garbage_collect(), if ubifs_find_dirty_leb returns an error, lp is an uninitialized variable. But lp.num might be used in the out branch, which is a random value. If the value is -1 or another value that can pass the check, soob may occur in the ubifs_change_lp() in the following procedure.
To solve this problem, we initialize lp.lnum to -1, and then initialize it correctly in ubifs_find_dirty_leb, which is not equal to -1, and ubifs_return_leb is executed only when lp.lnum != -1.
if find a retained or indexing LEB and continue to next loop, but break before find another LEB, the "taken" flag of this LEB will be cleaned in ubi_return_lebi(). This bug has also been fixed in this patch.
Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/ubifs/gc.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/ubifs/gc.c b/fs/ubifs/gc.c index dc3e26e9ed7b..05e1eeae8457 100644 --- a/fs/ubifs/gc.c +++ b/fs/ubifs/gc.c @@ -692,6 +692,9 @@ int ubifs_garbage_collect(struct ubifs_info *c, int anyway) for (i = 0; ; i++) { int space_before, space_after;
+ /* Maybe continue after find and break before find */ + lp.lnum = -1; + cond_resched();
/* Give the commit an opportunity to run */ @@ -843,7 +846,8 @@ int ubifs_garbage_collect(struct ubifs_info *c, int anyway) ubifs_wbuf_sync_nolock(wbuf); ubifs_ro_mode(c, ret); mutex_unlock(&wbuf->io_mutex); - ubifs_return_leb(c, lp.lnum); + if (lp.lnum != -1) + ubifs_return_leb(c, lp.lnum); return ret; }
From: Baokun Li libaokun1@huawei.com
hulk inclusion category: bugfix bugzilla: 182993 https://gitee.com/openeuler/kernel/issues/I4DDEL
---------------------------
If ubifs_garbage_collect_leb() returns -EAGAIN and enters the "out" branch, ubifs_return_leb will execute twice on the same lnum. This can cause data loss in concurrency situations.
Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/ubifs/gc.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/ubifs/gc.c b/fs/ubifs/gc.c index 05e1eeae8457..238a60294d10 100644 --- a/fs/ubifs/gc.c +++ b/fs/ubifs/gc.c @@ -758,6 +758,8 @@ int ubifs_garbage_collect(struct ubifs_info *c, int anyway) err = ubifs_return_leb(c, lp.lnum); if (err) ret = err; + /* Maybe double return if go out */ + lp.lnum = -1; break; } goto out;
From: Baokun Li libaokun1@huawei.com
hulk inclusion category: bugfix bugzilla: 182993 https://gitee.com/openeuler/kernel/issues/I4DDEL
---------------------------
If ubifs_garbage_collect_leb() returns -EAGAIN and ubifs_return_leb returns error, a LEB will always has a "taken" flag. In this case, set the ubifs to read-only to prevent a worse situation.
Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/ubifs/gc.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/ubifs/gc.c b/fs/ubifs/gc.c index 238a60294d10..e43c8161423d 100644 --- a/fs/ubifs/gc.c +++ b/fs/ubifs/gc.c @@ -756,8 +756,14 @@ int ubifs_garbage_collect(struct ubifs_info *c, int anyway) * caller instead of the original '-EAGAIN'. */ err = ubifs_return_leb(c, lp.lnum); - if (err) + if (err) { ret = err; + /* LEB may always be "taken". So set + * the ubifs to read-only. Sync wbuf + * will return -EROFS, then go "out". + */ + ubifs_ro_mode(c, ret); + } /* Maybe double return if go out */ lp.lnum = -1; break;
From: Laibin Qiu qiulaibin@huawei.com
hulk inclusion category: bugfix bugzilla: 185722 https://gitee.com/openeuler/kernel/issues/I4DDEL
---------------------------
KASAN reports a use-after-free report when doing block test:
================================================================== [10050.967049] BUG: KASAN: use-after-free in submit_bio_checks+0x1539/0x1550
[10050.977638] Call Trace: [10050.978190] dump_stack+0x9b/0xce [10050.979674] print_address_description.constprop.6+0x3e/0x60 [10050.983510] kasan_report.cold.9+0x22/0x3a [10050.986089] submit_bio_checks+0x1539/0x1550 [10050.989576] submit_bio_noacct+0x83/0xc80 [10050.993714] submit_bio+0xa7/0x330 [10050.994435] mpage_readahead+0x380/0x500 [10050.998009] read_pages+0x1c1/0xbf0 [10051.002057] page_cache_ra_unbounded+0x4c2/0x6f0 [10051.007413] do_page_cache_ra+0xda/0x110 [10051.008207] force_page_cache_ra+0x23d/0x3d0 [10051.009087] page_cache_sync_ra+0xca/0x300 [10051.009970] generic_file_buffered_read+0xbea/0x2130 [10051.012685] generic_file_read_iter+0x315/0x490 [10051.014472] blkdev_read_iter+0x113/0x1b0 [10051.015300] aio_read+0x2ad/0x450 [10051.023786] io_submit_one+0xc8e/0x1d60 [10051.029855] __se_sys_io_submit+0x125/0x350 [10051.033442] do_syscall_64+0x2d/0x40 [10051.034156] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10051.048733] Allocated by task 18598: [10051.049482] kasan_save_stack+0x19/0x40 [10051.050263] __kasan_kmalloc.constprop.1+0xc1/0xd0 [10051.051230] kmem_cache_alloc+0x146/0x440 [10051.052060] mempool_alloc+0x125/0x2f0 [10051.052818] bio_alloc_bioset+0x353/0x590 [10051.053658] mpage_alloc+0x3b/0x240 [10051.054382] do_mpage_readpage+0xddf/0x1ef0 [10051.055250] mpage_readahead+0x264/0x500 [10051.056060] read_pages+0x1c1/0xbf0 [10051.056758] page_cache_ra_unbounded+0x4c2/0x6f0 [10051.057702] do_page_cache_ra+0xda/0x110 [10051.058511] force_page_cache_ra+0x23d/0x3d0 [10051.059373] page_cache_sync_ra+0xca/0x300 [10051.060198] generic_file_buffered_read+0xbea/0x2130 [10051.061195] generic_file_read_iter+0x315/0x490 [10051.062189] blkdev_read_iter+0x113/0x1b0 [10051.063015] aio_read+0x2ad/0x450 [10051.063686] io_submit_one+0xc8e/0x1d60 [10051.064467] __se_sys_io_submit+0x125/0x350 [10051.065318] do_syscall_64+0x2d/0x40 [10051.066082] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10051.067455] Freed by task 13307: [10051.068136] kasan_save_stack+0x19/0x40 [10051.068931] kasan_set_track+0x1c/0x30 [10051.069726] kasan_set_free_info+0x1b/0x30 [10051.070621] __kasan_slab_free+0x111/0x160 [10051.071480] kmem_cache_free+0x94/0x460 [10051.072256] mempool_free+0xd6/0x320 [10051.072985] bio_free+0xe0/0x130 [10051.073630] bio_put+0xab/0xe0 [10051.074252] bio_endio+0x3a6/0x5d0 [10051.074984] blk_update_request+0x590/0x1370 [10051.075870] scsi_end_request+0x7d/0x400 [10051.076667] scsi_io_completion+0x1aa/0xe50 [10051.077503] scsi_softirq_done+0x11b/0x240 [10051.078344] blk_mq_complete_request+0xd4/0x120 [10051.079275] scsi_mq_done+0xf0/0x200 [10051.080036] virtscsi_vq_done+0xbc/0x150 [10051.080850] vring_interrupt+0x179/0x390 [10051.081650] __handle_irq_event_percpu+0xf7/0x490 [10051.082626] handle_irq_event_percpu+0x7b/0x160 [10051.083527] handle_irq_event+0xcc/0x170 [10051.084297] handle_edge_irq+0x215/0xb20 [10051.085122] asm_call_irq_on_stack+0xf/0x20 [10051.085986] common_interrupt+0xae/0x120 [10051.086830] asm_common_interrupt+0x1e/0x40
==================================================================
Bio will be checked at beginning of submit_bio_noacct(). If bio needs to be throttled, it will start the timer and stop submit bio directly. Bio will submit in blk_throtl_dispatch_work_fn() when the timer expires. But in the current process, if bio is throttled, it will still set bio issue->value by blkcg_bio_issue_init(). This is redundant and may cause the above use-after-free.
CPU0 CPU1 submit_bio submit_bio_noacct submit_bio_checks blk_throtl_bio() <=mod_timer(&sq->pending_timer blk_throtl_dispatch_work_fn submit_bio_noacct() <= bio have throttle tag, will throw directly and bio issue->value will be set here
bio_endio() bio_put() bio_free() <= free this bio
blkcg_bio_issue_init(bio) <= bio has been freed and will lead to UAF return BLK_QC_T_NONE
Fix this by remove extra blkcg_bio_issue_init.
Fixes: e439bedf6b24 (blkcg: consolidate bio_issue_init() to be a part of core) Signed-off-by: Laibin Qiu qiulaibin@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- block/blk-core.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index 1fc33e1482aa..89f1e74785dc 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -897,10 +897,8 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio) if (unlikely(!current->io_context)) create_task_io_context(current, GFP_ATOMIC, q->node);
- if (blk_throtl_bio(bio)) { - blkcg_bio_issue_init(bio); + if (blk_throtl_bio(bio)) return false; - }
blk_cgroup_bio_start(bio); blkcg_bio_issue_init(bio);
From: Li Hua hucool.lihua@huawei.com
hulk inclusion category: bugfix bugzilla: 185759 https://gitee.com/openeuler/kernel/issues/I4DDEL
---------------------------
When rt_runtime is modified from -1 to a valid control value, it may cause the task to be throttled all the time. Operations like the following will trigger the bug. E.g: 1. echo -1 > /proc/sys/kernel/sched_rt_runtime_us 2. Run a FIFO task named A that executes while(1) 3. echo 950000 > /proc/sys/kernel/sched_rt_runtime_us
When rt_runtime is -1, The rt period timer will not be activated when task A enqueued. And then the task will be throttled after setting rt_runtime to 950,000. The task will always be throttled because the rt period timer is not activated.
Signed-off-by: Li Hua hucool.lihua@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Reviewed-by: Cheng Jian cj.chengjian@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- kernel/sched/rt.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index ad9bec90fb7a..dae1e8eaa983 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -944,6 +944,18 @@ static inline int rt_se_prio(struct sched_rt_entity *rt_se) return rt_task_of(rt_se)->prio; }
+static inline void try_start_rt_bandwidth(struct rt_bandwidth *rt_b) +{ + raw_spin_lock(&rt_b->rt_runtime_lock); + if (!rt_b->rt_period_active) { + rt_b->rt_period_active = 1; + hrtimer_forward_now(&rt_b->rt_period_timer, rt_b->rt_period); + hrtimer_start_expires(&rt_b->rt_period_timer, + HRTIMER_MODE_ABS_PINNED_HARD); + } + raw_spin_unlock(&rt_b->rt_runtime_lock); +} + static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq) { u64 runtime = sched_rt_runtime(rt_rq); @@ -1020,13 +1032,17 @@ static void update_curr_rt(struct rq *rq)
for_each_sched_rt_entity(rt_se) { struct rt_rq *rt_rq = rt_rq_of_se(rt_se); + int exceeded;
if (sched_rt_runtime(rt_rq) != RUNTIME_INF) { raw_spin_lock(&rt_rq->rt_runtime_lock); rt_rq->rt_time += delta_exec; - if (sched_rt_runtime_exceeded(rt_rq)) + exceeded = sched_rt_runtime_exceeded(rt_rq); + if (exceeded) resched_curr(rq); raw_spin_unlock(&rt_rq->rt_runtime_lock); + if (exceeded) + try_start_rt_bandwidth(sched_rt_bandwidth(rt_rq)); } } } @@ -2721,8 +2737,10 @@ static int sched_rt_global_validate(void)
static void sched_rt_do_global(void) { + raw_spin_lock(&def_rt_bandwidth.rt_runtime_lock); def_rt_bandwidth.rt_runtime = global_rt_runtime(); def_rt_bandwidth.rt_period = ns_to_ktime(global_rt_period()); + raw_spin_unlock(&def_rt_bandwidth.rt_runtime_lock); }
int sched_rt_handler(struct ctl_table *table, int write, void *buffer,
From: Ye Weihua yeweihua4@huawei.com
hulk inclusion category: bugfix bugzilla: 185757 https://gitee.com/openeuler/kernel/issues/I4DDEL
---------------------------
During the test, it is found that the running function to be patched is not detected when enabling the livepatch. It will cause unkown problems.
The cause is that the return value of the klp_check_jump_func() is incorrect. To solve the problem, reverse the return value.
Signed-off-by: Ye Weihua yeweihua4@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- arch/arm/kernel/livepatch.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm/kernel/livepatch.c b/arch/arm/kernel/livepatch.c index e98c8e47a344..221275714899 100644 --- a/arch/arm/kernel/livepatch.c +++ b/arch/arm/kernel/livepatch.c @@ -278,11 +278,11 @@ static bool check_func_list(struct klp_func_list *funcs, int *ret, unsigned long *ret = klp_compare_address(pc, funcs->func_addr, funcs->func_name, klp_size_to_check(funcs->func_size, funcs->force)); if (*ret) { - return false; + return true; } funcs = funcs->next; } - return true; + return false; }
static int klp_check_jump_func(struct stackframe *frame, void *data)
From: Luo Meng luomeng12@huawei.com
mainline inclusion from mainline-v5.11-rc1 commit 16238415eb9886328a89fe7a3cb0b88c7335fe16 category: bugfix bugzilla: 38689 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-----------------------------------------------
When the sum of fl->fl_start and l->l_len overflows, UBSAN shows the following warning:
UBSAN: Undefined behaviour in fs/locks.c:482:29 signed integer overflow: 2 + 9223372036854775806 cannot be represented in type 'long long int' Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xe4/0x14e lib/dump_stack.c:118 ubsan_epilogue+0xe/0x81 lib/ubsan.c:161 handle_overflow+0x193/0x1e2 lib/ubsan.c:192 flock64_to_posix_lock fs/locks.c:482 [inline] flock_to_posix_lock+0x595/0x690 fs/locks.c:515 fcntl_setlk+0xf3/0xa90 fs/locks.c:2262 do_fcntl+0x456/0xf60 fs/fcntl.c:387 __do_sys_fcntl fs/fcntl.c:483 [inline] __se_sys_fcntl fs/fcntl.c:468 [inline] __x64_sys_fcntl+0x12d/0x180 fs/fcntl.c:468 do_syscall_64+0xc8/0x5a0 arch/x86/entry/common.c:293 entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fix it by parenthesizing 'l->l_len - 1'.
Signed-off-by: Luo Meng luomeng12@huawei.com Signed-off-by: Jeff Layton jlayton@kernel.org Signed-off-by: Luo Meng luomeng12@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/locks.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/locks.c b/fs/locks.c index 32c948fe2944..6e118955b9b6 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -542,7 +542,7 @@ static int flock64_to_posix_lock(struct file *filp, struct file_lock *fl, if (l->l_len > 0) { if (l->l_len - 1 > OFFSET_MAX - fl->fl_start) return -EOVERFLOW; - fl->fl_end = fl->fl_start + l->l_len - 1; + fl->fl_end = fl->fl_start + (l->l_len - 1);
} else if (l->l_len < 0) { if (fl->fl_start + l->l_len < 0)
From: Jeremy Cline jcline@redhat.com
mainline inclusion from mainline commit aff2299e0d81b26304ccc6a1ec0170e437f38efc bugzilla: 185776 https://gitee.com/openeuler/kernel/issues/I4DDEL CVE: CVE-2020-27820
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Nouveau does not currently support hot-unplugging, but it still makes sense to switch from drm_dev_unregister() to drm_dev_unplug(). drm_dev_unplug() calls drm_dev_unregister() after marking the device as unplugged, but only after any device critical sections are finished.
Since nouveau isn't using drm_dev_enter() and drm_dev_exit(), there are no critical sections so this is nearly functionally equivalent. However, the DRM layer does check to see if the device is unplugged, and if it is returns appropriate error codes.
In the future nouveau can add critical sections in order to truly support hot-unplugging.
Cc: stable@vger.kernel.org # 5.4+ Signed-off-by: Jeremy Cline jcline@redhat.com Reviewed-by: Lyude Paul lyude@redhat.com Reviewed-by: Ben Skeggs bskeggs@redhat.com Tested-by: Karol Herbst kherbst@redhat.com Signed-off-by: Karol Herbst kherbst@redhat.com Link: https://patchwork.freedesktop.org/patch/msgid/20201125202648.5220-2-jcline@r... Link: https://gitlab.freedesktop.org/drm/nouveau/-/merge_requests/14 Signed-off-by: Chen Jun chenjun102@huawei.com Reviewed-by: weiyang wang wangweiyang2@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/gpu/drm/nouveau/nouveau_drm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c index 42fc5c813a9b..470b3c8f7392 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drm.c +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c @@ -792,7 +792,7 @@ nouveau_drm_device_remove(struct drm_device *dev) struct nvkm_client *client; struct nvkm_device *device;
- drm_dev_unregister(dev); + drm_dev_unplug(dev);
dev->irq_enabled = false; client = nvxx_client(&drm->client.base);
From: Jeremy Cline jcline@redhat.com
mainline inclusion from mainline commit abae9164a421bc4a41a3769f01ebcd1f9d955e0e bugzilla: 185776 https://gitee.com/openeuler/kernel/issues/I4DDEL CVE: CVE-2020-27820
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Rather than protecting the nouveau_drm clients list with the lock within the "client" nouveau_cli, add a dedicated lock to serialize access to the list. This is both clearer and necessary to avoid lockdep being upset with us when we need to iterate through all the clients in the list and potentially lock their mutex, which is the same class as the lock protecting the entire list.
Cc: stable@vger.kernel.org # 5.4+ Signed-off-by: Jeremy Cline jcline@redhat.com Reviewed-by: Lyude Paul lyude@redhat.com Reviewed-by: Ben Skeggs bskeggs@redhat.com Tested-by: Karol Herbst kherbst@redhat.com Signed-off-by: Karol Herbst kherbst@redhat.com Link: https://patchwork.freedesktop.org/patch/msgid/20201125202648.5220-3-jcline@r... Link: https://gitlab.freedesktop.org/drm/nouveau/-/merge_requests/14 Signed-off-by: Chen Jun chenjun102@huawei.com Reviewed-by: weiyang wang wangweiyang2@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/gpu/drm/nouveau/nouveau_drm.c | 10 ++++++---- drivers/gpu/drm/nouveau/nouveau_drv.h | 5 +++++ 2 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c index 470b3c8f7392..8201e9d11df9 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drm.c +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c @@ -557,6 +557,7 @@ nouveau_drm_device_init(struct drm_device *dev) nvkm_dbgopt(nouveau_debug, "DRM");
INIT_LIST_HEAD(&drm->clients); + mutex_init(&drm->clients_lock); spin_lock_init(&drm->tile.lock);
/* workaround an odd issue on nvc1 by disabling the device's @@ -654,6 +655,7 @@ nouveau_drm_device_fini(struct drm_device *dev) nouveau_cli_fini(&drm->client); nouveau_cli_fini(&drm->master); nvif_parent_dtor(&drm->parent); + mutex_destroy(&drm->clients_lock); kfree(drm); }
@@ -1086,9 +1088,9 @@ nouveau_drm_open(struct drm_device *dev, struct drm_file *fpriv)
fpriv->driver_priv = cli;
- mutex_lock(&drm->client.mutex); + mutex_lock(&drm->clients_lock); list_add(&cli->head, &drm->clients); - mutex_unlock(&drm->client.mutex); + mutex_unlock(&drm->clients_lock);
done: if (ret && cli) { @@ -1114,9 +1116,9 @@ nouveau_drm_postclose(struct drm_device *dev, struct drm_file *fpriv) nouveau_abi16_fini(cli->abi16); mutex_unlock(&cli->mutex);
- mutex_lock(&drm->client.mutex); + mutex_lock(&drm->clients_lock); list_del(&cli->head); - mutex_unlock(&drm->client.mutex); + mutex_unlock(&drm->clients_lock);
nouveau_cli_fini(cli); kfree(cli); diff --git a/drivers/gpu/drm/nouveau/nouveau_drv.h b/drivers/gpu/drm/nouveau/nouveau_drv.h index b8025507a9e4..8b252dca0fc3 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drv.h +++ b/drivers/gpu/drm/nouveau/nouveau_drv.h @@ -142,6 +142,11 @@ struct nouveau_drm {
struct list_head clients;
+ /** + * @clients_lock: Protects access to the @clients list of &struct nouveau_cli. + */ + struct mutex clients_lock; + u8 old_pm_cap;
struct {
From: Jeremy Cline jcline@redhat.com
mainline inclusion from mainline commit f55aaf63bde0d0336c3823bb3713bd4a464abbcf bugzilla: 185776 https://gitee.com/openeuler/kernel/issues/I4DDEL CVE: CVE-2020-27820
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The postclose handler can run after the device has been removed (or the driver has been unbound) since userspace clients are free to hold the file open as long as they want. Because the device removal callback frees the entire nouveau_drm structure, any reference to it in the postclose handler will result in a use-after-free.
To reproduce this, one must simply open the device file, unbind the driver (or physically remove the device), and then close the device file. This was found and can be reproduced easily with the IGT core_hotunplug tests.
To avoid this, all clients are cleaned up in the device finalization rather than deferring it to the postclose handler, and the postclose handler is protected by a critical section which ensures the drm_dev_unplug() and the postclose handler won't race.
This is not an ideal fix, since as I understand the proposed plan for the kernel<->userspace interface for hotplug support, destroying the client before the file is closed will cause problems. However, I believe to properly fix this issue, the lifetime of the nouveau_drm structure needs to be extended to match the drm_device, and this proved to be a rather invasive change. Thus, I've broken this out so the fix can be easily backported.
This fixes with the two previous commits CVE-2020-27820 (Karol).
Cc: stable@vger.kernel.org # 5.4+ Signed-off-by: Jeremy Cline jcline@redhat.com Reviewed-by: Lyude Paul lyude@redhat.com Reviewed-by: Ben Skeggs bskeggs@redhat.com Tested-by: Karol Herbst kherbst@redhat.com Signed-off-by: Karol Herbst kherbst@redhat.com Link: https://patchwork.freedesktop.org/patch/msgid/20201125202648.5220-4-jcline@r... Link: https://gitlab.freedesktop.org/drm/nouveau/-/merge_requests/14 Signed-off-by: Chen Jun chenjun102@huawei.com Reviewed-by: weiyang wang wangweiyang2@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/gpu/drm/nouveau/nouveau_drm.c | 30 +++++++++++++++++++++++++++ 1 file changed, 30 insertions(+)
diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c index 8201e9d11df9..ac96b6ab44c0 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drm.c +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c @@ -628,6 +628,7 @@ nouveau_drm_device_init(struct drm_device *dev) static void nouveau_drm_device_fini(struct drm_device *dev) { + struct nouveau_cli *cli, *temp_cli; struct nouveau_drm *drm = nouveau_drm(dev);
if (nouveau_pmops_runtime()) { @@ -652,6 +653,24 @@ nouveau_drm_device_fini(struct drm_device *dev) nouveau_ttm_fini(drm); nouveau_vga_fini(drm);
+ /* + * There may be existing clients from as-yet unclosed files. For now, + * clean them up here rather than deferring until the file is closed, + * but this likely not correct if we want to support hot-unplugging + * properly. + */ + mutex_lock(&drm->clients_lock); + list_for_each_entry_safe(cli, temp_cli, &drm->clients, head) { + list_del(&cli->head); + mutex_lock(&cli->mutex); + if (cli->abi16) + nouveau_abi16_fini(cli->abi16); + mutex_unlock(&cli->mutex); + nouveau_cli_fini(cli); + kfree(cli); + } + mutex_unlock(&drm->clients_lock); + nouveau_cli_fini(&drm->client); nouveau_cli_fini(&drm->master); nvif_parent_dtor(&drm->parent); @@ -1108,6 +1127,16 @@ nouveau_drm_postclose(struct drm_device *dev, struct drm_file *fpriv) { struct nouveau_cli *cli = nouveau_cli(fpriv); struct nouveau_drm *drm = nouveau_drm(dev); + int dev_index; + + /* + * The device is gone, and as it currently stands all clients are + * cleaned up in the removal codepath. In the future this may change + * so that we can support hot-unplugging, but for now we immediately + * return to avoid a double-free situation. + */ + if (!drm_dev_enter(dev, &dev_index)) + return;
pm_runtime_get_sync(dev->dev);
@@ -1124,6 +1153,7 @@ nouveau_drm_postclose(struct drm_device *dev, struct drm_file *fpriv) kfree(cli); pm_runtime_mark_last_busy(dev->dev); pm_runtime_put_autosuspend(dev->dev); + drm_dev_exit(dev_index); }
static const struct drm_ioctl_desc
From: Huang Guobin huangguobin4@huawei.com
mainline inclusion from mainline commit b93c6a911a3fe926b00add28f3b932007827c4ca category: bugfix bugzilla: 185687 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-----------------------------
When I do fuzz test for bonding device interface, I got the following use-after-free Calltrace:
================================================================== BUG: KASAN: use-after-free in bond_enslave+0x1521/0x24f0 Read of size 8 at addr ffff88825bc11c00 by task ifenslave/7365
CPU: 5 PID: 7365 Comm: ifenslave Tainted: G E 5.15.0-rc1+ #13 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014 Call Trace: dump_stack_lvl+0x6c/0x8b print_address_description.constprop.0+0x48/0x70 kasan_report.cold+0x82/0xdb __asan_load8+0x69/0x90 bond_enslave+0x1521/0x24f0 bond_do_ioctl+0x3e0/0x450 dev_ifsioc+0x2ba/0x970 dev_ioctl+0x112/0x710 sock_do_ioctl+0x118/0x1b0 sock_ioctl+0x2e0/0x490 __x64_sys_ioctl+0x118/0x150 do_syscall_64+0x35/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f19159cf577 Code: b3 66 90 48 8b 05 11 89 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 78 RSP: 002b:00007ffeb3083c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffeb3084bca RCX: 00007f19159cf577 RDX: 00007ffeb3083ce0 RSI: 0000000000008990 RDI: 0000000000000003 RBP: 00007ffeb3084bc4 R08: 0000000000000040 R09: 0000000000000000 R10: 00007ffeb3084bc0 R11: 0000000000000246 R12: 00007ffeb3083ce0 R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffeb3083cb0
Allocated by task 7365: kasan_save_stack+0x23/0x50 __kasan_kmalloc+0x83/0xa0 kmem_cache_alloc_trace+0x22e/0x470 bond_enslave+0x2e1/0x24f0 bond_do_ioctl+0x3e0/0x450 dev_ifsioc+0x2ba/0x970 dev_ioctl+0x112/0x710 sock_do_ioctl+0x118/0x1b0 sock_ioctl+0x2e0/0x490 __x64_sys_ioctl+0x118/0x150 do_syscall_64+0x35/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae
Freed by task 7365: kasan_save_stack+0x23/0x50 kasan_set_track+0x20/0x30 kasan_set_free_info+0x24/0x40 __kasan_slab_free+0xf2/0x130 kfree+0xd1/0x5c0 slave_kobj_release+0x61/0x90 kobject_put+0x102/0x180 bond_sysfs_slave_add+0x7a/0xa0 bond_enslave+0x11b6/0x24f0 bond_do_ioctl+0x3e0/0x450 dev_ifsioc+0x2ba/0x970 dev_ioctl+0x112/0x710 sock_do_ioctl+0x118/0x1b0 sock_ioctl+0x2e0/0x490 __x64_sys_ioctl+0x118/0x150 do_syscall_64+0x35/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae
Last potentially related work creation: kasan_save_stack+0x23/0x50 kasan_record_aux_stack+0xb7/0xd0 insert_work+0x43/0x190 __queue_work+0x2e3/0x970 delayed_work_timer_fn+0x3e/0x50 call_timer_fn+0x148/0x470 run_timer_softirq+0x8a8/0xc50 __do_softirq+0x107/0x55f
Second to last potentially related work creation: kasan_save_stack+0x23/0x50 kasan_record_aux_stack+0xb7/0xd0 insert_work+0x43/0x190 __queue_work+0x2e3/0x970 __queue_delayed_work+0x130/0x180 queue_delayed_work_on+0xa7/0xb0 bond_enslave+0xe25/0x24f0 bond_do_ioctl+0x3e0/0x450 dev_ifsioc+0x2ba/0x970 dev_ioctl+0x112/0x710 sock_do_ioctl+0x118/0x1b0 sock_ioctl+0x2e0/0x490 __x64_sys_ioctl+0x118/0x150 do_syscall_64+0x35/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88825bc11c00 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 0 bytes inside of 1024-byte region [ffff88825bc11c00, ffff88825bc12000) The buggy address belongs to the page: page:ffffea00096f0400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x25bc10 head:ffffea00096f0400 order:3 compound_mapcount:0 compound_pincount:0 flags: 0x57ff00000010200(slab|head|node=1|zone=2|lastcpupid=0x7ff) raw: 057ff00000010200 ffffea0009a71c08 ffff888240001968 ffff88810004dbc0 raw: 0000000000000000 00000000000a000a 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected
Memory state around the buggy address: ffff88825bc11b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88825bc11b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88825bc11c00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff88825bc11c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88825bc11d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ==================================================================
Put new_slave in bond_sysfs_slave_add() will cause use-after-free problems when new_slave is accessed in the subsequent error handling process. Since new_slave will be put in the subsequent error handling process, remove the unnecessary put to fix it. In addition, when sysfs_create_file() fails, if some files have been crea- ted successfully, we need to call sysfs_remove_file() to remove them. Since there are sysfs_create_files() & sysfs_remove_files() can be used, use these two functions instead.
Fixes: 7afcaec49696 (bonding: use kobject_put instead of _del after kobject_add) Signed-off-by: Huang Guobin huangguobin4@huawei.com Reviewed-by: Jakub Kicinski kuba@kernel.org Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Huang Guobin huangguobin4@huawei.com Reviewed-by: Wei Yongjun weiyongjun1@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/net/bonding/bond_sysfs_slave.c | 36 ++++++++------------------ 1 file changed, 11 insertions(+), 25 deletions(-)
diff --git a/drivers/net/bonding/bond_sysfs_slave.c b/drivers/net/bonding/bond_sysfs_slave.c index fd07561da034..6a6cdd0bb258 100644 --- a/drivers/net/bonding/bond_sysfs_slave.c +++ b/drivers/net/bonding/bond_sysfs_slave.c @@ -108,15 +108,15 @@ static ssize_t ad_partner_oper_port_state_show(struct slave *slave, char *buf) } static SLAVE_ATTR_RO(ad_partner_oper_port_state);
-static const struct slave_attribute *slave_attrs[] = { - &slave_attr_state, - &slave_attr_mii_status, - &slave_attr_link_failure_count, - &slave_attr_perm_hwaddr, - &slave_attr_queue_id, - &slave_attr_ad_aggregator_id, - &slave_attr_ad_actor_oper_port_state, - &slave_attr_ad_partner_oper_port_state, +static const struct attribute *slave_attrs[] = { + &slave_attr_state.attr, + &slave_attr_mii_status.attr, + &slave_attr_link_failure_count.attr, + &slave_attr_perm_hwaddr.attr, + &slave_attr_queue_id.attr, + &slave_attr_ad_aggregator_id.attr, + &slave_attr_ad_actor_oper_port_state.attr, + &slave_attr_ad_partner_oper_port_state.attr, NULL };
@@ -137,24 +137,10 @@ const struct sysfs_ops slave_sysfs_ops = {
int bond_sysfs_slave_add(struct slave *slave) { - const struct slave_attribute **a; - int err; - - for (a = slave_attrs; *a; ++a) { - err = sysfs_create_file(&slave->kobj, &((*a)->attr)); - if (err) { - kobject_put(&slave->kobj); - return err; - } - } - - return 0; + return sysfs_create_files(&slave->kobj, slave_attrs); }
void bond_sysfs_slave_del(struct slave *slave) { - const struct slave_attribute **a; - - for (a = slave_attrs; *a; ++a) - sysfs_remove_file(&slave->kobj, &((*a)->attr)); + sysfs_remove_files(&slave->kobj, slave_attrs); }
From: Zekun Shen bruceshenzk@gmail.com
maillist inclusion bugzilla: 185787 https://gitee.com/openeuler/kernel/issues/I4DDEL CVE: CVE-2021-43976
Reference: https://patchwork.kernel.org/project/linux-wireless/patch/YX4CqjfRcTa6bVL+@Z...
--------------------------------
Currently, with an unknown recv_type, mwifiex_usb_recv just return -1 without restoring the skb. Next time mwifiex_usb_rx_complete is invoked with the same skb, calling skb_put causes skb_over_panic.
The bug is triggerable with a compromised/malfunctioning usb device. After applying the patch, skb_over_panic no longer shows up with the same input.
Attached is the panic report from fuzzing. skbuff: skb_over_panic: text:000000003bf1b5fa len:2048 put:4 head:00000000dd6a115b data:000000000a9445d8 tail:0x844 end:0x840 dev:<NULL> kernel BUG at net/core/skbuff.c:109! invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 0 PID: 198 Comm: in:imklog Not tainted 5.6.0 #60 RIP: 0010:skb_panic+0x15f/0x161 Call Trace: <IRQ> ? mwifiex_usb_rx_complete+0x26b/0xfcd [mwifiex_usb] skb_put.cold+0x24/0x24 mwifiex_usb_rx_complete+0x26b/0xfcd [mwifiex_usb] __usb_hcd_giveback_urb+0x1e4/0x380 usb_giveback_urb_bh+0x241/0x4f0 ? __hrtimer_run_queues+0x316/0x740 ? __usb_hcd_giveback_urb+0x380/0x380 tasklet_action_common.isra.0+0x135/0x330 __do_softirq+0x18c/0x634 irq_exit+0x114/0x140 smp_apic_timer_interrupt+0xde/0x380 apic_timer_interrupt+0xf/0x20 </IRQ>
Reported-by: Brendan Dolan-Gavitt brendandg@nyu.edu Signed-off-by: Zekun Shen bruceshenzk@gmail.com Signed-off-by: Chen Jun chenjun102@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Wei Yongjun weiyongjun1@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/net/wireless/marvell/mwifiex/usb.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/wireless/marvell/mwifiex/usb.c b/drivers/net/wireless/marvell/mwifiex/usb.c index 426e39d4ccf0..6d81e87861ca 100644 --- a/drivers/net/wireless/marvell/mwifiex/usb.c +++ b/drivers/net/wireless/marvell/mwifiex/usb.c @@ -130,7 +130,8 @@ static int mwifiex_usb_recv(struct mwifiex_adapter *adapter, default: mwifiex_dbg(adapter, ERROR, "unknown recv_type %#x\n", recv_type); - return -1; + ret = -1; + goto exit_restore_skb; } break; case MWIFIEX_USB_EP_DATA:
From: Ye Bin yebin10@huawei.com
hulk inclusion category: bugfix bugzilla: 185781 https://gitee.com/openeuler/kernel/issues/I4DDEL
-----------------------------------------------
As blk_mq_quiesce_queue in elevator_init_mq will wait a RCU gap which want to make sure no IO will happen while blk_mq_init_sched. If there is lots of device will lead to boot slowly. To address this issue, according to Lei Ming's suggestion: "We are called before adding disk, when there isn't any FS I/O, so freezing queue plus canceling dispatch work is enough to drain any dispatch activities originated from passthrough requests, then no need to quiesce queue which may add long boot latency, especially when lots of disks are involved."
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- block/blk-mq.c | 13 +++++++++++++ block/blk-mq.h | 1 + block/elevator.c | 10 ++++++++-- 3 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index 01ec97aa9ec8..4b2507cdece3 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -3990,6 +3990,19 @@ unsigned int blk_mq_rq_cpu(struct request *rq) } EXPORT_SYMBOL(blk_mq_rq_cpu);
+void blk_mq_cancel_work_sync(struct request_queue *q) +{ + if (queue_is_mq(q)) { + struct blk_mq_hw_ctx *hctx; + int i; + + cancel_delayed_work_sync(&q->requeue_work); + + queue_for_each_hw_ctx(q, hctx, i) + cancel_delayed_work_sync(&hctx->run_work); + } +} + static int __init blk_mq_init(void) { int i; diff --git a/block/blk-mq.h b/block/blk-mq.h index f792a0920ebb..6f87c0681443 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -129,6 +129,7 @@ extern int blk_mq_sysfs_register(struct request_queue *q); extern void blk_mq_sysfs_unregister(struct request_queue *q); extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
+void blk_mq_cancel_work_sync(struct request_queue *q); void blk_mq_release(struct request_queue *q);
static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q, diff --git a/block/elevator.c b/block/elevator.c index 4ce6b22813a1..27eb70ec277a 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -684,12 +684,18 @@ void elevator_init_mq(struct request_queue *q) if (!e) return;
+ /* + * We are called before adding disk, when there isn't any FS I/O, + * so freezing queue plus canceling dispatch work is enough to + * drain any dispatch activities originated from passthrough + * requests, then no need to quiesce queue which may add long boot + * latency, especially when lots of disks are involved. + */ blk_mq_freeze_queue(q); - blk_mq_quiesce_queue(q); + blk_mq_cancel_work_sync(q);
err = blk_mq_init_sched(q, e);
- blk_mq_unquiesce_queue(q); blk_mq_unfreeze_queue(q);
if (err) {
From: Daniel Borkmann daniel@iogearbox.net
mainline inclusion from mainline-v5.16-rc1 commit 353050be4c19e102178ccc05988101887c25ae53 category: bugfix bugzilla: 185802 https://gitee.com/openeuler/kernel/issues/I4DDEL CVE: CVE-2021-4001
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Commit a23740ec43ba ("bpf: Track contents of read-only maps as scalars") is checking whether maps are read-only both from BPF program side and user space side, and then, given their content is constant, reading out their data via map->ops->map_direct_value_addr() which is then subsequently used as known scalar value for the register, that is, it is marked as __mark_reg_known() with the read value at verification time. Before a23740ec43ba, the register content was marked as an unknown scalar so the verifier could not make any assumptions about the map content.
The current implementation however is prone to a TOCTOU race, meaning, the value read as known scalar for the register is not guaranteed to be exactly the same at a later point when the program is executed, and as such, the prior made assumptions of the verifier with regards to the program will be invalid which can cause issues such as OOB access, etc.
While the BPF_F_RDONLY_PROG map flag is always fixed and required to be specified at map creation time, the map->frozen property is initially set to false for the map given the map value needs to be populated, e.g. for global data sections. Once complete, the loader "freezes" the map from user space such that no subsequent updates/deletes are possible anymore. For the rest of the lifetime of the map, this freeze one-time trigger cannot be undone anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_* cmd calls which would update/delete map entries will be rejected with -EPERM since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also means that pending update/delete map entries must still complete before this guarantee is given. This corner case is not an issue for loaders since they create and prepare such program private map in successive steps.
However, a malicious user is able to trigger this TOCTOU race in two different ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is used to expand the competition interval, so that map_update_elem() can modify the contents of the map after map_freeze() and bpf_prog_load() were executed. This works, because userfaultfd halts the parallel thread which triggered a map_update_elem() at the time where we copy key/value from the user buffer and this already passed the FMODE_CAN_WRITE capability test given at that time the map was not "frozen". Then, the main thread performs the map_freeze() and bpf_prog_load(), and once that had completed successfully, the other thread is woken up to complete the pending map_update_elem() which then changes the map content. For ii) the idea of the batched update is similar, meaning, when there are a large number of updates to be processed, it can increase the competition interval between the two. It is therefore possible in practice to modify the contents of the map after executing map_freeze() and bpf_prog_load().
One way to fix both i) and ii) at the same time is to expand the use of the map's map->writecnt. The latter was introduced in fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19be2e2 ("bpf: Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with the rationale to make a writable mmap()'ing of a map mutually exclusive with read-only freezing. The counter indicates writable mmap() mappings and then prevents/fails the freeze operation. Its semantics can be expanded beyond just mmap() by generally indicating ongoing write phases. This would essentially span any parallel regular and batched flavor of update/delete operation and then also have map_freeze() fail with -EBUSY. For the check_mem_access() in the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all last pending writes have completed via bpf_map_write_active() test. Once the map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0 only then we are really guaranteed to use the map's data as known constants. For map->frozen being set and pending writes in process of still being completed we fall back to marking that register as unknown scalar so we don't end up making assumptions about it. With this, both TOCTOU reproducers from i) and ii) are fixed.
Note that the map->writecnt has been converted into a atomic64 in the fix in order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning the freeze_mutex over entire map update/delete operations in syscall side would not be possible due to then causing everything to be serialized. Similarly, something like synchronize_rcu() after setting map->frozen to wait for update/deletes to complete is not possible either since it would also have to span the user copy which can sleep. On the libbpf side, this won't break d66562fba1ce ("libbpf: Add BPF object skeleton support") as the anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed mmap()-ed memory where for .rodata it's non-writable.
Fixes: a23740ec43ba ("bpf: Track contents of read-only maps as scalars") Reported-by: w1tcher.bupt@gmail.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Acked-by: Andrii Nakryiko andrii@kernel.org Signed-off-by: Alexei Starovoitov ast@kernel.org
conflicts: kernel/bpf/syscall.c
Signed-off-by: He Fengqing hefengqing@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- include/linux/bpf.h | 3 ++- kernel/bpf/syscall.c | 57 +++++++++++++++++++++++++++---------------- kernel/bpf/verifier.c | 17 ++++++++++++- 3 files changed, 54 insertions(+), 23 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 89b07eaf4415..88245386a4b6 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -173,7 +173,7 @@ struct bpf_map { atomic64_t usercnt; struct work_struct work; struct mutex freeze_mutex; - u64 writecnt; /* writable mmap cnt; protected by freeze_mutex */ + atomic64_t writecnt; };
static inline bool map_value_has_spin_lock(const struct bpf_map *map) @@ -1273,6 +1273,7 @@ void bpf_map_charge_move(struct bpf_map_memory *dst, void *bpf_map_area_alloc(u64 size, int numa_node); void *bpf_map_area_mmapable_alloc(u64 size, int numa_node); void bpf_map_area_free(void *base); +bool bpf_map_write_active(const struct bpf_map *map); void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr); int generic_map_lookup_batch(struct bpf_map *map, const union bpf_attr *attr, diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 5b6da64da46d..bb9a9cb1f321 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -127,6 +127,21 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr) return map; }
+static void bpf_map_write_active_inc(struct bpf_map *map) +{ + atomic64_inc(&map->writecnt); +} + +static void bpf_map_write_active_dec(struct bpf_map *map) +{ + atomic64_dec(&map->writecnt); +} + +bool bpf_map_write_active(const struct bpf_map *map) +{ + return atomic64_read(&map->writecnt) != 0; +} + static u32 bpf_map_value_size(struct bpf_map *map) { if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH || @@ -588,11 +603,8 @@ static void bpf_map_mmap_open(struct vm_area_struct *vma) { struct bpf_map *map = vma->vm_file->private_data;
- if (vma->vm_flags & VM_MAYWRITE) { - mutex_lock(&map->freeze_mutex); - map->writecnt++; - mutex_unlock(&map->freeze_mutex); - } + if (vma->vm_flags & VM_MAYWRITE) + bpf_map_write_active_inc(map); }
/* called for all unmapped memory region (including initial) */ @@ -600,11 +612,8 @@ static void bpf_map_mmap_close(struct vm_area_struct *vma) { struct bpf_map *map = vma->vm_file->private_data;
- if (vma->vm_flags & VM_MAYWRITE) { - mutex_lock(&map->freeze_mutex); - map->writecnt--; - mutex_unlock(&map->freeze_mutex); - } + if (vma->vm_flags & VM_MAYWRITE) + bpf_map_write_active_dec(map); }
static const struct vm_operations_struct bpf_map_default_vmops = { @@ -654,7 +663,7 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma) goto out;
if (vma->vm_flags & VM_MAYWRITE) - map->writecnt++; + bpf_map_write_active_inc(map); out: mutex_unlock(&map->freeze_mutex); return err; @@ -1086,6 +1095,7 @@ static int map_update_elem(union bpf_attr *attr) map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); + bpf_map_write_active_inc(map); if (!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { err = -EPERM; goto err_put; @@ -1127,6 +1137,7 @@ static int map_update_elem(union bpf_attr *attr) free_key: kfree(key); err_put: + bpf_map_write_active_dec(map); fdput(f); return err; } @@ -1149,6 +1160,7 @@ static int map_delete_elem(union bpf_attr *attr) map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); + bpf_map_write_active_inc(map); if (!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { err = -EPERM; goto err_put; @@ -1179,6 +1191,7 @@ static int map_delete_elem(union bpf_attr *attr) out: kfree(key); err_put: + bpf_map_write_active_dec(map); fdput(f); return err; } @@ -1483,6 +1496,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr) map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); + bpf_map_write_active_inc(map); if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ) || !(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { err = -EPERM; @@ -1524,6 +1538,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr) free_key: kfree(key); err_put: + bpf_map_write_active_dec(map); fdput(f); return err; } @@ -1550,8 +1565,7 @@ static int map_freeze(const union bpf_attr *attr) }
mutex_lock(&map->freeze_mutex); - - if (map->writecnt) { + if (bpf_map_write_active(map)) { err = -EBUSY; goto err_put; } @@ -3976,6 +3990,9 @@ static int bpf_map_do_batch(const union bpf_attr *attr, union bpf_attr __user *uattr, int cmd) { + bool has_read = cmd == BPF_MAP_LOOKUP_BATCH || + cmd == BPF_MAP_LOOKUP_AND_DELETE_BATCH; + bool has_write = cmd != BPF_MAP_LOOKUP_BATCH; struct bpf_map *map; int err, ufd; struct fd f; @@ -3988,16 +4005,13 @@ static int bpf_map_do_batch(const union bpf_attr *attr, map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); - - if ((cmd == BPF_MAP_LOOKUP_BATCH || - cmd == BPF_MAP_LOOKUP_AND_DELETE_BATCH) && - !(map_get_sys_perms(map, f) & FMODE_CAN_READ)) { + if (has_write) + bpf_map_write_active_inc(map); + if (has_read && !(map_get_sys_perms(map, f) & FMODE_CAN_READ)) { err = -EPERM; goto err_put; } - - if (cmd != BPF_MAP_LOOKUP_BATCH && - !(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { + if (has_write && !(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { err = -EPERM; goto err_put; } @@ -4010,8 +4024,9 @@ static int bpf_map_do_batch(const union bpf_attr *attr, BPF_DO_BATCH(map->ops->map_update_batch); else BPF_DO_BATCH(map->ops->map_delete_batch); - err_put: + if (has_write) + bpf_map_write_active_dec(map); fdput(f); return err; } diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 0c26757ea7fb..4215c2ff6aeb 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3486,7 +3486,22 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, int size)
static bool bpf_map_is_rdonly(const struct bpf_map *map) { - return (map->map_flags & BPF_F_RDONLY_PROG) && map->frozen; + /* A map is considered read-only if the following condition are true: + * + * 1) BPF program side cannot change any of the map content. The + * BPF_F_RDONLY_PROG flag is throughout the lifetime of a map + * and was set at map creation time. + * 2) The map value(s) have been initialized from user space by a + * loader and then "frozen", such that no new map update/delete + * operations from syscall side are possible for the rest of + * the map's lifetime from that point onwards. + * 3) Any parallel/pending map update/delete operations from syscall + * side have been completed. Only after that point, it's safe to + * assume that map value(s) are immutable. + */ + return (map->map_flags & BPF_F_RDONLY_PROG) && + READ_ONCE(map->frozen) && + !bpf_map_write_active(map); }
static int bpf_map_direct_read(struct bpf_map *map, int off, int size, u64 *val)