From: Filipe Manana fdmanana@suse.com
mainline inclusion from mainline-v6.2-rc8 commit 2f1a6be12ab6c8470d5776e68644726c94257c54 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6PQCT CVE: CVE-2023-1611
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The quota assign ioctl can currently run in parallel with a quota disable ioctl call. The assign ioctl uses the quota root, while the disable ioctl frees that root, and therefore we can have a use-after-free triggered in the assign ioctl, leading to a trace like the following when KASAN is enabled:
[672.723][T736] BUG: KASAN: slab-use-after-free in btrfs_search_slot+0x2962/0x2db0 [672.723][T736] Read of size 8 at addr ffff888022ec0208 by task btrfs_search_sl/27736 [672.724][T736] [672.725][T736] CPU: 1 PID: 27736 Comm: btrfs_search_sl Not tainted 6.3.0-rc3 #37 [672.723][T736] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 [672.727][T736] Call Trace: [672.728][T736] <TASK> [672.728][T736] dump_stack_lvl+0xd9/0x150 [672.725][T736] print_report+0xc1/0x5e0 [672.720][T736] ? __virt_addr_valid+0x61/0x2e0 [672.727][T736] ? __phys_addr+0xc9/0x150 [672.725][T736] ? btrfs_search_slot+0x2962/0x2db0 [672.722][T736] kasan_report+0xc0/0xf0 [672.729][T736] ? btrfs_search_slot+0x2962/0x2db0 [672.724][T736] btrfs_search_slot+0x2962/0x2db0 [672.723][T736] ? fs_reclaim_acquire+0xba/0x160 [672.722][T736] ? split_leaf+0x13d0/0x13d0 [672.726][T736] ? rcu_is_watching+0x12/0xb0 [672.723][T736] ? kmem_cache_alloc+0x338/0x3c0 [672.722][T736] update_qgroup_status_item+0xf7/0x320 [672.724][T736] ? add_qgroup_rb+0x3d0/0x3d0 [672.739][T736] ? do_raw_spin_lock+0x12d/0x2b0 [672.730][T736] ? spin_bug+0x1d0/0x1d0 [672.737][T736] btrfs_run_qgroups+0x5de/0x840 [672.730][T736] ? btrfs_qgroup_rescan_worker+0xa70/0xa70 [672.738][T736] ? __del_qgroup_relation+0x4ba/0xe00 [672.738][T736] btrfs_ioctl+0x3d58/0x5d80 [672.735][T736] ? tomoyo_path_number_perm+0x16a/0x550 [672.737][T736] ? tomoyo_execute_permission+0x4a0/0x4a0 [672.731][T736] ? btrfs_ioctl_get_supported_features+0x50/0x50 [672.737][T736] ? __sanitizer_cov_trace_switch+0x54/0x90 [672.734][T736] ? do_vfs_ioctl+0x132/0x1660 [672.730][T736] ? vfs_fileattr_set+0xc40/0xc40 [672.730][T736] ? _raw_spin_unlock_irq+0x2e/0x50 [672.732][T736] ? sigprocmask+0xf2/0x340 [672.737][T736] ? __fget_files+0x26a/0x480 [672.732][T736] ? bpf_lsm_file_ioctl+0x9/0x10 [672.738][T736] ? btrfs_ioctl_get_supported_features+0x50/0x50 [672.736][T736] __x64_sys_ioctl+0x198/0x210 [672.736][T736] do_syscall_64+0x39/0xb0 [672.731][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd [672.739][T736] RIP: 0033:0x4556ad [672.742][T736] </TASK> [672.743][T736] [672.748][T736] Allocated by task 27677: [672.743][T736] kasan_save_stack+0x22/0x40 [672.741][T736] kasan_set_track+0x25/0x30 [672.741][T736] __kasan_kmalloc+0xa4/0xb0 [672.749][T736] btrfs_alloc_root+0x48/0x90 [672.746][T736] btrfs_create_tree+0x146/0xa20 [672.744][T736] btrfs_quota_enable+0x461/0x1d20 [672.743][T736] btrfs_ioctl+0x4a1c/0x5d80 [672.747][T736] __x64_sys_ioctl+0x198/0x210 [672.749][T736] do_syscall_64+0x39/0xb0 [672.744][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd [672.756][T736] [672.757][T736] Freed by task 27677: [672.759][T736] kasan_save_stack+0x22/0x40 [672.759][T736] kasan_set_track+0x25/0x30 [672.756][T736] kasan_save_free_info+0x2e/0x50 [672.751][T736] ____kasan_slab_free+0x162/0x1c0 [672.758][T736] slab_free_freelist_hook+0x89/0x1c0 [672.752][T736] __kmem_cache_free+0xaf/0x2e0 [672.752][T736] btrfs_put_root+0x1ff/0x2b0 [672.759][T736] btrfs_quota_disable+0x80a/0xbc0 [672.752][T736] btrfs_ioctl+0x3e5f/0x5d80 [672.756][T736] __x64_sys_ioctl+0x198/0x210 [672.753][T736] do_syscall_64+0x39/0xb0 [672.765][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd [672.769][T736] [672.768][T736] The buggy address belongs to the object at ffff888022ec0000 [672.768][T736] which belongs to the cache kmalloc-4k of size 4096 [672.769][T736] The buggy address is located 520 bytes inside of [672.769][T736] freed 4096-byte region [ffff888022ec0000, ffff888022ec1000) [672.760][T736] [672.764][T736] The buggy address belongs to the physical page: [672.761][T736] page:ffffea00008bb000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22ec0 [672.766][T736] head:ffffea00008bb000 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [672.779][T736] flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff) [672.770][T736] raw: 00fff00000010200 ffff888012842140 ffffea000054ba00 dead000000000002 [672.770][T736] raw: 0000000000000000 0000000000040004 00000001ffffffff 0000000000000000 [672.771][T736] page dumped because: kasan: bad access detected [672.778][T736] page_owner tracks the page as allocated [672.777][T736] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd2040(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 88 [672.779][T736] get_page_from_freelist+0x119c/0x2d50 [672.779][T736] __alloc_pages+0x1cb/0x4a0 [672.776][T736] alloc_pages+0x1aa/0x270 [672.773][T736] allocate_slab+0x260/0x390 [672.771][T736] ___slab_alloc+0xa9a/0x13e0 [672.778][T736] __slab_alloc.constprop.0+0x56/0xb0 [672.771][T736] __kmem_cache_alloc_node+0x136/0x320 [672.789][T736] __kmalloc+0x4e/0x1a0 [672.783][T736] tomoyo_realpath_from_path+0xc3/0x600 [672.781][T736] tomoyo_path_perm+0x22f/0x420 [672.782][T736] tomoyo_path_unlink+0x92/0xd0 [672.780][T736] security_path_unlink+0xdb/0x150 [672.788][T736] do_unlinkat+0x377/0x680 [672.788][T736] __x64_sys_unlink+0xca/0x110 [672.789][T736] do_syscall_64+0x39/0xb0 [672.783][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd [672.784][T736] page last free stack trace: [672.787][T736] free_pcp_prepare+0x4e5/0x920 [672.787][T736] free_unref_page+0x1d/0x4e0 [672.784][T736] __unfreeze_partials+0x17c/0x1a0 [672.797][T736] qlist_free_all+0x6a/0x180 [672.796][T736] kasan_quarantine_reduce+0x189/0x1d0 [672.797][T736] __kasan_slab_alloc+0x64/0x90 [672.793][T736] kmem_cache_alloc+0x17c/0x3c0 [672.799][T736] getname_flags.part.0+0x50/0x4e0 [672.799][T736] getname_flags+0x9e/0xe0 [672.792][T736] vfs_fstatat+0x77/0xb0 [672.791][T736] __do_sys_newlstat+0x84/0x100 [672.798][T736] do_syscall_64+0x39/0xb0 [672.796][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd [672.790][T736] [672.791][T736] Memory state around the buggy address: [672.799][T736] ffff888022ec0100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [672.805][T736] ffff888022ec0180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [672.802][T736] >ffff888022ec0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [672.809][T736] ^ [672.809][T736] ffff888022ec0280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [672.809][T736] ffff888022ec0300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fix this by having the qgroup assign ioctl take the qgroup ioctl mutex before calling btrfs_run_qgroups(), which is what all qgroup ioctls should call.
Reported-by: butt3rflyh4ck butterflyhuangxx@gmail.com Link: https://lore.kernel.org/linux-btrfs/CAFcO6XN3VD8ogmHwqRk4kbiwtpUSNySu2VAxN8w... CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo wqu@suse.com Signed-off-by: Filipe Manana fdmanana@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com
Conflicts: fs/btrfs/qgroup.c
Signed-off-by: Long Li leo.lilong@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/btrfs/ioctl.c | 2 ++ fs/btrfs/qgroup.c | 11 ++++++++++- 2 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index a5ae02bf3652..00424d3f3464 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -5240,7 +5240,9 @@ static long btrfs_ioctl_qgroup_assign(struct file *file, void __user *arg) }
/* update qgroup status and info */ + mutex_lock(&fs_info->qgroup_ioctl_lock); err = btrfs_run_qgroups(trans); + mutex_unlock(&fs_info->qgroup_ioctl_lock); if (err < 0) btrfs_handle_fs_error(fs_info, err, "failed to update qgroup status and info"); diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 7916f711daf5..8e58c58f73a3 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2196,7 +2196,8 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans) }
/* - * called from commit_transaction. Writes all changed qgroups to disk. + * Writes all changed qgroups to disk. + * Called by the transaction commit path and the qgroup assign ioctl. */ int btrfs_run_qgroups(struct btrfs_trans_handle *trans) { @@ -2204,6 +2205,14 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans) struct btrfs_root *quota_root = fs_info->quota_root; int ret = 0;
+ /* + * In case we are called from the qgroup assign ioctl, assert that we + * are holding the qgroup_ioctl_lock, otherwise we can race with a quota + * disable operation (ioctl) and access a freed quota root. + */ + if (trans->transaction->state != TRANS_STATE_COMMIT_DOING) + lockdep_assert_held(&fs_info->qgroup_ioctl_lock); + if (!quota_root) return ret;
From: Songping Yu yusongping@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6TE76 CVE: NA
--------------------------------
This reverts commit 3a44e838d0e4ae53caeb40b17abcabf5b7709f84.
Signed-off-by: Songping Yu yusongping@huawei.com Reviewed-by: Zhang Qiao zhangqiao22@huawei.com Reviewed-by: chenhui judy.chenhui@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- kernel/sched/fair.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ad6a7923c9ed..0dba06ce0677 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3949,14 +3949,10 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se) #endif }
-/* The threshold when the wakee's vruntime should be set cfs_rq->min_vruntime, default: 200 days */ -#define WAKEUP_REINIT_THRESHOLD_NS (200 * 24 * 3600 * NSEC_PER_SEC) - static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) { u64 vruntime = cfs_rq->min_vruntime; - struct rq *rq = rq_of(cfs_rq);
/* * The 'current' period is already promised to the current tasks, @@ -3981,11 +3977,8 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) vruntime -= thresh; }
- if (unlikely(!initial && (s64)(rq_clock_task(rq) - se->exec_start) > WAKEUP_REINIT_THRESHOLD_NS)) - se->vruntime = vruntime; - else /* ensure we never gain time by being placed backwards. */ - se->vruntime = max_vruntime(se->vruntime, vruntime); + se->vruntime = max_vruntime(se->vruntime, vruntime); }
static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
From: Zhang Qiao zhangqiao22@huawei.com
mainline inclusion from mainline-v6.3-rc4 commit 829c1651e9c4a6f78398d3e67651cef9bb6b42cc category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6TE76 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=829c1...
--------------------------------
When a scheduling entity is placed onto cfs_rq, its vruntime is pulled to the base level (around cfs_rq->min_vruntime), so that the entity doesn't gain extra boost when placed backwards.
However, if the entity being placed wasn't executed for a long time, its vruntime may get too far behind (e.g. while cfs_rq was executing a low-weight hog), which can inverse the vruntime comparison due to s64 overflow. This results in the entity being placed with its original vruntime way forwards, so that it will effectively never get to the cpu.
To prevent that, ignore the vruntime of the entity being placed if it didn't execute for much longer than the characteristic sheduler time scale.
[rkagan: formatted, adjusted commit log, comments, cutoff value] Signed-off-by: Zhang Qiao zhangqiao22@huawei.com Co-developed-by: Roman Kagan rkagan@amazon.de Signed-off-by: Roman Kagan rkagan@amazon.de Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20230130122216.3555094-1-rkagan@amazon.de Signed-off-by: Songping Yu yusongping@huawei.com Reviewed-by: Zhang Qiao zhangqiao22@huawei.com Reviewed-by: chenhui judy.chenhui@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- kernel/sched/fair.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0dba06ce0677..0c3a9f4e971b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3953,6 +3953,7 @@ static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) { u64 vruntime = cfs_rq->min_vruntime; + u64 sleep_time;
/* * The 'current' period is already promised to the current tasks, @@ -3977,8 +3978,18 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) vruntime -= thresh; }
- /* ensure we never gain time by being placed backwards. */ - se->vruntime = max_vruntime(se->vruntime, vruntime); + /* + * Pull vruntime of the entity being placed to the base level of + * cfs_rq, to prevent boosting it if placed backwards. If the entity + * slept for a long time, don't even try to compare its vruntime with + * the base as it may be too far off and the comparison may get + * inversed due to s64 overflow. + */ + sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start; + if ((s64)sleep_time > 60LL * NSEC_PER_SEC) + se->vruntime = vruntime; + else + se->vruntime = max_vruntime(se->vruntime, vruntime); }
static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
From: Vincent Guittot vincent.guittot@linaro.org
mainline inclusion from mainline-v6.3-rc4 commit a53ce18cacb477dd0513c607f187d16f0fa96f71 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6TE76 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=a53ce...
--------------------------------
Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed") fixes an overflowing bug, but ignore a case that se->exec_start is reset after a migration.
For fixing this case, we delay the reset of se->exec_start after placing the entity which se->exec_start to detect long sleeping task.
In order to take into account a possible divergence between the clock_task of 2 rqs, we increase the threshold to around 104 days.
Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed") Originally-by: Zhang Qiao zhangqiao22@huawei.com Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Tested-by: Zhang Qiao zhangqiao22@huawei.com Link: https://lore.kernel.org/r/20230317160810.107988-1-vincent.guittot@linaro.org Signed-off-by: Songping Yu yusongping@huawei.com Reviewed-by: Zhang Qiao zhangqiao22@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- kernel/sched/core.c | 3 +++ kernel/sched/fair.c | 55 ++++++++++++++++++++++++++++++++++++--------- 2 files changed, 47 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 970616070da8..5d2a0a591e18 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -767,6 +767,9 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
void activate_task(struct rq *rq, struct task_struct *p, int flags) { + if (task_on_rq_migrating(p)) + flags |= ENQUEUE_MIGRATED; + if (task_contributes_to_load(p)) rq->nr_uninterruptible--;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0c3a9f4e971b..77a49eae2cdd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3949,11 +3949,33 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se) #endif }
+static inline bool entity_is_long_sleeper(struct sched_entity *se) +{ + struct cfs_rq *cfs_rq; + u64 sleep_time; + + if (se->exec_start == 0) + return false; + + cfs_rq = cfs_rq_of(se); + + sleep_time = rq_clock_task(rq_of(cfs_rq)); + + /* Happen while migrating because of clock task divergence */ + if (sleep_time <= se->exec_start) + return false; + + sleep_time -= se->exec_start; + if (sleep_time > ((1ULL << 63) / scale_load_down(NICE_0_LOAD))) + return true; + + return false; +} + static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) { u64 vruntime = cfs_rq->min_vruntime; - u64 sleep_time;
/* * The 'current' period is already promised to the current tasks, @@ -3980,13 +4002,24 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
/* * Pull vruntime of the entity being placed to the base level of - * cfs_rq, to prevent boosting it if placed backwards. If the entity - * slept for a long time, don't even try to compare its vruntime with - * the base as it may be too far off and the comparison may get - * inversed due to s64 overflow. - */ - sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start; - if ((s64)sleep_time > 60LL * NSEC_PER_SEC) + * cfs_rq, to prevent boosting it if placed backwards. + * However, min_vruntime can advance much faster than real time, with + * the extreme being when an entity with the minimal weight always runs + * on the cfs_rq. If the waking entity slept for a long time, its + * vruntime difference from min_vruntime may overflow s64 and their + * comparison may get inversed, so ignore the entity's original + * vruntime in that case. + * The maximal vruntime speedup is given by the ratio of normal to + * minimal weight: scale_load_down(NICE_0_LOAD) / MIN_SHARES. + * When placing a migrated waking entity, its exec_start has been set + * from a different rq. In order to take into account a possible + * divergence between new and prev rq's clocks task because of irq and + * stolen time, we take an additional margin. + * So, cutting off on the sleep time of + * 2^63 / scale_load_down(NICE_0_LOAD) ~ 104 days + * should be safe. + */ + if (entity_is_long_sleeper(se)) se->vruntime = vruntime; else se->vruntime = max_vruntime(se->vruntime, vruntime); @@ -4085,6 +4118,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (flags & ENQUEUE_WAKEUP) place_entity(cfs_rq, se, 0); + /* Entity has migrated, no longer consider this task hot */ + if (flags & ENQUEUE_MIGRATED) + se->exec_start = 0;
check_schedstat_required(); update_stats_enqueue(cfs_rq, se, flags); @@ -6912,9 +6948,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) /* Tell new CPU we are migrated */ p->se.avg.last_update_time = 0;
- /* We have migrated, no longer consider this task hot */ - p->se.exec_start = 0; - update_scan_period(p, new_cpu); }