From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v5.18-rc1 commit 56a8c8eb1eaf21261be8cdc4e3715239ac087342 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6113U CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
Mikulas asked in "Do we still need commit a0ee5ec520ed ('tmpfs: allocate on read when stacked')?" in [1]
Lukas noticed this unusual behavior of loop device backed by tmpfs in [2].
Normally, shmem_file_read_iter() copies the ZERO_PAGE when reading holes; but if it looks like it might be a read for "a stacking filesystem", it allocates actual pages to the page cache, and even marks them as dirty. And reads from the loop device do satisfy the test that is used.
This oddity was added for an old version of unionfs, to help to limit its usage to the limited size of the tmpfs mount involved; but about the same time as the tmpfs mod went in (2.6.25), unionfs was reworked to proceed differently; and the mod kept just in case others needed it.
Do we still need it? I cannot answer with more certainty than "Probably not". It's nasty enough that we really should try to delete it; but if a regression is reported somewhere, then we might have to revert later.
It's not quite as simple as just removing the test (as Mikulas did): xfstests generic/013 hung because splice from tmpfs failed on page not up-to-date and page mapping unset. That can be fixed just by marking the ZERO_PAGE as Uptodate, which of course it is: do so in pagecache_init() - it might be useful to others than tmpfs.
My intention, though, was to stop using the ZERO_PAGE here altogether: surely iov_iter_zero() is better for this case? Sadly not: it relies on clear_user(), and the x86 clear_user() is slower than its copy_user() [3].
But while we are still using the ZERO_PAGE, let's stop dirtying its struct page cacheline with unnecessary get_page() and put_page().
Link: https://lore.kernel.org/linux-mm/alpine.LRH.2.02.2007210510230.6959@file01.i... [1] Link: https://lore.kernel.org/linux-mm/20211126075100.gd64odg2bcptiqeb@work/ [2] Link: https://lore.kernel.org/lkml/2f5ca5e4-e250-a41c-11fb-a7f4ebc7e1c9@google.com... [3] Link: https://lkml.kernel.org/r/90bc5e69-9984-b5fa-a685-be55f2b64b@google.com Signed-off-by: Hugh Dickins hughd@google.com Reported-by: Mikulas Patocka mpatocka@redhat.com Reported-by: Lukas Czerner lczerner@redhat.com Acked-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Cc: Zdenek Kabelac zkabelac@redhat.com Cc: "Darrick J. Wong" djwong@kernel.org Cc: Miklos Szeredi miklos@szeredi.hu Cc: Borislav Petkov bp@suse.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: tong tiangen tongtiangen@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- mm/filemap.c | 6 ++++++ mm/shmem.c | 20 ++++++-------------- 2 files changed, 12 insertions(+), 14 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c index 98b448d9873f..a44baa79c1e7 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -996,6 +996,12 @@ void __init pagecache_init(void) init_waitqueue_head(&page_wait_table[i]);
page_writeback_init(); + + /* + * tmpfs uses the ZERO_PAGE for reading holes: it is up-to-date, + * and splice's page_cache_pipe_buf_confirm() needs to see that. + */ + SetPageUptodate(ZERO_PAGE(0)); }
/* diff --git a/mm/shmem.c b/mm/shmem.c index ad2d68150ed2..c5028634afee 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2573,19 +2573,10 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) struct address_space *mapping = inode->i_mapping; pgoff_t index; unsigned long offset; - enum sgp_type sgp = SGP_READ; int error = 0; ssize_t retval = 0; loff_t *ppos = &iocb->ki_pos;
- /* - * Might this read be for a stacking filesystem? Then when reading - * holes of a sparse file, we actually need to allocate those pages, - * and even mark them dirty, so it cannot exceed the max_blocks limit. - */ - if (!iter_is_iovec(to)) - sgp = SGP_CACHE; - index = *ppos >> PAGE_SHIFT; offset = *ppos & ~PAGE_MASK;
@@ -2594,6 +2585,7 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) pgoff_t end_index; unsigned long nr, ret; loff_t i_size = i_size_read(inode); + bool got_page;
end_index = i_size >> PAGE_SHIFT; if (index > end_index) @@ -2604,15 +2596,13 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) break; }
- error = shmem_getpage(inode, index, &page, sgp); + error = shmem_getpage(inode, index, &page, SGP_READ); if (error) { if (error == -EINVAL) error = 0; break; } if (page) { - if (sgp == SGP_CACHE) - set_page_dirty(page); unlock_page(page); }
@@ -2646,9 +2636,10 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) */ if (!offset) mark_page_accessed(page); + got_page = true; } else { page = ZERO_PAGE(0); - get_page(page); + got_page = false; }
/* @@ -2661,7 +2652,8 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) index += offset >> PAGE_SHIFT; offset &= ~PAGE_MASK;
- put_page(page); + if (got_page) + put_page(page); if (!iov_iter_count(to)) break; if (ret < nr) {
From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v5.18-rc3 commit 1bdec44b1eee32e311b44b5b06144bb7d9b33938 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6113U CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
Chuck Lever reported fsx-based xfstests generic 075 091 112 127 failing when 5.18-rc1 NFS server exports tmpfs: bisected to recent tmpfs change.
Whilst nfsd_splice_action() does contain some questionable handling of repeated pages, and Chuck was able to work around there, history from Mark Hemment makes clear that there might be similar dangers elsewhere: it was not a good idea for me to pass ZERO_PAGE down to unknown actors.
Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when iter_is_iovec(); in other cases, use the more natural iov_iter_zero() instead of copy_page_to_iter().
We would use iov_iter_zero() throughout, but the x86 clear_user() is not nearly so well optimized as copy to user (dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds).
And now pagecache_init() does not need to SetPageUptodate(ZERO_PAGE(0)): which had caused boot failure on arm noMMU STM32F7 and STM32H7 boards
Link: https://lkml.kernel.org/r/9a978571-8648-e830-5735-1f4748ce2e30@google.com Fixes: 56a8c8eb1eaf ("tmpfs: do not allocate pages on read") Signed-off-by: Hugh Dickins hughd@google.com Reported-by: Patrice CHOTARD patrice.chotard@foss.st.com Reported-by: Chuck Lever III chuck.lever@oracle.com Tested-by: Chuck Lever III chuck.lever@oracle.com Cc: Mark Hemment markhemm@googlemail.com Cc: Patrice CHOTARD patrice.chotard@foss.st.com Cc: Mikulas Patocka mpatocka@redhat.com Cc: Lukas Czerner lczerner@redhat.com Cc: Christoph Hellwig hch@lst.de Cc: "Darrick J. Wong" djwong@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Nanyong Sun sunnanyong@huawei.com Reviewed-by: tong tiangen tongtiangen@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- mm/filemap.c | 6 ------ mm/shmem.c | 31 ++++++++++++++++++++----------- 2 files changed, 20 insertions(+), 17 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c index a44baa79c1e7..98b448d9873f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -996,12 +996,6 @@ void __init pagecache_init(void) init_waitqueue_head(&page_wait_table[i]);
page_writeback_init(); - - /* - * tmpfs uses the ZERO_PAGE for reading holes: it is up-to-date, - * and splice's page_cache_pipe_buf_confirm() needs to see that. - */ - SetPageUptodate(ZERO_PAGE(0)); }
/* diff --git a/mm/shmem.c b/mm/shmem.c index c5028634afee..34b36bfbaf7e 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2585,7 +2585,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) pgoff_t end_index; unsigned long nr, ret; loff_t i_size = i_size_read(inode); - bool got_page;
end_index = i_size >> PAGE_SHIFT; if (index > end_index) @@ -2636,24 +2635,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) */ if (!offset) mark_page_accessed(page); - got_page = true; + /* + * Ok, we have the page, and it's up-to-date, so + * now we can copy it to user space... + */ + ret = copy_page_to_iter(page, offset, nr, to); + put_page(page); + + } else if (iter_is_iovec(to)) { + /* + * Copy to user tends to be so well optimized, but + * clear_user() not so much, that it is noticeably + * faster to copy the zero page instead of clearing. + */ + ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to); } else { - page = ZERO_PAGE(0); - got_page = false; + /* + * But submitting the same page twice in a row to + * splice() - or others? - can result in confusion: + * so don't attempt that optimization on pipes etc. + */ + ret = iov_iter_zero(nr, to); }
- /* - * Ok, we have the page, and it's up-to-date, so - * now we can copy it to user space... - */ - ret = copy_page_to_iter(page, offset, nr, to); retval += ret; offset += ret; index += offset >> PAGE_SHIFT; offset &= ~PAGE_MASK;
- if (got_page) - put_page(page); if (!iov_iter_count(to)) break; if (ret < nr) {
From: Zheng Zucheng zhengzucheng@huawei.com
hulk inclusion category: feature bugzilla: 187196, https://gitee.com/openeuler/kernel/issues/I612CS CVE: NA
-------------------------------
Allocate a new task_struct_resvd object for the recently cloned task
Signed-off-by: Zheng Zucheng zhengzucheng@huawei.com Reviewed-by: Zhang Qiao zhangqiao22@huawei.com Reviewed-by: Nanyong Sun sunnanyong@huawei.com Reviewed-by: chenhui judy.chenhui@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- include/linux/sched.h | 2 ++ init/init_task.c | 5 +++++ kernel/fork.c | 21 ++++++++++++++++++++- 3 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 6631387012e7..cd68fc0de8ee 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -673,6 +673,8 @@ struct wake_q_node { * struct task_struct_resvd - KABI extension struct */ struct task_struct_resvd { + /* pointer back to the main task_struct */ + struct task_struct *task; };
struct task_struct { diff --git a/init/init_task.c b/init/init_task.c index 5fa18ed59d33..891007de2eef 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -57,6 +57,10 @@ unsigned long init_shadow_call_stack[SCS_SIZE / sizeof(long)] }; #endif
+static struct task_struct_resvd init_task_struct_resvd = { + .task = &init_task, +}; + /* * Set up the first task table, touch at your own risk!. Base=0, * limit=0x1fffff (=2MB) @@ -213,6 +217,7 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP_FILTER .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif + ._resvd = &init_task_struct_resvd, }; EXPORT_SYMBOL(init_task);
diff --git a/kernel/fork.c b/kernel/fork.c index 0fb86b65ae60..8ceaece248fa 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -174,6 +174,7 @@ static inline struct task_struct *alloc_task_struct_node(int node)
static inline void free_task_struct(struct task_struct *tsk) { + kfree(tsk->_resvd); kmem_cache_free(task_struct_cachep, tsk); } #endif @@ -851,6 +852,18 @@ void set_task_stack_end_magic(struct task_struct *tsk) *stackend = STACK_END_MAGIC; /* for overflow detection */ }
+static bool dup_resvd_task_struct(struct task_struct *dst, + struct task_struct *orig, int node) +{ + dst->_resvd = kmalloc_node(sizeof(struct task_struct_resvd), + GFP_KERNEL, node); + if (!dst->_resvd) + return false; + + dst->_resvd->task = dst; + return true; +} + static struct task_struct *dup_task_struct(struct task_struct *orig, int node) { struct task_struct *tsk; @@ -863,6 +876,12 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) tsk = alloc_task_struct_node(node); if (!tsk) return NULL; + /* + * before proceeding, we need to make tsk->_resvd = NULL, + * otherwise the error paths below, if taken, might end up causing + * a double-free for task_struct_resvd extension object. + */ + WRITE_ONCE(tsk->_resvd, NULL);
stack = alloc_thread_stack_node(tsk, node); if (!stack) @@ -888,7 +907,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) refcount_set(&tsk->stack_refcount, 1); #endif
- if (err) + if (err || !dup_resvd_task_struct(tsk, orig, node)) goto free_stack;
err = scs_prepare(tsk, node);
From: Nico Pache npache@redhat.com
mainline inclusion from mainline-v5.18-rc4 commit e4a38402c36e42df28eb1a5394be87e6571fb48a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61FDP CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can be targeted by the oom reaper. This mapping is used to store the futex robust list head; the kernel does not keep a copy of the robust list and instead references a userspace address to maintain the robustness during a process death.
A race can occur between exit_mm and the oom reaper that allows the oom reaper to free the memory of the futex robust list before the exit path has handled the futex death:
CPU1 CPU2 -------------------------------------------------------------------- page_fault do_exit "signal" wake_oom_reaper oom_reaper oom_reap_task_mm (invalidates mm) exit_mm exit_mm_release futex_exit_release futex_cleanup exit_robust_list get_user (EFAULT- can't access memory)
If the get_user EFAULT's, the kernel will be unable to recover the waiters on the robust_list, leaving userspace mutexes hung indefinitely.
Delay the OOM reaper, allowing more time for the exit path to perform the futex cleanup.
Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
Based on a patch by Michal Hocko.
Link: https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370 [1] Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Signed-off-by: Joel Savitz jsavitz@redhat.com Signed-off-by: Nico Pache npache@redhat.com Co-developed-by: Joel Savitz jsavitz@redhat.com Suggested-by: Thomas Gleixner tglx@linutronix.de Acked-by: Thomas Gleixner tglx@linutronix.de Acked-by: Michal Hocko mhocko@suse.com Cc: Rafael Aquini aquini@redhat.com Cc: Waiman Long longman@redhat.com Cc: Herton R. Krzesinski herton@redhat.com Cc: Juri Lelli juri.lelli@redhat.com Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Dietmar Eggemann dietmar.eggemann@arm.com Cc: Steven Rostedt rostedt@goodmis.org Cc: Ben Segall bsegall@google.com Cc: Mel Gorman mgorman@suse.de Cc: Daniel Bristot de Oliveira bristot@redhat.com Cc: David Rientjes rientjes@google.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Davidlohr Bueso dave@stgolabs.net Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@redhat.com Cc: Joel Savitz jsavitz@redhat.com Cc: Darren Hart dvhart@infradead.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Nanyong Sun sunnanyong@huawei.com Reviewed-by: chenhui judy.chenhui@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- include/linux/sched.h | 1 + mm/oom_kill.c | 54 ++++++++++++++++++++++++++++++++----------- 2 files changed, 41 insertions(+), 14 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index cd68fc0de8ee..afc7651a2659 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1353,6 +1353,7 @@ struct task_struct { int pagefault_disabled; #ifdef CONFIG_MMU struct task_struct *oom_reaper_list; + struct timer_list oom_reaper_timer; #endif #ifdef CONFIG_VMAP_STACK struct vm_struct *stack_vm_area; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index ffbe8fe2bbf6..2933e2beba6f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -680,7 +680,7 @@ static void oom_reap_task(struct task_struct *tsk) */ set_bit(MMF_OOM_SKIP, &mm->flags);
- /* Drop a reference taken by wake_oom_reaper */ + /* Drop a reference taken by queue_oom_reaper */ put_task_struct(tsk); }
@@ -690,12 +690,12 @@ static int oom_reaper(void *unused) struct task_struct *tsk = NULL;
wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL); - spin_lock(&oom_reaper_lock); + spin_lock_irq(&oom_reaper_lock); if (oom_reaper_list != NULL) { tsk = oom_reaper_list; oom_reaper_list = tsk->oom_reaper_list; } - spin_unlock(&oom_reaper_lock); + spin_unlock_irq(&oom_reaper_lock);
if (tsk) oom_reap_task(tsk); @@ -704,22 +704,48 @@ static int oom_reaper(void *unused) return 0; }
-static void wake_oom_reaper(struct task_struct *tsk) +static void wake_oom_reaper(struct timer_list *timer) { - /* mm is already queued? */ - if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) - return; + struct task_struct *tsk = container_of(timer, struct task_struct, + oom_reaper_timer); + struct mm_struct *mm = tsk->signal->oom_mm; + unsigned long flags;
- get_task_struct(tsk); + /* The victim managed to terminate on its own - see exit_mmap */ + if (test_bit(MMF_OOM_SKIP, &mm->flags)) { + put_task_struct(tsk); + return; + }
- spin_lock(&oom_reaper_lock); + spin_lock_irqsave(&oom_reaper_lock, flags); tsk->oom_reaper_list = oom_reaper_list; oom_reaper_list = tsk; - spin_unlock(&oom_reaper_lock); + spin_unlock_irqrestore(&oom_reaper_lock, flags); trace_wake_reaper(tsk->pid); wake_up(&oom_reaper_wait); }
+/* + * Give the OOM victim time to exit naturally before invoking the oom_reaping. + * The timers timeout is arbitrary... the longer it is, the longer the worst + * case scenario for the OOM can take. If it is too small, the oom_reaper can + * get in the way and release resources needed by the process exit path. + * e.g. The futex robust list can sit in Anon|Private memory that gets reaped + * before the exit path is able to wake the futex waiters. + */ +#define OOM_REAPER_DELAY (2*HZ) +static void queue_oom_reaper(struct task_struct *tsk) +{ + /* mm is already queued? */ + if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) + return; + + get_task_struct(tsk); + timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0); + tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY; + add_timer(&tsk->oom_reaper_timer); +} + static int __init oom_init(void) { oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); @@ -727,7 +753,7 @@ static int __init oom_init(void) } subsys_initcall(oom_init) #else -static inline void wake_oom_reaper(struct task_struct *tsk) +static inline void queue_oom_reaper(struct task_struct *tsk) { } #endif /* CONFIG_MMU */ @@ -978,7 +1004,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message) rcu_read_unlock();
if (can_oom_reap) - wake_oom_reaper(victim); + queue_oom_reaper(victim);
mmdrop(mm); put_task_struct(victim); @@ -1014,7 +1040,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) task_lock(victim); if (task_will_free_mem(victim)) { mark_oom_victim(victim); - wake_oom_reaper(victim); + queue_oom_reaper(victim); task_unlock(victim); put_task_struct(victim); return; @@ -1156,7 +1182,7 @@ bool out_of_memory(struct oom_control *oc) */ if (task_will_free_mem(current)) { mark_oom_victim(current); - wake_oom_reaper(current); + queue_oom_reaper(current); return true; }
From: Ma Wupeng mawupeng1@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I61FDP CVE: NA
-------------------------------
Move oom_reaper_timer from task_struct to task_struct_resvd to fix KABI broken.
Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Nanyong Sun sunnanyong@huawei.com Reviewed-by: chenhui judy.chenhui@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- include/linux/sched.h | 5 ++++- mm/oom_kill.c | 11 ++++++----- 2 files changed, 10 insertions(+), 6 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index afc7651a2659..d748c6f16174 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -675,6 +675,10 @@ struct wake_q_node { struct task_struct_resvd { /* pointer back to the main task_struct */ struct task_struct *task; + +#ifdef CONFIG_MMU + struct timer_list oom_reaper_timer; +#endif };
struct task_struct { @@ -1353,7 +1357,6 @@ struct task_struct { int pagefault_disabled; #ifdef CONFIG_MMU struct task_struct *oom_reaper_list; - struct timer_list oom_reaper_timer; #endif #ifdef CONFIG_VMAP_STACK struct vm_struct *stack_vm_area; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 2933e2beba6f..7eb4fda1ce87 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -706,8 +706,9 @@ static int oom_reaper(void *unused)
static void wake_oom_reaper(struct timer_list *timer) { - struct task_struct *tsk = container_of(timer, struct task_struct, - oom_reaper_timer); + struct task_struct_resvd *tsk_resvd = container_of(timer, + struct task_struct_resvd, oom_reaper_timer); + struct task_struct *tsk = tsk_resvd->task; struct mm_struct *mm = tsk->signal->oom_mm; unsigned long flags;
@@ -741,9 +742,9 @@ static void queue_oom_reaper(struct task_struct *tsk) return;
get_task_struct(tsk); - timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0); - tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY; - add_timer(&tsk->oom_reaper_timer); + timer_setup(&tsk->_resvd->oom_reaper_timer, wake_oom_reaper, 0); + tsk->_resvd->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY; + add_timer(&tsk->_resvd->oom_reaper_timer); }
static int __init oom_init(void)
From: "Matthew Wilcox (Oracle)" willy@infradead.org
mainline inclusion from mainline-v5.16-rc1 commit d417b49fff3e2f21043c834841e8623a6098741d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6110W CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
It is not safe to check page->index without holding the page lock. It can be changed if the page is moved between the swap cache and the page cache for a shmem file, for example. There is a VM_BUG_ON below which checks page->index is correct after taking the page lock.
Link: https://lkml.kernel.org/r/20210818144932.940640-1-willy@infradead.org Fixes: 5c211ba29deb ("mm: add and use find_lock_entries") Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Reported-by: syzbot+c87be4f669d920c76330@syzkaller.appspotmail.com Cc: Hugh Dickins hughd@google.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: tong tiangen tongtiangen@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- mm/filemap.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/mm/filemap.c b/mm/filemap.c index 98b448d9873f..bf92156150ed 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1956,7 +1956,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start, next_idx = page->index + thp_nr_pages(page); if (page->index < start) goto put; - VM_BUG_ON_PAGE(page->index != xas.xa_index, page); if (page->index + thp_nr_pages(page) - 1 > end) goto put; if (!trylock_page(page))
From: Nikita Yushchenko nikita.yushchenko@virtuozzo.com
mainline inclusion from mainline-v5.17-rc1 commit 0878355b51f5f26632e652c848a8e174bb02d22d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I699A9 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
If start_per_cpu_kthreads() called from osnoise_workload_start() returns error, event hooks are left in broken state: unhook_irq_events() called but unhook_thread_events() and unhook_softirq_events() not called, and trace_osnoise_callback_enabled flag not cleared.
On the next tracer enable, hooks get not installed due to trace_osnoise_callback_enabled flag.
And on the further tracer disable an attempt to remove non-installed hooks happened, hitting a WARN_ON_ONCE() in tracepoint_remove_func().
Fix the error path by adding the missing part of cleanup. While at this, introduce osnoise_unhook_events() to avoid code duplication between this error path and normal tracer disable.
Link: https://lkml.kernel.org/r/20220109153459.3701773-1-nikita.yushchenko@virtuoz...
Cc: stable@vger.kernel.org Fixes: bce29ac9ce0b ("trace: Add osnoise tracer") Acked-by: Daniel Bristot de Oliveira bristot@kernel.org Signed-off-by: Nikita Yushchenko nikita.yushchenko@virtuozzo.com Signed-off-by: Steven Rostedt rostedt@goodmis.org Signed-off-by: Zheng Yejian zhengyejian1@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- kernel/trace/trace_osnoise.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c index cfb80feb291e..3f806a3be8b2 100644 --- a/kernel/trace/trace_osnoise.c +++ b/kernel/trace/trace_osnoise.c @@ -2103,6 +2103,13 @@ static int osnoise_hook_events(void) return -EINVAL; }
+static void osnoise_unhook_events(void) +{ + unhook_thread_events(); + unhook_softirq_events(); + unhook_irq_events(); +} + /* * osnoise_workload_start - start the workload and hook to events */ @@ -2135,7 +2142,14 @@ static int osnoise_workload_start(void)
retval = start_per_cpu_kthreads(); if (retval) { - unhook_irq_events(); + trace_osnoise_callback_enabled = false; + /* + * Make sure that ftrace_nmi_enter/exit() see + * trace_osnoise_callback_enabled as false before continuing. + */ + barrier(); + + osnoise_unhook_events(); return retval; }
@@ -2166,9 +2180,7 @@ static void osnoise_workload_stop(void)
stop_per_cpu_kthreads();
- unhook_irq_events(); - unhook_softirq_events(); - unhook_thread_events(); + osnoise_unhook_events(); }
static void osnoise_tracer_start(struct trace_array *tr)
From: Daniel Bristot de Oliveira bristot@kernel.org
mainline inclusion from mainline-v5.17-rc8 commit f0cfe17bcc1dd2f0872966b554a148e888833ee9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I699A9 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Nicolas reported that using:
# trace-cmd record -e all -M 10 -p osnoise --poll
Resulted in the following kernel warning:
------------[ cut here ]------------ WARNING: CPU: 0 PID: 1217 at kernel/tracepoint.c:404 tracepoint_probe_unregister+0x280/0x370 [...] CPU: 0 PID: 1217 Comm: trace-cmd Not tainted 5.17.0-rc6-next-20220307-nico+ #19 RIP: 0010:tracepoint_probe_unregister+0x280/0x370 [...] CR2: 00007ff919b29497 CR3: 0000000109da4005 CR4: 0000000000170ef0 Call Trace: <TASK> osnoise_workload_stop+0x36/0x90 tracing_set_tracer+0x108/0x260 tracing_set_trace_write+0x94/0xd0 ? __check_object_size.part.0+0x10a/0x150 ? selinux_file_permission+0x104/0x150 vfs_write+0xb5/0x290 ksys_write+0x5f/0xe0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7ff919a18127 [...] ---[ end trace 0000000000000000 ]---
The warning complains about an attempt to unregister an unregistered tracepoint.
This happens on trace-cmd because it first stops tracing, and then switches the tracer to nop. Which is equivalent to:
# cd /sys/kernel/tracing/ # echo osnoise > current_tracer # echo 0 > tracing_on # echo nop > current_tracer
The osnoise tracer stops the workload when no trace instance is actually collecting data. This can be caused both by disabling tracing or disabling the tracer itself.
To avoid unregistering events twice, use the existing trace_osnoise_callback_enabled variable to check if the events (and the workload) are actually active before trying to deactivate them.
Link: https://lore.kernel.org/all/c898d1911f7f9303b7e14726e7cc9678fbfb4a0e.camel@r... Link: https://lkml.kernel.org/r/938765e17d5a781c2df429a98f0b2e7cc317b022.164682391...
Cc: stable@vger.kernel.org Cc: Marcelo Tosatti mtosatti@redhat.com Fixes: 2fac8d6486d5 ("tracing/osnoise: Allow multiple instances of the same tracer") Reported-by: Nicolas Saenz Julienne nsaenzju@redhat.com Signed-off-by: Daniel Bristot de Oliveira bristot@kernel.org Signed-off-by: Steven Rostedt (Google) rostedt@goodmis.org Signed-off-by: Zheng Yejian zhengyejian1@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- kernel/trace/trace_osnoise.c | 11 +++++++++++ 1 file changed, 11 insertions(+)
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c index 3f806a3be8b2..d23304b7f12c 100644 --- a/kernel/trace/trace_osnoise.c +++ b/kernel/trace/trace_osnoise.c @@ -2171,6 +2171,17 @@ static void osnoise_workload_stop(void) if (osnoise_has_registered_instances()) return;
+ /* + * If callbacks were already disabled in a previous stop + * call, there is no need to disable then again. + * + * For instance, this happens when tracing is stopped via: + * echo 0 > tracing_on + * echo nop > current_tracer. + */ + if (!trace_osnoise_callback_enabled) + return; + trace_osnoise_callback_enabled = false; /* * Make sure that ftrace_nmi_enter/exit() see
From: Alon Zahavi zahavi.alon@gmail.com
mainline inclusion from mainline-v6.2-rc1 commit 6d5c9e79b726cc473d40e9cb60976dbe8e669624 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I683ER CVE: CVE-2022-4842
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The bug occours due to a misuse of `attr` variable instead of `attr_b`. `attr` is being initialized as NULL, then being derenfernced as `attr->res.data_size`.
This bug causes a crash of the ntfs3 driver itself, If compiled directly to the kernel, it crashes the whole system.
Signed-off-by: Alon Zahavi zahavi.alon@gmail.com Co-developed-by: Tal Lossos tallossos@gmail.com Signed-off-by: Tal Lossos tallossos@gmail.com Signed-off-by: Konstantin Komarov almaz.alexandrovich@paragon-software.com Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/ntfs3/attrib.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ntfs3/attrib.c b/fs/ntfs3/attrib.c index e8c00dda42ad..4e74bc8f01ed 100644 --- a/fs/ntfs3/attrib.c +++ b/fs/ntfs3/attrib.c @@ -1949,7 +1949,7 @@ int attr_punch_hole(struct ntfs_inode *ni, u64 vbo, u64 bytes, u32 *frame_size) return -ENOENT;
if (!attr_b->non_res) { - u32 data_size = le32_to_cpu(attr->res.data_size); + u32 data_size = le32_to_cpu(attr_b->res.data_size); u32 from, to;
if (vbo > data_size)
From: William Liu will@willsroot.io
mainline inclusion from mainline-v6.2-rc4 commit 797805d81baa814f76cf7bdab35f86408a79d707 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6A0GH CVE: CVE-2023-0210
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
"nt_len - CIFS_ENCPWD_SIZE" is passed directly from ksmbd_decode_ntlmssp_auth_blob to ksmbd_auth_ntlmv2. Malicious requests can set nt_len to less than CIFS_ENCPWD_SIZE, which results in a negative number (or large unsigned value) used for a subsequent memcpy in ksmbd_auth_ntlvm2 and can cause a panic.
Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Signed-off-by: William Liu will@willsroot.io Signed-off-by: Hrvoje Mišetić misetichrvoje@gmail.com Acked-by: Namjae Jeon linkinjeon@kernel.org Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/ksmbd/auth.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ksmbd/auth.c b/fs/ksmbd/auth.c index fc87c9913c8d..0ac85a1a63c0 100644 --- a/fs/ksmbd/auth.c +++ b/fs/ksmbd/auth.c @@ -321,7 +321,8 @@ int ksmbd_decode_ntlmssp_auth_blob(struct authenticate_message *authblob, dn_off = le32_to_cpu(authblob->DomainName.BufferOffset); dn_len = le16_to_cpu(authblob->DomainName.Length);
- if (blob_len < (u64)dn_off + dn_len || blob_len < (u64)nt_off + nt_len) + if (blob_len < (u64)dn_off + dn_len || blob_len < (u64)nt_off + nt_len || + nt_len < CIFS_ENCPWD_SIZE) return -EINVAL;
/* TODO : use domain name that imported from configuration file */
From: Guo Mengqi guomengqi3@huawei.com
hulk inclusion category: other bugzilla: https://gitee.com/openeuler/kernel/issues/I69JDF CVE: NA
-------------------------------
Delete svm driver, as it is specially designed for Hisilicon platform.
Signed-off-by: Guo Mengqi guomengqi3@huawei.com Reviewed-by: Weilong Chen chenweilong@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- arch/arm64/configs/openeuler_defconfig | 1 - drivers/char/Kconfig | 10 - drivers/char/Makefile | 1 - drivers/char/svm.c | 1772 ------------------------ mm/mmap.c | 2 + 5 files changed, 2 insertions(+), 1784 deletions(-) delete mode 100644 drivers/char/svm.c
diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 34a7b8d500a1..21fa51ad2d69 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -3395,7 +3395,6 @@ CONFIG_TCG_TIS_ST33ZP24_I2C=m CONFIG_TCG_TIS_ST33ZP24_SPI=m # CONFIG_XILLYBUS is not set CONFIG_PIN_MEMORY_DEV=m -CONFIG_HISI_SVM=m # CONFIG_RANDOM_TRUST_CPU is not set # CONFIG_RANDOM_TRUST_BOOTLOADER is not set # end of Character devices diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 701811fcc0fd..6fa56a473995 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -478,16 +478,6 @@ config PIN_MEMORY_DEV help pin memory driver
-config HISI_SVM - tristate "Hisilicon svm driver" - depends on ARM64 && ARM_SMMU_V3 && MMU_NOTIFIER - default m - help - This driver provides character-level access to Hisilicon - SVM chipset. Typically, you can bind a task to the - svm and share the virtual memory with hisilicon svm device. - When in doubt, say "N". - config RANDOM_TRUST_CPU bool "Initialize RNG using CPU RNG instructions" default y diff --git a/drivers/char/Makefile b/drivers/char/Makefile index 362d4a9cd4cf..71d76fd62692 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -48,4 +48,3 @@ obj-$(CONFIG_XILLYBUS) += xillybus/ obj-$(CONFIG_POWERNV_OP_PANEL) += powernv-op-panel.o obj-$(CONFIG_ADI) += adi.o obj-$(CONFIG_PIN_MEMORY_DEV) += pin_memory.o -obj-$(CONFIG_HISI_SVM) += svm.o diff --git a/drivers/char/svm.c b/drivers/char/svm.c deleted file mode 100644 index 6945e93354b4..000000000000 --- a/drivers/char/svm.c +++ /dev/null @@ -1,1772 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* - * Copyright (c) 2017-2018 Hisilicon Limited. - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ - -#include <asm/esr.h> -#include <linux/mmu_context.h> - -#include <linux/delay.h> -#include <linux/err.h> -#include <linux/interrupt.h> -#include <linux/io.h> -#include <linux/iommu.h> -#include <linux/miscdevice.h> -#include <linux/mman.h> -#include <linux/mmu_notifier.h> -#include <linux/module.h> -#include <linux/of.h> -#include <linux/of_address.h> -#include <linux/of_device.h> -#include <linux/platform_device.h> -#include <linux/ptrace.h> -#include <linux/security.h> -#include <linux/slab.h> -#include <linux/uaccess.h> -#include <linux/sched.h> -#include <linux/hugetlb.h> -#include <linux/sched/mm.h> -#include <linux/msi.h> -#include <linux/acpi.h> - -#define SVM_DEVICE_NAME "svm" -#define ASID_SHIFT 48 - -#define SVM_IOCTL_REMAP_PROC 0xfff4 -#define SVM_IOCTL_UNPIN_MEMORY 0xfff5 -#define SVM_IOCTL_PIN_MEMORY 0xfff7 -#define SVM_IOCTL_GET_PHYS 0xfff9 -#define SVM_IOCTL_LOAD_FLAG 0xfffa -#define SVM_IOCTL_SET_RC 0xfffc -#define SVM_IOCTL_PROCESS_BIND 0xffff - -#define CORE_SID 0 - -#define SVM_IOCTL_RELEASE_PHYS32 0xfff3 -#define SVM_REMAP_MEM_LEN_MAX (16 * 1024 * 1024) -#define MMAP_PHY32_MAX (16 * 1024 * 1024) - -static int probe_index; -static LIST_HEAD(child_list); -static DECLARE_RWSEM(svm_sem); -static struct rb_root svm_process_root = RB_ROOT; -static struct mutex svm_process_mutex; - -struct core_device { - struct device dev; - struct iommu_group *group; - struct iommu_domain *domain; - u8 smmu_bypass; - struct list_head entry; -}; - -struct svm_device { - unsigned long long id; - struct miscdevice miscdev; - struct device *dev; - phys_addr_t l2buff; - unsigned long l2size; -}; - -struct svm_bind_process { - pid_t vpid; - u64 ttbr; - u64 tcr; - int pasid; - u32 flags; -#define SVM_BIND_PID (1 << 0) -}; - -/* - *svm_process is released in svm_notifier_release() when mm refcnt - *goes down zero. We should access svm_process only in the context - *where mm_struct is valid, which means we should always get mm - *refcnt first. - */ -struct svm_process { - struct pid *pid; - struct mm_struct *mm; - unsigned long asid; - struct rb_node rb_node; - struct mmu_notifier notifier; - /* For postponed release */ - struct rcu_head rcu; - int pasid; - struct mutex mutex; - struct rb_root sdma_list; - struct svm_device *sdev; - struct iommu_sva *sva; -}; - -struct svm_sdma { - struct rb_node node; - unsigned long addr; - int nr_pages; - struct page **pages; - atomic64_t ref; -}; - -struct svm_proc_mem { - u32 dev_id; - u32 len; - u64 pid; - u64 vaddr; - u64 buf; -}; - -static char *svm_cmd_to_string(unsigned int cmd) -{ - switch (cmd) { - case SVM_IOCTL_PROCESS_BIND: - return "bind"; - case SVM_IOCTL_GET_PHYS: - return "get phys"; - case SVM_IOCTL_SET_RC: - return "set rc"; - case SVM_IOCTL_PIN_MEMORY: - return "pin memory"; - case SVM_IOCTL_UNPIN_MEMORY: - return "unpin memory"; - case SVM_IOCTL_REMAP_PROC: - return "remap proc"; - case SVM_IOCTL_LOAD_FLAG: - return "load flag"; - case SVM_IOCTL_RELEASE_PHYS32: - return "release phys"; - default: - return "unsupported"; - } - - return NULL; -} - -/* - * image word of slot - * SVM_IMAGE_WORD_INIT: initial value, indicating that the slot is not used. - * SVM_IMAGE_WORD_VALID: valid data is filled in the slot - * SVM_IMAGE_WORD_DONE: the DMA operation is complete when the TS uses this address, - * so, this slot can be freed. - */ -#define SVM_IMAGE_WORD_INIT 0x0 -#define SVM_IMAGE_WORD_VALID 0xaa55aa55 -#define SVM_IMAGE_WORD_DONE 0x55ff55ff - -/* - * The length of this structure must be 64 bytes, which is the agreement with the TS. - * And the data type and sequence cannot be changed, because the TS core reads data - * based on the data type and sequence. - * image_word: slot status. For details, see SVM_IMAGE_WORD_xxx - * pid: pid of process which ioctl svm device to get physical addr, it is used for - * verification by TS. - * data_type: used to determine the data type by TS. Currently, data type must be - * SVM_VA2PA_TYPE_DMA. - * char data[48]: for the data type SVM_VA2PA_TYPE_DMA, the DMA address is stored. - */ -struct svm_va2pa_slot { - int image_word; - int resv; - int pid; - int data_type; - union { - char user_defined_data[48]; - struct { - unsigned long phys; - unsigned long len; - char reserved[32]; - }; - }; -}; - -struct svm_va2pa_trunk { - struct svm_va2pa_slot *slots; - int slot_total; - int slot_used; - unsigned long *bitmap; - struct mutex mutex; -}; - -struct svm_va2pa_trunk va2pa_trunk; - -#define SVM_VA2PA_TRUNK_SIZE_MAX 0x3200000 -#define SVM_VA2PA_MEMORY_ALIGN 64 -#define SVM_VA2PA_SLOT_SIZE sizeof(struct svm_va2pa_slot) -#define SVM_VA2PA_TYPE_DMA 0x1 -#define SVM_MEM_REG "va2pa trunk" -#define SVM_VA2PA_CLEAN_BATCH_NUM 0x80 - -struct device_node *svm_find_mem_reg_node(struct device *dev, const char *compat) -{ - int index = 0; - struct device_node *tmp = NULL; - struct device_node *np = dev->of_node; - - for (; ; index++) { - tmp = of_parse_phandle(np, "memory-region", index); - if (!tmp) - break; - - if (of_device_is_compatible(tmp, compat)) - return tmp; - - of_node_put(tmp); - } - - return NULL; -} - -static int svm_parse_trunk_memory(struct device *dev, phys_addr_t *base, unsigned long *size) -{ - int err; - struct resource r; - struct device_node *trunk = NULL; - - trunk = svm_find_mem_reg_node(dev, SVM_MEM_REG); - if (!trunk) { - dev_err(dev, "Didn't find reserved memory\n"); - return -EINVAL; - } - - err = of_address_to_resource(trunk, 0, &r); - of_node_put(trunk); - if (err) { - dev_err(dev, "Couldn't address to resource for reserved memory\n"); - return -ENOMEM; - } - - *base = r.start; - *size = resource_size(&r); - - return 0; -} - -static int svm_setup_trunk(struct device *dev, phys_addr_t base, unsigned long size) -{ - int slot_total; - unsigned long *bitmap = NULL; - struct svm_va2pa_slot *slot = NULL; - - if (!IS_ALIGNED(base, SVM_VA2PA_MEMORY_ALIGN)) { - dev_err(dev, "Didn't aligned to %u\n", SVM_VA2PA_MEMORY_ALIGN); - return -EINVAL; - } - - if ((size == 0) || (size > SVM_VA2PA_TRUNK_SIZE_MAX)) { - dev_err(dev, "Size of reserved memory is not right\n"); - return -EINVAL; - } - - slot_total = size / SVM_VA2PA_SLOT_SIZE; - if (slot_total < BITS_PER_LONG) - return -EINVAL; - - bitmap = kvcalloc(slot_total / BITS_PER_LONG, sizeof(unsigned long), GFP_KERNEL); - if (!bitmap) { - dev_err(dev, "alloc memory failed\n"); - return -ENOMEM; - } - - slot = ioremap(base, size); - if (!slot) { - kvfree(bitmap); - dev_err(dev, "Ioremap trunk failed\n"); - return -ENXIO; - } - - va2pa_trunk.slots = slot; - va2pa_trunk.slot_used = 0; - va2pa_trunk.slot_total = slot_total; - va2pa_trunk.bitmap = bitmap; - mutex_init(&va2pa_trunk.mutex); - - return 0; -} - -static void svm_remove_trunk(struct device *dev) -{ - iounmap(va2pa_trunk.slots); - kvfree(va2pa_trunk.bitmap); - - va2pa_trunk.slots = NULL; - va2pa_trunk.bitmap = NULL; -} - -static void svm_set_slot_valid(unsigned long index, unsigned long phys, unsigned long len) -{ - struct svm_va2pa_slot *slot = &va2pa_trunk.slots[index]; - - slot->phys = phys; - slot->len = len; - slot->image_word = SVM_IMAGE_WORD_VALID; - slot->pid = current->tgid; - slot->data_type = SVM_VA2PA_TYPE_DMA; - __bitmap_set(va2pa_trunk.bitmap, index, 1); - va2pa_trunk.slot_used++; -} - -static void svm_set_slot_init(unsigned long index) -{ - struct svm_va2pa_slot *slot = &va2pa_trunk.slots[index]; - - slot->image_word = SVM_IMAGE_WORD_INIT; - __bitmap_clear(va2pa_trunk.bitmap, index, 1); - va2pa_trunk.slot_used--; -} - -static void svm_clean_done_slots(void) -{ - int used = va2pa_trunk.slot_used; - int count = 0; - long temp = -1; - phys_addr_t addr; - unsigned long *bitmap = va2pa_trunk.bitmap; - - for (; count < used && count < SVM_VA2PA_CLEAN_BATCH_NUM;) { - temp = find_next_bit(bitmap, va2pa_trunk.slot_total, temp + 1); - if (temp == va2pa_trunk.slot_total) - break; - - count++; - if (va2pa_trunk.slots[temp].image_word != SVM_IMAGE_WORD_DONE) - continue; - - addr = (phys_addr_t)va2pa_trunk.slots[temp].phys; - put_page(pfn_to_page(PHYS_PFN(addr))); - svm_set_slot_init(temp); - } -} - -static int svm_find_slot_init(unsigned long *index) -{ - int temp; - unsigned long *bitmap = va2pa_trunk.bitmap; - - temp = find_first_zero_bit(bitmap, va2pa_trunk.slot_total); - if (temp == va2pa_trunk.slot_total) - return -ENOSPC; - - *index = temp; - return 0; -} - -static int svm_va2pa_trunk_init(struct device *dev) -{ - int err; - phys_addr_t base; - unsigned long size; - - err = svm_parse_trunk_memory(dev, &base, &size); - if (err) - return err; - - err = svm_setup_trunk(dev, base, size); - if (err) - return err; - - return 0; -} - -static struct svm_process *find_svm_process(unsigned long asid) -{ - struct rb_node *node = svm_process_root.rb_node; - - while (node) { - struct svm_process *process = NULL; - - process = rb_entry(node, struct svm_process, rb_node); - if (asid < process->asid) - node = node->rb_left; - else if (asid > process->asid) - node = node->rb_right; - else - return process; - } - - return NULL; -} - -static void insert_svm_process(struct svm_process *process) -{ - struct rb_node **p = &svm_process_root.rb_node; - struct rb_node *parent = NULL; - - while (*p) { - struct svm_process *tmp_process = NULL; - - parent = *p; - tmp_process = rb_entry(parent, struct svm_process, rb_node); - if (process->asid < tmp_process->asid) - p = &(*p)->rb_left; - else if (process->asid > tmp_process->asid) - p = &(*p)->rb_right; - else { - WARN_ON_ONCE("asid already in the tree"); - return; - } - } - - rb_link_node(&process->rb_node, parent, p); - rb_insert_color(&process->rb_node, &svm_process_root); -} - -static void delete_svm_process(struct svm_process *process) -{ - rb_erase(&process->rb_node, &svm_process_root); - RB_CLEAR_NODE(&process->rb_node); -} - -static struct svm_device *file_to_sdev(struct file *file) -{ - return container_of(file->private_data, - struct svm_device, miscdev); -} - -static inline struct core_device *to_core_device(struct device *d) -{ - return container_of(d, struct core_device, dev); -} - -static struct svm_sdma *svm_find_sdma(struct svm_process *process, - unsigned long addr, int nr_pages) -{ - struct rb_node *node = process->sdma_list.rb_node; - - while (node) { - struct svm_sdma *sdma = NULL; - - sdma = rb_entry(node, struct svm_sdma, node); - if (addr < sdma->addr) - node = node->rb_left; - else if (addr > sdma->addr) - node = node->rb_right; - else if (nr_pages < sdma->nr_pages) - node = node->rb_left; - else if (nr_pages > sdma->nr_pages) - node = node->rb_right; - else - return sdma; - } - - return NULL; -} - -static int svm_insert_sdma(struct svm_process *process, struct svm_sdma *sdma) -{ - struct rb_node **p = &process->sdma_list.rb_node; - struct rb_node *parent = NULL; - - while (*p) { - struct svm_sdma *tmp_sdma = NULL; - - parent = *p; - tmp_sdma = rb_entry(parent, struct svm_sdma, node); - if (sdma->addr < tmp_sdma->addr) - p = &(*p)->rb_left; - else if (sdma->addr > tmp_sdma->addr) - p = &(*p)->rb_right; - else if (sdma->nr_pages < tmp_sdma->nr_pages) - p = &(*p)->rb_left; - else if (sdma->nr_pages > tmp_sdma->nr_pages) - p = &(*p)->rb_right; - else { - /* - * add reference count and return -EBUSY - * to free former alloced one. - */ - atomic64_inc(&tmp_sdma->ref); - return -EBUSY; - } - } - - rb_link_node(&sdma->node, parent, p); - rb_insert_color(&sdma->node, &process->sdma_list); - - return 0; -} - -static void svm_remove_sdma(struct svm_process *process, - struct svm_sdma *sdma, bool try_rm) -{ - int null_count = 0; - - if (try_rm && (!atomic64_dec_and_test(&sdma->ref))) - return; - - rb_erase(&sdma->node, &process->sdma_list); - RB_CLEAR_NODE(&sdma->node); - - while (sdma->nr_pages--) { - if (sdma->pages[sdma->nr_pages] == NULL) { - pr_err("null pointer, nr_pages:%d.\n", sdma->nr_pages); - null_count++; - continue; - } - - put_page(sdma->pages[sdma->nr_pages]); - } - - if (null_count) - dump_stack(); - - kvfree(sdma->pages); - kfree(sdma); -} - -static int svm_pin_pages(unsigned long addr, int nr_pages, - struct page **pages) -{ - int err; - - err = get_user_pages_fast(addr, nr_pages, 1, pages); - if (err > 0 && err < nr_pages) { - while (err--) - put_page(pages[err]); - err = -EFAULT; - } else if (err == 0) { - err = -EFAULT; - } - - return err; -} - -static int svm_add_sdma(struct svm_process *process, - unsigned long addr, unsigned long size) -{ - int err; - struct svm_sdma *sdma = NULL; - - sdma = kzalloc(sizeof(struct svm_sdma), GFP_KERNEL); - if (sdma == NULL) - return -ENOMEM; - - atomic64_set(&sdma->ref, 1); - sdma->addr = addr & PAGE_MASK; - sdma->nr_pages = (PAGE_ALIGN(size + addr) >> PAGE_SHIFT) - - (sdma->addr >> PAGE_SHIFT); - sdma->pages = kvcalloc(sdma->nr_pages, sizeof(char *), GFP_KERNEL); - if (sdma->pages == NULL) { - err = -ENOMEM; - goto err_free_sdma; - } - - /* - * If always pin the same addr with the same nr_pages, pin pages - * maybe should move after insert sdma with mutex lock. - */ - err = svm_pin_pages(sdma->addr, sdma->nr_pages, sdma->pages); - if (err < 0) { - pr_err("%s: failed to pin pages addr 0x%pK, size 0x%lx\n", - __func__, (void *)addr, size); - goto err_free_pages; - } - - err = svm_insert_sdma(process, sdma); - if (err < 0) { - err = 0; - pr_debug("%s: sdma already exist!\n", __func__); - goto err_unpin_pages; - } - - return err; - -err_unpin_pages: - while (sdma->nr_pages--) - put_page(sdma->pages[sdma->nr_pages]); -err_free_pages: - kvfree(sdma->pages); -err_free_sdma: - kfree(sdma); - - return err; -} - -static int svm_pin_memory(unsigned long __user *arg) -{ - int err; - struct svm_process *process = NULL; - unsigned long addr, size, asid; - - if (!acpi_disabled) - return -EPERM; - - if (arg == NULL) - return -EINVAL; - - if (get_user(addr, arg)) - return -EFAULT; - - if (get_user(size, arg + 1)) - return -EFAULT; - - if ((addr + size <= addr) || (size >= (u64)UINT_MAX) || (addr == 0)) - return -EINVAL; - - asid = arm64_mm_context_get(current->mm); - if (!asid) - return -ENOSPC; - - mutex_lock(&svm_process_mutex); - process = find_svm_process(asid); - if (process == NULL) { - mutex_unlock(&svm_process_mutex); - err = -ESRCH; - goto out; - } - mutex_unlock(&svm_process_mutex); - - mutex_lock(&process->mutex); - err = svm_add_sdma(process, addr, size); - mutex_unlock(&process->mutex); - -out: - arm64_mm_context_put(current->mm); - - return err; -} - -static int svm_unpin_memory(unsigned long __user *arg) -{ - int err = 0, nr_pages; - struct svm_sdma *sdma = NULL; - unsigned long addr, size, asid; - struct svm_process *process = NULL; - - if (!acpi_disabled) - return -EPERM; - - if (arg == NULL) - return -EINVAL; - - if (get_user(addr, arg)) - return -EFAULT; - - if (get_user(size, arg + 1)) - return -EFAULT; - - if (ULONG_MAX - addr < size) - return -EINVAL; - - asid = arm64_mm_context_get(current->mm); - if (!asid) - return -ENOSPC; - - nr_pages = (PAGE_ALIGN(size + addr) >> PAGE_SHIFT) - - ((addr & PAGE_MASK) >> PAGE_SHIFT); - addr &= PAGE_MASK; - - mutex_lock(&svm_process_mutex); - process = find_svm_process(asid); - if (process == NULL) { - mutex_unlock(&svm_process_mutex); - err = -ESRCH; - goto out; - } - mutex_unlock(&svm_process_mutex); - - mutex_lock(&process->mutex); - sdma = svm_find_sdma(process, addr, nr_pages); - if (sdma == NULL) { - mutex_unlock(&process->mutex); - err = -ESRCH; - goto out; - } - - svm_remove_sdma(process, sdma, true); - mutex_unlock(&process->mutex); - -out: - arm64_mm_context_put(current->mm); - - return err; -} - -static void svm_unpin_all(struct svm_process *process) -{ - struct rb_node *node = NULL; - - while ((node = rb_first(&process->sdma_list))) - svm_remove_sdma(process, - rb_entry(node, struct svm_sdma, node), - false); -} - -static int svm_acpi_bind_core(struct core_device *cdev, void *data) -{ - struct task_struct *task = NULL; - struct svm_process *process = data; - - if (cdev->smmu_bypass) - return 0; - - task = get_pid_task(process->pid, PIDTYPE_PID); - if (!task) { - pr_err("failed to get task_struct\n"); - return -ESRCH; - } - - process->sva = iommu_sva_bind_device(&cdev->dev, task->mm, NULL); - if (!process->sva) { - pr_err("failed to bind device\n"); - return PTR_ERR(process->sva); - } - - process->pasid = task->mm->pasid; - put_task_struct(task); - - return 0; -} - -static int svm_dt_bind_core(struct device *dev, void *data) -{ - struct task_struct *task = NULL; - struct svm_process *process = data; - struct core_device *cdev = to_core_device(dev); - - if (cdev->smmu_bypass) - return 0; - - task = get_pid_task(process->pid, PIDTYPE_PID); - if (!task) { - pr_err("failed to get task_struct\n"); - return -ESRCH; - } - - process->sva = iommu_sva_bind_device(dev, task->mm, NULL); - if (!process->sva) { - pr_err("failed to bind device\n"); - return PTR_ERR(process->sva); - } - - process->pasid = task->mm->pasid; - put_task_struct(task); - - return 0; -} - -static void svm_dt_bind_cores(struct svm_process *process) -{ - device_for_each_child(process->sdev->dev, process, svm_dt_bind_core); -} - -static void svm_acpi_bind_cores(struct svm_process *process) -{ - struct core_device *pos = NULL; - - list_for_each_entry(pos, &child_list, entry) { - svm_acpi_bind_core(pos, process); - } -} - -static void svm_process_free(struct mmu_notifier *mn) -{ - struct svm_process *process = NULL; - - process = container_of(mn, struct svm_process, notifier); - svm_unpin_all(process); - arm64_mm_context_put(process->mm); - kfree(process); -} - -static void svm_process_release(struct svm_process *process) -{ - delete_svm_process(process); - put_pid(process->pid); - - mmu_notifier_put(&process->notifier); -} - -static void svm_notifier_release(struct mmu_notifier *mn, - struct mm_struct *mm) -{ - struct svm_process *process = NULL; - - process = container_of(mn, struct svm_process, notifier); - - /* - * No need to call svm_unbind_cores(), as iommu-sva will do the - * unbind in its mm_notifier callback. - */ - - mutex_lock(&svm_process_mutex); - svm_process_release(process); - mutex_unlock(&svm_process_mutex); -} - -static struct mmu_notifier_ops svm_process_mmu_notifier = { - .release = svm_notifier_release, - .free_notifier = svm_process_free, -}; - -static struct svm_process * -svm_process_alloc(struct svm_device *sdev, struct pid *pid, - struct mm_struct *mm, unsigned long asid) -{ - struct svm_process *process = kzalloc(sizeof(*process), GFP_ATOMIC); - - if (!process) - return ERR_PTR(-ENOMEM); - - process->sdev = sdev; - process->pid = pid; - process->mm = mm; - process->asid = asid; - process->sdma_list = RB_ROOT; //lint !e64 - mutex_init(&process->mutex); - process->notifier.ops = &svm_process_mmu_notifier; - - return process; -} - -static struct task_struct *svm_get_task(struct svm_bind_process params) -{ - struct task_struct *task = NULL; - - if (params.flags & ~SVM_BIND_PID) - return ERR_PTR(-EINVAL); - - if (params.flags & SVM_BIND_PID) { - struct mm_struct *mm = NULL; - - task = find_get_task_by_vpid(params.vpid); - if (task == NULL) - return ERR_PTR(-ESRCH); - - /* check the permission */ - mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS); - if (IS_ERR_OR_NULL(mm)) { - pr_err("cannot access mm\n"); - put_task_struct(task); - return ERR_PTR(-ESRCH); - } - - mmput(mm); - } else { - get_task_struct(current); - task = current; - } - - return task; -} - -static int svm_process_bind(struct task_struct *task, - struct svm_device *sdev, u64 *ttbr, u64 *tcr, int *pasid) -{ - int err; - unsigned long asid; - struct pid *pid = NULL; - struct svm_process *process = NULL; - struct mm_struct *mm = NULL; - - if ((ttbr == NULL) || (tcr == NULL) || (pasid == NULL)) - return -EINVAL; - - pid = get_task_pid(task, PIDTYPE_PID); - if (pid == NULL) - return -EINVAL; - - mm = get_task_mm(task); - if (!mm) { - err = -EINVAL; - goto err_put_pid; - } - - asid = arm64_mm_context_get(mm); - if (!asid) { - err = -ENOSPC; - goto err_put_mm; - } - - /* If a svm_process already exists, use it */ - mutex_lock(&svm_process_mutex); - process = find_svm_process(asid); - if (process == NULL) { - process = svm_process_alloc(sdev, pid, mm, asid); - if (IS_ERR(process)) { - err = PTR_ERR(process); - mutex_unlock(&svm_process_mutex); - goto err_put_mm_context; - } - err = mmu_notifier_register(&process->notifier, mm); - if (err) { - mutex_unlock(&svm_process_mutex); - goto err_free_svm_process; - } - - insert_svm_process(process); - - if (acpi_disabled) - svm_dt_bind_cores(process); - else - svm_acpi_bind_cores(process); - - mutex_unlock(&svm_process_mutex); - } else { - mutex_unlock(&svm_process_mutex); - arm64_mm_context_put(mm); - put_pid(pid); - } - - - *ttbr = virt_to_phys(mm->pgd) | asid << ASID_SHIFT; - *tcr = read_sysreg(tcr_el1); - *pasid = process->pasid; - - mmput(mm); - return 0; - -err_free_svm_process: - kfree(process); -err_put_mm_context: - arm64_mm_context_put(mm); -err_put_mm: - mmput(mm); -err_put_pid: - put_pid(pid); - - return err; -} - -static pte_t *svm_get_pte(struct vm_area_struct *vma, - pud_t *pud, - unsigned long addr, - unsigned long *page_size, - unsigned long *offset) -{ - pte_t *pte = NULL; - unsigned long size = 0; - - if (is_vm_hugetlb_page(vma)) { - if (pud_present(*pud)) { - if (pud_val(*pud) && !(pud_val(*pud) & PUD_TABLE_BIT)) { - pte = (pte_t *)pud; - *offset = addr & (PUD_SIZE - 1); - size = PUD_SIZE; - } else { - pte = (pte_t *)pmd_offset(pud, addr); - *offset = addr & (PMD_SIZE - 1); - size = PMD_SIZE; - } - } else { - pr_err("%s:hugetlb but pud not present\n", __func__); - } - } else { - pmd_t *pmd = pmd_offset(pud, addr); - - if (pmd_none(*pmd)) - return NULL; - - if (pmd_trans_huge(*pmd)) { - pte = (pte_t *)pmd; - *offset = addr & (PMD_SIZE - 1); - size = PMD_SIZE; - } else { - pte = pte_offset_map(pmd, addr); - *offset = addr & (PAGE_SIZE - 1); - size = PAGE_SIZE; - } - } - - if (page_size) - *page_size = size; - - return pte; -} - -/* Must be called with mmap_lock held */ -static pte_t *svm_walk_pt(unsigned long addr, unsigned long *page_size, - unsigned long *offset) -{ - pgd_t *pgd = NULL; - p4d_t *p4d = NULL; - pud_t *pud = NULL; - struct mm_struct *mm = current->mm; - struct vm_area_struct *vma = NULL; - - vma = find_vma(mm, addr); - if (!vma) - return NULL; - - pgd = pgd_offset(mm, addr); - if (pgd_none(*pgd)) - return NULL; - - p4d = p4d_offset(pgd, addr); - if (p4d_none(*p4d)) - return NULL; - - pud = pud_offset(p4d, addr); - if (pud_none(*pud)) - return NULL; - - return svm_get_pte(vma, pud, addr, page_size, offset); -} - -static int svm_get_phys(unsigned long __user *arg) -{ - int err; - pte_t *ptep = NULL; - pte_t pte; - unsigned long index = 0; - struct page *page; - unsigned long addr, phys, offset; - struct mm_struct *mm = current->mm; - struct vm_area_struct *vma = NULL; - unsigned long len; - - if (!acpi_disabled) - return -EPERM; - - if (get_user(addr, arg)) - return -EFAULT; - - down_read(&mm->mmap_lock); - ptep = svm_walk_pt(addr, NULL, &offset); - if (!ptep) { - up_read(&mm->mmap_lock); - return -EINVAL; - } - - pte = READ_ONCE(*ptep); - if (!pte_present(pte) || !(pfn_in_present_section(pte_pfn(pte)))) { - up_read(&mm->mmap_lock); - return -EINVAL; - } - - page = pte_page(pte); - get_page(page); - - phys = PFN_PHYS(pte_pfn(pte)) + offset; - - /* fix ts problem, which need the len to check out memory */ - len = 0; - vma = find_vma(mm, addr); - if (vma) - len = vma->vm_end - addr; - - up_read(&mm->mmap_lock); - - mutex_lock(&va2pa_trunk.mutex); - svm_clean_done_slots(); - if (va2pa_trunk.slot_used == va2pa_trunk.slot_total) { - err = -ENOSPC; - goto err_mutex_unlock; - } - - err = svm_find_slot_init(&index); - if (err) - goto err_mutex_unlock; - - svm_set_slot_valid(index, phys, len); - - err = put_user(index * SVM_VA2PA_SLOT_SIZE, (unsigned long __user *)arg); - if (err) - goto err_slot_init; - - mutex_unlock(&va2pa_trunk.mutex); - return 0; - -err_slot_init: - svm_set_slot_init(index); -err_mutex_unlock: - mutex_unlock(&va2pa_trunk.mutex); - put_page(page); - return err; -} - -static struct bus_type svm_bus_type = { - .name = "svm_bus", -}; - -static int svm_open(struct inode *inode, struct file *file) -{ - return 0; -} - -static int svm_proc_load_flag(int __user *arg) -{ - static atomic_t l2buf_load_flag = ATOMIC_INIT(0); - int flag; - - if (!acpi_disabled) - return -EPERM; - - if (arg == NULL) - return -EINVAL; - - if (0 == (atomic_cmpxchg(&l2buf_load_flag, 0, 1))) - flag = 0; - else - flag = 1; - - return put_user(flag, arg); -} - -static int svm_mmap(struct file *file, struct vm_area_struct *vma) -{ - int err; - struct svm_device *sdev = file_to_sdev(file); - - if (!acpi_disabled) - return -EPERM; - - if (vma->vm_flags & VM_PA32BIT) { - unsigned long vm_size = vma->vm_end - vma->vm_start; - struct page *page = NULL; - - if ((vma->vm_end < vma->vm_start) || (vm_size > MMAP_PHY32_MAX)) - return -EINVAL; - - /* vma->vm_pgoff transfer the nid */ - if (vma->vm_pgoff == 0) - page = alloc_pages(GFP_KERNEL | GFP_DMA32, - get_order(vm_size)); - else - page = alloc_pages_node((int)vma->vm_pgoff, - GFP_KERNEL | __GFP_THISNODE, - get_order(vm_size)); - if (!page) { - dev_err(sdev->dev, "fail to alloc page on node 0x%lx\n", - vma->vm_pgoff); - return -ENOMEM; - } - - err = remap_pfn_range(vma, - vma->vm_start, - page_to_pfn(page), - vm_size, vma->vm_page_prot); - if (err) - dev_err(sdev->dev, - "fail to remap 0x%pK err=%d\n", - (void *)vma->vm_start, err); - } else { - if ((vma->vm_end < vma->vm_start) || - ((vma->vm_end - vma->vm_start) > sdev->l2size)) - return -EINVAL; - - vma->vm_page_prot = __pgprot((~PTE_SHARED) & - vma->vm_page_prot.pgprot); - - err = remap_pfn_range(vma, - vma->vm_start, - sdev->l2buff >> PAGE_SHIFT, - vma->vm_end - vma->vm_start, - __pgprot(vma->vm_page_prot.pgprot | PTE_DIRTY)); - if (err) - dev_err(sdev->dev, - "fail to remap 0x%pK err=%d\n", - (void *)vma->vm_start, err); - } - - return err; -} - -static int svm_release_phys32(unsigned long __user *arg) -{ - struct mm_struct *mm = current->mm; - struct vm_area_struct *vma = NULL; - struct page *page = NULL; - pte_t *pte = NULL; - unsigned long phys, addr, offset; - unsigned int len = 0; - - if (arg == NULL) - return -EINVAL; - - if (get_user(addr, arg)) - return -EFAULT; - - down_read(&mm->mmap_lock); - pte = svm_walk_pt(addr, NULL, &offset); - if (pte && pte_present(*pte)) { - phys = PFN_PHYS(pte_pfn(*pte)) + offset; - } else { - up_read(&mm->mmap_lock); - return -EINVAL; - } - - vma = find_vma(mm, addr); - if (!vma) { - up_read(&mm->mmap_lock); - return -EFAULT; - } - - page = phys_to_page(phys); - len = vma->vm_end - vma->vm_start; - - __free_pages(page, get_order(len)); - - up_read(&mm->mmap_lock); - - return 0; -} - -static long svm_ioctl(struct file *file, unsigned int cmd, - unsigned long arg) -{ - int err = -EINVAL; - struct svm_bind_process params; - struct svm_device *sdev = file_to_sdev(file); - struct task_struct *task; - - if (!arg) - return -EINVAL; - - if (cmd == SVM_IOCTL_PROCESS_BIND) { - err = copy_from_user(¶ms, (void __user *)arg, - sizeof(params)); - if (err) { - dev_err(sdev->dev, "fail to copy params %d\n", err); - return -EFAULT; - } - } - - switch (cmd) { - case SVM_IOCTL_PROCESS_BIND: - task = svm_get_task(params); - if (IS_ERR(task)) { - dev_err(sdev->dev, "failed to get task\n"); - return PTR_ERR(task); - } - - err = svm_process_bind(task, sdev, ¶ms.ttbr, - ¶ms.tcr, ¶ms.pasid); - if (err) { - put_task_struct(task); - dev_err(sdev->dev, "failed to bind task %d\n", err); - return err; - } - - put_task_struct(task); - err = copy_to_user((void __user *)arg, ¶ms, - sizeof(params)); - if (err) { - dev_err(sdev->dev, "failed to copy to user!\n"); - return -EFAULT; - } - break; - case SVM_IOCTL_GET_PHYS: - err = svm_get_phys((unsigned long __user *)arg); - break; - case SVM_IOCTL_PIN_MEMORY: - err = svm_pin_memory((unsigned long __user *)arg); - break; - case SVM_IOCTL_UNPIN_MEMORY: - err = svm_unpin_memory((unsigned long __user *)arg); - break; - case SVM_IOCTL_LOAD_FLAG: - err = svm_proc_load_flag((int __user *)arg); - break; - case SVM_IOCTL_RELEASE_PHYS32: - err = svm_release_phys32((unsigned long __user *)arg); - break; - default: - err = -EINVAL; - } - - if (err) - dev_err(sdev->dev, "%s: %s failed err = %d\n", __func__, - svm_cmd_to_string(cmd), err); - - return err; -} - -static const struct file_operations svm_fops = { - .owner = THIS_MODULE, - .open = svm_open, - .mmap = svm_mmap, - .unlocked_ioctl = svm_ioctl, -}; - -static void cdev_device_release(struct device *dev) -{ - struct core_device *cdev = to_core_device(dev); - - if (!acpi_disabled) - list_del(&cdev->entry); - - kfree(cdev); -} - -static int svm_remove_core(struct device *dev, void *data) -{ - struct core_device *cdev = to_core_device(dev); - - if (!cdev->smmu_bypass) { - iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_SVA); - iommu_detach_group(cdev->domain, cdev->group); - iommu_group_put(cdev->group); - iommu_domain_free(cdev->domain); - } - - device_unregister(&cdev->dev); - - return 0; -} - -#ifdef CONFIG_ACPI -static int svm_acpi_add_core(struct svm_device *sdev, - struct acpi_device *children, int id) -{ - int err; - struct core_device *cdev = NULL; - char *name = NULL; - enum dev_dma_attr attr; - const union acpi_object *obj; - - name = devm_kasprintf(sdev->dev, GFP_KERNEL, "svm_child_dev%d", id); - if (name == NULL) - return -ENOMEM; - - cdev = kzalloc(sizeof(*cdev), GFP_KERNEL); - if (cdev == NULL) - return -ENOMEM; - cdev->dev.fwnode = &children->fwnode; - cdev->dev.parent = sdev->dev; - cdev->dev.bus = &svm_bus_type; - cdev->dev.release = cdev_device_release; - cdev->smmu_bypass = 0; - list_add(&cdev->entry, &child_list); - dev_set_name(&cdev->dev, "%s", name); - - err = device_register(&cdev->dev); - if (err) { - dev_info(&cdev->dev, "core_device register failed\n"); - list_del(&cdev->entry); - kfree(cdev); - return err; - } - - attr = device_get_dma_attr(&children->dev); - if (attr != DEV_DMA_NOT_SUPPORTED) { - err = acpi_dma_configure(&cdev->dev, attr); - if (err) { - dev_dbg(&cdev->dev, "acpi_dma_configure failed\n"); - return err; - } - } - - err = acpi_dev_get_property(children, "hisi,smmu-bypass", - DEV_PROP_U8, &obj); - if (err) - dev_info(&children->dev, "read smmu bypass failed\n"); - - cdev->smmu_bypass = *(u8 *)obj->integer.value; - - cdev->group = iommu_group_get(&cdev->dev); - if (IS_ERR_OR_NULL(cdev->group)) { - dev_err(&cdev->dev, "smmu is not right configured\n"); - return -ENXIO; - } - - cdev->domain = iommu_domain_alloc(sdev->dev->bus); - if (cdev->domain == NULL) { - dev_info(&cdev->dev, "failed to alloc domain\n"); - return -ENOMEM; - } - - err = iommu_attach_group(cdev->domain, cdev->group); - if (err) { - dev_err(&cdev->dev, "failed group to domain\n"); - return err; - } - - err = iommu_dev_enable_feature(&cdev->dev, IOMMU_DEV_FEAT_IOPF); - if (err) { - dev_err(&cdev->dev, "failed to enable iopf feature, %d\n", err); - return err; - } - - err = iommu_dev_enable_feature(&cdev->dev, IOMMU_DEV_FEAT_SVA); - if (err) { - dev_err(&cdev->dev, "failed to enable sva feature\n"); - return err; - } - - return 0; -} - -static int svm_acpi_init_core(struct svm_device *sdev) -{ - int err = 0; - struct device *dev = sdev->dev; - struct acpi_device *adev = ACPI_COMPANION(sdev->dev); - struct acpi_device *cdev = NULL; - int id = 0; - - down_write(&svm_sem); - if (!svm_bus_type.iommu_ops) { - err = bus_register(&svm_bus_type); - if (err) { - up_write(&svm_sem); - dev_err(dev, "failed to register svm_bus_type\n"); - return err; - } - - err = bus_set_iommu(&svm_bus_type, dev->bus->iommu_ops); - if (err) { - up_write(&svm_sem); - dev_err(dev, "failed to set iommu for svm_bus_type\n"); - goto err_unregister_bus; - } - } else if (svm_bus_type.iommu_ops != dev->bus->iommu_ops) { - err = -EBUSY; - up_write(&svm_sem); - dev_err(dev, "iommu_ops configured, but changed!\n"); - return err; - } - up_write(&svm_sem); - - list_for_each_entry(cdev, &adev->children, node) { - err = svm_acpi_add_core(sdev, cdev, id++); - if (err) - device_for_each_child(dev, NULL, svm_remove_core); - } - - return err; - -err_unregister_bus: - bus_unregister(&svm_bus_type); - - return err; -} -#else -static int svm_acpi_init_core(struct svm_device *sdev) { return 0; } -#endif - -static int svm_of_add_core(struct svm_device *sdev, struct device_node *np) -{ - int err; - struct resource res; - struct core_device *cdev = NULL; - char *name = NULL; - - name = devm_kasprintf(sdev->dev, GFP_KERNEL, "svm%llu_%s", - sdev->id, np->name); - if (name == NULL) - return -ENOMEM; - - cdev = kzalloc(sizeof(*cdev), GFP_KERNEL); - if (cdev == NULL) - return -ENOMEM; - - cdev->dev.of_node = np; - cdev->dev.parent = sdev->dev; - cdev->dev.bus = &svm_bus_type; - cdev->dev.release = cdev_device_release; - cdev->smmu_bypass = of_property_read_bool(np, "hisi,smmu_bypass"); - dev_set_name(&cdev->dev, "%s", name); - - err = device_register(&cdev->dev); - if (err) { - dev_info(&cdev->dev, "core_device register failed\n"); - kfree(cdev); - return err; - } - - err = of_dma_configure(&cdev->dev, np, true); - if (err) { - dev_dbg(&cdev->dev, "of_dma_configure failed\n"); - return err; - } - - err = of_address_to_resource(np, 0, &res); - if (err) { - dev_info(&cdev->dev, "no reg, FW should install the sid\n"); - } else { - /* If the reg specified, install sid for the core */ - void __iomem *core_base = NULL; - int sid = cdev->dev.iommu->fwspec->ids[0]; - - core_base = ioremap(res.start, resource_size(&res)); - if (core_base == NULL) { - dev_err(&cdev->dev, "ioremap failed\n"); - return -ENOMEM; - } - - writel_relaxed(sid, core_base + CORE_SID); - iounmap(core_base); - } - - cdev->group = iommu_group_get(&cdev->dev); - if (IS_ERR_OR_NULL(cdev->group)) { - dev_err(&cdev->dev, "smmu is not right configured\n"); - return -ENXIO; - } - - cdev->domain = iommu_domain_alloc(sdev->dev->bus); - if (cdev->domain == NULL) { - dev_info(&cdev->dev, "failed to alloc domain\n"); - return -ENOMEM; - } - - err = iommu_attach_group(cdev->domain, cdev->group); - if (err) { - dev_err(&cdev->dev, "failed group to domain\n"); - return err; - } - - err = iommu_dev_enable_feature(&cdev->dev, IOMMU_DEV_FEAT_IOPF); - if (err) { - dev_err(&cdev->dev, "failed to enable iopf feature, %d\n", err); - return err; - } - - err = iommu_dev_enable_feature(&cdev->dev, IOMMU_DEV_FEAT_SVA); - if (err) { - dev_err(&cdev->dev, "failed to enable sva feature, %d\n", err); - return err; - } - - return 0; -} - -static int svm_dt_init_core(struct svm_device *sdev, struct device_node *np) -{ - int err = 0; - struct device_node *child = NULL; - struct device *dev = sdev->dev; - - down_write(&svm_sem); - if (svm_bus_type.iommu_ops == NULL) { - err = bus_register(&svm_bus_type); - if (err) { - up_write(&svm_sem); - dev_err(dev, "failed to register svm_bus_type\n"); - return err; - } - - err = bus_set_iommu(&svm_bus_type, dev->bus->iommu_ops); - if (err) { - up_write(&svm_sem); - dev_err(dev, "failed to set iommu for svm_bus_type\n"); - goto err_unregister_bus; - } - } else if (svm_bus_type.iommu_ops != dev->bus->iommu_ops) { - err = -EBUSY; - up_write(&svm_sem); - dev_err(dev, "iommu_ops configured, but changed!\n"); - return err; - } - up_write(&svm_sem); - - for_each_available_child_of_node(np, child) { - err = svm_of_add_core(sdev, child); - if (err) - device_for_each_child(dev, NULL, svm_remove_core); - } - - return err; - -err_unregister_bus: - bus_unregister(&svm_bus_type); - - return err; -} - -int svm_get_pasid(pid_t vpid, int dev_id __maybe_unused) -{ - int pasid; - unsigned long asid; - struct task_struct *task = NULL; - struct mm_struct *mm = NULL; - struct svm_process *process = NULL; - struct svm_bind_process params; - - params.flags = SVM_BIND_PID; - params.vpid = vpid; - params.pasid = -1; - params.ttbr = 0; - params.tcr = 0; - task = svm_get_task(params); - if (IS_ERR(task)) - return PTR_ERR(task); - - mm = get_task_mm(task); - if (mm == NULL) { - pasid = -EINVAL; - goto put_task; - } - - asid = arm64_mm_context_get(mm); - if (!asid) { - pasid = -ENOSPC; - goto put_mm; - } - - mutex_lock(&svm_process_mutex); - process = find_svm_process(asid); - mutex_unlock(&svm_process_mutex); - if (process) - pasid = process->pasid; - else - pasid = -ESRCH; - - arm64_mm_context_put(mm); -put_mm: - mmput(mm); -put_task: - put_task_struct(task); - - return pasid; -} -EXPORT_SYMBOL_GPL(svm_get_pasid); - -static int svm_dt_setup_l2buff(struct svm_device *sdev, struct device_node *np) -{ - struct device_node *l2buff = of_parse_phandle(np, "memory-region", 0); - - if (l2buff) { - struct resource r; - int err = of_address_to_resource(l2buff, 0, &r); - - if (err) { - of_node_put(l2buff); - return err; - } - - sdev->l2buff = r.start; - sdev->l2size = resource_size(&r); - } - - of_node_put(l2buff); - return 0; -} - -static int svm_device_probe(struct platform_device *pdev) -{ - int err = -1; - struct device *dev = &pdev->dev; - struct svm_device *sdev = NULL; - struct device_node *np = dev->of_node; - int alias_id; - - if (acpi_disabled && np == NULL) - return -ENODEV; - - if (!dev->bus) { - dev_dbg(dev, "this dev bus is NULL\n"); - return -EPROBE_DEFER; - } - - if (!dev->bus->iommu_ops) { - dev_dbg(dev, "defer probe svm device\n"); - return -EPROBE_DEFER; - } - - sdev = devm_kzalloc(dev, sizeof(*sdev), GFP_KERNEL); - if (sdev == NULL) - return -ENOMEM; - - if (!acpi_disabled) { - err = device_property_read_u64(dev, "svmid", &sdev->id); - if (err) { - dev_err(dev, "failed to get this svm device id\n"); - return err; - } - } else { - alias_id = of_alias_get_id(np, "svm"); - if (alias_id < 0) - sdev->id = probe_index; - else - sdev->id = alias_id; - } - - sdev->dev = dev; - sdev->miscdev.minor = MISC_DYNAMIC_MINOR; - sdev->miscdev.fops = &svm_fops; - sdev->miscdev.name = devm_kasprintf(dev, GFP_KERNEL, - SVM_DEVICE_NAME"%llu", sdev->id); - if (sdev->miscdev.name == NULL) - return -ENOMEM; - - dev_set_drvdata(dev, sdev); - err = misc_register(&sdev->miscdev); - if (err) { - dev_err(dev, "Unable to register misc device\n"); - return err; - } - - if (!acpi_disabled) { - err = svm_acpi_init_core(sdev); - if (err) { - dev_err(dev, "failed to init acpi cores\n"); - goto err_unregister_misc; - } - } else { - /* - * Get the l2buff phys address and size, if it do not exist - * just warn and continue, and runtime can not use L2BUFF. - */ - err = svm_dt_setup_l2buff(sdev, np); - if (err) - dev_warn(dev, "Cannot get l2buff\n"); - - if (svm_va2pa_trunk_init(dev)) { - dev_err(dev, "failed to init va2pa trunk\n"); - goto err_unregister_misc; - } - - err = svm_dt_init_core(sdev, np); - if (err) { - dev_err(dev, "failed to init dt cores\n"); - goto err_remove_trunk; - } - - probe_index++; - } - - mutex_init(&svm_process_mutex); - - return err; - -err_remove_trunk: - svm_remove_trunk(dev); - -err_unregister_misc: - misc_deregister(&sdev->miscdev); - - return err; -} - -static int svm_device_remove(struct platform_device *pdev) -{ - struct device *dev = &pdev->dev; - struct svm_device *sdev = dev_get_drvdata(dev); - - device_for_each_child(sdev->dev, NULL, svm_remove_core); - misc_deregister(&sdev->miscdev); - - return 0; -} - -static const struct acpi_device_id svm_acpi_match[] = { - { "HSVM1980", 0}, - { } -}; -MODULE_DEVICE_TABLE(acpi, svm_acpi_match); - -static const struct of_device_id svm_of_match[] = { - { .compatible = "hisilicon,svm" }, - { } -}; -MODULE_DEVICE_TABLE(of, svm_of_match); - -/*svm acpi probe and remove*/ -static struct platform_driver svm_driver = { - .probe = svm_device_probe, - .remove = svm_device_remove, - .driver = { - .name = SVM_DEVICE_NAME, - .acpi_match_table = ACPI_PTR(svm_acpi_match), - .of_match_table = svm_of_match, - }, -}; - -module_platform_driver(svm_driver); - -MODULE_DESCRIPTION("Hisilicon SVM driver"); -MODULE_AUTHOR("Fang Lijun fanglijun3@huawei.com"); -MODULE_LICENSE("GPL v2"); diff --git a/mm/mmap.c b/mm/mmap.c index 517239cea7d2..b5a183983f5b 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1479,9 +1479,11 @@ unsigned long __do_mmap_mm(struct mm_struct *mm, struct file *file, pkey = 0; }
+#ifdef CONFIG_ASCEND_FEATURES /* Physical address is within 4G */ if (flags & MAP_PA32BIT) vm_flags |= VM_PA32BIT; +#endif
/* Do simple checking here so the lower-level routines won't have * to. we assume access permissions have been handled by the open
hulk inclusion category: performance bugzilla: 32059, https://gitee.com/openeuler/kernel/issues/I65DOZ CVE: NA
--------------------------------
This option optimizes the scheduler for common desktop workloads by automatically creating and populating task groups. This separation of workloads isolates aggressive CPU burners (like build jobs) from desktop applications. Task group autogeneration is currently based upon task session.
We do not need this for mostly server workloads, so just disable by default. If you need this feature really, just enable it by sysctl:
sysctl -w kernel.sched_autogroup_enabled=1
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com --- kernel/sched/autogroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c index 2067080bb235..bcb2bb80919a 100644 --- a/kernel/sched/autogroup.c +++ b/kernel/sched/autogroup.c @@ -5,7 +5,7 @@ #include <linux/nospec.h> #include "sched.h"
-unsigned int __read_mostly sysctl_sched_autogroup_enabled = 1; +unsigned int __read_mostly sysctl_sched_autogroup_enabled; static struct autogroup autogroup_default; static atomic_t autogroup_seq_nr;
From: Pavel Begunkov asml.silence@gmail.com
stable inclusion from stable-v5.10.141 commit 28d8d2737e82fc29ff9e788597661abecc7f7994 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I685FC CEV: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...
--------------------------------
Older kernels lack io_uring POLLFREE handling. As only affected files are signalfd and android binder the safest option would be to disable polling those files via io_uring and hope there are no users.
Fixes: 221c5eb233823 ("io_uring: add support for IORING_OP_POLL") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
conflicts: include/linux/fs.h
Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/android/binder.c | 1 + fs/io_uring.c | 5 +++++ fs/signalfd.c | 1 + include/linux/fs.h | 1 + 4 files changed, 8 insertions(+)
diff --git a/drivers/android/binder.c b/drivers/android/binder.c index b9985eee8c1b..cfb1393a0891 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -6081,6 +6081,7 @@ const struct file_operations binder_fops = { .open = binder_open, .flush = binder_flush, .release = binder_release, + .may_pollfree = true, };
static int __init init_binder_device(const char *name) diff --git a/fs/io_uring.c b/fs/io_uring.c index 9d5a041d329e..4c6e442a5edf 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5233,6 +5233,11 @@ static __poll_t __io_arm_poll_handler(struct io_kiocb *req, struct io_ring_ctx *ctx = req->ctx; bool cancel = false;
+ if (req->file->f_op->may_pollfree) { + spin_lock_irq(&ctx->completion_lock); + return -EOPNOTSUPP; + } + INIT_HLIST_NODE(&req->hash_node); io_init_poll_iocb(poll, mask, wake_func); poll->file = req->file; diff --git a/fs/signalfd.c b/fs/signalfd.c index b94fb5f81797..41dc597b78cc 100644 --- a/fs/signalfd.c +++ b/fs/signalfd.c @@ -248,6 +248,7 @@ static const struct file_operations signalfd_fops = { .poll = signalfd_poll, .read = signalfd_read, .llseek = noop_llseek, + .may_pollfree = true, };
static int do_signalfd4(int ufd, sigset_t *mask, int flags) diff --git a/include/linux/fs.h b/include/linux/fs.h index b95ff48204ba..406c170b61fc 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1898,6 +1898,7 @@ struct file_operations { struct file *file_out, loff_t pos_out, loff_t len, unsigned int remap_flags); int (*fadvise)(struct file *, loff_t, loff_t, int); + bool may_pollfree;
KABI_RESERVE(1) KABI_RESERVE(2)
From: Li Lingfeng lilingfeng3@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I685FC CVE: NA
--------------------------------
Commit 0845c5803f3f("[Backport] io_uring: disable polling pollfree files") adds a new member in file_operations, so we need to fix kabi broken problem.
Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- include/linux/fs.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h index 406c170b61fc..b256911f03fc 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1898,9 +1898,8 @@ struct file_operations { struct file *file_out, loff_t pos_out, loff_t len, unsigned int remap_flags); int (*fadvise)(struct file *, loff_t, loff_t, int); - bool may_pollfree;
- KABI_RESERVE(1) + KABI_USE(1, bool may_pollfree) KABI_RESERVE(2) KABI_RESERVE(3) KABI_RESERVE(4)
From: Qi Zheng zhengqi.arch@bytedance.com
mainline inclusion from mainline-v6.1-rc7 commit ea4452de2ae987342fadbdd2c044034e6480daad category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I69VVC CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
When we specify __GFP_NOWARN, we only expect that no warnings will be issued for current caller. But in the __should_failslab() and __should_fail_alloc_page(), the local GFP flags alter the global {failslab|fail_page_alloc}.attr, which is persistent and shared by all tasks. This is not what we expected, let's fix it.
[akpm@linux-foundation.org: unexport should_fail_ex()] Link: https://lkml.kernel.org/r/20221118100011.2634-1-zhengqi.arch@bytedance.com Fixes: 3f913fc5f974 ("mm: fix missing handler for __GFP_NOWARN") Signed-off-by: Qi Zheng zhengqi.arch@bytedance.com Reported-by: Dmitry Vyukov dvyukov@google.com Reviewed-by: Akinobu Mita akinobu.mita@gmail.com Reviewed-by: Jason Gunthorpe jgg@nvidia.com Cc: Akinobu Mita akinobu.mita@gmail.com Cc: Matthew Wilcox willy@infradead.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Ye Weihua yeweihua4@huawei.com Reviewed-by: tong tiangen tongtiangen@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- include/linux/fault-inject.h | 7 +++++-- lib/fault-inject.c | 13 ++++++++----- mm/failslab.c | 12 ++++++++++-- mm/page_alloc.c | 7 +++++-- 4 files changed, 28 insertions(+), 11 deletions(-)
diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h index d506ee960ffd..ff27c006732f 100644 --- a/include/linux/fault-inject.h +++ b/include/linux/fault-inject.h @@ -20,7 +20,6 @@ struct fault_attr { atomic_t space; unsigned long verbose; bool task_filter; - bool no_warn; unsigned long stacktrace_depth; unsigned long require_start; unsigned long require_end; @@ -32,6 +31,10 @@ struct fault_attr { struct dentry *dname; };
+enum fault_flags { + FAULT_NOWARN = 1 << 0, +}; + #define FAULT_ATTR_INITIALIZER { \ .interval = 1, \ .times = ATOMIC_INIT(1), \ @@ -40,11 +43,11 @@ struct fault_attr { .ratelimit_state = RATELIMIT_STATE_INIT_DISABLED, \ .verbose = 2, \ .dname = NULL, \ - .no_warn = false, \ }
#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER int setup_fault_attr(struct fault_attr *attr, char *str); +bool should_fail_ex(struct fault_attr *attr, ssize_t size, int flags); bool should_fail(struct fault_attr *attr, ssize_t size);
#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS diff --git a/lib/fault-inject.c b/lib/fault-inject.c index 423784d9c058..70768d8a2200 100644 --- a/lib/fault-inject.c +++ b/lib/fault-inject.c @@ -41,9 +41,6 @@ EXPORT_SYMBOL_GPL(setup_fault_attr);
static void fail_dump(struct fault_attr *attr) { - if (attr->no_warn) - return; - if (attr->verbose > 0 && __ratelimit(&attr->ratelimit_state)) { printk(KERN_NOTICE "FAULT_INJECTION: forcing a failure.\n" "name %pd, interval %lu, probability %lu, " @@ -103,7 +100,7 @@ static inline bool fail_stacktrace(struct fault_attr *attr) * http://www.nongnu.org/failmalloc/ */
-bool should_fail(struct fault_attr *attr, ssize_t size) +bool should_fail_ex(struct fault_attr *attr, ssize_t size, int flags) { if (in_task()) { unsigned int fail_nth = READ_ONCE(current->fail_nth); @@ -146,13 +143,19 @@ bool should_fail(struct fault_attr *attr, ssize_t size) return false;
fail: - fail_dump(attr); + if (!(flags & FAULT_NOWARN)) + fail_dump(attr);
if (atomic_read(&attr->times) != -1) atomic_dec_not_zero(&attr->times);
return true; } + +bool should_fail(struct fault_attr *attr, ssize_t size) +{ + return should_fail_ex(attr, size, 0); +} EXPORT_SYMBOL_GPL(should_fail);
#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS diff --git a/mm/failslab.c b/mm/failslab.c index 58df9789f1d2..ffc420c0e767 100644 --- a/mm/failslab.c +++ b/mm/failslab.c @@ -16,6 +16,8 @@ static struct {
bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags) { + int flags = 0; + /* No fault-injection for bootstrap cache */ if (unlikely(s == kmem_cache)) return false; @@ -30,10 +32,16 @@ bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags) if (failslab.cache_filter && !(s->flags & SLAB_FAILSLAB)) return false;
+ /* + * In some cases, it expects to specify __GFP_NOWARN + * to avoid printing any information(not just a warning), + * thus avoiding deadlocks. See commit 6b9dbedbe349 for + * details. + */ if (gfpflags & __GFP_NOWARN) - failslab.attr.no_warn = true; + flags |= FAULT_NOWARN;
- return should_fail(&failslab.attr, s->object_size); + return should_fail_ex(&failslab.attr, s->object_size, flags); }
static int __init setup_failslab(char *str) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 179a6d4948af..df3723c0a819 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3545,6 +3545,8 @@ __setup("fail_page_alloc=", setup_fail_page_alloc);
static bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) { + int flags = 0; + if (order < fail_page_alloc.min_order) return false; if (gfp_mask & __GFP_NOFAIL) @@ -3555,10 +3557,11 @@ static bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) (gfp_mask & __GFP_DIRECT_RECLAIM)) return false;
+ /* See comment in __should_failslab() */ if (gfp_mask & __GFP_NOWARN) - fail_page_alloc.attr.no_warn = true; + flags |= FAULT_NOWARN;
- return should_fail(&fail_page_alloc.attr, 1 << order); + return should_fail_ex(&fail_page_alloc.attr, 1 << order, flags); }
#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
From: Dave Chinner dchinner@redhat.com
stable inclusion from stable-v5.10.129 commit b261cd005ab980c4018634a849f77e036bfd4f80 category: bugfix bugzilla: 188251,https://gitee.com/openeuler/kernel/issues/I5YNDQ CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 756b1c343333a5aefcc26b0409f3fd16f72281bf upstream.
Because the iomap code using PF_MEMALLOC_NOFS to detect transaction recursion in XFS is just wrong. Remove it from the iomap code and replace it with XFS specific internal checks using current->journal_info instead.
[djwong: This change also realigns the lifetime of NOFS flag changes to match the incore transaction, instead of the inconsistent scheme we have now.]
Fixes: 9070733b4efa ("xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS") Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Amir Goldstein amir73il@gmail.com Acked-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Conflicts: fs/xfs/xfs_aops.c fs/xfs/xfs_trans.c fs/xfs/xfs_trans.h
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/iomap/buffered-io.c | 7 ------- fs/xfs/libxfs/xfs_btree.c | 12 ++++++++++-- fs/xfs/xfs_aops.c | 15 ++++++++++++++- fs/xfs/xfs_trans.c | 20 +++++--------------- fs/xfs/xfs_trans.h | 30 ++++++++++++++++++++++++++++++ 5 files changed, 59 insertions(+), 25 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index e2cdce897f83..986020dcf026 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1460,13 +1460,6 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) PF_MEMALLOC)) goto redirty;
- /* - * Given that we do not allow direct reclaim to call us, we should - * never be called in a recursive filesystem reclaim context. - */ - if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS)) - goto redirty; - /* * Is this page beyond the end of the file? * diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index 98c82f4935e1..24c7d30e41df 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -2811,7 +2811,7 @@ xfs_btree_split_worker( struct xfs_btree_split_args *args = container_of(work, struct xfs_btree_split_args, work); unsigned long pflags; - unsigned long new_pflags = PF_MEMALLOC_NOFS; + unsigned long new_pflags = 0;
/* * we are in a transaction context here, but may also be doing work @@ -2823,12 +2823,20 @@ xfs_btree_split_worker( new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
current_set_flags_nested(&pflags, new_pflags); + xfs_trans_set_context(args->cur->bc_tp);
args->result = __xfs_btree_split(args->cur, args->level, args->ptrp, args->key, args->curp, args->stat); - complete(args->done);
+ xfs_trans_clear_context(args->cur->bc_tp); current_restore_flags_nested(&pflags, new_pflags); + + /* + * Do not access args after complete() has run here. We don't own args + * and the owner may run and free args before we return here. + */ + complete(args->done); + }
/* diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 7aa67d95c578..e341d6531e68 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -98,7 +98,7 @@ xfs_setfilesize_ioend( * thus we need to mark ourselves as being in a transaction manually. * Similarly for freeze protection. */ - current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); + xfs_trans_set_context(tp); __sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
/* we abort the update if there was an IO error */ @@ -538,6 +538,12 @@ xfs_vm_writepage( { struct xfs_writepage_ctx wpc = { };
+ if (WARN_ON_ONCE(current->journal_info)) { + redirty_page_for_writepage(wbc, page); + unlock_page(page); + return 0; + } + return iomap_writepage(page, wbc, &wpc.ctx, &xfs_writeback_ops); }
@@ -548,6 +554,13 @@ xfs_vm_writepages( { struct xfs_writepage_ctx wpc = { };
+ /* + * Writing back data in a transaction context can result in recursive + * transactions. This is bad, so issue a warning and get out of here. + */ + if (WARN_ON_ONCE(current->journal_info)) + return 0; + xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED); return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops); } diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index e8a9967e7194..8836bb02d82d 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -72,6 +72,7 @@ xfs_trans_free( xfs_extent_busy_clear(tp->t_mountp, &tp->t_busy, false);
trace_xfs_trans_free(tp, _RET_IP_); + xfs_trans_clear_context(tp); if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT)) sb_end_intwrite(tp->t_mountp->m_super); xfs_trans_free_dqinfo(tp); @@ -123,7 +124,8 @@ xfs_trans_dup(
ntp->t_rtx_res = tp->t_rtx_res - tp->t_rtx_res_used; tp->t_rtx_res = tp->t_rtx_res_used; - ntp->t_pflags = tp->t_pflags; + + xfs_trans_switch_context(tp, ntp);
/* move deferred ops over to the new tp */ xfs_defer_move(ntp, tp); @@ -157,9 +159,6 @@ xfs_trans_reserve( int error = 0; bool rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
- /* Mark this thread as being in a transaction */ - current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); - /* * Attempt to reserve the needed disk blocks by decrementing * the number needed from the number available. This will @@ -167,10 +166,8 @@ xfs_trans_reserve( */ if (blocks > 0) { error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd); - if (error != 0) { - current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); + if (error != 0) return -ENOSPC; - } tp->t_blk_res += blocks; }
@@ -244,9 +241,6 @@ xfs_trans_reserve( xfs_mod_fdblocks(mp, (int64_t)blocks, rsvd); tp->t_blk_res = 0; } - - current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); - return error; }
@@ -272,6 +266,7 @@ xfs_trans_alloc( tp = kmem_cache_zalloc(xfs_trans_zone, GFP_KERNEL | __GFP_NOFAIL); if (!(flags & XFS_TRANS_NO_WRITECOUNT)) sb_start_intwrite(mp->m_super); + xfs_trans_set_context(tp);
/* * Zero-reservation ("empty") transactions can't modify anything, so @@ -893,7 +888,6 @@ __xfs_trans_commit(
xlog_cil_commit(mp->m_log, tp, &commit_seq, regrant);
- current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); xfs_trans_free(tp);
/* @@ -925,7 +919,6 @@ __xfs_trans_commit( xfs_log_ticket_ungrant(mp->m_log, tp->t_ticket); tp->t_ticket = NULL; } - current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); xfs_trans_free_items(tp, !!error); xfs_trans_free(tp);
@@ -985,9 +978,6 @@ xfs_trans_cancel( tp->t_ticket = NULL; }
- /* mark this thread as no longer being in a transaction */ - current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); - xfs_trans_free_items(tp, dirty); xfs_trans_free(tp); } diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index f95566fe981b..50da47f23a07 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -266,4 +266,34 @@ int xfs_trans_alloc_ichange(struct xfs_inode *ip, struct xfs_dquot *udqp, struct xfs_dquot *gdqp, struct xfs_dquot *pdqp, bool force, struct xfs_trans **tpp);
+static inline void +xfs_trans_set_context( + struct xfs_trans *tp) +{ + ASSERT(current->journal_info == NULL); + tp->t_pflags = memalloc_nofs_save(); + current->journal_info = tp; +} + +static inline void +xfs_trans_clear_context( + struct xfs_trans *tp) +{ + if (current->journal_info == tp) { + memalloc_nofs_restore(tp->t_pflags); + current->journal_info = NULL; + } +} + +static inline void +xfs_trans_switch_context( + struct xfs_trans *old_tp, + struct xfs_trans *new_tp) +{ + ASSERT(current->journal_info == old_tp); + new_tp->t_pflags = old_tp->t_pflags; + old_tp->t_pflags = 0; + current->journal_info = new_tp; +} + #endif /* __XFS_TRANS_H__ */
From: "Darrick J. Wong" djwong@kernel.org
stable inclusion from stable-v5.10.141 commit cb41f22df3ec7b0b4f7cd9a730a538645e324f2f category: bugfix bugzilla: 188251,https://gitee.com/openeuler/kernel/issues/I685FC CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 15f04fdc75aaaa1cccb0b8b3af1be290e118a7bc upstream.
[Added wrapper xfs_fdblocks_unavailable() for 5.10.y backport]
Infinite loops in kernel code are scary. Calls to xfs_reserve_blocks should be rare (people should just use the defaults!) so we really don't need to try so hard. Simplify the logic here by removing the infinite loop.
Cc: Brian Foster bfoster@redhat.com Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Dave Chinner dchinner@redhat.com Signed-off-by: Amir Goldstein amir73il@gmail.com Acked-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Conflicts: fs/xfs/xfs_mount.h
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/xfs/xfs_fsops.c | 52 +++++++++++++++++++--------------------------- fs/xfs/xfs_mount.h | 8 +++++++ 2 files changed, 29 insertions(+), 31 deletions(-)
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c index 2741dbd22704..3209e77ef84a 100644 --- a/fs/xfs/xfs_fsops.c +++ b/fs/xfs/xfs_fsops.c @@ -374,46 +374,36 @@ xfs_reserve_blocks( * If the request is larger than the current reservation, reserve the * blocks before we update the reserve counters. Sample m_fdblocks and * perform a partial reservation if the request exceeds free space. + * + * The code below estimates how many blocks it can request from + * fdblocks to stash in the reserve pool. This is a classic TOCTOU + * race since fdblocks updates are not always coordinated via + * m_sb_lock. */ - error = -ENOSPC; - do { - free = percpu_counter_sum(&mp->m_fdblocks) - - mp->m_alloc_set_aside; - if (free <= 0) - break; - - delta = request - mp->m_resblks; - lcounter = free - delta; - if (lcounter < 0) - /* We can't satisfy the request, just get what we can */ - fdblks_delta = free; - else - fdblks_delta = delta; - + free = percpu_counter_sum(&mp->m_fdblocks) - + xfs_fdblocks_unavailable(mp); + delta = request - mp->m_resblks; + if (delta > 0 && free > 0) { /* * We'll either succeed in getting space from the free block - * count or we'll get an ENOSPC. If we get a ENOSPC, it means - * things changed while we were calculating fdblks_delta and so - * we should try again to see if there is anything left to - * reserve. - * - * Don't set the reserved flag here - we don't want to reserve - * the extra reserve blocks from the reserve..... + * count or we'll get an ENOSPC. Don't set the reserved flag + * here - we don't want to reserve the extra reserve blocks + * from the reserve. */ + fdblks_delta = min(free, delta); spin_unlock(&mp->m_sb_lock); error = xfs_mod_fdblocks(mp, -fdblks_delta, 0); spin_lock(&mp->m_sb_lock); - } while (error == -ENOSPC);
- /* - * Update the reserve counters if blocks have been successfully - * allocated. - */ - if (!error && fdblks_delta) { - mp->m_resblks += fdblks_delta; - mp->m_resblks_avail += fdblks_delta; + /* + * Update the reserve counters if blocks have been successfully + * allocated. + */ + if (!error) { + mp->m_resblks += fdblks_delta; + mp->m_resblks_avail += fdblks_delta; + } } - out: if (outval) { outval->resblks = mp->m_resblks; diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 46dba3289d10..91a66b7b8815 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -467,6 +467,14 @@ extern void xfs_unmountfs(xfs_mount_t *); */ #define XFS_FDBLOCKS_BATCH 1024
+/* Accessor added for 5.10.y backport */ +static inline uint64_t +xfs_fdblocks_unavailable( + struct xfs_mount *mp) +{ + return mp->m_alloc_set_aside; +} + extern int xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta, bool reserved); extern int xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);
From: "Darrick J. Wong" djwong@kernel.org
stable inclusion from stable-v5.10.141 commit 72a259bdd50dd6646a88e29fc769e50377e06d57 category: bugfix bugzilla: 188251,https://gitee.com/openeuler/kernel/issues/I685FC CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 0baa2657dc4d79202148be79a3dc36c35f425060 upstream.
Nowadays, xfs_mod_fdblocks will always choose to fill the reserve pool with freed blocks before adding to fdblocks. Therefore, we can change the behavior of xfs_reserve_blocks slightly -- setting the target size of the pool should always succeed, since a deficiency will eventually be made up as blocks get freed.
Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Dave Chinner dchinner@redhat.com Signed-off-by: Amir Goldstein amir73il@gmail.com Acked-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/xfs/xfs_fsops.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c index 3209e77ef84a..e555a8bbdbd7 100644 --- a/fs/xfs/xfs_fsops.c +++ b/fs/xfs/xfs_fsops.c @@ -378,11 +378,14 @@ xfs_reserve_blocks( * The code below estimates how many blocks it can request from * fdblocks to stash in the reserve pool. This is a classic TOCTOU * race since fdblocks updates are not always coordinated via - * m_sb_lock. + * m_sb_lock. Set the reserve size even if there's not enough free + * space to fill it because mod_fdblocks will refill an undersized + * reserve when it can. */ free = percpu_counter_sum(&mp->m_fdblocks) - xfs_fdblocks_unavailable(mp); delta = request - mp->m_resblks; + mp->m_resblks = request; if (delta > 0 && free > 0) { /* * We'll either succeed in getting space from the free block @@ -399,10 +402,8 @@ xfs_reserve_blocks( * Update the reserve counters if blocks have been successfully * allocated. */ - if (!error) { - mp->m_resblks += fdblks_delta; + if (!error) mp->m_resblks_avail += fdblks_delta; - } } out: if (outval) {
From: "Darrick J. Wong" djwong@kernel.org
stable inclusion from stable-v5.10.141 commit f168801da95fe62c6751235665c27edf5ca2458a category: bugfix bugzilla: 188251,https://gitee.com/openeuler/kernel/issues/I685FC CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 82be38bcf8a2e056b4c99ce79a3827fa743df6ec upstream.
Due to cycling of m_sb_lock, it's possible for multiple callers of xfs_reserve_blocks to race at changing the pool size, subtracting blocks from fdblocks, and actually putting it in the pool. The result of all this is that we can overfill the reserve pool to hilarious levels.
xfs_mod_fdblocks, when called with a positive value, already knows how to take freed blocks and either fill the reserve until it's full, or put them in fdblocks. Use that instead of setting m_resblks_avail directly.
Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Dave Chinner dchinner@redhat.com Signed-off-by: Amir Goldstein amir73il@gmail.com Acked-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/xfs/xfs_fsops.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c index e555a8bbdbd7..81cb2b3b20f2 100644 --- a/fs/xfs/xfs_fsops.c +++ b/fs/xfs/xfs_fsops.c @@ -392,18 +392,17 @@ xfs_reserve_blocks( * count or we'll get an ENOSPC. Don't set the reserved flag * here - we don't want to reserve the extra reserve blocks * from the reserve. + * + * The desired reserve size can change after we drop the lock. + * Use mod_fdblocks to put the space into the reserve or into + * fdblocks as appropriate. */ fdblks_delta = min(free, delta); spin_unlock(&mp->m_sb_lock); error = xfs_mod_fdblocks(mp, -fdblks_delta, 0); - spin_lock(&mp->m_sb_lock); - - /* - * Update the reserve counters if blocks have been successfully - * allocated. - */ if (!error) - mp->m_resblks_avail += fdblks_delta; + xfs_mod_fdblocks(mp, fdblks_delta, 0); + spin_lock(&mp->m_sb_lock); } out: if (outval) {
From: Bing-Jhong Billy Jheng billy@starlabs.sg
stable inclusion from stable-v5.10.160 commit 75454b4bbfc7e6a4dd8338556f36ea9107ddf61a category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6AUN9 CVE: CVE-2022-4696
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=l...
--------------------------------
Splice is like read/write and should grab current->nsproxy, denoted by IO_WQ_WORK_FILES as it refers to current->files as well
Signed-off-by: Bing-Jhong Billy Jheng billy@starlabs.sg Reviewed-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4c6e442a5edf..4ace89ae4832 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -935,7 +935,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .hash_reg_file = 1, .unbound_nonreg_file = 1, - .work_flags = IO_WQ_WORK_BLKCG, + .work_flags = IO_WQ_WORK_BLKCG | IO_WQ_WORK_FILES, }, [IORING_OP_PROVIDE_BUFFERS] = {}, [IORING_OP_REMOVE_BUFFERS] = {},
From: Carlos Llamas cmllamas@google.com
stable inclusion from stable-v5.10.154 commit 015ac18be7de25d17d6e5f1643cb3b60bfbe859e category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I68WW5 CVE: CVE-2023-20928
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
In commit 720c24192404 ("ANDROID: binder: change down_write to down_read") binder assumed the mmap read lock is sufficient to protect alloc->vma inside binder_update_page_range(). This used to be accurate until commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"), which now downgrades the mmap_lock after detaching the vma from the rbtree in munmap(). Then it proceeds to teardown and free the vma with only the read lock held.
This means that accesses to alloc->vma in binder_update_page_range() now will race with vm_area_free() in munmap() and can cause a UAF as shown in the following KASAN trace:
================================================================== BUG: KASAN: use-after-free in vm_insert_page+0x7c/0x1f0 Read of size 8 at addr ffff16204ad00600 by task server/558
CPU: 3 PID: 558 Comm: server Not tainted 5.10.150-00001-gdc8dcf942daa #1 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2a0 show_stack+0x18/0x2c dump_stack+0xf8/0x164 print_address_description.constprop.0+0x9c/0x538 kasan_report+0x120/0x200 __asan_load8+0xa0/0xc4 vm_insert_page+0x7c/0x1f0 binder_update_page_range+0x278/0x50c binder_alloc_new_buf+0x3f0/0xba0 binder_transaction+0x64c/0x3040 binder_thread_write+0x924/0x2020 binder_ioctl+0x1610/0x2e5c __arm64_sys_ioctl+0xd4/0x120 el0_svc_common.constprop.0+0xac/0x270 do_el0_svc+0x38/0xa0 el0_svc+0x1c/0x2c el0_sync_handler+0xe8/0x114 el0_sync+0x180/0x1c0
Allocated by task 559: kasan_save_stack+0x38/0x6c __kasan_kmalloc.constprop.0+0xe4/0xf0 kasan_slab_alloc+0x18/0x2c kmem_cache_alloc+0x1b0/0x2d0 vm_area_alloc+0x28/0x94 mmap_region+0x378/0x920 do_mmap+0x3f0/0x600 vm_mmap_pgoff+0x150/0x17c ksys_mmap_pgoff+0x284/0x2dc __arm64_sys_mmap+0x84/0xa4 el0_svc_common.constprop.0+0xac/0x270 do_el0_svc+0x38/0xa0 el0_svc+0x1c/0x2c el0_sync_handler+0xe8/0x114 el0_sync+0x180/0x1c0
Freed by task 560: kasan_save_stack+0x38/0x6c kasan_set_track+0x28/0x40 kasan_set_free_info+0x24/0x4c __kasan_slab_free+0x100/0x164 kasan_slab_free+0x14/0x20 kmem_cache_free+0xc4/0x34c vm_area_free+0x1c/0x2c remove_vma+0x7c/0x94 __do_munmap+0x358/0x710 __vm_munmap+0xbc/0x130 __arm64_sys_munmap+0x4c/0x64 el0_svc_common.constprop.0+0xac/0x270 do_el0_svc+0x38/0xa0 el0_svc+0x1c/0x2c el0_sync_handler+0xe8/0x114 el0_sync+0x180/0x1c0
[...] ==================================================================
To prevent the race above, revert back to taking the mmap write lock inside binder_update_page_range(). One might expect an increase of mmap lock contention. However, binder already serializes these calls via top level alloc->mutex. Also, there was no performance impact shown when running the binder benchmark tests.
Note this patch is specific to stable branches 5.4 and 5.10. Since in newer kernel releases binder no longer caches a pointer to the vma. Instead, it has been refactored to use vma_lookup() which avoids the issue described here. This switch was introduced in commit a43cfc87caaf ("android: binder: stop saving a pointer to the VMA").
Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Reported-by: Jann Horn jannh@google.com Cc: stable@vger.kernel.org # 5.10.x Cc: Minchan Kim minchan@kernel.org Cc: Yang Shi yang.shi@linux.alibaba.com Cc: Liam Howlett liam.howlett@oracle.com Signed-off-by: Carlos Llamas cmllamas@google.com Acked-by: Todd Kjos tkjos@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Chen Jiahao chenjiahao16@huawei.com Reviewed-by: Liao Chang liaochang1@huawei.com Reviewed-by: Zhang Jianhua chris.zjh@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/android/binder_alloc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c index 95ca4f934d28..a77ed66425f2 100644 --- a/drivers/android/binder_alloc.c +++ b/drivers/android/binder_alloc.c @@ -212,7 +212,7 @@ static int binder_update_page_range(struct binder_alloc *alloc, int allocate, mm = alloc->vma_vm_mm;
if (mm) { - mmap_read_lock(mm); + mmap_write_lock(mm); vma = alloc->vma; }
@@ -270,7 +270,7 @@ static int binder_update_page_range(struct binder_alloc *alloc, int allocate, trace_binder_alloc_page_end(alloc, index); } if (mm) { - mmap_read_unlock(mm); + mmap_write_unlock(mm); mmput(mm); } return 0; @@ -303,7 +303,7 @@ static int binder_update_page_range(struct binder_alloc *alloc, int allocate, } err_no_vma: if (mm) { - mmap_read_unlock(mm); + mmap_write_unlock(mm); mmput(mm); } return vma ? -ENOMEM : -ESRCH;