Baruch Siach (1): gpio: mvebu: fix pwm .get_state period calculation
Eric Biggers (1): fs: fix lazytime expiration handling in __writeback_single_inode()
Gaurav Kohli (1): tracing: Fix race in trace_open and buffer resize call
Greg Kroah-Hartman (1): Linux 4.19.172
Jan Kara (1): writeback: Drop I_DIRTY_TIME_EXPIRE
Jason Gerecke (1): HID: wacom: Correct NULL dereference on AES pen proximity
Jean-Philippe Brucker (1): tools: Factor HOSTCC, HOSTLD, HOSTAR definitions
Mikulas Patocka (1): dm integrity: conditionally disable "recalculate" feature
Thomas Gleixner (18): futex: Move futex exit handling into futex code futex: Replace PF_EXITPIDONE with a state exit/exec: Seperate mm_release() futex: Split futex_mm_release() for exit/exec futex: Set task::futex_state to DEAD right after handling futex exit futex: Mark the begin of futex exit explicitly futex: Sanitize exit state handling futex: Provide state handling for exec() as well futex: Add mutex around futex exit futex: Provide distinct return value when owner is exiting futex: Prevent exit livelock futex: Ensure the correct return value from futex_lock_pi() futex: Replace pointless printk in fixup_owner() futex: Provide and use pi_state_update_owner() rtmutex: Remove unused argument from rt_mutex_proxy_unlock() futex: Use pi_state_update_owner() in put_pi_state() futex: Simplify fixup_pi_state_owner() futex: Handle faults correctly for PI futexes
Wang Hai (1): Revert "mm/slub: fix a memory leak in sysfs_slab_add()"
Documentation/device-mapper/dm-integrity.txt | 7 + Makefile | 2 +- drivers/gpio/gpio-mvebu.c | 25 +- drivers/hid/wacom_sys.c | 7 +- drivers/hid/wacom_wac.h | 2 +- drivers/md/dm-integrity.c | 24 +- fs/exec.c | 2 +- fs/ext4/inode.c | 2 +- fs/fs-writeback.c | 36 +- fs/xfs/xfs_trans_inode.c | 4 +- include/linux/compat.h | 2 - include/linux/fs.h | 1 - include/linux/futex.h | 40 +- include/linux/sched.h | 3 +- include/linux/sched/mm.h | 6 +- include/trace/events/writeback.h | 1 - kernel/exit.c | 30 +- kernel/fork.c | 40 +- kernel/futex.c | 485 +++++++++++++------ kernel/locking/rtmutex.c | 3 +- kernel/locking/rtmutex_common.h | 3 +- kernel/trace/ring_buffer.c | 4 + mm/slub.c | 4 +- tools/build/Makefile | 4 - tools/objtool/Makefile | 9 - tools/perf/Makefile.perf | 4 - tools/power/acpi/Makefile.config | 1 - tools/scripts/Makefile.include | 10 + 28 files changed, 459 insertions(+), 302 deletions(-)
From: Baruch Siach baruch@tkos.co.il
commit e73b0101ae5124bf7cd3fb5d250302ad2f16a416 upstream.
The period is the sum of on and off values. That is, calculate period as
($on + $off) / clkrate
instead of
$off / clkrate - $on / clkrate
that makes no sense.
Reported-by: Russell King linux@armlinux.org.uk Reviewed-by: Uwe Kleine-König u.kleine-koenig@pengutronix.de Fixes: 757642f9a584e ("gpio: mvebu: Add limited PWM support") Signed-off-by: Baruch Siach baruch@tkos.co.il Signed-off-by: Bartosz Golaszewski bgolaszewski@baylibre.com [baruch: backport to kernels <= v5.10] Reviewed-by: Uwe Kleine-König u.kleine-koenig@pengutronix.de Signed-off-by: Baruch Siach baruch@tkos.co.il Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/gpio/gpio-mvebu.c | 25 ++++++++++--------------- 1 file changed, 10 insertions(+), 15 deletions(-)
diff --git a/drivers/gpio/gpio-mvebu.c b/drivers/gpio/gpio-mvebu.c index 3b78dcda4736..874caed72390 100644 --- a/drivers/gpio/gpio-mvebu.c +++ b/drivers/gpio/gpio-mvebu.c @@ -650,9 +650,8 @@ static void mvebu_pwm_get_state(struct pwm_chip *chip,
spin_lock_irqsave(&mvpwm->lock, flags);
- val = (unsigned long long) - readl_relaxed(mvebu_pwmreg_blink_on_duration(mvpwm)); - val *= NSEC_PER_SEC; + u = readl_relaxed(mvebu_pwmreg_blink_on_duration(mvpwm)); + val = (unsigned long long) u * NSEC_PER_SEC; do_div(val, mvpwm->clk_rate); if (val > UINT_MAX) state->duty_cycle = UINT_MAX; @@ -661,21 +660,17 @@ static void mvebu_pwm_get_state(struct pwm_chip *chip, else state->duty_cycle = 1;
- val = (unsigned long long) - readl_relaxed(mvebu_pwmreg_blink_off_duration(mvpwm)); + val = (unsigned long long) u; /* on duration */ + /* period = on + off duration */ + val += readl_relaxed(mvebu_pwmreg_blink_off_duration(mvpwm)); val *= NSEC_PER_SEC; do_div(val, mvpwm->clk_rate); - if (val < state->duty_cycle) { + if (val > UINT_MAX) + state->period = UINT_MAX; + else if (val) + state->period = val; + else state->period = 1; - } else { - val -= state->duty_cycle; - if (val > UINT_MAX) - state->period = UINT_MAX; - else if (val) - state->period = val; - else - state->period = 1; - }
regmap_read(mvchip->regs, GPIO_BLINK_EN_OFF + mvchip->offset, &u); if (u)
From: Wang Hai wanghai38@huawei.com
commit 757fed1d0898b893d7daa84183947c70f27632f3 upstream.
This reverts commit dde3c6b72a16c2db826f54b2d49bdea26c3534a2.
syzbot report a double-free bug. The following case can cause this bug.
- mm/slab_common.c: create_cache(): if the __kmem_cache_create() fails, it does:
out_free_cache: kmem_cache_free(kmem_cache, s);
- but __kmem_cache_create() - at least for slub() - will have done
sysfs_slab_add(s) -> sysfs_create_group() .. fails .. -> kobject_del(&s->kobj); .. which frees s ...
We can't remove the kmem_cache_free() in create_cache(), because other error cases of __kmem_cache_create() do not free this.
So, revert the commit dde3c6b72a16 ("mm/slub: fix a memory leak in sysfs_slab_add()") to fix this.
Reported-by: syzbot+d0bd96b4696c1ef67991@syzkaller.appspotmail.com Fixes: dde3c6b72a16 ("mm/slub: fix a memory leak in sysfs_slab_add()") Acked-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Wang Hai wanghai38@huawei.com Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/slub.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c index 182b403b7ec6..f5e94499b736 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -5746,10 +5746,8 @@ static int sysfs_slab_add(struct kmem_cache *s)
s->kobj.kset = kset; err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, "%s", name); - if (err) { - kobject_put(&s->kobj); + if (err) goto out; - }
err = sysfs_create_group(&s->kobj, &slab_attr_group); if (err)
From: Thomas Gleixner tglx@linutronix.de
commit ba31c1a48538992316cc71ce94fa9cd3e7b427c0 upstream
The futex exit handling is #ifdeffed into mm_release() which is not pretty to begin with. But upcoming changes to address futex exit races need to add more functionality to this exit code.
Split it out into a function, move it into futex code and make the various futex exit functions static.
Preparatory only and no functional change.
Folded build fix from Borislav.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.049705556@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/linux/compat.h | 2 -- include/linux/futex.h | 29 ++++++++++++++++------------- kernel/fork.c | 25 +++---------------------- kernel/futex.c | 33 +++++++++++++++++++++++++++++---- 4 files changed, 48 insertions(+), 41 deletions(-)
diff --git a/include/linux/compat.h b/include/linux/compat.h index de0c13bdcd2c..189d0e111d57 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -445,8 +445,6 @@ struct compat_kexec_segment; struct compat_mq_attr; struct compat_msgbuf;
-extern void compat_exit_robust_list(struct task_struct *curr); - #define BITS_PER_COMPAT_LONG (8*sizeof(compat_long_t))
#define BITS_TO_COMPAT_LONGS(bits) DIV_ROUND_UP(bits, BITS_PER_COMPAT_LONG) diff --git a/include/linux/futex.h b/include/linux/futex.h index a61bf436dcf3..82b8a5cd4018 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -2,7 +2,9 @@ #ifndef _LINUX_FUTEX_H #define _LINUX_FUTEX_H
+#include <linux/sched.h> #include <linux/ktime.h> + #include <uapi/linux/futex.h>
struct inode; @@ -51,15 +53,24 @@ union futex_key { #define FUTEX_KEY_INIT (union futex_key) { .both = { .ptr = 0ULL } }
#ifdef CONFIG_FUTEX -extern void exit_robust_list(struct task_struct *curr);
-long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, - u32 __user *uaddr2, u32 val2, u32 val3); -#else -static inline void exit_robust_list(struct task_struct *curr) +static inline void futex_init_task(struct task_struct *tsk) { + tsk->robust_list = NULL; +#ifdef CONFIG_COMPAT + tsk->compat_robust_list = NULL; +#endif + INIT_LIST_HEAD(&tsk->pi_state_list); + tsk->pi_state_cache = NULL; }
+void futex_mm_release(struct task_struct *tsk); + +long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, + u32 __user *uaddr2, u32 val2, u32 val3); +#else +static inline void futex_init_task(struct task_struct *tsk) { } +static inline void futex_mm_release(struct task_struct *tsk) { } static inline long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3) @@ -68,12 +79,4 @@ static inline long do_futex(u32 __user *uaddr, int op, u32 val, } #endif
-#ifdef CONFIG_FUTEX_PI -extern void exit_pi_state_list(struct task_struct *curr); -#else -static inline void exit_pi_state_list(struct task_struct *curr) -{ -} -#endif - #endif diff --git a/kernel/fork.c b/kernel/fork.c index 61496b70cfb8..09fe70b5e802 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1287,20 +1287,7 @@ static int wait_for_vfork_done(struct task_struct *child, void mm_release(struct task_struct *tsk, struct mm_struct *mm) { /* Get rid of any futexes when releasing the mm */ -#ifdef CONFIG_FUTEX - if (unlikely(tsk->robust_list)) { - exit_robust_list(tsk); - tsk->robust_list = NULL; - } -#ifdef CONFIG_COMPAT - if (unlikely(tsk->compat_robust_list)) { - compat_exit_robust_list(tsk); - tsk->compat_robust_list = NULL; - } -#endif - if (unlikely(!list_empty(&tsk->pi_state_list))) - exit_pi_state_list(tsk); -#endif + futex_mm_release(tsk);
uprobe_free_utask(tsk);
@@ -2007,14 +1994,8 @@ static __latent_entropy struct task_struct *copy_process( #ifdef CONFIG_BLOCK p->plug = NULL; #endif -#ifdef CONFIG_FUTEX - p->robust_list = NULL; -#ifdef CONFIG_COMPAT - p->compat_robust_list = NULL; -#endif - INIT_LIST_HEAD(&p->pi_state_list); - p->pi_state_cache = NULL; -#endif + futex_init_task(p); + /* * sigaltstack should be cleared when sharing the same VM */ diff --git a/kernel/futex.c b/kernel/futex.c index 8b67e8a7a5e9..434c98553651 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -341,6 +341,12 @@ static inline bool should_fail_futex(bool fshared) } #endif /* CONFIG_FAIL_FUTEX */
+#ifdef CONFIG_COMPAT +static void compat_exit_robust_list(struct task_struct *curr); +#else +static inline void compat_exit_robust_list(struct task_struct *curr) { } +#endif + static inline void futex_get_mm(union futex_key *key) { mmgrab(key->private.mm); @@ -895,7 +901,7 @@ static void put_pi_state(struct futex_pi_state *pi_state) * Kernel cleans up PI-state, but userspace is likely hosed. * (Robust-futex cleanup is separate and might save the day for userspace.) */ -void exit_pi_state_list(struct task_struct *curr) +static void exit_pi_state_list(struct task_struct *curr) { struct list_head *next, *head = &curr->pi_state_list; struct futex_pi_state *pi_state; @@ -965,7 +971,8 @@ void exit_pi_state_list(struct task_struct *curr) } raw_spin_unlock_irq(&curr->pi_lock); } - +#else +static inline void exit_pi_state_list(struct task_struct *curr) { } #endif
/* @@ -3630,7 +3637,7 @@ static inline int fetch_robust_entry(struct robust_list __user **entry, * * We silently return on any sign of list-walking problem. */ -void exit_robust_list(struct task_struct *curr) +static void exit_robust_list(struct task_struct *curr) { struct robust_list_head __user *head = curr->robust_list; struct robust_list __user *entry, *next_entry, *pending; @@ -3695,6 +3702,24 @@ void exit_robust_list(struct task_struct *curr) } }
+void futex_mm_release(struct task_struct *tsk) +{ + if (unlikely(tsk->robust_list)) { + exit_robust_list(tsk); + tsk->robust_list = NULL; + } + +#ifdef CONFIG_COMPAT + if (unlikely(tsk->compat_robust_list)) { + compat_exit_robust_list(tsk); + tsk->compat_robust_list = NULL; + } +#endif + + if (unlikely(!list_empty(&tsk->pi_state_list))) + exit_pi_state_list(tsk); +} + long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3) { @@ -3822,7 +3847,7 @@ static void __user *futex_uaddr(struct robust_list __user *entry, * * We silently return on any sign of list-walking problem. */ -void compat_exit_robust_list(struct task_struct *curr) +static void compat_exit_robust_list(struct task_struct *curr) { struct compat_robust_list_head __user *head = curr->compat_robust_list; struct robust_list __user *entry, *next_entry, *pending;
From: Thomas Gleixner tglx@linutronix.de
commit 3d4775df0a89240f671861c6ab6e8d59af8e9e41 upstream
The futex exit handling relies on PF_ flags. That's suboptimal as it requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in the middle of do_exit() to enforce the observability of PF_EXITING in the futex code.
Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic over to the new state. The PF_EXITING dependency will be cleaned up in a later step.
This prepares for handling various futex exit issues later.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/linux/futex.h | 33 +++++++++++++++++++++++++++++++++ include/linux/sched.h | 2 +- kernel/exit.c | 18 ++---------------- kernel/futex.c | 25 +++++++++++++------------ 4 files changed, 49 insertions(+), 29 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h index 82b8a5cd4018..84e945116126 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -53,6 +53,10 @@ union futex_key { #define FUTEX_KEY_INIT (union futex_key) { .both = { .ptr = 0ULL } }
#ifdef CONFIG_FUTEX +enum { + FUTEX_STATE_OK, + FUTEX_STATE_DEAD, +};
static inline void futex_init_task(struct task_struct *tsk) { @@ -62,6 +66,34 @@ static inline void futex_init_task(struct task_struct *tsk) #endif INIT_LIST_HEAD(&tsk->pi_state_list); tsk->pi_state_cache = NULL; + tsk->futex_state = FUTEX_STATE_OK; +} + +/** + * futex_exit_done - Sets the tasks futex state to FUTEX_STATE_DEAD + * @tsk: task to set the state on + * + * Set the futex exit state of the task lockless. The futex waiter code + * observes that state when a task is exiting and loops until the task has + * actually finished the futex cleanup. The worst case for this is that the + * waiter runs through the wait loop until the state becomes visible. + * + * This has two callers: + * + * - futex_mm_release() after the futex exit cleanup has been done + * + * - do_exit() from the recursive fault handling path. + * + * In case of a recursive fault this is best effort. Either the futex exit + * code has run already or not. If the OWNER_DIED bit has been set on the + * futex then the waiter can take it over. If not, the problem is pushed + * back to user space. If the futex exit code did not run yet, then an + * already queued waiter might block forever, but there is nothing which + * can be done about that. + */ +static inline void futex_exit_done(struct task_struct *tsk) +{ + tsk->futex_state = FUTEX_STATE_DEAD; }
void futex_mm_release(struct task_struct *tsk); @@ -71,6 +103,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, #else static inline void futex_init_task(struct task_struct *tsk) { } static inline void futex_mm_release(struct task_struct *tsk) { } +static inline void futex_exit_done(struct task_struct *tsk) { } static inline long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3) diff --git a/include/linux/sched.h b/include/linux/sched.h index ceb584190bb8..dbd78dff4ac6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1001,6 +1001,7 @@ struct task_struct { #endif struct list_head pi_state_list; struct futex_pi_state *pi_state_cache; + unsigned int futex_state; #endif #ifdef CONFIG_PERF_EVENTS struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts]; @@ -1391,7 +1392,6 @@ extern struct pid *cad_pid; */ #define PF_IDLE 0x00000002 /* I am an IDLE thread */ #define PF_EXITING 0x00000004 /* Getting shut down */ -#define PF_EXITPIDONE 0x00000008 /* PI exit done on shut down */ #define PF_VCPU 0x00000010 /* I'm a virtual CPU */ #define PF_WQ_WORKER 0x00000020 /* I'm a workqueue worker */ #define PF_FORKNOEXEC 0x00000040 /* Forked but didn't exec */ diff --git a/kernel/exit.c b/kernel/exit.c index 73519db2542a..370ca4105546 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -809,16 +809,7 @@ void __noreturn do_exit(long code) */ if (unlikely(tsk->flags & PF_EXITING)) { pr_alert("Fixing recursive fault but reboot is needed!\n"); - /* - * We can do this unlocked here. The futex code uses - * this flag just to verify whether the pi state - * cleanup has been done or not. In the worst case it - * loops once more. We pretend that the cleanup was - * done as there is no way to return. Either the - * OWNER_DIED bit is set by now or we push the blocked - * task into the wait for ever nirwana as well. - */ - tsk->flags |= PF_EXITPIDONE; + futex_exit_done(tsk); set_current_state(TASK_UNINTERRUPTIBLE); schedule(); } @@ -909,12 +900,7 @@ void __noreturn do_exit(long code) * Make sure we are holding no locks: */ debug_check_no_locks_held(); - /* - * We can do this unlocked here. The futex code uses this flag - * just to verify whether the pi state cleanup has been done - * or not. In the worst case it loops once more. - */ - tsk->flags |= PF_EXITPIDONE; + futex_exit_done(tsk);
if (tsk->io_context) exit_io_context(tsk); diff --git a/kernel/futex.c b/kernel/futex.c index 434c98553651..9a2fea087b87 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1187,9 +1187,10 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval, u32 uval2;
/* - * If PF_EXITPIDONE is not yet set, then try again. + * If the futex exit state is not yet FUTEX_STATE_DEAD, wait + * for it to finish. */ - if (tsk && !(tsk->flags & PF_EXITPIDONE)) + if (tsk && tsk->futex_state != FUTEX_STATE_DEAD) return -EAGAIN;
/* @@ -1208,8 +1209,9 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval, * *uaddr = 0xC0000000; tsk = get_task(PID); * } if (!tsk->flags & PF_EXITING) { * ... attach(); - * tsk->flags |= PF_EXITPIDONE; } else { - * if (!(tsk->flags & PF_EXITPIDONE)) + * tsk->futex_state = } else { + * FUTEX_STATE_DEAD; if (tsk->futex_state != + * FUTEX_STATE_DEAD) * return -EAGAIN; * return -ESRCH; <--- FAIL * } @@ -1265,17 +1267,16 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key, }
/* - * We need to look at the task state flags to figure out, - * whether the task is exiting. To protect against the do_exit - * change of the task flags, we do this protected by - * p->pi_lock: + * We need to look at the task state to figure out, whether the + * task is exiting. To protect against the change of the task state + * in futex_exit_release(), we do this protected by p->pi_lock: */ raw_spin_lock_irq(&p->pi_lock); - if (unlikely(p->flags & PF_EXITING)) { + if (unlikely(p->futex_state != FUTEX_STATE_OK)) { /* - * The task is on the way out. When PF_EXITPIDONE is - * set, we know that the task has finished the - * cleanup: + * The task is on the way out. When the futex state is + * FUTEX_STATE_DEAD, we know that the task has finished + * the cleanup: */ int ret = handle_exit_race(uaddr, uval, p);
From: Thomas Gleixner tglx@linutronix.de
commit 4610ba7ad877fafc0a25a30c6c82015304120426 upstream
mm_release() contains the futex exit handling. mm_release() is called from do_exit()->exit_mm() and from exec()->exec_mm().
In the exit_mm() case PF_EXITING and the futex state is updated. In the exec_mm() case these states are not touched.
As the futex exit code needs further protections against exit races, this needs to be split into two functions.
Preparatory only, no functional change.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/exec.c | 2 +- include/linux/sched/mm.h | 6 ++++-- kernel/exit.c | 2 +- kernel/fork.c | 12 +++++++++++- 4 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index 713e4ccec6e9..6fd4fe856281 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1011,7 +1011,7 @@ static int exec_mmap(struct mm_struct *mm) /* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; - mm_release(tsk, old_mm); + exec_mm_release(tsk, old_mm);
if (old_mm) { sync_mm_rss(old_mm); diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 1f7224d3483c..f41c0aff0223 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -119,8 +119,10 @@ extern struct mm_struct *get_task_mm(struct task_struct *task); * succeeds. */ extern struct mm_struct *mm_access(struct task_struct *task, unsigned int mode); -/* Remove the current tasks stale references to the old mm_struct */ -extern void mm_release(struct task_struct *, struct mm_struct *); +/* Remove the current tasks stale references to the old mm_struct on exit() */ +extern void exit_mm_release(struct task_struct *, struct mm_struct *); +/* Remove the current tasks stale references to the old mm_struct on exec() */ +extern void exec_mm_release(struct task_struct *, struct mm_struct *);
#ifdef CONFIG_MM_OWNER extern void mm_update_next_owner(struct mm_struct *mm); diff --git a/kernel/exit.c b/kernel/exit.c index 370ca4105546..8ed8c5e6205e 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -489,7 +489,7 @@ static void exit_mm(void) struct mm_struct *mm = current->mm; struct core_state *core_state;
- mm_release(current, mm); + exit_mm_release(current, mm); if (!mm) return; sync_mm_rss(mm); diff --git a/kernel/fork.c b/kernel/fork.c index 09fe70b5e802..5f79eaf6f188 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1284,7 +1284,7 @@ static int wait_for_vfork_done(struct task_struct *child, * restoring the old one. . . * Eric Biederman 10 January 1998 */ -void mm_release(struct task_struct *tsk, struct mm_struct *mm) +static void mm_release(struct task_struct *tsk, struct mm_struct *mm) { /* Get rid of any futexes when releasing the mm */ futex_mm_release(tsk); @@ -1321,6 +1321,16 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm) complete_vfork_done(tsk); }
+void exit_mm_release(struct task_struct *tsk, struct mm_struct *mm) +{ + mm_release(tsk, mm); +} + +void exec_mm_release(struct task_struct *tsk, struct mm_struct *mm) +{ + mm_release(tsk, mm); +} + /* * Allocate a new mm structure and copy contents from the * mm structure of the passed in task structure.
From: Thomas Gleixner tglx@linutronix.de
commit 150d71584b12809144b8145b817e83b81158ae5f upstream
To allow separate handling of the futex exit state in the futex exit code for exit and exec, split futex_mm_release() into two functions and invoke them from the corresponding exit/exec_mm_release() callsites.
Preparatory only, no functional change.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.332094221@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/linux/futex.h | 6 ++++-- kernel/fork.c | 5 ++--- kernel/futex.c | 7 ++++++- 3 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h index 84e945116126..d141df679863 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -96,14 +96,16 @@ static inline void futex_exit_done(struct task_struct *tsk) tsk->futex_state = FUTEX_STATE_DEAD; }
-void futex_mm_release(struct task_struct *tsk); +void futex_exit_release(struct task_struct *tsk); +void futex_exec_release(struct task_struct *tsk);
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3); #else static inline void futex_init_task(struct task_struct *tsk) { } -static inline void futex_mm_release(struct task_struct *tsk) { } static inline void futex_exit_done(struct task_struct *tsk) { } +static inline void futex_exit_release(struct task_struct *tsk) { } +static inline void futex_exec_release(struct task_struct *tsk) { } static inline long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3) diff --git a/kernel/fork.c b/kernel/fork.c index 5f79eaf6f188..da79ba9c83ac 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1286,9 +1286,6 @@ static int wait_for_vfork_done(struct task_struct *child, */ static void mm_release(struct task_struct *tsk, struct mm_struct *mm) { - /* Get rid of any futexes when releasing the mm */ - futex_mm_release(tsk); - uprobe_free_utask(tsk);
/* Get rid of any cached register state */ @@ -1323,11 +1320,13 @@ static void mm_release(struct task_struct *tsk, struct mm_struct *mm)
void exit_mm_release(struct task_struct *tsk, struct mm_struct *mm) { + futex_exit_release(tsk); mm_release(tsk, mm); }
void exec_mm_release(struct task_struct *tsk, struct mm_struct *mm) { + futex_exec_release(tsk); mm_release(tsk, mm); }
diff --git a/kernel/futex.c b/kernel/futex.c index 9a2fea087b87..908d557283c9 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3703,7 +3703,7 @@ static void exit_robust_list(struct task_struct *curr) } }
-void futex_mm_release(struct task_struct *tsk) +void futex_exec_release(struct task_struct *tsk) { if (unlikely(tsk->robust_list)) { exit_robust_list(tsk); @@ -3721,6 +3721,11 @@ void futex_mm_release(struct task_struct *tsk) exit_pi_state_list(tsk); }
+void futex_exit_release(struct task_struct *tsk) +{ + futex_exec_release(tsk); +} + long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3) {
From: Thomas Gleixner tglx@linutronix.de
commit f24f22435dcc11389acc87e5586239c1819d217c upstream
Setting task::futex_state in do_exit() is rather arbitrarily placed for no reason. Move it into the futex code.
Note, this is only done for the exit cleanup as the exec cleanup cannot set the state to FUTEX_STATE_DEAD because the task struct is still in active use.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/exit.c | 1 - kernel/futex.c | 1 + 2 files changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/exit.c b/kernel/exit.c index 8ed8c5e6205e..1fa0fa91704a 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -900,7 +900,6 @@ void __noreturn do_exit(long code) * Make sure we are holding no locks: */ debug_check_no_locks_held(); - futex_exit_done(tsk);
if (tsk->io_context) exit_io_context(tsk); diff --git a/kernel/futex.c b/kernel/futex.c index 908d557283c9..aa8d2d4eabc8 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3724,6 +3724,7 @@ void futex_exec_release(struct task_struct *tsk) void futex_exit_release(struct task_struct *tsk) { futex_exec_release(tsk); + futex_exit_done(tsk); }
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
From: Thomas Gleixner tglx@linutronix.de
commit 18f694385c4fd77a09851fd301236746ca83f3cb upstream
Instead of relying on PF_EXITING use an explicit state for the futex exit and set it in the futex exit function. This moves the smp barrier and the lock/unlock serialization into the futex code.
As with the DEAD state this is restricted to the exit path as exec continues to use the same task struct.
This allows to simplify that logic in a next step.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/linux/futex.h | 31 +++---------------------------- kernel/exit.c | 13 +------------ kernel/futex.c | 37 ++++++++++++++++++++++++++++++++++++- 3 files changed, 40 insertions(+), 41 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h index d141df679863..1db16694ea16 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -55,6 +55,7 @@ union futex_key { #ifdef CONFIG_FUTEX enum { FUTEX_STATE_OK, + FUTEX_STATE_EXITING, FUTEX_STATE_DEAD, };
@@ -69,33 +70,7 @@ static inline void futex_init_task(struct task_struct *tsk) tsk->futex_state = FUTEX_STATE_OK; }
-/** - * futex_exit_done - Sets the tasks futex state to FUTEX_STATE_DEAD - * @tsk: task to set the state on - * - * Set the futex exit state of the task lockless. The futex waiter code - * observes that state when a task is exiting and loops until the task has - * actually finished the futex cleanup. The worst case for this is that the - * waiter runs through the wait loop until the state becomes visible. - * - * This has two callers: - * - * - futex_mm_release() after the futex exit cleanup has been done - * - * - do_exit() from the recursive fault handling path. - * - * In case of a recursive fault this is best effort. Either the futex exit - * code has run already or not. If the OWNER_DIED bit has been set on the - * futex then the waiter can take it over. If not, the problem is pushed - * back to user space. If the futex exit code did not run yet, then an - * already queued waiter might block forever, but there is nothing which - * can be done about that. - */ -static inline void futex_exit_done(struct task_struct *tsk) -{ - tsk->futex_state = FUTEX_STATE_DEAD; -} - +void futex_exit_recursive(struct task_struct *tsk); void futex_exit_release(struct task_struct *tsk); void futex_exec_release(struct task_struct *tsk);
@@ -103,7 +78,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3); #else static inline void futex_init_task(struct task_struct *tsk) { } -static inline void futex_exit_done(struct task_struct *tsk) { } +static inline void futex_exit_recursive(struct task_struct *tsk) { } static inline void futex_exit_release(struct task_struct *tsk) { } static inline void futex_exec_release(struct task_struct *tsk) { } static inline long do_futex(u32 __user *uaddr, int op, u32 val, diff --git a/kernel/exit.c b/kernel/exit.c index 1fa0fa91704a..62cf0982d94b 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -809,23 +809,12 @@ void __noreturn do_exit(long code) */ if (unlikely(tsk->flags & PF_EXITING)) { pr_alert("Fixing recursive fault but reboot is needed!\n"); - futex_exit_done(tsk); + futex_exit_recursive(tsk); set_current_state(TASK_UNINTERRUPTIBLE); schedule(); }
exit_signals(tsk); /* sets PF_EXITING */ - /* - * Ensure that all new tsk->pi_lock acquisitions must observe - * PF_EXITING. Serializes against futex.c:attach_to_pi_owner(). - */ - smp_mb(); - /* - * Ensure that we must observe the pi_state in exit_mm() -> - * mm_release() -> exit_pi_state_list(). - */ - raw_spin_lock_irq(&tsk->pi_lock); - raw_spin_unlock_irq(&tsk->pi_lock);
/* sync mm's RSS info before statistics gathering */ if (tsk->mm) diff --git a/kernel/futex.c b/kernel/futex.c index aa8d2d4eabc8..0c2944229ad5 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3721,10 +3721,45 @@ void futex_exec_release(struct task_struct *tsk) exit_pi_state_list(tsk); }
+/** + * futex_exit_recursive - Set the tasks futex state to FUTEX_STATE_DEAD + * @tsk: task to set the state on + * + * Set the futex exit state of the task lockless. The futex waiter code + * observes that state when a task is exiting and loops until the task has + * actually finished the futex cleanup. The worst case for this is that the + * waiter runs through the wait loop until the state becomes visible. + * + * This is called from the recursive fault handling path in do_exit(). + * + * This is best effort. Either the futex exit code has run already or + * not. If the OWNER_DIED bit has been set on the futex then the waiter can + * take it over. If not, the problem is pushed back to user space. If the + * futex exit code did not run yet, then an already queued waiter might + * block forever, but there is nothing which can be done about that. + */ +void futex_exit_recursive(struct task_struct *tsk) +{ + tsk->futex_state = FUTEX_STATE_DEAD; +} + void futex_exit_release(struct task_struct *tsk) { + tsk->futex_state = FUTEX_STATE_EXITING; + /* + * Ensure that all new tsk->pi_lock acquisitions must observe + * FUTEX_STATE_EXITING. Serializes against attach_to_pi_owner(). + */ + smp_mb(); + /* + * Ensure that we must observe the pi_state in exit_pi_state_list(). + */ + raw_spin_lock_irq(&tsk->pi_lock); + raw_spin_unlock_irq(&tsk->pi_lock); + futex_exec_release(tsk); - futex_exit_done(tsk); + + tsk->futex_state = FUTEX_STATE_DEAD; }
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
From: Thomas Gleixner tglx@linutronix.de
commit 4a8e991b91aca9e20705d434677ac013974e0e30 upstream
Instead of having a smp_mb() and an empty lock/unlock of task::pi_lock move the state setting into to the lock section.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.645603214@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/futex.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 0c2944229ad5..5d16e9bb22fd 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3745,16 +3745,19 @@ void futex_exit_recursive(struct task_struct *tsk)
void futex_exit_release(struct task_struct *tsk) { - tsk->futex_state = FUTEX_STATE_EXITING; - /* - * Ensure that all new tsk->pi_lock acquisitions must observe - * FUTEX_STATE_EXITING. Serializes against attach_to_pi_owner(). - */ - smp_mb(); /* - * Ensure that we must observe the pi_state in exit_pi_state_list(). + * Switch the state to FUTEX_STATE_EXITING under tsk->pi_lock. + * + * This ensures that all subsequent checks of tsk->futex_state in + * attach_to_pi_owner() must observe FUTEX_STATE_EXITING with + * tsk->pi_lock held. + * + * It guarantees also that a pi_state which was queued right before + * the state change under tsk->pi_lock by a concurrent waiter must + * be observed in exit_pi_state_list(). */ raw_spin_lock_irq(&tsk->pi_lock); + tsk->futex_state = FUTEX_STATE_EXITING; raw_spin_unlock_irq(&tsk->pi_lock);
futex_exec_release(tsk);
From: Thomas Gleixner tglx@linutronix.de
commit af8cbda2cfcaa5515d61ec500498d46e9a8247e2 upstream
exec() attempts to handle potentially held futexes gracefully by running the futex exit handling code like exit() does.
The current implementation has no protection against concurrent incoming waiters. The reason is that the futex state cannot be set to FUTEX_STATE_DEAD after the cleanup because the task struct is still active and just about to execute the new binary.
While its arguably buggy when a task holds a futex over exec(), for consistency sake the state handling can at least cover the actual futex exit cleanup section. This provides state consistency protection accross the cleanup. As the futex state of the task becomes FUTEX_STATE_OK after the cleanup has been finished, this cannot prevent subsequent attempts to attach to the task in case that the cleanup was not successfull in mopping up all leftovers.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.753355618@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/futex.c | 38 ++++++++++++++++++++++++++++++++++---- 1 file changed, 34 insertions(+), 4 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 5d16e9bb22fd..9f30c9228032 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3703,7 +3703,7 @@ static void exit_robust_list(struct task_struct *curr) } }
-void futex_exec_release(struct task_struct *tsk) +static void futex_cleanup(struct task_struct *tsk) { if (unlikely(tsk->robust_list)) { exit_robust_list(tsk); @@ -3743,7 +3743,7 @@ void futex_exit_recursive(struct task_struct *tsk) tsk->futex_state = FUTEX_STATE_DEAD; }
-void futex_exit_release(struct task_struct *tsk) +static void futex_cleanup_begin(struct task_struct *tsk) { /* * Switch the state to FUTEX_STATE_EXITING under tsk->pi_lock. @@ -3759,10 +3759,40 @@ void futex_exit_release(struct task_struct *tsk) raw_spin_lock_irq(&tsk->pi_lock); tsk->futex_state = FUTEX_STATE_EXITING; raw_spin_unlock_irq(&tsk->pi_lock); +}
- futex_exec_release(tsk); +static void futex_cleanup_end(struct task_struct *tsk, int state) +{ + /* + * Lockless store. The only side effect is that an observer might + * take another loop until it becomes visible. + */ + tsk->futex_state = state; +}
- tsk->futex_state = FUTEX_STATE_DEAD; +void futex_exec_release(struct task_struct *tsk) +{ + /* + * The state handling is done for consistency, but in the case of + * exec() there is no way to prevent futher damage as the PID stays + * the same. But for the unlikely and arguably buggy case that a + * futex is held on exec(), this provides at least as much state + * consistency protection which is possible. + */ + futex_cleanup_begin(tsk); + futex_cleanup(tsk); + /* + * Reset the state to FUTEX_STATE_OK. The task is alive and about + * exec a new binary. + */ + futex_cleanup_end(tsk, FUTEX_STATE_OK); +} + +void futex_exit_release(struct task_struct *tsk) +{ + futex_cleanup_begin(tsk); + futex_cleanup(tsk); + futex_cleanup_end(tsk, FUTEX_STATE_DEAD); }
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
From: Thomas Gleixner tglx@linutronix.de
commit 3f186d974826847a07bc7964d79ec4eded475ad9 upstream
The mutex will be used in subsequent changes to replace the busy looping of a waiter when the futex owner is currently executing the exit cleanup to prevent a potential live lock.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.845798895@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/linux/futex.h | 1 + include/linux/sched.h | 1 + kernel/futex.c | 16 ++++++++++++++++ 3 files changed, 18 insertions(+)
diff --git a/include/linux/futex.h b/include/linux/futex.h index 1db16694ea16..b70df27d7e85 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -68,6 +68,7 @@ static inline void futex_init_task(struct task_struct *tsk) INIT_LIST_HEAD(&tsk->pi_state_list); tsk->pi_state_cache = NULL; tsk->futex_state = FUTEX_STATE_OK; + mutex_init(&tsk->futex_exit_mutex); }
void futex_exit_recursive(struct task_struct *tsk); diff --git a/include/linux/sched.h b/include/linux/sched.h index dbd78dff4ac6..62cb3d2ea186 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1001,6 +1001,7 @@ struct task_struct { #endif struct list_head pi_state_list; struct futex_pi_state *pi_state_cache; + struct mutex futex_exit_mutex; unsigned int futex_state; #endif #ifdef CONFIG_PERF_EVENTS diff --git a/kernel/futex.c b/kernel/futex.c index 9f30c9228032..ec7d13989f4c 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3740,11 +3740,22 @@ static void futex_cleanup(struct task_struct *tsk) */ void futex_exit_recursive(struct task_struct *tsk) { + /* If the state is FUTEX_STATE_EXITING then futex_exit_mutex is held */ + if (tsk->futex_state == FUTEX_STATE_EXITING) + mutex_unlock(&tsk->futex_exit_mutex); tsk->futex_state = FUTEX_STATE_DEAD; }
static void futex_cleanup_begin(struct task_struct *tsk) { + /* + * Prevent various race issues against a concurrent incoming waiter + * including live locks by forcing the waiter to block on + * tsk->futex_exit_mutex when it observes FUTEX_STATE_EXITING in + * attach_to_pi_owner(). + */ + mutex_lock(&tsk->futex_exit_mutex); + /* * Switch the state to FUTEX_STATE_EXITING under tsk->pi_lock. * @@ -3768,6 +3779,11 @@ static void futex_cleanup_end(struct task_struct *tsk, int state) * take another loop until it becomes visible. */ tsk->futex_state = state; + /* + * Drop the exit protection. This unblocks waiters which observed + * FUTEX_STATE_EXITING to reevaluate the state. + */ + mutex_unlock(&tsk->futex_exit_mutex); }
void futex_exec_release(struct task_struct *tsk)
From: Thomas Gleixner tglx@linutronix.de
commit ac31c7ff8624409ba3c4901df9237a616c187a5d upstream
attach_to_pi_owner() returns -EAGAIN for various cases:
- Owner task is exiting - Futex value has changed
The caller drops the held locks (hash bucket, mmap_sem) and retries the operation. In case of the owner task exiting this can result in a live lock.
As a preparatory step for seperating those cases, provide a distinct return value (EBUSY) for the owner exiting case.
No functional change.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.935606117@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/futex.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index ec7d13989f4c..bfe3194d7a34 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1187,11 +1187,11 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval, u32 uval2;
/* - * If the futex exit state is not yet FUTEX_STATE_DEAD, wait - * for it to finish. + * If the futex exit state is not yet FUTEX_STATE_DEAD, tell the + * caller that the alleged owner is busy. */ if (tsk && tsk->futex_state != FUTEX_STATE_DEAD) - return -EAGAIN; + return -EBUSY;
/* * Reread the user space value to handle the following situation: @@ -2100,12 +2100,13 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, if (!ret) goto retry; goto out; + case -EBUSY: case -EAGAIN: /* * Two reasons for this: - * - Owner is exiting and we just wait for the + * - EBUSY: Owner is exiting and we just wait for the * exit to complete. - * - The user space value changed. + * - EAGAIN: The user space value changed. */ double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); @@ -2878,12 +2879,13 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, goto out_unlock_put_key; case -EFAULT: goto uaddr_faulted; + case -EBUSY: case -EAGAIN: /* * Two reasons for this: - * - Task is exiting and we just wait for the + * - EBUSY: Task is exiting and we just wait for the * exit to complete. - * - The user space value changed. + * - EAGAIN: The user space value changed. */ queue_unlock(hb); put_futex_key(&q.key);
From: Thomas Gleixner tglx@linutronix.de
commit 3ef240eaff36b8119ac9e2ea17cbf41179c930ba upstream
Oleg provided the following test case:
int main(void) { struct sched_param sp = {};
sp.sched_priority = 2; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
int lock = vfork(); if (!lock) { sp.sched_priority = 1; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); _exit(0); }
syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0); return 0; }
This creates an unkillable RT process spinning in futex_lock_pi() on a UP machine or if the process is affine to a single CPU. The reason is:
parent child
set FIFO prio 2
vfork() -> set FIFO prio 1 implies wait_for_child() sched_setscheduler(...) exit() do_exit() .... mm_release() tsk->futex_state = FUTEX_STATE_EXITING; exit_futex(); (NOOP in this case) complete() --> wakes parent sys_futex() loop infinite because tsk->futex_state == FUTEX_STATE_EXITING
The same problem can happen just by regular preemption as well:
task holds futex ... do_exit() tsk->futex_state = FUTEX_STATE_EXITING;
--> preemption (unrelated wakeup of some other higher prio task, e.g. timer)
switch_to(other_task)
return to user sys_futex() loop infinite as above
Just for the fun of it the futex exit cleanup could trigger the wakeup itself before the task sets its futex state to DEAD.
To cure this, the handling of the exiting owner is changed so:
- A refcount is held on the task
- The task pointer is stored in a caller visible location
- The caller drops all locks (hash bucket, mmap_sem) and blocks on task::futex_exit_mutex. When the mutex is acquired then the exiting task has completed the cleanup and the state is consistent and can be reevaluated.
This is not a pretty solution, but there is no choice other than returning an error code to user space, which would break the state consistency guarantee and open another can of problems including regressions.
For stable backports the preparatory commits ac31c7ff8624 .. ba31c1a48538 are required as well, but for anything older than 5.3.y the backports are going to be provided when this hits mainline as the other dependencies for those kernels are definitely not stable material.
Fixes: 778e9a9c3e71 ("pi-futex: fix exit races and locking problems") Reported-by: Oleg Nesterov oleg@redhat.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Stable Team stable@vger.kernel.org Link: https://lkml.kernel.org/r/20191106224557.041676471@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/futex.c | 106 ++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 91 insertions(+), 15 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index bfe3194d7a34..38911f21b3ff 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1181,6 +1181,36 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval, return ret; }
+/** + * wait_for_owner_exiting - Block until the owner has exited + * @exiting: Pointer to the exiting task + * + * Caller must hold a refcount on @exiting. + */ +static void wait_for_owner_exiting(int ret, struct task_struct *exiting) +{ + if (ret != -EBUSY) { + WARN_ON_ONCE(exiting); + return; + } + + if (WARN_ON_ONCE(ret == -EBUSY && !exiting)) + return; + + mutex_lock(&exiting->futex_exit_mutex); + /* + * No point in doing state checking here. If the waiter got here + * while the task was in exec()->exec_futex_release() then it can + * have any FUTEX_STATE_* value when the waiter has acquired the + * mutex. OK, if running, EXITING or DEAD if it reached exit() + * already. Highly unlikely and not a problem. Just one more round + * through the futex maze. + */ + mutex_unlock(&exiting->futex_exit_mutex); + + put_task_struct(exiting); +} + static int handle_exit_race(u32 __user *uaddr, u32 uval, struct task_struct *tsk) { @@ -1242,7 +1272,8 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval, * it after doing proper sanity checks. */ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key, - struct futex_pi_state **ps) + struct futex_pi_state **ps, + struct task_struct **exiting) { pid_t pid = uval & FUTEX_TID_MASK; struct futex_pi_state *pi_state; @@ -1281,7 +1312,19 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key, int ret = handle_exit_race(uaddr, uval, p);
raw_spin_unlock_irq(&p->pi_lock); - put_task_struct(p); + /* + * If the owner task is between FUTEX_STATE_EXITING and + * FUTEX_STATE_DEAD then store the task pointer and keep + * the reference on the task struct. The calling code will + * drop all locks, wait for the task to reach + * FUTEX_STATE_DEAD and then drop the refcount. This is + * required to prevent a live lock when the current task + * preempted the exiting task between the two states. + */ + if (ret == -EBUSY) + *exiting = p; + else + put_task_struct(p); return ret; }
@@ -1320,7 +1363,8 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
static int lookup_pi_state(u32 __user *uaddr, u32 uval, struct futex_hash_bucket *hb, - union futex_key *key, struct futex_pi_state **ps) + union futex_key *key, struct futex_pi_state **ps, + struct task_struct **exiting) { struct futex_q *top_waiter = futex_top_waiter(hb, key);
@@ -1335,7 +1379,7 @@ static int lookup_pi_state(u32 __user *uaddr, u32 uval, * We are the first waiter - try to look up the owner based on * @uval and attach to it. */ - return attach_to_pi_owner(uaddr, uval, key, ps); + return attach_to_pi_owner(uaddr, uval, key, ps, exiting); }
static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval) @@ -1363,6 +1407,8 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval) * lookup * @task: the task to perform the atomic lock work for. This will * be "current" except in the case of requeue pi. + * @exiting: Pointer to store the task pointer of the owner task + * which is in the middle of exiting * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0) * * Return: @@ -1371,11 +1417,17 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval) * - <0 - error * * The hb->lock and futex_key refs shall be held by the caller. + * + * @exiting is only set when the return value is -EBUSY. If so, this holds + * a refcount on the exiting task on return and the caller needs to drop it + * after waiting for the exit to complete. */ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, union futex_key *key, struct futex_pi_state **ps, - struct task_struct *task, int set_waiters) + struct task_struct *task, + struct task_struct **exiting, + int set_waiters) { u32 uval, newval, vpid = task_pid_vnr(task); struct futex_q *top_waiter; @@ -1445,7 +1497,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, * attach to the owner. If that fails, no harm done, we only * set the FUTEX_WAITERS bit in the user space variable. */ - return attach_to_pi_owner(uaddr, newval, key, ps); + return attach_to_pi_owner(uaddr, newval, key, ps, exiting); }
/** @@ -1866,6 +1918,8 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key, * @key1: the from futex key * @key2: the to futex key * @ps: address to store the pi_state pointer + * @exiting: Pointer to store the task pointer of the owner task + * which is in the middle of exiting * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0) * * Try and get the lock on behalf of the top waiter if we can do it atomically. @@ -1873,16 +1927,20 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key, * then direct futex_lock_pi_atomic() to force setting the FUTEX_WAITERS bit. * hb1 and hb2 must be held by the caller. * + * @exiting is only set when the return value is -EBUSY. If so, this holds + * a refcount on the exiting task on return and the caller needs to drop it + * after waiting for the exit to complete. + * * Return: * - 0 - failed to acquire the lock atomically; * - >0 - acquired the lock, return value is vpid of the top_waiter * - <0 - error */ -static int futex_proxy_trylock_atomic(u32 __user *pifutex, - struct futex_hash_bucket *hb1, - struct futex_hash_bucket *hb2, - union futex_key *key1, union futex_key *key2, - struct futex_pi_state **ps, int set_waiters) +static int +futex_proxy_trylock_atomic(u32 __user *pifutex, struct futex_hash_bucket *hb1, + struct futex_hash_bucket *hb2, union futex_key *key1, + union futex_key *key2, struct futex_pi_state **ps, + struct task_struct **exiting, int set_waiters) { struct futex_q *top_waiter = NULL; u32 curval; @@ -1919,7 +1977,7 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex, */ vpid = task_pid_vnr(top_waiter->task); ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task, - set_waiters); + exiting, set_waiters); if (ret == 1) { requeue_pi_wake_futex(top_waiter, key2, hb2); return vpid; @@ -2048,6 +2106,8 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, }
if (requeue_pi && (task_count - nr_wake < nr_requeue)) { + struct task_struct *exiting = NULL; + /* * Attempt to acquire uaddr2 and wake the top waiter. If we * intend to requeue waiters, force setting the FUTEX_WAITERS @@ -2055,7 +2115,8 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, * faults rather in the requeue loop below. */ ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1, - &key2, &pi_state, nr_requeue); + &key2, &pi_state, + &exiting, nr_requeue);
/* * At this point the top_waiter has either taken uaddr2 or is @@ -2082,7 +2143,8 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, * If that call succeeds then we have pi_state and an * initial refcount on it. */ - ret = lookup_pi_state(uaddr2, ret, hb2, &key2, &pi_state); + ret = lookup_pi_state(uaddr2, ret, hb2, &key2, + &pi_state, &exiting); }
switch (ret) { @@ -2112,6 +2174,12 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, hb_waiters_dec(hb2); put_futex_key(&key2); put_futex_key(&key1); + /* + * Handle the case where the owner is in the middle of + * exiting. Wait for the exit to complete otherwise + * this task might loop forever, aka. live lock. + */ + wait_for_owner_exiting(ret, exiting); cond_resched(); goto retry; default: @@ -2839,6 +2907,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, { struct hrtimer_sleeper timeout, *to = NULL; struct futex_pi_state *pi_state = NULL; + struct task_struct *exiting = NULL; struct rt_mutex_waiter rt_waiter; struct futex_hash_bucket *hb; struct futex_q q = futex_q_init; @@ -2866,7 +2935,8 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, retry_private: hb = queue_lock(&q);
- ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0); + ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, + &exiting, 0); if (unlikely(ret)) { /* * Atomic work succeeded and we got the lock, @@ -2889,6 +2959,12 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, */ queue_unlock(hb); put_futex_key(&q.key); + /* + * Handle the case where the owner is in the middle of + * exiting. Wait for the exit to complete otherwise + * this task might loop forever, aka. live lock. + */ + wait_for_owner_exiting(ret, exiting); cond_resched(); goto retry; default:
From: Thomas Gleixner tglx@linutronix.de
commit 12bb3f7f1b03d5913b3f9d4236a488aa7774dfe9 upstream
In case that futex_lock_pi() was aborted by a signal or a timeout and the task returned without acquiring the rtmutex, but is the designated owner of the futex due to a concurrent futex_unlock_pi() fixup_owner() is invoked to establish consistent state. In that case it invokes fixup_pi_state_owner() which in turn tries to acquire the rtmutex again. If that succeeds then it does not propagate this success to fixup_owner() and futex_lock_pi() returns -EINTR or -ETIMEOUT despite having the futex locked.
Return success from fixup_pi_state_owner() in all cases where the current task owns the rtmutex and therefore the futex and propagate it correctly through fixup_owner(). Fixup the other callsite which does not expect a positive return value.
Fixes: c1e2f0eaf015 ("futex: Avoid violating the 10th rule of futex") Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/futex.c | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 38911f21b3ff..b2f33b221def 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2494,8 +2494,8 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, }
if (__rt_mutex_futex_trylock(&pi_state->pi_mutex)) { - /* We got the lock after all, nothing to fix. */ - ret = 0; + /* We got the lock. pi_state is correct. Tell caller. */ + ret = 1; goto out_unlock; }
@@ -2523,7 +2523,7 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, * We raced against a concurrent self; things are * already fixed up. Nothing to do. */ - ret = 0; + ret = 1; goto out_unlock; } newowner = argowner; @@ -2569,7 +2569,7 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, raw_spin_unlock(&newowner->pi_lock); raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
- return 0; + return argowner == current;
/* * In order to reschedule or handle a page fault, we need to drop the @@ -2611,7 +2611,7 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, * Check if someone else fixed it for us: */ if (pi_state->owner != oldowner) { - ret = 0; + ret = argowner == current; goto out_unlock; }
@@ -2644,8 +2644,6 @@ static long futex_wait_restart(struct restart_block *restart); */ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) { - int ret = 0; - if (locked) { /* * Got the lock. We might not be the anticipated owner if we @@ -2656,8 +2654,8 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) * stable state, anything else needs more attention. */ if (q->pi_state->owner != current) - ret = fixup_pi_state_owner(uaddr, q, current); - goto out; + return fixup_pi_state_owner(uaddr, q, current); + return 1; }
/* @@ -2668,10 +2666,8 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) * Another speculative read; pi_state->owner == current is unstable * but needs our attention. */ - if (q->pi_state->owner == current) { - ret = fixup_pi_state_owner(uaddr, q, NULL); - goto out; - } + if (q->pi_state->owner == current) + return fixup_pi_state_owner(uaddr, q, NULL);
/* * Paranoia check. If we did not take the lock, then we should not be @@ -2684,8 +2680,7 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) q->pi_state->owner); }
-out: - return ret ? ret : locked; + return 0; }
/** @@ -3416,7 +3411,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, if (q.pi_state && (q.pi_state->owner != current)) { spin_lock(q.lock_ptr); ret = fixup_pi_state_owner(uaddr2, &q, current); - if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current) { + if (ret < 0 && rt_mutex_owner(&q.pi_state->pi_mutex) == current) { pi_state = q.pi_state; get_pi_state(pi_state); } @@ -3426,6 +3421,11 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, */ put_pi_state(q.pi_state); spin_unlock(q.lock_ptr); + /* + * Adjust the return value. It's either -EFAULT or + * success (1) but the caller expects 0 for success. + */ + ret = ret < 0 ? ret : 0; } } else { struct rt_mutex *pi_mutex;
From: Thomas Gleixner tglx@linutronix.de
commit 04b79c55201f02ffd675e1231d731365e335c307 upstream
If that unexpected case of inconsistent arguments ever happens then the futex state is left completely inconsistent and the printk is not really helpful. Replace it with a warning and make the state consistent.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/futex.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index b2f33b221def..04d48b80dc5c 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2671,14 +2671,10 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
/* * Paranoia check. If we did not take the lock, then we should not be - * the owner of the rt_mutex. + * the owner of the rt_mutex. Warn and establish consistent state. */ - if (rt_mutex_owner(&q->pi_state->pi_mutex) == current) { - printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p " - "pi-state %p\n", ret, - q->pi_state->pi_mutex.owner, - q->pi_state->owner); - } + if (WARN_ON_ONCE(rt_mutex_owner(&q->pi_state->pi_mutex) == current)) + return fixup_pi_state_owner(uaddr, q, current);
return 0; }
From: Thomas Gleixner tglx@linutronix.de
commit c5cade200ab9a2a3be9e7f32a752c8d86b502ec7 upstream
Updating pi_state::owner is done at several places with the same code. Provide a function for it and use that at the obvious places.
This is also a preparation for a bug fix to avoid yet another copy of the same code or alternatively introducing a completely unpenetratable mess of gotos.
Originally-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/futex.c | 66 +++++++++++++++++++++++++------------------------- 1 file changed, 33 insertions(+), 33 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 04d48b80dc5c..50f5f1eb9b50 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -844,6 +844,29 @@ static struct futex_pi_state *alloc_pi_state(void) return pi_state; }
+static void pi_state_update_owner(struct futex_pi_state *pi_state, + struct task_struct *new_owner) +{ + struct task_struct *old_owner = pi_state->owner; + + lockdep_assert_held(&pi_state->pi_mutex.wait_lock); + + if (old_owner) { + raw_spin_lock(&old_owner->pi_lock); + WARN_ON(list_empty(&pi_state->list)); + list_del_init(&pi_state->list); + raw_spin_unlock(&old_owner->pi_lock); + } + + if (new_owner) { + raw_spin_lock(&new_owner->pi_lock); + WARN_ON(!list_empty(&pi_state->list)); + list_add(&pi_state->list, &new_owner->pi_state_list); + pi_state->owner = new_owner; + raw_spin_unlock(&new_owner->pi_lock); + } +} + static void get_pi_state(struct futex_pi_state *pi_state) { WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount)); @@ -1602,26 +1625,15 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_ ret = -EINVAL; }
- if (ret) - goto out_unlock; - - /* - * This is a point of no return; once we modify the uval there is no - * going back and subsequent operations must not fail. - */ - - raw_spin_lock(&pi_state->owner->pi_lock); - WARN_ON(list_empty(&pi_state->list)); - list_del_init(&pi_state->list); - raw_spin_unlock(&pi_state->owner->pi_lock); - - raw_spin_lock(&new_owner->pi_lock); - WARN_ON(!list_empty(&pi_state->list)); - list_add(&pi_state->list, &new_owner->pi_state_list); - pi_state->owner = new_owner; - raw_spin_unlock(&new_owner->pi_lock); - - postunlock = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q); + if (!ret) { + /* + * This is a point of no return; once we modified the uval + * there is no going back and subsequent operations must + * not fail. + */ + pi_state_update_owner(pi_state, new_owner); + postunlock = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q); + }
out_unlock: raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); @@ -2554,19 +2566,7 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, * We fixed up user space. Now we need to fix the pi_state * itself. */ - if (pi_state->owner != NULL) { - raw_spin_lock(&pi_state->owner->pi_lock); - WARN_ON(list_empty(&pi_state->list)); - list_del_init(&pi_state->list); - raw_spin_unlock(&pi_state->owner->pi_lock); - } - - pi_state->owner = newowner; - - raw_spin_lock(&newowner->pi_lock); - WARN_ON(!list_empty(&pi_state->list)); - list_add(&pi_state->list, &newowner->pi_state_list); - raw_spin_unlock(&newowner->pi_lock); + pi_state_update_owner(pi_state, newowner); raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
return argowner == current;
From: Thomas Gleixner tglx@linutronix.de
commit 2156ac1934166d6deb6cd0f6ffc4c1076ec63697 upstream
Nothing uses the argument. Remove it as preparation to use pi_state_update_owner().
Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/futex.c | 2 +- kernel/locking/rtmutex.c | 3 +-- kernel/locking/rtmutex_common.h | 3 +-- 3 files changed, 3 insertions(+), 5 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 50f5f1eb9b50..44a52505907d 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -899,7 +899,7 @@ static void put_pi_state(struct futex_pi_state *pi_state) list_del_init(&pi_state->list); raw_spin_unlock(&owner->pi_lock); } - rt_mutex_proxy_unlock(&pi_state->pi_mutex, owner); + rt_mutex_proxy_unlock(&pi_state->pi_mutex); raw_spin_unlock_irqrestore(&pi_state->pi_mutex.wait_lock, flags); }
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 9562aaa2afdc..a5ec4f68527e 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1719,8 +1719,7 @@ void rt_mutex_init_proxy_locked(struct rt_mutex *lock, * possible because it belongs to the pi_state which is about to be freed * and it is not longer visible to other tasks. */ -void rt_mutex_proxy_unlock(struct rt_mutex *lock, - struct task_struct *proxy_owner) +void rt_mutex_proxy_unlock(struct rt_mutex *lock) { debug_rt_mutex_proxy_unlock(lock); rt_mutex_set_owner(lock, NULL); diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h index d1d62f942be2..ca6fb489007b 100644 --- a/kernel/locking/rtmutex_common.h +++ b/kernel/locking/rtmutex_common.h @@ -133,8 +133,7 @@ enum rtmutex_chainwalk { extern struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock); extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock, struct task_struct *proxy_owner); -extern void rt_mutex_proxy_unlock(struct rt_mutex *lock, - struct task_struct *proxy_owner); +extern void rt_mutex_proxy_unlock(struct rt_mutex *lock); extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter); extern int __rt_mutex_start_proxy_lock(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
From: Thomas Gleixner tglx@linutronix.de
commit 6ccc84f917d33312eb2846bd7b567639f585ad6d upstream
No point in open coding it. This way it gains the extra sanity checks.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/futex.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 44a52505907d..bceb7daadc55 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -889,16 +889,10 @@ static void put_pi_state(struct futex_pi_state *pi_state) * and has cleaned up the pi_state already */ if (pi_state->owner) { - struct task_struct *owner; unsigned long flags;
raw_spin_lock_irqsave(&pi_state->pi_mutex.wait_lock, flags); - owner = pi_state->owner; - if (owner) { - raw_spin_lock(&owner->pi_lock); - list_del_init(&pi_state->list); - raw_spin_unlock(&owner->pi_lock); - } + pi_state_update_owner(pi_state, NULL); rt_mutex_proxy_unlock(&pi_state->pi_mutex); raw_spin_unlock_irqrestore(&pi_state->pi_mutex.wait_lock, flags); }
From: Thomas Gleixner tglx@linutronix.de
commit f2dac39d93987f7de1e20b3988c8685523247ae2 upstream
Too many gotos already and an upcoming fix would make it even more unreadable.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/futex.c | 53 +++++++++++++++++++++++++------------------------- 1 file changed, 26 insertions(+), 27 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index bceb7daadc55..3e8619024cb8 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2450,18 +2450,13 @@ static void unqueue_me_pi(struct futex_q *q) spin_unlock(q->lock_ptr); }
-static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, - struct task_struct *argowner) +static int __fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, + struct task_struct *argowner) { + u32 uval, uninitialized_var(curval), newval, newtid; struct futex_pi_state *pi_state = q->pi_state; - u32 uval, uninitialized_var(curval), newval; struct task_struct *oldowner, *newowner; - u32 newtid; - int ret, err = 0; - - lockdep_assert_held(q->lock_ptr); - - raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); + int err = 0;
oldowner = pi_state->owner;
@@ -2495,14 +2490,12 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, * We raced against a concurrent self; things are * already fixed up. Nothing to do. */ - ret = 0; - goto out_unlock; + return 0; }
if (__rt_mutex_futex_trylock(&pi_state->pi_mutex)) { /* We got the lock. pi_state is correct. Tell caller. */ - ret = 1; - goto out_unlock; + return 1; }
/* @@ -2529,8 +2522,7 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, * We raced against a concurrent self; things are * already fixed up. Nothing to do. */ - ret = 1; - goto out_unlock; + return 1; } newowner = argowner; } @@ -2561,7 +2553,6 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, * itself. */ pi_state_update_owner(pi_state, newowner); - raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
return argowner == current;
@@ -2584,17 +2575,16 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
switch (err) { case -EFAULT: - ret = fault_in_user_writeable(uaddr); + err = fault_in_user_writeable(uaddr); break;
case -EAGAIN: cond_resched(); - ret = 0; + err = 0; break;
default: WARN_ON_ONCE(1); - ret = err; break; }
@@ -2604,17 +2594,26 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, /* * Check if someone else fixed it for us: */ - if (pi_state->owner != oldowner) { - ret = argowner == current; - goto out_unlock; - } + if (pi_state->owner != oldowner) + return argowner == current;
- if (ret) - goto out_unlock; + /* Retry if err was -EAGAIN or the fault in succeeded */ + if (!err) + goto retry;
- goto retry; + return err; +}
-out_unlock: +static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, + struct task_struct *argowner) +{ + struct futex_pi_state *pi_state = q->pi_state; + int ret; + + lockdep_assert_held(q->lock_ptr); + + raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); + ret = __fixup_pi_state_owner(uaddr, q, argowner); raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); return ret; }
From: Thomas Gleixner tglx@linutronix.de
commit 34b1a1ce1458f50ef27c54e28eb9b1947012907a upstream
fixup_pi_state_owner() tries to ensure that the state of the rtmutex, pi_state and the user space value related to the PI futex are consistent before returning to user space. In case that the user space value update faults and the fault cannot be resolved by faulting the page in via fault_in_user_writeable() the function returns with -EFAULT and leaves the rtmutex and pi_state owner state inconsistent.
A subsequent futex_unlock_pi() operates on the inconsistent pi_state and releases the rtmutex despite not owning it which can corrupt the RB tree of the rtmutex and cause a subsequent kernel stack use after free.
It was suggested to loop forever in fixup_pi_state_owner() if the fault cannot be resolved, but that results in runaway tasks which is especially undesired when the problem happens due to a programming error and not due to malice.
As the user space value cannot be fixed up, the proper solution is to make the rtmutex and the pi_state consistent so both have the same owner. This leaves the user space value out of sync. Any subsequent operation on the futex will fail because the 10th rule of PI futexes (pi_state owner and user space value are consistent) has been violated.
As a consequence this removes the inept attempts of 'fixing' the situation in case that the current task owns the rtmutex when returning with an unresolvable fault by unlocking the rtmutex which left pi_state::owner and rtmutex::owner out of sync in a different and only slightly less dangerous way.
Fixes: 1b7558e457ed ("futexes: fix fault handling in futex_lock_pi") Reported-by: gzobqq@gmail.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/futex.c | 56 ++++++++++++++++++-------------------------------- 1 file changed, 20 insertions(+), 36 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 3e8619024cb8..7ece653364d2 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1039,7 +1039,8 @@ static inline void exit_pi_state_list(struct task_struct *curr) { } * FUTEX_OWNER_DIED bit. See [4] * * [10] There is no transient state which leaves owner and user space - * TID out of sync. + * TID out of sync. Except one error case where the kernel is denied + * write access to the user address, see fixup_pi_state_owner(). * * * Serialization and lifetime rules: @@ -2601,6 +2602,24 @@ static int __fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, if (!err) goto retry;
+ /* + * fault_in_user_writeable() failed so user state is immutable. At + * best we can make the kernel state consistent but user state will + * be most likely hosed and any subsequent unlock operation will be + * rejected due to PI futex rule [10]. + * + * Ensure that the rtmutex owner is also the pi_state owner despite + * the user space value claiming something different. There is no + * point in unlocking the rtmutex if current is the owner as it + * would need to wait until the next waiter has taken the rtmutex + * to guarantee consistent state. Keep it simple. Userspace asked + * for this wreckaged state. + * + * The rtmutex has an owner - either current or some other + * task. See the EAGAIN loop above. + */ + pi_state_update_owner(pi_state, rt_mutex_owner(&pi_state->pi_mutex)); + return err; }
@@ -2890,7 +2909,6 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int trylock) { struct hrtimer_sleeper timeout, *to = NULL; - struct futex_pi_state *pi_state = NULL; struct task_struct *exiting = NULL; struct rt_mutex_waiter rt_waiter; struct futex_hash_bucket *hb; @@ -3033,23 +3051,9 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, if (res) ret = (res < 0) ? res : 0;
- /* - * If fixup_owner() faulted and was unable to handle the fault, unlock - * it and return the fault to userspace. - */ - if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current)) { - pi_state = q.pi_state; - get_pi_state(pi_state); - } - /* Unqueue and drop the lock */ unqueue_me_pi(&q);
- if (pi_state) { - rt_mutex_futex_unlock(&pi_state->pi_mutex); - put_pi_state(pi_state); - } - goto out_put_key;
out_unlock_put_key: @@ -3315,7 +3319,6 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, u32 __user *uaddr2) { struct hrtimer_sleeper timeout, *to = NULL; - struct futex_pi_state *pi_state = NULL; struct rt_mutex_waiter rt_waiter; struct futex_hash_bucket *hb; union futex_key key2 = FUTEX_KEY_INIT; @@ -3400,10 +3403,6 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, if (q.pi_state && (q.pi_state->owner != current)) { spin_lock(q.lock_ptr); ret = fixup_pi_state_owner(uaddr2, &q, current); - if (ret < 0 && rt_mutex_owner(&q.pi_state->pi_mutex) == current) { - pi_state = q.pi_state; - get_pi_state(pi_state); - } /* * Drop the reference to the pi state which * the requeue_pi() code acquired for us. @@ -3445,25 +3444,10 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, if (res) ret = (res < 0) ? res : 0;
- /* - * If fixup_pi_state_owner() faulted and was unable to handle - * the fault, unlock the rt_mutex and return the fault to - * userspace. - */ - if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current) { - pi_state = q.pi_state; - get_pi_state(pi_state); - } - /* Unqueue and drop the lock. */ unqueue_me_pi(&q); }
- if (pi_state) { - rt_mutex_futex_unlock(&pi_state->pi_mutex); - put_pi_state(pi_state); - } - if (ret == -EINTR) { /* * We've already been requeued, but cannot restart by calling
From: Jason Gerecke killertofu@gmail.com
commit 179e8e47c02a1950f1c556f2b854bdb2259078fb upstream.
The recent commit to fix a memory leak introduced an inadvertant NULL pointer dereference. The `wacom_wac->pen_fifo` variable was never intialized, resuling in a crash whenever functions tried to use it. Since the FIFO is only used by AES pens (to buffer events from pen proximity until the hardware reports the pen serial number) this would have been easily overlooked without testing an AES device.
This patch converts `wacom_wac->pen_fifo` over to a pointer (since the call to `devres_alloc` allocates memory for us) and ensures that we assign it to point to the allocated and initalized `pen_fifo` before the function returns.
Link: https://github.com/linuxwacom/input-wacom/issues/230 Fixes: 37309f47e2f5 ("HID: wacom: Fix memory leakage caused by kfifo_alloc") CC: stable@vger.kernel.org # v4.19+ Signed-off-by: Jason Gerecke jason.gerecke@wacom.com Tested-by: Ping Cheng ping.cheng@wacom.com Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/wacom_sys.c | 7 ++++--- drivers/hid/wacom_wac.h | 2 +- 2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/hid/wacom_sys.c b/drivers/hid/wacom_sys.c index 523014f2c0eb..8006732b8f42 100644 --- a/drivers/hid/wacom_sys.c +++ b/drivers/hid/wacom_sys.c @@ -150,9 +150,9 @@ static int wacom_wac_pen_serial_enforce(struct hid_device *hdev, }
if (flush) - wacom_wac_queue_flush(hdev, &wacom_wac->pen_fifo); + wacom_wac_queue_flush(hdev, wacom_wac->pen_fifo); else if (insert) - wacom_wac_queue_insert(hdev, &wacom_wac->pen_fifo, + wacom_wac_queue_insert(hdev, wacom_wac->pen_fifo, raw_data, report_size);
return insert && !flush; @@ -1251,7 +1251,7 @@ static void wacom_devm_kfifo_release(struct device *dev, void *res) static int wacom_devm_kfifo_alloc(struct wacom *wacom) { struct wacom_wac *wacom_wac = &wacom->wacom_wac; - struct kfifo_rec_ptr_2 *pen_fifo = &wacom_wac->pen_fifo; + struct kfifo_rec_ptr_2 *pen_fifo; int error;
pen_fifo = devres_alloc(wacom_devm_kfifo_release, @@ -1268,6 +1268,7 @@ static int wacom_devm_kfifo_alloc(struct wacom *wacom) }
devres_add(&wacom->hdev->dev, pen_fifo); + wacom_wac->pen_fifo = pen_fifo;
return 0; } diff --git a/drivers/hid/wacom_wac.h b/drivers/hid/wacom_wac.h index f67d871841c0..46da97162ef4 100644 --- a/drivers/hid/wacom_wac.h +++ b/drivers/hid/wacom_wac.h @@ -344,7 +344,7 @@ struct wacom_wac { struct input_dev *pen_input; struct input_dev *touch_input; struct input_dev *pad_input; - struct kfifo_rec_ptr_2 pen_fifo; + struct kfifo_rec_ptr_2 *pen_fifo; int pid; int num_contacts_left; u8 bt_features;
From: Gaurav Kohli gkohli@codeaurora.org
commit bbeb97464eefc65f506084fd9f18f21653e01137 upstream.
Below race can come, if trace_open and resize of cpu buffer is running parallely on different cpus CPUX CPUY ring_buffer_resize atomic_read(&buffer->resize_disabled) tracing_open tracing_reset_online_cpus ring_buffer_reset_cpu rb_reset_cpu rb_update_pages remove/insert pages resetting pointer
This race can cause data abort or some times infinte loop in rb_remove_pages and rb_insert_pages while checking pages for sanity.
Take buffer lock to fix this.
Link: https://lkml.kernel.org/r/1601976833-24377-1-git-send-email-gkohli@codeauror...
Cc: stable@vger.kernel.org Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic") Reported-by: Denis Efremov efremov@linux.com Signed-off-by: Gaurav Kohli gkohli@codeaurora.org Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/trace/ring_buffer.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 87ce9736043d..360129e47540 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -4393,6 +4393,8 @@ void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu)
if (!cpumask_test_cpu(cpu, buffer->cpumask)) return; + /* prevent another thread from changing buffer sizes */ + mutex_lock(&buffer->mutex);
atomic_inc(&buffer->resize_disabled); atomic_inc(&cpu_buffer->record_disabled); @@ -4416,6 +4418,8 @@ void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu)
atomic_dec(&cpu_buffer->record_disabled); atomic_dec(&buffer->resize_disabled); + + mutex_unlock(&buffer->mutex); } EXPORT_SYMBOL_GPL(ring_buffer_reset_cpu);
From: Jean-Philippe Brucker jean-philippe@linaro.org
commit c8a950d0d3b926a02c7b2e713850d38217cec3d1 upstream.
Several Makefiles in tools/ need to define the host toolchain variables. Move their definition to tools/scripts/Makefile.include
Signed-off-by: Jean-Philippe Brucker jean-philippe@linaro.org Signed-off-by: Andrii Nakryiko andrii@kernel.org Acked-by: Jiri Olsa jolsa@redhat.com Acked-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Link: https://lore.kernel.org/bpf/20201110164310.2600671-2-jean-philippe@linaro.or... Cc: Alistair Delva adelva@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- tools/build/Makefile | 4 ---- tools/objtool/Makefile | 9 --------- tools/perf/Makefile.perf | 4 ---- tools/power/acpi/Makefile.config | 1 - tools/scripts/Makefile.include | 10 ++++++++++ 5 files changed, 10 insertions(+), 18 deletions(-)
diff --git a/tools/build/Makefile b/tools/build/Makefile index 727050c40f09..8a55378e8b7c 100644 --- a/tools/build/Makefile +++ b/tools/build/Makefile @@ -15,10 +15,6 @@ endef $(call allow-override,CC,$(CROSS_COMPILE)gcc) $(call allow-override,LD,$(CROSS_COMPILE)ld)
-HOSTCC ?= gcc -HOSTLD ?= ld -HOSTAR ?= ar - export HOSTCC HOSTLD HOSTAR
ifeq ($(V),1) diff --git a/tools/objtool/Makefile b/tools/objtool/Makefile index baa92279c137..15f32f67cf34 100644 --- a/tools/objtool/Makefile +++ b/tools/objtool/Makefile @@ -7,15 +7,6 @@ ARCH := x86 endif
# always use the host compiler -ifneq ($(LLVM),) -HOSTAR ?= llvm-ar -HOSTCC ?= clang -HOSTLD ?= ld.lld -else -HOSTAR ?= ar -HOSTCC ?= gcc -HOSTLD ?= ld -endif AR = $(HOSTAR) CC = $(HOSTCC) LD = $(HOSTLD) diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf index 0be411695379..678aa7feb84d 100644 --- a/tools/perf/Makefile.perf +++ b/tools/perf/Makefile.perf @@ -148,10 +148,6 @@ endef
LD += $(EXTRA_LDFLAGS)
-HOSTCC ?= gcc -HOSTLD ?= ld -HOSTAR ?= ar - PKG_CONFIG = $(CROSS_COMPILE)pkg-config LLVM_CONFIG ?= llvm-config
diff --git a/tools/power/acpi/Makefile.config b/tools/power/acpi/Makefile.config index fc116c060b98..32ff7baf39df 100644 --- a/tools/power/acpi/Makefile.config +++ b/tools/power/acpi/Makefile.config @@ -57,7 +57,6 @@ INSTALL_SCRIPT = ${INSTALL_PROGRAM} CROSS = #/usr/i386-linux-uclibc/usr/bin/i386-uclibc- CROSS_COMPILE ?= $(CROSS) LD = $(CC) -HOSTCC = gcc
# check if compiler option is supported cc-supports = ${shell if $(CC) ${1} -S -o /dev/null -x c /dev/null > /dev/null 2>&1; then echo "$(1)"; fi;} diff --git a/tools/scripts/Makefile.include b/tools/scripts/Makefile.include index 8fc6b1ca47dc..42dbe05b1807 100644 --- a/tools/scripts/Makefile.include +++ b/tools/scripts/Makefile.include @@ -60,6 +60,16 @@ $(call allow-override,LD,$(CROSS_COMPILE)ld) $(call allow-override,CXX,$(CROSS_COMPILE)g++) $(call allow-override,STRIP,$(CROSS_COMPILE)strip)
+ifneq ($(LLVM),) +HOSTAR ?= llvm-ar +HOSTCC ?= clang +HOSTLD ?= ld.lld +else +HOSTAR ?= ar +HOSTCC ?= gcc +HOSTLD ?= ld +endif + ifeq ($(CC_NO_CLANG), 1) EXTRA_WARNINGS += -Wstrict-aliasing=3 endif
From: Mikulas Patocka mpatocka@redhat.com
commit 5c02406428d5219c367c5f53457698c58bc5f917 upstream.
Otherwise a malicious user could (ab)use the "recalculate" feature that makes dm-integrity calculate the checksums in the background while the device is already usable. When the system restarts before all checksums have been calculated, the calculation continues where it was interrupted even if the recalculate feature is not requested the next time the dm device is set up.
Disable recalculating if we use internal_hash or journal_hash with a key (e.g. HMAC) and we don't have the "legacy_recalculate" flag.
This may break activation of a volume, created by an older kernel, that is not yet fully recalculated -- if this happens, the user should add the "legacy_recalculate" flag to constructor parameters.
Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka mpatocka@redhat.com Reported-by: Daniel Glockner dg@emlix.com Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- Documentation/device-mapper/dm-integrity.txt | 7 ++++++ drivers/md/dm-integrity.c | 24 +++++++++++++++++++- 2 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/Documentation/device-mapper/dm-integrity.txt b/Documentation/device-mapper/dm-integrity.txt index 297251b0d2d5..bf6af2ade0a6 100644 --- a/Documentation/device-mapper/dm-integrity.txt +++ b/Documentation/device-mapper/dm-integrity.txt @@ -146,6 +146,13 @@ block_size:number Supported values are 512, 1024, 2048 and 4096 bytes. If not specified the default block size is 512 bytes.
+legacy_recalculate + Allow recalculating of volumes with HMAC keys. This is disabled by + default for security reasons - an attacker could modify the volume, + set recalc_sector to zero, and the kernel would not detect the + modification. + + The journal mode (D/J), buffer_sectors, journal_watermark, commit_time can be changed when reloading the target (load an inactive table and swap the tables with suspend and resume). The other arguments should not be changed diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c index 1917051b512f..cffd42317272 100644 --- a/drivers/md/dm-integrity.c +++ b/drivers/md/dm-integrity.c @@ -240,6 +240,7 @@ struct dm_integrity_c {
bool journal_uptodate; bool just_formatted; + bool legacy_recalculate;
struct alg_spec internal_hash_alg; struct alg_spec journal_crypt_alg; @@ -345,6 +346,14 @@ static int dm_integrity_failed(struct dm_integrity_c *ic) return READ_ONCE(ic->failed); }
+static bool dm_integrity_disable_recalculate(struct dm_integrity_c *ic) +{ + if ((ic->internal_hash_alg.key || ic->journal_mac_alg.key) && + !ic->legacy_recalculate) + return true; + return false; +} + static commit_id_t dm_integrity_commit_id(struct dm_integrity_c *ic, unsigned i, unsigned j, unsigned char seq) { @@ -2503,6 +2512,7 @@ static void dm_integrity_status(struct dm_target *ti, status_type_t type, arg_count += !!ic->internal_hash_alg.alg_string; arg_count += !!ic->journal_crypt_alg.alg_string; arg_count += !!ic->journal_mac_alg.alg_string; + arg_count += ic->legacy_recalculate; DMEMIT("%s %llu %u %c %u", ic->dev->name, (unsigned long long)ic->start, ic->tag_size, ic->mode, arg_count); if (ic->meta_dev) @@ -2516,6 +2526,8 @@ static void dm_integrity_status(struct dm_target *ti, status_type_t type, DMEMIT(" buffer_sectors:%u", 1U << ic->log2_buffer_sectors); DMEMIT(" journal_watermark:%u", (unsigned)watermark_percentage); DMEMIT(" commit_time:%u", ic->autocommit_msec); + if (ic->legacy_recalculate) + DMEMIT(" legacy_recalculate");
#define EMIT_ALG(a, n) \ do { \ @@ -3118,7 +3130,7 @@ static int dm_integrity_ctr(struct dm_target *ti, unsigned argc, char **argv) unsigned extra_args; struct dm_arg_set as; static const struct dm_arg _args[] = { - {0, 15, "Invalid number of feature args"}, + {0, 12, "Invalid number of feature args"}, }; unsigned journal_sectors, interleave_sectors, buffer_sectors, journal_watermark, sync_msec; bool recalculate; @@ -3248,6 +3260,8 @@ static int dm_integrity_ctr(struct dm_target *ti, unsigned argc, char **argv) goto bad; } else if (!strcmp(opt_string, "recalculate")) { recalculate = true; + } else if (!strcmp(opt_string, "legacy_recalculate")) { + ic->legacy_recalculate = true; } else { r = -EINVAL; ti->error = "Invalid argument"; @@ -3523,6 +3537,14 @@ static int dm_integrity_ctr(struct dm_target *ti, unsigned argc, char **argv) } }
+ if (ic->sb->flags & cpu_to_le32(SB_FLAG_RECALCULATING) && + le64_to_cpu(ic->sb->recalc_sector) < ic->provided_data_sectors && + dm_integrity_disable_recalculate(ic)) { + ti->error = "Recalculating with HMAC is disabled for security reasons - if you really need it, use the argument "legacy_recalculate""; + r = -EOPNOTSUPP; + goto bad; + } + ic->bufio = dm_bufio_client_create(ic->meta_dev ? ic->meta_dev->bdev : ic->dev->bdev, 1U << (SECTOR_SHIFT + ic->log2_buffer_sectors), 1, 0, NULL, NULL); if (IS_ERR(ic->bufio)) {
From: Jan Kara jack@suse.cz
commit 5fcd57505c002efc5823a7355e21f48dd02d5a51 upstream.
The only use of I_DIRTY_TIME_EXPIRE is to detect in __writeback_single_inode() that inode got there because flush worker decided it's time to writeback the dirty inode time stamps (either because we are syncing or because of age). However we can detect this directly in __writeback_single_inode() and there's no need for the strange propagation with I_DIRTY_TIME_EXPIRE flag.
Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Eric Biggers ebiggers@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/ext4/inode.c | 2 +- fs/fs-writeback.c | 28 +++++++++++----------------- fs/xfs/xfs_trans_inode.c | 4 ++-- include/linux/fs.h | 1 - include/trace/events/writeback.h | 1 - 5 files changed, 14 insertions(+), 22 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d03dc1b90419..8b70f35c54d0 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5227,7 +5227,7 @@ static int other_inode_match(struct inode * inode, unsigned long ino, (inode->i_state & I_DIRTY_TIME)) { struct ext4_inode_info *ei = EXT4_I(inode);
- inode->i_state &= ~(I_DIRTY_TIME | I_DIRTY_TIME_EXPIRED); + inode->i_state &= ~I_DIRTY_TIME; spin_unlock(&inode->i_lock);
spin_lock(&ei->i_raw_lock); diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 63d94b906945..15946c7b01ff 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1158,7 +1158,7 @@ static bool inode_dirtied_after(struct inode *inode, unsigned long t) */ static int move_expired_inodes(struct list_head *delaying_queue, struct list_head *dispatch_queue, - int flags, unsigned long dirtied_before) + unsigned long dirtied_before) { LIST_HEAD(tmp); struct list_head *pos, *node; @@ -1174,8 +1174,6 @@ static int move_expired_inodes(struct list_head *delaying_queue, list_move(&inode->i_io_list, &tmp); moved++; spin_lock(&inode->i_lock); - if (flags & EXPIRE_DIRTY_ATIME) - inode->i_state |= I_DIRTY_TIME_EXPIRED; inode->i_state |= I_SYNC_QUEUED; spin_unlock(&inode->i_lock); if (sb_is_blkdev_sb(inode->i_sb)) @@ -1223,11 +1221,11 @@ static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work,
assert_spin_locked(&wb->list_lock); list_splice_init(&wb->b_more_io, &wb->b_io); - moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, 0, dirtied_before); + moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, dirtied_before); if (!work->for_sync) time_expire_jif = jiffies - dirtytime_expire_interval * HZ; moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io, - EXPIRE_DIRTY_ATIME, time_expire_jif); + time_expire_jif); if (moved) wb_io_lists_populated(wb); trace_writeback_queue_io(wb, work, dirtied_before, moved); @@ -1403,18 +1401,14 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) spin_lock(&inode->i_lock);
dirty = inode->i_state & I_DIRTY; - if (inode->i_state & I_DIRTY_TIME) { - if ((dirty & I_DIRTY_INODE) || - wbc->sync_mode == WB_SYNC_ALL || - unlikely(inode->i_state & I_DIRTY_TIME_EXPIRED) || - unlikely(time_after(jiffies, - (inode->dirtied_time_when + - dirtytime_expire_interval * HZ)))) { - dirty |= I_DIRTY_TIME | I_DIRTY_TIME_EXPIRED; - trace_writeback_lazytime(inode); - } - } else - inode->i_state &= ~I_DIRTY_TIME_EXPIRED; + if ((inode->i_state & I_DIRTY_TIME) && + ((dirty & I_DIRTY_INODE) || + wbc->sync_mode == WB_SYNC_ALL || wbc->for_sync || + time_after(jiffies, inode->dirtied_time_when + + dirtytime_expire_interval * HZ))) { + dirty |= I_DIRTY_TIME; + trace_writeback_lazytime(inode); + } inode->i_state &= ~dirty;
/* diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c index ae453dd236a6..6fcdf7e449fe 100644 --- a/fs/xfs/xfs_trans_inode.c +++ b/fs/xfs/xfs_trans_inode.c @@ -99,9 +99,9 @@ xfs_trans_log_inode( * to log the timestamps, or will clear already cleared fields in the * worst case. */ - if (inode->i_state & (I_DIRTY_TIME | I_DIRTY_TIME_EXPIRED)) { + if (inode->i_state & I_DIRTY_TIME) { spin_lock(&inode->i_lock); - inode->i_state &= ~(I_DIRTY_TIME | I_DIRTY_TIME_EXPIRED); + inode->i_state &= ~I_DIRTY_TIME; spin_unlock(&inode->i_lock); }
diff --git a/include/linux/fs.h b/include/linux/fs.h index 70a545ada6fd..8d5ee697727c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2115,7 +2115,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) #define I_DIO_WAKEUP (1 << __I_DIO_WAKEUP) #define I_LINKABLE (1 << 10) #define I_DIRTY_TIME (1 << 11) -#define I_DIRTY_TIME_EXPIRED (1 << 12) #define I_WB_SWITCH (1 << 13) #define I_OVL_INUSE (1 << 14) #define I_CREATING (1 << 15) diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index c0b7c41b73f1..4e6af6c80698 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -20,7 +20,6 @@ {I_CLEAR, "I_CLEAR"}, \ {I_SYNC, "I_SYNC"}, \ {I_DIRTY_TIME, "I_DIRTY_TIME"}, \ - {I_DIRTY_TIME_EXPIRED, "I_DIRTY_TIME_EXPIRED"}, \ {I_REFERENCED, "I_REFERENCED"} \ )
From: Eric Biggers ebiggers@google.com
commit 1e249cb5b7fc09ff216aa5a12f6c302e434e88f9 upstream.
When lazytime is enabled and an inode is being written due to its in-memory updated timestamps having expired, either due to a sync() or syncfs() system call or due to dirtytime_expire_interval having elapsed, the VFS needs to inform the filesystem so that the filesystem can copy the inode's timestamps out to the on-disk data structures.
This is done by __writeback_single_inode() calling mark_inode_dirty_sync(), which then calls ->dirty_inode(I_DIRTY_SYNC).
However, this occurs after __writeback_single_inode() has already cleared the dirty flags from ->i_state. This causes two bugs:
- mark_inode_dirty_sync() redirties the inode, causing it to remain dirty. This wastefully causes the inode to be written twice. But more importantly, it breaks cases where sync_filesystem() is expected to clean dirty inodes. This includes the FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (as reported at https://lore.kernel.org/r/20200306004555.GB225345@gmail.com), as well as possibly filesystem freezing (freeze_super()).
- Since ->i_state doesn't contain I_DIRTY_TIME when ->dirty_inode() is called from __writeback_single_inode() for lazytime expiration, xfs_fs_dirty_inode() ignores the notification. (XFS only cares about lazytime expirations, and it assumes that i_state will contain I_DIRTY_TIME during those.) Therefore, lazy timestamps aren't persisted by sync(), syncfs(), or dirtytime_expire_interval on XFS.
Fix this by moving the call to mark_inode_dirty_sync() to earlier in __writeback_single_inode(), before the dirty flags are cleared from i_state. This makes filesystems be properly notified of the timestamp expiration, and it avoids incorrectly redirtying the inode.
This fixes xfstest generic/580 (which tests FS_IOC_REMOVE_ENCRYPTION_KEY) when run on ext4 or f2fs with lazytime enabled. It also fixes the new lazytime xfstest I've proposed, which reproduces the above-mentioned XFS bug (https://lore.kernel.org/r/20210105005818.92978-1-ebiggers@kernel.org).
Alternatively, we could call ->dirty_inode(I_DIRTY_SYNC) directly. But due to the introduction of I_SYNC_QUEUED, mark_inode_dirty_sync() is the right thing to do because mark_inode_dirty_sync() now knows not to move the inode to a writeback list if it is currently queued for sync.
Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option") Cc: stable@vger.kernel.org Depends-on: 5afced3bf281 ("writeback: Avoid skipping inode writeback") Link: https://lore.kernel.org/r/20210112190253.64307-2-ebiggers@kernel.org Suggested-by: Jan Kara jack@suse.cz Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Eric Biggers ebiggers@google.com Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/fs-writeback.c | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 15946c7b01ff..c65796603ec0 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1394,21 +1394,25 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) }
/* - * Some filesystems may redirty the inode during the writeback - * due to delalloc, clear dirty metadata flags right before - * write_inode() + * If the inode has dirty timestamps and we need to write them, call + * mark_inode_dirty_sync() to notify the filesystem about it and to + * change I_DIRTY_TIME into I_DIRTY_SYNC. */ - spin_lock(&inode->i_lock); - - dirty = inode->i_state & I_DIRTY; if ((inode->i_state & I_DIRTY_TIME) && - ((dirty & I_DIRTY_INODE) || - wbc->sync_mode == WB_SYNC_ALL || wbc->for_sync || + (wbc->sync_mode == WB_SYNC_ALL || wbc->for_sync || time_after(jiffies, inode->dirtied_time_when + dirtytime_expire_interval * HZ))) { - dirty |= I_DIRTY_TIME; trace_writeback_lazytime(inode); + mark_inode_dirty_sync(inode); } + + /* + * Some filesystems may redirty the inode during the writeback + * due to delalloc, clear dirty metadata flags right before + * write_inode() + */ + spin_lock(&inode->i_lock); + dirty = inode->i_state & I_DIRTY; inode->i_state &= ~dirty;
/* @@ -1429,8 +1433,6 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
spin_unlock(&inode->i_lock);
- if (dirty & I_DIRTY_TIME) - mark_inode_dirty_sync(inode); /* Don't write the inode if only I_DIRTY_PAGES was set */ if (dirty & ~I_DIRTY_PAGES) { int err = write_inode(inode, wbc);
From: Greg Kroah-Hartman gregkh@linuxfoundation.org
Merge 27 patches from 4.19.172 stable branch (27 total) beside 0 already merged patches.
Tested-by: Pavel Machek (CIP) pavel@denx.de Link: https://lore.kernel.org/r/20210129105910.685105711@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Makefile b/Makefile index 335b015c5c9b..7da0ddd65052 100644 --- a/Makefile +++ b/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 VERSION = 4 PATCHLEVEL = 19 -SUBLEVEL = 171 +SUBLEVEL = 172 EXTRAVERSION = NAME = "People's Front"