Al Viro (1): don't dump the threads that had been already exiting when zapped.
Aleksandr Nogikh (1): netem: fix zero division in tabledist
Andy Shevchenko (2): device property: Keep secondary firmware node secondary by type device property: Don't clear secondary pointer for shared primary firmware node
Arjun Roy (1): tcp: Prevent low rmem stalls with SO_RCVLOWAT.
Arnaldo Carvalho de Melo (1): perf scripting python: Avoid declaring function pointers with a visibility attribute
Boris Protopopov (1): Convert trailing spaces and periods in path components
Brian Foster (1): xfs: flush new eof page on truncate to avoid post-eof corruption
Chen Zhou (1): selinux: Fix error return code in sel_ib_pkey_sid_slow()
Christoph Hellwig (2): nbd: fix a block_device refcount leak in nbd_release xfs: fix a missing unlock on error in xfs_fs_map_blocks
Chunyan Zhang (1): tick/common: Touch watchdog in tick_unfreeze() on all CPUs
Dan Carpenter (1): futex: Don't enable IRQs unconditionally in put_pi_state()
Darrick J. Wong (12): xfs: fix realtime bitmap/summary file truncation when growing rt volume xfs: don't free rt blocks when we're doing a REMAP bunmapi call xfs: set xefi_discard when creating a deferred agfl free log intent item xfs: fix scrub flagging rtinherit even if there is no rt device xfs: fix flags argument to rmap lookup when converting shared file rmaps xfs: set the unwritten bit in rmap lookup flags in xchk_bmap_get_rmapextents xfs: fix rmap key and record comparison functions xfs: fix brainos in the refcount scrubber's rmap fragment processor vfs: remove lockdep bogosity in __sb_start_write xfs: fix the minrecs logic when dealing with inode root child blocks xfs: strengthen rmap record flags checking xfs: revert "xfs: fix rmap key and record comparison functions"
Dinghao Liu (1): ext4: fix error handling code in add_new_gdb
Dongli Zhang (1): page_frag: Recover from memory pressure
Douglas Gilbert (1): sgl_alloc_order: fix memory leak
Eddy Wu (1): fork: fix copy_process(CLONE_PARENT) race with the exiting ->real_parent
Eric Biggers (1): ext4: fix leaking sysfs kobject after failed mount
Gabriel Krisman Bertazi (2): blk-cgroup: Fix memleak on error path blk-cgroup: Pre-allocate tree node on blkg_conf_prep
George Spelvin (1): random32: make prandom_u32() output unpredictable
Gerald Schaefer (1): mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault()
Heiner Kallweit (1): net: bridge: add missing counters to ndo_get_stats64 callback
Jan Kara (2): ext4: Detect already used quota file early ext4: fix bogus warning in ext4_update_dx_flag()
Jason A. Donenfeld (1): netfilter: use actual socket sk rather than skb sk when routing harder
Jiri Olsa (2): perf python scripting: Fix printable strings in python3 scripts perf tools: Add missing swap for ino_generation
Jonathan Cameron (1): ACPI: Add out of bounds and numa_off protections to pxm_to_node()
Joseph Qi (1): ext4: unlock xattr_sem properly in ext4_inline_data_truncate()
Kaixu Xia (1): ext4: correctly report "not supported" for {usr, grp}jquota when !CONFIG_QUOTA
Lang Dai (1): uio: free uio id after uio file node is freed
Lee Jones (1): Fonts: Replace discarded const qualifier
Leo Yan (1): perf lock: Don't free "lock_seq_stat" if read_count isn't zero
Luo Meng (2): ext4: fix invalid inode checksum fail_function: Remove a redundant mutex unlock
Mao Wenan (1): net: Update window_clamp if SOCK_RCVBUF is set
Marc Zyngier (1): arm64: Run ARCH_WORKAROUND_1 enabling code on all CPUs
Mateusz Nosek (1): futex: Fix incorrect should_fail_futex() handling
Matteo Croce (2): Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint" reboot: fix overflow parsing reboot cpu number
Michael Schaller (1): efivarfs: Replace invalid slashes with exclamation marks in dentries.
Mike Galbraith (1): futex: Handle transient "ownerless" rtmutex state correctly
Miklos Szeredi (1): fuse: fix page dereference after free
Ming Lei (1): scsi: core: Don't start concurrent async scan on same host
Nicholas Piggin (1): mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
Oleg Nesterov (1): ptrace: fix task_join_group_stop() for the case when current is traced
Oliver Herms (1): IPv6: Set SIT tunnel hard_header_len to zero
Peter Zijlstra (2): serial: pl011: Fix lockdep splat when handling magic-sysrq interrupt perf: Fix get_recursion_context()
Qiujun Huang (2): ring-buffer: Return 0 on success from ring_buffer_resize() tracing: Fix out of bounds write in get_trace_buf
Ronnie Sahlberg (1): cifs: handle -EINTR in cifs_setattr
Ryan Sharpelletti (1): tcp: only postpone PROBE_RTT if RTT is < current min_rtt estimate
Shijie Luo (1): mm: mempolicy: fix potential pte_unmap_unlock pte error
Shin'ichiro Kawasaki (1): uio: Fix use-after-free in uio_unregister_device()
Stefano Brivio (1): netfilter: ipset: Update byte and packet counters regardless of whether they match
Steven Rostedt (VMware) (3): ring-buffer: Fix recursion protection transitions between interrupt context ftrace: Fix recursion check for NMI test ftrace: Handle tracing when switching between context
Tyler Hicks (1): tpm: efi: Don't create binary_bios_measurements file for an empty log
Valentin Schneider (1): arm64: topology: Stop using MPIDR for topology information
Vamshi K Sthambamkadi (1): efivarfs: fix memory leak in efivarfs_create()
Wang Hai (2): devlink: Add missing genlmsg_cancel() in devlink_nl_sb_port_pool_fill() inet_diag: Fix error path to cancel the meseage in inet_req_diag_fill()
Wengang Wang (1): ocfs2: initialize ip_next_orphan
Will Deacon (1): arm64: psci: Avoid printing in cpu_psci_cpu_die()
Xiubo Li (1): nbd: make the config put is called before the notifying the waiter
Yi-Hung Wei (1): ip_tunnels: Set tunnel option flag when tunnel metadata is present
Yicong Yang (1): libfs: fix error cast of negative value in simple_attr_write()
Yunsheng Lin (1): net: sch_generic: fix the missing new qdisc assignment bug
Zeng Tao (1): time: Prevent undefined behaviour in timespec64_to_ns()
Zhengyuan Liu (1): arm64/mm: return cpu_all_mask when node is NUMA_NO_NODE
Zqiang (1): kthread_worker: prevent queuing delayed work from timer_fn when it is being canceled
zhuoliang zhang (1): net: xfrm: fix a race condition during allocing spi
arch/Kconfig | 7 + arch/arm64/include/asm/numa.h | 3 + arch/arm64/kernel/cpu_errata.c | 8 + arch/arm64/kernel/psci.c | 5 +- arch/arm64/kernel/topology.c | 43 +- arch/arm64/mm/numa.c | 6 +- block/blk-cgroup.c | 15 +- drivers/acpi/numa.c | 2 +- drivers/base/core.c | 4 +- drivers/block/nbd.c | 3 +- drivers/char/random.c | 1 - drivers/char/tpm/eventlog/efi.c | 5 + drivers/net/geneve.c | 3 +- drivers/scsi/scsi_scan.c | 7 +- drivers/tty/serial/amba-pl011.c | 11 +- drivers/uio/uio.c | 12 +- fs/cifs/cifs_unicode.c | 8 +- fs/cifs/inode.c | 13 +- fs/efivarfs/super.c | 4 + fs/exec.c | 15 +- fs/ext4/ext4.h | 3 +- fs/ext4/inline.c | 1 + fs/ext4/inode.c | 11 +- fs/ext4/resize.c | 4 +- fs/ext4/super.c | 10 +- fs/fuse/dev.c | 28 +- fs/libfs.c | 6 +- fs/ocfs2/super.c | 1 + fs/super.c | 33 +- fs/xfs/libxfs/xfs_alloc.c | 1 + fs/xfs/libxfs/xfs_bmap.c | 19 +- fs/xfs/libxfs/xfs_bmap.h | 2 +- fs/xfs/libxfs/xfs_rmap.c | 2 +- fs/xfs/scrub/bmap.c | 10 +- fs/xfs/scrub/btree.c | 45 +- fs/xfs/scrub/inode.c | 3 +- fs/xfs/scrub/refcount.c | 8 +- fs/xfs/xfs_iops.c | 10 + fs/xfs/xfs_pnfs.c | 2 +- fs/xfs/xfs_rtalloc.c | 10 +- include/linux/netfilter_ipv4.h | 2 +- include/linux/netfilter_ipv6.h | 2 +- include/linux/prandom.h | 36 +- include/linux/time64.h | 4 + include/net/ip_tunnels.h | 7 +- kernel/events/internal.h | 2 +- kernel/exit.c | 5 +- kernel/fail_function.c | 5 +- kernel/fork.c | 10 +- kernel/futex.c | 25 +- kernel/kthread.c | 3 +- kernel/reboot.c | 28 +- kernel/signal.c | 19 +- kernel/time/itimer.c | 4 - kernel/time/tick-common.c | 2 + kernel/time/timer.c | 7 - kernel/trace/ring_buffer.c | 66 ++- kernel/trace/trace.c | 2 +- kernel/trace/trace.h | 26 +- kernel/trace/trace_selftest.c | 9 +- lib/fonts/font_10x18.c | 2 +- lib/fonts/font_6x10.c | 2 +- lib/fonts/font_6x11.c | 2 +- lib/fonts/font_7x14.c | 2 +- lib/fonts/font_8x16.c | 2 +- lib/fonts/font_8x8.c | 2 +- lib/fonts/font_acorn_8x8.c | 2 +- lib/fonts/font_mini_4x6.c | 2 +- lib/fonts/font_pearl_8x8.c | 2 +- lib/fonts/font_sun12x22.c | 2 +- lib/fonts/font_sun8x16.c | 2 +- lib/random32.c | 462 +++++++++++------- lib/scatterlist.c | 2 +- mm/huge_memory.c | 9 +- mm/mempolicy.c | 6 +- mm/page_alloc.c | 5 + net/bridge/br_device.c | 1 + net/core/devlink.c | 6 +- net/ipv4/inet_diag.c | 4 +- net/ipv4/netfilter.c | 12 +- net/ipv4/netfilter/ipt_SYNPROXY.c | 2 +- net/ipv4/netfilter/iptable_mangle.c | 2 +- net/ipv4/netfilter/nf_nat_l3proto_ipv4.c | 2 +- net/ipv4/netfilter/nf_reject_ipv4.c | 2 +- net/ipv4/netfilter/nft_chain_route_ipv4.c | 2 +- net/ipv4/syncookies.c | 9 +- net/ipv4/tcp.c | 2 + net/ipv4/tcp_bbr.c | 2 +- net/ipv4/tcp_input.c | 3 +- net/ipv6/netfilter.c | 6 +- net/ipv6/netfilter/ip6table_mangle.c | 2 +- net/ipv6/netfilter/nf_nat_l3proto_ipv6.c | 2 +- net/ipv6/netfilter/nft_chain_route_ipv6.c | 2 +- net/ipv6/sit.c | 2 - net/ipv6/syncookies.c | 10 +- net/netfilter/ipset/ip_set_core.c | 3 +- net/netfilter/ipvs/ip_vs_core.c | 4 +- net/sched/sch_generic.c | 3 + net/sched/sch_netem.c | 9 +- net/xfrm/xfrm_state.c | 8 +- security/selinux/ibpkey.c | 4 +- tools/perf/builtin-lock.c | 2 +- tools/perf/util/print_binary.c | 2 +- .../scripting-engines/trace-event-python.c | 7 +- tools/perf/util/session.c | 1 + 105 files changed, 809 insertions(+), 461 deletions(-)
From: Marc Zyngier maz@kernel.org
stable inclusion from linux-4.19.155 commit d953cd3fb149a999534ff48af323eb091fd5aa3b
--------------------------------
commit 18fce56134c987e5b4eceddafdbe4b00c07e2ae1 upstream.
Commit 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or lack thereof") changed the way we deal with ARCH_WORKAROUND_1, by moving most of the enabling code to the .matches() callback.
This has the unfortunate effect that the workaround gets only enabled on the first affected CPU, and no other.
In order to address this, forcefully call the .matches() callback from a .cpu_enable() callback, which brings us back to the original behaviour.
Fixes: 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or lack thereof") Cc: stable@vger.kernel.org Reviewed-by: Suzuki K Poulose suzuki.poulose@arm.com Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/arm64/kernel/cpu_errata.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c index 5f2c30ae82ac..c0b2ef5e7ea3 100644 --- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -647,6 +647,12 @@ check_branch_predictor(const struct arm64_cpu_capabilities *entry, int scope) return (need_wa > 0); }
+static void +cpu_enable_branch_predictor_hardening(const struct arm64_cpu_capabilities *cap) +{ + cap->matches(cap, SCOPE_LOCAL_CPU); +} + static const __maybe_unused struct midr_range tx2_family_cpus[] = { MIDR_ALL_VERSIONS(MIDR_BRCM_VULCAN), MIDR_ALL_VERSIONS(MIDR_CAVIUM_THUNDERX2), @@ -829,9 +835,11 @@ const struct arm64_cpu_capabilities arm64_errata[] = { }, #endif { + .desc = "Branch predictor hardening", .capability = ARM64_HARDEN_BRANCH_PREDICTOR, .type = ARM64_CPUCAP_LOCAL_CPU_ERRATUM, .matches = check_branch_predictor, + .cpu_enable = cpu_enable_branch_predictor_hardening, }, #ifdef CONFIG_HARDEN_EL2_VECTORS {
From: Michael Schaller misch@google.com
stable inclusion from linux-4.19.155 commit 02bb497cd6de22e9ce17396957e76fa5aa11102f
--------------------------------
commit 336af6a4686d885a067ecea8c3c3dd129ba4fc75 upstream.
Without this patch efivarfs_alloc_dentry creates dentries with slashes in their name if the respective EFI variable has slashes in its name. This in turn causes EIO on getdents64, which prevents a complete directory listing of /sys/firmware/efi/efivars/.
This patch replaces the invalid shlashes with exclamation marks like kobject_set_name_vargs does for /sys/firmware/efi/vars/ to have consistently named dentries under /sys/firmware/efi/vars/ and /sys/firmware/efi/efivars/.
Signed-off-by: Michael Schaller misch@google.com Link: https://lore.kernel.org/r/20200925074502.150448-1-misch@google.com Signed-off-by: Ard Biesheuvel ardb@kernel.org Signed-off-by: dann frazier dann.frazier@canonical.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/efivarfs/super.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c index 5b68e4294faa..834615f13f3e 100644 --- a/fs/efivarfs/super.c +++ b/fs/efivarfs/super.c @@ -145,6 +145,9 @@ static int efivarfs_callback(efi_char16_t *name16, efi_guid_t vendor,
name[len + EFI_VARIABLE_GUID_LEN+1] = '\0';
+ /* replace invalid slashes like kobject_set_name_vargs does for /sys/firmware/efi/vars. */ + strreplace(name, '/', '!'); + inode = efivarfs_get_inode(sb, d_inode(root), S_IFREG | 0644, 0, is_removable); if (!inode)
From: Aleksandr Nogikh nogikh@google.com
stable inclusion from linux-4.19.155 commit 95ba2236b8e69de3cb9b12e1cd6c4252a1574a19
--------------------------------
[ Upstream commit eadd1befdd778a1eca57fad058782bd22b4db804 ]
Currently it is possible to craft a special netlink RTM_NEWQDISC command that can result in jitter being equal to 0x80000000. It is enough to set the 32 bit jitter to 0x02000000 (it will later be multiplied by 2^6) or just set the 64 bit jitter via TCA_NETEM_JITTER64. This causes an overflow during the generation of uniformly distributed numbers in tabledist(), which in turn leads to division by zero (sigma != 0, but sigma * 2 is 0).
The related fragment of code needs 32-bit division - see commit 9b0ed89 ("netem: remove unnecessary 64 bit modulus"), so switching to 64 bit is not an option.
Fix the issue by keeping the value of jitter within the range that can be adequately handled by tabledist() - [0;INT_MAX]. As negative std deviation makes no sense, take the absolute value of the passed value and cap it at INT_MAX. Inside tabledist(), switch to unsigned 32 bit arithmetic in order to prevent overflows.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Aleksandr Nogikh nogikh@google.com Reported-by: syzbot+ec762a6342ad0d3c0d8f@syzkaller.appspotmail.com Acked-by: Stephen Hemminger stephen@networkplumber.org Link: https://lore.kernel.org/r/20201028170731.1383332-1-aleksandrnogikh@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/sched/sch_netem.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c index 014a28d8dd4f..02d8d3fd84a5 100644 --- a/net/sched/sch_netem.c +++ b/net/sched/sch_netem.c @@ -330,7 +330,7 @@ static s64 tabledist(s64 mu, s32 sigma,
/* default uniform distribution */ if (dist == NULL) - return ((rnd % (2 * sigma)) + mu) - sigma; + return ((rnd % (2 * (u32)sigma)) + mu) - sigma;
t = dist->table[rnd % dist->size]; x = (sigma % NETEM_DIST_SCALE) * t; @@ -787,6 +787,10 @@ static void get_slot(struct netem_sched_data *q, const struct nlattr *attr) q->slot_config.max_packets = INT_MAX; if (q->slot_config.max_bytes == 0) q->slot_config.max_bytes = INT_MAX; + + /* capping dist_jitter to the range acceptable by tabledist() */ + q->slot_config.dist_jitter = min_t(__s64, INT_MAX, abs(q->slot_config.dist_jitter)); + q->slot.packets_left = q->slot_config.max_packets; q->slot.bytes_left = q->slot_config.max_bytes; if (q->slot_config.min_delay | q->slot_config.max_delay | @@ -1011,6 +1015,9 @@ static int netem_change(struct Qdisc *sch, struct nlattr *opt, if (tb[TCA_NETEM_SLOT]) get_slot(q, tb[TCA_NETEM_SLOT]);
+ /* capping jitter to the range acceptable by tabledist() */ + q->jitter = min_t(s64, abs(q->jitter), INT_MAX); + return ret;
get_table_failure:
From: Arjun Roy arjunroy@google.com
stable inclusion from linux-4.19.155 commit f5a4ea7839b15d702fb9c731eddcca3e3fecf8d8
--------------------------------
[ Upstream commit 435ccfa894e35e3d4a1799e6ac030e48a7b69ef5 ]
With SO_RCVLOWAT, under memory pressure, it is possible to enter a state where:
1. We have not received enough bytes to satisfy SO_RCVLOWAT. 2. We have not entered buffer pressure (see tcp_rmem_pressure()). 3. But, we do not have enough buffer space to accept more packets.
In this case, we advertise 0 rwnd (due to #3) but the application does not drain the receive queue (no wakeup because of #1 and #2) so the flow stalls.
Modify the heuristic for SO_RCVLOWAT so that, if we are advertising rwnd<=rcv_mss, force a wakeup to prevent a stall.
Without this patch, setting tcp_rmem to 6143 and disabling TCP autotune causes a stalled flow. With this patch, no stall occurs. This is with RPC-style traffic with large messages.
Fixes: 03f45c883c6f ("tcp: avoid extra wakeups for SO_RCVLOWAT users") Signed-off-by: Arjun Roy arjunroy@google.com Acked-by: Soheil Hassas Yeganeh soheil@google.com Acked-by: Neal Cardwell ncardwell@google.com Signed-off-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/20201023184709.217614-1-arjunroy.kdev@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv4/tcp.c | 2 ++ net/ipv4/tcp_input.c | 3 ++- 2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 4ce3397e6fcf..98e8ee8bb759 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -495,6 +495,8 @@ static inline bool tcp_stream_is_readable(const struct tcp_sock *tp, return true; if (tcp_rmem_pressure(sk)) return true; + if (tcp_receive_window(tp) <= inet_csk(sk)->icsk_ack.rcv_mss) + return true; } if (sk->sk_prot->stream_memory_read) return sk->sk_prot->stream_memory_read(sk); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 8f04d2eb2799..0b3f8637d4fa 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4712,7 +4712,8 @@ void tcp_data_ready(struct sock *sk) int avail = tp->rcv_nxt - tp->copied_seq;
if (avail < sk->sk_rcvlowat && !tcp_rmem_pressure(sk) && - !sock_flag(sk, SOCK_DONE)) + !sock_flag(sk, SOCK_DONE) && + tcp_receive_window(tp) > inet_csk(sk)->icsk_ack.rcv_mss) return;
sk->sk_data_ready(sk);
From: Miklos Szeredi mszeredi@redhat.com
stable inclusion from linux-4.19.155 commit 6ef82327906270f66adb4b13517e818d3a20448e
--------------------------------
commit d78092e4937de9ce55edcb4ee4c5e3c707be0190 upstream.
After unlock_request() pages from the ap->pages[] array may be put (e.g. by aborting the connection) and the pages can be freed.
Prevent use after free by grabbing a reference to the page before calling unlock_request().
The original patch was created by Pradeep P V K.
Reported-by: Pradeep P V K ppvk@codeaurora.org Cc: stable@vger.kernel.org Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/fuse/dev.c | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index c51c9a6881e4..1ff5a6b21db0 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -853,15 +853,16 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep) struct page *newpage; struct pipe_buffer *buf = cs->pipebufs;
+ get_page(oldpage); err = unlock_request(cs->req); if (err) - return err; + goto out_put_old;
fuse_copy_finish(cs);
err = pipe_buf_confirm(cs->pipe, buf); if (err) - return err; + goto out_put_old;
BUG_ON(!cs->nr_segs); cs->currbuf = buf; @@ -901,7 +902,7 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep) err = replace_page_cache_page(oldpage, newpage, GFP_KERNEL); if (err) { unlock_page(newpage); - return err; + goto out_put_old; }
get_page(newpage); @@ -920,14 +921,19 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep) if (err) { unlock_page(newpage); put_page(newpage); - return err; + goto out_put_old; }
unlock_page(oldpage); + /* Drop ref for ap->pages[] array */ put_page(oldpage); cs->len = 0;
- return 0; + err = 0; +out_put_old: + /* Drop ref obtained in this function */ + put_page(oldpage); + return err;
out_fallback_unlock: unlock_page(newpage); @@ -936,10 +942,10 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep) cs->offset = buf->offset;
err = lock_request(cs->req); - if (err) - return err; + if (!err) + err = 1;
- return 1; + goto out_put_old; }
static int fuse_ref_page(struct fuse_copy_state *cs, struct page *page, @@ -951,14 +957,16 @@ static int fuse_ref_page(struct fuse_copy_state *cs, struct page *page, if (cs->nr_segs == cs->pipe->buffers) return -EIO;
+ get_page(page); err = unlock_request(cs->req); - if (err) + if (err) { + put_page(page); return err; + }
fuse_copy_finish(cs);
buf = cs->pipebufs; - get_page(page); buf->page = page; buf->offset = offset; buf->len = count;
From: Peter Zijlstra peterz@infradead.org
stable inclusion from linux-4.19.155 commit ad1e8fc2a51e48c6c51f6ca2236e9b9fcbe21fea
--------------------------------
commit 534cf755d9df99e214ddbe26b91cd4d81d2603e2 upstream.
Issuing a magic-sysrq via the PL011 causes the following lockdep splat, which is easily reproducible under QEMU:
| sysrq: Changing Loglevel | sysrq: Loglevel set to 9 | | ====================================================== | WARNING: possible circular locking dependency detected | 5.9.0-rc7 #1 Not tainted | ------------------------------------------------------ | systemd-journal/138 is trying to acquire lock: | ffffab133ad950c0 (console_owner){-.-.}-{0:0}, at: console_lock_spinning_enable+0x34/0x70 | | but task is already holding lock: | ffff0001fd47b098 (&port_lock_key){-.-.}-{2:2}, at: pl011_int+0x40/0x488 | | which lock already depends on the new lock.
[...]
| Possible unsafe locking scenario: | | CPU0 CPU1 | ---- ---- | lock(&port_lock_key); | lock(console_owner); | lock(&port_lock_key); | lock(console_owner); | | *** DEADLOCK ***
The issue being that CPU0 takes 'port_lock' on the irq path in pl011_int() before taking 'console_owner' on the printk() path, whereas CPU1 takes the two locks in the opposite order on the printk() path due to setting the "console_owner" prior to calling into into the actual console driver.
Fix this in the same way as the msm-serial driver by dropping 'port_lock' before handling the sysrq.
Cc: stable@vger.kernel.org # 4.19+ Cc: Russell King linux@armlinux.org.uk Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Jiri Slaby jirislaby@kernel.org Link: https://lore.kernel.org/r/20200811101313.GA6970@willie-the-truck Signed-off-by: Peter Zijlstra peterz@infradead.org Tested-by: Will Deacon will@kernel.org Signed-off-by: Will Deacon will@kernel.org Link: https://lore.kernel.org/r/20200930120432.16551-1-will@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/tty/serial/amba-pl011.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c index 8d2cffedbe37..35b86a2d5ddb 100644 --- a/drivers/tty/serial/amba-pl011.c +++ b/drivers/tty/serial/amba-pl011.c @@ -313,8 +313,9 @@ static void pl011_write(unsigned int val, const struct uart_amba_port *uap, */ static int pl011_fifo_to_tty(struct uart_amba_port *uap) { - u16 status; unsigned int ch, flag, fifotaken; + int sysrq; + u16 status;
for (fifotaken = 0; fifotaken != 256; fifotaken++) { status = pl011_read(uap, REG_FR); @@ -349,10 +350,12 @@ static int pl011_fifo_to_tty(struct uart_amba_port *uap) flag = TTY_FRAME; }
- if (uart_handle_sysrq_char(&uap->port, ch & 255)) - continue; + spin_unlock(&uap->port.lock); + sysrq = uart_handle_sysrq_char(&uap->port, ch & 255); + spin_lock(&uap->port.lock);
- uart_insert_char(&uap->port, ch, UART011_DR_OE, ch, flag); + if (!sysrq) + uart_insert_char(&uap->port, ch, UART011_DR_OE, ch, flag); }
return fifotaken;
From: Mateusz Nosek mateusznosek0@gmail.com
stable inclusion from linux-4.19.155 commit ae5aa5685c7cc1def1e07e0ab562087c5b8c78e7
--------------------------------
[ Upstream commit 921c7ebd1337d1a46783d7e15a850e12aed2eaa0 ]
If should_futex_fail() returns true in futex_wake_pi(), then the 'ret' variable is set to -EFAULT and then immediately overwritten. So the failure injection is non-functional.
Fix it by actually leaving the function and returning -EFAULT.
The Fixes tag is kinda blury because the initial commit which introduced failure injection was already sloppy, but the below mentioned commit broke it completely.
[ tglx: Massaged changelog ]
Fixes: 6b4f4bc9cb22 ("locking/futex: Allow low-level atomic operations to return -EAGAIN") Signed-off-by: Mateusz Nosek mateusznosek0@gmail.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/20200927000858.24219-1-mateusznosek0@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/futex.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 28b321e23053..04bc59324d7f 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1522,8 +1522,10 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_ */ newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
- if (unlikely(should_fail_futex(true))) + if (unlikely(should_fail_futex(true))) { ret = -EFAULT; + goto out_unlock; + }
ret = cmpxchg_futex_value_locked(&curval, uaddr, uval, newval); if (!ret && (curval != uval)) {
From: Nicholas Piggin npiggin@gmail.com
stable inclusion from linux-4.19.155 commit b664645274e922b3356115be9392a49cf3937ce2
--------------------------------
commit d53c3dfb23c45f7d4f910c3a3ca84bf0a99c6143 upstream.
Reading and modifying current->mm and current->active_mm and switching mm should be done with irqs off, to prevent races seeing an intermediate state.
This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate"). At exec-time when the new mm is activated, the old one should usually be single-threaded and no longer used, unless something else is holding an mm_users reference (which may be possible).
Absent other mm_users, there is also a race with preemption and lazy tlb switching. Consider the kernel_execve case where the current thread is using a lazy tlb active mm:
call_usermodehelper() kernel_execve() old_mm = current->mm; active_mm = current->active_mm; *** preempt *** --------------------> schedule() prev->active_mm = NULL; mmdrop(prev active_mm); ... <-------------------- schedule() current->mm = mm; current->active_mm = mm; if (!old_mm) mmdrop(active_mm);
If we switch back to the kernel thread from a different mm, there is a double free of the old active_mm, and a missing free of the new one.
Closing this race only requires interrupts to be disabled while ->mm and ->active_mm are being switched, but the TLB problem requires also holding interrupts off over activate_mm. Unfortunately not all archs can do that yet, e.g., arm defers the switch if irqs are disabled and expects finish_arch_post_lock_switch() to be called to complete the flush; um takes a blocking lock in activate_mm().
So as a first step, disable interrupts across the mm/active_mm updates to close the lazy tlb preempt race, and provide an arch option to extend that to activate_mm which allows architectures doing IPI based TLB shootdowns to close the second race.
This is a bit ugly, but in the interest of fixing the bug and backporting before all architectures are converted this is a compromise.
Signed-off-by: Nicholas Piggin npiggin@gmail.com Acked-by: Peter Zijlstra (Intel) peterz@infradead.org [mpe: Manual backport to 4.19 due to membarrier_exec_mmap(mm) changes] Signed-off-by: Michael Ellerman mpe@ellerman.id.au Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Conflicts: fs/exec.c [yyl: adjust context] Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/Kconfig | 7 +++++++ fs/exec.c | 15 ++++++++++++++- 2 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/arch/Kconfig b/arch/Kconfig index 895b252d11b0..0dd478fcee0b 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -381,6 +381,13 @@ config HAVE_RCU_TABLE_FREE config HAVE_RCU_TABLE_INVALIDATE bool
+config ARCH_WANT_IRQS_OFF_ACTIVATE_MM + bool + help + Temporary select until all architectures can be converted to have + irqs disabled over activate_mm. Architectures that do IPI based TLB + shootdowns should enable this. + config ARCH_HAVE_NMI_SAFE_CMPXCHG bool
diff --git a/fs/exec.c b/fs/exec.c index 15d30416886d..8099ac98e605 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1028,11 +1028,24 @@ static int exec_mmap(struct mm_struct *mm) } } task_lock(tsk); + + local_irq_disable(); active_mm = tsk->active_mm; membarrier_exec_mmap(mm); - tsk->mm = mm; tsk->active_mm = mm; + tsk->mm = mm; + /* + * This prevents preemption while active_mm is being loaded and + * it and mm are being updated, which could cause problems for + * lazy tlb mm refcounting when these are updated by context + * switches. Not all architectures can handle irqs off over + * activate_mm yet. + */ + if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) + local_irq_enable(); activate_mm(active_mm, mm); + if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) + local_irq_enable(); tsk->mm->vmacache_seqnum = 0; vmacache_flush(tsk); task_unlock(tsk);
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.155 commit bda64b43564ac3eca39202db4f1462487ca4a6bf
--------------------------------
[ Upstream commit f4c32e87de7d66074d5612567c5eac7325024428 ]
The realtime bitmap and summary files are regular files that are hidden away from the directory tree. Since they're regular files, inode inactivation will try to purge what it thinks are speculative preallocations beyond the incore size of the file. Unfortunately, xfs_growfs_rt forgets to update the incore size when it resizes the inodes, with the result that inactivating the rt inodes at unmount time will cause their contents to be truncated.
Fix this by updating the incore size when we change the ondisk size as part of updating the superblock. Note that we don't do this when we're allocating blocks to the rt inodes because we actually want those blocks to get purged if the growfs fails.
This fixes corruption complaints from the online rtsummary checker when running xfs/233. Since that test requires rmap, one can also trigger this by growing an rt volume, cycling the mount, and creating rt files.
Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Chandan Babu R chandanrlinux@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/xfs_rtalloc.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c index 08da48b66235..280965fc9bbd 100644 --- a/fs/xfs/xfs_rtalloc.c +++ b/fs/xfs/xfs_rtalloc.c @@ -998,10 +998,13 @@ xfs_growfs_rt( xfs_ilock(mp->m_rbmip, XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, mp->m_rbmip, XFS_ILOCK_EXCL); /* - * Update the bitmap inode's size. + * Update the bitmap inode's size ondisk and incore. We need + * to update the incore size so that inode inactivation won't + * punch what it thinks are "posteof" blocks. */ mp->m_rbmip->i_d.di_size = nsbp->sb_rbmblocks * nsbp->sb_blocksize; + i_size_write(VFS_I(mp->m_rbmip), mp->m_rbmip->i_d.di_size); xfs_trans_log_inode(tp, mp->m_rbmip, XFS_ILOG_CORE); /* * Get the summary inode into the transaction. @@ -1009,9 +1012,12 @@ xfs_growfs_rt( xfs_ilock(mp->m_rsumip, XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, mp->m_rsumip, XFS_ILOCK_EXCL); /* - * Update the summary inode's size. + * Update the summary inode's size. We need to update the + * incore size so that inode inactivation won't punch what it + * thinks are "posteof" blocks. */ mp->m_rsumip->i_d.di_size = nmp->m_rsumsize; + i_size_write(VFS_I(mp->m_rsumip), mp->m_rsumip->i_d.di_size); xfs_trans_log_inode(tp, mp->m_rsumip, XFS_ILOG_CORE); /* * Copy summary data from old to new sizes.
From: Valentin Schneider valentin.schneider@arm.com
stable inclusion from linux-4.19.155 commit f3dc5ac4011eac5a3a8b28fc13fe08b7dfb5af34
--------------------------------
[ Upstream commit 3102bc0e6ac752cc5df896acb557d779af4d82a1 ]
In the absence of ACPI or DT topology data, we fallback to haphazardly decoding *something* out of MPIDR. Sadly, the contents of that register are mostly unusable due to the implementation leniancy and things like Aff0 having to be capped to 15 (despite being encoded on 8 bits).
Consider a simple system with a single package of 32 cores, all under the same LLC. We ought to be shoving them in the same core_sibling mask, but MPIDR is going to look like:
| CPU | 0 | ... | 15 | 16 | ... | 31 | |------+---+-----+----+----+-----+----+ | Aff0 | 0 | ... | 15 | 0 | ... | 15 | | Aff1 | 0 | ... | 0 | 1 | ... | 1 | | Aff2 | 0 | ... | 0 | 0 | ... | 0 |
Which will eventually yield
core_sibling(0-15) == 0-15 core_sibling(16-31) == 16-31
NUMA woes
=========
If we try to play games with this and set up NUMA boundaries within those groups of 16 cores via e.g. QEMU:
# Node0: 0-9; Node1: 10-19 $ qemu-system-aarch64 <blah> \ -smp 20 -numa node,cpus=0-9,nodeid=0 -numa node,cpus=10-19,nodeid=1
The scheduler's MC domain (all CPUs with same LLC) is going to be built via
arch_topology.c::cpu_coregroup_mask()
In there we try to figure out a sensible mask out of the topology information we have. In short, here we'll pick the smallest of NUMA or core sibling mask.
node_mask(CPU9) == 0-9 core_sibling(CPU9) == 0-15
MC mask for CPU9 will thus be 0-9, not a problem.
node_mask(CPU10) == 10-19 core_sibling(CPU10) == 0-15
MC mask for CPU10 will thus be 10-19, not a problem.
node_mask(CPU16) == 10-19 core_sibling(CPU16) == 16-19
MC mask for CPU16 will thus be 16-19... Uh oh. CPUs 16-19 are in two different unique MC spans, and the scheduler has no idea what to make of that. That triggers the WARN_ON() added by commit
ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap")
Fixing MPIDR-derived topology =============================
We could try to come up with some cleverer scheme to figure out which of the available masks to pick, but really if one of those masks resulted from MPIDR then it should be discarded because it's bound to be bogus.
I was hoping to give MPIDR a chance for SMT, to figure out which threads are in the same core using Aff1-3 as core ID, but Sudeep and Robin pointed out to me that there are systems out there where *all* cores have non-zero values in their higher affinity fields (e.g. RK3288 has "5" in all of its cores' MPIDR.Aff1), which would expose a bogus core ID to userspace.
Stop using MPIDR for topology information. When no other source of topology information is available, mark each CPU as its own core and its NUMA node as its LLC domain.
Signed-off-by: Valentin Schneider valentin.schneider@arm.com Reviewed-by: Sudeep Holla sudeep.holla@arm.com Link: https://lore.kernel.org/r/20200829130016.26106-1-valentin.schneider@arm.com Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Conflicts: arch/arm64/kernel/topology.c [yyl: resolve conflicts introduced by ARM_CPU_IMP_PHYTIUM] Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/arm64/kernel/topology.c | 43 +++++++++++++++++++----------------- 1 file changed, 23 insertions(+), 20 deletions(-)
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index d6fcafa22f31..4c22fe0cd3c6 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -272,26 +272,29 @@ void store_cpu_topology(unsigned int cpuid) if (mpidr & MPIDR_UP_BITMASK) return;
- /* Create cpu topology mapping based on MPIDR. */ - if (mpidr & MPIDR_MT_BITMASK) { - /* Multiprocessor system : Multi-threads per core */ - cpuid_topo->thread_id = MPIDR_AFFINITY_LEVEL(mpidr, 0); - cpuid_topo->core_id = MPIDR_AFFINITY_LEVEL(mpidr, 1); - cpuid_topo->package_id = MPIDR_AFFINITY_LEVEL(mpidr, 2) | - MPIDR_AFFINITY_LEVEL(mpidr, 3) << 8; - } else { - /* Multiprocessor system : Single-thread per core */ - cpuid_topo->thread_id = -1; - cpuid_topo->core_id = MPIDR_AFFINITY_LEVEL(mpidr, 0); - cpuid_topo->package_id = MPIDR_AFFINITY_LEVEL(mpidr, 1) | - MPIDR_AFFINITY_LEVEL(mpidr, 2) << 8 | - MPIDR_AFFINITY_LEVEL(mpidr, 3) << 16; - - if (read_cpuid_implementor() == ARM_CPU_IMP_PHYTIUM) { - cpuid_topo->thread_id = 0; - cpuid_topo->core_id = cpuid; - cpuid_topo->package_id = 0; - } + /* + * This would be the place to create cpu topology based on MPIDR. + * + * However, it cannot be trusted to depict the actual topology; some + * pieces of the architecture enforce an artificial cap on Aff0 values + * (e.g. GICv3's ICC_SGI1R_EL1 limits it to 15), leading to an + * artificial cycling of Aff1, Aff2 and Aff3 values. IOW, these end up + * having absolutely no relationship to the actual underlying system + * topology, and cannot be reasonably used as core / package ID. + * + * If the MT bit is set, Aff0 *could* be used to define a thread ID, but + * we still wouldn't be able to obtain a sane core ID. This means we + * need to entirely ignore MPIDR for any topology deduction. + */ + cpuid_topo->thread_id = -1; + cpuid_topo->core_id = cpuid; + cpuid_topo->package_id = cpu_to_node(cpuid); + + if (!(mpidr & MPIDR_MT_BITMASK) && + read_cpuid_implementor() == ARM_CPU_IMP_PHYTIUM) { + cpuid_topo->thread_id = 0; + cpuid_topo->core_id = cpuid; + cpuid_topo->package_id = 0; }
pr_debug("CPU%u: cluster %d core %d thread %d mpidr %#016llx\n",
From: Lang Dai lang.dai@intel.com
stable inclusion from linux-4.19.155 commit ed23397d1daabd47fe463a3c0d5cfdd495448ccb
--------------------------------
[ Upstream commit 8fd0e2a6df262539eaa28b0a2364cca10d1dc662 ]
uio_register_device() do two things. 1) get an uio id from a global pool, e.g. the id is <A> 2) create file nodes like /sys/class/uio/uio<A>
uio_unregister_device() do two things. 1) free the uio id <A> and return it to the global pool 2) free the file node /sys/class/uio/uio<A>
There is a situation is that one worker is calling uio_unregister_device(), and another worker is calling uio_register_device(). If the two workers are X and Y, they go as below sequence, 1) X free the uio id <AAA> 2) Y get an uio id <AAA> 3) Y create file node /sys/class/uio/uio<AAA> 4) X free the file note /sys/class/uio/uio<AAA> Then it will failed at the 3rd step and cause the phenomenon we saw as it is creating a duplicated file node.
Failure reports as follows: sysfs: cannot create duplicate filename '/class/uio/uio10' Call Trace: sysfs_do_create_link_sd.isra.2+0x9e/0xb0 sysfs_create_link+0x25/0x40 device_add+0x2c4/0x640 __uio_register_device+0x1c5/0x576 [uio] adf_uio_init_bundle_dev+0x231/0x280 [intel_qat] adf_uio_register+0x1c0/0x340 [intel_qat] adf_dev_start+0x202/0x370 [intel_qat] adf_dev_start_async+0x40/0xa0 [intel_qat] process_one_work+0x14d/0x410 worker_thread+0x4b/0x460 kthread+0x105/0x140 ? process_one_work+0x410/0x410 ? kthread_bind+0x40/0x40 ret_from_fork+0x1f/0x40 Code: 85 c0 48 89 c3 74 12 b9 00 10 00 00 48 89 c2 31 f6 4c 89 ef e8 ec c4 ff ff 4c 89 e2 48 89 de 48 c7 c7 e8 b4 ee b4 e8 6a d4 d7 ff <0f> 0b 48 89 df e8 20 fa f3 ff 5b 41 5c 41 5d 5d c3 66 0f 1f 84 ---[ end trace a7531c1ed5269e84 ]--- c6xxvf b002:00:00.0: Failed to register UIO devices c6xxvf b002:00:00.0: Failed to register UIO devices
Signed-off-by: Lang Dai lang.dai@intel.com
Link: https://lore.kernel.org/r/1600054002-17722-1-git-send-email-lang.dai@intel.c... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/uio/uio.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/uio/uio.c b/drivers/uio/uio.c index 5fd5a1d302f6..e743246aaeac 100644 --- a/drivers/uio/uio.c +++ b/drivers/uio/uio.c @@ -1008,8 +1008,6 @@ void uio_unregister_device(struct uio_info *info)
idev = info->uio_dev;
- uio_free_minor(idev); - mutex_lock(&idev->info_lock); uio_dev_del_attributes(idev);
@@ -1024,6 +1022,8 @@ void uio_unregister_device(struct uio_info *info)
device_unregister(&idev->dev);
+ uio_free_minor(idev); + return; } EXPORT_SYMBOL_GPL(uio_unregister_device);
From: Zhengyuan Liu liuzhengyuan@tj.kylinos.cn
stable inclusion from linux-4.19.155 commit 0a2cb1eae0a70acf701600b510de04c55404bb31
--------------------------------
[ Upstream commit a194c5f2d2b3a05428805146afcabe5140b5d378 ]
The @node passed to cpumask_of_node() can be NUMA_NO_NODE, in that case it will trigger the following WARN_ON(node >= nr_node_ids) due to mismatched data types of @node and @nr_node_ids. Actually we should return cpu_all_mask just like most other architectures do if passed NUMA_NO_NODE.
Also add a similar check to the inline cpumask_of_node() in numa.h.
Signed-off-by: Zhengyuan Liu liuzhengyuan@tj.kylinos.cn Reviewed-by: Gavin Shan gshan@redhat.com Link: https://lore.kernel.org/r/20200921023936.21846-1-liuzhengyuan@tj.kylinos.cn Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/arm64/include/asm/numa.h | 3 +++ arch/arm64/mm/numa.c | 6 +++++- 2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h index 563461a14a43..43bfff72a32f 100644 --- a/arch/arm64/include/asm/numa.h +++ b/arch/arm64/include/asm/numa.h @@ -28,6 +28,9 @@ const struct cpumask *cpumask_of_node(int node); /* Returns a pointer to the cpumask of CPUs on Node 'node'. */ static inline const struct cpumask *cpumask_of_node(int node) { + if (node == NUMA_NO_NODE) + return cpu_all_mask; + return node_to_cpumask_map[node]; } #endif diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c index c50d22205f8b..a9d3ad5ee0cc 100644 --- a/arch/arm64/mm/numa.c +++ b/arch/arm64/mm/numa.c @@ -85,7 +85,11 @@ EXPORT_SYMBOL(node_to_cpumask_map); */ const struct cpumask *cpumask_of_node(int node) { - if (WARN_ON(node >= nr_node_ids)) + + if (node == NUMA_NO_NODE) + return cpu_all_mask; + + if (WARN_ON(node < 0 || node >= nr_node_ids)) return cpu_none_mask;
if (WARN_ON(node_to_cpumask_map[node] == NULL))
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.155 commit 451a7c74c23b2ec42410ca5a7110c1a85ebae688
--------------------------------
[ Upstream commit 8df0fa39bdd86ca81a8d706a6ed9d33cc65ca625 ]
When callers pass XFS_BMAPI_REMAP into xfs_bunmapi, they want the extent to be unmapped from the given file fork without the extent being freed. We do this for non-rt files, but we forgot to do this for realtime files. So far this isn't a big deal since nobody makes a bunmapi call to a rt file with the REMAP flag set, but don't leave a logic bomb.
Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Dave Chinner dchinner@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/libxfs/xfs_bmap.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index a902f3b6e7db..2eb1cb71a43d 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -4969,20 +4969,25 @@ xfs_bmap_del_extent_real(
flags = XFS_ILOG_CORE; if (whichfork == XFS_DATA_FORK && XFS_IS_REALTIME_INODE(ip)) { - xfs_fsblock_t bno; xfs_filblks_t len; xfs_extlen_t mod;
- bno = div_u64_rem(del->br_startblock, mp->m_sb.sb_rextsize, - &mod); - ASSERT(mod == 0); len = div_u64_rem(del->br_blockcount, mp->m_sb.sb_rextsize, &mod); ASSERT(mod == 0);
- error = xfs_rtfree_extent(tp, bno, (xfs_extlen_t)len); - if (error) - goto done; + if (!(bflags & XFS_BMAPI_REMAP)) { + xfs_fsblock_t bno; + + bno = div_u64_rem(del->br_startblock, + mp->m_sb.sb_rextsize, &mod); + ASSERT(mod == 0); + + error = xfs_rtfree_extent(tp, bno, (xfs_extlen_t)len); + if (error) + goto done; + } + do_fx = 0; nblks = len * mp->m_sb.sb_rextsize; qfield = XFS_TRANS_DQ_RTBCOUNT;
From: Jonathan Cameron Jonathan.Cameron@huawei.com
stable inclusion from linux-4.19.155 commit d2828ab0a92683e56e210cae4769af1f83519197
--------------------------------
[ Upstream commit 8a3decac087aa897df5af04358c2089e52e70ac4 ]
The function should check the validity of the pxm value before using it to index the pxm_to_node_map[] array.
Whilst hardening this code may be good in general, the main intent here is to enable following patches that use this function to replace acpi_map_pxm_to_node() for non SRAT usecases which should return NO_NUMA_NODE for PXM entries not matching with those in SRAT.
Signed-off-by: Jonathan Cameron Jonathan.Cameron@huawei.com Reviewed-by: Barry Song song.bao.hua@hisilicon.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c index 0da58f0bf7e5..a28ff3cfbc29 100644 --- a/drivers/acpi/numa.c +++ b/drivers/acpi/numa.c @@ -46,7 +46,7 @@ int acpi_numa __initdata;
int pxm_to_node(int pxm) { - if (pxm < 0) + if (pxm < 0 || pxm >= MAX_PXM_DOMAINS || numa_off) return NUMA_NO_NODE; return pxm_to_node_map[pxm]; }
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.155 commit d808c88f0b88697d7df75eb233b50c235db34b24
--------------------------------
[ Upstream commit e0770e91424f694b461141cbc99adf6b23006b60 ]
When we try to use file already used as a quota file again (for the same or different quota type), strange things can happen. At the very least lockdep annotations may be wrong but also inode flags may be wrongly set / reset. When the file is used for two quota types at once we can even corrupt the file and likely crash the kernel. Catch all these cases by checking whether passed file is already used as quota file and bail early in that case.
This fixes occasional generic/219 failure due to lockdep complaint.
Reviewed-by: Andreas Dilger adilger@dilger.ca Reported-by: Ritesh Harjani riteshh@linux.ibm.com Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20201015110330.28716-1-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 1af560c0b9f6..3972580c66ae 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -5818,6 +5818,11 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id, /* Quotafile not on the same filesystem? */ if (path->dentry->d_sb != sb) return -EXDEV; + + /* Quota already enabled for this file? */ + if (IS_NOQUOTA(d_inode(path->dentry))) + return -EBUSY; + /* Journaling quota? */ if (EXT4_SB(sb)->s_qf_names[type]) { /* Quotafile not in fs root? */
From: Ronnie Sahlberg lsahlber@redhat.com
stable inclusion from linux-4.19.155 commit 80dc9c4266b5696e797bded1d0f6087446d7cf9d
--------------------------------
[ Upstream commit c6cc4c5a72505a0ecefc9b413f16bec512f38078 ]
RHBZ: 1848178
Some calls that set attributes, like utimensat(), are not supposed to return -EINTR and thus do not have handlers for this in glibc which causes us to leak -EINTR to the applications which are also unprepared to handle it.
For example tar will break if utimensat() return -EINTR and abort unpacking the archive. Other applications may break too.
To handle this we add checks, and retry, for -EINTR in cifs_setattr()
Signed-off-by: Ronnie Sahlberg lsahlber@redhat.com Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/cifs/inode.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c index 4a38f16d944d..d30eb4350656 100644 --- a/fs/cifs/inode.c +++ b/fs/cifs/inode.c @@ -2550,13 +2550,18 @@ cifs_setattr(struct dentry *direntry, struct iattr *attrs) { struct cifs_sb_info *cifs_sb = CIFS_SB(direntry->d_sb); struct cifs_tcon *pTcon = cifs_sb_master_tcon(cifs_sb); + int rc, retries = 0;
- if (pTcon->unix_ext) - return cifs_setattr_unix(direntry, attrs); - - return cifs_setattr_nounix(direntry, attrs); + do { + if (pTcon->unix_ext) + rc = cifs_setattr_unix(direntry, attrs); + else + rc = cifs_setattr_nounix(direntry, attrs); + retries++; + } while (is_retryable_error(rc) && retries < 2);
/* BB: add cifs_setattr_legacy for really old servers */ + return rc; }
#if 0
From: Xiubo Li xiubli@redhat.com
stable inclusion from linux-4.19.155 commit 2ef6f4bd60411934e3fc2715442c2afe70f84bf3
--------------------------------
[ Upstream commit 87aac3a80af5cbad93e63250e8a1e19095ba0d30 ]
There has one race case for ceph's rbd-nbd tool. When do mapping it may fail with EBUSY from ioctl(nbd, NBD_DO_IT), but actually the nbd device has already unmaped.
It dues to if just after the wake_up(), the recv_work() is scheduled out and defers calling the nbd_config_put(), though the map process has exited the "nbd->recv_task" is not cleared.
Signed-off-by: Xiubo Li xiubli@redhat.com Reviewed-by: Josef Bacik josef@toxicpanda.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/block/nbd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 292129174c38..c3b1d53c9be8 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -758,9 +758,9 @@ static void recv_work(struct work_struct *work)
blk_mq_complete_request(blk_mq_rq_from_pdu(cmd)); } + nbd_config_put(nbd); atomic_dec(&config->recv_threads); wake_up(&config->recv_wq); - nbd_config_put(nbd); kfree(args); }
From: Douglas Gilbert dgilbert@interlog.com
stable inclusion from linux-4.19.155 commit e6786fd18fe2b91a4844f7d1606c50e09d4cebcf
--------------------------------
[ Upstream commit b2a182a40278bc5849730e66bca01a762188ed86 ]
sgl_alloc_order() can fail when 'length' is large on a memory constrained system. When order > 0 it will potentially be making several multi-page allocations with the later ones more likely to fail than the earlier one. So it is important that sgl_alloc_order() frees up any pages it has obtained before returning NULL. In the case when order > 0 it calls the wrong free page function and leaks. In testing the leak was sufficient to bring down my 8 GiB laptop with OOM.
Reviewed-by: Bart Van Assche bvanassche@acm.org Signed-off-by: Douglas Gilbert dgilbert@interlog.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- lib/scatterlist.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 8c3036c37ba0..cf18373f4f24 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -506,7 +506,7 @@ struct scatterlist *sgl_alloc_order(unsigned long long length, elem_len = min_t(u64, length, PAGE_SIZE << order); page = alloc_pages(gfp, order); if (!page) { - sgl_free(sgl); + sgl_free_order(sgl, order); return NULL; }
From: Jiri Olsa jolsa@kernel.org
stable inclusion from linux-4.19.155 commit 753e8a597d54209aa97aa985342f7f98d4240be4
--------------------------------
commit 6fcd5ddc3b1467b3586972ef785d0d926ae4cdf4 upstream.
Hagen reported broken strings in python3 tracepoint scripts:
make PYTHON=python3 perf record -e sched:sched_switch -a -- sleep 5 perf script --gen-script py perf script -s ./perf-script.py
[..] sched__sched_switch 7 563231.759525792 0 swapper prev_comm=bytearray(b'swapper/7\x00\x00\x00\x00\x00\x00\x00'), prev_pid=0, prev_prio=120, prev_state=, next_comm=bytearray(b'mutex-thread-co\x00'),
The problem is in the is_printable_array function that does not take the zero byte into account and claim such string as not printable, so the code will create byte array instead of string.
Committer testing:
After this fix:
sched__sched_switch 3 484522.497072626 1158680 kworker/3:0-eve prev_comm=kworker/3:0, prev_pid=1158680, prev_prio=120, prev_state=I, next_comm=swapper/3, next_pid=0, next_prio=120 Sample: {addr=0, cpu=3, datasrc=84410401, datasrc_decode=N/A|SNP N/A|TLB N/A|LCK N/A, ip=18446744071841817196, period=1, phys_addr=0, pid=1158680, tid=1158680, time=484522497072626, transaction=0, values=[(0, 0)], weight=0}
sched__sched_switch 4 484522.497085610 1225814 perf prev_comm=perf, prev_pid=1225814, prev_prio=120, prev_state=, next_comm=migration/4, next_pid=30, next_prio=0 Sample: {addr=0, cpu=4, datasrc=84410401, datasrc_decode=N/A|SNP N/A|TLB N/A|LCK N/A, ip=18446744071841817196, period=1, phys_addr=0, pid=1225814, tid=1225814, time=484522497085610, transaction=0, values=[(0, 0)], weight=0}
Fixes: 249de6e07458 ("perf script python: Fix string vs byte array resolving") Signed-off-by: Jiri Olsa jolsa@kernel.org Tested-by: Arnaldo Carvalho de Melo acme@redhat.com Tested-by: Hagen Paul Pfeifer hagen@jauu.net Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Mark Rutland mark.rutland@arm.com Cc: Michael Petlan mpetlan@redhat.com Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: stable@vger.kernel.org Link: http://lore.kernel.org/lkml/20200928201135.3633850-1-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/perf/util/print_binary.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/print_binary.c b/tools/perf/util/print_binary.c index 23e367063446..71aeaf6f45cc 100644 --- a/tools/perf/util/print_binary.c +++ b/tools/perf/util/print_binary.c @@ -50,7 +50,7 @@ int is_printable_array(char *p, unsigned int len)
len--;
- for (i = 0; i < len; i++) { + for (i = 0; i < len && p[i]; i++) { if (!isprint(p[i]) && !isspace(p[i])) return 0; }
From: Qiujun Huang hqjagain@gmail.com
stable inclusion from linux-4.19.155 commit 57ebe9102928e835d72ffec43c7995723cfea132
--------------------------------
commit 0a1754b2a97efa644aa6e84d1db5b17c42251483 upstream.
We don't need to check the new buffer size, and the return value had confused resize_buffer_duplicate_size(). ... ret = ring_buffer_resize(trace_buf->buffer, per_cpu_ptr(size_buf->data,cpu_id)->entries, cpu_id); if (ret == 0) per_cpu_ptr(trace_buf->data, cpu_id)->entries = per_cpu_ptr(size_buf->data, cpu_id)->entries; ...
Link: https://lkml.kernel.org/r/20201019142242.11560-1-hqjagain@gmail.com
Cc: stable@vger.kernel.org Fixes: d60da506cbeb3 ("tracing: Add a resize function to make one buffer equivalent to another buffer") Signed-off-by: Qiujun Huang hqjagain@gmail.com Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/trace/ring_buffer.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 564d22691dd7..eef05eb3b284 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1692,18 +1692,18 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, { struct ring_buffer_per_cpu *cpu_buffer; unsigned long nr_pages; - int cpu, err = 0; + int cpu, err;
/* * Always succeed at resizing a non-existent buffer: */ if (!buffer) - return size; + return 0;
/* Make sure the requested buffer exists */ if (cpu_id != RING_BUFFER_ALL_CPUS && !cpumask_test_cpu(cpu_id, buffer->cpumask)) - return size; + return 0;
nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
@@ -1843,7 +1843,7 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, }
mutex_unlock(&buffer->mutex); - return size; + return 0;
out_err: for_each_buffer_cpu(buffer, cpu) {
From: Eric Biggers ebiggers@google.com
stable inclusion from linux-4.19.155 commit 314b5a46c65726750fa1fa3e665df9485d09551a
--------------------------------
commit cb8d53d2c97369029cc638c9274ac7be0a316c75 upstream.
ext4_unregister_sysfs() only deletes the kobject. The reference to it needs to be put separately, like ext4_put_super() does.
This addresses the syzbot report "memory leak in kobject_set_name_vargs (3)" (https://syzkaller.appspot.com/bug?extid=9f864abad79fae7c17e1).
Reported-by: syzbot+9f864abad79fae7c17e1@syzkaller.appspotmail.com Fixes: 72ba74508b28 ("ext4: release sysfs kobject when failing to enable quotas on mount") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers ebiggers@google.com Link: https://lore.kernel.org/r/20200922162456.93657-1-ebiggers@kernel.org Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 3972580c66ae..9b1ab74df916 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4652,6 +4652,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
failed_mount8: ext4_unregister_sysfs(sb); + kobject_put(&sbi->s_kobj); failed_mount7: ext4_unregister_li_request(sb); failed_mount6:
From: Dinghao Liu dinghao.liu@zju.edu.cn
stable inclusion from linux-4.19.155 commit 6ad22dc99cb627fa13812bc0e27a6595bba62f5e
--------------------------------
commit c9e87161cc621cbdcfc472fa0b2d81c63780c8f5 upstream.
When ext4_journal_get_write_access() fails, we should terminate the execution flow and release n_group_desc, iloc.bh, dind and gdb_bh.
Cc: stable@kernel.org Signed-off-by: Dinghao Liu dinghao.liu@zju.edu.cn Reviewed-by: Andreas Dilger adilger@dilger.ca Link: https://lore.kernel.org/r/20200829025403.3139-1-dinghao.liu@zju.edu.cn Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/resize.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index e9409ecda9c8..6a0c5c880354 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -843,8 +843,10 @@ static int add_new_gdb(handle_t *handle, struct inode *inode,
BUFFER_TRACE(dind, "get_write_access"); err = ext4_journal_get_write_access(handle, dind); - if (unlikely(err)) + if (unlikely(err)) { ext4_std_error(sb, err); + goto errout; + }
/* ext4_reserve_inode_write() gets a reference on the iloc */ err = ext4_reserve_inode_write(handle, inode, &iloc);
From: Luo Meng luomeng12@huawei.com
stable inclusion from linux-4.19.155 commit 7b97149296ef752a1576d0939f629dd30e83e03d
--------------------------------
commit 1322181170bb01bce3c228b82ae3d5c6b793164f upstream.
During the stability test, there are some errors: ext4_lookup:1590: inode #6967: comm fsstress: iget: checksum invalid.
If the inode->i_iblocks too big and doesn't set huge file flag, checksum will not be recalculated when update the inode information to it's buffer. If other inode marks the buffer dirty, then the inconsistent inode will be flushed to disk.
Fix this problem by checking i_blocks in advance.
Cc: stable@kernel.org Signed-off-by: Luo Meng luomeng12@huawei.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Link: https://lore.kernel.org/r/20201020013631.3796673-1-luomeng12@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/inode.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 0bde4afe6b7f..061d4f6b0844 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5285,6 +5285,12 @@ static int ext4_do_update_inode(handle_t *handle, if (ext4_test_inode_state(inode, EXT4_STATE_NEW)) memset(raw_inode, 0, EXT4_SB(inode->i_sb)->s_inode_size);
+ err = ext4_inode_blocks_set(handle, raw_inode, ei); + if (err) { + spin_unlock(&ei->i_raw_lock); + goto out_brelse; + } + raw_inode->i_mode = cpu_to_le16(inode->i_mode); i_uid = i_uid_read(inode); i_gid = i_gid_read(inode); @@ -5318,11 +5324,6 @@ static int ext4_do_update_inode(handle_t *handle, EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode); EXT4_EINODE_SET_XTIME(i_crtime, ei, raw_inode);
- err = ext4_inode_blocks_set(handle, raw_inode, ei); - if (err) { - spin_unlock(&ei->i_raw_lock); - goto out_brelse; - } raw_inode->i_dtime = cpu_to_le32(ei->i_dtime); raw_inode->i_flags = cpu_to_le32(ei->i_flags & 0xFFFFFFFF); if (likely(!test_opt2(inode->i_sb, HURD_COMPAT)))
From: Andy Shevchenko andriy.shevchenko@linux.intel.com
stable inclusion from linux-4.19.155 commit 76b712488dcfc71c158ba2e2785c23aca3ed2924
--------------------------------
commit d5dcce0c414fcbfe4c2037b66ac69ea5f9b3f75c upstream.
Behind primary and secondary we understand the type of the nodes which might define their ordering. However, if primary node gone, we can't maintain the ordering by definition of the linked list. Thus, by ordering secondary node becomes first in the list. But in this case the meaning of it is still secondary (or auxiliary). The type of the node is maintained by the secondary pointer in it:
secondary pointer Meaning NULL or valid primary node ERR_PTR(-ENODEV) secondary node
So, if by some reason we do the following sequence of calls
set_primary_fwnode(dev, NULL); set_primary_fwnode(dev, primary);
we should preserve secondary node.
This concept is supported by the description of set_primary_fwnode() along with implementation of set_secondary_fwnode(). Hence, fix the commit c15e1bdda436 to follow this as well.
Fixes: c15e1bdda436 ("device property: Fix the secondary firmware node handling in set_primary_fwnode()") Cc: Ferry Toth fntoth@gmail.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Reviewed-by: Heikki Krogerus heikki.krogerus@linux.intel.com Tested-by: Ferry Toth fntoth@gmail.com Cc: 5.9+ stable@vger.kernel.org # 5.9+ Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/base/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/base/core.c b/drivers/base/core.c index b911c38ad18c..b2e2f086f586 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -3347,7 +3347,7 @@ void set_primary_fwnode(struct device *dev, struct fwnode_handle *fwnode) } else { if (fwnode_is_primary(fn)) { dev->fwnode = fn->secondary; - fn->secondary = NULL; + fn->secondary = ERR_PTR(-ENODEV); } else { dev->fwnode = NULL; }
From: Andy Shevchenko andriy.shevchenko@linux.intel.com
stable inclusion from linux-4.19.155 commit 9169e2f12323994af8244dd64c036f1d974ea51d
--------------------------------
commit 99aed9227073fb34ce2880cbc7063e04185a65e1 upstream.
It appears that firmware nodes can be shared between devices. In such case when a (child) device is about to be deleted, its firmware node may be shared and ACPI_COMPANION_SET(..., NULL) call for it breaks the secondary link of the shared primary firmware node.
In order to prevent that, check, if the device has a parent and parent's firmware node is shared with its child, and avoid crashing the link.
Fixes: c15e1bdda436 ("device property: Fix the secondary firmware node handling in set_primary_fwnode()") Reported-by: Ferry Toth fntoth@gmail.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Reviewed-by: Heikki Krogerus heikki.krogerus@linux.intel.com Tested-by: Ferry Toth fntoth@gmail.com Cc: 5.9+ stable@vger.kernel.org # 5.9+ Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/base/core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/base/core.c b/drivers/base/core.c index b2e2f086f586..f0cdf38ed31c 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -3333,6 +3333,7 @@ static inline bool fwnode_is_primary(struct fwnode_handle *fwnode) */ void set_primary_fwnode(struct device *dev, struct fwnode_handle *fwnode) { + struct device *parent = dev->parent; struct fwnode_handle *fn = dev->fwnode;
if (fwnode) { @@ -3347,7 +3348,8 @@ void set_primary_fwnode(struct device *dev, struct fwnode_handle *fwnode) } else { if (fwnode_is_primary(fn)) { dev->fwnode = fn->secondary; - fn->secondary = ERR_PTR(-ENODEV); + if (!(parent && fn == parent->fwnode)) + fn->secondary = ERR_PTR(-ENODEV); } else { dev->fwnode = NULL; }
From: Oleg Nesterov oleg@redhat.com
stable inclusion from linux-4.19.156 commit caf8f9c19a93ec501d0ffca5e6767a969961baea
--------------------------------
commit 7b3c36fc4c231ca532120bbc0df67a12f09c1d96 upstream.
This testcase
#include <stdio.h> #include <unistd.h> #include <signal.h> #include <sys/ptrace.h> #include <sys/wait.h> #include <pthread.h> #include <assert.h>
void *tf(void *arg) { return NULL; }
int main(void) { int pid = fork(); if (!pid) { kill(getpid(), SIGSTOP);
pthread_t th; pthread_create(&th, NULL, tf, NULL);
return 0; }
waitpid(pid, NULL, WSTOPPED);
ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_TRACECLONE); waitpid(pid, NULL, 0);
ptrace(PTRACE_CONT, pid, 0,0); waitpid(pid, NULL, 0);
int status; int thread = waitpid(-1, &status, 0); assert(thread > 0 && thread != pid); assert(status == 0x80137f);
return 0; }
fails and triggers WARN_ON_ONCE(!signr) in do_jobctl_trap().
This is because task_join_group_stop() has 2 problems when current is traced:
1. We can't rely on the "JOBCTL_STOP_PENDING" check, a stopped tracee can be woken up by debugger and it can clone another thread which should join the group-stop.
We need to check group_stop_count || SIGNAL_STOP_STOPPED.
2. If SIGNAL_STOP_STOPPED is already set, we should not increment sig->group_stop_count and add JOBCTL_STOP_CONSUME. The new thread should stop without another do_notify_parent_cldstop() report.
To clarify, the problem is very old and we should blame ptrace_init_task(). But now that we have task_join_group_stop() it makes more sense to fix this helper to avoid the code duplication.
Reported-by: syzbot+3485e3773f7da290eecc@syzkaller.appspotmail.com Signed-off-by: Oleg Nesterov oleg@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Jens Axboe axboe@kernel.dk Cc: Christian Brauner christian@brauner.io Cc: "Eric W . Biederman" ebiederm@xmission.com Cc: Zhiqiang Liu liuzhiqiang26@huawei.com Cc: Tejun Heo tj@kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201019134237.GA18810@redhat.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/signal.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c index 6fd732c4c30c..deba77ef0573 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -382,16 +382,17 @@ static bool task_participate_group_stop(struct task_struct *task)
void task_join_group_stop(struct task_struct *task) { + unsigned long mask = current->jobctl & JOBCTL_STOP_SIGMASK; + struct signal_struct *sig = current->signal; + + if (sig->group_stop_count) { + sig->group_stop_count++; + mask |= JOBCTL_STOP_CONSUME; + } else if (!(sig->flags & SIGNAL_STOP_STOPPED)) + return; + /* Have the new thread join an on-going signal group stop */ - unsigned long jobctl = current->jobctl; - if (jobctl & JOBCTL_STOP_PENDING) { - struct signal_struct *sig = current->signal; - unsigned long signr = jobctl & JOBCTL_STOP_SIGMASK; - unsigned long gstop = JOBCTL_STOP_PENDING | JOBCTL_STOP_CONSUME; - if (task_set_jobctl_pending(task, signr | gstop)) { - sig->group_stop_count++; - } - } + task_set_jobctl_pending(task, mask | JOBCTL_STOP_PENDING); }
/*
From: Lee Jones lee.jones@linaro.org
stable inclusion from linux-4.19.156 commit 11af858054baadff13eb92eb347443f93768fde7
--------------------------------
commit 9522750c66c689b739e151fcdf895420dc81efc0 upstream.
Commit 6735b4632def ("Fonts: Support FONT_EXTRA_WORDS macros for built-in fonts") introduced the following error when building rpc_defconfig (only this build appears to be affected):
`acorndata_8x8' referenced in section `.text' of arch/arm/boot/compressed/ll_char_wr.o: defined in discarded section `.data' of arch/arm/boot/compressed/font.o `acorndata_8x8' referenced in section `.data.rel.ro' of arch/arm/boot/compressed/font.o: defined in discarded section `.data' of arch/arm/boot/compressed/font.o make[3]: *** [/scratch/linux/arch/arm/boot/compressed/Makefile:191: arch/arm/boot/compressed/vmlinux] Error 1 make[2]: *** [/scratch/linux/arch/arm/boot/Makefile:61: arch/arm/boot/compressed/vmlinux] Error 2 make[1]: *** [/scratch/linux/arch/arm/Makefile:317: zImage] Error 2
The .data section is discarded at link time. Reinstating acorndata_8x8 as const ensures it is still available after linking. Do the same for the other 12 built-in fonts as well, for consistency purposes.
Cc: stable@vger.kernel.org Cc: Russell King linux@armlinux.org.uk Reviewed-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Fixes: 6735b4632def ("Fonts: Support FONT_EXTRA_WORDS macros for built-in fonts") Signed-off-by: Lee Jones lee.jones@linaro.org Co-developed-by: Peilin Ye yepeilin.cs@gmail.com Signed-off-by: Peilin Ye yepeilin.cs@gmail.com Signed-off-by: Daniel Vetter daniel.vetter@ffwll.ch Link: https://patchwork.freedesktop.org/patch/msgid/20201102183242.2031659-1-yepei... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- lib/fonts/font_10x18.c | 2 +- lib/fonts/font_6x10.c | 2 +- lib/fonts/font_6x11.c | 2 +- lib/fonts/font_7x14.c | 2 +- lib/fonts/font_8x16.c | 2 +- lib/fonts/font_8x8.c | 2 +- lib/fonts/font_acorn_8x8.c | 2 +- lib/fonts/font_mini_4x6.c | 2 +- lib/fonts/font_pearl_8x8.c | 2 +- lib/fonts/font_sun12x22.c | 2 +- lib/fonts/font_sun8x16.c | 2 +- 11 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/lib/fonts/font_10x18.c b/lib/fonts/font_10x18.c index 0e2deac97da0..e02f9df24d1e 100644 --- a/lib/fonts/font_10x18.c +++ b/lib/fonts/font_10x18.c @@ -8,7 +8,7 @@
#define FONTDATAMAX 9216
-static struct font_data fontdata_10x18 = { +static const struct font_data fontdata_10x18 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, 0x00, /* 0000000000 */ diff --git a/lib/fonts/font_6x10.c b/lib/fonts/font_6x10.c index 87da8acd07db..6e3c4b7691c8 100644 --- a/lib/fonts/font_6x10.c +++ b/lib/fonts/font_6x10.c @@ -3,7 +3,7 @@
#define FONTDATAMAX 2560
-static struct font_data fontdata_6x10 = { +static const struct font_data fontdata_6x10 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, /* 00000000 */ diff --git a/lib/fonts/font_6x11.c b/lib/fonts/font_6x11.c index 5e975dfa10a5..2d22a24e816f 100644 --- a/lib/fonts/font_6x11.c +++ b/lib/fonts/font_6x11.c @@ -9,7 +9,7 @@
#define FONTDATAMAX (11*256)
-static struct font_data fontdata_6x11 = { +static const struct font_data fontdata_6x11 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, /* 00000000 */ diff --git a/lib/fonts/font_7x14.c b/lib/fonts/font_7x14.c index 86d298f38505..9cc7ae2e03f7 100644 --- a/lib/fonts/font_7x14.c +++ b/lib/fonts/font_7x14.c @@ -8,7 +8,7 @@
#define FONTDATAMAX 3584
-static struct font_data fontdata_7x14 = { +static const struct font_data fontdata_7x14 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, /* 0000000 */ diff --git a/lib/fonts/font_8x16.c b/lib/fonts/font_8x16.c index 37cedd36ca5e..bab25dc59e8d 100644 --- a/lib/fonts/font_8x16.c +++ b/lib/fonts/font_8x16.c @@ -10,7 +10,7 @@
#define FONTDATAMAX 4096
-static struct font_data fontdata_8x16 = { +static const struct font_data fontdata_8x16 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, /* 00000000 */ diff --git a/lib/fonts/font_8x8.c b/lib/fonts/font_8x8.c index 8ab695538395..109d0572368f 100644 --- a/lib/fonts/font_8x8.c +++ b/lib/fonts/font_8x8.c @@ -9,7 +9,7 @@
#define FONTDATAMAX 2048
-static struct font_data fontdata_8x8 = { +static const struct font_data fontdata_8x8 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, /* 00000000 */ diff --git a/lib/fonts/font_acorn_8x8.c b/lib/fonts/font_acorn_8x8.c index 069b3e80c434..fb395f0d4031 100644 --- a/lib/fonts/font_acorn_8x8.c +++ b/lib/fonts/font_acorn_8x8.c @@ -5,7 +5,7 @@
#define FONTDATAMAX 2048
-static struct font_data acorndata_8x8 = { +static const struct font_data acorndata_8x8 = { { 0, 0, FONTDATAMAX, 0 }, { /* 00 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* ^@ */ /* 01 */ 0x7e, 0x81, 0xa5, 0x81, 0xbd, 0x99, 0x81, 0x7e, /* ^A */ diff --git a/lib/fonts/font_mini_4x6.c b/lib/fonts/font_mini_4x6.c index 1449876c6a27..592774a90917 100644 --- a/lib/fonts/font_mini_4x6.c +++ b/lib/fonts/font_mini_4x6.c @@ -43,7 +43,7 @@ __END__;
#define FONTDATAMAX 1536
-static struct font_data fontdata_mini_4x6 = { +static const struct font_data fontdata_mini_4x6 = { { 0, 0, FONTDATAMAX, 0 }, { /*{*/ /* Char 0: ' ' */ diff --git a/lib/fonts/font_pearl_8x8.c b/lib/fonts/font_pearl_8x8.c index 32d65551e7ed..a6f95ebce950 100644 --- a/lib/fonts/font_pearl_8x8.c +++ b/lib/fonts/font_pearl_8x8.c @@ -14,7 +14,7 @@
#define FONTDATAMAX 2048
-static struct font_data fontdata_pearl8x8 = { +static const struct font_data fontdata_pearl8x8 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, /* 00000000 */ diff --git a/lib/fonts/font_sun12x22.c b/lib/fonts/font_sun12x22.c index 641a6b4dca42..a5b65bd49604 100644 --- a/lib/fonts/font_sun12x22.c +++ b/lib/fonts/font_sun12x22.c @@ -3,7 +3,7 @@
#define FONTDATAMAX 11264
-static struct font_data fontdata_sun12x22 = { +static const struct font_data fontdata_sun12x22 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, 0x00, /* 000000000000 */ diff --git a/lib/fonts/font_sun8x16.c b/lib/fonts/font_sun8x16.c index 193fe6d988e0..e577e76a6a7c 100644 --- a/lib/fonts/font_sun8x16.c +++ b/lib/fonts/font_sun8x16.c @@ -3,7 +3,7 @@
#define FONTDATAMAX 4096
-static struct font_data fontdata_sun8x16 = { +static const struct font_data fontdata_sun8x16 = { { 0, 0, FONTDATAMAX, 0 }, { /* */ 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* */ 0x00,0x00,0x7e,0x81,0xa5,0x81,0x81,0xbd,0x99,0x81,0x81,0x7e,0x00,0x00,0x00,0x00,
From: Shijie Luo luoshijie1@huawei.com
stable inclusion from linux-4.19.156 commit d4fea27ffcb01bbf8ca8ae54f60479945394b83e
--------------------------------
commit 3f08842098e842c51e3b97d0dcdebf810b32558e upstream.
When flags in queue_pages_pte_range don't have MPOL_MF_MOVE or MPOL_MF_MOVE_ALL bits, code breaks and passing origin pte - 1 to pte_unmap_unlock seems like not a good idea.
queue_pages_pte_range can run in MPOL_MF_MOVE_ALL mode which doesn't migrate misplaced pages but returns with EIO when encountering such a page. Since commit a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified") and early break on the first pte in the range results in pte_unmap_unlock on an underflow pte. This can lead to lockups later on when somebody tries to lock the pte resp. page_table_lock again..
Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified") Signed-off-by: Shijie Luo luoshijie1@huawei.com Signed-off-by: Miaohe Lin linmiaohe@huawei.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Oscar Salvador osalvador@suse.de Acked-by: Michal Hocko mhocko@suse.com Cc: Miaohe Lin linmiaohe@huawei.com Cc: Feilong Lin linfeilong@huawei.com Cc: Shijie Luo luoshijie1@huawei.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201019074853.50856-1-luoshijie1@huawei.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/mempolicy.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c index cab2ee936b5d..8f420c64934e 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -535,7 +535,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, unsigned long flags = qp->flags; int ret; bool has_unmovable = false; - pte_t *pte; + pte_t *pte, *mapped_pte; spinlock_t *ptl;
ptl = pmd_trans_huge_lock(pmd, vma); @@ -549,7 +549,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, if (pmd_trans_unstable(pmd)) return 0;
- pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) { if (!pte_present(*pte)) continue; @@ -581,7 +581,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, } else break; } - pte_unmap_unlock(pte - 1, ptl); + pte_unmap_unlock(mapped_pte, ptl); cond_resched();
if (has_unmovable)
From: Zqiang qiang.zhang@windriver.com
stable inclusion from linux-4.19.156 commit 68e8b8ed787a8acafa3e987d52f6d415c5949ec6
--------------------------------
commit 6993d0fdbee0eb38bfac350aa016f65ad11ed3b1 upstream.
There is a small race window when a delayed work is being canceled and the work still might be queued from the timer_fn:
CPU0 CPU1 kthread_cancel_delayed_work_sync() __kthread_cancel_work_sync() __kthread_cancel_work() work->canceling++; kthread_delayed_work_timer_fn() kthread_insert_work();
BUG: kthread_insert_work() should not get called when work->canceling is set.
Signed-off-by: Zqiang qiang.zhang@windriver.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Petr Mladek pmladek@suse.com Acked-by: Tejun Heo tj@kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201014083030.16895-1-qiang.zhang@windriver.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/kthread.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/kthread.c b/kernel/kthread.c index 7973996e061f..ba1aacf0755a 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -859,7 +859,8 @@ void kthread_delayed_work_timer_fn(struct timer_list *t) /* Move the work from worker->delayed_work_list. */ WARN_ON_ONCE(list_empty(&work->node)); list_del_init(&work->node); - kthread_insert_work(worker, work, &worker->work_list); + if (!work->canceling) + kthread_insert_work(worker, work, &worker->work_list);
spin_unlock(&worker->lock); }
From: "Steven Rostedt (VMware)" rostedt@goodmis.org
stable inclusion from linux-4.19.156 commit b410d07e965a039dc67073889dab9ff01cee8402
--------------------------------
commit b02414c8f045ab3b9afc816c3735bc98c5c3d262 upstream.
The recursion protection of the ring buffer depends on preempt_count() to be correct. But it is possible that the ring buffer gets called after an interrupt comes in but before it updates the preempt_count(). This will trigger a false positive in the recursion code.
Use the same trick from the ftrace function callback recursion code which uses a "transition" bit that gets set, to allow for a single recursion for to handle transitions between contexts.
Cc: stable@vger.kernel.org Fixes: 567cd4da54ff4 ("ring-buffer: User context bit recursion checking") Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/trace/ring_buffer.c | 58 ++++++++++++++++++++++++++++++-------- 1 file changed, 46 insertions(+), 12 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index eef05eb3b284..87ce9736043d 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -444,14 +444,16 @@ struct rb_event_info {
/* * Used for which event context the event is in. - * NMI = 0 - * IRQ = 1 - * SOFTIRQ = 2 - * NORMAL = 3 + * TRANSITION = 0 + * NMI = 1 + * IRQ = 2 + * SOFTIRQ = 3 + * NORMAL = 4 * * See trace_recursive_lock() comment below for more details. */ enum { + RB_CTX_TRANSITION, RB_CTX_NMI, RB_CTX_IRQ, RB_CTX_SOFTIRQ, @@ -2620,10 +2622,10 @@ rb_wakeups(struct ring_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer) * a bit of overhead in something as critical as function tracing, * we use a bitmask trick. * - * bit 0 = NMI context - * bit 1 = IRQ context - * bit 2 = SoftIRQ context - * bit 3 = normal context. + * bit 1 = NMI context + * bit 2 = IRQ context + * bit 3 = SoftIRQ context + * bit 4 = normal context. * * This works because this is the order of contexts that can * preempt other contexts. A SoftIRQ never preempts an IRQ @@ -2646,6 +2648,30 @@ rb_wakeups(struct ring_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer) * The least significant bit can be cleared this way, and it * just so happens that it is the same bit corresponding to * the current context. + * + * Now the TRANSITION bit breaks the above slightly. The TRANSITION bit + * is set when a recursion is detected at the current context, and if + * the TRANSITION bit is already set, it will fail the recursion. + * This is needed because there's a lag between the changing of + * interrupt context and updating the preempt count. In this case, + * a false positive will be found. To handle this, one extra recursion + * is allowed, and this is done by the TRANSITION bit. If the TRANSITION + * bit is already set, then it is considered a recursion and the function + * ends. Otherwise, the TRANSITION bit is set, and that bit is returned. + * + * On the trace_recursive_unlock(), the TRANSITION bit will be the first + * to be cleared. Even if it wasn't the context that set it. That is, + * if an interrupt comes in while NORMAL bit is set and the ring buffer + * is called before preempt_count() is updated, since the check will + * be on the NORMAL bit, the TRANSITION bit will then be set. If an + * NMI then comes in, it will set the NMI bit, but when the NMI code + * does the trace_recursive_unlock() it will clear the TRANSTION bit + * and leave the NMI bit set. But this is fine, because the interrupt + * code that set the TRANSITION bit will then clear the NMI bit when it + * calls trace_recursive_unlock(). If another NMI comes in, it will + * set the TRANSITION bit and continue. + * + * Note: The TRANSITION bit only handles a single transition between context. */
static __always_inline int @@ -2661,8 +2687,16 @@ trace_recursive_lock(struct ring_buffer_per_cpu *cpu_buffer) bit = pc & NMI_MASK ? RB_CTX_NMI : pc & HARDIRQ_MASK ? RB_CTX_IRQ : RB_CTX_SOFTIRQ;
- if (unlikely(val & (1 << (bit + cpu_buffer->nest)))) - return 1; + if (unlikely(val & (1 << (bit + cpu_buffer->nest)))) { + /* + * It is possible that this was called by transitioning + * between interrupt context, and preempt_count() has not + * been updated yet. In this case, use the TRANSITION bit. + */ + bit = RB_CTX_TRANSITION; + if (val & (1 << (bit + cpu_buffer->nest))) + return 1; + }
val |= (1 << (bit + cpu_buffer->nest)); cpu_buffer->current_context = val; @@ -2677,8 +2711,8 @@ trace_recursive_unlock(struct ring_buffer_per_cpu *cpu_buffer) cpu_buffer->current_context - (1 << cpu_buffer->nest); }
-/* The recursive locking above uses 4 bits */ -#define NESTED_BITS 4 +/* The recursive locking above uses 5 bits */ +#define NESTED_BITS 5
/** * ring_buffer_nest_start - Allow to trace while nested
From: "Steven Rostedt (VMware)" rostedt@goodmis.org
stable inclusion from linux-4.19.156 commit ee2b95c08515633da67c048beb164966224ffe9e
--------------------------------
commit ee11b93f95eabdf8198edd4668bf9102e7248270 upstream.
The code that checks recursion will work to only do the recursion check once if there's nested checks. The top one will do the check, the other nested checks will see recursion was already checked and return zero for its "bit". On the return side, nothing will be done if the "bit" is zero.
The problem is that zero is returned for the "good" bit when in NMI context. This will set the bit for NMIs making it look like *all* NMI tracing is recursing, and prevent tracing of anything in NMI context!
The simple fix is to return "bit + 1" and subtract that bit on the end to get the real bit.
Cc: stable@vger.kernel.org Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks") Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/trace/trace.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 8cc1861163dc..868bd11ce72a 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -595,7 +595,7 @@ static __always_inline int trace_test_and_set_recursion(int start, int max) current->trace_recursion = val; barrier();
- return bit; + return bit + 1; }
static __always_inline void trace_clear_recursion(int bit) @@ -605,6 +605,7 @@ static __always_inline void trace_clear_recursion(int bit) if (!bit) return;
+ bit--; bit = 1 << bit; val &= ~bit;
From: "Steven Rostedt (VMware)" rostedt@goodmis.org
stable inclusion from linux-4.19.156 commit 2de780dfbe1ad05c3f15f8d6f2d22dae9cf95267
--------------------------------
commit 726b3d3f141fba6f841d715fc4d8a4a84f02c02a upstream.
When an interrupt or NMI comes in and switches the context, there's a delay from when the preempt_count() shows the update. As the preempt_count() is used to detect recursion having each context have its own bit get set when tracing starts, and if that bit is already set, it is considered a recursion and the function exits. But if this happens in that section where context has changed but preempt_count() has not been updated, this will be incorrectly flagged as a recursion.
To handle this case, create another bit call TRANSITION and test it if the current context bit is already set. Flag the call as a recursion if the TRANSITION bit is already set, and if not, set it and continue. The TRANSITION bit will be cleared normally on the return of the function that set it, or if the current context bit is clear, set it and clear the TRANSITION bit to allow for another transition between the current context and an even higher one.
Cc: stable@vger.kernel.org Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks") Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/trace/trace.h | 23 +++++++++++++++++++++-- kernel/trace/trace_selftest.c | 9 +++++++-- 2 files changed, 28 insertions(+), 4 deletions(-)
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 868bd11ce72a..4088742fbb08 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -534,6 +534,12 @@ enum {
TRACE_GRAPH_DEPTH_START_BIT, TRACE_GRAPH_DEPTH_END_BIT, + + /* + * When transitioning between context, the preempt_count() may + * not be correct. Allow for a single recursion to cover this case. + */ + TRACE_TRANSITION_BIT, };
#define trace_recursion_set(bit) do { (current)->trace_recursion |= (1<<(bit)); } while (0) @@ -588,8 +594,21 @@ static __always_inline int trace_test_and_set_recursion(int start, int max) return 0;
bit = trace_get_context_bit() + start; - if (unlikely(val & (1 << bit))) - return -1; + if (unlikely(val & (1 << bit))) { + /* + * It could be that preempt_count has not been updated during + * a switch between contexts. Allow for a single recursion. + */ + bit = TRACE_TRANSITION_BIT; + if (trace_recursion_test(bit)) + return -1; + trace_recursion_set(bit); + barrier(); + return bit + 1; + } + + /* Normal check passed, clear the transition to allow it again */ + trace_recursion_clear(TRACE_TRANSITION_BIT);
val |= 1 << bit; current->trace_recursion = val; diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c index 11e9daa4a568..330ba2b081ed 100644 --- a/kernel/trace/trace_selftest.c +++ b/kernel/trace/trace_selftest.c @@ -492,8 +492,13 @@ trace_selftest_function_recursion(void) unregister_ftrace_function(&test_rec_probe);
ret = -1; - if (trace_selftest_recursion_cnt != 1) { - pr_cont("*callback not called once (%d)* ", + /* + * Recursion allows for transitions between context, + * and may call the callback twice. + */ + if (trace_selftest_recursion_cnt != 1 && + trace_selftest_recursion_cnt != 2) { + pr_cont("*callback not called once (or twice) (%d)* ", trace_selftest_recursion_cnt); goto out; }
From: Qiujun Huang hqjagain@gmail.com
stable inclusion from linux-4.19.156 commit 7e4eeff7da1df5602e9af340d01b2ce1b4662d3e
--------------------------------
commit c1acb4ac1a892cf08d27efcb964ad281728b0545 upstream.
The nesting count of trace_printk allows for 4 levels of nesting. The nesting counter starts at zero and is incremented before being used to retrieve the current context's buffer. But the index to the buffer uses the nesting counter after it was incremented, and not its original number, which in needs to do.
Link: https://lkml.kernel.org/r/20201029161905.4269-1-hqjagain@gmail.com
Cc: stable@vger.kernel.org Fixes: 3d9622c12c887 ("tracing: Add barrier to trace_printk() buffer nesting modification") Signed-off-by: Qiujun Huang hqjagain@gmail.com Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/trace/trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index e32aae03f6be..b4f429106ad4 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2817,7 +2817,7 @@ static char *get_trace_buf(void)
/* Interrupts must see nesting incremented before we use the buffer */ barrier(); - return &buffer->buffer[buffer->nesting][0]; + return &buffer->buffer[buffer->nesting - 1][0]; }
static void put_trace_buf(void)
From: Mike Galbraith efault@gmx.de
stable inclusion from linux-4.19.156 commit c096a3d44ead54b53a3b65d9d210022099bdc7ef
--------------------------------
commit 9f5d1c336a10c0d24e83e40b4c1b9539f7dba627 upstream.
Gratian managed to trigger the BUG_ON(!newowner) in fixup_pi_state_owner(). This is one possible chain of events leading to this:
Task Prio Operation T1 120 lock(F) T2 120 lock(F) -> blocks (top waiter) T3 50 (RT) lock(F) -> boosts T1 and blocks (new top waiter) XX timeout/ -> wakes T2 signal T1 50 unlock(F) -> wakes T3 (rtmutex->owner == NULL, waiter bit is set) T2 120 cleanup -> try_to_take_mutex() fails because T3 is the top waiter and the lower priority T2 cannot steal the lock. -> fixup_pi_state_owner() sees newowner == NULL -> BUG_ON()
The comment states that this is invalid and rt_mutex_real_owner() must return a non NULL owner when the trylock failed, but in case of a queued and woken up waiter rt_mutex_real_owner() == NULL is a valid transient state. The higher priority waiter has simply not yet managed to take over the rtmutex.
The BUG_ON() is therefore wrong and this is just another retry condition in fixup_pi_state_owner().
Drop the locks, so that T3 can make progress, and then try the fixup again.
Gratian provided a great analysis, traces and a reproducer. The analysis is to the point, but it confused the hell out of that tglx dude who had to page in all the futex horrors again. Condensed version is above.
[ tglx: Wrote comment and changelog ]
Fixes: c1e2f0eaf015 ("futex: Avoid violating the 10th rule of futex") Reported-by: Gratian Crisan gratian.crisan@ni.com Signed-off-by: Mike Galbraith efault@gmx.de Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87a6w6x7bb.fsf@ni.com Link: https://lore.kernel.org/r/87sg9pkvf7.fsf@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/futex.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 04bc59324d7f..cd852c498bbe 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2422,10 +2422,22 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, }
/* - * Since we just failed the trylock; there must be an owner. + * The trylock just failed, so either there is an owner or + * there is a higher priority waiter than this one. */ newowner = rt_mutex_owner(&pi_state->pi_mutex); - BUG_ON(!newowner); + /* + * If the higher priority waiter has not yet taken over the + * rtmutex then newowner is NULL. We can't return here with + * that state because it's inconsistent vs. the user space + * state. So drop the locks and try again. It's a valid + * situation and not any different from the other retry + * conditions. + */ + if (unlikely(!newowner)) { + err = -EAGAIN; + goto handle_err; + } } else { WARN_ON_ONCE(argowner != current); if (oldowner == current) {
From: Gabriel Krisman Bertazi krisman@collabora.com
stable inclusion from linux-4.19.156 commit 6dba51828dfd31e326e6642727ede24c054a86fe
--------------------------------
[ Upstream commit 52abfcbd57eefdd54737fc8c2dc79d8f46d4a3e5 ]
If new_blkg allocation raced with blk_policy change and blkg_lookup_check fails, new_blkg is leaked.
Acked-by: Tejun Heo tj@kernel.org Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- block/blk-cgroup.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index c64f0afa27dc..1a21a3151741 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -900,6 +900,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, blkg = blkg_lookup_check(pos, pol, q); if (IS_ERR(blkg)) { ret = PTR_ERR(blkg); + blkg_free(new_blkg); goto fail_unlock; }
From: Gabriel Krisman Bertazi krisman@collabora.com
stable inclusion from linux-4.19.156 commit 239aed5d2ecbfd381ca642f0a6bcb272c98408df
--------------------------------
[ Upstream commit f255c19b3ab46d3cad3b1b2e1036f4c926cb1d0c ]
Similarly to commit 457e490f2b741 ("blkcg: allocate struct blkcg_gq outside request queue spinlock"), blkg_create can also trigger occasional -ENOMEM failures at the radix insertion because any allocation inside blkg_create has to be non-blocking, making it more likely to fail. This causes trouble for userspace tools trying to configure io weights who need to deal with this condition.
This patch reduces the occurrence of -ENOMEMs on this path by preloading the radix tree element on a GFP_KERNEL context, such that we guarantee the later non-blocking insertion won't fail.
A similar solution exists in blkcg_init_queue for the same situation.
Acked-by: Tejun Heo tj@kernel.org Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- block/blk-cgroup.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 1a21a3151741..7ded51051e0b 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -894,6 +894,12 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, goto fail; }
+ if (radix_tree_preload(GFP_KERNEL)) { + blkg_free(new_blkg); + ret = -ENOMEM; + goto fail; + } + rcu_read_lock(); spin_lock_irq(q->queue_lock);
@@ -901,7 +907,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, if (IS_ERR(blkg)) { ret = PTR_ERR(blkg); blkg_free(new_blkg); - goto fail_unlock; + goto fail_preloaded; }
if (blkg) { @@ -910,10 +916,12 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, blkg = blkg_create(pos, q, new_blkg); if (unlikely(IS_ERR(blkg))) { ret = PTR_ERR(blkg); - goto fail_unlock; + goto fail_preloaded; } }
+ radix_tree_preload_end(); + if (pos == blkcg) goto success; } @@ -923,6 +931,8 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, ctx->body = body; return 0;
+fail_preloaded: + radix_tree_preload_end(); fail_unlock: spin_unlock_irq(q->queue_lock); rcu_read_unlock();
From: Ming Lei ming.lei@redhat.com
stable inclusion from linux-4.19.156 commit 6280db912e03089b15a3e2c2882d573d3b09bd0c
--------------------------------
[ Upstream commit 831e3405c2a344018a18fcc2665acc5a38c3a707 ]
The current scanning mechanism is supposed to fall back to a synchronous host scan if an asynchronous scan is in progress. However, this rule isn't strictly respected, scsi_prep_async_scan() doesn't hold scan_mutex when checking shost->async_scan. When scsi_scan_host() is called concurrently, two async scans on same host can be started and a hang in do_scan_async() is observed.
Fixes this issue by checking & setting shost->async_scan atomically with shost->scan_mutex.
Link: https://lore.kernel.org/r/20201010032539.426615-1-ming.lei@redhat.com Cc: Christoph Hellwig hch@lst.de Cc: Ewan D. Milne emilne@redhat.com Cc: Hannes Reinecke hare@suse.de Cc: Bart Van Assche bvanassche@acm.org Reviewed-by: Lee Duncan lduncan@suse.com Reviewed-by: Bart Van Assche bvanassche@acm.org Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/scsi/scsi_scan.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c index 9a7e3a3bd5ce..009a5b2aa3d0 100644 --- a/drivers/scsi/scsi_scan.c +++ b/drivers/scsi/scsi_scan.c @@ -1722,15 +1722,16 @@ static void scsi_sysfs_add_devices(struct Scsi_Host *shost) */ static struct async_scan_data *scsi_prep_async_scan(struct Scsi_Host *shost) { - struct async_scan_data *data; + struct async_scan_data *data = NULL; unsigned long flags;
if (strncmp(scsi_scan_type, "sync", 4) == 0) return NULL;
+ mutex_lock(&shost->scan_mutex); if (shost->async_scan) { shost_printk(KERN_DEBUG, shost, "%s called twice\n", __func__); - return NULL; + goto err; }
data = kmalloc(sizeof(*data), GFP_KERNEL); @@ -1741,7 +1742,6 @@ static struct async_scan_data *scsi_prep_async_scan(struct Scsi_Host *shost) goto err; init_completion(&data->prev_finished);
- mutex_lock(&shost->scan_mutex); spin_lock_irqsave(shost->host_lock, flags); shost->async_scan = 1; spin_unlock_irqrestore(shost->host_lock, flags); @@ -1756,6 +1756,7 @@ static struct async_scan_data *scsi_prep_async_scan(struct Scsi_Host *shost) return data;
err: + mutex_unlock(&shost->scan_mutex); kfree(data); return NULL; }
From: Eddy Wu itseddy0402@gmail.com
stable inclusion from linux-4.19.156 commit b177d2d915cea2d0a590f0034a20299dd1ee3ef2
--------------------------------
commit b4e00444cab4c3f3fec876dc0cccc8cbb0d1a948 upstream.
current->group_leader->exit_signal may change during copy_process() if current->real_parent exits.
Move the assignment inside tasklist_lock to avoid the race.
Signed-off-by: Eddy Wu eddy_wu@trendmicro.com Acked-by: Oleg Nesterov oleg@redhat.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/fork.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c index 68b0fd51302e..dc059179fd18 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2026,14 +2026,9 @@ static __latent_entropy struct task_struct *copy_process( /* ok, now we should be set up.. */ p->pid = pid_nr(pid); if (clone_flags & CLONE_THREAD) { - p->exit_signal = -1; p->group_leader = current->group_leader; p->tgid = current->tgid; } else { - if (clone_flags & CLONE_PARENT) - p->exit_signal = current->group_leader->exit_signal; - else - p->exit_signal = (clone_flags & CSIGNAL); p->group_leader = p; p->tgid = p->pid; } @@ -2079,10 +2074,15 @@ static __latent_entropy struct task_struct *copy_process( p->real_parent = current->real_parent; p->parent_exec_id = current->parent_exec_id; p->parent_exec_id_u64 = current->parent_exec_id_u64; + if (clone_flags & CLONE_THREAD) + p->exit_signal = -1; + else + p->exit_signal = current->group_leader->exit_signal; } else { p->real_parent = current; p->parent_exec_id = current->self_exec_id; p->parent_exec_id_u64 = current->self_exec_id_u64; + p->exit_signal = (clone_flags & CSIGNAL); }
klp_copy_process(p);
From: Zeng Tao prime.zeng@hisilicon.com
stable inclusion from linux-4.19.158 commit 68e51bf3761736359110b198c42e6f78056e3719
--------------------------------
[ Upstream commit cb47755725da7b90fecbb2aa82ac3b24a7adb89b ]
UBSAN reports:
Undefined behaviour in ./include/linux/time64.h:127:27 signed integer overflow: 17179869187 * 1000000000 cannot be represented in type 'long long int' Call Trace: timespec64_to_ns include/linux/time64.h:127 [inline] set_cpu_itimer+0x65c/0x880 kernel/time/itimer.c:180 do_setitimer+0x8e/0x740 kernel/time/itimer.c:245 __x64_sys_setitimer+0x14c/0x2c0 kernel/time/itimer.c:336 do_syscall_64+0xa1/0x540 arch/x86/entry/common.c:295
Commit bd40a175769d ("y2038: itimer: change implementation to timespec64") replaced the original conversion which handled time clamping correctly with timespec64_to_ns() which has no overflow protection.
Fix it in timespec64_to_ns() as this is not necessarily limited to the usage in itimers.
[ tglx: Added comment and adjusted the fixes tag ]
Fixes: 361a3bf00582 ("time64: Add time64.h header and define struct timespec64") Signed-off-by: Zeng Tao prime.zeng@hisilicon.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Arnd Bergmann arnd@arndb.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/1598952616-6416-1-git-send-email-prime.zeng@hisili... Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/time64.h | 4 ++++ kernel/time/itimer.c | 4 ---- 2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/time64.h b/include/linux/time64.h index be946b03c030..b58142aa0b1d 100644 --- a/include/linux/time64.h +++ b/include/linux/time64.h @@ -138,6 +138,10 @@ static inline bool timespec64_valid_settod(const struct timespec64 *ts) */ static inline s64 timespec64_to_ns(const struct timespec64 *ts) { + /* Prevent multiplication overflow */ + if ((unsigned long long)ts->tv_sec >= KTIME_SEC_MAX) + return KTIME_MAX; + return ((s64) ts->tv_sec * NSEC_PER_SEC) + ts->tv_nsec; }
diff --git a/kernel/time/itimer.c b/kernel/time/itimer.c index 9a65713c8309..2e2b335ef101 100644 --- a/kernel/time/itimer.c +++ b/kernel/time/itimer.c @@ -154,10 +154,6 @@ static void set_cpu_itimer(struct task_struct *tsk, unsigned int clock_id, u64 oval, nval, ointerval, ninterval; struct cpu_itimer *it = &tsk->signal->it[clock_id];
- /* - * Use the to_ktime conversion because that clamps the maximum - * value to KTIME_MAX and avoid multiplication overflows. - */ nval = ktime_to_ns(timeval_to_ktime(value->it_value)); ninterval = ktime_to_ns(timeval_to_ktime(value->it_interval));
From: zhuoliang zhang zhuoliang.zhang@mediatek.com
stable inclusion from linux-4.19.158 commit 243a16469c3431affdfdcc26674bd266fbfd5f8e
--------------------------------
[ Upstream commit a779d91314ca7208b7feb3ad817b62904397c56d ]
we found that the following race condition exists in xfrm_alloc_userspi flow:
user thread state_hash_work thread ---- ---- xfrm_alloc_userspi() __find_acq_core() /*alloc new xfrm_state:x*/ xfrm_state_alloc() /*schedule state_hash_work thread*/ xfrm_hash_grow_check() xfrm_hash_resize() xfrm_alloc_spi /*hold lock*/ x->id.spi = htonl(spi) spin_lock_bh(&net->xfrm.xfrm_state_lock) /*waiting lock release*/ xfrm_hash_transfer() spin_lock_bh(&net->xfrm.xfrm_state_lock) /*add x into hlist:net->xfrm.state_byspi*/ hlist_add_head_rcu(&x->byspi) spin_unlock_bh(&net->xfrm.xfrm_state_lock)
/*add x into hlist:net->xfrm.state_byspi 2 times*/ hlist_add_head_rcu(&x->byspi)
1. a new state x is alloced in xfrm_state_alloc() and added into the bydst hlist in __find_acq_core() on the LHS; 2. on the RHS, state_hash_work thread travels the old bydst and tranfers every xfrm_state (include x) into the new bydst hlist and new byspi hlist; 3. user thread on the LHS gets the lock and adds x into the new byspi hlist again.
So the same xfrm_state (x) is added into the same list_hash (net->xfrm.state_byspi) 2 times that makes the list_hash become an inifite loop.
To fix the race, x->id.spi = htonl(spi) in the xfrm_alloc_spi() is moved to the back of spin_lock_bh, sothat state_hash_work thread no longer add x which id.spi is zero into the hash_list.
Fixes: f034b5d4efdf ("[XFRM]: Dynamic xfrm_state hash table sizing.") Signed-off-by: zhuoliang zhang zhuoliang.zhang@mediatek.com Acked-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Steffen Klassert steffen.klassert@secunet.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/xfrm/xfrm_state.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index a649d7c2f48c..84dea0ad1666 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -1825,6 +1825,7 @@ int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high) int err = -ENOENT; __be32 minspi = htonl(low); __be32 maxspi = htonl(high); + __be32 newspi = 0; u32 mark = x->mark.v & x->mark.m;
spin_lock_bh(&x->lock); @@ -1843,21 +1844,22 @@ int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high) xfrm_state_put(x0); goto unlock; } - x->id.spi = minspi; + newspi = minspi; } else { u32 spi = 0; for (h = 0; h < high-low+1; h++) { spi = low + prandom_u32()%(high-low+1); x0 = xfrm_state_lookup(net, mark, &x->id.daddr, htonl(spi), x->id.proto, x->props.family); if (x0 == NULL) { - x->id.spi = htonl(spi); + newspi = htonl(spi); break; } xfrm_state_put(x0); } } - if (x->id.spi) { + if (newspi) { spin_lock_bh(&net->xfrm.xfrm_state_lock); + x->id.spi = newspi; h = xfrm_spi_hash(net, &x->id.daddr, x->id.spi, x->id.proto, x->props.family); hlist_add_head_rcu(&x->byspi, net->xfrm.state_byspi + h); spin_unlock_bh(&net->xfrm.xfrm_state_lock);
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.158 commit bbec3ef4913e0b3918c55d4ac23ca23aa880ffbe
--------------------------------
[ Upstream commit 2c334e12f957cd8c6bb66b4aa3f79848b7c33cab ]
Make sure that we actually initialize xefi_discard when we're scheduling a deferred free of an AGFL block. This was (eventually) found by the UBSAN while I was banging on realtime rmap problems, but it exists in the upstream codebase. While we're at it, rearrange the structure to reduce the struct size from 64 to 56 bytes.
Fixes: fcb762f5de2e ("xfs: add bmapi nodiscard flag") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Brian Foster bfoster@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/libxfs/xfs_alloc.c | 1 + fs/xfs/libxfs/xfs_bmap.h | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 4ec82352d965..d2e100601102 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2213,6 +2213,7 @@ xfs_defer_agfl_block( new->xefi_startblock = XFS_AGB_TO_FSB(mp, agno, agbno); new->xefi_blockcount = 1; new->xefi_oinfo = *oinfo; + new->xefi_skip_discard = false;
trace_xfs_agfl_free_defer(mp, agno, 0, agbno, 1);
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h index dc15440bff5c..4c9df0499c42 100644 --- a/fs/xfs/libxfs/xfs_bmap.h +++ b/fs/xfs/libxfs/xfs_bmap.h @@ -52,9 +52,9 @@ struct xfs_extent_free_item { xfs_fsblock_t xefi_startblock;/* starting fs block number */ xfs_extlen_t xefi_blockcount;/* number of blocks in extent */ + bool xefi_skip_discard; struct list_head xefi_list; struct xfs_owner_info xefi_oinfo; /* extent owner */ - bool xefi_skip_discard; };
#define XFS_BMAP_MAX_NMAP 4
From: Stefano Brivio sbrivio@redhat.com
stable inclusion from linux-4.19.158 commit a79b767a92242b2801bb2f4433501a1b78bcb4e3
--------------------------------
[ Upstream commit 7d10e62c2ff8e084c136c94d32d9a94de4d31248 ]
In ip_set_match_extensions(), for sets with counters, we take care of updating counters themselves by calling ip_set_update_counter(), and of checking if the given comparison and values match, by calling ip_set_match_counter() if needed.
However, if a given comparison on counters doesn't match the configured values, that doesn't mean the set entry itself isn't matching.
This fix restores the behaviour we had before commit 4750005a85f7 ("netfilter: ipset: Fix "don't update counters" mode when counters used at the matching"), without reintroducing the issue fixed there: back then, mtype_data_match() first updated counters in any case, and then took care of matching on counters.
Now, if the IPSET_FLAG_SKIP_COUNTER_UPDATE flag is set, ip_set_update_counter() will anyway skip counter updates if desired.
The issue observed is illustrated by this reproducer:
ipset create c hash:ip counters ipset add c 192.0.2.1 iptables -I INPUT -m set --match-set c src --bytes-gt 800 -j DROP
if we now send packets from 192.0.2.1, bytes and packets counters for the entry as shown by 'ipset list' are always zero, and, no matter how many bytes we send, the rule will never match, because counters themselves are not updated.
Reported-by: Mithil Mhatre mmhatre@redhat.com Fixes: 4750005a85f7 ("netfilter: ipset: Fix "don't update counters" mode when counters used at the matching") Signed-off-by: Stefano Brivio sbrivio@redhat.com Signed-off-by: Jozsef Kadlecsik kadlec@netfilter.org Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/netfilter/ipset/ip_set_core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c index ea42c40213aa..1916cd39190c 100644 --- a/net/netfilter/ipset/ip_set_core.c +++ b/net/netfilter/ipset/ip_set_core.c @@ -488,13 +488,14 @@ ip_set_match_extensions(struct ip_set *set, const struct ip_set_ext *ext, if (SET_WITH_COUNTER(set)) { struct ip_set_counter *counter = ext_counter(data, set);
+ ip_set_update_counter(counter, ext, flags); + if (flags & IPSET_FLAG_MATCH_COUNTERS && !(ip_set_match_counter(ip_set_get_packets(counter), mext->packets, mext->packets_op) && ip_set_match_counter(ip_set_get_bytes(counter), mext->bytes, mext->bytes_op))) return false; - ip_set_update_counter(counter, ext, flags); } if (SET_WITH_SKBINFO(set)) ip_set_get_skbinfo(ext_skbinfo(data, set),
From: Jiri Olsa jolsa@kernel.org
stable inclusion from linux-4.19.158 commit 5e6d8c982529d72c307a2597178e6639b02b26c4
--------------------------------
[ Upstream commit fe01adb72356a4e2f8735e4128af85921ca98fa1 ]
We are missing swap for ino_generation field.
Fixes: 5c5e854bc760 ("perf tools: Add attr->mmap2 support") Signed-off-by: Jiri Olsa jolsa@kernel.org Acked-by: Namhyung Kim namhyung@kernel.org Link: https://lore.kernel.org/r/20201101233103.3537427-2-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/perf/util/session.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c index 29ecff1fc25b..063ecec3d697 100644 --- a/tools/perf/util/session.c +++ b/tools/perf/util/session.c @@ -489,6 +489,7 @@ static void perf_event__mmap2_swap(union perf_event *event, event->mmap2.maj = bswap_32(event->mmap2.maj); event->mmap2.min = bswap_32(event->mmap2.min); event->mmap2.ino = bswap_64(event->mmap2.ino); + event->mmap2.ino_generation = bswap_64(event->mmap2.ino_generation);
if (sample_id_all) { void *data = &event->mmap2.filename;
From: Brian Foster bfoster@redhat.com
stable inclusion from linux-4.19.158 commit 955113c89ca9d49b8eef9f46f248397d1a1e55b1
--------------------------------
[ Upstream commit 869ae85dae64b5540e4362d7fe4cd520e10ec05c ]
It is possible to expose non-zeroed post-EOF data in XFS if the new EOF page is dirty, backed by an unwritten block and the truncate happens to race with writeback. iomap_truncate_page() will not zero the post-EOF portion of the page if the underlying block is unwritten. The subsequent call to truncate_setsize() will, but doesn't dirty the page. Therefore, if writeback happens to complete after iomap_truncate_page() (so it still sees the unwritten block) but before truncate_setsize(), the cached page becomes inconsistent with the on-disk block. A mapped read after the associated page is reclaimed or invalidated exposes non-zero post-EOF data.
For example, consider the following sequence when run on a kernel modified to explicitly flush the new EOF page within the race window:
$ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file $ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file ... $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file 00000400: 00 00 00 00 00 00 00 00 ........ $ umount /mnt/; mount <dev> /mnt/ $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file 00000400: cd cd cd cd cd cd cd cd ........
Update xfs_setattr_size() to explicitly flush the new EOF page prior to the page truncate to ensure iomap has the latest state of the underlying block.
Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") Signed-off-by: Brian Foster bfoster@redhat.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/xfs_iops.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index e427ad097e2e..948ac1290121 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -895,6 +895,16 @@ xfs_setattr_size( error = iomap_zero_range(inode, oldsize, newsize - oldsize, &did_zeroing, &xfs_iomap_ops); } else { + /* + * iomap won't detect a dirty page over an unwritten block (or a + * cow block over a hole) and subsequently skips zeroing the + * newly post-EOF portion of the page. Flush the new EOF to + * convert the block before the pagecache truncate. + */ + error = filemap_write_and_wait_range(inode->i_mapping, newsize, + newsize); + if (error) + return error; error = iomap_truncate_page(inode, newsize, &did_zeroing, &xfs_iomap_ops); }
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.158 commit cba36f708a8a23a6e51dd2ddf70ab8449c692c15
--------------------------------
[ Upstream commit c1f6b1ac00756a7108e5fcb849a2f8230c0b62a5 ]
The kernel has always allowed directories to have the rtinherit flag set, even if there is no rt device, so this check is wrong.
Fixes: 80e4e1268802 ("xfs: scrub inodes") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/scrub/inode.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c index e386c9b0b4ab..8d45d60832db 100644 --- a/fs/xfs/scrub/inode.c +++ b/fs/xfs/scrub/inode.c @@ -131,8 +131,7 @@ xchk_inode_flags( goto bad;
/* rt flags require rt device */ - if ((flags & (XFS_DIFLAG_REALTIME | XFS_DIFLAG_RTINHERIT)) && - !mp->m_rtdev_targp) + if ((flags & XFS_DIFLAG_REALTIME) && !mp->m_rtdev_targp) goto bad;
/* new rt bitmap flag only valid for rbmino */
From: Tyler Hicks tyhicks@linux.microsoft.com
stable inclusion from linux-4.19.158 commit 0982aa54d054bf11fe202522e29e4da9b288653b
--------------------------------
[ Upstream commit 8ffd778aff45be760292225049e0141255d4ad6e ]
Mimic the pre-existing ACPI and Device Tree event log behavior by not creating the binary_bios_measurements file when the EFI TPM event log is empty.
This fixes the following NULL pointer dereference that can occur when reading /sys/kernel/security/tpm0/binary_bios_measurements after the kernel received an empty event log from the firmware:
BUG: kernel NULL pointer dereference, address: 000000000000002c #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 2 PID: 3932 Comm: fwupdtpmevlog Not tainted 5.9.0-00003-g629990edad62 #17 Hardware name: LENOVO 20LCS03L00/20LCS03L00, BIOS N27ET38W (1.24 ) 11/28/2019 RIP: 0010:tpm2_bios_measurements_start+0x3a/0x550 Code: 54 53 48 83 ec 68 48 8b 57 70 48 8b 1e 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8b 82 c0 06 00 00 48 8b 8a c8 06 00 00 <44> 8b 60 1c 48 89 4d a0 4c 89 e2 49 83 c4 20 48 83 fb 00 75 2a 49 RSP: 0018:ffffa9c901203db0 EFLAGS: 00010246 RAX: 0000000000000010 RBX: 0000000000000000 RCX: 0000000000000010 RDX: ffff8ba1eb99c000 RSI: ffff8ba1e4ce8280 RDI: ffff8ba1e4ce8258 RBP: ffffa9c901203e40 R08: ffffa9c901203dd8 R09: ffff8ba1ec443300 R10: ffffa9c901203e50 R11: 0000000000000000 R12: ffff8ba1e4ce8280 R13: ffffa9c901203ef0 R14: ffffa9c901203ef0 R15: ffff8ba1e4ce8258 FS: 00007f6595460880(0000) GS:ffff8ba1ef880000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000002c CR3: 00000007d8d18003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? __kmalloc_node+0x113/0x320 ? kvmalloc_node+0x31/0x80 seq_read+0x94/0x420 vfs_read+0xa7/0x190 ksys_read+0xa7/0xe0 __x64_sys_read+0x1a/0x20 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9
In this situation, the bios_event_log pointer in the tpm_bios_log struct was not NULL but was equal to the ZERO_SIZE_PTR (0x10) value. This was due to the following kmemdup() in tpm_read_log_efi():
int tpm_read_log_efi(struct tpm_chip *chip) { ... /* malloc EventLog space */ log->bios_event_log = kmemdup(log_tbl->log, log_size, GFP_KERNEL); if (!log->bios_event_log) { ret = -ENOMEM; goto out; } ... }
When log_size is zero, due to an empty event log from firmware, ZERO_SIZE_PTR is returned from kmemdup(). Upon a read of the binary_bios_measurements file, the tpm2_bios_measurements_start() function does not perform a ZERO_OR_NULL_PTR() check on the bios_event_log pointer before dereferencing it.
Rather than add a ZERO_OR_NULL_PTR() check in functions that make use of the bios_event_log pointer, simply avoid creating the binary_bios_measurements_file as is done in other event log retrieval backends.
Explicitly ignore all of the events in the final event log when the main event log is empty. The list of events in the final event log cannot be accurately parsed without referring to the first event in the main event log (the event log header) so the final event log is useless in such a situation.
Fixes: 58cc1e4faf10 ("tpm: parse TPM event logs based on EFI table") Link: https://lore.kernel.org/linux-integrity/E1FDCCCB-CA51-4AEE-AC83-9CDE995EAE52... Reported-by: Kai-Heng Feng kai.heng.feng@canonical.com Reported-by: Kenneth R. Crudup kenny@panix.com Reported-by: Mimi Zohar zohar@linux.ibm.com Cc: Thiébaud Weksteen tweek@google.com Cc: Ard Biesheuvel ardb@kernel.org Signed-off-by: Tyler Hicks tyhicks@linux.microsoft.com Reviewed-by: Jarkko Sakkinen jarkko@kernel.org Signed-off-by: Jarkko Sakkinen jarkko@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/char/tpm/eventlog/efi.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/drivers/char/tpm/eventlog/efi.c b/drivers/char/tpm/eventlog/efi.c index 3e673ab22cb4..abd3beeb5158 100644 --- a/drivers/char/tpm/eventlog/efi.c +++ b/drivers/char/tpm/eventlog/efi.c @@ -43,6 +43,11 @@ int tpm_read_log_efi(struct tpm_chip *chip) log_size = log_tbl->size; memunmap(log_tbl);
+ if (!log_size) { + pr_warn("UEFI TPM log area empty\n"); + return -EIO; + } + log_tbl = memremap(efi.tpm_log, sizeof(*log_tbl) + log_size, MEMREMAP_WB); if (!log_tbl) {
From: "Jason A. Donenfeld" Jason@zx2c4.com
stable inclusion from linux-4.19.158 commit 580a117919f3ae1390be6b4111253ee9595938f5
--------------------------------
commit 46d6c5ae953cc0be38efd0e469284df7c4328cf8 upstream.
If netfilter changes the packet mark when mangling, the packet is rerouted using the route_me_harder set of functions. Prior to this commit, there's one big difference between route_me_harder and the ordinary initial routing functions, described in the comment above __ip_queue_xmit():
/* Note: skb->sk can be different from sk, in case of tunnels */ int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,
That function goes on to correctly make use of sk->sk_bound_dev_if, rather than skb->sk->sk_bound_dev_if. And indeed the comment is true: a tunnel will receive a packet in ndo_start_xmit with an initial skb->sk. It will make some transformations to that packet, and then it will send the encapsulated packet out of a *new* socket. That new socket will basically always have a different sk_bound_dev_if (otherwise there'd be a routing loop). So for the purposes of routing the encapsulated packet, the routing information as it pertains to the socket should come from that socket's sk, rather than the packet's original skb->sk. For that reason __ip_queue_xmit() and related functions all do the right thing.
One might argue that all tunnels should just call skb_orphan(skb) before transmitting the encapsulated packet into the new socket. But tunnels do *not* do this -- and this is wisely avoided in skb_scrub_packet() too -- because features like TSQ rely on skb->destructor() being called when that buffer space is truely available again. Calling skb_orphan(skb) too early would result in buffers filling up unnecessarily and accounting info being all wrong. Instead, additional routing must take into account the new sk, just as __ip_queue_xmit() notes.
So, this commit addresses the problem by fishing the correct sk out of state->sk -- it's already set properly in the call to nf_hook() in __ip_local_out(), which receives the sk as part of its normal functionality. So we make sure to plumb state->sk through the various route_me_harder functions, and then make correct use of it following the example of __ip_queue_xmit().
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Reviewed-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org [Jason: backported to 4.19 from Sasha's 5.4 backport] Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/netfilter_ipv4.h | 2 +- include/linux/netfilter_ipv6.h | 2 +- net/ipv4/netfilter.c | 12 +++++++----- net/ipv4/netfilter/ipt_SYNPROXY.c | 2 +- net/ipv4/netfilter/iptable_mangle.c | 2 +- net/ipv4/netfilter/nf_nat_l3proto_ipv4.c | 2 +- net/ipv4/netfilter/nf_reject_ipv4.c | 2 +- net/ipv4/netfilter/nft_chain_route_ipv4.c | 2 +- net/ipv6/netfilter.c | 6 +++--- net/ipv6/netfilter/ip6table_mangle.c | 2 +- net/ipv6/netfilter/nf_nat_l3proto_ipv6.c | 2 +- net/ipv6/netfilter/nft_chain_route_ipv6.c | 2 +- net/netfilter/ipvs/ip_vs_core.c | 4 ++-- 13 files changed, 22 insertions(+), 20 deletions(-)
diff --git a/include/linux/netfilter_ipv4.h b/include/linux/netfilter_ipv4.h index 95ab5cc64422..45ff1330b339 100644 --- a/include/linux/netfilter_ipv4.h +++ b/include/linux/netfilter_ipv4.h @@ -16,7 +16,7 @@ struct ip_rt_info { u_int32_t mark; };
-int ip_route_me_harder(struct net *net, struct sk_buff *skb, unsigned addr_type); +int ip_route_me_harder(struct net *net, struct sock *sk, struct sk_buff *skb, unsigned addr_type);
struct nf_queue_entry;
diff --git a/include/linux/netfilter_ipv6.h b/include/linux/netfilter_ipv6.h index c0dc4dd78887..47a2de582f57 100644 --- a/include/linux/netfilter_ipv6.h +++ b/include/linux/netfilter_ipv6.h @@ -36,7 +36,7 @@ struct nf_ipv6_ops { };
#ifdef CONFIG_NETFILTER -int ip6_route_me_harder(struct net *net, struct sk_buff *skb); +int ip6_route_me_harder(struct net *net, struct sock *sk, struct sk_buff *skb); __sum16 nf_ip6_checksum(struct sk_buff *skb, unsigned int hook, unsigned int dataoff, u_int8_t protocol);
diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c index 8d2e5dc9a827..3d670d5aea34 100644 --- a/net/ipv4/netfilter.c +++ b/net/ipv4/netfilter.c @@ -17,17 +17,19 @@ #include <net/netfilter/nf_queue.h>
/* route_me_harder function, used by iptable_nat, iptable_mangle + ip_queue */ -int ip_route_me_harder(struct net *net, struct sk_buff *skb, unsigned int addr_type) +int ip_route_me_harder(struct net *net, struct sock *sk, struct sk_buff *skb, unsigned int addr_type) { const struct iphdr *iph = ip_hdr(skb); struct rtable *rt; struct flowi4 fl4 = {}; __be32 saddr = iph->saddr; - const struct sock *sk = skb_to_full_sk(skb); - __u8 flags = sk ? inet_sk_flowi_flags(sk) : 0; + __u8 flags; struct net_device *dev = skb_dst(skb)->dev; unsigned int hh_len;
+ sk = sk_to_full_sk(sk); + flags = sk ? inet_sk_flowi_flags(sk) : 0; + if (addr_type == RTN_UNSPEC) addr_type = inet_addr_type_dev_table(net, dev, saddr); if (addr_type == RTN_LOCAL || addr_type == RTN_UNICAST) @@ -91,8 +93,8 @@ int nf_ip_reroute(struct sk_buff *skb, const struct nf_queue_entry *entry) skb->mark == rt_info->mark && iph->daddr == rt_info->daddr && iph->saddr == rt_info->saddr)) - return ip_route_me_harder(entry->state.net, skb, - RTN_UNSPEC); + return ip_route_me_harder(entry->state.net, entry->state.sk, + skb, RTN_UNSPEC); } return 0; } diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c b/net/ipv4/netfilter/ipt_SYNPROXY.c index 690b17ef6a44..d64b1ef43c10 100644 --- a/net/ipv4/netfilter/ipt_SYNPROXY.c +++ b/net/ipv4/netfilter/ipt_SYNPROXY.c @@ -54,7 +54,7 @@ synproxy_send_tcp(struct net *net,
skb_dst_set_noref(nskb, skb_dst(skb)); nskb->protocol = htons(ETH_P_IP); - if (ip_route_me_harder(net, nskb, RTN_UNSPEC)) + if (ip_route_me_harder(net, nskb->sk, nskb, RTN_UNSPEC)) goto free_nskb;
if (nfct) { diff --git a/net/ipv4/netfilter/iptable_mangle.c b/net/ipv4/netfilter/iptable_mangle.c index dea138ca8925..0829f46ddfdd 100644 --- a/net/ipv4/netfilter/iptable_mangle.c +++ b/net/ipv4/netfilter/iptable_mangle.c @@ -65,7 +65,7 @@ ipt_mangle_out(struct sk_buff *skb, const struct nf_hook_state *state) iph->daddr != daddr || skb->mark != mark || iph->tos != tos) { - err = ip_route_me_harder(state->net, skb, RTN_UNSPEC); + err = ip_route_me_harder(state->net, state->sk, skb, RTN_UNSPEC); if (err < 0) ret = NF_DROP_ERR(err); } diff --git a/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c b/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c index 52e3f9c7bc69..290a04820701 100644 --- a/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c +++ b/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c @@ -331,7 +331,7 @@ nf_nat_ipv4_local_fn(void *priv, struct sk_buff *skb,
if (ct->tuplehash[dir].tuple.dst.u3.ip != ct->tuplehash[!dir].tuple.src.u3.ip) { - err = ip_route_me_harder(state->net, skb, RTN_UNSPEC); + err = ip_route_me_harder(state->net, state->sk, skb, RTN_UNSPEC); if (err < 0) ret = NF_DROP_ERR(err); } diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c index 5cd06ba3535d..4996db1f64a1 100644 --- a/net/ipv4/netfilter/nf_reject_ipv4.c +++ b/net/ipv4/netfilter/nf_reject_ipv4.c @@ -129,7 +129,7 @@ void nf_send_reset(struct net *net, struct sk_buff *oldskb, int hook) ip4_dst_hoplimit(skb_dst(nskb))); nf_reject_ip_tcphdr_put(nskb, oldskb, oth);
- if (ip_route_me_harder(net, nskb, RTN_UNSPEC)) + if (ip_route_me_harder(net, nskb->sk, nskb, RTN_UNSPEC)) goto free_nskb;
niph = ip_hdr(nskb); diff --git a/net/ipv4/netfilter/nft_chain_route_ipv4.c b/net/ipv4/netfilter/nft_chain_route_ipv4.c index 7d82934c46f4..61003768e52b 100644 --- a/net/ipv4/netfilter/nft_chain_route_ipv4.c +++ b/net/ipv4/netfilter/nft_chain_route_ipv4.c @@ -50,7 +50,7 @@ static unsigned int nf_route_table_hook(void *priv, iph->daddr != daddr || skb->mark != mark || iph->tos != tos) { - err = ip_route_me_harder(state->net, skb, RTN_UNSPEC); + err = ip_route_me_harder(state->net, state->sk, skb, RTN_UNSPEC); if (err < 0) ret = NF_DROP_ERR(err); } diff --git a/net/ipv6/netfilter.c b/net/ipv6/netfilter.c index 6d0b1f3e927b..5679fa3f696a 100644 --- a/net/ipv6/netfilter.c +++ b/net/ipv6/netfilter.c @@ -17,10 +17,10 @@ #include <net/xfrm.h> #include <net/netfilter/nf_queue.h>
-int ip6_route_me_harder(struct net *net, struct sk_buff *skb) +int ip6_route_me_harder(struct net *net, struct sock *sk_partial, struct sk_buff *skb) { const struct ipv6hdr *iph = ipv6_hdr(skb); - struct sock *sk = sk_to_full_sk(skb->sk); + struct sock *sk = sk_to_full_sk(sk_partial); unsigned int hh_len; struct dst_entry *dst; int strict = (ipv6_addr_type(&iph->daddr) & @@ -81,7 +81,7 @@ static int nf_ip6_reroute(struct sk_buff *skb, if (!ipv6_addr_equal(&iph->daddr, &rt_info->daddr) || !ipv6_addr_equal(&iph->saddr, &rt_info->saddr) || skb->mark != rt_info->mark) - return ip6_route_me_harder(entry->state.net, skb); + return ip6_route_me_harder(entry->state.net, entry->state.sk, skb); } return 0; } diff --git a/net/ipv6/netfilter/ip6table_mangle.c b/net/ipv6/netfilter/ip6table_mangle.c index b0524b18c4fb..acba3757ff60 100644 --- a/net/ipv6/netfilter/ip6table_mangle.c +++ b/net/ipv6/netfilter/ip6table_mangle.c @@ -60,7 +60,7 @@ ip6t_mangle_out(struct sk_buff *skb, const struct nf_hook_state *state) skb->mark != mark || ipv6_hdr(skb)->hop_limit != hop_limit || flowlabel != *((u_int32_t *)ipv6_hdr(skb)))) { - err = ip6_route_me_harder(state->net, skb); + err = ip6_route_me_harder(state->net, state->sk, skb); if (err < 0) ret = NF_DROP_ERR(err); } diff --git a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c index bf25263b2dcd..2160936e5b91 100644 --- a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c +++ b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c @@ -354,7 +354,7 @@ nf_nat_ipv6_local_fn(void *priv, struct sk_buff *skb,
if (!nf_inet_addr_cmp(&ct->tuplehash[dir].tuple.dst.u3, &ct->tuplehash[!dir].tuple.src.u3)) { - err = ip6_route_me_harder(state->net, skb); + err = ip6_route_me_harder(state->net, state->sk, skb); if (err < 0) ret = NF_DROP_ERR(err); } diff --git a/net/ipv6/netfilter/nft_chain_route_ipv6.c b/net/ipv6/netfilter/nft_chain_route_ipv6.c index da3f1f8cb325..afe79cb46e63 100644 --- a/net/ipv6/netfilter/nft_chain_route_ipv6.c +++ b/net/ipv6/netfilter/nft_chain_route_ipv6.c @@ -52,7 +52,7 @@ static unsigned int nf_route_table_hook(void *priv, skb->mark != mark || ipv6_hdr(skb)->hop_limit != hop_limit || flowlabel != *((u_int32_t *)ipv6_hdr(skb)))) { - err = ip6_route_me_harder(state->net, skb); + err = ip6_route_me_harder(state->net, state->sk, skb); if (err < 0) ret = NF_DROP_ERR(err); } diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c index d5e4329579e2..acaeeaf81441 100644 --- a/net/netfilter/ipvs/ip_vs_core.c +++ b/net/netfilter/ipvs/ip_vs_core.c @@ -725,12 +725,12 @@ static int ip_vs_route_me_harder(struct netns_ipvs *ipvs, int af, struct dst_entry *dst = skb_dst(skb);
if (dst->dev && !(dst->dev->flags & IFF_LOOPBACK) && - ip6_route_me_harder(ipvs->net, skb) != 0) + ip6_route_me_harder(ipvs->net, skb->sk, skb) != 0) return 1; } else #endif if (!(skb_rtable(skb)->rt_flags & RTCF_LOCAL) && - ip_route_me_harder(ipvs->net, skb, RTN_LOCAL) != 0) + ip_route_me_harder(ipvs->net, skb->sk, skb, RTN_LOCAL) != 0) return 1;
return 0;
From: Chunyan Zhang zhang.lyra@gmail.com
stable inclusion from linux-4.19.158 commit 880d94c7811ebe89ab7edc7e8839ccdf0eedcd91
--------------------------------
commit 5167c506d62dd9ffab73eba23c79b0a8845c9fe1 upstream.
Suspend to IDLE invokes tick_unfreeze() on resume. tick_unfreeze() on the first resuming CPU resumes timekeeping, which also has the side effect of resetting the softlockup watchdog on this CPU.
But on the secondary CPUs the watchdog is not reset in the resume / unfreeze() path, which can result in false softlockup warnings on those CPUs depending on the time spent in suspend.
Prevent this by clearing the softlock watchdog in the unfreeze path also on the secondary resuming CPUs.
[ tglx: Massaged changelog ]
Signed-off-by: Chunyan Zhang chunyan.zhang@unisoc.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/20200110083902.27276-1-chunyan.zhang@unisoc.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/time/tick-common.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index a02e0f6b287c..0a3cc37e4b83 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -15,6 +15,7 @@ #include <linux/err.h> #include <linux/hrtimer.h> #include <linux/interrupt.h> +#include <linux/nmi.h> #include <linux/percpu.h> #include <linux/profile.h> #include <linux/sched.h> @@ -520,6 +521,7 @@ void tick_unfreeze(void) trace_suspend_resume(TPS("timekeeping_freeze"), smp_processor_id(), false); } else { + touch_softlockup_watchdog(); tick_resume_local(); }
From: Christoph Hellwig hch@lst.de
stable inclusion from linux-4.19.158 commit d9f4534a9a286877ae29e15b5f9ba8ecb7370924
--------------------------------
[ Upstream commit 2bd645b2d3f0bacadaa6037f067538e1cd4e42ef ]
bdget_disk needs to be paired with bdput to not leak a reference on the block device inode.
Fixes: 08ba91ee6e2c ("nbd: Add the nbd NBD_DISCONNECT_ON_CLOSE config flag.") Signed-off-by: Christoph Hellwig hch@lst.de Reviewed-by: Josef Bacik josef@toxicpanda.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/block/nbd.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index c3b1d53c9be8..49f98b9b2777 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -1452,6 +1452,7 @@ static void nbd_release(struct gendisk *disk, fmode_t mode) if (test_bit(NBD_RT_DISCONNECT_ON_CLOSE, &nbd->config->runtime_flags) && bdev->bd_openers == 0) nbd_disconnect_and_put(nbd); + bdput(bdev);
nbd_config_put(nbd); nbd_put(nbd);
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.158 commit 83e65294aff8635877dfd0a019dbb9cd1abcab8a
--------------------------------
[ Upstream commit ea8439899c0b15a176664df62aff928010fad276 ]
Pass the same oldext argument (which contains the existing rmapping's unwritten state) to xfs_rmap_lookup_le_range at the start of xfs_rmap_convert_shared. At this point in the code, flags is zero, which means that we perform lookups using the wrong key.
Fixes: 3f165b334e51 ("xfs: convert unwritten status of reverse mappings for shared files") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/libxfs/xfs_rmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c index 245af452840e..ab3e72e702f0 100644 --- a/fs/xfs/libxfs/xfs_rmap.c +++ b/fs/xfs/libxfs/xfs_rmap.c @@ -1387,7 +1387,7 @@ xfs_rmap_convert_shared( * record for our insertion point. This will also give us the record for * start block contiguity tests. */ - error = xfs_rmap_lookup_le_range(cur, bno, owner, offset, flags, + error = xfs_rmap_lookup_le_range(cur, bno, owner, offset, oldext, &PREV, &i); if (error) goto done;
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.158 commit 5fda0976e88dd4addc401af6b2ab53b2842e47a6
--------------------------------
[ Upstream commit 5dda3897fd90783358c4c6115ef86047d8c8f503 ]
When the bmbt scrubber is looking up rmap extents, we need to set the extent flags from the bmbt record fully. This will matter once we fix the rmap btree comparison functions to check those flags correctly.
Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/scrub/bmap.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c index e1d11f3223e3..d1fc8a3fb891 100644 --- a/fs/xfs/scrub/bmap.c +++ b/fs/xfs/scrub/bmap.c @@ -102,6 +102,8 @@ xchk_bmap_get_rmap(
if (info->whichfork == XFS_ATTR_FORK) rflags |= XFS_RMAP_ATTR_FORK; + if (irec->br_state == XFS_EXT_UNWRITTEN) + rflags |= XFS_RMAP_UNWRITTEN;
/* * CoW staging extents are owned (on disk) by the refcountbt, so
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.158 commit b7fec5a9a726e9e2a095d53b6de4e62bf179b481
--------------------------------
[ Upstream commit 6ff646b2ceb0eec916101877f38da0b73e3a5b7f ]
Keys for extent interval records in the reverse mapping btree are supposed to be computed as follows:
(physical block, owner, fork, is_btree, is_unwritten, offset)
This provides users the ability to look up a reverse mapping from a bmbt record -- start with the physical block; then if there are multiple records for the same block, move on to the owner; then the inode fork type; and so on to the file offset.
However, the key comparison functions incorrectly remove the fork/btree/unwritten information that's encoded in the on-disk offset. This means that lookup comparisons are only done with:
(physical block, owner, offset)
This means that queries can return incorrect results. On consistent filesystems this hasn't been an issue because blocks are never shared between forks or with bmbt blocks; and are never unwritten. However, this bug means that online repair cannot always detect corruption in the key information in internal rmapbt nodes.
Found by fuzzing keys[1].attrfork = ones on xfs/371.
Fixes: 4b8ed67794fe ("xfs: add rmap btree operations") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/libxfs/xfs_rmap_btree.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c index f79cf040d745..77528f413286 100644 --- a/fs/xfs/libxfs/xfs_rmap_btree.c +++ b/fs/xfs/libxfs/xfs_rmap_btree.c @@ -247,8 +247,8 @@ xfs_rmapbt_key_diff( else if (y > x) return -1;
- x = XFS_RMAP_OFF(be64_to_cpu(kp->rm_offset)); - y = rec->rm_offset; + x = be64_to_cpu(kp->rm_offset); + y = xfs_rmap_irec_offset_pack(rec); if (x > y) return 1; else if (y > x) @@ -279,8 +279,8 @@ xfs_rmapbt_diff_two_keys( else if (y > x) return -1;
- x = XFS_RMAP_OFF(be64_to_cpu(kp1->rm_offset)); - y = XFS_RMAP_OFF(be64_to_cpu(kp2->rm_offset)); + x = be64_to_cpu(kp1->rm_offset); + y = be64_to_cpu(kp2->rm_offset); if (x > y) return 1; else if (y > x) @@ -393,8 +393,8 @@ xfs_rmapbt_keys_inorder( return 1; else if (a > b) return 0; - a = XFS_RMAP_OFF(be64_to_cpu(k1->rmap.rm_offset)); - b = XFS_RMAP_OFF(be64_to_cpu(k2->rmap.rm_offset)); + a = be64_to_cpu(k1->rmap.rm_offset); + b = be64_to_cpu(k2->rmap.rm_offset); if (a <= b) return 1; return 0; @@ -423,8 +423,8 @@ xfs_rmapbt_recs_inorder( return 1; else if (a > b) return 0; - a = XFS_RMAP_OFF(be64_to_cpu(r1->rmap.rm_offset)); - b = XFS_RMAP_OFF(be64_to_cpu(r2->rmap.rm_offset)); + a = be64_to_cpu(r1->rmap.rm_offset); + b = be64_to_cpu(r2->rmap.rm_offset); if (a <= b) return 1; return 0;
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.158 commit 964e25377fab8e9071c07d8cdf1c9a4fb079285e
--------------------------------
[ Upstream commit 54e9b09e153842ab5adb8a460b891e11b39e9c3d ]
Fix some serious WTF in the reference count scrubber's rmap fragment processing. The code comment says that this loop is supposed to move all fragment records starting at or before bno onto the worklist, but there's no obvious reason why nr (the number of items added) should increment starting from 1, and breaking the loop when we've added the target number seems dubious since we could have more rmap fragments that should have been added to the worklist.
This seems to manifest in xfs/411 when adding one to the refcount field.
Fixes: dbde19da9637 ("xfs: cross-reference the rmapbt data with the refcountbt") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/scrub/refcount.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c index e8c82b026083..76e4f16a9fab 100644 --- a/fs/xfs/scrub/refcount.c +++ b/fs/xfs/scrub/refcount.c @@ -180,7 +180,6 @@ xchk_refcountbt_process_rmap_fragments( */ INIT_LIST_HEAD(&worklist); rbno = NULLAGBLOCK; - nr = 1;
/* Make sure the fragments actually /are/ in agbno order. */ bno = 0; @@ -194,15 +193,14 @@ xchk_refcountbt_process_rmap_fragments( * Find all the rmaps that start at or before the refc extent, * and put them on the worklist. */ + nr = 0; list_for_each_entry_safe(frag, n, &refchk->fragments, list) { - if (frag->rm.rm_startblock > refchk->bno) - goto done; + if (frag->rm.rm_startblock > refchk->bno || nr > target_nr) + break; bno = frag->rm.rm_startblock + frag->rm.rm_blockcount; if (bno < rbno) rbno = bno; list_move_tail(&frag->list, &worklist); - if (nr == target_nr) - break; nr++; }
From: Christoph Hellwig hch@lst.de
stable inclusion from linux-4.19.158 commit 05e0bb210209126f495401051996389831f7ba48
--------------------------------
[ Upstream commit 2bd3fa793aaa7e98b74e3653fdcc72fa753913b5 ]
We also need to drop the iolock when invalidate_inode_pages2 fails, not only on all other error or successful cases.
Fixes: 527851124d10 ("xfs: implement pNFS export operations") Signed-off-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/xfs_pnfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c index f44c3599527d..1c9bced3e860 100644 --- a/fs/xfs/xfs_pnfs.c +++ b/fs/xfs/xfs_pnfs.c @@ -141,7 +141,7 @@ xfs_fs_map_blocks( goto out_unlock; error = invalidate_inode_pages2(inode->i_mapping); if (WARN_ON_ONCE(error)) - return error; + goto out_unlock;
end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + length); offset_fsb = XFS_B_TO_FSBT(mp, offset);
From: Peter Zijlstra peterz@infradead.org
stable inclusion from linux-4.19.158 commit 0f4eb125c56b524776a2f396bb7a7c119bb2e0a1
--------------------------------
[ Upstream commit ce0f17fc93f63ee91428af10b7b2ddef38cd19e5 ]
One should use in_serving_softirq() to detect SoftIRQ context.
Fixes: 96f6d4444302 ("perf_counter: avoid recursion") Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20201030151955.120572175@infradead.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/events/internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 6dc725a7e7bc..8fc0ddc38cb6 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -209,7 +209,7 @@ static inline int get_recursion_context(int *recursion) rctx = 3; else if (in_irq()) rctx = 2; - else if (in_softirq()) + else if (in_serving_softirq()) rctx = 1; else rctx = 0;
From: Kaixu Xia kaixuxia@tencent.com
stable inclusion from linux-4.19.158 commit f0bb30a1fc8fe0d9a3068c7cf1644302c8196cfc
--------------------------------
commit 174fe5ba2d1ea0d6c5ab2a7d4aa058d6d497ae4d upstream.
The macro MOPT_Q is used to indicates the mount option is related to quota stuff and is defined to be MOPT_NOSUPPORT when CONFIG_QUOTA is disabled. Normally the quota options are handled explicitly, so it didn't matter that the MOPT_STRING flag was missing, even though the usrjquota and grpjquota mount options take a string argument. It's important that's present in the !CONFIG_QUOTA case, since without MOPT_STRING, the mount option matcher will match usrjquota= followed by an integer, and will otherwise skip the table entry, and so "mount option not supported" error message is never reported.
[ Fixed up the commit description to better explain why the fix works. --TYT ]
Fixes: 26092bf52478 ("ext4: use a table-driven handler for mount options") Signed-off-by: Kaixu Xia kaixuxia@tencent.com Link: https://lore.kernel.org/r/1603986396-28917-1-git-send-email-kaixuxia@tencent... Signed-off-by: Theodore Ts'o tytso@mit.edu Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 9b1ab74df916..1a0d494e57ae 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1791,8 +1791,8 @@ static const struct mount_opts { {Opt_noquota, (EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA | EXT4_MOUNT_GRPQUOTA | EXT4_MOUNT_PRJQUOTA), MOPT_CLEAR | MOPT_Q}, - {Opt_usrjquota, 0, MOPT_Q}, - {Opt_grpjquota, 0, MOPT_Q}, + {Opt_usrjquota, 0, MOPT_Q | MOPT_STRING}, + {Opt_grpjquota, 0, MOPT_Q | MOPT_STRING}, {Opt_offusrjquota, 0, MOPT_Q}, {Opt_offgrpjquota, 0, MOPT_Q}, {Opt_jqfmt_vfsold, QFMT_VFS_OLD, MOPT_QFMT},
From: Joseph Qi joseph.qi@linux.alibaba.com
stable inclusion from linux-4.19.158 commit 9d9ad220bb414da6e8b6ae0e35f9e4db976a97ef
--------------------------------
commit 7067b2619017d51e71686ca9756b454de0e5826a upstream.
It takes xattr_sem to check inline data again but without unlock it in case not have. So unlock it before return.
Fixes: aef1c8513c1f ("ext4: let ext4_truncate handle inline data correctly") Reported-by: Dan Carpenter dan.carpenter@oracle.com Cc: Tao Ma boyu.mt@taobao.com Signed-off-by: Joseph Qi joseph.qi@linux.alibaba.com Reviewed-by: Andreas Dilger adilger@dilger.ca Link: https://lore.kernel.org/r/1604370542-124630-1-git-send-email-joseph.qi@linux... Signed-off-by: Theodore Ts'o tytso@mit.edu Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/inline.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c index 56f6e1782d5f..32b266c478f9 100644 --- a/fs/ext4/inline.c +++ b/fs/ext4/inline.c @@ -1921,6 +1921,7 @@ int ext4_inline_data_truncate(struct inode *inode, int *has_inline)
ext4_write_lock_xattr(inode, &no_expand); if (!ext4_has_inline_data(inode)) { + ext4_write_unlock_xattr(inode, &no_expand); *has_inline = 0; ext4_journal_stop(handle); return 0;
From: Shin'ichiro Kawasaki shinichiro.kawasaki@wdc.com
stable inclusion from linux-4.19.158 commit ebfe3b94c746b9c9b1b7fa2cfc842380a85d7c40
--------------------------------
commit 092561f06702dd4fdd7fb74dd3a838f1818529b7 upstream.
Commit 8fd0e2a6df26 ("uio: free uio id after uio file node is freed") triggered KASAN use-after-free failure at deletion of TCM-user backstores [1].
In uio_unregister_device(), struct uio_device *idev is passed to uio_free_minor() to refer idev->minor. However, before uio_free_minor() call, idev is already freed by uio_device_release() during call to device_unregister().
To avoid reference to idev->minor after idev free, keep idev->minor value in a local variable. Also modify uio_free_minor() argument to receive the value.
[1] BUG: KASAN: use-after-free in uio_unregister_device+0x166/0x190 Read of size 4 at addr ffff888105196508 by task targetcli/49158
CPU: 3 PID: 49158 Comm: targetcli Not tainted 5.10.0-rc1 #1 Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015 Call Trace: dump_stack+0xae/0xe5 ? uio_unregister_device+0x166/0x190 print_address_description.constprop.0+0x1c/0x210 ? uio_unregister_device+0x166/0x190 ? uio_unregister_device+0x166/0x190 kasan_report.cold+0x37/0x7c ? kobject_put+0x80/0x410 ? uio_unregister_device+0x166/0x190 uio_unregister_device+0x166/0x190 tcmu_destroy_device+0x1c4/0x280 [target_core_user] ? tcmu_release+0x90/0x90 [target_core_user] ? __mutex_unlock_slowpath+0xd6/0x5d0 target_free_device+0xf3/0x2e0 [target_core_mod] config_item_cleanup+0xea/0x210 configfs_rmdir+0x651/0x860 ? detach_groups.isra.0+0x380/0x380 vfs_rmdir.part.0+0xec/0x3a0 ? __lookup_hash+0x20/0x150 do_rmdir+0x252/0x320 ? do_file_open_root+0x420/0x420 ? strncpy_from_user+0xbc/0x2f0 ? getname_flags.part.0+0x8e/0x450 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f9e2bfc91fb Code: 73 01 c3 48 8b 0d 9d ec 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 54 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6d ec 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffdd2baafe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000054 RAX: ffffffffffffffda RBX: 00007f9e2beb44a0 RCX: 00007f9e2bfc91fb RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f9e1c20be90 RBP: 00007ffdd2bab000 R08: 0000000000000000 R09: 00007f9e2bdf2440 R10: 00007ffdd2baaf37 R11: 0000000000000246 R12: 00000000ffffff9c R13: 000055f9abb7e390 R14: 000055f9abcf9558 R15: 00007f9e2be7a780
Allocated by task 34735: kasan_save_stack+0x1b/0x40 __kasan_kmalloc.constprop.0+0xc2/0xd0 __uio_register_device+0xeb/0xd40 tcmu_configure_device+0x5a0/0xbc0 [target_core_user] target_configure_device+0x12f/0x760 [target_core_mod] target_dev_enable_store+0x32/0x50 [target_core_mod] configfs_write_file+0x2bb/0x450 vfs_write+0x1ce/0x610 ksys_write+0xe9/0x1b0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 49158: kasan_save_stack+0x1b/0x40 kasan_set_track+0x1c/0x30 kasan_set_free_info+0x1b/0x30 __kasan_slab_free+0x110/0x150 slab_free_freelist_hook+0x5a/0x170 kfree+0xc6/0x560 device_release+0x9b/0x210 kobject_put+0x13e/0x410 uio_unregister_device+0xf9/0x190 tcmu_destroy_device+0x1c4/0x280 [target_core_user] target_free_device+0xf3/0x2e0 [target_core_mod] config_item_cleanup+0xea/0x210 configfs_rmdir+0x651/0x860 vfs_rmdir.part.0+0xec/0x3a0 do_rmdir+0x252/0x320 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The buggy address belongs to the object at ffff888105196000 which belongs to the cache kmalloc-2k of size 2048 The buggy address is located 1288 bytes inside of 2048-byte region [ffff888105196000, ffff888105196800) The buggy address belongs to the page: page:0000000098e6ca81 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x105190 head:0000000098e6ca81 order:3 compound_mapcount:0 compound_pincount:0 flags: 0x17ffffc0010200(slab|head) raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff888100043040 raw: 0000000000000000 0000000000080008 00000001ffffffff ffff88810eb55c01 page dumped because: kasan: bad access detected page->mem_cgroup:ffff88810eb55c01
Memory state around the buggy address: ffff888105196400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888105196480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888105196500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff888105196580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888105196600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 8fd0e2a6df26 ("uio: free uio id after uio file node is freed") Cc: stable stable@vger.kernel.org Signed-off-by: Shin'ichiro Kawasaki shinichiro.kawasaki@wdc.com Link: https://lore.kernel.org/r/20201102122819.2346270-1-shinichiro.kawasaki@wdc.c... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/uio/uio.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/drivers/uio/uio.c b/drivers/uio/uio.c index e743246aaeac..1876e4c4b81a 100644 --- a/drivers/uio/uio.c +++ b/drivers/uio/uio.c @@ -413,10 +413,10 @@ static int uio_get_minor(struct uio_device *idev) return retval; }
-static void uio_free_minor(struct uio_device *idev) +static void uio_free_minor(unsigned long minor) { mutex_lock(&minor_lock); - idr_remove(&uio_idr, idev->minor); + idr_remove(&uio_idr, minor); mutex_unlock(&minor_lock); }
@@ -988,7 +988,7 @@ int __uio_register_device(struct module *owner, err_uio_dev_add_attributes: device_del(&idev->dev); err_device_create: - uio_free_minor(idev); + uio_free_minor(idev->minor); put_device(&idev->dev); return ret; } @@ -1002,11 +1002,13 @@ EXPORT_SYMBOL_GPL(__uio_register_device); void uio_unregister_device(struct uio_info *info) { struct uio_device *idev; + unsigned long minor;
if (!info || !info->uio_dev) return;
idev = info->uio_dev; + minor = idev->minor;
mutex_lock(&idev->info_lock); uio_dev_del_attributes(idev); @@ -1022,7 +1024,7 @@ void uio_unregister_device(struct uio_info *info)
device_unregister(&idev->dev);
- uio_free_minor(idev); + uio_free_minor(minor);
return; }
From: Dan Carpenter dan.carpenter@oracle.com
stable inclusion from linux-4.19.158 commit 3f7277405fb3fe2853ce5d017ac7935ffca4ccfd
--------------------------------
commit 1e106aa3509b86738769775969822ffc1ec21bf4 upstream.
The exit_pi_state_list() function calls put_pi_state() with IRQs disabled and is not expecting that IRQs will be enabled inside the function.
Use the _irqsave() variant so that IRQs are restored to the original state instead of being enabled unconditionally.
Fixes: 153fbd1226fb ("futex: Fix more put_pi_state() vs. exit_pi_state_list() races") Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201106085205.GA1159983@mwanda Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/futex.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index cd852c498bbe..8b67e8a7a5e9 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -861,8 +861,9 @@ static void put_pi_state(struct futex_pi_state *pi_state) */ if (pi_state->owner) { struct task_struct *owner; + unsigned long flags;
- raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); + raw_spin_lock_irqsave(&pi_state->pi_mutex.wait_lock, flags); owner = pi_state->owner; if (owner) { raw_spin_lock(&owner->pi_lock); @@ -870,7 +871,7 @@ static void put_pi_state(struct futex_pi_state *pi_state) raw_spin_unlock(&owner->pi_lock); } rt_mutex_proxy_unlock(&pi_state->pi_mutex, owner); - raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); + raw_spin_unlock_irqrestore(&pi_state->pi_mutex.wait_lock, flags); }
if (current->pi_state_cache) {
From: Wengang Wang wen.gang.wang@oracle.com
stable inclusion from linux-4.19.158 commit bc2d96b3061ad35150c4db14661e91916766f241
--------------------------------
commit f5785283dd64867a711ca1fb1f5bb172f252ecdf upstream.
Though problem if found on a lower 4.1.12 kernel, I think upstream has same issue.
In one node in the cluster, there is the following callback trace:
# cat /proc/21473/stack __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2] ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2] ocfs2_evict_inode+0x152/0x820 [ocfs2] evict+0xae/0x1a0 iput+0x1c6/0x230 ocfs2_orphan_filldir+0x5d/0x100 [ocfs2] ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2] ocfs2_dir_foreach+0x29/0x30 [ocfs2] ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2] ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2] process_one_work+0x169/0x4a0 worker_thread+0x5b/0x560 kthread+0xcb/0xf0 ret_from_fork+0x61/0x90
The above stack is not reasonable, the final iput shouldn't happen in ocfs2_orphan_filldir() function. Looking at the code,
2067 /* Skip inodes which are already added to recover list, since dio may 2068 * happen concurrently with unlink/rename */ 2069 if (OCFS2_I(iter)->ip_next_orphan) { 2070 iput(iter); 2071 return 0; 2072 } 2073
The logic thinks the inode is already in recover list on seeing ip_next_orphan is non-NULL, so it skip this inode after dropping a reference which incremented in ocfs2_iget().
While, if the inode is already in recover list, it should have another reference and the iput() at line 2070 should not be the final iput (dropping the last reference). So I don't think the inode is really in the recover list (no vmcore to confirm).
Note that ocfs2_queue_orphans(), though not shown up in the call back trace, is holding cluster lock on the orphan directory when looking up for unlinked inodes. The on disk inode eviction could involve a lot of IOs which may need long time to finish. That means this node could hold the cluster lock for very long time, that can lead to the lock requests (from other nodes) to the orhpan directory hang for long time.
Looking at more on ip_next_orphan, I found it's not initialized when allocating a new ocfs2_inode_info structure.
This causes te reflink operations from some nodes hang for very long time waiting for the cluster lock on the orphan directory.
Fix: initialize ip_next_orphan as NULL.
Signed-off-by: Wengang Wang wen.gang.wang@oracle.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Joseph Qi joseph.qi@linux.alibaba.com Cc: Mark Fasheh mark@fasheh.com Cc: Joel Becker jlbec@evilplan.org Cc: Junxiao Bi junxiao.bi@oracle.com Cc: Changwei Ge gechangwei@live.cn Cc: Gang He ghe@suse.com Cc: Jun Piao piaojun@huawei.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201109171746.27884-1-wen.gang.wang@oracle.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ocfs2/super.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index 3415e0b09398..74da9158e4b6 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -1747,6 +1747,7 @@ static void ocfs2_inode_init_once(void *data)
oi->ip_blkno = 0ULL; oi->ip_clusters = 0; + oi->ip_next_orphan = NULL;
ocfs2_resv_init_once(&oi->ip_la_data_resv);
From: Chen Zhou chenzhou10@huawei.com
stable inclusion from linux-4.19.158 commit 46977ef987a743133b2a66208464c898d7a03e3c
--------------------------------
commit c350f8bea271782e2733419bd2ab9bf4ec2051ef upstream.
Fix to return a negative error code from the error handling case instead of 0 in function sel_ib_pkey_sid_slow(), as done elsewhere in this function.
Cc: stable@vger.kernel.org Fixes: 409dcf31538a ("selinux: Add a cache for quicker retreival of PKey SIDs") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Paul Moore paul@paul-moore.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- security/selinux/ibpkey.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/security/selinux/ibpkey.c b/security/selinux/ibpkey.c index 0a4b89d48297..cb05ae28ce00 100644 --- a/security/selinux/ibpkey.c +++ b/security/selinux/ibpkey.c @@ -161,8 +161,10 @@ static int sel_ib_pkey_sid_slow(u64 subnet_prefix, u16 pkey_num, u32 *sid) * is valid, it just won't be added to the cache. */ new = kzalloc(sizeof(*new), GFP_ATOMIC); - if (!new) + if (!new) { + ret = -ENOMEM; goto out; + }
new->psec.subnet_prefix = subnet_prefix; new->psec.pkey = pkey_num;
From: Al Viro viro@zeniv.linux.org.uk
stable inclusion from linux-4.19.158 commit 9bb7c3825457a6319c6e6cc4442e52ce3edcdaf4
--------------------------------
commit 77f6ab8b7768cf5e6bdd0e72499270a0671506ee upstream.
Coredump logics needs to report not only the registers of the dumping thread, but (since 2.5.43) those of other threads getting killed.
Doing that might require extra state saved on the stack in asm glue at kernel entry; signal delivery logics does that (we need to be able to save sigcontext there, at the very least) and so does seccomp.
That covers all callers of do_coredump(). Secondary threads get hit with SIGKILL and caught as soon as they reach exit_mm(), which normally happens in signal delivery, so those are also fine most of the time. Unfortunately, it is possible to end up with secondary zapped when it has already entered exit(2) (or, worse yet, is oopsing). In those cases we reach exit_mm() when mm->core_state is already set, but the stack contents is not what we would have in signal delivery.
At least on two architectures (alpha and m68k) it leads to infoleaks - we end up with a chunk of kernel stack written into coredump, with the contents consisting of normal C stack frames of the call chain leading to exit_mm() instead of the expected copy of userland registers. In case of alpha we leak 312 bytes of stack. Other architectures (including the regset-using ones) might have similar problems - the normal user of regsets is ptrace and the state of tracee at the time of such calls is special in the same way signal delivery is.
Note that had the zapper gotten to the exiting thread slightly later, it wouldn't have been included into coredump anyway - we skip the threads that have already cleared their ->mm. So let's pretend that zapper always loses the race. IOW, have exit_mm() only insert into the dumper list if we'd gotten there from handling a fatal signal[*]
As the result, the callers of do_exit() that have *not* gone through get_signal() are not seen by coredump logics as secondary threads. Which excludes voluntary exit()/oopsen/traps/etc. The dumper thread itself is unaffected by that, so seccomp is fine.
[*] originally I intended to add a new flag in tsk->flags, but ebiederman pointed out that PF_SIGNALED is already doing just what we need.
Cc: stable@vger.kernel.org Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3") History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Acked-by: "Eric W. Biederman" ebiederm@xmission.com Signed-off-by: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/exit.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/exit.c b/kernel/exit.c index a085052c02ae..d3f1108e57f7 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -517,7 +517,10 @@ static void exit_mm(void) up_read(&mm->mmap_sem);
self.task = current; - self.next = xchg(&core_state->dumper.next, &self); + if (self.task->flags & PF_SIGNALED) + self.next = xchg(&core_state->dumper.next, &self); + else + self.task = NULL; /* * Implies mb(), the result of xchg() must be visible * to core_state->dumper.
From: Oliver Herms oliver.peter.herms@gmail.com
stable inclusion from linux-4.19.158 commit 7fe18ca0435629dc6b5bcc76a6dbb2ba66f86d70
--------------------------------
[ Upstream commit 8ef9ba4d666614497a057d09b0a6eafc1e34eadf ]
Due to the legacy usage of hard_header_len for SIT tunnels while already using infrastructure from net/ipv4/ip_tunnel.c the calculation of the path MTU in tnl_update_pmtu is incorrect. This leads to unnecessary creation of MTU exceptions for any flow going over a SIT tunnel.
As SIT tunnels do not have a header themsevles other than their transport (L3, L2) headers we're leaving hard_header_len set to zero as tnl_update_pmtu is already taking care of the transport headers sizes.
This will also help avoiding unnecessary IPv6 GC runs and spinlock contention seen when using SIT tunnels and for more than net.ipv6.route.gc_thresh flows.
Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.") Signed-off-by: Oliver Herms oliver.peter.herms@gmail.com Acked-by: Willem de Bruijn willemb@google.com Link: https://lore.kernel.org/r/20201103104133.GA1573211@tws Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv6/sit.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index bfed7508ba19..98c108baf35e 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -1087,7 +1087,6 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) if (tdev && !netif_is_l3_master(tdev)) { int t_hlen = tunnel->hlen + sizeof(struct iphdr);
- dev->hard_header_len = tdev->hard_header_len + sizeof(struct iphdr); dev->mtu = tdev->mtu - t_hlen; if (dev->mtu < IPV6_MIN_MTU) dev->mtu = IPV6_MIN_MTU; @@ -1377,7 +1376,6 @@ static void ipip6_tunnel_setup(struct net_device *dev) dev->priv_destructor = ipip6_dev_free;
dev->type = ARPHRD_SIT; - dev->hard_header_len = LL_MAX_HEADER + t_hlen; dev->mtu = ETH_DATA_LEN - t_hlen; dev->min_mtu = IPV6_MIN_MTU; dev->max_mtu = IP6_MAX_MTU - t_hlen;
From: Mao Wenan wenan.mao@linux.alibaba.com
stable inclusion from linux-4.19.158 commit bd73928fa9561a25144be6ed9313493bd9fb9ce3
--------------------------------
[ Upstream commit 909172a149749242990a6e64cb55d55460d4e417 ]
When net.ipv4.tcp_syncookies=1 and syn flood is happened, cookie_v4_check or cookie_v6_check tries to redo what tcp_v4_send_synack or tcp_v6_send_synack did, rsk_window_clamp will be changed if SOCK_RCVBUF is set, which will make rcv_wscale is different, the client still operates with initial window scale and can overshot granted window, the client use the initial scale but local server use new scale to advertise window value, and session work abnormally.
Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt") Signed-off-by: Mao Wenan wenan.mao@linux.alibaba.com Signed-off-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/1604967391-123737-1-git-send-email-wenan.mao@linux... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv4/syncookies.c | 9 +++++++-- net/ipv6/syncookies.c | 10 ++++++++-- 2 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index f66b2e6d97a7..1a06850ef3cc 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -296,7 +296,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) __u32 cookie = ntohl(th->ack_seq) - 1; struct sock *ret = sk; struct request_sock *req; - int mss; + int full_space, mss; struct rtable *rt; __u8 rcv_wscale; struct flowi4 fl4; @@ -391,8 +391,13 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
/* Try to redo what tcp_v4_send_synack did. */ req->rsk_window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW); + /* limit the window selection if the user enforce a smaller rx buffer */ + full_space = tcp_full_space(sk); + if (sk->sk_userlocks & SOCK_RCVBUF_LOCK && + (req->rsk_window_clamp > full_space || req->rsk_window_clamp == 0)) + req->rsk_window_clamp = full_space;
- tcp_select_initial_window(sk, tcp_full_space(sk), req->mss, + tcp_select_initial_window(sk, full_space, req->mss, &req->rsk_rcv_wnd, &req->rsk_window_clamp, ireq->wscale_ok, &rcv_wscale, dst_metric(&rt->dst, RTAX_INITRWND)); diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c index a377be8a9fb4..ec61b67a92be 100644 --- a/net/ipv6/syncookies.c +++ b/net/ipv6/syncookies.c @@ -141,7 +141,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb) __u32 cookie = ntohl(th->ack_seq) - 1; struct sock *ret = sk; struct request_sock *req; - int mss; + int full_space, mss; struct dst_entry *dst; __u8 rcv_wscale; u32 tsoff = 0; @@ -246,7 +246,13 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb) }
req->rsk_window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW); - tcp_select_initial_window(sk, tcp_full_space(sk), req->mss, + /* limit the window selection if the user enforce a smaller rx buffer */ + full_space = tcp_full_space(sk); + if (sk->sk_userlocks & SOCK_RCVBUF_LOCK && + (req->rsk_window_clamp > full_space || req->rsk_window_clamp == 0)) + req->rsk_window_clamp = full_space; + + tcp_select_initial_window(sk, full_space, req->mss, &req->rsk_rcv_wnd, &req->rsk_window_clamp, ireq->wscale_ok, &rcv_wscale, dst_metric(dst, RTAX_INITRWND));
From: George Spelvin lkml@sdf.org
stable inclusion from linux-4.19.158 commit 81d7c56d6fab5ccbf522c47a655cd427808679f2
--------------------------------
commit c51f8f88d705e06bd696d7510aff22b33eb8e638 upstream.
Non-cryptographic PRNGs may have great statistical properties, but are usually trivially predictable to someone who knows the algorithm, given a small sample of their output. An LFSR like prandom_u32() is particularly simple, even if the sample is widely scattered bits.
It turns out the network stack uses prandom_u32() for some things like random port numbers which it would prefer are *not* trivially predictable. Predictability led to a practical DNS spoofing attack. Oops.
This patch replaces the LFSR with a homebrew cryptographic PRNG based on the SipHash round function, which is in turn seeded with 128 bits of strong random key. (The authors of SipHash have *not* been consulted about this abuse of their algorithm.) Speed is prioritized over security; attacks are rare, while performance is always wanted.
Replacing all callers of prandom_u32() is the quick fix. Whether to reinstate a weaker PRNG for uses which can tolerate it is an open question.
Commit f227e3ec3b5c ("random32: update the net random state on interrupt and activity") was an earlier attempt at a solution. This patch replaces it.
Reported-by: Amit Klein aksecurity@gmail.com Cc: Willy Tarreau w@1wt.eu Cc: Eric Dumazet edumazet@google.com Cc: "Jason A. Donenfeld" Jason@zx2c4.com Cc: Andy Lutomirski luto@kernel.org Cc: Kees Cook keescook@chromium.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Peter Zijlstra peterz@infradead.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: tytso@mit.edu Cc: Florian Westphal fw@strlen.de Cc: Marc Plumb lkml.mplumb@gmail.com Fixes: f227e3ec3b5c ("random32: update the net random state on interrupt and activity") Signed-off-by: George Spelvin lkml@sdf.org Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ [ willy: partial reversal of f227e3ec3b5c; moved SIPROUND definitions to prandom.h for later use; merged George's prandom_seed() proposal; inlined siprand_u32(); replaced the net_rand_state[] array with 4 members to fix a build issue; cosmetic cleanups to make checkpatch happy; fixed RANDOM32_SELFTEST build ] [wt: backported to 4.19 -- various context adjustments] Signed-off-by: Willy Tarreau w@1wt.eu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/char/random.c | 1 - include/linux/prandom.h | 36 +++- kernel/time/timer.c | 7 - lib/random32.c | 462 ++++++++++++++++++++++++---------------- 4 files changed, 317 insertions(+), 189 deletions(-)
diff --git a/drivers/char/random.c b/drivers/char/random.c index c210a6f4551a..c7c344e69f19 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1257,7 +1257,6 @@ void add_interrupt_randomness(int irq, int irq_flags)
fast_mix(fast_pool); add_interrupt_bench(cycles); - this_cpu_add(net_rand_state.s1, fast_pool->pool[cycles & 3]);
if (unlikely(crng_init == 0)) { if ((fast_pool->count >= 64) && diff --git a/include/linux/prandom.h b/include/linux/prandom.h index aa16e6468f91..cc1e71334e53 100644 --- a/include/linux/prandom.h +++ b/include/linux/prandom.h @@ -16,12 +16,44 @@ void prandom_bytes(void *buf, size_t nbytes); void prandom_seed(u32 seed); void prandom_reseed_late(void);
+#if BITS_PER_LONG == 64 +/* + * The core SipHash round function. Each line can be executed in + * parallel given enough CPU resources. + */ +#define PRND_SIPROUND(v0, v1, v2, v3) ( \ + v0 += v1, v1 = rol64(v1, 13), v2 += v3, v3 = rol64(v3, 16), \ + v1 ^= v0, v0 = rol64(v0, 32), v3 ^= v2, \ + v0 += v3, v3 = rol64(v3, 21), v2 += v1, v1 = rol64(v1, 17), \ + v3 ^= v0, v1 ^= v2, v2 = rol64(v2, 32) \ +) + +#define PRND_K0 (0x736f6d6570736575 ^ 0x6c7967656e657261) +#define PRND_K1 (0x646f72616e646f6d ^ 0x7465646279746573) + +#elif BITS_PER_LONG == 32 +/* + * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash. + * This is weaker, but 32-bit machines are not used for high-traffic + * applications, so there is less output for an attacker to analyze. + */ +#define PRND_SIPROUND(v0, v1, v2, v3) ( \ + v0 += v1, v1 = rol32(v1, 5), v2 += v3, v3 = rol32(v3, 8), \ + v1 ^= v0, v0 = rol32(v0, 16), v3 ^= v2, \ + v0 += v3, v3 = rol32(v3, 7), v2 += v1, v1 = rol32(v1, 13), \ + v3 ^= v0, v1 ^= v2, v2 = rol32(v2, 16) \ +) +#define PRND_K0 0x6c796765 +#define PRND_K1 0x74656462 + +#else +#error Unsupported BITS_PER_LONG +#endif + struct rnd_state { __u32 s1, s2, s3, s4; };
-DECLARE_PER_CPU(struct rnd_state, net_rand_state); - u32 prandom_u32_state(struct rnd_state *state); void prandom_bytes_state(struct rnd_state *state, void *buf, size_t nbytes); void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state); diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 61e41ea3a96e..a6e88d9bb931 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1655,13 +1655,6 @@ void update_process_times(int user_tick) scheduler_tick(); if (IS_ENABLED(CONFIG_POSIX_TIMERS)) run_posix_cpu_timers(p); - - /* The current CPU might make use of net randoms without receiving IRQs - * to renew them often enough. Let's update the net_rand_state from a - * non-constant value that's not affine to the number of calls to make - * sure it's updated when there's some activity (we don't care in idle). - */ - this_cpu_add(net_rand_state.s1, rol32(jiffies, 24) + user_tick); }
/** diff --git a/lib/random32.c b/lib/random32.c index b6f3325e38e4..9085b1172015 100644 --- a/lib/random32.c +++ b/lib/random32.c @@ -40,16 +40,6 @@ #include <linux/sched.h> #include <asm/unaligned.h>
-#ifdef CONFIG_RANDOM32_SELFTEST -static void __init prandom_state_selftest(void); -#else -static inline void prandom_state_selftest(void) -{ -} -#endif - -DEFINE_PER_CPU(struct rnd_state, net_rand_state) __latent_entropy; - /** * prandom_u32_state - seeded pseudo-random number generator. * @state: pointer to state structure holding seeded state. @@ -69,25 +59,6 @@ u32 prandom_u32_state(struct rnd_state *state) } EXPORT_SYMBOL(prandom_u32_state);
-/** - * prandom_u32 - pseudo random number generator - * - * A 32 bit pseudo-random number is generated using a fast - * algorithm suitable for simulation. This algorithm is NOT - * considered safe for cryptographic use. - */ -u32 prandom_u32(void) -{ - struct rnd_state *state = &get_cpu_var(net_rand_state); - u32 res; - - res = prandom_u32_state(state); - put_cpu_var(net_rand_state); - - return res; -} -EXPORT_SYMBOL(prandom_u32); - /** * prandom_bytes_state - get the requested number of pseudo-random bytes * @@ -119,20 +90,6 @@ void prandom_bytes_state(struct rnd_state *state, void *buf, size_t bytes) } EXPORT_SYMBOL(prandom_bytes_state);
-/** - * prandom_bytes - get the requested number of pseudo-random bytes - * @buf: where to copy the pseudo-random bytes to - * @bytes: the requested number of bytes - */ -void prandom_bytes(void *buf, size_t bytes) -{ - struct rnd_state *state = &get_cpu_var(net_rand_state); - - prandom_bytes_state(state, buf, bytes); - put_cpu_var(net_rand_state); -} -EXPORT_SYMBOL(prandom_bytes); - static void prandom_warmup(struct rnd_state *state) { /* Calling RNG ten times to satisfy recurrence condition */ @@ -148,96 +105,6 @@ static void prandom_warmup(struct rnd_state *state) prandom_u32_state(state); }
-static u32 __extract_hwseed(void) -{ - unsigned int val = 0; - - (void)(arch_get_random_seed_int(&val) || - arch_get_random_int(&val)); - - return val; -} - -static void prandom_seed_early(struct rnd_state *state, u32 seed, - bool mix_with_hwseed) -{ -#define LCG(x) ((x) * 69069U) /* super-duper LCG */ -#define HWSEED() (mix_with_hwseed ? __extract_hwseed() : 0) - state->s1 = __seed(HWSEED() ^ LCG(seed), 2U); - state->s2 = __seed(HWSEED() ^ LCG(state->s1), 8U); - state->s3 = __seed(HWSEED() ^ LCG(state->s2), 16U); - state->s4 = __seed(HWSEED() ^ LCG(state->s3), 128U); -} - -/** - * prandom_seed - add entropy to pseudo random number generator - * @seed: seed value - * - * Add some additional seeding to the prandom pool. - */ -void prandom_seed(u32 entropy) -{ - int i; - /* - * No locking on the CPUs, but then somewhat random results are, well, - * expected. - */ - for_each_possible_cpu(i) { - struct rnd_state *state = &per_cpu(net_rand_state, i); - - state->s1 = __seed(state->s1 ^ entropy, 2U); - prandom_warmup(state); - } -} -EXPORT_SYMBOL(prandom_seed); - -/* - * Generate some initially weak seeding values to allow - * to start the prandom_u32() engine. - */ -static int __init prandom_init(void) -{ - int i; - - prandom_state_selftest(); - - for_each_possible_cpu(i) { - struct rnd_state *state = &per_cpu(net_rand_state, i); - u32 weak_seed = (i + jiffies) ^ random_get_entropy(); - - prandom_seed_early(state, weak_seed, true); - prandom_warmup(state); - } - - return 0; -} -core_initcall(prandom_init); - -static void __prandom_timer(struct timer_list *unused); - -static DEFINE_TIMER(seed_timer, __prandom_timer); - -static void __prandom_timer(struct timer_list *unused) -{ - u32 entropy; - unsigned long expires; - - get_random_bytes(&entropy, sizeof(entropy)); - prandom_seed(entropy); - - /* reseed every ~60 seconds, in [40 .. 80) interval with slack */ - expires = 40 + prandom_u32_max(40); - seed_timer.expires = jiffies + msecs_to_jiffies(expires * MSEC_PER_SEC); - - add_timer(&seed_timer); -} - -static void __init __prandom_start_seed_timer(void) -{ - seed_timer.expires = jiffies + msecs_to_jiffies(40 * MSEC_PER_SEC); - add_timer(&seed_timer); -} - void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state) { int i; @@ -257,51 +124,6 @@ void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state) } EXPORT_SYMBOL(prandom_seed_full_state);
-/* - * Generate better values after random number generator - * is fully initialized. - */ -static void __prandom_reseed(bool late) -{ - unsigned long flags; - static bool latch = false; - static DEFINE_SPINLOCK(lock); - - /* Asking for random bytes might result in bytes getting - * moved into the nonblocking pool and thus marking it - * as initialized. In this case we would double back into - * this function and attempt to do a late reseed. - * Ignore the pointless attempt to reseed again if we're - * already waiting for bytes when the nonblocking pool - * got initialized. - */ - - /* only allow initial seeding (late == false) once */ - if (!spin_trylock_irqsave(&lock, flags)) - return; - - if (latch && !late) - goto out; - - latch = true; - prandom_seed_full_state(&net_rand_state); -out: - spin_unlock_irqrestore(&lock, flags); -} - -void prandom_reseed_late(void) -{ - __prandom_reseed(true); -} - -static int __init prandom_reseed(void) -{ - __prandom_reseed(false); - __prandom_start_seed_timer(); - return 0; -} -late_initcall(prandom_reseed); - #ifdef CONFIG_RANDOM32_SELFTEST static struct prandom_test1 { u32 seed; @@ -421,7 +243,28 @@ static struct prandom_test2 { { 407983964U, 921U, 728767059U }, };
-static void __init prandom_state_selftest(void) +static u32 __extract_hwseed(void) +{ + unsigned int val = 0; + + (void)(arch_get_random_seed_int(&val) || + arch_get_random_int(&val)); + + return val; +} + +static void prandom_seed_early(struct rnd_state *state, u32 seed, + bool mix_with_hwseed) +{ +#define LCG(x) ((x) * 69069U) /* super-duper LCG */ +#define HWSEED() (mix_with_hwseed ? __extract_hwseed() : 0) + state->s1 = __seed(HWSEED() ^ LCG(seed), 2U); + state->s2 = __seed(HWSEED() ^ LCG(state->s1), 8U); + state->s3 = __seed(HWSEED() ^ LCG(state->s2), 16U); + state->s4 = __seed(HWSEED() ^ LCG(state->s3), 128U); +} + +static int __init prandom_state_selftest(void) { int i, j, errors = 0, runs = 0; bool error = false; @@ -461,5 +304,266 @@ static void __init prandom_state_selftest(void) pr_warn("prandom: %d/%d self tests failed\n", errors, runs); else pr_info("prandom: %d self tests passed\n", runs); + return 0; } +core_initcall(prandom_state_selftest); #endif + +/* + * The prandom_u32() implementation is now completely separate from the + * prandom_state() functions, which are retained (for now) for compatibility. + * + * Because of (ab)use in the networking code for choosing random TCP/UDP port + * numbers, which open DoS possibilities if guessable, we want something + * stronger than a standard PRNG. But the performance requirements of + * the network code do not allow robust crypto for this application. + * + * So this is a homebrew Junior Spaceman implementation, based on the + * lowest-latency trustworthy crypto primitive available, SipHash. + * (The authors of SipHash have not been consulted about this abuse of + * their work.) + * + * Standard SipHash-2-4 uses 2n+4 rounds to hash n words of input to + * one word of output. This abbreviated version uses 2 rounds per word + * of output. + */ + +struct siprand_state { + unsigned long v0; + unsigned long v1; + unsigned long v2; + unsigned long v3; +}; + +static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy; + +/* + * This is the core CPRNG function. As "pseudorandom", this is not used + * for truly valuable things, just intended to be a PITA to guess. + * For maximum speed, we do just two SipHash rounds per word. This is + * the same rate as 4 rounds per 64 bits that SipHash normally uses, + * so hopefully it's reasonably secure. + * + * There are two changes from the official SipHash finalization: + * - We omit some constants XORed with v2 in the SipHash spec as irrelevant; + * they are there only to make the output rounds distinct from the input + * rounds, and this application has no input rounds. + * - Rather than returning v0^v1^v2^v3, return v1+v3. + * If you look at the SipHash round, the last operation on v3 is + * "v3 ^= v0", so "v0 ^ v3" just undoes that, a waste of time. + * Likewise "v1 ^= v2". (The rotate of v2 makes a difference, but + * it still cancels out half of the bits in v2 for no benefit.) + * Second, since the last combining operation was xor, continue the + * pattern of alternating xor/add for a tiny bit of extra non-linearity. + */ +static inline u32 siprand_u32(struct siprand_state *s) +{ + unsigned long v0 = s->v0, v1 = s->v1, v2 = s->v2, v3 = s->v3; + + PRND_SIPROUND(v0, v1, v2, v3); + PRND_SIPROUND(v0, v1, v2, v3); + s->v0 = v0; s->v1 = v1; s->v2 = v2; s->v3 = v3; + return v1 + v3; +} + + +/** + * prandom_u32 - pseudo random number generator + * + * A 32 bit pseudo-random number is generated using a fast + * algorithm suitable for simulation. This algorithm is NOT + * considered safe for cryptographic use. + */ +u32 prandom_u32(void) +{ + struct siprand_state *state = get_cpu_ptr(&net_rand_state); + u32 res = siprand_u32(state); + + put_cpu_ptr(&net_rand_state); + return res; +} +EXPORT_SYMBOL(prandom_u32); + +/** + * prandom_bytes - get the requested number of pseudo-random bytes + * @buf: where to copy the pseudo-random bytes to + * @bytes: the requested number of bytes + */ +void prandom_bytes(void *buf, size_t bytes) +{ + struct siprand_state *state = get_cpu_ptr(&net_rand_state); + u8 *ptr = buf; + + while (bytes >= sizeof(u32)) { + put_unaligned(siprand_u32(state), (u32 *)ptr); + ptr += sizeof(u32); + bytes -= sizeof(u32); + } + + if (bytes > 0) { + u32 rem = siprand_u32(state); + + do { + *ptr++ = (u8)rem; + rem >>= BITS_PER_BYTE; + } while (--bytes > 0); + } + put_cpu_ptr(&net_rand_state); +} +EXPORT_SYMBOL(prandom_bytes); + +/** + * prandom_seed - add entropy to pseudo random number generator + * @entropy: entropy value + * + * Add some additional seed material to the prandom pool. + * The "entropy" is actually our IP address (the only caller is + * the network code), not for unpredictability, but to ensure that + * different machines are initialized differently. + */ +void prandom_seed(u32 entropy) +{ + int i; + + add_device_randomness(&entropy, sizeof(entropy)); + + for_each_possible_cpu(i) { + struct siprand_state *state = per_cpu_ptr(&net_rand_state, i); + unsigned long v0 = state->v0, v1 = state->v1; + unsigned long v2 = state->v2, v3 = state->v3; + + do { + v3 ^= entropy; + PRND_SIPROUND(v0, v1, v2, v3); + PRND_SIPROUND(v0, v1, v2, v3); + v0 ^= entropy; + } while (unlikely(!v0 || !v1 || !v2 || !v3)); + + WRITE_ONCE(state->v0, v0); + WRITE_ONCE(state->v1, v1); + WRITE_ONCE(state->v2, v2); + WRITE_ONCE(state->v3, v3); + } +} +EXPORT_SYMBOL(prandom_seed); + +/* + * Generate some initially weak seeding values to allow + * the prandom_u32() engine to be started. + */ +static int __init prandom_init_early(void) +{ + int i; + unsigned long v0, v1, v2, v3; + + if (!arch_get_random_long(&v0)) + v0 = jiffies; + if (!arch_get_random_long(&v1)) + v1 = random_get_entropy(); + v2 = v0 ^ PRND_K0; + v3 = v1 ^ PRND_K1; + + for_each_possible_cpu(i) { + struct siprand_state *state; + + v3 ^= i; + PRND_SIPROUND(v0, v1, v2, v3); + PRND_SIPROUND(v0, v1, v2, v3); + v0 ^= i; + + state = per_cpu_ptr(&net_rand_state, i); + state->v0 = v0; state->v1 = v1; + state->v2 = v2; state->v3 = v3; + } + + return 0; +} +core_initcall(prandom_init_early); + + +/* Stronger reseeding when available, and periodically thereafter. */ +static void prandom_reseed(struct timer_list *unused); + +static DEFINE_TIMER(seed_timer, prandom_reseed); + +static void prandom_reseed(struct timer_list *unused) +{ + unsigned long expires; + int i; + + /* + * Reinitialize each CPU's PRNG with 128 bits of key. + * No locking on the CPUs, but then somewhat random results are, + * well, expected. + */ + for_each_possible_cpu(i) { + struct siprand_state *state; + unsigned long v0 = get_random_long(), v2 = v0 ^ PRND_K0; + unsigned long v1 = get_random_long(), v3 = v1 ^ PRND_K1; +#if BITS_PER_LONG == 32 + int j; + + /* + * On 32-bit machines, hash in two extra words to + * approximate 128-bit key length. Not that the hash + * has that much security, but this prevents a trivial + * 64-bit brute force. + */ + for (j = 0; j < 2; j++) { + unsigned long m = get_random_long(); + + v3 ^= m; + PRND_SIPROUND(v0, v1, v2, v3); + PRND_SIPROUND(v0, v1, v2, v3); + v0 ^= m; + } +#endif + /* + * Probably impossible in practice, but there is a + * theoretical risk that a race between this reseeding + * and the target CPU writing its state back could + * create the all-zero SipHash fixed point. + * + * To ensure that never happens, ensure the state + * we write contains no zero words. + */ + state = per_cpu_ptr(&net_rand_state, i); + WRITE_ONCE(state->v0, v0 ? v0 : -1ul); + WRITE_ONCE(state->v1, v1 ? v1 : -1ul); + WRITE_ONCE(state->v2, v2 ? v2 : -1ul); + WRITE_ONCE(state->v3, v3 ? v3 : -1ul); + } + + /* reseed every ~60 seconds, in [40 .. 80) interval with slack */ + expires = round_jiffies(jiffies + 40 * HZ + prandom_u32_max(40 * HZ)); + mod_timer(&seed_timer, expires); +} + +/* + * The random ready callback can be called from almost any interrupt. + * To avoid worrying about whether it's safe to delay that interrupt + * long enough to seed all CPUs, just schedule an immediate timer event. + */ +static void prandom_timer_start(struct random_ready_callback *unused) +{ + mod_timer(&seed_timer, jiffies); +} + +/* + * Start periodic full reseeding as soon as strong + * random numbers are available. + */ +static int __init prandom_init_late(void) +{ + static struct random_ready_callback random_ready = { + .func = prandom_timer_start + }; + int ret = add_random_ready_callback(&random_ready); + + if (ret == -EALREADY) { + prandom_timer_start(&random_ready); + ret = 0; + } + return ret; +} +late_initcall(prandom_init_late);
From: Arnaldo Carvalho de Melo acme@redhat.com
stable inclusion from linux-4.19.158 commit b9658d18499e4cf1b6c5f122f12e1cc985ac5568
--------------------------------
commit d0e7b0c71fbb653de90a7163ef46912a96f0bdaf upstream.
To avoid this:
util/scripting-engines/trace-event-python.c: In function 'python_start_script': util/scripting-engines/trace-event-python.c:1595:2: error: 'visibility' attribute ignored [-Werror=attributes] 1595 | PyMODINIT_FUNC (*initfunc)(void); | ^~~~~~~~~~~~~~
That started breaking when building with PYTHON=python3 and these gcc versions (I haven't checked with the clang ones, maybe it breaks there as well):
# export PERF_TARBALL=http://192.168.86.5/perf/perf-5.9.0.tar.xz # dm fedora:33 fedora:rawhide 1 107.80 fedora:33 : Ok gcc (GCC) 10.2.1 20201005 (Red Hat 10.2.1-5), clang version 11.0.0 (Fedora 11.0.0-1.fc33) 2 92.47 fedora:rawhide : Ok gcc (GCC) 10.2.1 20201016 (Red Hat 10.2.1-6), clang version 11.0.0 (Fedora 11.0.0-1.fc34) #
Avoid that by ditching that 'initfunc' function pointer with its:
#define Py_EXPORTED_SYMBOL _attribute_ ((visibility ("default"))) #define PyMODINIT_FUNC Py_EXPORTED_SYMBOL PyObject*
And just call PyImport_AppendInittab() at the end of the ifdef python3 block with the functions that were being attributed to that initfunc.
Cc: Adrian Hunter adrian.hunter@intel.com Cc: Ian Rogers irogers@google.com Cc: Jiri Olsa jolsa@kernel.org Cc: Namhyung Kim namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Tapas Kundu tkundu@vmware.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/perf/util/scripting-engines/trace-event-python.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/tools/perf/util/scripting-engines/trace-event-python.c b/tools/perf/util/scripting-engines/trace-event-python.c index 9569cc06e0a7..2814251df06b 100644 --- a/tools/perf/util/scripting-engines/trace-event-python.c +++ b/tools/perf/util/scripting-engines/trace-event-python.c @@ -1493,7 +1493,6 @@ static void _free_command_line(wchar_t **command_line, int num) static int python_start_script(const char *script, int argc, const char **argv) { struct tables *tables = &tables_global; - PyMODINIT_FUNC (*initfunc)(void); #if PY_MAJOR_VERSION < 3 const char **command_line; #else @@ -1508,20 +1507,18 @@ static int python_start_script(const char *script, int argc, const char **argv) FILE *fp;
#if PY_MAJOR_VERSION < 3 - initfunc = initperf_trace_context; command_line = malloc((argc + 1) * sizeof(const char *)); command_line[0] = script; for (i = 1; i < argc + 1; i++) command_line[i] = argv[i - 1]; + PyImport_AppendInittab(name, initperf_trace_context); #else - initfunc = PyInit_perf_trace_context; command_line = malloc((argc + 1) * sizeof(wchar_t *)); command_line[0] = Py_DecodeLocale(script, NULL); for (i = 1; i < argc + 1; i++) command_line[i] = Py_DecodeLocale(argv[i - 1], NULL); + PyImport_AppendInittab(name, PyInit_perf_trace_context); #endif - - PyImport_AppendInittab(name, initfunc); Py_Initialize();
#if PY_MAJOR_VERSION < 3
From: Matteo Croce mcroce@microsoft.com
stable inclusion from linux-4.19.158 commit 9a6cea8220c608f6d59763fab06ff196185cc3ff
--------------------------------
commit 8b92c4ff4423aa9900cf838d3294fcade4dbda35 upstream.
Patch series "fix parsing of reboot= cmdline", v3.
The parsing of the reboot= cmdline has two major errors:
- a missing bound check can crash the system on reboot
- parsing of the cpu number only works if specified last
Fix both.
This patch (of 2):
This reverts commit 616feab753972b97.
kstrtoint() and simple_strtoul() have a subtle difference which makes them non interchangeable: if a non digit character is found amid the parsing, the former will return an error, while the latter will just stop parsing, e.g. simple_strtoul("123xyx") = 123.
The kernel cmdline reboot= argument allows to specify the CPU used for rebooting, with the syntax `s####` among the other flags, e.g. "reboot=warm,s31,force", so if this flag is not the last given, it's silently ignored as well as the subsequent ones.
Fixes: 616feab75397 ("kernel/reboot.c: convert simple_strtoul to kstrtoint") Signed-off-by: Matteo Croce mcroce@microsoft.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Guenter Roeck linux@roeck-us.net Cc: Petr Mladek pmladek@suse.com Cc: Arnd Bergmann arnd@arndb.de Cc: Mike Rapoport rppt@kernel.org Cc: Kees Cook keescook@chromium.org Cc: Pavel Tatashin pasha.tatashin@soleen.com Cc: Robin Holt robinmholt@gmail.com Cc: Fabian Frederick fabf@skynet.be Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201103214025.116799-2-mcroce@linux.microsoft.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [sudip: use reboot_mode instead of mode] Signed-off-by: Sudip Mukherjee sudipm.mukherjee@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/reboot.c | 21 +++++++-------------- 1 file changed, 7 insertions(+), 14 deletions(-)
diff --git a/kernel/reboot.c b/kernel/reboot.c index 8fb44dec9ad7..db0eed5916d5 100644 --- a/kernel/reboot.c +++ b/kernel/reboot.c @@ -539,22 +539,15 @@ static int __init reboot_setup(char *str) break;
case 's': - { - int rc; - - if (isdigit(*(str+1))) { - rc = kstrtoint(str+1, 0, &reboot_cpu); - if (rc) - return rc; - } else if (str[1] == 'm' && str[2] == 'p' && - isdigit(*(str+3))) { - rc = kstrtoint(str+3, 0, &reboot_cpu); - if (rc) - return rc; - } else + if (isdigit(*(str+1))) + reboot_cpu = simple_strtoul(str+1, NULL, 0); + else if (str[1] == 'm' && str[2] == 'p' && + isdigit(*(str+3))) + reboot_cpu = simple_strtoul(str+3, NULL, 0); + else reboot_mode = REBOOT_SOFT; break; - } + case 'g': reboot_mode = REBOOT_GPIO; break;
From: Matteo Croce mcroce@microsoft.com
stable inclusion from linux-4.19.158 commit 2e021b7197fb911f934efcc45f019e5f7717e8db
--------------------------------
commit df5b0ab3e08a156701b537809914b339b0daa526 upstream.
Limit the CPU number to num_possible_cpus(), because setting it to a value lower than INT_MAX but higher than NR_CPUS produces the following error on reboot and shutdown:
BUG: unable to handle page fault for address: ffffffff90ab1bb0 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1c09067 P4D 1c09067 PUD 1c0a063 PMD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 1 Comm: systemd-shutdow Not tainted 5.9.0-rc8-kvm #110 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:migrate_to_reboot_cpu+0xe/0x60 Code: ea ea 00 48 89 fa 48 c7 c7 30 57 f1 81 e9 fa ef ff ff 66 2e 0f 1f 84 00 00 00 00 00 53 8b 1d d5 ea ea 00 e8 14 33 fe ff 89 da <48> 0f a3 15 ea fc bd 00 48 89 d0 73 29 89 c2 c1 e8 06 65 48 8b 3c RSP: 0018:ffffc90000013e08 EFLAGS: 00010246 RAX: ffff88801f0a0000 RBX: 0000000077359400 RCX: 0000000000000000 RDX: 0000000077359400 RSI: 0000000000000002 RDI: ffffffff81c199e0 RBP: ffffffff81c1e3c0 R08: ffff88801f41f000 R09: ffffffff81c1e348 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 00007f32bedf8830 R14: 00000000fee1dead R15: 0000000000000000 FS: 00007f32bedf8980(0000) GS:ffff88801f480000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff90ab1bb0 CR3: 000000001d057000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __do_sys_reboot.cold+0x34/0x5b do_syscall_64+0x2d/0x40
Fixes: 1b3a5d02ee07 ("reboot: move arch/x86 reboot= handling to generic kernel") Signed-off-by: Matteo Croce mcroce@microsoft.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Arnd Bergmann arnd@arndb.de Cc: Fabian Frederick fabf@skynet.be Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Guenter Roeck linux@roeck-us.net Cc: Kees Cook keescook@chromium.org Cc: Mike Rapoport rppt@kernel.org Cc: Pavel Tatashin pasha.tatashin@soleen.com Cc: Petr Mladek pmladek@suse.com Cc: Robin Holt robinmholt@gmail.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201103214025.116799-3-mcroce@linux.microsoft.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [sudip: use reboot_mode instead of mode] Signed-off-by: Sudip Mukherjee sudipm.mukherjee@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/reboot.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/kernel/reboot.c b/kernel/reboot.c index db0eed5916d5..45bea54f9aee 100644 --- a/kernel/reboot.c +++ b/kernel/reboot.c @@ -546,6 +546,13 @@ static int __init reboot_setup(char *str) reboot_cpu = simple_strtoul(str+3, NULL, 0); else reboot_mode = REBOOT_SOFT; + if (reboot_cpu >= num_possible_cpus()) { + pr_err("Ignoring the CPU number in reboot= option. " + "CPU %d exceeds possible cpu number %d\n", + reboot_cpu, num_possible_cpus()); + reboot_cpu = 0; + break; + } break;
case 'g':
From: Yunsheng Lin linyunsheng@huawei.com
stable inclusion from linux-4.19.158 commit 81504d1952d712c8bb9c3966896efee8a37ea966
--------------------------------
When commit 2fb541c862c9 ("net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc") is backported to stable kernel, one assignment is missing, which causes two problems reported by Joakim and Vishwanath, see [1] and [2].
So add the assignment back to fix it.
1. https://www.spinics.net/lists/netdev/msg693916.html 2. https://www.spinics.net/lists/netdev/msg695131.html
Fixes: 749cc0b0c7f3 ("net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc") Signed-off-by: Yunsheng Lin linyunsheng@huawei.com Acked-by: Jakub Kicinski kuba@kernel.org Tested-by: Brian Norris briannorris@chromium.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/sched/sch_generic.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index bd96fd261dba..4e15913e7519 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -1116,10 +1116,13 @@ static void dev_deactivate_queue(struct net_device *dev, void *_qdisc_default) { struct Qdisc *qdisc = rtnl_dereference(dev_queue->qdisc); + struct Qdisc *qdisc_default = _qdisc_default;
if (qdisc) { if (!(qdisc->flags & TCQ_F_BUILTIN)) set_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state); + + rcu_assign_pointer(dev_queue->qdisc, qdisc_default); } }
From: Boris Protopopov pboris@amazon.com
stable inclusion from linux-4.19.158 commit 987c45d57f388f9531d523b8424b8e97687bb323
--------------------------------
commit 57c176074057531b249cf522d90c22313fa74b0b upstream.
When converting trailing spaces and periods in paths, do so for every component of the path, not just the last component. If the conversion is not done for every path component, then subsequent operations in directories with trailing spaces or periods (e.g. create(), mkdir()) will fail with ENOENT. This is because on the server, the directory will have a special symbol in its name, and the client needs to provide the same.
Signed-off-by: Boris Protopopov pboris@amazon.com Acked-by: Ronnie Sahlberg lsahlber@redhat.com Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/cifs/cifs_unicode.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/cifs/cifs_unicode.c b/fs/cifs/cifs_unicode.c index a2b2355e7f01..9986817532b1 100644 --- a/fs/cifs/cifs_unicode.c +++ b/fs/cifs/cifs_unicode.c @@ -501,7 +501,13 @@ cifsConvertToUTF16(__le16 *target, const char *source, int srclen, else if (map_chars == SFM_MAP_UNI_RSVD) { bool end_of_string;
- if (i == srclen - 1) + /** + * Remap spaces and periods found at the end of every + * component of the path. The special cases of '.' and + * '..' do not need to be dealt with explicitly because + * they are addressed in namei.c:link_path_walk(). + **/ + if ((i == srclen - 1) || (source[i+1] == '\')) end_of_string = true; else end_of_string = false;
From: Wang Hai wanghai38@huawei.com
stable inclusion from linux-4.19.160 commit 8196e11a66d751752a25b62c6dc0e2923a5e6ff0
--------------------------------
[ Upstream commit 849920c703392957f94023f77ec89ca6cf119d43 ]
If sb_occ_port_pool_get() failed in devlink_nl_sb_port_pool_fill(), msg should be canceled by genlmsg_cancel().
Fixes: df38dafd2559 ("devlink: implement shared buffer occupancy monitoring interface") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Wang Hai wanghai38@huawei.com Link: https://lore.kernel.org/r/20201113111622.11040-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/core/devlink.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/core/devlink.c b/net/core/devlink.c index a77e3777c8dd..6ad095264896 100644 --- a/net/core/devlink.c +++ b/net/core/devlink.c @@ -1113,7 +1113,7 @@ static int devlink_nl_sb_port_pool_fill(struct sk_buff *msg, err = ops->sb_occ_port_pool_get(devlink_port, devlink_sb->index, pool_index, &cur, &max); if (err && err != -EOPNOTSUPP) - return err; + goto sb_occ_get_failure; if (!err) { if (nla_put_u32(msg, DEVLINK_ATTR_SB_OCC_CUR, cur)) goto nla_put_failure; @@ -1126,8 +1126,10 @@ static int devlink_nl_sb_port_pool_fill(struct sk_buff *msg, return 0;
nla_put_failure: + err = -EMSGSIZE; +sb_occ_get_failure: genlmsg_cancel(msg, hdr); - return -EMSGSIZE; + return err; }
static int devlink_nl_cmd_sb_port_pool_get_doit(struct sk_buff *skb,
From: Wang Hai wanghai38@huawei.com
stable inclusion from linux-4.19.160 commit f0ec2cd03100ab287c9e5fce5ae56a76d6fc17a4
--------------------------------
[ Upstream commit e33de7c5317e2827b2ba6fd120a505e9eb727b05 ]
nlmsg_cancel() needs to be called in the error path of inet_req_diag_fill to cancel the message.
Fixes: d545caca827b ("net: inet: diag: expose the socket mark to privileged processes.") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Wang Hai wanghai38@huawei.com Link: https://lore.kernel.org/r/20201116082018.16496-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv4/inet_diag.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c index f0957ebf82cf..d07917059d70 100644 --- a/net/ipv4/inet_diag.c +++ b/net/ipv4/inet_diag.c @@ -392,8 +392,10 @@ static int inet_req_diag_fill(struct sock *sk, struct sk_buff *skb, r->idiag_inode = 0;
if (net_admin && nla_put_u32(skb, INET_DIAG_MARK, - inet_rsk(reqsk)->ir_mark)) + inet_rsk(reqsk)->ir_mark)) { + nlmsg_cancel(skb, nlh); return -EMSGSIZE; + }
nlmsg_end(skb, nlh); return 0;
From: Heiner Kallweit hkallweit1@gmail.com
stable inclusion from linux-4.19.160 commit b1197f5d6f532a3f82a66940c8cec44cf302bf6d
--------------------------------
[ Upstream commit 7a30ecc9237681bb125cbd30eee92bef7e86293d ]
In br_forward.c and br_input.c fields dev->stats.tx_dropped and dev->stats.multicast are populated, but they are ignored in ndo_get_stats64.
Fixes: 28172739f0a2 ("net: fix 64 bit counters on 32 bit arches") Signed-off-by: Heiner Kallweit hkallweit1@gmail.com Link: https://lore.kernel.org/r/58ea9963-77ad-a7cf-8dfd-fc95ab95f606@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/bridge/br_device.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c index 9ce661e2590d..a350c05b7ff5 100644 --- a/net/bridge/br_device.c +++ b/net/bridge/br_device.c @@ -215,6 +215,7 @@ static void br_get_stats64(struct net_device *dev, sum.rx_packets += tmp.rx_packets; }
+ netdev_stats_to_stats64(stats, &dev->stats); stats->tx_bytes = sum.tx_bytes; stats->tx_packets = sum.tx_packets; stats->rx_bytes = sum.rx_bytes;
From: Dongli Zhang dongli.zhang@oracle.com
stable inclusion from linux-4.19.160 commit 082b21d03479b566681f43b74677c249f6aad337
--------------------------------
[ Upstream commit d8c19014bba8f565d8a2f1f46b4e38d1d97bf1a7 ]
The ethernet driver may allocate skb (and skb->data) via napi_alloc_skb(). This ends up to page_frag_alloc() to allocate skb->data from page_frag_cache->va.
During the memory pressure, page_frag_cache->va may be allocated as pfmemalloc page. As a result, the skb->pfmemalloc is always true as skb->data is from page_frag_cache->va. The skb will be dropped if the sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour under memory pressure.
However, once kernel is not under memory pressure any longer (suppose large amount of memory pages are just reclaimed), the page_frag_alloc() may still re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a result, the skb->pfmemalloc is always true unless page_frag_cache->va is re-allocated, even if the kernel is not under memory pressure any longer.
Here is how kernel runs into issue.
1. The kernel is under memory pressure and allocation of PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead, the pfmemalloc page is allocated for page_frag_cache->va.
2: All skb->data from page_frag_cache->va (pfmemalloc) will have skb->pfmemalloc=true. The skb will always be dropped by sock without SOCK_MEMALLOC. This is an expected behaviour.
3. Suppose a large amount of pages are reclaimed and kernel is not under memory pressure any longer. We expect skb->pfmemalloc drop will not happen.
4. Unfortunately, page_frag_alloc() does not proactively re-allocate page_frag_alloc->va and will always re-use the prior pfmemalloc page. The skb->pfmemalloc is always true even kernel is not under memory pressure any longer.
Fix this by freeing and re-allocating the page instead of recycling it.
References: https://lore.kernel.org/lkml/20201103193239.1807-1-dongli.zhang@oracle.com/ References: https://lore.kernel.org/linux-mm/20201105042140.5253-1-willy@infradead.org/ Suggested-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: Aruna Ramakrishna aruna.ramakrishna@oracle.com Cc: Bert Barbe bert.barbe@oracle.com Cc: Rama Nichanamatlu rama.nichanamatlu@oracle.com Cc: Venkat Venkatsubra venkat.x.venkatsubra@oracle.com Cc: Manjunath Patil manjunath.b.patil@oracle.com Cc: Joe Jin joe.jin@oracle.com Cc: SRINIVAS srinivas.eeda@oracle.com Fixes: 79930f5892e1 ("net: do not deplete pfmemalloc reserve") Signed-off-by: Dongli Zhang dongli.zhang@oracle.com Acked-by: Vlastimil Babka vbabka@suse.cz Reviewed-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/20201115201029.11903-1-dongli.zhang@oracle.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_alloc.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 504c9699b1af..fd77bf4979a9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4653,6 +4653,11 @@ void *page_frag_alloc(struct page_frag_cache *nc, if (!page_ref_sub_and_test(page, nc->pagecnt_bias)) goto refill;
+ if (unlikely(nc->pfmemalloc)) { + free_the_page(page, compound_order(page)); + goto refill; + } + #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) /* if size can vary use size else just use PAGE_SIZE */ size = nc->size;
From: Ryan Sharpelletti sharpelletti@google.com
stable inclusion from linux-4.19.160 commit 88c2cfa3f78120bc755e538ffa8b456fc2ecf874
--------------------------------
[ Upstream commit 1b9e2a8c99a5c021041bfb2d512dc3ed92a94ffd ]
During loss recovery, retransmitted packets are forced to use TCP timestamps to calculate the RTT samples, which have a millisecond granularity. BBR is designed using a microsecond granularity. As a result, multiple RTT samples could be truncated to the same RTT value during loss recovery. This is problematic, as BBR will not enter PROBE_RTT if the RTT sample is <= the current min_rtt sample, meaning that if there are persistent losses, PROBE_RTT will constantly be pushed off and potentially never re-entered. This patch makes sure that BBR enters PROBE_RTT by checking if RTT sample is < the current min_rtt sample, rather than <=.
The Netflix transport/TCP team discovered this bug in the Linux TCP BBR code during lab tests.
Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Ryan Sharpelletti sharpelletti@google.com Signed-off-by: Neal Cardwell ncardwell@google.com Signed-off-by: Soheil Hassas Yeganeh soheil@google.com Signed-off-by: Yuchung Cheng ycheng@google.com Link: https://lore.kernel.org/r/20201116174412.1433277-1-sharpelletti.kdev@gmail.c... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv4/tcp_bbr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index 93f176336297..b70c9365e131 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -917,7 +917,7 @@ static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs) filter_expired = after(tcp_jiffies32, bbr->min_rtt_stamp + bbr_min_rtt_win_sec * HZ); if (rs->rtt_us >= 0 && - (rs->rtt_us <= bbr->min_rtt_us || + (rs->rtt_us < bbr->min_rtt_us || (filter_expired && !rs->is_ack_delayed))) { bbr->min_rtt_us = rs->rtt_us; bbr->min_rtt_stamp = tcp_jiffies32;
From: Will Deacon will@kernel.org
stable inclusion from linux-4.19.160 commit 331298e711e7da5c296c55b4e7ec9ee7894126a7
--------------------------------
[ Upstream commit 891deb87585017d526b67b59c15d38755b900fea ]
cpu_psci_cpu_die() is called in the context of the dying CPU, which will no longer be online or tracked by RCU. It is therefore not generally safe to call printk() if the PSCI "cpu off" request fails, so remove the pr_crit() invocation.
Cc: Qian Cai cai@redhat.com Cc: "Paul E. McKenney" paulmck@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Link: https://lore.kernel.org/r/20201106103602.9849-2-will@kernel.org Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/arm64/kernel/psci.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/arch/arm64/kernel/psci.c b/arch/arm64/kernel/psci.c index 3856d51c645b..3ebb2a56e5f7 100644 --- a/arch/arm64/kernel/psci.c +++ b/arch/arm64/kernel/psci.c @@ -69,7 +69,6 @@ static int cpu_psci_cpu_disable(unsigned int cpu)
static void cpu_psci_cpu_die(unsigned int cpu) { - int ret; /* * There are no known implementations of PSCI actually using the * power state field, pass a sensible default for now. @@ -77,9 +76,7 @@ static void cpu_psci_cpu_die(unsigned int cpu) u32 state = PSCI_POWER_STATE_TYPE_POWER_DOWN << PSCI_0_2_POWER_STATE_TYPE_SHIFT;
- ret = psci_ops.cpu_off(state); - - pr_crit("unable to power off CPU%u (%d)\n", cpu, ret); + psci_ops.cpu_off(state); }
static int cpu_psci_cpu_kill(unsigned int cpu)
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.160 commit c7d7fe001b6fb46d2e625c2f28acecf0c2d7dae6
--------------------------------
[ Upstream commit 22843291efc986ce7722610073fcf85a39b4cb13 ]
__sb_start_write has some weird looking lockdep code that claims to exist to handle nested freeze locking requests from xfs. The code as written seems broken -- if we think we hold a read lock on any of the higher freeze levels (e.g. we hold SB_FREEZE_WRITE and are trying to lock SB_FREEZE_PAGEFAULT), it converts a blocking lock attempt into a trylock.
However, it's not correct to downgrade a blocking lock attempt to a trylock unless the downgrading code or the callers are prepared to deal with that situation. Neither __sb_start_write nor its callers handle this at all. For example:
sb_start_pagefault ignores the return value completely, with the result that if xfs_filemap_fault loses a race with a different thread trying to fsfreeze, it will proceed without pagefault freeze protection (thereby breaking locking rules) and then unlocks the pagefault freeze lock that it doesn't own on its way out (thereby corrupting the lock state), which leads to a system hang shortly afterwards.
Normally, this won't happen because our ownership of a read lock on a higher freeze protection level blocks fsfreeze from grabbing a write lock on that higher level. *However*, if lockdep is offline, lock_is_held_type unconditionally returns 1, which means that percpu_rwsem_is_held returns 1, which means that __sb_start_write unconditionally converts blocking freeze lock attempts into trylocks, even when we *don't* hold anything that would block a fsfreeze.
Apparently this all held together until 5.10-rc1, when bugs in lockdep caused lockdep to shut itself off early in an fstests run, and once fstests gets to the "race writes with freezer" tests, kaboom. This might explain the long trail of vanishingly infrequent livelocks in fstests after lockdep goes offline that I've never been able to diagnose.
We could fix it by spinning on the trylock if wait==true, but AFAICT the locking works fine if lockdep is not built at all (and I didn't see any complaints running fstests overnight), so remove this snippet entirely.
NOTE: Commit f4b554af9931 in 2015 created the current weird logic (which used to exist in a different form in commit 5accdf82ba25c from 2012) in __sb_start_write. XFS solved this whole problem in the late 2.6 era by creating a variant of transactions (XFS_TRANS_NO_WRITECOUNT) that don't grab intwrite freeze protection, thus making lockdep's solution unnecessary. The commit claims that Dave Chinner explained that the trylock hack + comment could be removed, but nobody ever did.
Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/super.c | 33 ++++----------------------------- 1 file changed, 4 insertions(+), 29 deletions(-)
diff --git a/fs/super.c b/fs/super.c index f3a8c008e164..9fb4553c46e6 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1360,36 +1360,11 @@ EXPORT_SYMBOL(__sb_end_write); */ int __sb_start_write(struct super_block *sb, int level, bool wait) { - bool force_trylock = false; - int ret = 1; + if (!wait) + return percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
-#ifdef CONFIG_LOCKDEP - /* - * We want lockdep to tell us about possible deadlocks with freezing - * but it's it bit tricky to properly instrument it. Getting a freeze - * protection works as getting a read lock but there are subtle - * problems. XFS for example gets freeze protection on internal level - * twice in some cases, which is OK only because we already hold a - * freeze protection also on higher level. Due to these cases we have - * to use wait == F (trylock mode) which must not fail. - */ - if (wait) { - int i; - - for (i = 0; i < level - 1; i++) - if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) { - force_trylock = true; - break; - } - } -#endif - if (wait && !force_trylock) - percpu_down_read(sb->s_writers.rw_sem + level-1); - else - ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1); - - WARN_ON(force_trylock && !ret); - return ret; + percpu_down_read(sb->s_writers.rw_sem + level-1); + return 1; } EXPORT_SYMBOL(__sb_start_write);
From: Leo Yan leo.yan@linaro.org
stable inclusion from linux-4.19.160 commit c4d54937576e5b9cdedb84e371eedf1f0ba67c0b
--------------------------------
[ Upstream commit b0e5a05cc9e37763c7f19366d94b1a6160c755bc ]
When execute command "perf lock report", it hits failure and outputs log as follows:
perf: builtin-lock.c:623: report_lock_release_event: Assertion `!(seq->read_count < 0)' failed. Aborted
This is an imbalance issue. The locking sequence structure "lock_seq_stat" contains the reader counter and it is used to check if the locking sequence is balance or not between acquiring and releasing.
If the tool wrongly frees "lock_seq_stat" when "read_count" isn't zero, the "read_count" will be reset to zero when allocate a new structure at the next time; thus it causes the wrong counting for reader and finally results in imbalance issue.
To fix this issue, if detects "read_count" is not zero (means still have read user in the locking sequence), goto the "end" tag to skip freeing structure "lock_seq_stat".
Fixes: e4cef1f65061 ("perf lock: Fix state machine to recognize lock sequence") Signed-off-by: Leo Yan leo.yan@linaro.org Acked-by: Jiri Olsa jolsa@redhat.com Link: https://lore.kernel.org/r/20201104094229.17509-2-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/perf/builtin-lock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c index 6e0189df2b3b..0cb7f7b731fb 100644 --- a/tools/perf/builtin-lock.c +++ b/tools/perf/builtin-lock.c @@ -620,7 +620,7 @@ static int report_lock_release_event(struct perf_evsel *evsel, case SEQ_STATE_READ_ACQUIRED: seq->read_count--; BUG_ON(seq->read_count < 0); - if (!seq->read_count) { + if (seq->read_count) { ls->nr_release++; goto end; }
From: Yi-Hung Wei yihung.wei@gmail.com
stable inclusion from linux-4.19.160 commit 901e04cd478133ba6458fecbd3b232b528d26f3b
--------------------------------
[ Upstream commit 9c2e14b48119b39446031d29d994044ae958d8fc ]
Currently, we may set the tunnel option flag when the size of metadata is zero. For example, we set TUNNEL_GENEVE_OPT in the receive function no matter the geneve option is present or not. As this may result in issues on the tunnel flags consumers, this patch fixes the issue.
Related discussion: * https://lore.kernel.org/netdev/1604448694-19351-1-git-send-email-yihung.wei@...
Fixes: 256c87c17c53 ("net: check tunnel option type in tunnel flags") Signed-off-by: Yi-Hung Wei yihung.wei@gmail.com Link: https://lore.kernel.org/r/1605053800-74072-1-git-send-email-yihung.wei@gmail... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/net/geneve.c | 3 +-- include/net/ip_tunnels.h | 7 ++++--- 2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index fdb251e1f471..e3b15daeedfe 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -223,8 +223,7 @@ static void geneve_rx(struct geneve_dev *geneve, struct geneve_sock *gs, if (ip_tunnel_collect_metadata() || gs->collect_md) { __be16 flags;
- flags = TUNNEL_KEY | TUNNEL_GENEVE_OPT | - (gnvh->oam ? TUNNEL_OAM : 0) | + flags = TUNNEL_KEY | (gnvh->oam ? TUNNEL_OAM : 0) | (gnvh->critical ? TUNNEL_CRIT_OPT : 0);
tun_dst = udp_tun_rx_dst(skb, geneve_get_sk_family(gs), flags, diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index e11423530d64..f8873c4eb003 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -489,9 +489,11 @@ static inline void ip_tunnel_info_opts_set(struct ip_tunnel_info *info, const void *from, int len, __be16 flags) { - memcpy(ip_tunnel_info_opts(info), from, len); info->options_len = len; - info->key.tun_flags |= flags; + if (len > 0) { + memcpy(ip_tunnel_info_opts(info), from, len); + info->key.tun_flags |= flags; + } }
static inline struct ip_tunnel_info *lwt_tun_info(struct lwtunnel_state *lwtstate) @@ -537,7 +539,6 @@ static inline void ip_tunnel_info_opts_set(struct ip_tunnel_info *info, __be16 flags) { info->options_len = 0; - info->key.tun_flags |= flags; }
#endif /* CONFIG_INET */
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.160 commit 804b269c252445d3b98497c8b02db0152c65cd63
--------------------------------
[ Upstream commit e95b6c3ef1311dd7b20467d932a24b6d0fd88395 ]
The comment and logic in xchk_btree_check_minrecs for dealing with inode-rooted btrees isn't quite correct. While the direct children of the inode root are allowed to have fewer records than what would normally be allowed for a regular ondisk btree block, this is only true if there is only one child block and the number of records don't fit in the inode root.
Fixes: 08a3a692ef58 ("xfs: btree scrub should check minrecs") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Chandan Babu R chandanrlinux@gmail.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/scrub/btree.c | 45 ++++++++++++++++++++++++++------------------ 1 file changed, 27 insertions(+), 18 deletions(-)
diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c index 4ae959f7ad2c..c924fe3cdad6 100644 --- a/fs/xfs/scrub/btree.c +++ b/fs/xfs/scrub/btree.c @@ -450,32 +450,41 @@ xchk_btree_check_minrecs( int level, struct xfs_btree_block *block) { - unsigned int numrecs; - int ok_level; - - numrecs = be16_to_cpu(block->bb_numrecs); + struct xfs_btree_cur *cur = bs->cur; + unsigned int root_level = cur->bc_nlevels - 1; + unsigned int numrecs = be16_to_cpu(block->bb_numrecs);
/* More records than minrecs means the block is ok. */ - if (numrecs >= bs->cur->bc_ops->get_minrecs(bs->cur, level)) + if (numrecs >= cur->bc_ops->get_minrecs(cur, level)) return;
/* - * Certain btree blocks /can/ have fewer than minrecs records. Any - * level greater than or equal to the level of the highest dedicated - * btree block are allowed to violate this constraint. - * - * For a btree rooted in a block, the btree root can have fewer than - * minrecs records. If the btree is rooted in an inode and does not - * store records in the root, the direct children of the root and the - * root itself can have fewer than minrecs records. + * For btrees rooted in the inode, it's possible that the root block + * contents spilled into a regular ondisk block because there wasn't + * enough space in the inode root. The number of records in that + * child block might be less than the standard minrecs, but that's ok + * provided that there's only one direct child of the root. */ - ok_level = bs->cur->bc_nlevels - 1; - if (bs->cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) - ok_level--; - if (level >= ok_level) + if ((cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) && + level == cur->bc_nlevels - 2) { + struct xfs_btree_block *root_block; + struct xfs_buf *root_bp; + int root_maxrecs; + + root_block = xfs_btree_get_block(cur, root_level, &root_bp); + root_maxrecs = cur->bc_ops->get_dmaxrecs(cur, root_level); + if (be16_to_cpu(root_block->bb_numrecs) != 1 || + numrecs <= root_maxrecs) + xchk_btree_set_corrupt(bs->sc, cur, level); return; + }
- xchk_btree_set_corrupt(bs->sc, bs->cur, level); + /* + * Otherwise, only the root level is allowed to have fewer than minrecs + * records or keyptrs. + */ + if (level < root_level) + xchk_btree_set_corrupt(bs->sc, cur, level); }
/*
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.160 commit d88cc6f344a2b4a8095a0c8f392b4fa632707276
--------------------------------
[ Upstream commit 498fe261f0d6d5189f8e11d283705dd97b474b54 ]
We always know the correct state of the rmap record flags (attr, bmbt, unwritten) so check them by direct comparison.
Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Chandan Babu R chandanrlinux@gmail.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/scrub/bmap.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c index d1fc8a3fb891..dcc7e65bb7be 100644 --- a/fs/xfs/scrub/bmap.c +++ b/fs/xfs/scrub/bmap.c @@ -207,13 +207,13 @@ xchk_bmap_xref_rmap( * which doesn't track unwritten state. */ if (owner != XFS_RMAP_OWN_COW && - irec->br_state == XFS_EXT_UNWRITTEN && - !(rmap.rm_flags & XFS_RMAP_UNWRITTEN)) + !!(irec->br_state == XFS_EXT_UNWRITTEN) != + !!(rmap.rm_flags & XFS_RMAP_UNWRITTEN)) xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, irec->br_startoff);
- if (info->whichfork == XFS_ATTR_FORK && - !(rmap.rm_flags & XFS_RMAP_ATTR_FORK)) + if (!!(info->whichfork == XFS_ATTR_FORK) != + !!(rmap.rm_flags & XFS_RMAP_ATTR_FORK)) xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, irec->br_startoff); if (rmap.rm_flags & XFS_RMAP_BMBT_BLOCK)
From: Luo Meng luomeng12@huawei.com
stable inclusion from linux-4.19.160 commit 730b192ad262545a269483609f52d29bff503507
--------------------------------
[ Upstream commit 2801a5da5b25b7af9dd2addd19b2315c02d17b64 ]
Fix a mutex_unlock() issue where before copy_from_user() is not called mutex_locked.
Fixes: 4b1a29a7f542 ("error-injection: Support fault injection framework") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Luo Meng luomeng12@huawei.com Signed-off-by: Masami Hiramatsu mhiramat@kernel.org Signed-off-by: Alexei Starovoitov ast@kernel.org Acked-by: Masami Hiramatsu mhiramat@kernel.org Link: https://lore.kernel.org/bpf/160570737118.263807.8358435412898356284.stgit@de... Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/fail_function.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/fail_function.c b/kernel/fail_function.c index bc80a4e268c0..a52151a2291f 100644 --- a/kernel/fail_function.c +++ b/kernel/fail_function.c @@ -261,7 +261,7 @@ static ssize_t fei_write(struct file *file, const char __user *buffer,
if (copy_from_user(buf, buffer, count)) { ret = -EFAULT; - goto out; + goto out_free; } buf[count] = '\0'; sym = strstrip(buf); @@ -315,8 +315,9 @@ static ssize_t fei_write(struct file *file, const char __user *buffer, ret = count; } out: - kfree(buf); mutex_unlock(&fei_lock); +out_free: + kfree(buf); return ret; }
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.160 commit 088ad76f42062164775af47464f56c7fc6a13fca
--------------------------------
[ Upstream commit eb8409071a1d47e3593cfe077107ac46853182ab ]
This reverts commit 6ff646b2ceb0eec916101877f38da0b73e3a5b7f.
Your maintainer committed a major braino in the rmap code by adding the attr fork, bmbt, and unwritten extent usage bits into rmap record key comparisons. While XFS uses the usage bits *in the rmap records* for cross-referencing metadata in xfs_scrub and xfs_repair, it only needs the owner and offset information to distinguish between reverse mappings of the same physical extent into the data fork of a file at multiple offsets. The other bits are not important for key comparisons for index lookups, and never have been.
Eric Sandeen reports that this causes regressions in generic/299, so undo this patch before it does more damage.
Reported-by: Eric Sandeen sandeen@sandeen.net Fixes: 6ff646b2ceb0 ("xfs: fix rmap key and record comparison functions") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Eric Sandeen sandeen@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/libxfs/xfs_rmap_btree.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c index 77528f413286..f79cf040d745 100644 --- a/fs/xfs/libxfs/xfs_rmap_btree.c +++ b/fs/xfs/libxfs/xfs_rmap_btree.c @@ -247,8 +247,8 @@ xfs_rmapbt_key_diff( else if (y > x) return -1;
- x = be64_to_cpu(kp->rm_offset); - y = xfs_rmap_irec_offset_pack(rec); + x = XFS_RMAP_OFF(be64_to_cpu(kp->rm_offset)); + y = rec->rm_offset; if (x > y) return 1; else if (y > x) @@ -279,8 +279,8 @@ xfs_rmapbt_diff_two_keys( else if (y > x) return -1;
- x = be64_to_cpu(kp1->rm_offset); - y = be64_to_cpu(kp2->rm_offset); + x = XFS_RMAP_OFF(be64_to_cpu(kp1->rm_offset)); + y = XFS_RMAP_OFF(be64_to_cpu(kp2->rm_offset)); if (x > y) return 1; else if (y > x) @@ -393,8 +393,8 @@ xfs_rmapbt_keys_inorder( return 1; else if (a > b) return 0; - a = be64_to_cpu(k1->rmap.rm_offset); - b = be64_to_cpu(k2->rmap.rm_offset); + a = XFS_RMAP_OFF(be64_to_cpu(k1->rmap.rm_offset)); + b = XFS_RMAP_OFF(be64_to_cpu(k2->rmap.rm_offset)); if (a <= b) return 1; return 0; @@ -423,8 +423,8 @@ xfs_rmapbt_recs_inorder( return 1; else if (a > b) return 0; - a = be64_to_cpu(r1->rmap.rm_offset); - b = be64_to_cpu(r2->rmap.rm_offset); + a = XFS_RMAP_OFF(be64_to_cpu(r1->rmap.rm_offset)); + b = XFS_RMAP_OFF(be64_to_cpu(r2->rmap.rm_offset)); if (a <= b) return 1; return 0;
From: Yicong Yang yangyicong@hisilicon.com
stable inclusion from linux-4.19.160 commit 7dfd3751915578b219cf1a25e312569fe0571739
--------------------------------
[ Upstream commit 488dac0c9237647e9b8f788b6a342595bfa40bda ]
The attr->set() receive a value of u64, but simple_strtoll() is used for doing the conversion. It will lead to the error cast if user inputs a negative value.
Use kstrtoull() instead of simple_strtoll() to convert a string got from the user to an unsigned value. The former will return '-EINVAL' if it gets a negetive value, but the latter can't handle the situation correctly. Make 'val' unsigned long long as what kstrtoull() takes, this will eliminate the compile warning on no 64-bit architectures.
Fixes: f7b88631a897 ("fs/libfs.c: fix simple_attr_write() on 32bit machines") Signed-off-by: Yicong Yang yangyicong@hisilicon.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Al Viro viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1605341356-11872-1-git-send-email-yangyicong@hisil... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/libfs.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/libfs.c b/fs/libfs.c index 4bb80685ee6e..608566e38d12 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -934,7 +934,7 @@ ssize_t simple_attr_write(struct file *file, const char __user *buf, size_t len, loff_t *ppos) { struct simple_attr *attr; - u64 val; + unsigned long long val; size_t size; ssize_t ret;
@@ -952,7 +952,9 @@ ssize_t simple_attr_write(struct file *file, const char __user *buf, goto out;
attr->set_buf[size] = '\0'; - val = simple_strtoll(attr->set_buf, NULL, 0); + ret = kstrtoull(attr->set_buf, 0, &val); + if (ret) + goto out; ret = attr->set(attr->data, val); if (ret == 0) ret = len; /* on success, claim we got the whole input */
From: Vamshi K Sthambamkadi vamshi.k.sthambamkadi@gmail.com
stable inclusion from linux-4.19.160 commit 3a07581b04648c8acb634045c7060db392853efe
--------------------------------
commit fe5186cf12e30facfe261e9be6c7904a170bd822 upstream.
kmemleak report: unreferenced object 0xffff9b8915fcb000 (size 4096): comm "efivarfs.sh", pid 2360, jiffies 4294920096 (age 48.264s) hex dump (first 32 bytes): 2d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 -............... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000cc4d897c>] kmem_cache_alloc_trace+0x155/0x4b0 [<000000007d1dfa72>] efivarfs_create+0x6e/0x1a0 [<00000000e6ee18fc>] path_openat+0xe4b/0x1120 [<000000000ad0414f>] do_filp_open+0x91/0x100 [<00000000ce93a198>] do_sys_openat2+0x20c/0x2d0 [<000000002a91be6d>] do_sys_open+0x46/0x80 [<000000000a854999>] __x64_sys_openat+0x20/0x30 [<00000000c50d89c9>] do_syscall_64+0x38/0x90 [<00000000cecd6b5f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
In efivarfs_create(), inode->i_private is setup with efivar_entry object which is never freed.
Cc: stable@vger.kernel.org Signed-off-by: Vamshi K Sthambamkadi vamshi.k.sthambamkadi@gmail.com Link: https://lore.kernel.org/r/20201023115429.GA2479@cosmos Signed-off-by: Ard Biesheuvel ardb@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/efivarfs/super.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c index 834615f13f3e..7808a26bd33f 100644 --- a/fs/efivarfs/super.c +++ b/fs/efivarfs/super.c @@ -23,6 +23,7 @@ LIST_HEAD(efivarfs_list); static void efivarfs_evict_inode(struct inode *inode) { clear_inode(inode); + kfree(inode->i_private); }
static const struct super_operations efivarfs_ops = {
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.160 commit f2a32b26684550173f9fdd326923c0799b552916
--------------------------------
commit f902b216501094495ff75834035656e8119c537f upstream.
The idea of the warning in ext4_update_dx_flag() is that we should warn when we are clearing EXT4_INODE_INDEX on a filesystem with metadata checksums enabled since after clearing the flag, checksums for internal htree nodes will become invalid. So there's no need to warn (or actually do anything) when EXT4_INODE_INDEX is not set.
Link: https://lore.kernel.org/r/20201118153032.17281-1-jack@suse.cz Fixes: 48a34311953d ("ext4: fix checksum errors with indexed dirs") Reported-by: Eric Biggers ebiggers@kernel.org Reviewed-by: Eric Biggers ebiggers@google.com Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/ext4.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 03cf8faf51b9..b2e876bf0dec 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2410,7 +2410,8 @@ void ext4_insert_dentry(struct inode *inode, struct ext4_filename *fname); static inline void ext4_update_dx_flag(struct inode *inode) { - if (!ext4_has_feature_dir_index(inode->i_sb)) { + if (!ext4_has_feature_dir_index(inode->i_sb) && + ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) { /* ext4_iget() should have caught this... */ WARN_ON_ONCE(ext4_has_feature_metadata_csum(inode->i_sb)); ext4_clear_inode_flag(inode, EXT4_INODE_INDEX);
From: Gerald Schaefer gerald.schaefer@linux.ibm.com
stable inclusion from linux-4.19.160 commit dc94f8e932b596926649fa2778eff8d90495e0ff
--------------------------------
commit bfe8cc1db02ab243c62780f17fc57f65bde0afe1 upstream.
Alexander reported a syzkaller / KASAN finding on s390, see below for complete output.
In do_huge_pmd_anonymous_page(), the pre-allocated pagetable will be freed in some cases. In the case of userfaultfd_missing(), this will happen after calling handle_userfault(), which might have released the mmap_lock. Therefore, the following pte_free(vma->vm_mm, pgtable) will access an unstable vma->vm_mm, which could have been freed or re-used already.
For all architectures other than s390 this will go w/o any negative impact, because pte_free() simply frees the page and ignores the passed-in mm. The implementation for SPARC32 would also access mm->page_table_lock for pte_free(), but there is no THP support in SPARC32, so the buggy code path will not be used there.
For s390, the mm->context.pgtable_list is being used to maintain the 2K pagetable fragments, and operating on an already freed or even re-used mm could result in various more or less subtle bugs due to list / pagetable corruption.
Fix this by calling pte_free() before handle_userfault(), similar to how it is already done in __do_huge_pmd_anonymous_page() for the WRITE / non-huge_zero_page case.
Commit 6b251fc96cf2c ("userfaultfd: call handle_userfault() for userfaultfd_missing() faults") actually introduced both, the do_huge_pmd_anonymous_page() and also __do_huge_pmd_anonymous_page() changes wrt to calling handle_userfault(), but only in the latter case it put the pte_free() before calling handle_userfault().
BUG: KASAN: use-after-free in do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744 Read of size 8 at addr 00000000962d6988 by task syz-executor.0/9334
CPU: 1 PID: 9334 Comm: syz-executor.0 Not tainted 5.10.0-rc1-syzkaller-07083-g4c9720875573 #0 Hardware name: IBM 3906 M04 701 (KVM/Linux) Call Trace: do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744 create_huge_pmd mm/memory.c:4256 [inline] __handle_mm_fault+0xe6e/0x1068 mm/memory.c:4480 handle_mm_fault+0x288/0x748 mm/memory.c:4607 do_exception+0x394/0xae0 arch/s390/mm/fault.c:479 do_dat_exception+0x34/0x80 arch/s390/mm/fault.c:567 pgm_check_handler+0x1da/0x22c arch/s390/kernel/entry.S:706 copy_from_user_mvcos arch/s390/lib/uaccess.c:111 [inline] raw_copy_from_user+0x3a/0x88 arch/s390/lib/uaccess.c:174 _copy_from_user+0x48/0xa8 lib/usercopy.c:16 copy_from_user include/linux/uaccess.h:192 [inline] __do_sys_sigaltstack kernel/signal.c:4064 [inline] __s390x_sys_sigaltstack+0xc8/0x240 kernel/signal.c:4060 system_call+0xe0/0x28c arch/s390/kernel/entry.S:415
Allocated by task 9334: slab_alloc_node mm/slub.c:2891 [inline] slab_alloc mm/slub.c:2899 [inline] kmem_cache_alloc+0x118/0x348 mm/slub.c:2904 vm_area_dup+0x9c/0x2b8 kernel/fork.c:356 __split_vma+0xba/0x560 mm/mmap.c:2742 split_vma+0xca/0x108 mm/mmap.c:2800 mlock_fixup+0x4ae/0x600 mm/mlock.c:550 apply_vma_lock_flags+0x2c6/0x398 mm/mlock.c:619 do_mlock+0x1aa/0x718 mm/mlock.c:711 __do_sys_mlock2 mm/mlock.c:738 [inline] __s390x_sys_mlock2+0x86/0xa8 mm/mlock.c:728 system_call+0xe0/0x28c arch/s390/kernel/entry.S:415
Freed by task 9333: slab_free mm/slub.c:3142 [inline] kmem_cache_free+0x7c/0x4b8 mm/slub.c:3158 __vma_adjust+0x7b2/0x2508 mm/mmap.c:960 vma_merge+0x87e/0xce0 mm/mmap.c:1209 userfaultfd_release+0x412/0x6b8 fs/userfaultfd.c:868 __fput+0x22c/0x7a8 fs/file_table.c:281 task_work_run+0x200/0x320 kernel/task_work.c:151 tracehook_notify_resume include/linux/tracehook.h:188 [inline] do_notify_resume+0x100/0x148 arch/s390/kernel/signal.c:538 system_call+0xe6/0x28c arch/s390/kernel/entry.S:416
The buggy address belongs to the object at 00000000962d6948 which belongs to the cache vm_area_struct of size 200 The buggy address is located 64 bytes inside of 200-byte region [00000000962d6948, 00000000962d6a10) The buggy address belongs to the page: page:00000000313a09fe refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x962d6 flags: 0x3ffff00000000200(slab) raw: 3ffff00000000200 000040000257e080 0000000c0000000c 000000008020ba00 raw: 0000000000000000 000f001e00000000 ffffffff00000001 0000000096959501 page dumped because: kasan: bad access detected page->mem_cgroup:0000000096959501
Memory state around the buggy address: 00000000962d6880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000000962d6900: 00 fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
00000000962d6980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ 00000000962d6a00: fb fb fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00000000962d6a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ==================================================================
Fixes: 6b251fc96cf2c ("userfaultfd: call handle_userfault() for userfaultfd_missing() faults") Reported-by: Alexander Egorenkov egorenar@linux.ibm.com Signed-off-by: Gerald Schaefer gerald.schaefer@linux.ibm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Andrea Arcangeli aarcange@redhat.com Cc: Heiko Carstens hca@linux.ibm.com Cc: stable@vger.kernel.org [4.3+] Link: https://lkml.kernel.org/r/20201110190329.11920-1-gerald.schaefer@linux.ibm.c... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/huge_memory.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1287ea703a42..b60cbf77e902 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -702,7 +702,6 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) transparent_hugepage_use_zero_page()) { pgtable_t pgtable; struct page *zero_page; - bool set; vm_fault_t ret; pgtable = pte_alloc_one(vma->vm_mm, haddr); if (unlikely(!pgtable)) @@ -715,25 +714,25 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) } vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); ret = 0; - set = false; if (pmd_none(*vmf->pmd)) { ret = check_stable_address_space(vma->vm_mm); if (ret) { spin_unlock(vmf->ptl); + pte_free(vma->vm_mm, pgtable); } else if (userfaultfd_missing(vma)) { spin_unlock(vmf->ptl); + pte_free(vma->vm_mm, pgtable); ret = handle_userfault(vmf, VM_UFFD_MISSING); VM_BUG_ON(ret & VM_FAULT_FALLBACK); } else { set_huge_zero_page(pgtable, vma->vm_mm, vma, haddr, vmf->pmd, zero_page); spin_unlock(vmf->ptl); - set = true; } - } else + } else { spin_unlock(vmf->ptl); - if (!set) pte_free(vma->vm_mm, pgtable); + } return ret; } gfp = alloc_hugepage_direct_gfpmask(vma);