bugfix for 20.03 @ 2021/02/22
Aichun Li (7): netpoll: remove dev argument from netpoll_send_skb_on_dev() netpoll: move netpoll_send_skb() out of line netpoll: netpoll_send_skb() returns transmit status netpoll: accept NULL np argument in netpoll_send_skb() bonding: add an option to specify a delay between peer notifications bonding: fix value exported by Netlink for peer_notif_delay bonding: add documentation for peer_notif_delay
Akilesh Kailash (1): dm snapshot: flush merged data before committing metadata
Al Viro (2): don't dump the threads that had been already exiting when zapped. dump_common_audit_data(): fix racy accesses to ->d_name
Aleksandr Nogikh (1): netem: fix zero division in tabledist
Alexander Duyck (1): tcp: Set INET_ECN_xmit configuration in tcp_reinit_congestion_control
Alexander Lobakin (1): skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() too
Alexey Dobriyan (2): proc: change ->nlink under proc_subdir_lock proc: fix lookup in /proc/net subdirectories after setns(2)
Alexey Kardashevskiy (1): serial_core: Check for port state when tty is in error state
Amit Cohen (1): mlxsw: core: Fix use-after-free in mlxsw_emad_trans_finish()
Andy Shevchenko (2): device property: Keep secondary firmware node secondary by type device property: Don't clear secondary pointer for shared primary firmware node
Antoine Tenart (6): netfilter: bridge: reset skb->pkt_type after NF_INET_POST_ROUTING traversal net: ip6_gre: set dev->hard_header_len when using header_ops net-sysfs: take the rtnl lock when storing xps_cpus net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc net-sysfs: take the rtnl lock when storing xps_rxqs net-sysfs: take the rtnl lock when accessing xps_rxqs_map and num_tc
Ard Biesheuvel (3): efivarfs: revert "fix memory leak in efivarfs_create()" arm64, mm, efi: Account for GICv3 LPI tables in static memblock reserve table efi/arm: Revert "Defer persistent reservations until after paging_init()"
Arjun Roy (1): tcp: Prevent low rmem stalls with SO_RCVLOWAT.
Arnaldo Carvalho de Melo (1): perf scripting python: Avoid declaring function pointers with a visibility attribute
Axel Lin (1): ASoC: msm8916-wcd-digital: Select REGMAP_MMIO to fix build error
Aya Levin (1): net: ipv6: Validate GSO SKB before finish IPv6 processing
Bart Van Assche (1): scsi: scsi_transport_spi: Set RQF_PM for domain validation commands
Bharat Gooty (1): PCI: iproc: Fix out-of-bound array accesses
Bixuan Cui (2): mmap: fix a compiling error for 'MAP_CHECKNODE' powerpc: fix a compiling error for 'access_ok'
Bjorn Helgaas (1): PCI: Bounds-check command-line resource alignment requests
Björn Töpel (1): ixgbe: avoid premature Rx buffer reuse
Boqun Feng (1): fcntl: Fix potential deadlock in send_sig{io, urg}()
Boris Protopopov (1): Convert trailing spaces and periods in path components
Brian Foster (1): xfs: flush new eof page on truncate to avoid post-eof corruption
Calum Mackay (1): lockd: don't use interval-based rebinding over TCP
Chen Zhou (1): selinux: Fix error return code in sel_ib_pkey_sid_slow()
Chenguangli (1): scsi/hifc:Fix the bug that the system may be oops during unintall hifc module.
Cheng Lin (1): nfs_common: need lock during iterate through the list
Christoph Hellwig (2): nbd: fix a block_device refcount leak in nbd_release xfs: fix a missing unlock on error in xfs_fs_map_blocks
Chunguang Xu (1): ext4: fix a memory leak of ext4_free_data
Chunyan Zhang (1): tick/common: Touch watchdog in tick_unfreeze() on all CPUs
Colin Ian King (1): PCI: Fix overflow in command-line resource alignment requests
Cong Wang (1): erspan: fix version 1 check in gre_parse_header()
Damien Le Moal (1): null_blk: Fix zone size initialization
Dan Carpenter (1): futex: Don't enable IRQs unconditionally in put_pi_state()
Daniel Scally (1): Revert "ACPI / resources: Use AE_CTRL_TERMINATE to terminate resources walks"
Darrick J. Wong (12): xfs: fix realtime bitmap/summary file truncation when growing rt volume xfs: don't free rt blocks when we're doing a REMAP bunmapi call xfs: set xefi_discard when creating a deferred agfl free log intent item xfs: fix scrub flagging rtinherit even if there is no rt device xfs: fix flags argument to rmap lookup when converting shared file rmaps xfs: set the unwritten bit in rmap lookup flags in xchk_bmap_get_rmapextents xfs: fix rmap key and record comparison functions xfs: fix brainos in the refcount scrubber's rmap fragment processor vfs: remove lockdep bogosity in __sb_start_write xfs: fix the minrecs logic when dealing with inode root child blocks xfs: strengthen rmap record flags checking xfs: revert "xfs: fix rmap key and record comparison functions"
Dave Wysochanski (1): NFS4: Fix use-after-free in trace_event_raw_event_nfs4_set_lock
Dexuan Cui (1): ACPI: scan: Harden acpi_device_add() against device ID overflows
Dinghao Liu (4): ext4: fix error handling code in add_new_gdb net/mlx5e: Fix memleak in mlx5e_create_l2_table_groups net/mlx5e: Fix two double free cases netfilter: nf_nat: Fix memleak in nf_nat_init
Dongdong Wang (1): lwt: Disable BH too in run_lwt_bpf()
Dongli Zhang (1): page_frag: Recover from memory pressure
Douglas Gilbert (1): sgl_alloc_order: fix memory leak
Eddy Wu (1): fork: fix copy_process(CLONE_PARENT) race with the exiting ->real_parent
Eran Ben Elisha (1): net/mlx5: Fix wrong address reclaim when command interface is down
Eric Auger (1): vfio/pci: Move dummy_resources_list init in vfio_pci_probe()
Eric Biggers (1): ext4: fix leaking sysfs kobject after failed mount
Eric Dumazet (4): tcp: select sane initial rcvq_space.space for big MSS net: avoid 32 x truesize under-estimation for tiny skbs net_sched: avoid shift-out-of-bounds in tcindex_set_parms() net_sched: reject silly cell_log in qdisc_get_rtab()
Fang Lijun (3): arm64/ascend: mm: Add MAP_CHECKNODE flag to check node hugetlb arm64/ascend: mm: Fix arm32 compile warnings arm64/ascend: mm: Fix hugetlb check node error
Fangrui Song (1): arm64: Change .weak to SYM_FUNC_START_WEAK_PI for arch/arm64/lib/mem*.S
Florian Fainelli (1): net: Have netpoll bring-up DSA management interface
Florian Westphal (4): netfilter: nf_tables: avoid false-postive lockdep splat netfilter: xt_RATEEST: reject non-null terminated string from userspace net: ip: always refragment ip defragmented packets net: fix pmtu check in nopmtudisc mode
Gabriel Krisman Bertazi (2): blk-cgroup: Fix memleak on error path blk-cgroup: Pre-allocate tree node on blkg_conf_prep
George Spelvin (1): random32: make prandom_u32() output unpredictable
Gerald Schaefer (1): mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault()
Guillaume Nault (4): ipv4: Fix tos mask in inet_rtm_getroute() ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() netfilter: rpfilter: mask ecn bits before fib lookup udp: mask TOS bits in udp_v4_early_demux()
Hanjun Guo (1): clocksource/drivers/arch_timer: Fix vdso_fix compile error for arm32
Hannes Reinecke (1): dm: avoid filesystem lookup in dm_get_dev_t()
Hans de Goede (1): ACPI: scan: Make acpi_bus_get_device() clear return pointer on error
Heiner Kallweit (1): net: bridge: add missing counters to ndo_get_stats64 callback
Hoang Le (1): tipc: fix NULL deref in tipc_link_xmit()
Huang Shijie (1): lib/genalloc: fix the overflow when size is too big
Huang Ying (1): mm: fix a race during THP splitting
Hugh Dickins (2): mlock: fix unevictable_pgs event counts on THP mm: fix check_move_unevictable_pages() on THP
Hui Wang (1): ACPI: PNP: compare the string length in the matching_id()
Hyeongseok Kim (1): dm verity: skip verity work if I/O error when system is shutting down
Ido Schimmel (2): mlxsw: core: Fix memory leak on module removal mlxsw: core: Use variable timeout for EMAD retries
Ilya Dryomov (1): libceph: clear con->out_msg on Policy::stateful_server faults
Jakub Kicinski (1): net: vlan: avoid leaks on register_vlan_dev() failures
Jamie Iles (1): bonding: wait for sysfs kobject destruction before freeing struct slave
Jan Kara (8): ext4: Detect already used quota file early ext4: fix bogus warning in ext4_update_dx_flag() ext4: Protect superblock modifications with a buffer lock ext4: fix deadlock with fs freezing and EA inodes ext4: don't remount read-only with errors=continue on reboot quota: Don't overflow quota file offsets bfq: Fix computation of shallow depth ext4: fix superblock checksum failure when setting password salt
Jann Horn (1): mm, slub: consider rest of partial list if acquire_slab() fails
Jason A. Donenfeld (3): netfilter: use actual socket sk rather than skb sk when routing harder net: introduce skb_list_walk_safe for skb segment walking net: skbuff: disambiguate argument and member for skb_list_walk_safe helper
Jeff Dike (1): virtio_net: Fix recursive call to cpus_read_lock()
Jens Axboe (1): proc: don't allow async path resolution of /proc/self components
Jesper Dangaard Brouer (1): netfilter: conntrack: fix reading nf_conntrack_buckets
Jessica Yu (1): module: delay kobject uevent until after module init call
Jiri Olsa (2): perf python scripting: Fix printable strings in python3 scripts perf tools: Add missing swap for ino_generation
Johannes Thumshirn (1): block: factor out requeue handling from dispatch code
Jonathan Cameron (1): ACPI: Add out of bounds and numa_off protections to pxm_to_node()
Joseph Qi (1): ext4: unlock xattr_sem properly in ext4_inline_data_truncate()
Jubin Zhong (1): PCI: Fix pci_slot_release() NULL pointer dereference
Kaixu Xia (1): ext4: correctly report "not supported" for {usr, grp}jquota when !CONFIG_QUOTA
Keqian Zhu (1): clocksource/drivers/arm_arch_timer: Correct fault programming of CNTKCTL_EL1.EVNTI
Kirill Tkhai (1): mm: move nr_deactivate accounting to shrink_active_list()
Lang Dai (1): uio: free uio id after uio file node is freed
Lecopzer Chen (2): kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow kasan: fix incorrect arguments passing in kasan_add_zero_shadow
Lee Duncan (1): scsi: libiscsi: Fix NOP race condition
Lee Jones (1): Fonts: Replace discarded const qualifier
Leo Yan (1): perf lock: Don't free "lock_seq_stat" if read_count isn't zero
Leon Romanovsky (1): net/mlx5: Properly convey driver version to firmware
Lijie (1): config: enable CONFIG_NVME_MULTIPATH by default
Liu Shixin (3): config: set default value of CONFIG_TEST_FREE_PAGES mm: memcontrol: add struct mem_cgroup_extension mm: fix kabi broken
Lorenzo Pieralisi (1): asm-generic/io.h: Fix !CONFIG_GENERIC_IOMAP pci_iounmap() implementation
Lu Jialin (1): fs: fix files.usage bug when move tasks
Luc Van Oostenryck (1): xsk: Fix xsk_poll()'s return type
Luo Meng (2): ext4: fix invalid inode checksum fail_function: Remove a redundant mutex unlock
Mao Wenan (1): net: Update window_clamp if SOCK_RCVBUF is set
Marc Zyngier (2): arm64: Run ARCH_WORKAROUND_1 enabling code on all CPUs genirq/irqdomain: Don't try to free an interrupt that has no mapping
Mark Rutland (3): arm64: syscall: exit userspace before unmasking exceptions arm64: module: rework special section handling arm64: module/ftrace: intialize PLT at load time
Martin Wilck (1): scsi: core: Fix VPD LUN ID designator priorities
Mateusz Nosek (1): futex: Fix incorrect should_fail_futex() handling
Matteo Croce (4): Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint" reboot: fix overflow parsing reboot cpu number ipv6: create multicast route with RTPROT_KERNEL ipv6: set multicast flag on the multicast route
Matthew Wilcox (Oracle) (1): mm/page_alloc.c: fix freeing non-compound pages
Maurizio Lombardi (2): scsi: target: remove boilerplate code scsi: target: fix hang when multiple threads try to destroy the same iscsi session
Maxim Mikityanskiy (1): net/tls: Protect from calling tls_dev_del for TLS RX twice
Miaohe Lin (1): mm/hugetlb: fix potential missing huge page size info
Michael Schaller (1): efivarfs: Replace invalid slashes with exclamation marks in dentries.
Mike Christie (1): scsi: target: iscsi: Fix cmd abort fabric stop race
Mike Galbraith (1): futex: Handle transient "ownerless" rtmutex state correctly
Miklos Szeredi (1): fuse: fix page dereference after free
Mikulas Patocka (3): dm integrity: fix the maximum number of arguments dm integrity: fix flush with external metadata device dm integrity: fix a crash if "recalculate" used without "internal_hash"
Ming Lei (2): scsi: core: Don't start concurrent async scan on same host block: fix use-after-free in disk_part_iter_next
Minwoo Im (1): nvme: free sq/cq dbbuf pointers when dbbuf set fails
Miroslav Benes (1): module: set MODULE_STATE_GOING state when a module fails to load
Moshe Shemesh (2): net/mlx4_en: Avoid scheduling restart task if it is already running net/mlx4_en: Handle TX error CQE
Naoya Horiguchi (1): mm, hwpoison: double-check page count in __get_any_page()
Naveen N. Rao (1): ftrace: Fix updating FTRACE_FL_TRAMP
Neal Cardwell (1): tcp: fix cwnd-limited bug for TSO deferral where we send nothing
NeilBrown (1): NFS: switch nfsiod to be an UNBOUND workqueue.
Nicholas Piggin (1): mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
Oleg Nesterov (1): ptrace: fix task_join_group_stop() for the case when current is traced
Oliver Herms (1): IPv6: Set SIT tunnel hard_header_len to zero
Paul Moore (1): selinux: fix inode_doinit_with_dentry() LABEL_INVALID error handling
Paulo Alcantara (1): cifs: fix potential use-after-free in cifs_echo_request()
Peng Liu (1): sched/deadline: Fix sched_dl_global_validate()
Peter Zijlstra (2): serial: pl011: Fix lockdep splat when handling magic-sysrq interrupt perf: Fix get_recursion_context()
Petr Malat (1): sctp: Fix COMM_LOST/CANT_STR_ASSOC err reporting on big-endian platforms
Qian Cai (1): mm/swapfile: do not sleep with a spin lock held
Qiujun Huang (2): ring-buffer: Return 0 on success from ring_buffer_resize() tracing: Fix out of bounds write in get_trace_buf
Rafael J. Wysocki (1): driver core: Extend device_is_dependent()
Randy Dunlap (1): net: sched: prevent invalid Scell_log shift count
Ritika Srivastava (2): block: Return blk_status_t instead of errno codes block: better deal with the delayed not supported case in blk_cloned_rq_check_limits
Ronnie Sahlberg (1): cifs: handle -EINTR in cifs_setattr
Ryan Sharpelletti (1): tcp: only postpone PROBE_RTT if RTT is < current min_rtt estimate
Sami Tolvanen (1): arm64: lse: fix LSE atomics with LLVM's integrated assembler
Sean Tranchetti (1): net: ipv6: fib: flush exceptions when purging route
Shakeel Butt (2): mm: swap: fix vmstats for huge pages mm: swap: memcg: fix memcg stats for huge pages
Shijie Luo (1): mm: mempolicy: fix potential pte_unmap_unlock pte error
Shin'ichiro Kawasaki (1): uio: Fix use-after-free in uio_unregister_device()
Stefano Brivio (1): netfilter: ipset: Update byte and packet counters regardless of whether they match
Steven Rostedt (VMware) (4): ring-buffer: Fix recursion protection transitions between interrupt context ftrace: Fix recursion check for NMI test ftrace: Handle tracing when switching between context tracing: Fix userstacktrace option for instances
Subash Abhinov Kasiviswanathan (2): netfilter: x_tables: Switch synchronization to RCU netfilter: x_tables: Update remaining dereference to RCU
Sven Eckelmann (2): vxlan: Add needed_headroom for lower device vxlan: Copy needed_tailroom from lowerdev
Sylwester Dziedziuch (2): i40e: Fix removing driver while bare-metal VFs pass traffic i40e: Fix Error I40E_AQ_RC_EINVAL when removing VFs
Takashi Iwai (1): libata: transport: Use scnprintf() for avoiding potential buffer overflow
Tariq Toukan (1): net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled
Thomas Gleixner (12): sched: Reenable interrupts in do_sched_yield() futex: Move futex exit handling into futex code futex: Replace PF_EXITPIDONE with a state exit/exec: Seperate mm_release() futex: Split futex_mm_release() for exit/exec futex: Set task::futex_state to DEAD right after handling futex exit futex: Mark the begin of futex exit explicitly futex: Sanitize exit state handling futex: Provide state handling for exec() as well futex: Add mutex around futex exit futex: Provide distinct return value when owner is exiting futex: Prevent exit livelock
Tianyue Ren (1): selinux: fix error initialization in inode_doinit_with_dentry()
Trond Myklebust (5): SUNRPC: xprt_load_transport() needs to support the netid "rdma6" NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode pNFS: Mark layout for return if return-on-close was not sent NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter NFS: nfs_igrab_and_active must first reference the superblock
Tung Nguyen (1): tipc: fix memory leak caused by tipc_buf_append()
Tyler Hicks (1): tpm: efi: Don't create binary_bios_measurements file for an empty log
Uwe Kleine-König (1): spi: fix resource leak for drivers without .remove callback
Vadim Fedorenko (1): net/tls: missing received data after fast remote close
Valentin Schneider (1): arm64: topology: Stop using MPIDR for topology information
Vamshi K Sthambamkadi (1): efivarfs: fix memory leak in efivarfs_create()
Vasily Averin (2): netfilter: ipset: fix shift-out-of-bounds in htable_bits() net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed packet
Vincenzo Frascino (1): arm64: lse: Fix LSE atomics with LLVM
Vladyslav Tarasiuk (1): net/mlx5: Disable QoS when min_rates on all VFs are zero
Wang Hai (4): devlink: Add missing genlmsg_cancel() in devlink_nl_sb_port_pool_fill() inet_diag: Fix error path to cancel the meseage in inet_req_diag_fill() tipc: fix memory leak in tipc_topsrv_start() ipv6: addrlabel: fix possible memory leak in ip6addrlbl_net_init
Wang Wensheng (1): sbsa_gwdt: Add WDIOF_PRETIMEOUT flag to watchdog_info at defination
Wei Li (1): irqchip/gic-v3: Fix compiling error on ARM32 with GICv3
Wei Yang (1): mm: thp: don't need care deferred split queue in memcg charge move path
Weilong Chen (1): hugetlbfs: Add dependency with ascend memory features
Wengang Wang (1): ocfs2: initialize ip_next_orphan
Will Deacon (3): arm64: psci: Avoid printing in cpu_psci_cpu_die() arm64: pgtable: Fix pte_accessible() arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect()
Willem de Bruijn (1): sock: set sk_err to ee_errno on dequeue from errq
Wu Bo (1): scsi: libiscsi: fix task hung when iscsid deamon exited
Xie XiuQi (1): cputime: fix undefined reference to get_idle_time when CONFIG_PROC_FS disabled
Xin Long (1): sctp: change to hold/put transport for proto_unreach_timer
Xiongfeng Wang (1): arm64: fix compile error when CONFIG_HOTPLUG_CPU is disabled
Xiubo Li (1): nbd: make the config put is called before the notifying the waiter
Xu Qiang (4): NMI: Enable arm-pmu interrupt as NMI in Acensed. irqchip/gic-v3-its: Unconditionally save/restore the ITS state on suspend. irqchip/irq-gic-v3: Add workaround bindings in device tree to init ts core GICR. Document: In the binding document, add enable-init-all-GICR field description.
Yang Shi (6): mm: list_lru: set shrinker map bit when child nr_items is not zero mm: thp: extract split_queue_* into a struct mm: move mem_cgroup_uncharge out of __page_cache_release() mm: shrinker: make shrinker not depend on memcg kmem mm: thp: make deferred split shrinker memcg aware mm: vmscan: protect shrinker idr replace with CONFIG_MEMCG
Yang Yingliang (5): arm64: arch_timer: only do cntvct workaround on VDSO path on D05 armv7 fix compile error Kconfig: disable KTASK by default futex: sched: fix kabi broken in task_struct futex: sched: fix UAF when free futex_exit_mutex in free_task()
Yi-Hung Wei (1): ip_tunnels: Set tunnel option flag when tunnel metadata is present
Yicong Yang (1): libfs: fix error cast of negative value in simple_attr_write()
Yu Kuai (2): blk-cgroup: prevent rcu_sched detected stalls warnings in blkg_destroy_all() blk-throttle: don't check whether or not lower limit is valid if CONFIG_BLK_DEV_THROTTLING_LOW is off
Yufen Yu (2): bdi: fix compiler error in bdi_get_dev_name() scsi: do quiesce for enclosure driver
Yunfeng Ye (1): workqueue: Kick a worker based on the actual activation of delayed works
Yunjian Wang (2): net: hns: fix return value check in __lb_other_process() vhost_net: fix ubuf refcount incorrectly when sendmsg fails
Yunsheng Lin (1): net: sch_generic: fix the missing new qdisc assignment bug
Zeng Tao (1): time: Prevent undefined behaviour in timespec64_to_ns()
Zhang Changzhong (2): ah6: fix error return code in ah6_input() net: bridge: vlan: fix error return code in __vlan_add()
Zhao Heming (2): md/cluster: block reshape with remote resync job md/cluster: fix deadlock when node is doing resync job
Zhengyuan Liu (3): arm64/mm: return cpu_all_mask when node is NUMA_NO_NODE hifc: remove unnecessary __init specifier mmap: fix a compiling error for 'MAP_PA32BIT'
Zhou Guanghui (2): memcg/ascend: Check sysctl oom config for memcg oom memcg/ascend: enable kmem cgroup by default for ascend
Zqiang (1): kthread_worker: prevent queuing delayed work from timer_fn when it is being canceled
chenmaodong (1): fix virtio_gpu use-after-free while creating dumb
j.nixdorf@avm.de (1): net: sunrpc: interpret the return value of kstrtou32 correctly
lijinlin (1): ext4: add ext3 report error to userspace by netlink
miaoyubo (1): KVM: Enable PUD huge mappings only on 1620
yangerkun (1): ext4: fix bug for rename with RENAME_WHITEOUT
zhuoliang zhang (1): net: xfrm: fix a race condition during allocing spi
.../interrupt-controller/arm,gic-v3.txt | 4 + Documentation/networking/bonding.txt | 16 +- arch/Kconfig | 7 + arch/alpha/include/uapi/asm/mman.h | 2 + arch/arm/kernel/perf_event_v7.c | 16 +- arch/arm64/Kconfig | 2 + arch/arm64/configs/euleros_defconfig | 1 + arch/arm64/configs/hulk_defconfig | 1 + arch/arm64/configs/openeuler_defconfig | 3 +- arch/arm64/include/asm/atomic_lse.h | 76 ++- arch/arm64/include/asm/cpufeature.h | 4 + arch/arm64/include/asm/lse.h | 6 +- arch/arm64/include/asm/memory.h | 11 + arch/arm64/include/asm/numa.h | 3 + arch/arm64/include/asm/pgtable.h | 34 +- arch/arm64/kernel/cpu_errata.c | 8 + arch/arm64/kernel/cpufeature.c | 2 +- arch/arm64/kernel/ftrace.c | 50 +- arch/arm64/kernel/module.c | 47 +- arch/arm64/kernel/psci.c | 5 +- arch/arm64/kernel/setup.c | 5 +- arch/arm64/kernel/syscall.c | 2 +- arch/arm64/kernel/topology.c | 43 +- arch/arm64/lib/memcpy.S | 3 +- arch/arm64/lib/memmove.S | 3 +- arch/arm64/lib/memset.S | 3 +- arch/arm64/mm/init.c | 17 +- arch/arm64/mm/numa.c | 6 +- arch/mips/include/uapi/asm/mman.h | 2 + arch/parisc/include/uapi/asm/mman.h | 2 + arch/powerpc/include/asm/uaccess.h | 4 +- arch/powerpc/include/uapi/asm/mman.h | 2 + arch/sparc/include/uapi/asm/mman.h | 2 + arch/x86/configs/hulk_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + arch/xtensa/include/uapi/asm/mman.h | 2 + block/bfq-iosched.c | 8 +- block/blk-cgroup.c | 30 +- block/blk-core.c | 39 +- block/blk-mq.c | 29 +- block/blk-throttle.c | 6 + block/genhd.c | 9 +- drivers/acpi/acpi_pnp.c | 3 + drivers/acpi/internal.h | 2 +- drivers/acpi/numa.c | 2 +- drivers/acpi/resource.c | 2 +- drivers/acpi/scan.c | 17 +- drivers/ata/libata-transport.c | 10 +- drivers/base/core.c | 21 +- drivers/block/nbd.c | 3 +- drivers/block/null_blk_zoned.c | 20 +- drivers/char/Kconfig | 2 +- drivers/char/random.c | 1 - drivers/char/tpm/eventlog/efi.c | 5 + drivers/clocksource/arm_arch_timer.c | 36 +- drivers/firmware/efi/efi.c | 4 - drivers/firmware/efi/libstub/arm-stub.c | 3 - drivers/gpu/drm/virtio/virtgpu_gem.c | 4 +- drivers/irqchip/irq-gic-v3-its.c | 30 +- drivers/irqchip/irq-gic-v3.c | 18 +- drivers/md/dm-bufio.c | 6 + drivers/md/dm-integrity.c | 58 ++- drivers/md/dm-snap.c | 24 + drivers/md/dm-table.c | 15 +- drivers/md/dm-verity-target.c | 12 +- drivers/md/md-cluster.c | 67 +-- drivers/md/md.c | 14 +- drivers/mtd/hisilicon/sfc/hrd_sfc_driver.c | 4 +- drivers/net/bonding/bond_main.c | 92 ++-- drivers/net/bonding/bond_netlink.c | 14 + drivers/net/bonding/bond_options.c | 71 ++- drivers/net/bonding/bond_procfs.c | 2 + drivers/net/bonding/bond_sysfs.c | 13 + drivers/net/bonding/bond_sysfs_slave.c | 18 +- .../net/ethernet/hisilicon/hns/hns_ethtool.c | 4 + drivers/net/ethernet/intel/i40e/i40e.h | 4 + drivers/net/ethernet/intel/i40e/i40e_main.c | 32 +- .../ethernet/intel/i40e/i40e_virtchnl_pf.c | 30 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 +- .../net/ethernet/mellanox/mlx4/en_netdev.c | 21 +- drivers/net/ethernet/mellanox/mlx4/en_tx.c | 40 +- drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 12 +- .../net/ethernet/mellanox/mlx5/core/en_fs.c | 3 + .../net/ethernet/mellanox/mlx5/core/eswitch.c | 15 +- .../net/ethernet/mellanox/mlx5/core/main.c | 6 +- .../ethernet/mellanox/mlx5/core/pagealloc.c | 21 +- drivers/net/ethernet/mellanox/mlxsw/core.c | 8 +- drivers/net/geneve.c | 3 +- drivers/net/macvlan.c | 5 +- drivers/net/virtio_net.c | 12 +- drivers/net/vxlan.c | 3 + drivers/nvme/host/pci.c | 15 + drivers/pci/controller/pcie-iproc.c | 10 +- drivers/pci/pci.c | 14 +- drivers/pci/slot.c | 6 +- drivers/scsi/huawei/hifc/unf_common.h | 2 +- drivers/scsi/huawei/hifc/unf_scsi.c | 23 + drivers/scsi/libiscsi.c | 33 +- drivers/scsi/scsi_lib.c | 126 +++-- drivers/scsi/scsi_scan.c | 11 +- drivers/scsi/scsi_transport_spi.c | 27 +- drivers/spi/spi.c | 19 +- drivers/target/iscsi/iscsi_target.c | 96 ++-- drivers/target/iscsi/iscsi_target.h | 1 - drivers/target/iscsi/iscsi_target_configfs.c | 5 +- drivers/target/iscsi/iscsi_target_login.c | 5 +- drivers/tty/serial/amba-pl011.c | 11 +- drivers/tty/serial/serial_core.c | 4 + drivers/uio/uio.c | 12 +- drivers/vfio/pci/vfio_pci.c | 3 +- drivers/vhost/net.c | 6 +- drivers/watchdog/sbsa_gwdt.c | 6 +- fs/cifs/cifs_unicode.c | 8 +- fs/cifs/connect.c | 2 + fs/cifs/inode.c | 13 +- fs/efivarfs/inode.c | 2 + fs/efivarfs/super.c | 3 + fs/exec.c | 17 +- fs/ext4/ext4.h | 6 +- fs/ext4/ext4_jbd2.c | 1 - fs/ext4/file.c | 1 + fs/ext4/inline.c | 1 + fs/ext4/inode.c | 31 +- fs/ext4/ioctl.c | 3 + fs/ext4/mballoc.c | 1 + fs/ext4/namei.c | 23 +- fs/ext4/resize.c | 8 +- fs/ext4/super.c | 32 +- fs/ext4/xattr.c | 1 + fs/fcntl.c | 10 +- fs/filescontrol.c | 73 +-- fs/fuse/dev.c | 28 +- fs/hugetlbfs/inode.c | 2 +- fs/libfs.c | 6 +- fs/lockd/host.c | 20 +- fs/nfs/inode.c | 2 +- fs/nfs/internal.h | 12 +- fs/nfs/nfs4proc.c | 2 +- fs/nfs/nfs4super.c | 2 +- fs/nfs/pnfs.c | 40 +- fs/nfs/pnfs.h | 5 + fs/nfs_common/grace.c | 6 +- fs/ocfs2/super.c | 1 + fs/proc/generic.c | 55 ++- fs/proc/internal.h | 7 + fs/proc/proc_net.c | 16 - fs/proc/self.c | 7 + fs/quota/quota_tree.c | 8 +- fs/super.c | 33 +- fs/xfs/libxfs/xfs_alloc.c | 1 + fs/xfs/libxfs/xfs_bmap.c | 19 +- fs/xfs/libxfs/xfs_bmap.h | 2 +- fs/xfs/libxfs/xfs_rmap.c | 2 +- fs/xfs/scrub/bmap.c | 10 +- fs/xfs/scrub/btree.c | 45 +- fs/xfs/scrub/inode.c | 3 +- fs/xfs/scrub/refcount.c | 8 +- fs/xfs/xfs_iops.c | 10 + fs/xfs/xfs_pnfs.c | 2 +- fs/xfs/xfs_rtalloc.c | 10 +- include/asm-generic/io.h | 39 +- include/linux/backing-dev.h | 1 + include/linux/blkdev.h | 1 + include/linux/compat.h | 2 - include/linux/dm-bufio.h | 1 + include/linux/efi.h | 7 - include/linux/futex.h | 39 +- include/linux/huge_mm.h | 9 + include/linux/hugetlb.h | 10 +- include/linux/if_team.h | 5 +- include/linux/memblock.h | 3 - include/linux/memcontrol.h | 32 +- include/linux/mm.h | 2 + include/linux/mm_types.h | 1 + include/linux/mman.h | 15 + include/linux/mmzone.h | 8 + include/linux/netfilter/x_tables.h | 5 +- include/linux/netfilter_ipv4.h | 2 +- include/linux/netfilter_ipv6.h | 2 +- include/linux/netpoll.h | 10 +- include/linux/prandom.h | 36 +- include/linux/proc_fs.h | 8 +- include/linux/sched.h | 5 +- include/linux/sched/cputime.h | 5 + include/linux/sched/mm.h | 6 +- include/linux/shrinker.h | 7 +- include/linux/skbuff.h | 5 + include/linux/sunrpc/xprt.h | 1 + include/linux/time64.h | 4 + include/net/bond_options.h | 1 + include/net/bonding.h | 14 +- include/net/ip_tunnels.h | 7 +- include/net/red.h | 4 +- include/net/tls.h | 6 + include/scsi/libiscsi.h | 3 + include/target/iscsi/iscsi_target_core.h | 2 +- include/uapi/asm-generic/mman.h | 1 + include/uapi/linux/if_link.h | 1 + init/Kconfig | 2 +- kernel/events/internal.h | 2 +- kernel/exit.c | 35 +- kernel/fail_function.c | 5 +- kernel/fork.c | 61 ++- kernel/futex.c | 291 +++++++++-- kernel/irq/irqdomain.c | 11 +- kernel/kthread.c | 3 +- kernel/module.c | 6 +- kernel/reboot.c | 28 +- kernel/sched/core.c | 6 +- kernel/sched/cputime.c | 6 + kernel/sched/deadline.c | 5 +- kernel/sched/sched.h | 42 +- kernel/signal.c | 19 +- kernel/time/itimer.c | 4 - kernel/time/tick-common.c | 2 + kernel/time/timer.c | 7 - kernel/trace/ftrace.c | 22 +- kernel/trace/ring_buffer.c | 66 ++- kernel/trace/trace.c | 9 +- kernel/trace/trace.h | 32 +- kernel/trace/trace_selftest.c | 9 +- kernel/workqueue.c | 13 +- lib/Kconfig.debug | 9 + lib/Makefile | 1 + lib/fonts/font_10x18.c | 2 +- lib/fonts/font_6x10.c | 2 +- lib/fonts/font_6x11.c | 2 +- lib/fonts/font_7x14.c | 2 +- lib/fonts/font_8x16.c | 2 +- lib/fonts/font_8x8.c | 2 +- lib/fonts/font_acorn_8x8.c | 2 +- lib/fonts/font_mini_4x6.c | 2 +- lib/fonts/font_pearl_8x8.c | 2 +- lib/fonts/font_sun12x22.c | 2 +- lib/fonts/font_sun8x16.c | 2 +- lib/genalloc.c | 25 +- lib/random32.c | 462 +++++++++++------- lib/scatterlist.c | 2 +- lib/test_free_pages.c | 42 ++ mm/huge_memory.c | 167 +++++-- mm/hugetlb.c | 5 +- mm/kasan/kasan_init.c | 23 +- mm/list_lru.c | 10 +- mm/memblock.c | 11 +- mm/memcontrol.c | 34 +- mm/memory-failure.c | 6 + mm/mempolicy.c | 6 +- mm/mlock.c | 25 +- mm/mmap.c | 27 +- mm/page_alloc.c | 9 + mm/slub.c | 2 +- mm/swap.c | 37 +- mm/swapfile.c | 4 +- mm/vmscan.c | 64 +-- net/8021q/vlan.c | 3 +- net/8021q/vlan_dev.c | 5 +- net/bridge/br_device.c | 1 + net/bridge/br_netfilter_hooks.c | 7 +- net/bridge/br_private.h | 5 +- net/bridge/br_vlan.c | 4 +- net/ceph/messenger.c | 5 + net/core/dev.c | 5 + net/core/devlink.c | 6 +- net/core/lwt_bpf.c | 8 +- net/core/net-sysfs.c | 65 ++- net/core/netpoll.c | 51 +- net/core/skbuff.c | 23 +- net/dsa/slave.c | 5 +- net/ipv4/fib_frontend.c | 2 +- net/ipv4/gre_demux.c | 2 +- net/ipv4/inet_diag.c | 4 +- net/ipv4/ip_output.c | 2 +- net/ipv4/ip_tunnel.c | 10 +- net/ipv4/netfilter.c | 12 +- net/ipv4/netfilter/arp_tables.c | 16 +- net/ipv4/netfilter/ip_tables.c | 16 +- net/ipv4/netfilter/ipt_SYNPROXY.c | 2 +- net/ipv4/netfilter/ipt_rpfilter.c | 2 +- net/ipv4/netfilter/iptable_mangle.c | 2 +- net/ipv4/netfilter/nf_nat_l3proto_ipv4.c | 2 +- net/ipv4/netfilter/nf_reject_ipv4.c | 2 +- net/ipv4/netfilter/nft_chain_route_ipv4.c | 2 +- net/ipv4/route.c | 7 +- net/ipv4/syncookies.c | 9 +- net/ipv4/tcp.c | 2 + net/ipv4/tcp_bbr.c | 2 +- net/ipv4/tcp_cong.c | 5 + net/ipv4/tcp_input.c | 6 +- net/ipv4/tcp_output.c | 9 +- net/ipv4/udp.c | 3 +- net/ipv6/addrconf.c | 3 +- net/ipv6/addrlabel.c | 26 +- net/ipv6/ah6.c | 3 +- net/ipv6/ip6_fib.c | 5 +- net/ipv6/ip6_gre.c | 16 +- net/ipv6/ip6_output.c | 40 +- net/ipv6/netfilter.c | 6 +- net/ipv6/netfilter/ip6_tables.c | 16 +- net/ipv6/netfilter/ip6table_mangle.c | 2 +- net/ipv6/netfilter/nf_nat_l3proto_ipv6.c | 2 +- net/ipv6/netfilter/nft_chain_route_ipv6.c | 2 +- net/ipv6/sit.c | 2 - net/ipv6/syncookies.c | 10 +- net/netfilter/ipset/ip_set_core.c | 3 +- net/netfilter/ipset/ip_set_hash_gen.h | 20 +- net/netfilter/ipvs/ip_vs_core.c | 4 +- net/netfilter/nf_conntrack_standalone.c | 3 + net/netfilter/nf_nat_core.c | 1 + net/netfilter/nf_tables_api.c | 3 +- net/netfilter/x_tables.c | 49 +- net/netfilter/xt_RATEEST.c | 3 + net/sched/cls_tcindex.c | 8 +- net/sched/sch_api.c | 3 +- net/sched/sch_choke.c | 2 +- net/sched/sch_generic.c | 3 + net/sched/sch_gred.c | 2 +- net/sched/sch_netem.c | 9 +- net/sched/sch_red.c | 2 +- net/sched/sch_sfq.c | 2 +- net/sctp/input.c | 4 +- net/sctp/sm_sideeffect.c | 8 +- net/sctp/transport.c | 2 +- net/sunrpc/addr.c | 2 +- net/sunrpc/xprt.c | 65 ++- net/sunrpc/xprtrdma/module.c | 1 + net/sunrpc/xprtrdma/transport.c | 1 + net/sunrpc/xprtsock.c | 4 + net/tipc/link.c | 9 +- net/tipc/msg.c | 5 +- net/tipc/topsrv.c | 10 +- net/tls/tls_device.c | 5 +- net/tls/tls_sw.c | 6 + net/xdp/xsk.c | 8 +- net/xfrm/xfrm_state.c | 8 +- security/lsm_audit.c | 7 +- security/selinux/hooks.c | 16 +- security/selinux/ibpkey.c | 4 +- sound/soc/codecs/Kconfig | 1 + tools/include/uapi/linux/if_link.h | 1 + tools/perf/builtin-lock.c | 2 +- tools/perf/util/print_binary.c | 2 +- .../scripting-engines/trace-event-python.c | 7 +- tools/perf/util/session.c | 1 + virt/kvm/arm/mmu.c | 7 + 344 files changed, 3424 insertions(+), 1701 deletions(-) create mode 100644 lib/test_free_pages.c
From: Yu Kuai yukuai3@huawei.com
hulk inclusion category: bugfix bugzilla: 46357 CVE: NA
---------------------------
test procedures:
a. create 20000 cgroups, and echo "8:0 10000" to blkio.throttle.write_bps_device b. echo 1 > /sys/blocd/sda/device/delete
test result:
rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [5/1143] rcu: 0-...0: (0 ticks this GP) idle=0f2/1/0x4000000000000000 softirq=15507/15507 fq rcu: (detected by 6, t=60012 jiffies, g=119977, q=27153) NMI backtrace for cpu 0 CPU: 0 PID: 443 Comm: bash Not tainted 4.19.95-00061-g0bcc83b30eec #63 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-p4 RIP: 0010:blk_throtl_update_limit_valid.isra.0+0x116/0x2a0 Code: 01 00 00 e8 7c dd 74 ff 48 83 bb 78 01 00 00 00 0f 85 54 01 00 00 48 8d bb 88 01 1 RSP: 0018:ffff8881030bf9f0 EFLAGS: 00000046 RAX: 0000000000000000 RBX: ffff8880b4f37080 RCX: ffffffff95da0afe RDX: dffffc0000000000 RSI: ffff888100373980 RDI: ffff8880b4f37208 RBP: ffff888100deca00 R08: ffffffff9528f951 R09: 0000000000000001 R10: ffffed10159dbf56 R11: ffff8880acedfab3 R12: ffff8880b9fda498 R13: ffff8880b9fda4f4 R14: 0000000000000050 R15: ffffffff98b622c0 FS: 00007feb51c51700(0000) GS:ffff888106200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000561619547080 CR3: 0000000102bc9000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: throtl_pd_offline+0x98/0x100 blkg_destroy+0x133/0x630 ? blkcg_deactivate_policy+0x2c0/0x2c0 ? lock_timer_base+0x65/0x110 blkg_destroy_all+0x7f/0x100 blkcg_exit_queue+0x3f/0xa5 blk_exit_queue+0x69/0xa0 blk_cleanup_queue+0x226/0x360 __scsi_remove_device+0xb4/0x3c0 scsi_remove_device+0x38/0x60 sdev_store_delete+0x74/0x100 ? dev_driver_string+0xb0/0xb0 dev_attr_store+0x41/0x70 sysfs_kf_write+0x89/0xc0 kernfs_fop_write+0x1b6/0x2e0 ? sysfs_kf_bin_read+0x130/0x130 __vfs_write+0xca/0x420 ? kernel_read+0xc0/0xc0 ? __alloc_fd+0x16f/0x2d0 ? __fd_install+0x95/0x1a0 ? handle_mm_fault+0x3e0/0x560 vfs_write+0x11a/0x2f0 ksys_write+0xb9/0x1e0 ? __x64_sys_read+0x60/0x60 ? kasan_check_write+0x20/0x30 ? filp_close+0xb5/0xf0 __x64_sys_write+0x46/0x60 do_syscall_64+0xd9/0x1f0 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The usage of so many blkg is very rare, however, such problem do exist in theory. In order to avoid such warnings, release 'q->queue_lock' for a while when a batch of blkg were destroyed.
Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Tao Hou houtao1@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-cgroup.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index e592167449aa..c64f0afa27dc 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -364,16 +364,31 @@ static void blkg_destroy(struct blkcg_gq *blkg) */ static void blkg_destroy_all(struct request_queue *q) { +#define BLKG_DESTROY_BATCH 4096 struct blkcg_gq *blkg, *n; + int count;
lockdep_assert_held(q->queue_lock);
+again: + count = BLKG_DESTROY_BATCH; list_for_each_entry_safe(blkg, n, &q->blkg_list, q_node) { struct blkcg *blkcg = blkg->blkcg;
spin_lock(&blkcg->lock); blkg_destroy(blkg); spin_unlock(&blkcg->lock); + /* + * If the list is too long, the loop can took a long time, + * thus relese the lock for a while when a batch of blkg + * were destroyed. + */ + if (!--count) { + spin_unlock_irq(q->queue_lock); + cond_resched(); + spin_lock_irq(q->queue_lock); + goto again; + } }
q->root_blkg = NULL;
From: Yu Kuai yukuai3@huawei.com
hulk inclusion category: bugfix bugzilla: 46357 CVE: NA
---------------------------
blk_throtl_update_limit_valid() will search for descendants to see if 'LIMIT_LOW' of bps/iops and READ/WRITE is nonzero. However, they're always zero if CONFIG_BLK_DEV_THROTTLING_LOW is not set, furthermore, a lot of time will be wasted to iterate descendants.
Thus do nothing in blk_throtl_update_limit_valid() in such situation.
Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Tao Hou houtao1@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-throttle.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c index caee658609d7..c7b4f905feb5 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -568,6 +568,7 @@ static void throtl_pd_online(struct blkg_policy_data *pd) tg_update_has_rules(tg); }
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW static void blk_throtl_update_limit_valid(struct throtl_data *td) { struct cgroup_subsys_state *pos_css; @@ -588,6 +589,11 @@ static void blk_throtl_update_limit_valid(struct throtl_data *td)
td->limit_valid[LIMIT_LOW] = low_valid; } +#else +static inline void blk_throtl_update_limit_valid(struct throtl_data *td) +{ +} +#endif
static void throtl_upgrade_state(struct throtl_data *td); static void throtl_pd_offline(struct blkg_policy_data *pd)
From: Kirill Tkhai ktkhai@virtuozzo.com
mainline inclusion from mainline-v5.2-rc1 commit 9851ac13592df77958ae7bac6ba39e71420c38ec category: bugfix bugzilla: 36232 CVE: NA
-------------------------------------------------
We know which LRU is not active.
[chris@chrisdown.name: fix build on !CONFIG_MEMCG] Link: http://lkml.kernel.org/r/20190322150513.GA22021@chrisdown.name Link: http://lkml.kernel.org/r/155290128498.31489.18250485448913338607.stgit@local... Signed-off-by: Kirill Tkhai ktkhai@virtuozzo.com Signed-off-by: Chris Down chris@chrisdown.name Reviewed-by: Daniel Jordan daniel.m.jordan@oracle.com Cc: Michal Hocko mhocko@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/memcontrol.h | 6 ++++++ mm/vmscan.c | 10 ++++------ 2 files changed, 10 insertions(+), 6 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 24068566ada2..0069cd618b1e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1100,6 +1100,12 @@ static inline void count_memcg_events(struct mem_cgroup *memcg, { }
+static inline void __count_memcg_events(struct mem_cgroup *memcg, + enum vm_event_item idx, + unsigned long count) +{ +} + static inline void count_memcg_page_event(struct page *page, int idx) { diff --git a/mm/vmscan.c b/mm/vmscan.c index ac6f4cbef5ea..01abd3f43a02 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2076,12 +2076,6 @@ static unsigned move_active_pages_to_lru(struct lruvec *lruvec, } }
- if (!is_active_lru(lru)) { - __count_vm_events(PGDEACTIVATE, nr_moved); - count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, - nr_moved); - } - return nr_moved; }
@@ -2176,6 +2170,10 @@ static void shrink_active_list(unsigned long nr_to_scan,
nr_activate = move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru); nr_deactivate = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE); + + __count_vm_events(PGDEACTIVATE, nr_deactivate); + __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate); + __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); spin_unlock_irq(&pgdat->lru_lock);
From: Shakeel Butt shakeelb@google.com
mainline inclusion from mainline-v5.8-rc1 commit 5d91f31faf8ebed2acfc3a1d6ac344f95c488d66 category: bugfix bugzilla: 36232 CVE: NA
-------------------------------------------------
Many of the callbacks called by pagevec_lru_move_fn() does not correctly update the vmstats for huge pages. Fix that. Also __pagevec_lru_add_fn() use the irq-unsafe alternative to update the stat as the irqs are already disabled.
Signed-off-by: Shakeel Butt shakeelb@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Johannes Weiner hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20200527182916.249910-1-shakeelb@google.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/swap.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c index 320ac35bde26..a819f8746ee2 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -224,7 +224,7 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec, del_page_from_lru_list(page, lruvec, page_lru(page)); ClearPageActive(page); add_page_to_lru_list_tail(page, lruvec, page_lru(page)); - (*pgmoved)++; + (*pgmoved) += hpage_nr_pages(page); } }
@@ -284,7 +284,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec, add_page_to_lru_list(page, lruvec, lru); trace_mm_lru_activate(page);
- __count_vm_event(PGACTIVATE); + __count_vm_events(PGACTIVATE, hpage_nr_pages(page)); update_page_reclaim_stat(lruvec, file, 1); } } @@ -499,6 +499,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec, { int lru, file; bool active; + int nr_pages = hpage_nr_pages(page);
if (!PageLRU(page)) return; @@ -532,11 +533,11 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec, * We moves tha page into tail of inactive. */ list_move_tail(&page->lru, &lruvec->lists[lru]); - __count_vm_event(PGROTATED); + __count_vm_events(PGROTATED, nr_pages); }
if (active) - __count_vm_event(PGDEACTIVATE); + __count_vm_events(PGDEACTIVATE, nr_pages); update_page_reclaim_stat(lruvec, file, 0); }
@@ -870,6 +871,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec, { enum lru_list lru; int was_unevictable = TestClearPageUnevictable(page); + int nr_pages = hpage_nr_pages(page);
VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -907,13 +909,13 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec, update_page_reclaim_stat(lruvec, page_is_file_cache(page), PageActive(page)); if (was_unevictable) - count_vm_event(UNEVICTABLE_PGRESCUED); + __count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages); } else { lru = LRU_UNEVICTABLE; ClearPageActive(page); SetPageUnevictable(page); if (!was_unevictable) - count_vm_event(UNEVICTABLE_PGCULLED); + __count_vm_events(UNEVICTABLE_PGCULLED, nr_pages); }
add_page_to_lru_list(page, lruvec, lru);
From: Shakeel Butt shakeelb@google.com
mainline inclusion from mainline-v5.8-rc1 commit 21e330fc632d6a288f73de48045b782cc51d501a category: bugfix bugzilla: 36232 CVE: NA
-------------------------------------------------
The commit 2262185c5b28 ("mm: per-cgroup memory reclaim stats") added PGLAZYFREE, PGACTIVATE & PGDEACTIVATE stats for cgroups but missed couple of places and PGLAZYFREE missed huge page handling. Fix that. Also for PGLAZYFREE use the irq-unsafe function to update as the irq is already disabled.
Fixes: 2262185c5b28 ("mm: per-cgroup memory reclaim stats") Signed-off-by: Shakeel Butt shakeelb@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Johannes Weiner hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20200527182947.251343-1-shakeelb@google.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/swap.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c index a819f8746ee2..0156aa4d772c 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -277,6 +277,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec, if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { int file = page_is_file_cache(page); int lru = page_lru_base_type(page); + int nr_pages = hpage_nr_pages(page);
del_page_from_lru_list(page, lruvec, lru); SetPageActive(page); @@ -284,7 +285,9 @@ static void __activate_page(struct page *page, struct lruvec *lruvec, add_page_to_lru_list(page, lruvec, lru); trace_mm_lru_activate(page);
- __count_vm_events(PGACTIVATE, hpage_nr_pages(page)); + __count_vm_events(PGACTIVATE, nr_pages); + __count_memcg_events(lruvec_memcg(lruvec), PGACTIVATE, + nr_pages); update_page_reclaim_stat(lruvec, file, 1); } } @@ -536,18 +539,21 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec, __count_vm_events(PGROTATED, nr_pages); }
- if (active) + if (active) { __count_vm_events(PGDEACTIVATE, nr_pages); + __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, + nr_pages); + } update_page_reclaim_stat(lruvec, file, 0); }
- static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec, void *arg) { if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page) && !PageUnevictable(page)) { bool active = PageActive(page); + int nr_pages = hpage_nr_pages(page);
del_page_from_lru_list(page, lruvec, LRU_INACTIVE_ANON + active); @@ -561,8 +567,9 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec, ClearPageSwapBacked(page); add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
- __count_vm_events(PGLAZYFREE, hpage_nr_pages(page)); - count_memcg_page_event(page, PGLAZYFREE); + __count_vm_events(PGLAZYFREE, nr_pages); + __count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE, + nr_pages); update_page_reclaim_stat(lruvec, 1, 0); } }
From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v5.9-rc6 commit 0964730bf46b4e271c5ecad5badbbd95737c087b category: bugfix bugzilla: 36232 CVE: NA
-------------------------------------------------
5.8 commit 5d91f31faf8e ("mm: swap: fix vmstats for huge page") has established that vm_events should count every subpage of a THP, including unevictable_pgs_culled and unevictable_pgs_rescued; but lru_cache_add_inactive_or_unevictable() was not doing so for unevictable_pgs_mlocked, and mm/mlock.c was not doing so for unevictable_pgs mlocked, munlocked, cleared and stranded.
Fix them; but THPs don't go the pagevec way in mlock.c, so no fixes needed on that path.
Fixes: 5d91f31faf8e ("mm: swap: fix vmstats for huge page") Signed-off-by: Hugh Dickins hughd@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Yang Shi shy828301@gmail.com Cc: Alex Shi alex.shi@linux.alibaba.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Qian Cai cai@lca.pw Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301408230.5954@eggly.anvils Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/mlock.c | 25 +++++++++++++++---------- mm/swap.c | 6 +++--- 2 files changed, 18 insertions(+), 13 deletions(-)
diff --git a/mm/mlock.c b/mm/mlock.c index 8112a2200d3e..4077b1e8e199 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -58,12 +58,14 @@ EXPORT_SYMBOL(can_do_mlock); */ void clear_page_mlock(struct page *page) { + int nr_pages; + if (!TestClearPageMlocked(page)) return;
- mod_zone_page_state(page_zone(page), NR_MLOCK, - -hpage_nr_pages(page)); - count_vm_event(UNEVICTABLE_PGCLEARED); + nr_pages = hpage_nr_pages(page); + mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); + count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages); /* * The previous TestClearPageMlocked() corresponds to the smp_mb() * in __pagevec_lru_add_fn(). @@ -77,7 +79,7 @@ void clear_page_mlock(struct page *page) * We lost the race. the page already moved to evictable list. */ if (PageUnevictable(page)) - count_vm_event(UNEVICTABLE_PGSTRANDED); + count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages); } }
@@ -94,9 +96,10 @@ void mlock_vma_page(struct page *page) VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page);
if (!TestSetPageMlocked(page)) { - mod_zone_page_state(page_zone(page), NR_MLOCK, - hpage_nr_pages(page)); - count_vm_event(UNEVICTABLE_PGMLOCKED); + int nr_pages = hpage_nr_pages(page); + + mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages); + count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages); if (!isolate_lru_page(page)) putback_lru_page(page); } @@ -139,7 +142,7 @@ static void __munlock_isolated_page(struct page *page)
/* Did try_to_unlock() succeed or punt? */ if (!PageMlocked(page)) - count_vm_event(UNEVICTABLE_PGMUNLOCKED); + count_vm_events(UNEVICTABLE_PGMUNLOCKED, hpage_nr_pages(page));
putback_lru_page(page); } @@ -155,10 +158,12 @@ static void __munlock_isolated_page(struct page *page) */ static void __munlock_isolation_failed(struct page *page) { + int nr_pages = hpage_nr_pages(page); + if (PageUnevictable(page)) - __count_vm_event(UNEVICTABLE_PGSTRANDED); + __count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages); else - __count_vm_event(UNEVICTABLE_PGMUNLOCKED); + __count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages); }
/** diff --git a/mm/swap.c b/mm/swap.c index 0156aa4d772c..bdb9b294afbf 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -464,14 +464,14 @@ void lru_cache_add_active_or_unevictable(struct page *page, if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) SetPageActive(page); else if (!TestSetPageMlocked(page)) { + int nr_pages = hpage_nr_pages(page); /* * We use the irq-unsafe __mod_zone_page_stat because this * counter is not modified from interrupt context, and the pte * lock is held(spinlock), which implies preemption disabled. */ - __mod_zone_page_state(page_zone(page), NR_MLOCK, - hpage_nr_pages(page)); - count_vm_event(UNEVICTABLE_PGMLOCKED); + __mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages); + count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages); } lru_cache_add(page); }
From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v5.9-rc6 commit 8d8869ca5d2d9d86db96271ab063fdcfa9baf5b4 category: bugfix bugzilla: 36232 CVE: NA
-------------------------------------------------
check_move_unevictable_pages() is used in making unevictable shmem pages evictable: by shmem_unlock_mapping(), drm_gem_check_release_pagevec() and i915/gem check_release_pagevec(). Those may pass down subpages of a huge page, when /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force".
That does not crash or warn at present, but the accounting of vmstats unevictable_pgs_scanned and unevictable_pgs_rescued is inconsistent: scanned being incremented on each subpage, rescued only on the head (since tails already appear evictable once the head has been updated).
5.8 commit 5d91f31faf8e ("mm: swap: fix vmstats for huge page") has established that vm_events in general (and unevictable_pgs_rescued in particular) should count every subpage: so follow that precedent here.
Do this in such a way that if mem_cgroup_page_lruvec() is made stricter (to check page->mem_cgroup is always set), no problem: skip the tails before calling it, and add thp_nr_pages() to vmstats on the head.
Signed-off-by: Hugh Dickins hughd@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Yang Shi shy828301@gmail.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Qian Cai cai@lca.pw Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301405000.5954@eggly.anvils Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/vmscan.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c index 01abd3f43a02..f49c11a6cff3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4364,8 +4364,14 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages) for (i = 0; i < nr_pages; i++) { struct page *page = pages[i]; struct pglist_data *pagepgdat = page_pgdat(page); + int _nr_pages; + + if (PageTransTail(page)) + continue; + + _nr_pages = hpage_nr_pages(page); + pgscanned += _nr_pages;
- pgscanned++; if (pagepgdat != pgdat) { if (pgdat) spin_unlock_irq(&pgdat->lru_lock); @@ -4384,7 +4390,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages) ClearPageUnevictable(page); del_page_from_lru_list(page, lruvec, LRU_UNEVICTABLE); add_page_to_lru_list(page, lruvec, lru); - pgrescued++; + pgrescued += _nr_pages; } }
From: Huang Ying ying.huang@intel.com
mainline inclusion from mainline-v5.10-rc1 commit c4f9c701f9b44299e6adbc58d1a4bb2c40383494 category: bugfix bugzilla: 44801 CVE: NA
-------------------------------------------------
It is reported that the following bug is triggered if the HDD is used as swap device,
[ 5758.157556] BUG: kernel NULL pointer dereference, address: 0000000000000007 [ 5758.165331] #PF: supervisor write access in kernel mode [ 5758.171161] #PF: error_code(0x0002) - not-present page [ 5758.176894] PGD 0 P4D 0 [ 5758.179721] Oops: 0002 [#1] SMP PTI [ 5758.183614] CPU: 10 PID: 316 Comm: kswapd1 Kdump: loaded Tainted: G S --------- --- 5.9.0-0.rc3.1.tst.el8.x86_64 #1 [ 5758.196717] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013 [ 5758.208176] RIP: 0010:split_swap_cluster+0x47/0x60 [ 5758.213522] Code: c1 e3 06 48 c1 eb 0f 48 8d 1c d8 48 89 df e8 d0 20 6a 00 80 63 07 fb 48 85 db 74 16 48 89 df c6 07 00 66 66 66 90 31 c0 5b c3 <80> 24 25 07 00 00 00 fb 31 c0 5b c3 b8 f0 ff ff ff 5b c3 66 0f 1f [ 5758.234478] RSP: 0018:ffffb147442d7af0 EFLAGS: 00010246 [ 5758.240309] RAX: 0000000000000000 RBX: 000000000014b217 RCX: ffffb14779fd9000 [ 5758.248281] RDX: 000000000014b217 RSI: ffff9c52f2ab1400 RDI: 000000000014b217 [ 5758.256246] RBP: ffffe00c51168080 R08: ffffe00c5116fe08 R09: ffff9c52fffd3000 [ 5758.264208] R10: ffffe00c511537c8 R11: ffff9c52fffd3c90 R12: 0000000000000000 [ 5758.272172] R13: ffffe00c51170000 R14: ffffe00c51170000 R15: ffffe00c51168040 [ 5758.280134] FS: 0000000000000000(0000) GS:ffff9c52f2a80000(0000) knlGS:0000000000000000 [ 5758.289163] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5758.295575] CR2: 0000000000000007 CR3: 0000000022a0e003 CR4: 00000000000606e0 [ 5758.303538] Call Trace: [ 5758.306273] split_huge_page_to_list+0x88b/0x950 [ 5758.311433] deferred_split_scan+0x1ca/0x310 [ 5758.316202] do_shrink_slab+0x12c/0x2a0 [ 5758.320491] shrink_slab+0x20f/0x2c0 [ 5758.324482] shrink_node+0x240/0x6c0 [ 5758.328469] balance_pgdat+0x2d1/0x550 [ 5758.332652] kswapd+0x201/0x3c0 [ 5758.336157] ? finish_wait+0x80/0x80 [ 5758.340147] ? balance_pgdat+0x550/0x550 [ 5758.344525] kthread+0x114/0x130 [ 5758.348126] ? kthread_park+0x80/0x80 [ 5758.352214] ret_from_fork+0x22/0x30 [ 5758.356203] Modules linked in: fuse zram rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp mgag200 iTCO_wdt crct10dif_pclmul iTCO_vendor_support drm_kms_helper crc32_pclmul ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops cec rapl joydev intel_cstate ipmi_si ipmi_devintf drm intel_uncore i2c_i801 ipmi_msghandler pcspkr lpc_ich mei_me i2c_smbus mei ioatdma ip_tables xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg igb ahci libahci i2c_algo_bit crc32c_intel libata dca wmi dm_mirror dm_region_hash dm_log dm_mod [ 5758.412673] CR2: 0000000000000007 [ 0.000000] Linux version 5.9.0-0.rc3.1.tst.el8.x86_64 (mockbuild@x86-vm-15.build.eng.bos.redhat.com) (gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), GNU ld version 2.30-79.el8) #1 SMP Wed Sep 9 16:03:34 EDT 2020
After further digging it's found that the following race condition exists in the original implementation,
CPU1 CPU2 ---- ---- deferred_split_scan() split_huge_page(page) /* page isn't compound head */ split_huge_page_to_list(page, NULL) __split_huge_page(page, ) ClearPageCompound(head) /* unlock all subpages except page (not head) */ add_to_swap(head) /* not THP */ get_swap_page(head) add_to_swap_cache(head, ) SetPageSwapCache(head) if PageSwapCache(head) split_swap_cluster(/* swap entry of head */) /* Deref sis->cluster_info: NULL accessing! */
So, in split_huge_page_to_list(), PageSwapCache() is called for the already split and unlocked "head", which may be added to swap cache in another CPU. So split_swap_cluster() may be called wrongly.
To fix the race, the call to split_swap_cluster() is moved to __split_huge_page() before all subpages are unlocked. So that the PageSwapCache() is stable.
Fixes: 59807685a7e77 ("mm, THP, swap: support splitting THP for THP swap out") Reported-by: Rafael Aquini aquini@redhat.com Signed-off-by: "Huang, Ying" ying.huang@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Tested-by: Rafael Aquini aquini@redhat.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Hugh Dickins hughd@google.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Matthew Wilcox willy@infradead.org Link: https://lkml.kernel.org/r/20201009073647.1531083-1-ying.huang@intel.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/huge_memory.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6c4bda24c9b2..ffc4470e65f0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2549,6 +2549,12 @@ static void __split_huge_page(struct page *page, struct list_head *list,
remap_page(head);
+ if (PageSwapCache(head)) { + swp_entry_t entry = { .val = page_private(head) }; + + split_swap_cluster(entry); + } + for (i = 0; i < HPAGE_PMD_NR; i++) { struct page *subpage = head + i; if (subpage == page) @@ -2786,12 +2792,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) __dec_node_page_state(page, NR_SHMEM_THPS); spin_unlock(&pgdata->split_queue_lock); __split_huge_page(page, list, end, flags); - if (PageSwapCache(head)) { - swp_entry_t entry = { .val = page_private(head) }; - - ret = split_swap_cluster(entry); - } else - ret = 0; + ret = 0; } else { if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) { pr_alert("total_mapcount: %u, page_count(): %u\n",
From: Naoya Horiguchi naoya.horiguchi@nec.com
mainline inclusion from mainline-v5.10-rc1 commit 1f2481ddbe444de5bed72f167d7180d1b2708e56 category: bugfix bugzilla: 44803 CVE: NA
-------------------------------------------------
Soft offlining could fail with EIO due to the race condition with hugepage migration. This issuse became visible due to the change by previous patch that makes soft offline handler take page refcount by its own. We have no way to directly pin zero refcount page, and the page considered as a zero refcount page could be allocated just after the first check.
This patch adds the second check to find the race and gives us chance to handle it more reliably.
Reported-by: Qian Cai cai@lca.pw Signed-off-by: Naoya Horiguchi naoya.horiguchi@nec.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: "Aneesh Kumar K.V" aneesh.kumar@linux.ibm.com Cc: Aneesh Kumar K.V aneesh.kumar@linux.vnet.ibm.com Cc: Aristeu Rozanski aris@ruivo.org Cc: Dave Hansen dave.hansen@intel.com Cc: David Hildenbrand david@redhat.com Cc: Dmitry Yakunin zeil@yandex-team.ru Cc: Michal Hocko mhocko@kernel.org Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Oscar Salvador osalvador@suse.com Cc: Tony Luck tony.luck@intel.com Link: https://lkml.kernel.org/r/20200922135650.1634-14-osalvador@suse.de Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/memory-failure.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 148fdd929a19..fa091bfb5943 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1646,6 +1646,9 @@ static int __get_any_page(struct page *p, unsigned long pfn, int flags) } else if (is_free_buddy_page(p)) { pr_info("%s: %#lx free buddy page\n", __func__, pfn); ret = 0; + } else if (page_count(p)) { + /* raced with allocation */ + ret = -EBUSY; } else { pr_info("%s: %#lx: unknown zero refcount page type %lx\n", __func__, pfn, p->flags); @@ -1662,6 +1665,9 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags) { int ret = __get_any_page(page, pfn, flags);
+ if (ret == -EBUSY) + ret = __get_any_page(page, pfn, flags); + if (ret == 1 && !PageHuge(page) && !PageLRU(page) && !__PageMovable(page)) { /*
From: "Matthew Wilcox (Oracle)" willy@infradead.org
mainline inclusion from mainline-v5.10-rc1 commit e320d3012d25b1fb5f3df4edb7bd44a1c362ec10 category: bugfix bugzilla: 43588 CVE: NA
-------------------------------------------------
Here is a very rare race which leaks memory:
Page P0 is allocated to the page cache. Page P1 is free.
Thread A Thread B Thread C find_get_entry(): xas_load() returns P0 Removes P0 from page cache P0 finds its buddy P1 alloc_pages(GFP_KERNEL, 1) returns P0 P0 has refcount 1 page_cache_get_speculative(P0) P0 has refcount 2 __free_pages(P0) P0 has refcount 1 put_page(P0) P1 is not freed
Fix this by freeing all the pages in __free_pages() that won't be freed by the call to put_page(). It's usually not a good idea to split a page, but this is a very unlikely scenario.
Fixes: e286781d5f2e ("mm: speculative page references") Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Mike Rapoport rppt@linux.ibm.com Cc: Nick Piggin npiggin@gmail.com Cc: Hugh Dickins hughd@google.com Cc: Peter Zijlstra peterz@infradead.org Link: https://lkml.kernel.org/r/20200926213919.26642-1-willy@infradead.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- lib/Kconfig.debug | 9 +++++++++ lib/Makefile | 1 + lib/test_free_pages.c | 42 ++++++++++++++++++++++++++++++++++++++++++ mm/page_alloc.c | 3 +++ 4 files changed, 55 insertions(+) create mode 100644 lib/test_free_pages.c
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 47dca144782b..0ee305de7d0e 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1984,6 +1984,15 @@ config TEST_DEBUG_VIRTUAL
If unsure, say N.
+config TEST_FREE_PAGES + tristate "Test freeing pages" + help + Test that a memory leak does not occur due to a race between + freeing a block of pages and a speculative page reference. + Loading this module is safe if your kernel has the bug fixed. + If the bug is not fixed, it will leak gigabytes of memory and + probably OOM your system. + endif # RUNTIME_TESTING_MENU
config MEMTEST diff --git a/lib/Makefile b/lib/Makefile index 0ab808318202..73abfe59558f 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -82,6 +82,7 @@ obj-$(CONFIG_TEST_UUID) += test_uuid.o obj-$(CONFIG_TEST_PARMAN) += test_parman.o obj-$(CONFIG_TEST_KMOD) += test_kmod.o obj-$(CONFIG_TEST_DEBUG_VIRTUAL) += test_debug_virtual.o +obj-$(CONFIG_TEST_FREE_PAGES) += test_free_pages.o
ifeq ($(CONFIG_DEBUG_KOBJECT),y) CFLAGS_kobject.o += -DDEBUG diff --git a/lib/test_free_pages.c b/lib/test_free_pages.c new file mode 100644 index 000000000000..074e76bd76b2 --- /dev/null +++ b/lib/test_free_pages.c @@ -0,0 +1,42 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * test_free_pages.c: Check that free_pages() doesn't leak memory + * Copyright (c) 2020 Oracle + * Author: Matthew Wilcox willy@infradead.org + */ + +#include <linux/gfp.h> +#include <linux/mm.h> +#include <linux/module.h> + +static void test_free_pages(gfp_t gfp) +{ + unsigned int i; + + for (i = 0; i < 1000 * 1000; i++) { + unsigned long addr = __get_free_pages(gfp, 3); + struct page *page = virt_to_page(addr); + + /* Simulate page cache getting a speculative reference */ + get_page(page); + free_pages(addr, 3); + put_page(page); + } +} + +static int m_in(void) +{ + test_free_pages(GFP_KERNEL); + test_free_pages(GFP_KERNEL | __GFP_COMP); + + return 0; +} + +static void m_ex(void) +{ +} + +module_init(m_in); +module_exit(m_ex); +MODULE_AUTHOR("Matthew Wilcox willy@infradead.org"); +MODULE_LICENSE("GPL"); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 869074294cce..504c9699b1af 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4561,6 +4561,9 @@ void __free_pages(struct page *page, unsigned int order) { if (put_page_testzero(page)) free_the_page(page, order); + else if (!PageHead(page)) + while (order-- > 0) + free_the_page(page + (1 << order), order); } EXPORT_SYMBOL(__free_pages);
From: Liu Shixin liushixin2@huawei.com
hulk inclusion category:feature bugzilla: 43588 CVE: NA
-------------------------------------------------
set default value of CONFIG_TEST_FREE_PAGES
Signed-off-by: Liu Shixin liushixin2@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/configs/euleros_defconfig | 1 + arch/arm64/configs/hulk_defconfig | 1 + arch/arm64/configs/openeuler_defconfig | 1 + arch/x86/configs/hulk_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + 5 files changed, 5 insertions(+)
diff --git a/arch/arm64/configs/euleros_defconfig b/arch/arm64/configs/euleros_defconfig index b26910444f52..122be650462a 100644 --- a/arch/arm64/configs/euleros_defconfig +++ b/arch/arm64/configs/euleros_defconfig @@ -5629,6 +5629,7 @@ CONFIG_TEST_KSTRTOX=y # CONFIG_TEST_UDELAY is not set # CONFIG_TEST_STATIC_KEYS is not set # CONFIG_TEST_KMOD is not set +# CONFIG_TEST_FREE_PAGES is not set # CONFIG_MEMTEST is not set # CONFIG_BUG_ON_DATA_CORRUPTION is not set # CONFIG_SAMPLES is not set diff --git a/arch/arm64/configs/hulk_defconfig b/arch/arm64/configs/hulk_defconfig index 090ff6c93730..b85d790d630b 100644 --- a/arch/arm64/configs/hulk_defconfig +++ b/arch/arm64/configs/hulk_defconfig @@ -5634,6 +5634,7 @@ CONFIG_TEST_KSTRTOX=y # CONFIG_TEST_UDELAY is not set # CONFIG_TEST_STATIC_KEYS is not set # CONFIG_TEST_KMOD is not set +# CONFIG_TEST_FREE_PAGES is not set # CONFIG_MEMTEST is not set # CONFIG_BUG_ON_DATA_CORRUPTION is not set # CONFIG_SAMPLES is not set diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 5561751ad6eb..e1f2135f2b5b 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -5941,6 +5941,7 @@ CONFIG_TEST_KSTRTOX=y # CONFIG_TEST_UDELAY is not set # CONFIG_TEST_STATIC_KEYS is not set # CONFIG_TEST_KMOD is not set +# CONFIG_TEST_FREE_PAGES is not set # CONFIG_MEMTEST is not set # CONFIG_BUG_ON_DATA_CORRUPTION is not set # CONFIG_SAMPLES is not set diff --git a/arch/x86/configs/hulk_defconfig b/arch/x86/configs/hulk_defconfig index 4079cef76f2c..c7c7d8776c2a 100644 --- a/arch/x86/configs/hulk_defconfig +++ b/arch/x86/configs/hulk_defconfig @@ -7485,6 +7485,7 @@ CONFIG_TEST_KSTRTOX=y # CONFIG_TEST_UDELAY is not set # CONFIG_TEST_STATIC_KEYS is not set # CONFIG_TEST_KMOD is not set +# CONFIG_TEST_FREE_PAGES is not set # CONFIG_MEMTEST is not set # CONFIG_BUG_ON_DATA_CORRUPTION is not set # CONFIG_SAMPLES is not set diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index bb1bcb8de1c5..a52e5932dcc6 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -7486,6 +7486,7 @@ CONFIG_TEST_KSTRTOX=y # CONFIG_TEST_UDELAY is not set # CONFIG_TEST_STATIC_KEYS is not set # CONFIG_TEST_KMOD is not set +# CONFIG_TEST_FREE_PAGES is not set # CONFIG_MEMTEST is not set # CONFIG_BUG_ON_DATA_CORRUPTION is not set # CONFIG_SAMPLES is not set
From: Marc Zyngier maz@kernel.org
stable inclusion from linux-4.19.155 commit d953cd3fb149a999534ff48af323eb091fd5aa3b
--------------------------------
commit 18fce56134c987e5b4eceddafdbe4b00c07e2ae1 upstream.
Commit 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or lack thereof") changed the way we deal with ARCH_WORKAROUND_1, by moving most of the enabling code to the .matches() callback.
This has the unfortunate effect that the workaround gets only enabled on the first affected CPU, and no other.
In order to address this, forcefully call the .matches() callback from a .cpu_enable() callback, which brings us back to the original behaviour.
Fixes: 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or lack thereof") Cc: stable@vger.kernel.org Reviewed-by: Suzuki K Poulose suzuki.poulose@arm.com Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/kernel/cpu_errata.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c index 5f2c30ae82ac..c0b2ef5e7ea3 100644 --- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -647,6 +647,12 @@ check_branch_predictor(const struct arm64_cpu_capabilities *entry, int scope) return (need_wa > 0); }
+static void +cpu_enable_branch_predictor_hardening(const struct arm64_cpu_capabilities *cap) +{ + cap->matches(cap, SCOPE_LOCAL_CPU); +} + static const __maybe_unused struct midr_range tx2_family_cpus[] = { MIDR_ALL_VERSIONS(MIDR_BRCM_VULCAN), MIDR_ALL_VERSIONS(MIDR_CAVIUM_THUNDERX2), @@ -829,9 +835,11 @@ const struct arm64_cpu_capabilities arm64_errata[] = { }, #endif { + .desc = "Branch predictor hardening", .capability = ARM64_HARDEN_BRANCH_PREDICTOR, .type = ARM64_CPUCAP_LOCAL_CPU_ERRATUM, .matches = check_branch_predictor, + .cpu_enable = cpu_enable_branch_predictor_hardening, }, #ifdef CONFIG_HARDEN_EL2_VECTORS {
From: Michael Schaller misch@google.com
stable inclusion from linux-4.19.155 commit 02bb497cd6de22e9ce17396957e76fa5aa11102f
--------------------------------
commit 336af6a4686d885a067ecea8c3c3dd129ba4fc75 upstream.
Without this patch efivarfs_alloc_dentry creates dentries with slashes in their name if the respective EFI variable has slashes in its name. This in turn causes EIO on getdents64, which prevents a complete directory listing of /sys/firmware/efi/efivars/.
This patch replaces the invalid shlashes with exclamation marks like kobject_set_name_vargs does for /sys/firmware/efi/vars/ to have consistently named dentries under /sys/firmware/efi/vars/ and /sys/firmware/efi/efivars/.
Signed-off-by: Michael Schaller misch@google.com Link: https://lore.kernel.org/r/20200925074502.150448-1-misch@google.com Signed-off-by: Ard Biesheuvel ardb@kernel.org Signed-off-by: dann frazier dann.frazier@canonical.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/efivarfs/super.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c index 5b68e4294faa..834615f13f3e 100644 --- a/fs/efivarfs/super.c +++ b/fs/efivarfs/super.c @@ -145,6 +145,9 @@ static int efivarfs_callback(efi_char16_t *name16, efi_guid_t vendor,
name[len + EFI_VARIABLE_GUID_LEN+1] = '\0';
+ /* replace invalid slashes like kobject_set_name_vargs does for /sys/firmware/efi/vars. */ + strreplace(name, '/', '!'); + inode = efivarfs_get_inode(sb, d_inode(root), S_IFREG | 0644, 0, is_removable); if (!inode)
From: Aleksandr Nogikh nogikh@google.com
stable inclusion from linux-4.19.155 commit 95ba2236b8e69de3cb9b12e1cd6c4252a1574a19
--------------------------------
[ Upstream commit eadd1befdd778a1eca57fad058782bd22b4db804 ]
Currently it is possible to craft a special netlink RTM_NEWQDISC command that can result in jitter being equal to 0x80000000. It is enough to set the 32 bit jitter to 0x02000000 (it will later be multiplied by 2^6) or just set the 64 bit jitter via TCA_NETEM_JITTER64. This causes an overflow during the generation of uniformly distributed numbers in tabledist(), which in turn leads to division by zero (sigma != 0, but sigma * 2 is 0).
The related fragment of code needs 32-bit division - see commit 9b0ed89 ("netem: remove unnecessary 64 bit modulus"), so switching to 64 bit is not an option.
Fix the issue by keeping the value of jitter within the range that can be adequately handled by tabledist() - [0;INT_MAX]. As negative std deviation makes no sense, take the absolute value of the passed value and cap it at INT_MAX. Inside tabledist(), switch to unsigned 32 bit arithmetic in order to prevent overflows.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Aleksandr Nogikh nogikh@google.com Reported-by: syzbot+ec762a6342ad0d3c0d8f@syzkaller.appspotmail.com Acked-by: Stephen Hemminger stephen@networkplumber.org Link: https://lore.kernel.org/r/20201028170731.1383332-1-aleksandrnogikh@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/sched/sch_netem.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c index 014a28d8dd4f..02d8d3fd84a5 100644 --- a/net/sched/sch_netem.c +++ b/net/sched/sch_netem.c @@ -330,7 +330,7 @@ static s64 tabledist(s64 mu, s32 sigma,
/* default uniform distribution */ if (dist == NULL) - return ((rnd % (2 * sigma)) + mu) - sigma; + return ((rnd % (2 * (u32)sigma)) + mu) - sigma;
t = dist->table[rnd % dist->size]; x = (sigma % NETEM_DIST_SCALE) * t; @@ -787,6 +787,10 @@ static void get_slot(struct netem_sched_data *q, const struct nlattr *attr) q->slot_config.max_packets = INT_MAX; if (q->slot_config.max_bytes == 0) q->slot_config.max_bytes = INT_MAX; + + /* capping dist_jitter to the range acceptable by tabledist() */ + q->slot_config.dist_jitter = min_t(__s64, INT_MAX, abs(q->slot_config.dist_jitter)); + q->slot.packets_left = q->slot_config.max_packets; q->slot.bytes_left = q->slot_config.max_bytes; if (q->slot_config.min_delay | q->slot_config.max_delay | @@ -1011,6 +1015,9 @@ static int netem_change(struct Qdisc *sch, struct nlattr *opt, if (tb[TCA_NETEM_SLOT]) get_slot(q, tb[TCA_NETEM_SLOT]);
+ /* capping jitter to the range acceptable by tabledist() */ + q->jitter = min_t(s64, abs(q->jitter), INT_MAX); + return ret;
get_table_failure:
From: Arjun Roy arjunroy@google.com
stable inclusion from linux-4.19.155 commit f5a4ea7839b15d702fb9c731eddcca3e3fecf8d8
--------------------------------
[ Upstream commit 435ccfa894e35e3d4a1799e6ac030e48a7b69ef5 ]
With SO_RCVLOWAT, under memory pressure, it is possible to enter a state where:
1. We have not received enough bytes to satisfy SO_RCVLOWAT. 2. We have not entered buffer pressure (see tcp_rmem_pressure()). 3. But, we do not have enough buffer space to accept more packets.
In this case, we advertise 0 rwnd (due to #3) but the application does not drain the receive queue (no wakeup because of #1 and #2) so the flow stalls.
Modify the heuristic for SO_RCVLOWAT so that, if we are advertising rwnd<=rcv_mss, force a wakeup to prevent a stall.
Without this patch, setting tcp_rmem to 6143 and disabling TCP autotune causes a stalled flow. With this patch, no stall occurs. This is with RPC-style traffic with large messages.
Fixes: 03f45c883c6f ("tcp: avoid extra wakeups for SO_RCVLOWAT users") Signed-off-by: Arjun Roy arjunroy@google.com Acked-by: Soheil Hassas Yeganeh soheil@google.com Acked-by: Neal Cardwell ncardwell@google.com Signed-off-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/20201023184709.217614-1-arjunroy.kdev@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/tcp.c | 2 ++ net/ipv4/tcp_input.c | 3 ++- 2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 4ce3397e6fcf..98e8ee8bb759 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -495,6 +495,8 @@ static inline bool tcp_stream_is_readable(const struct tcp_sock *tp, return true; if (tcp_rmem_pressure(sk)) return true; + if (tcp_receive_window(tp) <= inet_csk(sk)->icsk_ack.rcv_mss) + return true; } if (sk->sk_prot->stream_memory_read) return sk->sk_prot->stream_memory_read(sk); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 8f04d2eb2799..0b3f8637d4fa 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4712,7 +4712,8 @@ void tcp_data_ready(struct sock *sk) int avail = tp->rcv_nxt - tp->copied_seq;
if (avail < sk->sk_rcvlowat && !tcp_rmem_pressure(sk) && - !sock_flag(sk, SOCK_DONE)) + !sock_flag(sk, SOCK_DONE) && + tcp_receive_window(tp) > inet_csk(sk)->icsk_ack.rcv_mss) return;
sk->sk_data_ready(sk);
From: Miklos Szeredi mszeredi@redhat.com
stable inclusion from linux-4.19.155 commit 6ef82327906270f66adb4b13517e818d3a20448e
--------------------------------
commit d78092e4937de9ce55edcb4ee4c5e3c707be0190 upstream.
After unlock_request() pages from the ap->pages[] array may be put (e.g. by aborting the connection) and the pages can be freed.
Prevent use after free by grabbing a reference to the page before calling unlock_request().
The original patch was created by Pradeep P V K.
Reported-by: Pradeep P V K ppvk@codeaurora.org Cc: stable@vger.kernel.org Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/fuse/dev.c | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index c51c9a6881e4..1ff5a6b21db0 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -853,15 +853,16 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep) struct page *newpage; struct pipe_buffer *buf = cs->pipebufs;
+ get_page(oldpage); err = unlock_request(cs->req); if (err) - return err; + goto out_put_old;
fuse_copy_finish(cs);
err = pipe_buf_confirm(cs->pipe, buf); if (err) - return err; + goto out_put_old;
BUG_ON(!cs->nr_segs); cs->currbuf = buf; @@ -901,7 +902,7 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep) err = replace_page_cache_page(oldpage, newpage, GFP_KERNEL); if (err) { unlock_page(newpage); - return err; + goto out_put_old; }
get_page(newpage); @@ -920,14 +921,19 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep) if (err) { unlock_page(newpage); put_page(newpage); - return err; + goto out_put_old; }
unlock_page(oldpage); + /* Drop ref for ap->pages[] array */ put_page(oldpage); cs->len = 0;
- return 0; + err = 0; +out_put_old: + /* Drop ref obtained in this function */ + put_page(oldpage); + return err;
out_fallback_unlock: unlock_page(newpage); @@ -936,10 +942,10 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep) cs->offset = buf->offset;
err = lock_request(cs->req); - if (err) - return err; + if (!err) + err = 1;
- return 1; + goto out_put_old; }
static int fuse_ref_page(struct fuse_copy_state *cs, struct page *page, @@ -951,14 +957,16 @@ static int fuse_ref_page(struct fuse_copy_state *cs, struct page *page, if (cs->nr_segs == cs->pipe->buffers) return -EIO;
+ get_page(page); err = unlock_request(cs->req); - if (err) + if (err) { + put_page(page); return err; + }
fuse_copy_finish(cs);
buf = cs->pipebufs; - get_page(page); buf->page = page; buf->offset = offset; buf->len = count;
From: Peter Zijlstra peterz@infradead.org
stable inclusion from linux-4.19.155 commit ad1e8fc2a51e48c6c51f6ca2236e9b9fcbe21fea
--------------------------------
commit 534cf755d9df99e214ddbe26b91cd4d81d2603e2 upstream.
Issuing a magic-sysrq via the PL011 causes the following lockdep splat, which is easily reproducible under QEMU:
| sysrq: Changing Loglevel | sysrq: Loglevel set to 9 | | ====================================================== | WARNING: possible circular locking dependency detected | 5.9.0-rc7 #1 Not tainted | ------------------------------------------------------ | systemd-journal/138 is trying to acquire lock: | ffffab133ad950c0 (console_owner){-.-.}-{0:0}, at: console_lock_spinning_enable+0x34/0x70 | | but task is already holding lock: | ffff0001fd47b098 (&port_lock_key){-.-.}-{2:2}, at: pl011_int+0x40/0x488 | | which lock already depends on the new lock.
[...]
| Possible unsafe locking scenario: | | CPU0 CPU1 | ---- ---- | lock(&port_lock_key); | lock(console_owner); | lock(&port_lock_key); | lock(console_owner); | | *** DEADLOCK ***
The issue being that CPU0 takes 'port_lock' on the irq path in pl011_int() before taking 'console_owner' on the printk() path, whereas CPU1 takes the two locks in the opposite order on the printk() path due to setting the "console_owner" prior to calling into into the actual console driver.
Fix this in the same way as the msm-serial driver by dropping 'port_lock' before handling the sysrq.
Cc: stable@vger.kernel.org # 4.19+ Cc: Russell King linux@armlinux.org.uk Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Jiri Slaby jirislaby@kernel.org Link: https://lore.kernel.org/r/20200811101313.GA6970@willie-the-truck Signed-off-by: Peter Zijlstra peterz@infradead.org Tested-by: Will Deacon will@kernel.org Signed-off-by: Will Deacon will@kernel.org Link: https://lore.kernel.org/r/20200930120432.16551-1-will@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/tty/serial/amba-pl011.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c index 8d2cffedbe37..35b86a2d5ddb 100644 --- a/drivers/tty/serial/amba-pl011.c +++ b/drivers/tty/serial/amba-pl011.c @@ -313,8 +313,9 @@ static void pl011_write(unsigned int val, const struct uart_amba_port *uap, */ static int pl011_fifo_to_tty(struct uart_amba_port *uap) { - u16 status; unsigned int ch, flag, fifotaken; + int sysrq; + u16 status;
for (fifotaken = 0; fifotaken != 256; fifotaken++) { status = pl011_read(uap, REG_FR); @@ -349,10 +350,12 @@ static int pl011_fifo_to_tty(struct uart_amba_port *uap) flag = TTY_FRAME; }
- if (uart_handle_sysrq_char(&uap->port, ch & 255)) - continue; + spin_unlock(&uap->port.lock); + sysrq = uart_handle_sysrq_char(&uap->port, ch & 255); + spin_lock(&uap->port.lock);
- uart_insert_char(&uap->port, ch, UART011_DR_OE, ch, flag); + if (!sysrq) + uart_insert_char(&uap->port, ch, UART011_DR_OE, ch, flag); }
return fifotaken;
From: Mateusz Nosek mateusznosek0@gmail.com
stable inclusion from linux-4.19.155 commit ae5aa5685c7cc1def1e07e0ab562087c5b8c78e7
--------------------------------
[ Upstream commit 921c7ebd1337d1a46783d7e15a850e12aed2eaa0 ]
If should_futex_fail() returns true in futex_wake_pi(), then the 'ret' variable is set to -EFAULT and then immediately overwritten. So the failure injection is non-functional.
Fix it by actually leaving the function and returning -EFAULT.
The Fixes tag is kinda blury because the initial commit which introduced failure injection was already sloppy, but the below mentioned commit broke it completely.
[ tglx: Massaged changelog ]
Fixes: 6b4f4bc9cb22 ("locking/futex: Allow low-level atomic operations to return -EAGAIN") Signed-off-by: Mateusz Nosek mateusznosek0@gmail.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/20200927000858.24219-1-mateusznosek0@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/futex.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/futex.c b/kernel/futex.c index a1ac8fc68991..e963fac0d5fc 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1540,8 +1540,10 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_ */ newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
- if (unlikely(should_fail_futex(true))) + if (unlikely(should_fail_futex(true))) { ret = -EFAULT; + goto out_unlock; + }
ret = cmpxchg_futex_value_locked(&curval, uaddr, uval, newval); if (!ret && (curval != uval)) {
From: Nicholas Piggin npiggin@gmail.com
stable inclusion from linux-4.19.155 commit b664645274e922b3356115be9392a49cf3937ce2
--------------------------------
commit d53c3dfb23c45f7d4f910c3a3ca84bf0a99c6143 upstream.
Reading and modifying current->mm and current->active_mm and switching mm should be done with irqs off, to prevent races seeing an intermediate state.
This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate"). At exec-time when the new mm is activated, the old one should usually be single-threaded and no longer used, unless something else is holding an mm_users reference (which may be possible).
Absent other mm_users, there is also a race with preemption and lazy tlb switching. Consider the kernel_execve case where the current thread is using a lazy tlb active mm:
call_usermodehelper() kernel_execve() old_mm = current->mm; active_mm = current->active_mm; *** preempt *** --------------------> schedule() prev->active_mm = NULL; mmdrop(prev active_mm); ... <-------------------- schedule() current->mm = mm; current->active_mm = mm; if (!old_mm) mmdrop(active_mm);
If we switch back to the kernel thread from a different mm, there is a double free of the old active_mm, and a missing free of the new one.
Closing this race only requires interrupts to be disabled while ->mm and ->active_mm are being switched, but the TLB problem requires also holding interrupts off over activate_mm. Unfortunately not all archs can do that yet, e.g., arm defers the switch if irqs are disabled and expects finish_arch_post_lock_switch() to be called to complete the flush; um takes a blocking lock in activate_mm().
So as a first step, disable interrupts across the mm/active_mm updates to close the lazy tlb preempt race, and provide an arch option to extend that to activate_mm which allows architectures doing IPI based TLB shootdowns to close the second race.
This is a bit ugly, but in the interest of fixing the bug and backporting before all architectures are converted this is a compromise.
Signed-off-by: Nicholas Piggin npiggin@gmail.com Acked-by: Peter Zijlstra (Intel) peterz@infradead.org [mpe: Manual backport to 4.19 due to membarrier_exec_mmap(mm) changes] Signed-off-by: Michael Ellerman mpe@ellerman.id.au Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Conflicts: fs/exec.c [yyl: adjust context] Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/Kconfig | 7 +++++++ fs/exec.c | 15 ++++++++++++++- 2 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/arch/Kconfig b/arch/Kconfig index 895b252d11b0..0dd478fcee0b 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -381,6 +381,13 @@ config HAVE_RCU_TABLE_FREE config HAVE_RCU_TABLE_INVALIDATE bool
+config ARCH_WANT_IRQS_OFF_ACTIVATE_MM + bool + help + Temporary select until all architectures can be converted to have + irqs disabled over activate_mm. Architectures that do IPI based TLB + shootdowns should enable this. + config ARCH_HAVE_NMI_SAFE_CMPXCHG bool
diff --git a/fs/exec.c b/fs/exec.c index 15d30416886d..8099ac98e605 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1028,11 +1028,24 @@ static int exec_mmap(struct mm_struct *mm) } } task_lock(tsk); + + local_irq_disable(); active_mm = tsk->active_mm; membarrier_exec_mmap(mm); - tsk->mm = mm; tsk->active_mm = mm; + tsk->mm = mm; + /* + * This prevents preemption while active_mm is being loaded and + * it and mm are being updated, which could cause problems for + * lazy tlb mm refcounting when these are updated by context + * switches. Not all architectures can handle irqs off over + * activate_mm yet. + */ + if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) + local_irq_enable(); activate_mm(active_mm, mm); + if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) + local_irq_enable(); tsk->mm->vmacache_seqnum = 0; vmacache_flush(tsk); task_unlock(tsk);
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.155 commit bda64b43564ac3eca39202db4f1462487ca4a6bf
--------------------------------
[ Upstream commit f4c32e87de7d66074d5612567c5eac7325024428 ]
The realtime bitmap and summary files are regular files that are hidden away from the directory tree. Since they're regular files, inode inactivation will try to purge what it thinks are speculative preallocations beyond the incore size of the file. Unfortunately, xfs_growfs_rt forgets to update the incore size when it resizes the inodes, with the result that inactivating the rt inodes at unmount time will cause their contents to be truncated.
Fix this by updating the incore size when we change the ondisk size as part of updating the superblock. Note that we don't do this when we're allocating blocks to the rt inodes because we actually want those blocks to get purged if the growfs fails.
This fixes corruption complaints from the online rtsummary checker when running xfs/233. Since that test requires rmap, one can also trigger this by growing an rt volume, cycling the mount, and creating rt files.
Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Chandan Babu R chandanrlinux@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/xfs/xfs_rtalloc.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c index 08da48b66235..280965fc9bbd 100644 --- a/fs/xfs/xfs_rtalloc.c +++ b/fs/xfs/xfs_rtalloc.c @@ -998,10 +998,13 @@ xfs_growfs_rt( xfs_ilock(mp->m_rbmip, XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, mp->m_rbmip, XFS_ILOCK_EXCL); /* - * Update the bitmap inode's size. + * Update the bitmap inode's size ondisk and incore. We need + * to update the incore size so that inode inactivation won't + * punch what it thinks are "posteof" blocks. */ mp->m_rbmip->i_d.di_size = nsbp->sb_rbmblocks * nsbp->sb_blocksize; + i_size_write(VFS_I(mp->m_rbmip), mp->m_rbmip->i_d.di_size); xfs_trans_log_inode(tp, mp->m_rbmip, XFS_ILOG_CORE); /* * Get the summary inode into the transaction. @@ -1009,9 +1012,12 @@ xfs_growfs_rt( xfs_ilock(mp->m_rsumip, XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, mp->m_rsumip, XFS_ILOCK_EXCL); /* - * Update the summary inode's size. + * Update the summary inode's size. We need to update the + * incore size so that inode inactivation won't punch what it + * thinks are "posteof" blocks. */ mp->m_rsumip->i_d.di_size = nmp->m_rsumsize; + i_size_write(VFS_I(mp->m_rsumip), mp->m_rsumip->i_d.di_size); xfs_trans_log_inode(tp, mp->m_rsumip, XFS_ILOG_CORE); /* * Copy summary data from old to new sizes.
From: Valentin Schneider valentin.schneider@arm.com
stable inclusion from linux-4.19.155 commit f3dc5ac4011eac5a3a8b28fc13fe08b7dfb5af34
--------------------------------
[ Upstream commit 3102bc0e6ac752cc5df896acb557d779af4d82a1 ]
In the absence of ACPI or DT topology data, we fallback to haphazardly decoding *something* out of MPIDR. Sadly, the contents of that register are mostly unusable due to the implementation leniancy and things like Aff0 having to be capped to 15 (despite being encoded on 8 bits).
Consider a simple system with a single package of 32 cores, all under the same LLC. We ought to be shoving them in the same core_sibling mask, but MPIDR is going to look like:
| CPU | 0 | ... | 15 | 16 | ... | 31 | |------+---+-----+----+----+-----+----+ | Aff0 | 0 | ... | 15 | 0 | ... | 15 | | Aff1 | 0 | ... | 0 | 1 | ... | 1 | | Aff2 | 0 | ... | 0 | 0 | ... | 0 |
Which will eventually yield
core_sibling(0-15) == 0-15 core_sibling(16-31) == 16-31
NUMA woes
=========
If we try to play games with this and set up NUMA boundaries within those groups of 16 cores via e.g. QEMU:
# Node0: 0-9; Node1: 10-19 $ qemu-system-aarch64 <blah> \ -smp 20 -numa node,cpus=0-9,nodeid=0 -numa node,cpus=10-19,nodeid=1
The scheduler's MC domain (all CPUs with same LLC) is going to be built via
arch_topology.c::cpu_coregroup_mask()
In there we try to figure out a sensible mask out of the topology information we have. In short, here we'll pick the smallest of NUMA or core sibling mask.
node_mask(CPU9) == 0-9 core_sibling(CPU9) == 0-15
MC mask for CPU9 will thus be 0-9, not a problem.
node_mask(CPU10) == 10-19 core_sibling(CPU10) == 0-15
MC mask for CPU10 will thus be 10-19, not a problem.
node_mask(CPU16) == 10-19 core_sibling(CPU16) == 16-19
MC mask for CPU16 will thus be 16-19... Uh oh. CPUs 16-19 are in two different unique MC spans, and the scheduler has no idea what to make of that. That triggers the WARN_ON() added by commit
ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap")
Fixing MPIDR-derived topology =============================
We could try to come up with some cleverer scheme to figure out which of the available masks to pick, but really if one of those masks resulted from MPIDR then it should be discarded because it's bound to be bogus.
I was hoping to give MPIDR a chance for SMT, to figure out which threads are in the same core using Aff1-3 as core ID, but Sudeep and Robin pointed out to me that there are systems out there where *all* cores have non-zero values in their higher affinity fields (e.g. RK3288 has "5" in all of its cores' MPIDR.Aff1), which would expose a bogus core ID to userspace.
Stop using MPIDR for topology information. When no other source of topology information is available, mark each CPU as its own core and its NUMA node as its LLC domain.
Signed-off-by: Valentin Schneider valentin.schneider@arm.com Reviewed-by: Sudeep Holla sudeep.holla@arm.com Link: https://lore.kernel.org/r/20200829130016.26106-1-valentin.schneider@arm.com Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Conflicts: arch/arm64/kernel/topology.c [yyl: resolve conflicts introduced by ARM_CPU_IMP_PHYTIUM] Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/kernel/topology.c | 43 +++++++++++++++++++----------------- 1 file changed, 23 insertions(+), 20 deletions(-)
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index d6fcafa22f31..4c22fe0cd3c6 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -272,26 +272,29 @@ void store_cpu_topology(unsigned int cpuid) if (mpidr & MPIDR_UP_BITMASK) return;
- /* Create cpu topology mapping based on MPIDR. */ - if (mpidr & MPIDR_MT_BITMASK) { - /* Multiprocessor system : Multi-threads per core */ - cpuid_topo->thread_id = MPIDR_AFFINITY_LEVEL(mpidr, 0); - cpuid_topo->core_id = MPIDR_AFFINITY_LEVEL(mpidr, 1); - cpuid_topo->package_id = MPIDR_AFFINITY_LEVEL(mpidr, 2) | - MPIDR_AFFINITY_LEVEL(mpidr, 3) << 8; - } else { - /* Multiprocessor system : Single-thread per core */ - cpuid_topo->thread_id = -1; - cpuid_topo->core_id = MPIDR_AFFINITY_LEVEL(mpidr, 0); - cpuid_topo->package_id = MPIDR_AFFINITY_LEVEL(mpidr, 1) | - MPIDR_AFFINITY_LEVEL(mpidr, 2) << 8 | - MPIDR_AFFINITY_LEVEL(mpidr, 3) << 16; - - if (read_cpuid_implementor() == ARM_CPU_IMP_PHYTIUM) { - cpuid_topo->thread_id = 0; - cpuid_topo->core_id = cpuid; - cpuid_topo->package_id = 0; - } + /* + * This would be the place to create cpu topology based on MPIDR. + * + * However, it cannot be trusted to depict the actual topology; some + * pieces of the architecture enforce an artificial cap on Aff0 values + * (e.g. GICv3's ICC_SGI1R_EL1 limits it to 15), leading to an + * artificial cycling of Aff1, Aff2 and Aff3 values. IOW, these end up + * having absolutely no relationship to the actual underlying system + * topology, and cannot be reasonably used as core / package ID. + * + * If the MT bit is set, Aff0 *could* be used to define a thread ID, but + * we still wouldn't be able to obtain a sane core ID. This means we + * need to entirely ignore MPIDR for any topology deduction. + */ + cpuid_topo->thread_id = -1; + cpuid_topo->core_id = cpuid; + cpuid_topo->package_id = cpu_to_node(cpuid); + + if (!(mpidr & MPIDR_MT_BITMASK) && + read_cpuid_implementor() == ARM_CPU_IMP_PHYTIUM) { + cpuid_topo->thread_id = 0; + cpuid_topo->core_id = cpuid; + cpuid_topo->package_id = 0; }
pr_debug("CPU%u: cluster %d core %d thread %d mpidr %#016llx\n",
From: Lang Dai lang.dai@intel.com
stable inclusion from linux-4.19.155 commit ed23397d1daabd47fe463a3c0d5cfdd495448ccb
--------------------------------
[ Upstream commit 8fd0e2a6df262539eaa28b0a2364cca10d1dc662 ]
uio_register_device() do two things. 1) get an uio id from a global pool, e.g. the id is <A> 2) create file nodes like /sys/class/uio/uio<A>
uio_unregister_device() do two things. 1) free the uio id <A> and return it to the global pool 2) free the file node /sys/class/uio/uio<A>
There is a situation is that one worker is calling uio_unregister_device(), and another worker is calling uio_register_device(). If the two workers are X and Y, they go as below sequence, 1) X free the uio id <AAA> 2) Y get an uio id <AAA> 3) Y create file node /sys/class/uio/uio<AAA> 4) X free the file note /sys/class/uio/uio<AAA> Then it will failed at the 3rd step and cause the phenomenon we saw as it is creating a duplicated file node.
Failure reports as follows: sysfs: cannot create duplicate filename '/class/uio/uio10' Call Trace: sysfs_do_create_link_sd.isra.2+0x9e/0xb0 sysfs_create_link+0x25/0x40 device_add+0x2c4/0x640 __uio_register_device+0x1c5/0x576 [uio] adf_uio_init_bundle_dev+0x231/0x280 [intel_qat] adf_uio_register+0x1c0/0x340 [intel_qat] adf_dev_start+0x202/0x370 [intel_qat] adf_dev_start_async+0x40/0xa0 [intel_qat] process_one_work+0x14d/0x410 worker_thread+0x4b/0x460 kthread+0x105/0x140 ? process_one_work+0x410/0x410 ? kthread_bind+0x40/0x40 ret_from_fork+0x1f/0x40 Code: 85 c0 48 89 c3 74 12 b9 00 10 00 00 48 89 c2 31 f6 4c 89 ef e8 ec c4 ff ff 4c 89 e2 48 89 de 48 c7 c7 e8 b4 ee b4 e8 6a d4 d7 ff <0f> 0b 48 89 df e8 20 fa f3 ff 5b 41 5c 41 5d 5d c3 66 0f 1f 84 ---[ end trace a7531c1ed5269e84 ]--- c6xxvf b002:00:00.0: Failed to register UIO devices c6xxvf b002:00:00.0: Failed to register UIO devices
Signed-off-by: Lang Dai lang.dai@intel.com
Link: https://lore.kernel.org/r/1600054002-17722-1-git-send-email-lang.dai@intel.c... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/uio/uio.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/uio/uio.c b/drivers/uio/uio.c index 5fd5a1d302f6..e743246aaeac 100644 --- a/drivers/uio/uio.c +++ b/drivers/uio/uio.c @@ -1008,8 +1008,6 @@ void uio_unregister_device(struct uio_info *info)
idev = info->uio_dev;
- uio_free_minor(idev); - mutex_lock(&idev->info_lock); uio_dev_del_attributes(idev);
@@ -1024,6 +1022,8 @@ void uio_unregister_device(struct uio_info *info)
device_unregister(&idev->dev);
+ uio_free_minor(idev); + return; } EXPORT_SYMBOL_GPL(uio_unregister_device);
From: Zhengyuan Liu liuzhengyuan@tj.kylinos.cn
stable inclusion from linux-4.19.155 commit 0a2cb1eae0a70acf701600b510de04c55404bb31
--------------------------------
[ Upstream commit a194c5f2d2b3a05428805146afcabe5140b5d378 ]
The @node passed to cpumask_of_node() can be NUMA_NO_NODE, in that case it will trigger the following WARN_ON(node >= nr_node_ids) due to mismatched data types of @node and @nr_node_ids. Actually we should return cpu_all_mask just like most other architectures do if passed NUMA_NO_NODE.
Also add a similar check to the inline cpumask_of_node() in numa.h.
Signed-off-by: Zhengyuan Liu liuzhengyuan@tj.kylinos.cn Reviewed-by: Gavin Shan gshan@redhat.com Link: https://lore.kernel.org/r/20200921023936.21846-1-liuzhengyuan@tj.kylinos.cn Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/include/asm/numa.h | 3 +++ arch/arm64/mm/numa.c | 6 +++++- 2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h index 563461a14a43..43bfff72a32f 100644 --- a/arch/arm64/include/asm/numa.h +++ b/arch/arm64/include/asm/numa.h @@ -28,6 +28,9 @@ const struct cpumask *cpumask_of_node(int node); /* Returns a pointer to the cpumask of CPUs on Node 'node'. */ static inline const struct cpumask *cpumask_of_node(int node) { + if (node == NUMA_NO_NODE) + return cpu_all_mask; + return node_to_cpumask_map[node]; } #endif diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c index c50d22205f8b..a9d3ad5ee0cc 100644 --- a/arch/arm64/mm/numa.c +++ b/arch/arm64/mm/numa.c @@ -85,7 +85,11 @@ EXPORT_SYMBOL(node_to_cpumask_map); */ const struct cpumask *cpumask_of_node(int node) { - if (WARN_ON(node >= nr_node_ids)) + + if (node == NUMA_NO_NODE) + return cpu_all_mask; + + if (WARN_ON(node < 0 || node >= nr_node_ids)) return cpu_none_mask;
if (WARN_ON(node_to_cpumask_map[node] == NULL))
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.155 commit 451a7c74c23b2ec42410ca5a7110c1a85ebae688
--------------------------------
[ Upstream commit 8df0fa39bdd86ca81a8d706a6ed9d33cc65ca625 ]
When callers pass XFS_BMAPI_REMAP into xfs_bunmapi, they want the extent to be unmapped from the given file fork without the extent being freed. We do this for non-rt files, but we forgot to do this for realtime files. So far this isn't a big deal since nobody makes a bunmapi call to a rt file with the REMAP flag set, but don't leave a logic bomb.
Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Dave Chinner dchinner@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/xfs/libxfs/xfs_bmap.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index a902f3b6e7db..2eb1cb71a43d 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -4969,20 +4969,25 @@ xfs_bmap_del_extent_real(
flags = XFS_ILOG_CORE; if (whichfork == XFS_DATA_FORK && XFS_IS_REALTIME_INODE(ip)) { - xfs_fsblock_t bno; xfs_filblks_t len; xfs_extlen_t mod;
- bno = div_u64_rem(del->br_startblock, mp->m_sb.sb_rextsize, - &mod); - ASSERT(mod == 0); len = div_u64_rem(del->br_blockcount, mp->m_sb.sb_rextsize, &mod); ASSERT(mod == 0);
- error = xfs_rtfree_extent(tp, bno, (xfs_extlen_t)len); - if (error) - goto done; + if (!(bflags & XFS_BMAPI_REMAP)) { + xfs_fsblock_t bno; + + bno = div_u64_rem(del->br_startblock, + mp->m_sb.sb_rextsize, &mod); + ASSERT(mod == 0); + + error = xfs_rtfree_extent(tp, bno, (xfs_extlen_t)len); + if (error) + goto done; + } + do_fx = 0; nblks = len * mp->m_sb.sb_rextsize; qfield = XFS_TRANS_DQ_RTBCOUNT;
From: Jonathan Cameron Jonathan.Cameron@huawei.com
stable inclusion from linux-4.19.155 commit d2828ab0a92683e56e210cae4769af1f83519197
--------------------------------
[ Upstream commit 8a3decac087aa897df5af04358c2089e52e70ac4 ]
The function should check the validity of the pxm value before using it to index the pxm_to_node_map[] array.
Whilst hardening this code may be good in general, the main intent here is to enable following patches that use this function to replace acpi_map_pxm_to_node() for non SRAT usecases which should return NO_NUMA_NODE for PXM entries not matching with those in SRAT.
Signed-off-by: Jonathan Cameron Jonathan.Cameron@huawei.com Reviewed-by: Barry Song song.bao.hua@hisilicon.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/acpi/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c index 0da58f0bf7e5..a28ff3cfbc29 100644 --- a/drivers/acpi/numa.c +++ b/drivers/acpi/numa.c @@ -46,7 +46,7 @@ int acpi_numa __initdata;
int pxm_to_node(int pxm) { - if (pxm < 0) + if (pxm < 0 || pxm >= MAX_PXM_DOMAINS || numa_off) return NUMA_NO_NODE; return pxm_to_node_map[pxm]; }
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.155 commit d808c88f0b88697d7df75eb233b50c235db34b24
--------------------------------
[ Upstream commit e0770e91424f694b461141cbc99adf6b23006b60 ]
When we try to use file already used as a quota file again (for the same or different quota type), strange things can happen. At the very least lockdep annotations may be wrong but also inode flags may be wrongly set / reset. When the file is used for two quota types at once we can even corrupt the file and likely crash the kernel. Catch all these cases by checking whether passed file is already used as quota file and bail early in that case.
This fixes occasional generic/219 failure due to lockdep complaint.
Reviewed-by: Andreas Dilger adilger@dilger.ca Reported-by: Ritesh Harjani riteshh@linux.ibm.com Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20201015110330.28716-1-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/super.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 1af560c0b9f6..3972580c66ae 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -5818,6 +5818,11 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id, /* Quotafile not on the same filesystem? */ if (path->dentry->d_sb != sb) return -EXDEV; + + /* Quota already enabled for this file? */ + if (IS_NOQUOTA(d_inode(path->dentry))) + return -EBUSY; + /* Journaling quota? */ if (EXT4_SB(sb)->s_qf_names[type]) { /* Quotafile not in fs root? */
From: Ronnie Sahlberg lsahlber@redhat.com
stable inclusion from linux-4.19.155 commit 80dc9c4266b5696e797bded1d0f6087446d7cf9d
--------------------------------
[ Upstream commit c6cc4c5a72505a0ecefc9b413f16bec512f38078 ]
RHBZ: 1848178
Some calls that set attributes, like utimensat(), are not supposed to return -EINTR and thus do not have handlers for this in glibc which causes us to leak -EINTR to the applications which are also unprepared to handle it.
For example tar will break if utimensat() return -EINTR and abort unpacking the archive. Other applications may break too.
To handle this we add checks, and retry, for -EINTR in cifs_setattr()
Signed-off-by: Ronnie Sahlberg lsahlber@redhat.com Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/cifs/inode.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c index 4a38f16d944d..d30eb4350656 100644 --- a/fs/cifs/inode.c +++ b/fs/cifs/inode.c @@ -2550,13 +2550,18 @@ cifs_setattr(struct dentry *direntry, struct iattr *attrs) { struct cifs_sb_info *cifs_sb = CIFS_SB(direntry->d_sb); struct cifs_tcon *pTcon = cifs_sb_master_tcon(cifs_sb); + int rc, retries = 0;
- if (pTcon->unix_ext) - return cifs_setattr_unix(direntry, attrs); - - return cifs_setattr_nounix(direntry, attrs); + do { + if (pTcon->unix_ext) + rc = cifs_setattr_unix(direntry, attrs); + else + rc = cifs_setattr_nounix(direntry, attrs); + retries++; + } while (is_retryable_error(rc) && retries < 2);
/* BB: add cifs_setattr_legacy for really old servers */ + return rc; }
#if 0
From: Xiubo Li xiubli@redhat.com
stable inclusion from linux-4.19.155 commit 2ef6f4bd60411934e3fc2715442c2afe70f84bf3
--------------------------------
[ Upstream commit 87aac3a80af5cbad93e63250e8a1e19095ba0d30 ]
There has one race case for ceph's rbd-nbd tool. When do mapping it may fail with EBUSY from ioctl(nbd, NBD_DO_IT), but actually the nbd device has already unmaped.
It dues to if just after the wake_up(), the recv_work() is scheduled out and defers calling the nbd_config_put(), though the map process has exited the "nbd->recv_task" is not cleared.
Signed-off-by: Xiubo Li xiubli@redhat.com Reviewed-by: Josef Bacik josef@toxicpanda.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/block/nbd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 1b7855c2291c..4b624cd98f90 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -758,9 +758,9 @@ static void recv_work(struct work_struct *work)
blk_mq_complete_request(blk_mq_rq_from_pdu(cmd)); } + nbd_config_put(nbd); atomic_dec(&config->recv_threads); wake_up(&config->recv_wq); - nbd_config_put(nbd); kfree(args); }
From: Douglas Gilbert dgilbert@interlog.com
stable inclusion from linux-4.19.155 commit e6786fd18fe2b91a4844f7d1606c50e09d4cebcf
--------------------------------
[ Upstream commit b2a182a40278bc5849730e66bca01a762188ed86 ]
sgl_alloc_order() can fail when 'length' is large on a memory constrained system. When order > 0 it will potentially be making several multi-page allocations with the later ones more likely to fail than the earlier one. So it is important that sgl_alloc_order() frees up any pages it has obtained before returning NULL. In the case when order > 0 it calls the wrong free page function and leaks. In testing the leak was sufficient to bring down my 8 GiB laptop with OOM.
Reviewed-by: Bart Van Assche bvanassche@acm.org Signed-off-by: Douglas Gilbert dgilbert@interlog.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- lib/scatterlist.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 8c3036c37ba0..cf18373f4f24 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -506,7 +506,7 @@ struct scatterlist *sgl_alloc_order(unsigned long long length, elem_len = min_t(u64, length, PAGE_SIZE << order); page = alloc_pages(gfp, order); if (!page) { - sgl_free(sgl); + sgl_free_order(sgl, order); return NULL; }
From: Jiri Olsa jolsa@kernel.org
stable inclusion from linux-4.19.155 commit 753e8a597d54209aa97aa985342f7f98d4240be4
--------------------------------
commit 6fcd5ddc3b1467b3586972ef785d0d926ae4cdf4 upstream.
Hagen reported broken strings in python3 tracepoint scripts:
make PYTHON=python3 perf record -e sched:sched_switch -a -- sleep 5 perf script --gen-script py perf script -s ./perf-script.py
[..] sched__sched_switch 7 563231.759525792 0 swapper prev_comm=bytearray(b'swapper/7\x00\x00\x00\x00\x00\x00\x00'), prev_pid=0, prev_prio=120, prev_state=, next_comm=bytearray(b'mutex-thread-co\x00'),
The problem is in the is_printable_array function that does not take the zero byte into account and claim such string as not printable, so the code will create byte array instead of string.
Committer testing:
After this fix:
sched__sched_switch 3 484522.497072626 1158680 kworker/3:0-eve prev_comm=kworker/3:0, prev_pid=1158680, prev_prio=120, prev_state=I, next_comm=swapper/3, next_pid=0, next_prio=120 Sample: {addr=0, cpu=3, datasrc=84410401, datasrc_decode=N/A|SNP N/A|TLB N/A|LCK N/A, ip=18446744071841817196, period=1, phys_addr=0, pid=1158680, tid=1158680, time=484522497072626, transaction=0, values=[(0, 0)], weight=0}
sched__sched_switch 4 484522.497085610 1225814 perf prev_comm=perf, prev_pid=1225814, prev_prio=120, prev_state=, next_comm=migration/4, next_pid=30, next_prio=0 Sample: {addr=0, cpu=4, datasrc=84410401, datasrc_decode=N/A|SNP N/A|TLB N/A|LCK N/A, ip=18446744071841817196, period=1, phys_addr=0, pid=1225814, tid=1225814, time=484522497085610, transaction=0, values=[(0, 0)], weight=0}
Fixes: 249de6e07458 ("perf script python: Fix string vs byte array resolving") Signed-off-by: Jiri Olsa jolsa@kernel.org Tested-by: Arnaldo Carvalho de Melo acme@redhat.com Tested-by: Hagen Paul Pfeifer hagen@jauu.net Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Mark Rutland mark.rutland@arm.com Cc: Michael Petlan mpetlan@redhat.com Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: stable@vger.kernel.org Link: http://lore.kernel.org/lkml/20200928201135.3633850-1-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- tools/perf/util/print_binary.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/print_binary.c b/tools/perf/util/print_binary.c index 23e367063446..71aeaf6f45cc 100644 --- a/tools/perf/util/print_binary.c +++ b/tools/perf/util/print_binary.c @@ -50,7 +50,7 @@ int is_printable_array(char *p, unsigned int len)
len--;
- for (i = 0; i < len; i++) { + for (i = 0; i < len && p[i]; i++) { if (!isprint(p[i]) && !isspace(p[i])) return 0; }
From: Qiujun Huang hqjagain@gmail.com
stable inclusion from linux-4.19.155 commit 57ebe9102928e835d72ffec43c7995723cfea132
--------------------------------
commit 0a1754b2a97efa644aa6e84d1db5b17c42251483 upstream.
We don't need to check the new buffer size, and the return value had confused resize_buffer_duplicate_size(). ... ret = ring_buffer_resize(trace_buf->buffer, per_cpu_ptr(size_buf->data,cpu_id)->entries, cpu_id); if (ret == 0) per_cpu_ptr(trace_buf->data, cpu_id)->entries = per_cpu_ptr(size_buf->data, cpu_id)->entries; ...
Link: https://lkml.kernel.org/r/20201019142242.11560-1-hqjagain@gmail.com
Cc: stable@vger.kernel.org Fixes: d60da506cbeb3 ("tracing: Add a resize function to make one buffer equivalent to another buffer") Signed-off-by: Qiujun Huang hqjagain@gmail.com Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/trace/ring_buffer.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 564d22691dd7..eef05eb3b284 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1692,18 +1692,18 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, { struct ring_buffer_per_cpu *cpu_buffer; unsigned long nr_pages; - int cpu, err = 0; + int cpu, err;
/* * Always succeed at resizing a non-existent buffer: */ if (!buffer) - return size; + return 0;
/* Make sure the requested buffer exists */ if (cpu_id != RING_BUFFER_ALL_CPUS && !cpumask_test_cpu(cpu_id, buffer->cpumask)) - return size; + return 0;
nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
@@ -1843,7 +1843,7 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, }
mutex_unlock(&buffer->mutex); - return size; + return 0;
out_err: for_each_buffer_cpu(buffer, cpu) {
From: Eric Biggers ebiggers@google.com
stable inclusion from linux-4.19.155 commit 314b5a46c65726750fa1fa3e665df9485d09551a
--------------------------------
commit cb8d53d2c97369029cc638c9274ac7be0a316c75 upstream.
ext4_unregister_sysfs() only deletes the kobject. The reference to it needs to be put separately, like ext4_put_super() does.
This addresses the syzbot report "memory leak in kobject_set_name_vargs (3)" (https://syzkaller.appspot.com/bug?extid=9f864abad79fae7c17e1).
Reported-by: syzbot+9f864abad79fae7c17e1@syzkaller.appspotmail.com Fixes: 72ba74508b28 ("ext4: release sysfs kobject when failing to enable quotas on mount") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers ebiggers@google.com Link: https://lore.kernel.org/r/20200922162456.93657-1-ebiggers@kernel.org Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/super.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 3972580c66ae..9b1ab74df916 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4652,6 +4652,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
failed_mount8: ext4_unregister_sysfs(sb); + kobject_put(&sbi->s_kobj); failed_mount7: ext4_unregister_li_request(sb); failed_mount6:
From: Dinghao Liu dinghao.liu@zju.edu.cn
stable inclusion from linux-4.19.155 commit 6ad22dc99cb627fa13812bc0e27a6595bba62f5e
--------------------------------
commit c9e87161cc621cbdcfc472fa0b2d81c63780c8f5 upstream.
When ext4_journal_get_write_access() fails, we should terminate the execution flow and release n_group_desc, iloc.bh, dind and gdb_bh.
Cc: stable@kernel.org Signed-off-by: Dinghao Liu dinghao.liu@zju.edu.cn Reviewed-by: Andreas Dilger adilger@dilger.ca Link: https://lore.kernel.org/r/20200829025403.3139-1-dinghao.liu@zju.edu.cn Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/resize.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index e9409ecda9c8..6a0c5c880354 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -843,8 +843,10 @@ static int add_new_gdb(handle_t *handle, struct inode *inode,
BUFFER_TRACE(dind, "get_write_access"); err = ext4_journal_get_write_access(handle, dind); - if (unlikely(err)) + if (unlikely(err)) { ext4_std_error(sb, err); + goto errout; + }
/* ext4_reserve_inode_write() gets a reference on the iloc */ err = ext4_reserve_inode_write(handle, inode, &iloc);
From: Luo Meng luomeng12@huawei.com
stable inclusion from linux-4.19.155 commit 7b97149296ef752a1576d0939f629dd30e83e03d
--------------------------------
commit 1322181170bb01bce3c228b82ae3d5c6b793164f upstream.
During the stability test, there are some errors: ext4_lookup:1590: inode #6967: comm fsstress: iget: checksum invalid.
If the inode->i_iblocks too big and doesn't set huge file flag, checksum will not be recalculated when update the inode information to it's buffer. If other inode marks the buffer dirty, then the inconsistent inode will be flushed to disk.
Fix this problem by checking i_blocks in advance.
Cc: stable@kernel.org Signed-off-by: Luo Meng luomeng12@huawei.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Link: https://lore.kernel.org/r/20201020013631.3796673-1-luomeng12@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/inode.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 0bde4afe6b7f..061d4f6b0844 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5285,6 +5285,12 @@ static int ext4_do_update_inode(handle_t *handle, if (ext4_test_inode_state(inode, EXT4_STATE_NEW)) memset(raw_inode, 0, EXT4_SB(inode->i_sb)->s_inode_size);
+ err = ext4_inode_blocks_set(handle, raw_inode, ei); + if (err) { + spin_unlock(&ei->i_raw_lock); + goto out_brelse; + } + raw_inode->i_mode = cpu_to_le16(inode->i_mode); i_uid = i_uid_read(inode); i_gid = i_gid_read(inode); @@ -5318,11 +5324,6 @@ static int ext4_do_update_inode(handle_t *handle, EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode); EXT4_EINODE_SET_XTIME(i_crtime, ei, raw_inode);
- err = ext4_inode_blocks_set(handle, raw_inode, ei); - if (err) { - spin_unlock(&ei->i_raw_lock); - goto out_brelse; - } raw_inode->i_dtime = cpu_to_le32(ei->i_dtime); raw_inode->i_flags = cpu_to_le32(ei->i_flags & 0xFFFFFFFF); if (likely(!test_opt2(inode->i_sb, HURD_COMPAT)))
From: Andy Shevchenko andriy.shevchenko@linux.intel.com
stable inclusion from linux-4.19.155 commit 76b712488dcfc71c158ba2e2785c23aca3ed2924
--------------------------------
commit d5dcce0c414fcbfe4c2037b66ac69ea5f9b3f75c upstream.
Behind primary and secondary we understand the type of the nodes which might define their ordering. However, if primary node gone, we can't maintain the ordering by definition of the linked list. Thus, by ordering secondary node becomes first in the list. But in this case the meaning of it is still secondary (or auxiliary). The type of the node is maintained by the secondary pointer in it:
secondary pointer Meaning NULL or valid primary node ERR_PTR(-ENODEV) secondary node
So, if by some reason we do the following sequence of calls
set_primary_fwnode(dev, NULL); set_primary_fwnode(dev, primary);
we should preserve secondary node.
This concept is supported by the description of set_primary_fwnode() along with implementation of set_secondary_fwnode(). Hence, fix the commit c15e1bdda436 to follow this as well.
Fixes: c15e1bdda436 ("device property: Fix the secondary firmware node handling in set_primary_fwnode()") Cc: Ferry Toth fntoth@gmail.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Reviewed-by: Heikki Krogerus heikki.krogerus@linux.intel.com Tested-by: Ferry Toth fntoth@gmail.com Cc: 5.9+ stable@vger.kernel.org # 5.9+ Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/base/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/base/core.c b/drivers/base/core.c index b911c38ad18c..b2e2f086f586 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -3347,7 +3347,7 @@ void set_primary_fwnode(struct device *dev, struct fwnode_handle *fwnode) } else { if (fwnode_is_primary(fn)) { dev->fwnode = fn->secondary; - fn->secondary = NULL; + fn->secondary = ERR_PTR(-ENODEV); } else { dev->fwnode = NULL; }
From: Andy Shevchenko andriy.shevchenko@linux.intel.com
stable inclusion from linux-4.19.155 commit 9169e2f12323994af8244dd64c036f1d974ea51d
--------------------------------
commit 99aed9227073fb34ce2880cbc7063e04185a65e1 upstream.
It appears that firmware nodes can be shared between devices. In such case when a (child) device is about to be deleted, its firmware node may be shared and ACPI_COMPANION_SET(..., NULL) call for it breaks the secondary link of the shared primary firmware node.
In order to prevent that, check, if the device has a parent and parent's firmware node is shared with its child, and avoid crashing the link.
Fixes: c15e1bdda436 ("device property: Fix the secondary firmware node handling in set_primary_fwnode()") Reported-by: Ferry Toth fntoth@gmail.com Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Reviewed-by: Heikki Krogerus heikki.krogerus@linux.intel.com Tested-by: Ferry Toth fntoth@gmail.com Cc: 5.9+ stable@vger.kernel.org # 5.9+ Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/base/core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/base/core.c b/drivers/base/core.c index b2e2f086f586..f0cdf38ed31c 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -3333,6 +3333,7 @@ static inline bool fwnode_is_primary(struct fwnode_handle *fwnode) */ void set_primary_fwnode(struct device *dev, struct fwnode_handle *fwnode) { + struct device *parent = dev->parent; struct fwnode_handle *fn = dev->fwnode;
if (fwnode) { @@ -3347,7 +3348,8 @@ void set_primary_fwnode(struct device *dev, struct fwnode_handle *fwnode) } else { if (fwnode_is_primary(fn)) { dev->fwnode = fn->secondary; - fn->secondary = ERR_PTR(-ENODEV); + if (!(parent && fn == parent->fwnode)) + fn->secondary = ERR_PTR(-ENODEV); } else { dev->fwnode = NULL; }
From: Oleg Nesterov oleg@redhat.com
stable inclusion from linux-4.19.156 commit caf8f9c19a93ec501d0ffca5e6767a969961baea
--------------------------------
commit 7b3c36fc4c231ca532120bbc0df67a12f09c1d96 upstream.
This testcase
#include <stdio.h> #include <unistd.h> #include <signal.h> #include <sys/ptrace.h> #include <sys/wait.h> #include <pthread.h> #include <assert.h>
void *tf(void *arg) { return NULL; }
int main(void) { int pid = fork(); if (!pid) { kill(getpid(), SIGSTOP);
pthread_t th; pthread_create(&th, NULL, tf, NULL);
return 0; }
waitpid(pid, NULL, WSTOPPED);
ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_TRACECLONE); waitpid(pid, NULL, 0);
ptrace(PTRACE_CONT, pid, 0,0); waitpid(pid, NULL, 0);
int status; int thread = waitpid(-1, &status, 0); assert(thread > 0 && thread != pid); assert(status == 0x80137f);
return 0; }
fails and triggers WARN_ON_ONCE(!signr) in do_jobctl_trap().
This is because task_join_group_stop() has 2 problems when current is traced:
1. We can't rely on the "JOBCTL_STOP_PENDING" check, a stopped tracee can be woken up by debugger and it can clone another thread which should join the group-stop.
We need to check group_stop_count || SIGNAL_STOP_STOPPED.
2. If SIGNAL_STOP_STOPPED is already set, we should not increment sig->group_stop_count and add JOBCTL_STOP_CONSUME. The new thread should stop without another do_notify_parent_cldstop() report.
To clarify, the problem is very old and we should blame ptrace_init_task(). But now that we have task_join_group_stop() it makes more sense to fix this helper to avoid the code duplication.
Reported-by: syzbot+3485e3773f7da290eecc@syzkaller.appspotmail.com Signed-off-by: Oleg Nesterov oleg@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Jens Axboe axboe@kernel.dk Cc: Christian Brauner christian@brauner.io Cc: "Eric W . Biederman" ebiederm@xmission.com Cc: Zhiqiang Liu liuzhiqiang26@huawei.com Cc: Tejun Heo tj@kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201019134237.GA18810@redhat.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/signal.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c index 6fd732c4c30c..deba77ef0573 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -382,16 +382,17 @@ static bool task_participate_group_stop(struct task_struct *task)
void task_join_group_stop(struct task_struct *task) { + unsigned long mask = current->jobctl & JOBCTL_STOP_SIGMASK; + struct signal_struct *sig = current->signal; + + if (sig->group_stop_count) { + sig->group_stop_count++; + mask |= JOBCTL_STOP_CONSUME; + } else if (!(sig->flags & SIGNAL_STOP_STOPPED)) + return; + /* Have the new thread join an on-going signal group stop */ - unsigned long jobctl = current->jobctl; - if (jobctl & JOBCTL_STOP_PENDING) { - struct signal_struct *sig = current->signal; - unsigned long signr = jobctl & JOBCTL_STOP_SIGMASK; - unsigned long gstop = JOBCTL_STOP_PENDING | JOBCTL_STOP_CONSUME; - if (task_set_jobctl_pending(task, signr | gstop)) { - sig->group_stop_count++; - } - } + task_set_jobctl_pending(task, mask | JOBCTL_STOP_PENDING); }
/*
From: Lee Jones lee.jones@linaro.org
stable inclusion from linux-4.19.156 commit 11af858054baadff13eb92eb347443f93768fde7
--------------------------------
commit 9522750c66c689b739e151fcdf895420dc81efc0 upstream.
Commit 6735b4632def ("Fonts: Support FONT_EXTRA_WORDS macros for built-in fonts") introduced the following error when building rpc_defconfig (only this build appears to be affected):
`acorndata_8x8' referenced in section `.text' of arch/arm/boot/compressed/ll_char_wr.o: defined in discarded section `.data' of arch/arm/boot/compressed/font.o `acorndata_8x8' referenced in section `.data.rel.ro' of arch/arm/boot/compressed/font.o: defined in discarded section `.data' of arch/arm/boot/compressed/font.o make[3]: *** [/scratch/linux/arch/arm/boot/compressed/Makefile:191: arch/arm/boot/compressed/vmlinux] Error 1 make[2]: *** [/scratch/linux/arch/arm/boot/Makefile:61: arch/arm/boot/compressed/vmlinux] Error 2 make[1]: *** [/scratch/linux/arch/arm/Makefile:317: zImage] Error 2
The .data section is discarded at link time. Reinstating acorndata_8x8 as const ensures it is still available after linking. Do the same for the other 12 built-in fonts as well, for consistency purposes.
Cc: stable@vger.kernel.org Cc: Russell King linux@armlinux.org.uk Reviewed-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Fixes: 6735b4632def ("Fonts: Support FONT_EXTRA_WORDS macros for built-in fonts") Signed-off-by: Lee Jones lee.jones@linaro.org Co-developed-by: Peilin Ye yepeilin.cs@gmail.com Signed-off-by: Peilin Ye yepeilin.cs@gmail.com Signed-off-by: Daniel Vetter daniel.vetter@ffwll.ch Link: https://patchwork.freedesktop.org/patch/msgid/20201102183242.2031659-1-yepei... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- lib/fonts/font_10x18.c | 2 +- lib/fonts/font_6x10.c | 2 +- lib/fonts/font_6x11.c | 2 +- lib/fonts/font_7x14.c | 2 +- lib/fonts/font_8x16.c | 2 +- lib/fonts/font_8x8.c | 2 +- lib/fonts/font_acorn_8x8.c | 2 +- lib/fonts/font_mini_4x6.c | 2 +- lib/fonts/font_pearl_8x8.c | 2 +- lib/fonts/font_sun12x22.c | 2 +- lib/fonts/font_sun8x16.c | 2 +- 11 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/lib/fonts/font_10x18.c b/lib/fonts/font_10x18.c index 0e2deac97da0..e02f9df24d1e 100644 --- a/lib/fonts/font_10x18.c +++ b/lib/fonts/font_10x18.c @@ -8,7 +8,7 @@
#define FONTDATAMAX 9216
-static struct font_data fontdata_10x18 = { +static const struct font_data fontdata_10x18 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, 0x00, /* 0000000000 */ diff --git a/lib/fonts/font_6x10.c b/lib/fonts/font_6x10.c index 87da8acd07db..6e3c4b7691c8 100644 --- a/lib/fonts/font_6x10.c +++ b/lib/fonts/font_6x10.c @@ -3,7 +3,7 @@
#define FONTDATAMAX 2560
-static struct font_data fontdata_6x10 = { +static const struct font_data fontdata_6x10 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, /* 00000000 */ diff --git a/lib/fonts/font_6x11.c b/lib/fonts/font_6x11.c index 5e975dfa10a5..2d22a24e816f 100644 --- a/lib/fonts/font_6x11.c +++ b/lib/fonts/font_6x11.c @@ -9,7 +9,7 @@
#define FONTDATAMAX (11*256)
-static struct font_data fontdata_6x11 = { +static const struct font_data fontdata_6x11 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, /* 00000000 */ diff --git a/lib/fonts/font_7x14.c b/lib/fonts/font_7x14.c index 86d298f38505..9cc7ae2e03f7 100644 --- a/lib/fonts/font_7x14.c +++ b/lib/fonts/font_7x14.c @@ -8,7 +8,7 @@
#define FONTDATAMAX 3584
-static struct font_data fontdata_7x14 = { +static const struct font_data fontdata_7x14 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, /* 0000000 */ diff --git a/lib/fonts/font_8x16.c b/lib/fonts/font_8x16.c index 37cedd36ca5e..bab25dc59e8d 100644 --- a/lib/fonts/font_8x16.c +++ b/lib/fonts/font_8x16.c @@ -10,7 +10,7 @@
#define FONTDATAMAX 4096
-static struct font_data fontdata_8x16 = { +static const struct font_data fontdata_8x16 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, /* 00000000 */ diff --git a/lib/fonts/font_8x8.c b/lib/fonts/font_8x8.c index 8ab695538395..109d0572368f 100644 --- a/lib/fonts/font_8x8.c +++ b/lib/fonts/font_8x8.c @@ -9,7 +9,7 @@
#define FONTDATAMAX 2048
-static struct font_data fontdata_8x8 = { +static const struct font_data fontdata_8x8 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, /* 00000000 */ diff --git a/lib/fonts/font_acorn_8x8.c b/lib/fonts/font_acorn_8x8.c index 069b3e80c434..fb395f0d4031 100644 --- a/lib/fonts/font_acorn_8x8.c +++ b/lib/fonts/font_acorn_8x8.c @@ -5,7 +5,7 @@
#define FONTDATAMAX 2048
-static struct font_data acorndata_8x8 = { +static const struct font_data acorndata_8x8 = { { 0, 0, FONTDATAMAX, 0 }, { /* 00 */ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* ^@ */ /* 01 */ 0x7e, 0x81, 0xa5, 0x81, 0xbd, 0x99, 0x81, 0x7e, /* ^A */ diff --git a/lib/fonts/font_mini_4x6.c b/lib/fonts/font_mini_4x6.c index 1449876c6a27..592774a90917 100644 --- a/lib/fonts/font_mini_4x6.c +++ b/lib/fonts/font_mini_4x6.c @@ -43,7 +43,7 @@ __END__;
#define FONTDATAMAX 1536
-static struct font_data fontdata_mini_4x6 = { +static const struct font_data fontdata_mini_4x6 = { { 0, 0, FONTDATAMAX, 0 }, { /*{*/ /* Char 0: ' ' */ diff --git a/lib/fonts/font_pearl_8x8.c b/lib/fonts/font_pearl_8x8.c index 32d65551e7ed..a6f95ebce950 100644 --- a/lib/fonts/font_pearl_8x8.c +++ b/lib/fonts/font_pearl_8x8.c @@ -14,7 +14,7 @@
#define FONTDATAMAX 2048
-static struct font_data fontdata_pearl8x8 = { +static const struct font_data fontdata_pearl8x8 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, /* 00000000 */ diff --git a/lib/fonts/font_sun12x22.c b/lib/fonts/font_sun12x22.c index 641a6b4dca42..a5b65bd49604 100644 --- a/lib/fonts/font_sun12x22.c +++ b/lib/fonts/font_sun12x22.c @@ -3,7 +3,7 @@
#define FONTDATAMAX 11264
-static struct font_data fontdata_sun12x22 = { +static const struct font_data fontdata_sun12x22 = { { 0, 0, FONTDATAMAX, 0 }, { /* 0 0x00 '^@' */ 0x00, 0x00, /* 000000000000 */ diff --git a/lib/fonts/font_sun8x16.c b/lib/fonts/font_sun8x16.c index 193fe6d988e0..e577e76a6a7c 100644 --- a/lib/fonts/font_sun8x16.c +++ b/lib/fonts/font_sun8x16.c @@ -3,7 +3,7 @@
#define FONTDATAMAX 4096
-static struct font_data fontdata_sun8x16 = { +static const struct font_data fontdata_sun8x16 = { { 0, 0, FONTDATAMAX, 0 }, { /* */ 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* */ 0x00,0x00,0x7e,0x81,0xa5,0x81,0x81,0xbd,0x99,0x81,0x81,0x7e,0x00,0x00,0x00,0x00,
From: Shijie Luo luoshijie1@huawei.com
stable inclusion from linux-4.19.156 commit d4fea27ffcb01bbf8ca8ae54f60479945394b83e
--------------------------------
commit 3f08842098e842c51e3b97d0dcdebf810b32558e upstream.
When flags in queue_pages_pte_range don't have MPOL_MF_MOVE or MPOL_MF_MOVE_ALL bits, code breaks and passing origin pte - 1 to pte_unmap_unlock seems like not a good idea.
queue_pages_pte_range can run in MPOL_MF_MOVE_ALL mode which doesn't migrate misplaced pages but returns with EIO when encountering such a page. Since commit a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified") and early break on the first pte in the range results in pte_unmap_unlock on an underflow pte. This can lead to lockups later on when somebody tries to lock the pte resp. page_table_lock again..
Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified") Signed-off-by: Shijie Luo luoshijie1@huawei.com Signed-off-by: Miaohe Lin linmiaohe@huawei.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Oscar Salvador osalvador@suse.de Acked-by: Michal Hocko mhocko@suse.com Cc: Miaohe Lin linmiaohe@huawei.com Cc: Feilong Lin linfeilong@huawei.com Cc: Shijie Luo luoshijie1@huawei.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201019074853.50856-1-luoshijie1@huawei.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/mempolicy.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c index cab2ee936b5d..8f420c64934e 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -535,7 +535,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, unsigned long flags = qp->flags; int ret; bool has_unmovable = false; - pte_t *pte; + pte_t *pte, *mapped_pte; spinlock_t *ptl;
ptl = pmd_trans_huge_lock(pmd, vma); @@ -549,7 +549,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, if (pmd_trans_unstable(pmd)) return 0;
- pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) { if (!pte_present(*pte)) continue; @@ -581,7 +581,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, } else break; } - pte_unmap_unlock(pte - 1, ptl); + pte_unmap_unlock(mapped_pte, ptl); cond_resched();
if (has_unmovable)
From: Zqiang qiang.zhang@windriver.com
stable inclusion from linux-4.19.156 commit 68e8b8ed787a8acafa3e987d52f6d415c5949ec6
--------------------------------
commit 6993d0fdbee0eb38bfac350aa016f65ad11ed3b1 upstream.
There is a small race window when a delayed work is being canceled and the work still might be queued from the timer_fn:
CPU0 CPU1 kthread_cancel_delayed_work_sync() __kthread_cancel_work_sync() __kthread_cancel_work() work->canceling++; kthread_delayed_work_timer_fn() kthread_insert_work();
BUG: kthread_insert_work() should not get called when work->canceling is set.
Signed-off-by: Zqiang qiang.zhang@windriver.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Petr Mladek pmladek@suse.com Acked-by: Tejun Heo tj@kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201014083030.16895-1-qiang.zhang@windriver.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/kthread.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/kthread.c b/kernel/kthread.c index 7973996e061f..ba1aacf0755a 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -859,7 +859,8 @@ void kthread_delayed_work_timer_fn(struct timer_list *t) /* Move the work from worker->delayed_work_list. */ WARN_ON_ONCE(list_empty(&work->node)); list_del_init(&work->node); - kthread_insert_work(worker, work, &worker->work_list); + if (!work->canceling) + kthread_insert_work(worker, work, &worker->work_list);
spin_unlock(&worker->lock); }
From: "Steven Rostedt (VMware)" rostedt@goodmis.org
stable inclusion from linux-4.19.156 commit b410d07e965a039dc67073889dab9ff01cee8402
--------------------------------
commit b02414c8f045ab3b9afc816c3735bc98c5c3d262 upstream.
The recursion protection of the ring buffer depends on preempt_count() to be correct. But it is possible that the ring buffer gets called after an interrupt comes in but before it updates the preempt_count(). This will trigger a false positive in the recursion code.
Use the same trick from the ftrace function callback recursion code which uses a "transition" bit that gets set, to allow for a single recursion for to handle transitions between contexts.
Cc: stable@vger.kernel.org Fixes: 567cd4da54ff4 ("ring-buffer: User context bit recursion checking") Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/trace/ring_buffer.c | 58 ++++++++++++++++++++++++++++++-------- 1 file changed, 46 insertions(+), 12 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index eef05eb3b284..87ce9736043d 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -444,14 +444,16 @@ struct rb_event_info {
/* * Used for which event context the event is in. - * NMI = 0 - * IRQ = 1 - * SOFTIRQ = 2 - * NORMAL = 3 + * TRANSITION = 0 + * NMI = 1 + * IRQ = 2 + * SOFTIRQ = 3 + * NORMAL = 4 * * See trace_recursive_lock() comment below for more details. */ enum { + RB_CTX_TRANSITION, RB_CTX_NMI, RB_CTX_IRQ, RB_CTX_SOFTIRQ, @@ -2620,10 +2622,10 @@ rb_wakeups(struct ring_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer) * a bit of overhead in something as critical as function tracing, * we use a bitmask trick. * - * bit 0 = NMI context - * bit 1 = IRQ context - * bit 2 = SoftIRQ context - * bit 3 = normal context. + * bit 1 = NMI context + * bit 2 = IRQ context + * bit 3 = SoftIRQ context + * bit 4 = normal context. * * This works because this is the order of contexts that can * preempt other contexts. A SoftIRQ never preempts an IRQ @@ -2646,6 +2648,30 @@ rb_wakeups(struct ring_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer) * The least significant bit can be cleared this way, and it * just so happens that it is the same bit corresponding to * the current context. + * + * Now the TRANSITION bit breaks the above slightly. The TRANSITION bit + * is set when a recursion is detected at the current context, and if + * the TRANSITION bit is already set, it will fail the recursion. + * This is needed because there's a lag between the changing of + * interrupt context and updating the preempt count. In this case, + * a false positive will be found. To handle this, one extra recursion + * is allowed, and this is done by the TRANSITION bit. If the TRANSITION + * bit is already set, then it is considered a recursion and the function + * ends. Otherwise, the TRANSITION bit is set, and that bit is returned. + * + * On the trace_recursive_unlock(), the TRANSITION bit will be the first + * to be cleared. Even if it wasn't the context that set it. That is, + * if an interrupt comes in while NORMAL bit is set and the ring buffer + * is called before preempt_count() is updated, since the check will + * be on the NORMAL bit, the TRANSITION bit will then be set. If an + * NMI then comes in, it will set the NMI bit, but when the NMI code + * does the trace_recursive_unlock() it will clear the TRANSTION bit + * and leave the NMI bit set. But this is fine, because the interrupt + * code that set the TRANSITION bit will then clear the NMI bit when it + * calls trace_recursive_unlock(). If another NMI comes in, it will + * set the TRANSITION bit and continue. + * + * Note: The TRANSITION bit only handles a single transition between context. */
static __always_inline int @@ -2661,8 +2687,16 @@ trace_recursive_lock(struct ring_buffer_per_cpu *cpu_buffer) bit = pc & NMI_MASK ? RB_CTX_NMI : pc & HARDIRQ_MASK ? RB_CTX_IRQ : RB_CTX_SOFTIRQ;
- if (unlikely(val & (1 << (bit + cpu_buffer->nest)))) - return 1; + if (unlikely(val & (1 << (bit + cpu_buffer->nest)))) { + /* + * It is possible that this was called by transitioning + * between interrupt context, and preempt_count() has not + * been updated yet. In this case, use the TRANSITION bit. + */ + bit = RB_CTX_TRANSITION; + if (val & (1 << (bit + cpu_buffer->nest))) + return 1; + }
val |= (1 << (bit + cpu_buffer->nest)); cpu_buffer->current_context = val; @@ -2677,8 +2711,8 @@ trace_recursive_unlock(struct ring_buffer_per_cpu *cpu_buffer) cpu_buffer->current_context - (1 << cpu_buffer->nest); }
-/* The recursive locking above uses 4 bits */ -#define NESTED_BITS 4 +/* The recursive locking above uses 5 bits */ +#define NESTED_BITS 5
/** * ring_buffer_nest_start - Allow to trace while nested
From: "Steven Rostedt (VMware)" rostedt@goodmis.org
stable inclusion from linux-4.19.156 commit ee2b95c08515633da67c048beb164966224ffe9e
--------------------------------
commit ee11b93f95eabdf8198edd4668bf9102e7248270 upstream.
The code that checks recursion will work to only do the recursion check once if there's nested checks. The top one will do the check, the other nested checks will see recursion was already checked and return zero for its "bit". On the return side, nothing will be done if the "bit" is zero.
The problem is that zero is returned for the "good" bit when in NMI context. This will set the bit for NMIs making it look like *all* NMI tracing is recursing, and prevent tracing of anything in NMI context!
The simple fix is to return "bit + 1" and subtract that bit on the end to get the real bit.
Cc: stable@vger.kernel.org Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks") Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/trace/trace.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 8cc1861163dc..868bd11ce72a 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -595,7 +595,7 @@ static __always_inline int trace_test_and_set_recursion(int start, int max) current->trace_recursion = val; barrier();
- return bit; + return bit + 1; }
static __always_inline void trace_clear_recursion(int bit) @@ -605,6 +605,7 @@ static __always_inline void trace_clear_recursion(int bit) if (!bit) return;
+ bit--; bit = 1 << bit; val &= ~bit;
From: "Steven Rostedt (VMware)" rostedt@goodmis.org
stable inclusion from linux-4.19.156 commit 2de780dfbe1ad05c3f15f8d6f2d22dae9cf95267
--------------------------------
commit 726b3d3f141fba6f841d715fc4d8a4a84f02c02a upstream.
When an interrupt or NMI comes in and switches the context, there's a delay from when the preempt_count() shows the update. As the preempt_count() is used to detect recursion having each context have its own bit get set when tracing starts, and if that bit is already set, it is considered a recursion and the function exits. But if this happens in that section where context has changed but preempt_count() has not been updated, this will be incorrectly flagged as a recursion.
To handle this case, create another bit call TRANSITION and test it if the current context bit is already set. Flag the call as a recursion if the TRANSITION bit is already set, and if not, set it and continue. The TRANSITION bit will be cleared normally on the return of the function that set it, or if the current context bit is clear, set it and clear the TRANSITION bit to allow for another transition between the current context and an even higher one.
Cc: stable@vger.kernel.org Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks") Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/trace/trace.h | 23 +++++++++++++++++++++-- kernel/trace/trace_selftest.c | 9 +++++++-- 2 files changed, 28 insertions(+), 4 deletions(-)
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 868bd11ce72a..4088742fbb08 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -534,6 +534,12 @@ enum {
TRACE_GRAPH_DEPTH_START_BIT, TRACE_GRAPH_DEPTH_END_BIT, + + /* + * When transitioning between context, the preempt_count() may + * not be correct. Allow for a single recursion to cover this case. + */ + TRACE_TRANSITION_BIT, };
#define trace_recursion_set(bit) do { (current)->trace_recursion |= (1<<(bit)); } while (0) @@ -588,8 +594,21 @@ static __always_inline int trace_test_and_set_recursion(int start, int max) return 0;
bit = trace_get_context_bit() + start; - if (unlikely(val & (1 << bit))) - return -1; + if (unlikely(val & (1 << bit))) { + /* + * It could be that preempt_count has not been updated during + * a switch between contexts. Allow for a single recursion. + */ + bit = TRACE_TRANSITION_BIT; + if (trace_recursion_test(bit)) + return -1; + trace_recursion_set(bit); + barrier(); + return bit + 1; + } + + /* Normal check passed, clear the transition to allow it again */ + trace_recursion_clear(TRACE_TRANSITION_BIT);
val |= 1 << bit; current->trace_recursion = val; diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c index 11e9daa4a568..330ba2b081ed 100644 --- a/kernel/trace/trace_selftest.c +++ b/kernel/trace/trace_selftest.c @@ -492,8 +492,13 @@ trace_selftest_function_recursion(void) unregister_ftrace_function(&test_rec_probe);
ret = -1; - if (trace_selftest_recursion_cnt != 1) { - pr_cont("*callback not called once (%d)* ", + /* + * Recursion allows for transitions between context, + * and may call the callback twice. + */ + if (trace_selftest_recursion_cnt != 1 && + trace_selftest_recursion_cnt != 2) { + pr_cont("*callback not called once (or twice) (%d)* ", trace_selftest_recursion_cnt); goto out; }
From: Qiujun Huang hqjagain@gmail.com
stable inclusion from linux-4.19.156 commit 7e4eeff7da1df5602e9af340d01b2ce1b4662d3e
--------------------------------
commit c1acb4ac1a892cf08d27efcb964ad281728b0545 upstream.
The nesting count of trace_printk allows for 4 levels of nesting. The nesting counter starts at zero and is incremented before being used to retrieve the current context's buffer. But the index to the buffer uses the nesting counter after it was incremented, and not its original number, which in needs to do.
Link: https://lkml.kernel.org/r/20201029161905.4269-1-hqjagain@gmail.com
Cc: stable@vger.kernel.org Fixes: 3d9622c12c887 ("tracing: Add barrier to trace_printk() buffer nesting modification") Signed-off-by: Qiujun Huang hqjagain@gmail.com Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/trace/trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index e32aae03f6be..b4f429106ad4 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2817,7 +2817,7 @@ static char *get_trace_buf(void)
/* Interrupts must see nesting incremented before we use the buffer */ barrier(); - return &buffer->buffer[buffer->nesting][0]; + return &buffer->buffer[buffer->nesting - 1][0]; }
static void put_trace_buf(void)
From: Mike Galbraith efault@gmx.de
stable inclusion from linux-4.19.156 commit c096a3d44ead54b53a3b65d9d210022099bdc7ef
--------------------------------
commit 9f5d1c336a10c0d24e83e40b4c1b9539f7dba627 upstream.
Gratian managed to trigger the BUG_ON(!newowner) in fixup_pi_state_owner(). This is one possible chain of events leading to this:
Task Prio Operation T1 120 lock(F) T2 120 lock(F) -> blocks (top waiter) T3 50 (RT) lock(F) -> boosts T1 and blocks (new top waiter) XX timeout/ -> wakes T2 signal T1 50 unlock(F) -> wakes T3 (rtmutex->owner == NULL, waiter bit is set) T2 120 cleanup -> try_to_take_mutex() fails because T3 is the top waiter and the lower priority T2 cannot steal the lock. -> fixup_pi_state_owner() sees newowner == NULL -> BUG_ON()
The comment states that this is invalid and rt_mutex_real_owner() must return a non NULL owner when the trylock failed, but in case of a queued and woken up waiter rt_mutex_real_owner() == NULL is a valid transient state. The higher priority waiter has simply not yet managed to take over the rtmutex.
The BUG_ON() is therefore wrong and this is just another retry condition in fixup_pi_state_owner().
Drop the locks, so that T3 can make progress, and then try the fixup again.
Gratian provided a great analysis, traces and a reproducer. The analysis is to the point, but it confused the hell out of that tglx dude who had to page in all the futex horrors again. Condensed version is above.
[ tglx: Wrote comment and changelog ]
Fixes: c1e2f0eaf015 ("futex: Avoid violating the 10th rule of futex") Reported-by: Gratian Crisan gratian.crisan@ni.com Signed-off-by: Mike Galbraith efault@gmx.de Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87a6w6x7bb.fsf@ni.com Link: https://lore.kernel.org/r/87sg9pkvf7.fsf@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/futex.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index e963fac0d5fc..1cfa2330f16a 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2422,10 +2422,22 @@ static int __fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, }
/* - * Since we just failed the trylock; there must be an owner. + * The trylock just failed, so either there is an owner or + * there is a higher priority waiter than this one. */ newowner = rt_mutex_owner(&pi_state->pi_mutex); - BUG_ON(!newowner); + /* + * If the higher priority waiter has not yet taken over the + * rtmutex then newowner is NULL. We can't return here with + * that state because it's inconsistent vs. the user space + * state. So drop the locks and try again. It's a valid + * situation and not any different from the other retry + * conditions. + */ + if (unlikely(!newowner)) { + err = -EAGAIN; + goto handle_err; + } } else { WARN_ON_ONCE(argowner != current); if (oldowner == current) {
From: Gabriel Krisman Bertazi krisman@collabora.com
stable inclusion from linux-4.19.156 commit 6dba51828dfd31e326e6642727ede24c054a86fe
--------------------------------
[ Upstream commit 52abfcbd57eefdd54737fc8c2dc79d8f46d4a3e5 ]
If new_blkg allocation raced with blk_policy change and blkg_lookup_check fails, new_blkg is leaked.
Acked-by: Tejun Heo tj@kernel.org Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-cgroup.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index c64f0afa27dc..1a21a3151741 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -900,6 +900,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, blkg = blkg_lookup_check(pos, pol, q); if (IS_ERR(blkg)) { ret = PTR_ERR(blkg); + blkg_free(new_blkg); goto fail_unlock; }
From: Gabriel Krisman Bertazi krisman@collabora.com
stable inclusion from linux-4.19.156 commit 239aed5d2ecbfd381ca642f0a6bcb272c98408df
--------------------------------
[ Upstream commit f255c19b3ab46d3cad3b1b2e1036f4c926cb1d0c ]
Similarly to commit 457e490f2b741 ("blkcg: allocate struct blkcg_gq outside request queue spinlock"), blkg_create can also trigger occasional -ENOMEM failures at the radix insertion because any allocation inside blkg_create has to be non-blocking, making it more likely to fail. This causes trouble for userspace tools trying to configure io weights who need to deal with this condition.
This patch reduces the occurrence of -ENOMEMs on this path by preloading the radix tree element on a GFP_KERNEL context, such that we guarantee the later non-blocking insertion won't fail.
A similar solution exists in blkcg_init_queue for the same situation.
Acked-by: Tejun Heo tj@kernel.org Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-cgroup.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 1a21a3151741..7ded51051e0b 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -894,6 +894,12 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, goto fail; }
+ if (radix_tree_preload(GFP_KERNEL)) { + blkg_free(new_blkg); + ret = -ENOMEM; + goto fail; + } + rcu_read_lock(); spin_lock_irq(q->queue_lock);
@@ -901,7 +907,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, if (IS_ERR(blkg)) { ret = PTR_ERR(blkg); blkg_free(new_blkg); - goto fail_unlock; + goto fail_preloaded; }
if (blkg) { @@ -910,10 +916,12 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, blkg = blkg_create(pos, q, new_blkg); if (unlikely(IS_ERR(blkg))) { ret = PTR_ERR(blkg); - goto fail_unlock; + goto fail_preloaded; } }
+ radix_tree_preload_end(); + if (pos == blkcg) goto success; } @@ -923,6 +931,8 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, ctx->body = body; return 0;
+fail_preloaded: + radix_tree_preload_end(); fail_unlock: spin_unlock_irq(q->queue_lock); rcu_read_unlock();
From: Ming Lei ming.lei@redhat.com
stable inclusion from linux-4.19.156 commit 6280db912e03089b15a3e2c2882d573d3b09bd0c
--------------------------------
[ Upstream commit 831e3405c2a344018a18fcc2665acc5a38c3a707 ]
The current scanning mechanism is supposed to fall back to a synchronous host scan if an asynchronous scan is in progress. However, this rule isn't strictly respected, scsi_prep_async_scan() doesn't hold scan_mutex when checking shost->async_scan. When scsi_scan_host() is called concurrently, two async scans on same host can be started and a hang in do_scan_async() is observed.
Fixes this issue by checking & setting shost->async_scan atomically with shost->scan_mutex.
Link: https://lore.kernel.org/r/20201010032539.426615-1-ming.lei@redhat.com Cc: Christoph Hellwig hch@lst.de Cc: Ewan D. Milne emilne@redhat.com Cc: Hannes Reinecke hare@suse.de Cc: Bart Van Assche bvanassche@acm.org Reviewed-by: Lee Duncan lduncan@suse.com Reviewed-by: Bart Van Assche bvanassche@acm.org Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/scsi_scan.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c index 9a7e3a3bd5ce..009a5b2aa3d0 100644 --- a/drivers/scsi/scsi_scan.c +++ b/drivers/scsi/scsi_scan.c @@ -1722,15 +1722,16 @@ static void scsi_sysfs_add_devices(struct Scsi_Host *shost) */ static struct async_scan_data *scsi_prep_async_scan(struct Scsi_Host *shost) { - struct async_scan_data *data; + struct async_scan_data *data = NULL; unsigned long flags;
if (strncmp(scsi_scan_type, "sync", 4) == 0) return NULL;
+ mutex_lock(&shost->scan_mutex); if (shost->async_scan) { shost_printk(KERN_DEBUG, shost, "%s called twice\n", __func__); - return NULL; + goto err; }
data = kmalloc(sizeof(*data), GFP_KERNEL); @@ -1741,7 +1742,6 @@ static struct async_scan_data *scsi_prep_async_scan(struct Scsi_Host *shost) goto err; init_completion(&data->prev_finished);
- mutex_lock(&shost->scan_mutex); spin_lock_irqsave(shost->host_lock, flags); shost->async_scan = 1; spin_unlock_irqrestore(shost->host_lock, flags); @@ -1756,6 +1756,7 @@ static struct async_scan_data *scsi_prep_async_scan(struct Scsi_Host *shost) return data;
err: + mutex_unlock(&shost->scan_mutex); kfree(data); return NULL; }
From: Eddy Wu itseddy0402@gmail.com
stable inclusion from linux-4.19.156 commit b177d2d915cea2d0a590f0034a20299dd1ee3ef2
--------------------------------
commit b4e00444cab4c3f3fec876dc0cccc8cbb0d1a948 upstream.
current->group_leader->exit_signal may change during copy_process() if current->real_parent exits.
Move the assignment inside tasklist_lock to avoid the race.
Signed-off-by: Eddy Wu eddy_wu@trendmicro.com Acked-by: Oleg Nesterov oleg@redhat.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/fork.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c index 68b0fd51302e..dc059179fd18 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2026,14 +2026,9 @@ static __latent_entropy struct task_struct *copy_process( /* ok, now we should be set up.. */ p->pid = pid_nr(pid); if (clone_flags & CLONE_THREAD) { - p->exit_signal = -1; p->group_leader = current->group_leader; p->tgid = current->tgid; } else { - if (clone_flags & CLONE_PARENT) - p->exit_signal = current->group_leader->exit_signal; - else - p->exit_signal = (clone_flags & CSIGNAL); p->group_leader = p; p->tgid = p->pid; } @@ -2079,10 +2074,15 @@ static __latent_entropy struct task_struct *copy_process( p->real_parent = current->real_parent; p->parent_exec_id = current->parent_exec_id; p->parent_exec_id_u64 = current->parent_exec_id_u64; + if (clone_flags & CLONE_THREAD) + p->exit_signal = -1; + else + p->exit_signal = current->group_leader->exit_signal; } else { p->real_parent = current; p->parent_exec_id = current->self_exec_id; p->parent_exec_id_u64 = current->self_exec_id_u64; + p->exit_signal = (clone_flags & CSIGNAL); }
klp_copy_process(p);
From: Zeng Tao prime.zeng@hisilicon.com
stable inclusion from linux-4.19.158 commit 68e51bf3761736359110b198c42e6f78056e3719
--------------------------------
[ Upstream commit cb47755725da7b90fecbb2aa82ac3b24a7adb89b ]
UBSAN reports:
Undefined behaviour in ./include/linux/time64.h:127:27 signed integer overflow: 17179869187 * 1000000000 cannot be represented in type 'long long int' Call Trace: timespec64_to_ns include/linux/time64.h:127 [inline] set_cpu_itimer+0x65c/0x880 kernel/time/itimer.c:180 do_setitimer+0x8e/0x740 kernel/time/itimer.c:245 __x64_sys_setitimer+0x14c/0x2c0 kernel/time/itimer.c:336 do_syscall_64+0xa1/0x540 arch/x86/entry/common.c:295
Commit bd40a175769d ("y2038: itimer: change implementation to timespec64") replaced the original conversion which handled time clamping correctly with timespec64_to_ns() which has no overflow protection.
Fix it in timespec64_to_ns() as this is not necessarily limited to the usage in itimers.
[ tglx: Added comment and adjusted the fixes tag ]
Fixes: 361a3bf00582 ("time64: Add time64.h header and define struct timespec64") Signed-off-by: Zeng Tao prime.zeng@hisilicon.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Arnd Bergmann arnd@arndb.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/1598952616-6416-1-git-send-email-prime.zeng@hisili... Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/time64.h | 4 ++++ kernel/time/itimer.c | 4 ---- 2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/time64.h b/include/linux/time64.h index be946b03c030..b58142aa0b1d 100644 --- a/include/linux/time64.h +++ b/include/linux/time64.h @@ -138,6 +138,10 @@ static inline bool timespec64_valid_settod(const struct timespec64 *ts) */ static inline s64 timespec64_to_ns(const struct timespec64 *ts) { + /* Prevent multiplication overflow */ + if ((unsigned long long)ts->tv_sec >= KTIME_SEC_MAX) + return KTIME_MAX; + return ((s64) ts->tv_sec * NSEC_PER_SEC) + ts->tv_nsec; }
diff --git a/kernel/time/itimer.c b/kernel/time/itimer.c index 9a65713c8309..2e2b335ef101 100644 --- a/kernel/time/itimer.c +++ b/kernel/time/itimer.c @@ -154,10 +154,6 @@ static void set_cpu_itimer(struct task_struct *tsk, unsigned int clock_id, u64 oval, nval, ointerval, ninterval; struct cpu_itimer *it = &tsk->signal->it[clock_id];
- /* - * Use the to_ktime conversion because that clamps the maximum - * value to KTIME_MAX and avoid multiplication overflows. - */ nval = ktime_to_ns(timeval_to_ktime(value->it_value)); ninterval = ktime_to_ns(timeval_to_ktime(value->it_interval));
From: zhuoliang zhang zhuoliang.zhang@mediatek.com
stable inclusion from linux-4.19.158 commit 243a16469c3431affdfdcc26674bd266fbfd5f8e
--------------------------------
[ Upstream commit a779d91314ca7208b7feb3ad817b62904397c56d ]
we found that the following race condition exists in xfrm_alloc_userspi flow:
user thread state_hash_work thread ---- ---- xfrm_alloc_userspi() __find_acq_core() /*alloc new xfrm_state:x*/ xfrm_state_alloc() /*schedule state_hash_work thread*/ xfrm_hash_grow_check() xfrm_hash_resize() xfrm_alloc_spi /*hold lock*/ x->id.spi = htonl(spi) spin_lock_bh(&net->xfrm.xfrm_state_lock) /*waiting lock release*/ xfrm_hash_transfer() spin_lock_bh(&net->xfrm.xfrm_state_lock) /*add x into hlist:net->xfrm.state_byspi*/ hlist_add_head_rcu(&x->byspi) spin_unlock_bh(&net->xfrm.xfrm_state_lock)
/*add x into hlist:net->xfrm.state_byspi 2 times*/ hlist_add_head_rcu(&x->byspi)
1. a new state x is alloced in xfrm_state_alloc() and added into the bydst hlist in __find_acq_core() on the LHS; 2. on the RHS, state_hash_work thread travels the old bydst and tranfers every xfrm_state (include x) into the new bydst hlist and new byspi hlist; 3. user thread on the LHS gets the lock and adds x into the new byspi hlist again.
So the same xfrm_state (x) is added into the same list_hash (net->xfrm.state_byspi) 2 times that makes the list_hash become an inifite loop.
To fix the race, x->id.spi = htonl(spi) in the xfrm_alloc_spi() is moved to the back of spin_lock_bh, sothat state_hash_work thread no longer add x which id.spi is zero into the hash_list.
Fixes: f034b5d4efdf ("[XFRM]: Dynamic xfrm_state hash table sizing.") Signed-off-by: zhuoliang zhang zhuoliang.zhang@mediatek.com Acked-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Steffen Klassert steffen.klassert@secunet.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/xfrm/xfrm_state.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index a649d7c2f48c..84dea0ad1666 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -1825,6 +1825,7 @@ int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high) int err = -ENOENT; __be32 minspi = htonl(low); __be32 maxspi = htonl(high); + __be32 newspi = 0; u32 mark = x->mark.v & x->mark.m;
spin_lock_bh(&x->lock); @@ -1843,21 +1844,22 @@ int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high) xfrm_state_put(x0); goto unlock; } - x->id.spi = minspi; + newspi = minspi; } else { u32 spi = 0; for (h = 0; h < high-low+1; h++) { spi = low + prandom_u32()%(high-low+1); x0 = xfrm_state_lookup(net, mark, &x->id.daddr, htonl(spi), x->id.proto, x->props.family); if (x0 == NULL) { - x->id.spi = htonl(spi); + newspi = htonl(spi); break; } xfrm_state_put(x0); } } - if (x->id.spi) { + if (newspi) { spin_lock_bh(&net->xfrm.xfrm_state_lock); + x->id.spi = newspi; h = xfrm_spi_hash(net, &x->id.daddr, x->id.spi, x->id.proto, x->props.family); hlist_add_head_rcu(&x->byspi, net->xfrm.state_byspi + h); spin_unlock_bh(&net->xfrm.xfrm_state_lock);
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.158 commit bbec3ef4913e0b3918c55d4ac23ca23aa880ffbe
--------------------------------
[ Upstream commit 2c334e12f957cd8c6bb66b4aa3f79848b7c33cab ]
Make sure that we actually initialize xefi_discard when we're scheduling a deferred free of an AGFL block. This was (eventually) found by the UBSAN while I was banging on realtime rmap problems, but it exists in the upstream codebase. While we're at it, rearrange the structure to reduce the struct size from 64 to 56 bytes.
Fixes: fcb762f5de2e ("xfs: add bmapi nodiscard flag") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Brian Foster bfoster@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/xfs/libxfs/xfs_alloc.c | 1 + fs/xfs/libxfs/xfs_bmap.h | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 4ec82352d965..d2e100601102 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2213,6 +2213,7 @@ xfs_defer_agfl_block( new->xefi_startblock = XFS_AGB_TO_FSB(mp, agno, agbno); new->xefi_blockcount = 1; new->xefi_oinfo = *oinfo; + new->xefi_skip_discard = false;
trace_xfs_agfl_free_defer(mp, agno, 0, agbno, 1);
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h index dc15440bff5c..4c9df0499c42 100644 --- a/fs/xfs/libxfs/xfs_bmap.h +++ b/fs/xfs/libxfs/xfs_bmap.h @@ -52,9 +52,9 @@ struct xfs_extent_free_item { xfs_fsblock_t xefi_startblock;/* starting fs block number */ xfs_extlen_t xefi_blockcount;/* number of blocks in extent */ + bool xefi_skip_discard; struct list_head xefi_list; struct xfs_owner_info xefi_oinfo; /* extent owner */ - bool xefi_skip_discard; };
#define XFS_BMAP_MAX_NMAP 4
From: Stefano Brivio sbrivio@redhat.com
stable inclusion from linux-4.19.158 commit a79b767a92242b2801bb2f4433501a1b78bcb4e3
--------------------------------
[ Upstream commit 7d10e62c2ff8e084c136c94d32d9a94de4d31248 ]
In ip_set_match_extensions(), for sets with counters, we take care of updating counters themselves by calling ip_set_update_counter(), and of checking if the given comparison and values match, by calling ip_set_match_counter() if needed.
However, if a given comparison on counters doesn't match the configured values, that doesn't mean the set entry itself isn't matching.
This fix restores the behaviour we had before commit 4750005a85f7 ("netfilter: ipset: Fix "don't update counters" mode when counters used at the matching"), without reintroducing the issue fixed there: back then, mtype_data_match() first updated counters in any case, and then took care of matching on counters.
Now, if the IPSET_FLAG_SKIP_COUNTER_UPDATE flag is set, ip_set_update_counter() will anyway skip counter updates if desired.
The issue observed is illustrated by this reproducer:
ipset create c hash:ip counters ipset add c 192.0.2.1 iptables -I INPUT -m set --match-set c src --bytes-gt 800 -j DROP
if we now send packets from 192.0.2.1, bytes and packets counters for the entry as shown by 'ipset list' are always zero, and, no matter how many bytes we send, the rule will never match, because counters themselves are not updated.
Reported-by: Mithil Mhatre mmhatre@redhat.com Fixes: 4750005a85f7 ("netfilter: ipset: Fix "don't update counters" mode when counters used at the matching") Signed-off-by: Stefano Brivio sbrivio@redhat.com Signed-off-by: Jozsef Kadlecsik kadlec@netfilter.org Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/netfilter/ipset/ip_set_core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c index ea42c40213aa..1916cd39190c 100644 --- a/net/netfilter/ipset/ip_set_core.c +++ b/net/netfilter/ipset/ip_set_core.c @@ -488,13 +488,14 @@ ip_set_match_extensions(struct ip_set *set, const struct ip_set_ext *ext, if (SET_WITH_COUNTER(set)) { struct ip_set_counter *counter = ext_counter(data, set);
+ ip_set_update_counter(counter, ext, flags); + if (flags & IPSET_FLAG_MATCH_COUNTERS && !(ip_set_match_counter(ip_set_get_packets(counter), mext->packets, mext->packets_op) && ip_set_match_counter(ip_set_get_bytes(counter), mext->bytes, mext->bytes_op))) return false; - ip_set_update_counter(counter, ext, flags); } if (SET_WITH_SKBINFO(set)) ip_set_get_skbinfo(ext_skbinfo(data, set),
From: Jiri Olsa jolsa@kernel.org
stable inclusion from linux-4.19.158 commit 5e6d8c982529d72c307a2597178e6639b02b26c4
--------------------------------
[ Upstream commit fe01adb72356a4e2f8735e4128af85921ca98fa1 ]
We are missing swap for ino_generation field.
Fixes: 5c5e854bc760 ("perf tools: Add attr->mmap2 support") Signed-off-by: Jiri Olsa jolsa@kernel.org Acked-by: Namhyung Kim namhyung@kernel.org Link: https://lore.kernel.org/r/20201101233103.3537427-2-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- tools/perf/util/session.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c index 29ecff1fc25b..063ecec3d697 100644 --- a/tools/perf/util/session.c +++ b/tools/perf/util/session.c @@ -489,6 +489,7 @@ static void perf_event__mmap2_swap(union perf_event *event, event->mmap2.maj = bswap_32(event->mmap2.maj); event->mmap2.min = bswap_32(event->mmap2.min); event->mmap2.ino = bswap_64(event->mmap2.ino); + event->mmap2.ino_generation = bswap_64(event->mmap2.ino_generation);
if (sample_id_all) { void *data = &event->mmap2.filename;
From: Brian Foster bfoster@redhat.com
stable inclusion from linux-4.19.158 commit 955113c89ca9d49b8eef9f46f248397d1a1e55b1
--------------------------------
[ Upstream commit 869ae85dae64b5540e4362d7fe4cd520e10ec05c ]
It is possible to expose non-zeroed post-EOF data in XFS if the new EOF page is dirty, backed by an unwritten block and the truncate happens to race with writeback. iomap_truncate_page() will not zero the post-EOF portion of the page if the underlying block is unwritten. The subsequent call to truncate_setsize() will, but doesn't dirty the page. Therefore, if writeback happens to complete after iomap_truncate_page() (so it still sees the unwritten block) but before truncate_setsize(), the cached page becomes inconsistent with the on-disk block. A mapped read after the associated page is reclaimed or invalidated exposes non-zero post-EOF data.
For example, consider the following sequence when run on a kernel modified to explicitly flush the new EOF page within the race window:
$ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file $ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file ... $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file 00000400: 00 00 00 00 00 00 00 00 ........ $ umount /mnt/; mount <dev> /mnt/ $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file 00000400: cd cd cd cd cd cd cd cd ........
Update xfs_setattr_size() to explicitly flush the new EOF page prior to the page truncate to ensure iomap has the latest state of the underlying block.
Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") Signed-off-by: Brian Foster bfoster@redhat.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/xfs/xfs_iops.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index e427ad097e2e..948ac1290121 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -895,6 +895,16 @@ xfs_setattr_size( error = iomap_zero_range(inode, oldsize, newsize - oldsize, &did_zeroing, &xfs_iomap_ops); } else { + /* + * iomap won't detect a dirty page over an unwritten block (or a + * cow block over a hole) and subsequently skips zeroing the + * newly post-EOF portion of the page. Flush the new EOF to + * convert the block before the pagecache truncate. + */ + error = filemap_write_and_wait_range(inode->i_mapping, newsize, + newsize); + if (error) + return error; error = iomap_truncate_page(inode, newsize, &did_zeroing, &xfs_iomap_ops); }
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.158 commit cba36f708a8a23a6e51dd2ddf70ab8449c692c15
--------------------------------
[ Upstream commit c1f6b1ac00756a7108e5fcb849a2f8230c0b62a5 ]
The kernel has always allowed directories to have the rtinherit flag set, even if there is no rt device, so this check is wrong.
Fixes: 80e4e1268802 ("xfs: scrub inodes") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/xfs/scrub/inode.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c index e386c9b0b4ab..8d45d60832db 100644 --- a/fs/xfs/scrub/inode.c +++ b/fs/xfs/scrub/inode.c @@ -131,8 +131,7 @@ xchk_inode_flags( goto bad;
/* rt flags require rt device */ - if ((flags & (XFS_DIFLAG_REALTIME | XFS_DIFLAG_RTINHERIT)) && - !mp->m_rtdev_targp) + if ((flags & XFS_DIFLAG_REALTIME) && !mp->m_rtdev_targp) goto bad;
/* new rt bitmap flag only valid for rbmino */
From: Tyler Hicks tyhicks@linux.microsoft.com
stable inclusion from linux-4.19.158 commit 0982aa54d054bf11fe202522e29e4da9b288653b
--------------------------------
[ Upstream commit 8ffd778aff45be760292225049e0141255d4ad6e ]
Mimic the pre-existing ACPI and Device Tree event log behavior by not creating the binary_bios_measurements file when the EFI TPM event log is empty.
This fixes the following NULL pointer dereference that can occur when reading /sys/kernel/security/tpm0/binary_bios_measurements after the kernel received an empty event log from the firmware:
BUG: kernel NULL pointer dereference, address: 000000000000002c #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 2 PID: 3932 Comm: fwupdtpmevlog Not tainted 5.9.0-00003-g629990edad62 #17 Hardware name: LENOVO 20LCS03L00/20LCS03L00, BIOS N27ET38W (1.24 ) 11/28/2019 RIP: 0010:tpm2_bios_measurements_start+0x3a/0x550 Code: 54 53 48 83 ec 68 48 8b 57 70 48 8b 1e 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8b 82 c0 06 00 00 48 8b 8a c8 06 00 00 <44> 8b 60 1c 48 89 4d a0 4c 89 e2 49 83 c4 20 48 83 fb 00 75 2a 49 RSP: 0018:ffffa9c901203db0 EFLAGS: 00010246 RAX: 0000000000000010 RBX: 0000000000000000 RCX: 0000000000000010 RDX: ffff8ba1eb99c000 RSI: ffff8ba1e4ce8280 RDI: ffff8ba1e4ce8258 RBP: ffffa9c901203e40 R08: ffffa9c901203dd8 R09: ffff8ba1ec443300 R10: ffffa9c901203e50 R11: 0000000000000000 R12: ffff8ba1e4ce8280 R13: ffffa9c901203ef0 R14: ffffa9c901203ef0 R15: ffff8ba1e4ce8258 FS: 00007f6595460880(0000) GS:ffff8ba1ef880000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000002c CR3: 00000007d8d18003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? __kmalloc_node+0x113/0x320 ? kvmalloc_node+0x31/0x80 seq_read+0x94/0x420 vfs_read+0xa7/0x190 ksys_read+0xa7/0xe0 __x64_sys_read+0x1a/0x20 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9
In this situation, the bios_event_log pointer in the tpm_bios_log struct was not NULL but was equal to the ZERO_SIZE_PTR (0x10) value. This was due to the following kmemdup() in tpm_read_log_efi():
int tpm_read_log_efi(struct tpm_chip *chip) { ... /* malloc EventLog space */ log->bios_event_log = kmemdup(log_tbl->log, log_size, GFP_KERNEL); if (!log->bios_event_log) { ret = -ENOMEM; goto out; } ... }
When log_size is zero, due to an empty event log from firmware, ZERO_SIZE_PTR is returned from kmemdup(). Upon a read of the binary_bios_measurements file, the tpm2_bios_measurements_start() function does not perform a ZERO_OR_NULL_PTR() check on the bios_event_log pointer before dereferencing it.
Rather than add a ZERO_OR_NULL_PTR() check in functions that make use of the bios_event_log pointer, simply avoid creating the binary_bios_measurements_file as is done in other event log retrieval backends.
Explicitly ignore all of the events in the final event log when the main event log is empty. The list of events in the final event log cannot be accurately parsed without referring to the first event in the main event log (the event log header) so the final event log is useless in such a situation.
Fixes: 58cc1e4faf10 ("tpm: parse TPM event logs based on EFI table") Link: https://lore.kernel.org/linux-integrity/E1FDCCCB-CA51-4AEE-AC83-9CDE995EAE52... Reported-by: Kai-Heng Feng kai.heng.feng@canonical.com Reported-by: Kenneth R. Crudup kenny@panix.com Reported-by: Mimi Zohar zohar@linux.ibm.com Cc: Thiébaud Weksteen tweek@google.com Cc: Ard Biesheuvel ardb@kernel.org Signed-off-by: Tyler Hicks tyhicks@linux.microsoft.com Reviewed-by: Jarkko Sakkinen jarkko@kernel.org Signed-off-by: Jarkko Sakkinen jarkko@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/char/tpm/eventlog/efi.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/drivers/char/tpm/eventlog/efi.c b/drivers/char/tpm/eventlog/efi.c index 3e673ab22cb4..abd3beeb5158 100644 --- a/drivers/char/tpm/eventlog/efi.c +++ b/drivers/char/tpm/eventlog/efi.c @@ -43,6 +43,11 @@ int tpm_read_log_efi(struct tpm_chip *chip) log_size = log_tbl->size; memunmap(log_tbl);
+ if (!log_size) { + pr_warn("UEFI TPM log area empty\n"); + return -EIO; + } + log_tbl = memremap(efi.tpm_log, sizeof(*log_tbl) + log_size, MEMREMAP_WB); if (!log_tbl) {
From: "Jason A. Donenfeld" Jason@zx2c4.com
stable inclusion from linux-4.19.158 commit 580a117919f3ae1390be6b4111253ee9595938f5
--------------------------------
commit 46d6c5ae953cc0be38efd0e469284df7c4328cf8 upstream.
If netfilter changes the packet mark when mangling, the packet is rerouted using the route_me_harder set of functions. Prior to this commit, there's one big difference between route_me_harder and the ordinary initial routing functions, described in the comment above __ip_queue_xmit():
/* Note: skb->sk can be different from sk, in case of tunnels */ int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,
That function goes on to correctly make use of sk->sk_bound_dev_if, rather than skb->sk->sk_bound_dev_if. And indeed the comment is true: a tunnel will receive a packet in ndo_start_xmit with an initial skb->sk. It will make some transformations to that packet, and then it will send the encapsulated packet out of a *new* socket. That new socket will basically always have a different sk_bound_dev_if (otherwise there'd be a routing loop). So for the purposes of routing the encapsulated packet, the routing information as it pertains to the socket should come from that socket's sk, rather than the packet's original skb->sk. For that reason __ip_queue_xmit() and related functions all do the right thing.
One might argue that all tunnels should just call skb_orphan(skb) before transmitting the encapsulated packet into the new socket. But tunnels do *not* do this -- and this is wisely avoided in skb_scrub_packet() too -- because features like TSQ rely on skb->destructor() being called when that buffer space is truely available again. Calling skb_orphan(skb) too early would result in buffers filling up unnecessarily and accounting info being all wrong. Instead, additional routing must take into account the new sk, just as __ip_queue_xmit() notes.
So, this commit addresses the problem by fishing the correct sk out of state->sk -- it's already set properly in the call to nf_hook() in __ip_local_out(), which receives the sk as part of its normal functionality. So we make sure to plumb state->sk through the various route_me_harder functions, and then make correct use of it following the example of __ip_queue_xmit().
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Reviewed-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org [Jason: backported to 4.19 from Sasha's 5.4 backport] Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/netfilter_ipv4.h | 2 +- include/linux/netfilter_ipv6.h | 2 +- net/ipv4/netfilter.c | 12 +++++++----- net/ipv4/netfilter/ipt_SYNPROXY.c | 2 +- net/ipv4/netfilter/iptable_mangle.c | 2 +- net/ipv4/netfilter/nf_nat_l3proto_ipv4.c | 2 +- net/ipv4/netfilter/nf_reject_ipv4.c | 2 +- net/ipv4/netfilter/nft_chain_route_ipv4.c | 2 +- net/ipv6/netfilter.c | 6 +++--- net/ipv6/netfilter/ip6table_mangle.c | 2 +- net/ipv6/netfilter/nf_nat_l3proto_ipv6.c | 2 +- net/ipv6/netfilter/nft_chain_route_ipv6.c | 2 +- net/netfilter/ipvs/ip_vs_core.c | 4 ++-- 13 files changed, 22 insertions(+), 20 deletions(-)
diff --git a/include/linux/netfilter_ipv4.h b/include/linux/netfilter_ipv4.h index 95ab5cc64422..45ff1330b339 100644 --- a/include/linux/netfilter_ipv4.h +++ b/include/linux/netfilter_ipv4.h @@ -16,7 +16,7 @@ struct ip_rt_info { u_int32_t mark; };
-int ip_route_me_harder(struct net *net, struct sk_buff *skb, unsigned addr_type); +int ip_route_me_harder(struct net *net, struct sock *sk, struct sk_buff *skb, unsigned addr_type);
struct nf_queue_entry;
diff --git a/include/linux/netfilter_ipv6.h b/include/linux/netfilter_ipv6.h index c0dc4dd78887..47a2de582f57 100644 --- a/include/linux/netfilter_ipv6.h +++ b/include/linux/netfilter_ipv6.h @@ -36,7 +36,7 @@ struct nf_ipv6_ops { };
#ifdef CONFIG_NETFILTER -int ip6_route_me_harder(struct net *net, struct sk_buff *skb); +int ip6_route_me_harder(struct net *net, struct sock *sk, struct sk_buff *skb); __sum16 nf_ip6_checksum(struct sk_buff *skb, unsigned int hook, unsigned int dataoff, u_int8_t protocol);
diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c index 8d2e5dc9a827..3d670d5aea34 100644 --- a/net/ipv4/netfilter.c +++ b/net/ipv4/netfilter.c @@ -17,17 +17,19 @@ #include <net/netfilter/nf_queue.h>
/* route_me_harder function, used by iptable_nat, iptable_mangle + ip_queue */ -int ip_route_me_harder(struct net *net, struct sk_buff *skb, unsigned int addr_type) +int ip_route_me_harder(struct net *net, struct sock *sk, struct sk_buff *skb, unsigned int addr_type) { const struct iphdr *iph = ip_hdr(skb); struct rtable *rt; struct flowi4 fl4 = {}; __be32 saddr = iph->saddr; - const struct sock *sk = skb_to_full_sk(skb); - __u8 flags = sk ? inet_sk_flowi_flags(sk) : 0; + __u8 flags; struct net_device *dev = skb_dst(skb)->dev; unsigned int hh_len;
+ sk = sk_to_full_sk(sk); + flags = sk ? inet_sk_flowi_flags(sk) : 0; + if (addr_type == RTN_UNSPEC) addr_type = inet_addr_type_dev_table(net, dev, saddr); if (addr_type == RTN_LOCAL || addr_type == RTN_UNICAST) @@ -91,8 +93,8 @@ int nf_ip_reroute(struct sk_buff *skb, const struct nf_queue_entry *entry) skb->mark == rt_info->mark && iph->daddr == rt_info->daddr && iph->saddr == rt_info->saddr)) - return ip_route_me_harder(entry->state.net, skb, - RTN_UNSPEC); + return ip_route_me_harder(entry->state.net, entry->state.sk, + skb, RTN_UNSPEC); } return 0; } diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c b/net/ipv4/netfilter/ipt_SYNPROXY.c index 690b17ef6a44..d64b1ef43c10 100644 --- a/net/ipv4/netfilter/ipt_SYNPROXY.c +++ b/net/ipv4/netfilter/ipt_SYNPROXY.c @@ -54,7 +54,7 @@ synproxy_send_tcp(struct net *net,
skb_dst_set_noref(nskb, skb_dst(skb)); nskb->protocol = htons(ETH_P_IP); - if (ip_route_me_harder(net, nskb, RTN_UNSPEC)) + if (ip_route_me_harder(net, nskb->sk, nskb, RTN_UNSPEC)) goto free_nskb;
if (nfct) { diff --git a/net/ipv4/netfilter/iptable_mangle.c b/net/ipv4/netfilter/iptable_mangle.c index dea138ca8925..0829f46ddfdd 100644 --- a/net/ipv4/netfilter/iptable_mangle.c +++ b/net/ipv4/netfilter/iptable_mangle.c @@ -65,7 +65,7 @@ ipt_mangle_out(struct sk_buff *skb, const struct nf_hook_state *state) iph->daddr != daddr || skb->mark != mark || iph->tos != tos) { - err = ip_route_me_harder(state->net, skb, RTN_UNSPEC); + err = ip_route_me_harder(state->net, state->sk, skb, RTN_UNSPEC); if (err < 0) ret = NF_DROP_ERR(err); } diff --git a/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c b/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c index 52e3f9c7bc69..290a04820701 100644 --- a/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c +++ b/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c @@ -331,7 +331,7 @@ nf_nat_ipv4_local_fn(void *priv, struct sk_buff *skb,
if (ct->tuplehash[dir].tuple.dst.u3.ip != ct->tuplehash[!dir].tuple.src.u3.ip) { - err = ip_route_me_harder(state->net, skb, RTN_UNSPEC); + err = ip_route_me_harder(state->net, state->sk, skb, RTN_UNSPEC); if (err < 0) ret = NF_DROP_ERR(err); } diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c index 5cd06ba3535d..4996db1f64a1 100644 --- a/net/ipv4/netfilter/nf_reject_ipv4.c +++ b/net/ipv4/netfilter/nf_reject_ipv4.c @@ -129,7 +129,7 @@ void nf_send_reset(struct net *net, struct sk_buff *oldskb, int hook) ip4_dst_hoplimit(skb_dst(nskb))); nf_reject_ip_tcphdr_put(nskb, oldskb, oth);
- if (ip_route_me_harder(net, nskb, RTN_UNSPEC)) + if (ip_route_me_harder(net, nskb->sk, nskb, RTN_UNSPEC)) goto free_nskb;
niph = ip_hdr(nskb); diff --git a/net/ipv4/netfilter/nft_chain_route_ipv4.c b/net/ipv4/netfilter/nft_chain_route_ipv4.c index 7d82934c46f4..61003768e52b 100644 --- a/net/ipv4/netfilter/nft_chain_route_ipv4.c +++ b/net/ipv4/netfilter/nft_chain_route_ipv4.c @@ -50,7 +50,7 @@ static unsigned int nf_route_table_hook(void *priv, iph->daddr != daddr || skb->mark != mark || iph->tos != tos) { - err = ip_route_me_harder(state->net, skb, RTN_UNSPEC); + err = ip_route_me_harder(state->net, state->sk, skb, RTN_UNSPEC); if (err < 0) ret = NF_DROP_ERR(err); } diff --git a/net/ipv6/netfilter.c b/net/ipv6/netfilter.c index 6d0b1f3e927b..5679fa3f696a 100644 --- a/net/ipv6/netfilter.c +++ b/net/ipv6/netfilter.c @@ -17,10 +17,10 @@ #include <net/xfrm.h> #include <net/netfilter/nf_queue.h>
-int ip6_route_me_harder(struct net *net, struct sk_buff *skb) +int ip6_route_me_harder(struct net *net, struct sock *sk_partial, struct sk_buff *skb) { const struct ipv6hdr *iph = ipv6_hdr(skb); - struct sock *sk = sk_to_full_sk(skb->sk); + struct sock *sk = sk_to_full_sk(sk_partial); unsigned int hh_len; struct dst_entry *dst; int strict = (ipv6_addr_type(&iph->daddr) & @@ -81,7 +81,7 @@ static int nf_ip6_reroute(struct sk_buff *skb, if (!ipv6_addr_equal(&iph->daddr, &rt_info->daddr) || !ipv6_addr_equal(&iph->saddr, &rt_info->saddr) || skb->mark != rt_info->mark) - return ip6_route_me_harder(entry->state.net, skb); + return ip6_route_me_harder(entry->state.net, entry->state.sk, skb); } return 0; } diff --git a/net/ipv6/netfilter/ip6table_mangle.c b/net/ipv6/netfilter/ip6table_mangle.c index b0524b18c4fb..acba3757ff60 100644 --- a/net/ipv6/netfilter/ip6table_mangle.c +++ b/net/ipv6/netfilter/ip6table_mangle.c @@ -60,7 +60,7 @@ ip6t_mangle_out(struct sk_buff *skb, const struct nf_hook_state *state) skb->mark != mark || ipv6_hdr(skb)->hop_limit != hop_limit || flowlabel != *((u_int32_t *)ipv6_hdr(skb)))) { - err = ip6_route_me_harder(state->net, skb); + err = ip6_route_me_harder(state->net, state->sk, skb); if (err < 0) ret = NF_DROP_ERR(err); } diff --git a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c index bf25263b2dcd..2160936e5b91 100644 --- a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c +++ b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c @@ -354,7 +354,7 @@ nf_nat_ipv6_local_fn(void *priv, struct sk_buff *skb,
if (!nf_inet_addr_cmp(&ct->tuplehash[dir].tuple.dst.u3, &ct->tuplehash[!dir].tuple.src.u3)) { - err = ip6_route_me_harder(state->net, skb); + err = ip6_route_me_harder(state->net, state->sk, skb); if (err < 0) ret = NF_DROP_ERR(err); } diff --git a/net/ipv6/netfilter/nft_chain_route_ipv6.c b/net/ipv6/netfilter/nft_chain_route_ipv6.c index da3f1f8cb325..afe79cb46e63 100644 --- a/net/ipv6/netfilter/nft_chain_route_ipv6.c +++ b/net/ipv6/netfilter/nft_chain_route_ipv6.c @@ -52,7 +52,7 @@ static unsigned int nf_route_table_hook(void *priv, skb->mark != mark || ipv6_hdr(skb)->hop_limit != hop_limit || flowlabel != *((u_int32_t *)ipv6_hdr(skb)))) { - err = ip6_route_me_harder(state->net, skb); + err = ip6_route_me_harder(state->net, state->sk, skb); if (err < 0) ret = NF_DROP_ERR(err); } diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c index d5e4329579e2..acaeeaf81441 100644 --- a/net/netfilter/ipvs/ip_vs_core.c +++ b/net/netfilter/ipvs/ip_vs_core.c @@ -725,12 +725,12 @@ static int ip_vs_route_me_harder(struct netns_ipvs *ipvs, int af, struct dst_entry *dst = skb_dst(skb);
if (dst->dev && !(dst->dev->flags & IFF_LOOPBACK) && - ip6_route_me_harder(ipvs->net, skb) != 0) + ip6_route_me_harder(ipvs->net, skb->sk, skb) != 0) return 1; } else #endif if (!(skb_rtable(skb)->rt_flags & RTCF_LOCAL) && - ip_route_me_harder(ipvs->net, skb, RTN_LOCAL) != 0) + ip_route_me_harder(ipvs->net, skb->sk, skb, RTN_LOCAL) != 0) return 1;
return 0;
From: Chunyan Zhang zhang.lyra@gmail.com
stable inclusion from linux-4.19.158 commit 880d94c7811ebe89ab7edc7e8839ccdf0eedcd91
--------------------------------
commit 5167c506d62dd9ffab73eba23c79b0a8845c9fe1 upstream.
Suspend to IDLE invokes tick_unfreeze() on resume. tick_unfreeze() on the first resuming CPU resumes timekeeping, which also has the side effect of resetting the softlockup watchdog on this CPU.
But on the secondary CPUs the watchdog is not reset in the resume / unfreeze() path, which can result in false softlockup warnings on those CPUs depending on the time spent in suspend.
Prevent this by clearing the softlock watchdog in the unfreeze path also on the secondary resuming CPUs.
[ tglx: Massaged changelog ]
Signed-off-by: Chunyan Zhang chunyan.zhang@unisoc.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/20200110083902.27276-1-chunyan.zhang@unisoc.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/time/tick-common.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index a02e0f6b287c..0a3cc37e4b83 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -15,6 +15,7 @@ #include <linux/err.h> #include <linux/hrtimer.h> #include <linux/interrupt.h> +#include <linux/nmi.h> #include <linux/percpu.h> #include <linux/profile.h> #include <linux/sched.h> @@ -520,6 +521,7 @@ void tick_unfreeze(void) trace_suspend_resume(TPS("timekeeping_freeze"), smp_processor_id(), false); } else { + touch_softlockup_watchdog(); tick_resume_local(); }
From: Christoph Hellwig hch@lst.de
stable inclusion from linux-4.19.158 commit d9f4534a9a286877ae29e15b5f9ba8ecb7370924
--------------------------------
[ Upstream commit 2bd645b2d3f0bacadaa6037f067538e1cd4e42ef ]
bdget_disk needs to be paired with bdput to not leak a reference on the block device inode.
Fixes: 08ba91ee6e2c ("nbd: Add the nbd NBD_DISCONNECT_ON_CLOSE config flag.") Signed-off-by: Christoph Hellwig hch@lst.de Reviewed-by: Josef Bacik josef@toxicpanda.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/block/nbd.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 4b624cd98f90..d0c4b5bcf874 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -1465,6 +1465,7 @@ static void nbd_release(struct gendisk *disk, fmode_t mode) if (test_bit(NBD_RT_DISCONNECT_ON_CLOSE, &nbd->config->runtime_flags) && bdev->bd_openers == 0) nbd_disconnect_and_put(nbd); + bdput(bdev);
nbd_config_put(nbd); nbd_put(nbd);
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.158 commit 83e65294aff8635877dfd0a019dbb9cd1abcab8a
--------------------------------
[ Upstream commit ea8439899c0b15a176664df62aff928010fad276 ]
Pass the same oldext argument (which contains the existing rmapping's unwritten state) to xfs_rmap_lookup_le_range at the start of xfs_rmap_convert_shared. At this point in the code, flags is zero, which means that we perform lookups using the wrong key.
Fixes: 3f165b334e51 ("xfs: convert unwritten status of reverse mappings for shared files") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/xfs/libxfs/xfs_rmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c index 245af452840e..ab3e72e702f0 100644 --- a/fs/xfs/libxfs/xfs_rmap.c +++ b/fs/xfs/libxfs/xfs_rmap.c @@ -1387,7 +1387,7 @@ xfs_rmap_convert_shared( * record for our insertion point. This will also give us the record for * start block contiguity tests. */ - error = xfs_rmap_lookup_le_range(cur, bno, owner, offset, flags, + error = xfs_rmap_lookup_le_range(cur, bno, owner, offset, oldext, &PREV, &i); if (error) goto done;
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.158 commit 5fda0976e88dd4addc401af6b2ab53b2842e47a6
--------------------------------
[ Upstream commit 5dda3897fd90783358c4c6115ef86047d8c8f503 ]
When the bmbt scrubber is looking up rmap extents, we need to set the extent flags from the bmbt record fully. This will matter once we fix the rmap btree comparison functions to check those flags correctly.
Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/xfs/scrub/bmap.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c index e1d11f3223e3..d1fc8a3fb891 100644 --- a/fs/xfs/scrub/bmap.c +++ b/fs/xfs/scrub/bmap.c @@ -102,6 +102,8 @@ xchk_bmap_get_rmap(
if (info->whichfork == XFS_ATTR_FORK) rflags |= XFS_RMAP_ATTR_FORK; + if (irec->br_state == XFS_EXT_UNWRITTEN) + rflags |= XFS_RMAP_UNWRITTEN;
/* * CoW staging extents are owned (on disk) by the refcountbt, so
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.158 commit b7fec5a9a726e9e2a095d53b6de4e62bf179b481
--------------------------------
[ Upstream commit 6ff646b2ceb0eec916101877f38da0b73e3a5b7f ]
Keys for extent interval records in the reverse mapping btree are supposed to be computed as follows:
(physical block, owner, fork, is_btree, is_unwritten, offset)
This provides users the ability to look up a reverse mapping from a bmbt record -- start with the physical block; then if there are multiple records for the same block, move on to the owner; then the inode fork type; and so on to the file offset.
However, the key comparison functions incorrectly remove the fork/btree/unwritten information that's encoded in the on-disk offset. This means that lookup comparisons are only done with:
(physical block, owner, offset)
This means that queries can return incorrect results. On consistent filesystems this hasn't been an issue because blocks are never shared between forks or with bmbt blocks; and are never unwritten. However, this bug means that online repair cannot always detect corruption in the key information in internal rmapbt nodes.
Found by fuzzing keys[1].attrfork = ones on xfs/371.
Fixes: 4b8ed67794fe ("xfs: add rmap btree operations") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/xfs/libxfs/xfs_rmap_btree.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c index f79cf040d745..77528f413286 100644 --- a/fs/xfs/libxfs/xfs_rmap_btree.c +++ b/fs/xfs/libxfs/xfs_rmap_btree.c @@ -247,8 +247,8 @@ xfs_rmapbt_key_diff( else if (y > x) return -1;
- x = XFS_RMAP_OFF(be64_to_cpu(kp->rm_offset)); - y = rec->rm_offset; + x = be64_to_cpu(kp->rm_offset); + y = xfs_rmap_irec_offset_pack(rec); if (x > y) return 1; else if (y > x) @@ -279,8 +279,8 @@ xfs_rmapbt_diff_two_keys( else if (y > x) return -1;
- x = XFS_RMAP_OFF(be64_to_cpu(kp1->rm_offset)); - y = XFS_RMAP_OFF(be64_to_cpu(kp2->rm_offset)); + x = be64_to_cpu(kp1->rm_offset); + y = be64_to_cpu(kp2->rm_offset); if (x > y) return 1; else if (y > x) @@ -393,8 +393,8 @@ xfs_rmapbt_keys_inorder( return 1; else if (a > b) return 0; - a = XFS_RMAP_OFF(be64_to_cpu(k1->rmap.rm_offset)); - b = XFS_RMAP_OFF(be64_to_cpu(k2->rmap.rm_offset)); + a = be64_to_cpu(k1->rmap.rm_offset); + b = be64_to_cpu(k2->rmap.rm_offset); if (a <= b) return 1; return 0; @@ -423,8 +423,8 @@ xfs_rmapbt_recs_inorder( return 1; else if (a > b) return 0; - a = XFS_RMAP_OFF(be64_to_cpu(r1->rmap.rm_offset)); - b = XFS_RMAP_OFF(be64_to_cpu(r2->rmap.rm_offset)); + a = be64_to_cpu(r1->rmap.rm_offset); + b = be64_to_cpu(r2->rmap.rm_offset); if (a <= b) return 1; return 0;
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.158 commit 964e25377fab8e9071c07d8cdf1c9a4fb079285e
--------------------------------
[ Upstream commit 54e9b09e153842ab5adb8a460b891e11b39e9c3d ]
Fix some serious WTF in the reference count scrubber's rmap fragment processing. The code comment says that this loop is supposed to move all fragment records starting at or before bno onto the worklist, but there's no obvious reason why nr (the number of items added) should increment starting from 1, and breaking the loop when we've added the target number seems dubious since we could have more rmap fragments that should have been added to the worklist.
This seems to manifest in xfs/411 when adding one to the refcount field.
Fixes: dbde19da9637 ("xfs: cross-reference the rmapbt data with the refcountbt") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/xfs/scrub/refcount.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c index e8c82b026083..76e4f16a9fab 100644 --- a/fs/xfs/scrub/refcount.c +++ b/fs/xfs/scrub/refcount.c @@ -180,7 +180,6 @@ xchk_refcountbt_process_rmap_fragments( */ INIT_LIST_HEAD(&worklist); rbno = NULLAGBLOCK; - nr = 1;
/* Make sure the fragments actually /are/ in agbno order. */ bno = 0; @@ -194,15 +193,14 @@ xchk_refcountbt_process_rmap_fragments( * Find all the rmaps that start at or before the refc extent, * and put them on the worklist. */ + nr = 0; list_for_each_entry_safe(frag, n, &refchk->fragments, list) { - if (frag->rm.rm_startblock > refchk->bno) - goto done; + if (frag->rm.rm_startblock > refchk->bno || nr > target_nr) + break; bno = frag->rm.rm_startblock + frag->rm.rm_blockcount; if (bno < rbno) rbno = bno; list_move_tail(&frag->list, &worklist); - if (nr == target_nr) - break; nr++; }
From: Christoph Hellwig hch@lst.de
stable inclusion from linux-4.19.158 commit 05e0bb210209126f495401051996389831f7ba48
--------------------------------
[ Upstream commit 2bd3fa793aaa7e98b74e3653fdcc72fa753913b5 ]
We also need to drop the iolock when invalidate_inode_pages2 fails, not only on all other error or successful cases.
Fixes: 527851124d10 ("xfs: implement pNFS export operations") Signed-off-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/xfs/xfs_pnfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c index f44c3599527d..1c9bced3e860 100644 --- a/fs/xfs/xfs_pnfs.c +++ b/fs/xfs/xfs_pnfs.c @@ -141,7 +141,7 @@ xfs_fs_map_blocks( goto out_unlock; error = invalidate_inode_pages2(inode->i_mapping); if (WARN_ON_ONCE(error)) - return error; + goto out_unlock;
end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + length); offset_fsb = XFS_B_TO_FSBT(mp, offset);
From: Peter Zijlstra peterz@infradead.org
stable inclusion from linux-4.19.158 commit 0f4eb125c56b524776a2f396bb7a7c119bb2e0a1
--------------------------------
[ Upstream commit ce0f17fc93f63ee91428af10b7b2ddef38cd19e5 ]
One should use in_serving_softirq() to detect SoftIRQ context.
Fixes: 96f6d4444302 ("perf_counter: avoid recursion") Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20201030151955.120572175@infradead.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/events/internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 6dc725a7e7bc..8fc0ddc38cb6 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -209,7 +209,7 @@ static inline int get_recursion_context(int *recursion) rctx = 3; else if (in_irq()) rctx = 2; - else if (in_softirq()) + else if (in_serving_softirq()) rctx = 1; else rctx = 0;
From: Kaixu Xia kaixuxia@tencent.com
stable inclusion from linux-4.19.158 commit f0bb30a1fc8fe0d9a3068c7cf1644302c8196cfc
--------------------------------
commit 174fe5ba2d1ea0d6c5ab2a7d4aa058d6d497ae4d upstream.
The macro MOPT_Q is used to indicates the mount option is related to quota stuff and is defined to be MOPT_NOSUPPORT when CONFIG_QUOTA is disabled. Normally the quota options are handled explicitly, so it didn't matter that the MOPT_STRING flag was missing, even though the usrjquota and grpjquota mount options take a string argument. It's important that's present in the !CONFIG_QUOTA case, since without MOPT_STRING, the mount option matcher will match usrjquota= followed by an integer, and will otherwise skip the table entry, and so "mount option not supported" error message is never reported.
[ Fixed up the commit description to better explain why the fix works. --TYT ]
Fixes: 26092bf52478 ("ext4: use a table-driven handler for mount options") Signed-off-by: Kaixu Xia kaixuxia@tencent.com Link: https://lore.kernel.org/r/1603986396-28917-1-git-send-email-kaixuxia@tencent... Signed-off-by: Theodore Ts'o tytso@mit.edu Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/super.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 9b1ab74df916..1a0d494e57ae 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1791,8 +1791,8 @@ static const struct mount_opts { {Opt_noquota, (EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA | EXT4_MOUNT_GRPQUOTA | EXT4_MOUNT_PRJQUOTA), MOPT_CLEAR | MOPT_Q}, - {Opt_usrjquota, 0, MOPT_Q}, - {Opt_grpjquota, 0, MOPT_Q}, + {Opt_usrjquota, 0, MOPT_Q | MOPT_STRING}, + {Opt_grpjquota, 0, MOPT_Q | MOPT_STRING}, {Opt_offusrjquota, 0, MOPT_Q}, {Opt_offgrpjquota, 0, MOPT_Q}, {Opt_jqfmt_vfsold, QFMT_VFS_OLD, MOPT_QFMT},
From: Joseph Qi joseph.qi@linux.alibaba.com
stable inclusion from linux-4.19.158 commit 9d9ad220bb414da6e8b6ae0e35f9e4db976a97ef
--------------------------------
commit 7067b2619017d51e71686ca9756b454de0e5826a upstream.
It takes xattr_sem to check inline data again but without unlock it in case not have. So unlock it before return.
Fixes: aef1c8513c1f ("ext4: let ext4_truncate handle inline data correctly") Reported-by: Dan Carpenter dan.carpenter@oracle.com Cc: Tao Ma boyu.mt@taobao.com Signed-off-by: Joseph Qi joseph.qi@linux.alibaba.com Reviewed-by: Andreas Dilger adilger@dilger.ca Link: https://lore.kernel.org/r/1604370542-124630-1-git-send-email-joseph.qi@linux... Signed-off-by: Theodore Ts'o tytso@mit.edu Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/inline.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c index 56f6e1782d5f..32b266c478f9 100644 --- a/fs/ext4/inline.c +++ b/fs/ext4/inline.c @@ -1921,6 +1921,7 @@ int ext4_inline_data_truncate(struct inode *inode, int *has_inline)
ext4_write_lock_xattr(inode, &no_expand); if (!ext4_has_inline_data(inode)) { + ext4_write_unlock_xattr(inode, &no_expand); *has_inline = 0; ext4_journal_stop(handle); return 0;
From: Shin'ichiro Kawasaki shinichiro.kawasaki@wdc.com
stable inclusion from linux-4.19.158 commit ebfe3b94c746b9c9b1b7fa2cfc842380a85d7c40
--------------------------------
commit 092561f06702dd4fdd7fb74dd3a838f1818529b7 upstream.
Commit 8fd0e2a6df26 ("uio: free uio id after uio file node is freed") triggered KASAN use-after-free failure at deletion of TCM-user backstores [1].
In uio_unregister_device(), struct uio_device *idev is passed to uio_free_minor() to refer idev->minor. However, before uio_free_minor() call, idev is already freed by uio_device_release() during call to device_unregister().
To avoid reference to idev->minor after idev free, keep idev->minor value in a local variable. Also modify uio_free_minor() argument to receive the value.
[1] BUG: KASAN: use-after-free in uio_unregister_device+0x166/0x190 Read of size 4 at addr ffff888105196508 by task targetcli/49158
CPU: 3 PID: 49158 Comm: targetcli Not tainted 5.10.0-rc1 #1 Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015 Call Trace: dump_stack+0xae/0xe5 ? uio_unregister_device+0x166/0x190 print_address_description.constprop.0+0x1c/0x210 ? uio_unregister_device+0x166/0x190 ? uio_unregister_device+0x166/0x190 kasan_report.cold+0x37/0x7c ? kobject_put+0x80/0x410 ? uio_unregister_device+0x166/0x190 uio_unregister_device+0x166/0x190 tcmu_destroy_device+0x1c4/0x280 [target_core_user] ? tcmu_release+0x90/0x90 [target_core_user] ? __mutex_unlock_slowpath+0xd6/0x5d0 target_free_device+0xf3/0x2e0 [target_core_mod] config_item_cleanup+0xea/0x210 configfs_rmdir+0x651/0x860 ? detach_groups.isra.0+0x380/0x380 vfs_rmdir.part.0+0xec/0x3a0 ? __lookup_hash+0x20/0x150 do_rmdir+0x252/0x320 ? do_file_open_root+0x420/0x420 ? strncpy_from_user+0xbc/0x2f0 ? getname_flags.part.0+0x8e/0x450 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f9e2bfc91fb Code: 73 01 c3 48 8b 0d 9d ec 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 54 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6d ec 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffdd2baafe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000054 RAX: ffffffffffffffda RBX: 00007f9e2beb44a0 RCX: 00007f9e2bfc91fb RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f9e1c20be90 RBP: 00007ffdd2bab000 R08: 0000000000000000 R09: 00007f9e2bdf2440 R10: 00007ffdd2baaf37 R11: 0000000000000246 R12: 00000000ffffff9c R13: 000055f9abb7e390 R14: 000055f9abcf9558 R15: 00007f9e2be7a780
Allocated by task 34735: kasan_save_stack+0x1b/0x40 __kasan_kmalloc.constprop.0+0xc2/0xd0 __uio_register_device+0xeb/0xd40 tcmu_configure_device+0x5a0/0xbc0 [target_core_user] target_configure_device+0x12f/0x760 [target_core_mod] target_dev_enable_store+0x32/0x50 [target_core_mod] configfs_write_file+0x2bb/0x450 vfs_write+0x1ce/0x610 ksys_write+0xe9/0x1b0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 49158: kasan_save_stack+0x1b/0x40 kasan_set_track+0x1c/0x30 kasan_set_free_info+0x1b/0x30 __kasan_slab_free+0x110/0x150 slab_free_freelist_hook+0x5a/0x170 kfree+0xc6/0x560 device_release+0x9b/0x210 kobject_put+0x13e/0x410 uio_unregister_device+0xf9/0x190 tcmu_destroy_device+0x1c4/0x280 [target_core_user] target_free_device+0xf3/0x2e0 [target_core_mod] config_item_cleanup+0xea/0x210 configfs_rmdir+0x651/0x860 vfs_rmdir.part.0+0xec/0x3a0 do_rmdir+0x252/0x320 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The buggy address belongs to the object at ffff888105196000 which belongs to the cache kmalloc-2k of size 2048 The buggy address is located 1288 bytes inside of 2048-byte region [ffff888105196000, ffff888105196800) The buggy address belongs to the page: page:0000000098e6ca81 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x105190 head:0000000098e6ca81 order:3 compound_mapcount:0 compound_pincount:0 flags: 0x17ffffc0010200(slab|head) raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff888100043040 raw: 0000000000000000 0000000000080008 00000001ffffffff ffff88810eb55c01 page dumped because: kasan: bad access detected page->mem_cgroup:ffff88810eb55c01
Memory state around the buggy address: ffff888105196400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888105196480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888105196500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff888105196580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888105196600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 8fd0e2a6df26 ("uio: free uio id after uio file node is freed") Cc: stable stable@vger.kernel.org Signed-off-by: Shin'ichiro Kawasaki shinichiro.kawasaki@wdc.com Link: https://lore.kernel.org/r/20201102122819.2346270-1-shinichiro.kawasaki@wdc.c... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/uio/uio.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/drivers/uio/uio.c b/drivers/uio/uio.c index e743246aaeac..1876e4c4b81a 100644 --- a/drivers/uio/uio.c +++ b/drivers/uio/uio.c @@ -413,10 +413,10 @@ static int uio_get_minor(struct uio_device *idev) return retval; }
-static void uio_free_minor(struct uio_device *idev) +static void uio_free_minor(unsigned long minor) { mutex_lock(&minor_lock); - idr_remove(&uio_idr, idev->minor); + idr_remove(&uio_idr, minor); mutex_unlock(&minor_lock); }
@@ -988,7 +988,7 @@ int __uio_register_device(struct module *owner, err_uio_dev_add_attributes: device_del(&idev->dev); err_device_create: - uio_free_minor(idev); + uio_free_minor(idev->minor); put_device(&idev->dev); return ret; } @@ -1002,11 +1002,13 @@ EXPORT_SYMBOL_GPL(__uio_register_device); void uio_unregister_device(struct uio_info *info) { struct uio_device *idev; + unsigned long minor;
if (!info || !info->uio_dev) return;
idev = info->uio_dev; + minor = idev->minor;
mutex_lock(&idev->info_lock); uio_dev_del_attributes(idev); @@ -1022,7 +1024,7 @@ void uio_unregister_device(struct uio_info *info)
device_unregister(&idev->dev);
- uio_free_minor(idev); + uio_free_minor(minor);
return; }
From: Dan Carpenter dan.carpenter@oracle.com
stable inclusion from linux-4.19.158 commit 3f7277405fb3fe2853ce5d017ac7935ffca4ccfd
--------------------------------
commit 1e106aa3509b86738769775969822ffc1ec21bf4 upstream.
The exit_pi_state_list() function calls put_pi_state() with IRQs disabled and is not expecting that IRQs will be enabled inside the function.
Use the _irqsave() variant so that IRQs are restored to the original state instead of being enabled unconditionally.
Fixes: 153fbd1226fb ("futex: Fix more put_pi_state() vs. exit_pi_state_list() races") Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201106085205.GA1159983@mwanda Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/futex.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 1cfa2330f16a..a84b993f09cb 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -883,11 +883,12 @@ static void put_pi_state(struct futex_pi_state *pi_state) * and has cleaned up the pi_state already */ if (pi_state->owner) { + unsigned long flags;
- raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); + raw_spin_lock_irqsave(&pi_state->pi_mutex.wait_lock, flags); pi_state_update_owner(pi_state, NULL); rt_mutex_proxy_unlock(&pi_state->pi_mutex); - raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); + raw_spin_unlock_irqrestore(&pi_state->pi_mutex.wait_lock, flags); }
if (current->pi_state_cache) {
From: Wengang Wang wen.gang.wang@oracle.com
stable inclusion from linux-4.19.158 commit bc2d96b3061ad35150c4db14661e91916766f241
--------------------------------
commit f5785283dd64867a711ca1fb1f5bb172f252ecdf upstream.
Though problem if found on a lower 4.1.12 kernel, I think upstream has same issue.
In one node in the cluster, there is the following callback trace:
# cat /proc/21473/stack __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2] ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2] ocfs2_evict_inode+0x152/0x820 [ocfs2] evict+0xae/0x1a0 iput+0x1c6/0x230 ocfs2_orphan_filldir+0x5d/0x100 [ocfs2] ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2] ocfs2_dir_foreach+0x29/0x30 [ocfs2] ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2] ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2] process_one_work+0x169/0x4a0 worker_thread+0x5b/0x560 kthread+0xcb/0xf0 ret_from_fork+0x61/0x90
The above stack is not reasonable, the final iput shouldn't happen in ocfs2_orphan_filldir() function. Looking at the code,
2067 /* Skip inodes which are already added to recover list, since dio may 2068 * happen concurrently with unlink/rename */ 2069 if (OCFS2_I(iter)->ip_next_orphan) { 2070 iput(iter); 2071 return 0; 2072 } 2073
The logic thinks the inode is already in recover list on seeing ip_next_orphan is non-NULL, so it skip this inode after dropping a reference which incremented in ocfs2_iget().
While, if the inode is already in recover list, it should have another reference and the iput() at line 2070 should not be the final iput (dropping the last reference). So I don't think the inode is really in the recover list (no vmcore to confirm).
Note that ocfs2_queue_orphans(), though not shown up in the call back trace, is holding cluster lock on the orphan directory when looking up for unlinked inodes. The on disk inode eviction could involve a lot of IOs which may need long time to finish. That means this node could hold the cluster lock for very long time, that can lead to the lock requests (from other nodes) to the orhpan directory hang for long time.
Looking at more on ip_next_orphan, I found it's not initialized when allocating a new ocfs2_inode_info structure.
This causes te reflink operations from some nodes hang for very long time waiting for the cluster lock on the orphan directory.
Fix: initialize ip_next_orphan as NULL.
Signed-off-by: Wengang Wang wen.gang.wang@oracle.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Joseph Qi joseph.qi@linux.alibaba.com Cc: Mark Fasheh mark@fasheh.com Cc: Joel Becker jlbec@evilplan.org Cc: Junxiao Bi junxiao.bi@oracle.com Cc: Changwei Ge gechangwei@live.cn Cc: Gang He ghe@suse.com Cc: Jun Piao piaojun@huawei.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201109171746.27884-1-wen.gang.wang@oracle.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ocfs2/super.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index 3415e0b09398..74da9158e4b6 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -1747,6 +1747,7 @@ static void ocfs2_inode_init_once(void *data)
oi->ip_blkno = 0ULL; oi->ip_clusters = 0; + oi->ip_next_orphan = NULL;
ocfs2_resv_init_once(&oi->ip_la_data_resv);
From: Chen Zhou chenzhou10@huawei.com
stable inclusion from linux-4.19.158 commit 46977ef987a743133b2a66208464c898d7a03e3c
--------------------------------
commit c350f8bea271782e2733419bd2ab9bf4ec2051ef upstream.
Fix to return a negative error code from the error handling case instead of 0 in function sel_ib_pkey_sid_slow(), as done elsewhere in this function.
Cc: stable@vger.kernel.org Fixes: 409dcf31538a ("selinux: Add a cache for quicker retreival of PKey SIDs") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Paul Moore paul@paul-moore.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- security/selinux/ibpkey.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/security/selinux/ibpkey.c b/security/selinux/ibpkey.c index 0a4b89d48297..cb05ae28ce00 100644 --- a/security/selinux/ibpkey.c +++ b/security/selinux/ibpkey.c @@ -161,8 +161,10 @@ static int sel_ib_pkey_sid_slow(u64 subnet_prefix, u16 pkey_num, u32 *sid) * is valid, it just won't be added to the cache. */ new = kzalloc(sizeof(*new), GFP_ATOMIC); - if (!new) + if (!new) { + ret = -ENOMEM; goto out; + }
new->psec.subnet_prefix = subnet_prefix; new->psec.pkey = pkey_num;
From: Al Viro viro@zeniv.linux.org.uk
stable inclusion from linux-4.19.158 commit 9bb7c3825457a6319c6e6cc4442e52ce3edcdaf4
--------------------------------
commit 77f6ab8b7768cf5e6bdd0e72499270a0671506ee upstream.
Coredump logics needs to report not only the registers of the dumping thread, but (since 2.5.43) those of other threads getting killed.
Doing that might require extra state saved on the stack in asm glue at kernel entry; signal delivery logics does that (we need to be able to save sigcontext there, at the very least) and so does seccomp.
That covers all callers of do_coredump(). Secondary threads get hit with SIGKILL and caught as soon as they reach exit_mm(), which normally happens in signal delivery, so those are also fine most of the time. Unfortunately, it is possible to end up with secondary zapped when it has already entered exit(2) (or, worse yet, is oopsing). In those cases we reach exit_mm() when mm->core_state is already set, but the stack contents is not what we would have in signal delivery.
At least on two architectures (alpha and m68k) it leads to infoleaks - we end up with a chunk of kernel stack written into coredump, with the contents consisting of normal C stack frames of the call chain leading to exit_mm() instead of the expected copy of userland registers. In case of alpha we leak 312 bytes of stack. Other architectures (including the regset-using ones) might have similar problems - the normal user of regsets is ptrace and the state of tracee at the time of such calls is special in the same way signal delivery is.
Note that had the zapper gotten to the exiting thread slightly later, it wouldn't have been included into coredump anyway - we skip the threads that have already cleared their ->mm. So let's pretend that zapper always loses the race. IOW, have exit_mm() only insert into the dumper list if we'd gotten there from handling a fatal signal[*]
As the result, the callers of do_exit() that have *not* gone through get_signal() are not seen by coredump logics as secondary threads. Which excludes voluntary exit()/oopsen/traps/etc. The dumper thread itself is unaffected by that, so seccomp is fine.
[*] originally I intended to add a new flag in tsk->flags, but ebiederman pointed out that PF_SIGNALED is already doing just what we need.
Cc: stable@vger.kernel.org Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3") History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Acked-by: "Eric W. Biederman" ebiederm@xmission.com Signed-off-by: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/exit.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/exit.c b/kernel/exit.c index a085052c02ae..d3f1108e57f7 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -517,7 +517,10 @@ static void exit_mm(void) up_read(&mm->mmap_sem);
self.task = current; - self.next = xchg(&core_state->dumper.next, &self); + if (self.task->flags & PF_SIGNALED) + self.next = xchg(&core_state->dumper.next, &self); + else + self.task = NULL; /* * Implies mb(), the result of xchg() must be visible * to core_state->dumper.
From: Oliver Herms oliver.peter.herms@gmail.com
stable inclusion from linux-4.19.158 commit 7fe18ca0435629dc6b5bcc76a6dbb2ba66f86d70
--------------------------------
[ Upstream commit 8ef9ba4d666614497a057d09b0a6eafc1e34eadf ]
Due to the legacy usage of hard_header_len for SIT tunnels while already using infrastructure from net/ipv4/ip_tunnel.c the calculation of the path MTU in tnl_update_pmtu is incorrect. This leads to unnecessary creation of MTU exceptions for any flow going over a SIT tunnel.
As SIT tunnels do not have a header themsevles other than their transport (L3, L2) headers we're leaving hard_header_len set to zero as tnl_update_pmtu is already taking care of the transport headers sizes.
This will also help avoiding unnecessary IPv6 GC runs and spinlock contention seen when using SIT tunnels and for more than net.ipv6.route.gc_thresh flows.
Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.") Signed-off-by: Oliver Herms oliver.peter.herms@gmail.com Acked-by: Willem de Bruijn willemb@google.com Link: https://lore.kernel.org/r/20201103104133.GA1573211@tws Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv6/sit.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index bfed7508ba19..98c108baf35e 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -1087,7 +1087,6 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) if (tdev && !netif_is_l3_master(tdev)) { int t_hlen = tunnel->hlen + sizeof(struct iphdr);
- dev->hard_header_len = tdev->hard_header_len + sizeof(struct iphdr); dev->mtu = tdev->mtu - t_hlen; if (dev->mtu < IPV6_MIN_MTU) dev->mtu = IPV6_MIN_MTU; @@ -1377,7 +1376,6 @@ static void ipip6_tunnel_setup(struct net_device *dev) dev->priv_destructor = ipip6_dev_free;
dev->type = ARPHRD_SIT; - dev->hard_header_len = LL_MAX_HEADER + t_hlen; dev->mtu = ETH_DATA_LEN - t_hlen; dev->min_mtu = IPV6_MIN_MTU; dev->max_mtu = IP6_MAX_MTU - t_hlen;
From: Mao Wenan wenan.mao@linux.alibaba.com
stable inclusion from linux-4.19.158 commit bd73928fa9561a25144be6ed9313493bd9fb9ce3
--------------------------------
[ Upstream commit 909172a149749242990a6e64cb55d55460d4e417 ]
When net.ipv4.tcp_syncookies=1 and syn flood is happened, cookie_v4_check or cookie_v6_check tries to redo what tcp_v4_send_synack or tcp_v6_send_synack did, rsk_window_clamp will be changed if SOCK_RCVBUF is set, which will make rcv_wscale is different, the client still operates with initial window scale and can overshot granted window, the client use the initial scale but local server use new scale to advertise window value, and session work abnormally.
Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt") Signed-off-by: Mao Wenan wenan.mao@linux.alibaba.com Signed-off-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/1604967391-123737-1-git-send-email-wenan.mao@linux... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/syncookies.c | 9 +++++++-- net/ipv6/syncookies.c | 10 ++++++++-- 2 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index f66b2e6d97a7..1a06850ef3cc 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -296,7 +296,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) __u32 cookie = ntohl(th->ack_seq) - 1; struct sock *ret = sk; struct request_sock *req; - int mss; + int full_space, mss; struct rtable *rt; __u8 rcv_wscale; struct flowi4 fl4; @@ -391,8 +391,13 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
/* Try to redo what tcp_v4_send_synack did. */ req->rsk_window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW); + /* limit the window selection if the user enforce a smaller rx buffer */ + full_space = tcp_full_space(sk); + if (sk->sk_userlocks & SOCK_RCVBUF_LOCK && + (req->rsk_window_clamp > full_space || req->rsk_window_clamp == 0)) + req->rsk_window_clamp = full_space;
- tcp_select_initial_window(sk, tcp_full_space(sk), req->mss, + tcp_select_initial_window(sk, full_space, req->mss, &req->rsk_rcv_wnd, &req->rsk_window_clamp, ireq->wscale_ok, &rcv_wscale, dst_metric(&rt->dst, RTAX_INITRWND)); diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c index a377be8a9fb4..ec61b67a92be 100644 --- a/net/ipv6/syncookies.c +++ b/net/ipv6/syncookies.c @@ -141,7 +141,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb) __u32 cookie = ntohl(th->ack_seq) - 1; struct sock *ret = sk; struct request_sock *req; - int mss; + int full_space, mss; struct dst_entry *dst; __u8 rcv_wscale; u32 tsoff = 0; @@ -246,7 +246,13 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb) }
req->rsk_window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW); - tcp_select_initial_window(sk, tcp_full_space(sk), req->mss, + /* limit the window selection if the user enforce a smaller rx buffer */ + full_space = tcp_full_space(sk); + if (sk->sk_userlocks & SOCK_RCVBUF_LOCK && + (req->rsk_window_clamp > full_space || req->rsk_window_clamp == 0)) + req->rsk_window_clamp = full_space; + + tcp_select_initial_window(sk, full_space, req->mss, &req->rsk_rcv_wnd, &req->rsk_window_clamp, ireq->wscale_ok, &rcv_wscale, dst_metric(dst, RTAX_INITRWND));
From: George Spelvin lkml@sdf.org
stable inclusion from linux-4.19.158 commit 81d7c56d6fab5ccbf522c47a655cd427808679f2
--------------------------------
commit c51f8f88d705e06bd696d7510aff22b33eb8e638 upstream.
Non-cryptographic PRNGs may have great statistical properties, but are usually trivially predictable to someone who knows the algorithm, given a small sample of their output. An LFSR like prandom_u32() is particularly simple, even if the sample is widely scattered bits.
It turns out the network stack uses prandom_u32() for some things like random port numbers which it would prefer are *not* trivially predictable. Predictability led to a practical DNS spoofing attack. Oops.
This patch replaces the LFSR with a homebrew cryptographic PRNG based on the SipHash round function, which is in turn seeded with 128 bits of strong random key. (The authors of SipHash have *not* been consulted about this abuse of their algorithm.) Speed is prioritized over security; attacks are rare, while performance is always wanted.
Replacing all callers of prandom_u32() is the quick fix. Whether to reinstate a weaker PRNG for uses which can tolerate it is an open question.
Commit f227e3ec3b5c ("random32: update the net random state on interrupt and activity") was an earlier attempt at a solution. This patch replaces it.
Reported-by: Amit Klein aksecurity@gmail.com Cc: Willy Tarreau w@1wt.eu Cc: Eric Dumazet edumazet@google.com Cc: "Jason A. Donenfeld" Jason@zx2c4.com Cc: Andy Lutomirski luto@kernel.org Cc: Kees Cook keescook@chromium.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Peter Zijlstra peterz@infradead.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: tytso@mit.edu Cc: Florian Westphal fw@strlen.de Cc: Marc Plumb lkml.mplumb@gmail.com Fixes: f227e3ec3b5c ("random32: update the net random state on interrupt and activity") Signed-off-by: George Spelvin lkml@sdf.org Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ [ willy: partial reversal of f227e3ec3b5c; moved SIPROUND definitions to prandom.h for later use; merged George's prandom_seed() proposal; inlined siprand_u32(); replaced the net_rand_state[] array with 4 members to fix a build issue; cosmetic cleanups to make checkpatch happy; fixed RANDOM32_SELFTEST build ] [wt: backported to 4.19 -- various context adjustments] Signed-off-by: Willy Tarreau w@1wt.eu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/char/random.c | 1 - include/linux/prandom.h | 36 +++- kernel/time/timer.c | 7 - lib/random32.c | 462 ++++++++++++++++++++++++---------------- 4 files changed, 317 insertions(+), 189 deletions(-)
diff --git a/drivers/char/random.c b/drivers/char/random.c index c210a6f4551a..c7c344e69f19 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1257,7 +1257,6 @@ void add_interrupt_randomness(int irq, int irq_flags)
fast_mix(fast_pool); add_interrupt_bench(cycles); - this_cpu_add(net_rand_state.s1, fast_pool->pool[cycles & 3]);
if (unlikely(crng_init == 0)) { if ((fast_pool->count >= 64) && diff --git a/include/linux/prandom.h b/include/linux/prandom.h index aa16e6468f91..cc1e71334e53 100644 --- a/include/linux/prandom.h +++ b/include/linux/prandom.h @@ -16,12 +16,44 @@ void prandom_bytes(void *buf, size_t nbytes); void prandom_seed(u32 seed); void prandom_reseed_late(void);
+#if BITS_PER_LONG == 64 +/* + * The core SipHash round function. Each line can be executed in + * parallel given enough CPU resources. + */ +#define PRND_SIPROUND(v0, v1, v2, v3) ( \ + v0 += v1, v1 = rol64(v1, 13), v2 += v3, v3 = rol64(v3, 16), \ + v1 ^= v0, v0 = rol64(v0, 32), v3 ^= v2, \ + v0 += v3, v3 = rol64(v3, 21), v2 += v1, v1 = rol64(v1, 17), \ + v3 ^= v0, v1 ^= v2, v2 = rol64(v2, 32) \ +) + +#define PRND_K0 (0x736f6d6570736575 ^ 0x6c7967656e657261) +#define PRND_K1 (0x646f72616e646f6d ^ 0x7465646279746573) + +#elif BITS_PER_LONG == 32 +/* + * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash. + * This is weaker, but 32-bit machines are not used for high-traffic + * applications, so there is less output for an attacker to analyze. + */ +#define PRND_SIPROUND(v0, v1, v2, v3) ( \ + v0 += v1, v1 = rol32(v1, 5), v2 += v3, v3 = rol32(v3, 8), \ + v1 ^= v0, v0 = rol32(v0, 16), v3 ^= v2, \ + v0 += v3, v3 = rol32(v3, 7), v2 += v1, v1 = rol32(v1, 13), \ + v3 ^= v0, v1 ^= v2, v2 = rol32(v2, 16) \ +) +#define PRND_K0 0x6c796765 +#define PRND_K1 0x74656462 + +#else +#error Unsupported BITS_PER_LONG +#endif + struct rnd_state { __u32 s1, s2, s3, s4; };
-DECLARE_PER_CPU(struct rnd_state, net_rand_state); - u32 prandom_u32_state(struct rnd_state *state); void prandom_bytes_state(struct rnd_state *state, void *buf, size_t nbytes); void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state); diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 61e41ea3a96e..a6e88d9bb931 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1655,13 +1655,6 @@ void update_process_times(int user_tick) scheduler_tick(); if (IS_ENABLED(CONFIG_POSIX_TIMERS)) run_posix_cpu_timers(p); - - /* The current CPU might make use of net randoms without receiving IRQs - * to renew them often enough. Let's update the net_rand_state from a - * non-constant value that's not affine to the number of calls to make - * sure it's updated when there's some activity (we don't care in idle). - */ - this_cpu_add(net_rand_state.s1, rol32(jiffies, 24) + user_tick); }
/** diff --git a/lib/random32.c b/lib/random32.c index b6f3325e38e4..9085b1172015 100644 --- a/lib/random32.c +++ b/lib/random32.c @@ -40,16 +40,6 @@ #include <linux/sched.h> #include <asm/unaligned.h>
-#ifdef CONFIG_RANDOM32_SELFTEST -static void __init prandom_state_selftest(void); -#else -static inline void prandom_state_selftest(void) -{ -} -#endif - -DEFINE_PER_CPU(struct rnd_state, net_rand_state) __latent_entropy; - /** * prandom_u32_state - seeded pseudo-random number generator. * @state: pointer to state structure holding seeded state. @@ -69,25 +59,6 @@ u32 prandom_u32_state(struct rnd_state *state) } EXPORT_SYMBOL(prandom_u32_state);
-/** - * prandom_u32 - pseudo random number generator - * - * A 32 bit pseudo-random number is generated using a fast - * algorithm suitable for simulation. This algorithm is NOT - * considered safe for cryptographic use. - */ -u32 prandom_u32(void) -{ - struct rnd_state *state = &get_cpu_var(net_rand_state); - u32 res; - - res = prandom_u32_state(state); - put_cpu_var(net_rand_state); - - return res; -} -EXPORT_SYMBOL(prandom_u32); - /** * prandom_bytes_state - get the requested number of pseudo-random bytes * @@ -119,20 +90,6 @@ void prandom_bytes_state(struct rnd_state *state, void *buf, size_t bytes) } EXPORT_SYMBOL(prandom_bytes_state);
-/** - * prandom_bytes - get the requested number of pseudo-random bytes - * @buf: where to copy the pseudo-random bytes to - * @bytes: the requested number of bytes - */ -void prandom_bytes(void *buf, size_t bytes) -{ - struct rnd_state *state = &get_cpu_var(net_rand_state); - - prandom_bytes_state(state, buf, bytes); - put_cpu_var(net_rand_state); -} -EXPORT_SYMBOL(prandom_bytes); - static void prandom_warmup(struct rnd_state *state) { /* Calling RNG ten times to satisfy recurrence condition */ @@ -148,96 +105,6 @@ static void prandom_warmup(struct rnd_state *state) prandom_u32_state(state); }
-static u32 __extract_hwseed(void) -{ - unsigned int val = 0; - - (void)(arch_get_random_seed_int(&val) || - arch_get_random_int(&val)); - - return val; -} - -static void prandom_seed_early(struct rnd_state *state, u32 seed, - bool mix_with_hwseed) -{ -#define LCG(x) ((x) * 69069U) /* super-duper LCG */ -#define HWSEED() (mix_with_hwseed ? __extract_hwseed() : 0) - state->s1 = __seed(HWSEED() ^ LCG(seed), 2U); - state->s2 = __seed(HWSEED() ^ LCG(state->s1), 8U); - state->s3 = __seed(HWSEED() ^ LCG(state->s2), 16U); - state->s4 = __seed(HWSEED() ^ LCG(state->s3), 128U); -} - -/** - * prandom_seed - add entropy to pseudo random number generator - * @seed: seed value - * - * Add some additional seeding to the prandom pool. - */ -void prandom_seed(u32 entropy) -{ - int i; - /* - * No locking on the CPUs, but then somewhat random results are, well, - * expected. - */ - for_each_possible_cpu(i) { - struct rnd_state *state = &per_cpu(net_rand_state, i); - - state->s1 = __seed(state->s1 ^ entropy, 2U); - prandom_warmup(state); - } -} -EXPORT_SYMBOL(prandom_seed); - -/* - * Generate some initially weak seeding values to allow - * to start the prandom_u32() engine. - */ -static int __init prandom_init(void) -{ - int i; - - prandom_state_selftest(); - - for_each_possible_cpu(i) { - struct rnd_state *state = &per_cpu(net_rand_state, i); - u32 weak_seed = (i + jiffies) ^ random_get_entropy(); - - prandom_seed_early(state, weak_seed, true); - prandom_warmup(state); - } - - return 0; -} -core_initcall(prandom_init); - -static void __prandom_timer(struct timer_list *unused); - -static DEFINE_TIMER(seed_timer, __prandom_timer); - -static void __prandom_timer(struct timer_list *unused) -{ - u32 entropy; - unsigned long expires; - - get_random_bytes(&entropy, sizeof(entropy)); - prandom_seed(entropy); - - /* reseed every ~60 seconds, in [40 .. 80) interval with slack */ - expires = 40 + prandom_u32_max(40); - seed_timer.expires = jiffies + msecs_to_jiffies(expires * MSEC_PER_SEC); - - add_timer(&seed_timer); -} - -static void __init __prandom_start_seed_timer(void) -{ - seed_timer.expires = jiffies + msecs_to_jiffies(40 * MSEC_PER_SEC); - add_timer(&seed_timer); -} - void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state) { int i; @@ -257,51 +124,6 @@ void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state) } EXPORT_SYMBOL(prandom_seed_full_state);
-/* - * Generate better values after random number generator - * is fully initialized. - */ -static void __prandom_reseed(bool late) -{ - unsigned long flags; - static bool latch = false; - static DEFINE_SPINLOCK(lock); - - /* Asking for random bytes might result in bytes getting - * moved into the nonblocking pool and thus marking it - * as initialized. In this case we would double back into - * this function and attempt to do a late reseed. - * Ignore the pointless attempt to reseed again if we're - * already waiting for bytes when the nonblocking pool - * got initialized. - */ - - /* only allow initial seeding (late == false) once */ - if (!spin_trylock_irqsave(&lock, flags)) - return; - - if (latch && !late) - goto out; - - latch = true; - prandom_seed_full_state(&net_rand_state); -out: - spin_unlock_irqrestore(&lock, flags); -} - -void prandom_reseed_late(void) -{ - __prandom_reseed(true); -} - -static int __init prandom_reseed(void) -{ - __prandom_reseed(false); - __prandom_start_seed_timer(); - return 0; -} -late_initcall(prandom_reseed); - #ifdef CONFIG_RANDOM32_SELFTEST static struct prandom_test1 { u32 seed; @@ -421,7 +243,28 @@ static struct prandom_test2 { { 407983964U, 921U, 728767059U }, };
-static void __init prandom_state_selftest(void) +static u32 __extract_hwseed(void) +{ + unsigned int val = 0; + + (void)(arch_get_random_seed_int(&val) || + arch_get_random_int(&val)); + + return val; +} + +static void prandom_seed_early(struct rnd_state *state, u32 seed, + bool mix_with_hwseed) +{ +#define LCG(x) ((x) * 69069U) /* super-duper LCG */ +#define HWSEED() (mix_with_hwseed ? __extract_hwseed() : 0) + state->s1 = __seed(HWSEED() ^ LCG(seed), 2U); + state->s2 = __seed(HWSEED() ^ LCG(state->s1), 8U); + state->s3 = __seed(HWSEED() ^ LCG(state->s2), 16U); + state->s4 = __seed(HWSEED() ^ LCG(state->s3), 128U); +} + +static int __init prandom_state_selftest(void) { int i, j, errors = 0, runs = 0; bool error = false; @@ -461,5 +304,266 @@ static void __init prandom_state_selftest(void) pr_warn("prandom: %d/%d self tests failed\n", errors, runs); else pr_info("prandom: %d self tests passed\n", runs); + return 0; } +core_initcall(prandom_state_selftest); #endif + +/* + * The prandom_u32() implementation is now completely separate from the + * prandom_state() functions, which are retained (for now) for compatibility. + * + * Because of (ab)use in the networking code for choosing random TCP/UDP port + * numbers, which open DoS possibilities if guessable, we want something + * stronger than a standard PRNG. But the performance requirements of + * the network code do not allow robust crypto for this application. + * + * So this is a homebrew Junior Spaceman implementation, based on the + * lowest-latency trustworthy crypto primitive available, SipHash. + * (The authors of SipHash have not been consulted about this abuse of + * their work.) + * + * Standard SipHash-2-4 uses 2n+4 rounds to hash n words of input to + * one word of output. This abbreviated version uses 2 rounds per word + * of output. + */ + +struct siprand_state { + unsigned long v0; + unsigned long v1; + unsigned long v2; + unsigned long v3; +}; + +static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy; + +/* + * This is the core CPRNG function. As "pseudorandom", this is not used + * for truly valuable things, just intended to be a PITA to guess. + * For maximum speed, we do just two SipHash rounds per word. This is + * the same rate as 4 rounds per 64 bits that SipHash normally uses, + * so hopefully it's reasonably secure. + * + * There are two changes from the official SipHash finalization: + * - We omit some constants XORed with v2 in the SipHash spec as irrelevant; + * they are there only to make the output rounds distinct from the input + * rounds, and this application has no input rounds. + * - Rather than returning v0^v1^v2^v3, return v1+v3. + * If you look at the SipHash round, the last operation on v3 is + * "v3 ^= v0", so "v0 ^ v3" just undoes that, a waste of time. + * Likewise "v1 ^= v2". (The rotate of v2 makes a difference, but + * it still cancels out half of the bits in v2 for no benefit.) + * Second, since the last combining operation was xor, continue the + * pattern of alternating xor/add for a tiny bit of extra non-linearity. + */ +static inline u32 siprand_u32(struct siprand_state *s) +{ + unsigned long v0 = s->v0, v1 = s->v1, v2 = s->v2, v3 = s->v3; + + PRND_SIPROUND(v0, v1, v2, v3); + PRND_SIPROUND(v0, v1, v2, v3); + s->v0 = v0; s->v1 = v1; s->v2 = v2; s->v3 = v3; + return v1 + v3; +} + + +/** + * prandom_u32 - pseudo random number generator + * + * A 32 bit pseudo-random number is generated using a fast + * algorithm suitable for simulation. This algorithm is NOT + * considered safe for cryptographic use. + */ +u32 prandom_u32(void) +{ + struct siprand_state *state = get_cpu_ptr(&net_rand_state); + u32 res = siprand_u32(state); + + put_cpu_ptr(&net_rand_state); + return res; +} +EXPORT_SYMBOL(prandom_u32); + +/** + * prandom_bytes - get the requested number of pseudo-random bytes + * @buf: where to copy the pseudo-random bytes to + * @bytes: the requested number of bytes + */ +void prandom_bytes(void *buf, size_t bytes) +{ + struct siprand_state *state = get_cpu_ptr(&net_rand_state); + u8 *ptr = buf; + + while (bytes >= sizeof(u32)) { + put_unaligned(siprand_u32(state), (u32 *)ptr); + ptr += sizeof(u32); + bytes -= sizeof(u32); + } + + if (bytes > 0) { + u32 rem = siprand_u32(state); + + do { + *ptr++ = (u8)rem; + rem >>= BITS_PER_BYTE; + } while (--bytes > 0); + } + put_cpu_ptr(&net_rand_state); +} +EXPORT_SYMBOL(prandom_bytes); + +/** + * prandom_seed - add entropy to pseudo random number generator + * @entropy: entropy value + * + * Add some additional seed material to the prandom pool. + * The "entropy" is actually our IP address (the only caller is + * the network code), not for unpredictability, but to ensure that + * different machines are initialized differently. + */ +void prandom_seed(u32 entropy) +{ + int i; + + add_device_randomness(&entropy, sizeof(entropy)); + + for_each_possible_cpu(i) { + struct siprand_state *state = per_cpu_ptr(&net_rand_state, i); + unsigned long v0 = state->v0, v1 = state->v1; + unsigned long v2 = state->v2, v3 = state->v3; + + do { + v3 ^= entropy; + PRND_SIPROUND(v0, v1, v2, v3); + PRND_SIPROUND(v0, v1, v2, v3); + v0 ^= entropy; + } while (unlikely(!v0 || !v1 || !v2 || !v3)); + + WRITE_ONCE(state->v0, v0); + WRITE_ONCE(state->v1, v1); + WRITE_ONCE(state->v2, v2); + WRITE_ONCE(state->v3, v3); + } +} +EXPORT_SYMBOL(prandom_seed); + +/* + * Generate some initially weak seeding values to allow + * the prandom_u32() engine to be started. + */ +static int __init prandom_init_early(void) +{ + int i; + unsigned long v0, v1, v2, v3; + + if (!arch_get_random_long(&v0)) + v0 = jiffies; + if (!arch_get_random_long(&v1)) + v1 = random_get_entropy(); + v2 = v0 ^ PRND_K0; + v3 = v1 ^ PRND_K1; + + for_each_possible_cpu(i) { + struct siprand_state *state; + + v3 ^= i; + PRND_SIPROUND(v0, v1, v2, v3); + PRND_SIPROUND(v0, v1, v2, v3); + v0 ^= i; + + state = per_cpu_ptr(&net_rand_state, i); + state->v0 = v0; state->v1 = v1; + state->v2 = v2; state->v3 = v3; + } + + return 0; +} +core_initcall(prandom_init_early); + + +/* Stronger reseeding when available, and periodically thereafter. */ +static void prandom_reseed(struct timer_list *unused); + +static DEFINE_TIMER(seed_timer, prandom_reseed); + +static void prandom_reseed(struct timer_list *unused) +{ + unsigned long expires; + int i; + + /* + * Reinitialize each CPU's PRNG with 128 bits of key. + * No locking on the CPUs, but then somewhat random results are, + * well, expected. + */ + for_each_possible_cpu(i) { + struct siprand_state *state; + unsigned long v0 = get_random_long(), v2 = v0 ^ PRND_K0; + unsigned long v1 = get_random_long(), v3 = v1 ^ PRND_K1; +#if BITS_PER_LONG == 32 + int j; + + /* + * On 32-bit machines, hash in two extra words to + * approximate 128-bit key length. Not that the hash + * has that much security, but this prevents a trivial + * 64-bit brute force. + */ + for (j = 0; j < 2; j++) { + unsigned long m = get_random_long(); + + v3 ^= m; + PRND_SIPROUND(v0, v1, v2, v3); + PRND_SIPROUND(v0, v1, v2, v3); + v0 ^= m; + } +#endif + /* + * Probably impossible in practice, but there is a + * theoretical risk that a race between this reseeding + * and the target CPU writing its state back could + * create the all-zero SipHash fixed point. + * + * To ensure that never happens, ensure the state + * we write contains no zero words. + */ + state = per_cpu_ptr(&net_rand_state, i); + WRITE_ONCE(state->v0, v0 ? v0 : -1ul); + WRITE_ONCE(state->v1, v1 ? v1 : -1ul); + WRITE_ONCE(state->v2, v2 ? v2 : -1ul); + WRITE_ONCE(state->v3, v3 ? v3 : -1ul); + } + + /* reseed every ~60 seconds, in [40 .. 80) interval with slack */ + expires = round_jiffies(jiffies + 40 * HZ + prandom_u32_max(40 * HZ)); + mod_timer(&seed_timer, expires); +} + +/* + * The random ready callback can be called from almost any interrupt. + * To avoid worrying about whether it's safe to delay that interrupt + * long enough to seed all CPUs, just schedule an immediate timer event. + */ +static void prandom_timer_start(struct random_ready_callback *unused) +{ + mod_timer(&seed_timer, jiffies); +} + +/* + * Start periodic full reseeding as soon as strong + * random numbers are available. + */ +static int __init prandom_init_late(void) +{ + static struct random_ready_callback random_ready = { + .func = prandom_timer_start + }; + int ret = add_random_ready_callback(&random_ready); + + if (ret == -EALREADY) { + prandom_timer_start(&random_ready); + ret = 0; + } + return ret; +} +late_initcall(prandom_init_late);
From: Arnaldo Carvalho de Melo acme@redhat.com
stable inclusion from linux-4.19.158 commit b9658d18499e4cf1b6c5f122f12e1cc985ac5568
--------------------------------
commit d0e7b0c71fbb653de90a7163ef46912a96f0bdaf upstream.
To avoid this:
util/scripting-engines/trace-event-python.c: In function 'python_start_script': util/scripting-engines/trace-event-python.c:1595:2: error: 'visibility' attribute ignored [-Werror=attributes] 1595 | PyMODINIT_FUNC (*initfunc)(void); | ^~~~~~~~~~~~~~
That started breaking when building with PYTHON=python3 and these gcc versions (I haven't checked with the clang ones, maybe it breaks there as well):
# export PERF_TARBALL=http://192.168.86.5/perf/perf-5.9.0.tar.xz # dm fedora:33 fedora:rawhide 1 107.80 fedora:33 : Ok gcc (GCC) 10.2.1 20201005 (Red Hat 10.2.1-5), clang version 11.0.0 (Fedora 11.0.0-1.fc33) 2 92.47 fedora:rawhide : Ok gcc (GCC) 10.2.1 20201016 (Red Hat 10.2.1-6), clang version 11.0.0 (Fedora 11.0.0-1.fc34) #
Avoid that by ditching that 'initfunc' function pointer with its:
#define Py_EXPORTED_SYMBOL _attribute_ ((visibility ("default"))) #define PyMODINIT_FUNC Py_EXPORTED_SYMBOL PyObject*
And just call PyImport_AppendInittab() at the end of the ifdef python3 block with the functions that were being attributed to that initfunc.
Cc: Adrian Hunter adrian.hunter@intel.com Cc: Ian Rogers irogers@google.com Cc: Jiri Olsa jolsa@kernel.org Cc: Namhyung Kim namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Tapas Kundu tkundu@vmware.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- tools/perf/util/scripting-engines/trace-event-python.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/tools/perf/util/scripting-engines/trace-event-python.c b/tools/perf/util/scripting-engines/trace-event-python.c index 9569cc06e0a7..2814251df06b 100644 --- a/tools/perf/util/scripting-engines/trace-event-python.c +++ b/tools/perf/util/scripting-engines/trace-event-python.c @@ -1493,7 +1493,6 @@ static void _free_command_line(wchar_t **command_line, int num) static int python_start_script(const char *script, int argc, const char **argv) { struct tables *tables = &tables_global; - PyMODINIT_FUNC (*initfunc)(void); #if PY_MAJOR_VERSION < 3 const char **command_line; #else @@ -1508,20 +1507,18 @@ static int python_start_script(const char *script, int argc, const char **argv) FILE *fp;
#if PY_MAJOR_VERSION < 3 - initfunc = initperf_trace_context; command_line = malloc((argc + 1) * sizeof(const char *)); command_line[0] = script; for (i = 1; i < argc + 1; i++) command_line[i] = argv[i - 1]; + PyImport_AppendInittab(name, initperf_trace_context); #else - initfunc = PyInit_perf_trace_context; command_line = malloc((argc + 1) * sizeof(wchar_t *)); command_line[0] = Py_DecodeLocale(script, NULL); for (i = 1; i < argc + 1; i++) command_line[i] = Py_DecodeLocale(argv[i - 1], NULL); + PyImport_AppendInittab(name, PyInit_perf_trace_context); #endif - - PyImport_AppendInittab(name, initfunc); Py_Initialize();
#if PY_MAJOR_VERSION < 3
From: Matteo Croce mcroce@microsoft.com
stable inclusion from linux-4.19.158 commit 9a6cea8220c608f6d59763fab06ff196185cc3ff
--------------------------------
commit 8b92c4ff4423aa9900cf838d3294fcade4dbda35 upstream.
Patch series "fix parsing of reboot= cmdline", v3.
The parsing of the reboot= cmdline has two major errors:
- a missing bound check can crash the system on reboot
- parsing of the cpu number only works if specified last
Fix both.
This patch (of 2):
This reverts commit 616feab753972b97.
kstrtoint() and simple_strtoul() have a subtle difference which makes them non interchangeable: if a non digit character is found amid the parsing, the former will return an error, while the latter will just stop parsing, e.g. simple_strtoul("123xyx") = 123.
The kernel cmdline reboot= argument allows to specify the CPU used for rebooting, with the syntax `s####` among the other flags, e.g. "reboot=warm,s31,force", so if this flag is not the last given, it's silently ignored as well as the subsequent ones.
Fixes: 616feab75397 ("kernel/reboot.c: convert simple_strtoul to kstrtoint") Signed-off-by: Matteo Croce mcroce@microsoft.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Guenter Roeck linux@roeck-us.net Cc: Petr Mladek pmladek@suse.com Cc: Arnd Bergmann arnd@arndb.de Cc: Mike Rapoport rppt@kernel.org Cc: Kees Cook keescook@chromium.org Cc: Pavel Tatashin pasha.tatashin@soleen.com Cc: Robin Holt robinmholt@gmail.com Cc: Fabian Frederick fabf@skynet.be Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201103214025.116799-2-mcroce@linux.microsoft.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [sudip: use reboot_mode instead of mode] Signed-off-by: Sudip Mukherjee sudipm.mukherjee@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/reboot.c | 21 +++++++-------------- 1 file changed, 7 insertions(+), 14 deletions(-)
diff --git a/kernel/reboot.c b/kernel/reboot.c index 8fb44dec9ad7..db0eed5916d5 100644 --- a/kernel/reboot.c +++ b/kernel/reboot.c @@ -539,22 +539,15 @@ static int __init reboot_setup(char *str) break;
case 's': - { - int rc; - - if (isdigit(*(str+1))) { - rc = kstrtoint(str+1, 0, &reboot_cpu); - if (rc) - return rc; - } else if (str[1] == 'm' && str[2] == 'p' && - isdigit(*(str+3))) { - rc = kstrtoint(str+3, 0, &reboot_cpu); - if (rc) - return rc; - } else + if (isdigit(*(str+1))) + reboot_cpu = simple_strtoul(str+1, NULL, 0); + else if (str[1] == 'm' && str[2] == 'p' && + isdigit(*(str+3))) + reboot_cpu = simple_strtoul(str+3, NULL, 0); + else reboot_mode = REBOOT_SOFT; break; - } + case 'g': reboot_mode = REBOOT_GPIO; break;
From: Matteo Croce mcroce@microsoft.com
stable inclusion from linux-4.19.158 commit 2e021b7197fb911f934efcc45f019e5f7717e8db
--------------------------------
commit df5b0ab3e08a156701b537809914b339b0daa526 upstream.
Limit the CPU number to num_possible_cpus(), because setting it to a value lower than INT_MAX but higher than NR_CPUS produces the following error on reboot and shutdown:
BUG: unable to handle page fault for address: ffffffff90ab1bb0 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1c09067 P4D 1c09067 PUD 1c0a063 PMD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 1 Comm: systemd-shutdow Not tainted 5.9.0-rc8-kvm #110 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:migrate_to_reboot_cpu+0xe/0x60 Code: ea ea 00 48 89 fa 48 c7 c7 30 57 f1 81 e9 fa ef ff ff 66 2e 0f 1f 84 00 00 00 00 00 53 8b 1d d5 ea ea 00 e8 14 33 fe ff 89 da <48> 0f a3 15 ea fc bd 00 48 89 d0 73 29 89 c2 c1 e8 06 65 48 8b 3c RSP: 0018:ffffc90000013e08 EFLAGS: 00010246 RAX: ffff88801f0a0000 RBX: 0000000077359400 RCX: 0000000000000000 RDX: 0000000077359400 RSI: 0000000000000002 RDI: ffffffff81c199e0 RBP: ffffffff81c1e3c0 R08: ffff88801f41f000 R09: ffffffff81c1e348 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 00007f32bedf8830 R14: 00000000fee1dead R15: 0000000000000000 FS: 00007f32bedf8980(0000) GS:ffff88801f480000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff90ab1bb0 CR3: 000000001d057000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __do_sys_reboot.cold+0x34/0x5b do_syscall_64+0x2d/0x40
Fixes: 1b3a5d02ee07 ("reboot: move arch/x86 reboot= handling to generic kernel") Signed-off-by: Matteo Croce mcroce@microsoft.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Arnd Bergmann arnd@arndb.de Cc: Fabian Frederick fabf@skynet.be Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Guenter Roeck linux@roeck-us.net Cc: Kees Cook keescook@chromium.org Cc: Mike Rapoport rppt@kernel.org Cc: Pavel Tatashin pasha.tatashin@soleen.com Cc: Petr Mladek pmladek@suse.com Cc: Robin Holt robinmholt@gmail.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201103214025.116799-3-mcroce@linux.microsoft.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [sudip: use reboot_mode instead of mode] Signed-off-by: Sudip Mukherjee sudipm.mukherjee@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/reboot.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/kernel/reboot.c b/kernel/reboot.c index db0eed5916d5..45bea54f9aee 100644 --- a/kernel/reboot.c +++ b/kernel/reboot.c @@ -546,6 +546,13 @@ static int __init reboot_setup(char *str) reboot_cpu = simple_strtoul(str+3, NULL, 0); else reboot_mode = REBOOT_SOFT; + if (reboot_cpu >= num_possible_cpus()) { + pr_err("Ignoring the CPU number in reboot= option. " + "CPU %d exceeds possible cpu number %d\n", + reboot_cpu, num_possible_cpus()); + reboot_cpu = 0; + break; + } break;
case 'g':
From: Yunsheng Lin linyunsheng@huawei.com
stable inclusion from linux-4.19.158 commit 81504d1952d712c8bb9c3966896efee8a37ea966
--------------------------------
When commit 2fb541c862c9 ("net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc") is backported to stable kernel, one assignment is missing, which causes two problems reported by Joakim and Vishwanath, see [1] and [2].
So add the assignment back to fix it.
1. https://www.spinics.net/lists/netdev/msg693916.html 2. https://www.spinics.net/lists/netdev/msg695131.html
Fixes: 749cc0b0c7f3 ("net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc") Signed-off-by: Yunsheng Lin linyunsheng@huawei.com Acked-by: Jakub Kicinski kuba@kernel.org Tested-by: Brian Norris briannorris@chromium.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/sched/sch_generic.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index bd96fd261dba..4e15913e7519 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -1116,10 +1116,13 @@ static void dev_deactivate_queue(struct net_device *dev, void *_qdisc_default) { struct Qdisc *qdisc = rtnl_dereference(dev_queue->qdisc); + struct Qdisc *qdisc_default = _qdisc_default;
if (qdisc) { if (!(qdisc->flags & TCQ_F_BUILTIN)) set_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state); + + rcu_assign_pointer(dev_queue->qdisc, qdisc_default); } }
From: Boris Protopopov pboris@amazon.com
stable inclusion from linux-4.19.158 commit 987c45d57f388f9531d523b8424b8e97687bb323
--------------------------------
commit 57c176074057531b249cf522d90c22313fa74b0b upstream.
When converting trailing spaces and periods in paths, do so for every component of the path, not just the last component. If the conversion is not done for every path component, then subsequent operations in directories with trailing spaces or periods (e.g. create(), mkdir()) will fail with ENOENT. This is because on the server, the directory will have a special symbol in its name, and the client needs to provide the same.
Signed-off-by: Boris Protopopov pboris@amazon.com Acked-by: Ronnie Sahlberg lsahlber@redhat.com Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/cifs/cifs_unicode.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/cifs/cifs_unicode.c b/fs/cifs/cifs_unicode.c index a2b2355e7f01..9986817532b1 100644 --- a/fs/cifs/cifs_unicode.c +++ b/fs/cifs/cifs_unicode.c @@ -501,7 +501,13 @@ cifsConvertToUTF16(__le16 *target, const char *source, int srclen, else if (map_chars == SFM_MAP_UNI_RSVD) { bool end_of_string;
- if (i == srclen - 1) + /** + * Remap spaces and periods found at the end of every + * component of the path. The special cases of '.' and + * '..' do not need to be dealt with explicitly because + * they are addressed in namei.c:link_path_walk(). + **/ + if ((i == srclen - 1) || (source[i+1] == '\')) end_of_string = true; else end_of_string = false;
From: Wang Hai wanghai38@huawei.com
stable inclusion from linux-4.19.160 commit 8196e11a66d751752a25b62c6dc0e2923a5e6ff0
--------------------------------
[ Upstream commit 849920c703392957f94023f77ec89ca6cf119d43 ]
If sb_occ_port_pool_get() failed in devlink_nl_sb_port_pool_fill(), msg should be canceled by genlmsg_cancel().
Fixes: df38dafd2559 ("devlink: implement shared buffer occupancy monitoring interface") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Wang Hai wanghai38@huawei.com Link: https://lore.kernel.org/r/20201113111622.11040-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/core/devlink.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/core/devlink.c b/net/core/devlink.c index a77e3777c8dd..6ad095264896 100644 --- a/net/core/devlink.c +++ b/net/core/devlink.c @@ -1113,7 +1113,7 @@ static int devlink_nl_sb_port_pool_fill(struct sk_buff *msg, err = ops->sb_occ_port_pool_get(devlink_port, devlink_sb->index, pool_index, &cur, &max); if (err && err != -EOPNOTSUPP) - return err; + goto sb_occ_get_failure; if (!err) { if (nla_put_u32(msg, DEVLINK_ATTR_SB_OCC_CUR, cur)) goto nla_put_failure; @@ -1126,8 +1126,10 @@ static int devlink_nl_sb_port_pool_fill(struct sk_buff *msg, return 0;
nla_put_failure: + err = -EMSGSIZE; +sb_occ_get_failure: genlmsg_cancel(msg, hdr); - return -EMSGSIZE; + return err; }
static int devlink_nl_cmd_sb_port_pool_get_doit(struct sk_buff *skb,
From: Wang Hai wanghai38@huawei.com
stable inclusion from linux-4.19.160 commit f0ec2cd03100ab287c9e5fce5ae56a76d6fc17a4
--------------------------------
[ Upstream commit e33de7c5317e2827b2ba6fd120a505e9eb727b05 ]
nlmsg_cancel() needs to be called in the error path of inet_req_diag_fill to cancel the message.
Fixes: d545caca827b ("net: inet: diag: expose the socket mark to privileged processes.") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Wang Hai wanghai38@huawei.com Link: https://lore.kernel.org/r/20201116082018.16496-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/inet_diag.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c index f0957ebf82cf..d07917059d70 100644 --- a/net/ipv4/inet_diag.c +++ b/net/ipv4/inet_diag.c @@ -392,8 +392,10 @@ static int inet_req_diag_fill(struct sock *sk, struct sk_buff *skb, r->idiag_inode = 0;
if (net_admin && nla_put_u32(skb, INET_DIAG_MARK, - inet_rsk(reqsk)->ir_mark)) + inet_rsk(reqsk)->ir_mark)) { + nlmsg_cancel(skb, nlh); return -EMSGSIZE; + }
nlmsg_end(skb, nlh); return 0;
From: Heiner Kallweit hkallweit1@gmail.com
stable inclusion from linux-4.19.160 commit b1197f5d6f532a3f82a66940c8cec44cf302bf6d
--------------------------------
[ Upstream commit 7a30ecc9237681bb125cbd30eee92bef7e86293d ]
In br_forward.c and br_input.c fields dev->stats.tx_dropped and dev->stats.multicast are populated, but they are ignored in ndo_get_stats64.
Fixes: 28172739f0a2 ("net: fix 64 bit counters on 32 bit arches") Signed-off-by: Heiner Kallweit hkallweit1@gmail.com Link: https://lore.kernel.org/r/58ea9963-77ad-a7cf-8dfd-fc95ab95f606@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/bridge/br_device.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c index 9ce661e2590d..a350c05b7ff5 100644 --- a/net/bridge/br_device.c +++ b/net/bridge/br_device.c @@ -215,6 +215,7 @@ static void br_get_stats64(struct net_device *dev, sum.rx_packets += tmp.rx_packets; }
+ netdev_stats_to_stats64(stats, &dev->stats); stats->tx_bytes = sum.tx_bytes; stats->tx_packets = sum.tx_packets; stats->rx_bytes = sum.rx_bytes;
From: Dongli Zhang dongli.zhang@oracle.com
stable inclusion from linux-4.19.160 commit 082b21d03479b566681f43b74677c249f6aad337
--------------------------------
[ Upstream commit d8c19014bba8f565d8a2f1f46b4e38d1d97bf1a7 ]
The ethernet driver may allocate skb (and skb->data) via napi_alloc_skb(). This ends up to page_frag_alloc() to allocate skb->data from page_frag_cache->va.
During the memory pressure, page_frag_cache->va may be allocated as pfmemalloc page. As a result, the skb->pfmemalloc is always true as skb->data is from page_frag_cache->va. The skb will be dropped if the sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour under memory pressure.
However, once kernel is not under memory pressure any longer (suppose large amount of memory pages are just reclaimed), the page_frag_alloc() may still re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a result, the skb->pfmemalloc is always true unless page_frag_cache->va is re-allocated, even if the kernel is not under memory pressure any longer.
Here is how kernel runs into issue.
1. The kernel is under memory pressure and allocation of PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead, the pfmemalloc page is allocated for page_frag_cache->va.
2: All skb->data from page_frag_cache->va (pfmemalloc) will have skb->pfmemalloc=true. The skb will always be dropped by sock without SOCK_MEMALLOC. This is an expected behaviour.
3. Suppose a large amount of pages are reclaimed and kernel is not under memory pressure any longer. We expect skb->pfmemalloc drop will not happen.
4. Unfortunately, page_frag_alloc() does not proactively re-allocate page_frag_alloc->va and will always re-use the prior pfmemalloc page. The skb->pfmemalloc is always true even kernel is not under memory pressure any longer.
Fix this by freeing and re-allocating the page instead of recycling it.
References: https://lore.kernel.org/lkml/20201103193239.1807-1-dongli.zhang@oracle.com/ References: https://lore.kernel.org/linux-mm/20201105042140.5253-1-willy@infradead.org/ Suggested-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: Aruna Ramakrishna aruna.ramakrishna@oracle.com Cc: Bert Barbe bert.barbe@oracle.com Cc: Rama Nichanamatlu rama.nichanamatlu@oracle.com Cc: Venkat Venkatsubra venkat.x.venkatsubra@oracle.com Cc: Manjunath Patil manjunath.b.patil@oracle.com Cc: Joe Jin joe.jin@oracle.com Cc: SRINIVAS srinivas.eeda@oracle.com Fixes: 79930f5892e1 ("net: do not deplete pfmemalloc reserve") Signed-off-by: Dongli Zhang dongli.zhang@oracle.com Acked-by: Vlastimil Babka vbabka@suse.cz Reviewed-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/20201115201029.11903-1-dongli.zhang@oracle.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/page_alloc.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 504c9699b1af..fd77bf4979a9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4653,6 +4653,11 @@ void *page_frag_alloc(struct page_frag_cache *nc, if (!page_ref_sub_and_test(page, nc->pagecnt_bias)) goto refill;
+ if (unlikely(nc->pfmemalloc)) { + free_the_page(page, compound_order(page)); + goto refill; + } + #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) /* if size can vary use size else just use PAGE_SIZE */ size = nc->size;
From: Ryan Sharpelletti sharpelletti@google.com
stable inclusion from linux-4.19.160 commit 88c2cfa3f78120bc755e538ffa8b456fc2ecf874
--------------------------------
[ Upstream commit 1b9e2a8c99a5c021041bfb2d512dc3ed92a94ffd ]
During loss recovery, retransmitted packets are forced to use TCP timestamps to calculate the RTT samples, which have a millisecond granularity. BBR is designed using a microsecond granularity. As a result, multiple RTT samples could be truncated to the same RTT value during loss recovery. This is problematic, as BBR will not enter PROBE_RTT if the RTT sample is <= the current min_rtt sample, meaning that if there are persistent losses, PROBE_RTT will constantly be pushed off and potentially never re-entered. This patch makes sure that BBR enters PROBE_RTT by checking if RTT sample is < the current min_rtt sample, rather than <=.
The Netflix transport/TCP team discovered this bug in the Linux TCP BBR code during lab tests.
Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Ryan Sharpelletti sharpelletti@google.com Signed-off-by: Neal Cardwell ncardwell@google.com Signed-off-by: Soheil Hassas Yeganeh soheil@google.com Signed-off-by: Yuchung Cheng ycheng@google.com Link: https://lore.kernel.org/r/20201116174412.1433277-1-sharpelletti.kdev@gmail.c... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/tcp_bbr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index 93f176336297..b70c9365e131 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -917,7 +917,7 @@ static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs) filter_expired = after(tcp_jiffies32, bbr->min_rtt_stamp + bbr_min_rtt_win_sec * HZ); if (rs->rtt_us >= 0 && - (rs->rtt_us <= bbr->min_rtt_us || + (rs->rtt_us < bbr->min_rtt_us || (filter_expired && !rs->is_ack_delayed))) { bbr->min_rtt_us = rs->rtt_us; bbr->min_rtt_stamp = tcp_jiffies32;
From: Will Deacon will@kernel.org
stable inclusion from linux-4.19.160 commit 331298e711e7da5c296c55b4e7ec9ee7894126a7
--------------------------------
[ Upstream commit 891deb87585017d526b67b59c15d38755b900fea ]
cpu_psci_cpu_die() is called in the context of the dying CPU, which will no longer be online or tracked by RCU. It is therefore not generally safe to call printk() if the PSCI "cpu off" request fails, so remove the pr_crit() invocation.
Cc: Qian Cai cai@redhat.com Cc: "Paul E. McKenney" paulmck@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Link: https://lore.kernel.org/r/20201106103602.9849-2-will@kernel.org Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/kernel/psci.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/arch/arm64/kernel/psci.c b/arch/arm64/kernel/psci.c index 3856d51c645b..3ebb2a56e5f7 100644 --- a/arch/arm64/kernel/psci.c +++ b/arch/arm64/kernel/psci.c @@ -69,7 +69,6 @@ static int cpu_psci_cpu_disable(unsigned int cpu)
static void cpu_psci_cpu_die(unsigned int cpu) { - int ret; /* * There are no known implementations of PSCI actually using the * power state field, pass a sensible default for now. @@ -77,9 +76,7 @@ static void cpu_psci_cpu_die(unsigned int cpu) u32 state = PSCI_POWER_STATE_TYPE_POWER_DOWN << PSCI_0_2_POWER_STATE_TYPE_SHIFT;
- ret = psci_ops.cpu_off(state); - - pr_crit("unable to power off CPU%u (%d)\n", cpu, ret); + psci_ops.cpu_off(state); }
static int cpu_psci_cpu_kill(unsigned int cpu)
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.160 commit c7d7fe001b6fb46d2e625c2f28acecf0c2d7dae6
--------------------------------
[ Upstream commit 22843291efc986ce7722610073fcf85a39b4cb13 ]
__sb_start_write has some weird looking lockdep code that claims to exist to handle nested freeze locking requests from xfs. The code as written seems broken -- if we think we hold a read lock on any of the higher freeze levels (e.g. we hold SB_FREEZE_WRITE and are trying to lock SB_FREEZE_PAGEFAULT), it converts a blocking lock attempt into a trylock.
However, it's not correct to downgrade a blocking lock attempt to a trylock unless the downgrading code or the callers are prepared to deal with that situation. Neither __sb_start_write nor its callers handle this at all. For example:
sb_start_pagefault ignores the return value completely, with the result that if xfs_filemap_fault loses a race with a different thread trying to fsfreeze, it will proceed without pagefault freeze protection (thereby breaking locking rules) and then unlocks the pagefault freeze lock that it doesn't own on its way out (thereby corrupting the lock state), which leads to a system hang shortly afterwards.
Normally, this won't happen because our ownership of a read lock on a higher freeze protection level blocks fsfreeze from grabbing a write lock on that higher level. *However*, if lockdep is offline, lock_is_held_type unconditionally returns 1, which means that percpu_rwsem_is_held returns 1, which means that __sb_start_write unconditionally converts blocking freeze lock attempts into trylocks, even when we *don't* hold anything that would block a fsfreeze.
Apparently this all held together until 5.10-rc1, when bugs in lockdep caused lockdep to shut itself off early in an fstests run, and once fstests gets to the "race writes with freezer" tests, kaboom. This might explain the long trail of vanishingly infrequent livelocks in fstests after lockdep goes offline that I've never been able to diagnose.
We could fix it by spinning on the trylock if wait==true, but AFAICT the locking works fine if lockdep is not built at all (and I didn't see any complaints running fstests overnight), so remove this snippet entirely.
NOTE: Commit f4b554af9931 in 2015 created the current weird logic (which used to exist in a different form in commit 5accdf82ba25c from 2012) in __sb_start_write. XFS solved this whole problem in the late 2.6 era by creating a variant of transactions (XFS_TRANS_NO_WRITECOUNT) that don't grab intwrite freeze protection, thus making lockdep's solution unnecessary. The commit claims that Dave Chinner explained that the trylock hack + comment could be removed, but nobody ever did.
Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/super.c | 33 ++++----------------------------- 1 file changed, 4 insertions(+), 29 deletions(-)
diff --git a/fs/super.c b/fs/super.c index f3a8c008e164..9fb4553c46e6 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1360,36 +1360,11 @@ EXPORT_SYMBOL(__sb_end_write); */ int __sb_start_write(struct super_block *sb, int level, bool wait) { - bool force_trylock = false; - int ret = 1; + if (!wait) + return percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
-#ifdef CONFIG_LOCKDEP - /* - * We want lockdep to tell us about possible deadlocks with freezing - * but it's it bit tricky to properly instrument it. Getting a freeze - * protection works as getting a read lock but there are subtle - * problems. XFS for example gets freeze protection on internal level - * twice in some cases, which is OK only because we already hold a - * freeze protection also on higher level. Due to these cases we have - * to use wait == F (trylock mode) which must not fail. - */ - if (wait) { - int i; - - for (i = 0; i < level - 1; i++) - if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) { - force_trylock = true; - break; - } - } -#endif - if (wait && !force_trylock) - percpu_down_read(sb->s_writers.rw_sem + level-1); - else - ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1); - - WARN_ON(force_trylock && !ret); - return ret; + percpu_down_read(sb->s_writers.rw_sem + level-1); + return 1; } EXPORT_SYMBOL(__sb_start_write);
From: Leo Yan leo.yan@linaro.org
stable inclusion from linux-4.19.160 commit c4d54937576e5b9cdedb84e371eedf1f0ba67c0b
--------------------------------
[ Upstream commit b0e5a05cc9e37763c7f19366d94b1a6160c755bc ]
When execute command "perf lock report", it hits failure and outputs log as follows:
perf: builtin-lock.c:623: report_lock_release_event: Assertion `!(seq->read_count < 0)' failed. Aborted
This is an imbalance issue. The locking sequence structure "lock_seq_stat" contains the reader counter and it is used to check if the locking sequence is balance or not between acquiring and releasing.
If the tool wrongly frees "lock_seq_stat" when "read_count" isn't zero, the "read_count" will be reset to zero when allocate a new structure at the next time; thus it causes the wrong counting for reader and finally results in imbalance issue.
To fix this issue, if detects "read_count" is not zero (means still have read user in the locking sequence), goto the "end" tag to skip freeing structure "lock_seq_stat".
Fixes: e4cef1f65061 ("perf lock: Fix state machine to recognize lock sequence") Signed-off-by: Leo Yan leo.yan@linaro.org Acked-by: Jiri Olsa jolsa@redhat.com Link: https://lore.kernel.org/r/20201104094229.17509-2-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- tools/perf/builtin-lock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c index 6e0189df2b3b..0cb7f7b731fb 100644 --- a/tools/perf/builtin-lock.c +++ b/tools/perf/builtin-lock.c @@ -620,7 +620,7 @@ static int report_lock_release_event(struct perf_evsel *evsel, case SEQ_STATE_READ_ACQUIRED: seq->read_count--; BUG_ON(seq->read_count < 0); - if (!seq->read_count) { + if (seq->read_count) { ls->nr_release++; goto end; }
From: Yi-Hung Wei yihung.wei@gmail.com
stable inclusion from linux-4.19.160 commit 901e04cd478133ba6458fecbd3b232b528d26f3b
--------------------------------
[ Upstream commit 9c2e14b48119b39446031d29d994044ae958d8fc ]
Currently, we may set the tunnel option flag when the size of metadata is zero. For example, we set TUNNEL_GENEVE_OPT in the receive function no matter the geneve option is present or not. As this may result in issues on the tunnel flags consumers, this patch fixes the issue.
Related discussion: * https://lore.kernel.org/netdev/1604448694-19351-1-git-send-email-yihung.wei@...
Fixes: 256c87c17c53 ("net: check tunnel option type in tunnel flags") Signed-off-by: Yi-Hung Wei yihung.wei@gmail.com Link: https://lore.kernel.org/r/1605053800-74072-1-git-send-email-yihung.wei@gmail... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/geneve.c | 3 +-- include/net/ip_tunnels.h | 7 ++++--- 2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index fdb251e1f471..e3b15daeedfe 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -223,8 +223,7 @@ static void geneve_rx(struct geneve_dev *geneve, struct geneve_sock *gs, if (ip_tunnel_collect_metadata() || gs->collect_md) { __be16 flags;
- flags = TUNNEL_KEY | TUNNEL_GENEVE_OPT | - (gnvh->oam ? TUNNEL_OAM : 0) | + flags = TUNNEL_KEY | (gnvh->oam ? TUNNEL_OAM : 0) | (gnvh->critical ? TUNNEL_CRIT_OPT : 0);
tun_dst = udp_tun_rx_dst(skb, geneve_get_sk_family(gs), flags, diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index e11423530d64..f8873c4eb003 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -489,9 +489,11 @@ static inline void ip_tunnel_info_opts_set(struct ip_tunnel_info *info, const void *from, int len, __be16 flags) { - memcpy(ip_tunnel_info_opts(info), from, len); info->options_len = len; - info->key.tun_flags |= flags; + if (len > 0) { + memcpy(ip_tunnel_info_opts(info), from, len); + info->key.tun_flags |= flags; + } }
static inline struct ip_tunnel_info *lwt_tun_info(struct lwtunnel_state *lwtstate) @@ -537,7 +539,6 @@ static inline void ip_tunnel_info_opts_set(struct ip_tunnel_info *info, __be16 flags) { info->options_len = 0; - info->key.tun_flags |= flags; }
#endif /* CONFIG_INET */
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.160 commit 804b269c252445d3b98497c8b02db0152c65cd63
--------------------------------
[ Upstream commit e95b6c3ef1311dd7b20467d932a24b6d0fd88395 ]
The comment and logic in xchk_btree_check_minrecs for dealing with inode-rooted btrees isn't quite correct. While the direct children of the inode root are allowed to have fewer records than what would normally be allowed for a regular ondisk btree block, this is only true if there is only one child block and the number of records don't fit in the inode root.
Fixes: 08a3a692ef58 ("xfs: btree scrub should check minrecs") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Chandan Babu R chandanrlinux@gmail.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/xfs/scrub/btree.c | 45 ++++++++++++++++++++++++++------------------ 1 file changed, 27 insertions(+), 18 deletions(-)
diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c index 4ae959f7ad2c..c924fe3cdad6 100644 --- a/fs/xfs/scrub/btree.c +++ b/fs/xfs/scrub/btree.c @@ -450,32 +450,41 @@ xchk_btree_check_minrecs( int level, struct xfs_btree_block *block) { - unsigned int numrecs; - int ok_level; - - numrecs = be16_to_cpu(block->bb_numrecs); + struct xfs_btree_cur *cur = bs->cur; + unsigned int root_level = cur->bc_nlevels - 1; + unsigned int numrecs = be16_to_cpu(block->bb_numrecs);
/* More records than minrecs means the block is ok. */ - if (numrecs >= bs->cur->bc_ops->get_minrecs(bs->cur, level)) + if (numrecs >= cur->bc_ops->get_minrecs(cur, level)) return;
/* - * Certain btree blocks /can/ have fewer than minrecs records. Any - * level greater than or equal to the level of the highest dedicated - * btree block are allowed to violate this constraint. - * - * For a btree rooted in a block, the btree root can have fewer than - * minrecs records. If the btree is rooted in an inode and does not - * store records in the root, the direct children of the root and the - * root itself can have fewer than minrecs records. + * For btrees rooted in the inode, it's possible that the root block + * contents spilled into a regular ondisk block because there wasn't + * enough space in the inode root. The number of records in that + * child block might be less than the standard minrecs, but that's ok + * provided that there's only one direct child of the root. */ - ok_level = bs->cur->bc_nlevels - 1; - if (bs->cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) - ok_level--; - if (level >= ok_level) + if ((cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) && + level == cur->bc_nlevels - 2) { + struct xfs_btree_block *root_block; + struct xfs_buf *root_bp; + int root_maxrecs; + + root_block = xfs_btree_get_block(cur, root_level, &root_bp); + root_maxrecs = cur->bc_ops->get_dmaxrecs(cur, root_level); + if (be16_to_cpu(root_block->bb_numrecs) != 1 || + numrecs <= root_maxrecs) + xchk_btree_set_corrupt(bs->sc, cur, level); return; + }
- xchk_btree_set_corrupt(bs->sc, bs->cur, level); + /* + * Otherwise, only the root level is allowed to have fewer than minrecs + * records or keyptrs. + */ + if (level < root_level) + xchk_btree_set_corrupt(bs->sc, cur, level); }
/*
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.160 commit d88cc6f344a2b4a8095a0c8f392b4fa632707276
--------------------------------
[ Upstream commit 498fe261f0d6d5189f8e11d283705dd97b474b54 ]
We always know the correct state of the rmap record flags (attr, bmbt, unwritten) so check them by direct comparison.
Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Chandan Babu R chandanrlinux@gmail.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/xfs/scrub/bmap.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c index d1fc8a3fb891..dcc7e65bb7be 100644 --- a/fs/xfs/scrub/bmap.c +++ b/fs/xfs/scrub/bmap.c @@ -207,13 +207,13 @@ xchk_bmap_xref_rmap( * which doesn't track unwritten state. */ if (owner != XFS_RMAP_OWN_COW && - irec->br_state == XFS_EXT_UNWRITTEN && - !(rmap.rm_flags & XFS_RMAP_UNWRITTEN)) + !!(irec->br_state == XFS_EXT_UNWRITTEN) != + !!(rmap.rm_flags & XFS_RMAP_UNWRITTEN)) xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, irec->br_startoff);
- if (info->whichfork == XFS_ATTR_FORK && - !(rmap.rm_flags & XFS_RMAP_ATTR_FORK)) + if (!!(info->whichfork == XFS_ATTR_FORK) != + !!(rmap.rm_flags & XFS_RMAP_ATTR_FORK)) xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, irec->br_startoff); if (rmap.rm_flags & XFS_RMAP_BMBT_BLOCK)
From: Luo Meng luomeng12@huawei.com
stable inclusion from linux-4.19.160 commit 730b192ad262545a269483609f52d29bff503507
--------------------------------
[ Upstream commit 2801a5da5b25b7af9dd2addd19b2315c02d17b64 ]
Fix a mutex_unlock() issue where before copy_from_user() is not called mutex_locked.
Fixes: 4b1a29a7f542 ("error-injection: Support fault injection framework") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Luo Meng luomeng12@huawei.com Signed-off-by: Masami Hiramatsu mhiramat@kernel.org Signed-off-by: Alexei Starovoitov ast@kernel.org Acked-by: Masami Hiramatsu mhiramat@kernel.org Link: https://lore.kernel.org/bpf/160570737118.263807.8358435412898356284.stgit@de... Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/fail_function.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/fail_function.c b/kernel/fail_function.c index bc80a4e268c0..a52151a2291f 100644 --- a/kernel/fail_function.c +++ b/kernel/fail_function.c @@ -261,7 +261,7 @@ static ssize_t fei_write(struct file *file, const char __user *buffer,
if (copy_from_user(buf, buffer, count)) { ret = -EFAULT; - goto out; + goto out_free; } buf[count] = '\0'; sym = strstrip(buf); @@ -315,8 +315,9 @@ static ssize_t fei_write(struct file *file, const char __user *buffer, ret = count; } out: - kfree(buf); mutex_unlock(&fei_lock); +out_free: + kfree(buf); return ret; }
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.160 commit 088ad76f42062164775af47464f56c7fc6a13fca
--------------------------------
[ Upstream commit eb8409071a1d47e3593cfe077107ac46853182ab ]
This reverts commit 6ff646b2ceb0eec916101877f38da0b73e3a5b7f.
Your maintainer committed a major braino in the rmap code by adding the attr fork, bmbt, and unwritten extent usage bits into rmap record key comparisons. While XFS uses the usage bits *in the rmap records* for cross-referencing metadata in xfs_scrub and xfs_repair, it only needs the owner and offset information to distinguish between reverse mappings of the same physical extent into the data fork of a file at multiple offsets. The other bits are not important for key comparisons for index lookups, and never have been.
Eric Sandeen reports that this causes regressions in generic/299, so undo this patch before it does more damage.
Reported-by: Eric Sandeen sandeen@sandeen.net Fixes: 6ff646b2ceb0 ("xfs: fix rmap key and record comparison functions") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Eric Sandeen sandeen@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/xfs/libxfs/xfs_rmap_btree.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c index 77528f413286..f79cf040d745 100644 --- a/fs/xfs/libxfs/xfs_rmap_btree.c +++ b/fs/xfs/libxfs/xfs_rmap_btree.c @@ -247,8 +247,8 @@ xfs_rmapbt_key_diff( else if (y > x) return -1;
- x = be64_to_cpu(kp->rm_offset); - y = xfs_rmap_irec_offset_pack(rec); + x = XFS_RMAP_OFF(be64_to_cpu(kp->rm_offset)); + y = rec->rm_offset; if (x > y) return 1; else if (y > x) @@ -279,8 +279,8 @@ xfs_rmapbt_diff_two_keys( else if (y > x) return -1;
- x = be64_to_cpu(kp1->rm_offset); - y = be64_to_cpu(kp2->rm_offset); + x = XFS_RMAP_OFF(be64_to_cpu(kp1->rm_offset)); + y = XFS_RMAP_OFF(be64_to_cpu(kp2->rm_offset)); if (x > y) return 1; else if (y > x) @@ -393,8 +393,8 @@ xfs_rmapbt_keys_inorder( return 1; else if (a > b) return 0; - a = be64_to_cpu(k1->rmap.rm_offset); - b = be64_to_cpu(k2->rmap.rm_offset); + a = XFS_RMAP_OFF(be64_to_cpu(k1->rmap.rm_offset)); + b = XFS_RMAP_OFF(be64_to_cpu(k2->rmap.rm_offset)); if (a <= b) return 1; return 0; @@ -423,8 +423,8 @@ xfs_rmapbt_recs_inorder( return 1; else if (a > b) return 0; - a = be64_to_cpu(r1->rmap.rm_offset); - b = be64_to_cpu(r2->rmap.rm_offset); + a = XFS_RMAP_OFF(be64_to_cpu(r1->rmap.rm_offset)); + b = XFS_RMAP_OFF(be64_to_cpu(r2->rmap.rm_offset)); if (a <= b) return 1; return 0;
From: Yicong Yang yangyicong@hisilicon.com
stable inclusion from linux-4.19.160 commit 7dfd3751915578b219cf1a25e312569fe0571739
--------------------------------
[ Upstream commit 488dac0c9237647e9b8f788b6a342595bfa40bda ]
The attr->set() receive a value of u64, but simple_strtoll() is used for doing the conversion. It will lead to the error cast if user inputs a negative value.
Use kstrtoull() instead of simple_strtoll() to convert a string got from the user to an unsigned value. The former will return '-EINVAL' if it gets a negetive value, but the latter can't handle the situation correctly. Make 'val' unsigned long long as what kstrtoull() takes, this will eliminate the compile warning on no 64-bit architectures.
Fixes: f7b88631a897 ("fs/libfs.c: fix simple_attr_write() on 32bit machines") Signed-off-by: Yicong Yang yangyicong@hisilicon.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Al Viro viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1605341356-11872-1-git-send-email-yangyicong@hisil... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/libfs.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/libfs.c b/fs/libfs.c index 4bb80685ee6e..608566e38d12 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -934,7 +934,7 @@ ssize_t simple_attr_write(struct file *file, const char __user *buf, size_t len, loff_t *ppos) { struct simple_attr *attr; - u64 val; + unsigned long long val; size_t size; ssize_t ret;
@@ -952,7 +952,9 @@ ssize_t simple_attr_write(struct file *file, const char __user *buf, goto out;
attr->set_buf[size] = '\0'; - val = simple_strtoll(attr->set_buf, NULL, 0); + ret = kstrtoull(attr->set_buf, 0, &val); + if (ret) + goto out; ret = attr->set(attr->data, val); if (ret == 0) ret = len; /* on success, claim we got the whole input */
From: Vamshi K Sthambamkadi vamshi.k.sthambamkadi@gmail.com
stable inclusion from linux-4.19.160 commit 3a07581b04648c8acb634045c7060db392853efe
--------------------------------
commit fe5186cf12e30facfe261e9be6c7904a170bd822 upstream.
kmemleak report: unreferenced object 0xffff9b8915fcb000 (size 4096): comm "efivarfs.sh", pid 2360, jiffies 4294920096 (age 48.264s) hex dump (first 32 bytes): 2d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 -............... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000cc4d897c>] kmem_cache_alloc_trace+0x155/0x4b0 [<000000007d1dfa72>] efivarfs_create+0x6e/0x1a0 [<00000000e6ee18fc>] path_openat+0xe4b/0x1120 [<000000000ad0414f>] do_filp_open+0x91/0x100 [<00000000ce93a198>] do_sys_openat2+0x20c/0x2d0 [<000000002a91be6d>] do_sys_open+0x46/0x80 [<000000000a854999>] __x64_sys_openat+0x20/0x30 [<00000000c50d89c9>] do_syscall_64+0x38/0x90 [<00000000cecd6b5f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
In efivarfs_create(), inode->i_private is setup with efivar_entry object which is never freed.
Cc: stable@vger.kernel.org Signed-off-by: Vamshi K Sthambamkadi vamshi.k.sthambamkadi@gmail.com Link: https://lore.kernel.org/r/20201023115429.GA2479@cosmos Signed-off-by: Ard Biesheuvel ardb@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/efivarfs/super.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c index 834615f13f3e..7808a26bd33f 100644 --- a/fs/efivarfs/super.c +++ b/fs/efivarfs/super.c @@ -23,6 +23,7 @@ LIST_HEAD(efivarfs_list); static void efivarfs_evict_inode(struct inode *inode) { clear_inode(inode); + kfree(inode->i_private); }
static const struct super_operations efivarfs_ops = {
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.160 commit f2a32b26684550173f9fdd326923c0799b552916
--------------------------------
commit f902b216501094495ff75834035656e8119c537f upstream.
The idea of the warning in ext4_update_dx_flag() is that we should warn when we are clearing EXT4_INODE_INDEX on a filesystem with metadata checksums enabled since after clearing the flag, checksums for internal htree nodes will become invalid. So there's no need to warn (or actually do anything) when EXT4_INODE_INDEX is not set.
Link: https://lore.kernel.org/r/20201118153032.17281-1-jack@suse.cz Fixes: 48a34311953d ("ext4: fix checksum errors with indexed dirs") Reported-by: Eric Biggers ebiggers@kernel.org Reviewed-by: Eric Biggers ebiggers@google.com Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/ext4.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 03cf8faf51b9..b2e876bf0dec 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2410,7 +2410,8 @@ void ext4_insert_dentry(struct inode *inode, struct ext4_filename *fname); static inline void ext4_update_dx_flag(struct inode *inode) { - if (!ext4_has_feature_dir_index(inode->i_sb)) { + if (!ext4_has_feature_dir_index(inode->i_sb) && + ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) { /* ext4_iget() should have caught this... */ WARN_ON_ONCE(ext4_has_feature_metadata_csum(inode->i_sb)); ext4_clear_inode_flag(inode, EXT4_INODE_INDEX);
From: Gerald Schaefer gerald.schaefer@linux.ibm.com
stable inclusion from linux-4.19.160 commit dc94f8e932b596926649fa2778eff8d90495e0ff
--------------------------------
commit bfe8cc1db02ab243c62780f17fc57f65bde0afe1 upstream.
Alexander reported a syzkaller / KASAN finding on s390, see below for complete output.
In do_huge_pmd_anonymous_page(), the pre-allocated pagetable will be freed in some cases. In the case of userfaultfd_missing(), this will happen after calling handle_userfault(), which might have released the mmap_lock. Therefore, the following pte_free(vma->vm_mm, pgtable) will access an unstable vma->vm_mm, which could have been freed or re-used already.
For all architectures other than s390 this will go w/o any negative impact, because pte_free() simply frees the page and ignores the passed-in mm. The implementation for SPARC32 would also access mm->page_table_lock for pte_free(), but there is no THP support in SPARC32, so the buggy code path will not be used there.
For s390, the mm->context.pgtable_list is being used to maintain the 2K pagetable fragments, and operating on an already freed or even re-used mm could result in various more or less subtle bugs due to list / pagetable corruption.
Fix this by calling pte_free() before handle_userfault(), similar to how it is already done in __do_huge_pmd_anonymous_page() for the WRITE / non-huge_zero_page case.
Commit 6b251fc96cf2c ("userfaultfd: call handle_userfault() for userfaultfd_missing() faults") actually introduced both, the do_huge_pmd_anonymous_page() and also __do_huge_pmd_anonymous_page() changes wrt to calling handle_userfault(), but only in the latter case it put the pte_free() before calling handle_userfault().
BUG: KASAN: use-after-free in do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744 Read of size 8 at addr 00000000962d6988 by task syz-executor.0/9334
CPU: 1 PID: 9334 Comm: syz-executor.0 Not tainted 5.10.0-rc1-syzkaller-07083-g4c9720875573 #0 Hardware name: IBM 3906 M04 701 (KVM/Linux) Call Trace: do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744 create_huge_pmd mm/memory.c:4256 [inline] __handle_mm_fault+0xe6e/0x1068 mm/memory.c:4480 handle_mm_fault+0x288/0x748 mm/memory.c:4607 do_exception+0x394/0xae0 arch/s390/mm/fault.c:479 do_dat_exception+0x34/0x80 arch/s390/mm/fault.c:567 pgm_check_handler+0x1da/0x22c arch/s390/kernel/entry.S:706 copy_from_user_mvcos arch/s390/lib/uaccess.c:111 [inline] raw_copy_from_user+0x3a/0x88 arch/s390/lib/uaccess.c:174 _copy_from_user+0x48/0xa8 lib/usercopy.c:16 copy_from_user include/linux/uaccess.h:192 [inline] __do_sys_sigaltstack kernel/signal.c:4064 [inline] __s390x_sys_sigaltstack+0xc8/0x240 kernel/signal.c:4060 system_call+0xe0/0x28c arch/s390/kernel/entry.S:415
Allocated by task 9334: slab_alloc_node mm/slub.c:2891 [inline] slab_alloc mm/slub.c:2899 [inline] kmem_cache_alloc+0x118/0x348 mm/slub.c:2904 vm_area_dup+0x9c/0x2b8 kernel/fork.c:356 __split_vma+0xba/0x560 mm/mmap.c:2742 split_vma+0xca/0x108 mm/mmap.c:2800 mlock_fixup+0x4ae/0x600 mm/mlock.c:550 apply_vma_lock_flags+0x2c6/0x398 mm/mlock.c:619 do_mlock+0x1aa/0x718 mm/mlock.c:711 __do_sys_mlock2 mm/mlock.c:738 [inline] __s390x_sys_mlock2+0x86/0xa8 mm/mlock.c:728 system_call+0xe0/0x28c arch/s390/kernel/entry.S:415
Freed by task 9333: slab_free mm/slub.c:3142 [inline] kmem_cache_free+0x7c/0x4b8 mm/slub.c:3158 __vma_adjust+0x7b2/0x2508 mm/mmap.c:960 vma_merge+0x87e/0xce0 mm/mmap.c:1209 userfaultfd_release+0x412/0x6b8 fs/userfaultfd.c:868 __fput+0x22c/0x7a8 fs/file_table.c:281 task_work_run+0x200/0x320 kernel/task_work.c:151 tracehook_notify_resume include/linux/tracehook.h:188 [inline] do_notify_resume+0x100/0x148 arch/s390/kernel/signal.c:538 system_call+0xe6/0x28c arch/s390/kernel/entry.S:416
The buggy address belongs to the object at 00000000962d6948 which belongs to the cache vm_area_struct of size 200 The buggy address is located 64 bytes inside of 200-byte region [00000000962d6948, 00000000962d6a10) The buggy address belongs to the page: page:00000000313a09fe refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x962d6 flags: 0x3ffff00000000200(slab) raw: 3ffff00000000200 000040000257e080 0000000c0000000c 000000008020ba00 raw: 0000000000000000 000f001e00000000 ffffffff00000001 0000000096959501 page dumped because: kasan: bad access detected page->mem_cgroup:0000000096959501
Memory state around the buggy address: 00000000962d6880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000000962d6900: 00 fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
00000000962d6980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ 00000000962d6a00: fb fb fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00000000962d6a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ==================================================================
Fixes: 6b251fc96cf2c ("userfaultfd: call handle_userfault() for userfaultfd_missing() faults") Reported-by: Alexander Egorenkov egorenar@linux.ibm.com Signed-off-by: Gerald Schaefer gerald.schaefer@linux.ibm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Andrea Arcangeli aarcange@redhat.com Cc: Heiko Carstens hca@linux.ibm.com Cc: stable@vger.kernel.org [4.3+] Link: https://lkml.kernel.org/r/20201110190329.11920-1-gerald.schaefer@linux.ibm.c... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/huge_memory.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ffc4470e65f0..816a7fd3c6ff 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -702,7 +702,6 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) transparent_hugepage_use_zero_page()) { pgtable_t pgtable; struct page *zero_page; - bool set; vm_fault_t ret; pgtable = pte_alloc_one(vma->vm_mm, haddr); if (unlikely(!pgtable)) @@ -715,25 +714,25 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) } vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); ret = 0; - set = false; if (pmd_none(*vmf->pmd)) { ret = check_stable_address_space(vma->vm_mm); if (ret) { spin_unlock(vmf->ptl); + pte_free(vma->vm_mm, pgtable); } else if (userfaultfd_missing(vma)) { spin_unlock(vmf->ptl); + pte_free(vma->vm_mm, pgtable); ret = handle_userfault(vmf, VM_UFFD_MISSING); VM_BUG_ON(ret & VM_FAULT_FALLBACK); } else { set_huge_zero_page(pgtable, vma->vm_mm, vma, haddr, vmf->pmd, zero_page); spin_unlock(vmf->ptl); - set = true; } - } else + } else { spin_unlock(vmf->ptl); - if (!set) pte_free(vma->vm_mm, pgtable); + } return ret; } gfp = alloc_hugepage_direct_gfpmask(vma);
From: Lijie lijie34@huawei.com
driver inclusion category: bugfix bugzilla: NA Link: https://gitee.com/openeuler/kernel/issues/I1WGZE
--------------------------------
set config CONFIG_NVME_MULTIPATH default value.
Reviewed-by: Chao Leng lengchao@huawei.com Reviewed-by: Jike Cheng chengjike.cheng@huawei.com Signed-off-by: Lijie lijie34@huawei.com Acked-by: Hanjun Guo guohanjun@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/configs/openeuler_defconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index e1f2135f2b5b..ad321302cf2a 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -1956,7 +1956,7 @@ CONFIG_BLK_DEV_RBD=m # CONFIG_NVME_CORE=m CONFIG_BLK_DEV_NVME=m -# CONFIG_NVME_MULTIPATH is not set +CONFIG_NVME_MULTIPATH=y CONFIG_NVME_FABRICS=m CONFIG_NVME_RDMA=m CONFIG_NVME_FC=m
From: Fang Lijun fanglijun3@huawei.com
ascend inclusion category: bugfix bugzilla: NA CVE: NA
-------------------------------------------------
Dvpp use flags MAP_CHECKNODE to enable check node hugetlb. The global variable numanode will cause the mmap not be reenterable, so use the flags BITS[26:31] directly.
Fixes: cbdbfc7514ab ("mm: Check numa node hugepages enough when mmap hugetlb") Signed-off-by: Fang Lijun fanglijun3@huawei.com Reviewed-by: Ding Tianhong dingtianhong@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/hugetlbfs/inode.c | 2 +- include/linux/mm.h | 2 ++ include/uapi/asm-generic/mman.h | 1 + mm/hugetlb.c | 2 +- mm/mmap.c | 28 ++++++++++++++++------------ 5 files changed, 21 insertions(+), 14 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 7a8994585b89..a878f009347c 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -211,7 +211,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) inode_lock(inode); file_accessed(file);
- if (is_set_cdmmask()) { + if (is_set_cdmmask() && (vma->vm_flags & VM_CHECKNODE)) { ret = hugetlb_checknode(vma, len >> huge_page_shift(h)); if (ret < 0) goto out; diff --git a/include/linux/mm.h b/include/linux/mm.h index 679860bb2238..7ea591ec38c4 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -234,6 +234,8 @@ extern unsigned int kobjsize(const void *objp); #define VM_CDM 0x100000000 /* Contains coherent device memory */ #endif
+#define VM_CHECKNODE 0x200000000 + #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS #define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */ diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h index 1915bcc107eb..233a5e82407c 100644 --- a/include/uapi/asm-generic/mman.h +++ b/include/uapi/asm-generic/mman.h @@ -5,6 +5,7 @@ #include <asm-generic/mman-common.h>
#define MAP_GROWSDOWN 0x0100 /* stack-like segment */ +#define MAP_CHECKNODE 0x0400 /* hugetlb numa node check */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ #define MAP_EXECUTABLE 0x1000 /* mark it as an executable */ #define MAP_LOCKED 0x2000 /* pages are locked */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index efb68abe9f35..327d24f0cf0d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -970,7 +970,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, if (page && !avoid_reserve && vma_has_reserves(vma, chg)) { SetPagePrivate(page); h->resv_huge_pages--; - if (is_set_cdmmask()) + if (is_set_cdmmask() && (vma->vm_flags & VM_CHECKNODE)) h->resv_huge_pages_node[vma->vm_flags >> CHECKNODE_BITS]--; }
diff --git a/mm/mmap.c b/mm/mmap.c index 00702331afc1..6197c5590ded 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -69,7 +69,6 @@ const int mmap_rnd_compat_bits_max = CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX; int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS; #endif
-static unsigned long numanode; static bool ignore_rlimit_data; core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
@@ -1565,8 +1564,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr, /* set numa node id into vm_flags, * hugetlbfs file mmap will use it to check node */ - if (is_set_cdmmask()) - vm_flags |= ((numanode << CHECKNODE_BITS) & CHECKNODE_MASK); + if (is_set_cdmmask() && (flags & MAP_CHECKNODE)) + vm_flags |= VM_CHECKNODE | ((((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK) + << CHECKNODE_BITS) & CHECKNODE_MASK);
addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); if (!IS_ERR_VALUE(addr) && @@ -1583,12 +1583,6 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, struct file *file = NULL; unsigned long retval;
- /* get mmap numa node id */ - if (is_set_cdmmask()) { - numanode = (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK; - flags &= ~(MAP_HUGE_MASK << MAP_HUGE_SHIFT); - } - if (!(flags & MAP_ANONYMOUS)) { audit_mmap_fd(fd, flags); file = fget(fd); @@ -1602,12 +1596,23 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, } else if (flags & MAP_HUGETLB) { struct user_struct *user = NULL; struct hstate *hs; + int page_size_log;
- hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK); + /* + * If config cdm node, flags bits [26:31] used for + * mmap hugetlb check node + */ + if (is_set_cdmmask()) + page_size_log = 0; + else + page_size_log = (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK; + + hs = hstate_sizelog(page_size_log); if (!hs) return -EINVAL;
len = ALIGN(len, huge_page_size(hs)); + /* * VM_NORESERVE is used because the reservations will be * taken when vm_ops->mmap() is called @@ -1616,8 +1621,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, */ file = hugetlb_file_setup(HUGETLB_ANON_FILE, len, VM_NORESERVE, - &user, HUGETLB_ANONHUGE_INODE, - (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK); + &user, HUGETLB_ANONHUGE_INODE, page_size_log); if (IS_ERR(file)) return PTR_ERR(file); }
From: Xu Qiang xuqiang36@huawei.com
ascend inclusion category: feature bugzilla: NA CVE: NA
---------------------------------------
For Ascend platform,automatically opens enable_pseudo_nmi and pmu_nmi_enable.
Signed-off-by: Xu Qiang xuqiang36@huawei.com Reviewed-by: Ding Tianhong dingtianhong@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/include/asm/cpufeature.h | 4 ++++ arch/arm64/kernel/cpufeature.c | 2 +- arch/arm64/mm/init.c | 12 +++++++++++- 3 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index f776f8239699..c43d02365fb6 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -91,6 +91,10 @@ extern struct arm64_ftr_reg arm64_ftr_reg_ctrel0;
int arm64_cpu_ftr_regs_traverse(int (*op)(u32, u64, void *), void *argp);
+#ifdef CONFIG_ARM64_PSEUDO_NMI +extern bool enable_pseudo_nmi; +#endif + /* * CPU capabilities: * diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 6a00aa3cc876..822e6a2c0af1 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -1230,7 +1230,7 @@ ssize_t cpu_show_meltdown(struct device *dev, struct device_attribute *attr, }
#ifdef CONFIG_ARM64_PSEUDO_NMI -static bool enable_pseudo_nmi; +bool enable_pseudo_nmi;
static int __init early_enable_pseudo_nmi(char *p) { diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index f23da539a476..a6b9048ea1a4 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -769,6 +769,9 @@ __setup("keepinitrd", keepinitrd_setup); #endif
#ifdef CONFIG_ASCEND_FEATURES + +#include <linux/perf/arm_pmu.h> + void ascend_enable_all_features(void) { if (IS_ENABLED(CONFIG_ASCEND_DVPP_MMAP)) @@ -782,6 +785,13 @@ void ascend_enable_all_features(void)
if (IS_ENABLED(CONFIG_SUSPEND)) mem_sleep_current = PM_SUSPEND_ON; + + if (IS_ENABLED(CONFIG_PMU_WATCHDOG)) + pmu_nmi_enable = true; + +#ifdef CONFIG_ARM64_PSEUDO_NMI + enable_pseudo_nmi = true; +#endif }
static int __init ascend_enable_setup(char *__unused) @@ -791,7 +801,7 @@ static int __init ascend_enable_setup(char *__unused) return 1; }
-__setup("ascend_enable_all", ascend_enable_setup); +early_param("ascend_enable_all", ascend_enable_setup); #endif
From: Wang Wensheng wangwensheng4@huawei.com
ascend inclusion category: bugfix bugzilla: NA
-------------------
This flag is used by watchdog-core to determine the visibility of the attributes associated with pretimout when the watchdog-core probes the device. If we set the flag after that process those attributes could not be create properly.
Fixes: 569f7c7750c1 ("sbsa_gwdt: Introduce a panic notifier") Signed-off-by: Wang Wensheng wangwensheng4@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/watchdog/sbsa_gwdt.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/watchdog/sbsa_gwdt.c b/drivers/watchdog/sbsa_gwdt.c index 00723fb9e4ca..8045a66cdb69 100644 --- a/drivers/watchdog/sbsa_gwdt.c +++ b/drivers/watchdog/sbsa_gwdt.c @@ -214,11 +214,14 @@ static irqreturn_t sbsa_gwdt_interrupt(int irq, void *dev_id) return IRQ_HANDLED; }
-static struct watchdog_info sbsa_gwdt_info = { +static const struct watchdog_info sbsa_gwdt_info = { .identity = WATCHDOG_NAME, .options = WDIOF_SETTIMEOUT | WDIOF_KEEPALIVEPING | WDIOF_MAGICCLOSE | +#ifdef CONFIG_ARM_SBSA_WATCHDOG_PANIC_NOTIFIER + WDIOF_PRETIMEOUT | +#endif WDIOF_CARDRESET, };
@@ -369,7 +372,6 @@ static int sbsa_gwdt_probe(struct platform_device *pdev) * Add WDIOF_PRETIMEOUT flags to enable user to configure it. */ gwdt->wdd.pretimeout = gwdt->wdd.timeout - 1; - sbsa_gwdt_info.options |= WDIOF_PRETIMEOUT;
sbsa_register_panic_notifier(gwdt); }
From: Xu Qiang xuqiang36@huawei.com
mainline inclusion from mainline-v5.10-rc6 commit 74cde1a53368aed4f2b4b54bf7030437f64a534b category: bugfix bugzilla: NA CVE: NA
---------------------------------------
On systems without HW-based collections (i.e. anything except GIC-500), we rely on firmware to perform the ITS save/restore. This doesn't really work, as although FW can properly save everything, it cannot fully restore the state of the command queue (the read-side is reset to the head of the queue). This results in the ITS consuming previously processed commands, potentially corrupting the state.
Instead, let's always save the ITS state on suspend, disabling it in the process, and restore the full state on resume. This saves us from broken FW as long as it doesn't enable the ITS by itself (for which we can't do anything).
This amounts to simply dropping the ITS_FLAGS_SAVE_SUSPEND_STATE.
Signed-off-by: Xu Qiang xuqiang36@huawei.com [maz: added warning on resume, rewrote commit message] Signed-off-by: Marc Zyngier maz@kernel.org Link: https://lore.kernel.org/r/20201107104226.14282-1-xuqiang36@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/irqchip/irq-gic-v3-its.c | 30 +++--------------------------- 1 file changed, 3 insertions(+), 27 deletions(-)
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index f95c8ccff768..03b720aaa054 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -54,8 +54,6 @@ #define ITS_FLAGS_CMDQ_NEEDS_FLUSHING (1ULL << 0) #define ITS_FLAGS_WORKAROUND_CAVIUM_22375 (1ULL << 1) #define ITS_FLAGS_WORKAROUND_CAVIUM_23144 (1ULL << 2) -#define ITS_FLAGS_SAVE_SUSPEND_STATE (1ULL << 3) -#define ITS_FLAGS_SAVE_HIBERNATE_STATE (1ULL << 4)
#define RDIST_FLAGS_PROPBASE_NEEDS_FLUSHING (1 << 0) #define RDIST_FLAGS_RD_TABLES_PREALLOCATED (1 << 1) @@ -3617,17 +3615,6 @@ static int its_save_disable(void) raw_spin_lock(&its_lock); list_for_each_entry(its, &its_nodes, entry) { void __iomem *base; - u64 flags; - - if (system_in_hibernation()) - its->flags |= ITS_FLAGS_SAVE_HIBERNATE_STATE; - - flags = its->flags; - flags &= (ITS_FLAGS_SAVE_SUSPEND_STATE | - ITS_FLAGS_SAVE_HIBERNATE_STATE); - - if (!flags) - continue;
base = its->base; its->ctlr_save = readl_relaxed(base + GITS_CTLR); @@ -3647,9 +3634,6 @@ static int its_save_disable(void) list_for_each_entry_continue_reverse(its, &its_nodes, entry) { void __iomem *base;
- if (!(its->flags & ITS_FLAGS_SAVE_SUSPEND_STATE)) - continue; - base = its->base; writel_relaxed(its->ctlr_save, base + GITS_CTLR); } @@ -3667,16 +3651,8 @@ static void its_restore_enable(void) raw_spin_lock(&its_lock); list_for_each_entry(its, &its_nodes, entry) { void __iomem *base; - u64 flags; int i;
- flags = its->flags; - flags &= (ITS_FLAGS_SAVE_SUSPEND_STATE | - ITS_FLAGS_SAVE_HIBERNATE_STATE); - if (!flags) - continue; - - its->flags &= ~ITS_FLAGS_SAVE_HIBERNATE_STATE; base = its->base;
/* @@ -3684,7 +3660,10 @@ static void its_restore_enable(void) * don't restore it since writing to CBASER or BASER<n> * registers is undefined according to the GIC v3 ITS * Specification. + * + * Firmware resuming with the ITS enabled is terminally broken. */ + WARN_ON(readl_relaxed(base + GITS_CTLR) & GITS_CTLR_ENABLE); ret = its_force_quiescent(base); if (ret) { pr_err("ITS@%pa: failed to quiesce on resume: %d\n", @@ -3950,9 +3929,6 @@ static int __init its_probe_one(struct resource *res, ctlr |= GITS_CTLR_ImDe; writel_relaxed(ctlr, its->base + GITS_CTLR);
- if (GITS_TYPER_HCC(typer)) - its->flags |= ITS_FLAGS_SAVE_SUSPEND_STATE; - err = its_init_domain(handle, its); if (err) goto out_free_tables;
From: Ido Schimmel idosch@nvidia.com
stable inclusion from linux-4.19.155 commit 84013ba77c1704c1461b299fbd336d6d6b6d3a9f
--------------------------------
[ Upstream commit adc80b6cfedff6dad8b93d46a5ea2775fd5af9ec ]
Free the devlink instance during the teardown sequence in the non-reload case to avoid the following memory leak.
unreferenced object 0xffff888232895000 (size 2048): comm "modprobe", pid 1073, jiffies 4295568857 (age 164.871s) hex dump (first 32 bytes): 00 01 00 00 00 00 ad de 22 01 00 00 00 00 ad de ........"....... 10 50 89 32 82 88 ff ff 10 50 89 32 82 88 ff ff .P.2.....P.2.... backtrace: [<00000000c704e9a6>] __kmalloc+0x13a/0x2a0 [<00000000ee30129d>] devlink_alloc+0xff/0x760 [<0000000092ab3e5d>] 0xffffffffa042e5b0 [<000000004f3f8a31>] 0xffffffffa042f6ad [<0000000092800b4b>] 0xffffffffa0491df3 [<00000000c4843903>] local_pci_probe+0xcb/0x170 [<000000006993ded7>] pci_device_probe+0x2c2/0x4e0 [<00000000a8e0de75>] really_probe+0x2c5/0xf90 [<00000000d42ba75d>] driver_probe_device+0x1eb/0x340 [<00000000bcc95e05>] device_driver_attach+0x294/0x300 [<000000000e2bc177>] __driver_attach+0x167/0x2f0 [<000000007d44cd6e>] bus_for_each_dev+0x148/0x1f0 [<000000003cd5a91e>] driver_attach+0x45/0x60 [<000000000041ce51>] bus_add_driver+0x3b8/0x720 [<00000000f5215476>] driver_register+0x230/0x4e0 [<00000000d79356f5>] __pci_register_driver+0x190/0x200
Fixes: a22712a96291 ("mlxsw: core: Fix devlink unregister flow") Signed-off-by: Ido Schimmel idosch@nvidia.com Reported-by: Vadim Pasternak vadimp@nvidia.com Tested-by: Oleksandr Shamray oleksandrs@nvidia.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/mellanox/mlxsw/core.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c index d8e7ca48753f..1ae8234053c1 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/core.c +++ b/drivers/net/ethernet/mellanox/mlxsw/core.c @@ -1111,6 +1111,8 @@ void mlxsw_core_bus_device_unregister(struct mlxsw_core *mlxsw_core, if (!reload) devlink_resources_unregister(devlink, NULL); mlxsw_core->bus->fini(mlxsw_core->bus_priv); + if (!reload) + devlink_free(devlink);
return;
From: Tung Nguyen tung.q.nguyen@dektech.com.au
stable inclusion from linux-4.19.155 commit 48ea03f8bb651ed06c29bad75fcb35b7c2a13d3e
--------------------------------
[ Upstream commit ceb1eb2fb609c88363e06618b8d4bbf7815a4e03 ]
Commit ed42989eab57 ("tipc: fix the skb_unshare() in tipc_buf_append()") replaced skb_unshare() with skb_copy() to not reduce the data reference counter of the original skb intentionally. This is not the correct way to handle the cloned skb because it causes memory leak in 2 following cases: 1/ Sending multicast messages via broadcast link The original skb list is cloned to the local skb list for local destination. After that, the data reference counter of each skb in the original list has the value of 2. This causes each skb not to be freed after receiving ACK: tipc_link_advance_transmq() { ... /* release skb */ __skb_unlink(skb, &l->transmq); kfree_skb(skb); <-- memory exists after being freed }
2/ Sending multicast messages via replicast link Similar to the above case, each skb cannot be freed after purging the skb list: tipc_mcast_xmit() { ... __skb_queue_purge(pkts); <-- memory exists after being freed }
This commit fixes this issue by using skb_unshare() instead. Besides, to avoid use-after-free error reported by KASAN, the pointer to the fragment is set to NULL before calling skb_unshare() to make sure that the original skb is not freed after freeing the fragment 2 times in case skb_unshare() returns NULL.
Fixes: ed42989eab57 ("tipc: fix the skb_unshare() in tipc_buf_append()") Acked-by: Jon Maloy jmaloy@redhat.com Reported-by: Thang Hoang Ngo thang.h.ngo@dektech.com.au Signed-off-by: Tung Nguyen tung.q.nguyen@dektech.com.au Reviewed-by: Xin Long lucien.xin@gmail.com Acked-by: Cong Wang xiyou.wangcong@gmail.com Link: https://lore.kernel.org/r/20201027032403.1823-1-tung.q.nguyen@dektech.com.au Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/tipc/msg.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/net/tipc/msg.c b/net/tipc/msg.c index 0b8446cd541c..f04843ca8216 100644 --- a/net/tipc/msg.c +++ b/net/tipc/msg.c @@ -140,12 +140,11 @@ int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf) if (fragid == FIRST_FRAGMENT) { if (unlikely(head)) goto err; - if (skb_cloned(frag)) - frag = skb_copy(frag, GFP_ATOMIC); + *buf = NULL; + frag = skb_unshare(frag, GFP_ATOMIC); if (unlikely(!frag)) goto err; head = *headbuf = frag; - *buf = NULL; TIPC_SKB_CB(head)->tail = NULL; if (skb_is_nonlinear(head)) { skb_walk_frags(head, tail) {
From: Amit Cohen amcohen@nvidia.com
stable inclusion from linux-4.19.155 commit 30e91357e3e47c160d51decf4400603ea66b8925
--------------------------------
[ Upstream commit 0daf2bf5a2dcf33d446b76360908f109816e2e21 ]
Each EMAD transaction stores the skb used to issue the EMAD request ('trans->tx_skb') so that the request could be retried in case of a timeout. The skb can be freed when a corresponding response is received or as part of the retry logic (e.g., failed retransmit, exceeded maximum number of retries).
The two tasks (i.e., response processing and retransmits) are synchronized by the atomic 'trans->active' field which ensures that responses to inactive transactions are ignored.
In case of a failed retransmit the transaction is finished and all of its resources are freed. However, the current code does not mark it as inactive. Syzkaller was able to hit a race condition in which a concurrent response is processed while the transaction's resources are being freed, resulting in a use-after-free [1].
Fix the issue by making sure to mark the transaction as inactive after a failed retransmit and free its resources only if a concurrent task did not already do that.
[1] BUG: KASAN: use-after-free in consume_skb+0x30/0x370 net/core/skbuff.c:833 Read of size 4 at addr ffff88804f570494 by task syz-executor.0/1004
CPU: 0 PID: 1004 Comm: syz-executor.0 Not tainted 5.8.0-rc7+ #68 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xf6/0x16e lib/dump_stack.c:118 print_address_description.constprop.0+0x1c/0x250 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530 check_memory_region_inline mm/kasan/generic.c:186 [inline] check_memory_region+0x14e/0x1b0 mm/kasan/generic.c:192 instrument_atomic_read include/linux/instrumented.h:56 [inline] atomic_read include/asm-generic/atomic-instrumented.h:27 [inline] refcount_read include/linux/refcount.h:147 [inline] skb_unref include/linux/skbuff.h:1044 [inline] consume_skb+0x30/0x370 net/core/skbuff.c:833 mlxsw_emad_trans_finish+0x64/0x1c0 drivers/net/ethernet/mellanox/mlxsw/core.c:592 mlxsw_emad_process_response drivers/net/ethernet/mellanox/mlxsw/core.c:651 [inline] mlxsw_emad_rx_listener_func+0x5c9/0xac0 drivers/net/ethernet/mellanox/mlxsw/core.c:672 mlxsw_core_skb_receive+0x4df/0x770 drivers/net/ethernet/mellanox/mlxsw/core.c:2063 mlxsw_pci_cqe_rdq_handle drivers/net/ethernet/mellanox/mlxsw/pci.c:595 [inline] mlxsw_pci_cq_tasklet+0x12a6/0x2520 drivers/net/ethernet/mellanox/mlxsw/pci.c:651 tasklet_action_common.isra.0+0x13f/0x3e0 kernel/softirq.c:550 __do_softirq+0x223/0x964 kernel/softirq.c:292 asm_call_on_stack+0x12/0x20 arch/x86/entry/entry_64.S:711
Allocated by task 1006: save_stack+0x1b/0x40 mm/kasan/common.c:48 set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc mm/kasan/common.c:494 [inline] __kasan_kmalloc.constprop.0+0xc2/0xd0 mm/kasan/common.c:467 slab_post_alloc_hook mm/slab.h:586 [inline] slab_alloc_node mm/slub.c:2824 [inline] slab_alloc mm/slub.c:2832 [inline] kmem_cache_alloc+0xcd/0x2e0 mm/slub.c:2837 __build_skb+0x21/0x60 net/core/skbuff.c:311 __netdev_alloc_skb+0x1e2/0x360 net/core/skbuff.c:464 netdev_alloc_skb include/linux/skbuff.h:2810 [inline] mlxsw_emad_alloc drivers/net/ethernet/mellanox/mlxsw/core.c:756 [inline] mlxsw_emad_reg_access drivers/net/ethernet/mellanox/mlxsw/core.c:787 [inline] mlxsw_core_reg_access_emad+0x1ab/0x1420 drivers/net/ethernet/mellanox/mlxsw/core.c:1817 mlxsw_reg_trans_query+0x39/0x50 drivers/net/ethernet/mellanox/mlxsw/core.c:1831 mlxsw_sp_sb_pm_occ_clear drivers/net/ethernet/mellanox/mlxsw/spectrum_buffers.c:260 [inline] mlxsw_sp_sb_occ_max_clear+0xbff/0x10a0 drivers/net/ethernet/mellanox/mlxsw/spectrum_buffers.c:1365 mlxsw_devlink_sb_occ_max_clear+0x76/0xb0 drivers/net/ethernet/mellanox/mlxsw/core.c:1037 devlink_nl_cmd_sb_occ_max_clear_doit+0x1ec/0x280 net/core/devlink.c:1765 genl_family_rcv_msg_doit net/netlink/genetlink.c:669 [inline] genl_family_rcv_msg net/netlink/genetlink.c:714 [inline] genl_rcv_msg+0x617/0x980 net/netlink/genetlink.c:731 netlink_rcv_skb+0x152/0x440 net/netlink/af_netlink.c:2470 genl_rcv+0x24/0x40 net/netlink/genetlink.c:742 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x53a/0x750 net/netlink/af_netlink.c:1330 netlink_sendmsg+0x850/0xd90 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg+0x150/0x190 net/socket.c:671 ____sys_sendmsg+0x6d8/0x840 net/socket.c:2359 ___sys_sendmsg+0xff/0x170 net/socket.c:2413 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2446 do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:384 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 73: save_stack+0x1b/0x40 mm/kasan/common.c:48 set_track mm/kasan/common.c:56 [inline] kasan_set_free_info mm/kasan/common.c:316 [inline] __kasan_slab_free+0x12c/0x170 mm/kasan/common.c:455 slab_free_hook mm/slub.c:1474 [inline] slab_free_freelist_hook mm/slub.c:1507 [inline] slab_free mm/slub.c:3072 [inline] kmem_cache_free+0xbe/0x380 mm/slub.c:3088 kfree_skbmem net/core/skbuff.c:622 [inline] kfree_skbmem+0xef/0x1b0 net/core/skbuff.c:616 __kfree_skb net/core/skbuff.c:679 [inline] consume_skb net/core/skbuff.c:837 [inline] consume_skb+0xe1/0x370 net/core/skbuff.c:831 mlxsw_emad_trans_finish+0x64/0x1c0 drivers/net/ethernet/mellanox/mlxsw/core.c:592 mlxsw_emad_transmit_retry.isra.0+0x9d/0xc0 drivers/net/ethernet/mellanox/mlxsw/core.c:613 mlxsw_emad_trans_timeout_work+0x43/0x50 drivers/net/ethernet/mellanox/mlxsw/core.c:625 process_one_work+0xa3e/0x17a0 kernel/workqueue.c:2269 worker_thread+0x9e/0x1050 kernel/workqueue.c:2415 kthread+0x355/0x470 kernel/kthread.c:291 ret_from_fork+0x22/0x30 arch/x86/entry/entry_64.S:293
The buggy address belongs to the object at ffff88804f5703c0 which belongs to the cache skbuff_head_cache of size 224 The buggy address is located 212 bytes inside of 224-byte region [ffff88804f5703c0, ffff88804f5704a0) The buggy address belongs to the page: page:ffffea00013d5c00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x100000000000200(slab) raw: 0100000000000200 dead000000000100 dead000000000122 ffff88806c625400 raw: 0000000000000000 00000000000c000c 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected
Memory state around the buggy address: ffff88804f570380: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ffff88804f570400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88804f570480: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
^ ffff88804f570500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88804f570580: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
Fixes: caf7297e7ab5f ("mlxsw: core: Introduce support for asynchronous EMAD register access") Signed-off-by: Amit Cohen amcohen@nvidia.com Reviewed-by: Jiri Pirko jiri@nvidia.com Signed-off-by: Ido Schimmel idosch@nvidia.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/mellanox/mlxsw/core.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c index 1ae8234053c1..423c3e9925d0 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/core.c +++ b/drivers/net/ethernet/mellanox/mlxsw/core.c @@ -488,6 +488,9 @@ static void mlxsw_emad_transmit_retry(struct mlxsw_core *mlxsw_core, err = mlxsw_emad_transmit(trans->core, trans); if (err == 0) return; + + if (!atomic_dec_and_test(&trans->active)) + return; } else { err = -EIO; }
From: Ilya Dryomov idryomov@gmail.com
stable inclusion from linux-4.19.155 commit 416ea9783571eedcabdc9247653518e1005eca73
--------------------------------
commit 28e1581c3b4ea5f98530064a103c6217bedeea73 upstream.
con->out_msg must be cleared on Policy::stateful_server (!CEPH_MSG_CONNECT_LOSSY) faults. Not doing so botches the reconnection attempt, because after writing the banner the messenger moves on to writing the data section of that message (either from where it got interrupted by the connection reset or from the beginning) instead of writing struct ceph_msg_connect. This results in a bizarre error message because the server sends CEPH_MSGR_TAG_BADPROTOVER but we think we wrote struct ceph_msg_connect:
libceph: mds0 (1)172.21.15.45:6828 socket error on write ceph: mds0 reconnect start libceph: mds0 (1)172.21.15.45:6829 socket closed (con state OPEN) libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch, my 32 != server's 32 libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch
AFAICT this bug goes back to the dawn of the kernel client. The reason it survived for so long is that only MDS sessions are stateful and only two MDS messages have a data section: CEPH_MSG_CLIENT_RECONNECT (always, but reconnecting is rare) and CEPH_MSG_CLIENT_REQUEST (only when xattrs are involved). The connection has to get reset precisely when such message is being sent -- in this case it was the former.
Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/47723 Signed-off-by: Ilya Dryomov idryomov@gmail.com Reviewed-by: Jeff Layton jlayton@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ceph/messenger.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index f7d7f32ac673..21bd37ec5511 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -3037,6 +3037,11 @@ static void con_fault(struct ceph_connection *con) ceph_msg_put(con->in_msg); con->in_msg = NULL; } + if (con->out_msg) { + BUG_ON(con->out_msg->con != con); + ceph_msg_put(con->out_msg); + con->out_msg = NULL; + }
/* Requeue anything that hasn't been acked */ list_splice_init(&con->out_sent, &con->out_queue);
From: Petr Malat oss@malat.biz
stable inclusion from linux-4.19.156 commit 64752f5cfda61aa7ca12d23ca1ecc7d36e996f93
--------------------------------
[ Upstream commit b6df8c81412190fbd5eaa3cec7f642142d9c16cd ]
Commit 978aa0474115 ("sctp: fix some type cast warnings introduced since very beginning")' broke err reading from sctp_arg, because it reads the value as 32-bit integer, although the value is stored as 16-bit integer. Later this value is passed to the userspace in 16-bit variable, thus the user always gets 0 on big-endian platforms. Fix it by reading the __u16 field of sctp_arg union, as reading err field would produce a sparse warning.
Fixes: 978aa0474115 ("sctp: fix some type cast warnings introduced since very beginning") Signed-off-by: Petr Malat oss@malat.biz Acked-by: Marcelo Ricardo Leitner marcelo.leitner@gmail.com Link: https://lore.kernel.org/r/20201030132633.7045-1-oss@malat.biz Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/sctp/sm_sideeffect.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c index 567517e44811..bb4f1d0da153 100644 --- a/net/sctp/sm_sideeffect.c +++ b/net/sctp/sm_sideeffect.c @@ -1615,12 +1615,12 @@ static int sctp_cmd_interpreter(enum sctp_event event_type, break;
case SCTP_CMD_INIT_FAILED: - sctp_cmd_init_failed(commands, asoc, cmd->obj.u32); + sctp_cmd_init_failed(commands, asoc, cmd->obj.u16); break;
case SCTP_CMD_ASSOC_FAILED: sctp_cmd_assoc_failed(commands, asoc, event_type, - subtype, chunk, cmd->obj.u32); + subtype, chunk, cmd->obj.u16); break;
case SCTP_CMD_INIT_COUNTER_INC:
From: Wang Hai wanghai38@huawei.com
stable inclusion from linux-4.19.158 commit cf9913775901db56dd5c5c4488c3c3f47d5ff4e0
--------------------------------
[ Upstream commit fa6882c63621821f73cc806f291208e1c6ea6187 ]
kmemleak report a memory leak as follows:
unreferenced object 0xffff88810a596800 (size 512): comm "ip", pid 21558, jiffies 4297568990 (age 112.120s) hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... ff ff ff ff ff ff ff ff 00 83 60 b0 ff ff ff ff ..........`..... backtrace: [<0000000022bbe21f>] tipc_topsrv_init_net+0x1f3/0xa70 [<00000000fe15ddf7>] ops_init+0xa8/0x3c0 [<00000000138af6f2>] setup_net+0x2de/0x7e0 [<000000008c6807a3>] copy_net_ns+0x27d/0x530 [<000000006b21adbd>] create_new_namespaces+0x382/0xa30 [<00000000bb169746>] unshare_nsproxy_namespaces+0xa1/0x1d0 [<00000000fe2e42bc>] ksys_unshare+0x39c/0x780 [<0000000009ba3b19>] __x64_sys_unshare+0x2d/0x40 [<00000000614ad866>] do_syscall_64+0x56/0xa0 [<00000000a1b5ca3c>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
'srv' is malloced in tipc_topsrv_start() but not free before leaving from the error handling cases. We need to free it.
Fixes: 5c45ab24ac77 ("tipc: make struct tipc_server private for server.c") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Wang Hai wanghai38@huawei.com Link: https://lore.kernel.org/r/20201109140913.47370-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/tipc/topsrv.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c index ec9a7137d267..1c4733153d74 100644 --- a/net/tipc/topsrv.c +++ b/net/tipc/topsrv.c @@ -671,12 +671,18 @@ static int tipc_topsrv_start(struct net *net)
ret = tipc_topsrv_work_start(srv); if (ret < 0) - return ret; + goto err_start;
ret = tipc_topsrv_create_listener(srv); if (ret < 0) - tipc_topsrv_work_stop(srv); + goto err_create;
+ return 0; + +err_create: + tipc_topsrv_work_stop(srv); +err_start: + kfree(srv); return ret; }
From: Zhang Changzhong zhangchangzhong@huawei.com
stable inclusion from linux-4.19.160 commit f8c18eab5413d1edbfdeda8e8102780c5af8c64d
--------------------------------
[ Upstream commit a5ebcbdf34b65fcc07f38eaf2d60563b42619a59 ]
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Zhang Changzhong zhangchangzhong@huawei.com Link: https://lore.kernel.org/r/1605581105-35295-1-git-send-email-zhangchangzhong@... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv6/ah6.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c index 78c974391567..2b68bd7c83c4 100644 --- a/net/ipv6/ah6.c +++ b/net/ipv6/ah6.c @@ -600,7 +600,8 @@ static int ah6_input(struct xfrm_state *x, struct sk_buff *skb) memcpy(auth_data, ah->auth_data, ahp->icv_trunc_len); memset(ah->auth_data, 0, ahp->icv_trunc_len);
- if (ipv6_clear_mutable_options(ip6h, hdr_len, XFRM_POLICY_IN)) + err = ipv6_clear_mutable_options(ip6h, hdr_len, XFRM_POLICY_IN); + if (err) goto out_free;
ip6h->priority = 0;
From: Ido Schimmel idosch@nvidia.com
stable inclusion from linux-4.19.160 commit 5ca57e8ee2ccf5dcab7a6be6db8b23f7c26f0cb7
--------------------------------
[ Upstream commit 1f492eab67bced119a0ac7db75ef2047e29a30c6 ]
The driver sends Ethernet Management Datagram (EMAD) packets to the device for configuration purposes and waits for up to 200ms for a reply. A request is retried up to 5 times.
When the system is under heavy load, replies are not always processed in time and EMAD transactions fail.
Make the process more robust to such delays by using exponential backoff. First wait for up to 200ms, then retransmit and wait for up to 400ms and so on.
Fixes: caf7297e7ab5 ("mlxsw: core: Introduce support for asynchronous EMAD register access") Reported-by: Denis Yulevich denisyu@nvidia.com Tested-by: Denis Yulevich denisyu@nvidia.com Signed-off-by: Ido Schimmel idosch@nvidia.com Reviewed-by: Jiri Pirko jiri@nvidia.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/mellanox/mlxsw/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c index 423c3e9925d0..049ca4ba49de 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/core.c +++ b/drivers/net/ethernet/mellanox/mlxsw/core.c @@ -439,7 +439,8 @@ static void mlxsw_emad_trans_timeout_schedule(struct mlxsw_reg_trans *trans) if (trans->core->fw_flash_in_progress) timeout = msecs_to_jiffies(MLXSW_EMAD_TIMEOUT_DURING_FW_FLASH_MS);
- queue_delayed_work(trans->core->emad_wq, &trans->timeout_dw, timeout); + queue_delayed_work(trans->core->emad_wq, &trans->timeout_dw, + timeout << trans->retries); }
static int mlxsw_emad_transmit(struct mlxsw_core *mlxsw_core,
From: Florian Fainelli f.fainelli@gmail.com
stable inclusion from linux-4.19.160 commit 24184dfd51c9fbe0e478570f4cb3c3fbc8d1b594
--------------------------------
[ Upstream commit 1532b9778478577152201adbafa7738b1e844868 ]
DSA network devices rely on having their DSA management interface up and running otherwise their ndo_open() will return -ENETDOWN. Without doing this it would not be possible to use DSA devices as netconsole when configured on the command line. These devices also do not utilize the upper/lower linking so the check about the netpoll device having upper is not going to be a problem.
The solution adopted here is identical to the one done for net/ipv4/ipconfig.c with 728c02089a0e ("net: ipv4: handle DSA enabled master network devices"), with the network namespace scope being restricted to that of the process configuring netpoll.
Fixes: 04ff53f96a93 ("net: dsa: Add netconsole support") Tested-by: Vladimir Oltean olteanv@gmail.com Signed-off-by: Florian Fainelli f.fainelli@gmail.com Link: https://lore.kernel.org/r/20201117035236.22658-1-f.fainelli@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/core/netpoll.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/net/core/netpoll.c b/net/core/netpoll.c index 023ce0fbb496..41e32a958d08 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -28,6 +28,7 @@ #include <linux/slab.h> #include <linux/export.h> #include <linux/if_vlan.h> +#include <net/dsa.h> #include <net/tcp.h> #include <net/udp.h> #include <net/addrconf.h> @@ -638,15 +639,15 @@ EXPORT_SYMBOL_GPL(__netpoll_setup);
int netpoll_setup(struct netpoll *np) { - struct net_device *ndev = NULL; + struct net_device *ndev = NULL, *dev = NULL; + struct net *net = current->nsproxy->net_ns; struct in_device *in_dev; int err;
rtnl_lock(); - if (np->dev_name[0]) { - struct net *net = current->nsproxy->net_ns; + if (np->dev_name[0]) ndev = __dev_get_by_name(net, np->dev_name); - } + if (!ndev) { np_err(np, "%s doesn't exist, aborting\n", np->dev_name); err = -ENODEV; @@ -654,6 +655,19 @@ int netpoll_setup(struct netpoll *np) } dev_hold(ndev);
+ /* bring up DSA management network devices up first */ + for_each_netdev(net, dev) { + if (!netdev_uses_dsa(dev)) + continue; + + err = dev_change_flags(dev, dev->flags | IFF_UP); + if (err < 0) { + np_err(np, "%s failed to open %s\n", + np->dev_name, dev->name); + goto put; + } + } + if (netdev_master_upper_dev_get(ndev)) { np_err(np, "%s is a slave device, aborting\n", np->dev_name); err = -EBUSY;
From: Xin Long lucien.xin@gmail.com
stable inclusion from linux-4.19.160 commit c962aae4207730ec026a4a1f39128db11626953c
--------------------------------
[ Upstream commit 057a10fa1f73d745c8e69aa54ab147715f5630ae ]
A call trace was found in Hangbin's Codenomicon testing with debug kernel:
[ 2615.981988] ODEBUG: free active (active state 0) object type: timer_list hint: sctp_generate_proto_unreach_event+0x0/0x3a0 [sctp] [ 2615.995050] WARNING: CPU: 17 PID: 0 at lib/debugobjects.c:328 debug_print_object+0x199/0x2b0 [ 2616.095934] RIP: 0010:debug_print_object+0x199/0x2b0 [ 2616.191533] Call Trace: [ 2616.194265] <IRQ> [ 2616.202068] debug_check_no_obj_freed+0x25e/0x3f0 [ 2616.207336] slab_free_freelist_hook+0xeb/0x140 [ 2616.220971] kfree+0xd6/0x2c0 [ 2616.224293] rcu_do_batch+0x3bd/0xc70 [ 2616.243096] rcu_core+0x8b9/0xd00 [ 2616.256065] __do_softirq+0x23d/0xacd [ 2616.260166] irq_exit+0x236/0x2a0 [ 2616.263879] smp_apic_timer_interrupt+0x18d/0x620 [ 2616.269138] apic_timer_interrupt+0xf/0x20 [ 2616.273711] </IRQ>
This is because it holds asoc when transport->proto_unreach_timer starts and puts asoc when the timer stops, and without holding transport the transport could be freed when the timer is still running.
So fix it by holding/putting transport instead for proto_unreach_timer in transport, just like other timers in transport.
v1->v2: - Also use sctp_transport_put() for the "out_unlock:" path in sctp_generate_proto_unreach_event(), as Marcelo noticed.
Fixes: 50b5d6ad6382 ("sctp: Fix a race between ICMP protocol unreachable and connect()") Reported-by: Hangbin Liu liuhangbin@gmail.com Signed-off-by: Xin Long lucien.xin@gmail.com Acked-by: Marcelo Ricardo Leitner marcelo.leitner@gmail.com Link: https://lore.kernel.org/r/102788809b554958b13b95d33440f5448113b8d6.160533137... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/sctp/input.c | 4 ++-- net/sctp/sm_sideeffect.c | 4 ++-- net/sctp/transport.c | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/net/sctp/input.c b/net/sctp/input.c index 2de98b5da4c1..ee922867a5e0 100644 --- a/net/sctp/input.c +++ b/net/sctp/input.c @@ -461,7 +461,7 @@ void sctp_icmp_proto_unreachable(struct sock *sk, else { if (!mod_timer(&t->proto_unreach_timer, jiffies + (HZ/20))) - sctp_association_hold(asoc); + sctp_transport_hold(t); } } else { struct net *net = sock_net(sk); @@ -470,7 +470,7 @@ void sctp_icmp_proto_unreachable(struct sock *sk, "encountered!\n", __func__);
if (del_timer(&t->proto_unreach_timer)) - sctp_association_put(asoc); + sctp_transport_put(t);
sctp_do_sm(net, SCTP_EVENT_T_OTHER, SCTP_ST_OTHER(SCTP_EVENT_ICMP_PROTO_UNREACH), diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c index bb4f1d0da153..2a94240eac36 100644 --- a/net/sctp/sm_sideeffect.c +++ b/net/sctp/sm_sideeffect.c @@ -434,7 +434,7 @@ void sctp_generate_proto_unreach_event(struct timer_list *t) /* Try again later. */ if (!mod_timer(&transport->proto_unreach_timer, jiffies + (HZ/20))) - sctp_association_hold(asoc); + sctp_transport_hold(transport); goto out_unlock; }
@@ -450,7 +450,7 @@ void sctp_generate_proto_unreach_event(struct timer_list *t)
out_unlock: bh_unlock_sock(sk); - sctp_association_put(asoc); + sctp_transport_put(transport); }
/* Handle the timeout of the RE-CONFIG timer. */ diff --git a/net/sctp/transport.c b/net/sctp/transport.c index c0d55ed62d2e..78302e547b2b 100644 --- a/net/sctp/transport.c +++ b/net/sctp/transport.c @@ -148,7 +148,7 @@ void sctp_transport_free(struct sctp_transport *transport)
/* Delete the ICMP proto unreachable timer if it's active. */ if (del_timer(&transport->proto_unreach_timer)) - sctp_association_put(transport->asoc); + sctp_transport_put(transport);
sctp_transport_put(transport); }
From: Vladyslav Tarasiuk vladyslavt@nvidia.com
stable inclusion from linux-4.19.160 commit ba0046dc414d64fff9bd5d74a0641cd25b214a2a
--------------------------------
[ Upstream commit 470b74758260e4abc2508cf1614573c00a00465c ]
Currently when QoS is enabled for VF and any min_rate is configured, the driver sets bw_share value to at least 1 and doesn’t allow to set it to 0 to make minimal rate unlimited. It means there is always a minimal rate configured for every VF, even if user tries to remove it.
In order to make QoS disable possible, check whether all vports have configured min_rate = 0. If this is true, set their bw_share to 0 to disable min_rate limitations.
Fixes: c9497c98901c ("net/mlx5: Add support for setting VF min rate") Signed-off-by: Vladyslav Tarasiuk vladyslavt@nvidia.com Reviewed-by: Moshe Shemesh moshe@nvidia.com Signed-off-by: Saeed Mahameed saeedm@nvidia.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c index 7366033cd31c..2190daace873 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c @@ -1999,12 +1999,15 @@ static u32 calculate_vports_min_rate_divider(struct mlx5_eswitch *esw) max_guarantee = evport->info.min_rate; }
- return max_t(u32, max_guarantee / fw_max_bw_share, 1); + if (max_guarantee) + return max_t(u32, max_guarantee / fw_max_bw_share, 1); + return 0; }
-static int normalize_vports_min_rate(struct mlx5_eswitch *esw, u32 divider) +static int normalize_vports_min_rate(struct mlx5_eswitch *esw) { u32 fw_max_bw_share = MLX5_CAP_QOS(esw->dev, max_tsar_bw_share); + u32 divider = calculate_vports_min_rate_divider(esw); struct mlx5_vport *evport; u32 vport_max_rate; u32 vport_min_rate; @@ -2018,9 +2021,9 @@ static int normalize_vports_min_rate(struct mlx5_eswitch *esw, u32 divider) continue; vport_min_rate = evport->info.min_rate; vport_max_rate = evport->info.max_rate; - bw_share = MLX5_MIN_BW_SHARE; + bw_share = 0;
- if (vport_min_rate) + if (divider) bw_share = MLX5_RATE_TO_BW_SHARE(vport_min_rate, divider, fw_max_bw_share); @@ -2045,7 +2048,6 @@ int mlx5_eswitch_set_vport_rate(struct mlx5_eswitch *esw, int vport, struct mlx5_vport *evport; u32 fw_max_bw_share; u32 previous_min_rate; - u32 divider; bool min_rate_supported; bool max_rate_supported; int err = 0; @@ -2071,8 +2073,7 @@ int mlx5_eswitch_set_vport_rate(struct mlx5_eswitch *esw, int vport,
previous_min_rate = evport->info.min_rate; evport->info.min_rate = min_rate; - divider = calculate_vports_min_rate_divider(esw); - err = normalize_vports_min_rate(esw, divider); + err = normalize_vports_min_rate(esw); if (err) { evport->info.min_rate = previous_min_rate; goto unlock;
From: Weilong Chen chenweilong@huawei.com
ascend inclusion category: feature bugzilla: NA CVE: NA
-------------------------------------------------
The ascend memory feature depends on the large page feature, but it is not associated. When CONFIG_HUGETLBFS is turned off, the compilation will fail.
This patch add dependencies in configs.
Acked-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Weilong Chen chenweilong@huawei.com Reviewed-by: Ding Tianhong dingtianhong@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/Kconfig | 2 ++ drivers/char/Kconfig | 2 +- include/linux/hugetlb.h | 7 +++++++ 3 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index e6f0170e2872..d8917873f549 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1361,6 +1361,7 @@ config ASCEND_IOPF_HIPRI
config ASCEND_CHARGE_MIGRATE_HUGEPAGES bool "Enable support for migrate hugepages" + depends on HUGETLBFS default y help When reseved hugepages are used up, we attempts to apply for migrate @@ -1382,6 +1383,7 @@ config ASCEND_WATCHDOG_SYSFS_CONFIGURE
config ASCEND_AUTO_TUNING_HUGEPAGE bool "Enable support for the auto-tuning hugepage" + depends on HUGETLBFS default y help The hugepage auto-tuning means the kernel dynamically manages the number of diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 0c25d37ca1be..60e05b27b2a3 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -554,7 +554,7 @@ config ADI
config HISI_SVM bool "Hisilicon svm driver" - depends on ARM64 && ARM_SMMU_V3 && MMU_NOTIFIER + depends on ARM64 && ARM_SMMU_V3 && MMU_NOTIFIER && HUGETLBFS default y help This driver provides character-level access to Hisilicon diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 32f2837a6075..9cd938bb24fe 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -629,6 +629,13 @@ static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr pte_t *ptep, pte_t pte, unsigned long sz) { } + +static inline int hugetlb_insert_hugepage_pte_by_pa(struct mm_struct *mm, + unsigned long vir_addr, + pgprot_t prot, unsigned long phy_addr) +{ + return 0; +} #endif /* CONFIG_HUGETLB_PAGE */
static inline spinlock_t *huge_pte_lock(struct hstate *h,
From: Lorenzo Pieralisi lorenzo.pieralisi@arm.com
mainline inclusion from mainline-v5.10-rc1 commit f5810e5c329238b8553ebd98b914bdbefd8e6737 category: bugfix bugzilla: NA CVE: NA
-------------------------------------------------
For arches that do not select CONFIG_GENERIC_IOMAP, the current pci_iounmap() function does nothing causing obvious memory leaks for mapped regions that are backed by MMIO physical space.
In order to detect if a mapped pointer is IO vs MMIO, a check must made available to the pci_iounmap() function so that it can actually detect whether the pointer has to be unmapped.
In configurations where CONFIG_HAS_IOPORT_MAP && !CONFIG_GENERIC_IOMAP, a mapped port is detected using an ioport_map() stub defined in asm-generic/io.h.
Use the same logic to implement a stub (ie __pci_ioport_unmap()) that detects if the passed in pointer in pci_iounmap() is IO vs MMIO to iounmap conditionally and call it in pci_iounmap() fixing the issue.
Leave __pci_ioport_unmap() as a NOP for all other config options.
Tested-by: George Cherian george.cherian@marvell.com Link: https://lore.kernel.org/lkml/20200905024811.74701-1-yangyingliang@huawei.com Link: https://lore.kernel.org/lkml/20200824132046.3114383-1-george.cherian@marvell... Link: https://lore.kernel.org/r/a9daf8d8444d0ebd00bc6d64e336ec49dbb50784.160025414... Reported-by: George Cherian george.cherian@marvell.com Signed-off-by: Lorenzo Pieralisi lorenzo.pieralisi@arm.com Signed-off-by: Lorenzo Pieralisi lorenzo.pieralisi@arm.com Reviewed-by: Catalin Marinas catalin.marinas@arm.com Cc: Arnd Bergmann arnd@arndb.de Cc: George Cherian george.cherian@marvell.com Cc: Will Deacon will@kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: Catalin Marinas catalin.marinas@arm.com Cc: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/asm-generic/io.h | 39 +++++++++++++++++++++++++++------------ 1 file changed, 27 insertions(+), 12 deletions(-)
diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h index 303871651f8a..cc946f38182d 100644 --- a/include/asm-generic/io.h +++ b/include/asm-generic/io.h @@ -894,18 +894,6 @@ static inline void iowrite64_rep(volatile void __iomem *addr, #include <linux/vmalloc.h> #define __io_virt(x) ((void __force *)(x))
-#ifndef CONFIG_GENERIC_IOMAP -struct pci_dev; -extern void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long max); - -#ifndef pci_iounmap -#define pci_iounmap pci_iounmap -static inline void pci_iounmap(struct pci_dev *dev, void __iomem *p) -{ -} -#endif -#endif /* CONFIG_GENERIC_IOMAP */ - /* * Change virtual addresses to physical addresses and vv. * These are pretty trivial @@ -1029,6 +1017,16 @@ static inline void __iomem *ioport_map(unsigned long port, unsigned int nr) port &= IO_SPACE_LIMIT; return (port > MMIO_UPPER_LIMIT) ? NULL : PCI_IOBASE + port; } +#define __pci_ioport_unmap __pci_ioport_unmap +static inline void __pci_ioport_unmap(void __iomem *p) +{ + uintptr_t start = (uintptr_t) PCI_IOBASE; + uintptr_t addr = (uintptr_t) p; + + if (addr >= start && addr < start + IO_SPACE_LIMIT) + return; + iounmap(p); +} #endif
#ifndef ioport_unmap @@ -1043,6 +1041,23 @@ extern void ioport_unmap(void __iomem *p); #endif /* CONFIG_GENERIC_IOMAP */ #endif /* CONFIG_HAS_IOPORT_MAP */
+#ifndef CONFIG_GENERIC_IOMAP +struct pci_dev; +extern void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long max); + +#ifndef __pci_ioport_unmap +static inline void __pci_ioport_unmap(void __iomem *p) {} +#endif + +#ifndef pci_iounmap +#define pci_iounmap pci_iounmap +static inline void pci_iounmap(struct pci_dev *dev, void __iomem *p) +{ + __pci_ioport_unmap(p); +} +#endif +#endif /* CONFIG_GENERIC_IOMAP */ + /* * Convert a virtual cached pointer to an uncached pointer */
From: Xu Qiang xuqiang36@huawei.com
ascend inclusion category: bugfix bugzilla: NA CVE: NA
---------------------------------------------
Add workaround bindings in device tree to init ts core GICR.
Signed-off-by: Xu Qiang xuqiang36@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/irqchip/irq-gic-v3.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 7bf14acdcd28..6bb787ba1764 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -922,12 +922,18 @@ static struct workaround_oem_info gicr_wkrd_info[] = { } };
-static void gic_check_hisi_workaround(void) +static void gic_check_hisi_workaround(struct fwnode_handle *handle) { struct acpi_table_header *tbl; acpi_status status = AE_OK; + struct device_node *node = to_of_node(handle); int i;
+ if ((node != NULL) && of_property_read_bool(node, "enable-init-all-gicr")) { + its_enable_init_all_gicr(); + return; + } + status = acpi_get_table(ACPI_SIG_MADT, 0, &tbl); if (ACPI_FAILURE(status) || !tbl) return; @@ -1088,11 +1094,11 @@ static void gic_cpu_init_others(void) } } #else -static inline void gic_check_hisi_workaround(void) {} +#define gic_check_hisi_workaround(x)
-static inline void gic_compute_nr_gicr(void) {} +#define gic_compute_nr_gicr()
-static inline void gic_cpu_init_others(void) {} +#define gic_cpu_init_others() #endif
#ifdef CONFIG_SMP @@ -1549,7 +1555,7 @@ static int __init gic_init_bases(void __iomem *dist_base, gic_data.rdists.rdist = alloc_percpu(typeof(*gic_data.rdists.rdist)); gic_data.rdists.has_vlpis = true; gic_data.rdists.has_direct_lpi = true; - gic_check_hisi_workaround(); + gic_check_hisi_workaround(handle); gic_compute_nr_gicr();
if (WARN_ON(!gic_data.domain) || WARN_ON(!gic_data.rdists.rdist)) {
From: Xu Qiang xuqiang36@huawei.com
ascend inclusion category: bugfix bugzilla: NA CVE: NA
---------------------------------------------
In the binding document, add enable-init-all-GICR field description.
Signed-off-by: Xu Qiang xuqiang36@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- .../devicetree/bindings/interrupt-controller/arm,gic-v3.txt | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.txt b/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.txt index 3ea78c4ef887..9f4fe47d9d54 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.txt +++ b/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.txt @@ -71,6 +71,10 @@ Optional region containing only the {SET,CLR}SPI registers to be used if isolation is required, and if supported by the HW.
+- enable-init-all-gicr: Boolean property. Identifies kernel initializes + message interrupt functionality for other GICR not managed by this + operating system. + Sub-nodes:
PPI affinity can be expressed as a single "ppi-partitions" node,
From: Takashi Iwai tiwai@suse.de
mainline inclusion from mainline-v5.7-rc1 commit a900cc5cd846edc6964736834e098e88d7ecfcd6 category: bugfix bugzilla: 32178 CVE: NA
-------------------------------------------------
Since snprintf() returns the would-be-output size instead of the actual output size, the succeeding calls may go beyond the given buffer limit. Fix it by replacing with scnprintf().
Signed-off-by: Takashi Iwai tiwai@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Feng Yubo fengyubo3@huawei.com Reviewed-by: Tao Hou houtao1@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/ata/libata-transport.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/ata/libata-transport.c b/drivers/ata/libata-transport.c index a0b0b4d986f2..c3f446fc24c4 100644 --- a/drivers/ata/libata-transport.c +++ b/drivers/ata/libata-transport.c @@ -208,7 +208,7 @@ show_ata_port_##name(struct device *dev, \ { \ struct ata_port *ap = transport_class_to_port(dev); \ \ - return snprintf(buf, 20, format_string, cast ap->field); \ + return scnprintf(buf, 20, format_string, cast ap->field); \ }
#define ata_port_simple_attr(field, name, format_string, type) \ @@ -479,7 +479,7 @@ show_ata_dev_##field(struct device *dev, \ { \ struct ata_device *ata_dev = transport_class_to_dev(dev); \ \ - return snprintf(buf, 20, format_string, cast ata_dev->field); \ + return scnprintf(buf, 20, format_string, cast ata_dev->field); \ }
#define ata_dev_simple_attr(field, format_string, type) \ @@ -533,7 +533,7 @@ show_ata_dev_id(struct device *dev, if (ata_dev->class == ATA_DEV_PMP) return 0; for(i=0;i<ATA_ID_WORDS;i++) { - written += snprintf(buf+written, 20, "%04x%c", + written += scnprintf(buf+written, 20, "%04x%c", ata_dev->id[i], ((i+1) & 7) ? ' ' : '\n'); } @@ -552,7 +552,7 @@ show_ata_dev_gscr(struct device *dev, if (ata_dev->class != ATA_DEV_PMP) return 0; for(i=0;i<SATA_PMP_GSCR_DWORDS;i++) { - written += snprintf(buf+written, 20, "%08x%c", + written += scnprintf(buf+written, 20, "%08x%c", ata_dev->gscr[i], ((i+1) & 3) ? ' ' : '\n'); } @@ -581,7 +581,7 @@ show_ata_dev_trim(struct device *dev, else mode = "unqueued";
- return snprintf(buf, 20, "%s\n", mode); + return scnprintf(buf, 20, "%s\n", mode); }
static DEVICE_ATTR(trim, S_IRUGO, show_ata_dev_trim, NULL);
From: Yang Yingliang yangyingliang@huawei.com
hulk inclusion category: bugfix bugzilla: NA CVE: NA
--------------------------------
On other board do do cntvct workaround on VDSO path may cause unexpected error, so we do this only on D05.
Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/clocksource/arm_arch_timer.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c index a9869c2e0d92..cc2eb77b41d9 100644 --- a/drivers/clocksource/arm_arch_timer.c +++ b/drivers/clocksource/arm_arch_timer.c @@ -551,9 +551,14 @@ void arch_timer_enable_workaround(const struct arch_timer_erratum_workaround *wa * change both the default value and the vdso itself. */ if (wa->read_cntvct_el0) { - clocksource_counter.archdata.vdso_direct = true; - vdso_default = true; - vdso_fix = true; + if (wa->read_cntvct_el0 == hisi_161010101_read_cntvct_el0) { + clocksource_counter.archdata.vdso_direct = true; + vdso_default = true; + vdso_fix = true; + } else { + clocksource_counter.archdata.vdso_direct = false; + vdso_default = false; + } } }
From: Jan Kara jack@suse.cz
hulk inclusion category: bugfix bugzilla: 46758 CVE: NA ---------------------------
Protect all superblock modifications (including checksum computation) with a superblock buffer lock. That way we are sure computed checksum matches current superblock contents (a mismatch could cause checksum failures in nojournal mode or if an unjournalled superblock update races with a journalled one). Also we avoid modifying superblock contents while it is being written out (which can cause DIF/DIX failures if we are running in nojournal mode).
Signed-off-by: Jan Kara jack@suse.cz [backport the 10th patch: https://www.spinics.net/lists/linux-ext4/msg75423.html drop the other lock_buffer besides operation for orphan] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/ext4_jbd2.c | 1 - fs/ext4/file.c | 1 + fs/ext4/inode.c | 1 + fs/ext4/namei.c | 6 ++++++ fs/ext4/resize.c | 4 ++++ fs/ext4/xattr.c | 1 + 6 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c index a589b7f79558..f9ac7dfd93bf 100644 --- a/fs/ext4/ext4_jbd2.c +++ b/fs/ext4/ext4_jbd2.c @@ -361,7 +361,6 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line, struct buffer_head *bh = EXT4_SB(sb)->s_sbh; int err = 0;
- ext4_superblock_csum_set(sb); if (ext4_handle_valid(handle)) { err = jbd2_journal_dirty_metadata(handle, bh); if (err) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 52d155b4e733..1703871fa2d0 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -434,6 +434,7 @@ static int ext4_sample_last_mounted(struct super_block *sb, goto out_journal; strlcpy(sbi->s_es->s_last_mounted, cp, sizeof(sbi->s_es->s_last_mounted)); + ext4_superblock_csum_set(sb); ext4_handle_dirty_super(handle, sb); out_journal: ext4_journal_stop(handle); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 061d4f6b0844..0e047426c0c9 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5394,6 +5394,7 @@ static int ext4_do_update_inode(handle_t *handle, if (err) goto out_brelse; ext4_set_feature_large_file(sb); + ext4_superblock_csum_set(sb); ext4_handle_sync(handle); err = ext4_handle_dirty_super(handle, sb); } diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index b24864418524..2297d38aaf5e 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2856,7 +2856,10 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) (le32_to_cpu(sbi->s_es->s_inodes_count))) { /* Insert this inode at the head of the on-disk orphan list */ NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan); + lock_buffer(sbi->s_sbh); sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino); + ext4_superblock_csum_set(sb); + unlock_buffer(sbi->s_sbh); dirty = true; } list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan); @@ -2939,7 +2942,10 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) mutex_unlock(&sbi->s_orphan_lock); goto out_brelse; } + lock_buffer(sbi->s_sbh); sbi->s_es->s_last_orphan = cpu_to_le32(ino_next); + ext4_superblock_csum_set(inode->i_sb); + unlock_buffer(sbi->s_sbh); mutex_unlock(&sbi->s_orphan_lock); err = ext4_handle_dirty_super(handle, inode->i_sb); } else { diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index 6a0c5c880354..c2e007d836e4 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -901,6 +901,7 @@ static int add_new_gdb(handle_t *handle, struct inode *inode, ext4_kvfree_array_rcu(o_group_desc);
le16_add_cpu(&es->s_reserved_gdt_blocks, -1); + ext4_superblock_csum_set(sb); err = ext4_handle_dirty_super(handle, sb); if (err) ext4_std_error(sb, err); @@ -1423,6 +1424,7 @@ static void ext4_update_super(struct super_block *sb, * active. */ ext4_r_blocks_count_set(es, ext4_r_blocks_count(es) + reserved_blocks); + ext4_superblock_csum_set(sb);
/* Update the free space counts */ percpu_counter_add(&sbi->s_freeclusters_counter, @@ -1721,6 +1723,7 @@ static int ext4_group_extend_no_check(struct super_block *sb,
ext4_blocks_count_set(es, o_blocks_count + add); ext4_free_blocks_count_set(es, ext4_free_blocks_count(es) + add); + ext4_superblock_csum_set(sb); ext4_debug("freeing blocks %llu through %llu\n", o_blocks_count, o_blocks_count + add); /* We add the blocks to the bitmap and set the group need init bit */ @@ -1882,6 +1885,7 @@ static int ext4_convert_meta_bg(struct super_block *sb, struct inode *inode) ext4_set_feature_meta_bg(sb); sbi->s_es->s_first_meta_bg = cpu_to_le32(num_desc_blocks(sb, sbi->s_groups_count)); + ext4_superblock_csum_set(sb);
err = ext4_handle_dirty_super(handle, sb); if (err) { diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index 24cf730ba6b0..ae029dccebc1 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -791,6 +791,7 @@ static void ext4_xattr_update_super_block(handle_t *handle, BUFFER_TRACE(EXT4_SB(sb)->s_sbh, "get_write_access"); if (ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh) == 0) { ext4_set_feature_xattr(sb); + ext4_superblock_csum_set(sb); ext4_handle_dirty_super(handle, sb); } }
From: Maurizio Lombardi mlombard@redhat.com
stable inclusion from linux-4.19.117 commit 4b05ecfee5d7f294db728a7f552b11f7e02eb612
--------------------------------
[ Upstream commit e49a7d994379278d3353d7ffc7994672752fb0ad ]
iscsit_free_session() is equivalent to iscsit_stop_session() followed by a call to iscsit_close_session().
Link: https://lore.kernel.org/r/20200313170656.9716-2-mlombard@redhat.com Tested-by: Rahul Kundu rahul.kundu@chelsio.com Signed-off-by: Maurizio Lombardi mlombard@redhat.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/target/iscsi/iscsi_target.c | 46 ++--------------------------- drivers/target/iscsi/iscsi_target.h | 1 - 2 files changed, 2 insertions(+), 45 deletions(-)
diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c index efe93d6b02b0..5c7bb643f857 100644 --- a/drivers/target/iscsi/iscsi_target.c +++ b/drivers/target/iscsi/iscsi_target.c @@ -4560,49 +4560,6 @@ void iscsit_fail_session(struct iscsi_session *sess) sess->session_state = TARG_SESS_STATE_FAILED; }
-int iscsit_free_session(struct iscsi_session *sess) -{ - u16 conn_count = atomic_read(&sess->nconn); - struct iscsi_conn *conn, *conn_tmp = NULL; - int is_last; - - spin_lock_bh(&sess->conn_lock); - atomic_set(&sess->sleep_on_sess_wait_comp, 1); - - list_for_each_entry_safe(conn, conn_tmp, &sess->sess_conn_list, - conn_list) { - if (conn_count == 0) - break; - - if (list_is_last(&conn->conn_list, &sess->sess_conn_list)) { - is_last = 1; - } else { - iscsit_inc_conn_usage_count(conn_tmp); - is_last = 0; - } - iscsit_inc_conn_usage_count(conn); - - spin_unlock_bh(&sess->conn_lock); - iscsit_cause_connection_reinstatement(conn, 1); - spin_lock_bh(&sess->conn_lock); - - iscsit_dec_conn_usage_count(conn); - if (is_last == 0) - iscsit_dec_conn_usage_count(conn_tmp); - - conn_count--; - } - - if (atomic_read(&sess->nconn)) { - spin_unlock_bh(&sess->conn_lock); - wait_for_completion(&sess->session_wait_comp); - } else - spin_unlock_bh(&sess->conn_lock); - - iscsit_close_session(sess); - return 0; -} - void iscsit_stop_session( struct iscsi_session *sess, int session_sleep, @@ -4687,7 +4644,8 @@ int iscsit_release_sessions_for_tpg(struct iscsi_portal_group *tpg, int force) list_for_each_entry_safe(se_sess, se_sess_tmp, &free_list, sess_list) { sess = (struct iscsi_session *)se_sess->fabric_sess_ptr;
- iscsit_free_session(sess); + iscsit_stop_session(sess, 1, 1); + iscsit_close_session(sess); session_count++; }
diff --git a/drivers/target/iscsi/iscsi_target.h b/drivers/target/iscsi/iscsi_target.h index 48bac0acf8c7..11a481cf6ead 100644 --- a/drivers/target/iscsi/iscsi_target.h +++ b/drivers/target/iscsi/iscsi_target.h @@ -43,7 +43,6 @@ extern int iscsi_target_rx_thread(void *); extern int iscsit_close_connection(struct iscsi_conn *); extern int iscsit_close_session(struct iscsi_session *); extern void iscsit_fail_session(struct iscsi_session *); -extern int iscsit_free_session(struct iscsi_session *); extern void iscsit_stop_session(struct iscsi_session *, int, int); extern int iscsit_release_sessions_for_tpg(struct iscsi_portal_group *, int);
From: Maurizio Lombardi mlombard@redhat.com
stable inclusion from linux-4.19.117 commit 8b0d0d8c725e67dc9607bdf38472c154250e7355
--------------------------------
[ Upstream commit 57c46e9f33da530a2485fa01aa27b6d18c28c796 ]
A number of hangs have been reported against the target driver; they are due to the fact that multiple threads may try to destroy the iscsi session at the same time. This may be reproduced for example when a "targetcli iscsi/iqn.../tpg1 disable" command is executed while a logout operation is underway.
When this happens, two or more threads may end up sleeping and waiting for iscsit_close_connection() to execute "complete(session_wait_comp)". Only one of the threads will wake up and proceed to destroy the session structure, the remaining threads will hang forever.
Note that if the blocked threads are somehow forced to wake up with complete_all(), they will try to free the same iscsi session structure destroyed by the first thread, causing double frees, memory corruptions etc...
With this patch, the threads that want to destroy the iscsi session will increase the session refcount and will set the "session_close" flag to 1; then they wait for the driver to close the remaining active connections. When the last connection is closed, iscsit_close_connection() will wake up all the threads and will wait for the session's refcount to reach zero; when this happens, iscsit_close_connection() will destroy the session structure because no one is referencing it anymore.
INFO: task targetcli:5971 blocked for more than 120 seconds. Tainted: P OE 4.15.0-72-generic #81~16.04.1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. targetcli D 0 5971 1 0x00000080 Call Trace: __schedule+0x3d6/0x8b0 ? vprintk_func+0x44/0xe0 schedule+0x36/0x80 schedule_timeout+0x1db/0x370 ? __dynamic_pr_debug+0x8a/0xb0 wait_for_completion+0xb4/0x140 ? wake_up_q+0x70/0x70 iscsit_free_session+0x13d/0x1a0 [iscsi_target_mod] iscsit_release_sessions_for_tpg+0x16b/0x1e0 [iscsi_target_mod] iscsit_tpg_disable_portal_group+0xca/0x1c0 [iscsi_target_mod] lio_target_tpg_enable_store+0x66/0xe0 [iscsi_target_mod] configfs_write_file+0xb9/0x120 __vfs_write+0x1b/0x40 vfs_write+0xb8/0x1b0 SyS_write+0x5c/0xe0 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Link: https://lore.kernel.org/r/20200313170656.9716-3-mlombard@redhat.com Reported-by: Matt Coleman mcoleman@datto.com Tested-by: Matt Coleman mcoleman@datto.com Tested-by: Rahul Kundu rahul.kundu@chelsio.com Signed-off-by: Maurizio Lombardi mlombard@redhat.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/target/iscsi/iscsi_target.c | 35 ++++++++++++-------- drivers/target/iscsi/iscsi_target_configfs.c | 5 ++- drivers/target/iscsi/iscsi_target_login.c | 5 +-- include/target/iscsi/iscsi_target_core.h | 2 +- 4 files changed, 30 insertions(+), 17 deletions(-)
diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c index 5c7bb643f857..e8823a256663 100644 --- a/drivers/target/iscsi/iscsi_target.c +++ b/drivers/target/iscsi/iscsi_target.c @@ -4294,30 +4294,37 @@ int iscsit_close_connection( if (!atomic_read(&sess->session_reinstatement) && atomic_read(&sess->session_fall_back_to_erl0)) { spin_unlock_bh(&sess->conn_lock); + complete_all(&sess->session_wait_comp); iscsit_close_session(sess);
return 0; } else if (atomic_read(&sess->session_logout)) { pr_debug("Moving to TARG_SESS_STATE_FREE.\n"); sess->session_state = TARG_SESS_STATE_FREE; - spin_unlock_bh(&sess->conn_lock);
- if (atomic_read(&sess->sleep_on_sess_wait_comp)) - complete(&sess->session_wait_comp); + if (atomic_read(&sess->session_close)) { + spin_unlock_bh(&sess->conn_lock); + complete_all(&sess->session_wait_comp); + iscsit_close_session(sess); + } else { + spin_unlock_bh(&sess->conn_lock); + }
return 0; } else { pr_debug("Moving to TARG_SESS_STATE_FAILED.\n"); sess->session_state = TARG_SESS_STATE_FAILED;
- if (!atomic_read(&sess->session_continuation)) { - spin_unlock_bh(&sess->conn_lock); + if (!atomic_read(&sess->session_continuation)) iscsit_start_time2retain_handler(sess); - } else - spin_unlock_bh(&sess->conn_lock);
- if (atomic_read(&sess->sleep_on_sess_wait_comp)) - complete(&sess->session_wait_comp); + if (atomic_read(&sess->session_close)) { + spin_unlock_bh(&sess->conn_lock); + complete_all(&sess->session_wait_comp); + iscsit_close_session(sess); + } else { + spin_unlock_bh(&sess->conn_lock); + }
return 0; } @@ -4423,9 +4430,9 @@ static void iscsit_logout_post_handler_closesession( complete(&conn->conn_logout_comp);
iscsit_dec_conn_usage_count(conn); + atomic_set(&sess->session_close, 1); iscsit_stop_session(sess, sleep, sleep); iscsit_dec_session_usage_count(sess); - iscsit_close_session(sess); }
static void iscsit_logout_post_handler_samecid( @@ -4570,8 +4577,6 @@ void iscsit_stop_session( int is_last;
spin_lock_bh(&sess->conn_lock); - if (session_sleep) - atomic_set(&sess->sleep_on_sess_wait_comp, 1);
if (connection_sleep) { list_for_each_entry_safe(conn, conn_tmp, &sess->sess_conn_list, @@ -4629,12 +4634,15 @@ int iscsit_release_sessions_for_tpg(struct iscsi_portal_group *tpg, int force) spin_lock(&sess->conn_lock); if (atomic_read(&sess->session_fall_back_to_erl0) || atomic_read(&sess->session_logout) || + atomic_read(&sess->session_close) || (sess->time2retain_timer_flags & ISCSI_TF_EXPIRED)) { spin_unlock(&sess->conn_lock); continue; } + iscsit_inc_session_usage_count(sess); atomic_set(&sess->session_reinstatement, 1); atomic_set(&sess->session_fall_back_to_erl0, 1); + atomic_set(&sess->session_close, 1); spin_unlock(&sess->conn_lock);
list_move_tail(&se_sess->sess_list, &free_list); @@ -4644,8 +4652,9 @@ int iscsit_release_sessions_for_tpg(struct iscsi_portal_group *tpg, int force) list_for_each_entry_safe(se_sess, se_sess_tmp, &free_list, sess_list) { sess = (struct iscsi_session *)se_sess->fabric_sess_ptr;
+ list_del_init(&se_sess->sess_list); iscsit_stop_session(sess, 1, 1); - iscsit_close_session(sess); + iscsit_dec_session_usage_count(sess); session_count++; }
diff --git a/drivers/target/iscsi/iscsi_target_configfs.c b/drivers/target/iscsi/iscsi_target_configfs.c index 95d0a22b2ad6..d25cadc4f4f1 100644 --- a/drivers/target/iscsi/iscsi_target_configfs.c +++ b/drivers/target/iscsi/iscsi_target_configfs.c @@ -1501,20 +1501,23 @@ static void lio_tpg_close_session(struct se_session *se_sess) spin_lock(&sess->conn_lock); if (atomic_read(&sess->session_fall_back_to_erl0) || atomic_read(&sess->session_logout) || + atomic_read(&sess->session_close) || (sess->time2retain_timer_flags & ISCSI_TF_EXPIRED)) { spin_unlock(&sess->conn_lock); spin_unlock_bh(&se_tpg->session_lock); return; } + iscsit_inc_session_usage_count(sess); atomic_set(&sess->session_reinstatement, 1); atomic_set(&sess->session_fall_back_to_erl0, 1); + atomic_set(&sess->session_close, 1); spin_unlock(&sess->conn_lock);
iscsit_stop_time2retain_timer(sess); spin_unlock_bh(&se_tpg->session_lock);
iscsit_stop_session(sess, 1, 1); - iscsit_close_session(sess); + iscsit_dec_session_usage_count(sess); }
static u32 lio_tpg_get_inst_index(struct se_portal_group *se_tpg) diff --git a/drivers/target/iscsi/iscsi_target_login.c b/drivers/target/iscsi/iscsi_target_login.c index 185302023aec..db93bd0a9b88 100644 --- a/drivers/target/iscsi/iscsi_target_login.c +++ b/drivers/target/iscsi/iscsi_target_login.c @@ -164,6 +164,7 @@ int iscsi_check_for_session_reinstatement(struct iscsi_conn *conn) spin_lock(&sess_p->conn_lock); if (atomic_read(&sess_p->session_fall_back_to_erl0) || atomic_read(&sess_p->session_logout) || + atomic_read(&sess_p->session_close) || (sess_p->time2retain_timer_flags & ISCSI_TF_EXPIRED)) { spin_unlock(&sess_p->conn_lock); continue; @@ -174,6 +175,7 @@ int iscsi_check_for_session_reinstatement(struct iscsi_conn *conn) (sess_p->sess_ops->SessionType == sessiontype))) { atomic_set(&sess_p->session_reinstatement, 1); atomic_set(&sess_p->session_fall_back_to_erl0, 1); + atomic_set(&sess_p->session_close, 1); spin_unlock(&sess_p->conn_lock); iscsit_inc_session_usage_count(sess_p); iscsit_stop_time2retain_timer(sess_p); @@ -198,7 +200,6 @@ int iscsi_check_for_session_reinstatement(struct iscsi_conn *conn) if (sess->session_state == TARG_SESS_STATE_FAILED) { spin_unlock_bh(&sess->conn_lock); iscsit_dec_session_usage_count(sess); - iscsit_close_session(sess); return 0; } spin_unlock_bh(&sess->conn_lock); @@ -206,7 +207,6 @@ int iscsi_check_for_session_reinstatement(struct iscsi_conn *conn) iscsit_stop_session(sess, 1, 1); iscsit_dec_session_usage_count(sess);
- iscsit_close_session(sess); return 0; }
@@ -494,6 +494,7 @@ static int iscsi_login_non_zero_tsih_s2( sess_p = (struct iscsi_session *)se_sess->fabric_sess_ptr; if (atomic_read(&sess_p->session_fall_back_to_erl0) || atomic_read(&sess_p->session_logout) || + atomic_read(&sess_p->session_close) || (sess_p->time2retain_timer_flags & ISCSI_TF_EXPIRED)) continue; if (!memcmp(sess_p->isid, pdu->isid, 6) && diff --git a/include/target/iscsi/iscsi_target_core.h b/include/target/iscsi/iscsi_target_core.h index f2e6abea8490..70d18444d18d 100644 --- a/include/target/iscsi/iscsi_target_core.h +++ b/include/target/iscsi/iscsi_target_core.h @@ -674,7 +674,7 @@ struct iscsi_session { atomic_t session_logout; atomic_t session_reinstatement; atomic_t session_stop_active; - atomic_t sleep_on_sess_wait_comp; + atomic_t session_close; /* connection list */ struct list_head sess_conn_list; struct list_head cr_active_list;
From: Mike Christie michael.christie@oracle.com
stable inclusion from linux-4.19.161 commit 97b5295711ae0a03a5cfaf025d768c6c87a209f6
--------------------------------
[ Upstream commit f36199355c64a39fe82cfddc7623d827c7e050da ]
Maurizio found a race where the abort and cmd stop paths can race as follows:
1. thread1 runs iscsit_release_commands_from_conn and sets CMD_T_FABRIC_STOP.
2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It then returns from the aborted_task callout and we finish target_handle_abort and do:
target_handle_abort -> transport_cmd_check_stop_to_fabric -> lio_check_stop_free -> target_put_sess_cmd
The cmd is now freed.
3. thread1 now finishes iscsit_release_commands_from_conn and runs iscsit_free_cmd while accessing a command we just released.
In __target_check_io_state we check for CMD_T_FABRIC_STOP and set the CMD_T_ABORTED if the driver is not cleaning up the cmd because of a session shutdown. However, iscsit_release_commands_from_conn only sets the CMD_T_FABRIC_STOP and does not check to see if the abort path has claimed completion ownership of the command.
This adds a check in iscsit_release_commands_from_conn so only the abort or fabric stop path cleanup the command.
Link: https://lore.kernel.org/r/1605318378-9269-1-git-send-email-michael.christie@... Reported-by: Maurizio Lombardi mlombard@redhat.com Reviewed-by: Maurizio Lombardi mlombard@redhat.com Signed-off-by: Mike Christie michael.christie@oracle.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/target/iscsi/iscsi_target.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c index e8823a256663..92843e8db711 100644 --- a/drivers/target/iscsi/iscsi_target.c +++ b/drivers/target/iscsi/iscsi_target.c @@ -492,8 +492,7 @@ EXPORT_SYMBOL(iscsit_queue_rsp); void iscsit_aborted_task(struct iscsi_conn *conn, struct iscsi_cmd *cmd) { spin_lock_bh(&conn->cmd_lock); - if (!list_empty(&cmd->i_conn_node) && - !(cmd->se_cmd.transport_state & CMD_T_FABRIC_STOP)) + if (!list_empty(&cmd->i_conn_node)) list_del_init(&cmd->i_conn_node); spin_unlock_bh(&conn->cmd_lock);
@@ -4058,12 +4057,22 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn) spin_lock_bh(&conn->cmd_lock); list_splice_init(&conn->conn_cmd_list, &tmp_list);
- list_for_each_entry(cmd, &tmp_list, i_conn_node) { + list_for_each_entry_safe(cmd, cmd_tmp, &tmp_list, i_conn_node) { struct se_cmd *se_cmd = &cmd->se_cmd;
if (se_cmd->se_tfo != NULL) { spin_lock_irq(&se_cmd->t_state_lock); - se_cmd->transport_state |= CMD_T_FABRIC_STOP; + if (se_cmd->transport_state & CMD_T_ABORTED) { + /* + * LIO's abort path owns the cleanup for this, + * so put it back on the list and let + * aborted_task handle it. + */ + list_move_tail(&cmd->i_conn_node, + &conn->conn_cmd_list); + } else { + se_cmd->transport_state |= CMD_T_FABRIC_STOP; + } spin_unlock_irq(&se_cmd->t_state_lock); } }
From: Xiongfeng Wang wangxiongfeng2@huawei.com
hulk inclusion category: bugfix bugzilla: 46894 CVE: NA
---------------------------
Fix the following compile error when CONFIG_HOTPLUG_CPU is disabled.
arch/arm64/kernel/setup.c: In function 'arch_unregister_cpu': arch/arm64/kernel/setup.c:455:2: error: implicit declaration of function 'unregister_cpu' [-Werror=implicit-function-declaration] unregister_cpu(cpu); ^~~~~~~~~~~~~~
Signed-off-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/kernel/setup.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c index 155b8a61f98c..6e4f42944631 100644 --- a/arch/arm64/kernel/setup.c +++ b/arch/arm64/kernel/setup.c @@ -430,6 +430,8 @@ static int __init register_kernel_offset_dumper(void) } __initcall(register_kernel_offset_dumper);
+#ifdef CONFIG_HOTPLUG_CPU + int arch_register_cpu(int num) { struct cpu *cpu = &per_cpu(cpu_data.cpu, num); @@ -446,3 +448,5 @@ void arch_unregister_cpu(int num) unregister_cpu(cpu); } EXPORT_SYMBOL(arch_unregister_cpu); + +#endif
From: Yufen Yu yuyufen@huawei.com
hulk inclusion category: bugfix bugzilla: 30109 CVE: NA ---------------------------
Fix compiler error in bdi_get_dev_name() for some others arch.
include/linux/backing-dev.h:511:27: note: previous implicit declaration of 'dev_name' was here 511 | strlcpy(dname, rcu_dev ? dev_name(&rcu_dev->dev) : "(unknown)", len); | ^~~~~~~~
Fixes: 4bafd511afa9 ("bdi: get device name under rcu protect") Signed-off-by: Yufen Yu yuyufen@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/backing-dev.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 947b046ff588..ec3ca0b197dd 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -17,6 +17,7 @@ #include <linux/blk-cgroup.h> #include <linux/backing-dev-defs.h> #include <linux/slab.h> +#include <linux/device.h>
#define BDI_DEV_NAME_LEN 32
From: Zhou Guanghui zhouguanghui1@huawei.com
ascend inclusion category: feature bugzilla: NA CVE: NA
--------------------------------------------------
If memcg oom_kill_disable is false(enable oom killer), and sysctl enable_oom_killer is also false(disable oom killer), the memory allocated for this memcg will exceed the limit. This means that the memcg cannot limit the memory usage.
Therefore, ensure that sysctl enable_oom_killer is not false. Otherwise, wait for memory resources.
Signed-off-by: Zhou Guanghui zhouguanghui1@huawei.com Reviewed-by: Ding Tianhong dingtianhong@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/memcontrol.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d926de1686ba..43e5102f90f0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1799,7 +1799,11 @@ bool mem_cgroup_oom_synchronize(bool handle) if (locked) mem_cgroup_oom_notify(memcg);
- if (locked && !memcg->oom_kill_disable) { + if (locked && +#ifdef CONFIG_ASCEND_OOM + sysctl_enable_oom_killer != 0 && +#endif + !memcg->oom_kill_disable) { mem_cgroup_unmark_under_oom(memcg); finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
From: Zhou Guanghui zhouguanghui1@huawei.com
ascend inclusion category: feature bugzilla: NA CVE: NA
-------------------------------------
The function of kmem cgroup is enabled by default for ascend.
Signed-off-by: Zhou Guanghui zhouguanghui1@huawei.com Reviewed-by: Ding Tianhong dingtianhong@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/mm/init.c | 5 +++++ mm/memcontrol.c | 2 +- 2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index a6b9048ea1a4..0773205251ca 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -789,6 +789,11 @@ void ascend_enable_all_features(void) if (IS_ENABLED(CONFIG_PMU_WATCHDOG)) pmu_nmi_enable = true;
+ if (IS_ENABLED(CONFIG_MEMCG_KMEM)) { + extern bool cgroup_memory_nokmem; + cgroup_memory_nokmem = false; + } + #ifdef CONFIG_ARM64_PSEUDO_NMI enable_pseudo_nmi = true; #endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 43e5102f90f0..2553017683e9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -85,7 +85,7 @@ struct mem_cgroup *root_mem_cgroup __read_mostly; static bool cgroup_memory_nosocket;
/* Kernel memory accounting disabled */ -static bool cgroup_memory_nokmem = true; +bool cgroup_memory_nokmem = true;
/* Whether the swap controller is active */ #ifdef CONFIG_MEMCG_SWAP
From: Xie XiuQi xiexiuqi@huawei.com
hulk inclusion category: bugfix bugzilla: 46892 CVE: NA
Add get_idle_time stub function when CONFIG_PROC_FS disabled, to avoid this compile error.
$ make tinyconfig && make
kernel/sched/cputime.o: In function `sched_idle_time_adjust': cputime.c:(.text+0x3bb): undefined reference to `get_idle_time' make: *** [vmlinux] Error 1
Signed-off-by: Xie XiuQi xiexiuqi@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/sched/cputime.h | 5 +++++ kernel/sched/cputime.c | 6 ++++++ 2 files changed, 11 insertions(+)
diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h index 1ebbeec02051..6b1793606fc9 100644 --- a/include/linux/sched/cputime.h +++ b/include/linux/sched/cputime.h @@ -189,6 +189,11 @@ task_sched_runtime(struct task_struct *task); extern int use_sched_idle_time; extern int sched_idle_time_adjust(int cpu, u64 *utime, u64 *stime); extern unsigned long long sched_get_idle_time(int cpu); + +#ifdef CONFIG_PROC_FS extern u64 get_idle_time(int cpu); +#else +static inline u64 get_idle_time(int cpu) { return -1ULL; } +#endif
#endif /* _LINUX_SCHED_CPUTIME_H */ diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index a10b51259b6a..159e4c467773 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -593,6 +593,12 @@ int sched_idle_time_adjust(int cpu, u64 *utime, u64 *stime)
raw_spin_lock(&rq_cputime->lock);
+ /* If failed to get idle time, drop adjustment */ + if (get_idle_time(cpu) == -1ULL) { + raw_spin_unlock(&rq_cputime->lock); + return 0; + } + #ifdef CONFIG_IRQ_TIME_ACCOUNTING if (sched_clock_irqtime) { hi = kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ];
From: Yang Yingliang yangyingliang@huawei.com
hulk inclusion category: bugfix bugzilla: NA CVE: NA
--------------------------------
Fix the following compiler error.
arch/arm/kernel/perf_event_v7.c: In function 'krait_pmu_disable_event': arch/arm/kernel/perf_event_v7.c:1563:13: error: invalid storage class for function 'krait_pmu_enable_event' static void krait_pmu_enable_event(struct perf_event *event) ^ arch/arm/kernel/perf_event_v7.c: In function 'krait_pmu_enable_event': arch/arm/kernel/perf_event_v7.c:1601:13: error: invalid storage class for function 'krait_pmu_reset' static void krait_pmu_reset(void *info)
Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm/kernel/perf_event_v7.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/arch/arm/kernel/perf_event_v7.c b/arch/arm/kernel/perf_event_v7.c index 4f0a208072cf..c01ed569590a 100644 --- a/arch/arm/kernel/perf_event_v7.c +++ b/arch/arm/kernel/perf_event_v7.c @@ -1541,7 +1541,7 @@ static void krait_pmu_disable_event(struct perf_event *event) struct pmu_hw_events *events = this_cpu_ptr(cpu_pmu->hw_events);
/* Disable counter and interrupt */ - if (!pmu_nmi_enable) { + if (!pmu_nmi_enable) raw_spin_lock_irqsave(&events->pmu_lock, flags);
/* Disable counter */ @@ -1556,7 +1556,7 @@ static void krait_pmu_disable_event(struct perf_event *event) /* Disable interrupt for this counter */ armv7_pmnc_disable_intens(idx);
- if (!pmu_nmi_enable) { + if (!pmu_nmi_enable) raw_spin_unlock_irqrestore(&events->pmu_lock, flags); }
@@ -1572,7 +1572,7 @@ static void krait_pmu_enable_event(struct perf_event *event) * Enable counter and interrupt, and set the counter to count * the event that we're interested in. */ - if (!pmu_nmi_enable) { + if (!pmu_nmi_enable) raw_spin_lock_irqsave(&events->pmu_lock, flags);
/* Disable counter */ @@ -1594,7 +1594,7 @@ static void krait_pmu_enable_event(struct perf_event *event) /* Enable counter */ armv7_pmnc_enable_counter(idx);
- if (!pmu_nmi_enable) { + if (!pmu_nmi_enable) raw_spin_unlock_irqrestore(&events->pmu_lock, flags); }
@@ -1878,7 +1878,7 @@ static void scorpion_pmu_disable_event(struct perf_event *event) struct pmu_hw_events *events = this_cpu_ptr(cpu_pmu->hw_events);
/* Disable counter and interrupt */ - if (!pmu_nmi_enable) { + if (!pmu_nmi_enable) raw_spin_lock_irqsave(&events->pmu_lock, flags);
/* Disable counter */ @@ -1893,7 +1893,7 @@ static void scorpion_pmu_disable_event(struct perf_event *event) /* Disable interrupt for this counter */ armv7_pmnc_disable_intens(idx);
- if (!pmu_nmi_enable) { + if (!pmu_nmi_enable) raw_spin_unlock_irqrestore(&events->pmu_lock, flags); }
@@ -1909,7 +1909,7 @@ static void scorpion_pmu_enable_event(struct perf_event *event) * Enable counter and interrupt, and set the counter to count * the event that we're interested in. */ - if (!pmu_nmi_enable) { + if (!pmu_nmi_enable) raw_spin_lock_irqsave(&events->pmu_lock, flags);
/* Disable counter */ @@ -1931,7 +1931,7 @@ static void scorpion_pmu_enable_event(struct perf_event *event) /* Enable counter */ armv7_pmnc_enable_counter(idx);
- if (!pmu_nmi_enable) { + if (!pmu_nmi_enable) raw_spin_unlock_irqrestore(&events->pmu_lock, flags); }
From: Zhengyuan Liu liuzhengyuan@tj.kylinos.cn
hulk inclusion category: bugfix bugzilla: NA CVE: NA
The specifier of __init added to hrd_flashProbe() isn't appropriate. It can cause following building warning:
WARNING: drivers/mtd/hisilicon/sfc/hi-sfc.o(.text.unlikely+0x38c): Section mismatch in reference from the \ function flash_map_init() to the function .init.text:hrd_flashProbe() The function flash_map_init() references the function __init hrd_flashProbe(). This is often because flash_map_init lacks a __init annotation or the annotation of hrd_flashProbe is wrong.
Fixes: 4a81715b7bf2 ("drivers: add sfc driver to hulk") Signed-off-by: Zhengyuan Liu liuzhengyuan@tj.kylinos.cn Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/mtd/hisilicon/sfc/hrd_sfc_driver.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/mtd/hisilicon/sfc/hrd_sfc_driver.c b/drivers/mtd/hisilicon/sfc/hrd_sfc_driver.c index 23254e325a85..69fb1b306f0d 100644 --- a/drivers/mtd/hisilicon/sfc/hrd_sfc_driver.c +++ b/drivers/mtd/hisilicon/sfc/hrd_sfc_driver.c @@ -87,8 +87,8 @@ static int _hrd_flashProbe(const char **mtdDrv, struct map_info *map, return HRD_ERR; }
-static int __init hrd_flashProbe(const char **mtdDrv, struct map_info *map, - struct resource *sfc_regres, struct mtd_info **mtd) +static int hrd_flashProbe(const char **mtdDrv, struct map_info *map, + struct resource *sfc_regres, struct mtd_info **mtd) { int ret;
From: Zhengyuan Liu liuzhengyuan@tj.kylinos.cn
hulk inclusion category: bugfix bugzilla: NA CVE: NA
MAP_PA32BIT was defined in uapi/asm-generic/mman.h, that was not automatically included by mm/mmap.c when building on platforms such as mips, and result in following compiling error:
mm/mmap.c: In function ‘do_mmap’: mm/mmap.c:1450:14: error: ‘MAP_PA32BIT’ undeclared (first use in this function); did you mean ‘MAP_32BIT’? if (flags & MAP_PA32BIT) ^~~~~~~~~~~ MAP_32BIT mm/mmap.c:1450:14: note: each undeclared identifier is reported only once for each function it appears in make[1]: *** [scripts/Makefile.build:303: mm/mmap.o] Error 1
Fixes: e422eca7f9c5 ("svm: add support for allocing memory which is within 4G physical address in svm_mmap")
Signed-off-by: Zhengyuan Liu liuzhengyuan@tj.kylinos.cn Signed-off-by: Bixuan Cui cuibixuan@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/alpha/include/uapi/asm/mman.h | 1 + arch/mips/include/uapi/asm/mman.h | 1 + arch/parisc/include/uapi/asm/mman.h | 1 + arch/powerpc/include/uapi/asm/mman.h | 1 + arch/sparc/include/uapi/asm/mman.h | 1 + arch/xtensa/include/uapi/asm/mman.h | 1 + 6 files changed, 6 insertions(+)
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index f9d4e6b6d4bd..b3acfc00c8ec 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -33,6 +33,7 @@ #define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x100000 /* create a huge page mapping */ #define MAP_FIXED_NOREPLACE 0x200000/* MAP_FIXED which doesn't unmap underlying mapping */ +#define MAP_PA32BIT 0x400000 /* physical address is within 4G */
#define MS_ASYNC 1 /* sync memory asynchronously */ #define MS_SYNC 2 /* synchronous memory sync */ diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index 3035ca499cd8..72a00c746e78 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -51,6 +51,7 @@ #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */ +#define MAP_PA32BIT 0x400000 /* physical address is within 4G */
/* * Flags for msync diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 870fbf8c7088..9e989d649e85 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -27,6 +27,7 @@ #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */ +#define MAP_PA32BIT 0x400000 /* physical address is within 4G */
#define MS_SYNC 1 /* synchronous memory sync */ #define MS_ASYNC 2 /* sync memory asynchronously */ diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h index 65065ce32814..95f884ada96f 100644 --- a/arch/powerpc/include/uapi/asm/mman.h +++ b/arch/powerpc/include/uapi/asm/mman.h @@ -29,6 +29,7 @@ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x40000 /* create a huge page mapping */ +#define MAP_PA32BIT 0x400000 /* physical address is within 4G */
/* Override any generic PKEY permission defines */ #define PKEY_DISABLE_EXECUTE 0x4 diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h index f6f99ec65bb3..0d1881b8f30d 100644 --- a/arch/sparc/include/uapi/asm/mman.h +++ b/arch/sparc/include/uapi/asm/mman.h @@ -26,6 +26,7 @@ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x40000 /* create a huge page mapping */ +#define MAP_PA32BIT 0x400000 /* physical address is within 4G */
#endif /* _UAPI__SPARC_MMAN_H__ */ diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index 58f29a9d895d..f584a590bb00 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -58,6 +58,7 @@ #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */ +#define MAP_PA32BIT 0x400000 /* physical address is within 4G */ #ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED # define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */
From: Wu Bo wubo40@huawei.com
hulk inclusion category: bugfix bugzilla: NA CVE: NA
https://gitee.com/src-openeuler/kernel/issues/I28N9J ---------------------------
For some reason, during reboot the system, iscsi.service failed to logout all sessions. kernel will hang forever on its sd_sync_cache() logic, after issuing the SYNCHRONIZE_CACHE cmd to all still existent paths.
[ 1044.098991] reboot: Mddev shutdown finished. [ 1044.099311] reboot: Usermodehelper disable finished. [ 1050.611244] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295152378, last ping 4295153633, now 4295154944 [ 1348.599676] Call trace: [ 1348.599887] __switch_to+0xe8/0x150 [ 1348.600113] __schedule+0x33c/0xa08 [ 1348.600372] schedule+0x2c/0x88 [ 1348.600567] schedule_timeout+0x184/0x3a8 [ 1348.600820] io_schedule_timeout+0x28/0x48 [ 1348.601089] wait_for_common_io.constprop.2+0x168/0x258 [ 1348.601425] wait_for_completion_io_timeout+0x28/0x38 [ 1348.601762] blk_execute_rq+0x98/0xd8 [ 1348.602006] __scsi_execute+0xe0/0x1e8 [ 1348.602262] sd_sync_cache+0xd0/0x220 [sd_mod] [ 1348.602551] sd_shutdown+0x6c/0xf8 [sd_mod] [ 1348.602826] device_shutdown+0x13c/0x250 [ 1348.603078] kernel_restart_prepare+0x5c/0x68 [ 1348.603400] kernel_restart+0x20/0x98 [ 1348.603683] __se_sys_reboot+0x214/0x260 [ 1348.603987] __arm64_sys_reboot+0x24/0x30 [ 1348.604300] el0_svc_common+0x80/0x1b8 [ 1348.604590] el0_svc_handler+0x78/0xe0 [ 1348.604877] el0_svc+0x10/0x260
The scsi Mid Layer timeout handler function(scsi_times_out) will be call iscsi_eh_cmd_timed_out function to do error handler.
if the system_state is still SYSTEM_RUNNING, the iscsi_eh_cmd_timed_out will return BLK_EH_RESET_TIMER, the block layer will reset timer, and Waiting for the next timeout. Repeat the above operation.
In order to solve the cmd hung problem, Added a timeout limit.
Signed-off-by: Wu Bo wubo40@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/libiscsi.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index 35d603db33bb..6a437743f1c0 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -2025,7 +2025,10 @@ enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc) * Instead, handle cmd, allow completion to happen and let * upper layer to deal with the result. */ - if (unlikely(system_state != SYSTEM_RUNNING)) { + conn = session->leadconn; + if (unlikely(system_state != SYSTEM_RUNNING) || (conn && + time_before_eq(conn->last_recv + + (cls_session->recovery_tmo * HZ), jiffies))) { sc->result = DID_NO_CONNECT << 16; ISCSI_DBG_EH(session, "sc on shutdown, handled\n"); rc = BLK_EH_DONE; @@ -2471,10 +2474,11 @@ int iscsi_eh_session_reset(struct scsi_cmnd *sc) iscsi_conn_failure(conn, ISCSI_ERR_SCSI_EH_SESSION_RST);
ISCSI_DBG_EH(session, "wait for relogin\n"); - wait_event_interruptible(conn->ehwait, + wait_event_interruptible_timeout(conn->ehwait, session->state == ISCSI_STATE_TERMINATE || session->state == ISCSI_STATE_LOGGED_IN || - session->state == ISCSI_STATE_RECOVERY_FAILED); + session->state == ISCSI_STATE_RECOVERY_FAILED, + cls_session->recovery_tmo * HZ); if (signal_pending(current)) flush_signals(current);
From: Will Deacon will@kernel.org
stable inclusion from linux-4.19.161 commit ad1fc801a3e9bdaed26b112687f6677612da8427
--------------------------------
commit 07509e10dcc77627f8b6a57381e878fe269958d3 upstream.
pte_accessible() is used by ptep_clear_flush() to figure out whether TLB invalidation is necessary when unmapping pages for reclaim. Although our implementation is correct according to the architecture, returning true only for valid, young ptes in the absence of racing page-table modifications, this is in fact flawed due to lazy invalidation of old ptes in ptep_clear_flush_young() where we elide the expensive DSB instruction for completing the TLB invalidation.
Rather than penalise the aging path, adjust pte_accessible() to return true for any valid pte, even if the access flag is cleared.
Cc: stable@vger.kernel.org Fixes: 76c714be0e5e ("arm64: pgtable: implement pte_accessible()") Reported-by: Yu Zhao yuzhao@google.com Acked-by: Yu Zhao yuzhao@google.com Reviewed-by: Minchan Kim minchan@kernel.org Reviewed-by: Catalin Marinas catalin.marinas@arm.com Link: https://lore.kernel.org/r/20201120143557.6715-2-will@kernel.org Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/include/asm/pgtable.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 6341ca2b4472..6c0a375f1d29 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -108,8 +108,6 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]; #define pte_valid(pte) (!!(pte_val(pte) & PTE_VALID)) #define pte_valid_not_user(pte) \ ((pte_val(pte) & (PTE_VALID | PTE_USER)) == PTE_VALID) -#define pte_valid_young(pte) \ - ((pte_val(pte) & (PTE_VALID | PTE_AF)) == (PTE_VALID | PTE_AF)) #define pte_valid_user(pte) \ ((pte_val(pte) & (PTE_VALID | PTE_USER)) == (PTE_VALID | PTE_USER))
@@ -117,9 +115,12 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]; * Could the pte be present in the TLB? We must check mm_tlb_flush_pending * so that we don't erroneously return false for pages that have been * remapped as PROT_NONE but are yet to be flushed from the TLB. + * Note that we can't make any assumptions based on the state of the access + * flag, since ptep_clear_flush_young() elides a DSB when invalidating the + * TLB. */ #define pte_accessible(mm, pte) \ - (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte)) + (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
/* * p??_access_permitted() is true for valid user mappings (subject to the
From: Will Deacon will@kernel.org
stable inclusion from linux-4.19.161 commit 46c69325028226332be13b6df9f36f2d6b07b489
--------------------------------
commit ff1712f953e27f0b0718762ec17d0adb15c9fd0b upstream.
With hardware dirty bit management, calling pte_wrprotect() on a writable, dirty PTE will lose the dirty state and return a read-only, clean entry.
Move the logic from ptep_set_wrprotect() into pte_wrprotect() to ensure that the dirty bit is preserved for writable entries, as this is required for soft-dirty bit management if we enable it in the future.
Cc: stable@vger.kernel.org Fixes: 2f4b829c625e ("arm64: Add support for hardware updates of the access and dirty pte bits") Reviewed-by: Catalin Marinas catalin.marinas@arm.com Link: https://lore.kernel.org/r/20201120143557.6715-3-will@kernel.org Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/include/asm/pgtable.h | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 6c0a375f1d29..85af92cb2456 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -146,13 +146,6 @@ static inline pte_t set_pte_bit(pte_t pte, pgprot_t prot) return pte; }
-static inline pte_t pte_wrprotect(pte_t pte) -{ - pte = clear_pte_bit(pte, __pgprot(PTE_WRITE)); - pte = set_pte_bit(pte, __pgprot(PTE_RDONLY)); - return pte; -} - static inline pte_t pte_mkwrite(pte_t pte) { pte = set_pte_bit(pte, __pgprot(PTE_WRITE)); @@ -178,6 +171,20 @@ static inline pte_t pte_mkdirty(pte_t pte) return pte; }
+static inline pte_t pte_wrprotect(pte_t pte) +{ + /* + * If hardware-dirty (PTE_WRITE/DBM bit set and PTE_RDONLY + * clear), set the PTE_DIRTY bit. + */ + if (pte_hw_dirty(pte)) + pte = pte_mkdirty(pte); + + pte = clear_pte_bit(pte, __pgprot(PTE_WRITE)); + pte = set_pte_bit(pte, __pgprot(PTE_RDONLY)); + return pte; +} + static inline pte_t pte_mkold(pte_t pte) { return clear_pte_bit(pte, __pgprot(PTE_AF)); @@ -709,12 +716,6 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres pte = READ_ONCE(*ptep); do { old_pte = pte; - /* - * If hardware-dirty (PTE_WRITE/DBM bit set and PTE_RDONLY - * clear), set the PTE_DIRTY bit. - */ - if (pte_hw_dirty(pte)) - pte = pte_mkdirty(pte); pte = pte_wrprotect(pte); pte_val(pte) = cmpxchg_relaxed(&pte_val(*ptep), pte_val(old_pte), pte_val(pte));
From: Jens Axboe axboe@kernel.dk
stable inclusion from linux-4.19.161 commit 49d04a6948825b309411d13e97109857e4f76de6
--------------------------------
[ Upstream commit 8d4c3e76e3be11a64df95ddee52e99092d42fc19 ]
If this is attempted by a kthread, then return -EOPNOTSUPP as we don't currently support that. Once we can get task_pid_ptr() doing the right thing, then this can go away again.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/proc/self.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/fs/proc/self.c b/fs/proc/self.c index cc6d4253399d..7922edf70ce1 100644 --- a/fs/proc/self.c +++ b/fs/proc/self.c @@ -16,6 +16,13 @@ static const char *proc_self_get_link(struct dentry *dentry, pid_t tgid = task_tgid_nr_ns(current, ns); char *name;
+ /* + * Not currently supported. Once we can inherit all of struct pid, + * we can allow this. + */ + if (current->flags & PF_KTHREAD) + return ERR_PTR(-EOPNOTSUPP); + if (!tgid) return ERR_PTR(-ENOENT); /* max length of unsigned int in decimal + NULL term */
From: Minwoo Im minwoo.im.dev@gmail.com
stable inclusion from linux-4.19.161 commit fd1c1de8c4589fdd528733bfd01ed0c5f3f69204
--------------------------------
[ Upstream commit 0f0d2c876c96d4908a9ef40959a44bec21bdd6cf ]
If Doorbell Buffer Config command fails even 'dev->dbbuf_dbs != NULL' which means OACS indicates that NVME_CTRL_OACS_DBBUF_SUPP is set, nvme_dbbuf_update_and_check_event() will check event even it's not been successfully set.
This patch fixes mismatch among dbbuf for sq/cqs in case that dbbuf command fails.
Signed-off-by: Minwoo Im minwoo.im.dev@gmail.com Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/nvme/host/pci.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index d52e44346503..194576d87ade 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -277,9 +277,21 @@ static void nvme_dbbuf_init(struct nvme_dev *dev, nvmeq->dbbuf_cq_ei = &dev->dbbuf_eis[cq_idx(qid, dev->db_stride)]; }
+static void nvme_dbbuf_free(struct nvme_queue *nvmeq) +{ + if (!nvmeq->qid) + return; + + nvmeq->dbbuf_sq_db = NULL; + nvmeq->dbbuf_cq_db = NULL; + nvmeq->dbbuf_sq_ei = NULL; + nvmeq->dbbuf_cq_ei = NULL; +} + static void nvme_dbbuf_set(struct nvme_dev *dev) { struct nvme_command c; + unsigned int i;
if (!dev->dbbuf_dbs) return; @@ -293,6 +305,9 @@ static void nvme_dbbuf_set(struct nvme_dev *dev) dev_warn(dev->ctrl.device, "unable to set dbbuf\n"); /* Free memory and continue on */ nvme_dbbuf_dma_free(dev); + + for (i = 1; i <= dev->online_queues; i++) + nvme_dbbuf_free(&dev->queues[i]); } }
From: Lee Duncan lduncan@suse.com
stable inclusion from linux-4.19.161 commit e0172455d4e9e946a2d388386ec48dd7538275ec
--------------------------------
[ Upstream commit fe0a8a95e7134d0b44cd407bc0085b9ba8d8fe31 ]
iSCSI NOPs are sometimes "lost", mistakenly sent to the user-land iscsid daemon instead of handled in the kernel, as they should be, resulting in a message from the daemon like:
iscsid: Got nop in, but kernel supports nop handling.
This can occur because of the new forward- and back-locks, and the fact that an iSCSI NOP response can occur before processing of the NOP send is complete. This can result in "conn->ping_task" being NULL in iscsi_nop_out_rsp(), when the pointer is actually in the process of being set.
To work around this, we add a new state to the "ping_task" pointer. In addition to NULL (not assigned) and a pointer (assigned), we add the state "being set", which is signaled with an INVALID pointer (using "-1").
Link: https://lore.kernel.org/r/20201106193317.16993-1-leeman.duncan@gmail.com Reviewed-by: Mike Christie michael.christie@oracle.com Signed-off-by: Lee Duncan lduncan@suse.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/libiscsi.c | 23 +++++++++++++++-------- include/scsi/libiscsi.h | 3 +++ 2 files changed, 18 insertions(+), 8 deletions(-)
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index 6a437743f1c0..c4e02d97058a 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -577,8 +577,8 @@ static void iscsi_complete_task(struct iscsi_task *task, int state) if (conn->task == task) conn->task = NULL;
- if (conn->ping_task == task) - conn->ping_task = NULL; + if (READ_ONCE(conn->ping_task) == task) + WRITE_ONCE(conn->ping_task, NULL);
/* release get from queueing */ __iscsi_put_task(task); @@ -787,6 +787,9 @@ __iscsi_conn_send_pdu(struct iscsi_conn *conn, struct iscsi_hdr *hdr, task->conn->session->age); }
+ if (unlikely(READ_ONCE(conn->ping_task) == INVALID_SCSI_TASK)) + WRITE_ONCE(conn->ping_task, task); + if (!ihost->workq) { if (iscsi_prep_mgmt_task(conn, task)) goto free_task; @@ -997,8 +1000,11 @@ static int iscsi_send_nopout(struct iscsi_conn *conn, struct iscsi_nopin *rhdr) struct iscsi_nopout hdr; struct iscsi_task *task;
- if (!rhdr && conn->ping_task) - return -EINVAL; + if (!rhdr) { + if (READ_ONCE(conn->ping_task)) + return -EINVAL; + WRITE_ONCE(conn->ping_task, INVALID_SCSI_TASK); + }
memset(&hdr, 0, sizeof(struct iscsi_nopout)); hdr.opcode = ISCSI_OP_NOOP_OUT | ISCSI_OP_IMMEDIATE; @@ -1013,11 +1019,12 @@ static int iscsi_send_nopout(struct iscsi_conn *conn, struct iscsi_nopin *rhdr)
task = __iscsi_conn_send_pdu(conn, (struct iscsi_hdr *)&hdr, NULL, 0); if (!task) { + if (!rhdr) + WRITE_ONCE(conn->ping_task, NULL); iscsi_conn_printk(KERN_ERR, conn, "Could not send nopout\n"); return -EIO; } else if (!rhdr) { /* only track our nops */ - conn->ping_task = task; conn->last_ping = jiffies; }
@@ -1040,7 +1047,7 @@ static int iscsi_nop_out_rsp(struct iscsi_task *task, struct iscsi_conn *conn = task->conn; int rc = 0;
- if (conn->ping_task != task) { + if (READ_ONCE(conn->ping_task) != task) { /* * If this is not in response to one of our * nops then it must be from userspace. @@ -1984,7 +1991,7 @@ static void iscsi_start_tx(struct iscsi_conn *conn) */ static int iscsi_has_ping_timed_out(struct iscsi_conn *conn) { - if (conn->ping_task && + if (READ_ONCE(conn->ping_task) && time_before_eq(conn->last_recv + (conn->recv_timeout * HZ) + (conn->ping_timeout * HZ), jiffies)) return 1; @@ -2122,7 +2129,7 @@ enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc) * Checking the transport already or nop from a cmd timeout still * running */ - if (conn->ping_task) { + if (READ_ONCE(conn->ping_task)) { task->have_checked_conn = true; rc = BLK_EH_RESET_TIMER; goto done; diff --git a/include/scsi/libiscsi.h b/include/scsi/libiscsi.h index e03b9244499e..8e146aa623cf 100644 --- a/include/scsi/libiscsi.h +++ b/include/scsi/libiscsi.h @@ -145,6 +145,9 @@ struct iscsi_task { void *dd_data; /* driver/transport data */ };
+/* invalid scsi_task pointer */ +#define INVALID_SCSI_TASK (struct iscsi_task *)-1l + static inline int iscsi_task_has_unsol_data(struct iscsi_task *task) { return task->unsol_r2t.data_length > task->unsol_r2t.sent;
From: Ard Biesheuvel ardb@kernel.org
stable inclusion from linux-4.19.161 commit db9e8d33b066d7022a4ed1c8fa16b6496cd651c3
--------------------------------
[ Upstream commit ff04f3b6f2e27f8ae28a498416af2a8dd5072b43 ]
The memory leak addressed by commit fe5186cf12e3 is a false positive: all allocations are recorded in a linked list, and freed when the filesystem is unmounted. This leads to double frees, and as reported by David, leads to crashes if SLUB is configured to self destruct when double frees occur.
So drop the redundant kfree() again, and instead, mark the offending pointer variable so the allocation is ignored by kmemleak.
Cc: Vamshi K Sthambamkadi vamshi.k.sthambamkadi@gmail.com Fixes: fe5186cf12e3 ("efivarfs: fix memory leak in efivarfs_create()") Reported-by: David Laight David.Laight@aculab.com Signed-off-by: Ard Biesheuvel ardb@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/efivarfs/inode.c | 2 ++ fs/efivarfs/super.c | 1 - 2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/efivarfs/inode.c b/fs/efivarfs/inode.c index 8c6ab6c95727..7f40343b39b0 100644 --- a/fs/efivarfs/inode.c +++ b/fs/efivarfs/inode.c @@ -10,6 +10,7 @@ #include <linux/efi.h> #include <linux/fs.h> #include <linux/ctype.h> +#include <linux/kmemleak.h> #include <linux/slab.h> #include <linux/uuid.h>
@@ -106,6 +107,7 @@ static int efivarfs_create(struct inode *dir, struct dentry *dentry, var->var.VariableName[i] = '\0';
inode->i_private = var; + kmemleak_ignore(var);
err = efivar_entry_add(var, &efivarfs_list); if (err) diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c index 7808a26bd33f..834615f13f3e 100644 --- a/fs/efivarfs/super.c +++ b/fs/efivarfs/super.c @@ -23,7 +23,6 @@ LIST_HEAD(efivarfs_list); static void efivarfs_evict_inode(struct inode *inode) { clear_inode(inode); - kfree(inode->i_private); }
static const struct super_operations efivarfs_ops = {
From: Wang Hai wanghai38@huawei.com
stable inclusion from linux-4.19.162 commit 716cd2eba30d7737fd140abd00e1667c76b32fc5
--------------------------------
[ Upstream commit e255e11e66da8281e337e4e352956e8a4999fca4 ]
kmemleak report a memory leak as follows:
unreferenced object 0xffff8880059c6a00 (size 64): comm "ip", pid 23696, jiffies 4296590183 (age 1755.384s) hex dump (first 32 bytes): 20 01 00 10 00 00 00 00 00 00 00 00 00 00 00 00 ............... 1c 00 00 00 00 00 00 00 00 00 00 00 07 00 00 00 ................ backtrace: [<00000000aa4e7a87>] ip6addrlbl_add+0x90/0xbb0 [<0000000070b8d7f1>] ip6addrlbl_net_init+0x109/0x170 [<000000006a9ca9d4>] ops_init+0xa8/0x3c0 [<000000002da57bf2>] setup_net+0x2de/0x7e0 [<000000004e52d573>] copy_net_ns+0x27d/0x530 [<00000000b07ae2b4>] create_new_namespaces+0x382/0xa30 [<000000003b76d36f>] unshare_nsproxy_namespaces+0xa1/0x1d0 [<0000000030653721>] ksys_unshare+0x3a4/0x780 [<0000000007e82e40>] __x64_sys_unshare+0x2d/0x40 [<0000000031a10c08>] do_syscall_64+0x33/0x40 [<0000000099df30e7>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
We should free all rules when we catch an error in ip6addrlbl_net_init(). otherwise a memory leak will occur.
Fixes: 2a8cc6c89039 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Wang Hai wanghai38@huawei.com Link: https://lore.kernel.org/r/20201124071728.8385-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv6/addrlabel.c | 26 +++++++++++++++++--------- 1 file changed, 17 insertions(+), 9 deletions(-)
diff --git a/net/ipv6/addrlabel.c b/net/ipv6/addrlabel.c index 1d6ced37ad71..c7dc8b2de6c2 100644 --- a/net/ipv6/addrlabel.c +++ b/net/ipv6/addrlabel.c @@ -306,7 +306,9 @@ static int ip6addrlbl_del(struct net *net, /* add default label */ static int __net_init ip6addrlbl_net_init(struct net *net) { - int err = 0; + struct ip6addrlbl_entry *p = NULL; + struct hlist_node *n; + int err; int i;
ADDRLABEL(KERN_DEBUG "%s\n", __func__); @@ -315,14 +317,20 @@ static int __net_init ip6addrlbl_net_init(struct net *net) INIT_HLIST_HEAD(&net->ipv6.ip6addrlbl_table.head);
for (i = 0; i < ARRAY_SIZE(ip6addrlbl_init_table); i++) { - int ret = ip6addrlbl_add(net, - ip6addrlbl_init_table[i].prefix, - ip6addrlbl_init_table[i].prefixlen, - 0, - ip6addrlbl_init_table[i].label, 0); - /* XXX: should we free all rules when we catch an error? */ - if (ret && (!err || err != -ENOMEM)) - err = ret; + err = ip6addrlbl_add(net, + ip6addrlbl_init_table[i].prefix, + ip6addrlbl_init_table[i].prefixlen, + 0, + ip6addrlbl_init_table[i].label, 0); + if (err) + goto err_ip6addrlbl_add; + } + return 0; + +err_ip6addrlbl_add: + hlist_for_each_entry_safe(p, n, &net->ipv6.ip6addrlbl_table.head, list) { + hlist_del_rcu(&p->list); + kfree_rcu(p, rcu); } return err; }
From: Willem de Bruijn willemb@google.com
stable inclusion from linux-4.19.162 commit 0dc25f979633a39b9fb9cbf12414308657fecc7e
--------------------------------
[ Upstream commit 985f7337421a811cb354ca93882f943c8335a6f5 ]
When setting sk_err, set it to ee_errno, not ee_origin.
Commit f5f99309fa74 ("sock: do not set sk_err in sock_dequeue_err_skb") disabled updating sk_err on errq dequeue, which is correct for most error types (origins):
- sk->sk_err = err;
Commit 38b257938ac6 ("sock: reset sk_err when the error queue is empty") reenabled the behavior for IMCP origins, which do require it:
+ if (icmp_next) + sk->sk_err = SKB_EXT_ERR(skb_next)->ee.ee_origin;
But read from ee_errno.
Fixes: 38b257938ac6 ("sock: reset sk_err when the error queue is empty") Reported-by: Ayush Ranjan ayushranjan@google.com Signed-off-by: Willem de Bruijn willemb@google.com Acked-by: Soheil Hassas Yeganeh soheil@google.com Link: https://lore.kernel.org/r/20201126151220.2819322-1-willemdebruijn.kernel@gma... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/core/skbuff.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index be4bc833c28a..b5d9c9b2c702 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4265,7 +4265,7 @@ struct sk_buff *sock_dequeue_err_skb(struct sock *sk) if (skb && (skb_next = skb_peek(q))) { icmp_next = is_icmp_err_skb(skb_next); if (icmp_next) - sk->sk_err = SKB_EXT_ERR(skb_next)->ee.ee_origin; + sk->sk_err = SKB_EXT_ERR(skb_next)->ee.ee_errno; } spin_unlock_irqrestore(&q->lock, flags);
From: Alexander Duyck alexanderduyck@fb.com
stable inclusion from linux-4.19.162 commit 471186632170f46aeb66d4996b8089e0474a1b5d
--------------------------------
[ Upstream commit 55472017a4219ca965a957584affdb17549ae4a4 ]
When setting congestion control via a BPF program it is seen that the SYN/ACK for packets within a given flow will not include the ECT0 flag. A bit of simple printk debugging shows that when this is configured without BPF we will see the value INET_ECN_xmit value initialized in tcp_assign_congestion_control however when we configure this via BPF the socket is in the closed state and as such it isn't configured, and I do not see it being initialized when we transition the socket into the listen state. The result of this is that the ECT0 bit is configured based on whatever the default state is for the socket.
Any easy way to reproduce this is to monitor the following with tcpdump: tools/testing/selftests/bpf/test_progs -t bpf_tcp_ca
Without this patch the SYN/ACK will follow whatever the default is. If dctcp all SYN/ACK packets will have the ECT0 bit set, and if it is not then ECT0 will be cleared on all SYN/ACK packets. With this patch applied the SYN/ACK bit matches the value seen on the other packets in the given stream.
Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") Signed-off-by: Alexander Duyck alexanderduyck@fb.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/tcp_cong.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index 32c3162baf3e..00a7482b6fbd 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -196,6 +196,11 @@ static void tcp_reinit_congestion_control(struct sock *sk, icsk->icsk_ca_setsockopt = 1; memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv));
+ if (ca->flags & TCP_CONG_NEEDS_ECN) + INET_ECN_xmit(sk); + else + INET_ECN_dontxmit(sk); + if (!((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))) tcp_init_congestion_control(sk); }
From: Jamie Iles jamie@nuviainc.com
stable inclusion from linux-4.19.162 commit 8285a15cd47d6607186e0e63d882f19d462f71bd
--------------------------------
[ Upstream commit b9ad3e9f5a7a760ab068e33e1f18d240ba32ce92 ]
syzkaller found that with CONFIG_DEBUG_KOBJECT_RELEASE=y, releasing a struct slave device could result in the following splat:
kobject: 'bonding_slave' (00000000cecdd4fe): kobject_release, parent 0000000074ceb2b2 (delayed 1000) bond0 (unregistering): (slave bond_slave_1): Releasing backup interface ------------[ cut here ]------------ ODEBUG: free active (active state 0) object type: timer_list hint: workqueue_select_cpu_near kernel/workqueue.c:1549 [inline] ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x98 kernel/workqueue.c:1600 WARNING: CPU: 1 PID: 842 at lib/debugobjects.c:485 debug_print_object+0x180/0x240 lib/debugobjects.c:485 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 842 Comm: kworker/u4:4 Tainted: G S 5.9.0-rc8+ #96 Hardware name: linux,dummy-virt (DT) Workqueue: netns cleanup_net Call trace: dump_backtrace+0x0/0x4d8 include/linux/bitmap.h:239 show_stack+0x34/0x48 arch/arm64/kernel/traps.c:142 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x174/0x1f8 lib/dump_stack.c:118 panic+0x360/0x7a0 kernel/panic.c:231 __warn+0x244/0x2ec kernel/panic.c:600 report_bug+0x240/0x398 lib/bug.c:198 bug_handler+0x50/0xc0 arch/arm64/kernel/traps.c:974 call_break_hook+0x160/0x1d8 arch/arm64/kernel/debug-monitors.c:322 brk_handler+0x30/0xc0 arch/arm64/kernel/debug-monitors.c:329 do_debug_exception+0x184/0x340 arch/arm64/mm/fault.c:864 el1_dbg+0x48/0xb0 arch/arm64/kernel/entry-common.c:65 el1_sync_handler+0x170/0x1c8 arch/arm64/kernel/entry-common.c:93 el1_sync+0x80/0x100 arch/arm64/kernel/entry.S:594 debug_print_object+0x180/0x240 lib/debugobjects.c:485 __debug_check_no_obj_freed lib/debugobjects.c:967 [inline] debug_check_no_obj_freed+0x200/0x430 lib/debugobjects.c:998 slab_free_hook mm/slub.c:1536 [inline] slab_free_freelist_hook+0x190/0x210 mm/slub.c:1577 slab_free mm/slub.c:3138 [inline] kfree+0x13c/0x460 mm/slub.c:4119 bond_free_slave+0x8c/0xf8 drivers/net/bonding/bond_main.c:1492 __bond_release_one+0xe0c/0xec8 drivers/net/bonding/bond_main.c:2190 bond_slave_netdev_event drivers/net/bonding/bond_main.c:3309 [inline] bond_netdev_event+0x8f0/0xa70 drivers/net/bonding/bond_main.c:3420 notifier_call_chain+0xf0/0x200 kernel/notifier.c:83 __raw_notifier_call_chain kernel/notifier.c:361 [inline] raw_notifier_call_chain+0x44/0x58 kernel/notifier.c:368 call_netdevice_notifiers_info+0xbc/0x150 net/core/dev.c:2033 call_netdevice_notifiers_extack net/core/dev.c:2045 [inline] call_netdevice_notifiers net/core/dev.c:2059 [inline] rollback_registered_many+0x6a4/0xec0 net/core/dev.c:9347 unregister_netdevice_many.part.0+0x2c/0x1c0 net/core/dev.c:10509 unregister_netdevice_many net/core/dev.c:10508 [inline] default_device_exit_batch+0x294/0x338 net/core/dev.c:10992 ops_exit_list.isra.0+0xec/0x150 net/core/net_namespace.c:189 cleanup_net+0x44c/0x888 net/core/net_namespace.c:603 process_one_work+0x96c/0x18c0 kernel/workqueue.c:2269 worker_thread+0x3f0/0xc30 kernel/workqueue.c:2415 kthread+0x390/0x498 kernel/kthread.c:292 ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:925
This is a potential use-after-free if the sysfs nodes are being accessed whilst removing the struct slave, so wait for the object destruction to complete before freeing the struct slave itself.
Fixes: 07699f9a7c8d ("bonding: add sysfs /slave dir for bond slave devices.") Fixes: a068aab42258 ("bonding: Fix reference count leak in bond_sysfs_slave_add.") Cc: Qiushi Wu wu000273@umn.edu Cc: Jay Vosburgh j.vosburgh@gmail.com Cc: Veaceslav Falico vfalico@gmail.com Cc: Andy Gospodarek andy@greyhouse.net Signed-off-by: Jamie Iles jamie@nuviainc.com Reviewed-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Link: https://lore.kernel.org/r/20201120142827.879226-1-jamie@nuviainc.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/bonding/bond_main.c | 61 ++++++++++++++++++-------- drivers/net/bonding/bond_sysfs_slave.c | 18 +------- include/net/bonding.h | 8 ++++ 3 files changed, 52 insertions(+), 35 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index a59333b87eaf..c21c4291921f 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1268,7 +1268,39 @@ static void bond_upper_dev_unlink(struct bonding *bond, struct slave *slave) slave->dev->flags &= ~IFF_SLAVE; }
-static struct slave *bond_alloc_slave(struct bonding *bond) +static void slave_kobj_release(struct kobject *kobj) +{ + struct slave *slave = to_slave(kobj); + struct bonding *bond = bond_get_bond_by_slave(slave); + + cancel_delayed_work_sync(&slave->notify_work); + if (BOND_MODE(bond) == BOND_MODE_8023AD) + kfree(SLAVE_AD_INFO(slave)); + + kfree(slave); +} + +static struct kobj_type slave_ktype = { + .release = slave_kobj_release, +#ifdef CONFIG_SYSFS + .sysfs_ops = &slave_sysfs_ops, +#endif +}; + +static int bond_kobj_init(struct slave *slave) +{ + int err; + + err = kobject_init_and_add(&slave->kobj, &slave_ktype, + &(slave->dev->dev.kobj), "bonding_slave"); + if (err) + kobject_put(&slave->kobj); + + return err; +} + +static struct slave *bond_alloc_slave(struct bonding *bond, + struct net_device *slave_dev) { struct slave *slave = NULL;
@@ -1276,11 +1308,17 @@ static struct slave *bond_alloc_slave(struct bonding *bond) if (!slave) return NULL;
+ slave->bond = bond; + slave->dev = slave_dev; + + if (bond_kobj_init(slave)) + return NULL; + if (BOND_MODE(bond) == BOND_MODE_8023AD) { SLAVE_AD_INFO(slave) = kzalloc(sizeof(struct ad_slave_info), GFP_KERNEL); if (!SLAVE_AD_INFO(slave)) { - kfree(slave); + kobject_put(&slave->kobj); return NULL; } } @@ -1289,17 +1327,6 @@ static struct slave *bond_alloc_slave(struct bonding *bond) return slave; }
-static void bond_free_slave(struct slave *slave) -{ - struct bonding *bond = bond_get_bond_by_slave(slave); - - cancel_delayed_work_sync(&slave->notify_work); - if (BOND_MODE(bond) == BOND_MODE_8023AD) - kfree(SLAVE_AD_INFO(slave)); - - kfree(slave); -} - static void bond_fill_ifbond(struct bonding *bond, struct ifbond *info) { info->bond_mode = BOND_MODE(bond); @@ -1487,14 +1514,12 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev, bond->dev->addr_assign_type == NET_ADDR_RANDOM) bond_set_dev_addr(bond->dev, slave_dev);
- new_slave = bond_alloc_slave(bond); + new_slave = bond_alloc_slave(bond, slave_dev); if (!new_slave) { res = -ENOMEM; goto err_undo_flags; }
- new_slave->bond = bond; - new_slave->dev = slave_dev; /* Set the new_slave's queue_id to be zero. Queue ID mapping * is set via sysfs or module option if desired. */ @@ -1821,7 +1846,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev, dev_set_mtu(slave_dev, new_slave->original_mtu);
err_free: - bond_free_slave(new_slave); + kobject_put(&new_slave->kobj);
err_undo_flags: /* Enslave of first slave has failed and we need to fix master's mac */ @@ -2009,7 +2034,7 @@ static int __bond_release_one(struct net_device *bond_dev, if (!netif_is_bond_master(slave_dev)) slave_dev->priv_flags &= ~IFF_BONDING;
- bond_free_slave(slave); + kobject_put(&slave->kobj);
return 0; } diff --git a/drivers/net/bonding/bond_sysfs_slave.c b/drivers/net/bonding/bond_sysfs_slave.c index 36dee305c687..9ec0498d7d54 100644 --- a/drivers/net/bonding/bond_sysfs_slave.c +++ b/drivers/net/bonding/bond_sysfs_slave.c @@ -125,7 +125,6 @@ static const struct slave_attribute *slave_attrs[] = { };
#define to_slave_attr(_at) container_of(_at, struct slave_attribute, attr) -#define to_slave(obj) container_of(obj, struct slave, kobj)
static ssize_t slave_show(struct kobject *kobj, struct attribute *attr, char *buf) @@ -136,28 +135,15 @@ static ssize_t slave_show(struct kobject *kobj, return slave_attr->show(slave, buf); }
-static const struct sysfs_ops slave_sysfs_ops = { +const struct sysfs_ops slave_sysfs_ops = { .show = slave_show, };
-static struct kobj_type slave_ktype = { -#ifdef CONFIG_SYSFS - .sysfs_ops = &slave_sysfs_ops, -#endif -}; - int bond_sysfs_slave_add(struct slave *slave) { const struct slave_attribute **a; int err;
- err = kobject_init_and_add(&slave->kobj, &slave_ktype, - &(slave->dev->dev.kobj), "bonding_slave"); - if (err) { - kobject_put(&slave->kobj); - return err; - } - for (a = slave_attrs; *a; ++a) { err = sysfs_create_file(&slave->kobj, &((*a)->attr)); if (err) { @@ -175,6 +161,4 @@ void bond_sysfs_slave_del(struct slave *slave)
for (a = slave_attrs; *a; ++a) sysfs_remove_file(&slave->kobj, &((*a)->attr)); - - kobject_put(&slave->kobj); } diff --git a/include/net/bonding.h b/include/net/bonding.h index 8116648873c3..c458f084f7bb 100644 --- a/include/net/bonding.h +++ b/include/net/bonding.h @@ -170,6 +170,11 @@ struct slave { struct rtnl_link_stats64 slave_stats; };
+static inline struct slave *to_slave(struct kobject *kobj) +{ + return container_of(kobj, struct slave, kobj); +} + struct bond_up_slave { unsigned int count; struct rcu_head rcu; @@ -733,6 +738,9 @@ extern struct bond_parm_tbl ad_select_tbl[]; /* exported from bond_netlink.c */ extern struct rtnl_link_ops bond_link_ops;
+/* exported from bond_sysfs_slave.c */ +extern const struct sysfs_ops slave_sysfs_ops; + static inline void bond_tx_drop(struct net_device *dev, struct sk_buff *skb) { atomic_long_inc(&dev->tx_dropped);
From: Antoine Tenart atenart@kernel.org
stable inclusion from linux-4.19.162 commit c9c048d4e36519a53623afb05a7403204cec707d
--------------------------------
[ Upstream commit 44f64f23bae2f0fad25503bc7ab86cd08d04cd47 ]
Netfilter changes PACKET_OTHERHOST to PACKET_HOST before invoking the hooks as, while it's an expected value for a bridge, routing expects PACKET_HOST. The change is undone later on after hook traversal. This can be seen with pairs of functions updating skb>pkt_type and then reverting it to its original value:
For hook NF_INET_PRE_ROUTING: setup_pre_routing / br_nf_pre_routing_finish
For hook NF_INET_FORWARD: br_nf_forward_ip / br_nf_forward_finish
But the third case where netfilter does this, for hook NF_INET_POST_ROUTING, the packet type is changed in br_nf_post_routing but never reverted. A comment says:
/* We assume any code from br_dev_queue_push_xmit onwards doesn't care * about the value of skb->pkt_type. */
But when having a tunnel (say vxlan) attached to a bridge we have the following call trace:
br_nf_pre_routing br_nf_pre_routing_ipv6 br_nf_pre_routing_finish br_nf_forward_ip br_nf_forward_finish br_nf_post_routing <- pkt_type is updated to PACKET_HOST br_nf_dev_queue_xmit <- but not reverted to its original value vxlan_xmit vxlan_xmit_one skb_tunnel_check_pmtu <- a check on pkt_type is performed
In this specific case, this creates issues such as when an ICMPv6 PTB should be sent back. When CONFIG_BRIDGE_NETFILTER is enabled, the PTB isn't sent (as skb_tunnel_check_pmtu checks if pkt_type is PACKET_HOST and returns early).
If the comment is right and no one cares about the value of skb->pkt_type after br_dev_queue_push_xmit (which isn't true), resetting it to its original value should be safe.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Antoine Tenart atenart@kernel.org Reviewed-by: Florian Westphal fw@strlen.de Link: https://lore.kernel.org/r/20201123174902.622102-1-atenart@kernel.org Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/bridge/br_netfilter_hooks.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c index f9179accf319..a403179eea49 100644 --- a/net/bridge/br_netfilter_hooks.c +++ b/net/bridge/br_netfilter_hooks.c @@ -719,6 +719,11 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff mtu_reserved = nf_bridge_mtu_reduction(skb); mtu = skb->dev->mtu;
+ if (nf_bridge->pkt_otherhost) { + skb->pkt_type = PACKET_OTHERHOST; + nf_bridge->pkt_otherhost = false; + } + if (nf_bridge->frag_max_size && nf_bridge->frag_max_size < mtu) mtu = nf_bridge->frag_max_size;
@@ -812,8 +817,6 @@ static unsigned int br_nf_post_routing(void *priv, else return NF_ACCEPT;
- /* We assume any code from br_dev_queue_push_xmit onwards doesn't care - * about the value of skb->pkt_type. */ if (skb->pkt_type == PACKET_OTHERHOST) { skb->pkt_type = PACKET_HOST; nf_bridge->pkt_otherhost = true;
From: Guillaume Nault gnault@redhat.com
stable inclusion from linux-4.19.162 commit dbd97d579e60fae91a2db06e7b02b8f3a4a0872a
--------------------------------
[ Upstream commit 1ebf179037cb46c19da3a9c1e2ca16e7a754b75e ]
When inet_rtm_getroute() was converted to use the RCU variants of ip_route_input() and ip_route_output_key(), the TOS parameters stopped being masked with IPTOS_RT_MASK before doing the route lookup.
As a result, "ip route get" can return a different route than what would be used when sending real packets.
For example:
$ ip route add 192.0.2.11/32 dev eth0 $ ip route add unreachable 192.0.2.11/32 tos 2 $ ip route get 192.0.2.11 tos 2 RTNETLINK answers: No route to host
But, packets with TOS 2 (ECT(0) if interpreted as an ECN bit) would actually be routed using the first route:
$ ping -c 1 -Q 2 192.0.2.11 PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data. 64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.173 ms
--- 192.0.2.11 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.173/0.173/0.173/0.000 ms
This patch re-applies IPTOS_RT_MASK in inet_rtm_getroute(), to return results consistent with real route lookups.
Fixes: 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup") Signed-off-by: Guillaume Nault gnault@redhat.com Reviewed-by: David Ahern dsahern@kernel.org Link: https://lore.kernel.org/r/b2d237d08317ca55926add9654a48409ac1b8f5b.160641289... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/route.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c index f74b7513d607..106c97002088 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2871,7 +2871,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, memset(&fl4, 0, sizeof(fl4)); fl4.daddr = dst; fl4.saddr = src; - fl4.flowi4_tos = rtm->rtm_tos; + fl4.flowi4_tos = rtm->rtm_tos & IPTOS_RT_MASK; fl4.flowi4_oif = tb[RTA_OIF] ? nla_get_u32(tb[RTA_OIF]) : 0; fl4.flowi4_mark = mark; fl4.flowi4_uid = uid; @@ -2895,8 +2895,9 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, fl4.flowi4_iif = iif; /* for rt_fill_info */ skb->dev = dev; skb->mark = mark; - err = ip_route_input_rcu(skb, dst, src, rtm->rtm_tos, - dev, &res); + err = ip_route_input_rcu(skb, dst, src, + rtm->rtm_tos & IPTOS_RT_MASK, dev, + &res);
rt = skb_rtable(skb); if (err == 0 && rt->dst.error)
From: Antoine Tenart atenart@kernel.org
stable inclusion from linux-4.19.162 commit cf3d9717a5271573fa193c4dfa0df856295d5b15
--------------------------------
[ Upstream commit 832ba596494b2c9eac7760259eff2d8b7dcad0ee ]
syzkaller managed to crash the kernel using an NBMA ip6gre interface. I could reproduce it creating an NBMA ip6gre interface and forwarding traffic to it:
skbuff: skb_under_panic: text:ffffffff8250e927 len:148 put:44 head:ffff8c03c7a33 ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:109! Call Trace: skb_push+0x10/0x10 ip6gre_header+0x47/0x1b0 neigh_connected_output+0xae/0xf0
ip6gre tunnel provides its own header_ops->create, and sets it conditionally when initializing the tunnel in NBMA mode. When header_ops->create is used, dev->hard_header_len should reflect the length of the header created. Otherwise, when not used, dev->needed_headroom should be used.
Fixes: eb95f52fc72d ("net: ipv6_gre: Fix GRO to work on IPv6 over GRE tap") Cc: Maria Pasechnik mariap@mellanox.com Signed-off-by: Antoine Tenart atenart@kernel.org Link: https://lore.kernel.org/r/20201130161911.464106-1-atenart@kernel.org Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv6/ip6_gre.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c index 7cc9cd83ecb5..86c8ea7d7006 100644 --- a/net/ipv6/ip6_gre.c +++ b/net/ipv6/ip6_gre.c @@ -1140,8 +1140,13 @@ static void ip6gre_tnl_link_config_route(struct ip6_tnl *t, int set_mtu, return;
if (rt->dst.dev) { - dev->needed_headroom = rt->dst.dev->hard_header_len + - t_hlen; + unsigned short dst_len = rt->dst.dev->hard_header_len + + t_hlen; + + if (t->dev->header_ops) + dev->hard_header_len = dst_len; + else + dev->needed_headroom = dst_len;
if (set_mtu) { dev->mtu = rt->dst.dev->mtu - t_hlen; @@ -1166,7 +1171,12 @@ static int ip6gre_calc_hlen(struct ip6_tnl *tunnel) tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen;
t_hlen = tunnel->hlen + sizeof(struct ipv6hdr); - tunnel->dev->needed_headroom = LL_MAX_HEADER + t_hlen; + + if (tunnel->dev->header_ops) + tunnel->dev->hard_header_len = LL_MAX_HEADER + t_hlen; + else + tunnel->dev->needed_headroom = LL_MAX_HEADER + t_hlen; + return t_hlen; }
From: "Naveen N. Rao" naveen.n.rao@linux.vnet.ibm.com
stable inclusion from linux-4.19.163 commit 73b14c21c5dce6046daff0cc8965226581eb6681
--------------------------------
commit 4c75b0ff4e4bf7a45b5aef9639799719c28d0073 upstream.
On powerpc, kprobe-direct.tc triggered FTRACE_WARN_ON() in ftrace_get_addr_new() followed by the below message: Bad trampoline accounting at: 000000004222522f (wake_up_process+0xc/0x20) (f0000001)
The set of steps leading to this involved: - modprobe ftrace-direct-too - enable_probe - modprobe ftrace-direct - rmmod ftrace-direct <-- trigger
The problem turned out to be that we were not updating flags in the ftrace record properly. From the above message about the trampoline accounting being bad, it can be seen that the ftrace record still has FTRACE_FL_TRAMP set though ftrace-direct module is going away. This happens because we are checking if any ftrace_ops has the FTRACE_FL_TRAMP flag set _before_ updating the filter hash.
The fix for this is to look for any _other_ ftrace_ops that also needs FTRACE_FL_TRAMP.
Link: https://lkml.kernel.org/r/56c113aa9c3e10c19144a36d9684c7882bf09af5.160641243...
Cc: stable@vger.kernel.org Fixes: a124692b698b0 ("ftrace: Enable trampoline when rec count returns back to one") Signed-off-by: Naveen N. Rao naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/trace/ftrace.c | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index fd42d3e685f4..85fe273cadae 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1650,6 +1650,8 @@ static bool test_rec_ops_needs_regs(struct dyn_ftrace *rec) static struct ftrace_ops * ftrace_find_tramp_ops_any(struct dyn_ftrace *rec); static struct ftrace_ops * +ftrace_find_tramp_ops_any_other(struct dyn_ftrace *rec, struct ftrace_ops *op_exclude); +static struct ftrace_ops * ftrace_find_tramp_ops_next(struct dyn_ftrace *rec, struct ftrace_ops *ops);
static bool __ftrace_hash_rec_update(struct ftrace_ops *ops, @@ -1787,7 +1789,7 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops, * to it. */ if (ftrace_rec_count(rec) == 1 && - ftrace_find_tramp_ops_any(rec)) + ftrace_find_tramp_ops_any_other(rec, ops)) rec->flags |= FTRACE_FL_TRAMP; else rec->flags &= ~FTRACE_FL_TRAMP; @@ -2215,6 +2217,24 @@ ftrace_find_tramp_ops_any(struct dyn_ftrace *rec) return NULL; }
+static struct ftrace_ops * +ftrace_find_tramp_ops_any_other(struct dyn_ftrace *rec, struct ftrace_ops *op_exclude) +{ + struct ftrace_ops *op; + unsigned long ip = rec->ip; + + do_for_each_ftrace_op(op, ftrace_ops_list) { + + if (op == op_exclude || !op->trampoline) + continue; + + if (hash_contains_ip(ip, op->func_hash)) + return op; + } while_for_each_ftrace_op(op); + + return NULL; +} + static struct ftrace_ops * ftrace_find_tramp_ops_next(struct dyn_ftrace *rec, struct ftrace_ops *op)
From: Paulo Alcantara pc@cjr.nz
stable inclusion from linux-4.19.163 commit ba54b13c835609b3976714d3dde010c57d0fe23e
--------------------------------
commit 212253367dc7b49ed3fc194ce71b0992eacaecf2 upstream.
This patch fixes a potential use-after-free bug in cifs_echo_request().
For instance,
thread 1 -------- cifs_demultiplex_thread() clean_demultiplex_info() kfree(server)
thread 2 (workqueue) -------- apic_timer_interrupt() smp_apic_timer_interrupt() irq_exit() __do_softirq() run_timer_softirq() call_timer_fn() cifs_echo_request() <- use-after-free in server ptr
Signed-off-by: Paulo Alcantara (SUSE) pc@cjr.nz CC: Stable stable@vger.kernel.org Reviewed-by: Ronnie Sahlberg lsahlber@redhat.com Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/cifs/connect.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index d3bdf8c418f9..423bc5b481e4 100644 --- a/fs/cifs/connect.c +++ b/fs/cifs/connect.c @@ -777,6 +777,8 @@ static void clean_demultiplex_info(struct TCP_Server_Info *server) list_del_init(&server->tcp_ses_list); spin_unlock(&cifs_tcp_ses_lock);
+ cancel_delayed_work_sync(&server->echo); + spin_lock(&GlobalMid_Lock); server->tcpStatus = CifsExiting; spin_unlock(&GlobalMid_Lock);
From: Yang Shi shy828301@gmail.com
stable inclusion from linux-4.19.163 commit 7517a3834271fd761ffab9b8b3f55c16e6063f87
--------------------------------
commit 8199be001a470209f5c938570cc199abb012fe53 upstream.
When investigating a slab cache bloat problem, significant amount of negative dentry cache was seen, but confusingly they neither got shrunk by reclaimer (the host has very tight memory) nor be shrunk by dropping cache. The vmcore shows there are over 14M negative dentry objects on lru, but tracing result shows they were even not scanned at all.
Further investigation shows the memcg's vfs shrinker_map bit is not set. So the reclaimer or dropping cache just skip calling vfs shrinker. So we have to reboot the hosts to get the memory back.
I didn't manage to come up with a reproducer in test environment, and the problem can't be reproduced after rebooting. But it seems there is race between shrinker map bit clear and reparenting by code inspection. The hypothesis is elaborated as below.
The memcg hierarchy on our production environment looks like:
root / \ system user
The main workloads are running under user slice's children, and it creates and removes memcg frequently. So reparenting happens very often under user slice, but no task is under user slice directly.
So with the frequent reparenting and tight memory pressure, the below hypothetical race condition may happen:
CPU A CPU B reparent dst->nr_items == 0 shrinker: total_objects == 0 add src->nr_items to dst set_bit return SHRINK_EMPTY clear_bit child memcg offline replace child's kmemcg_id with parent's (in memcg_offline_kmem()) list_lru_del() between shrinker runs see parent's kmemcg_id dec dst->nr_items reparent again dst->nr_items may go negative due to concurrent list_lru_del()
The second run of shrinker: read nr_items without any synchronization, so it may see intermediate negative nr_items then total_objects may return 0 coincidently
keep the bit cleared dst->nr_items != 0 skip set_bit add scr->nr_item to dst
After this point dst->nr_item may never go zero, so reparenting will not set shrinker_map bit anymore. And since there is no task under user slice directly, so no new object will be added to its lru to set the shrinker map bit either. That bit is kept cleared forever.
How does list_lru_del() race with reparenting? It is because reparenting replaces children's kmemcg_id to parent's without protecting from nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually deleting items from child's lru, but dec'ing parent's nr_items, so the parent's nr_items may go negative as commit 2788cf0c401c ("memcg: reparent list_lrus and free kmemcg_id on css offline") says.
Since it is impossible that dst->nr_items goes negative and src->nr_items goes zero at the same time, so it seems we could set the shrinker map bit iff src->nr_items != 0. We could synchronize list_lru_count_one() and reparenting with nlru->lock, but it seems checking src->nr_items in reparenting is the simplest and avoids lock contention.
Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance") Suggested-by: Roman Gushchin guro@fb.com Signed-off-by: Yang Shi shy828301@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Roman Gushchin guro@fb.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Kirill Tkhai ktkhai@virtuozzo.com Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: stable@vger.kernel.org [4.19] Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/list_lru.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/list_lru.c b/mm/list_lru.c index 758653dd1443..f7d2ffc209fb 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -542,7 +542,6 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid, struct list_lru_node *nlru = &lru->node[nid]; int dst_idx = dst_memcg->kmemcg_id; struct list_lru_one *src, *dst; - bool set;
/* * Since list_lru_{add,del} may be called under an IRQ-safe lock, @@ -554,11 +553,12 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid, dst = list_lru_from_memcg_idx(nlru, dst_idx);
list_splice_init(&src->list, &dst->list); - set = (!dst->nr_items && src->nr_items); - dst->nr_items += src->nr_items; - if (set) + + if (src->nr_items) { + dst->nr_items += src->nr_items; memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru)); - src->nr_items = 0; + src->nr_items = 0; + }
spin_unlock_irq(&nlru->lock); }
From: Qian Cai qcai@redhat.com
stable inclusion from linux-4.19.163 commit 2498ae24602cdd33666cc305e16773c73c2ec252
--------------------------------
commit b11a76b37a5aa7b07c3e3eeeaae20b25475bddd3 upstream.
We can't call kvfree() with a spin lock held, so defer it. Fixes a might_sleep() runtime warning.
Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation") Signed-off-by: Qian Cai qcai@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Andrew Morton akpm@linux-foundation.org Cc: Hugh Dickins hughd@google.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/swapfile.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 46e6b0c06f09..13171e764c56 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2930,6 +2930,7 @@ late_initcall(max_swapfiles_check); static struct swap_info_struct *alloc_swap_info(void) { struct swap_info_struct *p; + struct swap_info_struct *defer = NULL; unsigned int type; int i; int size = sizeof(*p) + nr_node_ids * sizeof(struct plist_node); @@ -2959,7 +2960,7 @@ static struct swap_info_struct *alloc_swap_info(void) smp_wmb(); WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); } else { - kvfree(p); + defer = p; p = swap_info[type]; /* * Do not memset this entry: a racing procfs swap_next() @@ -2972,6 +2973,7 @@ static struct swap_info_struct *alloc_swap_info(void) plist_node_init(&p->avail_lists[i], 0); p->flags = SWP_USED; spin_unlock(&swap_lock); + kvfree(defer); spin_lock_init(&p->lock); spin_lock_init(&p->cont_lock);
From: "Steven Rostedt (VMware)" rostedt@goodmis.org
stable inclusion from linux-4.19.163 commit 1093c9a445ae0da5a3faedb7cc5b2f70cf05d82c
--------------------------------
commit bcee5278958802b40ee8b26679155a6d9231783e upstream.
When the instances were able to use their own options, the userstacktrace option was left hardcoded for the top level. This made the instance userstacktrace option bascially into a nop, and will confuse users that set it, but nothing happens (I was confused when it happened to me!)
Cc: stable@vger.kernel.org Fixes: 16270145ce6b ("tracing: Add trace options for core options to instances") Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/trace/trace.c | 7 ++++--- kernel/trace/trace.h | 6 ++++-- 2 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index b4f429106ad4..fd5415230e4a 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2414,7 +2414,7 @@ void trace_buffer_unlock_commit_regs(struct trace_array *tr, * two. They are not that meaningful. */ ftrace_trace_stack(tr, buffer, flags, regs ? 0 : STACK_SKIP, pc, regs); - ftrace_trace_userstack(buffer, flags, pc); + ftrace_trace_userstack(tr, buffer, flags, pc); }
/* @@ -2734,14 +2734,15 @@ void trace_dump_stack(int skip) static DEFINE_PER_CPU(int, user_stack_count);
void -ftrace_trace_userstack(struct ring_buffer *buffer, unsigned long flags, int pc) +ftrace_trace_userstack(struct trace_array *tr, + struct ring_buffer *buffer, unsigned long flags, int pc) { struct trace_event_call *call = &event_user_stack; struct ring_buffer_event *event; struct userstack_entry *entry; struct stack_trace trace;
- if (!(global_trace.trace_flags & TRACE_ITER_USERSTACKTRACE)) + if (!(tr->trace_flags & TRACE_ITER_USERSTACKTRACE)) return;
/* diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 4088742fbb08..6b83264e1bea 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -743,13 +743,15 @@ void update_max_tr_single(struct trace_array *tr, #endif /* CONFIG_TRACER_MAX_TRACE */
#ifdef CONFIG_STACKTRACE -void ftrace_trace_userstack(struct ring_buffer *buffer, unsigned long flags, +void ftrace_trace_userstack(struct trace_array *tr, + struct ring_buffer *buffer, unsigned long flags, int pc);
void __trace_stack(struct trace_array *tr, unsigned long flags, int skip, int pc); #else -static inline void ftrace_trace_userstack(struct ring_buffer *buffer, +static inline void ftrace_trace_userstack(struct trace_array *tr, + struct ring_buffer *buffer, unsigned long flags, int pc) { }
From: Florian Westphal fw@strlen.de
stable inclusion from linux-4.19.163 commit 0fc22510ff5c8fb695781c7968b95da4a6110625
--------------------------------
commit c0700dfa2cae44c033ed97dade8a2679c7d22a9d upstream.
There are reports wrt lockdep splat in nftables, e.g.: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 31416 at net/netfilter/nf_tables_api.c:622 lockdep_nfnl_nft_mutex_not_held+0x28/0x38 [nf_tables] ...
These are caused by an earlier, unrelated bug such as a n ABBA deadlock in a different subsystem. In such an event, lockdep is disabled and lockdep_is_held returns true unconditionally. This then causes the WARN() in nf_tables.
Make the WARN conditional on lockdep still active to avoid this.
Fixes: f102d66b335a417 ("netfilter: nf_tables: use dedicated mutex to guard transactions") Reported-by: Naresh Kamboju naresh.kamboju@linaro.org Link: https://lore.kernel.org/linux-kselftest/CA+G9fYvFUpODs+NkSYcnwKnXm62tmP=ksLe... Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/netfilter/nf_tables_api.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 5b4632826dc6..9cc8e92f4b00 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -532,7 +532,8 @@ static void nft_request_module(struct net *net, const char *fmt, ...) static void lockdep_nfnl_nft_mutex_not_held(void) { #ifdef CONFIG_PROVE_LOCKING - WARN_ON_ONCE(lockdep_nfnl_is_held(NFNL_SUBSYS_NFTABLES)); + if (debug_locks) + WARN_ON_ONCE(lockdep_nfnl_is_held(NFNL_SUBSYS_NFTABLES)); #endif }
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.164 commit 8bf5774646e56c0860fa45c078ab6c1b8e8443c5
--------------------------------
[ Upstream commit 72d05c00d7ecda85df29abd046da7e41cc071c17 ]
Before commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") small tcp_rmem[1] values were overridden by tcp_fixup_rcvbuf() to accommodate various MSS.
This is no longer the case, and Hazem Mohamed Abuelfotoh reported that DRS would not work for MTU 9000 endpoints receiving regular (1500 bytes) frames.
Root cause is that tcp_init_buffer_space() uses tp->rcv_wnd for upper limit of rcvq_space.space computation, while it can select later a smaller value for tp->rcv_ssthresh and tp->window_clamp.
ss -temoi on receiver would show :
skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) rcv_space:62496 rcv_ssthresh:56596
This means that TCP can not increase its window in tcp_grow_window(), and that DRS can never kick.
Fix this by making sure that rcvq_space.space is not bigger than number of bytes that can be held in TCP receive queue.
People unable/unwilling to change their kernel can work around this issue by selecting a bigger tcp_rmem[1] value as in :
echo "4096 196608 6291456" >/proc/sys/net/ipv4/tcp_rmem
Based on an initial report and patch from Hazem Mohamed Abuelfotoh https://lore.kernel.org/netdev/20201204180622.14285-1-abuehaze@amazon.com/
Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner") Reported-by: Hazem Mohamed Abuelfotoh abuehaze@amazon.com Signed-off-by: Eric Dumazet edumazet@google.com Acked-by: Soheil Hassas Yeganeh soheil@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/tcp_input.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 0b3f8637d4fa..b58c8fe44dab 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -439,7 +439,6 @@ void tcp_init_buffer_space(struct sock *sk) if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK)) tcp_sndbuf_expand(sk);
- tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss); tcp_mstamp_refresh(tp); tp->rcvq_space.time = tp->tcp_mstamp; tp->rcvq_space.seq = tp->copied_seq; @@ -463,6 +462,8 @@ void tcp_init_buffer_space(struct sock *sk)
tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp); tp->snd_cwnd_stamp = tcp_jiffies32; + tp->rcvq_space.space = min3(tp->rcv_ssthresh, tp->rcv_wnd, + (u32)TCP_INIT_CWND * tp->advmss); }
/* 4. Recalculate window clamp after socket hit its memory bounds. */
From: Neal Cardwell ncardwell@google.com
stable inclusion from linux-4.19.164 commit 1bbbaaf3e64d934785f4c17f03d327a695d01006
--------------------------------
[ Upstream commit 299bcb55ecd1412f6df606e9dc0912d55610029e ]
When cwnd is not a multiple of the TSO skb size of N*MSS, we can get into persistent scenarios where we have the following sequence:
(1) ACK for full-sized skb of N*MSS arrives -> tcp_write_xmit() transmit full-sized skb with N*MSS -> move pacing release time forward -> exit tcp_write_xmit() because pacing time is in the future
(2) TSQ callback or TCP internal pacing timer fires -> try to transmit next skb, but TSO deferral finds remainder of available cwnd is not big enough to trigger an immediate send now, so we defer sending until the next ACK.
(3) repeat...
So we can get into a case where we never mark ourselves as cwnd-limited for many seconds at a time, even with bulk/infinite-backlog senders, because:
o In case (1) above, every time in tcp_write_xmit() we have enough cwnd to send a full-sized skb, we are not fully using the cwnd (because cwnd is not a multiple of the TSO skb size). So every time we send data, we are not cwnd limited, and so in the cwnd-limited tracking code in tcp_cwnd_validate() we mark ourselves as not cwnd-limited.
o In case (2) above, every time in tcp_write_xmit() that we try to transmit the "remainder" of the cwnd but defer, we set the local variable is_cwnd_limited to true, but we do not send any packets, so sent_pkts is zero, so we don't call the cwnd-limited logic to update tp->is_cwnd_limited.
Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler") Reported-by: Ingemar Johansson ingemar.s.johansson@ericsson.com Signed-off-by: Neal Cardwell ncardwell@google.com Signed-off-by: Yuchung Cheng ycheng@google.com Acked-by: Soheil Hassas Yeganeh soheil@google.com Signed-off-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.co... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/tcp_output.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 74fb211e0ea6..3cfefec81975 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1622,7 +1622,8 @@ static void tcp_cwnd_validate(struct sock *sk, bool is_cwnd_limited) * window, and remember whether we were cwnd-limited then. */ if (!before(tp->snd_una, tp->max_packets_seq) || - tp->packets_out > tp->max_packets_out) { + tp->packets_out > tp->max_packets_out || + is_cwnd_limited) { tp->max_packets_out = tp->packets_out; tp->max_packets_seq = tp->snd_nxt; tp->is_cwnd_limited = is_cwnd_limited; @@ -2407,6 +2408,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, else tcp_chrono_stop(sk, TCP_CHRONO_RWND_LIMITED);
+ is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd); + if (likely(sent_pkts || is_cwnd_limited)) + tcp_cwnd_validate(sk, is_cwnd_limited); + if (likely(sent_pkts)) { if (tcp_in_cwnd_reduction(sk)) tp->prr_out += sent_pkts; @@ -2414,8 +2419,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, /* Send one loss probe per tail loss episode. */ if (push_one != 2) tcp_schedule_loss_probe(sk, false); - is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd); - tcp_cwnd_validate(sk, is_cwnd_limited); return false; } return !tp->packets_out && !tcp_write_queue_empty(sk);
From: Zhang Changzhong zhangchangzhong@huawei.com
stable inclusion from linux-4.19.164 commit e64cc462703f209ff0db62e5c75b7019df0cac97
--------------------------------
[ Upstream commit ee4f52a8de2c6f78b01f10b4c330867d88c1653a ]
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function.
Fixes: f8ed289fab84 ("bridge: vlan: use br_vlan_(get|put)_master to deal with refcounts") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Zhang Changzhong zhangchangzhong@huawei.com Acked-by: Nikolay Aleksandrov nikolay@nvidia.com Link: https://lore.kernel.org/r/1607071737-33875-1-git-send-email-zhangchangzhong@... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/bridge/br_vlan.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c index 5f3950f00f73..a82d0021d461 100644 --- a/net/bridge/br_vlan.c +++ b/net/bridge/br_vlan.c @@ -242,8 +242,10 @@ static int __vlan_add(struct net_bridge_vlan *v, u16 flags) }
masterv = br_vlan_get_master(br, v->vid); - if (!masterv) + if (!masterv) { + err = -ENOMEM; goto out_filt; + } v->brvlan = masterv; v->stats = masterv->stats; } else {
From: Sami Tolvanen samitolvanen@google.com
stable inclusion from linux-4.19.164 commit 5681cce44cf41dc3eed8fd62b9e299bb57d1cb7b
--------------------------------
commit e0d5896bd356cd577f9710a02d7a474cdf58426b upstream.
Unlike gcc, clang considers each inline assembly block to be independent and therefore, when using the integrated assembler for inline assembly, any preambles that enable features must be repeated in each block.
This change defines __LSE_PREAMBLE and adds it to each inline assembly block that has LSE instructions, which allows them to be compiled also with clang's assembler.
Link: https://github.com/ClangBuiltLinux/linux/issues/671 Signed-off-by: Sami Tolvanen samitolvanen@google.com Tested-by: Andrew Murray andrew.murray@arm.com Tested-by: Kees Cook keescook@chromium.org Reviewed-by: Andrew Murray andrew.murray@arm.com Reviewed-by: Kees Cook keescook@chromium.org Reviewed-by: Nick Desaulniers ndesaulniers@google.com Signed-off-by: Will Deacon will@kernel.org [nd: backport adjusted due to missing: commit addfc38672c7 ("arm64: atomics: avoid out-of-line ll/sc atomics")] Signed-off-by: Nick Desaulniers ndesaulniers@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/include/asm/atomic_lse.h | 76 +++++++++++++++++++++-------- arch/arm64/include/asm/lse.h | 6 +-- 2 files changed, 60 insertions(+), 22 deletions(-)
diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h index f9b0b09153e0..eab3de4f2ad2 100644 --- a/arch/arm64/include/asm/atomic_lse.h +++ b/arch/arm64/include/asm/atomic_lse.h @@ -32,7 +32,9 @@ static inline void atomic_##op(int i, atomic_t *v) \ register int w0 asm ("w0") = i; \ register atomic_t *x1 asm ("x1") = v; \ \ - asm volatile(ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC(op), \ + asm volatile( \ + __LSE_PREAMBLE \ + ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC(op), \ " " #asm_op " %w[i], %[v]\n") \ : [i] "+r" (w0), [v] "+Q" (v->counter) \ : "r" (x1) \ @@ -52,7 +54,9 @@ static inline int atomic_fetch_##op##name(int i, atomic_t *v) \ register int w0 asm ("w0") = i; \ register atomic_t *x1 asm ("x1") = v; \ \ - asm volatile(ARM64_LSE_ATOMIC_INSN( \ + asm volatile( \ + __LSE_PREAMBLE \ + ARM64_LSE_ATOMIC_INSN( \ /* LL/SC */ \ __LL_SC_ATOMIC(fetch_##op##name), \ /* LSE atomics */ \ @@ -84,7 +88,9 @@ static inline int atomic_add_return##name(int i, atomic_t *v) \ register int w0 asm ("w0") = i; \ register atomic_t *x1 asm ("x1") = v; \ \ - asm volatile(ARM64_LSE_ATOMIC_INSN( \ + asm volatile( \ + __LSE_PREAMBLE \ + ARM64_LSE_ATOMIC_INSN( \ /* LL/SC */ \ __LL_SC_ATOMIC(add_return##name) \ __nops(1), \ @@ -110,7 +116,9 @@ static inline void atomic_and(int i, atomic_t *v) register int w0 asm ("w0") = i; register atomic_t *x1 asm ("x1") = v;
- asm volatile(ARM64_LSE_ATOMIC_INSN( + asm volatile( + __LSE_PREAMBLE + ARM64_LSE_ATOMIC_INSN( /* LL/SC */ __LL_SC_ATOMIC(and) __nops(1), @@ -128,7 +136,9 @@ static inline int atomic_fetch_and##name(int i, atomic_t *v) \ register int w0 asm ("w0") = i; \ register atomic_t *x1 asm ("x1") = v; \ \ - asm volatile(ARM64_LSE_ATOMIC_INSN( \ + asm volatile( \ + __LSE_PREAMBLE \ + ARM64_LSE_ATOMIC_INSN( \ /* LL/SC */ \ __LL_SC_ATOMIC(fetch_and##name) \ __nops(1), \ @@ -154,7 +164,9 @@ static inline void atomic_sub(int i, atomic_t *v) register int w0 asm ("w0") = i; register atomic_t *x1 asm ("x1") = v;
- asm volatile(ARM64_LSE_ATOMIC_INSN( + asm volatile( + __LSE_PREAMBLE + ARM64_LSE_ATOMIC_INSN( /* LL/SC */ __LL_SC_ATOMIC(sub) __nops(1), @@ -172,7 +184,9 @@ static inline int atomic_sub_return##name(int i, atomic_t *v) \ register int w0 asm ("w0") = i; \ register atomic_t *x1 asm ("x1") = v; \ \ - asm volatile(ARM64_LSE_ATOMIC_INSN( \ + asm volatile( \ + __LSE_PREAMBLE \ + ARM64_LSE_ATOMIC_INSN( \ /* LL/SC */ \ __LL_SC_ATOMIC(sub_return##name) \ __nops(2), \ @@ -200,7 +214,9 @@ static inline int atomic_fetch_sub##name(int i, atomic_t *v) \ register int w0 asm ("w0") = i; \ register atomic_t *x1 asm ("x1") = v; \ \ - asm volatile(ARM64_LSE_ATOMIC_INSN( \ + asm volatile( \ + __LSE_PREAMBLE \ + ARM64_LSE_ATOMIC_INSN( \ /* LL/SC */ \ __LL_SC_ATOMIC(fetch_sub##name) \ __nops(1), \ @@ -229,7 +245,9 @@ static inline void atomic64_##op(long i, atomic64_t *v) \ register long x0 asm ("x0") = i; \ register atomic64_t *x1 asm ("x1") = v; \ \ - asm volatile(ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC64(op), \ + asm volatile( \ + __LSE_PREAMBLE \ + ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC64(op), \ " " #asm_op " %[i], %[v]\n") \ : [i] "+r" (x0), [v] "+Q" (v->counter) \ : "r" (x1) \ @@ -249,7 +267,9 @@ static inline long atomic64_fetch_##op##name(long i, atomic64_t *v) \ register long x0 asm ("x0") = i; \ register atomic64_t *x1 asm ("x1") = v; \ \ - asm volatile(ARM64_LSE_ATOMIC_INSN( \ + asm volatile( \ + __LSE_PREAMBLE \ + ARM64_LSE_ATOMIC_INSN( \ /* LL/SC */ \ __LL_SC_ATOMIC64(fetch_##op##name), \ /* LSE atomics */ \ @@ -281,7 +301,9 @@ static inline long atomic64_add_return##name(long i, atomic64_t *v) \ register long x0 asm ("x0") = i; \ register atomic64_t *x1 asm ("x1") = v; \ \ - asm volatile(ARM64_LSE_ATOMIC_INSN( \ + asm volatile( \ + __LSE_PREAMBLE \ + ARM64_LSE_ATOMIC_INSN( \ /* LL/SC */ \ __LL_SC_ATOMIC64(add_return##name) \ __nops(1), \ @@ -307,7 +329,9 @@ static inline void atomic64_and(long i, atomic64_t *v) register long x0 asm ("x0") = i; register atomic64_t *x1 asm ("x1") = v;
- asm volatile(ARM64_LSE_ATOMIC_INSN( + asm volatile( + __LSE_PREAMBLE + ARM64_LSE_ATOMIC_INSN( /* LL/SC */ __LL_SC_ATOMIC64(and) __nops(1), @@ -325,7 +349,9 @@ static inline long atomic64_fetch_and##name(long i, atomic64_t *v) \ register long x0 asm ("x0") = i; \ register atomic64_t *x1 asm ("x1") = v; \ \ - asm volatile(ARM64_LSE_ATOMIC_INSN( \ + asm volatile( \ + __LSE_PREAMBLE \ + ARM64_LSE_ATOMIC_INSN( \ /* LL/SC */ \ __LL_SC_ATOMIC64(fetch_and##name) \ __nops(1), \ @@ -351,7 +377,9 @@ static inline void atomic64_sub(long i, atomic64_t *v) register long x0 asm ("x0") = i; register atomic64_t *x1 asm ("x1") = v;
- asm volatile(ARM64_LSE_ATOMIC_INSN( + asm volatile( + __LSE_PREAMBLE + ARM64_LSE_ATOMIC_INSN( /* LL/SC */ __LL_SC_ATOMIC64(sub) __nops(1), @@ -369,7 +397,9 @@ static inline long atomic64_sub_return##name(long i, atomic64_t *v) \ register long x0 asm ("x0") = i; \ register atomic64_t *x1 asm ("x1") = v; \ \ - asm volatile(ARM64_LSE_ATOMIC_INSN( \ + asm volatile( \ + __LSE_PREAMBLE \ + ARM64_LSE_ATOMIC_INSN( \ /* LL/SC */ \ __LL_SC_ATOMIC64(sub_return##name) \ __nops(2), \ @@ -397,7 +427,9 @@ static inline long atomic64_fetch_sub##name(long i, atomic64_t *v) \ register long x0 asm ("x0") = i; \ register atomic64_t *x1 asm ("x1") = v; \ \ - asm volatile(ARM64_LSE_ATOMIC_INSN( \ + asm volatile( \ + __LSE_PREAMBLE \ + ARM64_LSE_ATOMIC_INSN( \ /* LL/SC */ \ __LL_SC_ATOMIC64(fetch_sub##name) \ __nops(1), \ @@ -422,7 +454,9 @@ static inline long atomic64_dec_if_positive(atomic64_t *v) { register long x0 asm ("x0") = (long)v;
- asm volatile(ARM64_LSE_ATOMIC_INSN( + asm volatile( + __LSE_PREAMBLE + ARM64_LSE_ATOMIC_INSN( /* LL/SC */ __LL_SC_ATOMIC64(dec_if_positive) __nops(6), @@ -455,7 +489,9 @@ static inline unsigned long __cmpxchg_case_##name(volatile void *ptr, \ register unsigned long x1 asm ("x1") = old; \ register unsigned long x2 asm ("x2") = new; \ \ - asm volatile(ARM64_LSE_ATOMIC_INSN( \ + asm volatile( \ + __LSE_PREAMBLE \ + ARM64_LSE_ATOMIC_INSN( \ /* LL/SC */ \ __LL_SC_CMPXCHG(name) \ __nops(2), \ @@ -507,7 +543,9 @@ static inline long __cmpxchg_double##name(unsigned long old1, \ register unsigned long x3 asm ("x3") = new2; \ register unsigned long x4 asm ("x4") = (unsigned long)ptr; \ \ - asm volatile(ARM64_LSE_ATOMIC_INSN( \ + asm volatile( \ + __LSE_PREAMBLE \ + ARM64_LSE_ATOMIC_INSN( \ /* LL/SC */ \ __LL_SC_CMPXCHG_DBL(name) \ __nops(3), \ diff --git a/arch/arm64/include/asm/lse.h b/arch/arm64/include/asm/lse.h index 8262325e2fc6..960213767f1e 100644 --- a/arch/arm64/include/asm/lse.h +++ b/arch/arm64/include/asm/lse.h @@ -4,6 +4,8 @@
#if defined(CONFIG_AS_LSE) && defined(CONFIG_ARM64_LSE_ATOMICS)
+#define __LSE_PREAMBLE ".arch armv8-a+lse\n" + #include <linux/compiler_types.h> #include <linux/export.h> #include <linux/stringify.h> @@ -20,8 +22,6 @@
#else /* __ASSEMBLER__ */
-__asm__(".arch_extension lse"); - /* Move the ll/sc atomics out-of-line */ #define __LL_SC_INLINE notrace #define __LL_SC_PREFIX(x) __ll_sc_##x @@ -33,7 +33,7 @@ __asm__(".arch_extension lse");
/* In-line patching at runtime */ #define ARM64_LSE_ATOMIC_INSN(llsc, lse) \ - ALTERNATIVE(llsc, lse, ARM64_HAS_LSE_ATOMICS) + ALTERNATIVE(llsc, __LSE_PREAMBLE lse, ARM64_HAS_LSE_ATOMICS)
#endif /* __ASSEMBLER__ */ #else /* CONFIG_AS_LSE && CONFIG_ARM64_LSE_ATOMICS */
From: Vincenzo Frascino vincenzo.frascino@arm.com
stable inclusion from linux-4.19.164 commit 96fd4981791602ba6f4cbe23c4bd9386408940c1
--------------------------------
commit dd1f6308b28edf0452dd5dc7877992903ec61e69 upstream.
Commit e0d5896bd356 ("arm64: lse: fix LSE atomics with LLVM's integrated assembler") broke the build when clang is used in connjunction with the binutils assembler ("-no-integrated-as"). This happens because __LSE_PREAMBLE is defined as ".arch armv8-a+lse", which overrides the version of the CPU architecture passed via the "-march" paramter to gas:
$ aarch64-none-linux-gnu-as -EL -I ./arch/arm64/include -I ./arch/arm64/include/generated -I ./include -I ./include -I ./arch/arm64/include/uapi -I ./arch/arm64/include/generated/uapi -I ./include/uapi -I ./include/generated/uapi -I ./init -I ./init -march=armv8.3-a -o init/do_mounts.o /tmp/do_mounts-d7992a.s /tmp/do_mounts-d7992a.s: Assembler messages: /tmp/do_mounts-d7992a.s:1959: Error: selected processor does not support `autiasp' /tmp/do_mounts-d7992a.s:2021: Error: selected processor does not support `paciasp' /tmp/do_mounts-d7992a.s:2157: Error: selected processor does not support `autiasp' /tmp/do_mounts-d7992a.s:2175: Error: selected processor does not support `paciasp' /tmp/do_mounts-d7992a.s:2494: Error: selected processor does not support `autiasp'
Fix the issue by replacing ".arch armv8-a+lse" with ".arch_extension lse". Sami confirms that the clang integrated assembler does now support the '.arch_extension' directive, so this change will be fine even for LTO builds in future.
Fixes: e0d5896bd356cd ("arm64: lse: fix LSE atomics with LLVM's integrated assembler") Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Reported-by: Amit Kachhap Amit.Kachhap@arm.com Tested-by: Sami Tolvanen samitolvanen@google.com Signed-off-by: Vincenzo Frascino vincenzo.frascino@arm.com Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Nick Desaulniers ndesaulniers@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/include/asm/lse.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/lse.h b/arch/arm64/include/asm/lse.h index 960213767f1e..13536c4da2c2 100644 --- a/arch/arm64/include/asm/lse.h +++ b/arch/arm64/include/asm/lse.h @@ -4,7 +4,7 @@
#if defined(CONFIG_AS_LSE) && defined(CONFIG_ARM64_LSE_ATOMICS)
-#define __LSE_PREAMBLE ".arch armv8-a+lse\n" +#define __LSE_PREAMBLE ".arch_extension lse\n"
#include <linux/compiler_types.h> #include <linux/export.h>
From: Fangrui Song maskray@google.com
stable inclusion from linux-4.19.164 commit 13ed97c2bb939890fe0814d6952189dfec57797f
--------------------------------
commit ec9d78070de986ecf581ea204fd322af4d2477ec upstream.
Commit 39d114ddc682 ("arm64: add KASAN support") added .weak directives to arch/arm64/lib/mem*.S instead of changing the existing SYM_FUNC_START_PI macros. This can lead to the assembly snippet `.weak memcpy ... .globl memcpy` which will produce a STB_WEAK memcpy with GNU as but STB_GLOBAL memcpy with LLVM's integrated assembler before LLVM 12. LLVM 12 (since https://reviews.llvm.org/D90108) will error on such an overridden symbol binding.
Use the appropriate SYM_FUNC_START_WEAK_PI instead.
Fixes: 39d114ddc682 ("arm64: add KASAN support") Reported-by: Sami Tolvanen samitolvanen@google.com Signed-off-by: Fangrui Song maskray@google.com Tested-by: Sami Tolvanen samitolvanen@google.com Tested-by: Nick Desaulniers ndesaulniers@google.com Reviewed-by: Nick Desaulniers ndesaulniers@google.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201029181951.1866093-1-maskray@google.com Signed-off-by: Will Deacon will@kernel.org [nd: backport to adjust for missing: commit 3ac0f4526dfb ("arm64: lib: Use modern annotations for assembly functions") commit 35e61c77ef38 ("arm64: asm: Add new-style position independent function annotations")] Signed-off-by: Nick Desaulniers ndesaulniers@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/lib/memcpy.S | 3 +-- arch/arm64/lib/memmove.S | 3 +-- arch/arm64/lib/memset.S | 3 +-- 3 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/lib/memcpy.S b/arch/arm64/lib/memcpy.S index 67613937711f..dfedd4ab1a76 100644 --- a/arch/arm64/lib/memcpy.S +++ b/arch/arm64/lib/memcpy.S @@ -68,9 +68,8 @@ stp \ptr, \regB, [\regC], \val .endm
- .weak memcpy ENTRY(__memcpy) -ENTRY(memcpy) +WEAK(memcpy) #include "copy_template.S" ret ENDPIPROC(memcpy) diff --git a/arch/arm64/lib/memmove.S b/arch/arm64/lib/memmove.S index a5a4459013b1..e3de8f05c21a 100644 --- a/arch/arm64/lib/memmove.S +++ b/arch/arm64/lib/memmove.S @@ -57,9 +57,8 @@ C_h .req x12 D_l .req x13 D_h .req x14
- .weak memmove ENTRY(__memmove) -ENTRY(memmove) +WEAK(memmove) cmp dstin, src b.lo __memcpy add tmp1, src, count diff --git a/arch/arm64/lib/memset.S b/arch/arm64/lib/memset.S index f2670a9f218c..316263c47c00 100644 --- a/arch/arm64/lib/memset.S +++ b/arch/arm64/lib/memset.S @@ -54,9 +54,8 @@ dst .req x8 tmp3w .req w9 tmp3 .req x9
- .weak memset ENTRY(__memset) -ENTRY(memset) +WEAK(memset) mov dst, dstin /* Preserve return value. */ and A_lw, val, #255 orr A_lw, A_lw, A_lw, lsl #8
From: Johannes Thumshirn johannes.thumshirn@wdc.com
stable inclusion from linux-4.19.164 commit 7894076fbc90e8c2528e339d50f0e669a6e7fc7a
--------------------------------
[ Upstream commit c92a41031a6d57395889b5c87cea359220a24d2a ]
Factor out the requeue handling from the dispatch code, this will make subsequent addition of different requeueing schemes easier.
Signed-off-by: Johannes Thumshirn johannes.thumshirn@wdc.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-mq.c | 29 ++++++++++++++++++----------- 1 file changed, 18 insertions(+), 11 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index d095fe0b2ac0..795ed40ce415 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1142,6 +1142,23 @@ static void blk_mq_update_dispatch_busy(struct blk_mq_hw_ctx *hctx, bool busy)
#define BLK_MQ_RESOURCE_DELAY 3 /* ms units */
+static void blk_mq_handle_dev_resource(struct request *rq, + struct list_head *list) +{ + struct request *next = + list_first_entry_or_null(list, struct request, queuelist); + + /* + * If an I/O scheduler has been configured and we got a driver tag for + * the next request already, free it. + */ + if (next) + blk_mq_put_driver_tag(next); + + list_add(&rq->queuelist, list); + __blk_mq_requeue_request(rq); +} + /* * Returns true if we did some work AND can potentially do more. */ @@ -1213,17 +1230,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
ret = q->mq_ops->queue_rq(hctx, &bd); if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { - /* - * If an I/O scheduler has been configured and we got a - * driver tag for the next request already, free it - * again. - */ - if (!list_empty(list)) { - nxt = list_first_entry(list, struct request, queuelist); - blk_mq_put_driver_tag(nxt); - } - list_add(&rq->queuelist, list); - __blk_mq_requeue_request(rq); + blk_mq_handle_dev_resource(rq, list); break; }
From: Subash Abhinov Kasiviswanathan subashab@codeaurora.org
stable inclusion from linux-4.19.164 commit 98ab3ff5e789985ec8c24f813c7a989b445da084
--------------------------------
[ Upstream commit cc00bcaa589914096edef7fb87ca5cee4a166b5c ]
When running concurrent iptables rules replacement with data, the per CPU sequence count is checked after the assignment of the new information. The sequence count is used to synchronize with the packet path without the use of any explicit locking. If there are any packets in the packet path using the table information, the sequence count is incremented to an odd value and is incremented to an even after the packet process completion.
The new table value assignment is followed by a write memory barrier so every CPU should see the latest value. If the packet path has started with the old table information, the sequence counter will be odd and the iptables replacement will wait till the sequence count is even prior to freeing the old table info.
However, this assumes that the new table information assignment and the memory barrier is actually executed prior to the counter check in the replacement thread. If CPU decides to execute the assignment later as there is no user of the table information prior to the sequence check, the packet path in another CPU may use the old table information. The replacement thread would then free the table information under it leading to a use after free in the packet processing context-
Unable to handle kernel NULL pointer dereference at virtual address 000000000000008e pc : ip6t_do_table+0x5d0/0x89c lr : ip6t_do_table+0x5b8/0x89c ip6t_do_table+0x5d0/0x89c ip6table_filter_hook+0x24/0x30 nf_hook_slow+0x84/0x120 ip6_input+0x74/0xe0 ip6_rcv_finish+0x7c/0x128 ipv6_rcv+0xac/0xe4 __netif_receive_skb+0x84/0x17c process_backlog+0x15c/0x1b8 napi_poll+0x88/0x284 net_rx_action+0xbc/0x23c __do_softirq+0x20c/0x48c
This could be fixed by forcing instruction order after the new table information assignment or by switching to RCU for the synchronization.
Fixes: 80055dab5de0 ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore") Reported-by: Sean Tranchetti stranche@codeaurora.org Reported-by: kernel test robot lkp@intel.com Suggested-by: Florian Westphal fw@strlen.de Signed-off-by: Subash Abhinov Kasiviswanathan subashab@codeaurora.org Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/netfilter/x_tables.h | 5 ++- net/ipv4/netfilter/arp_tables.c | 14 ++++----- net/ipv4/netfilter/ip_tables.c | 14 ++++----- net/ipv6/netfilter/ip6_tables.c | 14 ++++----- net/netfilter/x_tables.c | 49 +++++++++--------------------- 5 files changed, 40 insertions(+), 56 deletions(-)
diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h index 9077b3ebea08..728d7716bf4f 100644 --- a/include/linux/netfilter/x_tables.h +++ b/include/linux/netfilter/x_tables.h @@ -227,7 +227,7 @@ struct xt_table { unsigned int valid_hooks;
/* Man behind the curtain... */ - struct xt_table_info *private; + struct xt_table_info __rcu *private;
/* Set this to THIS_MODULE if you are a module, otherwise NULL */ struct module *me; @@ -449,6 +449,9 @@ xt_get_per_cpu_counter(struct xt_counters *cnt, unsigned int cpu)
struct nf_hook_ops *xt_hook_ops_alloc(const struct xt_table *, nf_hookfn *);
+struct xt_table_info +*xt_table_get_private_protected(const struct xt_table *table); + #ifdef CONFIG_COMPAT #include <net/compat.h>
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c index 10d8f95eb771..ca20efe775ee 100644 --- a/net/ipv4/netfilter/arp_tables.c +++ b/net/ipv4/netfilter/arp_tables.c @@ -202,7 +202,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
local_bh_disable(); addend = xt_write_recseq_begin(); - private = READ_ONCE(table->private); /* Address dependency. */ + private = rcu_access_pointer(table->private); cpu = smp_processor_id(); table_base = private->entries; jumpstack = (struct arpt_entry **)private->jumpstack[cpu]; @@ -648,7 +648,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) { unsigned int countersize; struct xt_counters *counters; - const struct xt_table_info *private = table->private; + const struct xt_table_info *private = xt_table_get_private_protected(table);
/* We need atomic snapshot of counters: rest doesn't change * (other than comefrom, which userspace doesn't care @@ -672,7 +672,7 @@ static int copy_entries_to_user(unsigned int total_size, unsigned int off, num; const struct arpt_entry *e; struct xt_counters *counters; - struct xt_table_info *private = table->private; + struct xt_table_info *private = xt_table_get_private_protected(table); int ret = 0; void *loc_cpu_entry;
@@ -807,7 +807,7 @@ static int get_info(struct net *net, void __user *user, t = xt_request_find_table_lock(net, NFPROTO_ARP, name); if (!IS_ERR(t)) { struct arpt_getinfo info; - const struct xt_table_info *private = t->private; + const struct xt_table_info *private = xt_table_get_private_protected(t); #ifdef CONFIG_COMPAT struct xt_table_info tmp;
@@ -860,7 +860,7 @@ static int get_entries(struct net *net, struct arpt_get_entries __user *uptr,
t = xt_find_table_lock(net, NFPROTO_ARP, get.name); if (!IS_ERR(t)) { - const struct xt_table_info *private = t->private; + const struct xt_table_info *private = xt_table_get_private_protected(t);
if (get.size == private->size) ret = copy_entries_to_user(private->size, @@ -1019,7 +1019,7 @@ static int do_add_counters(struct net *net, const void __user *user, }
local_bh_disable(); - private = t->private; + private = xt_table_get_private_protected(t); if (private->number != tmp.num_counters) { ret = -EINVAL; goto unlock_up_free; @@ -1356,7 +1356,7 @@ static int compat_copy_entries_to_user(unsigned int total_size, void __user *userptr) { struct xt_counters *counters; - const struct xt_table_info *private = table->private; + const struct xt_table_info *private = xt_table_get_private_protected(table); void __user *pos; unsigned int size; int ret = 0; diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index e77872c93c20..115d48049686 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -261,7 +261,7 @@ ipt_do_table(struct sk_buff *skb, WARN_ON(!(table->valid_hooks & (1 << hook))); local_bh_disable(); addend = xt_write_recseq_begin(); - private = READ_ONCE(table->private); /* Address dependency. */ + private = rcu_access_pointer(table->private); cpu = smp_processor_id(); table_base = private->entries; jumpstack = (struct ipt_entry **)private->jumpstack[cpu]; @@ -794,7 +794,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) { unsigned int countersize; struct xt_counters *counters; - const struct xt_table_info *private = table->private; + const struct xt_table_info *private = xt_table_get_private_protected(table);
/* We need atomic snapshot of counters: rest doesn't change (other than comefrom, which userspace doesn't care @@ -818,7 +818,7 @@ copy_entries_to_user(unsigned int total_size, unsigned int off, num; const struct ipt_entry *e; struct xt_counters *counters; - const struct xt_table_info *private = table->private; + const struct xt_table_info *private = xt_table_get_private_protected(table); int ret = 0; const void *loc_cpu_entry;
@@ -968,7 +968,7 @@ static int get_info(struct net *net, void __user *user, t = xt_request_find_table_lock(net, AF_INET, name); if (!IS_ERR(t)) { struct ipt_getinfo info; - const struct xt_table_info *private = t->private; + const struct xt_table_info *private = xt_table_get_private_protected(t); #ifdef CONFIG_COMPAT struct xt_table_info tmp;
@@ -1022,7 +1022,7 @@ get_entries(struct net *net, struct ipt_get_entries __user *uptr,
t = xt_find_table_lock(net, AF_INET, get.name); if (!IS_ERR(t)) { - const struct xt_table_info *private = t->private; + const struct xt_table_info *private = xt_table_get_private_protected(t); if (get.size == private->size) ret = copy_entries_to_user(private->size, t, uptr->entrytable); @@ -1178,7 +1178,7 @@ do_add_counters(struct net *net, const void __user *user, }
local_bh_disable(); - private = t->private; + private = xt_table_get_private_protected(t); if (private->number != tmp.num_counters) { ret = -EINVAL; goto unlock_up_free; @@ -1573,7 +1573,7 @@ compat_copy_entries_to_user(unsigned int total_size, struct xt_table *table, void __user *userptr) { struct xt_counters *counters; - const struct xt_table_info *private = table->private; + const struct xt_table_info *private = xt_table_get_private_protected(table); void __user *pos; unsigned int size; int ret = 0; diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c index daf2e9e9193d..b1441349e151 100644 --- a/net/ipv6/netfilter/ip6_tables.c +++ b/net/ipv6/netfilter/ip6_tables.c @@ -283,7 +283,7 @@ ip6t_do_table(struct sk_buff *skb,
local_bh_disable(); addend = xt_write_recseq_begin(); - private = READ_ONCE(table->private); /* Address dependency. */ + private = rcu_access_pointer(table->private); cpu = smp_processor_id(); table_base = private->entries; jumpstack = (struct ip6t_entry **)private->jumpstack[cpu]; @@ -810,7 +810,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) { unsigned int countersize; struct xt_counters *counters; - const struct xt_table_info *private = table->private; + const struct xt_table_info *private = xt_table_get_private_protected(table);
/* We need atomic snapshot of counters: rest doesn't change (other than comefrom, which userspace doesn't care @@ -834,7 +834,7 @@ copy_entries_to_user(unsigned int total_size, unsigned int off, num; const struct ip6t_entry *e; struct xt_counters *counters; - const struct xt_table_info *private = table->private; + const struct xt_table_info *private = xt_table_get_private_protected(table); int ret = 0; const void *loc_cpu_entry;
@@ -984,7 +984,7 @@ static int get_info(struct net *net, void __user *user, t = xt_request_find_table_lock(net, AF_INET6, name); if (!IS_ERR(t)) { struct ip6t_getinfo info; - const struct xt_table_info *private = t->private; + const struct xt_table_info *private = xt_table_get_private_protected(t); #ifdef CONFIG_COMPAT struct xt_table_info tmp;
@@ -1039,7 +1039,7 @@ get_entries(struct net *net, struct ip6t_get_entries __user *uptr,
t = xt_find_table_lock(net, AF_INET6, get.name); if (!IS_ERR(t)) { - struct xt_table_info *private = t->private; + struct xt_table_info *private = xt_table_get_private_protected(t); if (get.size == private->size) ret = copy_entries_to_user(private->size, t, uptr->entrytable); @@ -1194,7 +1194,7 @@ do_add_counters(struct net *net, const void __user *user, unsigned int len, }
local_bh_disable(); - private = t->private; + private = xt_table_get_private_protected(t); if (private->number != tmp.num_counters) { ret = -EINVAL; goto unlock_up_free; @@ -1582,7 +1582,7 @@ compat_copy_entries_to_user(unsigned int total_size, struct xt_table *table, void __user *userptr) { struct xt_counters *counters; - const struct xt_table_info *private = table->private; + const struct xt_table_info *private = xt_table_get_private_protected(table); void __user *pos; unsigned int size; int ret = 0; diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index 3bab89dbc371..6a7d0303d058 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -1354,6 +1354,14 @@ struct xt_counters *xt_counters_alloc(unsigned int counters) } EXPORT_SYMBOL(xt_counters_alloc);
+struct xt_table_info +*xt_table_get_private_protected(const struct xt_table *table) +{ + return rcu_dereference_protected(table->private, + mutex_is_locked(&xt[table->af].mutex)); +} +EXPORT_SYMBOL(xt_table_get_private_protected); + struct xt_table_info * xt_replace_table(struct xt_table *table, unsigned int num_counters, @@ -1361,7 +1369,6 @@ xt_replace_table(struct xt_table *table, int *error) { struct xt_table_info *private; - unsigned int cpu; int ret;
ret = xt_jumpstack_alloc(newinfo); @@ -1371,47 +1378,20 @@ xt_replace_table(struct xt_table *table, }
/* Do the substitution. */ - local_bh_disable(); - private = table->private; + private = xt_table_get_private_protected(table);
/* Check inside lock: is the old number correct? */ if (num_counters != private->number) { pr_debug("num_counters != table->private->number (%u/%u)\n", num_counters, private->number); - local_bh_enable(); *error = -EAGAIN; return NULL; }
newinfo->initial_entries = private->initial_entries; - /* - * Ensure contents of newinfo are visible before assigning to - * private. - */ - smp_wmb(); - table->private = newinfo; - - /* make sure all cpus see new ->private value */ - smp_wmb();
- /* - * Even though table entries have now been swapped, other CPU's - * may still be using the old entries... - */ - local_bh_enable(); - - /* ... so wait for even xt_recseq on all cpus */ - for_each_possible_cpu(cpu) { - seqcount_t *s = &per_cpu(xt_recseq, cpu); - u32 seq = raw_read_seqcount(s); - - if (seq & 1) { - do { - cond_resched(); - cpu_relax(); - } while (seq == raw_read_seqcount(s)); - } - } + rcu_assign_pointer(table->private, newinfo); + synchronize_rcu();
#ifdef CONFIG_AUDIT if (audit_enabled) { @@ -1452,12 +1432,12 @@ struct xt_table *xt_register_table(struct net *net, }
/* Simplifies replace_table code. */ - table->private = bootstrap; + rcu_assign_pointer(table->private, bootstrap);
if (!xt_replace_table(table, 0, newinfo, &ret)) goto unlock;
- private = table->private; + private = xt_table_get_private_protected(table); pr_debug("table->private->number = %u\n", private->number);
/* save number of initial entries */ @@ -1480,7 +1460,8 @@ void *xt_unregister_table(struct xt_table *table) struct xt_table_info *private;
mutex_lock(&xt[table->af].mutex); - private = table->private; + private = xt_table_get_private_protected(table); + RCU_INIT_POINTER(table->private, NULL); list_del(&table->list); mutex_unlock(&xt[table->af].mutex); kfree(table);
From: Mark Rutland mark.rutland@arm.com
stable inclusion from linux-4.19.164 commit 6abd3ab44001ff55ccff27793b925983cef23198
--------------------------------
[ Upstream commit ca1314d73eed493c49bb1932c60a8605530db2e4 ]
In el0_svc_common() we unmask exceptions before we call user_exit(), and so there's a window where an IRQ or debug exception can be taken while RCU is not watching. In do_debug_exception() we account for this in via debug_exception_{enter,exit}(), but in the el1_irq asm we do not and we call trace functions which rely on RCU before we have a guarantee that RCU is watching.
Let's avoid this by having el0_svc_common() exit userspace before unmasking exceptions, matching what we do for all other EL0 entry paths. We can use user_exit_irqoff() to avoid the pointless save/restore of IRQ flags while we're sure exceptions are masked in DAIF.
The workaround for Cortex-A76 erratum 1463225 may trigger a debug exception before this point, but the debug code invoked in this case is safe even when RCU is not watching.
Signed-off-by: Mark Rutland mark.rutland@arm.com Cc: Catalin Marinas catalin.marinas@arm.com Cc: James Morse james.morse@arm.com Cc: Will Deacon will@kernel.org Link: https://lore.kernel.org/r/20201130115950.22492-2-mark.rutland@arm.com Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/kernel/syscall.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c index 49caf0df5316..cee2933bd6c1 100644 --- a/arch/arm64/kernel/syscall.c +++ b/arch/arm64/kernel/syscall.c @@ -99,8 +99,8 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr, regs->syscallno = scno;
cortex_a76_erratum_1463225_svc_handler(); + user_exit_irqoff(); local_daif_restore(DAIF_PROCCTX); - user_exit();
if (has_syscall_work(flags)) { /* set default errno for user-issued syscall(-1) */
From: Alexey Kardashevskiy aik@ozlabs.ru
stable inclusion from linux-4.19.164 commit 7a3c3a1c67e00942ae4890281b5b56026650bed8
--------------------------------
commit 2f70e49ed860020f5abae4f7015018ebc10e1f0e upstream.
At the moment opening a serial device node (such as /dev/ttyS3) succeeds even if there is no actual serial device behind it. Reading/writing/ioctls fail as expected because the uart port is not initialized (the type is PORT_UNKNOWN) and the TTY_IO_ERROR error state bit is set fot the tty.
However setting line discipline does not have these checks 8250_port.c (8250 is the default choice made by univ8250_console_init()). As the result of PORT_UNKNOWN, uart_port::iobase is NULL which a platform translates onto some address accessing which produces a crash like below.
This adds tty_port_initialized() to uart_set_ldisc() to prevent the crash.
Found by syzkaller.
Signed-off-by: Alexey Kardashevskiy aik@ozlabs.ru Link: https://lore.kernel.org/r/20201203055834.45838-1-aik@ozlabs.ru Cc: stable stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/tty/serial/serial_core.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c index 2a5bf4c14fb8..80fa06b16d9d 100644 --- a/drivers/tty/serial/serial_core.c +++ b/drivers/tty/serial/serial_core.c @@ -1421,6 +1421,10 @@ static void uart_set_ldisc(struct tty_struct *tty) { struct uart_state *state = tty->driver_data; struct uart_port *uport; + struct tty_port *port = &state->port; + + if (!tty_port_initialized(port)) + return;
mutex_lock(&state->port.mutex); uport = uart_port_check(state);
From: Tianyue Ren rentianyue@kylinos.cn
stable inclusion from linux-4.19.164 commit bd7223dda090c9246b290a0b901c112d6484466a
--------------------------------
[ Upstream commit 83370b31a915493231e5b9addc72e4bef69f8d31 ]
Mark the inode security label as invalid if we cannot find a dentry so that we will retry later rather than marking it initialized with the unlabeled SID.
Fixes: 9287aed2ad1f ("selinux: Convert isec->lock into a spinlock") Signed-off-by: Tianyue Ren rentianyue@kylinos.cn [PM: minor comment tweaks] Signed-off-by: Paul Moore paul@paul-moore.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- security/selinux/hooks.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index de8f8948e6d6..710f5f2e9b50 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -1618,7 +1618,13 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent * inode_doinit with a dentry, before these inodes could * be used again by userspace. */ - goto out; + isec->initialized = LABEL_INVALID; + /* + * There is nothing useful to jump to the "out" + * label, except a needless spin lock/unlock + * cycle. + */ + return 0; }
len = INITCONTEXTLEN; @@ -1733,8 +1739,15 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent * inode_doinit() with a dentry, before these inodes * could be used again by userspace. */ - if (!dentry) - goto out; + if (!dentry) { + isec->initialized = LABEL_INVALID; + /* + * There is nothing useful to jump to the "out" + * label, except a needless spin lock/unlock + * cycle. + */ + return 0; + } rc = selinux_genfs_get_sid(dentry, sclass, sbsec->flags, &sid); dput(dentry);
From: Peng Liu iwtbavbm@gmail.com
stable inclusion from linux-4.19.164 commit 6db84b27221204c37a53dd7c58b01b5d7e91ce0b
--------------------------------
[ Upstream commit a57415f5d1e43c3a5c5d412cd85e2792d7ed9b11 ]
When change sched_rt_{runtime, period}_us, we validate that the new settings should at least accommodate the currently allocated -dl bandwidth:
sched_rt_handler() --> sched_dl_bandwidth_validate() { new_bw = global_rt_runtime()/global_rt_period();
for_each_possible_cpu(cpu) { dl_b = dl_bw_of(cpu); if (new_bw < dl_b->total_bw) <------- ret = -EBUSY; } }
But under CONFIG_SMP, dl_bw is per root domain , but not per CPU, dl_b->total_bw is the allocated bandwidth of the whole root domain. Instead, we should compare dl_b->total_bw against "cpus*new_bw", where 'cpus' is the number of CPUs of the root domain.
Also, below annotation(in kernel/sched/sched.h) implied implementation only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept evolving till got merged(v9), but the annotation remains unchanged, meaningless and misleading, update it.
* With respect to SMP, the bandwidth is given on a per-CPU basis, * meaning that: * - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU; * - dl_total_bw array contains, in the i-eth element, the currently * allocated bandwidth on the i-eth CPU.
[1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/
Fixes: 332ac17ef5bf ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks") Signed-off-by: Peng Liu iwtbavbm@gmail.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Daniel Bristot de Oliveira bristot@redhat.com Acked-by: Juri Lelli juri.lelli@redhat.com Link: https://lkml.kernel.org/r/db6bbda316048cda7a1bbc9571defde193a8d67e.160217106... Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/sched/deadline.c | 5 +++-- kernel/sched/sched.h | 42 ++++++++++++++++++----------------------- 2 files changed, 21 insertions(+), 26 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 6dcbaa21de0d..50a35d7b1eda 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2469,7 +2469,7 @@ int sched_dl_global_validate(void) u64 period = global_rt_period(); u64 new_bw = to_ratio(period, runtime); struct dl_bw *dl_b; - int cpu, ret = 0; + int cpu, cpus, ret = 0; unsigned long flags;
/* @@ -2484,9 +2484,10 @@ int sched_dl_global_validate(void) for_each_possible_cpu(cpu) { rcu_read_lock_sched(); dl_b = dl_bw_of(cpu); + cpus = dl_bw_cpus(cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags); - if (new_bw < dl_b->total_bw) + if (new_bw * cpus < dl_b->total_bw) ret = -EBUSY; raw_spin_unlock_irqrestore(&dl_b->lock, flags);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 955abd645ff9..5b31fe4bb1b4 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -248,30 +248,6 @@ struct rt_bandwidth {
void __dl_clear_params(struct task_struct *p);
-/* - * To keep the bandwidth of -deadline tasks and groups under control - * we need some place where: - * - store the maximum -deadline bandwidth of the system (the group); - * - cache the fraction of that bandwidth that is currently allocated. - * - * This is all done in the data structure below. It is similar to the - * one used for RT-throttling (rt_bandwidth), with the main difference - * that, since here we are only interested in admission control, we - * do not decrease any runtime while the group "executes", neither we - * need a timer to replenish it. - * - * With respect to SMP, the bandwidth is given on a per-CPU basis, - * meaning that: - * - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU; - * - dl_total_bw array contains, in the i-eth element, the currently - * allocated bandwidth on the i-eth CPU. - * Moreover, groups consume bandwidth on each CPU, while tasks only - * consume bandwidth on the CPU they're running on. - * Finally, dl_total_bw_cpu is used to cache the index of dl_total_bw - * that will be shown the next time the proc or cgroup controls will - * be red. It on its turn can be changed by writing on its own - * control. - */ struct dl_bandwidth { raw_spinlock_t dl_runtime_lock; u64 dl_runtime; @@ -283,6 +259,24 @@ static inline int dl_bandwidth_enabled(void) return sysctl_sched_rt_runtime >= 0; }
+/* + * To keep the bandwidth of -deadline tasks under control + * we need some place where: + * - store the maximum -deadline bandwidth of each cpu; + * - cache the fraction of bandwidth that is currently allocated in + * each root domain; + * + * This is all done in the data structure below. It is similar to the + * one used for RT-throttling (rt_bandwidth), with the main difference + * that, since here we are only interested in admission control, we + * do not decrease any runtime while the group "executes", neither we + * need a timer to replenish it. + * + * With respect to SMP, bandwidth is given on a per root domain basis, + * meaning that: + * - bw (< 100%) is the deadline bandwidth of each CPU; + * - total_bw is the currently allocated bandwidth in each root domain; + */ struct dl_bw { raw_spinlock_t lock; u64 bw;
From: Thomas Gleixner tglx@linutronix.de
stable inclusion from linux-4.19.164 commit b6b6ba5754eee1907497504e4b31e22c78f7670f
--------------------------------
[ Upstream commit 345a957fcc95630bf5535d7668a59ed983eb49a7 ]
do_sched_yield() invokes schedule() with interrupts disabled which is not allowed. This goes back to the pre git era to commit a6efb709806c ("[PATCH] irqlock patch 2.5.27-H6") in the history tree.
Reenable interrupts and remove the misleading comment which "explains" it.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/sched/core.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0d0527cdcd4e..e89fbd222783 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5020,12 +5020,8 @@ static void do_sched_yield(void) schedstat_inc(rq->yld_count); current->sched_class->yield_task(rq);
- /* - * Since we are going to call schedule() anyway, there's - * no need to preempt or enable interrupts: - */ preempt_disable(); - rq_unlock(rq, &rf); + rq_unlock_irq(rq, &rf); sched_preempt_enable_no_resched();
schedule();
From: Paul Moore paul@paul-moore.com
stable inclusion from linux-4.19.164 commit 87294e61dafc7be280188581191722eac8b87932
--------------------------------
[ Upstream commit 200ea5a2292dc444a818b096ae6a32ba3caa51b9 ]
A previous fix, commit 83370b31a915 ("selinux: fix error initialization in inode_doinit_with_dentry()"), changed how failures were handled before a SELinux policy was loaded. Unfortunately that patch was potentially problematic for two reasons: it set the isec->initialized state without holding a lock, and it didn't set the inode's SELinux label to the "default" for the particular filesystem. The later can be a problem if/when a later attempt to revalidate the inode fails and SELinux reverts to the existing inode label.
This patch should restore the default inode labeling that existed before the original fix, without affecting the LABEL_INVALID marking such that revalidation will still be attempted in the future.
Fixes: 83370b31a915 ("selinux: fix error initialization in inode_doinit_with_dentry()") Reported-by: Sven Schnelle svens@linux.ibm.com Tested-by: Sven Schnelle svens@linux.ibm.com Reviewed-by: Ondrej Mosnacek omosnace@redhat.com Signed-off-by: Paul Moore paul@paul-moore.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- security/selinux/hooks.c | 31 +++++++++++++------------------ 1 file changed, 13 insertions(+), 18 deletions(-)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 710f5f2e9b50..1b611c068669 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -1618,13 +1618,7 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent * inode_doinit with a dentry, before these inodes could * be used again by userspace. */ - isec->initialized = LABEL_INVALID; - /* - * There is nothing useful to jump to the "out" - * label, except a needless spin lock/unlock - * cycle. - */ - return 0; + goto out_invalid; }
len = INITCONTEXTLEN; @@ -1739,15 +1733,8 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent * inode_doinit() with a dentry, before these inodes * could be used again by userspace. */ - if (!dentry) { - isec->initialized = LABEL_INVALID; - /* - * There is nothing useful to jump to the "out" - * label, except a needless spin lock/unlock - * cycle. - */ - return 0; - } + if (!dentry) + goto out_invalid; rc = selinux_genfs_get_sid(dentry, sclass, sbsec->flags, &sid); dput(dentry); @@ -1760,11 +1747,10 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent out: spin_lock(&isec->lock); if (isec->initialized == LABEL_PENDING) { - if (!sid || rc) { + if (rc) { isec->initialized = LABEL_INVALID; goto out_unlock; } - isec->initialized = LABEL_INITIALIZED; isec->sid = sid; } @@ -1772,6 +1758,15 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent out_unlock: spin_unlock(&isec->lock); return rc; + +out_invalid: + spin_lock(&isec->lock); + if (isec->initialized == LABEL_PENDING) { + isec->initialized = LABEL_INVALID; + isec->sid = sid; + } + spin_unlock(&isec->lock); + return 0; }
/* Convert a Linux signal to an access vector. */
From: Martin Wilck mwilck@suse.com
stable inclusion from linux-4.19.164 commit 2544b54d6384541845f0b0a7d55892214473bea9
--------------------------------
[ Upstream commit 2e4209b3806cda9b89c30fd5e7bfecb7044ec78b ]
The current implementation of scsi_vpd_lun_id() uses the designator length as an implicit measure of priority. This works most of the time, but not always. For example, some Hitachi storage arrays return this in VPD 0x83:
VPD INQUIRY: Device Identification page Designation descriptor number 1, descriptor length: 24 designator_type: T10 vendor identification, code_set: ASCII associated with the Addressed logical unit vendor id: HITACHI vendor specific: 5030C3502025 Designation descriptor number 2, descriptor length: 6 designator_type: vendor specific [0x0], code_set: Binary associated with the Target port vendor specific: 08 03 Designation descriptor number 3, descriptor length: 20 designator_type: NAA, code_set: Binary associated with the Addressed logical unit NAA 6, IEEE Company_id: 0x60e8 Vendor Specific Identifier: 0x7c35000 Vendor Specific Identifier Extension: 0x30c35000002025 [0x60060e8007c350000030c35000002025]
The current code would use the first descriptor because it's longer than the NAA descriptor. But this is wrong, the kernel is supposed to prefer NAA descriptors over T10 vendor ID. Designator length should only be used to compare designators of the same type.
This patch addresses the issue by separating designator priority and length.
Link: https://lore.kernel.org/r/20201029170846.14786-1-mwilck@suse.com Fixes: 9983bed3907c ("scsi: Add scsi_vpd_lun_id()") Reviewed-by: Hannes Reinecke hare@suse.de Signed-off-by: Martin Wilck mwilck@suse.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/scsi_lib.c | 126 +++++++++++++++++++++++++++------------- 1 file changed, 86 insertions(+), 40 deletions(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index bdb90f5c9eeb..f5ee3b714b2a 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -3483,6 +3483,78 @@ void sdev_enable_disk_events(struct scsi_device *sdev) } EXPORT_SYMBOL(sdev_enable_disk_events);
+static unsigned char designator_prio(const unsigned char *d) +{ + if (d[1] & 0x30) + /* not associated with LUN */ + return 0; + + if (d[3] == 0) + /* invalid length */ + return 0; + + /* + * Order of preference for lun descriptor: + * - SCSI name string + * - NAA IEEE Registered Extended + * - EUI-64 based 16-byte + * - EUI-64 based 12-byte + * - NAA IEEE Registered + * - NAA IEEE Extended + * - EUI-64 based 8-byte + * - SCSI name string (truncated) + * - T10 Vendor ID + * as longer descriptors reduce the likelyhood + * of identification clashes. + */ + + switch (d[1] & 0xf) { + case 8: + /* SCSI name string, variable-length UTF-8 */ + return 9; + case 3: + switch (d[4] >> 4) { + case 6: + /* NAA registered extended */ + return 8; + case 5: + /* NAA registered */ + return 5; + case 4: + /* NAA extended */ + return 4; + case 3: + /* NAA locally assigned */ + return 1; + default: + break; + } + break; + case 2: + switch (d[3]) { + case 16: + /* EUI64-based, 16 byte */ + return 7; + case 12: + /* EUI64-based, 12 byte */ + return 6; + case 8: + /* EUI64-based, 8 byte */ + return 3; + default: + break; + } + break; + case 1: + /* T10 vendor ID */ + return 1; + default: + break; + } + + return 0; +} + /** * scsi_vpd_lun_id - return a unique device identification * @sdev: SCSI device @@ -3499,7 +3571,7 @@ EXPORT_SYMBOL(sdev_enable_disk_events); */ int scsi_vpd_lun_id(struct scsi_device *sdev, char *id, size_t id_len) { - u8 cur_id_type = 0xff; + u8 cur_id_prio = 0; u8 cur_id_size = 0; const unsigned char *d, *cur_id_str; const struct scsi_vpd *vpd_pg83; @@ -3512,20 +3584,6 @@ int scsi_vpd_lun_id(struct scsi_device *sdev, char *id, size_t id_len) return -ENXIO; }
- /* - * Look for the correct descriptor. - * Order of preference for lun descriptor: - * - SCSI name string - * - NAA IEEE Registered Extended - * - EUI-64 based 16-byte - * - EUI-64 based 12-byte - * - NAA IEEE Registered - * - NAA IEEE Extended - * - T10 Vendor ID - * as longer descriptors reduce the likelyhood - * of identification clashes. - */ - /* The id string must be at least 20 bytes + terminating NULL byte */ if (id_len < 21) { rcu_read_unlock(); @@ -3535,8 +3593,9 @@ int scsi_vpd_lun_id(struct scsi_device *sdev, char *id, size_t id_len) memset(id, 0, id_len); d = vpd_pg83->data + 4; while (d < vpd_pg83->data + vpd_pg83->len) { - /* Skip designators not referring to the LUN */ - if ((d[1] & 0x30) != 0x00) + u8 prio = designator_prio(d); + + if (prio == 0 || cur_id_prio > prio) goto next_desig;
switch (d[1] & 0xf) { @@ -3544,28 +3603,19 @@ int scsi_vpd_lun_id(struct scsi_device *sdev, char *id, size_t id_len) /* T10 Vendor ID */ if (cur_id_size > d[3]) break; - /* Prefer anything */ - if (cur_id_type > 0x01 && cur_id_type != 0xff) - break; + cur_id_prio = prio; cur_id_size = d[3]; if (cur_id_size + 4 > id_len) cur_id_size = id_len - 4; cur_id_str = d + 4; - cur_id_type = d[1] & 0xf; id_size = snprintf(id, id_len, "t10.%*pE", cur_id_size, cur_id_str); break; case 0x2: /* EUI-64 */ - if (cur_id_size > d[3]) - break; - /* Prefer NAA IEEE Registered Extended */ - if (cur_id_type == 0x3 && - cur_id_size == d[3]) - break; + cur_id_prio = prio; cur_id_size = d[3]; cur_id_str = d + 4; - cur_id_type = d[1] & 0xf; switch (cur_id_size) { case 8: id_size = snprintf(id, id_len, @@ -3583,17 +3633,14 @@ int scsi_vpd_lun_id(struct scsi_device *sdev, char *id, size_t id_len) cur_id_str); break; default: - cur_id_size = 0; break; } break; case 0x3: /* NAA */ - if (cur_id_size > d[3]) - break; + cur_id_prio = prio; cur_id_size = d[3]; cur_id_str = d + 4; - cur_id_type = d[1] & 0xf; switch (cur_id_size) { case 8: id_size = snprintf(id, id_len, @@ -3606,26 +3653,25 @@ int scsi_vpd_lun_id(struct scsi_device *sdev, char *id, size_t id_len) cur_id_str); break; default: - cur_id_size = 0; break; } break; case 0x8: /* SCSI name string */ - if (cur_id_size + 4 > d[3]) + if (cur_id_size > d[3]) break; /* Prefer others for truncated descriptor */ - if (cur_id_size && d[3] > id_len) - break; + if (d[3] > id_len) { + prio = 2; + if (cur_id_prio > prio) + break; + } + cur_id_prio = prio; cur_id_size = id_size = d[3]; cur_id_str = d + 4; - cur_id_type = d[1] & 0xf; if (cur_id_size >= id_len) cur_id_size = id_len - 1; memcpy(id, cur_id_str, cur_id_size); - /* Decrease priority for truncated descriptor */ - if (cur_id_size != id_size) - cur_id_size = 6; break; default: break;
From: Uwe Kleine-König u.kleine-koenig@pengutronix.de
stable inclusion from linux-4.19.164 commit d85ed98f5a91acb93b23b0e648f9c5c8fb426f14
--------------------------------
[ Upstream commit 440408dbadfe47a615afd0a0a4a402e629be658a ]
Consider an spi driver with a .probe but without a .remove callback (e.g. rtc-ds1347). The function spi_drv_probe() is called to bind a device and so dev_pm_domain_attach() is called. As there is no remove callback spi_drv_remove() isn't called at unbind time however and so calling dev_pm_domain_detach() is missed and the pm domain keeps active.
To fix this always use both spi_drv_probe() and spi_drv_remove() and make them handle the respective callback not being set. This has the side effect that for a (hypothetical) driver that has neither .probe nor remove the clk and pm domain setup is done.
Fixes: 33cf00e57082 ("spi: attach/detach SPI device to the ACPI power domain") Signed-off-by: Uwe Kleine-König u.kleine-koenig@pengutronix.de Link: https://lore.kernel.org/r/20201119161604.2633521-1-u.kleine-koenig@pengutron... Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/spi/spi.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c index 80c4abf411b2..f0f21f93d293 100644 --- a/drivers/spi/spi.c +++ b/drivers/spi/spi.c @@ -362,9 +362,11 @@ static int spi_drv_probe(struct device *dev) if (ret) return ret;
- ret = sdrv->probe(spi); - if (ret) - dev_pm_domain_detach(dev, true); + if (sdrv->probe) { + ret = sdrv->probe(spi); + if (ret) + dev_pm_domain_detach(dev, true); + }
return ret; } @@ -372,9 +374,10 @@ static int spi_drv_probe(struct device *dev) static int spi_drv_remove(struct device *dev) { const struct spi_driver *sdrv = to_spi_driver(dev->driver); - int ret; + int ret = 0;
- ret = sdrv->remove(to_spi_device(dev)); + if (sdrv->remove) + ret = sdrv->remove(to_spi_device(dev)); dev_pm_domain_detach(dev, true);
return ret; @@ -399,10 +402,8 @@ int __spi_register_driver(struct module *owner, struct spi_driver *sdrv) { sdrv->driver.owner = owner; sdrv->driver.bus = &spi_bus_type; - if (sdrv->probe) - sdrv->driver.probe = spi_drv_probe; - if (sdrv->remove) - sdrv->driver.remove = spi_drv_remove; + sdrv->driver.probe = spi_drv_probe; + sdrv->driver.remove = spi_drv_remove; if (sdrv->shutdown) sdrv->driver.shutdown = spi_drv_shutdown; return driver_register(&sdrv->driver);
From: Marc Zyngier maz@kernel.org
stable inclusion from linux-4.19.164 commit 4763ddb834462097ff818a8dcae2c545c0d5ba1a
--------------------------------
[ Upstream commit 4615fbc3788ddc8e7c6d697714ad35a53729aa2c ]
When an interrupt allocation fails for N interrupts, it is pretty common for the error handling code to free the same number of interrupts, no matter how many interrupts have actually been allocated.
This may result in the domain freeing code to be unexpectedly called for interrupts that have no mapping in that domain. Things end pretty badly.
Instead, add some checks to irq_domain_free_irqs_hierarchy() to make sure that thiss does not follow the hierarchy if no mapping exists for a given interrupt.
Fixes: 6a6544e520abe ("genirq/irqdomain: Remove auto-recursive hierarchy support") Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/20201129135551.396777-1-maz@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/irq/irqdomain.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index d105b85fe054..a3c94cdf44e7 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -1247,8 +1247,15 @@ static void irq_domain_free_irqs_hierarchy(struct irq_domain *domain, unsigned int irq_base, unsigned int nr_irqs) { - if (domain->ops->free) - domain->ops->free(domain, irq_base, nr_irqs); + unsigned int i; + + if (!domain->ops->free) + return; + + for (i = 0; i < nr_irqs; i++) { + if (irq_domain_get_irq_data(domain, irq_base + i)) + domain->ops->free(domain, irq_base + i, 1); + } }
int irq_domain_alloc_irqs_hierarchy(struct irq_domain *domain,
From: Bjorn Helgaas bhelgaas@google.com
stable inclusion from linux-4.19.164 commit 6d7483c6434f1da19b0140bd4d62c7e18a26ffa6
--------------------------------
[ Upstream commit 6534aac198b58309ff2337981d3f893e0be1d19d ]
32-bit BARs are limited to 2GB size (2^31). By extension, I assume 64-bit BARs are limited to 2^63 bytes. Limit the alignment requested by the "pci=resource_alignment=" command-line parameter to 2^63.
Link: https://lore.kernel.org/r/20201007123045.GS4282@kadam Reported-by: Dan Carpenter dan.carpenter@oracle.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/pci/pci.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index cea3c778a905..d2de3cf3fc34 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -5864,19 +5864,21 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev, while (*p) { count = 0; if (sscanf(p, "%d%n", &align_order, &count) == 1 && - p[count] == '@') { + p[count] == '@') { p += count + 1; + if (align_order > 63) { + pr_err("PCI: Invalid requested alignment (order %d)\n", + align_order); + align_order = PAGE_SHIFT; + } } else { - align_order = -1; + align_order = PAGE_SHIFT; }
ret = pci_dev_str_match(dev, p, &p); if (ret == 1) { *resize = true; - if (align_order == -1) - align = PAGE_SIZE; - else - align = 1 << align_order; + align = 1 << align_order; break; } else if (ret < 0) { pr_err("PCI: Can't parse resource_alignment parameter: %s\n",
From: Colin Ian King colin.king@canonical.com
stable inclusion from linux-4.19.164 commit e9817c8cf22f750f3547962544d4ae4e5d28c93c
--------------------------------
[ Upstream commit cc73eb321d246776e5a9f7723d15708809aa3699 ]
The shift of 1 by align_order is evaluated using 32 bit arithmetic and the result is assigned to a resource_size_t type variable that is a 64 bit unsigned integer on 64 bit platforms. Fix an overflow before widening issue by making the 1 a ULL.
Addresses-Coverity: ("Unintentional integer overflow") Fixes: 32a9a682bef2 ("PCI: allow assignment of memory resources with a specified alignment") Signed-off-by: Colin Ian King colin.king@canonical.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Logan Gunthorpe logang@deltatee.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/pci/pci.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index d2de3cf3fc34..140c6bd328f8 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -5878,7 +5878,7 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev, ret = pci_dev_str_match(dev, p, &p); if (ret == 1) { *resize = true; - align = 1 << align_order; + align = 1ULL << align_order; break; } else if (ret < 0) { pr_err("PCI: Can't parse resource_alignment parameter: %s\n",
From: Bharat Gooty bharat.gooty@broadcom.com
stable inclusion from linux-4.19.164 commit 06773ce45d65c74c0aebdd766fbc3a916546d4ba
--------------------------------
[ Upstream commit a3ff529f5d368a17ff35ada8009e101162ebeaf9 ]
Declare the full size array for all revisions of PAX register sets to avoid potentially out of bound access of the register array when they are being initialized in iproc_pcie_rev_init().
Link: https://lore.kernel.org/r/20201001060054.6616-2-srinath.mannam@broadcom.com Fixes: 06324ede76cdf ("PCI: iproc: Improve core register population") Signed-off-by: Bharat Gooty bharat.gooty@broadcom.com Signed-off-by: Lorenzo Pieralisi lorenzo.pieralisi@arm.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/pci/controller/pcie-iproc.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/controller/pcie-iproc.c b/drivers/pci/controller/pcie-iproc.c index 3160e9342a2f..9191fdd8ce0c 100644 --- a/drivers/pci/controller/pcie-iproc.c +++ b/drivers/pci/controller/pcie-iproc.c @@ -300,7 +300,7 @@ enum iproc_pcie_reg { };
/* iProc PCIe PAXB BCMA registers */ -static const u16 iproc_pcie_reg_paxb_bcma[] = { +static const u16 iproc_pcie_reg_paxb_bcma[IPROC_PCIE_MAX_NUM_REG] = { [IPROC_PCIE_CLK_CTRL] = 0x000, [IPROC_PCIE_CFG_IND_ADDR] = 0x120, [IPROC_PCIE_CFG_IND_DATA] = 0x124, @@ -311,7 +311,7 @@ static const u16 iproc_pcie_reg_paxb_bcma[] = { };
/* iProc PCIe PAXB registers */ -static const u16 iproc_pcie_reg_paxb[] = { +static const u16 iproc_pcie_reg_paxb[IPROC_PCIE_MAX_NUM_REG] = { [IPROC_PCIE_CLK_CTRL] = 0x000, [IPROC_PCIE_CFG_IND_ADDR] = 0x120, [IPROC_PCIE_CFG_IND_DATA] = 0x124, @@ -327,7 +327,7 @@ static const u16 iproc_pcie_reg_paxb[] = { };
/* iProc PCIe PAXB v2 registers */ -static const u16 iproc_pcie_reg_paxb_v2[] = { +static const u16 iproc_pcie_reg_paxb_v2[IPROC_PCIE_MAX_NUM_REG] = { [IPROC_PCIE_CLK_CTRL] = 0x000, [IPROC_PCIE_CFG_IND_ADDR] = 0x120, [IPROC_PCIE_CFG_IND_DATA] = 0x124, @@ -355,7 +355,7 @@ static const u16 iproc_pcie_reg_paxb_v2[] = { };
/* iProc PCIe PAXC v1 registers */ -static const u16 iproc_pcie_reg_paxc[] = { +static const u16 iproc_pcie_reg_paxc[IPROC_PCIE_MAX_NUM_REG] = { [IPROC_PCIE_CLK_CTRL] = 0x000, [IPROC_PCIE_CFG_IND_ADDR] = 0x1f0, [IPROC_PCIE_CFG_IND_DATA] = 0x1f4, @@ -364,7 +364,7 @@ static const u16 iproc_pcie_reg_paxc[] = { };
/* iProc PCIe PAXC v2 registers */ -static const u16 iproc_pcie_reg_paxc_v2[] = { +static const u16 iproc_pcie_reg_paxc_v2[IPROC_PCIE_MAX_NUM_REG] = { [IPROC_PCIE_MSI_GIC_MODE] = 0x050, [IPROC_PCIE_MSI_BASE_ADDR] = 0x074, [IPROC_PCIE_MSI_WINDOW_SIZE] = 0x078,
From: Trond Myklebust trond.myklebust@hammerspace.com
stable inclusion from linux-4.19.164 commit 99bba785816dcfa122fcfc3872dc25b7ff82da4e
--------------------------------
[ Upstream commit d5aa6b22e2258f05317313ecc02efbb988ed6d38 ]
According to RFC5666, the correct netid for an IPv6 addressed RDMA transport is "rdma6", which we've supported as a mount option since Linux-4.7. The problem is when we try to load the module "xprtrdma6", that will fail, since there is no modulealias of that name.
Fixes: 181342c5ebe8 ("xprtrdma: Add rdma6 option to support NFS/RDMA IPv6") Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/sunrpc/xprt.h | 1 + net/sunrpc/xprt.c | 65 +++++++++++++++++++++++++-------- net/sunrpc/xprtrdma/module.c | 1 + net/sunrpc/xprtrdma/transport.c | 1 + net/sunrpc/xprtsock.c | 4 ++ 5 files changed, 56 insertions(+), 16 deletions(-)
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h index e7bbd82908b1..69fed13e633b 100644 --- a/include/linux/sunrpc/xprt.h +++ b/include/linux/sunrpc/xprt.h @@ -317,6 +317,7 @@ struct xprt_class { struct rpc_xprt * (*setup)(struct xprt_create *); struct module *owner; char name[32]; + const char * netid[]; };
/* diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c index 5e7c13aa66d0..9c4235ce5789 100644 --- a/net/sunrpc/xprt.c +++ b/net/sunrpc/xprt.c @@ -143,31 +143,64 @@ int xprt_unregister_transport(struct xprt_class *transport) } EXPORT_SYMBOL_GPL(xprt_unregister_transport);
+static void +xprt_class_release(const struct xprt_class *t) +{ + module_put(t->owner); +} + +static const struct xprt_class * +xprt_class_find_by_netid_locked(const char *netid) +{ + const struct xprt_class *t; + unsigned int i; + + list_for_each_entry(t, &xprt_list, list) { + for (i = 0; t->netid[i][0] != '\0'; i++) { + if (strcmp(t->netid[i], netid) != 0) + continue; + if (!try_module_get(t->owner)) + continue; + return t; + } + } + return NULL; +} + +static const struct xprt_class * +xprt_class_find_by_netid(const char *netid) +{ + const struct xprt_class *t; + + spin_lock(&xprt_list_lock); + t = xprt_class_find_by_netid_locked(netid); + if (!t) { + spin_unlock(&xprt_list_lock); + request_module("rpc%s", netid); + spin_lock(&xprt_list_lock); + t = xprt_class_find_by_netid_locked(netid); + } + spin_unlock(&xprt_list_lock); + return t; +} + /** * xprt_load_transport - load a transport implementation - * @transport_name: transport to load + * @netid: transport to load * * Returns: * 0: transport successfully loaded * -ENOENT: transport module not available */ -int xprt_load_transport(const char *transport_name) +int xprt_load_transport(const char *netid) { - struct xprt_class *t; - int result; + const struct xprt_class *t;
- result = 0; - spin_lock(&xprt_list_lock); - list_for_each_entry(t, &xprt_list, list) { - if (strcmp(t->name, transport_name) == 0) { - spin_unlock(&xprt_list_lock); - goto out; - } - } - spin_unlock(&xprt_list_lock); - result = request_module("xprt%s", transport_name); -out: - return result; + t = xprt_class_find_by_netid(netid); + if (!t) + return -ENOENT; + xprt_class_release(t); + return 0; } EXPORT_SYMBOL_GPL(xprt_load_transport);
diff --git a/net/sunrpc/xprtrdma/module.c b/net/sunrpc/xprtrdma/module.c index 620327c01302..45c5b41ac8dc 100644 --- a/net/sunrpc/xprtrdma/module.c +++ b/net/sunrpc/xprtrdma/module.c @@ -24,6 +24,7 @@ MODULE_DESCRIPTION("RPC/RDMA Transport"); MODULE_LICENSE("Dual BSD/GPL"); MODULE_ALIAS("svcrdma"); MODULE_ALIAS("xprtrdma"); +MODULE_ALIAS("rpcrdma6");
static void __exit rpc_rdma_cleanup(void) { diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c index 8e6cb8f1abd2..bcac4bafa7c2 100644 --- a/net/sunrpc/xprtrdma/transport.c +++ b/net/sunrpc/xprtrdma/transport.c @@ -836,6 +836,7 @@ static struct xprt_class xprt_rdma = { .owner = THIS_MODULE, .ident = XPRT_TRANSPORT_RDMA, .setup = xprt_setup_rdma, + .netid = { "rdma", "rdma6", "" }, };
void xprt_rdma_cleanup(void) diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index 1f8d97084237..bb796f03507f 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -3242,6 +3242,7 @@ static struct xprt_class xs_local_transport = { .owner = THIS_MODULE, .ident = XPRT_TRANSPORT_LOCAL, .setup = xs_setup_local, + .netid = { "" }, };
static struct xprt_class xs_udp_transport = { @@ -3250,6 +3251,7 @@ static struct xprt_class xs_udp_transport = { .owner = THIS_MODULE, .ident = XPRT_TRANSPORT_UDP, .setup = xs_setup_udp, + .netid = { "udp", "udp6", "" }, };
static struct xprt_class xs_tcp_transport = { @@ -3258,6 +3260,7 @@ static struct xprt_class xs_tcp_transport = { .owner = THIS_MODULE, .ident = XPRT_TRANSPORT_TCP, .setup = xs_setup_tcp, + .netid = { "tcp", "tcp6", "" }, };
static struct xprt_class xs_bc_tcp_transport = { @@ -3266,6 +3269,7 @@ static struct xprt_class xs_bc_tcp_transport = { .owner = THIS_MODULE, .ident = XPRT_TRANSPORT_BC_TCP, .setup = xs_setup_bc_tcp, + .netid = { "" }, };
/**
From: Calum Mackay calum.mackay@oracle.com
stable inclusion from linux-4.19.164 commit ea5569c61da9644fdee4700f0461a560c84d06b7
--------------------------------
[ Upstream commit 9b82d88d5976e5f2b8015d58913654856576ace5 ]
NLM uses an interval-based rebinding, i.e. it clears the transport's binding under certain conditions if more than 60 seconds have elapsed since the connection was last bound.
This rebinding is not necessary for an autobind RPC client over a connection-oriented protocol like TCP.
It can also cause problems: it is possible for nlm_bind_host() to clear XPRT_BOUND whilst a connection worker is in the middle of trying to reconnect, after it had already been checked in xprt_connect().
When the connection worker notices that XPRT_BOUND has been cleared under it, in xs_tcp_finish_connecting(), that results in:
xs_tcp_setup_socket: connect returned unhandled error -107
Worse, it's possible that the two can get into lockstep, resulting in the same behaviour repeated indefinitely, with the above error every 300 seconds, without ever recovering, and the connection never being established. This has been seen in practice, with a large number of NLM client tasks, following a server restart.
The existing callers of nlm_bind_host & nlm_rebind_host should not need to force the rebind, for TCP, so restrict the interval-based rebinding to UDP only.
For TCP, we will still rebind when needed, e.g. on timeout, and connection error (including closure), since connection-related errors on an existing connection, ECONNREFUSED when trying to connect, and rpc_check_timeout(), already unconditionally clear XPRT_BOUND.
To avoid having to add the fix, and explanation, to both nlm_bind_host() and nlm_rebind_host(), remove the duplicate code from the former, and have it call the latter.
Drop the dprintk, which adds no value over a trace.
Signed-off-by: Calum Mackay calum.mackay@oracle.com Fixes: 35f5a422ce1a ("SUNRPC: new interface to force an RPC rebind") Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/lockd/host.c | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-)
diff --git a/fs/lockd/host.c b/fs/lockd/host.c index f0b5c987d6ae..3f6ba0cd2bd9 100644 --- a/fs/lockd/host.c +++ b/fs/lockd/host.c @@ -432,12 +432,7 @@ nlm_bind_host(struct nlm_host *host) * RPC rebind is required */ if ((clnt = host->h_rpcclnt) != NULL) { - if (time_after_eq(jiffies, host->h_nextrebind)) { - rpc_force_rebind(clnt); - host->h_nextrebind = jiffies + NLM_HOST_REBIND; - dprintk("lockd: next rebind in %lu jiffies\n", - host->h_nextrebind - jiffies); - } + nlm_rebind_host(host); } else { unsigned long increment = nlmsvc_timeout; struct rpc_timeout timeparms = { @@ -485,13 +480,20 @@ nlm_bind_host(struct nlm_host *host) return clnt; }
-/* - * Force a portmap lookup of the remote lockd port +/** + * nlm_rebind_host - If needed, force a portmap lookup of the peer's lockd port + * @host: NLM host handle for peer + * + * This is not needed when using a connection-oriented protocol, such as TCP. + * The existing autobind mechanism is sufficient to force a rebind when + * required, e.g. on connection state transitions. */ void nlm_rebind_host(struct nlm_host *host) { - dprintk("lockd: rebind host %s\n", host->h_name); + if (host->h_proto != IPPROTO_UDP) + return; + if (host->h_rpcclnt && time_after_eq(jiffies, host->h_nextrebind)) { rpc_force_rebind(host->h_rpcclnt); host->h_nextrebind = jiffies + NLM_HOST_REBIND;
From: NeilBrown neilb@suse.de
stable inclusion from linux-4.19.164 commit a5c43265c6244ce6df26b0c7a3d7f69449fb1a0a
--------------------------------
[ Upstream commit bf701b765eaa82dd164d65edc5747ec7288bb5c3 ]
nfsiod is currently a concurrency-managed workqueue (CMWQ). This means that workitems scheduled to nfsiod on a given CPU are queued behind all other work items queued on any CMWQ on the same CPU. This can introduce unexpected latency.
Occaionally nfsiod can even cause excessive latency. If the work item to complete a CLOSE request calls the final iput() on an inode, the address_space of that inode will be dismantled. This takes time proportional to the number of in-memory pages, which on a large host working on large files (e.g.. 5TB), can be a large number of pages resulting in a noticable number of seconds.
We can avoid these latency problems by switching nfsiod to WQ_UNBOUND. This causes each concurrent work item to gets a dedicated thread which can be scheduled to an idle CPU.
There is precedent for this as several other filesystems use WQ_UNBOUND workqueue for handling various async events.
Signed-off-by: NeilBrown neilb@suse.de Fixes: ada609ee2ac2 ("workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER") Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/nfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index 5d8e0eb0d663..905041eb24fe 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -2150,7 +2150,7 @@ static int nfsiod_start(void) { struct workqueue_struct *wq; dprintk("RPC: creating workqueue nfsiod\n"); - wq = alloc_workqueue("nfsiod", WQ_MEM_RECLAIM, 0); + wq = alloc_workqueue("nfsiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0); if (wq == NULL) return -ENOMEM; nfsiod_workqueue = wq;
From: Keqian Zhu zhukeqian1@huawei.com
stable inclusion from linux-4.19.164 commit 21253ff57b2c953aba6d4615fd185431bf6ff4cc
--------------------------------
[ Upstream commit 8b7770b877d187bfdae1eaf587bd2b792479a31c ]
ARM virtual counter supports event stream, it can only trigger an event when the trigger bit (the value of CNTKCTL_EL1.EVNTI) of CNTVCT_EL0 changes, so the actual period of event stream is 2^(cntkctl_evnti + 1). For example, when the trigger bit is 0, then virtual counter trigger an event for every two cycles.
While we're at it, rework the way we compute the trigger bit position by making it more obvious that when bits [n:n-1] are both set (with n being the most significant bit), we pick bit (n + 1).
Fixes: 037f637767a8 ("drivers: clocksource: add support for ARM architected timer event stream") Suggested-by: Marc Zyngier maz@kernel.org Signed-off-by: Keqian Zhu zhukeqian1@huawei.com Acked-by: Marc Zyngier maz@kernel.org Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org Link: https://lore.kernel.org/r/20201204073126.6920-3-zhukeqian1@huawei.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/clocksource/arm_arch_timer.c | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c index cc2eb77b41d9..f552a38feeb9 100644 --- a/drivers/clocksource/arm_arch_timer.c +++ b/drivers/clocksource/arm_arch_timer.c @@ -834,15 +834,24 @@ static void arch_timer_evtstrm_enable(int divider)
static void arch_timer_configure_evtstream(void) { - int evt_stream_div, pos; + int evt_stream_div, lsb; + + /* + * As the event stream can at most be generated at half the frequency + * of the counter, use half the frequency when computing the divider. + */ + evt_stream_div = arch_timer_rate / ARCH_TIMER_EVT_STREAM_FREQ / 2; + + /* + * Find the closest power of two to the divisor. If the adjacent bit + * of lsb (last set bit, starts from 0) is set, then we use (lsb + 1). + */ + lsb = fls(evt_stream_div) - 1; + if (lsb > 0 && (evt_stream_div & BIT(lsb - 1))) + lsb++;
- /* Find the closest power of two to the divisor */ - evt_stream_div = arch_timer_rate / ARCH_TIMER_EVT_STREAM_FREQ; - pos = fls(evt_stream_div); - if (pos > 1 && !(evt_stream_div & (1 << (pos - 2)))) - pos--; /* enable event stream */ - arch_timer_evtstrm_enable(min(pos, 15)); + arch_timer_evtstrm_enable(max(0, min(lsb, 15))); }
static void arch_counter_set_user_access(void)
From: Cheng Lin cheng.lin130@zte.com.cn
stable inclusion from linux-4.19.164 commit 5b1204d4d0cb9cefa59a32ce7a37d0d47c9eadc4
--------------------------------
[ Upstream commit 4a9d81caf841cd2c0ae36abec9c2963bf21d0284 ]
If the elem is deleted during be iterated on it, the iteration process will fall into an endless loop.
kernel: NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [nfsd:17137]
PID: 17137 TASK: ffff8818d93c0000 CPU: 4 COMMAND: "nfsd" [exception RIP: __state_in_grace+76] RIP: ffffffffc00e817c RSP: ffff8818d3aefc98 RFLAGS: 00000246 RAX: ffff881dc0c38298 RBX: ffffffff81b03580 RCX: ffff881dc02c9f50 RDX: ffff881e3fce8500 RSI: 0000000000000001 RDI: ffffffff81b03580 RBP: ffff8818d3aefca0 R8: 0000000000000020 R9: ffff8818d3aefd40 R10: ffff88017fc03800 R11: ffff8818e83933c0 R12: ffff8818d3aefd40 R13: 0000000000000000 R14: ffff8818e8391068 R15: ffff8818fa6e4000 CS: 0010 SS: 0018 #0 [ffff8818d3aefc98] opens_in_grace at ffffffffc00e81e3 [grace] #1 [ffff8818d3aefca8] nfs4_preprocess_stateid_op at ffffffffc02a3e6c [nfsd] #2 [ffff8818d3aefd18] nfsd4_write at ffffffffc028ed5b [nfsd] #3 [ffff8818d3aefd80] nfsd4_proc_compound at ffffffffc0290a0d [nfsd] #4 [ffff8818d3aefdd0] nfsd_dispatch at ffffffffc027b800 [nfsd] #5 [ffff8818d3aefe08] svc_process_common at ffffffffc02017f3 [sunrpc] #6 [ffff8818d3aefe70] svc_process at ffffffffc0201ce3 [sunrpc] #7 [ffff8818d3aefe98] nfsd at ffffffffc027b117 [nfsd] #8 [ffff8818d3aefec8] kthread at ffffffff810b88c1 #9 [ffff8818d3aeff50] ret_from_fork at ffffffff816d1607
The troublemake elem: crash> lock_manager ffff881dc0c38298 struct lock_manager { list = { next = 0xffff881dc0c38298, prev = 0xffff881dc0c38298 }, block_opens = false }
Fixes: c87fb4a378f9 ("lockd: NLM grace period shouldn't block NFSv4 opens") Signed-off-by: Cheng Lin cheng.lin130@zte.com.cn Signed-off-by: Yi Wang wang.yi59@zte.com.cn Signed-off-by: Chuck Lever chuck.lever@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/nfs_common/grace.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/nfs_common/grace.c b/fs/nfs_common/grace.c index 5be08f02a76b..4f90c444907f 100644 --- a/fs/nfs_common/grace.c +++ b/fs/nfs_common/grace.c @@ -68,10 +68,14 @@ __state_in_grace(struct net *net, bool open) if (!open) return !list_empty(grace_list);
+ spin_lock(&grace_lock); list_for_each_entry(lm, grace_list, list) { - if (lm->block_opens) + if (lm->block_opens) { + spin_unlock(&grace_lock); return true; + } } + spin_unlock(&grace_lock); return false; }
From: Daniel Scally djrscally@gmail.com
stable inclusion from linux-4.19.164 commit 6d6a32df9465923445c8f657edb34f3419e74ef7
--------------------------------
commit 12fc4dad94dfac25599f31257aac181c691ca96f upstream.
This reverts commit 8a66790b7850a6669129af078768a1d42076a0ef.
Switching this function to AE_CTRL_TERMINATE broke the documented behaviour of acpi_dev_get_resources() - AE_CTRL_TERMINATE does not, in fact, terminate the resource walk because acpi_walk_resource_buffer() ignores it (specifically converting it to AE_OK), referring to that value as "an OK termination by the user function". This means that acpi_dev_get_resources() does not abort processing when the preproc function returns a negative value.
Signed-off-by: Daniel Scally djrscally@gmail.com Cc: 3.10+ stable@vger.kernel.org # 3.10+ Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/acpi/resource.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/acpi/resource.c b/drivers/acpi/resource.c index 316a0fc785e3..d3f9a320e880 100644 --- a/drivers/acpi/resource.c +++ b/drivers/acpi/resource.c @@ -549,7 +549,7 @@ static acpi_status acpi_dev_process_resource(struct acpi_resource *ares, ret = c->preproc(ares, c->preproc_data); if (ret < 0) { c->error = ret; - return AE_CTRL_TERMINATE; + return AE_ABORT_METHOD; } else if (ret > 0) { return AE_OK; }
From: Hui Wang hui.wang@canonical.com
stable inclusion from linux-4.19.164 commit 20ef32728f5ef8987d7fe5de8b306b3beb451193
--------------------------------
commit b08221c40febcbda9309dd70c61cf1b0ebb0e351 upstream.
Recently we met a touchscreen problem on some Thinkpad machines, the touchscreen driver (i2c-hid) is not loaded and the touchscreen can't work.
An i2c ACPI device with the name WACF2200 is defined in the BIOS, with the current rule in matching_id(), this device will be regarded as a PNP device since there is WACFXXX in the acpi_pnp_device_ids[] and this PNP device is attached to the acpi device as the 1st physical_node, this will make the i2c bus match fail when i2c bus calls acpi_companion_match() to match the acpi_id_table in the i2c-hid driver.
WACF2200 is an i2c device instead of a PNP device, after adding the string length comparing, the matching_id() will return false when matching WACF2200 and WACFXXX, and it is reasonable to compare the string length when matching two IDs.
Suggested-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Hui Wang hui.wang@canonical.com Cc: All applicable stable@vger.kernel.org Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/acpi/acpi_pnp.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/acpi/acpi_pnp.c b/drivers/acpi/acpi_pnp.c index 67d97c0090a2..5d72baf60ac8 100644 --- a/drivers/acpi/acpi_pnp.c +++ b/drivers/acpi/acpi_pnp.c @@ -320,6 +320,9 @@ static bool matching_id(const char *idstr, const char *list_id) { int i;
+ if (strlen(idstr) != strlen(list_id)) + return false; + if (memcmp(idstr, list_id, 3)) return false;
From: Chunguang Xu brookxu@tencent.com
stable inclusion from linux-4.19.164 commit 48b91786875e25a2e958f5089fdad997615b538b
--------------------------------
commit cca415537244f6102cbb09b5b90db6ae2c953bdd upstream.
When freeing metadata, we will create an ext4_free_data and insert it into the pending free list. After the current transaction is committed, the object will be freed.
ext4_mb_free_metadata() will check whether the area to be freed overlaps with the pending free list. If true, return directly. At this time, ext4_free_data is leaked. Fortunately, the probability of this problem is small, since it only occurs if the file system is corrupted such that a block is claimed by more one inode and those inodes are deleted within a single jbd2 transaction.
Signed-off-by: Chunguang Xu brookxu@tencent.com Link: https://lore.kernel.org/r/1604764698-4269-8-git-send-email-brookxu@tencent.c... Signed-off-by: Theodore Ts'o tytso@mit.edu Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/mballoc.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index ca402e3f014d..23e94193c8b4 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4691,6 +4691,7 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b, ext4_group_first_block_no(sb, group) + EXT4_C2B(sbi, cluster), "Block already on to-be-freed list"); + kmem_cache_free(ext4_free_data_cachep, new_entry); return 0; } }
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.164 commit 338b60216653f17080d9c36107a69fc0e6b2e46f
--------------------------------
commit 46e294efc355c48d1dd4d58501aa56dac461792a upstream.
Xattr code using inodes with large xattr data can end up dropping last inode reference (and thus deleting the inode) from places like ext4_xattr_set_entry(). That function is called with transaction started and so ext4_evict_inode() can deadlock against fs freezing like:
CPU1 CPU2
removexattr() freeze_super() vfs_removexattr() ext4_xattr_set() handle = ext4_journal_start() ... ext4_xattr_set_entry() iput(old_ea_inode) ext4_evict_inode(old_ea_inode) sb->s_writers.frozen = SB_FREEZE_FS; sb_wait_write(sb, SB_FREEZE_FS); ext4_freeze() jbd2_journal_lock_updates() -> blocks waiting for all handles to stop sb_start_intwrite() -> blocks as sb is already in SB_FREEZE_FS state
Generally it is advisable to delete inodes from a separate transaction as it can consume quite some credits however in this case it would be quite clumsy and furthermore the credits for inode deletion are quite limited and already accounted for. So just tweak ext4_evict_inode() to avoid freeze protection if we have transaction already started and thus it is not really needed anyway.
Cc: stable@vger.kernel.org Fixes: dec214d00e0d ("ext4: xattr inode deduplication") Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Andreas Dilger adilger@dilger.ca Link: https://lore.kernel.org/r/20201127110649.24730-1-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/inode.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 0e047426c0c9..7f960f8cb0a4 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -177,6 +177,7 @@ void ext4_evict_inode(struct inode *inode) */ int extra_credits = 6; struct ext4_xattr_inode_array *ea_inode_array = NULL; + bool freeze_protected = false;
trace_ext4_evict_inode(inode);
@@ -234,9 +235,14 @@ void ext4_evict_inode(struct inode *inode)
/* * Protect us against freezing - iput() caller didn't have to have any - * protection against it + * protection against it. When we are in a running transaction though, + * we are already protected against freezing and we cannot grab further + * protection due to lock ordering constraints. */ - sb_start_intwrite(inode->i_sb); + if (!ext4_journal_current_handle()) { + sb_start_intwrite(inode->i_sb); + freeze_protected = true; + }
if (!IS_NOQUOTA(inode)) extra_credits += EXT4_MAXQUOTAS_DEL_BLOCKS(inode->i_sb); @@ -255,7 +261,8 @@ void ext4_evict_inode(struct inode *inode) * cleaned up. */ ext4_orphan_del(NULL, inode); - sb_end_intwrite(inode->i_sb); + if (freeze_protected) + sb_end_intwrite(inode->i_sb); goto no_delete; }
@@ -296,7 +303,8 @@ void ext4_evict_inode(struct inode *inode) stop_handle: ext4_journal_stop(handle); ext4_orphan_del(NULL, inode); - sb_end_intwrite(inode->i_sb); + if (freeze_protected) + sb_end_intwrite(inode->i_sb); ext4_xattr_inode_array_free(ea_inode_array); goto no_delete; } @@ -325,7 +333,8 @@ void ext4_evict_inode(struct inode *inode) else ext4_free_inode(handle, inode); ext4_journal_stop(handle); - sb_end_intwrite(inode->i_sb); + if (freeze_protected) + sb_end_intwrite(inode->i_sb); ext4_xattr_inode_array_free(ea_inode_array); return; no_delete:
From: Zhao Heming heming.zhao@suse.com
stable inclusion from linux-4.19.164 commit 5149da7860b222bbf2b19561de669c5a4f397783
--------------------------------
commit a8da01f79c89755fad55ed0ea96e8d2103242a72 upstream.
Reshape request should be blocked with ongoing resync job. In cluster env, a node can start resync job even if the resync cmd isn't executed on it, e.g., user executes "mdadm --grow" on node A, sometimes node B will start resync job. However, current update_raid_disks() only check local recovery status, which is incomplete. As a result, we see user will execute "mdadm --grow" successfully on local, while the remote node deny to do reshape job when it doing resync job. The inconsistent handling cause array enter unexpected status. If user doesn't observe this issue and continue executing mdadm cmd, the array doesn't work at last.
Fix this issue by blocking reshape request. When node executes "--grow" and detects ongoing resync, it should stop and report error to user.
The following script reproduces the issue with ~100% probability. (two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB) ``` # on node1, node2 is the remote node. ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
sleep 5
mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0
mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg mdadm --grow --raid-devices=2 /dev/md0 ```
Cc: stable@vger.kernel.org Signed-off-by: Zhao Heming heming.zhao@suse.com Signed-off-by: Song Liu songliubraving@fb.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/md.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 436e82df5298..abffae809bc8 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -6930,6 +6930,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks) return -EINVAL; if (mddev->sync_thread || test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || + test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) || mddev->reshape_position != MaxSector) return -EBUSY;
@@ -9272,8 +9273,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev) } }
- if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) - update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); + if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) { + ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); + if (ret) + pr_warn("md: updating array disks failed. %d\n", ret); + }
/* Finally set the event to be up to date */ mddev->events = le64_to_cpu(sb->events);
From: Zhao Heming heming.zhao@suse.com
stable inclusion from linux-4.19.164 commit 05cafe5ad8a2e4644469ee5991560f615181e520
--------------------------------
commit bca5b0658020be90b6b504ca514fd80110204f71 upstream.
md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg. During sending msg, node can concurrently receive msg from another node. When node does resync job, grab token_lockres:EX may trigger a deadlock: ``` nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX
c. mdadm /dev/md0 --remove /dev/sdg + held reconfig_mutex + send REMOVE + wait_event(MD_CLUSTER_SEND_LOCK)
d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this time, both return false forever ``` Explaination: a. A send METADATA_UPDATED This will block another node to send msg
b. B does sync jobs, which will send RESYNCING at intervals. This will be block for holding token_lockres:EX lock.
c. B do "mdadm --remove", which will send REMOVE. This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
d. B recv METADATA_UPDATED msg, which send from A in step <a>. This will be blocked by step <c>: holding mddev lock, it makes wait_event can't hold mddev lock. (btw, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
There is a similar deadlock in commit 0ba959774e93 ("md-cluster: use sync way to handle METADATA_UPDATED msg") In that commit, step c is "update sb". This patch step c is "mdadm --remove".
For fixing this issue, we can refer the solution of function: metadata_update_start. Which does the same grab lock_token action. lock_comm can use the same steps to avoid deadlock. By moving MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm. It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, but it is safe & can break deadlock.
Repro steps (I only triggered 3 times with hundreds tests):
two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \ --bitmap-chunk=1M ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
sleep 5
mkfs.xfs /dev/md0 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0
mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg mdadm --grow --raid-devices=2 /dev/md0 ```
test script will hung when executing "mdadm --remove".
``` # dump stacks by "echo t > /proc/sysrq-trigger" md0_cluster_rec D 0 5329 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? _cond_resched+0x2d/0x40 ? schedule+0x4a/0xb0 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster] ? wait_woken+0x80/0x80 ? process_recvd_msg+0x113/0x1d0 [md_cluster] ? recv_daemon+0x9e/0x120 [md_cluster] ? md_thread+0x94/0x160 [md_mod] ? wait_woken+0x80/0x80 ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40
mdadm D 0 5423 1 0x00004004 Call Trace: __schedule+0x1f6/0x560 ? __schedule+0x1fe/0x560 ? schedule+0x4a/0xb0 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster] ? wait_woken+0x80/0x80 ? remove_disk+0x4f/0x90 [md_cluster] ? hot_remove_disk+0xb1/0x1b0 [md_mod] ? md_ioctl+0x50c/0xba0 [md_mod] ? wait_woken+0x80/0x80 ? blkdev_ioctl+0xa2/0x2a0 ? block_ioctl+0x39/0x40 ? ksys_ioctl+0x82/0xc0 ? __x64_sys_ioctl+0x16/0x20 ? do_syscall_64+0x5f/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
md0_resync D 0 5425 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? schedule+0x4a/0xb0 ? dlm_lock_sync+0xa1/0xd0 [md_cluster] ? wait_woken+0x80/0x80 ? lock_token+0x2d/0x90 [md_cluster] ? resync_info_update+0x95/0x100 [md_cluster] ? raid1_sync_request+0x7d3/0xa40 [raid1] ? md_do_sync.cold+0x737/0xc8f [md_mod] ? md_thread+0x94/0x160 [md_mod] ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 ```
At last, thanks for Xiao's solution.
Cc: stable@vger.kernel.org Signed-off-by: Zhao Heming heming.zhao@suse.com Suggested-by: Xiao Ni xni@redhat.com Reviewed-by: Xiao Ni xni@redhat.com Signed-off-by: Song Liu songliubraving@fb.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/md-cluster.c | 67 +++++++++++++++++++++++------------------ drivers/md/md.c | 6 ++-- 2 files changed, 42 insertions(+), 31 deletions(-)
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 0b2af6e74fc3..9a640863cd8c 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -669,9 +669,27 @@ static void recv_daemon(struct md_thread *thread) * Takes the lock on the TOKEN lock resource so no other * node can communicate while the operation is underway. */ -static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked) +static int lock_token(struct md_cluster_info *cinfo) { - int error, set_bit = 0; + int error; + + error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); + if (error) { + pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n", + __func__, __LINE__, error); + } else { + /* Lock the receive sequence */ + mutex_lock(&cinfo->recv_mutex); + } + return error; +} + +/* lock_comm() + * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel. + */ +static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked) +{ + int rv, set_bit = 0; struct mddev *mddev = cinfo->mddev;
/* @@ -682,34 +700,19 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked) */ if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state)) { - error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, + rv = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); - WARN_ON_ONCE(error); + WARN_ON_ONCE(rv); md_wakeup_thread(mddev->thread); set_bit = 1; } - error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); - if (set_bit) - clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
- if (error) - pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n", - __func__, __LINE__, error); - - /* Lock the receive sequence */ - mutex_lock(&cinfo->recv_mutex); - return error; -} - -/* lock_comm() - * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel. - */ -static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked) -{ wait_event(cinfo->wait, !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); - - return lock_token(cinfo, mddev_locked); + rv = lock_token(cinfo); + if (set_bit) + clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); + return rv; }
static void unlock_comm(struct md_cluster_info *cinfo) @@ -789,9 +792,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg, { int ret;
- lock_comm(cinfo, mddev_locked); - ret = __sendmsg(cinfo, cmsg); - unlock_comm(cinfo); + ret = lock_comm(cinfo, mddev_locked); + if (!ret) { + ret = __sendmsg(cinfo, cmsg); + unlock_comm(cinfo); + } return ret; }
@@ -1063,7 +1068,7 @@ static int metadata_update_start(struct mddev *mddev) return 0; }
- ret = lock_token(cinfo, 1); + ret = lock_token(cinfo); clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); return ret; } @@ -1181,7 +1186,10 @@ static void update_size(struct mddev *mddev, sector_t old_dev_sectors) int raid_slot = -1;
md_update_sb(mddev, 1); - lock_comm(cinfo, 1); + if (lock_comm(cinfo, 1)) { + pr_err("%s: lock_comm failed\n", __func__); + return; + }
memset(&cmsg, 0, sizeof(cmsg)); cmsg.type = cpu_to_le32(METADATA_UPDATED); @@ -1330,7 +1338,8 @@ static int add_new_disk(struct mddev *mddev, struct md_rdev *rdev) cmsg.type = cpu_to_le32(NEWDISK); memcpy(cmsg.uuid, uuid, 16); cmsg.raid_slot = cpu_to_le32(rdev->desc_nr); - lock_comm(cinfo, 1); + if (lock_comm(cinfo, 1)) + return -EAGAIN; ret = __sendmsg(cinfo, &cmsg); if (ret) { unlock_comm(cinfo); diff --git a/drivers/md/md.c b/drivers/md/md.c index abffae809bc8..a4b796ab6712 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -6599,8 +6599,10 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev) goto busy;
kick_rdev: - if (mddev_is_clustered(mddev)) - md_cluster_ops->remove_disk(mddev, rdev); + if (mddev_is_clustered(mddev)) { + if (md_cluster_ops->remove_disk(mddev, rdev)) + goto busy; + }
md_kick_rdev_from_array(rdev); set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
From: Jubin Zhong zhongjubin@huawei.com
stable inclusion from linux-4.19.164 commit 3b20e285bb9b20dda3d046b1b1e7131c3839869c
--------------------------------
commit 4684709bf81a2d98152ed6b610e3d5c403f9bced upstream.
If kobject_init_and_add() fails, pci_slot_release() is called to delete slot->list from parent->slots. But slot->list hasn't been initialized yet, so we dereference a NULL pointer:
Unable to handle kernel NULL pointer dereference at virtual address 00000000 ... CPU: 10 PID: 1 Comm: swapper/0 Not tainted 4.4.240 #197 task: ffffeb398a45ef10 task.stack: ffffeb398a470000 PC is at __list_del_entry_valid+0x5c/0xb0 LR is at pci_slot_release+0x84/0xe4 ... __list_del_entry_valid+0x5c/0xb0 pci_slot_release+0x84/0xe4 kobject_put+0x184/0x1c4 pci_create_slot+0x17c/0x1b4 __pci_hp_initialize+0x68/0xa4 pciehp_probe+0x1a4/0x2fc pcie_port_probe_service+0x58/0x84 driver_probe_device+0x320/0x470
Initialize slot->list before calling kobject_init_and_add() to avoid this.
Fixes: 8a94644b440e ("PCI: Fix pci_create_slot() reference count leak") Link: https://lore.kernel.org/r/1606876422-117457-1-git-send-email-zhongjubin@huaw... Signed-off-by: Jubin Zhong zhongjubin@huawei.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org # v5.9+ Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/pci/slot.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c index fb7478b6c4f9..dfbe9cbf292c 100644 --- a/drivers/pci/slot.c +++ b/drivers/pci/slot.c @@ -307,6 +307,9 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr, goto err; }
+ INIT_LIST_HEAD(&slot->list); + list_add(&slot->list, &parent->slots); + err = kobject_init_and_add(&slot->kobj, &pci_slot_ktype, NULL, "%s", slot_name); if (err) { @@ -314,9 +317,6 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr, goto err; }
- INIT_LIST_HEAD(&slot->list); - list_add(&slot->list, &parent->slots); - down_read(&pci_bus_sem); list_for_each_entry(dev, &parent->devices, bus_list) if (PCI_SLOT(dev->devfn) == slot_nr)
From: Wei Li liwei391@huawei.com
hulk inclusion category: bugfix bugzilla: NA CVE: NA
----------------------------------------------
/builds/1mHAjaH6WFVPAq6OVOJsJsJKHA2/drivers/irqchip/irq-gic-v3.c: In function 'gic_cpu_init': /builds/1mHAjaH6WFVPAq6OVOJsJsJKHA2/drivers/irqchip/irq-gic-v3.c:888:3: error: implicit declaration of function 'ipi_set_nmi_prio' [-Werror=implicit-function-declaration] 888 | ipi_set_nmi_prio(rbase, GICD_INT_NMI_PRI); | ^~~~~~~~~~~~~~~~
On ARM32, ipi_set_nmi_prio() is not implmented as CONFIG_ARM64_PSEUDO_NMI is not supported. Skip setting the NMI priority of SGIs when initializing.
Signed-off-by: Wei Li liwei391@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/irqchip/irq-gic-v3.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 6bb787ba1764..14e9c1a5627b 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -884,8 +884,10 @@ static void gic_cpu_init(void)
gic_cpu_config(rbase, gic_redist_wait_for_rwp);
+#ifdef CONFIG_ARM64_PSEUDO_NMI if (gic_supports_nmi()) ipi_set_nmi_prio(rbase, GICD_INT_NMI_PRI); +#endif
/* initialise system registers */ gic_cpu_sys_reg_init();
From: Axel Lin axel.lin@ingics.com
mainline inclusion from mainline-v5.1-rc1 commit 5b86fa6d2903c41c3c3dbff4def76a5b8bbe0660 category: bugfix bugzilla: NA CVE: NA
Fix below build error: ERROR: "__devm_regmap_init_mmio_clk" [sound/soc/codecs/snd-soc-msm8916-digital.ko] undefined!
Signed-off-by: Axel Lin axel.lin@ingics.com Acked-by: Srinivas Kandagatla srinivas.kandagatla@linaro.org Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- sound/soc/codecs/Kconfig | 1 + 1 file changed, 1 insertion(+)
diff --git a/sound/soc/codecs/Kconfig b/sound/soc/codecs/Kconfig index efb095dbcd71..1384353db0f4 100644 --- a/sound/soc/codecs/Kconfig +++ b/sound/soc/codecs/Kconfig @@ -679,6 +679,7 @@ config SND_SOC_MSM8916_WCD_ANALOG
config SND_SOC_MSM8916_WCD_DIGITAL tristate "Qualcomm MSM8916 WCD DIGITAL Codec" + select REGMAP_MMIO
config SND_SOC_PCM1681 tristate "Texas Instruments PCM1681 CODEC"
From: Ritika Srivastava ritika.srivastava@oracle.com
mainline inclusion from mainline-v5.10 commit 143d2600faf1563fd4125686ab84afaab5fb513b category: bugfix bugzilla: 46895 CVE: NA ---------------------------
Replace returning legacy errno codes with blk_status_t in blk_cloned_rq_check_limits().
Signed-off-by: Ritika Srivastava ritika.srivastava@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: block/blk-core.c
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-core.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index bf1049cf598b..498f74166574 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2594,12 +2594,12 @@ EXPORT_SYMBOL_GPL(blk_poll); * limits when retrying requests on other queues. Those requests need * to be checked against the new queue limits again during dispatch. */ -static int blk_cloned_rq_check_limits(struct request_queue *q, +static blk_status_t blk_cloned_rq_check_limits(struct request_queue *q, struct request *rq) { if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, req_op(rq))) { printk(KERN_ERR "%s: over max size limit.\n", __func__); - return -EIO; + return BLK_STS_IOERR; }
/* @@ -2611,10 +2611,10 @@ static int blk_cloned_rq_check_limits(struct request_queue *q, blk_recalc_rq_segments(rq); if (rq->nr_phys_segments > queue_max_segments(q)) { printk(KERN_ERR "%s: over max segments limit.\n", __func__); - return -EIO; + return BLK_STS_IOERR; }
- return 0; + return BLK_STS_OK; }
/**
From: Ritika Srivastava ritika.srivastava@oracle.com
mainline inclusion from mainline-v5.10 commit 8327cce5ff9376fac9ff713a8d5c99c16ba3fa33 category: bugfix bugzilla: 46895 CVE: NA ---------------------------
If WRITE_ZERO/WRITE_SAME operation is not supported by the storage, blk_cloned_rq_check_limits() will return IO error which will cause device-mapper to fail the paths.
Instead, if the queue limit is set to 0, return BLK_STS_NOTSUPP. BLK_STS_NOTSUPP will be ignored by device-mapper and will not fail the paths.
Suggested-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Ritika Srivastava ritika.srivastava@oracle.com Reviewed-by: Martin K. Petersen martin.petersen@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: block/blk-core.c
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-core.c | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index 498f74166574..9b2a05d00b96 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2597,7 +2597,22 @@ EXPORT_SYMBOL_GPL(blk_poll); static blk_status_t blk_cloned_rq_check_limits(struct request_queue *q, struct request *rq) { - if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, req_op(rq))) { + unsigned int max_sectors = blk_queue_get_max_sectors(q, req_op(rq)); + + if (blk_rq_sectors(rq) > max_sectors) { + /* + * SCSI device does not have a good way to return if + * Write Same/Zero is actually supported. If a device rejects + * a non-read/write command (discard, write same,etc.) the + * low-level device driver will set the relevant queue limit to + * 0 to prevent blk-lib from issuing more of the offending + * operations. Commands queued prior to the queue limit being + * reset need to be completed with BLK_STS_NOTSUPP to avoid I/O + * errors being propagated to upper layers. + */ + if (max_sectors == 0) + return BLK_STS_NOTSUPP; + printk(KERN_ERR "%s: over max size limit.\n", __func__); return BLK_STS_IOERR; } @@ -2627,8 +2642,11 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * unsigned long flags; int where = ELEVATOR_INSERT_BACK;
- if (blk_cloned_rq_check_limits(q, rq)) - return BLK_STS_IOERR; + blk_status_t ret; + + ret = blk_cloned_rq_check_limits(q, rq); + if (ret != BLK_STS_OK) + return ret;
if (rq->rq_disk && should_fail_request(&rq->rq_disk->part0, blk_rq_bytes(rq)))
From: Ard Biesheuvel ard.biesheuvel@linaro.org
mainline inclusion from mainline-v5.0-rc7 commit 8a5b403d71affa098009cc3dff1b2c45113021ad category: bugfix bugzilla: 5501 CVE: NA
-------------------------------------------------
In the irqchip and EFI code, we have what basically amounts to a quirk to work around a peculiarity in the GICv3 architecture, which permits the system memory address of LPI tables to be programmable only once after a CPU reset. This means kexec kernels must use the same memory as the first kernel, and thus ensure that this memory has not been given out for other purposes by the time the ITS init code runs, which is not very early for secondary CPUs.
On systems with many CPUs, these reservations could overflow the memblock reservation table, and this was addressed in commit:
eff896288872 ("efi/arm: Defer persistent reservations until after paging_init()")
However, this turns out to have made things worse, since the allocation of page tables and heap space for the resized memblock reservation table itself may overwrite the regions we are attempting to reserve, which may cause all kinds of corruption, also considering that the ITS will still be poking bits into that memory in response to incoming MSIs.
So instead, let's grow the static memblock reservation table on such systems so it can accommodate these reservations at an earlier time. This will permit us to revert the above commit in a subsequent patch.
[ mingo: Minor cleanups. ]
Signed-off-by: Ard Biesheuvel ard.biesheuvel@linaro.org Acked-by: Mike Rapoport rppt@linux.ibm.com Acked-by: Will Deacon will.deacon@arm.com Acked-by: Marc Zyngier marc.zyngier@arm.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: linux-arm-kernel@lists.infradead.org Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20190215123333.21209-2-ard.biesheuvel@linaro.org Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/include/asm/memory.h | 11 +++++++++++ include/linux/memblock.h | 3 --- mm/memblock.c | 11 +++++++++-- 3 files changed, 20 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h index fc13895cfc79..eeaf48163f7b 100644 --- a/arch/arm64/include/asm/memory.h +++ b/arch/arm64/include/asm/memory.h @@ -312,6 +312,17 @@ static inline void *phys_to_virt(phys_addr_t x) #define virt_addr_valid(kaddr) (_virt_addr_is_linear(kaddr) && \ _virt_addr_valid(kaddr))
+/* + * Given that the GIC architecture permits ITS implementations that can only be + * configured with a LPI table address once, GICv3 systems with many CPUs may + * end up reserving a lot of different regions after a kexec for their LPI + * tables (one per CPU), as we are forced to reuse the same memory after kexec + * (and thus reserve it persistently with EFI beforehand) + */ +#if defined(CONFIG_EFI) && defined(CONFIG_ARM_GIC_V3_ITS) +# define INIT_MEMBLOCK_RESERVED_REGIONS (INIT_MEMBLOCK_REGIONS + NR_CPUS + 1) +#endif + #include <asm-generic/memory_model.h>
#endif diff --git a/include/linux/memblock.h b/include/linux/memblock.h index e5905349ff2b..b12b4070e028 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -17,9 +17,6 @@ #include <linux/init.h> #include <linux/mm.h>
-#define INIT_MEMBLOCK_REGIONS 128 -#define INIT_PHYSMEM_REGIONS 4 - /** * enum memblock_flags - definition of memory region attributes * @MEMBLOCK_NONE: no special request diff --git a/mm/memblock.c b/mm/memblock.c index 519a5cb11eed..5c943854bdfb 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -27,6 +27,13 @@
#include "internal.h"
+#define INIT_MEMBLOCK_REGIONS 128 +#define INIT_PHYSMEM_REGIONS 4 + +#ifndef INIT_MEMBLOCK_RESERVED_REGIONS +# define INIT_MEMBLOCK_RESERVED_REGIONS INIT_MEMBLOCK_REGIONS +#endif + /** * DOC: memblock overview * @@ -83,7 +90,7 @@ */
static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock; -static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock; +static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_RESERVED_REGIONS] __initdata_memblock; #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP static struct memblock_region memblock_physmem_init_regions[INIT_PHYSMEM_REGIONS] __initdata_memblock; #endif @@ -96,7 +103,7 @@ struct memblock memblock __initdata_memblock = {
.reserved.regions = memblock_reserved_init_regions, .reserved.cnt = 1, /* empty dummy entry */ - .reserved.max = INIT_MEMBLOCK_REGIONS, + .reserved.max = INIT_MEMBLOCK_RESERVED_REGIONS, .reserved.name = "reserved",
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
From: Ard Biesheuvel ard.biesheuvel@linaro.org
mainline inclusion from mainline-v5.0-rc7 commit 582a32e708823e5957fd73ccd78dc4a9e49d21ea category: bugfix bugzilla: 5501 CVE: NA
-------------------------------------------------
This reverts commit eff896288872d687d9662000ec9ae11b6d61766f, which deferred the processing of persistent memory reservations to a point where the memory may have already been allocated and overwritten, defeating the purpose.
Signed-off-by: Ard Biesheuvel ard.biesheuvel@linaro.org Acked-by: Will Deacon will.deacon@arm.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Marc Zyngier marc.zyngier@arm.com Cc: Mike Rapoport rppt@linux.ibm.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: linux-arm-kernel@lists.infradead.org Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20190215123333.21209-3-ard.biesheuvel@linaro.org Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/kernel/setup.c | 1 - drivers/firmware/efi/efi.c | 4 ---- drivers/firmware/efi/libstub/arm-stub.c | 3 --- include/linux/efi.h | 7 ------- 4 files changed, 15 deletions(-)
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c index 6e4f42944631..0bd2a5320736 100644 --- a/arch/arm64/kernel/setup.c +++ b/arch/arm64/kernel/setup.c @@ -333,7 +333,6 @@ void __init setup_arch(char **cmdline_p) arm64_memblock_init();
paging_init(); - efi_apply_persistent_mem_reservations();
acpi_table_upgrade();
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 67cbca754f36..0f3ce03c1116 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -598,11 +598,7 @@ int __init efi_config_parse_tables(void *config_tables, int count, int sz,
early_memunmap(tbl, sizeof(*tbl)); } - return 0; -}
-int __init efi_apply_persistent_mem_reservations(void) -{ if (efi.mem_reserve != EFI_INVALID_TABLE_ADDR) { unsigned long prsv = efi.mem_reserve;
diff --git a/drivers/firmware/efi/libstub/arm-stub.c b/drivers/firmware/efi/libstub/arm-stub.c index 47c439a82964..296b3211f689 100644 --- a/drivers/firmware/efi/libstub/arm-stub.c +++ b/drivers/firmware/efi/libstub/arm-stub.c @@ -75,9 +75,6 @@ void install_memreserve_table(efi_system_table_t *sys_table_arg) efi_guid_t memreserve_table_guid = LINUX_EFI_MEMRESERVE_TABLE_GUID; efi_status_t status;
- if (IS_ENABLED(CONFIG_ARM)) - return; - status = efi_call_early(allocate_pool, EFI_LOADER_DATA, sizeof(*rsv), (void **)&rsv); if (status != EFI_SUCCESS) { diff --git a/include/linux/efi.h b/include/linux/efi.h index 0320a0da7a07..34c255c2a487 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -1180,8 +1180,6 @@ static inline bool efi_enabled(int feature) extern void efi_reboot(enum reboot_mode reboot_mode, const char *__unused);
extern bool efi_is_table_address(unsigned long phys_addr); - -extern int efi_apply_persistent_mem_reservations(void); #else static inline bool efi_enabled(int feature) { @@ -1200,11 +1198,6 @@ static inline bool efi_is_table_address(unsigned long phys_addr) { return false; } - -static inline int efi_apply_persistent_mem_reservations(void) -{ - return 0; -} #endif
extern int efi_status_to_err(efi_status_t status);
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.11-rc2 commit 307f660d056b5eb8f5bb2328fac3915ab75b5007 category: bugfix bugzilla: 46921 CVE: NA
-------------------
netpoll_send_skb_on_dev() can get the device pointer directly from np->dev
Rename it to __netpoll_send_skb()
Following patch will move netpoll_send_skb() out-of-line.
Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/netpoll.h | 5 ++--- net/core/netpoll.c | 10 ++++++---- 2 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h index 3ef82d3a78db..66c9612d7e1c 100644 --- a/include/linux/netpoll.h +++ b/include/linux/netpoll.h @@ -65,13 +65,12 @@ int netpoll_setup(struct netpoll *np); void __netpoll_cleanup(struct netpoll *np); void __netpoll_free_async(struct netpoll *np); void netpoll_cleanup(struct netpoll *np); -void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb, - struct net_device *dev); +void __netpoll_send_skb(struct netpoll *np, struct sk_buff *skb); static inline void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb) { unsigned long flags; local_irq_save(flags); - netpoll_send_skb_on_dev(np, skb, np->dev); + __netpoll_send_skb(np, skb); local_irq_restore(flags); }
diff --git a/net/core/netpoll.c b/net/core/netpoll.c index 41e32a958d08..d0cb045a8807 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -305,17 +305,19 @@ static int netpoll_owner_active(struct net_device *dev) }
/* call with IRQ disabled */ -void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb, - struct net_device *dev) +void __netpoll_send_skb(struct netpoll *np, struct sk_buff *skb) { int status = NETDEV_TX_BUSY; + struct net_device *dev; unsigned long tries; /* It is up to the caller to keep npinfo alive. */ struct netpoll_info *npinfo;
lockdep_assert_irqs_disabled();
- npinfo = rcu_dereference_bh(np->dev->npinfo); + dev = np->dev; + npinfo = rcu_dereference_bh(dev->npinfo); + if (!npinfo || !netif_running(dev) || !netif_device_present(dev)) { dev_kfree_skb_irq(skb); return; @@ -358,7 +360,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb, schedule_delayed_work(&npinfo->tx_work,0); } } -EXPORT_SYMBOL(netpoll_send_skb_on_dev); +EXPORT_SYMBOL(__netpoll_send_skb);
void netpoll_send_udp(struct netpoll *np, const char *msg, int len) {
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.11-rc2 commit fb1eee476b0d3be3e58dac1a3a96f726c6278bed category: bugfix bugzilla: 46921 CVE: NA
-------------------
There is no need to inline this helper, as we intend to add more code in this function.
Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/netpoll.h | 9 +-------- net/core/netpoll.c | 13 +++++++++++-- 2 files changed, 12 insertions(+), 10 deletions(-)
diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h index 66c9612d7e1c..9c3514114f49 100644 --- a/include/linux/netpoll.h +++ b/include/linux/netpoll.h @@ -65,14 +65,7 @@ int netpoll_setup(struct netpoll *np); void __netpoll_cleanup(struct netpoll *np); void __netpoll_free_async(struct netpoll *np); void netpoll_cleanup(struct netpoll *np); -void __netpoll_send_skb(struct netpoll *np, struct sk_buff *skb); -static inline void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb) -{ - unsigned long flags; - local_irq_save(flags); - __netpoll_send_skb(np, skb); - local_irq_restore(flags); -} +void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb);
#ifdef CONFIG_NETPOLL static inline void *netpoll_poll_lock(struct napi_struct *napi) diff --git a/net/core/netpoll.c b/net/core/netpoll.c index d0cb045a8807..2bc754c09486 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -305,7 +305,7 @@ static int netpoll_owner_active(struct net_device *dev) }
/* call with IRQ disabled */ -void __netpoll_send_skb(struct netpoll *np, struct sk_buff *skb) +static void __netpoll_send_skb(struct netpoll *np, struct sk_buff *skb) { int status = NETDEV_TX_BUSY; struct net_device *dev; @@ -360,7 +360,16 @@ void __netpoll_send_skb(struct netpoll *np, struct sk_buff *skb) schedule_delayed_work(&npinfo->tx_work,0); } } -EXPORT_SYMBOL(__netpoll_send_skb); + +void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb) +{ + unsigned long flags; + + local_irq_save(flags); + __netpoll_send_skb(np, skb); + local_irq_restore(flags); +} +EXPORT_SYMBOL(netpoll_send_skb);
void netpoll_send_udp(struct netpoll *np, const char *msg, int len) {
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.11-rc2 commit 1ddabdfaf70c202b88925edd74c66f4707dbd92e category: bugfix bugzilla: 46921 CVE: NA
-------------------
Some callers want to know if the packet has been sent or dropped, to inform upper stacks.
Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/netpoll.h | 2 +- net/core/netpoll.c | 11 +++++++---- 2 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h index 9c3514114f49..ccd2f54233e3 100644 --- a/include/linux/netpoll.h +++ b/include/linux/netpoll.h @@ -65,7 +65,7 @@ int netpoll_setup(struct netpoll *np); void __netpoll_cleanup(struct netpoll *np); void __netpoll_free_async(struct netpoll *np); void netpoll_cleanup(struct netpoll *np); -void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb); +netdev_tx_t netpoll_send_skb(struct netpoll *np, struct sk_buff *skb);
#ifdef CONFIG_NETPOLL static inline void *netpoll_poll_lock(struct napi_struct *napi) diff --git a/net/core/netpoll.c b/net/core/netpoll.c index 2bc754c09486..97d3083265d3 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -305,7 +305,7 @@ static int netpoll_owner_active(struct net_device *dev) }
/* call with IRQ disabled */ -static void __netpoll_send_skb(struct netpoll *np, struct sk_buff *skb) +static netdev_tx_t __netpoll_send_skb(struct netpoll *np, struct sk_buff *skb) { int status = NETDEV_TX_BUSY; struct net_device *dev; @@ -320,7 +320,7 @@ static void __netpoll_send_skb(struct netpoll *np, struct sk_buff *skb)
if (!npinfo || !netif_running(dev) || !netif_device_present(dev)) { dev_kfree_skb_irq(skb); - return; + return NET_XMIT_DROP; }
/* don't get messages out of order, and no recursion */ @@ -359,15 +359,18 @@ static void __netpoll_send_skb(struct netpoll *np, struct sk_buff *skb) skb_queue_tail(&npinfo->txq, skb); schedule_delayed_work(&npinfo->tx_work,0); } + return NETDEV_TX_OK; }
-void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb) +netdev_tx_t netpoll_send_skb(struct netpoll *np, struct sk_buff *skb) { unsigned long flags; + netdev_tx_t ret;
local_irq_save(flags); - __netpoll_send_skb(np, skb); + ret = __netpoll_send_skb(np, skb); local_irq_restore(flags); + return ret; } EXPORT_SYMBOL(netpoll_send_skb);
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.11-rc2 commit f78ed2204db9fc35b545d693865bddbe0149aa1f category: bugfix bugzilla: 46921 CVE: NA
-------------------
netpoll_send_skb() callers seem to leak skb if the np pointer is NULL. While this should not happen, we can make the code more robust.
Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/macvlan.c | 5 ++--- include/linux/if_team.h | 5 +---- include/net/bonding.h | 5 +---- net/8021q/vlan_dev.c | 5 ++--- net/bridge/br_private.h | 5 +---- net/core/netpoll.c | 11 ++++++++--- net/dsa/slave.c | 5 ++--- 7 files changed, 17 insertions(+), 24 deletions(-)
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index e226a96da3a3..44853cc2d9be 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -549,12 +549,11 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev) static inline netdev_tx_t macvlan_netpoll_send_skb(struct macvlan_dev *vlan, struct sk_buff *skb) { #ifdef CONFIG_NET_POLL_CONTROLLER - if (vlan->netpoll) - netpoll_send_skb(vlan->netpoll, skb); + return netpoll_send_skb(vlan->netpoll, skb); #else BUG(); -#endif return NETDEV_TX_OK; +#endif }
static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb, diff --git a/include/linux/if_team.h b/include/linux/if_team.h index ac42da56f7a2..1a9034ea94c7 100644 --- a/include/linux/if_team.h +++ b/include/linux/if_team.h @@ -106,10 +106,7 @@ static inline bool team_port_dev_txable(const struct net_device *port_dev) static inline void team_netpoll_send_skb(struct team_port *port, struct sk_buff *skb) { - struct netpoll *np = port->np; - - if (np) - netpoll_send_skb(np, skb); + netpoll_send_skb(port->np, skb); } #else static inline void team_netpoll_send_skb(struct team_port *port, diff --git a/include/net/bonding.h b/include/net/bonding.h index c458f084f7bb..58a8c2b66aa5 100644 --- a/include/net/bonding.h +++ b/include/net/bonding.h @@ -503,10 +503,7 @@ static inline unsigned long slave_last_rx(struct bonding *bond, static inline void bond_netpoll_send_skb(const struct slave *slave, struct sk_buff *skb) { - struct netpoll *np = slave->np; - - if (np) - netpoll_send_skb(np, skb); + netpoll_send_skb(slave->np, skb); } #else static inline void bond_netpoll_send_skb(const struct slave *slave, diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c index 84ef83772114..52e33fd9c9bf 100644 --- a/net/8021q/vlan_dev.c +++ b/net/8021q/vlan_dev.c @@ -94,12 +94,11 @@ static int vlan_dev_hard_header(struct sk_buff *skb, struct net_device *dev, static inline netdev_tx_t vlan_netpoll_send_skb(struct vlan_dev_priv *vlan, struct sk_buff *skb) { #ifdef CONFIG_NET_POLL_CONTROLLER - if (vlan->netpoll) - netpoll_send_skb(vlan->netpoll, skb); + return netpoll_send_skb(vlan->netpoll, skb); #else BUG(); -#endif return NETDEV_TX_OK; +#endif }
static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb, diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 33b8222db75c..2c9dbd437306 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -500,10 +500,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev); static inline void br_netpoll_send_skb(const struct net_bridge_port *p, struct sk_buff *skb) { - struct netpoll *np = p->np; - - if (np) - netpoll_send_skb(np, skb); + netpoll_send_skb(p->np, skb); }
int br_netpoll_enable(struct net_bridge_port *p); diff --git a/net/core/netpoll.c b/net/core/netpoll.c index 97d3083265d3..c77cc26dd7e5 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -367,9 +367,14 @@ netdev_tx_t netpoll_send_skb(struct netpoll *np, struct sk_buff *skb) unsigned long flags; netdev_tx_t ret;
- local_irq_save(flags); - ret = __netpoll_send_skb(np, skb); - local_irq_restore(flags); + if (unlikely(!np)) { + dev_kfree_skb_irq(skb); + ret = NET_XMIT_DROP; + } else { + local_irq_save(flags); + ret = __netpoll_send_skb(np, skb); + local_irq_restore(flags); + } return ret; } EXPORT_SYMBOL(netpoll_send_skb); diff --git a/net/dsa/slave.c b/net/dsa/slave.c index b39720d0995d..fbf1375427db 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -393,12 +393,11 @@ static inline netdev_tx_t dsa_slave_netpoll_send_skb(struct net_device *dev, #ifdef CONFIG_NET_POLL_CONTROLLER struct dsa_slave_priv *p = netdev_priv(dev);
- if (p->netpoll) - netpoll_send_skb(p->netpoll, skb); + return netpoll_send_skb(p->netpoll, skb); #else BUG(); -#endif return NETDEV_TX_OK; +#endif }
static void dsa_skb_tx_timestamp(struct dsa_slave_priv *p,
From: Yang Yingliang yangyingliang@huawei.com
hulk inclusion category: bugfix bugzilla: 46923 CVE: NA
----------------------------------------------
Enable KTASK in vfio, if the BAR size of some straight through equipment device is too large, it will cause guest crash on booting.
Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- init/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/init/Kconfig b/init/Kconfig index 121ad5dbc1c0..aa3e2e54215b 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -352,7 +352,7 @@ config AUDIT_TREE config KTASK bool "Multithread CPU-intensive kernel work" depends on SMP - default y + default n help Parallelize CPU-intensive kernel work. This feature is designed for big machines that can take advantage of their extra CPUs to speed up
From: Fang Lijun fanglijun3@huawei.com
ascend inclusion category: bugfix bugzilla: NA CVE: NA
-------------------------------------------------
The vm_flags will overflow on arm32 as left shift CHECKNODE_BITS(48).
This checknode function only used in cdm feature.
Fixes: cdccf4d4b7b5 ("arm64/ascend: mm: Add MAP_CHECKNODE flag to check node hugetlb") Signed-off-by: Fang Lijun fanglijun3@huawei.com Reviewed-by: Ding Tianhong dingtianhong@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/hugetlb.h | 3 --- include/linux/mman.h | 14 ++++++++++++++ mm/hugetlb.c | 1 + mm/mmap.c | 5 ++--- 4 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 9cd938bb24fe..3f355dfaf31f 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -15,9 +15,6 @@ struct ctl_table; struct user_struct; struct mmu_gather;
-#define CHECKNODE_BITS 48 -#define CHECKNODE_MASK (~((_AC(1, UL) << CHECKNODE_BITS) - 1)) - #ifndef is_hugepd /* * Some architectures requires a hugepage directory format that is diff --git a/include/linux/mman.h b/include/linux/mman.h index f4c25c06653c..d35d984c058c 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -73,6 +73,20 @@ static inline int dvpp_mmap_zone(unsigned long addr) { return 0; }
#endif
+#ifdef CONFIG_COHERENT_DEVICE +#define CHECKNODE_BITS 48 +#define CHECKNODE_MASK (~((_AC(1, UL) << CHECKNODE_BITS) - 1)) +static inline void set_vm_checknode(vm_flags_t vm_flags, unsigned long flags) +{ + if (is_set_cdmmask()) + vm_flags |= VM_CHECKNODE | ((((flags >> MAP_HUGE_SHIFT) & + MAP_HUGE_MASK) << CHECKNODE_BITS) & CHECKNODE_MASK); +} +#else +#define CHECKNODE_BITS (0) +static inline void set_vm_checknode(vm_flags_t vm_flags, unsigned long flags) {} +#endif + /* * Arrange for legacy / undefined architecture specific flags to be * ignored by mmap handling code. diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 327d24f0cf0d..7c2f51528c1c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -25,6 +25,7 @@ #include <linux/swap.h> #include <linux/swapops.h> #include <linux/jhash.h> +#include <linux/mman.h>
#include <asm/page.h> #include <asm/pgtable.h> diff --git a/mm/mmap.c b/mm/mmap.c index 6197c5590ded..4cc9ee8a0287 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1564,9 +1564,8 @@ unsigned long do_mmap(struct file *file, unsigned long addr, /* set numa node id into vm_flags, * hugetlbfs file mmap will use it to check node */ - if (is_set_cdmmask() && (flags & MAP_CHECKNODE)) - vm_flags |= VM_CHECKNODE | ((((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK) - << CHECKNODE_BITS) & CHECKNODE_MASK); + if (flags & MAP_CHECKNODE) + set_vm_checknode(vm_flags, flags);
addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); if (!IS_ERR_VALUE(addr) &&
From: lijinlin lijinlin3@huawei.com
euleros inclusion category: feature feature: add ext3 report error to userspace by netlink bugzilla: NA
--------------------------------
reason: This patch is used to implement abnormal alarm of ext3 file system. You can achieve this by setting "FILESYSTEM_MONITOR" or "FILESYSTEM_ALARM" on in configuration file. With this setting, alarm will be raised whenext3 file system expection occurs.
Signed-off-by: lijinlin lijinlin3@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/ext4.h | 3 +++ fs/ext4/super.c | 7 ++++++- 2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index b2e876bf0dec..4587fba7de7b 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -55,7 +55,10 @@ #endif
#define NL_EXT4_ERROR_GROUP 1 + +#define EXT3_ERROR_MAGIC 0xAE43125U #define EXT4_ERROR_MAGIC 0xAE32014U + struct ext4_err_msg { int magic; char s_id[32]; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 1a0d494e57ae..9becf3e2e20a 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -455,6 +455,8 @@ static void ext4_netlink_send_info(struct super_block *sb, int ext4_errno) struct ext4_err_msg *msg;
if (ext4nl) { + if (IS_EXT2_SB(sb)) + return; size = NLMSG_SPACE(sizeof(struct ext4_err_msg)); skb = alloc_skb(size, GFP_ATOMIC); if (!skb) { @@ -466,7 +468,10 @@ static void ext4_netlink_send_info(struct super_block *sb, int ext4_errno) if (!nlh) goto nlmsg_failure; msg = (struct ext4_err_msg *)NLMSG_DATA(nlh); - msg->magic = EXT4_ERROR_MAGIC; + if (IS_EXT3_SB(sb)) + msg->magic = EXT3_ERROR_MAGIC; + else + msg->magic = EXT4_ERROR_MAGIC; memcpy(msg->s_id, sb->s_id, sizeof(sb->s_id)); msg->s_flags = sb->s_flags; msg->ext4_errno = ext4_errno;
From: chenmaodong chenmaodong@huawei.com
euleros inclusion catagery: bugfix bugzilla: 46917
-------------------------
virtio_gpu drop reference from allocate in virtio_gpu_gem_create when creating dumb, but after that, this process will continue to use virtio_gpu_object in virtio_gpu_object_attach, which cause uaf. See defail in bugzilla.
Signed-off-by: chenmaodong chenmaodong@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/gpu/drm/virtio/virtgpu_gem.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/virtio/virtgpu_gem.c b/drivers/gpu/drm/virtio/virtgpu_gem.c index 0f2768eacaee..692776abdcf1 100644 --- a/drivers/gpu/drm/virtio/virtgpu_gem.c +++ b/drivers/gpu/drm/virtio/virtgpu_gem.c @@ -71,9 +71,6 @@ int virtio_gpu_gem_create(struct drm_file *file,
*obj_p = &obj->gem_base;
- /* drop reference from allocate - handle holds it now */ - drm_gem_object_put_unlocked(&obj->gem_base); - *handle_p = handle; return 0; } @@ -107,6 +104,7 @@ int virtio_gpu_mode_dumb_create(struct drm_file *file_priv, /* attach the object to the resource */ obj = gem_to_virtio_gpu_obj(gobj); ret = virtio_gpu_object_attach(vgdev, obj, resid, NULL); + drm_gem_object_put_unlocked(&obj->gem_base); if (ret) goto fail;
From: Fang Lijun fanglijun3@huawei.com
ascend inclusion category: bugfix bugzilla: NA CVE: NA
-------------------------------------------------
The vm_flags will changed by MAP_CHECKNODE, so we must use it for output argument.
Fixes: d0a645813b91 ("arm64/ascend: mm: Fix arm32 compile warnings") Signed-off-by: Fang Lijun fanglijun3@huawei.com Reviewed-by: Ding Tianhong dingtianhong@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/mman.h | 7 ++++--- mm/mmap.c | 2 +- 2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/include/linux/mman.h b/include/linux/mman.h index d35d984c058c..a8ea591faed7 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -76,15 +76,16 @@ static inline int dvpp_mmap_zone(unsigned long addr) { return 0; } #ifdef CONFIG_COHERENT_DEVICE #define CHECKNODE_BITS 48 #define CHECKNODE_MASK (~((_AC(1, UL) << CHECKNODE_BITS) - 1)) -static inline void set_vm_checknode(vm_flags_t vm_flags, unsigned long flags) +static inline void set_vm_checknode(vm_flags_t *vm_flags, unsigned long flags) { if (is_set_cdmmask()) - vm_flags |= VM_CHECKNODE | ((((flags >> MAP_HUGE_SHIFT) & + *vm_flags |= VM_CHECKNODE | ((((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK) << CHECKNODE_BITS) & CHECKNODE_MASK); } #else #define CHECKNODE_BITS (0) -static inline void set_vm_checknode(vm_flags_t vm_flags, unsigned long flags) {} +static inline void set_vm_checknode(vm_flags_t *vm_flags, unsigned long flags) +{} #endif
/* diff --git a/mm/mmap.c b/mm/mmap.c index 4cc9ee8a0287..04e34c022775 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1565,7 +1565,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, * hugetlbfs file mmap will use it to check node */ if (flags & MAP_CHECKNODE) - set_vm_checknode(vm_flags, flags); + set_vm_checknode(&vm_flags, flags);
addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); if (!IS_ERR_VALUE(addr) &&
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.5-rc5 commit 07a4ddec3ce9b0a533b5f90f582f1057390d5e63 category: bugfix bugzilla: 46933 CVE: NA
-------------------
Currently, gratuitous ARP/ND packets are sent every `miimon' milliseconds. This commit allows a user to specify a custom delay through a new option, `peer_notif_delay'.
Like for `updelay' and `downdelay', this delay should be a multiple of `miimon' to avoid managing an additional work queue. The configuration logic is copied from `updelay' and `downdelay'. However, the default value cannot be set using a module parameter: Netlink or sysfs should be used to configure this feature.
When setting `miimon' to 100 and `peer_notif_delay' to 500, we can observe the 500 ms delay is respected:
20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28 20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28 20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28 20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
In bond_mii_monitor(), I have tried to keep the lock logic readable. The change is due to the fact we cannot rely on a notification to lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is only triggered once every N times, while we need to decrement the counter each time.
iproute2 also needs to be updated to be able to specify this new attribute through `ip link'.
Signed-off-by: Vincent Bernat vincent@bernat.ch Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/bonding/bond_main.c | 31 ++++++++----- drivers/net/bonding/bond_netlink.c | 14 ++++++ drivers/net/bonding/bond_options.c | 71 +++++++++++++++++++----------- drivers/net/bonding/bond_procfs.c | 2 + drivers/net/bonding/bond_sysfs.c | 13 ++++++ include/net/bond_options.h | 1 + include/net/bonding.h | 1 + include/uapi/linux/if_link.h | 1 + tools/include/uapi/linux/if_link.h | 1 + 9 files changed, 98 insertions(+), 37 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index c21c4291921f..b7db57e6c96a 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -786,6 +786,8 @@ static bool bond_should_notify_peers(struct bonding *bond) slave ? slave->dev->name : "NULL");
if (!slave || !bond->send_peer_notif || + bond->send_peer_notif % + max(1, bond->params.peer_notif_delay) != 0 || !netif_carrier_ok(bond->dev) || test_bit(__LINK_STATE_LINKWATCH_PENDING, &slave->dev->state)) return false; @@ -878,15 +880,18 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
if (netif_running(bond->dev)) { bond->send_peer_notif = - bond->params.num_peer_notif; + bond->params.num_peer_notif * + max(1, bond->params.peer_notif_delay); should_notify_peers = bond_should_notify_peers(bond); }
call_netdevice_notifiers(NETDEV_BONDING_FAILOVER, bond->dev); - if (should_notify_peers) + if (should_notify_peers) { + bond->send_peer_notif--; call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, bond->dev); + } } }
@@ -2315,6 +2320,7 @@ static void bond_mii_monitor(struct work_struct *work) struct bonding *bond = container_of(work, struct bonding, mii_work.work); bool should_notify_peers = false; + bool commit; unsigned long delay; struct slave *slave; struct list_head *iter; @@ -2325,12 +2331,19 @@ static void bond_mii_monitor(struct work_struct *work) goto re_arm;
rcu_read_lock(); - should_notify_peers = bond_should_notify_peers(bond); - - if (bond_miimon_inspect(bond)) { + commit = !!bond_miimon_inspect(bond); + if (bond->send_peer_notif) { + rcu_read_unlock(); + if (rtnl_trylock()) { + bond->send_peer_notif--; + rtnl_unlock(); + } + } else { rcu_read_unlock(); + }
+ if (commit) { /* Race avoidance with bond_close cancel of workqueue */ if (!rtnl_trylock()) { delay = 1; @@ -2344,8 +2357,7 @@ static void bond_mii_monitor(struct work_struct *work) bond_miimon_commit(bond);
rtnl_unlock(); /* might sleep, hold no other locks */ - } else - rcu_read_unlock(); + }
re_arm: if (bond->params.miimon) @@ -3118,10 +3130,6 @@ static int bond_master_netdev_event(unsigned long event, case NETDEV_REGISTER: bond_create_proc_entry(event_bond); break; - case NETDEV_NOTIFY_PEERS: - if (event_bond->send_peer_notif) - event_bond->send_peer_notif--; - break; default: break; } @@ -4768,6 +4776,7 @@ static int bond_check_params(struct bond_params *params) params->arp_all_targets = arp_all_targets_value; params->updelay = updelay; params->downdelay = downdelay; + params->peer_notif_delay = 0; params->use_carrier = use_carrier; params->lacp_fast = lacp_fast; params->primary[0] = 0; diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c index fbcd8a752ee7..d446b9423bae 100644 --- a/drivers/net/bonding/bond_netlink.c +++ b/drivers/net/bonding/bond_netlink.c @@ -112,6 +112,7 @@ static const struct nla_policy bond_policy[IFLA_BOND_MAX + 1] = { [IFLA_BOND_AD_ACTOR_SYSTEM] = { .type = NLA_BINARY, .len = ETH_ALEN }, [IFLA_BOND_TLB_DYNAMIC_LB] = { .type = NLA_U8 }, + [IFLA_BOND_PEER_NOTIF_DELAY] = { .type = NLA_U32 }, };
static const struct nla_policy bond_slave_policy[IFLA_BOND_SLAVE_MAX + 1] = { @@ -219,6 +220,14 @@ static int bond_changelink(struct net_device *bond_dev, struct nlattr *tb[], if (err) return err; } + if (data[IFLA_BOND_PEER_NOTIF_DELAY]) { + int delay = nla_get_u32(data[IFLA_BOND_PEER_NOTIF_DELAY]); + + bond_opt_initval(&newval, delay); + err = __bond_opt_set(bond, BOND_OPT_PEER_NOTIF_DELAY, &newval); + if (err) + return err; + } if (data[IFLA_BOND_USE_CARRIER]) { int use_carrier = nla_get_u8(data[IFLA_BOND_USE_CARRIER]);
@@ -497,6 +506,7 @@ static size_t bond_get_size(const struct net_device *bond_dev) nla_total_size(sizeof(u16)) + /* IFLA_BOND_AD_USER_PORT_KEY */ nla_total_size(ETH_ALEN) + /* IFLA_BOND_AD_ACTOR_SYSTEM */ nla_total_size(sizeof(u8)) + /* IFLA_BOND_TLB_DYNAMIC_LB */ + nla_total_size(sizeof(u32)) + /* IFLA_BOND_PEER_NOTIF_DELAY */ 0; }
@@ -539,6 +549,10 @@ static int bond_fill_info(struct sk_buff *skb, bond->params.downdelay * bond->params.miimon)) goto nla_put_failure;
+ if (nla_put_u32(skb, IFLA_BOND_PEER_NOTIF_DELAY, + bond->params.downdelay * bond->params.miimon)) + goto nla_put_failure; + if (nla_put_u8(skb, IFLA_BOND_USE_CARRIER, bond->params.use_carrier)) goto nla_put_failure;
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c index 80867bd8f44c..7e07e213f4c2 100644 --- a/drivers/net/bonding/bond_options.c +++ b/drivers/net/bonding/bond_options.c @@ -28,6 +28,8 @@ static int bond_option_updelay_set(struct bonding *bond, const struct bond_opt_value *newval); static int bond_option_downdelay_set(struct bonding *bond, const struct bond_opt_value *newval); +static int bond_option_peer_notif_delay_set(struct bonding *bond, + const struct bond_opt_value *newval); static int bond_option_use_carrier_set(struct bonding *bond, const struct bond_opt_value *newval); static int bond_option_arp_interval_set(struct bonding *bond, @@ -428,6 +430,13 @@ static const struct bond_option bond_opts[BOND_OPT_LAST] = { .desc = "Number of peer notifications to send on failover event", .values = bond_num_peer_notif_tbl, .set = bond_option_num_peer_notif_set + }, + [BOND_OPT_PEER_NOTIF_DELAY] = { + .id = BOND_OPT_PEER_NOTIF_DELAY, + .name = "peer_notif_delay", + .desc = "Delay between each peer notification on failover event, in milliseconds", + .values = bond_intmax_tbl, + .set = bond_option_peer_notif_delay_set } };
@@ -850,6 +859,9 @@ static int bond_option_miimon_set(struct bonding *bond, if (bond->params.downdelay) netdev_dbg(bond->dev, "Note: Updating downdelay (to %d) since it is a multiple of the miimon value\n", bond->params.downdelay * bond->params.miimon); + if (bond->params.peer_notif_delay) + netdev_dbg(bond->dev, "Note: Updating peer_notif_delay (to %d) since it is a multiple of the miimon value\n", + bond->params.peer_notif_delay * bond->params.miimon); if (newval->value && bond->params.arp_interval) { netdev_dbg(bond->dev, "MII monitoring cannot be used with ARP monitoring - disabling ARP monitoring...\n"); bond->params.arp_interval = 0; @@ -873,52 +885,59 @@ static int bond_option_miimon_set(struct bonding *bond, return 0; }
-/* Set up and down delays. These must be multiples of the - * MII monitoring value, and are stored internally as the multiplier. - * Thus, we must translate to MS for the real world. +/* Set up, down and peer notification delays. These must be multiples + * of the MII monitoring value, and are stored internally as the + * multiplier. Thus, we must translate to MS for the real world. */ -static int bond_option_updelay_set(struct bonding *bond, - const struct bond_opt_value *newval) +static int _bond_option_delay_set(struct bonding *bond, + const struct bond_opt_value *newval, + const char *name, + int *target) { int value = newval->value;
if (!bond->params.miimon) { - netdev_err(bond->dev, "Unable to set up delay as MII monitoring is disabled\n"); + netdev_err(bond->dev, "Unable to set %s as MII monitoring is disabled\n", + name); return -EPERM; } if ((value % bond->params.miimon) != 0) { - netdev_warn(bond->dev, "up delay (%d) is not a multiple of miimon (%d), updelay rounded to %d ms\n", + netdev_warn(bond->dev, + "%s (%d) is not a multiple of miimon (%d), value rounded to %d ms\n", + name, value, bond->params.miimon, (value / bond->params.miimon) * bond->params.miimon); } - bond->params.updelay = value / bond->params.miimon; - netdev_dbg(bond->dev, "Setting up delay to %d\n", - bond->params.updelay * bond->params.miimon); + *target = value / bond->params.miimon; + netdev_dbg(bond->dev, "Setting %s to %d\n", + name, + *target * bond->params.miimon);
return 0; }
+static int bond_option_updelay_set(struct bonding *bond, + const struct bond_opt_value *newval) +{ + return _bond_option_delay_set(bond, newval, "up delay", + &bond->params.updelay); +} + static int bond_option_downdelay_set(struct bonding *bond, const struct bond_opt_value *newval) { - int value = newval->value; - - if (!bond->params.miimon) { - netdev_err(bond->dev, "Unable to set down delay as MII monitoring is disabled\n"); - return -EPERM; - } - if ((value % bond->params.miimon) != 0) { - netdev_warn(bond->dev, "down delay (%d) is not a multiple of miimon (%d), delay rounded to %d ms\n", - value, bond->params.miimon, - (value / bond->params.miimon) * - bond->params.miimon); - } - bond->params.downdelay = value / bond->params.miimon; - netdev_dbg(bond->dev, "Setting down delay to %d\n", - bond->params.downdelay * bond->params.miimon); + return _bond_option_delay_set(bond, newval, "down delay", + &bond->params.downdelay); +}
- return 0; +static int bond_option_peer_notif_delay_set(struct bonding *bond, + const struct bond_opt_value *newval) +{ + int ret = _bond_option_delay_set(bond, newval, + "peer notification delay", + &bond->params.peer_notif_delay); + return ret; }
static int bond_option_use_carrier_set(struct bonding *bond, diff --git a/drivers/net/bonding/bond_procfs.c b/drivers/net/bonding/bond_procfs.c index 9f7d83e827c3..fd5c9cbe45b1 100644 --- a/drivers/net/bonding/bond_procfs.c +++ b/drivers/net/bonding/bond_procfs.c @@ -104,6 +104,8 @@ static void bond_info_show_master(struct seq_file *seq) bond->params.updelay * bond->params.miimon); seq_printf(seq, "Down Delay (ms): %d\n", bond->params.downdelay * bond->params.miimon); + seq_printf(seq, "Peer Notification Delay (ms): %d\n", + bond->params.peer_notif_delay * bond->params.miimon);
/* ARP information */ diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 35847250da5a..0ecafe115f32 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -343,6 +343,18 @@ static ssize_t bonding_show_updelay(struct device *d, static DEVICE_ATTR(updelay, 0644, bonding_show_updelay, bonding_sysfs_store_option);
+static ssize_t bonding_show_peer_notif_delay(struct device *d, + struct device_attribute *attr, + char *buf) +{ + struct bonding *bond = to_bond(d); + + return sprintf(buf, "%d\n", + bond->params.peer_notif_delay * bond->params.miimon); +} +static DEVICE_ATTR(peer_notif_delay, 0644, + bonding_show_peer_notif_delay, bonding_sysfs_store_option); + /* Show the LACP interval. */ static ssize_t bonding_show_lacp(struct device *d, struct device_attribute *attr, @@ -734,6 +746,7 @@ static struct attribute *per_bond_attrs[] = { &dev_attr_arp_ip_target.attr, &dev_attr_downdelay.attr, &dev_attr_updelay.attr, + &dev_attr_peer_notif_delay.attr, &dev_attr_lacp_rate.attr, &dev_attr_ad_select.attr, &dev_attr_xmit_hash_policy.attr, diff --git a/include/net/bond_options.h b/include/net/bond_options.h index d79d28f5318c..6889a04d9520 100644 --- a/include/net/bond_options.h +++ b/include/net/bond_options.h @@ -67,6 +67,7 @@ enum { BOND_OPT_AD_ACTOR_SYSTEM, BOND_OPT_AD_USER_PORT_KEY, BOND_OPT_NUM_PEER_NOTIF_ALIAS, + BOND_OPT_PEER_NOTIF_DELAY, BOND_OPT_LAST };
diff --git a/include/net/bonding.h b/include/net/bonding.h index 58a8c2b66aa5..a548bada8e8e 100644 --- a/include/net/bonding.h +++ b/include/net/bonding.h @@ -114,6 +114,7 @@ struct bond_params { int fail_over_mac; int updelay; int downdelay; + int peer_notif_delay; int lacp_fast; unsigned int min_links; int ad_select; diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 9d8eac55da5a..f9a0359ceb6a 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -615,6 +615,7 @@ enum { IFLA_BOND_AD_USER_PORT_KEY, IFLA_BOND_AD_ACTOR_SYSTEM, IFLA_BOND_TLB_DYNAMIC_LB, + IFLA_BOND_PEER_NOTIF_DELAY, __IFLA_BOND_MAX, };
diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h index 43391e2d1153..f57af58f1919 100644 --- a/tools/include/uapi/linux/if_link.h +++ b/tools/include/uapi/linux/if_link.h @@ -614,6 +614,7 @@ enum { IFLA_BOND_AD_USER_PORT_KEY, IFLA_BOND_AD_ACTOR_SYSTEM, IFLA_BOND_TLB_DYNAMIC_LB, + IFLA_BOND_PEER_NOTIF_DELAY, __IFLA_BOND_MAX, };
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.5-rc5 commit ee4f56f46ab7a795dc8f7ecdfa11b616a5754b50 category: bugfix bugzilla: 46933 CVE: NA
-------------------
IFLA_BOND_PEER_NOTIF_DELAY was set to the value of downdelay instead of peer_notif_delay. After this change, the correct value is exported.
Fixes: 07a4ddec3ce9 ("bonding: add an option to specify a delay between peer notifications") Signed-off-by: Vincent Bernat vincent@bernat.ch Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/bonding/bond_netlink.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c index d446b9423bae..d89aba932b8e 100644 --- a/drivers/net/bonding/bond_netlink.c +++ b/drivers/net/bonding/bond_netlink.c @@ -550,7 +550,7 @@ static int bond_fill_info(struct sk_buff *skb, goto nla_put_failure;
if (nla_put_u32(skb, IFLA_BOND_PEER_NOTIF_DELAY, - bond->params.downdelay * bond->params.miimon)) + bond->params.peer_notif_delay * bond->params.miimon)) goto nla_put_failure;
if (nla_put_u8(skb, IFLA_BOND_USE_CARRIER, bond->params.use_carrier))
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.5-rc5 commit 0307d589c4d6f84349afb7a53717baa2392f6a38 category: bugfix bugzilla: 46933 CVE: NA
-------------------
Ability to tweak the interval between peer notifications has been added in 07a4ddec3ce9 ("bonding: add an option to specify a delay between peer notifications") but the documentation was not updated.
Signed-off-by: Vincent Bernat vincent@bernat.ch Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- Documentation/networking/bonding.txt | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index d3e5dd26db12..e3abfbd32f71 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -706,9 +706,9 @@ num_unsol_na unsolicited IPv6 Neighbor Advertisements) to be issued after a failover event. As soon as the link is up on the new slave (possibly immediately) a peer notification is sent on the - bonding device and each VLAN sub-device. This is repeated at - each link monitor interval (arp_interval or miimon, whichever - is active) if the number is greater than 1. + bonding device and each VLAN sub-device. This is repeated at + the rate specified by peer_notif_delay if the number is + greater than 1.
The valid range is 0 - 255; the default value is 1. These options affect only the active-backup mode. These options were added for @@ -727,6 +727,16 @@ packets_per_slave The valid range is 0 - 65535; the default value is 1. This option has effect only in balance-rr mode.
+peer_notif_delay + + Specify the delay, in milliseconds, between each peer + notification (gratuitous ARP and unsolicited IPv6 Neighbor + Advertisement) when they are issued after a failover event. + This delay should be a multiple of the link monitor interval + (arp_interval or miimon, whichever is active). The default + value is 0 which means to match the value of the link monitor + interval. + primary
A string (eth0, eth2, etc) specifying which slave is the
From: Yang Shi yang.shi@linux.alibaba.com
mainline inclusion from mainline-v5.4-rc1 commit 364c1eebe453f06f0c1e837eb155a5725c9cd272 category: bugfix bugzilla: 47240 CVE: NA
-------------------------------------------------
Patch series "Make deferred split shrinker memcg aware", v6.
Currently THP deferred split shrinker is not memcg aware, this may cause premature OOM with some configuration. For example the below test would run into premature OOM easily:
$ cgcreate -g memory:thp $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes $ cgexec -g memory:thp transhuge-stress 4000
transhuge-stress comes from kernel selftest.
It is easy to hit OOM, but there are still a lot THP on the deferred split queue, memcg direct reclaim can't touch them since the deferred split shrinker is not memcg aware.
Convert deferred split shrinker memcg aware by introducing per memcg deferred split queue. The THP should be on either per node or per memcg deferred split queue if it belongs to a memcg. When the page is immigrated to the other memcg, it will be immigrated to the target memcg's deferred split queue too.
Reuse the second tail page's deferred_list for per memcg list since the same THP can't be on multiple deferred split queues.
Make deferred split shrinker not depend on memcg kmem since it is not slab. It doesn't make sense to not shrink THP even though memcg kmem is disabled.
With the above change the test demonstrated above doesn't trigger OOM even though with cgroup.memory=nokmem.
This patch (of 4):
Put split_queue, split_queue_lock and split_queue_len into a struct in order to reduce code duplication when we convert deferred_split to memcg aware in the later patches.
Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.al... Signed-off-by: Yang Shi yang.shi@linux.alibaba.com Suggested-by: "Kirill A . Shutemov" kirill.shutemov@linux.intel.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Reviewed-by: Kirill Tkhai ktkhai@virtuozzo.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: Hugh Dickins hughd@google.com Cc: Shakeel Butt shakeelb@google.com Cc: David Rientjes rientjes@google.com Cc: Qian Cai cai@lca.pw Cc: Vladimir Davydov vdavydov.dev@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/mmzone.h | 12 ++++++++--- mm/huge_memory.c | 45 +++++++++++++++++++++++------------------- mm/page_alloc.c | 8 +++++--- 3 files changed, 39 insertions(+), 26 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 01349f51fe88..4fa5a20f3da9 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -616,6 +616,14 @@ struct zonelist { extern struct page *mem_map; #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE +struct deferred_split { + spinlock_t split_queue_lock; + struct list_head split_queue; + unsigned long split_queue_len; +}; +#endif + /* * On NUMA machines, each NUMA node would have a pg_data_t to describe * it's memory layout. On UMA machines there is a single pglist_data which @@ -702,9 +710,7 @@ typedef struct pglist_data { #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE - spinlock_t split_queue_lock; - struct list_head split_queue; - unsigned long split_queue_len; + struct deferred_split deferred_split_queue; #endif
/* Fields commonly accessed by the page reclaim scanner */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 816a7fd3c6ff..be44f09ae93f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2690,6 +2690,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) { struct page *head = compound_head(page); struct pglist_data *pgdata = NODE_DATA(page_to_nid(head)); + struct deferred_split *ds_queue = &pgdata->deferred_split_queue; struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; int count, mapcount, extra_pins, ret; @@ -2779,17 +2780,17 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) }
/* Prevent deferred_split_scan() touching ->_refcount */ - spin_lock(&pgdata->split_queue_lock); + spin_lock(&ds_queue->split_queue_lock); count = page_count(head); mapcount = total_mapcount(head); if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) { if (!list_empty(page_deferred_list(head))) { - pgdata->split_queue_len--; + ds_queue->split_queue_len--; list_del(page_deferred_list(head)); } if (mapping) __dec_node_page_state(page, NR_SHMEM_THPS); - spin_unlock(&pgdata->split_queue_lock); + spin_unlock(&ds_queue->split_queue_lock); __split_huge_page(page, list, end, flags); ret = 0; } else { @@ -2801,7 +2802,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) dump_page(page, "total_mapcount(head) > 0"); BUG(); } - spin_unlock(&pgdata->split_queue_lock); + spin_unlock(&ds_queue->split_queue_lock); fail: if (mapping) xa_unlock(&mapping->i_pages); spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags); @@ -2824,52 +2825,56 @@ fail: if (mapping) void free_transhuge_page(struct page *page) { struct pglist_data *pgdata = NODE_DATA(page_to_nid(page)); + struct deferred_split *ds_queue = &pgdata->deferred_split_queue; unsigned long flags;
- spin_lock_irqsave(&pgdata->split_queue_lock, flags); + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); if (!list_empty(page_deferred_list(page))) { - pgdata->split_queue_len--; + ds_queue->split_queue_len--; list_del(page_deferred_list(page)); } - spin_unlock_irqrestore(&pgdata->split_queue_lock, flags); + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); free_compound_page(page); }
void deferred_split_huge_page(struct page *page) { struct pglist_data *pgdata = NODE_DATA(page_to_nid(page)); + struct deferred_split *ds_queue = &pgdata->deferred_split_queue; unsigned long flags;
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
- spin_lock_irqsave(&pgdata->split_queue_lock, flags); + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); if (list_empty(page_deferred_list(page))) { count_vm_event(THP_DEFERRED_SPLIT_PAGE); - list_add_tail(page_deferred_list(page), &pgdata->split_queue); - pgdata->split_queue_len++; + list_add_tail(page_deferred_list(page), &ds_queue->split_queue); + ds_queue->split_queue_len++; } - spin_unlock_irqrestore(&pgdata->split_queue_lock, flags); + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); }
static unsigned long deferred_split_count(struct shrinker *shrink, struct shrink_control *sc) { struct pglist_data *pgdata = NODE_DATA(sc->nid); - return READ_ONCE(pgdata->split_queue_len); + struct deferred_split *ds_queue = &pgdata->deferred_split_queue; + return READ_ONCE(ds_queue->split_queue_len); }
static unsigned long deferred_split_scan(struct shrinker *shrink, struct shrink_control *sc) { struct pglist_data *pgdata = NODE_DATA(sc->nid); + struct deferred_split *ds_queue = &pgdata->deferred_split_queue; unsigned long flags; LIST_HEAD(list), *pos, *next; struct page *page; int split = 0;
- spin_lock_irqsave(&pgdata->split_queue_lock, flags); + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); /* Take pin on all head pages to avoid freeing them under us */ - list_for_each_safe(pos, next, &pgdata->split_queue) { + list_for_each_safe(pos, next, &ds_queue->split_queue) { page = list_entry((void *)pos, struct page, mapping); page = compound_head(page); if (get_page_unless_zero(page)) { @@ -2877,12 +2882,12 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, } else { /* We lost race with put_compound_page() */ list_del_init(page_deferred_list(page)); - pgdata->split_queue_len--; + ds_queue->split_queue_len--; } if (!--sc->nr_to_scan) break; } - spin_unlock_irqrestore(&pgdata->split_queue_lock, flags); + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
list_for_each_safe(pos, next, &list) { page = list_entry((void *)pos, struct page, mapping); @@ -2896,15 +2901,15 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, put_page(page); }
- spin_lock_irqsave(&pgdata->split_queue_lock, flags); - list_splice_tail(&list, &pgdata->split_queue); - spin_unlock_irqrestore(&pgdata->split_queue_lock, flags); + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); + list_splice_tail(&list, &ds_queue->split_queue); + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
/* * Stop shrinker if we didn't split any page, but the queue is empty. * This can happen if pages were freed under us. */ - if (!split && list_empty(&pgdata->split_queue)) + if (!split && list_empty(&ds_queue->split_queue)) return SHRINK_STOP; return split; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fd77bf4979a9..f5746b4c5a3f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6352,9 +6352,11 @@ static unsigned long __init calc_memmap_size(unsigned long spanned_pages, #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void pgdat_init_split_queue(struct pglist_data *pgdat) { - spin_lock_init(&pgdat->split_queue_lock); - INIT_LIST_HEAD(&pgdat->split_queue); - pgdat->split_queue_len = 0; + struct deferred_split *ds_queue = &pgdat->deferred_split_queue; + + spin_lock_init(&ds_queue->split_queue_lock); + INIT_LIST_HEAD(&ds_queue->split_queue); + ds_queue->split_queue_len = 0; } #else static void pgdat_init_split_queue(struct pglist_data *pgdat) {}
From: Yang Shi yang.shi@linux.alibaba.com
mainline inclusion from mainline-v5.4-rc1 commit 7ae88534cdd96235cd775c03b32a75009355740b category: bugfix bugzilla: 47240 CVE: NA
-------------------------------------------------
A later patch makes THP deferred split shrinker memcg aware, but it needs page->mem_cgroup information in THP destructor, which is called after mem_cgroup_uncharge() now.
So move mem_cgroup_uncharge() from __page_cache_release() to compound page destructor, which is called by both THP and other compound pages except HugeTLB. And call it in __put_single_page() for single order page.
Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.al... Signed-off-by: Yang Shi yang.shi@linux.alibaba.com Suggested-by: "Kirill A . Shutemov" kirill.shutemov@linux.intel.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Reviewed-by: Kirill Tkhai ktkhai@virtuozzo.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: Hugh Dickins hughd@google.com Cc: Shakeel Butt shakeelb@google.com Cc: David Rientjes rientjes@google.com Cc: Qian Cai cai@lca.pw Cc: Vladimir Davydov vdavydov.dev@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/page_alloc.c | 1 + mm/swap.c | 2 +- mm/vmscan.c | 6 ++---- 3 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f5746b4c5a3f..d85b5a0bfda6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -593,6 +593,7 @@ static void bad_page(struct page *page, const char *reason,
void free_compound_page(struct page *page) { + mem_cgroup_uncharge(page); __free_pages_ok(page, compound_order(page)); }
diff --git a/mm/swap.c b/mm/swap.c index bdb9b294afbf..002c98a81555 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -71,12 +71,12 @@ static void __page_cache_release(struct page *page) spin_unlock_irqrestore(zone_lru_lock(zone), flags); } __ClearPageWaiters(page); - mem_cgroup_uncharge(page); }
static void __put_single_page(struct page *page) { __page_cache_release(page); + mem_cgroup_uncharge(page); free_unref_page(page); }
diff --git a/mm/vmscan.c b/mm/vmscan.c index f49c11a6cff3..497df95dd7b8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1475,10 +1475,9 @@ static unsigned long shrink_page_list(struct list_head *page_list, * Is there need to periodically free_page_list? It would * appear not as the counts should be low */ - if (unlikely(PageTransHuge(page))) { - mem_cgroup_uncharge(page); + if (unlikely(PageTransHuge(page))) (*get_compound_page_dtor(page))(page); - } else + else list_add(&page->lru, &free_pages); continue;
@@ -1876,7 +1875,6 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
if (unlikely(PageCompound(page))) { spin_unlock_irq(&pgdat->lru_lock); - mem_cgroup_uncharge(page); (*get_compound_page_dtor(page))(page); spin_lock_irq(&pgdat->lru_lock); } else
From: Yang Shi yang.shi@linux.alibaba.com
mainline inclusion from mainline-v5.4-rc1 commit 0a432dcbeb32edcd211a5d8f7847d0da7642a8b4 category: bugfix bugzilla: 47240 CVE: NA
-------------------------------------------------
Currently shrinker is just allocated and can work when memcg kmem is enabled. But, THP deferred split shrinker is not slab shrinker, it doesn't make too much sense to have such shrinker depend on memcg kmem. It should be able to reclaim THP even though memcg kmem is disabled.
Introduce a new shrinker flag, SHRINKER_NONSLAB, for non-slab shrinker. When memcg kmem is disabled, just such shrinkers can be called in shrinking memcg slab.
[yang.shi@linux.alibaba.com: add comment] Link: http://lkml.kernel.org/r/1566496227-84952-4-git-send-email-yang.shi@linux.al... Link: http://lkml.kernel.org/r/1565144277-36240-4-git-send-email-yang.shi@linux.al... Signed-off-by: Yang Shi yang.shi@linux.alibaba.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Reviewed-by: Kirill Tkhai ktkhai@virtuozzo.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: "Kirill A . Shutemov" kirill.shutemov@linux.intel.com Cc: Hugh Dickins hughd@google.com Cc: Shakeel Butt shakeelb@google.com Cc: David Rientjes rientjes@google.com Cc: Qian Cai cai@lca.pw Cc: Vladimir Davydov vdavydov.dev@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/memcontrol.h | 19 +++++++++++-------- include/linux/shrinker.h | 7 ++++++- mm/memcontrol.c | 9 +-------- mm/vmscan.c | 36 +++++++++++++++++++----------------- 4 files changed, 37 insertions(+), 34 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0069cd618b1e..e454c5d45401 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -133,9 +133,8 @@ struct mem_cgroup_per_node {
struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1];
-#ifdef CONFIG_MEMCG_KMEM struct memcg_shrinker_map __rcu *shrinker_map; -#endif + struct rb_node tree_node; /* RB tree node */ unsigned long usage_in_excess;/* Set to the value by which */ /* the soft limit is exceeded*/ @@ -1263,6 +1262,11 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg) } while ((memcg = parent_mem_cgroup(memcg))); return false; } + +extern int memcg_expand_shrinker_maps(int new_id); + +extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg, + int nid, int shrinker_id); #else #define mem_cgroup_sockets_enabled 0 static inline void mem_cgroup_sk_alloc(struct sock *sk) { }; @@ -1271,6 +1275,11 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg) { return false; } + +static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg, + int nid, int shrinker_id) +{ +} #endif
struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep); @@ -1332,10 +1341,6 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg) return memcg ? memcg->kmemcg_id : -1; }
-extern int memcg_expand_shrinker_maps(int new_id); - -extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg, - int nid, int shrinker_id); #else
static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order) @@ -1377,8 +1382,6 @@ static inline void memcg_put_cache_ids(void) { }
-static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg, - int nid, int shrinker_id) { } #endif /* CONFIG_MEMCG_KMEM */
#endif /* _LINUX_MEMCONTROL_H */ diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 9443cafd1969..0f80123650e2 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -69,7 +69,7 @@ struct shrinker {
/* These are for internal use */ struct list_head list; -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG /* ID in shrinker_idr */ int id; #endif @@ -81,6 +81,11 @@ struct shrinker { /* Flags */ #define SHRINKER_NUMA_AWARE (1 << 0) #define SHRINKER_MEMCG_AWARE (1 << 1) +/* + * It just makes sense when the shrinker is also MEMCG_AWARE for now, + * non-MEMCG_AWARE shrinker should not have this flag set. + */ +#define SHRINKER_NONSLAB (1 << 2)
extern int prealloc_shrinker(struct shrinker *shrinker); extern void register_shrinker_prepared(struct shrinker *shrinker); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2553017683e9..97cf7bd3f782 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -320,6 +320,7 @@ DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key); EXPORT_SYMBOL(memcg_kmem_enabled_key);
struct workqueue_struct *memcg_kmem_cache_wq; +#endif
static int memcg_shrinker_map_size; static DEFINE_MUTEX(memcg_shrinker_map_mutex); @@ -445,14 +446,6 @@ void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id) } }
-#else /* CONFIG_MEMCG_KMEM */ -static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg) -{ - return 0; -} -static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) { } -#endif /* CONFIG_MEMCG_KMEM */ - /** * mem_cgroup_css_from_page - css of the memcg associated with a page * @page: page of interest diff --git a/mm/vmscan.c b/mm/vmscan.c index 497df95dd7b8..099bbd411121 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -186,8 +186,7 @@ static DEFINE_PER_CPU(struct delayed_work, vmscan_work); static LIST_HEAD(shrinker_list); static DECLARE_RWSEM(shrinker_rwsem);
-#ifdef CONFIG_MEMCG_KMEM - +#ifdef CONFIG_MEMCG /* * We allow subsystems to populate their shrinker-related * LRU lists before register_shrinker_prepared() is called @@ -239,18 +238,7 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker) idr_remove(&shrinker_idr, id); up_write(&shrinker_rwsem); } -#else /* CONFIG_MEMCG_KMEM */ -static int prealloc_memcg_shrinker(struct shrinker *shrinker) -{ - return 0; -}
-static void unregister_memcg_shrinker(struct shrinker *shrinker) -{ -} -#endif /* CONFIG_MEMCG_KMEM */ - -#ifdef CONFIG_MEMCG static bool global_reclaim(struct scan_control *sc) { return !sc->target_mem_cgroup; @@ -305,6 +293,15 @@ static bool memcg_congested(pg_data_t *pgdat,
} #else +static int prealloc_memcg_shrinker(struct shrinker *shrinker) +{ + return 0; +} + +static void unregister_memcg_shrinker(struct shrinker *shrinker) +{ +} + static bool global_reclaim(struct scan_control *sc) { return true; @@ -582,7 +579,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, return freed; }
-#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, int priority) { @@ -590,7 +587,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, unsigned long ret, freed = 0; int i;
- if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg)) + if (!mem_cgroup_online(memcg)) return 0;
if (!down_read_trylock(&shrinker_rwsem)) @@ -616,6 +613,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, continue; }
+ /* Call non-slab shrinkers even though kmem is disabled */ + if (!memcg_kmem_enabled() && + !(shrinker->flags & SHRINKER_NONSLAB)) + continue; + ret = do_shrink_slab(&sc, shrinker, priority); if (ret == SHRINK_EMPTY) { clear_bit(i, map->map); @@ -652,13 +654,13 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, up_read(&shrinker_rwsem); return freed; } -#else /* CONFIG_MEMCG_KMEM */ +#else /* CONFIG_MEMCG */ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, int priority) { return 0; } -#endif /* CONFIG_MEMCG_KMEM */ +#endif /* CONFIG_MEMCG */
/** * shrink_slab - shrink slab caches
From: Yang Shi yang.shi@linux.alibaba.com
mainline inclusion from mainline-v5.4-rc1 commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe category: bugfix bugzilla: 47240 CVE: NA
-------------------------------------------------
Currently THP deferred split shrinker is not memcg aware, this may cause premature OOM with some configuration. For example the below test would run into premature OOM easily:
$ cgcreate -g memory:thp $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes $ cgexec -g memory:thp transhuge-stress 4000
transhuge-stress comes from kernel selftest.
It is easy to hit OOM, but there are still a lot THP on the deferred split queue, memcg direct reclaim can't touch them since the deferred split shrinker is not memcg aware.
Convert deferred split shrinker memcg aware by introducing per memcg deferred split queue. The THP should be on either per node or per memcg deferred split queue if it belongs to a memcg. When the page is immigrated to the other memcg, it will be immigrated to the target memcg's deferred split queue too.
Reuse the second tail page's deferred_list for per memcg list since the same THP can't be on multiple deferred split queues.
[yang.shi@linux.alibaba.com: simplify deferred split queue dereference per Kirill Tkhai] Link: http://lkml.kernel.org/r/1566496227-84952-5-git-send-email-yang.shi@linux.al... Link: http://lkml.kernel.org/r/1565144277-36240-5-git-send-email-yang.shi@linux.al... Signed-off-by: Yang Shi yang.shi@linux.alibaba.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Reviewed-by: Kirill Tkhai ktkhai@virtuozzo.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: "Kirill A . Shutemov" kirill.shutemov@linux.intel.com Cc: Hugh Dickins hughd@google.com Cc: Shakeel Butt shakeelb@google.com Cc: David Rientjes rientjes@google.com Cc: Qian Cai cai@lca.pw Cc: Vladimir Davydov vdavydov.dev@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/huge_mm.h | 9 ++++++ include/linux/memcontrol.h | 4 +++ include/linux/mm_types.h | 1 + mm/huge_memory.c | 63 ++++++++++++++++++++++++++++++++------ mm/memcontrol.c | 24 +++++++++++++++ 5 files changed, 91 insertions(+), 10 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index e375f2249f52..7a447e8e8481 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -250,6 +250,15 @@ static inline bool thp_migration_supported(void) return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION); }
+static inline struct list_head *page_deferred_list(struct page *page) +{ + /* + * Global or memcg deferred list in the second tail pages is + * occupied by compound_head. + */ + return &page[2].deferred_list; +} + #else /* CONFIG_TRANSPARENT_HUGEPAGE */ #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index e454c5d45401..9afd655ff646 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -310,6 +310,10 @@ struct mem_cgroup { struct list_head event_list; spinlock_t event_list_lock;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE + struct deferred_split deferred_split_queue; +#endif + struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ }; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 178e9dee217a..9903776a967e 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -134,6 +134,7 @@ struct page { struct { /* Second tail page of compound page */ unsigned long _compound_pad_1; /* compound_head */ unsigned long _compound_pad_2; + /* For both global and memcg */ struct list_head deferred_list; }; struct { /* Page table pages */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index be44f09ae93f..0b6874cf3a3d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -492,11 +492,25 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) return pmd; }
-static inline struct list_head *page_deferred_list(struct page *page) +#ifdef CONFIG_MEMCG +static inline struct deferred_split *get_deferred_split_queue(struct page *page) { - /* ->lru in the tail pages is occupied by compound_head. */ - return &page[2].deferred_list; + struct mem_cgroup *memcg = compound_head(page)->mem_cgroup; + struct pglist_data *pgdat = NODE_DATA(page_to_nid(page)); + + if (memcg) + return &memcg->deferred_split_queue; + else + return &pgdat->deferred_split_queue; } +#else +static inline struct deferred_split *get_deferred_split_queue(struct page *page) +{ + struct pglist_data *pgdat = NODE_DATA(page_to_nid(page)); + + return &pgdat->deferred_split_queue; +} +#endif
void prep_transhuge_page(struct page *page) { @@ -2689,8 +2703,7 @@ bool can_split_huge_page(struct page *page, int *pextra_pins) int split_huge_page_to_list(struct page *page, struct list_head *list) { struct page *head = compound_head(page); - struct pglist_data *pgdata = NODE_DATA(page_to_nid(head)); - struct deferred_split *ds_queue = &pgdata->deferred_split_queue; + struct deferred_split *ds_queue = get_deferred_split_queue(page); struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; int count, mapcount, extra_pins, ret; @@ -2824,8 +2837,7 @@ fail: if (mapping)
void free_transhuge_page(struct page *page) { - struct pglist_data *pgdata = NODE_DATA(page_to_nid(page)); - struct deferred_split *ds_queue = &pgdata->deferred_split_queue; + struct deferred_split *ds_queue = get_deferred_split_queue(page); unsigned long flags;
spin_lock_irqsave(&ds_queue->split_queue_lock, flags); @@ -2839,17 +2851,37 @@ void free_transhuge_page(struct page *page)
void deferred_split_huge_page(struct page *page) { - struct pglist_data *pgdata = NODE_DATA(page_to_nid(page)); - struct deferred_split *ds_queue = &pgdata->deferred_split_queue; + struct deferred_split *ds_queue = get_deferred_split_queue(page); +#ifdef CONFIG_MEMCG + struct mem_cgroup *memcg = compound_head(page)->mem_cgroup; +#endif unsigned long flags;
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+ /* + * The try_to_unmap() in page reclaim path might reach here too, + * this may cause a race condition to corrupt deferred split queue. + * And, if page reclaim is already handling the same page, it is + * unnecessary to handle it again in shrinker. + * + * Check PageSwapCache to determine if the page is being + * handled by page reclaim since THP swap would add the page into + * swap cache before calling try_to_unmap(). + */ + if (PageSwapCache(page)) + return; + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); if (list_empty(page_deferred_list(page))) { count_vm_event(THP_DEFERRED_SPLIT_PAGE); list_add_tail(page_deferred_list(page), &ds_queue->split_queue); ds_queue->split_queue_len++; +#ifdef CONFIG_MEMCG + if (memcg) + memcg_set_shrinker_bit(memcg, page_to_nid(page), + deferred_split_shrinker.id); +#endif } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); } @@ -2859,6 +2891,11 @@ static unsigned long deferred_split_count(struct shrinker *shrink, { struct pglist_data *pgdata = NODE_DATA(sc->nid); struct deferred_split *ds_queue = &pgdata->deferred_split_queue; + +#ifdef CONFIG_MEMCG + if (sc->memcg) + ds_queue = &sc->memcg->deferred_split_queue; +#endif return READ_ONCE(ds_queue->split_queue_len); }
@@ -2872,6 +2909,11 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, struct page *page; int split = 0;
+#ifdef CONFIG_MEMCG + if (sc->memcg) + ds_queue = &sc->memcg->deferred_split_queue; +#endif + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); /* Take pin on all head pages to avoid freeing them under us */ list_for_each_safe(pos, next, &ds_queue->split_queue) { @@ -2918,7 +2960,8 @@ static struct shrinker deferred_split_shrinker = { .count_objects = deferred_split_count, .scan_objects = deferred_split_scan, .seeks = DEFAULT_SEEKS, - .flags = SHRINKER_NUMA_AWARE, + .flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE | + SHRINKER_NONSLAB, };
#ifdef CONFIG_DEBUG_FS diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 97cf7bd3f782..eabe38a0d56f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4565,6 +4565,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void) #endif #ifdef CONFIG_CGROUP_WRITEBACK INIT_LIST_HEAD(&memcg->cgwb_list); +#endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + spin_lock_init(&memcg->deferred_split_queue.split_queue_lock); + INIT_LIST_HEAD(&memcg->deferred_split_queue.split_queue); + memcg->deferred_split_queue.split_queue_len = 0; #endif idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); return memcg; @@ -4943,6 +4948,14 @@ static int mem_cgroup_move_account(struct page *page, __mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages); }
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (compound && !list_empty(page_deferred_list(page))) { + spin_lock(&from->deferred_split_queue.split_queue_lock); + list_del_init(page_deferred_list(page)); + from->deferred_split_queue.split_queue_len--; + spin_unlock(&from->deferred_split_queue.split_queue_lock); + } +#endif /* * It is safe to change page->mem_cgroup here because the page * is referenced, charged, and isolated - we can't race with @@ -4951,6 +4964,17 @@ static int mem_cgroup_move_account(struct page *page,
/* caller should have done css_get */ page->mem_cgroup = to; + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (compound && list_empty(page_deferred_list(page))) { + spin_lock(&to->deferred_split_queue.split_queue_lock); + list_add_tail(page_deferred_list(page), + &to->deferred_split_queue.split_queue); + to->deferred_split_queue.split_queue_len++; + spin_unlock(&to->deferred_split_queue.split_queue_lock); + } +#endif + spin_unlock_irqrestore(&from->move_lock, flags);
ret = 0;
From: Yang Shi yang.shi@linux.alibaba.com
mainline inclusion from mainline-v5.5-rc3 commit 42a9a53bb394a1de2247ef78f0b802ae86798122 category: bugfix bugzilla: 47240 CVE: NA
-------------------------------------------------
Since commit 0a432dcbeb32 ("mm: shrinker: make shrinker not depend on memcg kmem"), shrinkers' idr is protected by CONFIG_MEMCG instead of CONFIG_MEMCG_KMEM, so it makes no sense to protect shrinker idr replace with CONFIG_MEMCG_KMEM.
And in the CONFIG_MEMCG && CONFIG_SLOB case, shrinker_idr contains only shrinker, and it is deferred_split_shrinker. But it is never actually called, since idr_replace() is never compiled due to the wrong #ifdef. The deferred_split_shrinker all the time is staying in half-registered state, and it's never called for subordinate mem cgroups.
Link: http://lkml.kernel.org/r/1575486978-45249-1-git-send-email-yang.shi@linux.al... Fixes: 0a432dcbeb32 ("mm: shrinker: make shrinker not depend on memcg kmem") Signed-off-by: Yang Shi yang.shi@linux.alibaba.com Reviewed-by: Kirill Tkhai ktkhai@virtuozzo.com Acked-by: Michal Hocko mhocko@suse.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Shakeel Butt shakeelb@google.com Cc: Roman Gushchin guro@fb.com Cc: stable@vger.kernel.org [5.4+] Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c index 099bbd411121..48976d98d5e8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -421,7 +421,7 @@ void register_shrinker_prepared(struct shrinker *shrinker) { down_write(&shrinker_rwsem); list_add_tail(&shrinker->list, &shrinker_list); -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG if (shrinker->flags & SHRINKER_MEMCG_AWARE) idr_replace(&shrinker_idr, shrinker, shrinker->id); #endif
From: Wei Yang richardw.yang@linux.intel.com
mainline inclusion from mainline-v5.6-rc1 commit fac0516b5534897bf4c4a88daa06a8cfa5611b23 category: bugfix bugzilla: 47240 CVE: NA
-------------------------------------------------
If compound is true, this means it is a PMD mapped THP. Which implies the page is not linked to any defer list. So the first code chunk will not be executed.
Also with this reason, it would not be proper to add this page to a defer list. So the second code chunk is not correct.
Based on this, we should remove the defer list related code.
[yang.shi@linux.alibaba.com: better patch title] Link: http://lkml.kernel.org/r/20200117233836.3434-1-richardw.yang@linux.intel.com Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") Signed-off-by: Wei Yang richardw.yang@linux.intel.com Suggested-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Acked-by: Yang Shi yang.shi@linux.alibaba.com Cc: David Rientjes rientjes@google.com Cc: Michal Hocko mhocko@suse.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: stable@vger.kernel.org [5.4+] Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/memcontrol.c | 18 ------------------ 1 file changed, 18 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index eabe38a0d56f..dca37f310b7b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4948,14 +4948,6 @@ static int mem_cgroup_move_account(struct page *page, __mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages); }
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (compound && !list_empty(page_deferred_list(page))) { - spin_lock(&from->deferred_split_queue.split_queue_lock); - list_del_init(page_deferred_list(page)); - from->deferred_split_queue.split_queue_len--; - spin_unlock(&from->deferred_split_queue.split_queue_lock); - } -#endif /* * It is safe to change page->mem_cgroup here because the page * is referenced, charged, and isolated - we can't race with @@ -4965,16 +4957,6 @@ static int mem_cgroup_move_account(struct page *page, /* caller should have done css_get */ page->mem_cgroup = to;
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (compound && list_empty(page_deferred_list(page))) { - spin_lock(&to->deferred_split_queue.split_queue_lock); - list_add_tail(page_deferred_list(page), - &to->deferred_split_queue.split_queue); - to->deferred_split_queue.split_queue_len++; - spin_unlock(&to->deferred_split_queue.split_queue_lock); - } -#endif - spin_unlock_irqrestore(&from->move_lock, flags);
ret = 0;
From: Liu Shixin liushixin2@huawei.com
hulk inclusion category: bugfix bugzilla: 47240 CVE: NA
-------------------------------------------------
Change struct mem_cgroup directly will causes the kabi broken. So add a new struct mem_cgroup_extension for new variables.
Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/memcontrol.h | 4 ++++ mm/memcontrol.c | 11 ++++++++--- 2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 9afd655ff646..7da3f6c660f5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -318,6 +318,10 @@ struct mem_cgroup { /* WARNING: nodeinfo must be the last member here */ };
+struct mem_cgroup_extension { + struct mem_cgroup memcg; +}; + /* * size of first charge trial. "32" comes from vmscan.c's magic value. * TODO: maybe necessary to use big numbers in big irons. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index dca37f310b7b..ea67a63a4e71 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4508,11 +4508,14 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) static void __mem_cgroup_free(struct mem_cgroup *memcg) { int node; + struct mem_cgroup_extension *memcg_ext;
for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); free_percpu(memcg->stat_cpu); - kfree(memcg); + + memcg_ext = container_of(memcg, struct mem_cgroup_extension, memcg); + kfree(memcg_ext); }
static void mem_cgroup_free(struct mem_cgroup *memcg) @@ -4524,13 +4527,15 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) static struct mem_cgroup *mem_cgroup_alloc(void) { struct mem_cgroup *memcg; + struct mem_cgroup_extension *memcg_ext; size_t size; int node;
- size = sizeof(struct mem_cgroup); + size = sizeof(struct mem_cgroup_extension); size += nr_node_ids * sizeof(struct mem_cgroup_per_node *);
- memcg = kzalloc(size, GFP_KERNEL); + memcg_ext = kzalloc(size, GFP_KERNEL); + memcg = &memcg_ext->memcg; if (!memcg) return NULL;
From: Liu Shixin liushixin2@huawei.com
hulk inclusion category: bugfix bugzilla: 47240 CVE: NA
-------------------------------------------------
Fix kabi broken for struct pglist_data and struct mem_cgroup.
Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/memcontrol.h | 7 +-- include/linux/mmzone.h | 10 +-- mm/huge_memory.c | 121 +++++++++++++++++++++++++------------ mm/memcontrol.c | 6 +- mm/page_alloc.c | 8 +-- 5 files changed, 96 insertions(+), 56 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7da3f6c660f5..f354e76221db 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -310,15 +310,14 @@ struct mem_cgroup { struct list_head event_list; spinlock_t event_list_lock;
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE - struct deferred_split deferred_split_queue; -#endif - struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ };
struct mem_cgroup_extension { + spinlock_t split_queue_lock; + struct list_head split_queue; + unsigned long split_queue_len; struct mem_cgroup memcg; };
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4fa5a20f3da9..08ea0f24077e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -618,9 +618,9 @@ extern struct page *mem_map;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE struct deferred_split { - spinlock_t split_queue_lock; - struct list_head split_queue; - unsigned long split_queue_len; + spinlock_t *split_queue_lock; + struct list_head *split_queue; + unsigned long *split_queue_len; }; #endif
@@ -710,7 +710,9 @@ typedef struct pglist_data { #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE - struct deferred_split deferred_split_queue; + spinlock_t split_queue_lock; + struct list_head split_queue; + unsigned long split_queue_len; #endif
/* Fields commonly accessed by the page reclaim scanner */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0b6874cf3a3d..28e2ed5090e9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -493,22 +493,62 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) }
#ifdef CONFIG_MEMCG -static inline struct deferred_split *get_deferred_split_queue(struct page *page) +static inline void get_deferred_split_queue(struct page *page, + struct deferred_split *ds_queue) { struct mem_cgroup *memcg = compound_head(page)->mem_cgroup; struct pglist_data *pgdat = NODE_DATA(page_to_nid(page)); + struct mem_cgroup_extension *memcg_ext;
- if (memcg) - return &memcg->deferred_split_queue; - else - return &pgdat->deferred_split_queue; + if (memcg) { + memcg_ext = container_of(memcg, struct mem_cgroup_extension, memcg); + ds_queue->split_queue_lock = &memcg_ext->split_queue_lock; + ds_queue->split_queue = &memcg_ext->split_queue; + ds_queue->split_queue_len = &memcg_ext->split_queue_len; + } else { + ds_queue->split_queue_lock = &pgdat->split_queue_lock; + ds_queue->split_queue = &pgdat->split_queue; + ds_queue->split_queue_len = &pgdat->split_queue_len; + } +} + +static inline void get_deferred_split_queue_from_sc(struct shrink_control *sc, + struct deferred_split *ds_queue) +{ + struct mem_cgroup *memcg = sc->memcg; + struct pglist_data *pgdat = NODE_DATA(sc->nid); + struct mem_cgroup_extension *memcg_ext; + + if (memcg) { + memcg_ext = container_of(memcg, struct mem_cgroup_extension, memcg); + ds_queue->split_queue_lock = &memcg_ext->split_queue_lock; + ds_queue->split_queue = &memcg_ext->split_queue; + ds_queue->split_queue_len = &memcg_ext->split_queue_len; + } else { + ds_queue->split_queue_lock = &pgdat->split_queue_lock; + ds_queue->split_queue = &pgdat->split_queue; + ds_queue->split_queue_len = &pgdat->split_queue_len; + } } #else -static inline struct deferred_split *get_deferred_split_queue(struct page *page) +static inline void get_deferred_split_queue(struct page *page, + struct deferred_split *ds_queue) { struct pglist_data *pgdat = NODE_DATA(page_to_nid(page));
- return &pgdat->deferred_split_queue; + ds_queue->split_queue_lock = &pgdat->split_queue_lock; + ds_queue->split_queue = &pgdat->split_queue; + ds_queue->split_queue_len = &pgdat->split_queue_len; +} + +static inline void get_deferred_split_queue_from_sc(struct shrink_control *sc, + struct deferred_split *ds_queue) +{ + struct pglist_data *pgdat = NODE_DATA(sc->nid); + + ds_queue->split_queue_lock = &pgdat->split_queue_lock; + ds_queue->split_queue = &pgdat->split_queue; + ds_queue->split_queue_len = &pgdat->split_queue_len; } #endif
@@ -2703,7 +2743,7 @@ bool can_split_huge_page(struct page *page, int *pextra_pins) int split_huge_page_to_list(struct page *page, struct list_head *list) { struct page *head = compound_head(page); - struct deferred_split *ds_queue = get_deferred_split_queue(page); + struct deferred_split ds_queue; struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; int count, mapcount, extra_pins, ret; @@ -2793,17 +2833,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) }
/* Prevent deferred_split_scan() touching ->_refcount */ - spin_lock(&ds_queue->split_queue_lock); + get_deferred_split_queue(page, &ds_queue); + spin_lock(ds_queue.split_queue_lock); count = page_count(head); mapcount = total_mapcount(head); if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) { if (!list_empty(page_deferred_list(head))) { - ds_queue->split_queue_len--; + (*ds_queue.split_queue_len)--; list_del(page_deferred_list(head)); } if (mapping) __dec_node_page_state(page, NR_SHMEM_THPS); - spin_unlock(&ds_queue->split_queue_lock); + spin_unlock(ds_queue.split_queue_lock); __split_huge_page(page, list, end, flags); ret = 0; } else { @@ -2815,7 +2856,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) dump_page(page, "total_mapcount(head) > 0"); BUG(); } - spin_unlock(&ds_queue->split_queue_lock); + spin_unlock(ds_queue.split_queue_lock); fail: if (mapping) xa_unlock(&mapping->i_pages); spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags); @@ -2837,21 +2878,22 @@ fail: if (mapping)
void free_transhuge_page(struct page *page) { - struct deferred_split *ds_queue = get_deferred_split_queue(page); + struct deferred_split ds_queue; unsigned long flags;
- spin_lock_irqsave(&ds_queue->split_queue_lock, flags); + get_deferred_split_queue(page, &ds_queue); + spin_lock_irqsave(ds_queue.split_queue_lock, flags); if (!list_empty(page_deferred_list(page))) { - ds_queue->split_queue_len--; + (*ds_queue.split_queue_len)--; list_del(page_deferred_list(page)); } - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); + spin_unlock_irqrestore(ds_queue.split_queue_lock, flags); free_compound_page(page); }
void deferred_split_huge_page(struct page *page) { - struct deferred_split *ds_queue = get_deferred_split_queue(page); + struct deferred_split ds_queue; #ifdef CONFIG_MEMCG struct mem_cgroup *memcg = compound_head(page)->mem_cgroup; #endif @@ -2872,51 +2914,50 @@ void deferred_split_huge_page(struct page *page) if (PageSwapCache(page)) return;
- spin_lock_irqsave(&ds_queue->split_queue_lock, flags); + get_deferred_split_queue(page, &ds_queue); + spin_lock_irqsave(ds_queue.split_queue_lock, flags); if (list_empty(page_deferred_list(page))) { count_vm_event(THP_DEFERRED_SPLIT_PAGE); - list_add_tail(page_deferred_list(page), &ds_queue->split_queue); - ds_queue->split_queue_len++; + list_add_tail(page_deferred_list(page), ds_queue.split_queue); + (*ds_queue.split_queue_len)++; #ifdef CONFIG_MEMCG if (memcg) memcg_set_shrinker_bit(memcg, page_to_nid(page), deferred_split_shrinker.id); #endif } - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); + spin_unlock_irqrestore(ds_queue.split_queue_lock, flags); }
static unsigned long deferred_split_count(struct shrinker *shrink, struct shrink_control *sc) { struct pglist_data *pgdata = NODE_DATA(sc->nid); - struct deferred_split *ds_queue = &pgdata->deferred_split_queue; + unsigned long *split_queue_len = &pgdata->split_queue_len; + struct mem_cgroup_extension *memcg_ext;
#ifdef CONFIG_MEMCG - if (sc->memcg) - ds_queue = &sc->memcg->deferred_split_queue; + if (sc->memcg) { + memcg_ext = container_of(sc->memcg, struct mem_cgroup_extension, memcg); + split_queue_len = &memcg_ext->split_queue_len; + } #endif - return READ_ONCE(ds_queue->split_queue_len); + return READ_ONCE(*split_queue_len); }
static unsigned long deferred_split_scan(struct shrinker *shrink, struct shrink_control *sc) { - struct pglist_data *pgdata = NODE_DATA(sc->nid); - struct deferred_split *ds_queue = &pgdata->deferred_split_queue; + struct deferred_split ds_queue; unsigned long flags; LIST_HEAD(list), *pos, *next; struct page *page; int split = 0;
-#ifdef CONFIG_MEMCG - if (sc->memcg) - ds_queue = &sc->memcg->deferred_split_queue; -#endif - - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); + get_deferred_split_queue_from_sc(sc, &ds_queue); + spin_lock_irqsave(ds_queue.split_queue_lock, flags); /* Take pin on all head pages to avoid freeing them under us */ - list_for_each_safe(pos, next, &ds_queue->split_queue) { + list_for_each_safe(pos, next, ds_queue.split_queue) { page = list_entry((void *)pos, struct page, mapping); page = compound_head(page); if (get_page_unless_zero(page)) { @@ -2924,12 +2965,12 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, } else { /* We lost race with put_compound_page() */ list_del_init(page_deferred_list(page)); - ds_queue->split_queue_len--; + (*ds_queue.split_queue_len)--; } if (!--sc->nr_to_scan) break; } - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); + spin_unlock_irqrestore(ds_queue.split_queue_lock, flags);
list_for_each_safe(pos, next, &list) { page = list_entry((void *)pos, struct page, mapping); @@ -2943,15 +2984,15 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, put_page(page); }
- spin_lock_irqsave(&ds_queue->split_queue_lock, flags); - list_splice_tail(&list, &ds_queue->split_queue); - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); + spin_lock_irqsave(ds_queue.split_queue_lock, flags); + list_splice_tail(&list, ds_queue.split_queue); + spin_unlock_irqrestore(ds_queue.split_queue_lock, flags);
/* * Stop shrinker if we didn't split any page, but the queue is empty. * This can happen if pages were freed under us. */ - if (!split && list_empty(&ds_queue->split_queue)) + if (!split && list_empty(ds_queue.split_queue)) return SHRINK_STOP; return split; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ea67a63a4e71..f401be9d45a5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4572,9 +4572,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void) INIT_LIST_HEAD(&memcg->cgwb_list); #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE - spin_lock_init(&memcg->deferred_split_queue.split_queue_lock); - INIT_LIST_HEAD(&memcg->deferred_split_queue.split_queue); - memcg->deferred_split_queue.split_queue_len = 0; + spin_lock_init(&memcg_ext->split_queue_lock); + INIT_LIST_HEAD(&memcg_ext->split_queue); + memcg_ext->split_queue_len = 0; #endif idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); return memcg; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d85b5a0bfda6..0888870e3458 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6353,11 +6353,9 @@ static unsigned long __init calc_memmap_size(unsigned long spanned_pages, #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void pgdat_init_split_queue(struct pglist_data *pgdat) { - struct deferred_split *ds_queue = &pgdat->deferred_split_queue; - - spin_lock_init(&ds_queue->split_queue_lock); - INIT_LIST_HEAD(&ds_queue->split_queue); - ds_queue->split_queue_len = 0; + spin_lock_init(&pgdat->split_queue_lock); + INIT_LIST_HEAD(&pgdat->split_queue); + pgdat->split_queue_len = 0; } #else static void pgdat_init_split_queue(struct pglist_data *pgdat) {}
From: yangerkun yangerkun@huawei.com
mainline inclusion from mainline-5.11-rc4 commit 6b4b8e6b4ad8553660421d6360678b3811d5deb9 category: bugfix bugzilla: 46884 CVE: NA ---------------------------
We got a "deleted inode referenced" warning cross our fsstress test. The bug can be reproduced easily with following steps:
cd /dev/shm mkdir test/ fallocate -l 128M img mkfs.ext4 -b 1024 img mount img test/ dd if=/dev/zero of=test/foo bs=1M count=128 mkdir test/dir/ && cd test/dir/ for ((i=0;i<1000;i++)); do touch file$i; done # consume all block cd ~ && renameat2(AT_FDCWD, /dev/shm/test/dir/file1, AT_FDCWD, /dev/shm/test/dir/dst_file, RENAME_WHITEOUT) # ext4_add_entry in ext4_rename will return ENOSPC!! cd /dev/shm/ && umount test/ && mount img test/ && ls -li test/dir/file1 We will get the output: "ls: cannot access 'test/dir/file1': Structure needs cleaning" and the dmesg show: "EXT4-fs error (device loop0): ext4_lookup:1626: inode #2049: comm ls: deleted inode referenced: 139"
ext4_rename will create a special inode for whiteout and use this 'ino' to replace the source file's dir entry 'ino'. Once error happens latter(the error above was the ENOSPC return from ext4_add_entry in ext4_rename since all space has been consumed), the cleanup do drop the nlink for whiteout, but forget to restore 'ino' with source file. This will trigger the bug describle as above.
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Jan Kara jack@suse.cz Cc: stable@vger.kernel.org Fixes: cd808deced43 ("ext4: support RENAME_WHITEOUT") Link: https://lore.kernel.org/r/20210105062857.3566-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu
Conflicts: fs/ext4/namei.c
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/namei.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 2297d38aaf5e..34a4a1a298df 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -3425,9 +3425,6 @@ static int ext4_setent(handle_t *handle, struct ext4_renament *ent, return retval; } } - brelse(ent->bh); - ent->bh = NULL; - return 0; }
@@ -3626,6 +3623,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry, } }
+ old_file_type = old.de->file_type; if (IS_DIRSYNC(old.dir) || IS_DIRSYNC(new.dir)) ext4_handle_sync(handle);
@@ -3653,7 +3651,6 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry, force_reread = (new.dir->i_ino == old.dir->i_ino && ext4_test_inode_flag(new.dir, EXT4_INODE_INLINE_DATA));
- old_file_type = old.de->file_type; if (whiteout) { /* * Do this before adding a new entry, so the old entry is sure @@ -3725,15 +3722,19 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry, retval = 0;
end_rename: - brelse(old.dir_bh); - brelse(old.bh); - brelse(new.bh); if (whiteout) { - if (retval) + if (retval) { + ext4_setent(handle, &old, + old.inode->i_ino, old_file_type); drop_nlink(whiteout); + } unlock_new_inode(whiteout); iput(whiteout); + } + brelse(old.dir_bh); + brelse(old.bh); + brelse(new.bh); if (handle) ext4_journal_stop(handle); return retval;
From: Yufen Yu yuyufen@huawei.com
hulk inclusion category: bugfix bugzilla: 46860
--------------------------------
Drivers (such as scsi enclosure) will not call blk_register_queue() to do initialize for request_queue. And we rely on driver self to deal with the race in that case when cleanup queue.
But, some self-developed drivers cannot deal the race. To avoid null pointer reference as following, we do quiesce in kernel.
[67760.308034] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 ... [67760.308069] pc : blk_mq_do_dispatch_sched+0x94/0x130 [67760.308072] lr : blk_mq_sched_dispatch_requests+0x128/0x1f0 [67760.308072] sp : ffff0000b2bb3ca0 [67760.308073] x29: ffff0000b2bb3ca0 x28: 0000000000000000 [67760.308075] x27: 0000000000000000 x26: ffff00008128a000 [67760.308076] x25: ffff8042b13e9700 x24: 0000000000000000 [67760.308077] x23: ffff0000b2bb3cf8 x22: ffff8042b2976808 [67760.308078] x21: ffff0000b2bb3d58 x20: ffff00008128a000 [67760.308079] x19: ffff8042b2976800 x18: 0000000000000000 [67760.308080] x17: 0000000000000000 x16: 0000000000000000 [67760.308081] x15: 0000000000000000 x14: 0000000000000000 [67760.308082] x13: 0000000000000000 x12: 0000000000000000 [67760.308084] x11: ffff800040801004 x10: ffff80004080100c [67760.308085] x9 : 0000000000000060 x8 : ffff809e58f7e500 [67760.308087] x7 : 0000000000000000 x6 : 00000000ffffffff [67760.308088] x5 : ffff000080ab7550 x4 : ffff80418e84ec80 [67760.308089] x3 : ffff0000815d7f10 x2 : 94e133dc71839c00 [67760.308090] x1 : 0000000000000000 x0 : ffff00008128a748 [67760.308091] Call trace: [67760.308093] blk_mq_do_dispatch_sched+0x94/0x130 [67760.308095] blk_mq_sched_dispatch_requests+0x128/0x1f0 [67760.308096] __blk_mq_run_hw_queue+0x98/0x138 [67760.308097] blk_mq_run_work_fn+0x28/0x38
Signed-off-by: Yufen Yu yuyufen@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-core.c | 7 +++++-- drivers/scsi/scsi_scan.c | 4 ++++ include/linux/blkdev.h | 1 + 3 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index 9b2a05d00b96..ffbe326c70b9 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -791,9 +791,12 @@ void blk_cleanup_queue(struct request_queue *q) * from more than one contexts. * * We rely on driver to deal with the race in case that queue - * initialization isn't done. + * initialization isn't done. If driver cannot deal the race, + * we try to call quiesce in kernel for these drivers that have + * set QUEUE_FLAG_FORECE_QUIESCE flag. */ - if (q->mq_ops && blk_queue_init_done(q)) + if (q->mq_ops && (blk_queue_init_done(q) || + test_bit(QUEUE_FLAG_FORECE_QUIESCE, &q->queue_flags))) blk_mq_quiesce_queue(q);
/* for synchronous bio-based driver finish in-flight integrity i/o */ diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c index 009a5b2aa3d0..27610816ccb8 100644 --- a/drivers/scsi/scsi_scan.c +++ b/drivers/scsi/scsi_scan.c @@ -844,6 +844,10 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result, *bflags |= BLIST_NOREPORTLUN; }
+ if (sdev->type == TYPE_ENCLOSURE) + set_bit(QUEUE_FLAG_FORECE_QUIESCE, + &sdev->request_queue->queue_flags); + /* * For a peripheral qualifier (PQ) value of 1 (001b), the SCSI * spec says: The device server is capable of supporting the diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index d3d81b56007f..ad2b10c4a079 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -720,6 +720,7 @@ struct request_queue { #define QUEUE_FLAG_REGISTERED 26 /* queue has been registered to a disk */ #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */ #define QUEUE_FLAG_QUIESCED 28 /* queue has been quiesced */ +#define QUEUE_FLAG_FORECE_QUIESCE 29 /* force quiesce when cleanup queue */
#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ (1 << QUEUE_FLAG_SAME_COMP) | \
From: Lu Jialin lujialin4@huawei.com
hulk inclusion category: bugfix bugzilla: 46924 CVE: NA
--------
If parent cgroup files.limit is 0, fail to move a task into child cgroup. When kill the task, the files.usage of parent cgroup and child cgroup is abnormal.
/sys/fs/cgroup/parent # ls cgroup.clone_children files.limit tasks cgroup.procs files.usage child notify_on_release /sys/fs/cgroup/parent # echo 0 >files.limit /sys/fs/cgroup/parent # cd child /sys/fs/cgroup/parent/child # ls cgroup.clone_children files.limit notify_on_release cgroup.procs files.usage tasks /sys/fs/cgroup/parent/child # echo 156 >tasks [ 879.564728] Open files limit overcommited /sys/fs/cgroup/parent/child # kill -9 156 /sys/fs/cgroup/parent/child # [ 886.363690] WARNING: CPU: 0 PID: 156 at mm/page_counter.c:62 page_counter_cancel+0x26/0x30 [ 886.364093] Modules linked in: [ 886.364093] CPU: 0 PID: 156 Comm: top Not tainted 4.18.0+ #1 [ 886.364093] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 886.365350] RIP: 0010:page_counter_cancel+0x26/0x30 [ 886.365350] Code: 0f 1f 40 00 66 66 66 66 90 48 89 f0 53 48 f7 d8 f0 48 0f c1 07 48 29 f0 48 89 c3 48 89 c6 e8 61 ff ff ff 48 85 d5 [ 886.365350] RSP: 0018:ffffb754006b7d00 EFLAGS: 00000286 [ 886.365350] RAX: 0000000000000000 RBX: ffffffffffffffff RCX: 0000000000000001 [ 886.365350] RDX: 0000000000000000 RSI: ffffffffffffffff RDI: ffff9ca61888b930 [ 886.365350] RBP: 0000000000000001 R08: 00000000000295c0 R09: ffffffff820597aa [ 886.365350] R10: ffffffffffffffff R11: ffffd78601823508 R12: 0000000000000000 [ 886.365350] R13: ffff9ca6181c0628 R14: 0000000000000000 R15: ffff9ca663e9d000 [ 886.365350] FS: 0000000000000000(0000) GS:ffff9ca661e00000(0000) knlGS:0000000000000000 [ 886.365350] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 886.365350] CR2: 0000000000867fb8 CR3: 0000000017a0a000 CR4: 00000000000006f0 [ 886.365350] Call Trace: [ 886.369392] page_counter_uncharge+0x1d/0x30 [ 886.369392] put_files_struct+0x7c/0xe0 [ 886.369392] do_exit+0x2c7/0xb90 [ 886.369392] ? __schedule+0x2a1/0x900 [ 886.369392] do_group_exit+0x3a/0xa0 [ 886.369392] get_signal+0x15e/0x870 [ 886.369392] do_signal+0x36/0x610 [ 886.369392] ? do_vfs_ioctl+0xa4/0x640 [ 886.369392] ? do_vfs_ioctl+0xa4/0x640 [ 886.369392] ? dput+0x29/0x110 [ 886.369392] exit_to_usermode_loop+0x71/0xe0 [ 886.369392] do_syscall_64+0x181/0x1b0 [ 886.369392] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 886.369392] RIP: 0033:0x4b9b5a [ 886.369392] Code: Bad RIP value. [ 886.369392] RSP: 002b:00007ffe27221968 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 [ 886.373373] RAX: fffffffffffffe00 RBX: 0000000000000001 RCX: 00000000004b9b5a [ 886.373373] RDX: 00007ffe27221930 RSI: 0000000000005402 RDI: 0000000000000000 [ 886.373373] RBP: 0000000000000135 R08: 00007ffe272219a4 R09: 0000000000000010 [ 886.373373] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 [ 886.373373] R13: 0000000000000005 R14: 0000000000000135 R15: 0000000000000000 [ 886.373373] ---[ end trace 56c4971a753a98c5 ]---
[1]+ Killed top /sys/fs/cgroup/parent/child # ls cgroup.clone_children files.limit notify_on_release cgroup.procs files.usage tasks /sys/fs/cgroup/parent/child # cat files.usage 18446744073709551613 /sys/fs/cgroup/parent/child # cd .. /sys/fs/cgroup/parent # ls cgroup.clone_children files.limit tasks cgroup.procs files.usage child notify_on_release /sys/fs/cgroup/parent # cat files.usage 18446744073709551613
The reason is when fail to move a task into child cgroup,the files.usage of child cgroup and its parent cgroup are the same as before. The struct files_cgroup points to the dst_css. Therefore, when kill the task, the page_counter_uncharge() will subtract the files.usage of child cgroup and its parent cgroup again. The files.usage will be abnormal.
If we just change the struct files_cgroup pointers when charge success in files_cgroup_attach, problems will occur in some extreme scenario. 1)If we add num_files into original page_counter when fail to charge the file resource into new cgroup, the files.usage will be larger than files.limit of the original cgroup when new task moves into the original cgroup at the same time. 2)If we subtract num_files into original page_counter when success to charge the file resource into new cgroup, when the parent files.limit equals to the files.usage and there are two child cgroups of the parent, it will be failed to move the task from one child cgroup into another child cgroup.
The patch implements files_cgroup_attach() into files_cgroup_can_attach() and delete files_cgroup_attach(). This will make move file related resource into new cgroup before move task. When try_charge is failed, task and its file resource will be in the original cgroup.The above problems will be solved.
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/filescontrol.c | 73 +++++++++-------------------------------------- 1 file changed, 13 insertions(+), 60 deletions(-)
diff --git a/fs/filescontrol.c b/fs/filescontrol.c index 32f3a127579c..cfb1a6ba26b3 100644 --- a/fs/filescontrol.c +++ b/fs/filescontrol.c @@ -102,66 +102,14 @@ u64 files_cgroup_count_fds(struct files_struct *files) return retval; }
-/* - * cgroup core uses cgroup_threadgroup_rwsem to ensure - * the task will not exit and task->files will not be NULL - * during the migration of cgroup. - */ -static u64 files_in_taskset(struct cgroup_taskset *tset) -{ - struct task_struct *task; - u64 files = 0; - struct cgroup_subsys_state *css; - - cgroup_taskset_for_each(task, css, tset) { - if (!thread_group_leader(task)) - continue; - - /* - * use file_lock to ensure fd will not be created or destroyed, - * and the fd table will not be expanded. - */ - spin_lock(&task->files->file_lock); - files += files_cgroup_count_fds(task->files); - spin_unlock(&task->files->file_lock); - } - - return files; -} - /* * If attaching this cgroup would overcommit the resource then deny - * the attach. + * the attach. If not, attach the file resource into new cgroup. */ static int files_cgroup_can_attach(struct cgroup_taskset *tset) -{ - struct cgroup_subsys_state *css; - unsigned long margin; - struct page_counter *cnt; - unsigned long counter; - u64 files = files_in_taskset(tset); - - cgroup_taskset_first(tset, &css); - cnt = css_res_open_handles(css); - - counter = (unsigned long)atomic_long_read(&cnt->usage); - if (cnt->max > counter) - margin = cnt->max - counter; - else - margin = 0; - if (margin < files) - return -ENOMEM; - return 0; -} - -/* - * If resource counts have gone up between can_attach and attach then - * this may overcommit resources. In that case just deny further allocation - * until the resource usage drops. - */ -static void files_cgroup_attach(struct cgroup_taskset *tset) { u64 num_files; + bool can_attach; struct cgroup_subsys_state *to_css; struct cgroup_subsys_state *from_css; struct page_counter *from_res; @@ -176,7 +124,7 @@ static void files_cgroup_attach(struct cgroup_taskset *tset) files = task->files; if (!files || files == &init_files) { task_unlock(task); - return; + return 0; }
from_css = &files_cgroup_from_files(files)->css; @@ -185,14 +133,20 @@ static void files_cgroup_attach(struct cgroup_taskset *tset) spin_lock(&files->file_lock); num_files = files_cgroup_count_fds(files); page_counter_uncharge(from_res, num_files); - css_put(from_css);
- if (!page_counter_try_charge(to_res, num_files, &fail_res)) + if (!page_counter_try_charge(to_res, num_files, &fail_res)) { + page_counter_charge(from_res, num_files); pr_err("Open files limit overcommited\n"); - css_get(to_css); - task->files->files_cgroup = css_fcg(to_css); + can_attach = false; + } else { + css_put(from_css); + css_get(to_css); + task->files->files_cgroup = css_fcg(to_css); + can_attach = true; + } spin_unlock(&files->file_lock); task_unlock(task); + return can_attach ? 0 : -ENOSPC; }
int files_cgroup_alloc_fd(struct files_struct *files, u64 n) @@ -322,7 +276,6 @@ struct cgroup_subsys files_cgrp_subsys = { .css_alloc = files_cgroup_css_alloc, .css_free = files_cgroup_css_free, .can_attach = files_cgroup_can_attach, - .attach = files_cgroup_attach, .legacy_cftypes = files, .dfl_cftypes = files, };
From: miaoyubo miaoyubo@huawei.com
virt inclusion category: bugfix bugzilla: 47428 CVE: NA
Disable PUD_SIZE huge mapping on 161x serial boards and only enable on 1620CS due to low performance on cache maintenance and complex implementation for pre-setup stage2 page table with PUD huge page. Which would even lead to hard lockup
Signed-off-by: Zenghui Yu yuzenghui@huawei.com Signed-off-by: miaoyubo miaoyubo@huawei.com Reviewed-by: zhanghailiang zhang.zhanghailiang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- virt/kvm/arm/mmu.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c index 8d0777027493..aec599488847 100644 --- a/virt/kvm/arm/mmu.c +++ b/virt/kvm/arm/mmu.c @@ -1754,6 +1754,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (vma_pagesize == PMD_SIZE || (vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm))) gfn = (fault_ipa & huge_page_mask(hstate_vma(vma))) >> PAGE_SHIFT; + + /* Only enable PUD_SIZE huge mapping on 1620 serial boards */ + if (vma_pagesize == PUD_SIZE && !kvm_ncsnp_support) { + vma_pagesize = PMD_SIZE; + gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT; + } + up_read(¤t->mm->mmap_sem);
/* We need minimum second+third level pages */
From: Chenguangli chenguangli2@huawei.com
driver inclusion category: bug bugzilla: NA
------------------------------------------------------------------
Resolved the issue that the system may be oops due to the resource was released after shutdown and scsi layer continue send fcp cmd to hifc driver.
Signed-off-by: Chenguangli chenguangli2@huawei.com Reviewed-by: Zengweiliang zengweiliang.zengweiliang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/huawei/hifc/unf_common.h | 2 +- drivers/scsi/huawei/hifc/unf_scsi.c | 23 +++++++++++++++++++++++ 2 files changed, 24 insertions(+), 1 deletion(-)
diff --git a/drivers/scsi/huawei/hifc/unf_common.h b/drivers/scsi/huawei/hifc/unf_common.h index 2793acb5dea7..bc11cc542c56 100644 --- a/drivers/scsi/huawei/hifc/unf_common.h +++ b/drivers/scsi/huawei/hifc/unf_common.h @@ -13,7 +13,7 @@ /* B version, B0XX Corresponding x.x */ #define UNF_B_VERSION "5.0" /* Indicates the minor version number of the driver */ -#define UNF_DRIVER_VERSION "10" +#define UNF_DRIVER_VERSION "11" /* version num */ #define UNF_FC_VERSION UNF_MAJOR_VERSION "." UNF_B_VERSION "." UNF_DRIVER_VERSION extern unsigned int unf_dbg_level; diff --git a/drivers/scsi/huawei/hifc/unf_scsi.c b/drivers/scsi/huawei/hifc/unf_scsi.c index 11331bacb914..2f5cb0e723fb 100644 --- a/drivers/scsi/huawei/hifc/unf_scsi.c +++ b/drivers/scsi/huawei/hifc/unf_scsi.c @@ -845,6 +845,29 @@ static int unf_scsi_queue_cmd(struct Scsi_Host *shost, return 0; }
+ if (unlikely(!scsi_image_table->wwn_rport_info_table)) { + UNF_TRACE(UNF_EVTLOG_DRIVER_INFO, UNF_LOG_ABNORMAL, UNF_WARN, + "[warn]Port(0x%x) WwnRportInfoTable NULL", lport->port_id); + + cmd->result = DID_NO_CONNECT << 16; + cmd->scsi_done(cmd); + ret_value = DID_NO_CONNECT; + UNF_IO_RESULT_CNT(scsi_image_table, scsi_id, ret_value); + return 0; + } + + if (unlikely(lport->b_port_removing == UNF_TRUE)) { + UNF_TRACE(UNF_EVTLOG_DRIVER_INFO, UNF_LOG_ABNORMAL, UNF_WARN, + "[warn]Port(0x%x) scsi_id(0x%x) rport(0x%p) target_id(0x%x) cmd(0x%p) is removing", + lport->port_id, scsi_id, p_rport, p_rport->scsi_target_id, cmd); + + cmd->result = DID_NO_CONNECT << 16; + cmd->scsi_done(cmd); + ret_value = DID_NO_CONNECT; + UNF_IO_RESULT_CNT(scsi_image_table, scsi_id, ret_value); + return 0; + } + en_scsi_state = atomic_read(&scsi_image_table->wwn_rport_info_table[scsi_id].en_scsi_state); if (unlikely(en_scsi_state != UNF_SCSI_ST_ONLINE)) { if (en_scsi_state == UNF_SCSI_ST_OFFLINE) {
From: Hanjun Guo guohanjun@huawei.com
hulk inclusion category: bugfix bugzilla: 47461 CVE: NA
-------------------------------------------------
We got compile error:
drivers/clocksource/arm_arch_timer.c: In function 'arch_counter_register': drivers/clocksource/arm_arch_timer.c:1009:31: error: 'struct arch_clocksource_data' has no member named 'vdso_fix' 1009 | clocksource_counter.archdata.vdso_fix = vdso_fix; | ^ make[3]: *** [/builds/1mzfdQzleCy69KZFb5qHNSEgabZ/scripts/Makefile.build:303: drivers/clocksource/arm_arch_timer.o] Error 1 make[3]: Target '__build' not remade because of errors.
Fix it by guarding vdso_fix with #ifdef CONFIG_ARM64 .. #endif
Signed-off-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/clocksource/arm_arch_timer.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c index f552a38feeb9..58863fd9c91b 100644 --- a/drivers/clocksource/arm_arch_timer.c +++ b/drivers/clocksource/arm_arch_timer.c @@ -1006,7 +1006,9 @@ static void __init arch_counter_register(unsigned type) arch_timer_read_counter = arch_counter_get_cntpct;
clocksource_counter.archdata.vdso_direct = vdso_default; +#ifdef CONFIG_ARM64 clocksource_counter.archdata.vdso_fix = vdso_fix; +#endif } else { arch_timer_read_counter = arch_counter_get_cntvct_mem; }
From: Vadim Fedorenko vfedorenko@novek.ru
stable inclusion from linux-4.19.162 commit 9e401870db6c901debe3f14eae9d477bdec0e1af
--------------------------------
[ Upstream commit 20ffc7adf53a5fd3d19751fbff7895bcca66686e ]
In case when tcp socket received FIN after some data and the parser haven't started before reading data caller will receive an empty buffer. This behavior differs from plain TCP socket and leads to special treating in user-space. The flow that triggers the race is simple. Server sends small amount of data right after the connection is configured to use TLS and closes the connection. In this case receiver sees TLS Handshake data, configures TLS socket right after Change Cipher Spec record. While the configuration is in process, TCP socket receives small Application Data record, Encrypted Alert record and FIN packet. So the TCP socket changes sk_shutdown to RCV_SHUTDOWN and sk_flag with SK_DONE bit set. The received data is not parsed upon arrival and is never sent to user-space.
Patch unpauses parser directly if we have unparsed data in tcp receive queue.
Fixes: fcf4793e278e ("tls: check RCV_SHUTDOWN in tls_wait_data") Signed-off-by: Vadim Fedorenko vfedorenko@novek.ru Link: https://lore.kernel.org/r/1605801588-12236-1-git-send-email-vfedorenko@novek... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/tls/tls_sw.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index bbb2da70e870..7d761244a360 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -630,6 +630,12 @@ static struct sk_buff *tls_wait_data(struct sock *sk, int flags, return NULL; }
+ if (!skb_queue_empty(&sk->sk_receive_queue)) { + __strp_unpause(&ctx->strp); + if (ctx->recv_pkt) + return ctx->recv_pkt; + } + if (sk->sk_shutdown & RCV_SHUTDOWN) return NULL;
From: Maxim Mikityanskiy maximmi@mellanox.com
stable inclusion from linux-4.19.162 commit 2ba413da624ee45b29cc9de5fff710c19c4efcde
--------------------------------
[ Upstream commit 025cc2fb6a4e84e9a0552c0017dcd1c24b7ac7da ]
tls_device_offload_cleanup_rx doesn't clear tls_ctx->netdev after calling tls_dev_del if TLX TX offload is also enabled. Clearing tls_ctx->netdev gets postponed until tls_device_gc_task. It leaves a time frame when tls_device_down may get called and call tls_dev_del for RX one extra time, confusing the driver, which may lead to a crash.
This patch corrects this racy behavior by adding a flag to prevent tls_device_down from calling tls_dev_del the second time.
Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure") Signed-off-by: Maxim Mikityanskiy maximmi@mellanox.com Signed-off-by: Saeed Mahameed saeedm@nvidia.com Link: https://lore.kernel.org/r/20201125221810.69870-1-saeedm@nvidia.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/net/tls.h | 6 ++++++ net/tls/tls_device.c | 5 ++++- 2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/include/net/tls.h b/include/net/tls.h index 01dafc4cf6b8..e2d84b791e5c 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -170,6 +170,12 @@ enum {
enum tls_context_flags { TLS_RX_SYNC_RUNNING = 0, + /* tls_dev_del was called for the RX side, device state was released, + * but tls_ctx->netdev might still be kept, because TX-side driver + * resources might not be released yet. Used to prevent the second + * tls_dev_del call in tls_device_down if it happens simultaneously. + */ + TLS_RX_DEV_CLOSED = 2, };
struct cipher_context { diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c index dd0fc2aa6875..228e3ce48d43 100644 --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -955,6 +955,8 @@ void tls_device_offload_cleanup_rx(struct sock *sk) if (tls_ctx->tx_conf != TLS_HW) { dev_put(netdev); tls_ctx->netdev = NULL; + } else { + set_bit(TLS_RX_DEV_CLOSED, &tls_ctx->flags); } out: up_read(&device_offload_lock); @@ -984,7 +986,8 @@ static int tls_device_down(struct net_device *netdev) if (ctx->tx_conf == TLS_HW) netdev->tlsdev_ops->tls_dev_del(netdev, ctx, TLS_OFFLOAD_CTX_DIR_TX); - if (ctx->rx_conf == TLS_HW) + if (ctx->rx_conf == TLS_HW && + !test_bit(TLS_RX_DEV_CLOSED, &ctx->flags)) netdev->tlsdev_ops->tls_dev_del(netdev, ctx, TLS_OFFLOAD_CTX_DIR_RX); WRITE_ONCE(ctx->netdev, NULL);
From: Sylwester Dziedziuch sylwesterx.dziedziuch@intel.com
stable inclusion from linux-4.19.162 commit 71d97c0f61e198022e599520cf3896504e74d7cc
--------------------------------
[ Upstream commit 2980cbd4dce7b1e9bf57df3ced43a7b184986f50 ]
Prevent VFs from resetting when PF driver is being unloaded: - introduce new pf state: __I40E_VF_RESETS_DISABLED; - check if pf state has __I40E_VF_RESETS_DISABLED state set, if so, disable any further VFLR event notifications; - when i40e_remove (rmmod i40e) is called, disable any resets on the VFs;
Previously if there were bare-metal VFs passing traffic and PF driver was removed, there was a possibility of VFs triggering a Tx timeout right before iavf_remove. This was causing iavf_close to not be called because there is a check in the beginning of iavf_remove that bails out early if adapter->state < IAVF_DOWN_PENDING. This makes it so some resources do not get cleaned up.
Fixes: 6a9ddb36eeb8 ("i40e: disable IOV before freeing resources") Signed-off-by: Slawomir Laba slawomirx.laba@intel.com Signed-off-by: Brett Creeley brett.creeley@intel.com Signed-off-by: Sylwester Dziedziuch sylwesterx.dziedziuch@intel.com Tested-by: Konrad Jankowski konrad0.jankowski@intel.com Signed-off-by: Tony Nguyen anthony.l.nguyen@intel.com Link: https://lore.kernel.org/r/20201120180640.3654474-1-anthony.l.nguyen@intel.co... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Conflicts: drivers/net/ethernet/intel/i40e/i40e.h [yyl: adjust context] Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/intel/i40e/i40e.h | 1 + drivers/net/ethernet/intel/i40e/i40e_main.c | 22 +++++++++++----- .../ethernet/intel/i40e/i40e_virtchnl_pf.c | 26 +++++++++++-------- 3 files changed, 31 insertions(+), 18 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h index ff4d7d8aa326..1752fea024eb 100644 --- a/drivers/net/ethernet/intel/i40e/i40e.h +++ b/drivers/net/ethernet/intel/i40e/i40e.h @@ -148,6 +148,7 @@ enum i40e_state_t { __I40E_CLIENT_L2_CHANGE, __I40E_CLIENT_RESET, __I40E_VIRTCHNL_OP_PENDING, + __I40E_VF_RESETS_DISABLED, /* disable resets during i40e_remove */ /* This must be last as it determines the size of the BITMAP */ __I40E_STATE_SIZE__, }; diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 3200c75b9ed2..985601a65def 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -3895,8 +3895,16 @@ static irqreturn_t i40e_intr(int irq, void *data) }
if (icr0 & I40E_PFINT_ICR0_VFLR_MASK) { - ena_mask &= ~I40E_PFINT_ICR0_ENA_VFLR_MASK; - set_bit(__I40E_VFLR_EVENT_PENDING, pf->state); + /* disable any further VFLR event notifications */ + if (test_bit(__I40E_VF_RESETS_DISABLED, pf->state)) { + u32 reg = rd32(hw, I40E_PFINT_ICR0_ENA); + + reg &= ~I40E_PFINT_ICR0_VFLR_MASK; + wr32(hw, I40E_PFINT_ICR0_ENA, reg); + } else { + ena_mask &= ~I40E_PFINT_ICR0_ENA_VFLR_MASK; + set_bit(__I40E_VFLR_EVENT_PENDING, pf->state); + } }
if (icr0 & I40E_PFINT_ICR0_GRST_MASK) { @@ -14155,6 +14163,11 @@ static void i40e_remove(struct pci_dev *pdev) while (test_bit(__I40E_RESET_RECOVERY_PENDING, pf->state)) usleep_range(1000, 2000);
+ if (pf->flags & I40E_FLAG_SRIOV_ENABLED) { + set_bit(__I40E_VF_RESETS_DISABLED, pf->state); + i40e_free_vfs(pf); + pf->flags &= ~I40E_FLAG_SRIOV_ENABLED; + } /* no more scheduling of any task */ set_bit(__I40E_SUSPENDED, pf->state); set_bit(__I40E_DOWN, pf->state); @@ -14168,11 +14181,6 @@ static void i40e_remove(struct pci_dev *pdev) */ i40e_notify_client_of_netdev_close(pf->vsi[pf->lan_vsi], false);
- if (pf->flags & I40E_FLAG_SRIOV_ENABLED) { - i40e_free_vfs(pf); - pf->flags &= ~I40E_FLAG_SRIOV_ENABLED; - } - i40e_fdir_teardown(pf);
/* If there is a switch structure or any orphans, remove them. diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c index bbbff3f7e84c..5b2395bd6ed8 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c @@ -1204,7 +1204,8 @@ static void i40e_cleanup_reset_vf(struct i40e_vf *vf) * @vf: pointer to the VF structure * @flr: VFLR was issued or not * - * Returns true if the VF is reset, false otherwise. + * Returns true if the VF is in reset, resets successfully, or resets + * are disabled and false otherwise. **/ bool i40e_reset_vf(struct i40e_vf *vf, bool flr) { @@ -1214,11 +1215,14 @@ bool i40e_reset_vf(struct i40e_vf *vf, bool flr) u32 reg; int i;
+ if (test_bit(__I40E_VF_RESETS_DISABLED, pf->state)) + return true; + /* If the VFs have been disabled, this means something else is * resetting the VF, so we shouldn't continue. */ if (test_and_set_bit(__I40E_VF_DISABLE, pf->state)) - return false; + return true;
i40e_trigger_vf_reset(vf, flr);
@@ -1382,6 +1386,15 @@ void i40e_free_vfs(struct i40e_pf *pf)
i40e_notify_client_of_vf_enable(pf, 0);
+ /* Disable IOV before freeing resources. This lets any VF drivers + * running in the host get themselves cleaned up before we yank + * the carpet out from underneath their feet. + */ + if (!pci_vfs_assigned(pf->pdev)) + pci_disable_sriov(pf->pdev); + else + dev_warn(&pf->pdev->dev, "VFs are assigned - not disabling SR-IOV\n"); + /* Amortize wait time by stopping all VFs at the same time */ for (i = 0; i < pf->num_alloc_vfs; i++) { if (test_bit(I40E_VF_STATE_INIT, &pf->vf[i].vf_states)) @@ -1397,15 +1410,6 @@ void i40e_free_vfs(struct i40e_pf *pf) i40e_vsi_wait_queues_disabled(pf->vsi[pf->vf[i].lan_vsi_idx]); }
- /* Disable IOV before freeing resources. This lets any VF drivers - * running in the host get themselves cleaned up before we yank - * the carpet out from underneath their feet. - */ - if (!pci_vfs_assigned(pf->pdev)) - pci_disable_sriov(pf->pdev); - else - dev_warn(&pf->pdev->dev, "VFs are assigned - not disabling SR-IOV\n"); - /* free up VF resources */ tmp = pf->num_alloc_vfs; pf->num_alloc_vfs = 0;
From: Eran Ben Elisha eranbe@nvidia.com
stable inclusion from linux-4.19.162 commit 67633216cc548d9a9beccab244e80e8b209c5ad8
--------------------------------
[ Upstream commit 1d2bb5ad89f47d8ce8aedc70ef85059ab3870292 ]
When command interface is down, driver to reclaim all 4K page chucks that were hold by the Firmeware. Fix a bug for 64K page size systems, where driver repeatedly released only the first chunk of the page.
Define helper function to fill 4K chunks for a given Firmware pages. Iterate over all unreleased Firmware pages and call the hepler per each.
Fixes: 5adff6a08862 ("net/mlx5: Fix incorrect page count when in internal error") Signed-off-by: Eran Ben Elisha eranbe@nvidia.com Signed-off-by: Saeed Mahameed saeedm@nvidia.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- .../ethernet/mellanox/mlx5/core/pagealloc.c | 21 +++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c index e36d3e3675f9..9c3653e06886 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c @@ -331,6 +331,24 @@ static int give_pages(struct mlx5_core_dev *dev, u16 func_id, int npages, return err; }
+static u32 fwp_fill_manage_pages_out(struct fw_page *fwp, u32 *out, u32 index, + u32 npages) +{ + u32 pages_set = 0; + unsigned int n; + + for_each_clear_bit(n, &fwp->bitmask, MLX5_NUM_4K_IN_PAGE) { + MLX5_ARRAY_SET64(manage_pages_out, out, pas, index + pages_set, + fwp->addr + (n * MLX5_ADAPTER_PAGE_SIZE)); + pages_set++; + + if (!--npages) + break; + } + + return pages_set; +} + static int reclaim_pages_cmd(struct mlx5_core_dev *dev, u32 *in, int in_size, u32 *out, int out_size) { @@ -354,8 +372,7 @@ static int reclaim_pages_cmd(struct mlx5_core_dev *dev, if (fwp->func_id != func_id) continue;
- MLX5_ARRAY_SET64(manage_pages_out, out, pas, i, fwp->addr); - i++; + i += fwp_fill_manage_pages_out(fwp, out, i, npages - i); }
MLX5_SET(manage_pages_out, out, output_num_entries, i);
From: Moshe Shemesh moshe@mellanox.com
stable inclusion from linux-4.19.164 commit 7eb728345c422220407b68e339abd4a0cfb26439
--------------------------------
[ Upstream commit fed91613c9dd455dd154b22fa8e11b8526466082 ]
Add restarting state flag to avoid scheduling another restart task while such task is already running. Change task name from watchdog_task to restart_task to better fit the task role.
Fixes: 1e338db56e5a ("mlx4_en: Fix a race at restart task") Signed-off-by: Moshe Shemesh moshe@mellanox.com Signed-off-by: Tariq Toukan tariqt@nvidia.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- .../net/ethernet/mellanox/mlx4/en_netdev.c | 20 ++++++++++++------- drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 7 ++++++- 2 files changed, 19 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index 5868ec11db1a..a3a1947486e0 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -1384,8 +1384,10 @@ static void mlx4_en_tx_timeout(struct net_device *dev) }
priv->port_stats.tx_timeout++; - en_dbg(DRV, priv, "Scheduling watchdog\n"); - queue_work(mdev->workqueue, &priv->watchdog_task); + if (!test_and_set_bit(MLX4_EN_STATE_FLAG_RESTARTING, &priv->state)) { + en_dbg(DRV, priv, "Scheduling port restart\n"); + queue_work(mdev->workqueue, &priv->restart_task); + } }
@@ -1835,6 +1837,7 @@ int mlx4_en_start_port(struct net_device *dev) local_bh_enable(); }
+ clear_bit(MLX4_EN_STATE_FLAG_RESTARTING, &priv->state); netif_tx_start_all_queues(dev); netif_device_attach(dev);
@@ -2005,7 +2008,7 @@ void mlx4_en_stop_port(struct net_device *dev, int detach) static void mlx4_en_restart(struct work_struct *work) { struct mlx4_en_priv *priv = container_of(work, struct mlx4_en_priv, - watchdog_task); + restart_task); struct mlx4_en_dev *mdev = priv->mdev; struct net_device *dev = priv->dev;
@@ -2387,7 +2390,7 @@ static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu) if (netif_running(dev)) { mutex_lock(&mdev->state_lock); if (!mdev->device_up) { - /* NIC is probably restarting - let watchdog task reset + /* NIC is probably restarting - let restart task reset * the port */ en_dbg(DRV, priv, "Change MTU called with card down!?\n"); } else { @@ -2396,7 +2399,9 @@ static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu) if (err) { en_err(priv, "Failed restarting port:%d\n", priv->port); - queue_work(mdev->workqueue, &priv->watchdog_task); + if (!test_and_set_bit(MLX4_EN_STATE_FLAG_RESTARTING, + &priv->state)) + queue_work(mdev->workqueue, &priv->restart_task); } } mutex_unlock(&mdev->state_lock); @@ -2882,7 +2887,8 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog) if (err) { en_err(priv, "Failed starting port %d for XDP change\n", priv->port); - queue_work(mdev->workqueue, &priv->watchdog_task); + if (!test_and_set_bit(MLX4_EN_STATE_FLAG_RESTARTING, &priv->state)) + queue_work(mdev->workqueue, &priv->restart_task); } }
@@ -3280,7 +3286,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port, priv->counter_index = MLX4_SINK_COUNTER_INDEX(mdev->dev); spin_lock_init(&priv->stats_lock); INIT_WORK(&priv->rx_mode_task, mlx4_en_do_set_rx_mode); - INIT_WORK(&priv->watchdog_task, mlx4_en_restart); + INIT_WORK(&priv->restart_task, mlx4_en_restart); INIT_WORK(&priv->linkstate_task, mlx4_en_linkstate); INIT_DELAYED_WORK(&priv->stats_task, mlx4_en_do_get_stats); INIT_DELAYED_WORK(&priv->service_task, mlx4_en_service_task); diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h index 240f9c9ca943..3cce101e83ee 100644 --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h @@ -530,6 +530,10 @@ struct mlx4_en_stats_bitmap { struct mutex mutex; /* for mutual access to stats bitmap */ };
+enum { + MLX4_EN_STATE_FLAG_RESTARTING, +}; + struct mlx4_en_priv { struct mlx4_en_dev *mdev; struct mlx4_en_port_profile *prof; @@ -595,7 +599,7 @@ struct mlx4_en_priv { struct mlx4_en_cq *rx_cq[MAX_RX_RINGS]; struct mlx4_qp drop_qp; struct work_struct rx_mode_task; - struct work_struct watchdog_task; + struct work_struct restart_task; struct work_struct linkstate_task; struct delayed_work stats_task; struct delayed_work service_task; @@ -643,6 +647,7 @@ struct mlx4_en_priv { u32 pflags; u8 rss_key[MLX4_EN_RSS_KEY_SIZE]; u8 rss_hash_fn; + unsigned long state; };
enum mlx4_en_wol {
From: Moshe Shemesh moshe@mellanox.com
stable inclusion from linux-4.19.164 commit ba88ac9e5832be764f25d233f51e064302717314
--------------------------------
[ Upstream commit ba603d9d7b1215c72513d7c7aa02b6775fd4891b ]
In case error CQE was found while polling TX CQ, the QP is in error state and all posted WQEs will generate error CQEs without any data transmitted. Fix it by reopening the channels, via same method used for TX timeout handling.
In addition add some more info on error CQE and WQE for debug.
Fixes: bd2f631d7c60 ("net/mlx4_en: Notify user when TX ring in error state") Signed-off-by: Moshe Shemesh moshe@mellanox.com Signed-off-by: Tariq Toukan tariqt@nvidia.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- .../net/ethernet/mellanox/mlx4/en_netdev.c | 1 + drivers/net/ethernet/mellanox/mlx4/en_tx.c | 40 +++++++++++++++---- drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 5 +++ 3 files changed, 39 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index a3a1947486e0..47eee3e083ec 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -1741,6 +1741,7 @@ int mlx4_en_start_port(struct net_device *dev) mlx4_en_deactivate_cq(priv, cq); goto tx_err; } + clear_bit(MLX4_EN_TX_RING_STATE_RECOVERING, &tx_ring->state); if (t != TX_XDP) { tx_ring->tx_queue = netdev_get_tx_queue(dev, i); tx_ring->recycle_ring = NULL; diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c index e58052d07e39..29041d4a3f28 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c @@ -385,6 +385,35 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring) return cnt; }
+static void mlx4_en_handle_err_cqe(struct mlx4_en_priv *priv, struct mlx4_err_cqe *err_cqe, + u16 cqe_index, struct mlx4_en_tx_ring *ring) +{ + struct mlx4_en_dev *mdev = priv->mdev; + struct mlx4_en_tx_info *tx_info; + struct mlx4_en_tx_desc *tx_desc; + u16 wqe_index; + int desc_size; + + en_err(priv, "CQE error - cqn 0x%x, ci 0x%x, vendor syndrome: 0x%x syndrome: 0x%x\n", + ring->sp_cqn, cqe_index, err_cqe->vendor_err_syndrome, err_cqe->syndrome); + print_hex_dump(KERN_WARNING, "", DUMP_PREFIX_OFFSET, 16, 1, err_cqe, sizeof(*err_cqe), + false); + + wqe_index = be16_to_cpu(err_cqe->wqe_index) & ring->size_mask; + tx_info = &ring->tx_info[wqe_index]; + desc_size = tx_info->nr_txbb << LOG_TXBB_SIZE; + en_err(priv, "Related WQE - qpn 0x%x, wqe index 0x%x, wqe size 0x%x\n", ring->qpn, + wqe_index, desc_size); + tx_desc = ring->buf + (wqe_index << LOG_TXBB_SIZE); + print_hex_dump(KERN_WARNING, "", DUMP_PREFIX_OFFSET, 16, 1, tx_desc, desc_size, false); + + if (test_and_set_bit(MLX4_EN_STATE_FLAG_RESTARTING, &priv->state)) + return; + + en_err(priv, "Scheduling port restart\n"); + queue_work(mdev->workqueue, &priv->restart_task); +} + bool mlx4_en_process_tx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int napi_budget) { @@ -431,13 +460,10 @@ bool mlx4_en_process_tx_cq(struct net_device *dev, dma_rmb();
if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == - MLX4_CQE_OPCODE_ERROR)) { - struct mlx4_err_cqe *cqe_err = (struct mlx4_err_cqe *)cqe; - - en_err(priv, "CQE error - vendor syndrome: 0x%x syndrome: 0x%x\n", - cqe_err->vendor_err_syndrome, - cqe_err->syndrome); - } + MLX4_CQE_OPCODE_ERROR)) + if (!test_and_set_bit(MLX4_EN_TX_RING_STATE_RECOVERING, &ring->state)) + mlx4_en_handle_err_cqe(priv, (struct mlx4_err_cqe *)cqe, index, + ring);
/* Skip over last polled CQE */ new_index = be16_to_cpu(cqe->wqe_index) & size_mask; diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h index 3cce101e83ee..1a57ea9a7ea5 100644 --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h @@ -271,6 +271,10 @@ struct mlx4_en_page_cache { } buf[MLX4_EN_CACHE_SIZE]; };
+enum { + MLX4_EN_TX_RING_STATE_RECOVERING, +}; + struct mlx4_en_priv;
struct mlx4_en_tx_ring { @@ -317,6 +321,7 @@ struct mlx4_en_tx_ring { * Only queue_stopped might be used if BQL is not properly working. */ unsigned long queue_stopped; + unsigned long state; struct mlx4_hwq_resources sp_wqres; struct mlx4_qp sp_qp; struct mlx4_qp_context sp_context;
From: Luc Van Oostenryck luc.vanoostenryck@gmail.com
stable inclusion from linux-4.19.164 commit 377bd57ed7ad2417228c6969b014e868a5b90c3a
--------------------------------
[ Upstream commit 5d946c5abbaf68083fa6a41824dd79e1f06286d8 ]
xsk_poll() is defined as returning 'unsigned int' but the .poll method is declared as returning '__poll_t', a bitwise type.
Fix this by using the proper return type and using the EPOLL constants instead of the POLL ones, as required for __poll_t.
Signed-off-by: Luc Van Oostenryck luc.vanoostenryck@gmail.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Acked-by: Björn Töpel bjorn.topel@intel.com Link: https://lore.kernel.org/bpf/20191120001042.30830-1-luc.vanoostenryck@gmail.c... Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/xdp/xsk.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 9ff2ab63e639..6bb0649c028c 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -289,17 +289,17 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len) return (xs->zc) ? xsk_zc_xmit(sk) : xsk_generic_xmit(sk, m, total_len); }
-static unsigned int xsk_poll(struct file *file, struct socket *sock, +static __poll_t xsk_poll(struct file *file, struct socket *sock, struct poll_table_struct *wait) { - unsigned int mask = datagram_poll(file, sock, wait); + __poll_t mask = datagram_poll(file, sock, wait); struct sock *sk = sock->sk; struct xdp_sock *xs = xdp_sk(sk);
if (xs->rx && !xskq_empty_desc(xs->rx)) - mask |= POLLIN | POLLRDNORM; + mask |= EPOLLIN | EPOLLRDNORM; if (xs->tx && !xskq_full_desc(xs->tx)) - mask |= POLLOUT | POLLWRNORM; + mask |= EPOLLOUT | EPOLLWRNORM;
return mask; }
From: Björn Töpel bjorn.topel@intel.com
stable inclusion from linux-4.19.164 commit 35d826c94266d842cd4c0a0a60a8805c58397cc2
--------------------------------
[ Upstream commit a06316dc87bdc000f7f39a315476957af2ba0f05 ]
The page recycle code, incorrectly, relied on that a page fragment could not be freed inside xdp_do_redirect(). This assumption leads to that page fragments that are used by the stack/XDP redirect can be reused and overwritten.
To avoid this, store the page count prior invoking xdp_do_redirect().
Fixes: 6453073987ba ("ixgbe: add initial support for xdp redirect") Reported-and-analyzed-by: Li RongQing lirongqing@baidu.com Signed-off-by: Björn Töpel bjorn.topel@intel.com Tested-by: Sandeep Penigalapati sandeep.penigalapati@intel.com Signed-off-by: Tony Nguyen anthony.l.nguyen@intel.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 +++++++++++++------ 1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 630adacbe722..b52585fbb295 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -1943,7 +1943,8 @@ static inline bool ixgbe_page_is_reserved(struct page *page) return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page); }
-static bool ixgbe_can_reuse_rx_page(struct ixgbe_rx_buffer *rx_buffer) +static bool ixgbe_can_reuse_rx_page(struct ixgbe_rx_buffer *rx_buffer, + int rx_buffer_pgcnt) { unsigned int pagecnt_bias = rx_buffer->pagecnt_bias; struct page *page = rx_buffer->page; @@ -1954,7 +1955,7 @@ static bool ixgbe_can_reuse_rx_page(struct ixgbe_rx_buffer *rx_buffer)
#if (PAGE_SIZE < 8192) /* if we are only owner of page we can reuse it */ - if (unlikely((page_ref_count(page) - pagecnt_bias) > 1)) + if (unlikely((rx_buffer_pgcnt - pagecnt_bias) > 1)) return false; #else /* The last offset is a bit aggressive in that we assume the @@ -2019,11 +2020,18 @@ static void ixgbe_add_rx_frag(struct ixgbe_ring *rx_ring, static struct ixgbe_rx_buffer *ixgbe_get_rx_buffer(struct ixgbe_ring *rx_ring, union ixgbe_adv_rx_desc *rx_desc, struct sk_buff **skb, - const unsigned int size) + const unsigned int size, + int *rx_buffer_pgcnt) { struct ixgbe_rx_buffer *rx_buffer;
rx_buffer = &rx_ring->rx_buffer_info[rx_ring->next_to_clean]; + *rx_buffer_pgcnt = +#if (PAGE_SIZE < 8192) + page_count(rx_buffer->page); +#else + 0; +#endif prefetchw(rx_buffer->page); *skb = rx_buffer->skb;
@@ -2053,9 +2061,10 @@ static struct ixgbe_rx_buffer *ixgbe_get_rx_buffer(struct ixgbe_ring *rx_ring,
static void ixgbe_put_rx_buffer(struct ixgbe_ring *rx_ring, struct ixgbe_rx_buffer *rx_buffer, - struct sk_buff *skb) + struct sk_buff *skb, + int rx_buffer_pgcnt) { - if (ixgbe_can_reuse_rx_page(rx_buffer)) { + if (ixgbe_can_reuse_rx_page(rx_buffer, rx_buffer_pgcnt)) { /* hand second half of page back to the ring */ ixgbe_reuse_rx_page(rx_ring, rx_buffer); } else { @@ -2299,6 +2308,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, union ixgbe_adv_rx_desc *rx_desc; struct ixgbe_rx_buffer *rx_buffer; struct sk_buff *skb; + int rx_buffer_pgcnt; unsigned int size;
/* return some buffers to hardware, one at a time is too slow */ @@ -2318,7 +2328,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, */ dma_rmb();
- rx_buffer = ixgbe_get_rx_buffer(rx_ring, rx_desc, &skb, size); + rx_buffer = ixgbe_get_rx_buffer(rx_ring, rx_desc, &skb, size, &rx_buffer_pgcnt);
/* retrieve a buffer from the ring */ if (!skb) { @@ -2360,7 +2370,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, break; }
- ixgbe_put_rx_buffer(rx_ring, rx_buffer, skb); + ixgbe_put_rx_buffer(rx_ring, rx_buffer, skb, rx_buffer_pgcnt); cleaned_count++;
/* place incomplete frames back on ring for completion */
From: Sven Eckelmann sven@narfation.org
stable inclusion from linux-4.19.164 commit bf187ef6e11c96cc49227eab966572806dbbc59b
--------------------------------
[ Upstream commit 0a35dc41fea67ac4495ce7584406bf9557a6e7d0 ]
It was observed that sending data via batadv over vxlan (on top of wireguard) reduced the performance massively compared to raw ethernet or batadv on raw ethernet. A check of perf data showed that the vxlan_build_skb was calling all the time pskb_expand_head to allocate enough headroom for:
min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len + VXLAN_HLEN + iphdr_len;
But the vxlan_config_apply only requested needed headroom for:
lowerdev->hard_header_len + VXLAN6_HEADROOM or VXLAN_HEADROOM
So it completely ignored the needed_headroom of the lower device. The first caller of net_dev_xmit could therefore never make sure that enough headroom was allocated for the rest of the transmit path.
Cc: Annika Wickert annika.wickert@exaring.de Signed-off-by: Sven Eckelmann sven@narfation.org Tested-by: Annika Wickert aw@awlnx.space Link: https://lore.kernel.org/r/20201126125247.1047977-1-sven@narfation.org Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/vxlan.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index abf85f0ab72f..8481a21fe7af 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -3180,6 +3180,7 @@ static void vxlan_config_apply(struct net_device *dev, dev->gso_max_segs = lowerdev->gso_max_segs;
needed_headroom = lowerdev->hard_header_len; + needed_headroom += lowerdev->needed_headroom;
max_mtu = lowerdev->mtu - (use_ipv6 ? VXLAN6_HEADROOM : VXLAN_HEADROOM);
From: Sven Eckelmann sven@narfation.org
stable inclusion from linux-4.19.164 commit 709392673a451c600b5e2db747a1528d39f8d7d4
--------------------------------
[ Upstream commit a5e74021e84bb5eadf760aaf2c583304f02269be ]
While vxlan doesn't need any extra tailroom, the lowerdev might need it. In that case, copy it over to reduce the chance for additional (re)allocations in the transmit path.
Signed-off-by: Sven Eckelmann sven@narfation.org Link: https://lore.kernel.org/r/20201126125247.1047977-2-sven@narfation.org Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/vxlan.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 8481a21fe7af..66fffbd64a33 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -3182,6 +3182,8 @@ static void vxlan_config_apply(struct net_device *dev, needed_headroom = lowerdev->hard_header_len; needed_headroom += lowerdev->needed_headroom;
+ dev->needed_tailroom = lowerdev->needed_tailroom; + max_mtu = lowerdev->mtu - (use_ipv6 ? VXLAN6_HEADROOM : VXLAN_HEADROOM); if (max_mtu < ETH_MIN_MTU)
From: Leon Romanovsky leonro@nvidia.com
stable inclusion from linux-4.19.164 commit 949a49c8dad024489a202565f9e88cd027811ddc
--------------------------------
[ Upstream commit 907af0f0cab4ee5d5604f182ecec2c5b5119d294 ]
mlx5 firmware expects driver version in specific format X.X.X, so make it always correct and based on real kernel version aligned with the driver.
Fixes: 012e50e109fd ("net/mlx5: Set driver version into firmware") Signed-off-by: Leon Romanovsky leonro@nvidia.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/mellanox/mlx5/core/main.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c index 5fac00ea6245..a2b25afa2472 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c @@ -51,6 +51,7 @@ #ifdef CONFIG_RFS_ACCEL #include <linux/cpu_rmap.h> #endif +#include <linux/version.h> #include <net/devlink.h> #include "mlx5_core.h" #include "fs_core.h" @@ -211,7 +212,10 @@ static void mlx5_set_driver_version(struct mlx5_core_dev *dev) strncat(string, ",", remaining_size);
remaining_size = max_t(int, 0, driver_ver_sz - strlen(string)); - strncat(string, DRIVER_VERSION, remaining_size); + + snprintf(string + strlen(string), remaining_size, "%u.%u.%u", + (u8)((LINUX_VERSION_CODE >> 16) & 0xff), (u8)((LINUX_VERSION_CODE >> 8) & 0xff), + (u16)(LINUX_VERSION_CODE & 0xffff));
/*Send the command*/ MLX5_SET(set_driver_version_in, in, opcode,
From: Dongdong Wang wangdongdong.6@bytedance.com
stable inclusion from linux-4.19.164 commit d27b8a31096c7acb1a7e2e340d6fb6d481bb23ae
--------------------------------
[ Upstream commit d9054a1ff585ba01029584ab730efc794603d68f ]
The per-cpu bpf_redirect_info is shared among all skb_do_redirect() and BPF redirect helpers. Callers on RX path are all in BH context, disabling preemption is not sufficient to prevent BH interruption.
In production, we observed strange packet drops because of the race condition between LWT xmit and TC ingress, and we verified this issue is fixed after we disable BH.
Although this bug was technically introduced from the beginning, that is commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure"), at that time call_rcu() had to be call_rcu_bh() to match the RCU context. So this patch may not work well before RCU flavor consolidation has been completed around v5.0.
Update the comments above the code too, as call_rcu() is now BH friendly.
Signed-off-by: Dongdong Wang wangdongdong.6@bytedance.com Signed-off-by: Alexei Starovoitov ast@kernel.org Reviewed-by: Cong Wang cong.wang@bytedance.com Link: https://lore.kernel.org/bpf/20201205075946.497763-1-xiyou.wangcong@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/core/lwt_bpf.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c index a648568c5e8f..4a5f4fbffd83 100644 --- a/net/core/lwt_bpf.c +++ b/net/core/lwt_bpf.c @@ -44,12 +44,11 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt, { int ret;
- /* Preempt disable is needed to protect per-cpu redirect_info between - * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and - * access to maps strictly require a rcu_read_lock() for protection, - * mixing with BH RCU lock doesn't work. + /* Preempt disable and BH disable are needed to protect per-cpu + * redirect_info between BPF prog and skb_do_redirect(). */ preempt_disable(); + local_bh_disable(); bpf_compute_data_pointers(skb); ret = bpf_prog_run_save_cb(lwt->prog, skb);
@@ -82,6 +81,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt, break; }
+ local_bh_enable(); preempt_enable();
return ret;
From: Sylwester Dziedziuch sylwesterx.dziedziuch@intel.com
stable inclusion from linux-4.19.167 commit 7d6eaa841a0077951c556689a03eb5fcd1da624d
--------------------------------
[ Upstream commit 3ac874fa84d1baaf0c0175f2a1499f5d88d528b2 ]
When removing VFs for PF added to bridge there was an error I40E_AQ_RC_EINVAL. It was caused by not properly resetting and reinitializing PF when adding/removing VFs. Changed how reset is performed when adding/removing VFs to properly reinitialize PFs VSI.
Fixes: fc60861e9b00 ("i40e: start up in VEPA mode by default") Signed-off-by: Sylwester Dziedziuch sylwesterx.dziedziuch@intel.com Tested-by: Konrad Jankowski konrad0.jankowski@intel.com Signed-off-by: Tony Nguyen anthony.l.nguyen@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/intel/i40e/i40e.h | 3 +++ drivers/net/ethernet/intel/i40e/i40e_main.c | 10 ++++++++++ drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 4 ++-- 3 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h index 1752fea024eb..6fcf51baca13 100644 --- a/drivers/net/ethernet/intel/i40e/i40e.h +++ b/drivers/net/ethernet/intel/i40e/i40e.h @@ -127,6 +127,7 @@ enum i40e_state_t { __I40E_RESET_INTR_RECEIVED, __I40E_REINIT_REQUESTED, __I40E_PF_RESET_REQUESTED, + __I40E_PF_RESET_AND_REBUILD_REQUESTED, __I40E_CORE_RESET_REQUESTED, __I40E_GLOBAL_RESET_REQUESTED, __I40E_EMP_RESET_REQUESTED, @@ -154,6 +155,8 @@ enum i40e_state_t { };
#define I40E_PF_RESET_FLAG BIT_ULL(__I40E_PF_RESET_REQUESTED) +#define I40E_PF_RESET_AND_REBUILD_FLAG \ + BIT_ULL(__I40E_PF_RESET_AND_REBUILD_REQUESTED)
/* VSI state flags */ enum i40e_vsi_state_t { diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 985601a65def..a728b6a7872c 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -42,6 +42,8 @@ static int i40e_setup_misc_vector(struct i40e_pf *pf); static void i40e_determine_queue_usage(struct i40e_pf *pf); static int i40e_setup_pf_filter_control(struct i40e_pf *pf); static void i40e_prep_for_reset(struct i40e_pf *pf, bool lock_acquired); +static void i40e_reset_and_rebuild(struct i40e_pf *pf, bool reinit, + bool lock_acquired); static int i40e_reset(struct i40e_pf *pf); static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool lock_acquired); static void i40e_fdir_sb_setup(struct i40e_pf *pf); @@ -7929,6 +7931,14 @@ void i40e_do_reset(struct i40e_pf *pf, u32 reset_flags, bool lock_acquired) dev_dbg(&pf->pdev->dev, "PFR requested\n"); i40e_handle_reset_warning(pf, lock_acquired);
+ } else if (reset_flags & I40E_PF_RESET_AND_REBUILD_FLAG) { + /* Request a PF Reset + * + * Resets PF and reinitializes PFs VSI. + */ + i40e_prep_for_reset(pf, lock_acquired); + i40e_reset_and_rebuild(pf, true, lock_acquired); + } else if (reset_flags & BIT_ULL(__I40E_REINIT_REQUESTED)) { int v;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c index 5b2395bd6ed8..9154abe13b93 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c @@ -1573,7 +1573,7 @@ int i40e_pci_sriov_configure(struct pci_dev *pdev, int num_vfs) if (num_vfs) { if (!(pf->flags & I40E_FLAG_VEB_MODE_ENABLED)) { pf->flags |= I40E_FLAG_VEB_MODE_ENABLED; - i40e_do_reset_safe(pf, I40E_PF_RESET_FLAG); + i40e_do_reset_safe(pf, I40E_PF_RESET_AND_REBUILD_FLAG); } ret = i40e_pci_sriov_enable(pdev, num_vfs); goto sriov_configure_out; @@ -1582,7 +1582,7 @@ int i40e_pci_sriov_configure(struct pci_dev *pdev, int num_vfs) if (!pci_vfs_assigned(pf->pdev)) { i40e_free_vfs(pf); pf->flags &= ~I40E_FLAG_VEB_MODE_ENABLED; - i40e_do_reset_safe(pf, I40E_PF_RESET_FLAG); + i40e_do_reset_safe(pf, I40E_PF_RESET_AND_REBUILD_FLAG); } else { dev_warn(&pdev->dev, "Unable to free VFs because some are assigned to VMs.\n"); ret = -EINVAL;
From: Antoine Tenart atenart@kernel.org
stable inclusion from linux-4.19.167 commit bbebd73855f1778b8fd0e6f76aa627b9448c3dad
--------------------------------
[ Upstream commit 1ad58225dba3f2f598d2c6daed4323f24547168f ]
Two race conditions can be triggered when storing xps cpus, resulting in various oops and invalid memory accesses:
1. Calling netdev_set_num_tc while netif_set_xps_queue:
- netif_set_xps_queue uses dev->tc_num as one of the parameters to compute the size of new_dev_maps when allocating it. dev->tc_num is also used to access the map, and the compiler may generate code to retrieve this field multiple times in the function.
- netdev_set_num_tc sets dev->tc_num.
If new_dev_maps is allocated using dev->tc_num and then dev->tc_num is set to a higher value through netdev_set_num_tc, later accesses to new_dev_maps in netif_set_xps_queue could lead to accessing memory outside of new_dev_maps; triggering an oops.
2. Calling netif_set_xps_queue while netdev_set_num_tc is running:
2.1. netdev_set_num_tc starts by resetting the xps queues, dev->tc_num isn't updated yet.
2.2. netif_set_xps_queue is called, setting up the map with the *old* dev->num_tc.
2.3. netdev_set_num_tc updates dev->tc_num.
2.4. Later accesses to the map lead to out of bound accesses and oops.
A similar issue can be found with netdev_reset_tc.
One way of triggering this is to set an iface up (for which the driver uses netdev_set_num_tc in the open path, such as bnx2x) and writing to xps_cpus in a concurrent thread. With the right timing an oops is triggered.
Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc and netdev_reset_tc should be mutually exclusive. We do that by taking the rtnl lock in xps_cpus_store.
Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes") Signed-off-by: Antoine Tenart atenart@kernel.org Reviewed-by: Alexander Duyck alexanderduyck@fb.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/core/net-sysfs.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 001d7f07e780..cf0704134c05 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -1323,7 +1323,13 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue, return err; }
+ if (!rtnl_trylock()) { + free_cpumask_var(mask); + return restart_syscall(); + } + err = netif_set_xps_queue(dev, mask, index); + rtnl_unlock();
free_cpumask_var(mask);
From: Antoine Tenart atenart@kernel.org
stable inclusion from linux-4.19.167 commit bd7233f43b4526ccf7ef34e9e244dd56926f32c6
--------------------------------
[ Upstream commit fb25038586d0064123e393cadf1fadd70a9df97a ]
Accesses to dev->xps_cpus_map (when using dev->num_tc) should be protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't see an actual bug being triggered, but let's be safe here and take the rtnl lock while accessing the map in sysfs.
Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes") Signed-off-by: Antoine Tenart atenart@kernel.org Reviewed-by: Alexander Duyck alexanderduyck@fb.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/core/net-sysfs.c | 29 ++++++++++++++++++++++------- 1 file changed, 22 insertions(+), 7 deletions(-)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index cf0704134c05..574f09a544c0 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -1244,8 +1244,8 @@ static const struct attribute_group dql_group = { static ssize_t xps_cpus_show(struct netdev_queue *queue, char *buf) { + int cpu, len, ret, num_tc = 1, tc = 0; struct net_device *dev = queue->dev; - int cpu, len, num_tc = 1, tc = 0; struct xps_dev_maps *dev_maps; cpumask_var_t mask; unsigned long index; @@ -1255,22 +1255,31 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
index = get_netdev_queue_index(queue);
+ if (!rtnl_trylock()) + return restart_syscall(); + if (dev->num_tc) { /* Do not allow XPS on subordinate device directly */ num_tc = dev->num_tc; - if (num_tc < 0) - return -EINVAL; + if (num_tc < 0) { + ret = -EINVAL; + goto err_rtnl_unlock; + }
/* If queue belongs to subordinate dev use its map */ dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
tc = netdev_txq_to_tc(dev, index); - if (tc < 0) - return -EINVAL; + if (tc < 0) { + ret = -EINVAL; + goto err_rtnl_unlock; + } }
- if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) - return -ENOMEM; + if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) { + ret = -ENOMEM; + goto err_rtnl_unlock; + }
rcu_read_lock(); dev_maps = rcu_dereference(dev->xps_cpus_map); @@ -1293,9 +1302,15 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue, } rcu_read_unlock();
+ rtnl_unlock(); + len = snprintf(buf, PAGE_SIZE, "%*pb\n", cpumask_pr_args(mask)); free_cpumask_var(mask); return len < PAGE_SIZE ? len : -EINVAL; + +err_rtnl_unlock: + rtnl_unlock(); + return ret; }
static ssize_t xps_cpus_store(struct netdev_queue *queue,
From: Guillaume Nault gnault@redhat.com
stable inclusion from linux-4.19.167 commit d45a1c9d0438bc526173674b86a35b0b56d12b98
--------------------------------
[ Upstream commit 21fdca22eb7df2a1e194b8adb812ce370748b733 ]
RT_TOS() only clears one of the ECN bits. Therefore, when fib_compute_spec_dst() resorts to a fib lookup, it can return different results depending on the value of the second ECN bit.
For example, ECT(0) and ECT(1) packets could be treated differently.
$ ip netns add ns0 $ ip netns add ns1 $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1 $ ip -netns ns0 link set dev lo up $ ip -netns ns1 link set dev lo up $ ip -netns ns0 link set dev veth01 up $ ip -netns ns1 link set dev veth10 up
$ ip -netns ns0 address add 192.0.2.10/24 dev veth01 $ ip -netns ns1 address add 192.0.2.11/24 dev veth10
$ ip -netns ns1 address add 192.0.2.21/32 dev lo $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21 $ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0
With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21 (ping uses -Q to set all TOS and ECN bits):
$ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255 [...] 64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms
But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11 because the "tos 4" route isn't matched:
$ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255 [...] 64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms
After this patch the ECN bits don't affect the result anymore:
$ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255 [...] 64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms
Fixes: 35ebf65e851c ("ipv4: Create and use fib_compute_spec_dst() helper.") Signed-off-by: Guillaume Nault gnault@redhat.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/fib_frontend.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 7f4ec36e5f70..b96aa88087be 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -302,7 +302,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb) .flowi4_iif = LOOPBACK_IFINDEX, .flowi4_oif = l3mdev_master_ifindex_rcu(dev), .daddr = ip_hdr(skb)->saddr, - .flowi4_tos = RT_TOS(ip_hdr(skb)->tos), + .flowi4_tos = ip_hdr(skb)->tos & IPTOS_RT_MASK, .flowi4_scope = scope, .flowi4_mark = vmark ? skb->mark : 0, };
From: Yunjian Wang wangyunjian@huawei.com
stable inclusion from linux-4.19.167 commit 6928179970635accad665f34fbdaef3113b5cee0
--------------------------------
[ Upstream commit 5ede3ada3da7f050519112b81badc058190b9f9f ]
The function skb_copy() could return NULL, the return value need to be checked.
Fixes: b5996f11ea54 ("net: add Hisilicon Network Subsystem basic ethernet support") Signed-off-by: Yunjian Wang wangyunjian@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/hisilicon/hns/hns_ethtool.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c index b617a25e6f45..497e03525576 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c @@ -419,6 +419,10 @@ static void __lb_other_process(struct hns_nic_ring_data *ring_data, /* for mutl buffer*/ new_skb = skb_copy(skb, GFP_ATOMIC); dev_kfree_skb_any(skb); + if (!new_skb) { + netdev_err(ndev, "skb alloc failed\n"); + return; + } skb = new_skb;
check_ok = 0;
From: Cong Wang cong.wang@bytedance.com
stable inclusion from linux-4.19.167 commit 5a4fbd8b8a9d65207ac47079f0c5574673bb8773
--------------------------------
[ Upstream commit 085c7c4e1c0e50d90b7d90f61a12e12b317a91e2 ]
Both version 0 and version 1 use ETH_P_ERSPAN, but version 0 does not have an erspan header. So the check in gre_parse_header() is wrong, we have to distinguish version 1 from version 0.
We can just check the gre header length like is_erspan_type1().
Fixes: cb73ee40b1b3 ("net: ip_gre: use erspan key field for tunnel lookup") Reported-by: syzbot+f583ce3d4ddf9836b27a@syzkaller.appspotmail.com Cc: William Tu u9012063@gmail.com Cc: Lorenzo Bianconi lorenzo.bianconi@redhat.com Signed-off-by: Cong Wang cong.wang@bytedance.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/gre_demux.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/gre_demux.c b/net/ipv4/gre_demux.c index ad9ea82daeb3..9376b30cf626 100644 --- a/net/ipv4/gre_demux.c +++ b/net/ipv4/gre_demux.c @@ -133,7 +133,7 @@ int gre_parse_header(struct sk_buff *skb, struct tnl_ptk_info *tpi, * to 0 and sets the configured key in the * inner erspan header field */ - if (greh->protocol == htons(ETH_P_ERSPAN) || + if ((greh->protocol == htons(ETH_P_ERSPAN) && hdr_len != 4) || greh->protocol == htons(ETH_P_ERSPAN2)) { struct erspan_base_hdr *ershdr;
From: Randy Dunlap rdunlap@infradead.org
stable inclusion from linux-4.19.167 commit e94775ddeed0d27723cc204c590c0e1655662b7b
--------------------------------
[ Upstream commit bd1248f1ddbc48b0c30565fce897a3b6423313b8 ]
Check Scell_log shift size in red_check_params() and modify all callers of red_check_params() to pass Scell_log.
This prevents a shift out-of-bounds as detected by UBSAN: UBSAN: shift-out-of-bounds in ./include/net/red.h:252:22 shift exponent 72 is too large for 32-bit type 'int'
Fixes: 8afa10cbe281 ("net_sched: red: Avoid illegal values") Signed-off-by: Randy Dunlap rdunlap@infradead.org Reported-by: syzbot+97c5bd9cc81eca63d36e@syzkaller.appspotmail.com Cc: Nogah Frankel nogahf@mellanox.com Cc: Jamal Hadi Salim jhs@mojatatu.com Cc: Cong Wang xiyou.wangcong@gmail.com Cc: Jiri Pirko jiri@resnulli.us Cc: netdev@vger.kernel.org Cc: "David S. Miller" davem@davemloft.net Cc: Jakub Kicinski kuba@kernel.org Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/net/red.h | 4 +++- net/sched/sch_choke.c | 2 +- net/sched/sch_gred.c | 2 +- net/sched/sch_red.c | 2 +- net/sched/sch_sfq.c | 2 +- 5 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/include/net/red.h b/include/net/red.h index 9665582c4687..e21e7fd4fe07 100644 --- a/include/net/red.h +++ b/include/net/red.h @@ -168,12 +168,14 @@ static inline void red_set_vars(struct red_vars *v) v->qcount = -1; }
-static inline bool red_check_params(u32 qth_min, u32 qth_max, u8 Wlog) +static inline bool red_check_params(u32 qth_min, u32 qth_max, u8 Wlog, u8 Scell_log) { if (fls(qth_min) + Wlog > 32) return false; if (fls(qth_max) + Wlog > 32) return false; + if (Scell_log >= 32) + return false; if (qth_max < qth_min) return false; return true; diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c index 63bfceeb8e3c..d058397d284b 100644 --- a/net/sched/sch_choke.c +++ b/net/sched/sch_choke.c @@ -371,7 +371,7 @@ static int choke_change(struct Qdisc *sch, struct nlattr *opt,
ctl = nla_data(tb[TCA_CHOKE_PARMS]);
- if (!red_check_params(ctl->qth_min, ctl->qth_max, ctl->Wlog)) + if (!red_check_params(ctl->qth_min, ctl->qth_max, ctl->Wlog, ctl->Scell_log)) return -EINVAL;
if (ctl->limit > CHOKE_MAX_QUEUE) diff --git a/net/sched/sch_gred.c b/net/sched/sch_gred.c index 4a042abf844c..6f94bca75520 100644 --- a/net/sched/sch_gred.c +++ b/net/sched/sch_gred.c @@ -357,7 +357,7 @@ static inline int gred_change_vq(struct Qdisc *sch, int dp, struct gred_sched *table = qdisc_priv(sch); struct gred_sched_data *q = table->tab[dp];
- if (!red_check_params(ctl->qth_min, ctl->qth_max, ctl->Wlog)) + if (!red_check_params(ctl->qth_min, ctl->qth_max, ctl->Wlog, ctl->Scell_log)) return -EINVAL;
if (!q) { diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c index 56c181c3feeb..a3dc2118539b 100644 --- a/net/sched/sch_red.c +++ b/net/sched/sch_red.c @@ -214,7 +214,7 @@ static int red_change(struct Qdisc *sch, struct nlattr *opt, max_P = tb[TCA_RED_MAX_P] ? nla_get_u32(tb[TCA_RED_MAX_P]) : 0;
ctl = nla_data(tb[TCA_RED_PARMS]); - if (!red_check_params(ctl->qth_min, ctl->qth_max, ctl->Wlog)) + if (!red_check_params(ctl->qth_min, ctl->qth_max, ctl->Wlog, ctl->Scell_log)) return -EINVAL;
if (ctl->limit > 0) { diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c index b89cf0971d3d..74f697c4a4d3 100644 --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -651,7 +651,7 @@ static int sfq_change(struct Qdisc *sch, struct nlattr *opt) }
if (ctl_v1 && !red_check_params(ctl_v1->qth_min, ctl_v1->qth_max, - ctl_v1->Wlog)) + ctl_v1->Wlog, ctl_v1->Scell_log)) return -EINVAL; if (ctl_v1 && ctl_v1->qth_min) { p = kmalloc(sizeof(*p), GFP_KERNEL);
From: Antoine Tenart atenart@kernel.org
stable inclusion from linux-4.19.167 commit c2bda35d36bf91784e7cabd73a64d583bae47d15
--------------------------------
[ Upstream commit 2d57b4f142e0b03e854612b8e28978935414bced ]
Two race conditions can be triggered when storing xps rxqs, resulting in various oops and invalid memory accesses:
1. Calling netdev_set_num_tc while netif_set_xps_queue:
- netif_set_xps_queue uses dev->tc_num as one of the parameters to compute the size of new_dev_maps when allocating it. dev->tc_num is also used to access the map, and the compiler may generate code to retrieve this field multiple times in the function.
- netdev_set_num_tc sets dev->tc_num.
If new_dev_maps is allocated using dev->tc_num and then dev->tc_num is set to a higher value through netdev_set_num_tc, later accesses to new_dev_maps in netif_set_xps_queue could lead to accessing memory outside of new_dev_maps; triggering an oops.
2. Calling netif_set_xps_queue while netdev_set_num_tc is running:
2.1. netdev_set_num_tc starts by resetting the xps queues, dev->tc_num isn't updated yet.
2.2. netif_set_xps_queue is called, setting up the map with the *old* dev->num_tc.
2.3. netdev_set_num_tc updates dev->tc_num.
2.4. Later accesses to the map lead to out of bound accesses and oops.
A similar issue can be found with netdev_reset_tc.
One way of triggering this is to set an iface up (for which the driver uses netdev_set_num_tc in the open path, such as bnx2x) and writing to xps_rxqs in a concurrent thread. With the right timing an oops is triggered.
Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc and netdev_reset_tc should be mutually exclusive. We do that by taking the rtnl lock in xps_rxqs_store.
Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue") Signed-off-by: Antoine Tenart atenart@kernel.org Reviewed-by: Alexander Duyck alexanderduyck@fb.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/core/net-sysfs.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 574f09a544c0..0f08dcb94f10 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -1428,10 +1428,17 @@ static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf, return err; }
+ if (!rtnl_trylock()) { + bitmap_free(mask); + return restart_syscall(); + } + cpus_read_lock(); err = __netif_set_xps_queue(dev, mask, index, true); cpus_read_unlock();
+ rtnl_unlock(); + kfree(mask); return err ? : len; }
From: Antoine Tenart atenart@kernel.org
stable inclusion from linux-4.19.167 commit e552fbd60944d6ae3c1942d9cf5eae88112d36eb
--------------------------------
[ Upstream commit 4ae2bb81649dc03dfc95875f02126b14b773f7ab ]
Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't see an actual bug being triggered, but let's be safe here and take the rtnl lock while accessing the map in sysfs.
Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue") Signed-off-by: Antoine Tenart atenart@kernel.org Reviewed-by: Alexander Duyck alexanderduyck@fb.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/core/net-sysfs.c | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 0f08dcb94f10..fe0d255d66c8 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -1356,23 +1356,30 @@ static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init
static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf) { + int j, len, ret, num_tc = 1, tc = 0; struct net_device *dev = queue->dev; struct xps_dev_maps *dev_maps; unsigned long *mask, index; - int j, len, num_tc = 1, tc = 0;
index = get_netdev_queue_index(queue);
+ if (!rtnl_trylock()) + return restart_syscall(); + if (dev->num_tc) { num_tc = dev->num_tc; tc = netdev_txq_to_tc(dev, index); - if (tc < 0) - return -EINVAL; + if (tc < 0) { + ret = -EINVAL; + goto err_rtnl_unlock; + } } mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long), GFP_KERNEL); - if (!mask) - return -ENOMEM; + if (!mask) { + ret = -ENOMEM; + goto err_rtnl_unlock; + }
rcu_read_lock(); dev_maps = rcu_dereference(dev->xps_rxqs_map); @@ -1398,10 +1405,16 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf) out_no_maps: rcu_read_unlock();
+ rtnl_unlock(); + len = bitmap_print_to_pagebuf(false, buf, mask, dev->num_rx_queues); kfree(mask);
return len < PAGE_SIZE ? len : -EINVAL; + +err_rtnl_unlock: + rtnl_unlock(); + return ret; }
static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
From: Subash Abhinov Kasiviswanathan subashab@codeaurora.org
stable inclusion from linux-4.19.167 commit 954cc63f5d49851073c508d596af054da59505ab
--------------------------------
commit 443d6e86f821a165fae3fc3fc13086d27ac140b1 upstream.
This fixes the dereference to fetch the RCU pointer when holding the appropriate xtables lock.
Reported-by: kernel test robot lkp@intel.com Fixes: cc00bcaa5899 ("netfilter: x_tables: Switch synchronization to RCU") Signed-off-by: Subash Abhinov Kasiviswanathan subashab@codeaurora.org Reviewed-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/netfilter/arp_tables.c | 2 +- net/ipv4/netfilter/ip_tables.c | 2 +- net/ipv6/netfilter/ip6_tables.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c index ca20efe775ee..a2cae543a285 100644 --- a/net/ipv4/netfilter/arp_tables.c +++ b/net/ipv4/netfilter/arp_tables.c @@ -1405,7 +1405,7 @@ static int compat_get_entries(struct net *net, xt_compat_lock(NFPROTO_ARP); t = xt_find_table_lock(net, NFPROTO_ARP, get.name); if (!IS_ERR(t)) { - const struct xt_table_info *private = t->private; + const struct xt_table_info *private = xt_table_get_private_protected(t); struct xt_table_info info;
ret = compat_table_info(private, &info); diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index 115d48049686..6672172a7512 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -1619,7 +1619,7 @@ compat_get_entries(struct net *net, struct compat_ipt_get_entries __user *uptr, xt_compat_lock(AF_INET); t = xt_find_table_lock(net, AF_INET, get.name); if (!IS_ERR(t)) { - const struct xt_table_info *private = t->private; + const struct xt_table_info *private = xt_table_get_private_protected(t); struct xt_table_info info; ret = compat_table_info(private, &info); if (!ret && get.size == info.size) diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c index b1441349e151..3b067d5a62ee 100644 --- a/net/ipv6/netfilter/ip6_tables.c +++ b/net/ipv6/netfilter/ip6_tables.c @@ -1628,7 +1628,7 @@ compat_get_entries(struct net *net, struct compat_ip6t_get_entries __user *uptr, xt_compat_lock(AF_INET6); t = xt_find_table_lock(net, AF_INET6, get.name); if (!IS_ERR(t)) { - const struct xt_table_info *private = t->private; + const struct xt_table_info *private = xt_table_get_private_protected(t); struct xt_table_info info; ret = compat_table_info(private, &info); if (!ret && get.size == info.size)
From: Vasily Averin vvs@virtuozzo.com
stable inclusion from linux-4.19.167 commit 1406d39993ea3939829c589bc08dba4cdeb662d7
--------------------------------
commit 5c8193f568ae16f3242abad6518dc2ca6c8eef86 upstream.
htable_bits() can call jhash_size(32) and trigger shift-out-of-bounds
UBSAN: shift-out-of-bounds in net/netfilter/ipset/ip_set_hash_gen.h:151:6 shift exponent 32 is too large for 32-bit type 'unsigned int' CPU: 0 PID: 8498 Comm: syz-executor519 Not tainted 5.10.0-rc7-next-20201208-syzkaller #0 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395 htable_bits net/netfilter/ipset/ip_set_hash_gen.h:151 [inline] hash_mac_create.cold+0x58/0x9b net/netfilter/ipset/ip_set_hash_gen.h:1524 ip_set_create+0x610/0x1380 net/netfilter/ipset/ip_set_core.c:1115 nfnetlink_rcv_msg+0xecc/0x1180 net/netfilter/nfnetlink.c:252 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494 nfnetlink_rcv+0x1ac/0x420 net/netfilter/nfnetlink.c:600 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0x907/0xe40 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2345 ___sys_sendmsg+0xf3/0x170 net/socket.c:2399 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2432 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9
This patch replaces htable_bits() by simple fls(hashsize - 1) call: it alone returns valid nbits both for round and non-round hashsizes. It is normal to set any nbits here because it is validated inside following htable_size() call which returns 0 for nbits>31.
Fixes: 1feab10d7e6d("netfilter: ipset: Unified hash type generation") Reported-by: syzbot+d66bfadebca46cf61a2b@syzkaller.appspotmail.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Acked-by: Jozsef Kadlecsik kadlec@netfilter.org Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/netfilter/ipset/ip_set_hash_gen.h | 20 +++++--------------- 1 file changed, 5 insertions(+), 15 deletions(-)
diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h index ddfe06d7530b..154b40f2f2a8 100644 --- a/net/netfilter/ipset/ip_set_hash_gen.h +++ b/net/netfilter/ipset/ip_set_hash_gen.h @@ -115,20 +115,6 @@ htable_size(u8 hbits) return hsize * sizeof(struct hbucket *) + sizeof(struct htable); }
-/* Compute htable_bits from the user input parameter hashsize */ -static u8 -htable_bits(u32 hashsize) -{ - /* Assume that hashsize == 2^htable_bits */ - u8 bits = fls(hashsize - 1); - - if (jhash_size(bits) != hashsize) - /* Round up to the first 2^n value */ - bits = fls(hashsize); - - return bits; -} - #ifdef IP_SET_HASH_WITH_NETS #if IPSET_NET_COUNT > 1 #define __CIDR(cidr, i) (cidr[i]) @@ -1287,7 +1273,11 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set, if (!h) return -ENOMEM;
- hbits = htable_bits(hashsize); + /* Compute htable_bits from the user input parameter hashsize. + * Assume that hashsize == 2^htable_bits, + * otherwise round up to the first 2^n value. + */ + hbits = fls(hashsize - 1); hsize = htable_size(hbits); if (hsize == 0) { kfree(h);
From: Florian Westphal fw@strlen.de
stable inclusion from linux-4.19.167 commit e810ec8e6216297c47765b0d031f13a3250a2bad
--------------------------------
commit 6cb56218ad9e580e519dcd23bfb3db08d8692e5a upstream.
syzbot reports: detected buffer overflow in strlen [..] Call Trace: strlen include/linux/string.h:325 [inline] strlcpy include/linux/string.h:348 [inline] xt_rateest_tg_checkentry+0x2a5/0x6b0 net/netfilter/xt_RATEEST.c:143
strlcpy assumes src is a c-string. Check info->name before its used.
Reported-by: syzbot+e86f7c428c8c50db65b4@syzkaller.appspotmail.com Fixes: 5859034d7eb8793 ("[NETFILTER]: x_tables: add RATEEST target") Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/netfilter/xt_RATEEST.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c index 9e05c86ba5c4..932c0ae99bdc 100644 --- a/net/netfilter/xt_RATEEST.c +++ b/net/netfilter/xt_RATEEST.c @@ -118,6 +118,9 @@ static int xt_rateest_tg_checkentry(const struct xt_tgchk_param *par) } cfg; int ret;
+ if (strnlen(info->name, sizeof(est->name)) >= sizeof(est->name)) + return -ENAMETOOLONG; + net_get_random_once(&jhash_rnd, sizeof(jhash_rnd));
mutex_lock(&xn->hash_lock);
From: Jakub Kicinski kuba@kernel.org
stable inclusion from linux-4.19.168 commit 78f33b3eb6088b5add645ef215af3a68869a7505
--------------------------------
[ Upstream commit 55b7ab1178cbf41f979ff83236d3321ad35ed2ad ]
VLAN checks for NETREG_UNINITIALIZED to distinguish between registration failure and unregistration in progress.
Since commit cb626bf566eb ("net-sysfs: Fix reference count leak") registration failure may, however, result in NETREG_UNREGISTERED as well as NETREG_UNINITIALIZED.
This fix is similer to cebb69754f37 ("rtnetlink: Fix memory(net_device) leak when ->newlink fails")
Fixes: cb626bf566eb ("net-sysfs: Fix reference count leak") Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/8021q/vlan.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c index 5e9950453955..512ada90657b 100644 --- a/net/8021q/vlan.c +++ b/net/8021q/vlan.c @@ -277,7 +277,8 @@ static int register_vlan_device(struct net_device *real_dev, u16 vlan_id) return 0;
out_free_newdev: - if (new_dev->reg_state == NETREG_UNINITIALIZED) + if (new_dev->reg_state == NETREG_UNINITIALIZED || + new_dev->reg_state == NETREG_UNREGISTERED) free_netdev(new_dev); return err; }
From: Florian Westphal fw@strlen.de
stable inclusion from linux-4.19.168 commit c147585c97195f3462246b19e0cf380d7b8ddd7c
--------------------------------
[ Upstream commit bb4cc1a18856a73f0ff5137df0c2a31f4c50f6cf ]
Conntrack reassembly records the largest fragment size seen in IPCB. However, when this gets forwarded/transmitted, fragmentation will only be forced if one of the fragmented packets had the DF bit set.
In that case, a flag in IPCB will force fragmentation even if the MTU is large enough.
This should work fine, but this breaks with ip tunnels. Consider client that sends a UDP datagram of size X to another host.
The client fragments the datagram, so two packets, of size y and z, are sent. DF bit is not set on any of these packets.
Middlebox netfilter reassembles those packets back to single size-X packet, before routing decision.
packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit isn't set. At output time, ip refragmentation is skipped as well because x is still smaller than the mtu of the output device.
If ttransmit device is an ip tunnel, the packet size increases to x+overhead.
Also, tunnel might be configured to force DF bit on outer header.
In this case, packet will be dropped (exceeds MTU) and an ICMP error is generated back to sender.
But sender already respects the announced MTU, all the packets that it sent did fit the announced mtu.
Force refragmentation as per original sizes unconditionally so ip tunnel will encapsulate the fragments instead.
The only other solution I see is to place ip refragmentation in the ip_tunnel code to handle this case.
Fixes: d6b915e29f4ad ("ip_fragment: don't forward defragmented DF packet") Reported-by: Christian Perle christian.perle@secunet.com Signed-off-by: Florian Westphal fw@strlen.de Acked-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/ip_output.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index f0faf1193dd8..e411c42d8428 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -312,7 +312,7 @@ static int ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *sk if (skb_is_gso(skb)) return ip_finish_output_gso(net, sk, skb, mtu);
- if (skb->len > mtu || (IPCB(skb)->flags & IPSKB_FRAG_PMTU)) + if (skb->len > mtu || IPCB(skb)->frag_max_size) return ip_fragment(net, sk, skb, mtu, ip_finish_output2);
return ip_finish_output2(net, sk, skb);
From: Florian Westphal fw@strlen.de
stable inclusion from linux-4.19.168 commit 8593ed480c1fbf06b680608df14199999b330538
--------------------------------
[ Upstream commit 50c661670f6a3908c273503dfa206dfc7aa54c07 ]
For some reason ip_tunnel insist on setting the DF bit anyway when the inner header has the DF bit set, EVEN if the tunnel was configured with 'nopmtudisc'.
This means that the script added in the previous commit cannot be made to work by adding the 'nopmtudisc' flag to the ip tunnel configuration. Doing so breaks connectivity even for the without-conntrack/netfilter scenario.
When nopmtudisc is set, the tunnel will skip the mtu check, so no icmp error is sent to client. Then, because inner header has DF set, the outer header gets added with DF bit set as well.
IP stack then sends an error to itself because the packet exceeds the device MTU.
Fixes: 23a3647bc4f93 ("ip_tunnels: Use skb-len to PMTU check.") Cc: Stefano Brivio sbrivio@redhat.com Signed-off-by: Florian Westphal fw@strlen.de Acked-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/ip_tunnel.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index 375d0e516d85..1cad731039c3 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -736,7 +736,11 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, goto tx_error; }
- if (tnl_update_pmtu(dev, skb, rt, tnl_params->frag_off, inner_iph)) { + df = tnl_params->frag_off; + if (skb->protocol == htons(ETH_P_IP) && !tunnel->ignore_df) + df |= (inner_iph->frag_off & htons(IP_DF)); + + if (tnl_update_pmtu(dev, skb, rt, df, inner_iph)) { ip_rt_put(rt); goto tx_error; } @@ -764,10 +768,6 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, ttl = ip4_dst_hoplimit(&rt->dst); }
- df = tnl_params->frag_off; - if (skb->protocol == htons(ETH_P_IP) && !tunnel->ignore_df) - df |= (inner_iph->frag_off&htons(IP_DF)); - max_headroom = LL_RESERVED_SPACE(rt->dst.dev) + sizeof(struct iphdr) + rt->dst.header_len + ip_encap_hlen(&tunnel->encap); if (max_headroom > dev->needed_headroom)
From: Sean Tranchetti stranche@codeaurora.org
stable inclusion from linux-4.19.168 commit 8981235e1a4696679605870272c8688a08eb7a2b
--------------------------------
[ Upstream commit d8f5c29653c3f6995e8979be5623d263e92f6b86 ]
Route removal is handled by two code paths. The main removal path is via fib6_del_route() which will handle purging any PMTU exceptions from the cache, removing all per-cpu copies of the DST entry used by the route, and releasing the fib6_info struct.
The second removal location is during fib6_add_rt2node() during a route replacement operation. This path also calls fib6_purge_rt() to handle cleaning up the per-cpu copies of the DST entries and releasing the fib6_info associated with the older route, but it does not flush any PMTU exceptions that the older route had. Since the older route is removed from the tree during the replacement, we lose any way of accessing it again.
As these lingering DSTs and the fib6_info struct are holding references to the underlying netdevice struct as well, unregistering that device from the kernel can never complete.
Fixes: 2b760fcf5cfb3 ("ipv6: hook up exception table to store dst cache") Signed-off-by: Sean Tranchetti stranche@codeaurora.org Reviewed-by: David Ahern dsahern@kernel.org Link: https://lore.kernel.org/r/1609892546-11389-1-git-send-email-stranche@quicinc... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv6/ip6_fib.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 8b5459b34bc4..e0e464b72c1f 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -906,6 +906,8 @@ static void fib6_purge_rt(struct fib6_info *rt, struct fib6_node *fn, { struct fib6_table *table = rt->fib6_table;
+ /* Flush all cached dst in exception table */ + rt6_flush_exceptions(rt); if (rt->rt6i_pcpu) fib6_drop_pcpu_from(rt, table);
@@ -1756,9 +1758,6 @@ static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn, net->ipv6.rt6_stats->fib_rt_entries--; net->ipv6.rt6_stats->fib_discarded_routes++;
- /* Flush all cached dst in exception table */ - rt6_flush_exceptions(rt); - /* Reset round-robin state, if necessary */ if (rcu_access_pointer(fn->rr_ptr) == rt) fn->rr_ptr = NULL;
From: Dinghao Liu dinghao.liu@zju.edu.cn
stable inclusion from linux-4.19.168 commit 778e1171b4dccd6706e2b8717379c4dd675149be
--------------------------------
commit 5b0bb12c58ac7d22e05b5bfdaa30a116c8c32e32 upstream.
When mlx5_create_flow_group() fails, ft->g should be freed just like when kvzalloc() fails. The caller of mlx5e_create_l2_table_groups() does not catch this issue on failure, which leads to memleak.
Fixes: 33cfaaa8f36f ("net/mlx5e: Split the main flow steering table") Signed-off-by: Dinghao Liu dinghao.liu@zju.edu.cn Reviewed-by: Leon Romanovsky leonro@nvidia.com Signed-off-by: Saeed Mahameed saeedm@nvidia.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/mellanox/mlx5/core/en_fs.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c index 7ddacc9e4fe4..96bbe465320a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c @@ -1312,6 +1312,7 @@ static int mlx5e_create_l2_table_groups(struct mlx5e_l2_table *l2_table) ft->g[ft->num_groups] = NULL; mlx5e_destroy_groups(ft); kvfree(in); + kfree(ft->g);
return err; }
From: Dinghao Liu dinghao.liu@zju.edu.cn
stable inclusion from linux-4.19.168 commit 087ca73fc5d2bc8ef8ab12c756c14740a91f2ae3
--------------------------------
commit 7a6eb072a9548492ead086f3e820e9aac71c7138 upstream.
mlx5e_create_ttc_table_groups() frees ft->g on failure of kvzalloc(), but such failure will be caught by its caller in mlx5e_create_ttc_table() and ft->g will be freed again in mlx5e_destroy_flow_table(). The same issue also occurs in mlx5e_create_ttc_table_groups(). Set ft->g to NULL after kfree() to avoid double free.
Fixes: 7b3722fa9ef6 ("net/mlx5e: Support RSS for GRE tunneled packets") Fixes: 33cfaaa8f36f ("net/mlx5e: Split the main flow steering table") Signed-off-by: Dinghao Liu dinghao.liu@zju.edu.cn Reviewed-by: Leon Romanovsky leonro@nvidia.com Signed-off-by: Saeed Mahameed saeedm@nvidia.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/mellanox/mlx5/core/en_fs.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c index 96bbe465320a..c6eea6b6b1bb 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c @@ -893,6 +893,7 @@ static int mlx5e_create_ttc_table_groups(struct mlx5e_ttc_table *ttc, in = kvzalloc(inlen, GFP_KERNEL); if (!in) { kfree(ft->g); + ft->g = NULL; return -ENOMEM; }
@@ -1033,6 +1034,7 @@ static int mlx5e_create_inner_ttc_table_groups(struct mlx5e_ttc_table *ttc) in = kvzalloc(inlen, GFP_KERNEL); if (!in) { kfree(ft->g); + ft->g = NULL; return -ENOMEM; }
From: Vasily Averin vvs@virtuozzo.com
stable inclusion from linux-4.19.168 commit 77ca19cf335772b79c1a76b6f108463a5402ace4
--------------------------------
commit 54970a2fbb673f090b7f02d7f57b10b2e0707155 upstream.
syzbot reproduces BUG_ON in skb_checksum_help(): tun creates (bogus) skb with huge partial-checksummed area and small ip packet inside. Then ip_rcv trims the skb based on size of internal ip packet, after that csum offset points beyond of trimmed skb. Then checksum_tg() called via netfilter hook triggers BUG_ON:
offset = skb_checksum_start_offset(skb); BUG_ON(offset >= skb_headlen(skb));
To work around the problem this patch forces pskb_trim_rcsum_slow() to return -EINVAL in described scenario. It allows its callers to drop such kind of packets.
Link: https://syzkaller.appspot.com/bug?id=b419a5ca95062664fe1a60b764621eb4526e2cd... Reported-by: syzbot+7010af67ced6105e5ab6@syzkaller.appspotmail.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Acked-by: Willem de Bruijn willemb@google.com Link: https://lore.kernel.org/r/1b2494af-2c56-8ee2-7bc0-923fcad1cdf8@virtuozzo.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/core/skbuff.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index b5d9c9b2c702..5b87d2dd7151 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1853,6 +1853,12 @@ int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len) skb->csum = csum_block_sub(skb->csum, skb_checksum(skb, len, delta, 0), len); + } else if (skb->ip_summed == CHECKSUM_PARTIAL) { + int hdlen = (len > skb_headlen(skb)) ? skb_headlen(skb) : len; + int offset = skb_checksum_start_offset(skb) + skb->csum_offset; + + if (offset + sizeof(__sum16) > hdlen) + return -EINVAL; } return __pskb_trim(skb, len); }
From: Mark Rutland mark.rutland@arm.com
mainline inclusion from mainline-v5.5-rc1 commit bd8b21d3dd661658addc1cd4cc869bab11d28596 category: bugfix bugzilla: 25285 CVE: NA
--------------------------------
When we load a module, we have to perform some special work for a couple of named sections. To do this, we iterate over all of the module's sections, and perform work for each section we recognize.
To make it easier to handle the unexpected absence of a section, and to make the section-specific logic easer to read, let's factor the section search into a helper. Similar is already done in the core module loader, and other architectures (and ideally we'd unify these in future).
If we expect a module to have an ftrace trampoline section, but it doesn't have one, we'll now reject loading the module. When ARM64_MODULE_PLTS is selected, any correctly built module should have one (and this is assumed by arm64's ftrace PLT code) and the absence of such a section implies something has gone wrong at build time.
Subsequent patches will make use of the new helper.
Signed-off-by: Mark Rutland mark.rutland@arm.com Reviewed-by: Ard Biesheuvel ard.biesheuvel@linaro.org Reviewed-by: Torsten Duwe duwe@suse.de Tested-by: Amit Daniel Kachhap amit.kachhap@arm.com Tested-by: Torsten Duwe duwe@suse.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: James Morse james.morse@arm.com Cc: Will Deacon will@kernel.org Signed-off-by: Pu Lehui pulehui@huawei.com Reviewed-by: Jian Cheng cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/kernel/module.c | 35 ++++++++++++++++++++++++++--------- 1 file changed, 26 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index 834ecf108b53..55b00bfb704f 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -484,22 +484,39 @@ int apply_relocate_add(Elf64_Shdr *sechdrs, return -ENOEXEC; }
-int module_finalize(const Elf_Ehdr *hdr, - const Elf_Shdr *sechdrs, - struct module *me) +static const Elf_Shdr *find_section(const Elf_Ehdr *hdr, + const Elf_Shdr *sechdrs, + const char *name) { const Elf_Shdr *s, *se; const char *secstrs = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
for (s = sechdrs, se = sechdrs + hdr->e_shnum; s < se; s++) { - if (strcmp(".altinstructions", secstrs + s->sh_name) == 0) - apply_alternatives_module((void *)s->sh_addr, s->sh_size); + if (strcmp(name, secstrs + s->sh_name) == 0) + return s; + } + + return NULL; +} + +int module_finalize(const Elf_Ehdr *hdr, + const Elf_Shdr *sechdrs, + struct module *me) +{ + const Elf_Shdr *s; + + s = find_section(hdr, sechdrs, ".altinstructions"); + if (s) + apply_alternatives_module((void *)s->sh_addr, s->sh_size); + #ifdef CONFIG_ARM64_MODULE_PLTS - if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE) && - !strcmp(".text.ftrace_trampoline", secstrs + s->sh_name)) - me->arch.ftrace_trampoline = (void *)s->sh_addr; -#endif + if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE)) { + s = find_section(hdr, sechdrs, ".text.ftrace_trampoline"); + if (!s) + return -ENOEXEC; + me->arch.ftrace_trampoline = (void *)s->sh_addr; } +#endif
return 0; }
From: Mark Rutland mark.rutland@arm.com
mainline inclusion from mainline-v5.5-rc1 commit f1a54ae9af0da4d76239256ed640a93ab3aadac0 category: bugfix bugzilla: 25285 CVE: NA
--------------------------------
Currently we lazily-initialize a module's ftrace PLT at runtime when we install the first ftrace call. To do so we have to apply a number of sanity checks, transiently mark the module text as RW, and perform an IPI as part of handling Neoverse-N1 erratum #1542419.
We only expect the ftrace trampoline to point at ftrace_caller() (AKA FTRACE_ADDR), so let's simplify all of this by intializing the PLT at module load time, before the module loader marks the module RO and performs the intial I-cache maintenance for the module.
Thus we can rely on the module having been correctly intialized, and can simplify the runtime work necessary to install an ftrace call in a module. This will also allow for the removal of module_disable_ro().
Tested by forcing ftrace_make_call() to use the module PLT, and then loading up a module after setting up ftrace with:
| echo ":mod:<module-name>" > set_ftrace_filter; | echo function > current_tracer; | modprobe <module-name>
Since FTRACE_ADDR is only defined when CONFIG_DYNAMIC_FTRACE is selected, we wrap its use along with most of module_init_ftrace_plt() with ifdeffery rather than using IS_ENABLED().
Signed-off-by: Mark Rutland mark.rutland@arm.com Reviewed-by: Amit Daniel Kachhap amit.kachhap@arm.com Reviewed-by: Ard Biesheuvel ard.biesheuvel@linaro.org Reviewed-by: Torsten Duwe duwe@suse.de Tested-by: Amit Daniel Kachhap amit.kachhap@arm.com Tested-by: Torsten Duwe duwe@suse.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: James Morse james.morse@arm.com Cc: Peter Zijlstra peterz@infradead.org Cc: Will Deacon will@kernel.org
Conflicts: arch/arm64/kernel/ftrace.c arch/arm64/kernel/module.c
Signed-off-by: Pu Lehui pulehui@huawei.com Reviewed-by: Jian Cheng cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/kernel/ftrace.c | 50 +++++++++++--------------------------- arch/arm64/kernel/module.c | 32 +++++++++++++++--------- 2 files changed, 35 insertions(+), 47 deletions(-)
diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c index b6618391be8c..4e511d0663b6 100644 --- a/arch/arm64/kernel/ftrace.c +++ b/arch/arm64/kernel/ftrace.c @@ -76,9 +76,21 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
if (offset < -SZ_128M || offset >= SZ_128M) { #ifdef CONFIG_ARM64_MODULE_PLTS - struct plt_entry trampoline, *dst; struct module *mod;
+ /* + * There is only one ftrace trampoline per module. For now, + * this is not a problem since on arm64, all dynamic ftrace + * invocations are routed via ftrace_caller(). This will need + * to be revisited if support for multiple ftrace entry points + * is added in the future, but for now, the pr_err() below + * deals with a theoretical issue only. + */ + if (addr != FTRACE_ADDR) { + pr_err("ftrace: far branches to multiple entry points unsupported inside a single module\n"); + return -EINVAL; + } + /* * On kernels that support module PLTs, the offset between the * branch instruction and its target may legally exceed the @@ -96,41 +108,7 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr) if (WARN_ON(!mod)) return -EINVAL;
- /* - * There is only one ftrace trampoline per module. For now, - * this is not a problem since on arm64, all dynamic ftrace - * invocations are routed via ftrace_caller(). This will need - * to be revisited if support for multiple ftrace entry points - * is added in the future, but for now, the pr_err() below - * deals with a theoretical issue only. - */ - dst = mod->arch.ftrace_trampoline; - trampoline = get_plt_entry(addr); - if (!plt_entries_equal(dst, &trampoline)) { - if (!plt_entries_equal(dst, &(struct plt_entry){})) { - pr_err("ftrace: far branches to multiple entry points unsupported inside a single module\n"); - return -EINVAL; - } - - /* point the trampoline to our ftrace entry point */ - module_disable_ro(mod); - *dst = trampoline; - module_enable_ro(mod, true); - - /* - * Ensure updated trampoline is visible to instruction - * fetch before we patch in the branch. Although the - * architecture doesn't require an IPI in this case, - * Neoverse-N1 erratum #1542419 does require one - * if the TLB maintenance in module_enable_ro() is - * skipped due to rodata_enabled. It doesn't seem worth - * it to make it conditional given that this is - * certainly not a fast-path. - */ - flush_icache_range((unsigned long)&dst[0], - (unsigned long)&dst[1]); - } - addr = (unsigned long)dst; + addr = (unsigned long)mod->arch.ftrace_trampoline; #else /* CONFIG_ARM64_MODULE_PLTS */ return -EINVAL; #endif /* CONFIG_ARM64_MODULE_PLTS */ diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index 55b00bfb704f..81e1eb1eae53 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -20,6 +20,7 @@
#include <linux/bitops.h> #include <linux/elf.h> +#include <linux/ftrace.h> #include <linux/gfp.h> #include <linux/kasan.h> #include <linux/kernel.h> @@ -499,24 +500,33 @@ static const Elf_Shdr *find_section(const Elf_Ehdr *hdr, return NULL; }
+static int module_init_ftrace_plt(const Elf_Ehdr *hdr, + const Elf_Shdr *sechdrs, + struct module *mod) +{ +#if defined(CONFIG_ARM64_MODULE_PLTS) && defined(CONFIG_DYNAMIC_FTRACE) + const Elf_Shdr *s; + struct plt_entry *plt; + + s = find_section(hdr, sechdrs, ".text.ftrace_trampoline"); + if (!s) + return -ENOEXEC; + + plt = (void *)s->sh_addr; + *plt = get_plt_entry(FTRACE_ADDR); + mod->arch.ftrace_trampoline = plt; +#endif + return 0; +} + int module_finalize(const Elf_Ehdr *hdr, const Elf_Shdr *sechdrs, struct module *me) { const Elf_Shdr *s; - s = find_section(hdr, sechdrs, ".altinstructions"); if (s) apply_alternatives_module((void *)s->sh_addr, s->sh_size);
-#ifdef CONFIG_ARM64_MODULE_PLTS - if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE)) { - s = find_section(hdr, sechdrs, ".text.ftrace_trampoline"); - if (!s) - return -ENOEXEC; - me->arch.ftrace_trampoline = (void *)s->sh_addr; - } -#endif - - return 0; + return module_init_ftrace_plt(hdr, sechdrs, me); }
From: Eric Auger eric.auger@redhat.com
stable inclusion from linux-4.19.165 commit 706a63e840d0ab86ecba189277f22f82e5356c63
--------------------------------
[ Upstream commit 16b8fe4caf499ae8e12d2ab1b1324497e36a7b83 ]
In case an error occurs in vfio_pci_enable() before the call to vfio_pci_probe_mmaps(), vfio_pci_disable() will try to iterate on an uninitialized list and cause a kernel panic.
Lets move to the initialization to vfio_pci_probe() to fix the issue.
Signed-off-by: Eric Auger eric.auger@redhat.com Fixes: 05f0c03fbac1 ("vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive") CC: Stable stable@vger.kernel.org # v4.7+ Signed-off-by: Alex Williamson alex.williamson@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/vfio/pci/vfio_pci.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 58e7336b2748..87d586a3dee2 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -118,8 +118,6 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) int bar; struct vfio_pci_dummy_resource *dummy_res;
- INIT_LIST_HEAD(&vdev->dummy_resources_list); - for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) { res = vdev->pdev->resource + bar;
@@ -1522,6 +1520,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) mutex_init(&vdev->igate); spin_lock_init(&vdev->irqlock); mutex_init(&vdev->ioeventfds_lock); + INIT_LIST_HEAD(&vdev->dummy_resources_list); INIT_LIST_HEAD(&vdev->ioeventfds_list); mutex_init(&vdev->vma_lock); INIT_LIST_HEAD(&vdev->vma_list);
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.165 commit 81629230815ff27439a61d675ad9873e93190204
--------------------------------
[ Upstream commit b08070eca9e247f60ab39d79b2c25d274750441f ]
ext4_handle_error() with errors=continue mount option can accidentally remount the filesystem read-only when the system is rebooting. Fix that.
Fixes: 1dc1097ff60e ("ext4: avoid panic during forced reboot") Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Andreas Dilger adilger@dilger.ca Cc: stable@kernel.org Link: https://lore.kernel.org/r/20201127113405.26867-2-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Sasha Levin sashal@kernel.org Conflicts: fs/ext4/super.c [yyl: adjust context] Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/super.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 9becf3e2e20a..15f8aeda9ee7 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -504,20 +504,17 @@ static void ext4_netlink_send_info(struct super_block *sb, int ext4_errno)
static void ext4_handle_error(struct super_block *sb) { + journal_t *journal = EXT4_SB(sb)->s_journal; + if (test_opt(sb, WARN_ON_ERROR)) WARN_ON_ONCE(1);
- if (sb_rdonly(sb)) + if (sb_rdonly(sb) || test_opt(sb, ERRORS_CONT)) return;
- if (!test_opt(sb, ERRORS_CONT)) { - journal_t *journal = EXT4_SB(sb)->s_journal; - - EXT4_SB(sb)->s_mount_flags |= EXT4_MF_FS_ABORTED; - if (journal) - jbd2_journal_abort(journal, -EIO); - } - + EXT4_SB(sb)->s_mount_flags |= EXT4_MF_FS_ABORTED; + if (journal) + jbd2_journal_abort(journal, -EIO); /* * We force ERRORS_RO behavior when system is rebooting. Otherwise we * could panic during 'reboot -f' as the underlying device got already
From: Damien Le Moal damien.lemoal@wdc.com
stable inclusion from linux-4.19.165 commit 1344c5564d84ce01fdf844a6c77fa4be92b8039a
--------------------------------
commit 0ebcdd702f49aeb0ad2e2d894f8c124a0acc6e23 upstream.
For a null_blk device with zoned mode enabled is currently initialized with a number of zones equal to the device capacity divided by the zone size, without considering if the device capacity is a multiple of the zone size. If the zone size is not a divisor of the capacity, the zones end up not covering the entire capacity, potentially resulting is out of bounds accesses to the zone array.
Fix this by adding one last smaller zone with a size equal to the remainder of the disk capacity divided by the zone size if the capacity is not a multiple of the zone size. For such smaller last zone, the zone capacity is also checked so that it does not exceed the smaller zone size.
Reported-by: Naohiro Aota naohiro.aota@wdc.com Fixes: ca4b2a011948 ("null_blk: add zone support") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal damien.lemoal@wdc.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Johannes Thumshirn johannes.thumshirn@wdc.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/block/null_blk_zoned.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c index 7c6b86d98700..1a74871f1bfd 100644 --- a/drivers/block/null_blk_zoned.c +++ b/drivers/block/null_blk_zoned.c @@ -1,9 +1,9 @@ // SPDX-License-Identifier: GPL-2.0 #include <linux/vmalloc.h> +#include <linux/sizes.h> #include "null_blk.h"
-/* zone_size in MBs to sectors. */ -#define ZONE_SIZE_SHIFT 11 +#define MB_TO_SECTS(mb) (((sector_t)mb * SZ_1M) >> SECTOR_SHIFT)
static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect) { @@ -12,7 +12,7 @@ static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
int null_zone_init(struct nullb_device *dev) { - sector_t dev_size = (sector_t)dev->size * 1024 * 1024; + sector_t dev_capacity_sects; sector_t sector = 0; unsigned int i;
@@ -21,9 +21,12 @@ int null_zone_init(struct nullb_device *dev) return -EINVAL; }
- dev->zone_size_sects = dev->zone_size << ZONE_SIZE_SHIFT; - dev->nr_zones = dev_size >> - (SECTOR_SHIFT + ilog2(dev->zone_size_sects)); + dev_capacity_sects = MB_TO_SECTS(dev->size); + dev->zone_size_sects = MB_TO_SECTS(dev->zone_size); + dev->nr_zones = dev_capacity_sects >> ilog2(dev->zone_size_sects); + if (dev_capacity_sects & (dev->zone_size_sects - 1)) + dev->nr_zones++; + dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct blk_zone), GFP_KERNEL | __GFP_ZERO); if (!dev->zones) @@ -33,7 +36,10 @@ int null_zone_init(struct nullb_device *dev) struct blk_zone *zone = &dev->zones[i];
zone->start = zone->wp = sector; - zone->len = dev->zone_size_sects; + if (zone->start + dev->zone_size_sects > dev_capacity_sects) + zone->len = dev_capacity_sects - zone->start; + else + zone->len = dev->zone_size_sects; zone->type = BLK_ZONE_TYPE_SEQWRITE_REQ; zone->cond = BLK_ZONE_COND_EMPTY;
From: Boqun Feng boqun.feng@gmail.com
stable inclusion from linux-4.19.165 commit 8e63266b0d42a2dc233cfc468636889b5b3ba1cf
--------------------------------
commit 8d1ddb5e79374fb277985a6b3faa2ed8631c5b4c upstream.
Syzbot reports a potential deadlock found by the newly added recursive read deadlock detection in lockdep:
[...] ======================================================== [...] WARNING: possible irq lock inversion dependency detected [...] 5.9.0-rc2-syzkaller #0 Not tainted [...] -------------------------------------------------------- [...] syz-executor.1/10214 just changed the state of lock: [...] ffff88811f506338 (&f->f_owner.lock){.+..}-{2:2}, at: send_sigurg+0x1d/0x200 [...] but this lock was taken by another, HARDIRQ-safe lock in the past: [...] (&dev->event_lock){-...}-{2:2} [...] [...] [...] and interrupts could create inverse lock ordering between them. [...] [...] [...] other info that might help us debug this: [...] Chain exists of: [...] &dev->event_lock --> &new->fa_lock --> &f->f_owner.lock [...] [...] Possible interrupt unsafe locking scenario: [...] [...] CPU0 CPU1 [...] ---- ---- [...] lock(&f->f_owner.lock); [...] local_irq_disable(); [...] lock(&dev->event_lock); [...] lock(&new->fa_lock); [...] <Interrupt> [...] lock(&dev->event_lock); [...] [...] *** DEADLOCK ***
The corresponding deadlock case is as followed:
CPU 0 CPU 1 CPU 2 read_lock(&fown->lock); spin_lock_irqsave(&dev->event_lock, ...) write_lock_irq(&filp->f_owner.lock); // wait for the lock read_lock(&fown-lock); // have to wait until the writer release // due to the fairness <interrupted> spin_lock_irqsave(&dev->event_lock); // wait for the lock
The lock dependency on CPU 1 happens if there exists a call sequence:
input_inject_event(): spin_lock_irqsave(&dev->event_lock,...); input_handle_event(): input_pass_values(): input_to_handler(): handler->event(): // evdev_event() evdev_pass_values(): spin_lock(&client->buffer_lock); __pass_event(): kill_fasync(): kill_fasync_rcu(): read_lock(&fa->fa_lock); send_sigio(): read_lock(&fown->lock);
To fix this, make the reader in send_sigurg() and send_sigio() use read_lock_irqsave() and read_lock_irqrestore().
Reported-by: syzbot+22e87cdf94021b984aa6@syzkaller.appspotmail.com Reported-by: syzbot+c5e32344981ad9f33750@syzkaller.appspotmail.com Signed-off-by: Boqun Feng boqun.feng@gmail.com Signed-off-by: Jeff Layton jlayton@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/fcntl.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c index 0c70a8ed2a98..2e54fdc14e51 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -799,9 +799,10 @@ void send_sigio(struct fown_struct *fown, int fd, int band) { struct task_struct *p; enum pid_type type; + unsigned long flags; struct pid *pid; - read_lock(&fown->lock); + read_lock_irqsave(&fown->lock, flags);
type = fown->pid_type; pid = fown->pid; @@ -822,7 +823,7 @@ void send_sigio(struct fown_struct *fown, int fd, int band) read_unlock(&tasklist_lock); } out_unlock_fown: - read_unlock(&fown->lock); + read_unlock_irqrestore(&fown->lock, flags); }
static void send_sigurg_to_task(struct task_struct *p, @@ -837,9 +838,10 @@ int send_sigurg(struct fown_struct *fown) struct task_struct *p; enum pid_type type; struct pid *pid; + unsigned long flags; int ret = 0; - read_lock(&fown->lock); + read_lock_irqsave(&fown->lock, flags);
type = fown->pid_type; pid = fown->pid; @@ -862,7 +864,7 @@ int send_sigurg(struct fown_struct *fown) read_unlock(&tasklist_lock); } out_unlock_fown: - read_unlock(&fown->lock); + read_unlock_irqrestore(&fown->lock, flags); return ret; }
From: Miroslav Benes mbenes@suse.cz
stable inclusion from linux-4.19.165 commit bea7f4d1ffa33ced2801947eb70e400387b07575
--------------------------------
[ Upstream commit 5e8ed280dab9eeabc1ba0b2db5dbe9fe6debb6b5 ]
If a module fails to load due to an error in prepare_coming_module(), the following error handling in load_module() runs with MODULE_STATE_COMING in module's state. Fix it by correctly setting MODULE_STATE_GOING under "bug_cleanup" label.
Signed-off-by: Miroslav Benes mbenes@suse.cz Signed-off-by: Jessica Yu jeyu@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/module.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/module.c b/kernel/module.c index eb82703f9a26..5aa79d8e79ed 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -3837,6 +3837,7 @@ static int load_module(struct load_info *info, const char __user *uargs, MODULE_STATE_GOING, mod); klp_module_going(mod); bug_cleanup: + mod->state = MODULE_STATE_GOING; /* module_bug_cleanup needs module_mutex protection */ mutex_lock(&module_mutex); module_bug_cleanup(mod);
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.165 commit 75f1bd7955f17cf15412165d25f03df7d659568e
--------------------------------
[ Upstream commit 10f04d40a9fa29785206c619f80d8beedb778837 ]
The on-disk quota format supports quota files with upto 2^32 blocks. Be careful when computing quota file offsets in the quota files from block numbers as they can overflow 32-bit types. Since quota files larger than 4GB would require ~26 millions of quota users, this is mostly a theoretical concern now but better be careful, fuzzers would find the problem sooner or later anyway...
Reviewed-by: Andreas Dilger adilger@dilger.ca Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/quota/quota_tree.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/quota/quota_tree.c b/fs/quota/quota_tree.c index bb3f59bcfcf5..656f9ff63edd 100644 --- a/fs/quota/quota_tree.c +++ b/fs/quota/quota_tree.c @@ -61,7 +61,7 @@ static ssize_t read_blk(struct qtree_mem_dqinfo *info, uint blk, char *buf)
memset(buf, 0, info->dqi_usable_bs); return sb->s_op->quota_read(sb, info->dqi_type, buf, - info->dqi_usable_bs, blk << info->dqi_blocksize_bits); + info->dqi_usable_bs, (loff_t)blk << info->dqi_blocksize_bits); }
static ssize_t write_blk(struct qtree_mem_dqinfo *info, uint blk, char *buf) @@ -70,7 +70,7 @@ static ssize_t write_blk(struct qtree_mem_dqinfo *info, uint blk, char *buf) ssize_t ret;
ret = sb->s_op->quota_write(sb, info->dqi_type, buf, - info->dqi_usable_bs, blk << info->dqi_blocksize_bits); + info->dqi_usable_bs, (loff_t)blk << info->dqi_blocksize_bits); if (ret != info->dqi_usable_bs) { quota_error(sb, "dquota write failed"); if (ret >= 0) @@ -283,7 +283,7 @@ static uint find_free_dqentry(struct qtree_mem_dqinfo *info, blk); goto out_buf; } - dquot->dq_off = (blk << info->dqi_blocksize_bits) + + dquot->dq_off = ((loff_t)blk << info->dqi_blocksize_bits) + sizeof(struct qt_disk_dqdbheader) + i * info->dqi_entry_size; kfree(buf); @@ -558,7 +558,7 @@ static loff_t find_block_dqentry(struct qtree_mem_dqinfo *info, ret = -EIO; goto out_buf; } else { - ret = (blk << info->dqi_blocksize_bits) + sizeof(struct + ret = ((loff_t)blk << info->dqi_blocksize_bits) + sizeof(struct qt_disk_dqdbheader) + i * info->dqi_entry_size; } out_buf:
From: Trond Myklebust trond.myklebust@hammerspace.com
stable inclusion from linux-4.19.165 commit 05a0aec6787885a335a9d3cec7e1cfffee253b84
--------------------------------
[ Upstream commit b6d49ecd1081740b6e632366428b960461f8158b ]
When returning the layout in nfs4_evict_inode(), we need to ensure that the layout is actually done being freed before we can proceed to free the inode itself.
Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/nfs/nfs4super.c | 2 +- fs/nfs/pnfs.c | 33 +++++++++++++++++++++++++++++++-- fs/nfs/pnfs.h | 5 +++++ 3 files changed, 37 insertions(+), 3 deletions(-)
diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c index 6fb7cb6b3f4b..e7a10f5f5405 100644 --- a/fs/nfs/nfs4super.c +++ b/fs/nfs/nfs4super.c @@ -95,7 +95,7 @@ static void nfs4_evict_inode(struct inode *inode) nfs_inode_return_delegation_noreclaim(inode); /* Note that above delegreturn would trigger pnfs return-on-close */ pnfs_return_layout(inode); - pnfs_destroy_layout(NFS_I(inode)); + pnfs_destroy_layout_final(NFS_I(inode)); /* First call standard NFS clear_inode() code */ nfs_clear_inode(inode); } diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c index c818f9886f61..8f79d28bd974 100644 --- a/fs/nfs/pnfs.c +++ b/fs/nfs/pnfs.c @@ -294,6 +294,7 @@ void pnfs_put_layout_hdr(struct pnfs_layout_hdr *lo) { struct inode *inode; + unsigned long i_state;
if (!lo) return; @@ -304,8 +305,12 @@ pnfs_put_layout_hdr(struct pnfs_layout_hdr *lo) if (!list_empty(&lo->plh_segs)) WARN_ONCE(1, "NFS: BUG unfreed layout segments.\n"); pnfs_detach_layout_hdr(lo); + i_state = inode->i_state; spin_unlock(&inode->i_lock); pnfs_free_layout_hdr(lo); + /* Notify pnfs_destroy_layout_final() that we're done */ + if (i_state & (I_FREEING | I_CLEAR)) + wake_up_var(lo); } }
@@ -713,8 +718,7 @@ pnfs_free_lseg_list(struct list_head *free_me) } }
-void -pnfs_destroy_layout(struct nfs_inode *nfsi) +static struct pnfs_layout_hdr *__pnfs_destroy_layout(struct nfs_inode *nfsi) { struct pnfs_layout_hdr *lo; LIST_HEAD(tmp_list); @@ -732,9 +736,34 @@ pnfs_destroy_layout(struct nfs_inode *nfsi) pnfs_put_layout_hdr(lo); } else spin_unlock(&nfsi->vfs_inode.i_lock); + return lo; +} + +void pnfs_destroy_layout(struct nfs_inode *nfsi) +{ + __pnfs_destroy_layout(nfsi); } EXPORT_SYMBOL_GPL(pnfs_destroy_layout);
+static bool pnfs_layout_removed(struct nfs_inode *nfsi, + struct pnfs_layout_hdr *lo) +{ + bool ret; + + spin_lock(&nfsi->vfs_inode.i_lock); + ret = nfsi->layout != lo; + spin_unlock(&nfsi->vfs_inode.i_lock); + return ret; +} + +void pnfs_destroy_layout_final(struct nfs_inode *nfsi) +{ + struct pnfs_layout_hdr *lo = __pnfs_destroy_layout(nfsi); + + if (lo) + wait_var_event(lo, pnfs_layout_removed(nfsi, lo)); +} + static bool pnfs_layout_add_bulk_destroy_list(struct inode *inode, struct list_head *layout_list) diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h index ece367ebde69..670b1de3b5eb 100644 --- a/fs/nfs/pnfs.h +++ b/fs/nfs/pnfs.h @@ -253,6 +253,7 @@ struct pnfs_layout_segment *pnfs_layout_process(struct nfs4_layoutget *lgp); void pnfs_layoutget_free(struct nfs4_layoutget *lgp); void pnfs_free_lseg_list(struct list_head *tmp_list); void pnfs_destroy_layout(struct nfs_inode *); +void pnfs_destroy_layout_final(struct nfs_inode *); void pnfs_destroy_all_layouts(struct nfs_client *); int pnfs_destroy_layouts_byfsid(struct nfs_client *clp, struct nfs_fsid *fsid, @@ -644,6 +645,10 @@ static inline void pnfs_destroy_layout(struct nfs_inode *nfsi) { }
+static inline void pnfs_destroy_layout_final(struct nfs_inode *nfsi) +{ +} + static inline struct pnfs_layout_segment * pnfs_get_lseg(struct pnfs_layout_segment *lseg) {
From: Jessica Yu jeyu@kernel.org
stable inclusion from linux-4.19.165 commit 74925430503eccee4ddf20f3b46a580ca6a72bab
--------------------------------
[ Upstream commit 38dc717e97153e46375ee21797aa54777e5498f3 ]
Apparently there has been a longstanding race between udev/systemd and the module loader. Currently, the module loader sends a uevent right after sysfs initialization, but before the module calls its init function. However, some udev rules expect that the module has initialized already upon receiving the uevent.
This race has been triggered recently (see link in references) in some systemd mount unit files. For instance, the configfs module creates the /sys/kernel/config mount point in its init function, however the module loader issues the uevent before this happens. sys-kernel-config.mount expects to be able to mount /sys/kernel/config upon receipt of the module loading uevent, but if the configfs module has not called its init function yet, then this directory will not exist and the mount unit fails. A similar situation exists for sys-fs-fuse-connections.mount, as the fuse sysfs mount point is created during the fuse module's init function. If udev is faster than module initialization then the mount unit would fail in a similar fashion.
To fix this race, delay the module KOBJ_ADD uevent until after the module has finished calling its init routine.
References: https://github.com/systemd/systemd/issues/17586 Reviewed-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Tested-By: Nicolas Morey-Chaisemartin nmoreychaisemartin@suse.com Signed-off-by: Jessica Yu jeyu@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/module.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/module.c b/kernel/module.c index 5aa79d8e79ed..820b8e134a71 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1788,7 +1788,6 @@ static int mod_sysfs_init(struct module *mod) if (err) mod_kobject_put(mod);
- /* delay uevent until full sysfs population */ out: return err; } @@ -1825,7 +1824,6 @@ static int mod_sysfs_setup(struct module *mod, add_sect_attrs(mod, info); add_notes_attrs(mod, info);
- kobject_uevent(&mod->mkobj.kobj, KOBJ_ADD); return 0;
out_unreg_modinfo_attrs: @@ -3495,6 +3493,9 @@ static noinline int do_init_module(struct module *mod) blocking_notifier_call_chain(&module_notify_list, MODULE_STATE_LIVE, mod);
+ /* Delay uevent until module has finished its init routine */ + kobject_uevent(&mod->mkobj.kobj, KOBJ_ADD); + /* * We need to finish all async code before the module init sequence * is done. This has potential to deadlock. For example, a newly
From: Hyeongseok Kim hyeongseok@gmail.com
stable inclusion from linux-4.19.165 commit 63d881957e59dde38f38100631ae138d5c0cee88
--------------------------------
[ Upstream commit 252bd1256396cebc6fc3526127fdb0b317601318 ]
If emergency system shutdown is called, like by thermal shutdown, a dm device could be alive when the block device couldn't process I/O requests anymore. In this state, the handling of I/O errors by new dm I/O requests or by those already in-flight can lead to a verity corruption state, which is a misjudgment.
So, skip verity work in response to I/O error when system is shutting down.
Signed-off-by: Hyeongseok Kim hyeongseok@gmail.com Reviewed-by: Sami Tolvanen samitolvanen@google.com Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/dm-verity-target.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c index 50ab1af8ac8a..6ec3e97814b1 100644 --- a/drivers/md/dm-verity-target.c +++ b/drivers/md/dm-verity-target.c @@ -533,6 +533,15 @@ static int verity_verify_io(struct dm_verity_io *io) return 0; }
+/* + * Skip verity work in response to I/O error when system is shutting down. + */ +static inline bool verity_is_system_shutting_down(void) +{ + return system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF + || system_state == SYSTEM_RESTART; +} + /* * End one "io" structure with a given error. */ @@ -560,7 +569,8 @@ static void verity_end_io(struct bio *bio) { struct dm_verity_io *io = bio->bi_private;
- if (bio->bi_status && !verity_fec_is_enabled(io->v)) { + if (bio->bi_status && + (!verity_fec_is_enabled(io->v) || verity_is_system_shutting_down())) { verity_finish_io(io, bio->bi_status); return; }
From: Yunfeng Ye yeyunfeng@huawei.com
stable inclusion from linux-4.19.167 commit 2e3e4b337f51ab6e532d9f7c2b900a464cc17460
--------------------------------
[ Upstream commit 01341fbd0d8d4e717fc1231cdffe00343088ce0b ]
In realtime scenario, We do not want to have interference on the isolated cpu cores. but when invoking alloc_workqueue() for percpu wq on the housekeeping cpu, it kick a kworker on the isolated cpu.
alloc_workqueue pwq_adjust_max_active wake_up_worker
The comment in pwq_adjust_max_active() said: "Need to kick a worker after thawed or an unbound wq's max_active is bumped"
So it is unnecessary to kick a kworker for percpu's wq when invoking alloc_workqueue(). this patch only kick a worker based on the actual activation of delayed works.
Signed-off-by: Yunfeng Ye yeyunfeng@huawei.com Reviewed-by: Lai Jiangshan jiangshanlai@gmail.com Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/workqueue.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c index c59d02bf186a..4f05c714f99c 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3646,17 +3646,24 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq) * is updated and visible. */ if (!freezable || !workqueue_freezing) { + bool kick = false; + pwq->max_active = wq->saved_max_active;
while (!list_empty(&pwq->delayed_works) && - pwq->nr_active < pwq->max_active) + pwq->nr_active < pwq->max_active) { pwq_activate_first_delayed(pwq); + kick = true; + }
/* * Need to kick a worker after thawed or an unbound wq's - * max_active is bumped. It's a slow path. Do it always. + * max_active is bumped. In realtime scenarios, always kicking a + * worker will cause interference on the isolated cpu cores, so + * let's kick iff work items were activated. */ - wake_up_worker(pwq->pool); + if (kick) + wake_up_worker(pwq->pool); } else { pwq->max_active = 0; }
From: Bart Van Assche bvanassche@acm.org
stable inclusion from linux-4.19.167 commit 5a195bdd075cb185859db4eadb5fc17993957843
--------------------------------
[ Upstream commit cfefd9f8240a7b9fdd96fcd54cb029870b6d8d88 ]
Disable runtime power management during domain validation. Since a later patch removes RQF_PREEMPT, set RQF_PM for domain validation commands such that these are executed in the quiesced SCSI device state.
Link: https://lore.kernel.org/r/20201209052951.16136-6-bvanassche@acm.org Cc: Alan Stern stern@rowland.harvard.edu Cc: James Bottomley James.Bottomley@HansenPartnership.com Cc: Woody Suwalski terraluna977@gmail.com Cc: Can Guo cang@codeaurora.org Cc: Stanley Chu stanley.chu@mediatek.com Cc: Ming Lei ming.lei@redhat.com Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com Cc: Stan Johnson userm57@yahoo.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Jens Axboe axboe@kernel.dk Reviewed-by: Hannes Reinecke hare@suse.de Signed-off-by: Bart Van Assche bvanassche@acm.org Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/scsi_transport_spi.c | 27 +++++++++++++++++++-------- 1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/drivers/scsi/scsi_transport_spi.c b/drivers/scsi/scsi_transport_spi.c index 40b85b752b79..5390a9005584 100644 --- a/drivers/scsi/scsi_transport_spi.c +++ b/drivers/scsi/scsi_transport_spi.c @@ -130,12 +130,16 @@ static int spi_execute(struct scsi_device *sdev, const void *cmd, sshdr = &sshdr_tmp;
for(i = 0; i < DV_RETRIES; i++) { + /* + * The purpose of the RQF_PM flag below is to bypass the + * SDEV_QUIESCE state. + */ result = scsi_execute(sdev, cmd, dir, buffer, bufflen, sense, sshdr, DV_TIMEOUT, /* retries */ 1, REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER, - 0, NULL); + RQF_PM, NULL); if (driver_byte(result) != DRIVER_SENSE || sshdr->sense_key != UNIT_ATTENTION) break; @@ -1018,23 +1022,26 @@ spi_dv_device(struct scsi_device *sdev) */ lock_system_sleep();
+ if (scsi_autopm_get_device(sdev)) + goto unlock_system_sleep; + if (unlikely(spi_dv_in_progress(starget))) - goto unlock; + goto put_autopm;
if (unlikely(scsi_device_get(sdev))) - goto unlock; + goto put_autopm;
spi_dv_in_progress(starget) = 1;
buffer = kzalloc(len, GFP_KERNEL);
if (unlikely(!buffer)) - goto out_put; + goto put_sdev;
/* We need to verify that the actual device will quiesce; the * later target quiesce is just a nice to have */ if (unlikely(scsi_device_quiesce(sdev))) - goto out_free; + goto free_buffer;
scsi_target_quiesce(starget);
@@ -1054,12 +1061,16 @@ spi_dv_device(struct scsi_device *sdev)
spi_initial_dv(starget) = 1;
- out_free: +free_buffer: kfree(buffer); - out_put: + +put_sdev: spi_dv_in_progress(starget) = 0; scsi_device_put(sdev); -unlock: +put_autopm: + scsi_autopm_put_device(sdev); + +unlock_system_sleep: unlock_system_sleep(); } EXPORT_SYMBOL(spi_dv_device);
From: Huang Shijie sjhuang@iluvatar.ai
stable inclusion from linux-4.19.167 commit d8f0a87f20ca85bfd927c6e753574d682d2f6c50
--------------------------------
[ Upstream commit 36845663843fc59c5d794e3dc0641472e3e572da ]
Some graphic card has very big memory on chip, such as 32G bytes.
In the following case, it will cause overflow:
pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE); ret = gen_pool_add(pool, 0x1000000, SZ_32G, NUMA_NO_NODE);
va = gen_pool_alloc(pool, SZ_4G);
The overflow occurs in gen_pool_alloc_algo_owner():
.... size = nbits << order; ....
The @nbits is "int" type, so it will overflow. Then the gen_pool_avail() will return the wrong value.
This patch converts some "int" to "unsigned long", and changes the compare code in while.
Link: https://lkml.kernel.org/r/20201229060657.3389-1-sjhuang@iluvatar.ai Signed-off-by: Huang Shijie sjhuang@iluvatar.ai Reported-by: Shi Jiasheng jiasheng.shi@iluvatar.ai Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- lib/genalloc.c | 25 +++++++++++++------------ 1 file changed, 13 insertions(+), 12 deletions(-)
diff --git a/lib/genalloc.c b/lib/genalloc.c index 7e85d1e37a6e..0b8ee173cf3a 100644 --- a/lib/genalloc.c +++ b/lib/genalloc.c @@ -83,14 +83,14 @@ static int clear_bits_ll(unsigned long *addr, unsigned long mask_to_clear) * users set the same bit, one user will return remain bits, otherwise * return 0. */ -static int bitmap_set_ll(unsigned long *map, int start, int nr) +static int bitmap_set_ll(unsigned long *map, unsigned long start, unsigned long nr) { unsigned long *p = map + BIT_WORD(start); - const int size = start + nr; + const unsigned long size = start + nr; int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG); unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
- while (nr - bits_to_set >= 0) { + while (nr >= bits_to_set) { if (set_bits_ll(p, mask_to_set)) return nr; nr -= bits_to_set; @@ -118,14 +118,15 @@ static int bitmap_set_ll(unsigned long *map, int start, int nr) * users clear the same bit, one user will return remain bits, * otherwise return 0. */ -static int bitmap_clear_ll(unsigned long *map, int start, int nr) +static unsigned long +bitmap_clear_ll(unsigned long *map, unsigned long start, unsigned long nr) { unsigned long *p = map + BIT_WORD(start); - const int size = start + nr; + const unsigned long size = start + nr; int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG); unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
- while (nr - bits_to_clear >= 0) { + while (nr >= bits_to_clear) { if (clear_bits_ll(p, mask_to_clear)) return nr; nr -= bits_to_clear; @@ -184,8 +185,8 @@ int gen_pool_add_virt(struct gen_pool *pool, unsigned long virt, phys_addr_t phy size_t size, int nid) { struct gen_pool_chunk *chunk; - int nbits = size >> pool->min_alloc_order; - int nbytes = sizeof(struct gen_pool_chunk) + + unsigned long nbits = size >> pool->min_alloc_order; + unsigned long nbytes = sizeof(struct gen_pool_chunk) + BITS_TO_LONGS(nbits) * sizeof(long);
chunk = vzalloc_node(nbytes, nid); @@ -242,7 +243,7 @@ void gen_pool_destroy(struct gen_pool *pool) struct list_head *_chunk, *_next_chunk; struct gen_pool_chunk *chunk; int order = pool->min_alloc_order; - int bit, end_bit; + unsigned long bit, end_bit;
list_for_each_safe(_chunk, _next_chunk, &pool->chunks) { chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk); @@ -293,7 +294,7 @@ unsigned long gen_pool_alloc_algo(struct gen_pool *pool, size_t size, struct gen_pool_chunk *chunk; unsigned long addr = 0; int order = pool->min_alloc_order; - int nbits, start_bit, end_bit, remain; + unsigned long nbits, start_bit, end_bit, remain;
#ifndef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG BUG_ON(in_nmi()); @@ -376,7 +377,7 @@ void gen_pool_free(struct gen_pool *pool, unsigned long addr, size_t size) { struct gen_pool_chunk *chunk; int order = pool->min_alloc_order; - int start_bit, nbits, remain; + unsigned long start_bit, nbits, remain;
#ifndef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG BUG_ON(in_nmi()); @@ -638,7 +639,7 @@ unsigned long gen_pool_best_fit(unsigned long *map, unsigned long size, index = bitmap_find_next_zero_area(map, size, start, nr, 0);
while (index < size) { - int next_bit = find_next_bit(map, size, index + nr); + unsigned long next_bit = find_next_bit(map, size, index + nr); if ((next_bit - index) < len) { len = next_bit - index; start_bit = index;
From: Alexey Dobriyan adobriyan@gmail.com
stable inclusion from linux-4.19.167 commit 972013f7351fff34c7b76cfec9e21f5f60548d41
--------------------------------
[ Upstream commit e06689bf57017ac022ccf0f2a5071f760821ce0f ]
Currently gluing PDE into global /proc tree is done under lock, but changing ->nlink is not. Additionally struct proc_dir_entry::nlink is not atomic so updates can be lost.
Link: http://lkml.kernel.org/r/20190925202436.GA17388@avx2 Signed-off-by: Alexey Dobriyan adobriyan@gmail.com Cc: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/proc/generic.c | 31 +++++++++++++++---------------- 1 file changed, 15 insertions(+), 16 deletions(-)
diff --git a/fs/proc/generic.c b/fs/proc/generic.c index e39bac94dead..7820fe524cb0 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -137,8 +137,12 @@ static int proc_getattr(const struct path *path, struct kstat *stat, { struct inode *inode = d_inode(path->dentry); struct proc_dir_entry *de = PDE(inode); - if (de && de->nlink) - set_nlink(inode, de->nlink); + if (de) { + nlink_t nlink = READ_ONCE(de->nlink); + if (nlink > 0) { + set_nlink(inode, nlink); + } + }
generic_fillattr(inode, stat); return 0; @@ -361,6 +365,7 @@ struct proc_dir_entry *proc_register(struct proc_dir_entry *dir, write_unlock(&proc_subdir_lock); goto out_free_inum; } + dir->nlink++; write_unlock(&proc_subdir_lock);
return dp; @@ -471,10 +476,7 @@ struct proc_dir_entry *proc_mkdir_data(const char *name, umode_t mode, ent->data = data; ent->proc_fops = &proc_dir_operations; ent->proc_iops = &proc_dir_inode_operations; - parent->nlink++; ent = proc_register(parent, ent); - if (!ent) - parent->nlink--; } return ent; } @@ -504,10 +506,7 @@ struct proc_dir_entry *proc_create_mount_point(const char *name) ent->data = NULL; ent->proc_fops = NULL; ent->proc_iops = NULL; - parent->nlink++; ent = proc_register(parent, ent); - if (!ent) - parent->nlink--; } return ent; } @@ -665,8 +664,12 @@ void remove_proc_entry(const char *name, struct proc_dir_entry *parent) len = strlen(fn);
de = pde_subdir_find(parent, fn, len); - if (de) + if (de) { rb_erase(&de->subdir_node, &parent->subdir); + if (S_ISDIR(de->mode)) { + parent->nlink--; + } + } write_unlock(&proc_subdir_lock); if (!de) { WARN(1, "name '%s'\n", name); @@ -675,9 +678,6 @@ void remove_proc_entry(const char *name, struct proc_dir_entry *parent)
proc_entry_rundown(de);
- if (S_ISDIR(de->mode)) - parent->nlink--; - de->nlink = 0; WARN(pde_subdir_first(de), "%s: removing non-empty directory '%s/%s', leaking at least '%s'\n", __func__, de->parent->name, de->name, pde_subdir_first(de)->name); @@ -713,13 +713,12 @@ int remove_proc_subtree(const char *name, struct proc_dir_entry *parent) de = next; continue; } - write_unlock(&proc_subdir_lock); - - proc_entry_rundown(de); next = de->parent; if (S_ISDIR(de->mode)) next->nlink--; - de->nlink = 0; + write_unlock(&proc_subdir_lock); + + proc_entry_rundown(de); if (de == root) break; pde_put(de);
From: Alexey Dobriyan adobriyan@gmail.com
stable inclusion from linux-4.19.167 commit 6ccab11c562666b2a850c4db21c0bd10a7d63707
--------------------------------
[ Upstream commit c6c75deda81344c3a95d1d1f606d5cee109e5d54 ]
Commit 1fde6f21d90f ("proc: fix /proc/net/* after setns(2)") only forced revalidation of regular files under /proc/net/
However, /proc/net/ is unusual in the sense of /proc/net/foo handlers take netns pointer from parent directory which is old netns.
Steps to reproduce:
(void)open("/proc/net/sctp/snmp", O_RDONLY); unshare(CLONE_NEWNET);
int fd = open("/proc/net/sctp/snmp", O_RDONLY); read(fd, &c, 1);
Read will read wrong data from original netns.
Patch forces lookup on every directory under /proc/net .
Link: https://lkml.kernel.org/r/20201205160916.GA109739@localhost.localdomain Fixes: 1da4d377f943 ("proc: revalidate misc dentries") Signed-off-by: Alexey Dobriyan adobriyan@gmail.com Reported-by: "Rantala, Tommi T. (Nokia - FI/Espoo)" tommi.t.rantala@nokia.com Cc: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/proc/generic.c | 24 ++++++++++++++++++++++-- fs/proc/internal.h | 7 +++++++ fs/proc/proc_net.c | 16 ---------------- include/linux/proc_fs.h | 8 +++++++- 4 files changed, 36 insertions(+), 19 deletions(-)
diff --git a/fs/proc/generic.c b/fs/proc/generic.c index 7820fe524cb0..bab10368a04d 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -341,6 +341,16 @@ static const struct file_operations proc_dir_operations = { .iterate_shared = proc_readdir, };
+static int proc_net_d_revalidate(struct dentry *dentry, unsigned int flags) +{ + return 0; +} + +const struct dentry_operations proc_net_dentry_ops = { + .d_revalidate = proc_net_d_revalidate, + .d_delete = always_delete_dentry, +}; + /* * proc directories can do almost nothing.. */ @@ -463,8 +473,8 @@ struct proc_dir_entry *proc_symlink(const char *name, } EXPORT_SYMBOL(proc_symlink);
-struct proc_dir_entry *proc_mkdir_data(const char *name, umode_t mode, - struct proc_dir_entry *parent, void *data) +struct proc_dir_entry *_proc_mkdir(const char *name, umode_t mode, + struct proc_dir_entry *parent, void *data, bool force_lookup) { struct proc_dir_entry *ent;
@@ -476,10 +486,20 @@ struct proc_dir_entry *proc_mkdir_data(const char *name, umode_t mode, ent->data = data; ent->proc_fops = &proc_dir_operations; ent->proc_iops = &proc_dir_inode_operations; + if (force_lookup) { + pde_force_lookup(ent); + } ent = proc_register(parent, ent); } return ent; } +EXPORT_SYMBOL_GPL(_proc_mkdir); + +struct proc_dir_entry *proc_mkdir_data(const char *name, umode_t mode, + struct proc_dir_entry *parent, void *data) +{ + return _proc_mkdir(name, mode, parent, data, false); +} EXPORT_SYMBOL_GPL(proc_mkdir_data);
struct proc_dir_entry *proc_mkdir_mode(const char *name, umode_t mode, diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 95b14196f284..4f14906ef16b 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -305,3 +305,10 @@ extern unsigned long task_statm(struct mm_struct *, unsigned long *, unsigned long *, unsigned long *, unsigned long *); extern void task_mem(struct seq_file *, struct mm_struct *); + +extern const struct dentry_operations proc_net_dentry_ops; +static inline void pde_force_lookup(struct proc_dir_entry *pde) +{ + /* /proc/net/ entries can be changed under us by setns(CLONE_NEWNET) */ + pde->proc_dops = &proc_net_dentry_ops; +} diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c index a7b12435519e..096bcc1e7a8a 100644 --- a/fs/proc/proc_net.c +++ b/fs/proc/proc_net.c @@ -38,22 +38,6 @@ static struct net *get_proc_net(const struct inode *inode) return maybe_get_net(PDE_NET(PDE(inode))); }
-static int proc_net_d_revalidate(struct dentry *dentry, unsigned int flags) -{ - return 0; -} - -static const struct dentry_operations proc_net_dentry_ops = { - .d_revalidate = proc_net_d_revalidate, - .d_delete = always_delete_dentry, -}; - -static void pde_force_lookup(struct proc_dir_entry *pde) -{ - /* /proc/net/ entries can be changed under us by setns(CLONE_NEWNET) */ - pde->proc_dops = &proc_net_dentry_ops; -} - static int seq_open_net(struct inode *inode, struct file *file) { unsigned int state_size = PDE(inode)->state_size; diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index d0e1f1522a78..5141657a0f7f 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -21,6 +21,7 @@ extern void proc_flush_task(struct task_struct *);
extern struct proc_dir_entry *proc_symlink(const char *, struct proc_dir_entry *, const char *); +struct proc_dir_entry *_proc_mkdir(const char *, umode_t, struct proc_dir_entry *, void *, bool); extern struct proc_dir_entry *proc_mkdir(const char *, struct proc_dir_entry *); extern struct proc_dir_entry *proc_mkdir_data(const char *, umode_t, struct proc_dir_entry *, void *); @@ -89,6 +90,11 @@ static inline struct proc_dir_entry *proc_symlink(const char *name, static inline struct proc_dir_entry *proc_mkdir(const char *name, struct proc_dir_entry *parent) {return NULL;} static inline struct proc_dir_entry *proc_create_mount_point(const char *name) { return NULL; } +static inline struct proc_dir_entry *_proc_mkdir(const char *name, umode_t mode, + struct proc_dir_entry *parent, void *data, bool force_lookup) +{ + return NULL; +} static inline struct proc_dir_entry *proc_mkdir_data(const char *name, umode_t mode, struct proc_dir_entry *parent, void *data) { return NULL; } static inline struct proc_dir_entry *proc_mkdir_mode(const char *name, @@ -121,7 +127,7 @@ struct net; static inline struct proc_dir_entry *proc_net_mkdir( struct net *net, const char *name, struct proc_dir_entry *parent) { - return proc_mkdir_data(name, 0, parent, net); + return _proc_mkdir(name, 0, parent, net, true); }
struct ns_common;
From: Jeff Dike jdike@akamai.com
stable inclusion from linux-4.19.167 commit 6413249a00b145454cc111a17742a6eda821b60b
--------------------------------
[ Upstream commit de33212f768c5d9e2fe791b008cb26f92f0aa31c ]
virtnet_set_channels can recursively call cpus_read_lock if CONFIG_XPS and CONFIG_HOTPLUG are enabled.
The path is: virtnet_set_channels - calls get_online_cpus(), which is a trivial wrapper around cpus_read_lock() netif_set_real_num_tx_queues netif_reset_xps_queues_gt netif_reset_xps_queues - calls cpus_read_lock()
This call chain and potential deadlock happens when the number of TX queues is reduced.
This commit the removes netif_set_real_num_[tr]x_queues calls from inside the get/put_online_cpus section, as they don't require that it be held.
Fixes: 47be24796c13 ("virtio-net: fix the set affinity bug when CPU IDs are not consecutive") Signed-off-by: Jeff Dike jdike@akamai.com Acked-by: Jason Wang jasowang@redhat.com Acked-by: Michael S. Tsirkin mst@redhat.com Link: https://lore.kernel.org/r/20201223025421.671-1-jdike@akamai.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/virtio_net.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index c88ee376a2eb..e3553056cec0 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2075,14 +2075,16 @@ static int virtnet_set_channels(struct net_device *dev,
get_online_cpus(); err = _virtnet_set_queues(vi, queue_pairs); - if (!err) { - netif_set_real_num_tx_queues(dev, queue_pairs); - netif_set_real_num_rx_queues(dev, queue_pairs); - - virtnet_set_affinity(vi); + if (err) { + put_online_cpus(); + goto err; } + virtnet_set_affinity(vi); put_online_cpus();
+ netif_set_real_num_tx_queues(dev, queue_pairs); + netif_set_real_num_rx_queues(dev, queue_pairs); + err: return err; }
From: Yunjian Wang wangyunjian@huawei.com
stable inclusion from linux-4.19.167 commit cf7dd7f3bec70fb52a127388facd832157a877c9
--------------------------------
[ Upstream commit 01e31bea7e622f1890c274f4aaaaf8bccd296aa5 ]
Currently the vhost_zerocopy_callback() maybe be called to decrease the refcount when sendmsg fails in tun. The error handling in vhost handle_tx_zerocopy() will try to decrease the same refcount again. This is wrong. To fix this issue, we only call vhost_net_ubuf_put() when vq->heads[nvq->desc].len == VHOST_DMA_IN_PROGRESS.
Fixes: bab632d69ee4 ("vhost: vhost TX zero-copy support") Signed-off-by: Yunjian Wang wangyunjian@huawei.com Acked-by: Willem de Bruijn willemb@google.com Acked-by: Michael S. Tsirkin mst@redhat.com Acked-by: Jason Wang jasowang@redhat.com Link: https://lore.kernel.org/r/1609207308-20544-1-git-send-email-wangyunjian@huaw... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/vhost/net.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 88c8c158ec25..0c7bbc92b22a 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -613,6 +613,7 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) size_t len, total_len = 0; int err; struct vhost_net_ubuf_ref *uninitialized_var(ubufs); + struct ubuf_info *ubuf; bool zcopy_used; int sent_pkts = 0;
@@ -645,9 +646,7 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
/* use msg_control to pass vhost zerocopy ubuf info to skb */ if (zcopy_used) { - struct ubuf_info *ubuf; ubuf = nvq->ubuf_info + nvq->upend_idx; - vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head); vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS; ubuf->callback = vhost_zerocopy_callback; @@ -675,7 +674,8 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) err = sock->ops->sendmsg(sock, &msg, len); if (unlikely(err < 0)) { if (zcopy_used) { - vhost_net_ubuf_put(ubufs); + if (vq->heads[ubuf->desc].len == VHOST_DMA_IN_PROGRESS) + vhost_net_ubuf_put(ubufs); nvq->upend_idx = ((unsigned)nvq->upend_idx - 1) % UIO_MAXIOV; }
From: Ming Lei ming.lei@redhat.com
stable inclusion from linux-4.19.168 commit 3d13ebbd06697e2fb180a61bbff1ff244a9513e4
--------------------------------
commit aebf5db917055b38f4945ed6d621d9f07a44ff30 upstream.
Make sure that bdgrab() is done on the 'block_device' instance before referring to it for avoiding use-after-free.
Cc: stable@vger.kernel.org Reported-by: syzbot+825f0f9657d4e528046e@syzkaller.appspotmail.com Signed-off-by: Ming Lei ming.lei@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/genhd.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/block/genhd.c b/block/genhd.c index 862a2f3c536f..d8a9c901eaef 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -208,14 +208,17 @@ struct hd_struct *disk_part_iter_next(struct disk_part_iter *piter) part = rcu_dereference(ptbl->part[piter->idx]); if (!part) continue; + get_device(part_to_dev(part)); + piter->part = part; if (!part_nr_sects_read(part) && !(piter->flags & DISK_PITER_INCL_EMPTY) && !(piter->flags & DISK_PITER_INCL_EMPTY_PART0 && - piter->idx == 0)) + piter->idx == 0)) { + put_device(part_to_dev(part)); + piter->part = NULL; continue; + }
- get_device(part_to_dev(part)); - piter->part = part; piter->idx += inc; break; }
From: Dexuan Cui decui@microsoft.com
stable inclusion from linux-4.19.169 commit 56d772d831648da003bb18f8641e7973c3984915
--------------------------------
commit a58015d638cd4e4555297b04bec9b49028369075 upstream.
Linux VM on Hyper-V crashes with the latest mainline:
[ 4.069624] detected buffer overflow in strcpy [ 4.077733] kernel BUG at lib/string.c:1149! .. [ 4.085819] RIP: 0010:fortify_panic+0xf/0x11 ... [ 4.085819] Call Trace: [ 4.085819] acpi_device_add.cold.15+0xf2/0xfb [ 4.085819] acpi_add_single_object+0x2a6/0x690 [ 4.085819] acpi_bus_check_add+0xc6/0x280 [ 4.085819] acpi_ns_walk_namespace+0xda/0x1aa [ 4.085819] acpi_walk_namespace+0x9a/0xc2 [ 4.085819] acpi_bus_scan+0x78/0x90 [ 4.085819] acpi_scan_init+0xfa/0x248 [ 4.085819] acpi_init+0x2c1/0x321 [ 4.085819] do_one_initcall+0x44/0x1d0 [ 4.085819] kernel_init_freeable+0x1ab/0x1f4
This is because of the recent buffer overflow detection in the commit 6a39e62abbaf ("lib: string.h: detect intra-object overflow in fortified string functions")
Here acpi_device_bus_id->bus_id can only hold 14 characters, while the the acpi_device_hid(device) returns a 22-char string "HYPER_V_GEN_COUNTER_V1".
Per ACPI Spec v6.2, Section 6.1.5 _HID (Hardware ID), if the ID is a string, it must be of the form AAA#### or NNNN####, i.e. 7 chars or 8 chars.
The field bus_id in struct acpi_device_bus_id was originally defined as char bus_id[9], and later was enlarged to char bus_id[15] in 2007 in the commit bb0958544f3c ("ACPI: use more understandable bus_id for ACPI devices")
Fix the issue by changing the field bus_id to const char *, and use kstrdup_const() to initialize it.
Signed-off-by: Dexuan Cui decui@microsoft.com Tested-By: Jethro Beekman jethro@fortanix.com [ rjw: Subject change, whitespace adjustment ] Cc: All applicable stable@vger.kernel.org Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/acpi/internal.h | 2 +- drivers/acpi/scan.c | 15 ++++++++++++++- 2 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/internal.h b/drivers/acpi/internal.h index f59d0b9e2683..6def196cc23c 100644 --- a/drivers/acpi/internal.h +++ b/drivers/acpi/internal.h @@ -98,7 +98,7 @@ void acpi_scan_table_handler(u32 event, void *table, void *context); extern struct list_head acpi_bus_id_list;
struct acpi_device_bus_id { - char bus_id[15]; + const char *bus_id; unsigned int instance_no; struct list_head node; }; diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c index 1dcc48b9d33c..46f6698c72fd 100644 --- a/drivers/acpi/scan.c +++ b/drivers/acpi/scan.c @@ -486,6 +486,7 @@ static void acpi_device_del(struct acpi_device *device) acpi_device_bus_id->instance_no--; else { list_del(&acpi_device_bus_id->node); + kfree_const(acpi_device_bus_id->bus_id); kfree(acpi_device_bus_id); } break; @@ -674,7 +675,14 @@ int acpi_device_add(struct acpi_device *device, } if (!found) { acpi_device_bus_id = new_bus_id; - strcpy(acpi_device_bus_id->bus_id, acpi_device_hid(device)); + acpi_device_bus_id->bus_id = + kstrdup_const(acpi_device_hid(device), GFP_KERNEL); + if (!acpi_device_bus_id->bus_id) { + pr_err(PREFIX "Memory allocation error for bus id\n"); + result = -ENOMEM; + goto err_free_new_bus_id; + } + acpi_device_bus_id->instance_no = 0; list_add_tail(&acpi_device_bus_id->node, &acpi_bus_id_list); } @@ -709,6 +717,11 @@ int acpi_device_add(struct acpi_device *device, if (device->parent) list_del(&device->node); list_del(&device->wakeup_list); + + err_free_new_bus_id: + if (!found) + kfree(new_bus_id); + mutex_unlock(&acpi_device_lock);
err_detach:
From: Miaohe Lin linmiaohe@huawei.com
stable inclusion from linux-4.19.169 commit bba1a0da5bbdd938907cf7f5c3573b3d8e199074
--------------------------------
commit 0eb98f1588c2cc7a79816d84ab18a55d254f481c upstream.
The huge page size is encoded for VM_FAULT_HWPOISON errors only. So if we return VM_FAULT_HWPOISON, huge page size would just be ignored.
Link: https://lkml.kernel.org/r/20210107123449.38481-1-linmiaohe@huawei.com Fixes: aa50d3a7aa81 ("Encode huge page size for VM_FAULT_HWPOISON errors") Signed-off-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7c2f51528c1c..ea0902276cb5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4012,7 +4012,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, * So we need to block hugepage fault by PG_hwpoison bit check. */ if (unlikely(PageHWPoison(page))) { - ret = VM_FAULT_HWPOISON | + ret = VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); goto backout_unlocked; }
From: Akilesh Kailash akailash@google.com
stable inclusion from linux-4.19.169 commit aef593d2c87598839d6200772c5aee8a9ecdcd9f
--------------------------------
commit fcc42338375a1e67b8568dbb558f8b784d0f3b01 upstream.
If the origin device has a volatile write-back cache and the following events occur:
1: After finishing merge operation of one set of exceptions, merge_callback() is invoked. 2: Update the metadata in COW device tracking the merge completion. This update to COW device is flushed cleanly. 3: System crashes and the origin device's cache where the recent merge was completed has not been flushed.
During the next cycle when we read the metadata from the COW device, we will skip reading those metadata whose merge was completed in step (1). This will lead to data loss/corruption.
To address this, flush the origin device post merge IO before updating the metadata.
Cc: stable@vger.kernel.org Signed-off-by: Akilesh Kailash akailash@google.com Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/dm-snap.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c index d3f28a9e3fd9..9e930a150aa2 100644 --- a/drivers/md/dm-snap.c +++ b/drivers/md/dm-snap.c @@ -137,6 +137,11 @@ struct dm_snapshot { * for them to be committed. */ struct bio_list bios_queued_during_merge; + + /* + * Flush data after merge. + */ + struct bio flush_bio; };
/* @@ -1061,6 +1066,17 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s)
static void error_bios(struct bio *bio);
+static int flush_data(struct dm_snapshot *s) +{ + struct bio *flush_bio = &s->flush_bio; + + bio_reset(flush_bio); + bio_set_dev(flush_bio, s->origin->bdev); + flush_bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH; + + return submit_bio_wait(flush_bio); +} + static void merge_callback(int read_err, unsigned long write_err, void *context) { struct dm_snapshot *s = context; @@ -1074,6 +1090,11 @@ static void merge_callback(int read_err, unsigned long write_err, void *context) goto shut; }
+ if (flush_data(s) < 0) { + DMERR("Flush after merge failed: shutting down merge"); + goto shut; + } + if (s->store->type->commit_merge(s->store, s->num_merging_chunks) < 0) { DMERR("Write error in exception store: shutting down merge"); @@ -1198,6 +1219,7 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv) s->first_merging_chunk = 0; s->num_merging_chunks = 0; bio_list_init(&s->bios_queued_during_merge); + bio_init(&s->flush_bio, NULL, 0);
/* Allocate hash table for COW data */ if (init_hash_tables(s)) { @@ -1391,6 +1413,8 @@ static void snapshot_dtr(struct dm_target *ti)
mutex_destroy(&s->lock);
+ bio_uninit(&s->flush_bio); + dm_put_device(ti, s->cow);
dm_put_device(ti, s->origin);
From: Mikulas Patocka mpatocka@redhat.com
stable inclusion from linux-4.19.169 commit dd712c8f5f4f46b1c4682109c5cacbe001459133
--------------------------------
commit 17ffc193cdc6dc7a613d00d8ad47fc1f801b9bf0 upstream.
Advance the maximum number of arguments from 9 to 15 to account for all potential feature flags that may be supplied.
Linux 4.19 added "meta_device" (356d9d52e1221ba0c9f10b8b38652f78a5298329) and "recalculate" (a3fcf7253139609bf9ff901fbf955fba047e75dd) flags.
Commit 468dfca38b1a6fbdccd195d875599cb7c8875cd9 added "sectors_per_bit" and "bitmap_flush_interval".
Commit 84597a44a9d86ac949900441cea7da0af0f2f473 added "allow_discards".
And the commit d537858ac8aaf4311b51240893add2fc62003b97 added "fix_padding".
Signed-off-by: Mikulas Patocka mpatocka@redhat.com Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/dm-integrity.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c index 24157b31d31b..183288348dee 100644 --- a/drivers/md/dm-integrity.c +++ b/drivers/md/dm-integrity.c @@ -3091,7 +3091,7 @@ static int dm_integrity_ctr(struct dm_target *ti, unsigned argc, char **argv) unsigned extra_args; struct dm_arg_set as; static const struct dm_arg _args[] = { - {0, 9, "Invalid number of feature args"}, + {0, 15, "Invalid number of feature args"}, }; unsigned journal_sectors, interleave_sectors, buffer_sectors, journal_watermark, sync_msec; bool recalculate;
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.169 commit 9f8b5931be0bf003a8e90fcc93552089820fc2ee
--------------------------------
[ Upstream commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a ]
BFQ computes number of tags it allows to be allocated for each request type based on tag bitmap. However it uses 1 << bitmap.shift as number of available tags which is wrong. 'shift' is just an internal bitmap value containing logarithm of how many bits bitmap uses in each bitmap word. Thus number of tags allowed for some request types can be far to low. Use proper bitmap.depth which has the number of tags instead.
Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/bfq-iosched.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 97b033b00d4e..0c2c0cd8546c 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -5283,13 +5283,13 @@ static unsigned int bfq_update_depths(struct bfq_data *bfqd, * limit 'something'. */ /* no more than 50% of tags for async I/O */ - bfqd->word_depths[0][0] = max((1U << bt->sb.shift) >> 1, 1U); + bfqd->word_depths[0][0] = max(bt->sb.depth >> 1, 1U); /* * no more than 75% of tags for sync writes (25% extra tags * w.r.t. async I/O, to prevent async I/O from starving sync * writes) */ - bfqd->word_depths[0][1] = max(((1U << bt->sb.shift) * 3) >> 2, 1U); + bfqd->word_depths[0][1] = max((bt->sb.depth * 3) >> 2, 1U);
/* * In-word depths in case some bfq_queue is being weight- @@ -5299,9 +5299,9 @@ static unsigned int bfq_update_depths(struct bfq_data *bfqd, * shortage. */ /* no more than ~18% of tags for async I/O */ - bfqd->word_depths[1][0] = max(((1U << bt->sb.shift) * 3) >> 4, 1U); + bfqd->word_depths[1][0] = max((bt->sb.depth * 3) >> 4, 1U); /* no more than ~37% of tags for sync writes (~20% extra tags) */ - bfqd->word_depths[1][1] = max(((1U << bt->sb.shift) * 6) >> 4, 1U); + bfqd->word_depths[1][1] = max((bt->sb.depth * 6) >> 4, 1U);
for (i = 0; i < 2; i++) for (j = 0; j < 2; j++)
From: Al Viro viro@zeniv.linux.org.uk
stable inclusion from linux-4.19.169 commit fda4bb55c45bd9fdf490c39c3e567f0cea931e54
--------------------------------
commit d36a1dd9f77ae1e72da48f4123ed35627848507d upstream.
We are not guaranteed the locking environment that would prevent dentry getting renamed right under us. And it's possible for old long name to be freed after rename, leading to UAF here.
Cc: stable@kernel.org # v2.6.2+ Signed-off-by: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- security/lsm_audit.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/security/lsm_audit.c b/security/lsm_audit.c index 33028c098ef3..d6fea68d22ad 100644 --- a/security/lsm_audit.c +++ b/security/lsm_audit.c @@ -277,7 +277,9 @@ static void dump_common_audit_data(struct audit_buffer *ab, struct inode *inode;
audit_log_format(ab, " name="); + spin_lock(&a->u.dentry->d_lock); audit_log_untrustedstring(ab, a->u.dentry->d_name.name); + spin_unlock(&a->u.dentry->d_lock);
inode = d_backing_inode(a->u.dentry); if (inode) { @@ -295,8 +297,9 @@ static void dump_common_audit_data(struct audit_buffer *ab, dentry = d_find_alias(inode); if (dentry) { audit_log_format(ab, " name="); - audit_log_untrustedstring(ab, - dentry->d_name.name); + spin_lock(&dentry->d_lock); + audit_log_untrustedstring(ab, dentry->d_name.name); + spin_unlock(&dentry->d_lock); dput(dentry); } audit_log_format(ab, " dev=");
From: Dave Wysochanski dwysocha@redhat.com
stable inclusion from linux-4.19.169 commit 825e0ffaad2cf1ca789a61ede31dab213618b648
--------------------------------
commit 3d1a90ab0ed93362ec8ac85cf291243c87260c21 upstream.
It is only safe to call the tracepoint before rpc_put_task() because 'data' is freed inside nfs4_lock_release (rpc_release).
Fixes: 48c9579a1afe ("Adding stateid information to tracepoints") Signed-off-by: Dave Wysochanski dwysocha@redhat.com Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/nfs/nfs4proc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index b964f9737dcd..672cef2c7ce7 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -6706,9 +6706,9 @@ static int _nfs4_do_setlk(struct nfs4_state *state, int cmd, struct file_lock *f data->arg.new_lock_owner, ret); } else data->cancelled = true; + trace_nfs4_set_lock(fl, state, &data->res.stateid, cmd, ret); rpc_put_task(task); dprintk("%s: done, ret = %d!\n", __func__, ret); - trace_nfs4_set_lock(fl, state, &data->res.stateid, cmd, ret); return ret; }
From: Trond Myklebust trond.myklebust@hammerspace.com
stable inclusion from linux-4.19.169 commit 87396ce3b5059916e81b6c3456a5c2b9b7636b5f
--------------------------------
commit 67bbceedc9bb8ad48993a8bd6486054756d711f4 upstream.
If the layout return-on-close failed because the layoutreturn was never sent, then we should mark the layout for return again.
Fixes: 9c47b18cf722 ("pNFS: Ensure we do clear the return-on-close layout stateid on fatal errors") Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/nfs/pnfs.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c index 8f79d28bd974..18c8d25fe608 100644 --- a/fs/nfs/pnfs.c +++ b/fs/nfs/pnfs.c @@ -1451,12 +1451,18 @@ void pnfs_roc_release(struct nfs4_layoutreturn_args *args, int ret) { struct pnfs_layout_hdr *lo = args->layout; + struct inode *inode = args->inode; const nfs4_stateid *arg_stateid = NULL; const nfs4_stateid *res_stateid = NULL; struct nfs4_xdr_opaque_data *ld_private = args->ld_private;
switch (ret) { case -NFS4ERR_NOMATCHING_LAYOUT: + spin_lock(&inode->i_lock); + if (pnfs_layout_is_valid(lo) && + nfs4_stateid_match_other(&args->stateid, &lo->plh_stateid)) + pnfs_set_plh_return_info(lo, args->range.iomode, 0); + spin_unlock(&inode->i_lock); break; case 0: if (res->lrs_present)
From: Trond Myklebust trond.myklebust@hammerspace.com
stable inclusion from linux-4.19.169 commit b2f9fbbc3fa20f066e20f1d676175e4af0e177ba
--------------------------------
commit cb2856c5971723910a86b7d1d0cf623d6919cbc4 upstream.
If we exit _lgopen_prepare_attached() without setting a layout, we will currently leak the plh_outstanding counter.
Fixes: 411ae722d10a ("pNFS: Wait for stale layoutget calls to complete in pnfs_update_layout()") Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/nfs/pnfs.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c index 18c8d25fe608..26c1be5900c3 100644 --- a/fs/nfs/pnfs.c +++ b/fs/nfs/pnfs.c @@ -2138,6 +2138,7 @@ static void _lgopen_prepare_attached(struct nfs4_opendata *data, &rng, GFP_KERNEL); if (!lgp) { pnfs_clear_first_layoutget(lo); + nfs_layoutget_end(lo); pnfs_put_layout_hdr(lo); return; }
From: Trond Myklebust trond.myklebust@hammerspace.com
stable inclusion from linux-4.19.169 commit df7adeee7477c865fecdd43b55aee6cca0e9bb0a
--------------------------------
commit 896567ee7f17a8a736cda8a28cc987228410a2ac upstream.
Before referencing the inode, we must ensure that the superblock can be referenced. Otherwise, we can end up with iput() calling superblock operations that are no longer valid or accessible.
Fixes: ea7c38fef0b7 ("NFSv4: Ensure we reference the inode for return-on-close in delegreturn") Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/nfs/internal.h | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 8357ff69962f..cc07189a501f 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -575,12 +575,14 @@ extern int nfs4_test_session_trunk(struct rpc_clnt *,
static inline struct inode *nfs_igrab_and_active(struct inode *inode) { - inode = igrab(inode); - if (inode != NULL && !nfs_sb_active(inode->i_sb)) { - iput(inode); - inode = NULL; + struct super_block *sb = inode->i_sb; + + if (sb && nfs_sb_active(sb)) { + if (igrab(inode)) + return inode; + nfs_sb_deactive(sb); } - return inode; + return NULL; }
static inline void nfs_iput_and_deactive(struct inode *inode)
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.169 commit 8e45768d6a279b20fe4c51a41a6d18c54899d0e6
--------------------------------
commit dfd56c2c0c0dbb11be939b804ddc8d5395ab3432 upstream.
When setting password salt in the superblock, we forget to recompute the superblock checksum so it will not match until the next superblock modification which recomputes the checksum. Fix it.
CC: Michael Halcrow mhalcrow@google.com Reported-by: Andreas Dilger adilger@dilger.ca Fixes: 9bd8212f981e ("ext4 crypto: add encryption policy and password salt support") Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20201216101844.22917-8-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/ioctl.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index 3f6985fa5a46..c6cac2d40f71 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -1096,7 +1096,10 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) err = ext4_journal_get_write_access(handle, sbi->s_sbh); if (err) goto pwsalt_err_journal; + lock_buffer(sbi->s_sbh); generate_random_uuid(sbi->s_es->s_encrypt_pw_salt); + ext4_superblock_csum_set(sb); + unlock_buffer(sbi->s_sbh); err = ext4_handle_dirty_metadata(handle, NULL, sbi->s_sbh); pwsalt_err_journal:
From: Jann Horn jannh@google.com
stable inclusion from linux-4.19.169 commit 21a466aed724f6f2fb10635c982a50c0900ed8f3
--------------------------------
commit 8ff60eb052eeba95cfb3efe16b08c9199f8121cf upstream.
acquire_slab() fails if there is contention on the freelist of the page (probably because some other CPU is concurrently freeing an object from the page). In that case, it might make sense to look for a different page (since there might be more remote frees to the page from other CPUs, and we don't want contention on struct page).
However, the current code accidentally stops looking at the partial list completely in that case. Especially on kernels without CONFIG_NUMA set, this means that get_partial() fails and new_slab_objects() falls back to new_slab(), allocating new pages. This could lead to an unnecessary increase in memory fragmentation.
Link: https://lkml.kernel.org/r/20201228130853.1871516-1-jannh@google.com Fixes: 7ced37197196 ("slub: Acquire_slab() avoid loop") Signed-off-by: Jann Horn jannh@google.com Acked-by: David Rientjes rientjes@google.com Acked-by: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Christoph Lameter cl@linux.com Cc: Pekka Enberg penberg@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/slub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c index 4a2f26139f10..4c5714000943 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1829,7 +1829,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
t = acquire_slab(s, n, page, object == NULL, &objects); if (!t) - break; + continue; /* cmpxchg raced */
available += objects; if (!object) {
From: "j.nixdorf@avm.de" j.nixdorf@avm.de
stable inclusion from linux-4.19.169 commit f6ced16ce529dda32000b154e254f84118ffbd8e
--------------------------------
commit 86b53fbf08f48d353a86a06aef537e78e82ba721 upstream.
A return value of 0 means success. This is documented in lib/kstrtox.c.
This was found by trying to mount an NFS share from a link-local IPv6 address with the interface specified by its index:
mount("[fe80::1%1]:/srv/nfs", "/mnt", "nfs", 0, "nolock,addr=fe80::1%1")
Before this commit this failed with EINVAL and also caused the following message in dmesg:
[...] NFS: bad IP address specified: addr=fe80::1%1
The syscall using the same address based on the interface name instead of its index succeeds.
Credits for this patch go to my colleague Christian Speich, who traced the origin of this bug to this line of code.
Signed-off-by: Johannes Nixdorf j.nixdorf@avm.de Fixes: 00cfaa943ec3 ("replace strict_strto calls") Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/sunrpc/addr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sunrpc/addr.c b/net/sunrpc/addr.c index 8391c2785550..7404f02702a1 100644 --- a/net/sunrpc/addr.c +++ b/net/sunrpc/addr.c @@ -184,7 +184,7 @@ static int rpc_parse_scope_id(struct net *net, const char *buf, scope_id = dev->ifindex; dev_put(dev); } else { - if (kstrtou32(p, 10, &scope_id) == 0) { + if (kstrtou32(p, 10, &scope_id) != 0) { kfree(p); return 0; }
From: Jesper Dangaard Brouer brouer@redhat.com
stable inclusion from linux-4.19.169 commit b69a79c62c7055a7c36464092b45b25018958153
--------------------------------
commit f6351c3f1c27c80535d76cac2299aec44c36291e upstream.
The old way of changing the conntrack hashsize runtime was through changing the module param via file /sys/module/nf_conntrack/parameters/hashsize. This was extended to sysctl change in commit 3183ab8997a4 ("netfilter: conntrack: allow increasing bucket size via sysctl too").
The commit introduced second "user" variable nf_conntrack_htable_size_user which shadow actual variable nf_conntrack_htable_size. When hashsize is changed via module param this "user" variable isn't updated. This results in sysctl net/netfilter/nf_conntrack_buckets shows the wrong value when users update via the old way.
This patch fix the issue by always updating "user" variable when reading the proc file. This will take care of changes to the actual variable without sysctl need to be aware.
Fixes: 3183ab8997a4 ("netfilter: conntrack: allow increasing bucket size via sysctl too") Reported-by: Yoel Caspersen yoel@kviknet.dk Signed-off-by: Jesper Dangaard Brouer brouer@redhat.com Acked-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/netfilter/nf_conntrack_standalone.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 4c94f3ba2ae4..dcd8e7922951 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -500,6 +500,9 @@ nf_conntrack_hash_sysctl(struct ctl_table *table, int write, { int ret;
+ /* module_param hashsize could have changed value */ + nf_conntrack_htable_size_user = nf_conntrack_htable_size; + ret = proc_dointvec(table, write, buffer, lenp, ppos); if (ret < 0 || !write) return ret;
From: Dinghao Liu dinghao.liu@zju.edu.cn
stable inclusion from linux-4.19.169 commit 1921060a11b30dedfa8732c83de382788811bb83
--------------------------------
commit 869f4fdaf4ca7bb6e0d05caf6fa1108dddc346a7 upstream.
When register_pernet_subsys() fails, nf_nat_bysource should be freed just like when nf_ct_extend_register() fails.
Fixes: 1cd472bf036ca ("netfilter: nf_nat: add nat hook register functions to nf_nat") Signed-off-by: Dinghao Liu dinghao.liu@zju.edu.cn Acked-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/netfilter/nf_nat_core.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c index f1576e46ee2c..8ea39b816c02 100644 --- a/net/netfilter/nf_nat_core.c +++ b/net/netfilter/nf_nat_core.c @@ -1087,6 +1087,7 @@ static int __init nf_nat_init(void) ret = register_pernet_subsys(&nat_net_ops); if (ret < 0) { nf_ct_extend_unregister(&nat_extend); + kvfree(nf_nat_bysource); return ret; }
From: Mikulas Patocka mpatocka@redhat.com
stable inclusion from linux-4.19.170 commit f9f5547bf02a8ec1a9240b8ab6fece5b28d74d1b
--------------------------------
commit 9b5948267adc9e689da609eb61cf7ed49cae5fa8 upstream.
With external metadata device, flush requests are not passed down to the data device.
Fix this by submitting the flush request in dm_integrity_flush_buffers. In order to not degrade performance, we overlap the data device flush with the metadata device flush.
Reported-by: Lukas Straub lukasstraub2@web.de Signed-off-by: Mikulas Patocka mpatocka@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/dm-bufio.c | 6 +++++ drivers/md/dm-integrity.c | 50 +++++++++++++++++++++++++++++++++++---- include/linux/dm-bufio.h | 1 + 3 files changed, 52 insertions(+), 5 deletions(-)
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c index dc385b70e4c3..b6e4ab67ae44 100644 --- a/drivers/md/dm-bufio.c +++ b/drivers/md/dm-bufio.c @@ -1471,6 +1471,12 @@ sector_t dm_bufio_get_device_size(struct dm_bufio_client *c) } EXPORT_SYMBOL_GPL(dm_bufio_get_device_size);
+struct dm_io_client *dm_bufio_get_dm_io_client(struct dm_bufio_client *c) +{ + return c->dm_io; +} +EXPORT_SYMBOL_GPL(dm_bufio_get_dm_io_client); + sector_t dm_bufio_get_block_number(struct dm_buffer *b) { return b->block; diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c index 183288348dee..bb38db2040c3 100644 --- a/drivers/md/dm-integrity.c +++ b/drivers/md/dm-integrity.c @@ -1154,12 +1154,52 @@ static int dm_integrity_rw_tag(struct dm_integrity_c *ic, unsigned char *tag, se return 0; }
-static void dm_integrity_flush_buffers(struct dm_integrity_c *ic) +struct flush_request { + struct dm_io_request io_req; + struct dm_io_region io_reg; + struct dm_integrity_c *ic; + struct completion comp; +}; + +static void flush_notify(unsigned long error, void *fr_) +{ + struct flush_request *fr = fr_; + if (unlikely(error != 0)) + dm_integrity_io_error(fr->ic, "flusing disk cache", -EIO); + complete(&fr->comp); +} + +static void dm_integrity_flush_buffers(struct dm_integrity_c *ic, bool flush_data) { int r; + + struct flush_request fr; + + if (!ic->meta_dev) + flush_data = false; + if (flush_data) { + fr.io_req.bi_op = REQ_OP_WRITE, + fr.io_req.bi_op_flags = REQ_PREFLUSH | REQ_SYNC, + fr.io_req.mem.type = DM_IO_KMEM, + fr.io_req.mem.ptr.addr = NULL, + fr.io_req.notify.fn = flush_notify, + fr.io_req.notify.context = &fr; + fr.io_req.client = dm_bufio_get_dm_io_client(ic->bufio), + fr.io_reg.bdev = ic->dev->bdev, + fr.io_reg.sector = 0, + fr.io_reg.count = 0, + fr.ic = ic; + init_completion(&fr.comp); + r = dm_io(&fr.io_req, 1, &fr.io_reg, NULL); + BUG_ON(r); + } + r = dm_bufio_write_dirty_buffers(ic->bufio); if (unlikely(r)) dm_integrity_io_error(ic, "writing tags", r); + + if (flush_data) + wait_for_completion(&fr.comp); }
static void sleep_on_endio_wait(struct dm_integrity_c *ic) @@ -1859,7 +1899,7 @@ static void integrity_commit(struct work_struct *w) flushes = bio_list_get(&ic->flush_bio_list); if (unlikely(ic->mode != 'J')) { spin_unlock_irq(&ic->endio_wait.lock); - dm_integrity_flush_buffers(ic); + dm_integrity_flush_buffers(ic, true); goto release_flush_bios; }
@@ -2070,7 +2110,7 @@ static void do_journal_write(struct dm_integrity_c *ic, unsigned write_start, complete_journal_op(&comp); wait_for_completion_io(&comp.comp);
- dm_integrity_flush_buffers(ic); + dm_integrity_flush_buffers(ic, true); }
static void integrity_writer(struct work_struct *w) @@ -2112,7 +2152,7 @@ static void recalc_write_super(struct dm_integrity_c *ic) { int r;
- dm_integrity_flush_buffers(ic); + dm_integrity_flush_buffers(ic, false); if (dm_integrity_failed(ic)) return;
@@ -2422,7 +2462,7 @@ static void dm_integrity_postsuspend(struct dm_target *ti) if (ic->meta_dev) queue_work(ic->writer_wq, &ic->writer_work); drain_workqueue(ic->writer_wq); - dm_integrity_flush_buffers(ic); + dm_integrity_flush_buffers(ic, true); }
BUG_ON(!RB_EMPTY_ROOT(&ic->in_progress)); diff --git a/include/linux/dm-bufio.h b/include/linux/dm-bufio.h index 3c8b7d274bd9..45ba37aaf6b7 100644 --- a/include/linux/dm-bufio.h +++ b/include/linux/dm-bufio.h @@ -138,6 +138,7 @@ void dm_bufio_set_minimum_buffers(struct dm_bufio_client *c, unsigned n);
unsigned dm_bufio_get_block_size(struct dm_bufio_client *c); sector_t dm_bufio_get_device_size(struct dm_bufio_client *c); +struct dm_io_client *dm_bufio_get_dm_io_client(struct dm_bufio_client *c); sector_t dm_bufio_get_block_number(struct dm_buffer *b); void *dm_bufio_get_block_data(struct dm_buffer *b); void *dm_bufio_get_aux_data(struct dm_buffer *b);
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.170 commit 669c0b5782fba3c4b0a5f68bca53b3b6055b3b2f
--------------------------------
[ Upstream commit 3226b158e67cfaa677fd180152bfb28989cb2fac ]
Both virtio net and napi_get_frags() allocate skbs with a very small skb->head
While using page fragments instead of a kmalloc backed skb->head might give a small performance improvement in some cases, there is a huge risk of under estimating memory usage.
For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations per page (order-3 page in x86), or even 64 on PowerPC
We have been tracking OOM issues on GKE hosts hitting tcp_mem limits but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
Even if we force napi_alloc_skb() to only use order-0 pages, the issue would still be there on arches with PAGE_SIZE >= 32768
This patch makes sure that small skb head are kmalloc backed, so that other objects in the slab page can be reused instead of being held as long as skbs are sitting in socket queues.
Note that we might in the future use the sk_buff napi cache, instead of going through a more expensive __alloc_skb()
Another idea would be to use separate page sizes depending on the allocated length (to never have more than 4 frags per page)
I would like to thank Greg Thelen for his precious help on this matter, analysing crash dumps is always a time consuming task.
Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb") Signed-off-by: Eric Dumazet edumazet@google.com Cc: Paolo Abeni pabeni@redhat.com Cc: Greg Thelen gthelen@google.com Reviewed-by: Alexander Duyck alexanderduyck@fb.com Acked-by: Michael S. Tsirkin mst@redhat.com Link: https://lore.kernel.org/r/20210113161819.1155526-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/core/skbuff.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 5b87d2dd7151..73f208466363 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -459,13 +459,17 @@ EXPORT_SYMBOL(__netdev_alloc_skb); struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, gfp_t gfp_mask) { - struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); + struct napi_alloc_cache *nc; struct sk_buff *skb; void *data;
len += NET_SKB_PAD + NET_IP_ALIGN;
- if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || + /* If requested length is either too small or too big, + * we use kmalloc() for skb->head allocation. + */ + if (len <= SKB_WITH_OVERHEAD(1024) || + len > SKB_WITH_OVERHEAD(PAGE_SIZE) || (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE); if (!skb) @@ -473,6 +477,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, goto skb_success; }
+ nc = this_cpu_ptr(&napi_alloc_cache); len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); len = SKB_DATA_ALIGN(len);
From: Hoang Le hoang.h.le@dektech.com.au
stable inclusion from linux-4.19.170 commit 4d1d3dddcb3f26000e66cd0a9b8b16f7c2eb41bb
--------------------------------
[ Upstream commit b77413446408fdd256599daf00d5be72b5f3e7c6 ]
The buffer list can have zero skb as following path: tipc_named_node_up()->tipc_node_xmit()->tipc_link_xmit(), so we need to check the list before casting an &sk_buff.
Fault report: [] tipc: Bulk publication failure [] general protection fault, probably for non-canonical [#1] PREEMPT [...] [] KASAN: null-ptr-deref in range [0x00000000000000c8-0x00000000000000cf] [] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 5.10.0-rc4+ #2 [] Hardware name: Bochs ..., BIOS Bochs 01/01/2011 [] RIP: 0010:tipc_link_xmit+0xc1/0x2180 [] Code: 24 b8 00 00 00 00 4d 39 ec 4c 0f 44 e8 e8 d7 0a 10 f9 48 [...] [] RSP: 0018:ffffc90000006ea0 EFLAGS: 00010202 [] RAX: dffffc0000000000 RBX: ffff8880224da000 RCX: 1ffff11003d3cc0d [] RDX: 0000000000000019 RSI: ffffffff886007b9 RDI: 00000000000000c8 [] RBP: ffffc90000007018 R08: 0000000000000001 R09: fffff52000000ded [] R10: 0000000000000003 R11: fffff52000000dec R12: ffffc90000007148 [] R13: 0000000000000000 R14: 0000000000000000 R15: ffffc90000007018 [] FS: 0000000000000000(0000) GS:ffff888037400000(0000) knlGS:000[...] [] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [] CR2: 00007fffd2db5000 CR3: 000000002b08f000 CR4: 00000000000006f0
Fixes: af9b028e270fd ("tipc: make media xmit call outside node spinlock context") Acked-by: Jon Maloy jmaloy@redhat.com Signed-off-by: Hoang Le hoang.h.le@dektech.com.au Link: https://lore.kernel.org/r/20210108071337.3598-1-hoang.h.le@dektech.com.au Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/tipc/link.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/net/tipc/link.c b/net/tipc/link.c index f756b721f93e..bd28ac7f2195 100644 --- a/net/tipc/link.c +++ b/net/tipc/link.c @@ -914,9 +914,7 @@ void tipc_link_reset(struct tipc_link *l) int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, struct sk_buff_head *xmitq) { - struct tipc_msg *hdr = buf_msg(skb_peek(list)); unsigned int maxwin = l->window; - int imp = msg_importance(hdr); unsigned int mtu = l->mtu; u16 ack = l->rcv_nxt - 1; u16 seqno = l->snd_nxt; @@ -925,13 +923,20 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, struct sk_buff_head *backlogq = &l->backlogq; struct sk_buff *skb, *_skb, **tskb; int pkt_cnt = skb_queue_len(list); + struct tipc_msg *hdr; int rc = 0; + int imp; + + if (pkt_cnt <= 0) + return 0;
+ hdr = buf_msg(skb_peek(list)); if (unlikely(msg_size(hdr) > mtu)) { __skb_queue_purge(list); return -EMSGSIZE; }
+ imp = msg_importance(hdr); /* Allow oversubscription of one data msg per source at congestion */ if (unlikely(l->backlog[imp].len >= l->backlog[imp].limit)) { if (imp == TIPC_SYSTEM_IMPORTANCE) {
From: "Jason A. Donenfeld" Jason@zx2c4.com
stable inclusion from linux-4.19.170 commit 8e6f5a1d3bd462ff54cdb86f2938bfaa4223ccc9
--------------------------------
commit dcfea72e79b0aa7a057c8f6024169d86a1bbc84b upstream.
As part of the continual effort to remove direct usage of skb->next and skb->prev, this patch adds a helper for iterating through the singly-linked variant of skb lists, which are used for lists of GSO packet. The name "skb_list_..." has been chosen to match the existing function, "kfree_skb_list, which also operates on these singly-linked lists, and the "..._walk_safe" part is the same idiom as elsewhere in the kernel.
This patch removes the helper from wireguard and puts it into linux/skbuff.h, while making it a bit more robust for general usage. In particular, parenthesis are added around the macro argument usage, and it now accounts for trying to iterate through an already-null skb pointer, which will simply run the iteration zero times. This latter enhancement means it can be used to replace both do { ... } while and while (...) open-coded idioms.
This should take care of these three possible usages, which match all current methods of iterations.
skb_list_walk_safe(segs, skb, next) { ... } skb_list_walk_safe(skb, skb, next) { ... } skb_list_walk_safe(segs, skb, segs) { ... }
Gcc appears to generate efficient code for each of these.
Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Signed-off-by: David S. Miller davem@davemloft.net [ Just the skbuff.h changes for backporting - gregkh] Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/skbuff.h | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 703ce71caeac..881038a0e1c8 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1363,6 +1363,11 @@ static inline void skb_mark_not_on_list(struct sk_buff *skb) skb->next = NULL; }
+/* Iterate through singly-linked GSO fragments of an skb. */ +#define skb_list_walk_safe(first, skb, next) \ + for ((skb) = (first), (next) = (skb) ? (skb)->next : NULL; (skb); \ + (skb) = (next), (next) = (skb) ? (skb)->next : NULL) + static inline void skb_list_del_init(struct sk_buff *skb) { __list_del_entry(&skb->list);
From: "Jason A. Donenfeld" Jason@zx2c4.com
stable inclusion from linux-4.19.170 commit bae6277ad341d4fbfa564409c310bd151f235250
--------------------------------
commit 5eee7bd7e245914e4e050c413dfe864e31805207 upstream.
This worked before, because we made all callers name their next pointer "next". But in trying to be more "drop-in" ready, the silliness here is revealed. This commit fixes the problem by making the macro argument and the member use different names.
Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/skbuff.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 881038a0e1c8..06176ef2a842 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1364,9 +1364,9 @@ static inline void skb_mark_not_on_list(struct sk_buff *skb) }
/* Iterate through singly-linked GSO fragments of an skb. */ -#define skb_list_walk_safe(first, skb, next) \ - for ((skb) = (first), (next) = (skb) ? (skb)->next : NULL; (skb); \ - (skb) = (next), (next) = (skb) ? (skb)->next : NULL) +#define skb_list_walk_safe(first, skb, next_skb) \ + for ((skb) = (first), (next_skb) = (skb) ? (skb)->next : NULL; (skb); \ + (skb) = (next_skb), (next_skb) = (skb) ? (skb)->next : NULL)
static inline void skb_list_del_init(struct sk_buff *skb) {
From: Aya Levin ayal@nvidia.com
stable inclusion from linux-4.19.170 commit 394d9608da91e7909d74895af22c26bba6c0be5d
--------------------------------
[ Upstream commit b210de4f8c97d57de051e805686248ec4c6cfc52 ]
There are cases where GSO segment's length exceeds the egress MTU: - Forwarding of a TCP GRO skb, when DF flag is not set. - Forwarding of an skb that arrived on a virtualisation interface (virtio-net/vhost/tap) with TSO/GSO size set by other network stack. - Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an interface with a smaller MTU. - Arriving GRO skb (or GSO skb in a virtualised environment) that is bridged to a NETIF_F_TSO tunnel stacked over an interface with an insufficient MTU.
If so: - Consume the SKB and its segments. - Issue an ICMP packet with 'Packet Too Big' message containing the MTU, allowing the source host to reduce its Path MTU appropriately.
Note: These cases are handled in the same manner in IPv4 output finish. This patch aligns the behavior of IPv6 and the one of IPv4.
Fixes: 9e50849054a4 ("netfilter: ipv6: move POSTROUTING invocation before fragmentation") Signed-off-by: Aya Levin ayal@nvidia.com Reviewed-by: Tariq Toukan tariqt@nvidia.com Link: https://lore.kernel.org/r/1610027418-30438-1-git-send-email-ayal@nvidia.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv6/ip6_output.c | 40 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 39 insertions(+), 1 deletion(-)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 22665e3638ac..e1bb7db88483 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -128,8 +128,42 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff * return -EINVAL; }
+static int +ip6_finish_output_gso_slowpath_drop(struct net *net, struct sock *sk, + struct sk_buff *skb, unsigned int mtu) +{ + struct sk_buff *segs, *nskb; + netdev_features_t features; + int ret = 0; + + /* Please see corresponding comment in ip_finish_output_gso + * describing the cases where GSO segment length exceeds the + * egress MTU. + */ + features = netif_skb_features(skb); + segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK); + if (IS_ERR_OR_NULL(segs)) { + kfree_skb(skb); + return -ENOMEM; + } + + consume_skb(skb); + + skb_list_walk_safe(segs, segs, nskb) { + int err; + + skb_mark_not_on_list(segs); + err = ip6_fragment(net, sk, segs, ip6_finish_output2); + if (err && ret == 0) + ret = err; + } + + return ret; +} + static int ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb) { + unsigned int mtu; int ret;
ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb); @@ -146,7 +180,11 @@ static int ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *s } #endif
- if ((skb->len > ip6_skb_dst_mtu(skb) && !skb_is_gso(skb)) || + mtu = ip6_skb_dst_mtu(skb); + if (skb_is_gso(skb) && !skb_gso_validate_network_len(skb, mtu)) + return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu); + + if ((skb->len > mtu && !skb_is_gso(skb)) || dst_allfrag(skb_dst(skb)) || (IP6CB(skb)->frag_max_size && skb->len > IP6CB(skb)->frag_max_size)) return ip6_fragment(net, sk, skb, ip6_finish_output2);
From: Hans de Goede hdegoede@redhat.com
stable inclusion from linux-4.19.171 commit 7efb1858e28b96b4ea0eac7616dc90290de42335
--------------------------------
commit 78a18fec5258c8df9435399a1ea022d73d3eceb9 upstream.
Set the acpi_device pointer which acpi_bus_get_device() returns-by- reference to NULL on errors.
We've recently had 2 cases where callers of acpi_bus_get_device() did not properly error check the return value, so set the returned- by-reference acpi_device pointer to NULL, because at least some callers of acpi_bus_get_device() expect that to be done on errors.
[ rjw: This issue was exposed by commit 71da201f38df ("ACPI: scan: Defer enumeration of devices with _DEP lists") which caused it to be much more likely to occur on some systems, but the real defect had been introduced by an earlier commit. ]
Fixes: 40e7fcb19293 ("ACPI: Add _DEP support to fix battery issue on Asus T100TA") Fixes: bcfcd409d4db ("usb: split code locating ACPI companion into port and device") Reported-by: Pierre-Louis Bossart pierre-louis.bossart@linux.intel.com Tested-by: Pierre-Louis Bossart pierre-louis.bossart@linux.intel.com Diagnosed-by: Rafael J. Wysocki rafael@kernel.org Signed-off-by: Hans de Goede hdegoede@redhat.com Cc: All applicable stable@vger.kernel.org [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/acpi/scan.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c index 46f6698c72fd..3110ac8176cb 100644 --- a/drivers/acpi/scan.c +++ b/drivers/acpi/scan.c @@ -586,6 +586,8 @@ static int acpi_get_device_data(acpi_handle handle, struct acpi_device **device, if (!device) return -EINVAL;
+ *device = NULL; + status = acpi_get_data_full(handle, acpi_scan_drop_device, (void **)device, callback); if (ACPI_FAILURE(status) || !*device) {
From: Hannes Reinecke hare@suse.de
stable inclusion from linux-4.19.171 commit 97c5846e04532a0651da7016b4d35862993ea56d
--------------------------------
commit 809b1e4945774c9ec5619a8f4e2189b7b3833c0c upstream.
This reverts commit 644bda6f3460 ("dm table: fall back to getting device using name_to_dev_t()")
dm_get_dev_t() is just used to convert an arbitrary 'path' string into a dev_t. It doesn't presume that the device is present; that check will be done later, as the only caller is dm_get_device(), which does a dm_get_table_device() later on, which will properly open the device.
So if the path string already _is_ in major:minor representation we can convert it directly, avoiding a recursion into the filesystem to lookup the block device.
This avoids a hang in multipath_message() when the filesystem is inaccessible.
Fixes: 644bda6f3460 ("dm table: fall back to getting device using name_to_dev_t()") Cc: stable@vger.kernel.org Signed-off-by: Hannes Reinecke hare@suse.de Signed-off-by: Martin Wilck mwilck@suse.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/dm-table.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 4ac9ac151fd3..418671ba057a 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -431,14 +431,23 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode, { int r; dev_t dev; + unsigned int major, minor; + char dummy; struct dm_dev_internal *dd; struct dm_table *t = ti->table;
BUG_ON(!t);
- dev = dm_get_dev_t(path); - if (!dev) - return -ENODEV; + if (sscanf(path, "%u:%u%c", &major, &minor, &dummy) == 2) { + /* Extract the major/minor numbers */ + dev = MKDEV(major, minor); + if (MAJOR(dev) != major || MINOR(dev) != minor) + return -EOVERFLOW; + } else { + dev = dm_get_dev_t(path); + if (!dev) + return -ENODEV; + }
dd = find_device(&t->devices, dev); if (!dd) {
From: Mikulas Patocka mpatocka@redhat.com
stable inclusion from linux-4.19.171 commit cf750e09d17f25cf763644168417fcad73e3ce0c
--------------------------------
commit 2d06dfecb132a1cc2e374a44eae83b5c4356b8b4 upstream.
Recalculate can only be specified with internal_hash.
Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Mikulas Patocka mpatocka@redhat.com Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/dm-integrity.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c index bb38db2040c3..a3acb59272cf 100644 --- a/drivers/md/dm-integrity.c +++ b/drivers/md/dm-integrity.c @@ -3520,6 +3520,12 @@ static int dm_integrity_ctr(struct dm_target *ti, unsigned argc, char **argv) r = -ENOMEM; goto bad; } + } else { + if (ic->sb->flags & cpu_to_le32(SB_FLAG_RECALCULATING)) { + ti->error = "Recalculate can only be specified with internal_hash"; + r = -EINVAL; + goto bad; + } }
ic->bufio = dm_bufio_client_create(ic->meta_dev ? ic->meta_dev->bdev : ic->dev->bdev,
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
stable inclusion from linux-4.19.171 commit cfd402c22ca216e6be8d7337da427bfcfaa4e2b3
--------------------------------
commit 3d1cf435e201d1fd63e4346b141881aed086effd upstream.
If the device passed as the target (second argument) to device_is_dependent() is not completely registered (that is, it has been initialized, but not added yet), but the parent pointer of it is set, it may be missing from the list of the parent's children and device_for_each_child() called by device_is_dependent() cannot be relied on to catch that dependency.
For this reason, modify device_is_dependent() to check the ancestors of the target device by following its parent pointer in addition to the device_for_each_child() walk.
Fixes: 9ed9895370ae ("driver core: Functional dependencies tracking support") Reported-by: Stephan Gerhold stephan@gerhold.net Tested-by: Stephan Gerhold stephan@gerhold.net Reviewed-by: Saravana Kannan saravanak@google.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Link: https://lore.kernel.org/r/17705994.d592GUb2YH@kreacher Cc: stable stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/base/core.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/drivers/base/core.c b/drivers/base/core.c index f0cdf38ed31c..e196e6a67e63 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -93,6 +93,16 @@ void device_links_read_unlock(int not_used) } #endif /* !CONFIG_SRCU */
+static bool device_is_ancestor(struct device *dev, struct device *target) +{ + while (target->parent) { + target = target->parent; + if (dev == target) + return true; + } + return false; +} + /** * device_is_dependent - Check if one device depends on another one * @dev: Device to check dependencies for. @@ -106,7 +116,12 @@ static int device_is_dependent(struct device *dev, void *target) struct device_link *link; int ret;
- if (dev == target) + /* + * The "ancestors" check is needed to catch the case when the target + * device has not been completely initialized yet and it is still + * missing from the list of children of its parent device. + */ + if (dev == target || device_is_ancestor(dev, target)) return 1;
ret = device_for_each_child(dev, target, device_is_dependent);
From: Guillaume Nault gnault@redhat.com
stable inclusion from linux-4.19.171 commit c9d1f7029832414846f932381178b2eda25e9dda
--------------------------------
commit 2e5a6266fbb11ae93c468dfecab169aca9c27b43 upstream.
RT_TOS() only masks one of the two ECN bits. Therefore rpfilter_mt() treats Not-ECT or ECT(1) packets in a different way than those with ECT(0) or CE.
Reproducer:
Create two netns, connected with a veth: $ ip netns add ns0 $ ip netns add ns1 $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1 $ ip -netns ns0 link set dev veth01 up $ ip -netns ns1 link set dev veth10 up $ ip -netns ns0 address add 192.0.2.10/32 dev veth01 $ ip -netns ns1 address add 192.0.2.11/32 dev veth10
Add a route to ns1 in ns0: $ ip -netns ns0 route add 192.0.2.11/32 dev veth01
In ns1, only packets with TOS 4 can be routed to ns0: $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10
Ping from ns0 to ns1 works regardless of the ECN bits, as long as TOS is 4: $ ip netns exec ns0 ping -Q 4 192.0.2.11 # TOS 4, Not-ECT ... 0% packet loss ... $ ip netns exec ns0 ping -Q 5 192.0.2.11 # TOS 4, ECT(1) ... 0% packet loss ... $ ip netns exec ns0 ping -Q 6 192.0.2.11 # TOS 4, ECT(0) ... 0% packet loss ... $ ip netns exec ns0 ping -Q 7 192.0.2.11 # TOS 4, CE ... 0% packet loss ...
Now use iptable's rpfilter module in ns1: $ ip netns exec ns1 iptables-legacy -t raw -A PREROUTING -m rpfilter --invert -j DROP
Not-ECT and ECT(1) packets still pass: $ ip netns exec ns0 ping -Q 4 192.0.2.11 # TOS 4, Not-ECT ... 0% packet loss ... $ ip netns exec ns0 ping -Q 5 192.0.2.11 # TOS 4, ECT(1) ... 0% packet loss ...
But ECT(0) and ECN packets are dropped: $ ip netns exec ns0 ping -Q 6 192.0.2.11 # TOS 4, ECT(0) ... 100% packet loss ... $ ip netns exec ns0 ping -Q 7 192.0.2.11 # TOS 4, CE ... 100% packet loss ...
After this patch, rpfilter doesn't drop ECT(0) and CE packets anymore.
Fixes: 8f97339d3feb ("netfilter: add ipv4 reverse path filter match") Signed-off-by: Guillaume Nault gnault@redhat.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/netfilter/ipt_rpfilter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/netfilter/ipt_rpfilter.c b/net/ipv4/netfilter/ipt_rpfilter.c index 74b19a5c572e..088320ce77a1 100644 --- a/net/ipv4/netfilter/ipt_rpfilter.c +++ b/net/ipv4/netfilter/ipt_rpfilter.c @@ -94,7 +94,7 @@ static bool rpfilter_mt(const struct sk_buff *skb, struct xt_action_param *par) flow.daddr = iph->saddr; flow.saddr = rpfilter_get_saddr(iph->daddr); flow.flowi4_mark = info->flags & XT_RPFILTER_VALID_MARK ? skb->mark : 0; - flow.flowi4_tos = RT_TOS(iph->tos); + flow.flowi4_tos = iph->tos & IPTOS_RT_MASK; flow.flowi4_scope = RT_SCOPE_UNIVERSE; flow.flowi4_oif = l3mdev_master_ifindex_rcu(xt_in(par));
From: Alexander Lobakin alobakin@pm.me
stable inclusion from linux-4.19.171 commit 66fb76f3a8d7a031f8fd80c76f8a2941cc182469
--------------------------------
commit 66c556025d687dbdd0f748c5e1df89c977b6c02a upstream.
Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs") ensured that skbs with data size lower than 1025 bytes will be kmalloc'ed to avoid excessive page cache fragmentation and memory consumption. However, the fix adressed only __napi_alloc_skb() (primarily for virtio_net and napi_get_frags()), but the issue can still be achieved through __netdev_alloc_skb(), which is still used by several drivers. Drivers often allocate a tiny skb for headers and place the rest of the frame to frags (so-called copybreak). Mirror the condition to __netdev_alloc_skb() to handle this case too.
Since v1 [0]: - fix "Fixes:" tag; - refine commit message (mention copybreak usecase).
[0] https://lore.kernel.org/netdev/20210114235423.232737-1-alobakin@pm.me
Fixes: a1c7fff7e18f ("net: netdev_alloc_skb() use build_skb()") Signed-off-by: Alexander Lobakin alobakin@pm.me Link: https://lore.kernel.org/r/20210115150354.85967-1-alobakin@pm.me Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/core/skbuff.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 73f208466363..4a9ab2596e78 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -398,7 +398,11 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
len += NET_SKB_PAD;
- if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || + /* If requested length is either too small or too big, + * we use kmalloc() for skb->head allocation. + */ + if (len <= SKB_WITH_OVERHEAD(1024) || + len > SKB_WITH_OVERHEAD(PAGE_SIZE) || (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE); if (!skb)
From: Lecopzer Chen lecopzer@gmail.com
stable inclusion from linux-4.19.171 commit 5cb8624526309dfb7979ef4a09205c947788099b
--------------------------------
commit a11a496ee6e2ab6ed850233c96b94caf042af0b9 upstream.
During testing kasan_populate_early_shadow and kasan_remove_zero_shadow, if the shadow start and end address in kasan_remove_zero_shadow() is not aligned to PMD_SIZE, the remain unaligned PTE won't be removed.
In the test case for kasan_remove_zero_shadow():
shadow_start: 0xffffffb802000000, shadow end: 0xffffffbfbe000000
3-level page table: PUD_SIZE: 0x40000000 PMD_SIZE: 0x200000 PAGE_SIZE: 4K
0xffffffbf80000000 ~ 0xffffffbfbdf80000 will not be removed because in kasan_remove_pud_table(), kasan_pmd_table(*pud) is true but the next address is 0xffffffbfbdf80000 which is not aligned to PUD_SIZE.
In the correct condition, this should fallback to the next level kasan_remove_pmd_table() but the condition flow always continue to skip the unaligned part.
Fix by correcting the condition when next and addr are neither aligned.
Link: https://lkml.kernel.org/r/20210103135621.83129-1-lecopzer@gmail.com Fixes: 0207df4fa1a86 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN") Signed-off-by: Lecopzer Chen lecopzer.chen@mediatek.com Cc: Andrey Ryabinin aryabinin@virtuozzo.com Cc: Dan Williams dan.j.williams@intel.com Cc: Dmitry Vyukov dvyukov@google.com Cc: Alexander Potapenko glider@google.com Cc: YJ Chiang yj.chiang@mediatek.com Cc: Andrey Konovalov andreyknvl@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/kasan/kasan_init.c | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/mm/kasan/kasan_init.c b/mm/kasan/kasan_init.c index 7a2a2f13f86f..ee4d5462b558 100644 --- a/mm/kasan/kasan_init.c +++ b/mm/kasan/kasan_init.c @@ -372,9 +372,10 @@ static void kasan_remove_pmd_table(pmd_t *pmd, unsigned long addr,
if (kasan_pte_table(*pmd)) { if (IS_ALIGNED(addr, PMD_SIZE) && - IS_ALIGNED(next, PMD_SIZE)) + IS_ALIGNED(next, PMD_SIZE)) { pmd_clear(pmd); - continue; + continue; + } } pte = pte_offset_kernel(pmd, addr); kasan_remove_pte_table(pte, addr, next); @@ -397,9 +398,10 @@ static void kasan_remove_pud_table(pud_t *pud, unsigned long addr,
if (kasan_pmd_table(*pud)) { if (IS_ALIGNED(addr, PUD_SIZE) && - IS_ALIGNED(next, PUD_SIZE)) + IS_ALIGNED(next, PUD_SIZE)) { pud_clear(pud); - continue; + continue; + } } pmd = pmd_offset(pud, addr); pmd_base = pmd_offset(pud, 0); @@ -423,9 +425,10 @@ static void kasan_remove_p4d_table(p4d_t *p4d, unsigned long addr,
if (kasan_pud_table(*p4d)) { if (IS_ALIGNED(addr, P4D_SIZE) && - IS_ALIGNED(next, P4D_SIZE)) + IS_ALIGNED(next, P4D_SIZE)) { p4d_clear(p4d); - continue; + continue; + } } pud = pud_offset(p4d, addr); kasan_remove_pud_table(pud, addr, next); @@ -457,9 +460,10 @@ void kasan_remove_zero_shadow(void *start, unsigned long size)
if (kasan_p4d_table(*pgd)) { if (IS_ALIGNED(addr, PGDIR_SIZE) && - IS_ALIGNED(next, PGDIR_SIZE)) + IS_ALIGNED(next, PGDIR_SIZE)) { pgd_clear(pgd); - continue; + continue; + } }
p4d = p4d_offset(pgd, addr);
From: Lecopzer Chen lecopzer@gmail.com
stable inclusion from linux-4.19.171 commit 1ad3d65c19b9ea823331afd58541a0936b9f2c6c
--------------------------------
commit 5dabd1712cd056814f9ab15f1d68157ceb04e741 upstream.
kasan_remove_zero_shadow() shall use original virtual address, start and size, instead of shadow address.
Link: https://lkml.kernel.org/r/20210103063847.5963-1-lecopzer@gmail.com Fixes: 0207df4fa1a86 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN") Signed-off-by: Lecopzer Chen lecopzer.chen@mediatek.com Reviewed-by: Andrey Konovalov andreyknvl@google.com Cc: Andrey Ryabinin aryabinin@virtuozzo.com Cc: Dan Williams dan.j.williams@intel.com Cc: Dmitry Vyukov dvyukov@google.com Cc: Alexander Potapenko glider@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/kasan/kasan_init.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/mm/kasan/kasan_init.c b/mm/kasan/kasan_init.c index ee4d5462b558..7a731c74be7d 100644 --- a/mm/kasan/kasan_init.c +++ b/mm/kasan/kasan_init.c @@ -487,7 +487,6 @@ int kasan_add_zero_shadow(void *start, unsigned long size)
ret = kasan_populate_zero_shadow(shadow_start, shadow_end); if (ret) - kasan_remove_zero_shadow(shadow_start, - size >> KASAN_SHADOW_SCALE_SHIFT); + kasan_remove_zero_shadow(start, size); return ret; }
From: Guillaume Nault gnault@redhat.com
stable inclusion from linux-4.19.171 commit 372d963821abeb4ef92ffdb1a0e94daea6f025c7
--------------------------------
commit 8d2b51b008c25240914984208b2ced57d1dd25a5 upstream.
udp_v4_early_demux() is the only function that calls ip_mc_validate_source() with a TOS that hasn't been masked with IPTOS_RT_MASK.
This results in different behaviours for incoming multicast UDPv4 packets, depending on if ip_mc_validate_source() is called from the early-demux path (udp_v4_early_demux) or from the regular input path (ip_route_input_noref).
ECN would normally not be used with UDP multicast packets, so the practical consequences should be limited on that side. However, IPTOS_RT_MASK is used to also masks the TOS' high order bits, to align with the non-early-demux path behaviour.
Reproducer:
Setup two netns, connected with veth: $ ip netns add ns0 $ ip netns add ns1 $ ip -netns ns0 link set dev lo up $ ip -netns ns1 link set dev lo up $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1 $ ip -netns ns0 link set dev veth01 up $ ip -netns ns1 link set dev veth10 up $ ip -netns ns0 address add 192.0.2.10 peer 192.0.2.11/32 dev veth01 $ ip -netns ns1 address add 192.0.2.11 peer 192.0.2.10/32 dev veth10
In ns0, add route to multicast address 224.0.2.0/24 using source address 198.51.100.10: $ ip -netns ns0 address add 198.51.100.10/32 dev lo $ ip -netns ns0 route add 224.0.2.0/24 dev veth01 src 198.51.100.10
In ns1, define route to 198.51.100.10, only for packets with TOS 4: $ ip -netns ns1 route add 198.51.100.10/32 tos 4 dev veth10
Also activate rp_filter in ns1, so that incoming packets not matching the above route get dropped: $ ip netns exec ns1 sysctl -wq net.ipv4.conf.veth10.rp_filter=1
Now try to receive packets on 224.0.2.11: $ ip netns exec ns1 socat UDP-RECVFROM:1111,ip-add-membership=224.0.2.11:veth10,ignoreeof -
In ns0, send packet to 224.0.2.11 with TOS 4 and ECT(0) (that is, tos 6 for socat): $ echo test0 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6
The "test0" message is properly received by socat in ns1, because early-demux has no cached dst to use, so source address validation is done by ip_route_input_mc(), which receives a TOS that has the ECN bits masked.
Now send another packet to 224.0.2.11, still with TOS 4 and ECT(0): $ echo test1 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6
The "test1" message isn't received by socat in ns1, because, now, early-demux has a cached dst to use and calls ip_mc_validate_source() immediately, without masking the ECN bits.
Fixes: bc044e8db796 ("udp: perform source validation for mcast early demux") Signed-off-by: Guillaume Nault gnault@redhat.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/udp.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 1d27d3b24a5e..871f41b7ab70 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -2413,7 +2413,8 @@ int udp_v4_early_demux(struct sk_buff *skb) */ if (!inet_sk(sk)->inet_daddr && in_dev) return ip_mc_validate_source(skb, iph->daddr, - iph->saddr, iph->tos, + iph->saddr, + iph->tos & IPTOS_RT_MASK, skb->dev, in_dev, &itag); } return 0;
From: Matteo Croce mcroce@microsoft.com
stable inclusion from linux-4.19.171 commit be33a52751d2482630bfc085179edc95356ba7fb
--------------------------------
commit a826b04303a40d52439aa141035fca5654ccaccd upstream.
The ff00::/8 multicast route is created without specifying the fc_protocol field, so the default RTPROT_BOOT value is used:
$ ip -6 -d route unicast ::1 dev lo proto kernel scope global metric 256 pref medium unicast fe80::/64 dev eth0 proto kernel scope global metric 256 pref medium unicast ff00::/8 dev eth0 proto boot scope global metric 256 pref medium
As the documentation says, this value identifies routes installed during boot, but the route is created when interface is set up. Change the value to RTPROT_KERNEL which is a better value.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Matteo Croce mcroce@microsoft.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv6/addrconf.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 627cd24b7c0d..e613df6814fc 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -2397,6 +2397,7 @@ static void addrconf_add_mroute(struct net_device *dev) .fc_flags = RTF_UP, .fc_type = RTN_UNICAST, .fc_nlinfo.nl_net = dev_net(dev), + .fc_protocol = RTPROT_KERNEL, };
ipv6_addr_set(&cfg.fc_dst, htonl(0xFF000000), 0, 0, 0);
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.171 commit 22c1b22672f3c56289ea91cf5eaffa61db3e4b2e
--------------------------------
commit bcd0cf19ef8258ac31b9a20248b05c15a1f4b4b0 upstream.
tc_index being 16bit wide, we need to check that TCA_TCINDEX_SHIFT attribute is not silly.
UBSAN: shift-out-of-bounds in net/sched/cls_tcindex.c:260:29 shift exponent 255 is too large for 32-bit type 'int' CPU: 0 PID: 8516 Comm: syz-executor228 Not tainted 5.10.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395 valid_perfect_hash net/sched/cls_tcindex.c:260 [inline] tcindex_set_parms.cold+0x1b/0x215 net/sched/cls_tcindex.c:425 tcindex_change+0x232/0x340 net/sched/cls_tcindex.c:546 tc_new_tfilter+0x13fb/0x21b0 net/sched/cls_api.c:2127 rtnetlink_rcv_msg+0x8b6/0xb80 net/core/rtnetlink.c:5555 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0x907/0xe40 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2336 ___sys_sendmsg+0xf3/0x170 net/socket.c:2390 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2423 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: syzbot syzkaller@googlegroups.com Link: https://lore.kernel.org/r/20210114185229.1742255-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/sched/cls_tcindex.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c index 79d8f3f79017..ee20bd2eb1ef 100644 --- a/net/sched/cls_tcindex.c +++ b/net/sched/cls_tcindex.c @@ -339,9 +339,13 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, if (tb[TCA_TCINDEX_MASK]) cp->mask = nla_get_u16(tb[TCA_TCINDEX_MASK]);
- if (tb[TCA_TCINDEX_SHIFT]) + if (tb[TCA_TCINDEX_SHIFT]) { cp->shift = nla_get_u32(tb[TCA_TCINDEX_SHIFT]); - + if (cp->shift > 16) { + err = -EINVAL; + goto errout; + } + } if (!cp->hash) { /* Hash not specified, use perfect hash if the upper limit * of the hashing index is below the threshold.
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.171 commit df4646250fc74e6871dae4cdf3ededf29a11811c
--------------------------------
commit e4bedf48aaa5552bc1f49703abd17606e7e6e82a upstream.
iproute2 probably never goes beyond 8 for the cell exponent, but stick to the max shift exponent for signed 32bit.
UBSAN reported: UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22 shift exponent 130 is too large for 32-bit type 'int' CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x183/0x22e lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:148 [inline] __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg net/socket.c:672 [inline] ____sys_sendmsg+0x5a2/0x900 net/socket.c:2345 ___sys_sendmsg net/socket.c:2399 [inline] __sys_sendmsg+0x319/0x400 net/socket.c:2432 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: syzbot syzkaller@googlegroups.com Acked-by: Cong Wang cong.wang@bytedance.com Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/sched/sch_api.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index 84fdc4857771..576f70e078bc 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -398,7 +398,8 @@ struct qdisc_rate_table *qdisc_get_rtab(struct tc_ratespec *r, { struct qdisc_rate_table *rtab;
- if (tab == NULL || r->rate == 0 || r->cell_log == 0 || + if (tab == NULL || r->rate == 0 || + r->cell_log == 0 || r->cell_log >= 32 || nla_len(tab) != TC_RTAB_SIZE) { NL_SET_ERR_MSG(extack, "Invalid rate table parameters for searching"); return NULL;
From: Matteo Croce mcroce@microsoft.com
stable inclusion from linux-4.19.171 commit 47d6a700430234cb0328722833ac3a0788ea7f38
--------------------------------
commit ceed9038b2783d14e0422bdc6fd04f70580efb4c upstream.
The multicast route ff00::/8 is created with type RTN_UNICAST:
$ ip -6 -d route unicast ::1 dev lo proto kernel scope global metric 256 pref medium unicast fe80::/64 dev eth0 proto kernel scope global metric 256 pref medium unicast ff00::/8 dev eth0 proto kernel scope global metric 256 pref medium
Set the type to RTN_MULTICAST which is more appropriate.
Fixes: e8478e80e5a7 ("net/ipv6: Save route type in rt6_info") Signed-off-by: Matteo Croce mcroce@microsoft.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv6/addrconf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index e613df6814fc..76c097552ea7 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -2395,7 +2395,7 @@ static void addrconf_add_mroute(struct net_device *dev) .fc_ifindex = dev->ifindex, .fc_dst_len = 8, .fc_flags = RTF_UP, - .fc_type = RTN_UNICAST, + .fc_type = RTN_MULTICAST, .fc_nlinfo.nl_net = dev_net(dev), .fc_protocol = RTPROT_KERNEL, };
From: Tariq Toukan tariqt@nvidia.com
stable inclusion from linux-4.19.171 commit fffe7ab69dc57460913854b815351b2cdf9a3c9d
--------------------------------
commit a3eb4e9d4c9218476d05c52dfd2be3d6fdce6b91 upstream.
With NETIF_F_HW_TLS_RX packets are decrypted in HW. This cannot be logically done when RXCSUM offload is off.
Fixes: 14136564c8ee ("net: Add TLS RX offload feature") Signed-off-by: Tariq Toukan tariqt@nvidia.com Reviewed-by: Boris Pismenny borisp@nvidia.com Link: https://lore.kernel.org/r/20210117151538.9411-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/core/dev.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/net/core/dev.c b/net/core/dev.c index c77d12a35f92..5d9800804d4a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -8351,6 +8351,11 @@ static netdev_features_t netdev_fix_features(struct net_device *dev, } }
+ if ((features & NETIF_F_HW_TLS_RX) && !(features & NETIF_F_RXCSUM)) { + netdev_dbg(dev, "Dropping TLS RX HW offload feature since no RXCSUM feature.\n"); + features &= ~NETIF_F_HW_TLS_RX; + } + return features; }
From: Thomas Gleixner tglx@linutronix.de
stable inclusion from linux-4.19.172 commit 3fe0ed7bd7329e28aca7cdaf7353c6d4778b3d41
--------------------------------
commit ba31c1a48538992316cc71ce94fa9cd3e7b427c0 upstream
The futex exit handling is #ifdeffed into mm_release() which is not pretty to begin with. But upcoming changes to address futex exit races need to add more functionality to this exit code.
Split it out into a function, move it into futex code and make the various futex exit functions static.
Preparatory only and no functional change.
Folded build fix from Borislav.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.049705556@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/compat.h | 2 -- include/linux/futex.h | 29 ++++++++++++++++------------- kernel/fork.c | 25 +++---------------------- kernel/futex.c | 33 +++++++++++++++++++++++++++++---- 4 files changed, 48 insertions(+), 41 deletions(-)
diff --git a/include/linux/compat.h b/include/linux/compat.h index de0c13bdcd2c..189d0e111d57 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -445,8 +445,6 @@ struct compat_kexec_segment; struct compat_mq_attr; struct compat_msgbuf;
-extern void compat_exit_robust_list(struct task_struct *curr); - #define BITS_PER_COMPAT_LONG (8*sizeof(compat_long_t))
#define BITS_TO_COMPAT_LONGS(bits) DIV_ROUND_UP(bits, BITS_PER_COMPAT_LONG) diff --git a/include/linux/futex.h b/include/linux/futex.h index a61bf436dcf3..82b8a5cd4018 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -2,7 +2,9 @@ #ifndef _LINUX_FUTEX_H #define _LINUX_FUTEX_H
+#include <linux/sched.h> #include <linux/ktime.h> + #include <uapi/linux/futex.h>
struct inode; @@ -51,15 +53,24 @@ union futex_key { #define FUTEX_KEY_INIT (union futex_key) { .both = { .ptr = 0ULL } }
#ifdef CONFIG_FUTEX -extern void exit_robust_list(struct task_struct *curr);
-long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, - u32 __user *uaddr2, u32 val2, u32 val3); -#else -static inline void exit_robust_list(struct task_struct *curr) +static inline void futex_init_task(struct task_struct *tsk) { + tsk->robust_list = NULL; +#ifdef CONFIG_COMPAT + tsk->compat_robust_list = NULL; +#endif + INIT_LIST_HEAD(&tsk->pi_state_list); + tsk->pi_state_cache = NULL; }
+void futex_mm_release(struct task_struct *tsk); + +long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, + u32 __user *uaddr2, u32 val2, u32 val3); +#else +static inline void futex_init_task(struct task_struct *tsk) { } +static inline void futex_mm_release(struct task_struct *tsk) { } static inline long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3) @@ -68,12 +79,4 @@ static inline long do_futex(u32 __user *uaddr, int op, u32 val, } #endif
-#ifdef CONFIG_FUTEX_PI -extern void exit_pi_state_list(struct task_struct *curr); -#else -static inline void exit_pi_state_list(struct task_struct *curr) -{ -} -#endif - #endif diff --git a/kernel/fork.c b/kernel/fork.c index dc059179fd18..82345524b086 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1278,20 +1278,7 @@ static int wait_for_vfork_done(struct task_struct *child, void mm_release(struct task_struct *tsk, struct mm_struct *mm) { /* Get rid of any futexes when releasing the mm */ -#ifdef CONFIG_FUTEX - if (unlikely(tsk->robust_list)) { - exit_robust_list(tsk); - tsk->robust_list = NULL; - } -#ifdef CONFIG_COMPAT - if (unlikely(tsk->compat_robust_list)) { - compat_exit_robust_list(tsk); - tsk->compat_robust_list = NULL; - } -#endif - if (unlikely(!list_empty(&tsk->pi_state_list))) - exit_pi_state_list(tsk); -#endif + futex_mm_release(tsk);
uprobe_free_utask(tsk);
@@ -1998,14 +1985,8 @@ static __latent_entropy struct task_struct *copy_process( #ifdef CONFIG_BLOCK p->plug = NULL; #endif -#ifdef CONFIG_FUTEX - p->robust_list = NULL; -#ifdef CONFIG_COMPAT - p->compat_robust_list = NULL; -#endif - INIT_LIST_HEAD(&p->pi_state_list); - p->pi_state_cache = NULL; -#endif + futex_init_task(p); + /* * sigaltstack should be cleared when sharing the same VM */ diff --git a/kernel/futex.c b/kernel/futex.c index a84b993f09cb..291e3f8e275f 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -341,6 +341,12 @@ static inline bool should_fail_futex(bool fshared) } #endif /* CONFIG_FAIL_FUTEX */
+#ifdef CONFIG_COMPAT +static void compat_exit_robust_list(struct task_struct *curr); +#else +static inline void compat_exit_robust_list(struct task_struct *curr) { } +#endif + static inline void futex_get_mm(union futex_key *key) { mmgrab(key->private.mm); @@ -912,7 +918,7 @@ static void put_pi_state(struct futex_pi_state *pi_state) * Kernel cleans up PI-state, but userspace is likely hosed. * (Robust-futex cleanup is separate and might save the day for userspace.) */ -void exit_pi_state_list(struct task_struct *curr) +static void exit_pi_state_list(struct task_struct *curr) { struct list_head *next, *head = &curr->pi_state_list; struct futex_pi_state *pi_state; @@ -982,7 +988,8 @@ void exit_pi_state_list(struct task_struct *curr) } raw_spin_unlock_irq(&curr->pi_lock); } - +#else +static inline void exit_pi_state_list(struct task_struct *curr) { } #endif
/* @@ -3603,7 +3610,7 @@ static inline int fetch_robust_entry(struct robust_list __user **entry, * * We silently return on any sign of list-walking problem. */ -void exit_robust_list(struct task_struct *curr) +static void exit_robust_list(struct task_struct *curr) { struct robust_list_head __user *head = curr->robust_list; struct robust_list __user *entry, *next_entry, *pending; @@ -3668,6 +3675,24 @@ void exit_robust_list(struct task_struct *curr) } }
+void futex_mm_release(struct task_struct *tsk) +{ + if (unlikely(tsk->robust_list)) { + exit_robust_list(tsk); + tsk->robust_list = NULL; + } + +#ifdef CONFIG_COMPAT + if (unlikely(tsk->compat_robust_list)) { + compat_exit_robust_list(tsk); + tsk->compat_robust_list = NULL; + } +#endif + + if (unlikely(!list_empty(&tsk->pi_state_list))) + exit_pi_state_list(tsk); +} + long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3) { @@ -3795,7 +3820,7 @@ static void __user *futex_uaddr(struct robust_list __user *entry, * * We silently return on any sign of list-walking problem. */ -void compat_exit_robust_list(struct task_struct *curr) +static void compat_exit_robust_list(struct task_struct *curr) { struct compat_robust_list_head __user *head = curr->compat_robust_list; struct robust_list __user *entry, *next_entry, *pending;
From: Thomas Gleixner tglx@linutronix.de
stable inclusion from linux-4.19.172 commit 095444fad7e35dcd63d0c6b86461d024314e2051
--------------------------------
commit 3d4775df0a89240f671861c6ab6e8d59af8e9e41 upstream
The futex exit handling relies on PF_ flags. That's suboptimal as it requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in the middle of do_exit() to enforce the observability of PF_EXITING in the futex code.
Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic over to the new state. The PF_EXITING dependency will be cleaned up in a later step.
This prepares for handling various futex exit issues later.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/futex.h | 33 +++++++++++++++++++++++++++++++++ include/linux/sched.h | 2 +- kernel/exit.c | 18 ++---------------- kernel/futex.c | 25 +++++++++++++------------ 4 files changed, 49 insertions(+), 29 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h index 82b8a5cd4018..84e945116126 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -53,6 +53,10 @@ union futex_key { #define FUTEX_KEY_INIT (union futex_key) { .both = { .ptr = 0ULL } }
#ifdef CONFIG_FUTEX +enum { + FUTEX_STATE_OK, + FUTEX_STATE_DEAD, +};
static inline void futex_init_task(struct task_struct *tsk) { @@ -62,6 +66,34 @@ static inline void futex_init_task(struct task_struct *tsk) #endif INIT_LIST_HEAD(&tsk->pi_state_list); tsk->pi_state_cache = NULL; + tsk->futex_state = FUTEX_STATE_OK; +} + +/** + * futex_exit_done - Sets the tasks futex state to FUTEX_STATE_DEAD + * @tsk: task to set the state on + * + * Set the futex exit state of the task lockless. The futex waiter code + * observes that state when a task is exiting and loops until the task has + * actually finished the futex cleanup. The worst case for this is that the + * waiter runs through the wait loop until the state becomes visible. + * + * This has two callers: + * + * - futex_mm_release() after the futex exit cleanup has been done + * + * - do_exit() from the recursive fault handling path. + * + * In case of a recursive fault this is best effort. Either the futex exit + * code has run already or not. If the OWNER_DIED bit has been set on the + * futex then the waiter can take it over. If not, the problem is pushed + * back to user space. If the futex exit code did not run yet, then an + * already queued waiter might block forever, but there is nothing which + * can be done about that. + */ +static inline void futex_exit_done(struct task_struct *tsk) +{ + tsk->futex_state = FUTEX_STATE_DEAD; }
void futex_mm_release(struct task_struct *tsk); @@ -71,6 +103,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, #else static inline void futex_init_task(struct task_struct *tsk) { } static inline void futex_mm_release(struct task_struct *tsk) { } +static inline void futex_exit_done(struct task_struct *tsk) { } static inline long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3) diff --git a/include/linux/sched.h b/include/linux/sched.h index ced2d7fc0754..e446bfd8c7ef 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1002,6 +1002,7 @@ struct task_struct { #endif struct list_head pi_state_list; struct futex_pi_state *pi_state_cache; + unsigned int futex_state; #endif #ifdef CONFIG_PERF_EVENTS struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts]; @@ -1397,7 +1398,6 @@ extern struct pid *cad_pid; */ #define PF_IDLE 0x00000002 /* I am an IDLE thread */ #define PF_EXITING 0x00000004 /* Getting shut down */ -#define PF_EXITPIDONE 0x00000008 /* PI exit done on shut down */ #define PF_VCPU 0x00000010 /* I'm a virtual CPU */ #define PF_WQ_WORKER 0x00000020 /* I'm a workqueue worker */ #define PF_FORKNOEXEC 0x00000040 /* Forked but didn't exec */ diff --git a/kernel/exit.c b/kernel/exit.c index d3f1108e57f7..e8a3ca3a3685 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -804,16 +804,7 @@ void __noreturn do_exit(long code) */ if (unlikely(tsk->flags & PF_EXITING)) { pr_alert("Fixing recursive fault but reboot is needed!\n"); - /* - * We can do this unlocked here. The futex code uses - * this flag just to verify whether the pi state - * cleanup has been done or not. In the worst case it - * loops once more. We pretend that the cleanup was - * done as there is no way to return. Either the - * OWNER_DIED bit is set by now or we push the blocked - * task into the wait for ever nirwana as well. - */ - tsk->flags |= PF_EXITPIDONE; + futex_exit_done(tsk); set_current_state(TASK_UNINTERRUPTIBLE); schedule(); } @@ -911,12 +902,7 @@ void __noreturn do_exit(long code) * Make sure we are holding no locks: */ debug_check_no_locks_held(); - /* - * We can do this unlocked here. The futex code uses this flag - * just to verify whether the pi state cleanup has been done - * or not. In the worst case it loops once more. - */ - tsk->flags |= PF_EXITPIDONE; + futex_exit_done(tsk);
if (tsk->io_context) exit_io_context(tsk); diff --git a/kernel/futex.c b/kernel/futex.c index 291e3f8e275f..26688507fce5 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1205,9 +1205,10 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval, u32 uval2;
/* - * If PF_EXITPIDONE is not yet set, then try again. + * If the futex exit state is not yet FUTEX_STATE_DEAD, wait + * for it to finish. */ - if (tsk && !(tsk->flags & PF_EXITPIDONE)) + if (tsk && tsk->futex_state != FUTEX_STATE_DEAD) return -EAGAIN;
/* @@ -1226,8 +1227,9 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval, * *uaddr = 0xC0000000; tsk = get_task(PID); * } if (!tsk->flags & PF_EXITING) { * ... attach(); - * tsk->flags |= PF_EXITPIDONE; } else { - * if (!(tsk->flags & PF_EXITPIDONE)) + * tsk->futex_state = } else { + * FUTEX_STATE_DEAD; if (tsk->futex_state != + * FUTEX_STATE_DEAD) * return -EAGAIN; * return -ESRCH; <--- FAIL * } @@ -1283,17 +1285,16 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key, }
/* - * We need to look at the task state flags to figure out, - * whether the task is exiting. To protect against the do_exit - * change of the task flags, we do this protected by - * p->pi_lock: + * We need to look at the task state to figure out, whether the + * task is exiting. To protect against the change of the task state + * in futex_exit_release(), we do this protected by p->pi_lock: */ raw_spin_lock_irq(&p->pi_lock); - if (unlikely(p->flags & PF_EXITING)) { + if (unlikely(p->futex_state != FUTEX_STATE_OK)) { /* - * The task is on the way out. When PF_EXITPIDONE is - * set, we know that the task has finished the - * cleanup: + * The task is on the way out. When the futex state is + * FUTEX_STATE_DEAD, we know that the task has finished + * the cleanup: */ int ret = handle_exit_race(uaddr, uval, p);
From: Thomas Gleixner tglx@linutronix.de
stable inclusion from linux-4.19.172 commit 9425476fb17a29b9f1c564321ae4b80129534c57
--------------------------------
commit 4610ba7ad877fafc0a25a30c6c82015304120426 upstream
mm_release() contains the futex exit handling. mm_release() is called from do_exit()->exit_mm() and from exec()->exec_mm().
In the exit_mm() case PF_EXITING and the futex state is updated. In the exec_mm() case these states are not touched.
As the futex exit code needs further protections against exit races, this needs to be split into two functions.
Preparatory only, no functional change.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/exec.c | 2 +- include/linux/sched/mm.h | 6 ++++-- kernel/exit.c | 2 +- kernel/fork.c | 12 +++++++++++- 4 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index 8099ac98e605..07b536eb6c44 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1011,7 +1011,7 @@ static int exec_mmap(struct mm_struct *mm) /* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; - mm_release(tsk, old_mm); + exec_mm_release(tsk, old_mm);
if (old_mm) { sync_mm_rss(old_mm); diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 290a613687b5..deb1bf7f5aa7 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -119,8 +119,10 @@ extern struct mm_struct *get_task_mm(struct task_struct *task); * succeeds. */ extern struct mm_struct *mm_access(struct task_struct *task, unsigned int mode); -/* Remove the current tasks stale references to the old mm_struct */ -extern void mm_release(struct task_struct *, struct mm_struct *); +/* Remove the current tasks stale references to the old mm_struct on exit() */ +extern void exit_mm_release(struct task_struct *, struct mm_struct *); +/* Remove the current tasks stale references to the old mm_struct on exec() */ +extern void exec_mm_release(struct task_struct *, struct mm_struct *);
#ifdef CONFIG_MEMCG extern void mm_update_next_owner(struct mm_struct *mm); diff --git a/kernel/exit.c b/kernel/exit.c index e8a3ca3a3685..1abc9b4a626d 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -498,7 +498,7 @@ static void exit_mm(void) struct mm_struct *mm = current->mm; struct core_state *core_state;
- mm_release(current, mm); + exit_mm_release(current, mm); if (!mm) return; sync_mm_rss(mm); diff --git a/kernel/fork.c b/kernel/fork.c index 82345524b086..a647bf7ee65b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1275,7 +1275,7 @@ static int wait_for_vfork_done(struct task_struct *child, * restoring the old one. . . * Eric Biederman 10 January 1998 */ -void mm_release(struct task_struct *tsk, struct mm_struct *mm) +static void mm_release(struct task_struct *tsk, struct mm_struct *mm) { /* Get rid of any futexes when releasing the mm */ futex_mm_release(tsk); @@ -1312,6 +1312,16 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm) complete_vfork_done(tsk); }
+void exit_mm_release(struct task_struct *tsk, struct mm_struct *mm) +{ + mm_release(tsk, mm); +} + +void exec_mm_release(struct task_struct *tsk, struct mm_struct *mm) +{ + mm_release(tsk, mm); +} + /* * Allocate a new mm structure and copy contents from the * mm structure of the passed in task structure.
From: Thomas Gleixner tglx@linutronix.de
stable inclusion from linux-4.19.172 commit 1dd589346a127db6aa14889ef4099366dcab3a96
--------------------------------
commit 150d71584b12809144b8145b817e83b81158ae5f upstream
To allow separate handling of the futex exit state in the futex exit code for exit and exec, split futex_mm_release() into two functions and invoke them from the corresponding exit/exec_mm_release() callsites.
Preparatory only, no functional change.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.332094221@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/futex.h | 6 ++++-- kernel/fork.c | 5 ++--- kernel/futex.c | 7 ++++++- 3 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h index 84e945116126..d141df679863 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -96,14 +96,16 @@ static inline void futex_exit_done(struct task_struct *tsk) tsk->futex_state = FUTEX_STATE_DEAD; }
-void futex_mm_release(struct task_struct *tsk); +void futex_exit_release(struct task_struct *tsk); +void futex_exec_release(struct task_struct *tsk);
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3); #else static inline void futex_init_task(struct task_struct *tsk) { } -static inline void futex_mm_release(struct task_struct *tsk) { } static inline void futex_exit_done(struct task_struct *tsk) { } +static inline void futex_exit_release(struct task_struct *tsk) { } +static inline void futex_exec_release(struct task_struct *tsk) { } static inline long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3) diff --git a/kernel/fork.c b/kernel/fork.c index a647bf7ee65b..5ee7ed5151d1 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1277,9 +1277,6 @@ static int wait_for_vfork_done(struct task_struct *child, */ static void mm_release(struct task_struct *tsk, struct mm_struct *mm) { - /* Get rid of any futexes when releasing the mm */ - futex_mm_release(tsk); - uprobe_free_utask(tsk);
/* Get rid of any cached register state */ @@ -1314,11 +1311,13 @@ static void mm_release(struct task_struct *tsk, struct mm_struct *mm)
void exit_mm_release(struct task_struct *tsk, struct mm_struct *mm) { + futex_exit_release(tsk); mm_release(tsk, mm); }
void exec_mm_release(struct task_struct *tsk, struct mm_struct *mm) { + futex_exec_release(tsk); mm_release(tsk, mm); }
diff --git a/kernel/futex.c b/kernel/futex.c index 26688507fce5..b89ac1a4ba4e 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3676,7 +3676,7 @@ static void exit_robust_list(struct task_struct *curr) } }
-void futex_mm_release(struct task_struct *tsk) +void futex_exec_release(struct task_struct *tsk) { if (unlikely(tsk->robust_list)) { exit_robust_list(tsk); @@ -3694,6 +3694,11 @@ void futex_mm_release(struct task_struct *tsk) exit_pi_state_list(tsk); }
+void futex_exit_release(struct task_struct *tsk) +{ + futex_exec_release(tsk); +} + long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3) {
From: Thomas Gleixner tglx@linutronix.de
stable inclusion from linux-4.19.172 commit 8f9a98a0e00ad101e2301ebb78d8537133e39ceb
--------------------------------
commit f24f22435dcc11389acc87e5586239c1819d217c upstream
Setting task::futex_state in do_exit() is rather arbitrarily placed for no reason. Move it into the futex code.
Note, this is only done for the exit cleanup as the exec cleanup cannot set the state to FUTEX_STATE_DEAD because the task struct is still in active use.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/exit.c | 1 - kernel/futex.c | 1 + 2 files changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/exit.c b/kernel/exit.c index 1abc9b4a626d..6c866afc1a0b 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -902,7 +902,6 @@ void __noreturn do_exit(long code) * Make sure we are holding no locks: */ debug_check_no_locks_held(); - futex_exit_done(tsk);
if (tsk->io_context) exit_io_context(tsk); diff --git a/kernel/futex.c b/kernel/futex.c index b89ac1a4ba4e..cd9485383faf 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3697,6 +3697,7 @@ void futex_exec_release(struct task_struct *tsk) void futex_exit_release(struct task_struct *tsk) { futex_exec_release(tsk); + futex_exit_done(tsk); }
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
From: Thomas Gleixner tglx@linutronix.de
stable inclusion from linux-4.19.172 commit 226eed1ef71dbc0e505e15954c41a97275b87058
--------------------------------
commit 18f694385c4fd77a09851fd301236746ca83f3cb upstream
Instead of relying on PF_EXITING use an explicit state for the futex exit and set it in the futex exit function. This moves the smp barrier and the lock/unlock serialization into the futex code.
As with the DEAD state this is restricted to the exit path as exec continues to use the same task struct.
This allows to simplify that logic in a next step.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/futex.h | 31 +++---------------------------- kernel/exit.c | 13 +------------ kernel/futex.c | 37 ++++++++++++++++++++++++++++++++++++- 3 files changed, 40 insertions(+), 41 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h index d141df679863..1db16694ea16 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -55,6 +55,7 @@ union futex_key { #ifdef CONFIG_FUTEX enum { FUTEX_STATE_OK, + FUTEX_STATE_EXITING, FUTEX_STATE_DEAD, };
@@ -69,33 +70,7 @@ static inline void futex_init_task(struct task_struct *tsk) tsk->futex_state = FUTEX_STATE_OK; }
-/** - * futex_exit_done - Sets the tasks futex state to FUTEX_STATE_DEAD - * @tsk: task to set the state on - * - * Set the futex exit state of the task lockless. The futex waiter code - * observes that state when a task is exiting and loops until the task has - * actually finished the futex cleanup. The worst case for this is that the - * waiter runs through the wait loop until the state becomes visible. - * - * This has two callers: - * - * - futex_mm_release() after the futex exit cleanup has been done - * - * - do_exit() from the recursive fault handling path. - * - * In case of a recursive fault this is best effort. Either the futex exit - * code has run already or not. If the OWNER_DIED bit has been set on the - * futex then the waiter can take it over. If not, the problem is pushed - * back to user space. If the futex exit code did not run yet, then an - * already queued waiter might block forever, but there is nothing which - * can be done about that. - */ -static inline void futex_exit_done(struct task_struct *tsk) -{ - tsk->futex_state = FUTEX_STATE_DEAD; -} - +void futex_exit_recursive(struct task_struct *tsk); void futex_exit_release(struct task_struct *tsk); void futex_exec_release(struct task_struct *tsk);
@@ -103,7 +78,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3); #else static inline void futex_init_task(struct task_struct *tsk) { } -static inline void futex_exit_done(struct task_struct *tsk) { } +static inline void futex_exit_recursive(struct task_struct *tsk) { } static inline void futex_exit_release(struct task_struct *tsk) { } static inline void futex_exec_release(struct task_struct *tsk) { } static inline long do_futex(u32 __user *uaddr, int op, u32 val, diff --git a/kernel/exit.c b/kernel/exit.c index 6c866afc1a0b..08e1ec2584aa 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -804,23 +804,12 @@ void __noreturn do_exit(long code) */ if (unlikely(tsk->flags & PF_EXITING)) { pr_alert("Fixing recursive fault but reboot is needed!\n"); - futex_exit_done(tsk); + futex_exit_recursive(tsk); set_current_state(TASK_UNINTERRUPTIBLE); schedule(); }
exit_signals(tsk); /* sets PF_EXITING */ - /* - * Ensure that all new tsk->pi_lock acquisitions must observe - * PF_EXITING. Serializes against futex.c:attach_to_pi_owner(). - */ - smp_mb(); - /* - * Ensure that we must observe the pi_state in exit_mm() -> - * mm_release() -> exit_pi_state_list(). - */ - raw_spin_lock_irq(&tsk->pi_lock); - raw_spin_unlock_irq(&tsk->pi_lock);
if (unlikely(in_atomic())) { pr_info("note: %s[%d] exited with preempt_count %d\n", diff --git a/kernel/futex.c b/kernel/futex.c index cd9485383faf..7f7a76600e93 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3694,10 +3694,45 @@ void futex_exec_release(struct task_struct *tsk) exit_pi_state_list(tsk); }
+/** + * futex_exit_recursive - Set the tasks futex state to FUTEX_STATE_DEAD + * @tsk: task to set the state on + * + * Set the futex exit state of the task lockless. The futex waiter code + * observes that state when a task is exiting and loops until the task has + * actually finished the futex cleanup. The worst case for this is that the + * waiter runs through the wait loop until the state becomes visible. + * + * This is called from the recursive fault handling path in do_exit(). + * + * This is best effort. Either the futex exit code has run already or + * not. If the OWNER_DIED bit has been set on the futex then the waiter can + * take it over. If not, the problem is pushed back to user space. If the + * futex exit code did not run yet, then an already queued waiter might + * block forever, but there is nothing which can be done about that. + */ +void futex_exit_recursive(struct task_struct *tsk) +{ + tsk->futex_state = FUTEX_STATE_DEAD; +} + void futex_exit_release(struct task_struct *tsk) { + tsk->futex_state = FUTEX_STATE_EXITING; + /* + * Ensure that all new tsk->pi_lock acquisitions must observe + * FUTEX_STATE_EXITING. Serializes against attach_to_pi_owner(). + */ + smp_mb(); + /* + * Ensure that we must observe the pi_state in exit_pi_state_list(). + */ + raw_spin_lock_irq(&tsk->pi_lock); + raw_spin_unlock_irq(&tsk->pi_lock); + futex_exec_release(tsk); - futex_exit_done(tsk); + + tsk->futex_state = FUTEX_STATE_DEAD; }
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
From: Thomas Gleixner tglx@linutronix.de
stable inclusion from linux-4.19.172 commit b45696340f5321cd7bd4f4865bae86c229d2bcc1
--------------------------------
commit 4a8e991b91aca9e20705d434677ac013974e0e30 upstream
Instead of having a smp_mb() and an empty lock/unlock of task::pi_lock move the state setting into to the lock section.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.645603214@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/futex.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 7f7a76600e93..778173080ad5 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3718,16 +3718,19 @@ void futex_exit_recursive(struct task_struct *tsk)
void futex_exit_release(struct task_struct *tsk) { - tsk->futex_state = FUTEX_STATE_EXITING; - /* - * Ensure that all new tsk->pi_lock acquisitions must observe - * FUTEX_STATE_EXITING. Serializes against attach_to_pi_owner(). - */ - smp_mb(); /* - * Ensure that we must observe the pi_state in exit_pi_state_list(). + * Switch the state to FUTEX_STATE_EXITING under tsk->pi_lock. + * + * This ensures that all subsequent checks of tsk->futex_state in + * attach_to_pi_owner() must observe FUTEX_STATE_EXITING with + * tsk->pi_lock held. + * + * It guarantees also that a pi_state which was queued right before + * the state change under tsk->pi_lock by a concurrent waiter must + * be observed in exit_pi_state_list(). */ raw_spin_lock_irq(&tsk->pi_lock); + tsk->futex_state = FUTEX_STATE_EXITING; raw_spin_unlock_irq(&tsk->pi_lock);
futex_exec_release(tsk);
From: Thomas Gleixner tglx@linutronix.de
stable inclusion from linux-4.19.172 commit ab89202056ca11596ebed057a7f54fe68576d940
--------------------------------
commit af8cbda2cfcaa5515d61ec500498d46e9a8247e2 upstream
exec() attempts to handle potentially held futexes gracefully by running the futex exit handling code like exit() does.
The current implementation has no protection against concurrent incoming waiters. The reason is that the futex state cannot be set to FUTEX_STATE_DEAD after the cleanup because the task struct is still active and just about to execute the new binary.
While its arguably buggy when a task holds a futex over exec(), for consistency sake the state handling can at least cover the actual futex exit cleanup section. This provides state consistency protection accross the cleanup. As the futex state of the task becomes FUTEX_STATE_OK after the cleanup has been finished, this cannot prevent subsequent attempts to attach to the task in case that the cleanup was not successfull in mopping up all leftovers.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.753355618@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/futex.c | 38 ++++++++++++++++++++++++++++++++++---- 1 file changed, 34 insertions(+), 4 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 778173080ad5..0fdf272a0cc8 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3676,7 +3676,7 @@ static void exit_robust_list(struct task_struct *curr) } }
-void futex_exec_release(struct task_struct *tsk) +static void futex_cleanup(struct task_struct *tsk) { if (unlikely(tsk->robust_list)) { exit_robust_list(tsk); @@ -3716,7 +3716,7 @@ void futex_exit_recursive(struct task_struct *tsk) tsk->futex_state = FUTEX_STATE_DEAD; }
-void futex_exit_release(struct task_struct *tsk) +static void futex_cleanup_begin(struct task_struct *tsk) { /* * Switch the state to FUTEX_STATE_EXITING under tsk->pi_lock. @@ -3732,10 +3732,40 @@ void futex_exit_release(struct task_struct *tsk) raw_spin_lock_irq(&tsk->pi_lock); tsk->futex_state = FUTEX_STATE_EXITING; raw_spin_unlock_irq(&tsk->pi_lock); +}
- futex_exec_release(tsk); +static void futex_cleanup_end(struct task_struct *tsk, int state) +{ + /* + * Lockless store. The only side effect is that an observer might + * take another loop until it becomes visible. + */ + tsk->futex_state = state; +}
- tsk->futex_state = FUTEX_STATE_DEAD; +void futex_exec_release(struct task_struct *tsk) +{ + /* + * The state handling is done for consistency, but in the case of + * exec() there is no way to prevent futher damage as the PID stays + * the same. But for the unlikely and arguably buggy case that a + * futex is held on exec(), this provides at least as much state + * consistency protection which is possible. + */ + futex_cleanup_begin(tsk); + futex_cleanup(tsk); + /* + * Reset the state to FUTEX_STATE_OK. The task is alive and about + * exec a new binary. + */ + futex_cleanup_end(tsk, FUTEX_STATE_OK); +} + +void futex_exit_release(struct task_struct *tsk) +{ + futex_cleanup_begin(tsk); + futex_cleanup(tsk); + futex_cleanup_end(tsk, FUTEX_STATE_DEAD); }
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
From: Thomas Gleixner tglx@linutronix.de
stable inclusion from linux-4.19.172 commit f9b0c6c556dbf3694fee05f1334830ce2ec1f5bc
--------------------------------
commit 3f186d974826847a07bc7964d79ec4eded475ad9 upstream
The mutex will be used in subsequent changes to replace the busy looping of a waiter when the futex owner is currently executing the exit cleanup to prevent a potential live lock.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.845798895@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/futex.h | 1 + include/linux/sched.h | 1 + kernel/futex.c | 16 ++++++++++++++++ 3 files changed, 18 insertions(+)
diff --git a/include/linux/futex.h b/include/linux/futex.h index 1db16694ea16..b70df27d7e85 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -68,6 +68,7 @@ static inline void futex_init_task(struct task_struct *tsk) INIT_LIST_HEAD(&tsk->pi_state_list); tsk->pi_state_cache = NULL; tsk->futex_state = FUTEX_STATE_OK; + mutex_init(&tsk->futex_exit_mutex); }
void futex_exit_recursive(struct task_struct *tsk); diff --git a/include/linux/sched.h b/include/linux/sched.h index e446bfd8c7ef..5f77e3b546db 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1002,6 +1002,7 @@ struct task_struct { #endif struct list_head pi_state_list; struct futex_pi_state *pi_state_cache; + struct mutex futex_exit_mutex; unsigned int futex_state; #endif #ifdef CONFIG_PERF_EVENTS diff --git a/kernel/futex.c b/kernel/futex.c index 0fdf272a0cc8..8bebf9233ca1 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3713,11 +3713,22 @@ static void futex_cleanup(struct task_struct *tsk) */ void futex_exit_recursive(struct task_struct *tsk) { + /* If the state is FUTEX_STATE_EXITING then futex_exit_mutex is held */ + if (tsk->futex_state == FUTEX_STATE_EXITING) + mutex_unlock(&tsk->futex_exit_mutex); tsk->futex_state = FUTEX_STATE_DEAD; }
static void futex_cleanup_begin(struct task_struct *tsk) { + /* + * Prevent various race issues against a concurrent incoming waiter + * including live locks by forcing the waiter to block on + * tsk->futex_exit_mutex when it observes FUTEX_STATE_EXITING in + * attach_to_pi_owner(). + */ + mutex_lock(&tsk->futex_exit_mutex); + /* * Switch the state to FUTEX_STATE_EXITING under tsk->pi_lock. * @@ -3741,6 +3752,11 @@ static void futex_cleanup_end(struct task_struct *tsk, int state) * take another loop until it becomes visible. */ tsk->futex_state = state; + /* + * Drop the exit protection. This unblocks waiters which observed + * FUTEX_STATE_EXITING to reevaluate the state. + */ + mutex_unlock(&tsk->futex_exit_mutex); }
void futex_exec_release(struct task_struct *tsk)
From: Thomas Gleixner tglx@linutronix.de
stable inclusion from linux-4.19.172 commit 7f237695d3f02730d2dd60c83e92abe1487c7e89
--------------------------------
commit ac31c7ff8624409ba3c4901df9237a616c187a5d upstream
attach_to_pi_owner() returns -EAGAIN for various cases:
- Owner task is exiting - Futex value has changed
The caller drops the held locks (hash bucket, mmap_sem) and retries the operation. In case of the owner task exiting this can result in a live lock.
As a preparatory step for seperating those cases, provide a distinct return value (EBUSY) for the owner exiting case.
No functional change.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20191106224556.935606117@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/futex.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 8bebf9233ca1..9ecefb8c4fe9 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1205,11 +1205,11 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval, u32 uval2;
/* - * If the futex exit state is not yet FUTEX_STATE_DEAD, wait - * for it to finish. + * If the futex exit state is not yet FUTEX_STATE_DEAD, tell the + * caller that the alleged owner is busy. */ if (tsk && tsk->futex_state != FUTEX_STATE_DEAD) - return -EAGAIN; + return -EBUSY;
/* * Reread the user space value to handle the following situation: @@ -2107,12 +2107,13 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, if (!ret) goto retry; goto out; + case -EBUSY: case -EAGAIN: /* * Two reasons for this: - * - Owner is exiting and we just wait for the + * - EBUSY: Owner is exiting and we just wait for the * exit to complete. - * - The user space value changed. + * - EAGAIN: The user space value changed. */ double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); @@ -2880,12 +2881,13 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, goto out_unlock_put_key; case -EFAULT: goto uaddr_faulted; + case -EBUSY: case -EAGAIN: /* * Two reasons for this: - * - Task is exiting and we just wait for the + * - EBUSY: Task is exiting and we just wait for the * exit to complete. - * - The user space value changed. + * - EAGAIN: The user space value changed. */ queue_unlock(hb); put_futex_key(&q.key);
From: Thomas Gleixner tglx@linutronix.de
stable inclusion from linux-4.19.172 commit 7874eee0130adf9bee28e8720bb5dd051089def3
--------------------------------
commit 3ef240eaff36b8119ac9e2ea17cbf41179c930ba upstream
Oleg provided the following test case:
int main(void) { struct sched_param sp = {};
sp.sched_priority = 2; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
int lock = vfork(); if (!lock) { sp.sched_priority = 1; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); _exit(0); }
syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0); return 0; }
This creates an unkillable RT process spinning in futex_lock_pi() on a UP machine or if the process is affine to a single CPU. The reason is:
parent child
set FIFO prio 2
vfork() -> set FIFO prio 1 implies wait_for_child() sched_setscheduler(...) exit() do_exit() .... mm_release() tsk->futex_state = FUTEX_STATE_EXITING; exit_futex(); (NOOP in this case) complete() --> wakes parent sys_futex() loop infinite because tsk->futex_state == FUTEX_STATE_EXITING
The same problem can happen just by regular preemption as well:
task holds futex ... do_exit() tsk->futex_state = FUTEX_STATE_EXITING;
--> preemption (unrelated wakeup of some other higher prio task, e.g. timer)
switch_to(other_task)
return to user sys_futex() loop infinite as above
Just for the fun of it the futex exit cleanup could trigger the wakeup itself before the task sets its futex state to DEAD.
To cure this, the handling of the exiting owner is changed so:
- A refcount is held on the task
- The task pointer is stored in a caller visible location
- The caller drops all locks (hash bucket, mmap_sem) and blocks on task::futex_exit_mutex. When the mutex is acquired then the exiting task has completed the cleanup and the state is consistent and can be reevaluated.
This is not a pretty solution, but there is no choice other than returning an error code to user space, which would break the state consistency guarantee and open another can of problems including regressions.
For stable backports the preparatory commits ac31c7ff8624 .. ba31c1a48538 are required as well, but for anything older than 5.3.y the backports are going to be provided when this hits mainline as the other dependencies for those kernels are definitely not stable material.
Fixes: 778e9a9c3e71 ("pi-futex: fix exit races and locking problems") Reported-by: Oleg Nesterov oleg@redhat.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ingo Molnar mingo@kernel.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Stable Team stable@vger.kernel.org Link: https://lkml.kernel.org/r/20191106224557.041676471@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/futex.c | 106 ++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 91 insertions(+), 15 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 9ecefb8c4fe9..7ece653364d2 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1199,6 +1199,36 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval, return ret; }
+/** + * wait_for_owner_exiting - Block until the owner has exited + * @exiting: Pointer to the exiting task + * + * Caller must hold a refcount on @exiting. + */ +static void wait_for_owner_exiting(int ret, struct task_struct *exiting) +{ + if (ret != -EBUSY) { + WARN_ON_ONCE(exiting); + return; + } + + if (WARN_ON_ONCE(ret == -EBUSY && !exiting)) + return; + + mutex_lock(&exiting->futex_exit_mutex); + /* + * No point in doing state checking here. If the waiter got here + * while the task was in exec()->exec_futex_release() then it can + * have any FUTEX_STATE_* value when the waiter has acquired the + * mutex. OK, if running, EXITING or DEAD if it reached exit() + * already. Highly unlikely and not a problem. Just one more round + * through the futex maze. + */ + mutex_unlock(&exiting->futex_exit_mutex); + + put_task_struct(exiting); +} + static int handle_exit_race(u32 __user *uaddr, u32 uval, struct task_struct *tsk) { @@ -1260,7 +1290,8 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval, * it after doing proper sanity checks. */ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key, - struct futex_pi_state **ps) + struct futex_pi_state **ps, + struct task_struct **exiting) { pid_t pid = uval & FUTEX_TID_MASK; struct futex_pi_state *pi_state; @@ -1299,7 +1330,19 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key, int ret = handle_exit_race(uaddr, uval, p);
raw_spin_unlock_irq(&p->pi_lock); - put_task_struct(p); + /* + * If the owner task is between FUTEX_STATE_EXITING and + * FUTEX_STATE_DEAD then store the task pointer and keep + * the reference on the task struct. The calling code will + * drop all locks, wait for the task to reach + * FUTEX_STATE_DEAD and then drop the refcount. This is + * required to prevent a live lock when the current task + * preempted the exiting task between the two states. + */ + if (ret == -EBUSY) + *exiting = p; + else + put_task_struct(p); return ret; }
@@ -1338,7 +1381,8 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
static int lookup_pi_state(u32 __user *uaddr, u32 uval, struct futex_hash_bucket *hb, - union futex_key *key, struct futex_pi_state **ps) + union futex_key *key, struct futex_pi_state **ps, + struct task_struct **exiting) { struct futex_q *top_waiter = futex_top_waiter(hb, key);
@@ -1353,7 +1397,7 @@ static int lookup_pi_state(u32 __user *uaddr, u32 uval, * We are the first waiter - try to look up the owner based on * @uval and attach to it. */ - return attach_to_pi_owner(uaddr, uval, key, ps); + return attach_to_pi_owner(uaddr, uval, key, ps, exiting); }
static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval) @@ -1381,6 +1425,8 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval) * lookup * @task: the task to perform the atomic lock work for. This will * be "current" except in the case of requeue pi. + * @exiting: Pointer to store the task pointer of the owner task + * which is in the middle of exiting * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0) * * Return: @@ -1389,11 +1435,17 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval) * - <0 - error * * The hb->lock and futex_key refs shall be held by the caller. + * + * @exiting is only set when the return value is -EBUSY. If so, this holds + * a refcount on the exiting task on return and the caller needs to drop it + * after waiting for the exit to complete. */ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, union futex_key *key, struct futex_pi_state **ps, - struct task_struct *task, int set_waiters) + struct task_struct *task, + struct task_struct **exiting, + int set_waiters) { u32 uval, newval, vpid = task_pid_vnr(task); struct futex_q *top_waiter; @@ -1463,7 +1515,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, * attach to the owner. If that fails, no harm done, we only * set the FUTEX_WAITERS bit in the user space variable. */ - return attach_to_pi_owner(uaddr, newval, key, ps); + return attach_to_pi_owner(uaddr, newval, key, ps, exiting); }
/** @@ -1873,6 +1925,8 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key, * @key1: the from futex key * @key2: the to futex key * @ps: address to store the pi_state pointer + * @exiting: Pointer to store the task pointer of the owner task + * which is in the middle of exiting * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0) * * Try and get the lock on behalf of the top waiter if we can do it atomically. @@ -1880,16 +1934,20 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key, * then direct futex_lock_pi_atomic() to force setting the FUTEX_WAITERS bit. * hb1 and hb2 must be held by the caller. * + * @exiting is only set when the return value is -EBUSY. If so, this holds + * a refcount on the exiting task on return and the caller needs to drop it + * after waiting for the exit to complete. + * * Return: * - 0 - failed to acquire the lock atomically; * - >0 - acquired the lock, return value is vpid of the top_waiter * - <0 - error */ -static int futex_proxy_trylock_atomic(u32 __user *pifutex, - struct futex_hash_bucket *hb1, - struct futex_hash_bucket *hb2, - union futex_key *key1, union futex_key *key2, - struct futex_pi_state **ps, int set_waiters) +static int +futex_proxy_trylock_atomic(u32 __user *pifutex, struct futex_hash_bucket *hb1, + struct futex_hash_bucket *hb2, union futex_key *key1, + union futex_key *key2, struct futex_pi_state **ps, + struct task_struct **exiting, int set_waiters) { struct futex_q *top_waiter = NULL; u32 curval; @@ -1926,7 +1984,7 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex, */ vpid = task_pid_vnr(top_waiter->task); ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task, - set_waiters); + exiting, set_waiters); if (ret == 1) { requeue_pi_wake_futex(top_waiter, key2, hb2); return vpid; @@ -2055,6 +2113,8 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, }
if (requeue_pi && (task_count - nr_wake < nr_requeue)) { + struct task_struct *exiting = NULL; + /* * Attempt to acquire uaddr2 and wake the top waiter. If we * intend to requeue waiters, force setting the FUTEX_WAITERS @@ -2062,7 +2122,8 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, * faults rather in the requeue loop below. */ ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1, - &key2, &pi_state, nr_requeue); + &key2, &pi_state, + &exiting, nr_requeue);
/* * At this point the top_waiter has either taken uaddr2 or is @@ -2089,7 +2150,8 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, * If that call succeeds then we have pi_state and an * initial refcount on it. */ - ret = lookup_pi_state(uaddr2, ret, hb2, &key2, &pi_state); + ret = lookup_pi_state(uaddr2, ret, hb2, &key2, + &pi_state, &exiting); }
switch (ret) { @@ -2119,6 +2181,12 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, hb_waiters_dec(hb2); put_futex_key(&key2); put_futex_key(&key1); + /* + * Handle the case where the owner is in the middle of + * exiting. Wait for the exit to complete otherwise + * this task might loop forever, aka. live lock. + */ + wait_for_owner_exiting(ret, exiting); cond_resched(); goto retry; default: @@ -2841,6 +2909,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int trylock) { struct hrtimer_sleeper timeout, *to = NULL; + struct task_struct *exiting = NULL; struct rt_mutex_waiter rt_waiter; struct futex_hash_bucket *hb; struct futex_q q = futex_q_init; @@ -2868,7 +2937,8 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, retry_private: hb = queue_lock(&q);
- ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0); + ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, + &exiting, 0); if (unlikely(ret)) { /* * Atomic work succeeded and we got the lock, @@ -2891,6 +2961,12 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, */ queue_unlock(hb); put_futex_key(&q.key); + /* + * Handle the case where the owner is in the middle of + * exiting. Wait for the exit to complete otherwise + * this task might loop forever, aka. live lock. + */ + wait_for_owner_exiting(ret, exiting); cond_resched(); goto retry; default:
From: Yang Yingliang yangyingliang@huawei.com
hulk inclusion category: bugfix bugzilla: NA CVE: NA
-------------------------------------------------
Fix kabi broken introduced by c5f21df94d82 ("futex: Replace PF_EXITPIDONE with a state") 5c7385e6eb43 ("futex: Add mutex around futex exit")
Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Jian Cheng cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/futex.h | 1 - include/linux/sched.h | 6 +++--- kernel/fork.c | 11 ++++++++++- kernel/futex.c | 10 +++++----- 4 files changed, 18 insertions(+), 10 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h index b70df27d7e85..1db16694ea16 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -68,7 +68,6 @@ static inline void futex_init_task(struct task_struct *tsk) INIT_LIST_HEAD(&tsk->pi_state_list); tsk->pi_state_cache = NULL; tsk->futex_state = FUTEX_STATE_OK; - mutex_init(&tsk->futex_exit_mutex); }
void futex_exit_recursive(struct task_struct *tsk); diff --git a/include/linux/sched.h b/include/linux/sched.h index 5f77e3b546db..9dc064305c13 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1002,8 +1002,6 @@ struct task_struct { #endif struct list_head pi_state_list; struct futex_pi_state *pi_state_cache; - struct mutex futex_exit_mutex; - unsigned int futex_state; #endif #ifdef CONFIG_PERF_EVENTS struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts]; @@ -1218,12 +1216,14 @@ struct task_struct { #ifndef __GENKSYMS__ u64 parent_exec_id_u64; u64 self_exec_id_u64; + struct mutex *futex_exit_mutex; + unsigned long futex_state; #else KABI_RESERVE(1) KABI_RESERVE(2) -#endif KABI_RESERVE(3) KABI_RESERVE(4) +#endif KABI_RESERVE(5) KABI_RESERVE(6) KABI_RESERVE(7) diff --git a/kernel/fork.c b/kernel/fork.c index 5ee7ed5151d1..36d3edeff7ac 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -458,6 +458,8 @@ void free_task(struct task_struct *tsk) arch_release_task_struct(tsk); if (tsk->flags & PF_KTHREAD) free_kthread_struct(tsk); + kfree(tsk->futex_exit_mutex); + tsk->futex_exit_mutex = NULL; free_task_struct(tsk); } EXPORT_SYMBOL(free_task); @@ -1995,6 +1997,10 @@ static __latent_entropy struct task_struct *copy_process( p->plug = NULL; #endif futex_init_task(p); + p->futex_exit_mutex = kmalloc(sizeof(struct mutex), GFP_KERNEL); + if (!p->futex_exit_mutex) + goto bad_fork_free_pid; + mutex_init(p->futex_exit_mutex);
/* * sigaltstack should be cleared when sharing the same VM @@ -2040,7 +2046,7 @@ static __latent_entropy struct task_struct *copy_process( */ retval = cgroup_can_fork(p); if (retval) - goto bad_fork_free_pid; + goto bad_fork_free_futex_mutex;
/* * From this point on we must avoid any synchronous user-space @@ -2164,6 +2170,9 @@ static __latent_entropy struct task_struct *copy_process( spin_unlock(¤t->sighand->siglock); write_unlock_irq(&tasklist_lock); cgroup_cancel_fork(p); +bad_fork_free_futex_mutex: + kfree(p->futex_exit_mutex); + p->futex_exit_mutex = NULL; bad_fork_free_pid: cgroup_threadgroup_change_end(current); if (pid != &init_struct_pid) diff --git a/kernel/futex.c b/kernel/futex.c index 7ece653364d2..fc56076a2cc2 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1215,7 +1215,7 @@ static void wait_for_owner_exiting(int ret, struct task_struct *exiting) if (WARN_ON_ONCE(ret == -EBUSY && !exiting)) return;
- mutex_lock(&exiting->futex_exit_mutex); + mutex_lock(exiting->futex_exit_mutex); /* * No point in doing state checking here. If the waiter got here * while the task was in exec()->exec_futex_release() then it can @@ -1224,7 +1224,7 @@ static void wait_for_owner_exiting(int ret, struct task_struct *exiting) * already. Highly unlikely and not a problem. Just one more round * through the futex maze. */ - mutex_unlock(&exiting->futex_exit_mutex); + mutex_unlock(exiting->futex_exit_mutex);
put_task_struct(exiting); } @@ -3793,7 +3793,7 @@ void futex_exit_recursive(struct task_struct *tsk) { /* If the state is FUTEX_STATE_EXITING then futex_exit_mutex is held */ if (tsk->futex_state == FUTEX_STATE_EXITING) - mutex_unlock(&tsk->futex_exit_mutex); + mutex_unlock(tsk->futex_exit_mutex); tsk->futex_state = FUTEX_STATE_DEAD; }
@@ -3805,7 +3805,7 @@ static void futex_cleanup_begin(struct task_struct *tsk) * tsk->futex_exit_mutex when it observes FUTEX_STATE_EXITING in * attach_to_pi_owner(). */ - mutex_lock(&tsk->futex_exit_mutex); + mutex_lock(tsk->futex_exit_mutex);
/* * Switch the state to FUTEX_STATE_EXITING under tsk->pi_lock. @@ -3834,7 +3834,7 @@ static void futex_cleanup_end(struct task_struct *tsk, int state) * Drop the exit protection. This unblocks waiters which observed * FUTEX_STATE_EXITING to reevaluate the state. */ - mutex_unlock(&tsk->futex_exit_mutex); + mutex_unlock(tsk->futex_exit_mutex); }
void futex_exec_release(struct task_struct *tsk)
From: Yang Yingliang yangyingliang@huawei.com
hulk inclusion category: bugfix bugzilla: NA CVE: NA
-------------------------------------------------
If free_task() is called on error path, it will free futex_exit_mutex of parent process and cause UAF, so move free of futex_exit_mutex to __put_task_struct().
Fixes: f9a5a3dea71b ("futex: sched: fix kabi broken in task_struct") Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Jian Cheng cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/fork.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c index 36d3edeff7ac..80a7e29920cd 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -458,8 +458,6 @@ void free_task(struct task_struct *tsk) arch_release_task_struct(tsk); if (tsk->flags & PF_KTHREAD) free_kthread_struct(tsk); - kfree(tsk->futex_exit_mutex); - tsk->futex_exit_mutex = NULL; free_task_struct(tsk); } EXPORT_SYMBOL(free_task); @@ -731,6 +729,8 @@ void __put_task_struct(struct task_struct *tsk) exit_creds(tsk); delayacct_tsk_free(tsk); put_signal_struct(tsk->signal); + kfree(tsk->futex_exit_mutex); + tsk->futex_exit_mutex = NULL;
if (!profile_handoff_task(tsk)) free_task(tsk);
From: Bixuan Cui cuibixuan@huawei.com
hulk inclusion category: bugfix bugzilla: NA CVE: NA
MAP_CHECKNODE was defined in uapi/asm-generic/mman.h, that was not automatically included by mm/mmap.c when building on platforms such as mips, and result in following compiling error:
mm/mmap.c: In function ‘__do_mmap’: mm/mmap.c:1581:14: error: ‘MAP_CHECKNODE’ undeclared (first use in this function) if (flags & MAP_CHECKNODE) ^ mm/mmap.c:1581:14: note: each undeclared identifier is reported only once for each function it appears in scripts/Makefile.build:303: recipe for target 'mm/mmap.o' failed
Fixes: cdccf4d4b7b5 ("arm64/ascend: mm: Add MAP_CHECKNODE flag to check node hugetlb")
Signed-off-by: Bixuan Cui cuibixuan@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/alpha/include/uapi/asm/mman.h | 1 + arch/mips/include/uapi/asm/mman.h | 1 + arch/parisc/include/uapi/asm/mman.h | 1 + arch/powerpc/include/uapi/asm/mman.h | 1 + arch/sparc/include/uapi/asm/mman.h | 1 + arch/xtensa/include/uapi/asm/mman.h | 1 + 6 files changed, 6 insertions(+)
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index b3acfc00c8ec..1c7ce2716ad3 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -34,6 +34,7 @@ #define MAP_HUGETLB 0x100000 /* create a huge page mapping */ #define MAP_FIXED_NOREPLACE 0x200000/* MAP_FIXED which doesn't unmap underlying mapping */ #define MAP_PA32BIT 0x400000 /* physical address is within 4G */ +#define MAP_CHECKNODE 0x800000 /* hugetlb numa node check */
#define MS_ASYNC 1 /* sync memory asynchronously */ #define MS_SYNC 2 /* synchronous memory sync */ diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index 72a00c746e78..4570a54ac1d9 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -52,6 +52,7 @@ #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */ #define MAP_PA32BIT 0x400000 /* physical address is within 4G */ +#define MAP_CHECKNODE 0x800000 /* hugetlb numa node check */
/* * Flags for msync diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 9e989d649e85..06857eb1bee8 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -28,6 +28,7 @@ #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */ #define MAP_PA32BIT 0x400000 /* physical address is within 4G */ +#define MAP_CHECKNODE 0x800000 /* hugetlb numa node check */
#define MS_SYNC 1 /* synchronous memory sync */ #define MS_ASYNC 2 /* sync memory asynchronously */ diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h index 95f884ada96f..24354f792b00 100644 --- a/arch/powerpc/include/uapi/asm/mman.h +++ b/arch/powerpc/include/uapi/asm/mman.h @@ -30,6 +30,7 @@ #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x40000 /* create a huge page mapping */ #define MAP_PA32BIT 0x400000 /* physical address is within 4G */ +#define MAP_CHECKNODE 0x800000 /* hugetlb numa node check */
/* Override any generic PKEY permission defines */ #define PKEY_DISABLE_EXECUTE 0x4 diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h index 0d1881b8f30d..214abe17f44a 100644 --- a/arch/sparc/include/uapi/asm/mman.h +++ b/arch/sparc/include/uapi/asm/mman.h @@ -27,6 +27,7 @@ #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x40000 /* create a huge page mapping */ #define MAP_PA32BIT 0x400000 /* physical address is within 4G */ +#define MAP_CHECKNODE 0x800000 /* hugetlb numa node check */
#endif /* _UAPI__SPARC_MMAN_H__ */ diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index f584a590bb00..2c9d70560238 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -59,6 +59,7 @@ #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */ #define MAP_PA32BIT 0x400000 /* physical address is within 4G */ +#define MAP_CHECKNODE 0x800000 /* hugetlb numa node check */ #ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED # define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */
From: Bixuan Cui cuibixuan@huawei.com
hulk inclusion category: bugfix bugzilla: NA CVE: NA
An error is reported during powerpc platform compilation because VERIFY_WRITE is already removed in powerpc platform.
./arch/powerpc/include/asm/uaccess.h: In function ‘clear_user’: ./arch/powerpc/include/asm/uaccess.h:446:48: error: macro "access_ok" passed 3 arguments, but takes just 2 if (likely(access_ok(VERIFY_WRITE, addr, size))) { ^ In file included from ./include/asm-generic/div64.h:25:0, from ./arch/powerpc/include/generated/asm/div64.h:1, from ./include/linux/math64.h:6, from ./include/linux/time64.h:5, from ./include/linux/compat_time.h:6, from ./include/linux/compat.h:10, from arch/powerpc/kernel/asm-offsets.c:16: ./arch/powerpc/include/asm/uaccess.h:446:13: error: ‘access_ok’ undeclared (first use in this function) if (likely(access_ok(VERIFY_WRITE, addr, size))) { ^ ./include/linux/compiler.h:76:40: note: in definition of macro ‘likely’ # define likely(x) __builtin_expect(!!(x), 1) ^ ./arch/powerpc/include/asm/uaccess.h:446:13: note: each undeclared identifier is reported only once for each function it appears in if (likely(access_ok(VERIFY_WRITE, addr, size))) { ^ ./include/linux/compiler.h:76:40: note: in definition of macro ‘likely’ # define likely(x) __builtin_expect(!!(x), 1) ^ Kbuild:56: recipe for target 'arch/powerpc/kernel/asm-offsets.s' failed
Fixes: 837baab68b87 ("powerpc: Add a framework for user access tracking") Fixes: a10f8b4fe993 ("powerpc: Implement user_access_begin and friends")
Signed-off-by: Bixuan Cui cuibixuan@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/powerpc/include/asm/uaccess.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h index 1bcf1b01a873..f04f5d43496a 100644 --- a/arch/powerpc/include/asm/uaccess.h +++ b/arch/powerpc/include/asm/uaccess.h @@ -443,7 +443,7 @@ static inline unsigned long clear_user(void __user *addr, unsigned long size) { unsigned long ret = size; might_fault(); - if (likely(access_ok(VERIFY_WRITE, addr, size))) { + if (likely(access_ok(addr, size))) { allow_write_to_user(addr, size); ret = __arch_clear_user(addr, size); prevent_write_to_user(addr, size); @@ -464,7 +464,7 @@ extern long __copy_from_user_flushcache(void *dst, const void __user *src, extern void memcpy_page_flushcache(char *to, struct page *page, size_t offset, size_t len);
-#define user_access_begin(type, ptr, len) access_ok(type, ptr, len) +#define user_access_begin(ptr, len) access_ok(ptr, len) #define user_access_end() prevent_user_access(NULL, NULL, ~0ul)
#define unsafe_op_wrap(op, err) do { if (unlikely(op)) goto err; } while (0)