update opeEuler 20.03 @ 20210414 step 1
Aleksandr Miloserdov (2): scsi: target: core: Add cmd length set before cmd complete scsi: target: core: Prevent underflow for service actions
Anna-Maria Behnsen (1): hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
Arnd Bergmann (1): mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warning
Cheng Jian (5): sched/fair: Optimize select_idle_cpu disable stealing by default sched/fair: introduce SCHED_STEAL config: enable CONFIG_SCHED_STEAL by default sched/fair: fix try_steal compile error
Dan Carpenter (1): ocfs2: fix a use after free on error
Daniel Borkmann (1): net: Fix gro aggregation for udp encaps with zero csum
Daniel Kobras (1): sunrpc: fix refcount leak for rpc auth modules
Daniel Wagner (2): block: Use non _rcu version of list functions for tag_set_list block: Suppress uevent for hidden device when removed
David Rientjes (1): KVM: SVM: Periodically schedule when unregistering regions on destroy
Eric Biggers (1): random: fix the RNDRESEEDCRNG ioctl
Eric Dumazet (6): net: qrtr: fix a kernel-infoleak in qrtr_recvmsg() tcp: fix SO_RCVLOWAT related hangs under mem pressure ipv6: icmp6: avoid indirect call for icmpv6_send() tcp: annotate tp->copied_seq lockless reads tcp: annotate tp->write_seq lockless reads tcp: add sanity tests to TCP_QUEUE_SEQ
Fangrui Song (1): module: Ignore _GLOBAL_OFFSET_TABLE_ when warning for undefined symbols
Florian Westphal (1): netfilter: ctnetlink: fix dump of the expect mask attribute
Frank Sorenson (1): NFS: Correct size calculation for create reply length
Geert Uytterhoeven (1): PCI: Fix pci_register_io_range() memory leak
Guo Fan (2): userswap: add a new flag 'MAP_REPLACE' for mmap() userswap: support userswap via userfaultfd
Hillf Danton (1): mm/gup: Let __get_user_pages_locked() return -EINTR for fatal signal
Jan Beulich (1): xen-blkback: don't leak persistent grants from xen_blkbk_map()
Jan Kara (2): bfq: Avoid false bfq queue merging ext4: add reclaim checks to xattr code
Jason A. Donenfeld (6): icmp: introduce helper for nat'd source address in network device context icmp: allow icmpv6_ndo_send to work with CONFIG_IPV6=n gtp: use icmp_ndo_send helper sunvnet: use icmp_ndo_send helper xfrm: interface: use icmp_ndo_send helper net: icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending
Jeffle Xu (3): dm table: fix iterate_devices based device capability checks dm table: fix DAX iterate_devices based device capability checks dm table: fix zoned iterate_devices based device capability checks
Jens Axboe (1): swap: fix swapfile read/write offset
JeongHyeon Lee (1): dm verity: add root hash pkcs#7 signature verification
Kefeng Wang (1): mm: slub: Expanded the scope of corrupted freelist workaround
Kuppuswamy Sathyanarayanan (1): mm/vmalloc.c: fix percpu free VM area search criteria
Leon Romanovsky (1): ipv6: silence compilation warning for non-IPV6 builds
Li Xinhai (1): mm/hugetlb.c: fix unnecessary address expansion of pmd sharing
Linus Torvalds (1): Revert "mm, slub: consider rest of partial list if acquire_slab() fails"
Marc Zyngier (1): arm64: Add missing ISB after invalidating TLB in __primary_switch
Marco Elver (1): net: fix up truesize of cloned skb in skb_prepare_for_shift()
Mark Tomlinson (3): Revert "netfilter: x_tables: Switch synchronization to RCU" netfilter: x_tables: Use correct memory barriers. Revert "netfilter: x_tables: Update remaining dereference to RCU"
Matthew Wilcox (Oracle) (1): include/linux/sched/mm.h: use rcu_dereference in in_vfork()
Miaohe Lin (3): mm/memory.c: fix potential pte_unmap_unlock pte error mm/hugetlb: fix potential double free in hugetlb_register_node() error path mm/rmap: fix potential pte_unmap on an not mapped pte
Michael Braun (1): gianfar: fix jumbo packets+napi+rx overrun crash
Michal Hocko (1): mm, mempolicy: fix up gup usage in lookup_node
Mike Kravetz (2): hugetlb: fix copy_huge_page_from_user contig page struct assumption hugetlb: fix update_and_free_page contig page struct assumption
Mikulas Patocka (4): blk-settings: align max_sectors on "logical_block_size" boundary dm: fix deadlock when swapping to encrypted device dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size dm ioctl: fix out of bounds array access when no devices
Ming Lei (1): block: respect queue limit of max discard segment
Muchun Song (1): printk: fix deadlock when kernel panic
NeilBrown (1): x86: fix seq_file iteration for pat/memtype.c
Oleg Nesterov (1): kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()
Pan Bian (1): isofs: release buffer head before return
Paulo Alcantara (1): cifs: return proper error code in statfs(2)
Pavel Tatashin (1): arm64: kdump: update ppos when reading elfcorehdr
Peter Xu (4): mm: allow VM_FAULT_RETRY for multiple times mm/gup: allow VM_FAULT_RETRY for multiple times mm/gup: fix fixup_user_fault() on multiple retries mm/mempolicy: Allow lookup_node() to handle fatal signal
Peter Zijlstra (2): jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations locking/static_key: Fix false positive warnings on concurrent dec/inc
Rafael J. Wysocki (1): ACPI: property: Fix fwnode string properties matching
Rustam Kovhaev (1): KVM: fix memory leak in kvm_io_bus_unregister_dev()
Sagi Grimberg (1): nvme-rdma: fix possible hang when failing to set io queues
Sakari Ailus (1): media: v4l: ioctl: Fix memory leak in video_usercopy
Shaoying Xu (1): arm64 module: set plt* section addresses to 0x0
Shuah Khan (1): usbip: fix stub_dev usbip_sockfd_store() races leading to gpf
Steve Sistare (10): sched: Provide sparsemask, a reduced contention bitmap sched/topology: Provide hooks to allocate data shared per LLC sched/topology: Provide cfs_overload_cpus bitmap sched/fair: Dynamically update cfs_overload_cpus sched/fair: Hoist idle_stamp up from idle_balance sched/fair: Generalize the detach_task interface sched/fair: Provide can_migrate_task_llc sched/fair: Steal work from an overloaded CPU when CPU goes idle sched/fair: disable stealing if too many NUMA nodes sched/fair: Provide idle search schedstats
Steven Rostedt (VMware) (1): tracepoint: Do not fail unregistering a probe due to memory failure
Thomas Gleixner (1): locking/mutex: Fix non debug version of mutex_lock_io_nested()
Uladzislau Rezki (Sony) (3): mm/vmalloc.c: keep track of free blocks for vmap allocation mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro
Vasily Averin (1): netfilter: x_tables: gpf inside xt_find_revision()
Vincent Whitchurch (1): cifs: Fix preauth hash corruption
Viresh Kumar (4): sched/core: Create task_has_idle_policy() helper sched/fair: Start tracking SCHED_IDLE tasks count in cfs_rq sched/fair: Fall back to sched-idle CPU if idle CPU isn't found sched/fair: Make sched-idle CPU selection consistent throughout
Yufen Yu (1): block: only update parent bi_status when bio fail
Yumei Huang (1): xfs: Fix assert failure in xfs_setattr_size()
wanglin (1): RDMA/hns: fix timer, gid_type, scc cfg
zhangyi (F) (1): ext4: do not try to set xattr into ea_inode if value is empty
arch/alpha/mm/fault.c | 2 +- arch/arc/mm/fault.c | 1 - arch/arm/mm/fault.c | 3 - arch/arm64/configs/euleros_defconfig | 1 + arch/arm64/configs/hulk_defconfig | 1 + arch/arm64/configs/openeuler_defconfig | 1 + arch/arm64/configs/storage_ci_defconfig | 1 + arch/arm64/configs/syzkaller_defconfig | 1 + arch/arm64/kernel/crash_dump.c | 2 + arch/arm64/kernel/head.S | 1 + arch/arm64/kernel/module.lds | 6 +- arch/arm64/mm/fault.c | 5 - arch/hexagon/mm/vm_fault.c | 1 - arch/ia64/mm/fault.c | 1 - arch/m68k/mm/fault.c | 3 - arch/microblaze/mm/fault.c | 1 - arch/mips/mm/fault.c | 1 - arch/nds32/mm/fault.c | 1 - arch/nios2/mm/fault.c | 3 - arch/openrisc/mm/fault.c | 1 - arch/parisc/mm/fault.c | 4 +- arch/powerpc/mm/fault.c | 6 - arch/riscv/mm/fault.c | 5 - arch/s390/mm/fault.c | 5 +- arch/sh/mm/fault.c | 1 - arch/sparc/mm/fault_32.c | 1 - arch/sparc/mm/fault_64.c | 1 - arch/um/kernel/trap.c | 1 - arch/unicore32/mm/fault.c | 4 +- arch/x86/configs/hulk_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + arch/x86/configs/storage_ci_defconfig | 1 + arch/x86/configs/syzkaller_defconfig | 1 + arch/x86/kvm/svm.c | 1 + arch/x86/mm/fault.c | 2 - arch/x86/mm/pat.c | 3 +- arch/xtensa/mm/fault.c | 1 - block/bfq-iosched.c | 1 + block/bio.c | 2 +- block/blk-merge.c | 11 +- block/blk-mq.c | 4 +- block/blk-settings.c | 12 + block/genhd.c | 4 +- drivers/acpi/property.c | 44 +- drivers/block/xen-blkback/blkback.c | 2 +- drivers/char/random.c | 2 +- drivers/gpu/drm/ttm/ttm_bo_vm.c | 12 +- .../infiniband/hw/hns/hns_roce_hw_sysfs_v2.c | 2 +- drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 16 +- drivers/md/dm-bufio.c | 4 + drivers/md/dm-core.h | 4 + drivers/md/dm-crypt.c | 1 + drivers/md/dm-ioctl.c | 2 +- drivers/md/dm-table.c | 174 ++- drivers/md/dm-verity-target.c | 2 +- drivers/md/dm.c | 60 + drivers/media/v4l2-core/v4l2-ioctl.c | 19 +- drivers/net/ethernet/freescale/gianfar.c | 15 + drivers/net/ethernet/sun/sunvnet_common.c | 23 +- drivers/net/gtp.c | 5 +- drivers/nvme/host/rdma.c | 7 +- drivers/pci/pci.c | 4 + drivers/target/target_core_pr.c | 15 +- drivers/target/target_core_transport.c | 15 +- drivers/usb/usbip/stub_dev.c | 32 +- fs/cifs/cifsfs.c | 2 +- fs/cifs/transport.c | 7 +- fs/ext4/xattr.c | 6 +- fs/isofs/dir.c | 1 + fs/isofs/namei.c | 1 + fs/nfs/nfs3xdr.c | 3 +- fs/ocfs2/cluster/heartbeat.c | 8 +- fs/proc/task_mmu.c | 3 + fs/select.c | 10 +- fs/userfaultfd.c | 26 +- fs/xfs/xfs_iops.c | 2 +- include/linux/device-mapper.h | 5 + include/linux/icmpv6.h | 48 +- include/linux/ipv6.h | 2 +- include/linux/mm.h | 46 +- include/linux/mutex.h | 2 +- include/linux/netfilter/x_tables.h | 7 +- include/linux/rmap.h | 3 +- include/linux/sched/mm.h | 3 +- include/linux/sched/topology.h | 3 + include/linux/swap.h | 12 +- include/linux/thread_info.h | 13 + include/linux/userfaultfd_k.h | 4 + include/linux/vmalloc.h | 6 +- include/net/icmp.h | 10 + include/net/tcp.h | 11 +- include/target/target_core_backend.h | 1 + include/trace/events/mmflags.h | 7 + include/uapi/asm-generic/mman.h | 4 + include/uapi/linux/userfaultfd.h | 3 + init/Kconfig | 15 + kernel/futex.c | 3 +- kernel/jump_label.c | 26 +- kernel/module.c | 21 +- kernel/printk/printk_safe.c | 16 +- kernel/sched/core.c | 39 +- kernel/sched/debug.c | 2 +- kernel/sched/fair.c | 418 ++++++- kernel/sched/features.h | 8 + kernel/sched/sched.h | 28 +- kernel/sched/sparsemask.h | 210 ++++ kernel/sched/stats.c | 15 + kernel/sched/stats.h | 20 + kernel/sched/topology.c | 141 ++- kernel/time/alarmtimer.c | 2 +- kernel/time/hrtimer.c | 62 +- kernel/time/posix-cpu-timers.c | 2 +- kernel/tracepoint.c | 80 +- lib/logic_pio.c | 3 + mm/Kconfig | 9 + mm/filemap.c | 2 +- mm/gup.c | 47 +- mm/hugetlb.c | 38 +- mm/internal.h | 6 +- mm/memory.c | 35 +- mm/mempolicy.c | 4 +- mm/mmap.c | 207 ++++ mm/page_io.c | 11 +- mm/slub.c | 14 +- mm/swapfile.c | 2 +- mm/userfaultfd.c | 26 + mm/vmalloc.c | 1099 +++++++++++++---- net/core/skbuff.c | 14 +- net/ipv4/icmp.c | 34 + net/ipv4/netfilter/arp_tables.c | 16 +- net/ipv4/netfilter/ip_tables.c | 16 +- net/ipv4/tcp.c | 59 +- net/ipv4/tcp_diag.c | 5 +- net/ipv4/tcp_input.c | 6 +- net/ipv4/tcp_ipv4.c | 23 +- net/ipv4/tcp_minisocks.c | 4 +- net/ipv4/tcp_output.c | 6 +- net/ipv4/udp_offload.c | 2 +- net/ipv6/icmp.c | 19 +- net/ipv6/ip6_icmp.c | 46 +- net/ipv6/netfilter/ip6_tables.c | 16 +- net/ipv6/tcp_ipv6.c | 15 +- net/netfilter/nf_conntrack_netlink.c | 1 + net/netfilter/x_tables.c | 55 +- net/qrtr/qrtr.c | 5 + net/sunrpc/svc.c | 6 +- net/xfrm/xfrm_interface.c | 6 +- virt/kvm/kvm_main.c | 21 +- 148 files changed, 3024 insertions(+), 791 deletions(-) create mode 100644 kernel/sched/sparsemask.h
From: Daniel Wagner dwagner@suse.de
mainline inclusion from mainline-5.9-rc1 commit 08c875cbf481d74db82d6bba2fbcf580087dee24 category: bugfix bugzilla: 39772 CVE: NA
---------------------------
tag_set_list is only accessed under the tag_set_lock lock. There is no need for using the _rcu list functions.
The _rcu list function were introduced to allow read access to the tag_set_list protected under RCU, see 705cda97ee3a ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list") and 05b79413946d ("Revert "blk-mq: don't handle TAG_SHARED in restart""). Those changes got reverted later but the cleanup commit missed a couple of places to undo the changes.
Fixes: 97889f9ac24f ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()" Signed-off-by: Daniel Wagner dwagner@suse.de Reviewed-by: Hannes Reinecke hare@suse.de Cc: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-mq.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index b5c2a4f65402..4965023121b5 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2521,7 +2521,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q) struct blk_mq_tag_set *set = q->tag_set;
mutex_lock(&set->tag_list_lock); - list_del_rcu(&q->tag_set_list); + list_del(&q->tag_set_list); if (list_is_singular(&set->tag_list)) { /* just transitioned to unshared */ set->flags &= ~BLK_MQ_F_TAG_SHARED; @@ -2550,7 +2550,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set, } if (set->flags & BLK_MQ_F_TAG_SHARED) queue_set_hctx_shared(q, true); - list_add_tail_rcu(&q->tag_set_list, &set->tag_list); + list_add_tail(&q->tag_set_list, &set->tag_list);
mutex_unlock(&set->tag_list_lock); }
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-v5.9-rc3 commit 943b40c832beb71115e38a1c4d99b640b5342738 category: bugfix bugzilla: 41908 CVE: NA
---------------------------
When queue_max_discard_segments(q) is 1, blk_discard_mergable() will return false for discard request, then normal request merge is applied. However, only queue_max_segments() is checked, so max discard segment limit isn't respected.
Check max discard segment limit in the request merge code for fixing the issue.
Discard request failure of virtio_blk is fixed.
Fixes: 69840466086d ("block: fix the DISCARD request merge") Signed-off-by: Ming Lei ming.lei@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Cc: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflict: block/blk-merge.c Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-merge.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c index 044bff9afa5e..d24a6c9398ed 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -477,13 +477,20 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq, } EXPORT_SYMBOL(blk_rq_map_sg);
+static inline unsigned int blk_rq_get_max_segments(struct request *rq) +{ + if (req_op(rq) == REQ_OP_DISCARD) + return queue_max_discard_segments(rq->q); + return queue_max_segments(rq->q); +} + static inline int ll_new_hw_segment(struct request_queue *q, struct request *req, struct bio *bio) { int nr_phys_segs = bio_phys_segments(q, bio);
- if (req->nr_phys_segments + nr_phys_segs > queue_max_segments(q)) + if (req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req)) goto no_merge;
if (blk_integrity_merge_bio(q, req, bio) == false) @@ -606,7 +613,7 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req, total_phys_segments--; }
- if (total_phys_segments > queue_max_segments(q)) + if (total_phys_segments > blk_rq_get_max_segments(req)) return 0;
if (blk_integrity_merge_rq(q, req, next) == false)
From: wanglin wanglin137@huawei.com
driver inclusion category: bugfix bugzilla: NA CVE: NA
This patch fix timer, gid_type, scc cfg not just for HIP08_B.
Reviewed-by: Hu Chunzhi huchunzhi@huawei.com Reviewed-by: Zhao Weibo zhaoweibo3@huawei.com Signed-off-by: wanglin wanglin137@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/infiniband/hw/hns/hns_roce_hw_sysfs_v2.c | 2 +- drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 16 ++++++++-------- 2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_sysfs_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_sysfs_v2.c index a780a0e10386..c6106379d243 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_sysfs_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_sysfs_v2.c @@ -340,7 +340,7 @@ int hns_roce_v2_query_pkt_stat(struct hns_roce_dev *hr_dev, if (status) return status;
- if (hr_dev->pci_dev->revision == PCI_REVISION_ID_HIP08_B) { + if (hr_dev->pci_dev->revision >= PCI_REVISION_ID_HIP08_B) { hns_roce_cmq_setup_basic_desc(&desc_cnp_rx, HNS_ROCE_OPC_QUEYR_CNP_RX_CNT, true); status = hns_roce_cmq_send(hr_dev, &desc_cnp_rx, 1); diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c index 253263b68cf1..e1ba9877985b 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c @@ -1873,7 +1873,7 @@ static void set_default_caps(struct hns_roce_dev *hr_dev) caps->max_srq_wrs = HNS_ROCE_V2_MAX_SRQ_WR; caps->max_srq_sges = HNS_ROCE_V2_MAX_SRQ_SGE;
- if (hr_dev->pci_dev->revision == PCI_REVISION_ID_HIP08_B) { + if (hr_dev->pci_dev->revision >= PCI_REVISION_ID_HIP08_B) { caps->flags |= HNS_ROCE_CAP_FLAG_ATOMIC | HNS_ROCE_CAP_FLAG_MW | HNS_ROCE_CAP_FLAG_SRQ | HNS_ROCE_CAP_FLAG_FRMR | HNS_ROCE_CAP_FLAG_QP_FLOW_CTRL; @@ -2122,7 +2122,7 @@ static int hns_roce_query_pf_caps(struct hns_roce_dev *hr_dev) caps->srqc_bt_num, &caps->srqc_buf_pg_sz, &caps->srqc_ba_pg_sz, HEM_TYPE_SRQC);
- if (hr_dev->pci_dev->revision == PCI_REVISION_ID_HIP08_B) { + if (hr_dev->pci_dev->revision >= PCI_REVISION_ID_HIP08_B) { caps->qpc_timer_hop_num = HNS_ROCE_HOP_NUM_0; caps->cqc_timer_hop_num = HNS_ROCE_HOP_NUM_0; caps->scc_ctx_hop_num = ctx_hop_num; @@ -2186,7 +2186,7 @@ static int hns_roce_v2_profile(struct hns_roce_dev *hr_dev) if (ret) return ret;
- if (hr_dev->pci_dev->revision == PCI_REVISION_ID_HIP08_B) { + if (hr_dev->pci_dev->revision >= PCI_REVISION_ID_HIP08_B) { ret = hns_roce_query_pf_timer_resource(hr_dev); if (ret) { dev_err(hr_dev->dev, @@ -2202,7 +2202,7 @@ static int hns_roce_v2_profile(struct hns_roce_dev *hr_dev) return ret; }
- if (hr_dev->pci_dev->revision == PCI_REVISION_ID_HIP08_B) { + if (hr_dev->pci_dev->revision >= PCI_REVISION_ID_HIP08_B) { ret = hns_roce_set_vf_switch_param(hr_dev, 0); if (ret) { dev_err(hr_dev->dev, @@ -2503,7 +2503,7 @@ static int hns_roce_v2_init(struct hns_roce_dev *hr_dev) goto err_cqc_timer_failed; } } - if (hr_dev->pci_dev->revision == PCI_REVISION_ID_HIP08_B) + if (hr_dev->pci_dev->revision >= PCI_REVISION_ID_HIP08_B) hns_roce_clear_extdb_list_info(hr_dev);
return 0; @@ -2531,7 +2531,7 @@ static void hns_roce_v2_exit(struct hns_roce_dev *hr_dev) { struct hns_roce_v2_priv *priv = hr_dev->priv;
- if (hr_dev->pci_dev->revision == PCI_REVISION_ID_HIP08_B) + if (hr_dev->pci_dev->revision >= PCI_REVISION_ID_HIP08_B) hns_roce_function_clear(hr_dev);
hns_roce_free_link_table(hr_dev, &priv->tpq); @@ -4873,10 +4873,10 @@ static int hns_roce_v2_set_path(struct ib_qp *ibqp, V2_QPC_BYTE_24_HOP_LIMIT_S, 0);
#ifdef CONFIG_KERNEL_419 - if (hr_dev->pci_dev->revision == PCI_REVISION_ID_HIP08_B && + if (hr_dev->pci_dev->revision >= PCI_REVISION_ID_HIP08_B && gid_attr->gid_type == IB_GID_TYPE_ROCE_UDP_ENCAP) #else - if (hr_dev->pci_dev->revision == PCI_REVISION_ID_HIP08_B && + if (hr_dev->pci_dev->revision >= PCI_REVISION_ID_HIP08_B && gid_attr.gid_type == IB_GID_TYPE_ROCE_UDP_ENCAP) #endif roce_set_field(context->byte_24_mtu_tc, V2_QPC_BYTE_24_TC_M,
From: Yufen Yu yuyufen@huawei.com
mainline inclusion from mainline-v5.12-rc3 commit 3edf5346e4f2ce2fa0c94651a90a8dda169565ee category: bugfix bugzilla: 47613 CVE: NA ---------------------------
For multiple split bios, if one of the bio is fail, the whole should return error to application. But we found there is a race between bio_integrity_verify_fn and bio complete, which return io success to application after one of the bio fail. The race as following:
split bio(READ) kworker
nvme_complete_rq blk_update_request //split error=0 bio_endio bio_integrity_endio queue_work(kintegrityd_wq, &bip->bip_work);
bio_integrity_verify_fn bio_endio //split bio __bio_chain_endio if (!parent->bi_status)
<interrupt entry> nvme_irq blk_update_request //parent error=7 req_bio_endio bio->bi_status = 7 //parent bio <interrupt exit>
parent->bi_status = 0 parent->bi_end_io() // return bi_status=0
The bio has been split as two: split and parent. When split bio completed, it depends on kworker to do endio, while bio_integrity_verify_fn have been interrupted by parent bio complete irq handler. Then, parent bio->bi_status which have been set in irq handler will overwrite by kworker.
In fact, even without the above race, we also need to conside the concurrency beteen mulitple split bio complete and update the same parent bi_status. Normally, multiple split bios will be issued to the same hctx and complete from the same irq vector. But if we have updated queue map between multiple split bios, these bios may complete on different hw queue and different irq vector. Then the concurrency update parent bi_status may cause the final status error.
Suggested-by: Keith Busch kbusch@kernel.org Signed-off-by: Yufen Yu yuyufen@huawei.com Reviewed-by: Ming Lei ming.lei@redhat.com Link: https://lore.kernel.org/r/20210331115359.1125679-1-yuyufen@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yufen Yu yuyufen@huawei.com Reviewed-by: Kuai Yu yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/bio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/bio.c b/block/bio.c index 94d0f4798b5b..da05350dfba2 100644 --- a/block/bio.c +++ b/block/bio.c @@ -314,7 +314,7 @@ static struct bio *__bio_chain_endio(struct bio *bio) { struct bio *parent = bio->bi_private;
- if (!parent->bi_status) + if (bio->bi_status && !parent->bi_status) parent->bi_status = bio->bi_status; bio_put(bio); return parent;
From: Sakari Ailus sakari.ailus@linux.intel.com
stable inclusion from linux-4.19.179 commit ff2111a6fab31923685b6ca8ea466ea0576b8a0e CVE: CVE-2021-30002
--------------------------------
commit fb18802a338b36f675a388fc03d2aa504a0d0899 upstream.
When an IOCTL with argument size larger than 128 that also used array arguments were handled, two memory allocations were made but alas, only the latter one of them was released. This happened because there was only a single local variable to hold such a temporary allocation.
Fix this by adding separate variables to hold the pointers to the temporary allocations.
Reported-by: Arnd Bergmann arnd@kernel.org Reported-by: syzbot+1115e79c8df6472c612b@syzkaller.appspotmail.com Fixes: d14e6d76ebf7 ("[media] v4l: Add multi-planar ioctl handling code") Cc: stable@vger.kernel.org Signed-off-by: Sakari Ailus sakari.ailus@linux.intel.com Acked-by: Arnd Bergmann arnd@arndb.de Acked-by: Hans Verkuil hverkuil-cisco@xs4all.nl Reviewed-by: Laurent Pinchart laurent.pinchart@ideasonboard.com Signed-off-by: Mauro Carvalho Chehab mchehab+huawei@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/media/v4l2-core/v4l2-ioctl.c | 19 +++++++------------ 1 file changed, 7 insertions(+), 12 deletions(-)
diff --git a/drivers/media/v4l2-core/v4l2-ioctl.c b/drivers/media/v4l2-core/v4l2-ioctl.c index 7675b645db2e..b01e5f0c5c0c 100644 --- a/drivers/media/v4l2-core/v4l2-ioctl.c +++ b/drivers/media/v4l2-core/v4l2-ioctl.c @@ -2939,7 +2939,7 @@ video_usercopy(struct file *file, unsigned int cmd, unsigned long arg, v4l2_kioctl func) { char sbuf[128]; - void *mbuf = NULL; + void *mbuf = NULL, *array_buf = NULL; void *parg = (void *)arg; long err = -EINVAL; bool has_array_args; @@ -2998,20 +2998,14 @@ video_usercopy(struct file *file, unsigned int cmd, unsigned long arg, has_array_args = err;
if (has_array_args) { - /* - * When adding new types of array args, make sure that the - * parent argument to ioctl (which contains the pointer to the - * array) fits into sbuf (so that mbuf will still remain - * unused up to here). - */ - mbuf = kvmalloc(array_size, GFP_KERNEL); + array_buf = kvmalloc(array_size, GFP_KERNEL); err = -ENOMEM; - if (NULL == mbuf) + if (array_buf == NULL) goto out_array_args; err = -EFAULT; - if (copy_from_user(mbuf, user_ptr, array_size)) + if (copy_from_user(array_buf, user_ptr, array_size)) goto out_array_args; - *kernel_ptr = mbuf; + *kernel_ptr = array_buf; }
/* Handles IOCTL */ @@ -3030,7 +3024,7 @@ video_usercopy(struct file *file, unsigned int cmd, unsigned long arg,
if (has_array_args) { *kernel_ptr = (void __force *)user_ptr; - if (copy_to_user(user_ptr, mbuf, array_size)) + if (copy_to_user(user_ptr, array_buf, array_size)) err = -EFAULT; goto out_array_args; } @@ -3052,6 +3046,7 @@ video_usercopy(struct file *file, unsigned int cmd, unsigned long arg, }
out: + kvfree(array_buf); kvfree(mbuf); return err; }
From: Michael Braun michael-dev@fami-braun.de
stable inclusion from linux-4.19.184 commit 9943741c2792a7f1d091aad38f496ed6eb7681c4 CVE: CVE-2021-29264
--------------------------------
[ Upstream commit d8861bab48b6c1fc3cdbcab8ff9d1eaea43afe7f ]
When using jumbo packets and overrunning rx queue with napi enabled, the following sequence is observed in gfar_add_rx_frag:
| lstatus | | skb | t | lstatus, size, flags | first | len, data_len, *ptr | ---+--------------------------------------+-------+-----------------------+ 13 | 18002348, 9032, INTERRUPT LAST | 0 | 9600, 8000, f554c12e | 12 | 10000640, 1600, INTERRUPT | 0 | 8000, 6400, f554c12e | 11 | 10000640, 1600, INTERRUPT | 0 | 6400, 4800, f554c12e | 10 | 10000640, 1600, INTERRUPT | 0 | 4800, 3200, f554c12e | 09 | 10000640, 1600, INTERRUPT | 0 | 3200, 1600, f554c12e | 08 | 14000640, 1600, INTERRUPT FIRST | 0 | 1600, 0, f554c12e | 07 | 14000640, 1600, INTERRUPT FIRST | 1 | 0, 0, f554c12e | 06 | 1c000080, 128, INTERRUPT LAST FIRST | 1 | 0, 0, abf3bd6e | 05 | 18002348, 9032, INTERRUPT LAST | 0 | 8000, 6400, c5a57780 | 04 | 10000640, 1600, INTERRUPT | 0 | 6400, 4800, c5a57780 | 03 | 10000640, 1600, INTERRUPT | 0 | 4800, 3200, c5a57780 | 02 | 10000640, 1600, INTERRUPT | 0 | 3200, 1600, c5a57780 | 01 | 10000640, 1600, INTERRUPT | 0 | 1600, 0, c5a57780 | 00 | 14000640, 1600, INTERRUPT FIRST | 1 | 0, 0, c5a57780 |
So at t=7 a new packets is started but not finished, probably due to rx overrun - but rx overrun is not indicated in the flags. Instead a new packets starts at t=8. This results in skb->len to exceed size for the LAST fragment at t=13 and thus a negative fragment size added to the skb.
This then crashes:
kernel BUG at include/linux/skbuff.h:2277! Oops: Exception in kernel mode, sig: 5 [#1] ... NIP [c04689f4] skb_pull+0x2c/0x48 LR [c03f62ac] gfar_clean_rx_ring+0x2e4/0x844 Call Trace: [ec4bfd38] [c06a84c4] _raw_spin_unlock_irqrestore+0x60/0x7c (unreliable) [ec4bfda8] [c03f6a44] gfar_poll_rx_sq+0x48/0xe4 [ec4bfdc8] [c048d504] __napi_poll+0x54/0x26c [ec4bfdf8] [c048d908] net_rx_action+0x138/0x2c0 [ec4bfe68] [c06a8f34] __do_softirq+0x3a4/0x4fc [ec4bfed8] [c0040150] run_ksoftirqd+0x58/0x70 [ec4bfee8] [c0066ecc] smpboot_thread_fn+0x184/0x1cc [ec4bff08] [c0062718] kthread+0x140/0x144 [ec4bff38] [c0012350] ret_from_kernel_thread+0x14/0x1c
This patch fixes this by checking for computed LAST fragment size, so a negative sized fragment is never added. In order to prevent the newer rx frame from getting corrupted, the FIRST flag is checked to discard the incomplete older frame.
Signed-off-by: Michael Braun michael-dev@fami-braun.de Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/freescale/gianfar.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+)
diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c index c97c4edfa31b..6e2245fdc18e 100644 --- a/drivers/net/ethernet/freescale/gianfar.c +++ b/drivers/net/ethernet/freescale/gianfar.c @@ -2942,6 +2942,10 @@ static bool gfar_add_rx_frag(struct gfar_rx_buff *rxb, u32 lstatus, if (lstatus & BD_LFLAG(RXBD_LAST)) size -= skb->len;
+ WARN(size < 0, "gianfar: rx fragment size underflow"); + if (size < 0) + return false; + skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page, rxb->page_offset + RXBUF_ALIGNMENT, size, GFAR_RXB_TRUESIZE); @@ -3103,6 +3107,17 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit) if (lstatus & BD_LFLAG(RXBD_EMPTY)) break;
+ /* lost RXBD_LAST descriptor due to overrun */ + if (skb && + (lstatus & BD_LFLAG(RXBD_FIRST))) { + /* discard faulty buffer */ + dev_kfree_skb(skb); + skb = NULL; + rx_queue->stats.rx_dropped++; + + /* can continue normally */ + } + /* order rx buffer descriptor reads */ rmb();
From: Jan Beulich jbeulich@suse.com
stable inclusion from linux-4.19.184 commit 16356ddb587867c2a5ab85407eeb75f2b8818207 CVE: CVE-2021-28688
--------------------------------
commit a846738f8c3788d846ed1f587270d2f2e3d32432 upstream.
The fix for XSA-365 zapped too many of the ->persistent_gnt[] entries. Ones successfully obtained should not be overwritten, but instead left for xen_blkbk_unmap_prepare() to pick up and put.
This is XSA-371.
Signed-off-by: Jan Beulich jbeulich@suse.com Cc: stable@vger.kernel.org Reviewed-by: Juergen Gross jgross@suse.com Reviewed-by: Wei Liu wl@xen.org Signed-off-by: Juergen Gross jgross@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/block/xen-blkback/blkback.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c index 208f3eea3641..d98cfd3b64ff 100644 --- a/drivers/block/xen-blkback/blkback.c +++ b/drivers/block/xen-blkback/blkback.c @@ -944,7 +944,7 @@ static int xen_blkbk_map(struct xen_blkif_ring *ring, out: for (i = last_map; i < num; i++) { /* Don't zap current batch's valid persistent grants. */ - if(i >= last_map + segs_to_map) + if(i >= map_until) pages[i]->persistent_gnt = NULL; pages[i]->handle = BLKBACK_INVALID_HANDLE; }
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.184 commit 5f09be2a1a35cb8bd6c178d5f205b7265bd68646 CVE: CVE-2021-29647
--------------------------------
commit 50535249f624d0072cd885bcdce4e4b6fb770160 upstream.
struct sockaddr_qrtr has a 2-byte hole, and qrtr_recvmsg() currently does not clear it before copying kernel data to user space.
It might be too late to name the hole since sockaddr_qrtr structure is uapi.
BUG: KMSAN: kernel-infoleak in kmsan_copy_to_user+0x9c/0xb0 mm/kmsan/kmsan_hooks.c:249 CPU: 0 PID: 29705 Comm: syz-executor.3 Not tainted 5.11.0-rc7-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x21c/0x280 lib/dump_stack.c:120 kmsan_report+0xfb/0x1e0 mm/kmsan/kmsan_report.c:118 kmsan_internal_check_memory+0x202/0x520 mm/kmsan/kmsan.c:402 kmsan_copy_to_user+0x9c/0xb0 mm/kmsan/kmsan_hooks.c:249 instrument_copy_to_user include/linux/instrumented.h:121 [inline] _copy_to_user+0x1ac/0x270 lib/usercopy.c:33 copy_to_user include/linux/uaccess.h:209 [inline] move_addr_to_user+0x3a2/0x640 net/socket.c:237 ____sys_recvmsg+0x696/0xd50 net/socket.c:2575 ___sys_recvmsg net/socket.c:2610 [inline] do_recvmmsg+0xa97/0x22d0 net/socket.c:2710 __sys_recvmmsg net/socket.c:2789 [inline] __do_sys_recvmmsg net/socket.c:2812 [inline] __se_sys_recvmmsg+0x24a/0x410 net/socket.c:2805 __x64_sys_recvmmsg+0x62/0x80 net/socket.c:2805 do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x465f69 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f43659d6188 EFLAGS: 00000246 ORIG_RAX: 000000000000012b RAX: ffffffffffffffda RBX: 000000000056bf60 RCX: 0000000000465f69 RDX: 0000000000000008 RSI: 0000000020003e40 RDI: 0000000000000003 RBP: 00000000004bfa8f R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000010060 R11: 0000000000000246 R12: 000000000056bf60 R13: 0000000000a9fb1f R14: 00007f43659d6300 R15: 0000000000022000
Local variable ----addr@____sys_recvmsg created at: ____sys_recvmsg+0x168/0xd50 net/socket.c:2550 ____sys_recvmsg+0x168/0xd50 net/socket.c:2550
Bytes 2-3 of 12 are uninitialized Memory access of size 12 starts at ffff88817c627b40 Data copied to user address 0000000020000140
Fixes: bdabad3e363d ("net: Add Qualcomm IPC router") Signed-off-by: Eric Dumazet edumazet@google.com Cc: Courtney Cavin courtney.cavin@sonymobile.com Reported-by: syzbot syzkaller@googlegroups.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/qrtr/qrtr.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/net/qrtr/qrtr.c b/net/qrtr/qrtr.c index 5c75118539bb..e72076337638 100644 --- a/net/qrtr/qrtr.c +++ b/net/qrtr/qrtr.c @@ -857,6 +857,11 @@ static int qrtr_recvmsg(struct socket *sock, struct msghdr *msg, rc = copied;
if (addr) { + /* There is an anonymous 2-byte hole after sq_family, + * make sure to clear it. + */ + memset(addr, 0, sizeof(*addr)); + cb = (struct qrtr_cb *)skb->cb; addr->sq_family = AF_QIPCRTR; addr->sq_node = cb->src_node;
From: Rustam Kovhaev rkovhaev@gmail.com
stable inclusion from linux-4.19.148 commit 19184bd06f488af62924ff1747614a8cb284ad63 CVE: CVE-2020-36312
--------------------------------
[ Upstream commit f65886606c2d3b562716de030706dfe1bea4ed5e ]
when kmalloc() fails in kvm_io_bus_unregister_dev(), before removing the bus, we should iterate over all other devices linked to it and call kvm_iodevice_destructor() for them
Fixes: 90db10434b16 ("KVM: kvm_io_bus_unregister_dev() should never fail") Cc: stable@vger.kernel.org Reported-and-tested-by: syzbot+f196caa45793d6374707@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=f196caa45793d6374707 Signed-off-by: Rustam Kovhaev rkovhaev@gmail.com Reviewed-by: Vitaly Kuznetsov vkuznets@redhat.com Message-Id: 20200907185535.233114-1-rkovhaev@gmail.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- virt/kvm/kvm_main.c | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 0292f9f0a774..44a9532aaf4b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3696,7 +3696,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, void kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx, struct kvm_io_device *dev) { - int i; + int i, j; struct kvm_io_bus *new_bus, *bus;
bus = kvm_get_bus(kvm, bus_idx); @@ -3713,17 +3713,20 @@ void kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
new_bus = kmalloc(sizeof(*bus) + ((bus->dev_count - 1) * sizeof(struct kvm_io_range)), GFP_KERNEL); - if (!new_bus) { + if (new_bus) { + memcpy(new_bus, bus, sizeof(*bus) + i * sizeof(struct kvm_io_range)); + new_bus->dev_count--; + memcpy(new_bus->range + i, bus->range + i + 1, + (new_bus->dev_count - i) * sizeof(struct kvm_io_range)); + } else { pr_err("kvm: failed to shrink bus, removing it completely\n"); - goto broken; + for (j = 0; j < bus->dev_count; j++) { + if (j == i) + continue; + kvm_iodevice_destructor(bus->range[j].dev); + } }
- memcpy(new_bus, bus, sizeof(*bus) + i * sizeof(struct kvm_io_range)); - new_bus->dev_count--; - memcpy(new_bus->range + i, bus->range + i + 1, - (new_bus->dev_count - i) * sizeof(struct kvm_io_range)); - -broken: rcu_assign_pointer(kvm->buses[bus_idx], new_bus); synchronize_srcu_expedited(&kvm->srcu); kfree(bus);
From: Peter Zijlstra peterz@infradead.org
stable inclusion from linux-4.19.178 commit 158c3ec956d3881c86df5c0a842f39a2ee0c926b
--------------------------------
commit cb538267ea1e9e025ec692577c9ae75797261889 upstream.
Weirdly we seem to have forgotten this...
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Will McVicker willmcvicker@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/jump_label.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/kernel/jump_label.c b/kernel/jump_label.c index ee72f937bedc..ba2454af4141 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -83,6 +83,7 @@ void static_key_slow_inc_cpuslocked(struct static_key *key) int v, v1;
STATIC_KEY_CHECK_USE(key); + lockdep_assert_cpus_held();
/* * Careful if we get concurrent static_key_slow_inc() calls; @@ -128,6 +129,7 @@ EXPORT_SYMBOL_GPL(static_key_slow_inc); void static_key_enable_cpuslocked(struct static_key *key) { STATIC_KEY_CHECK_USE(key); + lockdep_assert_cpus_held();
if (atomic_read(&key->enabled) > 0) { WARN_ON_ONCE(atomic_read(&key->enabled) != 1); @@ -158,6 +160,7 @@ EXPORT_SYMBOL_GPL(static_key_enable); void static_key_disable_cpuslocked(struct static_key *key) { STATIC_KEY_CHECK_USE(key); + lockdep_assert_cpus_held();
if (atomic_read(&key->enabled) != 1) { WARN_ON_ONCE(atomic_read(&key->enabled) != 0); @@ -183,6 +186,8 @@ static void __static_key_slow_dec_cpuslocked(struct static_key *key, unsigned long rate_limit, struct delayed_work *work) { + lockdep_assert_cpus_held(); + /* * The negative count check is valid even when a negative * key->enabled is in use by static_key_slow_inc(); a
From: Peter Zijlstra peterz@infradead.org
stable inclusion from linux-4.19.178 commit 4eb9488bd27b969b248748ae02053f508c9b529e
--------------------------------
commit a1247d06d01045d7ab2882a9c074fbf21137c690 upstream.
Even though the atomic_dec_and_mutex_lock() in __static_key_slow_dec_cpuslocked() can never see a negative value in key->enabled the subsequent sanity check is re-reading key->enabled, which may have been set to -1 in the meantime by static_key_slow_inc_cpuslocked().
CPU A CPU B
__static_key_slow_dec_cpuslocked(): static_key_slow_inc_cpuslocked(): # enabled = 1 atomic_dec_and_mutex_lock() # enabled = 0 atomic_read() == 0 atomic_set(-1) # enabled = -1 val = atomic_read() # Oops - val == -1!
The test case is TCP's clean_acked_data_enable() / clean_acked_data_disable() as tickled by KTLS (net/ktls).
Suggested-by: Jakub Kicinski jakub.kicinski@netronome.com Reported-by: Jakub Kicinski jakub.kicinski@netronome.com Tested-by: Jakub Kicinski jakub.kicinski@netronome.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Paul E. McKenney paulmck@linux.vnet.ibm.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Will Deacon will.deacon@arm.com Cc: ard.biesheuvel@linaro.org Cc: oss-drivers@netronome.com Cc: pbonzini@redhat.com Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Will McVicker willmcvicker@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/jump_label.c | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-)
diff --git a/kernel/jump_label.c b/kernel/jump_label.c index ba2454af4141..0cf0a1496925 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -186,6 +186,8 @@ static void __static_key_slow_dec_cpuslocked(struct static_key *key, unsigned long rate_limit, struct delayed_work *work) { + int val; + lockdep_assert_cpus_held();
/* @@ -195,17 +197,20 @@ static void __static_key_slow_dec_cpuslocked(struct static_key *key, * returns is unbalanced, because all other static_key_slow_inc() * instances block while the update is in progress. */ - if (!atomic_dec_and_mutex_lock(&key->enabled, &jump_label_mutex)) { - WARN(atomic_read(&key->enabled) < 0, - "jump label: negative count!\n"); + val = atomic_fetch_add_unless(&key->enabled, -1, 1); + if (val != 1) { + WARN(val < 0, "jump label: negative count!\n"); return; }
- if (rate_limit) { - atomic_inc(&key->enabled); - schedule_delayed_work(work, rate_limit); - } else { - jump_label_update(key); + jump_label_lock(); + if (atomic_dec_and_test(&key->enabled)) { + if (rate_limit) { + atomic_inc(&key->enabled); + schedule_delayed_work(work, rate_limit); + } else { + jump_label_update(key); + } } jump_label_unlock(); }
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.178 commit 904e2953231a8b040108584965561a1ba8c197f2
--------------------------------
commit 41e76c85660c022c6bf5713bfb6c21e64a487cec upstream.
bfq_setup_cooperator() uses bfqd->in_serv_last_pos so detect whether it makes sense to merge current bfq queue with the in-service queue. However if the in-service queue is freshly scheduled and didn't dispatch any requests yet, bfqd->in_serv_last_pos is stale and contains value from the previously scheduled bfq queue which can thus result in a bogus decision that the two queues should be merged. This bug can be observed for example with the following fio jobfile:
[global] direct=0 ioengine=sync invalidate=1 size=1g rw=read
[reader] numjobs=4 directory=/mnt
where the 4 processes will end up in the one shared bfq queue although they do IO to physically very distant files (for some reason I was able to observe this only with slice_idle=1ms setting).
Fix the problem by invalidating bfqd->in_serv_last_pos when switching in-service queue.
Fixes: 058fdecc6de7 ("block, bfq: fix in-service-queue check for queue merging") CC: stable@vger.kernel.org Signed-off-by: Jan Kara jack@suse.cz Acked-by: Paolo Valente paolo.valente@linaro.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/bfq-iosched.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 97b033b00d4e..a214c95f071a 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -2479,6 +2479,7 @@ static void __bfq_set_in_service_queue(struct bfq_data *bfqd, }
bfqd->in_service_queue = bfqq; + bfqd->in_serv_last_pos = 0; }
/*
From: Eric Biggers ebiggers@google.com
stable inclusion from linux-4.19.178 commit 1fc338cde538bc2d73006fffb0ea20fa97fdbd55
--------------------------------
commit 11a0b5e0ec8c13bef06f7414f9e914506140d5cb upstream.
The RNDRESEEDCRNG ioctl reseeds the primary_crng from itself, which doesn't make sense. Reseed it from the input_pool instead.
Fixes: d848e5f8e1eb ("random: add new ioctl RNDRESEEDCRNG") Cc: stable@vger.kernel.org Cc: linux-crypto@vger.kernel.org Cc: Andy Lutomirski luto@kernel.org Cc: Jann Horn jannh@google.com Cc: Theodore Ts'o tytso@mit.edu Reviewed-by: Jann Horn jannh@google.com Acked-by: Ard Biesheuvel ardb@kernel.org Signed-off-by: Eric Biggers ebiggers@google.com Link: https://lore.kernel.org/r/20210112192818.69921-1-ebiggers@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/char/random.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/char/random.c b/drivers/char/random.c index c7c344e69f19..401a2cce29ef 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -2011,7 +2011,7 @@ static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg) return -EPERM; if (crng_init < 2) return -ENODATA; - crng_reseed(&primary_crng, NULL); + crng_reseed(&primary_crng, &input_pool); crng_global_init_time = jiffies - 1; return 0; default:
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.178 commit 777d796966484f5b2b6245706057a05d1d1b642a
--------------------------------
[ Upstream commit f969dc5a885736842c3511ecdea240fbb02d25d9 ]
While commit 24adbc1676af ("tcp: fix SO_RCVLOWAT hangs with fat skbs") fixed an issue vs too small sk_rcvbuf for given sk_rcvlowat constraint, it missed to address issue caused by memory pressure.
1) If we are under memory pressure and socket receive queue is empty. First incoming packet is allowed to be queued, after commit 76dfa6082032 ("tcp: allow one skb to be received per socket under memory pressure")
But we do not send EPOLLIN yet, in case tcp_data_ready() sees sk_rcvlowat is bigger than skb length.
2) Then, when next packet comes, it is dropped, and we directly call sk->sk_data_ready().
3) If application is using poll(), tcp_poll() will then use tcp_stream_is_readable() and decide the socket receive queue is not yet filled, so nothing will happen.
Even when sender retransmits packets, phases 2) & 3) repeat and flow is effectively frozen, until memory pressure is off.
Fix is to consider tcp_under_memory_pressure() to take care of global memory pressure or memcg pressure.
Fixes: 24adbc1676af ("tcp: fix SO_RCVLOWAT hangs with fat skbs") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: Arjun Roy arjunroy@google.com Suggested-by: Wei Wang weiwan@google.com Reviewed-by: Wei Wang weiwan@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/net/tcp.h | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h index 444762485615..f5128bc28bb7 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1380,8 +1380,13 @@ static inline int tcp_full_space(const struct sock *sk) */ static inline bool tcp_rmem_pressure(const struct sock *sk) { - int rcvbuf = READ_ONCE(sk->sk_rcvbuf); - int threshold = rcvbuf - (rcvbuf >> 3); + int rcvbuf, threshold; + + if (tcp_under_memory_pressure(sk)) + return true; + + rcvbuf = READ_ONCE(sk->sk_rcvbuf); + threshold = rcvbuf - (rcvbuf >> 3);
return atomic_read(&sk->sk_rmem_alloc) > threshold; }
From: Pan Bian bianpan2016@163.com
stable inclusion from linux-4.19.178 commit 8e51a6f8cf9c2a6e8e7b321a7fbccf6108b9e50c
--------------------------------
[ Upstream commit 0a6dc67a6aa45f19bd4ff89b4f468fc50c4b8daa ]
Release the buffer_head before returning error code in do_isofs_readdir() and isofs_find_entry().
Fixes: 2deb1acc653c ("isofs: fix access to unallocated memory when reading corrupted filesystem") Link: https://lore.kernel.org/r/20210118120455.118955-1-bianpan2016@163.com Signed-off-by: Pan Bian bianpan2016@163.com Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/isofs/dir.c | 1 + fs/isofs/namei.c | 1 + 2 files changed, 2 insertions(+)
diff --git a/fs/isofs/dir.c b/fs/isofs/dir.c index 947ce22f5b3c..55df4d80793b 100644 --- a/fs/isofs/dir.c +++ b/fs/isofs/dir.c @@ -152,6 +152,7 @@ static int do_isofs_readdir(struct inode *inode, struct file *file, printk(KERN_NOTICE "iso9660: Corrupted directory entry" " in block %lu of inode %lu\n", block, inode->i_ino); + brelse(bh); return -EIO; }
diff --git a/fs/isofs/namei.c b/fs/isofs/namei.c index cac468f04820..558e7c51ce0d 100644 --- a/fs/isofs/namei.c +++ b/fs/isofs/namei.c @@ -102,6 +102,7 @@ isofs_find_entry(struct inode *dir, struct dentry *dentry, printk(KERN_NOTICE "iso9660: Corrupted directory entry" " in block %lu of inode %lu\n", block, dir->i_ino); + brelse(bh); return 0; }
From: "Steven Rostedt (VMware)" rostedt@goodmis.org
stable inclusion from linux-4.19.178 commit dc782e5a4d4cd20e5c365532b85be53696f0c320
--------------------------------
[ Upstream commit befe6d946551d65cddbd32b9cb0170b0249fd5ed ]
The list of tracepoint callbacks is managed by an array that is protected by RCU. To update this array, a new array is allocated, the updates are copied over to the new array, and then the list of functions for the tracepoint is switched over to the new array. After a completion of an RCU grace period, the old array is freed.
This process happens for both adding a callback as well as removing one. But on removing a callback, if the new array fails to be allocated, the callback is not removed, and may be used after it is freed by the clients of the tracepoint.
There's really no reason to fail if the allocation for a new array fails when removing a function. Instead, the function can simply be replaced by a stub function that could be cleaned up on the next modification of the array. That is, instead of calling the function registered to the tracepoint, it would call a stub function in its place.
Link: https://lore.kernel.org/r/20201115055256.65625-1-mmullins@mmlx.us Link: https://lore.kernel.org/r/20201116175107.02db396d@gandalf.local.home Link: https://lore.kernel.org/r/20201117211836.54acaef2@oasis.local.home Link: https://lkml.kernel.org/r/20201118093405.7a6d2290@gandalf.local.home
[ Note, this version does use undefined compiler behavior (assuming that a stub function with no parameters or return, can be called by a location that thinks it has parameters but still no return value. Static calls do the same thing, so this trick is not without precedent.
There's another solution that uses RCU tricks and is more complex, but can be an alternative if this solution becomes an issue.
Link: https://lore.kernel.org/lkml/20210127170721.58bce7cc@gandalf.local.home/ ]
Cc: Peter Zijlstra peterz@infradead.org Cc: Josh Poimboeuf jpoimboe@redhat.com Cc: Mathieu Desnoyers mathieu.desnoyers@efficios.com Cc: Ingo Molnar mingo@redhat.com Cc: Alexei Starovoitov ast@kernel.org Cc: Daniel Borkmann daniel@iogearbox.net Cc: Dmitry Vyukov dvyukov@google.com Cc: Martin KaFai Lau kafai@fb.com Cc: Song Liu songliubraving@fb.com Cc: Yonghong Song yhs@fb.com Cc: Andrii Nakryiko andriin@fb.com Cc: John Fastabend john.fastabend@gmail.com Cc: KP Singh kpsingh@chromium.org Cc: netdev netdev@vger.kernel.org Cc: bpf bpf@vger.kernel.org Cc: Kees Cook keescook@chromium.org Cc: Florian Weimer fw@deneb.enyo.de Fixes: 97e1c18e8d17b ("tracing: Kernel Tracepoints") Reported-by: syzbot+83aa762ef23b6f0d1991@syzkaller.appspotmail.com Reported-by: syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com Reported-by: Matt Mullins mmullins@mmlx.us Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Tested-by: Matt Mullins mmullins@mmlx.us Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/tracepoint.c | 80 ++++++++++++++++++++++++++++++++++++--------- 1 file changed, 64 insertions(+), 16 deletions(-)
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c index a3be42304485..d5ce69231912 100644 --- a/kernel/tracepoint.c +++ b/kernel/tracepoint.c @@ -66,6 +66,12 @@ struct tp_probes { struct tracepoint_func probes[0]; };
+/* Called in removal of a func but failed to allocate a new tp_funcs */ +static void tp_stub_func(void) +{ + return; +} + static inline void *allocate_probes(int count) { struct tp_probes *p = kmalloc(count * sizeof(struct tracepoint_func) @@ -144,6 +150,7 @@ func_add(struct tracepoint_func **funcs, struct tracepoint_func *tp_func, { struct tracepoint_func *old, *new; int nr_probes = 0; + int stub_funcs = 0; int pos = -1;
if (WARN_ON(!tp_func->func)) @@ -160,14 +167,34 @@ func_add(struct tracepoint_func **funcs, struct tracepoint_func *tp_func, if (old[nr_probes].func == tp_func->func && old[nr_probes].data == tp_func->data) return ERR_PTR(-EEXIST); + if (old[nr_probes].func == tp_stub_func) + stub_funcs++; } } - /* + 2 : one for new probe, one for NULL func */ - new = allocate_probes(nr_probes + 2); + /* + 2 : one for new probe, one for NULL func - stub functions */ + new = allocate_probes(nr_probes + 2 - stub_funcs); if (new == NULL) return ERR_PTR(-ENOMEM); if (old) { - if (pos < 0) { + if (stub_funcs) { + /* Need to copy one at a time to remove stubs */ + int probes = 0; + + pos = -1; + for (nr_probes = 0; old[nr_probes].func; nr_probes++) { + if (old[nr_probes].func == tp_stub_func) + continue; + if (pos < 0 && old[nr_probes].prio < prio) + pos = probes++; + new[probes++] = old[nr_probes]; + } + nr_probes = probes; + if (pos < 0) + pos = probes; + else + nr_probes--; /* Account for insertion */ + + } else if (pos < 0) { pos = nr_probes; memcpy(new, old, nr_probes * sizeof(struct tracepoint_func)); } else { @@ -201,8 +228,9 @@ static void *func_remove(struct tracepoint_func **funcs, /* (N -> M), (N > 1, M >= 0) probes */ if (tp_func->func) { for (nr_probes = 0; old[nr_probes].func; nr_probes++) { - if (old[nr_probes].func == tp_func->func && - old[nr_probes].data == tp_func->data) + if ((old[nr_probes].func == tp_func->func && + old[nr_probes].data == tp_func->data) || + old[nr_probes].func == tp_stub_func) nr_del++; } } @@ -221,14 +249,32 @@ static void *func_remove(struct tracepoint_func **funcs, /* N -> M, (N > 1, M > 0) */ /* + 1 for NULL */ new = allocate_probes(nr_probes - nr_del + 1); - if (new == NULL) - return ERR_PTR(-ENOMEM); - for (i = 0; old[i].func; i++) - if (old[i].func != tp_func->func - || old[i].data != tp_func->data) - new[j++] = old[i]; - new[nr_probes - nr_del].func = NULL; - *funcs = new; + if (new) { + for (i = 0; old[i].func; i++) + if ((old[i].func != tp_func->func + || old[i].data != tp_func->data) + && old[i].func != tp_stub_func) + new[j++] = old[i]; + new[nr_probes - nr_del].func = NULL; + *funcs = new; + } else { + /* + * Failed to allocate, replace the old function + * with calls to tp_stub_func. + */ + for (i = 0; old[i].func; i++) + if (old[i].func == tp_func->func && + old[i].data == tp_func->data) { + old[i].func = tp_stub_func; + /* Set the prio to the next event. */ + if (old[i + 1].func) + old[i].prio = + old[i + 1].prio; + else + old[i].prio = -1; + } + *funcs = old; + } } debug_print_probes(*funcs); return old; @@ -284,10 +330,12 @@ static int tracepoint_remove_func(struct tracepoint *tp, tp_funcs = rcu_dereference_protected(tp->funcs, lockdep_is_held(&tracepoints_mutex)); old = func_remove(&tp_funcs, func); - if (IS_ERR(old)) { - WARN_ON_ONCE(PTR_ERR(old) != -ENOMEM); + if (WARN_ON_ONCE(IS_ERR(old))) return PTR_ERR(old); - } + + if (tp_funcs == old) + /* Failed allocating new tp_funcs, replaced func with stub */ + return 0;
if (!tp_funcs) { /* Removed last function */
From: Dan Carpenter dan.carpenter@oracle.com
stable inclusion from linux-4.19.178 commit 23f96a69ba239bfe39d2845c5105016f48da5955
--------------------------------
[ Upstream commit c57d117f2b2f2a19b570c36f2819ef8d8210af20 ]
The error handling in this function frees "reg" but it is still on the "o2hb_all_regions" list so it will lead to a use after freew. Joseph Qi points out that we need to clear the bit in the "o2hb_region_bitmap" as well
Link: https://lkml.kernel.org/r/YBk4M6HUG8jB/jc7@mwanda Fixes: 1cf257f51191 ("ocfs2: fix memory leak") Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Reviewed-by: Joseph Qi joseph.qi@linux.alibaba.com Cc: Mark Fasheh mark@fasheh.com Cc: Joel Becker jlbec@evilplan.org Cc: Junxiao Bi junxiao.bi@oracle.com Cc: Changwei Ge gechangwei@live.cn Cc: Gang He ghe@suse.com Cc: Jun Piao piaojun@huawei.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ocfs2/cluster/heartbeat.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c index 9b2ed62dd638..19b0d358a0d6 100644 --- a/fs/ocfs2/cluster/heartbeat.c +++ b/fs/ocfs2/cluster/heartbeat.c @@ -2154,7 +2154,7 @@ static struct config_item *o2hb_heartbeat_group_make_item(struct config_group *g o2hb_nego_timeout_handler, reg, NULL, ®->hr_handler_list); if (ret) - goto free; + goto remove_item;
ret = o2net_register_handler(O2HB_NEGO_APPROVE_MSG, reg->hr_key, sizeof(struct o2hb_nego_msg), @@ -2173,6 +2173,12 @@ static struct config_item *o2hb_heartbeat_group_make_item(struct config_group *g
unregister_handler: o2net_unregister_handler_list(®->hr_handler_list); +remove_item: + spin_lock(&o2hb_live_lock); + list_del(®->hr_all_item); + if (o2hb_global_heartbeat_active()) + clear_bit(reg->hr_region_num, o2hb_region_bitmap); + spin_unlock(&o2hb_live_lock); free: kfree(reg); return ERR_PTR(ret);
From: Marc Zyngier maz@kernel.org
stable inclusion from linux-4.19.178 commit 73ff5db113009d6072e63b25b8beed1f47e55baf
--------------------------------
[ Upstream commit 9d41053e8dc115c92b8002c3db5f545d7602498b ]
Although there has been a bit of back and forth on the subject, it appears that invalidating TLBs requires an ISB instruction when FEAT_ETS is not implemented by the CPU.
From the bible:
| In an implementation that does not implement FEAT_ETS, a TLB | maintenance instruction executed by a PE, PEx, can complete at any | time after it is issued, but is only guaranteed to be finished for a | PE, PEx, after the execution of DSB by the PEx followed by a Context | synchronization event
Add the missing ISB in __primary_switch, just in case.
Fixes: 3c5e9f238bc4 ("arm64: head.S: move KASLR processing out of __enable_mmu()") Suggested-by: Will Deacon will@kernel.org Signed-off-by: Marc Zyngier maz@kernel.org Acked-by: Mark Rutland mark.rutland@arm.com Link: https://lore.kernel.org/r/20210224093738.3629662-3-maz@kernel.org Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/kernel/head.S | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S index 2c1f6e0e5c59..34bd0790d1e6 100644 --- a/arch/arm64/kernel/head.S +++ b/arch/arm64/kernel/head.S @@ -867,6 +867,7 @@ __primary_switch:
tlbi vmalle1 // Remove any stale TLB entries dsb nsh + isb
msr sctlr_el1, x19 // re-enable the MMU isb
From: Fangrui Song maskray@google.com
stable inclusion from linux-4.19.178 commit 8697aa8614cb58ad882c5453ca6786b32696a9fc
--------------------------------
commit ebfac7b778fac8b0e8e92ec91d0b055f046b4604 upstream.
clang-12 -fno-pic (since https://github.com/llvm/llvm-project/commit/a084c0388e2a59b9556f2de008333323...) can emit `call __stack_chk_fail@PLT` instead of `call __stack_chk_fail` on x86. The two forms should have identical behaviors on x86-64 but the former causes GNU as<2.37 to produce an unreferenced undefined symbol _GLOBAL_OFFSET_TABLE_.
(On x86-32, there is an R_386_PC32 vs R_386_PLT32 difference but the linker behavior is identical as far as Linux kernel is concerned.)
Simply ignore _GLOBAL_OFFSET_TABLE_ for now, like what scripts/mod/modpost.c:ignore_undef_symbol does. This also fixes the problem for gcc/clang -fpie and -fpic, which may emit `call foo@PLT` for external function calls on x86.
Note: ld -z defs and dynamic loaders do not error for unreferenced undefined symbols so the module loader is reading too much. If we ever need to ignore more symbols, the code should be refactored to ignore unreferenced symbols.
Cc: stable@vger.kernel.org Link: https://github.com/ClangBuiltLinux/linux/issues/1250 Link: https://sourceware.org/bugzilla/show_bug.cgi?id=27178 Reported-by: Marco Elver elver@google.com Reviewed-by: Nick Desaulniers ndesaulniers@google.com Reviewed-by: Nathan Chancellor natechancellor@gmail.com Tested-by: Marco Elver elver@google.com Signed-off-by: Fangrui Song maskray@google.com Signed-off-by: Jessica Yu jeyu@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/module.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/kernel/module.c b/kernel/module.c index 820b8e134a71..ad4c1d7b7a95 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2262,6 +2262,21 @@ static int verify_export_symbols(struct module *mod) return 0; }
+static bool ignore_undef_symbol(Elf_Half emachine, const char *name) +{ + /* + * On x86, PIC code and Clang non-PIC code may have call foo@PLT. GNU as + * before 2.37 produces an unreferenced _GLOBAL_OFFSET_TABLE_ on x86-64. + * i386 has a similar problem but may not deserve a fix. + * + * If we ever have to ignore many symbols, consider refactoring the code to + * only warn if referenced by a relocation. + */ + if (emachine == EM_386 || emachine == EM_X86_64) + return !strcmp(name, "_GLOBAL_OFFSET_TABLE_"); + return false; +} + /* Change all symbols so that st_value encodes the pointer directly. */ static int simplify_symbols(struct module *mod, const struct load_info *info) { @@ -2307,8 +2322,10 @@ static int simplify_symbols(struct module *mod, const struct load_info *info) break; }
- /* Ok if weak. */ - if (!ksym && ELF_ST_BIND(sym[i].st_info) == STB_WEAK) + /* Ok if weak or ignored. */ + if (!ksym && + (ELF_ST_BIND(sym[i].st_info) == STB_WEAK || + ignore_undef_symbol(info->hdr->e_machine, name))) break;
ret = PTR_ERR(ksym) ?: -ENOENT;
From: Muchun Song songmuchun@bytedance.com
stable inclusion from linux-4.19.178 commit ba7ae3629d5b2ae4ed86a08f29afeac623550511
--------------------------------
commit 8a8109f303e25a27f92c1d8edd67d7cbbc60a4eb upstream.
printk_safe_flush_on_panic() caused the following deadlock on our server:
CPU0: CPU1: panic rcu_dump_cpu_stacks kdump_nmi_shootdown_cpus nmi_trigger_cpumask_backtrace register_nmi_handler(crash_nmi_callback) printk_safe_flush __printk_safe_flush raw_spin_lock_irqsave(&read_lock) // send NMI to other processors apic_send_IPI_allbutself(NMI_VECTOR) // NMI interrupt, dead loop crash_nmi_callback printk_safe_flush_on_panic printk_safe_flush __printk_safe_flush // deadlock raw_spin_lock_irqsave(&read_lock)
DEADLOCK: read_lock is taken on CPU1 and will never get released.
It happens when panic() stops a CPU by NMI while it has been in the middle of printk_safe_flush().
Handle the lock the same way as logbuf_lock. The printk_safe buffers are flushed only when both locks can be safely taken. It can avoid the deadlock _in this particular case_ at expense of losing contents of printk_safe buffers.
Note: It would actually be safe to re-init the locks when all CPUs were stopped by NMI. But it would require passing this information from arch-specific code. It is not worth the complexity. Especially because logbuf_lock and printk_safe buffers have been obsoleted by the lockless ring buffer.
Fixes: cf9b1106c81c ("printk/nmi: flush NMI messages on the system panic") Signed-off-by: Muchun Song songmuchun@bytedance.com Reviewed-by: Petr Mladek pmladek@suse.com Cc: stable@vger.kernel.org Acked-by: Sergey Senozhatsky sergey.senozhatsky@gmail.com Signed-off-by: Petr Mladek pmladek@suse.com Link: https://lore.kernel.org/r/20210210034823.64867-1-songmuchun@bytedance.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/printk/printk_safe.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c index 63ba880a6df0..809f92492ec7 100644 --- a/kernel/printk/printk_safe.c +++ b/kernel/printk/printk_safe.c @@ -55,6 +55,8 @@ struct printk_safe_seq_buf { static DEFINE_PER_CPU(struct printk_safe_seq_buf, safe_print_seq); static DEFINE_PER_CPU(int, printk_context);
+static DEFINE_RAW_SPINLOCK(safe_read_lock); + #ifdef CONFIG_PRINTK_NMI static DEFINE_PER_CPU(struct printk_safe_seq_buf, nmi_print_seq); #endif @@ -190,8 +192,6 @@ static void report_message_lost(struct printk_safe_seq_buf *s) */ static void __printk_safe_flush(struct irq_work *work) { - static raw_spinlock_t read_lock = - __RAW_SPIN_LOCK_INITIALIZER(read_lock); struct printk_safe_seq_buf *s = container_of(work, struct printk_safe_seq_buf, work); unsigned long flags; @@ -205,7 +205,7 @@ static void __printk_safe_flush(struct irq_work *work) * different CPUs. This is especially important when printing * a backtrace. */ - raw_spin_lock_irqsave(&read_lock, flags); + raw_spin_lock_irqsave(&safe_read_lock, flags);
i = 0; more: @@ -242,7 +242,7 @@ static void __printk_safe_flush(struct irq_work *work)
out: report_message_lost(s); - raw_spin_unlock_irqrestore(&read_lock, flags); + raw_spin_unlock_irqrestore(&safe_read_lock, flags); }
/** @@ -288,6 +288,14 @@ void printk_safe_flush_on_panic(void) raw_spin_lock_init(&logbuf_lock); }
+ if (raw_spin_is_locked(&safe_read_lock)) { + if (num_online_cpus() > 1) + return; + + debug_locks_off(); + raw_spin_lock_init(&safe_read_lock); + } + printk_safe_flush(); } EXPORT_SYMBOL_GPL(printk_safe_flush_on_panic);
From: "Jason A. Donenfeld" Jason@zx2c4.com
stable inclusion from linux-4.19.178 commit 3efde1864ab5552d8c9411e0112f5508b4c7ec47
--------------------------------
commit 0b41713b606694257b90d61ba7e2712d8457648b upstream.
This introduces a helper function to be called only by network drivers that wraps calls to icmp[v6]_send in a conntrack transformation, in case NAT has been used. We don't want to pollute the non-driver path, though, so we introduce this as a helper to be called by places that actually make use of this, as suggested by Florian.
Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Cc: Florian Westphal fw@strlen.de Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/icmpv6.h | 6 ++++++ include/net/icmp.h | 6 ++++++ net/ipv4/icmp.c | 33 +++++++++++++++++++++++++++++++++ net/ipv6/ip6_icmp.c | 34 ++++++++++++++++++++++++++++++++++ 4 files changed, 79 insertions(+)
diff --git a/include/linux/icmpv6.h b/include/linux/icmpv6.h index a8f888976137..adb981ab7de9 100644 --- a/include/linux/icmpv6.h +++ b/include/linux/icmpv6.h @@ -31,6 +31,12 @@ static inline void icmpv6_send(struct sk_buff *skb, } #endif
+#if IS_ENABLED(CONFIG_NF_NAT) +void icmpv6_ndo_send(struct sk_buff *skb_in, u8 type, u8 code, __u32 info); +#else +#define icmpv6_ndo_send icmpv6_send +#endif + extern int icmpv6_init(void); extern int icmpv6_err_convert(u8 type, u8 code, int *err); diff --git a/include/net/icmp.h b/include/net/icmp.h index 8665bf24e3b7..9c344e2655d2 100644 --- a/include/net/icmp.h +++ b/include/net/icmp.h @@ -47,6 +47,12 @@ static inline void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 __icmp_send(skb_in, type, code, info, &IPCB(skb_in)->opt); }
+#if IS_ENABLED(CONFIG_NF_NAT) +void icmp_ndo_send(struct sk_buff *skb_in, int type, int code, __be32 info); +#else +#define icmp_ndo_send icmp_send +#endif + int icmp_rcv(struct sk_buff *skb); void icmp_err(struct sk_buff *skb, u32 info); int icmp_init(void); diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index 71a25e6db892..07d62cc5873a 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -757,6 +757,39 @@ out:; } EXPORT_SYMBOL(__icmp_send);
+#if IS_ENABLED(CONFIG_NF_NAT) +#include <net/netfilter/nf_conntrack.h> +void icmp_ndo_send(struct sk_buff *skb_in, int type, int code, __be32 info) +{ + struct sk_buff *cloned_skb = NULL; + enum ip_conntrack_info ctinfo; + struct nf_conn *ct; + __be32 orig_ip; + + ct = nf_ct_get(skb_in, &ctinfo); + if (!ct || !(ct->status & IPS_SRC_NAT)) { + icmp_send(skb_in, type, code, info); + return; + } + + if (skb_shared(skb_in)) + skb_in = cloned_skb = skb_clone(skb_in, GFP_ATOMIC); + + if (unlikely(!skb_in || skb_network_header(skb_in) < skb_in->head || + (skb_network_header(skb_in) + sizeof(struct iphdr)) > + skb_tail_pointer(skb_in) || skb_ensure_writable(skb_in, + skb_network_offset(skb_in) + sizeof(struct iphdr)))) + goto out; + + orig_ip = ip_hdr(skb_in)->saddr; + ip_hdr(skb_in)->saddr = ct->tuplehash[0].tuple.src.u3.ip; + icmp_send(skb_in, type, code, info); + ip_hdr(skb_in)->saddr = orig_ip; +out: + consume_skb(cloned_skb); +} +EXPORT_SYMBOL(icmp_ndo_send); +#endif
static void icmp_socket_deliver(struct sk_buff *skb, u32 info) { diff --git a/net/ipv6/ip6_icmp.c b/net/ipv6/ip6_icmp.c index 02045494c24c..e0086758b6ee 100644 --- a/net/ipv6/ip6_icmp.c +++ b/net/ipv6/ip6_icmp.c @@ -45,4 +45,38 @@ void icmpv6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info) rcu_read_unlock(); } EXPORT_SYMBOL(icmpv6_send); + +#if IS_ENABLED(CONFIG_NF_NAT) +#include <net/netfilter/nf_conntrack.h> +void icmpv6_ndo_send(struct sk_buff *skb_in, u8 type, u8 code, __u32 info) +{ + struct sk_buff *cloned_skb = NULL; + enum ip_conntrack_info ctinfo; + struct in6_addr orig_ip; + struct nf_conn *ct; + + ct = nf_ct_get(skb_in, &ctinfo); + if (!ct || !(ct->status & IPS_SRC_NAT)) { + icmpv6_send(skb_in, type, code, info); + return; + } + + if (skb_shared(skb_in)) + skb_in = cloned_skb = skb_clone(skb_in, GFP_ATOMIC); + + if (unlikely(!skb_in || skb_network_header(skb_in) < skb_in->head || + (skb_network_header(skb_in) + sizeof(struct ipv6hdr)) > + skb_tail_pointer(skb_in) || skb_ensure_writable(skb_in, + skb_network_offset(skb_in) + sizeof(struct ipv6hdr)))) + goto out; + + orig_ip = ipv6_hdr(skb_in)->saddr; + ipv6_hdr(skb_in)->saddr = ct->tuplehash[0].tuple.src.u3.in6; + icmpv6_send(skb_in, type, code, info); + ipv6_hdr(skb_in)->saddr = orig_ip; +out: + consume_skb(cloned_skb); +} +EXPORT_SYMBOL(icmpv6_ndo_send); +#endif #endif
From: "Jason A. Donenfeld" Jason@zx2c4.com
stable inclusion from linux-4.19.178 commit e1820f4376528503812dc078de7651f17e84f43b
--------------------------------
commit a8e41f6033a0c5633d55d6e35993c9e2005d872f upstream.
The icmpv6_send function has long had a static inline implementation with an empty body for CONFIG_IPV6=n, so that code calling it doesn't need to be ifdef'd. The new icmpv6_ndo_send function, which is intended for drivers as a drop-in replacement with an identical function signature, should follow the same pattern. Without this patch, drivers that used to work with CONFIG_IPV6=n now result in a linker error.
Cc: Chen Zhou chenzhou10@huawei.com Reported-by: Hulk Robot hulkci@huawei.com Fixes: 0b41713b6066 ("icmp: introduce helper for nat'd source address in network device context") Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/icmpv6.h | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/include/linux/icmpv6.h b/include/linux/icmpv6.h index adb981ab7de9..024b7a4cd98e 100644 --- a/include/linux/icmpv6.h +++ b/include/linux/icmpv6.h @@ -22,19 +22,23 @@ extern int inet6_unregister_icmp_sender(ip6_icmp_send_t *fn); int ip6_err_gen_icmpv6_unreach(struct sk_buff *skb, int nhs, int type, unsigned int data_len);
+#if IS_ENABLED(CONFIG_NF_NAT) +void icmpv6_ndo_send(struct sk_buff *skb_in, u8 type, u8 code, __u32 info); +#else +#define icmpv6_ndo_send icmpv6_send +#endif + #else
static inline void icmpv6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info) { - } -#endif
-#if IS_ENABLED(CONFIG_NF_NAT) -void icmpv6_ndo_send(struct sk_buff *skb_in, u8 type, u8 code, __u32 info); -#else -#define icmpv6_ndo_send icmpv6_send +static inline void icmpv6_ndo_send(struct sk_buff *skb, + u8 type, u8 code, __u32 info) +{ +} #endif
extern int icmpv6_init(void);
From: "Jason A. Donenfeld" Jason@zx2c4.com
stable inclusion from linux-4.19.178 commit 3d6fad4686ec1418a36fb5295163652ffe8266fc
--------------------------------
commit 45942ba890e6f35232727a5fa33d732681f4eb9f upstream.
Because xfrmi is calling icmp from network device context, it should use the ndo helper so that the rate limiting applies correctly.
Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Cc: Nicolas Dichtel nicolas.dichtel@6wind.com Cc: Steffen Klassert steffen.klassert@secunet.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/xfrm/xfrm_interface.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c index 0079e5922067..2e3fc48ced42 100644 --- a/net/xfrm/xfrm_interface.c +++ b/net/xfrm/xfrm_interface.c @@ -300,10 +300,10 @@ xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, struct flowi *fl) if (mtu < IPV6_MIN_MTU) mtu = IPV6_MIN_MTU;
- icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu); + icmpv6_ndo_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu); } else { - icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, - htonl(mtu)); + icmp_ndo_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, + htonl(mtu)); }
dst_release(dst);
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.178 commit 00d3dc031d9a1aed4a56d4a83a3c31a5d42a0e4b
--------------------------------
commit cc7a21b6fbd945f8d8f61422ccd27203c1fafeb7 upstream.
If IPv6 is builtin, we do not need an expensive indirect call to reach icmp6_send().
v2: put inline keyword before the type to avoid sparse warnings.
Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/icmpv6.h | 22 +++++++++++++++++++++- net/ipv6/icmp.c | 5 +++-- net/ipv6/ip6_icmp.c | 10 +++++----- 3 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/include/linux/icmpv6.h b/include/linux/icmpv6.h index 024b7a4cd98e..f54f2c07c319 100644 --- a/include/linux/icmpv6.h +++ b/include/linux/icmpv6.h @@ -13,12 +13,32 @@ static inline struct icmp6hdr *icmp6_hdr(const struct sk_buff *skb) #include <linux/netdevice.h>
#if IS_ENABLED(CONFIG_IPV6) -extern void icmpv6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info);
typedef void ip6_icmp_send_t(struct sk_buff *skb, u8 type, u8 code, __u32 info, const struct in6_addr *force_saddr); +#if IS_BUILTIN(CONFIG_IPV6) +void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, + const struct in6_addr *force_saddr); +static inline void icmpv6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info) +{ + icmp6_send(skb, type, code, info, NULL); +} +static inline int inet6_register_icmp_sender(ip6_icmp_send_t *fn) +{ + BUILD_BUG_ON(fn != icmp6_send); + return 0; +} +static inline int inet6_unregister_icmp_sender(ip6_icmp_send_t *fn) +{ + BUILD_BUG_ON(fn != icmp6_send); + return 0; +} +#else +extern void icmpv6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info); extern int inet6_register_icmp_sender(ip6_icmp_send_t *fn); extern int inet6_unregister_icmp_sender(ip6_icmp_send_t *fn); +#endif + int ip6_err_gen_icmpv6_unreach(struct sk_buff *skb, int nhs, int type, unsigned int data_len);
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c index 6d14cbe443f8..da0637b7e456 100644 --- a/net/ipv6/icmp.c +++ b/net/ipv6/icmp.c @@ -418,8 +418,8 @@ static int icmp6_iif(const struct sk_buff *skb) /* * Send an ICMP message in response to a packet in error */ -static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, - const struct in6_addr *force_saddr) +void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, + const struct in6_addr *force_saddr) { struct inet6_dev *idev = NULL; struct ipv6hdr *hdr = ipv6_hdr(skb); @@ -592,6 +592,7 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, out_bh_enable: local_bh_enable(); } +EXPORT_SYMBOL(icmp6_send);
/* Slightly more convenient version of icmp6_send. */ diff --git a/net/ipv6/ip6_icmp.c b/net/ipv6/ip6_icmp.c index e0086758b6ee..70c8c2f36c98 100644 --- a/net/ipv6/ip6_icmp.c +++ b/net/ipv6/ip6_icmp.c @@ -9,6 +9,8 @@
#if IS_ENABLED(CONFIG_IPV6)
+#if !IS_BUILTIN(CONFIG_IPV6) + static ip6_icmp_send_t __rcu *ip6_icmp_send;
int inet6_register_icmp_sender(ip6_icmp_send_t *fn) @@ -37,14 +39,12 @@ void icmpv6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info)
rcu_read_lock(); send = rcu_dereference(ip6_icmp_send); - - if (!send) - goto out; - send(skb, type, code, info, NULL); -out: + if (send) + send(skb, type, code, info, NULL); rcu_read_unlock(); } EXPORT_SYMBOL(icmpv6_send); +#endif
#if IS_ENABLED(CONFIG_NF_NAT) #include <net/netfilter/nf_conntrack.h>
From: Leon Romanovsky leonro@nvidia.com
stable inclusion from linux-4.19.178 commit 480c09809f7c67b148a102de5f8cdd5fcab04685
--------------------------------
commit 1faba27f11c8da244e793546a1b35a9b1da8208e upstream.
The W=1 compilation of allmodconfig generates the following warning:
net/ipv6/icmp.c:448:6: warning: no previous prototype for 'icmp6_send' [-Wmissing-prototypes] 448 | void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, | ^~~~~~~~~~
Fix it by providing function declaration for builds with ipv6 as a module.
Signed-off-by: Leon Romanovsky leonro@nvidia.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/icmpv6.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/icmpv6.h b/include/linux/icmpv6.h index f54f2c07c319..74fc27926bd5 100644 --- a/include/linux/icmpv6.h +++ b/include/linux/icmpv6.h @@ -16,9 +16,9 @@ static inline struct icmp6hdr *icmp6_hdr(const struct sk_buff *skb)
typedef void ip6_icmp_send_t(struct sk_buff *skb, u8 type, u8 code, __u32 info, const struct in6_addr *force_saddr); -#if IS_BUILTIN(CONFIG_IPV6) void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, const struct in6_addr *force_saddr); +#if IS_BUILTIN(CONFIG_IPV6) static inline void icmpv6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info) { icmp6_send(skb, type, code, info, NULL);
From: "Jason A. Donenfeld" Jason@zx2c4.com
stable inclusion from linux-4.19.178 commit 9efa1af186114b17a8e797e89c005876b5eceaac
--------------------------------
commit ee576c47db60432c37e54b1e2b43a8ca6d3a8dca upstream.
The icmp{,v6}_send functions make all sorts of use of skb->cb, casting it with IPCB or IP6CB, assuming the skb to have come directly from the inet layer. But when the packet comes from the ndo layer, especially when forwarded, there's no telling what might be in skb->cb at that point. As a result, the icmp sending code risks reading bogus memory contents, which can result in nasty stack overflows such as this one reported by a user:
panic+0x108/0x2ea __stack_chk_fail+0x14/0x20 __icmp_send+0x5bd/0x5c0 icmp_ndo_send+0x148/0x160
In icmp_send, skb->cb is cast with IPCB and an ip_options struct is read from it. The optlen parameter there is of particular note, as it can induce writes beyond bounds. There are quite a few ways that can happen in __ip_options_echo. For example:
// sptr/skb are attacker-controlled skb bytes sptr = skb_network_header(skb); // dptr/dopt points to stack memory allocated by __icmp_send dptr = dopt->__data; // sopt is the corrupt skb->cb in question if (sopt->rr) { optlen = sptr[sopt->rr+1]; // corrupt skb->cb + skb->data soffset = sptr[sopt->rr+2]; // corrupt skb->cb + skb->data // this now writes potentially attacker-controlled data, over // flowing the stack: memcpy(dptr, sptr+sopt->rr, optlen); }
In the icmpv6_send case, the story is similar, but not as dire, as only IP6CB(skb)->iif and IP6CB(skb)->dsthao are used. The dsthao case is worse than the iif case, but it is passed to ipv6_find_tlv, which does a bit of bounds checking on the value.
This is easy to simulate by doing a `memset(skb->cb, 0x41, sizeof(skb->cb));` before calling icmp{,v6}_ndo_send, and it's only by good fortune and the rarity of icmp sending from that context that we've avoided reports like this until now. For example, in KASAN:
BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0xa0e/0x12b0 Write of size 38 at addr ffff888006f1f80e by task ping/89 CPU: 2 PID: 89 Comm: ping Not tainted 5.10.0-rc7-debug+ #5 Call Trace: dump_stack+0x9a/0xcc print_address_description.constprop.0+0x1a/0x160 __kasan_report.cold+0x20/0x38 kasan_report+0x32/0x40 check_memory_region+0x145/0x1a0 memcpy+0x39/0x60 __ip_options_echo+0xa0e/0x12b0 __icmp_send+0x744/0x1700
Actually, out of the 4 drivers that do this, only gtp zeroed the cb for the v4 case, while the rest did not. So this commit actually removes the gtp-specific zeroing, while putting the code where it belongs in the shared infrastructure of icmp{,v6}_ndo_send.
This commit fixes the issue by passing an empty IPCB or IP6CB along to the functions that actually do the work. For the icmp_send, this was already trivial, thanks to __icmp_send providing the plumbing function. For icmpv6_send, this required a tiny bit of refactoring to make it behave like the v4 case, after which it was straight forward.
Fixes: a2b78e9b2cac ("sunvnet: generate ICMP PTMUD messages for smaller port MTUs") Reported-by: SinYu liuxyon@gmail.com Reviewed-by: Willem de Bruijn willemb@google.com Link: https://lore.kernel.org/netdev/CAF=yD-LOF116aHub6RMe8vB8ZpnrrnoTdqhobEx+bvoA... Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com Link: https://lore.kernel.org/r/20210223131858.72082-1-Jason@zx2c4.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/gtp.c | 1 - include/linux/icmpv6.h | 26 ++++++++++++++++++++------ include/linux/ipv6.h | 2 +- include/net/icmp.h | 6 +++++- net/ipv4/icmp.c | 5 +++-- net/ipv6/icmp.c | 16 ++++++++-------- net/ipv6/ip6_icmp.c | 12 +++++++----- 7 files changed, 44 insertions(+), 24 deletions(-)
diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c index 5804ac5b8f52..01ff214048ef 100644 --- a/drivers/net/gtp.c +++ b/drivers/net/gtp.c @@ -549,7 +549,6 @@ static int gtp_build_skb_ip4(struct sk_buff *skb, struct net_device *dev, if (!skb_is_gso(skb) && (iph->frag_off & htons(IP_DF)) && mtu < ntohs(iph->tot_len)) { netdev_dbg(dev, "packet too big, fragmentation needed\n"); - memset(IPCB(skb), 0, sizeof(*IPCB(skb))); icmp_ndo_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); goto err_rt; diff --git a/include/linux/icmpv6.h b/include/linux/icmpv6.h index 74fc27926bd5..0be0d68fbb00 100644 --- a/include/linux/icmpv6.h +++ b/include/linux/icmpv6.h @@ -3,6 +3,7 @@ #define _LINUX_ICMPV6_H
#include <linux/skbuff.h> +#include <linux/ipv6.h> #include <uapi/linux/icmpv6.h>
static inline struct icmp6hdr *icmp6_hdr(const struct sk_buff *skb) @@ -15,13 +16,16 @@ static inline struct icmp6hdr *icmp6_hdr(const struct sk_buff *skb) #if IS_ENABLED(CONFIG_IPV6)
typedef void ip6_icmp_send_t(struct sk_buff *skb, u8 type, u8 code, __u32 info, - const struct in6_addr *force_saddr); + const struct in6_addr *force_saddr, + const struct inet6_skb_parm *parm); void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, - const struct in6_addr *force_saddr); + const struct in6_addr *force_saddr, + const struct inet6_skb_parm *parm); #if IS_BUILTIN(CONFIG_IPV6) -static inline void icmpv6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info) +static inline void __icmpv6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, + const struct inet6_skb_parm *parm) { - icmp6_send(skb, type, code, info, NULL); + icmp6_send(skb, type, code, info, NULL, parm); } static inline int inet6_register_icmp_sender(ip6_icmp_send_t *fn) { @@ -34,18 +38,28 @@ static inline int inet6_unregister_icmp_sender(ip6_icmp_send_t *fn) return 0; } #else -extern void icmpv6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info); +extern void __icmpv6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, + const struct inet6_skb_parm *parm); extern int inet6_register_icmp_sender(ip6_icmp_send_t *fn); extern int inet6_unregister_icmp_sender(ip6_icmp_send_t *fn); #endif
+static inline void icmpv6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info) +{ + __icmpv6_send(skb, type, code, info, IP6CB(skb)); +} + int ip6_err_gen_icmpv6_unreach(struct sk_buff *skb, int nhs, int type, unsigned int data_len);
#if IS_ENABLED(CONFIG_NF_NAT) void icmpv6_ndo_send(struct sk_buff *skb_in, u8 type, u8 code, __u32 info); #else -#define icmpv6_ndo_send icmpv6_send +static inline void icmpv6_ndo_send(struct sk_buff *skb_in, u8 type, u8 code, __u32 info) +{ + struct inet6_skb_parm parm = { 0 }; + __icmpv6_send(skb_in, type, code, info, &parm); +} #endif
#else diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index 90eaabe33331..c0589ef1c850 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -4,6 +4,7 @@
#include <linux/kabi.h> #include <uapi/linux/ipv6.h> +#include <uapi/linux/icmpv6.h>
#define ipv6_optlen(p) (((p)->hdrlen+1) << 3) #define ipv6_authlen(p) (((p)->hdrlen+2) << 2) @@ -101,7 +102,6 @@ struct ipv6_params { __s32 autoconf; }; extern struct ipv6_params ipv6_defaults; -#include <linux/icmpv6.h> #include <linux/tcp.h> #include <linux/udp.h>
diff --git a/include/net/icmp.h b/include/net/icmp.h index 9c344e2655d2..ffe4a5d2bbe7 100644 --- a/include/net/icmp.h +++ b/include/net/icmp.h @@ -50,7 +50,11 @@ static inline void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 #if IS_ENABLED(CONFIG_NF_NAT) void icmp_ndo_send(struct sk_buff *skb_in, int type, int code, __be32 info); #else -#define icmp_ndo_send icmp_send +static inline void icmp_ndo_send(struct sk_buff *skb_in, int type, int code, __be32 info) +{ + struct ip_options opts = { 0 }; + __icmp_send(skb_in, type, code, info, &opts); +} #endif
int icmp_rcv(struct sk_buff *skb); diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index 07d62cc5873a..609a73fe6fbb 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -762,13 +762,14 @@ EXPORT_SYMBOL(__icmp_send); void icmp_ndo_send(struct sk_buff *skb_in, int type, int code, __be32 info) { struct sk_buff *cloned_skb = NULL; + struct ip_options opts = { 0 }; enum ip_conntrack_info ctinfo; struct nf_conn *ct; __be32 orig_ip;
ct = nf_ct_get(skb_in, &ctinfo); if (!ct || !(ct->status & IPS_SRC_NAT)) { - icmp_send(skb_in, type, code, info); + __icmp_send(skb_in, type, code, info, &opts); return; }
@@ -783,7 +784,7 @@ void icmp_ndo_send(struct sk_buff *skb_in, int type, int code, __be32 info)
orig_ip = ip_hdr(skb_in)->saddr; ip_hdr(skb_in)->saddr = ct->tuplehash[0].tuple.src.u3.ip; - icmp_send(skb_in, type, code, info); + __icmp_send(skb_in, type, code, info, &opts); ip_hdr(skb_in)->saddr = orig_ip; out: consume_skb(cloned_skb); diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c index da0637b7e456..fbc8746371b6 100644 --- a/net/ipv6/icmp.c +++ b/net/ipv6/icmp.c @@ -309,10 +309,9 @@ static int icmpv6_getfrag(void *from, char *to, int offset, int len, int odd, st }
#if IS_ENABLED(CONFIG_IPV6_MIP6) -static void mip6_addr_swap(struct sk_buff *skb) +static void mip6_addr_swap(struct sk_buff *skb, const struct inet6_skb_parm *opt) { struct ipv6hdr *iph = ipv6_hdr(skb); - struct inet6_skb_parm *opt = IP6CB(skb); struct ipv6_destopt_hao *hao; struct in6_addr tmp; int off; @@ -329,7 +328,7 @@ static void mip6_addr_swap(struct sk_buff *skb) } } #else -static inline void mip6_addr_swap(struct sk_buff *skb) {} +static inline void mip6_addr_swap(struct sk_buff *skb, const struct inet6_skb_parm *opt) {} #endif
static struct dst_entry *icmpv6_route_lookup(struct net *net, @@ -419,7 +418,8 @@ static int icmp6_iif(const struct sk_buff *skb) * Send an ICMP message in response to a packet in error */ void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, - const struct in6_addr *force_saddr) + const struct in6_addr *force_saddr, + const struct inet6_skb_parm *parm) { struct inet6_dev *idev = NULL; struct ipv6hdr *hdr = ipv6_hdr(skb); @@ -512,7 +512,7 @@ void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, if (!(skb->dev->flags&IFF_LOOPBACK) && !icmpv6_global_allow(type)) goto out_bh_enable;
- mip6_addr_swap(skb); + mip6_addr_swap(skb, parm);
memset(&fl6, 0, sizeof(fl6)); fl6.flowi6_proto = IPPROTO_ICMPV6; @@ -598,7 +598,7 @@ EXPORT_SYMBOL(icmp6_send); */ void icmpv6_param_prob(struct sk_buff *skb, u8 code, int pos) { - icmp6_send(skb, ICMPV6_PARAMPROB, code, pos, NULL); + icmp6_send(skb, ICMPV6_PARAMPROB, code, pos, NULL, IP6CB(skb)); kfree_skb(skb); }
@@ -655,10 +655,10 @@ int ip6_err_gen_icmpv6_unreach(struct sk_buff *skb, int nhs, int type, } if (type == ICMP_TIME_EXCEEDED) icmp6_send(skb2, ICMPV6_TIME_EXCEED, ICMPV6_EXC_HOPLIMIT, - info, &temp_saddr); + info, &temp_saddr, IP6CB(skb2)); else icmp6_send(skb2, ICMPV6_DEST_UNREACH, ICMPV6_ADDR_UNREACH, - info, &temp_saddr); + info, &temp_saddr, IP6CB(skb2)); if (rt) ip6_rt_put(rt);
diff --git a/net/ipv6/ip6_icmp.c b/net/ipv6/ip6_icmp.c index 70c8c2f36c98..9e3574880cb0 100644 --- a/net/ipv6/ip6_icmp.c +++ b/net/ipv6/ip6_icmp.c @@ -33,23 +33,25 @@ int inet6_unregister_icmp_sender(ip6_icmp_send_t *fn) } EXPORT_SYMBOL(inet6_unregister_icmp_sender);
-void icmpv6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info) +void __icmpv6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, + const struct inet6_skb_parm *parm) { ip6_icmp_send_t *send;
rcu_read_lock(); send = rcu_dereference(ip6_icmp_send); if (send) - send(skb, type, code, info, NULL); + send(skb, type, code, info, NULL, parm); rcu_read_unlock(); } -EXPORT_SYMBOL(icmpv6_send); +EXPORT_SYMBOL(__icmpv6_send); #endif
#if IS_ENABLED(CONFIG_NF_NAT) #include <net/netfilter/nf_conntrack.h> void icmpv6_ndo_send(struct sk_buff *skb_in, u8 type, u8 code, __u32 info) { + struct inet6_skb_parm parm = { 0 }; struct sk_buff *cloned_skb = NULL; enum ip_conntrack_info ctinfo; struct in6_addr orig_ip; @@ -57,7 +59,7 @@ void icmpv6_ndo_send(struct sk_buff *skb_in, u8 type, u8 code, __u32 info)
ct = nf_ct_get(skb_in, &ctinfo); if (!ct || !(ct->status & IPS_SRC_NAT)) { - icmpv6_send(skb_in, type, code, info); + __icmpv6_send(skb_in, type, code, info, &parm); return; }
@@ -72,7 +74,7 @@ void icmpv6_ndo_send(struct sk_buff *skb_in, u8 type, u8 code, __u32 info)
orig_ip = ipv6_hdr(skb_in)->saddr; ipv6_hdr(skb_in)->saddr = ct->tuplehash[0].tuple.src.u3.in6; - icmpv6_send(skb_in, type, code, info); + __icmpv6_send(skb_in, type, code, info, &parm); ipv6_hdr(skb_in)->saddr = orig_ip; out: consume_skb(cloned_skb);
From: Mike Kravetz mike.kravetz@oracle.com
stable inclusion from linux-4.19.179 commit 08831f662b88f7117be51c5e55bd1f120087f90c
--------------------------------
commit dbfee5aee7e54f83d96ceb8e3e80717fac62ad63 upstream.
page structs are not guaranteed to be contiguous for gigantic pages. The routine update_and_free_page can encounter a gigantic page, yet it assumes page structs are contiguous when setting page flags in subpages.
If update_and_free_page encounters non-contiguous page structs, we can see “BUG: Bad page state in process …” errors.
Non-contiguous page structs are generally not an issue. However, they can exist with a specific kernel configuration and hotplug operations. For example: Configure the kernel with CONFIG_SPARSEMEM and !CONFIG_SPARSEMEM_VMEMMAP. Then, hotplug add memory for the area where the gigantic page will be allocated. Zi Yan outlined steps to reproduce here [1].
[1] https://lore.kernel.org/linux-mm/16F7C58B-4D79-41C5-9B64-A1A1628F4AF2@nvidia...
Link: https://lkml.kernel.org/r/20210217184926.33567-1-mike.kravetz@oracle.com Fixes: 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime") Signed-off-by: Zi Yan ziy@nvidia.com Signed-off-by: Mike Kravetz mike.kravetz@oracle.com Cc: Zi Yan ziy@nvidia.com Cc: Davidlohr Bueso dbueso@suse.de Cc: "Kirill A . Shutemov" kirill.shutemov@linux.intel.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Matthew Wilcox willy@infradead.org Cc: Oscar Salvador osalvador@suse.de Cc: Joao Martins joao.m.martins@oracle.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Mike Kravetz mike.kravetz@oracle.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/hugetlb.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 824252c70366..ab4ebe574b14 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1189,14 +1189,16 @@ static inline void destroy_compound_gigantic_page(struct page *page, static void update_and_free_page(struct hstate *h, struct page *page) { int i; + struct page *subpage = page;
if (hstate_is_gigantic(h) && !gigantic_page_supported()) return;
h->nr_huge_pages--; h->nr_huge_pages_node[page_to_nid(page)]--; - for (i = 0; i < pages_per_huge_page(h); i++) { - page[i].flags &= ~(1 << PG_locked | 1 << PG_error | + for (i = 0; i < pages_per_huge_page(h); + i++, subpage = mem_map_next(subpage, page, i)) { + subpage->flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced | 1 << PG_dirty | 1 << PG_active | 1 << PG_private | 1 << PG_writeback);
From: Shaoying Xu shaoyi@amazon.com
stable inclusion from linux-4.19.179 commit 79e73552f4fea6feedd5d05e6d882095b4925ba3
--------------------------------
commit f5c6d0fcf90ce07ee0d686d465b19b247ebd5ed7 upstream.
These plt* and .text.ftrace_trampoline sections specified for arm64 have non-zero addressses. Non-zero section addresses in a relocatable ELF would confuse GDB when it tries to compute the section offsets and it ends up printing wrong symbol addresses. Therefore, set them to zero, which mirrors the change in commit 5d8591bc0fba ("module: set ksymtab/kcrctab* section addresses to 0x0").
Reported-by: Frank van der Linden fllinden@amazon.com Signed-off-by: Shaoying Xu shaoyi@amazon.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210216183234.GA23876@amazon.com Signed-off-by: Will Deacon will@kernel.org [shaoyi@amazon.com: made same changes in arch/arm64/kernel/module.lds for 5.4] Signed-off-by: Shaoying Xu shaoyi@amazon.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/kernel/module.lds | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kernel/module.lds b/arch/arm64/kernel/module.lds index 22e36a21c113..09a0eef71d12 100644 --- a/arch/arm64/kernel/module.lds +++ b/arch/arm64/kernel/module.lds @@ -1,5 +1,5 @@ SECTIONS { - .plt (NOLOAD) : { BYTE(0) } - .init.plt (NOLOAD) : { BYTE(0) } - .text.ftrace_trampoline (NOLOAD) : { BYTE(0) } + .plt 0 (NOLOAD) : { BYTE(0) } + .init.plt 0 (NOLOAD) : { BYTE(0) } + .text.ftrace_trampoline 0 (NOLOAD) : { BYTE(0) } }
From: Yumei Huang yuhuang@redhat.com
stable inclusion from linux-4.19.179 commit f62847c98f6d30a0ac751359dacd02cdb8e0df4a
--------------------------------
commit 88a9e03beef22cc5fabea344f54b9a0dfe63de08 upstream.
An assert failure is triggered by syzkaller test due to ATTR_KILL_PRIV is not cleared before xfs_setattr_size. As ATTR_KILL_PRIV is not checked/used by xfs_setattr_size, just remove it from the assert.
Signed-off-by: Yumei Huang yuhuang@redhat.com Reviewed-by: Brian Foster bfoster@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/xfs/xfs_iops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 870e7b77b11c..0ac63cafb32a 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -860,7 +860,7 @@ xfs_setattr_size( ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL)); ASSERT(S_ISREG(inode->i_mode)); ASSERT((iattr->ia_valid & (ATTR_UID|ATTR_GID|ATTR_ATIME|ATTR_ATIME_SET| - ATTR_MTIME_SET|ATTR_KILL_PRIV|ATTR_TIMES_SET)) == 0); + ATTR_MTIME_SET|ATTR_TIMES_SET)) == 0);
oldsize = inode->i_size; newsize = iattr->ia_size;
From: Marco Elver elver@google.com
stable inclusion from linux-4.19.179 commit 5d55a6a46a7f70bc1d270c50200edd3096cb380c
--------------------------------
commit 097b9146c0e26aabaa6ff3e5ea536a53f5254a79 upstream.
Avoid the assumption that ksize(kmalloc(S)) == ksize(kmalloc(S)): when cloning an skb, save and restore truesize after pskb_expand_head(). This can occur if the allocator decides to service an allocation of the same size differently (e.g. use a different size class, or pass the allocation on to KFENCE).
Because truesize is used for bookkeeping (such as sk_wmem_queued), a modified truesize of a cloned skb may result in corrupt bookkeeping and relevant warnings (such as in sk_stream_kill_queues()).
Link: https://lkml.kernel.org/r/X9JR/J6dMMOy1obu@elver.google.com Reported-by: syzbot+7b99aafdcc2eedea6178@syzkaller.appspotmail.com Suggested-by: Eric Dumazet edumazet@google.com Signed-off-by: Marco Elver elver@google.com Signed-off-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/20210201160420.2826895-1-elver@google.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/core/skbuff.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 4a9ab2596e78..ea9684bcc2e8 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3092,7 +3092,19 @@ EXPORT_SYMBOL(skb_split); */ static int skb_prepare_for_shift(struct sk_buff *skb) { - return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC); + int ret = 0; + + if (skb_cloned(skb)) { + /* Save and restore truesize: pskb_expand_head() may reallocate + * memory where ksize(kmalloc(S)) != ksize(kmalloc(S)), but we + * cannot change truesize at this point. + */ + unsigned int save_truesize = skb->truesize; + + ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC); + skb->truesize = save_truesize; + } + return ret; }
/**
From: Li Xinhai lixinhai.lxh@gmail.com
stable inclusion from linux-4.19.179 commit 66a013879cdb304c214e94a49d2129363f29e174
--------------------------------
commit a1ba9da8f0f9a37d900ff7eff66482cf7de8015e upstream.
The current code would unnecessarily expand the address range. Consider one example, (start, end) = (1G-2M, 3G+2M), and (vm_start, vm_end) = (1G-4M, 3G+4M), the expected adjustment should be keep (1G-2M, 3G+2M) without expand. But the current result will be (1G-4M, 3G+4M). Actually, the range (1G-4M, 1G) and (3G, 3G+4M) would never been involved in pmd sharing.
After this patch, we will check that the vma span at least one PUD aligned size and the start,end range overlap the aligned range of vma.
With above example, the aligned vma range is (1G, 3G), so if (start, end) range is within (1G-4M, 1G), or within (3G, 3G+4M), then no adjustment to both start and end. Otherwise, we will have chance to adjust start downwards or end upwards without exceeding (vm_start, vm_end).
Mike:
: The 'adjusted range' is used for calls to mmu notifiers and cache(tlb) : flushing. Since the current code unnecessarily expands the range in some : cases, more entries than necessary would be flushed. This would/could : result in performance degradation. However, this is highly dependent on : the user runtime. Is there a combination of vma layout and calls to : actually hit this issue? If the issue is hit, will those entries : unnecessarily flushed be used again and need to be unnecessarily reloaded?
Link: https://lkml.kernel.org/r/20210104081631.2921415-1-lixinhai.lxh@gmail.com Fixes: 75802ca66354 ("mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible") Signed-off-by: Li Xinhai lixinhai.lxh@gmail.com Suggested-by: Mike Kravetz mike.kravetz@oracle.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: Peter Xu peterx@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/hugetlb.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ab4ebe574b14..510aa3b2806c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4875,21 +4875,23 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, unsigned long *start, unsigned long *end) { - unsigned long a_start, a_end; + unsigned long v_start = ALIGN(vma->vm_start, PUD_SIZE), + v_end = ALIGN_DOWN(vma->vm_end, PUD_SIZE);
- if (!(vma->vm_flags & VM_MAYSHARE)) + /* + * vma need span at least one aligned PUD size and the start,end range + * must at least partialy within it. + */ + if (!(vma->vm_flags & VM_MAYSHARE) || !(v_end > v_start) || + (*end <= v_start) || (*start >= v_end)) return;
/* Extend the range to be PUD aligned for a worst case scenario */ - a_start = ALIGN_DOWN(*start, PUD_SIZE); - a_end = ALIGN(*end, PUD_SIZE); + if (*start > v_start) + *start = ALIGN_DOWN(*start, PUD_SIZE);
- /* - * Intersect the range with the vma range, since pmd sharing won't be - * across vma after all - */ - *start = max(vma->vm_start, a_start); - *end = min(vma->vm_end, a_end); + if (*end < v_end) + *end = ALIGN(*end, PUD_SIZE); }
/*
From: Jens Axboe axboe@kernel.dk
stable inclusion from linux-4.19.179 commit 6ceea9a764b422e32d9028e35d37ad68761ffd2c
--------------------------------
commit caf6912f3f4af7232340d500a4a2008f81b93f14 upstream.
We're not factoring in the start of the file for where to write and read the swapfile, which leads to very unfortunate side effects of writing where we should not be...
[This issue only affects swapfiles on filesystems on top of blockdevs that implement rw_page ops (brd, zram, btt, pmem), and not on top of any other block devices, in contrast to the upstream commit fix.]
Fixes: dd6bd0d9c7db ("swap: use bdev_read_page() / bdev_write_page()") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Anthony Iliopoulos ailiop@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/page_io.c | 11 +++-------- mm/swapfile.c | 2 +- 2 files changed, 4 insertions(+), 9 deletions(-)
diff --git a/mm/page_io.c b/mm/page_io.c index 97aaf9bb6a9e..17cff71a9304 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -38,7 +38,6 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
bio->bi_iter.bi_sector = map_swap_page(page, &bdev); bio_set_dev(bio, bdev); - bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9; bio->bi_end_io = end_io;
for (i = 0; i < nr; i++) @@ -262,11 +261,6 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) return ret; }
-static sector_t swap_page_sector(struct page *page) -{ - return (sector_t)__page_file_index(page) << (PAGE_SHIFT - 9); -} - static inline void count_swpout_vm_event(struct page *page) { #ifdef CONFIG_TRANSPARENT_HUGEPAGE @@ -325,7 +319,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, return ret; }
- ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc); + ret = bdev_write_page(sis->bdev, map_swap_page(page, &sis->bdev), + page, wbc); if (!ret) { count_swpout_vm_event(page); return 0; @@ -376,7 +371,7 @@ int swap_readpage(struct page *page, bool synchronous) return ret; }
- ret = bdev_read_page(sis->bdev, swap_page_sector(page), page); + ret = bdev_read_page(sis->bdev, map_swap_page(page, &sis->bdev), page); if (!ret) { if (trylock_page(page)) { swap_slot_free_notify(page); diff --git a/mm/swapfile.c b/mm/swapfile.c index c54b0afd8c87..4028994a51ae 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2381,7 +2381,7 @@ sector_t map_swap_page(struct page *page, struct block_device **bdev) { swp_entry_t entry; entry.val = page_private(page); - return map_swap_entry(entry, bdev); + return map_swap_entry(entry, bdev) << (PAGE_SHIFT - 9); }
/*
From: Mikulas Patocka mpatocka@redhat.com
stable inclusion from linux-4.19.180 commit 85295f08866a6a26d013ae46699104b80699fb4e
--------------------------------
commit a14e5ec66a7a66e57b24e2469f9212a78460207e upstream.
dm_bufio_get_device_size returns the device size in blocks. Before returning the value, we must subtract the nubmer of starting sectors. The number of starting sectors may not be divisible by block size.
Note that currently, no target is using dm_bufio_set_sector_offset and dm_bufio_get_device_size simultaneously, so this change has no effect. However, an upcoming dm-verity-fec fix needs this change.
Signed-off-by: Mikulas Patocka mpatocka@redhat.com Reviewed-by: Milan Broz gmazyland@gmail.com Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/dm-bufio.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c index b6e4ab67ae44..b3b799f84dcc 100644 --- a/drivers/md/dm-bufio.c +++ b/drivers/md/dm-bufio.c @@ -1463,6 +1463,10 @@ EXPORT_SYMBOL_GPL(dm_bufio_get_block_size); sector_t dm_bufio_get_device_size(struct dm_bufio_client *c) { sector_t s = i_size_read(c->bdev->bd_inode) >> SECTOR_SHIFT; + if (s >= c->start) + s -= c->start; + else + s = 0; if (likely(c->sectors_per_block_bits >= 0)) s >>= c->sectors_per_block_bits; else
From: Jeffle Xu jefflexu@linux.alibaba.com
stable inclusion from linux-4.19.180 commit adbe8d9d3d45e02271f28f212375c67c30ced700
--------------------------------
commit a4c8dd9c2d0987cf542a2a0c42684c9c6d78a04e upstream.
According to the definition of dm_iterate_devices_fn: * This function must iterate through each section of device used by the * target until it encounters a non-zero return code, which it then returns. * Returns zero if no callout returned non-zero.
For some target type (e.g. dm-stripe), one call of iterate_devices() may iterate multiple underlying devices internally, in which case a non-zero return code returned by iterate_devices_callout_fn will stop the iteration in advance. No iterate_devices_callout_fn should return non-zero unless device iteration should stop.
Rename dm_table_requires_stable_pages() to dm_table_any_dev_attr() and elevate it for reuse to stop iterating (and return non-zero) on the first device that causes iterate_devices_callout_fn to return non-zero. Use dm_table_any_dev_attr() to properly iterate through devices.
Rename device_is_nonrot() to device_is_rotational() and invert logic accordingly to fix improper disposition.
[jeffle: backport notes] Also convert the no_sg_merge capability check, which is introduced by commit 200612ec33e5 ("dm table: propagate QUEUE_FLAG_NO_SG_MERGE"), and removed since commit 2705c93742e9 ("block: kill QUEUE_FLAG_NO_SG_MERGE") in v5.1.
Also convert the partial completion capability check, which is introduced by commit 22c11858e800 ("dm: introduce DM_TYPE_NVME_BIO_BASED"), and removed since commit 9c37de297f65 ("dm: remove special-casing of bio-based immutable singleton target on NVMe") in v5.10.
Fixes: c3c4555edd10 ("dm table: clear add_random unless all devices have it set") Fixes: 4693c9668fdc ("dm table: propagate non rotational flag") Cc: stable@vger.kernel.org Signed-off-by: Jeffle Xu jefflexu@linux.alibaba.com Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/dm-table.c | 115 ++++++++++++++++++++++-------------------- 1 file changed, 60 insertions(+), 55 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 418671ba057a..fb084fe21fa6 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -1398,6 +1398,46 @@ struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector) return &t->targets[(KEYS_PER_NODE * n) + k]; }
+/* + * type->iterate_devices() should be called when the sanity check needs to + * iterate and check all underlying data devices. iterate_devices() will + * iterate all underlying data devices until it encounters a non-zero return + * code, returned by whether the input iterate_devices_callout_fn, or + * iterate_devices() itself internally. + * + * For some target type (e.g. dm-stripe), one call of iterate_devices() may + * iterate multiple underlying devices internally, in which case a non-zero + * return code returned by iterate_devices_callout_fn will stop the iteration + * in advance. + * + * Cases requiring _any_ underlying device supporting some kind of attribute, + * should use the iteration structure like dm_table_any_dev_attr(), or call + * it directly. @func should handle semantics of positive examples, e.g. + * capable of something. + * + * Cases requiring _all_ underlying devices supporting some kind of attribute, + * should use the iteration structure like dm_table_supports_nowait() or + * dm_table_supports_discards(). Or introduce dm_table_all_devs_attr() that + * uses an @anti_func that handle semantics of counter examples, e.g. not + * capable of something. So: return !dm_table_any_dev_attr(t, anti_func); + */ +static bool dm_table_any_dev_attr(struct dm_table *t, + iterate_devices_callout_fn func) +{ + struct dm_target *ti; + unsigned int i; + + for (i = 0; i < dm_table_get_num_targets(t); i++) { + ti = dm_table_get_target(t, i); + + if (ti->type->iterate_devices && + ti->type->iterate_devices(ti, func, NULL)) + return true; + } + + return false; +} + static int count_device(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) { @@ -1714,12 +1754,12 @@ static int dm_table_supports_dax_write_cache(struct dm_table *t) return false; }
-static int device_is_nonrot(struct dm_target *ti, struct dm_dev *dev, - sector_t start, sector_t len, void *data) +static int device_is_rotational(struct dm_target *ti, struct dm_dev *dev, + sector_t start, sector_t len, void *data) { struct request_queue *q = bdev_get_queue(dev->bdev);
- return q && blk_queue_nonrot(q); + return q && !blk_queue_nonrot(q); }
static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev, @@ -1730,43 +1770,26 @@ static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev, return q && !blk_queue_add_random(q); }
-static int queue_supports_sg_merge(struct dm_target *ti, struct dm_dev *dev, - sector_t start, sector_t len, void *data) +static int queue_no_sg_merge(struct dm_target *ti, struct dm_dev *dev, + sector_t start, sector_t len, void *data) { struct request_queue *q = bdev_get_queue(dev->bdev);
- return q && !test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags); -} - -static bool dm_table_all_devices_attribute(struct dm_table *t, - iterate_devices_callout_fn func) -{ - struct dm_target *ti; - unsigned i; - - for (i = 0; i < dm_table_get_num_targets(t); i++) { - ti = dm_table_get_target(t, i); - - if (!ti->type->iterate_devices || - !ti->type->iterate_devices(ti, func, NULL)) - return false; - } - - return true; + return q && test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags); }
-static int device_no_partial_completion(struct dm_target *ti, struct dm_dev *dev, +static int device_is_partial_completion(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) { char b[BDEVNAME_SIZE];
/* For now, NVMe devices are the only devices of this class */ - return (strncmp(bdevname(dev->bdev, b), "nvme", 4) == 0); + return (strncmp(bdevname(dev->bdev, b), "nvme", 4) != 0); }
static bool dm_table_does_not_support_partial_completion(struct dm_table *t) { - return dm_table_all_devices_attribute(t, device_no_partial_completion); + return !dm_table_any_dev_attr(t, device_is_partial_completion); }
static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev, @@ -1893,27 +1916,6 @@ static int device_requires_stable_pages(struct dm_target *ti, return q && bdi_cap_stable_pages_required(q->backing_dev_info); }
-/* - * If any underlying device requires stable pages, a table must require - * them as well. Only targets that support iterate_devices are considered: - * don't want error, zero, etc to require stable pages. - */ -static bool dm_table_requires_stable_pages(struct dm_table *t) -{ - struct dm_target *ti; - unsigned i; - - for (i = 0; i < dm_table_get_num_targets(t); i++) { - ti = dm_table_get_target(t, i); - - if (ti->type->iterate_devices && - ti->type->iterate_devices(ti, device_requires_stable_pages, NULL)) - return true; - } - - return false; -} - void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, struct queue_limits *limits) { @@ -1954,28 +1956,31 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, dax_write_cache(t->md->dax_dev, true);
/* Ensure that all underlying devices are non-rotational. */ - if (dm_table_all_devices_attribute(t, device_is_nonrot)) - blk_queue_flag_set(QUEUE_FLAG_NONROT, q); - else + if (dm_table_any_dev_attr(t, device_is_rotational)) blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); + else + blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
if (!dm_table_supports_write_same(t)) q->limits.max_write_same_sectors = 0; if (!dm_table_supports_write_zeroes(t)) q->limits.max_write_zeroes_sectors = 0;
- if (dm_table_all_devices_attribute(t, queue_supports_sg_merge)) - blk_queue_flag_clear(QUEUE_FLAG_NO_SG_MERGE, q); - else + if (dm_table_any_dev_attr(t, queue_no_sg_merge)) blk_queue_flag_set(QUEUE_FLAG_NO_SG_MERGE, q); + else + blk_queue_flag_clear(QUEUE_FLAG_NO_SG_MERGE, q);
dm_table_verify_integrity(t);
/* * Some devices don't use blk_integrity but still want stable pages * because they do their own checksumming. + * If any underlying device requires stable pages, a table must require + * them as well. Only targets that support iterate_devices are considered: + * don't want error, zero, etc to require stable pages. */ - if (dm_table_requires_stable_pages(t)) + if (dm_table_any_dev_attr(t, device_requires_stable_pages)) q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES; else q->backing_dev_info->capabilities &= ~BDI_CAP_STABLE_WRITES; @@ -1986,7 +1991,7 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, * Clear QUEUE_FLAG_ADD_RANDOM if any underlying device does not * have it set. */ - if (blk_queue_add_random(q) && dm_table_all_devices_attribute(t, device_is_not_random)) + if (blk_queue_add_random(q) && dm_table_any_dev_attr(t, device_is_not_random)) blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
/* Allow reads to exceed readahead limits */
From: Jeffle Xu jefflexu@linux.alibaba.com
stable inclusion from linux-4.19.180 commit 16ee4c4957e8a92c2d03b8b028e11664d825c167
--------------------------------
commit 5b0fab508992c2e120971da658ce80027acbc405 upstream.
Fix dm_table_supports_dax() and invert logic of both iterate_devices_callout_fn so that all devices' DAX capabilities are properly checked.
Fixes: 545ed20e6df6 ("dm: add infrastructure for DAX support") Cc: stable@vger.kernel.org Signed-off-by: Jeffle Xu jefflexu@linux.alibaba.com Signed-off-by: Mike Snitzer snitzer@redhat.com [jeffle: no dax synchronous] Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/dm-table.c | 25 ++++--------------------- 1 file changed, 4 insertions(+), 21 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index fb084fe21fa6..921f79cd0360 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -891,10 +891,10 @@ void dm_table_set_type(struct dm_table *t, enum dm_queue_mode type) } EXPORT_SYMBOL_GPL(dm_table_set_type);
-static int device_supports_dax(struct dm_target *ti, struct dm_dev *dev, +static int device_not_dax_capable(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) { - return bdev_dax_supported(dev->bdev, PAGE_SIZE); + return !bdev_dax_supported(dev->bdev, PAGE_SIZE); }
static bool dm_table_supports_dax(struct dm_table *t) @@ -910,7 +910,7 @@ static bool dm_table_supports_dax(struct dm_table *t) return false;
if (!ti->type->iterate_devices || - !ti->type->iterate_devices(ti, device_supports_dax, NULL)) + ti->type->iterate_devices(ti, device_not_dax_capable, NULL)) return false; }
@@ -1737,23 +1737,6 @@ static int device_dax_write_cache_enabled(struct dm_target *ti, return false; }
-static int dm_table_supports_dax_write_cache(struct dm_table *t) -{ - struct dm_target *ti; - unsigned i; - - for (i = 0; i < dm_table_get_num_targets(t); i++) { - ti = dm_table_get_target(t, i); - - if (ti->type->iterate_devices && - ti->type->iterate_devices(ti, - device_dax_write_cache_enabled, NULL)) - return true; - } - - return false; -} - static int device_is_rotational(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) { @@ -1952,7 +1935,7 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, else blk_queue_flag_clear(QUEUE_FLAG_DAX, q);
- if (dm_table_supports_dax_write_cache(t)) + if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled)) dax_write_cache(t->md->dax_dev, true);
/* Ensure that all underlying devices are non-rotational. */
From: Jeffle Xu jefflexu@linux.alibaba.com
stable inclusion from linux-4.19.180 commit c6547dc1e3753382a58bc339a24ed4ef04b09dd3
--------------------------------
commit 24f6b6036c9eec21191646930ad42808e6180510 upstream.
Fix dm_table_supports_zoned_model() and invert logic of both iterate_devices_callout_fn so that all devices' zoned capabilities are properly checked.
Add one more parameter to dm_table_any_dev_attr(), which is actually used as the @data parameter of iterate_devices_callout_fn, so that dm_table_matches_zone_sectors() can be replaced by dm_table_any_dev_attr().
Fixes: dd88d313bef02 ("dm table: add zoned block devices validation") Cc: stable@vger.kernel.org Signed-off-by: Jeffle Xu jefflexu@linux.alibaba.com Signed-off-by: Mike Snitzer snitzer@redhat.com [jeffle: also convert no_sg_merge and partial completion check] Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/dm-table.c | 52 +++++++++++++++---------------------------- 1 file changed, 18 insertions(+), 34 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 921f79cd0360..16acc33858dd 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -1419,10 +1419,10 @@ struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector) * should use the iteration structure like dm_table_supports_nowait() or * dm_table_supports_discards(). Or introduce dm_table_all_devs_attr() that * uses an @anti_func that handle semantics of counter examples, e.g. not - * capable of something. So: return !dm_table_any_dev_attr(t, anti_func); + * capable of something. So: return !dm_table_any_dev_attr(t, anti_func, data); */ static bool dm_table_any_dev_attr(struct dm_table *t, - iterate_devices_callout_fn func) + iterate_devices_callout_fn func, void *data) { struct dm_target *ti; unsigned int i; @@ -1431,7 +1431,7 @@ static bool dm_table_any_dev_attr(struct dm_table *t, ti = dm_table_get_target(t, i);
if (ti->type->iterate_devices && - ti->type->iterate_devices(ti, func, NULL)) + ti->type->iterate_devices(ti, func, data)) return true; }
@@ -1474,13 +1474,13 @@ bool dm_table_has_no_data_devices(struct dm_table *table) return true; }
-static int device_is_zoned_model(struct dm_target *ti, struct dm_dev *dev, - sector_t start, sector_t len, void *data) +static int device_not_zoned_model(struct dm_target *ti, struct dm_dev *dev, + sector_t start, sector_t len, void *data) { struct request_queue *q = bdev_get_queue(dev->bdev); enum blk_zoned_model *zoned_model = data;
- return q && blk_queue_zoned_model(q) == *zoned_model; + return !q || blk_queue_zoned_model(q) != *zoned_model; }
static bool dm_table_supports_zoned_model(struct dm_table *t, @@ -1497,37 +1497,20 @@ static bool dm_table_supports_zoned_model(struct dm_table *t, return false;
if (!ti->type->iterate_devices || - !ti->type->iterate_devices(ti, device_is_zoned_model, &zoned_model)) + ti->type->iterate_devices(ti, device_not_zoned_model, &zoned_model)) return false; }
return true; }
-static int device_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev, - sector_t start, sector_t len, void *data) +static int device_not_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev, + sector_t start, sector_t len, void *data) { struct request_queue *q = bdev_get_queue(dev->bdev); unsigned int *zone_sectors = data;
- return q && blk_queue_zone_sectors(q) == *zone_sectors; -} - -static bool dm_table_matches_zone_sectors(struct dm_table *t, - unsigned int zone_sectors) -{ - struct dm_target *ti; - unsigned i; - - for (i = 0; i < dm_table_get_num_targets(t); i++) { - ti = dm_table_get_target(t, i); - - if (!ti->type->iterate_devices || - !ti->type->iterate_devices(ti, device_matches_zone_sectors, &zone_sectors)) - return false; - } - - return true; + return !q || blk_queue_zone_sectors(q) != *zone_sectors; }
static int validate_hardware_zoned_model(struct dm_table *table, @@ -1547,7 +1530,7 @@ static int validate_hardware_zoned_model(struct dm_table *table, if (!zone_sectors || !is_power_of_2(zone_sectors)) return -EINVAL;
- if (!dm_table_matches_zone_sectors(table, zone_sectors)) { + if (dm_table_any_dev_attr(table, device_not_matches_zone_sectors, &zone_sectors)) { DMERR("%s: zone sectors is not consistent across all devices", dm_device_name(table->md)); return -EINVAL; @@ -1772,7 +1755,7 @@ static int device_is_partial_completion(struct dm_target *ti, struct dm_dev *dev
static bool dm_table_does_not_support_partial_completion(struct dm_table *t) { - return !dm_table_any_dev_attr(t, device_is_partial_completion); + return !dm_table_any_dev_attr(t, device_is_partial_completion, NULL); }
static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev, @@ -1935,11 +1918,11 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, else blk_queue_flag_clear(QUEUE_FLAG_DAX, q);
- if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled)) + if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL)) dax_write_cache(t->md->dax_dev, true);
/* Ensure that all underlying devices are non-rotational. */ - if (dm_table_any_dev_attr(t, device_is_rotational)) + if (dm_table_any_dev_attr(t, device_is_rotational, NULL)) blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); else blk_queue_flag_set(QUEUE_FLAG_NONROT, q); @@ -1949,7 +1932,7 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, if (!dm_table_supports_write_zeroes(t)) q->limits.max_write_zeroes_sectors = 0;
- if (dm_table_any_dev_attr(t, queue_no_sg_merge)) + if (dm_table_any_dev_attr(t, queue_no_sg_merge, NULL)) blk_queue_flag_set(QUEUE_FLAG_NO_SG_MERGE, q); else blk_queue_flag_clear(QUEUE_FLAG_NO_SG_MERGE, q); @@ -1963,7 +1946,7 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, * them as well. Only targets that support iterate_devices are considered: * don't want error, zero, etc to require stable pages. */ - if (dm_table_any_dev_attr(t, device_requires_stable_pages)) + if (dm_table_any_dev_attr(t, device_requires_stable_pages, NULL)) q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES; else q->backing_dev_info->capabilities &= ~BDI_CAP_STABLE_WRITES; @@ -1974,7 +1957,8 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, * Clear QUEUE_FLAG_ADD_RANDOM if any underlying device does not * have it set. */ - if (blk_queue_add_random(q) && dm_table_any_dev_attr(t, device_is_not_random)) + if (blk_queue_add_random(q) && + dm_table_any_dev_attr(t, device_is_not_random, NULL)) blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
/* Allow reads to exceed readahead limits */
From: Daniel Borkmann daniel@iogearbox.net
stable inclusion from linux-4.19.181 commit e5008f2e9157247e7758d8f19707033a2956e08c
--------------------------------
commit 89e5c58fc1e2857ccdaae506fb8bc5fed57ee063 upstream.
We noticed a GRO issue for UDP-based encaps such as vxlan/geneve when the csum for the UDP header itself is 0. In that case, GRO aggregation does not take place on the phys dev, but instead is deferred to the vxlan/geneve driver (see trace below).
The reason is essentially that GRO aggregation bails out in udp_gro_receive() for such case when drivers marked the skb with CHECKSUM_UNNECESSARY (ice, i40e, others) where for non-zero csums 2abb7cdc0dc8 ("udp: Add support for doing checksum unnecessary conversion") promotes those skbs to CHECKSUM_COMPLETE and napi context has csum_valid set. This is however not the case for zero UDP csum (here: csum_cnt is still 0 and csum_valid continues to be false).
At the same time 57c67ff4bd92 ("udp: additional GRO support") added matches on !uh->check ^ !uh2->check as part to determine candidates for aggregation, so it certainly is expected to handle zero csums in udp_gro_receive(). The purpose of the check added via 662880f44203 ("net: Allow GRO to use and set levels of checksum unnecessary") seems to catch bad csum and stop aggregation right away.
One way to fix aggregation in the zero case is to only perform the !csum_valid check in udp_gro_receive() if uh->check is infact non-zero.
Before:
[...] swapper 0 [008] 731.946506: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100400 len=1500 (1) swapper 0 [008] 731.946507: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100200 len=1500 swapper 0 [008] 731.946507: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101100 len=1500 swapper 0 [008] 731.946508: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101700 len=1500 swapper 0 [008] 731.946508: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101b00 len=1500 swapper 0 [008] 731.946508: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100600 len=1500 swapper 0 [008] 731.946508: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100f00 len=1500 swapper 0 [008] 731.946509: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100a00 len=1500 swapper 0 [008] 731.946516: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100500 len=1500 swapper 0 [008] 731.946516: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100700 len=1500 swapper 0 [008] 731.946516: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101d00 len=1500 (2) swapper 0 [008] 731.946517: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101000 len=1500 swapper 0 [008] 731.946517: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101c00 len=1500 swapper 0 [008] 731.946517: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101400 len=1500 swapper 0 [008] 731.946518: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100e00 len=1500 swapper 0 [008] 731.946518: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101600 len=1500 swapper 0 [008] 731.946521: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100800 len=774 swapper 0 [008] 731.946530: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff966497100400 len=14032 (1) swapper 0 [008] 731.946530: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff966497101d00 len=9112 (2) [...]
# netperf -H 10.55.10.4 -t TCP_STREAM -l 20 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.55.10.4 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 20.01 13129.24
After:
[...] swapper 0 [026] 521.862641: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff93ab0d479000 len=11286 (1) swapper 0 [026] 521.862643: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d479000 len=11236 (1) swapper 0 [026] 521.862650: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff93ab0d478500 len=2898 (2) swapper 0 [026] 521.862650: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff93ab0d479f00 len=8490 (3) swapper 0 [026] 521.862653: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d478500 len=2848 (2) swapper 0 [026] 521.862653: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d479f00 len=8440 (3) [...]
# netperf -H 10.55.10.4 -t TCP_STREAM -l 20 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.55.10.4 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 20.01 24576.53
Fixes: 57c67ff4bd92 ("udp: additional GRO support") Fixes: 662880f44203 ("net: Allow GRO to use and set levels of checksum unnecessary") Signed-off-by: Daniel Borkmann daniel@iogearbox.net Cc: Eric Dumazet edumazet@google.com Cc: Jesse Brandeburg jesse.brandeburg@intel.com Cc: Tom Herbert tom@herbertland.com Acked-by: Willem de Bruijn willemb@google.com Acked-by: John Fastabend john.fastabend@gmail.com Link: https://lore.kernel.org/r/20210226212248.8300-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/udp_offload.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index aa343654abfc..2d22d39952da 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -359,7 +359,7 @@ struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb, struct sock *sk;
if (NAPI_GRO_CB(skb)->encap_mark || - (skb->ip_summed != CHECKSUM_PARTIAL && + (uh->check && skb->ip_summed != CHECKSUM_PARTIAL && NAPI_GRO_CB(skb)->csum_cnt == 0 && !NAPI_GRO_CB(skb)->csum_valid)) goto out;
From: Vasily Averin vvs@virtuozzo.com
stable inclusion from linux-4.19.181 commit 072d8778f66ccc8452e82db6d1e83b315466fa81
--------------------------------
commit 8e24edddad152b998b37a7f583175137ed2e04a5 upstream.
nested target/match_revfn() calls work with xt[NFPROTO_UNSPEC] lists without taking xt[NFPROTO_UNSPEC].mutex. This can race with module unload and cause host to crash:
general protection fault: 0000 [#1] Modules linked in: ... [last unloaded: xt_cluster] CPU: 0 PID: 542455 Comm: iptables RIP: 0010:[<ffffffff8ffbd518>] [<ffffffff8ffbd518>] strcmp+0x18/0x40 RDX: 0000000000000003 RSI: ffff9a5a5d9abe10 RDI: dead000000000111 R13: ffff9a5a5d9abe10 R14: ffff9a5a5d9abd8c R15: dead000000000100 (VvS: %R15 -- &xt_match, %RDI -- &xt_match.name, xt_cluster unregister match in xt[NFPROTO_UNSPEC].match list) Call Trace: [<ffffffff902ccf44>] match_revfn+0x54/0xc0 [<ffffffff902ccf9f>] match_revfn+0xaf/0xc0 [<ffffffff902cd01e>] xt_find_revision+0x6e/0xf0 [<ffffffffc05a5be0>] do_ipt_get_ctl+0x100/0x420 [ip_tables] [<ffffffff902cc6bf>] nf_getsockopt+0x4f/0x70 [<ffffffff902dd99e>] ip_getsockopt+0xde/0x100 [<ffffffff903039b5>] raw_getsockopt+0x25/0x50 [<ffffffff9026c5da>] sock_common_getsockopt+0x1a/0x20 [<ffffffff9026b89d>] SyS_getsockopt+0x7d/0xf0 [<ffffffff903cbf92>] system_call_fastpath+0x25/0x2a
Fixes: 656caff20e1 ("netfilter 04/09: x_tables: fix match/target revision lookup") Signed-off-by: Vasily Averin vvs@virtuozzo.com Reviewed-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/netfilter/x_tables.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index 6a7d0303d058..1314de5f317f 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -335,6 +335,7 @@ static int match_revfn(u8 af, const char *name, u8 revision, int *bestp) const struct xt_match *m; int have_rev = 0;
+ mutex_lock(&xt[af].mutex); list_for_each_entry(m, &xt[af].match, list) { if (strcmp(m->name, name) == 0) { if (m->revision > *bestp) @@ -343,6 +344,7 @@ static int match_revfn(u8 af, const char *name, u8 revision, int *bestp) have_rev = 1; } } + mutex_unlock(&xt[af].mutex);
if (af != NFPROTO_UNSPEC && !have_rev) return match_revfn(NFPROTO_UNSPEC, name, revision, bestp); @@ -355,6 +357,7 @@ static int target_revfn(u8 af, const char *name, u8 revision, int *bestp) const struct xt_target *t; int have_rev = 0;
+ mutex_lock(&xt[af].mutex); list_for_each_entry(t, &xt[af].target, list) { if (strcmp(t->name, name) == 0) { if (t->revision > *bestp) @@ -363,6 +366,7 @@ static int target_revfn(u8 af, const char *name, u8 revision, int *bestp) have_rev = 1; } } + mutex_unlock(&xt[af].mutex);
if (af != NFPROTO_UNSPEC && !have_rev) return target_revfn(NFPROTO_UNSPEC, name, revision, bestp); @@ -376,12 +380,10 @@ int xt_find_revision(u8 af, const char *name, u8 revision, int target, { int have_rev, best = -1;
- mutex_lock(&xt[af].mutex); if (target == 1) have_rev = target_revfn(af, name, revision, &best); else have_rev = match_revfn(af, name, revision, &best); - mutex_unlock(&xt[af].mutex);
/* Nothing at all? Return 0 to try loading module. */ if (best == -1) {
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.181 commit 7209e120966c5557106caed20cb4bb08ab26d5d3
--------------------------------
[ Upstream commit 7db48e983930285b765743ebd665aecf9850582b ]
There are few places where we fetch tp->copied_seq while this field can change from IRQ or other cpu.
We need to add READ_ONCE() annotations, and also make sure write sides use corresponding WRITE_ONCE() to avoid store-tearing.
Note that tcp_inq_hint() was already using READ_ONCE(tp->copied_seq)
Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/tcp.c | 18 +++++++++--------- net/ipv4/tcp_diag.c | 3 ++- net/ipv4/tcp_input.c | 6 +++--- net/ipv4/tcp_ipv4.c | 2 +- net/ipv4/tcp_minisocks.c | 2 +- net/ipv4/tcp_output.c | 2 +- net/ipv6/tcp_ipv6.c | 2 +- 7 files changed, 18 insertions(+), 17 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 98e8ee8bb759..f639c7d60838 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -567,7 +567,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait) (state != TCP_SYN_RECV || tp->fastopen_rsk)) { int target = sock_rcvlowat(sk, 0, INT_MAX);
- if (tp->urg_seq == tp->copied_seq && + if (tp->urg_seq == READ_ONCE(tp->copied_seq) && !sock_flag(sk, SOCK_URGINLINE) && tp->urg_data) target++; @@ -628,7 +628,7 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg) unlock_sock_fast(sk, slow); break; case SIOCATMARK: - answ = tp->urg_data && tp->urg_seq == tp->copied_seq; + answ = tp->urg_data && tp->urg_seq == READ_ONCE(tp->copied_seq); break; case SIOCOUTQ: if (sk->sk_state == TCP_LISTEN) @@ -1696,9 +1696,9 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, sk_eat_skb(sk, skb); if (!desc->count) break; - tp->copied_seq = seq; + WRITE_ONCE(tp->copied_seq, seq); } - tp->copied_seq = seq; + WRITE_ONCE(tp->copied_seq, seq);
tcp_rcv_space_adjust(sk);
@@ -1835,7 +1835,7 @@ static int tcp_zerocopy_receive(struct sock *sk, out: up_read(¤t->mm->mmap_sem); if (length) { - tp->copied_seq = seq; + WRITE_ONCE(tp->copied_seq, seq); tcp_rcv_space_adjust(sk);
/* Clean up data we have read: This will do ACK frames. */ @@ -2112,7 +2112,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, if (urg_offset < used) { if (!urg_offset) { if (!sock_flag(sk, SOCK_URGINLINE)) { - ++*seq; + WRITE_ONCE(*seq, *seq + 1); urg_hole++; offset++; used--; @@ -2134,7 +2134,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, } }
- *seq += used; + WRITE_ONCE(*seq, *seq + used); copied += used; len -= used;
@@ -2163,7 +2163,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
found_fin_ok: /* Process the FIN. */ - ++*seq; + WRITE_ONCE(*seq, *seq + 1); if (!(flags & MSG_PEEK)) sk_eat_skb(sk, skb); break; @@ -2578,7 +2578,7 @@ int tcp_disconnect(struct sock *sk, int flags)
tcp_clear_xmit_timers(sk); __skb_queue_purge(&sk->sk_receive_queue); - tp->copied_seq = tp->rcv_nxt; + WRITE_ONCE(tp->copied_seq, tp->rcv_nxt); tp->urg_data = 0; tcp_write_queue_purge(sk); tcp_fastopen_active_disable_ofo_check(sk); diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c index c9e97f304f98..a96b252c742c 100644 --- a/net/ipv4/tcp_diag.c +++ b/net/ipv4/tcp_diag.c @@ -30,7 +30,8 @@ static void tcp_diag_get_info(struct sock *sk, struct inet_diag_msg *r, } else if (sk->sk_type == SOCK_STREAM) { const struct tcp_sock *tp = tcp_sk(sk);
- r->idiag_rqueue = max_t(int, READ_ONCE(tp->rcv_nxt) - tp->copied_seq, 0); + r->idiag_rqueue = max_t(int, READ_ONCE(tp->rcv_nxt) - + READ_ONCE(tp->copied_seq), 0); r->idiag_wqueue = tp->write_seq - tp->snd_una; } if (info) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c7c3df30d8bc..effba01eed84 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5897,7 +5897,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, /* Remember, tcp_poll() does not lock socket! * Change state from SYN-SENT only after copied_seq * is initialized. */ - tp->copied_seq = tp->rcv_nxt; + WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
smc_check_reset_syn(tp);
@@ -5972,7 +5972,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, }
WRITE_ONCE(tp->rcv_nxt, TCP_SKB_CB(skb)->seq + 1); - tp->copied_seq = tp->rcv_nxt; + WRITE_ONCE(tp->copied_seq, tp->rcv_nxt); tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
/* RFC1323: The window in SYN & SYN/ACK segments is @@ -6134,7 +6134,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) tcp_rearm_rto(sk); } else { tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB); - tp->copied_seq = tp->rcv_nxt; + WRITE_ONCE(tp->copied_seq, tp->rcv_nxt); } smp_mb(); tcp_set_state(sk, TCP_ESTABLISHED); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 960c3b11fb93..c41e095a63fb 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2431,7 +2431,7 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i) * we might find a transient negative value. */ rx_queue = max_t(int, READ_ONCE(tp->rcv_nxt) - - tp->copied_seq, 0); + READ_ONCE(tp->copied_seq), 0);
seq_printf(f, "%4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX " "%08X %5u %8d %lu %d %pK %lu %lu %u %u %d", diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 7ba8a90772b0..0b1a04fa5439 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -470,7 +470,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
seq = treq->rcv_isn + 1; newtp->rcv_wup = seq; - newtp->copied_seq = seq; + WRITE_ONCE(newtp->copied_seq, seq); WRITE_ONCE(newtp->rcv_nxt, seq); newtp->segs_in = 1;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 3cfefec81975..662aa48173b8 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3381,7 +3381,7 @@ static void tcp_connect_init(struct sock *sk) else tp->rcv_tstamp = tcp_jiffies32; tp->rcv_wup = tp->rcv_nxt; - tp->copied_seq = tp->rcv_nxt; + WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
inet_csk(sk)->icsk_rto = tcp_timeout_init(sk); inet_csk(sk)->icsk_retransmits = 0; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 2e76ebfdc907..de9b9c0bf18f 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1839,7 +1839,7 @@ static void get_tcp6_sock(struct seq_file *seq, struct sock *sp, int i) * we might find a transient negative value. */ rx_queue = max_t(int, READ_ONCE(tp->rcv_nxt) - - tp->copied_seq, 0); + READ_ONCE(tp->copied_seq), 0);
seq_printf(seq, "%4d: %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X "
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.181 commit 92ba49b27efd409fd27bdcd5bbb2946d8a02938c
--------------------------------
[ Upstream commit 0f31746452e6793ad6271337438af8f4defb8940 ]
There are few places where we fetch tp->write_seq while this field can change from IRQ or other cpu.
We need to add READ_ONCE() annotations, and also make sure write sides use corresponding WRITE_ONCE() to avoid store-tearing.
Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/net/tcp.h | 2 +- net/ipv4/tcp.c | 20 ++++++++++++-------- net/ipv4/tcp_diag.c | 2 +- net/ipv4/tcp_ipv4.c | 21 ++++++++++++--------- net/ipv4/tcp_minisocks.c | 2 +- net/ipv4/tcp_output.c | 4 ++-- net/ipv6/tcp_ipv6.c | 13 +++++++------ 7 files changed, 36 insertions(+), 28 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h index f5128bc28bb7..a9a0db9bef5e 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1880,7 +1880,7 @@ static inline u32 tcp_notsent_lowat(const struct tcp_sock *tp) static inline bool tcp_stream_memory_free(const struct sock *sk) { const struct tcp_sock *tp = tcp_sk(sk); - u32 notsent_bytes = tp->write_seq - tp->snd_nxt; + u32 notsent_bytes = READ_ONCE(tp->write_seq) - tp->snd_nxt;
return notsent_bytes < tcp_notsent_lowat(tp); } diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index f639c7d60838..370faff782cd 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -637,7 +637,7 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg) if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) answ = 0; else - answ = tp->write_seq - tp->snd_una; + answ = READ_ONCE(tp->write_seq) - tp->snd_una; break; case SIOCOUTQNSD: if (sk->sk_state == TCP_LISTEN) @@ -646,7 +646,7 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg) if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) answ = 0; else - answ = tp->write_seq - tp->snd_nxt; + answ = READ_ONCE(tp->write_seq) - tp->snd_nxt; break; default: return -ENOIOCTLCMD; @@ -1037,7 +1037,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset, sk->sk_wmem_queued += copy; sk_mem_charge(sk, copy); skb->ip_summed = CHECKSUM_PARTIAL; - tp->write_seq += copy; + WRITE_ONCE(tp->write_seq, tp->write_seq + copy); TCP_SKB_CB(skb)->end_seq += copy; tcp_skb_pcount_set(skb, 0);
@@ -1391,7 +1391,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) if (!copied) TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
- tp->write_seq += copy; + WRITE_ONCE(tp->write_seq, tp->write_seq + copy); TCP_SKB_CB(skb)->end_seq += copy; tcp_skb_pcount_set(skb, 0);
@@ -2556,6 +2556,7 @@ int tcp_disconnect(struct sock *sk, int flags) struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); int old_state = sk->sk_state; + u32 seq;
if (old_state != TCP_CLOSE) tcp_set_state(sk, TCP_CLOSE); @@ -2593,9 +2594,12 @@ int tcp_disconnect(struct sock *sk, int flags) sock_reset_flag(sk, SOCK_DONE); tp->srtt_us = 0; tp->rcv_rtt_last_tsecr = 0; - tp->write_seq += tp->max_window + 2; - if (tp->write_seq == 0) - tp->write_seq = 1; + + seq = tp->write_seq + tp->max_window + 2; + if (!seq) + seq = 1; + WRITE_ONCE(tp->write_seq, seq); + tp->snd_cwnd = 2; icsk->icsk_probes_out = 0; tp->snd_ssthresh = TCP_INFINITE_SSTHRESH; @@ -2885,7 +2889,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level, if (sk->sk_state != TCP_CLOSE) err = -EPERM; else if (tp->repair_queue == TCP_SEND_QUEUE) - tp->write_seq = val; + WRITE_ONCE(tp->write_seq, val); else if (tp->repair_queue == TCP_RECV_QUEUE) { WRITE_ONCE(tp->rcv_nxt, val); WRITE_ONCE(tp->copied_seq, val); diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c index a96b252c742c..2a46f9f81ba0 100644 --- a/net/ipv4/tcp_diag.c +++ b/net/ipv4/tcp_diag.c @@ -32,7 +32,7 @@ static void tcp_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
r->idiag_rqueue = max_t(int, READ_ONCE(tp->rcv_nxt) - READ_ONCE(tp->copied_seq), 0); - r->idiag_wqueue = tp->write_seq - tp->snd_una; + r->idiag_wqueue = READ_ONCE(tp->write_seq) - tp->snd_una; } if (info) tcp_get_info(sk, info); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index c41e095a63fb..b6e75deb286a 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -169,9 +169,11 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) * without appearing to create any others. */ if (likely(!tp->repair)) { - tp->write_seq = tcptw->tw_snd_nxt + 65535 + 2; - if (tp->write_seq == 0) - tp->write_seq = 1; + u32 seq = tcptw->tw_snd_nxt + 65535 + 2; + + if (!seq) + seq = 1; + WRITE_ONCE(tp->write_seq, seq); tp->rx_opt.ts_recent = tcptw->tw_ts_recent; tp->rx_opt.ts_recent_stamp = tcptw->tw_ts_recent_stamp; } @@ -258,7 +260,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) tp->rx_opt.ts_recent = 0; tp->rx_opt.ts_recent_stamp = 0; if (likely(!tp->repair)) - tp->write_seq = 0; + WRITE_ONCE(tp->write_seq, 0); }
inet->inet_dport = usin->sin_port; @@ -296,10 +298,11 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
if (likely(!tp->repair)) { if (!tp->write_seq) - tp->write_seq = secure_tcp_seq(inet->inet_saddr, - inet->inet_daddr, - inet->inet_sport, - usin->sin_port); + WRITE_ONCE(tp->write_seq, + secure_tcp_seq(inet->inet_saddr, + inet->inet_daddr, + inet->inet_sport, + usin->sin_port)); tp->tsoffset = secure_tcp_ts_off(sock_net(sk), inet->inet_saddr, inet->inet_daddr); @@ -2436,7 +2439,7 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i) seq_printf(f, "%4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX " "%08X %5u %8d %lu %d %pK %lu %lu %u %u %d", i, src, srcp, dest, destp, state, - tp->write_seq - tp->snd_una, + READ_ONCE(tp->write_seq) - tp->snd_una, rx_queue, timer_active, jiffies_delta_to_clock_t(timer_expires - jiffies), diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 0b1a04fa5439..9436fb9b6a3d 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -510,7 +510,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk, newtp->app_limited = ~0U;
tcp_init_xmit_timers(newsk); - newtp->write_seq = newtp->pushed_seq = treq->snt_isn + 1; + WRITE_ONCE(newtp->write_seq, newtp->pushed_seq = treq->snt_isn + 1);
newtp->rx_opt.saw_tstamp = 0;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 662aa48173b8..9b74041e8dd1 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1175,7 +1175,7 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb) struct tcp_sock *tp = tcp_sk(sk);
/* Advance write_seq and place onto the write_queue. */ - tp->write_seq = TCP_SKB_CB(skb)->end_seq; + WRITE_ONCE(tp->write_seq, TCP_SKB_CB(skb)->end_seq); __skb_header_release(skb); tcp_add_write_queue_tail(sk, skb); sk->sk_wmem_queued += skb->truesize; @@ -3397,7 +3397,7 @@ static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb) __skb_header_release(skb); sk->sk_wmem_queued += skb->truesize; sk_mem_charge(sk, skb->truesize); - tp->write_seq = tcb->end_seq; + WRITE_ONCE(tp->write_seq, tcb->end_seq); tp->packets_out += tcp_skb_pcount(skb); }
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index de9b9c0bf18f..6e84f2eb08d6 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -206,7 +206,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, !ipv6_addr_equal(&sk->sk_v6_daddr, &usin->sin6_addr)) { tp->rx_opt.ts_recent = 0; tp->rx_opt.ts_recent_stamp = 0; - tp->write_seq = 0; + WRITE_ONCE(tp->write_seq, 0); }
sk->sk_v6_daddr = usin->sin6_addr; @@ -304,10 +304,11 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
if (likely(!tp->repair)) { if (!tp->write_seq) - tp->write_seq = secure_tcpv6_seq(np->saddr.s6_addr32, - sk->sk_v6_daddr.s6_addr32, - inet->inet_sport, - inet->inet_dport); + WRITE_ONCE(tp->write_seq, + secure_tcpv6_seq(np->saddr.s6_addr32, + sk->sk_v6_daddr.s6_addr32, + inet->inet_sport, + inet->inet_dport)); tp->tsoffset = secure_tcpv6_ts_off(sock_net(sk), np->saddr.s6_addr32, sk->sk_v6_daddr.s6_addr32); @@ -1850,7 +1851,7 @@ static void get_tcp6_sock(struct seq_file *seq, struct sock *sp, int i) dest->s6_addr32[0], dest->s6_addr32[1], dest->s6_addr32[2], dest->s6_addr32[3], destp, state, - tp->write_seq - tp->snd_una, + READ_ONCE(tp->write_seq) - tp->snd_una, rx_queue, timer_active, jiffies_delta_to_clock_t(timer_expires - jiffies),
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.181 commit 319f460237fc2965a80aa9a055044e1da7b3692a
--------------------------------
[ Upstream commit 8811f4a9836e31c14ecdf79d9f3cb7c5d463265d ]
Qingyu Li reported a syzkaller bug where the repro changes RCV SEQ _after_ restoring data in the receive queue.
mprotect(0x4aa000, 12288, PROT_READ) = 0 mmap(0x1ffff000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1ffff000 mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000 mmap(0x21000000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x21000000 socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3 setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0 connect(3, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="0x0000000000000003\0\0", iov_len=20}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20 setsockopt(3, SOL_TCP, TCP_REPAIR, [0], 4) = 0 setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [128], 4) = 0 recvfrom(3, NULL, 20, 0, NULL, NULL) = -1 ECONNRESET (Connection reset by peer)
syslog shows: [ 111.205099] TCP recvmsg seq # bug 2: copied 80, seq 0, rcvnxt 80, fl 0 [ 111.207894] WARNING: CPU: 1 PID: 356 at net/ipv4/tcp.c:2343 tcp_recvmsg_locked+0x90e/0x29a0
This should not be allowed. TCP_QUEUE_SEQ should only be used when queues are empty.
This patch fixes this case, and the tx path as well.
Fixes: ee9952831cfd ("tcp: Initial repair mode") Signed-off-by: Eric Dumazet edumazet@google.com Cc: Pavel Emelyanov xemul@parallels.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=212005 Reported-by: Qingyu Li ieatmuttonchuan@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/tcp.c | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 370faff782cd..769e1f683471 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2886,16 +2886,23 @@ static int do_tcp_setsockopt(struct sock *sk, int level, break;
case TCP_QUEUE_SEQ: - if (sk->sk_state != TCP_CLOSE) + if (sk->sk_state != TCP_CLOSE) { err = -EPERM; - else if (tp->repair_queue == TCP_SEND_QUEUE) - WRITE_ONCE(tp->write_seq, val); - else if (tp->repair_queue == TCP_RECV_QUEUE) { - WRITE_ONCE(tp->rcv_nxt, val); - WRITE_ONCE(tp->copied_seq, val); - } - else + } else if (tp->repair_queue == TCP_SEND_QUEUE) { + if (!tcp_rtx_queue_empty(sk)) + err = -EPERM; + else + WRITE_ONCE(tp->write_seq, val); + } else if (tp->repair_queue == TCP_RECV_QUEUE) { + if (tp->rcv_nxt != tp->copied_seq) { + err = -EPERM; + } else { + WRITE_ONCE(tp->rcv_nxt, val); + WRITE_ONCE(tp->copied_seq, val); + } + } else { err = -EINVAL; + } break;
case TCP_REPAIR_OPTIONS:
From: Paulo Alcantara pc@cjr.nz
stable inclusion from linux-4.19.181 commit 3cbf408b5ae6d384fffc9dabe4f5cad2f3906d89
--------------------------------
commit 14302ee3301b3a77b331cc14efb95bf7184c73cc upstream.
In cifs_statfs(), if server->ops->queryfs is not NULL, then we should use its return value rather than always returning 0. Instead, use rc variable as it is properly set to 0 in case there is no server->ops->queryfs.
Signed-off-by: Paulo Alcantara (SUSE) pc@cjr.nz Reviewed-by: Aurelien Aptel aaptel@suse.com Reviewed-by: Ronnie Sahlberg lsahlber@redhat.com CC: stable@vger.kernel.org Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/cifs/cifsfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index d5457015801d..bc906fcf3f6d 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -229,7 +229,7 @@ cifs_statfs(struct dentry *dentry, struct kstatfs *buf) rc = server->ops->queryfs(xid, tcon, buf);
free_xid(xid); - return 0; + return rc; }
static long cifs_fallocate(struct file *file, int mode, loff_t off, loff_t len)
From: Linus Torvalds torvalds@linux-foundation.org
stable inclusion from linux-4.19.181 commit c6afb0eeed94a7d504acdfdde7128573ec6a9e3e
--------------------------------
commit 9b1ea29bc0d7b94d420f96a0f4121403efc3dd85 upstream.
This reverts commit 8ff60eb052eeba95cfb3efe16b08c9199f8121cf.
The kernel test robot reports a huge performance regression due to the commit, and the reason seems fairly straightforward: when there is contention on the page list (which is what causes acquire_slab() to fail), we do _not_ want to just loop and try again, because that will transfer the contention to the 'n->list_lock' spinlock we hold, and just make things even worse.
This is admittedly likely a problem only on big machines - the kernel test robot report comes from a 96-thread dual socket Intel Xeon Gold 6252 setup, but the regression there really is quite noticeable:
-47.9% regression of stress-ng.rawpkt.ops_per_sec
and the commit that was marked as being fixed (7ced37197196: "slub: Acquire_slab() avoid loop") actually did the loop exit early very intentionally (the hint being that "avoid loop" part of that commit message), exactly to avoid this issue.
The correct thing to do may be to pick some kind of reasonable middle ground: instead of breaking out of the loop on the very first sign of contention, or trying over and over and over again, the right thing may be to re-try _once_, and then give up on the second failure (or pick your favorite value for "once"..).
Reported-by: kernel test robot oliver.sang@intel.com Link: https://lore.kernel.org/lkml/20210301080404.GF12822@xsang-OptiPlex-9020/ Cc: Jann Horn jannh@google.com Cc: David Rientjes rientjes@google.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Acked-by: Christoph Lameter cl@linux.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/slub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c index 4b681e71ba8c..212b2c0a0ee3 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1863,7 +1863,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
t = acquire_slab(s, n, page, object == NULL, &objects); if (!t) - continue; /* cmpxchg raced */ + break;
available += objects; if (!object) {
From: Geert Uytterhoeven geert+renesas@glider.be
stable inclusion from linux-4.19.181 commit 6a8b02ead7f7ed10af3dbe180c524e0b5ce16c6e
--------------------------------
[ Upstream commit f6bda644fa3a7070621c3bf12cd657f69a42f170 ]
Kmemleak reports:
unreferenced object 0xc328de40 (size 64): comm "kworker/1:1", pid 21, jiffies 4294938212 (age 1484.670s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 e0 d8 fc eb 00 00 00 00 ................ 00 00 10 fe 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace: [<ad758d10>] pci_register_io_range+0x3c/0x80 [<2c7f139e>] of_pci_range_to_resource+0x48/0xc0 [<f079ecc8>] devm_of_pci_get_host_bridge_resources.constprop.0+0x2ac/0x3ac [<e999753b>] devm_of_pci_bridge_init+0x60/0x1b8 [<a895b229>] devm_pci_alloc_host_bridge+0x54/0x64 [<e451ddb0>] rcar_pcie_probe+0x2c/0x644
In case a PCI host driver's probe is deferred, the same I/O range may be allocated again, and be ignored, causing a memory leak.
Fix this by (a) letting logic_pio_register_range() return -EEXIST if the passed range already exists, so pci_register_io_range() will free it, and by (b) making pci_register_io_range() not consider -EEXIST an error condition.
Link: https://lore.kernel.org/r/20210202100332.829047-1-geert+renesas@glider.be Signed-off-by: Geert Uytterhoeven geert+renesas@glider.be Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/pci/pci.c | 4 ++++ lib/logic_pio.c | 3 +++ 2 files changed, 7 insertions(+)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 140c6bd328f8..f6e3a813caf1 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -3834,6 +3834,10 @@ int pci_register_io_range(struct fwnode_handle *fwnode, phys_addr_t addr, ret = logic_pio_register_range(range); if (ret) kfree(range); + + /* Ignore duplicates due to deferred probing */ + if (ret == -EEXIST) + ret = 0; #endif
return ret; diff --git a/lib/logic_pio.c b/lib/logic_pio.c index 905027574e5d..774bb02fff10 100644 --- a/lib/logic_pio.c +++ b/lib/logic_pio.c @@ -27,6 +27,8 @@ static DEFINE_MUTEX(io_range_mutex); * @new_range: pointer to the IO range to be registered. * * Returns 0 on success, the error code in case of failure. + * If the range already exists, -EEXIST will be returned, which should be + * considered a success. * * Register a new IO range node in the IO range list. */ @@ -49,6 +51,7 @@ int logic_pio_register_range(struct logic_pio_hwaddr *new_range) list_for_each_entry(range, &io_range_list, list) { if (range->fwnode == new_range->fwnode) { /* range already there */ + ret = -EEXIST; goto end_register; } if (range->flags == LOGIC_PIO_CPU_MMIO &&
From: Aleksandr Miloserdov a.miloserdov@yadro.com
stable inclusion from linux-4.19.181 commit 7abc17dced7593cf40a46f8f9b1aafd634a0ebb1
--------------------------------
[ Upstream commit 1c73e0c5e54d5f7d77f422a10b03ebe61eaed5ad ]
TCM doesn't properly handle underflow case for service actions. One way to prevent it is to always complete command with target_complete_cmd_with_length(), however it requires access to data_sg, which is not always available.
This change introduces target_set_cmd_data_length() function which allows to set command data length before completing it.
Link: https://lore.kernel.org/r/20210209072202.41154-2-a.miloserdov@yadro.com Reviewed-by: Roman Bolshakov r.bolshakov@yadro.com Reviewed-by: Bodo Stroesser bostroesser@gmail.com Signed-off-by: Aleksandr Miloserdov a.miloserdov@yadro.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/target/target_core_transport.c | 15 +++++++++++---- include/target/target_core_backend.h | 1 + 2 files changed, 12 insertions(+), 4 deletions(-)
diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index f1b730b77a31..bdada97cd4fe 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -841,11 +841,9 @@ void target_complete_cmd(struct se_cmd *cmd, u8 scsi_status) } EXPORT_SYMBOL(target_complete_cmd);
-void target_complete_cmd_with_length(struct se_cmd *cmd, u8 scsi_status, int length) +void target_set_cmd_data_length(struct se_cmd *cmd, int length) { - if ((scsi_status == SAM_STAT_GOOD || - cmd->se_cmd_flags & SCF_TREAT_READ_AS_NORMAL) && - length < cmd->data_length) { + if (length < cmd->data_length) { if (cmd->se_cmd_flags & SCF_UNDERFLOW_BIT) { cmd->residual_count += cmd->data_length - length; } else { @@ -855,6 +853,15 @@ void target_complete_cmd_with_length(struct se_cmd *cmd, u8 scsi_status, int len
cmd->data_length = length; } +} +EXPORT_SYMBOL(target_set_cmd_data_length); + +void target_complete_cmd_with_length(struct se_cmd *cmd, u8 scsi_status, int length) +{ + if (scsi_status == SAM_STAT_GOOD || + cmd->se_cmd_flags & SCF_TREAT_READ_AS_NORMAL) { + target_set_cmd_data_length(cmd, length); + }
target_complete_cmd(cmd, scsi_status); } diff --git a/include/target/target_core_backend.h b/include/target/target_core_backend.h index 51b6f50eabee..0deeff9b4496 100644 --- a/include/target/target_core_backend.h +++ b/include/target/target_core_backend.h @@ -69,6 +69,7 @@ int transport_backend_register(const struct target_backend_ops *); void target_backend_unregister(const struct target_backend_ops *);
void target_complete_cmd(struct se_cmd *, u8); +void target_set_cmd_data_length(struct se_cmd *, int); void target_complete_cmd_with_length(struct se_cmd *, u8, int);
void transport_copy_sense_to_cmd(struct se_cmd *, unsigned char *);
From: "Matthew Wilcox (Oracle)" willy@infradead.org
stable inclusion from linux-4.19.181 commit 5bd7642bd62805a91f5e90d4af8b5465515273f5
--------------------------------
[ Upstream commit 149fc787353f65b7e72e05e7b75d34863266c3e2 ]
Fix a sparse warning by using rcu_dereference(). Technically this is a bug and a sufficiently aggressive compiler could reload the `real_parent' pointer outside the protection of the rcu lock (and access freed memory), but I think it's pretty unlikely to happen.
Link: https://lkml.kernel.org/r/20210221194207.1351703-1-willy@infradead.org Fixes: b18dc5f291c0 ("mm, oom: skip vforked tasks from being selected") Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Reviewed-by: Miaohe Lin linmiaohe@huawei.com Acked-by: Michal Hocko mhocko@suse.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/sched/mm.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index deb1bf7f5aa7..ee7eada5b016 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -167,7 +167,8 @@ static inline bool in_vfork(struct task_struct *tsk) * another oom-unkillable task does this it should blame itself. */ rcu_read_lock(); - ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm; + ret = tsk->vfork_done && + rcu_dereference(tsk->real_parent)->mm == tsk->mm; rcu_read_unlock();
return ret;
From: Daniel Kobras kobras@puzzle-itc.de
stable inclusion from linux-4.19.183 commit 98982cf7997414245477ffa90c9cdf8492c0b4bc
--------------------------------
commit f1442d6349a2e7bb7a6134791bdc26cb776c79af upstream.
If an auth module's accept op returns SVC_CLOSE, svc_process_common() enters a call path that does not call svc_authorise() before leaving the function, and thus leaks a reference on the auth module's refcount. Hence, make sure calls to svc_authenticate() and svc_authorise() are paired for all call paths, to make sure rpc auth modules can be unloaded.
Signed-off-by: Daniel Kobras kobras@puzzle-itc.de Fixes: 4d712ef1db05 ("svcauth_gss: Close connection when dropping an incoming message") Link: https://lore.kernel.org/linux-nfs/3F1B347F-B809-478F-A1E9-0BE98E22B0F0@oracl... Signed-off-by: Chuck Lever chuck.lever@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/sunrpc/svc.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c index faf145df6c60..429da1b3e62a 100644 --- a/net/sunrpc/svc.c +++ b/net/sunrpc/svc.c @@ -1330,7 +1330,7 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
sendit: if (svc_authorise(rqstp)) - goto close; + goto close_xprt; return 1; /* Caller can now send it */
dropit: @@ -1339,6 +1339,8 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv) return 0;
close: + svc_authorise(rqstp); +close_xprt: if (rqstp->rq_xprt && test_bit(XPT_TEMP, &rqstp->rq_xprt->xpt_flags)) svc_close_xprt(rqstp->rq_xprt); dprintk("svc: svc_process close\n"); @@ -1347,7 +1349,7 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv) err_short_len: svc_printk(rqstp, "short len %zd, dropping request\n", argv->iov_len); - goto close; + goto close_xprt;
err_bad_rpc: serv->sv_stats->rpcbadfmt++;
From: Sagi Grimberg sagi@grimberg.me
stable inclusion from linux-4.19.183 commit 5d9873e46c6d5a3c358341e40c373b79677f14e2
--------------------------------
[ Upstream commit c4c6df5fc84659690d4391d1fba155cd94185295 ]
We only setup io queues for nvme controllers, and it makes absolutely no sense to allow a controller (re)connect without any I/O queues. If we happen to fail setting the queue count for any reason, we should not allow this to be a successful reconnect as I/O has no chance in going through. Instead just fail and schedule another reconnect.
Reported-by: Chao Leng lengchao@huawei.com Fixes: 711023071960 ("nvme-rdma: add a NVMe over Fabrics RDMA host driver") Signed-off-by: Sagi Grimberg sagi@grimberg.me Reviewed-by: Chao Leng lengchao@huawei.com Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/nvme/host/rdma.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index b318714e7591..79fefce2e4f3 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -651,8 +651,11 @@ static int nvme_rdma_alloc_io_queues(struct nvme_rdma_ctrl *ctrl) return ret;
ctrl->ctrl.queue_count = nr_io_queues + 1; - if (ctrl->ctrl.queue_count < 2) - return 0; + if (ctrl->ctrl.queue_count < 2) { + dev_err(ctrl->ctrl.device, + "unable to set any I/O queues\n"); + return -ENOMEM; + }
dev_info(ctrl->ctrl.device, "creating %d I/O queues.\n", nr_io_queues);
From: Oleg Nesterov oleg@redhat.com
stable inclusion from linux-4.19.183 commit 6cd1e19841fc245b44277d73e449c1dc82a56c73
--------------------------------
commit 5abbe51a526253b9f003e9a0a195638dc882d660 upstream.
Preparation for fixing get_nr_restart_syscall() on X86 for COMPAT.
Add a new helper which sets restart_block->fn and calls a dummy arch_set_restart_data() helper.
Fixes: 609c19a385c8 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code") Signed-off-by: Oleg Nesterov oleg@redhat.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210201174641.GA17871@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/select.c | 10 ++++------ include/linux/thread_info.h | 13 +++++++++++++ kernel/futex.c | 3 +-- kernel/time/alarmtimer.c | 2 +- kernel/time/hrtimer.c | 2 +- kernel/time/posix-cpu-timers.c | 2 +- 6 files changed, 21 insertions(+), 11 deletions(-)
diff --git a/fs/select.c b/fs/select.c index b3cce96718ab..be2f66c5cc8a 100644 --- a/fs/select.c +++ b/fs/select.c @@ -1000,10 +1000,9 @@ static long do_restart_poll(struct restart_block *restart_block)
ret = do_sys_poll(ufds, nfds, to);
- if (ret == -EINTR) { - restart_block->fn = do_restart_poll; - ret = -ERESTART_RESTARTBLOCK; - } + if (ret == -EINTR) + ret = set_restart_fn(restart_block, do_restart_poll); + return ret; }
@@ -1025,7 +1024,6 @@ SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds, struct restart_block *restart_block;
restart_block = ¤t->restart_block; - restart_block->fn = do_restart_poll; restart_block->poll.ufds = ufds; restart_block->poll.nfds = nfds;
@@ -1036,7 +1034,7 @@ SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds, } else restart_block->poll.has_timeout = 0;
- ret = -ERESTART_RESTARTBLOCK; + ret = set_restart_fn(restart_block, do_restart_poll); } return ret; } diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h index 06ca9c157980..e22fdce95308 100644 --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -12,12 +12,25 @@ #include <linux/bug.h> #include <linux/restart_block.h> #include <linux/thread_bits.h> +#include <linux/errno.h>
#include <linux/bitops.h> #include <asm/thread_info.h>
#ifdef __KERNEL__
+#ifndef arch_set_restart_data +#define arch_set_restart_data(restart) do { } while (0) +#endif + +static inline long set_restart_fn(struct restart_block *restart, + long (*fn)(struct restart_block *)) +{ + restart->fn = fn; + arch_set_restart_data(restart); + return -ERESTART_RESTARTBLOCK; +} + #ifndef THREAD_ALIGN #define THREAD_ALIGN THREAD_SIZE #endif diff --git a/kernel/futex.c b/kernel/futex.c index fc56076a2cc2..eeb1ac8b5bc6 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2862,14 +2862,13 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, goto out;
restart = ¤t->restart_block; - restart->fn = futex_wait_restart; restart->futex.uaddr = uaddr; restart->futex.val = val; restart->futex.time = *abs_time; restart->futex.bitset = bitset; restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;
- ret = -ERESTART_RESTARTBLOCK; + ret = set_restart_fn(restart, futex_wait_restart);
out: if (to) { diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c index 9eece67f29f3..6a2ba39889bd 100644 --- a/kernel/time/alarmtimer.c +++ b/kernel/time/alarmtimer.c @@ -822,9 +822,9 @@ static int alarm_timer_nsleep(const clockid_t which_clock, int flags, if (flags == TIMER_ABSTIME) return -ERESTARTNOHAND;
- restart->fn = alarm_timer_nsleep_restart; restart->nanosleep.clockid = type; restart->nanosleep.expires = exp; + set_restart_fn(restart, alarm_timer_nsleep_restart); return ret; }
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index ccde4dc8462c..0e04b24cec81 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1771,9 +1771,9 @@ long hrtimer_nanosleep(const struct timespec64 *rqtp, }
restart = ¤t->restart_block; - restart->fn = hrtimer_nanosleep_restart; restart->nanosleep.clockid = t.timer.base->clockid; restart->nanosleep.expires = hrtimer_get_expires_tv64(&t.timer); + set_restart_fn(restart, hrtimer_nanosleep_restart); out: destroy_hrtimer_on_stack(&t.timer); return ret; diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c index d62d7ae5201c..bfaa44a80c03 100644 --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -1371,8 +1371,8 @@ static int posix_cpu_nsleep(const clockid_t which_clock, int flags, if (flags & TIMER_ABSTIME) return -ERESTARTNOHAND;
- restart_block->fn = posix_cpu_nsleep_restart; restart_block->nanosleep.clockid = which_clock; + set_restart_fn(restart_block, posix_cpu_nsleep_restart); } return error; }
From: "zhangyi (F)" yi.zhang@huawei.com
stable inclusion from linux-4.19.183 commit a8fb57ec924feec102d477c34a1e21685ff865e9
--------------------------------
commit 6b22489911b726eebbf169caee52fea52013fbdd upstream.
Syzbot report a warning that ext4 may create an empty ea_inode if set an empty extent attribute to a file on the file system which is no free blocks left.
WARNING: CPU: 6 PID: 10667 at fs/ext4/xattr.c:1640 ext4_xattr_set_entry+0x10f8/0x1114 fs/ext4/xattr.c:1640 ... Call trace: ext4_xattr_set_entry+0x10f8/0x1114 fs/ext4/xattr.c:1640 ext4_xattr_block_set+0x1d0/0x1b1c fs/ext4/xattr.c:1942 ext4_xattr_set_handle+0x8a0/0xf1c fs/ext4/xattr.c:2390 ext4_xattr_set+0x120/0x1f0 fs/ext4/xattr.c:2491 ext4_xattr_trusted_set+0x48/0x5c fs/ext4/xattr_trusted.c:37 __vfs_setxattr+0x208/0x23c fs/xattr.c:177 ...
Now, ext4 try to store extent attribute into an external inode if ext4_xattr_block_set() return -ENOSPC, but for the case of store an empty extent attribute, store the extent entry into the extent attribute block is enough. A simple reproduce below.
fallocate test.img -l 1M mkfs.ext4 -F -b 2048 -O ea_inode test.img mount test.img /mnt dd if=/dev/zero of=/mnt/foo bs=2048 count=500 setfattr -n "user.test" /mnt/foo
Reported-by: syzbot+98b881fdd8ebf45ab4ae@syzkaller.appspotmail.com Fixes: 9c6e7853c531 ("ext4: reserve space for xattr entries/names") Cc: stable@kernel.org Signed-off-by: zhangyi (F) yi.zhang@huawei.com Link: https://lore.kernel.org/r/20210305120508.298465-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/xattr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index 0654b00bbdc1..a9bc07e2e1ae 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -2402,7 +2402,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index, * external inode if possible. */ if (ext4_has_feature_ea_inode(inode->i_sb) && - !i.in_inode) { + i.value_len && !i.in_inode) { i.in_inode = 1; goto retry_inode; }
From: Vincent Whitchurch vincent.whitchurch@axis.com
stable inclusion from linux-4.19.183 commit b0834edc70e402244ed8da96664368c15d869582
--------------------------------
commit 05946d4b7a7349ae58bfa2d51ae832e64a394c2d upstream.
smb311_update_preauth_hash() uses the shash in server->secmech without appropriate locking, and this can lead to sessions corrupting each other's preauth hashes.
The following script can easily trigger the problem:
#!/bin/sh -e
NMOUNTS=10 for i in $(seq $NMOUNTS); mkdir -p /tmp/mnt$i umount /tmp/mnt$i 2>/dev/null || : done while :; do for i in $(seq $NMOUNTS); do mount -t cifs //192.168.0.1/test /tmp/mnt$i -o ... & done wait for i in $(seq $NMOUNTS); do umount /tmp/mnt$i done done
Usually within seconds this leads to one or more of the mounts failing with the following errors, and a "Bad SMB2 signature for message" is seen in the server logs:
CIFS: VFS: \192.168.0.1 failed to connect to IPC (rc=-13) CIFS: VFS: cifs_mount failed w/return code = -13
Fix it by holding the server mutex just like in the other places where the shashes are used.
Fixes: 8bd68c6e47abff34e4 ("CIFS: implement v3.11 preauth integrity") Signed-off-by: Vincent Whitchurch vincent.whitchurch@axis.com CC: stable@vger.kernel.org Reviewed-by: Aurelien Aptel aaptel@suse.com Signed-off-by: Steve French stfrench@microsoft.com [aaptel: backport to kernel without CIFS_SESS_OP and multichannel] Signed-off-by: Aurelien Aptel aaptel@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/cifs/transport.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c index 70412944b267..59643acb6d67 100644 --- a/fs/cifs/transport.c +++ b/fs/cifs/transport.c @@ -891,9 +891,12 @@ compound_send_recv(const unsigned int xid, struct cifs_ses *ses, /* * Compounding is never used during session establish. */ - if ((ses->status == CifsNew) || (optype & CIFS_NEG_OP)) + if ((ses->status == CifsNew) || (optype & CIFS_NEG_OP)) { + mutex_lock(&ses->server->srv_mutex); smb311_update_preauth_hash(ses, rqst[0].rq_iov, rqst[0].rq_nvec); + mutex_unlock(&ses->server->srv_mutex); + }
if (timeout == CIFS_ASYNC_OP) goto out; @@ -964,7 +967,9 @@ compound_send_recv(const unsigned int xid, struct cifs_ses *ses, .iov_base = resp_iov[0].iov_base, .iov_len = resp_iov[0].iov_len }; + mutex_lock(&ses->server->srv_mutex); smb311_update_preauth_hash(ses, &iov, 1); + mutex_unlock(&ses->server->srv_mutex); }
out:
From: Frank Sorenson sorenson@redhat.com
stable inclusion from linux-4.19.184 commit 5cd09eeadd277c301344975e403bcd95d9c0f125
--------------------------------
[ Upstream commit ad3dbe35c833c2d4d0bbf3f04c785d32f931e7c9 ]
CREATE requests return a post_op_fh3, rather than nfs_fh3. The post_op_fh3 includes an extra word to indicate 'handle_follows'.
Without that additional word, create fails when full 64-byte filehandles are in use.
Add NFS3_post_op_fh_sz, and correct the size calculation for NFS3_createres_sz.
Signed-off-by: Frank Sorenson sorenson@redhat.com Signed-off-by: Anna Schumaker Anna.Schumaker@Netapp.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/nfs/nfs3xdr.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c index 64e4fa33d89f..9761f97e2c08 100644 --- a/fs/nfs/nfs3xdr.c +++ b/fs/nfs/nfs3xdr.c @@ -34,6 +34,7 @@ */ #define NFS3_fhandle_sz (1+16) #define NFS3_fh_sz (NFS3_fhandle_sz) /* shorthand */ +#define NFS3_post_op_fh_sz (1+NFS3_fh_sz) #define NFS3_sattr_sz (15) #define NFS3_filename_sz (1+(NFS3_MAXNAMLEN>>2)) #define NFS3_path_sz (1+(NFS3_MAXPATHLEN>>2)) @@ -71,7 +72,7 @@ #define NFS3_readlinkres_sz (1+NFS3_post_op_attr_sz+1) #define NFS3_readres_sz (1+NFS3_post_op_attr_sz+3) #define NFS3_writeres_sz (1+NFS3_wcc_data_sz+4) -#define NFS3_createres_sz (1+NFS3_fh_sz+NFS3_post_op_attr_sz+NFS3_wcc_data_sz) +#define NFS3_createres_sz (1+NFS3_post_op_fh_sz+NFS3_post_op_attr_sz+NFS3_wcc_data_sz) #define NFS3_renameres_sz (1+(2 * NFS3_wcc_data_sz)) #define NFS3_linkres_sz (1+NFS3_post_op_attr_sz+NFS3_wcc_data_sz) #define NFS3_readdirres_sz (1+NFS3_post_op_attr_sz+2)
From: Daniel Wagner dwagner@suse.de
stable inclusion from linux-4.19.184 commit 4c083481b30a17568273deed595736e091d17a65
--------------------------------
[ Upstream commit 9ec491447b90ad6a4056a9656b13f0b3a1e83043 ]
register_disk() suppress uevents for devices with the GENHD_FL_HIDDEN but enables uevents at the end again in order to announce disk after possible partitions are created.
When the device is removed the uevents are still on and user land sees 'remove' messages for devices which were never 'add'ed to the system.
KERNEL[95481.571887] remove /devices/virtual/nvme-fabrics/ctl/nvme5/nvme0c5n1 (block)
Let's suppress the uevents for GENHD_FL_HIDDEN by not enabling the uevents at all.
Signed-off-by: Daniel Wagner dwagner@suse.de Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Martin Wilck mwilck@suse.com Link: https://lore.kernel.org/r/20210311151917.136091-1-dwagner@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/genhd.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/block/genhd.c b/block/genhd.c index e109a0702968..8df6420194a0 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -622,10 +622,8 @@ static void register_disk(struct device *parent, struct gendisk *disk) disk->part0.holder_dir = kobject_create_and_add("holders", &ddev->kobj); disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
- if (disk->flags & GENHD_FL_HIDDEN) { - dev_set_uevent_suppress(ddev, 0); + if (disk->flags & GENHD_FL_HIDDEN) return; - }
/* No minors to use for partitions */ if (!disk_part_scan_enabled(disk))
From: Mikulas Patocka mpatocka@redhat.com
stable inclusion from linux-4.19.184 commit 76aa61c55279fdaa8d428236ba8834edf313b372
--------------------------------
commit 4edbe1d7bcffcd6269f3b5eb63f710393ff2ec7a upstream.
If there are not any dm devices, we need to zero the "dev" argument in the first structure dm_name_list. However, this can cause out of bounds write, because the "needed" variable is zero and len may be less than eight.
Fix this bug by reporting DM_BUFFER_FULL_FLAG if the result buffer is too small to hold the "nl->dev" value.
Signed-off-by: Mikulas Patocka mpatocka@redhat.com Reported-by: Dan Carpenter dan.carpenter@oracle.com Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/dm-ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c index f666778ad237..2b1db5135250 100644 --- a/drivers/md/dm-ioctl.c +++ b/drivers/md/dm-ioctl.c @@ -529,7 +529,7 @@ static int list_devices(struct file *filp, struct dm_ioctl *param, size_t param_ * Grab our output buffer. */ nl = orig_nl = get_result_buffer(param, param_size, &len); - if (len < needed) { + if (len < needed || len < sizeof(nl->dev)) { param->flags |= DM_BUFFER_FULL_FLAG; goto out; }
From: Florian Westphal fw@strlen.de
stable inclusion from linux-4.19.184 commit 2ca21906400986780cb5216e8bdd27201fd4a780
--------------------------------
[ Upstream commit b58f33d49e426dc66e98ed73afb5d97b15a25f2d ]
Before this change, the mask is never included in the netlink message, so "conntrack -E expect" always prints 0.0.0.0.
In older kernels the l3num callback struct was passed as argument, based on tuple->src.l3num. After the l3num indirection got removed, the call chain is based on m.src.l3num, but this value is 0xffff.
Init l3num to the correct value.
Fixes: f957be9d349a3 ("netfilter: conntrack: remove ctnetlink callbacks from l3 protocol trackers") Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/netfilter/nf_conntrack_netlink.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 0b89609a6e9d..15c9fbcd32f2 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -2656,6 +2656,7 @@ static int ctnetlink_exp_dump_mask(struct sk_buff *skb, memset(&m, 0xFF, sizeof(m)); memcpy(&m.src.u3, &mask->src.u3, sizeof(m.src.u3)); m.src.u.all = mask->src.u.all; + m.src.l3num = tuple->src.l3num; m.dst.protonum = tuple->dst.protonum;
nest_parms = nla_nest_start(skb, CTA_EXPECT_MASK | NLA_F_NESTED);
From: Pavel Tatashin pasha.tatashin@soleen.com
stable inclusion from linux-4.19.184 commit df6f09cb7143be1f981f42ec363e5551c5d890c2
--------------------------------
[ Upstream commit 141f8202cfa4192c3af79b6cbd68e7760bb01b5a ]
The ppos points to a position in the old kernel memory (and in case of arm64 in the crash kernel since elfcorehdr is passed as a segment). The function should update the ppos by the amount that was read. This bug is not exposed by accident, but other platforms update this value properly. So, fix it in ARM64 version of elfcorehdr_read() as well.
Signed-off-by: Pavel Tatashin pasha.tatashin@soleen.com Fixes: e62aaeac426a ("arm64: kdump: provide /proc/vmcore file") Reviewed-by: Tyler Hicks tyhicks@linux.microsoft.com Link: https://lore.kernel.org/r/20210319205054.743368-1-pasha.tatashin@soleen.com Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/kernel/crash_dump.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/arm64/kernel/crash_dump.c b/arch/arm64/kernel/crash_dump.c index f46d57c31443..76905a258550 100644 --- a/arch/arm64/kernel/crash_dump.c +++ b/arch/arm64/kernel/crash_dump.c @@ -67,5 +67,7 @@ ssize_t copy_oldmem_page(unsigned long pfn, char *buf, ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos) { memcpy(buf, phys_to_virt((phys_addr_t)*ppos), count); + *ppos += count; + return count; }
From: Mark Tomlinson mark.tomlinson@alliedtelesis.co.nz
stable inclusion from linux-4.19.184 commit 0abcfaf058d77aa6450ceb29985e50f72bf6b782
--------------------------------
[ Upstream commit d3d40f237480abf3268956daf18cdc56edd32834 ]
This reverts commit cc00bcaa589914096edef7fb87ca5cee4a166b5c.
This (and the preceding) patch basically re-implemented the RCU mechanisms of patch 784544739a25. That patch was replaced because of the performance problems that it created when replacing tables. Now, we have the same issue: the call to synchronize_rcu() makes replacing tables slower by as much as an order of magnitude.
Prior to using RCU a script calling "iptables" approx. 200 times was taking 1.16s. With RCU this increased to 11.59s.
Revert these patches and fix the issue in a different way.
Signed-off-by: Mark Tomlinson mark.tomlinson@alliedtelesis.co.nz Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/netfilter/x_tables.h | 5 +-- net/ipv4/netfilter/arp_tables.c | 14 ++++----- net/ipv4/netfilter/ip_tables.c | 14 ++++----- net/ipv6/netfilter/ip6_tables.c | 14 ++++----- net/netfilter/x_tables.c | 49 +++++++++++++++++++++--------- 5 files changed, 56 insertions(+), 40 deletions(-)
diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h index 728d7716bf4f..9077b3ebea08 100644 --- a/include/linux/netfilter/x_tables.h +++ b/include/linux/netfilter/x_tables.h @@ -227,7 +227,7 @@ struct xt_table { unsigned int valid_hooks;
/* Man behind the curtain... */ - struct xt_table_info __rcu *private; + struct xt_table_info *private;
/* Set this to THIS_MODULE if you are a module, otherwise NULL */ struct module *me; @@ -449,9 +449,6 @@ xt_get_per_cpu_counter(struct xt_counters *cnt, unsigned int cpu)
struct nf_hook_ops *xt_hook_ops_alloc(const struct xt_table *, nf_hookfn *);
-struct xt_table_info -*xt_table_get_private_protected(const struct xt_table *table); - #ifdef CONFIG_COMPAT #include <net/compat.h>
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c index a2cae543a285..b1106d4507fd 100644 --- a/net/ipv4/netfilter/arp_tables.c +++ b/net/ipv4/netfilter/arp_tables.c @@ -202,7 +202,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
local_bh_disable(); addend = xt_write_recseq_begin(); - private = rcu_access_pointer(table->private); + private = READ_ONCE(table->private); /* Address dependency. */ cpu = smp_processor_id(); table_base = private->entries; jumpstack = (struct arpt_entry **)private->jumpstack[cpu]; @@ -648,7 +648,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) { unsigned int countersize; struct xt_counters *counters; - const struct xt_table_info *private = xt_table_get_private_protected(table); + const struct xt_table_info *private = table->private;
/* We need atomic snapshot of counters: rest doesn't change * (other than comefrom, which userspace doesn't care @@ -672,7 +672,7 @@ static int copy_entries_to_user(unsigned int total_size, unsigned int off, num; const struct arpt_entry *e; struct xt_counters *counters; - struct xt_table_info *private = xt_table_get_private_protected(table); + struct xt_table_info *private = table->private; int ret = 0; void *loc_cpu_entry;
@@ -807,7 +807,7 @@ static int get_info(struct net *net, void __user *user, t = xt_request_find_table_lock(net, NFPROTO_ARP, name); if (!IS_ERR(t)) { struct arpt_getinfo info; - const struct xt_table_info *private = xt_table_get_private_protected(t); + const struct xt_table_info *private = t->private; #ifdef CONFIG_COMPAT struct xt_table_info tmp;
@@ -860,7 +860,7 @@ static int get_entries(struct net *net, struct arpt_get_entries __user *uptr,
t = xt_find_table_lock(net, NFPROTO_ARP, get.name); if (!IS_ERR(t)) { - const struct xt_table_info *private = xt_table_get_private_protected(t); + const struct xt_table_info *private = t->private;
if (get.size == private->size) ret = copy_entries_to_user(private->size, @@ -1019,7 +1019,7 @@ static int do_add_counters(struct net *net, const void __user *user, }
local_bh_disable(); - private = xt_table_get_private_protected(t); + private = t->private; if (private->number != tmp.num_counters) { ret = -EINVAL; goto unlock_up_free; @@ -1356,7 +1356,7 @@ static int compat_copy_entries_to_user(unsigned int total_size, void __user *userptr) { struct xt_counters *counters; - const struct xt_table_info *private = xt_table_get_private_protected(table); + const struct xt_table_info *private = table->private; void __user *pos; unsigned int size; int ret = 0; diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index 6672172a7512..2c1d66bef720 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -261,7 +261,7 @@ ipt_do_table(struct sk_buff *skb, WARN_ON(!(table->valid_hooks & (1 << hook))); local_bh_disable(); addend = xt_write_recseq_begin(); - private = rcu_access_pointer(table->private); + private = READ_ONCE(table->private); /* Address dependency. */ cpu = smp_processor_id(); table_base = private->entries; jumpstack = (struct ipt_entry **)private->jumpstack[cpu]; @@ -794,7 +794,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) { unsigned int countersize; struct xt_counters *counters; - const struct xt_table_info *private = xt_table_get_private_protected(table); + const struct xt_table_info *private = table->private;
/* We need atomic snapshot of counters: rest doesn't change (other than comefrom, which userspace doesn't care @@ -818,7 +818,7 @@ copy_entries_to_user(unsigned int total_size, unsigned int off, num; const struct ipt_entry *e; struct xt_counters *counters; - const struct xt_table_info *private = xt_table_get_private_protected(table); + const struct xt_table_info *private = table->private; int ret = 0; const void *loc_cpu_entry;
@@ -968,7 +968,7 @@ static int get_info(struct net *net, void __user *user, t = xt_request_find_table_lock(net, AF_INET, name); if (!IS_ERR(t)) { struct ipt_getinfo info; - const struct xt_table_info *private = xt_table_get_private_protected(t); + const struct xt_table_info *private = t->private; #ifdef CONFIG_COMPAT struct xt_table_info tmp;
@@ -1022,7 +1022,7 @@ get_entries(struct net *net, struct ipt_get_entries __user *uptr,
t = xt_find_table_lock(net, AF_INET, get.name); if (!IS_ERR(t)) { - const struct xt_table_info *private = xt_table_get_private_protected(t); + const struct xt_table_info *private = t->private; if (get.size == private->size) ret = copy_entries_to_user(private->size, t, uptr->entrytable); @@ -1178,7 +1178,7 @@ do_add_counters(struct net *net, const void __user *user, }
local_bh_disable(); - private = xt_table_get_private_protected(t); + private = t->private; if (private->number != tmp.num_counters) { ret = -EINVAL; goto unlock_up_free; @@ -1573,7 +1573,7 @@ compat_copy_entries_to_user(unsigned int total_size, struct xt_table *table, void __user *userptr) { struct xt_counters *counters; - const struct xt_table_info *private = xt_table_get_private_protected(table); + const struct xt_table_info *private = table->private; void __user *pos; unsigned int size; int ret = 0; diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c index 3b067d5a62ee..19eb40355dd1 100644 --- a/net/ipv6/netfilter/ip6_tables.c +++ b/net/ipv6/netfilter/ip6_tables.c @@ -283,7 +283,7 @@ ip6t_do_table(struct sk_buff *skb,
local_bh_disable(); addend = xt_write_recseq_begin(); - private = rcu_access_pointer(table->private); + private = READ_ONCE(table->private); /* Address dependency. */ cpu = smp_processor_id(); table_base = private->entries; jumpstack = (struct ip6t_entry **)private->jumpstack[cpu]; @@ -810,7 +810,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) { unsigned int countersize; struct xt_counters *counters; - const struct xt_table_info *private = xt_table_get_private_protected(table); + const struct xt_table_info *private = table->private;
/* We need atomic snapshot of counters: rest doesn't change (other than comefrom, which userspace doesn't care @@ -834,7 +834,7 @@ copy_entries_to_user(unsigned int total_size, unsigned int off, num; const struct ip6t_entry *e; struct xt_counters *counters; - const struct xt_table_info *private = xt_table_get_private_protected(table); + const struct xt_table_info *private = table->private; int ret = 0; const void *loc_cpu_entry;
@@ -984,7 +984,7 @@ static int get_info(struct net *net, void __user *user, t = xt_request_find_table_lock(net, AF_INET6, name); if (!IS_ERR(t)) { struct ip6t_getinfo info; - const struct xt_table_info *private = xt_table_get_private_protected(t); + const struct xt_table_info *private = t->private; #ifdef CONFIG_COMPAT struct xt_table_info tmp;
@@ -1039,7 +1039,7 @@ get_entries(struct net *net, struct ip6t_get_entries __user *uptr,
t = xt_find_table_lock(net, AF_INET6, get.name); if (!IS_ERR(t)) { - struct xt_table_info *private = xt_table_get_private_protected(t); + struct xt_table_info *private = t->private; if (get.size == private->size) ret = copy_entries_to_user(private->size, t, uptr->entrytable); @@ -1194,7 +1194,7 @@ do_add_counters(struct net *net, const void __user *user, unsigned int len, }
local_bh_disable(); - private = xt_table_get_private_protected(t); + private = t->private; if (private->number != tmp.num_counters) { ret = -EINVAL; goto unlock_up_free; @@ -1582,7 +1582,7 @@ compat_copy_entries_to_user(unsigned int total_size, struct xt_table *table, void __user *userptr) { struct xt_counters *counters; - const struct xt_table_info *private = xt_table_get_private_protected(table); + const struct xt_table_info *private = table->private; void __user *pos; unsigned int size; int ret = 0; diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index 1314de5f317f..8b83806f4f8c 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -1356,14 +1356,6 @@ struct xt_counters *xt_counters_alloc(unsigned int counters) } EXPORT_SYMBOL(xt_counters_alloc);
-struct xt_table_info -*xt_table_get_private_protected(const struct xt_table *table) -{ - return rcu_dereference_protected(table->private, - mutex_is_locked(&xt[table->af].mutex)); -} -EXPORT_SYMBOL(xt_table_get_private_protected); - struct xt_table_info * xt_replace_table(struct xt_table *table, unsigned int num_counters, @@ -1371,6 +1363,7 @@ xt_replace_table(struct xt_table *table, int *error) { struct xt_table_info *private; + unsigned int cpu; int ret;
ret = xt_jumpstack_alloc(newinfo); @@ -1380,20 +1373,47 @@ xt_replace_table(struct xt_table *table, }
/* Do the substitution. */ - private = xt_table_get_private_protected(table); + local_bh_disable(); + private = table->private;
/* Check inside lock: is the old number correct? */ if (num_counters != private->number) { pr_debug("num_counters != table->private->number (%u/%u)\n", num_counters, private->number); + local_bh_enable(); *error = -EAGAIN; return NULL; }
newinfo->initial_entries = private->initial_entries; + /* + * Ensure contents of newinfo are visible before assigning to + * private. + */ + smp_wmb(); + table->private = newinfo; + + /* make sure all cpus see new ->private value */ + smp_wmb();
- rcu_assign_pointer(table->private, newinfo); - synchronize_rcu(); + /* + * Even though table entries have now been swapped, other CPU's + * may still be using the old entries... + */ + local_bh_enable(); + + /* ... so wait for even xt_recseq on all cpus */ + for_each_possible_cpu(cpu) { + seqcount_t *s = &per_cpu(xt_recseq, cpu); + u32 seq = raw_read_seqcount(s); + + if (seq & 1) { + do { + cond_resched(); + cpu_relax(); + } while (seq == raw_read_seqcount(s)); + } + }
#ifdef CONFIG_AUDIT if (audit_enabled) { @@ -1434,12 +1454,12 @@ struct xt_table *xt_register_table(struct net *net, }
/* Simplifies replace_table code. */ - rcu_assign_pointer(table->private, bootstrap); + table->private = bootstrap;
if (!xt_replace_table(table, 0, newinfo, &ret)) goto unlock;
- private = xt_table_get_private_protected(table); + private = table->private; pr_debug("table->private->number = %u\n", private->number);
/* save number of initial entries */ @@ -1462,8 +1482,7 @@ void *xt_unregister_table(struct xt_table *table) struct xt_table_info *private;
mutex_lock(&xt[table->af].mutex); - private = xt_table_get_private_protected(table); - RCU_INIT_POINTER(table->private, NULL); + private = table->private; list_del(&table->list); mutex_unlock(&xt[table->af].mutex); kfree(table);
From: Hillf Danton hdanton@sina.com
mainline inclusion from mainline-v5.7-rc1 commit ae46d2aa6a7fbe8ca0946f24b061b6ccdc6c3f25 category: bugfix bugzilla: 47439 CVE: NA ---------------------------
__get_user_pages_locked() will return 0 instead of -EINTR after commit 4426e945df588 ("mm/gup: allow VM_FAULT_RETRY for multiple times") which added extra code to allow gup detect fatal signal faster.
Restore the original -EINTR behavior.
Cc: Andrew Morton akpm@linux-foundation.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Peter Zijlstra peterz@infradead.org Fixes: 4426e945df58 ("mm/gup: allow VM_FAULT_RETRY for multiple times") Reported-by: syzbot+3be1a33f04dc782e9fd5@syzkaller.appspotmail.com Signed-off-by: Hillf Danton hdanton@sina.com Acked-by: Michal Hocko mhocko@suse.com Signed-off-by: Peter Xu peterx@redhat.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Jing Xiangfeng jingxiangfeng@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/gup.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/gup.c b/mm/gup.c index 8be20cbec785..83f0737e57a7 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -947,8 +947,11 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, * start trying again otherwise it can loop forever. */
- if (fatal_signal_pending(current)) + if (fatal_signal_pending(current)) { + if (!pages_done) + pages_done = -EINTR; break; + }
*locked = 1; down_read(&mm->mmap_sem);
From: Peter Xu peterx@redhat.com
mainline inclusion from mainline-v5.7-rc1 commit ba841078cd0557b43b59c63f5c048b12168f0db2 category: bugfix bugzilla: 47439 CVE: NA ---------------------------
lookup_node() uses gup to pin the page and get node information. It checks against ret>=0 assuming the page will be filled in. However it's also possible that gup will return zero, for example, when the thread is quickly killed with a fatal signal. Teach lookup_node() to gracefully return an error -EFAULT if it happens.
Meanwhile, initialize "page" to NULL to avoid potential risk of exploiting the pointer.
Fixes: 4426e945df58 ("mm/gup: allow VM_FAULT_RETRY for multiple times") Reported-by: syzbot+693dc11fcb53120b5559@syzkaller.appspotmail.com Signed-off-by: Peter Xu peterx@redhat.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Conflicts: mm/mempolicy.c Signed-off-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Jing Xiangfeng jingxiangfeng@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/mempolicy.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 8f420c64934e..59c7e6069c1e 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -897,11 +897,14 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
static int lookup_node(unsigned long addr) { - struct page *p; + struct page *p = NULL; int err;
err = get_user_pages(addr & PAGE_MASK, 1, 0, &p, NULL); - if (err >= 0) { + if (err == 0) { + /* E.g. GUP interrupted by fatal signal */ + err = -EFAULT; + } else if (err > 0) { err = page_to_nid(p); put_page(p); }
From: Michal Hocko mhocko@suse.com
mainline inclusion from mainline-v5.8-rc1 commit 2d3a36a47964371101d9a71691c18d59ee611e87 category: bugfix bugzilla: 47439 CVE: NA ---------------------------
ba841078cd05 ("mm/mempolicy: Allow lookup_node() to handle fatal signal") has added a special casing for 0 return value because that was a possible gup return value when interrupted by fatal signal. This has been fixed by ae46d2aa6a7f ("mm/gup: Let __get_user_pages_locked() return -EINTR for fatal signal") in the mean time so ba841078cd05 can be reverted.
This patch however doesn't go all the way to revert it because the check for 0 is wrong and confusing here. Firstly it is inherently unsafe to access the page when get_user_pages_locked returns 0 (aka no page returned).
Fortunatelly this will not happen because get_user_pages_locked will not return 0 when nr_pages > 0 unless FOLL_NOWAIT is specified which is not the case here. Document this potential error code in gup code while we are at it.
Signed-off-by: Michal Hocko mhocko@suse.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Peter Xu peterx@redhat.com Link: http://lkml.kernel.org/r/20200421071026.18394-1-mhocko@kernel.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: mm/gup.c [wangxiongfeng: conflicts in comments ] Signed-off-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Jing Xiangfeng jingxiangfeng@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/gup.c | 5 +++++ mm/mempolicy.c | 5 +---- 2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c index 83f0737e57a7..5801d4bd523a 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -632,6 +632,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) * were pinned, returns -errno. Each page returned must be released * with a put_page() call when it is finished with. vmas will only * remain valid while mmap_sem is held. + * -- 0 return value is possible when the fault would need to be retried. * * Must be called with mmap_sem held. It may be released. See below. * @@ -877,6 +878,10 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, } EXPORT_SYMBOL_GPL(fixup_user_fault);
+/* + * Please note that this function, unlike __get_user_pages will not + * return 0 for nr_pages > 0 without FOLL_NOWAIT + */ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 59c7e6069c1e..0bd78e8cdf89 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -901,10 +901,7 @@ static int lookup_node(unsigned long addr) int err;
err = get_user_pages(addr & PAGE_MASK, 1, 0, &p, NULL); - if (err == 0) { - /* E.g. GUP interrupted by fatal signal */ - err = -EFAULT; - } else if (err > 0) { + if (err > 0) { err = page_to_nid(p); put_page(p); }
From: Guo Fan guofan5@huawei.com
hulk inclusion category: feature bugzilla: 47439 CVE: NA
-------------------------------------------------
To make sure there are no other userspace threads access the memory region we are swapping out, we need unmmap the memory region, map it to a new address and use the new address to perform the swapout. We add a new flag 'MAP_REPLACE' for mmap() to unmap the pages of the input parameter 'VA' and remap them to a new tmpVA.
Signed-off-by: Guo Fan guofan5@huawei.com Signed-off-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Jing Xiangfeng jingxiangfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/proc/task_mmu.c | 3 + include/linux/mm.h | 5 + include/linux/swap.h | 12 +- include/trace/events/mmflags.h | 7 ++ include/uapi/asm-generic/mman.h | 4 + mm/Kconfig | 9 ++ mm/mmap.c | 207 ++++++++++++++++++++++++++++++++ 7 files changed, 246 insertions(+), 1 deletion(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index ac7f57badcfd..66939a7998ab 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -665,6 +665,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_PKEY_BIT4)] = "", #endif #endif /* CONFIG_ARCH_HAS_PKEYS */ +#ifdef CONFIG_USERSWAP + [ilog2(VM_USWAP)] = "us", +#endif }; size_t i;
diff --git a/include/linux/mm.h b/include/linux/mm.h index 073295cc94f3..61734ef3c184 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -236,6 +236,11 @@ extern unsigned int kobjsize(const void *objp);
#define VM_CHECKNODE 0x200000000
+#ifdef CONFIG_USERSWAP +/* bit[32:36] is the protection key of intel, so use a large value for VM_USWAP */ +#define VM_USWAP 0x2000000000000000 +#endif + #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS #define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */ diff --git a/include/linux/swap.h b/include/linux/swap.h index c6f9dba6d713..b7cfad35987a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -52,6 +52,16 @@ static inline int current_is_kswapd(void) * actions on faults. */
+/* + * Userswap entry type + */ +#ifdef CONFIG_USERSWAP +#define SWP_USERSWAP_NUM 1 +#define SWP_USERSWAP_ENTRY (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+SWP_DEVICE_NUM) +#else +#define SWP_USERSWAP_NUM 0 +#endif + /* * Unaddressable device memory support. See include/linux/hmm.h and * Documentation/vm/hmm.rst. Short description is we need struct pages for @@ -92,7 +102,7 @@ static inline int current_is_kswapd(void)
#define MAX_SWAPFILES \ ((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \ - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM) + SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - SWP_USERSWAP_NUM)
/* * Magic header for a swap area. The first part of the union is diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 2994f1c86a46..b817bf1885a0 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -130,6 +130,12 @@ IF_HAVE_PG_IDLE(PG_idle, "idle"), \ #define IF_HAVE_VM_SOFTDIRTY(flag,name) #endif
+#ifdef CONFIG_USERSWAP +#define IF_HAVE_VM_USWAP(flag,name) {flag, name }, +#else +#define IF_HAVE_VM_USWAP(flag,name) +#endif + #define __def_vmaflag_names \ {VM_READ, "read" }, \ {VM_WRITE, "write" }, \ @@ -161,6 +167,7 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \ {VM_MIXEDMAP, "mixedmap" }, \ {VM_HUGEPAGE, "hugepage" }, \ {VM_NOHUGEPAGE, "nohugepage" }, \ +IF_HAVE_VM_USWAP(VM_USWAP, "userswap" ) \ {VM_MERGEABLE, "mergeable" } \
#define show_vma_flags(flags) \ diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h index 233a5e82407c..defdf92911c3 100644 --- a/include/uapi/asm-generic/mman.h +++ b/include/uapi/asm-generic/mman.h @@ -17,6 +17,10 @@ #define MAP_SYNC 0x80000 /* perform synchronous page faults for the mapping */ #define MAP_PA32BIT 0x400000 /* physical address is within 4G */
+#ifdef CONFIG_USERSWAP +#define MAP_REPLACE 0x1000000 +#endif + /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
#define MCL_CURRENT 1 /* lock all current mappings */ diff --git a/mm/Kconfig b/mm/Kconfig index 7ebd52dc1e40..4e075a27d737 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -503,6 +503,15 @@ config SHRINK_PAGECACHE
if unsure, say N to disable the SHRINK_PAGECACHE.
+config USERSWAP + bool "Enable User Swap" + depends on MMU && USERFAULTFD + depends on X86 || ARM64 + default n + help + Support for User Swap. This is based on userfaultfd. We can implement + our own swapout and swapin functions in usersapce. + config CMA bool "Contiguous Memory Allocator" depends on HAVE_MEMBLOCK && MMU diff --git a/mm/mmap.c b/mm/mmap.c index 3fcfed26d298..1b0eda02dc7f 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -46,6 +46,7 @@ #include <linux/pkeys.h> #include <linux/oom.h> #include <linux/sched/mm.h> +#include <linux/swapops.h>
#include <linux/uaccess.h> #include <asm/cacheflush.h> @@ -1372,6 +1373,169 @@ int unregister_mmap_notifier(struct notifier_block *nb) EXPORT_SYMBOL_GPL(unregister_mmap_notifier); #endif
+#ifdef CONFIG_USERSWAP +/* + * Check if pages between 'addr ~ addr+len' can be user swapped. If so, get + * the reference of the pages and return the pages through input parameters + * 'ppages'. + */ +int pages_can_be_swapped(struct mm_struct *mm, unsigned long addr, + unsigned long len, struct page ***ppages) +{ + struct vm_area_struct *vma; + struct page *page = NULL; + struct page **pages = NULL; + unsigned long addr_start, addr_end; + unsigned long ret; + int i, page_num = 0; + + pages = kmalloc(sizeof(struct page *) * (len / PAGE_SIZE), GFP_KERNEL); + if (!pages) + return -ENOMEM; + + addr_start = addr; + addr_end = addr + len; + while (addr < addr_end) { + vma = find_vma(mm, addr); + if (!vma || !vma_is_anonymous(vma) || + (vma->vm_flags & VM_LOCKED) || vma->vm_file + || (vma->vm_flags & VM_STACK) || (vma->vm_flags & (VM_IO | VM_PFNMAP))) { + ret = -EINVAL; + goto out; + } + if (!(vma->vm_flags & VM_UFFD_MISSING)) { + ret = -EAGAIN; + goto out; + } +get_again: + /* follow_page will inc page ref, dec the ref after we remap the page */ + page = follow_page(vma, addr, FOLL_GET); + if (IS_ERR_OR_NULL(page)) { + ret = -ENODEV; + goto out; + } + pages[page_num] = page; + page_num++; + if (!PageAnon(page) || !PageSwapBacked(page) || PageHuge(page) || PageSwapCache(page)) { + ret = -EINVAL; + goto out; + } else if (PageTransCompound(page)) { + if (trylock_page(page)) { + if (!split_huge_page(page)) { + put_page(page); + page_num--; + unlock_page(page); + goto get_again; + } else { + unlock_page(page); + ret = -EINVAL; + goto out; + } + } else { + ret = -EINVAL; + goto out; + } + } + if (page_mapcount(page) > 1 || page_mapcount(page) + 1 != page_count(page)) { + ret = -EBUSY; + goto out; + } + addr += PAGE_SIZE; + } + + *ppages = pages; + return 0; + +out: + for (i = 0; i < page_num; i++) + put_page(pages[i]); + if (pages) + kfree(pages); + *ppages = NULL; + return ret; +} + +/* + * In uswap situation, we use the bit 0 of the returned address to indicate + * whether the pages are dirty. + */ +#define USWAP_PAGES_DIRTY 1 + +/* unmap the pages between 'addr ~ addr+len' and remap them to a new address */ +unsigned long do_user_swap(struct mm_struct *mm, unsigned long addr_start, + unsigned long len, struct page **pages, unsigned long new_addr) +{ + struct vm_area_struct *vma; + struct page *page; + pmd_t *pmd; + pte_t *pte, old_pte; + spinlock_t *ptl; + unsigned long addr, addr_end; + bool pages_dirty = false; + int i, err; + + addr_end = addr_start + len; + lru_add_drain(); + mmu_notifier_invalidate_range_start(mm, addr_start, addr_end); + addr = addr_start; + i = 0; + while (addr < addr_end) { + page = pages[i]; + vma = find_vma(mm, addr); + if (!vma) { + mmu_notifier_invalidate_range_end(mm, addr_start, addr_end); + WARN_ON("find_vma failed\n"); + return -EINVAL; + } + pmd = mm_find_pmd(mm, addr); + if (!pmd) { + mmu_notifier_invalidate_range_end(mm, addr_start, addr_end); + WARN_ON("mm_find_pmd failed, addr:%llx\n"); + return -ENXIO; + } + pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + flush_cache_page(vma, addr, pte_pfn(*pte)); + old_pte = ptep_clear_flush(vma, addr, pte); + if (pte_dirty(old_pte) || PageDirty(page)) + pages_dirty = true; + set_pte(pte, swp_entry_to_pte(swp_entry(SWP_USERSWAP_ENTRY, page_to_pfn(page)))); + dec_mm_counter(mm, MM_ANONPAGES); + page_remove_rmap(page, false); + put_page(page); + + pte_unmap_unlock(pte, ptl); + vma->vm_flags |= VM_USWAP; + page->mapping = NULL; + addr += PAGE_SIZE; + i++; + } + mmu_notifier_invalidate_range_end(mm, addr_start, addr_end); + + addr_start = new_addr; + addr_end = new_addr + len; + addr = addr_start; + vma = find_vma(mm, addr); + i = 0; + while (addr < addr_end) { + page = pages[i]; + if (addr > vma->vm_end - 1) + vma = find_vma(mm, addr); + err = vm_insert_page(vma, addr, page); + if (err) { + pr_err("vm_insert_page failed:%d\n", err); + } + i++; + addr += PAGE_SIZE; + } + vma->vm_flags |= VM_USWAP; + + if (pages_dirty) + new_addr = new_addr | USWAP_PAGES_DIRTY; + + return new_addr; +} +#endif + /* * The caller must hold down_write(¤t->mm->mmap_sem). */ @@ -1383,6 +1547,12 @@ unsigned long do_mmap(struct file *file, unsigned long addr, { struct mm_struct *mm = current->mm; int pkey = 0; +#ifdef CONFIG_USERSWAP + struct page **pages = NULL; + unsigned long addr_start = addr; + int i, page_num = 0; + unsigned long ret; +#endif
*populate = 0;
@@ -1399,6 +1569,17 @@ unsigned long do_mmap(struct file *file, unsigned long addr, if (!(file && path_noexec(&file->f_path))) prot |= PROT_EXEC;
+#ifdef CONFIG_USERSWAP + if (flags & MAP_REPLACE) { + if (offset_in_page(addr) || (len % PAGE_SIZE)) + return -EINVAL; + page_num = len / PAGE_SIZE; + ret = pages_can_be_swapped(mm, addr, len, &pages); + if (ret) + return ret; + } +#endif + /* force arch specific MAP_FIXED handling in get_unmapped_area */ if (flags & MAP_FIXED_NOREPLACE) flags |= MAP_FIXED; @@ -1571,12 +1752,38 @@ unsigned long do_mmap(struct file *file, unsigned long addr, if (flags & MAP_CHECKNODE) set_vm_checknode(&vm_flags, flags);
+#ifdef CONFIG_USERSWAP + /* mark the vma as special to avoid merging with other vmas */ + if (flags & MAP_REPLACE) + vm_flags |= VM_SPECIAL; +#endif + addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) *populate = len; +#ifndef CONFIG_USERSWAP return addr; +#else + if (!(flags & MAP_REPLACE)) + return addr; + + if (IS_ERR_VALUE(addr)) { + pr_info("mmap_region failed, return addr:%lx\n", addr); + ret = addr; + goto out; + } + + ret = do_user_swap(mm, addr_start, len, pages, addr); +out: + /* follow_page() above increased the reference*/ + for (i = 0; i < page_num; i++) + put_page(pages[i]); + if (pages) + kfree(pages); + return ret; +#endif }
unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
From: Guo Fan guofan5@huawei.com
hulk inclusion category: feature bugzilla: 47439 CVE: NA
-------------------------------------------------
This patch modify the userfaultfd to support userswap. To check whether tha pages are dirty since the last swap in, we make them clean when we swap in the pages. The userspace may swap in a large area and part of it are not swapped out. We need to skip those pages that are not swapped out.
Signed-off-by: Guo Fan guofan5@huawei.com Signed-off-by: Xiongfeng Wang wangxiongfeng2@huawei.com Reviewed-by: Jing Xiangfeng jingxiangfeng@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/userfaultfd.c | 26 +++++++++++++++++++++++++- include/linux/userfaultfd_k.h | 4 ++++ include/uapi/linux/userfaultfd.h | 3 +++ mm/memory.c | 19 +++++++++++++++++++ mm/userfaultfd.c | 26 ++++++++++++++++++++++++++ 5 files changed, 77 insertions(+), 1 deletion(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index d269d1139f7f..0d19adb40dc2 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -327,6 +327,10 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx, * Lockless access: we're in a wait_event so it's ok if it * changes under us. */ +#ifdef CONFIG_USERSWAP + if ((reason & VM_USWAP) && (!pte_present(*pte))) + ret = true; +#endif if (pte_none(*pte)) ret = true; pte_unmap(pte); @@ -1321,10 +1325,30 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, ret = -EINVAL; if (!uffdio_register.mode) goto out; + vm_flags = 0; +#ifdef CONFIG_USERSWAP + /* + * register the whole vma overlapping with the address range to avoid + * splitting the vma. + */ + if (uffdio_register.mode & UFFDIO_REGISTER_MODE_USWAP) { + uffdio_register.mode &= ~UFFDIO_REGISTER_MODE_USWAP; + vm_flags |= VM_USWAP; + end = uffdio_register.range.start + uffdio_register.range.len - 1; + vma = find_vma(mm, uffdio_register.range.start); + if (!vma) + goto out; + uffdio_register.range.start = vma->vm_start; + + vma = find_vma(mm, end); + if (!vma) + goto out; + uffdio_register.range.len = vma->vm_end - uffdio_register.range.start; + } +#endif if (uffdio_register.mode & ~(UFFDIO_REGISTER_MODE_MISSING| UFFDIO_REGISTER_MODE_WP)) goto out; - vm_flags = 0; if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING) vm_flags |= VM_UFFD_MISSING; if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) { diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 37c9eba75c98..5912381ec765 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -47,7 +47,11 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
static inline bool userfaultfd_missing(struct vm_area_struct *vma) { +#ifdef CONFIG_USERSWAP + return (vma->vm_flags & VM_UFFD_MISSING) && !(vma->vm_flags & VM_USWAP); +#else return vma->vm_flags & VM_UFFD_MISSING; +#endif }
static inline bool userfaultfd_armed(struct vm_area_struct *vma) diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 48f1a7c2f1f0..42e0f860e7f7 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -190,6 +190,9 @@ struct uffdio_register { struct uffdio_range range; #define UFFDIO_REGISTER_MODE_MISSING ((__u64)1<<0) #define UFFDIO_REGISTER_MODE_WP ((__u64)1<<1) +#ifdef CONFIG_USERSWAP +#define UFFDIO_REGISTER_MODE_USWAP ((__u64)1<<2) +#endif __u64 mode;
/* diff --git a/mm/memory.c b/mm/memory.c index 17f3016c7acd..dbf7fd76958a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2769,6 +2769,25 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out;
entry = pte_to_swp_entry(vmf->orig_pte); +#ifdef CONFIG_USERSWAP + if (swp_type(entry) == SWP_USERSWAP_ENTRY) { + /* print error if we come across a nested fault */ + if (!strncmp(current->comm, "uswap", 5)) { + pr_err("USWAP: fault %lx is triggered by %s\n", + vmf->address, current->comm); + return VM_FAULT_SIGBUS; + } + if (!(vma->vm_flags & VM_UFFD_MISSING)) { + pr_err("USWAP: addr %lx flags %lx is not a user swap page", + vmf->address, vma->vm_flags); + goto skip_uswap; + } + BUG_ON(!(vma->vm_flags & VM_UFFD_MISSING)); + ret = handle_userfault(vmf, VM_UFFD_MISSING | VM_USWAP); + return ret; + } +skip_uswap: +#endif if (unlikely(non_swap_entry(entry))) { if (is_migration_entry(entry)) { migration_entry_wait(vma->vm_mm, vmf->pmd, diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 7529d3fcc899..cc6ea42d1ea8 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -60,6 +60,10 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, *pagep = NULL; }
+#ifdef CONFIG_USERSWAP + if (dst_vma->vm_flags & VM_USWAP) + ClearPageDirty(page); +#endif /* * The memory barrier inside __SetPageUptodate makes sure that * preceeding stores to the page contents become visible before @@ -74,6 +78,10 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, _dst_pte = mk_pte(page, dst_vma->vm_page_prot); if (dst_vma->vm_flags & VM_WRITE) _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte)); +#ifdef CONFIG_USERSWAP + if (dst_vma->vm_flags & VM_USWAP) + _dst_pte = pte_mkclean(_dst_pte); +#endif
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); if (dst_vma->vm_file) { @@ -85,9 +93,27 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, if (unlikely(offset >= max_off)) goto out_release_uncharge_unlock; } + +#ifdef CONFIG_USERSWAP + if (!(dst_vma->vm_flags & VM_USWAP)) { + ret = -EEXIST; + if (!pte_none(*dst_pte)) + goto out_release_uncharge_unlock; + } else { + /* + * The userspace may swap in a large area. Part of the area is + * not swapped out. Skip those pages. + */ + ret = 0; + if (swp_type(pte_to_swp_entry(*dst_pte)) != SWP_USERSWAP_ENTRY || + pte_present(*dst_pte)) + goto out_release_uncharge_unlock; + } +#else ret = -EEXIST; if (!pte_none(*dst_pte)) goto out_release_uncharge_unlock; +#endif
inc_mm_counter(dst_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-5.2-rc1 commit 68ad4a3304335358f95a417f2a2b0c909e5119c4 category: bugfix bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25 CVE: NA
------------------------------------------------- Patch series "improve vmap allocation", v3.
Objective ---------
Please have a look for the description at:
https://lkml.org/lkml/2018/10/19/786
but let me also summarize it a bit here as well.
The current implementation has O(N) complexity. Requests with different permissive parameters can lead to long allocation time. When i say "long" i mean milliseconds.
Description -----------
This approach organizes the KVA memory layout into free areas of the 1-ULONG_MAX range, i.e. an allocation is done over free areas lookups, instead of finding a hole between two busy blocks. It allows to have lower number of objects which represent the free space, therefore to have less fragmented memory allocator. Because free blocks are always as large as possible.
It uses the augment tree where all free areas are sorted in ascending order of va->va_start address in pair with linked list that provides O(1) access to prev/next elements.
Since the tree is augment, we also maintain the "subtree_max_size" of VA that reflects a maximum available free block in its left or right sub-tree. Knowing that, we can easily traversal toward the lowest (left most path) free area.
Allocation: ~O(log(N)) complexity. It is sequential allocation method therefore tends to maximize locality. The search is done until a first suitable block is large enough to encompass the requested parameters. Bigger areas are split.
I copy paste here the description of how the area is split, since i described it in https://lkml.org/lkml/2018/10/19/786
<snip>
A free block can be split by three different ways. Their names are FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they correspond to how requested size and alignment fit to a free block.
FL_FIT_TYPE - in this case a free block is just removed from the free list/tree because it fully fits. Comparing with current design there is an extra work with rb-tree updating.
LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do is just cutting a free block. It is as fast as a current design. Most of the vmalloc allocations just end up with this case, because the edge is always aligned to 1.
NE_FIT_TYPE - Is much less common case. Basically it happens when requested size and alignment does not fit left nor right edges, i.e. it is between them. In this case during splitting we have to build a remaining left free area and place it back to the free list/tree.
Comparing with current design there are two extra steps. First one is we have to allocate a new vmap_area structure. Second one we have to insert that remaining free block to the address sorted list/tree.
In order to optimize a first case there is a cache with free_vmap objects. Instead of allocating from slab we just take an object from the cache and reuse it.
Second one is pretty optimized. Since we know a start point in the tree we do not do a search from the top. Instead a traversal begins from a rb-tree node we split. <snip>
De-allocation. ~O(log(N)) complexity. An area is not inserted straight away to the tree/list, instead we identify the spot first, checking if it can be merged around neighbors. The list provides O(1) access to prev/next, so it is pretty fast to check it. Summarizing. If merged then large coalesced areas are created, if not the area is just linked making more fragments.
There is one more thing that i should mention here. After modification of VA node, its subtree_max_size is updated if it was/is the biggest area in its left or right sub-tree. Apart of that it can also be populated back to upper levels to fix the tree. For more details please have a look at the __augment_tree_propagate_from() function and the description.
Tests and stressing -------------------
I use the "test_vmalloc.sh" test driver available under "tools/testing/selftests/vm/" since 5.1-rc1 kernel. Just trigger "sudo ./test_vmalloc.sh" to find out how to deal with it.
Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA. Regarding last one, i do not have any physical access to NUMA system, therefore i emulated it. The time of stressing is days.
If you run the test driver in "stress mode", you also need the patch that is in Andrew's tree but not in Linux 5.1-rc1. So, please apply it:
http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da31...
After massive testing, i have not identified any problems like memory leaks, crashes or kernel panics. I find it stable, but more testing would be good.
Performance analysis --------------------
I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and another is HiKey960(arm64) board. i5-3320M runs on 4.20 kernel, whereas Hikey960 uses 4.15 kernel. I have both system which could run on 5.1-rc1 as well, but the results have not been ready by time i an writing this.
Currently it consist of 8 tests. There are three of them which correspond to different types of splitting(to compare with default). We have 3 ones(see above). Another 5 do allocations in different conditions.
a) sudo ./test_vmalloc.sh performance
When the test driver is run in "performance" mode, it runs all available tests pinned to first online CPU with sequential execution test order. We do it in order to get stable and repeatable results. Take a look at time difference in "long_busy_list_alloc_test". It is not surprising because the worst case is O(N).
How many cycles all tests took: CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt
How many cycles all tests took: CPU0=3478683207 cycles vs CPU0=463767978 cycles
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt
b) time sudo ./test_vmalloc.sh test_repeat_count=1
With this configuration, all tests are run on all available online CPUs. Before running each CPU shuffles its tests execution order. It gives random allocation behaviour. So it is rough comparison, but it puts in the picture for sure.
<default> vs <patched> real 101m22.813s real 0m56.805s user 0m0.011s user 0m0.015s sys 0m5.076s sys 0m0.023s
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt
<default> vs <patched> real unknown real 4m25.214s user unknown user 0m0.011s sys unknown sys 0m0.670s
I did not manage to complete this test on "default Hikey960" kernel version. After 24 hours it was still running, therefore i had to cancel it. That is why real/user/sys are "unknown".
This patch (of 3):
Currently an allocation of the new vmap area is done over busy list iteration(complexity O(n)) until a suitable hole is found between two busy areas. Therefore each new allocation causes the list being grown. Due to over fragmented list and different permissive parameters an allocation can take a long time. For example on embedded devices it is milliseconds.
This patch organizes the KVA memory layout into free areas of the 1-ULONG_MAX range. It uses an augment red-black tree that keeps blocks sorted by their offsets in pair with linked list keeping the free space in order of increasing addresses.
Nodes are augmented with the size of the maximum available free block in its left or right sub-tree. Thus, that allows to take a decision and traversal toward the block that will fit and will have the lowest start address, i.e. it is sequential allocation.
Allocation: to allocate a new block a search is done over the tree until a suitable lowest(left most) block is large enough to encompass: the requested size, alignment and vstart point. If the block is bigger than requested size - it is split.
De-allocation: when a busy vmap area is freed it can either be merged or inserted to the tree. Red-black tree allows efficiently find a spot whereas a linked list provides a constant-time access to previous and next blocks to check if merging can be done. In case of merging of de-allocated memory chunk a large coalesced area is created.
Complexity: ~O(log(N))
[urezki@gmail.com: v3] Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com [urezki@gmail.com: v4] Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Reviewed-by: Roman Gushchin guro@fb.com Cc: Michal Hocko mhocko@suse.com Cc: Matthew Wilcox willy@infradead.org Cc: Thomas Garnier thgarnie@google.com Cc: Oleksiy Avramchenko oleksiy.avramchenko@sonymobile.com Cc: Steven Rostedt rostedt@goodmis.org Cc: Joel Fernandes joelaf@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnar mingo@elte.hu Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 68ad4a3304335358f95a417f2a2b0c909e5119c4) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/vmalloc.h | 6 +- mm/vmalloc.c | 1004 +++++++++++++++++++++++++++++---------- 2 files changed, 763 insertions(+), 247 deletions(-)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 206957b1b54d..239571c69d6b 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -45,12 +45,16 @@ struct vm_struct { struct vmap_area { unsigned long va_start; unsigned long va_end; + + /* + * Largest available free size in subtree. + */ + unsigned long subtree_max_size; unsigned long flags; struct rb_node rb_node; /* address sorted rbtree */ struct list_head list; /* address sorted list */ struct llist_node purge_list; /* "lazy purge" list */ struct vm_struct *vm; - struct rcu_head rcu_head; };
/* diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 1817871b0239..d03cd4770ca4 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -32,6 +32,7 @@ #include <linux/llist.h> #include <linux/bitops.h> #include <linux/overflow.h> +#include <linux/rbtree_augmented.h>
#include <linux/uaccess.h> #include <asm/tlbflush.h> @@ -332,14 +333,67 @@ static DEFINE_SPINLOCK(vmap_area_lock); LIST_HEAD(vmap_area_list); static LLIST_HEAD(vmap_purge_list); static struct rb_root vmap_area_root = RB_ROOT; +static bool vmap_initialized __read_mostly;
-/* The vmap cache globals are protected by vmap_area_lock */ -static struct rb_node *free_vmap_cache; -static unsigned long cached_hole_size; -static unsigned long cached_vstart; -static unsigned long cached_align; +/* + * This kmem_cache is used for vmap_area objects. Instead of + * allocating from slab we reuse an object from this cache to + * make things faster. Especially in "no edge" splitting of + * free block. + */ +static struct kmem_cache *vmap_area_cachep; + +/* + * This linked list is used in pair with free_vmap_area_root. + * It gives O(1) access to prev/next to perform fast coalescing. + */ +static LIST_HEAD(free_vmap_area_list); + +/* + * This augment red-black tree represents the free vmap space. + * All vmap_area objects in this tree are sorted by va->va_start + * address. It is used for allocation and merging when a vmap + * object is released. + * + * Each vmap_area node contains a maximum available free block + * of its sub-tree, right or left. Therefore it is possible to + * find a lowest match of free area. + */ +static struct rb_root free_vmap_area_root = RB_ROOT; + +static __always_inline unsigned long +va_size(struct vmap_area *va) +{ + return (va->va_end - va->va_start); +} + +static __always_inline unsigned long +get_subtree_max_size(struct rb_node *node) +{ + struct vmap_area *va; + + va = rb_entry_safe(node, struct vmap_area, rb_node); + return va ? va->subtree_max_size : 0; +}
-static unsigned long vmap_area_pcpu_hole; +/* + * Gets called when remove the node and rotate. + */ +static __always_inline unsigned long +compute_subtree_max_size(struct vmap_area *va) +{ + return max3(va_size(va), + get_subtree_max_size(va->rb_node.rb_left), + get_subtree_max_size(va->rb_node.rb_right)); +} + +RB_DECLARE_CALLBACKS(static, free_vmap_area_rb_augment_cb, + struct vmap_area, rb_node, unsigned long, subtree_max_size, + compute_subtree_max_size) + +static void purge_vmap_area_lazy(void); +static BLOCKING_NOTIFIER_HEAD(vmap_notify_list); +static unsigned long lazy_max_pages(void);
static struct vmap_area *__find_vmap_area(unsigned long addr) { @@ -360,41 +414,522 @@ static struct vmap_area *__find_vmap_area(unsigned long addr) return NULL; }
-static void __insert_vmap_area(struct vmap_area *va) -{ - struct rb_node **p = &vmap_area_root.rb_node; - struct rb_node *parent = NULL; - struct rb_node *tmp; +/* + * This function returns back addresses of parent node + * and its left or right link for further processing. + */ +static __always_inline struct rb_node ** +find_va_links(struct vmap_area *va, + struct rb_root *root, struct rb_node *from, + struct rb_node **parent) +{ + struct vmap_area *tmp_va; + struct rb_node **link; + + if (root) { + link = &root->rb_node; + if (unlikely(!*link)) { + *parent = NULL; + return link; + } + } else { + link = &from; + }
- while (*p) { - struct vmap_area *tmp_va; + /* + * Go to the bottom of the tree. When we hit the last point + * we end up with parent rb_node and correct direction, i name + * it link, where the new va->rb_node will be attached to. + */ + do { + tmp_va = rb_entry(*link, struct vmap_area, rb_node);
- parent = *p; - tmp_va = rb_entry(parent, struct vmap_area, rb_node); - if (va->va_start < tmp_va->va_end) - p = &(*p)->rb_left; - else if (va->va_end > tmp_va->va_start) - p = &(*p)->rb_right; + /* + * During the traversal we also do some sanity check. + * Trigger the BUG() if there are sides(left/right) + * or full overlaps. + */ + if (va->va_start < tmp_va->va_end && + va->va_end <= tmp_va->va_start) + link = &(*link)->rb_left; + else if (va->va_end > tmp_va->va_start && + va->va_start >= tmp_va->va_end) + link = &(*link)->rb_right; else BUG(); + } while (*link); + + *parent = &tmp_va->rb_node; + return link; +} + +static __always_inline struct list_head * +get_va_next_sibling(struct rb_node *parent, struct rb_node **link) +{ + struct list_head *list; + + if (unlikely(!parent)) + /* + * The red-black tree where we try to find VA neighbors + * before merging or inserting is empty, i.e. it means + * there is no free vmap space. Normally it does not + * happen but we handle this case anyway. + */ + return NULL; + + list = &rb_entry(parent, struct vmap_area, rb_node)->list; + return (&parent->rb_right == link ? list->next : list); +} + +static __always_inline void +link_va(struct vmap_area *va, struct rb_root *root, + struct rb_node *parent, struct rb_node **link, struct list_head *head) +{ + /* + * VA is still not in the list, but we can + * identify its future previous list_head node. + */ + if (likely(parent)) { + head = &rb_entry(parent, struct vmap_area, rb_node)->list; + if (&parent->rb_right != link) + head = head->prev; }
- rb_link_node(&va->rb_node, parent, p); - rb_insert_color(&va->rb_node, &vmap_area_root); + /* Insert to the rb-tree */ + rb_link_node(&va->rb_node, parent, link); + if (root == &free_vmap_area_root) { + /* + * Some explanation here. Just perform simple insertion + * to the tree. We do not set va->subtree_max_size to + * its current size before calling rb_insert_augmented(). + * It is because of we populate the tree from the bottom + * to parent levels when the node _is_ in the tree. + * + * Therefore we set subtree_max_size to zero after insertion, + * to let __augment_tree_propagate_from() puts everything to + * the correct order later on. + */ + rb_insert_augmented(&va->rb_node, + root, &free_vmap_area_rb_augment_cb); + va->subtree_max_size = 0; + } else { + rb_insert_color(&va->rb_node, root); + }
- /* address-sort this list */ - tmp = rb_prev(&va->rb_node); - if (tmp) { - struct vmap_area *prev; - prev = rb_entry(tmp, struct vmap_area, rb_node); - list_add_rcu(&va->list, &prev->list); - } else - list_add_rcu(&va->list, &vmap_area_list); + /* Address-sort this list */ + list_add(&va->list, head); }
-static void purge_vmap_area_lazy(void); +static __always_inline void +unlink_va(struct vmap_area *va, struct rb_root *root) +{ + /* + * During merging a VA node can be empty, therefore + * not linked with the tree nor list. Just check it. + */ + if (!RB_EMPTY_NODE(&va->rb_node)) { + if (root == &free_vmap_area_root) + rb_erase_augmented(&va->rb_node, + root, &free_vmap_area_rb_augment_cb); + else + rb_erase(&va->rb_node, root);
-static BLOCKING_NOTIFIER_HEAD(vmap_notify_list); + list_del(&va->list); + RB_CLEAR_NODE(&va->rb_node); + } +} + +/* + * This function populates subtree_max_size from bottom to upper + * levels starting from VA point. The propagation must be done + * when VA size is modified by changing its va_start/va_end. Or + * in case of newly inserting of VA to the tree. + * + * It means that __augment_tree_propagate_from() must be called: + * - After VA has been inserted to the tree(free path); + * - After VA has been shrunk(allocation path); + * - After VA has been increased(merging path). + * + * Please note that, it does not mean that upper parent nodes + * and their subtree_max_size are recalculated all the time up + * to the root node. + * + * 4--8 + * /\ + * / \ + * / \ + * 2--2 8--8 + * + * For example if we modify the node 4, shrinking it to 2, then + * no any modification is required. If we shrink the node 2 to 1 + * its subtree_max_size is updated only, and set to 1. If we shrink + * the node 8 to 6, then its subtree_max_size is set to 6 and parent + * node becomes 4--6. + */ +static __always_inline void +augment_tree_propagate_from(struct vmap_area *va) +{ + struct rb_node *node = &va->rb_node; + unsigned long new_va_sub_max_size; + + while (node) { + va = rb_entry(node, struct vmap_area, rb_node); + new_va_sub_max_size = compute_subtree_max_size(va); + + /* + * If the newly calculated maximum available size of the + * subtree is equal to the current one, then it means that + * the tree is propagated correctly. So we have to stop at + * this point to save cycles. + */ + if (va->subtree_max_size == new_va_sub_max_size) + break; + + va->subtree_max_size = new_va_sub_max_size; + node = rb_parent(&va->rb_node); + } +} + +static void +insert_vmap_area(struct vmap_area *va, + struct rb_root *root, struct list_head *head) +{ + struct rb_node **link; + struct rb_node *parent; + + link = find_va_links(va, root, NULL, &parent); + link_va(va, root, parent, link, head); +} + +static void +insert_vmap_area_augment(struct vmap_area *va, + struct rb_node *from, struct rb_root *root, + struct list_head *head) +{ + struct rb_node **link; + struct rb_node *parent; + + if (from) + link = find_va_links(va, NULL, from, &parent); + else + link = find_va_links(va, root, NULL, &parent); + + link_va(va, root, parent, link, head); + augment_tree_propagate_from(va); +} + +/* + * Merge de-allocated chunk of VA memory with previous + * and next free blocks. If coalesce is not done a new + * free area is inserted. If VA has been merged, it is + * freed. + */ +static __always_inline void +merge_or_add_vmap_area(struct vmap_area *va, + struct rb_root *root, struct list_head *head) +{ + struct vmap_area *sibling; + struct list_head *next; + struct rb_node **link; + struct rb_node *parent; + bool merged = false; + + /* + * Find a place in the tree where VA potentially will be + * inserted, unless it is merged with its sibling/siblings. + */ + link = find_va_links(va, root, NULL, &parent); + + /* + * Get next node of VA to check if merging can be done. + */ + next = get_va_next_sibling(parent, link); + if (unlikely(next == NULL)) + goto insert; + + /* + * start end + * | | + * |<------VA------>|<-----Next----->| + * | | + * start end + */ + if (next != head) { + sibling = list_entry(next, struct vmap_area, list); + if (sibling->va_start == va->va_end) { + sibling->va_start = va->va_start; + + /* Check and update the tree if needed. */ + augment_tree_propagate_from(sibling); + + /* Remove this VA, it has been merged. */ + unlink_va(va, root); + + /* Free vmap_area object. */ + kmem_cache_free(vmap_area_cachep, va); + + /* Point to the new merged area. */ + va = sibling; + merged = true; + } + } + + /* + * start end + * | | + * |<-----Prev----->|<------VA------>| + * | | + * start end + */ + if (next->prev != head) { + sibling = list_entry(next->prev, struct vmap_area, list); + if (sibling->va_end == va->va_start) { + sibling->va_end = va->va_end; + + /* Check and update the tree if needed. */ + augment_tree_propagate_from(sibling); + + /* Remove this VA, it has been merged. */ + unlink_va(va, root); + + /* Free vmap_area object. */ + kmem_cache_free(vmap_area_cachep, va); + + return; + } + } + +insert: + if (!merged) { + link_va(va, root, parent, link, head); + augment_tree_propagate_from(va); + } +} + +static __always_inline bool +is_within_this_va(struct vmap_area *va, unsigned long size, + unsigned long align, unsigned long vstart) +{ + unsigned long nva_start_addr; + + if (va->va_start > vstart) + nva_start_addr = ALIGN(va->va_start, align); + else + nva_start_addr = ALIGN(vstart, align); + + /* Can be overflowed due to big size or alignment. */ + if (nva_start_addr + size < nva_start_addr || + nva_start_addr < vstart) + return false; + + return (nva_start_addr + size <= va->va_end); +} + +/* + * Find the first free block(lowest start address) in the tree, + * that will accomplish the request corresponding to passing + * parameters. + */ +static __always_inline struct vmap_area * +find_vmap_lowest_match(unsigned long size, + unsigned long align, unsigned long vstart) +{ + struct vmap_area *va; + struct rb_node *node; + unsigned long length; + + /* Start from the root. */ + node = free_vmap_area_root.rb_node; + + /* Adjust the search size for alignment overhead. */ + length = size + align - 1; + + while (node) { + va = rb_entry(node, struct vmap_area, rb_node); + + if (get_subtree_max_size(node->rb_left) >= length && + vstart < va->va_start) { + node = node->rb_left; + } else { + if (is_within_this_va(va, size, align, vstart)) + return va; + + /* + * Does not make sense to go deeper towards the right + * sub-tree if it does not have a free block that is + * equal or bigger to the requested search length. + */ + if (get_subtree_max_size(node->rb_right) >= length) { + node = node->rb_right; + continue; + } + + /* + * OK. We roll back and find the fist right sub-tree, + * that will satisfy the search criteria. It can happen + * only once due to "vstart" restriction. + */ + while ((node = rb_parent(node))) { + va = rb_entry(node, struct vmap_area, rb_node); + if (is_within_this_va(va, size, align, vstart)) + return va; + + if (get_subtree_max_size(node->rb_right) >= length && + vstart <= va->va_start) { + node = node->rb_right; + break; + } + } + } + } + + return NULL; +} + +enum fit_type { + NOTHING_FIT = 0, + FL_FIT_TYPE = 1, /* full fit */ + LE_FIT_TYPE = 2, /* left edge fit */ + RE_FIT_TYPE = 3, /* right edge fit */ + NE_FIT_TYPE = 4 /* no edge fit */ +}; + +static __always_inline enum fit_type +classify_va_fit_type(struct vmap_area *va, + unsigned long nva_start_addr, unsigned long size) +{ + enum fit_type type; + + /* Check if it is within VA. */ + if (nva_start_addr < va->va_start || + nva_start_addr + size > va->va_end) + return NOTHING_FIT; + + /* Now classify. */ + if (va->va_start == nva_start_addr) { + if (va->va_end == nva_start_addr + size) + type = FL_FIT_TYPE; + else + type = LE_FIT_TYPE; + } else if (va->va_end == nva_start_addr + size) { + type = RE_FIT_TYPE; + } else { + type = NE_FIT_TYPE; + } + + return type; +} + +static __always_inline int +adjust_va_to_fit_type(struct vmap_area *va, + unsigned long nva_start_addr, unsigned long size, + enum fit_type type) +{ + struct vmap_area *lva; + + if (type == FL_FIT_TYPE) { + /* + * No need to split VA, it fully fits. + * + * | | + * V NVA V + * |---------------| + */ + unlink_va(va, &free_vmap_area_root); + kmem_cache_free(vmap_area_cachep, va); + } else if (type == LE_FIT_TYPE) { + /* + * Split left edge of fit VA. + * + * | | + * V NVA V R + * |-------|-------| + */ + va->va_start += size; + } else if (type == RE_FIT_TYPE) { + /* + * Split right edge of fit VA. + * + * | | + * L V NVA V + * |-------|-------| + */ + va->va_end = nva_start_addr; + } else if (type == NE_FIT_TYPE) { + /* + * Split no edge of fit VA. + * + * | | + * L V NVA V R + * |---|-------|---| + */ + lva = kmem_cache_alloc(vmap_area_cachep, GFP_NOWAIT); + if (unlikely(!lva)) + return -1; + + /* + * Build the remainder. + */ + lva->va_start = va->va_start; + lva->va_end = nva_start_addr; + + /* + * Shrink this VA to remaining size. + */ + va->va_start = nva_start_addr + size; + } else { + return -1; + } + + if (type != FL_FIT_TYPE) { + augment_tree_propagate_from(va); + + if (type == NE_FIT_TYPE) + insert_vmap_area_augment(lva, &va->rb_node, + &free_vmap_area_root, &free_vmap_area_list); + } + + return 0; +} + +/* + * Returns a start address of the newly allocated area, if success. + * Otherwise a vend is returned that indicates failure. + */ +static __always_inline unsigned long +__alloc_vmap_area(unsigned long size, unsigned long align, + unsigned long vstart, unsigned long vend, int node) +{ + unsigned long nva_start_addr; + struct vmap_area *va; + enum fit_type type; + int ret; + + va = find_vmap_lowest_match(size, align, vstart); + if (unlikely(!va)) + return vend; + + if (va->va_start > vstart) + nva_start_addr = ALIGN(va->va_start, align); + else + nva_start_addr = ALIGN(vstart, align); + + /* Check the "vend" restriction. */ + if (nva_start_addr + size > vend) + return vend; + + /* Classify what we have found. */ + type = classify_va_fit_type(va, nva_start_addr, size); + if (WARN_ON_ONCE(type == NOTHING_FIT)) + return vend; + + /* Update the free vmap_area. */ + ret = adjust_va_to_fit_type(va, nva_start_addr, size, type); + if (ret) + return vend; + + return nva_start_addr; +}
/* * Allocate a region of KVA of the specified size and alignment, within the @@ -406,18 +941,19 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, int node, gfp_t gfp_mask) { struct vmap_area *va; - struct rb_node *n; unsigned long addr; int purged = 0; - struct vmap_area *first;
BUG_ON(!size); BUG_ON(offset_in_page(size)); BUG_ON(!is_power_of_2(align));
+ if (unlikely(!vmap_initialized)) + return ERR_PTR(-EBUSY); + might_sleep();
- va = kmalloc_node(sizeof(struct vmap_area), + va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask & GFP_RECLAIM_MASK, node); if (unlikely(!va)) return ERR_PTR(-ENOMEM); @@ -430,87 +966,20 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
retry: spin_lock(&vmap_area_lock); - /* - * Invalidate cache if we have more permissive parameters. - * cached_hole_size notes the largest hole noticed _below_ - * the vmap_area cached in free_vmap_cache: if size fits - * into that hole, we want to scan from vstart to reuse - * the hole instead of allocating above free_vmap_cache. - * Note that __free_vmap_area may update free_vmap_cache - * without updating cached_hole_size or cached_align. - */ - if (!free_vmap_cache || - size < cached_hole_size || - vstart < cached_vstart || - align < cached_align) { -nocache: - cached_hole_size = 0; - free_vmap_cache = NULL; - } - /* record if we encounter less permissive parameters */ - cached_vstart = vstart; - cached_align = align; - - /* find starting point for our search */ - if (free_vmap_cache) { - first = rb_entry(free_vmap_cache, struct vmap_area, rb_node); - addr = ALIGN(first->va_end, align); - if (addr < vstart) - goto nocache; - if (addr + size < addr) - goto overflow; - - } else { - addr = ALIGN(vstart, align); - if (addr + size < addr) - goto overflow; - - n = vmap_area_root.rb_node; - first = NULL; - - while (n) { - struct vmap_area *tmp; - tmp = rb_entry(n, struct vmap_area, rb_node); - if (tmp->va_end >= addr) { - first = tmp; - if (tmp->va_start <= addr) - break; - n = n->rb_left; - } else - n = n->rb_right; - } - - if (!first) - goto found; - }
- /* from the starting point, walk areas until a suitable hole is found */ - while (addr + size > first->va_start && addr + size <= vend) { - if (addr + cached_hole_size < first->va_start) - cached_hole_size = first->va_start - addr; - addr = ALIGN(first->va_end, align); - if (addr + size < addr) - goto overflow; - - if (list_is_last(&first->list, &vmap_area_list)) - goto found; - - first = list_next_entry(first, list); - } - -found: /* - * Check also calculated address against the vstart, - * because it can be 0 because of big align request. + * If an allocation fails, the "vend" address is + * returned. Therefore trigger the overflow path. */ - if (addr + size > vend || addr < vstart) + addr = __alloc_vmap_area(size, align, vstart, vend, node); + if (unlikely(addr == vend)) goto overflow;
va->va_start = addr; va->va_end = addr + size; va->flags = 0; - __insert_vmap_area(va); - free_vmap_cache = &va->rb_node; + insert_vmap_area(va, &vmap_area_root, &vmap_area_list); + spin_unlock(&vmap_area_lock);
BUG_ON(!IS_ALIGNED(va->va_start, align)); @@ -539,7 +1008,8 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) pr_warn("vmap allocation for size %lu failed: use vmalloc=<size> to increase size\n", size); - kfree(va); + + kmem_cache_free(vmap_area_cachep, va); return ERR_PTR(-EBUSY); }
@@ -559,35 +1029,16 @@ static void __free_vmap_area(struct vmap_area *va) { BUG_ON(RB_EMPTY_NODE(&va->rb_node));
- if (free_vmap_cache) { - if (va->va_end < cached_vstart) { - free_vmap_cache = NULL; - } else { - struct vmap_area *cache; - cache = rb_entry(free_vmap_cache, struct vmap_area, rb_node); - if (va->va_start <= cache->va_start) { - free_vmap_cache = rb_prev(&va->rb_node); - /* - * We don't try to update cached_hole_size or - * cached_align, but it won't go very wrong. - */ - } - } - } - rb_erase(&va->rb_node, &vmap_area_root); - RB_CLEAR_NODE(&va->rb_node); - list_del_rcu(&va->list); - /* - * Track the highest possible candidate for pcpu area - * allocation. Areas outside of vmalloc area can be returned - * here too, consider only end addresses which fall inside - * vmalloc area proper. + * Remove from the busy tree/list. */ - if (va->va_end > VMALLOC_START && va->va_end <= VMALLOC_END) - vmap_area_pcpu_hole = max(vmap_area_pcpu_hole, va->va_end); + unlink_va(va, &vmap_area_root);
- kfree_rcu(va, rcu_head); + /* + * Merge VA with its neighbors, otherwise just add it. + */ + merge_or_add_vmap_area(va, + &free_vmap_area_root, &free_vmap_area_list); }
/* @@ -788,8 +1239,6 @@ static struct vmap_area *find_vmap_area(unsigned long addr)
#define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE)
-static bool vmap_initialized __read_mostly = false; - struct vmap_block_queue { spinlock_t lock; struct list_head free; @@ -1243,12 +1692,58 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align) vm_area_add_early(vm); }
+static void vmap_init_free_space(void) +{ + unsigned long vmap_start = 1; + const unsigned long vmap_end = ULONG_MAX; + struct vmap_area *busy, *free; + + /* + * B F B B B F + * -|-----|.....|-----|-----|-----|.....|- + * | The KVA space | + * |<--------------------------------->| + */ + list_for_each_entry(busy, &vmap_area_list, list) { + if (busy->va_start - vmap_start > 0) { + free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); + if (!WARN_ON_ONCE(!free)) { + free->va_start = vmap_start; + free->va_end = busy->va_start; + + insert_vmap_area_augment(free, NULL, + &free_vmap_area_root, + &free_vmap_area_list); + } + } + + vmap_start = busy->va_end; + } + + if (vmap_end - vmap_start > 0) { + free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); + if (!WARN_ON_ONCE(!free)) { + free->va_start = vmap_start; + free->va_end = vmap_end; + + insert_vmap_area_augment(free, NULL, + &free_vmap_area_root, + &free_vmap_area_list); + } + } +} + void __init vmalloc_init(void) { struct vmap_area *va; struct vm_struct *tmp; int i;
+ /* + * Create the cache for vmap_area objects. + */ + vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC); + for_each_possible_cpu(i) { struct vmap_block_queue *vbq; struct vfree_deferred *p; @@ -1263,16 +1758,21 @@ void __init vmalloc_init(void)
/* Import existing vmlist entries. */ for (tmp = vmlist; tmp; tmp = tmp->next) { - va = kzalloc(sizeof(struct vmap_area), GFP_NOWAIT); + va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); + if (WARN_ON_ONCE(!va)) + continue; + va->flags = VM_VM_AREA; va->va_start = (unsigned long)tmp->addr; va->va_end = va->va_start + tmp->size; va->vm = tmp; - __insert_vmap_area(va); + insert_vmap_area(va, &vmap_area_root, &vmap_area_list); }
- vmap_area_pcpu_hole = VMALLOC_END; - + /* + * Now we can initialize a free vmap space. + */ + vmap_init_free_space(); vmap_initialized = true; }
@@ -2390,81 +2890,64 @@ static struct vmap_area *node_to_va(struct rb_node *n) }
/** - * pvm_find_next_prev - find the next and prev vmap_area surrounding @end - * @end: target address - * @pnext: out arg for the next vmap_area - * @pprev: out arg for the previous vmap_area + * pvm_find_va_enclose_addr - find the vmap_area @addr belongs to + * @addr: target address * - * Returns: %true if either or both of next and prev are found, - * %false if no vmap_area exists - * - * Find vmap_areas end addresses of which enclose @end. ie. if not - * NULL, *pnext->va_end > @end and *pprev->va_end <= @end. + * Returns: vmap_area if it is found. If there is no such area + * the first highest(reverse order) vmap_area is returned + * i.e. va->va_start < addr && va->va_end < addr or NULL + * if there are no any areas before @addr. */ -static bool pvm_find_next_prev(unsigned long end, - struct vmap_area **pnext, - struct vmap_area **pprev) +static struct vmap_area * +pvm_find_va_enclose_addr(unsigned long addr) { - struct rb_node *n = vmap_area_root.rb_node; - struct vmap_area *va = NULL; + struct vmap_area *va, *tmp; + struct rb_node *n; + + n = free_vmap_area_root.rb_node; + va = NULL;
while (n) { - va = rb_entry(n, struct vmap_area, rb_node); - if (end < va->va_end) - n = n->rb_left; - else if (end > va->va_end) + tmp = rb_entry(n, struct vmap_area, rb_node); + if (tmp->va_start <= addr) { + va = tmp; + if (tmp->va_end >= addr) + break; + n = n->rb_right; - else - break; + } else { + n = n->rb_left; + } }
- if (!va) - return false; - - if (va->va_end > end) { - *pnext = va; - *pprev = node_to_va(rb_prev(&(*pnext)->rb_node)); - } else { - *pprev = va; - *pnext = node_to_va(rb_next(&(*pprev)->rb_node)); - } - return true; + return va; }
/** - * pvm_determine_end - find the highest aligned address between two vmap_areas - * @pnext: in/out arg for the next vmap_area - * @pprev: in/out arg for the previous vmap_area - * @align: alignment - * - * Returns: determined end address + * pvm_determine_end_from_reverse - find the highest aligned address + * of free block below VMALLOC_END + * @va: + * in - the VA we start the search(reverse order); + * out - the VA with the highest aligned end address. * - * Find the highest aligned address between *@pnext and *@pprev below - * VMALLOC_END. *@pnext and *@pprev are adjusted so that the aligned - * down address is between the end addresses of the two vmap_areas. - * - * Please note that the address returned by this function may fall - * inside *@pnext vmap_area. The caller is responsible for checking - * that. + * Returns: determined end address within vmap_area */ -static unsigned long pvm_determine_end(struct vmap_area **pnext, - struct vmap_area **pprev, - unsigned long align) +static unsigned long +pvm_determine_end_from_reverse(struct vmap_area **va, unsigned long align) { - const unsigned long vmalloc_end = VMALLOC_END & ~(align - 1); + unsigned long vmalloc_end = VMALLOC_END & ~(align - 1); unsigned long addr;
- if (*pnext) - addr = min((*pnext)->va_start & ~(align - 1), vmalloc_end); - else - addr = vmalloc_end; - - while (*pprev && (*pprev)->va_end > addr) { - *pnext = *pprev; - *pprev = node_to_va(rb_prev(&(*pnext)->rb_node)); + if (likely(*va)) { + list_for_each_entry_from_reverse((*va), + &free_vmap_area_list, list) { + addr = min((*va)->va_end & ~(align - 1), vmalloc_end); + if ((*va)->va_start < addr) + return addr; + } }
- return addr; + return 0; }
/** @@ -2484,12 +2967,12 @@ static unsigned long pvm_determine_end(struct vmap_area **pnext, * to gigabytes. To avoid interacting with regular vmallocs, these * areas are allocated from top. * - * Despite its complicated look, this allocator is rather simple. It - * does everything top-down and scans areas from the end looking for - * matching slot. While scanning, if any of the areas overlaps with - * existing vmap_area, the base address is pulled down to fit the - * area. Scanning is repeated till all the areas fit and then all - * necessary data structures are inserted and the result is returned. + * Despite its complicated look, this allocator is rather simple. It + * does everything top-down and scans free blocks from the end looking + * for matching base. While scanning, if any of the areas do not fit the + * base address is pulled down to fit the area. Scanning is repeated till + * all the areas fit and then all necessary data structures are inserted + * and the result is returned. */ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, const size_t *sizes, int nr_vms, @@ -2497,11 +2980,12 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, { const unsigned long vmalloc_start = ALIGN(VMALLOC_START, align); const unsigned long vmalloc_end = VMALLOC_END & ~(align - 1); - struct vmap_area **vas, *prev, *next; + struct vmap_area **vas, *va; struct vm_struct **vms; int area, area2, last_area, term_area; - unsigned long base, start, end, last_end; + unsigned long base, start, size, end, last_end; bool purged = false; + enum fit_type type;
/* verify parameters and allocate data structures */ BUG_ON(offset_in_page(align) || !is_power_of_2(align)); @@ -2537,7 +3021,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, goto err_free2;
for (area = 0; area < nr_vms; area++) { - vas[area] = kzalloc(sizeof(struct vmap_area), GFP_KERNEL); + vas[area] = kmem_cache_zalloc(vmap_area_cachep, GFP_KERNEL); vms[area] = kzalloc(sizeof(struct vm_struct), GFP_KERNEL); if (!vas[area] || !vms[area]) goto err_free; @@ -2550,49 +3034,29 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, start = offsets[area]; end = start + sizes[area];
- if (!pvm_find_next_prev(vmap_area_pcpu_hole, &next, &prev)) { - base = vmalloc_end - last_end; - goto found; - } - base = pvm_determine_end(&next, &prev, align) - end; + va = pvm_find_va_enclose_addr(vmalloc_end); + base = pvm_determine_end_from_reverse(&va, align) - end;
while (true) { - BUG_ON(next && next->va_end <= base + end); - BUG_ON(prev && prev->va_end > base + end); - /* * base might have underflowed, add last_end before * comparing. */ - if (base + last_end < vmalloc_start + last_end) { - spin_unlock(&vmap_area_lock); - if (!purged) { - purge_vmap_area_lazy(); - purged = true; - goto retry; - } - goto err_free; - } + if (base + last_end < vmalloc_start + last_end) + goto overflow;
/* - * If next overlaps, move base downwards so that it's - * right below next and then recheck. + * Fitting base has not been found. */ - if (next && next->va_start < base + end) { - base = pvm_determine_end(&next, &prev, align) - end; - term_area = area; - continue; - } + if (va == NULL) + goto overflow;
/* - * If prev overlaps, shift down next and prev and move - * base so that it's right below new next and then - * recheck. + * If this VA does not fit, move base downwards and recheck. */ - if (prev && prev->va_end > base + start) { - next = prev; - prev = node_to_va(rb_prev(&next->rb_node)); - base = pvm_determine_end(&next, &prev, align) - end; + if (base + start < va->va_start || base + end > va->va_end) { + va = node_to_va(rb_prev(&va->rb_node)); + base = pvm_determine_end_from_reverse(&va, align) - end; term_area = area; continue; } @@ -2604,21 +3068,40 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, area = (area + nr_vms - 1) % nr_vms; if (area == term_area) break; + start = offsets[area]; end = start + sizes[area]; - pvm_find_next_prev(base + end, &next, &prev); + va = pvm_find_va_enclose_addr(base + end); } -found: + /* we've found a fitting base, insert all va's */ for (area = 0; area < nr_vms; area++) { - struct vmap_area *va = vas[area]; + int ret;
- va->va_start = base + offsets[area]; - va->va_end = va->va_start + sizes[area]; - __insert_vmap_area(va); - } + start = base + offsets[area]; + size = sizes[area];
- vmap_area_pcpu_hole = base + offsets[last_area]; + va = pvm_find_va_enclose_addr(start); + if (WARN_ON_ONCE(va == NULL)) + /* It is a BUG(), but trigger recovery instead. */ + goto recovery; + + type = classify_va_fit_type(va, start, size); + if (WARN_ON_ONCE(type == NOTHING_FIT)) + /* It is a BUG(), but trigger recovery instead. */ + goto recovery; + + ret = adjust_va_to_fit_type(va, start, size, type); + if (unlikely(ret)) + goto recovery; + + /* Allocated area. */ + va = vas[area]; + va->va_start = start; + va->va_end = start + size; + + insert_vmap_area(va, &vmap_area_root, &vmap_area_list); + }
spin_unlock(&vmap_area_lock);
@@ -2630,9 +3113,38 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, kfree(vas); return vms;
+recovery: + /* Remove previously inserted areas. */ + while (area--) { + __free_vmap_area(vas[area]); + vas[area] = NULL; + } + +overflow: + spin_unlock(&vmap_area_lock); + if (!purged) { + purge_vmap_area_lazy(); + purged = true; + + /* Before "retry", check if we recover. */ + for (area = 0; area < nr_vms; area++) { + if (vas[area]) + continue; + + vas[area] = kmem_cache_zalloc( + vmap_area_cachep, GFP_KERNEL); + if (!vas[area]) + goto err_free; + } + + goto retry; + } + err_free: for (area = 0; area < nr_vms; area++) { - kfree(vas[area]); + if (vas[area]) + kmem_cache_free(vmap_area_cachep, vas[area]); + kfree(vms[area]); } err_free2:
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-5.2-rc1 commit bb850f4dae4abb18c5ee727bb2d6df9ca47ede49 category: bugfix bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25 CVE: NA
------------------------------------------------- This macro adds some debug code to check that the augment tree is maintained correctly, meaning that every node contains valid subtree_max_size value.
By default this option is set to 0 and not active. It requires recompilation of the kernel to activate it. Set to 1, compile the kernel.
[urezki@gmail.com: v4] Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Reviewed-by: Roman Gushchin guro@fb.com Cc: Ingo Molnar mingo@elte.hu Cc: Joel Fernandes joelaf@google.com Cc: Matthew Wilcox willy@infradead.org Cc: Michal Hocko mhocko@suse.com Cc: Oleksiy Avramchenko oleksiy.avramchenko@sonymobile.com Cc: Steven Rostedt rostedt@goodmis.org Cc: Tejun Heo tj@kernel.org Cc: Thomas Garnier thgarnie@google.com Cc: Thomas Gleixner tglx@linutronix.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit bb850f4dae4abb18c5ee727bb2d6df9ca47ede49) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/vmalloc.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d03cd4770ca4..557e46f73ab7 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -325,6 +325,8 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
/*** Global kva allocator ***/
+#define DEBUG_AUGMENT_PROPAGATE_CHECK 0 + #define VM_LAZY_FREE 0x02 #define VM_VM_AREA 0x04
@@ -539,6 +541,48 @@ unlink_va(struct vmap_area *va, struct rb_root *root) } }
+#if DEBUG_AUGMENT_PROPAGATE_CHECK +static void +augment_tree_propagate_check(struct rb_node *n) +{ + struct vmap_area *va; + struct rb_node *node; + unsigned long size; + bool found = false; + + if (n == NULL) + return; + + va = rb_entry(n, struct vmap_area, rb_node); + size = va->subtree_max_size; + node = n; + + while (node) { + va = rb_entry(node, struct vmap_area, rb_node); + + if (get_subtree_max_size(node->rb_left) == size) { + node = node->rb_left; + } else { + if (va_size(va) == size) { + found = true; + break; + } + + node = node->rb_right; + } + } + + if (!found) { + va = rb_entry(n, struct vmap_area, rb_node); + pr_emerg("tree is corrupted: %lu, %lu\n", + va_size(va), va->subtree_max_size); + } + + augment_tree_propagate_check(n->rb_left); + augment_tree_propagate_check(n->rb_right); +} +#endif + /* * This function populates subtree_max_size from bottom to upper * levels starting from VA point. The propagation must be done @@ -588,6 +632,10 @@ augment_tree_propagate_from(struct vmap_area *va) va->subtree_max_size = new_va_sub_max_size; node = rb_parent(&va->rb_node); } + +#if DEBUG_AUGMENT_PROPAGATE_CHECK + augment_tree_propagate_check(free_vmap_area_root.rb_node); +#endif }
static void
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-5.2-rc1 commit a6cf4e0fe3e740ed7af39fdda721e1ac12247dd3 category: bugfix bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25 CVE: NA
------------------------------------------------- This macro adds some debug code to check that vmap allocations are happened in ascending order.
By default this option is set to 0 and not active. It requires recompilation of the kernel to activate it. Set to 1, compile the kernel.
[urezki@gmail.com: v4] Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Reviewed-by: Roman Gushchin guro@fb.com Cc: Ingo Molnar mingo@elte.hu Cc: Joel Fernandes joelaf@google.com Cc: Matthew Wilcox willy@infradead.org Cc: Michal Hocko mhocko@suse.com Cc: Oleksiy Avramchenko oleksiy.avramchenko@sonymobile.com Cc: Steven Rostedt rostedt@goodmis.org Cc: Tejun Heo tj@kernel.org Cc: Thomas Garnier thgarnie@google.com Cc: Thomas Gleixner tglx@linutronix.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit a6cf4e0fe3e740ed7af39fdda721e1ac12247dd3) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/vmalloc.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 557e46f73ab7..b246ebe224c4 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -326,6 +326,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn); /*** Global kva allocator ***/
#define DEBUG_AUGMENT_PROPAGATE_CHECK 0 +#define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
#define VM_LAZY_FREE 0x02 #define VM_VM_AREA 0x04 @@ -834,6 +835,44 @@ find_vmap_lowest_match(unsigned long size, return NULL; }
+#if DEBUG_AUGMENT_LOWEST_MATCH_CHECK +#include <linux/random.h> + +static struct vmap_area * +find_vmap_lowest_linear_match(unsigned long size, + unsigned long align, unsigned long vstart) +{ + struct vmap_area *va; + + list_for_each_entry(va, &free_vmap_area_list, list) { + if (!is_within_this_va(va, size, align, vstart)) + continue; + + return va; + } + + return NULL; +} + +static void +find_vmap_lowest_match_check(unsigned long size) +{ + struct vmap_area *va_1, *va_2; + unsigned long vstart; + unsigned int rnd; + + get_random_bytes(&rnd, sizeof(rnd)); + vstart = VMALLOC_START + rnd; + + va_1 = find_vmap_lowest_match(size, 1, vstart); + va_2 = find_vmap_lowest_linear_match(size, 1, vstart); + + if (va_1 != va_2) + pr_emerg("not lowest: t: 0x%p, l: 0x%p, v: 0x%lx\n", + va_1, va_2, vstart); +} +#endif + enum fit_type { NOTHING_FIT = 0, FL_FIT_TYPE = 1, /* full fit */ @@ -976,6 +1015,10 @@ __alloc_vmap_area(unsigned long size, unsigned long align, if (ret) return vend;
+#if DEBUG_AUGMENT_LOWEST_MATCH_CHECK + find_vmap_lowest_match_check(size); +#endif + return nva_start_addr; }
From: Arnd Bergmann arnd@arndb.de
mainline inclusion from mainline-5.2-rc7 commit 2c9292336a09f7bf019689580ceea9a2d116b999 category: bugfix bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25 CVE: NA
------------------------------------------------- gcc gets confused in pcpu_get_vm_areas() because there are too many branches that affect whether 'lva' was initialized before it gets used:
mm/vmalloc.c: In function 'pcpu_get_vm_areas': mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized] insert_vmap_area_augment(lva, &va->rb_node, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ &free_vmap_area_root, &free_vmap_area_list); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mm/vmalloc.c:916:20: note: 'lva' was declared here struct vmap_area *lva; ^~~
Add an intialization to NULL, and check whether this has changed before the first use.
[akpm@linux-foundation.org: tweak comments] Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") Signed-off-by: Arnd Bergmann arnd@arndb.de Reviewed-by: Uladzislau Rezki (Sony) urezki@gmail.com Cc: Joel Fernandes joelaf@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 2c9292336a09f7bf019689580ceea9a2d116b999) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/vmalloc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index b246ebe224c4..1e459e02b6af 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -912,7 +912,7 @@ adjust_va_to_fit_type(struct vmap_area *va, unsigned long nva_start_addr, unsigned long size, enum fit_type type) { - struct vmap_area *lva; + struct vmap_area *lva = NULL;
if (type == FL_FIT_TYPE) { /* @@ -971,7 +971,7 @@ adjust_va_to_fit_type(struct vmap_area *va, if (type != FL_FIT_TYPE) { augment_tree_propagate_from(va);
- if (type == NE_FIT_TYPE) + if (lva) /* type == NE_FIT_TYPE */ insert_vmap_area_augment(lva, &va->rb_node, &free_vmap_area_root, &free_vmap_area_list); }