Al Viro (1): iov_iter_fault_in_readable() should do nothing in xarray case
Alex Shi (1): mm: add VM_WARN_ON_ONCE_PAGE() macro
Alex Williamson (1): vfio/pci: Handle concurrent vma faults
Alexander Ovechkin (1): net: send SYNACK packet with accepted fwmark
Anirudh Rayabharam (1): ext4: fix kernel infoleak via ext4_extent_header
Casey Chen (1): nvme-pci: do not call nvme_dev_remove_admin from nvme_remove
Dan Carpenter (1): scsi: scsi_dh_alua: Fix signedness bug in alua_rtpg()
Dimitri John Ledkov (1): lib/decompress_unlz4.c: correctly handle zero-padding around initrds.
Dmitry Bogdanov (1): scsi: target: Fix protect handling in WRITE SAME(32)
Dmitry Torokhov (1): i2c: core: Disable client irq on reboot/shutdown
Eric Dumazet (8): pkt_sched: sch_qfq: fix qfq_change_class() error path vxlan: add missing rcu_read_lock() in neigh_reduce() ipv6: exthdrs: do not blindly use init_net ipv6: fix out-of-bound access in ip6_parse_tlv() tcp: annotate data races around tp->mtu_info ipv6: tcp: drop silly ICMPv6 packet too big messages udp: annotate data races around unix_sk(sk)->gso_size net/tcp_fastopen: fix data races around tfo_active_disable_stamp
Gabriel Krisman Bertazi (1): dm multipath: use updated MPATHF_QUEUE_IO on mapping for bio-based mpath
Gao Xiang (1): nfs: fix acl memory leak of posix_acl_create()
Hangbin Liu (1): net: ip_tunnel: fix mtu calculation for ETHER tunnel devices
Hannes Reinecke (1): scsi: scsi_dh_alua: Check for negative result value
Haoran Luo (1): tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.
Hugh Dickins (16): mm/thp: fix __split_huge_pmd_locked() on shmem migration entry mm/thp: make is_huge_zero_pmd() safe and quicker mm/thp: try_to_unmap() use TTU_SYNC for safe splitting mm/thp: fix vma_address() if virtual address below file offset mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() mm: page_vma_mapped_walk(): use page for pvmw->page mm: page_vma_mapped_walk(): settle PageHuge on entry mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block mm: page_vma_mapped_walk(): crossing page table boundary mm: page_vma_mapped_walk(): add a level of indentation mm: page_vma_mapped_walk(): use goto instead of while (1) mm: page_vma_mapped_walk(): get vma_address_end() earlier mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk() mm, futex: fix shared futex pgoff on shmem huge page
Jakub Kicinski (1): net: ip: avoid OOM kills with large UDP sends over loopback
Jason Ekstrand (1): dma-buf/sync_file: Don't leak fences on merge failure
Jason Gunthorpe (1): vfio-pci: Use io_remap_pfn_range() for PCI IO memory
Javed Hasan (1): scsi: libfc: Fix array index out of bound exception
Joe Thornber (1): dm space maps: don't reset space map allocation cursor when committing
Jue Wang (1): mm/thp: fix page_address_in_vma() on file THP tails
Krzysztof WilczyĆski (2): ACPI: sysfs: Fix a buffer overrun problem with description_show() PCI/sysfs: Fix dsm_label_utf16s_to_utf8s() buffer overrun
Liu Shixin (1): netlabel: Fix memory leak in netlbl_mgmt_add_common
Longpeng(Mike) (1): vsock: notify server to shutdown when client has pending signal
Maciej ƻenczykowski (1): bpf: Do not change gso_size during bpf_skb_change_proto()
ManYi Li (1): scsi: sr: Return appropriate error code when disk is ejected
Marcelo Henrique Cerri (1): proc: Avoid mixing integer types in mem_rw()
Mario Limonciello (1): ACPI: processor idle: Fix up C-state latency if not ordered
Miao Wang (1): net/ipv4: swap flow ports when validating source
Miaohe Lin (3): mm/rmap: remove unneeded semicolon in page_not_mapped() mm/rmap: use page_not_mapped in try_to_unmap() mm/huge_memory.c: don't discard hugepage if other processes are mapping it
Michael S. Tsirkin (1): virtio_net: move tx vq operation under tx queue lock
Mike Christie (4): scsi: iscsi: Add iscsi_cls_conn refcount helpers scsi: iscsi: Fix shost->max_id use scsi: qedi: Fix null ref during abort handling scsi: iscsi: Fix iface sysfs attr detection
Miklos Szeredi (1): fuse: check connected before queueing on fpq->io
Mikulas Patocka (3): dm writecache: fix data corruption when reloading the target dm writecache: return the exact table values that were set dm writecache: fix writing beyond end of underlying device when shrinking
Mimi Zohar (1): evm: fix writing <securityfs>/evm overflow
Muchun Song (1): writeback: fix obtain a reference to a freeing memcg css
Nicolas Dichtel (1): ipv6: fix 'disable_policy' for fwd packets
Nikolay Aleksandrov (1): net: bridge: multicast: fix PIM hello router port marking race
Odin Ugedal (1): sched/fair: Fix CFS bandwidth hrtimer expiry type
Pablo Neira Ayuso (3): netfilter: nft_exthdr: check for IPv6 packet before further processing netfilter: nft_osf: check for TCP packet before further processing netfilter: nft_tproxy: restrict support to TCP and UDP transport protocols
Pan Dong (1): ext4: fix avefreec in find_group_orlov
Peilin Ye (1): net/sched: act_skbmod: Skip non-Ethernet packets
Petr Mladek (2): kthread_worker: split code for canceling the delayed work timer kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()
Petr Pavlu (1): ipmi/watchdog: Stop watchdog timer when the current action is 'none'
Quat Le (1): scsi: core: Retry I/O for Notify (Enable Spinup) Required error
Richard Fitzgerald (1): lib: vsprintf: Fix handling of number field widths in vsscanf
Roberto Sassu (2): evm: Execute evm_inode_init_security() only when an HMAC key is loaded evm: Refuse EVM_ALLOW_METADATA_WRITES only if an HMAC key is loaded
Steffen Klassert (1): xfrm: Fix error reporting in xfrm_state_construct.
Stephen Brennan (1): ext4: use ext4_grp_locked_error in mb_find_extent
Steven Rostedt (VMware) (1): tracing: Do not reference char * as a string in histograms
Sunwook Eom (1): dm verity fec: fix hash block number in verity_fec_decode
Taehee Yoo (1): net: validate lwtstate->data before returning from skb_tunnel_info()
Thomas Gleixner (1): x86/fpu: Limit xstate copy size in xstateregs_set()
Trond Myklebust (3): NFS: nfs_find_open_context() may only select open files NFSv4: Initialise connection to the server in nfs4_alloc_client() NFSv4/pNFS: Don't call _nfs4_pnfs_v3_ds_connect multiple times
Tyrel Datwyler (1): scsi: core: Fix bad pointer dereference when ehandler kthread is invalid
Vasily Averin (1): netfilter: ctnetlink: suspicious RCU usage in ctnetlink_dump_helpinfo
Willy Tarreau (1): ipv6: use prandom_u32() for ID generation
Wolfgang Bumiller (1): net: bridge: sync fdb to new unicast-filtering ports
Xianting Tian (1): virtio_net: Remove BUG() to avoid machine dead
Xie Yongji (3): virtio-blk: Fix memory leak among suspend/resume procedure virtio_net: Fix error handling in virtnet_restore() virtio_console: Assure used length from device is limited
Xin Long (1): sctp: update active_key for asoc when old key is being replaced
Yajun Deng (1): net: sched: cls_api: Fix the the wrong parameter
Yang Shi (1): mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split
Yang Yingliang (1): ext4: return error code when ext4_fill_flex_info() fails
Yun Zhou (2): seq_buf: Make trace_seq_putmem_hex() support data longer than 8 seq_buf: Fix overflow in seq_buf_putmem_hex()
Zhang Yi (2): ext4: correct the cache_nr in tracepoint ext4_es_shrink_exit ext4: remove check for zero nr_to_scan in ext4_es_scan()
Zhihao Cheng (1): nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING
Documentation/ABI/testing/evm | 26 ++- arch/x86/kernel/fpu/regset.c | 2 +- drivers/acpi/device_sysfs.c | 2 +- drivers/acpi/processor_idle.c | 40 +++++ drivers/block/virtio_blk.c | 2 + drivers/char/ipmi/ipmi_watchdog.c | 22 +-- drivers/char/virtio_console.c | 4 +- drivers/dma-buf/sync_file.c | 13 +- drivers/i2c/i2c-core-base.c | 3 + drivers/md/dm-mpath.c | 6 +- drivers/md/dm-verity-fec.c | 2 +- drivers/md/dm-writecache.c | 102 +++++++---- .../md/persistent-data/dm-space-map-disk.c | 9 +- .../persistent-data/dm-space-map-metadata.c | 9 +- drivers/net/virtio_net.c | 29 +++- drivers/net/vxlan.c | 2 + drivers/nvme/host/pci.c | 5 +- drivers/pci/pci-label.c | 2 +- drivers/scsi/be2iscsi/be_main.c | 4 +- drivers/scsi/bnx2i/bnx2i_iscsi.c | 2 +- drivers/scsi/cxgbi/libcxgbi.c | 4 +- drivers/scsi/device_handler/scsi_dh_alua.c | 11 +- drivers/scsi/hosts.c | 1 + drivers/scsi/libfc/fc_rport.c | 13 +- drivers/scsi/libiscsi.c | 7 +- drivers/scsi/qedi/qedi_fw.c | 2 +- drivers/scsi/qedi/qedi_main.c | 2 +- drivers/scsi/scsi_lib.c | 1 + drivers/scsi/scsi_transport_iscsi.c | 102 +++++------ drivers/scsi/sr.c | 2 + drivers/target/target_core_sbc.c | 35 ++-- drivers/vfio/pci/vfio_pci.c | 31 +++- fs/ext4/extents.c | 3 + fs/ext4/extents_status.c | 4 +- fs/ext4/ialloc.c | 11 +- fs/ext4/mballoc.c | 9 +- fs/ext4/super.c | 1 + fs/fs-writeback.c | 9 +- fs/fuse/dev.c | 9 + fs/nfs/inode.c | 4 + fs/nfs/nfs3proc.c | 4 +- fs/nfs/nfs4client.c | 82 ++++----- fs/nfs/pnfs_nfs.c | 52 +++--- fs/proc/base.c | 2 +- include/linux/huge_mm.h | 8 +- include/linux/hugetlb.h | 16 -- include/linux/mm.h | 3 + include/linux/mmdebug.h | 13 ++ include/linux/nfs_fs.h | 1 + include/linux/pagemap.h | 13 +- include/linux/rmap.h | 3 +- include/net/dst_metadata.h | 4 +- include/scsi/scsi_transport_iscsi.h | 2 + kernel/futex.c | 2 +- kernel/kthread.c | 77 ++++++--- kernel/sched/fair.c | 4 +- kernel/trace/ring_buffer.c | 28 ++- kernel/trace/trace_events_hist.c | 6 +- lib/decompress_unlz4.c | 8 + lib/iov_iter.c | 2 +- lib/kstrtox.c | 13 +- lib/kstrtox.h | 2 + lib/seq_buf.c | 8 +- lib/vsprintf.c | 82 +++++---- mm/huge_memory.c | 58 ++++--- mm/hugetlb.c | 5 +- mm/internal.h | 53 ++++-- mm/memory.c | 41 +++++ mm/page_vma_mapped.c | 160 +++++++++++------- mm/pgtable-generic.c | 4 +- mm/rmap.c | 48 +++--- mm/truncate.c | 43 +++-- net/bridge/br_if.c | 17 +- net/bridge/br_multicast.c | 2 + net/core/filter.c | 4 - net/ipv4/fib_frontend.c | 2 + net/ipv4/ip_output.c | 32 ++-- net/ipv4/ip_tunnel.c | 22 ++- net/ipv4/tcp_fastopen.c | 19 ++- net/ipv4/tcp_ipv4.c | 4 +- net/ipv4/tcp_output.c | 1 + net/ipv4/udp.c | 6 +- net/ipv6/exthdrs.c | 31 ++-- net/ipv6/ip6_output.c | 36 ++-- net/ipv6/output_core.c | 28 +-- net/ipv6/tcp_ipv6.c | 22 ++- net/ipv6/udp.c | 2 +- net/netfilter/nf_conntrack_netlink.c | 3 + net/netfilter/nft_exthdr.c | 3 + net/netfilter/nft_osf.c | 5 + net/netfilter/nft_tproxy.c | 9 +- net/netlabel/netlabel_mgmt.c | 19 ++- net/sched/act_skbmod.c | 12 +- net/sched/cls_api.c | 2 +- net/sched/sch_qfq.c | 8 +- net/sctp/auth.c | 2 + net/vmw_vsock/af_vsock.c | 2 +- net/xfrm/xfrm_user.c | 28 +-- security/integrity/evm/evm_main.c | 2 +- security/integrity/evm/evm_secfs.c | 13 +- 100 files changed, 1092 insertions(+), 638 deletions(-)
From: Alex Shi alex.shi@linux.alibaba.com
stable inclusion from linux-4.19.197 commit 131cb7a08c767e2d082ff5de6d384d8d935fb6f1
--------------------------------
[ Upstream commit a4055888629bc0467d12d912cd7c90acdf3d9b12 part ]
Add VM_WARN_ON_ONCE_PAGE() macro.
Link: https://lkml.kernel.org/r/1604283436-18880-3-git-send-email-alex.shi@linux.a... Signed-off-by: Alex Shi alex.shi@linux.alibaba.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: Hugh Dickins hughd@google.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Note on stable backport: original commit was titled mm/memcg: warning on !memcg after readahead page charged which included uses of this macro in mm/memcontrol.c: here omitted.
Signed-off-by: Hugh Dickins hughd@google.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/mmdebug.h | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/include/linux/mmdebug.h b/include/linux/mmdebug.h index 2ad72d2c8cc52..5d0767cb424aa 100644 --- a/include/linux/mmdebug.h +++ b/include/linux/mmdebug.h @@ -37,6 +37,18 @@ void dump_mm(const struct mm_struct *mm); BUG(); \ } \ } while (0) +#define VM_WARN_ON_ONCE_PAGE(cond, page) ({ \ + static bool __section(".data.once") __warned; \ + int __ret_warn_once = !!(cond); \ + \ + if (unlikely(__ret_warn_once && !__warned)) { \ + dump_page(page, "VM_WARN_ON_ONCE_PAGE(" __stringify(cond)")");\ + __warned = true; \ + WARN_ON(1); \ + } \ + unlikely(__ret_warn_once); \ +}) + #define VM_WARN_ON(cond) (void)WARN_ON(cond) #define VM_WARN_ON_ONCE(cond) (void)WARN_ON_ONCE(cond) #define VM_WARN_ONCE(cond, format...) (void)WARN_ONCE(cond, format) @@ -48,6 +60,7 @@ void dump_mm(const struct mm_struct *mm); #define VM_BUG_ON_MM(cond, mm) VM_BUG_ON(cond) #define VM_WARN_ON(cond) BUILD_BUG_ON_INVALID(cond) #define VM_WARN_ON_ONCE(cond) BUILD_BUG_ON_INVALID(cond) +#define VM_WARN_ON_ONCE_PAGE(cond, page) BUILD_BUG_ON_INVALID(cond) #define VM_WARN_ONCE(cond, format...) BUILD_BUG_ON_INVALID(cond) #define VM_WARN(cond, format...) BUILD_BUG_ON_INVALID(cond) #endif
From: Miaohe Lin linmiaohe@huawei.com
stable inclusion from linux-4.19.197 commit 784445344c9e7cdef2a2a724fab34243933a91dc
--------------------------------
[ Upstream commit e0af87ff7afcde2660be44302836d2d5618185af ]
Remove extra semicolon without any functional change intended.
Link: https://lkml.kernel.org/r/20210127093425.39640-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin linmiaohe@huawei.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/rmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/rmap.c b/mm/rmap.c index aabd094d310f1..924e14bc8da8c 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1740,7 +1740,7 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags) static int page_not_mapped(struct page *page) { return !page_mapped(page); -}; +}
/** * try_to_munlock - try to munlock a page
From: Miaohe Lin linmiaohe@huawei.com
stable inclusion from linux-4.19.197 commit 420afb489a1f6981cbe2fb9ee10007dba878e223
--------------------------------
[ Upstream commit b7e188ec98b1644ff70a6d3624ea16aadc39f5e0 ]
page_mapcount_is_zero() calculates accurately how many mappings a hugepage has in order to check against 0 only. This is a waste of cpu time. We can do this via page_not_mapped() to save some possible atomic_read cycles. Remove the function page_mapcount_is_zero() as it's not used anymore and move page_not_mapped() above try_to_unmap() to avoid identifier undeclared compilation error.
Link: https://lkml.kernel.org/r/20210130084904.35307-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin linmiaohe@huawei.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/rmap.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c index 924e14bc8da8c..ed6ad441a50c2 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1693,9 +1693,9 @@ static bool invalid_migration_vma(struct vm_area_struct *vma, void *arg) return is_vma_temporary_stack(vma); }
-static int page_mapcount_is_zero(struct page *page) +static int page_not_mapped(struct page *page) { - return !total_mapcount(page); + return !page_mapped(page); }
/** @@ -1713,7 +1713,7 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags) struct rmap_walk_control rwc = { .rmap_one = try_to_unmap_one, .arg = (void *)flags, - .done = page_mapcount_is_zero, + .done = page_not_mapped, .anon_lock = page_lock_anon_vma_read, };
@@ -1737,11 +1737,6 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags) return !page_mapcount(page) ? true : false; }
-static int page_not_mapped(struct page *page) -{ - return !page_mapped(page); -} - /** * try_to_munlock - try to munlock a page * @page: the page to be munlocked
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit 629ee482e0966c23ddc1abeebf71aa84ff8d3673
--------------------------------
[ Upstream commit 99fa8a48203d62b3743d866fc48ef6abaee682be ]
Patch series "mm/thp: fix THP splitting unmap BUGs and related", v10.
Here is v2 batch of long-standing THP bug fixes that I had not got around to sending before, but prompted now by Wang Yugui's report https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
Wang Yugui has tested a rollup of these fixes applied to 5.10.39, and they have done no harm, but have *not* fixed that issue: something more is needed and I have no idea of what.
This patch (of 7):
Stressing huge tmpfs page migration racing hole punch often crashed on the VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y kernel; or shortly afterwards, on a bad dereference in __split_huge_pmd_locked() when DEBUG_VM=n. They forgot to allow for pmd migration entries in the non-anonymous case.
Full disclosure: those particular experiments were on a kernel with more relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the vanilla kernel: it is conceivable that stricter locking happens to avoid those cases, or makes them less likely; but __split_huge_pmd_locked() already allowed for pmd migration entries when handling anonymous THPs, so this commit brings the shmem and file THP handling into line.
And while there: use old_pmd rather than _pmd, as in the following blocks; and make it clearer to the eye that the !vma_is_anonymous() block is self-contained, making an early return after accounting for unmapping.
Link: https://lkml.kernel.org/r/af88612-1473-2eaa-903-8d1a448b26@google.com Link: https://lkml.kernel.org/r/dd221a99-efb3-cd1d-6256-7e646af29314@google.com Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp") Signed-off-by: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Yang Shi shy828301@gmail.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: "Matthew Wilcox (Oracle)" willy@infradead.org Cc: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Alistair Popple apopple@nvidia.com Cc: Ralph Campbell rcampbell@nvidia.com Cc: Zi Yan ziy@nvidia.com Cc: Miaohe Lin linmiaohe@huawei.com Cc: Minchan Kim minchan@kernel.org Cc: Jue Wang juew@google.com Cc: Peter Xu peterx@redhat.com Cc: Jan Kara jack@suse.cz Cc: Shakeel Butt shakeelb@google.com Cc: Oscar Salvador osalvador@suse.de Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Note on stable backport: this commit made intervening cleanups in pmdp_huge_clear_flush() redundant: here it's rediffed to skip them.
Signed-off-by: Hugh Dickins hughd@google.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/huge_memory.c | 27 ++++++++++++++++++--------- mm/pgtable-generic.c | 4 ++-- 2 files changed, 20 insertions(+), 11 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e3f4bba22c82b..227ed32d3d40f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2182,7 +2182,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, count_vm_event(THP_SPLIT_PMD);
if (!vma_is_anonymous(vma)) { - _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd); + old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd); /* * We are going to unmap this huge page. So * just go ahead and zap it @@ -2191,16 +2191,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, zap_deposited_table(mm, pmd); if (vma_is_dax(vma)) return; - page = pmd_page(_pmd); - if (!PageDirty(page) && pmd_dirty(_pmd)) - set_page_dirty(page); - if (!PageReferenced(page) && pmd_young(_pmd)) - SetPageReferenced(page); - page_remove_rmap(page, true); - put_page(page); + if (unlikely(is_pmd_migration_entry(old_pmd))) { + swp_entry_t entry; + + entry = pmd_to_swp_entry(old_pmd); + page = migration_entry_to_page(entry); + } else { + page = pmd_page(old_pmd); + if (!PageDirty(page) && pmd_dirty(old_pmd)) + set_page_dirty(page); + if (!PageReferenced(page) && pmd_young(old_pmd)) + SetPageReferenced(page); + page_remove_rmap(page, true); + put_page(page); + } add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR); return; - } else if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) { + } + + if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) { /* * FIXME: Do we want to invalidate secondary mmu by calling * mmu_notifier_invalidate_range() see comments below inside diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index cf2af04b34b97..36770fcdc3582 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -125,8 +125,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address, { pmd_t pmd; VM_BUG_ON(address & ~HPAGE_PMD_MASK); - VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) && - !pmd_devmap(*pmdp)) || !pmd_present(*pmdp)); + VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) && + !pmd_devmap(*pmdp)); pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp); flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return pmd;
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit fc1fbc5b017b6f5ef24a4a93f33cd022225e01c8
--------------------------------
[ Upstream commit 3b77e8c8cde581dadab9a0f1543a347e24315f11 ]
Most callers of is_huge_zero_pmd() supply a pmd already verified present; but a few (notably zap_huge_pmd()) do not - it might be a pmd migration entry, in which the pfn is encoded differently from a present pmd: which might pass the is_huge_zero_pmd() test (though not on x86, since L1TF forced us to protect against that); or perhaps even crash in pmd_page() applied to a swap-like entry.
Make it safe by adding pmd_present() check into is_huge_zero_pmd() itself; and make it quicker by saving huge_zero_pfn, so that is_huge_zero_pmd() will not need to do that pmd_page() lookup each time.
__split_huge_pmd_locked() checked pmd_trans_huge() before: that worked, but is unnecessary now that is_huge_zero_pmd() checks present.
Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp") Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: Alistair Popple apopple@nvidia.com Cc: Jan Kara jack@suse.cz Cc: Jue Wang juew@google.com Cc: "Matthew Wilcox (Oracle)" willy@infradead.org Cc: Miaohe Lin linmiaohe@huawei.com Cc: Minchan Kim minchan@kernel.org Cc: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Oscar Salvador osalvador@suse.de Cc: Peter Xu peterx@redhat.com Cc: Ralph Campbell rcampbell@nvidia.com Cc: Shakeel Butt shakeelb@google.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: Zi Yan ziy@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/huge_mm.h | 8 +++++++- mm/huge_memory.c | 5 ++++- 2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 7a447e8e84811..00f474e1c64b3 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -224,6 +224,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, extern vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
extern struct page *huge_zero_page; +extern unsigned long huge_zero_pfn;
static inline bool is_huge_zero_page(struct page *page) { @@ -232,7 +233,7 @@ static inline bool is_huge_zero_page(struct page *page)
static inline bool is_huge_zero_pmd(pmd_t pmd) { - return is_huge_zero_page(pmd_page(pmd)); + return READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd) && pmd_present(pmd); }
static inline bool is_huge_zero_pud(pud_t pud) @@ -351,6 +352,11 @@ static inline bool is_huge_zero_page(struct page *page) return false; }
+static inline bool is_huge_zero_pmd(pmd_t pmd) +{ + return false; +} + static inline bool is_huge_zero_pud(pud_t pud) { return false; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 227ed32d3d40f..c2f5b338786d2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -62,6 +62,7 @@ static struct shrinker deferred_split_shrinker;
static atomic_t huge_zero_refcount; struct page *huge_zero_page __read_mostly; +unsigned long huge_zero_pfn __read_mostly = ~0UL;
bool transparent_hugepage_enabled(struct vm_area_struct *vma) { @@ -93,6 +94,7 @@ static struct page *get_huge_zero_page(void) __free_pages(zero_page, compound_order(zero_page)); goto retry; } + WRITE_ONCE(huge_zero_pfn, page_to_pfn(zero_page));
/* We take additional reference here. It will be put back by shrinker */ atomic_set(&huge_zero_refcount, 2); @@ -142,6 +144,7 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink, if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) { struct page *zero_page = xchg(&huge_zero_page, NULL); BUG_ON(zero_page == NULL); + WRITE_ONCE(huge_zero_pfn, ~0UL); __free_pages(zero_page, compound_order(zero_page)); return HPAGE_PMD_NR; } @@ -2209,7 +2212,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, return; }
- if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) { + if (is_huge_zero_pmd(*pmd)) { /* * FIXME: Do we want to invalidate secondary mmu by calling * mmu_notifier_invalidate_range() see comments below inside
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit 205899d6be9a1d5494318920a211852a32886e46
--------------------------------
[ Upstream commit 732ed55823fc3ad998d43b86bf771887bcc5ec67 ]
Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE (!unmap_success): with dump_page() showing mapcount:1, but then its raw struct page output showing _mapcount ffffffff i.e. mapcount 0.
And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed, it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)), and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG(): all indicative of some mapcount difficulty in development here perhaps. But the !CONFIG_DEBUG_VM path handles the failures correctly and silently.
I believe the problem is that once a racing unmap has cleared pte or pmd, try_to_unmap_one() may skip taking the page table lock, and emerge from try_to_unmap() before the racing task has reached decrementing mapcount.
Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC to the options, and passing that from unmap_page().
When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same for both: the slight overhead added should rarely matter, except perhaps if splitting sparsely-populated multiply-mapped shmem. Once confident that bugs are fixed, TTU_SYNC here can be removed, and the race tolerated.
Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers") Signed-off-by: Hugh Dickins hughd@google.com Cc: Alistair Popple apopple@nvidia.com Cc: Jan Kara jack@suse.cz Cc: Jue Wang juew@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: "Matthew Wilcox (Oracle)" willy@infradead.org Cc: Miaohe Lin linmiaohe@huawei.com Cc: Minchan Kim minchan@kernel.org Cc: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Oscar Salvador osalvador@suse.de Cc: Peter Xu peterx@redhat.com Cc: Ralph Campbell rcampbell@nvidia.com Cc: Shakeel Butt shakeelb@google.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Note on stable backport: upstream TTU_SYNC 0x10 takes the value which 5.11 commit 013339df116c ("mm/rmap: always do TTU_IGNORE_ACCESS") freed. It is very tempting to backport that commit (as 5.10 already did) and make no change here; but on reflection, good as that commit is, I'm reluctant to include any possible side-effect of it in this series.
Signed-off-by: Hugh Dickins hughd@google.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/rmap.h | 3 ++- mm/huge_memory.c | 2 +- mm/page_vma_mapped.c | 11 +++++++++++ mm/rmap.c | 17 ++++++++++++++++- 4 files changed, 30 insertions(+), 3 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h index d7d6d4eb17949..91ccae9467164 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -98,7 +98,8 @@ enum ttu_flags { * do a final flush if necessary */ TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock: * caller holds it */ - TTU_SPLIT_FREEZE = 0x100, /* freeze pte under splitting thp */ + TTU_SPLIT_FREEZE = 0x100, /* freeze pte under splitting thp */ + TTU_SYNC = 0x200, /* avoid racy checks with PVMW_SYNC */ };
#ifdef CONFIG_MMU diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c2f5b338786d2..b64904e9faa0c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2487,7 +2487,7 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma, static void unmap_page(struct page *page) { enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | - TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD; + TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD | TTU_SYNC; bool unmap_success;
VM_BUG_ON_PAGE(!PageHead(page), page); diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 11df03e71288c..08e283ad46606 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -208,6 +208,17 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) pvmw->ptl = NULL; } } else if (!pmd_present(pmde)) { + /* + * If PVMW_SYNC, take and drop THP pmd lock so that we + * cannot return prematurely, while zap_huge_pmd() has + * cleared *pmd but not decremented compound_mapcount(). + */ + if ((pvmw->flags & PVMW_SYNC) && + PageTransCompound(pvmw->page)) { + spinlock_t *ptl = pmd_lock(mm, pvmw->pmd); + + spin_unlock(ptl); + } return false; } if (!map_pte(pvmw)) diff --git a/mm/rmap.c b/mm/rmap.c index ed6ad441a50c2..4f82fc2fbc9d1 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1354,6 +1354,15 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, unsigned long start = address, end; enum ttu_flags flags = (enum ttu_flags)arg;
+ /* + * When racing against e.g. zap_pte_range() on another cpu, + * in between its ptep_get_and_clear_full() and page_remove_rmap(), + * try_to_unmap() may return false when it is about to become true, + * if page table locking is skipped: use TTU_SYNC to wait for that. + */ + if (flags & TTU_SYNC) + pvmw.flags = PVMW_SYNC; + /* munlock has nothing to gain from examining un-locked vmas */ if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED)) return true; @@ -1734,7 +1743,13 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags) else rmap_walk(page, &rwc);
- return !page_mapcount(page) ? true : false; + /* + * When racing against e.g. zap_pte_range() on another cpu, + * in between its ptep_get_and_clear_full() and page_remove_rmap(), + * try_to_unmap() may return false when it is about to become true, + * if page table locking is skipped: use TTU_SYNC to wait for that. + */ + return !page_mapcount(page); }
/**
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit f38722557339bd22a5f1e8bae013f81b69309e9a
--------------------------------
[ Upstream commit 494334e43c16d63b878536a26505397fce6ff3a2 ]
Running certain tests with a DEBUG_VM kernel would crash within hours, on the total_mapcount BUG() in split_huge_page_to_list(), while trying to free up some memory by punching a hole in a shmem huge page: split's try_to_unmap() was unable to find all the mappings of the page (which, on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
When that BUG() was changed to a WARN(), it would later crash on the VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in mm/internal.h:vma_address(), used by rmap_walk_file() for try_to_unmap().
vma_address() is usually correct, but there's a wraparound case when the vm_start address is unusually low, but vm_pgoff not so low: vma_address() chooses max(start, vma->vm_start), but that decides on the wrong address, because start has become almost ULONG_MAX.
Rewrite vma_address() to be more careful about vm_pgoff; move the VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can be safely used from page_mapped_in_vma() and page_address_in_vma() too.
Add vma_address_end() to apply similar care to end address calculation, in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one(); though it raises a question of whether callers would do better to supply pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.
An irritation is that their apparent generality breaks down on KSM pages, which cannot be located by the page->index that page_to_pgoff() uses: as commit 4b0ece6fa016 ("mm: migrate: fix remove_migration_pte() for ksm pages") once discovered. I dithered over the best thing to do about that, and have ended up with a VM_BUG_ON_PAGE(PageKsm) in both vma_address() and vma_address_end(); though the only place in danger of using it on them was try_to_unmap_one().
Sidenote: vma_address() and vma_address_end() now use compound_nr() on a head page, instead of thp_size(): to make the right calculation on a hugetlbfs page, whether or not THPs are configured. try_to_unmap() is used on hugetlbfs pages, but perhaps the wrong calculation never mattered.
Link: https://lkml.kernel.org/r/caf1c1a3-7cfb-7f8f-1beb-ba816e932825@google.com Fixes: a8fa41ad2f6f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of") Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Alistair Popple apopple@nvidia.com Cc: Jan Kara jack@suse.cz Cc: Jue Wang juew@google.com Cc: "Matthew Wilcox (Oracle)" willy@infradead.org Cc: Miaohe Lin linmiaohe@huawei.com Cc: Minchan Kim minchan@kernel.org Cc: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Oscar Salvador osalvador@suse.de Cc: Peter Xu peterx@redhat.com Cc: Ralph Campbell rcampbell@nvidia.com Cc: Shakeel Butt shakeelb@google.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Note on stable backport: fixed up conflicts on intervening thp_size(), and mmu_notifier_range initializations; substitute for compound_nr().
Signed-off-by: Hugh Dickins hughd@google.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/internal.h | 53 ++++++++++++++++++++++++++++++++------------ mm/page_vma_mapped.c | 16 +++++-------- mm/rmap.c | 14 ++++++------ 3 files changed, 52 insertions(+), 31 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h index 91ffb1dced61d..bfa97c43a3fc5 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -331,27 +331,52 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page) extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
/* - * At what user virtual address is page expected in @vma? + * At what user virtual address is page expected in vma? + * Returns -EFAULT if all of the page is outside the range of vma. + * If page is a compound head, the entire compound page is considered. */ static inline unsigned long -__vma_address(struct page *page, struct vm_area_struct *vma) +vma_address(struct page *page, struct vm_area_struct *vma) { - pgoff_t pgoff = page_to_pgoff(page); - return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); + pgoff_t pgoff; + unsigned long address; + + VM_BUG_ON_PAGE(PageKsm(page), page); /* KSM page->index unusable */ + pgoff = page_to_pgoff(page); + if (pgoff >= vma->vm_pgoff) { + address = vma->vm_start + + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); + /* Check for address beyond vma (or wrapped through 0?) */ + if (address < vma->vm_start || address >= vma->vm_end) + address = -EFAULT; + } else if (PageHead(page) && + pgoff + (1UL << compound_order(page)) - 1 >= vma->vm_pgoff) { + /* Test above avoids possibility of wrap to 0 on 32-bit */ + address = vma->vm_start; + } else { + address = -EFAULT; + } + return address; }
+/* + * Then at what user virtual address will none of the page be found in vma? + * Assumes that vma_address() already returned a good starting address. + * If page is a compound head, the entire compound page is considered. + */ static inline unsigned long -vma_address(struct page *page, struct vm_area_struct *vma) +vma_address_end(struct page *page, struct vm_area_struct *vma) { - unsigned long start, end; - - start = __vma_address(page, vma); - end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1); - - /* page should be within @vma mapping range */ - VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma); - - return max(start, vma->vm_start); + pgoff_t pgoff; + unsigned long address; + + VM_BUG_ON_PAGE(PageKsm(page), page); /* KSM page->index unusable */ + pgoff = page_to_pgoff(page) + (1UL << compound_order(page)); + address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); + /* Check for address beyond vma (or wrapped through 0?) */ + if (address < vma->vm_start || address > vma->vm_end) + address = vma->vm_end; + return address; }
static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf, diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 08e283ad46606..bb7488e86bea3 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -224,18 +224,18 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) if (!map_pte(pvmw)) goto next_pte; while (1) { + unsigned long end; + if (check_pte(pvmw)) return true; next_pte: /* Seek to next pte only makes sense for THP */ if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page)) return not_found(pvmw); + end = vma_address_end(pvmw->page, pvmw->vma); do { pvmw->address += PAGE_SIZE; - if (pvmw->address >= pvmw->vma->vm_end || - pvmw->address >= - __vma_address(pvmw->page, pvmw->vma) + - hpage_nr_pages(pvmw->page) * PAGE_SIZE) + if (pvmw->address >= end) return not_found(pvmw); /* Did we cross page table boundary? */ if (pvmw->address % PMD_SIZE == 0) { @@ -273,14 +273,10 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) .vma = vma, .flags = PVMW_SYNC, }; - unsigned long start, end; - - start = __vma_address(page, vma); - end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
- if (unlikely(end < vma->vm_start || start >= vma->vm_end)) + pvmw.address = vma_address(page, vma); + if (pvmw.address == -EFAULT) return 0; - pvmw.address = max(start, vma->vm_start); if (!page_vma_mapped_walk(&pvmw)) return 0; page_vma_mapped_walk_done(&pvmw); diff --git a/mm/rmap.c b/mm/rmap.c index 4f82fc2fbc9d1..beb191abde219 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -692,7 +692,6 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) */ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) { - unsigned long address; if (PageAnon(page)) { struct anon_vma *page__anon_vma = page_anon_vma(page); /* @@ -707,10 +706,8 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) return -EFAULT; } else return -EFAULT; - address = __vma_address(page, vma); - if (unlikely(address < vma->vm_start || address >= vma->vm_end)) - return -EFAULT; - return address; + + return vma_address(page, vma); }
pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) @@ -902,7 +899,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, * We have to assume the worse case ie pmd for invalidation. Note that * the page can not be free from this function. */ - end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page))); + end = vma_address_end(page, vma); mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
while (page_vma_mapped_walk(&pvmw)) { @@ -1384,7 +1381,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * Note that the page can not be free in this function as call of * try_to_unmap() must hold a reference on the page. */ - end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page))); + end = PageKsm(page) ? + address + PAGE_SIZE : vma_address_end(page, vma); if (PageHuge(page)) { /* * If sharing is possible, start and end will be adjusted @@ -1846,6 +1844,7 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc, struct vm_area_struct *vma = avc->vma; unsigned long address = vma_address(page, vma);
+ VM_BUG_ON_VMA(address == -EFAULT, vma); cond_resched();
if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg)) @@ -1900,6 +1899,7 @@ static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc, pgoff_start, pgoff_end) { unsigned long address = vma_address(page, vma);
+ VM_BUG_ON_VMA(address == -EFAULT, vma); cond_resched();
if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
From: Jue Wang juew@google.com
stable inclusion from linux-4.19.197 commit 2c595b67fceda4a345ac4dcbd0caa2049fd17cb7
--------------------------------
[ Upstream commit 31657170deaf1d8d2f6a1955fbc6fa9d228be036 ]
Anon THP tails were already supported, but memory-failure may need to use page_address_in_vma() on file THP tails, which its page->mapping check did not permit: fix it.
hughd adds: no current usage is known to hit the issue, but this does fix a subtle trap in a general helper: best fixed in stable sooner than later.
Link: https://lkml.kernel.org/r/a0d9b53-bf5d-8bab-ac5-759dc61819c1@google.com Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Signed-off-by: Jue Wang juew@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Reviewed-by: Yang Shi shy828301@gmail.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Alistair Popple apopple@nvidia.com Cc: Jan Kara jack@suse.cz Cc: Miaohe Lin linmiaohe@huawei.com Cc: Minchan Kim minchan@kernel.org Cc: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Oscar Salvador osalvador@suse.de Cc: Peter Xu peterx@redhat.com Cc: Ralph Campbell rcampbell@nvidia.com Cc: Shakeel Butt shakeelb@google.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: Zi Yan ziy@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/rmap.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c index beb191abde219..738e07ee35345 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -701,11 +701,11 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) if (!vma->anon_vma || !page__anon_vma || vma->anon_vma->root != page__anon_vma->root) return -EFAULT; - } else if (page->mapping) { - if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping) - return -EFAULT; - } else + } else if (!vma->vm_file) { + return -EFAULT; + } else if (vma->vm_file->f_mapping != compound_head(page)->mapping) { return -EFAULT; + }
return vma_address(page, vma); }
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit d5cd96a7880322692d64fbe75d321ccd39392537
--------------------------------
[ Upstream commit 22061a1ffabdb9c3385de159c5db7aac3a4df1cc ]
There is a race between THP unmapping and truncation, when truncate sees pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it, but before its page_remove_rmap() gets to decrement compound_mapcount: generating false "BUG: Bad page cache" reports that the page is still mapped when deleted. This commit fixes that, but not in the way I hoped.
The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK) instead of unmap_mapping_range() in truncate_cleanup_page(): it has often been an annoyance that we usually call unmap_mapping_range() with no pages locked, but there apply it to a single locked page. try_to_unmap() looks more suitable for a single locked page.
However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page): it is used to insert THP migration entries, but not used to unmap THPs. Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB needs are different, I'm too ignorant of the DAX cases, and couldn't decide how far to go for anon+swap. Set that aside.
The second attempt took a different tack: make no change in truncate.c, but modify zap_huge_pmd() to insert an invalidated huge pmd instead of clearing it initially, then pmd_clear() between page_remove_rmap() and unlocking at the end. Nice. But powerpc blows that approach out of the water, with its serialize_against_pte_lookup(), and interesting pgtable usage. It would need serious help to get working on powerpc (with a minor optimization issue on s390 too). Set that aside.
Just add an "if (page_mapped(page)) synchronize_rcu();" or other such delay, after unmapping in truncate_cleanup_page()? Perhaps, but though that's likely to reduce or eliminate the number of incidents, it would give less assurance of whether we had identified the problem correctly.
This successful iteration introduces "unmap_mapping_page(page)" instead of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route, with an addition to details. Then zap_pmd_range() watches for this case, and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk() now does in the PVMW_SYNC case. Not pretty, but safe.
Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to assert its interface; but currently that's only used to make sure that page->mapping is stable, and zap_pmd_range() doesn't care if the page is locked or not. Along these lines, in invalidate_inode_pages2_range() move the initial unmap_mapping_range() out from under page lock, before then calling unmap_mapping_page() under page lock if still mapped.
Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com Fixes: fc127da085c2 ("truncate: handle file thp") Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: Alistair Popple apopple@nvidia.com Cc: Jan Kara jack@suse.cz Cc: Jue Wang juew@google.com Cc: "Matthew Wilcox (Oracle)" willy@infradead.org Cc: Miaohe Lin linmiaohe@huawei.com Cc: Minchan Kim minchan@kernel.org Cc: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Oscar Salvador osalvador@suse.de Cc: Peter Xu peterx@redhat.com Cc: Ralph Campbell rcampbell@nvidia.com Cc: Shakeel Butt shakeelb@google.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: Zi Yan ziy@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Note on stable backport: fixed up call to truncate_cleanup_page() in truncate_inode_pages_range(). Use hpage_nr_pages() in unmap_mapping_page().
Signed-off-by: Hugh Dickins hughd@google.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/mm.h | 3 +++ mm/memory.c | 41 +++++++++++++++++++++++++++++++++++++++++ mm/truncate.c | 43 +++++++++++++++++++------------------------ 3 files changed, 63 insertions(+), 24 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 99fd979c03d5b..db53f49e13b91 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1428,6 +1428,7 @@ struct zap_details { struct address_space *check_mapping; /* Check page->mapping if set */ pgoff_t first_index; /* Lowest page->index to unmap */ pgoff_t last_index; /* Highest page->index to unmap */ + struct page *single_page; /* Locked page to be unmapped */ };
struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr, @@ -1518,6 +1519,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma, extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, unsigned long address, unsigned int fault_flags, bool *unlocked); +void unmap_mapping_page(struct page *page); void unmap_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t nr, bool even_cows); void unmap_mapping_range(struct address_space *mapping, @@ -1538,6 +1540,7 @@ static inline int fixup_user_fault(struct task_struct *tsk, BUG(); return -EFAULT; } +static inline void unmap_mapping_page(struct page *page) { } static inline void unmap_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t nr, bool even_cows) { } static inline void unmap_mapping_range(struct address_space *mapping, diff --git a/mm/memory.c b/mm/memory.c index a1fefe2b368d3..6acc5ae64dcd6 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1193,7 +1193,18 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, else if (zap_huge_pmd(tlb, vma, pmd, addr)) goto next; /* fall through */ + } else if (details && details->single_page && + PageTransCompound(details->single_page) && + next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) { + spinlock_t *ptl = pmd_lock(tlb->mm, pmd); + /* + * Take and drop THP pmd lock so that we cannot return + * prematurely, while zap_huge_pmd() has cleared *pmd, + * but not yet decremented compound_mapcount(). + */ + spin_unlock(ptl); } + /* * Here there can be other concurrent MADV_DONTNEED or * trans huge page faults running, and if the pmd is @@ -2682,6 +2693,36 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root, } }
+/** + * unmap_mapping_page() - Unmap single page from processes. + * @page: The locked page to be unmapped. + * + * Unmap this page from any userspace process which still has it mmaped. + * Typically, for efficiency, the range of nearby pages has already been + * unmapped by unmap_mapping_pages() or unmap_mapping_range(). But once + * truncation or invalidation holds the lock on a page, it may find that + * the page has been remapped again: and then uses unmap_mapping_page() + * to unmap it finally. + */ +void unmap_mapping_page(struct page *page) +{ + struct address_space *mapping = page->mapping; + struct zap_details details = { }; + + VM_BUG_ON(!PageLocked(page)); + VM_BUG_ON(PageTail(page)); + + details.check_mapping = mapping; + details.first_index = page->index; + details.last_index = page->index + hpage_nr_pages(page) - 1; + details.single_page = page; + + i_mmap_lock_write(mapping); + if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))) + unmap_mapping_range_tree(&mapping->i_mmap, &details); + i_mmap_unlock_write(mapping); +} + /** * unmap_mapping_pages() - Unmap pages from processes. * @mapping: The address space containing pages to be unmapped. diff --git a/mm/truncate.c b/mm/truncate.c index 71b65aab80775..43c73db17a0a6 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -175,13 +175,10 @@ void do_invalidatepage(struct page *page, unsigned int offset, * its lock, b) when a concurrent invalidate_mapping_pages got there first and * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space. */ -static void -truncate_cleanup_page(struct address_space *mapping, struct page *page) +static void truncate_cleanup_page(struct page *page) { - if (page_mapped(page)) { - pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1; - unmap_mapping_pages(mapping, page->index, nr, false); - } + if (page_mapped(page)) + unmap_mapping_page(page);
if (page_has_private(page)) do_invalidatepage(page, 0, PAGE_SIZE); @@ -226,7 +223,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page) if (page->mapping != mapping) return -EIO;
- truncate_cleanup_page(mapping, page); + truncate_cleanup_page(page); delete_from_page_cache(page); return 0; } @@ -364,7 +361,7 @@ void truncate_inode_pages_range(struct address_space *mapping, pagevec_add(&locked_pvec, page); } for (i = 0; i < pagevec_count(&locked_pvec); i++) - truncate_cleanup_page(mapping, locked_pvec.pages[i]); + truncate_cleanup_page(locked_pvec.pages[i]); delete_from_page_cache_batch(mapping, &locked_pvec); for (i = 0; i < pagevec_count(&locked_pvec); i++) unlock_page(locked_pvec.pages[i]); @@ -703,6 +700,16 @@ int invalidate_inode_pages2_range(struct address_space *mapping, continue; }
+ if (!did_range_unmap && page_mapped(page)) { + /* + * If page is mapped, before taking its lock, + * zap the rest of the file in one hit. + */ + unmap_mapping_pages(mapping, index, + (1 + end - index), false); + did_range_unmap = 1; + } + lock_page(page); WARN_ON(page_to_index(page) != index); if (page->mapping != mapping) { @@ -710,23 +717,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping, continue; } wait_on_page_writeback(page); - if (page_mapped(page)) { - if (!did_range_unmap) { - /* - * Zap the rest of the file in one hit. - */ - unmap_mapping_pages(mapping, index, - (1 + end - index), false); - did_range_unmap = 1; - } else { - /* - * Just zap this page - */ - unmap_mapping_pages(mapping, index, - 1, false); - } - } + + if (page_mapped(page)) + unmap_mapping_page(page); BUG_ON(page_mapped(page)); + ret2 = do_launder_page(mapping, page); if (ret2 == 0) { if (!invalidate_complete_page2(mapping, page))
From: Yang Shi shy828301@gmail.com
stable inclusion from linux-4.19.197 commit e17afb6d0b7c74f316dbff33588210190600efc7
--------------------------------
[ Upstream commit 504e070dc08f757bccaed6d05c0f53ecbfac8a23 ]
When debugging the bug reported by Wang Yugui [1], try_to_unmap() may fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however it may miss the failure when head page is unmapped but other subpage is mapped. Then the second DEBUG_VM BUG() that check total mapcount would catch it. This may incur some confusion.
As this is not a fatal issue, so consolidate the two DEBUG_VM checks into one VM_WARN_ON_ONCE_PAGE().
[1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
Link: https://lkml.kernel.org/r/d0f0db68-98b8-ebfb-16dc-f29df24cf012@google.com Signed-off-by: Yang Shi shy828301@gmail.com Reviewed-by: Zi Yan ziy@nvidia.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Hugh Dickins hughd@google.com Cc: Alistair Popple apopple@nvidia.com Cc: Jan Kara jack@suse.cz Cc: Jue Wang juew@google.com Cc: "Matthew Wilcox (Oracle)" willy@infradead.org Cc: Miaohe Lin linmiaohe@huawei.com Cc: Minchan Kim minchan@kernel.org Cc: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Oscar Salvador osalvador@suse.de Cc: Peter Xu peterx@redhat.com Cc: Ralph Campbell rcampbell@nvidia.com Cc: Shakeel Butt shakeelb@google.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Note on stable backport: fixed up variables and split_queue_lock in split_huge_page_to_list(), and conflict on ttu_flags in unmap_page().
Signed-off-by: Hugh Dickins hughd@google.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Conflicts: mm/huge_memory.c [yyl: keep it same as mainline] --- mm/huge_memory.c | 24 +++++++----------------- 1 file changed, 7 insertions(+), 17 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b64904e9faa0c..45004eb20bbb8 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2488,15 +2488,15 @@ static void unmap_page(struct page *page) { enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD | TTU_SYNC; - bool unmap_success;
VM_BUG_ON_PAGE(!PageHead(page), page);
if (PageAnon(page)) ttu_flags |= TTU_SPLIT_FREEZE;
- unmap_success = try_to_unmap(page, ttu_flags); - VM_BUG_ON_PAGE(!unmap_success, page); + try_to_unmap(page, ttu_flags); + + VM_WARN_ON_ONCE_PAGE(page_mapped(page), page); }
static void remap_page(struct page *page) @@ -2761,7 +2761,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) struct deferred_split ds_queue; struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; - int count, mapcount, extra_pins, ret; + int extra_pins, ret; bool mlocked; unsigned long flags; pgoff_t end; @@ -2823,7 +2823,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
mlocked = PageMlocked(page); unmap_page(head); - VM_BUG_ON_PAGE(compound_mapcount(head), head);
/* Make sure the page is not on per-CPU pagevec as it takes pin */ if (mlocked) @@ -2850,9 +2849,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) /* Prevent deferred_split_scan() touching ->_refcount */ get_deferred_split_queue(page, &ds_queue); spin_lock(ds_queue.split_queue_lock); - count = page_count(head); - mapcount = total_mapcount(head); - if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) { + if (page_ref_freeze(head, 1 + extra_pins)) { if (!list_empty(page_deferred_list(head))) { (*ds_queue.split_queue_len)--; list_del(page_deferred_list(head)); @@ -2863,16 +2860,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) __split_huge_page(page, list, end, flags); ret = 0; } else { - if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) { - pr_alert("total_mapcount: %u, page_count(): %u\n", - mapcount, count); - if (PageTail(page)) - dump_page(head, NULL); - dump_page(page, "total_mapcount(head) > 0"); - BUG(); - } spin_unlock(ds_queue.split_queue_lock); -fail: if (mapping) +fail: + if (mapping) xa_unlock(&mapping->i_pages); spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags); remap_page(head);
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit be9ab2d00d071f930cc0389a8ce1424fd84d4c6b
--------------------------------
[ Upstream commit f003c03bd29e6f46fef1b9a8e8d636ac732286d5 ]
Patch series "mm: page_vma_mapped_walk() cleanup and THP fixes".
I've marked all of these for stable: many are merely cleanups, but I think they are much better before the main fix than after.
This patch (of 11):
page_vma_mapped_walk() cleanup: sometimes the local copy of pvwm->page was used, sometimes pvmw->page itself: use the local copy "page" throughout.
Link: https://lkml.kernel.org/r/589b358c-febc-c88e-d4c2-7834b37fa7bf@google.com Link: https://lkml.kernel.org/r/88e67645-f467-c279-bf5e-af4b5c6b13eb@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Alistair Popple apopple@nvidia.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Reviewed-by: Peter Xu peterx@redhat.com Cc: Yang Shi shy828301@gmail.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: Matthew Wilcox willy@infradead.org Cc: Ralph Campbell rcampbell@nvidia.com Cc: Zi Yan ziy@nvidia.com Cc: Will Deacon will@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_vma_mapped.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index bb7488e86bea3..d146f6d31a624 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -151,7 +151,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) if (pvmw->pte) goto next_pte;
- if (unlikely(PageHuge(pvmw->page))) { + if (unlikely(PageHuge(page))) { /* when pud is not present, pte will be NULL */ pvmw->pte = huge_pte_offset(mm, pvmw->address, PAGE_SIZE << compound_order(page)); @@ -213,8 +213,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) * cannot return prematurely, while zap_huge_pmd() has * cleared *pmd but not decremented compound_mapcount(). */ - if ((pvmw->flags & PVMW_SYNC) && - PageTransCompound(pvmw->page)) { + if ((pvmw->flags & PVMW_SYNC) && PageTransCompound(page)) { spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
spin_unlock(ptl); @@ -230,9 +229,9 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) return true; next_pte: /* Seek to next pte only makes sense for THP */ - if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page)) + if (!PageTransHuge(page) || PageHuge(page)) return not_found(pvmw); - end = vma_address_end(pvmw->page, pvmw->vma); + end = vma_address_end(page, pvmw->vma); do { pvmw->address += PAGE_SIZE; if (pvmw->address >= end)
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit a027b699161023c017b816b61be9990e3f9286f4
--------------------------------
[ Upstream commit 6d0fd5987657cb0c9756ce684e3a74c0f6351728 ]
page_vma_mapped_walk() cleanup: get the hugetlbfs PageHuge case out of the way at the start, so no need to worry about it later.
Link: https://lkml.kernel.org/r/e31a483c-6d73-a6bb-26c5-43c3b880a2@google.com Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Reviewed-by: Peter Xu peterx@redhat.com Cc: Alistair Popple apopple@nvidia.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: Matthew Wilcox willy@infradead.org Cc: Ralph Campbell rcampbell@nvidia.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: Will Deacon will@kernel.org Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_vma_mapped.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index d146f6d31a624..036ce20bb154c 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -148,10 +148,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) if (pvmw->pmd && !pvmw->pte) return not_found(pvmw);
- if (pvmw->pte) - goto next_pte; - if (unlikely(PageHuge(page))) { + /* The only possible mapping was handled on last iteration */ + if (pvmw->pte) + return not_found(pvmw); + /* when pud is not present, pte will be NULL */ pvmw->pte = huge_pte_offset(mm, pvmw->address, PAGE_SIZE << compound_order(page)); @@ -164,6 +165,9 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) return not_found(pvmw); return true; } + + if (pvmw->pte) + goto next_pte; restart: pgd = pgd_offset(mm, pvmw->address); if (!pgd_present(*pgd)) @@ -229,7 +233,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) return true; next_pte: /* Seek to next pte only makes sense for THP */ - if (!PageTransHuge(page) || PageHuge(page)) + if (!PageTransHuge(page)) return not_found(pvmw); end = vma_address_end(page, pvmw->vma); do {
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit 9fbb45c5d59d8e4c81c90cc418bed167503db227
--------------------------------
[ Upstream commit 3306d3119ceacc43ea8b141a73e21fea68eec30c ]
page_vma_mapped_walk() cleanup: re-evaluate pmde after taking lock, then use it in subsequent tests, instead of repeatedly dereferencing pointer.
Link: https://lkml.kernel.org/r/53fbc9d-891e-46b2-cb4b-468c3b19238e@google.com Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Reviewed-by: Peter Xu peterx@redhat.com Cc: Alistair Popple apopple@nvidia.com Cc: Matthew Wilcox willy@infradead.org Cc: Ralph Campbell rcampbell@nvidia.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: Will Deacon will@kernel.org Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_vma_mapped.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 036ce20bb154c..7e1db77b096a1 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -187,18 +187,19 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) pmde = READ_ONCE(*pvmw->pmd); if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) { pvmw->ptl = pmd_lock(mm, pvmw->pmd); - if (likely(pmd_trans_huge(*pvmw->pmd))) { + pmde = *pvmw->pmd; + if (likely(pmd_trans_huge(pmde))) { if (pvmw->flags & PVMW_MIGRATION) return not_found(pvmw); - if (pmd_page(*pvmw->pmd) != page) + if (pmd_page(pmde) != page) return not_found(pvmw); return true; - } else if (!pmd_present(*pvmw->pmd)) { + } else if (!pmd_present(pmde)) { if (thp_migration_supported()) { if (!(pvmw->flags & PVMW_MIGRATION)) return not_found(pvmw); - if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) { - swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd); + if (is_migration_entry(pmd_to_swp_entry(pmde))) { + swp_entry_t entry = pmd_to_swp_entry(pmde);
if (migration_entry_to_page(entry) != page) return not_found(pvmw);
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit d4b99cf445d24089255a49ff927ed8884ca2f7f5
--------------------------------
[ Upstream commit e2e1d4076c77b3671cf8ce702535ae7dee3acf89 ]
page_vma_mapped_walk() cleanup: rearrange the !pmd_present() block to follow the same "return not_found, return not_found, return true" pattern as the block above it (note: returning not_found there is never premature, since existence or prior existence of huge pmd guarantees good alignment).
Link: https://lkml.kernel.org/r/378c8650-1488-2edf-9647-32a53cf2e21@google.com Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Reviewed-by: Peter Xu peterx@redhat.com Cc: Alistair Popple apopple@nvidia.com Cc: Matthew Wilcox willy@infradead.org Cc: Ralph Campbell rcampbell@nvidia.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: Will Deacon will@kernel.org Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_vma_mapped.c | 30 ++++++++++++++---------------- 1 file changed, 14 insertions(+), 16 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 7e1db77b096a1..e32ed79129239 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -194,24 +194,22 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) if (pmd_page(pmde) != page) return not_found(pvmw); return true; - } else if (!pmd_present(pmde)) { - if (thp_migration_supported()) { - if (!(pvmw->flags & PVMW_MIGRATION)) - return not_found(pvmw); - if (is_migration_entry(pmd_to_swp_entry(pmde))) { - swp_entry_t entry = pmd_to_swp_entry(pmde); + } + if (!pmd_present(pmde)) { + swp_entry_t entry;
- if (migration_entry_to_page(entry) != page) - return not_found(pvmw); - return true; - } - } - return not_found(pvmw); - } else { - /* THP pmd was split under us: handle on pte level */ - spin_unlock(pvmw->ptl); - pvmw->ptl = NULL; + if (!thp_migration_supported() || + !(pvmw->flags & PVMW_MIGRATION)) + return not_found(pvmw); + entry = pmd_to_swp_entry(pmde); + if (!is_migration_entry(entry) || + migration_entry_to_page(entry) != page) + return not_found(pvmw); + return true; } + /* THP pmd was split under us: handle on pte level */ + spin_unlock(pvmw->ptl); + pvmw->ptl = NULL; } else if (!pmd_present(pmde)) { /* * If PVMW_SYNC, take and drop THP pmd lock so that we
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit 97a79b7896d6b8c009561f695649b31c1b35437c
--------------------------------
[ Upstream commit 448282487483d6fa5b2eeeafaa0acc681e544a9c ]
page_vma_mapped_walk() cleanup: adjust the test for crossing page table boundary - I believe pvmw->address is always page-aligned, but nothing else here assumed that; and remember to reset pvmw->pte to NULL after unmapping the page table, though I never saw any bug from that.
Link: https://lkml.kernel.org/r/799b3f9c-2a9e-dfef-5d89-26e9f76fd97@google.com Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Alistair Popple apopple@nvidia.com Cc: Matthew Wilcox willy@infradead.org Cc: Peter Xu peterx@redhat.com Cc: Ralph Campbell rcampbell@nvidia.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: Will Deacon will@kernel.org Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_vma_mapped.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index e32ed79129239..1fccadd0eb6d9 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -240,16 +240,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) if (pvmw->address >= end) return not_found(pvmw); /* Did we cross page table boundary? */ - if (pvmw->address % PMD_SIZE == 0) { - pte_unmap(pvmw->pte); + if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) { if (pvmw->ptl) { spin_unlock(pvmw->ptl); pvmw->ptl = NULL; } + pte_unmap(pvmw->pte); + pvmw->pte = NULL; goto restart; - } else { - pvmw->pte++; } + pvmw->pte++; } while (pte_none(*pvmw->pte));
if (!pvmw->ptl) {
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit ac0324b14dae4dd664fd94d9ada161089f4d2b8b
--------------------------------
[ Upstream commit b3807a91aca7d21c05d5790612e49969117a72b9 ]
page_vma_mapped_walk() cleanup: add a level of indentation to much of the body, making no functional change in this commit, but reducing the later diff when this is all converted to a loop.
[hughd@google.com: : page_vma_mapped_walk(): add a level of indentation fix] Link: https://lkml.kernel.org/r/7f817555-3ce1-c785-e438-87d8efdcaf26@google.com
Link: https://lkml.kernel.org/r/efde211-f3e2-fe54-977-ef481419e7f3@google.com Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Alistair Popple apopple@nvidia.com Cc: Matthew Wilcox willy@infradead.org Cc: Peter Xu peterx@redhat.com Cc: Ralph Campbell rcampbell@nvidia.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: Will Deacon will@kernel.org Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_vma_mapped.c | 105 ++++++++++++++++++++++--------------------- 1 file changed, 55 insertions(+), 50 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 1fccadd0eb6d9..8199456782640 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -169,62 +169,67 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) if (pvmw->pte) goto next_pte; restart: - pgd = pgd_offset(mm, pvmw->address); - if (!pgd_present(*pgd)) - return false; - p4d = p4d_offset(pgd, pvmw->address); - if (!p4d_present(*p4d)) - return false; - pud = pud_offset(p4d, pvmw->address); - if (!pud_present(*pud)) - return false; - pvmw->pmd = pmd_offset(pud, pvmw->address); - /* - * Make sure the pmd value isn't cached in a register by the - * compiler and used as a stale value after we've observed a - * subsequent update. - */ - pmde = READ_ONCE(*pvmw->pmd); - if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) { - pvmw->ptl = pmd_lock(mm, pvmw->pmd); - pmde = *pvmw->pmd; - if (likely(pmd_trans_huge(pmde))) { - if (pvmw->flags & PVMW_MIGRATION) - return not_found(pvmw); - if (pmd_page(pmde) != page) - return not_found(pvmw); - return true; - } - if (!pmd_present(pmde)) { - swp_entry_t entry; + { + pgd = pgd_offset(mm, pvmw->address); + if (!pgd_present(*pgd)) + return false; + p4d = p4d_offset(pgd, pvmw->address); + if (!p4d_present(*p4d)) + return false; + pud = pud_offset(p4d, pvmw->address); + if (!pud_present(*pud)) + return false;
- if (!thp_migration_supported() || - !(pvmw->flags & PVMW_MIGRATION)) - return not_found(pvmw); - entry = pmd_to_swp_entry(pmde); - if (!is_migration_entry(entry) || - migration_entry_to_page(entry) != page) - return not_found(pvmw); - return true; - } - /* THP pmd was split under us: handle on pte level */ - spin_unlock(pvmw->ptl); - pvmw->ptl = NULL; - } else if (!pmd_present(pmde)) { + pvmw->pmd = pmd_offset(pud, pvmw->address); /* - * If PVMW_SYNC, take and drop THP pmd lock so that we - * cannot return prematurely, while zap_huge_pmd() has - * cleared *pmd but not decremented compound_mapcount(). + * Make sure the pmd value isn't cached in a register by the + * compiler and used as a stale value after we've observed a + * subsequent update. */ - if ((pvmw->flags & PVMW_SYNC) && PageTransCompound(page)) { - spinlock_t *ptl = pmd_lock(mm, pvmw->pmd); + pmde = READ_ONCE(*pvmw->pmd);
- spin_unlock(ptl); + if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) { + pvmw->ptl = pmd_lock(mm, pvmw->pmd); + pmde = *pvmw->pmd; + if (likely(pmd_trans_huge(pmde))) { + if (pvmw->flags & PVMW_MIGRATION) + return not_found(pvmw); + if (pmd_page(pmde) != page) + return not_found(pvmw); + return true; + } + if (!pmd_present(pmde)) { + swp_entry_t entry; + + if (!thp_migration_supported() || + !(pvmw->flags & PVMW_MIGRATION)) + return not_found(pvmw); + entry = pmd_to_swp_entry(pmde); + if (!is_migration_entry(entry) || + migration_entry_to_page(entry) != page) + return not_found(pvmw); + return true; + } + /* THP pmd was split under us: handle on pte level */ + spin_unlock(pvmw->ptl); + pvmw->ptl = NULL; + } else if (!pmd_present(pmde)) { + /* + * If PVMW_SYNC, take and drop THP pmd lock so that we + * cannot return prematurely, while zap_huge_pmd() has + * cleared *pmd but not decremented compound_mapcount(). + */ + if ((pvmw->flags & PVMW_SYNC) && + PageTransCompound(page)) { + spinlock_t *ptl = pmd_lock(mm, pvmw->pmd); + + spin_unlock(ptl); + } + return false; } - return false; + if (!map_pte(pvmw)) + goto next_pte; } - if (!map_pte(pvmw)) - goto next_pte; while (1) { unsigned long end;
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit 7d82908ba4fb6ae809db8a5d63f85f1acc6d2946
--------------------------------
[ Upstream commit 474466301dfd8b39a10c01db740645f3f7ae9a28 ]
page_vma_mapped_walk() cleanup: add a label this_pte, matching next_pte, and use "goto this_pte", in place of the "while (1)" loop at the end.
Link: https://lkml.kernel.org/r/a52b234a-851-3616-2525-f42736e8934@google.com Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Alistair Popple apopple@nvidia.com Cc: Matthew Wilcox willy@infradead.org Cc: Peter Xu peterx@redhat.com Cc: Ralph Campbell rcampbell@nvidia.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: Will Deacon will@kernel.org Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_vma_mapped.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 8199456782640..470f19c917ad5 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -139,6 +139,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) { struct mm_struct *mm = pvmw->vma->vm_mm; struct page *page = pvmw->page; + unsigned long end; pgd_t *pgd; p4d_t *p4d; pud_t *pud; @@ -229,10 +230,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) } if (!map_pte(pvmw)) goto next_pte; - } - while (1) { - unsigned long end; - +this_pte: if (check_pte(pvmw)) return true; next_pte: @@ -261,6 +259,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) pvmw->ptl = pte_lockptr(mm, pvmw->pmd); spin_lock(pvmw->ptl); } + goto this_pte; } }
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit b114408b1aa2b0957877632113009bdf5efc5da2
--------------------------------
[ Upstream commit a765c417d876cc635f628365ec9aa6f09470069a ]
page_vma_mapped_walk() cleanup: get THP's vma_address_end() at the start, rather than later at next_pte.
It's a little unnecessary overhead on the first call, but makes for a simpler loop in the following commit.
Link: https://lkml.kernel.org/r/4542b34d-862f-7cb4-bb22-e0df6ce830a2@google.com Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Alistair Popple apopple@nvidia.com Cc: Matthew Wilcox willy@infradead.org Cc: Peter Xu peterx@redhat.com Cc: Ralph Campbell rcampbell@nvidia.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: Will Deacon will@kernel.org Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_vma_mapped.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 470f19c917ad5..7239c38e99e25 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -167,6 +167,15 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) return true; }
+ /* + * Seek to next pte only makes sense for THP. + * But more important than that optimization, is to filter out + * any PageKsm page: whose page->index misleads vma_address() + * and vma_address_end() to disaster. + */ + end = PageTransCompound(page) ? + vma_address_end(page, pvmw->vma) : + pvmw->address + PAGE_SIZE; if (pvmw->pte) goto next_pte; restart: @@ -234,10 +243,6 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) if (check_pte(pvmw)) return true; next_pte: - /* Seek to next pte only makes sense for THP */ - if (!PageTransHuge(page)) - return not_found(pvmw); - end = vma_address_end(page, pvmw->vma); do { pvmw->address += PAGE_SIZE; if (pvmw->address >= end)
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit 69784c9d5cb080a57062d944acb34c334758ff0a
--------------------------------
[ Upstream commit a9a7504d9beaf395481faa91e70e2fd08f7a3dde ]
Running certain tests with a DEBUG_VM kernel would crash within hours, on the total_mapcount BUG() in split_huge_page_to_list(), while trying to free up some memory by punching a hole in a shmem huge page: split's try_to_unmap() was unable to find all the mappings of the page (which, on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
Crash dumps showed two tail pages of a shmem huge page remained mapped by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end of a long unmapped range; and no page table had yet been allocated for the head of the huge page to be mapped into.
Although designed to handle these odd misaligned huge-page-mapped-by-pte cases, page_vma_mapped_walk() falls short by returning false prematurely when !pmd_present or !pud_present or !p4d_present or !pgd_present: there are cases when a huge page may span the boundary, with ptes present in the next.
Restructure page_vma_mapped_walk() as a loop to continue in these cases, while keeping its layout much as before. Add a step_forward() helper to advance pvmw->address across those boundaries: originally I tried to use mm's standard p?d_addr_end() macros, but hit the same crash 512 times less often: because of the way redundant levels are folded together, but folded differently in different configurations, it was just too difficult to use them correctly; and step_forward() is simpler anyway.
Link: https://lkml.kernel.org/r/fedb8632-1798-de42-f39e-873551d5bc81@google.com Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()") Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Alistair Popple apopple@nvidia.com Cc: Matthew Wilcox willy@infradead.org Cc: Peter Xu peterx@redhat.com Cc: Ralph Campbell rcampbell@nvidia.com Cc: Wang Yugui wangyugui@e16-tech.com Cc: Will Deacon will@kernel.org Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_vma_mapped.c | 34 +++++++++++++++++++++++++--------- 1 file changed, 25 insertions(+), 9 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 7239c38e99e25..b7a8009c1549f 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -111,6 +111,13 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) return pfn_in_hpage(pvmw->page, pfn); }
+static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size) +{ + pvmw->address = (pvmw->address + size) & ~(size - 1); + if (!pvmw->address) + pvmw->address = ULONG_MAX; +} + /** * page_vma_mapped_walk - check if @pvmw->page is mapped in @pvmw->vma at * @pvmw->address @@ -179,16 +186,22 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) if (pvmw->pte) goto next_pte; restart: - { + do { pgd = pgd_offset(mm, pvmw->address); - if (!pgd_present(*pgd)) - return false; + if (!pgd_present(*pgd)) { + step_forward(pvmw, PGDIR_SIZE); + continue; + } p4d = p4d_offset(pgd, pvmw->address); - if (!p4d_present(*p4d)) - return false; + if (!p4d_present(*p4d)) { + step_forward(pvmw, P4D_SIZE); + continue; + } pud = pud_offset(p4d, pvmw->address); - if (!pud_present(*pud)) - return false; + if (!pud_present(*pud)) { + step_forward(pvmw, PUD_SIZE); + continue; + }
pvmw->pmd = pmd_offset(pud, pvmw->address); /* @@ -235,7 +248,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
spin_unlock(ptl); } - return false; + step_forward(pvmw, PMD_SIZE); + continue; } if (!map_pte(pvmw)) goto next_pte; @@ -265,7 +279,9 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) spin_lock(pvmw->ptl); } goto this_pte; - } + } while (pvmw->address < end); + + return false; }
/**
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit e943b4373cf706ee8ee433988bc0c4d6e3ea5907
--------------------------------
[ Upstream commit a7a69d8ba88d8dcee7ef00e91d413a4bd003a814 ]
Aha! Shouldn't that quick scan over pte_none()s make sure that it holds ptlock in the PVMW_SYNC case? That too might have been responsible for BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though I've never seen any.
Link: https://lkml.kernel.org/r/1bdf384c-8137-a149-2a1e-475a4791c3c@google.com Link: https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/ Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()") Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Tested-by: Wang Yugui wangyugui@e16-tech.com Cc: Alistair Popple apopple@nvidia.com Cc: Matthew Wilcox willy@infradead.org Cc: Peter Xu peterx@redhat.com Cc: Ralph Campbell rcampbell@nvidia.com Cc: Will Deacon will@kernel.org Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_vma_mapped.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index b7a8009c1549f..edca786093187 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -272,6 +272,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) goto restart; } pvmw->pte++; + if ((pvmw->flags & PVMW_SYNC) && !pvmw->ptl) { + pvmw->ptl = pte_lockptr(mm, pvmw->pmd); + spin_lock(pvmw->ptl); + } } while (pte_none(*pvmw->pte));
if (!pvmw->ptl) {
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.197 commit 2445837e9cd084ba849a7c1c70086a6cdc608f48
--------------------------------
[ Upstream commit fe19bd3dae3d15d2fbfdb3de8839a6ea0fe94264 ]
If more than one futex is placed on a shmem huge page, it can happen that waking the second wakes the first instead, and leaves the second waiting: the key's shared.pgoff is wrong.
When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when generating futex_key"), the only shared huge pages came from hugetlbfs, and the code added to deal with its exceptional page->index was put into hugetlb source. Then that was missed when 4.8 added shmem huge pages.
page_to_pgoff() is what others use for this nowadays: except that, as currently written, it gives the right answer on hugetlbfs head, but nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific hugetlb_basepage_index() on PageHuge tails as well as on head.
Yes, it's unconventional to declare hugetlb_basepage_index() there in pagemap.h, rather than in hugetlb.h; but I do not expect anything but page_to_pgoff() ever to need it.
[akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope]
Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Reported-by: Neel Natu neelnatu@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Acked-by: Thomas Gleixner tglx@linutronix.de Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: Zhang Yi wetpzy@gmail.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Darren Hart dvhart@infradead.org Cc: Davidlohr Bueso dave@stgolabs.net Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Note on stable backport: leave redundant #include <linux/hugetlb.h> in kernel/futex.c, to avoid conflict over the header files included.
Signed-off-by: Hugh Dickins hughd@google.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/hugetlb.h | 16 ---------------- include/linux/pagemap.h | 13 +++++++------ kernel/futex.c | 2 +- mm/hugetlb.c | 5 +---- 4 files changed, 9 insertions(+), 27 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 831e8b654ab81..6831b24defd64 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -485,17 +485,6 @@ static inline int hstate_index(struct hstate *h) return h - hstates; }
-pgoff_t __basepage_index(struct page *page); - -/* Return page->index in PAGE_SIZE units */ -static inline pgoff_t basepage_index(struct page *page) -{ - if (!PageCompound(page)) - return page->index; - - return __basepage_index(page); -} - extern int dissolve_free_huge_page(struct page *page); extern int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn); @@ -590,11 +579,6 @@ static inline int hstate_index(struct hstate *h) return 0; }
-static inline pgoff_t basepage_index(struct page *page) -{ - return page->index; -} - static inline int dissolve_free_huge_page(struct page *page) { return 0; diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index e889d99296159..085aed892ce58 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -427,7 +427,7 @@ static inline struct page *read_mapping_page(struct address_space *mapping, }
/* - * Get index of the page with in radix-tree + * Get index of the page within radix-tree (but not for hugetlb pages). * (TODO: remove once hugetlb pages will have ->index in PAGE_SIZE) */ static inline pgoff_t page_to_index(struct page *page) @@ -446,15 +446,16 @@ static inline pgoff_t page_to_index(struct page *page) return pgoff; }
+extern pgoff_t hugetlb_basepage_index(struct page *page); + /* - * Get the offset in PAGE_SIZE. - * (TODO: hugepage should have ->index in PAGE_SIZE) + * Get the offset in PAGE_SIZE (even for hugetlb pages). + * (TODO: hugetlb pages should have ->index in PAGE_SIZE) */ static inline pgoff_t page_to_pgoff(struct page *page) { - if (unlikely(PageHeadHuge(page))) - return page->index << compound_order(page); - + if (unlikely(PageHuge(page))) + return hugetlb_basepage_index(page); return page_to_index(page); }
diff --git a/kernel/futex.c b/kernel/futex.c index 13574134bbc44..a36a006e5b138 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -724,7 +724,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, enum futex_a
key->both.offset |= FUT_OFF_INODE; /* inode-based key */ key->shared.i_seq = get_inode_sequence_number(inode); - key->shared.pgoff = basepage_index(tail); + key->shared.pgoff = page_to_pgoff(tail); rcu_read_unlock(); }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 27d03b7197eea..5132fefb60c86 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1409,15 +1409,12 @@ int PageHeadHuge(struct page *page_head) return get_compound_page_dtor(page_head) == free_huge_page; }
-pgoff_t __basepage_index(struct page *page) +pgoff_t hugetlb_basepage_index(struct page *page) { struct page *page_head = compound_head(page); pgoff_t index = page_index(page_head); unsigned long compound_idx;
- if (!PageHuge(page_head)) - return page_index(page); - if (compound_order(page_head) >= MAX_ORDER) compound_idx = page_to_pfn(page) - page_to_pfn(page_head); else
From: ManYi Li limanyi@uniontech.com
stable inclusion from linux-4.19.197 commit b52a404434a93d83b243187c6cf135dd6ba529c8
--------------------------------
[ Upstream commit 7dd753ca59d6c8cc09aa1ed24f7657524803c7f3 ]
Handle a reported media event code of 3. This indicates that the media has been removed from the drive and user intervention is required to proceed. Return DISK_EVENT_EJECT_REQUEST in that case.
Link: https://lore.kernel.org/r/20210611094402.23884-1-limanyi@uniontech.com Signed-off-by: ManYi Li limanyi@uniontech.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/scsi/sr.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c index 5be3d6b7991b4..a46fbe2d2ee63 100644 --- a/drivers/scsi/sr.c +++ b/drivers/scsi/sr.c @@ -216,6 +216,8 @@ static unsigned int sr_get_events(struct scsi_device *sdev) return DISK_EVENT_EJECT_REQUEST; else if (med->media_event_code == 2) return DISK_EVENT_MEDIA_CHANGE; + else if (med->media_event_code == 3) + return DISK_EVENT_EJECT_REQUEST; return 0; }
From: Petr Mladek pmladek@suse.com
stable inclusion from linux-4.19.197 commit 13bcf5aeb33caa283c6b03e14b7a254e1223c0d7
--------------------------------
commit 34b3d5344719d14fd2185b2d9459b3abcb8cf9d8 upstream.
Patch series "kthread_worker: Fix race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync()".
This patchset fixes the race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync() including proper return value handling.
This patch (of 2):
Simple code refactoring as a preparation step for fixing a race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync().
It does not modify the existing behavior.
Link: https://lkml.kernel.org/r/20210610133051.15337-2-pmladek@suse.com Signed-off-by: Petr Mladek pmladek@suse.com Cc: jenhaochen@google.com Cc: Martin Liu liumartin@google.com Cc: Minchan Kim minchan@google.com Cc: Nathan Chancellor nathan@kernel.org Cc: Nick Desaulniers ndesaulniers@google.com Cc: Oleg Nesterov oleg@redhat.com Cc: Tejun Heo tj@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/kthread.c | 46 +++++++++++++++++++++++++++++----------------- 1 file changed, 29 insertions(+), 17 deletions(-)
diff --git a/kernel/kthread.c b/kernel/kthread.c index d866cf1a3e20e..112591bd4b7a6 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1005,6 +1005,33 @@ void kthread_flush_work(struct kthread_work *work) } EXPORT_SYMBOL_GPL(kthread_flush_work);
+/* + * Make sure that the timer is neither set nor running and could + * not manipulate the work list_head any longer. + * + * The function is called under worker->lock. The lock is temporary + * released but the timer can't be set again in the meantime. + */ +static void kthread_cancel_delayed_work_timer(struct kthread_work *work, + unsigned long *flags) +{ + struct kthread_delayed_work *dwork = + container_of(work, struct kthread_delayed_work, work); + struct kthread_worker *worker = work->worker; + + /* + * del_timer_sync() must be called to make sure that the timer + * callback is not running. The lock must be temporary released + * to avoid a deadlock with the callback. In the meantime, + * any queuing is blocked by setting the canceling counter. + */ + work->canceling++; + spin_unlock_irqrestore(&worker->lock, *flags); + del_timer_sync(&dwork->timer); + spin_lock_irqsave(&worker->lock, *flags); + work->canceling--; +} + /* * This function removes the work from the worker queue. Also it makes sure * that it won't get queued later via the delayed work's timer. @@ -1019,23 +1046,8 @@ static bool __kthread_cancel_work(struct kthread_work *work, bool is_dwork, unsigned long *flags) { /* Try to cancel the timer if exists. */ - if (is_dwork) { - struct kthread_delayed_work *dwork = - container_of(work, struct kthread_delayed_work, work); - struct kthread_worker *worker = work->worker; - - /* - * del_timer_sync() must be called to make sure that the timer - * callback is not running. The lock must be temporary released - * to avoid a deadlock with the callback. In the meantime, - * any queuing is blocked by setting the canceling counter. - */ - work->canceling++; - spin_unlock_irqrestore(&worker->lock, *flags); - del_timer_sync(&dwork->timer); - spin_lock_irqsave(&worker->lock, *flags); - work->canceling--; - } + if (is_dwork) + kthread_cancel_delayed_work_timer(work, flags);
/* * Try to remove the work from a worker list. It might either
From: Petr Mladek pmladek@suse.com
stable inclusion from linux-4.19.197 commit 6e2a98bc902ce347747fabb133485d7ce2bdd7e4
--------------------------------
commit 5fa54346caf67b4b1b10b1f390316ae466da4d53 upstream.
The system might hang with the following backtrace:
schedule+0x80/0x100 schedule_timeout+0x48/0x138 wait_for_common+0xa4/0x134 wait_for_completion+0x1c/0x2c kthread_flush_work+0x114/0x1cc kthread_cancel_work_sync.llvm.16514401384283632983+0xe8/0x144 kthread_cancel_delayed_work_sync+0x18/0x2c xxxx_pm_notify+0xb0/0xd8 blocking_notifier_call_chain_robust+0x80/0x194 pm_notifier_call_chain_robust+0x28/0x4c suspend_prepare+0x40/0x260 enter_state+0x80/0x3f4 pm_suspend+0x60/0xdc state_store+0x108/0x144 kobj_attr_store+0x38/0x88 sysfs_kf_write+0x64/0xc0 kernfs_fop_write_iter+0x108/0x1d0 vfs_write+0x2f4/0x368 ksys_write+0x7c/0xec
It is caused by the following race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync():
CPU0 CPU1
Context: Thread A Context: Thread B
kthread_mod_delayed_work() spin_lock() __kthread_cancel_work() spin_unlock() del_timer_sync() kthread_cancel_delayed_work_sync() spin_lock() __kthread_cancel_work() spin_unlock() del_timer_sync() spin_lock()
work->canceling++ spin_unlock spin_lock() queue_delayed_work() // dwork is put into the worker->delayed_work_list
spin_unlock()
kthread_flush_work() // flush_work is put at the tail of the dwork
wait_for_completion()
Context: IRQ
kthread_delayed_work_timer_fn() spin_lock() list_del_init(&work->node); spin_unlock()
BANG: flush_work is not longer linked and will never get proceed.
The problem is that kthread_mod_delayed_work() checks work->canceling flag before canceling the timer.
A simple solution is to (re)check work->canceling after __kthread_cancel_work(). But then it is not clear what should be returned when __kthread_cancel_work() removed the work from the queue (list) and it can't queue it again with the new @delay.
The return value might be used for reference counting. The caller has to know whether a new work has been queued or an existing one was replaced.
The proper solution is that kthread_mod_delayed_work() will remove the work from the queue (list) _only_ when work->canceling is not set. The flag must be checked after the timer is stopped and the remaining operations can be done under worker->lock.
Note that kthread_mod_delayed_work() could remove the timer and then bail out. It is fine. The other canceling caller needs to cancel the timer as well. The important thing is that the queue (list) manipulation is done atomically under worker->lock.
Link: https://lkml.kernel.org/r/20210610133051.15337-3-pmladek@suse.com Fixes: 9a6b06c8d9a220860468a ("kthread: allow to modify delayed kthread work") Signed-off-by: Petr Mladek pmladek@suse.com Reported-by: Martin Liu liumartin@google.com Cc: jenhaochen@google.com Cc: Minchan Kim minchan@google.com Cc: Nathan Chancellor nathan@kernel.org Cc: Nick Desaulniers ndesaulniers@google.com Cc: Oleg Nesterov oleg@redhat.com Cc: Tejun Heo tj@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/kthread.c | 35 ++++++++++++++++++++++++----------- 1 file changed, 24 insertions(+), 11 deletions(-)
diff --git a/kernel/kthread.c b/kernel/kthread.c index 112591bd4b7a6..ef13ce4ade8a5 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1033,8 +1033,11 @@ static void kthread_cancel_delayed_work_timer(struct kthread_work *work, }
/* - * This function removes the work from the worker queue. Also it makes sure - * that it won't get queued later via the delayed work's timer. + * This function removes the work from the worker queue. + * + * It is called under worker->lock. The caller must make sure that + * the timer used by delayed work is not running, e.g. by calling + * kthread_cancel_delayed_work_timer(). * * The work might still be in use when this function finishes. See the * current_work proceed by the worker. @@ -1042,13 +1045,8 @@ static void kthread_cancel_delayed_work_timer(struct kthread_work *work, * Return: %true if @work was pending and successfully canceled, * %false if @work was not pending */ -static bool __kthread_cancel_work(struct kthread_work *work, bool is_dwork, - unsigned long *flags) +static bool __kthread_cancel_work(struct kthread_work *work) { - /* Try to cancel the timer if exists. */ - if (is_dwork) - kthread_cancel_delayed_work_timer(work, flags); - /* * Try to remove the work from a worker list. It might either * be from worker->work_list or from worker->delayed_work_list. @@ -1101,11 +1099,23 @@ bool kthread_mod_delayed_work(struct kthread_worker *worker, /* Work must not be used with >1 worker, see kthread_queue_work() */ WARN_ON_ONCE(work->worker != worker);
- /* Do not fight with another command that is canceling this work. */ + /* + * Temporary cancel the work but do not fight with another command + * that is canceling the work as well. + * + * It is a bit tricky because of possible races with another + * mod_delayed_work() and cancel_delayed_work() callers. + * + * The timer must be canceled first because worker->lock is released + * when doing so. But the work can be removed from the queue (list) + * only when it can be queued again so that the return value can + * be used for reference counting. + */ + kthread_cancel_delayed_work_timer(work, &flags); if (work->canceling) goto out; + ret = __kthread_cancel_work(work);
- ret = __kthread_cancel_work(work, true, &flags); fast_queue: __kthread_queue_delayed_work(worker, dwork, delay); out: @@ -1127,7 +1137,10 @@ static bool __kthread_cancel_work_sync(struct kthread_work *work, bool is_dwork) /* Work must not be used with >1 worker, see kthread_queue_work(). */ WARN_ON_ONCE(work->worker != worker);
- ret = __kthread_cancel_work(work, is_dwork, &flags); + if (is_dwork) + kthread_cancel_delayed_work_timer(work, &flags); + + ret = __kthread_cancel_work(work);
if (worker->current_work != work) goto out_fast;
From: Quat Le quat.le@oracle.com
stable inclusion from linux-4.19.198 commit 04e385f5eb2dedb074fe8d5ac6befb7c6e8f1a9d
--------------------------------
commit 104739aca4488909175e9e31d5cd7d75b82a2046 upstream.
If the device is power-cycled, it takes time for the initiator to transmit the periodic NOTIFY (ENABLE SPINUP) SAS primitive, and for the device to respond to the primitive to become ACTIVE. Retry the I/O request to allow the device time to become ACTIVE.
Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210629155826.48441-1-quat.le@oracle.com Reviewed-by: Bart Van Assche bvanassche@acm.org Signed-off-by: Quat Le quat.le@oracle.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/scsi/scsi_lib.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 1bca98745b009..5c6e05bbcb68d 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -929,6 +929,7 @@ static void scsi_io_completion_action(struct scsi_cmnd *cmd, int result) case 0x07: /* operation in progress */ case 0x08: /* Long write in progress */ case 0x09: /* self test in progress */ + case 0x11: /* notify (enable spinup) required */ case 0x14: /* space allocation in progress */ case 0x1a: /* start stop unit in progress */ case 0x1b: /* sanitize in progress */
From: Al Viro viro@zeniv.linux.org.uk
stable inclusion from linux-4.19.198 commit faf8ab4355ec527f7fd1a4bab8c66f596c8490a8
--------------------------------
commit 0e8f0d67401589a141950856902c7d0ec8d9c985 upstream.
... and actually should just check it's given an iovec-backed iterator in the first place.
Cc: stable@vger.kernel.org Signed-off-by: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- lib/iov_iter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/iov_iter.c b/lib/iov_iter.c index a19d3423d775f..da01bb6551d97 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -417,7 +417,7 @@ int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes) int err; struct iovec v;
- if (!(i->type & (ITER_BVEC|ITER_KVEC))) { + if (iter_is_iovec(i)) { iterate_iovec(i, bytes, v, iov, skip, ({ err = fault_in_pages_readable(v.iov_base, v.iov_len); if (unlikely(err))
From: Anirudh Rayabharam mail@anirudhrb.com
stable inclusion from linux-4.19.198 commit 9ed3a3d3a8d2cbe99d9e4386a98856491f0eade0
--------------------------------
commit ce3aba43599f0b50adbebff133df8d08a3d5fffe upstream.
Initialize eh_generation of struct ext4_extent_header to prevent leaking info to userspace. Fixes KMSAN kernel-infoleak bug reported by syzbot at: http://syzkaller.appspot.com/bug?id=78e9ad0e6952a3ca16e8234724b2fa92d041b9b8
Cc: stable@kernel.org Reported-by: syzbot+2dcfeaf8cb49b05e8f1a@syzkaller.appspotmail.com Fixes: a86c61812637 ("[PATCH] ext3: add extent map support") Signed-off-by: Anirudh Rayabharam mail@anirudhrb.com Link: https://lore.kernel.org/r/20210506185655.7118-1-mail@anirudhrb.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/extents.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 479544b79cc04..6a0aab71a01f0 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -868,6 +868,7 @@ int ext4_ext_tree_init(handle_t *handle, struct inode *inode) eh->eh_entries = 0; eh->eh_magic = EXT4_EXT_MAGIC; eh->eh_max = cpu_to_le16(ext4_ext_space_root(inode, 0)); + eh->eh_generation = 0; ext4_mark_inode_dirty(handle, inode); return 0; } @@ -1124,6 +1125,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode, neh->eh_max = cpu_to_le16(ext4_ext_space_block(inode, 0)); neh->eh_magic = EXT4_EXT_MAGIC; neh->eh_depth = 0; + neh->eh_generation = 0;
/* move remainder of path[depth] to the new leaf */ if (unlikely(path[depth].p_hdr->eh_entries != @@ -1201,6 +1203,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode, neh->eh_magic = EXT4_EXT_MAGIC; neh->eh_max = cpu_to_le16(ext4_ext_space_block_idx(inode, 0)); neh->eh_depth = cpu_to_le16(depth - i); + neh->eh_generation = 0; fidx = EXT_FIRST_INDEX(neh); fidx->ei_block = border; ext4_idx_store_pblock(fidx, oldblock);
stable inclusion from linux-4.19.198 commit 5485fe228f9771190de24f8936c43879fa52fc25
--------------------------------
commit 8f6840c4fd1e7bd715e403074fb161c1a04cda73 upstream.
After commit c89128a00838 ("ext4: handle errors on ext4_commit_super"), 'ret' may be set to 0 before calling ext4_fill_flex_info(), if ext4_fill_flex_info() fails ext4_mount() doesn't return error code, it makes 'root' is null which causes crash in legacy_get_tree().
Fixes: c89128a00838 ("ext4: handle errors on ext4_commit_super") Reported-by: Hulk Robot hulkci@huawei.com Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Yang Yingliang yangyingliang@huawei.com Link: https://lore.kernel.org/r/20210510111051.55650-1-yangyingliang@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index f8e419ba9f0de..d793e597c0623 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4713,6 +4713,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) ext4_msg(sb, KERN_ERR, "unable to initialize " "flex_bg meta info!"); + ret = -ENOMEM; goto failed_mount6; }
From: Zhang Yi yi.zhang@huawei.com
stable inclusion from linux-4.19.198 commit 2338dc5d32636d26d6a99e022ca0cb1dc6e37741
--------------------------------
commit 4fb7c70a889ead2e91e184895ac6e5354b759135 upstream.
The cache_cnt parameter of tracepoint ext4_es_shrink_exit means the remaining cache count after shrink, but now it is the cache count before shrink, fix it by read sbi->s_extent_cache_cnt again.
Fixes: 1ab6c4997e04 ("fs: convert fs shrinkers to new scan/count API") Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20210522103045.690103-3-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/extents_status.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c index c4e6fb15101b4..ad3675cb640b0 100644 --- a/fs/ext4/extents_status.c +++ b/fs/ext4/extents_status.c @@ -1085,6 +1085,7 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
nr_shrunk = __es_shrink(sbi, nr_to_scan, NULL);
+ ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt); trace_ext4_es_shrink_scan_exit(sbi->s_sb, nr_shrunk, ret); return nr_shrunk; }
From: Zhang Yi yi.zhang@huawei.com
stable inclusion from linux-4.19.198 commit cc0458bcd213f544edefe8b38d6ef23eedd568ee
--------------------------------
commit e5e7010e5444d923e4091cafff61d05f2d19cada upstream.
After converting fs shrinkers to new scan/count API, we are no longer pass zero nr_to_scan parameter to detect the number of objects to free, just remove this check.
Fixes: 1ab6c4997e04 ("fs: convert fs shrinkers to new scan/count API") Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20210522103045.690103-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/extents_status.c | 3 --- 1 file changed, 3 deletions(-)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c index ad3675cb640b0..027c3e1b9f61a 100644 --- a/fs/ext4/extents_status.c +++ b/fs/ext4/extents_status.c @@ -1080,9 +1080,6 @@ static unsigned long ext4_es_scan(struct shrinker *shrink, ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt); trace_ext4_es_shrink_scan_enter(sbi->s_sb, nr_to_scan, ret);
- if (!nr_to_scan) - return ret; - nr_shrunk = __es_shrink(sbi, nr_to_scan, NULL);
ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
From: Pan Dong pandong.peter@bytedance.com
stable inclusion from linux-4.19.198 commit fdb1e064f2cc38f99e4ccf6890f03b1ae3c52729
--------------------------------
commit c89849cc0259f3d33624cc3bd127685c3c0fa25d upstream.
The avefreec should be average free clusters instead of average free blocks, otherwize Orlov's allocator will not work properly when bigalloc enabled.
Cc: stable@kernel.org Signed-off-by: Pan Dong pandong.peter@bytedance.com Link: https://lore.kernel.org/r/20210525073656.31594-1-pandong.peter@bytedance.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/ialloc.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index ffa1020548e52..835109f054936 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -396,7 +396,7 @@ static void get_orlov_stats(struct super_block *sb, ext4_group_t g, * * We always try to spread first-level directories. * - * If there are blockgroups with both free inodes and free blocks counts + * If there are blockgroups with both free inodes and free clusters counts * not worse than average we return one with smallest directory count. * Otherwise we simply return a random group. * @@ -405,7 +405,7 @@ static void get_orlov_stats(struct super_block *sb, ext4_group_t g, * It's OK to put directory into a group unless * it has too many directories already (max_dirs) or * it has too few free inodes left (min_inodes) or - * it has too few free blocks left (min_blocks) or + * it has too few free clusters left (min_clusters) or * Parent's group is preferred, if it doesn't satisfy these * conditions we search cyclically through the rest. If none * of the groups look good we just look for a group with more @@ -421,7 +421,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent, ext4_group_t real_ngroups = ext4_get_groups_count(sb); int inodes_per_group = EXT4_INODES_PER_GROUP(sb); unsigned int freei, avefreei, grp_free; - ext4_fsblk_t freeb, avefreec; + ext4_fsblk_t freec, avefreec; unsigned int ndirs; int max_dirs, min_inodes; ext4_grpblk_t min_clusters; @@ -440,9 +440,8 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter); avefreei = freei / ngroups; - freeb = EXT4_C2B(sbi, - percpu_counter_read_positive(&sbi->s_freeclusters_counter)); - avefreec = freeb; + freec = percpu_counter_read_positive(&sbi->s_freeclusters_counter); + avefreec = freec; do_div(avefreec, ngroups); ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);
From: Stephen Brennan stephen.s.brennan@oracle.com
stable inclusion from linux-4.19.198 commit 80f566c013ba357bc6e14fc2c55d2793e494d07c
--------------------------------
commit cd84bbbac12a173a381a64c6ec8b76a5277b87b5 upstream.
Commit 5d1b1b3f492f ("ext4: fix BUG when calling ext4_error with locked block group") introduces ext4_grp_locked_error to handle unlocking a group in error cases. Otherwise, there is a possibility of a sleep while atomic. However, since 43c73221b3b1 ("ext4: replace BUG_ON with WARN_ON in mb_find_extent()"), mb_find_extent() has contained a ext4_error() call while a group spinlock is held. Replace this with ext4_grp_locked_error.
Fixes: 43c73221b3b1 ("ext4: replace BUG_ON with WARN_ON in mb_find_extent()") Cc: stable@vger.kernel.org # 4.14+ Signed-off-by: Stephen Brennan stephen.s.brennan@oracle.com Reviewed-by: Lukas Czerner lczerner@redhat.com Reviewed-by: Junxiao Bi junxiao.bi@oracle.com Link: https://lore.kernel.org/r/20210623232114.34457-1-stephen.s.brennan@oracle.co... Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/mballoc.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 4a4641ac35841..69ed137d07a97 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1542,10 +1542,11 @@ static int mb_find_extent(struct ext4_buddy *e4b, int block, if (ex->fe_start + ex->fe_len > EXT4_CLUSTERS_PER_GROUP(e4b->bd_sb)) { /* Should never happen! (but apparently sometimes does?!?) */ WARN_ON(1); - ext4_error(e4b->bd_sb, "corruption or bug in mb_find_extent " - "block=%d, order=%d needed=%d ex=%u/%d/%d@%u", - block, order, needed, ex->fe_group, ex->fe_start, - ex->fe_len, ex->fe_logical); + ext4_grp_locked_error(e4b->bd_sb, e4b->bd_group, 0, 0, + "corruption or bug in mb_find_extent " + "block=%d, order=%d needed=%d ex=%u/%d/%d@%u", + block, order, needed, ex->fe_group, ex->fe_start, + ex->fe_len, ex->fe_logical); ex->fe_len = 0; ex->fe_start = 0; ex->fe_group = 0;
From: Yun Zhou yun.zhou@windriver.com
stable inclusion from linux-4.19.198 commit c2e99a8d37b90b35d7d904c86fd125375ebec2b4
--------------------------------
commit 6a2cbc58d6c9d90cd74288cc497c2b45815bc064 upstream.
Since the raw memory 'data' does not go forward, it will dump repeated data if the data length is more than 8. If we want to dump longer data blocks, we need to repeatedly call macro SEQ_PUT_HEX_FIELD. I think it is a bit redundant, and multiple function calls also affect the performance.
Link: https://lore.kernel.org/lkml/20210625122453.5e2fe304@oasis.local.home/ Link: https://lkml.kernel.org/r/20210626032156.47889-2-yun.zhou@windriver.com
Cc: stable@vger.kernel.org Fixes: 6d2289f3faa7 ("tracing: Make trace_seq_putmem_hex() more robust") Signed-off-by: Yun Zhou yun.zhou@windriver.com Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- lib/seq_buf.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/lib/seq_buf.c b/lib/seq_buf.c index bd807f545a9d7..b15dbb6f061a5 100644 --- a/lib/seq_buf.c +++ b/lib/seq_buf.c @@ -242,12 +242,14 @@ int seq_buf_putmem_hex(struct seq_buf *s, const void *mem, break;
/* j increments twice per loop */ - len -= j / 2; hex[j++] = ' ';
seq_buf_putmem(s, hex, j); if (seq_buf_has_overflowed(s)) return -1; + + len -= start_len; + data += start_len; } return 0; }
From: Roberto Sassu roberto.sassu@huawei.com
stable inclusion from linux-4.19.198 commit 2c6b7016754559b1ead8d61e8ac67728d960fbbd
--------------------------------
commit 9eea2904292c2d8fa98df141d3bf7c41ec9dc1b5 upstream.
evm_inode_init_security() requires an HMAC key to calculate the HMAC on initial xattrs provided by LSMs. However, it checks generically whether a key has been loaded, including also public keys, which is not correct as public keys are not suitable to calculate the HMAC.
Originally, support for signature verification was introduced to verify a possibly immutable initial ram disk, when no new files are created, and to switch to HMAC for the root filesystem. By that time, an HMAC key should have been loaded and usable to calculate HMACs for new files.
More recently support for requiring an HMAC key was removed from the kernel, so that signature verification can be used alone. Since this is a legitimate use case, evm_inode_init_security() should not return an error when no HMAC key has been loaded.
This patch fixes this problem by replacing the evm_key_loaded() check with a check of the EVM_INIT_HMAC flag in evm_initialized.
Fixes: 26ddabfe96b ("evm: enable EVM when X509 certificate is loaded") Signed-off-by: Roberto Sassu roberto.sassu@huawei.com Cc: stable@vger.kernel.org # 4.5.x Signed-off-by: Mimi Zohar zohar@linux.ibm.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- security/integrity/evm/evm_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/security/integrity/evm/evm_main.c b/security/integrity/evm/evm_main.c index 9f279a1f6eb48..c0b08ac32dcf4 100644 --- a/security/integrity/evm/evm_main.c +++ b/security/integrity/evm/evm_main.c @@ -521,7 +521,7 @@ void evm_inode_post_setattr(struct dentry *dentry, int ia_valid) }
/* - * evm_inode_init_security - initializes security.evm + * evm_inode_init_security - initializes security.evm HMAC value */ int evm_inode_init_security(struct inode *inode, const struct xattr *lsm_xattr,
From: Roberto Sassu roberto.sassu@huawei.com
stable inclusion from linux-4.19.198 commit 77c94b2a1deea733e2796a2da7b637c1afd0cdb8
--------------------------------
commit 9acc89d31f0c94c8e573ed61f3e4340bbd526d0c upstream.
EVM_ALLOW_METADATA_WRITES is an EVM initialization flag that can be set to temporarily disable metadata verification until all xattrs/attrs necessary to verify an EVM portable signature are copied to the file. This flag is cleared when EVM is initialized with an HMAC key, to avoid that the HMAC is calculated on unverified xattrs/attrs.
Currently EVM unnecessarily denies setting this flag if EVM is initialized with a public key, which is not a concern as it cannot be used to trust xattrs/attrs updates. This patch removes this limitation.
Fixes: ae1ba1676b88e ("EVM: Allow userland to permit modification of EVM-protected metadata") Signed-off-by: Roberto Sassu roberto.sassu@huawei.com Cc: stable@vger.kernel.org # 4.16.x Signed-off-by: Mimi Zohar zohar@linux.ibm.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Conflicts: security/integrity/evm/evm_secfs.c [yyl: keep same as mainline] Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/ABI/testing/evm | 26 ++++++++++++++++++++++++-- security/integrity/evm/evm_secfs.c | 8 ++++---- 2 files changed, 28 insertions(+), 6 deletions(-)
diff --git a/Documentation/ABI/testing/evm b/Documentation/ABI/testing/evm index 201d10319fa18..1df1177df68ad 100644 --- a/Documentation/ABI/testing/evm +++ b/Documentation/ABI/testing/evm @@ -42,8 +42,30 @@ Description: modification of EVM-protected metadata and disable all further modification of policy
- Note that once a key has been loaded, it will no longer be - possible to enable metadata modification. + Echoing a value is additive, the new value is added to the + existing initialization flags. + + For example, after:: + + echo 2 ><securityfs>/evm + + another echo can be performed:: + + echo 1 ><securityfs>/evm + + and the resulting value will be 3. + + Note that once an HMAC key has been loaded, it will no longer + be possible to enable metadata modification. Signaling that an + HMAC key has been loaded will clear the corresponding flag. + For example, if the current value is 6 (2 and 4 set):: + + echo 1 ><securityfs>/evm + + will set the new value to 3 (4 cleared). + + Loading an HMAC key is the only way to disable metadata + modification.
Until key loading has been signaled EVM can not create or validate the 'security.evm' xattr, but returns diff --git a/security/integrity/evm/evm_secfs.c b/security/integrity/evm/evm_secfs.c index 77de71b7794c9..cabf2dd5d186b 100644 --- a/security/integrity/evm/evm_secfs.c +++ b/security/integrity/evm/evm_secfs.c @@ -85,12 +85,12 @@ static ssize_t evm_write_key(struct file *file, const char __user *buf, if (!i || (i & ~EVM_INIT_MASK) != 0) return -EINVAL;
- /* Don't allow a request to freshly enable metadata writes if - * keys are loaded. + /* + * Don't allow a request to enable metadata writes if + * an HMAC key is loaded. */ if ((i & EVM_ALLOW_METADATA_WRITES) && - ((evm_initialized & EVM_KEY_MASK) != 0) && - !(evm_initialized & EVM_ALLOW_METADATA_WRITES)) + (evm_initialized & EVM_INIT_HMAC) != 0) return -EPERM;
if (i & EVM_INIT_HMAC) {
From: Miklos Szeredi mszeredi@redhat.com
stable inclusion from linux-4.19.198 commit 1c37784a00edfb88be67b24f7e776cda9eb803fd
--------------------------------
commit 80ef08670d4c28a06a3de954bd350368780bcfef upstream.
A request could end up on the fpq->io list after fuse_abort_conn() has reset fpq->connected and aborted requests on that list:
Thread-1 Thread-2
======== ======== ->fuse_simple_request() ->shutdown ->__fuse_request_send() ->queue_request() ->fuse_abort_conn() ->fuse_dev_do_read() ->acquire(fpq->lock) ->wait_for(fpq->lock) ->set err to all req's in fpq->io ->release(fpq->lock) ->acquire(fpq->lock) ->add req to fpq->io
After the userspace copy is done the request will be ended, but req->out.h.error will remain uninitialized. Also the copy might block despite being already aborted.
Fix both issues by not allowing the request to be queued on the fpq->io list after fuse_abort_conn() has processed this list.
Reported-by: Pradeep P V K pragalla@codeaurora.org Fixes: fd22d62ed0c3 ("fuse: no fc->lock for iqueue parts") Cc: stable@vger.kernel.org # v4.2 Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/fuse/dev.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 1ff5a6b21db0b..498a4fab40801 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -1310,6 +1310,15 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file, goto restart; } spin_lock(&fpq->lock); + /* + * Must not put request on fpq->io queue after having been shut down by + * fuse_abort_conn() + */ + if (!fpq->connected) { + req->out.h.error = err = -ECONNABORTED; + goto out_end; + + } list_add(&req->list, &fpq->io); spin_unlock(&fpq->lock); cs->req = req;
From: Mario Limonciello mario.limonciello@amd.com
stable inclusion from linux-4.19.198 commit 33f09d202e851ded5d1d9db1b059388da33f2361
--------------------------------
[ Upstream commit 65ea8f2c6e230bdf71fed0137cf9e9d1b307db32 ]
Generally, the C-state latency is provided by the _CST method or FADT, but some OEM platforms using AMD Picasso, Renoir, Van Gogh, and Cezanne set the C2 latency greater than C3's which causes the C2 state to be skipped.
That will block the core entering PC6, which prevents S0ix working properly on Linux systems.
In other operating systems, the latency values are not validated and this does not cause problems by skipping states.
To avoid this issue on Linux, detect when latencies are not an arithmetic progression and sort them.
Link: https://gitlab.freedesktop.org/agd5f/linux/-/commit/026d186e4592c1ee9c1cb442... Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1230#note_712174 Suggested-by: Prike Liang Prike.Liang@amd.com Suggested-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Mario Limonciello mario.limonciello@amd.com [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/processor_idle.c | 40 +++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+)
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index 6336f956a144f..5c46435f81b5f 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -29,6 +29,7 @@ #include <linux/acpi.h> #include <linux/dmi.h> #include <linux/sched.h> /* need_resched() */ +#include <linux/sort.h> #include <linux/tick.h> #include <linux/cpuidle.h> #include <linux/cpu.h> @@ -546,10 +547,37 @@ static void acpi_processor_power_verify_c3(struct acpi_processor *pr, return; }
+static int acpi_cst_latency_cmp(const void *a, const void *b) +{ + const struct acpi_processor_cx *x = a, *y = b; + + if (!(x->valid && y->valid)) + return 0; + if (x->latency > y->latency) + return 1; + if (x->latency < y->latency) + return -1; + return 0; +} +static void acpi_cst_latency_swap(void *a, void *b, int n) +{ + struct acpi_processor_cx *x = a, *y = b; + u32 tmp; + + if (!(x->valid && y->valid)) + return; + tmp = x->latency; + x->latency = y->latency; + y->latency = tmp; +} + static int acpi_processor_power_verify(struct acpi_processor *pr) { unsigned int i; unsigned int working = 0; + unsigned int last_latency = 0; + unsigned int last_type = 0; + bool buggy_latency = false;
pr->power.timer_broadcast_on_state = INT_MAX;
@@ -573,12 +601,24 @@ static int acpi_processor_power_verify(struct acpi_processor *pr) } if (!cx->valid) continue; + if (cx->type >= last_type && cx->latency < last_latency) + buggy_latency = true; + last_latency = cx->latency; + last_type = cx->type;
lapic_timer_check_state(i, pr, cx); tsc_check_state(cx->type); working++; }
+ if (buggy_latency) { + pr_notice("FW issue: working around C-state latencies out of order\n"); + sort(&pr->power.states[1], max_cstate, + sizeof(struct acpi_processor_cx), + acpi_cst_latency_cmp, + acpi_cst_latency_swap); + } + lapic_timer_propagate_broadcast(pr);
return (working);
From: Richard Fitzgerald rf@opensource.cirrus.com
stable inclusion from linux-4.19.198 commit 8278bf4874aac63527e32f010d411e05a66d1953
--------------------------------
[ Upstream commit 900fdc4573766dd43b847b4f54bd4a1ee2bc7360 ]
The existing code attempted to handle numbers by doing a strto[u]l(), ignoring the field width, and then repeatedly dividing to extract the field out of the full converted value. If the string contains a run of valid digits longer than will fit in a long or long long, this would overflow and no amount of dividing can recover the correct value.
This patch fixes vsscanf() to obey number field widths when parsing the number.
A new _parse_integer_limit() is added that takes a limit for the number of characters to parse. The number field conversion in vsscanf is changed to use this new function.
If a number starts with a radix prefix, the field width must be long enough for at last one digit after the prefix. If not, it will be handled like this:
sscanf("0x4", "%1i", &i): i=0, scanning continues with the 'x' sscanf("0x4", "%2i", &i): i=0, scanning continues with the '4'
This is consistent with the observed behaviour of userland sscanf.
Note that this patch does NOT fix the problem of a single field value overflowing the target type. So for example:
sscanf("123456789abcdef", "%x", &i);
Will not produce the correct result because the value obviously overflows INT_MAX. But sscanf will report a successful conversion.
Note that where a very large number is used to mean "unlimited", the value INT_MAX is used for consistency with the behaviour of vsnprintf().
Signed-off-by: Richard Fitzgerald rf@opensource.cirrus.com Reviewed-by: Petr Mladek pmladek@suse.com Signed-off-by: Petr Mladek pmladek@suse.com Link: https://lore.kernel.org/r/20210514161206.30821-2-rf@opensource.cirrus.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- lib/kstrtox.c | 13 ++++++-- lib/kstrtox.h | 2 ++ lib/vsprintf.c | 82 +++++++++++++++++++++++++++++--------------------- 3 files changed, 60 insertions(+), 37 deletions(-)
diff --git a/lib/kstrtox.c b/lib/kstrtox.c index 661a1e807bd1a..1a02b87b19c7e 100644 --- a/lib/kstrtox.c +++ b/lib/kstrtox.c @@ -39,20 +39,22 @@ const char *_parse_integer_fixup_radix(const char *s, unsigned int *base)
/* * Convert non-negative integer string representation in explicitly given radix - * to an integer. + * to an integer. A maximum of max_chars characters will be converted. + * * Return number of characters consumed maybe or-ed with overflow bit. * If overflow occurs, result integer (incorrect) is still returned. * * Don't you dare use this function. */ -unsigned int _parse_integer(const char *s, unsigned int base, unsigned long long *p) +unsigned int _parse_integer_limit(const char *s, unsigned int base, unsigned long long *p, + size_t max_chars) { unsigned long long res; unsigned int rv;
res = 0; rv = 0; - while (1) { + while (max_chars--) { unsigned int c = *s; unsigned int lc = c | 0x20; /* don't tolower() this line */ unsigned int val; @@ -82,6 +84,11 @@ unsigned int _parse_integer(const char *s, unsigned int base, unsigned long long return rv; }
+unsigned int _parse_integer(const char *s, unsigned int base, unsigned long long *p) +{ + return _parse_integer_limit(s, base, p, INT_MAX); +} + static int _kstrtoull(const char *s, unsigned int base, unsigned long long *res) { unsigned long long _res; diff --git a/lib/kstrtox.h b/lib/kstrtox.h index 3b4637bcd2540..158c400ca8658 100644 --- a/lib/kstrtox.h +++ b/lib/kstrtox.h @@ -4,6 +4,8 @@
#define KSTRTOX_OVERFLOW (1U << 31) const char *_parse_integer_fixup_radix(const char *s, unsigned int *base); +unsigned int _parse_integer_limit(const char *s, unsigned int base, unsigned long long *res, + size_t max_chars); unsigned int _parse_integer(const char *s, unsigned int base, unsigned long long *res);
#endif diff --git a/lib/vsprintf.c b/lib/vsprintf.c index 12a5441d221db..b3b44c2c8a925 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -48,6 +48,31 @@ #include <linux/string_helpers.h> #include "kstrtox.h"
+static unsigned long long simple_strntoull(const char *startp, size_t max_chars, + char **endp, unsigned int base) +{ + const char *cp; + unsigned long long result = 0ULL; + size_t prefix_chars; + unsigned int rv; + + cp = _parse_integer_fixup_radix(startp, &base); + prefix_chars = cp - startp; + if (prefix_chars < max_chars) { + rv = _parse_integer_limit(cp, base, &result, max_chars - prefix_chars); + /* FIXME */ + cp += (rv & ~KSTRTOX_OVERFLOW); + } else { + /* Field too short for prefix + digit, skip over without converting */ + cp = startp + max_chars; + } + + if (endp) + *endp = (char *)cp; + + return result; +} + /** * simple_strtoull - convert a string to an unsigned long long * @cp: The start of the string @@ -58,18 +83,7 @@ */ unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int base) { - unsigned long long result; - unsigned int rv; - - cp = _parse_integer_fixup_radix(cp, &base); - rv = _parse_integer(cp, base, &result); - /* FIXME */ - cp += (rv & ~KSTRTOX_OVERFLOW); - - if (endp) - *endp = (char *)cp; - - return result; + return simple_strntoull(cp, INT_MAX, endp, base); } EXPORT_SYMBOL(simple_strtoull);
@@ -104,6 +118,21 @@ long simple_strtol(const char *cp, char **endp, unsigned int base) } EXPORT_SYMBOL(simple_strtol);
+static long long simple_strntoll(const char *cp, size_t max_chars, char **endp, + unsigned int base) +{ + /* + * simple_strntoull() safely handles receiving max_chars==0 in the + * case cp[0] == '-' && max_chars == 1. + * If max_chars == 0 we can drop through and pass it to simple_strntoull() + * and the content of *cp is irrelevant. + */ + if (*cp == '-' && max_chars > 0) + return -simple_strntoull(cp + 1, max_chars - 1, endp, base); + + return simple_strntoull(cp, max_chars, endp, base); +} + /** * simple_strtoll - convert a string to a signed long long * @cp: The start of the string @@ -114,10 +143,7 @@ EXPORT_SYMBOL(simple_strtol); */ long long simple_strtoll(const char *cp, char **endp, unsigned int base) { - if (*cp == '-') - return -simple_strtoull(cp + 1, endp, base); - - return simple_strtoull(cp, endp, base); + return simple_strntoll(cp, INT_MAX, endp, base); } EXPORT_SYMBOL(simple_strtoll);
@@ -3217,25 +3243,13 @@ int vsscanf(const char *buf, const char *fmt, va_list args) break;
if (is_sign) - val.s = qualifier != 'L' ? - simple_strtol(str, &next, base) : - simple_strtoll(str, &next, base); + val.s = simple_strntoll(str, + field_width >= 0 ? field_width : INT_MAX, + &next, base); else - val.u = qualifier != 'L' ? - simple_strtoul(str, &next, base) : - simple_strtoull(str, &next, base); - - if (field_width > 0 && next - str > field_width) { - if (base == 0) - _parse_integer_fixup_radix(str, &base); - while (next - str > field_width) { - if (is_sign) - val.s = div_s64(val.s, base); - else - val.u = div_u64(val.u, base); - --next; - } - } + val.u = simple_strntoull(str, + field_width >= 0 ? field_width : INT_MAX, + &next, base);
switch (qualifier) { case 'H': /* that's 'hh' in format */
From: Mimi Zohar zohar@linux.ibm.com
stable inclusion from linux-4.19.198 commit 8657d0d57d6df9d6d04277c18889da2804164ca3
--------------------------------
[ Upstream commit 49219d9b8785ba712575c40e48ce0f7461254626 ]
EVM_SETUP_COMPLETE is defined as 0x80000000, which is larger than INT_MAX. The "-fno-strict-overflow" compiler option properly prevents signaling EVM that the EVM policy setup is complete. Define and read an unsigned int.
Fixes: f00d79750712 ("EVM: Allow userspace to signal an RSA key has been loaded") Signed-off-by: Mimi Zohar zohar@linux.ibm.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- security/integrity/evm/evm_secfs.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/security/integrity/evm/evm_secfs.c b/security/integrity/evm/evm_secfs.c index cabf2dd5d186b..8a8bcf64d8d61 100644 --- a/security/integrity/evm/evm_secfs.c +++ b/security/integrity/evm/evm_secfs.c @@ -71,12 +71,13 @@ static ssize_t evm_read_key(struct file *filp, char __user *buf, static ssize_t evm_write_key(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { - int i, ret; + unsigned int i; + int ret;
if (!capable(CAP_SYS_ADMIN) || (evm_initialized & EVM_SETUP_COMPLETE)) return -EPERM;
- ret = kstrtoint_from_user(buf, count, 0, &i); + ret = kstrtouint_from_user(buf, count, 0, &i);
if (ret) return ret;
From: Krzysztof WilczyĆski kw@linux.com
stable inclusion from linux-4.19.198 commit 32afb981e38c0f68185eac93a673d9d9a7e260ca
--------------------------------
[ Upstream commit 888be6067b97132c3992866bbcf647572253ab3f ]
Currently, a device description can be obtained using ACPI, if the _STR method exists for a particular device, and then exposed to the userspace via a sysfs object as a string value.
If the _STR method is available for a given device then the data (usually a Unicode string) is read and stored in a buffer (of the ACPI_TYPE_BUFFER type) with a pointer to said buffer cached in the struct acpi_device_pnp for later access.
The description_show() function is responsible for exposing the device description to the userspace via a corresponding sysfs object and internally calls the utf16s_to_utf8s() function with a pointer to the buffer that contains the Unicode string so that it can be converted from UTF16 encoding to UTF8 and thus allowing for the value to be safely stored and later displayed.
When invoking the utf16s_to_utf8s() function, the description_show() function also sets a limit of the data that can be saved into a provided buffer as a result of the character conversion to be a total of PAGE_SIZE, and upon completion, the utf16s_to_utf8s() function returns an integer value denoting the number of bytes that have been written into the provided buffer.
Following the execution of the utf16s_to_utf8s() a newline character will be added at the end of the resulting buffer so that when the value is read in the userspace through the sysfs object then it would include newline making it more accessible when working with the sysfs file system in the shell, etc. Normally, this wouldn't be a problem, but if the function utf16s_to_utf8s() happens to return the number of bytes written to be precisely PAGE_SIZE, then we would overrun the buffer and write the newline character outside the allotted space which can have undefined consequences or result in a failure.
To fix this buffer overrun, ensure that there always is enough space left for the newline character to be safely appended.
Fixes: d1efe3c324ea ("ACPI: Add new sysfs interface to export device description") Signed-off-by: Krzysztof WilczyĆski kw@linux.com Reviewed-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/acpi/device_sysfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/acpi/device_sysfs.c b/drivers/acpi/device_sysfs.c index 8940054d6250f..6cc71a2e55087 100644 --- a/drivers/acpi/device_sysfs.c +++ b/drivers/acpi/device_sysfs.c @@ -460,7 +460,7 @@ static ssize_t description_show(struct device *dev, (wchar_t *)acpi_dev->pnp.str_obj->buffer.pointer, acpi_dev->pnp.str_obj->buffer.length, UTF16_LITTLE_ENDIAN, buf, - PAGE_SIZE); + PAGE_SIZE - 1);
buf[result++] = '\n';
From: Liu Shixin liushixin2@huawei.com
stable inclusion from linux-4.19.198 commit 0b6201184892db5c055ae6a02069460b61c3b62a
--------------------------------
[ Upstream commit b8f6b0522c298ae9267bd6584e19b942a0636910 ]
Hulk Robot reported memory leak in netlbl_mgmt_add_common. The problem is non-freed map in case of netlbl_domhsh_add() failed.
BUG: memory leak unreferenced object 0xffff888100ab7080 (size 96): comm "syz-executor537", pid 360, jiffies 4294862456 (age 22.678s) hex dump (first 32 bytes): 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ fe 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 ................ backtrace: [<0000000008b40026>] netlbl_mgmt_add_common.isra.0+0xb2a/0x1b40 [<000000003be10950>] netlbl_mgmt_add+0x271/0x3c0 [<00000000c70487ed>] genl_family_rcv_msg_doit.isra.0+0x20e/0x320 [<000000001f2ff614>] genl_rcv_msg+0x2bf/0x4f0 [<0000000089045792>] netlink_rcv_skb+0x134/0x3d0 [<0000000020e96fdd>] genl_rcv+0x24/0x40 [<0000000042810c66>] netlink_unicast+0x4a0/0x6a0 [<000000002e1659f0>] netlink_sendmsg+0x789/0xc70 [<000000006e43415f>] sock_sendmsg+0x139/0x170 [<00000000680a73d7>] ____sys_sendmsg+0x658/0x7d0 [<0000000065cbb8af>] ___sys_sendmsg+0xf8/0x170 [<0000000019932b6c>] __sys_sendmsg+0xd3/0x190 [<00000000643ac172>] do_syscall_64+0x37/0x90 [<000000009b79d6dc>] entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: 63c416887437 ("netlabel: Add network address selectors to the NetLabel/LSM domain mapping") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/netlabel/netlabel_mgmt.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/net/netlabel/netlabel_mgmt.c b/net/netlabel/netlabel_mgmt.c index 21e0095b1d142..71ba69cb50c9b 100644 --- a/net/netlabel/netlabel_mgmt.c +++ b/net/netlabel/netlabel_mgmt.c @@ -90,6 +90,7 @@ static const struct nla_policy netlbl_mgmt_genl_policy[NLBL_MGMT_A_MAX + 1] = { static int netlbl_mgmt_add_common(struct genl_info *info, struct netlbl_audit *audit_info) { + void *pmap = NULL; int ret_val = -EINVAL; struct netlbl_domaddr_map *addrmap = NULL; struct cipso_v4_doi *cipsov4 = NULL; @@ -189,6 +190,7 @@ static int netlbl_mgmt_add_common(struct genl_info *info, ret_val = -ENOMEM; goto add_free_addrmap; } + pmap = map; map->list.addr = addr->s_addr & mask->s_addr; map->list.mask = mask->s_addr; map->list.valid = 1; @@ -197,10 +199,8 @@ static int netlbl_mgmt_add_common(struct genl_info *info, map->def.cipso = cipsov4;
ret_val = netlbl_af4list_add(&map->list, &addrmap->list4); - if (ret_val != 0) { - kfree(map); - goto add_free_addrmap; - } + if (ret_val != 0) + goto add_free_map;
entry->family = AF_INET; entry->def.type = NETLBL_NLTYPE_ADDRSELECT; @@ -237,6 +237,7 @@ static int netlbl_mgmt_add_common(struct genl_info *info, ret_val = -ENOMEM; goto add_free_addrmap; } + pmap = map; map->list.addr = *addr; map->list.addr.s6_addr32[0] &= mask->s6_addr32[0]; map->list.addr.s6_addr32[1] &= mask->s6_addr32[1]; @@ -249,10 +250,8 @@ static int netlbl_mgmt_add_common(struct genl_info *info, map->def.calipso = calipso;
ret_val = netlbl_af6list_add(&map->list, &addrmap->list6); - if (ret_val != 0) { - kfree(map); - goto add_free_addrmap; - } + if (ret_val != 0) + goto add_free_map;
entry->family = AF_INET6; entry->def.type = NETLBL_NLTYPE_ADDRSELECT; @@ -262,10 +261,12 @@ static int netlbl_mgmt_add_common(struct genl_info *info,
ret_val = netlbl_domhsh_add(entry, audit_info); if (ret_val != 0) - goto add_free_addrmap; + goto add_free_map;
return 0;
+add_free_map: + kfree(pmap); add_free_addrmap: kfree(addrmap); add_doi_put_def:
From: Pablo Neira Ayuso pablo@netfilter.org
stable inclusion from linux-4.19.198 commit 0cb785dd9e42be71ec92fee8ef984c0a1e64dab3
--------------------------------
[ Upstream commit cdd73cc545c0fb9b1a1f7b209f4f536e7990cff4 ]
ipv6_find_hdr() does not validate that this is an IPv6 packet. Add a sanity check for calling ipv6_find_hdr() to make sure an IPv6 packet is passed for parsing.
Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/netfilter/nft_exthdr.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/net/netfilter/nft_exthdr.c b/net/netfilter/nft_exthdr.c index a940c9fd9045e..64e69d6683cab 100644 --- a/net/netfilter/nft_exthdr.c +++ b/net/netfilter/nft_exthdr.c @@ -45,6 +45,9 @@ static void nft_exthdr_ipv6_eval(const struct nft_expr *expr, unsigned int offset = 0; int err;
+ if (pkt->skb->protocol != htons(ETH_P_IPV6)) + goto err; + err = ipv6_find_hdr(pkt->skb, &offset, priv->type, NULL, NULL); if (priv->flags & NFT_EXTHDR_F_PRESENT) { *dest = (err >= 0);
From: Pablo Neira Ayuso pablo@netfilter.org
stable inclusion from linux-4.19.198 commit 3225cd78880a15928f8930e3de698a6edca63f42
--------------------------------
[ Upstream commit 8f518d43f89ae00b9cf5460e10b91694944ca1a8 ]
The osf expression only supports for TCP packets, add a upfront sanity check to skip packet parsing if this is not a TCP packet.
Fixes: b96af92d6eaf ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf") Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Reported-by: kernel test robot lkp@intel.com Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/netfilter/nft_osf.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/net/netfilter/nft_osf.c b/net/netfilter/nft_osf.c index a003533ff4d93..e259454b6a643 100644 --- a/net/netfilter/nft_osf.c +++ b/net/netfilter/nft_osf.c @@ -22,6 +22,11 @@ static void nft_osf_eval(const struct nft_expr *expr, struct nft_regs *regs, struct tcphdr _tcph; const char *os_name;
+ if (pkt->tprot != IPPROTO_TCP) { + regs->verdict.code = NFT_BREAK; + return; + } + tcp = skb_header_pointer(skb, ip_hdrlen(skb), sizeof(struct tcphdr), &_tcph); if (!tcp) {
From: Pablo Neira Ayuso pablo@netfilter.org
stable inclusion from linux-4.19.198 commit 6b97694d0a76c7b5f5e374dab2a3b1ef54899dc7
--------------------------------
[ Upstream commit 52f0f4e178c757b3d356087376aad8bd77271828 ]
Add unfront check for TCP and UDP packets before performing further processing.
Fixes: 4ed8eb6570a4 ("netfilter: nf_tables: Add native tproxy support") Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/netfilter/nft_tproxy.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/net/netfilter/nft_tproxy.c b/net/netfilter/nft_tproxy.c index 95980154ef02c..b97ab1198b03f 100644 --- a/net/netfilter/nft_tproxy.c +++ b/net/netfilter/nft_tproxy.c @@ -30,6 +30,12 @@ static void nft_tproxy_eval_v4(const struct nft_expr *expr, __be16 tport = 0; struct sock *sk;
+ if (pkt->tprot != IPPROTO_TCP && + pkt->tprot != IPPROTO_UDP) { + regs->verdict.code = NFT_BREAK; + return; + } + hp = skb_header_pointer(skb, ip_hdrlen(skb), sizeof(_hdr), &_hdr); if (!hp) { regs->verdict.code = NFT_BREAK; @@ -91,7 +97,8 @@ static void nft_tproxy_eval_v6(const struct nft_expr *expr,
memset(&taddr, 0, sizeof(taddr));
- if (!pkt->tprot_set) { + if (pkt->tprot != IPPROTO_TCP && + pkt->tprot != IPPROTO_UDP) { regs->verdict.code = NFT_BREAK; return; }
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.198 commit 98fd088c325429327fc5ddb0b12c6a203ebf7f30
--------------------------------
[ Upstream commit 0cd58e5c53babb9237b741dbef711f0a9eb6d3fd ]
If qfq_change_class() is unable to allocate memory for qfq_aggregate, it frees the class that has been inserted in the class hash table, but does not unhash it.
Defer the insertion after the problematic allocation.
BUG: KASAN: use-after-free in hlist_add_head include/linux/list.h:884 [inline] BUG: KASAN: use-after-free in qdisc_class_hash_insert+0x200/0x210 net/sched/sch_api.c:731 Write of size 8 at addr ffff88814a534f10 by task syz-executor.4/31478
CPU: 0 PID: 31478 Comm: syz-executor.4 Not tainted 5.13.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:233 __kasan_report mm/kasan/report.c:419 [inline] kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:436 hlist_add_head include/linux/list.h:884 [inline] qdisc_class_hash_insert+0x200/0x210 net/sched/sch_api.c:731 qfq_change_class+0x96c/0x1990 net/sched/sch_qfq.c:489 tc_ctl_tclass+0x514/0xe50 net/sched/sch_api.c:2113 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504 netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1929 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:674 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350 ___sys_sendmsg+0xf3/0x170 net/socket.c:2404 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433 do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x4665d9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fdc7b5f0188 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 000000000056bf80 RCX: 00000000004665d9 RDX: 0000000000000000 RSI: 00000000200001c0 RDI: 0000000000000003 RBP: 00007fdc7b5f01d0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002 R13: 00007ffcf7310b3f R14: 00007fdc7b5f0300 R15: 0000000000022000
Allocated by task 31445: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:428 [inline] ____kasan_kmalloc mm/kasan/common.c:507 [inline] ____kasan_kmalloc mm/kasan/common.c:466 [inline] __kasan_kmalloc+0x9b/0xd0 mm/kasan/common.c:516 kmalloc include/linux/slab.h:556 [inline] kzalloc include/linux/slab.h:686 [inline] qfq_change_class+0x705/0x1990 net/sched/sch_qfq.c:464 tc_ctl_tclass+0x514/0xe50 net/sched/sch_api.c:2113 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504 netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1929 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:674 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350 ___sys_sendmsg+0xf3/0x170 net/socket.c:2404 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433 do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae
Freed by task 31445: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38 kasan_set_track+0x1c/0x30 mm/kasan/common.c:46 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:357 ____kasan_slab_free mm/kasan/common.c:360 [inline] ____kasan_slab_free mm/kasan/common.c:325 [inline] __kasan_slab_free+0xfb/0x130 mm/kasan/common.c:368 kasan_slab_free include/linux/kasan.h:212 [inline] slab_free_hook mm/slub.c:1583 [inline] slab_free_freelist_hook+0xdf/0x240 mm/slub.c:1608 slab_free mm/slub.c:3168 [inline] kfree+0xe5/0x7f0 mm/slub.c:4212 qfq_change_class+0x10fb/0x1990 net/sched/sch_qfq.c:518 tc_ctl_tclass+0x514/0xe50 net/sched/sch_api.c:2113 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504 netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1929 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:674 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350 ___sys_sendmsg+0xf3/0x170 net/socket.c:2404 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433 do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88814a534f00 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 16 bytes inside of 128-byte region [ffff88814a534f00, ffff88814a534f80) The buggy address belongs to the page: page:ffffea0005294d00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x14a534 flags: 0x57ff00000000200(slab|node=1|zone=2|lastcpupid=0x7ff) raw: 057ff00000000200 ffffea00004fee00 0000000600000006 ffff8880110418c0 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 29797, ts 604817765317, free_ts 604810151744 prep_new_page mm/page_alloc.c:2358 [inline] get_page_from_freelist+0x1033/0x2b60 mm/page_alloc.c:3994 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5200 alloc_pages+0x18c/0x2a0 mm/mempolicy.c:2272 alloc_slab_page mm/slub.c:1646 [inline] allocate_slab+0x2c5/0x4c0 mm/slub.c:1786 new_slab mm/slub.c:1849 [inline] new_slab_objects mm/slub.c:2595 [inline] ___slab_alloc+0x4a1/0x810 mm/slub.c:2758 __slab_alloc.constprop.0+0xa7/0xf0 mm/slub.c:2798 slab_alloc_node mm/slub.c:2880 [inline] slab_alloc mm/slub.c:2922 [inline] __kmalloc+0x315/0x330 mm/slub.c:4050 kmalloc include/linux/slab.h:561 [inline] kzalloc include/linux/slab.h:686 [inline] __register_sysctl_table+0x112/0x1090 fs/proc/proc_sysctl.c:1318 mpls_dev_sysctl_register+0x1b7/0x2d0 net/mpls/af_mpls.c:1421 mpls_add_dev net/mpls/af_mpls.c:1472 [inline] mpls_dev_notify+0x214/0x8b0 net/mpls/af_mpls.c:1588 notifier_call_chain+0xb5/0x200 kernel/notifier.c:83 call_netdevice_notifiers_info+0xb5/0x130 net/core/dev.c:2121 call_netdevice_notifiers_extack net/core/dev.c:2133 [inline] call_netdevice_notifiers net/core/dev.c:2147 [inline] register_netdevice+0x106b/0x1500 net/core/dev.c:10312 veth_newlink+0x585/0xac0 drivers/net/veth.c:1547 __rtnl_newlink+0x1062/0x1710 net/core/rtnetlink.c:3452 rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3500 page last free stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1298 [inline] free_pcp_prepare+0x223/0x300 mm/page_alloc.c:1342 free_unref_page_prepare mm/page_alloc.c:3250 [inline] free_unref_page+0x12/0x1d0 mm/page_alloc.c:3298 __vunmap+0x783/0xb60 mm/vmalloc.c:2566 free_work+0x58/0x70 mm/vmalloc.c:80 process_one_work+0x98d/0x1600 kernel/workqueue.c:2276 worker_thread+0x64c/0x1120 kernel/workqueue.c:2422 kthread+0x3b1/0x4a0 kernel/kthread.c:313 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
Memory state around the buggy address: ffff88814a534e00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88814a534e80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88814a534f00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff88814a534f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88814a535000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fixes: 462dbc9101acd ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: syzbot syzkaller@googlegroups.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/sched/sch_qfq.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c index d61355cc86a6e..73ac3c65c747f 100644 --- a/net/sched/sch_qfq.c +++ b/net/sched/sch_qfq.c @@ -497,11 +497,6 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
if (cl->qdisc != &noop_qdisc) qdisc_hash_add(cl->qdisc, true); - sch_tree_lock(sch); - qdisc_class_hash_insert(&q->clhash, &cl->common); - sch_tree_unlock(sch); - - qdisc_class_hash_grow(sch, &q->clhash);
set_change_agg: sch_tree_lock(sch); @@ -519,8 +514,11 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid, } if (existing) qfq_deact_rm_from_agg(q, cl); + else + qdisc_class_hash_insert(&q->clhash, &cl->common); qfq_add_to_agg(q, new_agg, cl); sch_tree_unlock(sch); + qdisc_class_hash_grow(sch, &q->clhash);
*arg = (unsigned long)cl; return 0;
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.198 commit f80201ff7937fddb039716ba5948775b485d7646
--------------------------------
[ Upstream commit 85e8b032d6ebb0f698a34dd22c2f13443d905888 ]
syzbot complained in neigh_reduce(), because rcu_read_lock_bh() is treated differently than rcu_read_lock()
WARNING: suspicious RCU usage 5.13.0-rc6-syzkaller #0 Not tainted ----------------------------- include/net/addrconf.h:313 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1 3 locks held by kworker/0:0/5: #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline] #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: atomic64_set include/asm-generic/atomic-instrumented.h:856 [inline] #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: atomic_long_set include/asm-generic/atomic-long.h:41 [inline] #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:617 [inline] #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline] #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x871/0x1600 kernel/workqueue.c:2247 #1: ffffc90000ca7da8 ((work_completion)(&port->wq)){+.+.}-{0:0}, at: process_one_work+0x8a5/0x1600 kernel/workqueue.c:2251 #2: ffffffff8bf795c0 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x1da/0x3130 net/core/dev.c:4180
stack backtrace: CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.13.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events ipvlan_process_multicast Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 __in6_dev_get include/net/addrconf.h:313 [inline] __in6_dev_get include/net/addrconf.h:311 [inline] neigh_reduce drivers/net/vxlan.c:2167 [inline] vxlan_xmit+0x34d5/0x4c30 drivers/net/vxlan.c:2919 __netdev_start_xmit include/linux/netdevice.h:4944 [inline] netdev_start_xmit include/linux/netdevice.h:4958 [inline] xmit_one net/core/dev.c:3654 [inline] dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3670 __dev_queue_xmit+0x2133/0x3130 net/core/dev.c:4246 ipvlan_process_multicast+0xa99/0xd70 drivers/net/ipvlan/ipvlan_core.c:287 process_one_work+0x98d/0x1600 kernel/workqueue.c:2276 worker_thread+0x64c/0x1120 kernel/workqueue.c:2422 kthread+0x3b1/0x4a0 kernel/kthread.c:313 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
Fixes: f564f45c4518 ("vxlan: add ipv6 proxy support") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: syzbot syzkaller@googlegroups.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/net/vxlan.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 66fffbd64a33f..f119814c5e254 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -1682,6 +1682,7 @@ static int neigh_reduce(struct net_device *dev, struct sk_buff *skb, __be32 vni) struct neighbour *n; struct nd_msg *msg;
+ rcu_read_lock(); in6_dev = __in6_dev_get(dev); if (!in6_dev) goto out; @@ -1733,6 +1734,7 @@ static int neigh_reduce(struct net_device *dev, struct sk_buff *skb, __be32 vni) }
out: + rcu_read_unlock(); consume_skb(skb); return NETDEV_TX_OK; }
From: Miao Wang shankerwangmiao@gmail.com
stable inclusion from linux-4.19.198 commit 0cb6b100b869164dcd4b48e8c236e0a9f1c8c0ad
--------------------------------
[ Upstream commit c69f114d09891adfa3e301a35d9e872b8b7b5a50 ]
When doing source address validation, the flowi4 struct used for fib_lookup should be in the reverse direction to the given skb. fl4_dport and fl4_sport returned by fib4_rules_early_flow_dissect should thus be swapped.
Fixes: 5a847a6e1477 ("net/ipv4: Initialize proto and ports in flow struct") Signed-off-by: Miao Wang shankerwangmiao@gmail.com Reviewed-by: David Ahern dsahern@kernel.org Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv4/fib_frontend.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index b96aa88087be1..70e5e9e5d8351 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -353,6 +353,8 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, fl4.flowi4_proto = 0; fl4.fl4_sport = 0; fl4.fl4_dport = 0; + } else { + swap(fl4.fl4_sport, fl4.fl4_dport); }
if (fib_lookup(net, &fl4, &res, 0))
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.198 commit d317a20a84c434ae0a2011e551e6f400b995c32e
--------------------------------
[ Upstream commit bcc3f2a829b9edbe3da5fb117ee5a63686d31834 ]
I see no reason why max_dst_opts_cnt and max_hbh_opts_cnt are fetched from the initial net namespace.
The other sysctls (max_dst_opts_len & max_hbh_opts_len) are in fact already using the current ns.
Note: it is not clear why ipv6_destopt_rcv() use two ways to get to the netns :
1) dev_net(dst->dev) Originally used to increment IPSTATS_MIB_INHDRERRORS
2) dev_net(skb->dev) Tom used this variant in his patch.
Maybe this calls to use ipv6_skb_net() instead ?
Fixes: 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and Destination options") Signed-off-by: Eric Dumazet edumazet@google.com Cc: Tom Herbert tom@quantonium.net Cc: Coco Li lixiaoyan@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv6/exthdrs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c index 20291c2036fcd..68b8084da83a7 100644 --- a/net/ipv6/exthdrs.c +++ b/net/ipv6/exthdrs.c @@ -309,7 +309,7 @@ static int ipv6_destopt_rcv(struct sk_buff *skb) #endif
if (ip6_parse_tlv(tlvprocdestopt_lst, skb, - init_net.ipv6.sysctl.max_dst_opts_cnt)) { + net->ipv6.sysctl.max_dst_opts_cnt)) { skb->transport_header += extlen; opt = IP6CB(skb); #if IS_ENABLED(CONFIG_IPV6_MIP6) @@ -848,7 +848,7 @@ int ipv6_parse_hopopts(struct sk_buff *skb)
opt->flags |= IP6SKB_HOPBYHOP; if (ip6_parse_tlv(tlvprochopopt_lst, skb, - init_net.ipv6.sysctl.max_hbh_opts_cnt)) { + net->ipv6.sysctl.max_hbh_opts_cnt)) { skb->transport_header += extlen; opt = IP6CB(skb); opt->nhoff = sizeof(struct ipv6hdr);
From: Maciej ƻenczykowski maze@google.com
stable inclusion from linux-4.19.198 commit 7c557d06b420b08e380e2875f04aff8624fb8dcc
--------------------------------
[ Upstream commit 364745fbe981a4370f50274475da4675661104df ]
This is technically a backwards incompatible change in behaviour, but I'm going to argue that it is very unlikely to break things, and likely to fix *far* more then it breaks.
In no particular order, various reasons follow:
(a) I've long had a bug assigned to myself to debug a super rare kernel crash on Android Pixel phones which can (per stacktrace) be traced back to BPF clat IPv6 to IPv4 protocol conversion causing some sort of ugly failure much later on during transmit deep in the GSO engine, AFAICT precisely because of this change to gso_size, though I've never been able to manually reproduce it. I believe it may be related to the particular network offload support of attached USB ethernet dongle being used for tethering off of an IPv6-only cellular connection. The reason might be we end up with more segments than max permitted, or with a GSO packet with only one segment... (either way we break some assumption and hit a BUG_ON)
(b) There is no check that the gso_size is > 20 when reducing it by 20, so we might end up with a negative (or underflowing) gso_size or a gso_size of 0. This can't possibly be good. Indeed this is probably somehow exploitable (or at least can result in a kernel crash) by delivering crafted packets and perhaps triggering an infinite loop or a divide by zero... As a reminder: gso_size (MSS) is related to MTU, but not directly derived from it: gso_size/MSS may be significantly smaller then one would get by deriving from local MTU. And on some NICs (which do loose MTU checking on receive, it may even potentially be larger, for example my work pc with 1500 MTU can receive 1520 byte frames [and sometimes does due to bugs in a vendor plat46 implementation]). Indeed even just going from 21 to 1 is potentially problematic because it increases the number of segments by a factor of 21 (think DoS, or some other crash due to too many segments).
(c) It's always safe to not increase the gso_size, because it doesn't result in the max packet size increasing. So the skb_increase_gso_size() call was always unnecessary for correctness (and outright undesirable, see later). As such the only part which is potentially dangerous (ie. could cause backwards compatibility issues) is the removal of the skb_decrease_gso_size() call.
(d) If the packets are ultimately destined to the local device, then there is absolutely no benefit to playing around with gso_size. It only matters if the packets will egress the device. ie. we're either forwarding, or transmitting from the device.
(e) This logic only triggers for packets which are GSO. It does not trigger for skbs which are not GSO. It will not convert a non-GSO MTU sized packet into a GSO packet (and you don't even know what the MTU is, so you can't even fix it). As such your transmit path must *already* be able to handle an MTU 20 bytes larger then your receive path (for IPv4 to IPv6 translation) - and indeed 28 bytes larger due to IPv4 fragments. Thus removing the skb_decrease_gso_size() call doesn't actually increase the size of the packets your transmit side must be able to handle. ie. to handle non-GSO max-MTU packets, the IPv4/IPv6 device/ route MTUs must already be set correctly. Since for example with an IPv4 egress MTU of 1500, IPv4 to IPv6 translation will already build 1520 byte IPv6 frames, so you need a 1520 byte device MTU. This means if your IPv6 device's egress MTU is 1280, your IPv4 route must be 1260 (and actually 1252, because of the need to handle fragments). This is to handle normal non-GSO packets. Thus the reduction is simply not needed for GSO packets, because when they're correctly built, they will already be the right size.
(f) TSO/GSO should be able to exactly undo GRO: the number of packets (TCP segments) should not be modified, so that TCP's MSS counting works correctly (this matters for congestion control). If protocol conversion changes the gso_size, then the number of TCP segments may increase or decrease. Packet loss after protocol conversion can result in partial loss of MSS segments that the sender sent. How's the sending TCP stack going to react to receiving ACKs/SACKs in the middle of the segments it sent?
(g) skb_{decrease,increase}_gso_size() are already no-ops for GSO_BY_FRAGS case (besides triggering WARN_ON_ONCE). This means you already cannot guarantee that gso_size (and thus resulting packet MTU) is changed. ie. you must assume it won't be changed.
(h) changing gso_size is outright buggy for UDP GSO packets, where framing matters (I believe that's also the case for SCTP, but it's already excluded by [g]). So the only remaining case is TCP, which also doesn't want it (see [f]).
(i) see also the reasoning on the previous attempt at fixing this (commit fa7b83bf3b156c767f3e4a25bbf3817b08f3ff8e) which shows that the current behaviour causes TCP packet loss:
In the forwarding path GRO -> BPF 6 to 4 -> GSO for TCP traffic, the coalesced packet payload can be > MSS, but < MSS + 20.
bpf_skb_proto_6_to_4() will upgrade the MSS and it can be > the payload length. After then tcp_gso_segment checks for the payload length if it is <= MSS. The condition is causing the packet to be dropped.
tcp_gso_segment(): [...] mss = skb_shinfo(skb)->gso_size; if (unlikely(skb->len <= mss)) goto out; [...]
Thus changing the gso_size is simply a very bad idea. Increasing is unnecessary and buggy, and decreasing can go negative.
Fixes: 6578171a7ff0 ("bpf: add bpf_skb_change_proto helper") Signed-off-by: Maciej ƻenczykowski maze@google.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Cc: Dongseok Yi dseok.yi@samsung.com Cc: Willem de Bruijn willemb@google.com Link: https://lore.kernel.org/bpf/CANP3RGfjLikQ6dg=YpBU0OeHvyv7JOki7CyOUS9modaXAi-... Link: https://lore.kernel.org/bpf/20210617000953.2787453-2-zenczykowski@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/core/filter.c | 4 ---- 1 file changed, 4 deletions(-)
diff --git a/net/core/filter.c b/net/core/filter.c index 6a9c4bf09d37b..54fa2b827096d 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2639,8 +2639,6 @@ static int bpf_skb_proto_4_to_6(struct sk_buff *skb) shinfo->gso_type |= SKB_GSO_TCPV6; }
- /* Due to IPv6 header, MSS needs to be downgraded. */ - skb_decrease_gso_size(shinfo, len_diff); /* Header must be checked, and gso_segs recomputed. */ shinfo->gso_type |= SKB_GSO_DODGY; shinfo->gso_segs = 0; @@ -2680,8 +2678,6 @@ static int bpf_skb_proto_6_to_4(struct sk_buff *skb) shinfo->gso_type |= SKB_GSO_TCPV4; }
- /* Due to IPv4 header, MSS can be upgraded. */ - skb_increase_gso_size(shinfo, len_diff); /* Header must be checked, and gso_segs recomputed. */ shinfo->gso_type |= SKB_GSO_DODGY; shinfo->gso_segs = 0;
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.198 commit 6c7a0c308b3ac149ab6e4a86c7b9bc9ffce6154e
--------------------------------
[ Upstream commit 624085a31c1ad6a80b1e53f686bf6ee92abbf6e8 ]
First problem is that optlen is fetched without checking there is more than one byte to parse.
Fix this by taking care of IPV6_TLV_PAD1 before fetching optlen (under appropriate sanity checks against len)
Second problem is that IPV6_TLV_PADN checks of zero padding are performed before the check of remaining length.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Fixes: c1412fce7ecc ("net/ipv6/exthdrs.c: Strict PadN option checking") Signed-off-by: Eric Dumazet edumazet@google.com Cc: Paolo Abeni pabeni@redhat.com Cc: Tom Herbert tom@herbertland.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv6/exthdrs.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-)
diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c index 68b8084da83a7..fe2497ae4523f 100644 --- a/net/ipv6/exthdrs.c +++ b/net/ipv6/exthdrs.c @@ -138,18 +138,23 @@ static bool ip6_parse_tlv(const struct tlvtype_proc *procs, len -= 2;
while (len > 0) { - int optlen = nh[off + 1] + 2; - int i; + int optlen, i;
- switch (nh[off]) { - case IPV6_TLV_PAD1: - optlen = 1; + if (nh[off] == IPV6_TLV_PAD1) { padlen++; if (padlen > 7) goto bad; - break; + off++; + len--; + continue; + } + if (len < 2) + goto bad; + optlen = nh[off + 1] + 2; + if (optlen > len) + goto bad;
- case IPV6_TLV_PADN: + if (nh[off] == IPV6_TLV_PADN) { /* RFC 2460 states that the purpose of PadN is * to align the containing header to multiples * of 8. 7 is therefore the highest valid value. @@ -166,12 +171,7 @@ static bool ip6_parse_tlv(const struct tlvtype_proc *procs, if (nh[off + i] != 0) goto bad; } - break; - - default: /* Other TLV code so scan list */ - if (optlen > len) - goto bad; - + } else { tlv_count++; if (tlv_count > max_count) goto bad; @@ -191,7 +191,6 @@ static bool ip6_parse_tlv(const struct tlvtype_proc *procs, return false;
padlen = 0; - break; } off += optlen; len -= optlen;
From: Muchun Song songmuchun@bytedance.com
stable inclusion from linux-4.19.198 commit fdf3ef3f66cb88f30fbf9396f876937b29506d9b
--------------------------------
[ Upstream commit 8b0ed8443ae6458786580d36b7d5f8125535c5d4 ]
The caller of wb_get_create() should pin the memcg, because wb_get_create() relies on this guarantee. The rcu read lock only can guarantee that the memcg css returned by css_from_id() cannot be released, but the reference of the memcg can be zero.
rcu_read_lock() memcg_css = css_from_id() wb_get_create(memcg_css) cgwb_create(memcg_css) // css_get can change the ref counter from 0 back to 1 css_get(memcg_css) rcu_read_unlock()
Fix it by holding a reference to the css before calling wb_get_create(). This is not a problem I encountered in the real world. Just the result of a code review.
Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") Link: https://lore.kernel.org/r/20210402091145.80635-1-songmuchun@bytedance.com Signed-off-by: Muchun Song songmuchun@bytedance.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: Tejun Heo tj@kernel.org Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/fs-writeback.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 87f1ef939cf45..5abb71da2b9a4 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -511,9 +511,14 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) /* find and pin the new wb */ rcu_read_lock(); memcg_css = css_from_id(new_wb_id, &memory_cgrp_subsys); - if (memcg_css) - isw->new_wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + if (memcg_css && !css_tryget(memcg_css)) + memcg_css = NULL; rcu_read_unlock(); + if (!memcg_css) + goto out_free; + + isw->new_wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + css_put(memcg_css); if (!isw->new_wb) goto out_free;
From: Jason Gunthorpe jgg@nvidia.com
stable inclusion from linux-4.19.164 commit bb3130192ff9333f93ca02cb6f17144cf55ba4d8
--------------------------------
[ Upstream commit 7b06a56d468b756ad6bb43ac21b11e474ebc54a0 ]
commit f8f6ae5d077a ("mm: always have io_remap_pfn_range() set pgprot_decrypted()") allows drivers using mmap to put PCI memory mapped BAR space into userspace to work correctly on AMD SME systems that default to all memory encrypted.
Since vfio_pci_mmap_fault() is working with PCI memory mapped BAR space it should be calling io_remap_pfn_range() otherwise it will not work on SME systems.
Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking") Signed-off-by: Jason Gunthorpe jgg@nvidia.com Acked-by: Peter Xu peterx@redhat.com Tested-by: Tom Lendacky thomas.lendacky@amd.com Signed-off-by: Alex Williamson alex.williamson@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/vfio/pci/vfio_pci.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 87d586a3dee24..c48e1d84efb6b 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1378,8 +1378,8 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
mutex_unlock(&vdev->vma_lock);
- if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, - vma->vm_end - vma->vm_start, vma->vm_page_prot)) + if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, + vma->vm_end - vma->vm_start, vma->vm_page_prot)) ret = VM_FAULT_SIGBUS;
up_out:
From: Alex Williamson alex.williamson@redhat.com
stable inclusion from linux-4.19.198 commit d067a6b62b430f7bedef9d4a2f600b7edf58e417
--------------------------------
[ Upstream commit 6a45ece4c9af473555f01f0f8b97eba56e3c7d0d ]
io_remap_pfn_range() will trigger a BUG_ON if it encounters a populated pte within the mapping range. This can occur because we map the entire vma on fault and multiple faults can be blocked behind the vma_lock. This leads to traces like the one reported below.
We can use our vma_list to test whether a given vma is mapped to avoid this issue.
[ 1591.733256] kernel BUG at mm/memory.c:2177! [ 1591.739515] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 1591.747381] Modules linked in: vfio_iommu_type1 vfio_pci vfio_virqfd vfio pv680_mii(O) [ 1591.760536] CPU: 2 PID: 227 Comm: lcore-worker-2 Tainted: G O 5.11.0-rc3+ #1 [ 1591.770735] Hardware name: , BIOS HixxxxFPGA 1P B600 V121-1 [ 1591.778872] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--) [ 1591.786134] pc : remap_pfn_range+0x214/0x340 [ 1591.793564] lr : remap_pfn_range+0x1b8/0x340 [ 1591.799117] sp : ffff80001068bbd0 [ 1591.803476] x29: ffff80001068bbd0 x28: 0000042eff6f0000 [ 1591.810404] x27: 0000001100910000 x26: 0000001300910000 [ 1591.817457] x25: 0068000000000fd3 x24: ffffa92f1338e358 [ 1591.825144] x23: 0000001140000000 x22: 0000000000000041 [ 1591.832506] x21: 0000001300910000 x20: ffffa92f141a4000 [ 1591.839520] x19: 0000001100a00000 x18: 0000000000000000 [ 1591.846108] x17: 0000000000000000 x16: ffffa92f11844540 [ 1591.853570] x15: 0000000000000000 x14: 0000000000000000 [ 1591.860768] x13: fffffc0000000000 x12: 0000000000000880 [ 1591.868053] x11: ffff0821bf3d01d0 x10: ffff5ef2abd89000 [ 1591.875932] x9 : ffffa92f12ab0064 x8 : ffffa92f136471c0 [ 1591.883208] x7 : 0000001140910000 x6 : 0000000200000000 [ 1591.890177] x5 : 0000000000000001 x4 : 0000000000000001 [ 1591.896656] x3 : 0000000000000000 x2 : 0168044000000fd3 [ 1591.903215] x1 : ffff082126261880 x0 : fffffc2084989868 [ 1591.910234] Call trace: [ 1591.914837] remap_pfn_range+0x214/0x340 [ 1591.921765] vfio_pci_mmap_fault+0xac/0x130 [vfio_pci] [ 1591.931200] __do_fault+0x44/0x12c [ 1591.937031] handle_mm_fault+0xcc8/0x1230 [ 1591.942475] do_page_fault+0x16c/0x484 [ 1591.948635] do_translation_fault+0xbc/0xd8 [ 1591.954171] do_mem_abort+0x4c/0xc0 [ 1591.960316] el0_da+0x40/0x80 [ 1591.965585] el0_sync_handler+0x168/0x1b0 [ 1591.971608] el0_sync+0x174/0x180 [ 1591.978312] Code: eb1b027f 540000c0 f9400022 b4fffe02 (d4210000)
Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking") Reported-by: Zeng Tao prime.zeng@hisilicon.com Suggested-by: Zeng Tao prime.zeng@hisilicon.com Link: https://lore.kernel.org/r/162497742783.3883260.3282953006487785034.stgit@ome... Signed-off-by: Alex Williamson alex.williamson@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/vfio/pci/vfio_pci.c | 29 +++++++++++++++++++++-------- 1 file changed, 21 insertions(+), 8 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index c48e1d84efb6b..51b791c750f1b 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1359,6 +1359,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct vfio_pci_device *vdev = vma->vm_private_data; + struct vfio_pci_mmap_vma *mmap_vma; vm_fault_t ret = VM_FAULT_NOPAGE;
mutex_lock(&vdev->vma_lock); @@ -1366,24 +1367,36 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
if (!__vfio_pci_memory_enabled(vdev)) { ret = VM_FAULT_SIGBUS; - mutex_unlock(&vdev->vma_lock); goto up_out; }
- if (__vfio_pci_add_vma(vdev, vma)) { - ret = VM_FAULT_OOM; - mutex_unlock(&vdev->vma_lock); - goto up_out; + /* + * We populate the whole vma on fault, so we need to test whether + * the vma has already been mapped, such as for concurrent faults + * to the same vma. io_remap_pfn_range() will trigger a BUG_ON if + * we ask it to fill the same range again. + */ + list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) { + if (mmap_vma->vma == vma) + goto up_out; }
- mutex_unlock(&vdev->vma_lock); - if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, - vma->vm_end - vma->vm_start, vma->vm_page_prot)) + vma->vm_end - vma->vm_start, + vma->vm_page_prot)) { ret = VM_FAULT_SIGBUS; + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); + goto up_out; + } + + if (__vfio_pci_add_vma(vdev, vma)) { + ret = VM_FAULT_OOM; + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); + }
up_out: up_read(&vdev->memory_lock); + mutex_unlock(&vdev->vma_lock); return ret; }
From: Miaohe Lin linmiaohe@huawei.com
stable inclusion from linux-4.19.198 commit c4e4a6f1c976aba407fa45fd95e4564291324eb9
--------------------------------
[ Upstream commit babbbdd08af98a59089334eb3effbed5a7a0cf7f ]
If other processes are mapping any other subpages of the hugepage, i.e. in pte-mapped thp case, page_mapcount() will return 1 incorrectly. Then we would discard the page while other processes are still mapping it. Fix it by using total_mapcount() which can tell whether other processes are still mapping it.
Link: https://lkml.kernel.org/r/20210511134857.1581273-6-linmiaohe@huawei.com Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") Reviewed-by: Yang Shi shy828301@gmail.com Signed-off-by: Miaohe Lin linmiaohe@huawei.com Cc: Alexey Dobriyan adobriyan@gmail.com Cc: "Aneesh Kumar K . V" aneesh.kumar@linux.ibm.com Cc: Anshuman Khandual anshuman.khandual@arm.com Cc: David Hildenbrand david@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Matthew Wilcox willy@infradead.org Cc: Minchan Kim minchan@kernel.org Cc: Ralph Campbell rcampbell@nvidia.com Cc: Rik van Riel riel@surriel.com Cc: Song Liu songliubraving@fb.com Cc: William Kucharski william.kucharski@oracle.com Cc: Zi Yan ziy@nvidia.com Cc: Mike Kravetz mike.kravetz@oracle.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 45004eb20bbb8..31f1c580ba9c0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1749,7 +1749,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, * If other processes are mapping this page, we couldn't discard * the page unless they all do MADV_FREE so let's skip the page. */ - if (page_mapcount(page) != 1) + if (total_mapcount(page) != 1) goto out;
if (!trylock_page(page))
From: Willy Tarreau w@1wt.eu
stable inclusion from linux-4.19.198 commit f0be58ec9931907e980cf21737e51d369808eb95
--------------------------------
[ Upstream commit 62f20e068ccc50d6ab66fdb72ba90da2b9418c99 ]
This is a complement to commit aa6dd211e4b1 ("inet: use bigger hash table for IP ID generation"), but focusing on some specific aspects of IPv6.
Contary to IPv4, IPv6 only uses packet IDs with fragments, and with a minimum MTU of 1280, it's much less easy to force a remote peer to produce many fragments to explore its ID sequence. In addition packet IDs are 32-bit in IPv6, which further complicates their analysis. On the other hand, it is often easier to choose among plenty of possible source addresses and partially work around the bigger hash table the commit above permits, which leaves IPv6 partially exposed to some possibilities of remote analysis at the risk of weakening some protocols like DNS if some IDs can be predicted with a good enough probability.
Given the wide range of permitted IDs, the risk of collision is extremely low so there's no need to rely on the positive increment algorithm that is shared with the IPv4 code via ip_idents_reserve(). We have a fast PRNG, so let's simply call prandom_u32() and be done with it.
Performance measurements at 10 Gbps couldn't show any difference with the previous code, even when using a single core, because due to the large fragments, we're limited to only ~930 kpps at 10 Gbps and the cost of the random generation is completely offset by other operations and by the network transfer time. In addition, this change removes the need to update a shared entry in the idents table so it may even end up being slightly faster on large scale systems where this matters.
The risk of at least one collision here is about 1/80 million among 10 IDs, 1/850k among 100 IDs, and still only 1/8.5k among 1000 IDs, which remains very low compared to IPv4 where all IDs are reused every 4 to 80ms on a 10 Gbps flow depending on packet sizes.
Reported-by: Amit Klein aksecurity@gmail.com Signed-off-by: Willy Tarreau w@1wt.eu Reviewed-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/20210529110746.6796-1-w@1wt.eu Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv6/output_core.c | 28 +++++----------------------- 1 file changed, 5 insertions(+), 23 deletions(-)
diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c index 868ae23dbae19..3829b565c645d 100644 --- a/net/ipv6/output_core.c +++ b/net/ipv6/output_core.c @@ -14,29 +14,11 @@ static u32 __ipv6_select_ident(struct net *net, const struct in6_addr *dst, const struct in6_addr *src) { - const struct { - struct in6_addr dst; - struct in6_addr src; - } __aligned(SIPHASH_ALIGNMENT) combined = { - .dst = *dst, - .src = *src, - }; - u32 hash, id; - - /* Note the following code is not safe, but this is okay. */ - if (unlikely(siphash_key_is_zero(&net->ipv4.ip_id_key))) - get_random_bytes(&net->ipv4.ip_id_key, - sizeof(net->ipv4.ip_id_key)); - - hash = siphash(&combined, sizeof(combined), &net->ipv4.ip_id_key); - - /* Treat id of 0 as unset and if we get 0 back from ip_idents_reserve, - * set the hight order instead thus minimizing possible future - * collisions. - */ - id = ip_idents_reserve(hash, 1); - if (unlikely(!id)) - id = 1 << 31; + u32 id; + + do { + id = prandom_u32(); + } while (!id);
return id; }
From: Joe Thornber ejt@redhat.com
stable inclusion from linux-4.19.198 commit e20d40538987b78cdaadcfa2a88344bf97608eb1
--------------------------------
[ Upstream commit 5faafc77f7de69147d1e818026b9a0cbf036a7b2 ]
Current commit code resets the place where the search for free blocks will begin back to the start of the metadata device. There are a couple of repercussions to this:
- The first allocation after the commit is likely to take longer than normal as it searches for a free block in an area that is likely to have very few free blocks (if any).
- Any free blocks it finds will have been recently freed. Reusing them means we have fewer old copies of the metadata to aid recovery from hardware error.
Fix these issues by leaving the cursor alone, only resetting when the search hits the end of the metadata device.
Signed-off-by: Joe Thornber ejt@redhat.com Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/md/persistent-data/dm-space-map-disk.c | 9 ++++++++- drivers/md/persistent-data/dm-space-map-metadata.c | 9 ++++++++- 2 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/drivers/md/persistent-data/dm-space-map-disk.c b/drivers/md/persistent-data/dm-space-map-disk.c index bf4c5e2ccb6ff..e0acae7a3815d 100644 --- a/drivers/md/persistent-data/dm-space-map-disk.c +++ b/drivers/md/persistent-data/dm-space-map-disk.c @@ -171,6 +171,14 @@ static int sm_disk_new_block(struct dm_space_map *sm, dm_block_t *b) * Any block we allocate has to be free in both the old and current ll. */ r = sm_ll_find_common_free_block(&smd->old_ll, &smd->ll, smd->begin, smd->ll.nr_blocks, b); + if (r == -ENOSPC) { + /* + * There's no free block between smd->begin and the end of the metadata device. + * We search before smd->begin in case something has been freed. + */ + r = sm_ll_find_common_free_block(&smd->old_ll, &smd->ll, 0, smd->begin, b); + } + if (r) return r;
@@ -199,7 +207,6 @@ static int sm_disk_commit(struct dm_space_map *sm) return r;
memcpy(&smd->old_ll, &smd->ll, sizeof(smd->old_ll)); - smd->begin = 0; smd->nr_allocated_this_transaction = 0;
r = sm_disk_get_nr_free(sm, &nr_free); diff --git a/drivers/md/persistent-data/dm-space-map-metadata.c b/drivers/md/persistent-data/dm-space-map-metadata.c index 9e3c64ec2026f..da439ac857963 100644 --- a/drivers/md/persistent-data/dm-space-map-metadata.c +++ b/drivers/md/persistent-data/dm-space-map-metadata.c @@ -452,6 +452,14 @@ static int sm_metadata_new_block_(struct dm_space_map *sm, dm_block_t *b) * Any block we allocate has to be free in both the old and current ll. */ r = sm_ll_find_common_free_block(&smm->old_ll, &smm->ll, smm->begin, smm->ll.nr_blocks, b); + if (r == -ENOSPC) { + /* + * There's no free block between smm->begin and the end of the metadata device. + * We search before smm->begin in case something has been freed. + */ + r = sm_ll_find_common_free_block(&smm->old_ll, &smm->ll, 0, smm->begin, b); + } + if (r) return r;
@@ -503,7 +511,6 @@ static int sm_metadata_commit(struct dm_space_map *sm) return r;
memcpy(&smm->old_ll, &smm->ll, sizeof(smm->old_ll)); - smm->begin = 0; smm->allocated_this_transaction = 0;
return 0;
From: Xianting Tian xianting.tian@linux.alibaba.com
stable inclusion from linux-4.19.198 commit 34218ccb387c1e5e94b2baa6e337fb0367edede0
--------------------------------
[ Upstream commit 85eb1389458d134bdb75dad502cc026c3753a619 ]
We should not directly BUG() when there is hdr error, it is better to output a print when such error happens. Currently, the caller of xmit_skb() already did it.
Signed-off-by: Xianting Tian xianting.tian@linux.alibaba.com Reviewed-by: Leon Romanovsky leonro@nvidia.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/net/virtio_net.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index e3553056cec06..3595127f714f6 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1536,7 +1536,7 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) if (virtio_net_hdr_from_skb(skb, &hdr->hdr, virtio_is_little_endian(vi->vdev), false, 0)) - BUG(); + return -EPROTO;
if (vi->mergeable_rx_bufs) hdr->num_buffers = 0;
From: Steffen Klassert steffen.klassert@secunet.com
stable inclusion from linux-4.19.198 commit b559f417a1aa8715fe040322b1efcb164aea9851
--------------------------------
[ Upstream commit 6fd06963fa74197103cdbb4b494763127b3f2f34 ]
When memory allocation for XFRMA_ENCAP or XFRMA_COADDR fails, the error will not be reported because the -ENOMEM assignment to the err variable is overwritten before. Fix this by moving these two in front of the function so that memory allocation failures will be reported.
Reported-by: Tobias Brunner tobias@strongswan.org Signed-off-by: Steffen Klassert steffen.klassert@secunet.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/xfrm/xfrm_user.c | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 0b80c79077154..f94abe1fdd58f 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -579,6 +579,20 @@ static struct xfrm_state *xfrm_state_construct(struct net *net,
copy_from_user_state(x, p);
+ if (attrs[XFRMA_ENCAP]) { + x->encap = kmemdup(nla_data(attrs[XFRMA_ENCAP]), + sizeof(*x->encap), GFP_KERNEL); + if (x->encap == NULL) + goto error; + } + + if (attrs[XFRMA_COADDR]) { + x->coaddr = kmemdup(nla_data(attrs[XFRMA_COADDR]), + sizeof(*x->coaddr), GFP_KERNEL); + if (x->coaddr == NULL) + goto error; + } + if (attrs[XFRMA_SA_EXTRA_FLAGS]) x->props.extra_flags = nla_get_u32(attrs[XFRMA_SA_EXTRA_FLAGS]);
@@ -599,23 +613,9 @@ static struct xfrm_state *xfrm_state_construct(struct net *net, attrs[XFRMA_ALG_COMP]))) goto error;
- if (attrs[XFRMA_ENCAP]) { - x->encap = kmemdup(nla_data(attrs[XFRMA_ENCAP]), - sizeof(*x->encap), GFP_KERNEL); - if (x->encap == NULL) - goto error; - } - if (attrs[XFRMA_TFCPAD]) x->tfcpad = nla_get_u32(attrs[XFRMA_TFCPAD]);
- if (attrs[XFRMA_COADDR]) { - x->coaddr = kmemdup(nla_data(attrs[XFRMA_COADDR]), - sizeof(*x->coaddr), GFP_KERNEL); - if (x->coaddr == NULL) - goto error; - } - xfrm_mark_get(attrs, &x->mark);
xfrm_smark_init(attrs, &x->props.smark);
From: "Longpeng(Mike)" longpeng2@huawei.com
stable inclusion from linux-4.19.198 commit 0ea3919ffdb946390669b89e2437e0a70e8581be
--------------------------------
[ Upstream commit c7ff9cff70601ea19245d997bb977344663434c7 ]
The client's sk_state will be set to TCP_ESTABLISHED if the server replay the client's connect request.
However, if the client has pending signal, its sk_state will be set to TCP_CLOSE without notify the server, so the server will hold the corrupt connection.
client server
1. sk_state=TCP_SYN_SENT | 2. call ->connect() | 3. wait reply | | 4. sk_state=TCP_ESTABLISHED | 5. insert to connected list | 6. reply to the client 7. sk_state=TCP_ESTABLISHED | 8. insert to connected list | 9. *signal pending* <--------------------- the user kill client 10. sk_state=TCP_CLOSE | client is exiting... | 11. call ->release() | virtio_transport_close if (!(sk->sk_state == TCP_ESTABLISHED || sk->sk_state == TCP_CLOSING)) return true; *return at here, the server cannot notice the connection is corrupt*
So the client should notify the peer in this case.
Cc: David S. Miller davem@davemloft.net Cc: Jakub Kicinski kuba@kernel.org Cc: Jorgen Hansen jhansen@vmware.com Cc: Norbert Slusarek nslusarek@gmx.net Cc: Andra Paraschiv andraprs@amazon.com Cc: Colin Ian King colin.king@canonical.com Cc: David Brazdil dbrazdil@google.com Cc: Alexander Popov alex.popov@linux.com Suggested-by: Stefano Garzarella sgarzare@redhat.com Link: https://lkml.org/lkml/2021/5/17/418 Signed-off-by: lixianming lixianming5@huawei.com Signed-off-by: Longpeng(Mike) longpeng2@huawei.com Reviewed-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/vmw_vsock/af_vsock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index 62ec9182f234d..4f9a0635d723e 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -1225,7 +1225,7 @@ static int vsock_stream_connect(struct socket *sock, struct sockaddr *addr,
if (signal_pending(current)) { err = sock_intr_errno(timeout); - sk->sk_state = TCP_CLOSE; + sk->sk_state = sk->sk_state == TCP_ESTABLISHED ? TCP_CLOSING : TCP_CLOSE; sock->state = SS_UNCONNECTED; vsock_transport_cancel_pkt(vsk); goto out_wait;
From: Jakub Kicinski kuba@kernel.org
stable inclusion from linux-4.19.198 commit a847a5e25692a8d4637881f00d0453221ca086ef
--------------------------------
[ Upstream commit 6d123b81ac615072a8525c13c6c41b695270a15d ]
Dave observed number of machines hitting OOM on the UDP send path. The workload seems to be sending large UDP packets over loopback. Since loopback has MTU of 64k kernel will try to allocate an skb with up to 64k of head space. This has a good chance of failing under memory pressure. What's worse if the message length is <32k the allocation may trigger an OOM killer.
This is entirely avoidable, we can use an skb with page frags.
af_unix solves a similar problem by limiting the head length to SKB_MAX_ALLOC. This seems like a good and simple approach. It means that UDP messages > 16kB will now use fragments if underlying device supports SG, if extra allocator pressure causes regressions in real workloads we can switch to trying the large allocation first and falling back.
v4: pre-calculate all the additions to alloclen so we can be sure it won't go over order-2
Reported-by: Dave Jones dsj@fb.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv4/ip_output.c | 32 ++++++++++++++++++-------------- net/ipv6/ip6_output.c | 32 +++++++++++++++++--------------- 2 files changed, 35 insertions(+), 29 deletions(-)
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index e411c42d84289..e63905f7f6f95 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -940,7 +940,7 @@ static int __ip_append_data(struct sock *sk, unsigned int datalen; unsigned int fraglen; unsigned int fraggap; - unsigned int alloclen; + unsigned int alloclen, alloc_extra; unsigned int pagedlen; struct sk_buff *skb_prev; alloc_new_skb: @@ -960,35 +960,39 @@ static int __ip_append_data(struct sock *sk, fraglen = datalen + fragheaderlen; pagedlen = 0;
+ alloc_extra = hh_len + 15; + alloc_extra += exthdrlen; + + /* The last fragment gets additional space at tail. + * Note, with MSG_MORE we overallocate on fragments, + * because we have no idea what fragment will be + * the last. + */ + if (datalen == length + fraggap) + alloc_extra += rt->dst.trailer_len; + if ((flags & MSG_MORE) && !(rt->dst.dev->features&NETIF_F_SG)) alloclen = mtu; - else if (!paged) + else if (!paged && + (fraglen + alloc_extra < SKB_MAX_ALLOC || + !(rt->dst.dev->features & NETIF_F_SG))) alloclen = fraglen; else { alloclen = min_t(int, fraglen, MAX_HEADER); pagedlen = fraglen - alloclen; }
- alloclen += exthdrlen; - - /* The last fragment gets additional space at tail. - * Note, with MSG_MORE we overallocate on fragments, - * because we have no idea what fragment will be - * the last. - */ - if (datalen == length + fraggap) - alloclen += rt->dst.trailer_len; + alloclen += alloc_extra;
if (transhdrlen) { - skb = sock_alloc_send_skb(sk, - alloclen + hh_len + 15, + skb = sock_alloc_send_skb(sk, alloclen, (flags & MSG_DONTWAIT), &err); } else { skb = NULL; if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <= 2 * sk->sk_sndbuf) - skb = alloc_skb(alloclen + hh_len + 15, + skb = alloc_skb(alloclen, sk->sk_allocation); if (unlikely(!skb)) err = -ENOBUFS; diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index e1bb7db88483d..aa8f19f852cc7 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1394,7 +1394,7 @@ static int __ip6_append_data(struct sock *sk, unsigned int datalen; unsigned int fraglen; unsigned int fraggap; - unsigned int alloclen; + unsigned int alloclen, alloc_extra; unsigned int pagedlen; alloc_new_skb: /* There's no room in the current skb */ @@ -1421,17 +1421,28 @@ static int __ip6_append_data(struct sock *sk, fraglen = datalen + fragheaderlen; pagedlen = 0;
+ alloc_extra = hh_len; + alloc_extra += dst_exthdrlen; + alloc_extra += rt->dst.trailer_len; + + /* We just reserve space for fragment header. + * Note: this may be overallocation if the message + * (without MSG_MORE) fits into the MTU. + */ + alloc_extra += sizeof(struct frag_hdr); + if ((flags & MSG_MORE) && !(rt->dst.dev->features&NETIF_F_SG)) alloclen = mtu; - else if (!paged) + else if (!paged && + (fraglen + alloc_extra < SKB_MAX_ALLOC || + !(rt->dst.dev->features & NETIF_F_SG))) alloclen = fraglen; else { alloclen = min_t(int, fraglen, MAX_HEADER); pagedlen = fraglen - alloclen; } - - alloclen += dst_exthdrlen; + alloclen += alloc_extra;
if (datalen != length + fraggap) { /* @@ -1441,30 +1452,21 @@ static int __ip6_append_data(struct sock *sk, datalen += rt->dst.trailer_len; }
- alloclen += rt->dst.trailer_len; fraglen = datalen + fragheaderlen;
- /* - * We just reserve space for fragment header. - * Note: this may be overallocation if the message - * (without MSG_MORE) fits into the MTU. - */ - alloclen += sizeof(struct frag_hdr); - copy = datalen - transhdrlen - fraggap - pagedlen; if (copy < 0) { err = -EINVAL; goto error; } if (transhdrlen) { - skb = sock_alloc_send_skb(sk, - alloclen + hh_len, + skb = sock_alloc_send_skb(sk, alloclen, (flags & MSG_DONTWAIT), &err); } else { skb = NULL; if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <= 2 * sk->sk_sndbuf) - skb = alloc_skb(alloclen + hh_len, + skb = alloc_skb(alloclen, sk->sk_allocation); if (unlikely(!skb)) err = -ENOBUFS;
From: Petr Pavlu petr.pavlu@suse.com
stable inclusion from linux-4.19.198 commit ba7f895082ab3a89a5e9da2730474dbae7f79988
--------------------------------
commit 2253042d86f57d90a621ac2513a7a7a13afcf809 upstream.
When an IPMI watchdog timer is being stopped in ipmi_close() or ipmi_ioctl(WDIOS_DISABLECARD), the current watchdog action is updated to WDOG_TIMEOUT_NONE and _ipmi_set_timeout(IPMI_SET_TIMEOUT_NO_HB) is called to install this action. The latter function ends up invoking __ipmi_set_timeout() which makes the actual 'Set Watchdog Timer' IPMI request.
For IPMI 1.0, this operation results in fully stopping the watchdog timer. For IPMI >= 1.5, function __ipmi_set_timeout() always specifies the "don't stop" flag in the prepared 'Set Watchdog Timer' IPMI request. This causes that the watchdog timer has its action correctly updated to 'none' but the timer continues to run. A problem is that IPMI firmware can then still log an expiration event when the configured timeout is reached, which is unexpected because the watchdog timer was requested to be stopped.
The patch fixes this problem by not setting the "don't stop" flag in __ipmi_set_timeout() when the current action is WDOG_TIMEOUT_NONE which results in stopping the watchdog timer. This makes the behaviour for IPMI >= 1.5 consistent with IPMI 1.0. It also matches the logic in __ipmi_heartbeat() which does not allow to reset the watchdog if the current action is WDOG_TIMEOUT_NONE as that would start the timer.
Signed-off-by: Petr Pavlu petr.pavlu@suse.com Message-Id: 10a41bdc-9c99-089c-8d89-fa98ce5ea080@suse.com Cc: stable@vger.kernel.org Signed-off-by: Corey Minyard cminyard@mvista.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/char/ipmi/ipmi_watchdog.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/drivers/char/ipmi/ipmi_watchdog.c b/drivers/char/ipmi/ipmi_watchdog.c index ca1c5c5109f04..f016d54b25929 100644 --- a/drivers/char/ipmi/ipmi_watchdog.c +++ b/drivers/char/ipmi/ipmi_watchdog.c @@ -366,16 +366,18 @@ static int __ipmi_set_timeout(struct ipmi_smi_msg *smi_msg, data[0] = 0; WDOG_SET_TIMER_USE(data[0], WDOG_TIMER_USE_SMS_OS);
- if ((ipmi_version_major > 1) - || ((ipmi_version_major == 1) && (ipmi_version_minor >= 5))) { - /* This is an IPMI 1.5-only feature. */ - data[0] |= WDOG_DONT_STOP_ON_SET; - } else if (ipmi_watchdog_state != WDOG_TIMEOUT_NONE) { - /* - * In ipmi 1.0, setting the timer stops the watchdog, we - * need to start it back up again. - */ - hbnow = 1; + if (ipmi_watchdog_state != WDOG_TIMEOUT_NONE) { + if ((ipmi_version_major > 1) || + ((ipmi_version_major == 1) && (ipmi_version_minor >= 5))) { + /* This is an IPMI 1.5-only feature. */ + data[0] |= WDOG_DONT_STOP_ON_SET; + } else { + /* + * In ipmi 1.0, setting the timer stops the watchdog, we + * need to start it back up again. + */ + hbnow = 1; + } }
data[1] = 0;
From: Yun Zhou yun.zhou@windriver.com
stable inclusion from linux-4.19.198 commit 1f4c6061fccee64b2072b28dfa3e93cf859c4c0a
--------------------------------
commit d3b16034a24a112bb83aeb669ac5b9b01f744bb7 upstream.
There's two variables being increased in that loop (i and j), and i follows the raw data, and j follows what is being written into the buffer. We should compare 'i' to MAX_MEMHEX_BYTES or compare 'j' to HEX_CHARS. Otherwise, if 'j' goes bigger than HEX_CHARS, it will overflow the destination buffer.
Link: https://lore.kernel.org/lkml/20210625122453.5e2fe304@oasis.local.home/ Link: https://lkml.kernel.org/r/20210626032156.47889-1-yun.zhou@windriver.com
Cc: stable@vger.kernel.org Fixes: 5e3ca0ec76fce ("ftrace: introduce the "hex" output method") Signed-off-by: Yun Zhou yun.zhou@windriver.com Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- lib/seq_buf.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/lib/seq_buf.c b/lib/seq_buf.c index b15dbb6f061a5..5dd4d1d02a175 100644 --- a/lib/seq_buf.c +++ b/lib/seq_buf.c @@ -228,8 +228,10 @@ int seq_buf_putmem_hex(struct seq_buf *s, const void *mem,
WARN_ON(s->size == 0);
+ BUILD_BUG_ON(MAX_MEMHEX_BYTES * 2 >= HEX_CHARS); + while (len) { - start_len = min(len, HEX_CHARS - 1); + start_len = min(len, MAX_MEMHEX_BYTES); #ifdef __BIG_ENDIAN for (i = 0, j = 0; i < start_len; i++) { #else
From: Tyrel Datwyler tyreld@linux.ibm.com
stable inclusion from linux-4.19.198 commit e1bd3fac2baa3d5c04375980c1d5263a3335af92
--------------------------------
commit 93aa71ad7379900e61c8adff6a710a4c18c7c99b upstream.
Commit 66a834d09293 ("scsi: core: Fix error handling of scsi_host_alloc()") changed the allocation logic to call put_device() to perform host cleanup with the assumption that IDA removal and stopping the kthread would properly be performed in scsi_host_dev_release(). However, in the unlikely case that the error handler thread fails to spawn, shost->ehandler is set to ERR_PTR(-ENOMEM).
The error handler cleanup code in scsi_host_dev_release() will call kthread_stop() if shost->ehandler != NULL which will always be the case whether the kthread was successfully spawned or not. In the case that it failed to spawn this has the nasty side effect of trying to dereference an invalid pointer when kthread_stop() is called. The following splat provides an example of this behavior in the wild:
scsi host11: error handler thread failed to spawn, error = -4 Kernel attempted to read user page (10c) - exploit attempt? (uid: 0) BUG: Kernel NULL pointer dereference on read at 0x0000010c Faulting instruction address: 0xc00000000818e9a8 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: ibmvscsi(+) scsi_transport_srp dm_multipath dm_mirror dm_region hash dm_log dm_mod fuse overlay squashfs loop CPU: 12 PID: 274 Comm: systemd-udevd Not tainted 5.13.0-rc7 #1 NIP: c00000000818e9a8 LR: c0000000089846e8 CTR: 0000000000007ee8 REGS: c000000037d12ea0 TRAP: 0300 Not tainted (5.13.0-rc7) MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28228228 XER: 20040001 CFAR: c0000000089846e4 DAR: 000000000000010c DSISR: 40000000 IRQMASK: 0 GPR00: c0000000089846e8 c000000037d13140 c000000009cc1100 fffffffffffffffc GPR04: 0000000000000001 0000000000000000 0000000000000000 c000000037dc0000 GPR08: 0000000000000000 c000000037dc0000 0000000000000001 00000000fffff7ff GPR12: 0000000000008000 c00000000a049000 c000000037d13d00 000000011134d5a0 GPR16: 0000000000001740 c0080000190d0000 c0080000190d1740 c000000009129288 GPR20: c000000037d13bc0 0000000000000001 c000000037d13bc0 c0080000190b7898 GPR24: c0080000190b7708 0000000000000000 c000000033bb2c48 0000000000000000 GPR28: c000000046b28280 0000000000000000 000000000000010c fffffffffffffffc NIP [c00000000818e9a8] kthread_stop+0x38/0x230 LR [c0000000089846e8] scsi_host_dev_release+0x98/0x160 Call Trace: [c000000033bb2c48] 0xc000000033bb2c48 (unreliable) [c0000000089846e8] scsi_host_dev_release+0x98/0x160 [c00000000891e960] device_release+0x60/0x100 [c0000000087e55c4] kobject_release+0x84/0x210 [c00000000891ec78] put_device+0x28/0x40 [c000000008984ea4] scsi_host_alloc+0x314/0x430 [c0080000190b38bc] ibmvscsi_probe+0x54/0xad0 [ibmvscsi] [c000000008110104] vio_bus_probe+0xa4/0x4b0 [c00000000892a860] really_probe+0x140/0x680 [c00000000892aefc] driver_probe_device+0x15c/0x200 [c00000000892b63c] device_driver_attach+0xcc/0xe0 [c00000000892b740] __driver_attach+0xf0/0x200 [c000000008926f28] bus_for_each_dev+0xa8/0x130 [c000000008929ce4] driver_attach+0x34/0x50 [c000000008928fc0] bus_add_driver+0x1b0/0x300 [c00000000892c798] driver_register+0x98/0x1a0 [c00000000810eb60] __vio_register_driver+0x80/0xe0 [c0080000190b4a30] ibmvscsi_module_init+0x9c/0xdc [ibmvscsi] [c0000000080121d0] do_one_initcall+0x60/0x2d0 [c000000008261abc] do_init_module+0x7c/0x320 [c000000008265700] load_module+0x2350/0x25b0 [c000000008265cb4] __do_sys_finit_module+0xd4/0x160 [c000000008031110] system_call_exception+0x150/0x2d0 [c00000000800d35c] system_call_common+0xec/0x278
Fix this be nulling shost->ehandler when the kthread fails to spawn.
Link: https://lore.kernel.org/r/20210701195659.3185475-1-tyreld@linux.ibm.com Fixes: 66a834d09293 ("scsi: core: Fix error handling of scsi_host_alloc()") Cc: stable@vger.kernel.org Reviewed-by: Ming Lei ming.lei@redhat.com Signed-off-by: Tyrel Datwyler tyreld@linux.ibm.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/scsi/hosts.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index 3bbbeb8e79f3c..7150164c5ed95 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -494,6 +494,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize) shost_printk(KERN_WARNING, shost, "error handler thread failed to spawn, error = %ld\n", PTR_ERR(shost->ehandler)); + shost->ehandler = NULL; goto fail; }
From: "Steven Rostedt (VMware)" rostedt@goodmis.org
stable inclusion from linux-4.19.198 commit 1834fbbd664ca9aa51258c617bc8aa0206cd28b6
--------------------------------
commit 704adfb5a9978462cd861f170201ae2b5e3d3a80 upstream.
The histogram logic was allowing events with char * pointers to be used as normal strings. But it was easy to crash the kernel with:
# echo 'hist:keys=filename' > events/syscalls/sys_enter_openat/trigger
And open some files, and boom!
BUG: unable to handle page fault for address: 00007f2ced0c3280 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1173fa067 P4D 1173fa067 PUD 1171b6067 PMD 1171dd067 PTE 0 Oops: 0000 [#1] PREEMPT SMP CPU: 6 PID: 1810 Comm: cat Not tainted 5.13.0-rc5-test+ #61 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 RIP: 0010:strlen+0x0/0x20 Code: f6 82 80 2a 0b a9 20 74 11 0f b6 50 01 48 83 c0 01 f6 82 80 2a 0b a9 20 75 ef c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3
RSP: 0018:ffffbdbf81567b50 EFLAGS: 00010246 RAX: 0000000000000003 RBX: ffff93815cdb3800 RCX: ffff9382401a22d0 RDX: 0000000000000100 RSI: 0000000000000000 RDI: 00007f2ced0c3280 RBP: 0000000000000100 R08: ffff9382409ff074 R09: ffffbdbf81567c98 R10: ffff9382409ff074 R11: 0000000000000000 R12: ffff9382409ff074 R13: 0000000000000001 R14: ffff93815a744f00 R15: 00007f2ced0c3280 FS: 00007f2ced0f8580(0000) GS:ffff93825a800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f2ced0c3280 CR3: 0000000107069005 CR4: 00000000001706e0 Call Trace: event_hist_trigger+0x463/0x5f0 ? find_held_lock+0x32/0x90 ? sched_clock_cpu+0xe/0xd0 ? lock_release+0x155/0x440 ? kernel_init_free_pages+0x6d/0x90 ? preempt_count_sub+0x9b/0xd0 ? kernel_init_free_pages+0x6d/0x90 ? get_page_from_freelist+0x12c4/0x1680 ? __rb_reserve_next+0xe5/0x460 ? ring_buffer_lock_reserve+0x12a/0x3f0 event_triggers_call+0x52/0xe0 ftrace_syscall_enter+0x264/0x2c0 syscall_trace_enter.constprop.0+0x1ee/0x210 do_syscall_64+0x1c/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae
Where it triggered a fault on strlen(key) where key was the filename.
The reason is that filename is a char * to user space, and the histogram code just blindly dereferenced it, with obvious bad results.
I originally tried to use strncpy_from_user/kernel_nofault() but found that there's other places that its dereferenced and not worth the effort.
Just do not allow "char *" to act like strings.
Link: https://lkml.kernel.org/r/20210715000206.025df9d2@rorschach.local.home
Cc: Ingo Molnar mingo@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Tzvetomir Stoyanov tz.stoyanov@gmail.com Cc: stable@vger.kernel.org Acked-by: Namhyung Kim namhyung@kernel.org Acked-by: Tom Zanussi zanussi@kernel.org Fixes: 79e577cbce4c4 ("tracing: Support string type key properly") Fixes: 5967bd5c4239 ("tracing: Let filter_assign_type() detect FILTER_PTR_STRING") Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/trace/trace_events_hist.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index f9493e1d31d58..5ac561c9da57c 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -2262,7 +2262,9 @@ static struct hist_field *create_hist_field(struct hist_trigger_data *hist_data, if (WARN_ON_ONCE(!field)) goto out;
- if (is_string_field(field)) { + /* Pointers to strings are just pointers and dangerous to dereference */ + if (is_string_field(field) && + (field->filter_type != FILTER_PTR_STRING)) { flags |= HIST_FIELD_FL_STRING;
hist_field->size = MAX_FILTER_STR_VAL; @@ -4634,8 +4636,6 @@ static inline void add_to_key(char *compound_key, void *key, field = key_field->field; if (field->filter_type == FILTER_DYN_STRING) size = *(u32 *)(rec + field->offset) >> 16; - else if (field->filter_type == FILTER_PTR_STRING) - size = strlen(key); else if (field->filter_type == FILTER_STATIC_STRING) size = field->size;
From: Hannes Reinecke hare@suse.de
stable inclusion from linux-4.19.198 commit f4bde1d1bf905590c55acb02104c60d1c281a24f
--------------------------------
[ Upstream commit 7e26e3ea028740f934477ec01ba586ab033c35aa ]
scsi_execute() will now return a negative error if there was an error prior to command submission; evaluate that instead if checking for DRIVER_ERROR.
[mkp: build fix]
Link: https://lore.kernel.org/r/20210427083046.31620-6-hare@suse.de Signed-off-by: Hannes Reinecke hare@suse.de Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/scsi/device_handler/scsi_dh_alua.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c index c95c782b93a53..eb6a7fc4c2096 100644 --- a/drivers/scsi/device_handler/scsi_dh_alua.c +++ b/drivers/scsi/device_handler/scsi_dh_alua.c @@ -562,12 +562,12 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg) kfree(buff); return SCSI_DH_OK; } - if (!scsi_sense_valid(&sense_hdr)) { + if (retval < 0 || !scsi_sense_valid(&sense_hdr)) { sdev_printk(KERN_INFO, sdev, "%s: rtpg failed, result %d\n", ALUA_DH_NAME, retval); kfree(buff); - if (driver_byte(retval) == DRIVER_ERROR) + if (retval < 0) return SCSI_DH_DEV_TEMP_BUSY; return SCSI_DH_IO; } @@ -787,11 +787,11 @@ static unsigned alua_stpg(struct scsi_device *sdev, struct alua_port_group *pg) retval = submit_stpg(sdev, pg->group_id, &sense_hdr);
if (retval) { - if (!scsi_sense_valid(&sense_hdr)) { + if (retval < 0 || !scsi_sense_valid(&sense_hdr)) { sdev_printk(KERN_INFO, sdev, "%s: stpg failed, result %d", ALUA_DH_NAME, retval); - if (driver_byte(retval) == DRIVER_ERROR) + if (retval < 0) return SCSI_DH_DEV_TEMP_BUSY; } else { sdev_printk(KERN_INFO, sdev, "%s: stpg failed\n",
From: Mike Christie michael.christie@oracle.com
stable inclusion from linux-4.19.198 commit 8d3f5e3b4ebe1e4cdc9f6d79dd24e77aa8007efb
--------------------------------
[ Upstream commit b1d19e8c92cfb0ded180ef3376c20e130414e067 ]
There are a couple places where we could free the iscsi_cls_conn while it's still in use. This adds some helpers to get/put a refcount on the struct and converts an exiting user. Subsequent commits will then use the helpers to fix 2 bugs in the eh code.
Link: https://lore.kernel.org/r/20210525181821.7617-11-michael.christie@oracle.com Reviewed-by: Lee Duncan lduncan@suse.com Signed-off-by: Mike Christie michael.christie@oracle.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/scsi/libiscsi.c | 7 ++----- drivers/scsi/scsi_transport_iscsi.c | 12 ++++++++++++ include/scsi/scsi_transport_iscsi.h | 2 ++ 3 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index afa8091e19dd0..8ea441fe8aba3 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -1418,7 +1418,6 @@ void iscsi_session_failure(struct iscsi_session *session, enum iscsi_err err) { struct iscsi_conn *conn; - struct device *dev;
spin_lock_bh(&session->frwd_lock); conn = session->leadconn; @@ -1427,10 +1426,8 @@ void iscsi_session_failure(struct iscsi_session *session, return; }
- dev = get_device(&conn->cls_conn->dev); + iscsi_get_conn(conn->cls_conn); spin_unlock_bh(&session->frwd_lock); - if (!dev) - return; /* * if the host is being removed bypass the connection * recovery initialization because we are going to kill @@ -1440,7 +1437,7 @@ void iscsi_session_failure(struct iscsi_session *session, iscsi_conn_error_event(conn->cls_conn, err); else iscsi_conn_failure(conn, err); - put_device(dev); + iscsi_put_conn(conn->cls_conn); } EXPORT_SYMBOL_GPL(iscsi_session_failure);
diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c index e340b05278b63..2aaa5a2bd6135 100644 --- a/drivers/scsi/scsi_transport_iscsi.c +++ b/drivers/scsi/scsi_transport_iscsi.c @@ -2306,6 +2306,18 @@ int iscsi_destroy_conn(struct iscsi_cls_conn *conn) } EXPORT_SYMBOL_GPL(iscsi_destroy_conn);
+void iscsi_put_conn(struct iscsi_cls_conn *conn) +{ + put_device(&conn->dev); +} +EXPORT_SYMBOL_GPL(iscsi_put_conn); + +void iscsi_get_conn(struct iscsi_cls_conn *conn) +{ + get_device(&conn->dev); +} +EXPORT_SYMBOL_GPL(iscsi_get_conn); + /* * iscsi interface functions */ diff --git a/include/scsi/scsi_transport_iscsi.h b/include/scsi/scsi_transport_iscsi.h index b266d2a3bcb1d..484e9787d817b 100644 --- a/include/scsi/scsi_transport_iscsi.h +++ b/include/scsi/scsi_transport_iscsi.h @@ -436,6 +436,8 @@ extern void iscsi_remove_session(struct iscsi_cls_session *session); extern void iscsi_free_session(struct iscsi_cls_session *session); extern struct iscsi_cls_conn *iscsi_create_conn(struct iscsi_cls_session *sess, int dd_size, uint32_t cid); +extern void iscsi_put_conn(struct iscsi_cls_conn *conn); +extern void iscsi_get_conn(struct iscsi_cls_conn *conn); extern int iscsi_destroy_conn(struct iscsi_cls_conn *conn); extern void iscsi_unblock_session(struct iscsi_cls_session *session); extern void iscsi_block_session(struct iscsi_cls_session *session);
From: Mike Christie michael.christie@oracle.com
stable inclusion from linux-4.19.198 commit 693e09c3be3cc130320036861c0d249da03bf1be
--------------------------------
[ Upstream commit bdd4aad7ff92ae39c2e93c415bb6761cb8b584da ]
The iscsi offload drivers are setting the shost->max_id to the max number of sessions they support. The problem is that max_id is not the max number of targets but the highest identifier the targets can have. To use it to limit the number of targets we need to set it to max sessions - 1, or we can end up with a session we might not have preallocated resources for.
Link: https://lore.kernel.org/r/20210525181821.7617-15-michael.christie@oracle.com Reviewed-by: Lee Duncan lduncan@suse.com Signed-off-by: Mike Christie michael.christie@oracle.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/scsi/be2iscsi/be_main.c | 4 ++-- drivers/scsi/bnx2i/bnx2i_iscsi.c | 2 +- drivers/scsi/cxgbi/libcxgbi.c | 4 ++-- drivers/scsi/qedi/qedi_main.c | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/scsi/be2iscsi/be_main.c b/drivers/scsi/be2iscsi/be_main.c index 3660059784f74..a5b807c676fcc 100644 --- a/drivers/scsi/be2iscsi/be_main.c +++ b/drivers/scsi/be2iscsi/be_main.c @@ -423,7 +423,7 @@ static struct beiscsi_hba *beiscsi_hba_alloc(struct pci_dev *pcidev) "beiscsi_hba_alloc - iscsi_host_alloc failed\n"); return NULL; } - shost->max_id = BE2_MAX_SESSIONS; + shost->max_id = BE2_MAX_SESSIONS - 1; shost->max_channel = 0; shost->max_cmd_len = BEISCSI_MAX_CMD_LEN; shost->max_lun = BEISCSI_NUM_MAX_LUN; @@ -5336,7 +5336,7 @@ static int beiscsi_enable_port(struct beiscsi_hba *phba) /* Re-enable UER. If different TPE occurs then it is recoverable. */ beiscsi_set_uer_feature(phba);
- phba->shost->max_id = phba->params.cxns_per_ctrl; + phba->shost->max_id = phba->params.cxns_per_ctrl - 1; phba->shost->can_queue = phba->params.ios_per_ctrl; ret = beiscsi_init_port(phba); if (ret < 0) { diff --git a/drivers/scsi/bnx2i/bnx2i_iscsi.c b/drivers/scsi/bnx2i/bnx2i_iscsi.c index 1af1a884109ed..f2f4150217707 100644 --- a/drivers/scsi/bnx2i/bnx2i_iscsi.c +++ b/drivers/scsi/bnx2i/bnx2i_iscsi.c @@ -793,7 +793,7 @@ struct bnx2i_hba *bnx2i_alloc_hba(struct cnic_dev *cnic) return NULL; shost->dma_boundary = cnic->pcidev->dma_mask; shost->transportt = bnx2i_scsi_xport_template; - shost->max_id = ISCSI_MAX_CONNS_PER_HBA; + shost->max_id = ISCSI_MAX_CONNS_PER_HBA - 1; shost->max_channel = 0; shost->max_lun = 512; shost->max_cmd_len = 16; diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c index cd2c247d6d0ce..dcb148a09b06a 100644 --- a/drivers/scsi/cxgbi/libcxgbi.c +++ b/drivers/scsi/cxgbi/libcxgbi.c @@ -338,7 +338,7 @@ void cxgbi_hbas_remove(struct cxgbi_device *cdev) EXPORT_SYMBOL_GPL(cxgbi_hbas_remove);
int cxgbi_hbas_add(struct cxgbi_device *cdev, u64 max_lun, - unsigned int max_id, struct scsi_host_template *sht, + unsigned int max_conns, struct scsi_host_template *sht, struct scsi_transport_template *stt) { struct cxgbi_hba *chba; @@ -358,7 +358,7 @@ int cxgbi_hbas_add(struct cxgbi_device *cdev, u64 max_lun,
shost->transportt = stt; shost->max_lun = max_lun; - shost->max_id = max_id; + shost->max_id = max_conns - 1; shost->max_channel = 0; shost->max_cmd_len = 16;
diff --git a/drivers/scsi/qedi/qedi_main.c b/drivers/scsi/qedi/qedi_main.c index 763c7628356b1..756e0ccc37fb7 100644 --- a/drivers/scsi/qedi/qedi_main.c +++ b/drivers/scsi/qedi/qedi_main.c @@ -629,7 +629,7 @@ static struct qedi_ctx *qedi_host_alloc(struct pci_dev *pdev) goto exit_setup_shost; }
- shost->max_id = QEDI_MAX_ISCSI_CONNS_PER_HBA; + shost->max_id = QEDI_MAX_ISCSI_CONNS_PER_HBA - 1; shost->max_channel = 0; shost->max_lun = ~0; shost->max_cmd_len = 16;
From: Mike Christie michael.christie@oracle.com
stable inclusion from linux-4.19.198 commit 545de233de7d8ae2655d679d9e2a9e22497629f1
--------------------------------
[ Upstream commit 5777b7f0f03ce49372203b6521631f62f2810c8f ]
If qedi_process_cmd_cleanup_resp finds the cmd it frees the work and sets list_tmf_work to NULL, so qedi_tmf_work should check if list_tmf_work is non-NULL when it wants to force cleanup.
Link: https://lore.kernel.org/r/20210525181821.7617-20-michael.christie@oracle.com Reviewed-by: Manish Rangankar mrangankar@marvell.com Signed-off-by: Mike Christie michael.christie@oracle.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/scsi/qedi/qedi_fw.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/scsi/qedi/qedi_fw.c b/drivers/scsi/qedi/qedi_fw.c index 25d763ae5d5a6..d9a6f021050ec 100644 --- a/drivers/scsi/qedi/qedi_fw.c +++ b/drivers/scsi/qedi/qedi_fw.c @@ -1454,7 +1454,7 @@ static void qedi_tmf_work(struct work_struct *work)
ldel_exit: spin_lock_bh(&qedi_conn->tmf_work_lock); - if (!qedi_cmd->list_tmf_work) { + if (qedi_cmd->list_tmf_work) { list_del_init(&list_work->list); qedi_cmd->list_tmf_work = NULL; kfree(list_work);
From: Dmitry Torokhov dmitry.torokhov@gmail.com
stable inclusion from linux-4.19.198 commit 7c0bb53d48244b10782deea07d57e0a4c0fb4262
--------------------------------
[ Upstream commit b64210f2f7c11c757432ba3701d88241b2b98fb1 ]
If an i2c client receives an interrupt during reboot or shutdown it may be too late to service it by making an i2c transaction on the bus because the i2c controller has already been shutdown. This can lead to system hangs if the i2c controller tries to make a transfer that is doomed to fail because the access to the i2c pins is already shut down, or an iommu translation has been torn down so i2c controller register access doesn't work.
Let's simply disable the irq if there isn't a shutdown callback for an i2c client when there is an irq associated with the device. This will make sure that irqs don't come in later than the time that we can handle it. We don't do this if the i2c client device already has a shutdown callback because presumably they're doing the right thing and quieting the device so irqs don't come in after the shutdown callback returns.
Reported-by: kernel test robot lkp@intel.com [swboyd@chromium.org: Dropped newline, added commit text, added interrupt.h for robot build error] Signed-off-by: Stephen Boyd swboyd@chromium.org Signed-off-by: Dmitry Torokhov dmitry.torokhov@gmail.com Signed-off-by: Wolfram Sang wsa@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/i2c/i2c-core-base.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/i2c/i2c-core-base.c b/drivers/i2c/i2c-core-base.c index 1de10e5c70d7b..59e6f37180f6c 100644 --- a/drivers/i2c/i2c-core-base.c +++ b/drivers/i2c/i2c-core-base.c @@ -32,6 +32,7 @@ #include <linux/i2c-smbus.h> #include <linux/idr.h> #include <linux/init.h> +#include <linux/interrupt.h> #include <linux/irqflags.h> #include <linux/jump_label.h> #include <linux/kernel.h> @@ -449,6 +450,8 @@ static void i2c_device_shutdown(struct device *dev) driver = to_i2c_driver(dev->driver); if (driver->shutdown) driver->shutdown(client); + else if (client->irq > 0) + disable_irq(client->irq); }
static void i2c_client_dev_release(struct device *dev)
From: Dimitri John Ledkov dimitri.ledkov@canonical.com
stable inclusion from linux-4.19.198 commit a12a6a2d7cd2dd24663942499f40efab96e8d369
--------------------------------
[ Upstream commit 2c484419efc09e7234c667aa72698cb79ba8d8ed ]
lz4 compatible decompressor is simple. The format is underspecified and relies on EOF notification to determine when to stop. Initramfs buffer format[1] explicitly states that it can have arbitrary number of zero padding. Thus when operating without a fill function, be extra careful to ensure that sizes less than 4, or apperantly empty chunksizes are treated as EOF.
To test this I have created two cpio initrds, first a normal one, main.cpio. And second one with just a single /test-file with content "second" second.cpio. Then i compressed both of them with gzip, and with lz4 -l. Then I created a padding of 4 bytes (dd if=/dev/zero of=pad4 bs=1 count=4). To create four testcase initrds:
1) main.cpio.gzip + extra.cpio.gzip = pad0.gzip 2) main.cpio.lz4 + extra.cpio.lz4 = pad0.lz4 3) main.cpio.gzip + pad4 + extra.cpio.gzip = pad4.gzip 4) main.cpio.lz4 + pad4 + extra.cpio.lz4 = pad4.lz4
The pad4 test-cases replicate the initrd load by grub, as it pads and aligns every initrd it loads.
All of the above boot, however /test-file was not accessible in the initrd for the testcase #4, as decoding in lz4 decompressor failed. Also an error message printed which usually is harmless.
Whith a patched kernel, all of the above testcases now pass, and /test-file is accessible.
This fixes lz4 initrd decompress warning on every boot with grub. And more importantly this fixes inability to load multiple lz4 compressed initrds with grub. This patch has been shipping in Ubuntu kernels since January 2021.
[1] ./Documentation/driver-api/early-userspace/buffer-format.rst
BugLink: https://bugs.launchpad.net/bugs/1835660 Link: https://lore.kernel.org/lkml/20210114200256.196589-1-xnox@ubuntu.com/ # v0 Link: https://lkml.kernel.org/r/20210513104831.432975-1-dimitri.ledkov@canonical.c... Signed-off-by: Dimitri John Ledkov dimitri.ledkov@canonical.com Cc: Kyungsik Lee kyungsik.lee@lge.com Cc: Yinghai Lu yinghai@kernel.org Cc: Bongkyu Kim bongkyu.kim@lge.com Cc: Kees Cook keescook@chromium.org Cc: Sven Schmidt 4sschmid@informatik.uni-hamburg.de Cc: Rajat Asthana thisisrast7@gmail.com Cc: Nick Terrell terrelln@fb.com Cc: Gao Xiang hsiangkao@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- lib/decompress_unlz4.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/lib/decompress_unlz4.c b/lib/decompress_unlz4.c index 1b0baf3008ea3..b202aa864c48a 100644 --- a/lib/decompress_unlz4.c +++ b/lib/decompress_unlz4.c @@ -115,6 +115,9 @@ STATIC inline int INIT unlz4(u8 *input, long in_len, error("data corrupted"); goto exit_2; } + } else if (size < 4) { + /* empty or end-of-file */ + goto exit_3; }
chunksize = get_unaligned_le32(inp); @@ -128,6 +131,10 @@ STATIC inline int INIT unlz4(u8 *input, long in_len, continue; }
+ if (!fill && chunksize == 0) { + /* empty or end-of-file */ + goto exit_3; + }
if (posp) *posp += 4; @@ -187,6 +194,7 @@ STATIC inline int INIT unlz4(u8 *input, long in_len, } }
+exit_3: ret = 0; exit_2: if (!input)
From: Trond Myklebust trond.myklebust@hammerspace.com
stable inclusion from linux-4.19.198 commit e409580b19ec7923082e90c1ec9505aa0867f21d
--------------------------------
[ Upstream commit e97bc66377bca097e1f3349ca18ca17f202ff659 ]
If a file has already been closed, then it should not be selected to support further I/O.
Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com [Trond: Fix an invalid pointer deref reported by Colin Ian King] Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/nfs/inode.c | 4 ++++ include/linux/nfs_fs.h | 1 + 2 files changed, 5 insertions(+)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index c9670d822e1ca..6d5fedcd30d05 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -1040,6 +1040,7 @@ EXPORT_SYMBOL_GPL(nfs_inode_attach_open_context); void nfs_file_set_open_context(struct file *filp, struct nfs_open_context *ctx) { filp->private_data = get_nfs_open_context(ctx); + set_bit(NFS_CONTEXT_FILE_OPEN, &ctx->flags); if (list_empty(&ctx->list)) nfs_inode_attach_open_context(ctx); } @@ -1059,6 +1060,8 @@ struct nfs_open_context *nfs_find_open_context(struct inode *inode, struct rpc_c continue; if ((pos->mode & (FMODE_READ|FMODE_WRITE)) != mode) continue; + if (!test_bit(NFS_CONTEXT_FILE_OPEN, &pos->flags)) + continue; ctx = get_nfs_open_context(pos); break; } @@ -1073,6 +1076,7 @@ void nfs_file_clear_open_context(struct file *filp) if (ctx) { struct inode *inode = d_inode(ctx->dentry);
+ clear_bit(NFS_CONTEXT_FILE_OPEN, &ctx->flags); /* * We fatal error on write before. Try to writeback * every page again. diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h index dcf01bdcc3e3b..798dee258bff6 100644 --- a/include/linux/nfs_fs.h +++ b/include/linux/nfs_fs.h @@ -78,6 +78,7 @@ struct nfs_open_context { #define NFS_CONTEXT_RESEND_WRITES (1) #define NFS_CONTEXT_BAD (2) #define NFS_CONTEXT_UNLOCK (3) +#define NFS_CONTEXT_FILE_OPEN (4) int error;
struct list_head list;
From: Xie Yongji xieyongji@bytedance.com
stable inclusion from linux-4.19.198 commit 600942d2fd49b90e44857d20c774b20d16f3130f
--------------------------------
[ Upstream commit b71ba22e7c6c6b279c66f53ee7818709774efa1f ]
The vblk->vqs should be freed before we call init_vqs() in virtblk_restore().
Signed-off-by: Xie Yongji xieyongji@bytedance.com Link: https://lore.kernel.org/r/20210517084332.280-1-xieyongji@bytedance.com Acked-by: Jason Wang jasowang@redhat.com Signed-off-by: Michael S. Tsirkin mst@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/block/virtio_blk.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 3b00ee46b9d90..7fe0f8f75d550 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -940,6 +940,8 @@ static int virtblk_freeze(struct virtio_device *vdev) blk_mq_quiesce_queue(vblk->disk->queue);
vdev->config->del_vqs(vdev); + kfree(vblk->vqs); + return 0; }
From: Xie Yongji xieyongji@bytedance.com
stable inclusion from linux-4.19.198 commit 845ae8523f5a9ecfdbed48b485cb4ffae71df95b
--------------------------------
[ Upstream commit 3f2869cace829fb4b80fc53b3ddaa7f4ba9acbf1 ]
Do some cleanups in virtnet_restore() when virtnet_cpu_notif_add() failed.
Signed-off-by: Xie Yongji xieyongji@bytedance.com Link: https://lore.kernel.org/r/20210517084516.332-1-xieyongji@bytedance.com Acked-by: Jason Wang jasowang@redhat.com Signed-off-by: Michael S. Tsirkin mst@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/net/virtio_net.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 3595127f714f6..68eda4739a9ac 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -3185,8 +3185,11 @@ static __maybe_unused int virtnet_restore(struct virtio_device *vdev) virtnet_set_queues(vi, vi->curr_queue_pairs);
err = virtnet_cpu_notif_add(vi); - if (err) + if (err) { + virtnet_freeze_down(vdev); + remove_vq_common(vi); return err; + }
return 0; }
From: Xie Yongji xieyongji@bytedance.com
stable inclusion from linux-4.19.198 commit b5fba782ccd3d12a14f884cd20f255fc9c0eec0c
--------------------------------
[ Upstream commit d00d8da5869a2608e97cfede094dfc5e11462a46 ]
The buf->len might come from an untrusted device. This ensures the value would not exceed the size of the buffer to avoid data corruption or loss.
Signed-off-by: Xie Yongji xieyongji@bytedance.com Acked-by: Jason Wang jasowang@redhat.com Link: https://lore.kernel.org/r/20210525125622.1203-1-xieyongji@bytedance.com Signed-off-by: Michael S. Tsirkin mst@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/char/virtio_console.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index b353a5e5f8b1c..46188d0a54b8a 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -488,7 +488,7 @@ static struct port_buffer *get_inbuf(struct port *port)
buf = virtqueue_get_buf(port->in_vq, &len); if (buf) { - buf->len = len; + buf->len = min_t(size_t, len, buf->size); buf->offset = 0; port->stats.bytes_received += len; } @@ -1738,7 +1738,7 @@ static void control_work_handler(struct work_struct *work) while ((buf = virtqueue_get_buf(vq, &len))) { spin_unlock(&portdev->c_ivq_lock);
- buf->len = len; + buf->len = min_t(size_t, len, buf->size); buf->offset = 0;
handle_control_message(vq->vdev, portdev, buf);
From: Krzysztof WilczyĆski kw@linux.com
stable inclusion from linux-4.19.198 commit 1b78ad07da08ca15c47af0669d5a4e4f896cbb3c
--------------------------------
[ Upstream commit bdcdaa13ad96f1a530711c29e6d4b8311eff767c ]
"utf16s_to_utf8s(..., buf, PAGE_SIZE)" puts up to PAGE_SIZE bytes into "buf" and returns the number of bytes it actually put there. If it wrote PAGE_SIZE bytes, the newline added by dsm_label_utf16s_to_utf8s() would overrun "buf".
Reduce the size available for utf16s_to_utf8s() to use so there is always space for the newline.
[bhelgaas: reorder patch in series, commit log] Fixes: 6058989bad05 ("PCI: Export ACPI _DSM provided firmware instance number and string name to sysfs") Link: https://lore.kernel.org/r/20210603000112.703037-7-kw@linux.com Reported-by: Joe Perches joe@perches.com Signed-off-by: Krzysztof WilczyĆski kw@linux.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/pci-label.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pci/pci-label.c b/drivers/pci/pci-label.c index a5910f9428576..9fb4ef568f405 100644 --- a/drivers/pci/pci-label.c +++ b/drivers/pci/pci-label.c @@ -162,7 +162,7 @@ static void dsm_label_utf16s_to_utf8s(union acpi_object *obj, char *buf) len = utf16s_to_utf8s((const wchar_t *)obj->buffer.pointer, obj->buffer.length, UTF16_LITTLE_ENDIAN, - buf, PAGE_SIZE); + buf, PAGE_SIZE - 1); buf[len] = '\n'; }
From: Trond Myklebust trond.myklebust@hammerspace.com
stable inclusion from linux-4.19.198 commit 743f6b973c8ba8a0a5ed15ab11e1d07fa00d5368
--------------------------------
[ Upstream commit dd99e9f98fbf423ff6d365b37a98e8879170f17c ]
Set up the connection to the NFSv4 server in nfs4_alloc_client(), before we've added the struct nfs_client to the net-namespace's nfs_client_list so that a downed server won't cause other mounts to hang in the trunking detection code.
Reported-by: Michael Wakabayashi mwakabayashi@vmware.com Fixes: 5c6e5b60aae4 ("NFS: Fix an Oops in the pNFS files and flexfiles connection setup to the DS") Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/nfs/nfs4client.c | 82 +++++++++++++++++++++++---------------------- 1 file changed, 42 insertions(+), 40 deletions(-)
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c index cfc154a59c04f..36282e48cff64 100644 --- a/fs/nfs/nfs4client.c +++ b/fs/nfs/nfs4client.c @@ -191,8 +191,11 @@ void nfs40_shutdown_client(struct nfs_client *clp)
struct nfs_client *nfs4_alloc_client(const struct nfs_client_initdata *cl_init) { - int err; + char buf[INET6_ADDRSTRLEN + 1]; + const char *ip_addr = cl_init->ip_addr; struct nfs_client *clp = nfs_alloc_client(cl_init); + int err; + if (IS_ERR(clp)) return clp;
@@ -216,6 +219,44 @@ struct nfs_client *nfs4_alloc_client(const struct nfs_client_initdata *cl_init) init_waitqueue_head(&clp->cl_lock_waitq); #endif INIT_LIST_HEAD(&clp->pending_cb_stateids); + + if (cl_init->minorversion != 0) + __set_bit(NFS_CS_INFINITE_SLOTS, &clp->cl_flags); + __set_bit(NFS_CS_DISCRTRY, &clp->cl_flags); + __set_bit(NFS_CS_NO_RETRANS_TIMEOUT, &clp->cl_flags); + + /* + * Set up the connection to the server before we add add to the + * global list. + */ + err = nfs_create_rpc_client(clp, cl_init, RPC_AUTH_GSS_KRB5I); + if (err == -EINVAL) + err = nfs_create_rpc_client(clp, cl_init, RPC_AUTH_UNIX); + if (err < 0) + goto error; + + /* If no clientaddr= option was specified, find a usable cb address */ + if (ip_addr == NULL) { + struct sockaddr_storage cb_addr; + struct sockaddr *sap = (struct sockaddr *)&cb_addr; + + err = rpc_localaddr(clp->cl_rpcclient, sap, sizeof(cb_addr)); + if (err < 0) + goto error; + err = rpc_ntop(sap, buf, sizeof(buf)); + if (err < 0) + goto error; + ip_addr = (const char *)buf; + } + strlcpy(clp->cl_ipaddr, ip_addr, sizeof(clp->cl_ipaddr)); + + err = nfs_idmap_new(clp); + if (err < 0) { + dprintk("%s: failed to create idmapper. Error = %d\n", + __func__, err); + goto error; + } + __set_bit(NFS_CS_IDMAP, &clp->cl_res_state); return clp;
error: @@ -368,8 +409,6 @@ static int nfs4_init_client_minor_version(struct nfs_client *clp) struct nfs_client *nfs4_init_client(struct nfs_client *clp, const struct nfs_client_initdata *cl_init) { - char buf[INET6_ADDRSTRLEN + 1]; - const char *ip_addr = cl_init->ip_addr; struct nfs_client *old; int error;
@@ -377,43 +416,6 @@ struct nfs_client *nfs4_init_client(struct nfs_client *clp, /* the client is initialised already */ return clp;
- /* Check NFS protocol revision and initialize RPC op vector */ - clp->rpc_ops = &nfs_v4_clientops; - - if (clp->cl_minorversion != 0) - __set_bit(NFS_CS_INFINITE_SLOTS, &clp->cl_flags); - __set_bit(NFS_CS_DISCRTRY, &clp->cl_flags); - __set_bit(NFS_CS_NO_RETRANS_TIMEOUT, &clp->cl_flags); - - error = nfs_create_rpc_client(clp, cl_init, RPC_AUTH_GSS_KRB5I); - if (error == -EINVAL) - error = nfs_create_rpc_client(clp, cl_init, RPC_AUTH_UNIX); - if (error < 0) - goto error; - - /* If no clientaddr= option was specified, find a usable cb address */ - if (ip_addr == NULL) { - struct sockaddr_storage cb_addr; - struct sockaddr *sap = (struct sockaddr *)&cb_addr; - - error = rpc_localaddr(clp->cl_rpcclient, sap, sizeof(cb_addr)); - if (error < 0) - goto error; - error = rpc_ntop(sap, buf, sizeof(buf)); - if (error < 0) - goto error; - ip_addr = (const char *)buf; - } - strlcpy(clp->cl_ipaddr, ip_addr, sizeof(clp->cl_ipaddr)); - - error = nfs_idmap_new(clp); - if (error < 0) { - dprintk("%s: failed to create idmapper. Error = %d\n", - __func__, error); - goto error; - } - __set_bit(NFS_CS_IDMAP, &clp->cl_res_state); - error = nfs4_init_client_minor_version(clp); if (error < 0) goto error;
From: Gao Xiang hsiangkao@linux.alibaba.com
stable inclusion from linux-4.19.198 commit 0704f617040c397ae73c1f88f3956787ec5d6529
--------------------------------
[ Upstream commit 1fcb6fcd74a222d9ead54d405842fc763bb86262 ]
When looking into another nfs xfstests report, I found acl and default_acl in nfs3_proc_create() and nfs3_proc_mknod() error paths are possibly leaked. Fix them in advance.
Fixes: 013cdf1088d7 ("nfs: use generic posix ACL infrastructure for v3 Posix ACLs") Cc: Trond Myklebust trond.myklebust@hammerspace.com Cc: Anna Schumaker anna.schumaker@netapp.com Cc: Christoph Hellwig hch@infradead.org Cc: Joseph Qi joseph.qi@linux.alibaba.com Signed-off-by: Gao Xiang hsiangkao@linux.alibaba.com Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/nfs/nfs3proc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c index 37b5e2ba1448e..b73d6352d3fa0 100644 --- a/fs/nfs/nfs3proc.c +++ b/fs/nfs/nfs3proc.c @@ -350,7 +350,7 @@ nfs3_proc_create(struct inode *dir, struct dentry *dentry, struct iattr *sattr, break;
case NFS3_CREATE_UNCHECKED: - goto out; + goto out_release_acls; } nfs_fattr_init(data->res.dir_attr); nfs_fattr_init(data->res.fattr); @@ -717,7 +717,7 @@ nfs3_proc_mknod(struct inode *dir, struct dentry *dentry, struct iattr *sattr, break; default: status = -EINVAL; - goto out; + goto out_release_acls; }
d_alias = nfs3_do_create(dir, dentry, data);
From: Thomas Gleixner tglx@linutronix.de
stable inclusion from linux-4.19.198 commit fecb9b7b06ee288103f0f687b57cb033c96c54ca
--------------------------------
[ Upstream commit 07d6688b22e09be465652cf2da0da6bf86154df6 ]
If the count argument is larger than the xstate size, this will happily copy beyond the end of xstate.
Fixes: 91c3dba7dbc1 ("x86/fpu/xstate: Fix PTRACE frames for XSAVES") Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Borislav Petkov bp@suse.de Reviewed-by: Andy Lutomirski luto@kernel.org Reviewed-by: Borislav Petkov bp@suse.de Link: https://lkml.kernel.org/r/20210623121452.120741557@linutronix.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/kernel/fpu/regset.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c index bc02f5144b958..621d249ded0b9 100644 --- a/arch/x86/kernel/fpu/regset.c +++ b/arch/x86/kernel/fpu/regset.c @@ -128,7 +128,7 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset, /* * A whole standard-format XSAVE buffer is needed: */ - if ((pos != 0) || (count < fpu_user_xstate_size)) + if (pos != 0 || count != fpu_user_xstate_size) return -EFAULT;
xsave = &fpu->state.xsave;
From: "Michael S. Tsirkin" mst@redhat.com
stable inclusion from linux-4.19.198 commit eede0a76a4adb097d0b4e0e0b678c4297b5c1cef
--------------------------------
[ Upstream commit 5a2f966d0f3fa0ef6dada7ab9eda74cacee96b8a ]
It's unsafe to operate a vq from multiple threads. Unfortunately this is exactly what we do when invoking clean tx poll from rx napi. Same happens with napi-tx even without the opportunistic cleaning from the receive interrupt: that races with processing the vq in start_xmit.
As a fix move everything that deals with the vq to under tx lock.
Fixes: b92f1e6751a6 ("virtio-net: transmit napi") Signed-off-by: Michael S. Tsirkin mst@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/net/virtio_net.c | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 68eda4739a9ac..6d0bb3ce37e80 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1492,6 +1492,8 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget) struct virtnet_info *vi = sq->vq->vdev->priv; unsigned int index = vq2txq(sq->vq); struct netdev_queue *txq; + int opaque; + bool done;
if (unlikely(is_xdp_raw_buffer_queue(vi, index))) { /* We don't need to enable cb for XDP */ @@ -1501,10 +1503,28 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
txq = netdev_get_tx_queue(vi->dev, index); __netif_tx_lock(txq, raw_smp_processor_id()); + virtqueue_disable_cb(sq->vq); free_old_xmit_skbs(sq, true); + + opaque = virtqueue_enable_cb_prepare(sq->vq); + + done = napi_complete_done(napi, 0); + + if (!done) + virtqueue_disable_cb(sq->vq); + __netif_tx_unlock(txq);
- virtqueue_napi_complete(napi, sq->vq, 0); + if (done) { + if (unlikely(virtqueue_poll(sq->vq, opaque))) { + if (napi_schedule_prep(napi)) { + __netif_tx_lock(txq, raw_smp_processor_id()); + virtqueue_disable_cb(sq->vq); + __netif_tx_unlock(txq); + __napi_schedule(napi); + } + } + }
if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS) netif_tx_wake_queue(txq);
From: Trond Myklebust trond.myklebust@hammerspace.com
stable inclusion from linux-4.19.198 commit e3eeeaed0a54f664b025f290a819897af8a1eccc
--------------------------------
[ Upstream commit f46f84931a0aa344678efe412d4b071d84d8a805 ]
After we grab the lock in nfs4_pnfs_ds_connect(), there is no check for whether or not ds->ds_clp has already been initialised, so we can end up adding the same transports multiple times.
Fixes: fc821d59209d ("pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3") Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/nfs/pnfs_nfs.c | 52 +++++++++++++++++++++++------------------------ 1 file changed, 26 insertions(+), 26 deletions(-)
diff --git a/fs/nfs/pnfs_nfs.c b/fs/nfs/pnfs_nfs.c index acfb52bc0007d..3f0c2436254ac 100644 --- a/fs/nfs/pnfs_nfs.c +++ b/fs/nfs/pnfs_nfs.c @@ -555,19 +555,16 @@ nfs4_pnfs_ds_add(struct list_head *dsaddrs, gfp_t gfp_flags) } EXPORT_SYMBOL_GPL(nfs4_pnfs_ds_add);
-static void nfs4_wait_ds_connect(struct nfs4_pnfs_ds *ds) +static int nfs4_wait_ds_connect(struct nfs4_pnfs_ds *ds) { might_sleep(); - wait_on_bit(&ds->ds_state, NFS4DS_CONNECTING, - TASK_KILLABLE); + return wait_on_bit(&ds->ds_state, NFS4DS_CONNECTING, TASK_KILLABLE); }
static void nfs4_clear_ds_conn_bit(struct nfs4_pnfs_ds *ds) { smp_mb__before_atomic(); - clear_bit(NFS4DS_CONNECTING, &ds->ds_state); - smp_mb__after_atomic(); - wake_up_bit(&ds->ds_state, NFS4DS_CONNECTING); + clear_and_wake_up_bit(NFS4DS_CONNECTING, &ds->ds_state); }
static struct nfs_client *(*get_v3_ds_connect)( @@ -728,30 +725,33 @@ int nfs4_pnfs_ds_connect(struct nfs_server *mds_srv, struct nfs4_pnfs_ds *ds, { int err;
-again: - err = 0; - if (test_and_set_bit(NFS4DS_CONNECTING, &ds->ds_state) == 0) { - if (version == 3) { - err = _nfs4_pnfs_v3_ds_connect(mds_srv, ds, timeo, - retrans); - } else if (version == 4) { - err = _nfs4_pnfs_v4_ds_connect(mds_srv, ds, timeo, - retrans, minor_version); - } else { - dprintk("%s: unsupported DS version %d\n", __func__, - version); - err = -EPROTONOSUPPORT; - } + do { + err = nfs4_wait_ds_connect(ds); + if (err || ds->ds_clp) + goto out; + if (nfs4_test_deviceid_unavailable(devid)) + return -ENODEV; + } while (test_and_set_bit(NFS4DS_CONNECTING, &ds->ds_state) != 0);
- nfs4_clear_ds_conn_bit(ds); - } else { - nfs4_wait_ds_connect(ds); + if (ds->ds_clp) + goto connect_done;
- /* what was waited on didn't connect AND didn't mark unavail */ - if (!ds->ds_clp && !nfs4_test_deviceid_unavailable(devid)) - goto again; + switch (version) { + case 3: + err = _nfs4_pnfs_v3_ds_connect(mds_srv, ds, timeo, retrans); + break; + case 4: + err = _nfs4_pnfs_v4_ds_connect(mds_srv, ds, timeo, retrans, + minor_version); + break; + default: + dprintk("%s: unsupported DS version %d\n", __func__, version); + err = -EPROTONOSUPPORT; }
+connect_done: + nfs4_clear_ds_conn_bit(ds); +out: /* * At this point the ds->ds_clp should be ready, but it might have * hit an error.
From: Nikolay Aleksandrov nikolay@nvidia.com
stable inclusion from linux-4.19.198 commit 2e70bf39b1a96b40ab13044729613a913b46cddd
--------------------------------
commit 04bef83a3358946bfc98a5ecebd1b0003d83d882 upstream.
When a PIM hello packet is received on a bridge port with multicast snooping enabled, we mark it as a router port automatically, that includes adding that port the router port list. The multicast lock protects that list, but it is not acquired in the PIM message case leading to a race condition, we need to take it to fix the race.
Cc: stable@vger.kernel.org Fixes: 91b02d3d133b ("bridge: mcast: add router port on PIM hello message") Signed-off-by: Nikolay Aleksandrov nikolay@nvidia.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/bridge/br_multicast.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c index 6a362da211e1b..3504c3acd38f6 100644 --- a/net/bridge/br_multicast.c +++ b/net/bridge/br_multicast.c @@ -1791,7 +1791,9 @@ static void br_multicast_pim(struct net_bridge *br, pim_hdr_type(pimhdr) != PIM_TYPE_HELLO) return;
+ spin_lock(&br->multicast_lock); br_multicast_mark_router(br, port); + spin_unlock(&br->multicast_lock); }
static int br_multicast_ipv4_rcv(struct net_bridge *br,
From: Dan Carpenter dan.carpenter@oracle.com
stable inclusion from linux-4.19.198 commit c021be62bbcbf5435ec89713ed542040196f5b89
--------------------------------
commit 80927822e8b6be46f488524cd7d5fe683de97fc4 upstream.
The "retval" variable needs to be signed for the error handling to work.
Link: https://lore.kernel.org/r/YLjMEAFNxOas1mIp@mwanda Fixes: 7e26e3ea0287 ("scsi: scsi_dh_alua: Check for negative result value") Reviewed-by: Martin Wilck mwilck@suse.com Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/scsi/device_handler/scsi_dh_alua.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c index eb6a7fc4c2096..e3d9ae8216f1e 100644 --- a/drivers/scsi/device_handler/scsi_dh_alua.c +++ b/drivers/scsi/device_handler/scsi_dh_alua.c @@ -522,7 +522,8 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg) struct alua_port_group *tmp_pg; int len, k, off, bufflen = ALUA_RTPG_SIZE; unsigned char *desc, *buff; - unsigned err, retval; + unsigned err; + int retval; unsigned int tpg_desc_tbl_off; unsigned char orig_transition_tmo; unsigned long flags;
From: Javed Hasan jhasan@marvell.com
stable inclusion from linux-4.19.199 commit 4921b1618045ffab71b1050bf0014df3313a2289
--------------------------------
[ Upstream commit b27c4577557045f1ab3cdfeabfc7f3cd24aca1fe ]
Fix array index out of bound exception in fc_rport_prli_resp().
Link: https://lore.kernel.org/r/20210615165939.24327-1-jhasan@marvell.com Signed-off-by: Javed Hasan jhasan@marvell.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/scsi/libfc/fc_rport.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/drivers/scsi/libfc/fc_rport.c b/drivers/scsi/libfc/fc_rport.c index 90a748551ede5..d8cf519da92c9 100644 --- a/drivers/scsi/libfc/fc_rport.c +++ b/drivers/scsi/libfc/fc_rport.c @@ -1166,6 +1166,7 @@ static void fc_rport_prli_resp(struct fc_seq *sp, struct fc_frame *fp, resp_code = (pp->spp.spp_flags & FC_SPP_RESP_MASK); FC_RPORT_DBG(rdata, "PRLI spp_flags = 0x%x spp_type 0x%x\n", pp->spp.spp_flags, pp->spp.spp_type); + rdata->spp_type = pp->spp.spp_type; if (resp_code != FC_SPP_RESP_ACK) { if (resp_code == FC_SPP_RESP_CONF) @@ -1186,11 +1187,13 @@ static void fc_rport_prli_resp(struct fc_seq *sp, struct fc_frame *fp, /* * Call prli provider if we should act as a target */ - prov = fc_passive_prov[rdata->spp_type]; - if (prov) { - memset(&temp_spp, 0, sizeof(temp_spp)); - prov->prli(rdata, pp->prli.prli_spp_len, - &pp->spp, &temp_spp); + if (rdata->spp_type < FC_FC4_PROV_SIZE) { + prov = fc_passive_prov[rdata->spp_type]; + if (prov) { + memset(&temp_spp, 0, sizeof(temp_spp)); + prov->prli(rdata, pp->prli.prli_spp_len, + &pp->spp, &temp_spp); + } } /* * Check if the image pair could be established
From: Odin Ugedal odin@uged.al
stable inclusion from linux-4.19.199 commit aa36bd8fc187f513af2d250b82a3c913053ceaf1
--------------------------------
[ Upstream commit 72d0ad7cb5bad265adb2014dbe46c4ccb11afaba ]
The time remaining until expiry of the refresh_timer can be negative. Casting the type to an unsigned 64-bit value will cause integer underflow, making the runtime_refresh_within return false instead of true. These situations are rare, but they do happen.
This does not cause user-facing issues or errors; other than possibly unthrottling cfs_rq's using runtime from the previous period(s), making the CFS bandwidth enforcement less strict in those (special) situations.
Signed-off-by: Odin Ugedal odin@uged.al Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Ben Segall bsegall@google.com Link: https://lore.kernel.org/r/20210629121452.18429-1-odin@uged.al Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 193cce15f69b0..ff86c37f0d292 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4800,7 +4800,7 @@ static const u64 cfs_bandwidth_slack_period = 5 * NSEC_PER_MSEC; static int runtime_refresh_within(struct cfs_bandwidth *cfs_b, u64 min_expire) { struct hrtimer *refresh_timer = &cfs_b->period_timer; - u64 remaining; + s64 remaining;
/* if the call-back is running a quota refresh is already occurring */ if (hrtimer_callback_running(refresh_timer)) @@ -4808,7 +4808,7 @@ static int runtime_refresh_within(struct cfs_bandwidth *cfs_b, u64 min_expire)
/* is a quota refresh about to occur? */ remaining = ktime_to_ns(hrtimer_expires_remaining(refresh_timer)); - if (remaining < min_expire) + if (remaining < (s64)min_expire) return 1;
return 0;
From: Sunwook Eom speed.eom@samsung.com
stable inclusion from linux-4.19.121 commit d4440a77802d2b3c494eb9711bf26e98331ab702
--------------------------------
commit ad4e80a639fc61d5ecebb03caa5cdbfb91fcebfc upstream.
The error correction data is computed as if data and hash blocks were concatenated. But hash block number starts from v->hash_start. So, we have to calculate hash block number based on that.
Fixes: a739ff3f543af ("dm verity: add support for forward error correction") Cc: stable@vger.kernel.org Signed-off-by: Sunwook Eom speed.eom@samsung.com Reviewed-by: Sami Tolvanen samitolvanen@google.com Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/md/dm-verity-fec.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/dm-verity-fec.c b/drivers/md/dm-verity-fec.c index 0f4a2143bf557..bb8327999705e 100644 --- a/drivers/md/dm-verity-fec.c +++ b/drivers/md/dm-verity-fec.c @@ -436,7 +436,7 @@ int verity_fec_decode(struct dm_verity *v, struct dm_verity_io *io, fio->level++;
if (type == DM_VERITY_BLOCK_TYPE_METADATA) - block += v->data_blocks; + block = block - v->hash_start + v->data_blocks;
/* * For RS(M, N), the continuous FEC data is divided into blocks of N
From: Mikulas Patocka mpatocka@redhat.com
stable inclusion from linux-4.19.121 commit 7889936a925b318c787da898953e7c8857a93a30
--------------------------------
commit 31b22120194b5c0d460f59e0c98504de1d3f1f14 upstream.
The dm-writecache reads metadata in the target constructor. However, when we reload the target, there could be another active instance running on the same device. This is the sequence of operations when doing a reload:
1. construct new target 2. suspend old target 3. resume new target 4. destroy old target
Metadata that were written by the old target between steps 1 and 2 would not be visible by the new target.
Fix the data corruption by loading the metadata in the resume handler.
Also, validate block_size is at least as large as both the devices' logical block size and only read 1 block from the metadata during target constructor -- no need to read entirety of metadata now that it is done during resume.
Fixes: 48debafe4f2f ("dm: add writecache target") Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Mikulas Patocka mpatocka@redhat.com Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/md/dm-writecache.c | 52 +++++++++++++++++++++++++++----------- 1 file changed, 37 insertions(+), 15 deletions(-)
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c index a6aa366f65335..776aaf5951e4a 100644 --- a/drivers/md/dm-writecache.c +++ b/drivers/md/dm-writecache.c @@ -896,6 +896,24 @@ static int writecache_alloc_entries(struct dm_writecache *wc) return 0; }
+static int writecache_read_metadata(struct dm_writecache *wc, sector_t n_sectors) +{ + struct dm_io_region region; + struct dm_io_request req; + + region.bdev = wc->ssd_dev->bdev; + region.sector = wc->start_sector; + region.count = n_sectors; + req.bi_op = REQ_OP_READ; + req.bi_op_flags = REQ_SYNC; + req.mem.type = DM_IO_VMA; + req.mem.ptr.vma = (char *)wc->memory_map; + req.client = wc->dm_io; + req.notify.fn = NULL; + + return dm_io(&req, 1, ®ion, NULL); +} + static void writecache_resume(struct dm_target *ti) { struct dm_writecache *wc = ti->private; @@ -906,8 +924,18 @@ static void writecache_resume(struct dm_target *ti)
wc_lock(wc);
- if (WC_MODE_PMEM(wc)) + if (WC_MODE_PMEM(wc)) { persistent_memory_invalidate_cache(wc->memory_map, wc->memory_map_size); + } else { + r = writecache_read_metadata(wc, wc->metadata_sectors); + if (r) { + size_t sb_entries_offset; + writecache_error(wc, r, "unable to read metadata: %d", r); + sb_entries_offset = offsetof(struct wc_memory_superblock, entries); + memset((char *)wc->memory_map + sb_entries_offset, -1, + (wc->metadata_sectors << SECTOR_SHIFT) - sb_entries_offset); + } + }
wc->tree = RB_ROOT; INIT_LIST_HEAD(&wc->lru); @@ -1990,6 +2018,12 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv) ti->error = "Invalid block size"; goto bad; } + if (wc->block_size < bdev_logical_block_size(wc->dev->bdev) || + wc->block_size < bdev_logical_block_size(wc->ssd_dev->bdev)) { + r = -EINVAL; + ti->error = "Block size is smaller than device logical block size"; + goto bad; + } wc->block_size_bits = __ffs(wc->block_size);
wc->max_writeback_jobs = MAX_WRITEBACK_JOBS; @@ -2078,8 +2112,6 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv) goto bad; } } else { - struct dm_io_region region; - struct dm_io_request req; size_t n_blocks, n_metadata_blocks; uint64_t n_bitmap_bits;
@@ -2136,19 +2168,9 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv) goto bad; }
- region.bdev = wc->ssd_dev->bdev; - region.sector = wc->start_sector; - region.count = wc->metadata_sectors; - req.bi_op = REQ_OP_READ; - req.bi_op_flags = REQ_SYNC; - req.mem.type = DM_IO_VMA; - req.mem.ptr.vma = (char *)wc->memory_map; - req.client = wc->dm_io; - req.notify.fn = NULL; - - r = dm_io(&req, 1, ®ion, NULL); + r = writecache_read_metadata(wc, wc->block_size >> SECTOR_SHIFT); if (r) { - ti->error = "Unable to read metadata"; + ti->error = "Unable to read first block of metadata"; goto bad; } }
From: Gabriel Krisman Bertazi krisman@collabora.com
stable inclusion from linux-4.19.121 commit a2cf97eaaa8fb2a7c3875b17a303da878de4d774
--------------------------------
commit 5686dee34dbfe0238c0274e0454fa0174ac0a57a upstream.
When adding devices that don't have a scsi_dh on a BIO based multipath, I was able to consistently hit the warning below and lock-up the system.
The problem is that __map_bio reads the flag before it potentially being modified by choose_pgpath, and ends up using the older value.
The WARN_ON below is not trivially linked to the issue. It goes like this: The activate_path delayed_work is not initialized for non-scsi_dh devices, but we always set MPATHF_QUEUE_IO, asking for initialization. That is fine, since MPATHF_QUEUE_IO would be cleared in choose_pgpath. Nevertheless, only for BIO-based mpath, we cache the flag before calling choose_pgpath, and use the older version when deciding if we should initialize the path. Therefore, we end up trying to initialize the paths, and calling the non-initialized activate_path work.
[ 82.437100] ------------[ cut here ]------------ [ 82.437659] WARNING: CPU: 3 PID: 602 at kernel/workqueue.c:1624 __queue_delayed_work+0x71/0x90 [ 82.438436] Modules linked in: [ 82.438911] CPU: 3 PID: 602 Comm: systemd-udevd Not tainted 5.6.0-rc6+ #339 [ 82.439680] RIP: 0010:__queue_delayed_work+0x71/0x90 [ 82.440287] Code: c1 48 89 4a 50 81 ff 00 02 00 00 75 2a 4c 89 cf e9 94 d6 07 00 e9 7f e9 ff ff 0f 0b eb c7 0f 0b 48 81 7a 58 40 74 a8 94 74 a7 <0f> 0b 48 83 7a 48 00 74 a5 0f 0b eb a1 89 fe 4c 89 cf e9 c8 c4 07 [ 82.441719] RSP: 0018:ffffb738803977c0 EFLAGS: 00010007 [ 82.442121] RAX: ffffa086389f9740 RBX: 0000000000000002 RCX: 0000000000000000 [ 82.442718] RDX: ffffa086350dd930 RSI: ffffa0863d76f600 RDI: 0000000000000200 [ 82.443484] RBP: 0000000000000200 R08: 0000000000000000 R09: ffffa086350dd970 [ 82.444128] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa086350dd930 [ 82.444773] R13: ffffa0863d76f600 R14: 0000000000000000 R15: ffffa08636738008 [ 82.445427] FS: 00007f6abfe9dd40(0000) GS:ffffa0863dd80000(0000) knlGS:00000 [ 82.446040] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 82.446478] CR2: 0000557d288db4e8 CR3: 0000000078b36000 CR4: 00000000000006e0 [ 82.447104] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 82.447561] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 82.448012] Call Trace: [ 82.448164] queue_delayed_work_on+0x6d/0x80 [ 82.448472] __pg_init_all_paths+0x7b/0xf0 [ 82.448714] pg_init_all_paths+0x26/0x40 [ 82.448980] __multipath_map_bio.isra.0+0x84/0x210 [ 82.449267] __map_bio+0x3c/0x1f0 [ 82.449468] __split_and_process_non_flush+0x14a/0x1b0 [ 82.449775] __split_and_process_bio+0xde/0x340 [ 82.450045] ? dm_get_live_table+0x5/0xb0 [ 82.450278] dm_process_bio+0x98/0x290 [ 82.450518] dm_make_request+0x54/0x120 [ 82.450778] generic_make_request+0xd2/0x3e0 [ 82.451038] ? submit_bio+0x3c/0x150 [ 82.451278] submit_bio+0x3c/0x150 [ 82.451492] mpage_readpages+0x129/0x160 [ 82.451756] ? bdev_evict_inode+0x1d0/0x1d0 [ 82.452033] read_pages+0x72/0x170 [ 82.452260] __do_page_cache_readahead+0x1ba/0x1d0 [ 82.452624] force_page_cache_readahead+0x96/0x110 [ 82.452903] generic_file_read_iter+0x84f/0xae0 [ 82.453192] ? __seccomp_filter+0x7c/0x670 [ 82.453547] new_sync_read+0x10e/0x190 [ 82.453883] vfs_read+0x9d/0x150 [ 82.454172] ksys_read+0x65/0xe0 [ 82.454466] do_syscall_64+0x4e/0x210 [ 82.454828] entry_SYSCALL_64_after_hwframe+0x49/0xbe [...] [ 82.462501] ---[ end trace bb39975e9cf45daa ]---
Cc: stable@vger.kernel.org Signed-off-by: Gabriel Krisman Bertazi krisman@collabora.com Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/md/dm-mpath.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index def4f6ec290be..207ca0ad0b59d 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -586,10 +586,12 @@ static struct pgpath *__map_bio(struct multipath *m, struct bio *bio)
/* Do we need to select a new pgpath? */ pgpath = READ_ONCE(m->current_pgpath); - queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags); - if (!pgpath || !queue_io) + if (!pgpath || !test_bit(MPATHF_QUEUE_IO, &m->flags)) pgpath = choose_pgpath(m, bio->bi_iter.bi_size);
+ /* MPATHF_QUEUE_IO might have been cleared by choose_pgpath. */ + queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags); + if ((pgpath && queue_io) || (!pgpath && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) { /* Queue for the daemon to resubmit */
From: Mikulas Patocka mpatocka@redhat.com
stable inclusion from linux-4.19.199 commit fb55695f41f6471ae8dff9e0a175fcec24fd3180
--------------------------------
commit 054bee16163df023e2589db09fd27d81f7ad9e72 upstream.
LVM doesn't like it when the target returns different values from what was set in the constructor. Fix dm-writecache so that the returned table values are exactly the same as requested values.
Signed-off-by: Mikulas Patocka mpatocka@redhat.com Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/md/dm-writecache.c | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c index 776aaf5951e4a..c7ea0085fb474 100644 --- a/drivers/md/dm-writecache.c +++ b/drivers/md/dm-writecache.c @@ -153,6 +153,7 @@ struct dm_writecache { bool overwrote_committed:1; bool memory_vmapped:1;
+ bool start_sector_set:1; bool high_wm_percent_set:1; bool low_wm_percent_set:1; bool max_writeback_jobs_set:1; @@ -161,6 +162,10 @@ struct dm_writecache { bool writeback_fua_set:1; bool flush_on_suspend:1;
+ unsigned high_wm_percent_value; + unsigned low_wm_percent_value; + unsigned autocommit_time_value; + unsigned writeback_all; struct workqueue_struct *writeback_wq; struct work_struct writeback_work; @@ -2045,6 +2050,7 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv) if (sscanf(string, "%llu%c", &start_sector, &dummy) != 1) goto invalid_optional; wc->start_sector = start_sector; + wc->start_sector_set = true; if (wc->start_sector != start_sector || wc->start_sector >= wc->memory_map_size >> SECTOR_SHIFT) goto invalid_optional; @@ -2054,6 +2060,7 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv) goto invalid_optional; if (high_wm_percent < 0 || high_wm_percent > 100) goto invalid_optional; + wc->high_wm_percent_value = high_wm_percent; wc->high_wm_percent_set = true; } else if (!strcasecmp(string, "low_watermark") && opt_params >= 1) { string = dm_shift_arg(&as), opt_params--; @@ -2061,6 +2068,7 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv) goto invalid_optional; if (low_wm_percent < 0 || low_wm_percent > 100) goto invalid_optional; + wc->low_wm_percent_value = low_wm_percent; wc->low_wm_percent_set = true; } else if (!strcasecmp(string, "writeback_jobs") && opt_params >= 1) { string = dm_shift_arg(&as), opt_params--; @@ -2080,6 +2088,7 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv) if (autocommit_msecs > 3600000) goto invalid_optional; wc->autocommit_jiffies = msecs_to_jiffies(autocommit_msecs); + wc->autocommit_time_value = autocommit_msecs; wc->autocommit_time_set = true; } else if (!strcasecmp(string, "fua")) { if (WC_MODE_PMEM(wc)) { @@ -2275,7 +2284,6 @@ static void writecache_status(struct dm_target *ti, status_type_t type, struct dm_writecache *wc = ti->private; unsigned extra_args; unsigned sz = 0; - uint64_t x;
switch (type) { case STATUSTYPE_INFO: @@ -2287,7 +2295,7 @@ static void writecache_status(struct dm_target *ti, status_type_t type, DMEMIT("%c %s %s %u ", WC_MODE_PMEM(wc) ? 'p' : 's', wc->dev->name, wc->ssd_dev->name, wc->block_size); extra_args = 0; - if (wc->start_sector) + if (wc->start_sector_set) extra_args += 2; if (wc->high_wm_percent_set) extra_args += 2; @@ -2303,26 +2311,18 @@ static void writecache_status(struct dm_target *ti, status_type_t type, extra_args++;
DMEMIT("%u", extra_args); - if (wc->start_sector) + if (wc->start_sector_set) DMEMIT(" start_sector %llu", (unsigned long long)wc->start_sector); - if (wc->high_wm_percent_set) { - x = (uint64_t)wc->freelist_high_watermark * 100; - x += wc->n_blocks / 2; - do_div(x, (size_t)wc->n_blocks); - DMEMIT(" high_watermark %u", 100 - (unsigned)x); - } - if (wc->low_wm_percent_set) { - x = (uint64_t)wc->freelist_low_watermark * 100; - x += wc->n_blocks / 2; - do_div(x, (size_t)wc->n_blocks); - DMEMIT(" low_watermark %u", 100 - (unsigned)x); - } + if (wc->high_wm_percent_set) + DMEMIT(" high_watermark %u", wc->high_wm_percent_value); + if (wc->low_wm_percent_set) + DMEMIT(" low_watermark %u", wc->low_wm_percent_value); if (wc->max_writeback_jobs_set) DMEMIT(" writeback_jobs %u", wc->max_writeback_jobs); if (wc->autocommit_blocks_set) DMEMIT(" autocommit_blocks %u", wc->autocommit_blocks); if (wc->autocommit_time_set) - DMEMIT(" autocommit_time %u", jiffies_to_msecs(wc->autocommit_jiffies)); + DMEMIT(" autocommit_time %u", wc->autocommit_time_value); if (wc->writeback_fua_set) DMEMIT(" %sfua", wc->writeback_fua ? "" : "no"); break;
From: Mikulas Patocka mpatocka@redhat.com
stable inclusion from linux-4.19.199 commit 797b950e8aaec3fc30312bb9a44d702c6cc9d25e
--------------------------------
commit 4134455f2aafdfeab50cabb4cccb35e916034b93 upstream.
Do not attempt to write any data beyond the end of the underlying data device while shrinking it.
The DM writecache device must be suspended when the underlying data device is shrunk.
Signed-off-by: Mikulas Patocka mpatocka@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/md/dm-writecache.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c index c7ea0085fb474..6c6c23634bcfa 100644 --- a/drivers/md/dm-writecache.c +++ b/drivers/md/dm-writecache.c @@ -142,6 +142,7 @@ struct dm_writecache { size_t metadata_sectors; size_t n_blocks; uint64_t seq_count; + sector_t data_device_sectors; void *block_start; struct wc_entry *entries; unsigned block_size; @@ -929,6 +930,8 @@ static void writecache_resume(struct dm_target *ti)
wc_lock(wc);
+ wc->data_device_sectors = i_size_read(wc->dev->bdev->bd_inode) >> SECTOR_SHIFT; + if (WC_MODE_PMEM(wc)) { persistent_memory_invalidate_cache(wc->memory_map, wc->memory_map_size); } else { @@ -1499,6 +1502,10 @@ static bool wc_add_block(struct writeback_struct *wb, struct wc_entry *e, gfp_t void *address = memory_data(wc, e);
persistent_memory_flush_cache(address, block_size); + + if (unlikely(bio_end_sector(&wb->bio) >= wc->data_device_sectors)) + return true; + return bio_add_page(&wb->bio, persistent_memory_page(address), block_size, persistent_memory_page_offset(address)) != 0; } @@ -1571,6 +1578,9 @@ static void __writecache_writeback_pmem(struct dm_writecache *wc, struct writeba if (writecache_has_error(wc)) { bio->bi_status = BLK_STS_IOERR; bio_endio(&wb->bio); + } else if (unlikely(!bio_sectors(&wb->bio))) { + bio->bi_status = BLK_STS_OK; + bio_endio(&wb->bio); } else { submit_bio(&wb->bio); } @@ -1614,6 +1624,14 @@ static void __writecache_writeback_ssd(struct dm_writecache *wc, struct writebac e = f; }
+ if (unlikely(to.sector + to.count > wc->data_device_sectors)) { + if (to.sector >= wc->data_device_sectors) { + writecache_copy_endio(0, 0, c); + continue; + } + from.count = to.count = wc->data_device_sectors - to.sector; + } + dm_kcopyd_copy(wc->dm_kcopyd, &from, 1, &to, 0, writecache_copy_endio, c);
__writeback_throttle(wc, wbl);
From: Vasily Averin vvs@virtuozzo.com
stable inclusion from linux-4.19.199 commit 1b72d43e6b9f5dee6969df1570680652cf3ecbe4
--------------------------------
commit c23a9fd209bc6f8c1fa6ee303fdf037d784a1627 upstream.
Two patches listed below removed ctnetlink_dump_helpinfo call from under rcu_read_lock. Now its rcu_dereference generates following warning:
============================= WARNING: suspicious RCU usage 5.13.0+ #5 Not tainted ----------------------------- net/netfilter/nf_conntrack_netlink.c:221 suspicious rcu_dereference_check() usage!
other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 stack backtrace: CPU: 1 PID: 2251 Comm: conntrack Not tainted 5.13.0+ #5 Call Trace: dump_stack+0x7f/0xa1 ctnetlink_dump_helpinfo+0x134/0x150 [nf_conntrack_netlink] ctnetlink_fill_info+0x2c2/0x390 [nf_conntrack_netlink] ctnetlink_dump_table+0x13f/0x370 [nf_conntrack_netlink] netlink_dump+0x10c/0x370 __netlink_dump_start+0x1a7/0x260 ctnetlink_get_conntrack+0x1e5/0x250 [nf_conntrack_netlink] nfnetlink_rcv_msg+0x613/0x993 [nfnetlink] netlink_rcv_skb+0x50/0x100 nfnetlink_rcv+0x55/0x120 [nfnetlink] netlink_unicast+0x181/0x260 netlink_sendmsg+0x23f/0x460 sock_sendmsg+0x5b/0x60 __sys_sendto+0xf1/0x160 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x36/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: 49ca022bccc5 ("netfilter: ctnetlink: don't dump ct extensions of unconfirmed conntracks") Fixes: 0b35f6031a00 ("netfilter: Remove duplicated rcu_read_lock.") Signed-off-by: Vasily Averin vvs@virtuozzo.com Reviewed-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/netfilter/nf_conntrack_netlink.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 15c9fbcd32f29..2850a638401d5 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -213,6 +213,7 @@ static int ctnetlink_dump_helpinfo(struct sk_buff *skb, if (!help) return 0;
+ rcu_read_lock(); helper = rcu_dereference(help->helper); if (!helper) goto out; @@ -228,9 +229,11 @@ static int ctnetlink_dump_helpinfo(struct sk_buff *skb,
nla_nest_end(skb, nest_helper); out: + rcu_read_unlock(); return 0;
nla_put_failure: + rcu_read_unlock(); return -1; }
From: Wolfgang Bumiller w.bumiller@proxmox.com
stable inclusion from linux-4.19.199 commit a2281d2f761f990cb004398506fc5a0e720817e6
--------------------------------
commit a019abd8022061b917da767cd1a66ed823724eab upstream.
Since commit 2796d0c648c9 ("bridge: Automatically manage port promiscuous mode.") bridges with `vlan_filtering 1` and only 1 auto-port don't set IFF_PROMISC for unicast-filtering-capable ports.
Normally on port changes `br_manage_promisc` is called to update the promisc flags and unicast filters if necessary, but it cannot distinguish between *new* ports and ones losing their promisc flag, and new ports end up not receiving the MAC address list.
Fix this by calling `br_fdb_sync_static` in `br_add_if` after the port promisc flags are updated and the unicast filter was supposed to have been filled.
Fixes: 2796d0c648c9 ("bridge: Automatically manage port promiscuous mode.") Signed-off-by: Wolfgang Bumiller w.bumiller@proxmox.com Acked-by: Nikolay Aleksandrov nikolay@nvidia.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/bridge/br_if.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c index ed2b6002ae537..5aa508a08a691 100644 --- a/net/bridge/br_if.c +++ b/net/bridge/br_if.c @@ -564,7 +564,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev, struct net_bridge_port *p; int err = 0; unsigned br_hr, dev_hr; - bool changed_addr; + bool changed_addr, fdb_synced = false;
/* Don't allow bridging non-ethernet like devices, or DSA-enabled * master network devices since the bridge layer rx_handler prevents @@ -640,6 +640,19 @@ int br_add_if(struct net_bridge *br, struct net_device *dev, list_add_rcu(&p->list, &br->port_list);
nbp_update_port_count(br); + if (!br_promisc_port(p) && (p->dev->priv_flags & IFF_UNICAST_FLT)) { + /* When updating the port count we also update all ports' + * promiscuous mode. + * A port leaving promiscuous mode normally gets the bridge's + * fdb synced to the unicast filter (if supported), however, + * `br_port_clear_promisc` does not distinguish between + * non-promiscuous ports and *new* ports, so we need to + * sync explicitly here. + */ + fdb_synced = br_fdb_sync_static(br, p) == 0; + if (!fdb_synced) + netdev_err(dev, "failed to sync bridge static fdb addresses to this port\n"); + }
netdev_update_features(br->dev);
@@ -680,6 +693,8 @@ int br_add_if(struct net_bridge *br, struct net_device *dev, return 0;
err7: + if (fdb_synced) + br_fdb_unsync_static(br, p); list_del_rcu(&p->list); br_fdb_delete_by_port(br, p, 0, 1); nbp_update_port_count(br);
From: Alexander Ovechkin ovov@yandex-team.ru
stable inclusion from linux-4.19.199 commit d8142fffdb3e8c32b282a1a6c48b763eb92a8acf
--------------------------------
commit 43b90bfad34bcb81b8a5bc7dc650800f4be1787e upstream.
commit e05a90ec9e16 ("net: reflect mark on tcp syn ack packets") fixed IPv4 only.
This part is for the IPv6 side.
Fixes: e05a90ec9e16 ("net: reflect mark on tcp syn ack packets") Signed-off-by: Alexander Ovechkin ovov@yandex-team.ru Acked-by: Dmitry Yakunin zeil@yandex-team.ru Reviewed-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv6/tcp_ipv6.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 8d822df83b082..2e7a28f327a86 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -503,7 +503,8 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst, opt = ireq->ipv6_opt; if (!opt) opt = rcu_dereference(np->opt); - err = ip6_xmit(sk, skb, fl6, sk->sk_mark, opt, np->tclass); + err = ip6_xmit(sk, skb, fl6, skb->mark ? : sk->sk_mark, opt, + np->tclass); rcu_read_unlock(); err = net_xmit_eval(err); }
From: Taehee Yoo ap420073@gmail.com
stable inclusion from linux-4.19.199 commit 8bb1589c89e61e3b182dd546f1021928ebb5c2a6
--------------------------------
commit 67a9c94317402b826fc3db32afc8f39336803d97 upstream.
skb_tunnel_info() returns pointer of lwtstate->data as ip_tunnel_info type without validation. lwtstate->data can have various types such as mpls_iptunnel_encap, etc and these are not compatible. So skb_tunnel_info() should validate before returning that pointer.
Splat looks like: BUG: KASAN: slab-out-of-bounds in vxlan_get_route+0x418/0x4b0 [vxlan] Read of size 2 at addr ffff888106ec2698 by task ping/811
CPU: 1 PID: 811 Comm: ping Not tainted 5.13.0+ #1195 Call Trace: dump_stack_lvl+0x56/0x7b print_address_description.constprop.8.cold.13+0x13/0x2ee ? vxlan_get_route+0x418/0x4b0 [vxlan] ? vxlan_get_route+0x418/0x4b0 [vxlan] kasan_report.cold.14+0x83/0xdf ? vxlan_get_route+0x418/0x4b0 [vxlan] vxlan_get_route+0x418/0x4b0 [vxlan] [ ... ] vxlan_xmit_one+0x148b/0x32b0 [vxlan] [ ... ] vxlan_xmit+0x25c5/0x4780 [vxlan] [ ... ] dev_hard_start_xmit+0x1ae/0x6e0 __dev_queue_xmit+0x1f39/0x31a0 [ ... ] neigh_xmit+0x2f9/0x940 mpls_xmit+0x911/0x1600 [mpls_iptunnel] lwtunnel_xmit+0x18f/0x450 ip_finish_output2+0x867/0x2040 [ ... ]
Fixes: 61adedf3e3f1 ("route: move lwtunnel state to dst_entry") Signed-off-by: Taehee Yoo ap420073@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/net/dst_metadata.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h index 56cb3c38569a7..14efa0ded75dd 100644 --- a/include/net/dst_metadata.h +++ b/include/net/dst_metadata.h @@ -45,7 +45,9 @@ skb_tunnel_info(const struct sk_buff *skb) return &md_dst->u.tun_info;
dst = skb_dst(skb); - if (dst && dst->lwtstate) + if (dst && dst->lwtstate && + (dst->lwtstate->type == LWTUNNEL_ENCAP_IP || + dst->lwtstate->type == LWTUNNEL_ENCAP_IP6)) return lwt_tun_info(dst->lwtstate);
return NULL;
From: Jason Ekstrand jason@jlekstrand.net
stable inclusion from linux-4.19.199 commit e0355a0ad31a1d677b2a4514206de4902bd550e8
--------------------------------
commit ffe000217c5068c5da07ccb1c0f8cce7ad767435 upstream.
Each add_fence() call does a dma_fence_get() on the relevant fence. In the error path, we weren't calling dma_fence_put() so all those fences got leaked. Also, in the krealloc_array failure case, we weren't freeing the fences array. Instead, ensure that i and fences are always zero-initialized and dma_fence_put() all the fences and kfree(fences) on every error path.
Signed-off-by: Jason Ekstrand jason@jlekstrand.net Reviewed-by: Christian König christian.koenig@amd.com Fixes: a02b9dc90d84 ("dma-buf/sync_file: refactor fence storage in struct sync_file") Cc: Gustavo Padovan gustavo.padovan@collabora.co.uk Cc: Christian König christian.koenig@amd.com Link: https://patchwork.freedesktop.org/patch/msgid/20210624174732.1754546-1-jason... Signed-off-by: Christian König christian.koenig@amd.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/dma-buf/sync_file.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/drivers/dma-buf/sync_file.c b/drivers/dma-buf/sync_file.c index 91d620994123e..b0d2563cde5d2 100644 --- a/drivers/dma-buf/sync_file.c +++ b/drivers/dma-buf/sync_file.c @@ -220,8 +220,8 @@ static struct sync_file *sync_file_merge(const char *name, struct sync_file *a, struct sync_file *b) { struct sync_file *sync_file; - struct dma_fence **fences, **nfences, **a_fences, **b_fences; - int i, i_a, i_b, num_fences, a_num_fences, b_num_fences; + struct dma_fence **fences = NULL, **nfences, **a_fences, **b_fences; + int i = 0, i_a, i_b, num_fences, a_num_fences, b_num_fences;
sync_file = sync_file_alloc(); if (!sync_file) @@ -245,7 +245,7 @@ static struct sync_file *sync_file_merge(const char *name, struct sync_file *a, * If a sync_file can only be created with sync_file_merge * and sync_file_create, this is a reasonable assumption. */ - for (i = i_a = i_b = 0; i_a < a_num_fences && i_b < b_num_fences; ) { + for (i_a = i_b = 0; i_a < a_num_fences && i_b < b_num_fences; ) { struct dma_fence *pt_a = a_fences[i_a]; struct dma_fence *pt_b = b_fences[i_b];
@@ -286,15 +286,16 @@ static struct sync_file *sync_file_merge(const char *name, struct sync_file *a, fences = nfences; }
- if (sync_file_set_fence(sync_file, fences, i) < 0) { - kfree(fences); + if (sync_file_set_fence(sync_file, fences, i) < 0) goto err; - }
strlcpy(sync_file->user_name, name, sizeof(sync_file->user_name)); return sync_file;
err: + while (i) + dma_fence_put(fences[--i]); + kfree(fences); fput(sync_file->file); return NULL;
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.199 commit d694db2f90ffc44b8fb9f9ae2d38edaa4aa4f444
--------------------------------
commit 561022acb1ce62e50f7a8258687a21b84282a4cb upstream.
While tp->mtu_info is read while socket is owned, the write sides happen from err handlers (tcp_v[46]_mtu_reduced) which only own the socket spinlock.
Fixes: 563d34d05786 ("tcp: dont drop MTU reduction indications") Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv4/tcp_ipv4.c | 4 ++-- net/ipv6/tcp_ipv6.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index b6e75deb286ad..2eb90ed6c0134 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -348,7 +348,7 @@ void tcp_v4_mtu_reduced(struct sock *sk)
if ((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE)) return; - mtu = tcp_sk(sk)->mtu_info; + mtu = READ_ONCE(tcp_sk(sk)->mtu_info); dst = inet_csk_update_pmtu(sk, mtu); if (!dst) return; @@ -516,7 +516,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info) if (sk->sk_state == TCP_LISTEN) goto out;
- tp->mtu_info = info; + WRITE_ONCE(tp->mtu_info, info); if (!sock_owned_by_user(sk)) { tcp_v4_mtu_reduced(sk); } else { diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 2e7a28f327a86..5121012cc7413 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -340,7 +340,7 @@ static void tcp_v6_mtu_reduced(struct sock *sk) if ((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE)) return;
- dst = inet6_csk_update_pmtu(sk, tcp_sk(sk)->mtu_info); + dst = inet6_csk_update_pmtu(sk, READ_ONCE(tcp_sk(sk)->mtu_info)); if (!dst) return;
@@ -429,7 +429,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, if (!ip6_sk_accept_pmtu(sk)) goto out;
- tp->mtu_info = ntohl(info); + WRITE_ONCE(tp->mtu_info, ntohl(info)); if (!sock_owned_by_user(sk)) tcp_v6_mtu_reduced(sk); else if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.199 commit 4818f187041761e68bfabcce40601608fcf2591a
--------------------------------
commit c7bb4b89033b764eb07db4e060548a6311d801ee upstream.
While TCP stack scales reasonably well, there is still one part that can be used to DDOS it.
IPv6 Packet too big messages have to lookup/insert a new route, and if abused by attackers, can easily put hosts under high stress, with many cpus contending on a spinlock while one is stuck in fib6_run_gc()
ip6_protocol_deliver_rcu() icmpv6_rcv() icmpv6_notify() tcp_v6_err() tcp_v6_mtu_reduced() inet6_csk_update_pmtu() ip6_rt_update_pmtu() __ip6_rt_update_pmtu() ip6_rt_cache_alloc() ip6_dst_alloc() dst_alloc() ip6_dst_gc() fib6_run_gc() spin_lock_bh() ...
Some of our servers have been hit by malicious ICMPv6 packets trying to _increase_ the MTU/MSS of TCP flows.
We believe these ICMPv6 packets are a result of a bug in one ISP stack, since they were blindly sent back for _every_ (small) packet sent to them.
These packets are for one TCP flow: 09:24:36.266491 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 09:24:36.266509 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 09:24:36.316688 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 09:24:36.316704 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 09:24:36.608151 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
TCP stack can filter some silly requests :
1) MTU below IPV6_MIN_MTU can be filtered early in tcp_v6_err() 2) tcp_v6_mtu_reduced() can drop requests trying to increase current MSS.
This tests happen before the IPv6 routing stack is entered, thus removing the potential contention and route exhaustion.
Note that IPv6 stack was performing these checks, but too late (ie : after the route has been added, and after the potential garbage collect war)
v2: fix typo caught by Martin, thanks ! v3: exports tcp_mtu_to_mss(), caught by David, thanks !
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet edumazet@google.com Reviewed-by: Maciej ƻenczykowski maze@google.com Cc: Martin KaFai Lau kafai@fb.com Acked-by: Martin KaFai Lau kafai@fb.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv4/tcp_output.c | 1 + net/ipv6/tcp_ipv6.c | 19 +++++++++++++++++-- 2 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 9b74041e8dd10..d55ee43cd1b8f 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1472,6 +1472,7 @@ int tcp_mtu_to_mss(struct sock *sk, int pmtu) return __tcp_mtu_to_mss(sk, pmtu) - (tcp_sk(sk)->tcp_header_len - sizeof(struct tcphdr)); } +EXPORT_SYMBOL(tcp_mtu_to_mss);
/* Inverse of above */ int tcp_mss_to_mtu(struct sock *sk, int mss) diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 5121012cc7413..e8d206725cb75 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -336,11 +336,20 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, static void tcp_v6_mtu_reduced(struct sock *sk) { struct dst_entry *dst; + u32 mtu;
if ((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE)) return;
- dst = inet6_csk_update_pmtu(sk, READ_ONCE(tcp_sk(sk)->mtu_info)); + mtu = READ_ONCE(tcp_sk(sk)->mtu_info); + + /* Drop requests trying to increase our current mss. + * Check done in __ip6_rt_update_pmtu() is too late. + */ + if (tcp_mtu_to_mss(sk, mtu) >= tcp_sk(sk)->mss_cache) + return; + + dst = inet6_csk_update_pmtu(sk, mtu); if (!dst) return;
@@ -419,6 +428,8 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, }
if (type == ICMPV6_PKT_TOOBIG) { + u32 mtu = ntohl(info); + /* We are not interested in TCP_LISTEN and open_requests * (SYN-ACKs send out by Linux are always <576bytes so * they should go through unfragmented). @@ -429,7 +440,11 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, if (!ip6_sk_accept_pmtu(sk)) goto out;
- WRITE_ONCE(tp->mtu_info, ntohl(info)); + if (mtu < IPV6_MIN_MTU) + goto out; + + WRITE_ONCE(tp->mtu_info, mtu); + if (!sock_owned_by_user(sk)) tcp_v6_mtu_reduced(sk); else if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.199 commit dd1f607cad1f3be1880193521bee2bbd48354a23
--------------------------------
commit 18a419bad63b7f68a1979e28459782518e7b6bbe upstream.
Accesses to unix_sk(sk)->gso_size are lockless. Add READ_ONCE()/WRITE_ONCE() around them.
BUG: KCSAN: data-race in udp_lib_setsockopt / udpv6_sendmsg
write to 0xffff88812d78f47c of 2 bytes by task 10849 on cpu 1: udp_lib_setsockopt+0x3b3/0x710 net/ipv4/udp.c:2696 udpv6_setsockopt+0x63/0x90 net/ipv6/udp.c:1630 sock_common_setsockopt+0x5d/0x70 net/core/sock.c:3265 __sys_setsockopt+0x18f/0x200 net/socket.c:2104 __do_sys_setsockopt net/socket.c:2115 [inline] __se_sys_setsockopt net/socket.c:2112 [inline] __x64_sys_setsockopt+0x62/0x70 net/socket.c:2112 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff88812d78f47c of 2 bytes by task 10852 on cpu 0: udpv6_sendmsg+0x161/0x16b0 net/ipv6/udp.c:1299 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:642 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg net/socket.c:674 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2337 ___sys_sendmsg net/socket.c:2391 [inline] __sys_sendmmsg+0x315/0x4b0 net/socket.c:2477 __do_sys_sendmmsg net/socket.c:2506 [inline] __se_sys_sendmmsg net/socket.c:2503 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2503 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x0000 -> 0x0005
Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 10852 Comm: syz-executor.0 Not tainted 5.13.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT") Signed-off-by: Eric Dumazet edumazet@google.com Cc: Willem de Bruijn willemb@google.com Reported-by: syzbot syzkaller@googlegroups.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv4/udp.c | 6 +++--- net/ipv6/udp.c | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 7c68b9e213ef3..8200a2f542b06 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -994,7 +994,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) }
ipcm_init_sk(&ipc, inet); - ipc.gso_size = up->gso_size; + ipc.gso_size = READ_ONCE(up->gso_size);
if (msg->msg_controllen) { err = udp_cmsg_send(sk, msg, &ipc.gso_size); @@ -2502,7 +2502,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname, case UDP_SEGMENT: if (val < 0 || val > USHRT_MAX) return -EINVAL; - up->gso_size = val; + WRITE_ONCE(up->gso_size, val); break;
/* @@ -2596,7 +2596,7 @@ int udp_lib_getsockopt(struct sock *sk, int level, int optname, break;
case UDP_SEGMENT: - val = up->gso_size; + val = READ_ONCE(up->gso_size); break;
/* The following two cannot be changed on UDP sockets, the return is diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 1978a6d724c9f..b0553c0371e5a 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1175,7 +1175,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
ipcm6_init(&ipc6); - ipc6.gso_size = up->gso_size; + ipc6.gso_size = READ_ONCE(up->gso_size); ipc6.sockc.tsflags = sk->sk_tsflags;
/* destination address check */
From: Hangbin Liu liuhangbin@gmail.com
stable inclusion from linux-4.19.199 commit be680f4bf72231e75383a097981016c2b1ab0197
--------------------------------
commit 9992a078b1771da354ac1f9737e1e639b687caa2 upstream.
Commit 28e104d00281 ("net: ip_tunnel: fix mtu calculation") removed dev->hard_header_len subtraction when calculate MTU for tunnel devices as there is an overhead for device that has header_ops.
But there are ETHER tunnel devices, like gre_tap or erspan, which don't have header_ops but set dev->hard_header_len during setup. This makes pkts greater than (MTU - ETH_HLEN) could not be xmited. Fix it by subtracting the ETHER tunnel devices' dev->hard_header_len for MTU calculation.
Fixes: 28e104d00281 ("net: ip_tunnel: fix mtu calculation") Reported-by: Jianlin Shi jishi@redhat.com Signed-off-by: Hangbin Liu liuhangbin@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv4/ip_tunnel.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index bdd073ea300ae..30e93b4f831f6 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -330,7 +330,7 @@ static int ip_tunnel_bind_dev(struct net_device *dev) }
dev->needed_headroom = t_hlen + hlen; - mtu -= t_hlen; + mtu -= t_hlen + (dev->type == ARPHRD_ETHER ? dev->hard_header_len : 0);
if (mtu < IPV4_MIN_MTU) mtu = IPV4_MIN_MTU; @@ -361,6 +361,9 @@ static struct ip_tunnel *ip_tunnel_create(struct net *net, t_hlen = nt->hlen + sizeof(struct iphdr); dev->min_mtu = ETH_MIN_MTU; dev->max_mtu = IP_MAX_MTU - t_hlen; + if (dev->type == ARPHRD_ETHER) + dev->max_mtu -= dev->hard_header_len; + ip_tunnel_add(itn, nt); return nt;
@@ -502,13 +505,18 @@ static int tnl_update_pmtu(struct net_device *dev, struct sk_buff *skb, const struct iphdr *inner_iph) { struct ip_tunnel *tunnel = netdev_priv(dev); - int pkt_size = skb->len - tunnel->hlen; + int pkt_size; int mtu;
- if (df) + pkt_size = skb->len - tunnel->hlen; + pkt_size -= dev->type == ARPHRD_ETHER ? dev->hard_header_len : 0; + + if (df) { mtu = dst_mtu(&rt->dst) - (sizeof(struct iphdr) + tunnel->hlen); - else + mtu -= dev->type == ARPHRD_ETHER ? dev->hard_header_len : 0; + } else { mtu = skb_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu; + }
skb_dst_update_pmtu_no_confirm(skb, mtu);
@@ -936,6 +944,9 @@ int __ip_tunnel_change_mtu(struct net_device *dev, int new_mtu, bool strict) int t_hlen = tunnel->hlen + sizeof(struct iphdr); int max_mtu = IP_MAX_MTU - t_hlen;
+ if (dev->type == ARPHRD_ETHER) + max_mtu -= dev->hard_header_len; + if (new_mtu < ETH_MIN_MTU) return -EINVAL;
@@ -1113,6 +1124,9 @@ int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[], if (tb[IFLA_MTU]) { unsigned int max = IP_MAX_MTU - (nt->hlen + sizeof(struct iphdr));
+ if (dev->type == ARPHRD_ETHER) + max -= dev->hard_header_len; + mtu = clamp(dev->mtu, (unsigned int)ETH_MIN_MTU, max); }
From: Nicolas Dichtel nicolas.dichtel@6wind.com
stable inclusion from linux-4.19.199 commit bff0854e2f804f68d3e93d19e4580dbd69777e1d
--------------------------------
[ Upstream commit ccd27f05ae7b8ebc40af5b004e94517a919aa862 ]
The goal of commit df789fe75206 ("ipv6: Provide ipv6 version of "disable_policy" sysctl") was to have the disable_policy from ipv4 available on ipv6. However, it's not exactly the same mechanism. On IPv4, all packets coming from an interface, which has disable_policy set, bypass the policy check. For ipv6, this is done only for local packets, ie for packets destinated to an address configured on the incoming interface.
Let's align ipv6 with ipv4 so that the 'disable_policy' sysctl has the same effect for both protocols.
My first approach was to create a new kind of route cache entries, to be able to set DST_NOPOLICY without modifying routes. This would have added a lot of code. Because the local delivery path is already handled, I choose to focus on the forwarding path to minimize code churn.
Fixes: df789fe75206 ("ipv6: Provide ipv6 version of "disable_policy" sysctl") Signed-off-by: Nicolas Dichtel nicolas.dichtel@6wind.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv6/ip6_output.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index aa8f19f852cc7..fc36f3b0dceb3 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -459,7 +459,9 @@ int ip6_forward(struct sk_buff *skb) if (skb_warn_if_lro(skb)) goto drop;
- if (!xfrm6_policy_check(NULL, XFRM_POLICY_FWD, skb)) { + if (!net->ipv6.devconf_all->disable_policy && + !idev->cnf.disable_policy && + !xfrm6_policy_check(NULL, XFRM_POLICY_FWD, skb)) { __IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS); goto drop; }
From: Casey Chen cachen@purestorage.com
stable inclusion from linux-4.19.199 commit b248990aca0b1a352b94ead12605f84e878c4a5d
--------------------------------
[ Upstream commit 251ef6f71be2adfd09546a26643426fe62585173 ]
nvme_dev_remove_admin could free dev->admin_q and the admin_tagset while they are being accessed by nvme_dev_disable(), which can be called by nvme_reset_work via nvme_remove_dead_ctrl.
Commit cb4bfda62afa ("nvme-pci: fix hot removal during error handling") intended to avoid requests being stuck on a removed controller by killing the admin queue. But the later fix c8e9e9b7646e ("nvme-pci: unquiesce admin queue on shutdown"), together with nvme_dev_disable(dev, true) right before nvme_dev_remove_admin() could help dispatch requests and fail them early, so we don't need nvme_dev_remove_admin() any more.
Fixes: cb4bfda62afa ("nvme-pci: fix hot removal during error handling") Signed-off-by: Casey Chen cachen@purestorage.com Reviewed-by: Keith Busch kbusch@kernel.org Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/nvme/host/pci.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index cf51c7af5adc0..d43325761a144 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2614,7 +2614,6 @@ static void nvme_remove(struct pci_dev *pdev) if (!pci_device_is_present(pdev)) { nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD); nvme_dev_disable(dev, true); - nvme_dev_remove_admin(dev); }
flush_work(&dev->ctrl.reset_work);
From: Mike Christie michael.christie@oracle.com
stable inclusion from linux-4.19.199 commit 638d29f8563ddbfe699bd5ca77d945381d9853c9
--------------------------------
[ Upstream commit e746f3451ec7f91dcc9fd67a631239c715850a34 ]
A ISCSI_IFACE_PARAM can have the same value as a ISCSI_NET_PARAM so when iscsi_iface_attr_is_visible tries to figure out the type by just checking the value, we can collide and return the wrong type. When we call into the driver we might not match and return that we don't want attr visible in sysfs. The patch fixes this by setting the type when we figure out what the param is.
Link: https://lore.kernel.org/r/20210701002559.89533-1-michael.christie@oracle.com Fixes: 3e0f65b34cc9 ("[SCSI] iscsi_transport: Additional parameters for network settings") Signed-off-by: Mike Christie michael.christie@oracle.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/scsi/scsi_transport_iscsi.c | 90 +++++++++++------------------ 1 file changed, 34 insertions(+), 56 deletions(-)
diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c index 2aaa5a2bd6135..20e69052161e6 100644 --- a/drivers/scsi/scsi_transport_iscsi.c +++ b/drivers/scsi/scsi_transport_iscsi.c @@ -427,39 +427,10 @@ static umode_t iscsi_iface_attr_is_visible(struct kobject *kobj, struct device *dev = container_of(kobj, struct device, kobj); struct iscsi_iface *iface = iscsi_dev_to_iface(dev); struct iscsi_transport *t = iface->transport; - int param; - int param_type; + int param = -1;
if (attr == &dev_attr_iface_enabled.attr) param = ISCSI_NET_PARAM_IFACE_ENABLE; - else if (attr == &dev_attr_iface_vlan_id.attr) - param = ISCSI_NET_PARAM_VLAN_ID; - else if (attr == &dev_attr_iface_vlan_priority.attr) - param = ISCSI_NET_PARAM_VLAN_PRIORITY; - else if (attr == &dev_attr_iface_vlan_enabled.attr) - param = ISCSI_NET_PARAM_VLAN_ENABLED; - else if (attr == &dev_attr_iface_mtu.attr) - param = ISCSI_NET_PARAM_MTU; - else if (attr == &dev_attr_iface_port.attr) - param = ISCSI_NET_PARAM_PORT; - else if (attr == &dev_attr_iface_ipaddress_state.attr) - param = ISCSI_NET_PARAM_IPADDR_STATE; - else if (attr == &dev_attr_iface_delayed_ack_en.attr) - param = ISCSI_NET_PARAM_DELAYED_ACK_EN; - else if (attr == &dev_attr_iface_tcp_nagle_disable.attr) - param = ISCSI_NET_PARAM_TCP_NAGLE_DISABLE; - else if (attr == &dev_attr_iface_tcp_wsf_disable.attr) - param = ISCSI_NET_PARAM_TCP_WSF_DISABLE; - else if (attr == &dev_attr_iface_tcp_wsf.attr) - param = ISCSI_NET_PARAM_TCP_WSF; - else if (attr == &dev_attr_iface_tcp_timer_scale.attr) - param = ISCSI_NET_PARAM_TCP_TIMER_SCALE; - else if (attr == &dev_attr_iface_tcp_timestamp_en.attr) - param = ISCSI_NET_PARAM_TCP_TIMESTAMP_EN; - else if (attr == &dev_attr_iface_cache_id.attr) - param = ISCSI_NET_PARAM_CACHE_ID; - else if (attr == &dev_attr_iface_redirect_en.attr) - param = ISCSI_NET_PARAM_REDIRECT_EN; else if (attr == &dev_attr_iface_def_taskmgmt_tmo.attr) param = ISCSI_IFACE_PARAM_DEF_TASKMGMT_TMO; else if (attr == &dev_attr_iface_header_digest.attr) @@ -496,6 +467,38 @@ static umode_t iscsi_iface_attr_is_visible(struct kobject *kobj, param = ISCSI_IFACE_PARAM_STRICT_LOGIN_COMP_EN; else if (attr == &dev_attr_iface_initiator_name.attr) param = ISCSI_IFACE_PARAM_INITIATOR_NAME; + + if (param != -1) + return t->attr_is_visible(ISCSI_IFACE_PARAM, param); + + if (attr == &dev_attr_iface_vlan_id.attr) + param = ISCSI_NET_PARAM_VLAN_ID; + else if (attr == &dev_attr_iface_vlan_priority.attr) + param = ISCSI_NET_PARAM_VLAN_PRIORITY; + else if (attr == &dev_attr_iface_vlan_enabled.attr) + param = ISCSI_NET_PARAM_VLAN_ENABLED; + else if (attr == &dev_attr_iface_mtu.attr) + param = ISCSI_NET_PARAM_MTU; + else if (attr == &dev_attr_iface_port.attr) + param = ISCSI_NET_PARAM_PORT; + else if (attr == &dev_attr_iface_ipaddress_state.attr) + param = ISCSI_NET_PARAM_IPADDR_STATE; + else if (attr == &dev_attr_iface_delayed_ack_en.attr) + param = ISCSI_NET_PARAM_DELAYED_ACK_EN; + else if (attr == &dev_attr_iface_tcp_nagle_disable.attr) + param = ISCSI_NET_PARAM_TCP_NAGLE_DISABLE; + else if (attr == &dev_attr_iface_tcp_wsf_disable.attr) + param = ISCSI_NET_PARAM_TCP_WSF_DISABLE; + else if (attr == &dev_attr_iface_tcp_wsf.attr) + param = ISCSI_NET_PARAM_TCP_WSF; + else if (attr == &dev_attr_iface_tcp_timer_scale.attr) + param = ISCSI_NET_PARAM_TCP_TIMER_SCALE; + else if (attr == &dev_attr_iface_tcp_timestamp_en.attr) + param = ISCSI_NET_PARAM_TCP_TIMESTAMP_EN; + else if (attr == &dev_attr_iface_cache_id.attr) + param = ISCSI_NET_PARAM_CACHE_ID; + else if (attr == &dev_attr_iface_redirect_en.attr) + param = ISCSI_NET_PARAM_REDIRECT_EN; else if (iface->iface_type == ISCSI_IFACE_TYPE_IPV4) { if (attr == &dev_attr_ipv4_iface_ipaddress.attr) param = ISCSI_NET_PARAM_IPV4_ADDR; @@ -586,32 +589,7 @@ static umode_t iscsi_iface_attr_is_visible(struct kobject *kobj, return 0; }
- switch (param) { - case ISCSI_IFACE_PARAM_DEF_TASKMGMT_TMO: - case ISCSI_IFACE_PARAM_HDRDGST_EN: - case ISCSI_IFACE_PARAM_DATADGST_EN: - case ISCSI_IFACE_PARAM_IMM_DATA_EN: - case ISCSI_IFACE_PARAM_INITIAL_R2T_EN: - case ISCSI_IFACE_PARAM_DATASEQ_INORDER_EN: - case ISCSI_IFACE_PARAM_PDU_INORDER_EN: - case ISCSI_IFACE_PARAM_ERL: - case ISCSI_IFACE_PARAM_MAX_RECV_DLENGTH: - case ISCSI_IFACE_PARAM_FIRST_BURST: - case ISCSI_IFACE_PARAM_MAX_R2T: - case ISCSI_IFACE_PARAM_MAX_BURST: - case ISCSI_IFACE_PARAM_CHAP_AUTH_EN: - case ISCSI_IFACE_PARAM_BIDI_CHAP_EN: - case ISCSI_IFACE_PARAM_DISCOVERY_AUTH_OPTIONAL: - case ISCSI_IFACE_PARAM_DISCOVERY_LOGOUT_EN: - case ISCSI_IFACE_PARAM_STRICT_LOGIN_COMP_EN: - case ISCSI_IFACE_PARAM_INITIATOR_NAME: - param_type = ISCSI_IFACE_PARAM; - break; - default: - param_type = ISCSI_NET_PARAM; - } - - return t->attr_is_visible(param_type, param); + return t->attr_is_visible(ISCSI_NET_PARAM, param); }
static struct attribute *iscsi_iface_attrs[] = {
From: Dmitry Bogdanov d.bogdanov@yadro.com
stable inclusion from linux-4.19.199 commit 032fb65d3adc564f18f11a2c800866628cd259c9
--------------------------------
[ Upstream commit 6d8e7e7c932162bccd06872362751b0e1d76f5af ]
WRITE SAME(32) command handling reads WRPROTECT at the wrong offset in 1st byte instead of 10th byte.
Link: https://lore.kernel.org/r/20210702091655.22818-1-d.bogdanov@yadro.com Fixes: afd73f1b60fc ("target: Perform PROTECT sanity checks for WRITE_SAME") Signed-off-by: Dmitry Bogdanov d.bogdanov@yadro.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/target/target_core_sbc.c | 35 ++++++++++++++++---------------- 1 file changed, 17 insertions(+), 18 deletions(-)
diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c index ebac2b49b9c6e..af9b038da3ba6 100644 --- a/drivers/target/target_core_sbc.c +++ b/drivers/target/target_core_sbc.c @@ -38,7 +38,7 @@ #include "target_core_alua.h"
static sense_reason_t -sbc_check_prot(struct se_device *, struct se_cmd *, unsigned char *, u32, bool); +sbc_check_prot(struct se_device *, struct se_cmd *, unsigned char, u32, bool); static sense_reason_t sbc_execute_unmap(struct se_cmd *cmd);
static sense_reason_t @@ -292,14 +292,14 @@ static inline unsigned long long transport_lba_64_ext(unsigned char *cdb) }
static sense_reason_t -sbc_setup_write_same(struct se_cmd *cmd, unsigned char *flags, struct sbc_ops *ops) +sbc_setup_write_same(struct se_cmd *cmd, unsigned char flags, struct sbc_ops *ops) { struct se_device *dev = cmd->se_dev; sector_t end_lba = dev->transport->get_blocks(dev) + 1; unsigned int sectors = sbc_get_write_same_sectors(cmd); sense_reason_t ret;
- if ((flags[0] & 0x04) || (flags[0] & 0x02)) { + if ((flags & 0x04) || (flags & 0x02)) { pr_err("WRITE_SAME PBDATA and LBDATA" " bits not supported for Block Discard" " Emulation\n"); @@ -321,7 +321,7 @@ sbc_setup_write_same(struct se_cmd *cmd, unsigned char *flags, struct sbc_ops *o }
/* We always have ANC_SUP == 0 so setting ANCHOR is always an error */ - if (flags[0] & 0x10) { + if (flags & 0x10) { pr_warn("WRITE SAME with ANCHOR not supported\n"); return TCM_INVALID_CDB_FIELD; } @@ -329,7 +329,7 @@ sbc_setup_write_same(struct se_cmd *cmd, unsigned char *flags, struct sbc_ops *o * Special case for WRITE_SAME w/ UNMAP=1 that ends up getting * translated into block discard requests within backend code. */ - if (flags[0] & 0x08) { + if (flags & 0x08) { if (!ops->execute_unmap) return TCM_UNSUPPORTED_SCSI_OPCODE;
@@ -344,7 +344,7 @@ sbc_setup_write_same(struct se_cmd *cmd, unsigned char *flags, struct sbc_ops *o if (!ops->execute_write_same) return TCM_UNSUPPORTED_SCSI_OPCODE;
- ret = sbc_check_prot(dev, cmd, &cmd->t_task_cdb[0], sectors, true); + ret = sbc_check_prot(dev, cmd, flags >> 5, sectors, true); if (ret) return ret;
@@ -702,10 +702,9 @@ sbc_set_prot_op_checks(u8 protect, bool fabric_prot, enum target_prot_type prot_ }
static sense_reason_t -sbc_check_prot(struct se_device *dev, struct se_cmd *cmd, unsigned char *cdb, +sbc_check_prot(struct se_device *dev, struct se_cmd *cmd, unsigned char protect, u32 sectors, bool is_write) { - u8 protect = cdb[1] >> 5; int sp_ops = cmd->se_sess->sup_prot_ops; int pi_prot_type = dev->dev_attrib.pi_prot_type; bool fabric_prot = false; @@ -753,7 +752,7 @@ sbc_check_prot(struct se_device *dev, struct se_cmd *cmd, unsigned char *cdb, /* Fallthrough */ default: pr_err("Unable to determine pi_prot_type for CDB: 0x%02x " - "PROTECT: 0x%02x\n", cdb[0], protect); + "PROTECT: 0x%02x\n", cmd->t_task_cdb[0], protect); return TCM_INVALID_CDB_FIELD; }
@@ -828,7 +827,7 @@ sbc_parse_cdb(struct se_cmd *cmd, struct sbc_ops *ops) if (sbc_check_dpofua(dev, cmd, cdb)) return TCM_INVALID_CDB_FIELD;
- ret = sbc_check_prot(dev, cmd, cdb, sectors, false); + ret = sbc_check_prot(dev, cmd, cdb[1] >> 5, sectors, false); if (ret) return ret;
@@ -842,7 +841,7 @@ sbc_parse_cdb(struct se_cmd *cmd, struct sbc_ops *ops) if (sbc_check_dpofua(dev, cmd, cdb)) return TCM_INVALID_CDB_FIELD;
- ret = sbc_check_prot(dev, cmd, cdb, sectors, false); + ret = sbc_check_prot(dev, cmd, cdb[1] >> 5, sectors, false); if (ret) return ret;
@@ -856,7 +855,7 @@ sbc_parse_cdb(struct se_cmd *cmd, struct sbc_ops *ops) if (sbc_check_dpofua(dev, cmd, cdb)) return TCM_INVALID_CDB_FIELD;
- ret = sbc_check_prot(dev, cmd, cdb, sectors, false); + ret = sbc_check_prot(dev, cmd, cdb[1] >> 5, sectors, false); if (ret) return ret;
@@ -877,7 +876,7 @@ sbc_parse_cdb(struct se_cmd *cmd, struct sbc_ops *ops) if (sbc_check_dpofua(dev, cmd, cdb)) return TCM_INVALID_CDB_FIELD;
- ret = sbc_check_prot(dev, cmd, cdb, sectors, true); + ret = sbc_check_prot(dev, cmd, cdb[1] >> 5, sectors, true); if (ret) return ret;
@@ -891,7 +890,7 @@ sbc_parse_cdb(struct se_cmd *cmd, struct sbc_ops *ops) if (sbc_check_dpofua(dev, cmd, cdb)) return TCM_INVALID_CDB_FIELD;
- ret = sbc_check_prot(dev, cmd, cdb, sectors, true); + ret = sbc_check_prot(dev, cmd, cdb[1] >> 5, sectors, true); if (ret) return ret;
@@ -906,7 +905,7 @@ sbc_parse_cdb(struct se_cmd *cmd, struct sbc_ops *ops) if (sbc_check_dpofua(dev, cmd, cdb)) return TCM_INVALID_CDB_FIELD;
- ret = sbc_check_prot(dev, cmd, cdb, sectors, true); + ret = sbc_check_prot(dev, cmd, cdb[1] >> 5, sectors, true); if (ret) return ret;
@@ -965,7 +964,7 @@ sbc_parse_cdb(struct se_cmd *cmd, struct sbc_ops *ops) size = sbc_get_size(cmd, 1); cmd->t_task_lba = get_unaligned_be64(&cdb[12]);
- ret = sbc_setup_write_same(cmd, &cdb[10], ops); + ret = sbc_setup_write_same(cmd, cdb[10], ops); if (ret) return ret; break; @@ -1064,7 +1063,7 @@ sbc_parse_cdb(struct se_cmd *cmd, struct sbc_ops *ops) size = sbc_get_size(cmd, 1); cmd->t_task_lba = get_unaligned_be64(&cdb[2]);
- ret = sbc_setup_write_same(cmd, &cdb[1], ops); + ret = sbc_setup_write_same(cmd, cdb[1], ops); if (ret) return ret; break; @@ -1082,7 +1081,7 @@ sbc_parse_cdb(struct se_cmd *cmd, struct sbc_ops *ops) * Follow sbcr26 with WRITE_SAME (10) and check for the existence * of byte 1 bit 3 UNMAP instead of original reserved field */ - ret = sbc_setup_write_same(cmd, &cdb[1], ops); + ret = sbc_setup_write_same(cmd, cdb[1], ops); if (ret) return ret; break;
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.199 commit 86aef177e03baedb8009c4a8692c4505c6063bec
--------------------------------
[ Upstream commit 6f20c8adb1813467ea52c1296d52c4e95978cb2f ]
tfo_active_disable_stamp is read and written locklessly. We need to annotate these accesses appropriately.
Then, we need to perform the atomic_inc(tfo_active_disable_times) after the timestamp has been updated, and thus add barriers to make sure tcp_fastopen_active_should_disable() wont read a stale timestamp.
Fixes: cf1ef3f0719b ("net/tcp_fastopen: Disable active side TFO in certain scenarios") Signed-off-by: Eric Dumazet edumazet@google.com Cc: Wei Wang weiwan@google.com Cc: Yuchung Cheng ycheng@google.com Cc: Neal Cardwell ncardwell@google.com Acked-by: Wei Wang weiwan@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/ipv4/tcp_fastopen.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c index 018a484773551..2ab371f555250 100644 --- a/net/ipv4/tcp_fastopen.c +++ b/net/ipv4/tcp_fastopen.c @@ -454,8 +454,15 @@ void tcp_fastopen_active_disable(struct sock *sk) { struct net *net = sock_net(sk);
+ /* Paired with READ_ONCE() in tcp_fastopen_active_should_disable() */ + WRITE_ONCE(net->ipv4.tfo_active_disable_stamp, jiffies); + + /* Paired with smp_rmb() in tcp_fastopen_active_should_disable(). + * We want net->ipv4.tfo_active_disable_stamp to be updated first. + */ + smp_mb__before_atomic(); atomic_inc(&net->ipv4.tfo_active_disable_times); - net->ipv4.tfo_active_disable_stamp = jiffies; + NET_INC_STATS(net, LINUX_MIB_TCPFASTOPENBLACKHOLE); }
@@ -473,10 +480,16 @@ bool tcp_fastopen_active_should_disable(struct sock *sk) if (!tfo_da_times) return false;
+ /* Paired with smp_mb__before_atomic() in tcp_fastopen_active_disable() */ + smp_rmb(); + /* Limit timout to max: 2^6 * initial timeout */ multiplier = 1 << min(tfo_da_times - 1, 6); - timeout = multiplier * tfo_bh_timeout * HZ; - if (time_before(jiffies, sock_net(sk)->ipv4.tfo_active_disable_stamp + timeout)) + + /* Paired with the WRITE_ONCE() in tcp_fastopen_active_disable(). */ + timeout = READ_ONCE(sock_net(sk)->ipv4.tfo_active_disable_stamp) + + multiplier * tfo_bh_timeout * HZ; + if (time_before(jiffies, timeout)) return true;
/* Mark check bit so we can check for successful active TFO
From: Peilin Ye peilin.ye@bytedance.com
stable inclusion from linux-4.19.199 commit e4fdca366806f6bab374d1a95e626a10a3854b0c
--------------------------------
[ Upstream commit 727d6a8b7ef3d25080fad228b2c4a1d4da5999c6 ]
Currently tcf_skbmod_act() assumes that packets use Ethernet as their L2 protocol, which is not always the case. As an example, for CAN devices:
$ ip link add dev vcan0 type vcan $ ip link set up vcan0 $ tc qdisc add dev vcan0 root handle 1: htb $ tc filter add dev vcan0 parent 1: protocol ip prio 10 \ matchall action skbmod swap mac
Doing the above silently corrupts all the packets. Do not perform skbmod actions for non-Ethernet packets.
Fixes: 86da71b57383 ("net_sched: Introduce skbmod action") Reviewed-by: Cong Wang cong.wang@bytedance.com Signed-off-by: Peilin Ye peilin.ye@bytedance.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/sched/act_skbmod.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c index 21d1952961217..03a272af664ae 100644 --- a/net/sched/act_skbmod.c +++ b/net/sched/act_skbmod.c @@ -10,6 +10,7 @@ */
#include <linux/module.h> +#include <linux/if_arp.h> #include <linux/init.h> #include <linux/kernel.h> #include <linux/skbuff.h> @@ -36,6 +37,13 @@ static int tcf_skbmod_act(struct sk_buff *skb, const struct tc_action *a, tcf_lastuse_update(&d->tcf_tm); bstats_cpu_update(this_cpu_ptr(d->common.cpu_bstats), skb);
+ action = READ_ONCE(d->tcf_action); + if (unlikely(action == TC_ACT_SHOT)) + goto drop; + + if (!skb->dev || skb->dev->type != ARPHRD_ETHER) + return action; + /* XXX: if you are going to edit more fields beyond ethernet header * (example when you add IP header replacement or vlan swap) * then MAX_EDIT_LEN needs to change appropriately @@ -44,10 +52,6 @@ static int tcf_skbmod_act(struct sk_buff *skb, const struct tc_action *a, if (unlikely(err)) /* best policy is to drop on the floor */ goto drop;
- action = READ_ONCE(d->tcf_action); - if (unlikely(action == TC_ACT_SHOT)) - goto drop; - p = rcu_dereference_bh(d->skbmod_p); flags = p->flags; if (flags & SKBMOD_F_DMAC)
From: Zhihao Cheng chengzhihao1@huawei.com
stable inclusion from linux-4.19.199 commit 0ac2cafd710e4db558e269b7f88f731f76e23f3e
--------------------------------
[ Upstream commit 7764656b108cd308c39e9a8554353b8f9ca232a3 ]
Followling process: nvme_probe nvme_reset_ctrl nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING) queue_work(nvme_reset_wq, &ctrl->reset_work)
--------------> nvme_remove nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING) worker_thread process_one_work nvme_reset_work WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING)
, which will trigger WARN_ON in nvme_reset_work(): [ 127.534298] WARNING: CPU: 0 PID: 139 at drivers/nvme/host/pci.c:2594 [ 127.536161] CPU: 0 PID: 139 Comm: kworker/u8:7 Not tainted 5.13.0 [ 127.552518] Call Trace: [ 127.552840] ? kvm_sched_clock_read+0x25/0x40 [ 127.553936] ? native_send_call_func_single_ipi+0x1c/0x30 [ 127.555117] ? send_call_function_single_ipi+0x9b/0x130 [ 127.556263] ? __smp_call_single_queue+0x48/0x60 [ 127.557278] ? ttwu_queue_wakelist+0xfa/0x1c0 [ 127.558231] ? try_to_wake_up+0x265/0x9d0 [ 127.559120] ? ext4_end_io_rsv_work+0x160/0x290 [ 127.560118] process_one_work+0x28c/0x640 [ 127.561002] worker_thread+0x39a/0x700 [ 127.561833] ? rescuer_thread+0x580/0x580 [ 127.562714] kthread+0x18c/0x1e0 [ 127.563444] ? set_kthread_struct+0x70/0x70 [ 127.564347] ret_from_fork+0x1f/0x30
The preceding problem can be easily reproduced by executing following script (based on blktests suite): test() { pdev="$(_get_pci_dev_from_blkdev)" sysfs="/sys/bus/pci/devices/${pdev}" for ((i = 0; i < 10; i++)); do echo 1 > "$sysfs/remove" echo 1 > /sys/bus/pci/rescan done }
Since the device ctrl could be updated as an non-RESETTING state by repeating probe/remove in userspace (which is a normal situation), we can replace stack dumping WARN_ON with a warnning message.
Fixes: 82b057caefaff ("nvme-pci: fix multiple ctrl removal schedulin") Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/nvme/host/pci.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index d43325761a144..e752b4824b353 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2283,7 +2283,9 @@ static void nvme_reset_work(struct work_struct *work) int result; enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
- if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING)) { + if (dev->ctrl.state != NVME_CTRL_RESETTING) { + dev_warn(dev->ctrl.device, "ctrl state %d is not RESETTING\n", + dev->ctrl.state); result = -ENODEV; goto out; }
From: Xin Long lucien.xin@gmail.com
stable inclusion from linux-4.19.199 commit 50b57223da67653c61e405d0a7592355cfe4585e
--------------------------------
[ Upstream commit 58acd10092268831e49de279446c314727101292 ]
syzbot reported a call trace:
BUG: KASAN: use-after-free in sctp_auth_shkey_hold+0x22/0xa0 net/sctp/auth.c:112 Call Trace: sctp_auth_shkey_hold+0x22/0xa0 net/sctp/auth.c:112 sctp_set_owner_w net/sctp/socket.c:131 [inline] sctp_sendmsg_to_asoc+0x152e/0x2180 net/sctp/socket.c:1865 sctp_sendmsg+0x103b/0x1d30 net/sctp/socket.c:2027 inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:821 sock_sendmsg_nosec net/socket.c:703 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:723
This is an use-after-free issue caused by not updating asoc->shkey after it was replaced in the key list asoc->endpoint_shared_keys, and the old key was freed.
This patch is to fix by also updating active_key for asoc when old key is being replaced with a new one. Note that this issue doesn't exist in sctp_auth_del_key_id(), as it's not allowed to delete the active_key from the asoc.
Fixes: 1b1e0bc99474 ("sctp: add refcnt support for sh_key") Reported-by: syzbot+b774577370208727d12b@syzkaller.appspotmail.com Signed-off-by: Xin Long lucien.xin@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/sctp/auth.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/net/sctp/auth.c b/net/sctp/auth.c index 2bd8c80bd85fb..b2ca66c4a21d2 100644 --- a/net/sctp/auth.c +++ b/net/sctp/auth.c @@ -883,6 +883,8 @@ int sctp_auth_set_key(struct sctp_endpoint *ep, if (replace) { list_del_init(&shkey->key_list); sctp_auth_shkey_release(shkey); + if (asoc && asoc->active_key_id == auth_key->sca_keynumber) + sctp_auth_asoc_init_active_key(asoc, GFP_KERNEL); } list_add(&cur_key->key_list, sh_keys);
From: Yajun Deng yajun.deng@linux.dev
stable inclusion from linux-4.19.199 commit d682390655b388f7eb69add0006a4fce9cc33b3a
--------------------------------
[ Upstream commit 9d85a6f44bd5585761947f40f7821c9cd78a1bbe ]
The 4th parameter in tc_chain_notify() should be flags rather than seq. Let's change it back correctly.
Fixes: 32a4f5ecd738 ("net: sched: introduce chain object to uapi") Signed-off-by: Yajun Deng yajun.deng@linux.dev Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- net/sched/cls_api.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 184c20b86393b..4413aa8d4e829 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -1918,7 +1918,7 @@ static int tc_ctl_chain(struct sk_buff *skb, struct nlmsghdr *n, break; case RTM_GETCHAIN: err = tc_chain_notify(chain, skb, n->nlmsg_seq, - n->nlmsg_seq, n->nlmsg_type, true); + n->nlmsg_flags, n->nlmsg_type, true); if (err < 0) NL_SET_ERR_MSG(extack, "Failed to send chain notify message"); break;
From: Marcelo Henrique Cerri marcelo.cerri@canonical.com
stable inclusion from linux-4.19.199 commit 66bcd449e04c2530d9859ff628d2596658bcf825
--------------------------------
[ Upstream commit d238692b4b9f2c36e35af4c6e6f6da36184aeb3e ]
Use size_t when capping the count argument received by mem_rw(). Since count is size_t, using min_t(int, ...) can lead to a negative value that will later be passed to access_remote_vm(), which can cause unexpected behavior.
Since we are capping the value to at maximum PAGE_SIZE, the conversion from size_t to int when passing it to access_remote_vm() as "len" shouldn't be a problem.
Link: https://lkml.kernel.org/r/20210512125215.3348316-1-marcelo.cerri@canonical.c... Reviewed-by: David Disseldorp ddiss@suse.de Signed-off-by: Thadeu Lima de Souza Cascardo cascardo@canonical.com Signed-off-by: Marcelo Henrique Cerri marcelo.cerri@canonical.com Cc: Alexey Dobriyan adobriyan@gmail.com Cc: Souza Cascardo cascardo@canonical.com Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Michel Lespinasse walken@google.com Cc: Helge Deller deller@gmx.de Cc: Oleg Nesterov oleg@redhat.com Cc: Lorenzo Stoakes lstoakes@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/proc/base.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c index a8679c244e40a..24fb694357338 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -835,7 +835,7 @@ static ssize_t mem_rw(struct file *file, char __user *buf, flags = FOLL_FORCE | (write ? FOLL_WRITE : 0);
while (count > 0) { - int this_len = min_t(int, count, PAGE_SIZE); + size_t this_len = min_t(size_t, count, PAGE_SIZE);
if (write && copy_from_user(page, buf, this_len)) { copied = -EFAULT;
From: Haoran Luo www@aegistudio.net
stable inclusion from linux-4.19.199 commit 6a99bfee7f5625d2577a5c3b09a2bd2a845feb8a
--------------------------------
commit 67f0d6d9883c13174669f88adac4f0ee656cc16a upstream.
The "rb_per_cpu_empty()" misinterpret the condition (as not-empty) when "head_page" and "commit_page" of "struct ring_buffer_per_cpu" points to the same buffer page, whose "buffer_data_page" is empty and "read" field is non-zero.
An error scenario could be constructed as followed (kernel perspective):
1. All pages in the buffer has been accessed by reader(s) so that all of them will have non-zero "read" field.
2. Read and clear all buffer pages so that "rb_num_of_entries()" will return 0 rendering there's no more data to read. It is also required that the "read_page", "commit_page" and "tail_page" points to the same page, while "head_page" is the next page of them.
3. Invoke "ring_buffer_lock_reserve()" with large enough "length" so that it shot pass the end of current tail buffer page. Now the "head_page", "commit_page" and "tail_page" points to the same page.
4. Discard current event with "ring_buffer_discard_commit()", so that "head_page", "commit_page" and "tail_page" points to a page whose buffer data page is now empty.
When the error scenario has been constructed, "tracing_read_pipe" will be trapped inside a deadloop: "trace_empty()" returns 0 since "rb_per_cpu_empty()" returns 0 when it hits the CPU containing such constructed ring buffer. Then "trace_find_next_entry_inc()" always return NULL since "rb_num_of_entries()" reports there's no more entry to read. Finally "trace_seq_to_user()" returns "-EBUSY" spanking "tracing_read_pipe" back to the start of the "waitagain" loop.
I've also written a proof-of-concept script to construct the scenario and trigger the bug automatically, you can use it to trace and validate my reasoning above:
https://github.com/aegistudio/RingBufferDetonator.git
Tests has been carried out on linux kernel 5.14-rc2 (2734d6c1b1a089fb593ef6a23d4b70903526fe0c), my fixed version of kernel (for testing whether my update fixes the bug) and some older kernels (for range of affected kernels). Test result is also attached to the proof-of-concept repository.
Link: https://lore.kernel.org/linux-trace-devel/YPaNxsIlb2yjSi5Y@aegistudio/ Link: https://lore.kernel.org/linux-trace-devel/YPgrN85WL9VyrZ55@aegistudio
Cc: stable@vger.kernel.org Fixes: bf41a158cacba ("ring-buffer: make reentrant") Suggested-by: Linus Torvalds torvalds@linuxfoundation.org Signed-off-by: Haoran Luo www@aegistudio.net Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/trace/ring_buffer.c | 28 ++++++++++++++++++++++++---- 1 file changed, 24 insertions(+), 4 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 360129e475407..987d3447bf2a1 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -3172,10 +3172,30 @@ static bool rb_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer) if (unlikely(!head)) return true;
- return reader->read == rb_page_commit(reader) && - (commit == reader || - (commit == head && - head->read == rb_page_commit(commit))); + /* Reader should exhaust content in reader page */ + if (reader->read != rb_page_commit(reader)) + return false; + + /* + * If writers are committing on the reader page, knowing all + * committed content has been read, the ring buffer is empty. + */ + if (commit == reader) + return true; + + /* + * If writers are committing on a page other than reader page + * and head page, there should always be content to read. + */ + if (commit != head) + return false; + + /* + * Writers are committing on the head page, we just need + * to care about there're committed data, and the reader will + * swap reader page with head page when it is to read data. + */ + return rb_page_commit(commit) == 0; }
/**