Al Grant (1): perf tools: Correct SNOOPX field offset
Al Viro (2): do_epoll_ctl(): clean the failure exits up a bit fix regression in "epoll: Keep a reference on files added to the check list"
Alex Williamson (1): vfio/type1: Add proper error unwind for vfio_iommu_replay()
Alvin Šipraga (1): macvlan: validate setting of multiple remote source MAC addresses
Bodo Stroesser (1): scsi: target: tcmu: Fix crash in tcmu_flush_dcache_range on ARM
Brian Foster (3): xfs: acquire superblock freeze protection on eofblocks scans xfs: reset buffer write failure state on successful completion xfs: fix duplicate verification from xfs_qm_dqflush()
Charan Teja Reddy (1): mm, page_alloc: fix core hung in free_pcppages_bulk()
Chris Wilson (1): locking/lockdep: Fix overflow in presentation of average lock-time
Chuck Lever (1): NFS: Zero-stateid SETATTR should first return delegation
Daniel Borkmann (1): uaccess: Add non-pagefault user-space write function
Darrick J. Wong (4): xfs: fix partially uninitialized structure in xfs_reflink_remap_extent xfs: fix inode quota reservation checks xfs: fix xfs_bmap_validate_extent_raw when checking attr fork of rt files xfs: initialize the shortform attr header padding entry
Dave Chinner (1): xfs: Don't allow logging of XFS_ISTALE inodes
David Milburn (2): nvme-fc: cancel async events before freeing event struct nvme-rdma: cancel async events before freeing event struct
Doug Berger (1): mm: include CMA pages in lowmem_reserve at boot
Eiichi Tsukata (1): xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_init
Eric Biggers (1): xfs: clear PF_MEMALLOC before exiting xfsaild thread
George Kennedy (1): vt_ioctl: change VT_RESIZEX ioctl to check for error return from vc_resize()
Gustav Wiklander (1): spi: Fix memory leak on splited transfers
Heikki Krogerus (1): device property: Fix the secondary firmware node handling in set_primary_fwnode()
Helge Deller (1): fs/signalfd.c: fix inconsistent return codes for signalfd4
Hou Pu (1): scsi: target: iscsi: Fix hang in iscsit_access_np() when getting tpg->np_login_sem
Hugh Dickins (2): khugepaged: khugepaged_test_exit() check mmget_still_valid() khugepaged: adjust VM_BUG_ON_MM() in __khugepaged_enter()
Jan Kara (5): ext4: fix checking of directory entry validity for inline directories ext4: don't BUG on inconsistent journal feature writeback: Protect inode->i_io_list with inode->i_lock writeback: Avoid skipping inode writeback writeback: Fix sync livelock due to b_dirty_time processing
Jarkko Sakkinen (1): tpm: Unify the mismatching TPM space buffer sizes
Jason Gunthorpe (1): include/linux/log2.h: add missing () around n in roundup_pow_of_two()
Javed Hasan (1): scsi: fcoe: Memory leak fix in fcoe_sysfs_fcf_del()
Jens Axboe (1): block: ensure bdi->io_pages is always initialized
Jing Xiangfeng (1): scsi: iscsi: Do not put host in iscsi_set_flashnode_param()
Li Heng (1): efi: add missed destroy_workqueue when efisubsys_init fails
Lukas Czerner (3): jbd2: make sure jh have b_transaction set in refile/unfile_buffer ext4: handle read only external journal device ext4: handle option set by mount flags correctly
Lukas Wunner (4): spi: Prevent adding devices below an unregistering controller serial: pl011: Fix oops on -EPROBE_DEFER serial: pl011: Don't leak amba_ports entry on driver register error serial: 8250: Avoid error message on reprobe
Luo Meng (1): ext4: only set last error block when check system zone failed
Mao Wenan (1): virtio_ring: Avoid loop when vq is broken in virtqueue_poll
Marc Zyngier (1): epoll: Keep a reference on files added to the check list
Masami Hiramatsu (2): perf probe: Fix memory leakage when the probe point is not found uaccess: Add non-pagefault user-space read functions
Max Reitz (1): xfs: Fix tail rounding in xfs_alloc_file_space()
Mikulas Patocka (2): xfs: don't update mtime on COW faults dm writecache: handle DAX to partitions on persistent memory correctly
Ming Lei (1): blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTART
Muchun Song (1): kprobes: fix kill kprobe which has been marked as gone
Namhyung Kim (1): perf jevents: Fix suspicious code in fixregex()
Peter Xu (1): mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible
Peter Zijlstra (1): cpuidle: Fixup IRQ state
Qiushi Wu (1): PCI: Fix pci_create_slot() reference count leak
Rafael J. Wysocki (1): PM: sleep: core: Fix the handling of pending runtime resume requests
Ralph Campbell (1): mm/thp: fix __split_huge_pmd_locked() for migration PMD
Robin Murphy (1): iommu/iova: Don't BUG on invalid PFNs
Selvin Xavier (1): RDMA/bnxt_re: Do not add user qps to flushlist
Sergey Senozhatsky (1): serial: 8250: change lock order in serial8250_do_startup()
Sunghyun Jin (1): percpu: fix first chunk size calculation for populated bitmap
Tejun Heo (1): libata: implement ATA_HORKAGE_MAX_TRIM_128M and apply to Sandisks
Tetsuo Handa (1): vt: defer kfree() of vc_screenbuf in vc_do_resize()
Valmer Huhn (1): serial: 8250_exar: Fix number of ports for Commtech PCIe cards
Varun Prakash (1): scsi: target: iscsi: Fix data digest calculation
Wei Yongjun (1): kernel/relay.c: fix memleak on destroy relay channel
Xianting Tian (1): fs: prevent BUG_ON in submit_bh_wbc()
Xunlei Pang (1): mm: memcg: fix memcg reclaim soft lockup
Yang Xu (1): KEYS: reaching the keys quotas correctly
kaixuxia (1): xfs: Fix deadlock between AGI and AGF with RENAME_WHITEOUT
zhangyi (F) (1): jbd2: add the missing unlock_buffer() in the error path of jbd2_write_superblock()
block/blk-core.c | 2 + block/blk-mq-sched.c | 9 ++ block/blk-mq.c | 9 ++ drivers/ata/libata-core.c | 5 +- drivers/ata/libata-scsi.c | 8 +- drivers/base/core.c | 12 +- drivers/base/power/main.c | 16 ++- drivers/char/tpm/tpm-chip.c | 9 +- drivers/char/tpm/tpm.h | 6 +- drivers/char/tpm/tpm2-space.c | 26 ++-- drivers/char/tpm/tpmrm-dev.c | 2 +- drivers/cpuidle/cpuidle.c | 3 +- drivers/firmware/efi/efi.c | 2 + drivers/infiniband/hw/bnxt_re/main.c | 3 +- drivers/iommu/iova.c | 4 +- drivers/md/dm-writecache.c | 12 +- drivers/net/macvlan.c | 21 ++- drivers/nvme/host/fc.c | 1 + drivers/nvme/host/rdma.c | 1 + drivers/pci/slot.c | 6 +- drivers/scsi/fcoe/fcoe_ctlr.c | 2 +- drivers/scsi/scsi_transport_iscsi.c | 2 +- drivers/spi/Kconfig | 3 + drivers/spi/spi.c | 30 +++- drivers/target/iscsi/iscsi_target.c | 17 ++- drivers/target/iscsi/iscsi_target_login.c | 6 +- drivers/target/iscsi/iscsi_target_login.h | 3 +- drivers/target/iscsi/iscsi_target_nego.c | 3 +- drivers/target/target_core_user.c | 2 +- drivers/tty/serial/8250/8250_core.c | 11 +- drivers/tty/serial/8250/8250_exar.c | 24 +++- drivers/tty/serial/8250/8250_port.c | 9 +- drivers/tty/serial/amba-pl011.c | 16 ++- drivers/tty/vt/vt.c | 5 +- drivers/tty/vt/vt_ioctl.c | 12 +- drivers/vfio/vfio_iommu_type1.c | 71 ++++++++- drivers/virtio/virtio_ring.c | 3 + fs/buffer.c | 9 ++ fs/eventpoll.c | 23 +-- fs/ext4/block_validity.c | 3 +- fs/ext4/namei.c | 6 +- fs/ext4/super.c | 147 ++++++++++++------- fs/fs-writeback.c | 83 ++++++----- fs/jbd2/journal.c | 4 +- fs/jbd2/transaction.c | 10 ++ fs/nfs/nfs4proc.c | 4 +- fs/signalfd.c | 10 +- fs/xfs/libxfs/xfs_attr_leaf.c | 4 +- fs/xfs/libxfs/xfs_bmap.c | 2 +- fs/xfs/xfs_bmap_util.c | 4 +- fs/xfs/xfs_buf.c | 8 +- fs/xfs/xfs_dquot.c | 9 +- fs/xfs/xfs_file.c | 12 +- fs/xfs/xfs_icache.c | 13 +- fs/xfs/xfs_inode.c | 110 ++++++++------ fs/xfs/xfs_ioctl.c | 5 +- fs/xfs/xfs_reflink.c | 1 + fs/xfs/xfs_sysfs.h | 6 +- fs/xfs/xfs_trans_ail.c | 4 +- fs/xfs/xfs_trans_dquot.c | 2 +- fs/xfs/xfs_trans_inode.c | 2 + include/linux/fs.h | 8 +- include/linux/libata.h | 1 + include/linux/log2.h | 2 +- include/linux/uaccess.h | 26 ++++ include/trace/events/writeback.h | 13 +- kernel/kprobes.c | 9 +- kernel/locking/lockdep_proc.c | 2 +- kernel/relay.c | 1 + mm/huge_memory.c | 40 +++--- mm/hugetlb.c | 24 ++-- mm/khugepaged.c | 7 +- mm/maccess.c | 167 ++++++++++++++++++++-- mm/page_alloc.c | 7 +- mm/percpu.c | 2 +- mm/vmscan.c | 8 ++ security/keys/key.c | 2 +- security/keys/keyctl.c | 4 +- tools/include/uapi/linux/perf_event.h | 2 +- tools/perf/pmu-events/jevents.c | 2 +- tools/perf/util/probe-finder.c | 2 +- 81 files changed, 864 insertions(+), 322 deletions(-)
From: Alvin Šipraga alsi@bang-olufsen.dk
stable inclusion from linux-4.19.143 commit b12151989d2930328b85419c20037a0c9769ef52
--------------------------------
[ Upstream commit 8b61fba503904acae24aeb2bd5569b4d6544d48f ]
Remote source MAC addresses can be set on a 'source mode' macvlan interface via the IFLA_MACVLAN_MACADDR_DATA attribute. This commit tightens the validation of these MAC addresses to match the validation already performed when setting or adding a single MAC address via the IFLA_MACVLAN_MACADDR attribute.
iproute2 uses IFLA_MACVLAN_MACADDR_DATA for its 'macvlan macaddr set' command, and IFLA_MACVLAN_MACADDR for its 'macvlan macaddr add' command, which demonstrates the inconsistent behaviour that this commit addresses:
# ip link add link eth0 name macvlan0 type macvlan mode source # ip link set link dev macvlan0 type macvlan macaddr add 01:00:00:00:00:00 RTNETLINK answers: Cannot assign requested address # ip link set link dev macvlan0 type macvlan macaddr set 01:00:00:00:00:00 # ip -d link show macvlan0 5: macvlan0@eth0: <BROADCAST,MULTICAST,DYNAMIC,UP,LOWER_UP> mtu 1500 ... link/ether 2e:ac:fd:2d:69:f8 brd ff:ff:ff:ff:ff:ff promiscuity 0 macvlan mode source remotes (1) 01:00:00:00:00:00 numtxqueues 1 ...
With this change, the 'set' command will (rightly) fail in the same way as the 'add' command.
Signed-off-by: Alvin Šipraga alsi@bang-olufsen.dk Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/net/macvlan.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 349123592af0..e226a96da3a3 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -1230,6 +1230,9 @@ static void macvlan_port_destroy(struct net_device *dev) static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { + struct nlattr *nla, *head; + int rem, len; + if (tb[IFLA_ADDRESS]) { if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN) return -EINVAL; @@ -1277,6 +1280,20 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[], return -EADDRNOTAVAIL; }
+ if (data[IFLA_MACVLAN_MACADDR_DATA]) { + head = nla_data(data[IFLA_MACVLAN_MACADDR_DATA]); + len = nla_len(data[IFLA_MACVLAN_MACADDR_DATA]); + + nla_for_each_attr(nla, head, len, rem) { + if (nla_type(nla) != IFLA_MACVLAN_MACADDR || + nla_len(nla) != ETH_ALEN) + return -EINVAL; + + if (!is_valid_ether_addr(nla_data(nla))) + return -EADDRNOTAVAIL; + } + } + if (data[IFLA_MACVLAN_MACADDR_COUNT]) return -EINVAL;
@@ -1333,10 +1350,6 @@ static int macvlan_changelink_sources(struct macvlan_dev *vlan, u32 mode, len = nla_len(data[IFLA_MACVLAN_MACADDR_DATA]);
nla_for_each_attr(nla, head, len, rem) { - if (nla_type(nla) != IFLA_MACVLAN_MACADDR || - nla_len(nla) != ETH_ALEN) - continue; - addr = nla_data(nla); ret = macvlan_hash_add_source(vlan, addr); if (ret)
From: kaixuxia xiakaixu1987@gmail.com
stable inclusion from linux-4.19.119 commit 6fb102dd99f40ad98a1d55b30c30ae34e7a19cdd
--------------------------------
commit bc56ad8c74b8588685c2875de0df8ab6974828ef upstream.
When performing rename operation with RENAME_WHITEOUT flag, we will hold AGF lock to allocate or free extents in manipulating the dirents firstly, and then doing the xfs_iunlink_remove() call last to hold AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
The big problem here is that we have an ordering constraint on AGF and AGI locking - inode allocation locks the AGI, then can allocate a new extent for new inodes, locking the AGF after the AGI. Hence the ordering that is imposed by other parts of the code is AGI before AGF. So we get an ABBA deadlock between the AGI and AGF here.
Process A: Call trace: ? __schedule+0x2bd/0x620 schedule+0x33/0x90 schedule_timeout+0x17d/0x290 __down_common+0xef/0x125 ? xfs_buf_find+0x215/0x6c0 [xfs] down+0x3b/0x50 xfs_buf_lock+0x34/0xf0 [xfs] xfs_buf_find+0x215/0x6c0 [xfs] xfs_buf_get_map+0x37/0x230 [xfs] xfs_buf_read_map+0x29/0x190 [xfs] xfs_trans_read_buf_map+0x13d/0x520 [xfs] xfs_read_agf+0xa6/0x180 [xfs] ? schedule_timeout+0x17d/0x290 xfs_alloc_read_agf+0x52/0x1f0 [xfs] xfs_alloc_fix_freelist+0x432/0x590 [xfs] ? down+0x3b/0x50 ? xfs_buf_lock+0x34/0xf0 [xfs] ? xfs_buf_find+0x215/0x6c0 [xfs] xfs_alloc_vextent+0x301/0x6c0 [xfs] xfs_ialloc_ag_alloc+0x182/0x700 [xfs] ? _xfs_trans_bjoin+0x72/0xf0 [xfs] xfs_dialloc+0x116/0x290 [xfs] xfs_ialloc+0x6d/0x5e0 [xfs] ? xfs_log_reserve+0x165/0x280 [xfs] xfs_dir_ialloc+0x8c/0x240 [xfs] xfs_create+0x35a/0x610 [xfs] xfs_generic_create+0x1f1/0x2f0 [xfs] ...
Process B: Call trace: ? __schedule+0x2bd/0x620 ? xfs_bmapi_allocate+0x245/0x380 [xfs] schedule+0x33/0x90 schedule_timeout+0x17d/0x290 ? xfs_buf_find+0x1fd/0x6c0 [xfs] __down_common+0xef/0x125 ? xfs_buf_get_map+0x37/0x230 [xfs] ? xfs_buf_find+0x215/0x6c0 [xfs] down+0x3b/0x50 xfs_buf_lock+0x34/0xf0 [xfs] xfs_buf_find+0x215/0x6c0 [xfs] xfs_buf_get_map+0x37/0x230 [xfs] xfs_buf_read_map+0x29/0x190 [xfs] xfs_trans_read_buf_map+0x13d/0x520 [xfs] xfs_read_agi+0xa8/0x160 [xfs] xfs_iunlink_remove+0x6f/0x2a0 [xfs] ? current_time+0x46/0x80 ? xfs_trans_ichgtime+0x39/0xb0 [xfs] xfs_rename+0x57a/0xae0 [xfs] xfs_vn_rename+0xe4/0x150 [xfs] ...
In this patch we move the xfs_iunlink_remove() call to before acquiring the AGF lock to preserve correct AGI/AGF locking order.
[Minor massage required due to upstream change making xfs_bumplink() a void function where as in the 4.19.y tree the return value is checked, even though it is always zero. Only change was to the last code block removed by the patch. Functionally equivalent to upstream.]
Signed-off-by: kaixuxia kaixuxia@tencent.com Reviewed-by: Brian Foster bfoster@redhat.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Suraj Jitindar Singh surajjs@amazon.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/xfs_inode.c | 85 +++++++++++++++++++++++----------------------- 1 file changed, 42 insertions(+), 43 deletions(-)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 4cbcdd8e32fb..7206569efa70 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2959,7 +2959,8 @@ xfs_rename( spaceres);
/* - * Set up the target. + * Check for expected errors before we dirty the transaction + * so we can return an error without a transaction abort. */ if (target_ip == NULL) { /* @@ -2971,6 +2972,46 @@ xfs_rename( if (error) goto out_trans_cancel; } + } else { + /* + * If target exists and it's a directory, check that whether + * it can be destroyed. + */ + if (S_ISDIR(VFS_I(target_ip)->i_mode) && + (!xfs_dir_isempty(target_ip) || + (VFS_I(target_ip)->i_nlink > 2))) { + error = -EEXIST; + goto out_trans_cancel; + } + } + + /* + * Directory entry creation below may acquire the AGF. Remove + * the whiteout from the unlinked list first to preserve correct + * AGI/AGF locking order. This dirties the transaction so failures + * after this point will abort and log recovery will clean up the + * mess. + * + * For whiteouts, we need to bump the link count on the whiteout + * inode. After this point, we have a real link, clear the tmpfile + * state flag from the inode so it doesn't accidentally get misused + * in future. + */ + if (wip) { + ASSERT(VFS_I(wip)->i_nlink == 0); + error = xfs_iunlink_remove(tp, wip); + if (error) + goto out_trans_cancel; + + xfs_bumplink(tp, wip); + xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE); + VFS_I(wip)->i_state &= ~I_LINKABLE; + } + + /* + * Set up the target. + */ + if (target_ip == NULL) { /* * If target does not exist and the rename crosses * directories, adjust the target directory link count @@ -2990,22 +3031,6 @@ xfs_rename( goto out_trans_cancel; } } else { /* target_ip != NULL */ - /* - * If target exists and it's a directory, check that both - * target and source are directories and that target can be - * destroyed, or that neither is a directory. - */ - if (S_ISDIR(VFS_I(target_ip)->i_mode)) { - /* - * Make sure target dir is empty. - */ - if (!(xfs_dir_isempty(target_ip)) || - (VFS_I(target_ip)->i_nlink > 2)) { - error = -EEXIST; - goto out_trans_cancel; - } - } - /* * Link the source inode under the target name. * If the source inode is a directory and we are moving @@ -3096,32 +3121,6 @@ xfs_rename( if (error) goto out_trans_cancel;
- /* - * For whiteouts, we need to bump the link count on the whiteout inode. - * This means that failures all the way up to this point leave the inode - * on the unlinked list and so cleanup is a simple matter of dropping - * the remaining reference to it. If we fail here after bumping the link - * count, we're shutting down the filesystem so we'll never see the - * intermediate state on disk. - */ - if (wip) { - ASSERT(VFS_I(wip)->i_nlink == 0); - error = xfs_bumplink(tp, wip); - if (error) - goto out_trans_cancel; - error = xfs_iunlink_remove(tp, wip); - if (error) - goto out_trans_cancel; - xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE); - - /* - * Now we have a real link, clear the "I'm a tmpfile" state - * flag from the inode so it doesn't accidentally get misused in - * future. - */ - VFS_I(wip)->i_state &= ~I_LINKABLE; - } - xfs_trans_ichgtime(tp, src_dp, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG); xfs_trans_log_inode(tp, src_dp, XFS_ILOG_CORE); if (new_parent)
From: Brian Foster bfoster@redhat.com
stable inclusion from linux-4.19.120 commit 88e28547f049536c589579a3f8bfa1291f2c5251
--------------------------------
commit 4b674b9ac852937af1f8c62f730c325fb6eadcdb upstream.
The filesystem freeze sequence in XFS waits on any background eofblocks or cowblocks scans to complete before the filesystem is quiesced. At this point, the freezer has already stopped the transaction subsystem, however, which means a truncate or cowblock cancellation in progress is likely blocked in transaction allocation. This results in a deadlock between freeze and the associated scanner.
Fix this problem by holding superblock write protection across calls into the block reapers. Since protection for background scans is acquired from the workqueue task context, trylock to avoid a similar deadlock between freeze and blocking on the write lock.
Fixes: d6b636ebb1c9f ("xfs: halt auto-reclamation activities while rebuilding rmap") Reported-by: Paul Furtado paulfurtado91@gmail.com Signed-off-by: Brian Foster bfoster@redhat.com Reviewed-by: Chandan Rajendra chandanrlinux@gmail.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Allison Collins allison.henderson@oracle.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/xfs_icache.c | 10 ++++++++++ fs/xfs/xfs_ioctl.c | 5 ++++- 2 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 245483cc282b..901f27ac94ab 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -902,7 +902,12 @@ xfs_eofblocks_worker( { struct xfs_mount *mp = container_of(to_delayed_work(work), struct xfs_mount, m_eofblocks_work); + + if (!sb_start_write_trylock(mp->m_super)) + return; xfs_icache_free_eofblocks(mp, NULL); + sb_end_write(mp->m_super); + xfs_queue_eofblocks(mp); }
@@ -929,7 +934,12 @@ xfs_cowblocks_worker( { struct xfs_mount *mp = container_of(to_delayed_work(work), struct xfs_mount, m_cowblocks_work); + + if (!sb_start_write_trylock(mp->m_super)) + return; xfs_icache_free_cowblocks(mp, NULL); + sb_end_write(mp->m_super); + xfs_queue_cowblocks(mp); }
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 62b12c11a229..5ef429053cb1 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -2182,7 +2182,10 @@ xfs_file_ioctl( if (error) return error;
- return xfs_icache_free_eofblocks(mp, &keofb); + sb_start_write(mp->m_super); + error = xfs_icache_free_eofblocks(mp, &keofb); + sb_end_write(mp->m_super); + return error; }
default:
From: Eric Biggers ebiggers@google.com
stable inclusion from linux-4.19.120 commit aef81b0ffded319cafa330074d0d91c699416595
--------------------------------
commit 10a98cb16d80be3595fdb165fad898bb28b8b6d2 upstream.
Leaving PF_MEMALLOC set when exiting a kthread causes it to remain set during do_exit(). That can confuse things. In particular, if BSD process accounting is enabled, then do_exit() writes data to an accounting file. If that file has FS_SYNC_FL set, then this write occurs synchronously and can misbehave if PF_MEMALLOC is set.
For example, if the accounting file is located on an XFS filesystem, then a WARN_ON_ONCE() in iomap_do_writepage() is triggered and the data doesn't get written when it should. Or if the accounting file is located on an ext4 filesystem without a journal, then a WARN_ON_ONCE() in ext4_write_inode() is triggered and the inode doesn't get written.
Fix this in xfsaild() by using the helper functions to save and restore PF_MEMALLOC.
This can be reproduced as follows in the kvm-xfstests test appliance modified to add the 'acct' Debian package, and with kvm-xfstests's recommended kconfig modified to add CONFIG_BSD_PROCESS_ACCT=y:
mkfs.xfs -f /dev/vdb mount /vdb touch /vdb/file chattr +S /vdb/file accton /vdb/file mkfs.xfs -f /dev/vdc mount /vdc umount /vdc
It causes: WARNING: CPU: 1 PID: 336 at fs/iomap/buffered-io.c:1534 CPU: 1 PID: 336 Comm: xfsaild/vdc Not tainted 5.6.0-rc5 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191223_100556-anatol 04/01/2014 RIP: 0010:iomap_do_writepage+0x16b/0x1f0 fs/iomap/buffered-io.c:1534 [...] Call Trace: write_cache_pages+0x189/0x4d0 mm/page-writeback.c:2238 iomap_writepages+0x1c/0x33 fs/iomap/buffered-io.c:1642 xfs_vm_writepages+0x65/0x90 fs/xfs/xfs_aops.c:578 do_writepages+0x41/0xe0 mm/page-writeback.c:2344 __filemap_fdatawrite_range+0xd2/0x120 mm/filemap.c:421 file_write_and_wait_range+0x71/0xc0 mm/filemap.c:760 xfs_file_fsync+0x7a/0x2b0 fs/xfs/xfs_file.c:114 generic_write_sync include/linux/fs.h:2867 [inline] xfs_file_buffered_aio_write+0x379/0x3b0 fs/xfs/xfs_file.c:691 call_write_iter include/linux/fs.h:1901 [inline] new_sync_write+0x130/0x1d0 fs/read_write.c:483 __kernel_write+0x54/0xe0 fs/read_write.c:515 do_acct_process+0x122/0x170 kernel/acct.c:522 slow_acct_process kernel/acct.c:581 [inline] acct_process+0x1d4/0x27c kernel/acct.c:607 do_exit+0x83d/0xbc0 kernel/exit.c:791 kthread+0xf1/0x140 kernel/kthread.c:257 ret_from_fork+0x27/0x50 arch/x86/entry/entry_64.S:352
This bug was originally reported by syzbot at https://lore.kernel.org/r/0000000000000e7156059f751d7b@google.com.
Reported-by: syzbot+1f9dc49e8de2582d90c2@syzkaller.appspotmail.com Signed-off-by: Eric Biggers ebiggers@google.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/xfs_trans_ail.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index d3a4e89bf4a0..66f167aefd94 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -520,8 +520,9 @@ xfsaild( { struct xfs_ail *ailp = data; long tout = 0; /* milliseconds */ + unsigned int noreclaim_flag;
- current->flags |= PF_MEMALLOC; + noreclaim_flag = memalloc_noreclaim_save(); set_freezable();
while (1) { @@ -592,6 +593,7 @@ xfsaild( tout = xfsaild_push(ailp); }
+ memalloc_noreclaim_restore(noreclaim_flag); return 0; }
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.120 commit 9276babd9d734be6b3aa95a24b834f4e88366dc1
--------------------------------
[ Upstream commit c142932c29e533ee892f87b44d8abc5719edceec ]
In the reflink extent remap function, it turns out that uirec (the block mapping corresponding only to the part of the passed-in mapping that got unmapped) was not fully initialized. Specifically, br_state was not being copied from the passed-in struct to the uirec. This could lead to unpredictable results such as the reflinked mapping being marked unwritten in the destination file.
Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Brian Foster bfoster@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/xfs_reflink.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index f3c393f309e1..6622652a85a8 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -1058,6 +1058,7 @@ xfs_reflink_remap_extent( uirec.br_startblock = irec->br_startblock + rlen; uirec.br_startoff = irec->br_startoff + rlen; uirec.br_blockcount = unmap_len - rlen; + uirec.br_state = irec->br_state; unmap_len = rlen;
/* If this isn't a real mapping, we're done. */
From: Brian Foster bfoster@redhat.com
stable inclusion from linux-4.19.129 commit cc9485cd593f1fb306b78c25a5aaca5d5c4510b7
--------------------------------
[ Upstream commit b6983e80b03bd4fd42de71993b3ac7403edac758 ]
The buffer write failure flag is intended to control the internal write retry that XFS has historically implemented to help mitigate the severity of transient I/O errors. The flag is set when a buffer is resubmitted from the I/O completion path due to a previous failure. It is checked on subsequent I/O completions to skip the internal retry and fall through to the higher level configurable error handling mechanism. The flag is cleared in the synchronous and delwri submission paths and also checked in various places to log write failure messages.
There are a couple minor problems with the current usage of this flag. One is that we issue an internal retry after every submission from xfsaild due to how delwri submission clears the flag. This results in double the expected or configured number of write attempts when under sustained failures. Another more subtle issue is that the flag is never cleared on successful I/O completion. This can cause xfs_wait_buftarg() to suggest that dirty buffers are being thrown away due to the existence of the flag, when the reality is that the flag might still be set because the write succeeded on the retry.
Clear the write failure flag on successful I/O completion to address both of these problems. This means that the internal retry attempt occurs once since the last time a buffer write failed and that various other contexts only see the flag set when the immediately previous write attempt has failed.
Signed-off-by: Brian Foster bfoster@redhat.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Allison Collins allison.henderson@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/xfs_buf.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index c1f7c0d5d608..b33a9cd4fe94 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1202,8 +1202,10 @@ xfs_buf_ioend( bp->b_ops->verify_read(bp); }
- if (!bp->b_error) + if (!bp->b_error) { + bp->b_flags &= ~XBF_WRITE_FAIL; bp->b_flags |= XBF_DONE; + }
if (bp->b_iodone) (*(bp->b_iodone))(bp); @@ -1263,7 +1265,7 @@ xfs_bwrite(
bp->b_flags |= XBF_WRITE; bp->b_flags &= ~(XBF_ASYNC | XBF_READ | _XBF_DELWRI_Q | - XBF_WRITE_FAIL | XBF_DONE); + XBF_DONE);
error = xfs_buf_submit(bp); if (error) { @@ -2000,7 +2002,7 @@ xfs_buf_delwri_submit_buffers( * synchronously. Otherwise, drop the buffer from the delwri * queue and submit async. */ - bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_WRITE_FAIL); + bp->b_flags &= ~_XBF_DELWRI_Q; bp->b_flags |= XBF_WRITE; if (wait_list) { bp->b_flags &= ~XBF_ASYNC;
From: Brian Foster bfoster@redhat.com
stable inclusion from linux-4.19.129 commit edd948273038641da2aafa666a6666d6928777ee
--------------------------------
[ Upstream commit 629dcb38dc351947ed6a26a997d4b587f3bd5c7e ]
The pre-flush dquot verification in xfs_qm_dqflush() duplicates the read verifier by checking the dquot in the on-disk buffer. Instead, verify the in-core variant before it is flushed to the buffer.
Fixes: 7224fa482a6d ("xfs: add full xfs_dqblk verifier") Signed-off-by: Brian Foster bfoster@redhat.com Reviewed-by: Dave Chinner dchinner@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Allison Collins allison.henderson@oracle.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/xfs_dquot.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c index a1af984e4913..59b2b29542f4 100644 --- a/fs/xfs/xfs_dquot.c +++ b/fs/xfs/xfs_dquot.c @@ -1120,13 +1120,12 @@ xfs_qm_dqflush( dqb = bp->b_addr + dqp->q_bufoffset; ddqp = &dqb->dd_diskdq;
- /* - * A simple sanity check in case we got a corrupted dquot. - */ - fa = xfs_dqblk_verify(mp, dqb, be32_to_cpu(ddqp->d_id), 0); + /* sanity check the in-core structure before we flush */ + fa = xfs_dquot_verify(mp, &dqp->q_core, be32_to_cpu(dqp->q_core.d_id), + 0); if (fa) { xfs_alert(mp, "corrupt dquot ID 0x%x in memory at %pS", - be32_to_cpu(ddqp->d_id), fa); + be32_to_cpu(dqp->q_core.d_id), fa); xfs_buf_relse(bp); xfs_dqfunlock(dqp); xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
From: Masami Hiramatsu mhiramat@kernel.org
stable inclusion from linux-4.19.142 commit 6cb22ed4f00f73cb5e36581d42542ce1192c99b3
--------------------------------
[ Upstream commit 12d572e785b15bc764e956caaa8a4c846fd15694 ]
Fix the memory leakage in debuginfo__find_trace_events() when the probe point is not found in the debuginfo. If there is no probe point found in the debuginfo, debuginfo__find_probes() will NOT return -ENOENT, but 0.
Thus the caller of debuginfo__find_probes() must check the tf.ntevs and release the allocated memory for the array of struct probe_trace_event.
The current code releases the memory only if the debuginfo__find_probes() hits an error but not checks tf.ntevs. In the result, the memory allocated on *tevs are not released if tf.ntevs == 0.
This fixes the memory leakage by checking tf.ntevs == 0 in addition to ret < 0.
Fixes: ff741783506c ("perf probe: Introduce debuginfo to encapsulate dwarf information") Signed-off-by: Masami Hiramatsu mhiramat@kernel.org Reviewed-by: Srikar Dronamraju srikar@linux.vnet.ibm.com Cc: Andi Kleen ak@linux.intel.com Cc: Oleg Nesterov oleg@redhat.com Cc: stable@vger.kernel.org Link: http://lore.kernel.org/lkml/159438668346.62703.10887420400718492503.stgit@de... Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/perf/util/probe-finder.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/probe-finder.c b/tools/perf/util/probe-finder.c index 7ccabb891e5a..9d51b0c1f3a6 100644 --- a/tools/perf/util/probe-finder.c +++ b/tools/perf/util/probe-finder.c @@ -1357,7 +1357,7 @@ int debuginfo__find_trace_events(struct debuginfo *dbg, tf.ntevs = 0;
ret = debuginfo__find_probes(dbg, &tf.pf); - if (ret < 0) { + if (ret < 0 || tf.ntevs == 0) { for (i = 0; i < tf.ntevs; i++) clear_probe_trace_event(&tf.tevs[i]); zfree(tevs);
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.142 commit 17c08ee00b59067ed0537d91fd89873b3bba6491
--------------------------------
[ Upstream commit bbe98f9cadff58cdd6a4acaeba0efa8565dabe65 ]
Move collapse_huge_page()'s mmget_still_valid() check into khugepaged_test_exit() itself. collapse_huge_page() is used for anon THP only, and earned its mmget_still_valid() check because it inserts a huge pmd entry in place of the page table's pmd entry; whereas collapse_file()'s retract_page_tables() or collapse_pte_mapped_thp() merely clears the page table's pmd entry. But core dumping without mmap lock must have been as open to mistaking a racily cleared pmd entry for a page table at physical page 0, as exit_mmap() was. And we certainly have no interest in mapping as a THP once dumping core.
Fixes: 59ea6d06cfa9 ("coredump: fix race condition between collapse_huge_page() and core dumping") Signed-off-by: Hugh Dickins hughd@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Andrea Arcangeli aarcange@redhat.com Cc: Song Liu songliubraving@fb.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: stable@vger.kernel.org [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021217020.27773@eggly.anvils Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/khugepaged.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 483c4573695a..fbb3ac9ce086 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -394,7 +394,7 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
static inline int khugepaged_test_exit(struct mm_struct *mm) { - return atomic_read(&mm->mm_users) == 0; + return atomic_read(&mm->mm_users) == 0 || !mmget_still_valid(mm); }
static bool hugepage_vma_check(struct vm_area_struct *vma, @@ -1005,9 +1005,6 @@ static void collapse_huge_page(struct mm_struct *mm, * handled by the anon_vma lock + PG_lock. */ down_write(&mm->mmap_sem); - result = SCAN_ANY_PROCESS; - if (!mmget_still_valid(mm)) - goto out; result = hugepage_vma_revalidate(mm, address, &vma); if (result) goto out;
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.142 commit 2ef7ebb143705147bc74db2bc1bd0214c212e6bc
--------------------------------
[ Upstream commit f3f99d63a8156c7a4a6b20aac22b53c5579c7dc1 ]
syzbot crashes on the VM_BUG_ON_MM(khugepaged_test_exit(mm), mm) in __khugepaged_enter(): yes, when one thread is about to dump core, has set core_state, and is waiting for others, another might do something calling __khugepaged_enter(), which now crashes because I lumped the core_state test (known as "mmget_still_valid") into khugepaged_test_exit(). I still think it's best to lump them together, so just in this exceptional case, check mm->mm_users directly instead of khugepaged_test_exit().
Fixes: bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()") Reported-by: syzbot syzkaller@googlegroups.com Signed-off-by: Hugh Dickins hughd@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Yang Shi shy828301@gmail.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Song Liu songliubraving@fb.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Eric Dumazet edumazet@google.com Cc: stable@vger.kernel.org [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008141503370.18085@eggly.anvils Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/khugepaged.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index fbb3ac9ce086..f37be43f8cae 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -427,7 +427,7 @@ int __khugepaged_enter(struct mm_struct *mm) return -ENOMEM;
/* __khugepaged_exit() must not run from under us */ - VM_BUG_ON_MM(khugepaged_test_exit(mm), mm); + VM_BUG_ON_MM(atomic_read(&mm->mm_users) == 0, mm); if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) { free_mm_slot(mm_slot); return 0;
From: Wei Yongjun weiyongjun1@huawei.com
stable inclusion from linux-4.19.142 commit 3a54b901fd77e6f44a6ed6af2961671287e77fdf
--------------------------------
commit 71e843295c680898959b22dc877ae3839cc22470 upstream.
kmemleak report memory leak as follows:
unreferenced object 0x607ee4e5f948 (size 8): comm "syz-executor.1", pid 2098, jiffies 4295031601 (age 288.468s) hex dump (first 8 bytes): 00 00 00 00 00 00 00 00 ........ backtrace: relay_open kernel/relay.c:583 [inline] relay_open+0xb6/0x970 kernel/relay.c:563 do_blk_trace_setup+0x4a8/0xb20 kernel/trace/blktrace.c:557 __blk_trace_setup+0xb6/0x150 kernel/trace/blktrace.c:597 blk_trace_ioctl+0x146/0x280 kernel/trace/blktrace.c:738 blkdev_ioctl+0xb2/0x6a0 block/ioctl.c:613 block_ioctl+0xe5/0x120 fs/block_dev.c:1871 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x170/0x1ce fs/ioctl.c:739 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9
'chan->buf' is malloced in relay_open() by alloc_percpu() but not free while destroy the relay channel. Fix it by adding free_percpu() before return from relay_destroy_channel().
Fixes: 017c59c042d0 ("relay: Use per CPU constructs for the relay channel buffer pointers") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Wei Yongjun weiyongjun1@huawei.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Chris Wilson chris@chris-wilson.co.uk Cc: Al Viro viro@zeniv.linux.org.uk Cc: Michael Ellerman mpe@ellerman.id.au Cc: David Rientjes rientjes@google.com Cc: Michel Lespinasse walken@google.com Cc: Daniel Axtens dja@axtens.net Cc: Thomas Gleixner tglx@linutronix.de Cc: Akash Goel akash.goel@intel.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20200817122826.48518-1-weiyongjun1@huawei.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/relay.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/relay.c b/kernel/relay.c index 078f733bebae..5ad6cd2483ae 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -197,6 +197,7 @@ static struct rchan_buf *relay_create_buf(struct rchan *chan) static void relay_destroy_channel(struct kref *kref) { struct rchan *chan = container_of(kref, struct rchan, kref); + free_percpu(chan->buf); kfree(chan); }
From: Doug Berger opendmb@gmail.com
stable inclusion from linux-4.19.142 commit 84b8dc232afadf3aab425a104def45a1e7346a58
--------------------------------
commit e08d3fdfe2dafa0331843f70ce1ff6c1c4900bf4 upstream.
The lowmem_reserve arrays provide a means of applying pressure against allocations from lower zones that were targeted at higher zones. Its values are a function of the number of pages managed by higher zones and are assigned by a call to the setup_per_zone_lowmem_reserve() function.
The function is initially called at boot time by the function init_per_zone_wmark_min() and may be called later by accesses of the /proc/sys/vm/lowmem_reserve_ratio sysctl file.
The function init_per_zone_wmark_min() was moved up from a module_init to a core_initcall to resolve a sequencing issue with khugepaged. Unfortunately this created a sequencing issue with CMA page accounting.
The CMA pages are added to the managed page count of a zone when cma_init_reserved_areas() is called at boot also as a core_initcall. This makes it uncertain whether the CMA pages will be added to the managed page counts of their zones before or after the call to init_per_zone_wmark_min() as it becomes dependent on link order. With the current link order the pages are added to the managed count after the lowmem_reserve arrays are initialized at boot.
This means the lowmem_reserve values at boot may be lower than the values used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the ratio values are unchanged.
In many cases the difference is not significant, but for example an ARM platform with 1GB of memory and the following memory layout
cma: Reserved 256 MiB at 0x0000000030000000 Zone ranges: DMA [mem 0x0000000000000000-0x000000002fffffff] Normal empty HighMem [mem 0x0000000030000000-0x000000003fffffff]
would result in 0 lowmem_reserve for the DMA zone. This would allow userspace to deplete the DMA zone easily.
Funnily enough
$ cat /proc/sys/vm/lowmem_reserve_ratio
would fix up the situation because as a side effect it forces setup_per_zone_lowmem_reserve.
This commit breaks the link order dependency by invoking init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages have the chance to be properly accounted in their zone(s) and allowing the lowmem_reserve arrays to receive consistent values.
Fixes: bc22af74f271 ("mm: update min_free_kbytes from khugepaged after core initialization") Signed-off-by: Doug Berger opendmb@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Michal Hocko mhocko@suse.com Cc: Jason Baron jbaron@akamai.com Cc: David Rientjes rientjes@google.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1fb49bcdb4da..5818f5f2fe88 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7516,7 +7516,7 @@ int __meminit init_per_zone_wmark_min(void)
return 0; } -core_initcall(init_per_zone_wmark_min) +postcore_initcall(init_per_zone_wmark_min)
/* * min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so
From: Charan Teja Reddy charante@codeaurora.org
stable inclusion from linux-4.19.142 commit c666936d8d8b0ace4f3260d71a4eedefd53011d9
--------------------------------
commit 88e8ac11d2ea3acc003cf01bb5a38c8aa76c3cfd upstream.
The following race is observed with the repeated online, offline and a delay between two successive online of memory blocks of movable zone.
P1 P2
Online the first memory block in the movable zone. The pcp struct values are initialized to default values,i.e., pcp->high = 0 & pcp->batch = 1.
Allocate the pages from the movable zone.
Try to Online the second memory block in the movable zone thus it entered the online_pages() but yet to call zone_pcp_update(). This process is entered into the exit path thus it tries to release the order-0 pages to pcp lists through free_unref_page_commit(). As pcp->high = 0, pcp->count = 1 proceed to call the function free_pcppages_bulk(). Update the pcp values thus the new pcp values are like, say, pcp->high = 378, pcp->batch = 63. Read the pcp's batch value using READ_ONCE() and pass the same to free_pcppages_bulk(), pcp values passed here are, batch = 63, count = 1.
Since num of pages in the pcp lists are less than ->batch, then it will stuck in while(list_empty(list)) loop with interrupts disabled thus a core hung.
Avoid this by ensuring free_pcppages_bulk() is called with proper count of pcp list pages.
The mentioned race is some what easily reproducible without [1] because pcp's are not updated for the first memory block online and thus there is a enough race window for P2 between alloc+free and pcp struct values update through onlining of second memory block.
With [1], the race still exists but it is very narrow as we update the pcp struct values for the first memory block online itself.
This is not limited to the movable zone, it could also happen in cases with the normal zone (e.g., hotplug to a node that only has DMA memory, or no other memory yet).
[1]: https://patchwork.kernel.org/patch/11696389/
Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into one-list-per-migrate-type") Signed-off-by: Charan Teja Reddy charante@codeaurora.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: David Hildenbrand david@redhat.com Acked-by: David Rientjes rientjes@google.com Acked-by: Michal Hocko mhocko@suse.com Cc: Michal Hocko mhocko@suse.com Cc: Vlastimil Babka vbabka@suse.cz Cc: Vinayak Menon vinmenon@codeaurora.org Cc: stable@vger.kernel.org [2.6+] Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeauro... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_alloc.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5818f5f2fe88..e394ae5ba2a7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1116,6 +1116,11 @@ static void free_pcppages_bulk(struct zone *zone, int count, struct page *page, *tmp; LIST_HEAD(head);
+ /* + * Ensure proper count is passed which otherwise would stuck in the + * below while (list_empty(list)) loop. + */ + count = min(pcp->count, count); while (count) { struct list_head *list;
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.142 commit a5d3f789b272d76dadca4d35365a39ef516b31bf
--------------------------------
commit 7303cb5bfe845f7d43cd9b2dbd37dbb266efda9b upstream.
ext4_search_dir() and ext4_generic_delete_entry() can be called both for standard director blocks and for inline directories stored inside inode or inline xattr space. For the second case we didn't call ext4_check_dir_entry() with proper constraints that could result in accepting corrupted directory entry as well as false positive filesystem errors like:
EXT4-fs error (device dm-0): ext4_search_dir:1395: inode #28320400: block 113246792: comm dockerd: bad entry in directory: directory entry too close to block end - offset=0, inode=28320403, rec_len=32, name_len=8, size=4096
Fix the arguments passed to ext4_check_dir_entry().
Fixes: 109ba779d6cc ("ext4: check for directory entries too close to block end") CC: stable@vger.kernel.org Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20200731162135.8080-1-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/namei.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 8118c31dd6a6..b24864418524 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -1309,8 +1309,8 @@ int ext4_search_dir(struct buffer_head *bh, char *search_buf, int buf_size, ext4_match(fname, de)) { /* found a match - just to be sure, do * a full check */ - if (ext4_check_dir_entry(dir, NULL, de, bh, bh->b_data, - bh->b_size, offset)) + if (ext4_check_dir_entry(dir, NULL, de, bh, search_buf, + buf_size, offset)) return -1; *res_dir = de; return 1; @@ -2354,7 +2354,7 @@ int ext4_generic_delete_entry(handle_t *handle, de = (struct ext4_dir_entry_2 *)entry_buf; while (i < buf_size - csum_size) { if (ext4_check_dir_entry(dir, NULL, de, bh, - bh->b_data, bh->b_size, i)) + entry_buf, buf_size, i)) return -EFSCORRUPTED; if (de == de_del) { if (pde)
From: "zhangyi (F)" yi.zhang@huawei.com
stable inclusion from linux-4.19.142 commit 402ff143b90b48c1d1b29127fa538a2d227c2161
--------------------------------
commit ef3f5830b859604eda8723c26d90ab23edc027a4 upstream.
jbd2_write_superblock() is under the buffer lock of journal superblock before ending that superblock write, so add a missing unlock_buffer() in in the error path before submitting buffer.
Fixes: 742b06b5628f ("jbd2: check superblock mapped prior to committing") Signed-off-by: zhangyi (F) yi.zhang@huawei.com Reviewed-by: Ritesh Harjani riteshh@linux.ibm.com Cc: stable@kernel.org Link: https://lore.kernel.org/r/20200620061948.2049579-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/jbd2/journal.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 344f5cf53fb5..f1bc1c82c480 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -1366,8 +1366,10 @@ static int jbd2_write_superblock(journal_t *journal, int write_flags) int ret;
/* Buffer got discarded which means block device got invalidated */ - if (!buffer_mapped(bh)) + if (!buffer_mapped(bh)) { + unlock_buffer(bh); return -EIO; + }
trace_jbd2_write_superblock(journal, write_flags); if (!(journal->j_flags & JBD2_BARRIER))
From: Lukas Wunner lukas@wunner.de
stable inclusion from linux-4.19.142 commit 30872ac7dbf038869a2489d64bb8923cdb27d756
--------------------------------
[ Upstream commit ddf75be47ca748f8b12d28ac64d624354fddf189 ]
CONFIG_OF_DYNAMIC and CONFIG_ACPI allow adding SPI devices at runtime using a DeviceTree overlay or DSDT patch. CONFIG_SPI_SLAVE allows the same via sysfs.
But there are no precautions to prevent adding a device below a controller that's being removed. Such a device is unusable and may not even be able to unbind cleanly as it becomes inaccessible once the controller has been torn down. E.g. it is then impossible to quiesce the device's interrupt.
of_spi_notify() and acpi_spi_notify() do hold a ref on the controller, but otherwise run lockless against spi_unregister_controller().
Fix by holding the spi_add_lock in spi_unregister_controller() and bailing out of spi_add_device() if the controller has been unregistered concurrently.
Fixes: ce79d54ae447 ("spi/of: Add OF notifier handler") Signed-off-by: Lukas Wunner lukas@wunner.de Cc: stable@vger.kernel.org # v3.19+ Cc: Geert Uytterhoeven geert+renesas@glider.be Cc: Octavian Purdila octavian.purdila@intel.com Cc: Pantelis Antoniou pantelis.antoniou@konsulko.com Link: https://lore.kernel.org/r/a8c3205088a969dc8410eec1eba9aface60f36af.159645103... Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/spi/Kconfig | 3 +++ drivers/spi/spi.c | 21 ++++++++++++++++++++- 2 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/drivers/spi/Kconfig b/drivers/spi/Kconfig index 671d078349cc..0a7fd56c1ed9 100644 --- a/drivers/spi/Kconfig +++ b/drivers/spi/Kconfig @@ -817,4 +817,7 @@ config SPI_SLAVE_SYSTEM_CONTROL
endif # SPI_SLAVE
+config SPI_DYNAMIC + def_bool ACPI || OF_DYNAMIC || SPI_SLAVE + endif # SPI diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c index b8b9f6d88016..f4b3ca0a7d13 100644 --- a/drivers/spi/spi.c +++ b/drivers/spi/spi.c @@ -432,6 +432,12 @@ static LIST_HEAD(spi_controller_list); */ static DEFINE_MUTEX(board_lock);
+/* + * Prevents addition of devices with same chip select and + * addition of devices below an unregistering controller. + */ +static DEFINE_MUTEX(spi_add_lock); + /** * spi_alloc_device - Allocate a new SPI device * @ctlr: Controller to which device is connected @@ -510,7 +516,6 @@ static int spi_dev_check(struct device *dev, void *data) */ int spi_add_device(struct spi_device *spi) { - static DEFINE_MUTEX(spi_add_lock); struct spi_controller *ctlr = spi->controller; struct device *dev = ctlr->dev.parent; int status; @@ -538,6 +543,13 @@ int spi_add_device(struct spi_device *spi) goto done; }
+ /* Controller may unregister concurrently */ + if (IS_ENABLED(CONFIG_SPI_DYNAMIC) && + !device_is_registered(&ctlr->dev)) { + status = -ENODEV; + goto done; + } + if (ctlr->cs_gpios) spi->cs_gpio = ctlr->cs_gpios[spi->chip_select];
@@ -2306,6 +2318,10 @@ void spi_unregister_controller(struct spi_controller *ctlr) struct spi_controller *found; int id = ctlr->bus_num;
+ /* Prevent addition of new devices, unregister existing ones */ + if (IS_ENABLED(CONFIG_SPI_DYNAMIC)) + mutex_lock(&spi_add_lock); + device_for_each_child(&ctlr->dev, NULL, __unregister);
/* First make sure that this controller was ever added */ @@ -2326,6 +2342,9 @@ void spi_unregister_controller(struct spi_controller *ctlr) if (found == ctlr) idr_remove(&spi_master_idr, id); mutex_unlock(&board_lock); + + if (IS_ENABLED(CONFIG_SPI_DYNAMIC)) + mutex_unlock(&spi_add_lock); } EXPORT_SYMBOL_GPL(spi_unregister_controller);
From: Bodo Stroesser bstroesser@ts.fujitsu.com
stable inclusion from linux-4.19.142 commit 7d057ec39676bcb5fbf8177dad5f781f86e7d316
--------------------------------
[ Upstream commit 3145550a7f8b08356c8ff29feaa6c56aca12901d ]
This patch fixes the following crash (see https://bugzilla.kernel.org/show_bug.cgi?id=208045)
Process iscsi_trx (pid: 7496, stack limit = 0x0000000010dd111a) CPU: 0 PID: 7496 Comm: iscsi_trx Not tainted 4.19.118-0419118-generic #202004230533 Hardware name: Greatwall QingTian DF720/F601, BIOS 601FBE20 Sep 26 2019 pstate: 80400005 (Nzcv daif +PAN -UAO) pc : flush_dcache_page+0x18/0x40 lr : is_ring_space_avail+0x68/0x2f8 [target_core_user] sp : ffff000015123a80 x29: ffff000015123a80 x28: 0000000000000000 x27: 0000000000001000 x26: ffff000023ea5000 x25: ffffcfa25bbe08b8 x24: 0000000000000078 x23: ffff7e0000000000 x22: ffff000023ea5001 x21: ffffcfa24b79c000 x20: 0000000000000fff x19: ffff7e00008fa940 x18: 0000000000000000 x17: 0000000000000000 x16: ffff2d047e709138 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: ffff2d047fbd0a40 x11: 0000000000000000 x10: 0000000000000030 x9 : 0000000000000000 x8 : ffffc9a254820a00 x7 : 00000000000013b0 x6 : 000000000000003f x5 : 0000000000000040 x4 : ffffcfa25bbe08e8 x3 : 0000000000001000 x2 : 0000000000000078 x1 : ffffcfa25bbe08b8 x0 : ffff2d040bc88a18 Call trace: flush_dcache_page+0x18/0x40 is_ring_space_avail+0x68/0x2f8 [target_core_user] queue_cmd_ring+0x1f8/0x680 [target_core_user] tcmu_queue_cmd+0xe4/0x158 [target_core_user] __target_execute_cmd+0x30/0xf0 [target_core_mod] target_execute_cmd+0x294/0x390 [target_core_mod] transport_generic_new_cmd+0x1e8/0x358 [target_core_mod] transport_handle_cdb_direct+0x50/0xb0 [target_core_mod] iscsit_execute_cmd+0x2b4/0x350 [iscsi_target_mod] iscsit_sequence_cmd+0xd8/0x1d8 [iscsi_target_mod] iscsit_process_scsi_cmd+0xac/0xf8 [iscsi_target_mod] iscsit_get_rx_pdu+0x404/0xd00 [iscsi_target_mod] iscsi_target_rx_thread+0xb8/0x130 [iscsi_target_mod] kthread+0x130/0x138 ret_from_fork+0x10/0x18 Code: f9000bf3 aa0003f3 aa1e03e0 d503201f (f9400260) ---[ end trace 1e451c73f4266776 ]---
The solution is based on patch:
"scsi: target: tcmu: Optimize use of flush_dcache_page"
which restricts the use of tcmu_flush_dcache_range() to addresses from vmalloc'ed areas only.
This patch now replaces the virt_to_page() call in tcmu_flush_dcache_range() - which is wrong for vmalloced addrs - by vmalloc_to_page().
The patch was tested on ARM with kernel 4.19.118 and 5.7.2
Link: https://lore.kernel.org/r/20200618131632.32748-3-bstroesser@ts.fujitsu.com Tested-by: JiangYu lnsyyj@hotmail.com Tested-by: Daniel Meyerholt dxm523@gmail.com Acked-by: Mike Christie michael.christie@oracle.com Signed-off-by: Bodo Stroesser bstroesser@ts.fujitsu.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/target/target_core_user.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c index 7ee0a75ce452..368d9f658b21 100644 --- a/drivers/target/target_core_user.c +++ b/drivers/target/target_core_user.c @@ -612,7 +612,7 @@ static inline void tcmu_flush_dcache_range(void *vaddr, size_t size) size = round_up(size+offset, PAGE_SIZE);
while (size) { - flush_dcache_page(virt_to_page(start)); + flush_dcache_page(vmalloc_to_page(start)); start += PAGE_SIZE; size -= PAGE_SIZE; }
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.142 commit 1bc31e520faf8af5555030e7662c0577f25a1be0
--------------------------------
[ Upstream commit f959b5d037e71a4d69b5bf71faffa065d9269b4a ]
xfs_trans_dqresv is the function that we use to make reservations against resource quotas. Each resource contains two counters: the q_core counter, which tracks resources allocated on disk; and the dquot reservation counter, which tracks how much of that resource has either been allocated or reserved by threads that are working on metadata updates.
For disk blocks, we compare the proposed reservation counter against the hard and soft limits to decide if we're going to fail the operation. However, for inodes we inexplicably compare against the q_core counter, not the incore reservation count.
Since the q_core counter is always lower than the reservation count and we unlock the dquot between reservation and transaction commit, this means that multiple threads can reserve the last inode count before we hit the hard limit, and when they commit, we'll be well over the hard limit.
Fix this by checking against the incore inode reservation counter, since we would appear to maintain that correctly (and that's what we report in GETQUOTA).
Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Allison Collins allison.henderson@oracle.com Reviewed-by: Chandan Babu R chandanrlinux@gmail.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/xfs_trans_dquot.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c index c23257a26c2b..b8f05d5909b5 100644 --- a/fs/xfs/xfs_trans_dquot.c +++ b/fs/xfs/xfs_trans_dquot.c @@ -657,7 +657,7 @@ xfs_trans_dqresv( } } if (ninos > 0) { - total_count = be64_to_cpu(dqp->q_core.d_icount) + ninos; + total_count = dqp->q_res_icount + ninos; timer = be32_to_cpu(dqp->q_core.d_itimer); warns = be16_to_cpu(dqp->q_core.d_iwarns); warnlimit = dqp->q_mount->m_quotainfo->qi_iwarnlimit;
From: Mao Wenan wenan.mao@linux.alibaba.com
stable inclusion from linux-4.19.142 commit 9f489d30979bd352ffbd555020b6ca79efb93b89
--------------------------------
[ Upstream commit 481a0d7422db26fb63e2d64f0652667a5c6d0f3e ]
The loop may exist if vq->broken is true, virtqueue_get_buf_ctx_packed or virtqueue_get_buf_ctx_split will return NULL, so virtnet_poll will reschedule napi to receive packet, it will lead cpu usage(si) to 100%.
call trace as below: virtnet_poll virtnet_receive virtqueue_get_buf_ctx virtqueue_get_buf_ctx_packed virtqueue_get_buf_ctx_split virtqueue_napi_complete virtqueue_poll //return true virtqueue_napi_schedule //it will reschedule napi
to fix this, return false if vq is broken in virtqueue_poll.
Signed-off-by: Mao Wenan wenan.mao@linux.alibaba.com Acked-by: Michael S. Tsirkin mst@redhat.com Link: https://lore.kernel.org/r/1596354249-96204-1-git-send-email-wenan.mao@linux.... Signed-off-by: Michael S. Tsirkin mst@redhat.com Acked-by: Jason Wang jasowang@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/virtio/virtio_ring.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 6228b48d1e12..df7980aef927 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -828,6 +828,9 @@ bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx) { struct vring_virtqueue *vq = to_vvq(_vq);
+ if (unlikely(vq->broken)) + return false; + virtio_mb(vq->weak_barriers); return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx); }
From: Eiichi Tsukata devel@etsukata.com
stable inclusion from linux-4.19.142 commit c90652abae15465210207418ac8c1ecb230e3b1e
--------------------------------
[ Upstream commit 96cf2a2c75567ff56195fe3126d497a2e7e4379f ]
If xfs_sysfs_init is called with parent_kobj == NULL, UBSAN shows the following warning:
UBSAN: null-ptr-deref in ./fs/xfs/xfs_sysfs.h:37:23 member access within null pointer of type 'struct xfs_kobj' Call Trace: dump_stack+0x10e/0x195 ubsan_type_mismatch_common+0x241/0x280 __ubsan_handle_type_mismatch_v1+0x32/0x40 init_xfs_fs+0x12b/0x28f do_one_initcall+0xdd/0x1d0 do_initcall_level+0x151/0x1b6 do_initcalls+0x50/0x8f do_basic_setup+0x29/0x2b kernel_init_freeable+0x19f/0x20b kernel_init+0x11/0x1e0 ret_from_fork+0x22/0x30
Fix it by checking parent_kobj before the code accesses its member.
Signed-off-by: Eiichi Tsukata devel@etsukata.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com [darrick: minor whitespace edits] Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/xfs_sysfs.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_sysfs.h b/fs/xfs/xfs_sysfs.h index e9f810fc6731..43585850f154 100644 --- a/fs/xfs/xfs_sysfs.h +++ b/fs/xfs/xfs_sysfs.h @@ -32,9 +32,11 @@ xfs_sysfs_init( struct xfs_kobj *parent_kobj, const char *name) { + struct kobject *parent; + + parent = parent_kobj ? &parent_kobj->kobject : NULL; init_completion(&kobj->complete); - return kobject_init_and_add(&kobj->kobject, ktype, - &parent_kobj->kobject, "%s", name); + return kobject_init_and_add(&kobj->kobject, ktype, parent, "%s", name); }
static inline void
From: Helge Deller deller@gmx.de
stable inclusion from linux-4.19.142 commit 7c89e40ede2b0f1e292fb54d2e73fe952b4d44d8
--------------------------------
[ Upstream commit a089e3fd5a82aea20f3d9ec4caa5f4c65cc2cfcc ]
The kernel signalfd4() syscall returns different error codes when called either in compat or native mode. This behaviour makes correct emulation in qemu and testing programs like LTP more complicated.
Fix the code to always return -in both modes- EFAULT for unaccessible user memory, and EINVAL when called with an invalid signal mask.
Signed-off-by: Helge Deller deller@gmx.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Laurent Vivier laurent@vivier.eu Link: http://lkml.kernel.org/r/20200530100707.GA10159@ls3530.fritz.box Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/signalfd.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/fs/signalfd.c b/fs/signalfd.c index 4fcd1498acf5..3c40a3bf772c 100644 --- a/fs/signalfd.c +++ b/fs/signalfd.c @@ -313,9 +313,10 @@ SYSCALL_DEFINE4(signalfd4, int, ufd, sigset_t __user *, user_mask, { sigset_t mask;
- if (sizemask != sizeof(sigset_t) || - copy_from_user(&mask, user_mask, sizeof(mask))) + if (sizemask != sizeof(sigset_t)) return -EINVAL; + if (copy_from_user(&mask, user_mask, sizeof(mask))) + return -EFAULT; return do_signalfd4(ufd, &mask, flags); }
@@ -324,9 +325,10 @@ SYSCALL_DEFINE3(signalfd, int, ufd, sigset_t __user *, user_mask, { sigset_t mask;
- if (sizemask != sizeof(sigset_t) || - copy_from_user(&mask, user_mask, sizeof(mask))) + if (sizemask != sizeof(sigset_t)) return -EINVAL; + if (copy_from_user(&mask, user_mask, sizeof(mask))) + return -EFAULT; return do_signalfd4(ufd, &mask, 0); }
From: Alex Williamson alex.williamson@redhat.com
stable inclusion from linux-4.19.142 commit d200964e4deaa72f2a7d2eaefc9953d3df8784ce
--------------------------------
[ Upstream commit aae7a75a821a793ed6b8ad502a5890fb8e8f172d ]
The vfio_iommu_replay() function does not currently unwind on error, yet it does pin pages, perform IOMMU mapping, and modify the vfio_dma structure to indicate IOMMU mapping. The IOMMU mappings are torn down when the domain is destroyed, but the other actions go on to cause trouble later. For example, the iommu->domain_list can be empty if we only have a non-IOMMU backed mdev attached. We don't currently check if the list is empty before getting the first entry in the list, which leads to a bogus domain pointer. If a vfio_dma entry is erroneously marked as iommu_mapped, we'll attempt to use that bogus pointer to retrieve the existing physical page addresses.
This is the scenario that uncovered this issue, attempting to hot-add a vfio-pci device to a container with an existing mdev device and DMA mappings, one of which could not be pinned, causing a failure adding the new group to the existing container and setting the conditions for a subsequent attempt to explode.
To resolve this, we can first check if the domain_list is empty so that we can reject replay of a bogus domain, should we ever encounter this inconsistent state again in the future. The real fix though is to add the necessary unwind support, which means cleaning up the current pinning if an IOMMU mapping fails, then walking back through the r-b tree of DMA entries, reading from the IOMMU which ranges are mapped, and unmapping and unpinning those ranges. To be able to do this, we also defer marking the DMA entry as IOMMU mapped until all entries are processed, in order to allow the unwind to know the disposition of each entry.
Fixes: a54eb55045ae ("vfio iommu type1: Add support for mediated devices") Reported-by: Zhiyi Guo zhguo@redhat.com Tested-by: Zhiyi Guo zhguo@redhat.com Reviewed-by: Cornelia Huck cohuck@redhat.com Signed-off-by: Alex Williamson alex.williamson@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/vfio/vfio_iommu_type1.c | 71 ++++++++++++++++++++++++++++++--- 1 file changed, 66 insertions(+), 5 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 01be1496307b..2bde1a098a29 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1269,13 +1269,16 @@ static int vfio_bus_type(struct device *dev, void *data) static int vfio_iommu_replay(struct vfio_iommu *iommu, struct vfio_domain *domain) { - struct vfio_domain *d; + struct vfio_domain *d = NULL; struct rb_node *n; unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; int ret;
/* Arbitrarily pick the first domain in the list for lookups */ - d = list_first_entry(&iommu->domain_list, struct vfio_domain, next); + if (!list_empty(&iommu->domain_list)) + d = list_first_entry(&iommu->domain_list, + struct vfio_domain, next); + n = rb_first(&iommu->dma_list);
for (; n; n = rb_next(n)) { @@ -1293,6 +1296,11 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, phys_addr_t p; dma_addr_t i;
+ if (WARN_ON(!d)) { /* mapped w/o a domain?! */ + ret = -EINVAL; + goto unwind; + } + phys = iommu_iova_to_phys(d->domain, iova);
if (WARN_ON(!phys)) { @@ -1323,7 +1331,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, if (npage <= 0) { WARN_ON(!npage); ret = (int)npage; - return ret; + goto unwind; }
phys = pfn << PAGE_SHIFT; @@ -1332,14 +1340,67 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
ret = iommu_map(domain->domain, iova, phys, size, dma->prot | domain->prot); - if (ret) - return ret; + if (ret) { + if (!dma->iommu_mapped) + vfio_unpin_pages_remote(dma, iova, + phys >> PAGE_SHIFT, + size >> PAGE_SHIFT, + true); + goto unwind; + }
iova += size; } + } + + /* All dmas are now mapped, defer to second tree walk for unwind */ + for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) { + struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node); + dma->iommu_mapped = true; } + return 0; + +unwind: + for (; n; n = rb_prev(n)) { + struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node); + dma_addr_t iova; + + if (dma->iommu_mapped) { + iommu_unmap(domain->domain, dma->iova, dma->size); + continue; + } + + iova = dma->iova; + while (iova < dma->iova + dma->size) { + phys_addr_t phys, p; + size_t size; + dma_addr_t i; + + phys = iommu_iova_to_phys(domain->domain, iova); + if (!phys) { + iova += PAGE_SIZE; + continue; + } + + size = PAGE_SIZE; + p = phys + size; + i = iova + size; + while (i < dma->iova + dma->size && + p == iommu_iova_to_phys(domain->domain, i)) { + size += PAGE_SIZE; + p += PAGE_SIZE; + i += PAGE_SIZE; + } + + iommu_unmap(domain->domain, iova, size); + vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT, + size >> PAGE_SHIFT, true); + } + } + + return ret; }
static int vfio_iommu_mm_exit(struct device *dev, int pasid, void *data)
From: Selvin Xavier selvin.xavier@broadcom.com
stable inclusion from linux-4.19.142 commit f0e3b33f47cfe719868e15b3a5d5d25f17312ef6
--------------------------------
[ Upstream commit a812f2d60a9fb7818f9c81f967180317b52545c0 ]
Driver shall add only the kernel qps to the flush list for clean up. During async error events from the HW, driver is adding qps to this list without checking if the qp is kernel qp or not.
Add a check to avoid user qp addition to the flush list.
Fixes: 942c9b6ca8de ("RDMA/bnxt_re: Avoid Hard lockup during error CQE processing") Fixes: c50866e2853a ("bnxt_re: fix the regression due to changes in alloc_pbl") Link: https://lore.kernel.org/r/1596689148-4023-1-git-send-email-selvin.xavier@bro... Signed-off-by: Selvin Xavier selvin.xavier@broadcom.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/infiniband/hw/bnxt_re/main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/bnxt_re/main.c b/drivers/infiniband/hw/bnxt_re/main.c index 589b0d4677d5..f1b666c80f36 100644 --- a/drivers/infiniband/hw/bnxt_re/main.c +++ b/drivers/infiniband/hw/bnxt_re/main.c @@ -753,7 +753,8 @@ static int bnxt_re_handle_qp_async_event(struct creq_qp_event *qp_event, struct ib_event event; unsigned int flags;
- if (qp->qplib_qp.state == CMDQ_MODIFY_QP_NEW_STATE_ERR) { + if (qp->qplib_qp.state == CMDQ_MODIFY_QP_NEW_STATE_ERR && + rdma_is_kernel_res(&qp->ib_qp.res)) { flags = bnxt_re_lock_cqs(qp); bnxt_qplib_add_flush_qp(&qp->qplib_qp); bnxt_re_unlock_cqs(qp, flags);
From: Li Heng liheng40@huawei.com
stable inclusion from linux-4.19.142 commit 2ff3c97b47521d6700cc6485c7935908dcd2c27c
--------------------------------
commit 98086df8b70c06234a8f4290c46064e44dafa0ed upstream.
destroy_workqueue() should be called to destroy efi_rts_wq when efisubsys_init() init resources fails.
Cc: stable@vger.kernel.org Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Li Heng liheng40@huawei.com Link: https://lore.kernel.org/r/1595229738-10087-1-git-send-email-liheng40@huawei.... Signed-off-by: Ard Biesheuvel ardb@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/firmware/efi/efi.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index f386a786aa70..67cbca754f36 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -361,6 +361,7 @@ static int __init efisubsys_init(void) efi_kobj = kobject_create_and_add("efi", firmware_kobj); if (!efi_kobj) { pr_err("efi: Firmware registration failed.\n"); + destroy_workqueue(efi_rts_wq); return -ENOMEM; }
@@ -397,6 +398,7 @@ static int __init efisubsys_init(void) generic_ops_unregister(); err_put: kobject_put(efi_kobj); + destroy_workqueue(efi_rts_wq); return error; }
From: Marc Zyngier maz@kernel.org
stable inclusion from linux-4.19.142 commit 4957d56414ff5ca8fe62f8f11f0f9814f35b1d11
--------------------------------
commit a9ed4a6560b8562b7e2e2bed9527e88001f7b682 upstream.
When adding a new fd to an epoll, and that this new fd is an epoll fd itself, we recursively scan the fds attached to it to detect cycles, and add non-epool files to a "check list" that gets subsequently parsed.
However, this check list isn't completely safe when deletions can happen concurrently. To sidestep the issue, make sure that a struct file placed on the check list sees its f_count increased, ensuring that a concurrent deletion won't result in the file disapearing from under our feet.
Cc: stable@vger.kernel.org Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/eventpoll.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 113730881f11..d7fd1b0e37ab 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1892,9 +1892,11 @@ static int ep_loop_check_proc(void *priv, void *cookie, int call_nests) * not already there, and calling reverse_path_check() * during ep_insert(). */ - if (list_empty(&epi->ffd.file->f_tfile_llink)) + if (list_empty(&epi->ffd.file->f_tfile_llink)) { + get_file(epi->ffd.file); list_add(&epi->ffd.file->f_tfile_llink, &tfile_check_list); + } } } mutex_unlock(&ep->mtx); @@ -1938,6 +1940,7 @@ static void clear_tfile_check_list(void) file = list_first_entry(&tfile_check_list, struct file, f_tfile_llink); list_del_init(&file->f_tfile_llink); + fput(file); } INIT_LIST_HEAD(&tfile_check_list); } @@ -2097,9 +2100,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, clear_tfile_check_list(); goto error_tgt_fput; } - } else + } else { + get_file(tf.file); list_add(&tf.file->f_tfile_llink, &tfile_check_list); + } mutex_lock_nested(&ep->mtx, 0); if (is_file_epoll(tf.file)) { tep = tf.file->private_data;
From: Al Viro viro@zeniv.linux.org.uk
stable inclusion from linux-4.19.142 commit dcb6e6efb3298e59d90ee05c6ed33de810314892
--------------------------------
commit 52c479697c9b73f628140dcdfcd39ea302d05482 upstream.
Signed-off-by: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/eventpoll.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index d7fd1b0e37ab..6387b6128f3e 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2096,10 +2096,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, mutex_lock(&epmutex); if (is_file_epoll(tf.file)) { error = -ELOOP; - if (ep_loop_check(ep, tf.file) != 0) { - clear_tfile_check_list(); + if (ep_loop_check(ep, tf.file) != 0) goto error_tgt_fput; - } } else { get_file(tf.file); list_add(&tf.file->f_tfile_llink, @@ -2128,8 +2126,6 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, error = ep_insert(ep, &epds, tf.file, fd, full_check); } else error = -EEXIST; - if (full_check) - clear_tfile_check_list(); break; case EPOLL_CTL_DEL: if (epi) @@ -2152,8 +2148,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, mutex_unlock(&ep->mtx);
error_tgt_fput: - if (full_check) + if (full_check) { + clear_tfile_check_list(); mutex_unlock(&epmutex); + }
fdput(tf); error_fput:
From: Peter Xu peterx@redhat.com
stable inclusion from linux-4.19.142 commit 734654ae7962be55c44ff3fb0bb0652b5149cc17
--------------------------------
commit 75802ca66354a39ab8e35822747cd08b3384a99a upstream.
This is found by code observation only.
Firstly, the worst case scenario should assume the whole range was covered by pmd sharing. The old algorithm might not work as expected for ranges like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the expected range should be (0, 2g).
Since at it, remove the loop since it should not be required. With that, the new code should be faster too when the invalidating range is huge.
Mike said:
: With range (1g-2m, 1g+2m) within a vma (0, 2g) the existing code will only : adjust to (0, 1g+2m) which is incorrect. : : We should cc stable. The original reason for adjusting the range was to : prevent data corruption (getting wrong page). Since the range is not : always adjusted correctly, the potential for corruption still exists. : : However, I am fairly confident that adjust_range_if_pmd_sharing_possible : is only gong to be called in two cases: : : 1) for a single page : 2) for range == entire vma : : In those cases, the current code should produce the correct results. : : To be safe, let's just cc stable.
Fixes: 017b1660df89 ("mm: migration: fix migration of huge PMD shared pages") Signed-off-by: Peter Xu peterx@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Matthew Wilcox willy@infradead.org Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20200730201636.74778-1-peterx@redhat.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Mike Kravetz mike.kravetz@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/hugetlb.c | 24 ++++++++++-------------- 1 file changed, 10 insertions(+), 14 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 64f685bd6ef2..efb68abe9f35 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4832,25 +4832,21 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, unsigned long *start, unsigned long *end) { - unsigned long check_addr = *start; + unsigned long a_start, a_end;
if (!(vma->vm_flags & VM_MAYSHARE)) return;
- for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) { - unsigned long a_start = check_addr & PUD_MASK; - unsigned long a_end = a_start + PUD_SIZE; + /* Extend the range to be PUD aligned for a worst case scenario */ + a_start = ALIGN_DOWN(*start, PUD_SIZE); + a_end = ALIGN(*end, PUD_SIZE);
- /* - * If sharing is possible, adjust start/end if necessary. - */ - if (range_in_vma(vma, a_start, a_end)) { - if (a_start < *start) - *start = a_start; - if (a_end > *end) - *end = a_end; - } - } + /* + * Intersect the range with the vma range, since pmd sharing won't be + * across vma after all + */ + *start = max(vma->vm_start, a_start); + *end = min(vma->vm_end, a_end); }
/*
From: Robin Murphy robin.murphy@arm.com
stable inclusion from linux-4.19.143 commit 9c9723816024195e6d68255e29db0dd4658f76ff
--------------------------------
[ Upstream commit d3e3d2be688b4b5864538de61e750721a311e4fc ]
Unlike the other instances which represent a complete loss of consistency within the rcache mechanism itself, or a fundamental and obvious misconfiguration by an IOMMU driver, the BUG_ON() in iova_magazine_free_pfns() can be provoked at more or less any time in a "spooky action-at-a-distance" manner by any old device driver passing nonsense to dma_unmap_*() which then propagates through to queue_iova().
Not only is this well outside the IOVA layer's control, it's also nowhere near fatal enough to justify panicking anyway - all that really achieves is to make debugging the offending driver more difficult. Let's simply WARN and otherwise ignore bogus PFNs.
Reported-by: Prakash Gupta guptap@codeaurora.org Signed-off-by: Robin Murphy robin.murphy@arm.com Reviewed-by: Prakash Gupta guptap@codeaurora.org Link: https://lore.kernel.org/r/acbd2d092b42738a03a21b417ce64e27f8c91c86.159110329... Signed-off-by: Joerg Roedel jroedel@suse.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/iommu/iova.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index fd9ab92e9d1f..7f18dbeb7b7c 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -833,7 +833,9 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad) for (i = 0 ; i < mag->size; ++i) { struct iova *iova = private_find_iova(iovad, mag->pfns[i]);
- BUG_ON(!iova); + if (WARN_ON(!iova)) + continue; + private_free_iova(iovad, iova); }
From: Dave Chinner dchinner@redhat.com
stable inclusion from linux-4.19.143 commit 6af2bb145126e4907ff0a64bbee04fa0b499f09c
--------------------------------
[ Upstream commit 96355d5a1f0ee6dcc182c37db4894ec0c29f1692 ]
In tracking down a problem in this patchset, I discovered we are reclaiming dirty stale inodes. This wasn't discovered until inodes were always attached to the cluster buffer and then the rcu callback that freed inodes was assert failing because the inode still had an active pointer to the cluster buffer after it had been reclaimed.
Debugging the issue indicated that this was a pre-existing issue resulting from the way the inodes are handled in xfs_inactive_ifree. When we free a cluster buffer from xfs_ifree_cluster, all the inodes in cache are marked XFS_ISTALE. Those that are clean have nothing else done to them and so eventually get cleaned up by background reclaim. i.e. it is assumed we'll never dirty/relog an inode marked XFS_ISTALE.
On journal commit dirty stale inodes as are handled by both buffer and inode log items to run though xfs_istale_done() and removed from the AIL (buffer log item commit) or the log item will simply unpin it because the buffer log item will clean it. What happens to any specific inode is entirely dependent on which log item wins the commit race, but the result is the same - stale inodes are clean, not attached to the cluster buffer, and not in the AIL. Hence inode reclaim can just free these inodes without further care.
However, if the stale inode is relogged, it gets dirtied again and relogged into the CIL. Most of the time this isn't an issue, because relogging simply changes the inode's location in the current checkpoint. Problems arise, however, when the CIL checkpoints between two transactions in the xfs_inactive_ifree() deferops processing. This results in the XFS_ISTALE inode being redirtied and inserted into the CIL without any of the other stale cluster buffer infrastructure being in place.
Hence on journal commit, it simply gets unpinned, so it remains dirty in memory. Everything in inode writeback avoids XFS_ISTALE inodes so it can't be written back, and it is not tracked in the AIL so there's not even a trigger to attempt to clean the inode. Hence the inode just sits dirty in memory until inode reclaim comes along, sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming of a dirty inode caused use after free, list corruptions and other nasty issues later in this patchset.
Hence this patch addresses a violation of the "never log XFS_ISTALE inodes" caused by the deferops processing rolling a transaction and relogging a stale inode in xfs_inactive_free. It also adds a bunch of asserts to catch this problem in debug kernels so that we don't reintroduce this problem in future.
Reproducer for this issue was generic/558 on a v4 filesystem.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Brian Foster bfoster@redhat.com Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/xfs_icache.c | 3 ++- fs/xfs/xfs_inode.c | 25 ++++++++++++++++++++++--- fs/xfs/xfs_trans_inode.c | 2 ++ 3 files changed, 26 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 901f27ac94ab..56e9043bddc7 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1127,7 +1127,7 @@ xfs_reclaim_inode( goto out_ifunlock; xfs_iunpin_wait(ip); } - if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) { + if (xfs_inode_clean(ip)) { xfs_ifunlock(ip); goto reclaim; } @@ -1214,6 +1214,7 @@ xfs_reclaim_inode( xfs_ilock(ip, XFS_ILOCK_EXCL); xfs_qm_dqdetach(ip); xfs_iunlock(ip, XFS_ILOCK_EXCL); + ASSERT(xfs_inode_clean(ip));
__xfs_inode_free(ip); return error; diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 7206569efa70..c3226c48079d 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1772,10 +1772,31 @@ xfs_inactive_ifree( return error; }
+ /* + * We do not hold the inode locked across the entire rolling transaction + * here. We only need to hold it for the first transaction that + * xfs_ifree() builds, which may mark the inode XFS_ISTALE if the + * underlying cluster buffer is freed. Relogging an XFS_ISTALE inode + * here breaks the relationship between cluster buffer invalidation and + * stale inode invalidation on cluster buffer item journal commit + * completion, and can result in leaving dirty stale inodes hanging + * around in memory. + * + * We have no need for serialising this inode operation against other + * operations - we freed the inode and hence reallocation is required + * and that will serialise on reallocating the space the deferops need + * to free. Hence we can unlock the inode on the first commit of + * the transaction rather than roll it right through the deferops. This + * avoids relogging the XFS_ISTALE inode. + * + * We check that xfs_ifree() hasn't grown an internal transaction roll + * by asserting that the inode is still locked when it returns. + */ xfs_ilock(ip, XFS_ILOCK_EXCL); - xfs_trans_ijoin(tp, ip, 0); + xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
error = xfs_ifree(tp, ip); + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); if (error) { /* * If we fail to free the inode, shut down. The cancel @@ -1788,7 +1809,6 @@ xfs_inactive_ifree( xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR); } xfs_trans_cancel(tp); - xfs_iunlock(ip, XFS_ILOCK_EXCL); return error; }
@@ -1806,7 +1826,6 @@ xfs_inactive_ifree( xfs_notice(mp, "%s: xfs_trans_commit returned error %d", __func__, error);
- xfs_iunlock(ip, XFS_ILOCK_EXCL); return 0; }
diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c index 542927321a61..ae453dd236a6 100644 --- a/fs/xfs/xfs_trans_inode.c +++ b/fs/xfs/xfs_trans_inode.c @@ -39,6 +39,7 @@ xfs_trans_ijoin(
ASSERT(iip->ili_lock_flags == 0); iip->ili_lock_flags = lock_flags; + ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
/* * Get a log_item_desc to point at the new item. @@ -90,6 +91,7 @@ xfs_trans_log_inode(
ASSERT(ip->i_itemp != NULL); ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); + ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
/* * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
From: Qiushi Wu wu000273@umn.edu
stable inclusion from linux-4.19.143 commit 03f4e517e6ac202d6a6ca50f02a1319a4a70cdd6
--------------------------------
[ Upstream commit 8a94644b440eef5a7b9c104ac8aa7a7f413e35e5 ]
kobject_init_and_add() takes a reference even when it fails. If it returns an error, kobject_put() must be called to clean up the memory associated with the object.
When kobject_init_and_add() fails, call kobject_put() instead of kfree().
b8eb718348b8 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject") fixed a similar problem.
Link: https://lore.kernel.org/r/20200528021322.1984-1-wu000273@umn.edu Signed-off-by: Qiushi Wu wu000273@umn.edu Signed-off-by: Bjorn Helgaas bhelgaas@google.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/pci/slot.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c index a32897f83ee5..fb7478b6c4f9 100644 --- a/drivers/pci/slot.c +++ b/drivers/pci/slot.c @@ -303,13 +303,16 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr, slot_name = make_slot_name(name); if (!slot_name) { err = -ENOMEM; + kfree(slot); goto err; }
err = kobject_init_and_add(&slot->kobj, &pci_slot_ktype, NULL, "%s", slot_name); - if (err) + if (err) { + kobject_put(&slot->kobj); goto err; + }
INIT_LIST_HEAD(&slot->list); list_add(&slot->list, &parent->slots); @@ -328,7 +331,6 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr, mutex_unlock(&pci_slot_mutex); return slot; err: - kfree(slot); slot = ERR_PTR(err); goto out; }
From: Chris Wilson chris@chris-wilson.co.uk
stable inclusion from linux-4.19.143 commit db454f8ab4b694eaf6a23b97479aa4b42c38e6e3
--------------------------------
[ Upstream commit a7ef9b28aa8d72a1656fa6f0a01bbd1493886317 ]
Though the number of lock-acquisitions is tracked as unsigned long, this is passed as the divisor to div_s64() which interprets it as a s32, giving nonsense values with more than 2 billion acquisitons. E.g.
acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg ------------------------------------------------------------------------- 2350439395 0.07 353.38 649647067.36 0.-32
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Peter Zijlstra peterz@infradead.org Link: https://lore.kernel.org/r/20200725185110.11588-1-chris@chris-wilson.co.uk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/locking/lockdep_proc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/locking/lockdep_proc.c b/kernel/locking/lockdep_proc.c index 6fcc4650f0c4..53cc3bb7025a 100644 --- a/kernel/locking/lockdep_proc.c +++ b/kernel/locking/lockdep_proc.c @@ -394,7 +394,7 @@ static void seq_lock_time(struct seq_file *m, struct lock_time *lt) seq_time(m, lt->min); seq_time(m, lt->max); seq_time(m, lt->total); - seq_time(m, lt->nr ? div_s64(lt->total, lt->nr) : 0); + seq_time(m, lt->nr ? div64_u64(lt->total, lt->nr) : 0); }
static void seq_stats(struct seq_file *m, struct lock_stat_data *data)
From: Jing Xiangfeng jingxiangfeng@huawei.com
stable inclusion from linux-4.19.143 commit d10ceeb835b5db394e4b8443498190c6827e0576
--------------------------------
[ Upstream commit 68e12e5f61354eb42cfffbc20a693153fc39738e ]
If scsi_host_lookup() fails we will jump to put_host which may cause a panic. Jump to exit_set_fnode instead.
Link: https://lore.kernel.org/r/20200615081226.183068-1-jingxiangfeng@huawei.com Reviewed-by: Mike Christie michael.christie@oracle.com Signed-off-by: Jing Xiangfeng jingxiangfeng@huawei.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/scsi/scsi_transport_iscsi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c index 04d095488c76..698347301198 100644 --- a/drivers/scsi/scsi_transport_iscsi.c +++ b/drivers/scsi/scsi_transport_iscsi.c @@ -3172,7 +3172,7 @@ static int iscsi_set_flashnode_param(struct iscsi_transport *transport, pr_err("%s could not find host no %u\n", __func__, ev->u.set_flashnode.host_no); err = -ENODEV; - goto put_host; + goto exit_set_fnode; }
idx = ev->u.set_flashnode.flashnode_idx;
From: Javed Hasan jhasan@marvell.com
stable inclusion from linux-4.19.143 commit 15f15650ee10ff24cd5736dca5f9d693fc9f56d4
--------------------------------
[ Upstream commit e95b4789ff4380733006836d28e554dc296b2298 ]
In fcoe_sysfs_fcf_del(), we first deleted the fcf from the list and then freed it if ctlr_dev was not NULL. This was causing a memory leak.
Free the fcf even if ctlr_dev is NULL.
Link: https://lore.kernel.org/r/20200729081824.30996-3-jhasan@marvell.com Reviewed-by: Girish Basrur gbasrur@marvell.com Reviewed-by: Santosh Vernekar svernekar@marvell.com Reviewed-by: Saurav Kashyap skashyap@marvell.com Reviewed-by: Shyam Sundar ssundar@marvell.com Signed-off-by: Javed Hasan jhasan@marvell.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/scsi/fcoe/fcoe_ctlr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/scsi/fcoe/fcoe_ctlr.c b/drivers/scsi/fcoe/fcoe_ctlr.c index 24cbd0a2cc69..658c0726581f 100644 --- a/drivers/scsi/fcoe/fcoe_ctlr.c +++ b/drivers/scsi/fcoe/fcoe_ctlr.c @@ -267,9 +267,9 @@ static void fcoe_sysfs_fcf_del(struct fcoe_fcf *new) WARN_ON(!fcf_dev); new->fcf_dev = NULL; fcoe_fcf_device_delete(fcf_dev); - kfree(new); mutex_unlock(&cdev->lock); } + kfree(new); }
/**
From: Lukas Czerner lczerner@redhat.com
stable inclusion from linux-4.19.143 commit 3b1a4ea0028afe0194971c2b6910474a3eb89012
--------------------------------
[ Upstream commit 24dc9864914eb5813173cfa53313fcd02e4aea7d ]
Callers of __jbd2_journal_unfile_buffer() and __jbd2_journal_refile_buffer() assume that the b_transaction is set. In fact if it's not, we can end up with journal_head refcounting errors leading to crash much later that might be very hard to track down. Add asserts to make sure that is the case.
We also make sure that b_next_transaction is NULL in __jbd2_journal_unfile_buffer() since the callers expect that as well and we should not get into that stage in this state anyway, leading to problems later on if we do.
Tested with fstests.
Signed-off-by: Lukas Czerner lczerner@redhat.com Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20200617092549.6712-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/jbd2/transaction.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 37ede30b5981..ffa6d3530f4b 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -1991,6 +1991,9 @@ static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh) */ static void __jbd2_journal_unfile_buffer(struct journal_head *jh) { + J_ASSERT_JH(jh, jh->b_transaction != NULL); + J_ASSERT_JH(jh, jh->b_next_transaction == NULL); + __jbd2_journal_temp_unlink_buffer(jh); jh->b_transaction = NULL; jbd2_journal_put_journal_head(jh); @@ -2554,6 +2557,13 @@ void __jbd2_journal_refile_buffer(struct journal_head *jh)
was_dirty = test_clear_buffer_jbddirty(bh); __jbd2_journal_temp_unlink_buffer(jh); + + /* + * b_transaction must be set, otherwise the new b_transaction won't + * be holding jh reference + */ + J_ASSERT_JH(jh, jh->b_transaction != NULL); + /* * We set b_transaction here because b_next_transaction will inherit * our jh reference and thus __jbd2_journal_file_buffer() must not
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.143 commit a6d49257cbe53c7bca1a0353a6443f53cbed9cc7
--------------------------------
[ Upstream commit 11215630aada28307ba555a43138db6ac54fa825 ]
A customer has reported a BUG_ON in ext4_clear_journal_err() hitting during an LTP testing. Either this has been caused by a test setup issue where the filesystem was being overwritten while LTP was mounting it or the journal replay has overwritten the superblock with invalid data. In either case it is preferable we don't take the machine down with a BUG_ON. So handle the situation of unexpectedly missing has_journal feature more gracefully. We issue warning and fail the mount in the cases where the race window is narrow and the failed check is most likely a programming error. In cases where fs corruption is more likely, we do full ext4_error() handling before failing mount / remount.
Reviewed-by: Lukas Czerner lczerner@redhat.com Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20200710140759.18031-1-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 68 ++++++++++++++++++++++++++++++++++--------------- 1 file changed, 47 insertions(+), 21 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 75fbf2372abe..7a84d2cc7831 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -69,10 +69,10 @@ static int ext4_load_journal(struct super_block *, struct ext4_super_block *, unsigned long journal_devnum); static int ext4_show_options(struct seq_file *seq, struct dentry *root); static int ext4_commit_super(struct super_block *sb, int sync); -static void ext4_mark_recovery_complete(struct super_block *sb, +static int ext4_mark_recovery_complete(struct super_block *sb, struct ext4_super_block *es); -static void ext4_clear_journal_err(struct super_block *sb, - struct ext4_super_block *es); +static int ext4_clear_journal_err(struct super_block *sb, + struct ext4_super_block *es); static int ext4_sync_fs(struct super_block *sb, int wait); static int ext4_remount(struct super_block *sb, int *flags, char *data); static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf); @@ -4605,7 +4605,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS; if (needs_recovery) { ext4_msg(sb, KERN_INFO, "recovery complete"); - ext4_mark_recovery_complete(sb, es); + err = ext4_mark_recovery_complete(sb, es); + if (err) + goto failed_mount8; } if (EXT4_SB(sb)->s_journal) { if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) @@ -4648,10 +4650,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) ext4_msg(sb, KERN_ERR, "VFS: Can't find ext4 filesystem"); goto failed_mount;
-#ifdef CONFIG_QUOTA failed_mount8: ext4_unregister_sysfs(sb); -#endif failed_mount7: ext4_unregister_li_request(sb); failed_mount6: @@ -4787,7 +4787,8 @@ static journal_t *ext4_get_journal(struct super_block *sb, struct inode *journal_inode; journal_t *journal;
- BUG_ON(!ext4_has_feature_journal(sb)); + if (WARN_ON_ONCE(!ext4_has_feature_journal(sb))) + return NULL;
journal_inode = ext4_get_journal_inode(sb, journal_inum); if (!journal_inode) @@ -4817,7 +4818,8 @@ static journal_t *ext4_get_dev_journal(struct super_block *sb, struct ext4_super_block *es; struct block_device *bdev;
- BUG_ON(!ext4_has_feature_journal(sb)); + if (WARN_ON_ONCE(!ext4_has_feature_journal(sb))) + return NULL;
bdev = ext4_blkdev_get(j_dev, sb); if (bdev == NULL) @@ -4909,7 +4911,8 @@ static int ext4_load_journal(struct super_block *sb, int err = 0; int really_read_only;
- BUG_ON(!ext4_has_feature_journal(sb)); + if (WARN_ON_ONCE(!ext4_has_feature_journal(sb))) + return -EFSCORRUPTED;
if (journal_devnum && journal_devnum != le32_to_cpu(es->s_journal_dev)) { @@ -4979,7 +4982,12 @@ static int ext4_load_journal(struct super_block *sb, }
EXT4_SB(sb)->s_journal = journal; - ext4_clear_journal_err(sb, es); + err = ext4_clear_journal_err(sb, es); + if (err) { + EXT4_SB(sb)->s_journal = NULL; + jbd2_journal_destroy(journal); + return err; + }
if (!really_read_only && journal_devnum && journal_devnum != le32_to_cpu(es->s_journal_dev)) { @@ -5075,26 +5083,32 @@ static int ext4_commit_super(struct super_block *sb, int sync) * remounting) the filesystem readonly, then we will end up with a * consistent fs on disk. Record that fact. */ -static void ext4_mark_recovery_complete(struct super_block *sb, - struct ext4_super_block *es) +static int ext4_mark_recovery_complete(struct super_block *sb, + struct ext4_super_block *es) { + int err; journal_t *journal = EXT4_SB(sb)->s_journal;
if (!ext4_has_feature_journal(sb)) { - BUG_ON(journal != NULL); - return; + if (journal != NULL) { + ext4_error(sb, "Journal got removed while the fs was " + "mounted!"); + return -EFSCORRUPTED; + } + return 0; } jbd2_journal_lock_updates(journal); - if (jbd2_journal_flush(journal) < 0) + err = jbd2_journal_flush(journal); + if (err < 0) goto out;
if (ext4_has_feature_journal_needs_recovery(sb) && sb_rdonly(sb)) { ext4_clear_feature_journal_needs_recovery(sb); ext4_commit_super(sb, 1); } - out: jbd2_journal_unlock_updates(journal); + return err; }
/* @@ -5102,14 +5116,17 @@ static void ext4_mark_recovery_complete(struct super_block *sb, * has recorded an error from a previous lifetime, move that error to the * main filesystem now. */ -static void ext4_clear_journal_err(struct super_block *sb, +static int ext4_clear_journal_err(struct super_block *sb, struct ext4_super_block *es) { journal_t *journal; int j_errno; const char *errstr;
- BUG_ON(!ext4_has_feature_journal(sb)); + if (!ext4_has_feature_journal(sb)) { + ext4_error(sb, "Journal got removed while the fs was mounted!"); + return -EFSCORRUPTED; + }
journal = EXT4_SB(sb)->s_journal;
@@ -5134,6 +5151,7 @@ static void ext4_clear_journal_err(struct super_block *sb, jbd2_journal_clear_err(journal); jbd2_journal_update_sb_errno(journal); } + return 0; }
/* @@ -5404,8 +5422,13 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data) (sbi->s_mount_state & EXT4_VALID_FS)) es->s_state = cpu_to_le16(sbi->s_mount_state);
- if (sbi->s_journal) + if (sbi->s_journal) { + /* + * We let remount-ro finish even if marking fs + * as clean failed... + */ ext4_mark_recovery_complete(sb, es); + } if (sbi->s_mmp_tsk) kthread_stop(sbi->s_mmp_tsk); } else { @@ -5461,8 +5484,11 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data) * been changed by e2fsck since we originally mounted * the partition.) */ - if (sbi->s_journal) - ext4_clear_journal_err(sb, es); + if (sbi->s_journal) { + err = ext4_clear_journal_err(sb, es); + if (err) + goto restore_opts; + } sbi->s_mount_state = le16_to_cpu(es->s_state);
err = ext4_setup_super(sb, es, 0);
From: Lukas Czerner lczerner@redhat.com
stable inclusion from linux-4.19.143 commit 47788043e5fa6415fa9ff0b4af217079b06815af
--------------------------------
[ Upstream commit 273108fa5015eeffc4bacfa5ce272af3434b96e4 ]
Ext4 uses blkdev_get_by_dev() to get the block_device for journal device which does check to see if the read-only block device was opened read-only.
As a result ext4 will hapily proceed mounting the file system with external journal on read-only device. This is bad as we would not be able to use the journal leading to errors later on.
Instead of simply failing to mount file system in this case, treat it in a similar way we treat internal journal on read-only device. Allow to mount with -o noload in read-only mode.
This can be reproduced easily like this:
mke2fs -F -O journal_dev $JOURNAL_DEV 100M mkfs.$FSTYPE -F -J device=$JOURNAL_DEV $FS_DEV blockdev --setro $JOURNAL_DEV mount $FS_DEV $MNT touch $MNT/file umount $MNT
leading to error like this
[ 1307.318713] ------------[ cut here ]------------ [ 1307.323362] generic_make_request: Trying to write to read-only block-device dm-2 (partno 0) [ 1307.331741] WARNING: CPU: 36 PID: 3224 at block/blk-core.c:855 generic_make_request_checks+0x2c3/0x580 [ 1307.341041] Modules linked in: ext4 mbcache jbd2 rfkill intel_rapl_msr intel_rapl_common isst_if_commd [ 1307.419445] CPU: 36 PID: 3224 Comm: jbd2/dm-2 Tainted: G W I 5.8.0-rc5 #2 [ 1307.427359] Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 2.3.10 08/15/2019 [ 1307.434932] RIP: 0010:generic_make_request_checks+0x2c3/0x580 [ 1307.440676] Code: 94 03 00 00 48 89 df 48 8d 74 24 08 c6 05 cf 2b 18 01 01 e8 7f a4 ff ff 48 c7 c7 50e [ 1307.459420] RSP: 0018:ffffc0d70eb5fb48 EFLAGS: 00010286 [ 1307.464646] RAX: 0000000000000000 RBX: ffff9b33b2978300 RCX: 0000000000000000 [ 1307.471780] RDX: ffff9b33e12a81e0 RSI: ffff9b33e1298000 RDI: ffff9b33e1298000 [ 1307.478913] RBP: ffff9b7b9679e0c0 R08: 0000000000000837 R09: 0000000000000024 [ 1307.486044] R10: 0000000000000000 R11: ffffc0d70eb5f9f0 R12: 0000000000000400 [ 1307.493177] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000 [ 1307.500308] FS: 0000000000000000(0000) GS:ffff9b33e1280000(0000) knlGS:0000000000000000 [ 1307.508396] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1307.514142] CR2: 000055eaf4109000 CR3: 0000003dee40a006 CR4: 00000000007606e0 [ 1307.521273] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1307.528407] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1307.535538] PKRU: 55555554 [ 1307.538250] Call Trace: [ 1307.540708] generic_make_request+0x30/0x340 [ 1307.544985] submit_bio+0x43/0x190 [ 1307.548393] ? bio_add_page+0x62/0x90 [ 1307.552068] submit_bh_wbc+0x16a/0x190 [ 1307.555833] jbd2_write_superblock+0xec/0x200 [jbd2] [ 1307.560803] jbd2_journal_update_sb_log_tail+0x65/0xc0 [jbd2] [ 1307.566557] jbd2_journal_commit_transaction+0x2ae/0x1860 [jbd2] [ 1307.572566] ? check_preempt_curr+0x7a/0x90 [ 1307.576756] ? update_curr+0xe1/0x1d0 [ 1307.580421] ? account_entity_dequeue+0x7b/0xb0 [ 1307.584955] ? newidle_balance+0x231/0x3d0 [ 1307.589056] ? __switch_to_asm+0x42/0x70 [ 1307.592986] ? __switch_to_asm+0x36/0x70 [ 1307.596918] ? lock_timer_base+0x67/0x80 [ 1307.600851] kjournald2+0xbd/0x270 [jbd2] [ 1307.604873] ? finish_wait+0x80/0x80 [ 1307.608460] ? commit_timeout+0x10/0x10 [jbd2] [ 1307.612915] kthread+0x114/0x130 [ 1307.616152] ? kthread_park+0x80/0x80 [ 1307.619816] ret_from_fork+0x22/0x30 [ 1307.623400] ---[ end trace 27490236265b1630 ]---
Signed-off-by: Lukas Czerner lczerner@redhat.com Reviewed-by: Andreas Dilger adilger@dilger.ca Link: https://lore.kernel.org/r/20200717090605.2612-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 51 ++++++++++++++++++++++++++++++++----------------- 1 file changed, 33 insertions(+), 18 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 7a84d2cc7831..3e568a9f73c2 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4910,6 +4910,7 @@ static int ext4_load_journal(struct super_block *sb, dev_t journal_dev; int err = 0; int really_read_only; + int journal_dev_ro;
if (WARN_ON_ONCE(!ext4_has_feature_journal(sb))) return -EFSCORRUPTED; @@ -4922,7 +4923,31 @@ static int ext4_load_journal(struct super_block *sb, } else journal_dev = new_decode_dev(le32_to_cpu(es->s_journal_dev));
- really_read_only = bdev_read_only(sb->s_bdev); + if (journal_inum && journal_dev) { + ext4_msg(sb, KERN_ERR, + "filesystem has both journal inode and journal device!"); + return -EINVAL; + } + + if (journal_inum) { + journal = ext4_get_journal(sb, journal_inum); + if (!journal) + return -EINVAL; + } else { + journal = ext4_get_dev_journal(sb, journal_dev); + if (!journal) + return -EINVAL; + } + + journal_dev_ro = bdev_read_only(journal->j_dev); + really_read_only = bdev_read_only(sb->s_bdev) | journal_dev_ro; + + if (journal_dev_ro && !sb_rdonly(sb)) { + ext4_msg(sb, KERN_ERR, + "journal device read-only, try mounting with '-o ro'"); + err = -EROFS; + goto err_out; + }
/* * Are we loading a blank journal or performing recovery after a @@ -4937,27 +4962,14 @@ static int ext4_load_journal(struct super_block *sb, ext4_msg(sb, KERN_ERR, "write access " "unavailable, cannot proceed " "(try mounting with noload)"); - return -EROFS; + err = -EROFS; + goto err_out; } ext4_msg(sb, KERN_INFO, "write access will " "be enabled during recovery"); } }
- if (journal_inum && journal_dev) { - ext4_msg(sb, KERN_ERR, "filesystem has both journal " - "and inode journals!"); - return -EINVAL; - } - - if (journal_inum) { - if (!(journal = ext4_get_journal(sb, journal_inum))) - return -EINVAL; - } else { - if (!(journal = ext4_get_dev_journal(sb, journal_dev))) - return -EINVAL; - } - if (!(journal->j_flags & JBD2_BARRIER)) ext4_msg(sb, KERN_INFO, "barriers disabled");
@@ -4977,8 +4989,7 @@ static int ext4_load_journal(struct super_block *sb,
if (err) { ext4_msg(sb, KERN_ERR, "error loading journal"); - jbd2_journal_destroy(journal); - return err; + goto err_out; }
EXT4_SB(sb)->s_journal = journal; @@ -4998,6 +5009,10 @@ static int ext4_load_journal(struct super_block *sb, }
return 0; + +err_out: + jbd2_journal_destroy(journal); + return err; }
static int ext4_commit_super(struct super_block *sb, int sync)
From: Lukas Czerner lczerner@redhat.com
stable inclusion from linux-4.19.143 commit bfb8d9b74750e7c9b12f9e18b4885617a6433f6d
--------------------------------
[ Upstream commit f25391ebb475d3ffb3aa61bb90e3594c841749ef ]
Currently there is a problem with mount options that can be both set by vfs using mount flags or by a string parsing in ext4.
i_version/iversion options gets lost after remount, for example
$ mount -o i_version /dev/pmem0 /mnt $ grep pmem0 /proc/self/mountinfo | grep i_version 310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,seclabel,i_version $ mount -o remount,ro /mnt $ grep pmem0 /proc/self/mountinfo | grep i_version
nolazytime gets ignored by ext4 on remount, for example
$ mount -o lazytime /dev/pmem0 /mnt $ grep pmem0 /proc/self/mountinfo | grep lazytime 310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel $ mount -o remount,nolazytime /mnt $ grep pmem0 /proc/self/mountinfo | grep lazytime 310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel
Fix it by applying the SB_LAZYTIME and SB_I_VERSION flags from *flags to s_flags before we parse the option and use the resulting state of the same flags in *flags at the end of successful remount.
Signed-off-by: Lukas Czerner lczerner@redhat.com Reviewed-by: Ritesh Harjani riteshh@linux.ibm.com Link: https://lore.kernel.org/r/20200723150526.19931-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 3e568a9f73c2..4c43f497d33f 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -5309,7 +5309,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data) { struct ext4_super_block *es; struct ext4_sb_info *sbi = EXT4_SB(sb); - unsigned long old_sb_flags; + unsigned long old_sb_flags, vfs_flags; struct ext4_mount_options old_opts; int enable_quota = 0; ext4_group_t g; @@ -5352,6 +5352,14 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data) if (sbi->s_journal && sbi->s_journal->j_task->io_context) journal_ioprio = sbi->s_journal->j_task->io_context->ioprio;
+ /* + * Some options can be enabled by ext4 and/or by VFS mount flag + * either way we need to make sure it matches in both *flags and + * s_flags. Copy those selected flags from *flags to s_flags + */ + vfs_flags = SB_LAZYTIME | SB_I_VERSION; + sb->s_flags = (sb->s_flags & ~vfs_flags) | (*flags & vfs_flags); + if (!parse_options(data, sb, NULL, &journal_ioprio, 1)) { err = -EINVAL; goto restore_opts; @@ -5405,9 +5413,6 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data) set_task_ioprio(sbi->s_journal->j_task, journal_ioprio); }
- if (*flags & SB_LAZYTIME) - sb->s_flags |= SB_LAZYTIME; - if ((bool)(*flags & SB_RDONLY) != sb_rdonly(sb)) { if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED) { err = -EROFS; @@ -5567,7 +5572,13 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data) if (!test_opt(sb, BLOCK_VALIDITY) && sbi->system_blks) ext4_release_system_zone(sb);
- *flags = (*flags & ~SB_LAZYTIME) | (sb->s_flags & SB_LAZYTIME); + /* + * Some options can be enabled by ext4 and/or by VFS mount flag + * either way we need to make sure it matches in both *flags and + * s_flags. Copy those selected flags from s_flags to *flags + */ + *flags = (*flags & ~vfs_flags) | (sb->s_flags & vfs_flags); + ext4_msg(sb, KERN_INFO, "re-mounted. Opts: %s", orig_data); kfree(orig_data); return 0;
From: Xianting Tian xianting_tian@126.com
stable inclusion from linux-4.19.143 commit 4aaac9c537b79ffd0602db06cd5127a455e49275
--------------------------------
[ Upstream commit 377254b2cd2252c7c3151b113cbdf93a7736c2e9 ]
If a device is hot-removed --- for example, when a physical device is unplugged from pcie slot or a nbd device's network is shutdown --- this can result in a BUG_ON() crash in submit_bh_wbc(). This is because the when the block device dies, the buffer heads will have their Buffer_Mapped flag get cleared, leading to the crash in submit_bh_wbc.
We had attempted to work around this problem in commit a17712c8 ("ext4: check superblock mapped prior to committing"). Unfortunately, it's still possible to hit the BUG_ON(!buffer_mapped(bh)) if the device dies between when the work-around check in ext4_commit_super() and when submit_bh_wbh() is finally called:
Code path: ext4_commit_super judge if 'buffer_mapped(sbh)' is false, return <== commit a17712c8 lock_buffer(sbh) ... unlock_buffer(sbh) __sync_dirty_buffer(sbh,... lock_buffer(sbh) judge if 'buffer_mapped(sbh))' is false, return <== added by this patch submit_bh(...,sbh) submit_bh_wbc(...,sbh,...)
[100722.966497] kernel BUG at fs/buffer.c:3095! <== BUG_ON(!buffer_mapped(bh))' in submit_bh_wbc() [100722.966503] invalid opcode: 0000 [#1] SMP [100722.966566] task: ffff8817e15a9e40 task.stack: ffffc90024744000 [100722.966574] RIP: 0010:submit_bh_wbc+0x180/0x190 [100722.966575] RSP: 0018:ffffc90024747a90 EFLAGS: 00010246 [100722.966576] RAX: 0000000000620005 RBX: ffff8818a80603a8 RCX: 0000000000000000 [100722.966576] RDX: ffff8818a80603a8 RSI: 0000000000020800 RDI: 0000000000000001 [100722.966577] RBP: ffffc90024747ac0 R08: 0000000000000000 R09: ffff88207f94170d [100722.966578] R10: 00000000000437c8 R11: 0000000000000001 R12: 0000000000020800 [100722.966578] R13: 0000000000000001 R14: 000000000bf9a438 R15: ffff88195f333000 [100722.966580] FS: 00007fa2eee27700(0000) GS:ffff88203d840000(0000) knlGS:0000000000000000 [100722.966580] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [100722.966581] CR2: 0000000000f0b008 CR3: 000000201a622003 CR4: 00000000007606e0 [100722.966582] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [100722.966583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [100722.966583] PKRU: 55555554 [100722.966583] Call Trace: [100722.966588] __sync_dirty_buffer+0x6e/0xd0 [100722.966614] ext4_commit_super+0x1d8/0x290 [ext4] [100722.966626] __ext4_std_error+0x78/0x100 [ext4] [100722.966635] ? __ext4_journal_get_write_access+0xca/0x120 [ext4] [100722.966646] ext4_reserve_inode_write+0x58/0xb0 [ext4] [100722.966655] ? ext4_dirty_inode+0x48/0x70 [ext4] [100722.966663] ext4_mark_inode_dirty+0x53/0x1e0 [ext4] [100722.966671] ? __ext4_journal_start_sb+0x6d/0xf0 [ext4] [100722.966679] ext4_dirty_inode+0x48/0x70 [ext4] [100722.966682] __mark_inode_dirty+0x17f/0x350 [100722.966686] generic_update_time+0x87/0xd0 [100722.966687] touch_atime+0xa9/0xd0 [100722.966690] generic_file_read_iter+0xa09/0xcd0 [100722.966694] ? page_cache_tree_insert+0xb0/0xb0 [100722.966704] ext4_file_read_iter+0x4a/0x100 [ext4] [100722.966707] ? __inode_security_revalidate+0x4f/0x60 [100722.966709] __vfs_read+0xec/0x160 [100722.966711] vfs_read+0x8c/0x130 [100722.966712] SyS_pread64+0x87/0xb0 [100722.966716] do_syscall_64+0x67/0x1b0 [100722.966719] entry_SYSCALL64_slow_path+0x25/0x25
To address this, add the check of 'buffer_mapped(bh)' to __sync_dirty_buffer(). This also has the benefit of fixing this for other file systems.
With this addition, we can drop the workaround in ext4_commit_supper().
[ Commit description rewritten by tytso. ]
Signed-off-by: Xianting Tian xianting_tian@126.com Link: https://lore.kernel.org/r/1596211825-8750-1-git-send-email-xianting_tian@126... Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/buffer.c | 9 +++++++++ fs/ext4/super.c | 7 ------- 2 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c index 50cdeb1023a2..10e7389786d3 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -3177,6 +3177,15 @@ int __sync_dirty_buffer(struct buffer_head *bh, int op_flags) WARN_ON(atomic_read(&bh->b_count) < 1); lock_buffer(bh); if (test_clear_buffer_dirty(bh)) { + /* + * The bh should be mapped, but it might not be if the + * device was hot-removed. Not much we can do but fail the I/O. + */ + if (!buffer_mapped(bh)) { + unlock_buffer(bh); + return -EIO; + } + get_bh(bh); bh->b_end_io = end_buffer_write_sync; ret = submit_bh(REQ_OP_WRITE, op_flags, bh); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 4c43f497d33f..1af560c0b9f6 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -5024,13 +5024,6 @@ static int ext4_commit_super(struct super_block *sb, int sync) if (!sbh || block_device_ejected(sb)) return error;
- /* - * The superblock bh should be mapped, but it might not be if the - * device was hot-removed. Not much we can do but fail the I/O. - */ - if (!buffer_mapped(sbh)) - return error; - /* * If the file system is mounted read-only, don't update the * superblock write time. This avoids updating the superblock
From: Ming Lei ming.lei@redhat.com
stable inclusion from linux-4.19.143 commit 77064570e4c3636da26f30ed28e4d132256424ec
--------------------------------
commit d7d8535f377e9ba87edbf7fbbd634ac942f3f54f upstream.
SCHED_RESTART code path is relied to re-run queue for dispatch requests in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding requests to hctx->dispatch.
memory barriers have to be used for ordering the following two pair of OPs:
1) adding requests to hctx->dispatch and checking SCHED_RESTART in blk_mq_dispatch_rq_list()
2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch in blk_mq_sched_restart().
Without the added memory barrier, either:
1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in dispatch side
or
2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue in dispatch side, meantime checking if there is request in hctx->dispatch from blk_mq_sched_restart() is missed.
IO hang in ltp/fs_fill test is reported by kernel test robot:
https://lkml.org/lkml/2020/7/26/77
Turns out it is caused by the above out-of-order OPs. And the IO hang can't be observed any more after applying this patch.
Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers") Reported-by: kernel test robot rong.a.chen@intel.com Signed-off-by: Ming Lei ming.lei@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Cc: Bart Van Assche bvanassche@acm.org Cc: Christoph Hellwig hch@lst.de Cc: David Jeffery djeffery@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- block/blk-mq-sched.c | 9 +++++++++ block/blk-mq.c | 9 +++++++++ 2 files changed, 18 insertions(+)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index bc032131e32c..ce4b2ac6d697 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -69,6 +69,15 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) return; clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+ /* + * Order clearing SCHED_RESTART and list_empty_careful(&hctx->dispatch) + * in blk_mq_run_hw_queue(). Its pair is the barrier in + * blk_mq_dispatch_rq_list(). So dispatch code won't see SCHED_RESTART, + * meantime new request added to hctx->dispatch is missed to check in + * blk_mq_run_hw_queue(). + */ + smp_mb(); + blk_mq_run_hw_queue(hctx, true); }
diff --git a/block/blk-mq.c b/block/blk-mq.c index 4acd394047b9..a2b735263dff 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1225,6 +1225,15 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, list_splice_tail_init(list, &hctx->dispatch); spin_unlock(&hctx->lock);
+ /* + * Order adding requests to hctx->dispatch and checking + * SCHED_RESTART flag. The pair of this smp_mb() is the one + * in blk_mq_sched_restart(). Avoid restart code path to + * miss the new added requests to hctx->dispatch, meantime + * SCHED_RESTART is observed here. + */ + smp_mb(); + /* * If SCHED_RESTART was set by the caller of this function and * it is no longer set that means that it was cleared by another
From: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp
stable inclusion from linux-4.19.143 commit c1fe757dd3d18497eaca831ed82aa20b4186affd
--------------------------------
commit f8d1653daec02315e06d30246cff4af72e76e54e upstream.
syzbot is reporting UAF bug in set_origin() from vc_do_resize() [1], for vc_do_resize() calls kfree(vc->vc_screenbuf) before calling set_origin().
Unfortunately, in set_origin(), vc->vc_sw->con_set_origin() might access vc->vc_pos when scroll is involved in order to manipulate cursor, but vc->vc_pos refers already released vc->vc_screenbuf until vc->vc_pos gets updated based on the result of vc->vc_sw->con_set_origin().
Preserving old buffer and tolerating outdated vc members until set_origin() completes would be easier than preventing vc->vc_sw->con_set_origin() from accessing outdated vc members.
[1] https://syzkaller.appspot.com/bug?id=6649da2081e2ebdc65c0642c214b27fe91099db...
Reported-by: syzbot syzbot+9116ecc1978ca3a12f43@syzkaller.appspotmail.com Signed-off-by: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Cc: stable stable@vger.kernel.org Link: https://lore.kernel.org/r/1596034621-4714-1-git-send-email-penguin-kernel@I-... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/tty/vt/vt.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c index 982d9684c65e..758f522f331e 100644 --- a/drivers/tty/vt/vt.c +++ b/drivers/tty/vt/vt.c @@ -1199,7 +1199,7 @@ static int vc_do_resize(struct tty_struct *tty, struct vc_data *vc, unsigned int old_rows, old_row_size, first_copied_row; unsigned int new_cols, new_rows, new_row_size, new_screen_size; unsigned int user; - unsigned short *newscreen; + unsigned short *oldscreen, *newscreen; struct uni_screen *new_uniscr = NULL;
WARN_CONSOLE_UNLOCKED(); @@ -1297,10 +1297,11 @@ static int vc_do_resize(struct tty_struct *tty, struct vc_data *vc, if (new_scr_end > new_origin) scr_memsetw((void *)new_origin, vc->vc_video_erase_char, new_scr_end - new_origin); - kfree(vc->vc_screenbuf); + oldscreen = vc->vc_screenbuf; vc->vc_screenbuf = newscreen; vc->vc_screenbuf_size = new_screen_size; set_origin(vc); + kfree(oldscreen);
/* do part of a reset_terminal() */ vc->vc_top = 0;
From: George Kennedy george.kennedy@oracle.com
stable inclusion from linux-4.19.143 commit 1221d11e5c35db18323ade3d4b2130bde89cc9df
--------------------------------
commit bc5269ca765057a1b762e79a1cfd267cd7bf1c46 upstream.
vc_resize() can return with an error after failure. Change VT_RESIZEX ioctl to save struct vc_data values that are modified and restore the original values in case of error.
Signed-off-by: George Kennedy george.kennedy@oracle.com Cc: stable stable@vger.kernel.org Reported-by: syzbot+38a3699c7eaf165b97a6@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/1596213192-6635-2-git-send-email-george.kennedy@or... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/tty/vt/vt_ioctl.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/drivers/tty/vt/vt_ioctl.c b/drivers/tty/vt/vt_ioctl.c index 5de81431c835..6a82030cf1ef 100644 --- a/drivers/tty/vt/vt_ioctl.c +++ b/drivers/tty/vt/vt_ioctl.c @@ -893,12 +893,22 @@ int vt_ioctl(struct tty_struct *tty, console_lock(); vcp = vc_cons[i].d; if (vcp) { + int ret; + int save_scan_lines = vcp->vc_scan_lines; + int save_font_height = vcp->vc_font.height; + if (v.v_vlin) vcp->vc_scan_lines = v.v_vlin; if (v.v_clin) vcp->vc_font.height = v.v_clin; vcp->vc_resize_user = 1; - vc_resize(vcp, v.v_cols, v.v_rows); + ret = vc_resize(vcp, v.v_cols, v.v_rows); + if (ret) { + vcp->vc_scan_lines = save_scan_lines; + vcp->vc_font.height = save_font_height; + console_unlock(); + return ret; + } } console_unlock(); }
From: Lukas Wunner lukas@wunner.de
stable inclusion from linux-4.19.143 commit eec2f7d9f0352a8bfe41980632e4e67a0d5c032b
--------------------------------
commit 27afac93e3bd7fa89749cf11da5d86ac9cde4dba upstream.
If probing of a pl011 gets deferred until after free_initmem(), an oops ensues because pl011_console_match() is called which has been freed.
Fix by removing the __init attribute from the function and those it calls.
Commit 10879ae5f12e ("serial: pl011: add console matching function") introduced pl011_console_match() not just for early consoles but regular preferred consoles, such as those added by acpi_parse_spcr(). Regular consoles may be registered after free_initmem() for various reasons, one being deferred probing, another being dynamic enablement of serial ports using a DeviceTree overlay.
Thus, pl011_console_match() must not be declared __init and the functions it calls mustn't either.
Stack trace for posterity:
Unable to handle kernel paging request at virtual address 80c38b58 Internal error: Oops: 8000000d [#1] PREEMPT SMP ARM PC is at pl011_console_match+0x0/0xfc LR is at register_console+0x150/0x468 [<80187004>] (register_console) [<805a8184>] (uart_add_one_port) [<805b2b68>] (pl011_register_port) [<805b3ce4>] (pl011_probe) [<80569214>] (amba_probe) [<805ca088>] (really_probe) [<805ca2ec>] (driver_probe_device) [<805ca5b0>] (__device_attach_driver) [<805c8060>] (bus_for_each_drv) [<805c9dfc>] (__device_attach) [<805ca630>] (device_initial_probe) [<805c90a8>] (bus_probe_device) [<805c95a8>] (deferred_probe_work_func)
Fixes: 10879ae5f12e ("serial: pl011: add console matching function") Signed-off-by: Lukas Wunner lukas@wunner.de Cc: stable@vger.kernel.org # v4.10+ Cc: Aleksey Makarov amakarov@marvell.com Cc: Peter Hurley peter@hurleysoftware.com Cc: Russell King linux@armlinux.org.uk Cc: Christopher Covington cov@codeaurora.org Link: https://lore.kernel.org/r/f827ff09da55b8c57d316a1b008a137677b58921.159731555... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/tty/serial/amba-pl011.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c index 781a27eebf0b..c41450e4e149 100644 --- a/drivers/tty/serial/amba-pl011.c +++ b/drivers/tty/serial/amba-pl011.c @@ -2316,9 +2316,8 @@ pl011_console_write(struct console *co, const char *s, unsigned int count) clk_disable(uap->clk); }
-static void __init -pl011_console_get_options(struct uart_amba_port *uap, int *baud, - int *parity, int *bits) +static void pl011_console_get_options(struct uart_amba_port *uap, int *baud, + int *parity, int *bits) { if (pl011_read(uap, REG_CR) & UART01x_CR_UARTEN) { unsigned int lcr_h, ibrd, fbrd; @@ -2351,7 +2350,7 @@ pl011_console_get_options(struct uart_amba_port *uap, int *baud, } }
-static int __init pl011_console_setup(struct console *co, char *options) +static int pl011_console_setup(struct console *co, char *options) { struct uart_amba_port *uap; int baud = 38400; @@ -2419,8 +2418,8 @@ static int __init pl011_console_setup(struct console *co, char *options) * * Returns 0 if console matches; otherwise non-zero to use default matching */ -static int __init pl011_console_match(struct console *co, char *name, int idx, - char *options) +static int pl011_console_match(struct console *co, char *name, int idx, + char *options) { unsigned char iotype; resource_size_t addr;
From: Lukas Wunner lukas@wunner.de
stable inclusion from linux-4.19.143 commit e12f36220f6f89ecf78edc98a153d548352d6951
--------------------------------
commit 89efbe70b27dd325d8a8c177743a26b885f7faec upstream.
pl011_probe() calls pl011_setup_port() to reserve an amba_ports[] entry, then calls pl011_register_port() to register the uart driver with the tty layer.
If registration of the uart driver fails, the amba_ports[] entry is not released. If this happens 14 times (value of UART_NR macro), then all amba_ports[] entries will have been leaked and driver probing is no longer possible. (To be fair, that can only happen if the DeviceTree doesn't contain alias IDs since they cause the same entry to be used for a given port.) Fix it.
Fixes: ef2889f7ffee ("serial: pl011: Move uart_register_driver call to device") Signed-off-by: Lukas Wunner lukas@wunner.de Cc: stable@vger.kernel.org # v3.15+ Cc: Tushar Behera tushar.behera@linaro.org Link: https://lore.kernel.org/r/138f8c15afb2f184d8102583f8301575566064a6.159731616... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/tty/serial/amba-pl011.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c index c41450e4e149..f55e4d9596cb 100644 --- a/drivers/tty/serial/amba-pl011.c +++ b/drivers/tty/serial/amba-pl011.c @@ -2657,7 +2657,7 @@ static int pl011_setup_port(struct device *dev, struct uart_amba_port *uap,
static int pl011_register_port(struct uart_amba_port *uap) { - int ret; + int ret, i;
/* Ensure interrupts from this UART are masked and cleared */ pl011_write(0, uap, REG_IMSC); @@ -2668,6 +2668,9 @@ static int pl011_register_port(struct uart_amba_port *uap) if (ret < 0) { dev_err(uap->port.dev, "Failed to register AMBA-PL011 driver\n"); + for (i = 0; i < ARRAY_SIZE(amba_ports); i++) + if (amba_ports[i] == uap) + amba_ports[i] = NULL; return ret; } }
From: Valmer Huhn valmer.huhn@concurrent-rt.com
stable inclusion from linux-4.19.143 commit ce755e4eae3e7f782d6a690ac6bd85ac9671b5e0
--------------------------------
commit c6b9e95dde7b54e6a53c47241201ab5a4035c320 upstream.
The following in 8250_exar.c line 589 is used to determine the number of ports for each Exar board:
nr_ports = board->num_ports ? board->num_ports : pcidev->device & 0x0f;
If the number of ports a card has is not explicitly specified, it defaults to the rightmost 4 bits of the PCI device ID. This is prone to error since not all PCI device IDs contain a number which corresponds to the number of ports that card provides.
This particular case involves COMMTECH_4222PCIE, COMMTECH_4224PCIE and COMMTECH_4228PCIE cards with device IDs 0x0022, 0x0020 and 0x0021. Currently the multiport cards receive 2, 0 and 1 port instead of 2, 4 and 8 ports respectively.
To fix this, each Commtech Fastcom PCIe card is given a struct where the number of ports is explicitly specified. This ensures 'board->num_ports' is used instead of the default 'pcidev->device & 0x0f'.
Fixes: d0aeaa83f0b0 ("serial: exar: split out the exar code from 8250_pci") Signed-off-by: Valmer Huhn valmer.huhn@concurrent-rt.com Tested-by: Valmer Huhn valmer.huhn@concurrent-rt.com Cc: stable stable@vger.kernel.org Link: https://lore.kernel.org/r/20200813165255.GC345440@icarus.concurrent-rt.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/tty/serial/8250/8250_exar.c | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-)
diff --git a/drivers/tty/serial/8250/8250_exar.c b/drivers/tty/serial/8250/8250_exar.c index 0089aa305ef9..18a373d89480 100644 --- a/drivers/tty/serial/8250/8250_exar.c +++ b/drivers/tty/serial/8250/8250_exar.c @@ -604,6 +604,24 @@ static const struct exar8250_board pbn_exar_XR17V35x = { .exit = pci_xr17v35x_exit, };
+static const struct exar8250_board pbn_fastcom35x_2 = { + .num_ports = 2, + .setup = pci_xr17v35x_setup, + .exit = pci_xr17v35x_exit, +}; + +static const struct exar8250_board pbn_fastcom35x_4 = { + .num_ports = 4, + .setup = pci_xr17v35x_setup, + .exit = pci_xr17v35x_exit, +}; + +static const struct exar8250_board pbn_fastcom35x_8 = { + .num_ports = 8, + .setup = pci_xr17v35x_setup, + .exit = pci_xr17v35x_exit, +}; + static const struct exar8250_board pbn_exar_XR17V4358 = { .num_ports = 12, .setup = pci_xr17v35x_setup, @@ -665,9 +683,9 @@ static const struct pci_device_id exar_pci_tbl[] = { EXAR_DEVICE(EXAR, EXAR_XR17V358, pbn_exar_XR17V35x), EXAR_DEVICE(EXAR, EXAR_XR17V4358, pbn_exar_XR17V4358), EXAR_DEVICE(EXAR, EXAR_XR17V8358, pbn_exar_XR17V8358), - EXAR_DEVICE(COMMTECH, COMMTECH_4222PCIE, pbn_exar_XR17V35x), - EXAR_DEVICE(COMMTECH, COMMTECH_4224PCIE, pbn_exar_XR17V35x), - EXAR_DEVICE(COMMTECH, COMMTECH_4228PCIE, pbn_exar_XR17V35x), + EXAR_DEVICE(COMMTECH, COMMTECH_4222PCIE, pbn_fastcom35x_2), + EXAR_DEVICE(COMMTECH, COMMTECH_4224PCIE, pbn_fastcom35x_4), + EXAR_DEVICE(COMMTECH, COMMTECH_4228PCIE, pbn_fastcom35x_8),
EXAR_DEVICE(COMMTECH, COMMTECH_4222PCI335, pbn_fastcom335_2), EXAR_DEVICE(COMMTECH, COMMTECH_4224PCI335, pbn_fastcom335_4),
From: Sergey Senozhatsky sergey.senozhatsky@gmail.com
stable inclusion from linux-4.19.143 commit 2eb35a11bbcc3636cdabd6228299d6a1b91cde00
--------------------------------
commit 205d300aea75623e1ae4aa43e0d265ab9cf195fd upstream.
We have a number of "uart.port->desc.lock vs desc.lock->uart.port" lockdep reports coming from 8250 driver; this causes a bit of trouble to people, so let's fix it.
The problem is reverse lock order in two different call paths:
chain #1:
serial8250_do_startup() spin_lock_irqsave(&port->lock); disable_irq_nosync(port->irq); raw_spin_lock_irqsave(&desc->lock)
chain #2:
__report_bad_irq() raw_spin_lock_irqsave(&desc->lock) for_each_action_of_desc() printk() spin_lock_irqsave(&port->lock);
Fix this by changing the order of locks in serial8250_do_startup(): do disable_irq_nosync() first, which grabs desc->lock, and grab uart->port after that, so that chain #1 and chain #2 have same lock order.
Full lockdep splat:
====================================================== WARNING: possible circular locking dependency detected 5.4.39 #55 Not tainted ======================================================
swapper/0/0 is trying to acquire lock: ffffffffab65b6c0 (console_owner){-...}, at: console_lock_spinning_enable+0x31/0x57
but task is already holding lock: ffff88810a8e34c0 (&irq_desc_lock_class){-.-.}, at: __report_bad_irq+0x5b/0xba
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&irq_desc_lock_class){-.-.}: _raw_spin_lock_irqsave+0x61/0x8d __irq_get_desc_lock+0x65/0x89 __disable_irq_nosync+0x3b/0x93 serial8250_do_startup+0x451/0x75c uart_startup+0x1b4/0x2ff uart_port_activate+0x73/0xa0 tty_port_open+0xae/0x10a uart_open+0x1b/0x26 tty_open+0x24d/0x3a0 chrdev_open+0xd5/0x1cc do_dentry_open+0x299/0x3c8 path_openat+0x434/0x1100 do_filp_open+0x9b/0x10a do_sys_open+0x15f/0x3d7 kernel_init_freeable+0x157/0x1dd kernel_init+0xe/0x105 ret_from_fork+0x27/0x50
-> #1 (&port_lock_key){-.-.}: _raw_spin_lock_irqsave+0x61/0x8d serial8250_console_write+0xa7/0x2a0 console_unlock+0x3b7/0x528 vprintk_emit+0x111/0x17f printk+0x59/0x73 register_console+0x336/0x3a4 uart_add_one_port+0x51b/0x5be serial8250_register_8250_port+0x454/0x55e dw8250_probe+0x4dc/0x5b9 platform_drv_probe+0x67/0x8b really_probe+0x14a/0x422 driver_probe_device+0x66/0x130 device_driver_attach+0x42/0x5b __driver_attach+0xca/0x139 bus_for_each_dev+0x97/0xc9 bus_add_driver+0x12b/0x228 driver_register+0x64/0xed do_one_initcall+0x20c/0x4a6 do_initcall_level+0xb5/0xc5 do_basic_setup+0x4c/0x58 kernel_init_freeable+0x13f/0x1dd kernel_init+0xe/0x105 ret_from_fork+0x27/0x50
-> #0 (console_owner){-...}: __lock_acquire+0x118d/0x2714 lock_acquire+0x203/0x258 console_lock_spinning_enable+0x51/0x57 console_unlock+0x25d/0x528 vprintk_emit+0x111/0x17f printk+0x59/0x73 __report_bad_irq+0xa3/0xba note_interrupt+0x19a/0x1d6 handle_irq_event_percpu+0x57/0x79 handle_irq_event+0x36/0x55 handle_fasteoi_irq+0xc2/0x18a do_IRQ+0xb3/0x157 ret_from_intr+0x0/0x1d cpuidle_enter_state+0x12f/0x1fd cpuidle_enter+0x2e/0x3d do_idle+0x1ce/0x2ce cpu_startup_entry+0x1d/0x1f start_kernel+0x406/0x46a secondary_startup_64+0xa4/0xb0
other info that might help us debug this:
Chain exists of: console_owner --> &port_lock_key --> &irq_desc_lock_class
Possible unsafe locking scenario:
CPU0 CPU1 ---- ---- lock(&irq_desc_lock_class); lock(&port_lock_key); lock(&irq_desc_lock_class); lock(console_owner);
*** DEADLOCK ***
2 locks held by swapper/0/0: #0: ffff88810a8e34c0 (&irq_desc_lock_class){-.-.}, at: __report_bad_irq+0x5b/0xba #1: ffffffffab65b5c0 (console_lock){+.+.}, at: console_trylock_spinning+0x20/0x181
stack backtrace: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.39 #55 Hardware name: XXXXXX Call Trace: <IRQ> dump_stack+0xbf/0x133 ? print_circular_bug+0xd6/0xe9 check_noncircular+0x1b9/0x1c3 __lock_acquire+0x118d/0x2714 lock_acquire+0x203/0x258 ? console_lock_spinning_enable+0x31/0x57 console_lock_spinning_enable+0x51/0x57 ? console_lock_spinning_enable+0x31/0x57 console_unlock+0x25d/0x528 ? console_trylock+0x18/0x4e vprintk_emit+0x111/0x17f ? lock_acquire+0x203/0x258 printk+0x59/0x73 __report_bad_irq+0xa3/0xba note_interrupt+0x19a/0x1d6 handle_irq_event_percpu+0x57/0x79 handle_irq_event+0x36/0x55 handle_fasteoi_irq+0xc2/0x18a do_IRQ+0xb3/0x157 common_interrupt+0xf/0xf </IRQ>
Signed-off-by: Sergey Senozhatsky sergey.senozhatsky@gmail.com Fixes: 768aec0b5bcc ("serial: 8250: fix shared interrupts issues with SMP and RT kernels") Reported-by: Guenter Roeck linux@roeck-us.net Reported-by: Raul Rangel rrangel@google.com BugLink: https://bugs.chromium.org/p/chromium/issues/detail?id=1114800 Link: https://lore.kernel.org/lkml/CAHQZ30BnfX+gxjPm1DUd5psOTqbyDh4EJE=2=VAMW_VDaf... Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Reviewed-by: Guenter Roeck linux@roeck-us.net Tested-by: Guenter Roeck linux@roeck-us.net Cc: stable stable@vger.kernel.org Link: https://lore.kernel.org/r/20200817022646.1484638-1-sergey.senozhatsky@gmail.... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/tty/serial/8250/8250_port.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c index aa4de6907f77..368904825d5b 100644 --- a/drivers/tty/serial/8250/8250_port.c +++ b/drivers/tty/serial/8250/8250_port.c @@ -2255,6 +2255,10 @@ int serial8250_do_startup(struct uart_port *port)
if (port->irq && !(up->port.flags & UPF_NO_THRE_TEST)) { unsigned char iir1; + + if (port->irqflags & IRQF_SHARED) + disable_irq_nosync(port->irq); + /* * Test for UARTs that do not reassert THRE when the * transmitter is idle and the interrupt has already @@ -2264,8 +2268,6 @@ int serial8250_do_startup(struct uart_port *port) * allow register changes to become visible. */ spin_lock_irqsave(&port->lock, flags); - if (up->port.irqflags & IRQF_SHARED) - disable_irq_nosync(port->irq);
wait_for_xmitr(up, UART_LSR_THRE); serial_port_out_sync(port, UART_IER, UART_IER_THRI); @@ -2277,9 +2279,10 @@ int serial8250_do_startup(struct uart_port *port) iir = serial_port_in(port, UART_IIR); serial_port_out(port, UART_IER, 0);
+ spin_unlock_irqrestore(&port->lock, flags); + if (port->irqflags & IRQF_SHARED) enable_irq(port->irq); - spin_unlock_irqrestore(&port->lock, flags);
/* * If the interrupt is not reasserted, or we otherwise
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.143 commit 1f58ddc07eef098a32741e74516bb6ef18009ce1
--------------------------------
commit b35250c0816c7cf7d0a8de92f5fafb6a7508a708 upstream.
Currently, operations on inode->i_io_list are protected by wb->list_lock. In the following patches we'll need to maintain consistency between inode->i_state and inode->i_io_list so change the code so that inode->i_lock protects also all inode's i_io_list handling.
Reviewed-by: Martijn Coenen maco@android.com Reviewed-by: Christoph Hellwig hch@lst.de CC: stable@vger.kernel.org # Prerequisite for "writeback: Avoid skipping inode writeback" Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/fs-writeback.c | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index c6023330cbb3..49067955ff7c 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -160,6 +160,7 @@ static void inode_io_list_del_locked(struct inode *inode, struct bdi_writeback *wb) { assert_spin_locked(&wb->list_lock); + assert_spin_locked(&inode->i_lock);
list_del_init(&inode->i_io_list); wb_io_lists_depopulated(wb); @@ -1041,7 +1042,9 @@ void inode_io_list_del(struct inode *inode) struct bdi_writeback *wb;
wb = inode_to_wb_and_lock_list(inode); + spin_lock(&inode->i_lock); inode_io_list_del_locked(inode, wb); + spin_unlock(&inode->i_lock); spin_unlock(&wb->list_lock); } EXPORT_SYMBOL(inode_io_list_del); @@ -1091,8 +1094,10 @@ void sb_clear_inode_writeback(struct inode *inode) * the case then the inode must have been redirtied while it was being written * out and we don't reset its dirtied_when. */ -static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) +static void redirty_tail_locked(struct inode *inode, struct bdi_writeback *wb) { + assert_spin_locked(&inode->i_lock); + if (!list_empty(&wb->b_dirty)) { struct inode *tail;
@@ -1103,6 +1108,13 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) inode_io_list_move_locked(inode, wb, &wb->b_dirty); }
+static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) +{ + spin_lock(&inode->i_lock); + redirty_tail_locked(inode, wb); + spin_unlock(&inode->i_lock); +} + /* * requeue inode for re-scanning after bdi->b_io list is exhausted. */ @@ -1313,7 +1325,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb, * writeback is not making progress due to locked * buffers. Skip this inode for now. */ - redirty_tail(inode, wb); + redirty_tail_locked(inode, wb); return; }
@@ -1333,7 +1345,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb, * retrying writeback of the dirty page/inode * that cannot be performed immediately. */ - redirty_tail(inode, wb); + redirty_tail_locked(inode, wb); } } else if (inode->i_state & I_DIRTY) { /* @@ -1341,7 +1353,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb, * such as delayed allocation during submission or metadata * updates after data IO completion. */ - redirty_tail(inode, wb); + redirty_tail_locked(inode, wb); } else if (inode->i_state & I_DIRTY_TIME) { inode->dirtied_when = jiffies; inode_io_list_move_locked(inode, wb, &wb->b_dirty_time); @@ -1588,8 +1600,8 @@ static long writeback_sb_inodes(struct super_block *sb, */ spin_lock(&inode->i_lock); if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { + redirty_tail_locked(inode, wb); spin_unlock(&inode->i_lock); - redirty_tail(inode, wb); continue; } if ((inode->i_state & I_SYNC) && wbc.sync_mode != WB_SYNC_ALL) {
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.143 commit be0937e03bf65b4e113213658bb3c54b5e5d4f0e
--------------------------------
commit 5afced3bf28100d81fb2fe7e98918632a08feaf5 upstream.
Inode's i_io_list list head is used to attach inode to several different lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker prepares a list of inodes to writeback e.g. for sync(2), it moves inodes to b_io list. Thus it is critical for sync(2) data integrity guarantees that inode is not requeued to any other writeback list when inode is queued for processing by flush worker. That's the reason why writeback_single_inode() does not touch i_io_list (unless the inode is completely clean) and why __mark_inode_dirty() does not touch i_io_list if I_SYNC flag is set.
However there are two flaws in the current logic:
1) When inode has only I_DIRTY_TIME set but it is already queued in b_io list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC) can still move inode back to b_dirty list resulting in skipping writeback of inode time stamps during sync(2).
2) When inode is on b_dirty_time list and writeback_single_inode() races with __mark_inode_dirty() like:
writeback_single_inode() __mark_inode_dirty(inode, I_DIRTY_PAGES) inode->i_state |= I_SYNC __writeback_single_inode() inode->i_state |= I_DIRTY_PAGES; if (inode->i_state & I_SYNC) bail if (!(inode->i_state & I_DIRTY_ALL)) - not true so nothing done
We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus standard background writeback will not writeback this inode leading to possible dirty throttling stalls etc. (thanks to Martijn Coenen for this analysis).
Fix these problems by tracking whether inode is queued in b_io or b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we know flush worker has queued inode and we should not touch i_io_list. On the other hand we also know that once flush worker is done with the inode it will requeue the inode to appropriate dirty list. When I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode to appropriate dirty list.
Reported-by: Martijn Coenen maco@android.com Reviewed-by: Martijn Coenen maco@android.com Tested-by: Martijn Coenen maco@android.com Reviewed-by: Christoph Hellwig hch@lst.de Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option") CC: stable@vger.kernel.org Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/fs-writeback.c | 17 ++++++++++++----- include/linux/fs.h | 8 ++++++-- 2 files changed, 18 insertions(+), 7 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 49067955ff7c..eb2667b1006c 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -162,6 +162,7 @@ static void inode_io_list_del_locked(struct inode *inode, assert_spin_locked(&wb->list_lock); assert_spin_locked(&inode->i_lock);
+ inode->i_state &= ~I_SYNC_QUEUED; list_del_init(&inode->i_io_list); wb_io_lists_depopulated(wb); } @@ -1106,6 +1107,7 @@ static void redirty_tail_locked(struct inode *inode, struct bdi_writeback *wb) inode->dirtied_when = jiffies; } inode_io_list_move_locked(inode, wb, &wb->b_dirty); + inode->i_state &= ~I_SYNC_QUEUED; }
static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) @@ -1181,8 +1183,11 @@ static int move_expired_inodes(struct list_head *delaying_queue, break; list_move(&inode->i_io_list, &tmp); moved++; + spin_lock(&inode->i_lock); if (flags & EXPIRE_DIRTY_ATIME) - set_bit(__I_DIRTY_TIME_EXPIRED, &inode->i_state); + inode->i_state |= I_DIRTY_TIME_EXPIRED; + inode->i_state |= I_SYNC_QUEUED; + spin_unlock(&inode->i_lock); if (sb_is_blkdev_sb(inode->i_sb)) continue; if (sb && sb != inode->i_sb) @@ -1357,6 +1362,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb, } else if (inode->i_state & I_DIRTY_TIME) { inode->dirtied_when = jiffies; inode_io_list_move_locked(inode, wb, &wb->b_dirty_time); + inode->i_state &= ~I_SYNC_QUEUED; } else { /* The inode is clean. Remove from writeback lists. */ inode_io_list_del_locked(inode, wb); @@ -2222,11 +2228,12 @@ void __mark_inode_dirty(struct inode *inode, int flags) inode->i_state |= flags;
/* - * If the inode is being synced, just update its dirty state. - * The unlocker will place the inode on the appropriate - * superblock list, based upon its state. + * If the inode is queued for writeback by flush worker, just + * update its dirty state. Once the flush worker is done with + * the inode it will place it on the appropriate superblock + * list, based upon its state. */ - if (inode->i_state & I_SYNC) + if (inode->i_state & I_SYNC_QUEUED) goto out_unlock_inode;
/* diff --git a/include/linux/fs.h b/include/linux/fs.h index e9a5b900fef2..ec5744cd8d0f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2097,6 +2097,10 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) * * I_CREATING New object's inode in the middle of setting up. * + * I_SYNC_QUEUED Inode is queued in b_io or b_more_io writeback lists. + * Used to detect that mark_inode_dirty() should not move + * inode between dirty lists. + * * Q: What is the difference between I_WILL_FREE and I_FREEING? */ #define I_DIRTY_SYNC (1 << 0) @@ -2114,11 +2118,11 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) #define I_DIO_WAKEUP (1 << __I_DIO_WAKEUP) #define I_LINKABLE (1 << 10) #define I_DIRTY_TIME (1 << 11) -#define __I_DIRTY_TIME_EXPIRED 12 -#define I_DIRTY_TIME_EXPIRED (1 << __I_DIRTY_TIME_EXPIRED) +#define I_DIRTY_TIME_EXPIRED (1 << 12) #define I_WB_SWITCH (1 << 13) #define I_OVL_INUSE (1 << 14) #define I_CREATING (1 << 15) +#define I_SYNC_QUEUED (1 << 17)
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC) #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.143 commit 7c3d77a31b0633dd8d61e3ab7e3d919af6fded75
--------------------------------
commit f9cae926f35e8230330f28c7b743ad088611a8de upstream.
When we are processing writeback for sync(2), move_expired_inodes() didn't set any inode expiry value (older_than_this). This can result in writeback never completing if there's steady stream of inodes added to b_dirty_time list as writeback rechecks dirty lists after each writeback round whether there's more work to be done. Fix the problem by using sync(2) start time is inode expiry value when processing b_dirty_time list similarly as for ordinarily dirtied inodes. This requires some refactoring of older_than_this handling which simplifies the code noticeably as a bonus.
Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option") CC: stable@vger.kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Conflicts: include/trace/events/writeback.h [yyl: strlcpy() is replaced with bdi_get_dev_name()] Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/fs-writeback.c | 44 ++++++++++++-------------------- include/trace/events/writeback.h | 13 +++++----- 2 files changed, 23 insertions(+), 34 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index eb2667b1006c..f39c83758d94 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -45,7 +45,6 @@ struct wb_completion { struct wb_writeback_work { long nr_pages; struct super_block *sb; - unsigned long *older_than_this; enum writeback_sync_modes sync_mode; unsigned int tagged_writepages:1; unsigned int for_kupdate:1; @@ -1153,16 +1152,13 @@ static bool inode_dirtied_after(struct inode *inode, unsigned long t) #define EXPIRE_DIRTY_ATIME 0x0001
/* - * Move expired (dirtied before work->older_than_this) dirty inodes from + * Move expired (dirtied before dirtied_before) dirty inodes from * @delaying_queue to @dispatch_queue. */ static int move_expired_inodes(struct list_head *delaying_queue, struct list_head *dispatch_queue, - int flags, - struct wb_writeback_work *work) + int flags, unsigned long dirtied_before) { - unsigned long *older_than_this = NULL; - unsigned long expire_time; LIST_HEAD(tmp); struct list_head *pos, *node; struct super_block *sb = NULL; @@ -1170,16 +1166,9 @@ static int move_expired_inodes(struct list_head *delaying_queue, int do_sb_sort = 0; int moved = 0;
- if ((flags & EXPIRE_DIRTY_ATIME) == 0) - older_than_this = work->older_than_this; - else if (!work->for_sync) { - expire_time = jiffies - (dirtytime_expire_interval * HZ); - older_than_this = &expire_time; - } while (!list_empty(delaying_queue)) { inode = wb_inode(delaying_queue->prev); - if (older_than_this && - inode_dirtied_after(inode, *older_than_this)) + if (inode_dirtied_after(inode, dirtied_before)) break; list_move(&inode->i_io_list, &tmp); moved++; @@ -1225,18 +1214,22 @@ static int move_expired_inodes(struct list_head *delaying_queue, * | * +--> dequeue for IO */ -static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work) +static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work, + unsigned long dirtied_before) { int moved; + unsigned long time_expire_jif = dirtied_before;
assert_spin_locked(&wb->list_lock); list_splice_init(&wb->b_more_io, &wb->b_io); - moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, 0, work); + moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, 0, dirtied_before); + if (!work->for_sync) + time_expire_jif = jiffies - dirtytime_expire_interval * HZ; moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io, - EXPIRE_DIRTY_ATIME, work); + EXPIRE_DIRTY_ATIME, time_expire_jif); if (moved) wb_io_lists_populated(wb); - trace_writeback_queue_io(wb, work, moved); + trace_writeback_queue_io(wb, work, dirtied_before, moved); }
static int write_inode(struct inode *inode, struct writeback_control *wbc) @@ -1748,7 +1741,7 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages, blk_start_plug(&plug); spin_lock(&wb->list_lock); if (list_empty(&wb->b_io)) - queue_io(wb, &work); + queue_io(wb, &work, jiffies); __writeback_inodes_wb(wb, &work); spin_unlock(&wb->list_lock); blk_finish_plug(&plug); @@ -1768,7 +1761,7 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages, * takes longer than a dirty_writeback_interval interval, then leave a * one-second gap. * - * older_than_this takes precedence over nr_to_write. So we'll only write back + * dirtied_before takes precedence over nr_to_write. So we'll only write back * all dirty pages if they are all attached to "old" mappings. */ static long wb_writeback(struct bdi_writeback *wb, @@ -1776,14 +1769,11 @@ static long wb_writeback(struct bdi_writeback *wb, { unsigned long wb_start = jiffies; long nr_pages = work->nr_pages; - unsigned long oldest_jif; + unsigned long dirtied_before = jiffies; struct inode *inode; long progress; struct blk_plug plug;
- oldest_jif = jiffies; - work->older_than_this = &oldest_jif; - blk_start_plug(&plug); spin_lock(&wb->list_lock); for (;;) { @@ -1817,14 +1807,14 @@ static long wb_writeback(struct bdi_writeback *wb, * safe. */ if (work->for_kupdate) { - oldest_jif = jiffies - + dirtied_before = jiffies - msecs_to_jiffies(dirty_expire_interval * 10); } else if (work->for_background) - oldest_jif = jiffies; + dirtied_before = jiffies;
trace_writeback_start(wb, work); if (list_empty(&wb->b_io)) - queue_io(wb, work); + queue_io(wb, work, dirtied_before); if (work->sb) progress = writeback_sb_inodes(work->sb, wb, work); else diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index fbba9d27b0a3..c0b7c41b73f1 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -358,8 +358,9 @@ DEFINE_WBC_EVENT(wbc_writepage); TRACE_EVENT(writeback_queue_io, TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work, + unsigned long dirtied_before, int moved), - TP_ARGS(wb, work, moved), + TP_ARGS(wb, work, dirtied_before, moved), TP_STRUCT__entry( __array(char, name, 32) __field(unsigned long, older) @@ -369,19 +370,17 @@ TRACE_EVENT(writeback_queue_io, __field(unsigned int, cgroup_ino) ), TP_fast_assign( - unsigned long *older_than_this = work->older_than_this; bdi_get_dev_name(wb->bdi, __entry->name, 32); - __entry->older = older_than_this ? *older_than_this : 0; - __entry->age = older_than_this ? - (jiffies - *older_than_this) * 1000 / HZ : -1; + __entry->older = dirtied_before; + __entry->age = (jiffies - dirtied_before) * 1000 / HZ; __entry->moved = moved; __entry->reason = work->reason; __entry->cgroup_ino = __trace_wb_assign_cgroup(wb); ), TP_printk("bdi %s: older=%lu age=%ld enqueue=%d reason=%s cgroup_ino=%u", __entry->name, - __entry->older, /* older_than_this in jiffies */ - __entry->age, /* older_than_this in relative milliseconds */ + __entry->older, /* dirtied_before in jiffies */ + __entry->age, /* dirtied_before in relative milliseconds */ __entry->moved, __print_symbolic(__entry->reason, WB_WORK_REASON), __entry->cgroup_ino
From: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
stable inclusion from linux-4.19.143 commit 517b087a70a3fc59cdba5dcf99afcb8a2f90a3c7
--------------------------------
commit e3eb6e8fba65094328b8dca635d00de74ba75b45 upstream.
It has been reported that system-wide suspend may be aborted in the absence of any wakeup events due to unforseen interactions of it with the runtume PM framework.
One failing scenario is when there are multiple devices sharing an ACPI power resource and runtime-resume needs to be carried out for one of them during system-wide suspend (for example, because it needs to be reconfigured before the whole system goes to sleep). In that case, the runtime-resume of that device involves turning the ACPI power resource "on" which in turn causes runtime-resume requests to be queued up for all of the other devices sharing it. Those requests go to the runtime PM workqueue which is frozen during system-wide suspend, so they are not actually taken care of until the resume of the whole system, but the pm_runtime_barrier() call in __device_suspend() sees them and triggers system wakeup events for them which then cause the system-wide suspend to be aborted if wakeup source objects are in active use.
Of course, the logic that leads to triggering those wakeup events is questionable in the first place, because clearly there are cases in which a pending runtime resume request for a device is not connected to any real wakeup events in any way (like the one above). Moreover, it is racy, because the device may be resuming already by the time the pm_runtime_barrier() runs and so if the driver doesn't take care of signaling the wakeup event as appropriate, it will be lost. However, if the driver does take care of that, the extra pm_wakeup_event() call in the core is redundant.
Accordingly, drop the conditional pm_wakeup_event() call fron __device_suspend() and make the latter call pm_runtime_barrier() alone. Also modify the comment next to that call to reflect the new code and extend it to mention the need to avoid unwanted interactions between runtime PM and system-wide device suspend callbacks.
Fixes: 1e2ef05bb8cf8 ("PM: Limit race conditions between runtime PM and system sleep (v2)") Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Acked-by: Alan Stern stern@rowland.harvard.edu Reported-by: Utkarsh H Patel utkarsh.h.patel@intel.com Tested-by: Utkarsh H Patel utkarsh.h.patel@intel.com Tested-by: Pengfei Xu pengfei.xu@intel.com Cc: All applicable stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/base/power/main.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 4abd7c6531d9..140e2df659e3 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -1719,13 +1719,17 @@ static int __device_suspend(struct device *dev, pm_message_t state, bool async) }
/* - * If a device configured to wake up the system from sleep states - * has been suspended at run time and there's a resume request pending - * for it, this is equivalent to the device signaling wakeup, so the - * system suspend operation should be aborted. + * Wait for possible runtime PM transitions of the device in progress + * to complete and if there's a runtime resume request pending for it, + * resume it before proceeding with invoking the system-wide suspend + * callbacks for it. + * + * If the system-wide suspend callbacks below change the configuration + * of the device, they must disable runtime PM for it or otherwise + * ensure that its runtime-resume callbacks will not be confused by that + * change in case they are invoked going forward. */ - if (pm_runtime_barrier(dev) && device_may_wakeup(dev)) - pm_wakeup_event(dev, 0); + pm_runtime_barrier(dev);
if (pm_wakeup_pending()) { dev->power.direct_complete = false;
From: Heikki Krogerus heikki.krogerus@linux.intel.com
stable inclusion from linux-4.19.143 commit 9d57313ce14b6449369f04e07afd8bfe2c70a2bc
--------------------------------
commit c15e1bdda4365a5f17cdadf22bf1c1df13884a9e upstream.
When the primary firmware node pointer is removed from a device (set to NULL) the secondary firmware node pointer, when it exists, is made the primary node for the device. However, the secondary firmware node pointer of the original primary firmware node is never cleared (set to NULL).
To avoid situation where the secondary firmware node pointer is pointing to a non-existing object, clearing it properly when the primary node is removed from a device in set_primary_fwnode().
Fixes: 97badf873ab6 ("device property: Make it possible to use secondary firmware nodes") Cc: All applicable stable@vger.kernel.org Signed-off-by: Heikki Krogerus heikki.krogerus@linux.intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/base/core.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/base/core.c b/drivers/base/core.c index 928fc1532a70..b911c38ad18c 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -3333,9 +3333,9 @@ static inline bool fwnode_is_primary(struct fwnode_handle *fwnode) */ void set_primary_fwnode(struct device *dev, struct fwnode_handle *fwnode) { - if (fwnode) { - struct fwnode_handle *fn = dev->fwnode; + struct fwnode_handle *fn = dev->fwnode;
+ if (fwnode) { if (fwnode_is_primary(fn)) fn = fn->secondary;
@@ -3345,8 +3345,12 @@ void set_primary_fwnode(struct device *dev, struct fwnode_handle *fwnode) } dev->fwnode = fwnode; } else { - dev->fwnode = fwnode_is_primary(dev->fwnode) ? - dev->fwnode->secondary : NULL; + if (fwnode_is_primary(fn)) { + dev->fwnode = fn->secondary; + fn->secondary = NULL; + } else { + dev->fwnode = NULL; + } } } EXPORT_SYMBOL_GPL(set_primary_fwnode);
From: Jarkko Sakkinen jarkko.sakkinen@linux.intel.com
stable inclusion from linux-4.19.143 commit dc828b79feea72cfbc34fd6104a8386bade6fd78
--------------------------------
[ Upstream commit 6c4e79d99e6f42b79040f1a33cd4018f5425030b ]
The size of the buffers for storing context's and sessions can vary from arch to arch as PAGE_SIZE can be anything between 4 kB and 256 kB (the maximum for PPC64). Define a fixed buffer size set to 16 kB. This should be enough for most use with three handles (that is how many we allow at the moment). Parametrize the buffer size while doing this, so that it is easier to revisit this later on if required.
Cc: stable@vger.kernel.org Reported-by: Stefan Berger stefanb@linux.ibm.com Fixes: 745b361e989a ("tpm: infrastructure for TPM spaces") Reviewed-by: Jerry Snitselaar jsnitsel@redhat.com Tested-by: Stefan Berger stefanb@linux.ibm.com Signed-off-by: Jarkko Sakkinen jarkko.sakkinen@linux.intel.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/char/tpm/tpm-chip.c | 9 ++------- drivers/char/tpm/tpm.h | 6 +++++- drivers/char/tpm/tpm2-space.c | 26 ++++++++++++++++---------- drivers/char/tpm/tpmrm-dev.c | 2 +- 4 files changed, 24 insertions(+), 19 deletions(-)
diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c index f3ae8a32d1d5..8280fcd5130e 100644 --- a/drivers/char/tpm/tpm-chip.c +++ b/drivers/char/tpm/tpm-chip.c @@ -274,13 +274,8 @@ struct tpm_chip *tpm_chip_alloc(struct device *pdev, chip->cdev.owner = THIS_MODULE; chip->cdevs.owner = THIS_MODULE;
- chip->work_space.context_buf = kzalloc(PAGE_SIZE, GFP_KERNEL); - if (!chip->work_space.context_buf) { - rc = -ENOMEM; - goto out; - } - chip->work_space.session_buf = kzalloc(PAGE_SIZE, GFP_KERNEL); - if (!chip->work_space.session_buf) { + rc = tpm2_init_space(&chip->work_space, TPM2_SPACE_BUFFER_SIZE); + if (rc) { rc = -ENOMEM; goto out; } diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h index 289221d653cb..b9a30f0b8825 100644 --- a/drivers/char/tpm/tpm.h +++ b/drivers/char/tpm/tpm.h @@ -188,6 +188,7 @@ struct tpm_space { u8 *context_buf; u32 session_tbl[3]; u8 *session_buf; + u32 buf_size; };
enum tpm_chip_flags { @@ -278,6 +279,9 @@ struct tpm_output_header {
#define TPM_TAG_RQU_COMMAND 193
+/* TPM2 specific constants. */ +#define TPM2_SPACE_BUFFER_SIZE 16384 /* 16 kB */ + struct stclear_flags_t { __be16 tag; u8 deactivated; @@ -595,7 +599,7 @@ void tpm2_shutdown(struct tpm_chip *chip, u16 shutdown_type); unsigned long tpm2_calc_ordinal_duration(struct tpm_chip *chip, u32 ordinal); int tpm2_probe(struct tpm_chip *chip); int tpm2_find_cc(struct tpm_chip *chip, u32 cc); -int tpm2_init_space(struct tpm_space *space); +int tpm2_init_space(struct tpm_space *space, unsigned int buf_size); void tpm2_del_space(struct tpm_chip *chip, struct tpm_space *space); int tpm2_prepare_space(struct tpm_chip *chip, struct tpm_space *space, u32 cc, u8 *cmd); diff --git a/drivers/char/tpm/tpm2-space.c b/drivers/char/tpm/tpm2-space.c index d2e101b32482..9f4e22dcde27 100644 --- a/drivers/char/tpm/tpm2-space.c +++ b/drivers/char/tpm/tpm2-space.c @@ -43,18 +43,21 @@ static void tpm2_flush_sessions(struct tpm_chip *chip, struct tpm_space *space) } }
-int tpm2_init_space(struct tpm_space *space) +int tpm2_init_space(struct tpm_space *space, unsigned int buf_size) { - space->context_buf = kzalloc(PAGE_SIZE, GFP_KERNEL); + space->context_buf = kzalloc(buf_size, GFP_KERNEL); if (!space->context_buf) return -ENOMEM;
- space->session_buf = kzalloc(PAGE_SIZE, GFP_KERNEL); + space->session_buf = kzalloc(buf_size, GFP_KERNEL); if (space->session_buf == NULL) { kfree(space->context_buf); + /* Prevent caller getting a dangling pointer. */ + space->context_buf = NULL; return -ENOMEM; }
+ space->buf_size = buf_size; return 0; }
@@ -276,8 +279,10 @@ int tpm2_prepare_space(struct tpm_chip *chip, struct tpm_space *space, u32 cc, sizeof(space->context_tbl)); memcpy(&chip->work_space.session_tbl, &space->session_tbl, sizeof(space->session_tbl)); - memcpy(chip->work_space.context_buf, space->context_buf, PAGE_SIZE); - memcpy(chip->work_space.session_buf, space->session_buf, PAGE_SIZE); + memcpy(chip->work_space.context_buf, space->context_buf, + space->buf_size); + memcpy(chip->work_space.session_buf, space->session_buf, + space->buf_size);
rc = tpm2_load_space(chip); if (rc) { @@ -456,7 +461,7 @@ static int tpm2_save_space(struct tpm_chip *chip) continue;
rc = tpm2_save_context(chip, space->context_tbl[i], - space->context_buf, PAGE_SIZE, + space->context_buf, space->buf_size, &offset); if (rc == -ENOENT) { space->context_tbl[i] = 0; @@ -474,9 +479,8 @@ static int tpm2_save_space(struct tpm_chip *chip) continue;
rc = tpm2_save_context(chip, space->session_tbl[i], - space->session_buf, PAGE_SIZE, + space->session_buf, space->buf_size, &offset); - if (rc == -ENOENT) { /* handle error saving session, just forget it */ space->session_tbl[i] = 0; @@ -522,8 +526,10 @@ int tpm2_commit_space(struct tpm_chip *chip, struct tpm_space *space, sizeof(space->context_tbl)); memcpy(&space->session_tbl, &chip->work_space.session_tbl, sizeof(space->session_tbl)); - memcpy(space->context_buf, chip->work_space.context_buf, PAGE_SIZE); - memcpy(space->session_buf, chip->work_space.session_buf, PAGE_SIZE); + memcpy(space->context_buf, chip->work_space.context_buf, + space->buf_size); + memcpy(space->session_buf, chip->work_space.session_buf, + space->buf_size);
return 0; } diff --git a/drivers/char/tpm/tpmrm-dev.c b/drivers/char/tpm/tpmrm-dev.c index 1a0e97a5da5a..162fb16243d0 100644 --- a/drivers/char/tpm/tpmrm-dev.c +++ b/drivers/char/tpm/tpmrm-dev.c @@ -22,7 +22,7 @@ static int tpmrm_open(struct inode *inode, struct file *file) if (priv == NULL) return -ENOMEM;
- rc = tpm2_init_space(&priv->space); + rc = tpm2_init_space(&priv->space, TPM2_SPACE_BUFFER_SIZE); if (rc) { kfree(priv); return -ENOMEM;
From: Peter Zijlstra peterz@infradead.org
stable inclusion from linux-4.19.144 commit c12a4e6ac32c6d5978fecc0501b7bb9c8ddd2bae
--------------------------------
[ Upstream commit 49d9c5936314e44d314c605c39cce0fd947f9c3a ]
Match the pattern elsewhere in this file.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Steven Rostedt (VMware) rostedt@goodmis.org Reviewed-by: Thomas Gleixner tglx@linutronix.de Acked-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Tested-by: Marco Elver elver@google.com Link: https://lkml.kernel.org/r/20200821085348.251340558@infradead.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/cpuidle/cpuidle.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 6df894d65d9e..2d182dc1b49e 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -148,7 +148,8 @@ static void enter_s2idle_proper(struct cpuidle_driver *drv, */ stop_critical_timings(); drv->states[index].enter_s2idle(dev, drv, index); - WARN_ON(!irqs_disabled()); + if (WARN_ON_ONCE(!irqs_disabled())) + local_irq_disable(); /* * timekeeping_resume() that will be called by tick_unfreeze() for the * first CPU executing it calls functions containing RCU read-side
From: Al Grant al.grant@foss.arm.com
stable inclusion from linux-4.19.144 commit 5154e806105266406156b3fa67d05df7a398aa6c
--------------------------------
[ Upstream commit 39c0a53b114d0317e5c4e76b631f41d133af5cb0 ]
perf_event.h has macros that define the field offsets in the data_src bitmask in perf records. The SNOOPX and REMOTE offsets were both 37.
These are distinct fields, and the bitfield layout in perf_mem_data_src confirms that SNOOPX should be at offset 38.
Committer notes:
This was extracted from a larger patch that also contained kernel changes.
Fixes: 52839e653b5629bd ("perf tools: Add support for printing new mem_info encodings") Signed-off-by: Al Grant al.grant@arm.com Reviewed-by: Andi Kleen ak@linux.intel.com Cc: Adrian Hunter adrian.hunter@intel.com Cc: Ian Rogers irogers@google.com Cc: Jiri Olsa jolsa@kernel.org Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Link: http://lore.kernel.org/lkml/9974f2d0-bf7f-518e-d9f7-4520e5ff1bb0@foss.arm.co... Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/include/uapi/linux/perf_event.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h index f35eb72739c0..a45e7b4f0316 100644 --- a/tools/include/uapi/linux/perf_event.h +++ b/tools/include/uapi/linux/perf_event.h @@ -1079,7 +1079,7 @@ union perf_mem_data_src {
#define PERF_MEM_SNOOPX_FWD 0x01 /* forward */ /* 1 free */ -#define PERF_MEM_SNOOPX_SHIFT 37 +#define PERF_MEM_SNOOPX_SHIFT 38
/* locked instruction */ #define PERF_MEM_LOCK_NA 0x01 /* not available */
From: Al Viro viro@zeniv.linux.org.uk
stable inclusion from linux-4.19.144 commit 37d933e8b41b83bb8278815e366aec5a542b7e31
--------------------------------
[ Upstream commit 77f4689de17c0887775bb77896f4cc11a39bf848 ]
epoll_loop_check_proc() can run into a file already committed to destruction; we can't grab a reference on those and don't need to add them to the set for reverse path check anyway.
Tested-by: Marc Zyngier maz@kernel.org Fixes: a9ed4a6560b8 ("epoll: Keep a reference on files added to the check list") Signed-off-by: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/eventpoll.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 6387b6128f3e..76f9079a6802 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1893,9 +1893,9 @@ static int ep_loop_check_proc(void *priv, void *cookie, int call_nests) * during ep_insert(). */ if (list_empty(&epi->ffd.file->f_tfile_llink)) { - get_file(epi->ffd.file); - list_add(&epi->ffd.file->f_tfile_llink, - &tfile_check_list); + if (get_file_rcu(epi->ffd.file)) + list_add(&epi->ffd.file->f_tfile_llink, + &tfile_check_list); } } }
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.144 commit ab2413892e2d26015eae2f279f30935846ca24aa
--------------------------------
[ Upstream commit d0c20d38af135b2b4b90aa59df7878ef0c8fbef4 ]
The realtime flag only applies to the data fork, so don't use the realtime block number checks on the attr fork of a realtime file.
Fixes: 30b0984d9117 ("xfs: refactor bmap record validation") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Eric Sandeen sandeen@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/libxfs/xfs_bmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 02cdcd999cf6..a902f3b6e7db 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -6179,7 +6179,7 @@ xfs_bmap_validate_extent(
isrt = XFS_IS_REALTIME_INODE(ip); endfsb = irec->br_startblock + irec->br_blockcount - 1; - if (isrt) { + if (isrt && whichfork == XFS_DATA_FORK) { if (!xfs_verify_rtbno(mp, irec->br_startblock)) return __this_address; if (!xfs_verify_rtbno(mp, endfsb))
From: Namhyung Kim namhyung@kernel.org
stable inclusion from linux-4.19.144 commit faec94592f7364be5f1e512f6330d3b86136166b
--------------------------------
[ Upstream commit e62458e3940eb3dfb009481850e140fbee183b04 ]
The new string should have enough space for the original string and the back slashes IMHO.
Fixes: fbc2844e84038ce3 ("perf vendor events: Use more flexible pattern matching for CPU identification for mapfile.csv") Signed-off-by: Namhyung Kim namhyung@kernel.org Reviewed-by: Ian Rogers irogers@google.com Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Andi Kleen andi@firstfloor.org Cc: Jiri Olsa jolsa@redhat.com Cc: John Garry john.garry@huawei.com Cc: Kajol Jain kjain@linux.ibm.com Cc: Mark Rutland mark.rutland@arm.com Cc: Peter Zijlstra peterz@infradead.org Cc: Stephane Eranian eranian@google.com Cc: William Cohen wcohen@redhat.com Link: http://lore.kernel.org/lkml/20200903152510.489233-1-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/perf/pmu-events/jevents.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/pmu-events/jevents.c b/tools/perf/pmu-events/jevents.c index 38b5888ef7b3..c17e59404171 100644 --- a/tools/perf/pmu-events/jevents.c +++ b/tools/perf/pmu-events/jevents.c @@ -137,7 +137,7 @@ static char *fixregex(char *s) return s;
/* allocate space for a new string */ - fixed = (char *) malloc(len + 1); + fixed = (char *) malloc(len + esc_count + 1); if (!fixed) return NULL;
From: Jason Gunthorpe jgg@nvidia.com
stable inclusion from linux-4.19.144 commit 95968e5cbb5db7ff33288d01a95349d476f19248
--------------------------------
[ Upstream commit 428fc0aff4e59399ec719ffcc1f7a5d29a4ee476 ]
Otherwise gcc generates warnings if the expression is complicated.
Fixes: 312a0c170945 ("[PATCH] LOG2: Alter roundup_pow_of_two() so that it can use a ilog2() on a constant") Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Link: https://lkml.kernel.org/r/0-v1-8a2697e3c003+41165-log_brackets_jgg@nvidia.co... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/log2.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/log2.h b/include/linux/log2.h index 2af7f77866d0..78496801cddf 100644 --- a/include/linux/log2.h +++ b/include/linux/log2.h @@ -177,7 +177,7 @@ unsigned long __rounddown_pow_of_two(unsigned long n) #define roundup_pow_of_two(n) \ ( \ __builtin_constant_p(n) ? ( \ - (n == 1) ? 1 : \ + ((n) == 1) ? 1 : \ (1UL << (ilog2((n) - 1) + 1)) \ ) : \ __roundup_pow_of_two(n) \
From: Mikulas Patocka mpatocka@redhat.com
stable inclusion from linux-4.19.144 commit 884fee7632168ab59ed49a26de430fa3ed5c6a86
--------------------------------
commit b17164e258e3888d376a7434415013175d637377 upstream.
When running in a dax mode, if the user maps a page with MAP_PRIVATE and PROT_WRITE, the xfs filesystem would incorrectly update ctime and mtime when the user hits a COW fault.
This breaks building of the Linux kernel. How to reproduce:
1. extract the Linux kernel tree on dax-mounted xfs filesystem 2. run make clean 3. run make -j12 4. run make -j12
at step 4, make would incorrectly rebuild the whole kernel (although it was already built in step 3).
The reason for the breakage is that almost all object files depend on objtool. When we run objtool, it takes COW page fault on its .data section, and these faults will incorrectly update the timestamp of the objtool binary. The updated timestamp causes make to rebuild the whole tree.
Signed-off-by: Mikulas Patocka mpatocka@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/xfs_file.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index ba344f014782..1b2eb9d055ba 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1121,6 +1121,14 @@ __xfs_filemap_fault( return ret; }
+static inline bool +xfs_is_write_fault( + struct vm_fault *vmf) +{ + return (vmf->flags & FAULT_FLAG_WRITE) && + (vmf->vma->vm_flags & VM_SHARED); +} + static vm_fault_t xfs_filemap_fault( struct vm_fault *vmf) @@ -1128,7 +1136,7 @@ xfs_filemap_fault( /* DAX can shortcut the normal fault path on write faults! */ return __xfs_filemap_fault(vmf, PE_SIZE_PTE, IS_DAX(file_inode(vmf->vma->vm_file)) && - (vmf->flags & FAULT_FLAG_WRITE)); + xfs_is_write_fault(vmf)); }
static vm_fault_t @@ -1141,7 +1149,7 @@ xfs_filemap_huge_fault(
/* DAX can shortcut the normal fault path on write faults! */ return __xfs_filemap_fault(vmf, pe_size, - (vmf->flags & FAULT_FLAG_WRITE)); + xfs_is_write_fault(vmf)); }
static vm_fault_t
From: Masami Hiramatsu mhiramat@kernel.org
stable inclusion from linux-4.19.144 commit 61135a9c74c8f79fce637c28f6109ab58467d5bd
--------------------------------
[ Upstream commit 3d7081822f7f9eab867d9bcc8fd635208ec438e0 ]
Add probe_user_read(), strncpy_from_unsafe_user() and strnlen_unsafe_user() which allows caller to access user-space in IRQ context.
Current probe_kernel_read() and strncpy_from_unsafe() are not available for user-space memory, because it sets KERNEL_DS while accessing data. On some arch, user address space and kernel address space can be co-exist, but others can not. In that case, setting KERNEL_DS means given address is treated as a kernel address space. Also strnlen_user() is only available from user context since it can sleep if pagefault is enabled.
To access user-space memory without pagefault, we need these new functions which sets USER_DS while accessing the data.
Link: http://lkml.kernel.org/r/155789869802.26965.4940338412595759063.stgit@devnot...
Acked-by: Ingo Molnar mingo@kernel.org Signed-off-by: Masami Hiramatsu mhiramat@kernel.org Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Sasha Levin sashal@kernel.org Conflicts: mm/maccess.c [yyl: remove VERIFY_READ in access_ok()] Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/uaccess.h | 14 +++++ mm/maccess.c | 122 ++++++++++++++++++++++++++++++++++++++-- 2 files changed, 130 insertions(+), 6 deletions(-)
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index c35a07ac4b1a..4939e6b5b471 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -239,6 +239,17 @@ static inline unsigned long __copy_from_user_inatomic_nocache(void *to, extern long probe_kernel_read(void *dst, const void *src, size_t size); extern long __probe_kernel_read(void *dst, const void *src, size_t size);
+/* + * probe_user_read(): safely attempt to read from a location in user space + * @dst: pointer to the buffer that shall take the data + * @src: address to read from + * @size: size of the data chunk + * + * Safely read from address @src to the buffer at @dst. If a kernel fault + * happens, handle that and return -EFAULT. + */ +extern long probe_user_read(void *dst, const void __user *src, size_t size); + /* * probe_kernel_write(): safely attempt to write to a location * @dst: address to write to @@ -252,6 +263,9 @@ extern long notrace probe_kernel_write(void *dst, const void *src, size_t size); extern long notrace __probe_kernel_write(void *dst, const void *src, size_t size);
extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count); +extern long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr, + long count); +extern long strnlen_unsafe_user(const void __user *unsafe_addr, long count);
/** * probe_kernel_address(): safely attempt to read from a location diff --git a/mm/maccess.c b/mm/maccess.c index ec00be51a24f..19c8c3dc14df 100644 --- a/mm/maccess.c +++ b/mm/maccess.c @@ -5,8 +5,20 @@ #include <linux/mm.h> #include <linux/uaccess.h>
+static __always_inline long +probe_read_common(void *dst, const void __user *src, size_t size) +{ + long ret; + + pagefault_disable(); + ret = __copy_from_user_inatomic(dst, src, size); + pagefault_enable(); + + return ret ? -EFAULT : 0; +} + /** - * probe_kernel_read(): safely attempt to read from a location + * probe_kernel_read(): safely attempt to read from a kernel-space location * @dst: pointer to the buffer that shall take the data * @src: address to read from * @size: size of the data chunk @@ -29,16 +41,40 @@ long __probe_kernel_read(void *dst, const void *src, size_t size) mm_segment_t old_fs = get_fs();
set_fs(KERNEL_DS); - pagefault_disable(); - ret = __copy_from_user_inatomic(dst, - (__force const void __user *)src, size); - pagefault_enable(); + ret = probe_read_common(dst, (__force const void __user *)src, size); set_fs(old_fs);
- return ret ? -EFAULT : 0; + return ret; } EXPORT_SYMBOL_GPL(probe_kernel_read);
+/** + * probe_user_read(): safely attempt to read from a user-space location + * @dst: pointer to the buffer that shall take the data + * @src: address to read from. This must be a user address. + * @size: size of the data chunk + * + * Safely read from user address @src to the buffer at @dst. If a kernel fault + * happens, handle that and return -EFAULT. + */ + +long __weak probe_user_read(void *dst, const void __user *src, size_t size) + __attribute__((alias("__probe_user_read"))); + +long __probe_user_read(void *dst, const void __user *src, size_t size) +{ + long ret = -EFAULT; + mm_segment_t old_fs = get_fs(); + + set_fs(USER_DS); + if (access_ok(src, size)) + ret = probe_read_common(dst, src, size); + set_fs(old_fs); + + return ret; +} +EXPORT_SYMBOL_GPL(probe_user_read); + /** * probe_kernel_write(): safely attempt to write to a location * @dst: address to write to @@ -66,6 +102,7 @@ long __probe_kernel_write(void *dst, const void *src, size_t size) } EXPORT_SYMBOL_GPL(probe_kernel_write);
+ /** * strncpy_from_unsafe: - Copy a NUL terminated string from unsafe address. * @dst: Destination address, in kernel space. This buffer must be at @@ -105,3 +142,76 @@ long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count)
return ret ? -EFAULT : src - unsafe_addr; } + +/** + * strncpy_from_unsafe_user: - Copy a NUL terminated string from unsafe user + * address. + * @dst: Destination address, in kernel space. This buffer must be at + * least @count bytes long. + * @unsafe_addr: Unsafe user address. + * @count: Maximum number of bytes to copy, including the trailing NUL. + * + * Copies a NUL-terminated string from unsafe user address to kernel buffer. + * + * On success, returns the length of the string INCLUDING the trailing NUL. + * + * If access fails, returns -EFAULT (some data may have been copied + * and the trailing NUL added). + * + * If @count is smaller than the length of the string, copies @count-1 bytes, + * sets the last byte of @dst buffer to NUL and returns @count. + */ +long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr, + long count) +{ + mm_segment_t old_fs = get_fs(); + long ret; + + if (unlikely(count <= 0)) + return 0; + + set_fs(USER_DS); + pagefault_disable(); + ret = strncpy_from_user(dst, unsafe_addr, count); + pagefault_enable(); + set_fs(old_fs); + + if (ret >= count) { + ret = count; + dst[ret - 1] = '\0'; + } else if (ret > 0) { + ret++; + } + + return ret; +} + +/** + * strnlen_unsafe_user: - Get the size of a user string INCLUDING final NUL. + * @unsafe_addr: The string to measure. + * @count: Maximum count (including NUL) + * + * Get the size of a NUL-terminated string in user space without pagefault. + * + * Returns the size of the string INCLUDING the terminating NUL. + * + * If the string is too long, returns a number larger than @count. User + * has to check the return value against "> count". + * On exception (or invalid count), returns 0. + * + * Unlike strnlen_user, this can be used from IRQ handler etc. because + * it disables pagefaults. + */ +long strnlen_unsafe_user(const void __user *unsafe_addr, long count) +{ + mm_segment_t old_fs = get_fs(); + int ret; + + set_fs(USER_DS); + pagefault_disable(); + ret = strnlen_user(unsafe_addr, count); + pagefault_enable(); + set_fs(old_fs); + + return ret; +}
From: Daniel Borkmann daniel@iogearbox.net
stable inclusion from linux-4.19.144 commit cfb4721fce554cb596ea86af116d3d68a4d91254
--------------------------------
[ Upstream commit 1d1585ca0f48fe7ed95c3571f3e4a82b2b5045dc ]
Commit 3d7081822f7f ("uaccess: Add non-pagefault user-space read functions") missed to add probe write function, therefore factor out a probe_write_common() helper with most logic of probe_kernel_write() except setting KERNEL_DS, and add a new probe_user_write() helper so it can be used from BPF side.
Again, on some archs, the user address space and kernel address space can co-exist and be overlapping, so in such case, setting KERNEL_DS would mean that the given address is treated as being in kernel address space.
Signed-off-by: Daniel Borkmann daniel@iogearbox.net Signed-off-by: Alexei Starovoitov ast@kernel.org Acked-by: Andrii Nakryiko andriin@fb.com Cc: Masami Hiramatsu mhiramat@kernel.org Link: https://lore.kernel.org/bpf/9df2542e68141bfa3addde631441ee45503856a8.1572649... Signed-off-by: Sasha Levin sashal@kernel.org Conflicts: mm/maccess.c [yyl: remove VERIFY_WRITE in access_ok()] Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/uaccess.h | 12 +++++++++++ mm/maccess.c | 45 +++++++++++++++++++++++++++++++++++++---- 2 files changed, 53 insertions(+), 4 deletions(-)
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 4939e6b5b471..29ebf7f1e47c 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -262,6 +262,18 @@ extern long probe_user_read(void *dst, const void __user *src, size_t size); extern long notrace probe_kernel_write(void *dst, const void *src, size_t size); extern long notrace __probe_kernel_write(void *dst, const void *src, size_t size);
+/* + * probe_user_write(): safely attempt to write to a location in user space + * @dst: address to write to + * @src: pointer to the data that shall be written + * @size: size of the data chunk + * + * Safely write to address @dst from the buffer at @src. If a kernel fault + * happens, handle that and return -EFAULT. + */ +extern long notrace probe_user_write(void __user *dst, const void *src, size_t size); +extern long notrace __probe_user_write(void __user *dst, const void *src, size_t size); + extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count); extern long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr, long count); diff --git a/mm/maccess.c b/mm/maccess.c index 19c8c3dc14df..35381585318f 100644 --- a/mm/maccess.c +++ b/mm/maccess.c @@ -17,6 +17,18 @@ probe_read_common(void *dst, const void __user *src, size_t size) return ret ? -EFAULT : 0; }
+static __always_inline long +probe_write_common(void __user *dst, const void *src, size_t size) +{ + long ret; + + pagefault_disable(); + ret = __copy_to_user_inatomic(dst, src, size); + pagefault_enable(); + + return ret ? -EFAULT : 0; +} + /** * probe_kernel_read(): safely attempt to read from a kernel-space location * @dst: pointer to the buffer that shall take the data @@ -84,6 +96,7 @@ EXPORT_SYMBOL_GPL(probe_user_read); * Safely write to address @dst from the buffer at @src. If a kernel fault * happens, handle that and return -EFAULT. */ + long __weak probe_kernel_write(void *dst, const void *src, size_t size) __attribute__((alias("__probe_kernel_write")));
@@ -93,15 +106,39 @@ long __probe_kernel_write(void *dst, const void *src, size_t size) mm_segment_t old_fs = get_fs();
set_fs(KERNEL_DS); - pagefault_disable(); - ret = __copy_to_user_inatomic((__force void __user *)dst, src, size); - pagefault_enable(); + ret = probe_write_common((__force void __user *)dst, src, size); set_fs(old_fs);
- return ret ? -EFAULT : 0; + return ret; } EXPORT_SYMBOL_GPL(probe_kernel_write);
+/** + * probe_user_write(): safely attempt to write to a user-space location + * @dst: address to write to + * @src: pointer to the data that shall be written + * @size: size of the data chunk + * + * Safely write to address @dst from the buffer at @src. If a kernel fault + * happens, handle that and return -EFAULT. + */ + +long __weak probe_user_write(void __user *dst, const void *src, size_t size) + __attribute__((alias("__probe_user_write"))); + +long __probe_user_write(void __user *dst, const void *src, size_t size) +{ + long ret = -EFAULT; + mm_segment_t old_fs = get_fs(); + + set_fs(USER_DS); + if (access_ok(dst, size)) + ret = probe_write_common(dst, src, size); + set_fs(old_fs); + + return ret; +} +EXPORT_SYMBOL_GPL(probe_user_write);
/** * strncpy_from_unsafe: - Copy a NUL terminated string from unsafe address.
From: Tejun Heo tj@kernel.org
stable inclusion from linux-4.19.144 commit a8bb7740aa313994bfa4c21cba399f65985a8a35
--------------------------------
commit 3b5455636fe26ea21b4189d135a424a6da016418 upstream.
All three generations of Sandisk SSDs lock up hard intermittently. Experiments showed that disabling NCQ lowered the failure rate significantly and the kernel has been disabling NCQ for some models of SD7's and 8's, which is obviously undesirable.
Karthik worked with Sandisk to root cause the hard lockups to trim commands larger than 128M. This patch implements ATA_HORKAGE_MAX_TRIM_128M which limits max trim size to 128M and applies it to all three generations of Sandisk SSDs.
Signed-off-by: Tejun Heo tj@kernel.org Cc: Karthik Shivaram karthikgs@fb.com Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/ata/libata-core.c | 5 ++--- drivers/ata/libata-scsi.c | 8 +++++++- include/linux/libata.h | 1 + 3 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index f12392aa0339..290ff99bf1d8 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -4522,9 +4522,8 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = { /* https://bugzilla.kernel.org/show_bug.cgi?id=15573 */ { "C300-CTFDDAC128MAG", "0001", ATA_HORKAGE_NONCQ, },
- /* Some Sandisk SSDs lock up hard with NCQ enabled. Reported on - SD7SN6S256G and SD8SN8U256G */ - { "SanDisk SD[78]SN*G", NULL, ATA_HORKAGE_NONCQ, }, + /* Sandisk SD7/8/9s lock up hard on large trims */ + { "SanDisk SD[789]*", NULL, ATA_HORKAGE_MAX_TRIM_128M, },
/* devices which puke on READ_NATIVE_MAX */ { "HDS724040KLSA80", "KFAOA20N", ATA_HORKAGE_BROKEN_HPA, }, diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index ba07ed41e64b..16d76c8490c7 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -2392,6 +2392,7 @@ static unsigned int ata_scsiop_inq_89(struct ata_scsi_args *args, u8 *rbuf)
static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf) { + struct ata_device *dev = args->dev; u16 min_io_sectors;
rbuf[1] = 0xb0; @@ -2417,7 +2418,12 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf) * with the unmap bit set. */ if (ata_id_has_trim(args->id)) { - put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM, &rbuf[36]); + u64 max_blocks = 65535 * ATA_MAX_TRIM_RNUM; + + if (dev->horkage & ATA_HORKAGE_MAX_TRIM_128M) + max_blocks = 128 << (20 - SECTOR_SHIFT); + + put_unaligned_be64(max_blocks, &rbuf[36]); put_unaligned_be32(1, &rbuf[28]); }
diff --git a/include/linux/libata.h b/include/linux/libata.h index 92a541a9c939..dc164b7ebbb0 100644 --- a/include/linux/libata.h +++ b/include/linux/libata.h @@ -438,6 +438,7 @@ enum { ATA_HORKAGE_NO_DMA_LOG = (1 << 23), /* don't use DMA for log read */ ATA_HORKAGE_NOTRIM = (1 << 24), /* don't use TRIM */ ATA_HORKAGE_MAX_SEC_1024 = (1 << 25), /* Limit max sects to 1024 */ + ATA_HORKAGE_MAX_TRIM_128M = (1 << 26), /* Limit max trim size to 128M */
/* DMA mask for user DMA control: User visible values; DO NOT renumber */
From: Mikulas Patocka mpatocka@redhat.com
stable inclusion from linux-4.19.144 commit 154096e9966123f198cfc2f959bc559c562fc6b7
--------------------------------
commit f9e040efcc28309e5c592f7e79085a9a52e31f58 upstream.
The function dax_direct_access doesn't take partitions into account, it always maps pages from the beginning of the device. Therefore, persistent_memory_claim() must get the partition offset using get_start_sect() and add it to the page offsets passed to dax_direct_access().
Signed-off-by: Mikulas Patocka mpatocka@redhat.com Fixes: 48debafe4f2f ("dm: add writecache target") Cc: stable@vger.kernel.org # 4.18+ Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/md/dm-writecache.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c index 774ca03a32de..a6aa366f6533 100644 --- a/drivers/md/dm-writecache.c +++ b/drivers/md/dm-writecache.c @@ -226,6 +226,7 @@ static int persistent_memory_claim(struct dm_writecache *wc) pfn_t pfn; int id; struct page **pages; + sector_t offset;
wc->memory_vmapped = false;
@@ -244,9 +245,16 @@ static int persistent_memory_claim(struct dm_writecache *wc) goto err1; }
+ offset = get_start_sect(wc->ssd_dev->bdev); + if (offset & (PAGE_SIZE / 512 - 1)) { + r = -EINVAL; + goto err1; + } + offset >>= PAGE_SHIFT - 9; + id = dax_read_lock();
- da = dax_direct_access(wc->ssd_dev->dax_dev, 0, p, &wc->memory_map, &pfn); + da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn); if (da < 0) { wc->memory_map = NULL; r = da; @@ -268,7 +276,7 @@ static int persistent_memory_claim(struct dm_writecache *wc) i = 0; do { long daa; - daa = dax_direct_access(wc->ssd_dev->dax_dev, i, p - i, + daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i, p - i, NULL, &pfn); if (daa <= 0) { r = daa ? daa : -EINVAL;
From: Jens Axboe axboe@kernel.dk
stable inclusion from linux-4.19.145 commit 732fd460bb72fd51607311009b7d474a6e0e47f3
--------------------------------
[ Upstream commit de1b0ee490eafdf65fac9eef9925391a8369f2dc ]
If a driver leaves the limit settings as the defaults, then we don't initialize bdi->io_pages. This means that file systems may need to work around bdi->io_pages == 0, which is somewhat messy.
Initialize the default value just like we do for ->ra_pages.
Cc: stable@vger.kernel.org Fixes: 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting") Reported-by: OGAWA Hirofumi hirofumi@mail.parknet.co.jp Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- block/blk-core.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/block/blk-core.c b/block/blk-core.c index 7afe44d09741..bf1049cf598b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1028,6 +1028,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id,
q->backing_dev_info->ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_SIZE; + q->backing_dev_info->io_pages = + (VM_MAX_READAHEAD * 1024) / PAGE_SIZE; q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK; q->backing_dev_info->name = "block"; q->node = node_id;
From: "Darrick J. Wong" darrick.wong@oracle.com
stable inclusion from linux-4.19.146 commit b701016288dc8cfe55d24b1b9e075adbab25ae78
--------------------------------
[ Upstream commit 125eac243806e021f33a1fdea3687eccbb9f7636 ]
Don't leak kernel memory contents into the shortform attr fork.
Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Eric Sandeen sandeen@redhat.com Reviewed-by: Dave Chinner dchinner@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/libxfs/xfs_attr_leaf.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c index 39025b62b449..39df2c5425fa 100644 --- a/fs/xfs/libxfs/xfs_attr_leaf.c +++ b/fs/xfs/libxfs/xfs_attr_leaf.c @@ -553,8 +553,8 @@ xfs_attr_shortform_create(xfs_da_args_t *args) ASSERT(ifp->if_flags & XFS_IFINLINE); } xfs_idata_realloc(dp, sizeof(*hdr), XFS_ATTR_FORK); - hdr = (xfs_attr_sf_hdr_t *)ifp->if_u1.if_data; - hdr->count = 0; + hdr = (struct xfs_attr_sf_hdr *)ifp->if_u1.if_data; + memset(hdr, 0, sizeof(*hdr)); hdr->totsize = cpu_to_be16(sizeof(*hdr)); xfs_trans_log_inode(args->trans, dp, XFS_ILOG_CORE | XFS_ILOG_ADATA); }
From: Varun Prakash varun@chelsio.com
stable inclusion from linux-4.19.146 commit 549a2cac6bc278b7f238a59eeb644205a10a86ca
--------------------------------
commit 5528d03183fe5243416c706f64b1faa518b05130 upstream.
Current code does not consider 'page_off' in data digest calculation. To fix this, add a local variable 'first_sg' and set first_sg.offset to sg->offset + page_off.
Link: https://lore.kernel.org/r/1598358910-3052-1-git-send-email-varun@chelsio.com Fixes: e48354ce078c ("iscsi-target: Add iSCSI fabric support for target v4.1") Cc: stable@vger.kernel.org Reviewed-by: Mike Christie michael.christie@oralce.com Signed-off-by: Varun Prakash varun@chelsio.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/target/iscsi/iscsi_target.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c index 317d0f3f7a14..efe93d6b02b0 100644 --- a/drivers/target/iscsi/iscsi_target.c +++ b/drivers/target/iscsi/iscsi_target.c @@ -1383,14 +1383,27 @@ static u32 iscsit_do_crypto_hash_sg( sg = cmd->first_data_sg; page_off = cmd->first_data_sg_off;
+ if (data_length && page_off) { + struct scatterlist first_sg; + u32 len = min_t(u32, data_length, sg->length - page_off); + + sg_init_table(&first_sg, 1); + sg_set_page(&first_sg, sg_page(sg), len, sg->offset + page_off); + + ahash_request_set_crypt(hash, &first_sg, NULL, len); + crypto_ahash_update(hash); + + data_length -= len; + sg = sg_next(sg); + } + while (data_length) { - u32 cur_len = min_t(u32, data_length, (sg->length - page_off)); + u32 cur_len = min_t(u32, data_length, sg->length);
ahash_request_set_crypt(hash, sg, NULL, cur_len); crypto_ahash_update(hash);
data_length -= cur_len; - page_off = 0; /* iscsit_map_iovec has already checked for invalid sg pointers */ sg = sg_next(sg); }
From: Hou Pu houpu@bytedance.com
stable inclusion from linux-4.19.146 commit 4f78e55daaa2986bc533d1da3e8e7ac9cc3048f5
--------------------------------
commit ed43ffea78dcc97db3f561da834f1a49c8961e33 upstream.
The iSCSI target login thread might get stuck with the following stack:
cat /proc/`pidof iscsi_np`/stack [<0>] down_interruptible+0x42/0x50 [<0>] iscsit_access_np+0xe3/0x167 [<0>] iscsi_target_locate_portal+0x695/0x8ac [<0>] __iscsi_target_login_thread+0x855/0xb82 [<0>] iscsi_target_login_thread+0x2f/0x5a [<0>] kthread+0xfa/0x130 [<0>] ret_from_fork+0x1f/0x30
This can be reproduced via the following steps:
1. Initiator A tries to log in to iqn1-tpg1 on port 3260. After finishing PDU exchange in the login thread and before the negotiation is finished the the network link goes down. At this point A has not finished login and tpg->np_login_sem is held.
2. Initiator B tries to log in to iqn2-tpg1 on port 3260. After finishing PDU exchange in the login thread the target expects to process remaining login PDUs in workqueue context.
3. Initiator A' tries to log in to iqn1-tpg1 on port 3260 from a new socket. A' will wait for tpg->np_login_sem with np->np_login_timer loaded to wait for at most 15 seconds. The lock is held by A so A' eventually times out.
4. Before A' got timeout initiator B gets negotiation failed and calls iscsi_target_login_drop()->iscsi_target_login_sess_out(). The np->np_login_timer is canceled and initiator A' will hang forever. Because A' is now in the login thread, no new login requests can be serviced.
Fix this by moving iscsi_stop_login_thread_timer() out of iscsi_target_login_sess_out(). Also remove iscsi_np parameter from iscsi_target_login_sess_out().
Link: https://lore.kernel.org/r/20200729130343.24976-1-houpu@bytedance.com Cc: stable@vger.kernel.org Reviewed-by: Mike Christie michael.christie@oracle.com Signed-off-by: Hou Pu houpu@bytedance.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/target/iscsi/iscsi_target_login.c | 6 +++--- drivers/target/iscsi/iscsi_target_login.h | 3 +-- drivers/target/iscsi/iscsi_target_nego.c | 3 +-- 3 files changed, 5 insertions(+), 7 deletions(-)
diff --git a/drivers/target/iscsi/iscsi_target_login.c b/drivers/target/iscsi/iscsi_target_login.c index bb90c80ff388..185302023aec 100644 --- a/drivers/target/iscsi/iscsi_target_login.c +++ b/drivers/target/iscsi/iscsi_target_login.c @@ -1182,7 +1182,7 @@ void iscsit_free_conn(struct iscsi_conn *conn) }
void iscsi_target_login_sess_out(struct iscsi_conn *conn, - struct iscsi_np *np, bool zero_tsih, bool new_sess) + bool zero_tsih, bool new_sess) { if (!new_sess) goto old_sess_out; @@ -1200,7 +1200,6 @@ void iscsi_target_login_sess_out(struct iscsi_conn *conn, conn->sess = NULL;
old_sess_out: - iscsi_stop_login_thread_timer(np); /* * If login negotiation fails check if the Time2Retain timer * needs to be restarted. @@ -1440,8 +1439,9 @@ static int __iscsi_target_login_thread(struct iscsi_np *np) new_sess_out: new_sess = true; old_sess_out: + iscsi_stop_login_thread_timer(np); tpg_np = conn->tpg_np; - iscsi_target_login_sess_out(conn, np, zero_tsih, new_sess); + iscsi_target_login_sess_out(conn, zero_tsih, new_sess); new_sess = false;
if (tpg) { diff --git a/drivers/target/iscsi/iscsi_target_login.h b/drivers/target/iscsi/iscsi_target_login.h index 3b8e3639ff5d..fc95e6150253 100644 --- a/drivers/target/iscsi/iscsi_target_login.h +++ b/drivers/target/iscsi/iscsi_target_login.h @@ -22,8 +22,7 @@ extern int iscsit_put_login_tx(struct iscsi_conn *, struct iscsi_login *, u32); extern void iscsit_free_conn(struct iscsi_conn *); extern int iscsit_start_kthreads(struct iscsi_conn *); extern void iscsi_post_login_handler(struct iscsi_np *, struct iscsi_conn *, u8); -extern void iscsi_target_login_sess_out(struct iscsi_conn *, struct iscsi_np *, - bool, bool); +extern void iscsi_target_login_sess_out(struct iscsi_conn *, bool, bool); extern int iscsi_target_login_thread(void *); extern void iscsi_handle_login_thread_timeout(struct timer_list *t);
diff --git a/drivers/target/iscsi/iscsi_target_nego.c b/drivers/target/iscsi/iscsi_target_nego.c index 8a5e8d17a942..5db8842a8026 100644 --- a/drivers/target/iscsi/iscsi_target_nego.c +++ b/drivers/target/iscsi/iscsi_target_nego.c @@ -554,12 +554,11 @@ static bool iscsi_target_sk_check_and_clear(struct iscsi_conn *conn, unsigned in
static void iscsi_target_login_drop(struct iscsi_conn *conn, struct iscsi_login *login) { - struct iscsi_np *np = login->np; bool zero_tsih = login->zero_tsih;
iscsi_remove_failed_auth_entry(conn); iscsi_target_nego_release(conn); - iscsi_target_login_sess_out(conn, np, zero_tsih, true); + iscsi_target_login_sess_out(conn, zero_tsih, true); }
struct conn_timeout {
From: Chuck Lever chuck.lever@oracle.com
stable inclusion from linux-4.19.147 commit a6a2cf4d918f3c62b652a87d4e9b667049de4cb1
--------------------------------
[ Upstream commit 644c9f40cf71969f29add32f32349e71d4995c0b ]
If a write delegation isn't available, the Linux NFS client uses a zero-stateid when performing a SETATTR.
NFSv4.0 provides no mechanism for an NFS server to match such a request to a particular client. It recalls all delegations for that file, even delegations held by the client issuing the request. If that client happens to hold a read delegation, the server will recall it immediately, resulting in an NFS4ERR_DELAY/CB_RECALL/ DELEGRETURN sequence.
Optimize out this pipeline bubble by having the client return any delegations it may hold on a file before it issues a SETATTR(zero-stateid) on that file.
Signed-off-by: Chuck Lever chuck.lever@oracle.com Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/nfs/nfs4proc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index fa5cabde5df8..b964f9737dcd 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -3120,8 +3120,10 @@ static int _nfs4_do_setattr(struct inode *inode,
/* Servers should only apply open mode checks for file size changes */ truncate = (arg->iap->ia_valid & ATTR_SIZE) ? true : false; - if (!truncate) + if (!truncate) { + nfs4_inode_make_writeable(inode); goto zero_stateid; + }
if (nfs4_copy_delegation_stateid(inode, FMODE_WRITE, &arg->stateid, &delegation_cred)) { /* Use that stateid */
From: David Milburn dmilburn@redhat.com
stable inclusion from linux-4.19.147 commit 514171c50909736af8b6cdf6365c0d15bdb869a2
--------------------------------
[ Upstream commit e126e8210e950bb83414c4f57b3120ddb8450742 ]
Cancel async event work in case async event has been queued up, and nvme_fc_submit_async_event() runs after event has been freed.
Signed-off-by: David Milburn dmilburn@redhat.com Reviewed-by: Keith Busch kbusch@kernel.org Reviewed-by: Sagi Grimberg sagi@grimberg.me Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/nvme/host/fc.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 1875f6b8a907..40514e4f8ead 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -1792,6 +1792,7 @@ nvme_fc_term_aen_ops(struct nvme_fc_ctrl *ctrl) struct nvme_fc_fcp_op *aen_op; int i;
+ cancel_work_sync(&ctrl->ctrl.async_event_work); aen_op = ctrl->aen_ops; for (i = 0; i < NVME_NR_AEN_COMMANDS; i++, aen_op++) { if (!aen_op->fcp_req.private)
From: David Milburn dmilburn@redhat.com
stable inclusion from linux-4.19.147 commit f10c9c9dce4d3ee542987680e2a8576871c05734
--------------------------------
[ Upstream commit 925dd04c1f9825194b9e444c12478084813b2b5d ]
Cancel async event work in case async event has been queued up, and nvme_rdma_submit_async_event() runs after event has been freed.
Signed-off-by: David Milburn dmilburn@redhat.com Reviewed-by: Keith Busch kbusch@kernel.org Reviewed-by: Sagi Grimberg sagi@grimberg.me Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/nvme/host/rdma.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index e4f167e35353..a652d5125c85 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -739,6 +739,7 @@ static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl, nvme_rdma_free_tagset(&ctrl->ctrl, ctrl->ctrl.admin_tagset); } if (ctrl->async_event_sqe.data) { + cancel_work_sync(&ctrl->ctrl.async_event_work); nvme_rdma_free_qe(ctrl->device->dev, &ctrl->async_event_sqe, sizeof(struct nvme_command), DMA_TO_DEVICE); ctrl->async_event_sqe.data = NULL;
From: Gustav Wiklander gustavwi@axis.com
stable inclusion from linux-4.19.147 commit 80c468d9abc9d4129809c1ffc90b3c835a1202c2
--------------------------------
[ Upstream commit b59a7ca15464c78ea1ba3b280cfc5ac5ece11ade ]
In the prepare_message callback the bus driver has the opportunity to split a transfer into smaller chunks. spi_map_msg is done after prepare_message.
Function spi_res_release releases the splited transfers in the message. Therefore spi_res_release should be called after spi_map_msg.
The previous try at this was commit c9ba7a16d0f1 which released the splited transfers after spi_finalize_current_message had been called. This introduced a race since the message struct could be out of scope because the spi_sync call got completed.
Fixes this leak on spi bus driver spi-bcm2835.c when transfer size is greater than 65532:
Kmemleak: sg_alloc_table+0x28/0xc8 spi_map_buf+0xa4/0x300 __spi_pump_messages+0x370/0x748 __spi_sync+0x1d4/0x270 spi_sync+0x34/0x58 spi_test_execute_msg+0x60/0x340 [spi_loopback_test] spi_test_run_iter+0x548/0x578 [spi_loopback_test] spi_test_run_test+0x94/0x140 [spi_loopback_test] spi_test_run_tests+0x150/0x180 [spi_loopback_test] spi_loopback_test_probe+0x50/0xd0 [spi_loopback_test] spi_drv_probe+0x84/0xe0
Signed-off-by: Gustav Wiklander gustavwi@axis.com Link: https://lore.kernel.org/r/20200908151129.15915-1-gustav.wiklander@axis.com Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/spi/spi.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c index f4b3ca0a7d13..80c4abf411b2 100644 --- a/drivers/spi/spi.c +++ b/drivers/spi/spi.c @@ -1116,8 +1116,6 @@ static int spi_transfer_one_message(struct spi_controller *ctlr, if (msg->status && ctlr->handle_err) ctlr->handle_err(ctlr, msg);
- spi_res_release(ctlr, msg); - spi_finalize_current_message(ctlr);
return ret; @@ -1375,6 +1373,13 @@ void spi_finalize_current_message(struct spi_controller *ctlr)
spi_unmap_msg(ctlr, mesg);
+ /* In the prepare_messages callback the spi bus has the opportunity to + * split a transfer to smaller chunks. + * Release splited transfers here since spi_map_msg is done on the + * splited transfers. + */ + spi_res_release(ctlr, mesg); + if (ctlr->cur_msg_prepared && ctlr->unprepare_message) { ret = ctlr->unprepare_message(ctlr, mesg); if (ret) {
From: Sunghyun Jin mcsmonk@gmail.com
stable inclusion from linux-4.19.147 commit 5afd52f302cac2700c59b86d19c329c0ba918977
--------------------------------
commit b3b33d3c43bbe0177d70653f4e889c78cc37f097 upstream.
Variable populated, which is a member of struct pcpu_chunk, is used as a unit of size of unsigned long. However, size of populated is miscounted. So, I fix this minor part.
Fixes: 8ab16c43ea79 ("percpu: change the number of pages marked in the first_chunk pop bitmap") Cc: stable@vger.kernel.org # 4.14+ Signed-off-by: Sunghyun Jin mcsmonk@gmail.com Signed-off-by: Dennis Zhou dennis@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/percpu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/percpu.c b/mm/percpu.c index ff76fa0b7528..0151f276ae68 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1103,7 +1103,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
/* allocate chunk */ chunk = memblock_virt_alloc(sizeof(struct pcpu_chunk) + - BITS_TO_LONGS(region_size >> PAGE_SHIFT), + BITS_TO_LONGS(region_size >> PAGE_SHIFT) * sizeof(unsigned long), 0);
INIT_LIST_HEAD(&chunk->list);
From: Muchun Song songmuchun@bytedance.com
stable inclusion from linux-4.19.148 commit d44a437826119e8307c3904c1e4f513095ea17cb
--------------------------------
[ Upstream commit b0399092ccebd9feef68d4ceb8d6219a8c0caa05 ]
If a kprobe is marked as gone, we should not kill it again. Otherwise, we can disarm the kprobe more than once. In that case, the statistics of kprobe_ftrace_enabled can unbalance which can lead to that kprobe do not work.
Fixes: e8386a0cb22f ("kprobes: support probing module __exit function") Co-developed-by: Chengming Zhou zhouchengming@bytedance.com Signed-off-by: Muchun Song songmuchun@bytedance.com Signed-off-by: Chengming Zhou zhouchengming@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Masami Hiramatsu mhiramat@kernel.org Cc: "Naveen N . Rao" naveen.n.rao@linux.ibm.com Cc: Anil S Keshavamurthy anil.s.keshavamurthy@intel.com Cc: David S. Miller davem@davemloft.net Cc: Song Liu songliubraving@fb.com Cc: Steven Rostedt rostedt@goodmis.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200822030055.32383-1-songmuchun@bytedance.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/kprobes.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 1053330c5929..f231d3f335a3 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -2050,6 +2050,9 @@ static void kill_kprobe(struct kprobe *p) { struct kprobe *kp;
+ if (WARN_ON_ONCE(kprobe_gone(p))) + return; + p->flags |= KPROBE_FLAG_GONE; if (kprobe_aggrprobe(p)) { /* @@ -2232,7 +2235,10 @@ static int kprobes_module_callback(struct notifier_block *nb, mutex_lock(&kprobe_mutex); for (i = 0; i < KPROBE_TABLE_SIZE; i++) { head = &kprobe_table[i]; - hlist_for_each_entry_rcu(p, head, hlist) + hlist_for_each_entry_rcu(p, head, hlist) { + if (kprobe_gone(p)) + continue; + if (within_module_init((unsigned long)p->addr, mod) || (checkcore && within_module_core((unsigned long)p->addr, mod))) { @@ -2249,6 +2255,7 @@ static int kprobes_module_callback(struct notifier_block *nb, */ kill_kprobe(p); } + } } mutex_unlock(&kprobe_mutex); return NOTIFY_DONE;
From: Ralph Campbell rcampbell@nvidia.com
stable inclusion from linux-4.19.148 commit ec56646e3b2a9a0c3a2fa63732fab731009a25af
--------------------------------
[ Upstream commit ec0abae6dcdf7ef88607c869bf35a4b63ce1b370 ]
A migrating transparent huge page has to already be unmapped. Otherwise, the page could be modified while it is being copied to a new page and data could be lost. The function __split_huge_pmd() checks for a PMD migration entry before calling __split_huge_pmd_locked() leading one to think that __split_huge_pmd_locked() can handle splitting a migrating PMD.
However, the code always increments the page->_mapcount and adjusts the memory control group accounting assuming the page is mapped.
Also, if the PMD entry is a migration PMD entry, the call to is_huge_zero_pmd(*pmd) is incorrect because it calls pmd_pfn(pmd) instead of migration_entry_to_pfn(pmd_to_swp_entry(pmd)). Fix these problems by checking for a PMD migration entry.
Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Signed-off-by: Ralph Campbell rcampbell@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Yang Shi shy828301@gmail.com Reviewed-by: Zi Yan ziy@nvidia.com Cc: Jerome Glisse jglisse@redhat.com Cc: John Hubbard jhubbard@nvidia.com Cc: Alistair Popple apopple@nvidia.com Cc: Christoph Hellwig hch@lst.de Cc: Jason Gunthorpe jgg@nvidia.com Cc: Bharata B Rao bharata@linux.ibm.com Cc: Ben Skeggs bskeggs@redhat.com Cc: Shuah Khan shuah@kernel.org Cc: stable@vger.kernel.org [4.14+] Link: https://lkml.kernel.org/r/20200903183140.19055-1-rcampbell@nvidia.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/huge_memory.c | 40 +++++++++++++++++++++++----------------- 1 file changed, 23 insertions(+), 17 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cdceef783141..7641bb18fb6d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2153,7 +2153,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, put_page(page); add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR); return; - } else if (is_huge_zero_pmd(*pmd)) { + } else if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) { /* * FIXME: Do we want to invalidate secondary mmu by calling * mmu_notifier_invalidate_range() see comments below inside @@ -2241,27 +2241,33 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, pte = pte_offset_map(&_pmd, addr); BUG_ON(!pte_none(*pte)); set_pte_at(mm, addr, pte, entry); - atomic_inc(&page[i]._mapcount); - pte_unmap(pte); - } - - /* - * Set PG_double_map before dropping compound_mapcount to avoid - * false-negative page_mapped(). - */ - if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page)) { - for (i = 0; i < HPAGE_PMD_NR; i++) + if (!pmd_migration) atomic_inc(&page[i]._mapcount); + pte_unmap(pte); }
- if (atomic_add_negative(-1, compound_mapcount_ptr(page))) { - /* Last compound_mapcount is gone. */ - __dec_node_page_state(page, NR_ANON_THPS); - if (TestClearPageDoubleMap(page)) { - /* No need in mapcount reference anymore */ + if (!pmd_migration) { + /* + * Set PG_double_map before dropping compound_mapcount to avoid + * false-negative page_mapped(). + */ + if (compound_mapcount(page) > 1 && + !TestSetPageDoubleMap(page)) { for (i = 0; i < HPAGE_PMD_NR; i++) - atomic_dec(&page[i]._mapcount); + atomic_inc(&page[i]._mapcount); + } + + lock_page_memcg(page); + if (atomic_add_negative(-1, compound_mapcount_ptr(page))) { + /* Last compound_mapcount is gone. */ + __dec_lruvec_page_state(page, NR_ANON_THPS); + if (TestClearPageDoubleMap(page)) { + /* No need in mapcount reference anymore */ + for (i = 0; i < HPAGE_PMD_NR; i++) + atomic_dec(&page[i]._mapcount); + } } + unlock_page_memcg(page); }
smp_wmb(); /* make pte visible before pmd */
From: Xunlei Pang xlpang@linux.alibaba.com
stable inclusion from linux-4.19.148 commit 1aa7a9e5eebc5c40f0a5ea4e4cb8e8bd0267aea1
--------------------------------
commit e3336cab2579012b1e72b5265adf98e2d6e244ad upstream.
We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg doesn't have any reclaimable memory.
It can be easily reproduced as below:
watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204] CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12 Call Trace: shrink_lruvec+0x49f/0x640 shrink_node+0x2a6/0x6f0 do_try_to_free_pages+0xe9/0x3e0 try_to_free_mem_cgroup_pages+0xef/0x1f0 try_charge+0x2c1/0x750 mem_cgroup_charge+0xd7/0x240 __add_to_page_cache_locked+0x2fd/0x370 add_to_page_cache_lru+0x4a/0xc0 pagecache_get_page+0x10b/0x2f0 filemap_fault+0x661/0xad0 ext4_filemap_fault+0x2c/0x40 __do_fault+0x4d/0xf9 handle_mm_fault+0x1080/0x1790
It only happens on our 1-vcpu instances, because there's no chance for oom reaper to run to reclaim the to-be-killed process.
Add a cond_resched() at the upper shrink_node_memcgs() to solve this issue, this will mean that we will get a scheduling point for each memcg in the reclaimed hierarchy without any dependency on the reclaimable memory in that memcg thus making it more predictable.
Suggested-by: Michal Hocko mhocko@suse.com Signed-off-by: Xunlei Pang xlpang@linux.alibaba.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Chris Down chris@chrisdown.name Acked-by: Michal Hocko mhocko@suse.com Acked-by: Johannes Weiner hannes@cmpxchg.org Link: http://lkml.kernel.org/r/1598495549-67324-1-git-send-email-xlpang@linux.alib... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Julius Hemanth Pitti jpitti@cisco.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/vmscan.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/mm/vmscan.c b/mm/vmscan.c index f6fe7b59e6d8..d8b6ba5bd29e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2712,6 +2712,14 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) unsigned long reclaimed; unsigned long scanned;
+ /* + * This loop can become CPU-bound when target memcgs + * aren't eligible for reclaim - either because they + * don't have any reclaimable pages, or because their + * memory is explicitly protected. Avoid soft lockups. + */ + cond_resched(); + switch (mem_cgroup_protected(root, memcg)) { case MEMCG_PROT_MIN: /*
From: Lukas Wunner lukas@wunner.de
stable inclusion from linux-4.19.148 commit 8b4846ac1af4b0c99817aee7304e9f5dd6ffcb56
--------------------------------
commit e0a851fe6b9b619527bd928aa93caaddd003f70c upstream.
If the call to uart_add_one_port() in serial8250_register_8250_port() fails, a half-initialized entry in the serial_8250ports[] array is left behind.
A subsequent reprobe of the same serial port causes that entry to be reused. Because uart->port.dev is set, uart_remove_one_port() is called for the half-initialized entry and bails out with an error message:
bcm2835-aux-uart 3f215040.serial: Removing wrong port: (null) != (ptrval)
The same happens on failure of mctrl_gpio_init() since commit 4a96895f74c9 ("tty/serial/8250: use mctrl_gpio helpers").
Fix by zeroing the uart->port.dev pointer in the probe error path.
The bug was introduced in v2.6.10 by historical commit befff6f5bf5f ("[SERIAL] Add new port registration/unregistration functions."): https://git.kernel.org/tglx/history/c/befff6f5bf5f
The commit added an unconditional call to uart_remove_one_port() in serial8250_register_port(). In v3.7, commit 835d844d1a28 ("8250_pnp: do pnp probe before legacy probe") made that call conditional on uart->port.dev which allows me to fix the issue by zeroing that pointer in the error path. Thus, the present commit will fix the problem as far back as v3.7 whereas still older versions need to also cherry-pick 835d844d1a28.
Fixes: 835d844d1a28 ("8250_pnp: do pnp probe before legacy probe") Signed-off-by: Lukas Wunner lukas@wunner.de Cc: stable@vger.kernel.org # v2.6.10 Cc: stable@vger.kernel.org # v2.6.10: 835d844d1a28: 8250_pnp: do pnp probe before legacy Reviewed-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Link: https://lore.kernel.org/r/b4a072013ee1a1d13ee06b4325afb19bda57ca1b.158928587... [iwamatsu: Backported to 4.14, 4.19: adjust context] Signed-off-by: Nobuhiro Iwamatsu (CIP) nobuhiro1.iwamatsu@toshiba.co.jp Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/tty/serial/8250/8250_core.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/tty/serial/8250/8250_core.c b/drivers/tty/serial/8250/8250_core.c index d34b8899af2b..373f34bb4061 100644 --- a/drivers/tty/serial/8250/8250_core.c +++ b/drivers/tty/serial/8250/8250_core.c @@ -1063,8 +1063,10 @@ int serial8250_register_8250_port(struct uart_8250_port *up) serial8250_apply_quirks(uart); ret = uart_add_one_port(&serial8250_reg, &uart->port); - if (ret == 0) - ret = uart->port.line; + if (ret) + goto err; + + ret = uart->port.line; } else { dev_info(uart->port.dev, "skipping CIR port at 0x%lx / 0x%llx, IRQ %d\n", @@ -1089,6 +1091,11 @@ int serial8250_register_8250_port(struct uart_8250_port *up) mutex_unlock(&serial_mutex);
return ret; + +err: + uart->port.dev = NULL; + mutex_unlock(&serial_mutex); + return ret; } EXPORT_SYMBOL(serial8250_register_8250_port);
From: Yang Xu xuyang2018.jy@cn.fujitsu.com
stable inclusion from linux-4.19.116 commit 14b96359440421c4d84033933a2a5de73de1510d
--------------------------------
commit 2e356101e72ab1361821b3af024d64877d9a798d upstream.
Currently, when we add a new user key, the calltrace as below:
add_key() key_create_or_update() key_alloc() __key_instantiate_and_link generic_key_instantiate key_payload_reserve ......
Since commit a08bf91ce28e ("KEYS: allow reaching the keys quotas exactly"), we can reach max bytes/keys in key_alloc, but we forget to remove this limit when we reserver space for payload in key_payload_reserve. So we can only reach max keys but not max bytes when having delta between plen and type->def_datalen. Remove this limit when instantiating the key, so we can keep consistent with key_alloc.
Also, fix the similar problem in keyctl_chown_key().
Fixes: 0b77f5bfb45c ("keys: make the keyring quotas controllable through /proc/sys") Fixes: a08bf91ce28e ("KEYS: allow reaching the keys quotas exactly") Cc: stable@vger.kernel.org # 5.0.x Cc: Eric Biggers ebiggers@google.com Signed-off-by: Yang Xu xuyang2018.jy@cn.fujitsu.com Reviewed-by: Jarkko Sakkinen jarkko.sakkinen@linux.intel.com Reviewed-by: Eric Biggers ebiggers@google.com Signed-off-by: Jarkko Sakkinen jarkko.sakkinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- security/keys/key.c | 2 +- security/keys/keyctl.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/security/keys/key.c b/security/keys/key.c index 749a5cf27a19..d5fa8c4fc554 100644 --- a/security/keys/key.c +++ b/security/keys/key.c @@ -383,7 +383,7 @@ int key_payload_reserve(struct key *key, size_t datalen) spin_lock(&key->user->lock);
if (delta > 0 && - (key->user->qnbytes + delta >= maxbytes || + (key->user->qnbytes + delta > maxbytes || key->user->qnbytes + delta < key->user->qnbytes)) { ret = -EDQUOT; } diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c index edbbc5055976..9394d72a77e8 100644 --- a/security/keys/keyctl.c +++ b/security/keys/keyctl.c @@ -948,8 +948,8 @@ long keyctl_chown_key(key_serial_t id, uid_t user, gid_t group) key_quota_root_maxbytes : key_quota_maxbytes;
spin_lock(&newowner->lock); - if (newowner->qnkeys + 1 >= maxkeys || - newowner->qnbytes + key->quotalen >= maxbytes || + if (newowner->qnkeys + 1 > maxkeys || + newowner->qnbytes + key->quotalen > maxbytes || newowner->qnbytes + key->quotalen < newowner->qnbytes) goto quota_overrun;
From: Max Reitz mreitz@redhat.com
mainline inclusion from mainline-v5.4-rc3 commit e093c4be760ebf46c131ae0dd6138865a22f46fa category: bugfix bugzilla: NA CVE: NA
---------------------------
To ensure that all blocks touched by the range [offset, offset + count) are allocated, we need to calculate the block count from the difference of the range end (rounded up) and the range start (rounded down).
Before this patch, we just round up the byte count, which may lead to unaligned ranges not being fully allocated:
$ touch test_file $ block_size=$(stat -fc '%S' test_file) $ fallocate -o $((block_size / 2)) -l $block_size test_file $ xfs_bmap test_file test_file: 0: [0..7]: 1396264..1396271 1: [8..15]: hole
There should not be a hole there. Instead, the first two blocks should be fully allocated.
With this patch applied, the result is something like this:
$ touch test_file $ block_size=$(stat -fc '%S' test_file) $ fallocate -o $((block_size / 2)) -l $block_size test_file $ xfs_bmap test_file test_file: 0: [0..15]: 11024..11039
Signed-off-by: Max Reitz mreitz@redhat.com Reviewed-by: Carlos Maiolino cmaiolino@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/xfs_bmap_util.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index e638740f1681..7e274ee7cad1 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -869,6 +869,7 @@ xfs_alloc_file_space( xfs_filblks_t allocatesize_fsb; xfs_extlen_t extsz, temp; xfs_fileoff_t startoffset_fsb; + xfs_fileoff_t endoffset_fsb; int nimaps; int quota_flag; int rt; @@ -896,7 +897,8 @@ xfs_alloc_file_space( imapp = &imaps[0]; nimaps = 1; startoffset_fsb = XFS_B_TO_FSBT(mp, offset); - allocatesize_fsb = XFS_B_TO_FSB(mp, count); + endoffset_fsb = XFS_B_TO_FSB(mp, offset + count); + allocatesize_fsb = endoffset_fsb - startoffset_fsb;
/* * Allocate file space until done or until there is an error
From: Luo Meng luomeng12@huawei.com
hulk inclusion category: bugfix bugzilla: 39268 CVE: NA
-------------------------------------------------
Since the code of commit c03b42efb1c4 ("ext4: Fold ext4_data_block_valid_rcu() into the caller") when check valid the inode blocks, we set the last error block before final determination the block is invalid, which confuses with linux master.
The block should be invalid only when the block is belong to the system zone. The system zone was initialized when mount, and the entry->ino just should be 0 or journal_ino, and it never changed in his lifetime. Only when check the inode with ino=0/journal_ino will cause set the wrong last error block. But the ino=0/journal_ino never call ext4_inode_block_valid, so it never case any problem.
In order to keep the same logic with linux master and dispel the confuse, add explicit judgment for invalid block before set the last error block.
Fixes: c03b42efb1c4 ("ext4: Fold ext4_data_block_valid_rcu() into the caller") Signed-off-by: Luo Meng luomeng12@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/block_validity.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c index 817e3896462f..2471577d5c09 100644 --- a/fs/ext4/block_validity.c +++ b/fs/ext4/block_validity.c @@ -326,8 +326,9 @@ int ext4_inode_block_valid(struct inode *inode, ext4_fsblk_t start_blk, else if (start_blk >= (entry->start_blk + entry->count)) n = n->rb_right; else { - sbi->s_es->s_last_error_block = cpu_to_le64(start_blk); ret = (entry->ino == inode->i_ino); + if (!ret) + sbi->s_es->s_last_error_block = cpu_to_le64(start_blk); break; } }