bugfix for 20.03 at 202102
Alexey Gladkov (1): moduleparam: Save information about built-in modules in separate file
Amir Goldstein (1): ovl: skip getxattr of security labels
Aurelien Aptel (1): cifs: report error instead of invalid when revalidating a dentry fails
Baptiste Lepers (1): udp: Prevent reuseport_select_sock from reading uninitialized socks
Byron Stanoszek (1): tmpfs: restore functionality of nr_inodes=0
Chris Down (2): tmpfs: per-superblock i_ino support tmpfs: support 64-bit inums per-sb
Christian Brauner (1): sysctl: handle overflow in proc_get_long
Cong Wang (1): af_key: relax availability checks for skb size calculation
Dave Wysochanski (2): SUNRPC: Move simple_get_bytes and simple_get_netobj into private header SUNRPC: Handle 0 length opaque XDR object data properly
Dong Kai (1): livepatch/core: Fix jump_label_apply_nops called multi times
Edwin Peer (1): net: watchdog: hold device global xmit lock during tx disable
Eric Biggers (1): fs: fix lazytime expiration handling in __writeback_single_inode()
Eric Dumazet (1): net_sched: gen_estimator: support large ewma log
Eyal Birger (1): xfrm: fix disable_xfrm sysctl when used on xfrm interfaces
Florian Westphal (1): netfilter: conntrack: skip identical origin tuple in same zone only
Gaurav Kohli (1): tracing: Fix race in trace_open and buffer resize call
Gustavo A. R. Silva (1): smb3: Fix out-of-bounds bug in SMB2_negotiate()
Hugh Dickins (1): mm: thp: fix MADV_REMOVE deadlock on shmem THP
Jakub Kicinski (1): net: sit: unregister_netdevice on newlink's error path
Jan Kara (1): writeback: Drop I_DIRTY_TIME_EXPIRE
Jozsef Kadlecsik (1): netfilter: xt_recent: Fix attempt to update deleted entry
Liangyan (1): ovl: fix dentry leak in ovl_get_redirect
Lin Feng (1): bfq-iosched: Revert "bfq: Fix computation of shallow depth"
Marc Zyngier (1): genirq/msi: Activate Multi-MSI early when MSI_FLAG_ACTIVATE_EARLY is set
Martin K. Petersen (2): scsi: sd: block: Fix regressions in read-only block device handling scsi: sd: block: Fix kabi change by 'scsi: sd: block: Fix regressions in read-only block device handling'
Martin Willi (1): vrf: Fix fast path output packet handling with async Netfilter rules
Masami Hiramatsu (1): tracing/kprobe: Fix to support kretprobe events on unloaded modules
Miklos Szeredi (4): proc/mounts: add cursor ovl: perform vfs_getxattr() with mounter creds cap: fix conversions on getxattr ovl: expand warning in ovl_d_real()
Mikulas Patocka (1): dm integrity: conditionally disable "recalculate" feature
Ming Lei (4): scsi: core: Run queue in case of I/O resource contention failure scsi: core: Only re-run queue in scsi_end_request() if device queue is busy block: don't hold q->sysfs_lock in elevator_init_mq blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue
Muchun Song (4): mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page mm: hugetlb: fix a race between freeing and dissolving the page mm: hugetlb: fix a race between isolating and freeing page mm: hugetlb: remove VM_BUG_ON_PAGE from page_huge_active
NeilBrown (1): net: fix iteration for sctp transport seq_files
Ondrej Jirman (1): brcmfmac: Loading the correct firmware for brcm43456
Pablo Neira Ayuso (1): netfilter: nft_dynset: add timeout extension to template
Pavel Begunkov (1): list: introduce list_for_each_continue()
Pengcheng Yang (1): tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN
Peter Zijlstra (2): kthread: Extract KTHREAD_IS_PER_CPU workqueue: Restrict affinity change to rescuer
Roi Dayan (1): net/mlx5: Fix memory leak on flow table creation error flow
Roman Gushchin (1): memblock: do not start bottom-up allocations with kernel_end
Sabyrzhan Tasbolatov (1): net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
Shmulik Ladkani (1): xfrm: Fix oops in xfrm_replay_advance_bmp
Steven Rostedt (VMware) (3): fgraph: Initialize tracing_graph_pause at task creation tracing: Do not count ftrace events in top level enable output tracing: Check length before giving out the filter buffer
Sven Auhagen (1): netfilter: flowtable: fix tcp and udp header checksum update
Vadim Fedorenko (1): net: ip_tunnel: fix mtu calculation
Wang Hai (1): Revert "mm/slub: fix a memory leak in sysfs_slab_add()"
Wang ShaoBo (1): kretprobe: Avoid re-registration of the same kretprobe earlier
Willem de Bruijn (1): esp: avoid unneeded kmap_atomic call
Xiao Ni (1): md: Set prev_flush_start and flush_bio in an atomic way
Yang Yingliang (3): sysctl/mm: fix compile error when CONFIG_SLUB is disabled config: disable config TMPFS_INODE64 by default connector: change GFP_CONNECTOR to bit-31
Ye Bin (3): scsi: sd: block: Fix read-only flag residuals when partition table change Revert "scsi: sd: block: Fix read-only flag residuals when partition table change" Revert "scsi: sg: fix memory leak in sg_build_indirect"
Yonglong Liu (4): net: hns3: adds support for setting pf max tx rate via sysfs net: hns3: update hns3 version to 1.9.38.10 net: hns3: fix 'ret' may be used uninitialized problem net: hns3: update hns3 version to 1.9.38.11
Yufen Yu (1): scsi: fix kabi for scsi_device
Zhang Xiaoxu (1): proc/mounts: Fix kabi broken
liubo (3): etmem: add etmem-scan feature etmem: add etmem-swap feature config: Enable the config option of the etmem feature
shiyongbang (3): gpu: hibmc: Fix erratic display during startup stage. gpu: hibmc: Use drm get pci dev api. gpu: hibmc: Fix stuck when switch GUI to text.
zhangyi (F) (1): ext4: find old entry again if failed to rename whiteout
.gitignore | 1 + Documentation/device-mapper/dm-integrity.txt | 7 + Documentation/dontdiff | 1 + Documentation/filesystems/tmpfs.txt | 17 + Documentation/kbuild/kbuild.txt | 5 + Makefile | 2 + arch/arm64/configs/hulk_defconfig | 3 + arch/arm64/configs/openeuler_defconfig | 2 + arch/x86/configs/hulk_defconfig | 2 + arch/x86/configs/openeuler_defconfig | 2 + block/bfq-iosched.c | 8 +- block/blk-mq.c | 7 - block/elevator.c | 14 +- block/genhd.c | 33 +- block/ioctl.c | 4 + block/partition-generic.c | 7 +- drivers/connector/connector.c | 6 +- .../gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c | 68 +- .../gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h | 2 + .../gpu/drm/hisilicon/hibmc/hibmc_drm_vdac.c | 1 + drivers/md/dm-integrity.c | 24 +- drivers/md/md.c | 2 + drivers/net/ethernet/hisilicon/hns3/Makefile | 3 +- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 10 +- .../hns3/hns3_cae/hns3_cae_version.h | 2 +- .../net/ethernet/hisilicon/hns3/hns3_enet.h | 2 +- .../hns3_extension/hns3pf/hclge_main_it.c | 33 + .../hns3/hns3_extension/hns3pf/hclge_sysfs.c | 140 +++ .../hns3/hns3_extension/hns3pf/hclge_sysfs.h | 14 + .../hisilicon/hns3/hns3pf/hclge_main.c | 17 + .../hisilicon/hns3/hns3pf/hclge_main.h | 7 +- .../hisilicon/hns3/hns3vf/hclgevf_main.h | 2 +- .../net/ethernet/mellanox/mlx5/core/fs_core.c | 1 + drivers/net/vrf.c | 92 +- .../broadcom/brcm80211/brcmfmac/sdio.c | 4 +- drivers/scsi/scsi_lib.c | 59 +- drivers/scsi/sg.c | 6 +- fs/Kconfig | 21 + fs/cifs/dir.c | 22 +- fs/cifs/smb2pdu.h | 2 +- fs/ext4/inode.c | 2 +- fs/ext4/namei.c | 29 +- fs/fs-writeback.c | 36 +- fs/hugetlbfs/inode.c | 3 +- fs/mount.h | 17 +- fs/namespace.c | 119 +- fs/overlayfs/copy_up.c | 15 +- fs/overlayfs/dir.c | 2 +- fs/overlayfs/inode.c | 2 + fs/overlayfs/super.c | 13 +- fs/proc/Makefile | 2 + fs/proc/base.c | 4 + fs/proc/etmem_scan.c | 1046 +++++++++++++++++ fs/proc/etmem_scan.h | 132 +++ fs/proc/etmem_swap.c | 102 ++ fs/proc/internal.h | 2 + fs/proc/task_mmu.c | 117 ++ fs/proc_namespace.c | 4 +- fs/xfs/xfs_trans_inode.c | 4 +- include/asm-generic/vmlinux.lds.h | 1 + include/linux/fs.h | 16 +- include/linux/genhd.h | 5 + include/linux/gfp.h | 8 +- include/linux/hugetlb.h | 3 + include/linux/kprobes.h | 2 +- include/linux/kthread.h | 3 + include/linux/list.h | 10 + include/linux/mm_types.h | 18 + include/linux/module.h | 1 + include/linux/moduleparam.h | 12 +- include/linux/mount.h | 4 +- include/linux/msi.h | 6 + include/linux/netdevice.h | 2 + include/linux/shmem_fs.h | 3 + include/linux/sunrpc/xdr.h | 3 +- include/linux/swap.h | 5 + include/net/tcp.h | 2 +- include/scsi/scsi_device.h | 3 +- include/trace/events/writeback.h | 1 - init/init_task.c | 3 +- kernel/irq/msi.c | 44 +- kernel/kprobes.c | 36 +- kernel/kthread.c | 27 +- kernel/livepatch/core.c | 20 +- kernel/smpboot.c | 1 + kernel/sysctl.c | 6 + kernel/trace/ftrace.c | 2 - kernel/trace/ring_buffer.c | 4 + kernel/trace/trace.c | 2 +- kernel/trace/trace_events.c | 3 +- kernel/trace/trace_kprobe.c | 4 +- kernel/workqueue.c | 9 +- lib/Kconfig | 11 + mm/huge_memory.c | 37 +- mm/hugetlb.c | 48 +- mm/memblock.c | 49 +- mm/pagewalk.c | 1 + mm/shmem.c | 133 ++- mm/slub.c | 14 +- mm/vmscan.c | 112 ++ mm/vmstat.c | 4 + net/core/gen_estimator.c | 11 +- net/core/sock_reuseport.c | 2 +- net/ipv4/esp4.c | 7 +- net/ipv4/ip_tunnel.c | 16 +- net/ipv4/tcp_input.c | 10 +- net/ipv4/tcp_recovery.c | 5 +- net/ipv6/esp6.c | 7 +- net/ipv6/sit.c | 5 +- net/key/af_key.c | 6 +- net/netfilter/nf_conntrack_core.c | 3 +- net/netfilter/nf_flow_table_core.c | 4 +- net/netfilter/nft_dynset.c | 4 +- net/netfilter/xt_recent.c | 12 +- net/rds/rdma.c | 3 + net/sctp/proc.c | 16 +- net/sunrpc/auth_gss/auth_gss.c | 30 +- net/sunrpc/auth_gss/auth_gss_internal.h | 45 + net/sunrpc/auth_gss/gss_krb5_mech.c | 31 +- net/xfrm/xfrm_input.c | 2 +- net/xfrm/xfrm_policy.c | 4 +- scripts/link-vmlinux.sh | 3 + security/commoncap.c | 67 +- virt/kvm/kvm_main.c | 6 + 124 files changed, 2817 insertions(+), 466 deletions(-) create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.c create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.h create mode 100644 fs/proc/etmem_scan.c create mode 100644 fs/proc/etmem_scan.h create mode 100644 fs/proc/etmem_swap.c create mode 100644 net/sunrpc/auth_gss/auth_gss_internal.h
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-v5.6-rc1 commit 28ca0d6d39ab1d01c86762c82a585b7cedd2920c category: bugfix bugzilla: 35619 CVE: NA
--------------------------------
As other *continue() helpers, this continues iteration from a given position.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhang Xiaoxu zhangxiaoxu5@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/list.h | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/include/linux/list.h b/include/linux/list.h index de04cc5ed536..0e540581d52c 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -455,6 +455,16 @@ static inline void list_splice_tail_init(struct list_head *list, #define list_for_each(pos, head) \ for (pos = (head)->next; pos != (head); pos = pos->next)
+/** + * list_for_each_continue - continue iteration over a list + * @pos: the &struct list_head to use as a loop cursor. + * @head: the head for your list. + * + * Continue to iterate over a list, continuing after the current position. + */ +#define list_for_each_continue(pos, head) \ + for (pos = pos->next; pos != (head); pos = pos->next) + /** * list_for_each_prev - iterate over a list backwards * @pos: the &struct list_head to use as a loop cursor.
From: Miklos Szeredi mszeredi@redhat.com
mainline inclusion from mainline-v5.8-rc1 commit 9f6c61f96f2d97cbb5f7fa85607bc398f843ff0f category: bugfix bugzilla: 35619 CVE: NA
--------------------------------
If mounts are deleted after a read(2) call on /proc/self/mounts (or its kin), the subsequent read(2) could miss a mount that comes after the deleted one in the list. This is because the file position is interpreted as the number mount entries from the start of the list.
E.g. first read gets entries #0 to #9; the seq file index will be 10. Then entry #5 is deleted, resulting in #10 becoming #9 and #11 becoming #10, etc... The next read will continue from entry #10, and #9 is missed.
Solve this by adding a cursor entry for each open instance. Taking the global namespace_sem for write seems excessive, since we are only dealing with a per-namespace list. Instead add a per-namespace spinlock and use that together with namespace_sem taken for read to protect against concurrent modification of the mount list. This may reduce parallelism of is_local_mountpoint(), but it's hardly a big contention point. We could also use RCU freeing of cursors to make traversal not need additional locks, if that turns out to be neceesary.
Only move the cursor once for each read (cursor is not added on open) to minimize cacheline invalidation. When EOF is reached, the cursor is taken off the list, in order to prevent an excessive number of cursors due to inactive open file descriptors.
Reported-by: Karel Zak kzak@redhat.com Signed-off-by: Miklos Szeredi mszeredi@redhat.com
Conflicts: fs/mount.h fs/namespace.c
Signed-off-by: Zhang Xiaoxu zhangxiaoxu5@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/mount.h | 12 ++++-- fs/namespace.c | 91 +++++++++++++++++++++++++++++++++++-------- fs/proc_namespace.c | 4 +- include/linux/mount.h | 4 +- 4 files changed, 90 insertions(+), 21 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h index f39bc9da4d73..b8318db51ea1 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -9,7 +9,13 @@ struct mnt_namespace { atomic_t count; struct ns_common ns; struct mount * root; + /* + * Traversal and modification of .list is protected by either + * - taking namespace_sem for write, OR + * - taking namespace_sem for read AND taking .ns_lock. + */ struct list_head list; + spinlock_t ns_lock; struct user_namespace *user_ns; struct ucounts *ucounts; u64 seq; /* Sequence number to prevent loops */ @@ -131,9 +137,7 @@ struct proc_mounts { struct mnt_namespace *ns; struct path root; int (*show)(struct seq_file *, struct vfsmount *); - void *cached_mount; - u64 cached_event; - loff_t cached_index; + struct mount cursor; };
extern const struct seq_operations mounts_op; @@ -146,3 +150,5 @@ static inline bool is_local_mountpoint(struct dentry *dentry)
return __is_local_mountpoint(dentry); } + +extern void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor); diff --git a/fs/namespace.c b/fs/namespace.c index da90b9c878c1..c582abcdab59 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -653,6 +653,21 @@ struct vfsmount *lookup_mnt(const struct path *path) return m; }
+static inline void lock_ns_list(struct mnt_namespace *ns) +{ + spin_lock(&ns->ns_lock); +} + +static inline void unlock_ns_list(struct mnt_namespace *ns) +{ + spin_unlock(&ns->ns_lock); +} + +static inline bool mnt_is_cursor(struct mount *mnt) +{ + return mnt->mnt.mnt_flags & MNT_CURSOR; +} + /* * __is_local_mountpoint - Test to see if dentry is a mountpoint in the * current mount namespace. @@ -678,11 +693,15 @@ bool __is_local_mountpoint(struct dentry *dentry) goto out;
down_read(&namespace_sem); + lock_ns_list(ns); list_for_each_entry(mnt, &ns->list, mnt_list) { + if (mnt_is_cursor(mnt)) + continue; is_covered = (mnt->mnt_mountpoint == dentry); if (is_covered) break; } + unlock_ns_list(ns); up_read(&namespace_sem); out: return is_covered; @@ -1237,46 +1256,71 @@ struct vfsmount *mnt_clone_internal(const struct path *path) }
#ifdef CONFIG_PROC_FS +static struct mount *mnt_list_next(struct mnt_namespace *ns, + struct list_head *p) +{ + struct mount *mnt, *ret = NULL; + + lock_ns_list(ns); + list_for_each_continue(p, &ns->list) { + mnt = list_entry(p, typeof(*mnt), mnt_list); + if (!mnt_is_cursor(mnt)) { + ret = mnt; + break; + } + } + unlock_ns_list(ns); + + return ret; +} + /* iterator; we want it to have access to namespace_sem, thus here... */ static void *m_start(struct seq_file *m, loff_t *pos) { struct proc_mounts *p = m->private; + struct list_head *prev;
down_read(&namespace_sem); - if (p->cached_event == p->ns->event) { - void *v = p->cached_mount; - if (*pos == p->cached_index) - return v; - if (*pos == p->cached_index + 1) { - v = seq_list_next(v, &p->ns->list, &p->cached_index); - return p->cached_mount = v; - } + if (!*pos) { + prev = &p->ns->list; + } else { + prev = &p->cursor.mnt_list; + + /* Read after we'd reached the end? */ + if (list_empty(prev)) + return NULL; }
- p->cached_event = p->ns->event; - p->cached_mount = seq_list_start(&p->ns->list, *pos); - p->cached_index = *pos; - return p->cached_mount; + return mnt_list_next(p->ns, prev); }
static void *m_next(struct seq_file *m, void *v, loff_t *pos) { struct proc_mounts *p = m->private; + struct mount *mnt = v;
- p->cached_mount = seq_list_next(v, &p->ns->list, pos); - p->cached_index = *pos; - return p->cached_mount; + ++*pos; + return mnt_list_next(p->ns, &mnt->mnt_list); }
static void m_stop(struct seq_file *m, void *v) { + struct proc_mounts *p = m->private; + struct mount *mnt = v; + + lock_ns_list(p->ns); + if (mnt) + list_move_tail(&p->cursor.mnt_list, &mnt->mnt_list); + else + list_del_init(&p->cursor.mnt_list); + unlock_ns_list(p->ns); up_read(&namespace_sem); }
static int m_show(struct seq_file *m, void *v) { struct proc_mounts *p = m->private; - struct mount *r = list_entry(v, struct mount, mnt_list); + struct mount *r = v; return p->show(m, &r->mnt); }
@@ -1286,6 +1330,15 @@ const struct seq_operations mounts_op = { .stop = m_stop, .show = m_show, }; + +void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor) +{ + down_read(&namespace_sem); + lock_ns_list(ns); + list_del(&cursor->mnt_list); + unlock_ns_list(ns); + up_read(&namespace_sem); +} #endif /* CONFIG_PROC_FS */
/** @@ -2858,6 +2911,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns) INIT_LIST_HEAD(&new_ns->list); init_waitqueue_head(&new_ns->poll); new_ns->event = 0; + spin_lock_init(&new_ns->ns_lock); new_ns->user_ns = get_user_ns(user_ns); new_ns->ucounts = ucounts; new_ns->mounts = 0; @@ -3312,10 +3366,14 @@ static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new, bool visible = false;
down_read(&namespace_sem); + lock_ns_list(ns); list_for_each_entry(mnt, &ns->list, mnt_list) { struct mount *child; int mnt_flags;
+ if (mnt_is_cursor(mnt)) + continue; + if (mnt->mnt.mnt_sb->s_type != new->mnt_sb->s_type) continue;
@@ -3363,6 +3421,7 @@ static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new, next: ; } found: + unlock_ns_list(ns); up_read(&namespace_sem); return visible; } diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index e16fb8f2049e..969f9c8fbdc0 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -279,7 +279,8 @@ static int mounts_open_common(struct inode *inode, struct file *file, p->ns = ns; p->root = root; p->show = show; - p->cached_event = ~0ULL; + INIT_LIST_HEAD(&p->cursor.mnt_list); + p->cursor.mnt.mnt_flags = MNT_CURSOR;
return 0;
@@ -296,6 +297,7 @@ static int mounts_release(struct inode *inode, struct file *file) struct seq_file *m = file->private_data; struct proc_mounts *p = m->private; path_put(&p->root); + mnt_cursor_del(p->ns, &p->cursor); put_mnt_ns(p->ns); return seq_release_private(inode, file); } diff --git a/include/linux/mount.h b/include/linux/mount.h index 4b0db4418954..46a77d791870 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -49,7 +49,8 @@ struct mnt_namespace; #define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
#define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \ - MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED) + MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED | \ + MNT_CURSOR)
#define MNT_INTERNAL 0x4000
@@ -63,6 +64,7 @@ struct mnt_namespace; #define MNT_SYNC_UMOUNT 0x2000000 #define MNT_MARKED 0x4000000 #define MNT_UMOUNT 0x8000000 +#define MNT_CURSOR 0x10000000
struct vfsmount { struct dentry *mnt_root; /* root of the mounted tree */
From: Zhang Xiaoxu zhangxiaoxu5@huawei.com
hulk inclusion category: bugfix bugzilla: 35619 CVE: NA
---------------------------
Since we add ns_lock in struct mnt_namespace, it broken the KABI, so use wrapper to fix it.
We assume all module just use the pointer of struct mnt_namespace, rather than use the implement of struct mnt_namespace.
Signed-off-by: Zhang Xiaoxu zhangxiaoxu5@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/mount.h | 9 ++++++-- fs/namespace.c | 58 +++++++++++++++++++++++++++++++++----------------- 2 files changed, 46 insertions(+), 21 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h index b8318db51ea1..ab0150e2f0ad 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -12,10 +12,10 @@ struct mnt_namespace { /* * Traversal and modification of .list is protected by either * - taking namespace_sem for write, OR - * - taking namespace_sem for read AND taking .ns_lock. + * - taking namespace_sem for read AND taking .ns_lock + * in mnt_namespace_wrapper */ struct list_head list; - spinlock_t ns_lock; struct user_namespace *user_ns; struct ucounts *ucounts; u64 seq; /* Sequence number to prevent loops */ @@ -25,6 +25,11 @@ struct mnt_namespace { unsigned int pending_mounts; } __randomize_layout;
+struct mnt_namespace_wrapper { + struct mnt_namespace ns; + spinlock_t ns_lock; +}; + struct mnt_pcp { int mnt_count; int mnt_writers; diff --git a/fs/namespace.c b/fs/namespace.c index c582abcdab59..2c84a110ce2d 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -653,14 +653,14 @@ struct vfsmount *lookup_mnt(const struct path *path) return m; }
-static inline void lock_ns_list(struct mnt_namespace *ns) +static inline void lock_ns_list(struct mnt_namespace_wrapper *nsw) { - spin_lock(&ns->ns_lock); + spin_lock(&nsw->ns_lock); }
-static inline void unlock_ns_list(struct mnt_namespace *ns) +static inline void unlock_ns_list(struct mnt_namespace_wrapper *nsw) { - spin_unlock(&ns->ns_lock); + spin_unlock(&nsw->ns_lock); }
static inline bool mnt_is_cursor(struct mount *mnt) @@ -686,14 +686,16 @@ static inline bool mnt_is_cursor(struct mount *mnt) bool __is_local_mountpoint(struct dentry *dentry) { struct mnt_namespace *ns = current->nsproxy->mnt_ns; + struct mnt_namespace_wrapper *nsw; struct mount *mnt; bool is_covered = false;
if (!d_mountpoint(dentry)) goto out;
+ nsw = container_of(ns, struct mnt_namespace_wrapper, ns); down_read(&namespace_sem); - lock_ns_list(ns); + lock_ns_list(nsw); list_for_each_entry(mnt, &ns->list, mnt_list) { if (mnt_is_cursor(mnt)) continue; @@ -701,7 +703,7 @@ bool __is_local_mountpoint(struct dentry *dentry) if (is_covered) break; } - unlock_ns_list(ns); + unlock_ns_list(nsw); up_read(&namespace_sem); out: return is_covered; @@ -1260,8 +1262,11 @@ static struct mount *mnt_list_next(struct mnt_namespace *ns, struct list_head *p) { struct mount *mnt, *ret = NULL; + struct mnt_namespace_wrapper *nsw;
- lock_ns_list(ns); + nsw = container_of(ns, struct mnt_namespace_wrapper, ns); + + lock_ns_list(nsw); list_for_each_continue(p, &ns->list) { mnt = list_entry(p, typeof(*mnt), mnt_list); if (!mnt_is_cursor(mnt)) { @@ -1269,7 +1274,7 @@ static struct mount *mnt_list_next(struct mnt_namespace *ns, break; } } - unlock_ns_list(ns); + unlock_ns_list(nsw);
return ret; } @@ -1307,13 +1312,16 @@ static void m_stop(struct seq_file *m, void *v) { struct proc_mounts *p = m->private; struct mount *mnt = v; + struct mnt_namespace_wrapper *nsw; + + nsw = container_of(p->ns, struct mnt_namespace_wrapper, ns);
- lock_ns_list(p->ns); + lock_ns_list(nsw); if (mnt) list_move_tail(&p->cursor.mnt_list, &mnt->mnt_list); else list_del_init(&p->cursor.mnt_list); - unlock_ns_list(p->ns); + unlock_ns_list(nsw); up_read(&namespace_sem); }
@@ -1333,10 +1341,14 @@ const struct seq_operations mounts_op = {
void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor) { + struct mnt_namespace_wrapper *nsw; + + nsw = container_of(ns, struct mnt_namespace_wrapper, ns); + down_read(&namespace_sem); - lock_ns_list(ns); + lock_ns_list(nsw); list_del(&cursor->mnt_list); - unlock_ns_list(ns); + unlock_ns_list(nsw); up_read(&namespace_sem); } #endif /* CONFIG_PROC_FS */ @@ -2868,10 +2880,13 @@ static void dec_mnt_namespaces(struct ucounts *ucounts)
static void free_mnt_ns(struct mnt_namespace *ns) { + struct mnt_namespace_wrapper *nsw; + + nsw = container_of(ns, struct mnt_namespace_wrapper, ns); ns_free_inum(&ns->ns); dec_mnt_namespaces(ns->ucounts); put_user_ns(ns->user_ns); - kfree(ns); + kfree(nsw); }
/* @@ -2886,6 +2901,7 @@ static atomic64_t mnt_ns_seq = ATOMIC64_INIT(1); static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns) { struct mnt_namespace *new_ns; + struct mnt_namespace_wrapper *new_nsw; struct ucounts *ucounts; int ret;
@@ -2893,14 +2909,15 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns) if (!ucounts) return ERR_PTR(-ENOSPC);
- new_ns = kmalloc(sizeof(struct mnt_namespace), GFP_KERNEL); - if (!new_ns) { + new_nsw = kmalloc(sizeof(struct mnt_namespace_wrapper), GFP_KERNEL); + if (!new_nsw) { dec_mnt_namespaces(ucounts); return ERR_PTR(-ENOMEM); } + new_ns = &new_nsw->ns; ret = ns_alloc_inum(&new_ns->ns); if (ret) { - kfree(new_ns); + kfree(new_nsw); dec_mnt_namespaces(ucounts); return ERR_PTR(ret); } @@ -2911,7 +2928,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns) INIT_LIST_HEAD(&new_ns->list); init_waitqueue_head(&new_ns->poll); new_ns->event = 0; - spin_lock_init(&new_ns->ns_lock); + spin_lock_init(&new_nsw->ns_lock); new_ns->user_ns = get_user_ns(user_ns); new_ns->ucounts = ucounts; new_ns->mounts = 0; @@ -3364,9 +3381,12 @@ static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new, int new_flags = *new_mnt_flags; struct mount *mnt; bool visible = false; + struct mnt_namespace_wrapper *nsw; + + nsw = container_of(ns, struct mnt_namespace_wrapper, ns);
down_read(&namespace_sem); - lock_ns_list(ns); + lock_ns_list(nsw); list_for_each_entry(mnt, &ns->list, mnt_list) { struct mount *child; int mnt_flags; @@ -3421,7 +3441,7 @@ static bool mnt_already_visible(struct mnt_namespace *ns, struct vfsmount *new, next: ; } found: - unlock_ns_list(ns); + unlock_ns_list(nsw); up_read(&namespace_sem); return visible; }
From: Yang Yingliang yangyingliang@huawei.com
hulk inclusion category: bugfix bugzilla: 47452 CVE: NA
-------------------------------------------------
Fix the compile error: mm/vmstat.c:1739: undefined reference to `isolate_cnt' mm/vmstat.c:1740: undefined reference to `unexpect_free_cnt' kernel/sysctl.o:(.data+0x1550): undefined reference to `sysctl_isolate_corrupted_freelist'
Fixes: add85f87f2a8c ("sysctl: control if check validity of...") Fixes: cd12ff78aee9e ("connector: debug: try catch unexpect free skb->data") Fixes: d5d59f865c97a ("mm: slub: check freelist validity to avoid crash") Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/sysctl.c | 4 ++++ mm/vmstat.c | 4 ++++ 2 files changed, 8 insertions(+)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c921ee10615a..f8a376720e87 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1245,7 +1245,9 @@ static struct ctl_table kern_table[] = { { } };
+#ifdef CONFIG_SLUB extern int sysctl_isolate_corrupted_freelist; +#endif static struct ctl_table vm_table[] = { { .procname = "overcommit_memory", @@ -1715,6 +1717,7 @@ static struct ctl_table vm_table[] = { .extra2 = (void *)&mmap_rnd_compat_bits_max, }, #endif +#ifdef CONFIG_SLUB { .procname = "isolate_corrupted_freelist", .data = &sysctl_isolate_corrupted_freelist, @@ -1724,6 +1727,7 @@ static struct ctl_table vm_table[] = { .extra1 = &zero, .extra2 = &one, }, +#endif { } };
diff --git a/mm/vmstat.c b/mm/vmstat.c index 222d6a7cbef9..4258c2b344d2 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1732,12 +1732,16 @@ static int vmstat_show(struct seq_file *m, void *arg) return 0; }
+#ifdef CONFIG_SLUB extern int isolate_cnt; extern int unexpect_free_cnt; +#endif static void vmstat_stop(struct seq_file *m, void *arg) { +#ifdef CONFIG_SLUB seq_printf(m, "nr_freelist_isolated %d\n", isolate_cnt); seq_printf(m, "nr_unexpected_free %d\n", unexpect_free_cnt); +#endif kfree(m->private); m->private = NULL; }
From: "Martin K. Petersen" martin.petersen@oracle.com
hulk inclusion category: bugfix bugzilla: 46833 CVE: NA
-----------------------------------------------
Fix: https://gitee.com/src-openeuler/util-linux/issues/I28N07
Origin patch: https://patchwork.kernel.org/project/linux-scsi/patch/20190227041941. 1568-1-martin.petersen@oracle.com/
If partition table changed online, will record partition read-only flag.
Some devices come online in write protected state and switch to read-write once they are ready to process I/O requests. These devices broke with commit 20bd1d026aac ("scsi: sd: Keep disk read-only when re-reading partition") because we had no way to distinguish between a user decision to set a block_device read-only and the actual hardware device being write-protected.
Because partitions are dropped and recreated on revalidate we are unable to persist any user-provided policy in hd_struct. Introduce a bitmap in struct gendisk to track the user configuration. This bitmap is updated when BLKROSET is called on a given disk or partition.
A helper function, get_user_ro(), is provided to determine whether the ioctl has forced read-only state for a given block device. This helper is used by set_disk_ro() and add_partition() to ensure that both existing and newly created partitions will get the correct state.
- If BLKROSET sets a whole disk device read-only, all partitions will now end up in a read-only state.
- If BLKROSET sets a given partition read-only, that partition will remain read-only post revalidate.
- Otherwise both the whole disk device and any partitions will reflect the write protect state of the underlying device.
Since nobody knows what "policy" means, rename the field to "read_only" for clarity.
Cc: Jeremy Cline jeremy@jcline.org Cc: Oleksii Kurochko olkuroch@cisco.com Cc: stable@vger.kernel.org # v4.16+ Reported-by: Oleksii Kurochko olkuroch@cisco.com Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201221 Fixes: 20bd1d026aac ("scsi: sd: Keep disk read-only when re-reading partition") Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-core.c | 2 +- block/genhd.c | 34 ++++++++++++++++++++++++---------- block/ioctl.c | 4 ++++ block/partition-generic.c | 7 +++++-- include/linux/genhd.h | 11 +++++++---- 5 files changed, 41 insertions(+), 17 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index ffbe326c70b9..41c40a6acdca 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2166,7 +2166,7 @@ static inline bool bio_check_ro(struct bio *bio, struct hd_struct *part) { const int op = bio_op(bio);
- if (part->policy && op_is_write(op)) { + if (part->read_only && op_is_write(op)) { char b[BDEVNAME_SIZE];
if (op_is_flush(bio->bi_opf) && !bio_sectors(bio)) diff --git a/block/genhd.c b/block/genhd.c index d8a9c901eaef..d9ec6bb4f880 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -1543,26 +1543,40 @@ static void set_disk_ro_uevent(struct gendisk *gd, int ro) kobject_uevent_env(&disk_to_dev(gd)->kobj, KOBJ_CHANGE, envp); }
-void set_device_ro(struct block_device *bdev, int flag) +void set_device_ro(struct block_device *bdev, bool state) { - bdev->bd_part->policy = flag; + bdev->bd_part->read_only = state; }
EXPORT_SYMBOL(set_device_ro);
-void set_disk_ro(struct gendisk *disk, int flag) +bool get_user_ro(struct gendisk *disk, unsigned int partno) +{ + /* Is the user read-only bit set for the whole disk device? */ + if (test_bit(0, disk->user_ro_bitmap)) + return true; + + /* Is the user read-only bit set for this particular partition? */ + if (test_bit(partno, disk->user_ro_bitmap)) + return true; + + return false; +} + +void set_disk_ro(struct gendisk *disk, bool state) { struct disk_part_iter piter; struct hd_struct *part;
- if (disk->part0.policy != flag) { - set_disk_ro_uevent(disk, flag); - disk->part0.policy = flag; - } + if (disk->part0.read_only != state) + set_disk_ro_uevent(disk, state);
- disk_part_iter_init(&piter, disk, DISK_PITER_INCL_EMPTY); + disk_part_iter_init(&piter, disk, DISK_PITER_INCL_EMPTY_PART0); while ((part = disk_part_iter_next(&piter))) - part->policy = flag; + if (get_user_ro(disk, part->partno)) + part->read_only = true; + else + part->read_only = state; disk_part_iter_exit(&piter); }
@@ -1572,7 +1586,7 @@ int bdev_read_only(struct block_device *bdev) { if (!bdev) return 0; - return bdev->bd_part->policy; + return bdev->bd_part->read_only; }
EXPORT_SYMBOL(bdev_read_only); diff --git a/block/ioctl.c b/block/ioctl.c index 5a6157b3735a..899ffd50a7c6 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -455,6 +455,10 @@ static int blkdev_roset(struct block_device *bdev, fmode_t mode, return ret; if (get_user(n, (int __user *)arg)) return -EFAULT; + if (n) + set_bit(bdev->bd_partno, bdev->bd_disk->user_ro_bitmap); + else + clear_bit(bdev->bd_partno, bdev->bd_disk->user_ro_bitmap); set_device_ro(bdev, n); return 0; } diff --git a/block/partition-generic.c b/block/partition-generic.c index d86d794d28e1..63b82df5bbb4 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -98,7 +98,7 @@ static ssize_t part_ro_show(struct device *dev, struct device_attribute *attr, char *buf) { struct hd_struct *p = dev_to_part(dev); - return sprintf(buf, "%d\n", p->policy ? 1 : 0); + return sprintf(buf, "%u\n", p->read_only ? 1 : 0); }
static ssize_t part_alignment_offset_show(struct device *dev, @@ -352,7 +352,10 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno, queue_limit_discard_alignment(&disk->queue->limits, start); p->nr_sects = len; p->partno = partno; - p->policy = get_disk_ro(disk); + if (get_user_ro(disk, partno)) + p->read_only = true; + else + p->read_only = get_disk_ro(disk); p->disk = disk;
if (info) { diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 666b23a88c6f..6a6da28ac0b9 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -116,7 +116,8 @@ struct hd_struct { unsigned int discard_alignment; struct device __dev; struct kobject *holder_dir; - int policy, partno; + bool read_only; + int partno; struct partition_meta_info *info; #ifdef CONFIG_FAIL_MAKE_REQUEST int make_it_fail; @@ -202,6 +203,7 @@ struct gendisk { */ struct disk_part_tbl __rcu *part_tbl; struct hd_struct part0; + DECLARE_BITMAP(user_ro_bitmap, DISK_MAX_PARTS);
const struct block_device_operations *fops; struct request_queue *queue; @@ -440,12 +442,13 @@ extern void del_gendisk(struct gendisk *gp); extern struct gendisk *get_gendisk(dev_t dev, int *partno); extern struct block_device *bdget_disk(struct gendisk *disk, int partno);
-extern void set_device_ro(struct block_device *bdev, int flag); -extern void set_disk_ro(struct gendisk *disk, int flag); +extern void set_device_ro(struct block_device *bdev, bool state); +extern void set_disk_ro(struct gendisk *disk, bool state); +extern bool get_user_ro(struct gendisk *disk, unsigned int partno);
static inline int get_disk_ro(struct gendisk *disk) { - return disk->part0.policy; + return disk->part0.read_only; }
extern void disk_block_events(struct gendisk *disk);
From: Ye Bin yebin10@huawei.com
hulk inclusion category: bugfix bugzilla: 46833 CVE: NA
-----------------------------------------------
Fixes: ("scsi: sd: block: Fix regressions in read-only block device handling") Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/partition-generic.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/block/partition-generic.c b/block/partition-generic.c index 63b82df5bbb4..b27ed20d3db4 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -286,6 +286,7 @@ void delete_partition(struct gendisk *disk, int partno) if (!part) return;
+ clear_bit(partno, disk->user_ro_bitmap); get_device(disk_to_dev(disk)); rcu_assign_pointer(ptbl->part[partno], NULL);
From: "Martin K. Petersen" martin.petersen@oracle.com
hulk inclusion category: bugfix bugzilla: 46833 CVE: NA
-----------------------------------------------
Fixes: ("scsi: sd: block: Fix regressions in read-only block device handling") Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-core.c | 2 +- block/genhd.c | 33 +++++++++++++++++++++------------ block/partition-generic.c | 6 +++--- include/linux/genhd.h | 16 +++++++++------- 4 files changed, 34 insertions(+), 23 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index 41c40a6acdca..ffbe326c70b9 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2166,7 +2166,7 @@ static inline bool bio_check_ro(struct bio *bio, struct hd_struct *part) { const int op = bio_op(bio);
- if (part->read_only && op_is_write(op)) { + if (part->policy && op_is_write(op)) { char b[BDEVNAME_SIZE];
if (op_is_flush(bio->bi_opf) && !bio_sectors(bio)) diff --git a/block/genhd.c b/block/genhd.c index d9ec6bb4f880..e109a0702968 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -1307,6 +1307,7 @@ static void disk_release(struct device *dev) hd_free_part(&disk->part0); if (disk->queue) blk_put_queue(disk->queue); + kfree(disk->user_ro_bitmap); kfree(disk); } struct class block_class = { @@ -1481,6 +1482,14 @@ struct gendisk *__alloc_disk_node(int minors, int node_id) return NULL; }
+ disk->user_ro_bitmap = kzalloc_node( + BITS_TO_LONGS(DISK_MAX_PARTS) * sizeof(long), + GFP_KERNEL, node_id); + if (!disk->user_ro_bitmap) { + hd_free_part(&disk->part0); + kfree(disk); + return NULL; + } disk->minors = minors; rand_initialize_disk(disk); disk_to_dev(disk)->class = &block_class; @@ -1543,40 +1552,40 @@ static void set_disk_ro_uevent(struct gendisk *gd, int ro) kobject_uevent_env(&disk_to_dev(gd)->kobj, KOBJ_CHANGE, envp); }
-void set_device_ro(struct block_device *bdev, bool state) +void set_device_ro(struct block_device *bdev, int flag) { - bdev->bd_part->read_only = state; + bdev->bd_part->policy = flag; }
EXPORT_SYMBOL(set_device_ro);
-bool get_user_ro(struct gendisk *disk, unsigned int partno) +int get_user_ro(struct gendisk *disk, unsigned int partno) { /* Is the user read-only bit set for the whole disk device? */ if (test_bit(0, disk->user_ro_bitmap)) - return true; + return 1;
/* Is the user read-only bit set for this particular partition? */ if (test_bit(partno, disk->user_ro_bitmap)) - return true; + return 1;
- return false; + return 0; }
-void set_disk_ro(struct gendisk *disk, bool state) +void set_disk_ro(struct gendisk *disk, int flag) { struct disk_part_iter piter; struct hd_struct *part;
- if (disk->part0.read_only != state) - set_disk_ro_uevent(disk, state); + if (disk->part0.policy != flag) + set_disk_ro_uevent(disk, flag);
disk_part_iter_init(&piter, disk, DISK_PITER_INCL_EMPTY_PART0); while ((part = disk_part_iter_next(&piter))) if (get_user_ro(disk, part->partno)) - part->read_only = true; + part->policy = 1; else - part->read_only = state; + part->policy = flag; disk_part_iter_exit(&piter); }
@@ -1586,7 +1595,7 @@ int bdev_read_only(struct block_device *bdev) { if (!bdev) return 0; - return bdev->bd_part->read_only; + return bdev->bd_part->policy; }
EXPORT_SYMBOL(bdev_read_only); diff --git a/block/partition-generic.c b/block/partition-generic.c index b27ed20d3db4..a39c311aec38 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -98,7 +98,7 @@ static ssize_t part_ro_show(struct device *dev, struct device_attribute *attr, char *buf) { struct hd_struct *p = dev_to_part(dev); - return sprintf(buf, "%u\n", p->read_only ? 1 : 0); + return sprintf(buf, "%u\n", p->policy ? 1 : 0); }
static ssize_t part_alignment_offset_show(struct device *dev, @@ -354,9 +354,9 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno, p->nr_sects = len; p->partno = partno; if (get_user_ro(disk, partno)) - p->read_only = true; + p->policy = 1; else - p->read_only = get_disk_ro(disk); + p->policy = get_disk_ro(disk); p->disk = disk;
if (info) { diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 6a6da28ac0b9..404567f13cda 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -116,8 +116,7 @@ struct hd_struct { unsigned int discard_alignment; struct device __dev; struct kobject *holder_dir; - bool read_only; - int partno; + int policy, partno; struct partition_meta_info *info; #ifdef CONFIG_FAIL_MAKE_REQUEST int make_it_fail; @@ -203,7 +202,6 @@ struct gendisk { */ struct disk_part_tbl __rcu *part_tbl; struct hd_struct part0; - DECLARE_BITMAP(user_ro_bitmap, DISK_MAX_PARTS);
const struct block_device_operations *fops; struct request_queue *queue; @@ -223,7 +221,11 @@ struct gendisk { struct badblocks *bb; struct lockdep_map lockdep_map;
+#ifndef __GENKSYMS__ + unsigned long *user_ro_bitmap; +#else KABI_RESERVE(1) +#endif KABI_RESERVE(2) KABI_RESERVE(3) KABI_RESERVE(4) @@ -442,13 +444,13 @@ extern void del_gendisk(struct gendisk *gp); extern struct gendisk *get_gendisk(dev_t dev, int *partno); extern struct block_device *bdget_disk(struct gendisk *disk, int partno);
-extern void set_device_ro(struct block_device *bdev, bool state); -extern void set_disk_ro(struct gendisk *disk, bool state); -extern bool get_user_ro(struct gendisk *disk, unsigned int partno); +extern void set_device_ro(struct block_device *bdev, int flag); +extern void set_disk_ro(struct gendisk *disk, int flag); +extern int get_user_ro(struct gendisk *disk, unsigned int partno);
static inline int get_disk_ro(struct gendisk *disk) { - return disk->part0.read_only; + return disk->part0.policy; }
extern void disk_block_events(struct gendisk *disk);
From: Ye Bin yebin10@huawei.com
hulk inclusion category: bugfix bugzilla: 46833 CVE: NA
-----------------------------------------------
c4f20b042b70 patch may lead to clear read-only flag when re-read partition table.
This reverts commit c4f20b042b7060f505134dcb1fc283893ec41e2a
Fixes:c4f20b042b70("scsi: sd: block: Fix read-only flag residuals when partition table change") Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/partition-generic.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/block/partition-generic.c b/block/partition-generic.c index a39c311aec38..739c0cc5fd22 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -286,7 +286,6 @@ void delete_partition(struct gendisk *disk, int partno) if (!part) return;
- clear_bit(partno, disk->user_ro_bitmap); get_device(disk_to_dev(disk)); rcu_assign_pointer(ptbl->part[partno], NULL);
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-v5.8-rc7 commit 3f0dcfbcd2e162fc0a11c1f59b7acd42ee45f126 category: bugfix bugzilla: 47875 CVE: NA
-------------------------------------------------
I/O requests may be held in scheduler queue because of resource contention. The starvation scenario was handled properly in the regular completion path but we failed to account for it during I/O submission. This lead to the hang captured below. Make sure we run the queue when resource contention is encountered in the submission path.
[ 39.054963] scsi 13:0:0:0: rejecting I/O to dead device [ 39.058700] scsi 13:0:0:0: rejecting I/O to dead device [ 39.087855] sd 13:0:0:1: [sdd] Synchronizing SCSI cache [ 39.088909] scsi 13:0:0:1: rejecting I/O to dead device [ 39.095351] scsi 13:0:0:1: rejecting I/O to dead device [ 39.096962] scsi 13:0:0:1: rejecting I/O to dead device [ 247.021859] INFO: task scsi-stress-rem:813 blocked for more than 122 seconds. [ 247.023258] Not tainted 5.8.0-rc2 #8 [ 247.024069] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 247.025331] scsi-stress-rem D 0 813 802 0x00004000 [ 247.025334] Call Trace: [ 247.025354] __schedule+0x504/0x55f [ 247.027987] schedule+0x72/0xa8 [ 247.027991] blk_mq_freeze_queue_wait+0x63/0x8c [ 247.027994] ? do_wait_intr_irq+0x7a/0x7a [ 247.027996] blk_cleanup_queue+0x4b/0xc9 [ 247.028000] __scsi_remove_device+0xf6/0x14e [ 247.028002] scsi_remove_device+0x21/0x2b [ 247.029037] sdev_store_delete+0x58/0x7c [ 247.029041] kernfs_fop_write+0x10d/0x14f [ 247.031281] vfs_write+0xa2/0xdf [ 247.032670] ksys_write+0x6b/0xb3 [ 247.032673] do_syscall_64+0x56/0x82 [ 247.034053] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 247.034059] RIP: 0033:0x7f69f39e9008 [ 247.036330] Code: Bad RIP value. [ 247.036331] RSP: 002b:00007ffdd8116498 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 247.037613] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f69f39e9008 [ 247.039714] RDX: 0000000000000002 RSI: 000055cde92a0ab0 RDI: 0000000000000001 [ 247.039715] RBP: 000055cde92a0ab0 R08: 000000000000000a R09: 00007f69f3a79e80 [ 247.039716] R10: 000000000000000a R11: 0000000000000246 R12: 00007f69f3abb780 [ 247.039717] R13: 0000000000000002 R14: 00007f69f3ab6740 R15: 0000000000000002
Link: https://lore.kernel.org/r/20200720025435.812030-1-ming.lei@redhat.com Cc: linux-block@vger.kernel.org Cc: Christoph Hellwig hch@lst.de Reviewed-by: Bart Van Assche bvanassche@acm.org Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Conflict: drivers/scsi/scsi_lib.c [Yufen: compatible with commit 44ea147b2756 ("SCSI: fix queue cleanup race before scsi_requeue_run_queue is done")] Signed-off-by: Yufen Yu yuyufen@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/scsi_lib.c | 27 +++++++++++++++++++-------- 1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index f5ee3b714b2a..0d5255482ca1 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -685,6 +685,21 @@ static void scsi_release_bidi_buffers(struct scsi_cmnd *cmd) cmd->request->next_rq->special = NULL; }
+static void scsi_run_queue_async(struct scsi_device *sdev) +{ + struct request_queue *q = sdev->request_queue; + + percpu_ref_get(&q->q_usage_counter); + if (scsi_target(sdev)->single_lun || + !list_empty(&sdev->host->starved_list)) { + if (!kblockd_schedule_work(&sdev->requeue_work)) + percpu_ref_put(&q->q_usage_counter); + } else { + blk_mq_run_hw_queues(q, true); + percpu_ref_put(&q->q_usage_counter); + } +} + /* Returns false when no more bytes to process, true if there are more */ static bool scsi_end_request(struct request *req, blk_status_t error, unsigned int bytes, unsigned int bidi_bytes) @@ -735,14 +750,9 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
__blk_mq_end_request(req, error);
- if (scsi_target(sdev)->single_lun || - !list_empty(&sdev->host->starved_list)) { - if (!kblockd_schedule_work(&sdev->requeue_work)) - percpu_ref_put(&q->q_usage_counter); - } else { - blk_mq_run_hw_queues(q, true); - percpu_ref_put(&q->q_usage_counter); - } + scsi_run_queue_async(sdev); + + percpu_ref_put(&q->q_usage_counter); } else { unsigned long flags;
@@ -2206,6 +2216,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, */ if (req->rq_flags & RQF_DONTPREP) scsi_mq_uninit_cmd(cmd); + scsi_run_queue_async(sdev); break; } return ret;
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-v5.10-rc1 commit ed5dd6a67d5eac5fb8873697b55dc1699752a9f3 category: bugfix bugzilla: 47875 CVE: NA
-------------------------------------------------
The request queue is currently run unconditionally in scsi_end_request() if both target queue and host queue are ready.
Recently Long Li reported that cost of a queue run can be very heavy in case of high queue depth. Improve this situation by only running the request queue when this LUN is busy.
Link: https://lore.kernel.org/r/20200910075056.36509-1-ming.lei@redhat.com Reported-by: Long Li longli@microsoft.com Tested-by: Long Li longli@microsoft.com Tested-by: Kashyap Desai kashyap.desai@broadcom.com Reviewed-by: Bart Van Assche bvanassche@acm.org Reviewed-by: Hannes Reinecke hare@suse.de Reviewed-by: Ewan D. Milne emilne@redhat.com Reviewed-by: John Garry john.garry@huawei.com Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Conflict: drivers/scsi/scsi_lib.c include/scsi/scsi_device.h Signed-off-by: Yufen Yu yuyufen@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/scsi_lib.c | 34 +++++++++++++++++++++++++++++++++- include/scsi/scsi_device.h | 1 + 2 files changed, 34 insertions(+), 1 deletion(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 0d5255482ca1..1bca98745b00 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -695,7 +695,23 @@ static void scsi_run_queue_async(struct scsi_device *sdev) if (!kblockd_schedule_work(&sdev->requeue_work)) percpu_ref_put(&q->q_usage_counter); } else { - blk_mq_run_hw_queues(q, true); + /* + * smp_mb() present in sbitmap_queue_clear() or implied in + * .end_io is for ordering writing .device_busy in + * scsi_device_unbusy() and reading sdev->restarts. + */ + int old = atomic_read(&sdev->restarts); + + /* + * ->restarts has to be kept as non-zero if new budget + * contention occurs. + * + * No need to run queue when either another re-run + * queue wins in updating ->restarts or a new budget + * contention occurs. + */ + if (old && atomic_cmpxchg(&sdev->restarts, old, 0) == old) + blk_mq_run_hw_queues(sdev->request_queue, true); percpu_ref_put(&q->q_usage_counter); } } @@ -2136,7 +2152,23 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
out_put_device: put_device(&sdev->sdev_gendev); + atomic_inc(&sdev->restarts); + + /* + * Orders atomic_inc(&sdev->restarts) and atomic_read(&sdev->device_busy). + * .restarts must be incremented before .device_busy is read because the + * code in scsi_run_queue_async() depends on the order of these operations. + */ + smp_mb__after_atomic(); out: + /* + * If all in-flight requests originated from this LUN are completed + * before reading .device_busy, sdev->device_busy will be observed as + * zero, then blk_mq_delay_run_hw_queues() will dispatch this request + * soon. Otherwise, completion of one of these requests will observe + * the .restarts flag, and the request queue will be run for handling + * this request, see scsi_end_request(). + */ if (atomic_read(&sdev->device_busy) == 0 && !scsi_device_blocked(sdev)) blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); return false; diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index 550739a5ea96..6e2edd66c598 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -110,6 +110,7 @@ struct scsi_device { atomic_t device_busy; /* commands actually active on LLDD */ atomic_t device_blocked; /* Device returned QUEUE_FULL. */
+ atomic_t restarts; spinlock_t list_lock; struct list_head cmd_list; /* queue of in use SCSI Command structures */ struct list_head starved_entry;
From: Yufen Yu yuyufen@huawei.com
hulk inclusion category: bugfix bugzilla: 47875 CVE: NA
-------------------------------------------------
Use reserve room of struct scsi_device to store new added member.
Signed-off-by: Yufen Yu yuyufen@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/scsi/scsi_device.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index 6e2edd66c598..f0b011ca39e1 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -110,7 +110,6 @@ struct scsi_device { atomic_t device_busy; /* commands actually active on LLDD */ atomic_t device_blocked; /* Device returned QUEUE_FULL. */
- atomic_t restarts; spinlock_t list_lock; struct list_head cmd_list; /* queue of in use SCSI Command structures */ struct list_head starved_entry; @@ -231,10 +230,11 @@ struct scsi_device { struct task_struct *quiesced_by; #ifndef __GENKSYMS__ unsigned long offline_already; /* Device offline message logged */ + atomic_t restarts; #else KABI_RESERVE(1) -#endif KABI_RESERVE(2) +#endif KABI_RESERVE(3) KABI_RESERVE(4) KABI_RESERVE(5)
From: Ye Bin yebin10@huawei.com
hulk inclusion category: bugfix bugzilla: 49978 CVE: NA
-----------------------------------------------
This reverts commit 544058bd9aa143489cd480c5c076caf76c33b6c1.
We got follow error: 2021/02/26 10:15:49 parsed 1 programs 2021/02/26 10:15:49 executed programs: 0
Message from syslogd@localhost at Feb 26 10:15:52 ... kernel:[ 710.135641] page:ffff7e000309e600 count:-1 mapcount:0 mapping:0000000000000000 index:0x0
Message from syslogd@localhost at Feb 26 10:15:52 ... kernel:[ 710.136201] flags: 0xffffe0000000000()
As in sg_remove_scat will judge schp->k_use_sg then free pages. But in sg_build_indirect if (rem_sz > 0) we free pages without clean schp->k_use_sg or set schp->pages[i] with NULL. So it will lead to free in sg_remove_scat again.
Fixes: 544058bd9aa1("scsi: sg: fix memory leak in sg_build_indirect") Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/sg.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 749faafbc977..10da329fa53f 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -1942,12 +1942,8 @@ sg_build_indirect(Sg_scatter_hold * schp, Sg_fd * sfp, int buff_size) k, rem_sz));
schp->bufflen = blk_size; - if (rem_sz > 0) { /* must have failed */ - for (i = 0; i < k; i++) - __free_pages(schp->pages[i], order); - + if (rem_sz > 0) /* must have failed */ return -ENOMEM; - } return 0; out: for (i = 0; i < k; i++)
From: Chris Down chris@chrisdown.name
mainline inclusion from mainline-v5.9-rc1 commit e809d5f0b5c912fe981dce738f3283b2010665f0 category: bugfix bugzilla: NA CVE: NA ---------------------------
Patch series "tmpfs: inode: Reduce risk of inum overflow", v7.
In Facebook production we are seeing heavy i_ino wraparounds on tmpfs. On affected tiers, in excess of 10% of hosts show multiple files with different content and the same inode number, with some servers even having as many as 150 duplicated inode numbers with differing file content.
This causes actual, tangible problems in production. For example, we have complaints from those working on remote caches that their application is reporting cache corruptions because it uses (device, inodenum) to establish the identity of a particular cache object, but because it's not unique any more, the application refuses to continue and reports cache corruption. Even worse, sometimes applications may not even detect the corruption but may continue anyway, causing phantom and hard to debug behaviour.
In general, userspace applications expect that (device, inodenum) should be enough to be uniquely point to one inode, which seems fair enough. One might also need to check the generation, but in this case:
1. That's not currently exposed to userspace (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs); 2. Even with generation, there shouldn't be two live inodes with the same inode number on one device.
In order to mitigate this, we take a two-pronged approach:
1. Moving inum generation from being global to per-sb for tmpfs. This itself allows some reduction in i_ino churn. This works on both 64- and 32- bit machines. 2. Adding inode{64,32} for tmpfs. This fix is supported on machines with 64-bit ino_t only: we allow users to mount tmpfs with a new inode64 option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.
You can see how this compares to previous related patches which didn't implement this per-superblock:
- https://patchwork.kernel.org/patch/11254001/ - https://patchwork.kernel.org/patch/11023915/
This patch (of 2):
get_next_ino has a number of problems:
- It uses and returns a uint, which is susceptible to become overflowed if a lot of volatile inodes that use get_next_ino are created. - It's global, with no specificity per-sb or even per-filesystem. This means it's not that difficult to cause inode number wraparounds on a single device, which can result in having multiple distinct inodes with the same inode number.
This patch adds a per-superblock counter that mitigates the second case. This design also allows us to later have a specific i_ino size per-device, for example, allowing users to choose whether to use 32- or 64-bit inodes for each tmpfs mount. This is implemented in the next commit.
For internal shmem mounts which may be less tolerant to spinlock delays, we implement a percpu batching scheme which only takes the stat_lock at each batch boundary.
Signed-off-by: Chris Down chris@chrisdown.name Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Hugh Dickins hughd@google.com Cc: Amir Goldstein amir73il@gmail.com Cc: Al Viro viro@zeniv.linux.org.uk Cc: Matthew Wilcox willy@infradead.org Cc: Jeff Layton jlayton@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Tejun Heo tj@kernel.org Link: http://lkml.kernel.org/r/cover.1594661218.git.chris@chrisdown.name Link: http://lkml.kernel.org/r/1986b9d63b986f08ec07a4aa4b2275e718e47d8a.1594661218... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Luo Meng luomeng12@huawei.com
Conflicts: mm/shmem.c Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/fs.h | 15 +++++++++ include/linux/shmem_fs.h | 2 ++ mm/shmem.c | 66 +++++++++++++++++++++++++++++++++++++--- 3 files changed, 78 insertions(+), 5 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h index ec5744cd8d0f..7e9a64f52c6e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3010,6 +3010,21 @@ extern void discard_new_inode(struct inode *); extern unsigned int get_next_ino(void); extern void evict_inodes(struct super_block *sb);
+/* + * Userspace may rely on the the inode number being non-zero. For example, glibc + * simply ignores files with zero i_ino in unlink() and other places. + * + * As an additional complication, if userspace was compiled with + * _FILE_OFFSET_BITS=32 on a 64-bit kernel we'll only end up reading out the + * lower 32 bits, so we need to check that those aren't zero explicitly. With + * _FILE_OFFSET_BITS=64, this may cause some harmless false-negatives, but + * better safe than sorry. + */ +static inline bool is_zero_ino(ino_t ino) +{ + return (u32)ino == 0; +} + extern void __iget(struct inode * inode); extern void iget_failed(struct inode *); extern void clear_inode(struct inode *); diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index f155dc607112..dd67c6e59cea 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -34,6 +34,8 @@ struct shmem_sb_info { unsigned char huge; /* Whether to try for hugepages */ kuid_t uid; /* Mount uid for root directory */ kgid_t gid; /* Mount gid for root directory */ + ino_t next_ino; /* The next per-sb inode number to use */ + ino_t __percpu *ino_batch; /* The next per-cpu inode number to use */ struct mempolicy *mpol; /* default memory policy for mappings */ spinlock_t shrinklist_lock; /* Protects shrinklist */ struct list_head shrinklist; /* List of shinkable inodes */ diff --git a/mm/shmem.c b/mm/shmem.c index 98939e57f6cc..25205b46b053 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -267,18 +267,67 @@ bool vma_is_shmem(struct vm_area_struct *vma) static LIST_HEAD(shmem_swaplist); static DEFINE_MUTEX(shmem_swaplist_mutex);
-static int shmem_reserve_inode(struct super_block *sb) +/* + * shmem_reserve_inode() performs bookkeeping to reserve a shmem inode, and + * produces a novel ino for the newly allocated inode. + * + * It may also be called when making a hard link to permit the space needed by + * each dentry. However, in that case, no new inode number is needed since that + * internally draws from another pool of inode numbers (currently global + * get_next_ino()). This case is indicated by passing NULL as inop. + */ +#define SHMEM_INO_BATCH 1024 +static int shmem_reserve_inode(struct super_block *sb, ino_t *inop) { struct shmem_sb_info *sbinfo = SHMEM_SB(sb); - if (sbinfo->max_inodes) { + ino_t ino; + + if (!(sb->s_flags & SB_KERNMOUNT)) { spin_lock(&sbinfo->stat_lock); if (!sbinfo->free_inodes) { spin_unlock(&sbinfo->stat_lock); return -ENOSPC; } sbinfo->free_inodes--; + if (inop) { + ino = sbinfo->next_ino++; + if (unlikely(is_zero_ino(ino))) + ino = sbinfo->next_ino++; + if (unlikely(ino > UINT_MAX)) { + /* + * Emulate get_next_ino uint wraparound for + * compatibility + */ + ino = 1; + } + *inop = ino; + } spin_unlock(&sbinfo->stat_lock); + } else if (inop) { + /* + * __shmem_file_setup, one of our callers, is lock-free: it + * doesn't hold stat_lock in shmem_reserve_inode since + * max_inodes is always 0, and is called from potentially + * unknown contexts. As such, use a per-cpu batched allocator + * which doesn't require the per-sb stat_lock unless we are at + * the batch boundary. + */ + ino_t *next_ino; + next_ino = per_cpu_ptr(sbinfo->ino_batch, get_cpu()); + ino = *next_ino; + if (unlikely(ino % SHMEM_INO_BATCH == 0)) { + spin_lock(&sbinfo->stat_lock); + ino = sbinfo->next_ino; + sbinfo->next_ino += SHMEM_INO_BATCH; + spin_unlock(&sbinfo->stat_lock); + if (unlikely(is_zero_ino(ino))) + ino++; + } + *inop = ino; + *next_ino = ++ino; + put_cpu(); } + return 0; }
@@ -2218,13 +2267,14 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode struct inode *inode; struct shmem_inode_info *info; struct shmem_sb_info *sbinfo = SHMEM_SB(sb); + ino_t ino;
- if (shmem_reserve_inode(sb)) + if (shmem_reserve_inode(sb, &ino)) return NULL;
inode = new_inode(sb); if (inode) { - inode->i_ino = get_next_ino(); + inode->i_ino = ino; inode_init_owner(inode, dir, mode); inode->i_blocks = 0; inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); @@ -2938,7 +2988,7 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr * first link must skip that, to get the accounting right. */ if (inode->i_nlink) { - ret = shmem_reserve_inode(inode->i_sb); + ret = shmem_reserve_inode(inode->i_sb, NULL); if (ret) goto out; } @@ -3539,6 +3589,7 @@ static void shmem_put_super(struct super_block *sb) { struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+ free_percpu(sbinfo->ino_batch); percpu_counter_destroy(&sbinfo->used_blocks); mpol_put(sbinfo->mpol); kfree(sbinfo); @@ -3583,6 +3634,11 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent) #else sb->s_flags |= SB_NOUSER; #endif + if (sb->s_flags & SB_KERNMOUNT) { + sbinfo->ino_batch = alloc_percpu(ino_t); + if (!sbinfo->ino_batch) + goto failed; + }
spin_lock_init(&sbinfo->stat_lock); if (percpu_counter_init(&sbinfo->used_blocks, 0, GFP_KERNEL))
From: Chris Down chris@chrisdown.name
mainline inclusion from mainline-v5.9-rc1 commit ea3271f7196c65ae5d3e1c7b3f733892c017dbd6 category: bugfix bugzilla: NA CVE: NA ---------------------------
The default is still set to inode32 for backwards compatibility, but system administrators can opt in to the new 64-bit inode numbers by either:
1. Passing inode64 on the command line when mounting, or 2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
The inode64 and inode32 names are used based on existing precedent from XFS.
[hughd@google.com: Kconfig fixes] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008011928010.13320@eggly.anvils
Signed-off-by: Chris Down chris@chrisdown.name Signed-off-by: Hugh Dickins hughd@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Amir Goldstein amir73il@gmail.com Acked-by: Hugh Dickins hughd@google.com Cc: Al Viro viro@zeniv.linux.org.uk Cc: Matthew Wilcox willy@infradead.org Cc: Jeff Layton jlayton@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Tejun Heo tj@kernel.org Link: http://lkml.kernel.org/r/8b23758d0c66b5e2263e08baf9c4b6a7565cbd8f.1594661218... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- Documentation/filesystems/tmpfs.txt | 17 ++++++++ fs/Kconfig | 21 ++++++++++ include/linux/shmem_fs.h | 1 + mm/shmem.c | 61 ++++++++++++++++++++++++++--- 4 files changed, 95 insertions(+), 5 deletions(-)
diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index d06e9a59a9f4..c72536fa78ae 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt @@ -135,6 +135,21 @@ gid: The group id These options do not have any effect on remount. You can change these parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
+tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode +numbers: + +======= ======================== +inode64 Use 64-bit inode numbers +inode32 Use 32-bit inode numbers +======= ======================== + +On a 32-bit kernel, inode32 is implicit, and inode64 is refused at mount time. +On a 64-bit kernel, CONFIG_TMPFS_INODE64 sets the default. inode64 avoids the +possibility of multiple files with the same inode number on a single device; +but risks glibc failing with EOVERFLOW once 33-bit inode numbers are reached - +if a long-lived tmpfs is accessed by 32-bit applications so ancient that +opening a file larger than 2GiB fails with EINVAL. +
So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs' will give you tmpfs instance on /mytmpfs which can allocate 10GB @@ -147,3 +162,5 @@ Updated: Hugh Dickins, 4 June 2007 Updated: KOSAKI Motohiro, 16 Mar 2010 +Updated: + Chris Down, 13 July 2020 diff --git a/fs/Kconfig b/fs/Kconfig index 3fa013a39981..2d9d472d8ba8 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -190,6 +190,27 @@ config TMPFS_XATTR
If unsure, say N.
+config TMPFS_INODE64 + bool "Use 64-bit ino_t by default in tmpfs" + depends on TMPFS && 64BIT + default n + help + tmpfs has historically used only inode numbers as wide as an unsigned + int. In some cases this can cause wraparound, potentially resulting + in multiple files with the same inode number on a single device. This + option makes tmpfs use the full width of ino_t by default, without + needing to specify the inode64 option when mounting. + + But if a long-lived tmpfs is to be accessed by 32-bit applications so + ancient that opening a file larger than 2GiB fails with EINVAL, then + the INODE64 config option and inode64 mount option risk operations + failing with EOVERFLOW once 33-bit inode numbers are reached. + + To override this configured default, use the inode32 or inode64 + option when mounting. + + If unsure, say N. + config HUGETLBFS bool "HugeTLB file system support" depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \ diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index dd67c6e59cea..401afcf81efe 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -34,6 +34,7 @@ struct shmem_sb_info { unsigned char huge; /* Whether to try for hugepages */ kuid_t uid; /* Mount uid for root directory */ kgid_t gid; /* Mount gid for root directory */ + bool full_inums; /* If i_ino should be uint or ino_t */ ino_t next_ino; /* The next per-sb inode number to use */ ino_t __percpu *ino_batch; /* The next per-cpu inode number to use */ struct mempolicy *mpol; /* default memory policy for mappings */ diff --git a/mm/shmem.c b/mm/shmem.c index 25205b46b053..ad5ca9ed943e 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -293,12 +293,17 @@ static int shmem_reserve_inode(struct super_block *sb, ino_t *inop) ino = sbinfo->next_ino++; if (unlikely(is_zero_ino(ino))) ino = sbinfo->next_ino++; - if (unlikely(ino > UINT_MAX)) { + if (unlikely(!sbinfo->full_inums && + ino > UINT_MAX)) { /* * Emulate get_next_ino uint wraparound for * compatibility */ - ino = 1; + if (IS_ENABLED(CONFIG_64BIT)) + pr_warn("%s: inode number overflow on device %d, consider using inode64 mount option\n", + __func__, MINOR(sb->s_dev)); + sbinfo->next_ino = 1; + ino = sbinfo->next_ino++; } *inop = ino; } @@ -311,6 +316,10 @@ static int shmem_reserve_inode(struct super_block *sb, ino_t *inop) * unknown contexts. As such, use a per-cpu batched allocator * which doesn't require the per-sb stat_lock unless we are at * the batch boundary. + * + * We don't need to worry about inode{32,64} since SB_KERNMOUNT + * shmem mounts are not exposed to userspace, so we don't need + * to worry about things like glibc compatibility. */ ino_t *next_ino; next_ino = per_cpu_ptr(sbinfo->ino_batch, get_cpu()); @@ -3427,9 +3436,12 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo, if ((value = strchr(this_char,'=')) != NULL) { *value++ = 0; } else { - pr_err("tmpfs: No value for mount option '%s'\n", - this_char); - goto error; + if (strcmp(this_char,"inode32") && + strcmp(this_char,"inode64")) { + pr_err("tmpfs: No value for mount option '%s'\n", + this_char); + goto error; + } }
if (!strcmp(this_char,"size")) { @@ -3495,6 +3507,14 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo, if (mpol_parse_str(value, &mpol)) goto bad_val; #endif + } else if (!strcmp(this_char,"inode32")) { + sbinfo->full_inums = false; + } else if (!strcmp(this_char,"inode64")) { + if (sizeof(ino_t) < 8) { + pr_err("tmpfs: Cannot use inode64 with <64bit inums in kernel\n"); + goto error; + } + sbinfo->full_inums = true; } else { pr_err("tmpfs: Bad mount option %s\n", this_char); goto error; @@ -3539,8 +3559,15 @@ static int shmem_remount_fs(struct super_block *sb, int *flags, char *data) if (config.max_inodes && !sbinfo->max_inodes) goto out;
+ if (sbinfo->full_inums && !config.full_inums && + sbinfo->next_ino > UINT_MAX) { + pr_err("tmpfs: Current inum too high to switch to 32-bit inums\n"); + goto out; + } + error = 0; sbinfo->huge = config.huge; + sbinfo->full_inums = config.full_inums; sbinfo->max_blocks = config.max_blocks; sbinfo->max_inodes = config.max_inodes; sbinfo->free_inodes = config.max_inodes - inodes; @@ -3574,6 +3601,29 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root) if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID)) seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, sbinfo->gid)); + + /* + * Showing inode{64,32} might be useful even if it's the system default, + * since then people don't have to resort to checking both here and + * /proc/config.gz to confirm 64-bit inums were successfully applied + * (which may not even exist if IKCONFIG_PROC isn't enabled). + * + * We hide it when inode64 isn't the default and we are using 32-bit + * inodes, since that probably just means the feature isn't even under + * consideration. + * + * As such: + * + * +-----------------+-----------------+ + * | TMPFS_INODE64=y | TMPFS_INODE64=n | + * +------------------+-----------------+-----------------+ + * | full_inums=true | show | show | + * | full_inums=false | show | hide | + * +------------------+-----------------+-----------------+ + * + */ + if (IS_ENABLED(CONFIG_TMPFS_INODE64) || sbinfo->full_inums) + seq_printf(seq, ",inode%d", (sbinfo->full_inums ? 64 : 32)); #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE /* Rightly or wrongly, show huge mount option unmasked by shmem_huge */ if (sbinfo->huge) @@ -3622,6 +3672,7 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent) if (!(sb->s_flags & SB_KERNMOUNT)) { sbinfo->max_blocks = shmem_default_max_blocks(); sbinfo->max_inodes = shmem_default_max_inodes(); + sbinfo->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64); if (shmem_parse_options(data, sbinfo, false)) { err = -EINVAL; goto failed;
From: Byron Stanoszek gandalf@winds.org
mainline inclusion from mainline-v5.9-rc6 commit bb3e96d63eb75a2f4ff790b089f6b93614c729a1 category: bugfix bugzilla: NA CVE: NA ---------------------------
Commit e809d5f0b5c9 ("tmpfs: per-superblock i_ino support") made changes to shmem_reserve_inode() in mm/shmem.c, however the original test for (sbinfo->max_inodes) got dropped. This causes mounting tmpfs with option nr_inodes=0 to fail:
# mount -ttmpfs -onr_inodes=0 none /ext0 mount: /ext0: mount(2) system call failed: Cannot allocate memory.
This patch restores the nr_inodes=0 functionality.
Fixes: e809d5f0b5c9 ("tmpfs: per-superblock i_ino support") Signed-off-by: Byron Stanoszek gandalf@winds.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Hugh Dickins hughd@google.com Acked-by: Chris Down chris@chrisdown.name Link: https://lkml.kernel.org/r/20200902035715.16414-1-gandalf@winds.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Luo Meng luomeng12@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/shmem.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c index ad5ca9ed943e..b4be0be77327 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -284,11 +284,13 @@ static int shmem_reserve_inode(struct super_block *sb, ino_t *inop)
if (!(sb->s_flags & SB_KERNMOUNT)) { spin_lock(&sbinfo->stat_lock); - if (!sbinfo->free_inodes) { - spin_unlock(&sbinfo->stat_lock); - return -ENOSPC; + if (sbinfo->max_inodes) { + if (!sbinfo->free_inodes) { + spin_unlock(&sbinfo->stat_lock); + return -ENOSPC; + } + sbinfo->free_inodes--; } - sbinfo->free_inodes--; if (inop) { ino = sbinfo->next_ino++; if (unlikely(is_zero_ino(ino)))
From: Yang Yingliang yangyingliang@huawei.com
hulk inclusion category: bugfix bugzilla: NA CVE: NA
-----------------------------------------------
disable config TMPFS_INODE64 by default
Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/configs/hulk_defconfig | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/arm64/configs/hulk_defconfig b/arch/arm64/configs/hulk_defconfig index b85d790d630b..706f06dfcfd3 100644 --- a/arch/arm64/configs/hulk_defconfig +++ b/arch/arm64/configs/hulk_defconfig @@ -4888,6 +4888,7 @@ CONFIG_SYSFS=y CONFIG_TMPFS=y CONFIG_TMPFS_POSIX_ACL=y CONFIG_TMPFS_XATTR=y +# CONFIG_TMPFS_INODE64 is not set CONFIG_HUGETLBFS=y CONFIG_HUGETLB_PAGE=y CONFIG_MEMFD_CREATE=y
From: "zhangyi (F)" yi.zhang@huawei.com
hulk inclusion category: bugfix bugzilla: 49893 CVE: NA ---------------------------
If we failed to add new entry on rename whiteout, we cannot reset the old->de entry directly, because the old->de could have moved from under us during make indexed dir. So find the old entry again before reset is needed, otherwise it may corrupt the filesystem.
Fixes: 21401081d4e ("ext4: fix bug for rename with RENAME_WHITEOUT") Signed-off-by: zhangyi (F) yi.zhang@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Reviewed-by: Ye bin yebin10@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/namei.c | 29 +++++++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 34a4a1a298df..34b26dacbbe5 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -3428,6 +3428,31 @@ static int ext4_setent(handle_t *handle, struct ext4_renament *ent, return 0; }
+static void ext4_resetent(handle_t *handle, struct ext4_renament *ent, + unsigned ino, unsigned file_type) +{ + struct ext4_renament old = *ent; + int retval = 0; + + /* + * old->de could have moved from under us during make indexed dir, + * so the old->de may no longer valid and need to find it again + * before reset old inode info. + */ + old.bh = ext4_find_entry(old.dir, &old.dentry->d_name, &old.de, NULL); + if (IS_ERR(old.bh)) + retval = PTR_ERR(old.bh); + if (!old.bh) + retval = -ENOENT; + if (retval) { + ext4_std_error(old.dir->i_sb, retval); + return; + } + + ext4_setent(handle, &old, ino, file_type); + brelse(old.bh); +} + static int ext4_find_delete_entry(handle_t *handle, struct inode *dir, const struct qstr *d_name) { @@ -3724,8 +3749,8 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry, end_rename: if (whiteout) { if (retval) { - ext4_setent(handle, &old, - old.inode->i_ino, old_file_type); + ext4_resetent(handle, &old, + old.inode->i_ino, old_file_type); drop_nlink(whiteout); } unlock_new_inode(whiteout);
From: Yang Yingliang yangyingliang@huawei.com
hulk inclusion category: bugfix bugzilla: 47452 CVE: NA
-------------------------------------------------
Use bit-23 for connector debug will change __GFP_BITS_SHIFT that will break some driver to use this flag. So we use bit-31 instead and keep __GFP_BITS_SHIFT unchanged.
For using GFP_CONNECTOR, it need be ignored when checking slab bug mask in new_slab().
If driver uses wrong gfp flag (bit-31) and also use __GFP_ZERO to allocate zeroed memory, the last 8-bytes memory will not be set to 0, so set magic number before clear memory to avoid this.
Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/connector/connector.c | 6 +++++- include/linux/gfp.h | 8 ++++---- mm/slub.c | 10 +++++++--- 3 files changed, 16 insertions(+), 8 deletions(-)
diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index b5cdd0b99736..a4c5e04fcaaa 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -107,7 +107,11 @@ int cn_netlink_send_mult(struct cn_msg *msg, u16 len, u32 portid, u32 __group,
size = sizeof(*msg) + len;
- skb = nlmsg_new(size, gfp_mask | GFP_CONNECTOR); + if (sysctl_connector_debug) + skb = nlmsg_new(size, gfp_mask | GFP_CONNECTOR); + else + skb = nlmsg_new(size, gfp_mask); + if (!skb) return -ENOMEM;
diff --git a/include/linux/gfp.h b/include/linux/gfp.h index e3274fdc2736..d7b5f2333fc3 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -39,13 +39,13 @@ struct vm_area_struct; #define ___GFP_ACCOUNT 0x100000u #define ___GFP_DIRECT_RECLAIM 0x200000u #define ___GFP_KSWAPD_RECLAIM 0x400000u -#define ___GFP_CONNECTOR 0x800000u #ifdef CONFIG_LOCKDEP -#define ___GFP_NOLOCKDEP 0x1000000u +#define ___GFP_NOLOCKDEP 0x800000u #else #define ___GFP_NOLOCKDEP 0 #endif -/* used bit-23 for dfx in connector */ +/* used bit-31 for dfx in connector */ +#define ___GFP_CONNECTOR 0x80000000u #define __GFP_CONNECTOR ((__force gfp_t)___GFP_CONNECTOR) #define GFP_CONNECTOR __GFP_CONNECTOR /* If the above are modified, __GFP_BITS_SHIFT may need updating */ @@ -221,7 +221,7 @@ struct vm_area_struct; #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
/* Room for N __GFP_FOO bits */ -#define __GFP_BITS_SHIFT (24 + IS_ENABLED(CONFIG_LOCKDEP)) +#define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP)) #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/** diff --git a/mm/slub.c b/mm/slub.c index 2301c04353d0..3cfec27018ac 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1689,12 +1689,16 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) { if (unlikely(flags & GFP_SLAB_BUG_MASK)) { gfp_t invalid_mask = flags & GFP_SLAB_BUG_MASK; + if (sysctl_connector_debug && + (invalid_mask & GFP_CONNECTOR)) + goto out; flags &= ~GFP_SLAB_BUG_MASK; pr_warn("Unexpected gfp: %#x (%pGg). Fixing up to gfp: %#x (%pGg). Fix your code!\n", invalid_mask, &invalid_mask, flags, &flags); dump_stack(); }
+out: return allocate_slab(s, flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node); } @@ -2782,15 +2786,15 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, stat(s, ALLOC_FASTPATH); }
- if (unlikely(gfpflags & __GFP_ZERO) && object) - memset(object, 0, s->object_size); - if (sysctl_connector_debug && unlikely(gfpflags & __GFP_CONNECTOR) && object) { if (s->object_size == 512) memset(object + 504, 0xad, 8); }
+ if (unlikely(gfpflags & __GFP_ZERO) && object) + memset(object, 0, s->object_size); + slab_post_alloc_hook(s, gfpflags, 1, &object);
return object;
From: Yonglong Liu liuyonglong@huawei.com
driver inclusion category: feature bugzilla: NA CVE: NA
----------------------------
This patch adds support for setting/getting pf max tx rate via sysfs:
echo <rate Mbit/s> > /sys/class/net/<eth name>/kunpeng/pf/max_tx_rate cat /sys/class/net/<eth name>/kunpeng/pf/max_tx_rate
Signed-off-by: Yonglong Liu liuyonglong@huawei.com Reviewed-by: li yongxin liyongxin1@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/hisilicon/hns3/Makefile | 3 +- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 8 + .../hns3_extension/hns3pf/hclge_main_it.c | 33 +++++ .../hns3/hns3_extension/hns3pf/hclge_sysfs.c | 140 ++++++++++++++++++ .../hns3/hns3_extension/hns3pf/hclge_sysfs.h | 14 ++ .../hisilicon/hns3/hns3pf/hclge_main.c | 17 +++ .../hisilicon/hns3/hns3pf/hclge_main.h | 5 + 7 files changed, 219 insertions(+), 1 deletion(-) create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.c create mode 100644 drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.h
diff --git a/drivers/net/ethernet/hisilicon/hns3/Makefile b/drivers/net/ethernet/hisilicon/hns3/Makefile index e1c520226ddd..5bec09bb6e9e 100644 --- a/drivers/net/ethernet/hisilicon/hns3/Makefile +++ b/drivers/net/ethernet/hisilicon/hns3/Makefile @@ -38,7 +38,8 @@ HCLGE_OBJ = hns3pf/hclge_main.o \ hns3pf/hclge_err.o
-HCLGE_OBJ_IT_MAIN = hns3_extension/hns3pf/hclge_main_it.o +HCLGE_OBJ_IT_MAIN = hns3_extension/hns3pf/hclge_main_it.o \ + hns3_extension/hns3pf/hclge_sysfs.o obj-$(CONFIG_HNS3_HCLGE) += hclge.o hclge-objs := $(HCLGE_OBJ) $(HCLGE_OBJ_IT_MAIN) hclge-$(CONFIG_HNS3_DCB) += hns3pf/hclge_dcb.o diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index 3c56cbc9afa6..7b3938054e5f 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -578,6 +578,9 @@ struct hnae3_ae_ops { int (*ecc_handle)(struct hnae3_ae_dev *ae_dev); int (*priv_ops)(struct hnae3_handle *handle, int opcode, void *data, int length); + void (*ext_init)(struct hnae3_handle *handle); + void (*ext_uninit)(struct hnae3_handle *handle); + void (*ext_reset_done)(struct hnae3_handle *handle); #endif };
@@ -713,6 +716,11 @@ struct hnae3_handle {
/* Network interface message level enabled bits */ u32 msg_enable; + +#ifdef CONFIG_HNS3_TEST + /* for sysfs */ + struct kobject *kobj; +#endif };
#define hnae3_set_field(origin, mask, shift, val) \ diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_main_it.c b/drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_main_it.c index 01f0e19f1e50..59e287848ee3 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_main_it.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_main_it.c @@ -18,6 +18,9 @@ #include "hclge_main.h" #include "hnae3.h" #include "hclge_main_it.h" +#ifdef CONFIG_HNS3_TEST +#include "hclge_sysfs.h" +#endif
#ifdef CONFIG_IT_VALIDATION #define HCLGE_RESET_MAX_FAIL_CNT 1 @@ -174,8 +177,38 @@ bool hclge_reset_done_it(struct hnae3_handle *handle, bool done) return done; }
+#ifdef CONFIG_HNS3_TEST +void hclge_ext_init(struct hnae3_handle *handle) +{ + hclge_sysfs_init(handle); +} + +void hclge_ext_uninit(struct hnae3_handle *handle) +{ + struct hclge_vport *vport = hclge_get_vport(handle); + struct hclge_dev *hdev = vport->back; + + hclge_reset_pf_rate(hdev); + hclge_sysfs_uninit(handle); +} + +void hclge_ext_reset_done(struct hnae3_handle *handle) +{ + struct hclge_vport *vport = hclge_get_vport(handle); + struct hclge_dev *hdev = vport->back; + + hclge_resume_pf_rate(hdev); +} +#endif + int hclge_init_it(void) { +#ifdef CONFIG_HNS3_TEST + hclge_ops.ext_init = hclge_ext_init; + hclge_ops.ext_uninit = hclge_ext_uninit; + hclge_ops.ext_reset_done = hclge_ext_reset_done; +#endif + hclge_ops.reset_event = hclge_reset_event_it; hclge_ops.reset_done = hclge_reset_done_it; hclge_ops.handle_imp_error = hclge_handle_imp_error_it; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.c new file mode 100644 index 000000000000..86e60938013c --- /dev/null +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.c @@ -0,0 +1,140 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* Copyright (c) 2018-2021 Hisilicon Limited. */ + +#include <linux/device.h> +#include "hnae3.h" +#include "hclge_main.h" +#include "hclge_tm.h" +#include "hclge_sysfs.h" + +void hclge_reset_pf_rate(struct hclge_dev *hdev) +{ + struct hclge_vport *vport = &hdev->vport[0]; + int ret; + + /* zero means max rate, if max_tx_rate is zero, just return */ + if (!vport->vf_info.max_tx_rate) + return; + + vport->vf_info.max_tx_rate = 0; + + ret = hclge_tm_qs_shaper_cfg(vport, vport->vf_info.max_tx_rate); + if (ret) + dev_err(&hdev->pdev->dev, + "failed to reset pf tx rate to default, ret = %d.\n", + ret); +} + +int hclge_resume_pf_rate(struct hclge_dev *hdev) +{ + struct hclge_vport *vport = &hdev->vport[0]; + int ret; + + /* zero means max rate, after reset, firmware already set it to + * max rate, so just continue. + */ + if (!vport->vf_info.max_tx_rate) + return 0; + + ret = hclge_tm_qs_shaper_cfg(vport, vport->vf_info.max_tx_rate); + if (ret) { + dev_err(&hdev->pdev->dev, + "failed to resume pf tx rate:%u, ret = %d.\n", + vport->vf_info.max_tx_rate, ret); + return ret; + } + + return 0; +} + +static ssize_t hclge_max_tx_rate_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + struct hclge_vport *vport = + container_of(kobj, struct hclge_vport, kobj); + + return sprintf(buf, "%d Mbit/s (0 means no limit)\n", + vport->vf_info.max_tx_rate); +} + +static ssize_t hclge_max_tx_rate_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, + size_t size) +{ + struct hclge_vport *vport = + container_of(kobj, struct hclge_vport, kobj); + struct hclge_dev *hdev = vport->back; + int max_tx_rate; + int ret; + + ret = kstrtoint(buf, 0, &max_tx_rate); + if (ret) + return -EINVAL; + + if (max_tx_rate < 0 || max_tx_rate > hdev->hw.mac.max_speed) { + dev_err(&hdev->pdev->dev, + "invalid max_tx_rate:%d [0, %u]\n", + max_tx_rate, hdev->hw.mac.max_speed); + return -EINVAL; + } + + ret = hclge_tm_qs_shaper_cfg(vport, max_tx_rate); + if (ret) + return ret; + + vport->vf_info.max_tx_rate = max_tx_rate; + + return ret ? (ssize_t)ret : size; +} + +static struct kobj_attribute hclge_attr_max_tx_rate = { + .attr = {.name = "max_tx_rate", + .mode = 0644 }, + .show = hclge_max_tx_rate_show, + .store = hclge_max_tx_rate_store, +}; + +static struct attribute *hclge_sysfs_attrs[] = { + &hclge_attr_max_tx_rate.attr, + NULL, +}; + +static struct kobj_type hclge_sysfs_type = { + .sysfs_ops = &kobj_sysfs_ops, + .default_attrs = hclge_sysfs_attrs, +}; + +void hclge_sysfs_init(struct hnae3_handle *handle) +{ + struct net_device *netdev = handle->netdev; + struct hclge_vport *vport = hclge_get_vport(handle); + int ret; + + handle->kobj = kobject_create_and_add("kunpeng", &netdev->dev.kobj); + if (!handle->kobj) { + netdev_err(netdev, "failed to create kobj, ret = %d\n", ret); + return; + } + + ret = kobject_init_and_add(&vport->kobj, &hclge_sysfs_type, + handle->kobj, "pf"); + if (ret) { + netdev_err(netdev, "failed to init kobj, ret = %d\n", ret); + kobject_put(handle->kobj); + handle->kobj = NULL; + } +} + +void hclge_sysfs_uninit(struct hnae3_handle *handle) +{ + struct hclge_vport *vport = hclge_get_vport(handle); + + if (!handle->kobj) + return; + + kobject_put(&vport->kobj); + kobject_put(handle->kobj); + handle->kobj = NULL; +} diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.h b/drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.h new file mode 100644 index 000000000000..8eb33357c577 --- /dev/null +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.h @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0+ + * Copyright (c) 2018-2021 Hisilicon Limited. + */ + +#ifndef __HCLGE_SYSFS_H +#define __HCLGE_SYSFS_H + +void hclge_reset_pf_rate(struct hclge_dev *hdev); +int hclge_resume_pf_rate(struct hclge_dev *hdev); + +void hclge_sysfs_init(struct hnae3_handle *handle); +void hclge_sysfs_uninit(struct hnae3_handle *handle); + +#endif diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index b40cf43ba8c0..7b6d7c157747 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -9946,6 +9946,11 @@ static int hclge_init_nic_client_instance(struct hnae3_ae_dev *ae_dev, if (ret) return ret;
+#ifdef CONFIG_HNS3_TEST + if (ae_dev->ops->ext_init) + ae_dev->ops->ext_init(&vport->nic); +#endif + set_bit(HCLGE_STATE_NIC_REGISTERED, &hdev->state); if (test_bit(HCLGE_STATE_RST_HANDLING, &hdev->state) || rst_cnt != hdev->rst_stats.reset_cnt) { @@ -10087,6 +10092,13 @@ static void hclge_uninit_client_instance(struct hnae3_client *client, struct hclge_vport *vport; int i;
+#ifdef CONFIG_HNS3_TEST + if (ae_dev->ops->ext_uninit) { + vport = &hdev->vport[0]; + ae_dev->ops->ext_uninit(&vport->nic); + } +#endif + for (i = 0; i < hdev->num_vmdq_vport + 1; i++) { vport = &hdev->vport[i]; if (hdev->roce_client) { @@ -10817,6 +10829,11 @@ static int hclge_reset_ae_dev(struct hnae3_ae_dev *ae_dev) if (ret) return ret;
+#ifdef CONFIG_HNS3_TEST + if (ae_dev->ops->ext_reset_done) + ae_dev->ops->ext_reset_done(&hdev->vport->nic); +#endif + dev_info(&pdev->dev, "Reset done, %s driver initialization finished.\n", HCLGE_DRIVER_NAME);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h index 08d17ba61960..f803a9d895c0 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h @@ -970,6 +970,11 @@ struct hclge_vport { struct list_head uc_mac_list; /* Store VF unicast table */ struct list_head mc_mac_list; /* Store VF multicast table */ struct list_head vlan_list; /* Store VF vlan table */ + +#ifdef CONFIG_HNS3_TEST + /* for sysfs */ + struct kobject kobj; +#endif };
int hclge_set_vport_promisc_mode(struct hclge_vport *vport, bool en_uc_pmc,
From: Yonglong Liu liuyonglong@huawei.com
driver inclusion category: feature bugzilla: NA CVE: NA
-----------------------------
This patch is used to update driver version to 1.9.38.10.
Signed-off-by: Yonglong Liu liuyonglong@huawei.com Reviewed-by: li yongxin liyongxin1@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 2 +- drivers/net/ethernet/hisilicon/hns3/hns3_cae/hns3_cae_version.h | 2 +- drivers/net/ethernet/hisilicon/hns3/hns3_enet.h | 2 +- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h | 2 +- drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index 7b3938054e5f..d92c64c74451 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -30,7 +30,7 @@ #include <linux/pci.h> #include <linux/types.h>
-#define HNAE3_MOD_VERSION "1.9.38.9" +#define HNAE3_MOD_VERSION "1.9.38.10"
#define HNAE3_MIN_VECTOR_NUM 2 /* first one for misc, another for IO */
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_cae/hns3_cae_version.h b/drivers/net/ethernet/hisilicon/hns3/hns3_cae/hns3_cae_version.h index 782ed542b1a4..5563070bd4a2 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_cae/hns3_cae_version.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_cae/hns3_cae_version.h @@ -4,7 +4,7 @@ #ifndef __HNS3_CAE_VERSION_H__ #define __HNS3_CAE_VERSION_H__
-#define HNS3_CAE_MOD_VERSION "1.9.38.9" +#define HNS3_CAE_MOD_VERSION "1.9.38.10"
#define CMT_ID_LEN 8 #define RESV_LEN 3 diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h index c174bdbb98c4..c44e3ba782e5 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h @@ -8,7 +8,7 @@
#include "hnae3.h"
-#define HNS3_MOD_VERSION "1.9.38.9" +#define HNS3_MOD_VERSION "1.9.38.10"
extern char hns3_driver_version[];
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h index f803a9d895c0..6e89f8f78f0f 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h @@ -12,7 +12,7 @@ #include "hclge_cmd.h" #include "hnae3.h"
-#define HCLGE_MOD_VERSION "1.9.38.9" +#define HCLGE_MOD_VERSION "1.9.38.10" #define HCLGE_DRIVER_NAME "hclge"
#define HCLGE_MAX_PF_NUM 8 diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h index 4f75b36c7888..3b133a3b4905 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h @@ -10,7 +10,7 @@ #include "hclgevf_cmd.h" #include "hnae3.h"
-#define HCLGEVF_MOD_VERSION "1.9.38.9" +#define HCLGEVF_MOD_VERSION "1.9.38.10" #define HCLGEVF_DRIVER_NAME "hclgevf"
#define HCLGEVF_MAX_VLAN_ID 4095
From: Yonglong Liu liuyonglong@huawei.com
driver inclusion category: bugfix bugzilla: NA CVE: NA
-----------------------------
This patch fix the following warning: drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.c:117:3: warning: ‘ret’ may be used uninitialized in this function [-Wmaybe-uninitialized] netdev_err(netdev, "failed to create kobj, ret = %d\n", ret);
Fixes: f0ab3ab6c0fa ("net: hns3: adds support for setting pf max tx rate via sysfs") Signed-off-by: Yonglong Liu liuyonglong@huawei.com Reviewed-by: Zhong Zhaohui zhongzhaohui@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- .../ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.c index 86e60938013c..a307b592dc92 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_extension/hns3pf/hclge_sysfs.c @@ -114,7 +114,7 @@ void hclge_sysfs_init(struct hnae3_handle *handle)
handle->kobj = kobject_create_and_add("kunpeng", &netdev->dev.kobj); if (!handle->kobj) { - netdev_err(netdev, "failed to create kobj, ret = %d\n", ret); + netdev_err(netdev, "failed to create kobj!\n"); return; }
From: Yonglong Liu liuyonglong@huawei.com
driver inclusion category: bugfix bugzilla: NA CVE: NA
-----------------------------
This patch is used to update driver version to 1.9.38.11.
Signed-off-by: Yonglong Liu liuyonglong@huawei.com Reviewed-by: Zhong Zhaohui zhongzhaohui@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 2 +- drivers/net/ethernet/hisilicon/hns3/hns3_cae/hns3_cae_version.h | 2 +- drivers/net/ethernet/hisilicon/hns3/hns3_enet.h | 2 +- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h | 2 +- drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index d92c64c74451..e407343fd095 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -30,7 +30,7 @@ #include <linux/pci.h> #include <linux/types.h>
-#define HNAE3_MOD_VERSION "1.9.38.10" +#define HNAE3_MOD_VERSION "1.9.38.11"
#define HNAE3_MIN_VECTOR_NUM 2 /* first one for misc, another for IO */
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_cae/hns3_cae_version.h b/drivers/net/ethernet/hisilicon/hns3/hns3_cae/hns3_cae_version.h index 5563070bd4a2..20892a86599a 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_cae/hns3_cae_version.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_cae/hns3_cae_version.h @@ -4,7 +4,7 @@ #ifndef __HNS3_CAE_VERSION_H__ #define __HNS3_CAE_VERSION_H__
-#define HNS3_CAE_MOD_VERSION "1.9.38.10" +#define HNS3_CAE_MOD_VERSION "1.9.38.11"
#define CMT_ID_LEN 8 #define RESV_LEN 3 diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h index c44e3ba782e5..88c843d562d0 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h @@ -8,7 +8,7 @@
#include "hnae3.h"
-#define HNS3_MOD_VERSION "1.9.38.10" +#define HNS3_MOD_VERSION "1.9.38.11"
extern char hns3_driver_version[];
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h index 6e89f8f78f0f..6b9ca5a5c48b 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h @@ -12,7 +12,7 @@ #include "hclge_cmd.h" #include "hnae3.h"
-#define HCLGE_MOD_VERSION "1.9.38.10" +#define HCLGE_MOD_VERSION "1.9.38.11" #define HCLGE_DRIVER_NAME "hclge"
#define HCLGE_MAX_PF_NUM 8 diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h index 3b133a3b4905..17cf19719b15 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h @@ -10,7 +10,7 @@ #include "hclgevf_cmd.h" #include "hnae3.h"
-#define HCLGEVF_MOD_VERSION "1.9.38.10" +#define HCLGEVF_MOD_VERSION "1.9.38.11" #define HCLGEVF_DRIVER_NAME "hclgevf"
#define HCLGEVF_MAX_VLAN_ID 4095
From: shiyongbang shiyongbang@huawei.com
driver inclusion category: bugfix bugzilla: NA CVE: NA
After the connector is initialized, the drm_connector_register interface is invoked to rectify the erratic display during the startup process during the installation using the ISO file.
Signed-off-by: Shiyongbang shiyongbang@huawei.com Reviewed-by: Wu yang wuyang7@huawei.com Reviewed-by: Li dongming lidongming5@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_vdac.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_vdac.c b/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_vdac.c index 879ffb8c2413..90319a9025d3 100644 --- a/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_vdac.c +++ b/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_vdac.c @@ -123,6 +123,7 @@ hibmc_connector_init(struct hibmc_drm_private *priv) } drm_connector_helper_add(connector, &hibmc_connector_helper_funcs); + drm_connector_register(connector);
return connector; }
From: shiyongbang shiyongbang@huawei.com
driver inclusion category: bugfix bugzilla: NA CVE: NA
The hibmc driver uses duplicate implementations with drm_get_pci_dev for its probe. Replace the code with the generic function drm_get_pci_dev.
Signed-off-by: Shiyongbang shiyongbang@huawei.com Reviewed-by: Wu yang wuyang7@huawei.com Reviewed-by: Li dongming lidongming5@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- .../gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c | 52 +++---------------- .../gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h | 2 + 2 files changed, 8 insertions(+), 46 deletions(-)
diff --git a/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c b/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c index e08655aaf14a..b9a6643a6515 100644 --- a/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c +++ b/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c @@ -58,6 +58,8 @@ irqreturn_t hibmc_drm_interrupt(int irq, void *arg) static struct drm_driver hibmc_driver = { .driver_features = DRIVER_GEM | DRIVER_MODESET | DRIVER_ATOMIC | DRIVER_HAVE_IRQ, + .load = hibmc_load, + .unload = hibmc_unload, .fops = &hibmc_fops, .name = "hibmc", .date = "20160828", @@ -284,7 +286,7 @@ static int hibmc_hw_init(struct hibmc_drm_private *priv) return 0; }
-static int hibmc_unload(struct drm_device *dev) +void hibmc_unload(struct drm_device *dev) { struct hibmc_drm_private *priv = dev->dev_private;
@@ -301,10 +303,9 @@ static int hibmc_unload(struct drm_device *dev) hibmc_mm_fini(priv); hibmc_hw_unmap(priv); dev->dev_private = NULL; - return 0; }
-static int hibmc_load(struct drm_device *dev) +int hibmc_load(struct drm_device *dev, unsigned long flags) { struct hibmc_drm_private *priv; int ret; @@ -366,56 +367,15 @@ static int hibmc_load(struct drm_device *dev) static int hibmc_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent) { - struct drm_device *dev; - int ret; - - dev = drm_dev_alloc(&hibmc_driver, &pdev->dev); - if (IS_ERR(dev)) { - DRM_ERROR("failed to allocate drm_device\n"); - return PTR_ERR(dev); - } - - dev->pdev = pdev; - pci_set_drvdata(pdev, dev); - - ret = pci_enable_device(pdev); - if (ret) { - DRM_ERROR("failed to enable pci device: %d\n", ret); - goto err_free; - } - - ret = hibmc_load(dev); - if (ret) { - DRM_ERROR("failed to load hibmc: %d\n", ret); - goto err_disable; - } - - ret = drm_dev_register(dev, 0); - if (ret) { - DRM_ERROR("failed to register drv for userspace access: %d\n", - ret); - goto err_unload; - } - return 0; - -err_unload: - hibmc_unload(dev); -err_disable: - pci_disable_device(pdev); -err_free: - drm_dev_unref(dev); - - return ret; + return drm_get_pci_dev(pdev, ent, &hibmc_driver); }
static void hibmc_pci_remove(struct pci_dev *pdev) { struct drm_device *dev = pci_get_drvdata(pdev);
- drm_dev_unregister(dev); - hibmc_unload(dev); + drm_put_dev(dev); pci_disable_device(pdev); - drm_dev_unref(dev); }
static void hibmc_pci_shutdown(struct pci_dev *pdev) diff --git a/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h b/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h index e195521eb41e..4395dc6674bb 100644 --- a/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h +++ b/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h @@ -85,6 +85,8 @@ void hibmc_set_power_mode(struct hibmc_drm_private *priv, unsigned int power_mode); void hibmc_set_current_gate(struct hibmc_drm_private *priv, unsigned int gate); +int hibmc_load(struct drm_device *dev, unsigned long flags); +void hibmc_unload(struct drm_device *dev);
int hibmc_de_init(struct hibmc_drm_private *priv); int hibmc_vdac_init(struct hibmc_drm_private *priv);
From: shiyongbang shiyongbang@huawei.com
driver inclusion category: bugfix bugzilla: NA CVE: NA
Add remove_conflicting_framebuffers function to fix stuck problem when switch GUI to text during installation using the ISO file.
Signed-off-by: Shiyongbang shiyongbang@huawei.com Reviewed-by: Wu yang wuyang7@huawei.com Reviewed-by: Li dongming lidongming5@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- .../gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c b/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c index b9a6643a6515..7be784a77efa 100644 --- a/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c +++ b/drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.c @@ -55,6 +55,22 @@ irqreturn_t hibmc_drm_interrupt(int irq, void *arg) return IRQ_HANDLED; }
+static void hibmc_remove_framebuffers(struct pci_dev *pdev) +{ + struct apertures_struct *ap; + + ap = alloc_apertures(1); + if (!ap) + return; + + ap->ranges[0].base = pci_resource_start(pdev, 0); + ap->ranges[0].size = pci_resource_len(pdev, 0); + + drm_fb_helper_remove_conflicting_framebuffers(ap, "hibmcdrmfb", false); + + kfree(ap); +} + static struct drm_driver hibmc_driver = { .driver_features = DRIVER_GEM | DRIVER_MODESET | DRIVER_ATOMIC | DRIVER_HAVE_IRQ, @@ -367,6 +383,8 @@ int hibmc_load(struct drm_device *dev, unsigned long flags) static int hibmc_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent) { + hibmc_remove_framebuffers(pdev); + return drm_get_pci_dev(pdev, ent, &hibmc_driver); }
From: Dong Kai dongkai11@huawei.com
hulk inclusion category: bugfix bugzilla: 50425 CVE: NA
---------------------------
The jump_label_apply_nops is used to replace default nops with ideal nops. This can be called only once per module. The current logic is incorrent as it is called per klp_object. Then if klp patch contains multi klp_object to be patched, it will be called more than one times, and lead to following crash:
livepatch_scsi_test: loading out-of-tree module taints kernel. livepatch_scsi_test: tainting kernel with TAINT_LIVEPATCH jump_label: Fatal kernel bug, unexpected op at patch_exit+0x0/0x38 [livepatch_scsi_test] [(____ptrval____)] (66 66 66 66 90) 80 ------------[ cut here ]------------ kernel BUG at arch/x86/kernel/jump_label.c:37! invalid opcode: 0000 [#1] SMP NOPTI CPU: 0 PID: 116 Comm: insmod Tainted: G OE K --------- - - 4.18.0+ #13 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 RIP: 0010:bug_at+0x1d/0x20 Code: 02 ba d1 04 00 00 ee c3 90 90 90 90 90 66 66 66 66 90 41 89 f0 48 89 f9 48 89 fa 48 89 fe 48 c7 c7 98 79 c6 9d e8 01 79 0f 00 <0f> 0b 90 66 66 66 66 90 48 83 ec 18 48 8b 3f 65 48 8b 04 25 28
RSP: 0018:ffffb074c0563bb0 EFLAGS: 00000282 RAX: 000000000000007f RBX: ffffffffc036f018 RCX: ffffffff9de5abc8 RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000246 RBP: ffffffffc036f018 R08: 0000000000000175 R09: 20363628205d295f R10: ffffa00743f8b580 R11: 3038202930392036 R12: ffffa00743f7fb40 R13: ffffffffc036f100 R14: ffffa00743f22160 R15: 0000000000000001 FS: 0000000001a088c0(0000) GS:ffffa00747800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000001a0f000 CR3: 0000000003fb6000 CR4: 00000000000006f0 Call Trace: __jump_label_transform.isra.0+0x50/0x130 jump_label_apply_nops+0x64/0x70 klp_register_patch+0x55b/0x6f0 ? _cond_resched+0x15/0x40 ? kmem_cache_alloc_trace+0x1a6/0x1c0 ? patch_free_scaffold+0xa7/0xbf [livepatch_scsi_test] patch_init+0x43e/0x1000 [livepatch_scsi_test] ? 0xffffffffc037f000 do_one_initcall+0x46/0x1c8 ? free_unref_page_commit+0x95/0x120 ? _cond_resched+0x15/0x40 ? kmem_cache_alloc_trace+0x3e/0x1c0 do_init_module+0x5b/0x1fc load_module+0x15b1/0x1e50 ? vmap_page_range_noflush+0x33f/0x4a0 ? __do_sys_init_module+0x167/0x1a0 __do_sys_init_module+0x167/0x1a0 do_syscall_64+0x5b/0x1b0 entry_SYSCALL_64_after_hwframe+0x65/0xca RIP: 0033:0x4d63a9
To solve this, we put the jump_label_apply_nops to klp_init_patch.
Fixes: 61ddd2126320 ("livepatch/core: support jump_label") Signed-off-by: Dong Kai dongkai11@huawei.com Reviewed-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/livepatch/core.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 81c8b02ce3d1..c981d400fe55 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -1074,17 +1074,10 @@ static int klp_init_object_loaded(struct klp_patch *patch, }
arch_klp_init_object_loaded(patch, obj); - - set_mod_klp_rel_state(patch->mod, MODULE_KLP_REL_DONE); - jump_label_apply_nops(patch->mod); module_enable_ro(patch->mod, true);
mutex_unlock(&text_mutex);
- ret = jump_label_register(patch->mod); - if (ret) - return ret; - klp_for_each_func(obj, func) { ret = klp_find_object_symbol(obj->name, func->old_name, func->old_sympos, @@ -1222,6 +1215,19 @@ static int klp_init_patch(struct klp_patch *patch) goto free; }
+ set_mod_klp_rel_state(patch->mod, MODULE_KLP_REL_DONE); + mutex_lock(&text_mutex); + module_disable_ro(patch->mod); + jump_label_apply_nops(patch->mod); + ret = jump_label_register(patch->mod); + if (ret) { + module_enable_ro(patch->mod, true); + mutex_unlock(&text_mutex); + goto free; + } + module_enable_ro(patch->mod, true); + mutex_unlock(&text_mutex); + #ifdef CONFIG_LIVEPATCH_WO_FTRACE klp_for_each_object(patch, obj) klp_load_hook(obj);
From: Martin Willi martin@strongswan.org
stable inclusion from linux-4.19.158 commit 200c2f68407601b1bed3dba5e027f34cd284ec92
--------------------------------
[ Upstream commit 9e2b7fa2df4365e99934901da4fb4af52d81e820 ]
VRF devices use an optimized direct path on output if a default qdisc is involved, calling Netfilter hooks directly. This path, however, does not consider Netfilter rules completing asynchronously, such as with NFQUEUE. The Netfilter okfn() is called for asynchronously accepted packets, but the VRF never passes that packet down the stack to send it out over the slave device. Using the slower redirect path for this seems not feasible, as we do not know beforehand if a Netfilter hook has asynchronously completing rules.
Fix the use of asynchronously completing Netfilter rules in OUTPUT and POSTROUTING by using a special completion function that additionally calls dst_output() to pass the packet down the stack. Also, slightly adjust the use of nf_reset_ct() so that is called in the asynchronous case, too.
Fixes: dcdd43c41e60 ("net: vrf: performance improvements for IPv4") Fixes: a9ec54d1b0cd ("net: vrf: performance improvements for IPv6") Signed-off-by: Martin Willi martin@strongswan.org Link: https://lore.kernel.org/r/20201106073030.3974927-1-martin@strongswan.org Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/vrf.c | 92 +++++++++++++++++++++++++++++++++++------------ 1 file changed, 69 insertions(+), 23 deletions(-)
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index 7e677182691b..e745e2c09578 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -336,8 +336,7 @@ static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct net_device *dev) return ret; }
-static int vrf_finish_direct(struct net *net, struct sock *sk, - struct sk_buff *skb) +static void vrf_finish_direct(struct sk_buff *skb) { struct net_device *vrf_dev = skb->dev;
@@ -356,7 +355,8 @@ static int vrf_finish_direct(struct net *net, struct sock *sk, skb_pull(skb, ETH_HLEN); }
- return 1; + /* reset skb device */ + nf_reset(skb); }
#if IS_ENABLED(CONFIG_IPV6) @@ -435,15 +435,41 @@ static struct sk_buff *vrf_ip6_out_redirect(struct net_device *vrf_dev, return skb; }
+static int vrf_output6_direct_finish(struct net *net, struct sock *sk, + struct sk_buff *skb) +{ + vrf_finish_direct(skb); + + return vrf_ip6_local_out(net, sk, skb); +} + static int vrf_output6_direct(struct net *net, struct sock *sk, struct sk_buff *skb) { + int err = 1; + skb->protocol = htons(ETH_P_IPV6);
- return NF_HOOK_COND(NFPROTO_IPV6, NF_INET_POST_ROUTING, - net, sk, skb, NULL, skb->dev, - vrf_finish_direct, - !(IPCB(skb)->flags & IPSKB_REROUTED)); + if (!(IPCB(skb)->flags & IPSKB_REROUTED)) + err = nf_hook(NFPROTO_IPV6, NF_INET_POST_ROUTING, net, sk, skb, + NULL, skb->dev, vrf_output6_direct_finish); + + if (likely(err == 1)) + vrf_finish_direct(skb); + + return err; +} + +static int vrf_ip6_out_direct_finish(struct net *net, struct sock *sk, + struct sk_buff *skb) +{ + int err; + + err = vrf_output6_direct(net, sk, skb); + if (likely(err == 1)) + err = vrf_ip6_local_out(net, sk, skb); + + return err; }
static struct sk_buff *vrf_ip6_out_direct(struct net_device *vrf_dev, @@ -456,18 +482,15 @@ static struct sk_buff *vrf_ip6_out_direct(struct net_device *vrf_dev, skb->dev = vrf_dev;
err = nf_hook(NFPROTO_IPV6, NF_INET_LOCAL_OUT, net, sk, - skb, NULL, vrf_dev, vrf_output6_direct); + skb, NULL, vrf_dev, vrf_ip6_out_direct_finish);
if (likely(err == 1)) err = vrf_output6_direct(net, sk, skb);
- /* reset skb device */ if (likely(err == 1)) - nf_reset(skb); - else - skb = NULL; + return skb;
- return skb; + return NULL; }
static struct sk_buff *vrf_ip6_out(struct net_device *vrf_dev, @@ -649,15 +672,41 @@ static struct sk_buff *vrf_ip_out_redirect(struct net_device *vrf_dev, return skb; }
+static int vrf_output_direct_finish(struct net *net, struct sock *sk, + struct sk_buff *skb) +{ + vrf_finish_direct(skb); + + return vrf_ip_local_out(net, sk, skb); +} + static int vrf_output_direct(struct net *net, struct sock *sk, struct sk_buff *skb) { + int err = 1; + skb->protocol = htons(ETH_P_IP);
- return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, - net, sk, skb, NULL, skb->dev, - vrf_finish_direct, - !(IPCB(skb)->flags & IPSKB_REROUTED)); + if (!(IPCB(skb)->flags & IPSKB_REROUTED)) + err = nf_hook(NFPROTO_IPV4, NF_INET_POST_ROUTING, net, sk, skb, + NULL, skb->dev, vrf_output_direct_finish); + + if (likely(err == 1)) + vrf_finish_direct(skb); + + return err; +} + +static int vrf_ip_out_direct_finish(struct net *net, struct sock *sk, + struct sk_buff *skb) +{ + int err; + + err = vrf_output_direct(net, sk, skb); + if (likely(err == 1)) + err = vrf_ip_local_out(net, sk, skb); + + return err; }
static struct sk_buff *vrf_ip_out_direct(struct net_device *vrf_dev, @@ -670,18 +719,15 @@ static struct sk_buff *vrf_ip_out_direct(struct net_device *vrf_dev, skb->dev = vrf_dev;
err = nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, net, sk, - skb, NULL, vrf_dev, vrf_output_direct); + skb, NULL, vrf_dev, vrf_ip_out_direct_finish);
if (likely(err == 1)) err = vrf_output_direct(net, sk, skb);
- /* reset skb device */ if (likely(err == 1)) - nf_reset(skb); - else - skb = NULL; + return skb;
- return skb; + return NULL; }
static struct sk_buff *vrf_ip_out(struct net_device *vrf_dev,
From: Baptiste Lepers baptiste.lepers@gmail.com
stable inclusion from linux-4.19.170 commit 4669452b4c66268a184a51c543b525533013c14e
--------------------------------
[ Upstream commit fd2ddef043592e7de80af53f47fa46fd3573086e ]
reuse->socks[] is modified concurrently by reuseport_add_sock. To prevent reading values that have not been fully initialized, only read the array up until the last known safe index instead of incorrectly re-reading the last index of the array.
Fixes: acdcecc61285f ("udp: correct reuseport selection with connected sockets") Signed-off-by: Baptiste Lepers baptiste.lepers@gmail.com Acked-by: Willem de Bruijn willemb@google.com Link: https://lore.kernel.org/r/20210107051110.12247-1-baptiste.lepers@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/core/sock_reuseport.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index 9c85ef2b7e1d..375a3bbe6485 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -299,7 +299,7 @@ struct sock *reuseport_select_sock(struct sock *sk, i = j = reciprocal_scale(hash, socks); while (reuse->socks[i]->sk_state == TCP_ESTABLISHED) { i++; - if (i >= reuse->num_socks) + if (i >= socks) i = 0; if (i == j) goto out;
From: Willem de Bruijn willemb@google.com
stable inclusion from linux-4.19.170 commit d4ede0a453cb2658a72dfed7572415c5366cdf4d
--------------------------------
[ Upstream commit 9bd6b629c39e3fa9e14243a6d8820492be1a5b2e ]
esp(6)_output_head uses skb_page_frag_refill to allocate a buffer for the esp trailer.
It accesses the page with kmap_atomic to handle highmem. But skb_page_frag_refill can return compound pages, of which kmap_atomic only maps the first underlying page.
skb_page_frag_refill does not return highmem, because flag __GFP_HIGHMEM is not set. ESP uses it in the same manner as TCP. That also does not call kmap_atomic, but directly uses page_address, in skb_copy_to_page_nocache. Do the same for ESP.
This issue has become easier to trigger with recent kmap local debugging feature CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP.
Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Signed-off-by: Willem de Bruijn willemb@google.com Acked-by: Steffen Klassert steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/esp4.c | 7 +------ net/ipv6/esp6.c | 7 +------ 2 files changed, 2 insertions(+), 12 deletions(-)
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c index 114f9def1ec5..0792a9e2a555 100644 --- a/net/ipv4/esp4.c +++ b/net/ipv4/esp4.c @@ -270,7 +270,6 @@ static int esp_output_udp_encap(struct xfrm_state *x, struct sk_buff *skb, struc int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *esp) { u8 *tail; - u8 *vaddr; int nfrags; int esph_offset; struct page *page; @@ -312,14 +311,10 @@ int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info * page = pfrag->page; get_page(page);
- vaddr = kmap_atomic(page); - - tail = vaddr + pfrag->offset; + tail = page_address(page) + pfrag->offset;
esp_output_fill_trailer(tail, esp->tfclen, esp->plen, esp->proto);
- kunmap_atomic(vaddr); - nfrags = skb_shinfo(skb)->nr_frags;
__skb_fill_page_desc(skb, nfrags, page, pfrag->offset, diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c index a7d996148eed..25317d5ccf2c 100644 --- a/net/ipv6/esp6.c +++ b/net/ipv6/esp6.c @@ -237,7 +237,6 @@ static void esp_output_fill_trailer(u8 *tail, int tfclen, int plen, __u8 proto) int esp6_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *esp) { u8 *tail; - u8 *vaddr; int nfrags; struct page *page; struct sk_buff *trailer; @@ -270,14 +269,10 @@ int esp6_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info page = pfrag->page; get_page(page);
- vaddr = kmap_atomic(page); - - tail = vaddr + pfrag->offset; + tail = page_address(page) + pfrag->offset;
esp_output_fill_trailer(tail, esp->tfclen, esp->plen, esp->proto);
- kunmap_atomic(vaddr); - nfrags = skb_shinfo(skb)->nr_frags;
__skb_fill_page_desc(skb, nfrags, page, pfrag->offset,
From: Jakub Kicinski kuba@kernel.org
stable inclusion from linux-4.19.170 commit 11e36dcef44e6c7ab269f79657ff3d2db9a82c15
--------------------------------
[ Upstream commit 47e4bb147a96f1c9b4e7691e7e994e53838bfff8 ]
We need to unregister the netdevice if config failed. .ndo_uninit takes care of most of the heavy lifting.
This was uncovered by recent commit c269a24ce057 ("net: make free_netdev() more lenient with unregistering devices"). Previously the partially-initialized device would be left in the system.
Reported-and-tested-by: syzbot+2393580080a2da190f04@syzkaller.appspotmail.com Fixes: e2f1f072db8d ("sit: allow to configure 6rd tunnels via netlink") Acked-by: Nicolas Dichtel nicolas.dichtel@6wind.com Link: https://lore.kernel.org/r/20210114012947.2515313-1-kuba@kernel.org Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv6/sit.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index 98c108baf35e..bcf29201f87b 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -1596,8 +1596,11 @@ static int ipip6_newlink(struct net *src_net, struct net_device *dev, }
#ifdef CONFIG_IPV6_SIT_6RD - if (ipip6_netlink_6rd_parms(data, &ip6rd)) + if (ipip6_netlink_6rd_parms(data, &ip6rd)) { err = ipip6_tunnel_update_6rd(nt, &ip6rd); + if (err < 0) + unregister_netdevice_queue(dev, NULL); + } #endif
return err;
From: Pablo Neira Ayuso pablo@netfilter.org
stable inclusion from linux-4.19.173 commit 95b1e6e1ffe99f4a69e2decf4b37ab63c3a06372
--------------------------------
commit 0c5b7a501e7400869ee905b4f7af3d6717802bcb upstream.
Otherwise, the newly create element shows no timeout when listing the ruleset. If the set definition does not specify a default timeout, then the set element only shows the expiration time, but not the timeout. This is a problem when restoring a stateful ruleset listing since it skips the timeout policy entirely.
Fixes: 22fe54d5fefc ("netfilter: nf_tables: add support for dynamic set updates") Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/netfilter/nft_dynset.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/netfilter/nft_dynset.c b/net/netfilter/nft_dynset.c index eb7f9a5f2aeb..4e544044fc2d 100644 --- a/net/netfilter/nft_dynset.c +++ b/net/netfilter/nft_dynset.c @@ -213,8 +213,10 @@ static int nft_dynset_init(const struct nft_ctx *ctx, nft_set_ext_add_length(&priv->tmpl, NFT_SET_EXT_EXPR, priv->expr->ops->size); if (set->flags & NFT_SET_TIMEOUT) { - if (timeout || set->timeout) + if (timeout || set->timeout) { + nft_set_ext_add(&priv->tmpl, NFT_SET_EXT_TIMEOUT); nft_set_ext_add(&priv->tmpl, NFT_SET_EXT_EXPIRATION); + } }
priv->timeout = timeout;
From: Shmulik Ladkani shmulik@metanetworks.com
stable inclusion from linux-4.19.173 commit 6790d6f3cb46d6212e9cabfc82d8be0a34cb08bd
--------------------------------
[ Upstream commit 56ce7c25ae1525d83cf80a880cf506ead1914250 ]
When setting xfrm replay_window to values higher than 32, a rare page-fault occurs in xfrm_replay_advance_bmp:
BUG: unable to handle page fault for address: ffff8af350ad7920 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD ad001067 P4D ad001067 PUD 0 Oops: 0002 [#1] SMP PTI CPU: 3 PID: 30 Comm: ksoftirqd/3 Kdump: loaded Not tainted 5.4.52-050452-generic #202007160732 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:xfrm_replay_advance_bmp+0xbb/0x130 RSP: 0018:ffffa1304013ba40 EFLAGS: 00010206 RAX: 000000000000010d RBX: 0000000000000002 RCX: 00000000ffffff4b RDX: 0000000000000018 RSI: 00000000004c234c RDI: 00000000ffb3dbff RBP: ffffa1304013ba50 R08: ffff8af330ad7920 R09: 0000000007fffffa R10: 0000000000000800 R11: 0000000000000010 R12: ffff8af29d6258c0 R13: ffff8af28b95c700 R14: 0000000000000000 R15: ffff8af29d6258fc FS: 0000000000000000(0000) GS:ffff8af339ac0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8af350ad7920 CR3: 0000000015ee4000 CR4: 00000000001406e0 Call Trace: xfrm_input+0x4e5/0xa10 xfrm4_rcv_encap+0xb5/0xe0 xfrm4_udp_encap_rcv+0x140/0x1c0
Analysis revealed offending code is when accessing:
replay_esn->bmp[nr] |= (1U << bitnr);
with 'nr' being 0x07fffffa.
This happened in an SMP system when reordering of packets was present; A packet arrived with a "too old" sequence number (outside the window, i.e 'diff > replay_window'), and therefore the following calculation:
bitnr = replay_esn->replay_window - (diff - pos);
yields a negative result, but since bitnr is u32 we get a large unsigned quantity (in crash dump above: 0xffffff4b seen in ecx).
This was supposed to be protected by xfrm_input()'s former call to:
if (x->repl->check(x, skb, seq)) {
However, the state's spinlock x->lock is *released* after '->check()' is performed, and gets re-acquired before '->advance()' - which gives a chance for a different core to update the xfrm state, e.g. by advancing 'replay_esn->seq' when it encounters more packets - leading to a 'diff > replay_window' situation when original core continues to xfrm_replay_advance_bmp().
An attempt to fix this issue was suggested in commit bcf66bf54aab ("xfrm: Perform a replay check after return from async codepaths"), by calling 'x->repl->recheck()' after lock is re-acquired, but fix applied only to asyncronous crypto algorithms.
Augment the fix, by *always* calling 'recheck()' - irrespective if we're using async crypto.
Fixes: 0ebea8ef3559 ("[IPSEC]: Move state lock into x->type->input") Signed-off-by: Shmulik Ladkani shmulik.ladkani@gmail.com Signed-off-by: Steffen Klassert steffen.klassert@secunet.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/xfrm/xfrm_input.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c index 0ee13d12782f..fcba8a139f61 100644 --- a/net/xfrm/xfrm_input.c +++ b/net/xfrm/xfrm_input.c @@ -420,7 +420,7 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type) /* only the first xfrm gets the encap type */ encap_type = 0;
- if (async && x->repl->recheck(x, skb, seq)) { + if (x->repl->recheck(x, skb, seq)) { XFRM_INC_STATS(net, LINUX_MIB_XFRMINSTATESEQERROR); goto drop_unlock; }
From: Eyal Birger eyal.birger@gmail.com
stable inclusion from linux-4.19.173 commit 5292ab517275e62042e479aebec9755196643001
--------------------------------
[ Upstream commit 9f8550e4bd9d78a8436c2061ad2530215f875376 ]
The disable_xfrm flag signals that xfrm should not be performed during routing towards a device before reaching device xmit.
For xfrm interfaces this is usually desired as they perform the outbound policy lookup as part of their xmit using their if_id.
Before this change enabling this flag on xfrm interfaces prevented them from xmitting as xfrm_lookup_with_ifid() would not perform a policy lookup in case the original dst had the DST_NOXFRM flag.
This optimization is incorrect when the lookup is done by the xfrm interface xmit logic.
Fix by performing policy lookup when invoked by xfrmi as if_id != 0.
Similarly it's unlikely for the 'no policy exists on net' check to yield any performance benefits when invoked from xfrmi.
Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces") Signed-off-by: Eyal Birger eyal.birger@gmail.com Signed-off-by: Steffen Klassert steffen.klassert@secunet.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/xfrm/xfrm_policy.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 939f3adf075a..e9aea82f370d 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2101,8 +2101,8 @@ struct dst_entry *xfrm_lookup_with_ifid(struct net *net, xflo.flags = flags;
/* To accelerate a bit... */ - if ((dst_orig->flags & DST_NOXFRM) || - !net->xfrm.policy_count[XFRM_POLICY_OUT]) + if (!if_id && ((dst_orig->flags & DST_NOXFRM) || + !net->xfrm.policy_count[XFRM_POLICY_OUT])) goto nopol;
xdst = xfrm_bundle_lookup(net, fl, family, dir, &xflo, if_id);
From: Roi Dayan roid@nvidia.com
stable inclusion from linux-4.19.173 commit 8cbe9e3567ab4dea3e05e8cf791b5f6e6fdf8c9c
--------------------------------
[ Upstream commit 487c6ef81eb98d0a43cb08be91b1fcc9b4250626 ]
When we create the ft object we also init rhltable in ft->fgs_hash. So in error flow before kfree of ft we need to destroy that rhltable.
Fixes: 693c6883bbc4 ("net/mlx5: Add hash table for flow groups in flow table") Signed-off-by: Roi Dayan roid@nvidia.com Reviewed-by: Maor Dickman maord@nvidia.com Signed-off-by: Saeed Mahameed saeedm@nvidia.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c index b16e0f45d28c..a38a0c86705a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c @@ -1004,6 +1004,7 @@ static struct mlx5_flow_table *__mlx5_create_flow_table(struct mlx5_flow_namespa destroy_ft: root->cmds->destroy_flow_table(root->dev, ft); free_ft: + rhltable_destroy(&ft->fgs_hash); kfree(ft); unlock_root: mutex_unlock(&root->chain_lock);
From: Pengcheng Yang yangpc@wangsu.com
stable inclusion from linux-4.19.173 commit dc7bc439a8d248686437f710a6eeb864a6f3d588
--------------------------------
commit 62d9f1a6945ba69c125e548e72a36d203b30596e upstream.
Upon receiving a cumulative ACK that changes the congestion state from Disorder to Open, the TLP timer is not set. If the sender is app-limited, it can only wait for the RTO timer to expire and retransmit.
The reason for this is that the TLP timer is set before the congestion state changes in tcp_ack(), so we delay the time point of calling tcp_set_xmit_timer() until after tcp_fastretrans_alert() returns and remove the FLAG_SET_XMIT_TIMER from ack_flag when the RACK reorder timer is set.
This commit has two additional benefits: 1) Make sure to reset RTO according to RFC6298 when receiving ACK, to avoid spurious RTO caused by RTO timer early expires. 2) Reduce the xmit timer reschedule once per ACK when the RACK reorder timer is set.
Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed") Link: https://lore.kernel.org/netdev/1611311242-6675-1-git-send-email-yangpc@wangs... Signed-off-by: Pengcheng Yang yangpc@wangsu.com Acked-by: Neal Cardwell ncardwell@google.com Acked-by: Yuchung Cheng ycheng@google.com Cc: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/1611464834-23030-1-git-send-email-yangpc@wangsu.co... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/net/tcp.h | 2 +- net/ipv4/tcp_input.c | 10 ++++++---- net/ipv4/tcp_recovery.c | 5 +++-- 3 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h index 241a2d3a77a0..444762485615 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1961,7 +1961,7 @@ void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb); void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced); extern s32 tcp_rack_skb_timeout(struct tcp_sock *tp, struct sk_buff *skb, u32 reo_wnd); -extern void tcp_rack_mark_lost(struct sock *sk); +extern bool tcp_rack_mark_lost(struct sock *sk); extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq, u64 xmit_time); extern void tcp_rack_reo_timeout(struct sock *sk); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index b58c8fe44dab..c7c3df30d8bc 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2754,7 +2754,8 @@ static void tcp_identify_packet_loss(struct sock *sk, int *ack_flag) } else if (tcp_is_rack(sk)) { u32 prior_retrans = tp->retrans_out;
- tcp_rack_mark_lost(sk); + if (tcp_rack_mark_lost(sk)) + *ack_flag &= ~FLAG_SET_XMIT_TIMER; if (prior_retrans > tp->retrans_out) *ack_flag |= FLAG_LOST_RETRANS; } @@ -3696,9 +3697,6 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
if (tp->tlp_high_seq) tcp_process_tlp_ack(sk, ack, flag); - /* If needed, reset TLP/RTO timer; RACK may later override this. */ - if (flag & FLAG_SET_XMIT_TIMER) - tcp_set_xmit_timer(sk);
if (tcp_ack_is_dubious(sk, flag)) { if (!(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP))) { @@ -3711,6 +3709,10 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) &rexmit); }
+ /* If needed, reset TLP/RTO timer when RACK doesn't set. */ + if (flag & FLAG_SET_XMIT_TIMER) + tcp_set_xmit_timer(sk); + if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) sk_dst_confirm(sk);
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c index c81aadff769b..0d96decba13d 100644 --- a/net/ipv4/tcp_recovery.c +++ b/net/ipv4/tcp_recovery.c @@ -109,13 +109,13 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout) } }
-void tcp_rack_mark_lost(struct sock *sk) +bool tcp_rack_mark_lost(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); u32 timeout;
if (!tp->rack.advanced) - return; + return false;
/* Reset the advanced flag to avoid unnecessary queue scanning */ tp->rack.advanced = 0; @@ -125,6 +125,7 @@ void tcp_rack_mark_lost(struct sock *sk) inet_csk_reset_xmit_timer(sk, ICSK_TIME_REO_TIMEOUT, timeout, inet_csk(sk)->icsk_rto); } + return !!timeout; }
/* Record the most recently (re)sent time among the (s)acked packets
From: Eric Dumazet edumazet@google.com
stable inclusion from linux-4.19.174 commit 69874c3152adac06d04360d40e1ea775788a2c12
--------------------------------
commit dd5e073381f2ada3630f36be42833c6e9c78b75e upstream
syzbot report reminded us that very big ewma_log were supported in the past, even if they made litle sense.
tc qdisc replace dev xxx root est 1sec 131072sec ...
While fixing the bug, also add boundary checks for ewma_log, in line with range supported by iproute2.
UBSAN: shift-out-of-bounds in net/core/gen_estimator.c:83:38 shift exponent -1 is negative CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395 est_timer.cold+0xbb/0x12d net/core/gen_estimator.c:83 call_timer_fn+0x1a5/0x710 kernel/time/timer.c:1417 expire_timers kernel/time/timer.c:1462 [inline] __run_timers.part.0+0x692/0xa80 kernel/time/timer.c:1731 __run_timers kernel/time/timer.c:1712 [inline] run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1744 __do_softirq+0x2bc/0xa77 kernel/softirq.c:343 asm_call_irq_on_stack+0xf/0x20 </IRQ> __run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline] run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline] do_softirq_own_stack+0xaa/0xd0 arch/x86/kernel/irq_64.c:77 invoke_softirq kernel/softirq.c:226 [inline] __irq_exit_rcu+0x17f/0x200 kernel/softirq.c:420 irq_exit_rcu+0x5/0x20 kernel/softirq.c:432 sysvec_apic_timer_interrupt+0x4d/0x100 arch/x86/kernel/apic/apic.c:1096 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:628 RIP: 0010:native_save_fl arch/x86/include/asm/irqflags.h:29 [inline] RIP: 0010:arch_local_save_flags arch/x86/include/asm/irqflags.h:79 [inline] RIP: 0010:arch_irqs_disabled arch/x86/include/asm/irqflags.h:169 [inline] RIP: 0010:acpi_safe_halt drivers/acpi/processor_idle.c:111 [inline] RIP: 0010:acpi_idle_do_entry+0x1c9/0x250 drivers/acpi/processor_idle.c:516
Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate estimators") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: syzbot syzkaller@googlegroups.com Link: https://lore.kernel.org/r/20210114181929.1717985-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org [sudip: adjust context] Signed-off-by: Sudip Mukherjee sudipm.mukherjee@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/core/gen_estimator.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c index e4e442d70c2d..752744db11ff 100644 --- a/net/core/gen_estimator.c +++ b/net/core/gen_estimator.c @@ -84,11 +84,11 @@ static void est_timer(struct timer_list *t) u64 rate, brate;
est_fetch_counters(est, &b); - brate = (b.bytes - est->last_bytes) << (10 - est->ewma_log - est->intvl_log); - brate -= (est->avbps >> est->ewma_log); + brate = (b.bytes - est->last_bytes) << (10 - est->intvl_log); + brate = (brate >> est->ewma_log) - (est->avbps >> est->ewma_log);
- rate = (u64)(b.packets - est->last_packets) << (10 - est->ewma_log - est->intvl_log); - rate -= (est->avpps >> est->ewma_log); + rate = (u64)(b.packets - est->last_packets) << (10 - est->intvl_log); + rate = (rate >> est->ewma_log) - (est->avpps >> est->ewma_log);
write_seqcount_begin(&est->seq); est->avbps += brate; @@ -147,6 +147,9 @@ int gen_new_estimator(struct gnet_stats_basic_packed *bstats, if (parm->interval < -2 || parm->interval > 3) return -EINVAL;
+ if (parm->ewma_log == 0 || parm->ewma_log >= 31) + return -EINVAL; + est = kzalloc(sizeof(*est), GFP_KERNEL); if (!est) return -ENOBUFS;
From: Vadim Fedorenko vfedorenko@novek.ru
stable inclusion from linux-4.19.175 commit 9c7b01fc3e22c49466c1fd0a9024bfda6b76440e
--------------------------------
commit 28e104d00281ade30250b24e098bf50887671ea4 upstream.
dev->hard_header_len for tunnel interface is set only when header_ops are set too and already contains full overhead of any tunnel encapsulation. That's why there is not need to use this overhead twice in mtu calc.
Fixes: fdafed459998 ("ip_gre: set dev->hard_header_len and dev->needed_headroom properly") Reported-by: Slava Bacherikov mail@slava.cc Signed-off-by: Vadim Fedorenko vfedorenko@novek.ru Link: https://lore.kernel.org/r/1611959267-20536-1-git-send-email-vfedorenko@novek... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/ipv4/ip_tunnel.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index 1cad731039c3..bdd073ea300a 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -330,7 +330,7 @@ static int ip_tunnel_bind_dev(struct net_device *dev) }
dev->needed_headroom = t_hlen + hlen; - mtu -= (dev->hard_header_len + t_hlen); + mtu -= t_hlen;
if (mtu < IPV4_MIN_MTU) mtu = IPV4_MIN_MTU; @@ -360,7 +360,7 @@ static struct ip_tunnel *ip_tunnel_create(struct net *net, nt = netdev_priv(dev); t_hlen = nt->hlen + sizeof(struct iphdr); dev->min_mtu = ETH_MIN_MTU; - dev->max_mtu = IP_MAX_MTU - dev->hard_header_len - t_hlen; + dev->max_mtu = IP_MAX_MTU - t_hlen; ip_tunnel_add(itn, nt); return nt;
@@ -502,12 +502,11 @@ static int tnl_update_pmtu(struct net_device *dev, struct sk_buff *skb, const struct iphdr *inner_iph) { struct ip_tunnel *tunnel = netdev_priv(dev); - int pkt_size = skb->len - tunnel->hlen - dev->hard_header_len; + int pkt_size = skb->len - tunnel->hlen; int mtu;
if (df) - mtu = dst_mtu(&rt->dst) - dev->hard_header_len - - sizeof(struct iphdr) - tunnel->hlen; + mtu = dst_mtu(&rt->dst) - (sizeof(struct iphdr) + tunnel->hlen); else mtu = skb_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;
@@ -935,7 +934,7 @@ int __ip_tunnel_change_mtu(struct net_device *dev, int new_mtu, bool strict) { struct ip_tunnel *tunnel = netdev_priv(dev); int t_hlen = tunnel->hlen + sizeof(struct iphdr); - int max_mtu = IP_MAX_MTU - dev->hard_header_len - t_hlen; + int max_mtu = IP_MAX_MTU - t_hlen;
if (new_mtu < ETH_MIN_MTU) return -EINVAL; @@ -1112,10 +1111,9 @@ int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[],
mtu = ip_tunnel_bind_dev(dev); if (tb[IFLA_MTU]) { - unsigned int max = IP_MAX_MTU - dev->hard_header_len - nt->hlen; + unsigned int max = IP_MAX_MTU - (nt->hlen + sizeof(struct iphdr));
- mtu = clamp(dev->mtu, (unsigned int)ETH_MIN_MTU, - (unsigned int)(max - sizeof(struct iphdr))); + mtu = clamp(dev->mtu, (unsigned int)ETH_MIN_MTU, max); }
err = dev_set_mtu(dev, mtu);
From: Cong Wang cong.wang@bytedance.com
stable inclusion from linux-4.19.176 commit fac13c1dd417f7d117c28d4929175c3b05820234
--------------------------------
[ Upstream commit afbc293add6466f8f3f0c3d944d85f53709c170f ]
xfrm_probe_algs() probes kernel crypto modules and changes the availability of struct xfrm_algo_desc. But there is a small window where ealg->available and aalg->available get changed between count_ah_combs()/count_esp_combs() and dump_ah_combs()/dump_esp_combs(), in this case we may allocate a smaller skb but later put a larger amount of data and trigger the panic in skb_put().
Fix this by relaxing the checks when counting the size, that is, skipping the test of ->available. We may waste some memory for a few of sizeof(struct sadb_comb), but it is still much better than a panic.
Reported-by: syzbot+b2bf2652983d23734c5c@syzkaller.appspotmail.com Cc: Steffen Klassert steffen.klassert@secunet.com Cc: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Cong Wang cong.wang@bytedance.com Signed-off-by: Steffen Klassert steffen.klassert@secunet.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/key/af_key.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/net/key/af_key.c b/net/key/af_key.c index e340e97224c3..c7d5a6015389 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -2908,7 +2908,7 @@ static int count_ah_combs(const struct xfrm_tmpl *t) break; if (!aalg->pfkey_supported) continue; - if (aalg_tmpl_set(t, aalg) && aalg->available) + if (aalg_tmpl_set(t, aalg)) sz += sizeof(struct sadb_comb); } return sz + sizeof(struct sadb_prop); @@ -2926,7 +2926,7 @@ static int count_esp_combs(const struct xfrm_tmpl *t) if (!ealg->pfkey_supported) continue;
- if (!(ealg_tmpl_set(t, ealg) && ealg->available)) + if (!(ealg_tmpl_set(t, ealg))) continue;
for (k = 1; ; k++) { @@ -2937,7 +2937,7 @@ static int count_esp_combs(const struct xfrm_tmpl *t) if (!aalg->pfkey_supported) continue;
- if (aalg_tmpl_set(t, aalg) && aalg->available) + if (aalg_tmpl_set(t, aalg)) sz += sizeof(struct sadb_comb); } }
From: Jozsef Kadlecsik kadlec@mail.kfki.hu
stable inclusion from linux-4.19.177 commit 28198be7e5f50d66a60ccf27fe518f55813e1b32
--------------------------------
[ Upstream commit b1bdde33b72366da20d10770ab7a49fe87b5e190 ]
When both --reap and --update flag are specified, there's a code path at which the entry to be updated is reaped beforehand, which then leads to kernel crash. Reap only entries which won't be updated.
Fixes kernel bugzilla #207773.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=207773 Reported-by: Reindl Harald h.reindl@thelounge.net Fixes: 0079c5aee348 ("netfilter: xt_recent: add an entry reaper") Signed-off-by: Jozsef Kadlecsik kadlec@netfilter.org Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/netfilter/xt_recent.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/xt_recent.c b/net/netfilter/xt_recent.c index 570144507df1..cb58bc7ae30d 100644 --- a/net/netfilter/xt_recent.c +++ b/net/netfilter/xt_recent.c @@ -155,7 +155,8 @@ static void recent_entry_remove(struct recent_table *t, struct recent_entry *e) /* * Drop entries with timestamps older then 'time'. */ -static void recent_entry_reap(struct recent_table *t, unsigned long time) +static void recent_entry_reap(struct recent_table *t, unsigned long time, + struct recent_entry *working, bool update) { struct recent_entry *e;
@@ -164,6 +165,12 @@ static void recent_entry_reap(struct recent_table *t, unsigned long time) */ e = list_entry(t->lru_list.next, struct recent_entry, lru_list);
+ /* + * Do not reap the entry which are going to be updated. + */ + if (e == working && update) + return; + /* * The last time stamp is the most recent. */ @@ -306,7 +313,8 @@ recent_mt(const struct sk_buff *skb, struct xt_action_param *par)
/* info->seconds must be non-zero */ if (info->check_set & XT_RECENT_REAP) - recent_entry_reap(t, time); + recent_entry_reap(t, time, e, + info->check_set & XT_RECENT_UPDATE && ret); }
if (info->check_set & XT_RECENT_SET ||
From: Sven Auhagen sven.auhagen@voleatech.de
stable inclusion from linux-4.19.177 commit 99109999f7c63ef833deaa8ebaa8e5a7bc6de15c
--------------------------------
[ Upstream commit 8d6bca156e47d68551750a384b3ff49384c67be3 ]
When updating the tcp or udp header checksum on port nat the function inet_proto_csum_replace2 with the last parameter pseudohdr as true. This leads to an error in the case that GRO is used and packets are split up in GSO. The tcp or udp checksum of all packets is incorrect.
The error is probably masked due to the fact the most network driver implement tcp/udp checksum offloading. It also only happens when GRO is applied and not on single packets.
The error is most visible when using a pppoe connection which is not triggering the tcp/udp checksum offload.
Fixes: ac2a66665e23 ("netfilter: add generic flow table infrastructure") Signed-off-by: Sven Auhagen sven.auhagen@voleatech.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/netfilter/nf_flow_table_core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c index 890799c16aa4..b3957fe7eced 100644 --- a/net/netfilter/nf_flow_table_core.c +++ b/net/netfilter/nf_flow_table_core.c @@ -360,7 +360,7 @@ static int nf_flow_nat_port_tcp(struct sk_buff *skb, unsigned int thoff, return -1;
tcph = (void *)(skb_network_header(skb) + thoff); - inet_proto_csum_replace2(&tcph->check, skb, port, new_port, true); + inet_proto_csum_replace2(&tcph->check, skb, port, new_port, false);
return 0; } @@ -377,7 +377,7 @@ static int nf_flow_nat_port_udp(struct sk_buff *skb, unsigned int thoff, udph = (void *)(skb_network_header(skb) + thoff); if (udph->check || skb->ip_summed == CHECKSUM_PARTIAL) { inet_proto_csum_replace2(&udph->check, skb, port, - new_port, true); + new_port, false); if (!udph->check) udph->check = CSUM_MANGLED_0; }
From: Florian Westphal fw@strlen.de
stable inclusion from linux-4.19.177 commit 9f7c3c838aed35c1aa54e1eaf3430ed018e578d5
--------------------------------
[ Upstream commit 07998281c268592963e1cd623fe6ab0270b65ae4 ]
The origin skip check needs to re-test the zone. Else, we might skip a colliding tuple in the reply direction.
This only occurs when using 'directional zones' where origin tuples reside in different zones but the reply tuples share the same zone.
This causes the new conntrack entry to be dropped at confirmation time because NAT clash resolution was elided.
Fixes: 4e35c1cb9460240 ("netfilter: nf_nat: skip nat clash resolution for same-origin entries") Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/netfilter/nf_conntrack_core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index c6073d17c324..4edf6c70a8fd 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1063,7 +1063,8 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple, * Let nf_ct_resolve_clash() deal with this later. */ if (nf_ct_tuple_equal(&ignored_conntrack->tuplehash[IP_CT_DIR_ORIGINAL].tuple, - &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple)) + &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple) && + nf_ct_zone_equal(ct, zone, IP_CT_DIR_ORIGINAL)) continue;
NF_CT_STAT_INC_ATOMIC(net, found);
From: NeilBrown neilb@suse.de
stable inclusion from linux-4.19.177 commit 7436ceb2b78f334e292d72bb3d2ca4d5deb367df
--------------------------------
commit af8085f3a4712c57d0dd415ad543bac85780375c upstream.
The sctp transport seq_file iterators take a reference to the transport in the ->start and ->next functions and releases the reference in the ->show function. The preferred handling for such resources is to release them in the subsequent ->next or ->stop function call.
Since Commit 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") there is no guarantee that ->show will be called after ->next, so this function can now leak references.
So move the sctp_transport_put() call to ->next and ->stop.
Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") Reported-by: Xin Long lucien.xin@gmail.com Signed-off-by: NeilBrown neilb@suse.de Acked-by: Marcelo Ricardo Leitner marcelo.leitner@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/sctp/proc.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/net/sctp/proc.c b/net/sctp/proc.c index a644292f9faf..84f79ac4b984 100644 --- a/net/sctp/proc.c +++ b/net/sctp/proc.c @@ -230,6 +230,12 @@ static void sctp_transport_seq_stop(struct seq_file *seq, void *v) { struct sctp_ht_iter *iter = seq->private;
+ if (v && v != SEQ_START_TOKEN) { + struct sctp_transport *transport = v; + + sctp_transport_put(transport); + } + sctp_transport_walk_stop(&iter->hti); }
@@ -237,6 +243,12 @@ static void *sctp_transport_seq_next(struct seq_file *seq, void *v, loff_t *pos) { struct sctp_ht_iter *iter = seq->private;
+ if (v && v != SEQ_START_TOKEN) { + struct sctp_transport *transport = v; + + sctp_transport_put(transport); + } + ++*pos;
return sctp_transport_get_next(seq_file_net(seq), &iter->hti); @@ -292,8 +304,6 @@ static int sctp_assocs_seq_show(struct seq_file *seq, void *v) sk->sk_rcvbuf); seq_printf(seq, "\n");
- sctp_transport_put(transport); - return 0; }
@@ -369,8 +379,6 @@ static int sctp_remaddr_seq_show(struct seq_file *seq, void *v) seq_printf(seq, "\n"); }
- sctp_transport_put(transport); - return 0; }
From: Sabyrzhan Tasbolatov snovitoll@gmail.com
stable inclusion from linux-4.19.177 commit 335bacb523e6207dac471e68488c7804923f0514
--------------------------------
commit a11148e6fcce2ae53f47f0a442d098d860b4f7db upstream.
syzbot found WARNING in rds_rdma_extra_size [1] when RDS_CMSG_RDMA_ARGS control message is passed with user-controlled 0x40001 bytes of args->nr_local, causing order >= MAX_ORDER condition.
The exact value 0x40001 can be checked with UIO_MAXIOV which is 0x400. So for kcalloc() 0x400 iovecs with sizeof(struct rds_iovec) = 0x10 is the closest limit, with 0x10 leftover.
Same condition is currently done in rds_cmsg_rdma_args().
[1] WARNING: mm/page_alloc.c:5011 [..] Call Trace: alloc_pages_current+0x18c/0x2a0 mm/mempolicy.c:2267 alloc_pages include/linux/gfp.h:547 [inline] kmalloc_order+0x2e/0xb0 mm/slab_common.c:837 kmalloc_order_trace+0x14/0x120 mm/slab_common.c:853 kmalloc_array include/linux/slab.h:592 [inline] kcalloc include/linux/slab.h:621 [inline] rds_rdma_extra_size+0xb2/0x3b0 net/rds/rdma.c:568 rds_rm_size net/rds/send.c:928 [inline]
Reported-by: syzbot+1bd2b07f93745fa38425@syzkaller.appspotmail.com Signed-off-by: Sabyrzhan Tasbolatov snovitoll@gmail.com Acked-by: Santosh Shilimkar santosh.shilimkar@oracle.com Link: https://lore.kernel.org/r/20210201203233.1324704-1-snovitoll@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com
Signed-off-by: Aichun Li liaichun@huawei.com reviewed-by: wangxiaopeng wangxiaopeng7@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/rds/rdma.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/net/rds/rdma.c b/net/rds/rdma.c index e1965d9cbcf8..9882cebfcad6 100644 --- a/net/rds/rdma.c +++ b/net/rds/rdma.c @@ -531,6 +531,9 @@ int rds_rdma_extra_size(struct rds_rdma_args *args, if (args->nr_local == 0) return -EINVAL;
+ if (args->nr_local > UIO_MAXIOV) + return -EMSGSIZE; + iov->iov = kcalloc(args->nr_local, sizeof(struct rds_iovec), GFP_KERNEL);
From: Wang Hai wanghai38@huawei.com
stable inclusion from linux-4.19.172 commit 08b227d9f380aff229fe7eb3e43c0d90ac14ec38
--------------------------------
commit 757fed1d0898b893d7daa84183947c70f27632f3 upstream.
This reverts commit dde3c6b72a16c2db826f54b2d49bdea26c3534a2.
syzbot report a double-free bug. The following case can cause this bug.
- mm/slab_common.c: create_cache(): if the __kmem_cache_create() fails, it does:
out_free_cache: kmem_cache_free(kmem_cache, s);
- but __kmem_cache_create() - at least for slub() - will have done
sysfs_slab_add(s) -> sysfs_create_group() .. fails .. -> kobject_del(&s->kobj); .. which frees s ...
We can't remove the kmem_cache_free() in create_cache(), because other error cases of __kmem_cache_create() do not free this.
So, revert the commit dde3c6b72a16 ("mm/slub: fix a memory leak in sysfs_slab_add()") to fix this.
Reported-by: syzbot+d0bd96b4696c1ef67991@syzkaller.appspotmail.com Fixes: dde3c6b72a16 ("mm/slub: fix a memory leak in sysfs_slab_add()") Acked-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Wang Hai wanghai38@huawei.com Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/slub.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c index 3cfec27018ac..45b5353e0938 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -5833,10 +5833,8 @@ static int sysfs_slab_add(struct kmem_cache *s)
s->kobj.kset = kset; err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, "%s", name); - if (err) { - kobject_put(&s->kobj); + if (err) goto out; - }
err = sysfs_create_group(&s->kobj, &slab_attr_group); if (err)
From: Gaurav Kohli gkohli@codeaurora.org
stable inclusion from linux-4.19.172 commit acfa7ad7b7f6489e2bed20880ce090fdabdbb841
--------------------------------
commit bbeb97464eefc65f506084fd9f18f21653e01137 upstream.
Below race can come, if trace_open and resize of cpu buffer is running parallely on different cpus CPUX CPUY ring_buffer_resize atomic_read(&buffer->resize_disabled) tracing_open tracing_reset_online_cpus ring_buffer_reset_cpu rb_reset_cpu rb_update_pages remove/insert pages resetting pointer
This race can cause data abort or some times infinte loop in rb_remove_pages and rb_insert_pages while checking pages for sanity.
Take buffer lock to fix this.
Link: https://lkml.kernel.org/r/1601976833-24377-1-git-send-email-gkohli@codeauror...
Cc: stable@vger.kernel.org Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic") Reported-by: Denis Efremov efremov@linux.com Signed-off-by: Gaurav Kohli gkohli@codeaurora.org Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/trace/ring_buffer.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 87ce9736043d..360129e47540 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -4393,6 +4393,8 @@ void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu)
if (!cpumask_test_cpu(cpu, buffer->cpumask)) return; + /* prevent another thread from changing buffer sizes */ + mutex_lock(&buffer->mutex);
atomic_inc(&buffer->resize_disabled); atomic_inc(&cpu_buffer->record_disabled); @@ -4416,6 +4418,8 @@ void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu)
atomic_dec(&cpu_buffer->record_disabled); atomic_dec(&buffer->resize_disabled); + + mutex_unlock(&buffer->mutex); } EXPORT_SYMBOL_GPL(ring_buffer_reset_cpu);
From: Mikulas Patocka mpatocka@redhat.com
stable inclusion from linux-4.19.172 commit 26b30d36cb5cff5e075114d05e5309043514770c
--------------------------------
commit 5c02406428d5219c367c5f53457698c58bc5f917 upstream.
Otherwise a malicious user could (ab)use the "recalculate" feature that makes dm-integrity calculate the checksums in the background while the device is already usable. When the system restarts before all checksums have been calculated, the calculation continues where it was interrupted even if the recalculate feature is not requested the next time the dm device is set up.
Disable recalculating if we use internal_hash or journal_hash with a key (e.g. HMAC) and we don't have the "legacy_recalculate" flag.
This may break activation of a volume, created by an older kernel, that is not yet fully recalculated -- if this happens, the user should add the "legacy_recalculate" flag to constructor parameters.
Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka mpatocka@redhat.com Reported-by: Daniel Glockner dg@emlix.com Signed-off-by: Mike Snitzer snitzer@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- Documentation/device-mapper/dm-integrity.txt | 7 ++++++ drivers/md/dm-integrity.c | 24 +++++++++++++++++++- 2 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/Documentation/device-mapper/dm-integrity.txt b/Documentation/device-mapper/dm-integrity.txt index 297251b0d2d5..bf6af2ade0a6 100644 --- a/Documentation/device-mapper/dm-integrity.txt +++ b/Documentation/device-mapper/dm-integrity.txt @@ -146,6 +146,13 @@ block_size:number Supported values are 512, 1024, 2048 and 4096 bytes. If not specified the default block size is 512 bytes.
+legacy_recalculate + Allow recalculating of volumes with HMAC keys. This is disabled by + default for security reasons - an attacker could modify the volume, + set recalc_sector to zero, and the kernel would not detect the + modification. + + The journal mode (D/J), buffer_sectors, journal_watermark, commit_time can be changed when reloading the target (load an inactive table and swap the tables with suspend and resume). The other arguments should not be changed diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c index a3acb59272cf..66016ce253ce 100644 --- a/drivers/md/dm-integrity.c +++ b/drivers/md/dm-integrity.c @@ -237,6 +237,7 @@ struct dm_integrity_c {
bool journal_uptodate; bool just_formatted; + bool legacy_recalculate;
struct alg_spec internal_hash_alg; struct alg_spec journal_crypt_alg; @@ -346,6 +347,14 @@ static int dm_integrity_failed(struct dm_integrity_c *ic) return READ_ONCE(ic->failed); }
+static bool dm_integrity_disable_recalculate(struct dm_integrity_c *ic) +{ + if ((ic->internal_hash_alg.key || ic->journal_mac_alg.key) && + !ic->legacy_recalculate) + return true; + return false; +} + static commit_id_t dm_integrity_commit_id(struct dm_integrity_c *ic, unsigned i, unsigned j, unsigned char seq) { @@ -2516,6 +2525,7 @@ static void dm_integrity_status(struct dm_target *ti, status_type_t type, arg_count += !!ic->internal_hash_alg.alg_string; arg_count += !!ic->journal_crypt_alg.alg_string; arg_count += !!ic->journal_mac_alg.alg_string; + arg_count += ic->legacy_recalculate; DMEMIT("%s %llu %u %c %u", ic->dev->name, (unsigned long long)ic->start, ic->tag_size, ic->mode, arg_count); if (ic->meta_dev) @@ -2529,6 +2539,8 @@ static void dm_integrity_status(struct dm_target *ti, status_type_t type, DMEMIT(" buffer_sectors:%u", 1U << ic->log2_buffer_sectors); DMEMIT(" journal_watermark:%u", (unsigned)watermark_percentage); DMEMIT(" commit_time:%u", ic->autocommit_msec); + if (ic->legacy_recalculate) + DMEMIT(" legacy_recalculate");
#define EMIT_ALG(a, n) \ do { \ @@ -3131,7 +3143,7 @@ static int dm_integrity_ctr(struct dm_target *ti, unsigned argc, char **argv) unsigned extra_args; struct dm_arg_set as; static const struct dm_arg _args[] = { - {0, 15, "Invalid number of feature args"}, + {0, 12, "Invalid number of feature args"}, }; unsigned journal_sectors, interleave_sectors, buffer_sectors, journal_watermark, sync_msec; bool recalculate; @@ -3261,6 +3273,8 @@ static int dm_integrity_ctr(struct dm_target *ti, unsigned argc, char **argv) goto bad; } else if (!strcmp(opt_string, "recalculate")) { recalculate = true; + } else if (!strcmp(opt_string, "legacy_recalculate")) { + ic->legacy_recalculate = true; } else { r = -EINVAL; ti->error = "Invalid argument"; @@ -3528,6 +3542,14 @@ static int dm_integrity_ctr(struct dm_target *ti, unsigned argc, char **argv) } }
+ if (ic->sb->flags & cpu_to_le32(SB_FLAG_RECALCULATING) && + le64_to_cpu(ic->sb->recalc_sector) < ic->provided_data_sectors && + dm_integrity_disable_recalculate(ic)) { + ti->error = "Recalculating with HMAC is disabled for security reasons - if you really need it, use the argument "legacy_recalculate""; + r = -EOPNOTSUPP; + goto bad; + } + ic->bufio = dm_bufio_client_create(ic->meta_dev ? ic->meta_dev->bdev : ic->dev->bdev, 1U << (SECTOR_SHIFT + ic->log2_buffer_sectors), 1, 0, NULL, NULL); if (IS_ERR(ic->bufio)) {
From: Jan Kara jack@suse.cz
stable inclusion from linux-4.19.172 commit 9f9623fc9340af731c3f3a09e6e9af0756b38a46
--------------------------------
commit 5fcd57505c002efc5823a7355e21f48dd02d5a51 upstream.
The only use of I_DIRTY_TIME_EXPIRE is to detect in __writeback_single_inode() that inode got there because flush worker decided it's time to writeback the dirty inode time stamps (either because we are syncing or because of age). However we can detect this directly in __writeback_single_inode() and there's no need for the strange propagation with I_DIRTY_TIME_EXPIRE flag.
Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Eric Biggers ebiggers@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/inode.c | 2 +- fs/fs-writeback.c | 28 +++++++++++----------------- fs/xfs/xfs_trans_inode.c | 4 ++-- include/linux/fs.h | 1 - include/trace/events/writeback.h | 1 - 5 files changed, 14 insertions(+), 22 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 7f960f8cb0a4..5a4baa20e99b 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5223,7 +5223,7 @@ static int other_inode_match(struct inode * inode, unsigned long ino, (inode->i_state & I_DIRTY_TIME)) { struct ext4_inode_info *ei = EXT4_I(inode);
- inode->i_state &= ~(I_DIRTY_TIME | I_DIRTY_TIME_EXPIRED); + inode->i_state &= ~I_DIRTY_TIME; spin_unlock(&inode->i_lock);
spin_lock(&ei->i_raw_lock); diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index f39c83758d94..0c22597794cb 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1157,7 +1157,7 @@ static bool inode_dirtied_after(struct inode *inode, unsigned long t) */ static int move_expired_inodes(struct list_head *delaying_queue, struct list_head *dispatch_queue, - int flags, unsigned long dirtied_before) + unsigned long dirtied_before) { LIST_HEAD(tmp); struct list_head *pos, *node; @@ -1173,8 +1173,6 @@ static int move_expired_inodes(struct list_head *delaying_queue, list_move(&inode->i_io_list, &tmp); moved++; spin_lock(&inode->i_lock); - if (flags & EXPIRE_DIRTY_ATIME) - inode->i_state |= I_DIRTY_TIME_EXPIRED; inode->i_state |= I_SYNC_QUEUED; spin_unlock(&inode->i_lock); if (sb_is_blkdev_sb(inode->i_sb)) @@ -1222,11 +1220,11 @@ static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work,
assert_spin_locked(&wb->list_lock); list_splice_init(&wb->b_more_io, &wb->b_io); - moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, 0, dirtied_before); + moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, dirtied_before); if (!work->for_sync) time_expire_jif = jiffies - dirtytime_expire_interval * HZ; moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io, - EXPIRE_DIRTY_ATIME, time_expire_jif); + time_expire_jif); if (moved) wb_io_lists_populated(wb); trace_writeback_queue_io(wb, work, dirtied_before, moved); @@ -1402,18 +1400,14 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) spin_lock(&inode->i_lock);
dirty = inode->i_state & I_DIRTY; - if (inode->i_state & I_DIRTY_TIME) { - if ((dirty & I_DIRTY_INODE) || - wbc->sync_mode == WB_SYNC_ALL || - unlikely(inode->i_state & I_DIRTY_TIME_EXPIRED) || - unlikely(time_after(jiffies, - (inode->dirtied_time_when + - dirtytime_expire_interval * HZ)))) { - dirty |= I_DIRTY_TIME | I_DIRTY_TIME_EXPIRED; - trace_writeback_lazytime(inode); - } - } else - inode->i_state &= ~I_DIRTY_TIME_EXPIRED; + if ((inode->i_state & I_DIRTY_TIME) && + ((dirty & I_DIRTY_INODE) || + wbc->sync_mode == WB_SYNC_ALL || wbc->for_sync || + time_after(jiffies, inode->dirtied_time_when + + dirtytime_expire_interval * HZ))) { + dirty |= I_DIRTY_TIME; + trace_writeback_lazytime(inode); + } inode->i_state &= ~dirty;
/* diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c index ae453dd236a6..6fcdf7e449fe 100644 --- a/fs/xfs/xfs_trans_inode.c +++ b/fs/xfs/xfs_trans_inode.c @@ -99,9 +99,9 @@ xfs_trans_log_inode( * to log the timestamps, or will clear already cleared fields in the * worst case. */ - if (inode->i_state & (I_DIRTY_TIME | I_DIRTY_TIME_EXPIRED)) { + if (inode->i_state & I_DIRTY_TIME) { spin_lock(&inode->i_lock); - inode->i_state &= ~(I_DIRTY_TIME | I_DIRTY_TIME_EXPIRED); + inode->i_state &= ~I_DIRTY_TIME; spin_unlock(&inode->i_lock); }
diff --git a/include/linux/fs.h b/include/linux/fs.h index 7e9a64f52c6e..787c8cd420a0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2118,7 +2118,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) #define I_DIO_WAKEUP (1 << __I_DIO_WAKEUP) #define I_LINKABLE (1 << 10) #define I_DIRTY_TIME (1 << 11) -#define I_DIRTY_TIME_EXPIRED (1 << 12) #define I_WB_SWITCH (1 << 13) #define I_OVL_INUSE (1 << 14) #define I_CREATING (1 << 15) diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index c0b7c41b73f1..4e6af6c80698 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -20,7 +20,6 @@ {I_CLEAR, "I_CLEAR"}, \ {I_SYNC, "I_SYNC"}, \ {I_DIRTY_TIME, "I_DIRTY_TIME"}, \ - {I_DIRTY_TIME_EXPIRED, "I_DIRTY_TIME_EXPIRED"}, \ {I_REFERENCED, "I_REFERENCED"} \ )
From: Eric Biggers ebiggers@google.com
stable inclusion from linux-4.19.172 commit b4da738055acf42e00535c52b60d9bef808a7464
--------------------------------
commit 1e249cb5b7fc09ff216aa5a12f6c302e434e88f9 upstream.
When lazytime is enabled and an inode is being written due to its in-memory updated timestamps having expired, either due to a sync() or syncfs() system call or due to dirtytime_expire_interval having elapsed, the VFS needs to inform the filesystem so that the filesystem can copy the inode's timestamps out to the on-disk data structures.
This is done by __writeback_single_inode() calling mark_inode_dirty_sync(), which then calls ->dirty_inode(I_DIRTY_SYNC).
However, this occurs after __writeback_single_inode() has already cleared the dirty flags from ->i_state. This causes two bugs:
- mark_inode_dirty_sync() redirties the inode, causing it to remain dirty. This wastefully causes the inode to be written twice. But more importantly, it breaks cases where sync_filesystem() is expected to clean dirty inodes. This includes the FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (as reported at https://lore.kernel.org/r/20200306004555.GB225345@gmail.com), as well as possibly filesystem freezing (freeze_super()).
- Since ->i_state doesn't contain I_DIRTY_TIME when ->dirty_inode() is called from __writeback_single_inode() for lazytime expiration, xfs_fs_dirty_inode() ignores the notification. (XFS only cares about lazytime expirations, and it assumes that i_state will contain I_DIRTY_TIME during those.) Therefore, lazy timestamps aren't persisted by sync(), syncfs(), or dirtytime_expire_interval on XFS.
Fix this by moving the call to mark_inode_dirty_sync() to earlier in __writeback_single_inode(), before the dirty flags are cleared from i_state. This makes filesystems be properly notified of the timestamp expiration, and it avoids incorrectly redirtying the inode.
This fixes xfstest generic/580 (which tests FS_IOC_REMOVE_ENCRYPTION_KEY) when run on ext4 or f2fs with lazytime enabled. It also fixes the new lazytime xfstest I've proposed, which reproduces the above-mentioned XFS bug (https://lore.kernel.org/r/20210105005818.92978-1-ebiggers@kernel.org).
Alternatively, we could call ->dirty_inode(I_DIRTY_SYNC) directly. But due to the introduction of I_SYNC_QUEUED, mark_inode_dirty_sync() is the right thing to do because mark_inode_dirty_sync() now knows not to move the inode to a writeback list if it is currently queued for sync.
Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option") Cc: stable@vger.kernel.org Depends-on: 5afced3bf281 ("writeback: Avoid skipping inode writeback") Link: https://lore.kernel.org/r/20210112190253.64307-2-ebiggers@kernel.org Suggested-by: Jan Kara jack@suse.cz Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Eric Biggers ebiggers@google.com Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/fs-writeback.c | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 0c22597794cb..a9c7522e367c 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1393,21 +1393,25 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) }
/* - * Some filesystems may redirty the inode during the writeback - * due to delalloc, clear dirty metadata flags right before - * write_inode() + * If the inode has dirty timestamps and we need to write them, call + * mark_inode_dirty_sync() to notify the filesystem about it and to + * change I_DIRTY_TIME into I_DIRTY_SYNC. */ - spin_lock(&inode->i_lock); - - dirty = inode->i_state & I_DIRTY; if ((inode->i_state & I_DIRTY_TIME) && - ((dirty & I_DIRTY_INODE) || - wbc->sync_mode == WB_SYNC_ALL || wbc->for_sync || + (wbc->sync_mode == WB_SYNC_ALL || wbc->for_sync || time_after(jiffies, inode->dirtied_time_when + dirtytime_expire_interval * HZ))) { - dirty |= I_DIRTY_TIME; trace_writeback_lazytime(inode); + mark_inode_dirty_sync(inode); } + + /* + * Some filesystems may redirty the inode during the writeback + * due to delalloc, clear dirty metadata flags right before + * write_inode() + */ + spin_lock(&inode->i_lock); + dirty = inode->i_state & I_DIRTY; inode->i_state &= ~dirty;
/* @@ -1428,8 +1432,6 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
spin_unlock(&inode->i_lock);
- if (dirty & I_DIRTY_TIME) - mark_inode_dirty_sync(inode); /* Don't write the inode if only I_DIRTY_PAGES was set */ if (dirty & ~I_DIRTY_PAGES) { int err = write_inode(inode, wbc);
From: Christian Brauner christian@brauner.io
stable inclusion from linux-4.19.174 commit 5cbe06fe63af46c74ec80c39bacd79546d89ea9c
--------------------------------
commit 7f2923c4f73f21cfd714d12a2d48de8c21f11cfe upstream.
proc_get_long() is a funny function. It uses simple_strtoul() and for a good reason. proc_get_long() wants to always succeed the parse and return the maybe incorrect value and the trailing characters to check against a pre-defined list of acceptable trailing values. However, simple_strtoul() explicitly ignores overflows which can cause funny things like the following to happen:
echo 18446744073709551616 > /proc/sys/fs/file-max cat /proc/sys/fs/file-max 0
(Which will cause your system to silently die behind your back.)
On the other hand kstrtoul() does do overflow detection but does not return the trailing characters, and also fails the parse when anything other than '\n' is a trailing character whereas proc_get_long() wants to be more lenient.
Now, before adding another kstrtoul() function let's simply add a static parse strtoul_lenient() which: - fails on overflow with -ERANGE - returns the trailing characters to the caller
The reason why we should fail on ERANGE is that we already do a partial fail on overflow right now. Namely, when the TMPBUFLEN is exceeded. So we already reject values such as 184467440737095516160 (21 chars) but accept values such as 18446744073709551616 (20 chars) but both are overflows. So we should just always reject 64bit overflows and not special-case this based on the number of chars.
Link: http://lkml.kernel.org/r/20190107222700.15954-2-christian@brauner.io Signed-off-by: Christian Brauner christian@brauner.io Acked-by: Kees Cook keescook@chromium.org Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Luis Chamberlain mcgrof@kernel.org Cc: Joe Lawrence joe.lawrence@redhat.com Cc: Waiman Long longman@redhat.com Cc: Dominik Brodowski linux@dominikbrodowski.net Cc: Al Viro viro@zeniv.linux.org.uk Cc: Alexey Dobriyan adobriyan@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Cc: Joerg Vehlow lkml@jv-coder.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/sysctl.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c index f8a376720e87..bc5ce37c6622 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -70,6 +70,8 @@
#include "../lib/kstrtox.h"
+#include "../lib/kstrtox.h" + #include <linux/uaccess.h> #include <asm/processor.h>
From: Peter Zijlstra peterz@infradead.org
stable inclusion from linux-4.19.174 commit fbad32181af9c8c76698b16115c28a493c26a2ee
--------------------------------
[ Upstream commit ac687e6e8c26181a33270efd1a2e2241377924b0 ]
There is a need to distinguish geniune per-cpu kthreads from kthreads that happen to have a single CPU affinity.
Geniune per-cpu kthreads are kthreads that are CPU affine for correctness, these will obviously have PF_KTHREAD set, but must also have PF_NO_SETAFFINITY set, lest userspace modify their affinity and ruins things.
However, these two things are not sufficient, PF_NO_SETAFFINITY is also set on other tasks that have their affinities controlled through other means, like for instance workqueues.
Therefore another bit is needed; it turns out kthread_create_per_cpu() already has such a bit: KTHREAD_IS_PER_CPU, which is used to make kthread_park()/kthread_unpark() work correctly.
Expose this flag and remove the implicit setting of it from kthread_create_on_cpu(); the io_uring usage of it seems dubious at best.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Valentin Schneider valentin.schneider@arm.com Tested-by: Valentin Schneider valentin.schneider@arm.com Link: https://lkml.kernel.org/r/20210121103506.557620262@infradead.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/kthread.h | 3 +++ kernel/kthread.c | 27 ++++++++++++++++++++++++++- kernel/smpboot.c | 1 + 3 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/include/linux/kthread.h b/include/linux/kthread.h index 33e1f0052940..8613e4981a97 100644 --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -32,6 +32,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data), unsigned int cpu, const char *namefmt);
+void kthread_set_per_cpu(struct task_struct *k, int cpu); +bool kthread_is_per_cpu(struct task_struct *k); + /** * kthread_run - create and wake a thread. * @threadfn: the function to run until signal_pending(current). diff --git a/kernel/kthread.c b/kernel/kthread.c index ba1aacf0755a..d866cf1a3e20 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -456,11 +456,36 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data), return p; kthread_bind(p, cpu); /* CPU hotplug need to bind once again when unparking the thread. */ - set_bit(KTHREAD_IS_PER_CPU, &to_kthread(p)->flags); to_kthread(p)->cpu = cpu; return p; }
+void kthread_set_per_cpu(struct task_struct *k, int cpu) +{ + struct kthread *kthread = to_kthread(k); + if (!kthread) + return; + + WARN_ON_ONCE(!(k->flags & PF_NO_SETAFFINITY)); + + if (cpu < 0) { + clear_bit(KTHREAD_IS_PER_CPU, &kthread->flags); + return; + } + + kthread->cpu = cpu; + set_bit(KTHREAD_IS_PER_CPU, &kthread->flags); +} + +bool kthread_is_per_cpu(struct task_struct *k) +{ + struct kthread *kthread = to_kthread(k); + if (!kthread) + return false; + + return test_bit(KTHREAD_IS_PER_CPU, &kthread->flags); +} + /** * kthread_unpark - unpark a thread created by kthread_create(). * @k: thread created by kthread_create(). diff --git a/kernel/smpboot.c b/kernel/smpboot.c index c230c2dd48e1..84c16654d859 100644 --- a/kernel/smpboot.c +++ b/kernel/smpboot.c @@ -187,6 +187,7 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu) kfree(td); return PTR_ERR(tsk); } + kthread_set_per_cpu(tsk, cpu); /* * Park the thread so that it could start right on the CPU * when it is available.
From: Peter Zijlstra peterz@infradead.org
stable inclusion from linux-4.19.174 commit 1746d1dcae9a960052b7a956523cc1719f71513c
--------------------------------
[ Upstream commit 640f17c82460e9724fd256f0a1f5d99e7ff0bda4 ]
create_worker() will already set the right affinity using kthread_bind_mask(), this means only the rescuer will need to change it's affinity.
Howveer, while in cpu-hot-unplug a regular task is not allowed to run on online&&!active as it would be pushed away quite agressively. We need KTHREAD_IS_PER_CPU to survive in that environment.
Therefore set the affinity after getting that magic flag.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Valentin Schneider valentin.schneider@arm.com Tested-by: Valentin Schneider valentin.schneider@arm.com Link: https://lkml.kernel.org/r/20210121103506.826629830@infradead.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/workqueue.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 4f05c714f99c..1ffc523edb65 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1729,12 +1729,6 @@ static void worker_attach_to_pool(struct worker *worker, { mutex_lock(&wq_pool_attach_mutex);
- /* - * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any - * online CPUs. It'll be re-applied when any of the CPUs come up. - */ - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); - /* * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains * stable across this function. See the comments above the flag @@ -1743,6 +1737,9 @@ static void worker_attach_to_pool(struct worker *worker, if (pool->flags & POOL_DISASSOCIATED) worker->flags |= WORKER_UNBOUND;
+ if (worker->rescue_wq) + set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); + list_add_tail(&worker->node, &pool->workers); worker->pool = pool;
From: Roman Gushchin guro@fb.com
stable inclusion from linux-4.19.175 commit 50b2181099c0b66e80bbb940c22881f2552ce7be
--------------------------------
[ Upstream commit 2dcb3964544177c51853a210b6ad400de78ef17d ]
With kaslr the kernel image is placed at a random place, so starting the bottom-up allocation with the kernel_end can result in an allocation failure and a warning like this one:
hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node ------------[ cut here ]------------ memblock: bottom-up allocation failed, memory hotremove may be affected WARNING: CPU: 0 PID: 0 at mm/memblock.c:332 memblock_find_in_range_node+0x178/0x25a Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.0+ #1169 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 RIP: 0010:memblock_find_in_range_node+0x178/0x25a Code: e9 6d ff ff ff 48 85 c0 0f 85 da 00 00 00 80 3d 9b 35 df 00 00 75 15 48 c7 c7 c0 75 59 88 c6 05 8b 35 df 00 01 e8 25 8a fa ff <0f> 0b 48 c7 44 24 20 ff ff ff ff 44 89 e6 44 89 ea 48 c7 c1 70 5c RSP: 0000:ffffffff88803d18 EFLAGS: 00010086 ORIG_RAX: 0000000000000000 RAX: 0000000000000000 RBX: 0000000240000000 RCX: 00000000ffffdfff RDX: 00000000ffffdfff RSI: 00000000ffffffea RDI: 0000000000000046 RBP: 0000000100000000 R08: ffffffff88922788 R09: 0000000000009ffb R10: 00000000ffffe000 R11: 3fffffffffffffff R12: 0000000000000000 R13: 0000000000000000 R14: 0000000080000000 R15: 00000001fb42c000 FS: 0000000000000000(0000) GS:ffffffff88f71000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffa080fb401000 CR3: 00000001fa80a000 CR4: 00000000000406b0 Call Trace: memblock_alloc_range_nid+0x8d/0x11e cma_declare_contiguous_nid+0x2c4/0x38c hugetlb_cma_reserve+0xdc/0x128 flush_tlb_one_kernel+0xc/0x20 native_set_fixmap+0x82/0xd0 flat_get_apic_id+0x5/0x10 register_lapic_address+0x8e/0x97 setup_arch+0x8a5/0xc3f start_kernel+0x66/0x547 load_ucode_bsp+0x4c/0xcd secondary_startup_64_no_verify+0xb0/0xbb random: get_random_bytes called from __warn+0xab/0x110 with crng_init=0 ---[ end trace f151227d0b39be70 ]---
At the same time, the kernel image is protected with memblock_reserve(), so we can just start searching at PAGE_SIZE. In this case the bottom-up allocation has the same chances to success as a top-down allocation, so there is no reason to fallback in the case of a failure. All together it simplifies the logic.
Link: https://lkml.kernel.org/r/20201217201214.3414100-2-guro@fb.com Fixes: 8fabc623238e ("powerpc: Ensure that swiotlb buffer is allocated from low memory") Signed-off-by: Roman Gushchin guro@fb.com Reviewed-by: Mike Rapoport rppt@linux.ibm.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Michal Hocko mhocko@kernel.org Cc: Rik van Riel riel@surriel.com Cc: Wonhyuk Yang vvghjk1234@gmail.com Cc: Thiago Jung Bauermann bauerman@linux.ibm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/memblock.c | 49 ++++++------------------------------------------- 1 file changed, 6 insertions(+), 43 deletions(-)
diff --git a/mm/memblock.c b/mm/memblock.c index 5c943854bdfb..13be610a381f 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -241,14 +241,6 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end, * * Find @size free area aligned to @align in the specified range and node. * - * When allocation direction is bottom-up, the @start should be greater - * than the end of the kernel image. Otherwise, it will be trimmed. The - * reason is that we want the bottom-up allocation just near the kernel - * image so it is highly likely that the allocated memory and the kernel - * will reside in the same node. - * - * If bottom-up allocation failed, will try to allocate memory top-down. - * * Return: * Found address on success, 0 on failure. */ @@ -257,8 +249,6 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size, phys_addr_t end, int nid, enum memblock_flags flags) { - phys_addr_t kernel_end, ret; - /* pump up @end */ if (end == MEMBLOCK_ALLOC_ACCESSIBLE || end == MEMBLOCK_ALLOC_KASAN) @@ -267,40 +257,13 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size, /* avoid allocating the first page */ start = max_t(phys_addr_t, start, PAGE_SIZE); end = max(start, end); - kernel_end = __pa_symbol(_end); - - /* - * try bottom-up allocation only when bottom-up mode - * is set and @end is above the kernel image. - */ - if (memblock_bottom_up() && end > kernel_end) { - phys_addr_t bottom_up_start; - - /* make sure we will allocate above the kernel */ - bottom_up_start = max(start, kernel_end);
- /* ok, try bottom-up allocation first */ - ret = __memblock_find_range_bottom_up(bottom_up_start, end, - size, align, nid, flags); - if (ret) - return ret; - - /* - * we always limit bottom-up allocation above the kernel, - * but top-down allocation doesn't have the limit, so - * retrying top-down allocation may succeed when bottom-up - * allocation failed. - * - * bottom-up allocation is expected to be fail very rarely, - * so we use WARN_ONCE() here to see the stack trace if - * fail happens. - */ - WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE), - "memblock: bottom-up allocation failed, memory hotremove may be affected\n"); - } - - return __memblock_find_range_top_down(start, end, size, align, nid, - flags); + if (memblock_bottom_up()) + return __memblock_find_range_bottom_up(start, end, size, align, + nid, flags); + else + return __memblock_find_range_top_down(start, end, size, align, + nid, flags); }
/**
From: Liangyan liangyan.peng@linux.alibaba.com
stable inclusion from linux-4.19.175 commit 307aa80243fd0e5e3197f89c9e464baedc01d377
--------------------------------
commit e04527fefba6e4e66492f122cf8cc6314f3cf3bf upstream.
We need to lock d_parent->d_lock before dget_dlock, or this may have d_lockref updated parallelly like calltrace below which will cause dentry->d_lockref leak and risk a crash.
CPU 0 CPU 1 ovl_set_redirect lookup_fast ovl_get_redirect __d_lookup dget_dlock //no lock protection here spin_lock(&dentry->d_lock) dentry->d_lockref.count++ dentry->d_lockref.count++
[ Â 49.799059] PGD 800000061fed7067 P4D 800000061fed7067 PUD 61fec5067 PMD 0 [ Â 49.799689] Oops: 0002 [#1] SMP PTI [ Â 49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 #1 [ Â 49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014 [ Â 49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20 [ Â 49.803470] RSP: 0018:ffffac6fc5417e98 EFLAGS: 00010246 [ Â 49.803949] RAX: 0000000000000000 RBX: ffff93b8da3446c0 RCX: 0000000a00000000 [ Â 49.804600] RDX: 0000000000000001 RSI: 000000000000000a RDI: 0000000000000088 [ Â 49.805252] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff993cf040 [ Â 49.805898] R10: ffff93b92292e580 R11: ffffd27f188a4b80 R12: 0000000000000000 [ Â 49.806548] R13: 00000000ffffff9c R14: 00000000fffffffe R15: ffff93b8da3446c0 [ Â 49.807200] FS: Â 00007ffbedffb700(0000) GS:ffff93b927880000(0000) knlGS:0000000000000000 [ Â 49.807935] CS: Â 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ Â 49.808461] CR2: 0000000000000088 CR3: 00000005e3f74006 CR4: 00000000003606a0 [ Â 49.809113] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ Â 49.809758] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ Â 49.810410] Call Trace: [ Â 49.810653] Â d_delete+0x2c/0xb0 [ Â 49.810951] Â vfs_rmdir+0xfd/0x120 [ Â 49.811264] Â do_rmdir+0x14f/0x1a0 [ Â 49.811573] Â do_syscall_64+0x5b/0x190 [ Â 49.811917] Â entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ Â 49.812385] RIP: 0033:0x7ffbf505ffd7 [ Â 49.814404] RSP: 002b:00007ffbedffada8 EFLAGS: 00000297 ORIG_RAX: 0000000000000054 [ Â 49.815098] RAX: ffffffffffffffda RBX: 00007ffbedffb640 RCX: 00007ffbf505ffd7 [ Â 49.815744] RDX: 0000000004449700 RSI: 0000000000000000 RDI: 0000000006c8cd50 [ Â 49.816394] RBP: 00007ffbedffaea0 R08: 0000000000000000 R09: 0000000000017d0b [ Â 49.817038] R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000012 [ Â 49.817687] R13: 00000000072823d8 R14: 00007ffbedffb700 R15: 00000000072823d8 [ Â 49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd libps2 i8042 [ Â 49.819052] CR2: 0000000000000088 [ Â 49.819368] ---[ end trace 4e652b8aa299aa2d ]--- [ Â 49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20 [ Â 49.821880] RSP: 0018:ffffac6fc5417e98 EFLAGS: 00010246 [ Â 49.822363] RAX: 0000000000000000 RBX: ffff93b8da3446c0 RCX: 0000000a00000000 [ Â 49.823008] RDX: 0000000000000001 RSI: 000000000000000a RDI: 0000000000000088 [ Â 49.823658] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff993cf040 [ Â 49.825404] R10: ffff93b92292e580 R11: ffffd27f188a4b80 R12: 0000000000000000 [ Â 49.827147] R13: 00000000ffffff9c R14: 00000000fffffffe R15: ffff93b8da3446c0 [ Â 49.828890] FS: Â 00007ffbedffb700(0000) GS:ffff93b927880000(0000) knlGS:0000000000000000 [ Â 49.830725] CS: Â 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ Â 49.832359] CR2: 0000000000000088 CR3: 00000005e3f74006 CR4: 00000000003606a0 [ Â 49.834085] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ Â 49.835792] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Cc: stable@vger.kernel.org Fixes: a6c606551141 ("ovl: redirect on rename-dir") Signed-off-by: Liangyan liangyan.peng@linux.alibaba.com Reviewed-by: Joseph Qi joseph.qi@linux.alibaba.com Suggested-by: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/overlayfs/dir.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c index 2228b38abdb1..74340542824a 100644 --- a/fs/overlayfs/dir.c +++ b/fs/overlayfs/dir.c @@ -968,8 +968,8 @@ static char *ovl_get_redirect(struct dentry *dentry, bool abs_redirect)
buflen -= thislen; memcpy(&buf[buflen], name, thislen); - tmp = dget_dlock(d->d_parent); spin_unlock(&d->d_lock); + tmp = dget_parent(d);
dput(d); d = tmp;
From: Wang ShaoBo bobo.shaobowang@huawei.com
stable inclusion from linux-4.19.175 commit e0d7f2f9d11c441f060da4d7509f42a8ce6bf5bc
--------------------------------
commit 0188b87899ffc4a1d36a0badbe77d56c92fd91dc upstream.
Our system encountered a re-init error when re-registering same kretprobe, where the kretprobe_instance in rp->free_instances is illegally accessed after re-init.
Implementation to avoid re-registration has been introduced for kprobe before, but lags for register_kretprobe(). We must check if kprobe has been re-registered before re-initializing kretprobe, otherwise it will destroy the data struct of kretprobe registered, which can lead to memory leak, system crash, also some unexpected behaviors.
We use check_kprobe_rereg() to check if kprobe has been re-registered before running register_kretprobe()'s body, for giving a warning message and terminate registration process.
Link: https://lkml.kernel.org/r/20210128124427.2031088-1-bobo.shaobowang@huawei.co...
Cc: stable@vger.kernel.org Fixes: 1f0ab40976460 ("kprobes: Prevent re-registration of the same kprobe") [ The above commit should have been done for kretprobes too ] Acked-by: Naveen N. Rao naveen.n.rao@linux.vnet.ibm.com Acked-by: Ananth N Mavinakayanahalli ananth@linux.ibm.com Acked-by: Masami Hiramatsu mhiramat@kernel.org Signed-off-by: Wang ShaoBo bobo.shaobowang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/kprobes.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 42dc9347da14..273ac8cf35ca 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1915,6 +1915,10 @@ int register_kretprobe(struct kretprobe *rp) if (!kprobe_on_func_entry(rp->kp.addr, rp->kp.symbol_name, rp->kp.offset)) return -EINVAL;
+ /* If only rp->kp.addr is specified, check reregistering kprobes */ + if (rp->kp.addr && check_kprobe_rereg(&rp->kp)) + return -EINVAL; + if (kretprobe_blacklist_size) { addr = kprobe_addr(&rp->kp); if (IS_ERR(addr))
From: Marc Zyngier maz@kernel.org
stable inclusion from linux-4.19.175 commit 2bc9ccf2cdfdbd1e2bcab4114d44a09b61d59a7a
--------------------------------
commit 4c457e8cb75eda91906a4f89fc39bde3f9a43922 upstream.
When MSI_FLAG_ACTIVATE_EARLY is set (which is the case for PCI), __msi_domain_alloc_irqs() performs the activation of the interrupt (which in the case of PCI results in the endpoint being programmed) as soon as the interrupt is allocated.
But it appears that this is only done for the first vector, introducing an inconsistent behaviour for PCI Multi-MSI.
Fix it by iterating over the number of vectors allocated to each MSI descriptor. This is easily achieved by introducing a new "for_each_msi_vector" iterator, together with a tiny bit of refactoring.
Fixes: f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early") Reported-by: Shameer Kolothum shameerali.kolothum.thodi@huawei.com Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Shameer Kolothum shameerali.kolothum.thodi@huawei.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210123122759.1781359-1-maz@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/msi.h | 6 ++++++ kernel/irq/msi.c | 44 ++++++++++++++++++++------------------------ 2 files changed, 26 insertions(+), 24 deletions(-)
diff --git a/include/linux/msi.h b/include/linux/msi.h index be8ec813dbfb..5dd171849a27 100644 --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -118,6 +118,12 @@ struct msi_desc { list_for_each_entry((desc), dev_to_msi_list((dev)), list) #define for_each_msi_entry_safe(desc, tmp, dev) \ list_for_each_entry_safe((desc), (tmp), dev_to_msi_list((dev)), list) +#define for_each_msi_vector(desc, __irq, dev) \ + for_each_msi_entry((desc), (dev)) \ + if ((desc)->irq) \ + for (__irq = (desc)->irq; \ + __irq < ((desc)->irq + (desc)->nvec_used); \ + __irq++)
#ifdef CONFIG_PCI_MSI #define first_pci_msi_entry(pdev) first_msi_entry(&(pdev)->dev) diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c index dc1186ce3ecd..604974f2afb1 100644 --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -437,22 +437,22 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
can_reserve = msi_check_reservation_mode(domain, info, dev);
- for_each_msi_entry(desc, dev) { - virq = desc->irq; - if (desc->nvec_used == 1) - dev_dbg(dev, "irq %d for MSI\n", virq); - else + /* + * This flag is set by the PCI layer as we need to activate + * the MSI entries before the PCI layer enables MSI in the + * card. Otherwise the card latches a random msi message. + */ + if (!(info->flags & MSI_FLAG_ACTIVATE_EARLY)) + goto skip_activate; + + for_each_msi_vector(desc, i, dev) { + if (desc->irq == i) { + virq = desc->irq; dev_dbg(dev, "irq [%d-%d] for MSI\n", virq, virq + desc->nvec_used - 1); - /* - * This flag is set by the PCI layer as we need to activate - * the MSI entries before the PCI layer enables MSI in the - * card. Otherwise the card latches a random msi message. - */ - if (!(info->flags & MSI_FLAG_ACTIVATE_EARLY)) - continue; + }
- irq_data = irq_domain_get_irq_data(domain, desc->irq); + irq_data = irq_domain_get_irq_data(domain, i); if (!can_reserve) { irqd_clr_can_reserve(irq_data); if (domain->flags & IRQ_DOMAIN_MSI_NOMASK_QUIRK) @@ -463,28 +463,24 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, goto cleanup; }
+skip_activate: /* * If these interrupts use reservation mode, clear the activated bit * so request_irq() will assign the final vector. */ if (can_reserve) { - for_each_msi_entry(desc, dev) { - irq_data = irq_domain_get_irq_data(domain, desc->irq); + for_each_msi_vector(desc, i, dev) { + irq_data = irq_domain_get_irq_data(domain, i); irqd_clr_activated(irq_data); } } return 0;
cleanup: - for_each_msi_entry(desc, dev) { - struct irq_data *irqd; - - if (desc->irq == virq) - break; - - irqd = irq_domain_get_irq_data(domain, desc->irq); - if (irqd_is_activated(irqd)) - irq_domain_deactivate_irq(irqd); + for_each_msi_vector(desc, i, dev) { + irq_data = irq_domain_get_irq_data(domain, i); + if (irqd_is_activated(irq_data)) + irq_domain_deactivate_irq(irq_data); } msi_domain_free_irqs(domain, dev); return ret;
From: Aurelien Aptel aaptel@suse.com
stable inclusion from linux-4.19.175 commit 613fa1fb62c82e6b9c170b7de6f8c8c850624f5a
--------------------------------
commit 21b200d091826a83aafc95d847139b2b0582f6d1 upstream.
Assuming - //HOST/a is mounted on /mnt - //HOST/b is mounted on /mnt/b
On a slow connection, running 'df' and killing it while it's processing /mnt/b can make cifs_get_inode_info() returns -ERESTARTSYS.
This triggers the following chain of events: => the dentry revalidation fail => dentry is put and released => superblock associated with the dentry is put => /mnt/b is unmounted
This patch makes cifs_d_revalidate() return the error instead of 0 (invalid) when cifs_revalidate_dentry() fails, except for ENOENT (file deleted) and ESTALE (file recreated).
Signed-off-by: Aurelien Aptel aaptel@suse.com Suggested-by: Shyam Prasad N nspmangalore@gmail.com Reviewed-by: Shyam Prasad N nspmangalore@gmail.com CC: stable@vger.kernel.org Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/cifs/dir.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c index f6e3c0089825..c7e162c9383d 100644 --- a/fs/cifs/dir.c +++ b/fs/cifs/dir.c @@ -840,6 +840,7 @@ static int cifs_d_revalidate(struct dentry *direntry, unsigned int flags) { struct inode *inode; + int rc;
if (flags & LOOKUP_RCU) return -ECHILD; @@ -849,8 +850,25 @@ cifs_d_revalidate(struct dentry *direntry, unsigned int flags) if ((flags & LOOKUP_REVAL) && !CIFS_CACHE_READ(CIFS_I(inode))) CIFS_I(inode)->time = 0; /* force reval */
- if (cifs_revalidate_dentry(direntry)) - return 0; + rc = cifs_revalidate_dentry(direntry); + if (rc) { + cifs_dbg(FYI, "cifs_revalidate_dentry failed with rc=%d", rc); + switch (rc) { + case -ENOENT: + case -ESTALE: + /* + * Those errors mean the dentry is invalid + * (file was deleted or recreated) + */ + return 0; + default: + /* + * Otherwise some unexpected error happened + * report it as-is to VFS layer + */ + return rc; + } + } else { /* * If the inode wasn't known to be a dfs entry when
From: "Gustavo A. R. Silva" gustavoars@kernel.org
stable inclusion from linux-4.19.175 commit 8c323163303d9923927c176977abdbe998f217ff
--------------------------------
commit 8d8d1dbefc423d42d626cf5b81aac214870ebaab upstream.
While addressing some warnings generated by -Warray-bounds, I found this bug that was introduced back in 2017:
CC [M] fs/cifs/smb2pdu.o fs/cifs/smb2pdu.c: In function ‘SMB2_negotiate’: fs/cifs/smb2pdu.c:822:16: warning: array subscript 1 is above array bounds of ‘__le16[1]’ {aka ‘short unsigned int[1]’} [-Warray-bounds] 822 | req->Dialects[1] = cpu_to_le16(SMB30_PROT_ID); | ~~~~~~~~~~~~~^~~ fs/cifs/smb2pdu.c:823:16: warning: array subscript 2 is above array bounds of ‘__le16[1]’ {aka ‘short unsigned int[1]’} [-Warray-bounds] 823 | req->Dialects[2] = cpu_to_le16(SMB302_PROT_ID); | ~~~~~~~~~~~~~^~~ fs/cifs/smb2pdu.c:824:16: warning: array subscript 3 is above array bounds of ‘__le16[1]’ {aka ‘short unsigned int[1]’} [-Warray-bounds] 824 | req->Dialects[3] = cpu_to_le16(SMB311_PROT_ID); | ~~~~~~~~~~~~~^~~ fs/cifs/smb2pdu.c:816:16: warning: array subscript 1 is above array bounds of ‘__le16[1]’ {aka ‘short unsigned int[1]’} [-Warray-bounds] 816 | req->Dialects[1] = cpu_to_le16(SMB302_PROT_ID); | ~~~~~~~~~~~~~^~~
At the time, the size of array _Dialects_ was changed from 1 to 3 in struct validate_negotiate_info_req, and then in 2019 it was changed from 3 to 4, but those changes were never made in struct smb2_negotiate_req, which has led to a 3 and a half years old out-of-bounds bug in function SMB2_negotiate() (fs/cifs/smb2pdu.c).
Fix this by increasing the size of array _Dialects_ in struct smb2_negotiate_req to 4.
Fixes: 9764c02fcbad ("SMB3: Add support for multidialect negotiate (SMB2.1 and later)") Fixes: d5c7076b772a ("smb3: add smb3.1.1 to default dialect list") Cc: stable@vger.kernel.org Signed-off-by: Gustavo A. R. Silva gustavoars@kernel.org Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/cifs/smb2pdu.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/cifs/smb2pdu.h b/fs/cifs/smb2pdu.h index 308c682fa4d3..5c41798411a4 100644 --- a/fs/cifs/smb2pdu.h +++ b/fs/cifs/smb2pdu.h @@ -222,7 +222,7 @@ struct smb2_negotiate_req { __le32 NegotiateContextOffset; /* SMB3.1.1 only. MBZ earlier */ __le16 NegotiateContextCount; /* SMB3.1.1 only. MBZ earlier */ __le16 Reserved2; - __le16 Dialects[1]; /* One dialect (vers=) at a time for now */ + __le16 Dialects[4]; /* BB expand this if autonegotiate > 4 dialects */ } __packed;
/* Dialects */
From: Muchun Song songmuchun@bytedance.com
stable inclusion from linux-4.19.175 commit b6e04c19c5b2060c91b07acec5d650a1beb6855f
--------------------------------
commit 585fc0d2871c9318c949fbf45b1f081edd489e96 upstream.
If a new hugetlb page is allocated during fallocate it will not be marked as active (set_page_huge_active) which will result in a later isolate_huge_page failure when the page migration code would like to move that page. Such a failure would be unexpected and wrong.
Only export set_page_huge_active, just leave clear_page_huge_active as static. Because there are no external users.
Link: https://lkml.kernel.org/r/20210115124942.46403-3-songmuchun@bytedance.com Fixes: 70c3547e36f5 (hugetlbfs: add hugetlbfs_fallocate()) Signed-off-by: Muchun Song songmuchun@bytedance.com Acked-by: Michal Hocko mhocko@suse.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Reviewed-by: Oscar Salvador osalvador@suse.de Cc: David Hildenbrand david@redhat.com Cc: Yang Shi shy828301@gmail.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/hugetlbfs/inode.c | 3 ++- include/linux/hugetlb.h | 3 +++ mm/hugetlb.c | 2 +- 3 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index a878f009347c..47a7b96cfe10 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -787,9 +787,10 @@ static int hugetlbfs_fallocate_chunk(pgoff_t start, pgoff_t end,
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ set_page_huge_active(page); /* * unlock_page because locked by add_to_page_cache() - * page_put due to reference from alloc_huge_page() + * put_page() due to reference from alloc_huge_page() */ unlock_page(page); put_page(page); diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 3f355dfaf31f..8aec9d4220bd 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -549,6 +549,9 @@ static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr set_huge_pte_at(mm, addr, ptep, pte); } #endif + +void set_page_huge_active(struct page *page); + #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; #define alloc_huge_page(v, a, r) NULL diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ea0902276cb5..5e029780a36c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1219,7 +1219,7 @@ bool page_huge_active(struct page *page) }
/* never called for tail page */ -static void set_page_huge_active(struct page *page) +void set_page_huge_active(struct page *page) { VM_BUG_ON_PAGE(!PageHeadHuge(page), page); SetPagePrivate(&page[1]);
From: Muchun Song songmuchun@bytedance.com
stable inclusion from linux-4.19.175 commit db510d8f98c38a953ae5d3b51b01f435740729b4
--------------------------------
commit 7ffddd499ba6122b1a07828f023d1d67629aa017 upstream.
There is a race condition between __free_huge_page() and dissolve_free_huge_page().
CPU0: CPU1:
// page_count(page) == 1 put_page(page) __free_huge_page(page) dissolve_free_huge_page(page) spin_lock(&hugetlb_lock) // PageHuge(page) && !page_count(page) update_and_free_page(page) // page is freed to the buddy spin_unlock(&hugetlb_lock) spin_lock(&hugetlb_lock) clear_page_huge_active(page) enqueue_huge_page(page) // It is wrong, the page is already freed spin_unlock(&hugetlb_lock)
The race window is between put_page() and dissolve_free_huge_page().
We should make sure that the page is already on the free list when it is dissolved.
As a result __free_huge_page would corrupt page(s) already in the buddy allocator.
Link: https://lkml.kernel.org/r/20210115124942.46403-4-songmuchun@bytedance.com Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Signed-off-by: Muchun Song songmuchun@bytedance.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Reviewed-by: Oscar Salvador osalvador@suse.de Acked-by: Michal Hocko mhocko@suse.com Cc: David Hildenbrand david@redhat.com Cc: Yang Shi shy828301@gmail.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/hugetlb.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5e029780a36c..c26873bcd925 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -70,6 +70,21 @@ DEFINE_SPINLOCK(hugetlb_lock); static int num_fault_mutexes; struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
+static inline bool PageHugeFreed(struct page *head) +{ + return page_private(head + 4) == -1UL; +} + +static inline void SetPageHugeFreed(struct page *head) +{ + set_page_private(head + 4, -1UL); +} + +static inline void ClearPageHugeFreed(struct page *head) +{ + set_page_private(head + 4, 0); +} + /* Forward declaration */ static int hugetlb_acct_memory(struct hstate *h, long delta);
@@ -868,6 +883,7 @@ static void enqueue_huge_page(struct hstate *h, struct page *page) list_move(&page->lru, &h->hugepage_freelists[nid]); h->free_huge_pages++; h->free_huge_pages_node[nid]++; + SetPageHugeFreed(page); }
static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) @@ -885,6 +901,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) return NULL; list_move(&page->lru, &h->hugepage_activelist); set_page_refcounted(page); + ClearPageHugeFreed(page); h->free_huge_pages--; h->free_huge_pages_node[nid]--; return page; @@ -1323,6 +1340,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) set_hugetlb_cgroup(page, NULL); h->nr_huge_pages++; h->nr_huge_pages_node[nid]++; + ClearPageHugeFreed(page); spin_unlock(&hugetlb_lock); }
@@ -1559,6 +1577,7 @@ int dissolve_free_huge_page(struct page *page) { int rc = -EBUSY;
+retry: /* Not to disrupt normal path by vainly holding hugetlb_lock */ if (!PageHuge(page)) return 0; @@ -1575,6 +1594,26 @@ int dissolve_free_huge_page(struct page *page) int nid = page_to_nid(head); if (h->free_huge_pages - h->resv_huge_pages == 0) goto out; + + /* + * We should make sure that the page is already on the free list + * when it is dissolved. + */ + if (unlikely(!PageHugeFreed(head))) { + spin_unlock(&hugetlb_lock); + cond_resched(); + + /* + * Theoretically, we should return -EBUSY when we + * encounter this race. In fact, we have a chance + * to successfully dissolve the page if we do a + * retry. Because the race window is quite small. + * If we seize this opportunity, it is an optimization + * for increasing the success rate of dissolving page. + */ + goto retry; + } + /* * Move PageHWPoison flag from head page to the raw error page, * which makes any subpages rather than the error page reusable.
From: Muchun Song songmuchun@bytedance.com
stable inclusion from linux-4.19.175 commit 532574ae2586940729419253fd2defd9c9880490
--------------------------------
commit 0eb2df2b5629794020f75e94655e1994af63f0d4 upstream.
There is a race between isolate_huge_page() and __free_huge_page().
CPU0: CPU1:
if (PageHuge(page)) put_page(page) __free_huge_page(page) spin_lock(&hugetlb_lock) update_and_free_page(page) set_compound_page_dtor(page, NULL_COMPOUND_DTOR) spin_unlock(&hugetlb_lock) isolate_huge_page(page) // trigger BUG_ON VM_BUG_ON_PAGE(!PageHead(page), page) spin_lock(&hugetlb_lock) page_huge_active(page) // trigger BUG_ON VM_BUG_ON_PAGE(!PageHuge(page), page) spin_unlock(&hugetlb_lock)
When we isolate a HugeTLB page on CPU0. Meanwhile, we free it to the buddy allocator on CPU1. Then, we can trigger a BUG_ON on CPU0, because it is already freed to the buddy allocator.
Link: https://lkml.kernel.org/r/20210115124942.46403-5-songmuchun@bytedance.com Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Signed-off-by: Muchun Song songmuchun@bytedance.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Acked-by: Michal Hocko mhocko@suse.com Reviewed-by: Oscar Salvador osalvador@suse.de Cc: David Hildenbrand david@redhat.com Cc: Yang Shi shy828301@gmail.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/hugetlb.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c26873bcd925..ec3e9d832b5e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5148,9 +5148,9 @@ bool isolate_huge_page(struct page *page, struct list_head *list) { bool ret = true;
- VM_BUG_ON_PAGE(!PageHead(page), page); spin_lock(&hugetlb_lock); - if (!page_huge_active(page) || !get_page_unless_zero(page)) { + if (!PageHeadHuge(page) || !page_huge_active(page) || + !get_page_unless_zero(page)) { ret = false; goto unlock; }
From: Muchun Song songmuchun@bytedance.com
stable inclusion from linux-4.19.175 commit 6bf5461ae968b870f81c813a880e0e3a2684dfc1
--------------------------------
commit ecbf4724e6061b4b01be20f6d797d64d462b2bc8 upstream.
The page_huge_active() can be called from scan_movable_pages() which do not hold a reference count to the HugeTLB page. So when we call page_huge_active() from scan_movable_pages(), the HugeTLB page can be freed parallel. Then we will trigger a BUG_ON which is in the page_huge_active() when CONFIG_DEBUG_VM is enabled. Just remove the VM_BUG_ON_PAGE.
Link: https://lkml.kernel.org/r/20210115124942.46403-6-songmuchun@bytedance.com Fixes: 7e1f049efb86 ("mm: hugetlb: cleanup using paeg_huge_active()") Signed-off-by: Muchun Song songmuchun@bytedance.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Acked-by: Michal Hocko mhocko@suse.com Reviewed-by: Oscar Salvador osalvador@suse.de Cc: David Hildenbrand david@redhat.com Cc: Yang Shi shy828301@gmail.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/hugetlb.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ec3e9d832b5e..8f1fc2dc355c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1231,8 +1231,7 @@ struct hstate *size_to_hstate(unsigned long size) */ bool page_huge_active(struct page *page) { - VM_BUG_ON_PAGE(!PageHuge(page), page); - return PageHead(page) && PagePrivate(&page[1]); + return PageHeadHuge(page) && PagePrivate(&page[1]); }
/* never called for tail page */
From: Hugh Dickins hughd@google.com
stable inclusion from linux-4.19.175 commit 081438440a6e0787d0e4c933bc9e447ac5d217bb
--------------------------------
commit 1c2f67308af4c102b4e1e6cd6f69819ae59408e0 upstream.
Sergey reported deadlock between kswapd correctly doing its usual lock_page(page) followed by down_read(page->mapping->i_mmap_rwsem), and madvise(MADV_REMOVE) on an madvise(MADV_HUGEPAGE) area doing down_write(page->mapping->i_mmap_rwsem) followed by lock_page(page).
This happened when shmem_fallocate(punch hole)'s unmap_mapping_range() reaches zap_pmd_range()'s call to __split_huge_pmd(). The same deadlock could occur when partially truncating a mapped huge tmpfs file, or using fallocate(FALLOC_FL_PUNCH_HOLE) on it.
__split_huge_pmd()'s page lock was added in 5.8, to make sure that any concurrent use of reuse_swap_page() (holding page lock) could not catch the anon THP's mapcounts and swapcounts while they were being split.
Fortunately, reuse_swap_page() is never applied to a shmem or file THP (not even by khugepaged, which checks PageSwapCache before calling), and anonymous THPs are never created in shmem or file areas: so that __split_huge_pmd()'s page lock can only be necessary for anonymous THPs, on which there is no risk of deadlock with i_mmap_rwsem.
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2101161409470.2022@eggly.anvils Fixes: c444eb564fb1 ("mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()") Signed-off-by: Hugh Dickins hughd@google.com Reported-by: Sergey Senozhatsky sergey.senozhatsky.work@gmail.com Reviewed-by: Andrea Arcangeli aarcange@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/huge_memory.c | 37 +++++++++++++++++++++++-------------- 1 file changed, 23 insertions(+), 14 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 28e2ed5090e9..8f75efda7b35 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2340,7 +2340,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, spinlock_t *ptl; struct mm_struct *mm = vma->vm_mm; unsigned long haddr = address & HPAGE_PMD_MASK; - bool was_locked = false; + bool do_unlock_page = false; pmd_t _pmd;
mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE); @@ -2353,7 +2353,6 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, VM_BUG_ON(freeze && !page); if (page) { VM_WARN_ON_ONCE(!PageLocked(page)); - was_locked = true; if (page != pmd_page(*pmd)) goto out; } @@ -2362,19 +2361,29 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (pmd_trans_huge(*pmd)) { if (!page) { page = pmd_page(*pmd); - if (unlikely(!trylock_page(page))) { - get_page(page); - _pmd = *pmd; - spin_unlock(ptl); - lock_page(page); - spin_lock(ptl); - if (unlikely(!pmd_same(*pmd, _pmd))) { - unlock_page(page); + /* + * An anonymous page must be locked, to ensure that a + * concurrent reuse_swap_page() sees stable mapcount; + * but reuse_swap_page() is not used on shmem or file, + * and page lock must not be taken when zap_pmd_range() + * calls __split_huge_pmd() while i_mmap_lock is held. + */ + if (PageAnon(page)) { + if (unlikely(!trylock_page(page))) { + get_page(page); + _pmd = *pmd; + spin_unlock(ptl); + lock_page(page); + spin_lock(ptl); + if (unlikely(!pmd_same(*pmd, _pmd))) { + unlock_page(page); + put_page(page); + page = NULL; + goto repeat; + } put_page(page); - page = NULL; - goto repeat; } - put_page(page); + do_unlock_page = true; } } if (PageMlocked(page)) @@ -2384,7 +2393,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, __split_huge_pmd_locked(vma, pmd, haddr, freeze); out: spin_unlock(ptl); - if (!was_locked && page) + if (do_unlock_page) unlock_page(page); /* * No need to double call mmu_notifier->invalidate_range() callback.
From: Xiao Ni xni@redhat.com
stable inclusion from linux-4.19.175 commit ada82f9940e92650221b677a46afd086632ed1c7
--------------------------------
commit dc5d17a3c39b06aef866afca19245a9cfb533a79 upstream.
One customer reports a crash problem which causes by flush request. It triggers a warning before crash.
/* new request after previous flush is completed */ if (ktime_after(req_start, mddev->prev_flush_start)) { WARN_ON(mddev->flush_bio); mddev->flush_bio = bio; bio = NULL; }
The WARN_ON is triggered. We use spin lock to protect prev_flush_start and flush_bio in md_flush_request. But there is no lock protection in md_submit_flush_data. It can set flush_bio to NULL first because of compiler reordering write instructions.
For example, flush bio1 sets flush bio to NULL first in md_submit_flush_data. An interrupt or vmware causing an extended stall happen between updating flush_bio and prev_flush_start. Because flush_bio is NULL, flush bio2 can get the lock and submit to underlayer disks. Then flush bio1 updates prev_flush_start after the interrupt or extended stall.
Then flush bio3 enters in md_flush_request. The start time req_start is behind prev_flush_start. The flush_bio is not NULL(flush bio2 hasn't finished). So it can trigger the WARN_ON now. Then it calls INIT_WORK again. INIT_WORK() will re-initialize the list pointers in the work_struct, which then can result in a corrupted work list and the work_struct queued a second time. With the work list corrupted, it can lead in invalid work items being used and cause a crash in process_one_work.
We need to make sure only one flush bio can be handled at one same time. So add spin lock in md_submit_flush_data to protect prev_flush_start and flush_bio in an atomic way.
Reviewed-by: David Jeffery djeffery@redhat.com Signed-off-by: Xiao Ni xni@redhat.com Signed-off-by: Song Liu songliubraving@fb.com Signed-off-by: Jack Wang jinpu.wang@cloud.ionos.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/md.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/md/md.c b/drivers/md/md.c index a4b796ab6712..d83fe40c421c 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -474,8 +474,10 @@ static void md_submit_flush_data(struct work_struct *ws) * could wait for this and below md_handle_request could wait for those * bios because of suspend check */ + spin_lock_irq(&mddev->lock); mddev->last_flush = mddev->start_flush; mddev->flush_bio = NULL; + spin_unlock_irq(&mddev->lock); wake_up(&mddev->sb_wait);
if (bio->bi_iter.bi_size == 0) {
From: Masami Hiramatsu mhiramat@kernel.org
stable inclusion from linux-4.19.176 commit 960434acef375b2a73168b467fdfc7fa0a11a68e
--------------------------------
commit 97c753e62e6c31a404183898d950d8c08d752dbd upstream.
Fix kprobe_on_func_entry() returns error code instead of false so that register_kretprobe() can return an appropriate error code.
append_trace_kprobe() expects the kprobe registration returns -ENOENT when the target symbol is not found, and it checks whether the target module is unloaded or not. If the target module doesn't exist, it defers to probe the target symbol until the module is loaded.
However, since register_kretprobe() returns -EINVAL instead of -ENOENT in that case, it always fail on putting the kretprobe event on unloaded modules. e.g.
Kprobe event: /sys/kernel/debug/tracing # echo p xfs:xfs_end_io >> kprobe_events [ 16.515574] trace_kprobe: This probe might be able to register after target module is loaded. Continue.
Kretprobe event: (p -> r) /sys/kernel/debug/tracing # echo r xfs:xfs_end_io >> kprobe_events sh: write error: Invalid argument /sys/kernel/debug/tracing # cat error_log [ 41.122514] trace_kprobe: error: Failed to register probe event Command: r xfs:xfs_end_io ^
To fix this bug, change kprobe_on_func_entry() to detect symbol lookup failure and return -ENOENT in that case. Otherwise it returns -EINVAL or 0 (succeeded, given address is on the entry).
Link: https://lkml.kernel.org/r/161176187132.1067016.8118042342894378981.stgit@dev...
Cc: stable@vger.kernel.org Fixes: 59158ec4aef7 ("tracing/kprobes: Check the probe on unloaded module correctly") Reported-by: Jianlin Lv Jianlin.Lv@arm.com Signed-off-by: Masami Hiramatsu mhiramat@kernel.org Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Conflicts: kernel/kprobes.c [yyl: adjust context] Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/kprobes.h | 2 +- kernel/kprobes.c | 34 +++++++++++++++++++++++++--------- kernel/trace/trace_kprobe.c | 4 ++-- 3 files changed, 28 insertions(+), 12 deletions(-)
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h index 9f22652d69bb..c28204e22b54 100644 --- a/include/linux/kprobes.h +++ b/include/linux/kprobes.h @@ -245,7 +245,7 @@ extern void kprobes_inc_nmissed_count(struct kprobe *p); extern bool arch_within_kprobe_blacklist(unsigned long addr); extern int arch_populate_kprobe_blacklist(void); extern bool arch_kprobe_on_func_entry(unsigned long offset); -extern bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long offset); +extern int kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long offset);
extern bool within_kprobe_blacklist(unsigned long addr); extern int kprobe_add_ksym_blacklist(unsigned long entry); diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 273ac8cf35ca..2e83379c16cf 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1888,23 +1888,38 @@ bool __weak arch_kprobe_on_func_entry(unsigned long offset) return !offset; }
-bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long offset) +/** + * kprobe_on_func_entry() -- check whether given address is function entry + * @addr: Target address + * @sym: Target symbol name + * @offset: The offset from the symbol or the address + * + * This checks whether the given @addr+@offset or @sym+@offset is on the + * function entry address or not. + * This returns 0 if it is the function entry, or -EINVAL if it is not. + * And also it returns -ENOENT if it fails the symbol or address lookup. + * Caller must pass @addr or @sym (either one must be NULL), or this + * returns -EINVAL. + */ +int kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long offset) { kprobe_opcode_t *kp_addr = _kprobe_addr(addr, sym, offset);
if (IS_ERR(kp_addr)) - return false; + return PTR_ERR(kp_addr);
- if (!kallsyms_lookup_size_offset((unsigned long)kp_addr, NULL, &offset) || - !arch_kprobe_on_func_entry(offset)) - return false; + if (!kallsyms_lookup_size_offset((unsigned long)kp_addr, NULL, &offset)) + return -ENOENT;
- return true; + if (!arch_kprobe_on_func_entry(offset)) + return -EINVAL; + + return 0; }
int register_kretprobe(struct kretprobe *rp) { - int ret = 0; + int ret; struct kretprobe_instance *inst; int i; void *addr; @@ -1912,8 +1927,9 @@ int register_kretprobe(struct kretprobe *rp) if ((ssize_t)rp->data_size < 0) return -EINVAL;
- if (!kprobe_on_func_entry(rp->kp.addr, rp->kp.symbol_name, rp->kp.offset)) - return -EINVAL; + ret = kprobe_on_func_entry(rp->kp.addr, rp->kp.symbol_name, rp->kp.offset); + if (ret) + return ret;
/* If only rp->kp.addr is specified, check reregistering kprobes */ if (rp->kp.addr && check_kprobe_rereg(&rp->kp)) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index c61b2b0a99e9..f389de402fe2 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -112,9 +112,9 @@ bool trace_kprobe_on_func_entry(struct trace_event_call *call) { struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
- return kprobe_on_func_entry(tk->rp.kp.addr, + return (kprobe_on_func_entry(tk->rp.kp.addr, tk->rp.kp.addr ? NULL : tk->rp.kp.symbol_name, - tk->rp.kp.addr ? 0 : tk->rp.kp.offset); + tk->rp.kp.addr ? 0 : tk->rp.kp.offset) == 0); }
bool trace_kprobe_error_injectable(struct trace_event_call *call)
From: "Steven Rostedt (VMware)" rostedt@goodmis.org
stable inclusion from linux-4.19.176 commit a19749a5fbd97f2147a8768813d084d584d9e045
--------------------------------
commit 7e0a9220467dbcfdc5bc62825724f3e52e50ab31 upstream.
On some archs, the idle task can call into cpu_suspend(). The cpu_suspend() will disable or pause function graph tracing, as there's some paths in bringing down the CPU that can have issues with its return address being modified. The task_struct structure has a "tracing_graph_pause" atomic counter, that when set to something other than zero, the function graph tracer will not modify the return address.
The problem is that the tracing_graph_pause counter is initialized when the function graph tracer is enabled. This can corrupt the counter for the idle task if it is suspended in these architectures.
CPU 1 CPU 2 ----- ----- do_idle() cpu_suspend() pause_graph_tracing() task_struct->tracing_graph_pause++ (0 -> 1)
start_graph_tracing() for_each_online_cpu(cpu) { ftrace_graph_init_idle_task(cpu) task-struct->tracing_graph_pause = 0 (1 -> 0)
unpause_graph_tracing() task_struct->tracing_graph_pause-- (0 -> -1)
The above should have gone from 1 to zero, and enabled function graph tracing again. But instead, it is set to -1, which keeps it disabled.
There's no reason that the field tracing_graph_pause on the task_struct can not be initialized at boot up.
Cc: stable@vger.kernel.org Fixes: 380c4b1411ccd ("tracing/function-graph-tracer: append the tracing_graph_flag") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211339 Reported-by: pierre.gondois@arm.com Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- init/init_task.c | 3 ++- kernel/trace/ftrace.c | 2 -- 2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/init/init_task.c b/init/init_task.c index 5aebe3be4d7c..994ffe018120 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -168,7 +168,8 @@ struct task_struct init_task .lockdep_recursion = 0, #endif #ifdef CONFIG_FUNCTION_GRAPH_TRACER - .ret_stack = NULL, + .ret_stack = NULL, + .tracing_graph_pause = ATOMIC_INIT(0), #endif #if defined(CONFIG_TRACING) && defined(CONFIG_PREEMPT) .trace_recursion = 0, diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 85fe273cadae..e214ef5003e4 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -6867,7 +6867,6 @@ static int alloc_retstack_tasklist(struct ftrace_ret_stack **ret_stack_list) }
if (t->ret_stack == NULL) { - atomic_set(&t->tracing_graph_pause, 0); atomic_set(&t->trace_overrun, 0); t->curr_ret_stack = -1; t->curr_ret_depth = -1; @@ -7080,7 +7079,6 @@ static DEFINE_PER_CPU(struct ftrace_ret_stack *, idle_ret_stack); static void graph_init_task(struct task_struct *t, struct ftrace_ret_stack *ret_stack) { - atomic_set(&t->tracing_graph_pause, 0); atomic_set(&t->trace_overrun, 0); t->ftrace_timestamp = 0; /* make curr_ret_stack visible before we add the ret_stack */
From: Dave Wysochanski dwysocha@redhat.com
stable inclusion from linux-4.19.176 commit b5c6efba2b0d35b0f8920d93c5510ea029d17976
--------------------------------
[ Upstream commit ba6dfce47c4d002d96cd02a304132fca76981172 ]
Remove duplicated helper functions to parse opaque XDR objects and place inside new file net/sunrpc/auth_gss/auth_gss_internal.h. In the new file carry the license and copyright from the source file net/sunrpc/auth_gss/auth_gss.c. Finally, update the comment inside include/linux/sunrpc/xdr.h since lockd is not the only user of struct xdr_netobj.
Signed-off-by: Dave Wysochanski dwysocha@redhat.com Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/sunrpc/xdr.h | 3 +- net/sunrpc/auth_gss/auth_gss.c | 30 +----------------- net/sunrpc/auth_gss/auth_gss_internal.h | 42 +++++++++++++++++++++++++ net/sunrpc/auth_gss/gss_krb5_mech.c | 31 ++---------------- 4 files changed, 46 insertions(+), 60 deletions(-) create mode 100644 net/sunrpc/auth_gss/auth_gss_internal.h
diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h index 2bd68177a442..33580cc72a43 100644 --- a/include/linux/sunrpc/xdr.h +++ b/include/linux/sunrpc/xdr.h @@ -26,8 +26,7 @@ struct rpc_rqst; #define XDR_QUADLEN(l) (((l) + 3) >> 2)
/* - * Generic opaque `network object.' At the kernel level, this type - * is used only by lockd. + * Generic opaque `network object.' */ #define XDR_MAX_NETOBJ 1024 struct xdr_netobj { diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c index 735109a3f312..cdc8a74dcaf8 100644 --- a/net/sunrpc/auth_gss/auth_gss.c +++ b/net/sunrpc/auth_gss/auth_gss.c @@ -53,6 +53,7 @@ #include <linux/uaccess.h> #include <linux/hashtable.h>
+#include "auth_gss_internal.h" #include "../netns.h"
static const struct rpc_authops authgss_ops; @@ -147,35 +148,6 @@ gss_cred_set_ctx(struct rpc_cred *cred, struct gss_cl_ctx *ctx) clear_bit(RPCAUTH_CRED_NEW, &cred->cr_flags); }
-static const void * -simple_get_bytes(const void *p, const void *end, void *res, size_t len) -{ - const void *q = (const void *)((const char *)p + len); - if (unlikely(q > end || q < p)) - return ERR_PTR(-EFAULT); - memcpy(res, p, len); - return q; -} - -static inline const void * -simple_get_netobj(const void *p, const void *end, struct xdr_netobj *dest) -{ - const void *q; - unsigned int len; - - p = simple_get_bytes(p, end, &len, sizeof(len)); - if (IS_ERR(p)) - return p; - q = (const void *)((const char *)p + len); - if (unlikely(q > end || q < p)) - return ERR_PTR(-EFAULT); - dest->data = kmemdup(p, len, GFP_NOFS); - if (unlikely(dest->data == NULL)) - return ERR_PTR(-ENOMEM); - dest->len = len; - return q; -} - static struct gss_cl_ctx * gss_cred_get_ctx(struct rpc_cred *cred) { diff --git a/net/sunrpc/auth_gss/auth_gss_internal.h b/net/sunrpc/auth_gss/auth_gss_internal.h new file mode 100644 index 000000000000..c5603242b54b --- /dev/null +++ b/net/sunrpc/auth_gss/auth_gss_internal.h @@ -0,0 +1,42 @@ +// SPDX-License-Identifier: BSD-3-Clause +/* + * linux/net/sunrpc/auth_gss/auth_gss_internal.h + * + * Internal definitions for RPCSEC_GSS client authentication + * + * Copyright (c) 2000 The Regents of the University of Michigan. + * All rights reserved. + * + */ +#include <linux/err.h> +#include <linux/string.h> +#include <linux/sunrpc/xdr.h> + +static inline const void * +simple_get_bytes(const void *p, const void *end, void *res, size_t len) +{ + const void *q = (const void *)((const char *)p + len); + if (unlikely(q > end || q < p)) + return ERR_PTR(-EFAULT); + memcpy(res, p, len); + return q; +} + +static inline const void * +simple_get_netobj(const void *p, const void *end, struct xdr_netobj *dest) +{ + const void *q; + unsigned int len; + + p = simple_get_bytes(p, end, &len, sizeof(len)); + if (IS_ERR(p)) + return p; + q = (const void *)((const char *)p + len); + if (unlikely(q > end || q < p)) + return ERR_PTR(-EFAULT); + dest->data = kmemdup(p, len, GFP_NOFS); + if (unlikely(dest->data == NULL)) + return ERR_PTR(-ENOMEM); + dest->len = len; + return q; +} diff --git a/net/sunrpc/auth_gss/gss_krb5_mech.c b/net/sunrpc/auth_gss/gss_krb5_mech.c index 7bb2514aadd9..14f2823ad6c2 100644 --- a/net/sunrpc/auth_gss/gss_krb5_mech.c +++ b/net/sunrpc/auth_gss/gss_krb5_mech.c @@ -46,6 +46,8 @@ #include <linux/sunrpc/xdr.h> #include <linux/sunrpc/gss_krb5_enctypes.h>
+#include "auth_gss_internal.h" + #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) # define RPCDBG_FACILITY RPCDBG_AUTH #endif @@ -187,35 +189,6 @@ get_gss_krb5_enctype(int etype) return NULL; }
-static const void * -simple_get_bytes(const void *p, const void *end, void *res, int len) -{ - const void *q = (const void *)((const char *)p + len); - if (unlikely(q > end || q < p)) - return ERR_PTR(-EFAULT); - memcpy(res, p, len); - return q; -} - -static const void * -simple_get_netobj(const void *p, const void *end, struct xdr_netobj *res) -{ - const void *q; - unsigned int len; - - p = simple_get_bytes(p, end, &len, sizeof(len)); - if (IS_ERR(p)) - return p; - q = (const void *)((const char *)p + len); - if (unlikely(q > end || q < p)) - return ERR_PTR(-EFAULT); - res->data = kmemdup(p, len, GFP_NOFS); - if (unlikely(res->data == NULL)) - return ERR_PTR(-ENOMEM); - res->len = len; - return q; -} - static inline const void * get_key(const void *p, const void *end, struct krb5_ctx *ctx, struct crypto_skcipher **res)
From: Dave Wysochanski dwysocha@redhat.com
stable inclusion from linux-4.19.176 commit 9848962bfb28515c476bcf34a8ec9655ccee08e4
--------------------------------
[ Upstream commit e4a7d1f7707eb44fd953a31dd59eff82009d879c ]
When handling an auth_gss downcall, it's possible to get 0-length opaque object for the acceptor. In the case of a 0-length XDR object, make sure simple_get_netobj() fills in dest->data = NULL, and does not continue to kmemdup() which will set dest->data = ZERO_SIZE_PTR for the acceptor.
The trace event code can handle NULL but not ZERO_SIZE_PTR for a string, and so without this patch the rpcgss_context trace event will crash the kernel as follows:
[ 162.887992] BUG: kernel NULL pointer dereference, address: 0000000000000010 [ 162.898693] #PF: supervisor read access in kernel mode [ 162.900830] #PF: error_code(0x0000) - not-present page [ 162.902940] PGD 0 P4D 0 [ 162.904027] Oops: 0000 [#1] SMP PTI [ 162.905493] CPU: 4 PID: 4321 Comm: rpc.gssd Kdump: loaded Not tainted 5.10.0 #133 [ 162.908548] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 162.910978] RIP: 0010:strlen+0x0/0x20 [ 162.912505] Code: 48 89 f9 74 09 48 83 c1 01 80 39 00 75 f7 31 d2 44 0f b6 04 16 44 88 04 11 48 83 c2 01 45 84 c0 75 ee c3 0f 1f 80 00 00 00 00 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31 [ 162.920101] RSP: 0018:ffffaec900c77d90 EFLAGS: 00010202 [ 162.922263] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000fffde697 [ 162.925158] RDX: 000000000000002f RSI: 0000000000000080 RDI: 0000000000000010 [ 162.928073] RBP: 0000000000000010 R08: 0000000000000e10 R09: 0000000000000000 [ 162.930976] R10: ffff8e698a590cb8 R11: 0000000000000001 R12: 0000000000000e10 [ 162.933883] R13: 00000000fffde697 R14: 000000010034d517 R15: 0000000000070028 [ 162.936777] FS: 00007f1e1eb93700(0000) GS:ffff8e6ab7d00000(0000) knlGS:0000000000000000 [ 162.940067] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 162.942417] CR2: 0000000000000010 CR3: 0000000104eba000 CR4: 00000000000406e0 [ 162.945300] Call Trace: [ 162.946428] trace_event_raw_event_rpcgss_context+0x84/0x140 [auth_rpcgss] [ 162.949308] ? __kmalloc_track_caller+0x35/0x5a0 [ 162.951224] ? gss_pipe_downcall+0x3a3/0x6a0 [auth_rpcgss] [ 162.953484] gss_pipe_downcall+0x585/0x6a0 [auth_rpcgss] [ 162.955953] rpc_pipe_write+0x58/0x70 [sunrpc] [ 162.957849] vfs_write+0xcb/0x2c0 [ 162.959264] ksys_write+0x68/0xe0 [ 162.960706] do_syscall_64+0x33/0x40 [ 162.962238] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 162.964346] RIP: 0033:0x7f1e1f1e57df
Signed-off-by: Dave Wysochanski dwysocha@redhat.com Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/sunrpc/auth_gss/auth_gss_internal.h | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/net/sunrpc/auth_gss/auth_gss_internal.h b/net/sunrpc/auth_gss/auth_gss_internal.h index c5603242b54b..f6d9631bd9d0 100644 --- a/net/sunrpc/auth_gss/auth_gss_internal.h +++ b/net/sunrpc/auth_gss/auth_gss_internal.h @@ -34,9 +34,12 @@ simple_get_netobj(const void *p, const void *end, struct xdr_netobj *dest) q = (const void *)((const char *)p + len); if (unlikely(q > end || q < p)) return ERR_PTR(-EFAULT); - dest->data = kmemdup(p, len, GFP_NOFS); - if (unlikely(dest->data == NULL)) - return ERR_PTR(-ENOMEM); + if (len) { + dest->data = kmemdup(p, len, GFP_NOFS); + if (unlikely(dest->data == NULL)) + return ERR_PTR(-ENOMEM); + } else + dest->data = NULL; dest->len = len; return q; }
From: Ming Lei ming.lei@redhat.com
stable inclusion from linux-4.19.176 commit 44b07600cec4df051b7ecbd6831febdeca252e54
--------------------------------
commit c48dac137a62a5d6fa1ef3fa445cbd9c43655a76 upstream.
The original comment says:
q->sysfs_lock must be held to provide mutual exclusion between elevator_switch() and here.
Which is simply wrong. elevator_init_mq() is only called from blk_mq_init_allocated_queue, which is always called before the request queue is registered via blk_register_queue(), for dm-rq or normal rq based driver. However, queue's kobject is only exposed and added to sysfs in blk_register_queue(). So there isn't such race between elevator_switch() and elevator_init_mq().
So avoid to hold q->sysfs_lock in elevator_init_mq().
Cc: Christoph Hellwig hch@infradead.org Cc: Hannes Reinecke hare@suse.com Cc: Greg KH gregkh@linuxfoundation.org Cc: Mike Snitzer snitzer@redhat.com Cc: Bart Van Assche bvanassche@acm.org Cc: Damien Le Moal damien.lemoal@wdc.com Reviewed-by: Bart Van Assche bvanassche@acm.org Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@cloud.ionos.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/elevator.c | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c index fae58b2f906f..ddbcd36616a8 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -980,23 +980,19 @@ int elevator_init_mq(struct request_queue *q) if (q->nr_hw_queues != 1) return 0;
- /* - * q->sysfs_lock must be held to provide mutual exclusion between - * elevator_switch() and here. - */ - mutex_lock(&q->sysfs_lock); + WARN_ON_ONCE(test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)); + if (unlikely(q->elevator)) - goto out_unlock; + goto out;
e = elevator_get(q, "mq-deadline", false); if (!e) - goto out_unlock; + goto out;
err = blk_mq_init_sched(q, e); if (err) elevator_put(e); -out_unlock: - mutex_unlock(&q->sysfs_lock); +out: return err; }
From: Ming Lei ming.lei@redhat.com
stable inclusion from linux-4.19.176 commit 6ff18507a7c165b90f1fe51822c65d10265f5714
--------------------------------
commit c6ba933358f0d7a6a042b894dba20cc70396a6d3 upstream.
blk_mq_map_swqueue() is called from blk_mq_init_allocated_queue() and blk_mq_update_nr_hw_queues(). For the former caller, the kobject isn't exposed to userspace yet. For the latter caller, hctx sysfs entries and debugfs are un-registered before updating nr_hw_queues.
On the other hand, commit 2f8f1336a48b ("blk-mq: always free hctx after request queue is freed") moves freeing hctx into queue's release handler, so there won't be race with queue release path too.
So don't hold q->sysfs_lock in blk_mq_map_swqueue().
Cc: Christoph Hellwig hch@infradead.org Cc: Hannes Reinecke hare@suse.com Cc: Greg KH gregkh@linuxfoundation.org Cc: Mike Snitzer snitzer@redhat.com Cc: Bart Van Assche bvanassche@acm.org Reviewed-by: Bart Van Assche bvanassche@acm.org Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@cloud.ionos.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-mq.c | 7 ------- 1 file changed, 7 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index 795ed40ce415..b5c2a4f65402 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2417,11 +2417,6 @@ static void blk_mq_map_swqueue(struct request_queue *q) struct blk_mq_ctx *ctx; struct blk_mq_tag_set *set = q->tag_set;
- /* - * Avoid others reading imcomplete hctx->cpumask through sysfs - */ - mutex_lock(&q->sysfs_lock); - queue_for_each_hw_ctx(q, hctx, i) { cpumask_clear(hctx->cpumask); hctx->nr_ctx = 0; @@ -2455,8 +2450,6 @@ static void blk_mq_map_swqueue(struct request_queue *q) hctx->ctxs[hctx->nr_ctx++] = ctx; }
- mutex_unlock(&q->sysfs_lock); - queue_for_each_hw_ctx(q, hctx, i) { /* * If no software queues are mapped to this hardware queue,
From: "Steven Rostedt (VMware)" rostedt@goodmis.org
stable inclusion from linux-4.19.177 commit 334d216dfe0700f9e043e6f29dcf20b8f0d13558
--------------------------------
commit 256cfdd6fdf70c6fcf0f7c8ddb0ebd73ce8f3bc9 upstream.
The file /sys/kernel/tracing/events/enable is used to enable all events by echoing in "1", or disabling all events when echoing in "0". To know if all events are enabled, disabled, or some are enabled but not all of them, cating the file should show either "1" (all enabled), "0" (all disabled), or "X" (some enabled but not all of them). This works the same as the "enable" files in the individule system directories (like tracing/events/sched/enable).
But when all events are enabled, the top level "enable" file shows "X". The reason is that its checking the "ftrace" events, which are special events that only exist for their format files. These include the format for the function tracer events, that are enabled when the function tracer is enabled, but not by the "enable" file. The check includes these events, which will always be disabled, and even though all true events are enabled, the top level "enable" file will show "X" instead of "1".
To fix this, have the check test the event's flags to see if it has the "IGNORE_ENABLE" flag set, and if so, not test it.
Cc: stable@vger.kernel.org Fixes: 553552ce1796c ("tracing: Combine event filter_active and enable into single flags field") Reported-by: "Yordan Karadzhov (VMware)" y.karadz@gmail.com Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/trace/trace_events.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index e90135eca580..bb4c6ac51a6d 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -1111,7 +1111,8 @@ system_enable_read(struct file *filp, char __user *ubuf, size_t cnt, mutex_lock(&event_mutex); list_for_each_entry(file, &tr->events, list) { call = file->event_call; - if (!trace_event_name(call) || !call->class || !call->class->reg) + if ((call->flags & TRACE_EVENT_FL_IGNORE_ENABLE) || + !trace_event_name(call) || !call->class || !call->class->reg) continue;
if (system && strcmp(call->class->system, system->name) != 0)
From: "Steven Rostedt (VMware)" rostedt@goodmis.org
stable inclusion from linux-4.19.177 commit 0572fc6a510add9029b113239eaabf4b5bce8ec9
--------------------------------
commit b220c049d5196dd94d992dd2dc8cba1a5e6123bf upstream.
When filters are used by trace events, a page is allocated on each CPU and used to copy the trace event fields to this page before writing to the ring buffer. The reason to use the filter and not write directly into the ring buffer is because a filter may discard the event and there's more overhead on discarding from the ring buffer than the extra copy.
The problem here is that there is no check against the size being allocated when using this page. If an event asks for more than a page size while being filtered, it will get only a page, leading to the caller writing more that what was allocated.
Check the length of the request, and if it is more than PAGE_SIZE minus the header default back to allocating from the ring buffer directly. The ring buffer may reject the event if its too big anyway, but it wont overflow.
Link: https://lore.kernel.org/ath10k/1612839593-2308-1-git-send-email-wgong@codeau...
Cc: stable@vger.kernel.org Fixes: 0fc1b09ff1ff4 ("tracing: Use temp buffer when filtering events") Reported-by: Wen Gong wgong@codeaurora.org Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/trace/trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index fd5415230e4a..b21986839b9f 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2290,7 +2290,7 @@ trace_event_buffer_lock_reserve(struct ring_buffer **current_rb, (entry = this_cpu_read(trace_buffered_event))) { /* Try to use the per cpu buffer first */ val = this_cpu_inc_return(trace_buffered_event_cnt); - if (val == 1) { + if ((len < (PAGE_SIZE - sizeof(*entry))) && val == 1) { trace_event_setup(entry, type, flags, pc); entry->array[0] = len; return entry;
From: Miklos Szeredi mszeredi@redhat.com
stable inclusion from linux-4.19.177 commit f1cb5984909ce2737caa205b63509c59141bcf49
--------------------------------
[ Upstream commit 554677b97257b0b69378bd74e521edb7e94769ff ]
The vfs_getxattr() in ovl_xattr_set() is used to check whether an xattr exist on a lower layer file that is to be removed. If the xattr does not exist, then no need to copy up the file.
This call of vfs_getxattr() wasn't wrapped in credential override, and this is probably okay. But for consitency wrap this instance as well.
Reported-by: "Eric W. Biederman" ebiederm@xmission.com Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/overlayfs/inode.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c index 252a3c6b4744..d1e9d926150b 100644 --- a/fs/overlayfs/inode.c +++ b/fs/overlayfs/inode.c @@ -329,7 +329,9 @@ int ovl_xattr_set(struct dentry *dentry, struct inode *inode, const char *name, goto out;
if (!value && !upperdentry) { + old_cred = ovl_override_creds(dentry->d_sb); err = vfs_getxattr(realdentry, name, NULL, 0); + revert_creds(old_cred); if (err < 0) goto out_drop_write; }
From: Miklos Szeredi mszeredi@redhat.com
stable inclusion from linux-4.19.177 commit 31a8d90f7cda828e1b48d8eb40ec1c6345ef5b7e
--------------------------------
[ Upstream commit f2b00be488730522d0fb7a8a5de663febdcefe0a ]
If a capability is stored on disk in v2 format cap_inode_getsecurity() will currently return in v2 format unconditionally.
This is wrong: v2 cap should be equivalent to a v3 cap with zero rootid, and so the same conversions performed on it.
If the rootid cannot be mapped, v3 is returned unconverted. Fix this so that both v2 and v3 return -EOVERFLOW if the rootid (or the owner of the fs user namespace in case of v2) cannot be mapped into the current user namespace.
Signed-off-by: Miklos Szeredi mszeredi@redhat.com Acked-by: "Eric W. Biederman" ebiederm@xmission.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- security/commoncap.c | 67 ++++++++++++++++++++++++++++---------------- 1 file changed, 43 insertions(+), 24 deletions(-)
diff --git a/security/commoncap.c b/security/commoncap.c index 4ab1c37f2d19..2b221c0ef502 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -378,10 +378,11 @@ int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer, { int size, ret; kuid_t kroot; + u32 nsmagic, magic; uid_t root, mappedroot; char *tmpbuf = NULL; struct vfs_cap_data *cap; - struct vfs_ns_cap_data *nscap; + struct vfs_ns_cap_data *nscap = NULL; struct dentry *dentry; struct user_namespace *fs_ns;
@@ -403,46 +404,61 @@ int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer, fs_ns = inode->i_sb->s_user_ns; cap = (struct vfs_cap_data *) tmpbuf; if (is_v2header((size_t) ret, cap)) { - /* If this is sizeof(vfs_cap_data) then we're ok with the - * on-disk value, so return that. */ - if (alloc) - *buffer = tmpbuf; - else - kfree(tmpbuf); - return ret; - } else if (!is_v3header((size_t) ret, cap)) { - kfree(tmpbuf); - return -EINVAL; + root = 0; + } else if (is_v3header((size_t) ret, cap)) { + nscap = (struct vfs_ns_cap_data *) tmpbuf; + root = le32_to_cpu(nscap->rootid); + } else { + size = -EINVAL; + goto out_free; }
- nscap = (struct vfs_ns_cap_data *) tmpbuf; - root = le32_to_cpu(nscap->rootid); kroot = make_kuid(fs_ns, root);
/* If the root kuid maps to a valid uid in current ns, then return * this as a nscap. */ mappedroot = from_kuid(current_user_ns(), kroot); if (mappedroot != (uid_t)-1 && mappedroot != (uid_t)0) { + size = sizeof(struct vfs_ns_cap_data); if (alloc) { - *buffer = tmpbuf; + if (!nscap) { + /* v2 -> v3 conversion */ + nscap = kzalloc(size, GFP_ATOMIC); + if (!nscap) { + size = -ENOMEM; + goto out_free; + } + nsmagic = VFS_CAP_REVISION_3; + magic = le32_to_cpu(cap->magic_etc); + if (magic & VFS_CAP_FLAGS_EFFECTIVE) + nsmagic |= VFS_CAP_FLAGS_EFFECTIVE; + memcpy(&nscap->data, &cap->data, sizeof(__le32) * 2 * VFS_CAP_U32); + nscap->magic_etc = cpu_to_le32(nsmagic); + } else { + /* use allocated v3 buffer */ + tmpbuf = NULL; + } nscap->rootid = cpu_to_le32(mappedroot); - } else - kfree(tmpbuf); - return size; + *buffer = nscap; + } + goto out_free; }
if (!rootid_owns_currentns(kroot)) { - kfree(tmpbuf); - return -EOPNOTSUPP; + size = -EOVERFLOW; + goto out_free; }
/* This comes from a parent namespace. Return as a v2 capability */ size = sizeof(struct vfs_cap_data); if (alloc) { - *buffer = kmalloc(size, GFP_ATOMIC); - if (*buffer) { - struct vfs_cap_data *cap = *buffer; - __le32 nsmagic, magic; + if (nscap) { + /* v3 -> v2 conversion */ + cap = kzalloc(size, GFP_ATOMIC); + if (!cap) { + size = -ENOMEM; + goto out_free; + } magic = VFS_CAP_REVISION_2; nsmagic = le32_to_cpu(nscap->magic_etc); if (nsmagic & VFS_CAP_FLAGS_EFFECTIVE) @@ -450,9 +466,12 @@ int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer, memcpy(&cap->data, &nscap->data, sizeof(__le32) * 2 * VFS_CAP_U32); cap->magic_etc = cpu_to_le32(magic); } else { - size = -ENOMEM; + /* use unconverted v2 */ + tmpbuf = NULL; } + *buffer = cap; } +out_free: kfree(tmpbuf); return size; }
From: Amir Goldstein amir73il@gmail.com
stable inclusion from linux-4.19.177 commit 45f86fe44ef2d4e85c0ea795ec438e9761e04ad9
--------------------------------
[ Upstream commit 03fedf93593c82538b18476d8c4f0e8f8435ea70 ]
When inode has no listxattr op of its own (e.g. squashfs) vfs_listxattr calls the LSM inode_listsecurity hooks to list the xattrs that LSMs will intercept in inode_getxattr hooks.
When selinux LSM is installed but not initialized, it will list the security.selinux xattr in inode_listsecurity, but will not intercept it in inode_getxattr. This results in -ENODATA for a getxattr call for an xattr returned by listxattr.
This situation was manifested as overlayfs failure to copy up lower files from squashfs when selinux is built-in but not initialized, because ovl_copy_xattr() iterates the lower inode xattrs by vfs_listxattr() and vfs_getxattr().
ovl_copy_xattr() skips copy up of security labels that are indentified by inode_copy_up_xattr LSM hooks, but it does that after vfs_getxattr(). Since we are not going to copy them, skip vfs_getxattr() of the security labels.
Reported-by: Michael Labriola michael.d.labriola@gmail.com Tested-by: Michael Labriola michael.d.labriola@gmail.com Link: https://lore.kernel.org/linux-unionfs/2nv9d47zt7.fsf@aldarion.sourceruckus.o... Signed-off-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/overlayfs/copy_up.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c index 6eb0b882ad23..e164f489d01d 100644 --- a/fs/overlayfs/copy_up.c +++ b/fs/overlayfs/copy_up.c @@ -79,6 +79,14 @@ int ovl_copy_xattr(struct dentry *old, struct dentry *new)
if (ovl_is_private_xattr(name)) continue; + + error = security_inode_copy_up_xattr(name); + if (error < 0 && error != -EOPNOTSUPP) + break; + if (error == 1) { + error = 0; + continue; /* Discard */ + } retry: size = vfs_getxattr(old, name, value, value_size); if (size == -ERANGE) @@ -102,13 +110,6 @@ int ovl_copy_xattr(struct dentry *old, struct dentry *new) goto retry; }
- error = security_inode_copy_up_xattr(name); - if (error < 0 && error != -EOPNOTSUPP) - break; - if (error == 1) { - error = 0; - continue; /* Discard */ - } error = vfs_setxattr(new, name, value, size, 0); if (error) break;
From: Lin Feng linf@wangsu.com
stable inclusion from linux-4.19.177 commit c869f173158aaf01a7445212420356247a8974b7
--------------------------------
[ Upstream commit 388c705b95f23f317fa43e6abf9ff07b583b721a ]
This reverts commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a.
bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core sbitmap_get_shallow, which uses just the number to limit the scan depth of each bitmap word, formula: scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100%
That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct. But after commit patch 'bfq: Fix computation of shallow depth', we use sbitmap.depth instead, as a example in following case:
sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64. The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit nothing.
Signed-off-by: Lin Feng linf@wangsu.com Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/bfq-iosched.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 0c2c0cd8546c..97b033b00d4e 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -5283,13 +5283,13 @@ static unsigned int bfq_update_depths(struct bfq_data *bfqd, * limit 'something'. */ /* no more than 50% of tags for async I/O */ - bfqd->word_depths[0][0] = max(bt->sb.depth >> 1, 1U); + bfqd->word_depths[0][0] = max((1U << bt->sb.shift) >> 1, 1U); /* * no more than 75% of tags for sync writes (25% extra tags * w.r.t. async I/O, to prevent async I/O from starving sync * writes) */ - bfqd->word_depths[0][1] = max((bt->sb.depth * 3) >> 2, 1U); + bfqd->word_depths[0][1] = max(((1U << bt->sb.shift) * 3) >> 2, 1U);
/* * In-word depths in case some bfq_queue is being weight- @@ -5299,9 +5299,9 @@ static unsigned int bfq_update_depths(struct bfq_data *bfqd, * shortage. */ /* no more than ~18% of tags for async I/O */ - bfqd->word_depths[1][0] = max((bt->sb.depth * 3) >> 4, 1U); + bfqd->word_depths[1][0] = max(((1U << bt->sb.shift) * 3) >> 4, 1U); /* no more than ~37% of tags for sync writes (~20% extra tags) */ - bfqd->word_depths[1][1] = max((bt->sb.depth * 6) >> 4, 1U); + bfqd->word_depths[1][1] = max(((1U << bt->sb.shift) * 6) >> 4, 1U);
for (i = 0; i < 2; i++) for (j = 0; j < 2; j++)
From: Edwin Peer edwin.peer@broadcom.com
stable inclusion from linux-4.19.177 commit bb7addea77224d0c6e471f6b7290f5b609416319
--------------------------------
commit 3aa6bce9af0e25b735c9c1263739a5639a336ae8 upstream.
Prevent netif_tx_disable() running concurrently with dev_watchdog() by taking the device global xmit lock. Otherwise, the recommended:
netif_carrier_off(dev); netif_tx_disable(dev);
driver shutdown sequence can happen after the watchdog has already checked carrier, resulting in possible false alarms. This is because netif_tx_lock() only sets the frozen bit without maintaining the locks on the individual queues.
Fixes: c3f26a269c24 ("netdev: Fix lockdep warnings in multiqueue configurations.") Signed-off-by: Edwin Peer edwin.peer@broadcom.com Reviewed-by: Jakub Kicinski kuba@kernel.org Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/netdevice.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 2b1197c83a5e..61ad607027f3 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -4097,6 +4097,7 @@ static inline void netif_tx_disable(struct net_device *dev)
local_bh_disable(); cpu = smp_processor_id(); + spin_lock(&dev->tx_global_lock); for (i = 0; i < dev->num_tx_queues; i++) { struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
@@ -4104,6 +4105,7 @@ static inline void netif_tx_disable(struct net_device *dev) netif_tx_stop_queue(txq); __netif_tx_unlock(txq); } + spin_unlock(&dev->tx_global_lock); local_bh_enable(); }
From: Miklos Szeredi mszeredi@redhat.com
stable inclusion from linux-4.19.177 commit 1d775f15e900453c78eb43f5b9ff0bc10f229119
--------------------------------
commit cef4cbff06fbc3be54d6d79ee139edecc2ee8598 upstream.
There was a syzbot report with this warning but insufficient information...
Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/overlayfs/super.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c index c5aa30aa902d..40fd013871a3 100644 --- a/fs/overlayfs/super.c +++ b/fs/overlayfs/super.c @@ -82,7 +82,7 @@ static void ovl_dentry_release(struct dentry *dentry) static struct dentry *ovl_d_real(struct dentry *dentry, const struct inode *inode) { - struct dentry *real; + struct dentry *real = NULL, *lower;
/* It's an overlay file */ if (inode && d_inode(dentry) == inode) @@ -101,9 +101,10 @@ static struct dentry *ovl_d_real(struct dentry *dentry, if (real && !inode && ovl_has_upperdata(d_inode(dentry))) return real;
- real = ovl_dentry_lowerdata(dentry); - if (!real) + lower = ovl_dentry_lowerdata(dentry); + if (!lower) goto bug; + real = lower;
/* Handle recursion */ real = d_real(real, inode); @@ -111,8 +112,10 @@ static struct dentry *ovl_d_real(struct dentry *dentry, if (!inode || inode == d_inode(real)) return real; bug: - WARN(1, "ovl_d_real(%pd4, %s:%lu): real dentry not found\n", dentry, - inode ? inode->i_sb->s_id : "NULL", inode ? inode->i_ino : 0); + WARN(1, "%s(%pd4, %s:%lu): real dentry (%p/%lu) not found\n", + __func__, dentry, inode ? inode->i_sb->s_id : "NULL", + inode ? inode->i_ino : 0, real, + real && d_inode(real) ? d_inode(real)->i_ino : 0); return dentry; }
From: Alexey Gladkov gladkov.alexey@gmail.com
mainline inclusion from mainline-v5.2-rc1 commit 898490c010b5d2e499e03b7e815fc214209ac583 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I390TB CVE: NA
Backporting this patch is to fix an issue happened when the 5.10 kernel is building on openEuler 20.03 system.
In openEuler 20.03 system, some modules ware built-in kernel. But in 5.10 kernel, those corresponding modules will be built as KO. For built-in kernel, kmode 2.7+ will fetch the modinfo from modules.builtin.modinfo which is only supported in kernel 5.2+.
In the rpmbuild process, kernel spec will call kmod to query modinfo of the KO images. It will fail for 'file missing'.
With backporting the mainline commit below, kmod can fetch any module's information from the corresponding module image or modules.builtin.modinfo.
---------------------------
Problem:
When a kernel module is compiled as a separate module, some important information about the kernel module is available via .modinfo section of the module. In contrast, when the kernel module is compiled into the kernel, that information is not available.
Information about built-in modules is necessary in the following cases:
1. When it is necessary to find out what additional parameters can be passed to the kernel at boot time.
2. When you need to know which module names and their aliases are in the kernel. This is very useful for creating an initrd image.
Proposal:
The proposed patch does not remove .modinfo section with module information from the vmlinux at the build time and saves it into a separate file after kernel linking. So, the kernel does not increase in size and no additional information remains in it. Information is stored in the same format as in the separate modules (null-terminated string array). Because the .modinfo section is already exported with a separate modules, we are not creating a new API.
It can be easily read in the userspace:
$ tr '\0' '\n' < modules.builtin.modinfo ext4.softdep=pre: crc32c ext4.license=GPL ext4.description=Fourth Extended Filesystem ext4.author=Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others ext4.alias=fs-ext4 ext4.alias=ext3 ext4.alias=fs-ext3 ext4.alias=ext2 ext4.alias=fs-ext2 md_mod.alias=block-major-9-* md_mod.alias=md md_mod.description=MD RAID framework md_mod.license=GPL md_mod.parmtype=create_on_open:bool md_mod.parmtype=start_dirty_degraded:int ...
Co-Developed-by: Gleb Fotengauer-Malinovskiy glebfm@altlinux.org Signed-off-by: Gleb Fotengauer-Malinovskiy glebfm@altlinux.org Signed-off-by: Alexey Gladkov gladkov.alexey@gmail.com Acked-by: Jessica Yu jeyu@kernel.org Signed-off-by: Masahiro Yamada yamada.masahiro@socionext.com Signed-off-by: Zhichang Yuan erik.yuan@arm.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- .gitignore | 1 + Documentation/dontdiff | 1 + Documentation/kbuild/kbuild.txt | 5 +++++ Makefile | 2 ++ include/asm-generic/vmlinux.lds.h | 1 + include/linux/module.h | 1 + include/linux/moduleparam.h | 12 +++++------- scripts/link-vmlinux.sh | 3 +++ 8 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/.gitignore b/.gitignore index 97ba6b79834c..e18bc625d330 100644 --- a/.gitignore +++ b/.gitignore @@ -57,6 +57,7 @@ modules.builtin /vmlinuz /System.map /Module.markers +/modules.builtin.modinfo
# # RPM spec file (make rpm-pkg) diff --git a/Documentation/dontdiff b/Documentation/dontdiff index 2228fcc8e29f..3d4d5a402b8b 100644 --- a/Documentation/dontdiff +++ b/Documentation/dontdiff @@ -179,6 +179,7 @@ mktables mktree modpost modules.builtin +modules.builtin.modinfo modules.order modversions.h* nconf diff --git a/Documentation/kbuild/kbuild.txt b/Documentation/kbuild/kbuild.txt index 8390c360d4b3..7f48e48f3fd2 100644 --- a/Documentation/kbuild/kbuild.txt +++ b/Documentation/kbuild/kbuild.txt @@ -11,6 +11,11 @@ modules.builtin This file lists all modules that are built into the kernel. This is used by modprobe to not fail when trying to load something builtin.
+modules.builtin.modinfo +-------------------------------------------------- +This file contains modinfo from all modules that are built into the kernel. +Unlike modinfo of a separate module, all fields are prefixed with module name. +
Environment variables
diff --git a/Makefile b/Makefile index 39bdbafae517..2cad844497b0 100644 --- a/Makefile +++ b/Makefile @@ -1263,6 +1263,7 @@ _modinst_: fi @cp -f $(objtree)/modules.order $(MODLIB)/ @cp -f $(objtree)/modules.builtin $(MODLIB)/ + @cp -f $(objtree)/modules.builtin.modinfo $(MODLIB)/ $(Q)$(MAKE) -f $(srctree)/scripts/Makefile.modinst
# This depmod is only for convenience to give the initial @@ -1303,6 +1304,7 @@ endif # CONFIG_MODULES
# Directories & files removed with 'make clean' CLEAN_DIRS += $(MODVERDIR) include/ksym +CLEAN_FILES += modules.builtin.modinfo
# Directories & files removed with 'make mrproper' MRPROPER_DIRS += include/config usr/include include/generated \ diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index dd38c97933f1..ec5e0f1a6d97 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -843,6 +843,7 @@ EXIT_CALL \ *(.discard) \ *(.discard.*) \ + *(.modinfo) \ }
/** diff --git a/include/linux/module.h b/include/linux/module.h index e1f34183fe18..14b6e3b367b5 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -239,6 +239,7 @@ extern typeof(name) __mod_##type##__##name##_device_table \ #define MODULE_VERSION(_version) MODULE_INFO(version, _version) #else #define MODULE_VERSION(_version) \ + MODULE_INFO(version, _version); \ static struct module_version_attribute ___modver_attr = { \ .mattr = { \ .attr = { \ diff --git a/include/linux/moduleparam.h b/include/linux/moduleparam.h index ba36506db4fb..5ba250d9172a 100644 --- a/include/linux/moduleparam.h +++ b/include/linux/moduleparam.h @@ -10,23 +10,21 @@ module name. */ #ifdef MODULE #define MODULE_PARAM_PREFIX /* empty */ +#define __MODULE_INFO_PREFIX /* empty */ #else #define MODULE_PARAM_PREFIX KBUILD_MODNAME "." +/* We cannot use MODULE_PARAM_PREFIX because some modules override it. */ +#define __MODULE_INFO_PREFIX KBUILD_MODNAME "." #endif
/* Chosen so that structs with an unsigned long line up. */ #define MAX_PARAM_PREFIX_LEN (64 - sizeof(unsigned long))
-#ifdef MODULE #define __MODULE_INFO(tag, name, info) \ static const char __UNIQUE_ID(name)[] \ __used __attribute__((section(".modinfo"), unused, aligned(1))) \ - = __stringify(tag) "=" info -#else /* !MODULE */ -/* This struct is here for syntactic coherency, it is not used */ -#define __MODULE_INFO(tag, name, info) \ - struct __UNIQUE_ID(name) {} -#endif + = __MODULE_INFO_PREFIX __stringify(tag) "=" info + #define __MODULE_PARM_TYPE(name, _type) \ __MODULE_INFO(parmtype, name##type, #name ":" _type)
diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh index c8cf45362bd6..c09e87e9c2b9 100755 --- a/scripts/link-vmlinux.sh +++ b/scripts/link-vmlinux.sh @@ -226,6 +226,9 @@ modpost_link vmlinux.o # modpost vmlinux.o to check for section mismatches ${MAKE} -f "${srctree}/scripts/Makefile.modpost" vmlinux.o
+info MODINFO modules.builtin.modinfo +${OBJCOPY} -j .modinfo -O binary vmlinux.o modules.builtin.modinfo + kallsymso="" kallsyms_vmlinux="" if [ -n "${CONFIG_KALLSYMS}" ]; then
From: liubo liubo254@huawei.com
euleros inclusion category: feature feature: etmem bugzilla: 49889
-------------------------------------------------
etmem, the memory vertical expansion technology, uses DRAM and high-performance storage new media to form multi-level memory storage. By grading the stored data, etmem migrates the classified cold storage data from the storage medium to the high-performance storage medium, so as to achieve the purpose of memory capacity expansion and memory cost reduction.
The etmem feature is mainly composed of two parts: etmem_scan and etmem_swap.
This patch is mainly used to generate etmem_scan.ko. etmem_scan.ko is used to scan the virtual address of the target process and return the address access information to the user mode for grading cold and hot pages.
Signed-off-by: Fengguang Wu fengguang.wu@intel.com Signed-off-by: yanxiaodan yanxiaodan@huawei.com Signed-off-by: Feilong Lin linfeilong@huawei.com Signed-off-by: geruijun geruijun@huawei.com Signed-off-by: liubo liubo254@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Reviewed-by: Jing Xiangfengjingxiangfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/proc/Makefile | 1 + fs/proc/base.c | 2 + fs/proc/etmem_scan.c | 1046 ++++++++++++++++++++++++++++++++++++++ fs/proc/etmem_scan.h | 132 +++++ fs/proc/internal.h | 1 + fs/proc/task_mmu.c | 66 +++ include/linux/mm_types.h | 18 + lib/Kconfig | 6 + mm/pagewalk.c | 1 + virt/kvm/kvm_main.c | 6 + 10 files changed, 1279 insertions(+) create mode 100644 fs/proc/etmem_scan.c create mode 100644 fs/proc/etmem_scan.h
diff --git a/fs/proc/Makefile b/fs/proc/Makefile index ead487e80510..c1ebd017a83b 100644 --- a/fs/proc/Makefile +++ b/fs/proc/Makefile @@ -33,3 +33,4 @@ proc-$(CONFIG_PROC_KCORE) += kcore.o proc-$(CONFIG_PROC_VMCORE) += vmcore.o proc-$(CONFIG_PRINTK) += kmsg.o proc-$(CONFIG_PROC_PAGE_MONITOR) += page.o +obj-$(CONFIG_ETMEM_SCAN) += etmem_scan.o diff --git a/fs/proc/base.c b/fs/proc/base.c index 5e705fa9a913..33ec3b6971b7 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2988,6 +2988,7 @@ static const struct pid_entry tgid_base_stuff[] = { REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), + REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), @@ -3373,6 +3374,7 @@ static const struct pid_entry tid_base_stuff[] = { REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), + REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), diff --git a/fs/proc/etmem_scan.c b/fs/proc/etmem_scan.c new file mode 100644 index 000000000000..94bf125de607 --- /dev/null +++ b/fs/proc/etmem_scan.c @@ -0,0 +1,1046 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/pagemap.h> +#include <linux/mm.h> +#include <linux/hugetlb.h> +#include <linux/kernel.h> +#include <linux/sched.h> +#include <linux/proc_fs.h> +#include <linux/uaccess.h> +#include <linux/kvm.h> +#include <linux/kvm_host.h> +#include <linux/bitmap.h> +#include <linux/sched/mm.h> +#include <linux/version.h> +#include <linux/module.h> +#include <linux/io.h> +#include <linux/uaccess.h> +#include <asm/cacheflush.h> +#include <asm/page.h> +#include <asm/pgalloc.h> +#include <asm/tlb.h> +#include <asm/pgtable.h> +#ifdef CONFIG_ARM64 +#include <asm/pgtable-types.h> +#include <asm/memory.h> +#include <asm/kvm_mmu.h> +#include <asm/kvm_arm.h> +#include <asm/stage2_pgtable.h> +#endif +#include "etmem_scan.h" + +#ifdef CONFIG_X86_64 +/* + * Fallback to false for kernel doens't support KVM_INVALID_SPTE + * ept_idle can sitll work in this situation but the scan accuracy may drop, + * depends on the access frequences of the workload. + */ +#ifdef KVM_INVALID_SPTE +#define KVM_CHECK_INVALID_SPTE(val) ((val) == KVM_INVALID_SPTE) +#else +#define KVM_CHECK_INVALID_SPTE(val) (0) +#endif + +# define kvm_arch_mmu_pointer(vcpu) (&(vcpu->arch.mmu)) +# define kvm_mmu_ad_disabled(mmu) (mmu->base_role.ad_disabled) +#endif /*CONFIG_X86_64*/ + +#ifdef CONFIG_ARM64 +#define if_pmd_thp_or_huge(pmd) (if_pmd_huge(pmd) || pmd_trans_huge(pmd)) +#endif /* CONFIG_ARM64 */ + +#ifdef DEBUG + +#define debug_printk trace_printk + +#define set_restart_gpa(val, note) ({ \ + unsigned long old_val = pic->restart_gpa; \ + pic->restart_gpa = (val); \ + trace_printk("restart_gpa=%lx %luK %s %s %d\n", \ + (val), (pic->restart_gpa - old_val) >> 10, \ + note, __func__, __LINE__); \ +}) + +#define set_next_hva(val, note) ({ \ + unsigned long old_val = pic->next_hva; \ + pic->next_hva = (val); \ + trace_printk(" next_hva=%lx %luK %s %s %d\n", \ + (val), (pic->next_hva - old_val) >> 10, \ + note, __func__, __LINE__); \ +}) + +#else + +#define debug_printk(...) + +#define set_restart_gpa(val, note) ({ \ + pic->restart_gpa = (val); \ +}) + +#define set_next_hva(val, note) ({ \ + pic->next_hva = (val); \ +}) + +#endif + +static unsigned long pagetype_size[16] = { + [PTE_ACCESSED] = PAGE_SIZE, /* 4k page */ + [PMD_ACCESSED] = PMD_SIZE, /* 2M page */ + [PUD_PRESENT] = PUD_SIZE, /* 1G page */ + + [PTE_DIRTY_M] = PAGE_SIZE, + [PMD_DIRTY_M] = PMD_SIZE, + + [PTE_IDLE] = PAGE_SIZE, + [PMD_IDLE] = PMD_SIZE, + [PMD_IDLE_PTES] = PMD_SIZE, + + [PTE_HOLE] = PAGE_SIZE, + [PMD_HOLE] = PMD_SIZE, +}; + +static void u64_to_u8(uint64_t n, uint8_t *p) +{ + p += sizeof(uint64_t) - 1; + + *p-- = n; n >>= 8; + *p-- = n; n >>= 8; + *p-- = n; n >>= 8; + *p-- = n; n >>= 8; + + *p-- = n; n >>= 8; + *p-- = n; n >>= 8; + *p-- = n; n >>= 8; + *p = n; +} + +static void dump_pic(struct page_idle_ctrl *pic) +{ + debug_printk("page_idle_ctrl: pie_read=%d pie_read_max=%d", + pic->pie_read, + pic->pie_read_max); + debug_printk(" buf_size=%d bytes_copied=%d next_hva=%pK", + pic->buf_size, + pic->bytes_copied, + pic->next_hva); + debug_printk(" restart_gpa=%pK pa_to_hva=%pK\n", + pic->restart_gpa, + pic->gpa_to_hva); +} + +#ifdef CONFIG_ARM64 +static int if_pmd_huge(pmd_t pmd) +{ + return pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT); +} + +static int if_pud_huge(pud_t pud) +{ +#ifndef __PAGETABLE_PMD_FOLDED + return pud_val(pud) && !(pud_val(pud) & PUD_TABLE_BIT); +#else + return 0; +#endif +} + +static inline bool if_stage2_pud_huge(struct kvm *kvm, pud_t pud) +{ + if (kvm_stage2_has_pmd(kvm)) + return if_pud_huge(pud); + else + return 0; +} +#endif + +static void pic_report_addr(struct page_idle_ctrl *pic, unsigned long addr) +{ + unsigned long hva; + + pic->kpie[pic->pie_read++] = PIP_CMD_SET_HVA; + hva = addr; + u64_to_u8(hva, &pic->kpie[pic->pie_read]); + pic->pie_read += sizeof(uint64_t); + dump_pic(pic); +} + +static int pic_add_page(struct page_idle_ctrl *pic, + unsigned long addr, + unsigned long next, + enum ProcIdlePageType page_type) +{ + unsigned long page_size = pagetype_size[page_type]; + + dump_pic(pic); + + /* align kernel/user vision of cursor position */ + next = round_up(next, page_size); + + if (!pic->pie_read || + addr + pic->gpa_to_hva != pic->next_hva) { + /* merge hole */ + if (page_type == PTE_HOLE || + page_type == PMD_HOLE) { + set_restart_gpa(next, "PTE_HOLE|PMD_HOLE"); + return 0; + } + + if (addr + pic->gpa_to_hva < pic->next_hva) { + debug_printk("page_idle: addr moves backwards\n"); + WARN_ONCE(1, "page_idle: addr moves backwards"); + } + + if (pic->pie_read + sizeof(uint64_t) + 2 >= pic->pie_read_max) { + set_restart_gpa(addr, "PAGE_IDLE_KBUF_FULL"); + return PAGE_IDLE_KBUF_FULL; + } + + pic_report_addr(pic, round_down(addr, page_size) + + pic->gpa_to_hva); + } else { + if (PIP_TYPE(pic->kpie[pic->pie_read - 1]) == page_type && + PIP_SIZE(pic->kpie[pic->pie_read - 1]) < 0xF) { + set_next_hva(next + pic->gpa_to_hva, "IN-PLACE INC"); + set_restart_gpa(next, "IN-PLACE INC"); + pic->kpie[pic->pie_read - 1]++; + WARN_ONCE(page_size < next-addr, "next-addr too large"); + return 0; + } + if (pic->pie_read >= pic->pie_read_max) { + set_restart_gpa(addr, "PAGE_IDLE_KBUF_FULL"); + return PAGE_IDLE_KBUF_FULL; + } + } + + set_next_hva(next + pic->gpa_to_hva, "NEW-ITEM"); + set_restart_gpa(next, "NEW-ITEM"); + pic->kpie[pic->pie_read] = PIP_COMPOSE(page_type, 1); + pic->pie_read++; + + return 0; +} + +static int init_page_idle_ctrl_buffer(struct page_idle_ctrl *pic) +{ + pic->pie_read = 0; + pic->pie_read_max = min(PAGE_IDLE_KBUF_SIZE, + pic->buf_size - pic->bytes_copied); + /* reserve space for PIP_CMD_SET_HVA in the end */ + pic->pie_read_max -= sizeof(uint64_t) + 1; + + /* + * Align with PAGE_IDLE_KBUF_FULL + * logic in pic_add_page(), to avoid pic->pie_read = 0 when + * PAGE_IDLE_KBUF_FULL happened. + */ + if (pic->pie_read_max <= sizeof(uint64_t) + 2) + return PAGE_IDLE_KBUF_FULL; + + memset(pic->kpie, 0, sizeof(pic->kpie)); + return 0; +} + +static void setup_page_idle_ctrl(struct page_idle_ctrl *pic, void *buf, + int buf_size, unsigned int flags) +{ + pic->buf = buf; + pic->buf_size = buf_size; + pic->bytes_copied = 0; + pic->next_hva = 0; + pic->gpa_to_hva = 0; + pic->restart_gpa = 0; + pic->last_va = 0; + pic->flags = flags; +} + +static int page_idle_copy_user(struct page_idle_ctrl *pic, + unsigned long start, unsigned long end) +{ + int bytes_read; + int lc = 0; /* last copy? */ + int ret; + + dump_pic(pic); + + /* Break out of loop on no more progress. */ + if (!pic->pie_read) { + lc = 1; + if (start < end) + start = end; + } + + if (start >= end && start > pic->next_hva) { + set_next_hva(start, "TAIL-HOLE"); + pic_report_addr(pic, start); + } + + bytes_read = pic->pie_read; + if (!bytes_read) + return 1; + + ret = copy_to_user(pic->buf, pic->kpie, bytes_read); + if (ret) + return -EFAULT; + + pic->buf += bytes_read; + pic->bytes_copied += bytes_read; + if (pic->bytes_copied >= pic->buf_size) + return PAGE_IDLE_BUF_FULL; + if (lc) + return lc; + + ret = init_page_idle_ctrl_buffer(pic); + if (ret) + return ret; + + cond_resched(); + return 0; +} + +#ifdef CONFIG_X86_64 +static int ept_pte_range(struct page_idle_ctrl *pic, + pmd_t *pmd, unsigned long addr, unsigned long end) +{ + pte_t *pte; + enum ProcIdlePageType page_type; + int err = 0; + + pte = pte_offset_kernel(pmd, addr); + do { + if (KVM_CHECK_INVALID_SPTE(pte->pte)) { + page_type = PTE_IDLE; + } else if (!ept_pte_present(*pte)) + page_type = PTE_HOLE; + else if (!test_and_clear_bit(_PAGE_BIT_EPT_ACCESSED, + (unsigned long *) &pte->pte)) + page_type = PTE_IDLE; + else { + page_type = PTE_ACCESSED; + if (pic->flags & SCAN_DIRTY_PAGE) { + if (test_and_clear_bit(_PAGE_BIT_EPT_DIRTY, + (unsigned long *) &pte->pte)) + page_type = PTE_DIRTY_M; + } + } + + err = pic_add_page(pic, addr, addr + PAGE_SIZE, page_type); + if (err) + break; + } while (pte++, addr += PAGE_SIZE, addr != end); + + return err; +} + + +static int ept_pmd_range(struct page_idle_ctrl *pic, + pud_t *pud, unsigned long addr, unsigned long end) +{ + pmd_t *pmd; + unsigned long next; + enum ProcIdlePageType page_type; + enum ProcIdlePageType pte_page_type; + int err = 0; + + if (pic->flags & SCAN_HUGE_PAGE) + pte_page_type = PMD_IDLE_PTES; + else + pte_page_type = IDLE_PAGE_TYPE_MAX; + + pmd = pmd_offset(pud, addr); + do { + next = pmd_addr_end(addr, end); + if (KVM_CHECK_INVALID_SPTE(pmd->pmd)) + page_type = PMD_IDLE; + else if (!ept_pmd_present(*pmd)) + page_type = PMD_HOLE; /* likely won't hit here */ + else if (!pmd_large(*pmd)) + page_type = pte_page_type; + else if (!test_and_clear_bit(_PAGE_BIT_EPT_ACCESSED, + (unsigned long *)pmd)) + page_type = PMD_IDLE; + else { + page_type = PMD_ACCESSED; + if ((pic->flags & SCAN_DIRTY_PAGE) && + test_and_clear_bit(_PAGE_BIT_EPT_DIRTY, + (unsigned long *) pmd)) + page_type = PMD_DIRTY_M; + } + + if (page_type != IDLE_PAGE_TYPE_MAX) + err = pic_add_page(pic, addr, next, page_type); + else + err = ept_pte_range(pic, pmd, addr, next); + if (err) + break; + } while (pmd++, addr = next, addr != end); + + return err; +} + + +static int ept_pud_range(struct page_idle_ctrl *pic, + p4d_t *p4d, unsigned long addr, unsigned long end) +{ + pud_t *pud; + unsigned long next; + int err = 0; + + pud = pud_offset(p4d, addr); + do { + next = pud_addr_end(addr, end); + + if (!ept_pud_present(*pud)) { + set_restart_gpa(next, "PUD_HOLE"); + continue; + } + + if (pud_large(*pud)) + err = pic_add_page(pic, addr, next, PUD_PRESENT); + else + err = ept_pmd_range(pic, pud, addr, next); + + if (err) + break; + } while (pud++, addr = next, addr != end); + + return err; +} + +static int ept_p4d_range(struct page_idle_ctrl *pic, + pgd_t *pgd, unsigned long addr, unsigned long end) +{ + p4d_t *p4d; + unsigned long next; + int err = 0; + + p4d = p4d_offset(pgd, addr); + do { + next = p4d_addr_end(addr, end); + if (!ept_p4d_present(*p4d)) { + set_restart_gpa(next, "P4D_HOLE"); + continue; + } + + err = ept_pud_range(pic, p4d, addr, next); + if (err) + break; + } while (p4d++, addr = next, addr != end); + + return err; +} + + +static int ept_page_range(struct page_idle_ctrl *pic, + unsigned long addr, + unsigned long end) +{ + struct kvm_vcpu *vcpu; + struct kvm_mmu *mmu; + pgd_t *ept_root; + pgd_t *pgd; + unsigned long next; + int err = 0; + + WARN_ON(addr >= end); + + spin_lock(&pic->kvm->mmu_lock); + + vcpu = kvm_get_vcpu(pic->kvm, 0); + if (!vcpu) { + spin_unlock(&pic->kvm->mmu_lock); + return -EINVAL; + } + + mmu = kvm_arch_mmu_pointer(vcpu); + if (!VALID_PAGE(mmu->root_hpa)) { + spin_unlock(&pic->kvm->mmu_lock); + return -EINVAL; + } + + ept_root = __va(mmu->root_hpa); + + spin_unlock(&pic->kvm->mmu_lock); + local_irq_disable(); + pgd = pgd_offset_pgd(ept_root, addr); + do { + next = pgd_addr_end(addr, end); + if (!ept_pgd_present(*pgd)) { + set_restart_gpa(next, "PGD_HOLE"); + continue; + } + + err = ept_p4d_range(pic, pgd, addr, next); + if (err) + break; + } while (pgd++, addr = next, addr != end); + local_irq_enable(); + return err; +} + +static int ept_idle_supports_cpu(struct kvm *kvm) +{ + struct kvm_vcpu *vcpu; + struct kvm_mmu *mmu; + int ret; + + vcpu = kvm_get_vcpu(kvm, 0); + if (!vcpu) + return -EINVAL; + + spin_lock(&kvm->mmu_lock); + mmu = kvm_arch_mmu_pointer(vcpu); + if (kvm_mmu_ad_disabled(mmu)) { + printk(KERN_NOTICE "CPU does not support EPT A/D bits tracking\n"); + ret = -EINVAL; + } else if (mmu->shadow_root_level != 4 + (!!pgtable_l5_enabled())) { + printk(KERN_NOTICE "Unsupported EPT level %d\n", mmu->shadow_root_level); + ret = -EINVAL; + } else + ret = 0; + spin_unlock(&kvm->mmu_lock); + + return ret; +} + +#else +static int arm_pte_range(struct page_idle_ctrl *pic, + pmd_t *pmd, unsigned long addr, unsigned long end) +{ + pte_t *pte; + enum ProcIdlePageType page_type; + int err = 0; + + pte = pte_offset_kernel(pmd, addr); + do { + if (!pte_present(*pte)) + page_type = PTE_HOLE; + else if (!test_and_clear_bit(_PAGE_MM_BIT_ACCESSED, + (unsigned long *) &pte->pte)) + page_type = PTE_IDLE; + else + page_type = PTE_ACCESSED; + + err = pic_add_page(pic, addr, addr + PAGE_SIZE, page_type); + if (err) + break; + } while (pte++, addr += PAGE_SIZE, addr != end); + + return err; +} + +static int arm_pmd_range(struct page_idle_ctrl *pic, + pud_t *pud, unsigned long addr, unsigned long end) +{ + pmd_t *pmd; + unsigned long next; + struct kvm *kvm = pic->kvm; + enum ProcIdlePageType page_type; + enum ProcIdlePageType pte_page_type; + int err = 0; + + if (pic->flags & SCAN_HUGE_PAGE) + pte_page_type = PMD_IDLE_PTES; + else + pte_page_type = IDLE_PAGE_TYPE_MAX; + + pmd = stage2_pmd_offset(kvm, pud, addr); + do { + next = stage2_pmd_addr_end(kvm, addr, end); + if (!pmd_present(*pmd)) + page_type = PMD_HOLE; + else if (!if_pmd_thp_or_huge(*pmd)) + page_type = pte_page_type; + else if (!test_and_clear_bit(_PAGE_MM_BIT_ACCESSED, + (unsigned long *)pmd)) + page_type = PMD_IDLE; + else + page_type = PMD_ACCESSED; + + if (page_type != IDLE_PAGE_TYPE_MAX) + err = pic_add_page(pic, addr, next, page_type); + else + err = arm_pte_range(pic, pmd, addr, next); + if (err) + break; + } while (pmd++, addr = next, addr != end); + + return err; +} + +static int arm_pud_range(struct page_idle_ctrl *pic, + pgd_t *pgd, unsigned long addr, unsigned long end) +{ + pud_t *pud; + unsigned long next; + struct kvm *kvm = pic->kvm; + int err = 0; + + pud = stage2_pud_offset(kvm, pgd, addr); + do { + next = stage2_pud_addr_end(kvm, addr, end); + if (!stage2_pud_present(kvm, *pud)) { + set_restart_gpa(next, "PUD_HOLE"); + continue; + } + + if (if_stage2_pud_huge(kvm, *pud)) + err = pic_add_page(pic, addr, next, PUD_PRESENT); + else + err = arm_pmd_range(pic, pud, addr, next); + if (err) + break; + } while (pud++, addr = next, addr != end); + + return err; +} + +static int arm_page_range(struct page_idle_ctrl *pic, + unsigned long addr, + unsigned long end) +{ + pgd_t *pgd; + unsigned long next; + struct kvm *kvm = pic->kvm; + int err = 0; + + WARN_ON(addr >= end); + + spin_lock(&pic->kvm->mmu_lock); + pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr); + spin_unlock(&pic->kvm->mmu_lock); + + local_irq_disable(); + do { + next = stage2_pgd_addr_end(kvm, addr, end); + if (!stage2_pgd_present(kvm, *pgd)) { + set_restart_gpa(next, "PGD_HOLE"); + continue; + } + + err = arm_pud_range(pic, pgd, addr, next); + if (err) + break; + } while (pgd++, addr = next, addr != end); + + local_irq_enable(); + return err; +} +#endif + +/* + * Depending on whether hva falls in a memslot: + * + * 1) found => return gpa and remaining memslot size in *addr_range + * + * |<----- addr_range --------->| + * [ mem slot ] + * ^hva + * + * 2) not found => return hole size in *addr_range + * + * |<----- addr_range --------->| + * [first mem slot above hva ] + * ^hva + * + * If hva is above all mem slots, *addr_range will be ~0UL. + * We can finish read(2). + */ +static unsigned long vm_idle_find_gpa(struct page_idle_ctrl *pic, + unsigned long hva, + unsigned long *addr_range) +{ + struct kvm *kvm = pic->kvm; + struct kvm_memslots *slots; + struct kvm_memory_slot *memslot; + unsigned long hva_end; + gfn_t gfn; + + *addr_range = ~0UL; + mutex_lock(&kvm->slots_lock); + slots = kvm_memslots(pic->kvm); + kvm_for_each_memslot(memslot, slots) { + hva_end = memslot->userspace_addr + + (memslot->npages << PAGE_SHIFT); + + if (hva >= memslot->userspace_addr && hva < hva_end) { + gpa_t gpa; + + gfn = hva_to_gfn_memslot(hva, memslot); + *addr_range = hva_end - hva; + gpa = gfn_to_gpa(gfn); + mutex_unlock(&kvm->slots_lock); + return gpa; + } + + if (memslot->userspace_addr > hva) + *addr_range = min(*addr_range, + memslot->userspace_addr - hva); + } + mutex_unlock(&kvm->slots_lock); + return INVALID_PAGE; +} + +static int vm_idle_walk_hva_range(struct page_idle_ctrl *pic, + unsigned long start, unsigned long end) +{ + unsigned long gpa_addr; + unsigned long addr_range; + unsigned long va_end; + int ret; + +#ifdef CONFIG_X86_64 + ret = ept_idle_supports_cpu(pic->kvm); + if (ret) + return ret; +#endif + + ret = init_page_idle_ctrl_buffer(pic); + if (ret) + return ret; + + for (; start < end;) { + gpa_addr = vm_idle_find_gpa(pic, start, &addr_range); + + if (gpa_addr == INVALID_PAGE) { + pic->gpa_to_hva = 0; + if (addr_range == ~0UL) { + set_restart_gpa(TASK_SIZE, "EOF"); + va_end = end; + } else { + start += addr_range; + set_restart_gpa(start, "OUT-OF-SLOT"); + va_end = start; + } + } else { + pic->gpa_to_hva = start - gpa_addr; +#ifdef CONFIG_ARM64 + arm_page_range(pic, gpa_addr, gpa_addr + addr_range); +#else + ept_page_range(pic, gpa_addr, gpa_addr + addr_range); +#endif + va_end = pic->gpa_to_hva + gpa_addr + addr_range; + } + + start = pic->restart_gpa + pic->gpa_to_hva; + ret = page_idle_copy_user(pic, start, va_end); + if (ret) + break; + } + + if (pic->bytes_copied) + ret = 0; + return ret; +} + +static ssize_t vm_idle_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + struct mm_struct *mm = file->private_data; + struct page_idle_ctrl *pic; + unsigned long hva_start = *ppos; + unsigned long hva_end = hva_start + (count << (3 + PAGE_SHIFT)); + int ret; + + pic = kzalloc(sizeof(*pic), GFP_KERNEL); + if (!pic) + return -ENOMEM; + + setup_page_idle_ctrl(pic, buf, count, file->f_flags); + pic->kvm = mm_kvm(mm); + + ret = vm_idle_walk_hva_range(pic, hva_start, hva_end); + if (ret) + goto out_kvm; + + ret = pic->bytes_copied; + *ppos = pic->next_hva; +out_kvm: + return ret; + +} + +static ssize_t mm_idle_read(struct file *file, char *buf, + size_t count, loff_t *ppos); + +static ssize_t page_scan_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + struct mm_struct *mm = file->private_data; + unsigned long hva_start = *ppos; + unsigned long hva_end = hva_start + (count << (3 + PAGE_SHIFT)); + + if ((hva_start >= TASK_SIZE) || (hva_end >= TASK_SIZE)) { + debug_printk("page_idle_read past TASK_SIZE: %pK %pK %lx\n", + hva_start, hva_end, TASK_SIZE); + return 0; + } + if (hva_end <= hva_start) { + debug_printk("page_idle_read past EOF: %pK %pK\n", + hva_start, hva_end); + return 0; + } + if (*ppos & (PAGE_SIZE - 1)) { + debug_printk("page_idle_read unaligned ppos: %pK\n", + hva_start); + return -EINVAL; + } + if (count < PAGE_IDLE_BUF_MIN) { + debug_printk("page_idle_read small count: %lx\n", + (unsigned long)count); + return -EINVAL; + } + + if (!mm_kvm(mm)) + return mm_idle_read(file, buf, count, ppos); + + return vm_idle_read(file, buf, count, ppos); +} + +static int page_scan_open(struct inode *inode, struct file *file) +{ + if (!try_module_get(THIS_MODULE)) + return -EBUSY; + + return 0; +} + +static int page_scan_release(struct inode *inode, struct file *file) +{ + struct mm_struct *mm = file->private_data; + struct kvm *kvm; + int ret = 0; + + if (!mm) { + ret = -EBADF; + goto out; + } + + kvm = mm_kvm(mm); + if (!kvm) { + ret = -EINVAL; + goto out; + } +#ifdef CONFIG_X86_64 + spin_lock(&kvm->mmu_lock); + kvm_flush_remote_tlbs(kvm); + spin_unlock(&kvm->mmu_lock); +#endif + +out: + module_put(THIS_MODULE); + return ret; +} + +static int mm_idle_pmd_large(pmd_t pmd) +{ +#ifdef CONFIG_ARM64 + return if_pmd_thp_or_huge(pmd); +#else + return pmd_large(pmd); +#endif +} + +static int mm_idle_pte_range(struct page_idle_ctrl *pic, pmd_t *pmd, + unsigned long addr, unsigned long next) +{ + enum ProcIdlePageType page_type; + pte_t *pte; + int err = 0; + + pte = pte_offset_kernel(pmd, addr); + do { + if (!pte_present(*pte)) + page_type = PTE_HOLE; + else if (!test_and_clear_bit(_PAGE_MM_BIT_ACCESSED, + (unsigned long *) &pte->pte)) + page_type = PTE_IDLE; + else { + page_type = PTE_ACCESSED; + } + + err = pic_add_page(pic, addr, addr + PAGE_SIZE, page_type); + if (err) + break; + } while (pte++, addr += PAGE_SIZE, addr != next); + + return err; +} + +static int mm_idle_pmd_entry(pmd_t *pmd, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + struct page_idle_ctrl *pic = walk->private; + enum ProcIdlePageType page_type; + enum ProcIdlePageType pte_page_type; + int err; + + /* + * Skip duplicate PMD_IDLE_PTES: when the PMD crosses VMA boundary, + * walk_page_range() can call on the same PMD twice. + */ + if ((addr & PMD_MASK) == (pic->last_va & PMD_MASK)) { + debug_printk("ignore duplicate addr %pK %pK\n", + addr, pic->last_va); + return 0; + } + pic->last_va = addr; + + if (pic->flags & SCAN_HUGE_PAGE) + pte_page_type = PMD_IDLE_PTES; + else + pte_page_type = IDLE_PAGE_TYPE_MAX; + + if (!pmd_present(*pmd)) + page_type = PMD_HOLE; + else if (!mm_idle_pmd_large(*pmd)) + page_type = pte_page_type; + else if (!test_and_clear_bit(_PAGE_MM_BIT_ACCESSED, + (unsigned long *)pmd)) + page_type = PMD_IDLE; + else + page_type = PMD_ACCESSED; + + if (page_type != IDLE_PAGE_TYPE_MAX) + err = pic_add_page(pic, addr, next, page_type); + else + err = mm_idle_pte_range(pic, pmd, addr, next); + + return err; +} + +static int mm_idle_pud_entry(pud_t *pud, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + struct page_idle_ctrl *pic = walk->private; + + if ((addr & PUD_MASK) != (pic->last_va & PUD_MASK)) { + pic_add_page(pic, addr, next, PUD_PRESENT); + pic->last_va = addr; + } + return 1; +} + +static int mm_idle_test_walk(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma = walk->vma; + + if (vma->vm_file) { + if ((vma->vm_flags & (VM_WRITE|VM_MAYSHARE)) == VM_WRITE) + return 0; + return 1; + } + + return 0; +} + +static int mm_idle_walk_range(struct page_idle_ctrl *pic, + unsigned long start, + unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma; + int ret = 0; + + ret = init_page_idle_ctrl_buffer(pic); + if (ret) + return ret; + + for (; start < end;) { + down_read(&walk->mm->mmap_sem); + vma = find_vma(walk->mm, start); + if (vma) { + if (end > vma->vm_start) { + local_irq_disable(); + ret = walk_page_range(start, end, walk); + local_irq_enable(); + } else + set_restart_gpa(vma->vm_start, "VMA-HOLE"); + } else + set_restart_gpa(TASK_SIZE, "EOF"); + up_read(&walk->mm->mmap_sem); + + WARN_ONCE(pic->gpa_to_hva, "non-zero gpa_to_hva"); + start = pic->restart_gpa; + ret = page_idle_copy_user(pic, start, end); + if (ret) + break; + } + + if (pic->bytes_copied) { + if (ret != PAGE_IDLE_BUF_FULL && pic->next_hva < end) + debug_printk("partial scan: next_hva=%pK end=%pK\n", + pic->next_hva, end); + ret = 0; + } else + WARN_ONCE(1, "nothing read"); + return ret; +} + +static ssize_t mm_idle_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + struct mm_struct *mm = file->private_data; + struct mm_walk mm_walk = {}; + struct page_idle_ctrl *pic; + unsigned long va_start = *ppos; + unsigned long va_end = va_start + (count << (3 + PAGE_SHIFT)); + int ret; + + if (va_end <= va_start) { + debug_printk("%s past EOF: %pK %pK\n", + __func__, va_start, va_end); + return 0; + } + if (*ppos & (PAGE_SIZE - 1)) { + debug_printk("%s unaligned ppos: %pK\n", + __func__, va_start); + return -EINVAL; + } + if (count < PAGE_IDLE_BUF_MIN) { + debug_printk("%s small count: %lx\n", + __func__, (unsigned long)count); + return -EINVAL; + } + + pic = kzalloc(sizeof(*pic), GFP_KERNEL); + if (!pic) + return -ENOMEM; + + setup_page_idle_ctrl(pic, buf, count, file->f_flags); + + mm_walk.mm = mm; + mm_walk.pmd_entry = mm_idle_pmd_entry; + mm_walk.pud_entry = mm_idle_pud_entry; + mm_walk.test_walk = mm_idle_test_walk; + mm_walk.private = pic; + + ret = mm_idle_walk_range(pic, va_start, va_end, &mm_walk); + if (ret) + goto out_free; + + ret = pic->bytes_copied; + *ppos = pic->next_hva; +out_free: + kfree(pic); + return ret; +} + +extern struct file_operations proc_page_scan_operations; + +static int page_scan_entry(void) +{ + proc_page_scan_operations.owner = THIS_MODULE; + proc_page_scan_operations.read = page_scan_read; + proc_page_scan_operations.open = page_scan_open; + proc_page_scan_operations.release = page_scan_release; + return 0; +} + +static void page_scan_exit(void) +{ + memset(&proc_page_scan_operations, 0, + sizeof(proc_page_scan_operations)); +} + +MODULE_LICENSE("GPL"); +module_init(page_scan_entry); +module_exit(page_scan_exit); diff --git a/fs/proc/etmem_scan.h b/fs/proc/etmem_scan.h new file mode 100644 index 000000000000..305739f92eef --- /dev/null +++ b/fs/proc/etmem_scan.h @@ -0,0 +1,132 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _PAGE_IDLE_H +#define _PAGE_IDLE_H + +#define SCAN_HUGE_PAGE O_NONBLOCK /* only huge page */ +#define SCAN_SKIM_IDLE O_NOFOLLOW /* stop on PMD_IDLE_PTES */ +#define SCAN_DIRTY_PAGE O_NOATIME /* report pte/pmd dirty bit */ + +enum ProcIdlePageType { + PTE_ACCESSED, /* 4k page */ + PMD_ACCESSED, /* 2M page */ + PUD_PRESENT, /* 1G page */ + + PTE_DIRTY_M, + PMD_DIRTY_M, + + PTE_IDLE, + PMD_IDLE, + PMD_IDLE_PTES, /* all PTE idle */ + + PTE_HOLE, + PMD_HOLE, + + PIP_CMD, + + IDLE_PAGE_TYPE_MAX +}; + +#define PIP_TYPE(a) (0xf & (a >> 4)) +#define PIP_SIZE(a) (0xf & a) +#define PIP_COMPOSE(type, nr) ((type << 4) | nr) + +#define PIP_CMD_SET_HVA PIP_COMPOSE(PIP_CMD, 0) + +#ifndef INVALID_PAGE +#define INVALID_PAGE ~0UL +#endif + +#ifdef CONFIG_ARM64 +#define _PAGE_MM_BIT_ACCESSED 10 +#else +#define _PAGE_MM_BIT_ACCESSED _PAGE_BIT_ACCESSED +#endif + +#ifdef CONFIG_X86_64 +#define _PAGE_BIT_EPT_ACCESSED 8 +#define _PAGE_BIT_EPT_DIRTY 9 +#define _PAGE_EPT_ACCESSED (_AT(pteval_t, 1) << _PAGE_BIT_EPT_ACCESSED) +#define _PAGE_EPT_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_EPT_DIRTY) + +#define _PAGE_EPT_PRESENT (_AT(pteval_t, 7)) + +static inline int ept_pte_present(pte_t a) +{ + return pte_flags(a) & _PAGE_EPT_PRESENT; +} + +static inline int ept_pmd_present(pmd_t a) +{ + return pmd_flags(a) & _PAGE_EPT_PRESENT; +} + +static inline int ept_pud_present(pud_t a) +{ + return pud_flags(a) & _PAGE_EPT_PRESENT; +} + +static inline int ept_p4d_present(p4d_t a) +{ + return p4d_flags(a) & _PAGE_EPT_PRESENT; +} + +static inline int ept_pgd_present(pgd_t a) +{ + return pgd_flags(a) & _PAGE_EPT_PRESENT; +} + +static inline int ept_pte_accessed(pte_t a) +{ + return pte_flags(a) & _PAGE_EPT_ACCESSED; +} + +static inline int ept_pmd_accessed(pmd_t a) +{ + return pmd_flags(a) & _PAGE_EPT_ACCESSED; +} + +static inline int ept_pud_accessed(pud_t a) +{ + return pud_flags(a) & _PAGE_EPT_ACCESSED; +} + +static inline int ept_p4d_accessed(p4d_t a) +{ + return p4d_flags(a) & _PAGE_EPT_ACCESSED; +} + +static inline int ept_pgd_accessed(pgd_t a) +{ + return pgd_flags(a) & _PAGE_EPT_ACCESSED; +} +#endif + +extern struct file_operations proc_page_scan_operations; + +#define PAGE_IDLE_KBUF_FULL 1 +#define PAGE_IDLE_BUF_FULL 2 +#define PAGE_IDLE_BUF_MIN (sizeof(uint64_t) * 2 + 3) + +#define PAGE_IDLE_KBUF_SIZE 8000 + +struct page_idle_ctrl { + struct mm_struct *mm; + struct kvm *kvm; + + uint8_t kpie[PAGE_IDLE_KBUF_SIZE]; + int pie_read; + int pie_read_max; + + void __user *buf; + int buf_size; + int bytes_copied; + + unsigned long next_hva; /* GPA for EPT; VA for PT */ + unsigned long gpa_to_hva; + unsigned long restart_gpa; + unsigned long last_va; + + unsigned int flags; +}; + +#endif diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 4f14906ef16b..55b4a9b716a9 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -299,6 +299,7 @@ extern const struct file_operations proc_pid_smaps_operations; extern const struct file_operations proc_pid_smaps_rollup_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; +extern const struct file_operations proc_mm_idle_operations;
extern unsigned long task_vsize(struct mm_struct *); extern unsigned long task_statm(struct mm_struct *, diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 657b15922939..ee258adb631b 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1598,6 +1598,72 @@ const struct file_operations proc_pagemap_operations = { .open = pagemap_open, .release = pagemap_release, }; + +/* will be filled when kvm_ept_idle module loads */ +struct file_operations proc_page_scan_operations = { +}; +EXPORT_SYMBOL_GPL(proc_page_scan_operations); + +static ssize_t mm_idle_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct mm_struct *mm = file->private_data; + int ret = 0; + + if (!mm || !mmget_not_zero(mm)) { + ret = -ESRCH; + return ret; + } + if (proc_page_scan_operations.read) + ret = proc_page_scan_operations.read(file, buf, count, ppos); + + mmput(mm); + return ret; +} + +static int mm_idle_open(struct inode *inode, struct file *file) +{ + struct mm_struct *mm = NULL; + + if (!file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN)) + return -EPERM; + + mm = proc_mem_open(inode, PTRACE_MODE_READ); + if (IS_ERR(mm)) + return PTR_ERR(mm); + + file->private_data = mm; + + if (proc_page_scan_operations.open) + return proc_page_scan_operations.open(inode, file); + + return 0; +} + +static int mm_idle_release(struct inode *inode, struct file *file) +{ + struct mm_struct *mm = file->private_data; + + if (mm) { + if (!mm_kvm(mm)) + flush_tlb_mm(mm); + mmdrop(mm); + } + + if (proc_page_scan_operations.release) + return proc_page_scan_operations.release(inode, file); + + return 0; +} + +const struct file_operations proc_mm_idle_operations = { + .llseek = mem_lseek, /* borrow this */ + .read = mm_idle_read, + .open = mm_idle_open, + .release = mm_idle_release, +}; + + #endif /* CONFIG_PROC_PAGE_MONITOR */
#ifdef CONFIG_NUMA diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 9903776a967e..eee88ebbcf35 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -28,6 +28,7 @@ typedef int vm_fault_t; struct address_space; struct mem_cgroup; struct hmm; +struct kvm;
/* * Each physical page in the system has a struct page associated with @@ -513,7 +514,12 @@ struct mm_struct { #endif } __randomize_layout;
+#if IS_ENABLED(CONFIG_KVM) && !defined(__GENKSYMS__) + struct kvm *kvm; +#else KABI_RESERVE(1) +#endif + KABI_RESERVE(2) KABI_RESERVE(3) KABI_RESERVE(4) @@ -531,6 +537,18 @@ struct mm_struct {
extern struct mm_struct init_mm;
+#if IS_ENABLED(CONFIG_KVM) +static inline struct kvm *mm_kvm(struct mm_struct *mm) +{ + return mm->kvm; +} +#else +static inline struct kvm *mm_kvm(struct mm_struct *mm) +{ + return NULL; +} +#endif + /* Pointer magic because the dynamic array size confuses some compilers. */ static inline void mm_init_cpumask(struct mm_struct *mm) { diff --git a/lib/Kconfig b/lib/Kconfig index a3928d4438b5..f332b0a05db2 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -599,6 +599,12 @@ config PARMAN config PRIME_NUMBERS tristate
+config ETMEM_SCAN + tristate "module: etmem page scan for etmem support" + help + etmem page scan feature + used to scan the virtual address of the target process + config STRING_SELFTEST tristate "Test string functions"
diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 3c0930d94a29..286f00cc1c06 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -338,6 +338,7 @@ int walk_page_range(unsigned long start, unsigned long end, } while (start = next, start < end); return err; } +EXPORT_SYMBOL_GPL(walk_page_range);
int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk) { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d8ae0458a049..0292f9f0a774 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -778,6 +778,9 @@ static void kvm_destroy_vm(struct kvm *kvm) struct mm_struct *mm = kvm->mm;
kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm); +#if IS_ENABLED(CONFIG_KVM) + mm->kvm = NULL; +#endif kvm_destroy_vm_debugfs(kvm); kvm_arch_sync_events(kvm); mutex_lock(&kvm_lock); @@ -3298,6 +3301,9 @@ static int kvm_dev_ioctl_create_vm(unsigned long type) fput(file); return -ENOMEM; } +#if IS_ENABLED(CONFIG_KVM) + kvm->mm->kvm = kvm; +#endif kvm_uevent_notify_change(KVM_EVENT_CREATE_VM, kvm);
fd_install(r, file);
From: liubo liubo254@huawei.com
euleros inclusion category: feature feature: etmem bugzilla: 49889
-------------------------------------------------
In order to achieve the goal of memory expansion, cold pages need to be migrated to the swap partition, etmem_swap.ko is to achieve this purpose.
This patch is mainly used to generate etmem_swap.ko. etmem_swap.ko is used to transfer the address passed in the user state for page migration.
Signed-off-by: yanxiaodan yanxiaodan@huawei.com Signed-off-by: linmiaohe linmiaohe@huawei.com Signed-off-by: louhongxiang louhongxiang@huawei.com Signed-off-by: liubo liubo254@huawei.com Signed-off-by: geruijun geruijun@huawei.com Signed-off-by: liangchenshu liangchenshu@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Reviewed-by: Jing Xiangfengjingxiangfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/proc/Makefile | 1 + fs/proc/base.c | 2 + fs/proc/etmem_swap.c | 102 +++++++++++++++++++++++++++++++++++++++ fs/proc/internal.h | 1 + fs/proc/task_mmu.c | 51 ++++++++++++++++++++ include/linux/swap.h | 5 ++ lib/Kconfig | 5 ++ mm/vmscan.c | 112 +++++++++++++++++++++++++++++++++++++++++++ 8 files changed, 279 insertions(+) create mode 100644 fs/proc/etmem_swap.c
diff --git a/fs/proc/Makefile b/fs/proc/Makefile index c1ebd017a83b..b9c9f59aba45 100644 --- a/fs/proc/Makefile +++ b/fs/proc/Makefile @@ -34,3 +34,4 @@ proc-$(CONFIG_PROC_VMCORE) += vmcore.o proc-$(CONFIG_PRINTK) += kmsg.o proc-$(CONFIG_PROC_PAGE_MONITOR) += page.o obj-$(CONFIG_ETMEM_SCAN) += etmem_scan.o +obj-$(CONFIG_ETMEM_SWAP) += etmem_swap.o diff --git a/fs/proc/base.c b/fs/proc/base.c index 33ec3b6971b7..32c7f7d69267 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2989,6 +2989,7 @@ static const struct pid_entry tgid_base_stuff[] = { REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations), + REG("swap_pages", S_IWUSR, proc_mm_swap_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), @@ -3375,6 +3376,7 @@ static const struct pid_entry tid_base_stuff[] = { REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations), + REG("swap_pages", S_IWUSR, proc_mm_swap_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), diff --git a/fs/proc/etmem_swap.c b/fs/proc/etmem_swap.c new file mode 100644 index 000000000000..b24c706c3b2a --- /dev/null +++ b/fs/proc/etmem_swap.c @@ -0,0 +1,102 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/init.h> +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/string.h> +#include <linux/proc_fs.h> +#include <linux/sched/mm.h> +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/mempolicy.h> +#include <linux/uaccess.h> +#include <linux/delay.h> + +static ssize_t swap_pages_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + char *p, *data, *data_ptr_res; + unsigned long vaddr; + struct mm_struct *mm = file->private_data; + struct page *page; + LIST_HEAD(pagelist); + int ret = 0; + + if (!mm || !mmget_not_zero(mm)) { + ret = -ESRCH; + goto out; + } + + if (count < 0) { + ret = -EOPNOTSUPP; + goto out_mm; + } + + data = memdup_user_nul(buf, count); + if (IS_ERR(data)) { + ret = PTR_ERR(data); + goto out_mm; + } + + data_ptr_res = data; + while ((p = strsep(&data, "\n")) != NULL) { + if (!*p) + continue; + + ret = kstrtoul(p, 16, &vaddr); + if (ret != 0) + continue; + /*If get page struct failed, ignore it, get next page*/ + page = get_page_from_vaddr(mm, vaddr); + if (!page) + continue; + + add_page_for_swap(page, &pagelist); + } + + if (!list_empty(&pagelist)) + reclaim_pages(&pagelist); + + ret = count; + kfree(data_ptr_res); +out_mm: + mmput(mm); +out: + return ret; +} + +static int swap_pages_open(struct inode *inode, struct file *file) +{ + if (!try_module_get(THIS_MODULE)) + return -EBUSY; + + return 0; +} + +static int swap_pages_release(struct inode *inode, struct file *file) +{ + module_put(THIS_MODULE); + return 0; +} + + +extern struct file_operations proc_swap_pages_operations; + +static int swap_pages_entry(void) +{ + proc_swap_pages_operations.owner = THIS_MODULE; + proc_swap_pages_operations.write = swap_pages_write; + proc_swap_pages_operations.open = swap_pages_open; + proc_swap_pages_operations.release = swap_pages_release; + + return 0; +} + +static void swap_pages_exit(void) +{ + memset(&proc_swap_pages_operations, 0, + sizeof(proc_swap_pages_operations)); +} + +MODULE_LICENSE("GPL"); +module_init(swap_pages_entry); +module_exit(swap_pages_exit); diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 55b4a9b716a9..2df432cc38af 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -300,6 +300,7 @@ extern const struct file_operations proc_pid_smaps_rollup_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; extern const struct file_operations proc_mm_idle_operations; +extern const struct file_operations proc_mm_swap_operations;
extern unsigned long task_vsize(struct mm_struct *); extern unsigned long task_statm(struct mm_struct *, diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index ee258adb631b..ac7f57badcfd 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1663,7 +1663,58 @@ const struct file_operations proc_mm_idle_operations = { .release = mm_idle_release, };
+/*swap pages*/ +struct file_operations proc_swap_pages_operations = { +}; +EXPORT_SYMBOL_GPL(proc_swap_pages_operations); + +static ssize_t mm_swap_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + if (proc_swap_pages_operations.write) + return proc_swap_pages_operations.write(file, buf, count, ppos); + + return -1; +} + +static int mm_swap_open(struct inode *inode, struct file *file) +{ + struct mm_struct *mm = NULL; + + if (!file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN)) + return -EPERM; + + mm = proc_mem_open(inode, PTRACE_MODE_READ); + if (IS_ERR(mm)) + return PTR_ERR(mm); + + file->private_data = mm; + + if (proc_swap_pages_operations.open) + return proc_swap_pages_operations.open(inode, file); + + return 0; +} + +static int mm_swap_release(struct inode *inode, struct file *file) +{ + struct mm_struct *mm = file->private_data;
+ if (mm) + mmdrop(mm); + + if (proc_swap_pages_operations.release) + return proc_swap_pages_operations.release(inode, file); + + return 0; +} + +const struct file_operations proc_mm_swap_operations = { + .llseek = mem_lseek, + .write = mm_swap_write, + .open = mm_swap_open, + .release = mm_swap_release, +}; #endif /* CONFIG_PROC_PAGE_MONITOR */
#ifdef CONFIG_NUMA diff --git a/include/linux/swap.h b/include/linux/swap.h index ca7b98e4ce74..c6f9dba6d713 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -374,6 +374,11 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern unsigned long vm_total_pages;
+extern unsigned long reclaim_pages(struct list_head *page_list); +extern int add_page_for_swap(struct page *page, struct list_head *pagelist); +extern struct page *get_page_from_vaddr(struct mm_struct *mm, + unsigned long vaddr); + #ifdef CONFIG_SHRINK_PAGECACHE extern unsigned long vm_cache_limit_ratio; extern unsigned long vm_cache_limit_ratio_min; diff --git a/lib/Kconfig b/lib/Kconfig index f332b0a05db2..edb7d40d1f60 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -605,6 +605,11 @@ config ETMEM_SCAN etmem page scan feature used to scan the virtual address of the target process
+config ETMEM_SWAP + tristate "module: etmem page swap for etmem support" + help + etmem page swap feature + config STRING_SELFTEST tristate "Test string functions"
diff --git a/mm/vmscan.c b/mm/vmscan.c index 48976d98d5e8..cabf2c290be5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -36,6 +36,7 @@ #include <linux/topology.h> #include <linux/cpu.h> #include <linux/cpuset.h> +#include <linux/mempolicy.h> #include <linux/compaction.h> #include <linux/notifier.h> #include <linux/rwsem.h> @@ -4401,3 +4402,114 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages) } } #endif /* CONFIG_SHMEM */ + +unsigned long reclaim_pages(struct list_head *page_list) +{ + int nid = NUMA_NO_NODE; + unsigned long nr_reclaimed = 0; + LIST_HEAD(node_page_list); + struct reclaim_stat dummy_stat; + struct page *page; + struct scan_control sc = { + .gfp_mask = GFP_KERNEL, + .priority = DEF_PRIORITY, + .may_writepage = 1, + .may_unmap = 1, + .may_swap = 1, + }; + + while (!list_empty(page_list)) { + page = lru_to_page(page_list); + if (nid == NUMA_NO_NODE) { + nid = page_to_nid(page); + INIT_LIST_HEAD(&node_page_list); + } + + if (nid == page_to_nid(page)) { + ClearPageActive(page); + list_move(&page->lru, &node_page_list); + continue; + } + + nr_reclaimed += shrink_page_list(&node_page_list, + NODE_DATA(nid), + &sc, 0, + &dummy_stat, false); + while (!list_empty(&node_page_list)) { + page = lru_to_page(&node_page_list); + list_del(&page->lru); + putback_lru_page(page); + } + + nid = NUMA_NO_NODE; + } + + if (!list_empty(&node_page_list)) { + nr_reclaimed += shrink_page_list(&node_page_list, + NODE_DATA(nid), + &sc, 0, + &dummy_stat, false); + while (!list_empty(&node_page_list)) { + page = lru_to_page(&node_page_list); + list_del(&page->lru); + putback_lru_page(page); + } + } + + return nr_reclaimed; +} +EXPORT_SYMBOL_GPL(reclaim_pages); + +int add_page_for_swap(struct page *page, struct list_head *pagelist) +{ + int err = -EBUSY; + struct page *head; + + /*If the page is mapped by more than one process, do not swap it */ + if (page_mapcount(page) > 1) + return -EACCES; + + if (PageHuge(page)) + return -EACCES; + + head = compound_head(page); + err = isolate_lru_page(head); + if (err) { + put_page(page); + return err; + } + put_page(page); + if (PageUnevictable(page)) + putback_lru_page(page); + else + list_add_tail(&head->lru, pagelist); + + err = 0; + return err; +} +EXPORT_SYMBOL_GPL(add_page_for_swap); +struct page *get_page_from_vaddr(struct mm_struct *mm, unsigned long vaddr) +{ + struct page *page; + struct vm_area_struct *vma; + unsigned int follflags; + + down_read(&mm->mmap_sem); + + vma = find_vma(mm, vaddr); + if (!vma || vaddr < vma->vm_start || vma->vm_flags & VM_LOCKED) { + up_read(&mm->mmap_sem); + return NULL; + } + + follflags = FOLL_GET | FOLL_DUMP; + page = follow_page(vma, vaddr, follflags); + if (IS_ERR(page) || !page) { + up_read(&mm->mmap_sem); + return NULL; + } + + up_read(&mm->mmap_sem); + return page; +} +EXPORT_SYMBOL_GPL(get_page_from_vaddr);
From: liubo liubo254@huawei.com
euleros inclusion category: feature feature: etmem bugzilla: 49889
-------------------------------------------------
Enable etmem feature config option. set default value of CONFIG_ETMEM_SCAN and CONFIG_ETMEM_SWAP.
Before using the etmem feature, need to insert etmem_scan.ko and etmem_swap.ko first.
Signed-off-by: liubo liubo254@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Reviewed-by: Jing Xiangfengjingxiangfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/configs/hulk_defconfig | 2 ++ arch/arm64/configs/openeuler_defconfig | 2 ++ arch/x86/configs/hulk_defconfig | 2 ++ arch/x86/configs/openeuler_defconfig | 2 ++ 4 files changed, 8 insertions(+)
diff --git a/arch/arm64/configs/hulk_defconfig b/arch/arm64/configs/hulk_defconfig index 706f06dfcfd3..9467ed0c3ca0 100644 --- a/arch/arm64/configs/hulk_defconfig +++ b/arch/arm64/configs/hulk_defconfig @@ -5661,3 +5661,5 @@ CONFIG_IO_STRICT_DEVMEM=y # CONFIG_DEBUG_EFI is not set # CONFIG_ARM64_RELOC_TEST is not set # CONFIG_CORESIGHT is not set +CONFIG_ETMEM_SCAN=m +CONFIG_ETMEM_SWAP=m diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index ad321302cf2a..f93571a479d0 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -5968,3 +5968,5 @@ CONFIG_IO_STRICT_DEVMEM=y # CONFIG_ARM64_RELOC_TEST is not set # CONFIG_CORESIGHT is not set CONFIG_SMMU_BYPASS_DEV=y +CONFIG_ETMEM_SCAN=m +CONFIG_ETMEM_SWAP=m diff --git a/arch/x86/configs/hulk_defconfig b/arch/x86/configs/hulk_defconfig index c7c7d8776c2a..e11831046b41 100644 --- a/arch/x86/configs/hulk_defconfig +++ b/arch/x86/configs/hulk_defconfig @@ -7536,3 +7536,5 @@ CONFIG_OPTIMIZE_INLINING=y # CONFIG_PUNIT_ATOM_DEBUG is not set CONFIG_UNWINDER_ORC=y # CONFIG_UNWINDER_FRAME_POINTER is not set +CONFIG_ETMEM_SCAN=m +CONFIG_ETMEM_SWAP=m diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index a52e5932dcc6..36f41978fd8c 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -7537,3 +7537,5 @@ CONFIG_OPTIMIZE_INLINING=y # CONFIG_PUNIT_ATOM_DEBUG is not set CONFIG_UNWINDER_ORC=y # CONFIG_UNWINDER_FRAME_POINTER is not set +CONFIG_ETMEM_SCAN=m +CONFIG_ETMEM_SWAP=m
From: Ondrej Jirman <megous(a)megous.com>
mainline inclusion from mainline-v5.2-rc1 commit e3062e05e1cfe378bb9b3fa0bef46711372bcf13 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3AUFW CVE: NA
------------------------------------------------
SDIO based brcm43456 is currently misdetected as brcm43455 and the wrong firmware name is used. Correct the detection and load the correct firmware file. Chiprev for brcm43456 is "9".
Signed-off-by: Ondrej Jirman <megous(a)megous.com> Signed-off-by: Kalle Valo <kvalo(a)codeaurora.org> Signed-off-by: Fang Yafen <yafen(a)iscas.ac.cn> Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c index abaed2fa2def..18e9e52f8ee7 100644 --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c @@ -621,6 +621,7 @@ BRCMF_FW_DEF(43430A0, "brcmfmac43430a0-sdio"); /* Note the names are not postfixed with a1 for backward compatibility */ BRCMF_FW_DEF(43430A1, "brcmfmac43430-sdio"); BRCMF_FW_DEF(43455, "brcmfmac43455-sdio"); +BRCMF_FW_DEF(43456, "brcmfmac43456-sdio"); BRCMF_FW_DEF(4354, "brcmfmac4354-sdio"); BRCMF_FW_DEF(4356, "brcmfmac4356-sdio"); BRCMF_FW_DEF(4373, "brcmfmac4373-sdio"); @@ -640,7 +641,8 @@ static const struct brcmf_firmware_mapping brcmf_sdio_fwnames[] = { BRCMF_FW_ENTRY(BRCM_CC_4339_CHIP_ID, 0xFFFFFFFF, 4339), BRCMF_FW_ENTRY(BRCM_CC_43430_CHIP_ID, 0x00000001, 43430A0), BRCMF_FW_ENTRY(BRCM_CC_43430_CHIP_ID, 0xFFFFFFFE, 43430A1), - BRCMF_FW_ENTRY(BRCM_CC_4345_CHIP_ID, 0xFFFFFFC0, 43455), + BRCMF_FW_ENTRY(BRCM_CC_4345_CHIP_ID, 0x00000200, 43456), + BRCMF_FW_ENTRY(BRCM_CC_4345_CHIP_ID, 0xFFFFFDC0, 43455), BRCMF_FW_ENTRY(BRCM_CC_4354_CHIP_ID, 0xFFFFFFFF, 4354), BRCMF_FW_ENTRY(BRCM_CC_4356_CHIP_ID, 0xFFFFFFFF, 4356), BRCMF_FW_ENTRY(CY_CC_4373_CHIP_ID, 0xFFFFFFFF, 4373)