mailweb.openeuler.org
Manage this list

Keyboard Shortcuts

Thread View

  • j: Next unread message
  • k: Previous unread message
  • j a: Jump to all threads
  • j l: Jump to MailingList overview

Kernel

Threads by month
  • ----- 2025 -----
  • May
  • April
  • March
  • February
  • January
  • ----- 2024 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2023 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2022 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2021 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2020 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2019 -----
  • December
kernel@openeuler.org

  • 26 participants
  • 18023 discussions
[PATCH openEuler-1.0-LTS 05/19] nfs: Fix potential posix_acl refcnt leak in nfs3_set_acl
by Yang Yingliang 11 May '21

11 May '21
From: Andreas Gruenbacher <agruenba(a)redhat.com> stable inclusion from linux-4.19.121 commit 7b4e9bfa245fa6b0c149326e1c8abb14cae5e9b8 -------------------------------- commit 7648f939cb919b9d15c21fff8cd9eba908d595dc upstream. nfs3_set_acl keeps track of the acl it allocated locally to determine if an acl needs to be released at the end. This results in a memory leak when the function allocates an acl as well as a default acl. Fix by releasing acls that differ from the acl originally passed into nfs3_set_acl. Fixes: b7fa0554cf1b ("[PATCH] NFS: Add support for NFSv3 ACLs") Reported-by: Xiyu Yang <xiyuyang19(a)fudan.edu.cn> Signed-off-by: Andreas Gruenbacher <agruenba(a)redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust(a)hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com> Reviewed-by: Hou Tao <houtao1(a)huawei.com> Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com> --- fs/nfs/nfs3acl.c | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/fs/nfs/nfs3acl.c b/fs/nfs/nfs3acl.c index c5c3fc6e6c600..26c94b32d6f49 100644 --- a/fs/nfs/nfs3acl.c +++ b/fs/nfs/nfs3acl.c @@ -253,37 +253,45 @@ int nfs3_proc_setacls(struct inode *inode, struct posix_acl *acl, int nfs3_set_acl(struct inode *inode, struct posix_acl *acl, int type) { - struct posix_acl *alloc = NULL, *dfacl = NULL; + struct posix_acl *orig = acl, *dfacl = NULL, *alloc; int status; if (S_ISDIR(inode->i_mode)) { switch(type) { case ACL_TYPE_ACCESS: - alloc = dfacl = get_acl(inode, ACL_TYPE_DEFAULT); + alloc = get_acl(inode, ACL_TYPE_DEFAULT); if (IS_ERR(alloc)) goto fail; + dfacl = alloc; break; case ACL_TYPE_DEFAULT: - dfacl = acl; - alloc = acl = get_acl(inode, ACL_TYPE_ACCESS); + alloc = get_acl(inode, ACL_TYPE_ACCESS); if (IS_ERR(alloc)) goto fail; + dfacl = acl; + acl = alloc; break; } } if (acl == NULL) { - alloc = acl = posix_acl_from_mode(inode->i_mode, GFP_KERNEL); + alloc = posix_acl_from_mode(inode->i_mode, GFP_KERNEL); if (IS_ERR(alloc)) goto fail; + acl = alloc; } status = __nfs3_proc_setacls(inode, acl, dfacl); - posix_acl_release(alloc); +out: + if (acl != orig) + posix_acl_release(acl); + if (dfacl != orig) + posix_acl_release(dfacl); return status; fail: - return PTR_ERR(alloc); + status = PTR_ERR(alloc); + goto out; } const struct xattr_handler *nfs3_xattr_handlers[] = { -- 2.25.1
1 0
0 0
[PATCH openEuler-1.0-LTS 04/19] NFSv4/pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid()
by Yang Yingliang 11 May '21

11 May '21
From: Trond Myklebust <trond.myklebust(a)hammerspace.com> stable inclusion from linux-4.19.118 commit 401876dbcf6be94b31a957ccccb8e028e9d3d9cc -------------------------------- [ Upstream commit d911c57a19551c6bef116a3b55c6b089901aacb0 ] Make sure to test the stateid for validity so that we catch instances where the server may have been reusing stateids in nfs_layout_find_inode_by_stateid(). Fixes: 7b410d9ce460 ("pNFS: Delay getting the layout header in CB_LAYOUTRECALL handlers") Signed-off-by: Trond Myklebust <trond.myklebust(a)hammerspace.com> Signed-off-by: Sasha Levin <sashal(a)kernel.org> Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com> Reviewed-by: Hou Tao <houtao1(a)huawei.com> Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com> --- fs/nfs/callback_proc.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c index 3159673549540..bcc51f131a496 100644 --- a/fs/nfs/callback_proc.c +++ b/fs/nfs/callback_proc.c @@ -130,6 +130,8 @@ static struct inode *nfs_layout_find_inode_by_stateid(struct nfs_client *clp, list_for_each_entry_rcu(server, &clp->cl_superblocks, client_link) { list_for_each_entry(lo, &server->layouts, plh_layouts) { + if (!pnfs_layout_is_valid(lo)) + continue; if (stateid != NULL && !nfs4_stateid_match_other(stateid, &lo->plh_stateid)) continue; -- 2.25.1
1 0
0 0
[PATCH kernel-4.19] cgroup/files: support boot parameter to control if disable files cgroup
by Yang Yingliang 11 May '21

11 May '21
hulk inclusion category: bugfix bugzilla: NA CVE: NA -------------------------------- When files cgroup is enabled, it's will leads syscall performance regression in UnixBench. Add a helper files_cgroup_enabled() and use it to control if use files cgroup, wen can use cgroup_disable=files in cmdline to disable files cgroup. syscall of UnixBench (large is better) enable files cgroup: 2868.5 disable files cgroup: 3177.0 disable config of files cgroup: 3186.5 Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com> Reviewed-by: Tao Hou <houtao1(a)huawei.com> Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com> --- .../admin-guide/kernel-parameters.txt | 7 ++++--- fs/file.c | 21 ++++++++++++------- include/linux/filescontrol.h | 6 ++++++ 3 files changed, 24 insertions(+), 10 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 564b14b0b9b84..05b9b8c6034cd 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -492,9 +492,10 @@ a single hierarchy - foo isn't visible as an individually mountable subsystem - {Currently only "memory" controller deal with this and - cut the overhead, others just disable the usage. So - only cgroup_disable=memory is actually worthy} + {Currently "memory" and "files" controller deal with + this and cut the overhead, others just disable the usage. + So cgroup_disable=memory and cgroup_disable=files are + actually worthy} cgroup_no_v1= [KNL] Disable one, multiple, all cgroup controllers in v1 Format: { controller[,controller...] | "all" } diff --git a/fs/file.c b/fs/file.c index 6dd55640e91fc..570b757fc469f 100644 --- a/fs/file.c +++ b/fs/file.c @@ -295,7 +295,8 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) new_fdt->full_fds_bits = newf->full_fds_bits_init; new_fdt->fd = &newf->fd_array[0]; #ifdef CONFIG_CGROUP_FILES - files_cgroup_assign(newf); + if (files_cgroup_enabled()) + files_cgroup_assign(newf); #endif spin_lock(&oldf->file_lock); @@ -361,6 +362,8 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) rcu_assign_pointer(newf->fdt, new_fdt); #ifdef CONFIG_CGROUP_FILES + if (!files_cgroup_enabled()) + return newf; spin_lock(&newf->file_lock); if (!files_cgroup_alloc_fd(newf, files_cgroup_count_fds(newf))) { spin_unlock(&newf->file_lock); @@ -385,7 +388,8 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) out_release: #ifdef CONFIG_CGROUP_FILES - files_cgroup_remove(newf); + if (files_cgroup_enabled()) + files_cgroup_remove(newf); #endif kmem_cache_free(files_cachep, newf); out: @@ -413,7 +417,8 @@ static struct fdtable *close_files(struct files_struct * files) struct file * file = xchg(&fdt->fd[i], NULL); if (file) { #ifdef CONFIG_CGROUP_FILES - files_cgroup_unalloc_fd(files, 1); + if (files_cgroup_enabled()) + files_cgroup_unalloc_fd(files, 1); #endif filp_close(file, files); cond_resched(); @@ -424,7 +429,8 @@ static struct fdtable *close_files(struct files_struct * files) } } #ifdef CONFIG_CGROUP_FILES - files_cgroup_remove(files); + if (files_cgroup_enabled()) + files_cgroup_remove(files); #endif return fdt; @@ -546,7 +552,7 @@ int __alloc_fd(struct files_struct *files, if (error) goto repeat; #ifdef CONFIG_CGROUP_FILES - if (files_cgroup_alloc_fd(files, 1)) { + if (files_cgroup_enabled() && files_cgroup_alloc_fd(files, 1)) { error = -EMFILE; goto out; } @@ -589,7 +595,7 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd) { struct fdtable *fdt = files_fdtable(files); #ifdef CONFIG_CGROUP_FILES - if (test_bit(fd, fdt->open_fds)) + if (files_cgroup_enabled() && test_bit(fd, fdt->open_fds)) files_cgroup_unalloc_fd(files, 1); #endif __clear_open_fd(fd, fdt); @@ -878,7 +884,8 @@ __releases(&files->file_lock) goto out; } #ifdef CONFIG_CGROUP_FILES - if (!tofree && files_cgroup_alloc_fd(files, 1)) { + if (files_cgroup_enabled() && + !tofree && files_cgroup_alloc_fd(files, 1)) { err = -EMFILE; goto out; } diff --git a/include/linux/filescontrol.h b/include/linux/filescontrol.h index 49dc620cf64ec..0182f145a3392 100644 --- a/include/linux/filescontrol.h +++ b/include/linux/filescontrol.h @@ -19,6 +19,7 @@ #define _LINUX_FILESCONTROL_H #include <linux/fdtable.h> +#include <linux/cgroup.h> #ifdef CONFIG_CGROUP_FILES @@ -30,5 +31,10 @@ extern struct files_struct init_files; void files_cgroup_assign(struct files_struct *files); void files_cgroup_remove(struct files_struct *files); +static inline bool files_cgroup_enabled(void) +{ + return cgroup_subsys_enabled(files_cgrp_subsys); +} + #endif /* CONFIG_CGROUP_FILES */ #endif /* _LINUX_FILESCONTROL_H */ -- 2.25.1
1 0
0 0
[PATCH openEuler-1.0-LTS] efi: Fix a race and a buffer overflow while reading efivars via sysfs
by Zheng Zengkai 10 May '21

10 May '21
From: Vladis Dronov <vdronov(a)redhat.com> mainline inclusion from mainline-5.6-rc6 commit 286d3250c9d6437340203fb64938bea344729a0e category: bugfix bugzilla: 31390, https://gitee.com/openeuler/kernel/issues/I3QONX CVE: NA ------------------------------------------------- There is a race and a buffer overflow corrupting a kernel memory while reading an EFI variable with a size more than 1024 bytes via the older sysfs method. This happens because accessing struct efi_variable in efivar_{attr,size,data}_read() and friends is not protected from a concurrent access leading to a kernel memory corruption and, at best, to a crash. The race scenario is the following: CPU0: CPU1: efivar_attr_read() var->DataSize = 1024; efivar_entry_get(... &var->DataSize) down_interruptible(&efivars_lock) efivar_attr_read() // same EFI var var->DataSize = 1024; efivar_entry_get(... &var->DataSize) down_interruptible(&efivars_lock) virt_efi_get_variable() // returns EFI_BUFFER_TOO_SMALL but // var->DataSize is set to a real // var size more than 1024 bytes up(&efivars_lock) virt_efi_get_variable() // called with var->DataSize set // to a real var size, returns // successfully and overwrites // a 1024-bytes kernel buffer up(&efivars_lock) This can be reproduced by concurrent reading of an EFI variable which size is more than 1024 bytes: ts# for cpu in $(seq 0 $(nproc --ignore=1)); do ( taskset -c $cpu \ cat /sys/firmware/efi/vars/KEKDefault*/size & ) ; done Fix this by using a local variable for a var's data buffer size so it does not get overwritten. Fixes: e14ab23dde12b80d ("efivars: efivar_entry API") Reported-by: Bob Sanders <bob.sanders(a)hpe.com> and the LTP testsuite Signed-off-by: Vladis Dronov <vdronov(a)redhat.com> Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org> Signed-off-by: Ingo Molnar <mingo(a)kernel.org> Cc: <stable(a)vger.kernel.org> Link: https://lore.kernel.org/r/20200305084041.24053-2-vdronov@redhat.com Link: https://lore.kernel.org/r/20200308080859.21568-24-ardb@kernel.org Signed-off-by: Zheng Zengkai <zhengzengkai(a)huawei.com> --- drivers/firmware/efi/efivars.c | 29 ++++++++++++++++++++--------- 1 file changed, 20 insertions(+), 9 deletions(-) diff --git a/drivers/firmware/efi/efivars.c b/drivers/firmware/efi/efivars.c index 463b94a72703..8cdd01117731 100644 --- a/drivers/firmware/efi/efivars.c +++ b/drivers/firmware/efi/efivars.c @@ -139,13 +139,16 @@ static ssize_t efivar_attr_read(struct efivar_entry *entry, char *buf) { struct efi_variable *var = &entry->var; + unsigned long size = sizeof(var->Data); char *str = buf; + int ret; if (!entry || !buf) return -EINVAL; - var->DataSize = 1024; - if (efivar_entry_get(entry, &var->Attributes, &var->DataSize, var->Data)) + ret = efivar_entry_get(entry, &var->Attributes, &size, var->Data); + var->DataSize = size; + if (ret) return -EIO; if (var->Attributes & EFI_VARIABLE_NON_VOLATILE) @@ -172,13 +175,16 @@ static ssize_t efivar_size_read(struct efivar_entry *entry, char *buf) { struct efi_variable *var = &entry->var; + unsigned long size = sizeof(var->Data); char *str = buf; + int ret; if (!entry || !buf) return -EINVAL; - var->DataSize = 1024; - if (efivar_entry_get(entry, &var->Attributes, &var->DataSize, var->Data)) + ret = efivar_entry_get(entry, &var->Attributes, &size, var->Data); + var->DataSize = size; + if (ret) return -EIO; str += sprintf(str, "0x%lx\n", var->DataSize); @@ -189,12 +195,15 @@ static ssize_t efivar_data_read(struct efivar_entry *entry, char *buf) { struct efi_variable *var = &entry->var; + unsigned long size = sizeof(var->Data); + int ret; if (!entry || !buf) return -EINVAL; - var->DataSize = 1024; - if (efivar_entry_get(entry, &var->Attributes, &var->DataSize, var->Data)) + ret = efivar_entry_get(entry, &var->Attributes, &size, var->Data); + var->DataSize = size; + if (ret) return -EIO; memcpy(buf, var->Data, var->DataSize); @@ -314,14 +323,16 @@ efivar_show_raw(struct efivar_entry *entry, char *buf) { struct efi_variable *var = &entry->var; struct compat_efi_variable *compat; + unsigned long datasize = sizeof(var->Data); size_t size; + int ret; if (!entry || !buf) return 0; - var->DataSize = 1024; - if (efivar_entry_get(entry, &entry->var.Attributes, - &entry->var.DataSize, entry->var.Data)) + ret = efivar_entry_get(entry, &var->Attributes, &datasize, var->Data); + var->DataSize = datasize; + if (ret) return -EIO; if (is_compat()) { -- 2.20.1
1 0
0 0
[PATCH kernel-4.19] RDMA/hns: Allocate one more recv SGE for HIP08
by Yang Yingliang 10 May '21

10 May '21
From: wanglin <wanglin137(a)huawei.com> driver inclusion category: bugfix bugzilla: NA CVE: NA The RQ/SRQ of HIP08 needs one special sge to stop receive reliably. So the driver needs to allocate at least one SGE when creating RQ/SRQ and ensure that at least one SGE is filled with the special value during post_recv. Besides, the kernel driver should only do this for kernel ULP. For userspace ULP, the userspace driver will allocate the reserved SGE in buffer, and the kernel driver just needs to pin the corresponding size of memory based on the userspace driver's requirements. Reviewed-by: Hu Chunzhi <huchunzhi(a)huawei.com> Reviewed-by: Zhao Weibo <zhaoweibo3(a)huawei.com> Signed-off-by: wanglin <wanglin137(a)huawei.com> Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com> --- drivers/infiniband/hw/hns/hns_roce_device.h | 2 + drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 24 +++++++----- drivers/infiniband/hw/hns/hns_roce_hw_v2.h | 6 +-- drivers/infiniband/hw/hns/hns_roce_qp.c | 33 +++++++++++++++-- drivers/infiniband/hw/hns/hns_roce_srq.c | 41 +++++++++++++++++---- 5 files changed, 83 insertions(+), 23 deletions(-) diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h index 533df3c68d2ff..c5edeaa6ee8e8 100644 --- a/drivers/infiniband/hw/hns/hns_roce_device.h +++ b/drivers/infiniband/hw/hns/hns_roce_device.h @@ -493,6 +493,7 @@ struct hns_roce_wq { int wqe_cnt; /* WQE num */ u32 max_post; int max_gs; + u32 rsv_sge; int offset; int wqe_shift; /* WQE size */ u32 head; @@ -601,6 +602,7 @@ struct hns_roce_srq { unsigned long srqn; int max; int max_gs; + u32 rsv_sge; int wqe_shift; void __iomem *db_reg_l; diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c index aa0fe9fd7d352..d3a70dd82d337 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c @@ -742,6 +742,7 @@ static int hns_roce_v2_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr, struct device *dev = hr_dev->dev; unsigned long flags = 0; void *wqe = NULL; + u32 max_sge; int ret = 0; int nreq; int ind; @@ -766,6 +767,7 @@ static int hns_roce_v2_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr, return -EINVAL; } + max_sge = hr_qp->rq.max_gs - hr_qp->rq.rsv_sge; for (nreq = 0; wr; ++nreq, wr = wr->next) { if (hns_roce_wq_overflow(&hr_qp->rq, nreq, hr_qp->ibqp.recv_cq)) { @@ -774,9 +776,9 @@ static int hns_roce_v2_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr, goto out; } - if (unlikely(wr->num_sge >= hr_qp->rq.max_gs)) { - dev_err(dev, "RQ: sge num(%d) is larger or equal than max sge num(%d)\n", - wr->num_sge, hr_qp->rq.max_gs); + if (unlikely(wr->num_sge > max_sge)) { + dev_err(dev, "RQ: sge num(%d) is larger than max sge num(%d)\n", + wr->num_sge, max_sge); ret = -EINVAL; *bad_wr = wr; goto out; @@ -791,7 +793,7 @@ static int hns_roce_v2_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr, dseg++; } - if (wr->num_sge < hr_qp->rq.max_gs) { + if (hr_qp->rq.rsv_sge) { dseg->lkey = cpu_to_le32(HNS_ROCE_INVALID_LKEY); dseg->addr = 0; dseg->len = cpu_to_le32(HNS_ROCE_INVALID_SGE_LENGTH); @@ -1985,10 +1987,12 @@ static int hns_roce_query_pf_caps(struct hns_roce_dev *hr_dev) caps->max_sq_sg = le16_to_cpu(resp_a->max_sq_sg); caps->max_sq_inline = le16_to_cpu(resp_a->max_sq_inline); caps->max_rq_sg = le16_to_cpu(resp_a->max_rq_sg); + caps->max_rq_sg = roundup_pow_of_two(caps->max_rq_sg); caps->max_extend_sg = le32_to_cpu(resp_a->max_extend_sg); caps->num_qpc_timer = le16_to_cpu(resp_a->num_qpc_timer); caps->num_cqc_timer = le16_to_cpu(resp_a->num_cqc_timer); caps->max_srq_sges = le16_to_cpu(resp_a->max_srq_sges); + caps->max_srq_sges = roundup_pow_of_two(caps->max_srq_sges); caps->num_aeq_vectors = resp_a->num_aeq_vectors; caps->num_other_vectors = resp_a->num_other_vectors; caps->max_sq_desc_sz = resp_a->max_sq_desc_sz; @@ -5429,7 +5433,7 @@ static int hns_roce_v2_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, done: qp_attr->cur_qp_state = qp_attr->qp_state; qp_attr->cap.max_recv_wr = hr_qp->rq.wqe_cnt; - qp_attr->cap.max_recv_sge = hr_qp->rq.max_gs; + qp_attr->cap.max_recv_sge = hr_qp->rq.max_gs - hr_qp->rq.rsv_sge; if (!ibqp->uobject) { qp_attr->cap.max_send_wr = hr_qp->sq.wqe_cnt; @@ -7051,7 +7055,7 @@ int hns_roce_v2_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr) attr->srq_limit = limit_wl; attr->max_wr = srq->max - 1; - attr->max_sge = srq->max_gs - HNS_ROCE_RESERVED_SGE; + attr->max_sge = srq->max_gs - srq->rsv_sge; memcpy(srq_context, mailbox->buf, sizeof(*srq_context)); @@ -7101,6 +7105,7 @@ static int hns_roce_v2_post_srq_recv(struct ib_srq *ibsrq, unsigned long flags; int ret = 0; int wqe_idx; + u32 max_sge; void *wqe; int nreq; int ind; @@ -7109,11 +7114,12 @@ static int hns_roce_v2_post_srq_recv(struct ib_srq *ibsrq, spin_lock_irqsave(&srq->lock, flags); ind = srq->head & (srq->max - 1); + max_sge = srq->max_gs - srq->rsv_sge; for (nreq = 0; wr; ++nreq, wr = wr->next) { - if (unlikely(wr->num_sge >= srq->max_gs)) { + if (unlikely(wr->num_sge > max_sge)) { dev_err(hr_dev->dev, "srq(0x%lx) wr sge num(%d) exceed the max num %d.\n", - srq->srqn, wr->num_sge, srq->max_gs); + srq->srqn, wr->num_sge, max_sge); ret = -EINVAL; *bad_wr = wr; break; @@ -7138,7 +7144,7 @@ static int hns_roce_v2_post_srq_recv(struct ib_srq *ibsrq, dseg[i].addr = cpu_to_le64(wr->sg_list[i].addr); } - if (wr->num_sge < srq->max_gs) { + if (srq->rsv_sge) { dseg[i].len = cpu_to_le32(HNS_ROCE_INVALID_SGE_LENGTH); dseg[i].lkey = cpu_to_le32(HNS_ROCE_INVALID_LKEY); dseg[i].addr = 0; diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h index dad3ac0b4731c..11318105826a5 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h @@ -58,16 +58,16 @@ #define HNS_ROCE_V2_MAX_WQE_NUM 0x8000 #define HNS_ROCE_V2_MAX_SRQ 0x100000 #define HNS_ROCE_V2_MAX_SRQ_WR 0x8000 -#define HNS_ROCE_V2_MAX_SRQ_SGE 0xff +#define HNS_ROCE_V2_MAX_SRQ_SGE 0x100 #define HNS_ROCE_V2_MAX_CQ_NUM 0x100000 #define HNS_ROCE_V2_MAX_CQC_TIMER_NUM 0x100 #define HNS_ROCE_V2_MAX_SRQ_NUM 0x100000 #define HNS_ROCE_V2_MAX_CQE_NUM 0x400000 #define HNS_ROCE_V2_MAX_SRQWQE_NUM 0x8000 /* reserve one sge to circumvent a hardware issue */ -#define HNS_ROCE_V2_MAX_RQ_SGE_NUM 0xff +#define HNS_ROCE_V2_MAX_RQ_SGE_NUM 0x100 #define HNS_ROCE_V2_MAX_SQ_SGE_NUM 0xff -#define HNS_ROCE_V2_MAX_SRQ_SGE_NUM 0xff +#define HNS_ROCE_V2_MAX_SRQ_SGE_NUM 0x100 #define HNS_ROCE_V2_MAX_EXTEND_SGE_NUM 0x200000 #define HNS_ROCE_V2_MAX_SQ_INLINE 0x20 #define HNS_ROCE_V2_UAR_NUM 256 diff --git a/drivers/infiniband/hw/hns/hns_roce_qp.c b/drivers/infiniband/hw/hns/hns_roce_qp.c index 520213d6461ed..e0f2413c2a0a3 100644 --- a/drivers/infiniband/hw/hns/hns_roce_qp.c +++ b/drivers/infiniband/hw/hns/hns_roce_qp.c @@ -344,17 +344,42 @@ void hns_roce_release_range_qp(struct hns_roce_dev *hr_dev, int base_qpn, } EXPORT_SYMBOL_GPL(hns_roce_release_range_qp); +static u32 proc_rq_sge(struct hns_roce_dev *dev, struct hns_roce_qp *hr_qp, + int user) +{ + u32 max_sge = dev->caps.max_rq_sg; + + if (dev->pci_dev->revision > PCI_REVISION_ID_HIP08_B) + return max_sge; + + /* Reserve SGEs only for HIP08 in kernel; The userspace driver will + * calculate number of max_sge with reserved SGEs when allocating wqe + * buf, so there is no need to do this again in kernel. But the number + * may exceed the capacity of SGEs recorded in the firmware, so the + * kernel driver should just adapt the value accordingly. + */ + if (user) + max_sge = roundup_pow_of_two(max_sge + 1); + else + hr_qp->rq.rsv_sge = 1; + + return max_sge; +} + + + static int hns_roce_set_rq_size(struct hns_roce_dev *hr_dev, struct ib_qp_cap *cap, int is_user, int has_rq, struct hns_roce_qp *hr_qp) { + u32 max_sge = proc_rq_sge(hr_dev, hr_qp, is_user); struct device *dev = hr_dev->dev; u32 max_cnt; /* Check the validity of QP support capacity */ if (cap->max_recv_wr > hr_dev->caps.max_wqes || - cap->max_recv_sge > hr_dev->caps.max_rq_sg) { - dev_err(dev, "RQ(0x%lx) WR or sge error!max_recv_wr=%d max_recv_sge=%d\n", + cap->max_recv_sge > max_sge) { + dev_err(dev, "RQ(0x%lx) WR or sge error, depth = %u, sge = %u\n", hr_qp->qpn, cap->max_recv_wr, cap->max_recv_sge); return -EINVAL; } @@ -386,7 +411,7 @@ static int hns_roce_set_rq_size(struct hns_roce_dev *hr_dev, max_cnt = max(1U, cap->max_recv_sge); hr_qp->rq.max_gs = roundup_pow_of_two(max_cnt + - HNS_ROCE_RESERVED_SGE); + hr_qp->rq.rsv_sge); if (hr_dev->caps.max_rq_sg <= HNS_ROCE_MAX_SGE_NUM) hr_qp->rq.wqe_shift = ilog2(hr_dev->caps.max_rq_desc_sz); @@ -397,7 +422,7 @@ static int hns_roce_set_rq_size(struct hns_roce_dev *hr_dev, } cap->max_recv_wr = hr_qp->rq.max_post = hr_qp->rq.wqe_cnt; - cap->max_recv_sge = hr_qp->rq.max_gs - HNS_ROCE_RESERVED_SGE; + cap->max_recv_sge = hr_qp->rq.max_gs - hr_qp->rq.rsv_sge; return 0; } diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c b/drivers/infiniband/hw/hns/hns_roce_srq.c index 1e17484d3676a..9d4aec18be0fc 100644 --- a/drivers/infiniband/hw/hns/hns_roce_srq.c +++ b/drivers/infiniband/hw/hns/hns_roce_srq.c @@ -30,7 +30,7 @@ * SOFTWARE. */ #include "roce_k_compat.h" - +#include <linux/pci.h> #include <rdma/ib_umem.h> #include <rdma/hns-abi.h> #include "hns_roce_device.h" @@ -431,6 +431,28 @@ static void destroy_kernel_srq(struct hns_roce_dev *hr_dev, hns_roce_buf_free(hr_dev, srq_buf_size, &srq->buf); } +static u32 proc_srq_sge(struct hns_roce_dev *dev, struct hns_roce_srq *hr_srq, + bool user) +{ + u32 max_sge = dev->caps.max_srq_sges; + + if (dev->pci_dev->revision > PCI_REVISION_ID_HIP08_B) + return max_sge; + /* Reserve SGEs only for HIP08 in kernel; The userspace driver will + * calculate number of max_sge with reserved SGEs when allocating wqe + * buf, so there is no need to do this again in kernel. But the number + * may exceed the capacity of SGEs recorded in the firmware, so the + * kernel driver should just adapt the value accordingly. + */ + if (user) + max_sge = roundup_pow_of_two(max_sge + 1); + else + hr_srq->rsv_sge = 1; + + return max_sge; +} + + struct ib_srq *hns_roce_create_srq(struct ib_pd *pd, struct ib_srq_init_attr *srq_init_attr, struct ib_udata *udata) @@ -439,23 +461,26 @@ struct ib_srq *hns_roce_create_srq(struct ib_pd *pd, struct hns_roce_srq *srq; int srq_desc_size; int srq_buf_size; + u32 max_sge; int ret; u32 cqn; - /* Check the actual SRQ wqe and SRQ sge num */ - if (srq_init_attr->attr.max_wr >= hr_dev->caps.max_srq_wrs || - srq_init_attr->attr.max_sge > hr_dev->caps.max_srq_sges) - return ERR_PTR(-EINVAL); - srq = kzalloc(sizeof(*srq), GFP_KERNEL); if (!srq) return ERR_PTR(-ENOMEM); + max_sge = proc_srq_sge(hr_dev, srq, !!udata); + + if (srq_init_attr->attr.max_wr >= hr_dev->caps.max_srq_wrs || + srq_init_attr->attr.max_sge > max_sge) + return ERR_PTR(-EINVAL); + mutex_init(&srq->mutex); spin_lock_init(&srq->lock); srq->max = roundup_pow_of_two(srq_init_attr->attr.max_wr + 1); - srq->max_gs = srq_init_attr->attr.max_sge + HNS_ROCE_RESERVED_SGE; + srq->max_gs = + roundup_pow_of_two(srq_init_attr->attr.max_sge + srq->rsv_sge); srq_desc_size = max(HNS_ROCE_SGE_SIZE, HNS_ROCE_SGE_SIZE * srq->max_gs); srq_desc_size = roundup_pow_of_two(srq_desc_size); @@ -499,6 +524,8 @@ struct ib_srq *hns_roce_create_srq(struct ib_pd *pd, srq->event = hns_roce_ib_srq_event; srq->ibsrq.ext.xrc.srq_num = srq->srqn; + srq_init_attr->attr.max_wr = srq->max; + srq_init_attr->attr.max_sge = srq->max_gs - srq->rsv_sge; if (pd->uobject) { if (ib_copy_to_udata(udata, &srq->srqn, sizeof(__u32))) { -- 2.25.1
1 0
0 0
[PATCH kernel-4.19 1/5] mm/vmscan: count layzfree pages and fix nr_isolated_* mismatch
by Yang Yingliang 08 May '21

08 May '21
From: Jaewon Kim <jaewon31.kim(a)samsung.com> mainline inclusion from mainline-v5.8-rc1 commit 1f318a9b0dc3990490e98eef48f21e6f15185781 category: bugfix bugzilla: 36239 CVE: NA ------------------------------------------------- Fix an nr_isolate_* mismatch problem between cma and dirty lazyfree pages. If try_to_unmap_one is used for reclaim and it detects a dirty lazyfree page, then the lazyfree page is changed to a normal anon page having SwapBacked by commit 802a3a92ad7a ("mm: reclaim MADV_FREE pages"). Even with the change, reclaim context correctly counts isolated files because it uses is_file_lru to distinguish file. And the change to anon is not happened if try_to_unmap_one is used for migration. So migration context like compaction also correctly counts isolated files even though it uses page_is_file_lru insted of is_file_lru. Recently page_is_file_cache was renamed to page_is_file_lru by commit 9de4f22a60f7 ("mm: code cleanup for MADV_FREE"). But the nr_isolate_* mismatch problem happens on cma alloc. There is reclaim_clean_pages_from_list which is being used only by cma. It was introduced by commit 02c6de8d757c ("mm: cma: discard clean pages during contiguous allocation instead of migration") to reclaim clean file pages without migration. The cma alloc uses both reclaim_clean_pages_from_list and migrate_pages, and it uses page_is_file_lru to count isolated files. If there are dirty lazyfree pages allocated from cma memory region, the pages are counted as isolated file at the beginging but are counted as isolated anon after finished. Mem-Info: Node 0 active_anon:3045904kB inactive_anon:611448kB active_file:14892kB inactive_file:205636kB unevictable:10416kB isolated(anon):0kB isolated(file):37664kB mapped:630216kB dirty:384kB writeback:0kB shmem:42576kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Like log above, there were too much isolated files, 37664kB, which triggers too_many_isolated in reclaim even when there is no actually isolated file in system wide. It could be reproducible by running two programs, writing on MADV_FREE page and doing cma alloc, respectively. Although isolated anon is 0, I found that the internal value of isolated anon was the negative value of isolated file. Fix this by compensating the isolated count for both LRU lists. Count non-discarded lazyfree pages in shrink_page_list, then compensate the counted number in reclaim_clean_pages_from_list. Reported-by: Yong-Taek Lee <ytk.lee(a)samsung.com> Suggested-by: Minchan Kim <minchan(a)kernel.org> Signed-off-by: Jaewon Kim <jaewon31.kim(a)samsung.com> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Acked-by: Minchan Kim <minchan(a)kernel.org> Cc: Mel Gorman <mgorman(a)suse.de> Cc: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Marek Szyprowski <m.szyprowski(a)samsung.com> Cc: Michal Nazarewicz <mina86(a)mina86.com> Cc: Shaohua Li <shli(a)fb.com> Link: http://lkml.kernel.org/r/20200426011718.30246-1-jaewon31.kim@samsung.com Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org> Signed-off-by: Liu Shixin <liushixin2(a)huawei.com> Reviewed-by: Kefeng Wang <wangkefeng.wang(a)huawei.com> Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com> --- include/linux/vmstat.h | 1 + mm/vmscan.c | 27 ++++++++++++++++++++++----- 2 files changed, 23 insertions(+), 5 deletions(-) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index f25cef84b41db..0beaea0e5b832 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -29,6 +29,7 @@ struct reclaim_stat { unsigned nr_activate; unsigned nr_ref_keep; unsigned nr_unmap_fail; + unsigned nr_lazyfree_fail; }; #ifdef CONFIG_VM_EVENT_COUNTERS diff --git a/mm/vmscan.c b/mm/vmscan.c index 6f3c655fc8879..bedea8b4024a8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1129,6 +1129,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, unsigned nr_immediate = 0; unsigned nr_ref_keep = 0; unsigned nr_unmap_fail = 0; + unsigned nr_lazyfree_fail = 0; cond_resched(); @@ -1336,11 +1337,15 @@ static unsigned long shrink_page_list(struct list_head *page_list, */ if (page_mapped(page)) { enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH; + bool was_swapbacked = PageSwapBacked(page); if (unlikely(PageTransHuge(page))) flags |= TTU_SPLIT_HUGE_PMD; + if (!try_to_unmap(page, flags)) { nr_unmap_fail++; + if (!was_swapbacked && PageSwapBacked(page)) + nr_lazyfree_fail++; goto activate_locked; } } @@ -1519,6 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, stat->nr_activate = pgactivate; stat->nr_ref_keep = nr_ref_keep; stat->nr_unmap_fail = nr_unmap_fail; + stat->nr_lazyfree_fail = nr_lazyfree_fail; } return nr_reclaimed; } @@ -1531,7 +1537,8 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, .priority = DEF_PRIORITY, .may_unmap = 1, }; - unsigned long ret; + struct reclaim_stat stat; + unsigned long nr_reclaimed; struct page *page, *next; LIST_HEAD(clean_pages); @@ -1543,11 +1550,21 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, } } - ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc, - TTU_IGNORE_ACCESS, NULL, true); + nr_reclaimed = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc, + TTU_IGNORE_ACCESS, &stat, true); list_splice(&clean_pages, page_list); - mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret); - return ret; + mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -nr_reclaimed); + /* + * Since lazyfree pages are isolated from file LRU from the beginning, + * they will rotate back to anonymous LRU in the end if it failed to + * discard so isolated count will be mismatched. + * Compensate the isolated count for both LRU lists. + */ + mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON, + stat.nr_lazyfree_fail); + mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, + -stat.nr_lazyfree_fail); + return nr_reclaimed; } /* -- 2.25.1
1 4
0 0
[PATCH kernel-4.19 1/7] bpf: Move off_reg into sanitize_ptr_alu
by Yang Yingliang 08 May '21

08 May '21
From: Daniel Borkmann <daniel(a)iogearbox.net> mainline inclusion from mainline-v5.12-rc8 commit 6f55b2f2a1178856c19bbce2f71449926e731914 category: bugfix bugzilla: NA CVE: CVE-2021-29155 -------------------------------- Small refactor to drag off_reg into sanitize_ptr_alu(), so we later on can use off_reg for generalizing some of the checks for all pointer types. Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net> Reviewed-by: John Fastabend <john.fastabend(a)gmail.com> Acked-by: Alexei Starovoitov <ast(a)kernel.org> Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com> Reviewed-by: Kuohai Xu <xukuohai(a)huawei.com> Reviewed-by: Xiu Jianfeng <xiujianfeng(a)huawei.com> Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com> --- kernel/bpf/verifier.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f9d8fc953b70d..4cb746168e4e6 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -2719,11 +2719,12 @@ static int sanitize_val_alu(struct bpf_verifier_env *env, static int sanitize_ptr_alu(struct bpf_verifier_env *env, struct bpf_insn *insn, const struct bpf_reg_state *ptr_reg, - struct bpf_reg_state *dst_reg, - bool off_is_neg) + const struct bpf_reg_state *off_reg, + struct bpf_reg_state *dst_reg) { struct bpf_verifier_state *vstate = env->cur_state; struct bpf_insn_aux_data *aux = cur_aux(env); + bool off_is_neg = off_reg->smin_value < 0; bool ptr_is_dst_reg = ptr_reg == dst_reg; u8 opcode = BPF_OP(insn->code); u32 alu_state, alu_limit; @@ -2847,7 +2848,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, switch (opcode) { case BPF_ADD: - ret = sanitize_ptr_alu(env, insn, ptr_reg, dst_reg, smin_val < 0); + ret = sanitize_ptr_alu(env, insn, ptr_reg, off_reg, dst_reg); if (ret < 0) { verbose(env, "R%d tried to add from different maps, paths, or prohibited types\n", dst); return ret; @@ -2902,7 +2903,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, } break; case BPF_SUB: - ret = sanitize_ptr_alu(env, insn, ptr_reg, dst_reg, smin_val < 0); + ret = sanitize_ptr_alu(env, insn, ptr_reg, off_reg, dst_reg); if (ret < 0) { verbose(env, "R%d tried to sub from different maps, paths, or prohibited types\n", dst); return ret; -- 2.25.1
1 3
0 0
[PATCH openEuler-1.0-LTS 6/6] pid: fix pid recover method kabi change
by Yang Yingliang 08 May '21

08 May '21
From: Jingxian He <hejingxian(a)huawei.com> hulk inclusion category: feature bugzilla: 48159 CVE: N/A ------------------------------ Pid recover method add new member to task_truct, which generates kabi change problem. We use KABI_RESERVE of task_truct, instead of adding new member to task_truct. Signed-off-by: Jingxian He <hejingxian(a)huawei.com> Reviewed-by: Jing Xiangfeng <jingxiangfeng(a)huawei.com> Reviewed-by: Xie XiuQi <xiexiuqi(a)huawei.com> Reviewed-by: Hanjun Guo <guohanjun(a)huawei.com> Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com> --- drivers/char/pin_memory.c | 2 +- include/linux/sched.h | 18 +++++++++++++++--- init/init_task.c | 4 +++- kernel/pid.c | 8 ++++---- 4 files changed, 23 insertions(+), 9 deletions(-) diff --git a/drivers/char/pin_memory.c b/drivers/char/pin_memory.c index 05fa7cfde03b2..12ae7627bdf0c 100644 --- a/drivers/char/pin_memory.c +++ b/drivers/char/pin_memory.c @@ -164,7 +164,7 @@ static int set_fork_pid(unsigned long arg) goto fault; if (copy_from_user(&pid, buf, sizeof(int))) goto fault; - current->fork_pid = pid; + current->fork_pid_union.fork_pid = pid; return 0; fault: return -EFAULT; diff --git a/include/linux/sched.h b/include/linux/sched.h index e532338aab8e7..b4c6dfe5e1d9d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -596,6 +596,13 @@ struct wake_q_node { struct wake_q_node *next; }; +#ifdef CONFIG_PID_RESERVE +typedef union { + pid_t fork_pid; + unsigned long pid_reserve; +} fork_pid_t; +#endif + struct task_struct { #ifdef CONFIG_THREAD_INFO_IN_TASK /* @@ -1206,9 +1213,6 @@ struct task_struct { /* Used by LSM modules for access restriction: */ void *security; #endif -#ifdef CONFIG_PID_RESERVE - int fork_pid; -#endif /* * New fields for task_struct should be added above here, so that @@ -1227,7 +1231,15 @@ struct task_struct { KABI_RESERVE(3) KABI_RESERVE(4) #endif +#ifdef CONFIG_PID_RESERVE +#ifndef __GENKSYMS__ + fork_pid_t fork_pid_union; +#else + KABI_RESERVE(5) +#endif +#else KABI_RESERVE(5) +#endif KABI_RESERVE(6) KABI_RESERVE(7) KABI_RESERVE(8) diff --git a/init/init_task.c b/init/init_task.c index dd02e591490ae..57ff82ab98110 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -181,7 +181,9 @@ struct task_struct init_task .security = NULL, #endif #ifdef CONFIG_PID_RESERVE - .fork_pid = 0, + .fork_pid_union = { + .fork_pid = 0, + }, #endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/pid.c b/kernel/pid.c index d6a8009f96739..5666ce075487d 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -189,7 +189,7 @@ struct pid *alloc_pid(struct pid_namespace *ns) pid_min = RESERVED_PIDS; #ifdef CONFIG_PID_RESERVE - if (!current->fork_pid) { + if (!current->fork_pid_union.fork_pid) { /* * Store a null pointer so find_pid_ns does not find * a partially initialized PID (see below). @@ -199,9 +199,9 @@ struct pid *alloc_pid(struct pid_namespace *ns) GFP_ATOMIC); } else { /* Try to free the reserved fork_pid, and then use it to alloc pid. */ - free_reserved_pid(&tmp->idr, current->fork_pid); - pid_min = current->fork_pid; - current->fork_pid = 0; + free_reserved_pid(&tmp->idr, current->fork_pid_union.fork_pid); + pid_min = current->fork_pid_union.fork_pid; + current->fork_pid_union.fork_pid = 0; nr = idr_alloc(&tmp->idr, NULL, pid_min, pid_min + 1, GFP_ATOMIC); -- 2.25.1
1 0
0 0
[PATCH openEuler-1.0-LTS 5/6] config: enable kernel hotupgrade features by default
by Yang Yingliang 08 May '21

08 May '21
From: Jingxian He <hejingxian(a)huawei.com> hulk inclusion category: feature bugzilla: 48159 CVE: N/A ------------------------------ enable kernel hot upgrade features by default: 1 add pin mem method for checkpoint and restore: CONFIG_PIN_MEMORY=y CONFIG_PIN_MEMORY_DEV=m 2 add pid recover method for checkpoint and restore CONFIG_PID_RESERVE=y 3 add cpu park method CONFIG_ARM64_CPU_PARK=y 4 add quick kexec support for kernel CONFIG_QUICK_KEXEC=y Signed-off-by: Sang Yan <sangyan(a)huawei.com> Signed-off-by: Jingxian He <hejingxian(a)huawei.com> Reviewed-by: Jing Xiangfeng <jingxiangfeng(a)huawei.com> Acked-by: Hanjun Guo <guohanjun(a)huawei.com> Reviewed-by: Xie XiuQi <xiexiuqi(a)huawei.com> Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com> --- arch/arm64/configs/hulk_defconfig | 6 ++++++ arch/arm64/configs/openeuler_defconfig | 5 +++++ 2 files changed, 11 insertions(+) diff --git a/arch/arm64/configs/hulk_defconfig b/arch/arm64/configs/hulk_defconfig index 8950fff02a43f..9a3db8a88170e 100644 --- a/arch/arm64/configs/hulk_defconfig +++ b/arch/arm64/configs/hulk_defconfig @@ -445,6 +445,7 @@ CONFIG_PARAVIRT=y CONFIG_PARAVIRT_TIME_ACCOUNTING=y CONFIG_KEXEC=y CONFIG_CRASH_DUMP=y +CONFIG_ARM64_CPU_PARK=y # CONFIG_XEN is not set CONFIG_FORCE_MAX_ZONEORDER=11 CONFIG_UNMAP_KERNEL_AT_EL0=y @@ -701,6 +702,7 @@ CONFIG_CRYPTO_AES_ARM64_NEON_BLK=m # CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y +CONFIG_QUICK_KEXEC=y CONFIG_OPROFILE_NMI_TIMER=y CONFIG_KPROBES=y CONFIG_JUMP_LABEL=y @@ -983,6 +985,8 @@ CONFIG_IDLE_PAGE_TRACKING=y # CONFIG_PERCPU_STATS is not set # CONFIG_GUP_BENCHMARK is not set CONFIG_ARCH_HAS_PTE_SPECIAL=y +CONFIG_PIN_MEMORY=y +CONFIG_PID_RESERVE=y CONFIG_NET=y CONFIG_NET_INGRESS=y CONFIG_NET_EGRESS=y @@ -2956,6 +2960,8 @@ CONFIG_TCG_CRB=y # CONFIG_DEVPORT is not set # CONFIG_XILLYBUS is not set CONFIG_HISI_SVM=y +CONFIG_PIN_MEMORY_DEV=m + # # I2C support # diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 89dec8e1954fd..34b34fa103d0a 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -446,6 +446,7 @@ CONFIG_PARAVIRT=y CONFIG_PARAVIRT_TIME_ACCOUNTING=y CONFIG_KEXEC=y CONFIG_CRASH_DUMP=y +CONFIG_ARM64_CPU_PARK=y # CONFIG_XEN is not set CONFIG_FORCE_MAX_ZONEORDER=14 CONFIG_UNMAP_KERNEL_AT_EL0=y @@ -691,6 +692,7 @@ CONFIG_CRYPTO_AES_ARM64_NEON_BLK=m # CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y +CONFIG_QUICK_KEXEC=y CONFIG_KPROBES=y CONFIG_JUMP_LABEL=y CONFIG_STATIC_KEYS_SELFTEST=y @@ -976,6 +978,8 @@ CONFIG_FRAME_VECTOR=y # CONFIG_PERCPU_STATS is not set # CONFIG_GUP_BENCHMARK is not set CONFIG_ARCH_HAS_PTE_SPECIAL=y +CONFIG_PIN_MEMORY=y +CONFIG_PID_RESERVE=y CONFIG_NET=y CONFIG_COMPAT_NETLINK_MESSAGES=y CONFIG_NET_INGRESS=y @@ -3045,6 +3049,7 @@ CONFIG_TCG_CRB=m # CONFIG_DEVPORT is not set # CONFIG_XILLYBUS is not set CONFIG_HISI_SVM=y +CONFIG_PIN_MEMORY_DEV=m # # I2C support -- 2.25.1
1 0
0 0
[PATCH openEuler-1.0-LTS 4/6] kexec: Add quick kexec support for kernel
by Yang Yingliang 08 May '21

08 May '21
From: Jingxian He <hejingxian(a)huawei.com> hulk inclusion category: feature bugzilla: 48159 CVE: N/A ------------------------------ In normal kexec, relocating kernel may cost 5 ~ 10 seconds, to copy all segments from vmalloced memory to kernel boot memory, because of disabled mmu. We introduce quick kexec to save time of copying memory as above, just like kdump(kexec on crash), by using reserved memory "Quick Kexec". Constructing quick kimage as the same as crash kernel, then simply copy all segments of kimage to reserved memroy. We also add this support in syscall kexec_load using flags of KEXEC_QUICK. Signed-off-by: Sang Yan <sangyan(a)huawei.com> Signed-off-by: Jingxian He <hejingxian(a)huawei.com> Reviewed-by: Wang Xiongfeng <wangxiongfeng2(a)huawei.com> Reviewed-by: Xie XiuQi <xiexiuqi(a)huawei.com> Reviewed-by: Hanjun Guo <guohanjun(a)huawei.com> Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com> --- arch/Kconfig | 9 ++++++++ arch/arm64/kernel/setup.c | 7 +++++++ arch/arm64/mm/init.c | 43 ++++++++++++++++++++++++++++++++++++++ include/linux/ioport.h | 1 + include/linux/kexec.h | 11 +++++++++- include/uapi/linux/kexec.h | 1 + kernel/kexec.c | 10 +++++++++ kernel/kexec_core.c | 41 ++++++++++++++++++++++++++++-------- 8 files changed, 113 insertions(+), 10 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 0dd478fcee0b6..e906cbb213444 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -18,6 +18,15 @@ config KEXEC_CORE select CRASH_CORE bool +config QUICK_KEXEC + bool "Support for quick kexec" + depends on KEXEC_CORE + help + It uses pre-reserved memory to accelerate kexec, just like + crash kexec, loads new kernel and initrd to reserved memory, + and boots new kernel on that memory. It will save the time + of relocating kernel. + config HAVE_IMA_KEXEC bool diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c index 76abf40f0949d..e23b804773874 100644 --- a/arch/arm64/kernel/setup.c +++ b/arch/arm64/kernel/setup.c @@ -282,6 +282,13 @@ static void __init request_standard_resources(void) request_resource(res, &pin_memory_resource); #endif +#ifdef CONFIG_QUICK_KEXEC + if (quick_kexec_res.end && + quick_kexec_res.start >= res->start && + quick_kexec_res.end <= res->end) + request_resource(res, &quick_kexec_res); +#endif + for (j = 0; j < res_mem_count; j++) { if (res_resources[j].start >= res->start && res_resources[j].end <= res->end) diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 71bfe4195f205..554824ce0f286 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -243,6 +243,45 @@ static void __init kexec_reserve_crashkres_pages(void) } #endif /* CONFIG_KEXEC_CORE */ +#ifdef CONFIG_QUICK_KEXEC +static int __init parse_quick_kexec(char *p) +{ + if (!p) + return 0; + + quick_kexec_res.end = PAGE_ALIGN(memparse(p, NULL)); + + return 0; +} +early_param("quickkexec", parse_quick_kexec); + +static void __init reserve_quick_kexec(void) +{ + unsigned long long mem_start, mem_len; + + mem_len = quick_kexec_res.end; + if (mem_len == 0) + return; + + /* Current arm64 boot protocol requires 2MB alignment */ + mem_start = memblock_find_in_range(0, ARCH_LOW_ADDRESS_LIMIT, + mem_len, CRASH_ALIGN); + if (mem_start == 0) { + pr_warn("cannot allocate quick kexec mem (size:0x%llx)\n", + mem_len); + quick_kexec_res.end = 0; + return; + } + + memblock_reserve(mem_start, mem_len); + pr_info("quick kexec mem reserved: 0x%016llx - 0x%016llx (%lld MB)\n", + mem_start, mem_start + mem_len, mem_len >> 20); + + quick_kexec_res.start = mem_start; + quick_kexec_res.end = mem_start + mem_len - 1; +} +#endif + #ifdef CONFIG_CRASH_DUMP static int __init early_init_dt_scan_elfcorehdr(unsigned long node, const char *uname, int depth, void *data) @@ -699,6 +738,10 @@ void __init arm64_memblock_init(void) reserve_crashkernel(); +#ifdef CONFIG_QUICK_KEXEC + reserve_quick_kexec(); +#endif + reserve_elfcorehdr(); high_memory = __va(memblock_end_of_DRAM() - 1) + 1; diff --git a/include/linux/ioport.h b/include/linux/ioport.h index 5330288da8dbb..1a438ecd15d01 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -139,6 +139,7 @@ enum { IORES_DESC_PERSISTENT_MEMORY_LEGACY = 5, IORES_DESC_DEVICE_PRIVATE_MEMORY = 6, IORES_DESC_DEVICE_PUBLIC_MEMORY = 7, + IORES_DESC_QUICK_KEXEC = 8, }; /* helpers to define resources */ diff --git a/include/linux/kexec.h b/include/linux/kexec.h index d6b8d0a69720d..98cf0fb091579 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -233,9 +233,10 @@ struct kimage { unsigned long control_page; /* Flags to indicate special processing */ - unsigned int type : 1; + unsigned int type : 2; #define KEXEC_TYPE_DEFAULT 0 #define KEXEC_TYPE_CRASH 1 +#define KEXEC_TYPE_QUICK 2 unsigned int preserve_context : 1; /* If set, we are using file mode kexec syscall */ unsigned int file_mode:1; @@ -296,6 +297,11 @@ extern int kexec_load_disabled; #define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_PRESERVE_CONTEXT) #endif +#ifdef CONFIG_QUICK_KEXEC +#undef KEXEC_FLAGS +#define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_QUICK) +#endif + /* List of defined/legal kexec file flags */ #define KEXEC_FILE_FLAGS (KEXEC_FILE_UNLOAD | KEXEC_FILE_ON_CRASH | \ KEXEC_FILE_NO_INITRAMFS) @@ -305,6 +311,9 @@ extern int kexec_load_disabled; extern struct resource crashk_res; extern struct resource crashk_low_res; extern note_buf_t __percpu *crash_notes; +#ifdef CONFIG_QUICK_KEXEC +extern struct resource quick_kexec_res; +#endif /* flag to track if kexec reboot is in progress */ extern bool kexec_in_progress; diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h index 6d112868272da..ca3cebebd6b20 100644 --- a/include/uapi/linux/kexec.h +++ b/include/uapi/linux/kexec.h @@ -12,6 +12,7 @@ /* kexec flags for different usage scenarios */ #define KEXEC_ON_CRASH 0x00000001 #define KEXEC_PRESERVE_CONTEXT 0x00000002 +#define KEXEC_QUICK 0x00000004 #define KEXEC_ARCH_MASK 0xffff0000 /* diff --git a/kernel/kexec.c b/kernel/kexec.c index 68559808fdfa1..47dfad722b7ca 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -46,6 +46,9 @@ static int kimage_alloc_init(struct kimage **rimage, unsigned long entry, int ret; struct kimage *image; bool kexec_on_panic = flags & KEXEC_ON_CRASH; +#ifdef CONFIG_QUICK_KEXEC + bool kexec_on_quick = flags & KEXEC_QUICK; +#endif if (kexec_on_panic) { /* Verify we have a valid entry point */ @@ -71,6 +74,13 @@ static int kimage_alloc_init(struct kimage **rimage, unsigned long entry, image->type = KEXEC_TYPE_CRASH; } +#ifdef CONFIG_QUICK_KEXEC + if (kexec_on_quick) { + image->control_page = quick_kexec_res.start; + image->type = KEXEC_TYPE_QUICK; + } +#endif + ret = sanity_check_segment_list(image); if (ret) goto out_free_image; diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index b36c9c46cd2ce..595a757af6568 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -74,6 +74,16 @@ struct resource crashk_low_res = { .desc = IORES_DESC_CRASH_KERNEL }; +#ifdef CONFIG_QUICK_KEXEC +struct resource quick_kexec_res = { + .name = "Quick kexec", + .start = 0, + .end = 0, + .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM, + .desc = IORES_DESC_QUICK_KEXEC +}; +#endif + int kexec_should_crash(struct task_struct *p) { /* @@ -470,8 +480,10 @@ static struct page *kimage_alloc_normal_control_pages(struct kimage *image, return pages; } -static struct page *kimage_alloc_crash_control_pages(struct kimage *image, - unsigned int order) + +static struct page *kimage_alloc_special_control_pages(struct kimage *image, + unsigned int order, + unsigned long end) { /* Control pages are special, they are the intermediaries * that are needed while we copy the rest of the pages @@ -501,7 +513,7 @@ static struct page *kimage_alloc_crash_control_pages(struct kimage *image, size = (1 << order) << PAGE_SHIFT; hole_start = (image->control_page + (size - 1)) & ~(size - 1); hole_end = hole_start + size - 1; - while (hole_end <= crashk_res.end) { + while (hole_end <= end) { unsigned long i; cond_resched(); @@ -536,7 +548,6 @@ static struct page *kimage_alloc_crash_control_pages(struct kimage *image, return pages; } - struct page *kimage_alloc_control_pages(struct kimage *image, unsigned int order) { @@ -547,8 +558,15 @@ struct page *kimage_alloc_control_pages(struct kimage *image, pages = kimage_alloc_normal_control_pages(image, order); break; case KEXEC_TYPE_CRASH: - pages = kimage_alloc_crash_control_pages(image, order); + pages = kimage_alloc_special_control_pages(image, order, + crashk_res.end); + break; +#ifdef CONFIG_QUICK_KEXEC + case KEXEC_TYPE_QUICK: + pages = kimage_alloc_special_control_pages(image, order, + quick_kexec_res.end); break; +#endif } return pages; @@ -898,11 +916,11 @@ static int kimage_load_normal_segment(struct kimage *image, return result; } -static int kimage_load_crash_segment(struct kimage *image, +static int kimage_load_special_segment(struct kimage *image, struct kexec_segment *segment) { - /* For crash dumps kernels we simply copy the data from - * user space to it's destination. + /* For crash dumps kernels and quick kexec kernels + * we simply copy the data from user space to it's destination. * We do things a page at a time for the sake of kmap. */ unsigned long maddr; @@ -976,8 +994,13 @@ int kimage_load_segment(struct kimage *image, result = kimage_load_normal_segment(image, segment); break; case KEXEC_TYPE_CRASH: - result = kimage_load_crash_segment(image, segment); + result = kimage_load_special_segment(image, segment); + break; +#ifdef CONFIG_QUICK_KEXEC + case KEXEC_TYPE_QUICK: + result = kimage_load_special_segment(image, segment); break; +#endif } return result; -- 2.25.1
1 0
0 0
  • ← Newer
  • 1
  • ...
  • 1720
  • 1721
  • 1722
  • 1723
  • 1724
  • 1725
  • 1726
  • ...
  • 1803
  • Older →

HyperKitty Powered by HyperKitty