[PATCH OLK-5.10 0/7] zcopy: A zero-copy data transfer mechanism
This patchset provides a PAGEATTACH mechanism, a zero-copy data transfer solution optimized for like High Performance Computing workloads. It allows direct sharing of physical memory pages between distinct processes by mapping source pages to a target address space, eliminating redundant data copying and improving large data transfer efficiency. Key features: - Supports zero-copy communication between intra-node processes. - Handles both PTE-level small pages and PMD-level huge pages. Liu Mingrui (7): zcopy: Initialize zcopy module zcopy: Introduce the pageattach interface zcopy: Extend PMD trans hugepage mapping ability zcopy: Add tracepoint for PageAttach zcopy: Add debug inerface dump_pagetable zcopy: Add ZCOPY documentation zcopy: enable ZCOPY module Documentation/misc-devices/zcopy.rst | 64 +++ MAINTAINERS | 7 + arch/arm64/configs/openeuler_defconfig | 6 + drivers/misc/Kconfig | 1 + drivers/misc/Makefile | 1 + drivers/misc/zcopy/Kconfig | 19 + drivers/misc/zcopy/Makefile | 2 + drivers/misc/zcopy/zcopy.c | 719 +++++++++++++++++++++++++ include/trace/events/attach.h | 157 ++++++ 9 files changed, 976 insertions(+) create mode 100644 Documentation/misc-devices/zcopy.rst create mode 100644 drivers/misc/zcopy/Kconfig create mode 100644 drivers/misc/zcopy/Makefile create mode 100644 drivers/misc/zcopy/zcopy.c create mode 100644 include/trace/events/attach.h -- 2.25.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/release-management/issues/ID3TGE -------------------------------- Initialize the zcopy driver module framework Signed-off-by: Liu Mingrui <liumingrui@huawei.com> --- MAINTAINERS | 5 ++ drivers/misc/Kconfig | 1 + drivers/misc/Makefile | 1 + drivers/misc/zcopy/Kconfig | 19 +++++++ drivers/misc/zcopy/Makefile | 2 + drivers/misc/zcopy/zcopy.c | 100 ++++++++++++++++++++++++++++++++++++ 6 files changed, 128 insertions(+) create mode 100644 drivers/misc/zcopy/Kconfig create mode 100644 drivers/misc/zcopy/Makefile create mode 100644 drivers/misc/zcopy/zcopy.c diff --git a/MAINTAINERS b/MAINTAINERS index 3eb4130ce6d0..fa863a4ab198 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -19649,6 +19649,11 @@ L: linux-mm@kvack.org S: Maintained F: mm/zswap.c +ZCOPY DRIVER +M: Mingrui Liu <liumingrui@huawei.com> +S: Maintained +F: drivers/zcopy/ + THE REST M: Linus Torvalds <torvalds@linux-foundation.org> L: linux-kernel@vger.kernel.org diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig index 6457efbb8099..194fa5cad439 100644 --- a/drivers/misc/Kconfig +++ b/drivers/misc/Kconfig @@ -500,4 +500,5 @@ source "drivers/misc/cardreader/Kconfig" source "drivers/misc/habanalabs/Kconfig" source "drivers/misc/uacce/Kconfig" source "drivers/misc/sdma-dae/Kconfig" +source "drivers/misc/zcopy/Kconfig" endmenu diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile index 1f9e143d107f..325e8c586e51 100644 --- a/drivers/misc/Makefile +++ b/drivers/misc/Makefile @@ -60,3 +60,4 @@ obj-$(CONFIG_SDMA_DAE) += sdma-dae/ obj-$(CONFIG_XILINX_SDFEC) += xilinx_sdfec.o obj-$(CONFIG_HISI_HIKEY_USB) += hisi_hikey_usb.o obj-$(CONFIG_VIRT_PLAT_DEV) += virt_plat_dev.o +obj-$(CONFIG_PAGEATTACH) += zcopy/ diff --git a/drivers/misc/zcopy/Kconfig b/drivers/misc/zcopy/Kconfig new file mode 100644 index 000000000000..1e7107039c3f --- /dev/null +++ b/drivers/misc/zcopy/Kconfig @@ -0,0 +1,19 @@ +# SPDX-License-Identifier: GPL-2.0-only +config PAGEATTACH + tristate "PAGEATTACH: A zero-copy data transfer mechanism" + depends on MMU && ARM64 + help + This option enables the PAGEATTACH mechanism, a zero-copy data transfer + solution optimized for like High Performance Computing workloads. It + allows direct sharing of physical memory pages between distinct processes + by mapping source pages to a target address space, eliminating redundant + data copying and improving large data transfer efficiency. + + Key features: + - Supports zero-copy communication between intra-node processes. + - Handles both PTE-level pages and PMD-level huge pages. + - Preserves the read/write permissions of the source page in the target address space. + + This mechanism is intended for HPC applications requiring high-speed inter-process + data sharing. If your use case does not meet the above constraints or you are unsure, + disable this option by saying N. diff --git a/drivers/misc/zcopy/Makefile b/drivers/misc/zcopy/Makefile new file mode 100644 index 000000000000..60a6909da314 --- /dev/null +++ b/drivers/misc/zcopy/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-$(CONFIG_PAGEATTACH) += zcopy.o \ No newline at end of file diff --git a/drivers/misc/zcopy/zcopy.c b/drivers/misc/zcopy/zcopy.c new file mode 100644 index 000000000000..55a3de4a1256 --- /dev/null +++ b/drivers/misc/zcopy/zcopy.c @@ -0,0 +1,100 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* Copyright (C) 2025. Huawei Technologies Co., Ltd */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/ioctl.h> +#include <linux/fs.h> +#include <linux/cdev.h> +#include <linux/uaccess.h> + +struct zcopy_cdev { + struct cdev chrdev; + dev_t dev; + int major; + struct class *dev_class; + struct device *dev_device; +}; + +static struct zcopy_cdev z_cdev; + +long zcopy_ioctl(struct file *file, unsigned int type, unsigned long ptr) +{ + return 0; +} + +static const struct file_operations zcopy_fops = { + .owner = THIS_MODULE, + .unlocked_ioctl = zcopy_ioctl, +}; + +static int register_device_zcopy(void) +{ + int ret; + + ret = alloc_chrdev_region(&z_cdev.dev, 0, 1, "zcopy"); + if (ret < 0) + goto err_out; + + z_cdev.major = MAJOR(z_cdev.dev); + + cdev_init(&z_cdev.chrdev, &zcopy_fops); + ret = cdev_add(&z_cdev.chrdev, z_cdev.dev, 1); + if (ret < 0) + goto err_unregister_chrdev; + + z_cdev.dev_class = class_create(THIS_MODULE, "zcopy"); + if (IS_ERR(z_cdev.dev_class)) { + ret = PTR_ERR(z_cdev.dev_class); + goto err_cdev_del; + } + + z_cdev.dev_device = device_create(z_cdev.dev_class, NULL, + MKDEV(z_cdev.major, 0), NULL, "zdax"); + if (IS_ERR(z_cdev.dev_device)) { + ret = PTR_ERR(z_cdev.dev_device); + goto err_class_destroy; + } + + return 0; + +err_class_destroy: + class_destroy(z_cdev.dev_class); +err_cdev_del: + cdev_del(&z_cdev.chrdev); +err_unregister_chrdev: + unregister_chrdev_region(z_cdev.dev, 1); +err_out: + return ret; +} + +static void unregister_device_zcopy(void) +{ + device_destroy(z_cdev.dev_class, MKDEV(z_cdev.major, 0)); + class_destroy(z_cdev.dev_class); + cdev_del(&z_cdev.chrdev); + unregister_chrdev_region(z_cdev.dev, 1); +} + +static int __init zcopy_init(void) +{ + int ret; + + ret = register_device_zcopy(); + if (ret) + return ret; + + return 0; +} + +static void __exit zcopy_exit(void) +{ + unregister_device_zcopy(); +} + +module_init(zcopy_init); +module_exit(zcopy_exit); + +MODULE_AUTHOR("liumingrui <liumingrui@huawei.com>"); +MODULE_LICENSE("GPL"); +MODULE_DESCRIPTION("PAGEATTACH: A zero-copy data transfer mechanism"); -- 2.25.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/release-management/issues/ID3TGE -------------------------------- Provide an efficient intra-node data transfer interface, enabling applications to map the pages associated with a specified virtual memory address space in the source process to the virtual address space in the destination process. Signed-off-by: Liu Mingrui <liumingrui@huawei.com> --- drivers/misc/zcopy/zcopy.c | 475 ++++++++++++++++++++++++++++++++++++- 1 file changed, 474 insertions(+), 1 deletion(-) diff --git a/drivers/misc/zcopy/zcopy.c b/drivers/misc/zcopy/zcopy.c index 55a3de4a1256..937bef4f9fd6 100644 --- a/drivers/misc/zcopy/zcopy.c +++ b/drivers/misc/zcopy/zcopy.c @@ -7,6 +7,41 @@ #include <linux/fs.h> #include <linux/cdev.h> #include <linux/uaccess.h> +#include <linux/kallsyms.h> +#include <linux/mm.h> +#include <linux/kprobes.h> +#include <linux/huge_mm.h> +#include <linux/mm_types.h> +#include <linux/mm_types_task.h> +#include <linux/rmap.h> +#include <linux/sched/mm.h> +#include <linux/pgtable.h> +#include <asm-generic/pgalloc.h> +#include <asm/tlbflush.h> +#include <asm/pgtable-hwdef.h> + +#ifndef PUD_SHIFT +#define ARM64_HW_PGTABLE_LEVEL_SHIFT(n) ((PAGE_SHIFT - 3) * (4 - (n)) + 3) +#define PUD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(1) +#endif + +enum pgt_entry { + NORMAL_PMD, + HPAGE_PMD, +}; + +enum { + IO_ATTACH = 1, + IO_MAX +}; + +struct zcopy_ioctl_pswap { + unsigned long src_addr; + unsigned long dst_addr; + int src_pid; + int dst_pid; + unsigned long size; +}; struct zcopy_cdev { struct cdev chrdev; @@ -18,16 +53,450 @@ struct zcopy_cdev { static struct zcopy_cdev z_cdev; -long zcopy_ioctl(struct file *file, unsigned int type, unsigned long ptr) +static int (*__zcopy_pte_alloc)(struct mm_struct *, pmd_t *); +static int (*__zcopy_pmd_alloc)(struct mm_struct *, pud_t *, unsigned long); +static int (*__zcopy_pud_alloc)(struct mm_struct *, p4d_t *, unsigned long); +static unsigned long (*kallsyms_lookup_name_funcp)(const char *); +static void (*zcopy_page_remove_rmap)(struct page *, bool); + +static struct kretprobe __kretprobe; + +static unsigned long __kprobe_lookup_name(const char *symbol_name) +{ + int ret; + void *addr; + + __kretprobe.kp.symbol_name = symbol_name; + ret = register_kretprobe(&__kretprobe); + if (ret < 0) + return 0; + + addr = __kretprobe.kp.addr; + unregister_kretprobe(&__kretprobe); + return (unsigned long)addr; +} + +static inline unsigned long __kallsyms_lookup_name(const char *symbol_name) +{ + if (kallsyms_lookup_name_funcp == NULL) + return 0; + return kallsyms_lookup_name_funcp(symbol_name); +} + +static inline pud_t *zcopy_pud_alloc(struct mm_struct *mm, p4d_t *p4d, + unsigned long address) +{ + return (unlikely(p4d_none(*p4d)) && + __zcopy_pud_alloc(mm, p4d, address)) ? NULL : pud_offset(p4d, address); +} + +static inline pmd_t *zcopy_pmd_alloc(struct mm_struct *mm, pud_t *pud, + unsigned long address) +{ + return (unlikely(pud_none(*pud)) && + __zcopy_pmd_alloc(mm, pud, address)) ? NULL : pmd_offset(pud, address); +} + +static inline bool zcopy_pte_alloc(struct mm_struct *mm, pmd_t *pmd) +{ + return unlikely(pmd_none(*pmd)) && __zcopy_pte_alloc(mm, pmd); +} + +static pud_t *zcopy_get_pud(struct mm_struct *mm, unsigned long addr) +{ + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + + pgd = pgd_offset(mm, addr); + if (pgd_none(*pgd)) + return NULL; + + p4d = p4d_offset(pgd, addr); + if (p4d_none(*p4d)) + return NULL; + + pud = pud_offset(p4d, addr); + if (pud_none(*pud)) + return NULL; + + return pud; +} + +static pmd_t *zcopy_get_pmd(struct mm_struct *mm, unsigned long addr) +{ + pud_t *pud; + pmd_t *pmd; + + pud = zcopy_get_pud(mm, addr); + if (!pud) + return NULL; + + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) + return NULL; + + return pmd; +} + +static pud_t *zcopy_alloc_new_pud(struct mm_struct *mm, unsigned long addr) +{ + pgd_t *pgd; + p4d_t *p4d; + + pgd = pgd_offset(mm, addr); + p4d = p4d_alloc(mm, pgd, addr); + if (!p4d) + return NULL; + + return zcopy_pud_alloc(mm, p4d, addr); +} + +static pmd_t *zcopy_alloc_pmd(struct mm_struct *mm, unsigned long addr) +{ + pud_t *pud; + pmd_t *pmd; + + pud = zcopy_alloc_new_pud(mm, addr); + if (!pud) + return NULL; + + pmd = zcopy_pmd_alloc(mm, pud, addr); + if (!pmd) + return NULL; + + return pmd; +} + +static inline void zcopy_add_mm_counter(struct mm_struct *mm, int member, long value) +{ + atomic_long_add(value, &mm->rss_stat.count[member]); +} + +static inline void zcopy_add_mm_rss_vec(struct mm_struct *mm, int *rss) +{ + int i; + + for (i = 0; i < NR_MM_COUNTERS; i++) + if (rss[i]) + zcopy_add_mm_counter(mm, i, rss[i]); +} + +static __always_inline unsigned long get_extent(enum pgt_entry entry, + unsigned long old_addr, unsigned long old_end, + unsigned long new_addr) +{ + unsigned long next, extent, mask, size; + + switch (entry) { + case HPAGE_PMD: + case NORMAL_PMD: + mask = PMD_MASK; + size = PMD_SIZE; + break; + default: + BUILD_BUG(); + break; + } + + next = (old_addr + size) & mask; + /* even if next overflowed, extent below will be ok */ + extent = next - old_addr; + if (extent > old_end - old_addr) + extent = old_end - old_addr; + next = (new_addr + size) & mask; + if (extent > next - new_addr) + extent = next - new_addr; + return extent; +} + +static inline void zcopy_set_pte_at(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) { + if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte)) + __sync_icache_dcache(pte); + + __check_racy_pte_update(mm, ptep, pte); + + set_pte(ptep, pte); +} + +static int attach_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, + unsigned long dst_addr, unsigned long src_addr, pmd_t *dst_pmdp, + pmd_t *src_pmdp, unsigned long len) +{ + struct mm_struct *dst_mm = dst_vma->vm_mm; + pte_t *src_ptep, *dst_ptep, pte, orig_pte; + struct page *src_page, *orig_page; + spinlock_t *dst_ptl; + int rss[NR_MM_COUNTERS]; + unsigned long src_addr_end = src_addr + len; + + memset(rss, 0, sizeof(int) * NR_MM_COUNTERS); + + src_ptep = pte_offset_map(src_pmdp, src_addr); + dst_ptep = pte_offset_map(dst_pmdp, dst_addr); + dst_ptl = pte_lockptr(dst_mm, dst_pmdp); + spin_lock_nested(dst_ptl, SINGLE_DEPTH_NESTING); + + for (; src_addr < src_addr_end; src_ptep++, src_addr += PAGE_SIZE, + dst_ptep++, dst_addr += PAGE_SIZE) { + /* + * For special pte, there may not be corresponding page. Hence, + * we skip this situation. + */ + pte = ptep_get(src_ptep); + if (pte_none(*src_ptep) || pte_special(*src_ptep) || !pte_present(pte)) + continue; + + src_page = pte_page(pte); + get_page(src_page); + page_dup_rmap(src_page, false); + rss[MM_ANONPAGES]++; + + /* + * If dst virtual addr has page mapping, before setup the new mapping. + * we should decrease the orig page mapcount and refcount. + */ + orig_pte = *dst_ptep; + if (!pte_none(orig_pte)) { + orig_page = pte_page(orig_pte); + put_page(orig_page); + zcopy_page_remove_rmap(orig_page, false); + rss[MM_ANONPAGES]--; + } + zcopy_set_pte_at(dst_mm, dst_addr, dst_ptep, pte); + } + + flush_tlb_range(dst_vma, dst_addr, dst_addr + len); + zcopy_add_mm_rss_vec(dst_mm, rss); + spin_unlock(dst_ptl); + return 0; } +static int attach_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, + unsigned long dst_addr, unsigned long src_addr, unsigned long size) +{ + struct vm_area_struct *src_vma, *dst_vma; + unsigned long extent, src_addr_end; + pmd_t *src_pmd, *dst_pmd; + int ret = 0; + + src_addr_end = src_addr + size; + src_vma = find_vma(src_mm, src_addr); + dst_vma = find_vma(dst_mm, dst_addr); + /* Check the vma has not been freed again.*/ + if (!src_vma || !dst_vma) + return -ENOENT; + + for (; src_addr < src_addr_end; src_addr += extent, dst_addr += extent) { + cond_resched(); + + extent = get_extent(NORMAL_PMD, src_addr, src_addr_end, dst_addr); + src_pmd = zcopy_get_pmd(src_mm, src_addr); + if (!src_pmd) + continue; + dst_pmd = zcopy_alloc_pmd(dst_mm, dst_addr); + if (!dst_pmd) { + ret = -ENOMEM; + break; + } + + if (pmd_trans_huge(*src_pmd)) { + /* Not support hugepage mapping */ + ret = -EOPNOTSUPP; + break; + } else if (is_swap_pmd(*src_pmd) || pmd_devmap(*src_pmd)) { + ret = -EOPNOTSUPP; + break; + } + + if (zcopy_pte_alloc(dst_mm, dst_pmd)) { + ret = -ENOMEM; + break; + } + + ret = attach_ptes(dst_vma, src_vma, dst_addr, src_addr, dst_pmd, + src_pmd, extent); + if (ret < 0) + break; + } + + return ret; +} + +static int attach_pages(unsigned long dst_addr, unsigned long src_addr, + int dst_pid, int src_pid, unsigned long size) +{ + struct mm_struct *dst_mm, *src_mm; + struct task_struct *src_task, *dst_task; + struct page **process_pages; + unsigned long nr_pages; + unsigned int flags = 0; + int pinned_pages; + int locked = 1; + int ret; + + ret = -EINVAL; + if (size <= 0) + goto out; + + if ((src_addr & (PAGE_SIZE-1)) != 0 || + (dst_addr & (PAGE_SIZE-1)) != 0 || + (size & (PAGE_SIZE-1)) != 0) { + goto out; + } + + /* check the addr is in userspace. wo do not allow */ + if (!is_ttbr0_addr(dst_addr) || !is_ttbr0_addr(src_addr)) + goto out; + + ret = -ESRCH; + src_task = find_get_task_by_vpid(src_pid); + if (!src_task) + goto out; + + src_mm = mm_access(src_task, PTRACE_MODE_ATTACH_REALCREDS); + if (!src_mm || IS_ERR(src_mm)) { + ret = IS_ERR(src_mm) ? PTR_ERR(src_mm) : -ESRCH; + if (ret == -EACCES) + ret = -EPERM; + goto put_src_task; + } + + dst_task = find_get_task_by_vpid(dst_pid); + if (!dst_task) + goto put_src_mm; + + dst_mm = mm_access(dst_task, PTRACE_MODE_ATTACH_REALCREDS); + if (!dst_mm || IS_ERR(dst_mm)) { + ret = IS_ERR(dst_mm) ? PTR_ERR(dst_mm) : -ESRCH; + if (ret == -EACCES) + ret = -EPERM; + goto put_dst_task; + } + + if (src_mm == dst_mm) { + ret = -EINVAL; + goto put_dst_mm; + } + + nr_pages = (src_addr + size - 1) / PAGE_SIZE - src_addr / PAGE_SIZE + 1; + process_pages = kvmalloc_array(nr_pages, sizeof(struct pages *), GFP_KERNEL); + if (!process_pages) { + ret = -ENOMEM; + goto put_dst_mm; + } + + mmap_read_lock(src_mm); + pinned_pages = pin_user_pages_remote(src_mm, src_addr, nr_pages, + flags, process_pages, + NULL, &locked); + if (locked) + mmap_read_unlock(src_mm); + + if (pinned_pages <= 0) { + ret = -EFAULT; + goto free_pages_array; + } + + ret = attach_page_range(dst_mm, src_mm, dst_addr, src_addr, size); + + unpin_user_pages_dirty_lock(process_pages, pinned_pages, 0); + +free_pages_array: + kvfree(process_pages); +put_dst_mm: + mmput(dst_mm); +put_dst_task: + put_task_struct(dst_task); +put_src_mm: + mmput(src_mm); +put_src_task: + put_task_struct(src_task); +out: + return ret; +} + +static long zcopy_ioctl(struct file *file, unsigned int type, unsigned long ptr) +{ + long ret = 0; + + switch (type) { + case IO_ATTACH: + { + struct zcopy_ioctl_pswap ctx; + + if (copy_from_user((void *)&ctx, (void *)ptr, + sizeof(struct zcopy_ioctl_pswap))) { + ret = -EFAULT; + break; + } + ret = attach_pages(ctx.dst_addr, ctx.src_addr, ctx.dst_pid, + ctx.src_pid, ctx.size); + break; + } + default: + break; + } + + return ret; +} + + static const struct file_operations zcopy_fops = { .owner = THIS_MODULE, .unlocked_ioctl = zcopy_ioctl, }; +#define REGISTER_CHECK(_var, _errstr) ({ \ + int __ret = 0; \ + if (!(_var)) { \ + pr_warn("Not found %s\n", _errstr); \ + __ret = -ENOENT; \ + } \ + __ret; \ +}) + +static int register_unexport_func(void) +{ + int ret; + + kallsyms_lookup_name_funcp + = (unsigned long (*)(const char *))__kprobe_lookup_name("kallsyms_lookup_name"); + ret = REGISTER_CHECK(kallsyms_lookup_name_funcp, "kallsyms_lookup_name"); + if (ret) + goto out; + + __zcopy_pte_alloc + = (int (*)(struct mm_struct *, pmd_t *))__kallsyms_lookup_name("__pte_alloc"); + ret = REGISTER_CHECK(__zcopy_pte_alloc, "__pte_alloc"); + if (ret) + goto out; + + __zcopy_pmd_alloc + = (int (*)(struct mm_struct *, pud_t *, unsigned long)) + __kallsyms_lookup_name("__pmd_alloc"); + ret = REGISTER_CHECK(__zcopy_pmd_alloc, "__pmd_alloc"); + if (ret) + goto out; + + __zcopy_pud_alloc + = (int (*)(struct mm_struct *, p4d_t *, unsigned long)) + __kallsyms_lookup_name("__pud_alloc"); + ret = REGISTER_CHECK(__zcopy_pud_alloc, "__pud_alloc"); + if (ret) + goto out; + + zcopy_page_remove_rmap + = (void (*)(struct page *, bool))__kallsyms_lookup_name("page_remove_rmap"); + ret = REGISTER_CHECK(zcopy_page_remove_rmap, "page_remove_rmap"); + +out: + return ret; +} + static int register_device_zcopy(void) { int ret; @@ -80,6 +549,10 @@ static int __init zcopy_init(void) { int ret; + ret = register_unexport_func(); + if (ret) + return ret; + ret = register_device_zcopy(); if (ret) return ret; -- 2.25.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/release-management/issues/ID3TGE -------------------------------- It extends support to 2MB Transparent Huge Page (THP) mapping, building on the existing small page mapping capability. Signed-off-by: Liu Mingrui <liumingrui@huawei.com> --- drivers/misc/zcopy/zcopy.c | 115 ++++++++++++++++++++++++++++++++++++- 1 file changed, 112 insertions(+), 3 deletions(-) diff --git a/drivers/misc/zcopy/zcopy.c b/drivers/misc/zcopy/zcopy.c index 937bef4f9fd6..ac2d5a082924 100644 --- a/drivers/misc/zcopy/zcopy.c +++ b/drivers/misc/zcopy/zcopy.c @@ -25,6 +25,9 @@ #define PUD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(1) #endif +#define zcopy_set_pmd_at(mm, addr, pmdp, pmd) \ + zcopy_set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd)) + enum pgt_entry { NORMAL_PMD, HPAGE_PMD, @@ -61,6 +64,25 @@ static void (*zcopy_page_remove_rmap)(struct page *, bool); static struct kretprobe __kretprobe; +#if USE_SPLIT_PTE_PTLOCKS && ALLOC_SPLIT_PTLOCKS +static struct kmem_cache *zcopy_page_ptl_cachep; +bool ptlock_alloc(struct page *page) +{ + spinlock_t *ptl; + + ptl = kmem_cache_alloc(zcopy_page_ptl_cachep, GFP_KERNEL); + if (!ptl) + return false; + page->ptl = ptl; + return true; +} + +void ptlock_free(struct page *page) +{ + kmem_cache_free(zcopy_page_ptl_cachep, page->ptl); +} +#endif + static unsigned long __kprobe_lookup_name(const char *symbol_name) { int ret; @@ -221,6 +243,78 @@ static inline void zcopy_set_pte_at(struct mm_struct *mm, unsigned long addr, set_pte(ptep, pte); } +static void zcopy_pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp, + pgtable_t pgtable) +{ + assert_spin_locked(pmd_lockptr(mm, pmdp)); + + /* FIFO */ + if (!pmd_huge_pte(mm, pmdp)) + INIT_LIST_HEAD(&pgtable->lru); + else + list_add(&pgtable->lru, &pmd_huge_pte(mm, pmdp)->lru); + pmd_huge_pte(mm, pmdp) = pgtable; +} + +int attach_huge_pmd(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, + unsigned long dst_addr, unsigned long src_addr, pmd_t *dst_pmdp, pmd_t *src_pmdp) +{ + struct mm_struct *dst_mm, *src_mm; + spinlock_t *src_ptl, *dst_ptl; + struct page *src_thp_page, *orig_thp_page; + pmd_t pmd, orig_pmd; + pgtable_t pgtable; + + + if (!vma_is_anonymous(dst_vma)) + return -EINVAL; + + dst_mm = dst_vma->vm_mm; + src_mm = src_vma->vm_mm; + + /* alloc a pgtable for new pmdp */ + pgtable = pte_alloc_one(dst_mm); + if (unlikely(!pgtable)) + return -ENOMEM; + + src_ptl = pmd_lockptr(src_mm, src_pmdp); + dst_ptl = pmd_lockptr(dst_mm, dst_pmdp); + + spin_lock(src_ptl); + pmd = *src_pmdp; + src_thp_page = pmd_page(pmd); + if (unlikely(!PageHead(src_thp_page))) { + pr_err("VM assertion failed: it is not a head page\n"); + spin_unlock(src_ptl); + return -EINVAL; + } + + get_page(src_thp_page); + atomic_inc(compound_mapcount_ptr(src_thp_page)); + spin_unlock(src_ptl); + + spin_lock_nested(dst_ptl, SINGLE_DEPTH_NESTING); + orig_pmd = *dst_pmdp; + /* umap the old page mappings */ + if (!pmd_none(orig_pmd)) { + orig_thp_page = pmd_page(orig_pmd); + put_page(orig_thp_page); + atomic_dec(compound_mapcount_ptr(orig_thp_page)); + zcopy_add_mm_counter(dst_mm, MM_ANONPAGES, -HPAGE_PMD_NR); + mm_dec_nr_ptes(dst_mm); + } + + zcopy_add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); + mm_inc_nr_ptes(dst_mm); + zcopy_pgtable_trans_huge_deposit(dst_mm, dst_pmdp, pgtable); + zcopy_set_pmd_at(dst_mm, dst_addr, dst_pmdp, pmd); + flush_tlb_range(dst_vma, dst_addr, dst_addr + HPAGE_PMD_SIZE); + spin_unlock(dst_ptl); + + return 0; +} + + static int attach_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long dst_addr, unsigned long src_addr, pmd_t *dst_pmdp, pmd_t *src_pmdp, unsigned long len) @@ -304,9 +398,16 @@ static int attach_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, } if (pmd_trans_huge(*src_pmd)) { - /* Not support hugepage mapping */ - ret = -EOPNOTSUPP; - break; + if (extent == HPAGE_PMD_SIZE) { + ret = attach_huge_pmd(dst_vma, src_vma, dst_addr, src_addr, + dst_pmd, src_pmd); + if (ret) + return ret; + continue; + } else { + ret = -EOPNOTSUPP; + break; + } } else if (is_swap_pmd(*src_pmd) || pmd_devmap(*src_pmd)) { ret = -EOPNOTSUPP; break; @@ -463,6 +564,14 @@ static int register_unexport_func(void) { int ret; +#if USE_SPLIT_PTE_PTLOCKS && ALLOC_SPLIT_PTLOCKS + zcopy_page_ptl_cachep + = (struct kmem_cache *)__kallsyms_lookup_name("page_ptl_cachep"); + ret = REGISTER_CHECK(__zcopy_pud_alloc, "__pud_alloc"); + if (ret) + goto out; +#endif + kallsyms_lookup_name_funcp = (unsigned long (*)(const char *))__kprobe_lookup_name("kallsyms_lookup_name"); ret = REGISTER_CHECK(kallsyms_lookup_name_funcp, "kallsyms_lookup_name"); -- 2.25.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/release-management/issues/ID3TGE -------------------------------- In order to debug mfs, we provide the some basic tracepoint in PageAttach, which are: - trace_attach_extent_start - trace_attach_extent_end - trace_attach_page_range_start - trace_attach_page_range_end Signed-off-by: Liu Mingrui <liumingrui@huawei.com> --- MAINTAINERS | 1 + drivers/misc/zcopy/zcopy.c | 13 +++ include/trace/events/attach.h | 157 ++++++++++++++++++++++++++++++++++ 3 files changed, 171 insertions(+) create mode 100644 include/trace/events/attach.h diff --git a/MAINTAINERS b/MAINTAINERS index fa863a4ab198..3b402b41ee63 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -19653,6 +19653,7 @@ ZCOPY DRIVER M: Mingrui Liu <liumingrui@huawei.com> S: Maintained F: drivers/zcopy/ +F: include/trace/events/attach.h THE REST M: Linus Torvalds <torvalds@linux-foundation.org> diff --git a/drivers/misc/zcopy/zcopy.c b/drivers/misc/zcopy/zcopy.c index ac2d5a082924..a20269a3d843 100644 --- a/drivers/misc/zcopy/zcopy.c +++ b/drivers/misc/zcopy/zcopy.c @@ -20,6 +20,9 @@ #include <asm/tlbflush.h> #include <asm/pgtable-hwdef.h> +#define CREATE_TRACE_POINTS +#include <trace/events/attach.h> + #ifndef PUD_SHIFT #define ARM64_HW_PGTABLE_LEVEL_SHIFT(n) ((PAGE_SHIFT - 3) * (4 - (n)) + 3) #define PUD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(1) @@ -399,8 +402,12 @@ static int attach_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, if (pmd_trans_huge(*src_pmd)) { if (extent == HPAGE_PMD_SIZE) { + trace_attach_extent_start(dst_mm, src_mm, dst_addr, src_addr, + dst_pmd, src_pmd, extent); ret = attach_huge_pmd(dst_vma, src_vma, dst_addr, src_addr, dst_pmd, src_pmd); + trace_attach_extent_end(dst_mm, src_mm, dst_addr, src_addr, + dst_pmd, src_pmd, ret); if (ret) return ret; continue; @@ -418,8 +425,12 @@ static int attach_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, break; } + trace_attach_extent_start(dst_mm, src_mm, dst_addr, src_addr, dst_pmd, + src_pmd, extent); ret = attach_ptes(dst_vma, src_vma, dst_addr, src_addr, dst_pmd, src_pmd, extent); + trace_attach_extent_end(dst_mm, src_mm, dst_addr, src_addr, dst_pmd, + src_pmd, ret); if (ret < 0) break; } @@ -502,7 +513,9 @@ static int attach_pages(unsigned long dst_addr, unsigned long src_addr, goto free_pages_array; } + trace_attach_page_range_start(dst_mm, src_mm, dst_addr, src_addr, size); ret = attach_page_range(dst_mm, src_mm, dst_addr, src_addr, size); + trace_attach_page_range_end(dst_mm, src_mm, dst_addr, src_addr, ret); unpin_user_pages_dirty_lock(process_pages, pinned_pages, 0); diff --git a/include/trace/events/attach.h b/include/trace/events/attach.h new file mode 100644 index 000000000000..1a38cbeba747 --- /dev/null +++ b/include/trace/events/attach.h @@ -0,0 +1,157 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM attach + +#if !defined(_TRACE_ATTACH_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_ATTACH_H + +#include <linux/types.h> +#include <linux/tracepoint.h> +#include <linux/mm_types.h> + +TRACE_EVENT(attach_page_range_start, + + TP_PROTO(struct mm_struct *dst_mm, + struct mm_struct *src_mm, + unsigned long dst_addr, + unsigned long src_addr, + unsigned long size), + + TP_ARGS(dst_mm, src_mm, dst_addr, src_addr, size), + + TP_STRUCT__entry( + __field(struct mm_struct *, dst_mm) + __field(struct mm_struct *, src_mm) + __field(unsigned long, dst_addr) + __field(unsigned long, src_addr) + __field(unsigned long, size) + ), + + TP_fast_assign( + __entry->dst_mm = dst_mm; + __entry->src_mm = src_mm; + __entry->dst_addr = dst_addr; + __entry->src_addr = src_addr; + __entry->size = size; + ), + + TP_printk("dst_mm=%p src_mm=%p dst_addr=0x%lx src_addr=0x%lx size=%ld", + __entry->dst_mm, __entry->src_mm, + __entry->dst_addr, __entry->src_addr, + __entry->size) +); + +TRACE_EVENT(attach_page_range_end, + + TP_PROTO(struct mm_struct *dst_mm, + struct mm_struct *src_mm, + unsigned long dst_addr, + unsigned long src_addr, + int ret), + + TP_ARGS(dst_mm, src_mm, dst_addr, src_addr, ret), + + TP_STRUCT__entry( + __field(struct mm_struct *, dst_mm) + __field(struct mm_struct *, src_mm) + __field(unsigned long, dst_addr) + __field(unsigned long, src_addr) + __field(int, ret) + ), + + TP_fast_assign( + __entry->dst_mm = dst_mm; + __entry->src_mm = src_mm; + __entry->dst_addr = dst_addr; + __entry->src_addr = src_addr; + __entry->ret = ret; + ), + + TP_printk("dst_mm=%p src_mm=%p dst_addr=0x%lx src_addr=0x%lx ret=%d", + __entry->dst_mm, __entry->src_mm, + __entry->dst_addr, __entry->src_addr, + __entry->ret) +); + +TRACE_EVENT(attach_extent_start, + + TP_PROTO(struct mm_struct *dst_mm, + struct mm_struct *src_mm, + unsigned long dst_addr, + unsigned long src_addr, + pmd_t *new_pmd, + pmd_t *old_pmd, + unsigned long extent), + + TP_ARGS(dst_mm, src_mm, dst_addr, src_addr, new_pmd, old_pmd, extent), + + TP_STRUCT__entry( + __field(struct mm_struct *, dst_mm) + __field(struct mm_struct *, src_mm) + __field(unsigned long, dst_addr) + __field(unsigned long, src_addr) + __field(pmd_t *, new_pmd) + __field(pmd_t *, old_pmd) + __field(unsigned long, extent) + ), + + TP_fast_assign( + __entry->dst_mm = dst_mm; + __entry->src_mm = src_mm; + __entry->dst_addr = dst_addr; + __entry->src_addr = src_addr; + __entry->new_pmd = new_pmd; + __entry->old_pmd = old_pmd; + __entry->extent = extent; + ), + + TP_printk("dst_mm=%p src_mm=%p dst_addr=0x%lx src_addr=0x%lx new_pmd=%016llx old_pmd=%016llx extent=%ld", + __entry->dst_mm, __entry->src_mm, + __entry->dst_addr, __entry->src_addr, + pmd_val(*__entry->new_pmd), pmd_val(*__entry->old_pmd), + __entry->extent) +); + +TRACE_EVENT(attach_extent_end, + + TP_PROTO(struct mm_struct *dst_mm, + struct mm_struct *src_mm, + unsigned long dst_addr, + unsigned long src_addr, + pmd_t *new_pmd, + pmd_t *old_pmd, + int ret), + + TP_ARGS(dst_mm, src_mm, dst_addr, src_addr, new_pmd, old_pmd, ret), + + TP_STRUCT__entry( + __field(struct mm_struct *, dst_mm) + __field(struct mm_struct *, src_mm) + __field(unsigned long, dst_addr) + __field(unsigned long, src_addr) + __field(pmd_t *, new_pmd) + __field(pmd_t *, old_pmd) + __field(int, ret) + ), + + TP_fast_assign( + __entry->dst_mm = dst_mm; + __entry->src_mm = src_mm; + __entry->dst_addr = dst_addr; + __entry->src_addr = src_addr; + __entry->new_pmd = new_pmd; + __entry->old_pmd = old_pmd; + __entry->ret = ret; + ), + + TP_printk("dst_mm=%p src_mm=%p dst_addr=0x%lx src_addr=0x%lx new_pmd=%016llx old_pmd=%016llx ret=%d", + __entry->dst_mm, __entry->src_mm, + __entry->dst_addr, __entry->src_addr, + pmd_val(*__entry->new_pmd), pmd_val(*__entry->old_pmd), + __entry->ret) +); + +#endif /* _TRACE_ATTACH_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> -- 2.25.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/release-management/issues/ID3TGE -------------------------------- Add a new interface dump_pagetable for debug, user can use it to dump the pagetables of vaddr. Signed-off-by: Liu Mingrui <liumingrui@huawei.com> --- drivers/misc/zcopy/zcopy.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/drivers/misc/zcopy/zcopy.c b/drivers/misc/zcopy/zcopy.c index a20269a3d843..295519412c1c 100644 --- a/drivers/misc/zcopy/zcopy.c +++ b/drivers/misc/zcopy/zcopy.c @@ -38,6 +38,7 @@ enum pgt_entry { enum { IO_ATTACH = 1, + IO_DUMP = 3, IO_MAX }; @@ -49,6 +50,11 @@ struct zcopy_ioctl_pswap { unsigned long size; }; +struct zcopy_ioctl_dump { + unsigned long size; + unsigned long addr; +}; + struct zcopy_cdev { struct cdev chrdev; dev_t dev; @@ -64,6 +70,7 @@ static int (*__zcopy_pmd_alloc)(struct mm_struct *, pud_t *, unsigned long); static int (*__zcopy_pud_alloc)(struct mm_struct *, p4d_t *, unsigned long); static unsigned long (*kallsyms_lookup_name_funcp)(const char *); static void (*zcopy_page_remove_rmap)(struct page *, bool); +static void (*dump_pagetable)(unsigned long addr); static struct kretprobe __kretprobe; @@ -551,6 +558,18 @@ static long zcopy_ioctl(struct file *file, unsigned int type, unsigned long ptr) ctx.src_pid, ctx.size); break; } + case IO_DUMP: + { + struct zcopy_ioctl_dump param; + + if (copy_from_user((void *)¶m, (void *)ptr, + sizeof(struct zcopy_ioctl_dump))) { + ret = -EFAULT; + break; + } + dump_pagetable(param.addr); + break; + } default: break; } @@ -614,6 +633,11 @@ static int register_unexport_func(void) zcopy_page_remove_rmap = (void (*)(struct page *, bool))__kallsyms_lookup_name("page_remove_rmap"); ret = REGISTER_CHECK(zcopy_page_remove_rmap, "page_remove_rmap"); + if (ret) + goto out; + + dump_pagetable = (void (*)(unsigned long))__kallsyms_lookup_name("show_pte"); + ret = REGISTER_CHECK(dump_pagetable, "show_pte"); out: return ret; -- 2.25.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/release-management/issues/ID3TGE -------------------------------- Add documentation for zcopy driver module. Signed-off-by: Liu Mingrui <liumingrui@huawei.com> --- Documentation/misc-devices/zcopy.rst | 64 ++++++++++++++++++++++++++++ MAINTAINERS | 1 + 2 files changed, 65 insertions(+) create mode 100644 Documentation/misc-devices/zcopy.rst diff --git a/Documentation/misc-devices/zcopy.rst b/Documentation/misc-devices/zcopy.rst new file mode 100644 index 000000000000..20a4e43d2def --- /dev/null +++ b/Documentation/misc-devices/zcopy.rst @@ -0,0 +1,64 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Kernel driver eeprom +==================== + + +Description +----------- +the PAGEATTACH mechanism, a zero-copy data transfer solution optimized for +High Performance Computing workloads. It allows direct sharing of physical +memory pages between distinct processes by mapping source pages to a target +address space, eliminating redundant data copying and improving large data +transfer efficiency. + + +Key features +----------- +- Supports zero-copy communication between intra-node processes. +- Handles both PTE-level small pages and PMD-level huge pages. +- Preserves the read/write permissions of the source page in the target address space. + + +Important constraints and requirements +--------------------- +1. Callers must ensure the source address is already mapped to physical pages, +while the destination address is unused (no existing mappings). If exists +old mappings, the return value is -EAGAIN. For this case, caller should free +the old dst_addr, and alloc a new one to try again. + +2. No internal locking is implemented in the PageAttach interface; Callers +must manage memory mapping and release order to avoid race conditions. + +3. Source and destination must be different processes (not threads of the same process). + +4. Only user-space addresses are supported; kernel addresses cannot be mapped. + +5. Both source and destination processes must remain alive during the mapping operation. + +6. PUD-level huge pages are not supported in current implementation. + +7. The start address and size of both source and destination must be PMD-size-aligned. + +8. Callers are responsible for ensuring safe access to mapped pages after attachment, +as permissions are inherited from the source. + + +Use +--- +Process a + src_addr= aligned_alloc(ALIGN_SIZE_2M, size); + memset(src_addr, 0, size); + +Process b + #define ALIGN_SIZE_2M 2097152 + int fd = open(SLS_DEVICE, O_RDWR); + + dst_addr= aligned_alloc(ALIGN_SIZE_2M, size); + res = zcopy(fd, (void *)src_addr, dst_addr, src_pid, dst_pid, size); + while(res.ret == -EAGAIN && retry--) { + free(dst_addr); + dst_addr= aligned_alloc(ALIGN_SIZE_2M, size); + res = zcopy(fd, (void *)src_addr, dst_addr, src_pid, dst_pid, aligned_size); + } diff --git a/MAINTAINERS b/MAINTAINERS index 3b402b41ee63..4fb57819c0c8 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -19652,6 +19652,7 @@ F: mm/zswap.c ZCOPY DRIVER M: Mingrui Liu <liumingrui@huawei.com> S: Maintained +F: Documentation/misc-devices/zcopy.rst F: drivers/zcopy/ F: include/trace/events/attach.h -- 2.25.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/release-management/issues/ID3TGE -------------------------------- Set CONFIG_PAGEATTACH=m in arm64. Signed-off-by: Liu Mingrui <liumingrui@huawei.com> --- arch/arm64/configs/openeuler_defconfig | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 4e955d06851c..6b8907c96e98 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -7487,3 +7487,9 @@ CONFIG_ARCH_HAS_KCOV=y # CONFIG_MEMTEST is not set # end of Kernel Testing and Coverage # end of Kernel hacking + +# +# PAGEATTACH SUPPORT +# +CONFIG_PAGEATTACH=m +# end of pageattach -- 2.25.1
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/18989 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/STE... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/18989 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/STE...
participants (2)
-
Liu Mingrui -
patchwork bot