[PATCH openEuler-25.03 v2 0/4] Add TrIO support in EROFS

TrIO can accelerate the cold start of containers during on-demand loading. It aggregates the read I/O operations required for container runtime during the first container launch. In the following startups, TrIO pulls the necessary I/O data to the container node in a single large I/O operation and uses this I/O information to construct the runtime rootfs. By improving the efficiency of network I/O, TrIO speeds up container startup in on-demand loading scenarios. TrIO consists of both kernel-space and user-space code. The kernel-space code has been adapted at the overlayfs layer, introducing the CONFIG_EROFS_TRIO configuration to provide isolation. The user-space code requires adaptation by the user, and detailed usage methods are introduced in the tools/trio/README.md section. Patches 1~2 correspond to the kernel adaptations, while patches 3~4 are the scripts and best practices that TrIO relies on for its operation. Hongbo Li (4): erofs:trio: Add trio_manager in erofs erofs: trio: Support TrIO feature in erofs TrIO: Add tools for using TrIO TrIO: Add README.md fs/erofs/Kconfig | 11 + fs/erofs/Makefile | 1 + fs/erofs/fscache.c | 22 +- fs/erofs/internal.h | 39 ++ fs/erofs/super.c | 45 +- fs/erofs/trio_manager.c | 337 ++++++++++++ tools/trio/README.md | 507 +++++++++++++++++++ tools/trio/bpf/iotracker/Makefile | 99 ++++ tools/trio/bpf/iotracker/iotracker.bpf.c | 59 +++ tools/trio/bpf/iotracker/iotracker.c | 57 +++ tools/trio/bpf/rio_tracker_mod/Makefile | 9 + tools/trio/bpf/rio_tracker_mod/rio_tracker.c | 278 ++++++++++ tools/trio/scripts/trace_parser.py | 277 ++++++++++ 13 files changed, 1739 insertions(+), 2 deletions(-) create mode 100644 fs/erofs/trio_manager.c create mode 100644 tools/trio/README.md create mode 100644 tools/trio/bpf/iotracker/Makefile create mode 100644 tools/trio/bpf/iotracker/iotracker.bpf.c create mode 100644 tools/trio/bpf/iotracker/iotracker.c create mode 100644 tools/trio/bpf/rio_tracker_mod/Makefile create mode 100644 tools/trio/bpf/rio_tracker_mod/rio_tracker.c create mode 100644 tools/trio/scripts/trace_parser.py -- 2.34.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/release-management/issues/IBK2MJ -------------------------------- TrIO is used to boost the erofs read operation by trace buffer, which gether multiple small IO of the container during start-up. TrIO provides the following mainly APIs: - erofs_register_trio: load the target trace info into erofs. - erofs_unregister_trio: release the trace info. - erofs_read_from_trio: try to read data from trio. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> --- fs/erofs/trio_manager.c | 337 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 337 insertions(+) create mode 100644 fs/erofs/trio_manager.c diff --git a/fs/erofs/trio_manager.c b/fs/erofs/trio_manager.c new file mode 100644 index 000000000000..6f5ab0b85227 --- /dev/null +++ b/fs/erofs/trio_manager.c @@ -0,0 +1,337 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * + * Copyright (C) 2014 Huawei Inc. + */ + +#include <linux/sysfs.h> +#include <linux/kstrtox.h> +#include <linux/hashtable.h> +#include <linux/crc32.h> +#include <linux/printk.h> +#include <linux/mm.h> +#include <linux/vmalloc.h> +#include <linux/pfn_t.h> +#include <linux/fs.h> +#include <linux/slab.h> +#include <linux/uio.h> +#include <linux/string.h> + +#include "internal.h" + +#define PATH_IO_DELIMIT "|" +#define PATH_PATH_DELIMIT "@" +#define IO_IO_DELIMIT "+" +#define VAR_DELIMIT "," + +struct trace_object { + uint64_t ino; /* inode number of this trace object */ + struct list_head head; /* record the all trace io */ + struct hlist_node node; /* link into hashtable */ +}; + +struct trace_io { + struct list_head link; /* link of this io */ + uint64_t soff; /* offset in source file, trace.data */ + uint64_t len; /* length of this io */ + uint64_t doff; /* offset in destination */ +}; + +static struct kmem_cache *trio_iop; +static struct kmem_cache *trio_objp; + +static struct trace_object *alloc_trace_object(uint64_t ino) +{ + struct trace_object *obj = kmem_cache_zalloc(trio_objp, GFP_KERNEL); + + if (!obj) + return NULL; + + obj->ino = ino; + INIT_LIST_HEAD(&obj->head); + + return obj; +} + +static void free_trace_object(struct trace_object *obj) +{ + struct list_head *pos, *n; + struct trace_io *io; + + if (IS_ERR_OR_NULL(obj)) + return; + + list_for_each_safe(pos, n, &obj->head) { + io = list_entry(pos, struct trace_io, link); + list_del(&io->link); + kmem_cache_free(trio_iop, io); + } + kmem_cache_free(trio_objp, obj); +} + +static void hash_add_trace_object(struct hlist_head *trace_ht, + struct trace_object *obj) +{ + hlist_add_head(&obj->node, &trace_ht[hash_min(obj->ino, TRIO_HT_BITS)]); +} + +static const struct hlist_head *_find_by_hash(struct hlist_head *ht, uint64_t ino) +{ + struct hlist_head *h; + + h = &ht[hash_min(ino, TRIO_HT_BITS)]; + if (hlist_empty(h)) + return NULL; + + return h; +} + +static struct trace_object *find_trace_object(struct inode *inode) +{ + struct erofs_sb_info *sbi = EROFS_SB(inode->i_sb); + unsigned long ino = inode->i_ino; + const struct hlist_head *handlers; + struct trace_object *obj; + + handlers = _find_by_hash(sbi->meta_ht, ino); + if (!handlers) + return NULL; + + hlist_for_each_entry(obj, handlers, node) + if (obj->ino == ino) + return obj; + + return NULL; +} + +struct trace_io *get_io_from_object(struct trace_object *obj, + loff_t off, size_t len, size_t *hit_len) +{ + struct list_head *tmp; + struct trace_io *io = NULL; + size_t can_read = 0; + + list_for_each(tmp, &obj->head) { + io = list_entry(tmp, struct trace_io, link); + /* next bigger one */ + if (io->doff + io->len <= off) + continue; + + /* last unmatch one */ + if (off + len <= io->doff) + break; + + /* io include the read range */ + if (io->doff <= off) { + can_read = min_t(size_t, len, io->doff + io->len - off); + break; + } + } + + *hit_len = can_read; + return io; +} + +static void *_read_data_inner(struct super_block *sb, const char *path) +{ + struct file *filp = filp_open(path, O_RDONLY, 0644); + loff_t size, off; + void *data; + char *buf; + ssize_t ret; + + if (IS_ERR(filp)) { + erofs_err(sb, "open target file:%s failed", path); + return NULL; + } + + size = filp->f_inode->i_size; + data = vmalloc(size + 1); + if (!data) { + erofs_err(sb, "alloc buffer for size:%lld failed", size); + goto close_data; + } + + off = 0; + ret = kernel_read(filp, data, size, &off); + if (ret < 0) { + erofs_err(sb, "read failed for size:%lld, ret:%ld", size, ret); + vfree(data); + data = NULL; + goto close_data; + } + + buf = (char *)data; + buf[size] = '\0'; + +close_data: + filp_close(filp, NULL); + return data; +} + +ssize_t erofs_read_from_trio(struct address_space *mapping, + loff_t pos, size_t len) +{ + struct inode *inode = mapping->host; + struct erofs_sb_info *sbi = EROFS_SB(inode->i_sb); + struct trace_object *obj; + struct trace_io *target_io; + struct iov_iter iter; + loff_t new_foff; + size_t hit_len; + ssize_t ret; + + obj = find_trace_object(inode); + if (!obj) + return 0; + + target_io = get_io_from_object(obj, pos, len, &hit_len); + if (!target_io || !hit_len) + return 0; + + iov_iter_xarray(&iter, ITER_DEST, &mapping->i_pages, pos, hit_len); + new_foff = (pos - target_io->doff) + target_io->soff; + ret = copy_to_iter(sbi->buffer + new_foff, hit_len, &iter); + if (ret != hit_len) + return -EFAULT; + return ret; +} + +int erofs_register_trio(struct super_block *sb) +{ + struct erofs_sb_info *sbi = EROFS_SB(sb); + char *item, *path, *ios, *io, *s_toff, *s_len, *s_foff; + uint64_t ino, doff, soff, len; + char *meta_buffer; + struct trace_object *obj; + struct trace_io *io_item; + struct list_head *pos, *n; + LIST_HEAD(head); + int ret = -EINVAL; + + if (!sbi->trio_meta || !sbi->trio_data) { + erofs_err(sb, "trio_meta and trio_data must be set together"); + return ret; + } + + sbi->buffer = _read_data_inner(sb, sbi->trio_data); + if (!sbi->buffer) + return ret; + + meta_buffer = _read_data_inner(sb, sbi->trio_meta); + if (!meta_buffer) + goto free_obj; + + while ((item = strsep(&meta_buffer, PATH_PATH_DELIMIT)) != NULL) { + path = strsep(&item, PATH_IO_DELIMIT); + ios = item; + ret = kstrtou64(path, 10, &ino); + if (ret < 0) { + erofs_err(sb, "parse inode failed ino:%s failed", path); + goto free_obj; + } + + while ((io = strsep(&ios, IO_IO_DELIMIT)) != NULL) { + s_toff = strsep(&io, VAR_DELIMIT); + s_len = strsep(&io, VAR_DELIMIT); + s_foff = strsep(&io, VAR_DELIMIT); + + ret = kstrtou64(s_toff, 10, &doff); + if (ret < 0) { + erofs_err(sb, "set target_offset failed path:%s,io(%s,%s,%s)", + path, s_toff, s_len, s_foff); + goto free_obj; + } + + ret = kstrtou64(s_len, 10, &len); + if (ret < 0) { + erofs_err(sb, "set target_length failed path:%s,io(%s,%s,%s)", + path, s_toff, s_len, s_foff); + goto free_obj; + } + + ret = kstrtou64(s_foff, 10, &soff); + if (ret < 0) { + erofs_err(sb, "set source_offset failed path:%s,io(%s,%s,%s)", + path, s_toff, s_len, s_foff); + goto free_obj; + } + + io_item = kmem_cache_zalloc(trio_iop, GFP_KERNEL); + if (!io_item) { + erofs_err(sb, "alloc for trace io failed"); + goto free_obj; + } + INIT_LIST_HEAD(&io_item->link); + io_item->len = len; + io_item->doff = doff; + io_item->soff = soff; + + list_add_tail(&io_item->link, &head); + } + + obj = alloc_trace_object(ino); + if (obj == NULL) { + erofs_err(sb, "alloc trace object failed"); + goto free_obj; + } + + list_splice_init(&head, &obj->head); + hash_add_trace_object(sbi->meta_ht, obj); + } + ret = 0; + +free_meta: + if (meta_buffer) + vfree(meta_buffer); + return ret; +free_obj: + list_for_each_safe(pos, n, &head) { + io_item = list_entry(pos, struct trace_io, link); + list_del(&io_item->link); + kmem_cache_free(trio_iop, io_item); + } + if (sbi->buffer) + vfree(sbi->buffer); + sbi->buffer = NULL; + goto free_meta; +} + +void erofs_unregister_trio(struct super_block *sb) +{ + struct erofs_sb_info *sbi = EROFS_SB(sb); + struct hlist_node *tmp; + struct trace_object *obj; + int i; + + hash_for_each_safe(sbi->meta_ht, i, tmp, obj, node) { + hlist_del(&obj->node); + free_trace_object(obj); + } + if (sbi->buffer) + vfree(sbi->buffer); + sbi->buffer = NULL; +} + +int trio_manager_init(void) +{ + trio_objp = kmem_cache_create("trio_obj_pool", + sizeof(struct trace_object), 0, 0, NULL); + if (!trio_objp) + return -ENOMEM; + + trio_iop = kmem_cache_create("trio_io_pool", sizeof(struct trace_io), + 0, 0, NULL); + if (!trio_iop) { + kmem_cache_destroy(trio_objp); + return -ENOMEM; + } + + return 0; +} + +void trio_manager_exit(void) +{ + kmem_cache_destroy(trio_iop); + kmem_cache_destroy(trio_objp); +} -- 2.34.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/release-management/issues/IBK2MJ -------------------------------- Add CONFIG_EROFS_TRIO to control the trio feature. For reading data, we first check the data in trio. When the users want to use TrIO in erofs, they should mount with @trio_meta and @trio_data options. The target options is the valid path where the TrIO info stored. Erofs will parse the TrIO info and build the runtime buffer for reading. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> --- fs/erofs/Kconfig | 11 +++++++++++ fs/erofs/Makefile | 1 + fs/erofs/fscache.c | 22 +++++++++++++++++++++- fs/erofs/internal.h | 39 +++++++++++++++++++++++++++++++++++++++ fs/erofs/super.c | 45 ++++++++++++++++++++++++++++++++++++++++++++- 5 files changed, 116 insertions(+), 2 deletions(-) diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig index f6dc961e6c2b..c36bcaeffc14 100644 --- a/fs/erofs/Kconfig +++ b/fs/erofs/Kconfig @@ -125,6 +125,17 @@ config EROFS_FS_ONDEMAND If unsure, say N. +config EROFS_TRIO + bool "EROFS TrIO support" + depends on EROFS_FS_ONDEMAND + default n + help + If this config option is enabled then erofs trio will be enable. + This is used for container cases to boost on-demand loading for + container's images. + + If unsure, say N. + config EROFS_FS_PCPU_KTHREAD bool "EROFS per-cpu decompression kthread workers" depends on EROFS_FS_ZIP diff --git a/fs/erofs/Makefile b/fs/erofs/Makefile index 994d0b9deddf..d7aeef9d50a4 100644 --- a/fs/erofs/Makefile +++ b/fs/erofs/Makefile @@ -7,3 +7,4 @@ erofs-$(CONFIG_EROFS_FS_ZIP) += decompressor.o zmap.o zdata.o pcpubuf.o erofs-$(CONFIG_EROFS_FS_ZIP_LZMA) += decompressor_lzma.o erofs-$(CONFIG_EROFS_FS_ZIP_DEFLATE) += decompressor_deflate.o erofs-$(CONFIG_EROFS_FS_ONDEMAND) += fscache.o +erofs-$(CONFIG_EROFS_TRIO) += trio_manager.o diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c index 7bcd1d261e7d..096384ea2a85 100644 --- a/fs/erofs/fscache.c +++ b/fs/erofs/fscache.c @@ -198,6 +198,21 @@ static int erofs_fscache_meta_read_folio(struct file *data, struct folio *folio) return ret; } +static bool erofs_fscache_read_trio(struct erofs_fscache_request *primary) +{ + struct address_space *mapping = primary->mapping; + loff_t pos = primary->start + primary->submitted; + + ssize_t ret = erofs_read_from_trio(mapping, pos, + primary->len - primary->submitted); + if (ret > 0) { + primary->submitted += ret; + return true; + } + + return false; +} + static int erofs_fscache_data_read_slice(struct erofs_fscache_request *primary) { struct address_space *mapping = primary->mapping; @@ -207,10 +222,15 @@ static int erofs_fscache_data_read_slice(struct erofs_fscache_request *primary) struct erofs_map_blocks map; struct erofs_map_dev mdev; struct iov_iter iter; - loff_t pos = primary->start + primary->submitted; + loff_t pos; size_t count; int ret; + /* first try mapping in trio */ + if (erofs_fscache_read_trio(primary)) + return 0; + + pos = primary->start + primary->submitted; map.m_la = pos; ret = erofs_map_blocks(inode, &map); if (ret) diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h index a53152693ae2..6d637f21ec75 100644 --- a/fs/erofs/internal.h +++ b/fs/erofs/internal.h @@ -16,6 +16,7 @@ #include <linux/slab.h> #include <linux/vmalloc.h> #include <linux/iomap.h> +#include <linux/hashtable.h> #include "erofs_fs.h" /* redefine pr_fmt "erofs: " */ @@ -177,6 +178,15 @@ struct erofs_sb_info { char *fsid; char *domain_id; bool ondemand_enabled; + +#ifdef CONFIG_EROFS_TRIO +#define TRIO_HT_BITS 10 + /* trio support */ + char *trio_meta; + char *trio_data; + void *buffer; + DECLARE_HASHTABLE(meta_ht, TRIO_HT_BITS); +#endif }; #define EROFS_SB(sb) ((struct erofs_sb_info *)(sb)->s_fs_info) @@ -525,6 +535,35 @@ static inline void erofs_fscache_unregister_cookie(struct erofs_fscache *fscache } #endif +#ifdef CONFIG_EROFS_TRIO +ssize_t erofs_read_from_trio(struct address_space *mapping, + loff_t pos, size_t len); +int erofs_register_trio(struct super_block *sb); +void erofs_unregister_trio(struct super_block *sb); +int trio_manager_init(void); +void trio_manager_exit(void); +#else +static inline ssize_t erofs_read_from_trio(struct address_space *mapping, + loff_t pos, size_t len) +{ + return 0; +} + +static inline int erofs_register_trio(struct super_block *sb) +{ + return 0; +} + +static inline void erofs_unregister_trio(struct super_block *sb) {} + +static inline int trio_manager_init(void) +{ + return 0; +} + +static inline void trio_manager_exit(void) {} +#endif + #define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */ #endif /* __EROFS_INTERNAL_H */ diff --git a/fs/erofs/super.c b/fs/erofs/super.c index 43e3a8322a6c..b5277ab6d3ec 100644 --- a/fs/erofs/super.c +++ b/fs/erofs/super.c @@ -13,6 +13,7 @@ #include <linux/fs_parser.h> #include <linux/dax.h> #include <linux/exportfs.h> +#include <linux/backing-dev-defs.h> #include "xattr.h" #define CREATE_TRACE_POINTS @@ -388,6 +389,8 @@ enum { Opt_device, Opt_fsid, Opt_domain_id, + Opt_trio_meta, + Opt_trio_data, Opt_err }; @@ -414,6 +417,8 @@ static const struct fs_parameter_spec erofs_fs_parameters[] = { fsparam_string("device", Opt_device), fsparam_string("fsid", Opt_fsid), fsparam_string("domain_id", Opt_domain_id), + fsparam_string("trio_meta", Opt_trio_meta), + fsparam_string("trio_data", Opt_trio_data), {} }; @@ -530,6 +535,25 @@ static int erofs_fc_parse_param(struct fs_context *fc, if (!sbi->domain_id) return -ENOMEM; break; +#ifdef CONFIG_EROFS_TRIO + case Opt_trio_meta: + kfree(sbi->trio_meta); + sbi->trio_meta = kstrdup(param->string, GFP_KERNEL); + if (!sbi->trio_meta) + return -ENOMEM; + break; + case Opt_trio_data: + kfree(sbi->trio_data); + sbi->trio_data = kstrdup(param->string, GFP_KERNEL); + if (!sbi->trio_data) + return -ENOMEM; + break; +#else + case Opt_trio_meta: + case Opt_trio_data: + errorfc(fc, "%s option not supported", erofs_fs_parameters[opt].name); + break; +#endif #else case Opt_fsid: case Opt_domain_id: @@ -596,6 +620,12 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc) sb->s_blocksize = PAGE_SIZE; sb->s_blocksize_bits = PAGE_SHIFT; + if (sbi->trio_meta || sbi->trio_data) { + err = erofs_register_trio(sb); + if (err) + return err; + } + err = erofs_fscache_register_fs(sb); if (err) return err; @@ -603,6 +633,9 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc) err = super_setup_bdi(sb); if (err) return err; + + sb->s_bdi->ra_pages = 0; + sb->s_bdi->io_pages = 0; } else { if (!sb_set_blocksize(sb, PAGE_SIZE)) { errorfc(fc, "failed to set initial blksize"); @@ -809,7 +842,10 @@ static void erofs_kill_sb(struct super_block *sb) erofs_free_dev_context(sbi->devs); fs_put_dax(sbi->dax_dev, NULL); + erofs_unregister_trio(sb); erofs_fscache_unregister_fs(sb); + kfree(sbi->trio_meta); + kfree(sbi->trio_data); kfree(sbi->fsid); kfree(sbi->domain_id); kfree(sbi); @@ -833,6 +869,7 @@ static void erofs_put_super(struct super_block *sb) sbi->packed_inode = NULL; erofs_free_dev_context(sbi->devs); sbi->devs = NULL; + erofs_unregister_trio(sb); erofs_fscache_unregister_fs(sb); } @@ -883,8 +920,13 @@ static int __init erofs_module_init(void) if (err) goto fs_err; - return 0; + err = trio_manager_init(); + if (err) + goto trio_err; + return 0; +trio_err: + unregister_filesystem(&erofs_fs_type); fs_err: erofs_exit_sysfs(); sysfs_err: @@ -902,6 +944,7 @@ static int __init erofs_module_init(void) static void __exit erofs_module_exit(void) { + trio_manager_exit(); unregister_filesystem(&erofs_fs_type); /* Ensure all RCU free inodes / pclusters are safe to be destroyed. */ -- 2.34.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/release-management/issues/IBK2MJ -------------------------------- In order to use TrIO, we should provide some basic tools. These are mainly about how to prepare the trace for TrIO. If the user want to use TrIO in container on-demand loading scenario, they may use these scripts and tools. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> --- tools/trio/bpf/iotracker/Makefile | 99 +++++++ tools/trio/bpf/iotracker/iotracker.bpf.c | 59 ++++ tools/trio/bpf/iotracker/iotracker.c | 57 ++++ tools/trio/bpf/rio_tracker_mod/Makefile | 9 + tools/trio/bpf/rio_tracker_mod/rio_tracker.c | 278 +++++++++++++++++++ tools/trio/scripts/trace_parser.py | 277 ++++++++++++++++++ 6 files changed, 779 insertions(+) create mode 100644 tools/trio/bpf/iotracker/Makefile create mode 100644 tools/trio/bpf/iotracker/iotracker.bpf.c create mode 100644 tools/trio/bpf/iotracker/iotracker.c create mode 100644 tools/trio/bpf/rio_tracker_mod/Makefile create mode 100644 tools/trio/bpf/rio_tracker_mod/rio_tracker.c create mode 100644 tools/trio/scripts/trace_parser.py diff --git a/tools/trio/bpf/iotracker/Makefile b/tools/trio/bpf/iotracker/Makefile new file mode 100644 index 000000000000..f5c279c62224 --- /dev/null +++ b/tools/trio/bpf/iotracker/Makefile @@ -0,0 +1,99 @@ +# SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +include ../../../scripts/Makefile.include + +OUTPUT ?= $(abspath .output) + +BPFTOOL_OUTPUT := $(OUTPUT)bpftool/ +DEFAULT_BPFTOOL := $(BPFTOOL_OUTPUT)bootstrap/bpftool +BPFTOOL ?= $(DEFAULT_BPFTOOL) +LIBBPF_SRC := $(abspath ../../../lib/bpf) +BPFOBJ_OUTPUT := $(OUTPUT)libbpf/ +BPFOBJ := $(BPFOBJ_OUTPUT)libbpf.a +BPF_DESTDIR := $(BPFOBJ_OUTPUT) +BPF_INCLUDE := $(BPF_DESTDIR)/include +INCLUDES := -I$(OUTPUT) -I$(BPF_INCLUDE) -I$(abspath ../../../include/uapi) +CFLAGS := -g -Wall $(CLANG_CROSS_FLAGS) +CFLAGS += $(EXTRA_CFLAGS) +LDFLAGS += $(EXTRA_LDFLAGS) +LDLIBS += -lelf -lz +ifeq ($(shell uname -m), x86_64) + ARCH_FLAG := __TARGET_ARCH_x86 +else + ARCH_FLAG := __TARGET_ARCH_arm64 +endif + +# Try to detect best kernel BTF source +KERNEL_REL := $(shell uname -r) +VMLINUX_BTF_PATHS := $(if $(O),$(O)/vmlinux) \ + $(if $(KBUILD_OUTPUT),$(KBUILD_OUTPUT)/vmlinux) \ + ../../../../vmlinux /sys/kernel/btf/vmlinux \ + /boot/vmlinux-$(KERNEL_REL) +VMLINUX_BTF_PATH := $(or $(VMLINUX_BTF),$(firstword \ + $(wildcard $(VMLINUX_BTF_PATHS)))) + +ifeq ($(V),1) +Q = +else +Q = @ +MAKEFLAGS += --no-print-directory +submake_extras := feature_display=0 +endif + +.DELETE_ON_ERROR: + +.PHONY: all clean iotracker libbpf_hdrs +all: iotracker + +iotracker: $(OUTPUT)/iotracker + +clean: + $(call QUIET_CLEAN, iotracker) + $(Q)$(RM) -r $(BPFOBJ_OUTPUT) $(BPFTOOL_OUTPUT) + $(Q)$(RM) $(OUTPUT)*.o $(OUTPUT)*.d + $(Q)$(RM) $(OUTPUT)*.skel.h $(OUTPUT)vmlinux.h + $(Q)$(RM) $(OUTPUT)iotracker + $(Q)$(RM) -r .output + +libbpf_hdrs: $(BPFOBJ) + +$(OUTPUT)/iotracker: $(OUTPUT)/iotracker.o $(BPFOBJ) + $(QUIET_LINK)$(CC) $(CFLAGS) $(LDFLAGS) $^ $(LDLIBS) -o $@ + +$(OUTPUT)/iotracker.o: $(OUTPUT)/iotracker.skel.h \ + $(OUTPUT)/iotracker.bpf.o | libbpf_hdrs + +$(OUTPUT)/iotracker.bpf.o: $(OUTPUT)/vmlinux.h | libbpf_hdrs + +$(OUTPUT)/%.skel.h: $(OUTPUT)/%.bpf.o | $(BPFTOOL) + $(QUIET_GEN)$(BPFTOOL) gen skeleton $< > $@ + + +$(OUTPUT)/%.bpf.o: %.bpf.c $(BPFOBJ) | $(OUTPUT) + $(QUIET_GEN)$(CLANG) -g -O2 --target=bpf -D$(ARCH_FLAG) $(INCLUDES) \ + -c $(filter %.c,$^) -o $@ && \ + $(LLVM_STRIP) -g $@ + +$(OUTPUT)/%.o: %.c | $(OUTPUT) + $(QUIET_CC)$(CC) $(CFLAGS) $(INCLUDES) -c $(filter %.c,$^) -o $@ + +$(OUTPUT) $(BPFOBJ_OUTPUT) $(BPFTOOL_OUTPUT): + $(QUIET_MKDIR)mkdir -p $@ + +$(OUTPUT)/vmlinux.h: $(VMLINUX_BTF_PATH) | $(OUTPUT) $(BPFTOOL) +ifeq ($(VMLINUX_H),) + $(Q)if [ ! -e "$(VMLINUX_BTF_PATH)" ] ; then \ + echo "Couldn't find kernel BTF; set VMLINUX_BTF to" \ + "specify its location." >&2; \ + exit 1;\ + fi + $(QUIET_GEN)$(BPFTOOL) btf dump file $(VMLINUX_BTF_PATH) format c > $@ +else + $(Q)cp "$(VMLINUX_H)" $@ +endif + +$(BPFOBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(BPFOBJ_OUTPUT) + $(Q)$(MAKE) $(submake_extras) -C $(LIBBPF_SRC) OUTPUT=$(BPFOBJ_OUTPUT) \ + DESTDIR=$(BPFOBJ_OUTPUT) prefix= $(abspath $@) install_headers + +$(DEFAULT_BPFTOOL): | $(BPFTOOL_OUTPUT) + $(Q)$(MAKE) $(submake_extras) -C ../../../bpf/bpftool OUTPUT=$(BPFTOOL_OUTPUT) bootstrap diff --git a/tools/trio/bpf/iotracker/iotracker.bpf.c b/tools/trio/bpf/iotracker/iotracker.bpf.c new file mode 100644 index 000000000000..09b53d8419e2 --- /dev/null +++ b/tools/trio/bpf/iotracker/iotracker.bpf.c @@ -0,0 +1,59 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2025 Huawei Technologies Co., Ltd + */ + +#include <vmlinux.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_core_read.h> +#include <bpf/bpf_tracing.h> + +#define PAGE_SZ 4096 +#define PAGE_ST 12 + +extern void bpf_get_dpath_mark(unsigned long addr, unsigned long off, + unsigned long len) __ksym; + +char LICENSE[] SEC("license") = "Dual BSD/GPL"; + +static int _read_request(struct pt_regs *ctx, struct kiocb *iocb, struct iov_iter *to) +{ + struct file *filp; + unsigned long foff, len, count; + loff_t offset; + + bpf_core_read(&offset, sizeof(loff_t), &(iocb->ki_pos)); + bpf_core_read(&filp, sizeof(struct file *), &(iocb->ki_filp)); + bpf_core_read(&count, sizeof(size_t), &(to->count)); + + /* enlarge to the 4k-aligned(page-based) */ + foff = (offset >> PAGE_ST) << PAGE_ST; + len = ((count >> PAGE_ST) + 1) << PAGE_ST; + + bpf_get_dpath_mark((unsigned long)filp, foff, len); + return 0; +} + +SEC("kprobe/erofs_file_read_iter") +int BPF_KPROBE(erofs_file_read_iter_entry, struct kiocb *iocb, + struct iov_iter *to) +{ + return _read_request(ctx, iocb, to); +} + +SEC("kprobe/filemap_fault") +int BPF_KPROBE(filemap_fault_entry, struct vm_fault *vmf) +{ + struct file *file; + unsigned long foff, len; + + bpf_core_read(&foff, sizeof(unsigned long), &vmf->pgoff); + file = BPF_CORE_READ(vmf, vma, vm_file); + if (!file) + return 0; + + foff = foff * PAGE_SZ; + len = PAGE_SZ; + + bpf_get_dpath_mark((unsigned long)file, foff, len); + return 0; +} diff --git a/tools/trio/bpf/iotracker/iotracker.c b/tools/trio/bpf/iotracker/iotracker.c new file mode 100644 index 000000000000..a9af6087793a --- /dev/null +++ b/tools/trio/bpf/iotracker/iotracker.c @@ -0,0 +1,57 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2025 Huawei Technologies Co., Ltd + */ + +#include <stdio.h> +#include <unistd.h> +#include <sys/resource.h> +#include <bpf/libbpf.h> +#include "iotracker.skel.h" + +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) +{ + return vfprintf(stderr, format, args); +} + +int main(int argc, char **argv) +{ + struct iotracker_bpf *skel; + int err; + + /* Set up libbpf errors and debug info callback */ + libbpf_set_print(libbpf_print_fn); + + /* Open BPF application */ + skel = iotracker_bpf__open(); + if (!skel) { + fprintf(stderr, "Failed to open BPF skeleton\n"); + return 1; + } + + /* Load & verify BPF programs */ + err = iotracker_bpf__load(skel); + if (err) { + fprintf(stderr, "Failed to load and verify BPF skeleton\n"); + goto cleanup; + } + + /* Attach tracepoint handler */ + err = iotracker_bpf__attach(skel); + if (err) { + fprintf(stderr, "Failed to attach BPF skeleton\n"); + goto cleanup; + } + + printf("Successfully started! Please run `sudo cat /sys/kernel/debug/tracing/trace_pipe`" + "to see output of the BPF programs.\n"); + + for (;;) { + /* trigger our BPF program */ + fprintf(stderr, "."); + sleep(1); + } + +cleanup: + iotracker_bpf__destroy(skel); + return -err; +} diff --git a/tools/trio/bpf/rio_tracker_mod/Makefile b/tools/trio/bpf/rio_tracker_mod/Makefile new file mode 100644 index 000000000000..22942d7124c7 --- /dev/null +++ b/tools/trio/bpf/rio_tracker_mod/Makefile @@ -0,0 +1,9 @@ +PWD = $(shell pwd) +KVERS =$(shell uname -r) +KERNDIR =/lib/modules/${KVERS}/build/ +obj-m += rio_tracker.o +build: kernel_modules +kernel_modules: + make -C $(KERNDIR) M=$(PWD) modules +clean: + make -C $(KERNDIR) M=$(PWD) clean diff --git a/tools/trio/bpf/rio_tracker_mod/rio_tracker.c b/tools/trio/bpf/rio_tracker_mod/rio_tracker.c new file mode 100644 index 000000000000..a9474fe738b6 --- /dev/null +++ b/tools/trio/bpf/rio_tracker_mod/rio_tracker.c @@ -0,0 +1,278 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2025 Huawei Technologies Co., Ltd + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/btf.h> +#include <linux/btf_ids.h> + +#include <linux/dcache.h> +#include <linux/string.h> +#include <asm/current.h> +#include <linux/uaccess.h> +#include <linux/vmalloc.h> +#include <linux/mutex.h> +#include <linux/fs.h> +#include <linux/err.h> +#include <linux/nsproxy.h> +#include <linux/utsname.h> +#include <linux/printk.h> + +static uint32_t tracker_buffer_size = 8388608; /* 8MB for default */ +module_param(tracker_buffer_size, uint, 0444); + +static char *tracker_output = "/"; +module_param(tracker_output, charp, 0444); +MODULE_PARM_DESC(tracker_output, "Must be set by the user."); + +struct rio_tracker_mgr { + bool enable; + struct kobject *object; + char *host_ns; + + /* buffer for trace */ + struct mutex lock; + char *data; + uint32_t pos; +}; + +static struct rio_tracker_mgr rtracker = {0}; + +ssize_t enable_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%d\n", rtracker.enable); +} + +ssize_t enable_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t count) +{ + ssize_t ret; + int value; + + ret = kstrtoint(buf, 10, &value); + if (ret < 0) { + pr_err("store attr failed\n"); + return -EINVAL; + } + + if (0 != value && 1 != value) + return -EINVAL; + + rtracker.enable = value; + return count; +} + +static void _dump_trace(void) +{ + struct file *filp; + ssize_t ret; + loff_t pos; + + mutex_lock(&rtracker.lock); + rtracker.data[rtracker.pos] = '\0'; + filp = filp_open(tracker_output, O_RDWR | O_CREAT | O_TRUNC, 0644); + if (IS_ERR(filp)) { + mutex_unlock(&rtracker.lock); + pr_warn("dump failed, file(%s) open failed, err:%ld\n", + tracker_output, PTR_ERR(filp)); + return; + } + + pos = 0; + ret = kernel_write(filp, rtracker.data, rtracker.pos, &pos); + if (ret < 0) + pr_warn("dump failed, file(%s) write failed, err:%ld, len:%u\n", + tracker_output, ret, rtracker.pos); + else + pr_info("dump to %s %ld bytes successfully!\n", + tracker_output, ret); + mutex_unlock(&rtracker.lock); + filp_close(filp, NULL); +} + +ssize_t dump_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t count) +{ + _dump_trace(); + return count; +} + +ssize_t reset_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t count) +{ + mutex_lock(&rtracker.lock); + rtracker.pos = 0; + mutex_unlock(&rtracker.lock); + return count; +} + +ssize_t host_ns_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%s\n", rtracker.host_ns); +} + +ssize_t host_ns_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t count) +{ + char *new_prefix = kstrdup(buf, GFP_KERNEL); + + if (!new_prefix) + return -ENOMEM; + + swap(rtracker.host_ns, new_prefix); + kfree(new_prefix); + return count; +} + +static struct kobj_attribute enable_attr = + __ATTR(enable, 0664, enable_show, enable_store); +struct kobj_attribute dump_attr = + __ATTR(dump, 0200, NULL, dump_store); +struct kobj_attribute reset_attr = + __ATTR(reset, 0200, NULL, reset_store); +struct kobj_attribute host_ns_attr = + __ATTR(host_ns, 0664, host_ns_show, host_ns_store); + +static struct attribute *tracker_kobj_attrs[] = { + &enable_attr.attr, + &dump_attr.attr, + &reset_attr.attr, + &host_ns_attr.attr, + NULL, +}; + +const struct attribute_group tracker_attr_group = { + .attrs = tracker_kobj_attrs, +}; + +__diag_push(); +__diag_ignore(GCC, 8, "-Wmissing-prototypes", +"Global functions as their definitions will be in vmlinux BTF"); + +static inline bool _target_process(const char *name) +{ + if (!rtracker.host_ns) + return false; + + return !!str_has_prefix(name, rtracker.host_ns); +} + +__bpf_kfunc void bpf_get_dpath_mark(unsigned long addr, unsigned long off, + unsigned long len) +{ + const struct file *file = (const struct file *)addr; + const struct path *path = (const struct path *)&(file->f_path); + char buff[256] = {0}; + char *ret_path = NULL; + int written; + + if (!rtracker.enable) + return; + + /* only track regular file */ + if (!S_ISREG(file_inode(file)->i_mode)) + return; + + if (!_target_process(current->nsproxy->uts_ns->name.nodename)) + return; + + ret_path = d_path(path, buff, sizeof(buff)); + if (IS_ERR(ret_path)) { + pr_err("get fpath failed, ret:%ld\n", PTR_ERR(ret_path)); + return; + } + + mutex_lock(&rtracker.lock); + if (rtracker.pos >= tracker_buffer_size) { + mutex_unlock(&rtracker.lock); + pr_err("tracker buffer is not enough, please enlarge it!\n"); + return; + } + + /* fill each trace item */ + written = snprintf(rtracker.data + rtracker.pos, + tracker_buffer_size - rtracker.pos, "%s,%lu,%lu,%lu\n", + ret_path, file_inode(file)->i_ino, off, len); + if (written >= 0 && written <= tracker_buffer_size - rtracker.pos) { + rtracker.pos += written; + } else { + pr_warn("trace data append failed for path:%s, off:%lu, len:%lu\n", + ret_path, off, len); + } + mutex_unlock(&rtracker.lock); +} + +__diag_pop(); + +BTF_SET8_START(bpf_rio_tracker_ids) +BTF_ID_FLAGS(func, bpf_get_dpath_mark) +BTF_SET8_END(bpf_rio_tracker_ids) + +static const struct btf_kfunc_id_set kfuncs = { + .owner = THIS_MODULE, + .set = &bpf_rio_tracker_ids, +}; + +static __init int rio_tracker_init(void) +{ + struct file *filp = filp_open(tracker_output, O_RDWR | O_CREAT, 0644); + int ret; + + if (IS_ERR(filp)) { + pr_err("rio tracker parameter error, %s is invalid, err:%ld\n", + tracker_output, PTR_ERR(filp)); + return -EINVAL; + } + filp_close(filp, NULL); + + rtracker.enable = false; + rtracker.object = kobject_create_and_add("rio_tracker", kernel_kobj); + ret = sysfs_create_group(rtracker.object, &tracker_attr_group); + if (ret < 0) { + pr_err("rio tracker init failed, sysfs kobject create failed\n"); + kobject_put(rtracker.object); + return ret; + } + + rtracker.data = vmalloc(tracker_buffer_size + 1); + if (!rtracker.data) { + ret = -ENOMEM; + goto cleanup; + } + mutex_init(&rtracker.lock); + rtracker.pos = 0; + + /* register self-defined bpf helper */ + ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC, &kfuncs); + if (ret) { + pr_err("register btf kfunc error with retcode:%d\n", ret); + goto cleanup; + } + + pr_info("rio tracker init success!\n"); + return 0; + +cleanup: + if (rtracker.data) + vfree(rtracker.data); + kobject_put(rtracker.object); + return ret; +} + +static __exit void rio_tracker_exit(void) +{ + _dump_trace(); + kfree(rtracker.host_ns); + vfree(rtracker.data); + kobject_put(rtracker.object); + pr_info("rio tracker exit\n"); +} + +module_init(rio_tracker_init); +module_exit(rio_tracker_exit); + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_DESCRIPTION("Runtime io Tracker module!"); diff --git a/tools/trio/scripts/trace_parser.py b/tools/trio/scripts/trace_parser.py new file mode 100644 index 000000000000..cb83d749d4cc --- /dev/null +++ b/tools/trio/scripts/trace_parser.py @@ -0,0 +1,277 @@ +#! /usr/env python +# SPDX-License-Identifier: GPL-2.0 +import os +import sys +import copy +import json +import argparse +import shutil +import hashlib + + +PATH_IO_DELIMIT = "|" +PATH_PATH_DELIMIT = "@" +IO_IO_DELIMIT = "+" +VAR_DELIMIT = "," + + +class TraceParser(object): + PAGE_SIZE = 4096 + + def __init__(self, trace_file, output_dir, rootfs): + self.output = os.path.join(output_dir, "trace.json") + self.meta = os.path.join(output_dir, "trace.meta") + self.data = os.path.join(output_dir, "trace.data") + self.src_file = trace_file + self.parent = rootfs + self.trace_map = {} + self.fpath_map = {} + self.total_bytes = 0 + + @staticmethod + def merge_partition(entry, items): + target_begin = entry[0] + target_end = entry[1] + prev_begin = 0 + prev_end = 0 + + # copy one for operating + items_bak = copy.deepcopy(items) + pidx = -1 + # find the insert point + for item in items: + curr_begin = item[0] + curr_end = item[1] + if curr_begin > target_begin: + break + pidx += 1 + prev_begin = curr_begin + prev_end = curr_end + + # merge prev node + if pidx != -1: + if prev_end >= target_begin: + target_begin = prev_begin + if target_end < prev_end: + target_end = prev_end + # remove extra overlay item + del items_bak[pidx] + # put placeholder + items_bak.insert(pidx, (0, 0)) + + # merge next node + idx = pidx + 1 + while idx < len(items): + next_begin = items[idx][0] + next_end = items[idx][1] + if target_end < next_begin: + break + if target_end < next_end: + target_end = next_end + del items_bak[idx] + # put placeholder + items_bak.insert(idx, (0, 0)) + idx += 1 + + # release old list + del items + items_bak.insert(pidx + 1, (target_begin, target_end)) + for item in items_bak: + # remove placeholder + if item[0] == 0 and item[1] == 0: + items_bak.remove(item) + return items_bak + + def in_blacklist(self, path): + # not tracker the memory file system + if (path.startswith("/tmp/") or path.startswith("/proc/") + or path.startswith("/sys/")): + return True + return False + + def parse_trace(self): + with open(self.src_file, 'r') as f: + for line in f.readlines(): + buf_list = line.split(',') + filepath = buf_list[0].strip() + ino = int(buf_list[1].strip()) + off = int(buf_list[2].strip()) + len = int(buf_list[3].strip()) + real_path = filepath if not self.parent else "%s/%s" % (self.parent, filepath) + if not os.path.exists(real_path) or not os.path.isfile(real_path): + continue + + if self.in_blacklist(filepath): + print("Path %s in blacklist should be skip!" % filepath) + continue + + # verify file io + size = int(os.path.getsize(real_path)) + end = len + off + if off >= size: + continue + if end > size: + end = size + + if filepath not in self.trace_map: + self.trace_map[filepath] = [(off, end)] + self.fpath_map[filepath] = ino + else: + items = self.trace_map[filepath] + new_items = TraceParser.merge_partition((off, end), items) + self.trace_map[filepath] = new_items + + for value in self.trace_map.values(): + for item in value: + self.total_bytes += (item[1] - item[0]) + + def trans_data(self): + try: + with open(self.data, 'rb') as f: + data = f.read() + file_hash = hashlib.sha256(data).hexdigest() + file_hash = os.path.join(os.path.dirname(self.data), file_hash) + shutil.copyfile(self.data, file_hash) + print("trace data:%s" % file_hash) + except Exception as e: + raise Exception("trans data exception:%s" % str(e)) + + def trans_meta(self): + try: + all = "" + f = open(self.meta) + data = json.load(f) + entries = data["entries"] + for entry in entries: + #name = entry["name"] + ios = entry["io"] + ino = entry["ino"] + ios_str = "" + for io in ios: + target_off = io[0] + target_len = io[1] + source_off = io[2] + if ios_str == "": + ios_str = "%d%s%d%s%d" % (target_off, VAR_DELIMIT, target_len, VAR_DELIMIT, source_off) + continue + ios_str = "%s%s%d%s%d%s%d" % (ios_str, IO_IO_DELIMIT, target_off, VAR_DELIMIT, target_len, VAR_DELIMIT, source_off) + if all == "": + all = "%d%s%s" % (ino, PATH_IO_DELIMIT, ios_str) + continue + all = "%s%s%s%s%s" % (all, PATH_PATH_DELIMIT, ino, PATH_IO_DELIMIT, ios_str) + + # save file + hashobj = hashlib.sha256() + hashobj.update(all.encode()) + sha256 = hashobj.hexdigest() + sha256 = os.path.join(os.path.dirname(self.meta), sha256) + with open(sha256, 'w') as f: + f.write(all) + print("trace meta:%s" % sha256) + except Exception as e: + raise Exception("trans meta exception:%s" % str(e)) + + @staticmethod + def dump_map(map, path): + jsObj = json.dumps(map) + fd = open(path, 'w') + fd.write(jsObj) + fd.close() + + def dump_trace(self): + TraceParser.dump_map(self.trace_map, self.output) + + def generate_data(self): + def read_data(path, off, len): + with open(path, "rb") as fd: + fd.seek(off, 0) + text = fd.read(len) + return text + + def read_zero_data(len): + tmp_file = "/tmp/zero.bin" + if not os.path.exists(tmp_file): + zero_data = b'\x00' * TraceParser.PAGE_SIZE + with open(tmp_file, 'wb') as f: + f.write(zero_data) + with open(tmp_file, 'rb') as fd: + fd.seek(0, 0) + text = fd.read(len) + return text + + trace_meta = { + "version": 1, + "entries": [] + } + foff = 0 + with open(self.data, "wb") as file: + for key, value in self.trace_map.items(): + path = key + real_path = path if not self.parent else "%s/%s" % (self.parent, path) + if not os.path.exists(real_path): + continue + entry = { + "name": path, + "ino": self.fpath_map[path], + "io": [] + } + for item in value: + off = item[0] + len = (item[1] - item[0]) + data = read_data(real_path, off, len) + file.write(data) + entry["io"].append((off, len, foff)) + # padding with zero + pad_len = TraceParser.PAGE_SIZE - (len % TraceParser.PAGE_SIZE) + if pad_len != TraceParser.PAGE_SIZE: + pad_data = read_zero_data(pad_len) + file.write(pad_data) + len += pad_len + foff += len + trace_meta["entries"].append(entry) + TraceParser.dump_map(trace_meta, self.meta) + + +def main(argv): + parser = argparse.ArgumentParser('container trace parser') + parser.add_argument('--trace_file', + required=True, + type=str, + help='trace source') + parser.add_argument('--output_dir', + required=True, + type=str, + help='output directory') + parser.add_argument('--rootfs', + required=True, + type=str, + help='container rootfs') + try: + args = parser.parse_args() + trace_file = args.trace_file + output_dir = args.output_dir + rootfs = args.rootfs + if not os.path.exists(trace_file) or not os.path.exists(output_dir) \ + or not os.path.exists(rootfs): + print("Please input the valid path") + return -1 + parser = TraceParser(trace_file, output_dir, rootfs) + parser.parse_trace() + parser.dump_trace() # metadata to json + parser.generate_data() + parser.trans_meta() + parser.trans_data() + + return 0 + except Exception as e: + print("page cache build exception:%s" % str(e)) + return -1 + + +if __name__ == '__main__': + try: + ret = main(sys.argv[1:]) + except Exception as main_e: + print(str(main_e)) + ret = -1 + sys.exit(ret) -- 2.34.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/release-management/issues/IBK2MJ -------------------------------- By providing the README.md file, it can guide users on how to use TrIO. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> --- tools/trio/README.md | 507 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 507 insertions(+) create mode 100644 tools/trio/README.md diff --git a/tools/trio/README.md b/tools/trio/README.md new file mode 100644 index 000000000000..246932414431 --- /dev/null +++ b/tools/trio/README.md @@ -0,0 +1,507 @@ +### About TrIO + +On-demand loading of container images significantly improves container startup performance. However, compared to traditional full-image loading solutions, this method triggers many discrete network I/O operations during container runtime, which can introduce considerable overhead during the container's running phase. TrIO can be used to accelerate this scenario. + +TrIO can be used to boost container startup based on Nydus, which is a typical container image on-demand loading solutions in container cases. It first tracks the I/O requests for reading image files during container runtime; the data corresponding to these I/O requests is the data needed for container running. Then, it orchestrates these I/O requests into the container's images and pushes them to the image repository. When a container is launched, TrIO can first pull these orchestrated I/O requests (referred to as I/O traces) to the local container node in the form of large I/O operations and use this data to reconstruct the rootfs required for container runtime. This rootfs will be used during container startup. + +The core idea of TrIO is to aggregate I/O operations. It orchestrates the I/O operations actually used during container runtime into a single large I/O operation to pull all the required data to the container node in a single I/O, thereby reducing the network overhead associated with image fetching. + +### Best Practice + +The functionality of loading trace into the rootfs has already been implemented in the kernel, but the creation of trace requires coordination with user-space programs. In this case, we leverage the capabilities of eBPF to orchestrate the trace. Here are the files we will use in the following steps: + +```shell +$ tree {KERNEL_TREE}/tools/trio/ +├── bpf +│ ├── iotracker +│ │ ├── iotracker.bpf.c # ebpf probe +│ │ ├── iotracker.c # ebpf loading program +│ │ └── Makefile +│ └── rio_tracker_mod +│ ├── Makefile +│ └── rio_tracker.c # provide the kfunc for probe +└── scripts + └── trace_parser.py # parse the raw trace +``` + +#### **Prerequisites** + +- **kernel config** + +If you want to enable TrIO, you should first compile the kernel with `CONFIG_EROFS_TRIO` enabled. + +- **apply patch on nydus-snapshotter** + +We assume that you have already set up the environment for on-demand container image loading and that you can successfully run the on-demand loading process for containers. To use TrIO, you can make some simple modifications to the snapshotter module of containerd. Here, we use `nydus-snapshotter-0.13.10` as an example for a brief explanation. Our goal is to make the functionality work, so we will make some simple adaptations within the snapshotter to fetch the I/O traces and load them into the kernel. The adaptation process can be as follow: + +```shell +$ mkdir nydus-snapshotter-0.13.10/pkg/utils/trace +$ vim helper.go +``` + +In `helper.go` the content is: + +```go +package trace + +import ( + "fmt" + "io" + "net/http" + "os" + "time" + "bufio" + "strings" +) + +const ( + BaseUrl = "${trace_repo_url}" // trace repo server, you can download traces from this url. Such as http://10.67.175.82:8080 + LocalTraceDir = "${trace_repo_dir}" // the directory where you download, Such as /home/trace-repo +) + +func downloadCost(start time.Time, path string) { + tc := time.Since(start) + fmt.Printf("Downloading trace:%s cost = %v\n", path, tc) +} + +func GetTraceHintFile() (string, string) { + trace_hint := "/var/log/trace_hint" + if _, err := os.Stat(trace_hint); os.IsNotExist(err) { + return "", "" + } + file, err := os.Open(trace_hint) + if err != nil { + return "", "" + } + defer file.Close() + + reader := bufio.NewReader(file) + content, err := reader.ReadString('\n') + if err != nil { + return "", "" + } + content = strings.TrimSuffix(content, "\n") + if len(content) == 0 { + return "", "" + } + strArray := strings.Split(content, ",") + return strArray[0], strArray[1] +} + +func GetTraceHintPath() (string, string) { + meta, data := GetTraceHintFile() + if len(meta) == 0 || len(data) == 0 { + return "", "" + } + real_meta := fmt.Sprintf("%s/%s", LocalTraceDir, meta) + real_data := fmt.Sprintf("%s/%s", LocalTraceDir, data) + return real_meta, real_data +} + +func FetchTraceFile(filename string) string { + localPath := fmt.Sprintf("%s/%s", LocalTraceDir, filename) + finfo, err := os.Stat(localPath) + if !os.IsNotExist(err) && finfo.Size() > 0 { + return localPath + } + url := fmt.Sprintf("%s/%s", BaseUrl, filename) + defer downloadCost(time.Now(), url) + + resp, err := http.Get(url) + if err != nil { + panic(err) + } + defer resp.Body.Close() + + /* create local file */ + file, err := os.Create(localPath) + if err != nil { + panic(err) + } + defer file.Close() + + /* copy http file to local */ + _, err = io.Copy(file, resp.Body) + if err != nil { + panic(err) + } + return localPath +} +``` + +> $trace_repo_url and $trace_repo_dir should be set. + +In order the fetch I/O traces, we should add pulling logical in `snapshot/snapshot.go` . Here are the changes on `snapshot/snapshot.go` and `snapshot/process.go`: + +```go +diff -Nuar -x bin ../nydus-snapshotter-0.13.10/pkg/daemon/daemon.go ./pkg/daemon/daemon.go +--- ../nydus-snapshotter-0.13.10/pkg/daemon/daemon.go 2024-03-19 09:30:29.000000000 +0800 ++++ ./pkg/daemon/daemon.go 2025-02-21 19:36:29.208911094 +0800 +@@ -20,6 +20,7 @@ + + "github.com/containerd/containerd/log" + ++ "github.com/containerd/nydus-snapshotter/pkg/utils/trace" + "github.com/containerd/nydus-snapshotter/config" + "github.com/containerd/nydus-snapshotter/config/daemonconfig" + "github.com/containerd/nydus-snapshotter/pkg/daemon/types" +@@ -306,7 +307,9 @@ + ra.AddAnnotation(rafs.AnnoFsCacheDomainID, cfg.DomainID) + ra.AddAnnotation(rafs.AnnoFsCacheID, fscacheID) + +- if err := erofs.Mount(cfg.DomainID, fscacheID, mountPoint); err != nil { ++ ++ meta, data := trace.GetTraceHintPath() ++ if err := erofs.Mount(cfg.DomainID, fscacheID, meta, data, mountPoint); err != nil { + if !errdefs.IsErofsMounted(err) { + return errors.Wrapf(err, "mount erofs to %s", mountPoint) + } +diff -Nuar -x bin ../nydus-snapshotter-0.13.10/pkg/utils/erofs/erofs.go ./pkg/utils/erofs/erofs.go +--- ../nydus-snapshotter-0.13.10/pkg/utils/erofs/erofs.go 2024-03-19 09:30:29.000000000 +0800 ++++ ./pkg/utils/erofs/erofs.go 2025-02-21 19:38:06.655911094 +0800 +@@ -15,16 +15,21 @@ + "golang.org/x/sys/unix" + ) + +-func Mount(domainID, fscacheID, mountpoint string) error { ++func Mount(domainID, fscacheID, meta, data, mountpoint string) error { + mount := unix.Mount + var opts string + + // Nydusd must have domain_id specified and it is set to fsid if it is + // never specified. ++ if meta != "" && data != "" { ++ opts = fmt.Sprintf("trio_meta=%s,trio_data=%s,", meta, data) ++ } else { ++ opts = "" ++ } + if domainID != "" && domainID != fscacheID { +- opts = fmt.Sprintf("domain_id=%s,fsid=%s", domainID, fscacheID) ++ opts = fmt.Sprintf("%sdomain_id=%s,fsid=%s", opts, domainID, fscacheID) + } else { +- opts = "fsid=" + fscacheID ++ opts = fmt.Sprintf("%sfsid=%s", opts, fscacheID) + } + log.L.Infof("Mount erofs to %s with options %s", mountpoint, opts) + +diff -Nuar -x bin ../nydus-snapshotter-0.13.10/pkg/utils/trace/helper.go ./pkg/utils/trace/helper.go +--- ../nydus-snapshotter-0.13.10/pkg/utils/trace/helper.go 1970-01-01 08:00:00.000000000 +0800 ++++ ./pkg/utils/trace/helper.go 2025-02-21 19:39:53.639911094 +0800 +@@ -0,0 +1,85 @@ ++package trace ++ ++import ( ++ "fmt" ++ "io" ++ "net/http" ++ "os" ++ "time" ++ "bufio" ++ "strings" ++) ++ ++const ( ++ BaseUrl = "http://10.67.175.82:8080" // trace repo server, you can download traces from this url. ++ LocalTraceDir = "/home/l00574196/containers-env/trace-repo" // the directory where you download ++) ++ ++func downloadCost(start time.Time, path string) { ++ tc := time.Since(start) ++ fmt.Printf("Downloading trace:%s cost = %v\n", path, tc) ++} ++ ++func GetTraceHintFile() (string, string) { ++ trace_hint := "/var/log/trace_hint" ++ if _, err := os.Stat(trace_hint); os.IsNotExist(err) { ++ return "", "" ++ } ++ file, err := os.Open(trace_hint) ++ if err != nil { ++ return "", "" ++ } ++ defer file.Close() ++ ++ reader := bufio.NewReader(file) ++ content, err := reader.ReadString('\n') ++ if err != nil { ++ return "", "" ++ } ++ content = strings.TrimSuffix(content, "\n") ++ if len(content) == 0 { ++ return "", "" ++ } ++ strArray := strings.Split(content, ",") ++ return strArray[0], strArray[1] ++} ++ ++func GetTraceHintPath() (string, string) { ++ meta, data := GetTraceHintFile() ++ if len(meta) == 0 || len(data) == 0 { ++ return "", "" ++ } ++ real_meta := fmt.Sprintf("%s/%s", LocalTraceDir, meta) ++ real_data := fmt.Sprintf("%s/%s", LocalTraceDir, data) ++ return real_meta, real_data ++} ++ ++func FetchTraceFile(filename string) string { ++ localPath := fmt.Sprintf("%s/%s", LocalTraceDir, filename) ++ finfo, err := os.Stat(localPath) ++ if !os.IsNotExist(err) && finfo.Size() > 0 { ++ return localPath ++ } ++ url := fmt.Sprintf("%s/%s", BaseUrl, filename) ++ defer downloadCost(time.Now(), url) ++ ++ resp, err := http.Get(url) ++ if err != nil { ++ panic(err) ++ } ++ defer resp.Body.Close() ++ ++ /* create local file */ ++ file, err := os.Create(localPath) ++ if err != nil { ++ panic(err) ++ } ++ defer file.Close() ++ ++ /* copy http file to local */ ++ _, err = io.Copy(file, resp.Body) ++ if err != nil { ++ panic(err) ++ } ++ return localPath ++} +diff -Nuar -x bin ../nydus-snapshotter-0.13.10/snapshot/process.go ./snapshot/process.go +--- ../nydus-snapshotter-0.13.10/snapshot/process.go 2024-03-19 09:30:29.000000000 +0800 ++++ ./snapshot/process.go 2025-02-21 19:38:46.988911094 +0800 +@@ -41,6 +41,7 @@ + + remoteHandler := func(id string, labels map[string]string) func() (bool, []mount.Mount, error) { + return func() (bool, []mount.Mount, error) { ++ sn.traceSync.Wait() + logger.Debugf("Prepare remote snapshot %s", id) + if err := sn.fs.Mount(ctx, id, labels, &s); err != nil { + return false, nil, err +diff -Nuar -x bin ../nydus-snapshotter-0.13.10/snapshot/snapshot.go ./snapshot/snapshot.go +--- ../nydus-snapshotter-0.13.10/snapshot/snapshot.go 2024-03-19 09:30:29.000000000 +0800 ++++ ./snapshot/snapshot.go 2025-02-21 19:33:39.672911094 +0800 +@@ -13,6 +13,7 @@ + "os" + "path/filepath" + "strings" ++ "sync" + + "github.com/pkg/errors" + +@@ -26,6 +27,7 @@ + "github.com/containerd/nydus-snapshotter/config" + "github.com/containerd/nydus-snapshotter/config/daemonconfig" + ++ "github.com/containerd/nydus-snapshotter/pkg/utils/trace" + "github.com/containerd/nydus-snapshotter/pkg/cache" + "github.com/containerd/nydus-snapshotter/pkg/cgroup" + v2 "github.com/containerd/nydus-snapshotter/pkg/cgroup/v2" +@@ -58,6 +60,7 @@ + enableKataVolume bool + syncRemove bool + cleanupOnClose bool ++ traceSync sync.WaitGroup + } + + func NewSnapshotter(ctx context.Context, cfg *config.SnapshotterConfig) (snapshots.Snapshotter, error) { +@@ -454,6 +457,15 @@ + } + + logger.Debugf("[Prepare] snapshot with labels %v", info.Labels) ++ o.traceSync.Add(1) ++ go func() { ++ defer o.traceSync.Done() ++ meta, data := trace.GetTraceHintFile() ++ if len(meta) != 0 && len(data) != 0 { ++ trace.FetchTraceFile(meta) ++ trace.FetchTraceFile(data) ++ } ++ }() + + processor, target, err := chooseProcessor(ctx, logger, o, s, key, parent, info.Labels, func() string { return o.upperPath(s.ID) }) + if err != nil { +``` + +- **Trace Server** + +Here we start an independent server to act as the server-side of the trace repository. In more formal way, the trace can be packed in the container images. For testing, the server only provides file download functionality. + +```go +// trace_server.go +package main + +import ( + "fmt" + "io" + "net/http" + "net/url" + "os" + "strconv" + "time" +) + +func timeCost(start time.Time, path, action string) { + tc := time.Since(start) + fmt.Printf("%v file:%s cost = %v, at:%v\n", action, path, tc, time.Now().Format("2006-01-02 15:04:05.000")) +} + +func download(w http.ResponseWriter, req *http.Request) { + defer timeCost(time.Now(), req.RequestURI, "Downloading") + + filename := req.RequestURI[1:] + enEscapeUrl, err := url.QueryUnescape(filename) + if err != nil { + w.Write([]byte(err.Error())) + return + } + + f, err := os.Open("./" + enEscapeUrl) + if err != nil { + w.Write([]byte(err.Error())) + return + } + + info, err := f.Stat() + if err != nil { + w.Write([]byte(err.Error())) + return + } + + w.Header().Set("Content-Type", "application/octet-stream") + w.Header().Set("Content-Length", strconv.FormatInt(info.Size(), 10)) + + f.Seek(0, 0) + io.Copy(w, f) +} + +func main() { + fmt.Printf("linsten on :8080 \n") + http.HandleFunc("/", download) + http.ListenAndServe(":8080", nil) +} +``` + + + +#### How to work + +- **Track the runtime io for container** + +> Notes: The container node for tracing can be different with running nodes. + +###### Prepare + +```shell +$ cd $KERNEL_SRC/tools/trio/bpf/iotracker && make -j32; cd $KERNEL_SRC/tools/trio/bpf/rio_tracker_mod && make -j32 +$ cd $KERNEL_SRC/tools/trio/bpf +$ insmod rio_tracker_mod/rio_tracker.ko tracker_output="/var/log/trace.txt" +$ iotracker/.output/iotracker +``` + +Now you have prepared the tracker environment. Then you should open a new terminal to prepare the following steps: + +###### Tracker + +```shell +$ sync; echo 1 > /proc/sys/vm/drop_caches; sleep 5 +$ echo 4096 > /sys/kernel/debug/fault_around_bytes # The original value should be keep, default 65536. +$ echo 1 > /sys/kernel/rio_tracker/reset +$ echo 1 > /sys/kernel/rio_tracker/enable +$ echo -n TRACE_HOST_NAME > /sys/kernel/rio_tracker/host_ns # It will tracker the io in ${host_name} uts namespace. Here uts namespace is TRACE_HOST_NAME, it will be used when launch the container. +``` + +Then you just run your container task. Here we take running pytorch container in `Nydus` as an example: + +- Terminal A: + + ```shell + $ modprobe erofs; modprobe cachefiles; + $ containerd-nydus-grpc --config /etc/nydus/config.toml --nydusd-config /etc/nydus/nydusd-config.fscache.json --fs-driver fscache --log-to-stdout + ``` + +- Terminal B: + + ```shell + $ nerdctl --snapshotter=nydus run -ti --hostname TRACE_HOST_NAME --rm --insecure-registry 10.67.175.82:5001/nydus/pytorch:nydus python -c "import datetime; print(datetime.datetime.now()); import torch;print(torch.cuda.is_available()); print(datetime.datetime.now())" # If you want to run nginx, you may be like: nerdctl --snapshotter=nydus run --name test-nginx --hostname TRACE_HOST_NAME -p 9001:80 -d --insecure-registry 10.67.175.82:5001/nydus/nginx:nydus;./nginx_ok.sh + ``` + +> Notes: You should launch the target container with `--hostname TRACE_HOST_NAME` which you used in the before step. + +In our example, we launch a pytorch task with the output results. If you want trace the container which runs in backend (such as web services), Then after launching the container service, you should prepare the condition that serves the request. For network services (e.g., nginx, httpd), HTTP status OK is suitable for internal probes (this can trigger to load the necessary libraries). Such as:`curl -kv 127.0.0.1:9001`. So in our pytorch exmple, when the task finished, we stop tracing. + +After the target condition is matched (finished), we can execute: + +```shell +$ echo 0 > /sys/kernel/rio_tracker/enable +$ echo 65536 > /sys/kernel/debug/fault_around_bytes # recovery +$ echo 1 > /sys/kernel/rio_tracker/dump +``` + +Then your tracing about the target container task is finished. And the raw trace source is dumped at `/var/log/trace.txt`. + +###### **Arrangement** + +Launch the container again to obtain the rootfs (need run container in backend, such as `-d` flags), it is like:`/run/containerd/io.containerd.runtime.v2.task/default/${container_id}/rootfs` (You can see this by `df -h` command). + +```shell +$ nerdctl --snapshotter=nydus run -d --insecure-registry 10.67.175.82:5001/nydus/pytorch:nydus sleep 999 +$ df -h # Then you can see the live rootfs +``` + +Then you can use the scripts `trace_parser.py` to process the traces. + +```shell +$ cd $KERNEL_SRC/tools/trio/scripts +$ python3 trace_parser.py --trace_file=/var/log/trace.txt --output_dir=/var/log --rootfs=${container rootfs} # such as python3 scripts/trace_parser.py --trace_file=/var/log/trace.txt --output_dir=/var/log --rootfs=/run/containerd/io.containerd.runtime.v2.task/default/c2e42ed2be52c79bc96fca9dccb188fe639de55fec6a6e3d2d2ad6aa2a3f65c1/rootfs +``` + +The output will show the trace data and metadata named as their md5 values. Then you can keep these trace files in the trace repository, here we can just use `scp` to transfer the traces to trace repository ( such as `scp /var/log/9af53bf836cc89591eb3f1df9a5302c5965c0049bb2622710a04519c06bd25c5 /var/log/fc8401b2850cc16d5850bf876543b85fd2b70fc5d112a846ca70e75a738cdbab root@10.67.175.82:/home/trace_hub`). + +> Notes: In fact, I/O traces can be arraged into container images by modifying the container management tools. To achieve this, you need to modify the `nerdctl` and `containerd` packages. + +- **Launch container by TrIO** + +> Normally, the running node is different with the tracing node. + +On the images or traces repository nodes (here is `10.67.175.82`), we can launch the traces server: + +```shell +$ cd /home/trace_hub +$ go run trace_server.go +``` + +> Notes: the trace should be put into the same directory of `server.go` + +Then on container node, we can start container. First, you should run container in on-demand loading mode. Take Nydus as example, we start nydus-snapshoter in one termimal, and run : + +- Termimal A + + ```shell + $ modprobe erofs; modprobe cachefiles; + $ containerd-nydus-grpc --config /etc/nydus/config.toml --nydusd-config /etc/nydus/nydusd-config.fscache.json --fs-driver fscache --log-to-stdout + ``` + +- Termimal B + +```shell +$ echo "${trace_meta_name},${trace_data_name}" > /var/log/trace_hint # such as echo "9af53bf836cc89591eb3f1df9a5302c5965c0049bb2622710a04519c06bd25c5,fc8401b2850cc16d5850bf876543b85fd2b70fc5d112a846ca70e75a738cdbab" > /var/log/trace_hint +$ nerdctl --snapshotter=nydus run -ti --rm --insecure-registry 10.67.175.82:5001/nydus/pytorch:nydus python -c "import datetime; print(datetime.datetime.now()); import torch;print(torch.cuda.is_available()); print(datetime.datetime.now())" +``` + +### Results + +We conducted simple experiments on the following containers. By leveraging TrIO, we were able to significantly improve the startup performance of containers under on-demand loading scenarios. + +| | nginx | redis | tomcat | pytorch | tensorflow | +| ---------- | ------- | ------- | -------- | --------- | ---------- | +| Base | 4.85 s | 4.756 s | 10.026 s | 127.986 s | 35.164 s | +| Nydus | 3.136 s | 2.705 s | 6.794 s | 19.136 s | 29.931 s | +| Nydus+TrIO | 2.873 s | 2.263 s | 4.954 s | 6.814 s | 8.496 s | + -- 2.34.1

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/15182 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/M... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/15182 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/M...
participants (2)
-
Hongbo Li
-
patchwork bot