September 2024 - Kernel - mailweb.openeuler.org

[PATCH OLK-6.6 0/4] Some fixes About cpuset partition
by Chen Ridong 05 Sep '24

05 Sep '24

Chen Ridong (1): cgroup/cpuset: fix panic caused by partcmd_update Waiman Long (3): cgroup/cpuset: Optimize isolated partition only generate_sched_domains() calls cgroup/cpuset: Fix remote root partition creation problem cgroup/cpuset: Clear effective_xcpus on cpus_allowed clearing only if cpus.exclusive not set kernel/cgroup/cpuset.c | 68 +++++++++++++++++++++++++++++++++--------- 1 file changed, 54 insertions(+), 14 deletions(-) -- 2.34.1

2 5

[PATCH penEuler-23.09] Add dynamic kprobe-based process-level cgroup memory monitoring tool
by Taoxy2004 05 Sep '24

05 Sep '24

--- tools/probeCgroup: This patch introduces a new tool called "probeCgroup" that enables dynamic monitoring of memory usage at the process level within cgroups. By using kprobes at relevant cgroup functions, this tool can track memory allocations and deallocations for individual processes within a cgroup, providing detailed statistics on memory usage. The key features of the tool include: 1. Dynamic insertion of kprobes at critical points in the cgroup subsystem. 2. Tracking memory allocation and deallocation events for each process by recording page addresses in a hash table. 3. Providing real-time statistics on memory usage at the process level. 4. Providing statistics on memory usage for processes that are OOM. Signed-off-by: Taoxy2004 <221870066(a)smail.nju.edu.cn> --- tools/probeCgroup/Makefile | 7 + tools/probeCgroup/README.md | 29 + tools/probeCgroup/probeCgroup.c | 612 ++++++++++++++++++ tools/probeCgroup/probeCgroup.h | 415 ++++++++++++ tools/probeCgroup/run.sh | 8 + tools/probeCgroup/scripts/script1.sh | 10 + tools/probeCgroup/scripts/script2.sh | 14 + tools/probeCgroup/scripts/script3.sh | 11 + .../testcases/1_load_unload_test.py | 24 + .../testcases/2_multiple_process_test.py | 48 ++ .../testcases/3_multiple_cgroup_test.py | 55 ++ tools/probeCgroup/testcases/4_oom_test.py | 52 ++ .../testcases/5_multiple_threads_test.py | 45 ++ tools/probeCgroup/testcases/cgroup_utils.py | 115 ++++ tools/probeCgroup/testcases/mem-allocate.c | 35 + .../testcases/multiple-thread-mem-allocate.c | 60 ++ tools/probeCgroup/testcases/run.py | 32 + .../testcases/simple-mem-allocate.c | 27 + 18 files changed, 1599 insertions(+) create mode 100644 tools/probeCgroup/Makefile create mode 100644 tools/probeCgroup/README.md create mode 100644 tools/probeCgroup/probeCgroup.c create mode 100644 tools/probeCgroup/probeCgroup.h create mode 100755 tools/probeCgroup/run.sh create mode 100755 tools/probeCgroup/scripts/script1.sh create mode 100755 tools/probeCgroup/scripts/script2.sh create mode 100755 tools/probeCgroup/scripts/script3.sh create mode 100755 tools/probeCgroup/testcases/1_load_unload_test.py create mode 100755 tools/probeCgroup/testcases/2_multiple_process_test.py create mode 100755 tools/probeCgroup/testcases/3_multiple_cgroup_test.py create mode 100755 tools/probeCgroup/testcases/4_oom_test.py create mode 100755 tools/probeCgroup/testcases/5_multiple_threads_test.py create mode 100644 tools/probeCgroup/testcases/cgroup_utils.py create mode 100644 tools/probeCgroup/testcases/mem-allocate.c create mode 100644 tools/probeCgroup/testcases/multiple-thread-mem-allocate.c create mode 100755 tools/probeCgroup/testcases/run.py create mode 100644 tools/probeCgroup/testcases/simple-mem-allocate.c diff --git a/tools/probeCgroup/Makefile b/tools/probeCgroup/Makefile new file mode 100644 index 000000000000..606c951e5487 --- /dev/null +++ b/tools/probeCgroup/Makefile @@ -0,0 +1,7 @@ +obj-m := probeCgroup.o +CROSS_COMPILE = '' +KDIR := /lib/modules/$(shell uname -r)/build +all: + make -C $(KDIR) M=$(PWD) modules +clean: + rm -f *.ko *.o *.mod *.mod.o *.mod.c .*.cmd *.symvers module* diff --git a/tools/probeCgroup/README.md b/tools/probeCgroup/README.md new file mode 100644 index 000000000000..ff0b6fc21228 --- /dev/null +++ b/tools/probeCgroup/README.md @@ -0,0 +1,29 @@ +# probeCgroup + +#### Description +probeCgroup is a process-level cgroup memory monitoring tool based on dynamic tracing (kprobe/kretprobe) technology. By inserting kprobes and kretprobes at the entry and exit points of relevant cgroup functions, this tool can track the memory usage of individual processes within each cgroup in real time. + +#### Software Architecture +1. Dynamic Tracing : Insert kprobes and kretprobes at critical points in cgroup functions to capture memory allocation and release events. +2. Hash Table Recording : Record the addresses of pages currently used by each process in a hash table, so that when a page is released, the process it belongs to can be identified. +3. Real-Time Statistics : Provide real-time statistics showing the memory usage of individual processes within each cgroup. + +#### Instruction +1. Compile and Load the Module + a. In the 'probeCgroup' directory, run the 'make' command to compile the module. + b. Load the module: 'insmod probeCgroup.ko'. + c. View memory statistics: 'cat /proc/cgroup_memory_usage_per_process'. + If an OOM (Out of Memory) event occurs in a cgroup, you can see "oom:" followed by the process that experienced the OOM and its memory usage at the time. + +2. Automate OOM Scenario + In the 'probeCgroup' directory, run './run.sh'. This script will automatically set up an OOM scenario and output the content of '/proc/cgroup_memory_usage_per_process' after execution. + +3. Perform More Tests + a. After compiling the module, in the 'testcases' directory, run './run.py'. + b. This script will perform various tests, including: + - Loading and unloading the module + - Each cgroup containing multiple processes + - Creating multiple cgroups + - OOM scenarios + - Multithreading + c. The tests will take approximately one minute to complete. diff --git a/tools/probeCgroup/probeCgroup.c b/tools/probeCgroup/probeCgroup.c new file mode 100644 index 000000000000..9883cb1e082d --- /dev/null +++ b/tools/probeCgroup/probeCgroup.c @@ -0,0 +1,612 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * probeCgroup.c - A tool used to get memory usage for each process in a cgroup + * + * Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + */ + +#include "probeCgroup.h" + +// kretprobe at mem_cgroup_charge +struct charge_data { + struct cgroup *cgrp; + struct mem_cgroup *memcg; + struct task_struct *task; + unsigned long addr; +}; + +static int mem_cgroup_charge_entry_handler(struct kretprobe_instance *ri, + struct pt_regs *regs) +{ + struct charge_data *data; + struct folio *page; + struct mm_struct *mm; + struct mem_cgroup *memcg; + struct cgroup_subsys_state css; + struct cgroup *cgrp; + + if (!current->mm) + return 1; + page = (struct folio *)regs->di; + mm = (struct mm_struct *)regs->si; + if (mm == NULL || page == NULL) + return -1; + memcg = get_mem_cgroup_from_mm(mm); + if (memcg != NULL) { + css = memcg->css; + cgrp = css.cgroup; + + data = (struct charge_data *)ri->data; + data->memcg = memcg; + data->addr = (unsigned long)page; + data->task = current; + data->cgrp = cgrp; + } + return 0; +} + +NOKPROBE_SYMBOL(mem_cgroup_charge_entry_handler); + +static int mem_cgroup_charge_ret_handler(struct kretprobe_instance *ri, + struct pt_regs *regs) +{ + unsigned long retval = regs_return_value(regs); + struct charge_data *data = (struct charge_data *)ri->data; + int id; + struct cgroup_info *cgrp_info; + struct task_info *tsk_info; + + if (data->memcg != NULL && retval == 0) { + id = ((data->memcg)->css).id; + + spin_lock(&lock); + cgrp_info = find_cgroup_info(id); + if (cgrp_info == NULL) { + cgrp_info = create_cgroup_info(data->cgrp, data->memcg); + if (cgrp_info == NULL) { + spin_unlock(&lock); + return -1; + } + add_cgroup_info(cgrp_info); + } + spin_unlock(&lock); + + read_lock(&cgrp_info->cgrp_lock); + tsk_info = find_task_info(cgrp_info, data->task->tgid); + read_unlock(&cgrp_info->cgrp_lock); + + // for some cases, task->comm changes over time + if (tsk_info != NULL + && strcmp(data->task->comm, tsk_info->comm) != 0) { + strscpy(tsk_info->comm, data->task->comm, + sizeof(tsk_info->comm)); + } + + if (tsk_info == NULL) { + tsk_info = create_task_info(data->task); + if (tsk_info == NULL) + return -1; + add_task_to_cgroup_info(cgrp_info, tsk_info); + } + + if (HashMap_insert(tsk_info->pages, data->addr)) { + //update counter + spin_lock(&(tsk_info->cnt_lock)); + tsk_info->count += + folio_nr_pages((struct folio *)data->addr); + spin_unlock(&(tsk_info->cnt_lock)); + } + } + + return 0; + +} + +NOKPROBE_SYMBOL(mem_cgroup_charge_ret_handler); + +static struct kretprobe mem_cgroup_charge_kretprobe = { + .handler = mem_cgroup_charge_ret_handler, + .entry_handler = mem_cgroup_charge_entry_handler, + .data_size = sizeof(struct charge_data), + .maxactive = 20, +}; + +static int mem_cgroup_charge_kretprobe_init(void) +{ + int ret; + + mem_cgroup_charge_kretprobe.kp.symbol_name = "__mem_cgroup_charge"; + ret = register_kretprobe(&mem_cgroup_charge_kretprobe); + if (ret < 0) { + pr_err("register_kretprobe failed, returned %d\n", ret); + return ret; + } + pr_info("Planted return probe at %s: %p\n", + mem_cgroup_charge_kretprobe.kp.symbol_name, + mem_cgroup_charge_kretprobe.kp.addr); + return 0; +} + +static void mem_cgroup_charge_kretprobe_exit(void) +{ + unregister_kretprobe(&mem_cgroup_charge_kretprobe); + pr_info("kretprobe at %p unregistered\n", + mem_cgroup_charge_kretprobe.kp.addr); + + /* nmissed > 0 suggests that maxactive was set too low. */ + pr_info("Missed probing %d instances of %s\n", + mem_cgroup_charge_kretprobe.nmissed, + mem_cgroup_charge_kretprobe.kp.symbol_name); +} + +// kretprobe at uncharge_folio + +struct uncharge_data { + struct cgroup *cgrp; + struct mem_cgroup *memcg; + unsigned long addr; + bool isKmem; + int nr_pages; +}; + +static int uncharge_folio_entry_handler(struct kretprobe_instance *ri, + struct pt_regs *regs) +{ + struct uncharge_data *data; + struct folio *page; + struct mem_cgroup *memcg = NULL; + struct cgroup_subsys_state css; + struct cgroup *cgrp; + struct obj_cgroup *objcg; + int nr_pages = 0; + + data = (struct uncharge_data *)ri->data; + page = (struct folio *)regs->di; + if (page == NULL) { + data->memcg = NULL; + return -1; + } + if (page->memcg_data & MEMCG_DATA_KMEM) { // if the page belongs to kmem + if (!folio_test_large(page)) + nr_pages = 1; + else + nr_pages = page->_folio_nr_pages; + // nr_pages = thp_nr_pages(page); + objcg = __folio_objcg(page); + if (objcg != NULL) + memcg = objcg->memcg; + data->isKmem = true; + data->nr_pages = nr_pages; + } else { + memcg = __folio_memcg(page); + data->isKmem = false; + } + + if (memcg != NULL) { + css = memcg->css; + cgrp = css.cgroup; + + data->memcg = memcg; + data->addr = (unsigned long)page; + data->cgrp = cgrp; + } + return 0; +} + +NOKPROBE_SYMBOL(uncharge_folio_entry_handler); + +static int uncharge_folio_ret_handler(struct kretprobe_instance *ri, + struct pt_regs *regs) +{ + struct uncharge_data *data = (struct uncharge_data *)ri->data; + int id; + struct cgroup_info *cgrp_info; + int ret = -1; + + if (data->memcg != NULL) { + id = ((data->memcg)->css).id; + cgrp_info = find_cgroup_info(id); + if (cgrp_info == NULL) + return -1; + if (data->isKmem) + ret = -1; + else + ret = remove_page_from_cgroup_info(data->addr, cgrp_info); + } + + return ret; +} + +NOKPROBE_SYMBOL(uncharge_folio_ret_handler); + +static struct kretprobe uncharge_folio_kretprobe = { + .handler = uncharge_folio_ret_handler, + .entry_handler = uncharge_folio_entry_handler, + .data_size = sizeof(struct uncharge_data), + .maxactive = 20, +}; + +static int uncharge_folio_kretprobe_init(void) +{ + int ret; + + uncharge_folio_kretprobe.kp.symbol_name = "uncharge_folio"; + ret = register_kretprobe(&uncharge_folio_kretprobe); + if (ret < 0) { + pr_err("register_kretprobe failed, returned %d\n", ret); + return ret; + } + pr_info("Planted return probe at %s: %p\n", + uncharge_folio_kretprobe.kp.symbol_name, + uncharge_folio_kretprobe.kp.addr); + return 0; +} + +static void uncharge_folio_kretprobe_exit(void) +{ + unregister_kretprobe(&uncharge_folio_kretprobe); + pr_info("kretprobe at %p unregistered\n", + uncharge_folio_kretprobe.kp.addr); + + /* nmissed > 0 suggests that maxactive was set too low. */ + pr_info("Missed probing %d instances of %s\n", + uncharge_folio_kretprobe.nmissed, + uncharge_folio_kretprobe.kp.symbol_name); +} + +//kprobe at do_exit +static struct kprobe do_exit_kprobe; +static int do_exit_kprobe_pre_handler(struct kprobe *p, struct pt_regs *regs) +{ + struct task_struct *cur = current; + int tgid = cur->tgid; + struct mm_struct *mm = cur->mm; + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); + struct cgroup_subsys_state css; + struct cgroup *cgrp; + struct cgroup_info *cgrp_info; + struct task_info *tsk_info; + int id; + + if (memcg != NULL) { + css = memcg->css; + cgrp = css.cgroup; + id = (memcg->css).id; + cgrp_info = find_cgroup_info(id); + if (cgrp_info != NULL) { + write_lock(&cgrp_info->cgrp_lock); + tsk_info = find_task_info(cgrp_info, tgid); + if (tsk_info != NULL) { + list_del(&tsk_info->list); + write_unlock(&cgrp_info->cgrp_lock); + remove_task_from_cgroup_info(cgrp_info, + tsk_info); + } else { + write_unlock(&cgrp_info->cgrp_lock); + } + return 0; + } + } + return 0; +} + +static void do_exit_kprobe_post_handler(struct kprobe *p, + struct pt_regs *regs, + unsigned long flags) +{ + +} + +static int do_exit_kprobe_init(void) +{ + do_exit_kprobe.pre_handler = do_exit_kprobe_pre_handler; + do_exit_kprobe.post_handler = do_exit_kprobe_post_handler; + do_exit_kprobe.symbol_name = "do_exit"; + if (register_kprobe(&do_exit_kprobe)) { + pr_alert("register_kprobe on do_exit failed!\n"); + return -EINVAL; + } + return 0; +} + +static void do_exit_kprobe_exit(void) +{ + unregister_kprobe(&do_exit_kprobe); +} + +//kprobe at mark_oom_victim +static struct kprobe mark_oom_victim_kprobe; + +static int mark_oom_victim_kprobe_pre_handler(struct kprobe *p, + struct pt_regs *regs) +{ + struct task_struct *victim; + int tgid; + struct mm_struct *mm; + struct mem_cgroup *memcg; + struct cgroup_subsys_state css; + struct cgroup *cgrp; + struct cgroup_info *cgrp_info; + struct task_info *tsk_info; + int id; + struct task_info *oom_info; + + victim = (struct task_struct *)regs->di; + tgid = victim->tgid; + mm = victim->mm; + memcg = get_mem_cgroup_from_mm(mm); + if (memcg != NULL) { + css = memcg->css; + cgrp = css.cgroup; + id = (memcg->css).id; + cgrp_info = find_cgroup_info(id); + if (cgrp_info != NULL) { + read_lock(&cgrp_info->cgrp_lock); + tsk_info = find_task_info(cgrp_info, tgid); + read_unlock(&cgrp_info->cgrp_lock); + if (tsk_info != NULL) { + oom_info = create_oom_task_info(tsk_info); + if (oom_info != NULL) { + add_oom_task_to_cgroup_info(cgrp_info, + oom_info); + } + return 0; + } + } + } + return 0; +} + +static void mark_oom_victim_kprobe_post_handler(struct kprobe *p, + struct pt_regs *regs, + unsigned long flags) +{ + +} + +static int mark_oom_victim_kprobe_init(void) +{ + mark_oom_victim_kprobe.pre_handler = mark_oom_victim_kprobe_pre_handler; + mark_oom_victim_kprobe.post_handler = + mark_oom_victim_kprobe_post_handler; + mark_oom_victim_kprobe.symbol_name = "mark_oom_victim"; + if (register_kprobe(&mark_oom_victim_kprobe)) { + pr_alert("register_kprobe on mark_oom_victim failed!\n"); + return -EINVAL; + } + return 0; +} + +static void mark_oom_victim_kprobe_exit(void) +{ + unregister_kprobe(&mark_oom_victim_kprobe); +} + +//kretporbe at cgroup_destroy_locked +struct destroy_data { + struct cgroup *cgrp; +}; + +static int cgroup_destroy_locked_entry_handler(struct kretprobe_instance + *ri, struct pt_regs *regs) +{ + struct destroy_data *data; + + data = (struct destroy_data *)ri->data; + data->cgrp = (struct cgroup *)regs->di; + return 0; +} + +NOKPROBE_SYMBOL(cgroup_destroy_locked_entry_handler); + +static int cgroup_destroy_locked_ret_handler(struct kretprobe_instance *ri, + struct pt_regs *regs) +{ + struct destroy_data *data = (struct destroy_data *)ri->data; + struct cgroup *cgrp = data->cgrp; + struct cgroup_info *cgrp_info = NULL; + unsigned long retval = regs_return_value(regs); + + if (!cgrp) + return -1; + if (retval != 0) + return -1; + list_for_each_entry(cgrp_info, &all_cgroup_info, list) { + if (cgrp_info->cgrp == cgrp) { + spin_lock(&lock); + list_del(&cgrp_info->list); + spin_unlock(&lock); + destroy_cgroup_info(cgrp_info); + return 0; + } + } + return -1; +} + +NOKPROBE_SYMBOL(cgroup_destroy_locked_ret_handler); + +static struct kretprobe cgroup_destroy_locked_kretprobe = { + .handler = cgroup_destroy_locked_ret_handler, + .entry_handler = cgroup_destroy_locked_entry_handler, + .data_size = sizeof(struct destroy_data), + .maxactive = 20, +}; + +static int cgroup_destroy_locked_kretprobe_init(void) +{ + int ret; + + cgroup_destroy_locked_kretprobe.kp.symbol_name = + "cgroup_destroy_locked"; + ret = register_kretprobe(&cgroup_destroy_locked_kretprobe); + if (ret < 0) { + pr_err("register_kretprobe failed, returned %d\n", ret); + return ret; + } + pr_info("Planted return probe at %s: %p\n", + cgroup_destroy_locked_kretprobe.kp.symbol_name, + cgroup_destroy_locked_kretprobe.kp.addr); + return 0; +} + +static void cgroup_destroy_locked_kretprobe_exit(void) +{ + unregister_kretprobe(&cgroup_destroy_locked_kretprobe); + pr_info("kretprobe at %p unregistered\n", + cgroup_destroy_locked_kretprobe.kp.addr); + + /* nmissed > 0 suggests that maxactive was set too low. */ + pr_info("Missed probing %d instances of %s\n", + cgroup_destroy_locked_kretprobe.nmissed, + cgroup_destroy_locked_kretprobe.kp.symbol_name); +} + +// print the tasks in order of their memory usage +static void print_sorted_tasks_list(struct cgroup_info *cgrp_info, + int type, struct seq_file *m) +{ + struct list_head *cur, *insert_pos; + struct task_info *task, *insert_task; + struct list_head new_list = LIST_HEAD_INIT(new_list); + struct list_head *old_list; + struct task_info *new_task, *next_task; + + if (type == 0) { + if (cgrp_info == NULL) + return; + read_lock(&cgrp_info->cgrp_lock); + old_list = &cgrp_info->tasks_list; + } else { + if (cgrp_info == NULL) + return; + old_list = &cgrp_info->oom_list; + } + + list_for_each_entry_safe(task, insert_task, old_list, list) { + new_task = kmalloc(sizeof(struct task_info), GFP_ATOMIC); + if (!new_task) + return; + new_task->tgid = task->tgid; + strscpy(new_task->comm, task->comm, sizeof(new_task->comm)); + new_task->count = task->count; + new_task->pages = NULL; + INIT_LIST_HEAD(&new_task->list); + + //insertion sort + cur = &new_list; + insert_pos = cur->next; + while (insert_pos != &new_list) { + next_task = + list_entry(insert_pos, struct task_info, list); + if (new_task->count >= next_task->count) + break; + cur = insert_pos; + insert_pos = insert_pos->next; + } + + (&new_task->list)->prev = insert_pos->prev; + (insert_pos->prev)->next = (&new_task->list); + (&new_task->list)->next = insert_pos; + insert_pos->prev = (&new_task->list); + } + if (type == 0) + read_unlock(&cgrp_info->cgrp_lock); + + //print + if (type == 1 && (&new_list) != new_list.next) { + seq_puts(m, "oom:\n"); + seq_printf(m, "%10s %20s %20s\n", "pid", "command", + "memory usage (KB)"); + } + if (type == 0) + seq_printf(m, "%10s %20s %20s\n", "pid", "command", + "memory usage (KB)"); + list_for_each_entry_safe(task, insert_task, &new_list, list) { + seq_printf(m, "%10d %20s %20d\n", task->tgid, task->comm, + (task->count) * 4); + } + + list_for_each_entry_safe(task, insert_task, &new_list, list) { + list_del(&task->list); + kfree(task); + } +} + +static struct proc_dir_entry *cgroup_info_read; +#define procfs_file_read "cgroup_memory_usage_per_process" + +void seq_print_tasks(struct cgroup_info *cgroup_info, struct seq_file *m) +{ + if (!cgroup_info) + return; + + print_sorted_tasks_list(cgroup_info, 0, m); +} + +void seq_print_oom_tasks(struct cgroup_info *cgroup_info, struct seq_file *m) +{ + if (!cgroup_info) + return; + + print_sorted_tasks_list(cgroup_info, 1, m); +} + +void seq_print_cgroups(struct seq_file *m) +{ + struct cgroup_info *cgrp, *pos; + + spin_lock(&lock); + list_for_each_entry_safe(cgrp, pos, &all_cgroup_info, list) { + seq_printf(m, "cgroup name : %s\n", cgrp->name); + seq_print_tasks(cgrp, m); + seq_print_oom_tasks(cgrp, m); + seq_puts(m, "\n"); + } + spin_unlock(&lock); +} + +static int memory_usage_show(struct seq_file *m, void *v) +{ + seq_print_cgroups(m); + return 0; +} + +static int __init global_init(void) +{ + int ret = 0; + + cgroup_info_read = + proc_create_single(procfs_file_read, 0, NULL, memory_usage_show); + if (!cgroup_info_read) + return -ENOMEM; + ret = mem_cgroup_charge_kretprobe_init(); + uncharge_folio_kretprobe_init(); + do_exit_kprobe_init(); + mark_oom_victim_kprobe_init(); + cgroup_destroy_locked_kretprobe_init(); + + return ret; +} + +static void __exit global_exit(void) +{ + struct cgroup_info *cgrp_info, *pos; + + mem_cgroup_charge_kretprobe_exit(); + uncharge_folio_kretprobe_exit(); + do_exit_kprobe_exit(); + mark_oom_victim_kprobe_exit(); + cgroup_destroy_locked_kretprobe_exit(); + + remove_proc_entry(procfs_file_read, NULL); + + //release all memory use + list_for_each_entry_safe(cgrp_info, pos, &all_cgroup_info, list) { + list_del(&cgrp_info->list); + destroy_cgroup_info(cgrp_info); + } +} + +module_init(global_init) +module_exit(global_exit) +MODULE_LICENSE("GPL"); diff --git a/tools/probeCgroup/probeCgroup.h b/tools/probeCgroup/probeCgroup.h new file mode 100644 index 000000000000..953a6e0aca31 --- /dev/null +++ b/tools/probeCgroup/probeCgroup.h @@ -0,0 +1,415 @@ +/* SPDX-License-Identifier: GPL-2.0*/ +/* + * probeCgroup.h + * + * Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/kprobes.h> +#include <linux/ktime.h> +#include <linux/limits.h> +#include <linux/sched.h> +#include <linux/mm_types.h> +#include <linux/memcontrol.h> +#include <linux/cgroup-defs.h> +#include <linux/kernfs.h> +#include <linux/string.h> +#include <linux/list.h> +#include <linux/oom.h> +#include <linux/fs.h> +#include <linux/proc_fs.h> +#include <linux/huge_mm.h> +#include <linux/page-flags.h> +#include <linux/spinlock.h> +#include <linux/rwlock.h> + +static spinlock_t lock; // global lock for the list of cgroup_info + +struct HashNode { + unsigned long addr; + struct HashNode *next; +}; + +struct HashNode *HashNode_create(unsigned long addr) +{ + struct HashNode *node = NULL; + + node = kzalloc(sizeof(struct HashNode), GFP_ATOMIC); + if (node == NULL) + return NULL; + node->addr = addr; + node->next = NULL; + return node; +} + +struct HashBucket { + struct HashNode *head; + spinlock_t bkt_lock; +}; + +void HashBucket_init(struct HashBucket *bkt) +{ + bkt->head = NULL; + spin_lock_init(&bkt->bkt_lock); +} + +bool HashBucket_insert(struct HashBucket *bkt, unsigned long addr) +{ + struct HashNode *new_node; + struct HashNode *node; + struct HashNode *prev; + bool ret = true; + + if (bkt == NULL) + return false; + + prev = NULL; + new_node = NULL; + new_node = HashNode_create(addr); + + spin_lock(&bkt->bkt_lock); + node = bkt->head; + while (node != NULL && node->addr != addr) { + prev = node; + node = node->next; + } + if (node == NULL) { + if (new_node == NULL) { + pr_info("not enough memory for HashNode\n"); + spin_unlock(&bkt->bkt_lock); + return false; + } + if (bkt->head == NULL) + bkt->head = new_node; + else + prev->next = new_node; + spin_unlock(&bkt->bkt_lock); + ret = true; + } else { + spin_unlock(&bkt->bkt_lock); + kfree(new_node); + ret = false; + } + + return ret; +} + +bool HashBucket_erase(struct HashBucket *bkt, unsigned long addr) +{ + struct HashNode *node; + struct HashNode *prev; + bool ret = true; + + if (bkt == NULL) + return false; + + spin_lock(&bkt->bkt_lock); + node = bkt->head; + prev = NULL; + while (node != NULL && node->addr != addr) { + prev = node; + node = node->next; + } + if (node == NULL) { + spin_unlock(&bkt->bkt_lock); + ret = false; + } else { + if (bkt->head == node) + bkt->head = node->next; + else + prev->next = node->next; + kfree(node); + spin_unlock(&bkt->bkt_lock); + ret = true; + } + + return ret; +} + +void HashBucket_clear(struct HashBucket *bkt) +{ + struct HashNode *node; + struct HashNode *prev; + + if (bkt == NULL) + return; + + spin_lock(&bkt->bkt_lock); + node = bkt->head; + prev = NULL; + bkt->head = NULL; + while (node != NULL) { + prev = node; + node = node->next; + kfree(prev); + } + spin_unlock(&bkt->bkt_lock); +} + +struct HashMap { + unsigned long size; + struct HashBucket *HashTable; +}; + +unsigned long hash_func(unsigned long addr, unsigned long size) +{ + return addr % size; +} + +struct HashMap *HashMap_create(unsigned long size) +{ + struct HashMap *hm = NULL; + struct HashBucket *ht = NULL; + int i = 0; + + hm = kmalloc(sizeof(struct HashMap), GFP_ATOMIC); + if (hm == NULL) + return NULL; + ht = kmalloc((size * sizeof(struct HashBucket)), GFP_ATOMIC); + if (ht == NULL) { + kfree(hm); + return NULL; + } + for (i = 0; i < size; i++) + HashBucket_init(&(ht[i])); + + hm->size = size; + hm->HashTable = ht; + return hm; +} + +bool HashMap_insert(struct HashMap *hm, unsigned long addr) +{ + unsigned long index; + + if (hm == NULL) + return false; + index = hash_func(addr, hm->size); + if (hm->HashTable == NULL) + return false; + return HashBucket_insert(&(hm->HashTable[index]), addr); +} + +bool HashMap_erase(struct HashMap *hm, unsigned long addr) +{ + unsigned long index; + + if (hm == NULL) + return false; + index = hash_func(addr, hm->size); + if (hm->HashTable == NULL) + return false; + return HashBucket_erase(&(hm->HashTable[index]), addr); +} + +void HashMap_clear(struct HashMap *hm) +{ + unsigned long size; + struct HashBucket *ht; + int i; + + if (hm == NULL) + return; + size = hm->size; + ht = hm->HashTable; + if (ht == NULL) + return; + hm->HashTable = NULL; + for (i = 0; i < size; i++) + HashBucket_clear(&(ht[i])); + + kfree(ht); + kfree(hm); +} + +//struct that save the information for each task +struct task_info { + int tgid; + char comm[TASK_COMM_LEN]; + int count; // number of pages + struct HashMap *pages; + struct list_head list; + spinlock_t cnt_lock; +}; + +// struct that save the information for each cgroup +struct cgroup_info { + struct cgroup *cgrp; + struct mem_cgroup *memcg; + int id; + char name[64]; + struct list_head list; + struct list_head tasks_list; + struct list_head oom_list; + rwlock_t cgrp_lock; + unsigned int cached_bytes; +}; + +static LIST_HEAD(all_cgroup_info); // a list that linked all the cgroup_info struct + +static struct task_info *create_task_info(struct task_struct *cur_task) +{ + struct task_info *tsk_info = + kmalloc(sizeof(struct task_info), GFP_ATOMIC); + if (!tsk_info) + return NULL; + + // initialization + tsk_info->tgid = cur_task->tgid; + strscpy(tsk_info->comm, cur_task->comm, sizeof(tsk_info->comm)); + tsk_info->count = 0; + tsk_info->pages = NULL; + tsk_info->pages = HashMap_create(1023); + INIT_LIST_HEAD(&tsk_info->list); + spin_lock_init(&(tsk_info->cnt_lock)); + + return tsk_info; +} + +static int +add_task_to_cgroup_info(struct cgroup_info *cgrp, struct task_info *task) +{ + if (!cgrp || !task) + return -EINVAL; + + write_lock(&cgrp->cgrp_lock); + list_add_tail(&task->list, &cgrp->tasks_list); + write_unlock(&cgrp->cgrp_lock); + return 0; +} + +static int +remove_task_from_cgroup_info(struct cgroup_info *cgrp, struct task_info *task) +{ + if (cgrp == NULL || task == NULL) + return -EINVAL; + + HashMap_clear(task->pages); + // kfree(task->pages); + kfree(task); + return 0; +} + +static struct task_info *find_task_info(struct cgroup_info *cgrp, int tgid) +{ + struct task_info *tsk_info, *pos; + + list_for_each_entry_safe(tsk_info, pos, &cgrp->tasks_list, list) { + if (tsk_info->tgid == tgid) + return tsk_info; + } + return NULL; +} + +static int +remove_page_from_cgroup_info(unsigned long addr, struct cgroup_info *cgrp) +{ + struct task_info *tsk_info, *pos; + + read_lock(&cgrp->cgrp_lock); + list_for_each_entry_safe(tsk_info, pos, &cgrp->tasks_list, list) { + if (HashMap_erase(tsk_info->pages, addr)) { + spin_lock(&(tsk_info->cnt_lock)); + tsk_info->count -= folio_nr_pages((struct folio *)addr); + spin_unlock(&(tsk_info->cnt_lock)); + read_unlock(&cgrp->cgrp_lock); + return 0; + } + } + read_unlock(&cgrp->cgrp_lock); + return -1; +} + +static struct cgroup_info *create_cgroup_info(struct cgroup *cgrp, + struct mem_cgroup *memcg) +{ + struct cgroup_info *cgrp_info = + kmalloc(sizeof(struct cgroup_info), GFP_ATOMIC); + struct kernfs_node *kn; + + if (!cgrp_info) + return NULL; + + cgrp_info->cgrp = cgrp; + cgrp_info->memcg = memcg; + cgrp_info->id = (memcg->css).id; + kn = cgrp->kn; + strscpy(cgrp_info->name, kn->name, sizeof(cgrp_info->name)); + INIT_LIST_HEAD(&cgrp_info->list); + INIT_LIST_HEAD(&cgrp_info->tasks_list); + INIT_LIST_HEAD(&cgrp_info->oom_list); + rwlock_init(&(cgrp_info->cgrp_lock)); + cgrp_info->cached_bytes = 0; + + return cgrp_info; +} + +static void destroy_cgroup_info(struct cgroup_info *cgrp_info) +{ + struct task_info *task, *tmp; + + if (!cgrp_info) + return; + + write_lock(&cgrp_info->cgrp_lock); + list_for_each_entry_safe(task, tmp, &cgrp_info->tasks_list, list) { + list_del(&task->list); + remove_task_from_cgroup_info(cgrp_info, task); + } + write_unlock(&cgrp_info->cgrp_lock); + list_for_each_entry_safe(task, tmp, &cgrp_info->oom_list, list) { + list_del(&task->list); + remove_task_from_cgroup_info(cgrp_info, task); + } + + kfree(cgrp_info); +} + +static int add_cgroup_info(struct cgroup_info *cgrp_info) +{ + if (!cgrp_info) + return -EINVAL; + + list_add_tail(&cgrp_info->list, &all_cgroup_info); + return 0; +} + +static struct cgroup_info *find_cgroup_info(int id) +{ + struct cgroup_info *cgrp_info = NULL; + + list_for_each_entry(cgrp_info, &all_cgroup_info, list) { + if (cgrp_info->id == id) + return cgrp_info; + } + + return NULL; +} + +static struct task_info *create_oom_task_info(struct task_info *tsk_info) +{ + struct task_info *oom_tsk_info = + kmalloc(sizeof(struct task_info), GFP_ATOMIC); + if (!oom_tsk_info) + return NULL; + + oom_tsk_info->tgid = tsk_info->tgid; + strscpy(oom_tsk_info->comm, tsk_info->comm, sizeof(oom_tsk_info->comm)); + oom_tsk_info->count = tsk_info->count; + oom_tsk_info->pages = NULL; + INIT_LIST_HEAD(&oom_tsk_info->list); + + return oom_tsk_info; +} + +static int +add_oom_task_to_cgroup_info(struct cgroup_info *cgrp, + struct task_info *oom_task) +{ + if (!cgrp || !oom_task) + return -EINVAL; + list_add_tail(&oom_task->list, &cgrp->oom_list); + return 0; +} diff --git a/tools/probeCgroup/run.sh b/tools/probeCgroup/run.sh new file mode 100755 index 000000000000..7e1ffefe66d1 --- /dev/null +++ b/tools/probeCgroup/run.sh @@ -0,0 +1,8 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + +cd scripts +./script1.sh +./script2.sh +./script3.sh diff --git a/tools/probeCgroup/scripts/script1.sh b/tools/probeCgroup/scripts/script1.sh new file mode 100755 index 000000000000..539e8258afb9 --- /dev/null +++ b/tools/probeCgroup/scripts/script1.sh @@ -0,0 +1,10 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + +cd .. +make +insmod probeCgroup.ko + +cd testcases +gcc simple-mem-allocate.c -o simple-mem-allocate diff --git a/tools/probeCgroup/scripts/script2.sh b/tools/probeCgroup/scripts/script2.sh new file mode 100755 index 000000000000..2ad515cfb912 --- /dev/null +++ b/tools/probeCgroup/scripts/script2.sh @@ -0,0 +1,14 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + +current_dir=$(pwd) +cd /sys/fs/cgroup/memory +mkdir test +cd test +sh -c "echo $$ >> cgroup.procs" +sh -c "echo 5M > memory.limit_in_bytes" +sh -c "echo 0 > memory.swappiness" +cd "$current_dir" +cd ../testcases +./simple-mem-allocate diff --git a/tools/probeCgroup/scripts/script3.sh b/tools/probeCgroup/scripts/script3.sh new file mode 100755 index 000000000000..127eb45de5c9 --- /dev/null +++ b/tools/probeCgroup/scripts/script3.sh @@ -0,0 +1,11 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + +cd /proc +cat cgroup_memory_usage_per_process + +cat /sys/fs/cgroup/memory/test/cgroup.procs > /sys/fs/cgroup/memory/cgroup.procs +rmdir /sys/fs/cgroup/memory/test +# cat cgroup_memory_usage_per_process +rmmod probeCgroup diff --git a/tools/probeCgroup/testcases/1_load_unload_test.py b/tools/probeCgroup/testcases/1_load_unload_test.py new file mode 100755 index 000000000000..5389a14a1dac --- /dev/null +++ b/tools/probeCgroup/testcases/1_load_unload_test.py @@ -0,0 +1,24 @@ +#!/usr/bin/env python +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + +import os +import subprocess +import time + +def test_module_load_unload(): + try: + subprocess.check_call(['insmod', '../probeCgroup.ko']) + time.sleep(1) + print('loading module successfully!') + subprocess.check_call(['rmmod', 'probeCgroup']) + output = subprocess.check_output(['lsmod']) + assert b'probeCgroup' not in output + print('unloading module successfully!') + except subprocess.CalledProcessError as e: + print('Load unload test failed. Insmod failed.') + except AssertionError as e: + print('Load unload test failed. Cannot remove module.') + +if __name__ == '__main__': + test_module_load_unload() \ No newline at end of file diff --git a/tools/probeCgroup/testcases/2_multiple_process_test.py b/tools/probeCgroup/testcases/2_multiple_process_test.py new file mode 100755 index 000000000000..d88c7f7f2952 --- /dev/null +++ b/tools/probeCgroup/testcases/2_multiple_process_test.py @@ -0,0 +1,48 @@ +#!/usr/bin/env python +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + +from cgroup_utils import create_cgroup, add_process_to_cgroup, get_process_memory_usage, remove_cgroup, check_memory_usage, cleanup, check_kmem_usage +import os +import subprocess +import time + +def test_multiple_process(num_procs): + subprocess.check_call(['insmod', '../probeCgroup.ko']) + time.sleep(1) + + cgroup_name = 'test' + cgroup_path = create_cgroup(cgroup_name) + + processes = [] + pids = [] + + for i in range(num_procs): + process = subprocess.Popen(['./mem-allocate']) + pid = process.pid + add_process_to_cgroup(cgroup_path, pid) + processes.append(process) + pids.append(pid) + + time.sleep(0.1) + try: + count = 0 + for i in range (2000): + count += check_memory_usage(cgroup_name, pids, False) + time.sleep(0.01) + assert count <= 50, f"Memory read by probeCgroup is not accurate" + + remove_cgroup(cgroup_path, pids) + check_memory_usage(cgroup_name, pids, True) + cleanup(processes) + subprocess.check_call(['rmmod', 'probeCgroup']) + + print('pass multiple process test!') + except AssertionError as e: + print(f"Assertion failed: {e}") + remove_cgroup(cgroup_path, pids) + cleanup(processes) + subprocess.check_call(['rmmod', 'probeCgroup']) + +if __name__ == '__main__': + test_multiple_process(3) \ No newline at end of file diff --git a/tools/probeCgroup/testcases/3_multiple_cgroup_test.py b/tools/probeCgroup/testcases/3_multiple_cgroup_test.py new file mode 100755 index 000000000000..592a716df877 --- /dev/null +++ b/tools/probeCgroup/testcases/3_multiple_cgroup_test.py @@ -0,0 +1,55 @@ +#!/usr/bin/env python +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + +from cgroup_utils import create_cgroup, add_process_to_cgroup, get_process_memory_usage, remove_cgroup, check_memory_usage, cleanup +import os +import subprocess +import time + +def test_multiple_cgroup(num_procs, num_cgroups): + subprocess.check_call(['insmod', '../probeCgroup.ko']) + time.sleep(1) + + cgroups = [] + processes = {} + pids = {} + for i in range(num_cgroups): + cgroup_name = f'test_{i}' + cgroup_path = create_cgroup(cgroup_name) + cgroups.append((cgroup_name, cgroup_path)) + + for j in range(num_procs): + process = subprocess.Popen(['./mem-allocate']) + pid = process.pid + add_process_to_cgroup(cgroup_path, pid) + if cgroup_path not in processes: + processes[cgroup_path] = [] + processes[cgroup_path].append(process) + if cgroup_path not in pids: + pids[cgroup_path] = [] + pids[cgroup_path].append(pid) + + time.sleep(0.1) + try: + for i in range (100): + for cgroup_name, cgroup_path in cgroups: + check_memory_usage(cgroup_name, pids[cgroup_path], False) + time.sleep(0.01) + + for cgroup_name, cgroup_path in cgroups: + remove_cgroup(cgroup_path, pids[cgroup_path]) + check_memory_usage(cgroup_name, pids[cgroup_path], True) + cleanup(processes[cgroup_path]) + subprocess.check_call(['rmmod', 'probeCgroup']) + + print('pass multiple cgroup test!') + except AssertionError as e: + print(f"Assertion failed: {e}") + for cgroup_name, cgroup_path in cgroups: + remove_cgroup(cgroup_path, pids[cgroup_path]) + cleanup(processes[cgroup_path]) + subprocess.check_call(['rmmod', 'probeCgroup']) + +if __name__ == '__main__': + test_multiple_cgroup(2,2) \ No newline at end of file diff --git a/tools/probeCgroup/testcases/4_oom_test.py b/tools/probeCgroup/testcases/4_oom_test.py new file mode 100755 index 000000000000..128a258c56f5 --- /dev/null +++ b/tools/probeCgroup/testcases/4_oom_test.py @@ -0,0 +1,52 @@ +#!/usr/bin/env python +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + +from cgroup_utils import create_cgroup, add_process_to_cgroup, get_process_memory_usage, remove_cgroup, check_memory_usage, cleanup, get_oom_process_memory_usage +import os +import subprocess +import time + +def test_oom(num_procs): + subprocess.check_call(['insmod', '../probeCgroup.ko']) + time.sleep(1) + + cgroup_name = 'test' + cgroup_path = create_cgroup(cgroup_name) + + with open(f"/sys/fs/cgroup/memory/{cgroup_name}/memory.limit_in_bytes", 'w') as limit_file: + limit_file.write("5M") + with open(f"/sys/fs/cgroup/memory/{cgroup_name}/memory.swappiness", 'w') as swap_file: + swap_file.write("0") + + processes = [] + pids = [] + for i in range(num_procs): + process = subprocess.Popen(['./simple-mem-allocate']) + pid = process.pid + add_process_to_cgroup(cgroup_path, pid) + processes.append(process) + pids.append(pid) + + time.sleep(6) + + try: + for pid in pids: + memory_usage = get_oom_process_memory_usage(pid, cgroup_name) + assert memory_usage is not None, f"Memory usage(oom) not found for PID {pid}" + assert memory_usage > 0, f"Memory usage should be greater than zero for PID {pid}" + + remove_cgroup(cgroup_path, pids) + check_memory_usage(cgroup_name, pids, True) + cleanup(processes) + subprocess.check_call(['rmmod', 'probeCgroup']) + + print('pass oom test!') + except AssertionError as e: + print(f"Assertion failed: {e}") + remove_cgroup(cgroup_path, pids) + cleanup(processes) + subprocess.check_call(['rmmod', 'probeCgroup']) + +if __name__ == '__main__': + test_oom(1) \ No newline at end of file diff --git a/tools/probeCgroup/testcases/5_multiple_threads_test.py b/tools/probeCgroup/testcases/5_multiple_threads_test.py new file mode 100755 index 000000000000..7e1b86dabe48 --- /dev/null +++ b/tools/probeCgroup/testcases/5_multiple_threads_test.py @@ -0,0 +1,45 @@ +#!/usr/bin/env python +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + +from cgroup_utils import create_cgroup, add_process_to_cgroup, get_process_memory_usage, remove_cgroup, check_memory_usage, cleanup, check_kmem_usage +import os +import subprocess +import time + +def test_multiple_thread(num_procs): + subprocess.check_call(['insmod', '../probeCgroup.ko']) + time.sleep(1) + + cgroup_name = 'test' + cgroup_path = create_cgroup(cgroup_name) + + processes = [] + pids = [] + for i in range(num_procs): + process = subprocess.Popen(['./multiple-thread-mem-allocate']) + pid = process.pid + add_process_to_cgroup(cgroup_path, pid) + processes.append(process) + pids.append(pid) + + time.sleep(1) + try: + for i in range (200): + check_memory_usage(cgroup_name, pids, False) + time.sleep(0.01) + + remove_cgroup(cgroup_path, pids) + check_memory_usage(cgroup_name, pids, True) + cleanup(processes) + subprocess.check_call(['rmmod', 'probeCgroup']) + + print('pass multiple threads test!') + except AssertionError as e: + print(f"Assertion failed: {e}") + remove_cgroup(cgroup_path, pids) + cleanup(processes) + subprocess.check_call(['rmmod', 'probeCgroup']) + +if __name__ == '__main__': + test_multiple_thread(5) \ No newline at end of file diff --git a/tools/probeCgroup/testcases/cgroup_utils.py b/tools/probeCgroup/testcases/cgroup_utils.py new file mode 100644 index 000000000000..f70c68c1f188 --- /dev/null +++ b/tools/probeCgroup/testcases/cgroup_utils.py @@ -0,0 +1,115 @@ +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + +import os +import subprocess +import time + +def create_cgroup(cgroup_name): + cgroup_path = f'/sys/fs/cgroup/memory/{cgroup_name}' + + try: + os.makedirs(cgroup_path) + except FileExistsError: + pass + + return cgroup_path + +def add_process_to_cgroup(cgroup_path, pid): + with open(os.path.join(cgroup_path, 'cgroup.procs'), 'w') as procs_file: + procs_file.write(str(pid)) + +def get_process_memory_usage(pid, cgroup_name): + cur_name = '' + with open('/proc/cgroup_memory_usage_per_process', 'r') as file: + for line in file: + parts = line.strip().split() + if len(parts) >= 4 and parts[0] == 'cgroup': + cur_name = parts[3] + if len(parts) >= 3 and parts[0] != 'cgroup' and cur_name == cgroup_name and parts[0] != 'pid' and int(parts[0]) == pid: + return int(parts[2]) + return None + +def get_process_kmem_usage(pid, cgroup_name): + cur_name = '' + with open('/proc/cgroup_memory_usage_per_process', 'r') as file: + for line in file: + parts = line.strip().split() + if len(parts) >= 4 and parts[0] == 'cgroup': + cur_name = parts[3] + if len(parts) >= 4 and parts[0] != 'cgroup' and cur_name == cgroup_name and parts[0] != 'pid' and int(parts[0]) == pid: + return int(parts[3]) + return None + +def remove_cgroup(cgroup_path, pids): + for pid in pids: + with open('/sys/fs/cgroup/memory/cgroup.procs', 'w') as backup_file: + backup_file.write(str(pid)) + os.rmdir(cgroup_path) + return + +def check_memory_usage(cgroup_name, pids, delete): + memory_sum = 0 + for pid in pids: + memory_usage = get_process_memory_usage(pid, cgroup_name) + if delete == False: + assert memory_usage is not None, f"Memory usage not found for PID {pid}" + assert memory_usage >= 0, f"Memory usage should be greater than zero for PID {pid}" + memory_sum += memory_usage + else: + assert memory_usage is None, f"Error: Memory usage should not be available for PID {pid} after deleting the cgroup." + if delete == False: + with open(f"/sys/fs/cgroup/memory/{cgroup_name}/memory.usage_in_bytes", 'r') as file: + content = file.readline().strip() + memory_read = int(content) + memory_sum *= 1024 + delta = abs(memory_read - memory_sum) + # print(f"read: {memory_read}") + # print(f"sum : {memory_sum}") + if (delta > max(memory_read, memory_sum) * 0.1): + return 1 + else: + return 0 + else: + return 0 + +def check_kmem_usage(cgroup_name, pids, delete): + kmem_sum = 0 + for pid in pids: + kmem_usage = get_process_kmem_usage(pid, cgroup_name) + if delete == False: + assert kmem_usage is not None, f"Kmem usage not found for PID {pid}" + assert kmem_usage >= 0, f"Kmem usage should be greater than zero for PID {pid}" + kmem_sum += kmem_usage + else: + assert kmem_usage is None, f"Error: Kmem usage should not be available for PID {pid} after deleting the cgroup." + if delete == False: + with open(f"/sys/fs/cgroup/memory/{cgroup_name}/memory.kmem.usage_in_bytes", 'r') as file: + content = file.readline().strip() + kmem_read = int(content) + kmem_sum *= 1024 + delta = abs(kmem_read - kmem_sum) + # print(f"kmem read: {kmem_read}") + # print(f"kmem sum : {kmem_sum}") + # assert delta <= max(kmem_read, kmem_sum) * 0.2, f"Kmem read by probeCgroup is not accurate, {kmem_read}, {kmem_sum}" + +def cleanup(processes): + for process in processes: + process.terminate() + process.wait() + +def get_oom_process_memory_usage(pid, cgroup_name): + # �� + cur_name = '' + oom = False + with open('/proc/cgroup_memory_usage_per_process', 'r') as file: + for line in file: + parts = line.strip().split() + if len(parts) >= 4 and parts[0] == 'cgroup': + cur_name = parts[3] + oom = False + if len(parts) >= 1 and parts[0] == 'oom:': + oom = True + if len(parts) >= 3 and parts[0] != 'cgroup' and cur_name == cgroup_name and parts[0] != 'pid' and int(parts[0]) == pid and oom == True: + return int(parts[2]) + return None \ No newline at end of file diff --git a/tools/probeCgroup/testcases/mem-allocate.c b/tools/probeCgroup/testcases/mem-allocate.c new file mode 100644 index 000000000000..e78e37cae61b --- /dev/null +++ b/tools/probeCgroup/testcases/mem-allocate.c @@ -0,0 +1,35 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * mem-allocate.c - The program to test probeCgroup + * + * Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + */ + +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> + +#define MB (1024 * 1024) + +char *arr[40]; + +int main(int argc, char *argv[]) +{ + char *p; + int i = 0; + + while (1) { + for (i = 0; i < 40; i++) { + p = (char *)malloc(MB); + memset(p, 0, MB); + arr[i] = p; + usleep(100000); + } + for (int i = 0; i < 40; i++) { + free(arr[i]); + usleep(100000); + } + } + return 0; +} diff --git a/tools/probeCgroup/testcases/multiple-thread-mem-allocate.c b/tools/probeCgroup/testcases/multiple-thread-mem-allocate.c new file mode 100644 index 000000000000..55f4c068f55e --- /dev/null +++ b/tools/probeCgroup/testcases/multiple-thread-mem-allocate.c @@ -0,0 +1,60 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * multiple-thread-mem-allocate.c - The program to test probeCgroup + * + * Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + */ + +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> +#include <pthread.h> + +#define MB (1024 * 1024) + +void *memory_test(void *) +{ + char *arr[25]; + char *p; + int i = 0; + int cnt = 0; + + while (1) { + for (i = 0; i < 20; i++) { + p = (char *)malloc(MB); + memset(p, 0, MB); + arr[i] = p; + usleep(10000); + } + for (int i = 0; i < 20; i++) { + free(arr[i]); + usleep(10000); + } + } +} + +int main(int argc, char *argv[]) +{ + pthread_t threads[4]; + int rc; + + // create threads + for (int i = 0; i < 4; i++) { + rc = pthread_create(&threads[i], NULL, memory_test, NULL); + if (rc != 0) { + fprintf(stderr, "Error creating thread: %d\n", rc); + return 1; + } + } + + for (int i = 0; i < 4; i++) { + rc = pthread_join(threads[i], NULL); + if (rc != 0) { + fprintf(stderr, "Error joining thread: %d\n", rc); + return 1; + } + } + + return 0; +} diff --git a/tools/probeCgroup/testcases/run.py b/tools/probeCgroup/testcases/run.py new file mode 100755 index 000000000000..8ffd0ca720d8 --- /dev/null +++ b/tools/probeCgroup/testcases/run.py @@ -0,0 +1,32 @@ +#!/usr/bin/env python +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + +import os +import subprocess +import sys + +def run_tests(directory): + """Run all Python scripts in the given directory.""" + python_files = [f for f in os.listdir(directory) if f.endswith('test.py')] + python_files.sort() + + for filename in python_files: + try: + filepath = os.path.join(directory, filename) + + subprocess.check_call([sys.executable, filepath]) + except subprocess.CalledProcessError as e: + print(f"Error executing {filename}:") + return + except Exception as e: + print(f"Error executing {filename}:") + print(e) + return + +if __name__ == '__main__': + tests_directory = '.' + subprocess.check_call(['gcc', 'mem-allocate.c', '-o', 'mem-allocate']) + subprocess.check_call(['gcc', 'simple-mem-allocate.c', '-o', 'simple-mem-allocate']) + subprocess.check_call(['gcc', 'multiple-thread-mem-allocate.c', '-o', 'multiple-thread-mem-allocate']) + run_tests(tests_directory) \ No newline at end of file diff --git a/tools/probeCgroup/testcases/simple-mem-allocate.c b/tools/probeCgroup/testcases/simple-mem-allocate.c new file mode 100644 index 000000000000..16328b10ba48 --- /dev/null +++ b/tools/probeCgroup/testcases/simple-mem-allocate.c @@ -0,0 +1,27 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * mem-allocate.c - The program to test probeCgroup + * + * Copyright (C) Taoxy2004 <221870066(a)smail.nju.edu.cn> + */ + +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> + +#define MB (1024 * 1024) + +int main(int argc, char *argv[]) +{ + char *p; + int i = 0; + + while (1) { + p = (char *)malloc(MB); + memset(p, 0, MB); + sleep(1); + } + + return 0; +} -- 2.43.0

1 0

[PATCH openEuler-22.03-LTS-SP1 0/2] nfc: pn533: Wait for out_urb's completion in pn533_usb_send_frame()
by Kaixiong Yu 04 Sep '24

04 Sep '24

fix CVE-2023-52907 Fedor Pchelkin (1): nfc: pn533: initialize struct pn533_out_arg properly Minsuk Kang (1): nfc: pn533: Wait for out_urb's completion in pn533_usb_send_frame() drivers/nfc/pn533/usb.c | 45 ++++++++++++++++++++++++++++++++++++++--- 1 file changed, 42 insertions(+), 3 deletions(-) -- 2.25.1

2 3

[PATCH openEuler-1.0-LTS 0/2] nfc: pn533: Wait for out_urb's completion in pn533_usb_send_frame()
by Kaixiong Yu 04 Sep '24

04 Sep '24

fix CVE-2023-52907 Fedor Pchelkin (1): nfc: pn533: initialize struct pn533_out_arg properly Minsuk Kang (1): nfc: pn533: Wait for out_urb's completion in pn533_usb_send_frame() drivers/nfc/pn533/usb.c | 45 ++++++++++++++++++++++++++++++++++++++--- 1 file changed, 42 insertions(+), 3 deletions(-) -- 2.25.1

2 3

[PATCH openEuler-1.0-LTS] md/raid5: avoid BUG_ON() while continue reshape after reassembling
by Li Nan 04 Sep '24

04 Sep '24

From: Yu Kuai <yukuai3(a)huawei.com> stable inclusion from stable-v4.19.320 commit 2c92f8c1c456d556f15cbf51667b385026b2e6a0 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAMNBN CVE: CVE-2024-43914 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- [ Upstream commit 305a5170dc5cf3d395bb4c4e9239bca6d0b54b49 ] Currently, mdadm support --revert-reshape to abort the reshape while reassembling, as the test 07revert-grow. However, following BUG_ON() can be triggerred by the test: kernel BUG at drivers/md/raid5.c:6278! invalid opcode: 0000 [#1] PREEMPT SMP PTI irq event stamp: 158985 CPU: 6 PID: 891 Comm: md0_reshape Not tainted 6.9.0-03335-g7592a0b0049a #94 RIP: 0010:reshape_request+0x3f1/0xe60 Call Trace: <TASK> raid5_sync_request+0x43d/0x550 md_do_sync+0xb7a/0x2110 md_thread+0x294/0x2b0 kthread+0x147/0x1c0 ret_from_fork+0x59/0x70 ret_from_fork_asm+0x1a/0x30 </TASK> Root cause is that --revert-reshape update the raid_disks from 5 to 4, while reshape position is still set, and after reassembling the array, reshape position will be read from super block, then during reshape the checking of 'writepos' that is caculated by old reshape position will fail. Fix this panic the easy way first, by converting the BUG_ON() to WARN_ON(), and stop the reshape if checkings fail. Noted that mdadm must fix --revert-shape as well, and probably md/raid should enhance metadata validation as well, however this means reassemble will fail and there must be user tools to fix the wrong metadata. Signed-off-by: Yu Kuai <yukuai3(a)huawei.com> Signed-off-by: Song Liu <song(a)kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-13-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal(a)kernel.org> Signed-off-by: Li Nan <linan122(a)huawei.com> --- drivers/md/raid5.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 1c2e2ff162dc..b2b35cdabac5 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -5817,7 +5817,9 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk safepos = conf->reshape_safe; sector_div(safepos, data_disks); if (mddev->reshape_backwards) { - BUG_ON(writepos < reshape_sectors); + if (WARN_ON(writepos < reshape_sectors)) + return MaxSector; + writepos -= reshape_sectors; readpos += reshape_sectors; safepos += reshape_sectors; @@ -5835,14 +5837,18 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk * to set 'stripe_addr' which is where we will write to. */ if (mddev->reshape_backwards) { - BUG_ON(conf->reshape_progress == 0); + if (WARN_ON(conf->reshape_progress == 0)) + return MaxSector; + stripe_addr = writepos; - BUG_ON((mddev->dev_sectors & - ~((sector_t)reshape_sectors - 1)) - - reshape_sectors - stripe_addr - != sector_nr); + if (WARN_ON((mddev->dev_sectors & + ~((sector_t)reshape_sectors - 1)) - + reshape_sectors - stripe_addr != sector_nr)) + return MaxSector; } else { - BUG_ON(writepos != sector_nr + reshape_sectors); + if (WARN_ON(writepos != sector_nr + reshape_sectors)) + return MaxSector; + stripe_addr = sector_nr; } -- 2.39.2

2 1

[PATCH OLK-5.10] md/raid5: avoid BUG_ON() while continue reshape after reassembling
by Li Nan 04 Sep '24

04 Sep '24

From: Yu Kuai <yukuai3(a)huawei.com> stable inclusion from stable-v5.10.224 commit c384dd4f1fb3b14a2fd199360701cc163ea88705 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAMNBN CVE: CVE-2024-43914 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- [ Upstream commit 305a5170dc5cf3d395bb4c4e9239bca6d0b54b49 ] Currently, mdadm support --revert-reshape to abort the reshape while reassembling, as the test 07revert-grow. However, following BUG_ON() can be triggerred by the test: kernel BUG at drivers/md/raid5.c:6278! invalid opcode: 0000 [#1] PREEMPT SMP PTI irq event stamp: 158985 CPU: 6 PID: 891 Comm: md0_reshape Not tainted 6.9.0-03335-g7592a0b0049a #94 RIP: 0010:reshape_request+0x3f1/0xe60 Call Trace: <TASK> raid5_sync_request+0x43d/0x550 md_do_sync+0xb7a/0x2110 md_thread+0x294/0x2b0 kthread+0x147/0x1c0 ret_from_fork+0x59/0x70 ret_from_fork_asm+0x1a/0x30 </TASK> Root cause is that --revert-reshape update the raid_disks from 5 to 4, while reshape position is still set, and after reassembling the array, reshape position will be read from super block, then during reshape the checking of 'writepos' that is caculated by old reshape position will fail. Fix this panic the easy way first, by converting the BUG_ON() to WARN_ON(), and stop the reshape if checkings fail. Noted that mdadm must fix --revert-shape as well, and probably md/raid should enhance metadata validation as well, however this means reassemble will fail and there must be user tools to fix the wrong metadata. Signed-off-by: Yu Kuai <yukuai3(a)huawei.com> Signed-off-by: Song Liu <song(a)kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-13-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal(a)kernel.org> Signed-off-by: Li Nan <linan122(a)huawei.com> --- drivers/md/raid5.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 3cb90d7e88d9..126b9ecfe750 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -5997,7 +5997,9 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk safepos = conf->reshape_safe; sector_div(safepos, data_disks); if (mddev->reshape_backwards) { - BUG_ON(writepos < reshape_sectors); + if (WARN_ON(writepos < reshape_sectors)) + return MaxSector; + writepos -= reshape_sectors; readpos += reshape_sectors; safepos += reshape_sectors; @@ -6015,14 +6017,18 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk * to set 'stripe_addr' which is where we will write to. */ if (mddev->reshape_backwards) { - BUG_ON(conf->reshape_progress == 0); + if (WARN_ON(conf->reshape_progress == 0)) + return MaxSector; + stripe_addr = writepos; - BUG_ON((mddev->dev_sectors & - ~((sector_t)reshape_sectors - 1)) - - reshape_sectors - stripe_addr - != sector_nr); + if (WARN_ON((mddev->dev_sectors & + ~((sector_t)reshape_sectors - 1)) - + reshape_sectors - stripe_addr != sector_nr)) + return MaxSector; } else { - BUG_ON(writepos != sector_nr + reshape_sectors); + if (WARN_ON(writepos != sector_nr + reshape_sectors)) + return MaxSector; + stripe_addr = sector_nr; } -- 2.39.2

2 1

[PATCH openEuler-22.03-LTS-SP1] md/raid5: avoid BUG_ON() while continue reshape after reassembling
by Li Nan 04 Sep '24

04 Sep '24

From: Yu Kuai <yukuai3(a)huawei.com> stable inclusion from stable-v5.10.224 commit c384dd4f1fb3b14a2fd199360701cc163ea88705 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAMNBN CVE: CVE-2024-43914 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- [ Upstream commit 305a5170dc5cf3d395bb4c4e9239bca6d0b54b49 ] Currently, mdadm support --revert-reshape to abort the reshape while reassembling, as the test 07revert-grow. However, following BUG_ON() can be triggerred by the test: kernel BUG at drivers/md/raid5.c:6278! invalid opcode: 0000 [#1] PREEMPT SMP PTI irq event stamp: 158985 CPU: 6 PID: 891 Comm: md0_reshape Not tainted 6.9.0-03335-g7592a0b0049a #94 RIP: 0010:reshape_request+0x3f1/0xe60 Call Trace: <TASK> raid5_sync_request+0x43d/0x550 md_do_sync+0xb7a/0x2110 md_thread+0x294/0x2b0 kthread+0x147/0x1c0 ret_from_fork+0x59/0x70 ret_from_fork_asm+0x1a/0x30 </TASK> Root cause is that --revert-reshape update the raid_disks from 5 to 4, while reshape position is still set, and after reassembling the array, reshape position will be read from super block, then during reshape the checking of 'writepos' that is caculated by old reshape position will fail. Fix this panic the easy way first, by converting the BUG_ON() to WARN_ON(), and stop the reshape if checkings fail. Noted that mdadm must fix --revert-shape as well, and probably md/raid should enhance metadata validation as well, however this means reassemble will fail and there must be user tools to fix the wrong metadata. Signed-off-by: Yu Kuai <yukuai3(a)huawei.com> Signed-off-by: Song Liu <song(a)kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-13-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal(a)kernel.org> Signed-off-by: Li Nan <linan122(a)huawei.com> --- drivers/md/raid5.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 3cb90d7e88d9..126b9ecfe750 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -5997,7 +5997,9 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk safepos = conf->reshape_safe; sector_div(safepos, data_disks); if (mddev->reshape_backwards) { - BUG_ON(writepos < reshape_sectors); + if (WARN_ON(writepos < reshape_sectors)) + return MaxSector; + writepos -= reshape_sectors; readpos += reshape_sectors; safepos += reshape_sectors; @@ -6015,14 +6017,18 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk * to set 'stripe_addr' which is where we will write to. */ if (mddev->reshape_backwards) { - BUG_ON(conf->reshape_progress == 0); + if (WARN_ON(conf->reshape_progress == 0)) + return MaxSector; + stripe_addr = writepos; - BUG_ON((mddev->dev_sectors & - ~((sector_t)reshape_sectors - 1)) - - reshape_sectors - stripe_addr - != sector_nr); + if (WARN_ON((mddev->dev_sectors & + ~((sector_t)reshape_sectors - 1)) - + reshape_sectors - stripe_addr != sector_nr)) + return MaxSector; } else { - BUG_ON(writepos != sector_nr + reshape_sectors); + if (WARN_ON(writepos != sector_nr + reshape_sectors)) + return MaxSector; + stripe_addr = sector_nr; } -- 2.39.2

2 1

[PATCH OLK-6.6] net: missing check virtio
by Zhang Changzhong 04 Sep '24

04 Sep '24

From: Denis Arefev <arefev(a)swemel.ru> stable inclusion from stable-v6.6.44 commit 90d41ebe0cd4635f6410471efc1dd71b33e894cf category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAKQ33 CVE: CVE-2024-43817 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id… -------------------------------- [ Upstream commit e269d79c7d35aa3808b1f3c1737d63dab504ddc8 ] Two missing check in virtio_net_hdr_to_skb() allowed syzbot to crash kernels again 1. After the skb_segment function the buffer may become non-linear (nr_frags != 0), but since the SKBTX_SHARED_FRAG flag is not set anywhere the __skb_linearize function will not be executed, then the buffer will remain non-linear. Then the condition (offset >= skb_headlen(skb)) becomes true, which causes WARN_ON_ONCE in skb_checksum_help. 2. The struct sk_buff and struct virtio_net_hdr members must be mathematically related. (gso_size) must be greater than (needed) otherwise WARN_ON_ONCE. (remainder) must be greater than (needed) otherwise WARN_ON_ONCE. (remainder) may be 0 if division is without remainder. offset+2 (4191) > skb_headlen() (1116) WARNING: CPU: 1 PID: 5084 at net/core/dev.c:3303 skb_checksum_help+0x5e2/0x740 net/core/dev.c:3303 Modules linked in: CPU: 1 PID: 5084 Comm: syz-executor336 Not tainted 6.7.0-rc3-syzkaller-00014-gdf60cee26a2e #0 Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 11/10/2023 RIP: 0010:skb_checksum_help+0x5e2/0x740 net/core/dev.c:3303 Code: 89 e8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 52 01 00 00 44 89 e2 2b 53 74 4c 89 ee 48 c7 c7 40 57 e9 8b e8 af 8f dd f8 90 <0f> 0b 90 90 e9 87 fe ff ff e8 40 0f 6e f9 e9 4b fa ff ff 48 89 ef RSP: 0018:ffffc90003a9f338 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff888025125780 RCX: ffffffff814db209 RDX: ffff888015393b80 RSI: ffffffff814db216 RDI: 0000000000000001 RBP: ffff8880251257f4 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: 000000000000045c R13: 000000000000105f R14: ffff8880251257f0 R15: 000000000000105d FS: 0000555555c24380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000002000f000 CR3: 0000000023151000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ip_do_fragment+0xa1b/0x18b0 net/ipv4/ip_output.c:777 ip_fragment.constprop.0+0x161/0x230 net/ipv4/ip_output.c:584 ip_finish_output_gso net/ipv4/ip_output.c:286 [inline] __ip_finish_output net/ipv4/ip_output.c:308 [inline] __ip_finish_output+0x49c/0x650 net/ipv4/ip_output.c:295 ip_finish_output+0x31/0x310 net/ipv4/ip_output.c:323 NF_HOOK_COND include/linux/netfilter.h:303 [inline] ip_output+0x13b/0x2a0 net/ipv4/ip_output.c:433 dst_output include/net/dst.h:451 [inline] ip_local_out+0xaf/0x1a0 net/ipv4/ip_output.c:129 iptunnel_xmit+0x5b4/0x9b0 net/ipv4/ip_tunnel_core.c:82 ipip6_tunnel_xmit net/ipv6/sit.c:1034 [inline] sit_tunnel_xmit+0xed2/0x28f0 net/ipv6/sit.c:1076 __netdev_start_xmit include/linux/netdevice.h:4940 [inline] netdev_start_xmit include/linux/netdevice.h:4954 [inline] xmit_one net/core/dev.c:3545 [inline] dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3561 __dev_queue_xmit+0x7c1/0x3d60 net/core/dev.c:4346 dev_queue_xmit include/linux/netdevice.h:3134 [inline] packet_xmit+0x257/0x380 net/packet/af_packet.c:276 packet_snd net/packet/af_packet.c:3087 [inline] packet_sendmsg+0x24ca/0x5240 net/packet/af_packet.c:3119 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0xd5/0x180 net/socket.c:745 __sys_sendto+0x255/0x340 net/socket.c:2190 __do_sys_sendto net/socket.c:2202 [inline] __se_sys_sendto net/socket.c:2198 [inline] __x64_sys_sendto+0xe0/0x1b0 net/socket.c:2198 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b Found by Linux Verification Center (linuxtesting.org) with Syzkaller Fixes: 0f6925b3e8da ("virtio_net: Do not pull payload in skb->head") Signed-off-by: Denis Arefev <arefev(a)swemel.ru> Message-Id: <20240613095448.27118-1-arefev(a)swemel.ru> Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com> Signed-off-by: Sasha Levin <sashal(a)kernel.org> Signed-off-by: Zhang Changzhong <zhangchangzhong(a)huawei.com> --- include/linux/virtio_net.h | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 6c395a2..c824c52 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -56,6 +56,7 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb, unsigned int thlen = 0; unsigned int p_off = 0; unsigned int ip_proto; + u64 ret, remainder, gso_size; if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) { switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) { @@ -98,6 +99,16 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb, u32 off = __virtio16_to_cpu(little_endian, hdr->csum_offset); u32 needed = start + max_t(u32, thlen, off + sizeof(__sum16)); + if (hdr->gso_size) { + gso_size = __virtio16_to_cpu(little_endian, hdr->gso_size); + ret = div64_u64_rem(skb->len, gso_size, &remainder); + if (!(ret && (hdr->gso_size > needed) && + ((remainder > needed) || (remainder == 0)))) { + return -EINVAL; + } + skb_shinfo(skb)->tx_flags |= SKBFL_SHARED_FRAG; + } + if (!pskb_may_pull(skb, needed)) return -EINVAL; -- 2.9.5

2 1

[PATCH openEuler-22.03-LTS-SP1] net: missing check virtio
by Zhang Changzhong 04 Sep '24

04 Sep '24

From: Denis Arefev <arefev(a)swemel.ru> mainline inclusion from mainline-v6.11-rc1 commit e269d79c7d35aa3808b1f3c1737d63dab504ddc8 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAKQ33 CVE: CVE-2024-43817 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- Two missing check in virtio_net_hdr_to_skb() allowed syzbot to crash kernels again 1. After the skb_segment function the buffer may become non-linear (nr_frags != 0), but since the SKBTX_SHARED_FRAG flag is not set anywhere the __skb_linearize function will not be executed, then the buffer will remain non-linear. Then the condition (offset >= skb_headlen(skb)) becomes true, which causes WARN_ON_ONCE in skb_checksum_help. 2. The struct sk_buff and struct virtio_net_hdr members must be mathematically related. (gso_size) must be greater than (needed) otherwise WARN_ON_ONCE. (remainder) must be greater than (needed) otherwise WARN_ON_ONCE. (remainder) may be 0 if division is without remainder. offset+2 (4191) > skb_headlen() (1116) WARNING: CPU: 1 PID: 5084 at net/core/dev.c:3303 skb_checksum_help+0x5e2/0x740 net/core/dev.c:3303 Modules linked in: CPU: 1 PID: 5084 Comm: syz-executor336 Not tainted 6.7.0-rc3-syzkaller-00014-gdf60cee26a2e #0 Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 11/10/2023 RIP: 0010:skb_checksum_help+0x5e2/0x740 net/core/dev.c:3303 Code: 89 e8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 52 01 00 00 44 89 e2 2b 53 74 4c 89 ee 48 c7 c7 40 57 e9 8b e8 af 8f dd f8 90 <0f> 0b 90 90 e9 87 fe ff ff e8 40 0f 6e f9 e9 4b fa ff ff 48 89 ef RSP: 0018:ffffc90003a9f338 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff888025125780 RCX: ffffffff814db209 RDX: ffff888015393b80 RSI: ffffffff814db216 RDI: 0000000000000001 RBP: ffff8880251257f4 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: 000000000000045c R13: 000000000000105f R14: ffff8880251257f0 R15: 000000000000105d FS: 0000555555c24380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000002000f000 CR3: 0000000023151000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ip_do_fragment+0xa1b/0x18b0 net/ipv4/ip_output.c:777 ip_fragment.constprop.0+0x161/0x230 net/ipv4/ip_output.c:584 ip_finish_output_gso net/ipv4/ip_output.c:286 [inline] __ip_finish_output net/ipv4/ip_output.c:308 [inline] __ip_finish_output+0x49c/0x650 net/ipv4/ip_output.c:295 ip_finish_output+0x31/0x310 net/ipv4/ip_output.c:323 NF_HOOK_COND include/linux/netfilter.h:303 [inline] ip_output+0x13b/0x2a0 net/ipv4/ip_output.c:433 dst_output include/net/dst.h:451 [inline] ip_local_out+0xaf/0x1a0 net/ipv4/ip_output.c:129 iptunnel_xmit+0x5b4/0x9b0 net/ipv4/ip_tunnel_core.c:82 ipip6_tunnel_xmit net/ipv6/sit.c:1034 [inline] sit_tunnel_xmit+0xed2/0x28f0 net/ipv6/sit.c:1076 __netdev_start_xmit include/linux/netdevice.h:4940 [inline] netdev_start_xmit include/linux/netdevice.h:4954 [inline] xmit_one net/core/dev.c:3545 [inline] dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3561 __dev_queue_xmit+0x7c1/0x3d60 net/core/dev.c:4346 dev_queue_xmit include/linux/netdevice.h:3134 [inline] packet_xmit+0x257/0x380 net/packet/af_packet.c:276 packet_snd net/packet/af_packet.c:3087 [inline] packet_sendmsg+0x24ca/0x5240 net/packet/af_packet.c:3119 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0xd5/0x180 net/socket.c:745 __sys_sendto+0x255/0x340 net/socket.c:2190 __do_sys_sendto net/socket.c:2202 [inline] __se_sys_sendto net/socket.c:2198 [inline] __x64_sys_sendto+0xe0/0x1b0 net/socket.c:2198 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b Found by Linux Verification Center (linuxtesting.org) with Syzkaller Fixes: 0f6925b3e8da ("virtio_net: Do not pull payload in skb->head") Signed-off-by: Denis Arefev <arefev(a)swemel.ru> Message-Id: <20240613095448.27118-1-arefev(a)swemel.ru> Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com> Signed-off-by: Zhang Changzhong <zhangchangzhong(a)huawei.com> --- include/linux/virtio_net.h | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index a960de6..b6a9d07 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -51,6 +51,7 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb, unsigned int thlen = 0; unsigned int p_off = 0; unsigned int ip_proto; + u64 ret, remainder, gso_size; if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) { switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) { @@ -87,6 +88,16 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb, u32 off = __virtio16_to_cpu(little_endian, hdr->csum_offset); u32 needed = start + max_t(u32, thlen, off + sizeof(__sum16)); + if (hdr->gso_size) { + gso_size = __virtio16_to_cpu(little_endian, hdr->gso_size); + ret = div64_u64_rem(skb->len, gso_size, &remainder); + if (!(ret && (hdr->gso_size > needed) && + ((remainder > needed) || (remainder == 0)))) { + return -EINVAL; + } + skb_shinfo(skb)->tx_flags |= SKBFL_SHARED_FRAG; + } + if (!pskb_may_pull(skb, needed)) return -EINVAL; -- 2.9.5

2 1

[PATCH OLK-5.10] net: missing check virtio
by Zhang Changzhong 04 Sep '24

04 Sep '24

From: Denis Arefev <arefev(a)swemel.ru> mainline inclusion from mainline-v6.11-rc1 commit e269d79c7d35aa3808b1f3c1737d63dab504ddc8 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAKQ33 CVE: CVE-2024-43817 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- Two missing check in virtio_net_hdr_to_skb() allowed syzbot to crash kernels again 1. After the skb_segment function the buffer may become non-linear (nr_frags != 0), but since the SKBTX_SHARED_FRAG flag is not set anywhere the __skb_linearize function will not be executed, then the buffer will remain non-linear. Then the condition (offset >= skb_headlen(skb)) becomes true, which causes WARN_ON_ONCE in skb_checksum_help. 2. The struct sk_buff and struct virtio_net_hdr members must be mathematically related. (gso_size) must be greater than (needed) otherwise WARN_ON_ONCE. (remainder) must be greater than (needed) otherwise WARN_ON_ONCE. (remainder) may be 0 if division is without remainder. offset+2 (4191) > skb_headlen() (1116) WARNING: CPU: 1 PID: 5084 at net/core/dev.c:3303 skb_checksum_help+0x5e2/0x740 net/core/dev.c:3303 Modules linked in: CPU: 1 PID: 5084 Comm: syz-executor336 Not tainted 6.7.0-rc3-syzkaller-00014-gdf60cee26a2e #0 Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 11/10/2023 RIP: 0010:skb_checksum_help+0x5e2/0x740 net/core/dev.c:3303 Code: 89 e8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 52 01 00 00 44 89 e2 2b 53 74 4c 89 ee 48 c7 c7 40 57 e9 8b e8 af 8f dd f8 90 <0f> 0b 90 90 e9 87 fe ff ff e8 40 0f 6e f9 e9 4b fa ff ff 48 89 ef RSP: 0018:ffffc90003a9f338 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff888025125780 RCX: ffffffff814db209 RDX: ffff888015393b80 RSI: ffffffff814db216 RDI: 0000000000000001 RBP: ffff8880251257f4 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: 000000000000045c R13: 000000000000105f R14: ffff8880251257f0 R15: 000000000000105d FS: 0000555555c24380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000002000f000 CR3: 0000000023151000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ip_do_fragment+0xa1b/0x18b0 net/ipv4/ip_output.c:777 ip_fragment.constprop.0+0x161/0x230 net/ipv4/ip_output.c:584 ip_finish_output_gso net/ipv4/ip_output.c:286 [inline] __ip_finish_output net/ipv4/ip_output.c:308 [inline] __ip_finish_output+0x49c/0x650 net/ipv4/ip_output.c:295 ip_finish_output+0x31/0x310 net/ipv4/ip_output.c:323 NF_HOOK_COND include/linux/netfilter.h:303 [inline] ip_output+0x13b/0x2a0 net/ipv4/ip_output.c:433 dst_output include/net/dst.h:451 [inline] ip_local_out+0xaf/0x1a0 net/ipv4/ip_output.c:129 iptunnel_xmit+0x5b4/0x9b0 net/ipv4/ip_tunnel_core.c:82 ipip6_tunnel_xmit net/ipv6/sit.c:1034 [inline] sit_tunnel_xmit+0xed2/0x28f0 net/ipv6/sit.c:1076 __netdev_start_xmit include/linux/netdevice.h:4940 [inline] netdev_start_xmit include/linux/netdevice.h:4954 [inline] xmit_one net/core/dev.c:3545 [inline] dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3561 __dev_queue_xmit+0x7c1/0x3d60 net/core/dev.c:4346 dev_queue_xmit include/linux/netdevice.h:3134 [inline] packet_xmit+0x257/0x380 net/packet/af_packet.c:276 packet_snd net/packet/af_packet.c:3087 [inline] packet_sendmsg+0x24ca/0x5240 net/packet/af_packet.c:3119 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0xd5/0x180 net/socket.c:745 __sys_sendto+0x255/0x340 net/socket.c:2190 __do_sys_sendto net/socket.c:2202 [inline] __se_sys_sendto net/socket.c:2198 [inline] __x64_sys_sendto+0xe0/0x1b0 net/socket.c:2198 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b Found by Linux Verification Center (linuxtesting.org) with Syzkaller Fixes: 0f6925b3e8da ("virtio_net: Do not pull payload in skb->head") Signed-off-by: Denis Arefev <arefev(a)swemel.ru> Message-Id: <20240613095448.27118-1-arefev(a)swemel.ru> Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com> Signed-off-by: Zhang Changzhong <zhangchangzhong(a)huawei.com> --- include/linux/virtio_net.h | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 6047058..29b19d0 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -51,6 +51,7 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb, unsigned int thlen = 0; unsigned int p_off = 0; unsigned int ip_proto; + u64 ret, remainder, gso_size; if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) { switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) { @@ -87,6 +88,16 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb, u32 off = __virtio16_to_cpu(little_endian, hdr->csum_offset); u32 needed = start + max_t(u32, thlen, off + sizeof(__sum16)); + if (hdr->gso_size) { + gso_size = __virtio16_to_cpu(little_endian, hdr->gso_size); + ret = div64_u64_rem(skb->len, gso_size, &remainder); + if (!(ret && (hdr->gso_size > needed) && + ((remainder > needed) || (remainder == 0)))) { + return -EINVAL; + } + skb_shinfo(skb)->tx_flags |= SKBFL_SHARED_FRAG; + } + if (!pskb_may_pull(skb, needed)) return -EINVAL; -- 2.9.5

2 1

2024

2023

2022

2021

2020

2019

Kernel September 2024