A group of optimization and bug fix for numa-affinity
Nanyong Sun (5): mm: thp: support to control numa migration mm: numa-affinity: add helper numa_affinity_sampling_enabled() mm: numa-affinity: adapt for should_numa_migrate_memory mm: numa-affinity: adapt for task_numa_placement mm: numa-affinity: fix build error when !CONFIG_PROC_SYSCTL
Documentation/admin-guide/mm/transhuge.rst | 8 ++++++ arch/arm64/Kconfig | 1 + arch/arm64/configs/openeuler_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + include/linux/huge_mm.h | 13 +++++++++ include/linux/mem_sampling.h | 13 +++++++++ kernel/sched/fair.c | 30 ++++++++++++++------ mm/Kconfig | 10 +++++++ mm/huge_memory.c | 33 ++++++++++++++++++++++ mm/mem_sampling.c | 4 --- mm/migrate.c | 3 ++ 11 files changed, 105 insertions(+), 12 deletions(-)
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHJKC CVE: NA
--------------------------------
Sometimes migrate THP is not beneficial, for example, when 64K page size is set on ARM64, THP will be 512M, migration may result in performance regression. This featrue add a interface to contrl THP migration when do numa balancing: /sys/kernel/mm/transparent_hugepage/numa_control
Default value is 0 which means keep default policy(will migrate). Write 1 to disable migrate THP while taskes still have chance to collect numa group info and may migrate.
The current control logic is applied for both autonuma and SPE based numa affinity.
Spark benchmark show 5% performance improvement after set 1 to the numa_control.
Fixes: 34387bcad1cd ("mm: numa-affinity: support THP migration") Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- Documentation/admin-guide/mm/transhuge.rst | 8 ++++++ arch/arm64/Kconfig | 1 + arch/arm64/configs/openeuler_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + include/linux/huge_mm.h | 13 +++++++++ mm/Kconfig | 10 +++++++ mm/huge_memory.c | 33 ++++++++++++++++++++++ mm/migrate.c | 3 ++ 8 files changed, 70 insertions(+)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 2bfb380e8380..fdff6c4247db 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -160,6 +160,14 @@ library) may want to know the size (in bytes) of a transparent hugepage::
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
+If CONFIG_THP_NUMA_CONTROL is on, user can control THP migration when +do numa balancing, 0 is default which means keep the default behavior, +writing 1 will disable thp migrate while tasks still have chance to +migrate:: + + echo 0 > /sys/kernel/mm/transparent_hugepage/numa_control + echo 1 > /sys/kernel/mm/transparent_hugepage/numa_control + khugepaged will be automatically started when transparent_hugepage/enabled is set to "always" or "madvise, and it'll be automatically shutdown if it's set to "never". diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index cae54a9bf65d..8b8f48b2a51e 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -216,6 +216,7 @@ config ARM64 select SYSCTL_EXCEPTION_TRACE select THREAD_INFO_IN_TASK select HAVE_LIVEPATCH_WO_FTRACE + select THP_NUMA_CONTROL if ARM64_64K_PAGES && NUMA_BALANCING && TRANSPARENT_HUGEPAGE help ARM 64-bit (AArch64) Linux support.
diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 5b928488b4c0..c26a9a7379a9 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -1182,6 +1182,7 @@ CONFIG_MEMORY_RELIABLE=y CONFIG_EXTEND_HUGEPAGE_MAPPING=y CONFIG_MEM_SAMPLING=y CONFIG_NUMABALANCING_MEM_SAMPLING=y +# CONFIG_THP_NUMA_CONTROL is not set
# # Data Access Monitoring diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index c522018b6481..c399055a52be 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -1110,6 +1110,7 @@ CONFIG_ARCH_HAS_PTE_SPECIAL=y CONFIG_MAPPING_DIRTY_HELPERS=y CONFIG_MEMORY_RELIABLE=y # CONFIG_CLEAR_FREELIST_PAGE is not set +# CONFIG_THP_NUMA_CONTROL is not set
# # Data Access Monitoring diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index efb370e79ac3..d9dde313d267 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -498,6 +498,19 @@ static inline unsigned long thp_size(struct page *page) return PAGE_SIZE << thp_order(page); }
+#ifdef CONFIG_THP_NUMA_CONTROL +#define THP_DISABLE_NUMA_MIGRATE 1 +extern unsigned long thp_numa_control; +static inline bool thp_numa_migrate_disabled(void) +{ + return thp_numa_control == THP_DISABLE_NUMA_MIGRATE; +} +#else +static inline bool thp_numa_migrate_disabled(void) +{ + return false; +} +#endif /* * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to * limitations in the implementation like arm64 MTE can override this to diff --git a/mm/Kconfig b/mm/Kconfig index ccbad233f2b1..cc43f5124cb3 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1038,6 +1038,16 @@ config NUMABALANCING_MEM_SAMPLING
if unsure, say N to disable the NUMABALANCING_MEM_SAMPLING.
+config THP_NUMA_CONTROL + bool "Control THP migration when numa balancing" + depends on NUMA_BALANCING && TRANSPARENT_HUGEPAGE + default n + help + Sometimes migrate THP is not beneficial, for example, when 64K page + size is set on ARM64, THP will be 512M, migration will be expensive. + This featrue add a switch to control the behavior of THP migration + when do numa balancing. + source "mm/damon/Kconfig"
endmenu diff --git a/mm/huge_memory.c b/mm/huge_memory.c index eb293d17a104..f286261f5525 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -316,6 +316,36 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj, static struct kobj_attribute hpage_pmd_size_attr = __ATTR_RO(hpage_pmd_size);
+#ifdef CONFIG_THP_NUMA_CONTROL +unsigned long thp_numa_control __read_mostly; + +static ssize_t numa_control_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%lu\n", READ_ONCE(thp_numa_control)); +} + +static ssize_t numa_control_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + unsigned long value; + int ret; + + ret = kstrtoul(buf, 10, &value); + if (ret < 0) + return ret; + if (value > THP_DISABLE_NUMA_MIGRATE) + return -EINVAL; + + WRITE_ONCE(thp_numa_control, value); + + return count; +} + +static struct kobj_attribute numa_control_attr = + __ATTR(numa_control, 0644, numa_control_show, numa_control_store); +#endif + static struct attribute *hugepage_attr[] = { &enabled_attr.attr, &defrag_attr.attr, @@ -323,6 +353,9 @@ static struct attribute *hugepage_attr[] = { &hpage_pmd_size_attr.attr, #ifdef CONFIG_SHMEM &shmem_enabled_attr.attr, +#endif +#ifdef CONFIG_THP_NUMA_CONTROL + &numa_control_attr.attr, #endif NULL, }; diff --git a/mm/migrate.c b/mm/migrate.c index 857c15e43497..cff5e11437d9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2161,6 +2161,9 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, */ compound = PageTransHuge(page);
+ if (compound && thp_numa_migrate_disabled()) + return 0; + if (compound) new = alloc_misplaced_dst_page_thp; else
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHJKC CVE: NA
--------------------------------
Numa affinity reuse some AutoNuma code, which were designed based on page fault based memory access awareness, so some logic need adjust for sampling based awareness. Add a helper numa_affinity_sampling_enabled() to distinguish the scenarios. Use it in task_tick_numa() to simplify the code.
Fixes: bdc4701337d7 ("mm/mem_sampling.c: Drive NUMA balancing via mem_sampling access data") Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- include/linux/mem_sampling.h | 13 +++++++++++++ kernel/sched/fair.c | 11 ++++------- 2 files changed, 17 insertions(+), 7 deletions(-)
diff --git a/include/linux/mem_sampling.h b/include/linux/mem_sampling.h index 5c168bc60862..6978c11d5499 100644 --- a/include/linux/mem_sampling.h +++ b/include/linux/mem_sampling.h @@ -105,4 +105,17 @@ static inline int arm_spe_enabled(void) return 0; } #endif /* CONFIG_ARM_SPE_MEM_SAMPLING */ + +#ifdef CONFIG_NUMABALANCING_MEM_SAMPLING +static inline bool numa_affinity_sampling_enabled(void) +{ + return static_branch_unlikely(&sched_numabalancing_mem_sampling); +} +#else +static inline bool numa_affinity_sampling_enabled(void) +{ + return false; +} +#endif + #endif /* __MEM_SAMPLING_H */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7f56800b17da..2139edac2cb1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2968,16 +2968,13 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr) struct callback_head *work = &curr->numa_work; u64 period, now;
-#ifdef CONFIG_NUMABALANCING_MEM_SAMPLING /* - * If we are using access hints from hardware (like using - * SPE), don't scan the address space. - * Note that currently PMD-level page migration is not - * supported. + * numa affinity use hardware sampling to get numa info(like using + * SPE for ARM64), no need to scan the address space anymore. */ - if (static_branch_unlikely(&sched_numabalancing_mem_sampling)) + if (numa_affinity_sampling_enabled()) return; -#endif + /* * We don't care about NUMA placement if we don't have memory. */
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHJKC CVE: NA
--------------------------------
The numa_scan_seq update depends on the numa scanning work done, which is skipped when sampling based numa affinity is on, so the numa_scan_seq will always be 0, skip the judge to avoid false migrate here.
Spark benchmark show 1%~2% performance improvement after applying this.
Fixes: bdc4701337d7 ("mm/mem_sampling.c: Drive NUMA balancing via mem_sampling access data") Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- kernel/sched/fair.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2139edac2cb1..d22936de5714 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1423,6 +1423,20 @@ static inline unsigned long group_weight(struct task_struct *p, int nid, return 1000 * faults / total_faults; }
+static inline bool in_early_stage(struct task_struct *p, int early_seq) +{ + /* + * For sampling based autonuma, numa_scan_seq never update. Currently, + * just skip here to avoid false migrate. In the future, the real + * lifetime judgment can be implemented if the workloads are very + * sensitive to the starting stage of the process. + */ + if (numa_affinity_sampling_enabled()) + return false; + + return p->numa_scan_seq <= early_seq; +} + bool should_numa_migrate_memory(struct task_struct *p, struct page * page, int src_nid, int dst_cpu) { @@ -1439,7 +1453,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, * two full passes of the "multi-stage node selection" test that is * executed below. */ - if ((p->numa_preferred_nid == NUMA_NO_NODE || p->numa_scan_seq <= 4) && + if ((p->numa_preferred_nid == NUMA_NO_NODE || in_early_stage(p, 4)) && (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid))) return true;
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHJKC CVE: NA
--------------------------------
The numa_scan_seq will never update when sampling enabled, this will result in p->numa_scan_seq == seq always be true, so the task_numa_placement() will always return at the beginning, so the numa_faults and numa_group information will never update. Skip the numa_scan_seq check in the task_numa_placement() to fix this.
Fixes: bdc4701337d7 ("mm/mem_sampling.c: Drive NUMA balancing via mem_sampling access data") Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- kernel/sched/fair.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d22936de5714..0e47766bc591 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2405,6 +2405,8 @@ static void task_numa_placement(struct task_struct *p) spinlock_t *group_lock = NULL; struct numa_group *ng;
+ if (numa_affinity_sampling_enabled()) + goto not_scan; /* * The p->mm->numa_scan_seq field gets updated without * exclusive access. Use READ_ONCE() here to ensure @@ -2416,6 +2418,7 @@ static void task_numa_placement(struct task_struct *p) p->numa_scan_seq = seq; p->numa_scan_period_max = task_scan_max(p);
+not_scan: total_faults = p->numa_faults_locality[0] + p->numa_faults_locality[1]; runtime = numa_get_avg_runtime(p, &period);
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHJKC CVE: NA
--------------------------------
Fix follow build error when CONFIG_PROC_SYSCTL is not set: mm/mem_sampling.c: error: ‘sysctl_mem_sampling_enable’ undeclared here mm/mem_sampling.c: error: ‘sysctl_numabalancing_mem_sampling’ undeclared here
Fixes: 9878268b0b9f ("mm/mem_sampling.c: Add controlling interface for mem_sampling") Signed-off-by: Nanyong Sun sunnanyong@huawei.com --- mm/mem_sampling.c | 4 ---- 1 file changed, 4 deletions(-)
diff --git a/mm/mem_sampling.c b/mm/mem_sampling.c index 1d8a831be531..5bff12212471 100644 --- a/mm/mem_sampling.c +++ b/mm/mem_sampling.c @@ -369,7 +369,6 @@ static void set_numabalancing_mem_sampling_state(bool enabled) } }
-#ifdef CONFIG_PROC_SYSCTL int sysctl_numabalancing_mem_sampling(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { @@ -391,7 +390,6 @@ int sysctl_numabalancing_mem_sampling(struct ctl_table *table, int write,
return err; } -#endif #else static inline void set_numabalancing_mem_sampling_state(bool enabled) { @@ -423,7 +421,6 @@ static void set_mem_sampling_state(bool enabled) set_numabalancing_mem_sampling_state(enabled); }
-#ifdef CONFIG_PROC_SYSCTL static int sysctl_mem_sampling_enable(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { @@ -443,7 +440,6 @@ static int sysctl_mem_sampling_enable(struct ctl_table *table, int write, set_mem_sampling_state(state); return err; } -#endif
static struct ctl_table ctl_table[] = { {
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/11773 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/S...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/11773 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/S...