From: Jonathan Cameron Jonathan.Cameron@huawei.com
mainline inclusion from tip/sched/core for v5.15-release commit: c5e22feffdd736cb02b98b0f5b375c8ebc858dd4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GEZS CVE: NA Reference: https://lore.kernel.org/lkml/20210924085104.44806-1-21cnbao@gmail.com/
------------------------------------------------------------------------
Both ACPI and DT provide the ability to describe additional layers of topology between that of individual cores and higher level constructs such as the level at which the last level cache is shared. In ACPI this can be represented in PPTT as a Processor Hierarchy Node Structure [1] that is the parent of the CPU cores and in turn has a parent Processor Hierarchy Nodes Structure representing a higher level of topology.
For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data, but each cluster has local L3 tag. On the other hand, each clusters will share some internal system bus.
+-----------------------------------+ +---------+ | +------+ +------+ +--------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | | +-----------+ | | | +------+ +------+ +--------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +---+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +--+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | +---------+ +-----------------------------------+
That means spreading tasks among clusters will bring more bandwidth while packing tasks within one cluster will lead to smaller cache synchronization latency. So both kernel and userspace will have a chance to leverage this topology to deploy tasks accordingly to achieve either smaller cache latency within one cluster or an even distribution of load among clusters for higher throughput.
This patch exposes cluster topology to both kernel and userspace. Libraried like hwloc will know cluster by cluster_cpus and related sysfs attributes. PoC of HWLOC support at [2].
Note this patch only handle the ACPI case.
Special consideration is needed for SMT processors, where it is necessary to move 2 levels up the hierarchy from the leaf nodes (thus skipping the processor core level).
Note that arm64 / ACPI does not provide any means of identifying a die level in the topology but that may be unrelate to the cluster level.
[1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node structure (Type 0) [2] https://github.com/hisilicon/hwloc/tree/linux-cluster
Signed-off-by: Jonathan Cameron Jonathan.Cameron@huawei.com Signed-off-by: Tian Tao tiantao6@hisilicon.com Signed-off-by: Barry Song song.bao.hua@hisilicon.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.com Signed-off-by: Yicong Yang yangyicong@hisilicon.com Reviewed-by: tao zeng prime.zeng@hisilicon.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/admin-guide/cputopology.rst | 26 +++++++-- arch/arm64/kernel/topology.c | 2 + drivers/acpi/pptt.c | 67 +++++++++++++++++++++++ drivers/base/arch_topology.c | 15 +++++ drivers/base/topology.c | 10 ++++ include/linux/acpi.h | 5 ++ include/linux/arch_topology.h | 5 ++ include/linux/topology.h | 6 ++ 8 files changed, 132 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst index b90dafcc8237..57be98ae27b8 100644 --- a/Documentation/admin-guide/cputopology.rst +++ b/Documentation/admin-guide/cputopology.rst @@ -18,6 +18,11 @@ die_id: identifier (rather than the kernel's). The actual value is architecture and platform dependent.
+cluster_id: + the cluster ID of cpuX. Typically it is the hardware platform's + identifier (rather than the kernel's). The actual value is + architecture and platform dependent. + core_id:
the CPU core ID of cpuX. Typically it is the hardware platform's @@ -64,6 +69,15 @@ die_cpus_list:
human-readable list of CPUs within the same die.
+cluster_cpus: + + internal kernel map of CPUs within the same cluster + +cluster_cpus_list: + + human-readable list of CPUs within the same cluster. + The format is like 0-3, 8-11, 14,17. + book_siblings:
internal kernel map of cpuX's hardware threads within the same @@ -96,11 +110,13 @@ these macros in include/asm-XXX/topology.h::
#define topology_physical_package_id(cpu) #define topology_die_id(cpu) + #define topology_cluster_id(cpu) #define topology_core_id(cpu) #define topology_book_id(cpu) #define topology_drawer_id(cpu) #define topology_sibling_cpumask(cpu) #define topology_core_cpumask(cpu) + #define topology_cluster_cpumask(cpu) #define topology_die_cpumask(cpu) #define topology_book_cpumask(cpu) #define topology_drawer_cpumask(cpu) @@ -116,10 +132,12 @@ not defined by include/asm-XXX/topology.h:
1) topology_physical_package_id: -1 2) topology_die_id: -1 -3) topology_core_id: 0 -4) topology_sibling_cpumask: just the given CPU -5) topology_core_cpumask: just the given CPU -6) topology_die_cpumask: just the given CPU +3) topology_cluster_id: -1 +4) topology_core_id: 0 +5) topology_sibling_cpumask: just the given CPU +6) topology_core_cpumask: just the given CPU +7) topology_cluster_cpumask: just the given CPU +8) topology_die_cpumask: just the given CPU
For architectures that don't support books (CONFIG_SCHED_BOOK) there are no default definitions for topology_book_id() and topology_book_cpumask(). diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index 19cedf882a6b..a80fcb6dd88a 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -114,6 +114,8 @@ int __init parse_acpi_topology(void) cpu_topology[cpu].thread_id = -1; cpu_topology[cpu].core_id = topology_id; } + topology_id = find_acpi_cpu_topology_cluster(cpu); + cpu_topology[cpu].cluster_id = topology_id; topology_id = find_acpi_cpu_topology_package(cpu); cpu_topology[cpu].package_id = topology_id;
diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c index 20bd584f2516..3d9403ded527 100644 --- a/drivers/acpi/pptt.c +++ b/drivers/acpi/pptt.c @@ -849,6 +849,73 @@ int find_acpi_cpu_topology_package(unsigned int cpu) ACPI_PPTT_PHYSICAL_PACKAGE); }
+/** + * find_acpi_cpu_topology_cluster() - Determine a unique CPU cluster value + * @cpu: Kernel logical CPU number + * + * Determine a topology unique cluster ID for the given CPU/thread. + * This ID can then be used to group peers, which will have matching ids. + * + * The cluster, if present is the level of topology above CPUs. In a + * multi-thread CPU, it will be the level above the CPU, not the thread. + * It may not exist in single CPU systems. In simple multi-CPU systems, + * it may be equal to the package topology level. + * + * Return: -ENOENT if the PPTT doesn't exist, the CPU cannot be found + * or there is no toplogy level above the CPU.. + * Otherwise returns a value which represents the package for this CPU. + */ + +int find_acpi_cpu_topology_cluster(unsigned int cpu) +{ + struct acpi_table_header *table; + acpi_status status; + struct acpi_pptt_processor *cpu_node, *cluster_node; + u32 acpi_cpu_id; + int retval; + int is_thread; + + status = acpi_get_table(ACPI_SIG_PPTT, 0, &table); + if (ACPI_FAILURE(status)) { + acpi_pptt_warn_missing(); + return -ENOENT; + } + + acpi_cpu_id = get_acpi_id_for_cpu(cpu); + cpu_node = acpi_find_processor_node(table, acpi_cpu_id); + if (cpu_node == NULL || !cpu_node->parent) { + retval = -ENOENT; + goto put_table; + } + + is_thread = cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_IS_THREAD; + cluster_node = fetch_pptt_node(table, cpu_node->parent); + if (cluster_node == NULL) { + retval = -ENOENT; + goto put_table; + } + if (is_thread) { + if (!cluster_node->parent) { + retval = -ENOENT; + goto put_table; + } + cluster_node = fetch_pptt_node(table, cluster_node->parent); + if (cluster_node == NULL) { + retval = -ENOENT; + goto put_table; + } + } + if (cluster_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID) + retval = cluster_node->acpi_processor_id; + else + retval = ACPI_PTR_DIFF(cluster_node, table); + +put_table: + acpi_put_table(table); + + return retval; +} + /** * find_acpi_cpu_topology_hetero_id() - Get a core architecture tag * @cpu: Kernel logical CPU number diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index de8587cc119e..21e63b6fab83 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -506,6 +506,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu) return core_mask; }
+const struct cpumask *cpu_clustergroup_mask(int cpu) +{ + return &cpu_topology[cpu].cluster_sibling; +} + void update_siblings_masks(unsigned int cpuid) { struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; @@ -523,6 +528,12 @@ void update_siblings_masks(unsigned int cpuid) if (cpuid_topo->package_id != cpu_topo->package_id) continue;
+ if (cpuid_topo->cluster_id == cpu_topo->cluster_id && + cpuid_topo->cluster_id != -1) { + cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling); + cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling); + } + cpumask_set_cpu(cpuid, &cpu_topo->core_sibling); cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
@@ -541,6 +552,9 @@ static void clear_cpu_topology(int cpu) cpumask_clear(&cpu_topo->llc_sibling); cpumask_set_cpu(cpu, &cpu_topo->llc_sibling);
+ cpumask_clear(&cpu_topo->cluster_sibling); + cpumask_set_cpu(cpu, &cpu_topo->cluster_sibling); + cpumask_clear(&cpu_topo->core_sibling); cpumask_set_cpu(cpu, &cpu_topo->core_sibling); cpumask_clear(&cpu_topo->thread_sibling); @@ -556,6 +570,7 @@ void __init reset_cpu_topology(void)
cpu_topo->thread_id = -1; cpu_topo->core_id = -1; + cpu_topo->cluster_id = -1; cpu_topo->package_id = -1; cpu_topo->llc_id = -1;
diff --git a/drivers/base/topology.c b/drivers/base/topology.c index 4d254fcc93d1..7157ac08ff57 100644 --- a/drivers/base/topology.c +++ b/drivers/base/topology.c @@ -46,6 +46,9 @@ static DEVICE_ATTR_RO(physical_package_id); define_id_show_func(die_id); static DEVICE_ATTR_RO(die_id);
+define_id_show_func(cluster_id); +static DEVICE_ATTR_RO(cluster_id); + define_id_show_func(core_id); static DEVICE_ATTR_RO(core_id);
@@ -61,6 +64,10 @@ define_siblings_show_func(core_siblings, core_cpumask); static DEVICE_ATTR_RO(core_siblings); static DEVICE_ATTR_RO(core_siblings_list);
+define_siblings_show_func(cluster_cpus, cluster_cpumask); +static DEVICE_ATTR_RO(cluster_cpus); +static DEVICE_ATTR_RO(cluster_cpus_list); + define_siblings_show_func(die_cpus, die_cpumask); static DEVICE_ATTR_RO(die_cpus); static DEVICE_ATTR_RO(die_cpus_list); @@ -88,6 +95,7 @@ static DEVICE_ATTR_RO(drawer_siblings_list); static struct attribute *default_attrs[] = { &dev_attr_physical_package_id.attr, &dev_attr_die_id.attr, + &dev_attr_cluster_id.attr, &dev_attr_core_id.attr, &dev_attr_thread_siblings.attr, &dev_attr_thread_siblings_list.attr, @@ -95,6 +103,8 @@ static struct attribute *default_attrs[] = { &dev_attr_core_cpus_list.attr, &dev_attr_core_siblings.attr, &dev_attr_core_siblings_list.attr, + &dev_attr_cluster_cpus.attr, + &dev_attr_cluster_cpus_list.attr, &dev_attr_die_cpus.attr, &dev_attr_die_cpus_list.attr, &dev_attr_package_cpus.attr, diff --git a/include/linux/acpi.h b/include/linux/acpi.h index a3db177329c2..9045dfb6d19c 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -1350,6 +1350,7 @@ static inline int lpit_read_residency_count_address(u64 *address) int acpi_pptt_init(void); int acpi_pptt_cpu_is_thread(unsigned int cpu); int find_acpi_cpu_topology(unsigned int cpu, int level); +int find_acpi_cpu_topology_cluster(unsigned int cpu); int find_acpi_cpu_topology_package(unsigned int cpu); int find_acpi_cpu_topology_hetero_id(unsigned int cpu); int find_acpi_cpu_cache_topology(unsigned int cpu, int level); @@ -1362,6 +1363,10 @@ static inline int find_acpi_cpu_topology(unsigned int cpu, int level) { return -EINVAL; } +static inline int find_acpi_cpu_topology_cluster(unsigned int cpu) +{ + return -EINVAL; +} static inline int find_acpi_cpu_topology_package(unsigned int cpu) { return -EINVAL; diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h index 0f6cd6b73a61..987c7ea75291 100644 --- a/include/linux/arch_topology.h +++ b/include/linux/arch_topology.h @@ -49,10 +49,12 @@ void topology_set_thermal_pressure(const struct cpumask *cpus, struct cpu_topology { int thread_id; int core_id; + int cluster_id; int package_id; int llc_id; cpumask_t thread_sibling; cpumask_t core_sibling; + cpumask_t cluster_sibling; cpumask_t llc_sibling; };
@@ -60,13 +62,16 @@ struct cpu_topology { extern struct cpu_topology cpu_topology[NR_CPUS];
#define topology_physical_package_id(cpu) (cpu_topology[cpu].package_id) +#define topology_cluster_id(cpu) (cpu_topology[cpu].cluster_id) #define topology_core_id(cpu) (cpu_topology[cpu].core_id) #define topology_core_cpumask(cpu) (&cpu_topology[cpu].core_sibling) #define topology_sibling_cpumask(cpu) (&cpu_topology[cpu].thread_sibling) +#define topology_cluster_cpumask(cpu) (&cpu_topology[cpu].cluster_sibling) #define topology_llc_cpumask(cpu) (&cpu_topology[cpu].llc_sibling) void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); const struct cpumask *cpu_coregroup_mask(int cpu); +const struct cpumask *cpu_clustergroup_mask(int cpu); void update_siblings_masks(unsigned int cpu); void remove_cpu_topology(unsigned int cpuid); void reset_cpu_topology(void); diff --git a/include/linux/topology.h b/include/linux/topology.h index 7634cd737061..80d27d717631 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -186,6 +186,9 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_die_id #define topology_die_id(cpu) ((void)(cpu), -1) #endif +#ifndef topology_cluster_id +#define topology_cluster_id(cpu) ((void)(cpu), -1) +#endif #ifndef topology_core_id #define topology_core_id(cpu) ((void)(cpu), 0) #endif @@ -195,6 +198,9 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_core_cpumask #define topology_core_cpumask(cpu) cpumask_of(cpu) #endif +#ifndef topology_cluster_cpumask +#define topology_cluster_cpumask(cpu) cpumask_of(cpu) +#endif #ifndef topology_die_cpumask #define topology_die_cpumask(cpu) cpumask_of(cpu) #endif
From: Barry Song song.bao.hua@hisilicon.com
mainline inclusion from tip/sched/core for v5.16 commit: 778c558f49a2cb3dc7b18a80ff515e82aa813627 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4H33U CVE: NA Reference: https://lore.kernel.org/lkml/20210924085104.44806-1-21cnbao@gmail.com/
------------------------------------------------------------------------
This patch adds scheduler level for clusters and automatically enables the load balance among clusters. It will directly benefit a lot of workload which loves more resources such as memory bandwidth, caches.
Testing has widely been done in two different hardware configurations of Kunpeng920:
24 cores in one NUMA(6 clusters in each NUMA node); 32 cores in one NUMA(8 clusters in each NUMA node)
Workload is running on either one NUMA node or four NUMA nodes, thus, this can estimate the effect of cluster spreading w/ and w/o NUMA load balance.
* Stream benchmark:
4threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 29929.64 ( 0.00%) 32932.68 ( 10.03%) MB/sec scale 29861.10 ( 0.00%) 32710.58 ( 9.54%) MB/sec add 27034.42 ( 0.00%) 32400.68 ( 19.85%) MB/sec triad 27225.26 ( 0.00%) 31965.36 ( 17.41%)
6threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 40330.24 ( 0.00%) 42377.68 ( 5.08%) MB/sec scale 40196.42 ( 0.00%) 42197.90 ( 4.98%) MB/sec add 37427.00 ( 0.00%) 41960.78 ( 12.11%) MB/sec triad 37841.36 ( 0.00%) 42513.64 ( 12.35%)
12threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 52639.82 ( 0.00%) 53818.04 ( 2.24%) MB/sec scale 52350.30 ( 0.00%) 53253.38 ( 1.73%) MB/sec add 53607.68 ( 0.00%) 55198.82 ( 2.97%) MB/sec triad 54776.66 ( 0.00%) 56360.40 ( 2.89%)
Thus, it could help memory-bound workload especially under medium load. Similar improvement is also seen in lkp-pbzip2:
* lkp-pbzip2 benchmark
2-96 threads (on 4NUMA * 24cores = 96cores) lkp-pbzip2 lkp-pbzip2 w/o patch w/ patch Hmean tput-2 11062841.57 ( 0.00%) 11341817.51 * 2.52%* Hmean tput-5 26815503.70 ( 0.00%) 27412872.65 * 2.23%* Hmean tput-8 41873782.21 ( 0.00%) 43326212.92 * 3.47%* Hmean tput-12 61875980.48 ( 0.00%) 64578337.51 * 4.37%* Hmean tput-21 105814963.07 ( 0.00%) 111381851.01 * 5.26%* Hmean tput-30 150349470.98 ( 0.00%) 156507070.73 * 4.10%* Hmean tput-48 237195937.69 ( 0.00%) 242353597.17 * 2.17%* Hmean tput-79 360252509.37 ( 0.00%) 362635169.23 * 0.66%* Hmean tput-96 394571737.90 ( 0.00%) 400952978.48 * 1.62%*
2-24 threads (on 1NUMA * 24cores = 24cores) lkp-pbzip2 lkp-pbzip2 w/o patch w/ patch Hmean tput-2 11071705.49 ( 0.00%) 11296869.10 * 2.03%* Hmean tput-4 20782165.19 ( 0.00%) 21949232.15 * 5.62%* Hmean tput-6 30489565.14 ( 0.00%) 33023026.96 * 8.31%* Hmean tput-8 40376495.80 ( 0.00%) 42779286.27 * 5.95%* Hmean tput-12 61264033.85 ( 0.00%) 62995632.78 * 2.83%* Hmean tput-18 86697139.39 ( 0.00%) 86461545.74 ( -0.27%) Hmean tput-24 104854637.04 ( 0.00%) 104522649.46 * -0.32%*
In the case of 6 threads and 8 threads, we see the greatest performance improvement.
Similar improvement can be seen on lkp-pixz though the improvement is smaller:
* lkp-pixz benchmark
2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores) lkp-pixz lkp-pixz w/o patch w/ patch Hmean tput-2 6486981.16 ( 0.00%) 6561515.98 * 1.15%* Hmean tput-4 11645766.38 ( 0.00%) 11614628.43 ( -0.27%) Hmean tput-6 15429943.96 ( 0.00%) 15957350.76 * 3.42%* Hmean tput-8 19974087.63 ( 0.00%) 20413746.98 * 2.20%* Hmean tput-12 28172068.18 ( 0.00%) 28751997.06 * 2.06%* Hmean tput-18 39413409.54 ( 0.00%) 39896830.55 * 1.23%* Hmean tput-24 49101815.85 ( 0.00%) 49418141.47 * 0.64%*
* SPECrate benchmark
4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores) Base Base Run Time Rate ------- --------- 4 Copies w/o 580 (w/ 570) w/o 11.1 (w/ 11.3) 8 Copies w/o 647 (w/ 605) w/o 20.0 (w/ 21.4, +7%) 16 Copies w/o 844 (w/ 844) w/o 30.6 (w/ 30.6)
32 Copies(on 4NUMA * 32 cores = 128cores) [w/o patch] Base Base Base Benchmarks Copies Run Time Rate --------------- ------- --------- --------- 500.perlbench_r 32 584 87.2 * 502.gcc_r 32 503 90.2 * 505.mcf_r 32 745 69.4 * 520.omnetpp_r 32 1031 40.7 * 523.xalancbmk_r 32 597 56.6 * 525.x264_r 1 -- CE 531.deepsjeng_r 32 336 109 * 541.leela_r 32 556 95.4 * 548.exchange2_r 32 513 163 * 557.xz_r 32 530 65.2 * Est. SPECrate2017_int_base 80.3
[w/ patch] Base Base Base Benchmarks Copies Run Time Rate --------------- ------- --------- --------- 500.perlbench_r 32 580 87.8 (+0.688%) * 502.gcc_r 32 477 95.1 (+5.432%) * 505.mcf_r 32 644 80.3 (+13.574%) * 520.omnetpp_r 32 942 44.6 (+9.58%) * 523.xalancbmk_r 32 560 60.4 (+6.714%%) * 525.x264_r 1 -- CE 531.deepsjeng_r 32 337 109 (+0.000%) * 541.leela_r 32 554 95.6 (+0.210%) * 548.exchange2_r 32 515 163 (+0.000%) * 557.xz_r 32 524 66.0 (+1.227%) * Est. SPECrate2017_int_base 83.7 (+4.062%)
On the other hand, it is slightly helpful to CPU-bound tasks like kernbench:
* 24-96 threads kernbench (on 4NUMA * 24cores = 96cores) kernbench kernbench w/o cluster w/ cluster Min user-24 12054.67 ( 0.00%) 12024.19 ( 0.25%) Min syst-24 1751.51 ( 0.00%) 1731.68 ( 1.13%) Min elsp-24 600.46 ( 0.00%) 598.64 ( 0.30%) Min user-48 12361.93 ( 0.00%) 12315.32 ( 0.38%) Min syst-48 1917.66 ( 0.00%) 1892.73 ( 1.30%) Min elsp-48 333.96 ( 0.00%) 332.57 ( 0.42%) Min user-96 12922.40 ( 0.00%) 12921.17 ( 0.01%) Min syst-96 2143.94 ( 0.00%) 2110.39 ( 1.56%) Min elsp-96 211.22 ( 0.00%) 210.47 ( 0.36%) Amean user-24 12063.99 ( 0.00%) 12030.78 * 0.28%* Amean syst-24 1755.20 ( 0.00%) 1735.53 * 1.12%* Amean elsp-24 601.60 ( 0.00%) 600.19 ( 0.23%) Amean user-48 12362.62 ( 0.00%) 12315.56 * 0.38%* Amean syst-48 1921.59 ( 0.00%) 1894.95 * 1.39%* Amean elsp-48 334.10 ( 0.00%) 332.82 * 0.38%* Amean user-96 12925.27 ( 0.00%) 12922.63 ( 0.02%) Amean syst-96 2146.66 ( 0.00%) 2122.20 * 1.14%* Amean elsp-96 211.96 ( 0.00%) 211.79 ( 0.08%)
Note this patch isn't an universal win, it might hurt those workload which can benefit from packing. Though tasks which want to take advantages of lower communication latency of one cluster won't necessarily been packed in one cluster while kernel is not aware of clusters, they have some chance to be randomly packed. But this patch will make them more likely spread.
Signed-off-by: Barry Song song.bao.hua@hisilicon.com Tested-by: Yicong Yang yangyicong@hisilicon.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Yicong Yang yangyicong@hisilicon.com Reviewed-by: tao zeng prime.zeng@hisilicon.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/Kconfig | 9 +++++++++ include/linux/sched/topology.h | 7 +++++++ include/linux/topology.h | 7 +++++++ kernel/sched/topology.c | 5 +++++ 4 files changed, 28 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 0b58ad078a34..507fdcb74153 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -998,6 +998,15 @@ config SCHED_MC making when dealing with multi-core CPU chips at a cost of slightly increased overhead in some places. If unsure say N here.
+config SCHED_CLUSTER + bool "Cluster scheduler support" + help + Cluster scheduler support improves the CPU scheduler's decision + making when dealing with machines that have clusters of CPUs. + Cluster usually means a couple of CPUs which are placed closely + by sharing mid-level caches, last-level cache tags or internal + busses. + config SCHED_SMT bool "SMT scheduler support" help diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index e50f2b0a2444..a80826bfef44 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -42,6 +42,13 @@ static inline int cpu_smt_flags(void) } #endif
+#ifdef CONFIG_SCHED_CLUSTER +static inline int cpu_cluster_flags(void) +{ + return SD_SHARE_PKG_RESOURCES; +} +#endif + #ifdef CONFIG_SCHED_MC static inline int cpu_core_flags(void) { diff --git a/include/linux/topology.h b/include/linux/topology.h index 80d27d717631..0b3704ad13c8 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -212,6 +212,13 @@ static inline const struct cpumask *cpu_smt_mask(int cpu) } #endif
+#if defined(CONFIG_SCHED_CLUSTER) && !defined(cpu_cluster_mask) +static inline const struct cpumask *cpu_cluster_mask(int cpu) +{ + return topology_cluster_cpumask(cpu); +} +#endif + static inline const struct cpumask *cpu_cpu_mask(int cpu) { return cpumask_of_node(cpu_to_node(cpu)); diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index b22cdbffe1dd..9b4e3b25ddff 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1525,6 +1525,11 @@ static struct sched_domain_topology_level default_topology[] = { #ifdef CONFIG_SCHED_SMT { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, #endif + +#ifdef CONFIG_SCHED_CLUSTER + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, +#endif + #ifdef CONFIG_SCHED_MC { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, #endif
From: Adam Borowski kilobyte@angband.pl
mainline inclusion from mainline-5.15-rc3 commit 892ab4bbd063cfe7f6bbb183e6be69d9907a61de category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- gcc knows the true length too, and rightfully complains.
Link: https://lkml.kernel.org/r/20210912204447.10427-1-kilobyte@angband.pl Signed-off-by: Adam Borowski kilobyte@angband.pl Cc: SeongJae Park sj38.park@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 892ab4bbd063cfe7f6bbb183e6be69d9907a61de) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/dbgfs-test.h | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/mm/damon/dbgfs-test.h b/mm/damon/dbgfs-test.h index 930e83bceef0..4eddcfa73996 100644 --- a/mm/damon/dbgfs-test.h +++ b/mm/damon/dbgfs-test.h @@ -20,27 +20,27 @@ static void damon_dbgfs_test_str_to_target_ids(struct kunit *test) ssize_t nr_integers = 0, i;
question = "123"; - answers = str_to_target_ids(question, strnlen(question, 128), + answers = str_to_target_ids(question, strlen(question), &nr_integers); KUNIT_EXPECT_EQ(test, (ssize_t)1, nr_integers); KUNIT_EXPECT_EQ(test, 123ul, answers[0]); kfree(answers);
question = "123abc"; - answers = str_to_target_ids(question, strnlen(question, 128), + answers = str_to_target_ids(question, strlen(question), &nr_integers); KUNIT_EXPECT_EQ(test, (ssize_t)1, nr_integers); KUNIT_EXPECT_EQ(test, 123ul, answers[0]); kfree(answers);
question = "a123"; - answers = str_to_target_ids(question, strnlen(question, 128), + answers = str_to_target_ids(question, strlen(question), &nr_integers); KUNIT_EXPECT_EQ(test, (ssize_t)0, nr_integers); kfree(answers);
question = "12 35"; - answers = str_to_target_ids(question, strnlen(question, 128), + answers = str_to_target_ids(question, strlen(question), &nr_integers); KUNIT_EXPECT_EQ(test, (ssize_t)2, nr_integers); for (i = 0; i < nr_integers; i++) @@ -48,7 +48,7 @@ static void damon_dbgfs_test_str_to_target_ids(struct kunit *test) kfree(answers);
question = "12 35 46"; - answers = str_to_target_ids(question, strnlen(question, 128), + answers = str_to_target_ids(question, strlen(question), &nr_integers); KUNIT_EXPECT_EQ(test, (ssize_t)3, nr_integers); for (i = 0; i < nr_integers; i++) @@ -56,7 +56,7 @@ static void damon_dbgfs_test_str_to_target_ids(struct kunit *test) kfree(answers);
question = "12 35 abc 46"; - answers = str_to_target_ids(question, strnlen(question, 128), + answers = str_to_target_ids(question, strlen(question), &nr_integers); KUNIT_EXPECT_EQ(test, (ssize_t)2, nr_integers); for (i = 0; i < 2; i++) @@ -64,13 +64,13 @@ static void damon_dbgfs_test_str_to_target_ids(struct kunit *test) kfree(answers);
question = ""; - answers = str_to_target_ids(question, strnlen(question, 128), + answers = str_to_target_ids(question, strlen(question), &nr_integers); KUNIT_EXPECT_EQ(test, (ssize_t)0, nr_integers); kfree(answers);
question = "\n"; - answers = str_to_target_ids(question, strnlen(question, 128), + answers = str_to_target_ids(question, strlen(question), &nr_integers); KUNIT_EXPECT_EQ(test, (ssize_t)0, nr_integers); kfree(answers);
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.15 commit 2e014660b3e4b7bd0e75f845cdcf745c0f632889 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Kunit test cases for 'damon_split_regions_of()' expects the number of regions after calling the function will be same to their request ('nr_sub'). However, the requested number is just an upper-limit, because the function randomly decides the size of each sub-region.
This fixes the wrong expectation.
Link: https://lkml.kernel.org/r/20211028090628.14948-1-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park sj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 2e014660b3e4b7bd0e75f845cdcf745c0f632889) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/core-test.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/damon/core-test.h b/mm/damon/core-test.h index c938a9c34e6c..7008c3735e99 100644 --- a/mm/damon/core-test.h +++ b/mm/damon/core-test.h @@ -219,14 +219,14 @@ static void damon_test_split_regions_of(struct kunit *test) r = damon_new_region(0, 22); damon_add_region(r, t); damon_split_regions_of(c, t, 2); - KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 2u); + KUNIT_EXPECT_LE(test, damon_nr_regions(t), 2u); damon_free_target(t);
t = damon_new_target(42); r = damon_new_region(0, 220); damon_add_region(r, t); damon_split_regions_of(c, t, 4); - KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 4u); + KUNIT_EXPECT_LE(test, damon_nr_regions(t), 4u); damon_free_target(t); damon_destroy_ctx(c); }
From: Geert Uytterhoeven geert@linux-m68k.org
mainline inclusion from mainline-5.16 commit f24b0626076783d56ef41c6459fedf70ab6dcbd0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Correct a singular versus plural grammar mistake in the help text for the DAMON_VADDR config symbol.
Link: https://lkml.kernel.org/r/20210914073451.3883834-1-geert@linux-m68k.org Fixes: 3f49584b262cf8f4 ("mm/damon: implement primitives for the virtual memory address spaces") Signed-off-by: Geert Uytterhoeven geert@linux-m68k.org Reviewed-by: SeongJae Park sjpark@amazon.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit f24b0626076783d56ef41c6459fedf70ab6dcbd0) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/damon/Kconfig b/mm/damon/Kconfig index 37024798a97c..ba8898c7eb8e 100644 --- a/mm/damon/Kconfig +++ b/mm/damon/Kconfig @@ -30,7 +30,7 @@ config DAMON_VADDR select PAGE_IDLE_FLAG help This builds the default data access monitoring primitives for DAMON - that works for virtual address spaces. + that work for virtual address spaces.
config DAMON_VADDR_KUNIT_TEST bool "Test for DAMON primitives" if !KUNIT_ALL_TESTS
From: SeongJae Park sjpark@amazon.de
mainline inclusion from mainline-5.16 commit ad782c48df326eb13cf5dec7aab571b44be3e415 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Most memory management user guide documents are in 'admin-guide/mm/', but two of those are in 'vm/'. This moves the two docs into 'admin-guide/mm' for easier documents finding.
Link: https://lkml.kernel.org/r/20210917123958.3819-2-sj@kernel.org Signed-off-by: SeongJae Park sjpark@amazon.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit ad782c48df326eb13cf5dec7aab571b44be3e415) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/admin-guide/mm/index.rst | 2 ++ .../{vm => admin-guide/mm}/swap_numa.rst | 0 .../{vm => admin-guide/mm}/zswap.rst | 0 Documentation/vm/index.rst | 26 ++++--------------- 4 files changed, 7 insertions(+), 21 deletions(-) rename Documentation/{vm => admin-guide/mm}/swap_numa.rst (100%) rename Documentation/{vm => admin-guide/mm}/zswap.rst (100%)
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 32c27fbf1913..db38553224c8 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -37,5 +37,7 @@ the Linux memory management. numaperf pagemap soft-dirty + swap_numa transhuge userfaultfd + zswap diff --git a/Documentation/vm/swap_numa.rst b/Documentation/admin-guide/mm/swap_numa.rst similarity index 100% rename from Documentation/vm/swap_numa.rst rename to Documentation/admin-guide/mm/swap_numa.rst diff --git a/Documentation/vm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst similarity index 100% rename from Documentation/vm/zswap.rst rename to Documentation/admin-guide/mm/zswap.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index b51f0d8992f8..6f5ffef4b716 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -3,27 +3,11 @@ Linux Memory Management Documentation =====================================
This is a collection of documents about the Linux memory management (mm) -subsystem. If you are looking for advice on simply allocating memory, -see the :ref:`memory_allocation`. - -User guides for MM features -=========================== - -The following documents provide guides for controlling and tuning -various features of the Linux memory management - -.. toctree:: - :maxdepth: 1 - - swap_numa - zswap - -Kernel developers MM documentation -================================== - -The below documents describe MM internals with different level of -details ranging from notes and mailing list responses to elaborate -descriptions of data structures and algorithms. +subsystem internals with different level of details ranging from notes and +mailing list responses for elaborating descriptions of data structures and +algorithms. If you are looking for advice on simply allocating memory, see the +:ref:`memory_allocation`. For controlling and tuning guides, see the +:doc:`admin guide <../admin-guide/mm/index>`.
.. toctree:: :maxdepth: 1
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit f9803a9918468d70d98d31c23cb7613d873e723f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This updates SeongJae's email address in MAINTAINERS file to his preferred one.
Link: https://lkml.kernel.org/r/20210917123958.3819-3-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Jonathan Corbet corbet@lwn.net Cc: SeongJae Park sjpark@amazon.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit f9803a9918468d70d98d31c23cb7613d873e723f) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/MAINTAINERS b/MAINTAINERS index 16a81020ead0..1b004a81ab15 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4873,7 +4873,7 @@ F: net/ax25/ax25_timer.c F: net/ax25/sysctl_net_ax25.c
DATA ACCESS MONITOR -M: SeongJae Park sjpark@amazon.de +M: SeongJae Park sj@kernel.org L: linux-mm@kvack.org S: Maintained F: Documentation/admin-guide/mm/damon/
From: SeongJae Park sjpark@amazon.de
mainline inclusion from mainline-5.16 commit 876d0aac2e3af10fbaf1c7a814840c71e470dc5c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Building DAMON documents warns for a reference to nonexisting doc, as below:
$ time make htmldocs [...] Documentation/vm/damon/index.rst:24: WARNING: toctree contains reference to nonexisting document 'vm/damon/plans'
This fixes the warning by removing the wrong reference.
Link: https://lkml.kernel.org/r/20210917123958.3819-4-sj@kernel.org Signed-off-by: SeongJae Park sjpark@amazon.de Cc: Jonathan Corbet corbet@lwn.net Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 876d0aac2e3af10fbaf1c7a814840c71e470dc5c) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/vm/damon/index.rst | 1 - 1 file changed, 1 deletion(-)
diff --git a/Documentation/vm/damon/index.rst b/Documentation/vm/damon/index.rst index a2858baf3bf1..48c0bbff98b2 100644 --- a/Documentation/vm/damon/index.rst +++ b/Documentation/vm/damon/index.rst @@ -27,4 +27,3 @@ workloads and systems. faq design api - plans
From: SeongJae Park sjpark@amazon.de
mainline inclusion from mainline-5.16 commit d2f272b35a84ace2ef04334a9822fd726a7f061b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- A few Kernel-doc comments in 'damon.h' are broken. This fixes them.
Link: https://lkml.kernel.org/r/20210917123958.3819-5-sj@kernel.org Signed-off-by: SeongJae Park sjpark@amazon.de Cc: Jonathan Corbet corbet@lwn.net Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit d2f272b35a84ace2ef04334a9822fd726a7f061b) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h index d68b67b8d458..755d70804705 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -62,7 +62,7 @@ struct damon_target { struct damon_ctx;
/** - * struct damon_primitive Monitoring primitives for given use cases. + * struct damon_primitive - Monitoring primitives for given use cases. * * @init: Initialize primitive-internal data structures. * @update: Update primitive-internal data structures. @@ -108,8 +108,8 @@ struct damon_primitive { void (*cleanup)(struct damon_ctx *context); };
-/* - * struct damon_callback Monitoring events notification callbacks. +/** + * struct damon_callback - Monitoring events notification callbacks. * * @before_start: Called before starting the monitoring. * @after_sampling: Called after each sampling.
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 704571f997424ecd64b10b37ca6097e65690240a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Logging of kdamond startup is using 'pr_info()' unnecessarily. This makes it to use 'pr_debug()' instead.
Link: https://lkml.kernel.org/r/20210917123958.3819-6-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Jonathan Corbet corbet@lwn.net Cc: SeongJae Park sjpark@amazon.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 704571f997424ecd64b10b37ca6097e65690240a) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/damon/core.c b/mm/damon/core.c index 30e9211f494a..874558a790a0 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -653,7 +653,7 @@ static int kdamond_fn(void *data) unsigned long sz_limit = 0;
mutex_lock(&ctx->kdamond_lock); - pr_info("kdamond (%d) starts\n", ctx->kdamond->pid); + pr_debug("kdamond (%d) starts\n", ctx->kdamond->pid); mutex_unlock(&ctx->kdamond_lock);
if (ctx->primitive.init)
From: Changbin Du changbin.du@gmail.com
mainline inclusion from mainline-5.16 commit 5f7fe2b9b827662cf349ab45406d6cbf0cc6251f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Just return from the kthread function.
Link: https://lkml.kernel.org/r/20210927232421.17694-1-changbin.du@gmail.com Signed-off-by: Changbin Du changbin.du@gmail.com Cc: SeongJae Park sjpark@amazon.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 5f7fe2b9b827662cf349ab45406d6cbf0cc6251f) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/damon/core.c b/mm/damon/core.c index 874558a790a0..61a9e3b37bc9 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -714,7 +714,7 @@ static int kdamond_fn(void *data) nr_running_ctxs--; mutex_unlock(&damon_lock);
- do_exit(0); + return 0; }
#include "core-test.h"
From: Changbin Du changbin.du@gmail.com
mainline inclusion from mainline-5.16 commit 42e4cef5fe48333e0db6e98b019edf5f2c2f11fd category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Just get the pid by 'current->pid'. Meanwhile, to be symmetrical make the 'starts' and 'finishes' logs both use debug level.
Link: https://lkml.kernel.org/r/20210927232432.17750-1-changbin.du@gmail.com Signed-off-by: Changbin Du changbin.du@gmail.com Reviewed-by: SeongJae Park sj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 42e4cef5fe48333e0db6e98b019edf5f2c2f11fd) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/core.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/mm/damon/core.c b/mm/damon/core.c index 61a9e3b37bc9..8171e7dddc30 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -652,9 +652,7 @@ static int kdamond_fn(void *data) unsigned int max_nr_accesses = 0; unsigned long sz_limit = 0;
- mutex_lock(&ctx->kdamond_lock); - pr_debug("kdamond (%d) starts\n", ctx->kdamond->pid); - mutex_unlock(&ctx->kdamond_lock); + pr_debug("kdamond (%d) starts\n", current->pid);
if (ctx->primitive.init) ctx->primitive.init(ctx); @@ -705,7 +703,7 @@ static int kdamond_fn(void *data) if (ctx->primitive.cleanup) ctx->primitive.cleanup(ctx);
- pr_debug("kdamond (%d) finishes\n", ctx->kdamond->pid); + pr_debug("kdamond (%d) finishes\n", current->pid); mutex_lock(&ctx->kdamond_lock); ctx->kdamond = NULL; mutex_unlock(&ctx->kdamond_lock);
From: Colin Ian King colin.king@canonical.com
mainline inclusion from mainline-5.16 commit 7ec1992b891e59dba0f04e0327980786e8f61b13 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Currently a plain integer is being used to nullify the pointer ctx->kdamond. Use NULL instead. Cleans up sparse warning:
mm/damon/core.c:317:40: warning: Using plain integer as NULL pointer
Link: https://lkml.kernel.org/r/20210925215908.181226-1-colin.king@canonical.com Signed-off-by: Colin Ian King colin.king@canonical.com Reviewed-by: SeongJae Park sj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 7ec1992b891e59dba0f04e0327980786e8f61b13) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/damon/core.c b/mm/damon/core.c index 8171e7dddc30..d993db50280c 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -314,7 +314,7 @@ static int __damon_start(struct damon_ctx *ctx) nr_running_ctxs); if (IS_ERR(ctx->kdamond)) { err = PTR_ERR(ctx->kdamond); - ctx->kdamond = 0; + ctx->kdamond = NULL; } } mutex_unlock(&ctx->kdamond_lock);
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit fda504fade7f124858d7022341dc46ff35b45274 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Patch series "Implement Data Access Monitoring-based Memory Operation Schemes".
Introduction Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com
============
DAMON[1] can be used as a primitive for data access aware memory management optimizations. For that, users who want such optimizations should run DAMON, read the monitoring results, analyze it, plan a new memory management scheme, and apply the new scheme by themselves. Such efforts will be inevitable for some complicated optimizations.
However, in many other cases, the users would simply want the system to apply a memory management action to a memory region of a specific size having a specific access frequency for a specific time. For example, "page out a memory region larger than 100 MiB keeping only rare accesses more than 2 minutes", or "Do not use THP for a memory region larger than 2 MiB rarely accessed for more than 1 seconds".
To make the works easier and non-redundant, this patchset implements a new feature of DAMON, which is called Data Access Monitoring-based Operation Schemes (DAMOS). Using the feature, users can describe the normal schemes in a simple way and ask DAMON to execute those on its own.
[1] https://damonitor.github.io
Evaluations ===========
DAMOS is accurate and useful for memory management optimizations. An experimental DAMON-based operation scheme for THP, 'ethp', removes 76.15% of THP memory overheads while preserving 51.25% of THP speedup. Another experimental DAMON-based 'proactive reclamation' implementation, 'prcl', reduces 93.38% of residential sets and 23.63% of system memory footprint while incurring only 1.22% runtime overhead in the best case (parsec3/freqmine).
NOTE that the experimental THP optimization and proactive reclamation are not for production but only for proof of concepts.
Please refer to the showcase web site's evaluation document[1] for detailed evaluation setup and results.
[1] https://damonitor.github.io/doc/html/v34/vm/damon/eval.html
Long-term Support Trees -----------------------
For people who want to test DAMON but using LTS kernels, there are another couple of trees based on two latest LTS kernels respectively and containing the 'damon/master' backports.
- For v5.4.y: https://git.kernel.org/sj/h/damon/for-v5.4.y - For v5.10.y: https://git.kernel.org/sj/h/damon/for-v5.10.y
Sequence Of Patches ===================
The 1st patch accounts age of each region. The 2nd patch implements the core of the DAMON-based operation schemes feature. The 3rd patch makes the default monitoring primitives for virtual address spaces to support the schemes. From this point, the kernel space users can use DAMOS. The 4th patch exports the feature to the user space via the debugfs interface. The 5th patch implements schemes statistics feature for easier tuning of the schemes and runtime access pattern analysis, and the 6th patch adds selftests for these changes. Finally, the 7th patch documents this new feature.
This patch (of 7):
DAMON can be used for data access pattern aware memory management optimizations. For that, users should run DAMON, read the monitoring results, analyze it, plan a new memory management scheme, and apply the new scheme by themselves. It would not be too hard, but still require some level of effort. For complicated cases, this effort is inevitable.
That said, in many cases, users would simply want to apply an actions to a memory region of a specific size having a specific access frequency for a specific time. For example, "page out a memory region larger than 100 MiB but having a low access frequency more than 10 minutes", or "Use THP for a memory region larger than 2 MiB having a high access frequency for more than 2 seconds".
For such optimizations, users will need to first account the age of each region themselves. To reduce such efforts, this implements a simple age account of each region in DAMON. For each aggregation step, DAMON compares the access frequency with that from last aggregation and reset the age of the region if the change is significant. Else, the age is incremented. Also, in case of the merge of regions, the region size-weighted average of the ages is set as the age of merged new region.
Link: https://lkml.kernel.org/r/20211001125604.29660-1-sj@kernel.org Link: https://lkml.kernel.org/r/20211001125604.29660-2-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Jonathan Corbet corbet@lwn.net Cc: David Hildenbrand david@redhat.com Cc: David Woodhouse dwmw@amazon.com Cc: Marco Elver elver@google.com Cc: Leonard Foerster foersleo@amazon.de Cc: Greg Thelen gthelen@google.com Cc: Markus Boehme markubo@amazon.de Cc: David Rienjes rientjes@google.com Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit fda504fade7f124858d7022341dc46ff35b45274) Signed-off-by: Yue Zou zouyue3@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 10 ++++++++++ mm/damon/core.c | 13 +++++++++++++ 2 files changed, 23 insertions(+)
diff --git a/include/linux/damon.h b/include/linux/damon.h index 755d70804705..3e8215debbd4 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -31,12 +31,22 @@ struct damon_addr_range { * @sampling_addr: Address of the sample for the next access check. * @nr_accesses: Access frequency of this region. * @list: List head for siblings. + * @age: Age of this region. + * + * @age is initially zero, increased for each aggregation interval, and reset + * to zero again if the access frequency is significantly changed. If two + * regions are merged into a new region, both @nr_accesses and @age of the new + * region are set as region size-weighted average of those of the two regions. */ struct damon_region { struct damon_addr_range ar; unsigned long sampling_addr; unsigned int nr_accesses; struct list_head list; + + unsigned int age; +/* private: Internal value for age calculation. */ + unsigned int last_nr_accesses; };
/** diff --git a/mm/damon/core.c b/mm/damon/core.c index d993db50280c..3efbe80779db 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -45,6 +45,9 @@ struct damon_region *damon_new_region(unsigned long start, unsigned long end) region->nr_accesses = 0; INIT_LIST_HEAD(®ion->list);
+ region->age = 0; + region->last_nr_accesses = 0; + return region; }
@@ -444,6 +447,7 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
damon_for_each_region(r, t) { trace_damon_aggregated(t, r, damon_nr_regions(t)); + r->last_nr_accesses = r->nr_accesses; r->nr_accesses = 0; } } @@ -461,6 +465,7 @@ static void damon_merge_two_regions(struct damon_target *t,
l->nr_accesses = (l->nr_accesses * sz_l + r->nr_accesses * sz_r) / (sz_l + sz_r); + l->age = (l->age * sz_l + r->age * sz_r) / (sz_l + sz_r); l->ar.end = r->ar.end; damon_destroy_region(r, t); } @@ -480,6 +485,11 @@ static void damon_merge_regions_of(struct damon_target *t, unsigned int thres, struct damon_region *r, *prev = NULL, *next;
damon_for_each_region_safe(r, next, t) { + if (diff_of(r->nr_accesses, r->last_nr_accesses) > thres) + r->age = 0; + else + r->age++; + if (prev && prev->ar.end == r->ar.start && diff_of(prev->nr_accesses, r->nr_accesses) <= thres && sz_damon_region(prev) + sz_damon_region(r) <= sz_limit) @@ -527,6 +537,9 @@ static void damon_split_region_at(struct damon_ctx *ctx,
r->ar.end = new->ar.start;
+ new->age = r->age; + new->last_nr_accesses = r->last_nr_accesses; + damon_insert_region(new, r, damon_next_region(r), t); }
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 1f366e421c8f69583ed37b56d86e3747331869c3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- In many cases, users might use DAMON for simple data access aware memory management optimizations such as applying an operation scheme to a memory region of a specific size having a specific access frequency for a specific time. For example, "page out a memory region larger than 100 MiB but having a low access frequency more than 10 minutes", or "Use THP for a memory region larger than 2 MiB having a high access frequency for more than 2 seconds".
Most simple form of the solution would be doing offline data access pattern profiling using DAMON and modifying the application source code or system configuration based on the profiling results. Or, developing a daemon constructed with two modules (one for access monitoring and the other for applying memory management actions via mlock(), madvise(), sysctl, etc) is imaginable.
To avoid users spending their time for implementation of such simple data access monitoring-based operation schemes, this makes DAMON to handle such schemes directly. With this change, users can simply specify their desired schemes to DAMON. Then, DAMON will automatically apply the schemes to the user-specified target processes.
Each of the schemes is composed with conditions for filtering of the target memory regions and desired memory management action for the target. Specifically, the format is::
<min/max size> <min/max access frequency> <min/max age> <action>
The filtering conditions are size of memory region, number of accesses to the region monitored by DAMON, and the age of the region. The age of region is incremented periodically but reset when its addresses or access frequency has significantly changed or the action of a scheme was applied. For the action, current implementation supports a few of madvise()-like hints, ``WILLNEED``, ``COLD``, ``PAGEOUT``, ``HUGEPAGE``, and ``NOHUGEPAGE``.
Because DAMON supports various address spaces and application of the actions to a monitoring target region is dependent to the type of the target address space, the application code should be implemented by each primitives and registered to the framework. Note that this only implements the framework part. Following commit will implement the action applications for virtual address spaces primitives.
Link: https://lkml.kernel.org/r/20211001125604.29660-3-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rienjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 1f366e421c8f69583ed37b56d86e3747331869c3) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 66 +++++++++++++++++++++++++ mm/damon/core.c | 109 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 175 insertions(+)
diff --git a/include/linux/damon.h b/include/linux/damon.h index 3e8215debbd4..dbe18b0fb795 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -69,6 +69,48 @@ struct damon_target { struct list_head list; };
+/** + * enum damos_action - Represents an action of a Data Access Monitoring-based + * Operation Scheme. + * + * @DAMOS_WILLNEED: Call ``madvise()`` for the region with MADV_WILLNEED. + * @DAMOS_COLD: Call ``madvise()`` for the region with MADV_COLD. + * @DAMOS_PAGEOUT: Call ``madvise()`` for the region with MADV_PAGEOUT. + * @DAMOS_HUGEPAGE: Call ``madvise()`` for the region with MADV_HUGEPAGE. + * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE. + */ +enum damos_action { + DAMOS_WILLNEED, + DAMOS_COLD, + DAMOS_PAGEOUT, + DAMOS_HUGEPAGE, + DAMOS_NOHUGEPAGE, +}; + +/** + * struct damos - Represents a Data Access Monitoring-based Operation Scheme. + * @min_sz_region: Minimum size of target regions. + * @max_sz_region: Maximum size of target regions. + * @min_nr_accesses: Minimum ``->nr_accesses`` of target regions. + * @max_nr_accesses: Maximum ``->nr_accesses`` of target regions. + * @min_age_region: Minimum age of target regions. + * @max_age_region: Maximum age of target regions. + * @action: &damo_action to be applied to the target regions. + * @list: List head for siblings. + * + * Note that both the minimums and the maximums are inclusive. + */ +struct damos { + unsigned long min_sz_region; + unsigned long max_sz_region; + unsigned int min_nr_accesses; + unsigned int max_nr_accesses; + unsigned int min_age_region; + unsigned int max_age_region; + enum damos_action action; + struct list_head list; +}; + struct damon_ctx;
/** @@ -79,6 +121,7 @@ struct damon_ctx; * @prepare_access_checks: Prepare next access check of target regions. * @check_accesses: Check the accesses to target regions. * @reset_aggregated: Reset aggregated accesses monitoring results. + * @apply_scheme: Apply a DAMON-based operation scheme. * @target_valid: Determine if the target is valid. * @cleanup: Clean up the context. * @@ -104,6 +147,9 @@ struct damon_ctx; * of its update. The value will be used for regions adjustment threshold. * @reset_aggregated should reset the access monitoring results that aggregated * by @check_accesses. + * @apply_scheme is called from @kdamond when a region for user provided + * DAMON-based operation scheme is found. It should apply the scheme's action + * to the region. This is not used for &DAMON_ARBITRARY_TARGET case. * @target_valid should check whether the target is still valid for the * monitoring. * @cleanup is called from @kdamond just before its termination. @@ -114,6 +160,8 @@ struct damon_primitive { void (*prepare_access_checks)(struct damon_ctx *context); unsigned int (*check_accesses)(struct damon_ctx *context); void (*reset_aggregated)(struct damon_ctx *context); + int (*apply_scheme)(struct damon_ctx *context, struct damon_target *t, + struct damon_region *r, struct damos *scheme); bool (*target_valid)(void *target); void (*cleanup)(struct damon_ctx *context); }; @@ -192,6 +240,7 @@ struct damon_callback { * @min_nr_regions: The minimum number of adaptive monitoring regions. * @max_nr_regions: The maximum number of adaptive monitoring regions. * @adaptive_targets: Head of monitoring targets (&damon_target) list. + * @schemes: Head of schemes (&damos) list. */ struct damon_ctx { unsigned long sample_interval; @@ -213,6 +262,7 @@ struct damon_ctx { unsigned long min_nr_regions; unsigned long max_nr_regions; struct list_head adaptive_targets; + struct list_head schemes; };
#define damon_next_region(r) \ @@ -233,6 +283,12 @@ struct damon_ctx { #define damon_for_each_target_safe(t, next, ctx) \ list_for_each_entry_safe(t, next, &(ctx)->adaptive_targets, list)
+#define damon_for_each_scheme(s, ctx) \ + list_for_each_entry(s, &(ctx)->schemes, list) + +#define damon_for_each_scheme_safe(s, next, ctx) \ + list_for_each_entry_safe(s, next, &(ctx)->schemes, list) + #ifdef CONFIG_DAMON
struct damon_region *damon_new_region(unsigned long start, unsigned long end); @@ -242,6 +298,14 @@ inline void damon_insert_region(struct damon_region *r, void damon_add_region(struct damon_region *r, struct damon_target *t); void damon_destroy_region(struct damon_region *r, struct damon_target *t);
+struct damos *damon_new_scheme( + unsigned long min_sz_region, unsigned long max_sz_region, + unsigned int min_nr_accesses, unsigned int max_nr_accesses, + unsigned int min_age_region, unsigned int max_age_region, + enum damos_action action); +void damon_add_scheme(struct damon_ctx *ctx, struct damos *s); +void damon_destroy_scheme(struct damos *s); + struct damon_target *damon_new_target(unsigned long id); void damon_add_target(struct damon_ctx *ctx, struct damon_target *t); void damon_free_target(struct damon_target *t); @@ -255,6 +319,8 @@ int damon_set_targets(struct damon_ctx *ctx, int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int, unsigned long aggr_int, unsigned long primitive_upd_int, unsigned long min_nr_reg, unsigned long max_nr_reg); +int damon_set_schemes(struct damon_ctx *ctx, + struct damos **schemes, ssize_t nr_schemes); int damon_nr_running_ctxs(void);
int damon_start(struct damon_ctx **ctxs, int nr_ctxs); diff --git a/mm/damon/core.c b/mm/damon/core.c index 3efbe80779db..0ed97b21cbb6 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -85,6 +85,50 @@ void damon_destroy_region(struct damon_region *r, struct damon_target *t) damon_free_region(r); }
+struct damos *damon_new_scheme( + unsigned long min_sz_region, unsigned long max_sz_region, + unsigned int min_nr_accesses, unsigned int max_nr_accesses, + unsigned int min_age_region, unsigned int max_age_region, + enum damos_action action) +{ + struct damos *scheme; + + scheme = kmalloc(sizeof(*scheme), GFP_KERNEL); + if (!scheme) + return NULL; + scheme->min_sz_region = min_sz_region; + scheme->max_sz_region = max_sz_region; + scheme->min_nr_accesses = min_nr_accesses; + scheme->max_nr_accesses = max_nr_accesses; + scheme->min_age_region = min_age_region; + scheme->max_age_region = max_age_region; + scheme->action = action; + INIT_LIST_HEAD(&scheme->list); + + return scheme; +} + +void damon_add_scheme(struct damon_ctx *ctx, struct damos *s) +{ + list_add_tail(&s->list, &ctx->schemes); +} + +static void damon_del_scheme(struct damos *s) +{ + list_del(&s->list); +} + +static void damon_free_scheme(struct damos *s) +{ + kfree(s); +} + +void damon_destroy_scheme(struct damos *s) +{ + damon_del_scheme(s); + damon_free_scheme(s); +} + /* * Construct a damon_target struct * @@ -156,6 +200,7 @@ struct damon_ctx *damon_new_ctx(void) ctx->max_nr_regions = 1000;
INIT_LIST_HEAD(&ctx->adaptive_targets); + INIT_LIST_HEAD(&ctx->schemes);
return ctx; } @@ -175,7 +220,13 @@ static void damon_destroy_targets(struct damon_ctx *ctx)
void damon_destroy_ctx(struct damon_ctx *ctx) { + struct damos *s, *next_s; + damon_destroy_targets(ctx); + + damon_for_each_scheme_safe(s, next_s, ctx) + damon_destroy_scheme(s); + kfree(ctx); }
@@ -250,6 +301,30 @@ int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int, return 0; }
+/** + * damon_set_schemes() - Set data access monitoring based operation schemes. + * @ctx: monitoring context + * @schemes: array of the schemes + * @nr_schemes: number of entries in @schemes + * + * This function should not be called while the kdamond of the context is + * running. + * + * Return: 0 if success, or negative error code otherwise. + */ +int damon_set_schemes(struct damon_ctx *ctx, struct damos **schemes, + ssize_t nr_schemes) +{ + struct damos *s, *next; + ssize_t i; + + damon_for_each_scheme_safe(s, next, ctx) + damon_destroy_scheme(s); + for (i = 0; i < nr_schemes; i++) + damon_add_scheme(ctx, schemes[i]); + return 0; +} + /** * damon_nr_running_ctxs() - Return number of currently running contexts. */ @@ -453,6 +528,39 @@ static void kdamond_reset_aggregated(struct damon_ctx *c) } }
+static void damon_do_apply_schemes(struct damon_ctx *c, + struct damon_target *t, + struct damon_region *r) +{ + struct damos *s; + unsigned long sz; + + damon_for_each_scheme(s, c) { + sz = r->ar.end - r->ar.start; + if (sz < s->min_sz_region || s->max_sz_region < sz) + continue; + if (r->nr_accesses < s->min_nr_accesses || + s->max_nr_accesses < r->nr_accesses) + continue; + if (r->age < s->min_age_region || s->max_age_region < r->age) + continue; + if (c->primitive.apply_scheme) + c->primitive.apply_scheme(c, t, r, s); + r->age = 0; + } +} + +static void kdamond_apply_schemes(struct damon_ctx *c) +{ + struct damon_target *t; + struct damon_region *r; + + damon_for_each_target(t, c) { + damon_for_each_region(r, t) + damon_do_apply_schemes(c, t, r); + } +} + #define sz_damon_region(r) (r->ar.end - r->ar.start)
/* @@ -693,6 +801,7 @@ static int kdamond_fn(void *data) if (ctx->callback.after_aggregation && ctx->callback.after_aggregation(ctx)) set_kdamond_stop(ctx); + kdamond_apply_schemes(ctx); kdamond_reset_aggregated(ctx); kdamond_split_regions(ctx); if (ctx->primitive.reset_aggregated)
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 6dea8add4d2875b80843e4a4c8acd334a4db8c8f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This makes DAMON's default primitives for virtual address spaces to support DAMON-based Operation Schemes (DAMOS) by implementing actions application functions and registering it to the monitoring context. The implementation simply links 'madvise()' for related DAMOS actions. That is, 'madvise(MADV_WILLNEED)' is called for 'WILLNEED' DAMOS action and similar for other actions ('COLD', 'PAGEOUT', 'HUGEPAGE', 'NOHUGEPAGE').
So, the kernel space DAMON users can now use the DAMON-based optimizations with only small amount of code.
Link: https://lkml.kernel.org/r/20211001125604.29660-4-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rienjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 6dea8add4d2875b80843e4a4c8acd334a4db8c8f) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 2 ++ mm/damon/vaddr.c | 56 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 58 insertions(+)
diff --git a/include/linux/damon.h b/include/linux/damon.h index dbe18b0fb795..be6b6e81e8ee 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -337,6 +337,8 @@ void damon_va_prepare_access_checks(struct damon_ctx *ctx); unsigned int damon_va_check_accesses(struct damon_ctx *ctx); bool damon_va_target_valid(void *t); void damon_va_cleanup(struct damon_ctx *ctx); +int damon_va_apply_scheme(struct damon_ctx *context, struct damon_target *t, + struct damon_region *r, struct damos *scheme); void damon_va_set_primitives(struct damon_ctx *ctx);
#endif /* CONFIG_DAMON_VADDR */ diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 58c1fb2aafa9..3e1c74d36bab 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -7,6 +7,7 @@
#define pr_fmt(fmt) "damon-va: " fmt
+#include <asm-generic/mman-common.h> #include <linux/damon.h> #include <linux/hugetlb.h> #include <linux/mm.h> @@ -658,6 +659,60 @@ bool damon_va_target_valid(void *target) return false; }
+#ifndef CONFIG_ADVISE_SYSCALLS +static int damos_madvise(struct damon_target *target, struct damon_region *r, + int behavior) +{ + return -EINVAL; +} +#else +static int damos_madvise(struct damon_target *target, struct damon_region *r, + int behavior) +{ + struct mm_struct *mm; + int ret = -ENOMEM; + + mm = damon_get_mm(target); + if (!mm) + goto out; + + ret = do_madvise(mm, PAGE_ALIGN(r->ar.start), + PAGE_ALIGN(r->ar.end - r->ar.start), behavior); + mmput(mm); +out: + return ret; +} +#endif /* CONFIG_ADVISE_SYSCALLS */ + +int damon_va_apply_scheme(struct damon_ctx *ctx, struct damon_target *t, + struct damon_region *r, struct damos *scheme) +{ + int madv_action; + + switch (scheme->action) { + case DAMOS_WILLNEED: + madv_action = MADV_WILLNEED; + break; + case DAMOS_COLD: + madv_action = MADV_COLD; + break; + case DAMOS_PAGEOUT: + madv_action = MADV_PAGEOUT; + break; + case DAMOS_HUGEPAGE: + madv_action = MADV_HUGEPAGE; + break; + case DAMOS_NOHUGEPAGE: + madv_action = MADV_NOHUGEPAGE; + break; + default: + pr_warn("Wrong action %d\n", scheme->action); + return -EINVAL; + } + + return damos_madvise(t, r, madv_action); +} + void damon_va_set_primitives(struct damon_ctx *ctx) { ctx->primitive.init = damon_va_init; @@ -667,6 +722,7 @@ void damon_va_set_primitives(struct damon_ctx *ctx) ctx->primitive.reset_aggregated = NULL; ctx->primitive.target_valid = damon_va_target_valid; ctx->primitive.cleanup = NULL; + ctx->primitive.apply_scheme = damon_va_apply_scheme; }
#include "vaddr-test.h"
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit af122dd8f3c0099349bc98ff69f0d90efd8b149f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This makes 'damon-dbgfs' to support the data access monitoring oriented memory management schemes. Users can read and update the schemes using ``<debugfs>/damon/schemes`` file. The format is::
<min/max size> <min/max access frequency> <min/max age> <action>
Link: https://lkml.kernel.org/r/20211001125604.29660-5-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rienjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit af122dd8f3c0099349bc98ff69f0d90efd8b149f) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/dbgfs.c | 165 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 162 insertions(+), 3 deletions(-)
diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index faee070977d8..78b7a04490c5 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -98,6 +98,159 @@ static ssize_t dbgfs_attrs_write(struct file *file, return ret; }
+static ssize_t sprint_schemes(struct damon_ctx *c, char *buf, ssize_t len) +{ + struct damos *s; + int written = 0; + int rc; + + damon_for_each_scheme(s, c) { + rc = scnprintf(&buf[written], len - written, + "%lu %lu %u %u %u %u %d\n", + s->min_sz_region, s->max_sz_region, + s->min_nr_accesses, s->max_nr_accesses, + s->min_age_region, s->max_age_region, + s->action); + if (!rc) + return -ENOMEM; + + written += rc; + } + return written; +} + +static ssize_t dbgfs_schemes_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct damon_ctx *ctx = file->private_data; + char *kbuf; + ssize_t len; + + kbuf = kmalloc(count, GFP_KERNEL); + if (!kbuf) + return -ENOMEM; + + mutex_lock(&ctx->kdamond_lock); + len = sprint_schemes(ctx, kbuf, count); + mutex_unlock(&ctx->kdamond_lock); + if (len < 0) + goto out; + len = simple_read_from_buffer(buf, count, ppos, kbuf, len); + +out: + kfree(kbuf); + return len; +} + +static void free_schemes_arr(struct damos **schemes, ssize_t nr_schemes) +{ + ssize_t i; + + for (i = 0; i < nr_schemes; i++) + kfree(schemes[i]); + kfree(schemes); +} + +static bool damos_action_valid(int action) +{ + switch (action) { + case DAMOS_WILLNEED: + case DAMOS_COLD: + case DAMOS_PAGEOUT: + case DAMOS_HUGEPAGE: + case DAMOS_NOHUGEPAGE: + return true; + default: + return false; + } +} + +/* + * Converts a string into an array of struct damos pointers + * + * Returns an array of struct damos pointers that converted if the conversion + * success, or NULL otherwise. + */ +static struct damos **str_to_schemes(const char *str, ssize_t len, + ssize_t *nr_schemes) +{ + struct damos *scheme, **schemes; + const int max_nr_schemes = 256; + int pos = 0, parsed, ret; + unsigned long min_sz, max_sz; + unsigned int min_nr_a, max_nr_a, min_age, max_age; + unsigned int action; + + schemes = kmalloc_array(max_nr_schemes, sizeof(scheme), + GFP_KERNEL); + if (!schemes) + return NULL; + + *nr_schemes = 0; + while (pos < len && *nr_schemes < max_nr_schemes) { + ret = sscanf(&str[pos], "%lu %lu %u %u %u %u %u%n", + &min_sz, &max_sz, &min_nr_a, &max_nr_a, + &min_age, &max_age, &action, &parsed); + if (ret != 7) + break; + if (!damos_action_valid(action)) { + pr_err("wrong action %d\n", action); + goto fail; + } + + pos += parsed; + scheme = damon_new_scheme(min_sz, max_sz, min_nr_a, max_nr_a, + min_age, max_age, action); + if (!scheme) + goto fail; + + schemes[*nr_schemes] = scheme; + *nr_schemes += 1; + } + return schemes; +fail: + free_schemes_arr(schemes, *nr_schemes); + return NULL; +} + +static ssize_t dbgfs_schemes_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct damon_ctx *ctx = file->private_data; + char *kbuf; + struct damos **schemes; + ssize_t nr_schemes = 0, ret = count; + int err; + + kbuf = user_input_str(buf, count, ppos); + if (IS_ERR(kbuf)) + return PTR_ERR(kbuf); + + schemes = str_to_schemes(kbuf, ret, &nr_schemes); + if (!schemes) { + ret = -EINVAL; + goto out; + } + + mutex_lock(&ctx->kdamond_lock); + if (ctx->kdamond) { + ret = -EBUSY; + goto unlock_out; + } + + err = damon_set_schemes(ctx, schemes, nr_schemes); + if (err) + ret = err; + else + nr_schemes = 0; +unlock_out: + mutex_unlock(&ctx->kdamond_lock); + free_schemes_arr(schemes, nr_schemes); +out: + kfree(kbuf); + return ret; +} + static inline bool targetid_is_pid(const struct damon_ctx *ctx) { return ctx->primitive.target_valid == damon_va_target_valid; @@ -279,6 +432,12 @@ static const struct file_operations attrs_fops = { .write = dbgfs_attrs_write, };
+static const struct file_operations schemes_fops = { + .open = damon_dbgfs_open, + .read = dbgfs_schemes_read, + .write = dbgfs_schemes_write, +}; + static const struct file_operations target_ids_fops = { .open = damon_dbgfs_open, .read = dbgfs_target_ids_read, @@ -292,10 +451,10 @@ static const struct file_operations kdamond_pid_fops = {
static void dbgfs_fill_ctx_dir(struct dentry *dir, struct damon_ctx *ctx) { - const char * const file_names[] = {"attrs", "target_ids", + const char * const file_names[] = {"attrs", "schemes", "target_ids", "kdamond_pid"}; - const struct file_operations *fops[] = {&attrs_fops, &target_ids_fops, - &kdamond_pid_fops}; + const struct file_operations *fops[] = {&attrs_fops, &schemes_fops, + &target_ids_fops, &kdamond_pid_fops}; int i;
for (i = 0; i < ARRAY_SIZE(file_names); i++)
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 2f0b548c9f03a78f4ce6ab48986e3108028936a6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- To tune the DAMON-based operation schemes, knowing how many and how large regions are affected by each of the schemes will be helful. Those stats could be used for not only the tuning, but also monitoring of the working set size and the number of regions, if the scheme does not change the program behavior too much.
For the reason, this implements the statistics for the schemes. The total number and size of the regions that each scheme is applied are exported to users via '->stat_count' and '->stat_sz' of 'struct damos'. Admins can also check the number by reading 'schemes' debugfs file. The last two integers now represents the stats. To allow collecting the stats without changing the program behavior, this also adds new scheme action, 'DAMOS_STAT'. Note that 'DAMOS_STAT' is not only making no memory operation actions, but also does not reset the age of regions.
Link: https://lkml.kernel.org/r/20211001125604.29660-6-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rienjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 2f0b548c9f03a78f4ce6ab48986e3108028936a6) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 10 +++++++++- mm/damon/core.c | 7 ++++++- mm/damon/dbgfs.c | 5 +++-- mm/damon/vaddr.c | 2 ++ 4 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h index be6b6e81e8ee..f301bb53381c 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -78,6 +78,7 @@ struct damon_target { * @DAMOS_PAGEOUT: Call ``madvise()`` for the region with MADV_PAGEOUT. * @DAMOS_HUGEPAGE: Call ``madvise()`` for the region with MADV_HUGEPAGE. * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE. + * @DAMOS_STAT: Do nothing but count the stat. */ enum damos_action { DAMOS_WILLNEED, @@ -85,6 +86,7 @@ enum damos_action { DAMOS_PAGEOUT, DAMOS_HUGEPAGE, DAMOS_NOHUGEPAGE, + DAMOS_STAT, /* Do nothing but only record the stat */ };
/** @@ -96,9 +98,13 @@ enum damos_action { * @min_age_region: Minimum age of target regions. * @max_age_region: Maximum age of target regions. * @action: &damo_action to be applied to the target regions. + * @stat_count: Total number of regions that this scheme is applied. + * @stat_sz: Total size of regions that this scheme is applied. * @list: List head for siblings. * - * Note that both the minimums and the maximums are inclusive. + * For each aggregation interval, DAMON applies @action to monitoring target + * regions fit in the condition and updates the statistics. Note that both + * the minimums and the maximums are inclusive. */ struct damos { unsigned long min_sz_region; @@ -108,6 +114,8 @@ struct damos { unsigned int min_age_region; unsigned int max_age_region; enum damos_action action; + unsigned long stat_count; + unsigned long stat_sz; struct list_head list; };
diff --git a/mm/damon/core.c b/mm/damon/core.c index 0ed97b21cbb6..2f6785737902 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -103,6 +103,8 @@ struct damos *damon_new_scheme( scheme->min_age_region = min_age_region; scheme->max_age_region = max_age_region; scheme->action = action; + scheme->stat_count = 0; + scheme->stat_sz = 0; INIT_LIST_HEAD(&scheme->list);
return scheme; @@ -544,9 +546,12 @@ static void damon_do_apply_schemes(struct damon_ctx *c, continue; if (r->age < s->min_age_region || s->max_age_region < r->age) continue; + s->stat_count++; + s->stat_sz += sz; if (c->primitive.apply_scheme) c->primitive.apply_scheme(c, t, r, s); - r->age = 0; + if (s->action != DAMOS_STAT) + r->age = 0; } }
diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index 78b7a04490c5..28d6abf27763 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -106,11 +106,11 @@ static ssize_t sprint_schemes(struct damon_ctx *c, char *buf, ssize_t len)
damon_for_each_scheme(s, c) { rc = scnprintf(&buf[written], len - written, - "%lu %lu %u %u %u %u %d\n", + "%lu %lu %u %u %u %u %d %lu %lu\n", s->min_sz_region, s->max_sz_region, s->min_nr_accesses, s->max_nr_accesses, s->min_age_region, s->max_age_region, - s->action); + s->action, s->stat_count, s->stat_sz); if (!rc) return -ENOMEM;
@@ -159,6 +159,7 @@ static bool damos_action_valid(int action) case DAMOS_PAGEOUT: case DAMOS_HUGEPAGE: case DAMOS_NOHUGEPAGE: + case DAMOS_STAT: return true; default: return false; diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 3e1c74d36bab..953c145b4f08 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -705,6 +705,8 @@ int damon_va_apply_scheme(struct damon_ctx *ctx, struct damon_target *t, case DAMOS_NOHUGEPAGE: madv_action = MADV_NOHUGEPAGE; break; + case DAMOS_STAT: + return 0; default: pr_warn("Wrong action %d\n", scheme->action); return -EINVAL;
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 8d5d4c6359054f3e680e1a2caca50e9b6d688b7d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This adds simple selftets for 'schemes' debugfs file of DAMON.
Link: https://lkml.kernel.org/r/20211001125604.29660-7-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rienjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 8d5d4c6359054f3e680e1a2caca50e9b6d688b7d) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- tools/testing/selftests/damon/debugfs_attrs.sh | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/tools/testing/selftests/damon/debugfs_attrs.sh b/tools/testing/selftests/damon/debugfs_attrs.sh index bfabb19dc0d3..639cfb6a1f65 100644 --- a/tools/testing/selftests/damon/debugfs_attrs.sh +++ b/tools/testing/selftests/damon/debugfs_attrs.sh @@ -57,6 +57,19 @@ test_write_fail "$file" "1 2 3 5 4" "$orig_content" \ test_content "$file" "$orig_content" "1 2 3 4 5" "successfully written" echo "$orig_content" > "$file"
+# Test schemes file +# ================= + +file="$DBGFS/schemes" +orig_content=$(cat "$file") + +test_write_succ "$file" "1 2 3 4 5 6 4" \ + "$orig_content" "valid input" +test_write_fail "$file" "1 2 +3 4 5 6 3" "$orig_content" "multi lines" +test_write_succ "$file" "" "$orig_content" "disabling" +echo "$orig_content" > "$file" + # Test target_ids file # ====================
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 68536f8e01e571f553f78fa058ba543de3834452 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This adds the description of DAMON-based operation schemes in the DAMON documents.
Link: https://lkml.kernel.org/r/20211001125604.29660-8-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rienjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 68536f8e01e571f553f78fa058ba543de3834452) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/admin-guide/mm/damon/start.rst | 11 +++++ Documentation/admin-guide/mm/damon/usage.rst | 51 +++++++++++++++++++- 2 files changed, 60 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index d5eb89a8fc38..51503cf90ca2 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -108,6 +108,17 @@ the results as separate image files. :: You can view the visualizations of this example workload at [1]_. Visualizations of other realistic workloads are available at [2]_ [3]_ [4]_.
+ +Data Access Pattern Aware Memory Management +=========================================== + +Below three commands make every memory region of size >=4K that doesn't +accessed for >=60 seconds in your workload to be swapped out. :: + + $ echo "#min-size max-size min-acc max-acc min-age max-age action" > scheme + $ echo "4K max 0 0 60s max pageout" >> scheme + $ damo schemes -c my_thp_scheme <pid of your workload> + .. [1] https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/start.html#vis... .. [2] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html .. [3] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index a72cda374aba..c0296c14babf 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -34,8 +34,8 @@ the reason, this document describes only the debugfs interface debugfs Interface =================
-DAMON exports three files, ``attrs``, ``target_ids``, and ``monitor_on`` under -its debugfs directory, ``<debugfs>/damon/``. +DAMON exports four files, ``attrs``, ``target_ids``, ``schemes`` and +``monitor_on`` under its debugfs directory, ``<debugfs>/damon/``.
Attributes @@ -74,6 +74,53 @@ check it again:: Note that setting the target ids doesn't start the monitoring.
+Schemes +------- + +For usual DAMON-based data access aware memory management optimizations, users +would simply want the system to apply a memory management action to a memory +region of a specific size having a specific access frequency for a specific +time. DAMON receives such formalized operation schemes from the user and +applies those to the target processes. It also counts the total number and +size of regions that each scheme is applied. This statistics can be used for +online analysis or tuning of the schemes. + +Users can get and set the schemes by reading from and writing to ``schemes`` +debugfs file. Reading the file also shows the statistics of each scheme. To +the file, each of the schemes should be represented in each line in below form: + + min-size max-size min-acc max-acc min-age max-age action + +Note that the ranges are closed interval. Bytes for the size of regions +(``min-size`` and ``max-size``), number of monitored accesses per aggregate +interval for access frequency (``min-acc`` and ``max-acc``), number of +aggregate intervals for the age of regions (``min-age`` and ``max-age``), and a +predefined integer for memory management actions should be used. The supported +numbers and their meanings are as below. + + - 0: Call ``madvise()`` for the region with ``MADV_WILLNEED`` + - 1: Call ``madvise()`` for the region with ``MADV_COLD`` + - 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT`` + - 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE`` + - 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE`` + - 5: Do nothing but count the statistics + +You can disable schemes by simply writing an empty string to the file. For +example, below commands applies a scheme saying "If a memory region of size in +[4KiB, 8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate +interval in [10, 20], page out the region", check the entered scheme again, and +finally remove the scheme. :: + + # cd <debugfs>/damon + # echo "4096 8192 0 5 10 20 2" > schemes + # cat schemes + 4096 8192 0 5 10 20 2 0 0 + # echo > schemes + +The last two integers in the 4th line of above example is the total number and +the total size of the regions that the scheme is applied. + + Turning On/Off --------------
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 90bebce9fcd6488ba6b010af3a16a0a0d7e44cb6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Patch series "DAMON: Support Physical Memory Address Space Monitoring:.
DAMON currently supports only virtual address spaces monitoring. It can be easily extended for various use cases and address spaces by configuring its monitoring primitives layer to use appropriate primitives implementations, though. This patchset implements monitoring primitives for the physical address space monitoring using the structure.
The first 3 patches allow the user space users manually set the monitoring regions. The 1st patch implements the feature in the 'damon-dbgfs'. Then, patches for adding a unit tests (the 2nd patch) and updating the documentation (the 3rd patch) follow.
Following 4 patches implement the physical address space monitoring primitives. The 4th patch makes some primitive functions for the virtual address spaces primitives reusable. The 5th patch implements the physical address space monitoring primitives. The 6th patch links the primitives to the 'damon-dbgfs'. Finally, 7th patch documents this new features.
This patch (of 7):
Some 'damon-dbgfs' users would want to monitor only a part of the entire virtual memory address space. The program interface users in the kernel space could use '->before_start()' callback or set the regions inside the context struct as they want, but 'damon-dbgfs' users cannot.
For that reason, this introduces a new debugfs file called 'init_region'. 'damon-dbgfs' users can specify which initial monitoring target address regions they want by writing special input to the file. The input should describe each region in each line in the below form:
<pid> <start address> <end address>
Note that the regions will be updated to cover entire memory mapped regions after a 'regions update interval' is passed. If you want the regions to not be updated after the initial setting, you could set the interval as a very long time, say, a few decades.
Link: https://lkml.kernel.org/r/20211012205711.29216-1-sj@kernel.org Link: https://lkml.kernel.org/r/20211012205711.29216-2-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Jonathan Corbet corbet@lwn.net Cc: David Hildenbrand david@redhat.com Cc: David Woodhouse dwmw@amazon.com Cc: Marco Elver elver@google.com Cc: Leonard Foerster foersleo@amazon.de Cc: Greg Thelen gthelen@google.com Cc: Markus Boehme markubo@amazon.de Cc: David Rienjes rientjes@google.com Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Cc: Brendan Higgins brendanhiggins@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 90bebce9fcd6488ba6b010af3a16a0a0d7e44cb6) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/dbgfs.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 154 insertions(+), 2 deletions(-)
diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index 28d6abf27763..1cce53cd241d 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -394,6 +394,152 @@ static ssize_t dbgfs_target_ids_write(struct file *file, return ret; }
+static ssize_t sprint_init_regions(struct damon_ctx *c, char *buf, ssize_t len) +{ + struct damon_target *t; + struct damon_region *r; + int written = 0; + int rc; + + damon_for_each_target(t, c) { + damon_for_each_region(r, t) { + rc = scnprintf(&buf[written], len - written, + "%lu %lu %lu\n", + t->id, r->ar.start, r->ar.end); + if (!rc) + return -ENOMEM; + written += rc; + } + } + return written; +} + +static ssize_t dbgfs_init_regions_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct damon_ctx *ctx = file->private_data; + char *kbuf; + ssize_t len; + + kbuf = kmalloc(count, GFP_KERNEL); + if (!kbuf) + return -ENOMEM; + + mutex_lock(&ctx->kdamond_lock); + if (ctx->kdamond) { + mutex_unlock(&ctx->kdamond_lock); + len = -EBUSY; + goto out; + } + + len = sprint_init_regions(ctx, kbuf, count); + mutex_unlock(&ctx->kdamond_lock); + if (len < 0) + goto out; + len = simple_read_from_buffer(buf, count, ppos, kbuf, len); + +out: + kfree(kbuf); + return len; +} + +static int add_init_region(struct damon_ctx *c, + unsigned long target_id, struct damon_addr_range *ar) +{ + struct damon_target *t; + struct damon_region *r, *prev; + unsigned long id; + int rc = -EINVAL; + + if (ar->start >= ar->end) + return -EINVAL; + + damon_for_each_target(t, c) { + id = t->id; + if (targetid_is_pid(c)) + id = (unsigned long)pid_vnr((struct pid *)id); + if (id == target_id) { + r = damon_new_region(ar->start, ar->end); + if (!r) + return -ENOMEM; + damon_add_region(r, t); + if (damon_nr_regions(t) > 1) { + prev = damon_prev_region(r); + if (prev->ar.end > r->ar.start) { + damon_destroy_region(r, t); + return -EINVAL; + } + } + rc = 0; + } + } + return rc; +} + +static int set_init_regions(struct damon_ctx *c, const char *str, ssize_t len) +{ + struct damon_target *t; + struct damon_region *r, *next; + int pos = 0, parsed, ret; + unsigned long target_id; + struct damon_addr_range ar; + int err; + + damon_for_each_target(t, c) { + damon_for_each_region_safe(r, next, t) + damon_destroy_region(r, t); + } + + while (pos < len) { + ret = sscanf(&str[pos], "%lu %lu %lu%n", + &target_id, &ar.start, &ar.end, &parsed); + if (ret != 3) + break; + err = add_init_region(c, target_id, &ar); + if (err) + goto fail; + pos += parsed; + } + + return 0; + +fail: + damon_for_each_target(t, c) { + damon_for_each_region_safe(r, next, t) + damon_destroy_region(r, t); + } + return err; +} + +static ssize_t dbgfs_init_regions_write(struct file *file, + const char __user *buf, size_t count, + loff_t *ppos) +{ + struct damon_ctx *ctx = file->private_data; + char *kbuf; + ssize_t ret = count; + int err; + + kbuf = user_input_str(buf, count, ppos); + if (IS_ERR(kbuf)) + return PTR_ERR(kbuf); + + mutex_lock(&ctx->kdamond_lock); + if (ctx->kdamond) { + ret = -EBUSY; + goto unlock_out; + } + + err = set_init_regions(ctx, kbuf, ret); + if (err) + ret = err; + +unlock_out: + mutex_unlock(&ctx->kdamond_lock); + kfree(kbuf); + return ret; +} + static ssize_t dbgfs_kdamond_pid_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { @@ -445,6 +591,12 @@ static const struct file_operations target_ids_fops = { .write = dbgfs_target_ids_write, };
+static const struct file_operations init_regions_fops = { + .open = damon_dbgfs_open, + .read = dbgfs_init_regions_read, + .write = dbgfs_init_regions_write, +}; + static const struct file_operations kdamond_pid_fops = { .open = damon_dbgfs_open, .read = dbgfs_kdamond_pid_read, @@ -453,9 +605,9 @@ static const struct file_operations kdamond_pid_fops = { static void dbgfs_fill_ctx_dir(struct dentry *dir, struct damon_ctx *ctx) { const char * const file_names[] = {"attrs", "schemes", "target_ids", - "kdamond_pid"}; + "init_regions", "kdamond_pid"}; const struct file_operations *fops[] = {&attrs_fops, &schemes_fops, - &target_ids_fops, &kdamond_pid_fops}; + &target_ids_fops, &init_regions_fops, &kdamond_pid_fops}; int i;
for (i = 0; i < ARRAY_SIZE(file_names); i++)
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 1c2e11bfa649cc07e6322b0e5ea3cdbada9c43c3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This adds another test case for the new feature, 'init_regions'.
Link: https://lkml.kernel.org/r/20211012205711.29216-3-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Reviewed-by: Brendan Higgins brendanhiggins@google.com Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rienjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 1c2e11bfa649cc07e6322b0e5ea3cdbada9c43c3) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/dbgfs-test.h | 54 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+)
diff --git a/mm/damon/dbgfs-test.h b/mm/damon/dbgfs-test.h index 4eddcfa73996..104b22957616 100644 --- a/mm/damon/dbgfs-test.h +++ b/mm/damon/dbgfs-test.h @@ -109,9 +109,63 @@ static void damon_dbgfs_test_set_targets(struct kunit *test) dbgfs_destroy_ctx(ctx); }
+static void damon_dbgfs_test_set_init_regions(struct kunit *test) +{ + struct damon_ctx *ctx = damon_new_ctx(); + unsigned long ids[] = {1, 2, 3}; + /* Each line represents one region in ``<target id> <start> <end>`` */ + char * const valid_inputs[] = {"2 10 20\n 2 20 30\n2 35 45", + "2 10 20\n", + "2 10 20\n1 39 59\n1 70 134\n 2 20 25\n", + ""}; + /* Reading the file again will show sorted, clean output */ + char * const valid_expects[] = {"2 10 20\n2 20 30\n2 35 45\n", + "2 10 20\n", + "1 39 59\n1 70 134\n2 10 20\n2 20 25\n", + ""}; + char * const invalid_inputs[] = {"4 10 20\n", /* target not exists */ + "2 10 20\n 2 14 26\n", /* regions overlap */ + "1 10 20\n2 30 40\n 1 5 8"}; /* not sorted by address */ + char *input, *expect; + int i, rc; + char buf[256]; + + damon_set_targets(ctx, ids, 3); + + /* Put valid inputs and check the results */ + for (i = 0; i < ARRAY_SIZE(valid_inputs); i++) { + input = valid_inputs[i]; + expect = valid_expects[i]; + + rc = set_init_regions(ctx, input, strnlen(input, 256)); + KUNIT_EXPECT_EQ(test, rc, 0); + + memset(buf, 0, 256); + sprint_init_regions(ctx, buf, 256); + + KUNIT_EXPECT_STREQ(test, (char *)buf, expect); + } + /* Put invlid inputs and check the return error code */ + for (i = 0; i < ARRAY_SIZE(invalid_inputs); i++) { + input = invalid_inputs[i]; + pr_info("input: %s\n", input); + rc = set_init_regions(ctx, input, strnlen(input, 256)); + KUNIT_EXPECT_EQ(test, rc, -EINVAL); + + memset(buf, 0, 256); + sprint_init_regions(ctx, buf, 256); + + KUNIT_EXPECT_STREQ(test, (char *)buf, ""); + } + + damon_set_targets(ctx, NULL, 0); + damon_destroy_ctx(ctx); +} + static struct kunit_case damon_test_cases[] = { KUNIT_CASE(damon_dbgfs_test_str_to_target_ids), KUNIT_CASE(damon_dbgfs_test_set_targets), + KUNIT_CASE(damon_dbgfs_test_set_init_regions), {}, };
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit c2fe4987ed31c32591c9aea5a1e8e2540ce66e12 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This adds description of the 'init_regions' feature in the DAMON usage document.
Link: https://lkml.kernel.org/r/20211012205711.29216-4-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Brendan Higgins brendanhiggins@google.com Cc: David Hildenbrand david@redhat.com Cc: David Rienjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit c2fe4987ed31c32591c9aea5a1e8e2540ce66e12) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/admin-guide/mm/damon/usage.rst | 41 +++++++++++++++++++- 1 file changed, 39 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index c0296c14babf..f7d5cfbb50c2 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -34,8 +34,9 @@ the reason, this document describes only the debugfs interface debugfs Interface =================
-DAMON exports four files, ``attrs``, ``target_ids``, ``schemes`` and -``monitor_on`` under its debugfs directory, ``<debugfs>/damon/``. +DAMON exports five files, ``attrs``, ``target_ids``, ``init_regions``, +``schemes`` and ``monitor_on`` under its debugfs directory, +``<debugfs>/damon/``.
Attributes @@ -74,6 +75,42 @@ check it again:: Note that setting the target ids doesn't start the monitoring.
+Initial Monitoring Target Regions +--------------------------------- + +In case of the debugfs based monitoring, DAMON automatically sets and updates +the monitoring target regions so that entire memory mappings of target +processes can be covered. However, users can want to limit the monitoring +region to specific address ranges, such as the heap, the stack, or specific +file-mapped area. Or, some users can know the initial access pattern of their +workloads and therefore want to set optimal initial regions for the 'adaptive +regions adjustment'. + +In such cases, users can explicitly set the initial monitoring target regions +as they want, by writing proper values to the ``init_regions`` file. Each line +of the input should represent one region in below form.:: + + <target id> <start address> <end address> + +The ``target id`` should already in ``target_ids`` file, and the regions should +be passed in address order. For example, below commands will set a couple of +address ranges, ``1-100`` and ``100-200`` as the initial monitoring target +region of process 42, and another couple of address ranges, ``20-40`` and +``50-100`` as that of process 4242.:: + + # cd <debugfs>/damon + # echo "42 1 100 + 42 100 200 + 4242 20 40 + 4242 50 100" > init_regions + +Note that this sets the initial monitoring target regions only. In case of +virtual memory monitoring, DAMON will automatically updates the boundary of the +regions after one ``regions update interval``. Therefore, users should set the +``regions update interval`` large enough in this case, if they don't want the +update. + + Schemes -------
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 46c3a0accdc48c86928157fd073e66807f338485 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This moves functions in the default virtual address spaces monitoring primitives that commonly usable from other address spaces like physical address space into a header file. Those will be reused by the physical address space monitoring primitives which will be implemented by the following commit.
[sj@kernel.org: include 'highmem.h' to fix a build failure] Link: https://lkml.kernel.org/r/20211014110848.5204-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211012205711.29216-5-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Brendan Higgins brendanhiggins@google.com Cc: David Hildenbrand david@redhat.com Cc: David Rienjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 46c3a0accdc48c86928157fd073e66807f338485) Conflicts: mm/damon/vaddr.c (need to include <linux/sched/mm.h> for get_task_mm/mmput function) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/Makefile | 2 +- mm/damon/prmtv-common.c | 87 +++++++++++++++++++++++++++++++++++++++++ mm/damon/prmtv-common.h | 17 ++++++++ mm/damon/vaddr.c | 87 ++--------------------------------------- 4 files changed, 108 insertions(+), 85 deletions(-) create mode 100644 mm/damon/prmtv-common.c create mode 100644 mm/damon/prmtv-common.h
diff --git a/mm/damon/Makefile b/mm/damon/Makefile index fed4be3bace3..99b1bfe01ff5 100644 --- a/mm/damon/Makefile +++ b/mm/damon/Makefile @@ -1,5 +1,5 @@ # SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_DAMON) := core.o -obj-$(CONFIG_DAMON_VADDR) += vaddr.o +obj-$(CONFIG_DAMON_VADDR) += prmtv-common.o vaddr.o obj-$(CONFIG_DAMON_DBGFS) += dbgfs.o diff --git a/mm/damon/prmtv-common.c b/mm/damon/prmtv-common.c new file mode 100644 index 000000000000..7e62ee54fb54 --- /dev/null +++ b/mm/damon/prmtv-common.c @@ -0,0 +1,87 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Common Primitives for Data Access Monitoring + * + * Author: SeongJae Park sj@kernel.org + */ + +#include <linux/mmu_notifier.h> +#include <linux/page_idle.h> +#include <linux/pagemap.h> +#include <linux/rmap.h> + +#include "prmtv-common.h" + +/* + * Get an online page for a pfn if it's in the LRU list. Otherwise, returns + * NULL. + * + * The body of this function is stolen from the 'page_idle_get_page()'. We + * steal rather than reuse it because the code is quite simple. + */ +struct page *damon_get_page(unsigned long pfn) +{ + struct page *page = pfn_to_online_page(pfn); + + if (!page || !PageLRU(page) || !get_page_unless_zero(page)) + return NULL; + + if (unlikely(!PageLRU(page))) { + put_page(page); + page = NULL; + } + return page; +} + +void damon_ptep_mkold(pte_t *pte, struct mm_struct *mm, unsigned long addr) +{ + bool referenced = false; + struct page *page = damon_get_page(pte_pfn(*pte)); + + if (!page) + return; + + if (pte_young(*pte)) { + referenced = true; + *pte = pte_mkold(*pte); + } + +#ifdef CONFIG_MMU_NOTIFIER + if (mmu_notifier_clear_young(mm, addr, addr + PAGE_SIZE)) + referenced = true; +#endif /* CONFIG_MMU_NOTIFIER */ + + if (referenced) + set_page_young(page); + + set_page_idle(page); + put_page(page); +} + +void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm, unsigned long addr) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + bool referenced = false; + struct page *page = damon_get_page(pmd_pfn(*pmd)); + + if (!page) + return; + + if (pmd_young(*pmd)) { + referenced = true; + *pmd = pmd_mkold(*pmd); + } + +#ifdef CONFIG_MMU_NOTIFIER + if (mmu_notifier_clear_young(mm, addr, + addr + ((1UL) << HPAGE_PMD_SHIFT))) + referenced = true; +#endif /* CONFIG_MMU_NOTIFIER */ + + if (referenced) + set_page_young(page); + + set_page_idle(page); + put_page(page); +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +} diff --git a/mm/damon/prmtv-common.h b/mm/damon/prmtv-common.h new file mode 100644 index 000000000000..7093d19e5d42 --- /dev/null +++ b/mm/damon/prmtv-common.h @@ -0,0 +1,17 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Common Primitives for Data Access Monitoring + * + * Author: SeongJae Park sj@kernel.org + */ + +#include <linux/damon.h> +#include <linux/random.h> + +/* Get a random number in [l, r) */ +#define damon_rand(l, r) (l + prandom_u32_max(r - l)) + +struct page *damon_get_page(unsigned long pfn); + +void damon_ptep_mkold(pte_t *pte, struct mm_struct *mm, unsigned long addr); +void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm, unsigned long addr); diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 953c145b4f08..cdeca92a34dc 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -8,25 +8,20 @@ #define pr_fmt(fmt) "damon-va: " fmt
#include <asm-generic/mman-common.h> -#include <linux/damon.h> +#include <linux/highmem.h> #include <linux/hugetlb.h> -#include <linux/mm.h> #include <linux/mmu_notifier.h> -#include <linux/highmem.h> #include <linux/page_idle.h> #include <linux/pagewalk.h> -#include <linux/random.h> #include <linux/sched/mm.h> -#include <linux/slab.h> + +#include "prmtv-common.h"
#ifdef CONFIG_DAMON_VADDR_KUNIT_TEST #undef DAMON_MIN_REGION #define DAMON_MIN_REGION 1 #endif
-/* Get a random number in [l, r) */ -#define damon_rand(l, r) (l + prandom_u32_max(r - l)) - /* * 't->id' should be the pointer to the relevant 'struct pid' having reference * count. Caller must put the returned task, unless it is NULL. @@ -373,82 +368,6 @@ void damon_va_update(struct damon_ctx *ctx) } }
-/* - * Get an online page for a pfn if it's in the LRU list. Otherwise, returns - * NULL. - * - * The body of this function is stolen from the 'page_idle_get_page()'. We - * steal rather than reuse it because the code is quite simple. - */ -static struct page *damon_get_page(unsigned long pfn) -{ - struct page *page = pfn_to_online_page(pfn); - - if (!page || !PageLRU(page) || !get_page_unless_zero(page)) - return NULL; - - if (unlikely(!PageLRU(page))) { - put_page(page); - page = NULL; - } - return page; -} - -static void damon_ptep_mkold(pte_t *pte, struct mm_struct *mm, - unsigned long addr) -{ - bool referenced = false; - struct page *page = damon_get_page(pte_pfn(*pte)); - - if (!page) - return; - - if (pte_young(*pte)) { - referenced = true; - *pte = pte_mkold(*pte); - } - -#ifdef CONFIG_MMU_NOTIFIER - if (mmu_notifier_clear_young(mm, addr, addr + PAGE_SIZE)) - referenced = true; -#endif /* CONFIG_MMU_NOTIFIER */ - - if (referenced) - set_page_young(page); - - set_page_idle(page); - put_page(page); -} - -static void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm, - unsigned long addr) -{ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - bool referenced = false; - struct page *page = damon_get_page(pmd_pfn(*pmd)); - - if (!page) - return; - - if (pmd_young(*pmd)) { - referenced = true; - *pmd = pmd_mkold(*pmd); - } - -#ifdef CONFIG_MMU_NOTIFIER - if (mmu_notifier_clear_young(mm, addr, - addr + ((1UL) << HPAGE_PMD_SHIFT))) - referenced = true; -#endif /* CONFIG_MMU_NOTIFIER */ - - if (referenced) - set_page_young(page); - - set_page_idle(page); - put_page(page); -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ -} - static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk) {
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit a28397beb55b68bd0f15c6778540e8ae1bc26d21 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This implements the monitoring primitives for the physical memory address space. Internally, it uses the PTE Accessed bit, similar to that of the virtual address spaces monitoring primitives. It supports only user memory pages, as idle pages tracking does. If the monitoring target physical memory address range contains non-user memory pages, access check of the pages will do nothing but simply treat the pages as not accessed.
Link: https://lkml.kernel.org/r/20211012205711.29216-6-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Brendan Higgins brendanhiggins@google.com Cc: David Hildenbrand david@redhat.com Cc: David Rienjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit a28397beb55b68bd0f15c6778540e8ae1bc26d21) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 10 ++ mm/damon/Kconfig | 8 ++ mm/damon/Makefile | 1 + mm/damon/paddr.c | 224 ++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 243 insertions(+) create mode 100644 mm/damon/paddr.c
diff --git a/include/linux/damon.h b/include/linux/damon.h index f301bb53381c..715dadd21f7c 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -351,4 +351,14 @@ void damon_va_set_primitives(struct damon_ctx *ctx);
#endif /* CONFIG_DAMON_VADDR */
+#ifdef CONFIG_DAMON_PADDR + +/* Monitoring primitives for the physical memory address space */ +void damon_pa_prepare_access_checks(struct damon_ctx *ctx); +unsigned int damon_pa_check_accesses(struct damon_ctx *ctx); +bool damon_pa_target_valid(void *t); +void damon_pa_set_primitives(struct damon_ctx *ctx); + +#endif /* CONFIG_DAMON_PADDR */ + #endif /* _DAMON_H */ diff --git a/mm/damon/Kconfig b/mm/damon/Kconfig index ba8898c7eb8e..2a5923be631e 100644 --- a/mm/damon/Kconfig +++ b/mm/damon/Kconfig @@ -32,6 +32,14 @@ config DAMON_VADDR This builds the default data access monitoring primitives for DAMON that work for virtual address spaces.
+config DAMON_PADDR + bool "Data access monitoring primitives for the physical address space" + depends on DAMON && MMU + select PAGE_IDLE_FLAG + help + This builds the default data access monitoring primitives for DAMON + that works for the physical address space. + config DAMON_VADDR_KUNIT_TEST bool "Test for DAMON primitives" if !KUNIT_ALL_TESTS depends on DAMON_VADDR && KUNIT=y diff --git a/mm/damon/Makefile b/mm/damon/Makefile index 99b1bfe01ff5..8d9b0df79702 100644 --- a/mm/damon/Makefile +++ b/mm/damon/Makefile @@ -2,4 +2,5 @@
obj-$(CONFIG_DAMON) := core.o obj-$(CONFIG_DAMON_VADDR) += prmtv-common.o vaddr.o +obj-$(CONFIG_DAMON_PADDR) += prmtv-common.o paddr.o obj-$(CONFIG_DAMON_DBGFS) += dbgfs.o diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c new file mode 100644 index 000000000000..d7a2ecd09ed0 --- /dev/null +++ b/mm/damon/paddr.c @@ -0,0 +1,224 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * DAMON Primitives for The Physical Address Space + * + * Author: SeongJae Park sj@kernel.org + */ + +#define pr_fmt(fmt) "damon-pa: " fmt + +#include <linux/mmu_notifier.h> +#include <linux/page_idle.h> +#include <linux/pagemap.h> +#include <linux/rmap.h> + +#include "prmtv-common.h" + +static bool __damon_pa_mkold(struct page *page, struct vm_area_struct *vma, + unsigned long addr, void *arg) +{ + struct page_vma_mapped_walk pvmw = { + .page = page, + .vma = vma, + .address = addr, + }; + + while (page_vma_mapped_walk(&pvmw)) { + addr = pvmw.address; + if (pvmw.pte) + damon_ptep_mkold(pvmw.pte, vma->vm_mm, addr); + else + damon_pmdp_mkold(pvmw.pmd, vma->vm_mm, addr); + } + return true; +} + +static void damon_pa_mkold(unsigned long paddr) +{ + struct page *page = damon_get_page(PHYS_PFN(paddr)); + struct rmap_walk_control rwc = { + .rmap_one = __damon_pa_mkold, + .anon_lock = page_lock_anon_vma_read, + }; + bool need_lock; + + if (!page) + return; + + if (!page_mapped(page) || !page_rmapping(page)) { + set_page_idle(page); + goto out; + } + + need_lock = !PageAnon(page) || PageKsm(page); + if (need_lock && !trylock_page(page)) + goto out; + + rmap_walk(page, &rwc); + + if (need_lock) + unlock_page(page); + +out: + put_page(page); +} + +static void __damon_pa_prepare_access_check(struct damon_ctx *ctx, + struct damon_region *r) +{ + r->sampling_addr = damon_rand(r->ar.start, r->ar.end); + + damon_pa_mkold(r->sampling_addr); +} + +void damon_pa_prepare_access_checks(struct damon_ctx *ctx) +{ + struct damon_target *t; + struct damon_region *r; + + damon_for_each_target(t, ctx) { + damon_for_each_region(r, t) + __damon_pa_prepare_access_check(ctx, r); + } +} + +struct damon_pa_access_chk_result { + unsigned long page_sz; + bool accessed; +}; + +static bool __damon_pa_young(struct page *page, struct vm_area_struct *vma, + unsigned long addr, void *arg) +{ + struct damon_pa_access_chk_result *result = arg; + struct page_vma_mapped_walk pvmw = { + .page = page, + .vma = vma, + .address = addr, + }; + + result->accessed = false; + result->page_sz = PAGE_SIZE; + while (page_vma_mapped_walk(&pvmw)) { + addr = pvmw.address; + if (pvmw.pte) { + result->accessed = pte_young(*pvmw.pte) || + !page_is_idle(page) || + mmu_notifier_test_young(vma->vm_mm, addr); + } else { +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + result->accessed = pmd_young(*pvmw.pmd) || + !page_is_idle(page) || + mmu_notifier_test_young(vma->vm_mm, addr); + result->page_sz = ((1UL) << HPAGE_PMD_SHIFT); +#else + WARN_ON_ONCE(1); +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + } + if (result->accessed) { + page_vma_mapped_walk_done(&pvmw); + break; + } + } + + /* If accessed, stop walking */ + return !result->accessed; +} + +static bool damon_pa_young(unsigned long paddr, unsigned long *page_sz) +{ + struct page *page = damon_get_page(PHYS_PFN(paddr)); + struct damon_pa_access_chk_result result = { + .page_sz = PAGE_SIZE, + .accessed = false, + }; + struct rmap_walk_control rwc = { + .arg = &result, + .rmap_one = __damon_pa_young, + .anon_lock = page_lock_anon_vma_read, + }; + bool need_lock; + + if (!page) + return false; + + if (!page_mapped(page) || !page_rmapping(page)) { + if (page_is_idle(page)) + result.accessed = false; + else + result.accessed = true; + put_page(page); + goto out; + } + + need_lock = !PageAnon(page) || PageKsm(page); + if (need_lock && !trylock_page(page)) { + put_page(page); + return NULL; + } + + rmap_walk(page, &rwc); + + if (need_lock) + unlock_page(page); + put_page(page); + +out: + *page_sz = result.page_sz; + return result.accessed; +} + +static void __damon_pa_check_access(struct damon_ctx *ctx, + struct damon_region *r) +{ + static unsigned long last_addr; + static unsigned long last_page_sz = PAGE_SIZE; + static bool last_accessed; + + /* If the region is in the last checked page, reuse the result */ + if (ALIGN_DOWN(last_addr, last_page_sz) == + ALIGN_DOWN(r->sampling_addr, last_page_sz)) { + if (last_accessed) + r->nr_accesses++; + return; + } + + last_accessed = damon_pa_young(r->sampling_addr, &last_page_sz); + if (last_accessed) + r->nr_accesses++; + + last_addr = r->sampling_addr; +} + +unsigned int damon_pa_check_accesses(struct damon_ctx *ctx) +{ + struct damon_target *t; + struct damon_region *r; + unsigned int max_nr_accesses = 0; + + damon_for_each_target(t, ctx) { + damon_for_each_region(r, t) { + __damon_pa_check_access(ctx, r); + max_nr_accesses = max(r->nr_accesses, max_nr_accesses); + } + } + + return max_nr_accesses; +} + +bool damon_pa_target_valid(void *t) +{ + return true; +} + +void damon_pa_set_primitives(struct damon_ctx *ctx) +{ + ctx->primitive.init = NULL; + ctx->primitive.update = NULL; + ctx->primitive.prepare_access_checks = damon_pa_prepare_access_checks; + ctx->primitive.check_accesses = damon_pa_check_accesses; + ctx->primitive.reset_aggregated = NULL; + ctx->primitive.target_valid = damon_pa_target_valid; + ctx->primitive.cleanup = NULL; + ctx->primitive.apply_scheme = NULL; +}
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit c026291ab88f02247999959d01182cb8eb6e6a5b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This makes the 'damon-dbgfs' to support the physical memory monitoring, in addition to the virtual memory monitoring.
Users can do the physical memory monitoring by writing a special keyword, 'paddr' to the 'target_ids' debugfs file. Then, DAMON will check the special keyword and configure the monitoring context to run with the primitives for the physical address space.
Unlike the virtual memory monitoring, the monitoring target region will not be automatically set. Therefore, users should also set the monitoring target address region using the 'init_regions' debugfs file.
Also, note that the physical memory monitoring will not automatically terminated. The user should explicitly turn off the monitoring by writing 'off' to the 'monitor_on' debugfs file.
Link: https://lkml.kernel.org/r/20211012205711.29216-7-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Brendan Higgins brendanhiggins@google.com Cc: David Hildenbrand david@redhat.com Cc: David Rienjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit c026291ab88f02247999959d01182cb8eb6e6a5b) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/Kconfig | 2 +- mm/damon/dbgfs.c | 21 ++++++++++++++++++--- 2 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/mm/damon/Kconfig b/mm/damon/Kconfig index 2a5923be631e..ca33b289ebbe 100644 --- a/mm/damon/Kconfig +++ b/mm/damon/Kconfig @@ -54,7 +54,7 @@ config DAMON_VADDR_KUNIT_TEST
config DAMON_DBGFS bool "DAMON debugfs interface" - depends on DAMON_VADDR && DEBUG_FS + depends on DAMON_VADDR && DAMON_PADDR && DEBUG_FS help This builds the debugfs interface for DAMON. The user space admins can use the interface for arbitrary data access monitoring. diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index 1cce53cd241d..38188347d8ab 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -339,6 +339,7 @@ static ssize_t dbgfs_target_ids_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { struct damon_ctx *ctx = file->private_data; + bool id_is_pid = true; char *kbuf, *nrs; unsigned long *targets; ssize_t nr_targets; @@ -351,6 +352,11 @@ static ssize_t dbgfs_target_ids_write(struct file *file, return PTR_ERR(kbuf);
nrs = kbuf; + if (!strncmp(kbuf, "paddr\n", count)) { + id_is_pid = false; + /* target id is meaningless here, but we set it just for fun */ + scnprintf(kbuf, count, "42 "); + }
targets = str_to_target_ids(nrs, ret, &nr_targets); if (!targets) { @@ -358,7 +364,7 @@ static ssize_t dbgfs_target_ids_write(struct file *file, goto out; }
- if (targetid_is_pid(ctx)) { + if (id_is_pid) { for (i = 0; i < nr_targets; i++) { targets[i] = (unsigned long)find_get_pid( (int)targets[i]); @@ -372,15 +378,24 @@ static ssize_t dbgfs_target_ids_write(struct file *file,
mutex_lock(&ctx->kdamond_lock); if (ctx->kdamond) { - if (targetid_is_pid(ctx)) + if (id_is_pid) dbgfs_put_pids(targets, nr_targets); ret = -EBUSY; goto unlock_out; }
+ /* remove targets with previously-set primitive */ + damon_set_targets(ctx, NULL, 0); + + /* Configure the context for the address space type */ + if (id_is_pid) + damon_va_set_primitives(ctx); + else + damon_pa_set_primitives(ctx); + err = damon_set_targets(ctx, targets, nr_targets); if (err) { - if (targetid_is_pid(ctx)) + if (id_is_pid) dbgfs_put_pids(targets, nr_targets); ret = err; }
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit c638072107f52ec35f292c97b6f3df9b9f2ed87d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This updates the DAMON documents for the physical memory address space monitoring support.
Link: https://lkml.kernel.org/r/20211012205711.29216-8-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Brendan Higgins brendanhiggins@google.com Cc: David Hildenbrand david@redhat.com Cc: David Rienjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit c638072107f52ec35f292c97b6f3df9b9f2ed87d) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/admin-guide/mm/damon/usage.rst | 25 +++++++++++++---- Documentation/vm/damon/design.rst | 29 ++++++++++++-------- Documentation/vm/damon/faq.rst | 5 ++-- 3 files changed, 40 insertions(+), 19 deletions(-)
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index f7d5cfbb50c2..ed96bbf0daff 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -10,15 +10,16 @@ DAMON provides below three interfaces for different users. This is for privileged people such as system administrators who want a just-working human-friendly interface. Using this, users can use the DAMON’s major features in a human-friendly way. It may not be highly tuned for - special cases, though. It supports only virtual address spaces monitoring. + special cases, though. It supports both virtual and physical address spaces + monitoring. - *debugfs interface.* This is for privileged user space programmers who want more optimized use of DAMON. Using this, users can use DAMON’s major features by reading from and writing to special debugfs files. Therefore, you can write and use your personalized DAMON debugfs wrapper programs that reads/writes the debugfs files instead of you. The DAMON user space tool is also a reference - implementation of such programs. It supports only virtual address spaces - monitoring. + implementation of such programs. It supports both virtual and physical + address spaces monitoring. - *Kernel Space Programming Interface.* This is for kernel space programmers. Using this, users can utilize every feature of DAMON most flexibly and efficiently by writing kernel space @@ -72,20 +73,34 @@ check it again:: # cat target_ids 42 4242
+Users can also monitor the physical memory address space of the system by +writing a special keyword, "``paddr\n``" to the file. Because physical address +space monitoring doesn't support multiple targets, reading the file will show a +fake value, ``42``, as below:: + + # cd <debugfs>/damon + # echo paddr > target_ids + # cat target_ids + 42 + Note that setting the target ids doesn't start the monitoring.
Initial Monitoring Target Regions ---------------------------------
-In case of the debugfs based monitoring, DAMON automatically sets and updates -the monitoring target regions so that entire memory mappings of target +In case of the virtual address space monitoring, DAMON automatically sets and +updates the monitoring target regions so that entire memory mappings of target processes can be covered. However, users can want to limit the monitoring region to specific address ranges, such as the heap, the stack, or specific file-mapped area. Or, some users can know the initial access pattern of their workloads and therefore want to set optimal initial regions for the 'adaptive regions adjustment'.
+In contrast, DAMON do not automatically sets and updates the monitoring target +regions in case of physical memory monitoring. Therefore, users should set the +monitoring target regions by themselves. + In such cases, users can explicitly set the initial monitoring target regions as they want, by writing proper values to the ``init_regions`` file. Each line of the input should represent one region in below form.:: diff --git a/Documentation/vm/damon/design.rst b/Documentation/vm/damon/design.rst index b05159c295f4..210f0f50efd8 100644 --- a/Documentation/vm/damon/design.rst +++ b/Documentation/vm/damon/design.rst @@ -35,13 +35,17 @@ two parts: 1. Identification of the monitoring target address range for the address space. 2. Access check of specific address range in the target space.
-DAMON currently provides the implementation of the primitives for only the -virtual address spaces. Below two subsections describe how it works. +DAMON currently provides the implementations of the primitives for the physical +and virtual address spaces. Below two subsections describe how those work.
VMA-based Target Address Range Construction -------------------------------------------
+This is only for the virtual address space primitives implementation. That for +the physical address space simply asks users to manually set the monitoring +target address ranges. + Only small parts in the super-huge virtual address space of the processes are mapped to the physical memory and accessed. Thus, tracking the unmapped address regions is just wasteful. However, because DAMON can deal with some @@ -71,15 +75,18 @@ to make a reasonable trade-off. Below shows this in detail:: PTE Accessed-bit Based Access Check -----------------------------------
-The implementation for the virtual address space uses PTE Accessed-bit for -basic access checks. It finds the relevant PTE Accessed bit from the address -by walking the page table for the target task of the address. In this way, the -implementation finds and clears the bit for next sampling target address and -checks whether the bit set again after one sampling period. This could disturb -other kernel subsystems using the Accessed bits, namely Idle page tracking and -the reclaim logic. To avoid such disturbances, DAMON makes it mutually -exclusive with Idle page tracking and uses ``PG_idle`` and ``PG_young`` page -flags to solve the conflict with the reclaim logic, as Idle page tracking does. +Both of the implementations for physical and virtual address spaces use PTE +Accessed-bit for basic access checks. Only one difference is the way of +finding the relevant PTE Accessed bit(s) from the address. While the +implementation for the virtual address walks the page table for the target task +of the address, the implementation for the physical address walks every page +table having a mapping to the address. In this way, the implementations find +and clear the bit(s) for next sampling target address and checks whether the +bit(s) set again after one sampling period. This could disturb other kernel +subsystems using the Accessed bits, namely Idle page tracking and the reclaim +logic. To avoid such disturbances, DAMON makes it mutually exclusive with Idle +page tracking and uses ``PG_idle`` and ``PG_young`` page flags to solve the +conflict with the reclaim logic, as Idle page tracking does.
Address Space Independent Core Mechanisms diff --git a/Documentation/vm/damon/faq.rst b/Documentation/vm/damon/faq.rst index cb3d8b585a8b..11aea40eb328 100644 --- a/Documentation/vm/damon/faq.rst +++ b/Documentation/vm/damon/faq.rst @@ -36,10 +36,9 @@ constructions and actual access checks can be implemented and configured on the DAMON core by the users. In this way, DAMON users can monitor any address space with any access check technique.
-Nonetheless, DAMON provides vma tracking and PTE Accessed bit check based +Nonetheless, DAMON provides vma/rmap tracking and PTE Accessed bit check based implementations of the address space dependent functions for the virtual memory -by default, for a reference and convenient use. In near future, we will -provide those for physical memory address space. +and the physical memory by default, for a reference and convenient use.
Can I simply monitor page granularity?
From: Rikard Falkeborn rikard.falkeborn@gmail.com
mainline inclusion from mainline-5.16 commit 199b50f4c9485c46c2403d8b3e0eca90ec401ed6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- The only usage of these structs is to pass their addresses to walk_page_range(), which takes a pointer to const mm_walk_ops as argument. Make them const to allow the compiler to put them in read-only memory.
Link: https://lkml.kernel.org/r/20211014075042.17174-2-rikard.falkeborn@gmail.com Signed-off-by: Rikard Falkeborn rikard.falkeborn@gmail.com Reviewed-by: SeongJae Park sj@kernel.org Reviewed-by: Anshuman Khandual anshuman.khandual@arm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 199b50f4c9485c46c2403d8b3e0eca90ec401ed6) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/vaddr.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index cdeca92a34dc..14768575f906 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -395,7 +395,7 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr, return 0; }
-static struct mm_walk_ops damon_mkold_ops = { +static const struct mm_walk_ops damon_mkold_ops = { .pmd_entry = damon_mkold_pmd_entry, };
@@ -491,7 +491,7 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr, return 0; }
-static struct mm_walk_ops damon_young_ops = { +static const struct mm_walk_ops damon_young_ops = { .pmd_entry = damon_young_pmd_entry, };
From: Rongwei Wang rongwei.wang@linux.alibaba.com
mainline inclusion from mainline-5.16 commit 9210622ab81f7e722da7563166d93b2a028a79d4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- In some functions, it's unnecessary to declare 'err' and 'ret' variables at the same time. This patch mainly to simplify the issue of such declarations by reusing one variable.
Link: https://lkml.kernel.org/r/20211014073014.35754-1-sj@kernel.org Signed-off-by: Rongwei Wang rongwei.wang@linux.alibaba.com Signed-off-by: SeongJae Park sj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 9210622ab81f7e722da7563166d93b2a028a79d4) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/dbgfs.c | 66 +++++++++++++++++++++++------------------------- 1 file changed, 31 insertions(+), 35 deletions(-)
diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index 38188347d8ab..c90988a20fa4 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -69,8 +69,7 @@ static ssize_t dbgfs_attrs_write(struct file *file, struct damon_ctx *ctx = file->private_data; unsigned long s, a, r, minr, maxr; char *kbuf; - ssize_t ret = count; - int err; + ssize_t ret;
kbuf = user_input_str(buf, count, ppos); if (IS_ERR(kbuf)) @@ -88,9 +87,9 @@ static ssize_t dbgfs_attrs_write(struct file *file, goto unlock_out; }
- err = damon_set_attrs(ctx, s, a, r, minr, maxr); - if (err) - ret = err; + ret = damon_set_attrs(ctx, s, a, r, minr, maxr); + if (!ret) + ret = count; unlock_out: mutex_unlock(&ctx->kdamond_lock); out: @@ -220,14 +219,13 @@ static ssize_t dbgfs_schemes_write(struct file *file, const char __user *buf, struct damon_ctx *ctx = file->private_data; char *kbuf; struct damos **schemes; - ssize_t nr_schemes = 0, ret = count; - int err; + ssize_t nr_schemes = 0, ret;
kbuf = user_input_str(buf, count, ppos); if (IS_ERR(kbuf)) return PTR_ERR(kbuf);
- schemes = str_to_schemes(kbuf, ret, &nr_schemes); + schemes = str_to_schemes(kbuf, count, &nr_schemes); if (!schemes) { ret = -EINVAL; goto out; @@ -239,11 +237,12 @@ static ssize_t dbgfs_schemes_write(struct file *file, const char __user *buf, goto unlock_out; }
- err = damon_set_schemes(ctx, schemes, nr_schemes); - if (err) - ret = err; - else + ret = damon_set_schemes(ctx, schemes, nr_schemes); + if (!ret) { + ret = count; nr_schemes = 0; + } + unlock_out: mutex_unlock(&ctx->kdamond_lock); free_schemes_arr(schemes, nr_schemes); @@ -343,9 +342,8 @@ static ssize_t dbgfs_target_ids_write(struct file *file, char *kbuf, *nrs; unsigned long *targets; ssize_t nr_targets; - ssize_t ret = count; + ssize_t ret; int i; - int err;
kbuf = user_input_str(buf, count, ppos); if (IS_ERR(kbuf)) @@ -358,7 +356,7 @@ static ssize_t dbgfs_target_ids_write(struct file *file, scnprintf(kbuf, count, "42 "); }
- targets = str_to_target_ids(nrs, ret, &nr_targets); + targets = str_to_target_ids(nrs, count, &nr_targets); if (!targets) { ret = -ENOMEM; goto out; @@ -393,11 +391,12 @@ static ssize_t dbgfs_target_ids_write(struct file *file, else damon_pa_set_primitives(ctx);
- err = damon_set_targets(ctx, targets, nr_targets); - if (err) { + ret = damon_set_targets(ctx, targets, nr_targets); + if (ret) { if (id_is_pid) dbgfs_put_pids(targets, nr_targets); - ret = err; + } else { + ret = count; }
unlock_out: @@ -715,8 +714,7 @@ static ssize_t dbgfs_mk_context_write(struct file *file, { char *kbuf; char *ctx_name; - ssize_t ret = count; - int err; + ssize_t ret;
kbuf = user_input_str(buf, count, ppos); if (IS_ERR(kbuf)) @@ -734,9 +732,9 @@ static ssize_t dbgfs_mk_context_write(struct file *file, }
mutex_lock(&damon_dbgfs_lock); - err = dbgfs_mk_context(ctx_name); - if (err) - ret = err; + ret = dbgfs_mk_context(ctx_name); + if (!ret) + ret = count; mutex_unlock(&damon_dbgfs_lock);
out: @@ -805,8 +803,7 @@ static ssize_t dbgfs_rm_context_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { char *kbuf; - ssize_t ret = count; - int err; + ssize_t ret; char *ctx_name;
kbuf = user_input_str(buf, count, ppos); @@ -825,9 +822,9 @@ static ssize_t dbgfs_rm_context_write(struct file *file, }
mutex_lock(&damon_dbgfs_lock); - err = dbgfs_rm_context(ctx_name); - if (err) - ret = err; + ret = dbgfs_rm_context(ctx_name); + if (!ret) + ret = count; mutex_unlock(&damon_dbgfs_lock);
out: @@ -851,9 +848,8 @@ static ssize_t dbgfs_monitor_on_read(struct file *file, static ssize_t dbgfs_monitor_on_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { - ssize_t ret = count; + ssize_t ret; char *kbuf; - int err;
kbuf = user_input_str(buf, count, ppos); if (IS_ERR(kbuf)) @@ -866,14 +862,14 @@ static ssize_t dbgfs_monitor_on_write(struct file *file, }
if (!strncmp(kbuf, "on", count)) - err = damon_start(dbgfs_ctxs, dbgfs_nr_ctxs); + ret = damon_start(dbgfs_ctxs, dbgfs_nr_ctxs); else if (!strncmp(kbuf, "off", count)) - err = damon_stop(dbgfs_ctxs, dbgfs_nr_ctxs); + ret = damon_stop(dbgfs_ctxs, dbgfs_nr_ctxs); else - err = -EINVAL; + ret = -EINVAL;
- if (err) - ret = err; + if (!ret) + ret = count; kfree(kbuf); return ret; }
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 57223ac295845b1d72ec1bd02b5fab992b77a021 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Introduction Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com
============
This patchset 1) makes the engine for general data access pattern-oriented memory management (DAMOS) be more useful for production environments, and 2) implements a static kernel module for lightweight proactive reclamation using the engine.
Proactive Reclamation ---------------------
On general memory over-committed systems, proactively reclaiming cold pages helps saving memory and reducing latency spikes that incurred by the direct reclaim or the CPU consumption of kswapd, while incurring only minimal performance degradation[2].
A Free Pages Reporting[8] based memory over-commit virtualization system would be one more specific use case. In the system, the guest VMs reports their free memory to host, and the host reallocates the reported memory to other guests. As a result, the system's memory utilization can be maximized. However, the guests could be not so memory-frugal, because some kernel subsystems and user-space applications are designed to use as much memory as available. Then, guests would report only small amount of free memory to host, results in poor memory utilization. Running the proactive reclamation in such guests could help mitigating this problem.
Google has also implemented this idea and using it in their data center. They further proposed upstreaming it in LSFMM'19, and "the general consensus was that, while this sort of proactive reclaim would be useful for a number of users, the cost of this particular solution was too high to consider merging it upstream"[3]. The cost mainly comes from the coldness tracking. Roughly speaking, the implementation periodically scans the 'Accessed' bit of each page. For the reason, the overhead linearly increases as the size of the memory and the scanning frequency grows. As a result, Google is known to dedicating one CPU for the work. That's a reasonable option to someone like Google, but it wouldn't be so to some others.
DAMON and DAMOS: An engine for data access pattern-oriented memory management -----------------------------------------------------------------------------
DAMON[4] is a framework for general data access monitoring. Its adaptive monitoring overhead control feature minimizes its monitoring overhead. It also let the upper-bound of the overhead be configurable by clients, regardless of the size of the monitoring target memory. While monitoring 70 GiB memory of a production system every 5 milliseconds, it consumes less than 1% single CPU time. For this, it could sacrify some of the quality of the monitoring results. Nevertheless, the lower-bound of the quality is configurable, and it uses a best-effort algorithm for better quality. Our test results[5] show the quality is practical enough. From the production system monitoring, we were able to find a 4 KiB region in the 70 GiB memory that shows highest access frequency.
We normally don't monitor the data access pattern just for fun but to improve something like memory management. Proactive reclamation is one such usage. For such general cases, DAMON provides a feature called DAMon-based Operation Schemes (DAMOS)[6]. It makes DAMON an engine for general data access pattern oriented memory management. Using this, clients can ask DAMON to find memory regions of specific data access pattern and apply some memory management action (e.g., page out, move to head of the LRU list, use huge page, ...). We call the request 'scheme'.
Proactive Reclamation on top of DAMON/DAMOS -------------------------------------------
Therefore, by using DAMON for the cold pages detection, the proactive reclamation's monitoring overhead issue can be solved. Actually, we previously implemented a version of proactive reclamation using DAMOS and achieved noticeable improvements with our evaluation setup[5]. Nevertheless, it more for a proof-of-concept, rather than production uses. It supports only virtual address spaces of processes, and require additional tuning efforts for given workloads and the hardware. For the tuning, we introduced a simple auto-tuning user space tool[8]. Google is also known to using a ML-based similar approach for their fleets[2]. But, making it just works with intuitive knobs in the kernel would be helpful for general users.
To this end, this patchset improves DAMOS to be ready for such production usages, and implements another version of the proactive reclamation, namely DAMON_RECLAIM, on top of it.
DAMOS Improvements: Aggressiveness Control, Prioritization, and Watermarks --------------------------------------------------------------------------
First of all, the current version of DAMOS supports only virtual address spaces. This patchset makes it supports the physical address space for the page out action.
Next major problem of the current version of DAMOS is the lack of the aggressiveness control, which can results in arbitrary overhead. For example, if huge memory regions having the data access pattern of interest are found, applying the requested action to all of the regions could incur significant overhead. It can be controlled by tuning the target data access pattern with manual or automated approaches[2,7]. But, some people would prefer the kernel to just work with only intuitive tuning or default values.
For such cases, this patchset implements a safeguard, namely time/size quota. Using this, the clients can specify up to how much time can be used for applying the action, and/or up to how much memory regions the action can be applied within a user-specified time duration. A followup question is, to which memory regions should the action applied within the limits? We implement a simple regions prioritization mechanism for each action and make DAMOS to apply the action to high priority regions first. It also allows clients tune the prioritization mechanism to use different weights for size, access frequency, and age of memory regions. This means we could use not only LRU but also LFU or some fancy algorithms like CAR[9] with lightweight overhead.
Though DAMON is lightweight, someone would want to remove even the cold pages monitoring overhead when it is unnecessary. Currently, it should manually turned on and off by clients, but some clients would simply want to turn it on and off based on some metrics like free memory ratio or memory fragmentation. For such cases, this patchset implements a watermarks-based automatic activation feature. It allows the clients configure the metric of their interest, and three watermarks of the metric. If the metric is higher than the high watermark or lower than the low watermark, the scheme is deactivated. If the metric is lower than the mid watermark but higher than the low watermark, the scheme is activated.
DAMON-based Reclaim -------------------
Using the improved version of DAMOS, this patchset implements a static kernel module called 'damon_reclaim'. It finds memory regions that didn't accessed for specific time duration and page out. Consuming too much CPU for the paging out operations, or doing pageout too frequently can be critical for systems configuring their swap devices with software-defined in-memory block devices like zram/zswap or total number of writes limited devices like SSDs, respectively. To avoid the problems, the time/size quotas can be configured. Under the quotas, it pages out memory regions that didn't accessed longer first. Also, to remove the monitoring overhead under peaceful situation, and to fall back to the LRU-list based page granularity reclamation when it doesn't make progress, the three watermarks based activation mechanism is used, with the free memory ratio as the watermark metric.
For convenient configurations, it provides several module parameters. Using these, sysadmins can enable/disable it, and tune its parameters including the coldness identification time threshold, the time/size quotas and the three watermarks.
Evaluation ==========
In short, DAMON_RECLAIM with 50ms/s time quota and regions prioritization on v5.15-rc5 Linux kernel with ZRAM swap device achieves 38.58% memory saving with only 1.94% runtime overhead. For this, DAMON_RECLAIM consumes only 4.97% of single CPU time.
Setup -----
We evaluate DAMON_RECLAIM to show how each of the DAMOS improvements make effect. For this, we measure DAMON_RECLAIM's CPU consumption, entire system memory footprint, total number of major page faults, and runtime of 24 realistic workloads in PARSEC3 and SPLASH-2X benchmark suites on my QEMU/KVM based virtual machine. The virtual machine runs on an i3.metal AWS instance, has 130GiB memory, and runs a linux kernel built on latest -mm tree[1] plus this patchset. It also utilizes a 4 GiB ZRAM swap device. We repeats the measurement 5 times and use averages.
[1] https://github.com/hnaz/linux-mm/tree/v5.15-rc5-mmots-2021-10-13-19-55
Detailed Results ----------------
The results are summarized in the below table.
With coldness identification threshold of 5 seconds, DAMON_RECLAIM without the time quota-based speed limit achieves 47.21% memory saving, but incur 4.59% runtime slowdown to the workloads on average. For this, DAMON_RECLAIM consumes about 11.28% single CPU time.
Applying time quotas of 200ms/s, 50ms/s, and 10ms/s without the regions prioritization reduces the slowdown to 4.89%, 2.65%, and 1.5%, respectively. Time quota of 200ms/s (20%) makes no real change compared to the quota unapplied version, because the quota unapplied version consumes only 11.28% CPU time. DAMON_RECLAIM's CPU utilization also similarly reduced: 11.24%, 5.51%, and 2.01% of single CPU time. That is, the overhead is proportional to the speed limit. Nevertheless, it also reduces the memory saving because it becomes less aggressive. In detail, the three variants show 48.76%, 37.83%, and 7.85% memory saving, respectively.
Applying the regions prioritization (page out regions that not accessed longer first within the time quota) further reduces the performance degradation. Runtime slowdowns and total number of major page faults increase has been 4.89%/218,690% -> 4.39%/166,136% (200ms/s), 2.65%/111,886% -> 1.94%/59,053% (50ms/s), and 1.5%/34,973.40% -> 2.08%/8,781.75% (10ms/s). The runtime under 10ms/s time quota has increased with prioritization, but apparently that's under the margin of error.
time quota prioritization memory_saving cpu_util slowdown pgmajfaults overhead N N 47.21% 11.28% 4.59% 194,802% 200ms/s N 48.76% 11.24% 4.89% 218,690% 50ms/s N 37.83% 5.51% 2.65% 111,886% 10ms/s N 7.85% 2.01% 1.5% 34,793.40% 200ms/s Y 50.08% 10.38% 4.39% 166,136% 50ms/s Y 38.58% 4.97% 1.94% 59,053% 10ms/s Y 3.63% 1.73% 2.08% 8,781.75%
Baseline and Complete Git Trees ===============================
The patches are based on the latest -mm tree (v5.15-rc5-mmots-2021-10-13-19-55). You can also clone the complete git tree from:
$ git clone git://github.com/sjp38/linux -b damon_reclaim/patches/v1
The web is also available: https://git.kernel.org/pub/scm/linux/kernel/git/sj/linux.git/tag/?h=damon_re...
Sequence Of Patches ===================
The first patch makes DAMOS support the physical address space for the page out action. Following five patches (patches 2-6) implement the time/size quotas. Next four patches (patches 7-10) implement the memory regions prioritization within the limit. Then, three following patches (patches 11-13) implement the watermarks-based schemes activation.
Finally, the last two patches (patches 14-15) implement and document the DAMON-based reclamation using the advanced DAMOS.
[1] https://www.kernel.org/doc/html/v5.15-rc1/vm/damon/index.html [2] https://research.google/pubs/pub48551/ [3] https://lwn.net/Articles/787611/ [4] https://damonitor.github.io [5] https://damonitor.github.io/doc/html/latest/vm/damon/eval.html [6] https://lore.kernel.org/linux-mm/20211001125604.29660-1-sj@kernel.org/ [7] https://github.com/awslabs/damoos [8] https://www.kernel.org/doc/html/latest/vm/free_page_reporting.html [9] https://www.usenix.org/conference/fast-04/car-clock-adaptive-replacement
This patch (of 15):
This makes the DAMON primitives for physical address space support the pageout action for DAMON-based Operation Schemes. With this commit, hence, users can easily implement system-level data access-aware reclamations using DAMOS.
[sj@kernel.org: fix missing-prototype build warning] Link: https://lkml.kernel.org/r/20211025064220.13904-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211019150731.16699-1-sj@kernel.org Link: https://lkml.kernel.org/r/20211019150731.16699-2-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Jonathan Corbet corbet@lwn.net Cc: David Hildenbrand david@redhat.com Cc: David Woodhouse dwmw@amazon.com Cc: Marco Elver elver@google.com Cc: Leonard Foerster foersleo@amazon.de Cc: Greg Thelen gthelen@google.com Cc: Markus Boehme markubo@amazon.de Cc: David Rientjes rientjes@google.com Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 57223ac295845b1d72ec1bd02b5fab992b77a021) Signed-off-by: Yue Zou zouyue3@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 2 ++ mm/damon/paddr.c | 37 ++++++++++++++++++++++++++++++++++++- 2 files changed, 38 insertions(+), 1 deletion(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h index 715dadd21f7c..9a327bc787b5 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -357,6 +357,8 @@ void damon_va_set_primitives(struct damon_ctx *ctx); void damon_pa_prepare_access_checks(struct damon_ctx *ctx); unsigned int damon_pa_check_accesses(struct damon_ctx *ctx); bool damon_pa_target_valid(void *t); +int damon_pa_apply_scheme(struct damon_ctx *context, struct damon_target *t, + struct damon_region *r, struct damos *scheme); void damon_pa_set_primitives(struct damon_ctx *ctx);
#endif /* CONFIG_DAMON_PADDR */ diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index d7a2ecd09ed0..957ada55de77 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -11,7 +11,9 @@ #include <linux/page_idle.h> #include <linux/pagemap.h> #include <linux/rmap.h> +#include <linux/swap.h>
+#include "../internal.h" #include "prmtv-common.h"
static bool __damon_pa_mkold(struct page *page, struct vm_area_struct *vma, @@ -211,6 +213,39 @@ bool damon_pa_target_valid(void *t) return true; }
+int damon_pa_apply_scheme(struct damon_ctx *ctx, struct damon_target *t, + struct damon_region *r, struct damos *scheme) +{ + unsigned long addr; + LIST_HEAD(page_list); + + if (scheme->action != DAMOS_PAGEOUT) + return -EINVAL; + + for (addr = r->ar.start; addr < r->ar.end; addr += PAGE_SIZE) { + struct page *page = damon_get_page(PHYS_PFN(addr)); + + if (!page) + continue; + + ClearPageReferenced(page); + test_and_clear_page_young(page); + if (isolate_lru_page(page)) { + put_page(page); + continue; + } + if (PageUnevictable(page)) { + putback_lru_page(page); + } else { + list_add(&page->lru, &page_list); + put_page(page); + } + } + reclaim_pages(&page_list); + cond_resched(); + return 0; +} + void damon_pa_set_primitives(struct damon_ctx *ctx) { ctx->primitive.init = NULL; @@ -220,5 +255,5 @@ void damon_pa_set_primitives(struct damon_ctx *ctx) ctx->primitive.reset_aggregated = NULL; ctx->primitive.target_valid = damon_pa_target_valid; ctx->primitive.cleanup = NULL; - ctx->primitive.apply_scheme = NULL; + ctx->primitive.apply_scheme = damon_pa_apply_scheme; }
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 2b8a248d5873343aa16f6c5ede30517693995f13 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- There could be arbitrarily large memory regions fulfilling the target data access pattern of a DAMON-based operation scheme. In the case, applying the action of the scheme could incur too high overhead. To provide an intuitive way for avoiding it, this implements a feature called size quota. If the quota is set, DAMON tries to apply the action only up to the given amount of memory regions within a given time window.
Link: https://lkml.kernel.org/r/20211019150731.16699-3-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rientjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 2b8a248d5873343aa16f6c5ede30517693995f13) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 36 +++++++++++++++++++++++--- mm/damon/core.c | 60 +++++++++++++++++++++++++++++++++++++------ mm/damon/dbgfs.c | 4 ++- 3 files changed, 87 insertions(+), 13 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h index 9a327bc787b5..3a1ce9d9921c 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -89,6 +89,26 @@ enum damos_action { DAMOS_STAT, /* Do nothing but only record the stat */ };
+/** + * struct damos_quota - Controls the aggressiveness of the given scheme. + * @sz: Maximum bytes of memory that the action can be applied. + * @reset_interval: Charge reset interval in milliseconds. + * + * To avoid consuming too much CPU time or IO resources for applying the + * &struct damos->action to large memory, DAMON allows users to set a size + * quota. The quota can be set by writing non-zero values to &sz. If the size + * quota is set, DAMON tries to apply the action only up to &sz bytes within + * &reset_interval. + */ +struct damos_quota { + unsigned long sz; + unsigned long reset_interval; + +/* private: For charging the quota */ + unsigned long charged_sz; + unsigned long charged_from; +}; + /** * struct damos - Represents a Data Access Monitoring-based Operation Scheme. * @min_sz_region: Minimum size of target regions. @@ -98,13 +118,20 @@ enum damos_action { * @min_age_region: Minimum age of target regions. * @max_age_region: Maximum age of target regions. * @action: &damo_action to be applied to the target regions. + * @quota: Control the aggressiveness of this scheme. * @stat_count: Total number of regions that this scheme is applied. * @stat_sz: Total size of regions that this scheme is applied. * @list: List head for siblings. * - * For each aggregation interval, DAMON applies @action to monitoring target - * regions fit in the condition and updates the statistics. Note that both - * the minimums and the maximums are inclusive. + * For each aggregation interval, DAMON finds regions which fit in the + * condition (&min_sz_region, &max_sz_region, &min_nr_accesses, + * &max_nr_accesses, &min_age_region, &max_age_region) and applies &action to + * those. To avoid consuming too much CPU time or IO resources for the + * &action, "a is used. + * + * After applying the &action to each region, &stat_count and &stat_sz is + * updated to reflect the number of regions and total size of regions that the + * &action is applied. */ struct damos { unsigned long min_sz_region; @@ -114,6 +141,7 @@ struct damos { unsigned int min_age_region; unsigned int max_age_region; enum damos_action action; + struct damos_quota quota; unsigned long stat_count; unsigned long stat_sz; struct list_head list; @@ -310,7 +338,7 @@ struct damos *damon_new_scheme( unsigned long min_sz_region, unsigned long max_sz_region, unsigned int min_nr_accesses, unsigned int max_nr_accesses, unsigned int min_age_region, unsigned int max_age_region, - enum damos_action action); + enum damos_action action, struct damos_quota *quota); void damon_add_scheme(struct damon_ctx *ctx, struct damos *s); void damon_destroy_scheme(struct damos *s);
diff --git a/mm/damon/core.c b/mm/damon/core.c index 2f6785737902..cce14a0d5c72 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -89,7 +89,7 @@ struct damos *damon_new_scheme( unsigned long min_sz_region, unsigned long max_sz_region, unsigned int min_nr_accesses, unsigned int max_nr_accesses, unsigned int min_age_region, unsigned int max_age_region, - enum damos_action action) + enum damos_action action, struct damos_quota *quota) { struct damos *scheme;
@@ -107,6 +107,11 @@ struct damos *damon_new_scheme( scheme->stat_sz = 0; INIT_LIST_HEAD(&scheme->list);
+ scheme->quota.sz = quota->sz; + scheme->quota.reset_interval = quota->reset_interval; + scheme->quota.charged_sz = 0; + scheme->quota.charged_from = 0; + return scheme; }
@@ -530,15 +535,25 @@ static void kdamond_reset_aggregated(struct damon_ctx *c) } }
+static void damon_split_region_at(struct damon_ctx *ctx, + struct damon_target *t, struct damon_region *r, + unsigned long sz_r); + static void damon_do_apply_schemes(struct damon_ctx *c, struct damon_target *t, struct damon_region *r) { struct damos *s; - unsigned long sz;
damon_for_each_scheme(s, c) { - sz = r->ar.end - r->ar.start; + struct damos_quota *quota = &s->quota; + unsigned long sz = r->ar.end - r->ar.start; + + /* Check the quota */ + if (quota->sz && quota->charged_sz >= quota->sz) + continue; + + /* Check the target regions condition */ if (sz < s->min_sz_region || s->max_sz_region < sz) continue; if (r->nr_accesses < s->min_nr_accesses || @@ -546,22 +561,51 @@ static void damon_do_apply_schemes(struct damon_ctx *c, continue; if (r->age < s->min_age_region || s->max_age_region < r->age) continue; - s->stat_count++; - s->stat_sz += sz; - if (c->primitive.apply_scheme) + + /* Apply the scheme */ + if (c->primitive.apply_scheme) { + if (quota->sz && quota->charged_sz + sz > quota->sz) { + sz = ALIGN_DOWN(quota->sz - quota->charged_sz, + DAMON_MIN_REGION); + if (!sz) + goto update_stat; + damon_split_region_at(c, t, r, sz); + } c->primitive.apply_scheme(c, t, r, s); + quota->charged_sz += sz; + } if (s->action != DAMOS_STAT) r->age = 0; + +update_stat: + s->stat_count++; + s->stat_sz += sz; } }
static void kdamond_apply_schemes(struct damon_ctx *c) { struct damon_target *t; - struct damon_region *r; + struct damon_region *r, *next_r; + struct damos *s; + + damon_for_each_scheme(s, c) { + struct damos_quota *quota = &s->quota; + + if (!quota->sz) + continue; + + /* New charge window starts */ + if (time_after_eq(jiffies, quota->charged_from + + msecs_to_jiffies( + quota->reset_interval))) { + quota->charged_from = jiffies; + quota->charged_sz = 0; + } + }
damon_for_each_target(t, c) { - damon_for_each_region(r, t) + damon_for_each_region_safe(r, next_r, t) damon_do_apply_schemes(c, t, r); } } diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index c90988a20fa4..a04bd50cc4c4 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -188,6 +188,8 @@ static struct damos **str_to_schemes(const char *str, ssize_t len,
*nr_schemes = 0; while (pos < len && *nr_schemes < max_nr_schemes) { + struct damos_quota quota = {}; + ret = sscanf(&str[pos], "%lu %lu %u %u %u %u %u%n", &min_sz, &max_sz, &min_nr_a, &max_nr_a, &min_age, &max_age, &action, &parsed); @@ -200,7 +202,7 @@ static struct damos **str_to_schemes(const char *str, ssize_t len,
pos += parsed; scheme = damon_new_scheme(min_sz, max_sz, min_nr_a, max_nr_a, - min_age, max_age, action); + min_age, max_age, action, "a); if (!scheme) goto fail;
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 50585192bc2ef9309d32dabdbb5e735679f4f128 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- If DAMOS has stopped applying action in the middle of a group of memory regions due to its size quota, it starts the work again from the beginning of the address space in the next charge window. If there is a huge memory region at the beginning of the address space and it fulfills the scheme's target data access pattern always, the action will applied to only the region.
This mitigates the case by skipping memory regions that charged in current charge window at the beginning of next charge window.
Link: https://lkml.kernel.org/r/20211019150731.16699-4-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rientjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 50585192bc2ef9309d32dabdbb5e735679f4f128) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 5 +++++ mm/damon/core.c | 37 +++++++++++++++++++++++++++++++++++++ 2 files changed, 42 insertions(+)
diff --git a/include/linux/damon.h b/include/linux/damon.h index 3a1ce9d9921c..585d985768fd 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -107,6 +107,8 @@ struct damos_quota { /* private: For charging the quota */ unsigned long charged_sz; unsigned long charged_from; + struct damon_target *charge_target_from; + unsigned long charge_addr_from; };
/** @@ -307,6 +309,9 @@ struct damon_ctx { #define damon_prev_region(r) \ (container_of(r->list.prev, struct damon_region, list))
+#define damon_last_region(t) \ + (list_last_entry(&t->regions_list, struct damon_region, list)) + #define damon_for_each_region(r, t) \ list_for_each_entry(r, &t->regions_list, list)
diff --git a/mm/damon/core.c b/mm/damon/core.c index cce14a0d5c72..693b75bc3450 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -111,6 +111,8 @@ struct damos *damon_new_scheme( scheme->quota.reset_interval = quota->reset_interval; scheme->quota.charged_sz = 0; scheme->quota.charged_from = 0; + scheme->quota.charge_target_from = NULL; + scheme->quota.charge_addr_from = 0;
return scheme; } @@ -553,6 +555,37 @@ static void damon_do_apply_schemes(struct damon_ctx *c, if (quota->sz && quota->charged_sz >= quota->sz) continue;
+ /* Skip previously charged regions */ + if (quota->charge_target_from) { + if (t != quota->charge_target_from) + continue; + if (r == damon_last_region(t)) { + quota->charge_target_from = NULL; + quota->charge_addr_from = 0; + continue; + } + if (quota->charge_addr_from && + r->ar.end <= quota->charge_addr_from) + continue; + + if (quota->charge_addr_from && r->ar.start < + quota->charge_addr_from) { + sz = ALIGN_DOWN(quota->charge_addr_from - + r->ar.start, DAMON_MIN_REGION); + if (!sz) { + if (r->ar.end - r->ar.start <= + DAMON_MIN_REGION) + continue; + sz = DAMON_MIN_REGION; + } + damon_split_region_at(c, t, r, sz); + r = damon_next_region(r); + sz = r->ar.end - r->ar.start; + } + quota->charge_target_from = NULL; + quota->charge_addr_from = 0; + } + /* Check the target regions condition */ if (sz < s->min_sz_region || s->max_sz_region < sz) continue; @@ -573,6 +606,10 @@ static void damon_do_apply_schemes(struct damon_ctx *c, } c->primitive.apply_scheme(c, t, r, s); quota->charged_sz += sz; + if (quota->sz && quota->charged_sz >= quota->sz) { + quota->charge_target_from = t; + quota->charge_addr_from = r->ar.end + 1; + } } if (s->action != DAMOS_STAT) r->age = 0;
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 1cd2430300594a230dba9178ac9e286d868d9da2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- The size quota feature of DAMOS is useful for IO resource-critical systems, but not so intuitive for CPU time-critical systems. Systems using zram or zswap-like swap device would be examples.
To provide another intuitive ways for such systems, this implements time-based quota for DAMON-based Operation Schemes. If the quota is set, DAMOS tries to use only up to the user-defined quota of CPU time within a given time window.
Link: https://lkml.kernel.org/r/20211019150731.16699-5-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rientjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 1cd2430300594a230dba9178ac9e286d868d9da2) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 25 +++++++++++++++++++----- mm/damon/core.c | 45 ++++++++++++++++++++++++++++++++++++++----- 2 files changed, 60 insertions(+), 10 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h index 585d985768fd..1e7671bf3d23 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -91,20 +91,35 @@ enum damos_action {
/** * struct damos_quota - Controls the aggressiveness of the given scheme. + * @ms: Maximum milliseconds that the scheme can use. * @sz: Maximum bytes of memory that the action can be applied. * @reset_interval: Charge reset interval in milliseconds. * * To avoid consuming too much CPU time or IO resources for applying the - * &struct damos->action to large memory, DAMON allows users to set a size - * quota. The quota can be set by writing non-zero values to &sz. If the size - * quota is set, DAMON tries to apply the action only up to &sz bytes within - * &reset_interval. + * &struct damos->action to large memory, DAMON allows users to set time and/or + * size quotas. The quotas can be set by writing non-zero values to &ms and + * &sz, respectively. If the time quota is set, DAMON tries to use only up to + * &ms milliseconds within &reset_interval for applying the action. If the + * size quota is set, DAMON tries to apply the action only up to &sz bytes + * within &reset_interval. + * + * Internally, the time quota is transformed to a size quota using estimated + * throughput of the scheme's action. DAMON then compares it against &sz and + * uses smaller one as the effective quota. */ struct damos_quota { + unsigned long ms; unsigned long sz; unsigned long reset_interval;
-/* private: For charging the quota */ +/* private: */ + /* For throughput estimation */ + unsigned long total_charged_sz; + unsigned long total_charged_ns; + + unsigned long esz; /* Effective size quota in bytes */ + + /* For charging the quota */ unsigned long charged_sz; unsigned long charged_from; struct damon_target *charge_target_from; diff --git a/mm/damon/core.c b/mm/damon/core.c index 693b75bc3450..d1da4bef96ed 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -107,8 +107,12 @@ struct damos *damon_new_scheme( scheme->stat_sz = 0; INIT_LIST_HEAD(&scheme->list);
+ scheme->quota.ms = quota->ms; scheme->quota.sz = quota->sz; scheme->quota.reset_interval = quota->reset_interval; + scheme->quota.total_charged_sz = 0; + scheme->quota.total_charged_ns = 0; + scheme->quota.esz = 0; scheme->quota.charged_sz = 0; scheme->quota.charged_from = 0; scheme->quota.charge_target_from = NULL; @@ -550,9 +554,10 @@ static void damon_do_apply_schemes(struct damon_ctx *c, damon_for_each_scheme(s, c) { struct damos_quota *quota = &s->quota; unsigned long sz = r->ar.end - r->ar.start; + struct timespec64 begin, end;
/* Check the quota */ - if (quota->sz && quota->charged_sz >= quota->sz) + if (quota->esz && quota->charged_sz >= quota->esz) continue;
/* Skip previously charged regions */ @@ -597,16 +602,21 @@ static void damon_do_apply_schemes(struct damon_ctx *c,
/* Apply the scheme */ if (c->primitive.apply_scheme) { - if (quota->sz && quota->charged_sz + sz > quota->sz) { - sz = ALIGN_DOWN(quota->sz - quota->charged_sz, + if (quota->esz && + quota->charged_sz + sz > quota->esz) { + sz = ALIGN_DOWN(quota->esz - quota->charged_sz, DAMON_MIN_REGION); if (!sz) goto update_stat; damon_split_region_at(c, t, r, sz); } + ktime_get_coarse_ts64(&begin); c->primitive.apply_scheme(c, t, r, s); + ktime_get_coarse_ts64(&end); + quota->total_charged_ns += timespec64_to_ns(&end) - + timespec64_to_ns(&begin); quota->charged_sz += sz; - if (quota->sz && quota->charged_sz >= quota->sz) { + if (quota->esz && quota->charged_sz >= quota->esz) { quota->charge_target_from = t; quota->charge_addr_from = r->ar.end + 1; } @@ -620,6 +630,29 @@ static void damon_do_apply_schemes(struct damon_ctx *c, } }
+/* Shouldn't be called if quota->ms and quota->sz are zero */ +static void damos_set_effective_quota(struct damos_quota *quota) +{ + unsigned long throughput; + unsigned long esz; + + if (!quota->ms) { + quota->esz = quota->sz; + return; + } + + if (quota->total_charged_ns) + throughput = quota->total_charged_sz * 1000000 / + quota->total_charged_ns; + else + throughput = PAGE_SIZE * 1024; + esz = throughput * quota->ms; + + if (quota->sz && quota->sz < esz) + esz = quota->sz; + quota->esz = esz; +} + static void kdamond_apply_schemes(struct damon_ctx *c) { struct damon_target *t; @@ -629,15 +662,17 @@ static void kdamond_apply_schemes(struct damon_ctx *c) damon_for_each_scheme(s, c) { struct damos_quota *quota = &s->quota;
- if (!quota->sz) + if (!quota->ms && !quota->sz) continue;
/* New charge window starts */ if (time_after_eq(jiffies, quota->charged_from + msecs_to_jiffies( quota->reset_interval))) { + quota->total_charged_sz += quota->charged_sz; quota->charged_from = jiffies; quota->charged_sz = 0; + damos_set_effective_quota(quota); } }
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit d7d0ec85e983945079364db3c3d2d80cc795a48c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This makes the debugfs interface of DAMON support the scheme quotas by chaning the format of the input for the schemes file.
Link: https://lkml.kernel.org/r/20211019150731.16699-6-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rientjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit d7d0ec85e983945079364db3c3d2d80cc795a48c) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/dbgfs.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index a04bd50cc4c4..097e6745ba75 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -105,11 +105,14 @@ static ssize_t sprint_schemes(struct damon_ctx *c, char *buf, ssize_t len)
damon_for_each_scheme(s, c) { rc = scnprintf(&buf[written], len - written, - "%lu %lu %u %u %u %u %d %lu %lu\n", + "%lu %lu %u %u %u %u %d %lu %lu %lu %lu %lu\n", s->min_sz_region, s->max_sz_region, s->min_nr_accesses, s->max_nr_accesses, s->min_age_region, s->max_age_region, - s->action, s->stat_count, s->stat_sz); + s->action, + s->quota.ms, s->quota.sz, + s->quota.reset_interval, + s->stat_count, s->stat_sz); if (!rc) return -ENOMEM;
@@ -190,10 +193,11 @@ static struct damos **str_to_schemes(const char *str, ssize_t len, while (pos < len && *nr_schemes < max_nr_schemes) { struct damos_quota quota = {};
- ret = sscanf(&str[pos], "%lu %lu %u %u %u %u %u%n", + ret = sscanf(&str[pos], "%lu %lu %u %u %u %u %u %lu %lu %lu%n", &min_sz, &max_sz, &min_nr_a, &max_nr_a, - &min_age, &max_age, &action, &parsed); - if (ret != 7) + &min_age, &max_age, &action, "a.ms, + "a.sz, "a.reset_interval, &parsed); + if (ret != 10) break; if (!damos_action_valid(action)) { pr_err("wrong action %d\n", action);
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit a2cb4dd0d40d3dcb7288a963d0f66271934417b6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This updates DAMON selftests to support updated schemes debugfs file format for the quotas.
Link: https://lkml.kernel.org/r/20211019150731.16699-7-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rientjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit a2cb4dd0d40d3dcb7288a963d0f66271934417b6) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- tools/testing/selftests/damon/debugfs_attrs.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/damon/debugfs_attrs.sh b/tools/testing/selftests/damon/debugfs_attrs.sh index 639cfb6a1f65..8e33a7b584e7 100644 --- a/tools/testing/selftests/damon/debugfs_attrs.sh +++ b/tools/testing/selftests/damon/debugfs_attrs.sh @@ -63,10 +63,10 @@ echo "$orig_content" > "$file" file="$DBGFS/schemes" orig_content=$(cat "$file")
-test_write_succ "$file" "1 2 3 4 5 6 4" \ +test_write_succ "$file" "1 2 3 4 5 6 4 0 0 0" \ "$orig_content" "valid input" test_write_fail "$file" "1 2 -3 4 5 6 3" "$orig_content" "multi lines" +3 4 5 6 3 0 0 0" "$orig_content" "multi lines" test_write_succ "$file" "" "$orig_content" "disabling" echo "$orig_content" > "$file"
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 38683e003153f7abfa612d7b7fe147efa4624af2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This makes DAMON apply schemes to regions having higher priority first, if it cannot apply schemes to all regions due to the quotas.
The prioritization function should be implemented in the monitoring primitives. Those would commonly calculate the priority of the region using attributes of regions, namely 'size', 'nr_accesses', and 'age'. For example, some primitive would calculate the priority of each region using a weighted sum of 'nr_accesses' and 'age' of the region.
The optimal weights would depend on give environments, so this makes those customizable. Nevertheless, the score calculation functions are only encouraged to respect the weights, not mandated.
Link: https://lkml.kernel.org/r/20211019150731.16699-8-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rientjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 38683e003153f7abfa612d7b7fe147efa4624af2) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 26 ++++++++++++++++++ mm/damon/core.c | 62 ++++++++++++++++++++++++++++++++++++++----- 2 files changed, 81 insertions(+), 7 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h index 1e7671bf3d23..5d47ad9e3911 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -14,6 +14,8 @@
/* Minimal region size. Every damon_region is aligned by this. */ #define DAMON_MIN_REGION PAGE_SIZE +/* Max priority score for DAMON-based operation schemes */ +#define DAMOS_MAX_SCORE (99)
/** * struct damon_addr_range - Represents an address region of [@start, @end). @@ -95,6 +97,10 @@ enum damos_action { * @sz: Maximum bytes of memory that the action can be applied. * @reset_interval: Charge reset interval in milliseconds. * + * @weight_sz: Weight of the region's size for prioritization. + * @weight_nr_accesses: Weight of the region's nr_accesses for prioritization. + * @weight_age: Weight of the region's age for prioritization. + * * To avoid consuming too much CPU time or IO resources for applying the * &struct damos->action to large memory, DAMON allows users to set time and/or * size quotas. The quotas can be set by writing non-zero values to &ms and @@ -106,12 +112,22 @@ enum damos_action { * Internally, the time quota is transformed to a size quota using estimated * throughput of the scheme's action. DAMON then compares it against &sz and * uses smaller one as the effective quota. + * + * For selecting regions within the quota, DAMON prioritizes current scheme's + * target memory regions using the &struct damon_primitive->get_scheme_score. + * You could customize the prioritization logic by setting &weight_sz, + * &weight_nr_accesses, and &weight_age, because monitoring primitives are + * encouraged to respect those. */ struct damos_quota { unsigned long ms; unsigned long sz; unsigned long reset_interval;
+ unsigned int weight_sz; + unsigned int weight_nr_accesses; + unsigned int weight_age; + /* private: */ /* For throughput estimation */ unsigned long total_charged_sz; @@ -124,6 +140,10 @@ struct damos_quota { unsigned long charged_from; struct damon_target *charge_target_from; unsigned long charge_addr_from; + + /* For prioritization */ + unsigned long histogram[DAMOS_MAX_SCORE + 1]; + unsigned int min_score; };
/** @@ -174,6 +194,7 @@ struct damon_ctx; * @prepare_access_checks: Prepare next access check of target regions. * @check_accesses: Check the accesses to target regions. * @reset_aggregated: Reset aggregated accesses monitoring results. + * @get_scheme_score: Get the score of a region for a scheme. * @apply_scheme: Apply a DAMON-based operation scheme. * @target_valid: Determine if the target is valid. * @cleanup: Clean up the context. @@ -200,6 +221,8 @@ struct damon_ctx; * of its update. The value will be used for regions adjustment threshold. * @reset_aggregated should reset the access monitoring results that aggregated * by @check_accesses. + * @get_scheme_score should return the priority score of a region for a scheme + * as an integer in [0, &DAMOS_MAX_SCORE]. * @apply_scheme is called from @kdamond when a region for user provided * DAMON-based operation scheme is found. It should apply the scheme's action * to the region. This is not used for &DAMON_ARBITRARY_TARGET case. @@ -213,6 +236,9 @@ struct damon_primitive { void (*prepare_access_checks)(struct damon_ctx *context); unsigned int (*check_accesses)(struct damon_ctx *context); void (*reset_aggregated)(struct damon_ctx *context); + int (*get_scheme_score)(struct damon_ctx *context, + struct damon_target *t, struct damon_region *r, + struct damos *scheme); int (*apply_scheme)(struct damon_ctx *context, struct damon_target *t, struct damon_region *r, struct damos *scheme); bool (*target_valid)(void *target); diff --git a/mm/damon/core.c b/mm/damon/core.c index d1da4bef96ed..fad25778e2ec 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -12,6 +12,7 @@ #include <linux/kthread.h> #include <linux/random.h> #include <linux/slab.h> +#include <linux/string.h>
#define CREATE_TRACE_POINTS #include <trace/events/damon.h> @@ -110,6 +111,9 @@ struct damos *damon_new_scheme( scheme->quota.ms = quota->ms; scheme->quota.sz = quota->sz; scheme->quota.reset_interval = quota->reset_interval; + scheme->quota.weight_sz = quota->weight_sz; + scheme->quota.weight_nr_accesses = quota->weight_nr_accesses; + scheme->quota.weight_age = quota->weight_age; scheme->quota.total_charged_sz = 0; scheme->quota.total_charged_ns = 0; scheme->quota.esz = 0; @@ -545,6 +549,28 @@ static void damon_split_region_at(struct damon_ctx *ctx, struct damon_target *t, struct damon_region *r, unsigned long sz_r);
+static bool __damos_valid_target(struct damon_region *r, struct damos *s) +{ + unsigned long sz; + + sz = r->ar.end - r->ar.start; + return s->min_sz_region <= sz && sz <= s->max_sz_region && + s->min_nr_accesses <= r->nr_accesses && + r->nr_accesses <= s->max_nr_accesses && + s->min_age_region <= r->age && r->age <= s->max_age_region; +} + +static bool damos_valid_target(struct damon_ctx *c, struct damon_target *t, + struct damon_region *r, struct damos *s) +{ + bool ret = __damos_valid_target(r, s); + + if (!ret || !s->quota.esz || !c->primitive.get_scheme_score) + return ret; + + return c->primitive.get_scheme_score(c, t, r, s) >= s->quota.min_score; +} + static void damon_do_apply_schemes(struct damon_ctx *c, struct damon_target *t, struct damon_region *r) @@ -591,13 +617,7 @@ static void damon_do_apply_schemes(struct damon_ctx *c, quota->charge_addr_from = 0; }
- /* Check the target regions condition */ - if (sz < s->min_sz_region || s->max_sz_region < sz) - continue; - if (r->nr_accesses < s->min_nr_accesses || - s->max_nr_accesses < r->nr_accesses) - continue; - if (r->age < s->min_age_region || s->max_age_region < r->age) + if (!damos_valid_target(c, t, r, s)) continue;
/* Apply the scheme */ @@ -661,6 +681,8 @@ static void kdamond_apply_schemes(struct damon_ctx *c)
damon_for_each_scheme(s, c) { struct damos_quota *quota = &s->quota; + unsigned long cumulated_sz; + unsigned int score, max_score = 0;
if (!quota->ms && !quota->sz) continue; @@ -674,6 +696,32 @@ static void kdamond_apply_schemes(struct damon_ctx *c) quota->charged_sz = 0; damos_set_effective_quota(quota); } + + if (!c->primitive.get_scheme_score) + continue; + + /* Fill up the score histogram */ + memset(quota->histogram, 0, sizeof(quota->histogram)); + damon_for_each_target(t, c) { + damon_for_each_region(r, t) { + if (!__damos_valid_target(r, s)) + continue; + score = c->primitive.get_scheme_score( + c, t, r, s); + quota->histogram[score] += + r->ar.end - r->ar.start; + if (score > max_score) + max_score = score; + } + } + + /* Set the min score limit */ + for (cumulated_sz = 0, score = max_score; ; score--) { + cumulated_sz += quota->histogram[score]; + if (cumulated_sz >= quota->esz || !score) + break; + } + quota->min_score = score; }
damon_for_each_target(t, c) {
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 198f0f4c58b9f481e4e51c8c70a6ab9852bbab7f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This makes the default monitoring primitives for virtual address spaces and the physical address sapce to support memory regions prioritization for 'PAGEOUT' DAMOS action. It calculates hotness of each region as weighted sum of 'nr_accesses' and 'age' of the region and get the priority score as reverse of the hotness, so that cold regions can be paged out first.
Link: https://lkml.kernel.org/r/20211019150731.16699-9-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rientjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 198f0f4c58b9f481e4e51c8c70a6ab9852bbab7f) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 4 ++++ mm/damon/paddr.c | 14 +++++++++++++ mm/damon/prmtv-common.c | 46 +++++++++++++++++++++++++++++++++++++++++ mm/damon/prmtv-common.h | 3 +++ mm/damon/vaddr.c | 15 ++++++++++++++ 5 files changed, 82 insertions(+)
diff --git a/include/linux/damon.h b/include/linux/damon.h index 5d47ad9e3911..1217566a0ebc 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -421,6 +421,8 @@ bool damon_va_target_valid(void *t); void damon_va_cleanup(struct damon_ctx *ctx); int damon_va_apply_scheme(struct damon_ctx *context, struct damon_target *t, struct damon_region *r, struct damos *scheme); +int damon_va_scheme_score(struct damon_ctx *context, struct damon_target *t, + struct damon_region *r, struct damos *scheme); void damon_va_set_primitives(struct damon_ctx *ctx);
#endif /* CONFIG_DAMON_VADDR */ @@ -433,6 +435,8 @@ unsigned int damon_pa_check_accesses(struct damon_ctx *ctx); bool damon_pa_target_valid(void *t); int damon_pa_apply_scheme(struct damon_ctx *context, struct damon_target *t, struct damon_region *r, struct damos *scheme); +int damon_pa_scheme_score(struct damon_ctx *context, struct damon_target *t, + struct damon_region *r, struct damos *scheme); void damon_pa_set_primitives(struct damon_ctx *ctx);
#endif /* CONFIG_DAMON_PADDR */ diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index 957ada55de77..a496d6f203d6 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -246,6 +246,19 @@ int damon_pa_apply_scheme(struct damon_ctx *ctx, struct damon_target *t, return 0; }
+int damon_pa_scheme_score(struct damon_ctx *context, struct damon_target *t, + struct damon_region *r, struct damos *scheme) +{ + switch (scheme->action) { + case DAMOS_PAGEOUT: + return damon_pageout_score(context, r, scheme); + default: + break; + } + + return DAMOS_MAX_SCORE; +} + void damon_pa_set_primitives(struct damon_ctx *ctx) { ctx->primitive.init = NULL; @@ -256,4 +269,5 @@ void damon_pa_set_primitives(struct damon_ctx *ctx) ctx->primitive.target_valid = damon_pa_target_valid; ctx->primitive.cleanup = NULL; ctx->primitive.apply_scheme = damon_pa_apply_scheme; + ctx->primitive.get_scheme_score = damon_pa_scheme_score; } diff --git a/mm/damon/prmtv-common.c b/mm/damon/prmtv-common.c index 7e62ee54fb54..92a04f5831d6 100644 --- a/mm/damon/prmtv-common.c +++ b/mm/damon/prmtv-common.c @@ -85,3 +85,49 @@ void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm, unsigned long addr) put_page(page); #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ } + +#define DAMON_MAX_SUBSCORE (100) +#define DAMON_MAX_AGE_IN_LOG (32) + +int damon_pageout_score(struct damon_ctx *c, struct damon_region *r, + struct damos *s) +{ + unsigned int max_nr_accesses; + int freq_subscore; + unsigned int age_in_sec; + int age_in_log, age_subscore; + unsigned int freq_weight = s->quota.weight_nr_accesses; + unsigned int age_weight = s->quota.weight_age; + int hotness; + + max_nr_accesses = c->aggr_interval / c->sample_interval; + freq_subscore = r->nr_accesses * DAMON_MAX_SUBSCORE / max_nr_accesses; + + age_in_sec = (unsigned long)r->age * c->aggr_interval / 1000000; + for (age_in_log = 0; age_in_log < DAMON_MAX_AGE_IN_LOG && age_in_sec; + age_in_log++, age_in_sec >>= 1) + ; + + /* If frequency is 0, higher age means it's colder */ + if (freq_subscore == 0) + age_in_log *= -1; + + /* + * Now age_in_log is in [-DAMON_MAX_AGE_IN_LOG, DAMON_MAX_AGE_IN_LOG]. + * Scale it to be in [0, 100] and set it as age subscore. + */ + age_in_log += DAMON_MAX_AGE_IN_LOG; + age_subscore = age_in_log * DAMON_MAX_SUBSCORE / + DAMON_MAX_AGE_IN_LOG / 2; + + hotness = (freq_weight * freq_subscore + age_weight * age_subscore); + if (freq_weight + age_weight) + hotness /= freq_weight + age_weight; + /* + * Transform it to fit in [0, DAMOS_MAX_SCORE] + */ + hotness = hotness * DAMOS_MAX_SCORE / DAMON_MAX_SUBSCORE; + + /* Return coldness of the region */ + return DAMOS_MAX_SCORE - hotness; +} diff --git a/mm/damon/prmtv-common.h b/mm/damon/prmtv-common.h index 7093d19e5d42..61f27037603e 100644 --- a/mm/damon/prmtv-common.h +++ b/mm/damon/prmtv-common.h @@ -15,3 +15,6 @@ struct page *damon_get_page(unsigned long pfn);
void damon_ptep_mkold(pte_t *pte, struct mm_struct *mm, unsigned long addr); void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm, unsigned long addr); + +int damon_pageout_score(struct damon_ctx *c, struct damon_region *r, + struct damos *s); diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 14768575f906..0d8685d63b3f 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -634,6 +634,20 @@ int damon_va_apply_scheme(struct damon_ctx *ctx, struct damon_target *t, return damos_madvise(t, r, madv_action); }
+int damon_va_scheme_score(struct damon_ctx *context, struct damon_target *t, + struct damon_region *r, struct damos *scheme) +{ + + switch (scheme->action) { + case DAMOS_PAGEOUT: + return damon_pageout_score(context, r, scheme); + default: + break; + } + + return DAMOS_MAX_SCORE; +} + void damon_va_set_primitives(struct damon_ctx *ctx) { ctx->primitive.init = damon_va_init; @@ -644,6 +658,7 @@ void damon_va_set_primitives(struct damon_ctx *ctx) ctx->primitive.target_valid = damon_va_target_valid; ctx->primitive.cleanup = NULL; ctx->primitive.apply_scheme = damon_va_apply_scheme; + ctx->primitive.get_scheme_score = damon_va_scheme_score; }
#include "vaddr-test.h"
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit f4a68b4a04e6db9397f7776c51d0f9715bd1a60e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This allows DAMON debugfs interface users set the prioritization weights by putting three more numbers to the 'schemes' file.
Link: https://lkml.kernel.org/r/20211019150731.16699-10-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rientjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit f4a68b4a04e6db9397f7776c51d0f9715bd1a60e) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/dbgfs.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index 097e6745ba75..20c4feb8b918 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -105,13 +105,16 @@ static ssize_t sprint_schemes(struct damon_ctx *c, char *buf, ssize_t len)
damon_for_each_scheme(s, c) { rc = scnprintf(&buf[written], len - written, - "%lu %lu %u %u %u %u %d %lu %lu %lu %lu %lu\n", + "%lu %lu %u %u %u %u %d %lu %lu %lu %u %u %u %lu %lu\n", s->min_sz_region, s->max_sz_region, s->min_nr_accesses, s->max_nr_accesses, s->min_age_region, s->max_age_region, s->action, s->quota.ms, s->quota.sz, s->quota.reset_interval, + s->quota.weight_sz, + s->quota.weight_nr_accesses, + s->quota.weight_age, s->stat_count, s->stat_sz); if (!rc) return -ENOMEM; @@ -193,11 +196,14 @@ static struct damos **str_to_schemes(const char *str, ssize_t len, while (pos < len && *nr_schemes < max_nr_schemes) { struct damos_quota quota = {};
- ret = sscanf(&str[pos], "%lu %lu %u %u %u %u %u %lu %lu %lu%n", + ret = sscanf(&str[pos], + "%lu %lu %u %u %u %u %u %lu %lu %lu %u %u %u%n", &min_sz, &max_sz, &min_nr_a, &max_nr_a, &min_age, &max_age, &action, "a.ms, - "a.sz, "a.reset_interval, &parsed); - if (ret != 10) + "a.sz, "a.reset_interval, + "a.weight_sz, "a.weight_nr_accesses, + "a.weight_age, &parsed); + if (ret != 13) break; if (!damos_action_valid(action)) { pr_err("wrong action %d\n", action);
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 5a0d6a08b81162fbe1e207f02571ace6d888f8b0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This updates the DAMON selftests for 'schemes' debugfs file, as the file format is updated.
Link: https://lkml.kernel.org/r/20211019150731.16699-11-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rientjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 5a0d6a08b81162fbe1e207f02571ace6d888f8b0) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- tools/testing/selftests/damon/debugfs_attrs.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/damon/debugfs_attrs.sh b/tools/testing/selftests/damon/debugfs_attrs.sh index 8e33a7b584e7..466dbeb37e31 100644 --- a/tools/testing/selftests/damon/debugfs_attrs.sh +++ b/tools/testing/selftests/damon/debugfs_attrs.sh @@ -63,10 +63,10 @@ echo "$orig_content" > "$file" file="$DBGFS/schemes" orig_content=$(cat "$file")
-test_write_succ "$file" "1 2 3 4 5 6 4 0 0 0" \ +test_write_succ "$file" "1 2 3 4 5 6 4 0 0 0 1 2 3" \ "$orig_content" "valid input" test_write_fail "$file" "1 2 -3 4 5 6 3 0 0 0" "$orig_content" "multi lines" +3 4 5 6 3 0 0 0 1 2 3" "$orig_content" "multi lines" test_write_succ "$file" "" "$orig_content" "disabling" echo "$orig_content" > "$file"
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit ee801b7dd7822a82fd7663048ad649545fac6df3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- DAMON-based operation schemes need to be manually turned on and off. In some use cases, however, the condition for turning a scheme on and off would depend on the system's situation. For example, schemes for proactive pages reclamation would need to be turned on when some memory pressure is detected, and turned off when the system has enough free memory.
For easier control of schemes activation based on the system situation, this introduces a watermarks-based mechanism. The client can describe the watermark metric (e.g., amount of free memory in the system), watermark check interval, and three watermarks, namely high, mid, and low. If the scheme is deactivated, it only gets the metric and compare that to the three watermarks for every check interval. If the metric is higher than the high watermark, the scheme is deactivated. If the metric is between the mid watermark and the low watermark, the scheme is activated. If the metric is lower than the low watermark, the scheme is deactivated again. This is to allow users fall back to traditional page-granularity mechanisms.
Link: https://lkml.kernel.org/r/20211019150731.16699-12-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rientjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit ee801b7dd7822a82fd7663048ad649545fac6df3) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 52 ++++++++++++++++++++++- mm/damon/core.c | 97 ++++++++++++++++++++++++++++++++++++++++++- mm/damon/dbgfs.c | 5 ++- 3 files changed, 151 insertions(+), 3 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h index 1217566a0ebc..c93325efddd7 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -146,6 +146,45 @@ struct damos_quota { unsigned int min_score; };
+/** + * enum damos_wmark_metric - Represents the watermark metric. + * + * @DAMOS_WMARK_NONE: Ignore the watermarks of the given scheme. + * @DAMOS_WMARK_FREE_MEM_RATE: Free memory rate of the system in [0,1000]. + */ +enum damos_wmark_metric { + DAMOS_WMARK_NONE, + DAMOS_WMARK_FREE_MEM_RATE, +}; + +/** + * struct damos_watermarks - Controls when a given scheme should be activated. + * @metric: Metric for the watermarks. + * @interval: Watermarks check time interval in microseconds. + * @high: High watermark. + * @mid: Middle watermark. + * @low: Low watermark. + * + * If &metric is &DAMOS_WMARK_NONE, the scheme is always active. Being active + * means DAMON does monitoring and applying the action of the scheme to + * appropriate memory regions. Else, DAMON checks &metric of the system for at + * least every &interval microseconds and works as below. + * + * If &metric is higher than &high, the scheme is inactivated. If &metric is + * between &mid and &low, the scheme is activated. If &metric is lower than + * &low, the scheme is inactivated. + */ +struct damos_watermarks { + enum damos_wmark_metric metric; + unsigned long interval; + unsigned long high; + unsigned long mid; + unsigned long low; + +/* private: */ + bool activated; +}; + /** * struct damos - Represents a Data Access Monitoring-based Operation Scheme. * @min_sz_region: Minimum size of target regions. @@ -156,6 +195,7 @@ struct damos_quota { * @max_age_region: Maximum age of target regions. * @action: &damo_action to be applied to the target regions. * @quota: Control the aggressiveness of this scheme. + * @wmarks: Watermarks for automated (in)activation of this scheme. * @stat_count: Total number of regions that this scheme is applied. * @stat_sz: Total size of regions that this scheme is applied. * @list: List head for siblings. @@ -166,6 +206,14 @@ struct damos_quota { * those. To avoid consuming too much CPU time or IO resources for the * &action, "a is used. * + * To do the work only when needed, schemes can be activated for specific + * system situations using &wmarks. If all schemes that registered to the + * monitoring context are inactive, DAMON stops monitoring either, and just + * repeatedly checks the watermarks. + * + * If all schemes that registered to a &struct damon_ctx are inactive, DAMON + * stops monitoring and just repeatedly checks the watermarks. + * * After applying the &action to each region, &stat_count and &stat_sz is * updated to reflect the number of regions and total size of regions that the * &action is applied. @@ -179,6 +227,7 @@ struct damos { unsigned int max_age_region; enum damos_action action; struct damos_quota quota; + struct damos_watermarks wmarks; unsigned long stat_count; unsigned long stat_sz; struct list_head list; @@ -384,7 +433,8 @@ struct damos *damon_new_scheme( unsigned long min_sz_region, unsigned long max_sz_region, unsigned int min_nr_accesses, unsigned int max_nr_accesses, unsigned int min_age_region, unsigned int max_age_region, - enum damos_action action, struct damos_quota *quota); + enum damos_action action, struct damos_quota *quota, + struct damos_watermarks *wmarks); void damon_add_scheme(struct damon_ctx *ctx, struct damos *s); void damon_destroy_scheme(struct damos *s);
diff --git a/mm/damon/core.c b/mm/damon/core.c index fad25778e2ec..6993c60ae31c 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -10,6 +10,7 @@ #include <linux/damon.h> #include <linux/delay.h> #include <linux/kthread.h> +#include <linux/mm.h> #include <linux/random.h> #include <linux/slab.h> #include <linux/string.h> @@ -90,7 +91,8 @@ struct damos *damon_new_scheme( unsigned long min_sz_region, unsigned long max_sz_region, unsigned int min_nr_accesses, unsigned int max_nr_accesses, unsigned int min_age_region, unsigned int max_age_region, - enum damos_action action, struct damos_quota *quota) + enum damos_action action, struct damos_quota *quota, + struct damos_watermarks *wmarks) { struct damos *scheme;
@@ -122,6 +124,13 @@ struct damos *damon_new_scheme( scheme->quota.charge_target_from = NULL; scheme->quota.charge_addr_from = 0;
+ scheme->wmarks.metric = wmarks->metric; + scheme->wmarks.interval = wmarks->interval; + scheme->wmarks.high = wmarks->high; + scheme->wmarks.mid = wmarks->mid; + scheme->wmarks.low = wmarks->low; + scheme->wmarks.activated = true; + return scheme; }
@@ -582,6 +591,9 @@ static void damon_do_apply_schemes(struct damon_ctx *c, unsigned long sz = r->ar.end - r->ar.start; struct timespec64 begin, end;
+ if (!s->wmarks.activated) + continue; + /* Check the quota */ if (quota->esz && quota->charged_sz >= quota->esz) continue; @@ -684,6 +696,9 @@ static void kdamond_apply_schemes(struct damon_ctx *c) unsigned long cumulated_sz; unsigned int score, max_score = 0;
+ if (!s->wmarks.activated) + continue; + if (!quota->ms && !quota->sz) continue;
@@ -924,6 +939,83 @@ static bool kdamond_need_stop(struct damon_ctx *ctx) return true; }
+static unsigned long damos_wmark_metric_value(enum damos_wmark_metric metric) +{ + struct sysinfo i; + + switch (metric) { + case DAMOS_WMARK_FREE_MEM_RATE: + si_meminfo(&i); + return i.freeram * 1000 / i.totalram; + default: + break; + } + return -EINVAL; +} + +/* + * Returns zero if the scheme is active. Else, returns time to wait for next + * watermark check in micro-seconds. + */ +static unsigned long damos_wmark_wait_us(struct damos *scheme) +{ + unsigned long metric; + + if (scheme->wmarks.metric == DAMOS_WMARK_NONE) + return 0; + + metric = damos_wmark_metric_value(scheme->wmarks.metric); + /* higher than high watermark or lower than low watermark */ + if (metric > scheme->wmarks.high || scheme->wmarks.low > metric) { + if (scheme->wmarks.activated) + pr_debug("inactivate a scheme (%d) for %s wmark\n", + scheme->action, + metric > scheme->wmarks.high ? + "high" : "low"); + scheme->wmarks.activated = false; + return scheme->wmarks.interval; + } + + /* inactive and higher than middle watermark */ + if ((scheme->wmarks.high >= metric && metric >= scheme->wmarks.mid) && + !scheme->wmarks.activated) + return scheme->wmarks.interval; + + if (!scheme->wmarks.activated) + pr_debug("activate a scheme (%d)\n", scheme->action); + scheme->wmarks.activated = true; + return 0; +} + +static void kdamond_usleep(unsigned long usecs) +{ + if (usecs > 100 * 1000) + schedule_timeout_interruptible(usecs_to_jiffies(usecs)); + else + usleep_range(usecs, usecs + 1); +} + +/* Returns negative error code if it's not activated but should return */ +static int kdamond_wait_activation(struct damon_ctx *ctx) +{ + struct damos *s; + unsigned long wait_time; + unsigned long min_wait_time = 0; + + while (!kdamond_need_stop(ctx)) { + damon_for_each_scheme(s, ctx) { + wait_time = damos_wmark_wait_us(s); + if (!min_wait_time || wait_time < min_wait_time) + min_wait_time = wait_time; + } + if (!min_wait_time) + return 0; + + kdamond_usleep(min_wait_time); + } + return -EBUSY; +} + static void set_kdamond_stop(struct damon_ctx *ctx) { mutex_lock(&ctx->kdamond_lock); @@ -952,6 +1044,9 @@ static int kdamond_fn(void *data) sz_limit = damon_region_sz_limit(ctx);
while (!kdamond_need_stop(ctx)) { + if (kdamond_wait_activation(ctx)) + continue; + if (ctx->primitive.prepare_access_checks) ctx->primitive.prepare_access_checks(ctx); if (ctx->callback.after_sampling && diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index 20c4feb8b918..9f13060d1058 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -195,6 +195,9 @@ static struct damos **str_to_schemes(const char *str, ssize_t len, *nr_schemes = 0; while (pos < len && *nr_schemes < max_nr_schemes) { struct damos_quota quota = {}; + struct damos_watermarks wmarks = { + .metric = DAMOS_WMARK_NONE, + };
ret = sscanf(&str[pos], "%lu %lu %u %u %u %u %u %lu %lu %lu %u %u %u%n", @@ -212,7 +215,7 @@ static struct damos **str_to_schemes(const char *str, ssize_t len,
pos += parsed; scheme = damon_new_scheme(min_sz, max_sz, min_nr_a, max_nr_a, - min_age, max_age, action, "a); + min_age, max_age, action, "a, &wmarks); if (!scheme) goto fail;
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit ae666a6dddfd119da55cc1bad54f7cbd8b2ef54c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This updates DAMON debugfs interface to support the watermarks based schemes activation. For this, now 'schemes' file receives five more values.
Link: https://lkml.kernel.org/r/20211019150731.16699-13-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rientjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit ae666a6dddfd119da55cc1bad54f7cbd8b2ef54c) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/dbgfs.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index 9f13060d1058..6828e463348b 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -105,7 +105,7 @@ static ssize_t sprint_schemes(struct damon_ctx *c, char *buf, ssize_t len)
damon_for_each_scheme(s, c) { rc = scnprintf(&buf[written], len - written, - "%lu %lu %u %u %u %u %d %lu %lu %lu %u %u %u %lu %lu\n", + "%lu %lu %u %u %u %u %d %lu %lu %lu %u %u %u %d %lu %lu %lu %lu %lu %lu\n", s->min_sz_region, s->max_sz_region, s->min_nr_accesses, s->max_nr_accesses, s->min_age_region, s->max_age_region, @@ -115,6 +115,8 @@ static ssize_t sprint_schemes(struct damon_ctx *c, char *buf, ssize_t len) s->quota.weight_sz, s->quota.weight_nr_accesses, s->quota.weight_age, + s->wmarks.metric, s->wmarks.interval, + s->wmarks.high, s->wmarks.mid, s->wmarks.low, s->stat_count, s->stat_sz); if (!rc) return -ENOMEM; @@ -195,18 +197,18 @@ static struct damos **str_to_schemes(const char *str, ssize_t len, *nr_schemes = 0; while (pos < len && *nr_schemes < max_nr_schemes) { struct damos_quota quota = {}; - struct damos_watermarks wmarks = { - .metric = DAMOS_WMARK_NONE, - }; + struct damos_watermarks wmarks;
ret = sscanf(&str[pos], - "%lu %lu %u %u %u %u %u %lu %lu %lu %u %u %u%n", + "%lu %lu %u %u %u %u %u %lu %lu %lu %u %u %u %u %lu %lu %lu %lu%n", &min_sz, &max_sz, &min_nr_a, &max_nr_a, &min_age, &max_age, &action, "a.ms, "a.sz, "a.reset_interval, "a.weight_sz, "a.weight_nr_accesses, - "a.weight_age, &parsed); - if (ret != 13) + "a.weight_age, &wmarks.metric, + &wmarks.interval, &wmarks.high, &wmarks.mid, + &wmarks.low, &parsed); + if (ret != 18) break; if (!damos_action_valid(action)) { pr_err("wrong action %d\n", action);
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 1dc90ccd15c55cc3edec508466db9248a841acad category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This updates DAMON selftests for 'schemes' debugfs file to reflect the changes in the format.
Link: https://lkml.kernel.org/r/20211019150731.16699-14-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rientjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 1dc90ccd15c55cc3edec508466db9248a841acad) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- tools/testing/selftests/damon/debugfs_attrs.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/damon/debugfs_attrs.sh b/tools/testing/selftests/damon/debugfs_attrs.sh index 466dbeb37e31..196b6640bf37 100644 --- a/tools/testing/selftests/damon/debugfs_attrs.sh +++ b/tools/testing/selftests/damon/debugfs_attrs.sh @@ -63,10 +63,10 @@ echo "$orig_content" > "$file" file="$DBGFS/schemes" orig_content=$(cat "$file")
-test_write_succ "$file" "1 2 3 4 5 6 4 0 0 0 1 2 3" \ +test_write_succ "$file" "1 2 3 4 5 6 4 0 0 0 1 2 3 1 100 3 2 1" \ "$orig_content" "valid input" test_write_fail "$file" "1 2 -3 4 5 6 3 0 0 0 1 2 3" "$orig_content" "multi lines" +3 4 5 6 3 0 0 0 1 2 3 1 100 3 2 1" "$orig_content" "multi lines" test_write_succ "$file" "" "$orig_content" "disabling" echo "$orig_content" > "$file"
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 43b0536cb4710e7bb591edfda7e68a1c327a3409 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This implements a new kernel subsystem that finds cold memory regions using DAMON and reclaims those immediately. It is intended to be used as proactive lightweigh reclamation logic for light memory pressure. For heavy memory pressure, it could be inactivated and fall back to the traditional page-scanning based reclamation.
It's implemented on top of DAMON framework to use the DAMON-based Operation Schemes (DAMOS) feature. It utilizes all the DAMOS features including speed limit, prioritization, and watermarks.
It could be enabled and tuned in boot time via the kernel boot parameter, and in run time via its module parameters ('/sys/module/damon_reclaim/parameters/') interface.
[yangyingliang@huawei.com: fix error return code in damon_reclaim_turn()] Link: https://lkml.kernel.org/r/20211025124500.2758060-1-yangyingliang@huawei.com
Link: https://lkml.kernel.org/r/20211019150731.16699-15-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rientjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 43b0536cb4710e7bb591edfda7e68a1c327a3409) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/Kconfig | 12 ++ mm/damon/Makefile | 1 + mm/damon/reclaim.c | 356 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 369 insertions(+) create mode 100644 mm/damon/reclaim.c
diff --git a/mm/damon/Kconfig b/mm/damon/Kconfig index ca33b289ebbe..5bcf05851ad0 100644 --- a/mm/damon/Kconfig +++ b/mm/damon/Kconfig @@ -73,4 +73,16 @@ config DAMON_DBGFS_KUNIT_TEST
If unsure, say N.
+config DAMON_RECLAIM + bool "Build DAMON-based reclaim (DAMON_RECLAIM)" + depends on DAMON_PADDR + help + This builds the DAMON-based reclamation subsystem. It finds pages + that not accessed for a long time (cold) using DAMON and reclaim + those. + + This is suggested to be used as a proactive and lightweight + reclamation under light memory pressure, while the traditional page + scanning-based reclamation is used for heavy pressure. + endmenu diff --git a/mm/damon/Makefile b/mm/damon/Makefile index 8d9b0df79702..f7d5ac377a2b 100644 --- a/mm/damon/Makefile +++ b/mm/damon/Makefile @@ -4,3 +4,4 @@ obj-$(CONFIG_DAMON) := core.o obj-$(CONFIG_DAMON_VADDR) += prmtv-common.o vaddr.o obj-$(CONFIG_DAMON_PADDR) += prmtv-common.o paddr.o obj-$(CONFIG_DAMON_DBGFS) += dbgfs.o +obj-$(CONFIG_DAMON_RECLAIM) += reclaim.o diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c new file mode 100644 index 000000000000..dc1485044eaf --- /dev/null +++ b/mm/damon/reclaim.c @@ -0,0 +1,356 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * DAMON-based page reclamation + * + * Author: SeongJae Park sj@kernel.org + */ + +#define pr_fmt(fmt) "damon-reclaim: " fmt + +#include <linux/damon.h> +#include <linux/ioport.h> +#include <linux/module.h> +#include <linux/sched.h> +#include <linux/workqueue.h> + +#ifdef MODULE_PARAM_PREFIX +#undef MODULE_PARAM_PREFIX +#endif +#define MODULE_PARAM_PREFIX "damon_reclaim." + +/* + * Enable or disable DAMON_RECLAIM. + * + * You can enable DAMON_RCLAIM by setting the value of this parameter as ``Y``. + * Setting it as ``N`` disables DAMON_RECLAIM. Note that DAMON_RECLAIM could + * do no real monitoring and reclamation due to the watermarks-based activation + * condition. Refer to below descriptions for the watermarks parameter for + * this. + */ +static bool enabled __read_mostly; +module_param(enabled, bool, 0600); + +/* + * Time threshold for cold memory regions identification in microseconds. + * + * If a memory region is not accessed for this or longer time, DAMON_RECLAIM + * identifies the region as cold, and reclaims. 120 seconds by default. + */ +static unsigned long min_age __read_mostly = 120000000; +module_param(min_age, ulong, 0600); + +/* + * Limit of time for trying the reclamation in milliseconds. + * + * DAMON_RECLAIM tries to use only up to this time within a time window + * (quota_reset_interval_ms) for trying reclamation of cold pages. This can be + * used for limiting CPU consumption of DAMON_RECLAIM. If the value is zero, + * the limit is disabled. + * + * 10 ms by default. + */ +static unsigned long quota_ms __read_mostly = 10; +module_param(quota_ms, ulong, 0600); + +/* + * Limit of size of memory for the reclamation in bytes. + * + * DAMON_RECLAIM charges amount of memory which it tried to reclaim within a + * time window (quota_reset_interval_ms) and makes no more than this limit is + * tried. This can be used for limiting consumption of CPU and IO. If this + * value is zero, the limit is disabled. + * + * 128 MiB by default. + */ +static unsigned long quota_sz __read_mostly = 128 * 1024 * 1024; +module_param(quota_sz, ulong, 0600); + +/* + * The time/size quota charge reset interval in milliseconds. + * + * The charge reset interval for the quota of time (quota_ms) and size + * (quota_sz). That is, DAMON_RECLAIM does not try reclamation for more than + * quota_ms milliseconds or quota_sz bytes within quota_reset_interval_ms + * milliseconds. + * + * 1 second by default. + */ +static unsigned long quota_reset_interval_ms __read_mostly = 1000; +module_param(quota_reset_interval_ms, ulong, 0600); + +/* + * The watermarks check time interval in microseconds. + * + * Minimal time to wait before checking the watermarks, when DAMON_RECLAIM is + * enabled but inactive due to its watermarks rule. 5 seconds by default. + */ +static unsigned long wmarks_interval __read_mostly = 5000000; +module_param(wmarks_interval, ulong, 0600); + +/* + * Free memory rate (per thousand) for the high watermark. + * + * If free memory of the system in bytes per thousand bytes is higher than + * this, DAMON_RECLAIM becomes inactive, so it does nothing but periodically + * checks the watermarks. 500 (50%) by default. + */ +static unsigned long wmarks_high __read_mostly = 500; +module_param(wmarks_high, ulong, 0600); + +/* + * Free memory rate (per thousand) for the middle watermark. + * + * If free memory of the system in bytes per thousand bytes is between this and + * the low watermark, DAMON_RECLAIM becomes active, so starts the monitoring + * and the reclaiming. 400 (40%) by default. + */ +static unsigned long wmarks_mid __read_mostly = 400; +module_param(wmarks_mid, ulong, 0600); + +/* + * Free memory rate (per thousand) for the low watermark. + * + * If free memory of the system in bytes per thousand bytes is lower than this, + * DAMON_RECLAIM becomes inactive, so it does nothing but periodically checks + * the watermarks. In the case, the system falls back to the LRU-based page + * granularity reclamation logic. 200 (20%) by default. + */ +static unsigned long wmarks_low __read_mostly = 200; +module_param(wmarks_low, ulong, 0600); + +/* + * Sampling interval for the monitoring in microseconds. + * + * The sampling interval of DAMON for the cold memory monitoring. Please refer + * to the DAMON documentation for more detail. 5 ms by default. + */ +static unsigned long sample_interval __read_mostly = 5000; +module_param(sample_interval, ulong, 0600); + +/* + * Aggregation interval for the monitoring in microseconds. + * + * The aggregation interval of DAMON for the cold memory monitoring. Please + * refer to the DAMON documentation for more detail. 100 ms by default. + */ +static unsigned long aggr_interval __read_mostly = 100000; +module_param(aggr_interval, ulong, 0600); + +/* + * Minimum number of monitoring regions. + * + * The minimal number of monitoring regions of DAMON for the cold memory + * monitoring. This can be used to set lower-bound of the monitoring quality. + * But, setting this too high could result in increased monitoring overhead. + * Please refer to the DAMON documentation for more detail. 10 by default. + */ +static unsigned long min_nr_regions __read_mostly = 10; +module_param(min_nr_regions, ulong, 0600); + +/* + * Maximum number of monitoring regions. + * + * The maximum number of monitoring regions of DAMON for the cold memory + * monitoring. This can be used to set upper-bound of the monitoring overhead. + * However, setting this too low could result in bad monitoring quality. + * Please refer to the DAMON documentation for more detail. 1000 by default. + */ +static unsigned long max_nr_regions __read_mostly = 1000; +module_param(max_nr_regions, ulong, 0600); + +/* + * Start of the target memory region in physical address. + * + * The start physical address of memory region that DAMON_RECLAIM will do work + * against. By default, biggest System RAM is used as the region. + */ +static unsigned long monitor_region_start __read_mostly; +module_param(monitor_region_start, ulong, 0600); + +/* + * End of the target memory region in physical address. + * + * The end physical address of memory region that DAMON_RECLAIM will do work + * against. By default, biggest System RAM is used as the region. + */ +static unsigned long monitor_region_end __read_mostly; +module_param(monitor_region_end, ulong, 0600); + +/* + * PID of the DAMON thread + * + * If DAMON_RECLAIM is enabled, this becomes the PID of the worker thread. + * Else, -1. + */ +static int kdamond_pid __read_mostly = -1; +module_param(kdamond_pid, int, 0400); + +static struct damon_ctx *ctx; +static struct damon_target *target; + +struct damon_reclaim_ram_walk_arg { + unsigned long start; + unsigned long end; +}; + +static int walk_system_ram(struct resource *res, void *arg) +{ + struct damon_reclaim_ram_walk_arg *a = arg; + + if (a->end - a->start < res->end - res->start) { + a->start = res->start; + a->end = res->end; + } + return 0; +} + +/* + * Find biggest 'System RAM' resource and store its start and end address in + * @start and @end, respectively. If no System RAM is found, returns false. + */ +static bool get_monitoring_region(unsigned long *start, unsigned long *end) +{ + struct damon_reclaim_ram_walk_arg arg = {}; + + walk_system_ram_res(0, ULONG_MAX, &arg, walk_system_ram); + if (arg.end <= arg.start) + return false; + + *start = arg.start; + *end = arg.end; + return true; +} + +static struct damos *damon_reclaim_new_scheme(void) +{ + struct damos_watermarks wmarks = { + .metric = DAMOS_WMARK_FREE_MEM_RATE, + .interval = wmarks_interval, + .high = wmarks_high, + .mid = wmarks_mid, + .low = wmarks_low, + }; + struct damos_quota quota = { + /* + * Do not try reclamation for more than quota_ms milliseconds + * or quota_sz bytes within quota_reset_interval_ms. + */ + .ms = quota_ms, + .sz = quota_sz, + .reset_interval = quota_reset_interval_ms, + /* Within the quota, page out older regions first. */ + .weight_sz = 0, + .weight_nr_accesses = 0, + .weight_age = 1 + }; + struct damos *scheme = damon_new_scheme( + /* Find regions having PAGE_SIZE or larger size */ + PAGE_SIZE, ULONG_MAX, + /* and not accessed at all */ + 0, 0, + /* for min_age or more micro-seconds, and */ + min_age / aggr_interval, UINT_MAX, + /* page out those, as soon as found */ + DAMOS_PAGEOUT, + /* under the quota. */ + "a, + /* (De)activate this according to the watermarks. */ + &wmarks); + + return scheme; +} + +static int damon_reclaim_turn(bool on) +{ + struct damon_region *region; + struct damos *scheme; + int err; + + if (!on) { + err = damon_stop(&ctx, 1); + if (!err) + kdamond_pid = -1; + return err; + } + + err = damon_set_attrs(ctx, sample_interval, aggr_interval, 0, + min_nr_regions, max_nr_regions); + if (err) + return err; + + if (monitor_region_start > monitor_region_end) + return -EINVAL; + if (!monitor_region_start && !monitor_region_end && + !get_monitoring_region(&monitor_region_start, + &monitor_region_end)) + return -EINVAL; + /* DAMON will free this on its own when finish monitoring */ + region = damon_new_region(monitor_region_start, monitor_region_end); + if (!region) + return -ENOMEM; + damon_add_region(region, target); + + /* Will be freed by 'damon_set_schemes()' below */ + scheme = damon_reclaim_new_scheme(); + if (!scheme) { + err = -ENOMEM; + goto free_region_out; + } + err = damon_set_schemes(ctx, &scheme, 1); + if (err) + goto free_scheme_out; + + err = damon_start(&ctx, 1); + if (!err) { + kdamond_pid = ctx->kdamond->pid; + return 0; + } + +free_scheme_out: + damon_destroy_scheme(scheme); +free_region_out: + damon_destroy_region(region, target); + return err; +} + +#define ENABLE_CHECK_INTERVAL_MS 1000 +static struct delayed_work damon_reclaim_timer; +static void damon_reclaim_timer_fn(struct work_struct *work) +{ + static bool last_enabled; + bool now_enabled; + + now_enabled = enabled; + if (last_enabled != now_enabled) { + if (!damon_reclaim_turn(now_enabled)) + last_enabled = now_enabled; + else + enabled = last_enabled; + } + + schedule_delayed_work(&damon_reclaim_timer, + msecs_to_jiffies(ENABLE_CHECK_INTERVAL_MS)); +} +static DECLARE_DELAYED_WORK(damon_reclaim_timer, damon_reclaim_timer_fn); + +static int __init damon_reclaim_init(void) +{ + ctx = damon_new_ctx(); + if (!ctx) + return -ENOMEM; + + damon_pa_set_primitives(ctx); + + /* 4242 means nothing but fun */ + target = damon_new_target(4242); + if (!target) { + damon_destroy_ctx(ctx); + return -ENOMEM; + } + damon_add_target(ctx, target); + + schedule_delayed_work(&damon_reclaim_timer, 0); + return 0; +} + +module_init(damon_reclaim_init);
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit bec976b691437d056a92964cb7af07ee1a54221a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- This adds an admin-guide document for DAMON-based Reclamation.
Link: https://lkml.kernel.org/r/20211019150731.16699-16-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Amit Shah amit@kernel.org Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: David Hildenbrand david@redhat.com Cc: David Rientjes rientjes@google.com Cc: David Woodhouse dwmw@amazon.com Cc: Greg Thelen gthelen@google.com Cc: Jonathan Cameron Jonathan.Cameron@huawei.com Cc: Jonathan Corbet corbet@lwn.net Cc: Leonard Foerster foersleo@amazon.de Cc: Marco Elver elver@google.com Cc: Markus Boehme markubo@amazon.de Cc: Shakeel Butt shakeelb@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit bec976b691437d056a92964cb7af07ee1a54221a) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/admin-guide/mm/damon/index.rst | 1 + .../admin-guide/mm/damon/reclaim.rst | 235 ++++++++++++++++++ 2 files changed, 236 insertions(+) create mode 100644 Documentation/admin-guide/mm/damon/reclaim.rst
diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst index 8c5dde3a5754..61aff88347f3 100644 --- a/Documentation/admin-guide/mm/damon/index.rst +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -13,3 +13,4 @@ optimize those.
start usage + reclaim diff --git a/Documentation/admin-guide/mm/damon/reclaim.rst b/Documentation/admin-guide/mm/damon/reclaim.rst new file mode 100644 index 000000000000..fb9def3a7355 --- /dev/null +++ b/Documentation/admin-guide/mm/damon/reclaim.rst @@ -0,0 +1,235 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +DAMON-based Reclamation +======================= + +DAMON-based Reclamation (DAMON_RECLAIM) is a static kernel module that aimed to +be used for proactive and lightweight reclamation under light memory pressure. +It doesn't aim to replace the LRU-list based page_granularity reclamation, but +to be selectively used for different level of memory pressure and requirements. + +Where Proactive Reclamation is Required? +======================================== + +On general memory over-committed systems, proactively reclaiming cold pages +helps saving memory and reducing latency spikes that incurred by the direct +reclaim of the process or CPU consumption of kswapd, while incurring only +minimal performance degradation [1]_ [2]_ . + +Free Pages Reporting [3]_ based memory over-commit virtualization systems are +good example of the cases. In such systems, the guest VMs reports their free +memory to host, and the host reallocates the reported memory to other guests. +As a result, the memory of the systems are fully utilized. However, the +guests could be not so memory-frugal, mainly because some kernel subsystems and +user-space applications are designed to use as much memory as available. Then, +guests could report only small amount of memory as free to host, results in +memory utilization drop of the systems. Running the proactive reclamation in +guests could mitigate this problem. + +How It Works? +============= + +DAMON_RECLAIM finds memory regions that didn't accessed for specific time +duration and page out. To avoid it consuming too much CPU for the paging out +operation, a speed limit can be configured. Under the speed limit, it pages +out memory regions that didn't accessed longer time first. System +administrators can also configure under what situation this scheme should +automatically activated and deactivated with three memory pressure watermarks. + +Interface: Module Parameters +============================ + +To use this feature, you should first ensure your system is running on a kernel +that is built with ``CONFIG_DAMON_RECLAIM=y``. + +To let sysadmins enable or disable it and tune for the given system, +DAMON_RECLAIM utilizes module parameters. That is, you can put +``damon_reclaim.<parameter>=<value>`` on the kernel boot command line or write +proper values to ``/sys/modules/damon_reclaim/parameters/<parameter>`` files. + +Note that the parameter values except ``enabled`` are applied only when +DAMON_RECLAIM starts. Therefore, if you want to apply new parameter values in +runtime and DAMON_RECLAIM is already enabled, you should disable and re-enable +it via ``enabled`` parameter file. Writing of the new values to proper +parameter values should be done before the re-enablement. + +Below are the description of each parameter. + +enabled +------- + +Enable or disable DAMON_RECLAIM. + +You can enable DAMON_RCLAIM by setting the value of this parameter as ``Y``. +Setting it as ``N`` disables DAMON_RECLAIM. Note that DAMON_RECLAIM could do +no real monitoring and reclamation due to the watermarks-based activation +condition. Refer to below descriptions for the watermarks parameter for this. + +min_age +------- + +Time threshold for cold memory regions identification in microseconds. + +If a memory region is not accessed for this or longer time, DAMON_RECLAIM +identifies the region as cold, and reclaims it. + +120 seconds by default. + +quota_ms +-------- + +Limit of time for the reclamation in milliseconds. + +DAMON_RECLAIM tries to use only up to this time within a time window +(quota_reset_interval_ms) for trying reclamation of cold pages. This can be +used for limiting CPU consumption of DAMON_RECLAIM. If the value is zero, the +limit is disabled. + +10 ms by default. + +quota_sz +-------- + +Limit of size of memory for the reclamation in bytes. + +DAMON_RECLAIM charges amount of memory which it tried to reclaim within a time +window (quota_reset_interval_ms) and makes no more than this limit is tried. +This can be used for limiting consumption of CPU and IO. If this value is +zero, the limit is disabled. + +128 MiB by default. + +quota_reset_interval_ms +----------------------- + +The time/size quota charge reset interval in milliseconds. + +The charget reset interval for the quota of time (quota_ms) and size +(quota_sz). That is, DAMON_RECLAIM does not try reclamation for more than +quota_ms milliseconds or quota_sz bytes within quota_reset_interval_ms +milliseconds. + +1 second by default. + +wmarks_interval +--------------- + +Minimal time to wait before checking the watermarks, when DAMON_RECLAIM is +enabled but inactive due to its watermarks rule. + +wmarks_high +----------- + +Free memory rate (per thousand) for the high watermark. + +If free memory of the system in bytes per thousand bytes is higher than this, +DAMON_RECLAIM becomes inactive, so it does nothing but only periodically checks +the watermarks. + +wmarks_mid +---------- + +Free memory rate (per thousand) for the middle watermark. + +If free memory of the system in bytes per thousand bytes is between this and +the low watermark, DAMON_RECLAIM becomes active, so starts the monitoring and +the reclaiming. + +wmarks_low +---------- + +Free memory rate (per thousand) for the low watermark. + +If free memory of the system in bytes per thousand bytes is lower than this, +DAMON_RECLAIM becomes inactive, so it does nothing but periodically checks the +watermarks. In the case, the system falls back to the LRU-list based page +granularity reclamation logic. + +sample_interval +--------------- + +Sampling interval for the monitoring in microseconds. + +The sampling interval of DAMON for the cold memory monitoring. Please refer to +the DAMON documentation (:doc:`usage`) for more detail. + +aggr_interval +------------- + +Aggregation interval for the monitoring in microseconds. + +The aggregation interval of DAMON for the cold memory monitoring. Please +refer to the DAMON documentation (:doc:`usage`) for more detail. + +min_nr_regions +-------------- + +Minimum number of monitoring regions. + +The minimal number of monitoring regions of DAMON for the cold memory +monitoring. This can be used to set lower-bound of the monitoring quality. +But, setting this too high could result in increased monitoring overhead. +Please refer to the DAMON documentation (:doc:`usage`) for more detail. + +max_nr_regions +-------------- + +Maximum number of monitoring regions. + +The maximum number of monitoring regions of DAMON for the cold memory +monitoring. This can be used to set upper-bound of the monitoring overhead. +However, setting this too low could result in bad monitoring quality. Please +refer to the DAMON documentation (:doc:`usage`) for more detail. + +monitor_region_start +-------------------- + +Start of target memory region in physical address. + +The start physical address of memory region that DAMON_RECLAIM will do work +against. That is, DAMON_RECLAIM will find cold memory regions in this region +and reclaims. By default, biggest System RAM is used as the region. + +monitor_region_end +------------------ + +End of target memory region in physical address. + +The end physical address of memory region that DAMON_RECLAIM will do work +against. That is, DAMON_RECLAIM will find cold memory regions in this region +and reclaims. By default, biggest System RAM is used as the region. + +kdamond_pid +----------- + +PID of the DAMON thread. + +If DAMON_RECLAIM is enabled, this becomes the PID of the worker thread. Else, +-1. + +Example +======= + +Below runtime example commands make DAMON_RECLAIM to find memory regions that +not accessed for 30 seconds or more and pages out. The reclamation is limited +to be done only up to 1 GiB per second to avoid DAMON_RECLAIM consuming too +much CPU time for the paging out operation. It also asks DAMON_RECLAIM to do +nothing if the system's free memory rate is more than 50%, but start the real +works if it becomes lower than 40%. If DAMON_RECLAIM doesn't make progress and +therefore the free memory rate becomes lower than 20%, it asks DAMON_RECLAIM to +do nothing again, so that we can fall back to the LRU-list based page +granularity reclamation. :: + + # cd /sys/modules/damon_reclaim/parameters + # echo 30000000 > min_age + # echo $((1 * 1024 * 1024 * 1024)) > quota_sz + # echo 1000 > quota_reset_interval_ms + # echo 500 > wmarks_high + # echo 400 > wmarks_mid + # echo 200 > wmarks_low + # echo Y > enabled + +.. [1] https://research.google/pubs/pub48551/ +.. [2] https://lwn.net/Articles/787611/ +.. [3] https://www.kernel.org/doc/html/latest/vm/free_page_reporting.html
From: Xin Hao xhao@linux.alibaba.com
mainline inclusion from mainline-5.16 commit a460a36034bad4403c2c62e04a521bc6987ae5db category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Patch series "mm/damon: Fix some small bugs", v4.
This patch (of 2):
In 'damon_va_apply_three_regions' there is no need to set variable 'i' to zero.
Link: https://lkml.kernel.org/r/b7df8d3dad0943a37e01f60c441b1968b2b20354.163472032... Link: https://lkml.kernel.org/r/cover.1634720326.git.xhao@linux.alibaba.com Signed-off-by: Xin Hao xhao@linux.alibaba.com Reviewed-by: SeongJae Park sj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit a460a36034bad4403c2c62e04a521bc6987ae5db) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/vaddr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 0d8685d63b3f..47f47f60440e 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -307,7 +307,7 @@ static void damon_va_apply_three_regions(struct damon_target *t, struct damon_addr_range bregions[3]) { struct damon_region *r, *next; - unsigned int i = 0; + unsigned int i;
/* Remove regions which are not in the three big regions now */ damon_for_each_region_safe(r, next, t) {
From: Xin Hao xhao@linux.alibaba.com
mainline inclusion from mainline-5.16 commit b5ca3e83ddb05342b1b30700b999cb9b107511f6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- When the ctx->adaptive_targets list is empty, I did some test on monitor_on interface like this.
# cat /sys/kernel/debug/damon/target_ids # # echo on > /sys/kernel/debug/damon/monitor_on # damon: kdamond (5390) starts
Though the ctx->adaptive_targets list is empty, but the kthread_run still be called, and the kdamond.x thread still be created, this is meaningless.
So there adds a judgment in 'dbgfs_monitor_on_write', if the ctx->adaptive_targets list is empty, return -EINVAL.
Link: https://lkml.kernel.org/r/0a60a6e8ec9d71989e0848a4dc3311996ca3b5d4.163472032... Signed-off-by: Xin Hao xhao@linux.alibaba.com Reviewed-by: SeongJae Park sj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit b5ca3e83ddb05342b1b30700b999cb9b107511f6) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 1 + mm/damon/core.c | 5 +++++ mm/damon/dbgfs.c | 15 ++++++++++++--- 3 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h index c93325efddd7..fa7f32614b65 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -440,6 +440,7 @@ void damon_destroy_scheme(struct damos *s);
struct damon_target *damon_new_target(unsigned long id); void damon_add_target(struct damon_ctx *ctx, struct damon_target *t); +bool damon_targets_empty(struct damon_ctx *ctx); void damon_free_target(struct damon_target *t); void damon_destroy_target(struct damon_target *t); unsigned int damon_nr_regions(struct damon_target *t); diff --git a/mm/damon/core.c b/mm/damon/core.c index 6993c60ae31c..46a6afea3030 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -180,6 +180,11 @@ void damon_add_target(struct damon_ctx *ctx, struct damon_target *t) list_add_tail(&t->list, &ctx->adaptive_targets); }
+bool damon_targets_empty(struct damon_ctx *ctx) +{ + return list_empty(&ctx->adaptive_targets); +} + static void damon_del_target(struct damon_target *t) { list_del(&t->list); diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index 6828e463348b..befb27a29aab 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -878,12 +878,21 @@ static ssize_t dbgfs_monitor_on_write(struct file *file, return -EINVAL; }
- if (!strncmp(kbuf, "on", count)) + if (!strncmp(kbuf, "on", count)) { + int i; + + for (i = 0; i < dbgfs_nr_ctxs; i++) { + if (damon_targets_empty(dbgfs_ctxs[i])) { + kfree(kbuf); + return -EINVAL; + } + } ret = damon_start(dbgfs_ctxs, dbgfs_nr_ctxs); - else if (!strncmp(kbuf, "off", count)) + } else if (!strncmp(kbuf, "off", count)) { ret = damon_stop(dbgfs_ctxs, dbgfs_nr_ctxs); - else + } else { ret = -EINVAL; + }
if (!ret) ret = count;
From: Changbin Du changbin.du@gmail.com
mainline inclusion from mainline-5.16 commit 0f91d13366a402420bf98eaaf393db03946c13e0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- A kernel thread can exit gracefully with kthread_stop(). So we don't need a new flag 'kdamond_stop'. And to make sure the task struct is not freed when accessing it, get reference to it before termination.
Link: https://lkml.kernel.org/r/20211027130517.4404-1-changbin.du@gmail.com Signed-off-by: Changbin Du changbin.du@gmail.com Reviewed-by: SeongJae Park sj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 0f91d13366a402420bf98eaaf393db03946c13e0) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 1 - mm/damon/core.c | 51 +++++++++++++------------------------------ 2 files changed, 15 insertions(+), 37 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h index fa7f32614b65..321de9d72360 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -381,7 +381,6 @@ struct damon_ctx {
/* public: */ struct task_struct *kdamond; - bool kdamond_stop; struct mutex kdamond_lock;
struct damon_primitive primitive; diff --git a/mm/damon/core.c b/mm/damon/core.c index 46a6afea3030..f37c17b53814 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -390,17 +390,6 @@ static unsigned long damon_region_sz_limit(struct damon_ctx *ctx) return sz; }
-static bool damon_kdamond_running(struct damon_ctx *ctx) -{ - bool running; - - mutex_lock(&ctx->kdamond_lock); - running = ctx->kdamond != NULL; - mutex_unlock(&ctx->kdamond_lock); - - return running; -} - static int kdamond_fn(void *data);
/* @@ -418,7 +407,6 @@ static int __damon_start(struct damon_ctx *ctx) mutex_lock(&ctx->kdamond_lock); if (!ctx->kdamond) { err = 0; - ctx->kdamond_stop = false; ctx->kdamond = kthread_run(kdamond_fn, ctx, "kdamond.%d", nr_running_ctxs); if (IS_ERR(ctx->kdamond)) { @@ -474,13 +462,15 @@ int damon_start(struct damon_ctx **ctxs, int nr_ctxs) */ static int __damon_stop(struct damon_ctx *ctx) { + struct task_struct *tsk; + mutex_lock(&ctx->kdamond_lock); - if (ctx->kdamond) { - ctx->kdamond_stop = true; + tsk = ctx->kdamond; + if (tsk) { + get_task_struct(tsk); mutex_unlock(&ctx->kdamond_lock); - while (damon_kdamond_running(ctx)) - usleep_range(ctx->sample_interval, - ctx->sample_interval * 2); + kthread_stop(tsk); + put_task_struct(tsk); return 0; } mutex_unlock(&ctx->kdamond_lock); @@ -925,12 +915,8 @@ static bool kdamond_need_update_primitive(struct damon_ctx *ctx) static bool kdamond_need_stop(struct damon_ctx *ctx) { struct damon_target *t; - bool stop;
- mutex_lock(&ctx->kdamond_lock); - stop = ctx->kdamond_stop; - mutex_unlock(&ctx->kdamond_lock); - if (stop) + if (kthread_should_stop()) return true;
if (!ctx->primitive.target_valid) @@ -1021,13 +1007,6 @@ static int kdamond_wait_activation(struct damon_ctx *ctx) return -EBUSY; }
-static void set_kdamond_stop(struct damon_ctx *ctx) -{ - mutex_lock(&ctx->kdamond_lock); - ctx->kdamond_stop = true; - mutex_unlock(&ctx->kdamond_lock); -} - /* * The monitoring daemon that runs as a kernel thread */ @@ -1038,17 +1017,18 @@ static int kdamond_fn(void *data) struct damon_region *r, *next; unsigned int max_nr_accesses = 0; unsigned long sz_limit = 0; + bool done = false;
pr_debug("kdamond (%d) starts\n", current->pid);
if (ctx->primitive.init) ctx->primitive.init(ctx); if (ctx->callback.before_start && ctx->callback.before_start(ctx)) - set_kdamond_stop(ctx); + done = true;
sz_limit = damon_region_sz_limit(ctx);
- while (!kdamond_need_stop(ctx)) { + while (!kdamond_need_stop(ctx) && !done) { if (kdamond_wait_activation(ctx)) continue;
@@ -1056,7 +1036,7 @@ static int kdamond_fn(void *data) ctx->primitive.prepare_access_checks(ctx); if (ctx->callback.after_sampling && ctx->callback.after_sampling(ctx)) - set_kdamond_stop(ctx); + done = true;
usleep_range(ctx->sample_interval, ctx->sample_interval + 1);
@@ -1069,7 +1049,7 @@ static int kdamond_fn(void *data) sz_limit); if (ctx->callback.after_aggregation && ctx->callback.after_aggregation(ctx)) - set_kdamond_stop(ctx); + done = true; kdamond_apply_schemes(ctx); kdamond_reset_aggregated(ctx); kdamond_split_regions(ctx); @@ -1088,9 +1068,8 @@ static int kdamond_fn(void *data) damon_destroy_region(r, t); }
- if (ctx->callback.before_terminate && - ctx->callback.before_terminate(ctx)) - set_kdamond_stop(ctx); + if (ctx->callback.before_terminate) + ctx->callback.before_terminate(ctx); if (ctx->primitive.cleanup) ctx->primitive.cleanup(ctx);
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 82e3fff55d0010310d3ee9005fec366c9cb3836a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Patch series "Fix trivial nits in Documentation/admin-guide/mm".
This patchset fixes trivial nits in admin guide documents for DAMON and pagemap.
This patch (of 4):
Some of the example commands in DAMON getting started guide are outdated, missing sudo, or just wrong. This fixes those.
Link: https://lkml.kernel.org/r/20211022090311.3856-2-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Jonathan Corbet corbet@lwn.net Cc: Peter Xu peterx@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 82e3fff55d0010310d3ee9005fec366c9cb3836a) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/admin-guide/mm/damon/start.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index 51503cf90ca2..3ad8bbed9b18 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -19,7 +19,7 @@ your workload. :: # mount -t debugfs none /sys/kernel/debug/ # git clone https://github.com/awslabs/damo # ./damo/damo record $(pidof <your workload>) - # ./damo/damo report heat --plot_ascii + # ./damo/damo report heats --heatmap stdout
The final command draws the access heatmap of ``<your workload>``. The heatmap shows which memory region (x-axis) is accessed when (y-axis) and how frequently @@ -94,9 +94,9 @@ Visualizing Recorded Patterns The following three commands visualize the recorded access patterns and save the results as separate image files. ::
- $ damo report heats --heatmap access_pattern_heatmap.png - $ damo report wss --range 0 101 1 --plot wss_dist.png - $ damo report wss --range 0 101 1 --sortby time --plot wss_chron_change.png + $ sudo damo report heats --heatmap access_pattern_heatmap.png + $ sudo damo report wss --range 0 101 1 --plot wss_dist.png + $ sudo damo report wss --range 0 101 1 --sortby time --plot wss_chron_change.png
- ``access_pattern_heatmap.png`` will visualize the data access pattern in a heatmap, showing which memory region (y-axis) got accessed when (x-axis) @@ -115,9 +115,9 @@ Data Access Pattern Aware Memory Management Below three commands make every memory region of size >=4K that doesn't accessed for >=60 seconds in your workload to be swapped out. ::
- $ echo "#min-size max-size min-acc max-acc min-age max-age action" > scheme - $ echo "4K max 0 0 60s max pageout" >> scheme - $ damo schemes -c my_thp_scheme <pid of your workload> + $ echo "#min-size max-size min-acc max-acc min-age max-age action" > test_scheme + $ echo "4K max 0 0 60s max pageout" >> test_scheme + $ damo schemes -c test_scheme <pid of your workload>
.. [1] https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/start.html#vis... .. [2] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 49ce7dee10891c709da769a731a45b42cf7c54f6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- The 'Getting Started' of DAMON is providing a link to DAMON's user interface document while saying about its user space tool's detailed usages. This fixes the link.
Link: https://lkml.kernel.org/r/20211022090311.3856-3-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Jonathan Corbet corbet@lwn.net Cc: Peter Xu peterx@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 49ce7dee10891c709da769a731a45b42cf7c54f6) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/admin-guide/mm/damon/start.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index 3ad8bbed9b18..5f3b22cafc76 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -6,7 +6,9 @@ Getting Started
This document briefly describes how you can use DAMON by demonstrating its default user space tool. Please note that this document describes only a part -of its features for brevity. Please refer to :doc:`usage` for more details. +of its features for brevity. Please refer to the usage `doc +https://github.com/awslabs/damo/blob/next/USAGE.md`_ of the tool for more +details.
TL; DR
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit b1eee3c5486003b247127538210f15fd6ebb5ee5 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Information in 'TL; DR' section of 'Getting Started' is duplicated in other parts of the doc. It is also asking readers to visit the access pattern visualizations gallery web site to show the results of example visualization commands, while the users of the commands can use terminal output.
To make the doc simple, this removes the duplicated 'TL; DR' section and replaces the visualization example commands with versions using terminal outputs.
Link: https://lkml.kernel.org/r/20211022090311.3856-4-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Jonathan Corbet corbet@lwn.net Cc: Peter Xu peterx@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit b1eee3c5486003b247127538210f15fd6ebb5ee5) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/admin-guide/mm/damon/start.rst | 113 ++++++++++--------- 1 file changed, 60 insertions(+), 53 deletions(-)
diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index 5f3b22cafc76..4d5ca2c46288 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -11,38 +11,6 @@ of its features for brevity. Please refer to the usage `doc details.
-TL; DR -====== - -Follow the commands below to monitor and visualize the memory access pattern of -your workload. :: - - # # build the kernel with CONFIG_DAMON_*=y, install it, and reboot - # mount -t debugfs none /sys/kernel/debug/ - # git clone https://github.com/awslabs/damo - # ./damo/damo record $(pidof <your workload>) - # ./damo/damo report heats --heatmap stdout - -The final command draws the access heatmap of ``<your workload>``. The heatmap -shows which memory region (x-axis) is accessed when (y-axis) and how frequently -(number; the higher the more accesses have been observed). :: - - 111111111111111111111111111111111111111111111111111111110000 - 111121111111111111111111111111211111111111111111111111110000 - 000000000000000000000000000000000000000000000000001555552000 - 000000000000000000000000000000000000000000000222223555552000 - 000000000000000000000000000000000000000011111677775000000000 - 000000000000000000000000000000000000000488888000000000000000 - 000000000000000000000000000000000177888400000000000000000000 - 000000000000000000000000000046666522222100000000000000000000 - 000000000000000000000014444344444300000000000000000000000000 - 000000000000000002222245555510000000000000000000000000000000 - # access_frequency: 0 1 2 3 4 5 6 7 8 9 - # x-axis: space (140286319947776-140286426374096: 101.496 MiB) - # y-axis: time (605442256436361-605479951866441: 37.695430s) - # resolution: 60x10 (1.692 MiB and 3.770s for each character) - - Prerequisites =============
@@ -93,22 +61,66 @@ pattern in the ``damon.data`` file. Visualizing Recorded Patterns =============================
-The following three commands visualize the recorded access patterns and save -the results as separate image files. :: - - $ sudo damo report heats --heatmap access_pattern_heatmap.png - $ sudo damo report wss --range 0 101 1 --plot wss_dist.png - $ sudo damo report wss --range 0 101 1 --sortby time --plot wss_chron_change.png - -- ``access_pattern_heatmap.png`` will visualize the data access pattern in a - heatmap, showing which memory region (y-axis) got accessed when (x-axis) - and how frequently (color). -- ``wss_dist.png`` will show the distribution of the working set size. -- ``wss_chron_change.png`` will show how the working set size has - chronologically changed. - -You can view the visualizations of this example workload at [1]_. -Visualizations of other realistic workloads are available at [2]_ [3]_ [4]_. +You can visualize the pattern in a heatmap, showing which memory region +(x-axis) got accessed when (y-axis) and how frequently (number).:: + + $ sudo damo report heats --heatmap stdout + 22222222222222222222222222222222222222211111111111111111111111111111111111111100 + 44444444444444444444444444444444444444434444444444444444444444444444444444443200 + 44444444444444444444444444444444444444433444444444444444444444444444444444444200 + 33333333333333333333333333333333333333344555555555555555555555555555555555555200 + 33333333333333333333333333333333333344444444444444444444444444444444444444444200 + 22222222222222222222222222222222222223355555555555555555555555555555555555555200 + 00000000000000000000000000000000000000288888888888888888888888888888888888888400 + 00000000000000000000000000000000000000288888888888888888888888888888888888888400 + 33333333333333333333333333333333333333355555555555555555555555555555555555555200 + 88888888888888888888888888888888888888600000000000000000000000000000000000000000 + 88888888888888888888888888888888888888600000000000000000000000000000000000000000 + 33333333333333333333333333333333333333444444444444444444444444444444444444443200 + 00000000000000000000000000000000000000288888888888888888888888888888888888888400 + [...] + # access_frequency: 0 1 2 3 4 5 6 7 8 9 + # x-axis: space (139728247021568-139728453431248: 196.848 MiB) + # y-axis: time (15256597248362-15326899978162: 1 m 10.303 s) + # resolution: 80x40 (2.461 MiB and 1.758 s for each character) + +You can also visualize the distribution of the working set size, sorted by the +size.:: + + $ sudo damo report wss --range 0 101 10 + # <percentile> <wss> + # target_id 18446632103789443072 + # avr: 107.708 MiB + 0 0 B | | + 10 95.328 MiB |**************************** | + 20 95.332 MiB |**************************** | + 30 95.340 MiB |**************************** | + 40 95.387 MiB |**************************** | + 50 95.387 MiB |**************************** | + 60 95.398 MiB |**************************** | + 70 95.398 MiB |**************************** | + 80 95.504 MiB |**************************** | + 90 190.703 MiB |********************************************************* | + 100 196.875 MiB |***********************************************************| + +Using ``--sortby`` option with the above command, you can show how the working +set size has chronologically changed.:: + + $ sudo damo report wss --range 0 101 10 --sortby time + # <percentile> <wss> + # target_id 18446632103789443072 + # avr: 107.708 MiB + 0 3.051 MiB | | + 10 190.703 MiB |***********************************************************| + 20 95.336 MiB |***************************** | + 30 95.328 MiB |***************************** | + 40 95.387 MiB |***************************** | + 50 95.332 MiB |***************************** | + 60 95.320 MiB |***************************** | + 70 95.398 MiB |***************************** | + 80 95.398 MiB |***************************** | + 90 95.340 MiB |***************************** | + 100 95.398 MiB |***************************** |
Data Access Pattern Aware Memory Management @@ -120,8 +132,3 @@ accessed for >=60 seconds in your workload to be swapped out. :: $ echo "#min-size max-size min-acc max-acc min-age max-age action" > test_scheme $ echo "4K max 0 0 60s max pageout" >> test_scheme $ damo schemes -c test_scheme <pid of your workload> - -.. [1] https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/start.html#vis... -.. [2] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html -.. [3] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html -.. [4] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
From: SeongJae Park sj@kernel.org
mainline inclusion from mainline-5.16 commit 0d16cfd46b48689de6a5ca594bc1a68105f6658b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Some descriptions of page flags in 'pagemap.rst' are written in assumption of none-rst, which respects every new line, as below:
7 - SLAB page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator When compound page is used, SLUB/SLQB will only set this flag on the head
Because rst ignores the new line between the first sentence and second sentence, resulting html looks a little bit weird, as below.
7 - SLAB page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator When ^ compound page is used, SLUB/SLQB will only set this flag on the head page; SLOB will not flag it at all.
This change makes it more natural and consistent with other parts in the rendered version.
Link: https://lkml.kernel.org/r/20211022090311.3856-5-sj@kernel.org Signed-off-by: SeongJae Park sj@kernel.org Cc: Jonathan Corbet corbet@lwn.net Cc: Peter Xu peterx@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 0d16cfd46b48689de6a5ca594bc1a68105f6658b) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/admin-guide/mm/pagemap.rst | 53 ++++++++++++------------ 1 file changed, 27 insertions(+), 26 deletions(-)
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index 340a5aee9b80..2f472dd3ee9f 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -88,13 +88,14 @@ Short descriptions to the page flags ====================================
0 - LOCKED - page is being locked for exclusive access, e.g. by undergoing read/write IO + The page is being locked for exclusive access, e.g. by undergoing read/write + IO. 7 - SLAB - page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator + The page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator. When compound page is used, SLUB/SLQB will only set this flag on the head page; SLOB will not flag it at all. 10 - BUDDY - a free memory block managed by the buddy system allocator + A free memory block managed by the buddy system allocator. The buddy system organizes free memory in blocks of various orders. An order N block has 2^N physically contiguous pages, with the BUDDY flag set for and _only_ for the first page. @@ -110,65 +111,65 @@ Short descriptions to the page flags 16 - COMPOUND_TAIL A compound page tail (see description above). 17 - HUGE - this is an integral part of a HugeTLB page + This is an integral part of a HugeTLB page. 19 - HWPOISON - hardware detected memory corruption on this page: don't touch the data! + Hardware detected memory corruption on this page: don't touch the data! 20 - NOPAGE - no page frame exists at the requested address + No page frame exists at the requested address. 21 - KSM - identical memory pages dynamically shared between one or more processes + Identical memory pages dynamically shared between one or more processes. 22 - THP - contiguous pages which construct transparent hugepages + Contiguous pages which construct transparent hugepages. 23 - OFFLINE - page is logically offline + The page is logically offline. 24 - ZERO_PAGE - zero page for pfn_zero or huge_zero page + Zero page for pfn_zero or huge_zero page. 25 - IDLE - page has not been accessed since it was marked idle (see + The page has not been accessed since it was marked idle (see :ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`). Note that this flag may be stale in case the page was accessed via a PTE. To make sure the flag is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first. 26 - PGTABLE - page is in use as a page table + The page is in use as a page table.
IO related page flags ---------------------
1 - ERROR - IO error occurred + IO error occurred. 3 - UPTODATE - page has up-to-date data + The page has up-to-date data. ie. for file backed page: (in-memory data revision >= on-disk one) 4 - DIRTY - page has been written to, hence contains new data + The page has been written to, hence contains new data. i.e. for file backed page: (in-memory data revision > on-disk one) 8 - WRITEBACK - page is being synced to disk + The page is being synced to disk.
LRU related page flags ----------------------
5 - LRU - page is in one of the LRU lists + The page is in one of the LRU lists. 6 - ACTIVE - page is in the active LRU list + The page is in the active LRU list. 18 - UNEVICTABLE - page is in the unevictable (non-)LRU list It is somehow pinned and + The page is in the unevictable (non-)LRU list It is somehow pinned and not a candidate for LRU page reclaims, e.g. ramfs pages, - shmctl(SHM_LOCK) and mlock() memory segments + shmctl(SHM_LOCK) and mlock() memory segments. 2 - REFERENCED - page has been referenced since last LRU list enqueue/requeue + The page has been referenced since last LRU list enqueue/requeue. 9 - RECLAIM - page will be reclaimed soon after its pageout IO completed + The page will be reclaimed soon after its pageout IO completed. 11 - MMAP - a memory mapped page + A memory mapped page. 12 - ANON - a memory mapped page that is not part of a file + A memory mapped page that is not part of a file. 13 - SWAPCACHE - page is mapped to swap space, i.e. has an associated swap entry + The page is mapped to swap space, i.e. has an associated swap entry. 14 - SWAPBACKED - page is backed by swap/RAM + The page is backed by swap/RAM.
The page-types tool in the tools/vm directory can be used to query the above flags.
From: Colin Ian King colin.i.king@googlemail.com
mainline inclusion from mainline-5.16 commit 0107865541961ee128149c9873996d32143a74d0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- There are a few spelling mistakes in the code. Fix these.
Link: https://lkml.kernel.org/r/20211028184157.614544-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King colin.i.king@gmail.com Reviewed-by: SeongJae Park sj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 0107865541961ee128149c9873996d32143a74d0) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/damon/core.c | 2 +- mm/damon/dbgfs-test.h | 2 +- mm/damon/vaddr-test.h | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/damon/core.c b/mm/damon/core.c index f37c17b53814..c381b3c525d0 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -959,7 +959,7 @@ static unsigned long damos_wmark_wait_us(struct damos *scheme) /* higher than high watermark or lower than low watermark */ if (metric > scheme->wmarks.high || scheme->wmarks.low > metric) { if (scheme->wmarks.activated) - pr_debug("inactivate a scheme (%d) for %s wmark\n", + pr_debug("deactivate a scheme (%d) for %s wmark\n", scheme->action, metric > scheme->wmarks.high ? "high" : "low"); diff --git a/mm/damon/dbgfs-test.h b/mm/damon/dbgfs-test.h index 104b22957616..86b9f9528231 100644 --- a/mm/damon/dbgfs-test.h +++ b/mm/damon/dbgfs-test.h @@ -145,7 +145,7 @@ static void damon_dbgfs_test_set_init_regions(struct kunit *test)
KUNIT_EXPECT_STREQ(test, (char *)buf, expect); } - /* Put invlid inputs and check the return error code */ + /* Put invalid inputs and check the return error code */ for (i = 0; i < ARRAY_SIZE(invalid_inputs); i++) { input = invalid_inputs[i]; pr_info("input: %s\n", input); diff --git a/mm/damon/vaddr-test.h b/mm/damon/vaddr-test.h index 1f5c13257dba..ecfd0b2ed222 100644 --- a/mm/damon/vaddr-test.h +++ b/mm/damon/vaddr-test.h @@ -233,7 +233,7 @@ static void damon_test_apply_three_regions3(struct kunit *test) * and 70-100) has totally freed and mapped to different area (30-32 and * 65-68). The target regions which were in the old second and third big * regions should now be removed and new target regions covering the new second - * and third big regions should be crated. + * and third big regions should be created. */ static void damon_test_apply_three_regions4(struct kunit *test) {
From: Changbin Du changbin.du@gmail.com
mainline inclusion from mainline-5.16 commit 658f9ae761b5965893727dd4edcdad56e5a439bb category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA
------------------------------------------------- Since the return value of 'before_terminate' callback is never used, we make it have no return value.
Link: https://lkml.kernel.org/r/20211029005023.8895-1-changbin.du@gmail.com Signed-off-by: Changbin Du changbin.du@gmail.com Reviewed-by: SeongJae Park sj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 658f9ae761b5965893727dd4edcdad56e5a439bb) Signed-off-by: Yue Zou zouyue3@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/damon.h | 2 +- mm/damon/dbgfs.c | 5 ++--- 2 files changed, 3 insertions(+), 4 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h index 321de9d72360..b4d4be3cc987 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -322,7 +322,7 @@ struct damon_callback { int (*before_start)(struct damon_ctx *context); int (*after_sampling)(struct damon_ctx *context); int (*after_aggregation)(struct damon_ctx *context); - int (*before_terminate)(struct damon_ctx *context); + void (*before_terminate)(struct damon_ctx *context); };
/** diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index befb27a29aab..eccc14b34901 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -645,18 +645,17 @@ static void dbgfs_fill_ctx_dir(struct dentry *dir, struct damon_ctx *ctx) debugfs_create_file(file_names[i], 0600, dir, ctx, fops[i]); }
-static int dbgfs_before_terminate(struct damon_ctx *ctx) +static void dbgfs_before_terminate(struct damon_ctx *ctx) { struct damon_target *t, *next;
if (!targetid_is_pid(ctx)) - return 0; + return;
damon_for_each_target_safe(t, next, ctx) { put_pid((struct pid *)t->id); damon_destroy_target(t); } - return 0; }
static struct damon_ctx *dbgfs_new_ctx(void)
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-master commit e435a6b5315a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Currently, there are two ways for PF to set the unicast MAC address space size: specified by config parameters in firmware or set to default value.
That's mean if the config parameters in firmware is zero, driver will divide the whole unicast MAC address space equally to 8 PFs. However, in this case, the unicast MAC address space will be wasted a lot when the hardware actually has less then 8 PFs. And in the other hand, if one PF has much more VFs than other PFs, then each function of this PF will has much less address space than other PFs.
In order to ameliorate the above two situations, introduce the third way of unicast MAC address space assignment: firmware divides the whole unicast MAC address space equally to functions of all PFs, and calculates the space size of each PF according to its function number. PF queries the space size by the querying device specification command when in initialization process.
The third way assignment is lower priority than specified by config parameters, only if the config parameters is zero can be used, and if firmware does not support the third way assignment, then driver still divides the whole unicast MAC address space equally to 8 PFs.
Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c | 2 ++ .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h | 4 +++- .../net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 11 ++++++++--- 4 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index 85b432068157..30cdae7c6bae 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -340,6 +340,7 @@ struct hnae3_dev_specs { u8 max_non_tso_bd_num; /* max BD number of one non-TSO packet */ u16 max_frm_size; u16 max_qset_num; + u16 umv_size; };
struct hnae3_client_ops { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c index 2b66c59f5eaf..d297f22f5af9 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c @@ -924,6 +924,8 @@ hns3_dbg_dev_specs(struct hnae3_handle *h, char *buf, int len, int *pos) dev_specs->max_tm_rate); *pos += scnprintf(buf + *pos, len - *pos, "MAX QSET number: %u\n", dev_specs->max_qset_num); + *pos += scnprintf(buf + *pos, len - *pos, "umv size: %u\n", + dev_specs->umv_size); }
static int hns3_dbg_dev_info(struct hnae3_handle *h, char *buf, int len) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h index 33244472e0d0..cfbb7c51b0cb 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h @@ -1188,7 +1188,9 @@ struct hclge_dev_specs_1_cmd { __le16 max_frm_size; __le16 max_qset_num; __le16 max_int_gl; - u8 rsv1[18]; + u8 rsv0[2]; + __le16 umv_size; + u8 rsv1[14]; };
/* mac speed type defined in firmware command */ diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index b384be5a44bc..62da9bb6fe2e 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -1342,8 +1342,6 @@ static void hclge_parse_cfg(struct hclge_cfg *cfg, struct hclge_desc *desc) cfg->umv_space = hnae3_get_field(__le32_to_cpu(req->param[1]), HCLGE_CFG_UMV_TBL_SPACE_M, HCLGE_CFG_UMV_TBL_SPACE_S); - if (!cfg->umv_space) - cfg->umv_space = HCLGE_DEFAULT_UMV_SPACE_PER_PF;
cfg->pf_rss_size_max = hnae3_get_field(__le32_to_cpu(req->param[2]), HCLGE_CFG_PF_RSS_SIZE_M, @@ -1419,6 +1417,7 @@ static void hclge_set_default_dev_specs(struct hclge_dev *hdev) ae_dev->dev_specs.max_int_gl = HCLGE_DEF_MAX_INT_GL; ae_dev->dev_specs.max_frm_size = HCLGE_MAC_MAX_FRAME; ae_dev->dev_specs.max_qset_num = HCLGE_MAX_QSET_NUM; + ae_dev->dev_specs.umv_size = HCLGE_DEFAULT_UMV_SPACE_PER_PF; }
static void hclge_parse_dev_specs(struct hclge_dev *hdev, @@ -1440,6 +1439,7 @@ static void hclge_parse_dev_specs(struct hclge_dev *hdev, ae_dev->dev_specs.max_qset_num = le16_to_cpu(req1->max_qset_num); ae_dev->dev_specs.max_int_gl = le16_to_cpu(req1->max_int_gl); ae_dev->dev_specs.max_frm_size = le16_to_cpu(req1->max_frm_size); + ae_dev->dev_specs.umv_size = le16_to_cpu(req1->umv_size); }
static void hclge_check_dev_specs(struct hclge_dev *hdev) @@ -1460,6 +1460,8 @@ static void hclge_check_dev_specs(struct hclge_dev *hdev) dev_specs->max_int_gl = HCLGE_DEF_MAX_INT_GL; if (!dev_specs->max_frm_size) dev_specs->max_frm_size = HCLGE_MAC_MAX_FRAME; + if (!dev_specs->umv_size) + dev_specs->umv_size = HCLGE_DEFAULT_UMV_SPACE_PER_PF; }
static int hclge_query_dev_specs(struct hclge_dev *hdev) @@ -1549,7 +1551,10 @@ static int hclge_configure(struct hclge_dev *hdev) hdev->tm_info.num_pg = 1; hdev->tc_max = cfg.tc_num; hdev->tm_info.hw_pfc_map = 0; - hdev->wanted_umv_size = cfg.umv_space; + if (cfg.umv_space) + hdev->wanted_umv_size = cfg.umv_space; + else + hdev->wanted_umv_size = hdev->ae_dev->dev_specs.umv_size; hdev->tx_spare_buf_size = cfg.tx_spare_buf_size; hdev->gro_en = true; if (cfg.vlan_fliter_cap == HCLGE_VLAN_FLTR_CAN_MDF)
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-master commit 5c56ff486dfc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
The new firmware supports to divides the whole multicast MAC address space equally to functions of all PFs, and calculates the space size of each PF according to its function number.
To support this feature, PF driver adds querying multicast MAC address space size from firmware and limits used number according to space size.
Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 5 +++ .../ethernet/hisilicon/hns3/hns3_debugfs.c | 2 ++ .../hisilicon/hns3/hns3pf/hclge_cmd.h | 3 +- .../hisilicon/hns3/hns3pf/hclge_debugfs.c | 3 ++ .../hisilicon/hns3/hns3pf/hclge_main.c | 35 +++++++++++++++---- .../hisilicon/hns3/hns3pf/hclge_main.h | 2 ++ 6 files changed, 43 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index 30cdae7c6bae..eb9893bd733f 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -94,6 +94,7 @@ enum HNAE3_DEV_CAP_BITS { HNAE3_DEV_SUPPORT_RXD_ADV_LAYOUT_B, HNAE3_DEV_SUPPORT_PORT_VLAN_BYPASS_B, HNAE3_DEV_SUPPORT_VLAN_FLTR_MDF_B, + HNAE3_DEV_SUPPORT_MC_MAC_MNG_B, };
#define hnae3_dev_fd_supported(hdev) \ @@ -150,6 +151,9 @@ enum HNAE3_DEV_CAP_BITS { #define hnae3_ae_dev_rxd_adv_layout_supported(ae_dev) \ test_bit(HNAE3_DEV_SUPPORT_RXD_ADV_LAYOUT_B, (ae_dev)->caps)
+#define hnae3_ae_dev_mc_mac_mng_supported(ae_dev) \ + test_bit(HNAE3_DEV_SUPPORT_MC_MAC_MNG_B, (ae_dev)->caps) + enum HNAE3_PF_CAP_BITS { HNAE3_PF_SUPPORT_VLAN_FLTR_MDF_B = 0, }; @@ -341,6 +345,7 @@ struct hnae3_dev_specs { u16 max_frm_size; u16 max_qset_num; u16 umv_size; + u16 mc_mac_size; };
struct hnae3_client_ops { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c index d297f22f5af9..a1555f074e06 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c @@ -926,6 +926,8 @@ hns3_dbg_dev_specs(struct hnae3_handle *h, char *buf, int len, int *pos) dev_specs->max_qset_num); *pos += scnprintf(buf + *pos, len - *pos, "umv size: %u\n", dev_specs->umv_size); + *pos += scnprintf(buf + *pos, len - *pos, "mc mac size: %u\n", + dev_specs->mc_mac_size); }
static int hns3_dbg_dev_info(struct hnae3_handle *h, char *buf, int len) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h index cfbb7c51b0cb..bfcfefa9d2b5 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h @@ -1190,7 +1190,8 @@ struct hclge_dev_specs_1_cmd { __le16 max_int_gl; u8 rsv0[2]; __le16 umv_size; - u8 rsv1[14]; + __le16 mc_mac_size; + u8 rsv1[12]; };
/* mac speed type defined in firmware command */ diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c index 87d96f82c318..a983d012e26f 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c @@ -1972,6 +1972,9 @@ static int hclge_dbg_dump_umv_info(struct hclge_dev *hdev, char *buf, int len) } mutex_unlock(&hdev->vport_lock);
+ pos += scnprintf(buf + pos, len - pos, "used_mc_mac_num : %u\n", + hdev->used_mc_mac_num); + return 0; }
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 62da9bb6fe2e..4c23a2541282 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -1440,6 +1440,7 @@ static void hclge_parse_dev_specs(struct hclge_dev *hdev, ae_dev->dev_specs.max_int_gl = le16_to_cpu(req1->max_int_gl); ae_dev->dev_specs.max_frm_size = le16_to_cpu(req1->max_frm_size); ae_dev->dev_specs.umv_size = le16_to_cpu(req1->umv_size); + ae_dev->dev_specs.mc_mac_size = le16_to_cpu(req1->mc_mac_size); }
static void hclge_check_dev_specs(struct hclge_dev *hdev) @@ -8503,6 +8504,9 @@ static int hclge_init_umv_space(struct hclge_dev *hdev) hdev->share_umv_size = hdev->priv_umv_size + hdev->max_umv_size % (hdev->num_alloc_vport + 1);
+ if (hdev->ae_dev->dev_specs.mc_mac_size) + set_bit(HNAE3_DEV_SUPPORT_MC_MAC_MNG_B, hdev->ae_dev->caps); + return 0; }
@@ -8520,6 +8524,8 @@ static void hclge_reset_umv_space(struct hclge_dev *hdev) hdev->share_umv_size = hdev->priv_umv_size + hdev->max_umv_size % (hdev->num_alloc_vport + 1); mutex_unlock(&hdev->vport_lock); + + hdev->used_mc_mac_num = 0; }
static bool hclge_is_umv_space_full(struct hclge_vport *vport, bool need_lock) @@ -8774,6 +8780,7 @@ int hclge_add_mc_addr_common(struct hclge_vport *vport, struct hclge_dev *hdev = vport->back; struct hclge_mac_vlan_tbl_entry_cmd req; struct hclge_desc desc[3]; + bool is_new_addr = false; int status;
/* mac addr check */ @@ -8787,6 +8794,13 @@ int hclge_add_mc_addr_common(struct hclge_vport *vport, hclge_prepare_mac_addr(&req, addr, true); status = hclge_lookup_mac_vlan_tbl(vport, &req, desc, true); if (status) { + if (hnae3_ae_dev_mc_mac_mng_supported(hdev->ae_dev) && + hdev->used_mc_mac_num >= + hdev->ae_dev->dev_specs.mc_mac_size) + goto err_no_space; + + is_new_addr = true; + /* This mac addr do not exist, add new entry for it */ memset(desc[0].data, 0, sizeof(desc[0].data)); memset(desc[1].data, 0, sizeof(desc[0].data)); @@ -8796,12 +8810,18 @@ int hclge_add_mc_addr_common(struct hclge_vport *vport, if (status) return status; status = hclge_add_mac_vlan_tbl(vport, &req, desc); - /* if already overflow, not to print each time */ - if (status == -ENOSPC && - !(vport->overflow_promisc_flags & HNAE3_OVERFLOW_MPE)) - dev_err(&hdev->pdev->dev, "mc mac vlan table is full\n"); + if (status == -ENOSPC) + goto err_no_space; + else if (!status && is_new_addr) + hdev->used_mc_mac_num++;
return status; + +err_no_space: + /* if already overflow, not to print each time */ + if (!(vport->overflow_promisc_flags & HNAE3_OVERFLOW_MPE)) + dev_err(&hdev->pdev->dev, "mc mac vlan table is full\n"); + return -ENOSPC; }
static int hclge_rm_mc_addr(struct hnae3_handle *handle, @@ -8838,12 +8858,15 @@ int hclge_rm_mc_addr_common(struct hclge_vport *vport, if (status) return status;
- if (hclge_is_all_function_id_zero(desc)) + if (hclge_is_all_function_id_zero(desc)) { /* All the vfid is zero, so need to delete this entry */ status = hclge_remove_mac_vlan_tbl(vport, &req); - else + if (!status) + hdev->used_mc_mac_num--; + } else { /* Not all the vfid is zero, update the vfid */ status = hclge_add_mac_vlan_tbl(vport, &req, desc); + } } else if (status == -ENOENT) { status = 0; } diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h index de6afbcbfbac..ca25e2edf3f0 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h @@ -938,6 +938,8 @@ struct hclge_dev { u16 priv_umv_size; /* unicast mac vlan space shared by PF and its VFs */ u16 share_umv_size; + /* multicast mac address number used by PF and its VFs */ + u16 used_mc_mac_num;
DECLARE_KFIFO(mac_tnl_log, struct hclge_mac_tnl_stats, HCLGE_MAC_TNL_LOG_SIZE);
From: Arnd Bergmann arnd@arndb.de
mainline inclusion from mainline-v5.15-rc4 commit c894b51e2a23 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
This function copies strings around between multiple buffers including a large on-stack array that causes a build warning on 32-bit systems:
drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c: In function 'hclge_dbg_dump_tm_pg': drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c:782:1: error: the frame size of 1424 bytes is larger than 1400 bytes [-Werror=frame-larger-than=]
The function can probably be cleaned up a lot, to go back to printing directly into the output buffer, but dynamically allocating the structure is a simpler workaround for now.
Fixes: 04d96139ddb3 ("net: hns3: refine function hclge_dbg_dump_tm_pri()") Signed-off-by: Arnd Bergmann arnd@arndb.de Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../hisilicon/hns3/hns3pf/hclge_debugfs.c | 28 ++++++++++++++++--- 1 file changed, 24 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c index a983d012e26f..f0aa4fbd2200 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c @@ -719,9 +719,9 @@ static void hclge_dbg_fill_shaper_content(struct hclge_tm_shaper_para *para, sprintf(result[(*index)++], "%6u", para->rate); }
-static int hclge_dbg_dump_tm_pg(struct hclge_dev *hdev, char *buf, int len) +static int __hclge_dbg_dump_tm_pg(struct hclge_dev *hdev, char *data_str, + char *buf, int len) { - char data_str[ARRAY_SIZE(tm_pg_items)][HCLGE_DBG_DATA_STR_LEN]; struct hclge_tm_shaper_para c_shaper_para, p_shaper_para; char *result[ARRAY_SIZE(tm_pg_items)], *sch_mode_str; u8 pg_id, sch_mode, weight, pri_bit_map, i, j; @@ -729,8 +729,10 @@ static int hclge_dbg_dump_tm_pg(struct hclge_dev *hdev, char *buf, int len) int pos = 0; int ret;
- for (i = 0; i < ARRAY_SIZE(tm_pg_items); i++) - result[i] = &data_str[i][0]; + for (i = 0; i < ARRAY_SIZE(tm_pg_items); i++) { + result[i] = data_str; + data_str += HCLGE_DBG_DATA_STR_LEN; + }
hclge_dbg_fill_content(content, sizeof(content), tm_pg_items, NULL, ARRAY_SIZE(tm_pg_items)); @@ -781,6 +783,24 @@ static int hclge_dbg_dump_tm_pg(struct hclge_dev *hdev, char *buf, int len) return 0; }
+static int hclge_dbg_dump_tm_pg(struct hclge_dev *hdev, char *buf, int len) +{ + char *data_str; + int ret; + + data_str = kcalloc(ARRAY_SIZE(tm_pg_items), + HCLGE_DBG_DATA_STR_LEN, GFP_KERNEL); + + if (!data_str) + return -ENOMEM; + + ret = __hclge_dbg_dump_tm_pg(hdev, data_str, buf, len); + + kfree(data_str); + + return ret; +} + static int hclge_dbg_dump_tm_port(struct hclge_dev *hdev, char *buf, int len) { struct hclge_tm_shaper_para shaper_para;
From: Jian Shen shenjian15@huawei.com
mainline inclusion from mainline-v5.15-rc4 commit a8e76fefe3de category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Currently, in function hns3_nic_set_real_num_queue(), the driver doesn't report the queue count and offset for disabled tc. If user enables multiple TCs, but only maps user priorities to partial of them, it may cause the queue range of the unmapped TC being displayed abnormally.
Fix it by removing the tc enable checking, ensure the queue count is not zero.
With this change, the tc_en is useless now, so remove it.
Fixes: a75a8efa00c5 ("net: hns3: Fix tc setup when netdev is first up") Signed-off-by: Jian Shen shenjian15@huawei.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 - drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 11 ++--------- .../net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c | 5 ----- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 2 -- 4 files changed, 2 insertions(+), 17 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index eb9893bd733f..4b68a277a56c 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -757,7 +757,6 @@ struct hnae3_tc_info { u8 prio_tc[HNAE3_MAX_USER_PRIO]; /* TC indexed by prio */ u16 tqp_count[HNAE3_MAX_TC]; u16 tqp_offset[HNAE3_MAX_TC]; - unsigned long tc_en; /* bitmap of TC enabled */ u8 num_tc; /* Total number of enabled TCs */ bool mqprio_active; }; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 6a0bbe5bcf9b..2deb4efdd553 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -623,13 +623,9 @@ static int hns3_nic_set_real_num_queue(struct net_device *netdev) return ret; }
- for (i = 0; i < HNAE3_MAX_TC; i++) { - if (!test_bit(i, &tc_info->tc_en)) - continue; - + for (i = 0; i < tc_info->num_tc; i++) netdev_set_tc_queue(netdev, i, tc_info->tqp_count[i], tc_info->tqp_offset[i]); - } }
ret = netif_set_real_num_tx_queues(netdev, queue_size); @@ -4867,12 +4863,9 @@ static void hns3_init_tx_ring_tc(struct hns3_nic_priv *priv) struct hnae3_tc_info *tc_info = &kinfo->tc_info; int i;
- for (i = 0; i < HNAE3_MAX_TC; i++) { + for (i = 0; i < tc_info->num_tc; i++) { int j;
- if (!test_bit(i, &tc_info->tc_en)) - continue; - for (j = 0; j < tc_info->tqp_count[i]; j++) { struct hnae3_queue *q;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c index f8c1dca14d8f..019316c30878 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c @@ -453,8 +453,6 @@ static int hclge_mqprio_qopt_check(struct hclge_dev *hdev, static void hclge_sync_mqprio_qopt(struct hnae3_tc_info *tc_info, struct tc_mqprio_qopt_offload *mqprio_qopt) { - int i; - memset(tc_info, 0, sizeof(*tc_info)); tc_info->num_tc = mqprio_qopt->qopt.num_tc; memcpy(tc_info->prio_tc, mqprio_qopt->qopt.prio_tc_map, @@ -463,9 +461,6 @@ static void hclge_sync_mqprio_qopt(struct hnae3_tc_info *tc_info, sizeof_field(struct hnae3_tc_info, tqp_count)); memcpy(tc_info->tqp_offset, mqprio_qopt->qopt.offset, sizeof_field(struct hnae3_tc_info, tqp_offset)); - - for (i = 0; i < HNAE3_MAX_USER_PRIO; i++) - set_bit(tc_info->prio_tc[i], &tc_info->tc_en); }
static int hclge_config_tc(struct hclge_dev *hdev, diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c index bc36566a0a24..95074e91a846 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c @@ -687,12 +687,10 @@ static void hclge_tm_vport_tc_info_update(struct hclge_vport *vport)
for (i = 0; i < HNAE3_MAX_TC; i++) { if (hdev->hw_tc_map & BIT(i) && i < kinfo->tc_info.num_tc) { - set_bit(i, &kinfo->tc_info.tc_en); kinfo->tc_info.tqp_offset[i] = i * kinfo->rss_size; kinfo->tc_info.tqp_count[i] = kinfo->rss_size; } else { /* Set to default queue if TC is disable */ - clear_bit(i, &kinfo->tc_info.tc_en); kinfo->tc_info.tqp_offset[i] = 0; kinfo->tc_info.tqp_count[i] = 1; }
From: Jian Shen shenjian15@huawei.com
mainline inclusion from mainline-v5.15-rc4 commit d82650be60ee category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
For destroy mqprio is irreversible in stack, so it's unnecessary to rollback the tc configuration when destroy mqprio failed. Otherwise, it may cause the configuration being inconsistent between driver and netstack.
As the failure is usually caused by reset, and the driver will restore the configuration after reset, so it can keep the configuration being consistent between driver and hardware.
Fixes: 5a5c90917467 ("net: hns3: add support for tc mqprio offload") Signed-off-by: Jian Shen shenjian15@huawei.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c index 019316c30878..91cb578f56b8 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c @@ -526,12 +526,17 @@ static int hclge_setup_tc(struct hnae3_handle *h, return hclge_notify_init_up(hdev);
err_out: - /* roll-back */ - memcpy(&kinfo->tc_info, &old_tc_info, sizeof(old_tc_info)); - if (hclge_config_tc(hdev, &kinfo->tc_info)) - dev_err(&hdev->pdev->dev, - "failed to roll back tc configuration\n"); - + if (!tc) { + dev_warn(&hdev->pdev->dev, + "failed to destroy mqprio, will active after reset, ret = %d\n", + ret); + } else { + /* roll-back */ + memcpy(&kinfo->tc_info, &old_tc_info, sizeof(old_tc_info)); + if (hclge_config_tc(hdev, &kinfo->tc_info)) + dev_err(&hdev->pdev->dev, + "failed to roll back tc configuration\n"); + } hclge_notify_init_up(hdev);
return ret;
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-v5.15-rc4 commit 276e60421668 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
If unicast mac address table is full, and user add a new mac address, the unicast promisc needs to be enabled for the new unicast mac address can be used. So does the multicast promisc.
Now this feature has been implemented for PF, and VF should be implemented too. When the mac table of VF is overflow, PF will enable promisc for this VF.
Fixes: 1e6e76101fd9 ("net: hns3: configure promisc mode for VF asynchronously") Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 4c23a2541282..153ea5fd06fc 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -12824,8 +12824,12 @@ static void hclge_sync_promisc_mode(struct hclge_dev *hdev) continue;
if (vport->vf_info.trusted) { - uc_en = vport->vf_info.request_uc_en > 0; - mc_en = vport->vf_info.request_mc_en > 0; + uc_en = vport->vf_info.request_uc_en > 0 || + vport->overflow_promisc_flags & + HNAE3_OVERFLOW_UPE; + mc_en = vport->vf_info.request_mc_en > 0 || + vport->overflow_promisc_flags & + HNAE3_OVERFLOW_MPE; } bc_en = vport->vf_info.request_bc_en > 0;
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-v5.15-rc4 commit 0178839ccca3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Currently, the firmware compatible features are enabled in PF driver initialization process, but they are not disabled in PF driver deinitialization process and firmware keeps these features in enabled status.
In this case, if load an old PF driver (for example, in VM) which not support the firmware compatible features, firmware will still send mailbox message to PF when link status changed and PF will print "un-supported mailbox message, code = 201".
To fix this problem, disable these firmware compatible features in PF driver deinitialization process.
Fixes: ed8fb4b262ae ("net: hns3: add link change event report") Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../hisilicon/hns3/hns3pf/hclge_cmd.c | 21 ++++++++++++------- 1 file changed, 13 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c index ac9b69513332..9c2eeaa82294 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c @@ -467,7 +467,7 @@ int hclge_cmd_queue_init(struct hclge_dev *hdev) return ret; }
-static int hclge_firmware_compat_config(struct hclge_dev *hdev) +static int hclge_firmware_compat_config(struct hclge_dev *hdev, bool en) { struct hclge_firmware_compat_cmd *req; struct hclge_desc desc; @@ -475,13 +475,16 @@ static int hclge_firmware_compat_config(struct hclge_dev *hdev)
hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_IMP_COMPAT_CFG, false);
- req = (struct hclge_firmware_compat_cmd *)desc.data; + if (en) { + req = (struct hclge_firmware_compat_cmd *)desc.data;
- hnae3_set_bit(compat, HCLGE_LINK_EVENT_REPORT_EN_B, 1); - hnae3_set_bit(compat, HCLGE_NCSI_ERROR_REPORT_EN_B, 1); - if (hnae3_dev_phy_imp_supported(hdev)) - hnae3_set_bit(compat, HCLGE_PHY_IMP_EN_B, 1); - req->compat = cpu_to_le32(compat); + hnae3_set_bit(compat, HCLGE_LINK_EVENT_REPORT_EN_B, 1); + hnae3_set_bit(compat, HCLGE_NCSI_ERROR_REPORT_EN_B, 1); + if (hnae3_dev_phy_imp_supported(hdev)) + hnae3_set_bit(compat, HCLGE_PHY_IMP_EN_B, 1); + + req->compat = cpu_to_le32(compat); + }
return hclge_cmd_send(&hdev->hw, &desc, 1); } @@ -538,7 +541,7 @@ int hclge_cmd_init(struct hclge_dev *hdev) /* ask the firmware to enable some features, driver can work without * it. */ - ret = hclge_firmware_compat_config(hdev); + ret = hclge_firmware_compat_config(hdev, true); if (ret) dev_warn(&hdev->pdev->dev, "Firmware compatible features not enabled(%d).\n", @@ -568,6 +571,8 @@ static void hclge_cmd_uninit_regs(struct hclge_hw *hw)
void hclge_cmd_uninit(struct hclge_dev *hdev) { + hclge_firmware_compat_config(hdev, false); + set_bit(HCLGE_STATE_CMD_DISABLE, &hdev->state); /* wait to ensure that the firmware completes the possible left * over commands.
From: Hao Chen chenhao288@hisilicon.com
mainline inclusion from mainline-master commit 850bfb912a6d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Add a file node "page_pool_info" for debugfs, then cat this file node to dump page pool info as below:
QUEUE_ID ALLOCATE_CNT FREE_CNT POOL_SIZE(PAGE_NUM) ORDER NUMA_ID MAX_LEN 0 512 0 512 0 2 4K 1 512 0 512 0 2 4K 2 512 0 512 0 2 4K 3 512 0 512 0 2 4K 4 512 0 512 0 2 4K
Signed-off-by: Hao Chen chenhao288@hisilicon.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + .../ethernet/hisilicon/hns3/hns3_debugfs.c | 74 +++++++++++++++++++ 2 files changed, 75 insertions(+)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index 4b68a277a56c..393a830c281c 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -297,6 +297,7 @@ enum hnae3_dbg_cmd { HNAE3_DBG_CMD_MAC_TNL_STATUS, HNAE3_DBG_CMD_SERV_INFO, HNAE3_DBG_CMD_UMV_INFO, + HNAE3_DBG_CMD_PAGE_POOL_INFO, HNAE3_DBG_CMD_UNKNOWN, };
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c index a1555f074e06..b26d43c9c088 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c @@ -336,6 +336,13 @@ static struct hns3_dbg_cmd_info hns3_dbg_cmd[] = { .buf_len = HNS3_DBG_READ_LEN, .init = hns3_dbg_common_file_init, }, + { + .name = "page_pool_info", + .cmd = HNAE3_DBG_CMD_PAGE_POOL_INFO, + .dentry = HNS3_DBG_DENTRY_COMMON, + .buf_len = HNS3_DBG_READ_LEN, + .init = hns3_dbg_common_file_init, + }, };
static struct hns3_dbg_cap_info hns3_dbg_cap[] = { @@ -941,6 +948,69 @@ static int hns3_dbg_dev_info(struct hnae3_handle *h, char *buf, int len) return 0; }
+static const struct hns3_dbg_item page_pool_info_items[] = { + { "QUEUE_ID", 2 }, + { "ALLOCATE_CNT", 2 }, + { "FREE_CNT", 6 }, + { "POOL_SIZE(PAGE_NUM)", 2 }, + { "ORDER", 2 }, + { "NUMA_ID", 2 }, + { "MAX_LEN", 2 }, +}; + +static void hns3_dump_page_pool_info(struct hns3_enet_ring *ring, + char **result, u32 index) +{ + u32 j = 0; + + sprintf(result[j++], "%u", index); + sprintf(result[j++], "%u", ring->page_pool->pages_state_hold_cnt); + sprintf(result[j++], "%u", + atomic_read(&ring->page_pool->pages_state_release_cnt)); + sprintf(result[j++], "%u", ring->page_pool->p.pool_size); + sprintf(result[j++], "%u", ring->page_pool->p.order); + sprintf(result[j++], "%d", ring->page_pool->p.nid); + sprintf(result[j++], "%uK", ring->page_pool->p.max_len / 1024); +} + +static int +hns3_dbg_page_pool_info(struct hnae3_handle *h, char *buf, int len) +{ + char data_str[ARRAY_SIZE(page_pool_info_items)][HNS3_DBG_DATA_STR_LEN]; + char *result[ARRAY_SIZE(page_pool_info_items)]; + struct hns3_nic_priv *priv = h->priv; + char content[HNS3_DBG_INFO_LEN]; + struct hns3_enet_ring *ring; + int pos = 0; + u32 i; + + if (!priv->ring) { + dev_err(&h->pdev->dev, "priv->ring is NULL\n"); + return -EFAULT; + } + + for (i = 0; i < ARRAY_SIZE(page_pool_info_items); i++) + result[i] = &data_str[i][0]; + + hns3_dbg_fill_content(content, sizeof(content), page_pool_info_items, + NULL, ARRAY_SIZE(page_pool_info_items)); + pos += scnprintf(buf + pos, len - pos, "%s", content); + for (i = 0; i < h->kinfo.num_tqps; i++) { + if (!test_bit(HNS3_NIC_STATE_INITED, &priv->state) || + test_bit(HNS3_NIC_STATE_RESETTING, &priv->state)) + return -EPERM; + ring = &priv->ring[(u32)(i + h->kinfo.num_tqps)]; + hns3_dump_page_pool_info(ring, result, i); + hns3_dbg_fill_content(content, sizeof(content), + page_pool_info_items, + (const char **)result, + ARRAY_SIZE(page_pool_info_items)); + pos += scnprintf(buf + pos, len - pos, "%s", content); + } + + return 0; +} + static int hns3_dbg_get_cmd_index(struct hns3_dbg_data *dbg_data, u32 *index) { u32 i; @@ -982,6 +1052,10 @@ static const struct hns3_dbg_func hns3_dbg_cmd_func[] = { .cmd = HNAE3_DBG_CMD_TX_QUEUE_INFO, .dbg_dump = hns3_dbg_tx_queue_info, }, + { + .cmd = HNAE3_DBG_CMD_PAGE_POOL_INFO, + .dbg_dump = hns3_dbg_page_pool_info, + }, };
static int hns3_dbg_read_cmd(struct hns3_dbg_data *dbg_data,
From: Uwe Kleine-König u.kleine-koenig@pengutronix.de
mainline inclusion from mainline-master commit e519d9ea62e8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Replace dev->driver_name() by dev_driver_string() for the corresponding struct device. This is a step toward removing pci_dev->driver.
[bhelgaas: split to separate patch] Link: https://lore.kernel.org/r/20211004125935.2300113-8-u.kleine-koenig@pengutron... Signed-off-by: Uwe Kleine-König u.kleine-koenig@pengutronix.de Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c index 2319c89f5670..23b512174531 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c @@ -608,7 +608,7 @@ static void hns3_get_drvinfo(struct net_device *netdev, return; }
- strncpy(drvinfo->driver, h->pdev->driver->name, + strncpy(drvinfo->driver, dev_driver_string(&h->pdev->dev), sizeof(drvinfo->driver)); drvinfo->driver[sizeof(drvinfo->driver) - 1] = '\0';
From: Jiaran Zhang zhangjiaran@huawei.com
mainline inclusion from mainline-v5.15-rc7 commit 60484103d5c3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Add configuration of interrupt type and fifo interrupt enable of TM QCN error event if enabled, otherwise this event will not be reported when there is error.
Fixes: d914971df022 ("net: hns3: remove redundant query in hclge_config_tm_hw_err_int()") Signed-off-by: Jiaran Zhang zhangjiaran@huawei.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c | 5 ++++- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h | 2 ++ 2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c index bb9b026ae88e..93aa7f2bdc13 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c @@ -1560,8 +1560,11 @@ static int hclge_config_tm_hw_err_int(struct hclge_dev *hdev, bool en)
/* configure TM QCN hw errors */ hclge_cmd_setup_basic_desc(&desc, HCLGE_TM_QCN_MEM_INT_CFG, false); - if (en) + desc.data[0] = cpu_to_le32(HCLGE_TM_QCN_ERR_INT_TYPE); + if (en) { + desc.data[0] |= cpu_to_le32(HCLGE_TM_QCN_FIFO_INT_EN); desc.data[1] = cpu_to_le32(HCLGE_TM_QCN_MEM_ERR_INT_EN); + }
ret = hclge_cmd_send(&hdev->hw, &desc, 1); if (ret) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h index 07987fb8332e..d811eeefe2c0 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h @@ -50,6 +50,8 @@ #define HCLGE_PPP_MPF_ECC_ERR_INT3_EN 0x003F #define HCLGE_PPP_MPF_ECC_ERR_INT3_EN_MASK 0x003F #define HCLGE_TM_SCH_ECC_ERR_INT_EN 0x3 +#define HCLGE_TM_QCN_ERR_INT_TYPE 0x29 +#define HCLGE_TM_QCN_FIFO_INT_EN 0xFFFF00 #define HCLGE_TM_QCN_MEM_ERR_INT_EN 0xFFFFFF #define HCLGE_NCSI_ERR_INT_EN 0x3 #define HCLGE_NCSI_ERR_INT_TYPE 0x9
From: Yunsheng Lin linyunsheng@huawei.com
mainline inclusion from mainline-v5.15-rc7 commit 9f9f0f19994b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
rx unused desc is the desc that need attatching new buffer before refilling to hw to receive new packet, the number of desc need attatching new buffer is calculated using next_to_use and next_to_clean. when next_to_use == next_to_clean, currently hns3 driver assumes that all the desc has the buffer attatched, but 'next_to_use == next_to_clean' also means all the desc need attatching new buffer if hw has comsumed all the desc and the driver has not attatched any buffer to the desc yet.
This patch adds 'refill' in desc_cb to indicate whether a new buffer has been refilled to a desc.
Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Yunsheng Lin linyunsheng@huawei.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 8 ++++++++ drivers/net/ethernet/hisilicon/hns3/hns3_enet.h | 1 + 2 files changed, 9 insertions(+)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 2deb4efdd553..655ecc009f77 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -3255,6 +3255,7 @@ static void hns3_buffer_detach(struct hns3_enet_ring *ring, int i) { hns3_unmap_buffer(ring, &ring->desc_cb[i]); ring->desc[i].addr = 0; + ring->desc_cb[i].refill = 0; }
static void hns3_free_buffer_detach(struct hns3_enet_ring *ring, int i, @@ -3333,6 +3334,7 @@ static int hns3_alloc_and_attach_buffer(struct hns3_enet_ring *ring, int i)
ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma + ring->desc_cb[i].page_offset); + ring->desc_cb[i].refill = 1;
return 0; } @@ -3362,6 +3364,7 @@ static void hns3_replace_buffer(struct hns3_enet_ring *ring, int i, { hns3_unmap_buffer(ring, &ring->desc_cb[i]); ring->desc_cb[i] = *res_cb; + ring->desc_cb[i].refill = 1; ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma + ring->desc_cb[i].page_offset); ring->desc[i].rx.bd_base_info = 0; @@ -3370,6 +3373,7 @@ static void hns3_replace_buffer(struct hns3_enet_ring *ring, int i, static void hns3_reuse_buffer(struct hns3_enet_ring *ring, int i) { ring->desc_cb[i].reuse_flag = 0; + ring->desc_cb[i].refill = 1; ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma + ring->desc_cb[i].page_offset); ring->desc[i].rx.bd_base_info = 0; @@ -3476,6 +3480,9 @@ static int hns3_desc_unused(struct hns3_enet_ring *ring) int ntc = ring->next_to_clean; int ntu = ring->next_to_use;
+ if (unlikely(ntc == ntu && !ring->desc_cb[ntc].refill)) + return ring->desc_num; + return ((ntc >= ntu) ? 0 : ring->desc_num) + ntc - ntu; }
@@ -3821,6 +3828,7 @@ static void hns3_rx_ring_move_fw(struct hns3_enet_ring *ring) { ring->desc[ring->next_to_clean].rx.bd_base_info &= cpu_to_le32(~BIT(HNS3_RXD_VLD_B)); + ring->desc_cb[ring->next_to_clean].refill = 0; ring->next_to_clean += 1;
if (unlikely(ring->next_to_clean == ring->desc_num)) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h index 9d9be3665bb1..f09a61d9c626 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h @@ -330,6 +330,7 @@ struct hns3_desc_cb { u32 length; /* length of the buffer */
u16 reuse_flag; + u16 refill;
/* desc type, used by the ring user to mark the type of the priv data */ u16 type;
From: Yunsheng Lin linyunsheng@huawei.com
mainline inclusion from mainline-v5.15-rc7 commit 68752b24f51a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Currently when there is a rx page allocation failure, it is possible that polling may be stopped if there is no more packet to be reveiced, which may cause queue stall problem under memory pressure.
This patch makes sure polling is scheduled again when there is any rx page allocation failure, and polling will try to allocate receive buffers until it succeeds.
Now the allocation retry is added, it is unnecessary to do the rx page allocation at the end of rx cleaning, so remove it. And reset the unused_count to zero after calling hns3_nic_alloc_rx_buffers() to avoid calling hns3_nic_alloc_rx_buffers() repeatedly under memory pressure.
Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Yunsheng Lin linyunsheng@huawei.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../net/ethernet/hisilicon/hns3/hns3_enet.c | 22 ++++++++++--------- 1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 655ecc009f77..b2a8abfc4eec 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -3486,7 +3486,8 @@ static int hns3_desc_unused(struct hns3_enet_ring *ring) return ((ntc >= ntu) ? 0 : ring->desc_num) + ntc - ntu; }
-static void hns3_nic_alloc_rx_buffers(struct hns3_enet_ring *ring, +/* Return true if there is any allocation failure */ +static bool hns3_nic_alloc_rx_buffers(struct hns3_enet_ring *ring, int cleand_count) { struct hns3_desc_cb *desc_cb; @@ -3511,7 +3512,10 @@ static void hns3_nic_alloc_rx_buffers(struct hns3_enet_ring *ring, hns3_rl_err(ring_to_netdev(ring), "alloc rx buffer failed: %d\n", ret); - break; + + writel(i, ring->tqp->io_base + + HNS3_RING_RX_RING_HEAD_REG); + return true; } hns3_replace_buffer(ring, ring->next_to_use, &res_cbs);
@@ -3524,6 +3528,7 @@ static void hns3_nic_alloc_rx_buffers(struct hns3_enet_ring *ring, }
writel(i, ring->tqp->io_base + HNS3_RING_RX_RING_HEAD_REG); + return false; }
static bool hns3_can_reuse_page(struct hns3_desc_cb *cb) @@ -4175,6 +4180,7 @@ int hns3_clean_rx_ring(struct hns3_enet_ring *ring, int budget, { #define RCB_NOF_ALLOC_RX_BUFF_ONCE 16 int unused_count = hns3_desc_unused(ring); + bool failure = false; int recv_pkts = 0; int err;
@@ -4183,9 +4189,9 @@ int hns3_clean_rx_ring(struct hns3_enet_ring *ring, int budget, while (recv_pkts < budget) { /* Reuse or realloc buffers */ if (unused_count >= RCB_NOF_ALLOC_RX_BUFF_ONCE) { - hns3_nic_alloc_rx_buffers(ring, unused_count); - unused_count = hns3_desc_unused(ring) - - ring->pending_buf; + failure = failure || + hns3_nic_alloc_rx_buffers(ring, unused_count); + unused_count = 0; }
/* Poll one pkt */ @@ -4204,11 +4210,7 @@ int hns3_clean_rx_ring(struct hns3_enet_ring *ring, int budget, }
out: - /* Make all data has been write before submit */ - if (unused_count > 0) - hns3_nic_alloc_rx_buffers(ring, unused_count); - - return recv_pkts; + return failure ? budget : recv_pkts; }
static void hns3_update_rx_int_coalesce(struct hns3_enet_tqp_vector *tqp_vector)
From: Huazhong Tan tanhuazhong@huawei.com
mainline inclusion from mainline-master commit c99fead7cb07 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Since user may need to check the current configuration of the interrupt coalesce, so add debugfs support for query this info.
Create a single file "coalesce_info" for it, and query it by "cat coalesce_info", return the result to userspace.
For device whose version is above V3(include V3), the GL's register contains usecs and 1us unit configuration. When get the usecs configuration from this register, it will include the confusing unit configuration, so add a GL mask to get the correct value, and add a QL mask for the frames configuration as well.
The display style is below: $ cat coalesce_info tx interrupt coalesce info: VEC_ID ALGO_STATE PROFILE_ID CQE_MODE TUNE_STATE STEPS_LEFT... 0 IN_PROG 4 EQE ON_TOP 0... 1 START 3 EQE LEFT 1...
rx interrupt coalesce info: VEC_ID ALGO_STATE PROFILE_ID CQE_MODE TUNE_STATE STEPS_LEFT... 0 IN_PROG 3 EQE LEFT 1... 1 IN_PROG 0 EQE ON_TOP 0...
Signed-off-by: Huazhong Tan tanhuazhong@huawei.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + .../ethernet/hisilicon/hns3/hns3_debugfs.c | 119 ++++++++++++++++++ .../net/ethernet/hisilicon/hns3/hns3_enet.h | 3 +- 3 files changed, 122 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index 393a830c281c..7cdc127ae217 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -298,6 +298,7 @@ enum hnae3_dbg_cmd { HNAE3_DBG_CMD_SERV_INFO, HNAE3_DBG_CMD_UMV_INFO, HNAE3_DBG_CMD_PAGE_POOL_INFO, + HNAE3_DBG_CMD_COAL_INFO, HNAE3_DBG_CMD_UNKNOWN, };
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c index b26d43c9c088..578d693b23c2 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c @@ -343,6 +343,13 @@ static struct hns3_dbg_cmd_info hns3_dbg_cmd[] = { .buf_len = HNS3_DBG_READ_LEN, .init = hns3_dbg_common_file_init, }, + { + .name = "coalesce_info", + .cmd = HNAE3_DBG_CMD_COAL_INFO, + .dentry = HNS3_DBG_DENTRY_COMMON, + .buf_len = HNS3_DBG_READ_LEN_1MB, + .init = hns3_dbg_common_file_init, + }, };
static struct hns3_dbg_cap_info hns3_dbg_cap[] = { @@ -391,6 +398,26 @@ static struct hns3_dbg_cap_info hns3_dbg_cap[] = { } };
+static const struct hns3_dbg_item coal_info_items[] = { + { "VEC_ID", 2 }, + { "ALGO_STATE", 2 }, + { "PROFILE_ID", 2 }, + { "CQE_MODE", 2 }, + { "TUNE_STATE", 2 }, + { "STEPS_LEFT", 2 }, + { "STEPS_RIGHT", 2 }, + { "TIRED", 2 }, + { "SW_GL", 2 }, + { "SW_QL", 2 }, + { "HW_GL", 2 }, + { "HW_QL", 2 }, +}; + +static const char * const dim_cqe_mode_str[] = { "EQE", "CQE" }; +static const char * const dim_state_str[] = { "START", "IN_PROG", "APPLY" }; +static const char * const +dim_tune_stat_str[] = { "ON_TOP", "TIRED", "RIGHT", "LEFT" }; + static void hns3_dbg_fill_content(char *content, u16 len, const struct hns3_dbg_item *items, const char **result, u16 size) @@ -412,6 +439,94 @@ static void hns3_dbg_fill_content(char *content, u16 len, *pos++ = '\0'; }
+static void hns3_get_coal_info(struct hns3_enet_tqp_vector *tqp_vector, + char **result, int i, bool is_tx) +{ + unsigned int gl_offset, ql_offset; + struct hns3_enet_coalesce *coal; + unsigned int reg_val; + unsigned int j = 0; + struct dim *dim; + bool ql_enable; + + if (is_tx) { + coal = &tqp_vector->tx_group.coal; + dim = &tqp_vector->tx_group.dim; + gl_offset = HNS3_VECTOR_GL1_OFFSET; + ql_offset = HNS3_VECTOR_TX_QL_OFFSET; + ql_enable = tqp_vector->tx_group.coal.ql_enable; + } else { + coal = &tqp_vector->rx_group.coal; + dim = &tqp_vector->rx_group.dim; + gl_offset = HNS3_VECTOR_GL0_OFFSET; + ql_offset = HNS3_VECTOR_RX_QL_OFFSET; + ql_enable = tqp_vector->rx_group.coal.ql_enable; + } + + sprintf(result[j++], "%d", i); + sprintf(result[j++], "%s", dim_state_str[dim->state]); + sprintf(result[j++], "%u", dim->profile_ix); + sprintf(result[j++], "%s", dim_cqe_mode_str[dim->mode]); + sprintf(result[j++], "%s", + dim_tune_stat_str[dim->tune_state]); + sprintf(result[j++], "%u", dim->steps_left); + sprintf(result[j++], "%u", dim->steps_right); + sprintf(result[j++], "%u", dim->tired); + sprintf(result[j++], "%u", coal->int_gl); + sprintf(result[j++], "%u", coal->int_ql); + reg_val = readl(tqp_vector->mask_addr + gl_offset) & + HNS3_VECTOR_GL_MASK; + sprintf(result[j++], "%u", reg_val); + if (ql_enable) { + reg_val = readl(tqp_vector->mask_addr + ql_offset) & + HNS3_VECTOR_QL_MASK; + sprintf(result[j++], "%u", reg_val); + } else { + sprintf(result[j++], "NA"); + } +} + +static void hns3_dump_coal_info(struct hnae3_handle *h, char *buf, int len, + int *pos, bool is_tx) +{ + char data_str[ARRAY_SIZE(coal_info_items)][HNS3_DBG_DATA_STR_LEN]; + char *result[ARRAY_SIZE(coal_info_items)]; + struct hns3_enet_tqp_vector *tqp_vector; + struct hns3_nic_priv *priv = h->priv; + char content[HNS3_DBG_INFO_LEN]; + unsigned int i; + + for (i = 0; i < ARRAY_SIZE(coal_info_items); i++) + result[i] = &data_str[i][0]; + + *pos += scnprintf(buf + *pos, len - *pos, + "%s interrupt coalesce info:\n", + is_tx ? "tx" : "rx"); + hns3_dbg_fill_content(content, sizeof(content), coal_info_items, + NULL, ARRAY_SIZE(coal_info_items)); + *pos += scnprintf(buf + *pos, len - *pos, "%s", content); + + for (i = 0; i < priv->vector_num; i++) { + tqp_vector = &priv->tqp_vector[i]; + hns3_get_coal_info(tqp_vector, result, i, is_tx); + hns3_dbg_fill_content(content, sizeof(content), coal_info_items, + (const char **)result, + ARRAY_SIZE(coal_info_items)); + *pos += scnprintf(buf + *pos, len - *pos, "%s", content); + } +} + +static int hns3_dbg_coal_info(struct hnae3_handle *h, char *buf, int len) +{ + int pos = 0; + + hns3_dump_coal_info(h, buf, len, &pos, true); + pos += scnprintf(buf + pos, len - pos, "\n"); + hns3_dump_coal_info(h, buf, len, &pos, false); + + return 0; +} + static const struct hns3_dbg_item tx_spare_info_items[] = { { "QUEUE_ID", 2 }, { "COPYBREAK", 2 }, @@ -1056,6 +1171,10 @@ static const struct hns3_dbg_func hns3_dbg_cmd_func[] = { .cmd = HNAE3_DBG_CMD_PAGE_POOL_INFO, .dbg_dump = hns3_dbg_page_pool_info, }, + { + .cmd = HNAE3_DBG_CMD_COAL_INFO, + .dbg_dump = hns3_dbg_coal_info, + }, };
static int hns3_dbg_read_cmd(struct hns3_dbg_data *dbg_data, diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h index f09a61d9c626..1715c98d906d 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h @@ -189,12 +189,13 @@ enum hns3_nic_state { #define HNS3_MAX_TSO_SIZE 1048576U #define HNS3_MAX_NON_TSO_SIZE 9728U
- +#define HNS3_VECTOR_GL_MASK GENMASK(11, 0) #define HNS3_VECTOR_GL0_OFFSET 0x100 #define HNS3_VECTOR_GL1_OFFSET 0x200 #define HNS3_VECTOR_GL2_OFFSET 0x300 #define HNS3_VECTOR_RL_OFFSET 0x900 #define HNS3_VECTOR_RL_EN_B 6 +#define HNS3_VECTOR_QL_MASK GENMASK(9, 0) #define HNS3_VECTOR_TX_QL_OFFSET 0xe00 #define HNS3_VECTOR_RX_QL_OFFSET 0xf00
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-master commit 0bd7e894dffa category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
After querying mac statistics from firmware, driver copies data from descriptors to struct mac_stats of hdev, and the number of copied data is just according to the register number queried from firmware. There is a problem that if the register number queried from firmware is larger than data number of struct mac_stats, it will cause a copy overflow.
So if the firmware adds more mac statistics in later version, it is not compatible with driver of old version.
To fix this problem, the number of copied data needs to be used the minimum value between the register number queried from firmware and data number of struct mac_stats.
The first descriptor has three data and there is one reserved, to optimize the copy process, add this reserverd data to struct mac_stats.
Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../hisilicon/hns3/hns3pf/hclge_main.c | 90 +++++++++---------- .../hisilicon/hns3/hns3pf/hclge_main.h | 5 +- 2 files changed, 48 insertions(+), 47 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 153ea5fd06fc..8f0fba36f486 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -451,8 +451,9 @@ static int hclge_mac_update_stats_defective(struct hclge_dev *hdev) u64 *data = (u64 *)(&hdev->mac_stats); struct hclge_desc desc[HCLGE_MAC_CMD_NUM]; __le64 *desc_data; - int i, k, n; + u32 data_size; int ret; + u32 i;
hclge_cmd_setup_basic_desc(&desc[0], HCLGE_OPC_STATS_MAC, true); ret = hclge_cmd_send(&hdev->hw, desc, HCLGE_MAC_CMD_NUM); @@ -463,33 +464,36 @@ static int hclge_mac_update_stats_defective(struct hclge_dev *hdev) return ret; }
- for (i = 0; i < HCLGE_MAC_CMD_NUM; i++) { - /* for special opcode 0032, only the first desc has the head */ - if (unlikely(i == 0)) { - desc_data = (__le64 *)(&desc[i].data[0]); - n = HCLGE_RD_FIRST_STATS_NUM; - } else { - desc_data = (__le64 *)(&desc[i]); - n = HCLGE_RD_OTHER_STATS_NUM; - } + /* The first desc has a 64-bit header, so data size need to minus 1 */ + data_size = sizeof(desc) / (sizeof(u64)) - 1;
- for (k = 0; k < n; k++) { - *data += le64_to_cpu(*desc_data); - data++; - desc_data++; - } + desc_data = (__le64 *)(&desc[0].data[0]); + for (i = 0; i < data_size; i++) { + /* data memory is continuous becase only the first desc has a + * header in this command + */ + *data += le64_to_cpu(*desc_data); + data++; + desc_data++; }
return 0; }
-static int hclge_mac_update_stats_complete(struct hclge_dev *hdev, u32 desc_num) +static int hclge_mac_update_stats_complete(struct hclge_dev *hdev, u32 reg_num) { +#define HCLGE_REG_NUM_PER_DESC 4 + u64 *data = (u64 *)(&hdev->mac_stats); struct hclge_desc *desc; __le64 *desc_data; - u16 i, k, n; + u32 data_size; + u32 desc_num; int ret; + u32 i; + + /* The first desc has a 64-bit header, so need to consider it */ + desc_num = reg_num / HCLGE_REG_NUM_PER_DESC + 1;
/* This may be called inside atomic sections, * so GFP_ATOMIC is more suitalbe here @@ -505,21 +509,16 @@ static int hclge_mac_update_stats_complete(struct hclge_dev *hdev, u32 desc_num) return ret; }
- for (i = 0; i < desc_num; i++) { - /* for special opcode 0034, only the first desc has the head */ - if (i == 0) { - desc_data = (__le64 *)(&desc[i].data[0]); - n = HCLGE_RD_FIRST_STATS_NUM; - } else { - desc_data = (__le64 *)(&desc[i]); - n = HCLGE_RD_OTHER_STATS_NUM; - } + data_size = min_t(u32, sizeof(hdev->mac_stats) / sizeof(u64), reg_num);
- for (k = 0; k < n; k++) { - *data += le64_to_cpu(*desc_data); - data++; - desc_data++; - } + desc_data = (__le64 *)(&desc[0].data[0]); + for (i = 0; i < data_size; i++) { + /* data memory is continuous becase only the first desc has a + * header in this command + */ + *data += le64_to_cpu(*desc_data); + data++; + desc_data++; }
kfree(desc); @@ -527,40 +526,41 @@ static int hclge_mac_update_stats_complete(struct hclge_dev *hdev, u32 desc_num) return 0; }
-static int hclge_mac_query_reg_num(struct hclge_dev *hdev, u32 *desc_num) +static int hclge_mac_query_reg_num(struct hclge_dev *hdev, u32 *reg_num) { struct hclge_desc desc; - __le32 *desc_data; - u32 reg_num; int ret;
hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_QUERY_MAC_REG_NUM, true); ret = hclge_cmd_send(&hdev->hw, &desc, 1); - if (ret) + if (ret) { + dev_err(&hdev->pdev->dev, + "failed to query mac statistic reg number, ret = %d\n", + ret); return ret; + }
- desc_data = (__le32 *)(&desc.data[0]); - reg_num = le32_to_cpu(*desc_data); - - *desc_num = 1 + ((reg_num - 3) >> 2) + - (u32)(((reg_num - 3) & 0x3) ? 1 : 0); + *reg_num = le32_to_cpu(desc.data[0]); + if (*reg_num == 0) { + dev_err(&hdev->pdev->dev, + "mac statistic reg number is invalid!\n"); + return -ENODATA; + }
return 0; }
static int hclge_mac_update_stats(struct hclge_dev *hdev) { - u32 desc_num; + u32 reg_num; int ret;
- ret = hclge_mac_query_reg_num(hdev, &desc_num); + ret = hclge_mac_query_reg_num(hdev, ®_num); /* The firmware supports the new statistics acquisition method */ if (!ret) - ret = hclge_mac_update_stats_complete(hdev, desc_num); + ret = hclge_mac_update_stats_complete(hdev, reg_num); else if (ret == -EOPNOTSUPP) ret = hclge_mac_update_stats_defective(hdev); - else - dev_err(&hdev->pdev->dev, "query mac reg num fail!\n");
return ret; } diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h index ca25e2edf3f0..36f1847b1c59 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h @@ -412,6 +412,7 @@ struct hclge_comm_stats_str { struct hclge_mac_stats { u64 mac_tx_mac_pause_num; u64 mac_rx_mac_pause_num; + u64 rsv0; u64 mac_tx_pfc_pri0_pkt_num; u64 mac_tx_pfc_pri1_pkt_num; u64 mac_tx_pfc_pri2_pkt_num; @@ -448,7 +449,7 @@ struct hclge_mac_stats { u64 mac_tx_1519_2047_oct_pkt_num; u64 mac_tx_2048_4095_oct_pkt_num; u64 mac_tx_4096_8191_oct_pkt_num; - u64 rsv0; + u64 rsv1; u64 mac_tx_8192_9216_oct_pkt_num; u64 mac_tx_9217_12287_oct_pkt_num; u64 mac_tx_12288_16383_oct_pkt_num; @@ -475,7 +476,7 @@ struct hclge_mac_stats { u64 mac_rx_1519_2047_oct_pkt_num; u64 mac_rx_2048_4095_oct_pkt_num; u64 mac_rx_4096_8191_oct_pkt_num; - u64 rsv1; + u64 rsv2; u64 mac_rx_8192_9216_oct_pkt_num; u64 mac_rx_9217_12287_oct_pkt_num; u64 mac_rx_12288_16383_oct_pkt_num;
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-master commit 4e4c03f6ab63 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Currently, driver queries number of mac statistics before querying mac statistics. As the number of mac statistics is a fixed value in firmware, it is redundant to query this number everytime before querying mac statistics, it can just be queried once in initialization process and saved in device specifications.
Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + .../ethernet/hisilicon/hns3/hns3_debugfs.c | 2 ++ .../hisilicon/hns3/hns3pf/hclge_main.c | 34 +++++++++++++------ 3 files changed, 26 insertions(+), 11 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index 7cdc127ae217..785d1f1f2a08 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -348,6 +348,7 @@ struct hnae3_dev_specs { u16 max_qset_num; u16 umv_size; u16 mc_mac_size; + u32 mac_stats_num; };
struct hnae3_client_ops { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c index 578d693b23c2..1a1bebd453d3 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c @@ -1050,6 +1050,8 @@ hns3_dbg_dev_specs(struct hnae3_handle *h, char *buf, int len, int *pos) dev_specs->umv_size); *pos += scnprintf(buf + *pos, len - *pos, "mc mac size: %u\n", dev_specs->mc_mac_size); + *pos += scnprintf(buf + *pos, len - *pos, "MAC statistics number: %u\n", + dev_specs->mac_stats_num); }
static int hns3_dbg_dev_info(struct hnae3_handle *h, char *buf, int len) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 8f0fba36f486..0cfbd4a9d993 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -480,10 +480,11 @@ static int hclge_mac_update_stats_defective(struct hclge_dev *hdev) return 0; }
-static int hclge_mac_update_stats_complete(struct hclge_dev *hdev, u32 reg_num) +static int hclge_mac_update_stats_complete(struct hclge_dev *hdev) { #define HCLGE_REG_NUM_PER_DESC 4
+ u32 reg_num = hdev->ae_dev->dev_specs.mac_stats_num; u64 *data = (u64 *)(&hdev->mac_stats); struct hclge_desc *desc; __le64 *desc_data; @@ -552,17 +553,11 @@ static int hclge_mac_query_reg_num(struct hclge_dev *hdev, u32 *reg_num)
static int hclge_mac_update_stats(struct hclge_dev *hdev) { - u32 reg_num; - int ret; - - ret = hclge_mac_query_reg_num(hdev, ®_num); /* The firmware supports the new statistics acquisition method */ - if (!ret) - ret = hclge_mac_update_stats_complete(hdev, reg_num); - else if (ret == -EOPNOTSUPP) - ret = hclge_mac_update_stats_defective(hdev); - - return ret; + if (hdev->ae_dev->dev_specs.mac_stats_num) + return hclge_mac_update_stats_complete(hdev); + else + return hclge_mac_update_stats_defective(hdev); }
static int hclge_tqps_update_stats(struct hnae3_handle *handle) @@ -1465,12 +1460,29 @@ static void hclge_check_dev_specs(struct hclge_dev *hdev) dev_specs->umv_size = HCLGE_DEFAULT_UMV_SPACE_PER_PF; }
+static int hclge_query_mac_stats_num(struct hclge_dev *hdev) +{ + u32 reg_num = 0; + int ret; + + ret = hclge_mac_query_reg_num(hdev, ®_num); + if (ret && ret != -EOPNOTSUPP) + return ret; + + hdev->ae_dev->dev_specs.mac_stats_num = reg_num; + return 0; +} + static int hclge_query_dev_specs(struct hclge_dev *hdev) { struct hclge_desc desc[HCLGE_QUERY_DEV_SPECS_BD_NUM]; int ret; int i;
+ ret = hclge_query_mac_stats_num(hdev); + if (ret) + return ret; + /* set default specifications as devices lower than version V3 do not * support querying specifications from firmware. */
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-master commit c8af2887c941 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
The mac statistics add pause/pfc durations in device version V3, we can get total active cycle of pause/pfc from these durations.
As driver gets register number from firmware to calculate desc number to query mac statistics, it needs to set mac statistics extended enable bit in firmware command 0x701A to tell firmware that driver supports extended mac statistics, otherwise firmware only returns register number of version V1.
As pause/pfc durations are not supported by hardware of old version, they should not been shown in command "ethtool -S ethX" in this case, so add checking max register number of each mac statistic in their version. If the max register number of one mac statistic is greater than register number got from firmware, it means hardware does not support this mac statistic, so ignore this statistic when get string and data of mac statistic.
Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../hisilicon/hns3/hns3pf/hclge_cmd.c | 1 + .../hisilicon/hns3/hns3pf/hclge_cmd.h | 1 + .../hisilicon/hns3/hns3pf/hclge_main.c | 245 +++++++++++------- .../hisilicon/hns3/hns3pf/hclge_main.h | 27 ++ 4 files changed, 182 insertions(+), 92 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c index 9c2eeaa82294..c327df9dbac4 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c @@ -482,6 +482,7 @@ static int hclge_firmware_compat_config(struct hclge_dev *hdev, bool en) hnae3_set_bit(compat, HCLGE_NCSI_ERROR_REPORT_EN_B, 1); if (hnae3_dev_phy_imp_supported(hdev)) hnae3_set_bit(compat, HCLGE_PHY_IMP_EN_B, 1); + hnae3_set_bit(compat, HCLGE_MAC_STATS_EXT_EN_B, 1);
req->compat = cpu_to_le32(compat); } diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h index bfcfefa9d2b5..c38b57fc6c6a 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h @@ -1150,6 +1150,7 @@ struct hclge_query_ppu_pf_other_int_dfx_cmd { #define HCLGE_LINK_EVENT_REPORT_EN_B 0 #define HCLGE_NCSI_ERROR_REPORT_EN_B 1 #define HCLGE_PHY_IMP_EN_B 2 +#define HCLGE_MAC_STATS_EXT_EN_B 3 struct hclge_firmware_compat_cmd { __le32 compat; u8 rsv[20]; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 0cfbd4a9d993..fb3431e6acc5 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -156,174 +156,210 @@ static const char hns3_nic_test_strs[][ETH_GSTRING_LEN] = { };
static const struct hclge_comm_stats_str g_mac_stats_string[] = { - {"mac_tx_mac_pause_num", + {"mac_tx_mac_pause_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_mac_pause_num)}, - {"mac_rx_mac_pause_num", + {"mac_rx_mac_pause_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_mac_pause_num)}, - {"mac_tx_control_pkt_num", + {"mac_tx_pause_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pause_xoff_time)}, + {"mac_rx_pause_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pause_xoff_time)}, + {"mac_tx_control_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_ctrl_pkt_num)}, - {"mac_rx_control_pkt_num", + {"mac_rx_control_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_ctrl_pkt_num)}, - {"mac_tx_pfc_pkt_num", + {"mac_tx_pfc_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pause_pkt_num)}, - {"mac_tx_pfc_pri0_pkt_num", + {"mac_tx_pfc_pri0_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri0_pkt_num)}, - {"mac_tx_pfc_pri1_pkt_num", + {"mac_tx_pfc_pri1_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri1_pkt_num)}, - {"mac_tx_pfc_pri2_pkt_num", + {"mac_tx_pfc_pri2_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri2_pkt_num)}, - {"mac_tx_pfc_pri3_pkt_num", + {"mac_tx_pfc_pri3_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri3_pkt_num)}, - {"mac_tx_pfc_pri4_pkt_num", + {"mac_tx_pfc_pri4_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri4_pkt_num)}, - {"mac_tx_pfc_pri5_pkt_num", + {"mac_tx_pfc_pri5_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri5_pkt_num)}, - {"mac_tx_pfc_pri6_pkt_num", + {"mac_tx_pfc_pri6_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri6_pkt_num)}, - {"mac_tx_pfc_pri7_pkt_num", + {"mac_tx_pfc_pri7_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri7_pkt_num)}, - {"mac_rx_pfc_pkt_num", + {"mac_tx_pfc_pri0_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri0_xoff_time)}, + {"mac_tx_pfc_pri1_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri1_xoff_time)}, + {"mac_tx_pfc_pri2_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri2_xoff_time)}, + {"mac_tx_pfc_pri3_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri3_xoff_time)}, + {"mac_tx_pfc_pri4_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri4_xoff_time)}, + {"mac_tx_pfc_pri5_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri5_xoff_time)}, + {"mac_tx_pfc_pri6_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri6_xoff_time)}, + {"mac_tx_pfc_pri7_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri7_xoff_time)}, + {"mac_rx_pfc_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pause_pkt_num)}, - {"mac_rx_pfc_pri0_pkt_num", + {"mac_rx_pfc_pri0_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri0_pkt_num)}, - {"mac_rx_pfc_pri1_pkt_num", + {"mac_rx_pfc_pri1_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri1_pkt_num)}, - {"mac_rx_pfc_pri2_pkt_num", + {"mac_rx_pfc_pri2_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri2_pkt_num)}, - {"mac_rx_pfc_pri3_pkt_num", + {"mac_rx_pfc_pri3_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri3_pkt_num)}, - {"mac_rx_pfc_pri4_pkt_num", + {"mac_rx_pfc_pri4_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri4_pkt_num)}, - {"mac_rx_pfc_pri5_pkt_num", + {"mac_rx_pfc_pri5_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri5_pkt_num)}, - {"mac_rx_pfc_pri6_pkt_num", + {"mac_rx_pfc_pri6_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri6_pkt_num)}, - {"mac_rx_pfc_pri7_pkt_num", + {"mac_rx_pfc_pri7_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri7_pkt_num)}, - {"mac_tx_total_pkt_num", + {"mac_rx_pfc_pri0_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri0_xoff_time)}, + {"mac_rx_pfc_pri1_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri1_xoff_time)}, + {"mac_rx_pfc_pri2_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri2_xoff_time)}, + {"mac_rx_pfc_pri3_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri3_xoff_time)}, + {"mac_rx_pfc_pri4_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri4_xoff_time)}, + {"mac_rx_pfc_pri5_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri5_xoff_time)}, + {"mac_rx_pfc_pri6_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri6_xoff_time)}, + {"mac_rx_pfc_pri7_xoff_time", HCLGE_MAC_STATS_MAX_NUM_V2, + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri7_xoff_time)}, + {"mac_tx_total_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_total_pkt_num)}, - {"mac_tx_total_oct_num", + {"mac_tx_total_oct_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_total_oct_num)}, - {"mac_tx_good_pkt_num", + {"mac_tx_good_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_good_pkt_num)}, - {"mac_tx_bad_pkt_num", + {"mac_tx_bad_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_bad_pkt_num)}, - {"mac_tx_good_oct_num", + {"mac_tx_good_oct_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_good_oct_num)}, - {"mac_tx_bad_oct_num", + {"mac_tx_bad_oct_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_bad_oct_num)}, - {"mac_tx_uni_pkt_num", + {"mac_tx_uni_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_uni_pkt_num)}, - {"mac_tx_multi_pkt_num", + {"mac_tx_multi_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_multi_pkt_num)}, - {"mac_tx_broad_pkt_num", + {"mac_tx_broad_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_broad_pkt_num)}, - {"mac_tx_undersize_pkt_num", + {"mac_tx_undersize_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_undersize_pkt_num)}, - {"mac_tx_oversize_pkt_num", + {"mac_tx_oversize_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_oversize_pkt_num)}, - {"mac_tx_64_oct_pkt_num", + {"mac_tx_64_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_64_oct_pkt_num)}, - {"mac_tx_65_127_oct_pkt_num", + {"mac_tx_65_127_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_65_127_oct_pkt_num)}, - {"mac_tx_128_255_oct_pkt_num", + {"mac_tx_128_255_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_128_255_oct_pkt_num)}, - {"mac_tx_256_511_oct_pkt_num", + {"mac_tx_256_511_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_256_511_oct_pkt_num)}, - {"mac_tx_512_1023_oct_pkt_num", + {"mac_tx_512_1023_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_512_1023_oct_pkt_num)}, - {"mac_tx_1024_1518_oct_pkt_num", + {"mac_tx_1024_1518_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_1024_1518_oct_pkt_num)}, - {"mac_tx_1519_2047_oct_pkt_num", + {"mac_tx_1519_2047_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_1519_2047_oct_pkt_num)}, - {"mac_tx_2048_4095_oct_pkt_num", + {"mac_tx_2048_4095_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_2048_4095_oct_pkt_num)}, - {"mac_tx_4096_8191_oct_pkt_num", + {"mac_tx_4096_8191_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_4096_8191_oct_pkt_num)}, - {"mac_tx_8192_9216_oct_pkt_num", + {"mac_tx_8192_9216_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_8192_9216_oct_pkt_num)}, - {"mac_tx_9217_12287_oct_pkt_num", + {"mac_tx_9217_12287_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_9217_12287_oct_pkt_num)}, - {"mac_tx_12288_16383_oct_pkt_num", + {"mac_tx_12288_16383_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_12288_16383_oct_pkt_num)}, - {"mac_tx_1519_max_good_pkt_num", + {"mac_tx_1519_max_good_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_1519_max_good_oct_pkt_num)}, - {"mac_tx_1519_max_bad_pkt_num", + {"mac_tx_1519_max_bad_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_1519_max_bad_oct_pkt_num)}, - {"mac_rx_total_pkt_num", + {"mac_rx_total_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_total_pkt_num)}, - {"mac_rx_total_oct_num", + {"mac_rx_total_oct_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_total_oct_num)}, - {"mac_rx_good_pkt_num", + {"mac_rx_good_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_good_pkt_num)}, - {"mac_rx_bad_pkt_num", + {"mac_rx_bad_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_bad_pkt_num)}, - {"mac_rx_good_oct_num", + {"mac_rx_good_oct_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_good_oct_num)}, - {"mac_rx_bad_oct_num", + {"mac_rx_bad_oct_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_bad_oct_num)}, - {"mac_rx_uni_pkt_num", + {"mac_rx_uni_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_uni_pkt_num)}, - {"mac_rx_multi_pkt_num", + {"mac_rx_multi_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_multi_pkt_num)}, - {"mac_rx_broad_pkt_num", + {"mac_rx_broad_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_broad_pkt_num)}, - {"mac_rx_undersize_pkt_num", + {"mac_rx_undersize_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_undersize_pkt_num)}, - {"mac_rx_oversize_pkt_num", + {"mac_rx_oversize_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_oversize_pkt_num)}, - {"mac_rx_64_oct_pkt_num", + {"mac_rx_64_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_64_oct_pkt_num)}, - {"mac_rx_65_127_oct_pkt_num", + {"mac_rx_65_127_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_65_127_oct_pkt_num)}, - {"mac_rx_128_255_oct_pkt_num", + {"mac_rx_128_255_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_128_255_oct_pkt_num)}, - {"mac_rx_256_511_oct_pkt_num", + {"mac_rx_256_511_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_256_511_oct_pkt_num)}, - {"mac_rx_512_1023_oct_pkt_num", + {"mac_rx_512_1023_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_512_1023_oct_pkt_num)}, - {"mac_rx_1024_1518_oct_pkt_num", + {"mac_rx_1024_1518_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_1024_1518_oct_pkt_num)}, - {"mac_rx_1519_2047_oct_pkt_num", + {"mac_rx_1519_2047_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_1519_2047_oct_pkt_num)}, - {"mac_rx_2048_4095_oct_pkt_num", + {"mac_rx_2048_4095_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_2048_4095_oct_pkt_num)}, - {"mac_rx_4096_8191_oct_pkt_num", + {"mac_rx_4096_8191_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_4096_8191_oct_pkt_num)}, - {"mac_rx_8192_9216_oct_pkt_num", + {"mac_rx_8192_9216_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_8192_9216_oct_pkt_num)}, - {"mac_rx_9217_12287_oct_pkt_num", + {"mac_rx_9217_12287_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_9217_12287_oct_pkt_num)}, - {"mac_rx_12288_16383_oct_pkt_num", + {"mac_rx_12288_16383_oct_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_12288_16383_oct_pkt_num)}, - {"mac_rx_1519_max_good_pkt_num", + {"mac_rx_1519_max_good_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_1519_max_good_oct_pkt_num)}, - {"mac_rx_1519_max_bad_pkt_num", + {"mac_rx_1519_max_bad_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_1519_max_bad_oct_pkt_num)},
- {"mac_tx_fragment_pkt_num", + {"mac_tx_fragment_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_fragment_pkt_num)}, - {"mac_tx_undermin_pkt_num", + {"mac_tx_undermin_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_undermin_pkt_num)}, - {"mac_tx_jabber_pkt_num", + {"mac_tx_jabber_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_jabber_pkt_num)}, - {"mac_tx_err_all_pkt_num", + {"mac_tx_err_all_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_err_all_pkt_num)}, - {"mac_tx_from_app_good_pkt_num", + {"mac_tx_from_app_good_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_from_app_good_pkt_num)}, - {"mac_tx_from_app_bad_pkt_num", + {"mac_tx_from_app_bad_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_tx_from_app_bad_pkt_num)}, - {"mac_rx_fragment_pkt_num", + {"mac_rx_fragment_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_fragment_pkt_num)}, - {"mac_rx_undermin_pkt_num", + {"mac_rx_undermin_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_undermin_pkt_num)}, - {"mac_rx_jabber_pkt_num", + {"mac_rx_jabber_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_jabber_pkt_num)}, - {"mac_rx_fcs_err_pkt_num", + {"mac_rx_fcs_err_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_fcs_err_pkt_num)}, - {"mac_rx_send_app_good_pkt_num", + {"mac_rx_send_app_good_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_send_app_good_pkt_num)}, - {"mac_rx_send_app_bad_pkt_num", + {"mac_rx_send_app_bad_pkt_num", HCLGE_MAC_STATS_MAX_NUM_V1, HCLGE_MAC_STATS_FIELD_OFF(mac_rx_send_app_bad_pkt_num)} };
@@ -665,20 +701,39 @@ static u8 *hclge_tqps_get_strings(struct hnae3_handle *handle, u8 *data) return buff; }
-static u64 *hclge_comm_get_stats(const void *comm_stats, +static int hclge_comm_get_count(struct hclge_dev *hdev, + const struct hclge_comm_stats_str strs[], + u32 size) +{ + int count = 0; + u32 i; + + for (i = 0; i < size; i++) + if (strs[i].stats_num <= hdev->ae_dev->dev_specs.mac_stats_num) + count++; + + return count; +} + +static u64 *hclge_comm_get_stats(struct hclge_dev *hdev, const struct hclge_comm_stats_str strs[], int size, u64 *data) { u64 *buf = data; u32 i;
- for (i = 0; i < size; i++) - buf[i] = HCLGE_STATS_READ(comm_stats, strs[i].offset); + for (i = 0; i < size; i++) { + if (strs[i].stats_num > hdev->ae_dev->dev_specs.mac_stats_num) + continue; + + *buf = HCLGE_STATS_READ(&hdev->mac_stats, strs[i].offset); + buf++; + }
- return buf + size; + return buf; }
-static u8 *hclge_comm_get_strings(u32 stringset, +static u8 *hclge_comm_get_strings(struct hclge_dev *hdev, u32 stringset, const struct hclge_comm_stats_str strs[], int size, u8 *data) { @@ -689,6 +744,9 @@ static u8 *hclge_comm_get_strings(u32 stringset, return buff;
for (i = 0; i < size; i++) { + if (strs[i].stats_num > hdev->ae_dev->dev_specs.mac_stats_num) + continue; + snprintf(buff, ETH_GSTRING_LEN, "%s", strs[i].desc); buff = buff + ETH_GSTRING_LEN; } @@ -780,7 +838,8 @@ static int hclge_get_sset_count(struct hnae3_handle *handle, int stringset) handle->flags |= HNAE3_SUPPORT_PHY_LOOPBACK; } } else if (stringset == ETH_SS_STATS) { - count = ARRAY_SIZE(g_mac_stats_string) + + count = hclge_comm_get_count(hdev, g_mac_stats_string, + ARRAY_SIZE(g_mac_stats_string)) + hclge_tqps_get_sset_count(handle, stringset); }
@@ -790,12 +849,14 @@ static int hclge_get_sset_count(struct hnae3_handle *handle, int stringset) static void hclge_get_strings(struct hnae3_handle *handle, u32 stringset, u8 *data) { + struct hclge_vport *vport = hclge_get_vport(handle); + struct hclge_dev *hdev = vport->back; u8 *p = (char *)data; int size;
if (stringset == ETH_SS_STATS) { size = ARRAY_SIZE(g_mac_stats_string); - p = hclge_comm_get_strings(stringset, g_mac_stats_string, + p = hclge_comm_get_strings(hdev, stringset, g_mac_stats_string, size, p); p = hclge_tqps_get_strings(handle, p); } else if (stringset == ETH_SS_TEST) { @@ -829,7 +890,7 @@ static void hclge_get_stats(struct hnae3_handle *handle, u64 *data) struct hclge_dev *hdev = vport->back; u64 *p;
- p = hclge_comm_get_stats(&hdev->mac_stats, g_mac_stats_string, + p = hclge_comm_get_stats(hdev, g_mac_stats_string, ARRAY_SIZE(g_mac_stats_string), data); p = hclge_tqps_get_stats(handle, p); } diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h index 36f1847b1c59..4f8403af84be 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h @@ -403,8 +403,13 @@ struct hclge_tm_info { u8 pfc_en; /* PFC enabled or not for user priority */ };
+/* max number of mac statistics on each version */ +#define HCLGE_MAC_STATS_MAX_NUM_V1 84 +#define HCLGE_MAC_STATS_MAX_NUM_V2 105 + struct hclge_comm_stats_str { char desc[ETH_GSTRING_LEN]; + u32 stats_num; unsigned long offset; };
@@ -499,6 +504,28 @@ struct hclge_mac_stats { u64 mac_rx_pfc_pause_pkt_num; u64 mac_tx_ctrl_pkt_num; u64 mac_rx_ctrl_pkt_num; + + /* duration of pfc */ + u64 mac_tx_pfc_pri0_xoff_time; + u64 mac_tx_pfc_pri1_xoff_time; + u64 mac_tx_pfc_pri2_xoff_time; + u64 mac_tx_pfc_pri3_xoff_time; + u64 mac_tx_pfc_pri4_xoff_time; + u64 mac_tx_pfc_pri5_xoff_time; + u64 mac_tx_pfc_pri6_xoff_time; + u64 mac_tx_pfc_pri7_xoff_time; + u64 mac_rx_pfc_pri0_xoff_time; + u64 mac_rx_pfc_pri1_xoff_time; + u64 mac_rx_pfc_pri2_xoff_time; + u64 mac_rx_pfc_pri3_xoff_time; + u64 mac_rx_pfc_pri4_xoff_time; + u64 mac_rx_pfc_pri5_xoff_time; + u64 mac_rx_pfc_pri6_xoff_time; + u64 mac_rx_pfc_pri7_xoff_time; + + /* duration of pause */ + u64 mac_tx_pause_xoff_time; + u64 mac_rx_pause_xoff_time; };
#define HCLGE_STATS_TIMER_INTERVAL 300UL
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-master commit 58cb422ef625 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
The functions of converting speed ability to ethtool link mode just support setting mac->supported currently, to reuse these functions to set ethtool link mode for others(i.e. advertising), delete the argument mac and add argument link_mode.
Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../hisilicon/hns3/hns3pf/hclge_main.c | 70 ++++++++++--------- 1 file changed, 37 insertions(+), 33 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index fb3431e6acc5..d3521c48f336 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -1093,96 +1093,100 @@ static int hclge_check_port_speed(struct hnae3_handle *handle, u32 speed) return -EINVAL; }
-static void hclge_convert_setting_sr(struct hclge_mac *mac, u16 speed_ability) +static void hclge_convert_setting_sr(u16 speed_ability, + unsigned long *link_mode) { if (speed_ability & HCLGE_SUPPORT_10G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_10000baseSR_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_25G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_25000baseSR_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_40G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_40000baseSR4_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_50G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_50000baseSR2_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_100G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_100000baseSR4_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_200G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_200000baseSR4_Full_BIT, - mac->supported); + link_mode); }
-static void hclge_convert_setting_lr(struct hclge_mac *mac, u16 speed_ability) +static void hclge_convert_setting_lr(u16 speed_ability, + unsigned long *link_mode) { if (speed_ability & HCLGE_SUPPORT_10G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_10000baseLR_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_25G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_25000baseSR_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_50G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_50000baseLR_ER_FR_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_40G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_40000baseLR4_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_100G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_100000baseLR4_ER4_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_200G_BIT) linkmode_set_bit( ETHTOOL_LINK_MODE_200000baseLR4_ER4_FR4_Full_BIT, - mac->supported); + link_mode); }
-static void hclge_convert_setting_cr(struct hclge_mac *mac, u16 speed_ability) +static void hclge_convert_setting_cr(u16 speed_ability, + unsigned long *link_mode) { if (speed_ability & HCLGE_SUPPORT_10G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_10000baseCR_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_25G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_25000baseCR_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_40G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_40000baseCR4_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_50G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_50000baseCR2_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_100G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_100000baseCR4_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_200G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_200000baseCR4_Full_BIT, - mac->supported); + link_mode); }
-static void hclge_convert_setting_kr(struct hclge_mac *mac, u16 speed_ability) +static void hclge_convert_setting_kr(u16 speed_ability, + unsigned long *link_mode) { if (speed_ability & HCLGE_SUPPORT_1G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_1000baseKX_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_10G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_10000baseKR_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_25G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_25000baseKR_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_40G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_40000baseKR4_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_50G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_50000baseKR2_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_100G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_100000baseKR4_Full_BIT, - mac->supported); + link_mode); if (speed_ability & HCLGE_SUPPORT_200G_BIT) linkmode_set_bit(ETHTOOL_LINK_MODE_200000baseKR4_Full_BIT, - mac->supported); + link_mode); }
static void hclge_convert_setting_fec(struct hclge_mac *mac) @@ -1226,9 +1230,9 @@ static void hclge_parse_fiber_link_mode(struct hclge_dev *hdev, linkmode_set_bit(ETHTOOL_LINK_MODE_1000baseX_Full_BIT, mac->supported);
- hclge_convert_setting_sr(mac, speed_ability); - hclge_convert_setting_lr(mac, speed_ability); - hclge_convert_setting_cr(mac, speed_ability); + hclge_convert_setting_sr(speed_ability, mac->supported); + hclge_convert_setting_lr(speed_ability, mac->supported); + hclge_convert_setting_cr(speed_ability, mac->supported); if (hnae3_dev_fec_supported(hdev)) hclge_convert_setting_fec(mac);
@@ -1244,7 +1248,7 @@ static void hclge_parse_backplane_link_mode(struct hclge_dev *hdev, { struct hclge_mac *mac = &hdev->hw.mac;
- hclge_convert_setting_kr(mac, speed_ability); + hclge_convert_setting_kr(speed_ability, mac->supported); if (hnae3_dev_fec_supported(hdev)) hclge_convert_setting_fec(mac);
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-master commit 6eaed433ee5f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Currently, the ethtool advertised link modes of FIBRE port is cleared to zero when autoneg is off, so user can not get the advertised link modes info directly from "ethtool <dev>" command.
In order to ameliorate this situation, update data of speeds, fec and pause of advertised link modes when autoneg is off.
Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../hisilicon/hns3/hns3pf/hclge_main.c | 78 ++++++++++++++++++- 1 file changed, 77 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index d3521c48f336..6ac190439274 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -3051,6 +3051,82 @@ static void hclge_update_link_status(struct hclge_dev *hdev) clear_bit(HCLGE_STATE_LINK_UPDATING, &hdev->state); }
+static void hclge_update_speed_advertising(struct hclge_mac *mac) +{ + u32 speed_ability; + + if (hclge_get_speed_bit(mac->speed, &speed_ability)) + return; + + switch (mac->module_type) { + case HNAE3_MODULE_TYPE_FIBRE_LR: + hclge_convert_setting_lr(speed_ability, mac->advertising); + break; + case HNAE3_MODULE_TYPE_FIBRE_SR: + case HNAE3_MODULE_TYPE_AOC: + hclge_convert_setting_sr(speed_ability, mac->advertising); + break; + case HNAE3_MODULE_TYPE_CR: + hclge_convert_setting_cr(speed_ability, mac->advertising); + break; + case HNAE3_MODULE_TYPE_KR: + hclge_convert_setting_kr(speed_ability, mac->advertising); + break; + default: + break; + } +} + +static void hclge_update_fec_advertising(struct hclge_mac *mac) +{ + if (mac->fec_mode & BIT(HNAE3_FEC_RS)) + linkmode_set_bit(ETHTOOL_LINK_MODE_FEC_RS_BIT, + mac->advertising); + else if (mac->fec_mode & BIT(HNAE3_FEC_BASER)) + linkmode_set_bit(ETHTOOL_LINK_MODE_FEC_BASER_BIT, + mac->advertising); + else + linkmode_set_bit(ETHTOOL_LINK_MODE_FEC_NONE_BIT, + mac->advertising); +} + +static void hclge_update_pause_advertising(struct hclge_dev *hdev) +{ + struct hclge_mac *mac = &hdev->hw.mac; + bool rx_en, tx_en; + + switch (hdev->fc_mode_last_time) { + case HCLGE_FC_RX_PAUSE: + rx_en = true; + tx_en = false; + break; + case HCLGE_FC_TX_PAUSE: + rx_en = false; + tx_en = true; + break; + case HCLGE_FC_FULL: + rx_en = true; + tx_en = true; + break; + default: + rx_en = false; + tx_en = false; + break; + } + + linkmode_set_pause(mac->advertising, tx_en, rx_en); +} + +static void hclge_update_advertising(struct hclge_dev *hdev) +{ + struct hclge_mac *mac = &hdev->hw.mac; + + linkmode_zero(mac->advertising); + hclge_update_speed_advertising(mac); + hclge_update_fec_advertising(mac); + hclge_update_pause_advertising(hdev); +} + static void hclge_update_port_capability(struct hclge_dev *hdev, struct hclge_mac *mac) { @@ -3073,7 +3149,7 @@ static void hclge_update_port_capability(struct hclge_dev *hdev, } else { linkmode_clear_bit(ETHTOOL_LINK_MODE_Autoneg_BIT, mac->supported); - linkmode_zero(mac->advertising); + hclge_update_advertising(hdev); } }
From: Weihang Li liweihang@huawei.com
mainline inclusion from mainline-master commit b566ef60394c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
This patch adds one ras error of bus related for roce, this error including RRESP/BRESP and read poison error.
Signed-off-by: Weihang Li liweihang@huawei.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c | 5 ++++- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h | 1 + 2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c index 93aa7f2bdc13..59df3c477c36 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c @@ -1321,7 +1321,10 @@ static const struct hclge_hw_type_id hclge_hw_type_id_st[] = { }, { .type_id = ROCEE_OVF_ERR, .msg = "rocee_ovf_error" - } + }, { + .type_id = ROCEE_BUS_ERR, + .msg = "rocee_bus_error" + }, };
static void hclge_log_error(struct device *dev, char *reg, diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h index d811eeefe2c0..2f4f4c71a5ec 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h @@ -169,6 +169,7 @@ enum hclge_err_type_list { /* add new ERROR TYPE for NIC here in order */ ROCEE_NORMAL_ERR = 40, ROCEE_OVF_ERR = 41, + ROCEE_BUS_ERR = 42, /* add new ERROR TYPE for ROCEE here in order */ };
From: Jiaran Zhang zhangjiaran@huawei.com
mainline inclusion from mainline-master commit da3fea80fea4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
This patch adds himac error recovery module, link_error type and ptp_error type for himac.
Signed-off-by: Jiaran Zhang zhangjiaran@huawei.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c | 9 +++++++++ drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h | 3 +++ 2 files changed, 12 insertions(+)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c index 59df3c477c36..20e628c2bd44 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c @@ -1242,6 +1242,9 @@ static const struct hclge_hw_module_id hclge_hw_module_id_st[] = { }, { .module_id = MODULE_MASTER, .msg = "MODULE_MASTER" + }, { + .module_id = MODULE_HIMAC, + .msg = "MODULE_HIMAC" }, { .module_id = MODULE_ROCEE_TOP, .msg = "MODULE_ROCEE_TOP" @@ -1315,6 +1318,12 @@ static const struct hclge_hw_type_id hclge_hw_type_id_st[] = { }, { .type_id = GLB_ERROR, .msg = "glb_error" + }, { + .type_id = LINK_ERROR, + .msg = "link_error" + }, { + .type_id = PTP_ERROR, + .msg = "ptp_error" }, { .type_id = ROCEE_NORMAL_ERR, .msg = "rocee_normal_error" diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h index 2f4f4c71a5ec..86be6fb32990 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h @@ -138,6 +138,7 @@ enum hclge_mod_name_list { MODULE_RCB_TX = 12, MODULE_TXDMA = 13, MODULE_MASTER = 14, + MODULE_HIMAC = 15, /* add new MODULE NAME for NIC here in order */ MODULE_ROCEE_TOP = 40, MODULE_ROCEE_TIMER = 41, @@ -166,6 +167,8 @@ enum hclge_err_type_list { ETS_ERROR = 10, NCSI_ERROR = 11, GLB_ERROR = 12, + LINK_ERROR = 13, + PTP_ERROR = 14, /* add new ERROR TYPE for NIC here in order */ ROCEE_NORMAL_ERR = 40, ROCEE_OVF_ERR = 41,
From: Yufeng Mo moyufeng@huawei.com
mainline inclusion from mainline-v5.15 commit f29da4088fb4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Currently, the workqueue of hclge/hclgevf is executed on the CPU that initiates scheduling requests by default. In stress scenarios, the CPU may be busy and workqueue scheduling is completed after a long period of time. To avoid this situation and implement proper scheduling, use the WQ_UNBOUND mode instead. In this way, the workqueue can be performed on a relatively idle CPU.
Signed-off-by: Yufeng Mo moyufeng@huawei.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../hisilicon/hns3/hns3pf/hclge_main.c | 34 +++---------------- .../hisilicon/hns3/hns3pf/hclge_main.h | 1 - .../hisilicon/hns3/hns3vf/hclgevf_main.c | 2 +- 3 files changed, 6 insertions(+), 31 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 6ac190439274..1de0a1c0123f 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -2930,33 +2930,28 @@ static void hclge_mbx_task_schedule(struct hclge_dev *hdev) { if (!test_bit(HCLGE_STATE_REMOVING, &hdev->state) && !test_and_set_bit(HCLGE_STATE_MBX_SERVICE_SCHED, &hdev->state)) - mod_delayed_work_on(cpumask_first(&hdev->affinity_mask), - hclge_wq, &hdev->service_task, 0); + mod_delayed_work(hclge_wq, &hdev->service_task, 0); }
static void hclge_reset_task_schedule(struct hclge_dev *hdev) { if (!test_bit(HCLGE_STATE_REMOVING, &hdev->state) && !test_and_set_bit(HCLGE_STATE_RST_SERVICE_SCHED, &hdev->state)) - mod_delayed_work_on(cpumask_first(&hdev->affinity_mask), - hclge_wq, &hdev->service_task, 0); + mod_delayed_work(hclge_wq, &hdev->service_task, 0); }
static void hclge_errhand_task_schedule(struct hclge_dev *hdev) { if (!test_bit(HCLGE_STATE_REMOVING, &hdev->state) && !test_and_set_bit(HCLGE_STATE_ERR_SERVICE_SCHED, &hdev->state)) - mod_delayed_work_on(cpumask_first(&hdev->affinity_mask), - hclge_wq, &hdev->service_task, 0); + mod_delayed_work(hclge_wq, &hdev->service_task, 0); }
void hclge_task_schedule(struct hclge_dev *hdev, unsigned long delay_time) { if (!test_bit(HCLGE_STATE_REMOVING, &hdev->state) && !test_bit(HCLGE_STATE_RST_FAIL, &hdev->state)) - mod_delayed_work_on(cpumask_first(&hdev->affinity_mask), - hclge_wq, &hdev->service_task, - delay_time); + mod_delayed_work(hclge_wq, &hdev->service_task, delay_time); }
static int hclge_get_mac_link_status(struct hclge_dev *hdev, int *link_status) @@ -3650,33 +3645,14 @@ static void hclge_get_misc_vector(struct hclge_dev *hdev) hdev->num_msi_used += 1; }
-static void hclge_irq_affinity_notify(struct irq_affinity_notify *notify, - const cpumask_t *mask) -{ - struct hclge_dev *hdev = container_of(notify, struct hclge_dev, - affinity_notify); - - cpumask_copy(&hdev->affinity_mask, mask); -} - -static void hclge_irq_affinity_release(struct kref *ref) -{ -} - static void hclge_misc_affinity_setup(struct hclge_dev *hdev) { irq_set_affinity_hint(hdev->misc_vector.vector_irq, &hdev->affinity_mask); - - hdev->affinity_notify.notify = hclge_irq_affinity_notify; - hdev->affinity_notify.release = hclge_irq_affinity_release; - irq_set_affinity_notifier(hdev->misc_vector.vector_irq, - &hdev->affinity_notify); }
static void hclge_misc_affinity_teardown(struct hclge_dev *hdev) { - irq_set_affinity_notifier(hdev->misc_vector.vector_irq, NULL); irq_set_affinity_hint(hdev->misc_vector.vector_irq, NULL); }
@@ -13233,7 +13209,7 @@ static int hclge_init(void) { pr_info("%s is initializing\n", HCLGE_NAME);
- hclge_wq = alloc_workqueue("%s", 0, 0, HCLGE_NAME); + hclge_wq = alloc_workqueue("%s", WQ_UNBOUND, 0, HCLGE_NAME); if (!hclge_wq) { pr_err("%s: failed to create workqueue\n", HCLGE_NAME); return -ENOMEM; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h index 4f8403af84be..9e1eede599ec 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h @@ -974,7 +974,6 @@ struct hclge_dev {
/* affinity mask and notify for misc interrupt */ cpumask_t affinity_mask; - struct irq_affinity_notify affinity_notify; struct hclge_ptp *ptp; struct devlink *devlink; }; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c index bef6b98e2f50..5efa5420297d 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c @@ -3899,7 +3899,7 @@ static int hclgevf_init(void) { pr_info("%s is initializing\n", HCLGEVF_NAME);
- hclgevf_wq = alloc_workqueue("%s", 0, 0, HCLGEVF_NAME); + hclgevf_wq = alloc_workqueue("%s", WQ_UNBOUND, 0, HCLGEVF_NAME); if (!hclgevf_wq) { pr_err("%s: failed to create workqueue\n", HCLGEVF_NAME); return -ENOMEM;
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-v5.15 commit 0251d196b0e1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Currently, if there is a reset event triggered by RAS during device in initialization process, driver may run reset process concurrently with initialization process. In this case, it may cause problem. For example, the RSS indirection table may has not been alloc memory in initialization process yet, but it is used in reset process, it will cause a call trace like this:
[61228.744836] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 ... [61228.897677] Workqueue: hclgevf hclgevf_service_task [hclgevf] [61228.911390] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--) [61228.918670] pc : hclgevf_set_rss_indir_table+0xb4/0x190 [hclgevf] [61228.927812] lr : hclgevf_set_rss_indir_table+0x90/0x190 [hclgevf] [61228.937248] sp : ffff8000162ebb50 [61228.941087] x29: ffff8000162ebb50 x28: ffffb77add72dbc0 x27: ffff0820c7dc8080 [61228.949516] x26: 0000000000000000 x25: ffff0820ad4fc880 x24: ffff0820c7dc8080 [61228.958220] x23: ffff0820c7dc8090 x22: 00000000ffffffff x21: 0000000000000040 [61228.966360] x20: ffffb77add72b9c0 x19: 0000000000000000 x18: 0000000000000030 [61228.974646] x17: 0000000000000000 x16: ffffb77ae713feb0 x15: ffff0820ad4fcce8 [61228.982808] x14: ffffffffffffffff x13: ffff8000962eb7f7 x12: 00003834ec70c960 [61228.991990] x11: 00e0fafa8c206982 x10: 9670facc78a8f9a8 x9 : ffffb77add717530 [61229.001123] x8 : ffff0820ad4fd6b8 x7 : 0000000000000000 x6 : 0000000000000011 [61229.010249] x5 : 00000000000cb1b0 x4 : 0000000000002adb x3 : 0000000000000049 [61229.018662] x2 : ffff8000162ebbb8 x1 : 0000000000000000 x0 : 0000000000000480 [61229.027002] Call trace: [61229.030177] hclgevf_set_rss_indir_table+0xb4/0x190 [hclgevf] [61229.039009] hclgevf_rss_init_hw+0x128/0x1b4 [hclgevf] [61229.046809] hclgevf_reset_rebuild+0x17c/0x69c [hclgevf] [61229.053862] hclgevf_reset_service_task+0x4cc/0xa80 [hclgevf] [61229.061306] hclgevf_service_task+0x6c/0x630 [hclgevf] [61229.068491] process_one_work+0x1dc/0x48c [61229.074121] worker_thread+0x15c/0x464 [61229.078562] kthread+0x168/0x16c [61229.082873] ret_from_fork+0x10/0x18 [61229.088221] Code: 7900e7f6 f904a683 d503201f 9101a3e2 (38616b43) [61229.095357] ---[ end trace 153661a538f6768c ]---
To fix this problem, don't schedule reset task before initialization process is done.
Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 1 + drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 3 +++ drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h | 1 + 3 files changed, 5 insertions(+)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 1de0a1c0123f..590c1d635f5a 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -2936,6 +2936,7 @@ static void hclge_mbx_task_schedule(struct hclge_dev *hdev) static void hclge_reset_task_schedule(struct hclge_dev *hdev) { if (!test_bit(HCLGE_STATE_REMOVING, &hdev->state) && + test_bit(HCLGE_STATE_SERVICE_INITED, &hdev->state) && !test_and_set_bit(HCLGE_STATE_RST_SERVICE_SCHED, &hdev->state)) mod_delayed_work(hclge_wq, &hdev->service_task, 0); } diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c index 5efa5420297d..cf00ad7bb881 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c @@ -2232,6 +2232,7 @@ static void hclgevf_get_misc_vector(struct hclgevf_dev *hdev) void hclgevf_reset_task_schedule(struct hclgevf_dev *hdev) { if (!test_bit(HCLGEVF_STATE_REMOVING, &hdev->state) && + test_bit(HCLGEVF_STATE_SERVICE_INITED, &hdev->state) && !test_and_set_bit(HCLGEVF_STATE_RST_SERVICE_SCHED, &hdev->state)) mod_delayed_work(hclgevf_wq, &hdev->service_task, 0); @@ -3449,6 +3450,8 @@ static int hclgevf_init_hdev(struct hclgevf_dev *hdev)
hclgevf_init_rxd_adv_layout(hdev);
+ set_bit(HCLGEVF_STATE_SERVICE_INITED, &hdev->state); + hdev->last_reset_time = jiffies; dev_info(&hdev->pdev->dev, "finished initializing %s driver\n", HCLGEVF_DRIVER_NAME); diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h index 883130a9b48f..28288d7e3303 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h @@ -146,6 +146,7 @@ enum hclgevf_states { HCLGEVF_STATE_REMOVING, HCLGEVF_STATE_NIC_REGISTERED, HCLGEVF_STATE_ROCE_REGISTERED, + HCLGEVF_STATE_SERVICE_INITED, /* task states */ HCLGEVF_STATE_RST_SERVICE_SCHED, HCLGEVF_STATE_RST_HANDLING,
From: Jie Wang wangjie125@huawei.com
mainline inclusion from mainline-v5.15 commit 2a21dab594a9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
The member data in struct hclge_desc is type of __le32, it needs endian conversion before using it, and some functions of debugfs didn't do that, so this patch fixes it.
Fixes: c0ebebb9ccc1 ("net: hns3: Add "dcb register" status information query function") Signed-off-by: Jie Wang wangjie125@huawei.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../hisilicon/hns3/hns3pf/hclge_debugfs.c | 30 +++++++++---------- 1 file changed, 14 insertions(+), 16 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c index f0aa4fbd2200..4e0a8c2f7c05 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c @@ -391,7 +391,7 @@ static int hclge_dbg_dump_mac(struct hclge_dev *hdev, char *buf, int len) static int hclge_dbg_dump_dcb_qset(struct hclge_dev *hdev, char *buf, int len, int *pos) { - struct hclge_dbg_bitmap_cmd *bitmap; + struct hclge_dbg_bitmap_cmd req; struct hclge_desc desc; u16 qset_id, qset_num; int ret; @@ -408,12 +408,12 @@ static int hclge_dbg_dump_dcb_qset(struct hclge_dev *hdev, char *buf, int len, if (ret) return ret;
- bitmap = (struct hclge_dbg_bitmap_cmd *)&desc.data[1]; + req.bitmap = (u8)le32_to_cpu(desc.data[1]);
*pos += scnprintf(buf + *pos, len - *pos, "%04u %#x %#x %#x %#x\n", - qset_id, bitmap->bit0, bitmap->bit1, - bitmap->bit2, bitmap->bit3); + qset_id, req.bit0, req.bit1, req.bit2, + req.bit3); }
return 0; @@ -422,7 +422,7 @@ static int hclge_dbg_dump_dcb_qset(struct hclge_dev *hdev, char *buf, int len, static int hclge_dbg_dump_dcb_pri(struct hclge_dev *hdev, char *buf, int len, int *pos) { - struct hclge_dbg_bitmap_cmd *bitmap; + struct hclge_dbg_bitmap_cmd req; struct hclge_desc desc; u8 pri_id, pri_num; int ret; @@ -439,12 +439,11 @@ static int hclge_dbg_dump_dcb_pri(struct hclge_dev *hdev, char *buf, int len, if (ret) return ret;
- bitmap = (struct hclge_dbg_bitmap_cmd *)&desc.data[1]; + req.bitmap = (u8)le32_to_cpu(desc.data[1]);
*pos += scnprintf(buf + *pos, len - *pos, "%03u %#x %#x %#x\n", - pri_id, bitmap->bit0, bitmap->bit1, - bitmap->bit2); + pri_id, req.bit0, req.bit1, req.bit2); }
return 0; @@ -453,7 +452,7 @@ static int hclge_dbg_dump_dcb_pri(struct hclge_dev *hdev, char *buf, int len, static int hclge_dbg_dump_dcb_pg(struct hclge_dev *hdev, char *buf, int len, int *pos) { - struct hclge_dbg_bitmap_cmd *bitmap; + struct hclge_dbg_bitmap_cmd req; struct hclge_desc desc; u8 pg_id; int ret; @@ -466,12 +465,11 @@ static int hclge_dbg_dump_dcb_pg(struct hclge_dev *hdev, char *buf, int len, if (ret) return ret;
- bitmap = (struct hclge_dbg_bitmap_cmd *)&desc.data[1]; + req.bitmap = (u8)le32_to_cpu(desc.data[1]);
*pos += scnprintf(buf + *pos, len - *pos, "%03u %#x %#x %#x\n", - pg_id, bitmap->bit0, bitmap->bit1, - bitmap->bit2); + pg_id, req.bit0, req.bit1, req.bit2); }
return 0; @@ -511,7 +509,7 @@ static int hclge_dbg_dump_dcb_queue(struct hclge_dev *hdev, char *buf, int len, static int hclge_dbg_dump_dcb_port(struct hclge_dev *hdev, char *buf, int len, int *pos) { - struct hclge_dbg_bitmap_cmd *bitmap; + struct hclge_dbg_bitmap_cmd req; struct hclge_desc desc; u8 port_id = 0; int ret; @@ -521,12 +519,12 @@ static int hclge_dbg_dump_dcb_port(struct hclge_dev *hdev, char *buf, int len, if (ret) return ret;
- bitmap = (struct hclge_dbg_bitmap_cmd *)&desc.data[1]; + req.bitmap = (u8)le32_to_cpu(desc.data[1]);
*pos += scnprintf(buf + *pos, len - *pos, "port_mask: %#x\n", - bitmap->bit0); + req.bit0); *pos += scnprintf(buf + *pos, len - *pos, "port_shaping_pass: %#x\n", - bitmap->bit1); + req.bit1);
return 0; }
From: Jie Wang wangjie125@huawei.com
mainline inclusion from mainline-v5.15 commit 6754614a787c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
As the width of packets number registers is 32 bits, they needs at most 10 characters for decimal data printing, but now the string spaces is not enough, so this patch fixes it.
Fixes: e44c495d95e ("net: hns3: refactor queue info of debugfs") Signed-off-by: Jie Wang wangjie125@huawei.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c index 1a1bebd453d3..d4eb122fda7a 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c @@ -584,7 +584,7 @@ static const struct hns3_dbg_item rx_queue_info_items[] = { { "TAIL", 2 }, { "HEAD", 2 }, { "FBDNUM", 2 }, - { "PKTNUM", 2 }, + { "PKTNUM", 5 }, { "COPYBREAK", 2 }, { "RING_EN", 2 }, { "RX_RING_EN", 2 }, @@ -687,7 +687,7 @@ static const struct hns3_dbg_item tx_queue_info_items[] = { { "HEAD", 2 }, { "FBDNUM", 2 }, { "OFFSET", 2 }, - { "PKTNUM", 2 }, + { "PKTNUM", 5 }, { "RING_EN", 2 }, { "TX_RING_EN", 2 }, { "BASE_ADDR", 10 },
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-v5.15 commit c7a6e3978ea9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
The specified buffer length for three debugfs files fd_tcam, uc and tqp is not enough for their maximum needs, so this patch fixes them.
Fixes: b5a0b70d77b9 ("net: hns3: refactor dump fd tcam of debugfs") Fixes: 1556ea9120ff ("net: hns3: refactor dump mac list of debugfs") Fixes: d96b0e59468d ("net: hns3: refactor dump reg of debugfs") Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c index d4eb122fda7a..c2565c4d20b2 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c @@ -137,7 +137,7 @@ static struct hns3_dbg_cmd_info hns3_dbg_cmd[] = { .name = "uc", .cmd = HNAE3_DBG_CMD_MAC_UC, .dentry = HNS3_DBG_DENTRY_MAC, - .buf_len = HNS3_DBG_READ_LEN, + .buf_len = HNS3_DBG_READ_LEN_128KB, .init = hns3_dbg_common_file_init, }, { @@ -256,7 +256,7 @@ static struct hns3_dbg_cmd_info hns3_dbg_cmd[] = { .name = "tqp", .cmd = HNAE3_DBG_CMD_REG_TQP, .dentry = HNS3_DBG_DENTRY_REG, - .buf_len = HNS3_DBG_READ_LEN, + .buf_len = HNS3_DBG_READ_LEN_128KB, .init = hns3_dbg_common_file_init, }, { @@ -298,7 +298,7 @@ static struct hns3_dbg_cmd_info hns3_dbg_cmd[] = { .name = "fd_tcam", .cmd = HNAE3_DBG_CMD_FD_TCAM, .dentry = HNS3_DBG_DENTRY_FD, - .buf_len = HNS3_DBG_READ_LEN, + .buf_len = HNS3_DBG_READ_LEN_1MB, .init = hns3_dbg_common_file_init, }, {
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-v5.15 commit 630a6738da82 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
This patch adjusts the string spaces of some parameters of tx bd info in debugfs according to their maximum needs.
Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c index c2565c4d20b2..67364ab63a1f 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c @@ -912,13 +912,13 @@ static int hns3_dbg_rx_bd_info(struct hns3_dbg_data *d, char *buf, int len) }
static const struct hns3_dbg_item tx_bd_info_items[] = { - { "BD_IDX", 5 }, - { "ADDRESS", 2 }, + { "BD_IDX", 2 }, + { "ADDRESS", 13 }, { "VLAN_TAG", 2 }, { "SIZE", 2 }, { "T_CS_VLAN_TSO", 2 }, { "OT_VLAN_TAG", 3 }, - { "TV", 2 }, + { "TV", 5 }, { "OLT_VLAN_LEN", 2 }, { "PAYLEN_OL4CS", 2 }, { "BD_FE_SC_VLD", 2 },
From: Randy Dunlap rdunlap@infradead.org
mainline inclusion from mainline-master commit 85879f131d78 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Fix kernel-doc warnings and spacing in hns3_ethtool.c:
hns3_ethtool.c:246: warning: No description found for return value of 'hns3_lp_run_test' hns3_ethtool.c:408: warning: expecting prototype for hns3_nic_self_test(). Prototype was for hns3_self_test() instead
Signed-off-by: Randy Dunlap rdunlap@infradead.org Reported-by: kernel test robot lkp@intel.com Cc: Peng Li lipeng321@huawei.com Cc: Guangbin Huang huangguangbin2@huawei.com Cc: Yisen Zhuang yisen.zhuang@huawei.com Cc: Salil Mehta salil.mehta@huawei.com Cc: "David S. Miller" davem@davemloft.net Cc: Jakub Kicinski kuba@kernel.org Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com (fix conflicts: remove hns3_do_selftest's fix due to a previous fix: 65b017dc58d6 "net: hns3: fix prototype warning") Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c index 23b512174531..c8442b86df94 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c @@ -238,9 +238,11 @@ static void hns3_lb_clear_tx_ring(struct hns3_nic_priv *priv, u32 start_ringid, }
/** - * hns3_lp_run_test - run loopback test + * hns3_lp_run_test - run loopback test * @ndev: net device * @mode: loopback type + * + * Return: %0 for success or a NIC loopback test error code on failure */ static int hns3_lp_run_test(struct net_device *ndev, enum hnae3_loop mode) {
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-master commit 3b4c6566c158 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Currently, when driver is failed to add a new multicast mac address to hardware due to the multicast mac table is full, it will directly return. In this case, if the multicast mac list has some reuse addresses after the new address, those reuse addresses will never be added to hardware.
To fix this problem, if function hclge_add_mc_addr_common() returns -ENOSPC, hclge_sync_vport_mac_list() should judge whether continue or stop to add next address.
As function hclge_sync_vport_mac_list() needs parameter mac_type to know whether is uc or mc, refine this function to add parameter mac_type and remove parameter sync. So does function hclge_unsync_vport_mac_list().
Fixes: ee4bcd3b7ae4 ("net: hns3: refactor the MAC address configure") Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../hisilicon/hns3/hns3pf/hclge_main.c | 50 ++++++++++--------- 1 file changed, 27 insertions(+), 23 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 590c1d635f5a..afa61439e1df 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -8949,8 +8949,11 @@ int hclge_add_mc_addr_common(struct hclge_vport *vport,
err_no_space: /* if already overflow, not to print each time */ - if (!(vport->overflow_promisc_flags & HNAE3_OVERFLOW_MPE)) + if (!(vport->overflow_promisc_flags & HNAE3_OVERFLOW_MPE)) { + vport->overflow_promisc_flags |= HNAE3_OVERFLOW_MPE; dev_err(&hdev->pdev->dev, "mc mac vlan table is full\n"); + } + return -ENOSPC; }
@@ -9006,12 +9009,17 @@ int hclge_rm_mc_addr_common(struct hclge_vport *vport,
static void hclge_sync_vport_mac_list(struct hclge_vport *vport, struct list_head *list, - int (*sync)(struct hclge_vport *, - const unsigned char *)) + enum HCLGE_MAC_ADDR_TYPE mac_type) { + int (*sync)(struct hclge_vport *vport, const unsigned char *addr); struct hclge_mac_node *mac_node, *tmp; int ret;
+ if (mac_type == HCLGE_MAC_ADDR_UC) + sync = hclge_add_uc_addr_common; + else + sync = hclge_add_mc_addr_common; + list_for_each_entry_safe(mac_node, tmp, list, node) { ret = sync(vport, mac_node->mac_addr); if (!ret) { @@ -9023,8 +9031,13 @@ static void hclge_sync_vport_mac_list(struct hclge_vport *vport, /* If one unicast mac address is existing in hardware, * we need to try whether other unicast mac addresses * are new addresses that can be added. + * Multicast mac address can be reusable, even though + * there is no space to add new multicast mac address, + * we should check whether other mac addresses are + * existing in hardware for reuse. */ - if (ret != -EEXIST) + if ((mac_type == HCLGE_MAC_ADDR_UC && ret != -EEXIST) || + (mac_type == HCLGE_MAC_ADDR_MC && ret != -ENOSPC)) break; } } @@ -9032,12 +9045,17 @@ static void hclge_sync_vport_mac_list(struct hclge_vport *vport,
static void hclge_unsync_vport_mac_list(struct hclge_vport *vport, struct list_head *list, - int (*unsync)(struct hclge_vport *, - const unsigned char *)) + enum HCLGE_MAC_ADDR_TYPE mac_type) { + int (*unsync)(struct hclge_vport *vport, const unsigned char *addr); struct hclge_mac_node *mac_node, *tmp; int ret;
+ if (mac_type == HCLGE_MAC_ADDR_UC) + unsync = hclge_rm_uc_addr_common; + else + unsync = hclge_rm_mc_addr_common; + list_for_each_entry_safe(mac_node, tmp, list, node) { ret = unsync(vport, mac_node->mac_addr); if (!ret || ret == -ENOENT) { @@ -9168,17 +9186,8 @@ static void hclge_sync_vport_mac_table(struct hclge_vport *vport, spin_unlock_bh(&vport->mac_list_lock);
/* delete first, in order to get max mac table space for adding */ - if (mac_type == HCLGE_MAC_ADDR_UC) { - hclge_unsync_vport_mac_list(vport, &tmp_del_list, - hclge_rm_uc_addr_common); - hclge_sync_vport_mac_list(vport, &tmp_add_list, - hclge_add_uc_addr_common); - } else { - hclge_unsync_vport_mac_list(vport, &tmp_del_list, - hclge_rm_mc_addr_common); - hclge_sync_vport_mac_list(vport, &tmp_add_list, - hclge_add_mc_addr_common); - } + hclge_unsync_vport_mac_list(vport, &tmp_del_list, mac_type); + hclge_sync_vport_mac_list(vport, &tmp_add_list, mac_type);
/* if some mac addresses were added/deleted fail, move back to the * mac_list, and retry at next time. @@ -9337,12 +9346,7 @@ static void hclge_uninit_vport_mac_list(struct hclge_vport *vport,
spin_unlock_bh(&vport->mac_list_lock);
- if (mac_type == HCLGE_MAC_ADDR_UC) - hclge_unsync_vport_mac_list(vport, &tmp_del_list, - hclge_rm_uc_addr_common); - else - hclge_unsync_vport_mac_list(vport, &tmp_del_list, - hclge_rm_mc_addr_common); + hclge_unsync_vport_mac_list(vport, &tmp_del_list, mac_type);
if (!list_empty(&tmp_del_list)) dev_warn(&hdev->pdev->dev,
From: Jie Wang wangjie125@huawei.com
mainline inclusion from mainline-master commit beb27ca451a5 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Currently, NIC init ROCE interrupt vector with MSIX interrupt. But ROCE use pci_irq_vector() to get interrupt vector, which adds the relative interrupt vector again and gets wrong interrupt vector.
So fixes it by assign relative interrupt vector to ROCE instead of MSIX interrupt vector and delete the unused struct member base_msi_vector declaration of hclgevf_dev.
Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support") Signed-off-by: Jie Wang wangjie125@huawei.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 6 +----- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h | 2 -- drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 5 +---- drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h | 2 -- 4 files changed, 2 insertions(+), 13 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index afa61439e1df..273d9087db88 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -2581,7 +2581,7 @@ static int hclge_init_roce_base_info(struct hclge_vport *vport) if (hdev->num_msi < hdev->num_nic_msi + hdev->num_roce_msi) return -EINVAL;
- roce->rinfo.base_vector = hdev->roce_base_vector; + roce->rinfo.base_vector = hdev->num_nic_msi;
roce->rinfo.netdev = nic->kinfo.netdev; roce->rinfo.roce_io_base = hdev->hw.io_base; @@ -2617,10 +2617,6 @@ static int hclge_init_msi(struct hclge_dev *hdev) hdev->num_msi = vectors; hdev->num_msi_left = vectors;
- hdev->base_msi_vector = pdev->irq; - hdev->roce_base_vector = hdev->base_msi_vector + - hdev->num_nic_msi; - hdev->vector_status = devm_kcalloc(&pdev->dev, hdev->num_msi, sizeof(u16), GFP_KERNEL); if (!hdev->vector_status) { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h index 9e1eede599ec..21013776de55 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h @@ -904,12 +904,10 @@ struct hclge_dev { u16 num_msi; u16 num_msi_left; u16 num_msi_used; - u32 base_msi_vector; u16 *vector_status; int *vector_irq; u16 num_nic_msi; /* Num of nic vectors for this PF */ u16 num_roce_msi; /* Num of roce vectors for this PF */ - int roce_base_vector;
unsigned long service_timer_period; unsigned long service_timer_previous; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c index cf00ad7bb881..4cf34f169354 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c @@ -2557,7 +2557,7 @@ static int hclgevf_init_roce_base_info(struct hclgevf_dev *hdev) hdev->num_msi_left == 0) return -EINVAL;
- roce->rinfo.base_vector = hdev->roce_base_vector; + roce->rinfo.base_vector = hdev->roce_base_msix_offset;
roce->rinfo.netdev = nic->kinfo.netdev; roce->rinfo.roce_io_base = hdev->hw.io_base; @@ -2823,9 +2823,6 @@ static int hclgevf_init_msi(struct hclgevf_dev *hdev) hdev->num_msi = vectors; hdev->num_msi_left = vectors;
- hdev->base_msi_vector = pdev->irq; - hdev->roce_base_vector = pdev->irq + hdev->roce_base_msix_offset; - hdev->vector_status = devm_kcalloc(&pdev->dev, hdev->num_msi, sizeof(u16), GFP_KERNEL); if (!hdev->vector_status) { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h index 28288d7e3303..4bd922b47501 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h @@ -308,8 +308,6 @@ struct hclgevf_dev { u16 num_nic_msix; /* Num of nic vectors for this VF */ u16 num_roce_msix; /* Num of roce vectors for this VF */ u16 roce_base_msix_offset; - int roce_base_vector; - u32 base_msi_vector; u16 *vector_status; int *vector_irq;
From: Jie Wang wangjie125@huawei.com
mainline inclusion from mainline-master commit 0b653a81a26d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Currently, driver will send command to firmware to query pfc packet number when user uses dcb tool to get pfc parameters. However, the periodic service task will also periodically query and record MAC statistics, including pfc packet number.
As the hardware registers of statistics is cleared after reading, it will cause pfc packet number of MAC statistics are not correct after using dcb tool to get pfc parameters.
To fix this problem, when user uses dcb tool to get pfc parameters, driver updates MAC statistics firstly and then get pfc packet number from MAC statistics.
Fixes: 64fd2300fcc1 ("net: hns3: add support for querying pfc puase packets statistic") Signed-off-by: Jie Wang wangjie125@huawei.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../hisilicon/hns3/hns3pf/hclge_dcb.c | 18 ++--- .../hisilicon/hns3/hns3pf/hclge_main.c | 4 +- .../hisilicon/hns3/hns3pf/hclge_main.h | 4 ++ .../ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 68 +++++++++---------- .../ethernet/hisilicon/hns3/hns3pf/hclge_tm.h | 4 +- 5 files changed, 48 insertions(+), 50 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c index 91cb578f56b8..90013c131e94 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c @@ -286,28 +286,24 @@ static int hclge_ieee_setets(struct hnae3_handle *h, struct ieee_ets *ets)
static int hclge_ieee_getpfc(struct hnae3_handle *h, struct ieee_pfc *pfc) { - u64 requests[HNAE3_MAX_TC], indications[HNAE3_MAX_TC]; struct hclge_vport *vport = hclge_get_vport(h); struct hclge_dev *hdev = vport->back; int ret; - u8 i;
memset(pfc, 0, sizeof(*pfc)); pfc->pfc_cap = hdev->pfc_max; pfc->pfc_en = hdev->tm_info.pfc_en;
- ret = hclge_pfc_tx_stats_get(hdev, requests); - if (ret) + ret = hclge_mac_update_stats(hdev); + if (ret) { + dev_err(&hdev->pdev->dev, + "failed to update MAC stats, ret = %d.\n", ret); return ret; + }
- ret = hclge_pfc_rx_stats_get(hdev, indications); - if (ret) - return ret; + hclge_pfc_tx_stats_get(hdev, pfc->requests); + hclge_pfc_rx_stats_get(hdev, pfc->indications);
- for (i = 0; i < HCLGE_MAX_TC_NUM; i++) { - pfc->requests[i] = requests[i]; - pfc->indications[i] = indications[i]; - } return 0; }
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 273d9087db88..959b39aeedc9 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -26,8 +26,6 @@ #include "hclge_devlink.h"
#define HCLGE_NAME "hclge" -#define HCLGE_STATS_READ(p, offset) (*(u64 *)((u8 *)(p) + (offset))) -#define HCLGE_MAC_STATS_FIELD_OFF(f) (offsetof(struct hclge_mac_stats, f))
#define HCLGE_BUF_SIZE_UNIT 256U #define HCLGE_BUF_MUL_BY 2 @@ -587,7 +585,7 @@ static int hclge_mac_query_reg_num(struct hclge_dev *hdev, u32 *reg_num) return 0; }
-static int hclge_mac_update_stats(struct hclge_dev *hdev) +int hclge_mac_update_stats(struct hclge_dev *hdev) { /* The firmware supports the new statistics acquisition method */ if (hdev->ae_dev->dev_specs.mac_stats_num) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h index 21013776de55..3c95c957d1e3 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h @@ -852,6 +852,9 @@ struct hclge_vf_vlan_cfg { (y) = (_k_ ^ ~_v_) & (_k_); \ } while (0)
+#define HCLGE_MAC_STATS_FIELD_OFF(f) (offsetof(struct hclge_mac_stats, f)) +#define HCLGE_STATS_READ(p, offset) (*(u64 *)((u8 *)(p) + (offset))) + #define HCLGE_MAC_TNL_LOG_SIZE 8 #define HCLGE_VPORT_NUM 256 struct hclge_dev { @@ -1166,4 +1169,5 @@ void hclge_inform_vf_promisc_info(struct hclge_vport *vport); int hclge_dbg_dump_rst_info(struct hclge_dev *hdev, char *buf, int len); int hclge_push_vf_link_status(struct hclge_vport *vport); int hclge_enable_vport_vlan_filter(struct hclge_vport *vport, bool request_en); +int hclge_mac_update_stats(struct hclge_dev *hdev); #endif diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c index 95074e91a846..a50e2edbf4a0 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c @@ -113,50 +113,50 @@ static int hclge_shaper_para_calc(u32 ir, u8 shaper_level, return 0; }
-static int hclge_pfc_stats_get(struct hclge_dev *hdev, - enum hclge_opcode_type opcode, u64 *stats) -{ - struct hclge_desc desc[HCLGE_TM_PFC_PKT_GET_CMD_NUM]; - int ret, i, j; - - if (!(opcode == HCLGE_OPC_QUERY_PFC_RX_PKT_CNT || - opcode == HCLGE_OPC_QUERY_PFC_TX_PKT_CNT)) - return -EINVAL; - - for (i = 0; i < HCLGE_TM_PFC_PKT_GET_CMD_NUM - 1; i++) { - hclge_cmd_setup_basic_desc(&desc[i], opcode, true); - desc[i].flag |= cpu_to_le16(HCLGE_CMD_FLAG_NEXT); - } - - hclge_cmd_setup_basic_desc(&desc[i], opcode, true); +static const u16 hclge_pfc_tx_stats_offset[] = { + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri0_pkt_num), + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri1_pkt_num), + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri2_pkt_num), + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri3_pkt_num), + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri4_pkt_num), + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri5_pkt_num), + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri6_pkt_num), + HCLGE_MAC_STATS_FIELD_OFF(mac_tx_pfc_pri7_pkt_num) +};
- ret = hclge_cmd_send(&hdev->hw, desc, HCLGE_TM_PFC_PKT_GET_CMD_NUM); - if (ret) - return ret; +static const u16 hclge_pfc_rx_stats_offset[] = { + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri0_pkt_num), + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri1_pkt_num), + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri2_pkt_num), + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri3_pkt_num), + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri4_pkt_num), + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri5_pkt_num), + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri6_pkt_num), + HCLGE_MAC_STATS_FIELD_OFF(mac_rx_pfc_pri7_pkt_num) +};
- for (i = 0; i < HCLGE_TM_PFC_PKT_GET_CMD_NUM; i++) { - struct hclge_pfc_stats_cmd *pfc_stats = - (struct hclge_pfc_stats_cmd *)desc[i].data; +static void hclge_pfc_stats_get(struct hclge_dev *hdev, bool tx, u64 *stats) +{ + const u16 *offset; + int i;
- for (j = 0; j < HCLGE_TM_PFC_NUM_GET_PER_CMD; j++) { - u32 index = i * HCLGE_TM_PFC_PKT_GET_CMD_NUM + j; + if (tx) + offset = hclge_pfc_tx_stats_offset; + else + offset = hclge_pfc_rx_stats_offset;
- if (index < HCLGE_MAX_TC_NUM) - stats[index] = - le64_to_cpu(pfc_stats->pkt_num[j]); - } - } - return 0; + for (i = 0; i < HCLGE_MAX_TC_NUM; i++) + stats[i] = HCLGE_STATS_READ(&hdev->mac_stats, offset[i]); }
-int hclge_pfc_rx_stats_get(struct hclge_dev *hdev, u64 *stats) +void hclge_pfc_rx_stats_get(struct hclge_dev *hdev, u64 *stats) { - return hclge_pfc_stats_get(hdev, HCLGE_OPC_QUERY_PFC_RX_PKT_CNT, stats); + hclge_pfc_stats_get(hdev, false, stats); }
-int hclge_pfc_tx_stats_get(struct hclge_dev *hdev, u64 *stats) +void hclge_pfc_tx_stats_get(struct hclge_dev *hdev, u64 *stats) { - return hclge_pfc_stats_get(hdev, HCLGE_OPC_QUERY_PFC_TX_PKT_CNT, stats); + hclge_pfc_stats_get(hdev, true, stats); }
int hclge_mac_pause_en_cfg(struct hclge_dev *hdev, bool tx, bool rx) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h index 2ee9b795f71d..1db7f40b4525 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h @@ -228,8 +228,8 @@ int hclge_tm_dwrr_cfg(struct hclge_dev *hdev); int hclge_tm_init_hw(struct hclge_dev *hdev, bool init); int hclge_mac_pause_en_cfg(struct hclge_dev *hdev, bool tx, bool rx); int hclge_pause_addr_cfg(struct hclge_dev *hdev, const u8 *mac_addr); -int hclge_pfc_rx_stats_get(struct hclge_dev *hdev, u64 *stats); -int hclge_pfc_tx_stats_get(struct hclge_dev *hdev, u64 *stats); +void hclge_pfc_rx_stats_get(struct hclge_dev *hdev, u64 *stats); +void hclge_pfc_tx_stats_get(struct hclge_dev *hdev, u64 *stats); int hclge_tm_qs_shaper_cfg(struct hclge_vport *vport, int max_tx_rate); int hclge_tm_get_qset_num(struct hclge_dev *hdev, u16 *qset_num); int hclge_tm_get_pri_num(struct hclge_dev *hdev, u8 *pri_num);
From: Yufeng Mo moyufeng@huawei.com
mainline inclusion from mainline-master commit 3b6db4a0492b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
When the driver processes rx packets, the head pointer is updated only after the number of received packets reaches 16. However, hardware relies on the head pointer to calculate the number of FBDs. As a result, the hardware calculates the FBD incorrectly. Therefore, the driver proactively updates the head pointer in each common poll to ensure that the number of FBDs calculated by the hardware is correct.
Fixes: 68752b24f51a ("net: hns3: schedule the polling again when allocation fails") Signed-off-by: Yufeng Mo moyufeng@huawei.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../net/ethernet/hisilicon/hns3/hns3_enet.c | 7 ++++ .../hisilicon/hns3/hns3pf/hclge_cmd.c | 1 + .../hisilicon/hns3/hns3pf/hclge_cmd.h | 1 + .../hisilicon/hns3/hns3vf/hclgevf_cmd.c | 32 +++++++++++++++++++ .../hisilicon/hns3/hns3vf/hclgevf_cmd.h | 9 ++++++ 5 files changed, 50 insertions(+)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index b2a8abfc4eec..68082f8d6ae8 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -4210,6 +4210,13 @@ int hns3_clean_rx_ring(struct hns3_enet_ring *ring, int budget, }
out: + /* sync head pointer before exiting, since hardware will calculate + * FBD number with head pointer + */ + if (unused_count > 0) + failure = failure || + hns3_nic_alloc_rx_buffers(ring, unused_count); + return failure ? budget : recv_pkts; }
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c index c327df9dbac4..c5d5466810bb 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c @@ -483,6 +483,7 @@ static int hclge_firmware_compat_config(struct hclge_dev *hdev, bool en) if (hnae3_dev_phy_imp_supported(hdev)) hnae3_set_bit(compat, HCLGE_PHY_IMP_EN_B, 1); hnae3_set_bit(compat, HCLGE_MAC_STATS_EXT_EN_B, 1); + hnae3_set_bit(compat, HCLGE_SYNC_RX_RING_HEAD_EN_B, 1);
req->compat = cpu_to_le32(compat); } diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h index c38b57fc6c6a..d24e59028798 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h @@ -1151,6 +1151,7 @@ struct hclge_query_ppu_pf_other_int_dfx_cmd { #define HCLGE_NCSI_ERROR_REPORT_EN_B 1 #define HCLGE_PHY_IMP_EN_B 2 #define HCLGE_MAC_STATS_EXT_EN_B 3 +#define HCLGE_SYNC_RX_RING_HEAD_EN_B 4 struct hclge_firmware_compat_cmd { __le32 compat; u8 rsv[20]; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c index f89bfb352adf..e605c2c5bcce 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c @@ -434,8 +434,28 @@ int hclgevf_cmd_queue_init(struct hclgevf_dev *hdev) return ret; }
+static int hclgevf_firmware_compat_config(struct hclgevf_dev *hdev, bool en) +{ + struct hclgevf_firmware_compat_cmd *req; + struct hclgevf_desc desc; + u32 compat = 0; + + hclgevf_cmd_setup_basic_desc(&desc, HCLGEVF_OPC_IMP_COMPAT_CFG, false); + + if (en) { + req = (struct hclgevf_firmware_compat_cmd *)desc.data; + + hnae3_set_bit(compat, HCLGEVF_SYNC_RX_RING_HEAD_EN_B, 1); + + req->compat = cpu_to_le32(compat); + } + + return hclgevf_cmd_send(&hdev->hw, &desc, 1); +} + int hclgevf_cmd_init(struct hclgevf_dev *hdev) { + struct hnae3_ae_dev *ae_dev = pci_get_drvdata(hdev->pdev); int ret;
spin_lock_bh(&hdev->hw.cmq.csq.lock); @@ -484,6 +504,17 @@ int hclgevf_cmd_init(struct hclgevf_dev *hdev) hnae3_get_field(hdev->fw_version, HNAE3_FW_VERSION_BYTE0_MASK, HNAE3_FW_VERSION_BYTE0_SHIFT));
+ if (ae_dev->dev_version >= HNAE3_DEVICE_VERSION_V3) { + /* ask the firmware to enable some features, driver can work + * without it. + */ + ret = hclgevf_firmware_compat_config(hdev, true); + if (ret) + dev_warn(&hdev->pdev->dev, + "Firmware compatible features not enabled(%d).\n", + ret); + } + return 0;
err_cmd_init: @@ -508,6 +539,7 @@ static void hclgevf_cmd_uninit_regs(struct hclgevf_hw *hw)
void hclgevf_cmd_uninit(struct hclgevf_dev *hdev) { + hclgevf_firmware_compat_config(hdev, false); set_bit(HCLGEVF_STATE_CMD_DISABLE, &hdev->state); /* wait to ensure that the firmware completes the possible left * over commands. diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h index 39d0b589c720..edc9e154061a 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h @@ -15,6 +15,12 @@ struct hclgevf_hw; struct hclgevf_dev;
+#define HCLGEVF_SYNC_RX_RING_HEAD_EN_B 4 +struct hclgevf_firmware_compat_cmd { + __le32 compat; + u8 rsv[20]; +}; + struct hclgevf_desc { __le16 opcode; __le16 flag; @@ -107,6 +113,9 @@ enum hclgevf_opcode_type { HCLGEVF_OPC_RSS_TC_MODE = 0x0D08, /* Mailbox cmd */ HCLGEVF_OPC_MBX_VF_TO_PF = 0x2001, + + /* IMP stats command */ + HCLGEVF_OPC_IMP_COMPAT_CFG = 0x701A, };
#define HCLGEVF_TQP_REG_OFFSET 0x80000
From: Yufeng Mo moyufeng@huawei.com
mainline inclusion from mainline-master commit e140c7983e30 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
When fully configure VLANs for a VF, then unload the VF while triggering a reset to PF, will cause a kernel crash because the irq is already uninit.
[ 293.177579] ------------[ cut here ]------------ [ 293.183502] kernel BUG at drivers/pci/msi.c:352! [ 293.189547] Internal error: Oops - BUG: 0 [#1] SMP ...... [ 293.390124] Workqueue: hclgevf hclgevf_service_task [hclgevf] [ 293.402627] pstate: 80c00009 (Nzcv daif +PAN +UAO) [ 293.414324] pc : free_msi_irqs+0x19c/0x1b8 [ 293.425429] lr : free_msi_irqs+0x18c/0x1b8 [ 293.436545] sp : ffff00002716fbb0 [ 293.446950] x29: ffff00002716fbb0 x28: 0000000000000000 [ 293.459519] x27: 0000000000000000 x26: ffff45b91ea16b00 [ 293.472183] x25: 0000000000000000 x24: ffffa587b08f4700 [ 293.484717] x23: ffffc591ac30e000 x22: ffffa587b08f8428 [ 293.497190] x21: ffffc591ac30e300 x20: 0000000000000000 [ 293.509594] x19: ffffa58a062a8300 x18: 0000000000000000 [ 293.521949] x17: 0000000000000000 x16: ffff45b91dcc3f48 [ 293.534013] x15: 0000000000000000 x14: 0000000000000000 [ 293.545883] x13: 0000000000000040 x12: 0000000000000228 [ 293.557508] x11: 0000000000000020 x10: 0000000000000040 [ 293.568889] x9 : ffff45b91ea1e190 x8 : ffffc591802d0000 [ 293.580123] x7 : ffffc591802d0148 x6 : 0000000000000120 [ 293.591190] x5 : ffffc591802d0000 x4 : 0000000000000000 [ 293.602015] x3 : 0000000000000000 x2 : 0000000000000000 [ 293.612624] x1 : 00000000000004a4 x0 : ffffa58a1e0c6b80 [ 293.623028] Call trace: [ 293.630340] free_msi_irqs+0x19c/0x1b8 [ 293.638849] pci_disable_msix+0x118/0x140 [ 293.647452] pci_free_irq_vectors+0x20/0x38 [ 293.656081] hclgevf_uninit_msi+0x44/0x58 [hclgevf] [ 293.665309] hclgevf_reset_rebuild+0x1ac/0x2e0 [hclgevf] [ 293.674866] hclgevf_reset+0x358/0x400 [hclgevf] [ 293.683545] hclgevf_reset_service_task+0xd0/0x1b0 [hclgevf] [ 293.693325] hclgevf_service_task+0x4c/0x2e8 [hclgevf] [ 293.702307] process_one_work+0x1b0/0x448 [ 293.710034] worker_thread+0x54/0x468 [ 293.717331] kthread+0x134/0x138 [ 293.724114] ret_from_fork+0x10/0x18 [ 293.731324] Code: f940b000 b4ffff00 a903e7b8 f90017b6 (d4210000)
This patch fixes the problem by waiting for the VF reset done while unloading the VF.
Fixes: e2cb1dec9779 ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support") Signed-off-by: Yufeng Mo moyufeng@huawei.com Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 5 +++++ drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h | 2 ++ 2 files changed, 7 insertions(+)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c index 4cf34f169354..3b8bde58613a 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c @@ -3010,7 +3010,10 @@ static void hclgevf_uninit_client_instance(struct hnae3_client *client,
/* un-init roce, if it exists */ if (hdev->roce_client) { + while (test_bit(HCLGEVF_STATE_RST_HANDLING, &hdev->state)) + msleep(HCLGEVF_WAIT_RESET_DONE); clear_bit(HCLGEVF_STATE_ROCE_REGISTERED, &hdev->state); + hdev->roce_client->ops->uninit_instance(&hdev->roce, 0); hdev->roce_client = NULL; hdev->roce.client = NULL; @@ -3019,6 +3022,8 @@ static void hclgevf_uninit_client_instance(struct hnae3_client *client, /* un-init nic/unic, if this was not called by roce client */ if (client->ops->uninit_instance && hdev->nic_client && client->type != HNAE3_CLIENT_ROCE) { + while (test_bit(HCLGEVF_STATE_RST_HANDLING, &hdev->state)) + msleep(HCLGEVF_WAIT_RESET_DONE); clear_bit(HCLGEVF_STATE_NIC_REGISTERED, &hdev->state);
client->ops->uninit_instance(&hdev->nic, 0); diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h index 4bd922b47501..f6f736c0091c 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h @@ -109,6 +109,8 @@ #define HCLGEVF_VF_RST_ING 0x07008 #define HCLGEVF_VF_RST_ING_BIT BIT(16)
+#define HCLGEVF_WAIT_RESET_DONE 100 + #define HCLGEVF_RSS_IND_TBL_SIZE 512 #define HCLGEVF_RSS_SET_BITMAP_MSK 0xffff #define HCLGEVF_RSS_KEY_SIZE 40
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-master commit 1122eac19476 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
When driver queries the register number of mac statistics from firmware, the old firmware runs in device version V2 only returns number of valid registers, not include number of three reserved registers among of them. It cause driver doesn't record the last three data when query mac statistics.
To fix this problem, driver never query register number in device version V2 and set it to a fixed value which include three reserved registers.
Fixes: c8af2887c941 ("net: hns3: add support pause/pfc durations for mac statistics") Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 10 ++++++++++ .../net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h | 2 +- 2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 959b39aeedc9..588b7e75c897 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -566,6 +566,16 @@ static int hclge_mac_query_reg_num(struct hclge_dev *hdev, u32 *reg_num) struct hclge_desc desc; int ret;
+ /* Driver needs total register number of both valid registers and + * reserved registers, but the old firmware only returns number + * of valid registers in device V2. To be compatible with these + * devices, driver uses a fixed value. + */ + if (hdev->ae_dev->dev_version == HNAE3_DEVICE_VERSION_V2) { + *reg_num = HCLGE_MAC_STATS_MAX_NUM_V1; + return 0; + } + hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_QUERY_MAC_REG_NUM, true); ret = hclge_cmd_send(&hdev->hw, &desc, 1); if (ret) { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h index 3c95c957d1e3..ebba603483a0 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h @@ -404,7 +404,7 @@ struct hclge_tm_info { };
/* max number of mac statistics on each version */ -#define HCLGE_MAC_STATS_MAX_NUM_V1 84 +#define HCLGE_MAC_STATS_MAX_NUM_V1 87 #define HCLGE_MAC_STATS_MAX_NUM_V2 105
struct hclge_comm_stats_str {
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-master commit 91fcc79bff40 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
If users set unicast mac address for VFs by PF, they need to guarantee all VFs' address is different. This patch removes the check mac address exist of VFs, for usrs can refresh mac addresses of all VFs directly without need to modify the exist mac address to other value firstly.
Fixes: 8e6de441b8e6 ("net: hns3: add support for configuring VF MAC from the host") Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- .../hisilicon/hns3/hns3pf/hclge_main.c | 36 ------------------- 1 file changed, 36 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 588b7e75c897..6d99170756c3 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -9418,36 +9418,6 @@ static int hclge_get_mac_ethertype_cmd_status(struct hclge_dev *hdev, return return_status; }
-static bool hclge_check_vf_mac_exist(struct hclge_vport *vport, int vf_idx, - u8 *mac_addr) -{ - struct hclge_mac_vlan_tbl_entry_cmd req; - struct hclge_dev *hdev = vport->back; - struct hclge_desc desc; - u16 egress_port = 0; - int i; - - if (is_zero_ether_addr(mac_addr)) - return false; - - memset(&req, 0, sizeof(req)); - hnae3_set_field(egress_port, HCLGE_MAC_EPORT_VFID_M, - HCLGE_MAC_EPORT_VFID_S, vport->vport_id); - req.egress_port = cpu_to_le16(egress_port); - hclge_prepare_mac_addr(&req, mac_addr, false); - - if (hclge_lookup_mac_vlan_tbl(vport, &req, &desc, false) != -ENOENT) - return true; - - vf_idx += HCLGE_VF_VPORT_START_NUM; - for (i = HCLGE_VF_VPORT_START_NUM; i < hdev->num_alloc_vport; i++) - if (i != vf_idx && - ether_addr_equal(mac_addr, hdev->vport[i].vf_info.mac)) - return true; - - return false; -} - static int hclge_set_vf_mac(struct hnae3_handle *handle, int vf, u8 *mac_addr) { @@ -9465,12 +9435,6 @@ static int hclge_set_vf_mac(struct hnae3_handle *handle, int vf, return 0; }
- if (hclge_check_vf_mac_exist(vport, vf, mac_addr)) { - dev_err(&hdev->pdev->dev, "Specified MAC(=%pM) exists!\n", - mac_addr); - return -EEXIST; - } - ether_addr_copy(vport->vf_info.mac, mac_addr);
if (test_bit(HCLGE_VPORT_STATE_ALIVE, &vport->state)) {
From: Guangbin Huang huangguangbin2@huawei.com
mainline inclusion from mainline-master commit 688db0c7a4a6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4I7P7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Currently, driver only allow configuring ETS bandwidth of TCs according to the max TC number queried from firmware. However, the hardware actually supports 8 TCs and users may need to configure ETS bandwidth of all TCs, so remove the restriction.
Fixes: 330baff5423b ("net: hns3: add ETS TC weight setting in SSU module") Signed-off-by: Guangbin Huang huangguangbin2@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Reviewed-by: Yongxin Li liyongxin1@huawei.com Signed-off-by: Junxin Chen chenjunxin1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c | 2 +- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 9 +-------- 2 files changed, 2 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c index 90013c131e94..375ebf105a9a 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c @@ -129,7 +129,7 @@ static int hclge_ets_sch_mode_validate(struct hclge_dev *hdev, u32 total_ets_bw = 0; u8 i;
- for (i = 0; i < hdev->tc_max; i++) { + for (i = 0; i < HNAE3_MAX_TC; i++) { switch (ets->tc_tsa[i]) { case IEEE_8021QAZ_TSA_STRICT: if (hdev->tm_info.tc_info[i].tc_sch_mode != diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c index a50e2edbf4a0..429652a8cde1 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c @@ -1123,7 +1123,6 @@ static int hclge_tm_pri_tc_base_dwrr_cfg(struct hclge_dev *hdev)
static int hclge_tm_ets_tc_dwrr_cfg(struct hclge_dev *hdev) { -#define DEFAULT_TC_WEIGHT 1 #define DEFAULT_TC_OFFSET 14
struct hclge_ets_tc_weight_cmd *ets_weight; @@ -1136,13 +1135,7 @@ static int hclge_tm_ets_tc_dwrr_cfg(struct hclge_dev *hdev) for (i = 0; i < HNAE3_MAX_TC; i++) { struct hclge_pg_info *pg_info;
- ets_weight->tc_weight[i] = DEFAULT_TC_WEIGHT; - - if (!(hdev->hw_tc_map & BIT(i))) - continue; - - pg_info = - &hdev->tm_info.pg_info[hdev->tm_info.tc_info[i].pgid]; + pg_info = &hdev->tm_info.pg_info[hdev->tm_info.tc_info[i].pgid]; ets_weight->tc_weight[i] = pg_info->tc_dwrr[i]; }